Efficient Context Integration through Factorized Pyramidal Learning for Ultra-Lightweight Semantic Segmentation

Nadeem Atif, Saquib Mazhar, Debajit Sarma, M. K. Bhuyan and Shaik Rafi Ahamed Dept. of Electronics and Electrical Engineerning, IIT Guwahati
Guwahati- 781039, India
{atif176102103, saquibmazhar, s.debajit, mkb, rafiahamed}@iitg.ac.in

Abstract

Semantic segmentation is a pixel-level prediction task to classify each pixel of the input image. Deep learning models, such as convolutional neural networks (CNNs), have been extremely successful in achieving excellent performances in this domain. However, mobile application, such as autonomous driving, demand real-time processing of incoming stream of images. Hence, achieving efficient architectures along with enhanced accuracy is of paramount importance. Since, accuracy and model size of CNNs are intrinsically contentious in nature, the challenge is to achieve a decent trade-off between accuracy and model size. To address this, we propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner. On one hand, it uses a bank of convolutional filters with multiple dilation rates which leads to multi-scale context aggregation; crucial in achieving better accuracy. On the other hand, parameters are reduced by a careful factorization of the employed filters; crucial in achieving lightweight models. Moreover, we decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect. We also design a dedicated Feature-Image Reinforcement (FIR) unit to carry out the fusion operation of shallow and deep features with the downsampled versions of the input image. This gives an accuracy enhancement without increasing model parameters. Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off. More specifically, with only less than 0.5 million parameters, the proposed network achieves 66.93% and 66.28% mIoU on Cityscapes validation and test set, respectively. Moreover, FPLNet has a processing speed of 95.5 frames per second (FPS).

Index Terms:

Semantic Segmentation, deep learning, real-time, autonomous driving

I Introduction

Semantic segmentation is a computer vision task that deals with classifying each and every single pixel of an image. Being a pixel-level prediction task, it is one of the most challenging tasks in visual recognition domain [1, 2, 3]. The rapid growth of deep learning (a sub-field of machine learning) during the last decade has revolutionized the field of computer vision [4, 5, 6, 7, 8]. Consequently, semantic segmentation has also greatly benefited from the recent developments in convolutional neural networks (CNNs) which is a deep learning model [9, 10, 11]. Autonomous driving, robotics, virtual and augmented reality, aerial imagery are some of the application areas of semantic segmentation. Out of these, autonomous driving is currently a hot topic of research both in industries and in academia [12, 13]. As a result, massive amounts of works have been published in literature reporting great performances in terms of accuracy of segmentation models [14]. However, in addition to high accuracy, mobile application scenarios such as driverless cars and drones demand small size models for real-time processing of incoming stream of images. So, it is equally important to consider the model size while designing segmentation networks for resource-constrained devices. This prompted researchers to scale down the model sizes and a number of lightweight networks have been developed [15, 16] as a result. However, these small model sizes were achieved at the cost of excessive reduction of inference accuracy; not suitable for practical application. It is therefore extremely important to keep both the attributes, i.e., accuracy and number of parameters in mind while designing semantic segmentation networks. In other words, achieving a decent trade-off between accuracy and model size is crucial in designing networks for resource-constrained real-time applications. As a result, it is currently a very promising area of research [11, 17, 18, 19, 20].

Refer to caption — Figure 1: Comparison of skeletal structure of ESP and FPL module. The comma separated numbers in the convolutional blocks are the dilation factors.

The overall accuracy of a network depends upon two types of information; high-level contextual information and low-level spatial information [21]. The high-level contextual information is crucial in producing globally consistent segmentation whereas low-level spatial information preserve the finer local details [10, 22]. Many works have recently been done to exploit these two types of information separately in two parallel branches [10, 9, 22]. Although, the resulting performances of these two branched approaches are quite exciting, the corresponding models still remain bulky having millions of parameters. So, achieving a decent accuracy-efficiency trade-off demands an efficient extraction of multi-scale context. To achieve this goal, we propose a Factorized Pyramidal Learning (FPL) module which is inspired by the basic block of the classical lightweight ESPNet [16]. ESP module employs a set of dilated convolutions with different dilation rates. To reduce parameters, the ESP module excessively compresses the channel dimension of feature blocks (1/5 of that of input block). This causes a significant information loss along the channel dimension. To address this, the spatial pyramid in FPL is decomposed into two stages. The first stage employs a single conventional convolution layer while the second stage is comprised of a bank of factorized filters. This decomposition allows us to feed 25% more channels (1/4 of input channels) to the convolutional filters in comparison to ESP module. Although, the pyramid decomposition increases information flow along the channel dimension, it makes the module bulky compared to ESP module. So, to counter this effect, we employ factorization of convolutional filters in the second stage of spatial pyramid which results in huge parameter saving. We empirically found that the factorization of filters in the second stage has very little effect on the accuracy. Moreover, this two-step transformation enables a simple solution to the problem of gridding or checkerboard effect; caused by direct concatenation of outputs of a set dilated convolutions. More specifically, we propose an efficient Pairwise Feature Fusion (PFF) involving only two feature tensors as opposed to Hierarchical feature fusion (HFF) of ESPNet which involves all the feature tensors of the module. A comparison of the skeletal structure of ESP and FPL module is shown in Fig. 1.

To improve the information flow, feature maps from the first and the last block of an encoder stage [20] are fused. To enhance the accuracy, input image insertion is also a common practice with only a slight increment in number of parameters [16, 18]. The intra-stage feature fusion and the input image insertion is usually done by concatenation of feature tensors and downsampled (usually by pooling operation) version of input image. However, in this work, we design a dedicated Feature-Image Reinforcement (FIR) unit to carry out this fusion operation which offers a simple yet effective solution without any increase in network parameters. We also design a simple and sequential asymmetric decoder that helps in recovering the low-level details that are lost during the downsampling operations in encoder. The decoder employed in the proposed network is free from any encoder to decoder skip connections and it is kept purely sequential. This keeps the decoder fairly simple by eliminating the requirement to store the feature maps from different stages of encoder. This fares well in terms of Graphics Processing Unit (GPU) memory utilization and ultimately leads to better efficiency.

Based on the FPL module, FIR unit and the asymmetric decoder, we propose an ultra-lightweight network, called FPLNet, to perform semantic segmentation in real time.

The main contributions of this work are summarized as follows–

•

A novel FPL module is proposed which is inspired by the classical ESP module. On one hand, it harvests rich contextual information by capturing context at multiple scales and on the other hand, causes a huge parameter saving by introducing factorization into spatial pyramid of filters.
•

The pyramidal decomposition enables a simpler solution to address the gridding artifact problem. The proposed solution is called Pairwise Feature Fusion (PFF) which required only two feature block at a time.
•

A simple Feature-Image Reinforcement (FIR) unit has been proposed to jointly fuse deep and shallow features with the image at the end of each stage. It improves the accuracy further without requiring additional network parameters.
•

Based on the FPL module, FIR unit and a simple asymmetric decoder, we propose an ultra-lightweight network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off with regards to lightweight networks.

The rest of the paper is organised as follows. In section II, a brief survey of the related works is presented. Section II also briefly presents the gap found in the existing literature which the proposed work fills. Section III presents the basic building blocks of the proposed network which includes initial module, FPL module and the FIR unit. The overall architecture design of the corresponding FPLNet is also presented in detail in section III. In section IV, various experiment related results and discussions are presented including dataset, training details, performance analysis and comparison of our FPLNet with other state-of-the-art networks. Section IV, also presents a detailed ablation study of various important design choices. Finally, section V concludes the paper.

II Related Work

To be consistent with our objective of developing small size model capable of achieving decent accuracy, we categorize the existing works based on their corresponding model sizes. More specifically, we divide the existing networks into four broad categories. Large-scale, mid-scale, lightweight and ultra-lightweight networks. Large-scale networks include models having equal to or more than 5 million parameters. Similarly, mid-scale, lightweight and ultra-lightweight networks include models having 1-5, 0.5-1 and less than 0.5 million parameters, respectively. This categorization makes the survey of the massive literature on semantic segmentation concise and easily tractable.

II-A Large-scale networks

The first end-to-end trainable network in this field was proposed by [1]. They adapted the famous classification networks AlexNet [7], GoogleNet [4], VGG Net [5] and through transfer learning, fine-tuned these networks for semantic segmentation. Reference [2] employed a symmetric encoder-decoder network where the upsampling of the low-resolution feature maps from encoder was done using the pooling indices. To increase the receptive fields of the convolutional kernals, dilated convolutions were used by the authors of [23]. This enabled them to harvest more contextual information that is very conducive to accurate segmentation. To capture the contextual information at multiple scales, [24] proposed a Pyramid Pooling Module (PPM). Various variants of PPM are still being used to harvest contextual informations [9]. Based on the PPM module and ResNet [6] as the backbone model, they proposed PSPNet. To preserve the finer details that are lost due to downsampling operations like strided convolution and pooling operations, [25] proposed RefineNet. They demonstrated that features from all the layers of a network are crucial so they designed their network such that it refines low resolution (coarse) features from deep layers using high-resolution (fine-grained) features from shallow layers. PSPNet was adopted by the authors of ICNet [26] as their backbone network. The PSPNet is incorporated in the primary branch which operates at very low resolution and two additional shallow branches are used to harvest finer spatial details. BiSeNet [10] uses two-branch approach; a deep branch for semantic information and a shallow one for spatial details. The output from these two branches then get integrated to give finely delineated prediction maps. HyperSeg [17] uses EfficientNet [27] as its backbone architecture. They also use weight prediction network to progressively upsample the low-resolution feature maps from the context head module. The decoder receives feature maps from different stages of the encoder and fuse them in a step-by-step fashion.

II-B Mid-scale networks

ERFNet [28] used factorization strategy and applied it to the conventional convolutional kernals to decompose a $3\times 3$ kernal into a pair of $3\times 1$ and $1\times 3$ 1-dimensional asymmetrical convolutions. MaskNet [29] adopts the ERFNet and uses a shallow parallel branch to apply semantic masking technique to improve the accuracy of small classes. BiAttenNet [30] adopts a two-branched approach for the separation of the attention blocks into the spatial detailing branch to explore details in a specialized manner. They also employed FCN style ResNet to achieve rough segmentation. Later on, they merge the coarse branch and the detail branch to achieve a speed-accuracy balance. RegSeg [31] designed a basic block which is inspired by ResNeXt [32] blocks and used two parallel dilated convolutional blocks to enhance the receptive-field while preserving low-level details also. Inspired by BiSeNet [21], the authors of CABiNet [22] also proposed a two-branched methodology where global aggregation and local distribution blocks [33] have been employed to harvest long-range and short-range contextual informations that is key to achieving decent accuracy-efficiency balance.

II-C Lightweight networks

The authors of DABNet [18] proposed a simple architecture to achieve a state-of-the-art performance in lightweight scenario. More specifically, they designed a simple depth-wise asymmetric bottleneck to process incoming stream of features. Each module consists of two depth-wise convolutions; one conventional and the other one dilated. Following the success of two-branched approach, ContextNet [19], fuses the low-resolution deep semantic feature with the high-resolution spatial finer details.

II-D Ultra-lightweight networks

With respect to accuracy oriented networks, ENet [15] focuses the other extreme and consequently achieved a real-time network through aggressive parameter reduction. This work introduced a new direction of research, i.e., real-time methods. As a result, other researchers followed suit and many efficient networks were proposed. ESPNet [16] is one such efficient model that has less than 0.5 million parameters. It uses a bank of convolutions with different dilation rates, allowing it to capture mult-scale context. CGNet [11] in its CG module, uses 3 types of context aggregation– surrounding context, joint feature and global context extractor. This is also an ultra-lightweight network with less than 0.5 million parameters suitable for resource-constrained devices.

Gap in the literature: In summary, the large-scale models achieve great accuracies with networks having millions of parameters which are not suitable for mobile devices. Similarly, the mid-scale networks, thought smaller than large-scale counterparts, still have million of parameters prohibiting them from being a decent choice for physical deployments in practicle scenarios. On the other extreme of spectrum, accuracies of the ultra-lightweight networks are not sufficient enough for practical deployments especially in applications that involve risk. So, to fill this gap, ultra-lightweight networks with decent accuracies are required. In this work, the FPLNet has been proposed to fill this gap.

III FPLNet: Proposed Network

In this section, the basic building blocks of the proposed FPLNet will be discussed first. A comparison of conventional convolution block, ESP module and the proposed FPL module has also been presented along with their merits and demerits to give deeper insights. Finally, the overall architecture design of the FPLNet is presented which includes the attached decoder.

III-A Basic Building Blocks

The basic building blocks of the proposed FPLNet include initial module, FPL module and FIR module which are discussed in detail in the subsequent sections.

III-A1 Initial module

In real-time semantic segmentation networks, for reducing the number of parameters and computational cost, usually, the spatial resolution is hastily downsampled to quarter resolution by using two consecutive downsampling blocks [16, 28]. This strategy undoubtedly saves some parameters and convolution operations. However, this aggressive reduction in spatial resolution so early has a significant negative effect on the fine delineation of the object boundaries that ultimately leads to poor accuracy performance [15]. This is because much of the finer spatial information, that could have been harvested in the first stage, is lost due to early downsampling. So, to avoid the loss of finer details, we use two successive convolutional layers (excluding downsampler) in stage-1. In this way, we are able to harvest better spatial information at half spatial resolution in the first stage, that yields better feature representation at the input of the second stage.

III-A2 FPL module

In order to correctly recognize a small region of an image, we need the information of surrounding regions as well [11]. So, context plays a very important role in vision problems and especially in semantic segmentation. The simplest way to increase the context integration is to enlarge the receptive fields of kernals by increasing their sizes. So, researchers have used kernals of different sizes ranging from $3\times 3$ to $11\times 11$ [7]. However, this strategy is very inefficient as it leads to increase in the parameters exponentially. For example, a $3\times 3$ conventional kernal has 9 parameters, whereas an $11\times 11$ kernal has 121 parameters per channel. One simple solution to this problem is to use dilated convolutions [34]. Dilation allows us to increase the receptive field without increasing the number of parameters. For example, a $3\times 3$ dilated convolution with a dilation factor 16 spans a region of $33\times 33$ of input feature map, whereas to span the same region, a conventional kernal would require 1089 parameters which is $121\times$ more than its dilated counterpart. So, with this strategy one can save a huge number of parameters. However, in complex pixel-level dense prediction tasks, such as semantic segmentation, single-scale context is not sufficient and we need multi-scale context.
In literature, the most popular way to harvest multi-scale context is to employ spatial pyramid [16]. Spatial pyramid is nothing but multiscale context aggregation. This multiscale context can be intergrated either by pooling features with different window sizes [24, 23, 9] or by employing a bank of convolutional filters with different dilation rates [16]. The former is used once in the network usually before the final prediction stage [24] whereas the latter is usually distributed across the network and is built into the basic blocks themselves [16]. This eliminates the need to use a seperate pyramid module for context aggregation as the basic blocks themselves are well equipped to carry out the required task. Therefore, we adopted the latter approach, i.e., context aggregation as an in-built functionality of the basic blocks. More specifically, we adapted the ESP module of the classical ESPNet and reformulated it. The ESP module, in order to save parameters and computations, drastically squeezes the channel depth of the input feature; 1/5 to be more specific. This saves the parameters at the cost of information loss along the channel dimension.

The proposed approach, however, is distinct from [16] in two important ways. Firstly, we decomposed the spatial pyramid into two consecutive stage. We employed a conventional convolution in the first stage to gather local information whereas a bank of dilated convolutions is employed in the second stage to gather multi-scale context. Secondly, we factorized the spatial pyramid (i.e., bank of dilated convolutions) in the second stage which leads to significant parameter saving while incurring a slight toll on the accuracy. To be more specific, each $3\times 3$ dilated kernal is factorized into a pair of $3\times 1$ and $1\times 3$ kernal. In this way, we are able to save three parameters in each channel of the kernal. The factorization also allows us to put an extra layer of non-linearity in between the two asymmetric kernals. This adds an additional linear region in the manifold space which leads to better accuracy with a negligible increment in computation cost [28]. Being on top of the factorized spatial pyramid of kernals, the conventional convolution in the first stage adds an additional layer of transformation which increases the effective depth of the network without increasing the number of basic modules. Moreover, being a symmetric convolution, it helps in balancing the accuracy loss incurred by the bank of factorized asymmetric 1D convolutional filters. This two-stage pyramid also enhances the effective receptive field with respect to ESP module [35]. More importantly, the pyramid decomposition enables us to solve the problem of gridding artifact in a quite simple way. To be more specific, we applied a simple pairwise feature fusion (PFF) that adds only two features as opposed to HFF of ESP module which engages all the five features of the module.

TABLE I: Comparison of different modules. FPL-decomp. is the decomposed (two-stage) version of FPL without factorization of spatial pyramid in the second stage.

Modules	Parameters	Numbers
Convolution	$C_{i}C_{o}k^{2}$	32, 400
ESP-original	$\dfrac{C_{o}}{b}(C_{i}+k^{2}C_{o})$	7, 200
FPL-decomp.	$\dfrac{C_{o}}{b^{2}}(bC_{i}+k^{2}C_{o}+k^{2}bC_{o})$	11, 025
FPL proposed	$\dfrac{C_{o}}{b^{2}}(bC_{i}+k^{2}C_{o}+2kbC_{o})$	8, 325

Table I shows the comparison of different modules. $C_{i}$ is the number of channels in the incoming feature map and $C_{o}$ is the number of channels in the outgoing one. $k\times k$ is the size of the symmetric convolutional kernal. $b$ is the number of branches the incoming feature tensor is split into. The exact number of parameters per module, shown in Table I, is computed for $C_{i}=C_{o}=60$ , $k=3$ and $b=5$ for ESP-original and $b=4$ for both FPL-decomp. and FPL. Please note that the decomposition of pyramid both in FPL-decomp. and FPL results in enhanced receptive field (hence more wider context) and allows the depth of each branch to increase from 1/5 to 1/4. It is clear from the Table I that the FPL-decomp. module requires 53.125 % more parameters in comparison to ESP-original module. Our proposed final version of FPL module, on the other hand, achieves the same objectives as that of FPL-decomp. but with only 15.6 % more parameters while having only a slight effect on the final accuracy.

III-A3 Feature-Image Reinforcement Unit (FIR unit)

Intra-stage skip connections are used to enhance the information flow of the network [20, 16]. Image insertion in also done to improve feature representation at different stages of the encoder [18]. To accomplish this, additive or concatenative fusion is usually used. Concatenation increases the parameters but performs slightly better than additive fusion. In this work, to improve it further, we design a dedicated FIR unit to jointly combine the deep and shallow features with the downsampled images. The proposed FIR unit employs a hierarchical combination of additive and concatenative fusion. This gives significantly better performance compared to addition or concatenation alone with the same number of trainable parameters. The FIR unit is shown in the Fig. 3. Features from deep and shallow layers are first added. The resultant feature tensor is then concatenated with the feature from the deep layer follwed by image insertion.

III-B Network Architecture

The overall network architecture of the FPLNet is shown in Fig. 4. A stage of a network is that section which has continuous group of feature maps with same spatial resolutions. In this way, the encoder of the proposed network is composed of three stages; stage-1, stage-2 and stage-3. Stage-1 is instantiated by the initial module which has 2 convolutional layers for finer feature extraction. There are 4 and 8 modules in the second and third stage, respectively, excluding the modules acting as downsamplers. The downsampling operation of feature maps is employed in deep CNNs for two reasons. Firstly, it allows the deeper layers to receive rich contextual information which in turn provides better semantic information. Secondly, it reduces the computational cost by reducing the spatial dimensions of the feature maps. However, these two benefits are achieved at the cost of spatial information loss at the local level that is crucial to recover finer details in the final predicted map. Pooling operations such as max-pooling or average pooling are generally used for downsampling [7, 2] but we have used the same FPL module for downsampling as well simply by using strided convolution. This way we do not have to rely on pooling operations separately. The details of the architecture design is presented in Table II.

TABLE II: Detailed structure of the proposed network (FPLNet). “Out Ch.”: Number of feature channels at the Layer’s output. “Out Res.” Output spatial resolution of feature tensors for input resolution

512\times 1024

	Layer	Operation	Out Ch.	Out Res.
ENCODER	1	Downsample (Conv-3)	32	$256\times 512$
	2-3	Conv-3	32	$256\times 512$
	4	Concatenation	35	$256\times 512$
	5	Downsample (FPL)	64	$128\times 256$
	6-9	4 $\times$ FPL module	64	$128\times 256$
	10	FIR unit	131	$128\times 256$
	11	Downsample (FPL)	128	$64\times 128$
	12-19	8 $\times$ FPL module	128	$64\times 128$
	20	FIR unit	259	$64\times 128$
DECODER	21	Projection (Conv-1x1)	64	$64\times 128$
	22	Upsampler	64	$128\times 256$
	23-24	2 $\times$ FPL	64	$128\times 256$
	25	Upsampler	16	$256\times 512$
	26-27	2 $\times$ FPL	16	$256\times 512$
	28	Projection (Deconv)	C	$512\times 1024$

TABLE III: Evaluation results (per-class basis) of our model on cityscapes validation (Top) and test set (Bottom).

FPLNet	Road	Sidewalk	Building	Wall	Fence	Pole	Traffic light	Traffic sign	Vegetation	Terrain	Sky	Pedestrian	Rider	Car	Truck	Bus	Train	Motorbike	Bicycle	mIoU (%)
Val. set	94.46	76.42	87.37	54.77	52.55	50.09	46.70	62.43	89.04	58.92	88.96	65.36	45.73	88.71	63.11	75.79	66.29	42.25	62.68	66.93
Test set	95.93	75.95	88.75	60.32	57.65	51.41	49.15	61.69	89.28	54.65	92.02	62.19	37.40	89.13	68.88	74.56	63.54	31.50	55.36	66.28

The first block of each stage is a downsampling module as shown in Fig. 4 with pink color. The spatial resolutions of first, second and third stages are $1/2$ , $1/4$ and $1/8$ of that of the input image. We have used the FIR units at the end of stage-2 and stage-3 to enrich the feature representation and to improve the flow of information. It should be noted here that this technique of input image insertion is very different than the target/label insertion [21, 26]. The objective of target/label insertion is to provide auxiliary loss mechanism at different depths of the network for enhanced learning. So, the target/label insertion mechanism is only used during the training phase and is absent while inference. The technique used in the proposed work, i.e., input image insertion using the FIR unit is used both during training and inference and hence does not rely on annotated images.

To produce segmentation map from the encoder itself, its output is directly projected to C-dimensional space using a pointwise convolution. Then, a bilinear-interpolation layer is used to directly upsample the low-resolution map to high-resolution. This is depicted in a transition layer between encoder and decoder section as shown in Fig. 4. To make the network lightweight, the decoders are either absent altogether [11, 18] or they are designed to be extremely light-weight [16]. On the other hand, many model size agnostic networks, where achieving high accuracy is the main goal, employ very complex decoders [17, 36]. Since, our objective is to find a balance between accuracy and model size, we employ a middle approach. More specifically, we design a small simple asymmetric decoder to progressively recover the finer spatial details. We empirically found that the encoder to decoder skip connections do not offer any accuracy gain in our case so we do not use such connections and keep our decoder purely sequential. Moreover, this sequential nature of the decoder makes it memory efficient especially during the training phase.

IV Experiments

In order to show the effectiveness of the proposed module and the corresponding network, extensive experiments have been carried out. General experimental setup, used to conduct the experiments, has also been presented along with the dataset and the evaluation metrics used to evaluate the performance of the proposed FPLNet. Most importantly, a comparison with other state-of-the-art methods have been presented to show the effectiveness of the propsosed methodology. A detailed ablation study of different design choices has also been presented.

IV-A Dataset

There are many road-scene datasets that are publicly available, for example, Cityscapes and CamVid [12, 13]. However, Cityscapes is one of the most widely used dataset. The images in this dataset are captured across 50 different cities with high-quality pixel-level annotations which makes it highly diverse. So, it has been used for training and evaluation of the proposed network. There are a total of 5000 finely annotated images, divided into training, testing and validation set having 2975, 1525 and 500 images, respectively. Test labels are not included in the dataset but can be accessed on an online test server. The images of this dataset are of very high resolution (2048 × 1024). Hence, usually the downscaled version (i.e., $1024\times 512$ ) are used by the researches for training.

IV-B Training details

All the experiments in this work have been conducted on a Tesla V100 GPU using PyTorch framework with CUDA 10.2 and cuDNN backends. Mini-batch stochastic gradient descent (SGD) [37] is used as the optimizer for training the FPLNet with momentum 0.9 and weight decay $5\times 10^{-4}$ . We employed the “poly” learning rate strategy which is given by

lr=lr_{init}\times(1-\frac{iter}{max\_iter})^{power}

$lr_{init}$ corresponds to initial learning rate which is $4.5\times 10^{-2}$ with power 0.9. The cross-entropy loss function is used as the loss function. The training is done in two stages. The encoder part of the network is trained in the first stage with a batch size of 12 where the weights were initialized randomly. In the second stage, the decoder is attached to the pretrained encoder (trained in first stage). The batch size is 6 for training the whole network. Though, the original resolution of images are $1024\times 2048$ , we trained our network by sub-sampling original images by a factor of 2. We trained the encoder for 500 epochs and the complete network for 1000 epochs. For data augmentation, standard strategies such as horizontal flipping, cropping and scaling have been employed during training. Following [15], a class weighting scheme has also been used to mitigate the class-imbalance problem. As per this scheme, different weights are assigned to different classes during training; giving more weight to rare classes and less weight to dominant classes. To be more specific, each class is assigned the following weight:

w_{class}=\frac{1}{ln(c+p_{class})}

Where $p_{class}$ is the normalized frequency of the a class and $c$ is a constant that is set to 1.02.

TABLE IV: Comparison of our FPLNet with other state-of-the-art networks on Cityscapes dataset. Symbol ”-” indicates that the corresponding values are not reported by the authors.

Methods	Backbone network	ImageNet pretrain	mIoU (%)		Parameters (M)	Resolution	Speed (FPS)
Methods	Backbone network	ImageNet pretrain	Val	Test	Parameters (M)	Resolution
FCN-8s [1]	VGG16 [5]	✓	-	65.3	134.5	-	-
ICNet [26]	PSPNet50 [24]	✓	-	69.5	26.5	2048 $\times$ 1024	30.3
SegNet [2]	self	✓	-	57	29.5	640 $\times$ 360	16.7
BiSeNetV1 [21]	Xception39 [8]	✓	69.0	68.4	5.8	1536 $\times$ 768	105.8
HyperSeg [17]	EfficientNet-B1 [27]	✓	78.2	78.1	10.2	1536 $\times$ 768	16.1
HSBNet [38]	MobileNetV2 [39]	✓	73.1	73.1	12.1	2048 $\times$ 1024	32.2
ERFNet [28]	self	✗	70	68	2.2	1024 $\times$ 512	41.7
CABiNet [22]	MobileNetV3 [40]	✓	76.6	75.9	2.64	2048 $\times$ 1024	76.5
BiSeNetV2 [10]	self	✓	75.8	75.3	4.59	1536 $\times$ 768	47.3
RegSeg [31]	self	✗	78.13	78.3	3.34	2048 $\times$ 1024	30
BiAttenNet [30]	ResNet-34 [6]	✓	71.4	74.7	2.2	1024 $\times$ 512	89.2
DABNet [18]	self	✗	69.1	70.1	0.75	2048 $\times$ 1024	27.7
ContextNet [19]	self	✗	65.7	66.1	0.88	2048 $\times$ 1024	54
EDANet [20]	self	✗	-	67.3	0.68	-	81.3
ENet [15]	self	✗	-	57	0.36	1024 $\times$ 512	74.9
ESPNet [16]	self	✗	61.4	60.3	0.36	1024 $\times$ 512	112
CGNet [11]	self	✗	63.5	64.8	0.5	2048 $\times$ 1024	17.6
FPLNet (Proposed)	self	✗	66.93	66.28	0.49	1024 $\times$ 512	95.5

IV-C Performance analysis

Mean Intersection over Union (mIoU) is the most commonly used metric in the field of semantic segmentation to evaluate the accuracy of the model [1]. The Intersection over union (IoU) for a particular class is defined as the ratio of overlap and union between the class prediction and the class ground truth. In case of multi-class segmentation, the mIoU of the model is calculated by taking the IoU of each class and averaging them out over all the classes present in all the predicted images.

IoU=\frac{TP}{TP+FP+FN}=\sum_{i=1}^{k}n_{ii}/(t_{i}+\sum_{j=1}^{k}n_{ji}-n_{ii})

where TP, FP and FN are, respectively, the number of true positives, false positives and false negatives at pixel level. Table III shows class-wise and mean IoUs for both validation and Test set.

IV-C1 Comparison of the proposed FPLNet with other state-of-the-art methods

The comparison of FPLNet with other state-of-the-art networks has been summerized in Table IV. To measure accuracy, mIoU (%) has been used and for efficiency, we use parameters (M) and speed (FPS). Apart from that, other comparative indicators such as type of backbone network, pretraining information, etc., are also included to facilitate a multifaceted perspective. To keep the discussion tractable, we divide the sementic segmentation models into 4 broad categories based on number of parameters as shown in Table V.

TABLE V: Categorization of models based on their corresponding sizes. The number of parameters are presented in millions.

Large-scale	Mid-scale	Lightweight	Ultra-lightweight
$\geq 5$	$1-5$	$0.5-1$	$\leq 0.5$

Recent large-scale networks such as BiSeNetV1 [10], HyperSeg [17] and HSBNet [38] achieve excellent accuracies. However, to achieve such high accuracies, they had to design networks with huge number of parameters. More specifically, BiSeNetV1, HyperSeg and HSBNet achieve 68.4, 78.1 and 73.1% mIoU at 5.8, 10.2 and 12.1 million parameters, respectively. It is quite interesting to note here that HyperSeg [17] is 5% more accurate than HSBNet despite having almost 2 millions less parameters. This shows that a straight-forward increase in parameters does not guarantee proportional accuracy boost. This can be further observed, when we compare mid-scale networks with the large-scale ones. When we Compare CABiNet [22] with BiSeNetV1, we find that CABiNet has 7.5% more accuracy than BiSeNetV1 while being $2.2\times$ smaller. Similarly, RegSeg [31] and BiAttenNet [30] are 0.2 and 1.6% more accurate than HyperSeg [17] and HSBNet [38], while being $3\times$ and $5.5\times$ smaller, respectively. A similar effect can be seen when we compare BiSeNetV1 and BiSeNetV2 [38]. Despite having 1.21 million less parameters, BiSeNetV2 is almost 7% more accurate than BiSeNetV1. These observations demonstrate the possibility for smaller networks of achieving accuracies similar to multiple-times bigger networks through smart architecture design. Exploiting this possibility is extremely important because model sizes of the large-scale and mid-scale networks become bottleneck in achieving real-time processing in resource-constrained devices [22]. To address this issue, many networks have been developed with less than 1 million parameters.

To further scale-down the network size while still maintaining a decent accuracy, we propose an ultra-lightweight real-time FPLNet. Experimental results show that the proposed network achieves 66.93% and 66.28% mIoU on validation and test set, respectively. It achieves similar accuracy compared to lightweight models with much less number of parameters such as DABNet [18] and EDANet [20]. When compared to other state-of-the-art ultra-lightweight (less than 0.5 million parameters) networks [16, 15, 19, 11], the proposed FPLNet outperforms them in terms of accuracy (almost 1-10% accuracy improvement) while having similar number of parameters; 0.49 million to be more specific. This shows the effectiveness of our network in achieving a decent accuracy-efficiency tradeoff in Ultra-Lightweight range. More importantly, it achieves accuracy close to large-scale BiSeNetV1 [21] and mid-scale ERFNet [28] while being 11.8 $\times$ and 4.5 $\times$ smaller, respectively. To put everything into perspective, a comparison of the proposed FPLNet with other lightweight and ultra-lightweight networks have been shown in Fig. 5. The inherent contention between accuracy and efficiency can also be clearly observed in Fig. 5. For calculating the trainable parameters of the network, pytorch-OpCounter library [41] has been used [17]. To give a visual perspective, the qualitative results are shown in Fig. 6.

IV-C2 Ablation studies

In this section, a detailed ablation study of different design choices is presented. It should be noted that all the experiments in this study are conducted considering only the encoder of the FPLNet unless stated otherwise. Once the required efficient encoder was found, a lightweight decoder is attached to it to complete the proposed network, i.e., FPLNet.

ESP vs FPL module: To show the effectiveness of the proposed FPL module and the corresponding FPLNet, we conducted a series of experiments as shown in Table VI. By applying the ESP module in FPLNet and FPL module in ESPNet, we are able to show the effectiveness of both the FPL module and the architecture of our network. It is clear from Table VI that by merely replacing ESP module with FPL module in the ESPNet (while keeping other variables constant) we are able to improve the mIoU of ESPNet by 2.43 %; reflecting the effectiveness of FPL module. Similarly, applying ESP module in our FPLNet (while keeping other variables constant) gives a 2.99 % accuracy boost over ESPNet; showing the effectiveness of the proposed architecture.

TABLE VI: Performance analysis of ESP and FPL module when employed in FPLNet and ESPNet, respectively.

Model	mIoU (%)	Parameters (M)
ESPNet	53.30	0.34
ESPNet-FPL	55.73	0.38
FPLNet	58.83	0.42
FPLNet-ESP	56.29	0.38

Hasty downsampling vs Delayed downsampling: In order to reduce parameter contributed by the first stage, many works adopt a hasty downsampling strategy [16, 28]. This does not allow any feature extraction at stage-1 (i.e., half of input resolution) and perform two back-to-back downsampling operations to downsample the feature maps to 1/4 of input resolution. However, it has an adverse effect on the accuracy of the network as the low-level finer details are lost. Hence, we employ two conventional layers in this stage to extract the finer details which is crucial in fine delineation of the object boundaries in the final segmented map. Table VII shows the advantage of delayed downsampling over hasty one.

TABLE VII: Effect of hasty and delayed downsampling on network accuracy and parameters.

Downsampling	mIoU (%)	Parameters (M)
Delayed	58.83	0.42
Hasty	56.97	0.40

Different fusion strategies within the encoder: Table VIII presents a comparison between different variants of intra-stage feature fusion and image fusion. It is clear from the Table VIII that concatenation works better for intra-stage feature fusion as opposed to addition, with a slight increment in the network parameters. The proposed FIR unit

TABLE VIII: Effect of different types of fusion strategies within the encoder. IF: Image Fusion, ISFF-add: Intra-stage feature fusion by addition, ISFF-concat: ISFF by concatenation.

Fusion strategies	mIoU (%)	Parameters (k)
IF	57.00	414.810
ISFF-add	56.42	414.636
ISFF-concat	56.97	419.820
IF & ISFF-add	57.20	414.810
IF & ISFF-concat	57.75	419.994
FIR (proposed)	58.83	419.994

Number of FLP modules in different stages: A series of experiments have been conducted to find the optimal number of FPL modules to be used in different stages. Table IX presents the performance of the FPLNet encoder against different number of FPL modules in stage-2 and stage-3. After performing extensive experiments, it is found that the optimal configuration is 4-8, i.e., 4 and 8 FPL modules in stage-2 and stage-3, respectively.

TABLE IX: Effect of varying the number of FPL modules in different stages on the network performance. “#FPL_S2” and “#FPL_S3”: Number of FPL modules in stage-2 and stage-3, respectively.

(#FPL_S2, #FPL_S3)	(2, 6)	(2, 8)	(2, 10)	(2, 12)	(4, 6)	(4, 8)	(4, 10)	(4, 12)
mIoU (%)	56.0	56.1	55.9	55	57.4	58.83	57.2	57.98
Parameters (M)	0.32	0.40	0.47	0.57	0.34	0.42	0.49	0.57

Extent of factorization: Factorization of convolutional kernals saves parameters but affects the accuracy. Hence, a trade-off is required. To find the required trade-off, a set of experiments have been conducted to reveal how much factorization within the FPL module should be done. Table X presents the results of the corresponding experiments.

TABLE X: Effect of factorization of conv kernals in FPL module. ✗ means no factorization, ”All” means all the kernals are factorized including the top convolution of pyramid stage-1.

Factorization	mIoU (%)	Parameters (M)
✗	59.10	0.54
All	56.79	0.38
Proposed (composite)	58.83	0.42

In the proposed composite scheme, first stage of pyramid, i.e., $3\times 3$ convolution is kept symmetric while the second stage, i.e., bank of dilated convolutions are factorized. It is clear from Table X that the proposed setting of factorization in the FPL module results in a balanced trade-off. With no factorization at all, we observe only 0.38 % mIoU increment at the expense of 0.13 million more parameters compared to the composite scheme. Hence, we can easily conclude that the proposed configuration of factorization offers the optimal balance between mIoU and number of parameters.

Activation function: Using PReLU [42] instead of ReLU [43] results in 0.45 % increment in accuracy, while having a negligible effect on the model complexity. So, we adopted PReLU in our network.

V Conclusion and Future work

In this article, a novel module has been proposed which employs spatial pyramid to extract multi-scale context in an efficient way. The resulting module is called factorized pyramidal learning (FPL) module. To allow more information embedding along the channel dimension, the spatial pyramid is decomposed into two stages. The first stage is a conventional convolution where as the second stage employs bank of factorized convolutional filters with different dilation rates. As a result, it is able to capture both short-range and long-range context which greatly enhances the segmentation accuracy of the model. The proposed FPL module carefully factorizes the pyramid filters, resulting in a huge saving of overall trainable parameters. Moreover, to improve the information flow and to enable enhanced learning, shallow and deep features of each stage is fused with the downsampled image using a dedicated Feature-Image Reinforcement (FIR) unit. This gives accuracy boost without increasing the network parameters compared to simple concatenative fusion. To complete the network, we also design a small, simple and sequential asymmetric decoder for the recovery of local spatial details in the final segmentation map. Based on the FPL module, FIR unit and the asymmetric decoder, we propose a lightweight real-time network to achieve state-of-the-art accuracy-efficiency trade-off. A detailed ablation study is presented to provide deep insight of the network response against various design choices.

References

[1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[3] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions. corr, vol. abs/1409.4842,” 2014.
[5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[8] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[9] Y. Hong, H. Pan, W. Sun, and Y. Jia, “Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes,” arXiv preprint arXiv:2101.06085, 2021.
[10] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation,” International Journal of Computer Vision, vol. 129, no. 11, pp. 3051–3068, 2021.
[11] T. Wu, S. Tang, R. Zhang, J. Cao, and Y. Zhang, “Cgnet: A light-weight context guided network for semantic segmentation,” IEEE Transactions on Image Processing, vol. 30, pp. 1169–1179, 2020.
[12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
[13] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009.
[14] N. Atif, M. Bhuyan, and S. Ahamed, “A review on semantic segmentation from a modern perspective,” in 2019 international conference on electrical, electronics and computer engineering (UPCON). IEEE, 2019, pp. 1–6.
[15] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
[16] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, “Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation,” in Proceedings of the european conference on computer vision (ECCV), 2018, pp. 552–568.
[17] Y. Nirkin, L. Wolf, and T. Hassner, “Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4061–4070.
[18] G. Li, I. Yun, J. Kim, and J. Kim, “Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation,” arXiv preprint arXiv:1907.11357, 2019.
[19] R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet: Exploring context and detail for semantic segmentation in real-time,” arXiv preprint arXiv:1805.04554, 2018.
[20] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Efficient dense modules of asymmetric convolution for real-time semantic segmentation,” in Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
[21] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 325–341.
[22] S. Kumaar, Y. Lyu, F. Nex, and M. Y. Yang, “Cabinet: efficient context aggregation network for low-latency semantic segmentation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 517–13 524.
[23] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[24] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
[25] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
[26] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 405–420.
[27] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[28] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–272, 2017.
[29] N. Atif, H. Balaji, S. Mazhar, S. R. Ahamad, and M. Bhuyan, “Semantic masking: A novel technique to mitigate the class-imbalance problem in real-time semantic segmentation,” in 2022 National Conference on Communications (NCC). IEEE, 2022, pp. 407–412.
[30] G. Li, L. Li, and J. Zhang, “Biattnnet: bilateral attention for improving real-time semantic segmentation,” IEEE Signal Processing Letters, vol. 29, pp. 46–50, 2021.
[31] R. Gao, “Rethink dilated convolution for real-time semantic segmentation,” arXiv preprint arXiv:2111.09957, 2021.
[32] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[33] X. Li, L. Zhang, A. You, M. Yang, K. Yang, and Y. Tong, “Global aggregation then local distribution in fully convolutional networks,” arXiv preprint arXiv:1909.07229, 2019.
[34] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, “A real-time algorithm for signal analysis with the help of the wavelet transform,” in Wavelets. Springer, 1990, pp. 286–297.
[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
[36] L. Rosas-Arias, G. Benitez-Garcia, J. Portillo-Portillo, J. Olivares-Mercado, G. Sanchez-Perez, and K. Yanai, “Fassd-net: Fast and accurate real-time semantic segmentation for embedded systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 349–14 360, 2021.
[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
[38] G. Li, L. Li, and J. Zhang, “Hierarchical semantic broadcasting network for real-time semantic segmentation,” IEEE Signal Processing Letters, vol. 29, pp. 309–313, 2021.
[39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[40] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
[41] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 607–12 616.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
[43] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Icml, 2010.