Real-time Semantic Segmentation via Spatial-detail Guided Context Propagation

Shijie Hao, Yuan Zhou, Yanrong Guo, Richang Hong, Jun Cheng, and Meng Wang S. Hao, Y. Zhou, Y. Guo, R. Hong and M. Wang are with Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education, and School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). J. Cheng is with CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, and Chinese University of Hong Kong (e-mail:[email protected]). S. Hao is the corresponding author.

Abstract

Nowadays, vision-based computing tasks play an important role in various real-world applications. However, many vision computing tasks, e.g. semantic segmentation, are usually computationally expensive, posing a challenge to the computing systems that are resource-constrained but require fast response speed. Therefore, it is valuable to develop accurate and real-time vision processing models that only require limited computational resources. To this end, we propose the Spatial-detail Guided Context Propagation Network (SGCPNet) for achieving real-time semantic segmentation. In SGCPNet, we propose the strategy of spatial-detail guided context propagation. It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. In this way, the need for maintaining high-resolution features along the network is freed, therefore largely improving the model efficiency. On the other hand, due to the effective reconstruction of spatial details, the segmentation accuracy can be still preserved. In the experiments, we validate the effectiveness and efficiency of the proposed SGCPNet model. On the Citysacpes dataset, for example, our SGCPNet achieves $69.5\%$ mIoU segmentation accuracy, while its speed reaches $178.5$ FPS on $768\times 1536$ images on a GeForce GTX 1080 Ti GPU card. In addition, SGCPNet is very lightweight and only contains 0.61 M parameters. The code will be released at https://github.com/zhouyuan888888/SGCPNet.

Index Terms:

Semantic segmentation, deep learning, contextual information, accuracy, speed.

I Introduction

Vision computing plays a more and more important role in many real-world applications, e.g. human action recognition [1, 2], object detection [3, 4, 5] and object tracking [6, 7], and so on. A common characteristic in these applications is that large amount of data is produced by vision sensors while a fast processing speed is required, making the system architecture solely based on cloud computing not always efficient [8]. In this context, edge computing has become a good complement, as this paradigm aims to drive computation from cloud to network edges, making the processing as close as to data generation sources.

Refer to caption — Figure 1: Examples of the semantic segmentation of the road scene.

Recently, the success of deep learning has simultaneously brought opportunities and challenges to edge computing [9, 10]. On one hand, deep learning models are able to significantly promote the system’s performance. On the other hand, however, these models usually have large computation and storage costs, posing challenges to the system’s implementation speed and power assumption [11, 12]. This challenge typically exists in many vision-based computation tasks, especially for the semantic segmentation task that aims to assign each pixel with a semantic label, as shown in Fig.1. Being able to provide pixel-level semantic information, semantic segmentation can be the cornerstone of many vision-involved applications, such as autonomous driving [13] and medical assistance systems [14].

Despite the success made by the deep-learning-based semantic segmentation models [15, 16, 17] proposed recently, it is difficult to directly apply them in the resource-constrained scenario due to their large model size or high complexity. Model compression is a feasible way of solving these issues [9], such as network pruning and quantization [18, 19, 20], knowledge distillation [21], and the hybrid ones [22]. On the other hand, it is more attractive to directly design a lightweight semantic segmentation model that simultaneously satisfies the following demands, i.e., being fast and accurate, while requiring low hardware consumption.

We first briefly analyze why the current segmentation models are computationally expensive. Aiming at high accuracy, on one hand, the current methods [23, 24, 25, 16, 15] mainly adopt the strategy of keeping high-resolution feature maps along the network pipeline, so as to realize effective preservation of spatial detail information, as presented in Fig.3 (a). On the other hand, the increased size of feature maps relatively lowers the receptive field of convolution kernels, and thus these methods use the dilated convolution [26] to aggregate more context information as much as possible. Nevertheless, this roadmap has a limitation that the high resolution feature maps kept in the pipeline lead to expensive computational costs. We conduct an experiment on the well-known ResNet [27] to empirically verify the influence of feature resolution on Floating Point Operations (FLOPs) and runtime. Of note, for a fair comparison, the last fully-connected layer is removed. The results in Table.I indicate that when the feature map is maintained with high resolution, the FLOPs increase substantially and the speed becomes much lower.

Based on the above observation, we can draw that maintaining feature map resolution is the bottleneck of designing a fast and accurate segmentation model. On one hand, it is possible to achieve a low computational complexity via downsampling the feature map quickly along the network. The rationale is from the fact that semantic contexts belong to the region-level information, and the pixels in the neighborhood usually share a common semantic label. Therefore, it is unnecessary to always keep aggregating contexts at the high-resolution feature maps. On the other hand, however, directly using low-resolution global contexts seems to go against the demand of obtaining high-quality segmentation results, as the spatial information is severely lost in the global context features. To solve this dilemma, in this paper, our aim is to free the demand of maintaining high resolution feature maps by efficiently reconstructing the lost spatial information of the global contexts, and therefore the demands of accuracy and speed can be satisfied in the meantime.

TABLE I: The influence of maintaining high-resolution feature maps on FLOPs and runtime. “

\mathit{\Upsilon}

” indicates the rate between the size of the maintained feature map and input image. Specially, in this part, the model is evaluated on an

512\times 1024

input image.

	Res-101		Res-50
$\mathit{\Upsilon}$	FLOPs	Runtime	FLOPs	Runtime
$1/32$	$27.4G$	$37.1ms$	$24.5G$	$24.9ms$
$1/16$	$39.7G$	$85.8ms$	$42.1G$	$42.9ms$
$1/8$	$112.5G$	$303.7ms$	$109.3G$	$155.6ms$
$1/4$	$486.3G$	$1207.2ms$	$427.2G$	$542.1ms$

To this end, we propose the strategy of spatial-detail guided context propagation, as shown in Fig.3 (b). It aims to use the spatial details of the shallow layers to guide the propagation of the global contexts to the neighboring positions, and therefore helps to reconstruct the lost spatial information of the global contexts. As described in Fig.3 (b), in our method, the resolution of the feature map is reduced along the network pipeline gradually, until obtaining the final global context information. Then, the proposed spatial-detail guided context propagation strategy is applied. We hope that the context propagation could meet two requirements. First, during the context propagation, the context information should be consistent with its neighboring spatial details, thus guaranteeing the effectiveness of the context propagation. Second, after the propagation, the original global context information should be accurately recovered from the propagated context map as much as possible. In this way, the propagation’s accuracy can be ensured. Based on the above roadmap, we design our context propagation in a bi-direction way, i.e. 1) propagating the context information of the current position to the neighborhood, and 2) gathering the contexts from the neighboring pixels to the current position. We realize them by building the lightweight bi-directed network structure, where we introduce the top-down path and the bottom-up path respectively. Specifically, in the top-down path, the global contexts are gradually propagated to neighboring positions under the guidance of spatial details. However, in the bottom-up path, the context information of the local region is progressively gathered using the pooling operation, and thereby the global contexts can be re-extracted. The re-extracted context information is supposed to be the same as the global contexts before being propagated.

Based on the proposed strategy, we build a lightweight network for efficient and accurate semantic segmentation as presented in Fig.5, named Spatial-detail Guided Context Propagation Network (SGCPNet). Our SGCPNet has low storage and computation costs. For example, it only contains $0.61M$ parameters, and costs only $4.5G$ FLOPs in segmenting an $1024\times 2048$ image. As a result, our SGCPNet has fast implementation speed. For example, it can realize $103.7$ FPS speed on $1024\times 2048$ inputs or $731.3$ FPS speed on $360\times 480$ inputs, based on a single GTX 1080Ti GPU card. Even with an Intel Xeon Silver 4210 CPU, segmenting a $512\times 1024$ RGB image only needs $151ms$ runtime. In addition to the low costs and fast speed, the accuracy of our SGCPNet is still kept on a high level. For example, on the public semantic segmentation datasets Cityscapes [28] and Camvid [29], SGCPNet obtains promising segmentation performances, such as $70.9\%$ mIoU on the Cityscapes test set and $69\%$ mIoU on the CamVid test set. Additionally, in Fig.2, we provide a chart that compares our SGCPNet with recent related methods in terms of model size (x-axis), segmentation accuracy (y-axis) and execution speed (mark area). We can see that our SGCPNet achieves good balance between these performance indices, showing the advantage of our SGCPNet in the resource-constrained semantic segmentation.

The contributions of this paper are summarized as the following aspects:

•

We propose a new spatial-detail guided context propagation strategy. It effectively reconstructs the lost spatial information in the global contexts, and thus the need for maintaining high resolution feature maps through the network pipeline can be freed.
•

We construct the Spatial-detail Guided Context Propagation Network (SGCPNet) that realizes the spatial-detail guided context propagation with high efficiency via the bi-directed network structure.
•

Last but not the least, our SGCPNet presents the competitive performance on the balance between segmentation accuracy and efficiency.

The rest of this paper is organized as follows. We first review the related works in Section II. Then, in Section III and Section IV, we respectively introduce our proposed method and the experimental results in detail. Finally, the paper is concluded in Section V.

II Related Work

In this section, we first review the relevant approaches that concentrate on boosting segmentation accuracy, and then review the methods aiming at improving segmentation efficiency.

II-A Methods for Boosting Segmentation Accuracy

Due to the significance of obtaining sufficient context and spatial information for the semantic segmentation task, Yu et al. [26] advance the conventional convolution operation via proposing the dilated convolution. On one hand, the dilated convolution enlarges the receptive field of convolution through inserting “holes” into the convolution kernel. In this way, more context information can be aggregated. On the other hand, some downsampling operations (e.g., pooling) for enlarging receptive field could be avoided, and thereby the spatial details can be maintained within the network pipeline. Based on [26], Chen et al. [16] propose the Atrous Spatial Pyramid Pooling (ASPP) module that equips the dilated convolution with the spatial pyramid structure, thus realizing context aggregation in multiple different receptive fields. Zhao et al. [15] further extend ASPP to the Pyramid Pooling Module (PPM), which additionally considers the global context information. This helps the model to understand the visual scene from a more global perspective. Aiming to help the network to perceive more context information, Huang et al. [30] propose the Criss-Cross Network (CCNet) that aggregates the contexts lying in the criss-cross path for all pixels. While in [17], Fu et al. propose to simultaneously aggregate the global contexts and cross-channel dependencies. With the goal of taking the advantages of dictionary learning, Zhang et al. [24] propose to build a learnable dictionary to preserve the semantic contexts of the whole training dataset. In contrary, Ma et al. [31] propose to design a Semantics Conformity Module (SCM) to better preserve the spatial details along the network pipeline. He et al. [25] propose to use the global-guided local affinity to guide the aggregation of pyramid context information, making the deep-learning model be more robust to the diversity of objects’ size and shape. Recently, Zhang et al. [32] propose to further enhance the segmentation results via explicitly considering the guidance of the edge and salient objects.

Discussion. Aiming to simultaneously obtain sufficient context information and spatial detail information, the above methods mainly choose to maintain spatial details along the network pipeline. Although these methods could provide accurate segmentation results, their computations are generally expensive and execution speed tends to be far from the demands of real-time processing. For example, DeepLab only achieves $0.25$ FPS on $512\times 1024$ images at the cost of $457.8G$ FLOPs, while PSPNet only achieves $0.78$ FPS on $713\times 713$ images at the cost of $412.2$ G FLOPs. Therefore, they are less suitable to be applied to the applications that are resource-constrained but require fast segmentation speed.

II-B Methods for Improving Segmentation Efficiency

In addition to segmentation accuracy, segmentation efficiency is also vital to many real-world applications. Recently, large efforts have been paid to the research of building lightweight and fast semantic segmentation models. Badrinarayanan et al. [33] propose an encoder-decoder model called SegNet. In the encoder, the features are gradually pooled to a low resolution, and the corresponding pooling indices are saved. In the decoder, the upsampling is performed by using the recorded pooling indices, avoiding learning how to upsample the low-resolution feature maps again, therefore substantially improving segmentation speed. Zhao et al. [34] propose ICNet that adopts multi-resolution branches, of which the network depths are different. Specially, the deep branches are used to extract semantic context information from the low-resolution inputs, while the shallow branches concentrate on capturing spatial details from the high-resolution inputs. In this way, the context information and spatial details can be both obtained, while the computation costs are saved as much as possible. Furthermore, Yu et al. [35] and Poudel et al. [36] propose to separably construct a spatial path and a context path, which are used to individually learn the spatial details and context information. Following [34], in [35] and [36], the spatial path is kept with a shallow depth, while the context path is designed with a deep depth. In the work [37], Yu et al. augment [35] by further introducing the guided aggregation layer that realizes better fusion between spatial details and context information. Differently, Li et al. [38] propose the feature reuse strategy, with the goal of fully exploiting the information contained in the previous layers. Li et al. [39] propose SFNet aiming to effectively align low-resolution and high-resolution feature maps, while maintaining high efficiency. Aiming to boost segmentation’s accuracy and efficiency simultaneously, Lo et al. [40] design a network with asymmetric convolution structure and dense connection. Notably, by designing a small decoder and using the early downsampling strategy, Paszke et al. [41] propose the lightweight ENet, which only involves about $0.4M$ network parameters.

Discussion. The current methods for improving segmentation efficiency mainly aim to individually maintain spatial details and aggregate context information, e.g. [34, 35, 36, 37]. We have some observations on this decoupling strategy. First, as for a typical CNN-based framework, contexts and spatial details can be obtained from the deep layers and shallow layers simultaneously. Therefore, it is possible to save some computations spend on the decoupled learning process. Second, the fusion part also needs a careful design, especially for the situations with limited computation resources. Based on these observations, in this paper, we propose to learn spatial details and contexts from the network in the meantime. Besides, we further propose to explicitly consider the relatedness between the context and spatial detail information. We use the spatial details to guide the aggregation of context information. Of note, our SGCPNet differs from [33] obviously. We realize our spatial detail guidance in a learning manner, rather than using the saved pooling indices.

III Proposed Method

In this section, we introduce our proposed method in detail. We first elaborate the strategy of spatial-detail guided context propagation, and then give the details of our SGCPNet.

III-A Spatial-detail Guided Context Propagation Strategy

The motivation of our spatial-detail guided context propagation strategy is simple and straightforward. We hope the global contexts can be propagated to a higher-resolution grids, and be consistent with the low-level spatial details. To this end, the basic operations in our spatial-detail guided context propagation are built in Fig.4. The context features $\bm{C}$ , obtained from relatively deep layers, are first upsampled to a higher resolution by the nearest neighbor interpolation operation Upsample $(\cdot)$ . This can be seen as a naive context propagation which is without any help of spatial detail guidance. Then, we use the spatial details $\bm{S}$ of relatively shallower layers to refine the naively propagated contexts using the exchange function Exchange $(\cdot$ , $\cdot)$ that aims to facilitate the interaction between $\bm{C}$ and $\bm{S}$ , therefore resulting in the reconstruction of the lost spatial information of $\bm{C}$ . The whole process can be described in Eq.1,

\bm{C}^{\prime}=Exchange(Upsample(\bm{C}),Conv(\bm{S}))

(1)

where $\bm{C}^{\prime}$ indicates the desired contexts propagated on a higher resolution. Specially, on one hand, as indicated in Eq.1, we also adopt the convolution Conv $(\cdot)$ to further refine the spatial detail information of $\bm{S}$ because the features of shallow layers tend to contain much noise. Considering the model efficiency, Conv $(\cdot)$ is implemented by separable convolution [42] in our work. On the other hand, the exchange function Exchange $(\cdot$ , $\cdot)$ can be realized in different forms. For example, we can simply fuse them with a direct summation, or learn a pixel-wise attention mechanism. However, aiming to facilitate a good balance between the model effectiveness and efficiency, the exchange function Exchange $(\cdot$ , $\cdot)$ is bulit as a simple linear combination,

\displaystyle Exchange(\bm{x_{1}},\bm{x_{2}})=\alpha\cdot\bm{x_{1}}+\beta\cdot\bm{x_{2}}

(2)

where $\alpha$ and $\beta$ are learnable scalar weights for the inputs $\bm{x}_{1}$ (e.g. contexts) and $\bm{x}_{2}$ (e.g. spatial details). The learned function refines the incorrect places in the contexts by imposing a large $\beta$ , therefore automatically weighing more for the spatial detail information. In the right part of Fig.4, we provide a toy example to explain this process. Suppose the initially upsampled $\bm{C}$ are locally inaccurate. After imposing the exchange function, these inaccurate places can be updated under the guidance of spatial detail information. For example, a few places inferred as “person” are corrected into the ones inferred as “car”. Also, several places inferred as “car” are corrected into the ones inffered as “road”. In this way, the upsampled contexts become more consistent with the spatial details of the original image, and thus the lost spatial information can be recovered, as exemplified in Fig.6.

As we mentioned before, we hope that our context propagation could obey two requirements, i.e. 1) during the context propagation, the context information should be consistent with the spatial details contained in the neighboring pixels; 2) after the context propagation, the original global contexts could be recovered as much as possible. Aiming to satisfy these two requirements, the spatial-detail context propagation strategy is realized by building the bi-directed paths, which we respectively call the top-down path and the bottom-up path. As presented in Fig.3 (b), along the top-down path, the global contexts are back-delivered to the shallow-layers and constantly interacted with the spatial detail information, leading to the reconstruction for the lost spatial detail information in contexts. However, as for the bottom-up path, it aims at re-extracting the global contexts, which are supposed to be as same as the contexts before propagation. The context re-extraction can be realized by using max pooling or average pooling cooperated with convolution operations. Particularly, in this paper, we adopt max pooling to realize context re-extraction, which is advantageous in selecting more discriminative responses. Also, the experimental results of ablation study presented in Table. VI indicate that using max pooling can yield better performance. After the alternate usage of the top-down and bottom-up paths, the more accurate high-resolution contexts are produced, which contains sufficient context information and spatial details simultaneously. In this way, the need for continuously maintaining high-resolution feature maps within the network can be freed, thus largely improving segmentation efficiency. Meanwhile, the segmentation accuracy is still maintained as much as possible.

III-B Architecture of SGCPNet

Based on the proposed spatial-detail guided context propagation strategy, we construct the Spatial-detail Guided Context Propagation Network (SGCPNet). As shown in Fig.5, our SGCPNet is a variant of encoder-decoder framework. As for the encoder, we choose the lightweight MobileNet [43] as the backbone network. Specially, for achieving high efficiency, we do not maintain spatial details in the network pipeline. Instead, the feature map is gradually downsampled until 1/32 of the original resolution. This can substantially reduce computational expenses, and therefore endue the model with a faster segmentation speed. Nevertheless, due to the lost spatial detail information, this may lead to substantial decrease of segmentation accuracy. To avoid this issue, we design a lightweight Spatial-detail Guided Context Propagation (SGCP) module as the decoder, as shown in Fig.5 (b). It is able to effectively recover the lost spatial information of contexts, therefore segmentation accuracy can be preserved as much as possible. Aiming to enable effective context propagation at low computation consumption, the SGCP module is designed as the bi-directional structure. At last, the $1\times 1$ convolutional classifier is applied to the outputs of the SGCP module, and produces the final segmentation results. In Section III-C, we provide the details of our SGCP module.

III-C Details of SGCP module

As shown in Fig.5 (b), three $1\times 1$ convolution operations are first applied to the features produced by the last three backbone layer, so that the features are encoded into higher-dimension representations which contain more abundant feature descriptions. We term the produced three new layers as Layer-3^†, Layer-4^† and Layer-5^†, respectively. Then, aiming to aggregate more global context information, we successively apply two max-pooling operations to the feature map of Layer-5^†. Particularly, the kernel size of these two max-poolings is both set as $3\times 3$ , and their stride is set as 2. In this way, we obtain the final context map whose resolution is condensed to 1/128. The final context map contains more global semantic contexts, while almost all its spatial information is lost.

Aiming to reconstruct the lost spatial detail information in the aggregated global contexts, the proposed spatial-detail guided context propagation strategy is used. To realize effective context propagation, we design a bi-directed network structure, where we introduce the top-down path and bottom-up path respectively. Specially, the top-down path and bottom-up path have similar network structure, and they both consists of the convolution operation and the scalar-weighted fusion. Nevertheless, aiming to realize different function for these two paths, the nearest-neighbor interpolation operation is employed in the top-down path. While in the bottom-up path, we further adopt the max-pooling operation. Following [38, 37], we use the separable convolution [42] to construct the convolution layers of our SGCP module, due to its modest computations and effectiveness in feature extraction. The separable convolution is composed of a point-wise convolution and a depth-wise convolution. Specifically, the depth-wise convolution individually extracts the features contained in each channel of the feature map, while the point-wise convolution linearly combines the features extracted by the depth-wise convolution. As presented in Fig.5, aiming to aggregate the information coming from the neighboring layers, the scalar-weighted fusion is employed in the top-down and bottom-up paths,

\displaystyle\bm{F}^{\prime}=\begin{cases}\mbox{{{if}} skip-connection:}\\ \qquad\alpha_{(l-1)}\cdot\bm{F}_{(l-1)}+\beta_{l}\cdot\bm{F}_{l}+\gamma_{l}\cdot\bm{S}_{l}\\ \mbox{{{else}}:}\\ \qquad\alpha_{(l-1)}\cdot\bm{F}_{(l-1)}+\beta_{l}\cdot\bm{F}_{l}\\ \end{cases}

(3)

where $\bm{F}_{l-1}$ and $\bm{F}_{l}$ respectively indicate the feature map coming from the $(l-1)^{th}$ and $l^{th}$ layer. Moreover, $\bm{F}_{l-1}$ contains more spatial detail information, and $\bm{F}_{l}$ contains more context information. As Eq.3 indicates, the formula is conditioned on whether skip connections exist. When skip connections exist, the features $\bm{S}_{l}$ coming from the $l^{th}$ layer are further considered. Additionally, the scalars $\alpha_{l-1}$ , $\beta_{l}$ and $\gamma_{l}$ are learnable weights for controlling the balance between $\bm{F}_{l-1}$ , $\bm{F}_{l}$ and $\bm{S}_{l}$ . According to Eq.3, the spatial-detail guided context propagation can be re-interpreted as follows: the contexts contained in $\bm{F}_{l}$ are propagated to the neighborhood, via leveraging the guidance of spatial details involved in $\bm{F}_{l-1}$ . Therefore, in propagated context map $\bm{F}^{\prime}$ , the spatial detail information is reconstructed to some extent.

As can be seen from Fig.5 (b), in our SGCP module, we first build a top-down path, with the goal of forcing continuous interaction between the global contexts and spatial details. In this way, the global contexts are gradually propagated to the neighborhood under the guidance of spatial details, until obtaining the preliminarily reconstructed context map with the resolution at 1/8 of the input image. Then, we build a bottom-up path, aiming at re-extracting the global contexts from the propagated context map. Specially, we hope that the similarity between the original global context map and the re-extracted one can be as high as possible. Therefore, we build the skip connections as presented in Fig.5 (b), so that the original context and spatial detail information can be inserted to the corresponding layers of the bottom-up path. At last, the top-down path is constructed. It produces the final propagated context map by leveraging the information contained in the previous paths. Of note, the final pixel-wise segmentation results are produced using the $1\times 1$ convolution classifier.

As indicated in Table.VI, our SGCP module is very lightweight, and it only contains $0.18M$ parameters. Nevertheless, it still yields satisfying performance. For example, we visualize the global context map processed by our SGCP module in Fig. 6. As can be seen from Fig. 6 (b), on one hand, the spatial information contains much details, e.g. boundaries and textures, but it is not aware of the scene semantics. On the other hand, the global context information is aware of the semantic regions at a large scale, but its spatial details are lost severely. As presented in Fig.6 (c), our SGCP module is able to effectively reconstruct the lost spatial information of the contexts, and thus the propagated context maps can clearly reflect the scene semantics and spatial details at the same time. Additionally, the ablation study in Section IV-D also statistically validates our SGCP module’s usefulness. For example, the backbone network only achieves $58.891\%$ mIoU on the cityscapes validation set after being trained with $200$ training epochs. But based on the same training policy, the segmentation accuracy can be further improved to $68.626\%$ mIoU by adding our SGCP module, while only bringing about $0.18M$ parameters.

TABLE II: Extensive comparison of parameters (Params), speed (FPS) and computation costs (FLOPs) between our method and the related works.

		$\bm{360\times 480}$		$\bm{360\times 640}$		$\bm{512\times 1024}$		$\bm{713\times 713}$		$\bm{720\times 960}$		$\bm{768\times 1536}$		$\bm{1024\times 1024}$		$\bm{1024\bm{\times}2048}$
Method	Params	FLOPs	FPS	FLOPs	FPS	FLOPs	FPS	FLOPs	FPS	FLOPs	FPS	FLOPs	FPS	FLOPs	FPS	FLOPs	FPS
Large Size:
DeepLab [16]	$262.1M$	-	-	-	-	$457.8G$	$0.25$	-	-	-	-	-	-	-	-	-	-
PSPNet [15]	$250.8M$	-	-	-	-	-	-	$412.2G$	$0.78$	-	-	-	-	-	-	-	-
DANet [17]	$66.6M$	-	-	-	-	-	-	-	-	-	-	-	-	$1111.8G$	$4$	-	-
CCNet[30]	$66.5M$	-	-	-	-	-	-	-	-	-	-	-	-	$1244.8G$	$4.7$	-	-
Medium Size:
SegNet [33]	$29.5M$	-	-	$286G$	$14.6$	-	-	-	-	-	-	-	-	-	-	-	-
SQ [44]	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	$270G$	16.7
SFNet-Res18 [39]	$13.5M$	-	$45.2$	-	-	-	-	-	-	-	-	-	-	$246.5G$	-	-	$22$
FRRN [45]	-	-	-	-	-	$235G$	$2.1$	-	-	-	-	-	-	-	-	-	-
FCN-8S [46]	-	-	-	-	-	$136.2G$	$2$	-	-	-	-	-	-	-	-	-	-
BiSeNetV2-L [37]	-	-	-	-	-	$118.5G$	$47.3$	-	-	-	-	-	-	-	-	-	-
TwoColumn [47]	-	-	-	-	-	$57.2G$	$14.7$	-	-	-	-	-	-	-	-	-	-
BiSeNetV1-L [35]	$49M$	-	-	-	$129.4$	-	-	-	-	-	-	$55.3G$	$45.7$	-	-	-	-
ICNet [34]	$26.5M$	-	-	-	-	-	-	-	-	-	$27.8$	-	-	-	-	$28.3G$	$30.3$
SFNet-DF2 [39]	$10.5M$	-	$153.8$	-	-	-	-	-	-	-	-	-	-	$96.5G$	-	-	$57$
SFNet-DF1 [39]	$9.0M$	-	-	-	-	-	-	-	-	-	-	-	-	$36.9G$	-	-	$97.5$
BiSeNetV2-S [37]	-	-	-	-	-	$21.2G$	$156$	-	-	-	-	-	-	-	-	-	-
DFANet-A [38]	$7.8M$	-	-	-	-	$1.7G$	$160$	-	-	-	$120$	-	-	$3.4G$	$100$	-	-
BiSeNetV1-S [35]	$5.8M$	-	-	-	$203.5$	-	-	-	-	-	-	$14.8G$	$72.3$	-	-	-	-
DFANet-B [38]	$4.8M$	-	-	-	-	-	-	-	-	-	$160$	-	-	$2.1G$	$120$	-	-
Small Size:
ContextNet [36]	$0.85M$	-	-	-	-	-	$65.5$	-	-	-	-	-	-	-	-	-	$18.3$
EDANet [40]	$0.68M$	-	-	-	-	$8.97G$	$81.3$	-	-	-	-	-	-	-	-	-	-
ENet [41]	$0.40M$	-	-	$3.8G$	$135.4$	-	-	-	-	-	-	-	-	-	-	-	-
Our SGCPNet	$0.61M$	$0.38G$	$731.3$	$0.51G$	$597.2$	$1.13G$	$349.9$	$1.12G$	$354.5$	$1.49G$	$278.4$	$2.53G$	$178.5$	$2.25G$	$195$	$4.5G$	$103.7$

IV Experiments

In this section, we first introduce the implementation details of our experiments, and then compare our SPCNet with the relevant state-of-the-art approaches. At last, we conduct the ablation study to investigate the influence of each component in our approach.

IV-A Implementation Details

All our codes are built on Pytorch¹¹1https://pytorch.org. We follow the previous works [25, 16], and employ the “poly” learning rate in the experiments, $base\_lr\times(1-\frac{iter}{total\_iter})^{power}$ , where $base\_lr$ and $power$ are respectively set as $0.3$ and $0.9$ . In addition, following [32, 31], the data augmentation policies are adopted to alleviate over-fitting, e.g. we randomly flip and scale the inputs from $0.5$ to $2$ . During the training phase, we choose the Stochastic Gradient Descent (SGD) optimizer, of which the momentum is set as $0.9$ and weight decay is set as $1e^{-5}$ . We validate our SGCPNet on two public semantic segmentation datasets, CamVid [29] and Cityscapes [28], of which the training details are slightly different. Specifically, as for CamVid [29], which has relatively fewer samples and lower image resolution, we set the crop size as $720\times 720$ , the batch size as $48$ , and the training epoch as $250$ . Considering that Cityscapes [28] contains many high-resolution images, we set the crop size as $1024\times 1024$ and the training epoch as $350$ . But due to our limited GPU resources, we set a smaller batch size 36 in this dataset. Furthermore, in Table. II, Table. III and Table. IV, we test the execution speed of our SGCPNet on a single GTX 1080Ti GPU card. While in Table. IV-C1, our model’s speed is drawn by testing on an Intel Xeon Silver 4210 CPU.

TABLE III: The results of our method on the Cityscapes test set.

Method	Params	Input Size	FLOPs	FPS	mIoU
Large Size:
DeepLab [16]	$262.1M$	$512\times 1024$	$457.8G$	$0.25$	$63.1\%$
PSPNet [15]	$250.8M$	$713\times 713$	$412.2G$	$0.78$	$81.2\%$
DANet [17]	$66.6M$	$1024\times 1024$	$1111.8G$	$4.0$	$81.5\%$
CCNet [30]	$66.5M$	$1024\times 1024$	$1244.8G$	$4.7$	$81.4\%$
Medium Size:
SegNet [33]	$29.5M$	$360\times 640$	$286.0G$	$14.6$	$56.1\%$
SQ [44]	-	$1024\times 2048$	$270.0G$	$16.7$	$59.8\%$
SFNet-Res18 [39]	$13.5M$	$1024\times 2048$	$246.5G$	$22.0$	$78.9\%$
FRRN [45]	-	$512\times 1024$	$235.0G$	$2.1$	$71.8\%$
FCN-8S [46]	-	$512\times 1024$	$136.2G$	$2.0$	$63.1\%$
BiSeNetV2-L [37]	-	$512\times 1024$	$118.5G$	$47.3$	$75.3\%$
TwoColumn [47]	-	$512\times 1024$	$57.2G$	$14.7$	$72.9\%$
BiSeNetV1-L [35]	$49M$	$768\times 1536$	$55.3G$	$45.7$	$74.7\%$
ICNet [34]	$26.5M$	$1024\times 2048$	$28.3G$	$30.3$	$69.5\%$
SFNet-DF2 [39]	$10.5M$	$1024\times 2048$	$96.5G$	$57.0$	$77.8\%$
SFNet-DF1 [39]	$9.0M$	$1024\times 2048$	$36.9G$	$97.5$	$74.5\%$
BiSeNetV2-S [37]	-	$512\times 1024$	$21.2G$	$156.0$	$72.6\%$
DFANet-A [38]	$7.8M$	$1024\times 1024$	$3.4G$	$100.0$	$71.3\%$
BiSeNetV1-S [35]	$5.8M$	$768\times 1536$	$14.8G$	$72.3$	$68.4\%$
DFANet-B [38]	$4.8M$	$1024\times 1024$	$2.1G$	$120.0$	$67.1\%$
Small Size:
ContextNet [36]	$0.85M$	$1024\times 2048$	-	$18.3$	$66.1\%$
EDANet [40]	$0.68M$	$512\times 1024$	$8.97G$	$81.3$	$67.3\%$
ENet [41]	$0.4M$	$360\times 640$	$3.8G$	$135.4$	$58.3\%$
SGCPNet1	$0.61M$	$1024\times 2048$	$4.5G$	$103.7$	$70.9\%$
SGCPNet2	$0.61M$	$768\times 1536$	$2.53G$	$178.5$	$69.5\%$

IV-B Comparison by Considering Efficiency Only

In this section, we compare our SGCPNet with the counterparts in terms of model efficiency, where we adopt three efficiency-related evaluation metrics, i.e., parameter number of a model (Params), Floating Point Operations (FLOPs) and Frames Per Second (FPS).

IV-B1 Model Categorization

For better comparison, the compared models are divided into three categories, i.e., the models of (i) large size, (ii) medium size and (iii) small size, according to the following rules:

•

The large-size model indicates the model’s Params and FLOPs should be more than $50M$ and $300G$ simultaneously;
•

The medium-size model indicates the model’s Params are between $1M$ and $50M$ , or the FLOPs are between $10G$ and $300G$ ;
•

In the small-size model, the model’s Params and FLOPs should be less than $1M$ and $10G$ in the meantime;

According to the above categorization, our SGCPNet typically belongs to a small-size model, as it just contains $0.61M$ parameters and its FLOPs are obviously less than $10G$ . For example, even segmenting a high-resolution image with $1024\times 2048$ resolution, its computational costs are only $4.5G$ FLOPs.

IV-B2 Performance Analysis

In Table.II, we compare our SGCPNet with the related segmentation models in terms of Params, FLOPs and FPS under the different image resolutions listed in the table. Specially, for a fair comparison, we test our model’s efficiency on the all image resolutions in the table.

From Table.II, we can draw the following observations. Generally, the large-size models concentrate on improving segmentation accuracy, so that they tend to involve expensive computations. For example, DeepLab [16] totally contains $262.1M$ parameters. When segmenting an $512\times 1024$ image, DeepLab needs to cost $457.8G$ FLOPs, and its segmentation speed is only $0.25FPS$ . While PSPNet [15] contains $250.8M$ parameters, and costs $421.2G$ FLOPs when processing an $713\times 713$ input image, running at the speed of $0.78$ FPS. Despite that CCNet [30] and DANet [17] have much less parameters, their FLOPs are still kept at a high level. Additionally, their execution speeds are far from the demands of real-time applications.

The storage and computation burden of the medium-size models are much smaller than those of the large-size models. Some of them achieve good performance in model efficiency, e.g., SFNet-DF1 [39] and DFANet [38] can realize the segmentation speed near or over $100$ FPS while their model parameters are less than 10M.

As can be seen from Table.II, the small-size models all have very small storage costs, and thus they are suitable to be applied to resource-restricted environments. Our SGCPNet has the second smallest size (Params= $0.61M$ ) among all the listed models. It is tens or even hundreds of times smaller than the medium- or large-size competitors. Furthermore, SGCPNet has the smallest FLOPs on almost all the image resolutions. Accordingly, our model’s segmentation speed is faster than the compared models by a large margin.

IV-C Comparison by Considering the Balance between Accuracy and Efficiency

As for the task of real-time semantic segmentation, the balance between accuracy and efficiency is a vital metric in model evaluation. Therefore, in this part, we compare our SGCPNet with the counterparts in terms of the balance between accuracy and efficiency on the Citycapes [28] and CamVid [29] datasets.

IV-C1 Cityscapes

The Cityscapes [28] dataset contains 25000 road scene images. Specially, $5000$ images are finely annotated, and the rest 20000 images are coarsely annotated. In our experiments, we only employ the fine-annotated subset, of which $2975$ , $500$ and $1525$ images are respectively used for training, validating and testing. The fine-annotated subset involves 30 semantic categories. We follow the previous works [48, 15], and adopt 19 categories in model evaluation. Aiming to clearly show our model’s efficiency, in this dataset, we evaluate our SGCPNet on two different image resolutions. According to different input resolutions, our model is named SGCPNet1 and SGCPNet2 respectively. As Table. III indicates, as for SGCPNet1, it is evaluated on $1024\times 2048$ image resolution as same as [34, 44, 39]. The high-resolution image poses more challenge for real-time semantic segmentation. While SGCPNet2 is trained and tested based on the images with a smaller $768\times 1536$ resolution. The results of our methods and their competitors are summarized in Table.III, and some examples of our segmentation results are presented in Fig.7.

Performance Analysis. Compared with the small-size models, the two versions of our SGCPNet have a better trade-off between efficiency and accuracy than that of ContextNet [36] and EDANet [40]. Although the parameters of our proposed model are slightly more than those of ENet [41] (about $0.21M$ ), our SGCPNet1 and SGCPNet2 have much higher segmentation accuracy. Additionally, our method consumes much less FLOPs. For example, even when segmenting a higher-resolution $768\times 1536$ image, our SGCPNet2 still needs much fewer FLOPs (about $1.3G$ ) than those of ENet cost on processing a smaller $360\times 640$ image. Accordingly, our SGCPNet2 has much faster execution speed (about $70$ FPS).

Compared with the medium-size models, our method also presents promising trade-off between accuracy and efficiency. For example, on one hand, our method has better performance than some of the models in both efficiency and accuracy, such as SegNet [33], SQ [44] and FCN-8S [46]. On the other hand, the accuracy of SCGPNet1 and SGCPNet2 is at the same level of the models such as FRRN [45], TwoColumn [47], ICNet [34], BiSeNetV2-S [37], BiSeNetV1-S [35], DFANet-A and DFANet-B [38]. But SCGPNet1 and SGCPNet2 have much fewer FLOPs and higher FPS in most cases. As for the models with high segmentation accuracy, such as SFNet [39], BiseNetV1-L [35] and BiseNetV2-L [37], their FLOPs are tens or even one hundred times larger than ours. The comparison between these models shows that our model achieves a better balance between accuracy and efficiency, thus more appropriate to be applied for realizing resource-constrained semantic segmentation.

In this part, we also compare our SGCPNet with some large-size models, aiming to prove the importance of further boosting the model efficiency. As shown in Table.III, the large-size models are very advantageous in segmentation accuracy, e.g., their accuracy is around 10 % higher than that of our SGCPNet1 and SGCPNet2. But these methods are extremely not applicable in real-time systems, due to their expensive computations and high latency. For example, segmenting an $512\times 1024$ image, DeepLab needs to cost $457.8G$ FLOPs, and its speed is only $0.25$ FPS. In contrary, our SGCPNet has obviously higher efficiency, e.g., the FLOPs of our SGCPNet1 are only around 1/250 of DANet [17] and CCNet [30].

TABLE IV: The results of our method on the CamVid test set.

Method	Params	Input Size	FPS	mIoU
Medium Size:
BiSeNetV2-L [37]	-	$720\times 960$	$32.7$	$73.2\%$
BiSeNetV1-L [35]	$49M$	$720\times 960$	$116.3$	$68.7\%$
SegNet [33]	$29.5M$	$360\times 480$	$29.4$	$55.6\%$
ICNet [34]	$26.5M$	$720\times 960$	$27.8$	$67.1\%$
SFNet-Res18 [39]	$13.5M$	$360\times 480$	$45.2$	$72.4\%$
SFNet-DF2 [39]	$10.5M$	$360\times 480$	$153.8$	$67.9\%$
BiSeNetV2-S [37]	-	$720\times 960$	$124.5$	$72.4\%$
DFANet-A [38]	$7.8M$	$720\times 960$	$120.0$	$64.7\%$
BiSeNetV1-S [35]	$5.8M$	$720\times 960$	$116.3$	$65.6\%$
DFANet-B [38]	$4.8M$	$720\times 960$	$160.0$	$59.3\%$
Small Size:
EDANet [40]	$0.68M$	$360\times 480$	$163.0$	$66.4\%$
ENet [41]	$0.4M$	$360\times 480$	$61.2$	$51.3\%$
Our SGCPNet	$0.61M$	$720\times 960$	$278.4$	$69.0\%$

[Uncaptioned image] — TABLE V: Comparison of speed and accuracy on the Cityscapes dataset when deploying the models on the CPU platform.

	Runtime
Method	$\bm{1024\times 2048}$	$\bm{512\times 1024}$	mIoU
Validation set:
MobileNet-V2-large [43]	$2980ms$	$786ms$	$72.97\%$
MobileNet-V3-large [43]	$2550ms$	$659ms$	$72.37\%$
MobileNet-V2-small [43]	$1270ms$	$354ms$	$66.83\%$
MobileNet-V3-small [43]	$1210ms$	327 ms	$68.38\%$
Our SGCPNet	665ms	151ms	$71.23\%$
\cdashline1-4[4pt/2pt] Test set:
CCNet [30]	$90078ms$	$23060ms$	$81.4\%$
DANet [17]	$34127ms$	$7380ms$	$81.5\%$
MobileNet-V3-large [43]	$2470ms$	$657ms$	$72.6\%$
MobileNet-V3-small [43]	$1030ms$	$275ms$	$68.3\%$
Our SGCPNet	$665ms$	$151ms$	$70.4\%$

								Runtime
BK	HFR	SW-Sum	TD1	BU	TD2	Pooling	Params	GPU	CPU	mIoU
✓	✗	✗	✗	✗	✗	✗	$0.43M$	$1.44ms$	$101ms$	$58.891\%$
✓	✗	✓	✓	✗	✗	$3\times 3$ -Max	$0.47M$	$1.95ms$	$123ms$	$65.292\%$
✓	✗	✓	✓	✓	✗	$3\times 3$ -Max	$0.50M$	$2.28ms$	$134ms$	$65.370\%$
✓	✗	✓	✓	✓	✓	$3\times 3$ -Max	$0.52M$	$2.74ms$	$153ms$	$67.081\%$
✓	✓	✓	✓	✗	✗	$3\times 3$ -Max	$0.54M$	$2.01ms$	$126ms$	$66.345\%$
✓	✓	✓	✓	✓	✗	$3\times 3$ -Max	$0.59M$	$2.40ms$	$135ms$	$67.084\%$
✓	✓	✓	✓	✓	✓	$3\times 3$ -Max	$0.61M$	$2.86ms$	$151ms$	$68.626\%$
\cdashline1-11[4pt/2pt] ✓	✓	✓	✓	✓	✓	$5\times 5$ -Max	$0.61M$	$2.75ms$	$155ms$	$68.531\%$
✓	✓	✗	✓	✓	✓	$3\times 3$ -Max	$0.61M$	$2.47ms$	$155ms$	$67.792\%$
✓	✓	✓	✓	✓	✓	$3\times 3$ -Avg	$0.61M$	$2.76ms$	$156ms$	$68.079\%$