Real-time Semantic Segmentation via Spatial-detail Guided Context Propagation
Abstract
Nowadays, vision-based computing tasks play an important role in various real-world applications. However, many vision computing tasks, e.g. semantic segmentation, are usually computationally expensive, posing a challenge to the computing systems that are resource-constrained but require fast response speed. Therefore, it is valuable to develop accurate and real-time vision processing models that only require limited computational resources. To this end, we propose the Spatial-detail Guided Context Propagation Network (SGCPNet) for achieving real-time semantic segmentation. In SGCPNet, we propose the strategy of spatial-detail guided context propagation. It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. In this way, the need for maintaining high-resolution features along the network is freed, therefore largely improving the model efficiency. On the other hand, due to the effective reconstruction of spatial details, the segmentation accuracy can be still preserved. In the experiments, we validate the effectiveness and efficiency of the proposed SGCPNet model. On the Citysacpes dataset, for example, our SGCPNet achieves mIoU segmentation accuracy, while its speed reaches FPS on images on a GeForce GTX 1080 Ti GPU card. In addition, SGCPNet is very lightweight and only contains 0.61 M parameters. The code will be released at https://github.com/zhouyuan888888/SGCPNet.
Index Terms:
Semantic segmentation, deep learning, contextual information, accuracy, speed.I Introduction
Vision computing plays a more and more important role in many real-world applications, e.g. human action recognition [1, 2], object detection [3, 4, 5] and object tracking [6, 7], and so on. A common characteristic in these applications is that large amount of data is produced by vision sensors while a fast processing speed is required, making the system architecture solely based on cloud computing not always efficient [8]. In this context, edge computing has become a good complement, as this paradigm aims to drive computation from cloud to network edges, making the processing as close as to data generation sources.


Recently, the success of deep learning has simultaneously brought opportunities and challenges to edge computing [9, 10]. On one hand, deep learning models are able to significantly promote the system’s performance. On the other hand, however, these models usually have large computation and storage costs, posing challenges to the system’s implementation speed and power assumption [11, 12]. This challenge typically exists in many vision-based computation tasks, especially for the semantic segmentation task that aims to assign each pixel with a semantic label, as shown in Fig.1. Being able to provide pixel-level semantic information, semantic segmentation can be the cornerstone of many vision-involved applications, such as autonomous driving [13] and medical assistance systems [14].
Despite the success made by the deep-learning-based semantic segmentation models [15, 16, 17] proposed recently, it is difficult to directly apply them in the resource-constrained scenario due to their large model size or high complexity. Model compression is a feasible way of solving these issues [9], such as network pruning and quantization [18, 19, 20], knowledge distillation [21], and the hybrid ones [22]. On the other hand, it is more attractive to directly design a lightweight semantic segmentation model that simultaneously satisfies the following demands, i.e., being fast and accurate, while requiring low hardware consumption.
We first briefly analyze why the current segmentation models are computationally expensive. Aiming at high accuracy, on one hand, the current methods [23, 24, 25, 16, 15] mainly adopt the strategy of keeping high-resolution feature maps along the network pipeline, so as to realize effective preservation of spatial detail information, as presented in Fig.3 (a). On the other hand, the increased size of feature maps relatively lowers the receptive field of convolution kernels, and thus these methods use the dilated convolution [26] to aggregate more context information as much as possible. Nevertheless, this roadmap has a limitation that the high resolution feature maps kept in the pipeline lead to expensive computational costs. We conduct an experiment on the well-known ResNet [27] to empirically verify the influence of feature resolution on Floating Point Operations (FLOPs) and runtime. Of note, for a fair comparison, the last fully-connected layer is removed. The results in Table.I indicate that when the feature map is maintained with high resolution, the FLOPs increase substantially and the speed becomes much lower.
Based on the above observation, we can draw that maintaining feature map resolution is the bottleneck of designing a fast and accurate segmentation model. On one hand, it is possible to achieve a low computational complexity via downsampling the feature map quickly along the network. The rationale is from the fact that semantic contexts belong to the region-level information, and the pixels in the neighborhood usually share a common semantic label. Therefore, it is unnecessary to always keep aggregating contexts at the high-resolution feature maps. On the other hand, however, directly using low-resolution global contexts seems to go against the demand of obtaining high-quality segmentation results, as the spatial information is severely lost in the global context features. To solve this dilemma, in this paper, our aim is to free the demand of maintaining high resolution feature maps by efficiently reconstructing the lost spatial information of the global contexts, and therefore the demands of accuracy and speed can be satisfied in the meantime.
Res-101 | Res-50 | |||
---|---|---|---|---|
FLOPs | Runtime | FLOPs | Runtime | |
To this end, we propose the strategy of spatial-detail guided context propagation, as shown in Fig.3 (b). It aims to use the spatial details of the shallow layers to guide the propagation of the global contexts to the neighboring positions, and therefore helps to reconstruct the lost spatial information of the global contexts. As described in Fig.3 (b), in our method, the resolution of the feature map is reduced along the network pipeline gradually, until obtaining the final global context information. Then, the proposed spatial-detail guided context propagation strategy is applied. We hope that the context propagation could meet two requirements. First, during the context propagation, the context information should be consistent with its neighboring spatial details, thus guaranteeing the effectiveness of the context propagation. Second, after the propagation, the original global context information should be accurately recovered from the propagated context map as much as possible. In this way, the propagation’s accuracy can be ensured. Based on the above roadmap, we design our context propagation in a bi-direction way, i.e. 1) propagating the context information of the current position to the neighborhood, and 2) gathering the contexts from the neighboring pixels to the current position. We realize them by building the lightweight bi-directed network structure, where we introduce the top-down path and the bottom-up path respectively. Specifically, in the top-down path, the global contexts are gradually propagated to neighboring positions under the guidance of spatial details. However, in the bottom-up path, the context information of the local region is progressively gathered using the pooling operation, and thereby the global contexts can be re-extracted. The re-extracted context information is supposed to be the same as the global contexts before being propagated.
Based on the proposed strategy, we build a lightweight network for efficient and accurate semantic segmentation as presented in Fig.5, named Spatial-detail Guided Context Propagation Network (SGCPNet). Our SGCPNet has low storage and computation costs. For example, it only contains parameters, and costs only FLOPs in segmenting an image. As a result, our SGCPNet has fast implementation speed. For example, it can realize FPS speed on inputs or FPS speed on inputs, based on a single GTX 1080Ti GPU card. Even with an Intel Xeon Silver 4210 CPU, segmenting a RGB image only needs runtime. In addition to the low costs and fast speed, the accuracy of our SGCPNet is still kept on a high level. For example, on the public semantic segmentation datasets Cityscapes [28] and Camvid [29], SGCPNet obtains promising segmentation performances, such as mIoU on the Cityscapes test set and mIoU on the CamVid test set. Additionally, in Fig.2, we provide a chart that compares our SGCPNet with recent related methods in terms of model size (x-axis), segmentation accuracy (y-axis) and execution speed (mark area). We can see that our SGCPNet achieves good balance between these performance indices, showing the advantage of our SGCPNet in the resource-constrained semantic segmentation.
The contributions of this paper are summarized as the following aspects:
-
•
We propose a new spatial-detail guided context propagation strategy. It effectively reconstructs the lost spatial information in the global contexts, and thus the need for maintaining high resolution feature maps through the network pipeline can be freed.
-
•
We construct the Spatial-detail Guided Context Propagation Network (SGCPNet) that realizes the spatial-detail guided context propagation with high efficiency via the bi-directed network structure.
-
•
Last but not the least, our SGCPNet presents the competitive performance on the balance between segmentation accuracy and efficiency.
The rest of this paper is organized as follows. We first review the related works in Section II. Then, in Section III and Section IV, we respectively introduce our proposed method and the experimental results in detail. Finally, the paper is concluded in Section V.

II Related Work
In this section, we first review the relevant approaches that concentrate on boosting segmentation accuracy, and then review the methods aiming at improving segmentation efficiency.
II-A Methods for Boosting Segmentation Accuracy
Due to the significance of obtaining sufficient context and spatial information for the semantic segmentation task, Yu et al. [26] advance the conventional convolution operation via proposing the dilated convolution. On one hand, the dilated convolution enlarges the receptive field of convolution through inserting “holes” into the convolution kernel. In this way, more context information can be aggregated. On the other hand, some downsampling operations (e.g., pooling) for enlarging receptive field could be avoided, and thereby the spatial details can be maintained within the network pipeline. Based on [26], Chen et al. [16] propose the Atrous Spatial Pyramid Pooling (ASPP) module that equips the dilated convolution with the spatial pyramid structure, thus realizing context aggregation in multiple different receptive fields. Zhao et al. [15] further extend ASPP to the Pyramid Pooling Module (PPM), which additionally considers the global context information. This helps the model to understand the visual scene from a more global perspective. Aiming to help the network to perceive more context information, Huang et al. [30] propose the Criss-Cross Network (CCNet) that aggregates the contexts lying in the criss-cross path for all pixels. While in [17], Fu et al. propose to simultaneously aggregate the global contexts and cross-channel dependencies. With the goal of taking the advantages of dictionary learning, Zhang et al. [24] propose to build a learnable dictionary to preserve the semantic contexts of the whole training dataset. In contrary, Ma et al. [31] propose to design a Semantics Conformity Module (SCM) to better preserve the spatial details along the network pipeline. He et al. [25] propose to use the global-guided local affinity to guide the aggregation of pyramid context information, making the deep-learning model be more robust to the diversity of objects’ size and shape. Recently, Zhang et al. [32] propose to further enhance the segmentation results via explicitly considering the guidance of the edge and salient objects.
Discussion. Aiming to simultaneously obtain sufficient context information and spatial detail information, the above methods mainly choose to maintain spatial details along the network pipeline. Although these methods could provide accurate segmentation results, their computations are generally expensive and execution speed tends to be far from the demands of real-time processing. For example, DeepLab only achieves FPS on images at the cost of FLOPs, while PSPNet only achieves FPS on images at the cost of G FLOPs. Therefore, they are less suitable to be applied to the applications that are resource-constrained but require fast segmentation speed.

II-B Methods for Improving Segmentation Efficiency
In addition to segmentation accuracy, segmentation efficiency is also vital to many real-world applications. Recently, large efforts have been paid to the research of building lightweight and fast semantic segmentation models. Badrinarayanan et al. [33] propose an encoder-decoder model called SegNet. In the encoder, the features are gradually pooled to a low resolution, and the corresponding pooling indices are saved. In the decoder, the upsampling is performed by using the recorded pooling indices, avoiding learning how to upsample the low-resolution feature maps again, therefore substantially improving segmentation speed. Zhao et al. [34] propose ICNet that adopts multi-resolution branches, of which the network depths are different. Specially, the deep branches are used to extract semantic context information from the low-resolution inputs, while the shallow branches concentrate on capturing spatial details from the high-resolution inputs. In this way, the context information and spatial details can be both obtained, while the computation costs are saved as much as possible. Furthermore, Yu et al. [35] and Poudel et al. [36] propose to separably construct a spatial path and a context path, which are used to individually learn the spatial details and context information. Following [34], in [35] and [36], the spatial path is kept with a shallow depth, while the context path is designed with a deep depth. In the work [37], Yu et al. augment [35] by further introducing the guided aggregation layer that realizes better fusion between spatial details and context information. Differently, Li et al. [38] propose the feature reuse strategy, with the goal of fully exploiting the information contained in the previous layers. Li et al. [39] propose SFNet aiming to effectively align low-resolution and high-resolution feature maps, while maintaining high efficiency. Aiming to boost segmentation’s accuracy and efficiency simultaneously, Lo et al. [40] design a network with asymmetric convolution structure and dense connection. Notably, by designing a small decoder and using the early downsampling strategy, Paszke et al. [41] propose the lightweight ENet, which only involves about network parameters.
Discussion. The current methods for improving segmentation efficiency mainly aim to individually maintain spatial details and aggregate context information, e.g. [34, 35, 36, 37]. We have some observations on this decoupling strategy. First, as for a typical CNN-based framework, contexts and spatial details can be obtained from the deep layers and shallow layers simultaneously. Therefore, it is possible to save some computations spend on the decoupled learning process. Second, the fusion part also needs a careful design, especially for the situations with limited computation resources. Based on these observations, in this paper, we propose to learn spatial details and contexts from the network in the meantime. Besides, we further propose to explicitly consider the relatedness between the context and spatial detail information. We use the spatial details to guide the aggregation of context information. Of note, our SGCPNet differs from [33] obviously. We realize our spatial detail guidance in a learning manner, rather than using the saved pooling indices.
III Proposed Method
In this section, we introduce our proposed method in detail. We first elaborate the strategy of spatial-detail guided context propagation, and then give the details of our SGCPNet.
III-A Spatial-detail Guided Context Propagation Strategy
The motivation of our spatial-detail guided context propagation strategy is simple and straightforward. We hope the global contexts can be propagated to a higher-resolution grids, and be consistent with the low-level spatial details. To this end, the basic operations in our spatial-detail guided context propagation are built in Fig.4. The context features , obtained from relatively deep layers, are first upsampled to a higher resolution by the nearest neighbor interpolation operation Upsample. This can be seen as a naive context propagation which is without any help of spatial detail guidance. Then, we use the spatial details of relatively shallower layers to refine the naively propagated contexts using the exchange function Exchange, that aims to facilitate the interaction between and , therefore resulting in the reconstruction of the lost spatial information of . The whole process can be described in Eq.1,
(1) |
where indicates the desired contexts propagated on a higher resolution. Specially, on one hand, as indicated in Eq.1, we also adopt the convolution Conv to further refine the spatial detail information of because the features of shallow layers tend to contain much noise. Considering the model efficiency, Conv is implemented by separable convolution [42] in our work. On the other hand, the exchange function Exchange, can be realized in different forms. For example, we can simply fuse them with a direct summation, or learn a pixel-wise attention mechanism. However, aiming to facilitate a good balance between the model effectiveness and efficiency, the exchange function Exchange, is bulit as a simple linear combination,
(2) |
where and are learnable scalar weights for the inputs (e.g. contexts) and (e.g. spatial details). The learned function refines the incorrect places in the contexts by imposing a large , therefore automatically weighing more for the spatial detail information. In the right part of Fig.4, we provide a toy example to explain this process. Suppose the initially upsampled are locally inaccurate. After imposing the exchange function, these inaccurate places can be updated under the guidance of spatial detail information. For example, a few places inferred as “person” are corrected into the ones inferred as “car”. Also, several places inferred as “car” are corrected into the ones inffered as “road”. In this way, the upsampled contexts become more consistent with the spatial details of the original image, and thus the lost spatial information can be recovered, as exemplified in Fig.6.
As we mentioned before, we hope that our context propagation could obey two requirements, i.e. 1) during the context propagation, the context information should be consistent with the spatial details contained in the neighboring pixels; 2) after the context propagation, the original global contexts could be recovered as much as possible. Aiming to satisfy these two requirements, the spatial-detail context propagation strategy is realized by building the bi-directed paths, which we respectively call the top-down path and the bottom-up path. As presented in Fig.3 (b), along the top-down path, the global contexts are back-delivered to the shallow-layers and constantly interacted with the spatial detail information, leading to the reconstruction for the lost spatial detail information in contexts. However, as for the bottom-up path, it aims at re-extracting the global contexts, which are supposed to be as same as the contexts before propagation. The context re-extraction can be realized by using max pooling or average pooling cooperated with convolution operations. Particularly, in this paper, we adopt max pooling to realize context re-extraction, which is advantageous in selecting more discriminative responses. Also, the experimental results of ablation study presented in Table. VI indicate that using max pooling can yield better performance. After the alternate usage of the top-down and bottom-up paths, the more accurate high-resolution contexts are produced, which contains sufficient context information and spatial details simultaneously. In this way, the need for continuously maintaining high-resolution feature maps within the network can be freed, thus largely improving segmentation efficiency. Meanwhile, the segmentation accuracy is still maintained as much as possible.
III-B Architecture of SGCPNet
Based on the proposed spatial-detail guided context propagation strategy, we construct the Spatial-detail Guided Context Propagation Network (SGCPNet). As shown in Fig.5, our SGCPNet is a variant of encoder-decoder framework. As for the encoder, we choose the lightweight MobileNet [43] as the backbone network. Specially, for achieving high efficiency, we do not maintain spatial details in the network pipeline. Instead, the feature map is gradually downsampled until 1/32 of the original resolution. This can substantially reduce computational expenses, and therefore endue the model with a faster segmentation speed. Nevertheless, due to the lost spatial detail information, this may lead to substantial decrease of segmentation accuracy. To avoid this issue, we design a lightweight Spatial-detail Guided Context Propagation (SGCP) module as the decoder, as shown in Fig.5 (b). It is able to effectively recover the lost spatial information of contexts, therefore segmentation accuracy can be preserved as much as possible. Aiming to enable effective context propagation at low computation consumption, the SGCP module is designed as the bi-directional structure. At last, the convolutional classifier is applied to the outputs of the SGCP module, and produces the final segmentation results. In Section III-C, we provide the details of our SGCP module.

III-C Details of SGCP module
As shown in Fig.5 (b), three convolution operations are first applied to the features produced by the last three backbone layer, so that the features are encoded into higher-dimension representations which contain more abundant feature descriptions. We term the produced three new layers as Layer-3†, Layer-4† and Layer-5†, respectively. Then, aiming to aggregate more global context information, we successively apply two max-pooling operations to the feature map of Layer-5†. Particularly, the kernel size of these two max-poolings is both set as , and their stride is set as 2. In this way, we obtain the final context map whose resolution is condensed to 1/128. The final context map contains more global semantic contexts, while almost all its spatial information is lost.
Aiming to reconstruct the lost spatial detail information in the aggregated global contexts, the proposed spatial-detail guided context propagation strategy is used. To realize effective context propagation, we design a bi-directed network structure, where we introduce the top-down path and bottom-up path respectively. Specially, the top-down path and bottom-up path have similar network structure, and they both consists of the convolution operation and the scalar-weighted fusion. Nevertheless, aiming to realize different function for these two paths, the nearest-neighbor interpolation operation is employed in the top-down path. While in the bottom-up path, we further adopt the max-pooling operation. Following [38, 37], we use the separable convolution [42] to construct the convolution layers of our SGCP module, due to its modest computations and effectiveness in feature extraction. The separable convolution is composed of a point-wise convolution and a depth-wise convolution. Specifically, the depth-wise convolution individually extracts the features contained in each channel of the feature map, while the point-wise convolution linearly combines the features extracted by the depth-wise convolution. As presented in Fig.5, aiming to aggregate the information coming from the neighboring layers, the scalar-weighted fusion is employed in the top-down and bottom-up paths,
(3) |
where and respectively indicate the feature map coming from the and layer. Moreover, contains more spatial detail information, and contains more context information. As Eq.3 indicates, the formula is conditioned on whether skip connections exist. When skip connections exist, the features coming from the layer are further considered. Additionally, the scalars , and are learnable weights for controlling the balance between , and . According to Eq.3, the spatial-detail guided context propagation can be re-interpreted as follows: the contexts contained in are propagated to the neighborhood, via leveraging the guidance of spatial details involved in . Therefore, in propagated context map , the spatial detail information is reconstructed to some extent.
As can be seen from Fig.5 (b), in our SGCP module, we first build a top-down path, with the goal of forcing continuous interaction between the global contexts and spatial details. In this way, the global contexts are gradually propagated to the neighborhood under the guidance of spatial details, until obtaining the preliminarily reconstructed context map with the resolution at 1/8 of the input image. Then, we build a bottom-up path, aiming at re-extracting the global contexts from the propagated context map. Specially, we hope that the similarity between the original global context map and the re-extracted one can be as high as possible. Therefore, we build the skip connections as presented in Fig.5 (b), so that the original context and spatial detail information can be inserted to the corresponding layers of the bottom-up path. At last, the top-down path is constructed. It produces the final propagated context map by leveraging the information contained in the previous paths. Of note, the final pixel-wise segmentation results are produced using the convolution classifier.
As indicated in Table.VI, our SGCP module is very lightweight, and it only contains parameters. Nevertheless, it still yields satisfying performance. For example, we visualize the global context map processed by our SGCP module in Fig. 6. As can be seen from Fig. 6 (b), on one hand, the spatial information contains much details, e.g. boundaries and textures, but it is not aware of the scene semantics. On the other hand, the global context information is aware of the semantic regions at a large scale, but its spatial details are lost severely. As presented in Fig.6 (c), our SGCP module is able to effectively reconstruct the lost spatial information of the contexts, and thus the propagated context maps can clearly reflect the scene semantics and spatial details at the same time. Additionally, the ablation study in Section IV-D also statistically validates our SGCP module’s usefulness. For example, the backbone network only achieves mIoU on the cityscapes validation set after being trained with training epochs. But based on the same training policy, the segmentation accuracy can be further improved to mIoU by adding our SGCP module, while only bringing about parameters.
Method | Params | FLOPs | FPS | FLOPs | FPS | FLOPs | FPS | FLOPs | FPS | FLOPs | FPS | FLOPs | FPS | FLOPs | FPS | FLOPs | FPS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Large Size: | |||||||||||||||||
DeepLab [16] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
PSPNet [15] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
DANet [17] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
CCNet[30] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
Medium Size: | |||||||||||||||||
SegNet [33] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
SQ [44] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 16.7 | |
SFNet-Res18 [39] | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
FRRN [45] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
FCN-8S [46] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
BiSeNetV2-L [37] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
TwoColumn [47] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
BiSeNetV1-L [35] | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
ICNet [34] | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
SFNet-DF2 [39] | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
SFNet-DF1 [39] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
BiSeNetV2-S [37] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
DFANet-A [38] | - | - | - | - | - | - | - | - | - | - | - | ||||||
BiSeNetV1-S [35] | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
DFANet-B [38] | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
Small Size: | |||||||||||||||||
ContextNet [36] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
EDANet [40] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
ENet [41] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
Our SGCPNet |
IV Experiments
In this section, we first introduce the implementation details of our experiments, and then compare our SPCNet with the relevant state-of-the-art approaches. At last, we conduct the ablation study to investigate the influence of each component in our approach.
IV-A Implementation Details
All our codes are built on Pytorch111https://pytorch.org. We follow the previous works [25, 16], and employ the “poly” learning rate in the experiments, , where and are respectively set as and . In addition, following [32, 31], the data augmentation policies are adopted to alleviate over-fitting, e.g. we randomly flip and scale the inputs from to . During the training phase, we choose the Stochastic Gradient Descent (SGD) optimizer, of which the momentum is set as and weight decay is set as . We validate our SGCPNet on two public semantic segmentation datasets, CamVid [29] and Cityscapes [28], of which the training details are slightly different. Specifically, as for CamVid [29], which has relatively fewer samples and lower image resolution, we set the crop size as , the batch size as , and the training epoch as . Considering that Cityscapes [28] contains many high-resolution images, we set the crop size as and the training epoch as . But due to our limited GPU resources, we set a smaller batch size 36 in this dataset. Furthermore, in Table. II, Table. III and Table. IV, we test the execution speed of our SGCPNet on a single GTX 1080Ti GPU card. While in Table. IV-C1, our model’s speed is drawn by testing on an Intel Xeon Silver 4210 CPU.
Method | Params | Input Size | FLOPs | FPS | mIoU |
---|---|---|---|---|---|
Large Size: | |||||
DeepLab [16] | |||||
PSPNet [15] | |||||
DANet [17] | |||||
CCNet [30] | |||||
Medium Size: | |||||
SegNet [33] | |||||
SQ [44] | - | ||||
SFNet-Res18 [39] | |||||
FRRN [45] | - | ||||
FCN-8S [46] | - | ||||
BiSeNetV2-L [37] | - | ||||
TwoColumn [47] | - | ||||
BiSeNetV1-L [35] | |||||
ICNet [34] | |||||
SFNet-DF2 [39] | |||||
SFNet-DF1 [39] | |||||
BiSeNetV2-S [37] | - | ||||
DFANet-A [38] | |||||
BiSeNetV1-S [35] | |||||
DFANet-B [38] | |||||
Small Size: | |||||
ContextNet [36] | - | ||||
EDANet [40] | |||||
ENet [41] | |||||
SGCPNet1 | |||||
SGCPNet2 |
IV-B Comparison by Considering Efficiency Only
In this section, we compare our SGCPNet with the counterparts in terms of model efficiency, where we adopt three efficiency-related evaluation metrics, i.e., parameter number of a model (Params), Floating Point Operations (FLOPs) and Frames Per Second (FPS).
IV-B1 Model Categorization
For better comparison, the compared models are divided into three categories, i.e., the models of (i) large size, (ii) medium size and (iii) small size, according to the following rules:
-
•
The large-size model indicates the model’s Params and FLOPs should be more than and simultaneously;
-
•
The medium-size model indicates the model’s Params are between and , or the FLOPs are between and ;
-
•
In the small-size model, the model’s Params and FLOPs should be less than and in the meantime;
According to the above categorization, our SGCPNet typically belongs to a small-size model, as it just contains parameters and its FLOPs are obviously less than . For example, even segmenting a high-resolution image with resolution, its computational costs are only FLOPs.
IV-B2 Performance Analysis
In Table.II, we compare our SGCPNet with the related segmentation models in terms of Params, FLOPs and FPS under the different image resolutions listed in the table. Specially, for a fair comparison, we test our model’s efficiency on the all image resolutions in the table.
From Table.II, we can draw the following observations. Generally, the large-size models concentrate on improving segmentation accuracy, so that they tend to involve expensive computations. For example, DeepLab [16] totally contains parameters. When segmenting an image, DeepLab needs to cost FLOPs, and its segmentation speed is only . While PSPNet [15] contains parameters, and costs FLOPs when processing an input image, running at the speed of FPS. Despite that CCNet [30] and DANet [17] have much less parameters, their FLOPs are still kept at a high level. Additionally, their execution speeds are far from the demands of real-time applications.
The storage and computation burden of the medium-size models are much smaller than those of the large-size models. Some of them achieve good performance in model efficiency, e.g., SFNet-DF1 [39] and DFANet [38] can realize the segmentation speed near or over FPS while their model parameters are less than 10M.
As can be seen from Table.II, the small-size models all have very small storage costs, and thus they are suitable to be applied to resource-restricted environments. Our SGCPNet has the second smallest size (Params=) among all the listed models. It is tens or even hundreds of times smaller than the medium- or large-size competitors. Furthermore, SGCPNet has the smallest FLOPs on almost all the image resolutions. Accordingly, our model’s segmentation speed is faster than the compared models by a large margin.
IV-C Comparison by Considering the Balance between Accuracy and Efficiency
As for the task of real-time semantic segmentation, the balance between accuracy and efficiency is a vital metric in model evaluation. Therefore, in this part, we compare our SGCPNet with the counterparts in terms of the balance between accuracy and efficiency on the Citycapes [28] and CamVid [29] datasets.
IV-C1 Cityscapes
The Cityscapes [28] dataset contains 25000 road scene images. Specially, images are finely annotated, and the rest 20000 images are coarsely annotated. In our experiments, we only employ the fine-annotated subset, of which , and images are respectively used for training, validating and testing. The fine-annotated subset involves 30 semantic categories. We follow the previous works [48, 15], and adopt 19 categories in model evaluation. Aiming to clearly show our model’s efficiency, in this dataset, we evaluate our SGCPNet on two different image resolutions. According to different input resolutions, our model is named SGCPNet1 and SGCPNet2 respectively. As Table. III indicates, as for SGCPNet1, it is evaluated on image resolution as same as [34, 44, 39]. The high-resolution image poses more challenge for real-time semantic segmentation. While SGCPNet2 is trained and tested based on the images with a smaller resolution. The results of our methods and their competitors are summarized in Table.III, and some examples of our segmentation results are presented in Fig.7.
Performance Analysis. Compared with the small-size models, the two versions of our SGCPNet have a better trade-off between efficiency and accuracy than that of ContextNet [36] and EDANet [40]. Although the parameters of our proposed model are slightly more than those of ENet [41] (about ), our SGCPNet1 and SGCPNet2 have much higher segmentation accuracy. Additionally, our method consumes much less FLOPs. For example, even when segmenting a higher-resolution image, our SGCPNet2 still needs much fewer FLOPs (about ) than those of ENet cost on processing a smaller image. Accordingly, our SGCPNet2 has much faster execution speed (about FPS).
Compared with the medium-size models, our method also presents promising trade-off between accuracy and efficiency. For example, on one hand, our method has better performance than some of the models in both efficiency and accuracy, such as SegNet [33], SQ [44] and FCN-8S [46]. On the other hand, the accuracy of SCGPNet1 and SGCPNet2 is at the same level of the models such as FRRN [45], TwoColumn [47], ICNet [34], BiSeNetV2-S [37], BiSeNetV1-S [35], DFANet-A and DFANet-B [38]. But SCGPNet1 and SGCPNet2 have much fewer FLOPs and higher FPS in most cases. As for the models with high segmentation accuracy, such as SFNet [39], BiseNetV1-L [35] and BiseNetV2-L [37], their FLOPs are tens or even one hundred times larger than ours. The comparison between these models shows that our model achieves a better balance between accuracy and efficiency, thus more appropriate to be applied for realizing resource-constrained semantic segmentation.
In this part, we also compare our SGCPNet with some large-size models, aiming to prove the importance of further boosting the model efficiency. As shown in Table.III, the large-size models are very advantageous in segmentation accuracy, e.g., their accuracy is around 10 % higher than that of our SGCPNet1 and SGCPNet2. But these methods are extremely not applicable in real-time systems, due to their expensive computations and high latency. For example, segmenting an image, DeepLab needs to cost FLOPs, and its speed is only FPS. In contrary, our SGCPNet has obviously higher efficiency, e.g., the FLOPs of our SGCPNet1 are only around 1/250 of DANet [17] and CCNet [30].


Method | Params | Input Size | FPS | mIoU |
---|---|---|---|---|
Medium Size: | ||||
BiSeNetV2-L [37] | - | |||
BiSeNetV1-L [35] | ||||
SegNet [33] | ||||
ICNet [34] | ||||
SFNet-Res18 [39] | ||||
SFNet-DF2 [39] | ||||
BiSeNetV2-S [37] | - | |||
DFANet-A [38] | ||||
BiSeNetV1-S [35] | ||||
DFANet-B [38] | ||||
Small Size: | ||||
EDANet [40] | ||||
ENet [41] | ||||
Our SGCPNet |
Runtime | ||||
Method | mIoU | |||
Validation set: | ||||
MobileNet-V2-large [43] | ||||
MobileNet-V3-large [43] | ||||
MobileNet-V2-small [43] | ||||
MobileNet-V3-small [43] | 327 ms | |||
Our SGCPNet | 665ms | 151ms | ||
\cdashline1-4[4pt/2pt] Test set: | ||||
CCNet [30] | ||||
DANet [17] | ||||
MobileNet-V3-large [43] | ||||
MobileNet-V3-small [43] | ||||
Our SGCPNet |
IV-C2 CamVid
The CamVid dataset [29] is collected from the high-resolution video sequences of road scenes. This dataset totally contains semantic classes, and images. Specially, in this dataset, , and images are used for training, validating, and testing respectively. Following the previous works, such as [34, 38], we only adopt classes in model evaluation. The comparisons between the results of our method and the recent related works on this dataset are summarized in Table.IV.
Performance Analysis. Compared with other two small-size models, our SGCPNet achieves obviously better trade-off between accuracy and efficiency. For example, our SGCPNet is FPS faster than EDANet [40] and FPS faster than ENet [41], while the accuracy of our SGCPNet is still higher than that of [40, 41] by and respectively.
Runtime | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
BK | HFR | SW-Sum | TD1 | BU | TD2 | Pooling | Params | GPU | CPU | mIoU |
✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ||||
✓ | ✗ | ✓ | ✓ | ✗ | ✗ | -Max | ||||
✓ | ✗ | ✓ | ✓ | ✓ | ✗ | -Max | ||||
✓ | ✗ | ✓ | ✓ | ✓ | ✓ | -Max | ||||
✓ | ✓ | ✓ | ✓ | ✗ | ✗ | -Max | ||||
✓ | ✓ | ✓ | ✓ | ✓ | ✗ | -Max | ||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -Max | ||||
\cdashline1-11[4pt/2pt] ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -Max | ||||
✓ | ✓ | ✗ | ✓ | ✓ | ✓ | -Max | ||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | -Avg |

Compared with the medium-size models, the accuracy of our SGCPNet still ranks the fourth place, while it is much smaller and faster than all the medium-size models. For example, the model size of our SGCPNet is twenty times smaller than that of SFNet-Res18 [39]. In the meantime, our SGCPNet is several times faster than the two versions of BiSeNetV2 [37]. As for the rest seven medium-size models, SGCPNet consistently outperforms them in both accuracy and efficiency. The experimental results on Camvid indicate that our SGCPNet is competent for the resource-constrained situations.
IV-D Ablation Study
In this part, we conduct ablation studies for our method. We first show the performance of our model on the CPU hardware platform, and then investigate the influence of each component of our SGCPNet.
IV-D1 Performance on Different Hardware Platforms
Considering that the speed reported in Table.II, Table.III and Table.IV is evaluated on the GPU platform, we further investigate our model’s execution speed on the CPU platform that is equipped with an Intel Xeon Silver 4210 CPU. As can be seen from Table.IV-C1, our method still yields promising performance. Specifically, for segmenting an input image, our SGCPNet only needs runtime. Even for a large resolution , SGCPNet still just costs runtime in segmenting an input image. These results show that our model also achieves relatively better efficiency on the CPU platform. Besides, from Table.IV-C1, we can observe that our model has better performance than another lightweight model MobileNet-V3-small in terms of both segmentation accuracy and speed.
IV-D2 Ablation Study on the Components in SGCPNet
In this part, we investigate the contribution of each component in SGCPNet in terms of accuracy, speed and model size. Without losing generality, we only adopt 200 training epochs in experiments for convenience. The results are summarized in Table.VI.
From Table.VI, we can see that the backbone network (BK), containing parameters, achieves mIoU on the Cityscapes validation set. By adding the first top-down path (TD1), the accuracy increases to mIoU, at the cost of additional parameters and GPU runtime. After being further equipped with a bottom-up path (BU), the model achieves mIoU segmentation accuracy. Correspondingly, the GPU runtime and parameters increase to and , respectively. When the last top-down path (TD2) is employed, the model performance reaches mIoU, at the cost of parameters and GPU runtime. In addition, we also visualize some typical segmentation results in Fig.8. From the figure, the backbone network (BK) generally fails in clearly segmenting the contours of objects. However, by using our top-down and bottom-up paths, the segmentation results become more and more accurate. These experimental results again empirically validate the effectiveness and efficiency of our bi-directional paths.
When we further introduce the high-dimension feature representations (HFR) into the model, the accuracy is further improved to mIoU, at the cost of only parameters and GPU runtime. We further investigate the influence of scalar-weighted sum (SW-Sum) and different types of pooling operations (Pooling) used in the SGCP module. When the scalar-weighted sum is replaced with the conventional sum operation, the accuracy has the decrease of nearly mIoU because of treating input feature maps equally. As can be seen from Table.VI, replacing max pooling with average pooling also leads to the decrease of accuracy. Furthermore, by enlarging the kernels of the max pooling to , the performance decreases to mIoU as well.
According to the results presented in Table.VI, on one hand, we can conclude that each component in our SGCPNet contributes to the final segmentation accuracy. On the other hand, the increased model size and runtime brought by them (both on GPU and CPU) are small and acceptable.
V Conclusion
In this paper, we design the Spatial-detail Guided Context Propagation Network (SGCPNet) for the real-time semantic segmentation task. SGCPNet frees the need of maintaining the high-resolution feature map in the network pipeline, while still well incorporating the context and spatial detail information. Therefore, SGCPNet is able to achieve the state-of-the-art balance on segmentation accuracy and efficiency, which is very suitable for being applied to the resource-constrained systems. Extensive experiments on the CamVid and Cityscapes datasets demonstrate the effectiveness and efficiency of our method. For example, our SGCPNet achieves mIoU on the Cityscapes dataset, with model size and over FPS execution speed on input images. In the future, we plan to advance our SGCPNet to realize real-time semantic segmentation in few shots scenario, thereby relieving the intensive labor spent on data collection and annotation, and further boosting the efficiency of the semantic segmentation technique in practical applications.
Acknowledgement
This work was supported by National Key Research and Development Program under Grant No. 2019YFA0706200, the National Nature Science Foundation of China under Grant No. 62172137, 62072152, 61725203 and the Fundamental Research Funds for the Central Universities under Grant No. PA2020GDKC0023.
References
- [1] H. Rahmani, A. Mian, and M. Shah, “Learning a deep model for human action recognition from novel viewpoints,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 667–681, 2017.
- [2] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive neural networks for high performance skeleton-based human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 1963–1978, 2019.
- [3] Y. Pang, J. Cao, Y. Li, J. Xie, H. Sun, and J. Gong, “Tju-dhd: A diverse high-resolution dataset for object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 207–219, 2020.
- [4] Y. Li, Y. Pang, J. Cao, J. Shen, and L. Shao, “Improving single shot object detection with feature scale unmixing,” IEEE Transactions on Image Processing, vol. 30, pp. 2708–2721, 2021.
- [5] J. Xie, Y. Pang, H. Cholakkal, R. Anwer, F. Khan, and L. Shao, “Psc-net: learning part spatial co-occurrence for occluded pedestrian detection,” Science China Information Sciences, vol. 64, no. 2, pp. 1–13, 2021.
- [6] S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi, “Action-driven visual object tracking with deep reinforcement learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 6, pp. 2239–2252, 2018.
- [7] L. Zhao, X. Gao, D. Tao, and X. Li, “Learning a tracking and estimation integrated graphical model for human pose tracking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, pp. 3176–3186, 2015.
- [8] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016.
- [9] J. Chen and X. Ran, “Deep learning with edge computing: A review,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, 2019.
- [10] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
- [11] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars, “The architectural implications of autonomous driving: Constraints and acceleration,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, pp. 751–766.
- [12] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Computer architectures for autonomous driving,” Computer, vol. 50, no. 8, pp. 18–25, 2017.
- [13] S. Liu, L. Liu, J. Tang, B. Yu, Y. Wang, and W. Shi, “Edge computing for autonomous driving: Opportunities and challenges,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1697–1716, 2019.
- [14] M. H. Hesamian, W. Jia, X. He, and P. Kennedy, “Deep learning techniques for medical image segmentation: Achievements and challenges,” Journal of Digital Imaging, vol. 32, no. 4, pp. 582–596, 2019.
- [15] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.
- [16] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- [17] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
- [18] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
- [19] S. Bhattacharya and N. D. Lane, “Sparsification and separation of deep learning layers for constrained resource inference on wearables,” in Proceedings of the ACM Conference on Embedded Network Sensor Systems CD-ROM, 2016, pp. 176–189.
- [20] S. Yao, Y. Zhao, A. Zhang, L. Su, and T. Abdelzaher, “Deepiot: Compressing deep neural network structures for sensing systems with a compressor-critic framework,” in Proceedings of the ACM Conference on Embedded Network Sensor Systems, 2017, pp. 1–14.
- [21] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- [22] S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, and J. Du, “On-demand deep model compression for mobile devices: A usage-driven model selection framework,” in Proceedings of the Annual International Conference on Mobile Systems, Applications, and Services, 2018, pp. 389–400.
- [23] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015.
- [24] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.
- [25] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid context network for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7519–7528.
- [26] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
- [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- [28] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.
- [29] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009.
- [30] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 603–612.
- [31] S. Ma, Y. Pang, J. Pan, and L. Shao, “Preserving details in semantics-aware context for scene parsing,” Science China Information Sciences, vol. 63, no. 2, pp. 1–14, 2020.
- [32] Z. Zhang and Y. Pang, “Cgnet: cross-guidance network for semantic segmentation,” Science China Information Sciences, vol. 63, no. 2, pp. 1–16, 2020.
- [33] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
- [34] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 405–420.
- [35] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 325–341.
- [36] R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach, “Contextnet: Exploring context and detail for semantic segmentation in real-time,” arXiv preprint arXiv:1805.04554, 2018.
- [37] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation,” arXiv preprint arXiv:2004.02147, 2020.
- [38] H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation for real-time semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9522–9531.
- [39] X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, and Y. Tong, “Semantic flow for fast and accurate scene parsing,” arXiv preprint arXiv:2002.10120, 2020.
- [40] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin, “Efficient dense modules of asymmetric convolution for real-time semantic segmentation,” in Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
- [41] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
- [42] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258.
- [43] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1314–1324.
- [44] M. Treml, J. Arjona-Medina, T. Unterthiner, R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr, M. Heusel, M. Hofmarcher, M. Widrich et al., “Speeding up semantic segmentation for autonomous driving,” in MLITS, NIPS Workshop, vol. 2, 2016, p. 7.
- [45] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4151–4160.
- [46] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
- [47] Z. Wu, C. Shen, and A. v. d. Hengel, “Real-time semantic image segmentation via spatial sparsity,” arXiv preprint arXiv:1712.00213, 2017.
- [48] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “Psanet: Point-wise spatial attention network for scene parsing,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 267–283.
![]() |
Shijie Hao is an associate professor at School of Computer Science and Information Engineering, Hefei University of Technology (HFUT). He is also with Key Laboratory of Knowledge Engineering with Big Data (Hefei University of technology), Ministry of Education. He received his Ph.D. degree at HFUT in 2012. His research interests include image processing and multimedia content analysis. |
![]() |
Yuan Zhou is pursuing his Ph.D Degree at School of Computer and Information, Hefei University of Technology. He is also with Key Laboratory of Knowledge Engineering with Big Data (Hefei University of technology), Ministry of Education. His research interests include image segmentation and few-shot learning. |
![]() |
Yanrong Guo is an associate professor at School of Computer and Information, Hefei University of Technology (HFUT). She is also with He is also with Key Laboratory of Knowledge Engineering with Big Data (Hefei University of technology), Ministry of Education. She received her Ph.D. degree at HFUT in 2013. She was a postdoc researcher at University of North Carolina at Chapel Hill (UNC) from 2013 to 2016. Her research interests include biomedical image segmentation and analysis. |
![]() |
Richang Hong received the Ph.D. degree from the University of Science and Technology of China, Hefei, China, in 2008. He was a Research Fellow of the School of Computing with the National University of Singapore, from 2008 to 2010. He is currently a Professor with the Hefei University of Technology, Hefei. He is also with Key Laboratory of Knowledge Engineering with Big Data (Hefei University of technology), Ministry of Education. He has coauthored over 70 publications in the areas of his research interests, which include multimedia content analysis and social media. He is a member of the ACM and the Executive Committee Member of the ACM SIGMM China Chapter. He was a recipient of the Best Paper Award from the ACM Multimedia 2010, the Best Paper Award from the ACM ICMR 2015, and the Honorable Mention of the IEEE Transactions on Multimedia Best Paper Award. He has served as the Technical Program Chair of the MMM 2016. He has served as an Associate Editor of IEEE Multimedia Magazine, Neural Processing Letter (Springer) Information Sciences (Elsevier) and Signal Processing (Elsevier). |
![]() |
Jun Cheng received the B.Eng. and M.Eng. degrees from the University of Science and Technology of China, Hefei, China, in 1999 and 2002, respectively, and the Ph.D. degree from the Chinese University of Hong Kong, Hong Kong, in 2006. He is currently a Professor with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, and the Director of the Laboratory for Human Machine Control. His current research interests include computer vision, robotics, machine intelligence, and control. |
![]() |
Meng Wang received the BE and PhD degrees in the special class for the Gifted Young and the Department of Electronic Engineering and Information Science from the University of Science and Technology of China (USTC), Hefei, China, in 2003 and 2008, respectively. He is a professor at the Hefei University of Technology, China. He is also with Key Laboratory of Knowledge Engineering with Big Data (Hefei University of technology), Ministry of Education. His current research interests include multimedia content analysis, computer vision, and pattern recognition. He has authored more than 200 book chapters, journal and conference papers in these areas. He is the recipient of the ACM SIGMM Rising Star Award 2014. He is an associate editor of the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), the IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCSVT), and the IEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS). |