Dilated SpineNet for Semantic Segmentation

Abdullah Rashwan Xianzhi Du¹¹footnotemark: 1 Xiaoqi Yin Jing Li
Google
{arashwan,xianzhi,yinx,jingli}@google.com Authors contributed equally.

Abstract

Scale-permuted networks have shown promising results on object bounding box detection and instance segmentation. Scale permutation and cross-scale fusion of features enable the network to capture multi-scale semantics while preserving spatial resolution. In this work, we evaluate this meta-architecture design on semantic segmentation – another vision task that benefits from high spatial resolution and multi-scale feature fusion at different network stages. By further leveraging dilated convolution operations, we propose SpineNet-Seg, a network discovered by NAS that is searched from the DeepLabv3 system. SpineNet-Seg is designed with a better scale-permuted network topology with customized dilation ratios per block on a semantic segmentation task. SpineNet-Seg models outperform the DeepLabv3/v3+ baselines at all model scales on multiple popular benchmarks in speed and accuracy. In particular, our SpineNet-S143+ model achieves the new state-of-the-art on the popular Cityscapes benchmark at 83.04% mIoU and attained strong performance on the PASCAL VOC2012 benchmark at 85.56% mIoU. SpineNet-Seg models also show promising results on a challenging Street View segmentation dataset. Code and checkpoints will be open-sourced.

Refer to caption — Figure 1: Performance comparisons of SpineNet-Seg, DeepLabv3 and DeepLabv3+ on the PASCAL VOC2012 val set. The proposed SpineNet-Seg models outperform the other two families of models at all model scales. SpineNet-Seg adopts SpineNet-S49/S96/S143 backbones and DeepLabv3 and DeepLabv3+ adopt ResNet-50/101/152 backbones. Controlled experiments and detailed experimental settings can be found in Section 5.

1 Introduction

Preserving feature resolution and aggregating multi-scale feature information have long been the challenges in achieving better semantic segmentation performance. Convolutional neural networks designed for image-level classification tasks [22, 36, 39, 40, 37, 8, 19, 33, 20] successively reduce feature resolution by pooling operations and convolutions with strides at different network stages. Such networks only output low-resolution features with strong semantics. e.g. ResNet [19] reduces feature resolution to $1/32$ of the input resolution at the end of its $C_{5}$ stage and only outputs the $C_{5}$ features. This design is not optimal for semantic segmentation as the pixel-wise classification task benefits from detailed spatial information and aggregation of features from multiple scales.

To solve these problems, researches have proposed better network operations and architecture designs. The dilated convolution operator [29, 48, 3, 4, 5, 6] is one of the most popular methods that overcome the challenge of preserving feature resolution. The ‘convolution with holes’ design allows the network to use upsampled convolution kernels to extract abstract semantics without reducing feature resolution. Recently, scale-permuted networks discovered by neural architecture search (NAS) [14, 13] have shown promising results on the task of object detection. Scale permutation for the intermediate building blocks enables the network to capture strong semantics and retain high feature resolution throughout network stages. Cross-scale feature fusion aggregates multi-scale semantics that helps the network to recognize objects at different scales.

In this work, we first explore the effectiveness of scale-permuted networks on the task of semantic segmentation. We simplify the search space proposed in [14] and use the backbone of DeepLabv3 [5] as the baseline for NAS. The architecture found by NAS improves over the baseline DeepLabv3 model by +2.06% mIoU on the PASCAL VOC2012 benchmark while using less computational resources. Secondly, we combine dilation convolution with scale-permuted network to further improve semantic segmentation. We delicately design a joint search space for scale permutation, cross-scale connections, block adjustments and block dilation ratios. The final architecture, called SpineNet-Seg-49 (SpineNet-S49), improves mIoU by +2.47% over the baseline on the PASCAL VOC2012 benchmark while using 15% less computations. Lastly, we scale and modify the SpineNet-S49 architecture to generate two model families for regular-size semantic segmentation and mobile-size semantic segmentation. In particular, our SpineNet-S143+ model achieves new state-of-the-art performance on Cityscapes at 83.04% and strong performance on PASCAL VOC2012 at 85.56% mIoU, under the settings of single-model single-scale inference without using extra data. Our mobile-size SpineNet-S49- outperforms the MobileNetv3 based DeepLab model by +2.5% while using less computatioinal resources.

Our contributions are summarized as below:

•

We prove scale-permuted network improves semantic segmentation.
•

We propose a novel search space that jointly search for 4 components for semantic segmentation and design a proxy task for NAS.
•

We outperfrom the baseline DeepLabv3/v3+ models at all model scales by 2-3% mIoU on the PASCAL VOC2012 benchmark while using less computations.
•

We achieve new state-of-the-art on the Cityscapes benchmark at 83.04% mIoU by using single-model single-scale inference without extra training data.
•

We provide a family of Mobile SpineNet-Seg models for mobile-size semantic segmentaiton that outperform popular MobileNetv2/v3 based segmentation models.

The remaining contents of the paper are organized as follows. We discuss related works in Section 2. We describe our search space design and final architectures in Section 3. The application details for our regular-size and mobile-size segmentation systems are described in 4. Our main results and ablation studies are presented in Section 5. We conclude this work in Section 6.

2 Related Work

Semantic segmentation:

Performance of the convolutional neural networks on the task of semantic segmentation has been improved in the recent years by adopting better backbones and improving network designs for semantic segmentation. Since the development of convolutional neural network, researchers have proposed stronger network architectures in better designs and larger scales for the task of image classification, e.g. AlexNet [22], VGG [36], Inception [38, 39, 40], ResNet [19], Xception [8], DenseNet [21], MobileNet [33], Wide ResNet [49] and ResNeXt [45]. Such networks not only improve image classification, but also transfer to downstream tasks such as object detection, semantic segmentation, depth estimation, etc. On the other hand, better architecture designs for semantic segmentation have been proposed to preserve object details and to aggregate multi-scale contexts. [35, 28, 32, 1, 25, 55] propose to use an encoder-decoder design to first reduce feature resolution with an encoder to capture deep and coarse semantics then recover spatial resolution via upsampling or deconvolution [50] with a decoder. Shortcut connections can be used between the two components to aggregate multi-scale contexts. [4, 5, 6, 54, 18] propose to adopt the spatial pyramid pooling module to aggregate context information from local to global at multiple grid scales. [29, 48, 3, 4, 5, 6, 44, 52] advocate to use dilated convolution at certain stages of existing architectures to expand receptive filed of convolution kernels without downsampling. The resulting architectures are able to capture dense semantics without losing resolution.

NAS and search space designs:

NAS automates the design of neural network architecture to find better architectures in a predetermined search space on a target task. Recent architectures discovered by NAS have shown promising results than handcrafted models on vision tasks including image classification [57, 31, 41, 20], object detection [16, 14, 13, 23, 43, 46, 24], semantic segmentation [27, 34], etc. For image classification, typical search space designs include searching for kernel size and filter size of convolutional layers, number of layers per network stage, additional operations such as shortcut connections, attention modules, activation functions, etc. Recent works have developed customized search space for downstream tasks. For object detection, NAS-FPN [16] proposes a search space to search for layer scales and lateral connections for FPN [25]. SpineNet [14] designs a search space that searches for a new ordering of network blocks for a baseline architecture and cross-scale connections to connect all blocks. CR-NAS [24] redistributes computational resources by searching for better block repeating times per network stage for ResNet models. Auto-DeepLab [27] is one of the pioneering works to explore NAS for semantic segmentation. Auto-DeepLab proposes a two-level hierarchical search space that learns operations at block-level and learns block resolutions at network-level with a few handcrafted constraints with respect to common network design choices for semantic segmentation.

3 Methodology

This section starts from introducing our search space for semantic segmentation in Section 3.1. The baseline architecture and the computation allocations for NAS are explained in Section 3.2. The final SpineNet-Seg architecture discovered by NAS and its variants are described in Section 3.3 and 3.4.

3.1 Search Space Design

The proposed search space consists of 4 components: search a scale permutation for the building blocks of a baseline architecture; search one cross-scale connection for each block; search a level¹¹1Following [14], we use “level” to represent the resolution of a block. $L_{i}$ indicates a block that has a resolution of $\frac{1}{2^{i}}$ of the input resolution. adjustment for each block; search a dilation ratio for the convolution within each block.

Scale permutations and cross-scale connections:

Inspired by [14], we define the search space for scale-permutation to be permuting the ordering of intermediate blocks. This results in a search space size of $N$ !, where $N$ is the total number of blocks to be permuted. Unlike [14], where two cross-scale connections are searched per block, we only search for one long-range connection for each block and simplify the short-range connection to be between each pair of adjacent blocks. This greatly reduces the number of candidates in the search space from $(\prod_{i=m}^{N+m-1}C^{i}_{2})$ to $(\prod_{i=m}^{N+m-1}i)$ , where $m$ is the number of initial blocks, while not introducing any performance drop in architecture search.

Block level adjustments:

As the default block level distribution might not be optimal for the target task, we allow each block to search for a level adjustment from a list of integer candidates $\{A_{1},A_{2},...,A_{a}\}$ . This results in a search space size of $a^{N}$ .

Dilation ratios:

Lastly, we introduce the popular dilated convolution operator to the search space. We allow each block to search for one dilation ratio from a list of candidates $\{D_{1},D_{2},...,D_{d}\}$ . This results in a search space size of $d^{N}$ .

3.2 Baseline and Computation Allocations

Model	Downsample	FLOPs (B)	mIoU
ResNet-50	end	117	79.4
ResNet-S50	beginning	85	79.2

Table 1: A performance comparison of the original DeepLabv3+ ResNet-50 backbone that downsamples at the end of each stage and our modified ResNet-S50 backbone that downsamples at the beginning of each stage. Results are reported with the DeepLabv3+ system on the PASCAL VOC2012 val dataset.

Block id	BP	CC	LA	DR	FD
$B_{0}$	$L_{2}$	–	–	–	64
$B_{1}$	$L_{2}$	–	–	–	64
$B_{2}$	$L_{3}$	$B_{0}$	-1	1	64
$B_{3}$	$L_{4}$	$B_{1}$	-1	2	128
$B_{4}$	$L_{3}$	$B_{1}$	0	1	128
$B_{5}$	$L_{3}$	$B_{2}$	0	1	128
$B_{6}$	$L_{6}$	$B_{3}$	0	1	512
$B_{7}$	$L_{6}$	$B_{5}$	-1	2	512
$B_{8}$	$L_{7}$	$B_{4}$	0	1	512
$B_{9}$	$L_{7}$	$B_{6}$	0	4	512
$B_{10}$	$L_{5}$	$B_{6}$	0	1	512
$B_{11}$	$L_{5}$	$B_{7}$	0	2	512
$B_{12}$	$L_{4}$	$B_{8}$	0	4	256
$B_{13}$	$L_{4}$	$B_{9}$	0	1	256
$B_{14}$	$L_{5}$	$B_{11}$	0	4	512
$B_{15}$	$L_{4}$	$B_{11}$	0	4	256
$B_{16}$	$L_{4}$	$B_{12}$	0	1	256
$B_{17}$	$L_{2}$	$B_{14}$	0	4	64
$B_{18}$	$L_{7}$	$B_{16}$	-1	2	512
$B_{19}$	$L_{6}$	$B_{15}$	0	1	512
$B_{20}$	$L_{4}$	$B_{17}$	-1	4	128
$B_{21}$	-	$B_{0}$	0	1	128

Table 2: Learned network configurations for the SpineNet-S49 architecture. We show the detailed configurations for each block for the search space components described in Section 3.1. BP: block permutation. CC: cross-scale connection. LA: level adjustment. DR: dilation ratio. FD: feature dimension.

Searching for a scale-permuted network starts from a baseline network. In this work, we adopt the ResNet-50 backbone of DeepLabv3 [5] with an output stride of 16, and with stage 5 being repeated twice. Unlike DeepLabv3 that proposes to downsample the features at the end of each network stage, we modify the downsampling to happen at the beginning of each stage. This saves 30% of the computational cost with negligible loss in performance. A study of the effect of such a modification is shown in Tab. 1. We refer to the modified backbone as ResNet-Seg-50 (ResNet-S50), while we refer to the original DeepLabv3+ backbones as ResNet-50/101/152²²2Unless stated otherwise, stage 5 is repeated twice when referring to different ResNet models (e.g. ResNet-50/101/152)..

ResNet-S50 provides a block allocation of {3 $\times L_{2}$ , 4 $\times L_{3}$ , 6 $\times L_{4}$ , 9 $\times L_{5}$ } bottleneck blocks. We take two $L_{2}$ blocks to build an initial network that forms the initial search space for cross-scale connections and reserve one $L_{3}$ block to construct an output block. For intermediate blocks, we first search for a permutation for the remaining block allocation {1 $\times L_{2}$ , 3 $\times L_{3}$ , 6 $\times L_{4}$ , 9 $\times L_{5}$ }. Secondly, for each block, we connect it to its immediate previous block and search for one cross-scale connection from the other previous blocks. The same resampling strategy as in [14] is adopted when merging blocks. Thirdly, we search for one level adjustment within $\{-1,0\}$ . This is because ResNet-S50 keeps feature resolution but increases feature dimension for stage 4 and above. In order to constraint the computation of candidates in the search space to be no larger than the baseline, we only allow each block to either keep or decrease its level (i.e. increase feature resolution). Lastly, we search for a dilation ratio within $\{1,2,4\}$ . The dilation ratio is applied to the 3 $\times$ 3 convolution in each bottleneck block. For the output block, following the common output design for semantic segmentation networks [5, 6, 20], we append a $L_{3}$ block (i.e. output stride 8) at the top and only search for a cross-scale connection and a dilation ratio.

3.3 SpineNet-Seg Architectures

SpineNet-Seg architectures are searched with a DeepLabv3 system on a semantic segmentation task. The final SpineNet-S49 configuration discovered by NAS are shown in Tab. 2. Feature dimension $\{32,64,128,256,512\}$ are used for $\{L_{1},L_{2},L_{3},L_{4},L_{5}\}$ blocks, respectively. Based on SpineNet-S49, we construct three larger models, named SpineNet-S96, SpineNet-S143 and SpineNet-S143+, by repeating each block in SpineNet-S49 twice or three times. When repeating one block, we construct replicas of this block and connect them with the original block sequentially without introducing any cross-scale connections. For our largest model SpineNet-S143+, we further uniformly upscale the feature dimension of convolutional layers by $1.3\times$ .

We control output stride of the model by changing the size of the output block and its cross-scale connection. We preserve the sizes of the rest of the layers. This is computationally more efficient than changing all layers that are smaller in size than the required output stride as proposed by DeepLabv3 [5]. Unless stated, our models have an output stride of 8.

3.4 Mobile-size SpineNet-Seg Architectures

Unlike regular semantic segmentation systems that use standard convolution operations, mobile-size systems desire low-computation operations. Inspired by [33, 20, 13], we adopt the inverted bottleneck block that employs depthwise separable convolution [8] as its main operator to build mobile-size SpineNet-Seg models.

We construct Mobile SpineNet-S49 and Mobile SpineNet-S49- by replacing all bottleneck blocks with inverted bottleneck blocks. Feature dimension $\{16,24,40,80,112\}$ and expansion ratio 6 are used for $\{L_{1},L_{2},L_{3},L_{4},L_{5}\}$ blocks in Mobile SpineNet-S49, respectively. Mobile SpineNet-S49- uniformly downscales the feature dimension of all convolutional layers by $0.65\times$ .

Model	Backbone	FLOPs (B)	#Params (M)	mIoU
DeepLabv3	ResNet-50	115	74	78.3
DeepLabv3+	ResNet-50	117	75	79.4
SpineNet-Seg	SpineNet-S49	98	69	81.4
DeepLabv3	ResNet-101	193	93	79.3
DeepLabv3+	ResNet-101	195	94	79.9
SpineNet-Seg	SpineNet-S96	154	116	82.6
DeepLabv3	ResNet-152	271	109	80.2
DeepLabv3+	ResNet-152	273	110	81.0
SpineNet-Seg	SpineNet-S143	210	162	83.3

Table 3: Result comparisons on the PASCAL VOC2012 val set. The proposed SpineNet-Seg models outperform the DeepLabv3 baselines and DeepLabv3+ models at all scales. All models are trained under the same settings.

4 Applications

We plug in SpineNet-Seg models as the backbones of the Deeplabv3 system for semantic segmentation. On top of the backbone, we apply an Atrous Spatial Pyramid Pooling (ASPP) module, $n$ convolutions with kernel size 3 and feature dimension 256 followed by batch normalization and activation, and a final classification layer with kernel size 3 to compute pixel-wise predictions. The final architecture of the SpineNet-S49 model is shown in Fig. 2. Normally, we directly build the final classification layer on top of the ASPP module ( $n$ = 0). However, we found that using 2 convolutional layers ( $n$ = 2) is essential for stable training when the output stride is 4, and with larger model sizes. In particular, SpineNet-S143+ model is trained with 2 convolutional layers at the head ( $n$ = 2), while the rest of the models are trained with $n$ = 0. For mobile-size systems, we replace regular convolutions with depthwise separable convolutions in ASPP and segmentation head.

EMA	640 $\times$ 640	COCO	mIoU
-	-	-	81.02
✓	-	-	81.39
✓	✓	-	82.09
✓	✓	✓	83.49

Table 4: An ablation study of the training settings. Results are reported with the SpineNet-S49 model on the PASCAL VOC2012 val set. EMA: refers to using exponential moving average of the model weights during training. 640x640: using training and evaluation image sizes of 640

\times

640. COCO: Model pretrained on the COCO dataset.

5 Experimental Results

We evaluate our Spinenet-Seg models on the PASCAL VOC2012 benchmark [15], the Cityscapes benchmark [9] and a challenging large-scale Street View dataset.

Unless stated, all experiments (including baselines) use SGD optimizer with momentum of 0.9, cosine learning rate schedule, and exponential moving average (EMA) optimizer with average decay of 0.9998. For pretraining, we use ImageNet [11] and COCO [26] datasets. Unless stated, the results (including DeepLabv3 and DeepLabv3+) reported are with ImageNet pretraining. All results in this section are computed with single scale inference. For fair comparison, we always scale mask predictions to original image sizes to compute mIoU.

Model	Backbone	mIoU
DFN [47]	ResNet-101	80.46
Auto-Deeplab [27]	Auto-Deeplab-L	80.75
GCN [30]	ResNet-GCN	81.00
DeepLabv3+ [6]	Xception-65	82.45
ExFuse [53]	ResNeXt-131	85.40
SpineNet-Seg	SpineNet-S49	83.49
SpineNet-Seg	SpineNet-S96	85.16
SpineNet-Seg	SpineNet-S143	85.64

Table 5: Results on PASCAL VOC2012 val set for different SpineNet-Seg models compared to other models. Results in this table include single scale inference with ImageNet, and COCO pretraining.

5.1 Pretraining

ImageNet pretrain:

We pretrain the models on ImageNet-1k dataset for 350 epochs with batch size of 4096. We use the following experiment setup for ImageNet pretraining: cosine learning rate schedule with an initial learning rate of 1.6, regular batch normalization with momentum of 0.99, L2 weight decay of 4e-5, label smoothing of 0.1, we train on random crops of 320x320, we use RandAugment [10] for image augmentations. EMA is not adopted for ImageNet pretraining.

COCO pretrain:

We use COCO-Things semantic segmentation labels where only annotations for things are used as forground classes, and the rest is used as background. We build the ASPP module such that it matches the ASPP used for the target dataset. We use the following experiment setup for COCO pretraining: we train on image sizes of 512x512 with random horizontal flips, and scale jittering of [0.5, 2.0], batch size of 256, sync batch normalization with momentum of 0.99 and L2 weight decay of 1e-5. We train the models for 64k steps with cosine learning rate schedule with an initial learning rate of 0.08. EMA is not used for COCO pretraining.

Model	Backbone	FLOPs (B)	#Params (M)	mIoU
DeepLabv3+	ResNet-50	1092	76	79.84
SpineNet-Seg	SpineNet-S49	798	69	81.06
SpineNet-Seg	SpineNet-S96	1272	117	81.45
SpineNet-Seg	SpineNet-S143	1722	164	82.11
SpineNet-Seg^†	SpineNet-S143+	2766	275	83.04

Table 6: Result comparisons on the Cityscapes val set. SpineNet-S49 outperforms DeepLabv3+ with a ResNet-50 backbone in both accuracy and speed. SpineNet-S49/S96/S143 and DeepLabv3+ models are trained under the same settings. SpineNet-S143+ marked with ^† adopts the best training recipe to achieve best performance.

Model	Backbone	Multi-scale Test	mIoU
DeepLabv3+ [6]	Xception-71	–	79.55
MDEQ-XL [2]	MDEQ	–	80.30
AutoDeeplab [27]	AutoDeeplab-L	–	80.33
RepVGG [12]	RepVGG-B2	–	80.57
HRNetV2 [42]	HRNetV2-W48	–	81.10
Panoptic-DeepLab [7]	-	–	81.50
HRNetV2 + OCR [42]	HRNetV2-W48	–	81.60
ResNeSt [51]	ResNeSt-200	✓	82.70
SpineNet-Seg	SpineNet-S143+	–	83.04

Table 7: State-of-the-art on the Cityscapes val set. We compare our best model on the Cityscapes val set to other models reported in literature. Note that our model uses single-scale input for inference and is trained without using extra data.

EMA	OS=4	COCO	mIoU
-	-	-	81.92
✓	-	-	82.11
✓	✓	-	82.67
✓	✓	✓	83.04

Table 8: Cityscapes val set results. These results are obtained using single scale inference with no horizontal flipping. OS: refers to the output stride of the SpineNet-Seg model, normally the output stride is 8.

5.2 Results on PASCAL VOC2012

PASCAL VOC2012 [15] is a semantic segmentation dataset with 20 forground classes and 1 background class. For training, we use an augmented version of the dataset [17] with extra annotations of 10582 images (trainaug). The default training setup uses training image sizes of 512x512 with scale jittering of [0.5, 2.0] and random horizontal image flipping. We use batch size of 32, and sync batch normalization with momentum of 0.9997. We use dilation rates of 12, 24 and 36 to build ASPP. We train experiments for 20k steps.

Tab. 3 and Fig. 1 shows performance comparisons of our SpineNet-Seg models vs. DeepLabv3 and DeepLabv3+ with counterpart ResNet backbones. All models are trained using the same experiment setup. Our results show consistent +3% and +2% improvements in mIoU across all model scales compared to DeepLabv3 and DeepLabv3+ models. Specifically, SpineNet-S49, SpineNet-S96, and SpineNet-S143 show improvements of +2%, +2.66%, and +2.29% in mIoU compared to DeepLabv3+ with ResNet-50, ResNet-101, and ResNet-152 backbones respectively. While having significant gain over DeepLab models, SpineNet models are less computationally expensive than their Deeplab ResNet counterparts.

Tab. 4 studies the effect of using different training setups on the PASCAL VOC2012 validation set. We found that using EMA of model weights improved mIoU by 0.37%. We also increase the training and evaluation image sizes to 640 $\times$ 640 and resize the prediction masks to its original image sizes for fair comparisons. Using image sizes of 640 $\times$ 640 shows an improvement of 0.7%. Finally, pretraining on COCO dataset shows an improvement of 1.4% in mIoU. As a result, the mIoU on PASCAL validation set of our SpineNet-S49 model achieves 83.49% mIoU.

Tab. 5 summarizes the effect of scaling up the model size by using different block repeats of 1, 2, and 3 (SpineNet-S49, SpineNet-S96, and SpineNet-S143 respectively). In these experiments, we used the best training setup in Tab. 4. SpineNet-S96 improves the mIoU by 1.67% on PASCAL Validation set, and SpineNet-S143 improves the mIoU by another 0.48%. Our best model using single scale inference achieves 85.64% mIoU on Pascal VOC validation set. We also compare our best models with previous work in Tab. 5.

5.3 Results on Cityscapes

Cityscapes [9] contains high quality pixel-level annotations of 5000 images (2975, 500, and 1525 for train, validation, and test splits respectively). It also contains 20000 coarsely annotated images for training. In our experiments, we only used the high quality pixel-level annotation train split for training, and evaluate on the validation split. Following [9], we train and evaluate on 19 semantic labels and ignore the void label.

We train on crops of 512x1024 with scale jittering of [0.5, 2.0] and random horizontal image flipping. We use batch size of 64, and sync batch normalization with momentum of 0.99. We use dilation rates of 12, 24, 36, and 72 to build ASPP. We train each experiment for 100k iterations.

Tab. 6 compares SpineNet-S49 to DeepLabv3+ with ResNet-50 backbone. Both models are trained using the same training setup including ImageNet pretraining, batch size, and using EMA of model weights. Our SpineNet-S49 model shows an improvement of +1.22% in mIoU compared to its DeepLabv3+ ResNet-50 model counterpart. In the same table, we show the effect of scaling up the model using block repeats of 1, 2, and 3 (SpineNet-S49, SpineNet-S96, and SpienNet-S143 respectively). SpineNet-S96 model improves the performance by 0.39%, while SpineNet-S143 further improves the mIoU by another 0.66%.

Tab. 8 shows the effect of using different training setup on the Cityscapes validation set. We used SpineNet-S143 model (block repeats of 3 and output stride of 8) as the baseline for the ablation study. We found that using exponential moving average of model weights improved mIoU by 0.19%. Moreover, we adopt the largest backbone SpineNet-S143+ and change the output stride of the model to 4. SpineNet-S143+ improves the mIoU by 0.56% on Cityscapes validation set. Finally, we pretrain SpineNet-S143+ on COCO and finetune on Cityscapes which further improves performance by 0.37%. As shown in Tab. 7, our best model at 83.04% mIoU on Cityscapes validation set achieves the new state-of-the-art when using single scale inference.

5.3.1 Mobile SpineNet-Seg Results

Model	Params (M)	mIoU
MobileNetV3 [20]	3.60	72.64
Mobile SpineNet-S49-	3.15	75.18
Mobile SpineNet-S49	4.40	77.41

Table 9: Mobile SpineNet-Seg results on Cityscapes val dataset. We compare our models to MobileNetV3 version.

We follow the training setup for SpineNet-Seg models to train two mobile-size models: Mobile SpineNet-S49 and Mobile SpineNet-S49-. As shown in Tab. 9, Mobile SpineNet-Seg models achieve significantly better mIoU in speed and accuracy compared to MobileNetV3 model.

5.4 Results on Street View

We further evaluate SpineNet-Seg on a challenging dataset from Street View images. The dataset contains 57k train images and 13k test images, with 44 semantic categories on typical street scenes, e.g., building, sidewalk, traffic sign, and cars. In addition, the dataset is collected across 6 continents, 39 countries and 80+ cities worldwide under diverse conditions. Fig. 3 shows an typical example of the dataset. Given the complexity, size and geo-diversity, we believe this is a practical stress test for the proposed model.

For controlled experiment, we compare our SpineNet-S143+ to ResNet-152 using the same experiment settings. We use train and eval image sizes of 1152x768 with scale jittering of [0.5, 2.0] and random horizontal image flipping. We use batch size of 128, and sync batch normalization with momentum of 0.99. We use dilation rates of 12, 24, 36, and 72 to build ASPP. We train each experiment for 100k iterations. We also compare to DeepLabv3+ with Xception65 backbone when trained according to experiment setting suggested in [6]. All experiments are trained and evaluated on the same image sizes.

Tab. 10 compares SpineNet-Seg to DeepLabv3+ models. SpineNet model improves the mIoU by 1.64% compared to DeepLabv3+ with ResNet-152 backbone. We also observe 2.46% improvements on mIOU compared to DeepLabv3+ with Xception-65 backbone. The results support our claim that SpineNet-Seg outperform DeepLabv3+ models in challenging real world scenarios.

Model	Backbone	mIoU
DeepLabv3+	Xception-65	57.06
DeepLabv3+	ResNet-152	57.88
SpineNet-Seg	SpineNet-S143+	59.52

Table 10: Results on Street View Dataset.

5.5 NAS Experiments

5.5.1 NAS implementation details

We run NAS for 10k trials, and we evaluate the best 10 architectures on PASCAL VOC2012. Due to the large number of trials, we design a proxy search task to quickly evaluate the architecture candidates. For all search experiments, we uniformly downscale the feature dimension of convolutional layers of the candidate models by $0.5\times$ and use image sizes 384 $\times$ 384 for training and evaluation. We run training for 30k steps with batch size of 64 and collect the final evaluation mIoU as the reward for the controller [56].

Model	Baseline	R-S50^†	R-S50	R-S101^†
mIoU Gain (%)	+0	+0.79	+2.47	+2.02

Table 11: Effectiveness of different search baselines. We show mIoU gain on PASCAL VOC2012 val set when using different architectures, ResNet-S50 and ResNet-S101, as the baseslines for NAS. Backbones marked with ^† indicate stage 5 not repeated.

Output Stride	Baseline	SP	SP+DR
mIoU Gain	+0	+2.15	+2.47

Table 12: An ablation of searching for scale-permuted network and dilation ratios. We show mIoU gain on PASCAL VOC2012 val set when searching for scale-permuted network or jointly searching for scale-permuted network and dilation ratios. SP: scale-permuted network. DR: dilation ratio.

Output Stride	Baseline	4	8	16
mIoU gain	+0	+0.67	+2.15	+1.47

Table 13: Impact of different output strides. This table shows mIoU gain on PASCAL VOC2012 val set when searching architectures with different output strides.

Search Dataset	Baseline	COCO Things	Stuff+Things
mIoU Gain	+0	+2.47	+2.06

Table 14: A study on the search dataset. We show mIoU gain on PASCAL VOC2012 val set when using COCO Things or COCO Stuff+Things for NAS.

5.5.2 Ablation studies of the search designs

First, we experiment with different baseline architectures to search from. Inspired by DeepLabv3, we consider three architectures, ResNet-S50^†, ResNet-S50 and ResNet-S101^†, where ^† indicates absence of stages 6 and 7. Tab. 11 shows that searching from ResNet-S50 yields best improvements with +2.47% in mIoU gain over the baseline. We also study the effect of searching for scale-permuted network and dilation ratios in Tab. 12. We found that searching for scale-permuted networks yields +2.15% in mIoU gain. Further searching for dilation ratios improves the mIoU by +0.32%. Finally, we study the effect of fixing the output stride to 4, 8, or 16 during search in Tab. 13, while also changing the feature dimension of the output block accordingly such that the model size is preserved among the search jobs. We found that output stride of 8 yields the best gain of +2.15% in mIoU.

5.5.3 A study on search dataset

Search dataset is important since the evaluation signals influence the quality of the searched architectures. For instance, performance on the PASCAL VOC2012 dataset depends heavily on the quality of the pretrained checkpoint, hence it is difficult to use for NAS that trains proxty tasks from scratch. We decide to use COCO dataset since it is diverse, and unlike Cityscapes it has significantly smaller images. Training proxy tasks from scratch using the COCO dataset usually converges, and eval signals can be used as stable rewards to update the NAS controller. We study the effect of using COCO Things annotations and COCO Things+Stuff annotations. Tab. 14 shows that COCO Things achieves better performance (+2.47% mIoU gain) compared to COCO Things+Stuff (+2.06%).

6 Conclusion

In this work, we evaluated the effectiveness of scale-permuted architectures on the task of semantic segmentation, a vision task that benefits from high feature resolution and multi-scale feature fusion. We proposed a new search space that simplifies the SpineNet search space and introduces new search components for semantic segmentation and learned SpineNet-S49 architecture by NAS with a carefully designed proxy task. We further construct two families of models based on SpineNet-S49: SpnieNet-Seg models and Mobile SpineNet-Seg models. SpineNet-Seg models outperform the popular DeepLabv3/v3+ models on the PASCAL VOC2012 benchmark and a challenging Street View data and achieve state-of-the-art performance on the Cityscapes benchmark. Mobile SpineNet-Seg models achieve new state-of-the-art performance on mobile-size semantic segmentation, surpassing popular mobile segmentation systems such as MobileNetV2/V3. We expect scale-permuted network with task-specific designs to benefit more computer vision tasks in the future.

References

[1] Vijay Badrinarayanan, Alex Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:2481–2495, 2017.
[2] Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale deep equilibrium models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5238–5250. Curran Associates, Inc., 2020.
[3] Liang-Chieh Chen, G. Papandreou, I. Kokkinos, Kevin Murphy, and A. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR, abs/1412.7062, 2015.
[4] Liang-Chieh Chen, G. Papandreou, I. Kokkinos, Kevin Murphy, and A. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:834–848, 2018.
[5] Liang-Chieh Chen, G. Papandreou, Florian Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. ArXiv, abs/1706.05587, 2017.
[6] Liang-Chieh Chen, Y. Zhu, G. Papandreou, Florian Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. ArXiv, abs/1802.02611, 2018.
[7] Bowen Cheng, Maxwell D. Collins, Y. Zhu, T. Liu, T. Huang, H. Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12472–12482, 2020.
[8] François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[10] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space, 2019.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
[12] Xiaohan Ding, X. Zhang, Ningning Ma, J. Han, G. Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. ArXiv, abs/2101.03697, 2021.
[13] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Yin Cui, M. Tan, Quoc V. Le, and Xiaodan Song. Efficient scale-permuted backbone with learned resource distribution. ArXiv, abs/2010.11426, 2020.
[14] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[16] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.
[17] Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV), 2011.
[18] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37:1904–1916, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[20] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV, 2019.
[21] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
[23] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In ECCV, 2018.
[24] Feng Liang, Chen Lin, Ronghao Guo, Ming Sun, Wei Wu, J. Yan, and Wanli Ouyang. Computation reallocation for object detection. ArXiv, abs/1912.11234, 2020.
[25] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[27] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
[28] Hyeonwoo Noh, Seunghoon Hong, and B. Han. Learning deconvolution network for semantic segmentation. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015.
[29] G. Papandreou, I. Kokkinos, and P. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 390–399, 2015.
[30] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
[31] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In AAAI, 2019.
[32] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
[33] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
[34] Albert Eaton Shaw, D. Hunter, Forrest N. Iandola, and S. Sidhu. Squeezenas: Fast neural architecture search for faster semantic segmentation. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2014–2024, 2019.
[35] Evan Shelhamer, J. Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:640–651, 2017.
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.
[38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
[39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
[40] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
[41] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, 2019.
[42] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. PAMI, 2020.
[43] Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, and Chunhua Shen. NAS-FCOS: Fast neural architecture search for object detection. arXiv preprint arXiv:1906.04423, 2019.
[44] Zifeng Wu, Chunhua Shen, and A. V. D. Hengel. Bridging category-level and instance-level semantic image segmentation. ArXiv, abs/1605.06885, 2016.
[45] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
[46] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In ICCV, 2019.
[47] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1857–1866, 2018.
[48] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2016.
[49] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
[50] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pages 2018–2025, 2011.
[51] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander Smola. Resnest: Split-attention networks, 2020.
[52] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan. Scale-adaptive convolutions for scene parsing. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2050–2058, 2017.
[53] Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue, and Jian Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 269–284, 2018.
[54] Hengshuang Zhao, J. Shi, Xiaojuan Qi, Xiaogang Wang, and J. Jia. Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, 2017.
[55] Barret Zoph, G. Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, E. D. Cubuk, and Quoc V. Le. Rethinking pre-training and self-training. ArXiv, abs/2006.06882, 2020.
[56] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
[57] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.