NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation

Anonymous Authors

Abstract

Current NAS-based semantic segmentation methods focus on accuracy improvements rather than light weight design. In this paper, we propose a two-stage framework to design our NAS-based RSPNet model for light-weight semantic segmentation. The first architecture search determines the inner cell structure, and the second architecture search considers exponentially growing paths to finalize the outer structure of the network. It was shown in the literature that the fusion of high- and low-resolution feature maps produces stronger representations. To find the expected macro structure without manual design, we adopt a new path-attention mechanism to efficiently search for suitable paths to fuse useful information for better segmentation. Our search for repeatable micro-structures from cells leads to a superior network architecture in semantic segmentation. In addition, we propose an RSP (recursive Stage Partial) architecture to search a light-weight design for NAS-based semantic segmentation. The proposed architecture is very efficient, simple, and effective that both the macro- and micro- structure searches can be completed in five days of computation on two V100 GPUs. The light-weight NAS architecture with only 1/4 parameter size of SoTA architectures can achieve SoTA performance on semantic segmentation on the Cityscapes dataset without using any backbones.

Introduction

Network Architecture Search (NAS) (?) is a computational approach for automating the optimization of the neural architecture design. As deep learning has been widely used for medical image segmentation, the most common deep networks used in practice are still designed manually. In this work, we focus on applying NAS for medical image segmentation as the targeted application. To optimize the NAS that looks for the best architecture for image segmentation, the search task can be decomposed into three parts: (i) a supernet to generate all possible architecture candidates, (ii) a global search of neural architecture paths from the supernet, and (iii) a local search of the cell architectures, namely operations including the conv/deconv kernels and the pooling parameters. The NAS space to explore is exponentially large w.r.t. the number of generated candidates, the paths between nodes, the number of depths, and the available cell operations to choose from. The computational burden of NAS for image segmentation is much higher than other tasks such as image classification, so each architecture verification step takes longer to complete. As a result, there exist fewer NAS methods that work successfully for image segmentation. In addition, none is designed for light-weight semantic segmentation, which is very important for AV(Automobile Vehicle)-related applications.

Refer to caption — Figure 1: Top: Our fully connected model with Path-Attention Module(PAM) to search macro structure. Bottom Left: Before entering the next layer, feature-maps generated by previous layer pass through the Path-Attention Module(PAM) and choose the best one to enter the cell. Bottom Right: Because of the efficiency and parameter size, we only use one cell at each layer when we search cell using DARTS. During training, we stack four cells we searched in search process.

The main challenge of NAS is on how to deal with the exponentially large search space when exploring and evaluating neural architectures. We tackle this problem based on a formulation regarding what needs to be considered in priority and how to effectively reduce search complexity. Most segmentation network designs (?; ?; ?; ?) use U-Nets to achieve better accuracies in image segmentation. For example, AutoDeepLab (?) designs a level-based U-Net as the supernet whose search space grows exponentially according to its level number $L$ and depth parameter $D$ . Joining the search for network-level and cell-level architectures creates huge challenges and inefficiency in determining the best architecture. To avoid exponential growth in the cell search space, only one path in AutoDeepLab is selected and sent to the next node. This limit is unreasonable since more input to the next node can generate richer features for image segmentation.

A “repeatable” concept is adopted in this paper to construct our model. Similar to repeatable cell architecture design, our model contains repeated units that share the same structure. The proposed model architecture for image segmentation is shown in Fig. 1. It is based on differential learning and is as efficient as the DARTS (Differentiable ARchiTecture Searching) method (?), compared to the other NAS methods based on RL and EV. In addition, we modify the concept of CSPNet (?) to recursively use only half of the channels to pass through the cell we searched for. The RSPNet (Recursive Stage Partial Network) makes our search procedure much more efficient and results in a light weight architecture for semantic segmentation. The proposed architecture is simple, efficient, and effective in image segmentation. Both the macro- and micro- structure searches can be completed in two days of computation on two V100 GPUs. The light-weight NAS network with only 1/4 parameter size of SoTA architectures achieves state-of-the-art performance on the Cityscapes datasets without any backbones. Main contributions of this paper are summarized in the following:

•

We propose a two-stage search method to decrease memory usage and speed up search time.

•

We designed a cell-based architecture that can construct a complex model by stacking the cell we searched for.

•

RSPNet makes our search procedure much more efficient and results in our light-weight architecture for semantic segmentation.

•

The proposed PAM selects the paths and fuses more inputs than Auto-Deeplab (?) and HiNAS (?) to generate richer features for better image segmentation.

•

Without using any backbone, our architecture outperforms SoTA methods with improved accuracy and efficiency on the Cityscapes (?) dataset.

Related Works

Mainstream NAS algorithms generally consist of three basic steps: supernet generation, architecture search, and network cell parameter optimization. Approaches for architecture search can be organized into three categories (?): reinforcement learning (RL) based, evolution (EV) based, and gradient based. In what follows, the details of each method are discussed.

RL-based methods (?; ?) use a controller to sample neural network architectures (NNAs) to learn a reward function to generate better architectures from exploration and exploitation. Although an RL-based NAS approach can construct a stable architecture for evolution, it needs a huge number of tries to get a positive reward for updating architectures and thus is very computationally expensive. For example, in (?), a cell-based Q-learning approach is evaluated on ImageNet, and the search takes up to 9 GPU-days to run.

Evolution-based methods (?; ?; ?; ?) perform evolution operators (e.g. crossover and mutation) based on the genetic algorithm (GA) to continuously adjust NNAs and improve their qualities across generations. These methods suffer from high computational cost in optimizing the model generator when validating the accuracy of each candidate architecture. Compared to the methods, which rely on optimization over discrete search spaces, gradient-based methods can optimize much faster via search in continuous spaces.

Gradient-based methods: The Differentiable ARchiTecture Search (DARTS) (?) signiﬁcantly improves search efﬁciency by computing a convex combination of a set of operations where the best architecture can be optimized by gradient descent algorithms. In DARTS, a supernet is constructed by placing a mixture of candidate operations on each edge rather than applying a single operation to a node. An attention mechanism on the connections is adopted to remove weak connections, such that all supernet weights can be efficiently optimized jointly with a continuous relaxation of the search space via gradient descent. The best architecture is found efficiently by restricting the search space to the subgraphs of the supernet. However, this acceleration comes with a penalty when limiting the search variety, resulting in architectures with inferior accuracy when compared with RL- or EV-based methods.

Nework parameter sharing methods (?; ?; ?; ?; ?; ?) use a weight-sharing concept to reduce gradient updates on network parameters. For example, in (?), the knowledge of well-trained architectures is compressed and transferred through network compression operations to improve the efficiency and effectiveness of model learning.

NAS for image segmentation is less investigated due to large memory demands and model validation costs. In (?), a U-like backbone is used to search down-sampling and up-sampling cells of repeatable structures for medical image segmentation. The NAS-UNet (?) is a U-like architecture, where the micro structure is automatically adjusted by a differential search for a repeatable cell structure. In AutoDeepLab (?), a two-level hierarchical architecture search space is used to find the best architecture. Similar to (?), the first search is applied for the inner cell structure search, and the second search considers exponential-many paths to determine the outer structure of the network. Significant results are achieved, but with 3 days of GPU computation.

Path-Attention NAS: The method of path searching mentioned in (?; ?) takes only three previous layers’ outputs as inputs and uses the Viterbi decoding algorithm to select the path with the maximum probability as the final result. Different from AutoDeeplab (?) and HiNAS (?), our path selection method takes all previous layers’ outputs and current layer’s output as inputs and uses the DARTS algorithm to choose the best two layers as inputs to next cells we search. Our path-attention method can fuse much more information from more inputs to generate better segmentation results.

Background

This section first addresses the vanishing gradient problem, which often occurs when training a neural network from a backward procedure. Then, we discuss the relation between the vanishing gradient problem and CSPNet (?). Then, we modify the concept of the CSPNet as RSPNet that recursively uses only half of the channels to pass through the cell we searched for. RSPNet makes our search procedure more efficient in finding the desired light-weight architecture for semantic segmentation.

Fig. 3 shows the forward and backward procedures of a neural network. Let $X_{i}$ denote the output of the $i$ th layer with the weight matrix $W_{i+1}$ and the output function $f_{i+1}$ . The relation between $X_{i}$ and $X_{i+1}$ can be written as follows:

X_{i+1}=f_{i+1}\left(X_{i},W_{i+1}\right).

(1)

Let $C$ be a cost function. For the $(L-1)$ th layer, according to the chain rule, the gradient of its weight matrix $W_{L-1}$ can be written as

\frac{\partial}{{\partial{W_{L-1}}}}C=\frac{{\partial C}}{{\partial{X_{L}}}}\frac{{\partial{X_{L}}}}{{\partial{X_{L-1}}}}\frac{{\partial{X_{L-1}}}}{{\partial{W_{L-1}}}}.

(2)

Based on Eq.(1), we have $X_{L}=f_{L}$ . Then, Eq.(2) can be rewritten as

\frac{\partial}{{\partial{W_{L-1}}}}C=\frac{{\partial C}}{{\partial{X_{L}}}}\frac{{\partial{f_{L}}}}{{\partial{X_{L-1}}}}\frac{{\partial{f_{L-1}}}}{{\partial{W_{L-1}}}}.

(3)

For the $(L-2)$ th layer, according to Eq.(3) and the chain rule, the gradient of $W_{L-2}$ to $C$ can be updated as follows.

\frac{\partial}{{\partial{W_{L-2}}}}C=\frac{{\partial C}}{{\partial{X_{L}}}}\left({\prod\limits_{k=1}^{2}{\frac{{\partial{f_{L-k+1}}}}{{\partial{X_{L-k}}}}}}\right)\frac{{\partial{f_{L-2}}}}{{\partial{W_{L-2}}}}.

(4)

For the $(L-i)$ th layer, based on the form of Eq.(4), the gradient of $W_{L-i}$ to $C$ can be recursively updated by a backpropagation method, $i.e.$ ,

\frac{\partial}{{\partial{W_{L-i}}}}C=\frac{{\partial C}}{{\partial{X_{L}}}}\left({\prod\limits_{k=1}^{i}{\frac{{\partial{f_{L-k+1}}}}{{\partial{X_{L-k}}}}}}\right)\frac{{\partial{f_{L-i}}}}{{\partial{W_{L-i}}}}.

(5)

When the activation function used in $f_{i}$ is a Sigmoid function, the recursive term in Eq.(5) will close to zero and result in the problem of vanishing gradient; that is,

\prod\limits_{k=1}^{i}{\frac{{\partial{f_{L-k+1}}}}{{\partial{X_{L-k}}}}}\to 0.

(6)

To solve this problem in Eq.(6), one solution is to change the activation function to the ReLU function. Another solution proposed in ResNet (?) explicitly preserves information through additive identity transformation, $i.e.$ , skip connection. With the skip connection, Eq.(1) can be rewritten as

X_{i+1}=f_{i+1}\left(X_{i},W_{i+1}\right)+X_{i}.

(7)

Fig. 4(a) shows the architecture of ResNet (?). Based on Eq.(7), Eq.(5) can be rewritten as

\frac{\partial}{{\partial{W_{L-i}}}}C=\frac{{\partial C}}{{\partial{X_{L}}}}\left({\prod\limits_{k=1}^{i}{\frac{{\partial{f_{L-k+1}}}}{{\partial{X_{L-k}}}}}}+1\right)\frac{{\partial{f_{L-i}}}}{{\partial{W_{L-i}}}}.

(8)

Based on Eq.(8), the recursive term in Eq.(5) will not converge to zero, and thus the problem of vanishing gradient can be solved. Unlike ResNet (?), CSPNet (?) adopts a split-transform-merge strategy to separate the feature map of the base layer into two parts; one part will go through a convolution block and a transition layer; the other part is then combined with the transmitted feature map to the next stage. Then, Eq.(7) can be modified as

X_{i+1}=f_{i+1}\left({1\over 2}X_{i},W_{i+1}\right)+{1\over 2}X_{i}.

(9)

It is another way to make the recursive term in Eq.(5) not converge to zero, and thus the problem of vanishing gradient is solved. Eq.(9) inspires the design of our RSPNet. The architecture of this CSPNet is shown in Fig.4(b). Due to the split-transform-merge strategy, rich features can be extracted and thus enhance both the efficiency and accuracy of object detection. However, CSPNet separates the feature map into two parts only on the base layer. We can apply the split-transform-merge strategy recursively not only on the base layer but also to other layers. The recursively mixed two-path way can preserve feature reusing characteristics, but at the same time prevents an excessive amount of duplicate gradient information by truncating the gradient ﬂow. This is named as RSPNet (Recursive Stage Partial Network).

Methods

A new concept is proposed in this paper to use two-stage method from which the best architecture can be searched via NAS. Fig. 1 shows the architecture created in our method. At first, the cell-structure will be first searched with light-weight structure by NAS (see Bottom Right). Then, we fix the cell-structure and repeat it four times to construct our architecture. In what follows, the details of our method are described.

Cell Architecture Search

For cell structure search, only one type of cell is used: the Normal Cell (Norm-Cell) as in Fig. 2. We use two intermediate nodes to construct each cell structure. Each cell keeps the number of channels and neither reduces the spatial dimension from $W\times H$ to $\frac{W}{2}\times\frac{H}{2}$ nor enlarges the spatial dimension from $W\times H$ to $2W\times 2H$ .

Referring to Fig. 1 (bottom left) and Fig. 2, for each Norm-Cell with one input, feature maps derived by previous layer with different shapes are reshaped by convolution. The input maps are then fed into subsequent operations after adding together, which are searched by DARTS to generate a feature map with the same size $C\times{H}\times{W}$ .

The set of operation types, consists of the following 5 operators, all prevalent in modern CNNs:

•

$3\times 3$ Convolution
•

$5\times 5$ Convolution
•

Dilated Convolution
•

Depthwise Convolution
•

Identity

We list all primitive operations used for cell search that appear in Fig. 2. For all edges between two intermediate nodes, their weights are calculated by applying softmax to obtain real-valued architecture parameters.

Two Stage Search

The main challenge of NAS is on how to deal with the exponentially large search space when exploring and evaluating neural architectures. We use a two-stage method to search and train our model. In the first stage, we adopt the lightweight architecture by using fewer cells to search. In Fig. 1 (bottom left), we only use one cell in each layer. After the search was completed, we stack the cells searched using DARTS (?) four times to construct our model. Our experiment result shows that the method can not only get better efficiency during the search process, but also outperform SoTA methods with improved accuracy.

Path Attention Module

To avoid the problem of combination exploration, we propose the Path Attention Module (PAM) to reduce the path search problem into a linear weight optimization. Our PAM considers the outputs from all previous layers and the current layer. It chooses the best two of them as inputs for the next cells to generate better segmentation results.

Fig. 5 depicts the details of this PAM. Given a node, the idea of PAM aims at calculating the attention value of each path and choosing the number of outputs you want (denoted by red color in Fig. 5(a)) with higher attention to construct the final macrostructure. The PAM uses a $1\times 1$ convolution and an interpolation operation to normalize all five inputs (denoted by four blue-arrows and one green-arrow in Fig. 5(b)) to the same size $C\times H\times W$ . Then, they are sent to a Channel Attention Module (CAM) to further calculate their channel attention. Details of the CAM are depicted in Fig. 6. For any pair of $i^{th}$ and $j^{th}$ inputs $F_{i}^{c}$ and $F_{j}^{c}$ on the $c^{th}$ channel, we perform an inner product between them to obtain channel attention $S^{c}_{i,j}\in R^{C\times C}$ :

S_{i,j}^{c}=\frac{\exp(F_{i}^{c}\cdot F_{j}^{c})}{\begin{matrix}\sum_{c=1}^{C}\exp(F_{i}^{c}\cdot F_{j}^{c})\end{matrix}},\vspace{-0.15cm}

(10)

where $S_{i,j}^{c}$ is the impact of the $i^{th}$ path on the $j^{th}$ path on the $c^{th}$ channel. Then, the attention $S_{i}^{Path}$ of the $i^{th}$ path is obtained as follows:

S_{i}^{Path}=\sum_{j=1}^{N}\sum_{c=1}^{C}S_{i,j}^{c},\vspace{-0.15cm}

(11)

where $N$ is the number of paths (or inputs). Based on $S_{i}^{Path}$ , the best two are chosen from the $N$ paths. As described above, our path selection approach is very different from AutoDeeplab (?) and HiNAS (?) which take only three previous layers’ outputs as inputs. Then, the Viterbi decoding algorithm is adopted to select only one input with the highest probability to generate the final result. However, our method takes all previous layers’ and current layer’s output as inputs and choose the best one layer’s feature maps for the next cell (we searched) for further processing. Since more layers and inputs are considered, our method fuses much more information to generate better segmentation results.

Recursive Stage Partial (RSP) Architecture for Light-Weight Semantic Segmentation

Neural Architecture Search(NAS) can successfully identify neural network architectures that outperform the hand-designed ones. However, such success greatly relies on costly computation resources. To reduce computations and maintain our accuracy, we modify the concept of Cross Partial Stage(CSP) (?) to recursively use only half of the channels to pass through the cell we searched for. Fig. 2 shows that we construct our cell architecture that recursively uses only half of the channels to pass through the operation. Before entering the next cell, we concatenate the output and the unused part of the input. This RSP architecture can make the search process much more efficient and results in a light-weight design for semantic segmentation.

Table 1: Performance evaluations of the model on Cityscapes validation set.

Methods Backbone Coarse ImageNet mIOU(%) Params(M) PSPNet (?) Dilated-ResNet-101 $\times$ $\checkmark$ 76.2 $\times$ PSANet (?) Dilated-ResNet-101 $\times$ $\checkmark$ 77.3 $\times$ PADNet (?) Dilated-ResNet-101 $\times$ $\checkmark$ 78.1 $\times$ DenseASPP (?) WDenseNet-161 $\times$ $\checkmark$ 77.8 $\times$ DeepLabv3 (?) ResNet-101 $\checkmark$ $\checkmark$ 79.5 $\times$ Auto-DeepLab (?) $\times$ $\checkmark$ $\times$ 80.3 44.42 HRNetV2 (?) $\times$ $\times$ $\checkmark$ 80.9 69.06 Ours $\times$ $\times$ $\checkmark$ 81.4 68.67 Ours + RSP $\times$ $\times$ $\checkmark$ 81.0 17.20

Table 2: Performance evaluations of the model on Cityscapes validation set. Training with the Mapillary Vistas dataset.

Methods Backbone Mapillary mIOU(%) Params(M) Mapillary (?) ResNeXt-101 $\checkmark$ 80.6 $\times$ HANet (?) ResNet-101 $\checkmark$ 81.7 $\times$ HRNetV2+OCR (?) HRNetV2 $\checkmark$ 81.8 70.37 DecoupleSegNets Wide-ResNet $\checkmark$ 81.6 $\times$ DCNAS (?) $\times$ $\checkmark$ 81.3 $\times$ Ours $\times$ $\times$ 81.4 68.67 Ours $\times$ $\checkmark$ 82.1 68.67 Ours + RSP $\times$ $\checkmark$ 81.7 17.20

Experimental Results

Experiments are conducted in the Cityscapes (?) urban scene understanding dataset for evaluation. Auto-DeepLab (?), U-Net (?) and NasUnet (?) are compared in Cityscapes as baseline. We use the standard mean intersection-over-union (mIOU) as a performance evaluation metric for semantic segmentation.

Results on Cityscapes scene segmentation

The Cityscapes dataset (?) is a recent large-scale urban scene dataset containing a diverse set of stereo video sequences from 50 cities. Cityscapes dataset contains high quality pixel-level annotations of $5,000$ images with size $1,024\times 2,048$ . There are $2,975$ , $500$ , and $1,525$ for training, validation, and test images, respectively, and additional $20,000$ weakly annotated frames. It is an order of magnitude larger than similar previous datasets.

We consider 2 intermediate nodes in all cells with one input. For each cell, we keep the channel numbers and the height and width of the feature tensor. Fig. 7 shows the searched cell architectures by NAS on the Cityscapes dataset. Fig. 9 shows that validation accuracy got more than 40% during macro search. Compared to the accuracy in AutoDeepLab (?), we got higher accuracy during search. $512\times 102$ random image crops are used. In DARTS search, batch size is 6 due to GPU memory limitation, architecture search optimization is conducted for $300$ epochs. In the learning of network weight $w$ , SGD optimizer with momentum $0.95$ , and weight decay $0.0005$ are used. For learning the architecture, SGD optimizer with learning rate $0.005$ and weight decay $0.0001$ are used. The entire architecture search optimization takes about five days on two V100 GPUs.

Table 1 shows that our NAS model outperforms the SoTA on the Cityscapes. Without any pretraining, our best model significantly outperforms all the SoTA method. Last row of Table 1 shows that the light-weigh RSP design uses only 1/4 parameter size of SoTA methods but still outperforms them. Fig. 8 shows that the visualization of our model on Cityscapes (?) validation and test set.

The Mapillary Vistas Dataset is a large-scale street-level image dataset containing 25,000 high-resolution images annotated into 66/124 object categories of which 37/70 classes are instance-specific labels (v.1.2 and v2.0, respectively). Annotation is performed in a dense and fine-grained style by using polygons for delineating individual objects. Dataset contains images from all around the world, captured at various conditions regarding weather, season and daytime. Images come from different imaging devices (mobile phones, tablets, action cameras, professional capturing rigs) and differently experienced photographers. We also adopted the Mapillary Vistas Dataset (?) during our training procedure. Because of the class number of cityscapes is less than Mapillary Vistas Dataset, we have to map the category into the corresponding ones in Cityscapes. Table 2 shows that our NAS model outperforms the SoTA on the Cityscapes with adopting Mapillary Vistas Dataset (?). Clearly, our light-weight RSP design still outperforms other SoTA methods even with only 1/4 parameter size.

To better aggregate the context, we also adopted the Object-Contextual Representations(OCR) (?). Table 3 shows that our NAS model outperforms the other model with pretraining and Object-Contextual Representations(OCR) (?) on the Cityscapes dataset. Without using any backbone, our light-weight architecture still outperforms HRNET and DCNAS.

Table 3: Performance evaluation on the Cityscapes validation set.

Methods OCR ImageNet mIOU(%) HRNetV2 (?) $\times$ $\times$ 76.16 HRNetV2 $\checkmark$ $\times$ 78.2 HRNetV2 $\times$ $\checkmark$ 80.9 HRNetV2 $\checkmark$ $\checkmark$ 81.6 DCNAS (?) $\times$ $\times$ 81.9 Ours $\times$ $\checkmark$ 81.4 Ours $\checkmark$ $\checkmark$ 83.2 Ours + RSP $\checkmark$ $\checkmark$ 82.5

Discussions

We discuss practical considerations in deploying our method. First, is the sensitivity of the number $N$ of cells that is stacked during training. We use a fixed $N=4$ in this paper. More search parameters are involved and more time will be consumed for larger $N$ , however better image segmentation performance might be obtained. Secondly, the potential negative impact of our work is on the environmental aspect, that automatic model search requires significant GPU computation. Again, the proposed micro search can be completed in two day of computation on modern GPUs.

Conclusion

This paper designs a two-stage architecture that allows us to search micro structure with less memory resource for image segmentation. The complexity reduction in search space and path selection makes the entire architecture search optimization very efficient. It takes about 2 days on two V100 GPUs to find the desired micro architecture. RSPNet makes our search procedure much more efficient and leads to our light-weight design for outstanding semantic segmentation.

Future work includes the generalization and consideration of additional computation structures and cells as a basis that can be incorporated into the proposed framework. The model can be replaced with other simple structures for memory- or power-aware applications. Another issue is the selection of the number $N$ . Current version sets the same $N$ for each module. If the number of each module is different, the system performance will be further improved. In addition, we envision works to be carried out straightforwardly on the NAS for other computer vision and NLP tasks.

References

[Ashok et al. 2018] Ashok, A.; Rhinehart, N.; Beainy, F.; and Kitani, K. M. 2018. N2N learning: Network to network compression via policy gradient reinforcement learning. In ICLR.
[Baker et al. 2017] Baker, B.; Gupta, O.; Naik, N.; and Raskar, R. 2017. Designing neural network architectures using reinforcement learning. In ICLR.
[Brock et al. 2018] Brock, A.; Lim, T.; Ritchie, J. M.; and Weston, N. 2018. SMASH: One-shot model architecture search through hyperNetworks. In ICLR.
[Cai et al. 2018] Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; and Wang, J. 2018. Efficient architecture search by network transformation. In AAAI.
[Chen et al. 2017] Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H. 2017. Rethinking atrous convolution for semantic image segmentation.
[Cho et al. 2020] Cho, K.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2020. Memory-efficient hierarchical neural architecture search for image denoising. In CVPR.
[Cordts et al. 2016] Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The Cityscapes dataset for semantic urban scene understanding. In CVPR.
[Elsken, Metzen, and Hutter 2019a] Elsken, T.; Metzen, J. H.; and Hutter, F. 2019a. Efficient multi-objective neural architecture search via lamarckian evolution. In ICLR.
[Elsken, Metzen, and Hutter 2019b] Elsken, T.; Metzen, J. H.; and Hutter, F. 2019b. Neural architecture search: A survey. JMLR 1–21.
[Fourure et al. 2017] Fourure, D.; Emonet, R.; Fromont, E.; Muselet, D.; Tremeau, A.; and Wolf, C. 2017. Residual conv-deconv grid network for semantic segmentation. In BMVC.
[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
[Hengshuang et al. 2018] Hengshuang, Z.; Yi, Z.; Shu, L.; Jianping, S.; Chen, Change, L.; Dahua, L.; and Jiaya, J. 2018. Psanet: Point-wise spatial attention network for scene parsing. In ECCV.
[Liu et al. 2018] Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; and Kavukcuoglu, K. 2018. Hierarchical representations for efficient architecture search. In ICLR.
[Liu et al. 2019] Liu, C.; Chen, L.-C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A. L.; and Fei-Fei, L. 2019. Auto-DeepLab: Hierarchical neural architecture search for semantic image segmentation. In CVPR.
[Liu, Simonyan, and Yang 2019] Liu, H.; Simonyan, K.; and Yang, Y. 2019. DARTS: Differentiable architecture search. In ICLR.
[Maoke et al. 2018] Maoke, Y.; Kun, Y.; Chi, Z.; Zhiwei, L.; and Kuiyuan, Y. 2018. Denseaspp for semantic segmentation in street scenes. In CVPR.
[Neuhold et al. 2017] Neuhold, G.; Ollmann, T.; S. Rota, B.; and Kontschieder, P. 2017. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV.
[Pham et al. 2018] Pham, H.; Guan, M. Y.; Zoph, B.; Le, Q. V.; and Dean, J. 2018. Efficient neural architecture search via parameter sharing. In ICML.
[Real et al. 2017] Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y. L.; Tan, J.; Le, Q.; and Kurakin, A. 2017. Large-scale evolution of image classifiers. In ICML, volume 70, 2902–2911.
[Real et al. 2019] Real, E.; Aggarwal, A.; Huang, Y.; and Le, Q. V. 2019. Regularized evolution for image classifier architecture search. In AAAI, 4780–4789.
[Ronneberger, Fischer, and Brox 2015] Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 234–241.
[Sun et al. 2019] Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; and Wang, J. 2019. High-resolution representations for labeling pixels and regions.
[Sungha, Joanne, and Jaegul 2020] Sungha, C.; Joanne, T, K.; and Jaegul, C. 2020. Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In CVPR.
[Wang et al. 2020] Wang, C.-Y.; Liao, H.-Y. M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; and Yeh, I.-H. 2020. CSPNet: A new backbone that can enhance learning capability of cnn. In CVPR Workshop.
[Weng et al. 2019] Weng, Y.; Zhou, T.; Li, Y.; and Qiu, X. 2019. NAS-Unet: Neural architecture search for medical image segmentation. IEEE Access 7:44247–44257.
[Xu et al. 2018] Xu, D.; Ouyang, W.; Wang, X.; and Sebe, N. 2018. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing.
[Yang et al. 2020] Yang, Z.; Wang, Y.; Chen, X.; Shi, B.; Xu, C.; Xu, C.; Tian, Q.; and Xu, C. 2020. CARS: Continuous evolution for efficient neural architecture search. In CVPR.
[Yuan et al. 2019] Yuan, Y.; Chen, X.; Chen, X.; and Wang, J. 2019. Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv:1909.11065.
[Zhang et al. 2020] Zhang, X.; Xu, H.; Mo, H.; Tan, J.; Yang, C.; Wang, L.; and Ren, W. 2020. Dcnas: Densely connected neural architecture search for semantic image segmentation.
[Zhong et al. 2018] Zhong, Z.; Yang, Z.; Deng, B.; Yan, J.; Wu, W.; Shao, J.; and Liu, C.-L. 2018. BlockQNN: Efficient block-wise neural network architecture generation. In CVPR.
[Zoph and Le 2017] Zoph, B., and Le, Q. V. 2017. Neural architecture search with reinforcement learning. In ICLR.
[Zoph et al. 2018] Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In CVPR.