PredNAS: A Universal and Sample Efficient Neural Architecture Search Framework

Liuchun Yuan, Zehao Huang, Naiyan Wang L. Yuan, Z. Huang, N. Wang are work for Tusimple Inc. Email: [email protected], [email protected], [email protected]

Abstract

In this paper, we present a general and effective framework for Neural Architecture Search (NAS), named PredNAS. The motivation is that given a differentiable performance estimation function, we can directly optimize the architecture towards higher performance by simple gradient ascent. Specifically, we adopt a neural predictor as the performance predictor. Surprisingly, PredNAS can achieve state-of-the-art performances on NAS benchmarks with only a few training samples (less than 100). To validate the universality of our method, we also apply our method on large-scale tasks and compare our method with RegNet on ImageNet and YOLOX on MSCOCO. The results demonstrate that our PredNAS can explore novel architectures with competitive performances under specific computational complexity constraints.

Index Terms:

Neural Architecture Search, Neural Predictor, Object Detection

I Introduction

Recent years have witnessed the extraordinary success of deep neural networks (DNNs) in various fields such as computer vision and neural language processing. From AlexNet [31] to ResNet[23] and MobileNet[26], DNNs become more and more compact and efficient, which shows the importance of “network engineering”. To reduce the enormous effort involved in hand-crafted network design, recent work [51, 36, 22, 2] focus on Neural Architecture Search (NAS), which is a technique for discovering better network architecture under constrained resources automatically.

The pioneering work [51] showed NAS could design better network architecture than hand-crafted deep models. However, the search cost is immense and unaffordable. Therefore, recent researches on NAS paid much attention to improve the search efficiency. Several work [52, 6, 27, 40, 35] focused on efficient sampling and progressive search. Nevertheless, the number of architectures needed to train and evaluate is still large. Another stream of work such as one-shot methods [1, 22] and gradient-based algorithms [36] adopted the strategy of SuperNet and weight-sharing to save the cost of model training. Among them, SPOS [22] and DARTS [36] are two representatives. They both train a supernet which contains all models in the search space as subnets. Then the performance of each subnet is used as the reference of corresponding model. Differently, SPOS only uses the SuperNet for fast evaluation, while DARTS utilizes gradients to update architecture parameters along with the SuperNet training. However, these methods usually suffer from unfair sampling [9] and unstable training [4, 10, 5] issues due to the weight-sharing strategy. Moreover, the search space may be limited by GPU memory since it needs to store the whole computational graph of SuperNet in GPU. The last category of work relates to our method most, which adopts a neural network, generally named predictor, for fast performance evaluation [45, 11, 50]. The main cost of predictor based methods is the collection of training samples for the predictor, namely the architecture-accuracy pairs. To handle this problem, recent researchers proposed several techniques such as ranking-based loss functions [50], data augmentation [37] and predictor pre-training [11, 50] to improve sample efficiency. Even though, hundreds of samples are still needed, which hinders the use of predictor based methods in large scale network search.

In this paper, we focus on neural predictor based methods and propose a general and efficient framework, named PredNAS. Our method enjoys the both advantages of predictor and gradient based NAS methods, and significantly reduce the number of training samples (a.k.a the times of training networks). Moreover, our method is general enough to accommodate different types of architecture and connection search. This property makes our method a flexible and universal tool for AutoML. Specifically, we randomly sample a small number of models and train a predictor for performance estimation. Then the predictor can be seen as a differentiable function which models the relationship between network architecture encoding and its corresponding performance on a specific task. Then given an initial architecture encoding ${\bm{a}}$ , we can optimize a new architecture ${\bm{a}}^{\prime}$ starting from ${\bm{a}}$ by the predictor guided gradient ascent. With this simple search strategy, our method can find new architectures with comparable performance with the state-of-the-art methods on several NAS benchmarks [16] using less than 100 training samples. To validate the universality of our method, we also conduct experiments on large scale tasks, such as image classification on ImageNet dataset [14] and object detection on MSCOCO [33]. On ImageNet, we adopt our method on the largest unconstrained AnyNet search space proposed by [40], and show that our PredNAS can find a series of models with comparable performances to RegNet without shrinking the search space by human heuristics. In object detection task, we validate our method on a recent work YOLOX [20]. The results demonstrate that our method can explore better models under different FLOPs constraints.

To summarize, the contributions of our work are in the following two folds:

•

We propose a general and efficient framework for predictor based neural architecture search. With a simple gradient based search strategy, our method can adapt to various NAS problems with less than 100 training samples.
•

With the same framework, we conduct comprehensive experiments in various applications and search spaces to show the universality of our method. The results show that our PredNAS could consistently achieve comparable or even better performance than existing methods.

II Related Works

II-A Sampling based Neural Architecture Search

To the best of our knowledge, [51] is the pioneering work of NAS. In [51], a recurrent network, also named controller, is used to propose new network architectures for evaluation. The controller is trained by reinforcement learning to maximize the performance of sampled networks on specific task. Though [51] showed the proposed method could design state-of-the-art models, the search cost was prohibitive. Several following works [52, 35, 27, 40] tried to mitigate this problem by reducing the search space. Different from searching the whole network on target dataset, [52] proposed to search stackable cells on proxy tasks. PNAS [35] adopted a progressive search approach to reduce the search cost. [40] focused on the design of search space and they discovered several principles to progressively simply the search space. Similarly, ABS [27] presented an angle-based metric to drop candidates during search.

II-B Weight-sharing based Neural Architecture Search

To alleviate the high cost of evaluation of each sampled architecture in NAS, some other works [1, 39, 36, 22, 3, 2, 47] utilized the weight-sharing mechanism. The key idea is to allow the samples to share weights to reduce the training cost of each architecture. The first work in this spirit is ENAS [39]. Then one-shot methods [1, 22, 36] proposed to train a SuperNet capable of enumerating the child models in the search space for fast model evaluation. SPOS [22] adopted an uniform sampling strategy for the training of SuperNet and used evolutionary algorithm for architecture search. DARTS [36] and DSO-NAS [49] formulated NAS as an optimization problem and optimized the architecture parameters by gradient descent. However, SuperNet based methods may suffer from the problem of unreliable ranking [9] and performance collapse [48, 4, 5]. FairNAS [9] proposed a strict fairness sampling and training strategy to improve the ranking correlation of SPOS. [48] proposed to early stop the search process to handle the instability problem of DARTS. SmoothDARTS [4] found the rounding step of deriving the discrete architecture from continuous optimization in DARTS could introduce large performance drop. They proposed a perturbation-based regularization to smooth the loss landscape of DARTS.

II-C Predictor based Neural Architecture Search

In the context of hyperparameter optimization, early work adopted probabilistic models [15] or Bayesian neural networks [29] to estimate the performance of neural network. The subsequent work tried to predict the accuracy of neural networks by neural networks. Peephole [13] used an LSTM [25] network to integrate the information of different layers following the network topology and adopted a Multiple Layer Perceptron (MLP) to predict the accuracy of the input network at a specific training epoch. [45] proposed to use Graph Convolutional Networks (GCNs) [28] to represent the connections of network architecture. The network operations are represented as one-hot codes and the topology of the neural network is formulated by a adjacency matrix. FBNetV3 [11] adopted a simple MLP to search both architectures and training recipes jointly. Additionally, they proposed a pre-training approach to improve the sample efficiency of predictor. Recently, instead of improving the regression accuracy of predictor, ReNAS [46] and AceNAS [50] adopted rank based loss for reliable predictions. The closest work to our PredNAS is NAO[38], which adopted gradient to do architecture optimization. However, NAO utilized encoder-decoder framework to do transformation between architecture and network embedding. Hundreds of training samples and an extra structure reconstruction loss are needed for the training. In our framework, the design of network encoding and projection function allows us to do architecture embedding transformation without encoder and decoder. The training samples used in our work (30) in much less than NAO (600), and we show our PredNAS can directly search on large-scale tasks such as ImageNet without transferring the searched architecture from CIFAR to ImageNet.

II-D NAS Applications on Different Tasks

Designing the structure of backbone network on image classification task and then transfer the ImageNet pretrained model to downstream tasks is a de facto paradigm in computation vision community. However, recent works show that this paradigm may be sub-optimal due to the gap between classification task and downstream applications. DetNAS [7] adopted the technique of one-shot SuperNet for object detection backbone search. Auto-DeepLab [34] constructed a hierarchical architecture search space for semantic segmentation task and adopted gradient-based method for search. SpineNet [18] proposed a scale-permuted model and showed the learned backbone could achieve better performance than regular scale-decreased models on both object detection and image classification tasks. Other works adopted NAS for component search, such as the architecture of feature pyramid [21] and prediction head [44] in object detection. In this paper, we adopted our PredNAS on a very efficient object detection framework, YOLOX [20], and show we can explore better architectures based on a large search space proposed by us.

III Method

In this section, we will first introduce the motivation of our method, and then describe the framework of PredNAS step by step, including the formulation of search space and the details of predictor training and architecture search.

III-A Motivation

Given the architecture search space $\mathbf{\Omega}$ and target resource budget $c$ , NAS can be formulated as the following optimization problem:

{\bm{a}}^{*}=\operatorname*{argmax}_{{\bm{a}}\in\mathbf{\Omega}}\ {\mathcal{F}}({\bm{a}}),\operatorname{s.t.}\ {\mathcal{C}}({\bm{a}})<=c,

(1)

where ${\mathcal{F}}({\bm{a}})$ is the performance indicator of architecture ${\bm{a}}$ on a specific dataset and ${\mathcal{C}}({\bm{a}})$ is the corresponding resource cost such as FLOPs, latency or parameters. For simplicity, we denote ${\bm{a}}$ as the encoding of its corresponding architecture. Following [43], we are more interested in finding multiple Pareto-optimal [12] solutions instead of a single architecture with highest accuracy. We construct a formulation of weighted sum of objectives to approximate Pareto optimal solutions:

{\bm{a}}^{*}=\operatorname*{argmax}_{{\bm{a}}\in\mathbf{\Omega}}\ {\mathcal{F}}({\bm{a}})-\alpha({\mathcal{C}}({\bm{a}})-c),

(2)

where $\alpha$ is a tunable weight. Then the key problem becomes: 1) How could we approximate ${\mathcal{F}}(\cdot)$ and ${\mathcal{C}}(\cdot)$ with limited training data? 2) How could we optimize the above problem efficiently? Our answer is surprisingly simple: neural network with gradient descent¹¹1Though the formulation of ${\mathcal{C}}(\cdot)$ with FLOPs (or parameters) guided constraints can be derived directly, it is difficult to formulate latency or power guided computational complexity in a parametric form. For a general formulation, we adopt a predictor to approximate ${\mathcal{C}}(\cdot)$ .. In this paper, we adopt two neural networks, the main predictor and the auxiliary predictor to approximate ${\mathcal{F}}(\cdot)$ and ${\mathcal{C}}(\cdot)$ respectively. Then the gradient of ${\bm{a}}$ respect to ${\mathcal{F}}({\bm{a}})$ and ${\mathcal{C}}({\bm{a}})$ can be computed by back-propagation easily for the optimization of ${\bm{a}}$ .

III-B Search Space and Network Encoding

Refer to caption — Figure 1: Top: The Process of Different search space. First row, the macro skeleton of each architecture. Second row, the architecture encoding methods. Third row, the network structures of predictors. Bottom: The search space and predictors used in different tasks. The top is methods compared in PredNAS. The bottom is other methods.

Method	Type	Pred.	Num.	Task
AnyNet[40]	size	MLP	$\approx 10^{18}$	cls.
YOLOX[20]	size	MLP	$\approx 10^{100}$	det.
Bench-201[17]	topology	GCN	$\approx 10^{5}$	cls.
SPOS[22]	size	MLP	$\approx 10^{12}$	cls.
DARTs[36]	topology	GCN	$\approx 10^{9}$	cls.
NAS-FPN[21]	topology	GCN	$\approx 10^{16}$	det.

Defining search space is the first step in NAS. Generally, the search space in NAS can be divided into two categories[16] as shown in Fig. 1(left): (1) Topology Search Space (TSS) adopted in DARTS[36] and NAS-Bench-201[17], concerning the connection topology and the associated operations on the connections; (2) Size Search Space (SSS) focuses on the width, depth or other parameters with the same topology [22, 40, 2]. Following [11] and [45], we adopt different network encoding methods to represent the architectures sampled from these two search space. For TSS, an adjacency matrix is used to represent the topology of architecture, and the operations on each connections are encoded as an operation matrix which consists of one-hot vectors. As for SSS, we simply concatenate the values of network depths, channel widths and other architecture parameters into a vector.

III-C Training Predictors

We train two predictors to approximate ${\mathcal{F}}(\cdot)$ and ${\mathcal{C}}(\cdot)$ : (1) main predictor ${\mathcal{P}}_{m}$ , estimating the performance of given architectures, e.g. accuracy or mean average precision (mAP); (2) auxiliary predictor ${\mathcal{P}}_{aux}$ , predicting the resource cost of given architectures. These two predictors share the same network structure. For encoding of network structure, it highly depends on the search space of each dataset, so we leave it in the experiment section. For TSS, we adopt Graph Neural Network (GCN) [28] to handle graph-structured training samples following [45]. As for SSS, we use a simple MLP.

The top of Fig. 2 shows the training process. For simplicity, we consider FLOPs constraints and size search space for illustration in the following paper. We first randomly sample $N$ ( $N<=50$ ) architectures from search space, and then collect the training samples as a dataset of $({\bm{a}},acc.,\text{FLOPs})$ triplets. The performance of architecture can be obtained via training the architecture on the target task or a proxy task to reduce the overhead. Following [11], we adopt the Huber loss to train both the main and auxiliary predictors.

III-D Gradient-based Search with Predictors

In the search stage, most of the predictor based methods [45, 11, 50] adopt predictors only for architecture evaluation. They usually randomly sample a large number (about $10^{6}$ ) of architectures, and then take the networks with top-K performances to retrain. In this paper, we argue that these methods essentially utilize naive random shooting method for optimization, and ignore the valuable differentiable property of the predictor. Actually, we can update the sampled architecture according to the gradient of predictors. The bottom of Fig. 2 shows the search procedure of our PredNAS. Given an initial architecture ${\bm{a}}$ , the main predictor ${\mathcal{P}}_{m}$ and the auxiliary predictor ${\mathcal{P}}_{aux}$ , we update ${\bm{a}}$ iteratively by projected gradient ascent:

{\bm{a}}^{(t+1)}=\mathbf{P}_{\mathbf{\Omega}}({\bm{a}}^{(t)}+\eta*(\frac{\partial{{\mathcal{P}}_{m}({\bm{a}}^{(t)})}}{\partial{{\bm{a}}^{(t)}}}-\alpha\frac{\partial{{\mathcal{P}}_{aux}({\bm{a}}^{(t)})}}{\partial{{\bm{a}}^{(t)}}})),

(3)

where $\eta$ is the learning rate, $t$ is the number of iterations, $\alpha$ is a trade-off between performance and resource constraints and $\mathbf{P}_{\mathbf{\Omega}}$ is a projection function which projects the updated architecture embedding back onto search space $\mathbf{\Omega}$ . We formulate $\mathbf{P}_{\mathbf{\Omega}}$ as the following optimization problem:

\mathbf{P}_{\mathbf{\Omega}}(\bm{a}^{(t)})=\operatorname*{argmin}_{\bm{a}\in\mathbf{\Omega}}{\frac{1}{2}}||\bm{a}-\bm{a}^{(t)}||.

(4)

In size search space, the output of this projection function can be obtained by rounding and clipping. As for topology search space, we select the operation with maximum probability as the chosen operation. Updating ${\bm{a}}^{(t)}$ with gradient $\frac{\partial{{\mathcal{P}}_{m}({\bm{a}}^{(t)})}}{\partial{{\bm{a}}^{(t)}}}$ increases the performance of ${\bm{a}}^{(t)}$ while $\frac{\partial{{\mathcal{P}}_{aux}({\bm{a}}^{(t)})}}{\partial{{\bm{a}}^{(t)}}}$ reduces the computational complexity. With a suitable $\alpha$ , the gradient-based search strategy will optimize the architecture ${\bm{a}}$ towards to higher performance while reducing the resource cost. Given a number of initial architectures, we can explore new networks with higher performances based on the proposed gradient based search method efficiently. To obtain architectures under a hard resource constraints, we adopt a grid search strategy to tune the value of $\alpha$ and all the models under the target constraints during update will be added into the model pool.

Alg. 1 describes the whole process of search. After training predictors, we randomly sample $T_{max}$ architectures from search space. Then we update each architecture following Eqn.(3) until reaching max iteration $t_{max}$ . We collect all models within constraints to a model pool and sort them by the predictor scores. Finally, we will retrain top-K of them on a specific task and select the best model as the final result.

Input: Size search space

\mathbf{\Omega}\in\mathbb{R}^{n}

, target FLOPs interval

{[f-\delta:f+\delta]}

the loss weight

\alpha

, the main and auxiliary predictors

{\mathcal{P}}_{m}

{\mathcal{P}}_{aux}

and learning rate

\eta

ModelPool=[];

for $T=1,2,...,T_{max}$ do

Random sample initial architecture

{\bm{a}}^{(1)}\in\Omega

;

{\bm{a}}^{(1)}

.requires_grad = True;

for $t=1,2,...,t_{max}$ do

\text{loss}=-{\mathcal{P}}_{m}({\bm{a}}^{(t)})+\alpha*{\mathcal{P}}_{aux}({\bm{a}}^{(t)})

;

loss.backward();

{\bm{a}}^{(t+1)}={\bm{a}}^{(t)}-\eta*{\bm{a}}^{(t)}.\text{grad}

;

{\bm{a}}^{(t+1)}\leftarrow\mathbf{P}_{\mathbf{\Omega}}({\bm{a}}^{(t+1)})

;

calculate FLOPs of

{\bm{a}}^{(t+1)}

and get

f_{{\bm{a}}^{(t+1)}}

;

if $f_{{\bm{a}}^{(t+1)}}\in[f-\delta:f+\delta]$ then

score =

{\mathcal{P}}_{m}({\bm{a}}^{(t+1)})

;

ModelPool.append

\{({\bm{a}}^{(t+1)},\text{score})\}

;

end if

end for

Sort ModelPool according to score in descending order;

return ModelPool[:K]

Algorithm 1 FLOPs guided gradient search algorithm in PredNAS

IV Experiments

We first conduct our experiments on NAS-Bench-201 [16] which is a standard benchmark for network topology search. To validate the generalization of our PredNAS on large search space, we consider two other size search spaces: AnyNet search space (Sec. IV-B) proposed in [40] and a new object detection search space (Sec. IV-C) proposed by us based on YOLOX [20]. Fig. 1(right) summarizes these three search space. The detailed search space definition can be found in the following subsections.

We adopt SGD with momentum [42] both for the predictor training and search. During training, we set initial learning rate as 0.01 and decay it with cosine learning rate schedule. We use a weight decay of 0.0001 and a momentum of 0.9. As for search, the learning rate starts from 0.02 and is divided by 10 at 1/3 and 2/3 of the total number of iterations without specific statement.

The Structure of Predictors

In size search space, we adopt a 3-layer MLP followed by one fully-connected layer as our predictor. As for topology search space, we use a 3-layer GCNs followed by two fully-connected layers. We utilize GeLU [24] as activation function and set the dropout as 0.05 in the last fully-connected layers and GCNs layers. Table I summarizes the structures of predictors. Following [45], each layer in GCNs can be formulated as:

\small{\mathbf{V}}_{l+1}=\frac{1}{2}\text{ReLU}({\mathbf{J}}{\mathbf{V}}_{l}{\mathbf{W}}_{l}^{1})+\frac{1}{2}\text{ReLU}({\mathbf{J}}^{T}{\mathbf{V}}_{l}{\mathbf{W}}_{l}^{2}),

(5)

where $l$ indicates the index of GCNs layers. ${\mathbf{J}}\in\mathbb{R}^{I\times I}$ is the adjacency matrix, $I$ means the number of nodes. The operation matrix is ${\mathbf{V}}_{0}\in\mathbb{R}^{I\times D}$ , $D$ is the number of operation candidates. ${\mathbf{W}}_{l}^{1}$ and ${\mathbf{W}}_{l}^{2}$ are trainable weights in GCNs.

Predictors Layers Embeddings Dimension Linear Dimension FC Layers Activations MLP 3 1000 1000 1 GeLU GCNs 3 144 128 2 GeLU

TABLE I: The structure of predictors in our experiments.

IV-A NAS-Bench-201

NAS-Bench-201 [17] consists of 15625 neural cell candidates with different topologies on CIFAR-10, CIFAR-100 [30] and ImageNet-16-120 [8]. We introduce how we encode the architectures firstly. Following [45], we formulate the structure of cell as a densely connected Directed Acyclic Graph (DAG) and adopt an adjacency matrix and operation matrix to represent the structure. The DAG contains an input node, an output node and up to 6 interior nodes. Each interior node can be one of the following operations: (1) zeroize, (2) skip connection, (3) 1-by-1 convolution, (4) 3-by-3 convolution and (5) 3-by-3 average pooling layer. So the number of nodes in the DAG is $6+2=8$ and the number of candidate options per node is $5+2=7$ ²²2Following [45], the input and output are also considered as candidate options.. Then the topology of cell can be represented by a $8\times 8$ adjacency matrix. The operation matrix is a $8\times 7$ matrix which consists of 8 one-hot vectors. Since there have zeroize and skip connection operations in the candidate operation options, the adjacency matrix can be fixed to a graph with maximum connections in the search space, and only the operation matrix is updated.

We adopt the same structure of GCN proposed in [45] as our predictors. Since this benchmark usually take budget (the number of evaluated models) into consideration, we do not use the resource predictor. As for search, we set the number of iterations to 200 and decay the learning rate by a factor of 2. A softmax operation is applied on each row of the operation matrix to ensure that the sum of each row is 1. After gradient ascent, we select the operator with maximum probability in each row as the chosen operator.

We train main predictors with 30 samples and query performances of the top-40 networks in the model pool, which indicates that PredNAS trains $30+40=70$ models to get the final performance. Tabel II shows the results of our method. With the same number of query samples, we achieve better performance than BOHB[19], REA[41], RL[51] and a very strong baseline AceNAS[50] on all the three datasets. On ImageNet-16-120 dataset, our PredNAS achieve about $0.6\%-1.5\%$ higher accuracy than other methods. For fair comparison, we also report the results with 110 query samples following AceNAS. The results are comparable. These results validate the sampling efficiency of our method under low query budget. As shown in the following experiments, this advantage becomes significant when enlarging the search space.

Method Querys CIFAR-10 CIFAR-100 IN-16-120 Optimal^‡ 15625 94.37 73.51 47.31 Random 70 93.80 $\pm$ 0.28 71.18 $\pm$ 1.0 45.38 $\pm$ 0.91 BOHB[19] 70 93.57 $\pm$ 0.39 71.36 $\pm$ 0.71 45.04 $\pm$ 0.94 RL[51] 70 93.88 $\pm$ 0.26 71.63 $\pm$ 0.93 45.03 $\pm$ 0.78 REA[41] 70 94.04 $\pm$ 0.25 71.78 $\pm$ 0.90 45.52 $\pm$ 0.65 AceNAS[50] 70 94.15 $\pm$ 0.28 72.64 $\pm$ 1.0 45.99 $\pm$ 0.76 PredNAS 70 94.20 $\pm$ 0.14 73.09 $\pm$ 0.48 46.72 $\pm$ 0.35 AceNAS[50] 110 94.30 $\pm$ 0.19 73.23 $\pm$ 0.54 46.47 $\pm$ 0.38 PredNAS 110 94.30 $\pm$ 0.08 73.17 $\pm$ 0.38 46.67 $\pm$ 0.27

TABLE II: Compared with other NAS methods on NAS-Bench 201. We repeat 15 times to obtain mean and variance.

\ddagger

: average results.

IV-B AnyNet Search Space

Search Modules	Number ( $D_{i}$ )	Widths ( $W_{i}$ )	Bottleneck Ratio ( $R_{i}$ )	Group ( $G_{i}$ )
Range	$0<D_{i}\leq 16$	$24^{\dagger}\leq W_{i}\leq 1024$	$\{0.25,0.5,1\}$	$0<G_{i}\leq 32$
RegNetX-600MF	[1,3,5,7]	[48,96,240,528]	[1,1,1,1]	[2,4,10,22]
PredNAS-600MF	[3,3,6,12]	[64,152,256,1024]	[0.5,0.5,0.5,0.25]	[2,19,32,22]

TABLE III: AnyNet search space proposed in RegNet[40]. The below shows the structures of RegNetX-600MF and PredNAS-600MF. ^†: We increase the lower bound of widths from 0 to 24, and width should be divisible by 8.

Training Method Sampled 200MF 400MF 600MF 800MF 1.6GF 3.2GF 4GF 6.4GF 8.0GF 12GF Proxy RegNetX[40] $2750$ 59.3 63.2 65.1 66.1 68.9 70.0 71.4 72.6 73.4 73.7 PredNAS $330$ 59.2 63.4 65.4 66.5 69.0 70.8 71.3 72.5 73.2 73.5 PredNAS^† $330$ 58.5 63.0 65.0 66.2 69.0 70.7 71.2 72.5 73.2 73.5 Full RegNetX[40] $-$ 68.9 72.7 74.1 75.2 77.0 78.3 78.6 79.2 79.3 79.7 PredNAS $-$ 67.9 72.3 73.6 74.8 76.8 77.1 78.0 79.2 79.1 79.5 PredNAS^† $-$ 69.0 72.5 74.2 75.0 76.8 78.1 78.4 79.2 79.1 79.5

TABLE IV: Comparison between RegNetX models and our PredNAS on ImageNet validation set. We report the top-1 accuracy of models with proxy and full training schedule respectively.

We validate our method on ImageNet dataset and compare the searched networks with RegNetX models. RegNetX models are a series networks with different FLOPs obtained from the RegNet search space [40]. RegNet is obtained by progressively shrinking an unconstrained search space AnyNet by human heuristics. The search space of AnyNet contains 16 variables including the number of blocks, block width, bottleneck ratio and the number of group in 4 stages. Table III summarizes the search space of AnyNet, which includes about $(16\cdot 128\cdot 3\cdot 6)^{4}\approx 10^{18}$ architectures. The authors of [40] proposes several prior knowledge to progressively refine the AnyNet search space and sample 500 models every step to validate whether the refinement is worthy. They finally explore a RegNet search space which contains about $10^{8}$ models after 5 times refinement. Then each of the RegNetX models is obtained by picking the best model from 25 random architectures sampled from the RegNet search space with a specific FLOPs constraints. In this paper, we argue that this strategy still needs intensive human knowledge and trial and error, which is contradict to the spirit of AutoML in fact. Consequently, we adopt our PredNAS directly on the unconstrained AnyNet search space, and compare the searched architectures to RegNetX models.

We random sample 30 models from the AnyNet search space and train each model for 10 epochs on the ImageNet dataset [14] following the proxy task setting in [40]. As [40] provides RegNetX models with a variety of FLOPs, we adjust the weight of auxiliary predictor $\alpha$ to get the models with different target FLOPs during search. For example, the $\alpha$ is set to 1 for 200MF FLOPs regime while 0.2 for 3.2GF FLOPs.

For each FLOPs regime, we choose top-30 models from the model pool and pick the best one according to its performance on the proxy task. Then we train the best models with 100 epochs following [40]. Finally, we sampled $30+30\times 10=330$ models to get 10 architectures with different FLOPs, which is much less than 2750, the number of models sampled in RegNet [40].

As shown in Table IV, we achieve comparable performances to RegNetX from 200MF to 12GF by directly adopting our method on the AnyNet search space. We found some structures of models we explore are not consistent with the design principles introduced in RegNet. For example, PredNAS-600MF is deeper than RegNetX-600MF (see Table III), and the bottleneck ratio is not strict to 1. This reveal the fact that human hand-crafted design may still introduce inductive biases which should be avoided. Other architectures of searched model can be found in Appendix.

Another interesting finding is that we find there still exists a non-negligible gap between proxy and full train setting. PredNAS usually finds significantly better model than RegNetX in the proxy setting, however in full training, the conclusion contradicts. To further validate the gap between proxy and full train setting, we show another series of models PredNAS^† in Table IV. For each FLOPs regime, We select top-3 models according to their performances on the proxy task, and retrain all of them with 100 epochs. Then the models with best performances are selected as PredNAS^†. The architectures with best performances on full train setting may yield worse results on the proxy task. Consequently, we believe designing a better proxy setting that is more consistent with full training results is a crucial task for efficient NAS, however it is beyond the scope of this work and we leave it for future work.

IV-C YOLOX Search Space

Search Module Search Space Number Stem Width ( $S$ ) $0<S\leq 80$ 10 Backbone Depth ( $b$ ) $0<b\leq 12$ $(12\cdot 10\cdot 3)^{4}$ Width ( $B^{k}$ ) $0<B^{k}\leq B^{k-1}*2,$ $B^{0}=80$ Expand ratio $\{0.25,0.5,1\}$ Neck Depth ( $n$ ) $0<n\leq 3$ Width ( $N^{k}$ ) $0<N^{k}\leq 1280$ $(3\cdot 10\cdot 3)^{4}$ Expand ratio $\{0.25,0.5,1\}$ Shared convs Depth ( $s$ ) $0<s\leq 4$ $8^{(4+3+2+1)*3}$ Width ( $W_{s_{i}}$ ) $0<W_{s_{i}}\leq 512,0<s_{i}\leq s$ Cls head Depth ( $l$ ) $0<l\leq 4$ $8^{(4+3+2+1)*3}$ Width ( $W_{l_{i}}$ ) $0<W_{l_{i}}\leq 512,0<l_{i}\leq l$ Reg head Depth ( $r$ ) $0<r\leq 4$ $8^{(4+3+2+1)*3}$ Width ( $W_{r_{i}}$ ) $0<W_{r_{i}}\leq 512,0<r_{i}\leq r$

TABLE V: The proposed search space of YOLOX. There are four stages in the backbone and neck. We search the depth of each stage. The width and expand ratio of different layers are shared in the

k

-th stage. The width in stem (

S

), backbone (

B^{k}

) and neck (

N^{k}

) of each stage will be divided into 10 values. The width in heads will be divided into 8 values. Three decoupled detection heads are responsible for three level of FPN features respectively. In total, the number of architectures in such search space is

10^{100}

We also adopt our method on the MSCOCO [33] dataset. The baseline is YOLOX [20], a highly efficient method proposed very recently. We propose a new search space based on YOLOX. The network architecture in YOLOX consists of three modules: backbone, neck and decoupled detection heads. The backbone and neck are composed of the same type of basic blocks. The searchable dimensions are depth (i.e. the number of blocks), widths and expand ratio of blocks. There are three decoupled detection heads in YOLOX, and each of them is responsible for a level of FPN [32] feature. The decoupled detection head contains convolution layers (shared convs) followed by two parallel branches for classification (cls head) and regression (reg head), respectively. We search for the depths and widths of convolution layers in each head. The summary of our proposed YOLOX search space is shown in Table V. This search space includes about $10^{100}$ possible architectures.

Model	FLOPs (G)	Parameters (M)	AP (%)	$\text{AP}_{50}$	$\text{AP}_{75}$	$\text{AP}_{S}$	$\text{AP}_{M}$	$\text{AP}_{L}$
YOLOX-S	26.8	9.0	40.5	59.3	43.7	23.2	44.8	54.1
PredNAS–S	28.4	10.6	42.5	61.2	46.0	23.9	47.2	55.7
YOLOX-M	73.8	25.3	46.9	65.6	51.1	29.0	52.1	62.3
PredNAS–M	76.0	24.1	47.1	66.0	51.0	28.6	52.0	62.2
YOLOX-L	155.6	54.2	49.7	68.0	53.9	32.2	55.0	65.1
PredNAS–L	158.2	55.8	50.5	69.5	54.8	32.1	55.6	65.0

TABLE VI: Comparison between YOLOX and the models searched by our PredNAS in terms of AP (

\%

) on COCO val2017.

Search Modules YOLOX-S PredNAS-S Stem W 32 16 Backbone D [1,3,3,1] [4,3,3,1] W [64,128,256,512] [72,80,320,488] R [0.5,0.5,0.5,0.5] [0.25,1,0.5,0.25] Neck D [1,1,1,1] [1,3,1,3] W [256,128,256,512] [512,256,576,304] R [0.5,0.5,0.5,0.5] [0.5,0.25,0.5,0.25] Shared convs D [1,1,1] [2,2,3] W [[128],[128],[128]]] [[224, 40],[224, 176],[256, 136, 176]] Cls head D [2,2,2] [2,1,3] W [[128, 128],[128, 128],[128, 128]] [[56, 96],[32], [176, 40, 224]] Reg head D [2,2,2] [1,2,1] W [[128, 128],[128, 128],[128, 128]] [[32],[104, 224], [192]]

TABLE VII: The structure of YOLOX-S[20] and PredNAS-S. W, D and R denote width, depth and ratio, respectively.

For such huge search space, we merely randomly sample 40 models to build training data. We train each sampled model following [20], but using 6% train data to alleviate the training cost.

For each target FLOPs, we select the best architecture from top-30 models in the model pool.

Table VI shows the results of our searched models. With FLOPs constraints, our models (PredNAS-S/M/L) consistently achieve better performances than YOLOX-S/M/L on the COCO val2017 set. Especially, comparing with YOLOX-S, our PredNAS-S achieves 42.5% AP, improving YOLOX-S by 2.0%. This result shows the effectiveness of our method on small scale models in object detection task. We further analyze the discrepancy of architectures between YOLOX-S and PredNAS-S (see Table V). PredNAS-S has a deeper backbone and computation resources are allocated more on the shared convolutions in head while less on the classification and regression heads.

V Discussion

Step	Architecture	Valid-Acc	Pred-Acc
0	[[4],[2,2],[3,3,3]]	55.71	57.88
60	[[4],[2,2],[3,3,3]]	55.71	57.88
120	[[4],[2,2],[3,3,1]]	63.02	64.57
160	[[2],[2,2],[2,3,1]]	70.34	76.88
200	[[2],[2,2],[2,3,1]]	70.34	76.88
0	[[5],[1,3],[4,1,2]]	68.98	64.00
60	[[5],[1,3],[4,1,2]]	68.98	64.00
120	[[5],[1,3],[4,1,2]]	68.98	64.00
160	[[2],[1,2],[4,1,2]]	73.14	76.01
200	[[2],[1,2],[4,1,2]]	73.14	76.01

V-A Effectiveness of Gradient Search

Gradient search can explore better architectures. In Fig.3 (left), we randomly pick some intermediate architectures during the procedure of gradient search on NAS-Bench-201. Given an initial architecture, gradient update can explore novel architecture with better performance. With the increase of prediction score, the accuracy of architecture on the validation set is also improving. Fig. 3 (right) also shows the accuracy curve of models with different initialization. After 200 iterations, most of the architectures achieve 68%-72% accuracy at the last step, even with poor initialization.

Comparison with random search. We compare gradient search with random search on the AnyNet search space. If we can traverse all the search space, the result obtained by random search is the same as gradient search due to the identical predictor for ranking models. However, it is unrealistic to traverse all the architectures in a large search space. For fair comparison, we randomly sample 100000 models following the number of models explored in gradient search from the AnyNet search space. Then we select top-15 models according to the scores of the main predictor for 5 different FLOPs regimes respectively. For each FLOPs regime, we train the top-15 models on the proxy task and obtain the performance of each model. Fig. 4 (left) shows the best models of different target FLOPs obtained by random search and gradient search. Obviously, gradient search can explore better architecture than random search. Moreover, we show the performances of top-15 models found by random search and gradient search in Fig. 4 (right). Most of the models obtained by gradient search achieve higher accuracy than random search.

Model	Random	Gradient
200MF	56.9%	59.2%
800MF	65.5%	66.5%
1.6GF	67.9%	69.0%
4GF	71.1%	71.3%
12GF	72.3%	73.5%

V-B N + K Ablation Study

We conduct experiments on NAS-Bench-201 with different N and K. N is the number of training samples for predictor and K is the number of top models selected for evaluation. The top of Table VIII shows the results with different number of training samples. Surprisingly, we found a small number of N is enough for training a predictor which can distinguish the performance of architectures coarsely. With K = 40, training predictor with only 10 samples can achieve better results than several baseline methods shown in Table II. The bottom of Table VIII shows the results with different K. We found K is more important in our experiments. With the increase of K, the performances are improving stably. This ablation shows that we can train a predictor with few samples and adopt it to do coarse model selection. Then the evaluation of top-K models can be seen as a refinement procedure.

#Samples N (K = 40)	10	20	30	40	50
CIFAR-100	72.85 $\pm$ 0.70	72.85 $\pm$ 0.37	73.09 $\pm$ 0.48	72.66 $\pm$ 0.54	72.77 $\pm$ 0.71
#Top K (N = 30)	10	20	30	40	50
CIFAR-100	71.92 $\pm$ 0.71	72.56 $\pm$ 0.62	72.74 $\pm$ 0.48	73.09 $\pm$ 0.48	73.17 $\pm$ 0.27

TABLE VIII: The results of PredNAS with different N and K on NAS-Bench-201. We repeat 15 times to obtain mean and variance.

VI Conclusion

In this paper, we have proposed a universal framework named PredNAS for neural architecture search. In our framework, we adopted neural predictor to estimate the performance of architectures and used a simple gradient search strategy to do architecture search. We validate our method on NAS-Bench-201, ImageNet and MSCOCO. With less than 100 training samples, our PredNAS could achieve comparable or even better performances than existing state-of-the-art methods. To reduce the training cost, we adopted proxy task to get the performance of architectures. However, we found the consistency of model performances between proxy task and the final task would limit the effectiveness of our method. It is interesting to investigate how to design the proxy task by the technique of NAS, which we leave it as our future work.

References

[1] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018.
[2] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-All: Train one network and specialize it for efficient deployment. In ICLR, 2020.
[3] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
[4] Xiangning Chen and Cho-Jui Hsieh. Stabilizing differentiable architecture search via perturbation-based regularization. In ICML, 2020.
[5] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In ICCV, 2019.
[6] Yukang Chen, Gaofeng Meng, Qian Zhang, Shiming Xiang, Chang Huang, Lisen Mu, and Xinggang Wang. ReNAS: Reinforced evolutionary neural architecture search. In CVPR, 2019.
[7] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Xinyu Xiao, and Jian Sun. DetNAS: Backbone search for object detection. In NeurIPS, 2019.
[8] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
[9] Xiangxiang Chu, Bo Zhang, and Ruijun Xu. FairNAS: Rethinking evaluation fairness of weight sharing neural architecture search. In ICCV, 2021.
[10] Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. Fair DATRS: Eliminating unfair advantages in differentiable architecture search. In ECCV, 2020.
[11] Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew Yu, Peter Vajda, et al. FBNetV3: Joint architecture-recipe search using neural acquisition function. In CVPR, 2021.
[12] Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pages 403–449. Springer, 2014.
[13] Boyang Deng, Junjie Yan, and Dahua Lin. Peephole: Predicting network performance before training. arXiv preprint arXiv:1712.03351, 2017.
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
[15] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, 2015.
[16] Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. NATS-Bench: Benchmarking nas algorithms for architecture topology and size. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[17] Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the scope of reproducible neural architecture search. In ICLR, 2019.
[18] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In CVPR, 2020.
[19] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. In ICML, 2018.
[20] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. YOLOX: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
[21] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.
[22] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. In ECCV, 2020.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[24] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[25] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[26] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[27] Yiming Hu, Yuding Liang, Zichao Guo, Ruosi Wan, Xiangyu Zhang, Yichen Wei, Qingyi Gu, and Jian Sun. Angle-based search space shrinking for neural architecture search. In ECCV, 2020.
[28] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
[29] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016.
[30] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In NeurIPS, 2012.
[32] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
[34] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, 2019.
[35] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.
[36] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In ICLR, 2018.
[37] Yuqiao Liu, Yehui Tang, and Yanan Sun. Homogeneous architecture augmentation for neural predictor. In ICCV, 2021.
[38] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. NeurIPS, 2018.
[39] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In ICML, 2018.
[40] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollar. Designing network design spaces. In CVPR, 2020.
[41] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In AAAI, 2019.
[42] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
[43] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MNASNet: Platform-aware neural architecture search for mobile. In CVPR, 2019.
[44] Ning Wang, Yang Gao, Hao Chen, Peng Wang, Zhi Tian, Chunhua Shen, and Yanning Zhang. Nas-fcos: Fast neural architecture search for object detection. In CVPR, 2020.
[45] Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural architecture search. In ECCV, 2020.
[46] Yixing Xu, Yunhe Wang, Kai Han, Yehui Tang, Shangling Jui, Chunjing Xu, and Chang Xu. ReNAS: Relativistic evaluation of neural architecture search. In CVPR, 2021.
[47] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, and Quoc Le. BigNAS: Scaling up neural architecture search with big single-stage models. In ECCV, 2020.
[48] Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and robustifying differentiable architecture search. In ICLR, 2020.
[49] Xinbang Zhang, Zehao Huang, Naiyan Wang, Shiming Xiang, and Chunhong Pan. You only search once: Single shot neural architecture search via direct sparse optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9):2891–2904, 2020.
[50] Yuge Zhang, Chenqian Yan, Quanlu Zhang, Li Lyna Zhang, Yaming Yang, Xiaotian Gao, and Yuqing Yang. AceNAS: Learning to rank ace neural architectures with weak supervision of weight sharing. In ICCV, 2021.
[51] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2016.
[52] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.

Supplementary

Searched Models

RegNet. Table IX shows the searched models of PredNAS and PredNAS^† in RegNet search space. Following to RegNet, we increase the upper bound of search space to get large scale models during search. Therefore, some dimensions in the searched models may outrange the AnyNet search space described in the main paper. For example, the width of PredNAS-12GF in the last stage is 1368, larger than 1024. Interestingly, most of the searched models have increasing widths as the RegNet suggested.

YOLOX. We adjust the lower bound to obtain models with small FLOPs during search. Table X describes the models we found in the YOLOX search space. PredNAS-S/M/L are deeper than YOLOX-S/M/L and the proportion of share convs is increasing.

Search Module Depth Width Ratio Groups^∗ Group Width Flops(B) Acc(%) RegNetX-200MF [1,1,4,7] [24,56,152,368] [1,1,1,1] [3,7,9,46] 8 0.2 68.9 PredNAS-200MF [5,2,16,1] [24,32,96,1600] [1,1,1,0.25] [3,1,12,16] [8,32,8,25] 0.22 67.9 PredNAS^†-200MF [2,11,13,14] [24, 88, 56, 616] [0.5,0.25,1,0.25] [6,11,8,22] [2,2,7,7] 0.22 69.0 RegNetX-400MF [1,2,7,12] [32,64,160,384] [1,1,1,1] [2,4,10,24] 16 0.4 72.7 PredNAS-400MF [4,2,10,8] [64,232,64,760] [0.5,0.25,1,0.5] [8,29,32,38] [4,2,2,10] 0.41 72.3 PredNAS^†-400MF [3,3,15,8] [64,64,64,792] [0.5,0.5,1,0.5] [4,8,8,33] [8,4,8,12] 0.4 72.5 RegNetX-600MF [1,3,5,7] [48,96,240,528] [1,1,1,1] [2,4,10,22] 24 0.6 74.1 PredNAS-600MF [3,3,6,12] [64,152,256,1024] [0.5,0.5,0.5,0.25] [2,19,32,32] [16,4,4,8] 0.61 73.6 PredNAS^†-600MF [13,2,11,8] [32,128,208,880] [0.5,1,0.5,0.5] [16,8,26,22] [1,16,4,20] 0.61 74.2 RegNetX-800MF [1,3,7,5] [64,128,288,672] [1,1,1,1] [4,8,18,42] 16 0.8 75.2 PredNAS-800MF [6,3,13,2] [96,168,168,1048] [0.25,0.5,1,1] [8,4,28,8] [3,21,6,131] 0.77 74.8 PredNAS^†-800MF [4,9,13,11] [64,64,168,888] [0.5,1,0.5,0.5] [16,4,12,37] [2,16,7,12] 0.76 75.0 RegNetX-1.6GF [2,4,10,2] [72,168,408,912] [1,1,1,1] [3,7,17,38] 24 1.6 77.0 PredNAS-1.6GF [4,1,8,11] [128,240,472,968] [0.5,0.5,0.5,0.5] [8,10,4,11] [8,12,59,44] 1.64 76.8 PredNAS^†-1.6GF [4,1,8,11] [128,240,472,968] [0.5,0.5,0.5,0.5] [8,10,4,11] [8,12,59,44] 1.64 76.8 RegNetX-3.2GF [2,6,15,2] [96,192,432,1008] [1,1,1,1] [2,4,9,21] 32 3.2 78.3 PredNAS-3.2GF [5,1,14,13] [128,128,496,1144] [0.5,0.5,0.5,1] [32,16,31,26] [2,4,8,44] 3.1 78.3 PredNAS^†-3.2GF [12,1,16,6] [128,552,528,984] [0.5,0.5,0.5,1] [8,46,33,12] [8,6,8,82] 3.0 78.1 RegNetX-4.0GF [2,5,14,2] [80,240,560,1360] [1,1,1,1] [2,6,14,34] 40 4.0 78.6 PredNAS-4.0GF [13,5,14,1] [128,128,992,1024] [0.5,0.5,0.5,1] [32,16,31,4] [2,4,16,256] 4.1 78.0 PredNAS^†-4.0GF [10,2,19,10] [88,152,704,1368] [1,0.5,0.5,0.5] [8,19,32,12] [11,4,11,57] 4.0 78.4 RegNetX-6.4GF [2,4,10,1] [168,392,784,1624] [1,1,1,1] [3,7,14,29] 56 6.5 79.2 PredNAS-6.4GF [12,2,17,1] [184,296,1240,1316] [0.5,0.5,0.25,1] [4,37,10,1] [23,4,31,1316] 6.4 79.2 PredNAS^†-6.4GF [12,2,17,1] [184,296,1240,1316] [0.5,0.5,0.25,1] [4,37,10,1] [23,4,31,1316] 6.4 79.2 RegNetX-8.0GF [2,5,15,1] [80,240,720,1920] [1,1,1,1] [1,2,6,16] 80 8.0 79.3 PredNAS-8.0GF [8,2,15,1] [304,512,1184,1408] [0.25,0.5,0.5,1] [38,16,37,1] [2,16,16,1408] 7.9 79.1 PredNAS^†-8.0GF [8,2,15,1] [304,512,1184,1408] [0.25,0.5,0.5,1] [38,16,37,1] [2,16,16,1408] 7.9 79.1 RegNetX-12GF [2,5,11,1] [224,448,896,2240] [1,1,1,1] [2,4,8,20] 112 12.1 79.7 PredNAS-12GF [13,2,16,2] [136,720,1184,1368] [1,0.5,0.5,1] [8,2,8,19] [17,160,74,72] 11.8 79.5 PredNAS^†-12GF [13,2,16,2] [136,720,1184,1368] [1,0.5,0.5,1] [8,2,8,19] [17,160,74,72] 11.8 79.5

TABLE IX: Searched Models in RegNet Search Space. ^∗: In gradient search, we will convert the group to the nearst integer of itself making it can divided by the input channel. Group Width is equal to the input channel divided by group.

Search Modules Backbone Neck Share Conv Cls Head Reg Head depth width ratio depth width ratio depth width depth width depth width YOLOX-S $[1,3,3,1]$ $[32,64,128,256,512]$ $[0.5,0.5,0.5,0.5]$ $[1,1,1,1]$ $[256,128,256,512]$ $[0.5,0.5,0.5,0.5]$ $[1,1,1]$ $[[128],[128],[128]]$ $[2,2,2]$ $[[128,128],[128,128],[128,128]]$ $[2,2,2]$ $[[128,128],[128,128],[128,128]]$ PredNAS-S $[4,3,3,1]$ $[16,72,80,320,488]$ $[0.25,1,0.5,0.25]$ $[1,3,1,3]$ $[512,256,576,304]$ $[0.5,0.25,0.5,0.25]$ $[2,2,3]$ $[[224,40],[224,176],[256,136,176]]$ $[2,1,3]$ $[[56,96],[32],[176,40,224]]$ $[1,2,1]$ $[[32],[104,224],[192]]$ PredNAS-S^T $[1,1,1,6]$ $[8,48,96,256,336]$ $[0.25,0.5,0.25,0.25]$ $[1,2,3,3]$ $[688,288,788,472]$ $[0.25,0.25,0.25,1]$ $[1,1,1]$ $[[32],[128],[96]]$ $[1,1,3]$ $[[32],[120],[32,248,224]]$ $[2,2,4]$ $[[64,32],[232,80],[32,112,144,144]]$ YOLOX-M $[2,6,6,2]$ $[48,96,192,384,768]$ $[0.5,0.5,0.5,0.5]$ $[2,2,2,2]$ $[384,192,384,768]$ $[0.5,0.5,0.5,0.5]$ $[1,1,1]$ $[[192],[192],[192]]$ $[2,2,2]$ $[[192,192],[192,192],[192,192]]$ $[2,2,2]$ $[[192,192],[192,192],[192,192]]$ PredNAS-M $[3,8,6,2]$ $[48,96,192,384,768]$ $[0.25,0.25,0.5,0.25]$ $[2,2,3,2]$ $[384,192,640,768]$ $[0.5,0.25,0.25,0.25]$ $[3,1,3]$ $[[192,248,192],[192],[192,192,192]]$ $[2,2,2]$ $[[192,192],[192,240],[192,192]]$ $[2,2,3]$ $[[192,192],[192,192],[192,232,208]]$ PredNAS-M^T $[1,1,10,2]$ $[64,96,216,512,664]$ $[0.5,0.5,0.25,0.5]$ $[1,3,3,1]$ $[752,128,1024,976]$ $[0.25,0.5,0.5,1]$ $[1,1,3]$ $[[32],[32],[488,312,160]]$ $[1,1,4]$ $[[32,],[48],[104,408,456,480]]$ $[1,4,4]$ $[[168],[32,320,208,344],[328,232,160,376]]$ YOLOX-L $[3,9,9,3]$ $[64,128,256,512,1024]$ $[0.5,0.5,0.5,0.5]$ $[3,3,3,3]$ $[512,256,512,1024]$ $[0.5,0.5,0.5,0.5]$ $[1,1,1]$ $[[256],[256],[256]]$ $[2,2,2]$ $[[256,256],[256,256],[256,256]]$ $[2,2,2]$ $[[256,256],[256,256],[256,256]]$ PredNAS-L $[12,12,11,10]$ $[48,120,240,384,960]$ $[0.5,0.5,0.5,0.5]$ $[3,2,2,3]$ $[384,568,640,768]$ $[0.25,0.25,0.5,0.25]$ $[3,1,3]$ $[[192,208,192],[256],[192,256,192]]$ $[3,2,3]$ $[[192,192,192],[192,256],[256,256,256]]$ $[2,3,2]$ $[[192,192],[256,256,256],[256,192]]$

TABLE X: Searched models in YOLOX search space.