RATs-NAS: Redirection of Adjacent Trails on GCN for Neural Architecture Search

Abstract

Various hand-designed CNN architectures have been developed, such as VGG, ResNet, DenseNet, etc., and achieve State-of-the-Art (SoTA) levels on different tasks. Neural Architecture Search (NAS) now focuses on automatically finding the best CNN architecture to handle the above tasks. However, the verification of a searched architecture is very time-consuming and makes predictor-based methods become an essential and important branch of NAS. Two commonly used techniques to build predictors are graph-convolution networks (GCN) and multilayer perceptron (MLP). In this paper, we consider the difference between GCN and MLP on adjacent operation trails and then propose the Redirected Adjacent Trails NAS (RATs-NAS) to quickly search for the desired neural network architecture. The RATs-NAS consists of two components: the Redirected Adjacent Trails GCN (RATs-GCN) and the Predictor-based Search Space Sampling (P3S) module. RATs-GCN can change trails and their strengths to search for a better neural network architecture. P3S can rapidly focus on tighter intervals of FLOPs in the search space. Based on our observations on cell-based NAS, we believe that architectures with similar FLOPs will perform similarly. Finally, the RATs-NAS consisting of RATs-GCN and P3S beats WeakNAS, Arch-Graph, and others by a significant margin on three sub-datasets of NASBench-201.

Index Terms— Neural Architecture Search, predictor-based NAS, cell-based NAS.

1 Introduction

Many Convolution Neural Networks (CNN) have been proposed and achieved great success in the past decade [1, 2, 3]. However, designing a handcrafted CNN architecture requires human intuition and experience, which takes work and time to build an optimal CNN. NAS [4] focuses on this problem and automatically searches for the best neural network architecture based on a specific strategy in the specific search space [5, 6]. Many methods have been proposed recently. The cell-based NAS relies on a meta-architecture to reduce the complexity of the search scale. The meta-architecture is a CNN model with pre-defined hyperparameters, such as the number of channels and stacked cells. Those stacked cells are composed of operations such as convolution, pooling, etc. Therefore, searching for a CNN architecture is equivalent to searching for a cell. However, it is time-consuming to verify a searched architecutre candidate. The predictor-based NAS method encodes an architecture with an adjacency matrix and an operation matrix to quickly predict an architecture’s performance. The adjacency matrix indicates the adjacent trails of operations in a cell, and the operation matrix indicates which operations are used in a cell. In general, the GCN-based predictor uses both matrices as input to predict the performance of an architecture, and the MLP-based predictor only uses the operations matrix. For example, Neural Predictor [7] and BRP-NAS [8] built their predictor with GCN. However, WeakNAS [9] just applied a fancy sampling method with a weaker predictor MLP to obtain a significant improvement over BRP-NAS. It is noticed that MLP does not combine prior adjacent trails of operations (adjacency matrix) as GCN does. The fact may indicate that this prior knowledge may not be necessary. It inspires us to explore the gap between GCN and MLP. In our experiments, we found that GCN is only sometimes better than MLP. It is even worse than MLP in many experiment settings.

Refer to caption — Fig. 1: The illustration of how different predictors transfer features. The four circles in each column represent the four operations in the cell, and the colors in the circles represent the current features of the operation. The trails’ thickness and the color’s depth represent the weight of 0 $\sim$ 1, respectively.

This phenomenon may be due to the information propagation barrier caused by the inherent adjacent trails and matrix multiplication in GCN. Therefore, the proposed Redirected Adjacent Trails GCN (RATs-GCN) is an adaptive version between GCN and MLP. It has prior knowledge of adjacent trails and avoids the information transmission obstacles that GCN may cause. It can change trails by itself through learning and replace the binary state of trails with weight [0,1]. In addition, based on our observations on cell-based NAS methods, we think architectures with similar FLOPs will perform similarly. Then, we propose a Predictor-based Search Space Sampling (P3S) module to rapidly focus on the tighter FLOPs intervals of the search space to efficiently search for the desired architecture. Finally, the proposed RATs-NAS method surpasses WeakNAS and Arch-Graph [10] and achieves SoTA performance on NASBench-201.

2 RELATED WORK

2.1 Various Types of NAS

There have been many studies on NAS in the past. Some methods are based on reinforcement learning [4, 11, 12], and others are developed from the evolutionary algorithm [13, 14, 15, 16]. The predictor-based NAS methods focuse on training a predictor to predict the performance of a CNN architecture and quickly filter out impossible candidates [17, 18, 7, 8, 19, 9, 10]. It can reduce the verification time required to evaluate the performance of an architecture candiate. The cell-based search space shares a fixed meta-architecture and the same hyperparameters. Therefore, the search space is reduced to a small cell which contains several operations such as convolution, pooling, etc. The predictors can be GCN-based or MLP-based. The GCN-based predictor performs more accurately, but needs more training time than the MLP-based one. This paper proposes a predictor called RATs-GCN which combines both advantages of GCN and MLP to better find the desired architecture.

2.2 Predictor-based NAS

There are many types of these predictors [17, 18, 7, 8, 19] for NAS. In recent work, the Neural Predictor [7] is the most common method and encodes a cell as an adjacency matrix and an operations matrix. The adjacency matrix indicates the adjacent trails of operations and the operations matrix indicates the features of operations. The GCN-based predictor shows significant performance in NASBench-101 [5]. Since the number of cells in its meta-architecture is equal, the hyperparameters of GCN are fixed. It uses multiple graph convolution [20] to extract high-level features and directionality from the above two matrices and then uses a fully connected layer to get the prediction. After that, the promising architecture is found with the prediction from the search space. Moreover, BRP-NAS [8] proposes a binary predictor that simultaneously takes two different architectures as input and predicts which is better rather than directly predicting their accuracies. This method dramatically improves the prediction performance compared to Neural Predictor [7]. WeakNAS [9] proposes a more robust sampling method and then adopts MLP to form the predictor. Surprisingly, even though a weak MLP-based preditor is used, it still surpasses Neural Predictor and BRP-NAS. The Arch-Graph [10] proposes a transferable predictor and can find promising architectures on NASBench-101 and NASBench-201 [6] on a limited budget.

3 APPROACH

3.1 Redirected Adjacent Trails GCN (RATs-GCN)

As mentioned above, the predictors can be GCN-based or MLP-based. As shown in Fig. 1, GCN uses adjacent trails of operations (adjacency matrix) and operation features (operation matrix) in a cell as input. MLP uses only operations features as input, which means that GCN has more prior knowledge than MLP, and MLP can be regarded as a full network connection architecture of adjacent trails. However, as shown in Tab. 1, we found that GCN is only sometimes better than MLP in all experimental settings since its inherent adjacent trails hinder the information flow caused by matrix multiplication. The adjacency matrix used in GCN may receieve the negative effects caused by the directions stored in the inherent adjacent trails. To address this problem, the trails stored in the adjacency matrix should be adaptively changed. Thus, this paper proposes a RATs (Redirected Adjacent Trails) module and attachs it to the backbone of GCN to adaptively tune the trail directions and their weights stored in the adjacency matrix. This module allows GCN to change each trail with a new learning weight. In extreme cases, RATs-GCN can be GCN or MLP.

3.2 Redirected Adjacent Trails Module (RATs)

As described above, the trails stored in the adjacency matrix of GCN are fixed. If the trails are wrongly set, negative effects will be sent to the predictor and result in accuracy deficiency. Fig. 1 shows the concept of our RATs predictor. The trails in the original GCN-based predcitor are fixed (the left part of Fig. 1) during training and inference. However, in the right part of Fig. 1, the trails in our RATs-GCN will be adaptively adjusted according to the embedded code generated by the operation matrix. Fig. 2 shows the detailed designs of the RATs module and RATs-GCN. Unlike the original GCN-based predictor, a new RATs module is proposed and attached to our RATs-GCN predictor. This module first converts the operation matrix to three new feature vectors: query $Q$ , key $K$ , and vaule $V$ by self-attention [21]. Then, an embedded code can be obtained by concatenating $Q$ , $K$ , $V$ with the original adjacency matrix. With this embedded code as input, the trail offsets and operation strengths are generated by a linera projection and sigmoid function. Finally, a new adjacency matrix is generated by adding the offset to the adjacency matrix and then doing a Hadamard product with the strength. The RATs can redetermine the trails and their strengths.

Table 1: The comparison of RATs-GCN and the other predictors on NASBench-101 and NASBench-201. The mACC is the mean accuracy of the top 100 architectures ranked from the predictor. The Psp is Spearman Correlation, and the calculation range is the entire search space.

NASBench-101 (423,624 unique cells.)
	Budgets	mAcc (%)			Psp (%)
MLP	300	90.78			30.38
GCN	300	89.54			1.93
BI-GCN	300	91.48			43.82
RATs-GCN	300	92.80			60.80
MLP	600	91.72			42.87
GCN	600	91.04			18.52
BI-GCN	600	91.56			38.56
RATs-GCN	600	92.94			70.24
MLP	900	92.03			48.45
GCN	900	90.94			27.16
BI-GCN	900	92.15			53.71
RATs-GCN	900	93.01			70.58
NASBench-201 (15,625 unique cells.)
	Budgets	CIFAR-10		CIFAR-100		ImgNet-16
	Budgets	mAcc	Psp	mAcc	Psp	mAcc	Psp
MLP	30	88.54	10.39	64.68	19.39	37.98	22.97
GCN	30	84.86	-0.04	65.12	31.89	37.69	33.06
BI-GCN	30	86.26	21.02	61.96	34.06	38.28	40.61
RATs-GCN	30	89.68	47.61	69.81	65.72	43.11	67.18
MLP	60	90.93	27.36	68.31	42.68	41.49	47.32
GCN	60	87.87	29.93	67.42	42.77	38.89	44.05
BI-GCN	60	87.82	18.24	64.28	47.39	39.32	54.22
RATs-GCN	60	92.72	61.67	70.05	73.60	43.79	74.64
MLP	90	91.69	42.86	65.64	51.72	41.98	56.22
GCN	90	90.83	40.30	67.64	44.99	39.15	45.51
BI-GCN	90	89.71	36.14	68.11	62.22	42.17	65.51
RATs-GCN	90	93.17	70.50	69.66	74.98	44.16	77.39

3.3 Predictor-based Search Space Sampling (P3S)

Although the proposed RATs-GCN has already provided more flexible plasticity than GCN and MLP, we all know that a predictor-based NAS’s performance depends not only on predictor design but also on the sampling method. WeakNAS gets SOTA performance using a weaker predictor with a firmer sampling method. So, a promising strategy is bound to bring about considerable improvement for NAS. The proposed P3S is based on our observations on cell-based NAS: (1) The architectures constructed in a cell-based approach share the same meta-architecture and candidate operations for cell search; (2) Each layer in the meta-architecture has the same hyperparameters, such as filters, strides, etc. This means that those candidate operations for cells have the same input and output shapes. In short, there are many architectures that are very similar because they all share the same meta-architecture and limited candidate operations, and the hyperparameters are the same. All of them result in our P3S method. The P3S method rapidly divides search space into tighter FLOPs intervals by following rough steps: (1) Sort the search space $S$ by FLOPs and initialize $i_{1}=0$ and $i_{2}=len(S)-1$ as the focus interval; (2) Select the top 1% architectures of the sub search space $[S_{i},S_{j}]$ sorted by predictor at $t$ time $P_{t}$ ; (3) If the indexes of selected architectures are at least 75% in the first or second half of the search space sorted by FLOPs, move $i,j$ to that half interval; if not, move $i,j$ to the last interval; (4) Sorting and get the top $k$ of $[S_{i},S_{j}]$ by $P_{t}$ and add these $S_{topk}$ to the sample pool $B$ ; (5) Training $P_{t+1}$ based on $B$ , then back to (2). At the beginning of $t=0$ , we randomly select $k$ samples from the search space as initial training samples to train $P_{0}$ . After that, the above steps will continue until we find the optimal global cell or exceed the budget. This process aims to rapidly divide and focus on the tighter FLOPs range because we believe the architectures with similar FLOPs will perform similarly. P3S has corrective measures such as Step (2) to avoid falling into the wrong range.

Table 2: The comparison on the number of samples required to find the global optimal on NASBench-201.

	NASBench-201
	CIFAR-10	CIFAR-100	ImgNet-16
Random Search	7782.1	7621.2	7726.1
Reg Evolution	563.2	438.2	715.1
MCTS	528.3	405.4	578.2
LaNAS	247.1	187.5	292.4
WeakNAS	182.1	78.4	268.4
RATs-NAS	114.6	74.3	146.7

4 EXPERIMENTS

4.1 Comparison of RATs-GCN and GCNs and MLP

We extensively test MLP, GCN, BI-GCN, and RATs-GCN under similar model settings with 30 runs for a fair comparison. GCN, BI-GCN, and RATs-GCN have three GCN layers with 32 filters and one FC layer with one filter to obtain output prediction accuracy, and MLP has three FC layers with 32 filters and one FC layer with one filter. They all applied the random sampling method to get training architectures from the search space. We evaluated them on NASBench-101 with training budgets of 300, 600, and 900. We also evaluate them on the three sub-datasets (CIFAR-10, CIFAR-100, ImageNet-16) of NASBench-201 with training budgets 30, 60, and 90. As shown in Tab. 1, we found that RATs-GCN surpasses others in mAcc for about 1% $\sim$ 5% and in Psp for about 10% $\sim$ 50% under different budgets. The mAcc denotes the average accuracy of the top 100 architectures predicted by the predictor. The Psp denotes the Spearman Correlation of predicted ranking and ground truth ranking.

4.2 Comparison of RATs-NAS and SOTAs

We design two experiments with 30 runs to verify the performance of RATs-NAS and compare it with other SOTA methods. The first experiment aims to evaluate how fast an NAS method finds the optimal one in the search space. As shown in Tab. 2, we can see that the RATs-NAS use fewer architectures than other methods. It even finds the global optimal cell using an average of 146.7 architecture costs, nearly twice as fast as WeakNAS. Another experiment examines how good architecture can be found at the cost of 150 architectures. As shown in Tab. 3, the RATs-NAS find the architecture with an accuracy of 73.50% on NASBench-201 (CIFAR-10), it better than other SOTA methods. Considering the optimal accuracy are 94.37%, 73.51%, 47.31% in three sub-dataset of NASBench-201, it can find the architectures of 94.36%, 73.50%, 47.07% with such little cost. It shows a significant performance and beats others by a considerable margin.

4.3 Visualization of Adjacent Trails

In order to obtain more evidence to support the RATs module, in addition to other experiments focusing on performance, we also visualize the trail of operations in a single cell. We randomly select an architecture (cell) from NASBench-101, then draw its adjacent trails to represent GCN, draw full trails to represent MLP, and draw new adjacent trails by the last RATs module in RATs-GCN. As shown in Fig. 4, we can see that the proposed RATs-NAS differs from GCN and MLP. Part (c) of Fig. 4 shows that RATs-GCN gets approximate MLP trails with weights between 0 and 1 starting from GCN trails.

Table 3: The comparison of RATs-NAS and SOTAs on NASBench-201. Note that the methods NP- are based on [7] and we replace its predictor with several types.

	NASBench-201
	CIFAR-10	CIFAR-100	ImgNet-16
NP-MLP	93.95	72.15	46.30
NP-GCN	94.04	72.37	46.28
NP-BI-GCN	94.07	72.18	46.39
NP-RATs	94.17	72.78	46.58
Random Search	93.91	71.80	46.03
Reg Evolution	-	72.70	-
BONAS	-	72.84	-
WeakNAS	94.23	73.42	46.79
Arch-Graph	-	73.38	-
RATs-NAS	94.36	73.50	47.07

5 Conclusion

A RATs-GCN predictor was proposed to improve the performance of GCN-based predictors. It can change trails and give the trails different weights and performs much better than GCN, MLP, and BI-GCN on NASBench-101 and NASBench-201. Then we propose the P3S method to rapidly divide the search space and focus on tighter FLOPs intervals. Finally, the proposed RATs-NAS consists of RATs-GCN and P3S outperforms WeakNAS and Arch-NAS on NASBench-201 by a considerable gap.

References

[1] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[3] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[4] Barret Zoph and Quoc V Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
[5] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter, “Nas-bench-101: Towards reproducible neural architecture search,” in International Conference on Machine Learning. PMLR, 2019, pp. 7105–7114.
[6] Xuanyi Dong and Yi Yang, “Nas-bench-201: Extending the scope of reproducible neural architecture search,” arXiv preprint arXiv:2001.00326, 2020.
[7] Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans, “Neural predictor for neural architecture search,” in European Conference on Computer Vision. Springer, 2020, pp. 660–676.
[8] Lukasz Dudziak, Thomas Chau, Mohamed Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas Lane, “Brp-nas: Prediction-based nas using gcns,” Advances in Neural Information Processing Systems, vol. 33, pp. 10480–10490, 2020.
[9] Junru Wu, Xiyang Dai, Dongdong Chen, Yinpeng Chen, Mengchen Liu, Ye Yu, Zhangyang Wang, Zicheng Liu, Mei Chen, and Lu Yuan, “Stronger nas with weaker predictors,” Advances in Neural Information Processing Systems, vol. 34, pp. 28904–28918, 2021.
[10] Minbin Huang, Zhijian Huang, Changlin Li, Xin Chen, Hang Xu, Zhenguo Li, and Xiaodan Liang, “Arch-graph: Acyclic architecture relation predictor for task-transferable neural architecture search,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11881–11891.
[11] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar, “Designing neural network architectures using reinforcement learning,” arXiv preprint arXiv:1611.02167, 2016.
[12] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2820–2828.
[13] Zhichao Lu, Ian Whalen, Vishnu Boddeti, Yashesh Dhebar, Kalyanmoy Deb, Erik Goodman, and Wolfgang Banzhaf, “Nsga-net: neural architecture search using multi-objective genetic algorithm,” in Proceedings of the genetic and evolutionary computation conference, 2019, pp. 419–427.
[14] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the aaai conference on artificial intelligence, 2019, vol. 33, pp. 4780–4789.
[15] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin, “Large-scale evolution of image classifiers,” in International Conference on Machine Learning. PMLR, 2017, pp. 2902–2911.
[16] Lingxi Xie and Alan Yuille, “Genetic cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1379–1388.
[17] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy, “Progressive neural architecture search,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 19–34.
[18] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu, “Neural architecture optimization,” Advances in neural information processing systems, vol. 31, 2018.
[19] Chen Wei, Chuang Niu, Yiping Tang, Yue Wang, Haihong Hu, and Jimin Liang, “Npenas: Neural predictor guided evolution for neural architecture search,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[20] Thomas N Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.