This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Training Under Limited Resources

Mahdi Zolnouri
Huawei Noah’s Ark Lab
Montréal, QC, Canada
[email protected] &Dounia Lakhmiri
GERAD and Polytechnique Montréal
Montréal, QC, Canada
[email protected] &Christophe Tribes
GERAD and Polytechnique Montréal
Montréal, QC, Canada
[email protected] &Eyyüb Sari
Huawei Noah’s Ark Lab
Montréal, QC, Canada
[email protected] &Sébastien Le Digabel
GERAD and Polytechnique Montréal
Montréal, QC, Canada
[email protected]
Abstract

Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation help DNNs perform much better while these two factors are limited. However, searching for an optimal architecture and the best hyperparameter values besides a good combination of data augmentation techniques under low resources requires many experiments. We present our approach to achieving such a goal in three steps: reducing training epoch time by compressing the model while maintaining the performance compared to the original model, preventing model overfitting when the dataset is small, and performing the hyperparameter tuning. We used NOMAD, which is a blackbox optimization software based on a derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of 86.0% on a tiny subset of Mini-ImageNet (Vinyals et al., 2016) at the ICLR 2021 Hardware Aware Efficient Training (HAET) Challenge and won second place in the competition. The competition results can be found at haet2021.github.io/challenge and our source code can be found at github.com/DouniaLakhmiri/ICLR_HAET2021.

1 Introduction

The compression of DNNs targets mainly the inference side, but recent works aim to squeeze the training process. However, it remains challenging. ICLR 2021 HAET Challenge is an annual competition that evaluates the performance of classification models given a limited training time and data. The time budget for training is 10 minutes on an NVIDIA GPU V100 with 32 GB memory, running with an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz processor, with 12 GB of RAM. The dataset is a tiny unknown dataset of 10 classes containing 5K images for training and 1K images for testing during the development phase. The inputs to the model are 32×\times32 RGB images like CIFAR-10. A subset of Mini-ImageNet with 80×\times80 images is used to evaluate the performance of candidate models during the evaluation phase. Applicants are allowed to use their own optimizer, training loop, and data augmentation process. Introducing such limitations to the training process creates two main challenges. First, most classifiers need much more than 10 minutes to converge due to a large number of parameters. The general trend in most of the state of the arts of DNNs is to go deeper and wider and increase the model’s capacity (Simonyan & Zisserman, 2014; Szegedy et al., 2016; 2017; He et al., 2016). A larger network requires more time and computational resources to achieve its best accuracy. In this paper, we used a compressed model by applying Neural Architecture Search to reduce the computation operations of the model. Second, a small dataset leads the model to overfit. In this case, the model learns too well the training images and performs accurately on the training dataset but poorly on real-world images. Since Data Augmentation (DA) is an effective technique to generate additional data for training (Taylor & Nitschke, 2017), we added a combination of several DA policies to our pipeline to overcome the overfitting problem.

2 Related Work

Since AlexNet (Krizhevsky et al., 2012) won the 2012 ImageNet ILSVRC-2012 competition by stacking the convolution layers, the DNNs became more accurate thanks to increasing the number of parameters and hidden layers. In 2014, both VGGNet (Simonyan & Zisserman, 2014) with 19 layers and GoogleNet (Szegedy et al., 2015) with 22 layers achieved the lowest error rate in localization and classification tasks of the ILSVRC competition, respectively. In 2015, ResNet, the winner of ILSVRC, proposed the skip connection to overcome the vanishing gradient issue in DNNs and go even deeper by stacking layers up to 152. SENet (Hu et al., 2018) won the ImageNet competition in 2017. It has 145M parameters almost twenty times bigger than GoogleNet with 6.8M. Recently, NFNet-F4+ (Brock et al., 2021) achieved 89.2% of top 1 accuracy and introduced a new state-of-the-art for ImageNet dataset using 527M parameters. On the other hand, some networks tried to present a good trade-off between the model’s size and accuracy. For example, MobileNet (Howard et al., 2017) used three multipliers for depth, width, and the resolution of the model. It showed that users could reduce the model’s size with the cost of losing accuracy. EfficientNet (Tan & Le, 2019) used NAS to design a new baseline model. It introduced a new scaling method to build a new family of networks that achieved the state-of-the-art. Mainly the focus is on inference side in these works and the proposed models are trained with no limitation in terms of data and resources during training process.

3 Proposed Approach

3.1 Simulating The Challenge Evaluation

As an initial step, we started with implementing a training pipeline. We based on a provided code from the challenge website, and we added new modules such as data preprocessing, model building, and trainer. In the preprocessing module, we added a data sampler to build a dataset requested by the competition: 5k images for the train set and 1k images for the validation set. We selected CIFAR-10 as a proxy dataset because it is the most common dataset with ten classes in classification. Its size allowed us to build various subsets by shuffling the selected image indices before subsampling the data and helped our candidate model generalize better. Model building module provides a designed model as we previewed a known model’s NAS variant. The trainer module runs the training loop for precisely ten minutes to simulate the competition environment.

3.2 Searching For A Baseline Network

The larger the search space, the longer and more expensive NAS will be. Efficient use of resources was a top priority for the challenge. Therefore, instead of starting an architecture design from scratch, the second step of our approach was finding a baseline network candidate from which we would start the search. We came up with a list of well-known classifiers, mainly the winner of the ImageNet ILSVRC challenge. We adapted the implementation of most of these classifiers from Kuang (2021). Then, we trained each classifier ten times with distinct subsets drawn from CIFAR-10. Training networks multiple times was important to smooth the variance effect induced by the small training sets. We trained all classifiers using our pipeline with 10 minutes training time budget and the same dataset. Other hyperparameters were similar to the respective paper of each model. We finally selected the top two models with the highest average accuracy. Table 1 shows that SENet with 77.0% and ResNet-18 with 75.5% of validation accuracy are on top of the list.

3.3 Neural Architecture Search

In order to reduce the training epoch time, we performed Neural Architecture Search on SENet and ResNet-18 by using Formulation 1. The goal of this third step was to find a smaller variant of the model that can perform higher or equal to the baseline model. We used the NOMAD (Le Digabel, 2011) blackbox optimization software which implements the MADS algorithm (Audet & Dennis, Jr., 2006). NOMAD allows optimizing an objective function under certain constraints without requiring knowledge about the internals of the function. We defined g(ϕ)g({\bm{\phi}}) as a function that denotes the resource consumption, an explicit function such as the number of Multiply-Accumulate (MAC) operations. We also constrained the model’s performance to be higher or equal to the baseline one. We followed Howard et al. (2017) to scale the baseline model by defining three multipliers for three dimensions of the model: depth, width, and input resolution. The depth multiplier determines the number of blocks for a model. The role of the width multiplier is to thin a network by reducing the number of input and output channels of each layer uniformly. Input resolution multiplier scales the input image and subsequently the internal representation of every layer. For each NAS trial, we return the number of MAC calculated based on the new dimensions of the model and the accuracy of the model to NOMAD.

minϕΦ\displaystyle\min_{{\bm{\phi}}\in\Phi}\quad g(ϕ)subjecttof(𝐰,ϕ𝐱,𝐲)f(𝐰0,ϕ0𝐱,𝐲)\displaystyle g({\bm{\phi}})~{}~{}\mathrm{subject~{}to~{}}~{}~{}f(\mathbf{w},{\bm{\phi}}\mid\mathbf{x},\mathbf{y})\geq f(\mathbf{w}_{0},{\bm{\phi}}_{0}\mid\mathbf{x},\mathbf{y}) (1)

Where 𝐰\mathbf{w} is the neural networks weights, Φ\Phi denotes the hyperparameter space of the neural network, ϕΦ{\bm{\phi}}\in\Phi represents a vector of values for depth, width, and resolution that define the network architecture, and (𝐱,𝐲)(\mathbf{x},\mathbf{y}) the data features and labels. We note f(ϕ,𝐰𝐱,𝐲)f({\bm{\phi}},\mathbf{w}\mid\mathbf{x},\mathbf{y}) the validation accuracy of the network after training; And g(ϕ)g({\bm{\phi}}) the function that reflects the number of MAC operations.

3.4 Data Augmentation Method

Given the small size of the dataset, we observed that our candidate model is prone to overfitting. Since the dataset is tiny, the model does not gain enough performance during training by using the basic techniques of image augmentation. We used the AutoAugment (Cubuk et al., 2018) approach, which is an automatic way of selecting optimal data augmentation policies. Also, we added the CUTOUT (DeVries & Taylor, 2017) technique which consists of masking out random sections of input images during training.

3.5 Hyper Parameter Optimization

Finally, we perform the HPO on our candidate model to find the best value of the hyperparameters by using NOMAD. Formulation 2 shows the objective function that we proposed.

maxλΛ\displaystyle\max_{\lambda\in\Lambda}\quad f(λ,𝐰𝐱,𝐲)\displaystyle f(\lambda,\mathbf{w}\mid\mathbf{x},\mathbf{y}) (2)

Where λ\lambda is a vector of values from hyperparameter space of Λ\Lambda. f(λ,𝐰𝐱,𝐲)f(\lambda,\mathbf{w}\mid\mathbf{x},\mathbf{y}) denotes the validation accuracy of the neural network. We selected three standard hyperparameters known to strongly influence validation set performance: initial learning rate, weight decay, and the optimizer type. Since the dataset is tiny, we chose to tune the batch size as well. As the range of values for learning rate is different from an optimizer to another, we adapted the learning rate based on an expected value of learning rate for each optimizer during training. For example, 0.10.1 is usually used for SGD, but 0.0010.001 is used for ADAM. In the end, the search space of the HPO experiment was as follows:

  • Learning rate of uniform type and in [0.6;1E-3]

  • Weight decay of discrete type and in {0, 0.00005, 0.0005, 0.005, 0.05, 0.5}

  • Optimizer of discrete type and in {“Adadelta”(1) , “Adagrad” (0.01) , “SGD” (0.1), “Adam” (0.01), “Adamw” (0.01), “Adamax” (0.002), “ASGD”}

  • Batch size of discrete type and in {128, 256, 512}

The HPO increased the accuracy of our model by 1.1%.

4 Experiments and Results

In this section, we conduct experiments to investigate the effectiveness of our approach across a range of model architecture and datasets.

4.1 CIFAR-10

CIFAR-10 is a collection of 60K color images of 32×\times32, divided as follows: 50K for train set and 10K for validation set. It has ten classes. Since it is so close to the description of the competition evaluation dataset and its size is big enough to have several distinct subsamples, we decided to use it during the development phase as a proxy dataset. In all our experiments, the size of the dataset is as follows: the train set is 5K, and the validation set is 1K. Table 1 shows the intermediate results of our search to find a baseline model. All the classifiers are trained for ten minutes and on a subset of CIFAR-10. SENet-18 with 77.0% and ResNet-18 with 75.5% achieved better accuracy than other classifiers.

SENet-18

ResNet-18

DenseNet121

GoogleNet

MobileNet V2

EfficientNet B0

ShuffleNet V2

RegNetX_200MF

ResNet32

SimpleDLA

ResNet50

VGG19

MobleNet V1

ResNeXt29 (2x64d)

ShuffleNet G2

ResNet110

PreActResNet18

DPN92

Accuracy 77.0 75.5 74.6 74.2 74.0 70.4 70.0 69.8 69.6 68.7 67.6 67.2 64.8 64.7 62.0 58.3 52.7 41.9
Table 1: Searching for a baseline network. Final validation accuracy (%): mean over ten runs.

By applying NAS on SENet-18 and ResNet-18, we were able to find a smaller variant for each model that performs similarly to its baseline model. The NOMAD-NAS-SENet-18 has the same depth and input resolution as SENet-18, but the width of the network is 67% of the original one. The NOMAD-NAS-ResNet-18 has the same depth as ResNet-18, but its width and input resolution are 73% and 118% of the original ones, respectively. Although the input resolution is 18% bigger than its original model, the NOMAD-NAS-ResNet-18 has 22% less MAC operations and 47% fewer parameters than ResNet-18. We observed that both candidate models are prone to overfitting. We improved our data transformation list by adding AutoAugment and CUTOUT techniques. Table 2 shows the performance of SENet-18 and ResNet-18 after applying NAS and DA. After this step, we continued on hyperparameters optimization with NOMAD-NAS-SENet-18. Table 3 shows the validation accuracy of our candidate model, NOMAD-NAS-SENet-18 after applying HPO. We used NOMAD to find the best hyperparameter values of the model. To this end, we performed a blackbox optimization by using MADS algorithm over multiple parameters of our final candidate model. As described in 3.5, we came up with a set of tuned values for the hyperparameters: learning rate = 0.042, weight decay = 0.005, optimizer = SGD, batch size = 512. We performed several tests to ensure these best values can achieve optimal accuracy. We observed an improvement of validation accuracy from 86.7% to 87.8%.

Table 2: Applying NAS and DA on SENet-18 and ResNet-18. Final validation accuracy: mean over ten runs.
Model Validation Accuracy (%)
SENet-18 86.7
ResNet-18 84.8
Table 3: Applying HPO on NOMAD-NAS-SENet-18. Final validation accuracy: mean over ten runs.
Model Validation Accuracy (%)
NOMAD-NAS-SENet-18 87.8

4.2 Mini-ImageNet

After the competition deadline, the ICLR HAET Challenge committee announced the results and the name of the evaluation dataset. Table 4 shows the final result of our model on a subset of Mini-ImageNet. There is a considerable gap of 1.8% between our evaluation on CIFAR-10 and committee evaluation on Mini-ImageNet. Some other potential differences, such as hardware and training loop, can justify this gap. However, the main factor affecting our model’s performance is an extra operation in data transformation that wasn’t included in the challenge description. The images in Mini-ImageNet are 80×\times80. The committee added a resize operation to the data transformation list in order to convert images to the initial announcement of 32×\times32. We used Cosine Annealing for the learning rate decay scheduler with the max epoch of 240 for ten minutes. This extra data transformation is costly and delayed our training process for several epochs, consequently reducing our model’s performance. We found the results fair because this extra cost affects all other submissions.

Table 4: NOMAD-NAS-SENet-18 on Mini-ImageNet. Final validation accuracy: mean over five runs. Dataset is a subset of Mini-ImageNet : train set of 5K images and validation set of 1K images
Model Validation Accuracy (%)
NOMAD-NAS-SENet-18 86.0

5 Conclusion

ICLR 2021 HAET Challenge focuses on training a neural network under low resources, more precisely limited data and training time. Most research done in the model compression field tried to optimize the inference part of DNNs. In order to train a model efficiently, we proposed an approach that finds a classifier that performs accurately on a tiny dataset and a limited training time budget. We used NOMAD, which implements a derivative-free optimization algorithm for both NAS and HPO. Also, we used data augmentation techniques to improve the performance of the proposed model. Our model, NOMAD-NAS-SENet-18, achieved 86% on a subset of Mini-ImageNet under 10 minutes and won second prize in the competition.

References

  • Audet & Dennis, Jr. (2006) C. Audet and J.E. Dennis, Jr. Mesh Adaptive Direct Search Algorithms for Constrained Optimization. SIAM Journal on Optimization, 17(1):188–217, 2006. doi: 10.1137/040603371.
  • Brock et al. (2021) Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021.
  • Cubuk et al. (2018) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
  • DeVries & Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7132–7141, 2018.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • Kuang (2021) Liu Kuang. Liu/pytorch-cifar. https://github.com/kuangliu/pytorch-cifar, 2021.
  • Le Digabel (2011) S. Le Digabel. Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm. ACM Transactions on Mathematical Software, 37(4):44:1–44:15, 2011. doi: 10.1145/1916461.1916468. URL http://dx.doi.org/10.1145/1916461.1916468.
  • Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–9, 2015.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2818–2826, 2016.
  • Szegedy et al. (2017) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • Tan & Le (2019) Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. PMLR, 2019.
  • Taylor & Nitschke (2017) Luke Taylor and Geoff Nitschke. Improving deep learning using generic data augmentation. arXiv preprint arXiv:1708.06020, 2017.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching Networks for One Shot Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp.  3637–3645, 2016.