Learning Rate Perturbation: A Generic Plugin of Learning Rate schedule towards Flatter Local Minima

Hengyu Liu [email protected] 1234-5678-9012 Northeastern UniversityShenyangLiaoningChina , Qiang Fu [email protected] Microsoft Research AsiaBeijingChina , Lun Du [email protected] Microsoft Research AsiaBeijingChina , Tiancheng Zhang [email protected] Northeastern UniversityShenyangLiaoningChina , Ge Yu [email protected] Northeastern UniversityShenyangLiaoningChina , Shi Han [email protected] Microsoft Research AsiaBeijingChina and Dongmei Zhang [email protected] Microsoft Research AsiaBeijingChina

(2022)

Abstract.

Learning rate is one of the most important hyper-parameters that has significant influence for neural network training. Learning rate schedules are widely used in real practice to adjust the learning rate according to pre-defined schedules for the fast convergence and good generalization. However, existing learning rate schedules are all heuristic algorithms and lack theoretical support. Therefore, people usually choose the learning rate schedules through multiple ad-hoc trial, and the obtained learning rate schedules are sub-optimal. To boost the performance of the obtained sub-optimal learning rate schedule, we propose a generic learning rate schedule plugin, called LEArning Rate Perturbation (LEAP), which can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate. We found that, with such simple yet effective strategy, training processing exponentially favors flat minima rather than sharp minima with guaranteed convergence, which leads to better generalization ability. In addition, we conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets using various learning rate schedules (including constant learning rate).

deep learning, neural networks, learning rate scheduler

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 31st ACM International Conference on Information and Knowledge Management; October 17–21, 2022; Atlanta, GA, USA.^†^†booktitle: Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), October 17–21, 2022, Atlanta, GA, USA^†^†price: 15.00^†^†isbn: 978-1-4503-9236-5/22/10^†^†doi: 10.1145/3511808.3557626^†^†ccs: Computing methodologies Artificial intelligence^†^†ccs: Deep Learning Neural Network

1. Introduction

Deep neural networks are the basis of state-of-the-art results for broad tasks such as image recognition, speech recognition, machine translation, driver-less car technology, source code understanding, social analysis, and tabular data understanding (Mao et al., 2021; Fu et al., 2021; Shi et al., 2021a; Shi et al., 2021b; Song et al., 2020; Du et al., 2021). It is known that the learning rate is one of the most important hyper-parameters which has significant influence for training deep neural networks. Too big learning rate may result in the great difficulty of finding minima or even the non-convergence issue, while too small learning rate greatly slows down the training process and may easily get stuck in sharp minima. How to adjust the learning rate is one of the challenges of training deep learning models.

Learning rate schedules seek to adjust the learning rate during training process by changing the learning rate according to a pre-defined schedule. Various learning rate schedules have been widely used and shown the practical effectiveness for better training. The existing learning rate schedules can be roughly divided into several categories, constant learning rate schedule, learning rate decay schedule (You et al., 2019), learning rate warm restart schedule (Smith, 2017; Loshchilov and Hutter, 2016; Mishra and Sarawadekar, 2019) and adaptive learning rate schedule (Preechakul and Kijsirikul, 2019; Zeiler, 2012; Liu et al., 2019; Lewkowycz, 2021). However, the existing learning rate schedules are all heuristic algorithms and lack theoretical support. It is because that the mapping functions of DNN models are usually with high complexity and non-linearity, and it is quite difficult to analyze the influence of the learning rate on the training process. Therefore, people may choose a learning rate schedule only through multiple ad-hoc trial, and the obtained learning rate schedule are probably sub-optimal.

As an effective approach to improve generalization ability of DNN models, perturbation addition has been widely studied (Reed and MarksII, 1999). It has been demonstrated that training with noise can indeed lead to improvements in network generalization (Bishop et al., 1995). Existing works attempt to introduce perturbation in the feature space (Bishop, 1995), activation function (Gulcehre et al., 2016), and gradient (Neelakantan et al., 2015). (Bishop, 1995) shows that adding noise to feature space during training is equivalent to Tikhonov regularization. (Gulcehre et al., 2016) proposes to exploit the injection of appropriate perturbation to the layer activations so that the gradients may flow easily (i.e., mitigating the vanishing gradient problem). (Neelakantan et al., 2015) adds perturbation to gradients to improve the robustness of the training process. The amount of noise can start high at the beginning of training and decrease over time, much like a decaying learning rate. However, there is no existing work that attempts to introduce randomness into hyper-parameters (e.g. learning rate).

To boost the obtained sub-optimal learning rate schedule, we propose a generic plugin of learning rate schedule, called Learning Rate Perturbation (LEAP), which can be applied to various learning rate schedules to improve the model training by introducing a certain type of perturbation to the learning rate. We leverage existing theoretical framework (Xie et al., 2020) to analyze the impact of LEAP on the training process, and found that training process applied with LEAP favors flatter minima exponentially more than sharpness minima. It is well studied that learning flat minima closely relate to generalization.

In deep learning models, there are already some generic strategies or plugins that can boost training performance, such as Dropout, Batch Normalization. The reason why we consider them as a kind of plugins is they could be applied to a particular aspect of DNN learning with a wide range of versatility. In more details, Dropout could be considered as a generic plugin applied to the network structure of deep learning models, which prevents deep learning models from over-reliance on a single neural unit by randomly ignoring some weights during training processing. Batch Normalization is a generic plugin applied to layer input/output, which improves the performance of the model by normalizing the value range of intermediate layer input/output. With different perspective from Dropout and Batch Normalization, LEAP is a generic plugin applied to the learning rate, which improves model performance by letting the training process favors flatter minima.

The main contributions of this work are outlined as follows:

•

We propose a simple yet effective plugin of learning rate schedule, called LEArning Rate Perturbation (LEAP), which can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate.
•

To the best of our knowledge, our work is the first one to propose a generic strategy with theoretical guarantees to boost training performance of various learning rate schedule by letting training process favor flat minima.
•

The extensive experiments show that LEAP can effectively improve the training performance of various DNN architectures, including Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), Graph Neural Network (GNN), and Transformer, with different learning rate schedules and optimizers on diverse domains.

2. Learning Rate Perturbation

In this section, we introduce the implementation details of LEAP, and given the pseudo code of training process applied with LEAP.

Firstly, we denote the training dataset as $x$ , one batch of training datasets as $x_{j}$ , the learning rate given by the learning rate schedule as $\eta$ , the model parameters as $\theta$ , and the loss function as $L(\theta,x)$ . For simplicity, we denote the training loss as $L(\theta)$ . Our LEAP is to add perturbation satisfying Gaussian distribution, i.e., $\zeta\sim\mathcal{N}(0,\sigma^{2})$ , to the learning rate $\eta$ . Here, $\sigma$ is the hyper-parameter of LEAP to control perturbation intensity. The training process applied with LEAP for updating parameters is shown in Algorithm 1.

Here, we define the learning rate after applying LEAP as a vector $\mathbf{h}=(\eta_{1},...,\eta_{M})$ , $\eta_{i}$ is the learning rate used for updating i-th parameter, which is calculated by adding the learning rate perturbation $\zeta_{i}$ to the original learning rate $\eta$ given by the learning rate schedule, and $|\mathbf{h}|=|\theta|=M$ ( $M$ is the total number of parameters). Note that, the perturbations on different feature dimensions ( $\zeta_{1},\zeta_{2},...,\zeta_{M}$ ) are independent so that the overall perturbations can lie along with diverse directions. Obviously, the learning rate $\mathbf{h}$ in LEAP obeys a high-dimensional Gaussian distribution in the following form:

(1)

\mathbf{h}\sim\mathcal{N}(\eta\mathbf{i},\eta^{2}\sigma^{2}\mathbf{I})

where $\mathbf{i}$ is a vector where all elements are 1, $\mathbf{I}$ is the identity matrix.

By combining Eq. 2, the parameter update formula of training process applied with LEAP is as follows.

(2)

\displaystyle\theta_{t+1}

\displaystyle=\theta_{t}-\mathbf{h}\circ\frac{\partial L(\theta_{t},x_{j})}{\partial\theta_{t}}=\theta_{t}-\eta\frac{\partial L(\theta_{t},x_{j})}{\partial\theta_{t}}+\eta A(\theta)\zeta_{t}

(3)

A(\theta)=diag(\frac{\partial L\left(\theta_{t},x_{j}\right)}{\partial\theta_{t,1}},...,\frac{\partial L\left(\theta_{t},x_{j}\right)}{\partial\theta_{t,M}})

where $L(\theta,x_{j})$ is the loss of the j-th mini-batch, $\circ$ is the Hadamard product, $\frac{\partial L(\theta_{t},x_{j})}{\partial\theta_{t,i}}$ is the gradient of the i-th parameter in the t-th update, and $\zeta_{t}\sim\mathcal{N}(0,\sigma^{2}\textbf{I})$ which is a zero-mean high dimensional Gaussian distribution.

Input: training dataset

x

;

learning rate schedule function

LRS

;

init parameters

\theta

Output: trained parameters

\theta

1 for e = 1 to IterNum do

\eta_{e}=LRS(e)

for j = 1 to $B$ do

3 Get a batch of training data

x_{j}

from training dataset

4 Sample

\mathbf{h}\sim\mathcal{N}(\eta_{e}\mathbf{i},\eta^{2}\sigma^{2}\mathbf{I})

5 Update

\theta=\theta-\mathbf{h}\circ\frac{\partial L(\theta,x_{j})}{\partial\theta}

Algorithm 1 The training process applied with LEAP

	Optimizer	SGD			Adam
		Vanilla	LEAP(Ours)	Gain	Vanilla	LEAP (Ours)	Gain
Cifar-10	ResNet-18	5.60 $\pm$ 0.11	5.32 $\pm$ 0.25	5.00%	9.26 $\pm$ 0.29	8.69 $\pm$ 0.08	6.16%
	ResNet-50	5.26 $\pm$ 0.07	4.78 $\pm$ 0.13	9.13%	7.84 $\pm$ 0.52	7.17 $\pm$ 0.12	8.78%
	ResNet-101	5.09 $\pm$ 0.09	4.69 $\pm$ 0.08	7.86%	7.13 $\pm$ 0.75	6.62 $\pm$ 0.04	7.15%

Table 1. Error rate (%) of applying our LEAP to 3 CNN models on Cifar-10 dataset with two optimizers (lower indicators are better). The columns leaded by the cell “Gain” present relative improvements.

		Vanilla	LEAP(Ours)	Gain
MNIST	MLP-3	2.24 $\pm$ 0.09	2.02 $\pm$ 0.07	9.82%
	MLP-4	2.23 $\pm$ 0.06	2.04 $\pm$ 0.04	8.52%
	MLP-6	2.24 $\pm$ 0.05	1.99 $\pm$ 0.13	11.16%
	MLP-8	2.54 $\pm$ 0.09	2.26 $\pm$ 0.10	11.02%
	MLP-10	2.44 $\pm$ 0.10	2.25 $\pm$ 0.9	7.79%
Cifar-10	ResNet-18	7.11 $\pm$ 0.17	6.78 $\pm$ 0.06	4.64%
	ResNet-50	6.93 $\pm$ 0.10	6.57 $\pm$ 0.18	5.19%
	ResNet-101	6.67 $\pm$ 0.08	6.26 $\pm$ 0.20	6.15%
IN	ResNet-50	28.38 $\pm$ 0.09	26.74 $\pm$ 0.15	5.78%
IN	VGG-19	31.17 $\pm$ 0.11	29.51 $\pm$ 0.17	5.33%

Table 2. Error rate (%) of applying our LEAP to 5 MLP models, and 4 CNN models on three CV datasets without learning rate schedule (lower indicators are better). The columns leaded by the cell “Gain” present relative improvements.

		Vanilla	LEAP (Ours)	Gain
Cora	GCN	25.38 $\pm$ 0.75	22.60 $\pm$ 0.57	10.95%
	GAT	29.72 $\pm$ 1.21	26.18 $\pm$ 1.71	11.91%
	GIN	37.46 $\pm$ 2.93	33.02 $\pm$ 2.71	11.85%
	GraphSage	26.24 $\pm$ 0.45	23.50 $\pm$ 0.66	10.44%
PubMed	GCN	25.96 $\pm$ 0.73	23.98 $\pm$ 1.00	7.63%
	GAT	26.60 $\pm$ 0.49	24.90 $\pm$ 1.63	6.39%
	GIN	29.24 $\pm$ 1.33	27.22 $\pm$ 1.87	6.91%
	GraphSage	27.02 $\pm$ 0.33	25.26 $\pm$ 1.26	6.51%
CiteSeer	GCN	37.38 $\pm$ 1.36	33.68 $\pm$ 0.96	9.90%
	GAT	37.96 $\pm$ 3.65	35.36 $\pm$ 2.01	6.85%
	GIN	49.38 $\pm$ 3.30	46.02 $\pm$ 2.18	6.80%
	GraphSage	38.02 $\pm$ 0.90	34.64 $\pm$ 0.38	8.89%

Table 3. Error rate (%) of applying our LEAP to 4 GNN models on three Graph datasets without learning rate schedule (lower indicators are better). The columns leaded by the cell “Gain” present relative improvements.

	Learning Rate Schedule	Decay			Warm Restart
		Vanilla	LEAP(Ours)	Gain	Vanilla	LEAP (Ours)	Gain
MNIST	MLP-3	2.48 $\pm$ 0.06	2.22 $\pm$ 0.04	10.48%	2.34 $\pm$ 0.05	2.12 $\pm$ 0.03	9.40%
	MLP-4	1.96 $\pm$ 0.09	1.79 $\pm$ 0.06	8.67%	2.20 $\pm$ 0.11	2.02 $\pm$ 0.11	8.18%
	MLP-6	2.24 $\pm$ 0.04	2.01 $\pm$ 0.13	10.27%	2.20 $\pm$ 0.20	1.99 $\pm$ 0.04	9.55%
	MLP-8	2.54 $\pm$ 0.09	2.25 $\pm$ 0.10	11.42%	2.62 $\pm$ 0.07	2.41 $\pm$ 0.08	8.02%
	MLP-10	2.49 $\pm$ 0.12	2.15 $\pm$ 0.10	13.65%	2.25 $\pm$ 0.05	1.96 $\pm$ 0.09	12.89%
Cifar-10	ResNet-18	5.60 $\pm$ 0.11	5.32 $\pm$ 0.25	5.00%	5.73 $\pm$ 0.09	5.27 $\pm$ 0.10	8.03%
	ResNet-50	5.26 $\pm$ 0.07	4.78 $\pm$ 0.13	9.13%	5.38 $\pm$ 0.16	4.86 $\pm$ 0.06	9.67%
	ResNet-101	5.09 $\pm$ 0.09	4.69 $\pm$ 0.08	7.86%	5.19 $\pm$ 0.09	4.68 $\pm$ 0.14	9.83%
IN	ResNet-50	24.48 $\pm$ 0.07	22.87 $\pm$ 0.04	6.58%	25.24 $\pm$ 0.21	23.51 $\pm$ 0.12	6.85%
	VGG-19	28.87 $\pm$ 0.10	27.33 $\pm$ 0.09	5.33%	29.17 $\pm$ 0.16	27.47 $\pm$ 0.08	5.83%
	Swin Transformer	-	-	-	18.86 $\pm$ 0.05	17.86 $\pm$ 0.03	5.30%

Table 4. Error rate (%) of applying our LEAP to 5 MLP models, 4 CNN models and 1 Transformer model on three CV datasets with two learning rate schedule (lower indicators are better). The columns leaded by the cell “Gain” present relative improvements.

	Learning Rate Schedule	Decay			Warm Restart
		Vanilla	LEAP (Ours)	Gain	Vanilla	LEAP (Ours)	Gain
Cora	GCN	25.84 $\pm$ 0.84	23.18 $\pm$ 1.40	10.29%	25.50 $\pm$ 0.71	22.88 $\pm$ 0.72	10.27%
	GAT	28.80 $\pm$ 1.10	25.98 $\pm$ 2.08	9.79%	29.76 $\pm$ 1.25	26.68 $\pm$ 1.72	10.35%
	GIN	38.06 $\pm$ 3.44	33.58 $\pm$ 1.71	11.77%	37.18 $\pm$ 1.48	34.38 $\pm$ 2.06	7.53%
	GraphSage	26.64 $\pm$ 0.35	23.66 $\pm$ 0.42	11.19%	27.78 $\pm$ 0.88	23.54 $\pm$ 0.69	15.26%
PubMed	GCN	28.40 $\pm$ 0.97	25.20 $\pm$ 0.70	11.27%	27.24 $\pm$ 1.33	25.18 $\pm$ 0.32	7.56%
	GAT	27.86 $\pm$ 0.71	26.16 $\pm$ 0.63	6.10%	28.88 $\pm$ 1.62	26.96 $\pm$ 1.02	6.65%
	GIN	29.54 $\pm$ 1.77	27.46 $\pm$ 1.00	7.04%	31.36 $\pm$ 1.33	28.86 $\pm$ 2.43	7.97%
	GraphSage	29.46 $\pm$ 0.97	25.98 $\pm$ 0.49	11.81%	28.18 $\pm$ 1.07	25.30 $\pm$ 0.72	10.22%
CiteSeer	GCN	36.70 $\pm$ 0.73	33.90 $\pm$ 1.13	7.63%	37.96 $\pm$ 0.85	34.24 $\pm$ 1.00	9.80%
	GAT	38.46 $\pm$ 4.04	34.72 $\pm$ 2.04	9.72%	37.92 $\pm$ 4.41	34.94 $\pm$ 1.42	7.86%
	GIN	53.02 $\pm$ 4.57	48.96 $\pm$ 2.42	7.66%	50.88 $\pm$ 4.86	45.10 $\pm$ 2.97	11.36%
	GraphSage	39.82 $\pm$ 1.47	35.68 $\pm$ 0.61	10.40%	38.72 $\pm$ 0.45	34.58 $\pm$ 1.09	10.69%

Table 5. Error rate (%) of applying our LEAP to 4 GNN models on three Graph datasets with two learning rate schedules (lower indicators are better). The columns leaded by the cell “Gain” present relative improvements.

3. Minima Preference Analysis of LEAP

By exploiting (Xie et al., 2020) theoretical framework, we replace stochastic gradient noise with $\eta A(\theta)\zeta_{t}$ and obtain Theorem 1. Theorem 1 present analysis result on escape time of LEAP at any minima $a$ .

Theorem 1.

LEAP Escape Time at Minima. The loss function $L(\theta)$ is of class $C^{2}$ and N-dimensional. If the dynamics is governed by equation 2, then the mean escape time from minima $a$ to the outside of minima $a$ is

t=2\pi C\frac{1}{\left|H_{be}\right|}\exp\left[\frac{2\Delta L}{\eta\sigma^{2}}\left(\frac{s}{A_{ae}}+\frac{(1-s)}{\left|A_{be}\right|}\right)\right]

where $C=\sqrt{\frac{det(H_{a}(diag(H_{a}))^{-1})}{-det(H_{b}(diag(H_{b}))^{-1})}}$ , $s\in(0,1)$ is a path-dependent parameter, $H_{be}$ is the eigenvalues of Hessian matrix $H(\theta)$ at the saddle point $b$ corresponding to the escape direction $e$ , and $A_{ae}$ and $A_{be}$ are, respectively, the eigenvalues of $diag(H(\theta))$ at the minima $a$ and the saddle point $b$ corresponding to the escape direction $e$ .

Explanation. We can see that the mean escape time exponentially depends on $A_{ae}$ which is the eigenvalue of $diag(H(\theta))$ at minima along the escape direction. The flatter the minima, the smaller the eigenvalue of $diag(H(\theta))$ , and the longer the training process stays in the minima. Therefore, we conclude that LEAP stays in flatter minima exponentially longer than the sharp minima.

Minima selection. By exploiting (Xie et al., 2020) theoretical framework, we can formulate the probability of converged minima as $P(\theta\in V_{a})=\frac{t_{a}}{\sum_{v}t_{v}}$ . In deep learning, the landscape contain many good minima and bad minima. Training process transits from one minima to another minima. The mean escape time of one minima corresponds to the number of updates which training process spends on this minima. Therefore, escape time is naturally proportional to the probability of selecting this minima. So we conclude that LEAP favors flat minima exponentially more than sharp minima.

Convergence. Since the expectation of perturbation in LEAP is 0, i.e. $E(\eta A(\theta)\zeta_{t})=0$ , LEAP can guarantees that $E(||\theta_{t+1}-\theta^{*}||)\leq||\theta_{t}-\theta^{*}||$ decreases with $\eta$ when convex function $L$ is $\beta$ smooth and $\eta<\frac{1}{\beta}$ ( $\eta$ is learning rate, $\theta_{t}$ is the current point and $\theta^{*}$ is the global minima). When the perturbation is not too large, LEAP does not affect the convergence. During training, we can ensure convergence by choosing appropriate $\sigma$ .

4. Experiment

4.1. Experiment Setup

Datasets. We take 6 real datasets from Computer Vision and Graph Learning domain to evaluate our method, and adopt Cora, CiteSeer and PubMed (Yang et al., 2016) for Graph Learning domain (Du et al., 2022a; Chen et al., 2021; Du et al., 2018), MNIST (LeCun, 1998), CIFAR-10 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) (Denoted as IN in the experiment) for Computer Vision domain. Cora, CiteSeer and PubMed (Yang et al., 2016) are citation networks based datasets. MNIST is one of the most researched datasets in machine learning, and is used to classify handwritten digits. The CIFAR-10 and ImageNet dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. For graph datasets, we use the public data splits provided by (Yang et al., 2016). For the MNIST dataset, we keep the test set unchanged, and we randomly select 50,000 training images for training and the other 10,000 images as validation set for hyper-parameter tuning. For Cifar-10 and ImageNet dataset, we use the public data splits provided by (Krizhevsky et al., 2009; Russakovsky et al., 2015).

Network Architechture. To verify the effectiveness of our method across different neural architectures, we employ Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Transforms, and Graph Neural Networks (GNNs) (Du et al., 2022b) for evaluation. And we adopt ResNet-18,50,101 (He et al., 2016), VGG-19 (Simonyan and Zisserman, 2014) for CNN, GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), GIN (Xu et al., 2018), and GraphSage (Hamilton et al., 2017) for GNN, Swin Transfermer (Liu et al., 2021) for Transforms. We run five MLP models which are denoted as MLP-L, L indicates the number of layers including input layer.

Learning Rate Schedules and Optimizers. We verify that LEAP on three different learning rate schedules (Constant Learning Rate, Learning Rate Decay and Warm Restart Schedule (Loshchilov and Hutter, 2016)). In addition, we also explored the performance improvement of LEAP for two optimizers (SGD (Ruder, 2016) and Adam (Kingma and Ba, 2014)).

Hyperparameters Settings. For the models without classical hyperparameter settings, we performed a hyperparameter search to find the best one for the baseline model. And the hyperparameter search process ensure each model in the same domain has the same search space. For all the GNN models, learning rate is searched in $\{$ 0.1, 0.05, 0.01, 5e-3, 1e-3 $\}$ , and weight decay is searched in $\{$ 0.05, 0.01, 5e-3, 1e-3, 5e-4, 1e-4 $\}$ . For ResNet and MLP, learning rate is searched in $\{0.1,0.05,0.01\}$ , and weight decay is searched in $\{$ 5e-3, 1e-3, 5e-4, 1e-4 $\}$ . Above is the hyperparameter setting and hyperparameter search for baselines. For LEAP, we search for the hyperparameter $\sigma$ in $\{$ 0.1, 0.05, 0.01, 5e-3, 1e-3, 5e-4, 1e-4 $\}$ while keeping other hyperparameters unchanged. Note that we do not apply constant learning rate and learning rate decay for Swim Transformer because that these two learning rate schedules are not suitable for Swim Transformer. We run each experiment on five random seeds and make an average. Model training is done on Nvidia Tesla V100 GPU.

4.2. Experiment Results

We evaluate the effectiveness of our methods from four dimensions: (1) different learning rate schedules; (2) different neural architectures; (3) datasets from different domains; (4) different optimizers.

Under the setting without using learning rate schedule, i.e., equally to constant learning rate schedule, the error rates of the image classification task for the vanilla models with and without LEAP are shown in Table 2, and graph node classification task result is in Table 3. Under the setting with using learning rate schedule i.e., learning rate decay and warm restart schedule, the error rates of the image classification task are shown in Table 4, and graph node classification task result is in Table 5. Table 1 show that the error rates of the image classification task under different optimizers. Combining the results in Table 2, 3, 4, 5 and 1, we can conclude that LEAP brings consistent improvements in these four dimensions. First, LEAP is a generic plugin, which can work with all learning rate schedules including Learning Rate Decay, and Warm Restart Schedule. Besides, we can see that LEAP can improve the performance of deep learning models even without basic learning rate schedules. We obtain 5.00 % to 13.65% relative error rate reduction with learning rate decay, and 6.65 % to 15.26% with Warm Restart Schedule. Second, LEAP brings consistent improvements across all neural architectures including MLPs, CNNs, Transformer, and GNNs. Third, vanilla models with LEAP have consistent improvements across both CV and Graph Learning domains. We obtain 4.64 % to 12.89% relative error rate reduction in CV domain, and 6.10 % to 15.26% in Graph domain. Last, optimizers does not affect the effectiveness of LEAP. SGD with momentum and Adam are top-2 widely-used optimizer, and Table 1 show that LEAP can achieve similar great results with both optimizers, which shows the significant practical potential of LEAP.

5. Conclusion

We propose a simple and effective learning rate schedule plugin, called LEArning Rate Perturbation (LEAP). Similar to Dropout and Batch Normalization, LEAP is a generic plugin for improving the performance of deep learning models. We found that LEAP make optimizers prefer flatter minima to sharp minima. Experiments show that training with LEAP improve the performance of various deep learning models on diverse datasets of different domains.

6. Acknowledgments

Hengyu Liu’s work described in this paper was in part supported by the computing resources funded by National Natural Science Foundation of China (Nos.U1811261, 62137001).

References

(1)
Bishop (1995) Chris M Bishop. 1995. Training with noise is equivalent to Tikhonov regularization. Neural computation 7, 1 (1995), 108–116.
Bishop et al. (1995) Christopher M Bishop et al. 1995. Neural networks for pattern recognition. Oxford university press.
Chen et al. (2021) Xu Chen, Lun Du, Mengyuan Chen, Yun Wang, Qingqing Long, and Kunqing Xie. 2021. Fast Hierarchy Preserving Graph Embedding via Subspace Constraints. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3580–3584.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
Du et al. (2022a) Lun Du, Xu Chen, Fei Gao, Qiang Fu, Kunqing Xie, Shi Han, and Dongmei Zhang. 2022a. Understanding and Improvement of Adversarial Training for Network Embedding from an Optimization Perspective. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 230–240.
Du et al. (2021) Lun Du, Fei Gao, Xu Chen, Ran Jia, Junshan Wang, Jiang Zhang, Shi Han, and Dongmei Zhang. 2021. TabularNet: A neural network architecture for understanding semantic structures of tabular data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 322–331.
Du et al. (2022b) Lun Du, Xiaozhou Shi, Qiang Fu, Xiaojun Ma, Hengyu Liu, Shi Han, and Dongmei Zhang. 2022b. GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily. In Proceedings of the ACM Web Conference 2022. 1550–1558.
Du et al. (2018) Lun Du, Guojie Song, Yiming Wang, Jipeng Huang, Mengfei Ruan, and Zhanyuan Yu. 2018. Traffic events oriented dynamic traffic assignment model for expressway network: a network flow approach. IEEE Intelligent Transportation Systems Magazine 10, 1 (2018), 107–120.
Fu et al. (2021) Qiang Fu, Lun Du, Haitao Mao, Xu Chen, Wei Fang, Shi Han, and Dongmei Zhang. 2021. Neuron with Steady Response Leads to Better Generalization. arXiv preprint arXiv:2111.15414 (2021).
Gulcehre et al. (2016) Caglar Gulcehre, Marcin Moczulski, Misha Denil, and Yoshua Bengio. 2016. Noisy activation functions. In International conference on machine learning. PMLR, 3059–3068.
Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. CoRR abs/1706.02216 (2017). arXiv:1706.02216 http://arxiv.org/abs/1706.02216
Haykin (1994) Simon Haykin. 1994. Neural networks: a comprehensive foundation. Prentice Hall PTR.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
LeCun (1998) Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998).
Lewkowycz (2021) Aitor Lewkowycz. 2021. How to decay your learning rate. arXiv preprint arXiv:2103.12682 (2021).
Liu et al. (2019) Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2019. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265 (2019).
Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021).
Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
Mao et al. (2021) Haitao Mao, Xu Chen, Qiang Fu, Lun Du, Shi Han, and Dongmei Zhang. 2021. Neuron Campaign for Initialization Guided by Information Bottleneck Theory. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3328–3332.
Mishra and Sarawadekar (2019) Purnendu Mishra and Kishor Sarawadekar. 2019. Polynomial learning rate policy with warm restart for deep neural network. In TENCON 2019-2019 IEEE Region 10 Conference (TENCON). IEEE, 2087–2092.
Neelakantan et al. (2015) Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. 2015. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015).
Preechakul and Kijsirikul (2019) Konpat Preechakul and Boonserm Kijsirikul. 2019. CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity. arXiv preprint arXiv:1912.11493 (2019).
Reed and MarksII (1999) Russell Reed and Robert J MarksII. 1999. Neural smithing: supervised learning in feedforward artificial neural networks. Mit Press.
Ruder (2016) Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016).
Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.
Shi et al. (2021a) Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dongmei Zhang, and Hongbin Sun. 2021a. Neural Code Summarization: How Far Are We? arXiv preprint arXiv:2107.07112 (2021).
Shi et al. (2021b) Ensheng Shi, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2021b. Cast: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees. arXiv preprint arXiv:2108.12987 (2021).
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Smith (2017) Leslie N Smith. 2017. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, 464–472.
Song et al. (2020) Guojie Song, Yuanhao Li, Junshan Wang, and Lun Du. 2020. Inferring explicit and implicit social ties simultaneously in mobile social networks. Science China Information Sciences 63, 4 (2020), 1–3.
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Xie et al. (2020) Zeke Xie, Issei Sato, and Masashi Sugiyama. 2020. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. arXiv preprint arXiv:2002.03495 (2020).
Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning. PMLR, 40–48.
You et al. (2019) Kaichao You, Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2019. How does learning rate decay help modern neural networks? arXiv preprint arXiv:1908.01878 (2019).
Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012).

Appendix A Details of Reproduction

A.1. Datasets

We take 6 real datasets from Computer Vision and Graph Learning domain to evaluate our method. The detailed descriptions are listed as follows:

•

Cora, CiteSeer and PubMed (Yang et al., 2016) are citation networks based datasets. In these datasets, nodes represent papers, and edges represent citations of one paper by another. Node features are the bag-of-words representation of papers, and node label denotes the academic topic of a paper.
•

MNIST (LeCun, 1998) MNIST is one of the most researched datasets in machine learning, and is used to classify handwritten digits.
•

CIFAR-10 (Krizhevsky et al., 2009) The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, etc.
•

ImageNet (Deng et al., 2009) The ImageNet is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories with a typical category, such as ”balloon” or ”strawberry”, consisting of several hundred images.

For graph datasets, we use the public data splits provided by (Yang et al., 2016). For the MNIST dataset, we keep the test set unchanged, and we randomly select 50,000 out of 60,000 training images for training and the other 10,000 images as validation set for hyper-parameter tuning. For Cifar-10 and ImageNet dataset, we use the public data splits provided by (Krizhevsky et al., 2009; Russakovsky et al., 2015).

A.2. Network Architechtures

To verify the effectiveness of our method across different neural architectures, we employ Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), Transforms, and Graph Neural Networks (GNNs) for evaluation.

•

MLP (Haykin, 1994) is the most classic neural network with multiple layers between the input and output layers.
•

ResNet (He et al., 2016) is a neural network of a kind that builds on constructs known from pyramidal cells in the cerebral cortex. ResNet do this by utilizing skip connections, or shortcuts to jump over some layers.
•

VGG (Simonyan and Zisserman, 2014) is a Convolutional Neural Network architcture, it was submitted to Large Scale Visual Recognition Challenge 2014 (ILSVRC2014).
•

Swin Transfermer (Liu et al., 2021) is basically a hierarchical Transformer whose representation is computed with shifted windows.
•

GCN (Kipf and Welling, 2016) is a semi-supervised graph convolutional network model which learns node representations by aggregating information from neighbors.
•

GAT (Veličković et al., 2017) is a graph neural network model using attention mechanism to aggregate node features.
•

GIN (Xu et al., 2018) is a graph neural network which can distinguish graph structures more as powerful as the Weisfeiler-Lehman test.
•

GraphSage (Hamilton et al., 2017) is a general inductive framework that leverages node feature information to efficiently generate node embedding for previously unseen data.

For MLPs, we run five MLP models, which are denoted as MLP-L, L indicates the number of layers including input layer. The detailed hidden size is listed in the Appendix. We adopt ResNet-18, ResNet-50, ResNet-101 (He et al., 2016) and VGG-19 (Simonyan and Zisserman, 2014) for CNNs, Swin Transformer (Liu et al., 2021) for Transformer, and GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), GIN (Xu et al., 2018), and GraphSage (Hamilton et al., 2017) for GNNs. ResNet and VGG are two conventional CNN models, Swin is a Transformer-based SOTA model for CV-related tasks, and GCN, GAT, GIN, and GraphSage are four widely used GNN models for the node classification task. To be specific, we use 2-layer GNN models with hidden size 16 and Tiny Swin Transfermor (Liu et al., 2021).

Model	Hidden layer dimension
MLP-3	[784, 100, 10]
MLP-4	[784, 256, 100, 10]
MLP-6	[784, 256, 128, 64, 32, 10]
MLP-8	[784, 256, 128, 64, 64, 32, 32, 10]
MLP-10	[784, 256, 128, 64, 64, 32, 32, 16, 16, 10]

Table 6. The detail architectures of MLP

A.3. Learning Rate Schedule

We verify that LEAP improves the performance of 6 CV and GNN models on three different learning rate schedules. The three different learning rate schedules are:

•

Constant Learning Rate uses the same learning rate during the whole training process. Actually, it can be considered as training without learning rate schedule.
•

Learning Rate Decay selects an initial learning rate, then gradually reduce it in accordance with a schedule.
•

Warm Restart Schedule (Loshchilov and Hutter, 2016) is a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks.

A.4. Optimizer

Since our method LEAP directly interferes with the optimization process, how the optimizer affects the proposed method becomes an important research problem. We adopt Stochastic Gradient Descent (SGD) (Ruder, 2016) with momentum (hyperparameter $\beta=0.9$ ) and its widely-used variant Adam (Kingma and Ba, 2014) to test the effects of different optimizers.

A.5. Hyperparameter Settings.

We use the hyperparameter of swin-transformer and VGG-19 provided by Model Zoo ¹¹1https://mmclassification.readthedocs.io/en/latest/model_zoo.html. For the models without classical hyperparameter settings, we performed a hyperparameter search to find the best one for the baseline model. And the hyperparameter search process ensure each model in the same domain has the same search space. For all the GNN models, learning rate is searched in $\{0.1,0.05,0.01,5e-3,1e-3\}$ , and weight decay is searched in $\{0.1,0.05,0.01,5e-3,1e-3,5e-4,1e-4\}$ . For ResNet and MLP, learning rate is searched in $\{0.1,0.05,0.01\}$ , and weight decay is searched in $\{5e-3,1e-3,5e-4,1e-4\}$ . Above is the hyperparameter setting and hyperparameter search for baselines. For LEAP, we search for the hyperparameter $\sigma$ in $\{0.1,0.05,0.01,5e-3,1e-3,5e-4,1e-4\}$ while keeping other hyperparameters unchanged. Note that we did not apply constant learning rate and learning rate decay for Swim Transformer because that these two learning rate schedules are not suitable for Swim Transformer. In fact, the performance of Swin Transformer which use these two learning rate schedules degrades greatly. We run each experiment on five random seeds and make an average. Model training is done on Nvidia Tesla V100 GPU.