Provable Super-Convergence with a Large
Cyclical Learning Rate
Abstract
Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don’t blow up. This letter introduces a simple scenario where an unstably large learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Our scheme uses a Cyclical Learning Rate where we periodically take one large unstable step and several small stable steps to compensate for the instability. These findings also help explain the empirical observations of [Smith and Topin, 2019] where they show that CLR with a large maximum learning rate can dramatically accelerate learning and lead to so-called “super-convergence”. We prove that our scheme excels in the problems where Hessian exhibits a bimodal spectrum and the eigenvalues can be grouped into two clusters (small and large). The unstably large step is the key to enabling fast convergence over the small eigen-spectrum.
Index Terms:
Convergence of numerical methods, Iterative algorithms, Gradient methodsI Introduction
Consider a least-squares problem with design matrix and labels . We wish to solve for
If we use a gradient-based algorithm the rate of convergence obviously depends on the condition number of . Here where the smoothness and strong convexity of the problem is given by the maximum and minimum eigenvalues of the Hessian matrix as follows
Here, denote the smallest/largest singular value of a matrix respectively. Standard gradient descent (GD) requires iterations to achieve -accuracy. Nesterov’s acceleration can improve this to . In general, consider the iterations
Here, the contraction matrix governs the rate of convergence. Over the th eigen-direction of with eigenvalue , the convergence/contraction rate is given by . Setting a fixed stable learning rate of , the issue is that gradient descent optimizes small eigen-directions much slower than the large eigen-directions. Faster learning over small eigen-directions can be facilitated by a large learning rate so that is closer to even for small ’s. However this would lead to instability i.e. .
Here, we point out the possibility that, one can use an unstably large learning rate once in a while to provide huge improvements over small directions. The resulting instability can be compensated quickly by following this unstable step with multiple stable steps which keep the larger directions under control. Overall, for problems with bimodal Hessian spectrum, where eigenvalues are clustered in large and small groups, this leads to substantial improvements with logarithmic dependency on the condition number. Figure 1 highlights this phenomena. Our approach can be formalized by using a cyclical (aka periodic) learning rate schedule [1, 2]. We remark that cyclical learning rate (CLR) is related to SGD with Restarts (SGDR) and Stochastic Weight Averaging which attracted significant attention due to their fast convergence, generalization benefits and flexibility [3, 4, 5]. [6] assesses certain theoretical benefits of cosine learning rates which are periodic. [7, 8] investigate periodic learning rates to facilitate escape from saddle points with stochastic gradients. Large initial learning rates are also known to play a critical role in generalization [9, 10, 11]. Recent work [12] provides further empirical evidence that practical learning rates can be at the edge of stability. However to the best of our knowledge, prior works do not consider potential theoretical benefits of unstably large learning rate choice. The closest work is by Smith and Topin [3]. Here authors observe that CLR can operate with very large maximum learning rates and converge “super fast”. We believe this letter provides a rigorous theoretical support for such observations. We show that maximum learning rate can in fact be unstable and it can be much larger than the maximum stable learning rate. We also show that “super fast” convergence can be as fast as logarithmic in condition number (which is drastically better than GD or Nesterov AGD) for suitable problems (discussed further below).
Our CLR scheme simply takes two values as described below.
Definition 1
Fix an integer and positive scalars . Set periodic learning rate for as
We call this schedule Unstable CLR if where is the smoothness of the problem. Indeed when this happens, the algorithm is susceptible to blow up in certain eigendirections as the contraction matrix has operator norm larger than .
Theorem 1 (Linear regression)
Let be the decreasingly sorted eigenvalues of with and . Fix some integer with . Introduce the quantities
Set period and learning rate according to Def. 1 with and . For all with , the iterations obey
(1) |
Alternatively, for , we have for .
Interpretation. Observe that in order to achieve accuracy, the number of required iterations grow as
(2) |
Here are the local condition numbers for the eigen-spectrum. is the condition number over the subspace (i.e. eigens from to ) and is the condition number over the orthogonal complement . Observe that is always upper bounded by the overall condition number i.e. . However if eigenvalues over and are narrowly clustered, we can have leading to a much faster convergence. Finally observe that the dependence on the (global) condition number is only logarithmic. Thus, in the worst case scenario of , the iteration complexity (2) is sub-optimal by a -factor compared to the standard gradient descent which requires iterations. However there is a factor of improvement when the eigenvalues are clustered and are small. Figure 1 demonstrates the comparisons to gradient descent (with ) and Nesterov’s Accelerated GD. In these examples, we set local condition numbers to whereas varies from to . As grows larger, Unstable CLR shines over the alternatives as the iteration number only grows as . Finally, note that inverse learning rate is equal to the smoothness (top Hessian eigenvalue) of the whole problem whereas is chosen based on the smoothness over the lower spectrum.
Bimodal Hessian. The bimodal Hessian spectrum is crucial for enabling super fast convergence. In essence, with bimodal structure, Unstable CLR acts as a preconditioner for the problem and helps not only large eigen-value/directions but also small ones. It is possible that other cyclical schemes will accelerate a broader class of Hessian spectrums. That said, we briefly mention that bimodal spectrum has empirical support in the deep learning literature. Several works studied the empirical Hessians (and Jacobians) of deep neural networks [13, 14, 15]. A common observation is that the Hessian spectrum has relatively few large eigenvalues and many more smaller eigenvalues [13, 15]. These observations are closely related to the seminal spiked covariance model [16] where the covariance spectrum has a few large eigenvalues and many more smaller eigenvalues. In connection, [14, 17, 18] studied the Jacobian spectrum. Note that, in the linear regression setting of Theorem 1, Jacobian is simply . They similarly found that practical deep nets have a bimodal spectrum and the Jacobian is approximately low-rank. It is also known that behavior of the wide deep nets (i.e., many neurons per layer) can be approximated by their Jacobian-based linearization at initialization via neural tangent kernel [19, 20]. Thus the spectrum indeed closely governs the optimization dynamics [21] similar to our simple linear regression setup. While the larger eigendirections of a deep net can be optimized quickly [14, 18], intuitively their existence will slow down the small eigendirections. However it is also observed that learning such small eigendirections is critical for success of deep learning since typically achieving zero-training loss (aka interpolation) leads to the best generalization performance [22, 23, 24]. In light of this discussion, related empirical observations of [3] and our theory, it is indeed plausible that Unstable CLR does a stellar job at learning these small eigen-directions leading to much faster interpolation and improved generalization.
II Extension to the Nonlinear Problems
To conclude with our discussion, we next provide a more general result that apply to strongly-convex functions. Recall as the complement of a subspace . Let denote the matrix that projects onto . Our goal is solving via gradient iterations .
Definition 2 (Smoothness and strong-convexity)
Let be a smooth convex function and fix . is -smooth and strongly-convex if its Hessian satisfies the following for all
First, we recall a classical convergence result for strongly convex functions for the sake of completeness.
Proposition 1
Let obey Definition 2 and suppose is its minimizer. Pick a learning rate , and use the iterations . The iterates obey .
This is the usual setup for gradient descent analysis. Instead, we will employ the bimodal Hessian definition.
Definition 3 (Bimodal Hessian)
Let be an smooth strongly-convex function. Additionally, there exists a subspace and local condition numbers and cross-smoothness such that, the Hessian of is satisfies
(3) | ||||
Here are the local condition numbers as previously. Observe that both of these are upper bounded by the global condition number as obeys Def. 2 as well. The cross-smoothness controls the interaction between two subspaces. For linear regression by picking to be eigenspace. For general nonlinear models, as long as problem can be approximated linearly (e.g. wide deep nets can be approximated by their linearization [19, 20]), it is plausible that cross-smoothness is small for a suitable choice of . We have the following analogue of Theorem 1.
Theorem 2 (Nonlinear problems)
Interpretation. This result is in similar spirit to the linear regression setup of Theorem 1. The required number of iterations is still governed by the quantity (2). A key difference is the cross-smoothness which controls the cross-Hessian matrix . This term was simply equal to for the linear problem. In essence, our condition for nonlinear problems essentially requires cross-Hessian to be dominated by the Hessian over the lower spectrum i.e. . In particular, should be upper bounded by the strong convexity parameter as well as where is the smoothness over the lower spectrum. It would be interesting to explore to what extent the conditions on the cross-Hessian can be relaxed. However, as mentioned earlier, the fact that wide artificial neural networks behave close to linear models [19] provides a decent justification for small cross-Hessian.
III Proofs
III-A Proof of Theorem 1
Proof:
Following Theorem 1’s statement introduce . Each gradient iteration can be written in terms of the residual and has the following form
Let be the principal eigenspace induced by the first eigenvectors and be its complement. Let be the projection operator on . Set , . Also set , . Since the learning rate is periodic, we analyze a single period starting from . During the first iterations, we have that
At the final (unstably large) iteration , we have
Now term is clearly non-increasing and obeys Note that we forego the for the sake of simplicity. We wish to make the growth of similarly small by enforcing
Observe that . Thus, we simply need
In short, at the end of a single period we are guaranteed to have (1) after observing . ∎
III-B Proof of Theorem 2
Proof:
The proof idea is same as Theorem 1 but we additionally control the cross-smoothness terms. Denote the residual by . We will work with the projections of the residual on the subspaces denoted by respectively. Additionally, set and set . Observe that by this definition and
We will prove that at the end of a single period, pair obeys
(4) |
The overall result follows inductively from this as follows. First, inductively we would achieve for . That would in turn yield
Establishing (4): Thus, let us show (4) for a single period of learning rate i.e. . The gradient is given by
where is obtained by integrating the Hessian’s along the path from to . Write the next iterate as
By Definition 3 and triangle inequality, satisfies the bimodal Hessian inequalities described in (3). To analyze a single period, we first focus on the lower learning rate which spans the initial iterations. Hence, suppose . Note that from strong convexity of over , we have that
Following this with , for , we find the recursions
(5) |
Using the identical argument for yields
Setting , we obtain
(6) |
Additionally, recall that, since is strongly-convex we have . Thus using we have
That is, we are guaranteed to have . Thus, using (6), recursively, for all , satisfies
(7) |
At time , we use the larger learning rate . The following holds via identical argument as (5)
(8) | |||
To bound at time , we recall and the bound , to obtain
Bounding is what remains. Noticing , our period choice obeys
for . Thus, using the assumption , and the bounds (7) and (8), the bound on is obtained via
where we noted . The above bounds on establishes (4) and concludes the proof. ∎
IV Conclusions
This letter introduced a setting where remarkably fast convergence can be attained by using cyclical learning rate schedule that carefully targets the bimodal structure of the Hessian of the problem. This is accomplished by replacing global condition number with local counterparts. While bimodal Hessian seems to be a strong assumption, recent literature provides rich empirical justification for our theory. Besides relaxing the assumptions in Theorem 2, there are multiple interesting directions for Unstable CLR. The results can likely be extended to minibatch SGD, overparameterized settings, or to non-convex problems (e.g. via Polyak-Lojasiewicz condition [25]). While our attention was restricted to bimodal Hessian and a CLR scheme with two values ( as in Def. 1), it would be exciting to explore whether more sophisticated CLR schemes can provably accelerate optimization for a richer class of Hessian structures. On the practical side, further empirical investigation (beyond [3, 12]) can help verify and explore the potential benefits of large cyclical learning rates.
References
- [1] L. N. Smith, “Cyclical learning rates for training neural networks,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 464–472.
- [2] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
- [3] L. N. Smith and N. Topin, “Super-convergence: Very fast training of residual networks using large learning rates,” arXiv preprint arXiv:1708.07120, 2017.
- [4] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” arXiv preprint arXiv:1903.10145, 2019.
- [5] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” in 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018. Association For Uncertainty in Artificial Intelligence (AUAI), 2018, pp. 876–885.
- [6] X. Li, Z. Zhuang, and F. Orabona, “Exponential step sizes for non-convex optimization,” arXiv preprint arXiv:2002.05273, 2020.
- [7] K. Zhang, A. Koppel, H. Zhu, and T. Basar, “Global convergence of policy gradient methods to (almost) locally optimal policies,” SIAM Journal on Control and Optimization, vol. 58, no. 6, pp. 3586–3612, 2020.
- [8] H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann, “Escaping saddles with stochastic gradients,” in International Conference on Machine Learning. PMLR, 2018, pp. 1155–1164.
- [9] G. Leclerc and A. Madry, “The two regimes of deep network training,” arXiv preprint arXiv:2002.10376, 2020.
- [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
- [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [12] J. M. Cohen, S. Kaur, Y. Li, Z. Kolter, and A. Talwalkar, “Gradient descent on neural networks typically occurs at the edge of stability,” to appear at ICLR, 2021.
- [13] L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,” arXiv preprint arXiv:1706.04454, 2017.
- [14] S. Oymak, Z. Fabian, M. Li, and M. Soltanolkotabi, “Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian,” arXiv preprint arXiv:1906.05392, 2019.
- [15] X. Li, Q. Gu, Y. Zhou, T. Chen, and A. Banerjee, “Hessian based analysis of sgd for deep nets: Dynamics and generalization,” in Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020, pp. 190–198.
- [16] D. Paul, “Asymptotics of sample eigenstructure for a large dimensional spiked covariance model,” Statistica Sinica, pp. 1617–1642, 2007.
- [17] M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4313–4324.
- [18] D. Kopitkov and V. Indelman, “Neural spectrum alignment: Empirical study,” in International Conference on Artificial Neural Networks. Springer, 2020, pp. 168–179.
- [19] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” Conference on Neural Information Processing Systems, 2018.
- [20] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” arXiv preprint arXiv:1902.06720, 2019.
- [21] G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” arXiv preprint arXiv:1812.04754, 2018.
- [22] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019.
- [23] M. Belkin, D. Hsu, and P. Mitra, “Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate,” arXiv preprint arXiv:1806.05161, 2018.
- [24] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar, “Theory of deep learning iii: explaining the non-overfitting puzzle,” arXiv preprint arXiv:1801.00173, 2017.
- [25] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.