This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provable Super-Convergence with a Large
Cyclical
Learning Rate

Samet Oymak Submitted on April 6, 2021. This work was supported in part by the NSF under CNS grant 1932254 and the CAREER award 2046816.
Abstract

Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don’t blow up. This letter introduces a simple scenario where an unstably large learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Our scheme uses a Cyclical Learning Rate where we periodically take one large unstable step and several small stable steps to compensate for the instability. These findings also help explain the empirical observations of [Smith and Topin, 2019] where they show that CLR with a large maximum learning rate can dramatically accelerate learning and lead to so-called “super-convergence”. We prove that our scheme excels in the problems where Hessian exhibits a bimodal spectrum and the eigenvalues can be grouped into two clusters (small and large). The unstably large step is the key to enabling fast convergence over the small eigen-spectrum.

Index Terms:
Convergence of numerical methods, Iterative algorithms, Gradient methods

I Introduction

Consider a least-squares problem with design matrix 𝑿n×p{\bm{X}}\in\mathbb{R}^{n\times p} and labels 𝒚n\bm{y}\in\mathbb{R}^{n}. We wish to solve for

𝜽=argmin𝜽p12𝒚𝑿𝜽22.\bm{\theta}_{\star}=\arg\min_{\bm{\theta}\in\mathbb{R}^{p}}\frac{1}{2}\|{\bm{y}-{\bm{X}}\bm{\theta}}\|_{\ell_{2}}^{2}.

If we use a gradient-based algorithm the rate of convergence obviously depends on the condition number κ\kappa of 𝑿{\bm{X}}. Here κ=L/μ\kappa=L/\mu where the smoothness LL and strong convexity μ\mu of the problem is given by the maximum and minimum eigenvalues of the Hessian matrix 𝑿𝑿{\bm{X}}^{\top}{\bm{X}} as follows

L=𝑿𝑿,μ=σmin(𝑿𝑿).L=\|{\bm{X}}^{\top}{\bm{X}}\|,\quad\mu={\sigma_{\min}({\bm{X}}^{\top}{\bm{X}})}.

Here, σmin(),{\sigma_{\min}()},\|\cdot\| denote the smallest/largest singular value of a matrix respectively. Standard gradient descent (GD) requires κlog(ε1)\kappa\log(\varepsilon^{-1}) iterations to achieve ε\varepsilon-accuracy. Nesterov’s acceleration can improve this to κlog(ε1)\sqrt{\kappa}\log(\varepsilon^{-1}). In general, consider the iterations

𝜽t+1=𝜽tηt𝑿(𝒚𝑿𝜽t).\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{X}}^{\top}(\bm{y}-{\bm{X}}\bm{\theta}_{t}).

Here, the contraction matrix 𝑪t=𝑰ηt𝑿T𝑿{\bm{C}}_{t}={\bm{I}}-\eta_{t}{\bm{X}}^{T}{\bm{X}} governs the rate of convergence. Over the iith eigen-direction of 𝑿𝑿{\bm{X}}^{\top}{\bm{X}} with eigenvalue λi\lambda_{i}, the convergence/contraction rate is given by 1ηtλi1-\eta_{t}\lambda_{i}. Setting a fixed stable learning rate of ηt=1/L\eta_{t}=1/L, the issue is that gradient descent optimizes small eigen-directions much slower than the large eigen-directions. Faster learning over small eigen-directions can be facilitated by a large learning rate so that 1ηtλi1-\eta_{t}\lambda_{i} is closer to 0 even for small λi\lambda_{i}’s. However this would lead to instability i.e. 𝑪t>1\|{\bm{C}}_{t}\|>1.

Here, we point out the possibility that, one can use an unstably large learning rate once in a while to provide huge improvements over small directions. The resulting instability can be compensated quickly by following this unstable step with multiple stable steps which keep the larger directions under control. Overall, for problems with bimodal Hessian spectrum, where eigenvalues are clustered in large and small groups, this leads to substantial improvements with logarithmic dependency on the condition number. Figure 1 highlights this phenomena. Our approach can be formalized by using a cyclical (aka periodic) learning rate schedule [1, 2]. We remark that cyclical learning rate (CLR) is related to SGD with Restarts (SGDR) and Stochastic Weight Averaging which attracted significant attention due to their fast convergence, generalization benefits and flexibility [3, 4, 5]. [6] assesses certain theoretical benefits of cosine learning rates which are periodic. [7, 8] investigate periodic learning rates to facilitate escape from saddle points with stochastic gradients. Large initial learning rates are also known to play a critical role in generalization [9, 10, 11]. Recent work [12] provides further empirical evidence that practical learning rates can be at the edge of stability. However to the best of our knowledge, prior works do not consider potential theoretical benefits of unstably large learning rate choice. The closest work is by Smith and Topin [3]. Here authors observe that CLR can operate with very large maximum learning rates and converge “super fast”. We believe this letter provides a rigorous theoretical support for such observations. We show that maximum learning rate can in fact be unstable and it can be much larger than the maximum stable learning rate. We also show that “super fast” convergence can be as fast as logarithmic in condition number (which is drastically better than GD or Nesterov AGD) for suitable problems (discussed further below).

Our CLR scheme simply takes two values η±\eta_{\pm} as described below.

Definition 1

Fix an integer T>1T>1 and positive scalars η+,η\eta_{+},\eta_{-}. Set periodic learning rate ηt\eta_{t} for t0t\geq 0 as

ηt={ηifmod(t,T)=1η+else.\eta_{t}=\begin{cases}\eta_{-}\quad\text{if}\quad\text{mod}(t,T)=-1\\ \eta_{+}\quad\text{else}\end{cases}.
Refer to caption(a) κ=10\kappa=10
Refer to caption(b) κ=100\kappa=100
Refer to caption(c) κ=1000\kappa=1000
Refer to caption(d) κ=10000\kappa=10000
Figure 1: Convergence performance of gradient descent with ηt=1/L\eta_{t}=1/L, Nesterov’s acceleration (AGD) and Unstable Cyclical Learning Rate of Theorem 1 on a linear regression task. In Figures (a) to (d), we vary the condition number κ\kappa from 1010 to 10410^{4}. In these experiments, eigenspectrum of Hessian is bimodal where eigenvalues lie over the intervals [1,2][1,2] and [κ/2,κ][\kappa/2,\kappa]. This means the local condition numbers are κ+=κ=2\kappa_{+}=\kappa_{-}=2. As the global condition number grows, Unstable CLR outperforms standard gradient descent or Nesterov AGD as it only requires logarithmic iteration in κ\kappa. We note that Unstable CLR can potentially further benefit from acceleration.

We call this schedule Unstable CLR if η>2/L\eta_{-}>2/L where LL is the smoothness of the problem. Indeed when this happens, the algorithm is susceptible to blow up in certain eigendirections as the contraction matrix 𝑰η𝑿𝑿{\bm{I}}-\eta_{-}{\bm{X}}^{\top}{\bm{X}} has operator norm larger than 11.

Theorem 1 (Linear regression)

Let 𝛌+p\bm{\lambda}\in\mathbb{R}_{+}^{p} be the decreasingly sorted eigenvalues of 𝐗𝐗{\bm{X}}^{\top}{\bm{X}} with L=λ1L=\lambda_{1} and μ=λp\mu=\lambda_{p}. Fix some integer rr with p>r1p>r\geq 1. Introduce the quantities

κ=Lμ,κ+=Lλr,κ=λr+1μ.\kappa=\frac{L}{\mu},\quad\kappa_{+}=\frac{{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L}}}{\lambda_{r}},\quad\kappa_{-}=\frac{\lambda_{r+1}}{{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mu}}}.

Set period Tκ+log(2κ2κ1)+1T\geq\kappa_{+}\log\left(\frac{2\kappa}{2\kappa_{-}-1}\right)+1 and learning rate according to Def. 1 with η+=1L\eta_{+}=\frac{1}{L} and η=1κμ\eta_{-}=\frac{1}{\kappa_{-}\mu}. For all tt with mod(t,T)=0\text{mod}(t,T)=0, the iterations obey

𝜽t𝜽2(112κ)t/T𝜽0𝜽2.\displaystyle\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}}\leq\left(1-\frac{1}{2\kappa_{-}}\right)^{t/T}\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}. (1)

Alternatively, for t2Tκlog(ε1)t\geq 2T\kappa_{-}\log(\varepsilon^{-1}), we have 𝛉t𝛉2ε𝛉0𝛉2\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}}\leq\varepsilon\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}} for 1>ε>01>\varepsilon>0.

Interpretation. Observe that in order to achieve ε\varepsilon accuracy, the number of required iterations grow as

κ+κlog(κ)log(ε1).\displaystyle\boxed{\kappa_{+}\kappa_{-}\log\left({\kappa}\right)\log(\varepsilon^{-1}).} (2)

Here κ+,κ\kappa_{+},\kappa_{-} are the local condition numbers for the eigen-spectrum. κ+\kappa_{+} is the condition number over the subspace 𝒮\mathcal{S} (i.e. eigens from λ1\lambda_{1} to λr\lambda_{r}) and κ\kappa_{-} is the condition number over the orthogonal complement 𝒮c\mathcal{S}^{c}. Observe that κ+×κ\kappa_{+}\times\kappa_{-} is always upper bounded by the overall condition number κ\kappa i.e. κ+κκ\kappa_{+}\kappa_{-}\leq\kappa. However if eigenvalues over 𝒮\mathcal{S} and 𝒮c\mathcal{S}^{c} are narrowly clustered, we can have κ+κκ\kappa_{+}\kappa_{-}\ll\kappa leading to a much faster convergence. Finally observe that the dependence on the (global) condition number κ\kappa is only logarithmic. Thus, in the worst case scenario of κ+κ=κ\kappa_{+}\kappa_{-}=\kappa, the iteration complexity (2) is sub-optimal by a log(κ)\log(\kappa)-factor compared to the standard gradient descent which requires κlog(ε1)\kappa\log(\varepsilon^{-1}) iterations. However there is a factor of κ/log(κ)\kappa/\log(\kappa) improvement when the eigenvalues are clustered and κ+,κ\kappa_{+},\kappa_{-} are small. Figure 1 demonstrates the comparisons to gradient descent (with η=1/L\eta=1/L) and Nesterov’s Accelerated GD. In these examples, we set local condition numbers to κ=κ+=2\kappa_{-}=\kappa_{+}=2 whereas κ\kappa varies from 1010 to 1000010000. As κ\kappa grows larger, Unstable CLR shines over the alternatives as the iteration number only grows as log(κ)\log(\kappa). Finally, note that inverse learning rate η+1=L\eta_{+}^{-1}=L is equal to the smoothness (top Hessian eigenvalue) of the whole problem whereas η1=κμ\eta_{-}^{-1}=\kappa_{-}\mu is chosen based on the smoothness over the lower spectrum.

Bimodal Hessian. The bimodal Hessian spectrum is crucial for enabling super fast convergence. In essence, with bimodal structure, Unstable CLR acts as a preconditioner for the problem and helps not only large eigen-value/directions but also small ones. It is possible that other cyclical schemes will accelerate a broader class of Hessian spectrums. That said, we briefly mention that bimodal spectrum has empirical support in the deep learning literature. Several works studied the empirical Hessians (and Jacobians) of deep neural networks [13, 14, 15]. A common observation is that the Hessian spectrum has relatively few large eigenvalues and many more smaller eigenvalues [13, 15]. These observations are closely related to the seminal spiked covariance model [16] where the covariance spectrum has a few large eigenvalues and many more smaller eigenvalues. In connection, [14, 17, 18] studied the Jacobian spectrum. Note that, in the linear regression setting of Theorem 1, Jacobian is simply 𝑿{\bm{X}}. They similarly found that practical deep nets have a bimodal spectrum and the Jacobian is approximately low-rank. It is also known that behavior of the wide deep nets (i.e., many neurons per layer) can be approximated by their Jacobian-based linearization at initialization via neural tangent kernel [19, 20]. Thus the spectrum indeed closely governs the optimization dynamics [21] similar to our simple linear regression setup. While the larger eigendirections of a deep net can be optimized quickly [14, 18], intuitively their existence will slow down the small eigendirections. However it is also observed that learning such small eigendirections is critical for success of deep learning since typically achieving zero-training loss (aka interpolation) leads to the best generalization performance [22, 23, 24]. In light of this discussion, related empirical observations of [3] and our theory, it is indeed plausible that Unstable CLR does a stellar job at learning these small eigen-directions leading to much faster interpolation and improved generalization.

II Extension to the Nonlinear Problems

To conclude with our discussion, we next provide a more general result that apply to strongly-convex functions. Recall 𝒮c\mathcal{S}^{c} as the complement of a subspace 𝒮\mathcal{S}. Let 𝚷𝒮\bm{\Pi}_{\mathcal{S}} denote the matrix that projects onto 𝒮\mathcal{S}. Our goal is solving 𝜽=argmin𝜽pf(𝜽)\bm{\theta}_{\star}=\arg\min_{\bm{\theta}\in\mathbb{R}^{p}}f(\bm{\theta}) via gradient iterations 𝜽t+1=𝜽tηtf(𝜽t)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{\nabla}f(\bm{\theta}_{t})}.

Definition 2 (Smoothness and strong-convexity)

Let f:pf:\mathbb{R}^{p}\rightarrow\mathbb{R} be a smooth convex function and fix L>μ>0L>\mu>0. f:pf:\mathbb{R}^{p}\rightarrow\mathbb{R} is LL-smooth and μ\mu strongly-convex if its Hessian satisfies the following for all 𝛉p\bm{\theta}\in\mathbb{R}^{p}

L𝑰p2f(𝜽)μ𝑰p.L{\bm{I}}_{p}\succeq{\bm{\nabla}^{2}f(\bm{\theta})}\succeq\mu{\bm{I}}_{p}.

First, we recall a classical convergence result for strongly convex functions for the sake of completeness.

Proposition 1

Let ff obey Definition 2 and suppose 𝛉\bm{\theta}_{\star} is its minimizer. Pick a learning rate η1/L\eta\leq 1/L, and use the iterations 𝛉t+1=𝛉tηf(𝛉t)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta{\bm{\nabla}f(\bm{\theta}_{t})}. The iterates obey 𝛉τ𝛉22(1ημ)τ𝛉0𝛉22\|{\bm{\theta}_{\tau}-\bm{\theta}_{\star}}\|_{\ell_{2}}^{2}\leq(1-\eta\mu)^{\tau}\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}^{2}.

This is the usual setup for gradient descent analysis. Instead, we will employ the bimodal Hessian definition.

Definition 3 (Bimodal Hessian)

Let f:pf:\mathbb{R}^{p}\rightarrow\mathbb{R} be an LL smooth μ\mu strongly-convex function. Additionally, there exists a subspace 𝒮p\mathcal{S}\in\mathbb{R}^{p} and local condition numbers κ+,κ1\kappa_{+},\kappa_{-}\geq 1 and cross-smoothness ε0\varepsilon\geq 0 such that, the Hessian of ff is satisfies

Upper spectrum:L𝑰p𝚷𝒮2f(𝜽)𝚷𝒮(L/κ+)𝑰p,\displaystyle\text{Upper spectrum:}~{}~{}~{}L{\bm{I}}_{p}\succeq\bm{\Pi}_{\mathcal{S}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}}\succeq(L/\kappa_{+}){\bm{I}}_{p}, (3)
Lower spectrum:κμ𝑰p𝚷𝒮c2f(𝜽)𝚷𝒮cμ𝑰p,\displaystyle\text{Lower spectrum:}~{}~{}~{}\kappa_{-}\mu{\bm{I}}_{p}\succeq\bm{\Pi}_{\mathcal{S}^{c}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}\succeq\mu{\bm{I}}_{p},
Cross spectrum:𝚷𝒮2f(𝜽)𝚷𝒮cε.\displaystyle\text{Cross spectrum:}~{}~{}~{}\|\bm{\Pi}_{\mathcal{S}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}\|\leq\varepsilon.

Here κ+,κ\kappa_{+},\kappa_{-} are the local condition numbers as previously. Observe that both of these are upper bounded by the global condition number κ=L/μ\kappa=L/\mu as ff obeys Def. 2 as well. The cross-smoothness controls the interaction between two subspaces. For linear regression ε=0\varepsilon=0 by picking 𝒮\mathcal{S} to be eigenspace. For general nonlinear models, as long as problem can be approximated linearly (e.g. wide deep nets can be approximated by their linearization [19, 20]), it is plausible that cross-smoothness is small for a suitable choice of 𝒮\mathcal{S}. We have the following analogue of Theorem 1.

Theorem 2 (Nonlinear problems)

Let ff obey Definition 3 with non-negative scalars L,μ,κ+,κ,εL,\mu,\kappa_{+},\kappa_{-},\varepsilon. Consider the learning rate schedule of Definition 1 and set

T2κ+log(2Lκμ)+1,η=1κμ,η+=1L.T\geq 2\kappa_{+}\log\left(\frac{2L}{\kappa_{-}\mu}\right)+1\quad,\quad\eta_{-}=\frac{1}{\kappa_{-}\mu}\quad,\quad\eta_{+}=\frac{1}{L}.

Additionally, suppose the cross-smoothness ε\varepsilon satisfies the following upper bound

4εmin(1,κ/T)μ.4\varepsilon\leq\min(1,{\kappa_{-}}/{T})~{}\mu.

Let 𝛉\bm{\theta}_{\star} be the minimizer of f(𝛉)f(\bm{\theta}). Starting from 𝛉0\bm{\theta}_{0}, apply the gradient iterations 𝛉t+1=𝛉tηtf(𝛉t)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{\nabla}f(\bm{\theta}_{t})}. For all iterations t0t\geq 0 with mod(t,T)=0\text{mod}(t,T)=0, we have

𝜽t𝜽22(114κ)t/T𝜽0𝜽2.\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}}\leq\sqrt{2}(1-\frac{1}{4\kappa_{-}})^{t/T}\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}.

Interpretation. This result is in similar spirit to the linear regression setup of Theorem 1. The required number of iterations is still governed by the quantity (2). A key difference is the cross-smoothness ε\varepsilon which controls the cross-Hessian matrix 𝚷𝒮2f(𝜽)𝚷𝒮c\bm{\Pi}_{\mathcal{S}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}. This term was simply equal to 0 for the linear problem. In essence, our condition for nonlinear problems essentially requires cross-Hessian to be dominated by the Hessian over the lower spectrum i.e. 𝚷𝒮c2f(𝜽)𝚷𝒮c\bm{\Pi}_{\mathcal{S}^{c}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}. In particular, ε\varepsilon should be upper bounded by the strong convexity parameter μ\mu as well as κμ/T{\kappa_{-}\mu}/{T} where κμ\kappa_{-}\mu is the smoothness over the lower spectrum. It would be interesting to explore to what extent the conditions on the cross-Hessian can be relaxed. However, as mentioned earlier, the fact that wide artificial neural networks behave close to linear models [19] provides a decent justification for small cross-Hessian.

III Proofs

III-A Proof of Theorem 1

Proof:

Following Theorem 1’s statement introduce L=λ1,μ=λpL=\lambda_{1},~{}\mu=\lambda_{p}. Each gradient iteration can be written in terms of the residual 𝒘t=𝜽t𝜽\bm{w}_{t}=\bm{\theta}_{t}-\bm{\theta}_{\star} and has the following form

𝜽t+1=𝜽tηt𝑿T(𝒚𝑿𝜽t)\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{X}}^{T}(\bm{y}-{\bm{X}}\bm{\theta}_{t})
𝒘t+1:=𝜽t+1𝜽=(𝑰𝑿T𝑿)𝒘t.\displaystyle\implies\bm{w}_{t+1}:=\bm{\theta}_{t+1}-\bm{\theta}_{\star}=({\bm{I}}-{\bm{X}}^{T}{\bm{X}})\bm{w}_{t}.

Let 𝒮\mathcal{S} be the principal eigenspace induced by the first rr eigenvectors and 𝒮c\mathcal{S}^{c} be its complement. Let 𝚷𝒮\bm{\Pi}_{\mathcal{S}} be the projection operator on 𝒮\mathcal{S}. Set 𝒂t=𝚷𝒮(𝒘t)\bm{a}_{t}=\bm{\Pi}_{\mathcal{S}}(\bm{w}_{t}), 𝒃t=𝚷𝒮c(𝒘t)\bm{b}_{t}=\bm{\Pi}_{\mathcal{S}^{c}}(\bm{w}_{t}). Also set at=𝒂t2a_{t}=\|{\bm{a}_{t}}\|_{\ell_{2}}, bt=𝒃t2b_{t}=\|{\bm{b}_{t}}\|_{\ell_{2}}. Since the learning rate is periodic, we analyze a single period starting from 𝒂0,𝒃0\bm{a}_{0},\bm{b}_{0}. During the first T1T-1 iterations, we have that

at(11κ+)ta0andbt(1μL)tb0=(11κ)tb0.\displaystyle a_{t}\leq(1-\frac{1}{\kappa_{+}})^{t}a_{0}\quad\text{and}\quad b_{t}\leq(1-\frac{\mu}{L})^{t}b_{0}=(1-\frac{1}{\kappa})^{t}b_{0}.

At the final (unstably large) iteration t=T1t=T-1, we have

aTLκμaT1=κκaT1andbt(11κ)b0.\displaystyle a_{T}\leq\frac{L}{\kappa_{-}\mu}a_{T-1}=\frac{\kappa}{\kappa_{-}}a_{T-1}\quad\text{and}\quad b_{t}\leq(1-\frac{1}{\kappa_{-}})b_{0}.

Now btb_{t} term is clearly non-increasing and obeys bT(11κ)(11κ)T1b0(11κ)b0.b_{T}\leq(1-\frac{1}{\kappa_{-}})(1-\frac{1}{\kappa})^{T-1}b_{0}\leq(1-\frac{1}{\kappa_{-}})b_{0}. Note that we forego the (11κ)T1(1-\frac{1}{\kappa})^{T-1} for the sake of simplicity. We wish to make the growth of aTa_{T} similarly small by enforcing

κκ(11κ+)T1112κ2κ2κ1((11κ+)1)T1.\displaystyle\frac{\kappa}{\kappa_{-}}(1-\frac{1}{\kappa_{+}})^{T-1}\leq 1-\frac{1}{2\kappa_{-}}\iff\frac{2\kappa}{2\kappa_{-}-1}\leq((1-\frac{1}{\kappa_{+}})^{-1})^{T-1}.

Observe that (11κ+)1e1/κ+(1-\frac{1}{\kappa_{+}})^{-1}\geq\mathrm{e}^{1/\kappa_{+}}. Thus, we simply need

log(2κ2κ1)T1κ+Tκ+log(2κ2κ1)+1.\log(\frac{2\kappa}{2\kappa_{-}-1})\leq\frac{T-1}{\kappa_{+}}\iff T\geq\kappa_{+}{\log(\frac{2\kappa}{2\kappa_{-}-1})}+1.

In short, at the end of a single period we are guaranteed to have (1) after observing 𝒘t22=at2+bt2\|{\bm{w}_{t}}\|_{\ell_{2}}^{2}=a_{t}^{2}+b_{t}^{2}. ∎

III-B Proof of Theorem 2

Proof:

The proof idea is same as Theorem 1 but we additionally control the cross-smoothness terms. Denote the residual by 𝒘t=𝜽t𝜽\bm{w}_{t}=\bm{\theta}_{t}-\bm{\theta}_{\star}. We will work with the projections of the residual on the subspaces 𝒮,𝒮c\mathcal{S},\mathcal{S}^{c} denoted by 𝒂t,𝒃t\bm{a}_{t},\bm{b}_{t} respectively. Additionally, set B=max(𝒂02,𝒃02)B=\max(\|{\bm{a}_{0}}\|_{\ell_{2}},\|{\bm{b}_{0}}\|_{\ell_{2}}) and set at=𝒂t2/B,bt=𝒃t2/Ba_{t}=\|{\bm{a}_{t}}\|_{\ell_{2}}/B,b_{t}=\|{\bm{b}_{t}}\|_{\ell_{2}}/B. Observe that by this definition B𝜽𝜽02B\leq\|{\bm{\theta}_{\star}-\bm{\theta}_{0}}\|_{\ell_{2}} and

max(a0,b0)=1.\max(a_{0},b_{0})=1.

We will prove that at the end of a single period, (aT,bT)(a_{T},b_{T}) pair obeys

max(aT,bT)114κ.\displaystyle\max(a_{T},b_{T})\leq 1-\frac{1}{4\kappa_{-}}. (4)

The overall result follows inductively from this as follows. First, inductively we would achieve max(at,bt)(114κ)t/T\max(a_{t},b_{t})\leq(1-\frac{1}{4\kappa_{-}})^{t/T} for mod(t,T)=0\text{mod}(t,T)=0. That would in turn yield

𝜽t𝜽2\displaystyle\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}} 2max(𝒂t2,𝒃t2)2Bmax(at,bt)\displaystyle\leq\sqrt{2}\max(\|{\bm{a}_{t}}\|_{\ell_{2}},\|{\bm{b}_{t}}\|_{\ell_{2}})\leq\sqrt{2}B\max(a_{t},b_{t})
2(114κ)t/T𝜽𝜽02.\displaystyle\leq\sqrt{2}(1-\frac{1}{4\kappa_{-}})^{t/T}\|{\bm{\theta}_{\star}-\bm{\theta}_{0}}\|_{\ell_{2}}.

Establishing (4): Thus, let us show (4) for a single period of learning rate i.e. 0tT0\leq t\leq T. The gradient is given by

f(𝜽t)=𝑯𝒘t,{\bm{\nabla}f(\bm{\theta}_{t})}={\bm{H}}\bm{w}_{t},

where 𝑯{\bm{H}} is obtained by integrating the Hessian’s along the path from 𝜽\bm{\theta}_{\star} to 𝜽t\bm{\theta}_{t}. Write the next iterate as

𝒘t+122=𝒘tηtf(𝜽t)22=𝒘tηt𝑯𝒘t22.\|{\bm{w}_{t+1}}\|_{\ell_{2}}^{2}=\|{\bm{w}_{t}-\eta_{t}{\bm{\nabla}f(\bm{\theta}_{t})}}\|_{\ell_{2}}^{2}=\|{\bm{w}_{t}-\eta_{t}{\bm{H}}\bm{w}_{t}}\|_{\ell_{2}}^{2}.

By Definition 3 and triangle inequality, 𝑯{\bm{H}} satisfies the bimodal Hessian inequalities described in (3). To analyze a single period, we first focus on the lower learning rate η+\eta_{+} which spans the initial T1T-1 iterations. Hence, suppose ηt=η+=1/L\eta_{t}=\eta_{+}=1/L. Note that from strong convexity of 𝑯{\bm{H}} over 𝒮,𝒮c\mathcal{S},\mathcal{S}^{c}, we have that

𝒂tηt𝚷𝒮𝑯𝒂t22(1ηtL/κ+)𝒂t22forηt1/L\displaystyle\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{a}_{t}}\|_{\ell_{2}}^{2}\leq(1-\eta_{t}L/\kappa_{+})\|{\bm{a}_{t}}\|_{\ell_{2}}^{2}\quad\text{for}\quad\eta_{t}\leq 1/L
𝒃tηt𝚷𝒮c𝑯𝒃t22(1ηtμ)𝒃t22forηt1/(κμ).\displaystyle\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{b}_{t}}\|_{\ell_{2}}^{2}\leq(1-\eta_{t}\mu)\|{\bm{b}_{t}}\|_{\ell_{2}}^{2}\quad\quad\text{for}\quad\eta_{t}\leq 1/(\kappa_{-}\mu).

Following this with ηt=η+=1/L\eta_{t}=\eta_{+}=1/L, for 𝒂t\bm{a}_{t}, we find the recursions

𝒂t+12\displaystyle\|{\bm{a}_{t+1}}\|_{\ell_{2}} =𝒂tηt𝚷𝒮𝑯𝒘t2\displaystyle=\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{w}_{t}}\|_{\ell_{2}}
𝒂tηt𝚷𝒮𝑯𝒂t2+ηt𝚷𝒮𝑯𝒃t2\displaystyle\leq\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{a}_{t}}\|_{\ell_{2}}+\eta_{t}\|{\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{b}_{t}}\|_{\ell_{2}}
(1ηtL2κ+)𝒂t2+ηtε𝒃t2\displaystyle\leq(1-\frac{\eta_{t}L}{2\kappa_{+}})\|{\bm{a}_{t}}\|_{\ell_{2}}+\eta_{t}\varepsilon\|{\bm{b}_{t}}\|_{\ell_{2}}
=(112κ+)𝒂t2+εL𝒃t2.\displaystyle=(1-\frac{1}{2\kappa_{+}})\|{\bm{a}_{t}}\|_{\ell_{2}}+\frac{\varepsilon}{L}\|{\bm{b}_{t}}\|_{\ell_{2}}. (5)

Using the identical argument for 𝒃t\bm{b}_{t} yields

𝒃t+12\displaystyle\|{\bm{b}_{t+1}}\|_{\ell_{2}} =𝒃tηt𝚷𝒮c𝑯𝒘t2\displaystyle=\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{w}_{t}}\|_{\ell_{2}}
𝒃tηt𝚷𝒮c𝑯𝒃t2+ηt𝚷𝒮c𝑯𝒂t2\displaystyle\leq\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{b}_{t}}\|_{\ell_{2}}+\eta_{t}\|{\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{a}_{t}}\|_{\ell_{2}}
(1μ2L)𝒃t2+εL𝒂t2.\displaystyle\leq(1-\frac{\mu}{2L})\|{\bm{b}_{t}}\|_{\ell_{2}}+\frac{\varepsilon}{L}\|{\bm{a}_{t}}\|_{\ell_{2}}.

Setting ε¯=εL\bar{\varepsilon}=\frac{\varepsilon}{L}, we obtain

at+1(112κ+)at+ε¯bt.\displaystyle a_{t+1}\leq(1-\frac{1}{2\kappa_{+}})a_{t}+\bar{\varepsilon}b_{t}. (6)

Additionally, recall that, since ff is μ\mu strongly-convex we have L/κ+μL/\kappa_{+}\geq\mu. Thus using L/κ+μ2εL/\kappa_{+}\geq\mu\geq 2\varepsilon we have

𝒃t+12(1μ2L)𝒃t2+ε¯𝒂t2max(𝒂t2,𝒃t2)\displaystyle\|{\bm{b}_{t+1}}\|_{\ell_{2}}\leq(1-\frac{\mu}{2L})\|{\bm{b}_{t}}\|_{\ell_{2}}+\bar{\varepsilon}\|{\bm{a}_{t}}\|_{\ell_{2}}\leq\max(\|{\bm{a}_{t}}\|_{\ell_{2}},\|{\bm{b}_{t}}\|_{\ell_{2}})
𝒂t+12(112κ+)𝒂t2+ε¯𝒃t2max(𝒂t2,𝒃t2).\displaystyle\|{\bm{a}_{t+1}}\|_{\ell_{2}}\leq(1-\frac{1}{2\kappa_{+}})\|{\bm{a}_{t}}\|_{\ell_{2}}+\bar{\varepsilon}\|{\bm{b}_{t}}\|_{\ell_{2}}\leq\max(\|{\bm{a}_{t}}\|_{\ell_{2}},\|{\bm{b}_{t}}\|_{\ell_{2}}).

That is, we are guaranteed to have 1at,bt01\geq a_{t},b_{t}\geq 0. Thus, using (6), recursively, for all 0tT10\leq t\leq T-1, ata_{t} satisfies

at\displaystyle a_{t} (112κ+)t+ε¯τ=0t1(112κ+)tτ1bt(112κ+)t+ε¯τ=0t1bτ\displaystyle\leq(1-\frac{1}{2\kappa_{+}})^{t}+\bar{\varepsilon}\sum_{\tau=0}^{t-1}(1-\frac{1}{2\kappa_{+}})^{t-\tau-1}b_{t}\leq(1-\frac{1}{2\kappa_{+}})^{t}+\bar{\varepsilon}\sum_{\tau=0}^{t-1}b_{\tau}
(112κ+)t+tε¯.\displaystyle\leq(1-\frac{1}{2\kappa_{+}})^{t}+t\bar{\varepsilon}. (7)

At time t=T1t=T-1, we use the larger learning rate η=1κμ\eta_{-}=\frac{1}{\kappa_{-}\mu}. The following holds via identical argument as (5)

at+1Lκμat+εκμbt\displaystyle a_{t+1}\leq\frac{L}{\kappa_{-}\mu}a_{t}+\frac{\varepsilon}{\kappa_{-}\mu}b_{t} (8)
bt+1(112κ)bt+εκμat.\displaystyle b_{t+1}\leq(1-\frac{1}{2\kappa_{-}})b_{t}+\frac{\varepsilon}{\kappa_{-}\mu}a_{t}.

To bound bTb_{T} at time t=T1t=T-1, we recall bT1,aT11b_{T-1},a_{T-1}\leq 1 and the bound εμ/4\varepsilon\leq\mu/4, to obtain

bT112κ+εκμ114κ.b_{T}\leq 1-\frac{1}{2\kappa_{-}}+\frac{\varepsilon}{\kappa_{-}\mu}\leq 1-\frac{1}{4\kappa_{-}}.

Bounding aTa_{T} is what remains. Noticing log(1x)x\log(1-x)\leq{-x}, our period choice TT obeys

T2κ+log(2L/(κμ))+11log(2L/(κμ))log(112κ+)T\geq 2\kappa_{+}{\log(2L/(\kappa_{-}\mu))}+1\geq 1-\frac{\log(2L/(\kappa_{-}\mu))}{\log({1-\frac{1}{2\kappa_{+}}})}

for κ+1\kappa_{+}\geq 1. Thus, using the assumption εκμ4T\varepsilon\leq\frac{\kappa_{-}\mu}{4T}, and the bounds (7) and (8), the bound on aTa_{T} is obtained via

aT\displaystyle a_{T} Lκμ((112κ+)T1+(T1)εL)+εκμ\displaystyle\leq\frac{L}{\kappa_{-}\mu}((1-\frac{1}{2\kappa_{+}})^{T-1}+(T-1)\frac{\varepsilon}{L})+\frac{\varepsilon}{\kappa_{-}\mu}
12+Tεκμ34114κ,\displaystyle\leq\frac{1}{2}+T\frac{\varepsilon}{\kappa_{-}\mu}\leq\frac{3}{4}\leq 1-\frac{1}{4\kappa_{-}},

where we noted κ1\kappa_{-}\geq 1. The above bounds on aT,bTa_{T},b_{T} establishes (4) and concludes the proof. ∎

IV Conclusions

This letter introduced a setting where remarkably fast convergence can be attained by using cyclical learning rate schedule that carefully targets the bimodal structure of the Hessian of the problem. This is accomplished by replacing global condition number with local counterparts. While bimodal Hessian seems to be a strong assumption, recent literature provides rich empirical justification for our theory. Besides relaxing the assumptions in Theorem 2, there are multiple interesting directions for Unstable CLR. The results can likely be extended to minibatch SGD, overparameterized settings, or to non-convex problems (e.g. via Polyak-Lojasiewicz condition [25]). While our attention was restricted to bimodal Hessian and a CLR scheme with two values (η+,η\eta_{+},\eta_{-} as in Def. 1), it would be exciting to explore whether more sophisticated CLR schemes can provably accelerate optimization for a richer class of Hessian structures. On the practical side, further empirical investigation (beyond [3, 12]) can help verify and explore the potential benefits of large cyclical learning rates.

References

  • [1] L. N. Smith, “Cyclical learning rates for training neural networks,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on.   IEEE, 2017, pp. 464–472.
  • [2] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
  • [3] L. N. Smith and N. Topin, “Super-convergence: Very fast training of residual networks using large learning rates,” arXiv preprint arXiv:1708.07120, 2017.
  • [4] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” arXiv preprint arXiv:1903.10145, 2019.
  • [5] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” in 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018.   Association For Uncertainty in Artificial Intelligence (AUAI), 2018, pp. 876–885.
  • [6] X. Li, Z. Zhuang, and F. Orabona, “Exponential step sizes for non-convex optimization,” arXiv preprint arXiv:2002.05273, 2020.
  • [7] K. Zhang, A. Koppel, H. Zhu, and T. Basar, “Global convergence of policy gradient methods to (almost) locally optimal policies,” SIAM Journal on Control and Optimization, vol. 58, no. 6, pp. 3586–3612, 2020.
  • [8] H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann, “Escaping saddles with stochastic gradients,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1155–1164.
  • [9] G. Leclerc and A. Madry, “The two regimes of deep network training,” arXiv preprint arXiv:2002.10376, 2020.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
  • [11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [12] J. M. Cohen, S. Kaur, Y. Li, Z. Kolter, and A. Talwalkar, “Gradient descent on neural networks typically occurs at the edge of stability,” to appear at ICLR, 2021.
  • [13] L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,” arXiv preprint arXiv:1706.04454, 2017.
  • [14] S. Oymak, Z. Fabian, M. Li, and M. Soltanolkotabi, “Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian,” arXiv preprint arXiv:1906.05392, 2019.
  • [15] X. Li, Q. Gu, Y. Zhou, T. Chen, and A. Banerjee, “Hessian based analysis of sgd for deep nets: Dynamics and generalization,” in Proceedings of the 2020 SIAM International Conference on Data Mining.   SIAM, 2020, pp. 190–198.
  • [16] D. Paul, “Asymptotics of sample eigenstructure for a large dimensional spiked covariance model,” Statistica Sinica, pp. 1617–1642, 2007.
  • [17] M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 4313–4324.
  • [18] D. Kopitkov and V. Indelman, “Neural spectrum alignment: Empirical study,” in International Conference on Artificial Neural Networks.   Springer, 2020, pp. 168–179.
  • [19] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” Conference on Neural Information Processing Systems, 2018.
  • [20] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” arXiv preprint arXiv:1902.06720, 2019.
  • [21] G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” arXiv preprint arXiv:1812.04754, 2018.
  • [22] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019.
  • [23] M. Belkin, D. Hsu, and P. Mitra, “Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate,” arXiv preprint arXiv:1806.05161, 2018.
  • [24] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar, “Theory of deep learning iii: explaining the non-overfitting puzzle,” arXiv preprint arXiv:1801.00173, 2017.
  • [25] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2016, pp. 795–811.