Provable Super-Convergence with a Large
Cyclical Learning Rate

Samet Oymak Submitted on April 6, 2021. This work was supported in part by the NSF under CNS grant 1932254 and the CAREER award 2046816.

Abstract

Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don’t blow up. This letter introduces a simple scenario where an unstably large learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Our scheme uses a Cyclical Learning Rate where we periodically take one large unstable step and several small stable steps to compensate for the instability. These findings also help explain the empirical observations of [Smith and Topin, 2019] where they show that CLR with a large maximum learning rate can dramatically accelerate learning and lead to so-called “super-convergence”. We prove that our scheme excels in the problems where Hessian exhibits a bimodal spectrum and the eigenvalues can be grouped into two clusters (small and large). The unstably large step is the key to enabling fast convergence over the small eigen-spectrum.

Index Terms:

Convergence of numerical methods, Iterative algorithms, Gradient methods

I Introduction

Consider a least-squares problem with design matrix ${\bm{X}}\in\mathbb{R}^{n\times p}$ and labels $\bm{y}\in\mathbb{R}^{n}$ . We wish to solve for

\bm{\theta}_{\star}=\arg\min_{\bm{\theta}\in\mathbb{R}^{p}}\frac{1}{2}\|{\bm{y}-{\bm{X}}\bm{\theta}}\|_{\ell_{2}}^{2}.

If we use a gradient-based algorithm the rate of convergence obviously depends on the condition number $\kappa$ of ${\bm{X}}$ . Here $\kappa=L/\mu$ where the smoothness $L$ and strong convexity $\mu$ of the problem is given by the maximum and minimum eigenvalues of the Hessian matrix ${\bm{X}}^{\top}{\bm{X}}$ as follows

L=\|{\bm{X}}^{\top}{\bm{X}}\|,\quad\mu={\sigma_{\min}({\bm{X}}^{\top}{\bm{X}})}.

Here, ${\sigma_{\min}()},\|\cdot\|$ denote the smallest/largest singular value of a matrix respectively. Standard gradient descent (GD) requires $\kappa\log(\varepsilon^{-1})$ iterations to achieve $\varepsilon$ -accuracy. Nesterov’s acceleration can improve this to $\sqrt{\kappa}\log(\varepsilon^{-1})$ . In general, consider the iterations

\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{X}}^{\top}(\bm{y}-{\bm{X}}\bm{\theta}_{t}).

Here, the contraction matrix ${\bm{C}}_{t}={\bm{I}}-\eta_{t}{\bm{X}}^{T}{\bm{X}}$ governs the rate of convergence. Over the $i$ th eigen-direction of ${\bm{X}}^{\top}{\bm{X}}$ with eigenvalue $\lambda_{i}$ , the convergence/contraction rate is given by $1-\eta_{t}\lambda_{i}$ . Setting a fixed stable learning rate of $\eta_{t}=1/L$ , the issue is that gradient descent optimizes small eigen-directions much slower than the large eigen-directions. Faster learning over small eigen-directions can be facilitated by a large learning rate so that $1-\eta_{t}\lambda_{i}$ is closer to $0$ even for small $\lambda_{i}$ ’s. However this would lead to instability i.e. $\|{\bm{C}}_{t}\|>1$ .

Here, we point out the possibility that, one can use an unstably large learning rate once in a while to provide huge improvements over small directions. The resulting instability can be compensated quickly by following this unstable step with multiple stable steps which keep the larger directions under control. Overall, for problems with bimodal Hessian spectrum, where eigenvalues are clustered in large and small groups, this leads to substantial improvements with logarithmic dependency on the condition number. Figure 1 highlights this phenomena. Our approach can be formalized by using a cyclical (aka periodic) learning rate schedule [1, 2]. We remark that cyclical learning rate (CLR) is related to SGD with Restarts (SGDR) and Stochastic Weight Averaging which attracted significant attention due to their fast convergence, generalization benefits and flexibility [3, 4, 5]. [6] assesses certain theoretical benefits of cosine learning rates which are periodic. [7, 8] investigate periodic learning rates to facilitate escape from saddle points with stochastic gradients. Large initial learning rates are also known to play a critical role in generalization [9, 10, 11]. Recent work [12] provides further empirical evidence that practical learning rates can be at the edge of stability. However to the best of our knowledge, prior works do not consider potential theoretical benefits of unstably large learning rate choice. The closest work is by Smith and Topin [3]. Here authors observe that CLR can operate with very large maximum learning rates and converge “super fast”. We believe this letter provides a rigorous theoretical support for such observations. We show that maximum learning rate can in fact be unstable and it can be much larger than the maximum stable learning rate. We also show that “super fast” convergence can be as fast as logarithmic in condition number (which is drastically better than GD or Nesterov AGD) for suitable problems (discussed further below).

Our CLR scheme simply takes two values $\eta_{\pm}$ as described below.

Definition 1

Fix an integer $T>1$ and positive scalars $\eta_{+},\eta_{-}$ . Set periodic learning rate $\eta_{t}$ for $t\geq 0$ as

\eta_{t}=\begin{cases}\eta_{-}\quad\text{if}\quad\text{mod}(t,T)=-1\\ \eta_{+}\quad\text{else}\end{cases}.

Refer to caption — Figure 1: Convergence performance of gradient descent with $\eta_{t}=1/L$ , Nesterov’s acceleration (AGD) and Unstable Cyclical Learning Rate of Theorem 1 on a linear regression task. In Figures (a) to (d), we vary the condition number $\kappa$ from $10$ to $10^{4}$ . In these experiments, eigenspectrum of Hessian is bimodal where eigenvalues lie over the intervals $[1,2]$ and $[\kappa/2,\kappa]$ . This means the local condition numbers are $\kappa_{+}=\kappa_{-}=2$ . As the global condition number grows, Unstable CLR outperforms standard gradient descent or Nesterov AGD as it only requires logarithmic iteration in $\kappa$ . We note that Unstable CLR can potentially further benefit from acceleration.

We call this schedule Unstable CLR if $\eta_{-}>2/L$ where $L$ is the smoothness of the problem. Indeed when this happens, the algorithm is susceptible to blow up in certain eigendirections as the contraction matrix ${\bm{I}}-\eta_{-}{\bm{X}}^{\top}{\bm{X}}$ has operator norm larger than $1$ .

Theorem 1 (Linear regression)

Let $\bm{\lambda}\in\mathbb{R}_{+}^{p}$ be the decreasingly sorted eigenvalues of ${\bm{X}}^{\top}{\bm{X}}$ with $L=\lambda_{1}$ and $\mu=\lambda_{p}$ . Fix some integer $r$ with $p>r\geq 1$ . Introduce the quantities

\kappa=\frac{L}{\mu},\quad\kappa_{+}=\frac{{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L}}}{\lambda_{r}},\quad\kappa_{-}=\frac{\lambda_{r+1}}{{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mu}}}.

Set period $T\geq\kappa_{+}\log\left(\frac{2\kappa}{2\kappa_{-}-1}\right)+1$ and learning rate according to Def. 1 with $\eta_{+}=\frac{1}{L}$ and $\eta_{-}=\frac{1}{\kappa_{-}\mu}$ . For all $t$ with $\text{mod}(t,T)=0$ , the iterations obey

\displaystyle\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}}\leq\left(1-\frac{1}{2\kappa_{-}}\right)^{t/T}\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}.

(1)

Alternatively, for $t\geq 2T\kappa_{-}\log(\varepsilon^{-1})$ , we have $\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}}\leq\varepsilon\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}$ for $1>\varepsilon>0$ .

Interpretation. Observe that in order to achieve $\varepsilon$ accuracy, the number of required iterations grow as

\displaystyle\boxed{\kappa_{+}\kappa_{-}\log\left({\kappa}\right)\log(\varepsilon^{-1}).}

(2)

Here $\kappa_{+},\kappa_{-}$ are the local condition numbers for the eigen-spectrum. $\kappa_{+}$ is the condition number over the subspace $\mathcal{S}$ (i.e. eigens from $\lambda_{1}$ to $\lambda_{r}$ ) and $\kappa_{-}$ is the condition number over the orthogonal complement $\mathcal{S}^{c}$ . Observe that $\kappa_{+}\times\kappa_{-}$ is always upper bounded by the overall condition number $\kappa$ i.e. $\kappa_{+}\kappa_{-}\leq\kappa$ . However if eigenvalues over $\mathcal{S}$ and $\mathcal{S}^{c}$ are narrowly clustered, we can have $\kappa_{+}\kappa_{-}\ll\kappa$ leading to a much faster convergence. Finally observe that the dependence on the (global) condition number $\kappa$ is only logarithmic. Thus, in the worst case scenario of $\kappa_{+}\kappa_{-}=\kappa$ , the iteration complexity (2) is sub-optimal by a $\log(\kappa)$ -factor compared to the standard gradient descent which requires $\kappa\log(\varepsilon^{-1})$ iterations. However there is a factor of $\kappa/\log(\kappa)$ improvement when the eigenvalues are clustered and $\kappa_{+},\kappa_{-}$ are small. Figure 1 demonstrates the comparisons to gradient descent (with $\eta=1/L$ ) and Nesterov’s Accelerated GD. In these examples, we set local condition numbers to $\kappa_{-}=\kappa_{+}=2$ whereas $\kappa$ varies from $10$ to $10000$ . As $\kappa$ grows larger, Unstable CLR shines over the alternatives as the iteration number only grows as $\log(\kappa)$ . Finally, note that inverse learning rate $\eta_{+}^{-1}=L$ is equal to the smoothness (top Hessian eigenvalue) of the whole problem whereas $\eta_{-}^{-1}=\kappa_{-}\mu$ is chosen based on the smoothness over the lower spectrum.

Bimodal Hessian. The bimodal Hessian spectrum is crucial for enabling super fast convergence. In essence, with bimodal structure, Unstable CLR acts as a preconditioner for the problem and helps not only large eigen-value/directions but also small ones. It is possible that other cyclical schemes will accelerate a broader class of Hessian spectrums. That said, we briefly mention that bimodal spectrum has empirical support in the deep learning literature. Several works studied the empirical Hessians (and Jacobians) of deep neural networks [13, 14, 15]. A common observation is that the Hessian spectrum has relatively few large eigenvalues and many more smaller eigenvalues [13, 15]. These observations are closely related to the seminal spiked covariance model [16] where the covariance spectrum has a few large eigenvalues and many more smaller eigenvalues. In connection, [14, 17, 18] studied the Jacobian spectrum. Note that, in the linear regression setting of Theorem 1, Jacobian is simply ${\bm{X}}$ . They similarly found that practical deep nets have a bimodal spectrum and the Jacobian is approximately low-rank. It is also known that behavior of the wide deep nets (i.e., many neurons per layer) can be approximated by their Jacobian-based linearization at initialization via neural tangent kernel [19, 20]. Thus the spectrum indeed closely governs the optimization dynamics [21] similar to our simple linear regression setup. While the larger eigendirections of a deep net can be optimized quickly [14, 18], intuitively their existence will slow down the small eigendirections. However it is also observed that learning such small eigendirections is critical for success of deep learning since typically achieving zero-training loss (aka interpolation) leads to the best generalization performance [22, 23, 24]. In light of this discussion, related empirical observations of [3] and our theory, it is indeed plausible that Unstable CLR does a stellar job at learning these small eigen-directions leading to much faster interpolation and improved generalization.

II Extension to the Nonlinear Problems

To conclude with our discussion, we next provide a more general result that apply to strongly-convex functions. Recall $\mathcal{S}^{c}$ as the complement of a subspace $\mathcal{S}$ . Let $\bm{\Pi}_{\mathcal{S}}$ denote the matrix that projects onto $\mathcal{S}$ . Our goal is solving $\bm{\theta}_{\star}=\arg\min_{\bm{\theta}\in\mathbb{R}^{p}}f(\bm{\theta})$ via gradient iterations $\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{\nabla}f(\bm{\theta}_{t})}$ .

Definition 2 (Smoothness and strong-convexity)

Let $f:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be a smooth convex function and fix $L>\mu>0$ . $f:\mathbb{R}^{p}\rightarrow\mathbb{R}$ is $L$ -smooth and $\mu$ strongly-convex if its Hessian satisfies the following for all $\bm{\theta}\in\mathbb{R}^{p}$

L{\bm{I}}_{p}\succeq{\bm{\nabla}^{2}f(\bm{\theta})}\succeq\mu{\bm{I}}_{p}.

First, we recall a classical convergence result for strongly convex functions for the sake of completeness.

Proposition 1

Let $f$ obey Definition 2 and suppose $\bm{\theta}_{\star}$ is its minimizer. Pick a learning rate $\eta\leq 1/L$ , and use the iterations $\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta{\bm{\nabla}f(\bm{\theta}_{t})}$ . The iterates obey $\|{\bm{\theta}_{\tau}-\bm{\theta}_{\star}}\|_{\ell_{2}}^{2}\leq(1-\eta\mu)^{\tau}\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}^{2}$ .

This is the usual setup for gradient descent analysis. Instead, we will employ the bimodal Hessian definition.

Definition 3 (Bimodal Hessian)

Let $f:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be an $L$ smooth $\mu$ strongly-convex function. Additionally, there exists a subspace $\mathcal{S}\in\mathbb{R}^{p}$ and local condition numbers $\kappa_{+},\kappa_{-}\geq 1$ and cross-smoothness $\varepsilon\geq 0$ such that, the Hessian of $f$ is satisfies

		$\displaystyle\text{Upper spectrum:}~{}~{}~{}L{\bm{I}}_{p}\succeq\bm{\Pi}_{\mathcal{S}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}}\succeq(L/\kappa_{+}){\bm{I}}_{p},$		(3)
		$\displaystyle\text{Lower spectrum:}~{}~{}~{}\kappa_{-}\mu{\bm{I}}_{p}\succeq\bm{\Pi}_{\mathcal{S}^{c}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}\succeq\mu{\bm{I}}_{p},$
		$\displaystyle\text{Cross spectrum:}~{}~{}~{}\\|\bm{\Pi}_{\mathcal{S}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}\\|\leq\varepsilon.$

Here $\kappa_{+},\kappa_{-}$ are the local condition numbers as previously. Observe that both of these are upper bounded by the global condition number $\kappa=L/\mu$ as $f$ obeys Def. 2 as well. The cross-smoothness controls the interaction between two subspaces. For linear regression $\varepsilon=0$ by picking $\mathcal{S}$ to be eigenspace. For general nonlinear models, as long as problem can be approximated linearly (e.g. wide deep nets can be approximated by their linearization [19, 20]), it is plausible that cross-smoothness is small for a suitable choice of $\mathcal{S}$ . We have the following analogue of Theorem 1.

Theorem 2 (Nonlinear problems)

Let $f$ obey Definition 3 with non-negative scalars $L,\mu,\kappa_{+},\kappa_{-},\varepsilon$ . Consider the learning rate schedule of Definition 1 and set

T\geq 2\kappa_{+}\log\left(\frac{2L}{\kappa_{-}\mu}\right)+1\quad,\quad\eta_{-}=\frac{1}{\kappa_{-}\mu}\quad,\quad\eta_{+}=\frac{1}{L}.

Additionally, suppose the cross-smoothness $\varepsilon$ satisfies the following upper bound

4\varepsilon\leq\min(1,{\kappa_{-}}/{T})~{}\mu.

Let $\bm{\theta}_{\star}$ be the minimizer of $f(\bm{\theta})$ . Starting from $\bm{\theta}_{0}$ , apply the gradient iterations $\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{\nabla}f(\bm{\theta}_{t})}$ . For all iterations $t\geq 0$ with $\text{mod}(t,T)=0$ , we have

\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\|_{\ell_{2}}\leq\sqrt{2}(1-\frac{1}{4\kappa_{-}})^{t/T}\|{\bm{\theta}_{0}-\bm{\theta}_{\star}}\|_{\ell_{2}}.

Interpretation. This result is in similar spirit to the linear regression setup of Theorem 1. The required number of iterations is still governed by the quantity (2). A key difference is the cross-smoothness $\varepsilon$ which controls the cross-Hessian matrix $\bm{\Pi}_{\mathcal{S}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}$ . This term was simply equal to $0$ for the linear problem. In essence, our condition for nonlinear problems essentially requires cross-Hessian to be dominated by the Hessian over the lower spectrum i.e. $\bm{\Pi}_{\mathcal{S}^{c}}{\bm{\nabla}^{2}f(\bm{\theta})}\bm{\Pi}_{\mathcal{S}^{c}}$ . In particular, $\varepsilon$ should be upper bounded by the strong convexity parameter $\mu$ as well as ${\kappa_{-}\mu}/{T}$ where $\kappa_{-}\mu$ is the smoothness over the lower spectrum. It would be interesting to explore to what extent the conditions on the cross-Hessian can be relaxed. However, as mentioned earlier, the fact that wide artificial neural networks behave close to linear models [19] provides a decent justification for small cross-Hessian.

III Proofs

III-A Proof of Theorem 1

Proof:

Following Theorem 1’s statement introduce $L=\lambda_{1},~{}\mu=\lambda_{p}$ . Each gradient iteration can be written in terms of the residual $\bm{w}_{t}=\bm{\theta}_{t}-\bm{\theta}_{\star}$ and has the following form

	$\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta_{t}{\bm{X}}^{T}(\bm{y}-{\bm{X}}\bm{\theta}_{t})$
	$\displaystyle\implies\bm{w}_{t+1}:=\bm{\theta}_{t+1}-\bm{\theta}_{\star}=({\bm{I}}-{\bm{X}}^{T}{\bm{X}})\bm{w}_{t}.$

Let $\mathcal{S}$ be the principal eigenspace induced by the first $r$ eigenvectors and $\mathcal{S}^{c}$ be its complement. Let $\bm{\Pi}_{\mathcal{S}}$ be the projection operator on $\mathcal{S}$ . Set $\bm{a}_{t}=\bm{\Pi}_{\mathcal{S}}(\bm{w}_{t})$ , $\bm{b}_{t}=\bm{\Pi}_{\mathcal{S}^{c}}(\bm{w}_{t})$ . Also set $a_{t}=\|{\bm{a}_{t}}\|_{\ell_{2}}$ , $b_{t}=\|{\bm{b}_{t}}\|_{\ell_{2}}$ . Since the learning rate is periodic, we analyze a single period starting from $\bm{a}_{0},\bm{b}_{0}$ . During the first $T-1$ iterations, we have that

\displaystyle a_{t}\leq(1-\frac{1}{\kappa_{+}})^{t}a_{0}\quad\text{and}\quad b_{t}\leq(1-\frac{\mu}{L})^{t}b_{0}=(1-\frac{1}{\kappa})^{t}b_{0}.

At the final (unstably large) iteration $t=T-1$ , we have

\displaystyle a_{T}\leq\frac{L}{\kappa_{-}\mu}a_{T-1}=\frac{\kappa}{\kappa_{-}}a_{T-1}\quad\text{and}\quad b_{t}\leq(1-\frac{1}{\kappa_{-}})b_{0}.

Now $b_{t}$ term is clearly non-increasing and obeys $b_{T}\leq(1-\frac{1}{\kappa_{-}})(1-\frac{1}{\kappa})^{T-1}b_{0}\leq(1-\frac{1}{\kappa_{-}})b_{0}.$ Note that we forego the $(1-\frac{1}{\kappa})^{T-1}$ for the sake of simplicity. We wish to make the growth of $a_{T}$ similarly small by enforcing

\displaystyle\frac{\kappa}{\kappa_{-}}(1-\frac{1}{\kappa_{+}})^{T-1}\leq 1-\frac{1}{2\kappa_{-}}\iff\frac{2\kappa}{2\kappa_{-}-1}\leq((1-\frac{1}{\kappa_{+}})^{-1})^{T-1}.

Observe that $(1-\frac{1}{\kappa_{+}})^{-1}\geq\mathrm{e}^{1/\kappa_{+}}$ . Thus, we simply need

\log(\frac{2\kappa}{2\kappa_{-}-1})\leq\frac{T-1}{\kappa_{+}}\iff T\geq\kappa_{+}{\log(\frac{2\kappa}{2\kappa_{-}-1})}+1.

In short, at the end of a single period we are guaranteed to have (1) after observing $\|{\bm{w}_{t}}\|_{\ell_{2}}^{2}=a_{t}^{2}+b_{t}^{2}$ . ∎

III-B Proof of Theorem 2

Proof:

The proof idea is same as Theorem 1 but we additionally control the cross-smoothness terms. Denote the residual by $\bm{w}_{t}=\bm{\theta}_{t}-\bm{\theta}_{\star}$ . We will work with the projections of the residual on the subspaces $\mathcal{S},\mathcal{S}^{c}$ denoted by $\bm{a}_{t},\bm{b}_{t}$ respectively. Additionally, set $B=\max(\|{\bm{a}_{0}}\|_{\ell_{2}},\|{\bm{b}_{0}}\|_{\ell_{2}})$ and set $a_{t}=\|{\bm{a}_{t}}\|_{\ell_{2}}/B,b_{t}=\|{\bm{b}_{t}}\|_{\ell_{2}}/B$ . Observe that by this definition $B\leq\|{\bm{\theta}_{\star}-\bm{\theta}_{0}}\|_{\ell_{2}}$ and

\max(a_{0},b_{0})=1.

We will prove that at the end of a single period, $(a_{T},b_{T})$ pair obeys

\displaystyle\max(a_{T},b_{T})\leq 1-\frac{1}{4\kappa_{-}}.

(4)

The overall result follows inductively from this as follows. First, inductively we would achieve $\max(a_{t},b_{t})\leq(1-\frac{1}{4\kappa_{-}})^{t/T}$ for $\text{mod}(t,T)=0$ . That would in turn yield

	$\displaystyle\\|{\bm{\theta}_{t}-\bm{\theta}_{\star}}\\|_{\ell_{2}}$	$\displaystyle\leq\sqrt{2}\max(\\|{\bm{a}_{t}}\\|_{\ell_{2}},\\|{\bm{b}_{t}}\\|_{\ell_{2}})\leq\sqrt{2}B\max(a_{t},b_{t})$
		$\displaystyle\leq\sqrt{2}(1-\frac{1}{4\kappa_{-}})^{t/T}\\|{\bm{\theta}_{\star}-\bm{\theta}_{0}}\\|_{\ell_{2}}.$

Establishing (4): Thus, let us show (4) for a single period of learning rate i.e. $0\leq t\leq T$ . The gradient is given by

{\bm{\nabla}f(\bm{\theta}_{t})}={\bm{H}}\bm{w}_{t},

where ${\bm{H}}$ is obtained by integrating the Hessian’s along the path from $\bm{\theta}_{\star}$ to $\bm{\theta}_{t}$ . Write the next iterate as

\|{\bm{w}_{t+1}}\|_{\ell_{2}}^{2}=\|{\bm{w}_{t}-\eta_{t}{\bm{\nabla}f(\bm{\theta}_{t})}}\|_{\ell_{2}}^{2}=\|{\bm{w}_{t}-\eta_{t}{\bm{H}}\bm{w}_{t}}\|_{\ell_{2}}^{2}.

By Definition 3 and triangle inequality, ${\bm{H}}$ satisfies the bimodal Hessian inequalities described in (3). To analyze a single period, we first focus on the lower learning rate $\eta_{+}$ which spans the initial $T-1$ iterations. Hence, suppose $\eta_{t}=\eta_{+}=1/L$ . Note that from strong convexity of ${\bm{H}}$ over $\mathcal{S},\mathcal{S}^{c}$ , we have that

	$\displaystyle\\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{a}_{t}}\\|_{\ell_{2}}^{2}\leq(1-\eta_{t}L/\kappa_{+})\\|{\bm{a}_{t}}\\|_{\ell_{2}}^{2}\quad\text{for}\quad\eta_{t}\leq 1/L$
	$\displaystyle\\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{b}_{t}}\\|_{\ell_{2}}^{2}\leq(1-\eta_{t}\mu)\\|{\bm{b}_{t}}\\|_{\ell_{2}}^{2}\quad\quad\text{for}\quad\eta_{t}\leq 1/(\kappa_{-}\mu).$

Following this with $\eta_{t}=\eta_{+}=1/L$ , for $\bm{a}_{t}$ , we find the recursions

$\displaystyle\\|{\bm{a}_{t+1}}\\|_{\ell_{2}}$	$\displaystyle=\\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{w}_{t}}\\|_{\ell_{2}}$
	$\displaystyle\leq\\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{a}_{t}}\\|_{\ell_{2}}+\eta_{t}\\|{\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{b}_{t}}\\|_{\ell_{2}}$
	$\displaystyle\leq(1-\frac{\eta_{t}L}{2\kappa_{+}})\\|{\bm{a}_{t}}\\|_{\ell_{2}}+\eta_{t}\varepsilon\\|{\bm{b}_{t}}\\|_{\ell_{2}}$
	$\displaystyle=(1-\frac{1}{2\kappa_{+}})\\|{\bm{a}_{t}}\\|_{\ell_{2}}+\frac{\varepsilon}{L}\\|{\bm{b}_{t}}\\|_{\ell_{2}}.$	(5)

Using the identical argument for $\bm{b}_{t}$ yields

	$\displaystyle\\|{\bm{b}_{t+1}}\\|_{\ell_{2}}$	$\displaystyle=\\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{w}_{t}}\\|_{\ell_{2}}$
		$\displaystyle\leq\\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{b}_{t}}\\|_{\ell_{2}}+\eta_{t}\\|{\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{a}_{t}}\\|_{\ell_{2}}$
		$\displaystyle\leq(1-\frac{\mu}{2L})\\|{\bm{b}_{t}}\\|_{\ell_{2}}+\frac{\varepsilon}{L}\\|{\bm{a}_{t}}\\|_{\ell_{2}}.$

Setting $\bar{\varepsilon}=\frac{\varepsilon}{L}$ , we obtain

\displaystyle a_{t+1}\leq(1-\frac{1}{2\kappa_{+}})a_{t}+\bar{\varepsilon}b_{t}.

(6)

Additionally, recall that, since $f$ is $\mu$ strongly-convex we have $L/\kappa_{+}\geq\mu$ . Thus using $L/\kappa_{+}\geq\mu\geq 2\varepsilon$ we have

	$\displaystyle\\|{\bm{b}_{t+1}}\\|_{\ell_{2}}\leq(1-\frac{\mu}{2L})\\|{\bm{b}_{t}}\\|_{\ell_{2}}+\bar{\varepsilon}\\|{\bm{a}_{t}}\\|_{\ell_{2}}\leq\max(\\|{\bm{a}_{t}}\\|_{\ell_{2}},\\|{\bm{b}_{t}}\\|_{\ell_{2}})$
	$\displaystyle\\|{\bm{a}_{t+1}}\\|_{\ell_{2}}\leq(1-\frac{1}{2\kappa_{+}})\\|{\bm{a}_{t}}\\|_{\ell_{2}}+\bar{\varepsilon}\\|{\bm{b}_{t}}\\|_{\ell_{2}}\leq\max(\\|{\bm{a}_{t}}\\|_{\ell_{2}},\\|{\bm{b}_{t}}\\|_{\ell_{2}}).$

That is, we are guaranteed to have $1\geq a_{t},b_{t}\geq 0$ . Thus, using (6), recursively, for all $0\leq t\leq T-1$ , $a_{t}$ satisfies

	$\displaystyle a_{t}$	$\displaystyle\leq(1-\frac{1}{2\kappa_{+}})^{t}+\bar{\varepsilon}\sum_{\tau=0}^{t-1}(1-\frac{1}{2\kappa_{+}})^{t-\tau-1}b_{t}\leq(1-\frac{1}{2\kappa_{+}})^{t}+\bar{\varepsilon}\sum_{\tau=0}^{t-1}b_{\tau}$
		$\displaystyle\leq(1-\frac{1}{2\kappa_{+}})^{t}+t\bar{\varepsilon}.$		(7)

At time $t=T-1$ , we use the larger learning rate $\eta_{-}=\frac{1}{\kappa_{-}\mu}$ . The following holds via identical argument as (5)

	$\displaystyle a_{t+1}\leq\frac{L}{\kappa_{-}\mu}a_{t}+\frac{\varepsilon}{\kappa_{-}\mu}b_{t}$		(8)
	$\displaystyle b_{t+1}\leq(1-\frac{1}{2\kappa_{-}})b_{t}+\frac{\varepsilon}{\kappa_{-}\mu}a_{t}.$

To bound $b_{T}$ at time $t=T-1$ , we recall $b_{T-1},a_{T-1}\leq 1$ and the bound $\varepsilon\leq\mu/4$ , to obtain

b_{T}\leq 1-\frac{1}{2\kappa_{-}}+\frac{\varepsilon}{\kappa_{-}\mu}\leq 1-\frac{1}{4\kappa_{-}}.

Bounding $a_{T}$ is what remains. Noticing $\log(1-x)\leq{-x}$ , our period choice $T$ obeys

T\geq 2\kappa_{+}{\log(2L/(\kappa_{-}\mu))}+1\geq 1-\frac{\log(2L/(\kappa_{-}\mu))}{\log({1-\frac{1}{2\kappa_{+}}})}

for $\kappa_{+}\geq 1$ . Thus, using the assumption $\varepsilon\leq\frac{\kappa_{-}\mu}{4T}$ , and the bounds (7) and (8), the bound on $a_{T}$ is obtained via

	$\displaystyle a_{T}$	$\displaystyle\leq\frac{L}{\kappa_{-}\mu}((1-\frac{1}{2\kappa_{+}})^{T-1}+(T-1)\frac{\varepsilon}{L})+\frac{\varepsilon}{\kappa_{-}\mu}$
		$\displaystyle\leq\frac{1}{2}+T\frac{\varepsilon}{\kappa_{-}\mu}\leq\frac{3}{4}\leq 1-\frac{1}{4\kappa_{-}},$

where we noted $\kappa_{-}\geq 1$ . The above bounds on $a_{T},b_{T}$ establishes (4) and concludes the proof. ∎

IV Conclusions

This letter introduced a setting where remarkably fast convergence can be attained by using cyclical learning rate schedule that carefully targets the bimodal structure of the Hessian of the problem. This is accomplished by replacing global condition number with local counterparts. While bimodal Hessian seems to be a strong assumption, recent literature provides rich empirical justification for our theory. Besides relaxing the assumptions in Theorem 2, there are multiple interesting directions for Unstable CLR. The results can likely be extended to minibatch SGD, overparameterized settings, or to non-convex problems (e.g. via Polyak-Lojasiewicz condition [25]). While our attention was restricted to bimodal Hessian and a CLR scheme with two values ( $\eta_{+},\eta_{-}$ as in Def. 1), it would be exciting to explore whether more sophisticated CLR schemes can provably accelerate optimization for a richer class of Hessian structures. On the practical side, further empirical investigation (beyond [3, 12]) can help verify and explore the potential benefits of large cyclical learning rates.

References

[1] L. N. Smith, “Cyclical learning rates for training neural networks,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 464–472.
[2] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
[3] L. N. Smith and N. Topin, “Super-convergence: Very fast training of residual networks using large learning rates,” arXiv preprint arXiv:1708.07120, 2017.
[4] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” arXiv preprint arXiv:1903.10145, 2019.
[5] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” in 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018. Association For Uncertainty in Artificial Intelligence (AUAI), 2018, pp. 876–885.
[6] X. Li, Z. Zhuang, and F. Orabona, “Exponential step sizes for non-convex optimization,” arXiv preprint arXiv:2002.05273, 2020.
[7] K. Zhang, A. Koppel, H. Zhu, and T. Basar, “Global convergence of policy gradient methods to (almost) locally optimal policies,” SIAM Journal on Control and Optimization, vol. 58, no. 6, pp. 3586–3612, 2020.
[8] H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann, “Escaping saddles with stochastic gradients,” in International Conference on Machine Learning. PMLR, 2018, pp. 1155–1164.
[9] G. Leclerc and A. Madry, “The two regimes of deep network training,” arXiv preprint arXiv:2002.10376, 2020.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[12] J. M. Cohen, S. Kaur, Y. Li, Z. Kolter, and A. Talwalkar, “Gradient descent on neural networks typically occurs at the edge of stability,” to appear at ICLR, 2021.
[13] L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,” arXiv preprint arXiv:1706.04454, 2017.
[14] S. Oymak, Z. Fabian, M. Li, and M. Soltanolkotabi, “Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian,” arXiv preprint arXiv:1906.05392, 2019.
[15] X. Li, Q. Gu, Y. Zhou, T. Chen, and A. Banerjee, “Hessian based analysis of sgd for deep nets: Dynamics and generalization,” in Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020, pp. 190–198.
[16] D. Paul, “Asymptotics of sample eigenstructure for a large dimensional spiked covariance model,” Statistica Sinica, pp. 1617–1642, 2007.
[17] M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4313–4324.
[18] D. Kopitkov and V. Indelman, “Neural spectrum alignment: Empirical study,” in International Conference on Artificial Neural Networks. Springer, 2020, pp. 168–179.
[19] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” Conference on Neural Information Processing Systems, 2018.
[20] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” arXiv preprint arXiv:1902.06720, 2019.
[21] G. Gur-Ari, D. A. Roberts, and E. Dyer, “Gradient descent happens in a tiny subspace,” arXiv preprint arXiv:1812.04754, 2018.
[22] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019.
[23] M. Belkin, D. Hsu, and P. Mitra, “Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate,” arXiv preprint arXiv:1806.05161, 2018.
[24] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, and H. Mhaskar, “Theory of deep learning iii: explaining the non-overfitting puzzle,” arXiv preprint arXiv:1801.00173, 2017.
[25] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.

$\displaystyle\\|{\bm{a}_{t+1}}\\|_{\ell_{2}}$	$\displaystyle=\\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{w}_{t}}\\|_{\ell_{2}}$
	$\displaystyle\leq\\|{\bm{a}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{a}_{t}}\\|_{\ell_{2}}+\eta_{t}\\|{\bm{\Pi}_{\mathcal{S}}{\bm{H}}\bm{b}_{t}}\\|_{\ell_{2}}$
	$\displaystyle\leq(1-\frac{\eta_{t}L}{2\kappa_{+}})\\|{\bm{a}_{t}}\\|_{\ell_{2}}+\eta_{t}\varepsilon\\|{\bm{b}_{t}}\\|_{\ell_{2}}$
	$\displaystyle=(1-\frac{1}{2\kappa_{+}})\\|{\bm{a}_{t}}\\|_{\ell_{2}}+\frac{\varepsilon}{L}\\|{\bm{b}_{t}}\\|_{\ell_{2}}.$	(5)

	$\displaystyle\\|{\bm{b}_{t+1}}\\|_{\ell_{2}}$	$\displaystyle=\\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{w}_{t}}\\|_{\ell_{2}}$
		$\displaystyle\leq\\|{\bm{b}_{t}-\eta_{t}\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{b}_{t}}\\|_{\ell_{2}}+\eta_{t}\\|{\bm{\Pi}_{\mathcal{S}^{c}}{\bm{H}}\bm{a}_{t}}\\|_{\ell_{2}}$
		$\displaystyle\leq(1-\frac{\mu}{2L})\\|{\bm{b}_{t}}\\|_{\ell_{2}}+\frac{\varepsilon}{L}\\|{\bm{a}_{t}}\\|_{\ell_{2}}.$

Provable Super-Convergence with a Large Cyclical Learning Rate

Abstract

Index Terms:

I Introduction

Definition 1

Theorem 1 (Linear regression)

II Extension to the Nonlinear Problems

Definition 2 (Smoothness and strong-convexity)

Proposition 1

Definition 3 (Bimodal Hessian)

Theorem 2 (Nonlinear problems)

III Proofs

III-A Proof of Theorem 1

Proof:

III-B Proof of Theorem 2

Proof:

IV Conclusions

References

Provable Super-Convergence with a Large
Cyclical Learning Rate