This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums

Rui Pan11,  Haishan Ye2 11footnotemark: 111,  Tong Zhang1 11footnotemark: 11
1The Hong Kong University of Science and Technology
2Xi’an Jiaotong University
[email protected], [email protected]
Equal contribution.Corresponding author is Haishan Ye.Jointly with Google Research.
Abstract

Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. In this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.

1 Introduction

Many machine learning models can be represented as the following optimization problem:

minwf(w)1ni=1nfi(w),\displaystyle\min_{w}f(w)\triangleq\frac{1}{n}\sum_{i=1}^{n}f_{i}(w), (1.1)

such as logistic regression, deep neural networks. To solve the above problem, stochastic gradient descent (SGD) (Robbins & Monro, 1951) has been widely adopted due to its computation efficiency in large-scale learning problems (Bottou & Bousquet, 2008), especially for training deep neural networks.

Given the popularity of SGD in this field, different learning rate schedules have been proposed to further improve its convergence rates. Among them, the most famous and widely used ones are inverse time decay, step decay (Goffin, 1977), and cosine scheduler (Loshchilov & Hutter, 2017). The learning rates generated by the inverse time decay scheduler depends on the current iteration number inversely. Such a scheduling strategy comes from the theory of SGD on strongly convex functions, and is extended to non-convex objectives like neural networks while still achieving good performance. Step decay scheduler keeps the learning rate piecewise constant and decreases it by a factor after a given amount of epochs. It is theoretically proved in Ge et al. (2019) that when the objective is quadratic, step decay scheduler outperforms inverse time decay. Empirical results are also provided in the same work to demonstrate the better convergence property of step decay in training neural networks when compared with inverse time decay. However, even step decay is proved to be near optimal on quadratic objectives, it is not truly optimal. There still exists a logT\log T gap away from the minimax optimal convergence rate, which turns out to be non-trivial in a wide range of settings and may greatly impact step decay’s empirical performance. Cosine decay scheduler (Loshchilov & Hutter, 2017) generates cosine-like learning rates in the range [0,T][0,T], with TT being the maximum iteration. It is a heuristic scheduling strategy which relies on the observation that good performance in practice can be achieved via slowly decreasing the learning rate in the beginning and “refining” the solution in the end with a very small learning rate. Its convergence property on smooth non-convex functions has been shown in Li et al. (2021), but the provided bound is still not tight enough to explain its success in practice.

Except cosine decay scheduler, all aforementioned learning rate schedulers have (or will have) a tight convergence bound on quadratic objectives. In fact, studying their convergence property on quadratic objective functions is quite important for understanding their behaviors in general non-convex problems. Recent studies in Neural Tangent Kernel (NTK) (Arora et al., 2019; Jacot et al., 2020) suggest that when neural networks are sufficiently wide, the gradient descent dynamic of neural networks can be approximated by NTK. In particular, when the loss function is least-square loss, neural network’s inference is equivalent to kernel ridge regression with respect to the NTK in expectation. In other words, for regression tasks, the non-convex objective in neural networks resembles quadratic objectives when the network is wide enough.

The existence of logT\log T gap in step decay’s convergence upper bound, which will be proven to be tight in a wide range of settings, implies that there is still room for improvement in theory. Meanwhile, the existence of cosine decay scheduler, which has no strong theoretical convergence guarantees but possesses good empirical performance in certain tasks, suggests that its convergence rate may depend on some specific properties of the objective determined by the network and dataset in practice. Hence it is natural to ask what those key properties may be, and whether it is possible to find theoretically-optimal schedulers whose empirical performance are comparable to cosine decay if those properties are available. In this paper, we offer an answer to these questions. We first derive a novel eigenvalue-distribution-based learning rate scheduler called eigencurve for quadratic functions. Combining with eigenvalue distributions of different types of networks, new neural-network-based learning rate schedulers can be generated based on our proposed paradigm, which achieve better convergence properties than step decay in Ge et al. (2019). Specifically, eigencurve closes the logT\log T gap in step decay and reaches minimax optimal convergence rates if the Hessian spectrum is skewed. We summarize the main contributions of this paper as follows.

  1. 1.

    To the best of our knowledge, this is the first work that incorporates the eigenvalue distribution of objective function’s Hessian matrix into designing learning rate schedulers. Accordingly, based on the eigenvalue distribution of the Hessian, we propose a novel eigenvalue distribution based learning rate scheduler named eigencurve.

  2. 2.

    Theoretically, eigencurve can achieve optimal convergence rate (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the Hessian is skewed. Furthermore, even when the Hessian is not skewed, eigencurve can still achieve no worse convergence rate than the step decay schedule in Ge et al. (2019), whose convergence rate are proven to be sub-optimal in a wide range of settings.

  3. 3.

    Empirically, on image classification tasks, eigencurve achieves optimal convergence rate for several models on CIFAR-10 and ImageNet if the loss can be approximated by quadratic objectives. Moreover, it obtains much better performance than step decay on CIFAR-10, especially when the number of epochs is small.

  4. 4.

    Intuitively, our learning rate scheduler sheds light on the theoretical property of cosine decay and provides a perspective of understanding the reason why it can achieve good performance on image classification tasks. The same idea has been used to inspire and discover several simple families of schedules that works in practice.

Problem Setup

For the theoretical analysis and the aim to derive our eigenvalue-dependent learning rate schedulers, we mainly focus on the quadratic function, that is,

minwf(w)𝔼ξ[f(w,ξ)], where f(w,ξ)=12wH(ξ)wb(ξ)w,\displaystyle\min_{w}f(w)\triangleq\mathbb{E}_{\xi}\left[f(w,\xi)\right],\mbox{ where }f(w,\xi)=\frac{1}{2}w^{\top}H(\xi)w-b(\xi)^{\top}w, (1.2)

where ξ\xi denotes the data sample. Hence, the Hessian of f(w)f(w) is

H=𝔼ξ[H(ξ)].H=\mathbb{E}_{\xi}\left[H(\xi)\right]. (1.3)

Letting us denote b=𝔼ξ[b(ξ)]b=\mathbb{E}_{\xi}[b(\xi)], we can obtain the optima of problem (1.2)

w=H1b.\displaystyle w_{*}=H^{-1}b. (1.4)

Given an initial iterate w0w_{0} and the learning rate sequence {ηt}\{\eta_{t}\}, the stochastic gradient update is

wt+1=wtηtf(wt,ξ)=wtηt(H(ξ)wtb(ξ)).\displaystyle w_{t+1}=w_{t}-\eta_{t}\nabla f(w_{t},\xi)=w_{t}-\eta_{t}(H(\xi)w_{t}-b(\xi)). (1.5)

We denote that

nt=Hwtb(H(ξ)wtb(ξ)),μλmin(H),Lλmax(H),and,κL/μ.n_{t}=Hw_{t}-b-\left(H(\xi)w_{t}-b(\xi)\right),\quad\mu\triangleq\lambda_{\min}(H),\quad L\triangleq\lambda_{\max}(H),\quad\mbox{and},\quad\kappa\triangleq L/\mu. (1.6)

In this paper, we assume that

𝔼ξ[ntnt]σ2H.\displaystyle\mathbb{E}_{\xi}\left[n_{t}n_{t}^{\top}\right]\preceq\sigma^{2}H. (1.7)

The reason for this assumption is presented in Appendix G.5.

Related Work

Table 1: Convergence rate of SGD with common schedulers on quadratic objectives.
Scheduler Convergence rate of SGD in quadratic objectives
Constant Not guaranteed to converge
Inverse Time Decay Θ(dσ2Tκ)\Theta\left(\frac{d\sigma^{2}}{T}\cdot{\color[rgb]{1,0,0}\kappa}\right)
Step Decay Θ(dσ2TlogT)\Theta\left(\frac{d\sigma^{2}}{T}\cdot{\color[rgb]{1,0,0}\log T}\right) (Ge et al. (2019); Wu et al. (2021); This work - Theorem 4)
Eigencurve 𝒪(dσ2T)\mathcal{O}\left(\frac{d\sigma^{2}}{T}\right) with skewed Hessian spectrums, 𝒪(dσ2Tlogκ)\mathcal{O}\left(\frac{d\sigma^{2}}{T}\cdot{\color[rgb]{1,0,0}\log\kappa}\right) in worst case (This work - Theorem 1, Corollary 2,  3)

In convergence analysis, one key property that separates SGD from vanilla gradient descent is that in SGD, noise in gradients dominates. In gradient descent (GD), constant learning rate can achieve linear convergence 𝒪(cT)\mathcal{O}(c^{T}) with 0<c<10<c<1 for strongly convex objectives, i.e. obtaining f(w(t))f(w)ϵf(w^{(t)})-f(w^{*})\leq\epsilon in 𝒪(log(1ϵ))\mathcal{O}(\log(\frac{1}{\epsilon})) iterations. However, in SGD, f(w(t))f(w^{(t)}) cannot even be guaranteed to converge to f(w)f(w^{*}) due to the existence of gradient noise (Bottou et al., 2018). Intuitively, this noise leads to a variance proportional to the learning rate size, so constant learning rate will always introduce a Ω(ηt)=Ω(η0)\Omega(\eta_{t})=\Omega(\eta_{0}) gap when compared with the convergence rate of GD. Fortunately, inverse time decay scheduler solves the problem by decaying the learning rate inversely proportional to the iteration number tt, which achieves 𝒪(1T)\mathcal{O}(\frac{1}{T}) convergence rate for strongly convex objectives, specifically, 𝒪(dσ2Tκ)\mathcal{O}(\frac{d\sigma^{2}}{T}\cdot\kappa). However, this is sub-optimal since the minimax optimal rate for SGD is 𝒪(dσ2T)\mathcal{O}(\frac{d\sigma^{2}}{T}) (Ge et al., 2019; Jain et al., 2018). Moreover, in practice, κ\kappa can be very big for large neural networks, which makes inverse time decay scheduler undesirable for those models. This is when step decay (Goffin, 1977) comes to play. Empirically, it is widely adopted in tasks such as image classification and serves as a baseline for a lot of models. Theoretically, it has been proven that step decay can achieve nearly optimal convergence rate 𝒪(dσ2TlogT)\mathcal{O}(\frac{d\sigma^{2}}{T}\cdot\log T) for strongly convex least square regression (Ge et al., 2019). A tighter set of instance-dependent bounds in a recent work (Wu et al., 2021), which is carried out independently from ours, also proves its near optimality. Nevertheless, step decay is not always the best choice for image classification tasks. In practice, cosine decay (Loshchilov & Hutter, 2017) can achieve comparable or even better performance, but the reason behind this superior performance is still unknown in theory (Gotmare et al., 2018). All the aforementioned results are summarized in Table 1, along with our results in this paper. It is worth mentioning that the minimax optimal rate 𝒪(dσ2T)\mathcal{O}(\frac{d\sigma^{2}}{T}) can be achieved by iterate averaging methods (Jain et al., 2018; Bach & Moulines, 2013; Défossez & Bach, 2015; Frostig et al., 2015; Jain et al., 2016; Neu & Rosasco, 2018), but it is not a common practice to use them in training deep neural networks, so only the final iterate (Shamir, 2012) behavior of SGD is analyzed in this paper, i.e. the point right after the last iteration.

Paper organization: Section 2 describes the motivation of our eigencurve scheduler. Section 3 presents the exact form and convergence rate of the proposed eigencurve scheduler, along with the lower bound of step decay. Section 4 shows the experimental results. Section 5 discusses the discovery and limitation of eigencurve and Section 6 gives our conclusion.

2 Motivation

In this section, we will give the main motivation and intuition of our eigencurve learning rate scheduler. We first give the scheduling strategy to achieve the optimal convergence rate in the case that the Hessian is diagonal. Then, we show that the inverse time learning rate is sub-optimal in most cases. Comparing these two scheduling methods brings up the reason why we should design eigenvalue distribution dependent learning rate scheduler.

Letting HH be a diagonal matrix diag(λ1,λ2,,λd)\mathrm{diag}(\lambda_{1},\lambda_{2},\dots,\lambda_{d}) and reformulating Eqn. (1.5), we have

wt+1w=\displaystyle w_{t+1}-w_{*}= wtwηt(H(ξ)wtb(ξ))\displaystyle w_{t}-w_{*}-\eta_{t}(H(\xi)w_{t}-b(\xi))
=\displaystyle= wtwηt(Hwtb)+ηt(Hwtb(H(ξ)wtb(ξ)))\displaystyle w_{t}-w_{*}-\eta_{t}(Hw_{t}-b)+\eta_{t}\left(Hw_{t}-b-(H(\xi)w_{t}-b(\xi))\right)
=\displaystyle= wtwηt(Hwtb(Hwb))+ηt(Hwtb(H(ξ)wtb(ξ)))\displaystyle w_{t}-w_{*}-\eta_{t}(Hw_{t}-b-(Hw_{*}-b))+\eta_{t}\left(Hw_{t}-b-(H(\xi)w_{t}-b(\xi))\right)
=\displaystyle= (IηtH)(wtw)+ηtnt.\displaystyle\left(I-\eta_{t}H\right)(w_{t}-w_{*})+\eta_{t}n_{t}.

It follows,

𝔼[λj(wt+1,jw,j)2]=𝔼[nt]=0\displaystyle\mathbb{E}\left[\lambda_{j}\left(w_{t+1,j}-w_{*,j}\right)^{2}\right]\overset{\mathbb{E}\left[n_{t}\right]=0}{=} λj{(1ηtλj)2𝔼[(wt,jw,j)2]+ηt2𝔼[nt]j2}\displaystyle\lambda_{j}\left\{(1-\eta_{t}\lambda_{j})^{2}\mathbb{E}\left[(w_{t,j}-w_{*,j})^{2}\right]+\eta_{t}^{2}\mathbb{E}\left\|[n_{t}]_{j}\right\|^{2}\right\} (2.1)
(1.7)\displaystyle\overset{\eqref{eq:var_ass}}{\leq} (1ηtλj)2λj𝔼[(wt,jw,j)2]+ηt2λj2σ2.\displaystyle(1-\eta_{t}\lambda_{j})^{2}\cdot\lambda_{j}\mathbb{E}\left[(w_{t,j}-w_{*,j})^{2}\right]+\eta_{t}^{2}\lambda_{j}^{2}\sigma^{2}.

Since HH is diagonal, we can set step size scheduling for each dimension separately. Letting us choose step size ηt\eta_{t} coordinately with the step size ηt,j=1λj(t+1)\eta_{t,j}=\frac{1}{\lambda_{j}(t+1)} being optimal for the jj-th coordinate, then we have the following proposition.

Proposition 1.

Assume that HH is diagonal matrix with eigenvalues λ1λ2,λd0\lambda_{1}\geq\lambda_{2}\geq\dots,\lambda_{d}\geq 0 and Eqn. (1.7) holds. If we set step size ηt,j=1λj(t+1)\eta_{t,j}=\frac{1}{\lambda_{j}(t+1)}, it holds that

2𝔼[f(wt+1)f(w)]=𝔼[j=1dλj(wt+1,jw,j)2]j=1dλj(w1,jw,j)2(t+1)2+t(t+1)2dσ2.2\mathbb{E}\left[f(w_{t+1})-f(w_{*})\right]=\mathbb{E}\left[\sum_{j=1}^{d}\lambda_{j}\left(w_{t+1,j}-w_{*,j}\right)^{2}\right]\leq\frac{\sum_{j=1}^{d}\lambda_{j}(w_{1,j}-w_{*,j})^{2}}{(t+1)^{2}}+\frac{t}{(t+1)^{2}}\cdot d\sigma^{2}. (2.2)

The leading equality here is proved in Appendix G.1, with the followed inequality proved in Appendix E. From Eqn. (2.2), we can observe that choosing proper step sizes coordinately can achieve the optimal convergence rate (Ge et al., 2019; Jain et al., 2018). Instead, if the widely used inverse time scheduler ηt=1/(L+μt)\eta_{t}=1/(L+\mu t) is chosen, we can show that only a sub-optimal convergence rate can be obtained, especially when λj\lambda_{j}’s vary from each other.

Proposition 2.

If we set the inverse time step size ηt=1(L+μt)\eta_{t}=\frac{1}{(L+\mu t)}, then we have

2𝔼[f(wt+1)f(w)]=𝔼[j=1dλj(wt+1,jw,j)2]\displaystyle 2\mathbb{E}\left[f(w_{t+1})-f(w_{*})\right]=\mathbb{E}\left[\sum_{j=1}^{d}\lambda_{j}\left(w_{t+1,j}-w_{*,j}\right)^{2}\right] (2.3)
\displaystyle\leq (L+μL+μt)2(j=1dλj(w1,jw,j)2)+j=1d(λj22λjμ1L+μtσ2+λj2σ2(L+μt)2).\displaystyle\left(\frac{L+\mu}{L+\mu t}\right)^{2}\left(\sum_{j=1}^{d}\lambda_{j}(w_{1,j}-w_{*,j})^{2}\right)+\sum_{j=1}^{d}\left(\frac{\lambda_{j}^{2}}{2\lambda_{j}-\mu}\cdot\frac{1}{L+\mu t}\cdot\sigma^{2}+\frac{\lambda_{j}^{2}\sigma^{2}}{(L+\mu t)^{2}}\right).
Remark 1.

Eqn. (2.2) shows that if one can choose step size coordinate-wise with step size ηt,j=1λj(t+1)\eta_{t,j}=\frac{1}{\lambda_{j}(t+1)}, then SGD can achieve a convergence rate

𝔼[f(wT+1)f(w)]𝒪(dTσ2).\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq\mathcal{O}\left(\frac{d}{T}\cdot\sigma^{2}\right). (2.4)

which matches the lower bound (Ge et al., 2019; Jain et al., 2018). In contrast, replacing L=λ1L=\lambda_{1} and μ=λd\mu=\lambda_{d} in Proposition 2, we can obtain that the convergence rate of SGD being

𝔼[f(wT+1)f(w)]𝒪(1Tj=1dλjλdσ2).\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq\mathcal{O}\left(\frac{1}{T}\sum_{j=1}^{d}\frac{\lambda_{j}}{\lambda_{d}}\cdot\sigma^{2}\right). (2.5)

Since it holds that λjλd\lambda_{j}\geq\lambda_{d}, the convergence rate in Eqn. (2.4) is better than the one in Eqn. (2.5), especially when the eigenvalues of the Hessian (HH matrix) decay rapidly. In fact, the upper bound in Eqn. (2.5) is tight for the inverse time decay scheduling, as proven in Ge et al. (2019).

Main Intuition

The diagonal case H=diag(λ1,λ2,,λd)H=\mathrm{diag}(\lambda_{1},\lambda_{2},\dots,\lambda_{d}) provides an important intuition for designing eigenvalue dependent learning rate scheduling. In fact, for general non-diagonal HH, letting H=UΛUH=U\Lambda U^{\top} be the spectral decomposition of the Hessian and setting w=Uww^{\prime}=U^{\top}w, then the Hessian becomes a diagonal matrix from perspective of updating ww^{\prime}, with the variance of the stochastic gradient being unchanged since UU is a unitary matrix. This is also the core idea of Newton’s method and many second-order methods (Huang et al., 2020). However, given our focus in this paper being learning rate schedulers only, we move the relevant discussion of their relationship to Appendix H.

Proposition 1 and 2 imply that a good learning rate scheduler should decrease the error of each coordinate. The inverse time decay scheduler is only optimal for the coordinate related to the smallest eigenvalue. That’s the reason why it is sub-optimal overall. Thus, we should reduce the learning rate gradually such that we can run a optimal learning rate associated to λj\lambda_{j} to sufficiently drop the error of jj-th coordinate. Furthermore, given the total iteration TT and the eigenvalue distribution of the Hessian, we should allocate the running time for each optimal learning rate associated to λj\lambda_{j}.

3 Eigenvalue Dependent Step Scheduling

Refer to caption
Figure 1: Eigencurve : piecewise inverse time decay scheduling.

Just as discussed in Section 2, to obtain better convergence rate for SGD, we should consider Hessian’s eigenvalue distribution and schedule the learning rate based on the distribution. In this section, we propose a novel learning rate scheduler for this task, which can be regarded as piecewise inverse time decay (see Figure 1). The method is very simple, we group eigenvalues according to their value and denote sis_{i} to be the number of eigenvalues lie in the range i=[μ2i,μ2i+1)\mathcal{R}_{i}=[\mu\cdot 2^{i},\mu\cdot 2^{i+1}), that is,

si=#λj[μ2i,μ2i+1).\displaystyle s_{i}=\#\lambda_{j}\in[\mu\cdot 2^{i},\mu\cdot 2^{i+1}). (3.1)

Then, there are at most Imax=log2κI_{\max}=\log_{2}\kappa such ranges. By the inverse time decay theory, the optimal learning rate associated to eigenvalues in the range i\mathcal{R}_{i} should be

ηt=𝒪(12i1μ(tti)), with 0=t0<t1<t2<<tImax=T.\displaystyle\eta_{t}=\mathcal{O}\left(\frac{1}{2^{i-1}\mu\cdot(t-t_{i})}\right),\mbox{ with }0=t_{0}<t_{1}<t_{2}<\dots<t_{I_{\max}}=T. (3.2)

Our scheduling strategy is to run the optimal learning rate for eigenvalues in each i\mathcal{R}_{i} for a period to sufficiently decrease the error associated to eigenvalues in i\mathcal{R}_{i}.

To make the step size sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} monotonely decreasing, we define the step sizes as

ηt=1L+μj=1i1Δj2j1+2i1μ(tti1) if t[ti1,ti)\displaystyle\eta_{t}=\frac{1}{L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+2^{i-1}\mu(t-t_{i-1})}\quad\mbox{ if }\quad t\in[t_{i-1},t_{i}) (3.3)

where

0=t0<t1<t2<<tImax=T,Δi=titi1, and Imax=log2κ.\displaystyle 0=t_{0}<t_{1}<t_{2}<\dots<t_{I_{\max}}=T,\;\Delta_{i}=t_{i}-t_{i-1},\mbox{ and }I_{\max}=\log_{2}\kappa. (3.4)

To make the total error, that is the sum of error associated with i\mathcal{R}_{i}, to be small, we should allocate Δi\Delta_{i} according to si1s_{i-1}’s. Intuitively, a large portion of eigenvalues lying in the range i\mathcal{R}_{i} should allocate a large portion of iterations. Specifically, we propose to allocate Δi\Delta_{i} as follows:

Δi=si1i=0Imax1siT, with si=#λj[μ2i,μ2i+1).\displaystyle\Delta_{i}=\frac{\sqrt{s_{i-1}}}{\sum_{i^{\prime}=0}^{I_{\max}-1}\sqrt{s_{i^{\prime}}}}\cdot T,\mbox{ with }s_{i}=\#\lambda_{j}\in[\mu\cdot 2^{i},\mu\cdot 2^{i+1}). (3.5)

In the rest of this section, we will show that the step size scheduling according to Eqn. (3.3) and (3.5) can achieve better convergence rate than the one in Ge et al. (2019) when sis_{i} is non-uniformly distributed. In fact, a better Δi\Delta_{i} allocation can be calculated using numerical optimization.

3.1 Theoretical Analysis

Lemma 1.

Let objective function f(x)f(x) be quadratic and Assumption (1.7) hold. Running SGD for TT-steps starting from w0w_{0} and a learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} defined in Eqn. (3.3), the final iterate wT+1w_{T+1} satisfies

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq (f(w0)f(w))κ2Δ12+152σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{\Delta_{1}^{2}}+\frac{15}{2}\cdot\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}. (3.6)

Since the bias term is a high order term, the variance term in Eqn. (3.6) dominates the error for wT+1w_{T+1}. For simplicity, instead of using numerical methods to find the optimal {Δi}\{\Delta_{i}\}, we propose to use Δi\Delta_{i} defined in Eqn. (3.5). The value of Δi\Delta_{i} is linear to square root of the number of eigenvalues lying in the range [μ2i1,μ2i)[\mu\cdot 2^{i-1},\mu\cdot 2^{i}). Using such Δi\Delta_{i}, eigencurve has the following convergence property.

Theorem 1.

Let objective function f(x)f(x) be quadratic and Assumption (1.7) hold. Running SGD for TT-steps starting from w0w_{0}, a learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} defined in Eqn. (3.3) and Δi\Delta_{i} defined in Eqn. (3.5), the final iterate wT+1w_{T+1} satisfies

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq (f(w0)f(w))κ2(i=0Imax1si)2s0T2+15(i=0Imax1si)2Tσ2.\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}\cdot\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{s_{0}T^{2}}+\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{T}\cdot\sigma^{2}.

Please refer to Appendix D, F and G for the full proof of Lemma 1 and Theorem 1. The variance term 15(i=0Imax1si)2Tσ2\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{T}\cdot\sigma^{2} in above theorem shows that when sis_{i}’s vary largely from each other, then the variance can be close to 𝒪(dTσ2)\mathcal{O}\left(\frac{d}{T}\cdot\sigma^{2}\right) which matches the lower bound (Ge et al., 2019). For example, letting Imax=100I_{\max}=100, s0=0.99ds_{0}=0.99d and si=0.0199ds_{i}=\frac{0.01}{99}d, we can obtain that

(i=099si)2Tσ2=(0.99+99×0.01/99)2dTσ2<4dTσ2.\displaystyle\frac{\left(\sum_{i=0}^{99}\sqrt{s_{i}}\right)^{2}}{T}\cdot\sigma^{2}=(\sqrt{0.99}+99\times\sqrt{0.01/99})^{2}\cdot\frac{d}{T}\cdot\sigma^{2}<\frac{4d}{T}\cdot\sigma^{2}.
Refer to caption
Figure 2: Power law observed in ResNet-18 on ImageNet, both eigenvalue (x-axis) and density (y-axis) are plotted in log scale.

We can observe that if the variance of sis_{i}’s is large, the variance term in Theorem 1 can be close to dσ2/Td\sigma^{2}/T. More generally, as rigorously stated in Corollary 2, eigencurve achieves minimax optimal convergence rate if the Hessian spectrum satisfies an extra assumption of “power law”: the density of eigenvalue λ\lambda is exponentially decaying with increasing value of λ\lambda in log scale, i.e. ln(λ)\ln(\lambda). This assumption comes from the observation of estimated Hessian spectrums in practice (see Figure 2), which will be further illustrated in Section 4.1.

Corollary 2.

Given the same setting as in Theorem 1, when Hessian HH’s eigenvalue distribution p(λ)p(\lambda) satisfies “power law”, i.e.

p(λ)=1Zexp(α(ln(λ)ln(μ)))=1Z(μλ)α\displaystyle p(\lambda)=\frac{1}{Z}\cdot\exp(-\alpha(\ln(\lambda)-\ln(\mu)))=\frac{1}{Z}\cdot\left(\frac{\mu}{\lambda}\right)^{\alpha} (3.7)

for some α>1\alpha>1, where Z=μL(μ/λ)α𝑑λZ=\int_{\mu}^{L}(\mu/\lambda)^{\alpha}d\lambda, there exists a constant C(α)C(\alpha) which only depends on α\alpha, such that the final iterate wT+1w_{T+1} satisfies

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq ((f(w0)f(w))κ2T2+dσ2T)C(α).\displaystyle\left((f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{T^{2}}+\frac{d\sigma^{2}}{T}\right)\cdot C(\alpha).

Please refer to Appendix G.3 for the proof. As for the worst-case guarantee, it is easy to check that only when sis_{i}’s are equal to each other, that is, si=d/Imax=d/log2(κ)s_{i}=d/I_{\max}=d/\log_{2}(\kappa), the variance term reaches its maximum.

Corollary 3.

Given the same setting as in Theorem 1, when si=d/log2(κ)s_{i}=d/\log_{2}(\kappa) for all 0iImax10\leq i\leq I_{\max}-1, the variance term in Theorem 1 reaches its maximum and wT+1w_{T+1} satisfies

𝔼[f(wT+1)f(w)](f(w0)f(w))κ2log2κT2+15dlogκTσ2.\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}\log^{2}\kappa}{T^{2}}+\frac{15d\cdot\log\kappa}{T}\sigma^{2}.
Remark 2.

When sis_{i}’s vary from each other, our eigenvalue dependent learning rate scheduler can achieve faster convergence rate than eigenvalue independent scheduler such as step decay which suffers from an extra log(T)log(T) term (Ge et al., 2019). Only when sis_{i}’s are equal to each other, Corollary 3 shows that the bound of variance matches to lower bound up to logκ\log\kappa which is same to the one in Proposition 3 of Ge et al. (2019).

Furthermore, we show that this logT\log T gap between step decay and eigencurve certainly exists for problem instances of skewed Hessian spectrums. For simplicity, we only discuss the case where HH is diagonal.

Theorem 4.

Let objective function f(x)f(x) be quadratic. We run SGD for TT-steps starting from w0w_{0} and a step decay learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} defined in Algorithm 1 of Ge et al. (2019) with η11/L\eta_{1}\leq 1/L. As long as (1) HH is diagonal, (2) The equality in Assumption (1.7) holds, i.e. 𝔼ξ[ntnt]=σ2H\mathbb{E}_{\xi}\left[n_{t}n_{t}^{\top}\right]=\sigma^{2}H and (3) λj(w0,jw,j)20\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\neq 0 for j=1,2,,d\forall j=1,2,\dots,d, the final iterate wT+1w_{T+1} satisfies,

𝔼[f(wT+1)f(w)]=Ω(dσ2TlogT)\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]=\Omega\left(\frac{d\sigma^{2}}{T}\cdot\log T\right)

The proof is provided in Appendix G.4. Removing this extra logT\log T term may not seem to be a big deal in theory, but experimental results suggest the opposite.

4 Experiments

To demonstrate eigencurve ’s practical value, empirical experiments are conducted on the task of image classification 111Code: https://github.com/opensource12345678/why_cosine_works/tree/main. Two well-known dataset are used: CIFAR-10 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009). For full experimental results on more datasets, please refer to Appendix A.

4.1 Hessian Spectrum’s Skewness in Practice

According to estimated222Please refer to Appendix B.2 for details of the estimation and preprocessing procedure. eigenvalue distributions of Hessian on CIFAR-10 and ImageNet, as shown in Figure 3, it can be observed that all of them are highly skewed and share a similar tendency: A large portion of small eigenvalues and a tiny portion of large eigenvalues. This phenomenon has also been observed and explained by other researchers in the past(Sagun et al., 2017; Arjevani & Field, 2020). On top of that, when we plot both eigenvalues and density in log scale, the “power law” arises. Therefore, if the loss surface can be approximated by quadratic objectives, then eigencurve has already achieved optimal convergence rate for those practical settings. The exact values of the extra constant terms are presented in Appendix A.2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: The estimated eigenvalue distribution of Hessian for ResNet-18 on CIFAR-10, GoogLeNet on CIFAR-10 and ResNet-18 on ImageNet respectively. Notice that the density here is all shown in log scale. First row: original scale for eigenvalues. Second row: log scale for preprocessed eigenvalues.

4.2 Image Classification on CIFAR-10 with Eigencurve Scheduling

This optimality in theory induces eigencurve ’s superior performance in practice, which is demonstrated in Table 2 and Figure 4. The full set of figures are available in Appendix A.8. All models are trained with stochastic gradient descent (SGD), no momentum, batch size 128128 and weight decay wd=0.0005wd=0.0005. For full details of the experiment setup, please refer to Appendix B.

Table 2: CIFAR-10: training losses and test accuracy of different schedules. Step Decay denotes the scheduler proposed in Ge et al. (2019) and General Step Decay means the same type of scheduler with searched interval numbers and decay rates. “*” before a number means at least one occurrence of loss explosion among all 5 trial experiments.
#Epoch Schedule ResNet-18 GoogLeNet VGG16
Loss Acc(%) Loss Acc(%) Loss Acc(%)
=10 Inverse Time Decay 1.58±\pm0.02 79.45±\pm1.00 2.61±\pm0.00 86.54±\pm0.94 2.26±\pm0.00 84.47±\pm0.74
Step Decay 1.82±\pm0.04 73.77±\pm1.48 2.59±\pm0.02 87.04±\pm0.48 2.42±\pm0.45 82.98±\pm0.27
General Step Decay 1.52±\pm0.02 81.99±\pm0.35 1.93±\pm0.03 88.32±\pm1.32 2.14±\pm0.42 86.79±\pm0.36
Cosine Decay 1.42±\pm0.01 84.23±\pm0.07 1.94±\pm0.00 90.56±\pm0.31 2.03±\pm0.00 87.99±\pm0.13
Eigencurve 1.36±\pm0.01 85.62±\pm0.28 1.33±\pm0.00 90.65±\pm0.15 1.87±\pm0.00 88.73±\pm0.11
=100 Inverse Time Decay 0.73±\pm0.00 90.82±\pm0.43 0.62±\pm0.02 92.05±\pm0.69 1.32±\pm0.62 *76.24±\pm13.77
Step Decay 0.26±\pm0.01 91.39±\pm1.03 0.28±\pm0.00 92.83±\pm0.15 0.59±\pm0.00 91.37±\pm0.20
General Step Decay 0.17±\pm0.00 93.97±\pm0.21 0.13±\pm0.00 94.18±\pm0.18 0.20±\pm0.00 *92.36±\pm0.46
Cosine Decay 0.17±\pm0.00 94.04±\pm0.21 0.12±\pm0.00 94.62±\pm0.11 0.20±\pm0.00 93.17±\pm0.05
Eigencurve 0.14±\pm0.00 94.05±\pm0.18 0.12±\pm0.00 94.75±\pm0.15 0.18±\pm0.00 92.88±\pm0.24
Refer to caption
Refer to caption
Figure 4: Example: CIFAR-10 results for ResNet-18, with #Epoch = 100. Left: training losses. Right: test accuracy. For full figures of this experiment, please refer to Appendix A.8.

4.3 Inspired Practical Schedules with Simple Forms

By simplifying the form of eigencurve and capturing some of its key properties, two simple and practical schedules are proposed: Elastic Step Decay and Cosine-power Decay, whose empirical performance are better than or at least comparable to cosine decay. Due to page limit, we leave all the experimental results in Appendix A.5,  A.6,  A.7.

Elastic Step Decay: ηt=η0/2k , if t[(1rk)T,(1rk+1)T)\displaystyle\eta_{t}=\eta_{0}/2^{k}\mbox{ , if }t\in\left[(1-r^{k})T,(1-r^{k+1})T\right) (4.1)
Cosine-power Decay: ηt=ηmin+(η0ηmin)[12(1+cos(ttmaxπ))]α\displaystyle\eta_{t}=\eta_{\min}+(\eta_{0}-\eta_{\min})\left[\frac{1}{2}(1+\cos(\frac{t}{t_{\max}}\pi))\right]^{\alpha} (4.2)

5 Discussion

Cosine Decay and Eigencurve

For ResNet-1818 on CIFAR-10 dataset, eigencurve scheduler presents an extremely similar learning rate curve to cosine decay, especially when the number of training epochs is set to 100100, as shown in Figure 5. This directly links cosine decay to our theory: the empirically superior performance of cosine decay is very likely to stem from the utilization of the “skewness” among Hessian matrix’s eigenvalues. For other situations, especially when the number of iterations is small, as shown in Table 2, eigencurve presents a better performance than cosine decay .

Refer to caption
Refer to caption
Refer to caption
Figure 5: Eigencurve ’s learning rate curve generated by the estimated eigenvalue distribution for ResNet-18 on CIFAR-18 after training 50/100/200 epochs. The cosine decay’s learning rate curve (green) is also provided for comparison.

Sensitiveness to Hessian’s Eigenvalue Distributions

One limitation of eigencurve is that it requires a precomputed eigenvalue distribution of objective functions’s Hessian matrix, which can be time-consuming for large models. This issue can be overcome by reusing the estimated eigenvalue distribution from similar settings. Further experiments on CIFAR-10 suggest the effectiveness of this approach. Please refer to Appendix A.3 for more details. This evidence suggests that eigencurve’s performance is not very sensitive to estimated eigenvalue distributions.

Relationship with Numerically Near-optimal Schedulers

In Zhang et al. (2019), a dynamic programming algorithm was proposed to find almost optimal schedulers if the exact loss of the quadratic objective is accessible. While it is certainly the case, eigencurve still possesses several additional advantages over this type of approaches. First, eigencurve can be used to find simple-formed schedulers. Compared with schedulers numerically computed by dynamic programming, eigencurve provides an analytic framework, so it is able to bypass the Hessian spectrum estimation process if some useful assumptions of the Hessian spectrum can be obtained, such as ”power law”. Second, eigencurve has a clear theoretical convergence guarantee. Dynamic programming can find almost optimal schedulers, but the convergence property of the computed scheduler is still unclear. Our work fills this gap.

6 Conclusion

In this paper, a novel learning rate schedule named eigencurve is proposed, which utilizes the “skewness” of objective’s Hessian matrix’s eigenvalue distribution and reaches minimax optimal convergence rates for SGD on quadratic objectives with skewed Hessian spectrums. This condition of skewed Hessian spectrums is observed and indeed satisfied in practical settings of image classification. Theoretically, eigencurve achieves no worse convergence guarantee than step decay for quadratic functions and reaches minimax optimal convergence rate (up to a constant) with skewed Hessian spectrums, e.g. under “power law”. Empirically, experimental results on CIFAR-10 show that eigencurve significantly outperforms step decay, especially when the number of epochs is small. The idea of eigencurve offers a possible explanation for cosine decay’s effectiveness in practice and inspires two practical families of schedules with simple forms.

Acknowledgement

This work is supported by GRF 16201320. Rui Pan acknowledges support from the Hong Kong PhD Fellowship Scheme (HKPFS). The work of Haishan Ye was supported in part by National Natural Science Foundation of China under Grant No. 12101491.

References

  • Arjevani & Field (2020) Yossi Arjevani and Michael Field. Analytic characterization of the hessian in shallow relu models: A tale of symmetry, 2020.
  • Arora et al. (2019) Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net, 2019.
  • Bach & Moulines (2013) Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). arXiv preprint arXiv:1306.2119, 2013.
  • Botev et al. (2017) Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep learning. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  557–565. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/botev17a.html.
  • Bottou & Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2007/file/0d3180d672e08b4c5312dcdafdf6ef36-Paper.pdf.
  • Bottou et al. (2018) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning, 2018.
  • Byrd et al. (2016) Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
  • Chang & Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Défossez & Bach (2015) Alexandre Défossez and Francis Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pp.  205–213. PMLR, 2015.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Dua & Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Erdogdu & Montanari (2015) Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/404dcc91b2aeaa7caa47487d1483e48a-Paper.pdf.
  • Frostig et al. (2015) Roy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In Conference on learning theory, pp.  728–763. PMLR, 2015.
  • Ge et al. (2019) Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Advances in Neural Information Processing Systems, pp. 14977–14988, 2019.
  • Goffin (1977) Jean-Louis Goffin. On convergence rates of subgradient optimization methods. Mathematical programming, 13(1):329–347, 1977.
  • Gotmare et al. (2018) Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation, 2018.
  • Goyal et al. (2018) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018.
  • Grosse & Martens (2016) Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.  573–582, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/grosse16.html.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 11 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
  • Huang et al. (2020) Xunpeng Huang, Xianfeng Liang, Zhengyang Liu, Lei Li, Yue Yu, and Yitan Li. Span: A stochastic projected approximate newton method. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  1520–1527, 2020.
  • Jacot et al. (2020) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks, 2020.
  • Jain et al. (2016) Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic approximation through mini-batching and tail-averaging. arXiv preprint arXiv:1610.03774, 2016.
  • Jain et al. (2018) Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression, 2018.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Li et al. (2021) Xiaoyu Li, Zhenxun Zhuang, and Francesco Orabona. A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance, 2021.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
  • Marcus et al. (1993) Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology.org/J93-2004.
  • Neu & Rosasco (2018) Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pp.  3222–3242. PMLR, 2018.
  • Robbins & Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  • Sagun et al. (2017) Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2017.
  • Schraudolph (2002) Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
  • Shamir (2012) Ohad Shamir. Is averaging needed for strongly convex stochastic gradient descent. Open problem presented at COLT, 2012.
  • Shamir & Zhang (2012) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, 2012.
  • Wu et al. (2021) Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham M Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. arXiv preprint arXiv:2110.06198, 2021.
  • Yang et al. (2021) Minghan Yang, Dong Xu, Hongyu Chen, Zaiwen Wen, and Mengyun Chen. Enhance curvature information by structured stochastic quasi-newton methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10654–10663, June 2021.
  • Yao et al. (2020) Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. Pyhessian: Neural networks through the lens of the hessian, 2020.
  • Zaremba et al. (2015) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2015.
  • Zhang et al. (2019) Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model, 2019.

Appendix A More Experimental Results

A.1 Ridge Regression

We compare different types of schedulings on ridge regression

f(w)=1nXwY22+αw22.\displaystyle f(w)=\frac{1}{n}||Xw-Y||_{2}^{2}+\alpha||w||_{2}^{2}.

This experiment is only an empirical proof of our theory. In fact, the optima of ridge regression has a closed form and can be directly computed with

w=(XX+nαI)1XTY.\displaystyle w_{*}=\left(X^{\top}X+n\alpha I\right)^{-1}X^{T}Y.

Thus the optimal training loss f(w)f(w_{*}) can be calculated accordingly. In all experiments, we use the loss gap f(wT)f(w)f(w^{T})-f(w_{*}) as our performance metric.

Experiments are conducted on a4a datasets  (Chang & Lin, 2011; Dua & Graff, 2017) (https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/binary.html#a4a/), which contains 4,7814,781 samples and 123123 features. This dataset is chosen majorly because it has a moderate number of samples and features, which enables us to compute the exact Hessian matrix H=2(XX/n+αI)H=2(X^{\top}X/n+\alpha I) and its corresponding eigenvalue distribution in acceptable time and space consumption.

In all of our experiments, we set α=103\alpha=10^{-3}. The model is optimized via SGD without momentum, with batch size 1, initial learning rate η0{\eta_{0}\in\{0.10.1, 0.060.06, 0.030.03, 0.020.02, 0.010.01, 0.0060.006, 0.0030.003, 0.0020.002, 0.0010.001, 0.00060.0006, 0.00030.0003, 0.00020.0002, 0.0001}0.0001\} and learning rate of last iteration ηmin{\eta_{\min}\in\{0.10.1, 0.010.01, 0.0010.001, 0.00010.0001, 0.000010.00001, 0, “UNRESTRICTED”}\text{``UNRESTRICTED''}\}. Here “UNRESTRICTED” denotes the case where ηmin\eta_{\min} is not set, which is useful for eigencurve, who can decide the learning rate curve without setting ηmin\eta_{\min}. Given η0\eta_{0} and ηmin\eta_{\min}, we adjust all schedulers as follows. For inverse time decay ηt=η0/(1+γη0t)\eta_{t}=\eta_{0}/(1+\gamma\eta_{0}t) and exponential decay ηt=γtη0\eta_{t}=\gamma^{t}\eta_{0}, the hyperparameter γ\gamma is computed accordingly based on η0\eta_{0} and ηmin\eta_{\min}. For cosine decay, η0\eta_{0} and ηmin\eta_{\min} is directly used, with no restart adopted. For eigencurve, the learning rate curve is linearly scaled to match the given ηmin\eta_{\min}.

In addition, for eigencurve, we use the eigenvalue distribution of the Hessian matrix, which is directly computed via eigenvalue decomposition, as shown in Figure 6.

Refer to caption
Refer to caption
Figure 6: The eigenvalue distribution of Hessian for ridge regression on a4a. Left: original scale for eigenvalues. Right: log scale for eigenvalues. Notice that the density here is shown in log scale.

All experimental results demonstrate that eigencurve can obtain similar or better training losses when compared with other schedulers, as shown in Table 3.

Table 3: Ridge regression: training loss gaps of different schedules over 5 trials.
Training loss - optimal training loss: f(wT)f(w)f(w^{T})-f(w_{*})
Schedule #Epoch = 1 #Epoch = 5
Constant 0.014963±\pm0.001369 0.004787±\pm0.000175
Inverse Time Decay 0.007284±\pm0.000190 0.002098±\pm0.000160
Exponetial Decay 0.008351±\pm0.000360 0.000931±\pm0.000100
Cosine Decay 0.007767±\pm0.000006 0.001167±\pm0.000142
Eigencurve 0.006977±\pm0.000197 0.000676±\pm0.000069
Schedule #Epoch = 25 #Epoch = 250
Constant 0.001351±\pm0.000179 0.000122±\pm0.000009
Inverse Time Decay 0.000637±\pm0.000143 0.000011±\pm0.000001
Exponetial Decay 0.000048±\pm0.000007 0.000000±\pm0.000000
Cosine Decay 0.000054±\pm0.000005 0.000000±\pm0.000000
Eigencurve 0.000045±\pm0.000008 0.000000±\pm0.000000

A.2 Exact Value of the Extra Term on CIFAR-10 Experiments

In Section 4.1, we have already given the qualitative evidence that shows eigencurve ’s optimality for practical settings on CIFAR-10. Here we strengthen this argument by providing the quantitative evidence as well. The exact value of the extra term is presented in Table 4, where we assume CIFAR-10 has batch size 128128, number of epochs 200200 and weight decay 5×1045\ \times 10^{-4}, while ImageNet has batch size 256256, number of epochs 9090 and weight decay 10410^{-4}.

Table 4: Convergence rate of SGD with common schedulers given the estimated eigenvalue distribution of Hessian, assuming the objective is quadratic.
Value of the extra term
Scheduler Convergence rate of SGD in quadratic functions CIFAR-10 + ResNet18 CIFAR-10 + GoogLeNet CIFAR-10 + VGG16 ImageNet + ResNet18
Inverse Time Decay Θ(dσ2Tκ)\Theta\left(\frac{d\sigma^{2}}{T}\cdot{\color[rgb]{1,0,0}\kappa}\right) 3.39×1053.39\times 10^{5} 4.92×1054.92\times 10^{5} 6.50×1056.50\times 10^{5} 6.80×1066.80\times 10^{6}
Step Decay Θ(dσ2TlogT)\Theta\left(\frac{d\sigma^{2}}{T}\cdot{\color[rgb]{1,0,0}\log T}\right) 16.2516.25 16.2516.25 16.2516.25 18.7818.78
Eigencurve 𝒪(dσ2T(i=0Imax1si)2d)\mathcal{O}\left(\frac{d\sigma^{2}}{T}\cdot{\color[rgb]{1,0,0}\frac{\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{d}}\right) where Imax=log2κ,\mbox{ where }I_{\max}=\log_{2}\kappa, si=#λj[μ2i,μ2i+1)s_{i}=\#\lambda_{j}\in[\mu\cdot 2^{i},\mu\cdot 2^{i+1}) 8.15\bm{8.15} 5.97\bm{5.97} 7.12\bm{7.12} 12.61\bm{12.61}
Minimax optimal rate Ω(dσ2T)\Omega\left(\frac{d\sigma^{2}}{T}\right) 1 1 1 1

It is worth noticing that the extra term’s value of eigencurve is independent from the number of iterations TT, since the value (i=0Imax1si)2/d(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}})^{2}/d only depends on the Hessian spectrum. So basically eigencurve has already achieved the minimax optimal rate (up to a constant) for models and datasets listed in Table 4, if the loss landscape around the optima can be approximated by quadratic functions. For full details of the estimation process, please refer to Appendix B.

A.3 Reusing eigencurve for Different Models on CIFAR-10

For image classification tasks on CIFAR-10, we check the performance of reusing ResNet-18’s eigenvalue distribution for other models. As shown in Table 5, experimental results demonstrate that Hessian’s eigenvalue distribution of Resnet-18 on CIFAR-10 can be applied to GoogLeNet/VGG16 and still achieves good peformance. Here the experiment settings are exactly the same as Section 4.2 in main paper.

Table 5: CIFAR-10: training losses and test accuracy of different schedules over 5 trials. Here all eigencurve schedules are generated based on ResNet-18’s Hessian spectrums. “*” before a number means at least one occurrence of loss explosion among all 5 trial experiments.
#Epoch Schedule GoogLeNet VGG16
Loss Acc(%) Loss Acc(%)
=10 Inverse Time Decay 2.61±\pm0.00 86.54±\pm0.94 2.26±\pm0.00 84.47±\pm0.74
Step Decay 2.59±\pm0.02 87.04±\pm0.48 2.42±\pm0.45 82.98±\pm0.27
General Step Decay 1.93±\pm0.03 88.32±\pm1.32 2.14±\pm0.42 86.79±\pm0.36
Cosine Decay 1.94±\pm0.00 90.56±\pm0.31 2.03±\pm0.00 87.99±\pm0.13
Eigencurve (transferred) 1.65±\pm0.00 91.17±\pm0.20 1.89±\pm0.00 88.17±\pm0.32
=100 Inverse Time Decay 0.62±\pm0.02 92.05±\pm0.69 1.32±\pm0.62 *76.24±\pm13.77
Step Decay 0.28±\pm0.00 92.83±\pm0.15 0.59±\pm0.00 91.37±\pm0.20
General Step Decay 0.13±\pm0.00 94.18±\pm0.18 0.20±\pm0.00 *92.36±\pm0.46
Cosine Decay 0.12±\pm0.00 94.62±\pm0.11 0.20±\pm0.00 93.17±\pm0.05
Eigencurve (transferred) 0.11±\pm0.00 94.81±\pm0.19 0.20±\pm0.00 93.17±\pm0.09

A.4 Comparison with Exponential Moving Average on CIFAR-10

Besides learning rate schedules, Exponential Moving Averaging (EMA) method

w¯t=αk=0t(1α)tkwkw¯t=αwt+(1α)w¯t1\displaystyle\overline{w}_{t}=\alpha\sum_{k=0}^{t}(1-\alpha)^{t-k}w_{k}\quad\Longleftrightarrow\quad\overline{w}_{t}=\alpha w_{t}+(1-\alpha)\overline{w}_{t-1}

is another competitive practical method that is commonly adopted in training neural networks with SGD. Thus, it is natural to ask whether eigencurve can beat this method as well. The answer is yes. In Table 6, we present additional experimental results on CIFAR-10 to compare the performance of eigencurve and exponential moving averaging. It can be observed that there is a large performance gap between those two methods.

Table 6: CIFAR-10: training losses and test accuracy of Exponential Moving Average (EMA) and eigencurve with #Epoch = 100 over 5 trials. For EMA, we search its constant learning rate ηt=η0{1.0,0.6,0.3,0.2,0.1}\eta_{t}=\eta_{0}\in\{1.0,0.6,0.3,0.2,0.1\} and decay α{0.9,0.95,0.99,0.995,0.999}\alpha\in\{0.9,0.95,0.99,0.995,0.999\}. Other settings remain the same as Section 4.2.
Method/Schedule ResNet-18 GoogLeNet VGG16
Loss Acc(%) Loss Acc(%) Loss Acc(%)
EMA 0.30±\pm0.01 90.09±\pm0.82 0.33±\pm0.01 93.42±\pm0.26 0.49±\pm0.00 91.87±\pm0.82
Eigencurve 0.14±\pm0.00 94.05±\pm0.18 0.12±\pm0.00 94.75±\pm0.15 0.18±\pm0.00 92.88±\pm0.24

A.5 ImageNet Classification with Elastic Step Decay

One key observation in CIFAR-10 experiments is the existence of “power law” shown in Figure 3. Also, notice that in the form of eigencurve , specifically Eqn. (3.5), iteration interval length Δi\Delta_{i} is proportional to the square root of eigenvalue density sis_{i} in range [μ2i,μ2i+1)[\mu\cdot 2^{i},\mu\cdot 2^{i+1}). Combining those two facts together, it suggests the length of “learning rate interval” should have lengths exponentially decreasing.

Based on this idea, Elastic Step Decay (ESD) is proposed, which has the following form,

ηt=η0/2k , if t[(1rk)T,(1rk+1)T)\displaystyle\eta_{t}=\eta_{0}/2^{k}\mbox{ , if }t\in\left[(1-r^{k})T,(1-r^{k+1})T\right)

Compared to general step decay with adjustable interval lengths, elastic step decay does not require manual adjustment for the length of each interval. Instead, they are all controlled by one hyperparameter r(0,1)r\in(0,1), which decides the “shrinking speed” of interval lengths. Experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate its superiority in practice, as shown in Table 7, Table 8.

For experiments on CIFAR-10/CIFAR-100, we adopt the same settings as eigencurve , except we only use common step decay with three same-length intervals + decay factor 10.

Table 7: Elastic Step Decay on CIFAR-10/CIFAR-100: test accuracy(%) of different schedules over 5 trials. “*” before a number means at least one occurrence of loss explosion among all 5 trial experiments.
#Epoch Schedule ResNet-18 GoogLeNet VGG16
CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100
=10 Inverse Time Decay 79.45±\pm1.00 48.73±\pm1.66 86.54±\pm0.94 57.90±\pm1.27 84.47±\pm0.74 50.04±\pm0.83
Step Decay 79.67±\pm0.74 54.54±\pm0.26 88.37±\pm0.13 63.05±\pm0.35 85.18±\pm0.06 45.86±\pm0.31
Cosine Decay 84.23±\pm0.07 61.26±\pm1.11 90.56±\pm0.31 69.09±\pm0.27 87.99±\pm0.13 55.42±\pm0.28
ESD 85.38±\pm0.38 64.17±\pm0.57 91.23±\pm0.33 70.46±\pm0.41 88.67±\pm0.21 57.23±\pm0.39
=100 Inverse Time Decay 90.82±\pm0.43 69.82±\pm0.37 92.05±\pm0.69 73.54±\pm0.28 *76.24±\pm13.77 67.70±\pm0.49
Step Decay 93.68±\pm0.07 73.13±\pm0.12 94.13±\pm0.32 76.80±\pm0.16 92.62±\pm0.15 70.02±\pm0.41
Cosine Decay 94.04±\pm0.21 74.65±\pm0.41 94.62±\pm0.11 78.13±\pm0.54 93.17±\pm0.05 72.47±\pm0.28
ESD 94.06±\pm0.11 74.76±\pm0.33 94.65±\pm0.11 78.23±\pm0.20 93.25±\pm0.12 72.50±\pm0.26

For experiments on ImageNet, we use ResNet-50 trained via SGD without momentum, batch size 256256 and weight decay wd=104wd=10^{-4}. Since no momentum is used, the initial learning rate is set to η0=1.0\eta_{0}=1.0 instead of η0=0.1\eta_{0}=0.1. Two step decay baselines are adopted. “Step Decay [30-60]” is the common choice that decays the learning rate 10 folds at the end of epoch 3030 and epoch 6060. “Step Decay [30-60-80]” is another popular choice for the ImageNet setting (Goyal et al., 2018), which further decays learning rate 10 folds at epoch 8080. For cosine decay scheduler, the hyperparameter ηmin\eta_{\min} is set to be 0. As for the dataset, we use the common ILSVRC 2012 dataset, which contains 1000 classes, around 1.2M images for training and 50,000 images for validation. For all experiments, we search r{1/2,1/2}r\in\{1/2,1/\sqrt{2}\} for ESD, with other hyperparameter search and selection process being the same as eigencurve .

Table 8: Elastic Step Decay on ImageNet-1k: Losses and validation accuracy of different schedulings for ResNet-50 with #Epoch=90 over 3 trials.
Schedule Training loss Top-1 validation acc(%) Top-5 validation acc(%)
#Epoch=90 Step Decay [30-60] 1.4726±\pm0.0057 75.55±\pm0.13 92.63±\pm0.08
Step Decay [30-60-80] 1.4738±\pm0.0080 76.05±\pm0.33 92.83±\pm0.15
Cosine Decay 1.4697±\pm0.0049 76.57±\pm0.07 93.25±\pm0.05
ESD (r=1/2r=1/\sqrt{2}) 1.4317±\pm0.0027 76.79±\pm0.10 93.31±\pm0.05

A.6 Language Modeling with Elastic Step Decay

More experiments on language modeling are conducted to further demonstrate Elastic Step Decay’s superiority over other schedulers.

For all experiments, we follow almost the same setting in Zaremba et al. (2015), where a large regularized LSTM recurrent neural network (Hochreiter & Schmidhuber, 1997) is trained on Penn Treebank (Marcus et al., 1993) for language modeling task. The Penn Treebank dataset has a training set of 929k words, a validation set of 73k words and a test set of 82k words. SGD without momentum is adopted for training, with batch size 20 and 35 unrolling steps in LSTM.

Other details are exactly the same, except for the number of training epochs. In Zaremba et al. (2015), it uses 55 epochs to train the large regularized LSTM, which is changed to 30 epochs in our setting, since we found that the model starts overfitting after 30 epochs. We conducted hyperparameter search for all schedules, as shown in Table 9.

Table 9: Hyperparameter search for schedulers.
Scheduler Form Hyperparameter choices
Inverse Time Decay ηt=η01+λη0t\eta_{t}=\frac{\eta_{0}}{1+\lambda\cdot\eta_{0}\cdot t} η0{100,101,102,103}\eta_{0}\in\{10^{0},10^{-1},10^{-2},10^{-3}\}, and set λ\lambda, so that ηmin{102,103,104,105,106}\eta_{\min}\in\{10^{-2},10^{-3},10^{-4},10^{-5},10^{-6}\}
General Step Decay ηt=η0γk\eta_{t}=\eta_{0}\cdot\gamma^{k}, if t[k,k+1)TKt\in\left[k,k+1\right)\cdot\frac{T}{K} η0=1\eta_{0}=1, K{3,4,5,logT},logT+1}K\in\{3,4,5,\left\lfloor\log T\right\rfloor\},\left\lfloor\log T\right\rfloor+1\}, γ{12,15,110}\gamma\in\{\frac{1}{2},\frac{1}{5},\frac{1}{10}\}
Cosine Decay ηt=ηmin+12(η0ηmin)(1+cos(tπT))\eta_{t}=\eta_{\min}+\frac{1}{2}\left(\eta_{0}-\eta_{\min}\right)\left(1+\cos\left(\frac{t\pi}{T}\right)\right) η0{100,101,102,103}\eta_{0}\in\{10^{0},10^{-1},10^{-2},10^{-3}\}, ηmin{102,103,104,0}\eta_{\min}\in\{10^{-2},10^{-3},10^{-4},0\}
Elastic Step Decay ηt=η0/2k,\eta_{t}=\eta_{0}/2^{k}, if t[(1rk)T,(1rk+1)T)\mbox{ if }t\in\left[(1-r^{k})T,(1-r^{k+1})T\right) η0=1\eta_{0}=1, r{21,21/2,21/3,21/5,21/20}r\in\{2^{-1},2^{-1/2},2^{-1/3},2^{-1/5},2^{-1/20}\},
Baseline ηt={η0 for first 14 epochsη01.15k for epoch k+14\eta_{t}=\begin{cases}\eta_{0}&\mbox{ for first 14 epochs}\\ \frac{\eta_{0}}{1.15^{k}}&\mbox{ for epoch }k+14\end{cases} η0=1\eta_{0}=1

Experimental results show that Elastic Step Decay significantly outperforms other schedulers, as shown in Table 10.

Table 10: Scheduler performance on LSTM + Penn Treebank over 5 trials.
Scheduler Validation perplexity Test perplexity
Inverse Time Decay 114.9±\pm1.1 112.7±\pm1.1
General Step Decay 82.4±\pm0.1 79.1±\pm0.2
Baseline (Zaremba et al., 2015) 82.2 78.4
Cosine Decay 82.4±\pm0.4 78.5±\pm0.4
Elastic Step Decay 81.1±\pm0.2 77.4±\pm0.3

A.7 Image Classification on ImageNet with Cosine-power Scheduling

Another key observation in CIFAR-10 experiments is that eigencurve ’s learning rate curve shape changes in a fixed tendency: more “concave” learning rate curves for less training epochs, which inspire the cosine-power schedule in following form.

Cosine-power:ηt=ηmin+(η0ηmin)[12(1+cos(ttmaxπ))]α\displaystyle\text{Cosine-power}:\eta_{t}=\eta_{\min}+(\eta_{0}-\eta_{\min})\left[\frac{1}{2}(1+\cos(\frac{t}{t_{\max}}\pi))\right]^{\alpha}

Results in Table 11 show the schedulings’ performance with α=0.5/1/2\alpha=0.5/1/2, which are denoted as Cosine\sqrt{\text{Cosine}}/Cosine/Cosine2\text{Cosine}^{2} respectively. Notice that the best scheduler gradually moves from small α\alpha to larger α\alpha when the number of epochs increases. For #epoch=270, since the number of epochs is large enough to make model converge, it is reasonable that the accuracy gap between all schedulers is small.

For experiments on ImageNet, we use ResNet-18 trained via SGD without momentum, batch size 256256 and weight decay wd=104wd=10^{-4}. Since no momentum is used, the initial learning rate is set to η0=1.0\eta_{0}=1.0 instead of η0=0.1\eta_{0}=0.1. The hyperparameters ηmin\eta_{\min} is set to be 0 for all cosine-power scheduler. As for the dataset, we use the common ILSVRC 2012 dataset, which contains 1000 classes, around 1.2M images for training and 50,000 images for validation.

Table 11: Cosine-power Decay on ImageNet: training losses and validation accuracy (%) of different schedulings for ResNet-18 over 3 trials. Settings #Epoch90\geq 90 only have 1 trial due to constraints of resource and time.
#Epoch Schedule Training loss Top-1 validation acc (%) Top-5 validation acc (%)
1 Cosine\sqrt{\text{Cosine}} 5.4085±\pm0.0080 30.01±\pm0.21 55.26±\pm0.33
Cosine 5.4330±\pm0.0106 26.43±\pm0.31 50.85±\pm0.43
Cosine2\text{Cosine}^{2} 5.4939±\pm0.0157 21.81±\pm0.21 44.53±\pm0.09
5 Cosine\sqrt{\text{Cosine}} 2.9515±\pm0.0057 57.27±\pm0.15 80.71±\pm0.12
Cosine 2.8389±\pm0.0061 55.67±\pm0.08 79.46±\pm0.16
Cosine2\text{Cosine}^{2} 2.9160±\pm0.0099 52.75±\pm0.20 77.11±\pm0.08
30 Cosine\sqrt{\text{Cosine}} 2.1739±\pm0.0046 67.56±\pm0.03 87.82±\pm0.09
Cosine 2.0402±\pm0.0031 67.97±\pm0.10 88.12±\pm0.03
Cosine2\text{Cosine}^{2} 2.0525±\pm0.0032 67.41±\pm0.05 87.70±\pm0.10
90 Cosine\sqrt{\text{Cosine}} 1.9056 69.85 89.46
Cosine 1.7676 70.46 89.75
Cosine2\text{Cosine}^{2} 1.7403 70.42 89.69
270 Cosine\sqrt{\text{Cosine}} 1.7178 71.37 90.31
Cosine 1.5756 71.93 90.33
Cosine2\text{Cosine}^{2} 1.5250 71.69 90.37
Refer to caption
Refer to caption
Figure 7: Learning rate curve of three cosine-power schedulers. Top: original scale; Bottom: log scale.

A.8 Full Figures for Eigencurve Experiments in Section 4.2

Please refer to Figure 8,  9,  10,  11,  12 and 13.

Refer to caption
Refer to caption
Figure 8: CIFAR-10 results for ResNet-18, with #Epoch = 10. Left: training losses. Right: test accuracy.
Refer to caption
Refer to caption
Figure 9: CIFAR-10 results for GoogLeNet, with #Epoch = 10. Left: training losses. Right: test accuracy.
Refer to caption
Refer to caption
Figure 10: CIFAR-10 results for VGG16, with #Epoch = 10. Left: training losses. Right: test accuracy.
Refer to caption
Refer to caption
Figure 11: CIFAR-10 results for ResNet-18, with #Epoch = 100. Left: training losses. Right: test accuracy.
Refer to caption
Refer to caption
Figure 12: CIFAR-10 results for GoogLeNet, with #Epoch = 100. Left: training losses. Right: test accuracy.
Refer to caption
Refer to caption
Figure 13: CIFAR-10 results for VGG16, with #Epoch = 100. Left: training losses. Right: test accuracy.

Appendix B Detailed Experimental Settings for Image Classification on CIFAR-10/CIFAR-100

B.1 Basic Settings

As mentioned in the main paper, all models are trained with stochastic gradient descent (SGD), no momentum, batch size 128128 and weight decay wd=0.0005wd=0.0005. Furthermore, we perform a grid search to choose the best hyperparameters of all schedulers, with a validation set created from 5,0005,000 samples in the training set, i.e. one-tenth of the training set. The remaining 45,00045,000 samples are then used for training the model. After obtaining hyperparameters with the best validation accuracy, we train the model again with the full training set and test the trained model on test set, where 5 trials of experiments are conducted. The mean and standard deviation of the test results are reported.

Here the grid search explores hyperparameters η0{1.0,0.6,0.3,0.2,0.1}\eta_{0}\in\{1.0,0.6,0.3,0.2,0.1\} and ηmin{0.01,0.001,0.0001,0,“UNRESTRICTED”}\eta_{\min}\in\{0.01,0.001,0.0001,0,\text{``UNRESTRICTED''}\}, where η0\eta_{0} denotes the initial learning rate and ηmin\eta_{\min} stands for the learning rate of last iteration. “UNRESTRICTED” denotes the case where ηmin\eta_{\min} is not set, which is useful for eigencurve, who can decide the learning rate curve without setting ηmin\eta_{\min}. Given η0\eta_{0} and ηmin\eta_{\min}, we adjust all schedulers as follows. For inverse time decay, the hyperparameter γ\gamma is computed accordingly based on η0\eta_{0} and ηmin\eta_{\min}. For cosine decay, η0\eta_{0} and ηmin\eta_{\min} is directly used, with no restart adopted. For general step decay, we search the interval number in {3,4,5,logT,logT+1}\{3,4,5,\lfloor\log T\rfloor,\lfloor\log T\rfloor+1\} and decay factor in {2,5,10}\{2,5,10\}. For step decay proposed in  Ge et al. (2019), the interval number is fixed to be logT\lfloor\log T\rfloor, along with a decay factor 22. For eigencurve, two major modifications are made to make it more suitable for practical settings:

ηt=1/L1+1κj=1i1Δj2j1+2i1κ(tti1)=𝜼𝟎1+1κj=1i1Δj𝜷j1+𝜷i1κ(tti1).\displaystyle\eta_{t}=\frac{1/L}{1+\frac{1}{\kappa}\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+\frac{2^{i-1}}{\kappa}(t-t_{i-1})}=\frac{\bm{\eta_{0}}}{1+\frac{1}{\kappa}\sum_{j=1}^{i-1}\Delta_{j}\bm{\beta}^{j-1}+\frac{\bm{\beta}^{i-1}}{\kappa}(t-t_{i-1})}.

Here we change 1/L1/L to η0\eta_{0} so that it is possible to adjust the initial learning rate of eigencurve. We also change the fixed constant 22 to a general constant β>1\beta>1, which is aimed at making the learning rate curve smoother. The learning rate curve of eigencurve is then linearly scaled to match the given ηmin\eta_{\min}.

Notice that the learning rate η0\eta_{0} can be larger than 1/L1/L, while the loss still does not explode. There are several explanations for this phenomenon. First, in basic non-smooth analysis of GD and SGD with inverse time decay scheduler, the learning rate can be larger than 1/L1/L if the gradient norm is bounded (Shamir & Zhang, 2012). Second, deep learning has a non-convex loss landscape, especially when the parameter is far away from the optima. Hence it is common to use larger learning rate at first. As long as the loss does not explode, it is okay. So we still include large learning rate η0\eta_{0} in our grid search process.

B.2 Settings for Eigencurve

In addition, for our eigencurve scheduler, we use PyHessian (Yao et al., 2020) to generate Hessian matrix’s eigenvalue distribution for all models. The whole process consists of three phases, which are illustrated as follows.

1) Training the model

Almost all CNN models on CIFAR-10 have non-convex objectives, thus the Hessian’s eigenvalue distributions are different for different parameters. Normally, we want the this distribution to reflect the overall tendency of most parts of the training process. According to the phenomenon demonstrated in Appendix E, figure A.11-A.17 of Yao et al. (2020), the eigenvalue distribution of ResNet’s Hessian presents similar tendency after training 30 epochs, which suggests that the Hessian’s eigenvalue distribution can be used after sufficient training.

In all CIFAR-10 experiments, we use the Hessian’s eigenvalue distribution of models after training 180 epochs. Since the goal here is to sufficiently train the model, not to obtain good performance, common baseline settings are adopted for training. For all models used for eigenvalue distribution estimation, we adopt SGD with momentum =0.9=0.9, batch size 128128, weight decay wd=0.0005wd=0.0005 and initial learning rate 0.10.1. On top of that, we use step decay, which decays the learning rate by a factor of 1010 at epoch 8080 and 120120. All of them are default settings of the PyHessian code (https://github.com/amirgholami/PyHessian/blob/master/training.py, commit: f4c3f77).

ImageNet adopts a similar setting, with training epochs being 90, SGD with momentum =0.9=0.9, batch size 256256, weight decay wd=0.0001wd=0.0001, inital learning rate 0.10.1 and step decay schedule decays learning rate by a factor of 1010 at epoch 3030 and 6060.

2) Estimating Hessian matrix’s eigenvalue distribution for the trained model

After obtaining the checkpoint of a sufficiently trained model, we then run PyHessian to estimate the Hessian’s eigenvalue distribution for that checkpoint. The goal here is to obtain the Hessian’s eigenvalue distribution with sufficient precision. To be more specific, the length of intervals around each estimated eigenvalue. PyHessian estimates the eigenvalue spectral density (ESD) of a model’s Hessian, in other words, the output is a list of eigenvalue intervals, along with the density of each interval, where the whole density adds up to 11. Precision means the interval length here.

It is natural that the estimation precision is related to the complexity of the PyHessian algorithm, e.g. the better precision it yields, the more time and space it consumes. More specifically, the algorithm has a time complexity of 𝒪(Nnv2d)\mathcal{O}(Nn^{2}_{v}d) and space complexity 𝒪(Bd+nvd)\mathcal{O}(Bd+n_{v}d), where dd is the number of model parameters, NN is the number of samples used for estimating the ESD, BB is the batch size and nvn_{v} is the iteration number of Stochastic Lanczos Quadrature used in PyHessian, which controls the estimation precision (see Algorithm 1 of Yao et al. (2020)).

In our experiments, we use nv=5000n_{v}=5000 for ResNet-18 and nv=3000n_{v}=3000 for GoogLeNet/VGG16, which gives an eigenvalue distribution estimation with precision around 10510^{-5} to 10410^{-4}. NN and BB are both set to 200200 due to GPU memory constraint, i.e. we use one mini-batch to estimate the eigenvalue distribution. It turns out that this one-batch estimation is good enough and yields similar results to full dataset settings shown in Yao et al. (2020).

However, space complexity is still a bottleneck here. Due to the large number of nvn_{v} and space complexity 𝒪(Bd+nvd)\mathcal{O}(Bd+n_{v}d) of PyHessian, the value of dd cannot be very large. In practice, with a NVIDIA GeForce 2080 Ti GPU, which has around 1111GB memory, the maximum acceptable parameter number dd is around 200K400K200K-400K. This implies that the model has to be compressed. In our experiments, we reduce the number of channels by a factor of CC for all models. For ResNet-18, C=16C=16. For GoogLeNet, C=4C=4. For VGG16, C=8C=8. Notice that those compressed models are only used for eigenvalue distribution estimation. In experiments of comparing different scheduling, we still use the original model with no compression.

One may refer to https://github.com/opensource12345678/why_cosine_works/tree/main/eigenvalue_distribution for generated eigenvalue distributions.

3) Generating eigencurve scheduler with the estimated eigenvalue distribution

After obtaining the eigenvalue distribution, we do a preprocessing before plug it into our eigencurve scheduler.

First, we notice that there are negative eigenvalues in the final distribution. Theoretically, if the parameter is right at the optimal point, no negative eigenvalues should exist for Hessian matrix. Thus we conjecture that those negative eigenvalues are caused by the fact that the model is closed to optima ww_{*}, but not exactly at that point. Furthermore, the estimation precision loss can be another cause. In fact, most of those negative eigenvalues are small, e.g. 98.6%98.6\% of those negative eigenvalues lie in [0.1,0)[-0.1,0), and can be generally ignored without much loss. In our case, we set them to their absolute values.

Second, for a given weight decay value wdwd, we need to take the implicit L2 regularization into account, since it affects the Hessian matrix as well. Therefore, for all eigenvalues after the first step, we add wdwd to them.

After preprocessing, we plug the eigenvalue distribution into our eigencurve scheduler and generates the exact form of eigencurve.

ηt=1/L1+1κj=1i1Δj2j1+2i1κ(tti1)=𝜼𝟎1+1κj=1i1Δj𝜷j1+𝜷i1κ(tti1)\displaystyle\eta_{t}=\frac{1/L}{1+\frac{1}{\kappa}\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+\frac{2^{i-1}}{\kappa}(t-t_{i-1})}=\frac{\bm{\eta_{0}}}{1+\frac{1}{\kappa}\sum_{j=1}^{i-1}\Delta_{j}\bm{\beta}^{j-1}+\frac{\bm{\beta}^{i-1}}{\kappa}(t-t_{i-1})}

For experiments with 100 epochs, we set β=1.000005\beta=1.000005, so that the learning rate curve is much smoother. For experiments with 10 epochs, we set β=2.0\beta=2.0. In our experiments, β\beta serves as a fixed constant, not hyperparameters. So no hyperparameter search is conducted on β\beta. One can do that in practice though, if computation resource allows.

B.3 Compute Resource and Implementation Details

All the code for results in main paper can be found in https://github.com/opensource12345678/why_cosine_works/tree/main, which is released under the MIT license.

All experiments on CIFAR-10/CIFAR-100 are conducted on a single NVIDIA GeForce 2080 Ti GPU, where ResNet-18/GoogLeNet/VGG16 takes around 20mins/90mins/40mins to train 100 epochs, respectively. High-precision eigenvalue distribution estimation, e.g. nv3000n_{v}\geq 3000, requires around 1-2 days to complete, but this is no longer necessary given the released results.

The ResNet-18 model is implemented in Tensorflow 2.0. We use tensorflow-gpu 2.3.1 in our code. The GoogLeNet and VGG16 model is implemented in Pytorch, specifically, 1.7.0+cu101.

B.4 License of PyHessian

According to https://github.com/amirgholami/PyHessian/blob/master/LICENSE, PyHessian (Yao et al., 2020) is released under the MIT License.

Appendix C Detailed Experimental Settings for Image Classification on ImageNet

One may refer to https://www.image-net.org/download for specific terms of access for ImageNet. The dataset can be downloaded from https://image-net.org/challenges/LSVRC/2012/2012-downloads.php, with training set being “Training images (Task 1 & 2)” and validation set being “Validation images (all tasks)”. Notice that registration and verification of institute is required for successful download.

ResNet-18 experiments on ImageNet are conducted on two NVIDIA GeForce 2080 Ti GPUs with data parallelism, while ResNet-50 experiments are conducted on 4 GPUs in a similar fashion. Both models take around 2 days to train 90 epochs, about 20mins-30mins per epoch. Those ResNet models on ImageNet are implemented in Pytorch, specifically, 1.7.0+cu101.

Appendix D Important Propositions and Lemmas

Proposition 3.

Letting f(x)f(x) be a monotonically increasing function in the range [t0,t~][t_{0},\tilde{t}], then it holds that

k=t0t~1f(k)t0t~f(x)𝑑x.\displaystyle\sum_{k=t_{0}}^{\tilde{t}-1}f(k)\leq\int_{t_{0}}^{\tilde{t}}f(x)\;dx. (D.1)

If f(x)f(x) is monotonically decreasing in the range [t0,t~][t_{0},\tilde{t}], then it holds that

t0t~f(x)𝑑xk=t0t~1f(k)k=t0t~f(k)f(t0)+t0t~f(x)𝑑x.\displaystyle\int_{t_{0}}^{\tilde{t}}f(x)\;dx\leq\sum_{k=t_{0}}^{\tilde{t}-1}f(k)\leq\sum_{k=t_{0}}^{\tilde{t}}f(k)\leq f(t_{0})+\int_{t_{0}}^{\tilde{t}}f(x)\;dx. (D.2)
Lemma 2.

Function f(x)=exp(αx)x2f(x)=\exp(-\alpha x)x^{2} with 0<α0<\alpha and x(0,1]x\in(0,1] is monotone decreasing in the range x(2α,+)x\in(\frac{2}{\alpha},+\infty) and monotone increasing in the range x[0,2α]x\in[0,\frac{2}{\alpha}].

Proof.

We can obtain the derivative of f(x)f(x) as

f(x)=xexp(αx)(2αx).\displaystyle\nabla f(x)=x\exp(-\alpha x)(2-\alpha x).

Thus, it holds that f(x)0\nabla f(x)\geq 0 when x[0,2α]x\in[0,\frac{2}{\alpha}]. This implies that f(x)f(x) is monotone increasing when x[0,2α]x\in[0,\frac{2}{\alpha}]. Similarly, we can obtain that f(x)f(x) is monotone decreasing when x(2α,+)x\in(\frac{2}{\alpha},+\infty). ∎

Lemma 3.

It holds that

exp(αx)x𝑑x=α1(xexp(αx)+α1exp(αx)).\displaystyle\int\exp(-\alpha x)x\;dx=-\alpha^{-1}(x\exp(-\alpha x)+\alpha^{-1}\exp(-\alpha x)). (D.3)
Proof.
exp(αx)x𝑑x=\displaystyle\int\exp(-\alpha x)x\;dx= α1xdexp(αx)=α1(xexp(αx)exp(αx)𝑑x)\displaystyle-\alpha^{-1}\int x\;d\exp(-\alpha x)=-\alpha^{-1}(x\exp(-\alpha x)-\int\exp(-\alpha x)dx)
=\displaystyle= α1(xexp(αx)+α1exp(αx)).\displaystyle-\alpha^{-1}(x\exp(-\alpha x)+\alpha^{-1}\exp(-\alpha x)).

Appendix E Proof of Section 2

Proof of Proposition 1.

By iteratively applying Eqn. (2.1), we can obtain that

𝔼[λj(wt+1,jw,j)2](2.1)\displaystyle\mathbb{E}\left[\lambda_{j}\left(w_{t+1,j}-w_{*,j}\right)^{2}\right]\overset{\eqref{eq:proc_j}}{\leq} i=1t(1ηi,jλj)2λj(w1,jw,j)2\displaystyle\prod_{i=1}^{t}(1-\eta_{i,j}\lambda_{j})^{2}\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}
+λj2σ2k=1ti=k+1t(1ηi,jλj)2ηk,j2\displaystyle+\lambda_{j}^{2}\sigma^{2}\cdot\sum_{k=1}^{t}\prod_{i=k+1}^{t}(1-\eta_{i,j}\lambda_{j})^{2}{\eta^{2}_{k,j}}
=\displaystyle= λj(w1,jw,j)2(t+1)2+σ2k=1t(k+1)2(t+1)21(k+1)2\displaystyle\frac{\lambda_{j}(w_{1,j}-w_{*,j})^{2}}{(t+1)^{2}}+\sigma^{2}\cdot\sum_{k=1}^{t}\frac{(k+1)^{2}}{(t+1)^{2}}\cdot\frac{1}{(k+1)^{2}}
=\displaystyle= λj(w1,jw,j)2(t+1)2+t(t+1)2σ2.\displaystyle\frac{\lambda_{j}(w_{1,j}-w_{*,j})^{2}}{(t+1)^{2}}+\frac{t}{(t+1)^{2}}\cdot\sigma^{2}.

Summing up each coordinate, we can obtain the result. ∎

Proof of Proposition 2.

Let us denote σj2=λj2σ2\sigma_{j}^{2}=\lambda_{j}^{2}\sigma^{2}. By Eqn. (2.1), we can obtain that

𝔼[λj(wt+1,jw,j)2]\displaystyle\mathbb{E}\left[\lambda_{j}\left(w_{t+1,j}-w_{*,j}\right)^{2}\right]
\displaystyle\leq Πi=1t(1ηi,jλj)2λj(w1,jw,j)2+k=1tΠi=k(1ηi,jλj)ηk,j2σj2\displaystyle\Pi_{i=1}^{t}(1-\eta_{i,j}\lambda_{j})^{2}\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\sum_{k=1}^{t}\Pi_{i=k}(1-\eta_{i,j}\lambda_{j}){\eta^{2}_{k,j}}\sigma_{j}^{2}
\displaystyle\leq exp(2i=1tηi,jλj)λj(w1,jw,j)2+k=1texp(2i=ktηi,jλj)ηk,j2σj2\displaystyle\exp\left(-2\sum_{i=1}^{t}\eta_{i,j}\lambda_{j}\right)\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\sum_{k=1}^{t}\exp\left(-2\sum_{i=k}^{t}\eta_{i,j}\lambda_{j}\right)\eta^{2}_{k,j}\sigma_{j}^{2}
=\displaystyle= exp(2λji=1t1L+μi)λj(w1,jw,j)2+k=1texp(2λji=kt1L+μi)ηk,j2σj2\displaystyle\exp\left(-2\lambda_{j}\sum_{i=1}^{t}\frac{1}{L+\mu i}\right)\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\sum_{k=1}^{t}\exp\left(-2\lambda_{j}\sum_{i=k}^{t}\frac{1}{L+\mu i}\right)\eta^{2}_{k,j}\sigma_{j}^{2}
\displaystyle\leq exp(2λjμln(L+μL+μt))λj(w1,jw,j)2+k=1texp(2λjμln(L+μkL+μt))σj2(L+μk)2\displaystyle\exp\left(\frac{2\lambda_{j}}{\mu}\ln\left(\frac{L+\mu}{L+\mu t}\right)\right)\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\sum_{k=1}^{t}\exp\left(\frac{2\lambda_{j}}{\mu}\ln\left(\frac{L+\mu k}{L+\mu t}\right)\right)\cdot\frac{\sigma_{j}^{2}}{(L+\mu k)^{2}}
=\displaystyle= (L+μL+μt)2λjμλj(w1,jw,j)2+k=1t(L+μk)(2λjμ2)(L+μt)2λjμσj2\displaystyle\left(\frac{L+\mu}{L+\mu t}\right)^{\frac{2\lambda_{j}}{\mu}}\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\sum_{k=1}^{t}\frac{(L+\mu k)^{\left(\frac{2\lambda_{j}}{\mu}-2\right)}}{(L+\mu t)^{\frac{2\lambda_{j}}{\mu}}}\cdot\sigma_{j}^{2}
\displaystyle\leq (L+μL+μt)2λjμλj(w1,jw,j)2+σj22λjμ(L+μt)2λjμ1(L+μt)2λjμ+σj2(L+μt)2\displaystyle\left(\frac{L+\mu}{L+\mu t}\right)^{\frac{2\lambda_{j}}{\mu}}\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\frac{\sigma_{j}^{2}}{2\lambda_{j}-\mu}\frac{(L+\mu t)^{\frac{2\lambda_{j}}{\mu}-1}}{(L+\mu t)^{\frac{2\lambda_{j}}{\mu}}}+\frac{\sigma_{j}^{2}}{(L+\mu t)^{2}}
=\displaystyle= (L+μL+μt)2λjμλj(w1,jw,j)2+12λjμ1L+μtσj2+σj2(L+μt)2.\displaystyle\left(\frac{L+\mu}{L+\mu t}\right)^{\frac{2\lambda_{j}}{\mu}}\cdot\lambda_{j}(w_{1,j}-w_{*,j})^{2}+\frac{1}{2\lambda_{j}-\mu}\cdot\frac{1}{L+\mu t}\cdot\sigma_{j}^{2}+\frac{\sigma_{j}^{2}}{(L+\mu t)^{2}}.

The third inequality is because function F(x)=1/(L+μx)F(x)=1/(L+\mu x) is monotone decreasing in the range [1,)[1,\infty), and it holds that

i=kt1L+μii=kt1L+μi𝑑i=1μln(L+μtL+μk).\displaystyle\sum_{i=k}^{t}\frac{1}{L+\mu i}\geq\int_{i=k}^{t}\frac{1}{L+\mu i}\;di=\frac{1}{\mu}\ln\left(\frac{L+\mu t}{L+\mu k}\right).

The last inequality is because function F(x)=(L+μx)2λj/μ2F(x)=(L+\mu x)^{2\lambda_{j}/\mu-2} is monotone increasing in the range [0,)[0,\infty), and it holds that

k=1t(L+μk)2λjμ2\displaystyle\sum_{k=1}^{t}(L+\mu k)^{\frac{2\lambda_{j}}{\mu}-2}\leq k=1t(L+μk)2λjμ2𝑑k+(L+μt)2λjμ2\displaystyle\int_{k=1}^{t}(L+\mu k)^{\frac{2\lambda_{j}}{\mu}-2}dk+(L+\mu t)^{\frac{2\lambda_{j}}{\mu}-2}
=\displaystyle= 1μ(2λjμ1)((L+μt)2λjμ1(L+μ)2λjμ1)+(L+μt)2λjμ2\displaystyle\frac{1}{\mu\left(\frac{2\lambda_{j}}{\mu}-1\right)}\cdot\left((L+\mu t)^{\frac{2\lambda_{j}}{\mu}-1}-(L+\mu)^{\frac{2\lambda_{j}}{\mu}-1}\right)+(L+\mu t)^{\frac{2\lambda_{j}}{\mu}-2}
<\displaystyle< 12λjμ(L+μt)2λjμ1+(L+μt)2λjμ2.\displaystyle\frac{1}{2\lambda_{j}-\mu}(L+\mu t)^{\frac{2\lambda_{j}}{\mu}-1}+(L+\mu t)^{\frac{2\lambda_{j}}{\mu}-2}.

By μλj\mu\leq\lambda_{j}, σj2=λj2σ2\sigma_{j}^{2}=\lambda_{j}^{2}\sigma^{2}, and summing up from i=1i=1 to dd, we can obtain the result. ∎

Appendix F Preliminaries

Lemma 4.

Let objective function f(x)f(x) be quadratic. Running SGD for TT-steps starting from w0w_{0} and a learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T}, the final iterate wT+1w_{T+1} satisfies

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right] (F.1)
=\displaystyle= 𝔼[(w0w)PTP0HP0PT(w0w)]\displaystyle\mathbb{E}\left[(w_{0}-w_{*})^{\top}\cdot P_{T}\dots P_{0}HP_{0}\dots P_{T}\cdot(w_{0}-w_{*})\right]
+τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ],\displaystyle+\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right],

where Pt=IηtHP_{t}=I-\eta_{t}H.

Proof.

Reformulating Eqn. (1.5), we have

wt+1w=\displaystyle w_{t+1}-w_{*}= wtwηt(H(ξ)wtb(ξ))\displaystyle w_{t}-w_{*}-\eta_{t}(H(\xi)w_{t}-b(\xi))
=\displaystyle= wtwηt(Hwtb)+ηt(Hwtb(H(ξ)wtb(ξ)))\displaystyle w_{t}-w_{*}-\eta_{t}(Hw_{t}-b)+\eta_{t}\left(Hw_{t}-b-(H(\xi)w_{t}-b(\xi))\right)
=\displaystyle= wtwηt(Hwtb(Hwb))+ηt(Hwtb(H(ξ)wtb(ξ)))\displaystyle w_{t}-w_{*}-\eta_{t}(Hw_{t}-b-(Hw_{*}-b))+\eta_{t}\left(Hw_{t}-b-(H(\xi)w_{t}-b(\xi))\right)
=\displaystyle= (IηtH)(wtw)+ηtnt\displaystyle\left(I-\eta_{t}H\right)(w_{t}-w_{*})+\eta_{t}n_{t}
=\displaystyle= Pt(wtw)+ηtnt.\displaystyle P_{t}(w_{t}-w_{*})+\eta_{t}n_{t}.

Thus, we can obtain that

wt+1w=PtP0(w0w)+τ=0tPtPτ+1ητnτ.\displaystyle w_{t+1}-w_{*}=P_{t}\dots P_{0}(w_{0}-w_{*})+\sum_{\tau=0}^{t}P_{t}\dots P_{\tau+1}\eta_{\tau}n_{\tau}. (F.2)

We can decompose above stochastic process associated with SGD’s update into two simpler processes as follows:

wt+1bw=Pt(wtbw),andwt+1vw=Pt(wtvw)+ηtnt, with w0v=w,\displaystyle w_{t+1}^{b}-w_{*}=P_{t}(w_{t}^{b}-w_{*}),\quad\mbox{and}\quad w_{t+1}^{v}-w_{*}=P_{t}(w_{t}^{v}-w_{*})+\eta_{t}n_{t},\mbox{ with }w_{0}^{v}=w_{*}, (F.3)

which entails that

wt+1bw=PtP0(w0bw)=P0Pt(w0bw)\displaystyle w_{t+1}^{b}-w_{*}=P_{t}\dots P_{0}(w_{0}^{b}-w_{*})=P_{0}\dots P_{t}(w_{0}^{b}-w_{*}) (F.4)
wt+1vw=τ=0tPtPτ+1ητnτ=τ=0tPτ+1Ptητnτ\displaystyle w_{t+1}^{v}-w_{*}=\sum_{\tau=0}^{t}P_{t}\dots P_{\tau+1}\eta_{\tau}n_{\tau}=\sum_{\tau=0}^{t}P_{\tau+1}\dots P_{t}\eta_{\tau}n_{\tau} (F.5)
(F.2)\displaystyle\overset{\eqref{eq:rec}}{\Rightarrow} wt+1w=(wt+1bw)+(wt+1vw)\displaystyle w_{t+1}-w_{*}=\left(w_{t+1}^{b}-w_{*}\right)+\left(w_{t+1}^{v}-w_{*}\right) (F.6)

where the last equality in Eqn. (F.4) and Eqn. (F.5) is because the commutative property PtPt=(IηtH)(IηtH)=(IηtH)(IηtH)=PtPtP_{t}P_{t^{\prime}}=(I-\eta_{t}H)(I-\eta_{t^{\prime}}H)=(I-\eta_{t^{\prime}}H)(I-\eta_{t}H)=P_{t^{\prime}}P_{t} holds for t,t\forall t,t^{\prime}.

Thus, we have

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right]
=(F.6)\displaystyle\overset{\eqref{eq:dec4}}{=} 𝔼[(wT+1bw)H(wT+1bw)+2(wT+1vw)H(wT+1bw)\displaystyle\mathbb{E}\left[(w^{b}_{T+1}-w_{*})^{\top}H(w^{b}_{T+1}-w_{*})+2(w^{v}_{T+1}-w_{*})^{\top}H(w^{b}_{T+1}-w_{*})\right.
+(wT+1vw)H(wT+1vw)]\displaystyle\left.+(w^{v}_{T+1}-w_{*})^{\top}H(w^{v}_{T+1}-w_{*})\right]
=(F.4)\displaystyle\overset{\eqref{eq:dec2}}{=} 𝔼[(w0bw)PTP0HP0PT(w0bw)+2(wT+1vw)HP0PT(w0bw)\displaystyle\mathbb{E}\left[(w^{b}_{0}-w_{*})^{\top}P_{T}\dots P_{0}HP_{0}\dots P_{T}(w^{b}_{0}-w_{*})+2(w^{v}_{T+1}-w_{*})^{\top}HP_{0}\dots P_{T}(w^{b}_{0}-w_{*})\right.
+(wT+1vw)H(wT+1vw)]\displaystyle\left.+(w^{v}_{T+1}-w_{*})^{\top}H(w^{v}_{T+1}-w_{*})\right]
=(F.5)\displaystyle\overset{\eqref{eq:dec3}}{=} 𝔼[(w0bw)PTP0HP0PT(w0bw)\displaystyle\mathbb{E}\left[(w^{b}_{0}-w_{*})^{\top}P_{T}\dots P_{0}HP_{0}\dots P_{T}(w^{b}_{0}-w_{*})\right.
+2(τ=0TPTPτ+1ητnτ)HP0PT(w0bw)\displaystyle+2\left(\sum_{\tau=0}^{T}P_{T}\dots P_{\tau+1}\eta_{\tau}n_{\tau}\right)^{\top}HP_{0}\dots P_{T}(w^{b}_{0}-w_{*})
+(τ=0TPTPτ+1ητnτ)H(τ=0TPTPτ+1ητnτ)]\displaystyle\left.+\left(\sum_{\tau=0}^{T}P_{T}\dots P_{\tau+1}\eta_{\tau}n_{\tau}\right)^{\top}H\left(\sum_{\tau=0}^{T}P_{T}\dots P_{\tau+1}\eta_{\tau}n_{\tau}\right)\right]
=𝔼[nτ]=0\displaystyle\overset{\mathbb{E}[n_{\tau}]=0}{=} 𝔼[(w0bw)PTP0HP0PT(w0bw)]\displaystyle\mathbb{E}\left[(w^{b}_{0}-w_{*})^{\top}P_{T}\dots P_{0}HP_{0}\dots P_{T}(w^{b}_{0}-w_{*})\right]
+𝔼[τ=0,τ=0TητητnτPTPτ+1HPτ+1PTnτ]\displaystyle+\mathbb{E}\left[\sum_{\tau=0,\tau^{\prime}=0}^{T}\eta_{\tau}\eta_{\tau^{\prime}}\cdot n_{\tau}^{\top}P_{T}\dots P_{\tau+1}\cdot H\cdot P_{\tau^{\prime}+1}\dots P_{T}n_{\tau^{\prime}}\right]
=\displaystyle= 𝔼[(w0bw)PTP0HP0PT(w0bw)]\displaystyle\mathbb{E}\left[(w^{b}_{0}-w_{*})^{\top}P_{T}\dots P_{0}HP_{0}\dots P_{T}(w^{b}_{0}-w_{*})\right]
+τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ],\displaystyle+\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right],

where the last equality is because when τ\tau and τ\tau^{\prime} are different, it holds that

𝔼[nτPTPτ+1HPτ+1PTnτ]=0\mathbb{E}[n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau^{\prime}}]=0

due to independence between nτn_{\tau} and nτn_{\tau^{\prime}}. ∎

Lemma 5.

Given the assumption that 𝔼ξ[ntnt]σ2H\mathbb{E}_{\xi}\left[n_{t}n_{t}^{\top}\right]\preceq\sigma^{2}H, then the variance term satisfies that

τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ]σ2j=1dλj2k=0Tηk2i=k+1T(1ηiλj)2,\displaystyle\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right]\leq\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}\sum_{k=0}^{T}\eta_{k}^{2}\prod_{i=k+1}^{T}\left(1-\eta_{i}\lambda_{j}\right)^{2}, (F.7)

where Pt=IηtHP_{t}=I-\eta_{t}H.

Proof.

Denote AτPTPτ+1H12A_{\tau}\triangleq P_{T}\dots P_{\tau+1}H^{\frac{1}{2}}, then

Aτ=(PTPτ+1H12)=(H12)Pτ+1PT=H12Pτ+1PT,\displaystyle A_{\tau}^{\top}=\left(P_{T}\dots P_{\tau+1}H^{\frac{1}{2}}\right)^{\top}=\left(H^{\frac{1}{2}}\right)^{\top}P_{\tau+1}^{\top}\dots P_{T}^{\top}=H^{\frac{1}{2}}P_{\tau+1}\dots P_{T}, (F.8)

where the second equality is entailed by the fact that H12,Pτ+1,,PTH^{\frac{1}{2}},P_{\tau+1},\dots,P_{T} are symmetric matrices.

Therefore, we have,

τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ]\displaystyle\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right]
=(F.8)\displaystyle\overset{\eqref{eq:A_transpose}}{=} τ=0T𝔼[ητ2nτAτAτnτ]=τ=0Tητ2𝔼[tr(nτAτAτnτ)]=τ=0Tητ2𝔼[tr(AτnτnτAτ)]\displaystyle\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}A_{\tau}A_{\tau}^{\top}n_{\tau}\right]=\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathbb{E}\left[\mathrm{tr}\left(n_{\tau}^{\top}A_{\tau}A_{\tau}^{\top}n_{\tau}\right)\right]=\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathbb{E}\left[\mathrm{tr}\left(A_{\tau}^{\top}n_{\tau}n_{\tau}^{\top}A_{\tau}\right)\right]
=\displaystyle= τ=0Tητ2tr(𝔼[AτnτnτAτ])=τ=0Tητ2tr(Aτ𝔼[nτnτ]Aτ)\displaystyle\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathrm{tr}\left(\mathbb{E}\left[A_{\tau}^{\top}n_{\tau}n_{\tau}^{\top}A_{\tau}\right]\right)=\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathrm{tr}\left(A_{\tau}^{\top}\mathbb{E}\left[n_{\tau}n_{\tau}^{\top}\right]A_{\tau}\right)
\displaystyle\leq σ2τ=0Tητ2tr(AτHAτ)=σ2τ=0Tητ2tr(AτAτH)\displaystyle\sigma^{2}\cdot\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathrm{tr}\left(A_{\tau}^{\top}HA_{\tau}\right)=\sigma^{2}\cdot\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathrm{tr}\left(A_{\tau}A_{\tau}^{\top}H\right)
=\displaystyle= σ2τ=0Tητ2tr(PTPτ+1HPτ+1PTH)\displaystyle\sigma^{2}\cdot\sum_{\tau=0}^{T}\eta_{\tau}^{2}\mathrm{tr}\left(P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}H\right)
=\displaystyle= σ2j=1dλj2k=0Tηk2i=k+1T(1ηiλj)2,\displaystyle\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}\sum_{k=0}^{T}\eta_{k}^{2}\prod_{i=k+1}^{T}\left(1-\eta_{i}\lambda_{j}\right)^{2},

where the third and sixth equality come from the cyclic property of trace, while the first inequality is because of the condition 𝔼ξ[ntnt]σ2H\mathbb{E}_{\xi}\left[n_{t}n_{t}^{\top}\right]\preceq\sigma^{2}H, where

x,x𝔼[nτnτ]xσ2xHx\displaystyle\forall x,\quad x^{\top}\mathbb{E}[n_{\tau}n_{\tau}^{\top}]x\leq\sigma^{2}x^{\top}Hx
\displaystyle\Rightarrow\quad z,zAτ𝔼[nτnτ]Aτz=(Aτz)𝔼[nτnτ](Aτz)σ2(Aτz)H(Aτz)=σ2zAτHAτz\displaystyle\forall z,\quad z^{\top}A_{\tau}^{\top}\mathbb{E}[n_{\tau}n_{\tau}^{\top}]A_{\tau}z=(A_{\tau}z)^{\top}\mathbb{E}[n_{\tau}n_{\tau}^{\top}](A_{\tau}z)\leq\sigma^{2}(A_{\tau}z)^{\top}H(A_{\tau}z)=\sigma^{2}z^{\top}A_{\tau}^{\top}HA_{\tau}z
\displaystyle\Rightarrow\quad Aτ𝔼[nτnτ]Aτσ2AτHAτ\displaystyle A_{\tau}^{\top}\mathbb{E}[n_{\tau}n_{\tau}^{\top}]A_{\tau}\preceq\sigma^{2}A_{\tau}^{\top}HA_{\tau}
\displaystyle\Rightarrow\quad tr(Aτ𝔼[nτnτ]Aτ)σ2tr(AτHAτ).\displaystyle\mathrm{tr}\left(A_{\tau}^{\top}\mathbb{E}[n_{\tau}n_{\tau}^{\top}]A_{\tau}\right)\leq\sigma^{2}\mathrm{tr}\left(A_{\tau}^{\top}HA_{\tau}\right).

Lemma 6.

Letting λj~\lambda_{\tilde{j}} be the smallest positive eigenvalue of HH, then the bias term satisfies that

𝔼[(w0w)PTP0HP0PT(w0w)]\displaystyle\mathbb{E}\left[(w_{0}-w_{*})^{\top}\cdot P_{T}\dots P_{0}HP_{0}\dots P_{T}\cdot(w_{0}-w_{*})\right]
\displaystyle\leq (w0w)H(w0w)exp(2λj~k=0Tηk).\displaystyle(w_{0}-w_{*})^{\top}H(w_{0}-w_{*})\cdot\exp\left(-2\lambda_{\tilde{j}}\sum_{k=0}^{T}\eta_{k}\right). (F.9)
Proof.

Letting H=UΛUH=U\Lambda U^{\top} be the spectral decomposition of HH and uju_{j} be jj-th column of UU, we can obtain that

𝔼[(w0w)PTP0HP0PT(w0w)]\displaystyle\mathbb{E}\left[(w_{0}-w_{*})^{\top}\cdot P_{T}\dots P_{0}HP_{0}\dots P_{T}\cdot(w_{0}-w_{*})\right]
=\displaystyle= j=1dλj(uj(w0w))2k=0T(1ηkλj)2\displaystyle\sum_{j=1}^{d}\lambda_{j}\cdot(u_{j}^{\top}(w_{0}-w_{*}))^{2}\cdot\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2}
\displaystyle\leq j=1dλj(uj(w0w))2exp(2λjk=0Tηk).\displaystyle\sum_{j=1}^{d}\lambda_{j}\cdot(u_{j}^{\top}(w_{0}-w_{*}))^{2}\cdot\exp\left(-2\lambda_{j}\sum_{k=0}^{T}\eta_{k}\right).

Since λj~\lambda_{\tilde{j}} is the smallest positive eigenvalue of HH, it holds that

j=1dλj(uj(w0w))2exp(2λjk=0Tηk)\displaystyle\sum_{j=1}^{d}\lambda_{j}\cdot(u_{j}^{\top}(w_{0}-w_{*}))^{2}\cdot\exp\left(-2\lambda_{j}\sum_{k=0}^{T}\eta_{k}\right)\leq j=1dλj(uj(w0w))2exp(2λj~k=0Tηk)\displaystyle\sum_{j=1}^{d}\lambda_{j}\cdot(u_{j}^{\top}(w_{0}-w_{*}))^{2}\cdot\exp\left(-2\lambda_{\tilde{j}}\sum_{k=0}^{T}\eta_{k}\right)
=\displaystyle= (w0w)H(w0w)exp(2λj~k=0Tηk).\displaystyle(w_{0}-w_{*})^{\top}H(w_{0}-w_{*})\cdot\exp\left(-2\lambda_{\tilde{j}}\sum_{k=0}^{T}\eta_{k}\right).

Appendix G Proof of Theorems

Lemma 7.

Let learning rate ηt\eta_{t} is defined in Eqn. (3.3). Assuming k[ti1,ti]k\in[t_{i^{\prime}-1},t_{i^{\prime}}] with 1ii~T1\leq i^{\prime}\leq\tilde{i}\leq T, the sequence {ηt}t=0T\{\eta_{t}\}_{t=0}^{T} satisfies that

t=kti~+11ηti=i+1i~+112i1μlnαiαi1+12i1μlnαiαi1+2i1μ(kti1),\displaystyle\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\geq\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\frac{1}{2^{i-1}\mu}\ln\frac{\alpha_{i}}{\alpha_{i-1}}+\frac{1}{2^{i^{\prime}-1}\mu}\ln\frac{\alpha_{i^{\prime}}}{\alpha_{i^{\prime}-1}+2^{i^{\prime}-1}\mu(k-t_{i^{\prime}-1})}, (G.1)

where αi\alpha_{i} is defined as

αiL+μj=1iΔj2j1=1ηti.\displaystyle\alpha_{i}\triangleq L+\mu\sum_{j=1}^{i}\Delta_{j}2^{j-1}=\frac{1}{\eta_{t_{i}}}. (G.2)
Proof.

First, we divide learning rates into two groups: those who are guaranteed to cover a full interval and those who may not.

t=kti~+11ηt=\displaystyle\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}= t=titi~+11ηt+t=kti1ηt=i=i+1i~+1t=ti1ti1ηt+t=kti1ηt\displaystyle\sum_{t=t_{i^{\prime}}}^{t_{\tilde{i}+1}-1}\eta_{t}+\sum_{t=k}^{t_{i^{\prime}}-1}\eta_{t}=\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\sum_{t=t_{i-1}}^{t_{i}-1}\eta_{t}+\sum_{t=k}^{t_{i^{\prime}}-1}\eta_{t}

Furthermore, because ηt\eta_{t} is monotonically decreasing with respect to tt, by Proposition 3, we have

t=kti~+11ηt(D.2)\displaystyle\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\overset{\eqref{eq:dec_int}}{\geq} i=i+1i~+1ti1tiηt𝑑t+ktiηt𝑑t\displaystyle\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\int_{t_{i-1}}^{t_{i}}\eta_{t}dt+\int_{k}^{t_{i^{\prime}}}\eta_{t}dt
=(3.3)\displaystyle\overset{\eqref{eq:eta_t}}{=} i=i+1i~+1ti1ti1L+μj=1i1Δj2j1+2i1μ(tti1)𝑑t\displaystyle\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\int_{t_{i-1}}^{t_{i}}\frac{1}{L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+2^{i-1}\mu(t-t_{i-1})}dt
+kti1L+μj=1i1Δj2j1+2i1μ(tti1)𝑑t\displaystyle+\int_{k}^{t_{i^{\prime}}}\frac{1}{L+\mu\sum_{j=1}^{i^{\prime}-1}\Delta_{j}2^{j-1}+2^{i^{\prime}-1}\mu(t-t_{i^{\prime}-1})}dt
=\displaystyle= i=i+1i~+112i1μlnL+μj=1i1Δj2j1+2i1μ(titi1)L+μj=1i1Δj2j1\displaystyle\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\frac{1}{2^{i-1}\mu}\ln\frac{L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+2^{i-1}\mu(t_{i}-t_{i-1})}{L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}}
+12i1μlnL+μj=1i1Δj2j1+2i1μ(titi1)L+μj=1i1Δj2j1+2i1μ(kti1)\displaystyle+\frac{1}{2^{i^{\prime}-1}\mu}\ln\frac{L+\mu\sum_{j=1}^{i^{\prime}-1}\Delta_{j}2^{j-1}+2^{i^{\prime}-1}\mu(t_{i^{\prime}}-t_{i^{\prime}-1})}{L+\mu\sum_{j=1}^{i^{\prime}-1}\Delta_{j}2^{j-1}+2^{i^{\prime}-1}\mu(k-t_{i^{\prime}-1})}
=\displaystyle= i=i+1i~+112i1μlnL+μj=1iΔj2j1L+μj=1i1Δj2j1\displaystyle\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\frac{1}{2^{i-1}\mu}\ln\frac{L+\mu\sum_{j=1}^{i}\Delta_{j}2^{j-1}}{L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}}
+12i1μlnL+μj=1iΔj2j1L+μj=1i1Δj2j1+2i1μ(kti1)\displaystyle+\frac{1}{2^{i^{\prime}-1}\mu}\ln\frac{L+\mu\sum_{j=1}^{i^{\prime}}\Delta_{j}2^{j-1}}{L+\mu\sum_{j=1}^{i^{\prime}-1}\Delta_{j}2^{j-1}+2^{i^{\prime}-1}\mu(k-t_{i^{\prime}-1})}
=\displaystyle= i=i+1i~+112i1μlnαiαi1+12i1μlnαiαi1+2i1μ(kti1).\displaystyle\sum_{i=i^{\prime}+1}^{\tilde{i}+1}\frac{1}{2^{i-1}\mu}\ln\frac{\alpha_{i}}{\alpha_{i-1}}+\frac{1}{2^{i^{\prime}-1}\mu}\ln\frac{\alpha_{i^{\prime}}}{\alpha_{i^{\prime}-1}+2^{i^{\prime}-1}\mu(k-t_{i^{\prime}-1})}.

Lemma 8.

Letting sequence {αi}\{\alpha_{i}\} be defined in Eqn. (G.2), given 1i~1\leq\tilde{i}, it holds that

i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2k=ti1ti1(αi1+2i1μ(kti1))2i~i+22\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\sum_{k=t_{i-1}}^{t_{i}-1}(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1}))^{2^{\tilde{i}-i+2}-2} (G.3)
\displaystyle\leq 212i~+1μi=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi2i~i+21αi12i~i+21)αi2i~i+2.\displaystyle 2\cdot\frac{1}{2^{\tilde{i}+1}\mu}\cdot\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)\alpha_{i}^{-2^{\tilde{i}-i+2}}.
Proof.

Notice that g(k):=(αi1+2i1μ(kti1))2i~i+22g(k):=(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1}))^{2^{\tilde{i}-i+2}-2} is a monotonically increasing function, we have,

i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2k=ti1ti1(αi1+2i1μ(kti1))2i~i+22\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\sum_{k=t_{i-1}}^{t_{i}-1}(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1}))^{2^{\tilde{i}-i+2}-2}
(D.1)\displaystyle\overset{\eqref{eq:inc_int}}{\leq} i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2ti1ti(αi1+2i1μ(tti1))2i~i+22𝑑t\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\int_{t_{i-1}}^{t_{i}}(\alpha_{i-1}+2^{i-1}\mu(t-t_{i-1}))^{2^{\tilde{i}-i+2}-2}dt
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2((2i~i+21)2i1μ)1(αi2i~i+21αi12i~i+21)\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}((2^{\tilde{i}-i+2}-1)\cdot 2^{i-1}\mu)^{-1}\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+21(2i~+12i1)μ(αi2i~i+21αi12i~i+21)\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\frac{1}{(2^{\tilde{i}+1}-2^{i-1})\mu}\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)
\displaystyle\leq i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+21(2i~+12i~)μ(αi2i~i+21αi12i~i+21)\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\frac{1}{(2^{\tilde{i}+1}-2^{\tilde{i}})\mu}\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+212i~+1μ1112(αi2i~i+21αi12i~i+21)\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\frac{1}{2^{\tilde{i}+1}\mu}\frac{1}{1-\frac{1}{2}}\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)
=\displaystyle= 212i~+1μi=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2(αi2i~i+21αi12i~i+21)\displaystyle 2\cdot\frac{1}{2^{\tilde{i}+1}\mu}\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)
=\displaystyle= 212i~+1μi=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi2i~i+21αi12i~i+21)αi2i~i+2.\displaystyle 2\cdot\frac{1}{2^{\tilde{i}+1}\mu}\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)\alpha_{i}^{-2^{\tilde{i}-i+2}}.

Lemma 9.

Letting {αi}\{\alpha_{i}\} be a positive sequence, given 1i~1\leq\tilde{i}, it holds that

i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi2i~i+21αi12i~i+21)αi2i~i+2αi~+11.\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)\alpha_{i}^{-2^{\tilde{i}-i+2}}\leq\alpha_{\tilde{i}+1}^{-1}. (G.4)
Proof.

First, we have

i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi2i~i+21αi12i~i+21)αi2i~i+2\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)\alpha_{i}^{-2^{\tilde{i}-i+2}}
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi1αi12i~i+21αi2i~i+2)\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{-1}-\frac{\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}}{\alpha_{i}^{2^{\tilde{i}-i+2}}}\right)
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi1i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi12i~i+21αi2i~i+2)\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-1}-\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}}{\alpha_{i}^{2^{\tilde{i}-i+2}}}\right)
=\displaystyle= αi~+11+i=1i~[j=i+1i~+1(αj1αj)2i~j+2]αi1i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi12i~i+21αi2i~i+2).\displaystyle\alpha_{\tilde{i}+1}^{-1}+\sum_{i=1}^{\tilde{i}}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-1}-\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}}{\alpha_{i}^{2^{\tilde{i}-i+2}}}\right).

Furthermore, we reformulate the term i=1i~[j=i+1i~+1(αj1αj)2i~j+2]αi1\sum_{i=1}^{\tilde{i}}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-1} as follows

i=1i~[j=i+1i~+1(αj1αj)2i~j+2]αi1=\displaystyle\sum_{i=1}^{\tilde{i}}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-1}= i=1i~[j=i+2i~+1(αj1αj)2i~j+2](αiαi+1)2i~i+1αi1\displaystyle\sum_{i=1}^{\tilde{i}}\left[\prod_{j=i+2}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i}}{\alpha_{i+1}}\right)^{2^{\tilde{i}-i+1}}\alpha_{i}^{-1} (G.5)
=\displaystyle= i=1i~[j=i+2i~+1(αj1αj)2i~j+2](αi2i~i+11αi+12i~i+1)\displaystyle\sum_{i=1}^{\tilde{i}}\left[\prod_{j=i+2}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i}^{2^{\tilde{i}-i+1}-1}}{\alpha_{i+1}^{2^{\tilde{i}-i+1}}}\right) (G.6)
=i′′=i+1\displaystyle\stackrel{{\scriptstyle i^{\prime\prime}=i+1}}{{=}} i′′=2i~+1[j=i′′+1i~+1(αj1αj)2i~j+2](αi′′12i~i′′+21αi′′2i~i′′+2)\displaystyle\sum_{i^{\prime\prime}=2}^{\tilde{i}+1}\left[\prod_{j=i^{\prime\prime}+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i^{\prime\prime}-1}^{2^{\tilde{i}-i^{\prime\prime}+2}-1}}{\alpha_{i^{\prime\prime}}^{2^{\tilde{i}-i^{\prime\prime}+2}}}\right) (G.7)
=i=i′′\displaystyle\stackrel{{\scriptstyle i=i^{\prime\prime}}}{{=}} i=2i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi12i~i+21αi2i~i+2).\displaystyle\sum_{i=2}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}}{\alpha_{i}^{2^{\tilde{i}-i+2}}}\right). (G.8)

Combining above results, we can obtain that

i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi2i~i+21αi12i~i+21)αi2i~i+2\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)\alpha_{i}^{-2^{\tilde{i}-i+2}}
=\displaystyle= αi~+11+i=2i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi12i~i+21αi2i~i+2)\displaystyle\alpha_{\tilde{i}+1}^{-1}+\sum_{i=2}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}}{\alpha_{i}^{2^{\tilde{i}-i+2}}}\right)
i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi12i~i+21αi2i~i+2)\displaystyle-\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}}{\alpha_{i}^{2^{\tilde{i}-i+2}}}\right)
=\displaystyle= αi~+11[j=2i~+1(αj1αj)2i~j+2](α02i~+11α12i~+1)\displaystyle\alpha_{\tilde{i}+1}^{-1}-\left[\prod_{j=2}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\frac{\alpha_{0}^{2^{\tilde{i}+1}-1}}{\alpha_{1}^{2^{\tilde{i}+1}}}\right)
\displaystyle\leq αi~+11.\displaystyle\alpha_{\tilde{i}+1}^{-1}.

Lemma 10.

Letting us denote vt+1,jk=0tηk2i=k+1t(1ηiλj)2v_{t+1,j}\triangleq\sum_{k=0}^{t}\eta_{k}^{2}\prod_{i=k+1}^{t}\left(1-\eta_{i}\lambda_{j}\right)^{2} with ηi\eta_{i} defined in Eqn. (3.3), for 1tt1\leq t\leq t^{\prime}, it holds that

vt,jmax(vt,j,ηt/λj).\displaystyle v_{t^{\prime},j}\leq\max(v_{t,j},\eta_{t}/\lambda_{j}). (G.9)
Proof.

If vt+1,jmax(vt,j,ηt/λj)v_{t+1,j}\leq\max(v_{t,j},\eta_{t}/\lambda_{j}) holds for t1\forall t\geq 1, then it naturally follows that

vt,j\displaystyle v_{t^{\prime},j}\leq max(vt1,j,ηt1λj)max(vt2,j,ηt2λj,ηt1λj))\displaystyle\max\left(v_{t^{\prime}-1,j},\frac{\eta_{t^{\prime}-1}}{\lambda_{j}}\right)\leq\max\left(v_{t^{\prime}-2,j},\frac{\eta_{t^{\prime}-2}}{\lambda_{j}},\frac{\eta_{t^{\prime}-1}}{\lambda_{j}})\right)
\displaystyle\leq \displaystyle\dots
\displaystyle\leq max(vt,j,ηtλj,ηt2λj,ηt1λj)\displaystyle\max\left(v_{t,j},\frac{\eta_{t}}{\lambda_{j}},\dots\frac{\eta_{t^{\prime}-2}}{\lambda_{j}},\frac{\eta_{t^{\prime}-1}}{\lambda_{j}}\right)
=\displaystyle= max(vt,j,ηtλj)\displaystyle\max\left(v_{t,j},\frac{\eta_{t}}{\lambda_{j}}\right)

where the last equality is entailed by the fact that ttt\leq t^{\prime} and ηt\eta_{t} defined in Eqn. (3.3) is monotonically decreasing. We then prove vt+1,jmax(vt,j,ηt/λj)v_{t+1,j}\leq\max(v_{t,j},\eta_{t}/\lambda_{j}) holds for t1\forall t\geq 1.

For t1\forall t\geq 1, we have

vt+1,j=\displaystyle v_{t+1,j}= k=0tηk2i=k+1t(1ηiλj)2\displaystyle\sum_{k=0}^{t}\eta^{2}_{k}\prod_{i=k+1}^{t}(1-\eta_{i}\lambda_{j})^{2} (G.10)
=\displaystyle= ηt2+k=0t1ηk2i=k+1t(1ηiλj)2\displaystyle\eta^{2}_{t}+\sum_{k=0}^{t-1}\eta^{2}_{k}\prod_{i=k+1}^{t}(1-\eta_{i}\lambda_{j})^{2}
=\displaystyle= ηt2+(1ηtλj)2k=0t1ηk2i=k+1t1(1ηiλj)2\displaystyle\eta^{2}_{t}+(1-\eta_{t}\lambda_{j})^{2}\sum_{k=0}^{t-1}\eta^{2}_{k}\prod_{i=k+1}^{t-1}(1-\eta_{i}\lambda_{j})^{2}
=\displaystyle= ηt2+(1ηtλj)2vt,j\displaystyle\eta^{2}_{t}+(1-\eta_{t}\lambda_{j})^{2}v_{t,j}

1) If vt+1,jvt,jv_{t+1,j}\leq v_{t,j}, then it naturally follows vt+1,jmax(vt,j,ηt/λj)v_{t+1,j}\leq\max(v_{t,j},\eta_{t}/\lambda_{j}).

2) If vt+1,j>vt,jv_{t+1,j}>v_{t,j}, denote a(1ηtλj)2,bηt2a\triangleq(1-\eta_{t}\lambda_{j})^{2},b\triangleq\eta^{2}_{t}, we have vt+1,j=avt,j+bv_{t+1,j}=av_{t,j}+b, where a[0,1)a\in[0,1) and b0b\geq 0. It follows,

vt+1,j>vt,j\displaystyle v_{t+1,j}>v_{t,j}
\displaystyle\Rightarrow\quad avt,j+b>vt,j\displaystyle av_{t,j}+b>v_{t,j}
\displaystyle\Rightarrow\quad vt,j<b1a\displaystyle v_{t,j}<\frac{b}{1-a}
\displaystyle\Rightarrow\quad vt+1,j=avt,j+b<ab1a+b=b1a\displaystyle v_{t+1,j}=av_{t,j}+b<a\cdot\frac{b}{1-a}+b=\frac{b}{1-a}

Therefore,

vt+1,j\displaystyle v_{t+1,j} <b1a=ηt21(1ηtλj)2<ηt21(1ηtλj)=ηtλjmax(vt,j,ηtλj),\displaystyle<\frac{b}{1-a}=\frac{\eta_{t}^{2}}{1-(1-\eta_{t}\lambda_{j})^{2}}<\frac{\eta_{t}^{2}}{1-(1-\eta_{t}\lambda_{j})}=\frac{\eta_{t}}{\lambda_{j}}\leq\max\left(v_{t,j},\frac{\eta_{t}}{\lambda_{j}}\right),

where the second inequality is entailed by the fact that 1ηtλj[0,1)1-\eta_{t}\lambda_{j}\in[0,1).

Lemma 11.

Letting vt,jv_{t,j} be defined as Lemma 10 and index i~\tilde{i} satisfy λj[μ2i~,μ2i~+1)\lambda_{j}\in[\mu\cdot 2^{\tilde{i}},\mu\cdot 2^{\tilde{i}+1}), then vi~+1,jv_{\tilde{i}+1,j} has the following property

vti~+1,j15ηti~+1λj.\displaystyle v_{t_{\tilde{i}+1},j}\leq 15\cdot\frac{\eta_{t_{\tilde{i}+1}}}{\lambda_{j}}. (G.11)
Proof.

By the fact that (1x)exp(x)(1-x)\leq\exp(-x), we have

vt+1,jk=0texp(2t=k+1tηtλj)ηk2.\displaystyle v_{t+1,j}\leq\sum_{k=0}^{t}\exp\left(-2\sum_{t^{\prime}=k+1}^{t}\eta_{t^{\prime}}\lambda_{j}\right)\eta_{k}^{2}.

Setting t=ti~+11t=t_{\tilde{i}+1}-1 in above equation, we have

vti~+1,jk=0ti~+11exp(2t=k+1ti~+11ηtλj)ηk2.\displaystyle v_{t_{\tilde{i}+1},j}\leq\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t^{\prime}=k+1}^{t_{\tilde{i}+1}-1}\eta_{t^{\prime}}\lambda_{j}\right)\eta_{k}^{2}. (G.12)

Now we bound the variance term. First, we have

k=0ti~+11exp(2t=k+1ti~+11ηtλj)ηk2=\displaystyle\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t={k+1}}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\eta^{2}_{k}= k=0ti~+11exp(2t=kti~+11ηtλj)exp(2ηkλj)ηk2\displaystyle\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\exp(2\eta_{k}\lambda_{j})\eta^{2}_{k}
\displaystyle\leq k=0ti~+11exp(2t=kti~+11ηtλj)exp(2λjL)ηk2\displaystyle\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\exp\left(\frac{2\lambda_{j}}{L}\right)\eta^{2}_{k}
\displaystyle\leq exp(2)k=0ti~+11exp(2t=kti~+11ηtλj)ηk2,\displaystyle\exp(2)\cdot\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\eta^{2}_{k},

where the first inequality is because ηk1/L\eta_{k}\leq 1/L. Hence, we can obtain

k=0ti~+11exp(2t=k+1ti~+11ηtλj)ηk2\displaystyle\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t={k+1}}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\eta^{2}_{k}
\displaystyle\leq exp(2)k=0ti~+11exp(2t=kti~+11ηtλj)ηk2\displaystyle\exp(2)\cdot\sum_{k=0}^{t_{\tilde{i}+1}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\eta^{2}_{k}
=\displaystyle= exp(2)i=1i~+1k=ti1ti1exp(2t=kti~+11ηtλj)ηk2.\displaystyle\exp(2)\cdot\sum_{i=1}^{\tilde{i}+1}\sum_{k=t_{i-1}}^{t_{i}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\eta_{k}^{2}.

Furthermore, combining with Eqn. (G.1) and the condition λj[μ2i~,μ2i~+1)\lambda_{j}\in[\mu\cdot 2^{\tilde{i}},\mu\cdot 2^{\tilde{i}+1}), we can obtain

i=1i~+1k=ti1ti1exp(2t=kti~+11ηtλj)ηk2i=1i~+1k=ti1ti1exp(2t=kti~+11ηtμ2i~)ηk2\displaystyle\sum_{i=1}^{\tilde{i}+1}\sum_{k=t_{i-1}}^{t_{i}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\lambda_{j}\right)\eta_{k}^{2}\leq\sum_{i=1}^{\tilde{i}+1}\sum_{k=t_{i-1}}^{t_{i}-1}\exp\left(-2\sum_{t=k}^{t_{\tilde{i}+1}-1}\eta_{t}\mu\cdot 2^{\tilde{i}}\right)\eta_{k}^{2}
(G.1)\displaystyle\overset{\eqref{eq:ge_sum}}{\leq} i=1i~+1k=ti1ti1exp(2j=i+1i~+12i~j+1lnαjαj122i~i+1lnαiαi1+2i1μ(kti1))ηk2\displaystyle\sum_{i=1}^{\tilde{i}+1}\sum_{k=t_{i-1}}^{t_{i}-1}\exp\left(-2\sum_{j=i+1}^{\tilde{i}+1}2^{\tilde{i}-j+1}\ln\frac{\alpha_{j}}{\alpha_{j-1}}-2\cdot 2^{\tilde{i}-i+1}\ln\frac{\alpha_{i}}{\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})}\right)\eta_{k}^{2}
=\displaystyle= i=1i~+1k=ti1ti1exp(j=i+1i~+12i~j+2lnαj1αj+2i~i+2lnαi1+2i1μ(kti1)αi)ηk2\displaystyle\sum_{i=1}^{\tilde{i}+1}\sum_{k=t_{i-1}}^{t_{i}-1}\exp\left(\sum_{j=i+1}^{\tilde{i}+1}2^{\tilde{i}-j+2}\ln\frac{\alpha_{j-1}}{\alpha_{j}}+2^{\tilde{i}-i+2}\ln\frac{\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})}{\alpha_{i}}\right)\eta_{k}^{2}
=\displaystyle= i=1i~+1k=ti1ti1[j=i+1i~+1(αj1αj)2i~j+2](αi1+2i1μ(kti1)αi)2i~i+2ηk2\displaystyle\sum_{i=1}^{\tilde{i}+1}\sum_{k=t_{i-1}}^{t_{i}-1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\cdot\left(\frac{\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})}{\alpha_{i}}\right)^{2^{\tilde{i}-i+2}}\eta_{k}^{2}
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]k=ti1ti1(αi1+2i1μ(kti1)αi)2i~i+2ηk2\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\sum_{k=t_{i-1}}^{t_{i}-1}\left(\frac{\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})}{\alpha_{i}}\right)^{2^{\tilde{i}-i+2}}\eta_{k}^{2}
=(3.3)\displaystyle\overset{\eqref{eq:eta_t}}{=} i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]
k=ti1ti1(αi1+2i1μ(kti1)αi)2i~i+2(L+μj=1i1Δj2j1+2i1μ(kti1))2\displaystyle\cdot\sum_{k=t_{i-1}}^{t_{i}-1}\left(\frac{\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})}{\alpha_{i}}\right)^{2^{\tilde{i}-i+2}}\left(L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+2^{i-1}\mu(k-t_{i-1})\right)^{-2}
=(G.2)\displaystyle\overset{\eqref{eq:alpha_def}}{=} i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]
k=ti1ti1(αi1+2i1μ(kti1)αi)2i~i+2(αi1+2i1μ(kti1))2\displaystyle\cdot\sum_{k=t_{i-1}}^{t_{i}-1}\left(\frac{\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})}{\alpha_{i}}\right)^{2^{\tilde{i}-i+2}}\left(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})\right)^{-2}
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}
k=ti1ti1(αi1+2i1μ(kti1))2i~i+2(αi1+2i1μ(kti1))2\displaystyle\cdot\sum_{k=t_{i-1}}^{t_{i}-1}(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1}))^{2^{\tilde{i}-i+2}}\left(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1})\right)^{-2}
=\displaystyle= i=1i~+1[j=i+1i~+1(αj1αj)2i~j+2]αi2i~i+2k=ti1ti1(αi1+2i1μ(kti1))2i~i+22\displaystyle\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\alpha_{i}^{-2^{\tilde{i}-i+2}}\sum_{k=t_{i-1}}^{t_{i}-1}(\alpha_{i-1}+2^{i-1}\mu(k-t_{i-1}))^{2^{\tilde{i}-i+2}-2}
(G.3)\displaystyle\overset{\eqref{eq:aa}}{\leq} 212i~+1μi=1i~+1[j=i+1i~+1(αj1αj)2i~j+2](αi2i~i+21αi12i~i+21)αi2i~i+2\displaystyle 2\cdot\frac{1}{2^{\tilde{i}+1}\mu}\cdot\sum_{i=1}^{\tilde{i}+1}\left[\prod_{j=i+1}^{\tilde{i}+1}\left(\frac{\alpha_{j-1}}{\alpha_{j}}\right)^{2^{\tilde{i}-j+2}}\right]\left(\alpha_{i}^{2^{\tilde{i}-i+2}-1}-\alpha_{i-1}^{2^{\tilde{i}-i+2}-1}\right)\alpha_{i}^{-2^{\tilde{i}-i+2}}
(G.4)\displaystyle\overset{\eqref{eq:alpha_cancel}}{\leq} 212i~+1μαi~+112ηti~+1λj,\displaystyle 2\cdot\frac{1}{2^{\tilde{i}+1}\mu}\cdot\alpha_{\tilde{i}+1}^{-1}\leq 2\cdot\frac{\eta_{t_{\tilde{i}+1}}}{\lambda_{j}},

where the last inequality is because of the condition λj[μ2i~,μ2i~+1)\lambda_{j}\in[\mu\cdot 2^{\tilde{i}},\mu\cdot 2^{\tilde{i}+1}) and the definition of αi\alpha_{i}.

Therefore, we have

vti~+1,j2exp(2)ηti~+1λj15ηti~+1λj.\displaystyle v_{t_{\tilde{i}+1},j}\leq 2\exp(2)\cdot\frac{\eta_{t_{\tilde{i}+1}}}{\lambda_{j}}\leq 15\cdot\frac{\eta_{t_{\tilde{i}+1}}}{\lambda_{j}}.

Lemma 12.

Let objective function f(x)f(x) be quadratic and Assumption (1.7) hold. Running SGD for TT-steps starting from w0w_{0} and a learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} defined in Eqn. (3.3), the final iterate wT+1w_{T+1} satisfies

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right]\leq (w0w)H(w0w)exp(2μk=0Tηk)\displaystyle(w_{0}-w_{*})^{\top}H(w_{0}-w_{*})\cdot\exp\left(-2\mu\sum_{k=0}^{T}\eta_{k}\right)
+15σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle+15\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}.
Proof.

The target of this lemma is to obtain the explicit form to bound the variance term. By the definition of vt+1,jv_{t+1,j} in Lemma 10, we can obtain that

τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ]\displaystyle\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right]
(F.7)\displaystyle\overset{\eqref{eq:var}}{\leq} σ2j=1dλj2vT+1,j\displaystyle\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}\cdot v_{T+1,j}
(G.9)\displaystyle\overset{\eqref{eq:var_trans}}{\leq} σ2j=1dλj2max(vti~+1+1,j,ηti~+1+1λj)\displaystyle\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}\cdot\max\left(v_{t_{\tilde{i}+1}+1,j},\frac{\eta_{t_{\tilde{i}+1}+1}}{\lambda_{j}}\right)
(G.11)\displaystyle\overset{\eqref{eq:ti}}{\leq} σ2j=1dλj2max(15ηti~+1+1λj,ηti~+1+1λj)\displaystyle\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}\cdot\max\left(15\cdot\frac{\eta_{t_{\tilde{i}+1}+1}}{\lambda_{j}},\frac{\eta_{t_{\tilde{i}+1}+1}}{\lambda_{j}}\right)
=\displaystyle= 15σ2j=1dλjηti~+1+115σ2j=1dλjηti~+115σ2μi~=0Imax12i~+1si~ηti~+1,\displaystyle 15\sigma^{2}\sum_{j=1}^{d}\lambda_{j}\cdot\eta_{t_{\tilde{i}+1}+1}\leq 15\sigma^{2}\sum_{j=1}^{d}\lambda_{j}\cdot\eta_{t_{\tilde{i}+1}}\leq 15\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}2^{\tilde{i}+1}s_{\tilde{i}}\cdot\eta_{t_{\tilde{i}+1}},

where the last inequality is because λj[2i~μ, 2i~+1μ)\lambda_{j}\in[2^{\tilde{i}}\mu,\;2^{\tilde{i}+1}\mu) and there are si~s_{\tilde{i}} such λj\lambda_{j}’s lie in this range. By Eqn. (3.3), we have

ηti~+1=1L+μj=1i~+1Δj2j1.\displaystyle\eta_{t_{\tilde{i}+1}}=\frac{1}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}. (G.13)

Therefore, we have

τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ]15σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right]\leq 15\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}. (G.14)

Combining with Lemma 4 and Lemma 6, we can obtain that

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right]\leq (w0w)H(w0w)exp(2μk=0Tηk)\displaystyle(w_{0}-w_{*})^{\top}H(w_{0}-w_{*})\cdot\exp\left(-2\mu\sum_{k=0}^{T}\eta_{k}\right)
+15σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle+15\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}.

Lemma 13.

For t0\forall t\geq 0, the learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} defined in Eqn. (3.3) satisfies

ηt1L+μt\displaystyle\eta_{t}\leq\frac{1}{L+\mu t} (G.15)
Proof.

For t0\forall t\geq 0, there i1\exists i\geq 1, where t[ti1,ti)t\in[t_{i-1},t_{i}). Given the form defined in Eqn. (3.3), we have,

ηt=\displaystyle\eta_{t}= 1L+μj=1i1Δj2j1+2i1μ(tti1)\displaystyle\frac{1}{L+\mu\sum_{j=1}^{i-1}\Delta_{j}2^{j-1}+2^{i-1}\mu(t-t_{i-1})}
\displaystyle\leq 1L+μj=1i1Δj+μ(tti1)\displaystyle\frac{1}{L+\mu\sum_{j=1}^{i-1}\Delta_{j}+\mu(t-t_{i-1})}
=(3.4)\displaystyle\overset{\eqref{eq:delta_and_t}}{=} 1L+μj=1i1(tjtj1)+μ(tti1)\displaystyle\frac{1}{L+\mu\sum_{j=1}^{i-1}(t_{j}-t_{j-1})+\mu(t-t_{i-1})}
=\displaystyle= 1L+μ(ti1t0)+μ(tti1)\displaystyle\frac{1}{L+\mu(t_{i-1}-t_{0})+\mu(t-t_{i-1})}
=\displaystyle= 1L+μ(tt0)\displaystyle\frac{1}{L+\mu(t-t_{0})}
=\displaystyle= 1L+μt\displaystyle\frac{1}{L+\mu t}

Lemma 14.

Let objective function f(x)f(x) be quadratic and Assumption (1.7) hold. Running SGD for TT-steps starting from w0w_{0} and a learning rate sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} defined in Eqn. (3.3), the final iterate wT+1w_{T+1} satisfies

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right]\leq (w0w)H(w0w)κ2Δ12\displaystyle(w_{0}-w_{*})^{\top}H(w_{0}-w_{*})\cdot\frac{\kappa^{2}}{\Delta_{1}^{2}}
+15σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle+15\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}.
Proof.

The target of this lemma is to obtain the explicit form to bound the bias term.

First, by Eqn. (G.1) and the condition λj[μ2i~,μ2i~+1)\lambda_{j}\in[\mu\cdot 2^{\tilde{i}},\mu\cdot 2^{\tilde{i}+1}), we have

exp(2λjk=0Tηk)\displaystyle\exp\left(-2\lambda_{j}\sum_{k=0}^{T}\eta_{k}\right)\leq exp(2k=0ti~+11ηkλj)exp(2i=1i~+112i1μlnαiαi1λj)\displaystyle\exp\left(-2\sum_{k=0}^{t_{\tilde{i}+1}-1}\eta_{k}\lambda_{j}\right)\leq\exp\left(-2\sum_{i=1}^{\tilde{i}+1}\frac{1}{2^{i-1}\mu}\ln\frac{\alpha_{i}}{\alpha_{i-1}}\lambda_{j}\right) (G.16)
\displaystyle\leq exp(i=1i~+12i~i+2lnαiαi1)=i=1i~+1(αi1αi)2i~i+2\displaystyle\exp\left(-\sum_{i=1}^{\tilde{i}+1}2^{\tilde{i}-i+2}\ln\frac{\alpha_{i}}{\alpha_{i-1}}\right)=\prod_{i=1}^{\tilde{i}+1}\left(\frac{\alpha_{i-1}}{\alpha_{i}}\right)^{2^{\tilde{i}-i+2}}
\displaystyle\leq i=1i~+1(αi1αi)2=(α1αi~+1)2=L2ηti~+12\displaystyle\prod_{i=1}^{\tilde{i}+1}\left(\frac{\alpha_{i-1}}{\alpha_{i}}\right)^{2}=\left(\frac{\alpha_{1}}{\alpha_{\tilde{i}+1}}\right)^{2}=L^{2}\cdot\eta_{t_{\tilde{i}+1}}^{2}

For λj=μ\lambda_{j}=\mu, since μ[μ2i~,μ2i~+1)\mu\in[\mu\cdot 2^{\tilde{i}},\mu\cdot 2^{\tilde{i}+1}) for i~=0\tilde{i}=0, it follows,

exp(2μk=0Tηk)L2ηt12(G.15)(LL+μt1)2=(3.4)(LL+μΔ1)2(LμΔ1)2=κ2Δ12\displaystyle\exp\left(-2\mu\sum_{k=0}^{T}\eta_{k}\right)\leq L^{2}\cdot\eta_{t_{1}}^{2}\overset{\eqref{eq:eta_upper}}{\leq}\left(\frac{L}{L+\mu t_{1}}\right)^{2}\overset{\eqref{eq:delta_and_t}}{=}\left(\frac{L}{L+\mu\Delta_{1}}\right)^{2}\leq\left(\frac{L}{\mu\Delta_{1}}\right)^{2}=\frac{\kappa^{2}}{\Delta_{1}^{2}} (G.17)

Combining with Lemma 12, we obtain that,

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right]\leq (w0w)H(w0w)κ2Δ12\displaystyle(w_{0}-w_{*})^{\top}H(w_{0}-w_{*})\cdot\frac{\kappa^{2}}{\Delta_{1}^{2}}
+15σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle+15\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}.

G.1 Proof of Lemma 1

See 1

Proof.

For t0\forall t\geq 0, we have

f(wt)f(w)=(1.2)\displaystyle f(w_{t})-f(w_{*})\overset{\eqref{eq:prob}}{=} 𝔼[12wtH(ξ)wtb(ξ)wt]𝔼[12wH(ξ)wb(ξ)w]\displaystyle\mathbb{E}\left[\frac{1}{2}w_{t}^{\top}H(\xi)w_{t}-b(\xi)^{\top}w_{t}\right]-\mathbb{E}\left[\frac{1}{2}w_{*}^{\top}H(\xi)w_{*}-b(\xi)^{\top}w_{*}\right]
=\displaystyle= (12wt𝔼[H(ξ)]wt𝔼[b(ξ)]wt)(12w𝔼[H(ξ)]w𝔼[b(ξ)]w)\displaystyle\left(\frac{1}{2}w_{t}^{\top}\mathbb{E}[H(\xi)]w_{t}-\mathbb{E}[b(\xi)]^{\top}w_{t}\right)-\left(\frac{1}{2}w_{*}^{\top}\mathbb{E}[H(\xi)]w_{*}-\mathbb{E}[b(\xi)]^{\top}w_{*}\right)
=\displaystyle= (12wtHwtbwt)(12wHwbw)\displaystyle\left(\frac{1}{2}w_{t}^{\top}Hw_{t}-b^{\top}w_{t}\right)-\left(\frac{1}{2}w_{*}^{\top}Hw_{*}-b^{\top}w_{*}\right)
=(1.4)\displaystyle\overset{\eqref{eq:optima}}{=} (12wtHwtbwt)(12b(H)1bbH1b)\displaystyle\left(\frac{1}{2}w_{t}^{\top}Hw_{t}-b^{\top}w_{t}\right)-\left(\frac{1}{2}b^{\top}\left(H^{\top}\right)^{-1}b-b^{\top}H^{-1}b\right)
=\displaystyle= (12wtHwtbwt)(12bH1bbH1b)\displaystyle\left(\frac{1}{2}w_{t}^{\top}Hw_{t}-b^{\top}w_{t}\right)-\left(\frac{1}{2}b^{\top}H^{-1}b-b^{\top}H^{-1}b\right)
=\displaystyle= 12wtHwtbwt+12bH1b\displaystyle\frac{1}{2}w_{t}^{\top}Hw_{t}-b^{\top}w_{t}+\frac{1}{2}b^{\top}H^{-1}b
=\displaystyle= 12wtHwt12bwt12bwt+12bH1b\displaystyle\frac{1}{2}w_{t}^{\top}Hw_{t}-\frac{1}{2}b^{\top}w_{t}-\frac{1}{2}b^{\top}w_{t}+\frac{1}{2}b^{\top}H^{-1}b
=\displaystyle= 12wtHwt12wtb12bwt+12bH1b\displaystyle\frac{1}{2}w_{t}^{\top}Hw_{t}-\frac{1}{2}w_{t}^{\top}b-\frac{1}{2}b^{\top}w_{t}+\frac{1}{2}b^{\top}H^{-1}b
=\displaystyle= 12wtHwt12wtHw12wHwt+12wHw\displaystyle\frac{1}{2}w_{t}^{\top}Hw_{t}-\frac{1}{2}w_{t}^{\top}Hw_{*}-\frac{1}{2}w_{*}^{\top}Hw_{t}+\frac{1}{2}w_{*}^{\top}Hw_{*}
=\displaystyle= 12(wtw)H(wtw),\displaystyle\frac{1}{2}(w_{t}-w_{*})^{\top}H(w_{t}-w_{*}),

where the 5th equality is entailed by the fact that H=HH^{\top}=H is a symmetric matrix, and the 9th equality uses both H=HH^{\top}=H and Eqn 1.4.

Combine the above result with Lemma 14, we obtain that

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq (f(w0)f(w))κ2Δ12+152σ2μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1.\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{\Delta_{1}^{2}}+\frac{15}{2}\cdot\sigma^{2}\mu\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}.

G.2 Proof of Theorem 1

See 1

Proof.

We have

μi~=0Imax12i~+1si~L+μj=1i~+1Δj2j1<\displaystyle\mu\cdot\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{L+\mu\sum_{j=1}^{\tilde{i}+1}\Delta_{j}2^{j-1}}< μi~=0Imax12i~+1si~μ2i~Δi~+1=(3.5)2i~=0Imax1si~si~i=0Imax1siT\displaystyle\mu\cdot\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{2^{\tilde{i}+1}s_{\tilde{i}}}{\mu 2^{\tilde{i}}\Delta_{\tilde{i}+1}}\overset{\eqref{eq:delta}}{=}2\sum_{\tilde{i}=0}^{I_{\max}-1}\frac{s_{\tilde{i}}}{\frac{\sqrt{s_{\tilde{i}}}}{\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}}\cdot T}
=\displaystyle= 2Ti=0Imax1sii~=0Imax1si~=2(i=0Imax1si)2T.\displaystyle\frac{2}{T}\cdot\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\cdot\sum_{\tilde{i}=0}^{I_{\max}-1}\sqrt{s_{\tilde{i}}}=\frac{2\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{T}.

Combining with Lemma 1 and the definition of Δ1\Delta_{1}, we can obtain that

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]\leq (f(w0)f(w))κ2(i=0Imax1si)2s0T2+15(i=0Imax1si)2Tσ2.\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}\cdot\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{s_{0}T^{2}}+\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{T}\cdot\sigma^{2}.

G.3 Proof of Corollary 2

See 2

Proof.

According to Theorem 1,

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]
\displaystyle\leq (f(w0)f(w))κ2(i=0Imax1si)2s0T2+15(i=0Imax1si)2Tσ2\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}\cdot\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{s_{0}T^{2}}+\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{T}\cdot\sigma^{2}
=\displaystyle= (f(w0)f(w))κ2T2(i=0Imax1si)2s0+dσ2T15(i=0Imax1si)2d.\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{T^{2}}\cdot\frac{\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{s_{0}}+\frac{d\sigma^{2}}{T}\cdot\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{d}.

The key terms here are C1(i=0Imax1si)2/s0C_{1}\triangleq\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}/s_{0} and C215(i=0Imax1si)2/dC_{2}\triangleq 15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}/d. As long as we can bound both terms with a constant C(α)C(\alpha), the corollary will be directly proved.

1) If κ<2\kappa<2, then there is only one interval with s0=ds_{0}=d. By setting C(α)=max(C1,C2)=15C(\alpha)=\max(C_{1},C_{2})=15, this completes the proof.

2) If κ2\kappa\geq 2, then bounding C1C_{1} and C2C_{2} be done by computing the value of sis_{i} under power law. For all interval ii except the last interval, we have,

sid=(3.1)\displaystyle\frac{s_{i}}{d}\overset{\eqref{eq:def_si}}{=} #λj[μ2i,μ2i+1)=μ2iμ2i+1p(λ)𝑑λ\displaystyle\#\lambda_{j}\in[\mu\cdot 2^{i},\mu\cdot 2^{i+1})=\int_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}p(\lambda)d\lambda
=(3.7)\displaystyle\overset{\eqref{eq:power_law}}{=} μ2iμ2i+11Z(μλ)α𝑑λ=1Zμαμ2iμ2i+1λα𝑑λ\displaystyle\int_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}\frac{1}{Z}\cdot\left(\frac{\mu}{\lambda}\right)^{\alpha}d\lambda=\frac{1}{Z}\cdot\mu^{\alpha}\cdot\int_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}\lambda^{-\alpha}d\lambda
=\displaystyle= (μL(μλ)α𝑑λ)1μαμ2iμ2i+1λα𝑑λ=(μLλα𝑑λ)1μ2iμ2i+1λα𝑑λ\displaystyle\left(\int_{\mu}^{L}\left(\frac{\mu}{\lambda}\right)^{\alpha}d\lambda\right)^{-1}\cdot\mu^{\alpha}\cdot\int_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}\lambda^{-\alpha}d\lambda=\left(\int_{\mu}^{L}\lambda^{-\alpha}d\lambda\right)^{-1}\cdot\int_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}\lambda^{-\alpha}d\lambda
=\displaystyle= (λ1α1α|μL)1(λ1α1α|μ2iμ2i+1)=(λ1α|μL)1(λ1α|μ2iμ2i+1)\displaystyle\left(\frac{\lambda^{1-\alpha}}{1-\alpha}\Big{|}_{\mu}^{L}\right)^{-1}\cdot\left(\frac{\lambda^{1-\alpha}}{1-\alpha}\Big{|}_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}\right)=\left(\lambda^{1-\alpha}\Big{|}_{\mu}^{L}\right)^{-1}\cdot\left(\lambda^{1-\alpha}\Big{|}_{\mu\cdot 2^{i}}^{\mu\cdot 2^{i+1}}\right)
=\displaystyle= (L1αμ1α)1(μ1α(2i+1)1αμ1α(2i)1α)\displaystyle\left(L^{1-\alpha}-\mu^{1-\alpha}\right)^{-1}\cdot\left(\mu^{1-\alpha}\cdot\left(2^{i+1}\right)^{1-\alpha}-\mu^{1-\alpha}\cdot\left(2^{i}\right)^{1-\alpha}\right)
=\displaystyle= μ1α(2i+1)1αμ1α(2i)1αL1αμ1α=(2i+1)1α(2i)1ακ1α1\displaystyle\frac{\mu^{1-\alpha}\cdot\left(2^{i+1}\right)^{1-\alpha}-\mu^{1-\alpha}\cdot\left(2^{i}\right)^{1-\alpha}}{L^{1-\alpha}-\mu^{1-\alpha}}=\frac{\left(2^{i+1}\right)^{1-\alpha}-\left(2^{i}\right)^{1-\alpha}}{\kappa^{1-\alpha}-1}
=\displaystyle= 2i(1α)21α1κ1α1\displaystyle 2^{i(1-\alpha)}\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}

Therefore, we have

si=d2i(1α)21α1κ1α1=d21α1κ1α12i(1α)\displaystyle s_{i}=d\cdot 2^{i(1-\alpha)}\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}=d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot 2^{i(1-\alpha)} (G.18)

holds for all interval ii except the last interval i=Imax1=log2κ1>0i^{\prime}=I_{\max}-1=\log_{2}\kappa-1>0. This last interval may not completely covers [μ2i,μ2i+1)[\mu\cdot 2^{i^{\prime}},\mu\cdot 2^{i^{\prime}+1}) due to the boundary truncated by LL, but we still have

sid21α1κ1α12i(1α)\displaystyle s_{i^{\prime}}\leq d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot 2^{i^{\prime}(1-\alpha)}

It follows,

(i=0Imax1si)2\displaystyle\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}\leq d21α1κ1α1(i=0Imax12i(1α))2=d21α1κ1α1(i=0Imax12i(1α)/2)2\displaystyle d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\sum_{i=0}^{I_{\max}-1}\sqrt{2^{i(1-\alpha)}}\right)^{2}=d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\sum_{i=0}^{I_{\max}-1}2^{i(1-\alpha)/2}\right)^{2}
=\displaystyle= d21α1κ1α1(i=0Imax1(12)i(α1)/2)2\displaystyle d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\sum_{i=0}^{I_{\max}-1}\left(\frac{1}{2}\right)^{i(\alpha-1)/2}\right)^{2}
\displaystyle\leq d21α1κ1α1(i=0(12)i(α1)/2)2=d21α1κ1α1(i=0(12(α1)/2)i)2\displaystyle d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\sum_{i=0}^{\infty}\left(\frac{1}{2}\right)^{i(\alpha-1)/2}\right)^{2}=d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\sum_{i=0}^{\infty}\left(\frac{1}{2^{(\alpha-1)/2}}\right)^{i}\right)^{2}
=\displaystyle= d21α1κ1α1(1112(α1)/2)2=d21α1κ1α1(112(1α)/2)2.\displaystyle d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\frac{1}{1-\frac{1}{2^{(\alpha-1)/2}}}\right)^{2}=d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}.

Thus,

C1=\displaystyle C_{1}= (i=0Imax1si)2s0d21α1κ1α1(112(1α)/2)2d21α1κ1α120(1α)=(112(1α)/2)2\displaystyle\frac{\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{s_{0}}\leq\frac{d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}}{d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot 2^{0(1-\alpha)}}=\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}
C2=\displaystyle C_{2}= 15(i=0Imax1si)2d15dd21α1κ1α1(112(1α)/2)2\displaystyle\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{d}\leq\frac{15}{d}\cdot d\cdot\frac{2^{1-\alpha}-1}{\kappa^{1-\alpha}-1}\cdot\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}
=\displaystyle= 151(12)α11(1κ)α1(112(1α)/2)2\displaystyle 15\cdot\frac{1-\left(\frac{1}{2}\right)^{\alpha-1}}{1-\left(\frac{1}{\kappa}\right)^{\alpha-1}}\cdot\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}
\displaystyle\leq 15(112(1α)/2)2.\displaystyle 15\cdot\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}.

Here the last inequality for C2C_{2} is entailed by κ2\kappa\geq 2 and α>1\alpha>1.

By setting C(α)=max(C1,C2)=15(112(1α)/2)2C(\alpha)=\max(C_{1},C_{2})=15\cdot\left(\frac{1}{1-2^{(1-\alpha)/2}}\right)^{2}, we obtain

𝔼[f(wT+1)f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1})-f(w_{*})\right]
\displaystyle\leq (f(w0)f(w))κ2T2(i=0Imax1si)2s0+dσ2T15(i=0Imax1si)2d.\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{T^{2}}\cdot\frac{\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{s_{0}}+\frac{d\sigma^{2}}{T}\cdot\frac{15\left(\sum_{i=0}^{I_{\max}-1}\sqrt{s_{i}}\right)^{2}}{d}.
=\displaystyle= (f(w0)f(w))κ2T2C1+dσ2TC2\displaystyle(f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{T^{2}}\cdot C_{1}+\frac{d\sigma^{2}}{T}\cdot C_{2}
\displaystyle\leq ((f(w0)f(w))κ2T2+dσ2T)C(α).\displaystyle\left((f(w_{0})-f(w_{*}))\cdot\frac{\kappa^{2}}{T^{2}}+\frac{d\sigma^{2}}{T}\right)\cdot C(\alpha).

G.4 Proof of Theorem 4

See 4

Proof.

The lower bound here is an asymptotic bound. Specifically, we require

TlogTmax(216,16,1256σ2minjλj(w0,jw,j)2).\displaystyle\frac{T}{\log T}\geq\max\left(2^{16},16,\frac{1}{256}\cdot\frac{\sigma^{2}}{\min_{j}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}}\right). (G.19)

In Ge et al. (2019), step decay has following learning rate sequence:

ηt=η12if t[1+TlogT,TlogT(+1)],\displaystyle\eta_{t}=\frac{\eta_{1}}{2^{\ell}}\quad\mbox{if }t\in\left[1+\frac{T}{\log T}\cdot\ell,\frac{T}{\log T}\cdot(\ell+1)\right], (G.20)

where =0,1,,logT1\ell=0,1,\dots,\log T-1. Notice that the index start from t=1t=1 instead of t=0t=0. For consistency with our framework, we set η0=0\eta_{0}=0, which produces the exact same step decay scheduler while only adding one extra iteration, thus does not affect the overall asymptotic bound.

We first translate the general notations to diagonal cases so that the idea of the proof can be clearer.

Since f(x)f(x) is quadratic, according to the proof of Lemma 1 in Appendix G.1,

f(wT+1)f(w)=\displaystyle f(w_{T+1})-f(w_{*})= 12(wT+1w)H(wT+1w).\displaystyle\frac{1}{2}(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*}).

Furthermore, according to Lemma 4, where Pt=IηtHP_{t}=I-\eta_{t}H,

𝔼[(wT+1w)H(wT+1w)]\displaystyle\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right]
=\displaystyle= 𝔼[(w0w)PTP0HP0PT(w0w)]\displaystyle\mathbb{E}\left[(w_{0}-w_{*})^{\top}\cdot P_{T}\dots P_{0}HP_{0}\dots P_{T}\cdot(w_{0}-w_{*})\right]
+τ=0T𝔼[ητ2nτPTPτ+1HPτ+1PTnτ]\displaystyle+\sum_{\tau=0}^{T}\mathbb{E}\left[\eta_{\tau}^{2}n_{\tau}^{\top}\cdot P_{T}\dots P_{\tau+1}HP_{\tau+1}\dots P_{T}\cdot n_{\tau}\right]
=\displaystyle= j=1dλj(w0,jw,j)2k=0T(1ηkλj)2+τ=0Tητ2j=1dk=τ+1Tλj(1ηkλj)2𝔼[nτ,j2]\displaystyle\sum_{j=1}^{d}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2}+\sum_{\tau=0}^{T}\eta_{\tau}^{2}\sum_{j=1}^{d}\prod_{k=\tau+1}^{T}\lambda_{j}(1-\eta_{k}\lambda_{j})^{2}\mathbb{E}\left[n_{\tau,j}^{2}\right]
=\displaystyle= j=1dλj(w0,jw,j)2k=0T(1ηkλj)2+τ=0Tητ2j=1dk=τ+1Tλj(1ηkλj)2λjσ2\displaystyle\sum_{j=1}^{d}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2}+\sum_{\tau=0}^{T}\eta_{\tau}^{2}\sum_{j=1}^{d}\prod_{k=\tau+1}^{T}\lambda_{j}(1-\eta_{k}\lambda_{j})^{2}\cdot\lambda_{j}\sigma^{2}
=\displaystyle= j=1dλj(w0,jw,j)2k=0T(1ηkλj)2+σ2j=1dλj2τ=0Tητ2k=τ+1T(1ηkλj)2.\displaystyle\sum_{j=1}^{d}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2}+\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}\sum_{\tau=0}^{T}\eta_{\tau}^{2}\prod_{k=\tau+1}^{T}(1-\eta_{k}\lambda_{j})^{2}.

Here the second equality is entailed by the fact that HH and PtP_{t} are diagonal, and the third equality comes from 𝔼ξ[ntnt]=σ2H\mathbb{E}_{\xi}\left[n_{t}n_{t}^{\top}\right]=\sigma^{2}H. Thus, by denoting bjλj(w0,jw,j)2k=0T(1ηkλj)2b_{j}\triangleq\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2} and vjτ=0Tητ2k=τ+1T(1ηkλj)2v_{j}\triangleq\sum_{\tau=0}^{T}\eta_{\tau}^{2}\prod_{k=\tau+1}^{T}(1-\eta_{k}\lambda_{j})^{2}, we have,

𝔼[f(wT+1f(w)]\displaystyle\mathbb{E}\left[f(w_{T+1}-f(w_{*})\right] =12𝔼[(wT+1w)H(wT+1w)]\displaystyle=\frac{1}{2}\mathbb{E}\left[(w_{T+1}-w_{*})^{\top}H(w_{T+1}-w_{*})\right] (G.21)
=12[(j=1dbj)+(σ2j=1dλj2vj)].\displaystyle=\frac{1}{2}\left[\left(\sum_{j=1}^{d}b_{j}\right)+\left(\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}v_{j}\right)\right].

To proceed the analysis, we divide all eigenvalues {λj}\{\lambda_{j}\} into two groups:

𝒜={j|λj>logT8η1T},={j|λjlogT8η1T},\displaystyle\mathcal{A}=\left\{j\middle|\lambda_{j}>\frac{\log T}{8\eta_{1}T}\right\},\quad\mathcal{B}=\left\{j\middle|\lambda_{j}\leq\frac{\log T}{8\eta_{1}T}\right\}, (G.22)

where group 𝒜\mathcal{A} are those large eigenvalues that the variance term vjv_{j} will finally dominate, and group \mathcal{B} are those small eigenvalues that the bias term bjb_{j} will finally dominate. Rigorously speaking,

a) For j𝒜\forall j\in\mathcal{A}:

Step decay’s bottleneck in variance term actually occurs at the first interval \ell that satisfies

2λjη18TlogT\displaystyle 2^{\ell}\geq\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T} (G.23)

We first show that interval \ell is well-defined for any dimension j𝒜j\in\mathcal{A}. Since j𝒜j\in\mathcal{A}, it follows from the definition of 𝒜\mathcal{A} in Eqn. (G.22),

λj>logT8η1Tλjη18TlogT>1=20\displaystyle\lambda_{j}>\frac{\log T}{8\eta_{1}T}\quad\Longrightarrow\quad\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T}>1=2^{0}

On the other hand, since we assume T/logT216T/\log T\geq 2^{16} in Eqn. (G.19), which implies T216logT16T\geq 2^{16}\Rightarrow\log T\geq 16, it follows

λjη18TlogTλjη1T2λjLT2T2=2logT1,\displaystyle\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T}\leq\lambda_{j}\eta_{1}\cdot\frac{T}{2}\leq\frac{\lambda_{j}}{L}\cdot\frac{T}{2}\leq\frac{T}{2}=2^{\log T-1},

where the second inequality comes from η11/L\eta_{1}\leq 1/L in assumption (1), and the third inequality is entailed by λjL\lambda_{j}\leq L given the definition of LL in Eqn. (1.6).

As a result, we have

λjη18TlogT(20,2logT1]\displaystyle\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T}\in\left(2^{0},2^{\log T-1}\right]

thus

2λjη18TlogT\displaystyle 2^{\ell}\geq\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T}

will guaranteed be satisified for some interval =1,,logT1\ell=1,\dots,\log T-1. Since interval \ell is the first interval satisifies Eqn. (G.23), we also have

21<λjη18TlogT2<λjη116TlogT\displaystyle 2^{\ell-1}<\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T}\quad\Longrightarrow\quad 2^{\ell}<\lambda_{j}\eta_{1}\cdot\frac{16T}{\log T} (G.24)

Back to our analysis for the lower bound, by focusing on the variance produced by interval \ell only, we have,

vj=\displaystyle v_{j}= τ=0Tητ2k=τ+1T(1ηkλj)2τ=TlogT+1(+1)TlogTητ2k=τ+1T(1ηkλj)2\displaystyle\sum_{\tau=0}^{T}\eta_{\tau}^{2}\prod_{k=\tau+1}^{T}(1-\eta_{k}\lambda_{j})^{2}\geq\sum_{\tau=\ell\cdot\frac{T}{\log T}+1}^{(\ell+1)\cdot\frac{T}{\log T}}\eta_{\tau}^{2}\prod_{k=\tau+1}^{T}(1-\eta_{k}\lambda_{j})^{2}
\displaystyle\geq τ=TlogT+1(+1)TlogTητ2k=TlogT+1T(1ηkλj)2=τ=TlogT+1(+1)TlogT(η12)2k=TlogT+1T(1ηkλj)2\displaystyle\sum_{\tau=\ell\cdot\frac{T}{\log T}+1}^{(\ell+1)\cdot\frac{T}{\log T}}\eta_{\tau}^{2}\prod_{k=\ell\cdot\frac{T}{\log T}+1}^{T}(1-\eta_{k}\lambda_{j})^{2}=\sum_{\tau=\ell\cdot\frac{T}{\log T}+1}^{(\ell+1)\cdot\frac{T}{\log T}}\left(\frac{\eta_{1}}{2^{\ell}}\right)^{2}\prod_{k=\ell\cdot\frac{T}{\log T}+1}^{T}(1-\eta_{k}\lambda_{j})^{2}
=\displaystyle= TlogT(η12)2k=TlogT+1T(1ηkλj)2\displaystyle\frac{T}{\log T}\cdot\left(\frac{\eta_{1}}{2^{\ell}}\right)^{2}\prod_{k=\ell\cdot\frac{T}{\log T}+1}^{T}(1-\eta_{k}\lambda_{j})^{2}
>(G.24)\displaystyle\overset{\eqref{eq:steplower_first_interval_2}}{>} TlogT(η1λjη116TlogT)2k=TlogT+1T(1ηkλj)2\displaystyle\frac{T}{\log T}\cdot\left(\frac{\eta_{1}}{\lambda_{j}\eta_{1}\cdot\frac{16T}{\log T}}\right)^{2}\prod_{k=\ell\cdot\frac{T}{\log T}+1}^{T}(1-\eta_{k}\lambda_{j})^{2}
=\displaystyle= 1256logTT1λj2k=TlogT+1T(1ηkλj)2\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\prod_{k=\ell\cdot\frac{T}{\log T}+1}^{T}(1-\eta_{k}\lambda_{j})^{2}
\displaystyle\geq 1256logTT1λj2(1k=TlogT+1T2ηkλj)=1256logTT1λj2(12λjk=TlogT+1Tηk)\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-\sum_{k=\ell\cdot\frac{T}{\log T}+1}^{T}2\eta_{k}\lambda_{j}\right)=\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-2\lambda_{j}\sum_{k=\ell\cdot\frac{T}{\log T}+1}^{T}\eta_{k}\right)
=\displaystyle= 1256logTT1λj2(12λji=logT1k=iTlogT+1(i+1)TlogTηk)\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-2\lambda_{j}\sum_{i=\ell}^{\log T-1}\sum_{k=i\cdot\frac{T}{\log T}+1}^{(i+1)\cdot\frac{T}{\log T}}\eta_{k}\right)
=\displaystyle= 1256logTT1λj2(12λji=logT1k=iTlogT+1(i+1)TlogTη12i)\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-2\lambda_{j}\sum_{i=\ell}^{\log T-1}\sum_{k=i\cdot\frac{T}{\log T}+1}^{(i+1)\cdot\frac{T}{\log T}}\frac{\eta_{1}}{2^{i}}\right)
=\displaystyle= 1256logTT1λj2(12λji=logT1TlogTη12i)\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-2\lambda_{j}\sum_{i=\ell}^{\log T-1}\frac{T}{\log T}\cdot\frac{\eta_{1}}{2^{i}}\right)
\displaystyle\geq 1256logTT1λj2(12λjTlogTη121)\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-2\lambda_{j}\cdot\frac{T}{\log T}\cdot\frac{\eta_{1}}{2^{\ell-1}}\right)
(G.23)\displaystyle\overset{\eqref{eq:steplower_first_interval_1}}{\geq} 1256logTT1λj2(14λjTlogTη1λjη18TlogT)\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\left(1-4\lambda_{j}\cdot\frac{T}{\log T}\cdot\frac{\eta_{1}}{\lambda_{j}\eta_{1}\cdot\frac{8T}{\log T}}\right)
=\displaystyle= 1256logTT1λj212\displaystyle\frac{1}{256}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}\cdot\frac{1}{2}
=\displaystyle= 1512logTT1λj2\displaystyle\frac{1}{512}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}

Here the first inequality is obtained by focusing variance generated in interval \ell only. The second inequality utilizes τT/logT\tau\geq\ell\cdot T/\log T. The fourth inequality is entailed by (1a1)(1a2)=1a1a2+a1a21a1a2(1-a_{1})(1-a_{2})=1-a_{1}-a_{2}+a_{1}a_{2}\geq 1-a_{1}-a_{2} for a1,a2[0,1]\forall a_{1},a_{2}\in[0,1], where by mathematical induction, we can extend this inequality for more terms i=1n(1ai)1i=1nai\prod_{i=1}^{n}(1-a_{i})\geq 1-\sum_{i=1}^{n}a_{i} as long as i=1nai1\sum_{i=1}^{n}a_{i}\leq 1. The fifth inequality comes from i=logT11/2ii=1/2i=1/21\sum_{i=\ell}^{\log T-1}1/2^{i}\leq\sum_{i=\ell}^{\infty}1/2^{i}=1/2^{\ell-1}.

b) For j\forall j\in\mathcal{B}:

Step decay’s bottleneck will occur in the bias term. Since jj\in\mathcal{B}, it follows from the definition of \mathcal{B} in Eqn.(G.22),

λjlogT8η1Tη1λjlogT8T,\displaystyle\lambda_{j}\leq\frac{\log T}{8\eta_{1}T}\quad\Longrightarrow\quad\eta_{1}\lambda_{j}\leq\frac{\log T}{8T},

we have

bj=\displaystyle b_{j}= λj(w0,jw,j)2k=0T(1ηkλj)2\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2}
\displaystyle\geq λj(w0,jw,j)2(1k=0T2ηkλj)=λj(w0,jw,j)2(1k=1T2ηkλj)\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-\sum_{k=0}^{T}2\eta_{k}\lambda_{j}\right)=\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-\sum_{k=1}^{T}2\eta_{k}\lambda_{j}\right)
=\displaystyle= λj(w0,jw,j)2(1i=0logT1k=iTlogT+1(i+1)TlogT2ηkλj)\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-\sum_{i=0}^{\log T-1}\sum_{k=i\cdot\frac{T}{\log T}+1}^{(i+1)\cdot\frac{T}{\log T}}2\eta_{k}\lambda_{j}\right)
=\displaystyle= λj(w0,jw,j)2(1i=0logT1k=iTlogT+1(i+1)TlogTη1λj2i1)\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-\sum_{i=0}^{\log T-1}\sum_{k=i\cdot\frac{T}{\log T}+1}^{(i+1)\cdot\frac{T}{\log T}}\frac{\eta_{1}\lambda_{j}}{2^{i-1}}\right)
=\displaystyle= λj(w0,jw,j)2(1η1λji=0logT1TlogT12i1)\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-\eta_{1}\lambda_{j}\sum_{i=0}^{\log T-1}\frac{T}{\log T}\cdot\frac{1}{2^{i-1}}\right)
\displaystyle\geq λj(w0,jw,j)2(14η1λjTlogT)\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-4\eta_{1}\lambda_{j}\cdot\frac{T}{\log T}\right)
\displaystyle\geq λj(w0,jw,j)2(14logT8TTlogT)\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\left(1-4\cdot\frac{\log T}{8T}\cdot\frac{T}{\log T}\right)
=\displaystyle= λj(w0,jw,j)212,\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\frac{1}{2},

where the first inequality is caused by (1a1)(1a2)=1a1a2+a1a21a1a2(1-a_{1})(1-a_{2})=1-a_{1}-a_{2}+a_{1}a_{2}\geq 1-a_{1}-a_{2} for a1,a2[0,1]\forall a_{1},a_{2}\in[0,1] and applying mathematical induction for {an}\{a_{n}\} to obtain i=1n(1ai)1i=1nai\prod_{i=1}^{n}(1-a_{i})\geq 1-\sum_{i=1}^{n}a_{i} as long as i=1nai1\sum_{i=1}^{n}a_{i}\leq 1. The second equality is because η0=0\eta_{0}=0. The second inequality comes from i=0logT11/2i1i=01/2i1=4\sum_{i=0}^{\log T-1}1/2^{i-1}\leq\sum_{i=0}^{\infty}1/2^{i-1}=4. The last inequality follows η1λjlogT/(8T)\eta_{1}\lambda_{j}\leq\log T/(8T).

From assumption (3), we know λj(w0,jw,j)2>0\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}>0. Furthermore, as we require

TlogT1256σ2minjλj(w0,jw,j)2\displaystyle\frac{T}{\log T}\geq\frac{1}{256}\cdot\frac{\sigma^{2}}{\min_{j}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}}

in Eqn. (G.19),

bj\displaystyle b_{j}\geq λj(w0,jw,j)212minjλj(w0,jw,j)2121512σ2TlogT.\displaystyle\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\frac{1}{2}\geq\min_{j}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\cdot\frac{1}{2}\geq\frac{1}{512}\cdot\frac{\sigma^{2}}{T}\cdot\log T.

In sum, we have obtained

j𝒜,\displaystyle\forall j\in\mathcal{A},\quad vj1512logTT1λj2\displaystyle v_{j}\geq\frac{1}{512}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}
j,\displaystyle\forall j\in\mathcal{B},\quad bj1512σ2TlogT\displaystyle b_{j}\geq\frac{1}{512}\cdot\frac{\sigma^{2}}{T}\cdot\log T

By combining with Eqn. (G.21), we have

𝔼[f(wT+1f(w)]=\displaystyle\mathbb{E}\left[f(w_{T+1}-f(w_{*})\right]= 12[(j=1dbj)+(σ2j=1dλj2vj)]\displaystyle\frac{1}{2}\left[\left(\sum_{j=1}^{d}b_{j}\right)+\left(\sigma^{2}\sum_{j=1}^{d}\lambda_{j}^{2}v_{j}\right)\right]
\displaystyle\geq 12[(jbj)+(σ2j𝒜λj2vj)]\displaystyle\frac{1}{2}\left[\left(\sum_{j\in\mathcal{B}}b_{j}\right)+\left(\sigma^{2}\sum_{j\in\mathcal{A}}\lambda_{j}^{2}v_{j}\right)\right]
\displaystyle\geq ||(11024σ2TlogT)+j𝒜σ2λj211024logTT1λj2\displaystyle|\mathcal{B}|\cdot\left(\frac{1}{1024}\cdot\frac{\sigma^{2}}{T}\cdot\log T\right)+\sum_{j\in\mathcal{A}}\sigma^{2}\cdot\lambda_{j}^{2}\cdot\frac{1}{1024}\cdot\frac{\log T}{T}\cdot\frac{1}{\lambda_{j}^{2}}
=\displaystyle= ||(11024σ2TlogT)+|𝒜|(11024σ2TlogT)\displaystyle|\mathcal{B}|\cdot\left(\frac{1}{1024}\cdot\frac{\sigma^{2}}{T}\cdot\log T\right)+|\mathcal{A}|\cdot\left(\frac{1}{1024}\cdot\frac{\sigma^{2}}{T}\cdot\log T\right)
=\displaystyle= (|𝒜|+||)(11024σ2TlogT)\displaystyle\left(|\mathcal{A}|+|\mathcal{B}|\right)\cdot\left(\frac{1}{1024}\cdot\frac{\sigma^{2}}{T}\cdot\log T\right)
=\displaystyle= d11024σ2TlogT\displaystyle d\cdot\frac{1}{1024}\cdot\frac{\sigma^{2}}{T}\cdot\log T
=\displaystyle= Ω(dσ2TlogT),\displaystyle\Omega\left(\frac{d\sigma^{2}}{T}\cdot\log T\right),

where the first inequality is because both the bias and variance terms are non-negative, given bj=λj(w0,jw,j)2k=0T(1ηkλj)20b_{j}=\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\prod_{k=0}^{T}(1-\eta_{k}\lambda_{j})^{2}\geq 0 and vj=τ=0Tητ2k=τ+1T(1ηkλj)20v_{j}=\sum_{\tau=0}^{T}\eta_{\tau}^{2}\prod_{k=\tau+1}^{T}(1-\eta_{k}\lambda_{j})^{2}\geq 0.

Remark 3.

The requirement T/logT1/256σ2/(minjλj(w0,jw,j)2)T/\log T\geq 1/256\cdot\sigma^{2}/\left(\min_{j}\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\right) and assumption λj(w0,jw,j)20\lambda_{j}\left(w_{0,j}-w_{*,j}\right)^{2}\neq 0 for j=1,2,,d\forall j=1,2,\dots,d can be replaced with T/logT>1/(8η1μ)T/\log T>1/(8\eta_{1}\mu), since in that case j𝒜j\in\mathcal{A} holds for j=1,2,,d\forall j=1,2,\dots,d and =\mathcal{B}=\emptyset. In particular, if η1=1/L\eta_{1}=1/L, this requirement on TT becomes T/logTκ/8T/\log T\geq\kappa/8.

G.5 The Reason of Using Assumption (1.7)

In all of our analysis, we employ assumption (1.7)

𝔼ξ[ntnt]σ2H where nt=Hwtb(H(ξ)wtb(ξ))\displaystyle\mathbb{E}_{\xi}\left[n_{t}n_{t}^{\top}\right]\preceq\sigma^{2}H\quad\mbox{ where }n_{t}=Hw_{t}-b-\left(H(\xi)w_{t}-b(\xi)\right)

which is the same as the one in Appendix C, Theorem 13 of Ge et al. (2019). This key theorem is the major difference between our work and Ge et al. (2019), which directly entails its main theorem by instantiating σ\sigma with specific values in its assumptions.

On the other hand, it is possible to use the assumptions in  Ge et al. (2019); Bach & Moulines (2013); Jain et al. (2016) instead of our assumption (1.7) for least square regression:

minwf(w) where f(w)12𝔼(x,y)𝒟[(ywx)2]\displaystyle\min_{w}f(w)\quad\mbox{ where }f(w)\triangleq\frac{1}{2}\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[(y-w^{\top}x)^{2}\right] (G.25)
y=wx+ϵ with ϵ satisfying 𝔼(x,y)𝒟[ϵ2xx]σ2H for (x,y)𝒟\displaystyle y=w_{*}^{\top}x+\epsilon\mbox{ with $\epsilon$ satisfying }\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\epsilon^{2}xx^{\top}\right]\preceq\sigma^{2}H\mbox{ for }\forall(x,y)\sim\mathcal{D} (G.26)
𝔼[x2xx]R2H\displaystyle\mathbb{E}\left[||x||^{2}xx^{\top}\right]\preceq R^{2}H (G.27)

By combining our Lemma 1 and assumption (1.7) with Lemma 5, Lemma 8 and Lemma 9 in Ge et al. (2019), one can obtain similar results in this paper with their assumptions. For simplicity, we just use assumption (1.7) here.

Appendix H Relationship with (Stochastic) Newton’s Method

Our motivation in Proposition 1 shares a similar idea with (stochastic) Newton’s method on quadratic objectives

wt+1=\displaystyle w_{t+1}= wtηtH1f(wt,ξ),\displaystyle w_{t}-\eta_{t}H^{-1}\nabla f(w_{t},\xi),

where the parameters are also updated coordinately in the “rotated space”, i.e. given H=UΛUH=U\Lambda U^{\top} and w=Uww^{\prime}=U^{\top}w. In particular, when the Hessian HH is diagonal and ηt=1/(t+1)\eta_{t}=1/(t+1), the update formula is exactly the same as the one for Proposition 1.

Despite of this similarity, our method differ from Newton method’s and its practical variants in several aspects. First of all, our method focuses on learning rate schedulers and is a first-order method. This property is especially salient when we consider eigencurve’s derivatives in Section 4.3: only hyperparameter search is needed, just like other common learning rate schedulers. In addition, most second-order methods, e.g.  Schraudolph (2002); Erdogdu & Montanari (2015); Grosse & Martens (2016); Byrd et al. (2016); Botev et al. (2017); Huang et al. (2020); Yang et al. (2021), approximates the Hessian matrix or the Hessian inverse and exploits the curvature information, while eigencurve only utilizes the rough estimation of the Hessian spectrum. On top of that, this estimation is only an one-time effect and can be even further removed for similar models. These key differences highlight eigencurve’s advantages over most second-order methods in practice.