This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmAaron \surMishkin

1]\orgdivDepartment of Computer Science, \orgnameStanford University 2]\orgdivDepartment of Electrical Engineering, \orgnameStanford University 3]\orgdivDepartment of Computer Science, \orgnameUniversity of British Columbia 4]\orgdivCanada CIFAR AI Chair, \orgnameAmii

Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation

[email protected]    \fnmMert \surPilanci [email protected]    \fnmMark \surSchmidt [email protected] [ [ [ [
Abstract

We prove new convergence rates for a generalized version of stochastic Nesterov acceleration under interpolation conditions. Unlike previous analyses, our approach accelerates any stochastic gradient method which makes sufficient progress in expectation. The proof, which proceeds using the estimating sequences framework, applies to both convex and strongly convex functions and is easily specialized to accelerated SGD under the strong growth condition. In this special case, our analysis reduces the dependence on the strong growth constant from ρ\rho to ρ\sqrt{\rho} as compared to prior work. This improvement is comparable to a square-root of the condition number in the worst case and address criticism that guarantees for stochastic acceleration could be worse than those for SGD.

keywords:
acceleration, SGD, strong growth, interpolation

1 Introduction

A continuing trend in machine learning is the adoption of powerful prediction models which can exactly fit, or interpolate, their training data [1]. Methods such as over-parameterized neural networks [2, 3], kernel machines [4], and boosting [5] have all been shown to achieve zero training loss in practice. This phenomena is particularly prevalent in modern deep learning, where interpolation is conjectured to be key to both optimization [6, 7] and generalization [8].

Recent experimental and theoretical evidence shows stochastic gradient descent (SGD) matches the fast convergence rates of deterministic gradient methods up to problem-dependent constants when training interpolating models [9, 10, 11]. With additional assumptions, interpolation also implies the strong [12] and weak [13, 14] growth conditions, which bound the second moment of the stochastic gradients. Under strong/weak growth, variance-reduced algorithms typically exhibit slower convergence than stochastic gradient methods despite using more computation or memory [15, 10], perhaps because these conditions already imply a form of “automatic variance reduction” [6]. A combination of interpolation and growth conditions has been used to prove fast convergence rates for SGD with line-search [14], with the stochastic Polyak step-size [16, 17], for mirror descent [18], and for model-based methods [19].

While these results show interpolation is sufficient to break the Ω(ϵ4)\Omega(\epsilon^{-4}) complexity barrier for computing stationary points of smooth, convex functions with stochastic, first-order oracles [20], significantly less work has been done to obtain the accelerated rates possible in the deterministic setting [21]. Vaswani et al. [22] analyze a stochastic version of Nesterov’s accelerated gradient method (AGD) [23] under the strong growth condition, but their bounds have a linear dependence on the strong growth constant and can be slower than SGD [24]. In contrast, Liu and Belkin [24] propose a modified version of stochastic AGD and extend the statistical condition number approach of Jain et al. [25] to the interpolation setting. However, their results apply primarily to quadratics and are not accelerated for general convex functions.

In this work, we apply the estimating sequences analysis developed by Nesterov [26] to the interpolation setting. Our approach hinges on a simple, in-expectation progress guarantee for SGD, which we prove is a sufficient condition for generic acceleration of stochastic algorithms. This proof technique is completely different from that used by Vaswani et al. [22] and yields an improved dependence on the strong growth constant. In the worst-case, the improvement is proportional to the square-root of the condition number and guarantees stochastic AGD is always at least as fast as SGD.  In what follows, all proofs are deferred to the corresponding section of the appendix.

1.1 Additional Related Work

A large literature on stochastic optimization under interpolation rapidly developed following the seminal work by Bassily et al. [13] and Vaswani et al. [22]. For instance, Xiao et al. [27] analyze Frank-Wolfe under interpolation, Vaswani et al. [28] prove fast convergence for Adagrad-type methods [29], and Meng et al. [30] show fast rates for sub-sampled Newton method under interpolation. Interpolation has also been used to study last-iterate convergence of SGD [31] and subgradient methods [32].

Interpolation is a key sufficient condition for growth conditions, which are a set of general assumptions controlling the magnitude of the noise in stochastic gradients. Strong growth was introduced by Polyak [12, Section 4.2.5], who called the condition relative random noise and used it to prove q-linear convergence of SGD. Solodov [33] and Tseng [34] later used a variation of strong growth to analyze incremental gradient methods; their variation was also used by Schmidt and Le Roux [35] to prove linear convergence of SGD with a constant step-size for strongly-convex functions. Vaswani et al. [22] returned to the original definition given by Polyak [12] and used it to analyze stochastic AGD, as we do in this paper.

Many authors have tried to accelerate stochastic gradient methods, including under a variety of growth conditions. By accelerate, we mean improve the dependence on the condition number or improve the order of convergence (i.e. from O(1/k)O(1/k) to O(1/k2)O(1/k^{2})) as compared to SGD. For example, Schmidt et al. [36] establish orders of growth on the gradient noise which still permit stochastic proximal-gradient methods to be accelerated. In contrast, d’Aspremont [37] and Devolder et al. [38] assume bounded, deterministic gradient errors to derive convergence rates, while Cohen et al. [39] develop a noise-resistant acceleration scheme. Most recently, Chen et al. [40] analyze stochastic AGD under expected smoothness, but their rate only holds under interpolation when the strong growth constant is less than two.

As discussed above, Liu and Belkin [24] extend the approach of Jain et al. [25] to the interpolation setting. Their assumptions imply strong growth and the analysis is limited to least-squares problems, although similar rates have been obtained for continuized AGD applied to kernel least-squares [41]. Valls et al. [42] take a different view and extend the work by Vaswani et al. [22] to constrained optimization. Unfortunately, none of these algorithms are necessarily accelerated and both Assran and Rabbat [43] and Liu and Belkin [24] prove that stochastic AGD may not obtain accelerated rates of convergence even under interpolation. We address this criticism and make detailed comparisons between convergence rates in Section 4.

2 Assumptions

Consider the unconstrained minimization of a smooth, convex function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}. We assume ff has at least one minimizer ww^{*} and denote the optimal set by 𝒲\mathcal{W}^{*}. At each iteration kk, we sample a stochastic gradient f(wk,zk)\nabla f(w_{k},z_{k}) such that,

𝔼zk[f(wk,zk)]=f(wk),\mathbb{E}_{z_{k}}\left[\nabla f(w_{k},z_{k})\right]=\nabla f(w_{k}),

meaning the stochastic gradient is unbiased. We assume that zk,zjz_{k},z_{j} are independent for kjk\neq j, but they do not need to have the same distribution. The stochastic gradients satisfy the strong growth condition when there exists ρ1\rho\geq 1 for which

𝔼zk[f(wk,zk)22]ρf(wk)22,\mathbb{E}_{z_{k}}\left[\|\nabla f(w_{k},z_{k})\|_{2}^{2}\right]\leq\rho\|\nabla f(w_{k})\|_{2}^{2}, (1)

holds given any sequence {wk,zk}\left\{w_{k},z_{k}\right\}. We say that interpolation is satisfied if

f(w)=0f(w,zk)=0,\nabla f(w)=0\implies\nabla f(w,z_{k})=0,

for all zkz_{k}. That is, stationarity of ff implies stationarity of the stochastic gradients. Although the strong growth condition implies interpolation, we will see that the converse requires further assumptions on ff and on the stochastic gradients.

We assume that ff is LL-smooth, meaning f\nabla f is LL-Lipschitz continuous. LL-smoothness of ff implies the following quadratic upper-bound for all w,udw,u\in\mathbb{R}^{d} [26]:

f(u)f(w)+f(w),uw+L2uw22.f(u)\leq f(w)+\left\langle\nabla f(w),u-w\right\rangle+\frac{L}{2}\|u-w\|_{2}^{2}. (2)

Similarly, we will sometimes require the stochastic gradients to be LmaxL_{\text{max}}-individually smooth, meaning they satisfy

f(u,zk)f(w,zk)+f(w,zk),uw+Lmax2uw22,f(u,z_{k})\leq f(w,z_{k})+\left\langle\nabla f(w,z_{k}),u-w\right\rangle+\frac{L_{\text{max}}}{2}\|u-w\|_{2}^{2}, (3)

almost surely for all kk. We also assume that ff is μ\mu-strongly convex, by which we mean

f(u)f(w)+f(w),uw+μ2uw22,f(u)\geq f(w)+\left\langle\nabla f(w),u-w\right\rangle+\frac{\mu}{2}\|u-w\|_{2}^{2}, (4)

holds for all w,udw,u\in\mathbb{R}^{d} and some μ0\mu\geq 0. When μ=0\mu=0, strong convexity reduces to convexity of ff. If ff is strongly convex with μ>0\mu>0, the stochastic gradients are LmaxL_{\text{max}}-individually smooth, and interpolation holds, then the strong growth constant is bounded as ρLmax/μ\rho\leq L_{\text{max}}/\mu (see Lemma 13). Recalling that the ratio κ=L/μ\kappa=L/\mu is the condition number of ff, we see that the strong-growth constant is bounded by a quantity like the condition number of the worst-conditioned stochastic function. 111Note that f(u,zk)f(u,z_{k}) is typically not strongly convex, so this analogy is not formal.

3 Convergence of Stochastic AGD

Our analysis in this section builds on the estimating sequences approach developed by Nesterov [44]. However, we consider variable step-sizes ηk\eta_{k} and allow the iterates wkw_{k} to be set using a general update scheme. The procedure takes γ0>0\gamma_{0}>0 as input and uses the following updates, which we call generalized stochastic AGD:

αk2\displaystyle\alpha_{k}^{2} =ηk(1αk)γk+ηkαkμ\displaystyle=\eta_{k}(1-\alpha_{k})\gamma_{k}+\eta_{k}\alpha_{k}\mu (5)
γk+1\displaystyle\gamma_{k+1} =(1αk)γk+αkμ\displaystyle=(1-\alpha_{k})\gamma_{k}+\alpha_{k}\mu
yk\displaystyle y_{k} =1γk+αkμ[αkγkvk+γk+1wk],\displaystyle=\frac{1}{\gamma_{k}+\alpha_{k}\mu}\left[\alpha_{k}\gamma_{k}v_{k}+\gamma_{k+1}w_{k}\right],
wk+1\displaystyle w_{k+1} =m(ηk,yk,f(yk,zk))\displaystyle=m(\eta_{k},y_{k},\nabla f(y_{k},z_{k}))
vk+1\displaystyle v_{k+1} =1γk+1[(1αk)γkvk+αkμykαkf(yk)].\displaystyle=\frac{1}{\gamma_{k+1}}\left[(1-\alpha_{k})\gamma_{k}v_{k}+\alpha_{k}\mu y_{k}-\alpha_{k}\nabla f(y_{k})\right].

where mm is an as-yet unspecified update for the “primal sequence” wkw_{k} and v0=w0v_{0}=w_{0} (which implies y0=w0y_{0}=w_{0}). Note that the step-size ηk\eta_{k} is required at the start of the iteration to compute αk\alpha_{k}. As a result, local search methods like the Armijo line-search [45] must re-evaluate γk+1,yk\gamma_{k+1},y_{k}, and wk+1w_{k+1} for each candidate step-size.

Choosing mm to be one step of SGD with step-size ηk\eta_{k} yields the familiar updates of the standard version of stochastic AGD (see Nesterov [26, Eq. 2.2.20]):

wk+1\displaystyle w_{k+1} =wkηkf(yk,zk)\displaystyle=w_{k}-\eta_{k}\nabla f(y_{k},z_{k}) (6)
αk+12\displaystyle\alpha_{k+1}^{2} =(1αk+1)αk2ηk+1ηk+ηk+1αk+1μ\displaystyle=(1-\alpha_{k+1})\alpha_{k}^{2}\frac{\eta_{k+1}}{\eta_{k}}+\eta_{k+1}\alpha_{k+1}\mu
yk+1\displaystyle y_{k+1} =wk+1+αk(1αk)αk2+αk+1(wk+1wk).\displaystyle=w_{k+1}+\frac{\alpha_{k}(1-\alpha_{k})}{\alpha_{k}^{2}+\alpha_{k+1}}(w_{k+1}-w_{k}).

Our approach hinges on the fact that, under strong growth, stochastic gradient updates for wkw_{k} give a similar per-step progress condition as deterministic gradient descent. Since descent in the primal step is the only link between wkw_{k} and the “dual sequence” yky_{k}, generalized stochastic AGD with any primal update obtaining similar per-iteration progress can be analyzed in the same fashion as stochastic AGD. This allows us to derive a fast convergence rate for the general scheme in Eq. 5.

We start by deriving the progress condition for SGD.  It is straightforward to prove the following bound using LL-smoothness, the definition of wk+1w_{k+1}, and strong growth:

Lemma 1.

Suppose ff is LL-smooth, the strong growth condition holds, and ηk\eta_{k} is independent of zkz_{k}. Then the stochastic gradient step in Eq. 6 makes progress as,

𝔼zk[f(wk+1)]f(yk)ηk(1ηkρL2)f(yk)22.\mathbb{E}_{z_{k}}\left[f(w_{k+1})\right]\leq f(y_{k})-\eta_{k}(1-\frac{\eta_{k}\rho L}{2})\|\nabla f(y_{k})\|_{2}^{2}. (7)

Substituting any fixed step-size ηk1/ρL\eta_{k}\leq 1/\rho L into Eq. 7 gives the following equation, which we call the expected progress condition:

𝔼zk[f(wk+1)]f(yk)ηk2f(yk)22,\mathbb{E}_{z_{k}}\left[f(w_{k+1})\right]\leq f(y_{k})-\frac{\eta_{k}}{2}\|\nabla f(y_{k})\|_{2}^{2}, (8)

which is equivalent to the progress made by gradient descent with step-size ηk1/L\eta_{k}\leq 1/L up to a factor of ρ\rho [46]. In order to make use of the expected progress condition, we now introduce the estimating sequences framework.

Definition 2 (Estimating Sequences).

Two sequences λk\lambda_{k}, ϕk\phi_{k} are estimating sequences if the following hold almost surely: (i) λk0\lambda_{k}\geq 0 (ii) limλk=0\lim\lambda_{k}=0, and (iii) for all wdw\in\mathbb{R}^{d},

ϕk(w)(1λk)f(w)+λkϕ0(w).\phi_{k}(w)\leq(1-\lambda_{k})f(w)+\lambda_{k}\phi_{0}(w). (9)

Unlike the standard definition, we permit λk\lambda_{k} and ϕk\phi_{k} to depend on z0,,zk1z_{0},\ldots,z_{k-1}, making them random variables. Typically the first function ϕ0\phi_{0} is deterministic and chosen to satisfy ϕ0(w)f(w)\phi_{0}(w)\geq f(w) for all ww near w0w_{0}. Since Eq. 9 guarantees limkϕk(w)f(w)\lim_{k}\phi_{k}(w)\leq f(w), ϕk\phi_{k} can be interpreted as a sequence of relaxing upper-bounds, where the rate of relaxation is controlled by λk\lambda_{k}. The next condition captures when ϕk\phi_{k} is a good local model of ff.

Definition 3.

The local upper-bound property holds in expectation if

𝔼z0,,zk1[f(wk)]𝔼z0,,zk1[infuϕk(u)].\mathbb{E}_{z_{0},\ldots,z_{k-1}}\left[f(w_{k})\right]\leq\mathbb{E}_{z_{0},\ldots,z_{k-1}}\left[\inf_{u}\phi_{k}(u)\right]. (10)

In what follows, we use 𝔼\mathbb{E} without subscripts to denote the total expectation with respect to z0,,zk1z_{0},\ldots,z_{k-1}. That is, all randomness in the procedure up to iteration kk. If ϕk\phi_{k} maintains the local upper-bound property in expectation, then Eq. 9 guarantees

𝔼[f(wk)]𝔼[infuϕk(u)]\displaystyle\mathbb{E}\left[f(w_{k})\right]\leq\mathbb{E}\left[\inf_{u}\phi_{k}(u)\right] 𝔼[ϕk(w)]𝔼[(1λk)f(w)+λkϕ0(w)]\displaystyle\leq\mathbb{E}\left[\phi_{k}(w^{*})\right]\leq\mathbb{E}\left[(1-\lambda_{k})f(w^{*})+\lambda_{k}\phi_{0}(w^{*})\right]
𝔼[f(wk)]f(w)\displaystyle\implies\mathbb{E}\left[f(w_{k})\right]-f(w^{*}) 𝔼[λk(ϕ0(w)f(w))],\displaystyle\leq\mathbb{E}\left[\lambda_{k}(\phi_{0}(w^{*})-f(w^{*}))\right], (11)

which shows that generalized stochastic AGD converges in expectation at a rate controlled by λk\lambda_{k}. As a result, the bulk of our analysis focuses on establishing the local upper-bound property for a suitable choice of estimating sequences.

Following Nesterov [26], choose λ0=1\lambda_{0}=1, ϕ0(w)=f(w0)+γ02ww022\phi_{0}(w)=f(w_{0})+\frac{\gamma_{0}}{2}\|w-w_{0}\|_{2}^{2}, and

λk+1\displaystyle\lambda_{k+1} =(1αk)λk\displaystyle=(1-\alpha_{k})\lambda_{k} (12)
ϕk+1(w)\displaystyle\phi_{k+1}(w) =(1αk)ϕk(w)+αk(f(yk)+f(yk),wyk+μ2wyk22).\displaystyle=(1-\alpha_{k})\phi_{k}(w)+\alpha_{k}\left(f(y_{k})+\left\langle\nabla f(y_{k}),w-y_{k}\right\rangle+\frac{\mu}{2}\|w-y_{k}\|_{2}^{2}\right).

The initial curvature γ0\gamma_{0} is an input parameter; differentiating shows that vk+1v_{k+1} is actually the minimizer of ϕk+1\phi_{k+1}, while 2ϕk+1=γk+1I\nabla^{2}\phi_{k+1}=\gamma_{k+1}I (Lemma 14). Thus, the auxillary sequences γk,vk\gamma_{k},v_{k} can be viewed as arising from our choice of local model. The next lemma proves these are valid estimating sequences when the step-size sequence is well-behaved. In what follows, we use the convention 1/0=1/0=\infty to cover the case of non-strongly convex functions.

Lemma 4.

Assume ff is μ\mu-strongly convex with μ0\mu\geq 0 and ηminηk<1/μ\eta_{\text{min}}\leq\eta_{k}<1/\mu almost surely. Then λk\lambda_{k} and ϕk\phi_{k} given in Eq. 9 are estimating sequences.

The parameter ηmin>0\eta_{\text{min}}>0 is required for λk\lambda_{k} to decrease sufficiently fast, while the upper-bound ηk1/μ\eta_{k}\leq 1/\mu is only necessary when μ>0\mu>0. In this case, it guarantees λk0\lambda_{k}\geq 0. This choice of estimating sequences also satisfies the local error-bound property in expectation when m(ηk,yk,f(yk,zk))m(\eta_{k},y_{k},\nabla f(y_{k},z_{k})) matches the progress of fixed step-size SGD.

Proposition 5.

If ff is LL-smooth and μ\mu-strongly convex with μ0\mu\geq 0, ηminηk<1/μ\eta_{\text{min}}\leq\eta_{k}<1/\mu almost surely for every kk\in\mathbb{N}, and the primal update m(ηk,yk,f(yk,zk))m(\eta_{k},y_{k},\nabla f(y_{k},z_{k})) satisfies the sufficient progress condition (Eq. 8), then ϕk\phi_{k} has the local upper-bound property in expectation. That is, for every kk\in\mathbb{N},

𝔼[f(wk)]𝔼[infuϕk(u)].\mathbb{E}\left[f(w_{k})\right]\leq\mathbb{E}\left[\inf_{u}\phi_{k}(u)\right].

Proposition 5 is our main theoretical contribution and immediately leads to two accelerated convergence rates for the generalized stochastic AGD scheme.

Theorem 6.

Suppose ff is LL-smooth and μ\mu-strongly convex with μ>0\mu>0, ηminηk<1/μ\eta_{\text{min}}\leq\eta_{k}<1/\mu almost surely for every kk\in\mathbb{N}, and the primal update m(ηk,yk,f(yk,zk))m(\eta_{k},y_{k},\nabla f(y_{k},z_{k})) satisfies the sufficient progress condition (Eq. 8). If γ0=μ\gamma_{0}=\mu, then generalized stochastic AGD has the following rate of convergence:

𝔼[f(wk+1)]f(w)\displaystyle\mathbb{E}\left[f(w_{k+1})\right]-f(w^{*}) 𝔼[i=0k(1ηkμ)][f(w0)f(w)+μ2w0w22]\displaystyle\leq\mathbb{E}\left[\prod_{i=0}^{k}\left(1-\sqrt{\eta_{k}\mu}\right)\right]\left[f(w_{0})-f(w^{*})+\frac{\mu}{2}\|w_{0}-w^{*}\|_{2}^{2}\right] (13)
(1ηminμ)k+1[f(w0)f(w)+μ2w0w22].\displaystyle\leq\big{(}1-\sqrt{\eta_{\text{min}}\mu}\big{)}^{k+1}\left[f(w_{0})-f(w^{*})+\frac{\mu}{2}\|w_{0}-w^{*}\|_{2}^{2}\right].

This linear rate of convergence requires knowledge of the strong convexity constant in order to set γ0\gamma_{0}. However, we can still obtain a O(1/k2)O(1/k^{2}) rate without knowing μ\mu so long as the smoothness constant can be estimated. The following theorem is a generalization of Nesterov [26] to stochastic optimization under interpolation.

Theorem 7.

Suppose ff is LL-smooth and μ\mu-strongly convex with μ0\mu\geq 0, ηminηk<1/μ\eta_{\text{min}}\leq\eta_{k}<1/\mu almost surely for every kk\in\mathbb{N}, and the primal update m(ηk,yk,f(yk,zk))m(\eta_{k},y_{k},\nabla f(y_{k},z_{k})) satisfies the sufficient progress condition (Eq. 8). If γ0(μ,3/ηmin)\gamma_{0}\in(\mu,3/\eta_{\text{min}}), then generalized stochastic AGD has the following rate of convergence:

𝔼[f(wk+1)]f(w)4ηmin(γ0μ)(k+1)2[f(w0)f(w)+γ02w0w22].\mathbb{E}\left[f(w_{k+1})\right]-f(w^{*})\leq\frac{4}{\eta_{\text{min}}(\gamma_{0}-\mu)(k+1)^{2}}\left[f(w_{0})-f(w^{*})+\frac{\gamma_{0}}{2}\|w_{0}-w^{*}\|_{2}^{2}\right]. (14)

3.1 Specializations

Theorems 6 and 7 provide accelerated guarantees for any stochastic primal update m(ηk,yk,f(yk,zk))m(\eta_{k},y_{k},\nabla f(y_{k},z_{k})) satisfying the sufficient progress condition. Assuming strong growth holds, we may specialize mm to fixed step-size SGD with ηk=1/ρL\eta_{k}=1/\rho L (sufficient progress is satisfied according to Lemma 1). This yields the standard version of stochastic AGD analyzed by Vaswani et al. [22]. However, instantiating our convergence guarantees shows a faster rate for stochastic AGD with an improved dependence on ρ\rho.

Corollary 8.

If ff is LL-smooth and μ\mu-strongly convex with μ>0\mu>0, strong growth holds, and ηk=1/ρL\eta_{k}=1/\rho L, then stochastic AGD with γ0=μ\gamma_{0}=\mu converges as,

𝔼[f(wk+1)]f(w)(1μρL)k+1[f(w0)f(w)+μ2w0w22].\mathbb{E}\left[f(w_{k+1})\right]-f(w^{*})\leq\left(1-\sqrt{\frac{\mu}{\rho L}}\right)^{k+1}\left[f(w_{0})-f(w^{*})+\frac{\mu}{2}\|w_{0}-w^{*}\|_{2}^{2}\right]. (15)

Alternatively, if μ0\mu\geq 0 and γ0(μ,3ρL)\gamma_{0}\in(\mu,3\rho L), then stochastic AGD satisfies,

f(wk+1)f(w)4ρL(γ0μ)(k+1)2[f(w0)f(w)+γ02w0w22].f(w_{k+1})-f(w^{*})\leq\frac{4\rho L}{(\gamma_{0}-\mu)(k+1)^{2}}\left[f(w_{0})-f(w^{*})+\frac{\gamma_{0}}{2}\|w_{0}-w^{*}\|_{2}^{2}\right]. (16)

Corollary 8 shows that SGD can be accelerated to obtain the same convergence rate as deterministic AGD up to a factor of ρ\rho. We emphasize that some dependence on ρ\rho cannot be avoided; generic acceleration of SGD is not possible, even in the interpolation setting [43], so convergence rates must incorporate some measure of hardness due to stochasticity. In the next section, we compare our convergence guarantees against other results from the literature and give simple conditions under which acceleration is achieved.

An advantage of our analysis is that it also extends to more complex methods, such as SGD with full matrix preconditioning,

wk+1=wkηkDk1f(wk,zk),w_{k+1}=w_{k}-\eta_{k}D_{k}^{-1}\nabla f(w_{k},z_{k}), (17)

where Dkd×dD_{k}\in\mathbb{R}^{d\times d} is a positive-definite matrix and ηk\eta_{k} is a step-size sequence. We say that matrix strong growth holds in the norm xDk2=xDkx\|x\|^{2}_{D_{k}}=x^{\top}D_{k}x with constant ρDk\rho_{D_{k}} if

𝔼zk[f(w,zk)Dk12]ρDkf(w)Dk12.\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|_{D_{k}^{-1}}^{2}\right]\leq\rho_{D_{k}}\|\nabla f(w)\|_{D_{k}^{-1}}^{2}. (18)

If interpolation holds, f(u,zk)\nabla f(u,z_{k}) are LmaxDkL_{\text{max}}^{D_{k}}-individually smooth in Dk\|\cdot\|_{D_{k}}, and ff is μDk\mu_{D_{k}}-strongly convex in Dk\|\cdot\|_{D_{k}}, then ρDkLmaxDk/μDk\rho_{D_{k}}\leq L_{\text{max}}^{D_{k}}/\mu_{D_{k}} (see Lemma 16) and the following convergence rate is obtained by combining Lemma 17 with our main convergence theorems for generalized stochastic AGD.

Corollary 9.

Assume ff is μ\mu-strongly convex and 0DkI0\prec D_{k}\preceq I for every kk\in\mathbb{N}. Suppose ff is LDkL_{D_{k}}-smooth, ρDk\rho_{D_{k}} matrix strong growth holds, and ηk=1/ρDkLDk\eta_{k}=1/\rho_{D_{k}}L_{D_{k}}. Let C=supk{ρDkLDk}C_{\infty}=\sup_{k}\left\{\rho_{D_{k}}L_{D_{k}}\right\}. If γ0=μ>0\gamma_{0}=\mu>0, then stochastic preconditioned AGD satisfies,

𝔼[f(wk+1)]f(w)i=0k(1μρDkLDk)[f(w0)f(w)+μ2w0w22].\mathbb{E}\left[f(w_{k+1})\right]-f(w^{*})\leq\prod_{i=0}^{k}\left(1-\sqrt{\frac{\mu}{\rho_{D_{k}}L_{D_{k}}}}\right)\left[f(w_{0})-f(w^{*})+\frac{\mu}{2}\|w_{0}-w^{*}\|_{2}^{2}\right]. (19)

Alternatively, if μ0\mu\geq 0 and γ0(μ,3C)\gamma_{0}\in(\mu,3C_{\infty}), then we obtain,

f(wk+1)f(w)4C(γ0μ)(k+1)2[f(w0)f(w)+γ02w0w22].f(w_{k+1})-f(w^{*})\leq\frac{4C_{\infty}}{(\gamma_{0}-\mu)(k+1)^{2}}\left[f(w_{0})-f(w^{*})+\frac{\gamma_{0}}{2}\|w_{0}-w^{*}\|_{2}^{2}\right]. (20)

Compared to standard stochastic AGD, preconditioning allows us to measure stochasticity in the norm induced by DkD_{k}. This is advantageous when ρDKLDkρL\rho_{D_{K}}L_{D_{k}}\leq\rho L, meaning ff is smoother and/or the stochastic gradients are better conditioned in Dk\|\cdot\|_{D_{k}} than in the standard Euclidean norm. In such a setting, our theory suggests preconditioning is a simple way to further speed-up accelerated stochastic optimization.

4 Comparison to Existing Rates

Assumptions SGD S-AGD (Ours) S-AGD (VSB) MaSS
Strongly Convex O(Lmaxμlog(1ϵ))O\left(\frac{L_{\text{max}}}{\mu}\log(\frac{1}{\epsilon})\right) O(ρLμlog(1ϵ))O\left(\sqrt{\frac{\rho L}{\mu}}\log\left(\frac{1}{\epsilon}\right)\right) O(ρLμlog(1ϵ))O\left(\rho\sqrt{\frac{L}{\mu}}\log\left(\frac{1}{\epsilon}\right)\right) κκ~log(1ϵ)\sqrt{\kappa\tilde{\kappa}}\log(\frac{1}{\epsilon})
Convex O(Lmaxϵ)O\left(\frac{L_{\text{max}}}{\epsilon}\right) O(ρLϵ)O\left(\sqrt{\frac{\rho L}{\epsilon}}\,\right) O(ρLϵ)O\left(\rho\sqrt{\frac{L}{\epsilon}}\,\right) N/A
Table 1: Comparison of iteration complexities for stochastic acceleration schemes under strong growth and individual smoothness. VSB indicates Vaswani et al. [22] and MaSS is the modified stochastic AGD iteration proposed by Liu and Belkin [24]. The strongly-convex rate for MaSS applies only to quadratics; although MaSS has a convergence guarantee for convex functions, we omit it here because it relies on a hard-to-interpret assumption and is not accelerated.

Now we compare our rates to those existing in the literature. Throughout this section, we assume that ff is individually smooth, interpolation holds, and the strong growth condition is satisfied. Recall that the strong growth constant is bounded above as ρLmax/μ\rho\leq L_{\text{max}}/\mu under these conditions. This worst-case bound on ρ\rho is critical to understanding when stochastic AGD does or does not accelerate.

Before proceeding, we introduce the notion of statistical condition number proposed by Jain et al. [25] and used by Liu and Belkin [24] to analyze their modified version of stochastic AGD (called MaSS) in the least-squares setting. Let 𝒫\mathcal{P} be a probability distribution over (x,y)(x,y) and define the least squares objective as

fls(w)=𝔼(x,y)𝒫[12(xwy)2].f_{\text{ls}}(w)=\mathbb{E}_{(x,y)\sim\mathcal{P}}\left[\frac{1}{2}(x^{\top}w-y)^{2}\right]. (21)

Define the stochastic functions and gradients to be fls(w,zk)=12(xzkwyzk)2f_{\text{ls}}(w,z_{k})=\frac{1}{2}(x_{z_{k}}^{\top}w-y_{z_{k}})^{2} and fls(w,zk)=xzk(xzkwyzk),\nabla f_{\text{ls}}(w,z_{k})=x_{z_{k}}(x_{z_{k}}^{\top}w-y_{z_{k}}), where (xzk,yzk)𝒫(x_{z_{k}},y_{z_{k}})\sim\mathcal{P}. These stochastic gradients are LmaxL_{\text{max}}-individually smooth with Lmax=sup{xzk22:(xzk,yzk)Supp(𝒫)}L_{\text{max}}=\sup\left\{\|x_{z_{k}}\|_{2}^{2}:(x_{z_{k}},y_{z_{k}})\in\text{Supp}(\mathcal{P})\right\}. Assuming we may interchange expectation and differentiation, the Hessian is H=𝔼x[xx]H=\mathbb{E}_{x}\left[xx^{\top}\right] and the condition number κ\kappa and statistical condition number κ~\tilde{\kappa} are defined as,

κ=inf{t/μ:𝔼x[x22(xx)]tH},κ~=inf{t:𝔼x[xH12(xx)]tH}.\displaystyle\kappa=\inf\left\{t/\mu:\mathbb{E}_{x}\left[\|x\|_{2}^{2}(xx^{\top})\right]\preceq tH\right\},\hskip 5.0pt\tilde{\kappa}=\inf\left\{t:\mathbb{E}_{x}\left[\|x\|^{2}_{H^{-1}}(xx^{\top})\right]\preceq tH\right\}. (22)

It is straightforward to prove κ,κ~Lmax/μ\kappa,\tilde{\kappa}\leq L_{\text{max}}/\mu, similar to the strong growth constant.

Table 1 compares our iteration complexities for stochastic AGD to the complexity of SGD under interpolation, the analysis of stochastic AGD by Vaswani et al. [22], and the complexity of MaSS. Unlike Vaswani et al. [22], who use strong growth to show both the optimality gap and distance to a minimizer decrease in expectation at each iteration, our approach only requires the sufficient progress condition. This allows us to shrink the dependence on the strong growth constant from ρ\rho to ρ\sqrt{\rho}, which — since ρLmax/μ\rho\leq L_{\text{max}}/\mu — can be larger than κ\sqrt{\kappa} in the worst case. Substituting this into the complexity bound shows stochastic AGD requires O((LLmax/μ)log(1/ϵ))O((\sqrt{LL_{\text{max}}}/\mu)\log(1/\epsilon)) iterations to reach ϵ\epsilon-sub-optimality. That is, stochastic AGD is always at least as fast SGD and faster when LmaxLL_{\text{max}}\gg L.

Our convergence rate for stochastic AGD also improves over that for SGD under the strong growth condition [35]. The improvement is by a factor ρ/μ\sqrt{\rho/\mu}, indicating that acceleration actually shrinks the dependence on the noise level. This quite different from results in the general stochastic setting, where accelerated methods are typically more sensitive to noise [47]. For example, Schmidt et al. [36] show that the noise level must decrease faster for accelerated methods to converge compared to (proximal) SGD when interpolation does not hold. We conclude that interpolation seems to be key when proving fast rates for stochastic AGD.

Comparing our results against MaSS is more difficult due to the dependence on κ~\tilde{\kappa}. To understand the difference in convergence rates, we consider two finite-sum example problems. In what follows, let e1,,ene_{1},\ldots,e_{n} be the standard basis for n\mathbb{R}^{n}.

Example 10 (LmaxLL_{\text{max}}\gg L).

Consider the least-squares problem setting in Eq. 21 and choose y=0y=0, xUniform(e1,en)x\sim\text{Uniform}(e_{1},\ldots e_{n}). A short calculation shows Lmax=1L_{\text{max}}=1, ρ=n\rho=n, L=μ=1/nL=\mu=1/n, and κ=κ~=n\kappa=\tilde{\kappa}=n. As a result, stochastic AGD and MaSS have the following complexity bounds:

S-AGD: O(nlog(1ϵ))vsMaSS: O(nlog(1ϵ))\text{\emph{S-AGD: }}O\left(\sqrt{n}\log\left(\frac{1}{\epsilon}\right)\right)\quad\text{\emph{vs}}\quad\text{\emph{MaSS: }}O\left(n\log\left(\frac{1}{\epsilon}\right)\right) (23)

As expected, stochastic AGD accelerates due to the gap between the smoothness and individual smoothness constants. In comparison, the complexity bound for MaSS is not accelerated and only matches that for SGD. The next example considers the opposite setting, where LLmaxL\approx L_{\text{max}} and we do not expect stochastic AGD to be faster than SGD.

Example 11 (LmaxLL_{\text{max}}\approx L).

Consider the least-squares problem setting in Eq. 21. Let y=0y=0 and xx be distributed as follows: P(x=e1)=11/nP(x=e_{1})=1-1/n and P(x=e2)=1/nP(x=e_{2})=1/n. It is straightforward to show that Lmax=1L_{\text{max}}=1, μ=1/n\mu=1/n, and L=(n1)/nL=(n-1)/n, while ρ=n\rho=n and κ~=κ=n\tilde{\kappa}=\kappa=n. As a result, the complexity estimates for stochastic AGD and MaSS are,

S-AGD: O(n(n1)log(1ϵ))vsMaSS: O(nlog(1ϵ)).\text{\emph{S-AGD: }}O\left(\sqrt{n(n-1)}\log\left(\frac{1}{\epsilon}\right)\right)\quad\text{\emph{vs}}\quad\text{\emph{MaSS: }}O\left(n\log\left(\frac{1}{\epsilon}\right)\right). (24)

As nn\rightarrow\infty, stochastic AGD attains the same complexity as SGD and is not accelerated. In comparison, the guarantee for MaSS always matches SGD and is slower than stochastic AGD for every finite nn. We conclude that while both methods are restricted by lower bounds on stochastic acceleration, AGD can accelerate on some simple problems where MaSS fails.

5 Conclusion

We derive new convergence rates for a generalized version of stochastic Nesterov acceleration. Our approach extends the estimating sequences framework to the stochastic setting and shows that any update scheme making sufficient progress in expectation can be accelerated. As this sufficient progress condition is satisfied by SGD under the strong growth condition, our proof immediately specializes to give fast rates for stochastic AGD. Compared to previous work, our convergence bounds improve the dependence on the strong growth constant from ρ\rho to ρ\sqrt{\rho}. This improvement can be larger than the square-root of the condition number, shows stochastic AGD is at least as fast as SGD, and explains the strong empirical performance of stochastic acceleration shown by Vaswani et al. [22]. We also leverage our generalized algorithm to prove convergence guarantees for stochastic AGD with preconditioning. In particular, we show that preconditioning further speeds-up accelerated SGD when the stochastic gradients are small in the matrix norm induced by the preconditioner.

In addition to these results, the utility of our theoretical approach is further demonstrated by recent literature. Our core result for stochastic AGD (Proposition 5) was previously made available in a master’s thesis [48] and the proof technique has since been leveraged to give optimal bounds for stochastic acceleration in the general setting [49]. Yet, several questions remain unanswered. For example, the convergence of stochastic AGD under relaxed conditions, like weak growth or with a stochastic line-search [22], has not been proved. And while our generalized AGD scheme also suggests accelerating methods like stochastic proximal-point, establishing the expected progress condition appears difficult and new insights may be required. We leave these questions to future work.

\bmhead

Acknowledgments

We would like to thank Frederik Kunstner, Victor Sanchez-Portella, and Sharan Vaswani for many insightful discussions.

\bmhead

Funding

Aaron Mishkin was supported by NSF GRF Grant No. DGE-1656518 and by NSERC PGS D Grant No. PGSD3-547242-2020. Mert Pilanci was supported by the NSF under Grant ECCS-2037304 and Grant DMS-2134248, by an NSF CAREER Award under Grant CCF-2236829, by the U.S. Army Research Office Early Career Award under Grant W911NF-21-1-0242, and by the Stanford Precourt Institute. Mark Schmidt was partially supported by the Canada CIFAR AI Chair Program and NSERC Discovery Grant No. RGPIN-2022-03669.

\bmhead

Data Availability

This paper does not rely any data, public or private. Derivations for all theoretical results are included in the appendix.

References

  • \bibcommenthead
  • Zhang et al. [2017] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net (2017)
  • Zhang and Yin [2013] Zhang, H., Yin, W.: Gradient methods for convex minimization: Better rates under weaker conditions. arXiv preprint arXiv:1303.4645 (2013)
  • Belkin et al. [2019a] Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854 (2019)
  • Belkin et al. [2019b] Belkin, M., Rakhlin, A., Tsybakov, A.B.: Does data interpolation contradict statistical optimality? In: Chaudhuri, K., Sugiyama, M. (eds.) The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019. Proceedings of Machine Learning Research, vol. 89, pp. 1611–1619. PMLR (2019)
  • Schapire et al. [1997] Schapire, R.E., Freund, Y., Barlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. In: Fisher, D.H. (ed.) Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), pp. 322–330. Morgan Kaufmann (1997)
  • Liu et al. [2022] Liu, C., Zhu, L., Belkin, M.: Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis 59, 85–116 (2022)
  • Oymak and Soltanolkotabi [2019] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: Gradient descent takes the shortest path? In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 4951–4960. PMLR (2019)
  • Belkin [2021] Belkin, M.: Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numer. 30, 203–248 (2021)
  • Arora et al. [2018] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: Implicit acceleration by overparameterization. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Proceedings of Machine Learning Research, vol. 80, pp. 244–253. PMLR (2018)
  • Ma et al. [2018] Ma, S., Bassily, R., Belkin, M.: The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Proceedings of Machine Learning Research, vol. 80, pp. 3331–3340. PMLR (2018)
  • Zou and Gu [2019] Zou, D., Gu, Q.: An improved analysis of training over-parameterized deep neural networks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019, pp. 2053–2062 (2019)
  • Polyak [1987] Polyak, B.T.: Introduction to optimization (1987)
  • Bassily et al. [2018] Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)
  • Vaswani et al. [2019] Vaswani, S., Mishkin, A., Laradji, I.H., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019, pp. 3727–3740 (2019)
  • Defazio and Bottou [2019] Defazio, A., Bottou, L.: On the ineffectiveness of variance reduced optimization for deep learning. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019, pp. 1753–1763 (2019)
  • Loizou et al. [2020] Loizou, N., Vaswani, S., Laradji, I., Lacoste-Julien, S.: Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. arXiv preprint arXiv:2002.10542 (2020)
  • Berrada et al. [2020] Berrada, L., Zisserman, A., Kumar, M.P.: Training neural networks for and by interpolation. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 799–809. PMLR (2020)
  • D’Orazio et al. [2021] D’Orazio, R., Loizou, N., Laradji, I.H., Mitliagkas, I.: Stochastic mirror descent: Convergence analysis and adaptive variants via the mirror stochastic polyak stepsize. CoRR abs/2110.15412 (2021)
  • Asi and Duchi [2019] Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization 29(3), 2257–2290 (2019)
  • Arjevani et al. [2019] Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365 (2019)
  • Nemirovsky and Nesterov [1985] Nemirovsky, A.S., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics 25(2), 21–30 (1985)
  • Vaswani et al. [2019] Vaswani, S., Bach, F., Schmidt, M.W.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: Chaudhuri, K., Sugiyama, M. (eds.) The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019. Proceedings of Machine Learning Research, vol. 89, pp. 1195–1204. PMLR (2019)
  • Nesterov [1983] Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence O(1/k2){O}(1/k^{2}). In: Doklady an USSR, vol. 269, pp. 543–547 (1983)
  • Liu and Belkin [2020] Liu, C., Belkin, M.: Accelerating SGD with momentum for over-parameterized learning. In: 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net (2020)
  • Jain et al. [2018] Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent for least squares regression. In: Bubeck, S., Perchet, V., Rigollet, P. (eds.) Conference On Learning Theory, COLT 2018. Proceedings of Machine Learning Research, vol. 75, pp. 545–604. PMLR (2018)
  • Nesterov [2004] Nesterov, Y.E.: Introductory Lectures on Convex Optimization - A Basic Course. Applied Optimization, vol. 87. Springer (2004)
  • Xiao et al. [2022] Xiao, T., Balasubramanian, K., Ghadimi, S.: Improved complexities for stochastic conditional gradient methods under interpolation-like conditions. Oper. Res. Lett. 50(2), 184–189 (2022)
  • Vaswani et al. [2020] Vaswani, S., Kunstner, F., Laradji, I., Meng, S.Y., Schmidt, M., Lacoste-Julien, S.: Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search). arXiv preprint arXiv:2006.06835 (2020)
  • Duchi et al. [2011] Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
  • Meng et al. [2020] Meng, S.Y., Vaswani, S., Laradji, I.H., Schmidt, M., Lacoste-Julien, S.: Fast and furious convergence: Stochastic second order methods under interpolation. In: Chiappa, S., Calandra, R. (eds.) The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020. Proceedings of Machine Learning Research, vol. 108, pp. 1375–1386. PMLR (2020)
  • Varre et al. [2021] Varre, A.V., Pillaud-Vivien, L., Flammarion, N.: Last iterate convergence of SGD for least-squares in the interpolation regime. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, pp. 21581–21591 (2021)
  • Fang et al. [2021] Fang, H., Fan, Z., Friedlander, M.P.: Fast convergence of stochastic subgradient method under interpolation. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021)
  • Solodov [1998] Solodov, M.V.: Incremental gradient algorithms with stepsizes bounded away from zero. Comp. Opt. and Appl. 11(1), 23–35 (1998)
  • Tseng [1998] Tseng, P.: An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization 8(2), 506–531 (1998)
  • Schmidt and Le Roux [2013] Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)
  • Schmidt et al. [2011] Schmidt, M., Le Roux, N., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24: NeurIPS 2011, pp. 1458–1466 (2011)
  • d’Aspremont [2008] d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
  • Devolder et al. [2014] Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming 146(1-2), 37–75 (2014)
  • Cohen et al. [2018] Cohen, M., Diakonikolas, J., Orecchia, L.: On acceleration with noise-corrupted gradients. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Proceedings of Machine Learning Research, vol. 80, pp. 1018–1027. PMLR (2018)
  • Chen et al. [2020] Chen, Y.-L., Na, S., Kolar, M.: Convergence analysis of accelerated stochastic gradient descent under the growth condition. arXiv preprint arXiv:2006.06782 (2020)
  • Even et al. [2021] Even, M., Berthier, R., Bach, F.R., Flammarion, N., Gaillard, P., Hendrikx, H., Massoulié, L., Taylor, A.B.: A continuized view on nesterov acceleration for stochastic gradient descent and randomized gossip. CoRR abs/2106.07644 (2021)
  • Valls et al. [2022] Valls, V., Wang, S., Jiang, Y., Tassiulas, L.: Accelerated convex optimization with stochastic gradients: Generalizing the strong-growth condition. arXiv preprint arXiv:2207.11833 (2022)
  • Assran and Rabbat [2020] Assran, M., Rabbat, M.: On the convergence of Nesterov’s accelerated gradient method in stochastic settings. arXiv preprint arXiv:2002.12414 (2020)
  • Nesterov [1988] Nesterov, Y.: On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody 24(3), 509–517 (1988)
  • Armijo [1966] Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics 16(1), 1–3 (1966)
  • Bertsekas [1997] Bertsekas, D.P.: Nonlinear programming. Journal of the Operational Research Society 48(3), 334–334 (1997)
  • Honorio [2012] Honorio, J.: Convergence rates of biased stochastic optimization for learning sparse Ising models. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012. icml.cc / Omnipress (2012)
  • Mishkin [2020] Mishkin, A.: Interpolation, growth conditions, and stochastic gradient descent. PhD thesis, University of British Columbia (2020)
  • Vaswani et al. [2022] Vaswani, S., Dubois-Taine, B., Babanezhad, R.: Towards noise-adaptive, problem-adaptive (accelerated) stochastic gradient descent. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. Proceedings of Machine Learning Research, vol. 162, pp. 22015–22059. PMLR (2022)

Appendix A Assumptions: Proofs

Lemma 12.

Suppose ff is convex and LL-smooth, the stochastic gradients {f(wk,zk)}\left\{\nabla f(w_{k},z_{k})\right\} are LmaxL_{\text{max}} individually smooth, and interpolation holds. Then the weak growth condition holds with constant αLmax/L\alpha\leq L_{\text{max}}/L.

Proof.

First, recall that the weak growth condition [22] is given by

𝔼zk[f(wk,zk)22]2αL(f(wk)f(w)).\mathbb{E}_{z_{k}}\left[\|\nabla f(w_{k},z_{k})\|_{2}^{2}\right]\leq 2\alpha L\left(f(w_{k})-f(w^{*})\right). (25)

Now, starting from LmaxL_{\text{max}} individual-smoothness,

f(u,zk)\displaystyle f(u,z_{k}) f(w,zk)+f(w,zk),uw+Lmax2uw2,\displaystyle\leq f(w,z_{k})+\left\langle\nabla f(w,z_{k}),u-w\right\rangle+\frac{L_{\text{max}}}{2}\|u-w\|^{2},
and choosing u=w1Lmaxf(w,zk)u=w-\frac{1}{L_{\text{max}}}\nabla f(w,z_{k}), we obtain
f(u,zk)\displaystyle f(u,z_{k}) f(w,zk)1Lmaxf(w,zk),f(w,zk)+Lmax2Lmax2f(w,zk)2\displaystyle\leq f(w,z_{k})-\frac{1}{L_{\text{max}}}\left\langle\nabla f(w,z_{k}),\nabla f(w,z_{k})\right\rangle+\frac{L_{\text{max}}}{2L_{\text{max}}^{2}}\|\nabla f(w,z_{k})\|^{2}
=f(w,zk)12Lmaxf(w,zk)2.\displaystyle=f(w,z_{k})-\frac{1}{2L_{\text{max}}}\|\nabla f(w,z_{k})\|^{2}.

Noting that f(u,zk)f(w,zk)f(u,z_{k})\geq f(w^{*},z_{k}) by convexity of ff and interpolation and taking expectations with respect to zkz_{k} gives the following:

f(w,zk)\displaystyle f(w^{*},z_{k}) f(w,zk)12Lmaxf(w,zk)2\displaystyle\leq f(w,z_{k})-\frac{1}{2L_{\text{max}}}\|\nabla f(w,z_{k})\|^{2}
𝔼zk[f(w,zk)]\displaystyle\implies\mathbb{E}_{z_{k}}\left[f(w^{*},z_{k})\right] 𝔼zk[f(w,zk)]12Lmax𝔼zk[f(w,zk)2]\displaystyle\leq\mathbb{E}_{z_{k}}\left[f(w,z_{k})\right]-\frac{1}{2L_{\text{max}}}\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|^{2}\right]
f(w)\displaystyle\implies f(w^{*}) f(w)12Lmax𝔼zk[f(w,zk)2].\displaystyle\leq f(w)-\frac{1}{2L_{\text{max}}}\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|^{2}\right].

Re-arranging this final equation gives the desired result as follows:

𝔼zk[f(w,zk)2]\displaystyle\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|^{2}\right] 2Lmax(f(w)f(w))\displaystyle\leq 2L_{\text{max}}\left(f(w)-f(w^{*})\right)
=2(LmaxL)L(f(w)f(w)).\displaystyle=2\left(\frac{L_{\text{max}}}{L}\right)L\left(f(w)-f(w^{*})\right).

We conclude that weak growth holds with αLmaxL\alpha\leq\frac{L_{\text{max}}}{L}. ∎

Lemma 13.

Suppose ff is LL-smooth and μ\mu-strongly convex, the stochastic gradients {f(wk,zk)}\left\{\nabla f(w_{k},z_{k})\right\} are LmaxL_{\text{max}} individually smooth, and interpolation holds. Then strong growth holds with constant ρLmax/μ\rho\leq L_{\text{max}}/\mu.

Proof.

12 implies that ff satisfies the weak growth condition with parameter

αLmaxL.\alpha\leq\frac{L_{\text{max}}}{L}.

Vaswani et al. [22, Proposition 1] now implies that ff satisfies strong growth with parameter

ραLμLmaxμ.\rho\leq\frac{\alpha L}{\mu}\leq\frac{L_{\text{max}}}{\mu}.

This concludes the proof. ∎

Appendix B Convergence of Stochastic AGD: Proofs

See 1

Proof.

The proof is a modification of the standard descent lemma. Starting from LL-smoothness of ff, we obtain

f(wk+1)\displaystyle f(w_{k+1}) f(yk)+f(yk),wk+1yk+L2wk+1yk22\displaystyle\leq f(y_{k})+\left\langle\nabla f(y_{k}),w_{k+1}-y_{k}\right\rangle+\frac{L}{2}\|w_{k+1}-y_{k}\|_{2}^{2}
=f(yk)ηkf(yk),f(yk,zk)+ηk2L2f(yk,zk)22\displaystyle=f(y_{k})-\eta_{k}\left\langle\nabla f(y_{k}),\nabla f(y_{k},z_{k})\right\rangle+\frac{\eta_{k}^{2}L}{2}\|\nabla f(y_{k},z_{k})\|_{2}^{2}
Taking expectations with respect to zkz_{k} and using the strong growth condition,
𝔼zk[f(wk+1)]\displaystyle\mathbb{E}_{z_{k}}\left[f(w_{k+1})\right] f(yk)ηkf(yk),𝔼zk[f(yk,zk)]+ηk2L2𝔼zk[f(yk,zk)22]\displaystyle\leq f(y_{k})-\eta_{k}\left\langle\nabla f(y_{k}),\mathbb{E}_{z_{k}}\left[\nabla f(y_{k},z_{k})\right]\right\rangle+\frac{\eta_{k}^{2}L}{2}\mathbb{E}_{z_{k}}\left[\|\nabla f(y_{k},z_{k})\|_{2}^{2}\right]
f(yk)ηkf(yk)22+ηk2Lρ2f(yk)22\displaystyle\leq f(y_{k})-\eta_{k}\|\nabla f(y_{k})\|_{2}^{2}+\frac{\eta_{k}^{2}L\rho}{2}\|\nabla f(y_{k})\|_{2}^{2}
=f(yk)ηk(1ηkLρ2)f(yk)22.\displaystyle=f(y_{k})-\eta_{k}\left(1-\frac{\eta_{k}L\rho}{2}\right)\|\nabla f(y_{k})\|_{2}^{2}.

Lemma 14.

The ϕk\phi_{k} sequence in Eq. 12 satisfies the following canonical form:

ϕk+1(w)=ϕk+1+γk+12wvk+122,\phi_{k+1}(w)=\phi_{k+1}^{*}+\frac{\gamma_{k+1}}{2}\|w-v_{k+1}\|_{2}^{2}, (26)

where the curvature γk+1\gamma_{k+1}, minimizer vk+1v_{k+1}, and minimum value ϕk+1\phi_{k+1}^{*} are given as follows:

γk+1\displaystyle\gamma_{k+1} =(1αk)γk+αkμ\displaystyle=(1-\alpha_{k})\gamma_{k}+\alpha_{k}\mu (27)
vk+1\displaystyle v_{k+1} =1γk+1[(1αk)γkvk+αkμykαkf(yk)]\displaystyle=\frac{1}{\gamma_{k+1}}\left[(1-\alpha_{k})\gamma_{k}v_{k}+\alpha_{k}\mu y_{k}-\alpha_{k}\nabla f(y_{k})\right]
ϕk+1\displaystyle\phi_{k+1}^{*} =(1αk)ϕk+αkf(yk)αk22γk+1f(yk)22\displaystyle=(1-\alpha_{k})\phi_{k}^{*}+\alpha_{k}f(y_{k})-\frac{\alpha_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(y_{k})\|_{2}^{2}
+αk(1αk)γkγk+1(μ2ykvk22+f(yk),vkyk).\displaystyle\hskip 20.00003pt+\frac{\alpha_{k}(1-\alpha_{k})\gamma_{k}}{\gamma_{k+1}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|_{2}^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)}.

Furthermore, the relationship between γk+1\gamma_{k+1} and αk\alpha_{k} is the following:

γk+1=αk2/ηk.\gamma_{k+1}=\alpha_{k}^{2}/\eta_{k}. (28)
Proof.

The canonical form for ϕk+1\phi_{k+1} follows directly from Nesterov [26, Lemma 2.2.3]. To see the relationship between γk+1\gamma_{k+1} and αk\alpha_{k}, re-arrange the update for αk+1\alpha_{k+1} in Eq. 5 to obtain,

αk2ηk=(1αk)γk+αkμ.\frac{\alpha_{k}^{2}}{\eta_{k}}=(1-\alpha_{k})\gamma_{k}+\alpha_{k}\mu. (29)

By comparison to the update for γk\gamma_{k}, we deduce that γk+1=αk2/ηk\gamma_{k+1}=\alpha_{k}^{2}/\eta_{k}. ∎

Lemma 15.

Assume αk(0,1)\alpha_{k}\in(0,1) and ηminηk1/μ\eta_{\text{min}}\leq\eta_{k}\leq 1/\mu almost surely for all kk\in\mathbb{N}. If μ>0\mu>0 and γ0=μ\gamma_{0}=\mu, then

λki=0k1(1ηkμ).\lambda_{k}\leq\prod_{i=0}^{k-1}(1-\sqrt{\eta_{k}\mu}). (30)

Alternately, if γ0(μ,μ+3/ηmin)\gamma_{0}\in(\mu,\mu+3/\eta_{\text{min}}), we obtain

λk4ηmin(γ0μ)(k+1)2.\lambda_{k}\leq\frac{4}{\eta_{\text{min}}(\gamma_{0}-\mu)(k+1)^{2}}. (31)
Proof.

Case 1: γ0=μ>0\gamma_{0}=\mu>0. Then γk=μ\gamma_{k}=\mu for all kk and

αk2\displaystyle\alpha_{k}^{2} =(1αk)ηkμ+αkηkμ\displaystyle=(1-\alpha_{k})\eta_{k}\mu+\alpha_{k}\eta_{k}\mu
=ηkμ.\displaystyle=\eta_{k}\mu.

Thus, we deduce that

λk=i=0k1(1ηkμ),\lambda_{k}=\prod_{i=0}^{k-1}(1-\sqrt{\eta_{k}\mu}),

and if ηk>ηmin\eta_{k}>\eta_{\text{min}}, then αkηminμ\alpha_{k}\geq\sqrt{\eta_{\text{min}}\mu} and

λk(1ηminμ)k.\lambda_{k}\leq(1-\sqrt{\eta_{\text{min}}\mu})^{k}.

Case 2: γ0(μ,3L+μ)\gamma_{0}\in\left(\mu,3L+\mu\right). Using the update rule for γk\gamma_{k} in Lemma 14, we find

γk+1μ\displaystyle\gamma_{k+1}-\mu =(1αk)γk+(αk1)μ=(1αk)(γkμ)\displaystyle=(1-\alpha_{k})\gamma_{k}+(\alpha_{k}-1)\mu=(1-\alpha_{k})(\gamma_{k}-\mu)
Recursing on this equality implies
γk+1\displaystyle\gamma_{k+1} =(γ0μ)i=0k(1αk)=λk+1(γ0μ).\displaystyle=(\gamma_{0}-\mu)\prod_{i=0}^{k}(1-\alpha_{k})=\lambda_{k+1}(\gamma_{0}-\mu).

Similarly, using λk+1=(1αk)λk\lambda_{k+1}=(1-\alpha_{k})\lambda_{k} and αk2/γk+1=ηk\alpha_{k}^{2}/\gamma_{k+1}=\eta_{k} yields

1λk+1λk\displaystyle 1-\frac{\lambda_{k+1}}{\lambda_{k}} =αk=(γk+1ηk)1/2\displaystyle=\alpha_{k}=\left(\gamma_{k+1}\eta_{k}\right)^{1/2}
=(ηkμ+ηkλk+1(γ0μ))1/2\displaystyle=\left(\eta_{k}\mu+\eta_{k}\lambda_{k+1}(\gamma_{0}-\mu)\right)^{1/2}
1λk+11λk\displaystyle\implies\frac{1}{\lambda_{k+1}}-\frac{1}{\lambda_{k}} =1λk+11/2[ηkμλk+1+ηk(γ0μ)]1/2\displaystyle=\frac{1}{\lambda_{k+1}^{1/2}}\left[\frac{\eta_{k}\mu}{\lambda_{k+1}}+\eta_{k}(\gamma_{0}-\mu)\right]^{1/2}
1λk+11/2[ηminμλk+1+ηmin(γ0μ)]1/2.\displaystyle\geq\frac{1}{\lambda_{k+1}^{1/2}}\left[\frac{\eta_{\text{min}}\mu}{\lambda_{k+1}}+\eta_{\text{min}}(\gamma_{0}-\mu)\right]^{1/2}.

Finally, this implies

2λk+11/2(1λk+11/21λk1/2)\displaystyle\frac{2}{\lambda_{k+1}^{1/2}}\left(\frac{1}{\lambda_{k+1}^{1/2}}-\frac{1}{\lambda_{k}^{1/2}}\right) (1λk+11/21λk1/2)(1λk+11/2+1λk1/2)\displaystyle\geq\left(\frac{1}{\lambda_{k+1}^{1/2}}-\frac{1}{\lambda_{k}^{1/2}}\right)\left(\frac{1}{\lambda_{k+1}^{1/2}}+\frac{1}{\lambda_{k}^{1/2}}\right)
1λk+11/2[ηminμλk+1+ηmin(γ0μ)]1/2.\displaystyle\geq\frac{1}{\lambda_{k+1}^{1/2}}\left[\frac{\eta_{\text{min}}\mu}{\lambda_{k+1}}+\eta_{\text{min}}(\gamma_{0}-\mu)\right]^{1/2}.

Moreover, this bound holds uniformly for all kk\in\mathbb{N}. We have now exactly reached Eq. 2.2.11 of Nesterov [26, Lemma 2.2.4] with LL replaced by ηmin\eta_{\text{min}}. Applying that Lemma with this modification, we obtain

λk4ηmin(γ0μ)(k+1)2,\lambda_{k}\leq\frac{4}{\eta_{\text{min}}(\gamma_{0}-\mu)(k+1)^{2}},

which completes the proof.

See 4

Proof.

Recalling the update for αk\alpha_{k}, we obtain,

αk2=(1αk)γkηk+αkηkμ.\alpha_{k}^{2}=(1-\alpha_{k})\gamma_{k}\eta_{k}+\alpha_{k}\eta_{k}\mu.

Define L^k=1/ηk\hat{L}_{k}=1/\eta_{k} to obtain the following quadratic formula,

L^kαk2+(γkμ)αkγk=0.\hat{L}_{k}\alpha_{k}^{2}+(\gamma_{k}-\mu)\alpha_{k}-\gamma_{k}=0.

Using the quadratic equation, we find

αk=μγk±(μγk)2+4L^kγk2L^k.\alpha_{k}=\frac{\mu-\gamma_{k}\pm\sqrt{(\mu-\gamma_{k})^{2}+4\hat{L}_{k}\gamma_{k}}}{2\hat{L}_{k}}.

As a result, αk>0\alpha_{k}>0 if and only if

(μγk)+((μγk)2+4L^kγk)1/2\displaystyle(\mu-\gamma_{k})+\left((\mu-\gamma_{k})^{2}+4\hat{L}_{k}\gamma_{k}\right)^{1/2} >0.\displaystyle>0.
If μγk\mu\geq\gamma_{k}, then this holds trivially. Otherwise, we require,
(μγk)2+4L^kγk\displaystyle(\mu-\gamma_{k})^{2}+4\hat{L}_{k}\gamma_{k} >(μγk)2,\displaystyle>(\mu-\gamma_{k})^{2},

which holds if and only if ηk,γk>0\eta_{k},\gamma_{k}>0. Similarly, αk<1\alpha_{k}<1 if and only if

4L^k2+4L^k(γkμ)+(μγk)2\displaystyle 4\hat{L}_{k}^{2}+4\hat{L}_{k}(\gamma_{k}-\mu)+(\mu-\gamma_{k})^{2} >(μγk)2+4L^kγkηk<1μ.\displaystyle>(\mu-\gamma_{k})^{2}+4\hat{L}_{k}\gamma_{k}\iff\eta_{k}<\frac{1}{\mu}.

Since this condition holds by assumption, we have αk(0,1)\alpha_{k}\in(0,1) for all kk.

Recall λ0=1\lambda_{0}=1 and λk+1=(1αk)λk\lambda_{k+1}=(1-\alpha_{k})\lambda_{k}. Since αk(0,1)\alpha_{k}\in(0,1), λk0\lambda_{k}\geq 0 holds by induction. It remains to show that λk\lambda_{k} tends to zero. Invoking Lemma 15 establishes this result almost surely.

Finally, we must show,

ϕk(w)(1λk)f(w)+λkϕ0(w).\phi_{k}(w)\leq(1-\lambda_{k})f(w)+\lambda_{k}\phi_{0}(w).

We proceed by induction. Since λ0=1\lambda_{0}=1, we immediately obtain

ϕ0(w)=(1λ0)f(w)+λ0ϕ0(w).\phi_{0}(w)=(1-\lambda_{0})f(w)+\lambda_{0}\phi_{0}(w).

Now assume the condition holds at ϕk\phi_{k}; by the construction of ϕk+1\phi_{k+1}, we have

ϕk+1(w)\displaystyle\phi_{k+1}(w) =(1αk)ϕk(w)+αk[f(yk)+f(yk),wyk+μ2wyk]\displaystyle=(1-\alpha_{k})\phi_{k}(w)+\alpha_{k}\left[f(y_{k})+\left\langle\nabla f(yk),w-y_{k}\right\rangle+\frac{\mu}{2}\|w-y_{k}\|\right]
(1αk)ϕk(w)+αkf(w)\displaystyle\leq(1-\alpha_{k})\phi_{k}(w)+\alpha_{k}f(w)
(1αk)[(1λk)f(w)+λkϕ0(w)]+αkf(w)\displaystyle\leq(1-\alpha_{k})\left[(1-\lambda_{k})f(w)+\lambda_{k}\phi_{0}(w)\right]+\alpha_{k}f(w)
=λk+1ϕ0(w)λk+1f(w)+(1αk)f(w)+αkf(w)\displaystyle=\lambda_{k+1}\phi_{0}(w)-\lambda_{k+1}f(w)+(1-\alpha_{k})f(w)+\alpha_{k}f(w)
=λk+1ϕ0(w)+(1λk+1)f(w).\displaystyle=\lambda_{k+1}\phi_{0}(w)+(1-\lambda_{k+1})f(w).

This completes the proof. ∎

See 5

Proof.

The choice of ϕ0=f(x0)\phi^{*}_{0}=f(x_{0}) ensures infϕ0(w)=f(w0)\inf\phi_{0}(w)=f(w_{0}) deterministically, which is the base case for induction. The inductive assumption is 𝔼[infϕk(w)]𝔼[f(wk)]\mathbb{E}[\inf\phi_{k}(w)]\geq\mathbb{E}[f(w_{k})]; let us use this to show

𝔼[infwϕk+1(w)]𝔼[f(wk+1)].\mathbb{E}\left[\inf_{w}\phi_{k+1}(w)\right]\geq\mathbb{E}\left[f(w_{k+1})\right].

Lemma 14 implies that the explicit form of the minimizer infϕk+1(w)=ϕk+1\inf\phi_{k+1}(w)=\phi_{k+1}^{*} is

ϕk+1\displaystyle\phi_{k+1}^{*} =(1αk)ϕk+αkf(yk)αk22γk+1f(yk)2\displaystyle=(1-\alpha_{k})\phi^{*}_{k}+\alpha_{k}f(y_{k})-\frac{\alpha_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(y_{k})\|^{2}
+αk(1αk)γkγk+1(μ2ykvk2+f(yk),vkyk)\displaystyle\hskip 50.00008pt+\frac{\alpha_{k}(1-\alpha_{k})\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)
Taking expectations with respect to z0,,zkz_{0},\ldots,z_{k} and using linearity of expectation:
𝔼[ϕk+1]\displaystyle\mathbb{E}[\phi_{k+1}^{*}] =𝔼[(1αk)ϕk]+𝔼[αkf(yk)αk22γk+1f(yk)2]\displaystyle=\mathbb{E}[(1-\alpha_{k})\phi^{*}_{k}]+\mathbb{E}\left[\alpha_{k}f(y_{k})-\frac{\alpha_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(y_{k})\|^{2}\right]
+𝔼[αk(1αk)γkγk+1(μ2ykvk2+f(yk),vkyk)]\displaystyle\hskip 50.00008pt+\mathbb{E}\left[\frac{\alpha_{k}(1-\alpha_{k})\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)\right]
𝔼[(1αk)f(wk)]+𝔼[αkf(yk)αk22γk+1f(yk)2]\displaystyle\geq\mathbb{E}\left[(1-\alpha_{k})f(w_{k})\right]+\mathbb{E}\left[\alpha_{k}f(y_{k})-\frac{\alpha_{k}^{2}}{2\gamma_{k+1}}\|\nabla f(y_{k})\|^{2}\right]
+𝔼[αk(1αk)γkγk+1(μ2ykvk2+f(yk),vkyk)],\displaystyle\hskip 50.00008pt+\mathbb{E}\left[\frac{\alpha_{k}(1-\alpha_{k})\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)\right],
where the inequality follows from the inductive assumption. Convexity of ff implies f(wk)f(yk)+f(yk),wkykf(w_{k})\geq f(y_{k})+\left\langle\nabla f(y_{k}),w_{k}-y_{k}\right\rangle. Recalling αk2γk+1=ηk\frac{\alpha_{k}^{2}}{\gamma_{k+1}}=\eta_{k} from Lemma 14 allows us to obtain,
𝔼[ϕk+1]\displaystyle\mathbb{E}[\phi_{k+1}^{*}] 𝔼[(1αk)(f(yk)+f(yk),wkyk)]+𝔼[αkf(yk)ηk2f(yk)2]\displaystyle\geq\mathbb{E}[(1-\alpha_{k})\left(f(y_{k})+\left\langle\nabla f(y_{k}),w_{k}-y_{k}\right\rangle\right)]+\mathbb{E}\bigg{[}\alpha_{k}f(y_{k})-\frac{\eta_{k}}{2}\|\nabla f(y_{k})\|^{2}\bigg{]}
+𝔼[αk(1αk)γkγk+1(μ2ykvk2+f(yk),vkyk)]\displaystyle\hskip 50.00008pt+\mathbb{E}\left[\frac{\alpha_{k}(1-\alpha_{k})\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)\right]
=𝔼[f(yk)]+𝔼[(1αk)f(yk),wkyk]𝔼[ηk2f(yk)2]\displaystyle=\mathbb{E}\left[f(y_{k})\right]+\mathbb{E}[(1-\alpha_{k})\left\langle\nabla f(y_{k}),w_{k}-y_{k}\right\rangle]-\mathbb{E}\bigg{[}\frac{\eta_{k}}{2}\|\nabla f(y_{k})\|^{2}\bigg{]}
+𝔼[αk(1αk)γkγk+1(μ2ykvk2+f(yk),vkyk)]\displaystyle\hskip 50.00008pt+\mathbb{E}\left[\frac{\alpha_{k}(1-\alpha_{k})\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)\right]
=𝔼[f(yk)ηk2f(yk)2]+𝔼[(1αk)(f(yk),wkyk\displaystyle=\mathbb{E}\left[f(y_{k})-\frac{\eta_{k}}{2}\|\nabla f(y_{k})\|^{2}\right]+\mathbb{E}\bigg{[}(1-\alpha_{k})\bigg{(}\left\langle\nabla f(y_{k}),w_{k}-y_{k}\right\rangle
+αkγkγk+1(μ2ykvk2+f(yk),vkyk))]\displaystyle\hskip 50.00008pt+\frac{\alpha_{k}\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)\bigg{)}\bigg{]}
The sufficient progress condition (Eq. 8) now implies
𝔼[ϕk+1]\displaystyle\mathbb{E}[\phi_{k+1}^{*}] 𝔼[f(wk+1)]+𝔼[(1αk)(f(yk),wkyk\displaystyle\geq\mathbb{E}\left[f(w_{k+1})\right]+\mathbb{E}\bigg{[}(1-\alpha_{k})\bigg{(}\left\langle\nabla f(y_{k}),w_{k}-y_{k}\right\rangle
+αkγkγk+1(μ2ykvk2+f(yk),vkyk))].\displaystyle\hskip 50.00008pt+\frac{\alpha_{k}\gamma_{k}}{\gamma_{k+1}}\left(\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\right)\bigg{)}\bigg{]}.
The remainder of the proof is largely unchanged from the deterministic case. The definition of yky_{k} gives wkyk=αkγkγk+αkμ(wkvk)w_{k}-y_{k}=\frac{\alpha_{k}\gamma_{k}}{\gamma_{k}+\alpha_{k}\mu}(w_{k}-v_{k}), which we use to obtain
𝔼[ϕk+1]\displaystyle\mathbb{E}\left[\phi_{k+1}^{*}\right] 𝔼[f(wk+1)]+𝔼[(1αk)(αkγkγk+αkμf(yk),wkvk\displaystyle\geq\mathbb{E}[f(w_{k+1})]+\mathbb{E}\bigg{[}(1-\alpha_{k})\bigg{(}\frac{\alpha_{k}\gamma_{k}}{\gamma_{k}+\alpha_{k}\mu}\left\langle\nabla f(y_{k}),w_{k}-v_{k}\right\rangle
+αkγkγk+1(μ2ykvk2+f(yk),vkyk))]\displaystyle\hskip 50.00008pt+\frac{\alpha_{k}\gamma_{k}}{\gamma_{k+1}}\Big{(}\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\Big{)}\bigg{)}\bigg{]}
Noting that vkyk=γk+1γk+αkμ(vkwk)v_{k}-y_{k}=\frac{\gamma_{k+1}}{\gamma_{k}+\alpha_{k}\mu}\left(v_{k}-w_{k}\right) gives
𝔼[ϕk+1]\displaystyle\mathbb{E}\left[\phi_{k+1}^{*}\right] 𝔼[f(wk+1)]+𝔼[(1αk)(αkγkγk+αkμf(yk),wkvk\displaystyle\geq\mathbb{E}[f(w_{k+1})]+\mathbb{E}\bigg{[}(1-\alpha_{k})\bigg{(}\frac{\alpha_{k}\gamma_{k}}{\gamma_{k}+\alpha_{k}\mu}\left\langle\nabla f(y_{k}),w_{k}-v_{k}\right\rangle
+αkγkγk+1(μ2ykvk2+f(yk),vkyk))]\displaystyle\hskip 50.00008pt+\frac{\alpha_{k}\gamma_{k}}{\gamma_{k+1}}\Big{(}\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\left\langle\nabla f(y_{k}),v_{k}-y_{k}\right\rangle\Big{)}\bigg{)}\bigg{]}
=𝔼[f(wk+1)]+𝔼[(1αk)(αkγkγk+αkμf(yk),wkvk\displaystyle=\mathbb{E}[f(w_{k+1})]+\mathbb{E}\bigg{[}(1-\alpha_{k})\bigg{(}\frac{\alpha_{k}\gamma_{k}}{\gamma_{k}+\alpha_{k}\mu}\left\langle\nabla f(y_{k}),w_{k}-v_{k}\right\rangle
+αkγkγk+1(μ2ykvk2+γk+1γk+αkμf(yk),vkwk))]\displaystyle\hskip 50.00008pt+\frac{\alpha_{k}\gamma_{k}}{\gamma_{k+1}}\Big{(}\frac{\mu}{2}\|y_{k}-v_{k}\|^{2}+\frac{\gamma_{k+1}}{\gamma_{k}+\alpha_{k}\mu}\left\langle\nabla f(y_{k}),v_{k}-w_{k}\right\rangle\Big{)}\bigg{)}\bigg{]}
=𝔼[f(wk+1)]+𝔼[μαk(1αk)γk2γk+1ykvk2]\displaystyle=\mathbb{E}[f(w_{k+1})]+\mathbb{E}\left[\frac{\mu\alpha_{k}(1-\alpha_{k})\gamma_{k}}{2\gamma_{k+1}}\|y_{k}-v_{k}\|^{2}\right]
𝔼[f(wk+1)].\displaystyle\geq\mathbb{E}[f(w_{k+1})].

since μαk(1αk)γk2γk+10\frac{\mu\alpha_{k}(1-\alpha_{k})\gamma_{k}}{2\gamma_{k+1}}\geq 0. We conclude that 𝔼[infϕk(w)]𝔼[f(wk)]\mathbb{E}[\inf\phi_{k}(w)]\geq\mathbb{E}[f(w_{k})] holds for all kk\in\mathbb{N} by induction. ∎

See 6

Proof.

Under the conditions of the theorem, Proposition 5 holds and Eq. 11 implies

𝔼[f(wk)]f(w)\displaystyle\mathbb{E}\left[f(w_{k})\right]-f(w^{*}) 𝔼[λk(ϕ0(w)f(w))]\displaystyle\leq\mathbb{E}\left[\lambda_{k}(\phi_{0}(w^{*})-f(w^{*}))\right]
Since γ0=μ>0\gamma_{0}=\mu>0, Lemma 15 implies λk=i=0k1(1ηiμ)\lambda_{k}=\prod_{i=0}^{k-1}(1-\sqrt{\eta_{i}\mu}) and we obtain
𝔼[f(wk)]f(w)\displaystyle\mathbb{E}\left[f(w_{k})\right]-f(w^{*}) 𝔼[i=0k1(1ηiμ)](ϕ0(w)f(w))\displaystyle\leq\mathbb{E}\left[\prod_{i=0}^{k-1}(1-\sqrt{\eta_{i}\mu})\right](\phi_{0}(w^{*})-f(w^{*}))
=𝔼[i=0k1(1ηiμ)][f(w0)f(w)+μ2w0w22],\displaystyle=\mathbb{E}\left[\prod_{i=0}^{k-1}(1-\sqrt{\eta_{i}\mu})\right]\left[f(w_{0})-f(w^{*})+\frac{\mu}{2}\|w_{0}-w^{*}\|_{2}^{2}\right],

which completes the proof. ∎

See 7

Proof.

Under the conditions of the theorem, Proposition 5 holds and Eq. 11 implies

𝔼[f(wk)]f(w)\displaystyle\mathbb{E}\left[f(w_{k})\right]-f(w^{*}) 𝔼[λk(ϕ0(w)f(w))]\displaystyle\leq\mathbb{E}\left[\lambda_{k}(\phi_{0}(w^{*})-f(w^{*}))\right]
Since γ0(μ,μ+3/ηmin)\gamma_{0}\in(\mu,\mu+3/\eta_{\text{min}}), Lemma 15 implies λk4/ηmin(γ0μ)(k+1)2\lambda_{k}\leq 4/\eta_{\text{min}}(\gamma_{0}-\mu)(k+1)^{2} and we obtain
𝔼[f(wk)]f(w)\displaystyle\mathbb{E}\left[f(w_{k})\right]-f(w^{*}) 4ηmin(γ0μ)(k+1)2(ϕ0(w)f(w))\displaystyle\leq\frac{4}{\eta_{\text{min}}(\gamma_{0}-\mu)(k+1)^{2}}(\phi_{0}(w^{*})-f(w^{*}))
=4ηmin(γ0μ)(k+1)2[f(w0)f(w)+γ02w0w22],\displaystyle=\frac{4}{\eta_{\text{min}}(\gamma_{0}-\mu)(k+1)^{2}}\left[f(w_{0})-f(w^{*})+\frac{\gamma_{0}}{2}\|w_{0}-w^{*}\|_{2}^{2}\right],

which completes the proof. ∎

B.1 Specializations: Proofs

Lemma 16.

Let Dd×dD\in\mathbb{R}^{d\times d} be independent of zkz_{k}. Suppose ff is convex and LL-smooth, the stochastic gradients f(wk,zk)\nabla f(w_{k},z_{k}) are LmaxDL_{\text{max}}^{D} individually smooth with respect to the matrix norm D\|\cdot\|_{D}, and interpolation holds. If ff is also μD\mu_{D}-strongly convex with respect to D\|\cdot\|_{D}, then

𝔼zk[f(w,zk)D12]LmaxDμDf(w)D12.\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|_{D^{-1}}^{2}\right]\leq\frac{L_{\text{max}}^{D}}{\mu_{D}}\|\nabla f(w)\|_{D^{-1}}^{2}. (32)

That is, strong growth holds in D\|\cdot\|_{D} with constant ρDLmaxDμD\rho_{D}\leq\frac{L_{\text{max}}^{D}}{\mu_{D}}.

Proof.

Starting from LmaxDL_{\text{max}}^{D} individual-smoothness,

f(u,zk)\displaystyle f(u,z_{k}) f(w,zk)+f(w,zk),uw+LmaxD2uwD2,\displaystyle\leq f(w,z_{k})+\left\langle\nabla f(w,z_{k}),u-w\right\rangle+\frac{L_{\text{max}}^{D}}{2}\|u-w\|_{D}^{2},
and choosing u=w1LmaxDD1f(w,zk)u=w-\frac{1}{L_{\text{max}}^{D}}D^{-1}\nabla f(w,z_{k}), we obtain
f(u,zk)\displaystyle f(u,z_{k}) f(w,zk)1LmaxDf(w,zk),D1f(w,zk)+12LmaxDD1f(w,zk)D2\displaystyle\leq f(w,z_{k})-\frac{1}{L_{\text{max}}^{D}}\left\langle\nabla f(w,z_{k}),D^{-1}\nabla f(w,z_{k})\right\rangle+\frac{1}{2L_{\text{max}}^{D}}\|D^{-1}\nabla f(w,z_{k})\|_{D}^{2}
=f(w,zk)12LmaxDf(w,zk)D12.\displaystyle=f(w,z_{k})-\frac{1}{2L_{\text{max}}^{D}}\|\nabla f(w,z_{k})\|_{D^{-1}}^{2}.

Noting that f(u,zk)f(w,zk)f(u,z_{k})\geq f(w^{*},z_{k}) by convexity of ff and interpolation and taking expectations with respect to zkz_{k} gives the following:

f(w,zk)\displaystyle f(w^{*},z_{k}) f(w,zk)12LmaxDf(w,zk)D12\displaystyle\leq f(w,z_{k})-\frac{1}{2L_{\text{max}}^{D}}\|\nabla f(w,z_{k})\|_{D^{-1}}^{2}
𝔼zk[f(w,zk)]\displaystyle\implies\mathbb{E}_{z_{k}}\left[f(w^{*},z_{k})\right] 𝔼zk[f(w,zk)]12LmaxD𝔼zk[f(w,zk)D12]\displaystyle\leq\mathbb{E}_{z_{k}}\left[f(w,z_{k})\right]-\frac{1}{2L_{\text{max}}^{D}}\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|_{D^{-1}}^{2}\right]
f(w)\displaystyle\implies f(w^{*}) f(w)12LmaxD𝔼zk[f(w,zk)D12].\displaystyle\leq f(w)-\frac{1}{2L_{\text{max}}^{D}}\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|_{D^{-1}}^{2}\right].

Re-arranging this final equation and using μD\mu_{D}-strong convexity gives the desired result,

𝔼zk[f(w,zk)D12]\displaystyle\mathbb{E}_{z_{k}}\left[\|\nabla f(w,z_{k})\|_{D^{-1}}^{2}\right] 2LmaxD(f(w)f(w))\displaystyle\leq 2L_{\text{max}}^{D}\left(f(w)-f(w^{*})\right)
LmaxDμDf(w)D12.\displaystyle\leq\frac{L_{\text{max}}^{D}}{\mu_{D}}\|\nabla f(w)\|_{D^{-1}}^{2}.

Lemma 17.

Let Dd×dD\in\mathbb{R}^{d\times d}. Assume the ff is both LDL_{D}-smooth and satisfies ρD\rho_{D} strong growth in the matrix norm D\|\cdot\|_{D}. If 0DI0\prec D\preceq I and ηk1ρDLD\eta_{k}\leq\frac{1}{\rho_{D}L_{D}}, then preconditioned SGD satisfies,

𝔼zk[f(wk+1)]f(yk)ηk2f(yk)2.\mathbb{E}_{z_{k}}\left[f(w_{k+1})\right]\leq f(y_{k})-\frac{\eta_{k}}{2}\|\nabla f\left(y_{k}\right)\|^{2}.
Proof.

Starting from smoothness in D\|\cdot\|_{D},

f(wk+1)\displaystyle f(w_{k+1}) f(yk)+f(yk),wk+1yk+LD2wk+1ykD2\displaystyle\leq f(y_{k})+\left\langle\nabla f(y_{k}),w_{k+1}-y_{k}\right\rangle+\frac{L_{D}}{2}\|w_{k+1}-y_{k}\|_{D}^{2}
f(yk)ηkf(yk),D1f(yk,zk)+ηk2LD2f(yk,zk)D12.\displaystyle\leq f(y_{k})-\eta_{k}\left\langle\nabla f(y_{k}),D^{-1}\nabla f\left(y_{k},z_{k}\right)\right\rangle+\frac{\eta_{k}^{2}L_{D}}{2}\|\nabla f\left(y_{k},z_{k}\right)\|_{D^{-1}}^{2}.
Taking expectations with respect to zkz_{k},
𝔼zk[f(wk+1)]\displaystyle\implies\mathbb{E}_{z_{k}}\left[f(w_{k+1})\right] f(yk)ηkf(yk),Dk1f(yk)+ηk2LD2𝔼zk[f(yk,zk)D12]\displaystyle\leq f(y_{k})-\eta_{k}\left\langle\nabla f(y_{k}),D^{-1}_{k}\nabla f\left(y_{k}\right)\right\rangle+\frac{\eta_{k}^{2}L_{D}}{2}\mathbb{E}_{z_{k}}\left[\|\nabla f\left(y_{k},z_{k}\right)\|_{D^{-1}}^{2}\right]
f(yk)ηkf(yk),Dk1f(yk)+ηk2ρDLD2f(yk)D12\displaystyle\leq f(y_{k})-\eta_{k}\left\langle\nabla f(y_{k}),D^{-1}_{k}\nabla f\left(y_{k}\right)\right\rangle+\frac{\eta_{k}^{2}\rho_{D}L_{D}}{2}\|\nabla f\left(y_{k}\right)\|_{D^{-1}}^{2}
=f(yk)ηk(1ηkρDLD2)f(yk)D12\displaystyle=f(y_{k})-\eta_{k}\left(1-\frac{\eta_{k}\rho_{D}L_{D}}{2}\right)\|\nabla f\left(y_{k}\right)\|_{D^{-1}}^{2}
f(yk)ηk2f(yk)D12\displaystyle\leq f(y_{k})-\frac{\eta_{k}}{2}\|\nabla f\left(y_{k}\right)\|_{D^{-1}}^{2}
f(yk)ηk2f(yk)22.\displaystyle\leq f(y_{k})-\frac{\eta_{k}}{2}\|\nabla f\left(y_{k}\right)\|_{2}^{2}.

Appendix C Comparison to Existing Rates: Proofs

See 10

Proof.

It is easy to see that Lmax=1L_{\text{max}}=1. Taking expectations, we find that

fls(w)=12nw22,f_{\text{ls}}(w)=\frac{1}{2n}\|w\|_{2}^{2},

which implies L=μ=1/nL=\mu=1/n. It is also straightforward to compute ρ\rho:

𝔼[fls(w,zk)22]\displaystyle\mathbb{E}\left[\|\nabla f_{\text{ls}}(w,z_{k})\|_{2}^{2}\right] =1ni=1nei(eiw)22=1ni=1nwi2=1nw22=nfls(w)22,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\|e_{i}(e_{i}^{\top}w)\|^{2}_{2}=\frac{1}{n}\sum_{i=1}^{n}w_{i}^{2}=\frac{1}{n}\|w\|_{2}^{2}=n\|\nabla f_{\text{ls}}(w)\|_{2}^{2},

which implies ρ=n\rho=n. Note that this is tight with the bound ρLmax/μ\rho\leq L_{\text{max}}/\mu.

Now we compute the values of κ\kappa and κ~\tilde{\kappa}. Since the Hessian satisfies H=I/nH=I/n, it is straightforward to see that

𝔼x[x22(xx)]\displaystyle\mathbb{E}_{x}\left[\|x\|_{2}^{2}(xx^{\top})\right] =1ni=1nei22eiei=H\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\|e_{i}\|_{2}^{2}e_{i}e_{i}^{\top}=H
𝔼x[xH12(xx)]\displaystyle\mathbb{E}_{x}\left[\|x\|^{2}_{H^{-1}}(xx^{\top})\right] =1ni=1neiH12eiei=i=1nei22eiei=nH,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\|e_{i}\|_{H^{-1}}^{2}e_{i}e_{i}^{\top}=\sum_{i=1}^{n}\|e_{i}\|_{2}^{2}e_{i}e_{i}^{\top}=nH,

which implies that κ=κ~=n\kappa=\tilde{\kappa}=n. Substituting these values into the complexity bounds for stochastic AGD and MaSS completes the example. ∎

See 11

Proof.

Again, it is easy to see that Lmax=1L_{\text{max}}=1. Taking expectations, we find that

fls(w)=n12nw12+12nw22,f_{\text{ls}}(w)=\frac{n-1}{2n}w_{1}^{2}+\frac{1}{2n}w_{2}^{2},

which implies L=n1nL=\frac{n-1}{n} and μ=1n\mu=\frac{1}{n}. The strong growth constant is given by

𝔼[f(w,zk)22]\displaystyle\mathbb{E}\left[\|\nabla f(w,z_{k})\|_{2}^{2}\right] =n1nw12+1nw22\displaystyle=\frac{n-1}{n}w_{1}^{2}+\frac{1}{n}w_{2}^{2}
=n(n1n2w12+1n2w22)\displaystyle=n\left(\frac{n-1}{n^{2}}w_{1}^{2}+\frac{1}{n^{2}}w_{2}^{2}\right)
n((n1)2n2w12+1n2w22)\displaystyle\leq n\left(\frac{(n-1)^{2}}{n^{2}}w_{1}^{2}+\frac{1}{n^{2}}w_{2}^{2}\right)
=nf(w)22,\displaystyle=n\|\nabla f(w)\|_{2}^{2},

which implies ρn\rho\leq n. It’s easy to see that this is tight by taking w1=0w_{1}=0 and w20w_{2}\neq 0.

Now we compute the values of κ\kappa and κ~\tilde{\kappa}. The Hessian is given by

H=[n1n001n],H=\begin{bmatrix}\frac{n-1}{n}&0\\ 0&\frac{1}{n}\end{bmatrix},

and thus

𝔼x[x22(xx)]\displaystyle\mathbb{E}_{x}\left[\|x\|_{2}^{2}(xx^{\top})\right] =n1ne1e1+1ne2e2=H\displaystyle=\frac{n-1}{n}e_{1}e_{1}^{\top}+\frac{1}{n}e_{2}e_{2}^{\top}=H
𝔼x[xH12(xx)]\displaystyle\mathbb{E}_{x}\left[\|x\|^{2}_{H^{-1}}(xx^{\top})\right] =e1e1+e2e2=InH,\displaystyle=e_{1}e_{1}^{\top}+e_{2}e_{2}^{\top}=I\preceq nH,

which implies κ~=κ=n\tilde{\kappa}=\kappa=n. Substituting these values into the complexity bounds for stochastic AGD and MaSS completes the example. ∎