[1]\fnmAaron \surMishkin
1]\orgdivDepartment of Computer Science, \orgnameStanford University 2]\orgdivDepartment of Electrical Engineering, \orgnameStanford University 3]\orgdivDepartment of Computer Science, \orgnameUniversity of British Columbia 4]\orgdivCanada CIFAR AI Chair, \orgnameAmii
Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation
Abstract
We prove new convergence rates for a generalized version of stochastic Nesterov acceleration under interpolation conditions. Unlike previous analyses, our approach accelerates any stochastic gradient method which makes sufficient progress in expectation. The proof, which proceeds using the estimating sequences framework, applies to both convex and strongly convex functions and is easily specialized to accelerated SGD under the strong growth condition. In this special case, our analysis reduces the dependence on the strong growth constant from to as compared to prior work. This improvement is comparable to a square-root of the condition number in the worst case and address criticism that guarantees for stochastic acceleration could be worse than those for SGD.
keywords:
acceleration, SGD, strong growth, interpolation1 Introduction
A continuing trend in machine learning is the adoption of powerful prediction models which can exactly fit, or interpolate, their training data [1]. Methods such as over-parameterized neural networks [2, 3], kernel machines [4], and boosting [5] have all been shown to achieve zero training loss in practice. This phenomena is particularly prevalent in modern deep learning, where interpolation is conjectured to be key to both optimization [6, 7] and generalization [8].
Recent experimental and theoretical evidence shows stochastic gradient descent (SGD) matches the fast convergence rates of deterministic gradient methods up to problem-dependent constants when training interpolating models [9, 10, 11]. With additional assumptions, interpolation also implies the strong [12] and weak [13, 14] growth conditions, which bound the second moment of the stochastic gradients. Under strong/weak growth, variance-reduced algorithms typically exhibit slower convergence than stochastic gradient methods despite using more computation or memory [15, 10], perhaps because these conditions already imply a form of “automatic variance reduction” [6]. A combination of interpolation and growth conditions has been used to prove fast convergence rates for SGD with line-search [14], with the stochastic Polyak step-size [16, 17], for mirror descent [18], and for model-based methods [19].
While these results show interpolation is sufficient to break the complexity barrier for computing stationary points of smooth, convex functions with stochastic, first-order oracles [20], significantly less work has been done to obtain the accelerated rates possible in the deterministic setting [21]. Vaswani et al. [22] analyze a stochastic version of Nesterov’s accelerated gradient method (AGD) [23] under the strong growth condition, but their bounds have a linear dependence on the strong growth constant and can be slower than SGD [24]. In contrast, Liu and Belkin [24] propose a modified version of stochastic AGD and extend the statistical condition number approach of Jain et al. [25] to the interpolation setting. However, their results apply primarily to quadratics and are not accelerated for general convex functions.
In this work, we apply the estimating sequences analysis developed by Nesterov [26] to the interpolation setting. Our approach hinges on a simple, in-expectation progress guarantee for SGD, which we prove is a sufficient condition for generic acceleration of stochastic algorithms. This proof technique is completely different from that used by Vaswani et al. [22] and yields an improved dependence on the strong growth constant. In the worst-case, the improvement is proportional to the square-root of the condition number and guarantees stochastic AGD is always at least as fast as SGD. In what follows, all proofs are deferred to the corresponding section of the appendix.
1.1 Additional Related Work
A large literature on stochastic optimization under interpolation rapidly developed following the seminal work by Bassily et al. [13] and Vaswani et al. [22]. For instance, Xiao et al. [27] analyze Frank-Wolfe under interpolation, Vaswani et al. [28] prove fast convergence for Adagrad-type methods [29], and Meng et al. [30] show fast rates for sub-sampled Newton method under interpolation. Interpolation has also been used to study last-iterate convergence of SGD [31] and subgradient methods [32].
Interpolation is a key sufficient condition for growth conditions, which are a set of general assumptions controlling the magnitude of the noise in stochastic gradients. Strong growth was introduced by Polyak [12, Section 4.2.5], who called the condition relative random noise and used it to prove q-linear convergence of SGD. Solodov [33] and Tseng [34] later used a variation of strong growth to analyze incremental gradient methods; their variation was also used by Schmidt and Le Roux [35] to prove linear convergence of SGD with a constant step-size for strongly-convex functions. Vaswani et al. [22] returned to the original definition given by Polyak [12] and used it to analyze stochastic AGD, as we do in this paper.
Many authors have tried to accelerate stochastic gradient methods, including under a variety of growth conditions. By accelerate, we mean improve the dependence on the condition number or improve the order of convergence (i.e. from to ) as compared to SGD. For example, Schmidt et al. [36] establish orders of growth on the gradient noise which still permit stochastic proximal-gradient methods to be accelerated. In contrast, d’Aspremont [37] and Devolder et al. [38] assume bounded, deterministic gradient errors to derive convergence rates, while Cohen et al. [39] develop a noise-resistant acceleration scheme. Most recently, Chen et al. [40] analyze stochastic AGD under expected smoothness, but their rate only holds under interpolation when the strong growth constant is less than two.
As discussed above, Liu and Belkin [24] extend the approach of Jain et al. [25] to the interpolation setting. Their assumptions imply strong growth and the analysis is limited to least-squares problems, although similar rates have been obtained for continuized AGD applied to kernel least-squares [41]. Valls et al. [42] take a different view and extend the work by Vaswani et al. [22] to constrained optimization. Unfortunately, none of these algorithms are necessarily accelerated and both Assran and Rabbat [43] and Liu and Belkin [24] prove that stochastic AGD may not obtain accelerated rates of convergence even under interpolation. We address this criticism and make detailed comparisons between convergence rates in Section 4.
2 Assumptions
Consider the unconstrained minimization of a smooth, convex function . We assume has at least one minimizer and denote the optimal set by . At each iteration , we sample a stochastic gradient such that,
meaning the stochastic gradient is unbiased. We assume that are independent for , but they do not need to have the same distribution. The stochastic gradients satisfy the strong growth condition when there exists for which
(1) |
holds given any sequence . We say that interpolation is satisfied if
for all . That is, stationarity of implies stationarity of the stochastic gradients. Although the strong growth condition implies interpolation, we will see that the converse requires further assumptions on and on the stochastic gradients.
We assume that is -smooth, meaning is -Lipschitz continuous. -smoothness of implies the following quadratic upper-bound for all [26]:
(2) |
Similarly, we will sometimes require the stochastic gradients to be -individually smooth, meaning they satisfy
(3) |
almost surely for all . We also assume that is -strongly convex, by which we mean
(4) |
holds for all and some . When , strong convexity reduces to convexity of . If is strongly convex with , the stochastic gradients are -individually smooth, and interpolation holds, then the strong growth constant is bounded as (see Lemma 13). Recalling that the ratio is the condition number of , we see that the strong-growth constant is bounded by a quantity like the condition number of the worst-conditioned stochastic function. 111Note that is typically not strongly convex, so this analogy is not formal.
3 Convergence of Stochastic AGD
Our analysis in this section builds on the estimating sequences approach developed by Nesterov [44]. However, we consider variable step-sizes and allow the iterates to be set using a general update scheme. The procedure takes as input and uses the following updates, which we call generalized stochastic AGD:
(5) | ||||
where is an as-yet unspecified update for the “primal sequence” and (which implies ). Note that the step-size is required at the start of the iteration to compute . As a result, local search methods like the Armijo line-search [45] must re-evaluate , and for each candidate step-size.
Choosing to be one step of SGD with step-size yields the familiar updates of the standard version of stochastic AGD (see Nesterov [26, Eq. 2.2.20]):
(6) | ||||
Our approach hinges on the fact that, under strong growth, stochastic gradient updates for give a similar per-step progress condition as deterministic gradient descent. Since descent in the primal step is the only link between and the “dual sequence” , generalized stochastic AGD with any primal update obtaining similar per-iteration progress can be analyzed in the same fashion as stochastic AGD. This allows us to derive a fast convergence rate for the general scheme in Eq. 5.
We start by deriving the progress condition for SGD. It is straightforward to prove the following bound using -smoothness, the definition of , and strong growth:
Lemma 1.
Suppose is -smooth, the strong growth condition holds, and is independent of . Then the stochastic gradient step in Eq. 6 makes progress as,
(7) |
Substituting any fixed step-size into Eq. 7 gives the following equation, which we call the expected progress condition:
(8) |
which is equivalent to the progress made by gradient descent with step-size up to a factor of [46]. In order to make use of the expected progress condition, we now introduce the estimating sequences framework.
Definition 2 (Estimating Sequences).
Two sequences , are estimating sequences if the following hold almost surely: (i) (ii) , and (iii) for all ,
(9) |
Unlike the standard definition, we permit and to depend on , making them random variables. Typically the first function is deterministic and chosen to satisfy for all near . Since Eq. 9 guarantees , can be interpreted as a sequence of relaxing upper-bounds, where the rate of relaxation is controlled by . The next condition captures when is a good local model of .
Definition 3.
The local upper-bound property holds in expectation if
(10) |
In what follows, we use without subscripts to denote the total expectation with respect to . That is, all randomness in the procedure up to iteration . If maintains the local upper-bound property in expectation, then Eq. 9 guarantees
(11) |
which shows that generalized stochastic AGD converges in expectation at a rate controlled by . As a result, the bulk of our analysis focuses on establishing the local upper-bound property for a suitable choice of estimating sequences.
Following Nesterov [26], choose , , and
(12) | ||||
The initial curvature is an input parameter; differentiating shows that is actually the minimizer of , while (Lemma 14). Thus, the auxillary sequences can be viewed as arising from our choice of local model. The next lemma proves these are valid estimating sequences when the step-size sequence is well-behaved. In what follows, we use the convention to cover the case of non-strongly convex functions.
Lemma 4.
Assume is -strongly convex with and almost surely. Then and given in Eq. 9 are estimating sequences.
The parameter is required for to decrease sufficiently fast, while the upper-bound is only necessary when . In this case, it guarantees . This choice of estimating sequences also satisfies the local error-bound property in expectation when matches the progress of fixed step-size SGD.
Proposition 5.
If is -smooth and -strongly convex with , almost surely for every , and the primal update satisfies the sufficient progress condition (Eq. 8), then has the local upper-bound property in expectation. That is, for every ,
Proposition 5 is our main theoretical contribution and immediately leads to two accelerated convergence rates for the generalized stochastic AGD scheme.
Theorem 6.
Suppose is -smooth and -strongly convex with , almost surely for every , and the primal update satisfies the sufficient progress condition (Eq. 8). If , then generalized stochastic AGD has the following rate of convergence:
(13) | ||||
This linear rate of convergence requires knowledge of the strong convexity constant in order to set . However, we can still obtain a rate without knowing so long as the smoothness constant can be estimated. The following theorem is a generalization of Nesterov [26] to stochastic optimization under interpolation.
Theorem 7.
Suppose is -smooth and -strongly convex with , almost surely for every , and the primal update satisfies the sufficient progress condition (Eq. 8). If , then generalized stochastic AGD has the following rate of convergence:
(14) |
3.1 Specializations
Theorems 6 and 7 provide accelerated guarantees for any stochastic primal update satisfying the sufficient progress condition. Assuming strong growth holds, we may specialize to fixed step-size SGD with (sufficient progress is satisfied according to Lemma 1). This yields the standard version of stochastic AGD analyzed by Vaswani et al. [22]. However, instantiating our convergence guarantees shows a faster rate for stochastic AGD with an improved dependence on .
Corollary 8.
If is -smooth and -strongly convex with , strong growth holds, and , then stochastic AGD with converges as,
(15) |
Alternatively, if and , then stochastic AGD satisfies,
(16) |
Corollary 8 shows that SGD can be accelerated to obtain the same convergence rate as deterministic AGD up to a factor of . We emphasize that some dependence on cannot be avoided; generic acceleration of SGD is not possible, even in the interpolation setting [43], so convergence rates must incorporate some measure of hardness due to stochasticity. In the next section, we compare our convergence guarantees against other results from the literature and give simple conditions under which acceleration is achieved.
An advantage of our analysis is that it also extends to more complex methods, such as SGD with full matrix preconditioning,
(17) |
where is a positive-definite matrix and is a step-size sequence. We say that matrix strong growth holds in the norm with constant if
(18) |
If interpolation holds, are -individually smooth in , and is -strongly convex in , then (see Lemma 16) and the following convergence rate is obtained by combining Lemma 17 with our main convergence theorems for generalized stochastic AGD.
Corollary 9.
Assume is -strongly convex and for every . Suppose is -smooth, matrix strong growth holds, and . Let . If , then stochastic preconditioned AGD satisfies,
(19) |
Alternatively, if and , then we obtain,
(20) |
Compared to standard stochastic AGD, preconditioning allows us to measure stochasticity in the norm induced by . This is advantageous when , meaning is smoother and/or the stochastic gradients are better conditioned in than in the standard Euclidean norm. In such a setting, our theory suggests preconditioning is a simple way to further speed-up accelerated stochastic optimization.
4 Comparison to Existing Rates
Assumptions | SGD | S-AGD (Ours) | S-AGD (VSB) | MaSS |
---|---|---|---|---|
Strongly Convex | ||||
Convex | N/A |
Now we compare our rates to those existing in the literature. Throughout this section, we assume that is individually smooth, interpolation holds, and the strong growth condition is satisfied. Recall that the strong growth constant is bounded above as under these conditions. This worst-case bound on is critical to understanding when stochastic AGD does or does not accelerate.
Before proceeding, we introduce the notion of statistical condition number proposed by Jain et al. [25] and used by Liu and Belkin [24] to analyze their modified version of stochastic AGD (called MaSS) in the least-squares setting. Let be a probability distribution over and define the least squares objective as
(21) |
Define the stochastic functions and gradients to be and where . These stochastic gradients are -individually smooth with . Assuming we may interchange expectation and differentiation, the Hessian is and the condition number and statistical condition number are defined as,
(22) |
It is straightforward to prove , similar to the strong growth constant.
Table 1 compares our iteration complexities for stochastic AGD to the complexity of SGD under interpolation, the analysis of stochastic AGD by Vaswani et al. [22], and the complexity of MaSS. Unlike Vaswani et al. [22], who use strong growth to show both the optimality gap and distance to a minimizer decrease in expectation at each iteration, our approach only requires the sufficient progress condition. This allows us to shrink the dependence on the strong growth constant from to , which — since — can be larger than in the worst case. Substituting this into the complexity bound shows stochastic AGD requires iterations to reach -sub-optimality. That is, stochastic AGD is always at least as fast SGD and faster when .
Our convergence rate for stochastic AGD also improves over that for SGD under the strong growth condition [35]. The improvement is by a factor , indicating that acceleration actually shrinks the dependence on the noise level. This quite different from results in the general stochastic setting, where accelerated methods are typically more sensitive to noise [47]. For example, Schmidt et al. [36] show that the noise level must decrease faster for accelerated methods to converge compared to (proximal) SGD when interpolation does not hold. We conclude that interpolation seems to be key when proving fast rates for stochastic AGD.
Comparing our results against MaSS is more difficult due to the dependence on . To understand the difference in convergence rates, we consider two finite-sum example problems. In what follows, let be the standard basis for .
Example 10 ().
Consider the least-squares problem setting in Eq. 21 and choose , . A short calculation shows , , , and . As a result, stochastic AGD and MaSS have the following complexity bounds:
(23) |
As expected, stochastic AGD accelerates due to the gap between the smoothness and individual smoothness constants. In comparison, the complexity bound for MaSS is not accelerated and only matches that for SGD. The next example considers the opposite setting, where and we do not expect stochastic AGD to be faster than SGD.
Example 11 ().
Consider the least-squares problem setting in Eq. 21. Let and be distributed as follows: and . It is straightforward to show that , , and , while and . As a result, the complexity estimates for stochastic AGD and MaSS are,
(24) |
As , stochastic AGD attains the same complexity as SGD and is not accelerated. In comparison, the guarantee for MaSS always matches SGD and is slower than stochastic AGD for every finite . We conclude that while both methods are restricted by lower bounds on stochastic acceleration, AGD can accelerate on some simple problems where MaSS fails.
5 Conclusion
We derive new convergence rates for a generalized version of stochastic Nesterov acceleration. Our approach extends the estimating sequences framework to the stochastic setting and shows that any update scheme making sufficient progress in expectation can be accelerated. As this sufficient progress condition is satisfied by SGD under the strong growth condition, our proof immediately specializes to give fast rates for stochastic AGD. Compared to previous work, our convergence bounds improve the dependence on the strong growth constant from to . This improvement can be larger than the square-root of the condition number, shows stochastic AGD is at least as fast as SGD, and explains the strong empirical performance of stochastic acceleration shown by Vaswani et al. [22]. We also leverage our generalized algorithm to prove convergence guarantees for stochastic AGD with preconditioning. In particular, we show that preconditioning further speeds-up accelerated SGD when the stochastic gradients are small in the matrix norm induced by the preconditioner.
In addition to these results, the utility of our theoretical approach is further demonstrated by recent literature. Our core result for stochastic AGD (Proposition 5) was previously made available in a master’s thesis [48] and the proof technique has since been leveraged to give optimal bounds for stochastic acceleration in the general setting [49]. Yet, several questions remain unanswered. For example, the convergence of stochastic AGD under relaxed conditions, like weak growth or with a stochastic line-search [22], has not been proved. And while our generalized AGD scheme also suggests accelerating methods like stochastic proximal-point, establishing the expected progress condition appears difficult and new insights may be required. We leave these questions to future work.
Acknowledgments
We would like to thank Frederik Kunstner, Victor Sanchez-Portella, and Sharan Vaswani for many insightful discussions.
Funding
Aaron Mishkin was supported by NSF GRF Grant No. DGE-1656518 and by NSERC PGS D Grant No. PGSD3-547242-2020. Mert Pilanci was supported by the NSF under Grant ECCS-2037304 and Grant DMS-2134248, by an NSF CAREER Award under Grant CCF-2236829, by the U.S. Army Research Office Early Career Award under Grant W911NF-21-1-0242, and by the Stanford Precourt Institute. Mark Schmidt was partially supported by the Canada CIFAR AI Chair Program and NSERC Discovery Grant No. RGPIN-2022-03669.
Data Availability
This paper does not rely any data, public or private. Derivations for all theoretical results are included in the appendix.
References
- \bibcommenthead
- Zhang et al. [2017] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net (2017)
- Zhang and Yin [2013] Zhang, H., Yin, W.: Gradient methods for convex minimization: Better rates under weaker conditions. arXiv preprint arXiv:1303.4645 (2013)
- Belkin et al. [2019a] Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854 (2019)
- Belkin et al. [2019b] Belkin, M., Rakhlin, A., Tsybakov, A.B.: Does data interpolation contradict statistical optimality? In: Chaudhuri, K., Sugiyama, M. (eds.) The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019. Proceedings of Machine Learning Research, vol. 89, pp. 1611–1619. PMLR (2019)
- Schapire et al. [1997] Schapire, R.E., Freund, Y., Barlett, P., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. In: Fisher, D.H. (ed.) Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), pp. 322–330. Morgan Kaufmann (1997)
- Liu et al. [2022] Liu, C., Zhu, L., Belkin, M.: Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis 59, 85–116 (2022)
- Oymak and Soltanolkotabi [2019] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: Gradient descent takes the shortest path? In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 4951–4960. PMLR (2019)
- Belkin [2021] Belkin, M.: Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numer. 30, 203–248 (2021)
- Arora et al. [2018] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: Implicit acceleration by overparameterization. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Proceedings of Machine Learning Research, vol. 80, pp. 244–253. PMLR (2018)
- Ma et al. [2018] Ma, S., Bassily, R., Belkin, M.: The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Proceedings of Machine Learning Research, vol. 80, pp. 3331–3340. PMLR (2018)
- Zou and Gu [2019] Zou, D., Gu, Q.: An improved analysis of training over-parameterized deep neural networks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019, pp. 2053–2062 (2019)
- Polyak [1987] Polyak, B.T.: Introduction to optimization (1987)
- Bassily et al. [2018] Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 (2018)
- Vaswani et al. [2019] Vaswani, S., Mishkin, A., Laradji, I.H., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019, pp. 3727–3740 (2019)
- Defazio and Bottou [2019] Defazio, A., Bottou, L.: On the ineffectiveness of variance reduced optimization for deep learning. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019, pp. 1753–1763 (2019)
- Loizou et al. [2020] Loizou, N., Vaswani, S., Laradji, I., Lacoste-Julien, S.: Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. arXiv preprint arXiv:2002.10542 (2020)
- Berrada et al. [2020] Berrada, L., Zisserman, A., Kumar, M.P.: Training neural networks for and by interpolation. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 799–809. PMLR (2020)
- D’Orazio et al. [2021] D’Orazio, R., Loizou, N., Laradji, I.H., Mitliagkas, I.: Stochastic mirror descent: Convergence analysis and adaptive variants via the mirror stochastic polyak stepsize. CoRR abs/2110.15412 (2021)
- Asi and Duchi [2019] Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization 29(3), 2257–2290 (2019)
- Arjevani et al. [2019] Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365 (2019)
- Nemirovsky and Nesterov [1985] Nemirovsky, A.S., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics 25(2), 21–30 (1985)
- Vaswani et al. [2019] Vaswani, S., Bach, F., Schmidt, M.W.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: Chaudhuri, K., Sugiyama, M. (eds.) The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019. Proceedings of Machine Learning Research, vol. 89, pp. 1195–1204. PMLR (2019)
- Nesterov [1983] Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence . In: Doklady an USSR, vol. 269, pp. 543–547 (1983)
- Liu and Belkin [2020] Liu, C., Belkin, M.: Accelerating SGD with momentum for over-parameterized learning. In: 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net (2020)
- Jain et al. [2018] Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent for least squares regression. In: Bubeck, S., Perchet, V., Rigollet, P. (eds.) Conference On Learning Theory, COLT 2018. Proceedings of Machine Learning Research, vol. 75, pp. 545–604. PMLR (2018)
- Nesterov [2004] Nesterov, Y.E.: Introductory Lectures on Convex Optimization - A Basic Course. Applied Optimization, vol. 87. Springer (2004)
- Xiao et al. [2022] Xiao, T., Balasubramanian, K., Ghadimi, S.: Improved complexities for stochastic conditional gradient methods under interpolation-like conditions. Oper. Res. Lett. 50(2), 184–189 (2022)
- Vaswani et al. [2020] Vaswani, S., Kunstner, F., Laradji, I., Meng, S.Y., Schmidt, M., Lacoste-Julien, S.: Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search). arXiv preprint arXiv:2006.06835 (2020)
- Duchi et al. [2011] Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
- Meng et al. [2020] Meng, S.Y., Vaswani, S., Laradji, I.H., Schmidt, M., Lacoste-Julien, S.: Fast and furious convergence: Stochastic second order methods under interpolation. In: Chiappa, S., Calandra, R. (eds.) The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020. Proceedings of Machine Learning Research, vol. 108, pp. 1375–1386. PMLR (2020)
- Varre et al. [2021] Varre, A.V., Pillaud-Vivien, L., Flammarion, N.: Last iterate convergence of SGD for least-squares in the interpolation regime. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, pp. 21581–21591 (2021)
- Fang et al. [2021] Fang, H., Fan, Z., Friedlander, M.P.: Fast convergence of stochastic subgradient method under interpolation. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net (2021)
- Solodov [1998] Solodov, M.V.: Incremental gradient algorithms with stepsizes bounded away from zero. Comp. Opt. and Appl. 11(1), 23–35 (1998)
- Tseng [1998] Tseng, P.: An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization 8(2), 506–531 (1998)
- Schmidt and Le Roux [2013] Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)
- Schmidt et al. [2011] Schmidt, M., Le Roux, N., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24: NeurIPS 2011, pp. 1458–1466 (2011)
- d’Aspremont [2008] d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
- Devolder et al. [2014] Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming 146(1-2), 37–75 (2014)
- Cohen et al. [2018] Cohen, M., Diakonikolas, J., Orecchia, L.: On acceleration with noise-corrupted gradients. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Proceedings of Machine Learning Research, vol. 80, pp. 1018–1027. PMLR (2018)
- Chen et al. [2020] Chen, Y.-L., Na, S., Kolar, M.: Convergence analysis of accelerated stochastic gradient descent under the growth condition. arXiv preprint arXiv:2006.06782 (2020)
- Even et al. [2021] Even, M., Berthier, R., Bach, F.R., Flammarion, N., Gaillard, P., Hendrikx, H., Massoulié, L., Taylor, A.B.: A continuized view on nesterov acceleration for stochastic gradient descent and randomized gossip. CoRR abs/2106.07644 (2021)
- Valls et al. [2022] Valls, V., Wang, S., Jiang, Y., Tassiulas, L.: Accelerated convex optimization with stochastic gradients: Generalizing the strong-growth condition. arXiv preprint arXiv:2207.11833 (2022)
- Assran and Rabbat [2020] Assran, M., Rabbat, M.: On the convergence of Nesterov’s accelerated gradient method in stochastic settings. arXiv preprint arXiv:2002.12414 (2020)
- Nesterov [1988] Nesterov, Y.: On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody 24(3), 509–517 (1988)
- Armijo [1966] Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics 16(1), 1–3 (1966)
- Bertsekas [1997] Bertsekas, D.P.: Nonlinear programming. Journal of the Operational Research Society 48(3), 334–334 (1997)
- Honorio [2012] Honorio, J.: Convergence rates of biased stochastic optimization for learning sparse Ising models. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012. icml.cc / Omnipress (2012)
- Mishkin [2020] Mishkin, A.: Interpolation, growth conditions, and stochastic gradient descent. PhD thesis, University of British Columbia (2020)
- Vaswani et al. [2022] Vaswani, S., Dubois-Taine, B., Babanezhad, R.: Towards noise-adaptive, problem-adaptive (accelerated) stochastic gradient descent. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. Proceedings of Machine Learning Research, vol. 162, pp. 22015–22059. PMLR (2022)
Appendix A Assumptions: Proofs
Lemma 12.
Suppose is convex and -smooth, the stochastic gradients are individually smooth, and interpolation holds. Then the weak growth condition holds with constant .
Proof.
First, recall that the weak growth condition [22] is given by
(25) |
Now, starting from individual-smoothness,
and choosing , we obtain | ||||
Noting that by convexity of and interpolation and taking expectations with respect to gives the following:
Re-arranging this final equation gives the desired result as follows:
We conclude that weak growth holds with . ∎
Lemma 13.
Suppose is -smooth and -strongly convex, the stochastic gradients are individually smooth, and interpolation holds. Then strong growth holds with constant .
Appendix B Convergence of Stochastic AGD: Proofs
See 1
Proof.
The proof is a modification of the standard descent lemma. Starting from -smoothness of , we obtain
Taking expectations with respect to and using the strong growth condition, | ||||
∎
Lemma 14.
The sequence in Eq. 12 satisfies the following canonical form:
(26) |
where the curvature , minimizer , and minimum value are given as follows:
(27) | ||||
Furthermore, the relationship between and is the following:
(28) |
Proof.
Lemma 15.
Assume and almost surely for all . If and , then
(30) |
Alternately, if , we obtain
(31) |
Proof.
Case 1: . Then for all and
Thus, we deduce that
and if , then and
Case 2: . Using the update rule for in Lemma 14, we find
Recursing on this equality implies | ||||
Similarly, using and yields
Finally, this implies
Moreover, this bound holds uniformly for all . We have now exactly reached Eq. 2.2.11 of Nesterov [26, Lemma 2.2.4] with replaced by . Applying that Lemma with this modification, we obtain
which completes the proof.
∎
See 4
Proof.
Recalling the update for , we obtain,
Define to obtain the following quadratic formula,
Using the quadratic equation, we find
As a result, if and only if
If , then this holds trivially. Otherwise, we require, | ||||
which holds if and only if . Similarly, if and only if
Since this condition holds by assumption, we have for all .
Recall and . Since , holds by induction. It remains to show that tends to zero. Invoking Lemma 15 establishes this result almost surely.
Finally, we must show,
We proceed by induction. Since , we immediately obtain
Now assume the condition holds at ; by the construction of , we have
This completes the proof. ∎
See 5
Proof.
The choice of ensures deterministically, which is the base case for induction. The inductive assumption is ; let us use this to show
Lemma 14 implies that the explicit form of the minimizer is
Taking expectations with respect to and using linearity of expectation: | ||||
where the inequality follows from the inductive assumption. Convexity of implies . Recalling from Lemma 14 allows us to obtain, | ||||
The sufficient progress condition (Eq. 8) now implies | ||||
The remainder of the proof is largely unchanged from the deterministic case. The definition of gives , which we use to obtain | ||||
Noting that gives | ||||
since . We conclude that holds for all by induction. ∎
See 6
Proof.
Under the conditions of the theorem, Proposition 5 holds and Eq. 11 implies
Since , Lemma 15 implies and we obtain | ||||
which completes the proof. ∎
See 7
Proof.
Under the conditions of the theorem, Proposition 5 holds and Eq. 11 implies
Since , Lemma 15 implies and we obtain | ||||
which completes the proof. ∎
B.1 Specializations: Proofs
Lemma 16.
Let be independent of . Suppose is convex and -smooth, the stochastic gradients are individually smooth with respect to the matrix norm , and interpolation holds. If is also -strongly convex with respect to , then
(32) |
That is, strong growth holds in with constant .
Proof.
Starting from individual-smoothness,
and choosing , we obtain | ||||
Noting that by convexity of and interpolation and taking expectations with respect to gives the following:
Re-arranging this final equation and using -strong convexity gives the desired result,
∎
Lemma 17.
Let . Assume the is both -smooth and satisfies strong growth in the matrix norm . If and , then preconditioned SGD satisfies,
Proof.
Starting from smoothness in ,
Taking expectations with respect to , | ||||
∎
Appendix C Comparison to Existing Rates: Proofs
See 10
Proof.
It is easy to see that . Taking expectations, we find that
which implies . It is also straightforward to compute :
which implies . Note that this is tight with the bound .
Now we compute the values of and . Since the Hessian satisfies , it is straightforward to see that
which implies that . Substituting these values into the complexity bounds for stochastic AGD and MaSS completes the example. ∎
See 11
Proof.
Again, it is easy to see that . Taking expectations, we find that
which implies and . The strong growth constant is given by
which implies . It’s easy to see that this is tight by taking and .
Now we compute the values of and . The Hessian is given by
and thus
which implies . Substituting these values into the complexity bounds for stochastic AGD and MaSS completes the example. ∎