Does Momentum Help in Stochastic Optimization?
A Sample Complexity Analysis.
Abstract
Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG) are popular momentum methods in stochastic optimization. While benefits of such acceleration ideas in deterministic settings are well understood, their advantages in stochastic optimization is still unclear. In fact, in some specific instances, it is known that momentum does not help in the sample complexity sense. Our work shows that a similar outcome actually holds for the whole of quadratic optimization. Specifically, we obtain a lower bound on the sample complexity of SHB and ASG for this family and show that the same bound can be achieved by the vanilla SGD. We note that there exist results claiming the superiority of momentum based methods in quadratic optimization, but these are based on one-sided or flawed analyses.
1 Introduction
In deterministic convex optimization (when one has access to exact gradients), Gradient Descent (GD) is a popular optimization algorithm Cauchy, (1847). However, in practice, exact gradients are not available and one has to rely on noisy observations. This brings forth the idea of Stochastic Gradient Descent (SGD). For GD, two classic momentum ideas, Heavy Ball (HB) Polyak, (1964, 1987); Qian, (1999) and Nesterov’s Accelerated Gradient (NAG) Nesterov, (1983, 2014, 2005), are used to speed up its convergence. Naturally, these momentum-based methods and their variants have also gained significant interest in stochastic settings Sutskever et al., (2013); Nitanda, (2014); Hu et al., (2009). Our work shows that the stochastic variants of HB and NAG, i.e., the Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG), are not better than the vanilla SGD for the whole of quadratic optimization. Specifically, we show that the sample complexities111Sample complexity refers to the number of iterations required to reach an -boundary of the solution. of SHB, ASG, and SGD are all of the same order.
More formally, in (deterministic) quadratic optimization222Throughout this work, we only consider algorithms with constant step-sizes. These are popular in practice and ensure faster convergence. with condition number GD converges to an -optimal solution in iterations, while both HB and NAG only need steps. A pertinent question then is “Does one get such advantages even in stochastic settings?" The current literature is, however, divided on whether SHB and ASG are better than SGD.
Some recent results Loizou and Richtárik, (2020); Mou et al., (2020); Assran and Rabbat, (2020); Can et al., (2019) claim that these momentum methods are better than SGD in quadratic or least-squares settings. However, Loizou and Richtárik, (2020) needs a strong assumption on noise, which (Kidambi et al.,, 2018, Section 6) claims is information-theoretically impossible even in the simple least squares regression problem. The other results either are based on a one-sided analysis Can et al., (2019) (only considers bias, while ignoring variance) or have a major flaw Mou et al., (2020); Assran and Rabbat, (2020); see Appendix A.1—A.3 for details.
On the other hand, there are also a few recent negative results on these momentum methods. For general convex optimization, Yuan et al., (2016) shows that SHB and ASG is equivalent to SGD with a rescaled step-size. However, this result requires that the stepsize be sufficiently small and the momentum parameter to be away from In Kidambi et al., (2018); Liu and Belkin, (2020), for one specific instance of the least squares regression with vanishing noise, it is shown that the performance of SHB and ASG cannot be better than that of SGD. Finally, Zhang et al., (2019) considers SHB for quadratic objectives in the noisy setting as our work and provides upper bounds on the rate at which the objective function decreases. They also argue that rescaled SGD performs as well as SHB and demonstrate it empirically but fall short of rigorously coming up with a lower bound that supports their claim.
The current literature can thus be summarized as follows.
Research Gap: Existing works on SHB and ASG fall into two groups: i.) positive - where the results claim advantages of these methods over SGD and ii.) negative - where the results claim the opposite. Results in the positive group either have a one-sided Can et al., (2019) or a flawed analysis Mou et al., (2020); Assran and Rabbat, (2020), while the ones in the negative apply to special cases (e.g., Yuan et al., (2016) requires sufficiently small stepsizes and momentum parameters away from while Kidambi et al., (2018); Liu and Belkin, (2020) only apply to one instance of a least squares problem with vanishing noise).
Key Contribution: Our work belongs to the negative group and extends the claims in Kidambi et al., (2018); Liu and Belkin, (2020) to more general settings. Specifically, for all quadratic optimization problems with persistent noise (noise variance is bounded away from zero) and any small we show that number of iterations needed by SHB and ASG to find an -optimal solution are of the same order as that of SGD. More technically, we obtain a lower bound on sample complexities of SHB and ASG (Theorem 2.1) and show that these are of the same order as the corresponding upper bound for SGD (Theorem 2.2). We emphasize that our result applies for all positive stepsizes and momentum parameters in the range unlike in Yuan et al., (2016). Moreover, our proof techniques are also significantly different from those used in existing lower bounds such as Kidambi et al., (2018); Liu and Belkin, (2020). This is because, under persistent noise, the expected error would contain an additional term that cannot be accounted for from their analyses (See Remark 3).
2 Main Results
We state our main results here that provide lower and upper bounds on the sample complexities of SHB and ASG. We use these bounds along with those of SGD to show that all these methods need a similar effort to find an -optimal solution. We emphasize that our results apply to the whole family of quadratic optimization.
To provide a unified analysis, we look at a generic update rule that includes as special cases SHB, ASG, and SGD. In particular, we consider the linear stochastic approximation with momentum (LSA-M) iterate given by
(1) | ||||
(2) |
where and is noise. Note that we do not presume is symmetric and this is what makes the above update a stochastic approximation. Also, when is symmetric, LSA-M subsumes as special cases SGD (let in (2)), SHB (let in (2)), and ASG (let in (2)).
With regards to and the noise sequence we make the following assumptions.
Assumption 1.
(Driving matrix feature) is diagonalizable and all its eigen values are real and positive.
Assumption 1 holds trivially when is a symmetric positive definite matrix. Consequently, our results apply to all members of the quadratic optimization family.
Also, under the above assumption, one would expect the iterates in (2) to go to a neighborhood of
We next state two assumptions for the first is used in Theorem 2.1, while the other is used in Theorem 2.2 and Corollary 2.3.
Assumption 2.a.
(Noise attributes for Theorem 2.1) is a martingale difference sequence w.r.t. the filtration , where . Further, such that a.s. for all
The notation for is used above to imply that is positive semi-definite. Also, the symbol refers to the identity matrix.
Assumption 2.b.
(Noise attributes for Theorem 2.2) is a martingale difference sequence w.r.t the filtration , where . Further, such that a.s. for all
Assumptions 2.a and 2.b are standard Mandt et al., (2017); Jastrzębski et al., (2018); Cheng et al., (2020); Borkar, (2008). The first of these holds if and only if all the eigenvalues of are bounded from below by i.e., noise is persistent throughout (or non-vanishing) in all directions. On the other hand, Assumption 2.b requires that the trace of be bounded from above. This bound can scale with and need not vanish near
Next, we define sample complexity which quantifies the effort required by LSA-M to obtain an -close solution to
Definition 1.
(Sample Complexity). The sample complexity of (2) is the minimum number of iterations such that the expected error .
Method | |||
---|---|---|---|
SGD | |||
- | 0 | ||
SHB | 0 | ||
ASG | 1 |
To enable easy comparison between different algorithms, we shall look at the order of their sample complexities. Towards that, we shall use the notation to imply that there exist constants and (independent of ) such that . The notation has a similar meaning but hides the dependence on the logarithmic terms as well.
Theorem 2.1 (Lower bound on sample complexity).
Remark 1.
As stated below (2), LSA-M includes SHB and Nesterov’s ASG method as special cases and, hence, the above result directly applies to them. In fact, this is the first lower bound on SHB and ASG’s sample complexities in quadratic optimization with persistent noise.
Remark 2.
The value of above is obtained by optimizing over all possible values of and hence does not depend on their specific choices. This implies that there is no “smart” combination of and that will give a better sample complexity (for small enough).
Remark 3.
The lower bounds in Kidambi et al., (2018) and Liu and Belkin, (2020) are obtained by viewing the expected error in SHB and ASG iterates for least squares as update rules of the form for some matrix (Kidambi et al.,, 2018, Appendix A, p 16) and (Liu and Belkin,, 2020, Appendix C, p 12)). In particular, they obtain bounds on the eigenvalues of to get the desired claim. In contrast, the error relations for SHB and ASG methods in our setup (quadratic optimization with persistent noise) have the form for some matrix and vector (cf. 3). This forces us to develop a new proof technique that jointly looks at both these terms and show that at least one of them remains larger than for the choice of given in Theorem 2.1.
We next state our upper bound on the sample complexity in Theorem 2.2 and Corollary 2.3. Similar bounds already exist in literature when is assumed to be symmetric and the noise is assumed to be iid with bounded variance (Can et al., (2019); Zhang et al., (2019)). Here, we show that a similar upper bound holds under more general settings: i.) is not symmetric but satisfies Assumption 1, and ii.) the noise is a martingale difference sequence satisfying Assumption 2.b.
Theorem 2.2.
Consider the LSA-M iterate and let Assumptions 1 and 2.b hold. Then, , there exists a choice of , and in LSA-M (see Table 1 for exact values) such that the expected error , , where,
-
(i)
when
-
(ii)
when .
From Table 1, we see that for each case is a minimum of three terms. The first term arises due to the unbounded noise (Assumption 2.b), the second due to the target neighborhood and the third from the optimal choice of step-size in the deterministic case. Since the bound provided in Theorem 2.2 is in terms of , the minimum of the three terms dictate the sample complexity. If is small enough such that the second term is the minimum, we obtain the following bound:
Corollary 2.3 (Upper bound on sample complexity).
Remark 4.
From Corollary 2.3, we see that for small enough the upper bound on the sample complexity of SGD, SHB and ASG match their lower bound in Theorem 2.1. In particular, since an upper bound on the sample complexity of SGD matches a lower bound on SHB and ASG, these methods cannot outperform SGD from a sample complexity perspective.
Remark 5.
When the noise is assumed to be bounded, the first term in the choice of in Table 1 vanishes for all three parameter choices. Under such an assumption, if is large enough or is small enough such that the third term in the choice of is the minimum, then the sample complexity of both SHB and ASG is better than SGD. We emphasize that such improvements are lost when the noise variance is large or the neighbourhood under consideration is small.
3 Proof Outline
In this section we provide the proof outline for Theorem 2.1 and Theorem 2.2. Section 3.1 provides an outline of all the key steps to prove the lower bound on the sample complexity. The proof of all the intermediate lemmas have been pushed to Appendix B. In section 3.2 we provide a sketch of the proof of Theorem 2.2. The detailed proof can be found in Appendix C.
3.1 Proof Outline for Theorem 2.1
For simplicity, let and . The general multivariate case is handled subsequently. We begin by defining the transformed iterates and rewrite (2) as:
(3) |
where, and For ease of exposition, we first consider the case when , corresponding to SHB and consider the general case for later. The objective here is to find an such that the error is lower bounded by . Towards this, we decompose the error expression as follows:
(4) |
where, . Here, the second equality follows since each is a martingale difference sequence and the inequality follows from Assumption 2.b. The first term here corresponds to the bias and the second term corresponds to the variance.
Let . We will show that for this choice of , the expected error will always be greater than or equal to for all and . From (3.1):
If is such that , then the claim immediately follows for this choice of and . We now consider the case where and are such that . Here we will show that for this choice of and , the variance will necessarily be greater than . Let and be the eigen-values of .
Lemma 1.
For all , , if for small enough then the variance term is bounded by
Note that, from the above lemma, the variance bound improves (i.e., decreases) when decreases. Using this, we show that the variance is ultimately bounded below by a linearly decreasing function of in the following lemma.
Lemma 2.
For all , , if for small enough , then
Recall that and . Using this, we get the following lower bound on :
Lemma 3.
If and , then .
This completes the proof for the univariate case when .
We now consider the case where . We notice that in the univariate case, LSA-M with parameters and is equivalent to LSA-M with parameters and . If , regardless of choice of , we have from (3.1):
Thus, when , the expected error will always be larger than . When , then for all . Since this is equivalent to LSA-M with and , the result follows from the lower bound on univariate LSA-M with . This completes the proof for the univariate case for all , .
Finally, we consider the multivariate case, i.e., when . For the multivariate case, we study the transformed iterates , where . Since we get
where
Let and , where each and . We will show that the update rule for each is the same as that of the univariate case. Notice that
where and . This is because and
Let . Then, from results in the univariate case, such that for all and , for some . Letting , it follows that for some We ignore the term as it is a constant, and it is in fact equal to when is symmetric (diagonalising matrices of a symmetric matrix are orthogonal). Finally,
for some .
3.2 Proof Outline of Theorem 2.2
Here we provide a quick sketch of the proof when (corresponding to SGD) and (corresponding to SHB). A similar analysis follows for and the detailed proof of all the three cases can be found in Appendix C.
The LSA-M iterate in (2) with corresponds to:
Using Assumption-2.b the expected norm of the iterates can be bounded as follows:
We next use the following lemma to bound .
Lemma 4.
Let, be a matrix and denote the eigen-value of . Then,
where is the spectral radius of and is a constant that depends on . Furthermore, if the eigen-values of are distinct, then
Proof.
See Appendix-D. ∎
Using Assumption 1 and Lemma 4, we have
(5) |
is the matrix in Jordan decomposition of and is the smallest singular value of . We define the sequence as below:
Observe that and that the sequence satisfies
Therefore, we have
Since , one can show that
We assume that and therefore .
Choose Then,
when the sample complexity is:
As in proof of Theorem 2.1, we begin by transforming the iterate given by (2) into the following iterate for :
where, and
Using Assumption 2.a, we can obtain the following bound on the mean squared error:
Here, the first term corresponds to the bias and the second term correspond to the variance. A spectral analysis of tells us that the best choice of is given by with . However, to simplify the analysis we choose the parameter , which maintains the same rate of decay of the bias term. Using Assumption 1 and Lemma 9, we show that
where and depends on and . The dependence of the second term on appears because we do not assume the noise variance to be uniformly bounded by a constant. This makes analyzing the above recursion challenging. Towards this, we define a sequence ,
Choosing
ensures that Using Lemma 10, we bound in terms of . Here, when the matrix is symmetric and when is not symmetric, where denotes the smallest singular value and is the matrix that diagonalizes , i.e., , a diagonal matrix. With this we obtain
Now, for it can show that . Next, we choose small enough: to ensure that the second term in the above inequality is within an boundary and large enough: to ensure that the first term is within an boundary. Finally,
gives the desired bound on the sample complexity for .
4 Concluding Remarks
In this work, we analyze the sample complexity of SHB and ASG and provide matching lower and upper bounds up to constants and logarithmic terms. More importantly, we show that the same sample complexity bound can be obtained by standard SGD. Our work also calls into question some of the recent positive results in favour of momentum methods in the stochastic regime. We show that such improvements do not take into account all the terms involved in the error decomposition, or have major flaws. Although some other results do question the superiority of momentum methods in the stochastic regime, the assumptions and the setting that these works look at do not correspond to those in the positive results. We also emphasize that the negative results either only consider small step-sizes and momentum parameters not close to 1 or provide such results for specific instances in linear regression. In contrast, our work shows that SHB and ASG cannot obtain an improvement in terms of sample complexity (for small neighbourhoods) over SGD for the entire family of quadratic optimization and holds for all step-sizes and all momentum parameters in .
References
- Assran and Rabbat, (2020) Assran, M. and Rabbat, M. (2020). On the convergence of nesterov’s accelerated gradient method in stochastic settings. Proceedings of the 37th International Conference on Machine Learning, PMLR, 119:410–420.
- Borkar, (2008) Borkar, V. S. (2008). Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
- Can et al., (2019) Can, B., Gurbuzbalaban, M., and Zhu, L. (2019). Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 891–901. PMLR.
- Cauchy, (1847) Cauchy, L. A. (1847). Methode generale pour la resolution des systemes d’equations simultanees. C.R. Acad. Sci. Paris, 25:536–538.
- Cheng et al., (2020) Cheng, X., Yin, D., Bartlett, P., and Jordan, M. (2020). Stochastic gradient and Langevin processes. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1810–1819. PMLR.
- Foucart, (2012) Foucart, S. (2012). University lecture, m504, lecture 6: Matrix norms and spectral radii.
- Horn and Johnson, (1990) Horn, R. and Johnson, C. (1990). Matrix analysis (Corrected reprint of the 1985 original). Cambridge University Press.
- Hu et al., (2009) Hu, C., Pan, W., and Kwok, J. (2009). Accelerated gradient methods for stochastic optimization and online learning. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A., editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc.
- Jastrzębski et al., (2018) Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Storkey, A., and Bengio, Y. (2018). Three factors influencing minima in SGD.
- Kidambi et al., (2018) Kidambi, R., Netrapalli, P., Jain, P., and Kakade, S. M. (2018). On the insufficiency of existing momentum schemes for stochastic optimization. CoRR, abs/1803.05591.
- Liu and Belkin, (2020) Liu, C. and Belkin, M. (2020). Accelerating sgd with momentum for over-parameterized learning. In International Conference on Learning Representations.
- Loizou and Richtárik, (2020) Loizou, N. and Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77:653–710.
- Mandt et al., (2017) Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35.
- Mou et al., (2020) Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2947–2997. PMLR.
- Nesterov, (1983) Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 269:543–547.
- Nesterov, (2005) Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152.
- Nesterov, (2014) Nesterov, Y. (2014). Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition.
- Nitanda, (2014) Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. In NIPS.
- Polyak, (1964) Polyak, B. (1964). Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17.
- Polyak, (1987) Polyak, B. (1987). Introduction to Optimization. Optimization Software, NY.
- Qian, (1999) Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Netw., 12(1):145–151.
- Sutskever et al., (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA. PMLR.
- Yuan et al., (2016) Yuan, K., Ying, B., and Sayed, A. H. (2016). On the influence of momentum acceleration on online learning. Journal of Machine Learning Research, 17(192):1–66.
- Zhang et al., (2019) Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Appendix
Appendix A Comparison with recent works
A.1 Comparison with Mou et al.(2020)
Claim 1 in (p. 20, Mou et al., (2020)) analyzes the asymptotic covariance of the heavy ball momentum algorithm (with Polyak averaging) and claims a correction term that satisfies:
where as in the decomposition of in Lemma 1 of Mou et al., (2020) and .
However in the proof of claim 1, we are not sure if the following bound holds, since the matrix is not symmetric:
Our calculation points towards the following bound:
We outline the proof for the uni-variate case, when for some . Then,
Observe that and therefore . Using this we have,
(6) |
The second inequality follows as in Mou et al., (2020). Next we analyze the dependence of on . Again for simplicity we consider the uni-variate case where . Let be diagonalizable. Therefore,
where and are the eigenvalues of . Let . We therefore have,
Solving the system of equations, we get:
Now, . Using the choice of as in Mou et al., (2020), we have:
For (which is the case where the momentum algorithm is claimed to improve the mixing time in Mou et al., (2020)), . As in proof of Lemma 5 in Appendix D.2 and using the fact that , we can show that . Combining with (6), we have:
A similar analysis can be carried for the multivariate case to show that
The correction term in SGD is (See Mou et al., (2020), Claim 1). The stationary distribution for the momentum algorithm is larger than that of SGD when . Indeed if we enforce that the two asymptotic covariances are of the same size by choosing the step-size for momentum iterate in terms of the step-size of SGD, i.e.,
then we must choose . In Appendix C.1. of Mou et al., (2020), the mixing time of momentum iterate is shown to be , while the mixing time of SGD is . When we choose , then the mixing time of momentum algorithm turns out to be the same as SGD. This behaviour is identical to what we observe in Theorem 2.2, where if we choose the same step-size then there is improvement by a square root factor.
A.2 Comparison with SHB, Can et al. (2019)
For strongly convex quadratic functions of the form:
where , is p.s.d, , and , Can et al., (2019) shows acceleration in Wasserstein distance by a factor of . The trace of the asymptotic covariance matrix is given by (See Appendix C.2 of Can et al., (2019)):
where, is the step-size, is the momentum parameter and is the eigen-value of . We show that the asymptotic covariance matrix is worse compared to when no momentum is used, i.e., and the optimial step-size is used. Substituting these values for and we get:
To compute the size of the stationary distribution with the iterates of heavy ball we set the step-size and momentum parameter and get:
The above calculation shows that the size of the asymptotic covariance matrix of SHB is worse by a factor of .
A.3 Comparison with Assran and Rabbat (2020)
Theorem 1 of Assran and Rabbat, (2020) is stated for a constant (not to be confused with in the current paper) and uses (a term that depends on ). However in Corollary 1.1, the result is stated for a decreasing sequence for which is undefined (equation (26)).
Using the expressions provided in Appendix of Assran and Rabbat, (2020), we find that the second term in Corollary 1.1 (variance bound) is in fact of order instead of . Using equation (29) and the display below it, we see that for ASG, . It follows that
However, for SGD,
As a result, the variance bound for ASG is , while the variance bound for SGD is . Though the bias term in their result improves by a factor of , the variance term also worsens by a factor of .
Appendix B Proof of Theorem 2.1
Observe that
Choose such an with . If , . Therefore we assume which implies and .
Lemma 1.
For all , , if for small enough then the variance term is bounded by
Proof.
We see that
The sum of fractions in the bracket can be expressed as a single fraction with denominator . The numerator turns out to be:
First consider the case . The eigen values and in this case are real. Therefore all the terms in the numerator and the denominator are real. Next consider the case . In this case the eigen values and are complex. Let and . Observe that
Therefore, all the terms in the numerator and the denominator are again real. Now observe that,
The second inequality follows because and therefore . Next,
The last inequality follows since . It follows that . Also, note that and therefore . It follows that . Using these bounds, taking and noting that , we get
(7) |
Finally, consider the case when . In this case .
Let , where Let . Since , we get and . We now solve for .
It follows that the following equations hold
Solving the above equations with , we get,
Observe that,
where the last inequality follows from (7). Therefore,
∎
Lemma 2.
For all , , if for small enough , then
Proof.
Before proceeding to the proof, we introduce a new function
to simplify calculations. First we consider the case when and are real. Since and are only functions of and , is well-defined.
The proof of the Lemma follows by showing . Note that since . Similarly, . Thus, . Since , it follows that . Thus,
We see that
Observe that the denominator in the above expression is positive. We consider the following two cases: (i) and (ii) . When , we can directly bound . Since and , we get .
Now consider the case . We have that,
When , for all , thus . This implies that the numerator of the partial derivative is also positive and we have that . Since the partial derivative is positive is an increasing function of and thus the maximum is achieved at and is given by:
Next, for , the eigen values and are complex. We have,
First we show that . Notice,
It follows that,
Observe that is a decreasing function of and therefore, the infimum is attained for . For this choice of ,
∎
Lemma 3.
If and , then .
Proof.
From we have . Then,
Thus, we have and . Thus, by re-arranging, and choosing such that , we have
(8) |
∎
Appendix C Proof of Theorem 2.2
We prove the theorem separately for (corresponding to SGD), (corresponding to SHB) and (corresponding to ASG).
Case-1: (SGD)
The LSA-M iterate in (2) with corresponds to:
(9) | |||
(10) |
Let . Then, equation (9) can be rewritten as:
Taking the square of the norm on both sides of the above equation, we obtain
Now we take expectation on both sides to obtain
Now, from Assumption 2.a, Therefore the second term becomes 0. Next consider the term inside the double summation. First consider the case . Without loss of generality, suppose .
The last equality follows from Assumption 2.a. When ,
Substituting the above values and using we get
We next use the following lemma to bound .
Lemma 4.
Let, be a matrix and denote the eigen-value of . Then,
where is the spectral radius of and is a constant that depends on . Furthermore, if the eigen-values of are distinct, then
Proof.
See Appendix D.1. ∎
We let . Using Assumption 1 and Lemma 4, we have
(11) |
is the matrix in Jordan decomposition of and is the smallest singular value of . We define the sequence as below:
Observe that and that the sequence satisfies
Therefore, we have
To ensure that , choose as follows:
We assume that and therefore .
Choose as below:
Then,
when the sample complexity is:
Case-2: (SHB)
LSA-M iterate with can be re-written as:
This can be re-written as:
Let us define
Note that and . It follows that,
The norm square of the above equation gives:
Taking expectation on both sides as well as using the fact that and , we have
Without loss of generality assume . Therefore, . As before, for a matrix , let denote the spectral radius of . Next, we compute . Consider the characteristic equation of P:
When and commute, we have the following formula for determinant of a block matrix (Horn and Johnson, (1990)):
Using this, the characteristic equation of simplifies to:
We note that when , the LHS of the above equation becomes . Thus, can never be a solution of the characteristic equation of whenever . We now further simplify the characteristic equation of to a more convienient form:
The only zeros of the characteristic equation of a matrix are its eigenvalues. Let be the eigen-value of with so that
The above is a quadratic equation in and the solution is given by
When , the absolute value of eigenvalues of P are independent of and
To ensure that , we must have
For the spectral radius of to be , the above must hold for all . We choose as:
and as:
Observe that if we choose the momentum parameter , then has two repeated roots since . To ensure that does not have any repeated root we choose the momentum parameter . Therefore, . Using Assumption 1 and Lemma 4 we get
However, unlike in Case-1, here the constant is not independent of and . The following lemma specifies an upper bound on .
Lemma 5.
, where is as defined in (11).
Proof.
See Appendix D.2 ∎
We define the sequence as follows
Observe that , and that satisfies
Therefore, we have
To ensure that we need to choose such that
Next, using Lemma 5, the above can be ensured by choosing such that
The recursion for the sequence then follows
Unrolling the recursion, we get
Choose as below:
Then,
when is as follows:
Case-3: (ASG)
The proof progresses in a similar way as in Case-2. It is easy to see that the following relation holds with a modified definition of the matrix .
where,
As in the previous case, we compute the eigen-values of . The characteristic equation of is given by:
As in the previous case, this can be simplified to the following equation:
We now further simplify the characteristic equation of to a more convienient form:
Progressing as in the previous case, we get that the eigen-values of satisfies:
When , we have that
This implies that,
(12) |
To ensure that , we must have
We assume and therefore, holds for all . Using this, the above relation simplifies to:
For the above to hold, we must have that
The above must hold and therefore we choose as:
As in Case-2, if we choose the momentum parameter , then has two repeated roots. To ensure that does not have any repeated root we choose the momentum parameter
Using (12), we get which is same as in Case-2 and therefore we have:
The above expression is same as that in Case-2 except the term . Since, the matrix is different when , in Case-2 need not be equal to in Case-3. However, we next show that follows the exact same upper bound as in Case-2, Lemma 5. Towards this, we have the following lemma:
Lemma 6.
, where is as defined in (11).
Proof.
See Appendix D.3 ∎
Thereafter, we can proceed exactly as in case-2 to prove the theorem.
Appendix D Proof of Auxilary Lemmas
D.1 Proof of Lemma 4
As in Foucart, (2012), we first construct a matrix norm such that . Consider the Jordan canonical form of
We define
where
Therefore,
where,
We define the matrix norm as
where is the matrix norm induced by the vector -norm. Using the fact that , where is the -th entry of , we have
It can be easily seen that . Therefore it follows that
where denotes the smallest singular value and we have used the fact that repeatedly. For , and defined as the size of largest Jordan block of
We conclude the first half of the lemma by defining as .
In the case that the eigenvalues are distinct, and defined above becomes independent of Moreover, when all eigen-values are distinct, each Jordan block is and the second half follows.
D.2 Proof of Lemma 5
Let be the matrix in Jordan decomposition of , i.e., , where is a diagonal matrix with eigenvalues of as its diagonal elements. Then,
where is the zero matrix of dimension . For ease of exposition, suppose is a matrix with eigenvalues and . Then
Suppose is the elementary matrix associated with the exchange of row-2 and row-3. It is easy to see that and that,
where,
Suppose, and,
Here and , where . Solving the above equation we get,
Setting ,
For a general matrix , using a similar procedure, it can be shown that
where and are permutation matrices that transform the matrix between them to a block diagonal matrix. Let . Therefore,
In order to simplify the expression for , we require the following lemma:
Lemma 7.
For all invertible matrices of order , the following identity holds:
where denote the singular values of .
Proof.
By definition, are the eigenvalues of . Then, the eigenvalues of are Note that and are similar since . Consequently, and have the same set of eigenvalues and we find that the singular values of are
Thus,
and the result follows. ∎
Using Lemma 7, we have
where is as defined in (11). Now, for any matrix of order ,
where denotes the Frobenius norm. Using the above inequality, and . Next we lower bound .
For a complex number , observe that = . Now,
(13) | ||||
Using ,
Using this, we get , and therefore
D.3 Proof of Lemma 6
The proof of the lemma can proceed exactly as in the proof of Lemma D.2. The only difference is that one needs to lower bound for We define and therefore . Using this, we get
The expression for is exactly same as in the proof of Lemma D.2 (cf. (13)). Thereafter, we proceed as in proof of Lemma D.2 to prove the claim.