Does Momentum Help in Stochastic Optimization?
A Sample Complexity Analysis.

Swetha Ganesh ^∗
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
[email protected] &Rohan Deb
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
[email protected]
Gugan Thoppe
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
[email protected] &Amarjit Budhiraja
Department of Statistics and Operations Research
University of North Carolina at Chapel Hill
[email protected]
Equal contribution.

Abstract

Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG) are popular momentum methods in stochastic optimization. While benefits of such acceleration ideas in deterministic settings are well understood, their advantages in stochastic optimization is still unclear. In fact, in some specific instances, it is known that momentum does not help in the sample complexity sense. Our work shows that a similar outcome actually holds for the whole of quadratic optimization. Specifically, we obtain a lower bound on the sample complexity of SHB and ASG for this family and show that the same bound can be achieved by the vanilla SGD. We note that there exist results claiming the superiority of momentum based methods in quadratic optimization, but these are based on one-sided or flawed analyses.

1 Introduction

In deterministic convex optimization (when one has access to exact gradients), Gradient Descent (GD) is a popular optimization algorithm Cauchy, (1847). However, in practice, exact gradients are not available and one has to rely on noisy observations. This brings forth the idea of Stochastic Gradient Descent (SGD). For GD, two classic momentum ideas, Heavy Ball (HB) Polyak, (1964, 1987); Qian, (1999) and Nesterov’s Accelerated Gradient (NAG) Nesterov, (1983, 2014, 2005), are used to speed up its convergence. Naturally, these momentum-based methods and their variants have also gained significant interest in stochastic settings Sutskever et al., (2013); Nitanda, (2014); Hu et al., (2009). Our work shows that the stochastic variants of HB and NAG, i.e., the Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG), are not better than the vanilla SGD for the whole of quadratic optimization. Specifically, we show that the sample complexities¹¹1Sample complexity refers to the number of iterations required to reach an $\epsilon$ -boundary of the solution. of SHB, ASG, and SGD are all of the same order.

More formally, in (deterministic) quadratic optimization²²2Throughout this work, we only consider algorithms with constant step-sizes. These are popular in practice and ensure faster convergence. with condition number $\kappa,$ GD converges to an $\epsilon$ -optimal solution in $\mathcal{O}(\kappa\log\frac{1}{\epsilon})$ iterations, while both HB and NAG only need $\mathcal{O}(\sqrt{\kappa}\log\frac{1}{\epsilon})$ steps. A pertinent question then is “Does one get such advantages even in stochastic settings?" The current literature is, however, divided on whether SHB and ASG are better than SGD.

Some recent results Loizou and Richtárik, (2020); Mou et al., (2020); Assran and Rabbat, (2020); Can et al., (2019) claim that these momentum methods are better than SGD in quadratic or least-squares settings. However, Loizou and Richtárik, (2020) needs a strong assumption on noise, which (Kidambi et al.,, 2018, Section 6) claims is information-theoretically impossible even in the simple least squares regression problem. The other results either are based on a one-sided analysis Can et al., (2019) (only considers bias, while ignoring variance) or have a major flaw Mou et al., (2020); Assran and Rabbat, (2020); see Appendix A.1—A.3 for details.

On the other hand, there are also a few recent negative results on these momentum methods. For general convex optimization, Yuan et al., (2016) shows that SHB and ASG is equivalent to SGD with a rescaled step-size. However, this result requires that the stepsize be sufficiently small and the momentum parameter to be away from $1.$ In Kidambi et al., (2018); Liu and Belkin, (2020), for one specific instance of the least squares regression with vanishing noise, it is shown that the performance of SHB and ASG cannot be better than that of SGD. Finally, Zhang et al., (2019) considers SHB for quadratic objectives in the noisy setting as our work and provides upper bounds on the rate at which the objective function decreases. They also argue that rescaled SGD performs as well as SHB and demonstrate it empirically but fall short of rigorously coming up with a lower bound that supports their claim.

The current literature can thus be summarized as follows.

Research Gap: Existing works on SHB and ASG fall into two groups: i.) positive - where the results claim advantages of these methods over SGD and ii.) negative - where the results claim the opposite. Results in the positive group either have a one-sided Can et al., (2019) or a flawed analysis Mou et al., (2020); Assran and Rabbat, (2020), while the ones in the negative apply to special cases (e.g., Yuan et al., (2016) requires sufficiently small stepsizes and momentum parameters away from $1,$ while Kidambi et al., (2018); Liu and Belkin, (2020) only apply to one instance of a least squares problem with vanishing noise).

Key Contribution: Our work belongs to the negative group and extends the claims in Kidambi et al., (2018); Liu and Belkin, (2020) to more general settings. Specifically, for all quadratic optimization problems with persistent noise (noise variance is bounded away from zero) and any small $\epsilon>0,$ we show that number of iterations needed by SHB and ASG to find an $\epsilon$ -optimal solution are of the same order as that of SGD. More technically, we obtain a lower bound on sample complexities of SHB and ASG (Theorem 2.1) and show that these are of the same order as the corresponding upper bound for SGD (Theorem 2.2). We emphasize that our result applies for all positive stepsizes and momentum parameters in the range $[0,1]$ unlike in Yuan et al., (2016). Moreover, our proof techniques are also significantly different from those used in existing lower bounds such as Kidambi et al., (2018); Liu and Belkin, (2020). This is because, under persistent noise, the expected error would contain an additional term that cannot be accounted for from their analyses (See Remark 3).

2 Main Results

We state our main results here that provide lower and upper bounds on the sample complexities of SHB and ASG. We use these bounds along with those of SGD to show that all these methods need a similar effort to find an $\epsilon$ -optimal solution. We emphasize that our results apply to the whole family of quadratic optimization.

To provide a unified analysis, we look at a generic update rule that includes as special cases SHB, ASG, and SGD. In particular, we consider the linear stochastic approximation with momentum (LSA-M) iterate given by

$\displaystyle x_{n+1}={}$	$\displaystyle x_{n}+\alpha(b-Ax_{n}+M_{n+1})$
	$\displaystyle+\eta(I-\alpha\beta A)(x_{n}-x_{n-1}).$	(1)
$\displaystyle={}$	$\displaystyle x_{n}+\alpha(b-A(x_{n}+\eta\beta(x_{n}-x_{n-1}))+M_{n+1})$
	$\displaystyle+\eta(x_{n}-x_{n-1}),$	(2)

where $b\in\mathbb{R}^{d},A\in\mathbb{R}^{d\times d},$ and $M_{n+1}\in\mathbb{R}^{d}$ is noise. Note that we do not presume $A$ is symmetric and this is what makes the above update a stochastic approximation. Also, when $A$ is symmetric, LSA-M subsumes as special cases SGD (let $\eta=0$ in (2)), SHB (let $\beta=0$ in (2)), and ASG (let $\beta=1$ in (2)).

With regards to $A$ and the noise sequence $(M_{n}),$ we make the following assumptions.

Assumption 1.

(Driving matrix feature) $A$ is diagonalizable and all its eigen values $\{\lambda_{i}(A)\}_{i=1}^{d}$ are real and positive.

Assumption 1 holds trivially when $A$ is a symmetric positive definite matrix. Consequently, our results apply to all members of the quadratic optimization family.

Also, under the above assumption, one would expect the iterates in (2) to go to a neighborhood of $x^{*}:=A^{-1}b.$

We next state two assumptions for $(M_{n}):$ the first is used in Theorem 2.1, while the other is used in Theorem 2.2 and Corollary 2.3.

Assumption 2.a.

(Noise attributes for Theorem 2.1) $(M_{n})$ is a martingale difference sequence w.r.t. the filtration $(\mathcal{F}_{n})$ , where $\mathcal{F}_{n}=\sigma(x_{m},M_{m};m\leq n)$ . Further, $\exists K>0$ such that $\mathbb{E}[M_{n+1}M_{n+1}^{T}|\mathcal{F}_{n}]\succeq KI_{d}$ a.s. for all $n\geq 0.$

The notation $A\succeq B$ for $A,B\in\mathbb{R}^{d}$ is used above to imply that $A-B$ is positive semi-definite. Also, the symbol $I_{d}$ refers to the identity matrix.

Assumption 2.b.

(Noise attributes for Theorem 2.2) $(M_{n})$ is a martingale difference sequence w.r.t the filtration $(\mathcal{F}_{n})$ , where $\mathcal{F}_{n}=\sigma(x_{m},M_{m};m\leq n)$ . Further, $\exists K\geq 0$ such that $\mathbb{E}[\|M_{n+1}\|^{2}|\mathcal{F}_{n}]\leq K(1+\|x_{n}-x^{*}\|^{2})$ a.s. for all $n\geq 0.$

Assumptions 2.a and 2.b are standard Mandt et al., (2017); Jastrzębski et al., (2018); Cheng et al., (2020); Borkar, (2008). The first of these holds if and only if all the eigenvalues of $\mathbb{E}[M_{n+1}M_{n+1}^{T}|\mathcal{F}_{n}]$ are bounded from below by $K,$ i.e., noise is persistent throughout (or non-vanishing) in all directions. On the other hand, Assumption 2.b requires that the trace of $\mathbb{E}[M_{n+1}M_{n+1}^{T}|\mathcal{F}_{n}]$ be bounded from above. This bound can scale with $\|x_{n}-x^{*}\|$ and need not vanish near $x^{*}.$

Next, we define sample complexity which quantifies the effort required by LSA-M to obtain an $\epsilon$ -close solution to $x^{*}.$

Definition 1.

(Sample Complexity). The sample complexity of (2) is the minimum number of iterations $n_{0}$ such that the expected error $\mathbb{E}[\|x_{n}-x^{*}\|^{2}]\leq\epsilon$ $\forall n\geq n_{0}$ .

Method	$\beta$	$\eta$	$\alpha$
SGD
SGD	-	0	$\min\Big{(}\frac{\lambda_{min}(A)}{\frac{3}{4}\lambda_{min}(A)^{2}+C^{2}K},\frac{\epsilon\lambda_{min}(A)}{4C^{2}K},\frac{2}{\lambda_{max}(A)+\lambda_{min}(A)}\Big{)}$
SHB	0	$\left(1-\frac{\sqrt{\alpha\lambda_{min}(A)}}{2}\right)^{2}$	$\min\Big{(}(\frac{\lambda_{\min}(A)^{\frac{3}{2}}}{\frac{3}{8}\lambda_{\min}(A)^{2}+25C^{2}K})^{2},(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K})^{2},$
			$(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}})^{2}\Big{)}$
ASG	1	$\frac{\left(1-\frac{\sqrt{\alpha\lambda_{min}(A)}}{2}\right)^{2}}{(1-\alpha\lambda_{min}(A))}$	$\min\Big{(}(\frac{\lambda_{\min}(A)^{\frac{3}{2}}}{\frac{3}{8}\lambda_{\min}(A)^{2}+25C^{2}K})^{2},(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K})^{2},\frac{1}{\lambda_{max}(A)}\Big{)}$

Table 1: Parameter choices for Theorem 2.2. Here

C=1

when the matrix

A

is symmetric and

C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})}

when

A

is not symmetric, where

\sigma_{min}(\cdot)

denotes the smallest singular value and

S

is the matrix that diagonalizes

A

, i.e.,

S^{-1}AS=D

, a diagonal matrix. When

A

is symmetric, indeed the three parameter choices correspond to SGD, SHB and ASG. However, with a slight abuse of notation, we stick to the same naming convention even when the driving matrix

A

is not symmetric.

To enable easy comparison between different algorithms, we shall look at the order of their sample complexities. Towards that, we shall use the notation $n_{0}\in\Theta(t)$ to imply that there exist constants $c_{1}$ and $c_{2}$ (independent of $t$ ) such that $c_{1}t\leq n_{0}\leq c_{2}t$ . The notation $\tilde{\Theta}(t)$ has a similar meaning but hides the dependence on the logarithmic terms as well.

Theorem 2.1 (Lower bound on sample complexity).

Consider the LSA-M iterate in (2) and suppose Assumptions 1 and 2.a hold. Let $\epsilon>0$ be small enough and $\lambda_{min}(A)=\min_{i}\lambda_{i}(A)$ . Then, for any choice of $\alpha>0,$ $\beta\in[0,1],$ and $\eta\in[0,1]$ , the expected error $\mathbb{E}[\|x_{n_{0}}-x^{*}\|^{2}]\geq\epsilon$ for $n_{0}=\frac{K}{64\epsilon\lambda_{min}(A)^{2}}\log(\frac{\|x_{0}-x^{*}\|^{2}}{\epsilon})\in\tilde{\Theta}\left(\frac{K}{\epsilon\lambda_{min}(A)^{2}}\right).$

Remark 1.

As stated below (2), LSA-M includes SHB and Nesterov’s ASG method as special cases and, hence, the above result directly applies to them. In fact, this is the first lower bound on SHB and ASG’s sample complexities in quadratic optimization with persistent noise.

Remark 2.

The value of $n_{0}$ above is obtained by optimizing over all possible values of $\alpha,\beta,\eta$ and hence does not depend on their specific choices. This implies that there is no “smart” combination of $\alpha,\beta$ and $\eta$ that will give a better sample complexity (for $\epsilon$ small enough).

Remark 3.

The lower bounds in Kidambi et al., (2018) and Liu and Belkin, (2020) are obtained by viewing the expected error in SHB and ASG iterates for least squares as update rules of the form $z_{n+1}=Pz_{n}$ for some matrix $P$ (Kidambi et al.,, 2018, Appendix A, p 16) and (Liu and Belkin,, 2020, Appendix C, p 12)). In particular, they obtain bounds on the eigenvalues of $P$ to get the desired claim. In contrast, the error relations for SHB and ASG methods in our setup (quadratic optimization with persistent noise) have the form $z_{n+1}=Pz_{n}+\alpha W_{n}$ for some matrix $P$ and vector $W_{n}$ (cf. 3). This forces us to develop a new proof technique that jointly looks at both these terms and show that at least one of them remains larger than $\epsilon$ for the choice of $n_{0}$ given in Theorem 2.1.

We next state our upper bound on the sample complexity in Theorem 2.2 and Corollary 2.3. Similar bounds already exist in literature when $A$ is assumed to be symmetric and the noise is assumed to be iid with bounded variance (Can et al., (2019); Zhang et al., (2019)). Here, we show that a similar upper bound holds under more general settings: i.) $A$ is not symmetric but satisfies Assumption 1, and ii.) the noise is a martingale difference sequence satisfying Assumption 2.b.

Theorem 2.2.

Consider the LSA-M iterate and let Assumptions 1 and 2.b hold. Then, $\forall\epsilon>0$ , there exists a choice of $\alpha$ , $\beta$ and $\eta$ in LSA-M (see Table 1 for exact values) such that the expected error $\mathbb{E}[\|x_{n}-x^{*}\|^{2}]\leq\epsilon$ , $\forall n>n_{0}$ , where,

(i)

$n_{0}\in\tilde{\Theta}(\frac{1}{{\alpha\lambda_{min}(A)}}),$ when $\eta=0$
(ii)

$n_{0}\in\tilde{\Theta}(\frac{1}{\sqrt{\alpha\lambda_{min}(A)}}),$ when $\eta>0$ .

From Table 1, we see that for each case $\alpha$ is a minimum of three terms. The first term arises due to the unbounded noise (Assumption 2.b), the second due to the target neighborhood $\epsilon$ and the third from the optimal choice of step-size in the deterministic case. Since the bound provided in Theorem 2.2 is in terms of $\alpha$ , the minimum of the three terms dictate the sample complexity. If $\epsilon$ is small enough such that the second term is the minimum, we obtain the following bound:

Corollary 2.3 (Upper bound on sample complexity).

Consider the LSA-M iterate and suppose Assumptions 1 and 2.b hold. Then for choice of parameters in Table 1 $,\forall\epsilon>0$ small enough, $\exists n_{0}\in\tilde{\Theta}\left(\frac{K}{\epsilon\lambda_{min}(A)^{2}}\right)$ such that $\mathbb{E}[\|x_{n}-x^{*}\|^{2}]\leq\epsilon$ , $\forall n\geq n_{0}$ .

Remark 4.

From Corollary 2.3, we see that for small enough $\epsilon>0$ the upper bound on the sample complexity of SGD, SHB and ASG match their lower bound in Theorem 2.1. In particular, since an upper bound on the sample complexity of SGD matches a lower bound on SHB and ASG, these methods cannot outperform SGD from a sample complexity perspective.

Remark 5.

When the noise is assumed to be bounded, the first term in the choice of $\alpha$ in Table 1 vanishes for all three parameter choices. Under such an assumption, if $\epsilon$ is large enough or $K$ is small enough such that the third term in the choice of $\alpha$ is the minimum, then the sample complexity of both SHB and ASG is better than SGD. We emphasize that such improvements are lost when the noise variance is large or the neighbourhood under consideration is small.

3 Proof Outline

In this section we provide the proof outline for Theorem 2.1 and Theorem 2.2. Section 3.1 provides an outline of all the key steps to prove the lower bound on the sample complexity. The proof of all the intermediate lemmas have been pushed to Appendix B. In section 3.2 we provide a sketch of the proof of Theorem 2.2. The detailed proof can be found in Appendix C.

3.1 Proof Outline for Theorem 2.1

For simplicity, let $d=1$ and $A\equiv\lambda\in\mathbb{R}$ . The general multivariate case is handled subsequently. We begin by defining the transformed iterates $\tilde{x}_{n}=x_{n}-x^{*}$ and rewrite (2) as:

\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n},

(3)

where, $\tilde{X}_{n}\triangleq\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix},W_{n}\triangleq\begin{bmatrix}M_{n}\\ 0\end{bmatrix}$ and $P\triangleq\begin{bmatrix}1-\alpha\lambda+\eta(1-\alpha\beta\lambda)&-\eta(1-\alpha\beta\lambda)\\ 1&0\end{bmatrix}.$ For ease of exposition, we first consider the case when $\beta=0$ , corresponding to SHB and consider the general case for $\beta\in[0,1]$ later. The objective here is to find an $n$ such that the error $\mathbb{E}[\|\tilde{X}_{n}\|^{2}]$ is lower bounded by $\epsilon$ . Towards this, we decompose the error expression as follows:

$\displaystyle\mathbb{E}[\\|\tilde{X}_{n}\\|^{2}]$	$\displaystyle=\mathbb{E}[\\|P^{n}\tilde{X}_{0}+\alpha\sum_{j=0}^{n-1}P^{n-1-j}W_{j+1}\\|^{2}]$
	$\displaystyle=\\|P^{n}\tilde{X}_{0}\\|^{2}+\mathbb{E}[\alpha^{2}\sum_{j=0}^{n-1}\\|P^{n-1-j}W_{j+1}\\|^{2}]$
	$\displaystyle\geq\\|P^{n}\tilde{X}_{0}\\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\\|P^{j}e_{1}\\|^{2},$	(4)

where, $e_{1}=\begin{pmatrix}1\\ 0\end{pmatrix}$ . Here, the second equality follows since each $W_{j+1}$ is a martingale difference sequence and the inequality follows from Assumption 2.b. The first term here corresponds to the bias and the second term corresponds to the variance.

Let $n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon})$ . We will show that for this choice of $n$ , the expected error will always be greater than or equal to $\epsilon$ for all $\alpha>0$ and $\eta\in[0,1]$ . From (3.1):

\displaystyle\mathbb{E}\|\tilde{X}_{n}\|^{2}\geq\|P^{n}\tilde{X_{0}}\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}.

If $\alpha$ is such that $\|P^{n}\tilde{X_{0}}\|^{2}\geq\epsilon$ , then the claim immediately follows for this choice of $\alpha$ and $\eta$ . We now consider the case where $\alpha$ and $\eta$ are such that $\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon$ . Here we will show that for this choice of $\alpha$ and $\eta$ , the variance will necessarily be greater than $\epsilon$ . Let $\mu_{+}$ and $\mu_{-}$ be the eigen-values of $P$ .

Lemma 1.

For all $\alpha>0$ , $\eta\in[0,1]$ , if $\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon$ for small enough $\epsilon>0$ then the variance term $\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}$ is bounded by

\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{\alpha^{2}K}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}.

Note that, from the above lemma, the variance bound improves (i.e., decreases) when $\eta$ decreases. Using this, we show that the variance is ultimately bounded below by a linearly decreasing function of $\rho(P)$ in the following lemma.

Lemma 2.

For all $\alpha>0$ , $\eta\in[0,1]$ , if $\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon$ for small enough $\epsilon>0$ , then $\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{K}{16\lambda^{2}}(1-\rho(P)).$

Recall that $n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon})$ and $\|P^{n}\tilde{X}_{0}\|^{2}<\epsilon$ . Using this, we get the following lower bound on $1-\rho(P)$ :

Lemma 3.

If $n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon})$ and $\|P^{n}\tilde{X}_{0}\|^{2}<\epsilon$ , then $1-\rho(P)\geq\frac{16\epsilon\lambda^{2}}{K}$ .

This completes the proof for the univariate case when $\beta=0$ .

We now consider the case where $\beta\in[0,1]$ . We notice that in the univariate case, LSA-M with parameters $\beta$ and $\eta$ is equivalent to LSA-M with parameters $\beta=0$ and $\eta^{\prime}=\eta(1-\alpha\lambda\beta)$ . If $\alpha\lambda>1$ , regardless of choice of $\eta$ , we have from (3.1):

\displaystyle\mathbb{E}\|\tilde{X}_{n}\|^{2}\geq\|P^{n}e_{1}\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\|P^{j}e_{1}\|^{2}\geq\alpha^{2}K>\frac{K}{\lambda^{2}}

Thus, when $\epsilon<\frac{K}{\lambda^{2}}$ , the expected error will always be larger than $\epsilon$ . When $\alpha\lambda<1$ , then $\eta^{\prime}=\eta(1-\alpha\lambda\beta)\in[0,1]$ for all $\eta,\beta\in[0,1]$ . Since this is equivalent to LSA-M with $\beta=0$ and $\eta^{\prime}\in[0,1]$ , the result follows from the lower bound on univariate LSA-M with $\beta=0$ . This completes the proof for the univariate case for all $\alpha>0$ , $\eta,\beta\in[0,1]$ .

Finally, we consider the multivariate case, i.e., when $d>1$ . For the multivariate case, we study the transformed iterates $\tilde{Y_{n}}=Z\tilde{X_{n}}$ , where $Z:=E_{d\times d}\begin{pmatrix}S&0\\ 0&S\end{pmatrix}$ . Since $\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n},$ we get

	$\displaystyle\tilde{Y}_{n}$	$\displaystyle=Z\tilde{X}_{n}=ZP\tilde{X}_{n-1}+\alpha ZW_{n}=ZP\tilde{X}_{n-1}+\alpha ZW_{n}$
		$\displaystyle=ZPZ^{-1}\tilde{Y}_{n-1}+\alpha ZW_{n}=B\tilde{Y}_{n-1}+\alpha ZW_{n},$

where

B=\begin{pmatrix}B_{1}&0&\ldots&0\\ 0&B_{2}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&B_{d}\end{pmatrix},\mbox{ and }B_{i}=\begin{pmatrix}1+\eta-\alpha\lambda_{i}&-\eta\\ 1&0\end{pmatrix}.

Let $\tilde{Y}_{n}=\begin{pmatrix}\tilde{Y}_{n}^{(1)}\\ \tilde{Y}_{n}^{(2)}\\ \vdots\\ \tilde{Y}_{n}^{(d)}\\ \end{pmatrix}$ and $\hat{M}_{n}=SM_{n}=\begin{pmatrix}\hat{M}_{n}^{(1)}\\ \hat{M}_{n}^{(2)}\\ \vdots\\ \hat{M}_{n}^{(d)}\\ \end{pmatrix}$ , where each $\tilde{Y}_{n}^{(i)}\in\mathbb{R}^{2}$ and $\hat{M}_{n}^{(i)}\in\mathbb{R}$ . We will show that the update rule for each $\tilde{Y}_{n}^{(i)}$ is the same as that of the univariate case. Notice that

\tilde{Y}_{n}^{(i)}=B_{i}\tilde{Y}_{n-1}^{(i)}+\alpha\begin{pmatrix}\hat{M}^{(i)}_{n}\\ 0\end{pmatrix},

where $\mathbb{E}[\hat{M}^{(i)}_{n}|\mathcal{F}_{n-1}]=0$ and $\mathbb{E}[(\hat{M}^{(i)}_{n})^{2}|\mathcal{F}_{n-1}]\geq\sigma_{\min}^{2}(S)K$ . This is because $\mathbb{E}[\hat{M}_{n}|\mathcal{F}_{n-1}]=\mathbb{E}[SM_{n}|\mathcal{F}_{n-1}]=0$ and

	$\displaystyle\mathbb{E}[(\hat{M}^{(i)}_{n})^{2}\|\mathcal{F}_{n-1}]$	$\displaystyle\geq\min_{\\|v\\|=1}\mathbb{E}[v^{T}\hat{M}_{n}\hat{M}^{T}_{n}v\|\mathcal{F}_{n-1}]$
		$\displaystyle=\min_{\\|v\\|=1}\mathbb{E}[v^{T}SM_{n}M^{T}_{n}S^{T}v\|\mathcal{F}_{n-1}]$
		$\displaystyle\geq\min_{\\|v\\|=1}\sigma_{\min}^{2}(S)\mathbb{E}[v^{T}M_{n}M^{T}_{n}v\|\mathcal{F}_{n-1}]$
		$\displaystyle\geq\sigma_{\min}^{2}(S)K.$

Let $\lambda_{1}=\min_{i}\lambda_{i}$ . Then, from results in the univariate case, $\exists\epsilon^{\prime}>0$ such that for all $\alpha>0$ and $\beta,\eta\in[0,1]$ , $\mathbb{E}[\|\tilde{Y}_{n}^{(1)}\|^{2}]\geq\epsilon^{\prime}$ for some $n\in\Theta(\frac{\sigma_{\min}^{2}(S)K}{\epsilon^{\prime}\lambda_{\min}^{2}})$ . Letting $\epsilon=\frac{\epsilon^{\prime}}{\sigma_{\max}^{2}(S)}$ , it follows that $\mathbb{E}[\|\tilde{Y}_{n}^{(1)}\|^{2}]\geq\sigma_{\max}^{2}(S)\epsilon$ for some $n\in\Theta(\frac{\sigma_{\min}^{2}(S)K}{\sigma_{\max}^{2}(S)\epsilon\lambda_{\min}^{2}})=\Theta(\frac{K}{\epsilon\lambda_{\min}^{2}})$ We ignore the term $\frac{\sigma_{\min}^{2}(S)}{\sigma_{\max}^{2}(S)}$ as it is a constant, and it is in fact equal to $1$ when $A$ is symmetric (diagonalising matrices of a symmetric matrix are orthogonal). Finally,

	$\displaystyle\mathbb{E}[\\|\tilde{X}_{n}\\|^{2}]$	$\displaystyle\geq\sigma^{2}_{\min}(Z)\mathbb{E}[\\|\tilde{Y}_{n}\\|^{2}]\geq\sigma^{2}_{\min}(S^{-1})\mathbb{E}[\\|\tilde{Y}_{n}^{(1)}\\|^{2}]$
		$\displaystyle=\frac{1}{\sigma^{2}_{\max}(S)}\mathbb{E}[\\|\tilde{Y}_{n}^{(1)}\\|^{2}]\geq\epsilon,$

for some $n\in\Theta(\frac{K}{\epsilon\lambda_{\min}^{2}})$ .

3.2 Proof Outline of Theorem 2.2

Here we provide a quick sketch of the proof when $\eta=0$ (corresponding to SGD) and $\beta=0$ (corresponding to SHB). A similar analysis follows for $\beta=1$ and the detailed proof of all the three cases can be found in Appendix C.

The LSA-M iterate in (2) with $\eta=0$ corresponds to:

\begin{split}\tilde{x}_{n}&=(I-\alpha A)\tilde{x}_{n-1}+\alpha M_{n}\\ &=(I-\alpha A)^{n}\tilde{x}_{0}+\alpha\sum_{i=0}^{n-1}[(I-\alpha A)^{n-1-i}M_{i+1}]\end{split}

Using Assumption-2.b the expected norm of the iterates can be bounded as follows:

\begin{split}\mathbb{E}[\|\tilde{x_{n}}\|^{2}]&\leq\|(I-\alpha A)^{n}\|^{2}\Lambda\\ &+\alpha^{2}K\sum_{i=0}^{n-1}\|(I-\alpha A)^{(n-1-i)}\|^{2}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}

We next use the following lemma to bound $\|(I-\alpha A)^{i}\|$ .

Lemma 4.

Let, $M\in\mathbb{R}^{d\times d}$ be a matrix and $\lambda_{i}(M)$ denote the $i^{th}$ eigen-value of $M$ . Then, $\forall\delta>0$

\|M^{n}\|\leq C_{\delta}(\rho(M)+\delta)^{n}

where $\rho(M)=\max_{i}|\lambda_{i}(M)|$ is the spectral radius of $M$ and $C_{\delta}$ is a constant that depends on $\delta$ . Furthermore, if the eigen-values of $M$ are distinct, then

\|M^{n}\|\leq C(\rho(M))^{n}

Proof.

See Appendix-D. ∎

Using Assumption 1 and Lemma 4, we have

\begin{split}\mathbb{E}[\|\tilde{x}_{n}\|^{2}]&\leq C^{2}(1-\alpha\lambda_{\min}(A))^{2n}\Lambda\\ &+C^{2}\alpha^{2}K\sum_{i=0}^{n-1}(1-\alpha\lambda_{\min}(A))^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}

\mbox{where, \quad}C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})},

(5)

$S$ is the matrix in Jordan decomposition of $A$ and $\sigma_{\min}(S)$ is the smallest singular value of $S$ . We define the sequence $\{U_{k}\}$ as below:

	$\displaystyle U_{k}$	$\displaystyle=C^{2}(1-\alpha\lambda_{\min}(A))^{2k}\Lambda$
		$\displaystyle+C^{2}\alpha^{2}K\sum_{i=0}^{k-1}(1-\alpha\lambda_{\min}(A))^{2(k-1-i)}(1+U_{i})$

Observe that $\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n}$ and that the sequence $\{U_{k}\}$ satisfies

	$\displaystyle U_{k+1}$	$\displaystyle=(1-\alpha\lambda_{\min}(A))^{2}U_{k}$
		$\displaystyle+C^{2}K\alpha^{2}(1+U_{k});\quad U_{0}=C^{2}\Lambda$

Therefore, we have

U_{k+1}=\left((1-\alpha\lambda_{\min}(A))^{2}+C^{2}K\alpha^{2}\right)U_{k}+\alpha^{2}C^{2}K

Since $\alpha\leq\frac{\lambda_{\min}(A)}{\frac{3}{4}\lambda_{\min}(A)^{2}+C^{2}K}$ , one can show that

\begin{split}U_{n}&\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\frac{2}{\alpha\lambda_{min}(A)}\\ \end{split}

We assume that $\alpha\leq\frac{1}{\lambda_{\min}(A)}$ and therefore $(1-\frac{\alpha\lambda_{\min}(A)}{2})^{2}\leq e^{-\alpha\lambda_{\min}(A)}$ .

\begin{split}U_{n}\leq e^{-n\alpha\lambda_{\min}(A)}C^{2}\Lambda+\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\end{split}

Choose $\alpha\leq\frac{\epsilon\lambda_{min}(A)}{4C^{2}K}.$ Then,

\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\leq\frac{\epsilon}{2}\Rightarrow\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon,

when the sample complexity is:

n=\frac{1}{\alpha\lambda_{\min}(A)}\log\left(\frac{2C^{2}\Lambda}{\epsilon}\right)

As in proof of Theorem 2.1, we begin by transforming the iterate given by (2) into the following iterate for $\beta=0$ :

\displaystyle\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n}

where, $\tilde{X}_{n}\triangleq\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix},$ $W_{n}\triangleq\begin{bmatrix}M_{n}\\ 0\end{bmatrix},$ and $P\triangleq\begin{bmatrix}I-\alpha A+\eta I&-\eta I\\ I&0\end{bmatrix}.$

Using Assumption 2.a, we can obtain the following bound on the mean squared error:

\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{i=0}^{n-1}\|P^{n-1-i}\|^{2}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}])

Here, the first term corresponds to the bias and the second term correspond to the variance. A spectral analysis of $P$ tells us that the best choice of $\eta$ is given by $\eta=(1-\sqrt{\lambda_{\min}(A)\alpha})^{2}$ with $\alpha\leq\big{(}\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}}\big{)}^{2}$ . However, to simplify the analysis we choose the parameter $\eta=\big{(}1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\big{)}^{2}$ , which maintains the same rate of decay of the bias term. Using Assumption 1 and Lemma 9, we show that

\displaystyle\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}\!\!\!\!\!\!\!\!\!\!\!\!\!\!(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}]),

where $\Lambda=\|\tilde{x}_{0}\|^{2}$ and $\hat{C}$ depends on $\alpha$ and $\lambda$ . The dependence of the second term on $\tilde{X}_{i}$ appears because we do not assume the noise variance to be uniformly bounded by a constant. This makes analyzing the above recursion challenging. Towards this, we define a sequence $\{V_{n}\}$ ,

\displaystyle V_{n}=\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+V_{i}).

Choosing

\displaystyle\alpha\leq\min\Big{(}(\frac{\lambda_{\min}(A)^{\frac{3}{2}}}{\frac{3}{8}\lambda_{\min}(A)^{2}+25C^{2}K})^{2},(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K})^{2},(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}})^{2}\Big{)}

ensures that $\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\hat{C}^{2}\Lambda+\frac{4\alpha^{2}\hat{C}^{2}K}{\sqrt{\alpha\lambda_{min}(A)}}.$ Using Lemma 10, we bound $\hat{C}$ in terms of $C$ . Here, $C=1$ when the matrix $A$ is symmetric and $C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})}$ when $A$ is not symmetric, where $\sigma_{min}(\cdot)$ denotes the smallest singular value and $S$ is the matrix that diagonalizes $A$ , i.e., $S^{-1}AS=D$ , a diagonal matrix. With this we obtain

V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\frac{25C^{2}}{\lambda_{\min}(A)\alpha}\Lambda+\alpha^{2}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\alpha\right)^{3/2}}.

Now, for $n\geq\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right),$ it can show that $V_{n}\leq 25C^{2}\Lambda e^{-\frac{n}{4}\sqrt{\lambda_{\min}(A)}\alpha}+\frac{100\sqrt{\alpha}C^{2}K}{\left(\lambda_{\min}(A)\right)^{3/2}}$ . Next, we choose $\alpha$ small enough: $\alpha\leq\left(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K}\right)^{2}$ to ensure that the second term in the above inequality is within an $\epsilon$ boundary and $n$ large enough: $n=\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right),$ to ensure that the first term is within an $\epsilon$ boundary. Finally,

\displaystyle n=\max\Bigg{(}\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right),\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right)\Bigg{)}

gives the desired bound on the sample complexity for $\beta=0$ .

4 Concluding Remarks

In this work, we analyze the sample complexity of SHB and ASG and provide matching lower and upper bounds up to constants and logarithmic terms. More importantly, we show that the same sample complexity bound can be obtained by standard SGD. Our work also calls into question some of the recent positive results in favour of momentum methods in the stochastic regime. We show that such improvements do not take into account all the terms involved in the error decomposition, or have major flaws. Although some other results do question the superiority of momentum methods in the stochastic regime, the assumptions and the setting that these works look at do not correspond to those in the positive results. We also emphasize that the negative results either only consider small step-sizes and momentum parameters not close to 1 or provide such results for specific instances in linear regression. In contrast, our work shows that SHB and ASG cannot obtain an improvement in terms of sample complexity (for small neighbourhoods) over SGD for the entire family of quadratic optimization and holds for all step-sizes and all momentum parameters in $[0,1]$ .

References

Assran and Rabbat, (2020) Assran, M. and Rabbat, M. (2020). On the convergence of nesterov’s accelerated gradient method in stochastic settings. Proceedings of the 37th International Conference on Machine Learning, PMLR, 119:410–420.
Borkar, (2008) Borkar, V. S. (2008). Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
Can et al., (2019) Can, B., Gurbuzbalaban, M., and Zhu, L. (2019). Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 891–901. PMLR.
Cauchy, (1847) Cauchy, L. A. (1847). Methode generale pour la resolution des systemes d’equations simultanees. C.R. Acad. Sci. Paris, 25:536–538.
Cheng et al., (2020) Cheng, X., Yin, D., Bartlett, P., and Jordan, M. (2020). Stochastic gradient and Langevin processes. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1810–1819. PMLR.
Foucart, (2012) Foucart, S. (2012). University lecture, m504, lecture 6: Matrix norms and spectral radii.
Horn and Johnson, (1990) Horn, R. and Johnson, C. (1990). Matrix analysis (Corrected reprint of the 1985 original). Cambridge University Press.
Hu et al., (2009) Hu, C., Pan, W., and Kwok, J. (2009). Accelerated gradient methods for stochastic optimization and online learning. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A., editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc.
Jastrzębski et al., (2018) Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Storkey, A., and Bengio, Y. (2018). Three factors influencing minima in SGD.
Kidambi et al., (2018) Kidambi, R., Netrapalli, P., Jain, P., and Kakade, S. M. (2018). On the insufficiency of existing momentum schemes for stochastic optimization. CoRR, abs/1803.05591.
Liu and Belkin, (2020) Liu, C. and Belkin, M. (2020). Accelerating sgd with momentum for over-parameterized learning. In International Conference on Learning Representations.
Loizou and Richtárik, (2020) Loizou, N. and Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77:653–710.
Mandt et al., (2017) Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35.
Mou et al., (2020) Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2947–2997. PMLR.
Nesterov, (1983) Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $o\bigl{(}\frac{1}{k^{2}}\bigr{)}$ . Soviet Mathematics Doklady, 269:543–547.
Nesterov, (2005) Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152.
Nesterov, (2014) Nesterov, Y. (2014). Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition.
Nitanda, (2014) Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. In NIPS.
Polyak, (1964) Polyak, B. (1964). Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17.
Polyak, (1987) Polyak, B. (1987). Introduction to Optimization. Optimization Software, NY.
Qian, (1999) Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Netw., 12(1):145–151.
Sutskever et al., (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA. PMLR.
Yuan et al., (2016) Yuan, K., Ying, B., and Sayed, A. H. (2016). On the influence of momentum acceleration on online learning. Journal of Machine Learning Research, 17(192):1–66.
Zhang et al., (2019) Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

Appendix

Appendix A Comparison with recent works

A.1 Comparison with Mou et al.(2020)

Claim 1 in (p. 20, Mou et al., (2020)) analyzes the asymptotic covariance of the heavy ball momentum algorithm (with Polyak averaging) and claims a correction term that satisfies:

\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\frac{\kappa^{2}(U)}{\lambda_{min}(A)^{3/2}})

where $\tilde{A}=\begin{pmatrix}0&I_{d}\\ -\bar{A}&\alpha I_{d}+\eta\bar{A}\end{pmatrix}=UDU^{-1}$ as in the decomposition of $\tilde{A}$ in Lemma 1 of Mou et al., (2020) and $\kappa(U)=\|U\|_{op}\|U^{-1}\|_{op}$ .

However in the proof of claim 1, we are not sure if the following bound holds, since the matrix $\tilde{A}$ is not symmetric:

Tr(\tilde{A}^{-1}\mathbb{E}(\tilde{\Xi}_{A}\Lambda_{\eta}^{*}(\tilde{\Xi}_{A})^{T})(\tilde{A}^{-1})^{T})\leq(\min_{i}|\lambda_{i}(\tilde{A})|)^{-2}(1+\eta^{2})v_{A}^{2}\mathbb{E}_{\pi_{\eta}}\|x_{t}-x^{*}\|_{2}^{2}

Our calculation points towards the following bound:

\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\frac{\kappa^{2}(U)}{\lambda_{min}(A)^{5/2}}).

We outline the proof for the uni-variate case, when $\bar{A}=\lambda$ for some $\lambda>0$ . Then,

\tilde{A}=\begin{pmatrix}0&1\\ -\lambda&\alpha+\eta\lambda\end{pmatrix},\mbox{ and }\tilde{A}^{-1}=\frac{1}{\lambda}\begin{pmatrix}\alpha+\eta\lambda&-1\\ \lambda&0\end{pmatrix}.

Observe that $\displaystyle\tilde{A}^{-1}(\tilde{A}^{-1})^{T}=\frac{1}{\lambda^{2}}\begin{bmatrix}1+(\alpha+\eta\lambda)^{2}&\lambda(\alpha+\eta\lambda)\\ \lambda(\alpha+\eta\lambda)&\lambda^{2}\end{bmatrix}$ and therefore $\displaystyle Tr(\tilde{A}^{-1}(\tilde{A}^{-1})^{T})=\mathcal{O}\left(\frac{1}{\lambda^{2}}\right)$ . Using this we have,

Tr(\tilde{A}^{-1}\mathbb{E}(\tilde{\Xi}_{A}\Lambda_{\eta}^{*}(\tilde{\Xi}_{A})^{T})(\tilde{A}^{-1})^{T})\leq\mathcal{O}(\frac{1}{\lambda^{2}})Tr(\mathbb{E}(\tilde{\Xi}_{A}\Lambda_{\eta}^{*}(\tilde{\Xi}_{A})^{T})

\lesssim\mathcal{O}(\eta\frac{\kappa^{2}(U)}{\lambda^{5/2}})

(6)

The second inequality follows as in Mou et al., (2020). Next we analyze the dependence of $\kappa^{2}(U)$ on $\lambda$ . Again for simplicity we consider the uni-variate case where $\bar{A}=\lambda$ . Let $\displaystyle\tilde{A}=\begin{pmatrix}0&1\\ -\lambda&\alpha+\eta\lambda\end{pmatrix},$ be diagonalizable. Therefore,

\tilde{A}=U\begin{bmatrix}\mu_{+}&0\\ 0&\mu_{-}\end{bmatrix}U^{-1},

where $\mu_{+}$ and $\mu_{-}$ are the eigenvalues of $\tilde{A}$ . Let $U=\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}$ . We therefore have,

\begin{bmatrix}0&1\\ -\lambda&\alpha+\eta\lambda\end{bmatrix}\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}=\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}\begin{bmatrix}\mu_{+}&0\\ 0&\mu_{-}\end{bmatrix}.

Solving the system of equations, we get:

U=\begin{bmatrix}1&1\\ \mu_{+}&\mu_{-}\end{bmatrix}\mbox{ and }U^{-1}=\frac{1}{\mu_{+}-\mu_{-}}\begin{bmatrix}1&1\\ \mu_{+}&\mu_{-}\end{bmatrix}

Now, $\mu_{+}-\mu_{-}=\sqrt{(\alpha+\eta\lambda)-4\lambda}$ . Using the choice of $\alpha=\sqrt{\lambda}$ as in Mou et al., (2020), we have:

\mu_{+}-\mu_{-}=\sqrt{\lambda+\eta^{2}\lambda^{2}+2\alpha\eta\lambda-4\lambda}

=\sqrt{\lambda}\sqrt{\eta^{2}\lambda+2\eta\sqrt{\lambda}-3}

For $\lambda<<1$ (which is the case where the momentum algorithm is claimed to improve the mixing time in Mou et al., (2020)), $\mu_{+}-\mu_{-}\geq\mathcal{O}(\sqrt{\lambda})$ . As in proof of Lemma 5 in Appendix D.2 and using the fact that $\displaystyle\|U\|_{op}\|U^{-1}\|_{op}=\sigma_{max}(U)\sigma_{max}(U^{-1})$ , we can show that $\kappa^{2}(U)\leq\mathcal{O}(\frac{1}{\lambda})$ . Combining with (6), we have:

\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\lambda^{-7/2}).

A similar analysis can be carried for the multivariate case to show that

\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\lambda_{min}(\bar{A})^{-7/2})

The correction term in SGD is $\mathcal{O}(\eta\lambda_{min}(\bar{A})^{-3})$ (See Mou et al., (2020), Claim 1). The stationary distribution for the momentum algorithm is larger than that of SGD when $\lambda_{min}(A)<<1$ . Indeed if we enforce that the two asymptotic covariances are of the same size by choosing the step-size for momentum iterate $\eta^{m}$ in terms of the step-size of SGD, i.e.,

\mathcal{O}(\eta\frac{1}{\lambda_{min}(\bar{A})^{3}})=\mathcal{O}(\eta^{m}\frac{1}{\lambda_{min}(\bar{A})^{7/2}}),

then we must choose $\eta^{m}=\mathcal{O}(\eta\sqrt{\lambda_{min}(\bar{A})})$ . In Appendix C.1. of Mou et al., (2020), the mixing time of momentum iterate is shown to be $\displaystyle\mathcal{O}(\frac{1}{\eta^{m}\sqrt{\lambda_{min}(\bar{A})}})$ , while the mixing time of SGD is $\displaystyle\mathcal{O}(\frac{1}{\eta\lambda_{min}(\bar{A})})$ . When we choose $\eta^{m}=\mathcal{O}(\sqrt{\lambda_{min}(\bar{A})}\eta)$ , then the mixing time of momentum algorithm turns out to be the same as SGD. This behaviour is identical to what we observe in Theorem 2.2, where if we choose the same step-size then there is improvement by a square root factor.

A.2 Comparison with SHB, Can et al. (2019)

For strongly convex quadratic functions of the form:

f(x)=\frac{1}{2}x^{T}Qx+a^{T}x+b,

where $x\in\mathbb{R}^{d}$ , $Q\in R^{d\times d}$ is p.s.d, $a\in\mathbb{R}^{d}$ , $b\in R$ and $\mu I_{d}\preceq Q\preceq LI_{d}$ , Can et al., (2019) shows acceleration in Wasserstein distance by a factor of $\displaystyle\sqrt{\kappa}=\sqrt{\frac{L}{\mu}}$ . The trace of the asymptotic covariance matrix $X_{HB}$ is given by (See Appendix C.2 of Can et al., (2019)):

\mbox{Tr}(X_{HB})=\sum_{i=1}^{d}\frac{2\alpha(1+\beta)}{(1-\beta)\lambda_{i}(2+2\beta-\alpha\lambda_{i})},

where, $\alpha$ is the step-size, $\beta$ is the momentum parameter and $\lambda_{i}$ is the $i^{th}$ eigen-value of $Q$ . We show that the asymptotic covariance matrix is worse compared to when no momentum is used, i.e., $\beta=0$ and the optimial step-size $\alpha=\frac{2}{\mu+L}$ is used. Substituting these values for $\beta$ and $\alpha$ we get:

	$\displaystyle\mbox{Tr}(X_{\beta=0})$	$\displaystyle=\sum_{i=1}^{d}\frac{2\frac{2}{\mu+L}}{\lambda_{i}(2-\frac{2}{\mu+L}\lambda_{i})}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2\frac{2}{\mu+L}}{\lambda_{i}\frac{2}{\mu+L}(\mu+L-\lambda_{i})}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}$

To compute the size of the stationary distribution with the iterates of heavy ball we set the step-size $\alpha=\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}$ and momentum parameter $\beta=\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}$ and get:

	$\displaystyle\mbox{Tr}(X_{HB})$	$\displaystyle=\sum_{i=1}^{d}\frac{2\displaystyle\left(\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(1+\beta)}{(1-\beta)\lambda_{i}\left(2+2\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}-\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\lambda_{i}\right)}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2\left(\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(1+\beta)}{(1-\beta)\lambda_{i}\left(\displaystyle\frac{2(\sqrt{L}+\sqrt{\mu})^{2}+2(\sqrt{L}-\sqrt{\mu})^{2}-4\lambda_{i}}{(\sqrt{L}+\sqrt{\mu})^{2}}\right)}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2\left(\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(1+\beta)}{(1-\beta)\lambda_{i}\left(\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(\mu+L-\lambda_{i})}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2(1+\beta)}{(1-\beta)\lambda_{i}(\mu+L-\lambda_{i})}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\left(\frac{1+\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}}{1-\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}}\right)$
		$\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\frac{L+\mu}{4\sqrt{\mu L}}$
		$\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\frac{1}{2}\left(\sqrt{\kappa}+\frac{1}{\sqrt{\kappa}}\right)$
		$\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\mathcal{O}(\sqrt{\kappa})$
		$\displaystyle=\mbox{Tr}(X_{\beta=0})\mathcal{O}(\sqrt{\kappa})$

The above calculation shows that the size of the asymptotic covariance matrix of SHB is worse by a factor of $\mathcal{O}(\sqrt{\kappa})$ .

A.3 Comparison with Assran and Rabbat (2020)

Theorem 1 of Assran and Rabbat, (2020) is stated for a constant $\epsilon$ (not to be confused with $\epsilon$ in the current paper) and uses $C_{\epsilon}$ (a term that depends on $\epsilon$ ). However in Corollary 1.1, the result is stated for a decreasing sequence $\epsilon_{k}$ for which $C_{\epsilon}$ is undefined (equation (26)).

Using the expressions provided in Appendix of Assran and Rabbat, (2020), we find that the second term in Corollary 1.1 (variance bound) is in fact of order $Q^{3/2}$ instead of $Q^{1/2}$ . Using equation (29) and the display below it, we see that for ASG, $\|A^{k}\|\leq k\big{(}1-\frac{1}{\sqrt{Q}}\big{)}^{k+1}$ . It follows that

\sum_{k=0}^{n-1}\|A^{k}\|^{2}\leq\sum_{k=0}^{n-1}k^{2}\Big{(}1-\frac{1}{\sqrt{Q}}\Big{)}^{2k+2}=O\Bigg{(}\frac{1}{\big{(}1-(1-\frac{1}{\sqrt{Q}})^{2}\big{)}^{3}}\Bigg{)}=O(Q^{3/2}).

However, for SGD,

\sum_{k=0}^{n-1}\|A^{k}\|^{2}\leq\sum_{k=0}^{n-1}\Big{(}1-\frac{2}{\sqrt{Q-1}}\Big{)}^{2k}=O\Bigg{(}\frac{1}{1-(1-\frac{2}{Q-1})^{2}}\Bigg{)}=O(Q).

As a result, the variance bound for ASG is $O(\sigma^{2}Q^{3/2})$ , while the variance bound for SGD is $O(\sigma^{2}Q)$ . Though the bias term in their result improves by a factor of $\sqrt{Q}$ , the variance term also worsens by a factor of $\sqrt{Q}$ .

Appendix B Proof of Theorem 2.1

Observe that

	$\displaystyle\\|P^{n}\tilde{X}_{0}\\|^{2}$	$\displaystyle=\\|P^{n}\\|^{2}\\|\tilde{X}_{0}\\|^{2}\mbox{ for some }\tilde{X}_{0}$
		$\displaystyle\geq\rho(P)^{2n}\\|\tilde{X}_{0}\\|^{2}$

Choose such an $\tilde{X}_{0}$ with $\|\tilde{X}_{0}\|\geq 1$ . If $\|P^{n}\tilde{X}_{0}\|^{2}>\epsilon$ , $\mathbb{E}[\|\tilde{X}_{n}\|^{2}]>\epsilon$ . Therefore we assume $\|P^{n}\tilde{X}_{0}\|^{2}\leq\epsilon$ which implies $\rho(P)^{2n}\leq\epsilon$ and $\|P^{n}e_{1}\|^{2}\leq\epsilon$ .

Lemma 1.

For all $\alpha>0$ , $\eta\in[0,(1-\sqrt{\alpha\lambda})^{2})$ , if $\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon$ for small enough $\epsilon>0$ then the variance term $\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}$ is bounded by

\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{\alpha^{2}K}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}.

Proof.

We see that

	$\displaystyle\sum_{j=0}^{n-1}\Big{\\|}P^{j}e_{1}\Big{\\|}^{2}$	$\displaystyle\geq\sum_{j=0}^{n-1}\left(\frac{\mu_{+}^{j}-\mu_{-}^{j}}{\mu_{+}-\mu_{-}}\right)^{2}$
		$\displaystyle=\frac{1}{(\mu_{+}-\mu_{-})^{2}}\sum_{j=0}^{n-1}(\mu_{+}^{2j}+\mu_{-}^{2j}-2(\mu_{+}\mu_{-})^{j})$
		$\displaystyle=\frac{1}{(\mu_{+}-\mu_{-})^{2}}\left(\sum_{j=0}^{n-1}\mu_{+}^{2j}+\sum_{j=0}^{n-1}\mu_{-}^{2j}-2\sum_{j=0}^{n-1}(\mu_{+}\mu_{-})^{j}\right)$
		$\displaystyle=\frac{1}{(\mu_{+}-\mu_{-})^{2}}\left(\frac{1-\mu_{+}^{2n}}{1-\mu_{+}^{2}}+\frac{1-\mu_{-}^{2n}}{1-\mu_{-}^{2}}-2\frac{1-(\mu_{+}\mu_{-})^{n}}{1-\mu_{+}\mu_{-}}\right)$

The sum of fractions in the bracket can be expressed as a single fraction with denominator $(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})$ . The numerator turns out to be:

	$\displaystyle num=(\mu_{+}-\mu_{-})^{2}+\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}+\mu_{+}\mu_{-}(\mu_{+}^{n}-\mu_{-}^{n})^{2}+(\mu_{+}\mu_{-})^{2}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}-$
	$\displaystyle(\mu_{+}\mu_{-})^{3}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}-2(\mu_{+}\mu_{-})^{n}(\mu_{+}-\mu_{-})^{2}-(\mu_{+}^{n}-\mu_{-}^{n})^{2}.$

First consider the case $\eta\in[0,(1-\sqrt{\alpha\lambda})^{2}))$ . The eigen values $\mu_{+}$ and $\mu_{-}$ in this case are real. Therefore all the terms in the numerator and the denominator are real. Next consider the case $\eta\in((1-\sqrt{\alpha\lambda})^{2},1]$ . In this case the eigen values $\mu_{+}$ and $\mu_{-}$ are complex. Let $\mu_{+}=\rho(P)(\cos\omega+i\sin\omega)$ and $\mu_{-}=\rho(P)(\cos\omega-i\sin\omega)$ . Observe that

	$\displaystyle(\mu_{+}-\mu_{-})^{2}$	$\displaystyle=(\rho(P)(\cos\omega+i\sin\omega)-\rho(P)(\cos\omega-i\sin\omega))^{2}=-4\rho(P)^{2}\sin^{2}\omega,$
	$\displaystyle(\mu_{+}^{m}-\mu_{-}^{m})^{2}$	$\displaystyle=-4\rho(P)^{2m}\sin^{2}m\omega,$
	$\displaystyle\mu_{+}\mu_{-}$	$\displaystyle=\rho(P)^{2}(\sin^{2}\omega+\cos^{2}\omega)=\rho(P)^{2}=\eta.$

Therefore, all the terms in the numerator and the denominator are again real. Now observe that,

	$\displaystyle\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}\geq 0,$
	$\displaystyle(\mu_{+}\mu_{-})^{2}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}-(\mu_{+}\mu_{-})^{3}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}\geq 0.$

The second inequality follows because $\mu_{+}\mu_{-}=\frac{(1-\alpha\lambda+\eta)^{2}-((1-\alpha\lambda+\eta)^{2}-4\eta)}{4}=\eta$ and therefore $(\mu_{+}\mu_{-})^{2}\geq(\mu_{+}\mu_{-})^{3}$ . Next,

	$\displaystyle\mu_{+}\mu_{-}(\mu_{+}^{n}-\mu_{-}^{n})^{2}$	$\displaystyle=\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}\Big{(}\sum_{i=0}^{n-1}\mu_{+}^{i}\mu_{-}^{n-1-i}\Big{)}^{2}$
		$\displaystyle\geq\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}(\mu_{+}^{n-1}+\mu_{-}^{n-1})^{2}$
		$\displaystyle\geq 4(\mu_{+}\mu_{-})^{n}(\mu_{+}-\mu_{-})^{2}$

The last inequality follows since $\mu_{+}^{n-1}+\mu_{-}^{n-1}\geq 2(\mu_{+}\mu_{-})^{\frac{n-1}{2}}$ . It follows that $\mu_{+}\mu_{-}(\mu_{+}^{n}-\mu_{-}^{n})^{2}-2(\mu_{+}\mu_{-})^{n}(\mu_{+}-\mu_{-})^{2}\geq 0$ . Also, note that $\left(\frac{\mu_{+}^{n}-\mu_{-}^{n}}{\mu_{+}-\mu_{-}}\right)^{2}\leq\|P^{n}e_{1}\|^{2}<\epsilon$ and therefore $(\mu_{+}^{n}-\mu_{-}^{n})^{2}\leq\epsilon(\mu_{+}-\mu_{-})^{2}$ . It follows that $num\geq(1-\epsilon)(\mu_{+}-\mu_{-})^{2}$ . Using these bounds, taking $\epsilon<=1/2$ and noting that $\displaystyle h(\eta,\alpha\lambda):=\frac{(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})}{(\alpha\lambda)^{2}}$ , we get

	$\displaystyle\alpha^{2}K\sum_{j=0}^{n-1}\Big{\\|}P^{j}e_{1}\Big{\\|}^{2}$	$\displaystyle\geq\frac{\alpha^{2}K}{(\mu_{+}-\mu_{-})^{2}}\left(\frac{1-\mu_{+}^{2j}}{1-\mu_{+}^{2}}+\frac{1-\mu_{-}^{2j}}{1-\mu_{-}^{2}}-2\frac{1-(\mu_{+}\mu_{-})^{j}}{1-\mu_{+}\mu_{-}}\right)$
		$\displaystyle\geq\frac{\alpha^{2}K}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})}.$		(7)

Finally, consider the case when $\eta=(1-\sqrt{\alpha\lambda})^{2}$ . In this case $\mu_{+}=\mu_{-}=\mu$ .

Let $P=S^{-1}JS$ , where $J=\begin{bmatrix}\mu&1\\ 0&\mu\end{bmatrix}.$ Let $S=\begin{bmatrix}x&y\\ w&z\end{bmatrix}$ . Since $\eta=(1-\alpha\lambda)^{2}$ , we get $\mu=\sqrt{\eta}=(1-\alpha\lambda)$ and $1-\alpha\lambda+\eta=2\sqrt{\eta}$ . We now solve for $S$ .

\begin{bmatrix}x&y\\ w&z\end{bmatrix}\begin{bmatrix}1-\alpha\lambda+\eta&-\eta\\ 1&0\end{bmatrix}=\begin{bmatrix}\mu&1\\ 0&\mu\end{bmatrix}\begin{bmatrix}x&y\\ w&z\end{bmatrix}.

It follows that the following equations hold

	$\displaystyle x(1-\alpha\lambda+\eta)+y$	$\displaystyle=\mu x+w$
	$\displaystyle w(1-\alpha\lambda+\eta)+z$	$\displaystyle=\mu w$
	$\displaystyle\eta x$	$\displaystyle=\mu y+z$
	$\displaystyle\eta w$	$\displaystyle=\mu z.$

Solving the above equations with $w=1$ , we get,

S=\begin{bmatrix}0&1\\ 1&-\sqrt{\eta}\end{bmatrix}

	$\displaystyle P^{j}$	$\displaystyle=S^{-1}J^{j}S$
		$\displaystyle=\begin{bmatrix}\sqrt{\eta}&1\\ 1&0\end{bmatrix}\begin{bmatrix}\mu^{j}&n\mu^{j-1}\\ 0&\mu^{j}\end{bmatrix}\begin{bmatrix}0&1\\ 1&-\sqrt{\eta}\end{bmatrix}$
		$\displaystyle=\begin{bmatrix}\mu^{j+1}&(j+1)\mu^{j}\\ \mu^{n}&j\mu^{j-1}\end{bmatrix}\begin{bmatrix}0&1\\ 1&-\mu\end{bmatrix}$
		$\displaystyle=\begin{bmatrix}(j+1)\mu^{j}&-j\mu^{j+1}\\ j\mu^{j-1}&-(j-1)\mu^{j}\end{bmatrix}$

Observe that,

	$\displaystyle\sum_{j=0}^{n-1}\\|P^{j}e_{1}\\|^{2}$	$\displaystyle=\sum_{j=0}^{n-1}(j+1)^{2}\mu^{2}j+j^{2}\mu^{2(j-1)}\geq\sum_{j=0}^{n-1}(j\mu^{j-1})^{2}$
		$\displaystyle=\lim_{\mu_{+}\rightarrow\mu_{-}}\sum_{j=0}^{n-1}\left(\frac{\mu_{+}^{j}-\mu_{-}^{j}}{\mu_{+}-\mu_{-}}\right)^{2}$
		$\displaystyle\geq\lim_{\mu_{+}\rightarrow\mu_{-}}\frac{1}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})},$

where the last inequality follows from (7). Therefore,

\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{\alpha^{2}K}{2(1-\mu^{2})(1-\mu^{2})(1-\eta)}.

∎

Lemma 2.

For all $\alpha>0$ , $\eta\in[0,(1-\sqrt{\alpha\lambda})^{2})$ , if $\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon$ for small enough $\epsilon>0$ , then $\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{K}{16\lambda^{2}}(1-\rho(P)).$

Proof.

Before proceeding to the proof, we introduce a new function

\displaystyle h(\eta,\alpha\lambda)\triangleq\frac{(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}{(\alpha\lambda)^{2}}

to simplify calculations. First we consider the case $\eta\in[0,(1-\sqrt{\alpha\lambda})^{2}]$ when $\mu_{+}$ and $\mu_{-}$ are real. Since $\mu_{+}$ and $\mu_{-}$ are only functions of $\alpha\lambda$ and $\eta$ , $h$ is well-defined.

The proof of the Lemma follows by showing $h(\eta,\alpha\lambda)(1-\rho(P))\leq 8$ . Note that $1-\mu_{+}^{2}=(1-\mu_{+})(1+\mu_{+})\leq 2(1-\mu_{+})$ since $\mu_{+}\leq 1$ . Similarly, $(1-\mu_{-}^{2})\leq 2(1-\mu_{-})$ . Thus, $h(\eta,\alpha\lambda)(1-\mu_{+})\leq\frac{4(1-\mu_{+})^{2}(1-\mu_{-})(1-\mu_{+}\mu_{-})}{(\alpha\lambda)^{2}}$ . Since $\mu_{+}+\mu_{-}=1-\alpha\lambda+\eta$ , it follows that $(1-\mu_{+})(1-\mu_{-})=1+\mu_{+}\mu_{-}-\mu_{+}-\mu_{-}=1-(1-\alpha\lambda+\eta)+\eta=\alpha\lambda$ . Thus,

	$\displaystyle h(\eta,\alpha\lambda)(1-\mu_{+})$	$\displaystyle\leq\frac{4(1-\mu_{+})(1-\eta)}{\alpha\lambda}$
		$\displaystyle=\frac{4(1-\eta)}{\alpha\lambda}\Bigg{(}1-\frac{(1+\eta-\lambda\alpha)+\sqrt{(\lambda\alpha-1-\eta)^{2}-4\eta}}{2}\Bigg{)}\triangleq g(\eta,\alpha\lambda).$

We see that

\displaystyle\frac{\partial g(\eta,\alpha\lambda)}{\partial\eta}=\frac{2\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-1\Big{)}\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-\alpha\lambda-1\Big{)}}{\alpha\lambda\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}}

Observe that the denominator in the above expression is positive. We consider the following two cases: (i) $\alpha\lambda\geq 1$ and (ii) $\alpha\lambda<1$ . When $\alpha\lambda\geq 1$ , we can directly bound $g(\eta,\alpha\lambda)=\frac{4(1-\mu_{+})(1-\eta)}{\alpha\lambda}$ . Since $\mu_{+}\geq-1$ and $\eta\geq 0$ , we get $g(\eta,\alpha\lambda)\leq 8$ .

Now consider the case $\alpha\lambda<1$ . We have that,

\displaystyle\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-1

\displaystyle=\sqrt{(1-\eta)^{2}+(\alpha\lambda)^{2}-2\alpha\lambda(1+\eta)}-(1-\eta)

When $\alpha\lambda<1$ , $(\alpha\lambda)^{2}-2\alpha\lambda(1+\eta)<0$ for all $\eta\geq 0$ , thus $\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-\alpha\lambda-1\Big{)}\leq\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-1\Big{)}<0$ . This implies that the numerator of the partial derivative is also positive and we have that $\displaystyle\frac{\partial g(\eta,\alpha\lambda)}{\partial\eta}>0$ . Since the partial derivative is positive $g(\eta,\alpha\lambda)$ is an increasing function of $\eta$ and thus the maximum is achieved at $\eta=(1-\sqrt{\alpha\lambda})^{2}$ and is given by:

	$\displaystyle g(\eta,\alpha\lambda)$	$\displaystyle\leq\frac{4(1-\mu_{+})(1-\eta)}{\alpha\lambda}$
		$\displaystyle=\frac{4(1-\rho(P))(1-\eta)}{\alpha\lambda}$
		$\displaystyle=\frac{4(1-(1-\sqrt{\alpha\lambda}))(1-\eta)}{\alpha\lambda}$
		$\displaystyle=\frac{4\sqrt{\alpha\lambda}(2\sqrt{\alpha\lambda}-\alpha\lambda)}{\alpha\lambda}$
		$\displaystyle\leq 8$

Next, for $\eta=((1-\sqrt{\alpha\lambda})^{2},1]$ , the eigen values $\mu_{+}$ and $\mu_{-}$ are complex. We have,

\displaystyle h(\eta,\alpha\lambda)=\frac{(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}{(\alpha\lambda)^{2}}.

First we show that $(1-\mu_{+}^{2})(1-\mu_{-}^{2})\leq 4(1-\mu_{+})(1-\mu_{-})$ . Notice,

	$\displaystyle(1-\mu_{+}^{2})(1-\mu_{-}^{2})$	$\displaystyle=(1-\mu_{+})(1+\mu_{+})(1-\mu_{-})(1+\mu_{-})$
		$\displaystyle=(1-\mu_{+})(1-\mu_{)}(1+\mu_{+}+\mu_{-}+\mu_{+}\mu_{-})$
		$\displaystyle=(1-\mu_{+})(1-\mu_{-})(1+(1-\alpha\lambda+\eta)+\eta)$
		$\displaystyle\leq 4(1-\mu_{+})(1-\mu_{-}).$

It follows that,

	$\displaystyle h(\eta,\alpha\lambda)(1-\rho(P))$	$\displaystyle\leq\frac{4(1-\rho(P))(1-\eta)}{\alpha\lambda}$
		$\displaystyle=\frac{4(1-\sqrt{\eta})(1-\eta)}{\alpha\lambda}\triangleq l(\eta,\alpha\lambda)$

Observe that $l(\eta,\alpha\lambda)$ is a decreasing function of $\eta$ and therefore, the infimum is attained for $\eta=(1-\sqrt{\alpha\lambda})^{2}$ . For this choice of $\eta$ ,

\displaystyle l(\eta,\alpha\lambda)

\displaystyle=\frac{4(1-(1-\sqrt{\alpha\lambda}))(1-(1-\sqrt{\alpha\lambda})^{2})}{\alpha\lambda}\leq 8

∎

Lemma 3.

If $n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon})$ and $\|P^{n}\tilde{X}_{0}\|^{2}<\epsilon$ , then $1-\mu_{+}\geq\frac{16\epsilon\lambda^{2}}{K}$ .

Proof.

From $\|P^{n}\tilde{X}_{0}\|\leq\epsilon$ we have $\rho(P)^{2n}\leq\epsilon$ . Then,

\displaystyle\epsilon\geq\rho(P)^{2n}\geq e^{-2n\frac{1-\rho(P)}{\rho(P)}}

Thus, we have $e^{-2n\frac{1-\rho(P)}{\rho(P)}}<\epsilon$ and $n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{1}{\epsilon})\geq\frac{\rho(P)}{2(1-\rho(P))}\log(\frac{1}{\epsilon})$ . Thus, by re-arranging, and choosing $\epsilon$ such that $1\leq\frac{K}{32\epsilon\lambda^{2}}$ , we have

\displaystyle\frac{1}{1-\rho(P)}\leq\frac{K}{32\epsilon\lambda^{2}}+1\leq\frac{K}{16\epsilon\lambda^{2}}.

(8)

∎

Appendix C Proof of Theorem 2.2

We prove the theorem separately for $\eta=0$ (corresponding to SGD), $\beta=0$ (corresponding to SHB) and $\beta=1$ (corresponding to ASG).

Case-1: $\eta=0$ (SGD)

The LSA-M iterate in (2) with $\eta=0$ corresponds to:

	$\displaystyle x_{n+1}-x^{}=x_{n}-x^{}+\alpha(Ax^{*}-Ax_{n}+M_{n+1})$		(9)
	$\displaystyle x_{n+1}-x^{}=x_{n}-x^{}+\alpha(Ax^{}-Ax_{n}+M_{n+1})+\eta((x_{n}-x^{})-(x_{n-1}-x^{*}))$		(10)

Let $\tilde{x}_{n}=x_{n}-x^{*}$ . Then, equation (9) can be rewritten as:

\begin{split}\tilde{x}_{n}&=\tilde{x}_{n-1}+\alpha(-A\tilde{x}_{n-1}+M_{n})=(I-\alpha A)\tilde{x}_{n-1}+\alpha M_{n}\\ &=(I-\alpha A)^{n}\tilde{x}_{0}+\alpha\sum_{i=0}^{n-1}[(I-\alpha A)^{n-1-i}M_{i+1}]\end{split}

Taking the square of the norm on both sides of the above equation, we obtain

\begin{split}\|\tilde{x}_{n}\|^{2}&=\|(I-\alpha A)^{n}\tilde{x}_{0}\|^{2}+2\alpha\left((I-\alpha A)^{n}\tilde{x}_{0}\right)^{T}\left(\sum_{i=0}^{n-1}(I-\alpha A)^{n-1-i}M_{i+1}\right)\\ &+\alpha^{2}\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}((I-\alpha A)^{n-1-i}M_{i+1})^{T}((I-\alpha A)^{(n-1-j)}M_{j+1}))\end{split}

Now we take expectation on both sides to obtain

\begin{split}\mathbb{E}[\|\tilde{x}_{n}\|^{2}]&\leq\|(I-\alpha A)^{n}\|^{2}\|\tilde{x_{0}}\|^{2}+2\alpha\left((I-\alpha A)^{n}\tilde{x}_{0}\right)^{T}\left(\sum_{i=0}^{n}(I-\alpha A)^{(n-1-i)}\mathbb{E}[M_{i+1}]\right)\\ &+\alpha^{2}\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\mathbb{E}((I-\alpha A)^{(n-1-i)}M_{i+1})^{T}((I-\alpha A)^{(n-1-j)}M_{j+1}))\end{split}

Now, from Assumption 2.a, $\mathbb{E}[M_{i+1}]=\mathbb{E}[\mathbb{E}[M_{i+1}|\mathcal{F}_{i}]]=0.$ Therefore the second term becomes 0. Next consider the term inside the double summation. First consider the case $i\neq j$ . Without loss of generality, suppose $i<j$ .

\begin{split}&\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-j)}M_{j+1}\right]\\ &=\mathbb{E}\left[\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-j)}M_{j+1}|\mathcal{F}_{j}\right]\right]\\ &=\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-j)}\mathbb{E}[M_{j+1}|\mathcal{F}_{j}]\right]=0\end{split}

The last equality follows from Assumption 2.a. When $i=j$ ,

\begin{split}&\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-i)}M_{i+1}\right]\\ &\leq\mathbb{E}\left[\|(I-\alpha A)^{(n-1-i)}\|^{2}\mathbb{E}\left[\|M_{i+1}\|^{2}|\mathcal{F}_{i}\right]\right]\leq\|(I-\alpha A)^{(n-1-i)}\|^{2}K\left(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}]\right)\end{split}

Substituting the above values and using $\Lambda=||x_{0}-x^{*}||^{2}$ we get

\begin{split}\mathbb{E}[\|\tilde{x_{n}}\|^{2}]&\leq\|(I-\alpha A)^{n}\|^{2}\Lambda+\alpha^{2}K\sum_{i=0}^{n-1}\|(I-\alpha A)^{(n-1-i)}\|^{2}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}

We next use the following lemma to bound $\|(I-\alpha A)^{i}\|$ .

Lemma 4.

Let, $M\in\mathbb{R}^{d\times d}$ be a matrix and $\lambda_{i}(M)$ denote the $i^{th}$ eigen-value of $M$ . Then, $\forall\delta>0$

\|M^{n}\|\leq C_{\delta}(\rho(M)+\delta)^{n}

where $\rho(M)=\max_{i}|\lambda_{i}(M)|$ is the spectral radius of $M$ and $C_{\delta}$ is a constant that depends on $\delta$ . Furthermore, if the eigen-values of $M$ are distinct, then

\|M^{n}\|\leq C(\rho(M))^{n}

Proof.

See Appendix D.1. ∎

We let $\lambda_{\min}(A)=\min_{i}\lambda_{i}(A)$ . Using Assumption 1 and Lemma 4, we have

\begin{split}\mathbb{E}[\|\tilde{x}_{n}\|^{2}]&\leq C^{2}(1-\alpha\lambda_{\min}(A))^{2n}\Lambda+C^{2}\alpha^{2}K\sum_{i=0}^{n-1}(1-\alpha\lambda_{\min}(A))^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}

\mbox{where, \quad}C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})},

(11)

$S$ is the matrix in Jordan decomposition of $A$ and $\sigma_{\min}(S)$ is the smallest singular value of $S$ . We define the sequence $\{U_{k}\}$ as below:

U_{k}=C^{2}(1-\alpha\lambda_{\min}(A))^{2k}\Lambda+C^{2}\alpha^{2}K\sum_{i=0}^{k-1}(1-\alpha\lambda_{\min}(A))^{2(k-1-i)}(1+U_{i})

Observe that $\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n}$ and that the sequence $\{U_{k}\}$ satisfies

\displaystyle U_{k+1}=(1-\alpha\lambda_{\min}(A))^{2}U_{k}+C^{2}K\alpha^{2}(1+U_{k});\quad U_{0}=C^{2}\Lambda

Therefore, we have

U_{k+1}=\left((1-\alpha\lambda_{\min}(A))^{2}+C^{2}K\alpha^{2}\right)U_{k}+\alpha^{2}C^{2}K

To ensure that $(1-\alpha\lambda_{\min}(A))^{2}+C^{2}K\alpha^{2}\leq(1-\alpha\lambda_{\min}(A)/2)^{2}$ , choose $\alpha$ as follows:

	$\displaystyle\alpha^{2}\lambda_{\min}(A)^{2}-2\alpha\lambda_{\min}(A)+C^{2}K\alpha^{2}\leq\frac{\alpha^{2}\lambda_{\min}(A)^{2}}{4}-\alpha\lambda_{\min}(A)$
	$\displaystyle\mbox{ or }\alpha\leq\frac{\lambda_{\min}(A)}{\frac{3}{4}\lambda_{\min}(A)^{2}+C^{2}K}$

\begin{split}U_{n}&\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2}U_{n-1}+\alpha^{2}C^{2}K\\ &\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2i}\\ &\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\frac{1}{1-\left(1-\frac{\alpha\lambda_{min}(A)}{2}\right)^{2}}\\ &\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\frac{2}{\alpha\lambda_{min}(A)}\\ \end{split}

We assume that $\alpha\leq\frac{1}{\lambda_{\min}(A)}$ and therefore $(1-\frac{\alpha\lambda_{\min}(A)}{2})^{2}\leq e^{-\alpha\lambda_{\min}(A)}$ .

\begin{split}U_{n}\leq e^{-n\alpha\lambda_{\min}(A)}C^{2}\Lambda+\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\end{split}

Choose $\alpha$ as below:

\alpha\leq\frac{\epsilon\lambda_{min}(A)}{4C^{2}K}

Then,

\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\leq\frac{\epsilon}{2}\Rightarrow\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon,

when the sample complexity is:

n=\frac{1}{\alpha\lambda_{\min}(A)}\log\left(\frac{2C^{2}\Lambda}{\epsilon}\right)

Case-2: $\beta=0$ (SHB)

LSA-M iterate with $\beta=0$ can be re-written as:

\tilde{x}_{n+1}=(I-\alpha A)\tilde{x}_{n}+\alpha(M_{n+1})+\eta(\tilde{x}_{n}-\tilde{x}_{n-1})

This can be re-written as:

\begin{bmatrix}\tilde{x}_{n+1}\\ \tilde{x}_{n}\end{bmatrix}=\begin{bmatrix}I-\alpha A+\eta I&-\eta I\\ I&0\end{bmatrix}\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix}+\alpha\begin{bmatrix}M_{n+1}\\ 0\end{bmatrix}

Let us define

\tilde{X}_{n}\triangleq\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix},P\triangleq\begin{bmatrix}I-\alpha A+\eta I&-\eta I\\ I&0\end{bmatrix}\mbox{ and }W_{n}\triangleq\begin{bmatrix}M_{n}\\ 0\end{bmatrix}.

Note that $\mathbb{E}[W_{n+1}|\mathcal{F}_{n}]=0$ and $\mathbb{E}[\|W_{n+1}\|^{2}|\mathcal{F}_{n}]=\mathbb{E}[\|M_{n+1}\|^{2}|\mathcal{F}_{n}]\leq K(1+\mathbb{E}[\|\tilde{x}_{n}\|^{2}])\leq K(1+\mathbb{E}[\|\tilde{X}_{n}\|^{2}])$ . It follows that,

\displaystyle\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n}=P^{n}\tilde{X}_{0}+\alpha\sum_{i=0}^{n-1}P^{n-1-i}W_{i+1}

The norm square of the above equation gives:

\begin{split}\|\tilde{X}_{n}\|^{2}&=\|P^{n}\tilde{X}_{0}\|^{2}+\alpha\left(P^{n}\tilde{X}_{0}\right)^{T}\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)+\alpha\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)^{T}\left(P^{n}\tilde{X}_{0}\right)\\ &+\alpha^{2}\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)^{T}\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)\end{split}

Taking expectation on both sides as well as using the fact that $\mathbb{E}[W_{k+1}|\mathcal{F}_{k}]=0$ and $\mathbb{E}[\|W_{k+1}\|^{2}|\mathcal{F}_{k}]\leq K(1+\mathbb{E}[\|\tilde{X}_{k}\|^{2}])$ , we have

\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{i=0}^{n-1}\|P^{n-1-i}\|^{2}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}])

Without loss of generality assume $\tilde{x}_{-1}=\mathbf{0}$ . Therefore, $||\tilde{X}_{0}||^{2}=||\tilde{x}_{0}||^{2}=\Lambda$ . As before, for a matrix $M$ , let $\rho(M)=\max_{i}|\lambda_{i}(M)|$ denote the spectral radius of $M$ . Next, we compute $\rho(P)$ . Consider the characteristic equation of P:

det\left(\begin{bmatrix}I-\alpha A+\eta I-\mu I&-\eta I\\ I&-\mu I\end{bmatrix}\right)=0

When $A_{21}$ and $A_{22}$ commute, we have the following formula for determinant of a block matrix (Horn and Johnson, (1990)):

det\left(\begin{bmatrix}A_{11}&A_{12}\\ A_{21}&A_{22}\end{bmatrix}\right)=det\left(A_{11}A_{22}-A_{12}A_{21}\right)

Using this, the characteristic equation of $P$ simplifies to:

\displaystyle det(-\mu I+\alpha\mu A-\eta\mu I+\mu^{2}I+\eta I)=0

We note that when $\mu=0$ , the LHS of the above equation becomes $\det(\eta I)$ . Thus, $\mu=0$ can never be a solution of the characteristic equation of $P$ whenever $\eta\neq 0$ . We now further simplify the characteristic equation of $P$ to a more convienient form:

det\left(A-I\left(\frac{\mu+\eta\mu-\mu^{2}-\eta}{\alpha\mu}\right)\right)=0

The only zeros of the characteristic equation of a matrix are its eigenvalues. Let $\lambda_{i}(A)$ be the eigen-value of $A$ with $\lambda_{i}(A)=\frac{\mu+\eta\mu-\mu^{2}-\eta}{\alpha\mu}$ so that

\displaystyle\mu^{2}+\mu(\alpha\lambda_{i}(A)-1-\eta)+\eta=0

The above is a quadratic equation in $\mu$ and the solution is given by

\displaystyle\mu=\frac{-(\lambda_{i}(A)\alpha-1-\eta)\pm\sqrt{(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta}}{2}

When $(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta\leq 0$ , the absolute value of eigenvalues of P are independent of $\alpha$ and

|\mu|=\frac{1}{2}\left(\sqrt{(\lambda_{i}(A)\alpha-1-\eta)^{2}+|(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta|}\right)=\sqrt{\eta}

To ensure that $(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta\leq 0$ , we must have

(\alpha\lambda_{i}(A)+1)-2\sqrt{\lambda_{i}(A)\alpha}\leq\eta\leq(\alpha\lambda_{i}(A)+1)+2\sqrt{\lambda_{i}(A)\alpha}

\left(1-\sqrt{\lambda_{i}(A)\alpha}\right)^{2}\leq\eta\leq\left(1+\sqrt{\lambda_{i}(A)\alpha}\right)^{2}

For the spectral radius of $P$ to be $\sqrt{\eta}$ , the above must hold for all $i$ . We choose $\alpha$ as:

\alpha\leq\left(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}}\right)^{2}

and $\eta$ as:

(1-\sqrt{\lambda_{\min}(A)\alpha})^{2}\leq\eta\leq(1+\sqrt{\lambda_{\min}(A)\alpha})^{2}

Observe that if we choose the momentum parameter $\eta=\left(1-\sqrt{\lambda_{\min}(A)\alpha}\right)^{2}$ , then $P$ has two repeated roots since $\sqrt{(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta}=0$ . To ensure that $P$ does not have any repeated root we choose the momentum parameter $\eta=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}$ . Therefore, $\rho(P)=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)$ . Using Assumption 1 and Lemma 4 we get

\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}])

However, unlike in Case-1, here the constant $\hat{C}$ is not independent of $\alpha$ and $\lambda_{min}(A)$ . The following lemma specifies an upper bound on $\hat{C}$ .

Lemma 5.

$\hat{C}\leq C\frac{5}{\sqrt{\alpha\lambda_{\min}(A)}}$ , where $C$ is as defined in (11).

Proof.

See Appendix D.2 ∎

We define the sequence $\{V_{n}\}$ as follows

V_{n}=\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+V_{i})

Observe that $\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq V_{n}$ , and that $\{V_{k}\}$ satisfies

\displaystyle V_{k+1}=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}V_{k}+\hat{C}^{2}K\alpha^{2}(1+V_{k});\quad V_{0}=\hat{C}^{2}\Lambda

Therefore, we have

V_{k+1}=\left(\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}+\hat{C}^{2}K\alpha^{2}\right)V_{k}+\hat{C}^{2}K\alpha^{2}

To ensure that $\left(\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}+\hat{C}^{2}K\alpha^{2}\right)\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2}$ we need to choose $\alpha$ such that

1+\frac{\lambda_{\min}(A)\alpha}{4}-\sqrt{\lambda_{\min}(A)\alpha}+\hat{C}^{2}K\alpha^{2}\leq 1+\frac{\lambda_{\min}(A)\alpha}{16}-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}

\mbox{or \quad}\frac{3\sqrt{\alpha}\lambda_{\min}(A)}{16}+\hat{C}^{2}K\alpha^{\frac{3}{2}}\leq\frac{\sqrt{\lambda_{\min}(A)}}{2}

Next, using Lemma 5, the above can be ensured by choosing $\alpha$ such that

\frac{3\sqrt{\alpha}\lambda_{\min}(A)}{16}+C^{2}\frac{25}{\alpha\lambda_{\min}}K\alpha^{\frac{3}{2}}\leq\frac{\sqrt{\lambda_{\min}(A)}}{2}

\mbox{ or }\alpha\leq\left(\frac{\sqrt{\lambda_{\min}(A)}}{\frac{3}{8}\lambda_{\min}(A)+\frac{25CK}{\lambda_{\min}(A)}}\right)^{2}

The recursion for the sequence $\{V_{k+1}\}$ then follows

\begin{split}V_{k+1}&\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2}V_{k}+\hat{C}^{2}K\alpha^{2}\end{split}

Unrolling the recursion, we get

\begin{split}V_{n}&\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2n}V_{0}+\hat{C}^{2}K\alpha^{2}\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2i}\\ &\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2n}V_{0}+\hat{C}^{2}K\alpha^{2}\frac{1}{1-\left(1-\frac{\sqrt{\lambda_{min}(A)}}{4}\right)^{2}}\\ &\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2n}V_{0}+\hat{C}^{2}K\alpha^{2}\frac{4}{\sqrt{\alpha\lambda_{min}(A)}}\end{split}

Further it follows from $\alpha\leq\left(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}}\right)^{2}$ that $\alpha\leq\frac{1}{\lambda_{\min}(A)}$ and $\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2}\leq e^{-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}$ .

V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\hat{C}^{2}\Lambda+\frac{4\alpha^{2}\hat{C}^{2}K}{\sqrt{\alpha\lambda_{min}(A)}}

Again using Lemma 5,

V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\frac{25C^{2}}{\lambda_{\min}(A)\alpha}\Lambda+\alpha^{2}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\alpha\right)^{3/2}}

Observe that

\frac{e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}}{\lambda_{\min}(A)\alpha}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}}

\mbox{ for }n\geq\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right).

Let $n$ be as above. Then,

V_{n}\leq 25C^{2}\Lambda e^{-\frac{n}{4}\sqrt{\lambda_{\min}(A)}\alpha}+\sqrt{\alpha}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\right)^{3/2}}

Choose $\alpha$ as below:

\alpha\leq\left(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K}\right)^{2}

Then,

\sqrt{\alpha}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\right)^{3/2}}\leq\frac{\epsilon}{2}\Rightarrow\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq V_{n}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon,

when $n$ is as follows:

n=\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right)

n=\max\left(\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right),\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right)\right)

Case-3: $\beta=1$ (ASG)

The proof progresses in a similar way as in Case-2. It is easy to see that the following relation holds with a modified definition of the matrix $P$ .

\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{i=0}^{n-1}\|P^{n-1-i}\|^{2}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}]),

where,

P\triangleq\begin{bmatrix}I-\alpha A+\eta(I-\alpha A)&-\eta(I-\alpha A)\\ I&0\end{bmatrix}

As in the previous case, we compute the eigen-values of $P$ . The characteristic equation of $P$ is given by:

det\left(\begin{bmatrix}I-\alpha A+\eta(I-\alpha A)-\mu I&-\eta(I-\alpha A)\\ I&-\mu I\end{bmatrix}\right)=0

As in the previous case, this can be simplified to the following equation:

\displaystyle det(-\mu I+\alpha\mu A-\mu\eta(I-\alpha A)+\mu^{2}I+\eta(I-\alpha A))=0

We now further simplify the characteristic equation of $P$ to a more convienient form:

det\left(A-I\left(\frac{\mu+\eta\mu-\mu^{2}-\eta}{\alpha\mu+\alpha\mu\eta-\eta\alpha}\right)\right)=0

Progressing as in the previous case, we get that the eigen-values of $P$ satisfies:

\displaystyle\mu=\frac{-(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)\pm\sqrt{(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-4\eta(1-\alpha\lambda_{i}(A))}}{2}

When $(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-4\eta(1-\alpha\lambda_{i}(A))\leq 0$ , we have that

|\mu|=\frac{1}{2}\left(\sqrt{(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}+4\eta(1-\alpha\lambda_{i}(A))}\right)=\sqrt{\eta(1-\alpha\lambda_{i}(A))}

This implies that,

\rho(P)=\sqrt{\eta(1-\alpha\lambda_{min}(A))}

(12)

To ensure that $(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-4\eta(1-\alpha\lambda_{i}(A))\leq 0$ , we must have

\eta^{2}(1-\alpha\lambda_{i}(A))^{2}+2\eta(1-\alpha^{2}\lambda_{i}(A)^{2})+(1-\alpha\lambda_{i}(A))^{2}\leq 0

We assume $\alpha\leq\frac{1}{\lambda_{max}(A)}$ and therefore, $(1-\alpha\lambda_{i}(A))\geq 0$ holds for all $i$ . Using this, the above relation simplifies to:

\eta^{2}(1-\alpha\lambda_{i}(A))+2\eta(1+\alpha\lambda_{i}(A))+(1-\alpha\lambda_{i}(A))\leq 0

For the above to hold, we must have that

\frac{2(1+\alpha\lambda_{i}(A))-4\sqrt{\alpha\lambda_{i}(A))}}{2(1-\alpha\lambda_{i}(A))}\leq\eta\leq\frac{2(1+\alpha\lambda_{i}(A))+4\sqrt{\alpha\lambda_{i}(A))}}{2(1-\alpha\lambda_{i}(A))}

\frac{(1-\sqrt{\alpha\lambda_{i}(A)})^{2}}{(1-\alpha\lambda_{i}(A))}\leq\eta\leq\frac{(1+\sqrt{\alpha\lambda_{i}(A)})^{2}}{(1-\alpha\lambda_{i}(A))}

The above must hold $\forall i$ and therefore we choose $\eta$ as:

\frac{(1-\sqrt{\alpha\lambda_{min}(A)})^{2}}{(1-\alpha\lambda_{min}(A))}\leq\eta\leq\frac{(1+\sqrt{\alpha\lambda_{min}(A)})^{2}}{(1-\alpha\lambda_{min}(A))}

As in Case-2, if we choose the momentum parameter $\displaystyle\eta=\frac{(1-\sqrt{\lambda_{\min}(A)\alpha})^{2}}{(1-\alpha\lambda_{min}(A))}$ , then $P$ has two repeated roots. To ensure that $P$ does not have any repeated root we choose the momentum parameter

\displaystyle\eta=\frac{\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}}{(1-\alpha\lambda_{min}(A))}.

Using (12), we get $\rho(P)=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)$ which is same as in Case-2 and therefore we have:

\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\tilde{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\tilde{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}]).

The above expression is same as that in Case-2 except the term $\tilde{C}$ . Since, the matrix $P$ is different when $\beta=1$ , $\hat{C}$ in Case-2 need not be equal to $\tilde{C}$ in Case-3. However, we next show that $\tilde{C}$ follows the exact same upper bound as in Case-2, Lemma 5. Towards this, we have the following lemma:

Lemma 6.

$\tilde{C}\leq C\frac{5}{\sqrt{\alpha\lambda_{\min}(A)}}$ , where $C$ is as defined in (11).

Proof.

See Appendix D.3 ∎

Thereafter, we can proceed exactly as in case-2 to prove the theorem.

Appendix D Proof of Auxilary Lemmas

D.1 Proof of Lemma 4

As in Foucart, (2012), we first construct a matrix norm ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\rho(M)+\delta$ . Consider the Jordan canonical form of $M$

M=S\begin{bmatrix}J_{n_{1}}(\lambda_{1}(M))&0&\ldots&0\\ 0&J_{n_{2}}(\lambda_{2}(M))&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&J_{n_{k}}(\lambda_{k}(M))\end{bmatrix}S^{-1}

We define

D(\delta)=\begin{bmatrix}D_{n_{1}}(\delta)&0&\ldots&0\\ 0&D_{n_{2}}(\delta)&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&D_{n_{k}}(\delta)\end{bmatrix},

where

D_{j}(\delta)=\begin{bmatrix}\delta&0&\ldots&0\\ 0&\delta^{2}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&\delta^{j}\end{bmatrix}

Therefore,

D(\frac{1}{\delta})S^{-1}MSD(\delta)=\begin{bmatrix}B_{n_{1}}(\lambda_{1}(M),\delta)&0&\ldots&0\\ 0&B_{n_{2}}(\lambda_{2}(M),\delta)&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&B_{n_{k}}(\lambda_{k}(M),\delta)\end{bmatrix}

where,

B_{i}(\lambda,\delta)=D_{i}(\frac{1}{\delta})J_{i}(\lambda)D_{i}(\delta)=\begin{bmatrix}\lambda&\delta&0&\ldots&0\\ 0&\lambda&\delta&\ddots&\vdots\\ 0&\ddots&\ddots&\ddots&0\\ \vdots&\ddots&\ddots&\lambda&\delta\\ 0&\ldots&0&0&\lambda\end{bmatrix}

We define the matrix norm ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ as

{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\triangleq\|D(\frac{1}{\delta})S^{-1}MSD(\delta)\|_{1}

where $\|\cdot\|_{1}$ is the matrix norm induced by the vector $L_{1}$ -norm. Using the fact that $\|M\|_{1}=\max_{j\in[1:d]}\sum_{i=1}^{d}|m_{i,j}|$ , where $m_{i,j}$ is the $i,j$ -th entry of $M$ , we have

{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\max_{j\in[1:d]}(|\lambda_{j}|+\delta)=\rho(M)+\delta.

\mbox{ and }{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M^{n}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{n}\leq(\rho(M)+\delta)^{n}

It can be easily seen that $\|M\|_{1}\geq\frac{1}{\sqrt{d}}\|M\|_{2}$ . Therefore it follows that

\begin{split}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}&=\|D\left(\frac{1}{\delta}\right)S^{-1}MSD(\delta)\|_{1}\\ &\geq\frac{1}{\sqrt{d}}\|D\left(\frac{1}{\delta}\right)S^{-1}MSD(\delta)\|_{2}\\ &\geq\frac{1}{\sqrt{d}}\sigma_{\min}\left(D\left(\frac{1}{\delta}\right)\right)\sigma_{\min}(S^{-1})\|M\|_{2}\sigma_{\min}(S)\sigma_{\min}(D(\delta))\end{split}

where $\sigma_{\min}(\cdot)$ denotes the smallest singular value and we have used the fact that $\|AB\|\geq\sigma_{\min}(A)\|B\|$ repeatedly. For $\delta<1$ , and $r$ defined as the size of largest Jordan block of $M$

\sigma_{\min}\left(D\left(\frac{1}{\delta}\right)\right)\sigma_{\min}\left(D\left(\delta\right)\right)=\delta^{r}\frac{1}{\delta}=\delta^{r-1}.

We conclude the first half of the lemma by defining $C_{\delta}$ as $\frac{\sqrt{d}}{\delta^{r-1}\sigma_{\min}(S)\sigma_{\min}(S^{-1})}$ .

In the case that the eigenvalues are distinct, $r=1$ and $C_{\delta}$ defined above becomes independent of $\delta.$ Moreover, when all eigen-values are distinct, each Jordan block is $J_{n_{i}}(\lambda_{i}(M))=[\lambda_{i}(M)]$ and the second half follows.

D.2 Proof of Lemma 5

Let $S$ be the matrix in Jordan decomposition of $A$ , i.e., $SAS^{-1}=D$ , where $D$ is a diagonal matrix with eigenvalues of $A$ as its diagonal elements. Then,

\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}=\begin{pmatrix}I-\alpha SAS^{-1}+\eta I&-\eta SS^{-1}\\ SS^{-1}&0\end{pmatrix}=\begin{pmatrix}I-\alpha D+\eta I&-\eta I\\ I&0\end{pmatrix},

where $0_{d}$ is the zero matrix of dimension $d\times d$ . For ease of exposition, suppose $A$ is a $2\times 2$ matrix with eigenvalues $\lambda_{1}$ and $\lambda_{2}$ . Then

\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}=\begin{pmatrix}1+\eta-\alpha\lambda_{1}&0&-\eta&0\\ 0&1+\eta-\alpha\lambda_{2}&0&-\eta\\ 1&0&0&0\\ 0&1&0&0\end{pmatrix}

Suppose $E$ is the elementary matrix associated with the exchange of row-2 and row-3. It is easy to see that $E=E^{T}=E^{-1}$ and that,

E\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}E^{-1}=\begin{pmatrix}1+\eta-\alpha\lambda_{1}&-\eta&0&0\\ 1&0&0&0\\ 0&0&1+\eta-\alpha\lambda_{2}&-\eta\\ 0&0&1&0\end{pmatrix}=\begin{pmatrix}B_{1}&0\\ 0&B_{2}\end{pmatrix}

where,

B_{i}=\begin{pmatrix}1+\eta-\alpha\lambda_{i}&-\eta\\ 1&0\end{pmatrix}

Suppose, $X_{i}=\begin{pmatrix}x_{i,1}&x_{i,2}\\ x_{i,3}&x_{i,4}\end{pmatrix}$ and,

X_{i}^{-1}B_{i}X_{i}=\begin{pmatrix}\mu_{i,+}&0\\ 0&\mu_{i,-}\end{pmatrix}.

Here $\mu_{i,+}=\frac{(1-\alpha\lambda_{i}+\eta)+\sqrt{\Delta_{i}}}{2}$ and $\mu_{i,-}=\frac{(1-\alpha\lambda_{i}+\eta)-\sqrt{\Delta_{i}}}{2}$ , where $\Delta_{i}=(1+\eta-\alpha\lambda_{i})^{2}-4\eta$ . Solving the above equation we get,

X_{i}=\begin{pmatrix}x_{i,3}\mu_{i,+}&x_{i,4}\mu_{i,-}\\ x_{i,3}&x_{i,4}\end{pmatrix}

Setting $x_{i,3}=x_{i,4}=1$ ,

X_{i}=\begin{pmatrix}\mu_{i,+}&\mu_{i,-}\\ 1&1\end{pmatrix}\mbox{ and }X_{i}^{-1}=\frac{1}{\mu_{i,+}-\mu_{i,-}}\begin{pmatrix}1&-1\\ -\mu_{i,-}&\mu_{i,+}\\ \end{pmatrix}

For a general $d\times d$ matrix $A$ , using a similar procedure, it can be shown that

\begin{pmatrix}X_{1}&0&0\\ 0&\ddots&\vdots\\ 0&\cdots&X_{d}\\ \end{pmatrix}E_{d\times d}\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}E_{d\times d}^{-1}\begin{pmatrix}X_{1}^{-1}&0&0\\ 0&\ddots&\vdots\\ 0&\cdots&X_{d}^{-1}\\ \end{pmatrix}

=\begin{pmatrix}\mu_{1,+}&0&0&\cdots&0\\ 0&\mu_{1,-}&0&\cdots&0\\ \vdots&0&\ddots&\cdots&0\\ 0&\cdots&0&\mu_{d,+}&0\\ 0&\cdots&0&0&\mu_{d,-}\end{pmatrix}

where $E_{d\times d}$ and $E_{d\times d}^{-1}$ are permutation matrices that transform the matrix between them to a block diagonal matrix. Let $\hat{S}=\begin{pmatrix}X_{1}&0&0\\ 0&\ddots&\vdots\\ 0&\cdots&X_{d}\\ \end{pmatrix}E_{d\times d}\begin{pmatrix}S&0\\ 0&S\end{pmatrix}$ . Therefore,

\hat{C}=\frac{\sqrt{d}}{\sigma_{\min}(\hat{S})\sigma_{\min}(\hat{S}^{-1})}

In order to simplify the expression for $\hat{C}$ , we require the following lemma:

Lemma 7.

For all invertible matrices $M$ of order $d\times d$ , the following identity holds:

\frac{1}{\sigma_{d}(M)\sigma_{d}(M^{-1})}=\sigma_{1}(M)\sigma_{1}(M^{-1}),

where $\sigma_{1}(X)\geq\cdots\geq\sigma_{d}(X)$ denote the singular values of $X$ .

Proof.

By definition, $\sigma_{1}^{2}(M)\geq\cdots\geq\sigma_{d}^{2}(M)$ are the eigenvalues of $M^{T}M$ . Then, the eigenvalues of $(M^{T}M)^{-1}=M^{-1}(M^{-1})^{T}$ are $\frac{1}{\sigma_{d}(M)^{2}}\geq\cdots\geq\frac{1}{\sigma_{1}(M)^{2}}.$ Note that $M^{-1}(M^{-1})^{T}$ and $(M^{-1})^{T}M^{-1}$ are similar since $(M^{-1})^{T}M^{-1}=M(M^{-1}(M^{-1})^{T})M^{-1}$ . Consequently, $M^{-1}(M^{-1})^{T}$ and $(M^{-1})^{T}M^{-1}$ have the same set of eigenvalues and we find that the singular values of $M^{-1}$ are $\frac{1}{\sigma_{d}(M)}\geq\cdots\geq\frac{1}{\sigma_{1}(M)}.$

Thus,

\sigma_{1}(M^{-1})=\frac{1}{\sigma_{d}(M)},

\sigma_{d}(M^{-1})=\frac{1}{\sigma_{1}(M)}

and the result follows. ∎

Using Lemma 7, we have

\begin{split}\hat{C}&=\sqrt{d}\sigma_{\max}(\hat{S})\sigma_{\max}(\hat{S}^{-1})\\ &\leq\sqrt{d}\max_{i}\{\sigma_{\max}(X_{i})\}\sigma_{\max}(S)\sigma_{\max}(S^{-1})\max_{i}\{\sigma_{\max}(X_{i}^{-1})\}\\ &=C\max_{i}\{\sigma_{\max}(X_{i})\}\max_{i}\{\sigma_{\max}(X_{i}^{-1})\},\end{split}

where $C$ is as defined in (11). Now, for any matrix $X$ of order $d\times d$ ,

\sigma_{\max}(X)=\|X\|_{2}\leq\|X\|_{F}=(\sum_{i,j}|x_{ij}|^{2})^{1/2}\leq d\max_{i,j}|x_{ij}|,

where $\|\cdot\|_{F}$ denotes the Frobenius norm. Using the above inequality, $\sigma_{\max}(X_{i})\leq 2$ and $\sigma_{\max}(X_{i}^{-1})\leq\frac{2}{|\mu_{i,+}-\mu_{i,-}|}=\frac{2}{|\sqrt{\Delta_{i}}|}$ . Next we lower bound $|\sqrt{\Delta_{i}}|$ .

For a complex number $z$ , observe that $|\sqrt{z}|$ = $\sqrt{|z|}$ . Now,

	$\displaystyle\|\Delta_{i}\|$	$\displaystyle=4\eta-(1+\eta-\alpha\lambda_{i}(A))^{2}$		(13)
		$\displaystyle\geq 4\eta-(1+\eta-\alpha\lambda_{min}(A))^{2}$

Using $\eta=\left(1-\frac{\sqrt{\alpha\lambda_{\min}}}{2}\right)^{2}$ ,

\begin{split}|\Delta_{i}|&\geq 4\eta-\left(1+1+\frac{\alpha\lambda_{\min}}{4}-\sqrt{\alpha\lambda_{\min}}-\alpha\lambda_{\min}\right)^{2}\\ &=4\eta-(2\sqrt{\eta}-\frac{3\alpha}{4}\lambda_{\min})^{2}\\ &=4\left[(\sqrt{\eta})^{2}-\left(\sqrt{\eta}-\frac{3\alpha\lambda_{\min}}{8}\right)^{2}\right]\\ &=\frac{3\alpha\lambda_{\min}}{2}\left(2\sqrt{\eta}-\frac{3\alpha\lambda_{\min}}{8}\right)\\ &=\frac{3\alpha\lambda_{\min}}{2}\left(2-\sqrt{\alpha\lambda_{\min}}-\frac{3\alpha\lambda_{\min}}{8}\right)\\ &\geq\frac{3\alpha\lambda_{\min}}{2}\left(2-1-\frac{3}{8}\right)\\ &=\frac{15}{16}\alpha\lambda_{\min}\end{split}

Using this, we get $\max_{i}\{\sigma_{\max}(X_{i}^{-1})\}\leq 2\sqrt{\frac{16}{15}}\frac{1}{\sqrt{\alpha\lambda_{\min}}}$ , and therefore $\hat{C}\leq\frac{5C}{\sqrt{\alpha\lambda_{\min}}}$

D.3 Proof of Lemma 6

The proof of the lemma can proceed exactly as in the proof of Lemma D.2. The only difference is that one needs to lower bound $|\Delta_{i}|=4\eta(1-\alpha\lambda_{i}(A))-(\alpha(1+\eta)\lambda_{i}(A)-1-\eta)^{2}$ for $\eta=\frac{\left(1-\frac{\sqrt{\alpha\lambda_{\min}}}{2}\right)^{2}}{(1-\alpha\lambda_{i}(A))}.$ We define $\eta^{\prime}=\left(1-\frac{\sqrt{\alpha\lambda_{\min}}}{2}\right)^{2}$ and therefore $\eta=\frac{\eta^{\prime}}{(1-\alpha\lambda_{i}(A))}$ . Using this, we get

	$\displaystyle\|\Delta_{i}\|$	$\displaystyle=4\eta(1-\alpha\lambda_{i}(A))-(1+\eta-\alpha(1+\eta)\lambda_{i}(A))^{2}$
		$\displaystyle=4\eta^{\prime}-(1+\eta-\alpha\lambda_{i}(A)(1+\eta))^{2}$
		$\displaystyle=4\eta^{\prime}-\Big{(}(1+\eta)(1-\alpha\lambda_{i}(A))\Big{)}^{2}$
		$\displaystyle=4\eta^{\prime}-\bigg{(}\Big{(}1+\frac{\eta^{\prime}}{1-\alpha\lambda_{i}(A)}\Big{)}(1-\alpha\lambda_{i}(A))\bigg{)}^{2}$
		$\displaystyle=4\eta^{\prime}-(1+\eta^{\prime}-\alpha\lambda_{i}(A))$

The expression for $|\Delta_{i}|$ is exactly same as in the proof of Lemma D.2 (cf. (13)). Thereafter, we proceed as in proof of Lemma D.2 to prove the claim.

$\displaystyle\mathbb{E}[\\|\tilde{X}_{n}\\|^{2}]$	$\displaystyle=\mathbb{E}[\\|P^{n}\tilde{X}_{0}+\alpha\sum_{j=0}^{n-1}P^{n-1-j}W_{j+1}\\|^{2}]$
	$\displaystyle=\\|P^{n}\tilde{X}_{0}\\|^{2}+\mathbb{E}[\alpha^{2}\sum_{j=0}^{n-1}\\|P^{n-1-j}W_{j+1}\\|^{2}]$
	$\displaystyle\geq\\|P^{n}\tilde{X}_{0}\\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\\|P^{j}e_{1}\\|^{2},$	(4)

	$\displaystyle\mathbb{E}[(\hat{M}^{(i)}_{n})^{2}\|\mathcal{F}_{n-1}]$	$\displaystyle\geq\min_{\\|v\\|=1}\mathbb{E}[v^{T}\hat{M}_{n}\hat{M}^{T}_{n}v\|\mathcal{F}_{n-1}]$
		$\displaystyle=\min_{\\|v\\|=1}\mathbb{E}[v^{T}SM_{n}M^{T}_{n}S^{T}v\|\mathcal{F}_{n-1}]$
		$\displaystyle\geq\min_{\\|v\\|=1}\sigma_{\min}^{2}(S)\mathbb{E}[v^{T}M_{n}M^{T}_{n}v\|\mathcal{F}_{n-1}]$
		$\displaystyle\geq\sigma_{\min}^{2}(S)K.$

Does Momentum Help in Stochastic Optimization? A Sample Complexity Analysis.

Abstract

1 Introduction

2 Main Results

Assumption 1.

Assumption 2.a.

Assumption 2.b.

Definition 1.

Theorem 2.1 (Lower bound on sample complexity).

Remark 1.

Remark 2.

Remark 3.

Theorem 2.2.

Corollary 2.3 (Upper bound on sample complexity).

Remark 4.

Remark 5.

3 Proof Outline

3.1 Proof Outline for Theorem 2.1

Lemma 1.

Lemma 2.

Lemma 3.

3.2 Proof Outline of Theorem 2.2

Lemma 4.

Proof.

4 Concluding Remarks

References

Appendix A Comparison with recent works

A.1 Comparison with Mou et al.(2020)

A.2 Comparison with SHB, Can et al. (2019)

A.3 Comparison with Assran and Rabbat (2020)

Appendix B Proof of Theorem 2.1

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Appendix C Proof of Theorem 2.2

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Appendix D Proof of Auxilary Lemmas

D.1 Proof of Lemma 4

D.2 Proof of Lemma 5

Lemma 7.

Proof.

D.3 Proof of Lemma 6

Does Momentum Help in Stochastic Optimization?
A Sample Complexity Analysis.