This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Does Momentum Help in Stochastic Optimization?
A Sample Complexity Analysis.

Swetha Ganesh
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
[email protected] &Rohan Deb
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
[email protected]
Gugan Thoppe
Department of Computer Science and Automation
Indian Institute of Science, Bangalore
[email protected] &Amarjit Budhiraja
Department of Statistics and Operations Research
University of North Carolina at Chapel Hill
[email protected]
Equal contribution.
Abstract

Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG) are popular momentum methods in stochastic optimization. While benefits of such acceleration ideas in deterministic settings are well understood, their advantages in stochastic optimization is still unclear. In fact, in some specific instances, it is known that momentum does not help in the sample complexity sense. Our work shows that a similar outcome actually holds for the whole of quadratic optimization. Specifically, we obtain a lower bound on the sample complexity of SHB and ASG for this family and show that the same bound can be achieved by the vanilla SGD. We note that there exist results claiming the superiority of momentum based methods in quadratic optimization, but these are based on one-sided or flawed analyses.

1 Introduction

In deterministic convex optimization (when one has access to exact gradients), Gradient Descent (GD) is a popular optimization algorithm Cauchy, (1847). However, in practice, exact gradients are not available and one has to rely on noisy observations. This brings forth the idea of Stochastic Gradient Descent (SGD). For GD, two classic momentum ideas, Heavy Ball (HB) Polyak, (1964, 1987); Qian, (1999) and Nesterov’s Accelerated Gradient (NAG) Nesterov, (1983, 2014, 2005), are used to speed up its convergence. Naturally, these momentum-based methods and their variants have also gained significant interest in stochastic settings Sutskever et al., (2013); Nitanda, (2014); Hu et al., (2009). Our work shows that the stochastic variants of HB and NAG, i.e., the Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG), are not better than the vanilla SGD for the whole of quadratic optimization. Specifically, we show that the sample complexities111Sample complexity refers to the number of iterations required to reach an ϵ\epsilon-boundary of the solution. of SHB, ASG, and SGD are all of the same order.

More formally, in (deterministic) quadratic optimization222Throughout this work, we only consider algorithms with constant step-sizes. These are popular in practice and ensure faster convergence. with condition number κ,\kappa, GD converges to an ϵ\epsilon-optimal solution in 𝒪(κlog1ϵ)\mathcal{O}(\kappa\log\frac{1}{\epsilon}) iterations, while both HB and NAG only need 𝒪(κlog1ϵ)\mathcal{O}(\sqrt{\kappa}\log\frac{1}{\epsilon}) steps. A pertinent question then is “Does one get such advantages even in stochastic settings?" The current literature is, however, divided on whether SHB and ASG are better than SGD.

Some recent results Loizou and Richtárik, (2020); Mou et al., (2020); Assran and Rabbat, (2020); Can et al., (2019) claim that these momentum methods are better than SGD in quadratic or least-squares settings. However, Loizou and Richtárik, (2020) needs a strong assumption on noise, which (Kidambi et al.,, 2018, Section 6) claims is information-theoretically impossible even in the simple least squares regression problem. The other results either are based on a one-sided analysis Can et al., (2019) (only considers bias, while ignoring variance) or have a major flaw Mou et al., (2020); Assran and Rabbat, (2020); see Appendix A.1—A.3 for details.

On the other hand, there are also a few recent negative results on these momentum methods. For general convex optimization, Yuan et al., (2016) shows that SHB and ASG is equivalent to SGD with a rescaled step-size. However, this result requires that the stepsize be sufficiently small and the momentum parameter to be away from 1.1. In Kidambi et al., (2018); Liu and Belkin, (2020), for one specific instance of the least squares regression with vanishing noise, it is shown that the performance of SHB and ASG cannot be better than that of SGD. Finally, Zhang et al., (2019) considers SHB for quadratic objectives in the noisy setting as our work and provides upper bounds on the rate at which the objective function decreases. They also argue that rescaled SGD performs as well as SHB and demonstrate it empirically but fall short of rigorously coming up with a lower bound that supports their claim.

The current literature can thus be summarized as follows.

Research Gap: Existing works on SHB and ASG fall into two groups: i.) positive - where the results claim advantages of these methods over SGD and ii.) negative - where the results claim the opposite. Results in the positive group either have a one-sided Can et al., (2019) or a flawed analysis Mou et al., (2020); Assran and Rabbat, (2020), while the ones in the negative apply to special cases (e.g., Yuan et al., (2016) requires sufficiently small stepsizes and momentum parameters away from 1,1, while Kidambi et al., (2018); Liu and Belkin, (2020) only apply to one instance of a least squares problem with vanishing noise).

Key Contribution: Our work belongs to the negative group and extends the claims in Kidambi et al., (2018); Liu and Belkin, (2020) to more general settings. Specifically, for all quadratic optimization problems with persistent noise (noise variance is bounded away from zero) and any small ϵ>0,\epsilon>0, we show that number of iterations needed by SHB and ASG to find an ϵ\epsilon-optimal solution are of the same order as that of SGD. More technically, we obtain a lower bound on sample complexities of SHB and ASG (Theorem 2.1) and show that these are of the same order as the corresponding upper bound for SGD (Theorem 2.2). We emphasize that our result applies for all positive stepsizes and momentum parameters in the range [0,1][0,1] unlike in Yuan et al., (2016). Moreover, our proof techniques are also significantly different from those used in existing lower bounds such as Kidambi et al., (2018); Liu and Belkin, (2020). This is because, under persistent noise, the expected error would contain an additional term that cannot be accounted for from their analyses (See Remark 3).

2 Main Results

We state our main results here that provide lower and upper bounds on the sample complexities of SHB and ASG. We use these bounds along with those of SGD to show that all these methods need a similar effort to find an ϵ\epsilon-optimal solution. We emphasize that our results apply to the whole family of quadratic optimization.

To provide a unified analysis, we look at a generic update rule that includes as special cases SHB, ASG, and SGD. In particular, we consider the linear stochastic approximation with momentum (LSA-M) iterate given by

xn+1=\displaystyle x_{n+1}={} xn+α(bAxn+Mn+1)\displaystyle x_{n}+\alpha(b-Ax_{n}+M_{n+1})
+η(IαβA)(xnxn1).\displaystyle+\eta(I-\alpha\beta A)(x_{n}-x_{n-1}). (1)
=\displaystyle={} xn+α(bA(xn+ηβ(xnxn1))+Mn+1)\displaystyle x_{n}+\alpha(b-A(x_{n}+\eta\beta(x_{n}-x_{n-1}))+M_{n+1})
+η(xnxn1),\displaystyle+\eta(x_{n}-x_{n-1}), (2)

where bd,Ad×d,b\in\mathbb{R}^{d},A\in\mathbb{R}^{d\times d}, and Mn+1dM_{n+1}\in\mathbb{R}^{d} is noise. Note that we do not presume AA is symmetric and this is what makes the above update a stochastic approximation. Also, when AA is symmetric, LSA-M subsumes as special cases SGD (let η=0\eta=0 in (2)), SHB (let β=0\beta=0 in (2)), and ASG (let β=1\beta=1 in (2)).

With regards to AA and the noise sequence (Mn),(M_{n}), we make the following assumptions.

Assumption 1.

(Driving matrix feature) AA is diagonalizable and all its eigen values {λi(A)}i=1d\{\lambda_{i}(A)\}_{i=1}^{d} are real and positive.

Assumption 1 holds trivially when AA is a symmetric positive definite matrix. Consequently, our results apply to all members of the quadratic optimization family.

Also, under the above assumption, one would expect the iterates in (2) to go to a neighborhood of x:=A1b.x^{*}:=A^{-1}b.

We next state two assumptions for (Mn):(M_{n}): the first is used in Theorem 2.1, while the other is used in Theorem 2.2 and Corollary 2.3.

Assumption 2.a.

(Noise attributes for Theorem 2.1) (Mn)(M_{n}) is a martingale difference sequence w.r.t. the filtration (n)(\mathcal{F}_{n}), where n=σ(xm,Mm;mn)\mathcal{F}_{n}=\sigma(x_{m},M_{m};m\leq n). Further, K>0\exists K>0 such that 𝔼[Mn+1Mn+1T|n]KId\mathbb{E}[M_{n+1}M_{n+1}^{T}|\mathcal{F}_{n}]\succeq KI_{d} a.s. for all n0.n\geq 0.

The notation ABA\succeq B for A,BdA,B\in\mathbb{R}^{d} is used above to imply that ABA-B is positive semi-definite. Also, the symbol IdI_{d} refers to the identity matrix.

Assumption 2.b.

(Noise attributes for Theorem 2.2) (Mn)(M_{n}) is a martingale difference sequence w.r.t the filtration (n)(\mathcal{F}_{n}), where n=σ(xm,Mm;mn)\mathcal{F}_{n}=\sigma(x_{m},M_{m};m\leq n). Further, K0\exists K\geq 0 such that 𝔼[Mn+12|n]K(1+xnx2)\mathbb{E}[\|M_{n+1}\|^{2}|\mathcal{F}_{n}]\leq K(1+\|x_{n}-x^{*}\|^{2}) a.s. for all n0.n\geq 0.

Assumptions 2.a and 2.b are standard Mandt et al., (2017); Jastrzębski et al., (2018); Cheng et al., (2020); Borkar, (2008). The first of these holds if and only if all the eigenvalues of 𝔼[Mn+1Mn+1T|n]\mathbb{E}[M_{n+1}M_{n+1}^{T}|\mathcal{F}_{n}] are bounded from below by K,K, i.e., noise is persistent throughout (or non-vanishing) in all directions. On the other hand, Assumption 2.b requires that the trace of 𝔼[Mn+1Mn+1T|n]\mathbb{E}[M_{n+1}M_{n+1}^{T}|\mathcal{F}_{n}] be bounded from above. This bound can scale with xnx\|x_{n}-x^{*}\| and need not vanish near x.x^{*}.

Next, we define sample complexity which quantifies the effort required by LSA-M to obtain an ϵ\epsilon-close solution to x.x^{*}.

Definition 1.

(Sample Complexity). The sample complexity of (2) is the minimum number of iterations n0n_{0} such that the expected error 𝔼[xnx2]ϵ\mathbb{E}[\|x_{n}-x^{*}\|^{2}]\leq\epsilon nn0\forall n\geq n_{0}.

Method β\beta η\eta α\alpha
SGD
- 0 min(λmin(A)34λmin(A)2+C2K,ϵλmin(A)4C2K,2λmax(A)+λmin(A))\min\Big{(}\frac{\lambda_{min}(A)}{\frac{3}{4}\lambda_{min}(A)^{2}+C^{2}K},\frac{\epsilon\lambda_{min}(A)}{4C^{2}K},\frac{2}{\lambda_{max}(A)+\lambda_{min}(A)}\Big{)}
SHB 0 (1αλmin(A)2)2\left(1-\frac{\sqrt{\alpha\lambda_{min}(A)}}{2}\right)^{2} min((λmin(A)3238λmin(A)2+25C2K)2,(ϵ(λmin(A))3/2200C2K)2,\min\Big{(}(\frac{\lambda_{\min}(A)^{\frac{3}{2}}}{\frac{3}{8}\lambda_{\min}(A)^{2}+25C^{2}K})^{2},(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K})^{2},
(2λmin(A)+λmax(A))2)(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}})^{2}\Big{)}
ASG 1 (1αλmin(A)2)2(1αλmin(A))\frac{\left(1-\frac{\sqrt{\alpha\lambda_{min}(A)}}{2}\right)^{2}}{(1-\alpha\lambda_{min}(A))} min((λmin(A)3238λmin(A)2+25C2K)2,(ϵ(λmin(A))3/2200C2K)2,1λmax(A))\min\Big{(}(\frac{\lambda_{\min}(A)^{\frac{3}{2}}}{\frac{3}{8}\lambda_{\min}(A)^{2}+25C^{2}K})^{2},(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K})^{2},\frac{1}{\lambda_{max}(A)}\Big{)}
Table 1: Parameter choices for Theorem 2.2. Here C=1C=1 when the matrix AA is symmetric and C=dσmin(S)σmin(S1)C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})} when AA is not symmetric, where σmin()\sigma_{min}(\cdot) denotes the smallest singular value and SS is the matrix that diagonalizes AA, i.e., S1AS=DS^{-1}AS=D, a diagonal matrix. When AA is symmetric, indeed the three parameter choices correspond to SGD, SHB and ASG. However, with a slight abuse of notation, we stick to the same naming convention even when the driving matrix AA is not symmetric.

To enable easy comparison between different algorithms, we shall look at the order of their sample complexities. Towards that, we shall use the notation n0Θ(t)n_{0}\in\Theta(t) to imply that there exist constants c1c_{1} and c2c_{2} (independent of tt) such that c1tn0c2tc_{1}t\leq n_{0}\leq c_{2}t. The notation Θ~(t)\tilde{\Theta}(t) has a similar meaning but hides the dependence on the logarithmic terms as well.

Theorem 2.1 (Lower bound on sample complexity).

Consider the LSA-M iterate in (2) and suppose Assumptions 1 and 2.a hold. Let ϵ>0\epsilon>0 be small enough and λmin(A)=miniλi(A)\lambda_{min}(A)=\min_{i}\lambda_{i}(A). Then, for any choice of α>0,\alpha>0, β[0,1],\beta\in[0,1], and η[0,1]\eta\in[0,1], the expected error 𝔼[xn0x2]ϵ\mathbb{E}[\|x_{n_{0}}-x^{*}\|^{2}]\geq\epsilon for n0=K64ϵλmin(A)2log(x0x2ϵ)Θ~(Kϵλmin(A)2).n_{0}=\frac{K}{64\epsilon\lambda_{min}(A)^{2}}\log(\frac{\|x_{0}-x^{*}\|^{2}}{\epsilon})\in\tilde{\Theta}\left(\frac{K}{\epsilon\lambda_{min}(A)^{2}}\right).

Remark 1.

As stated below (2), LSA-M includes SHB and Nesterov’s ASG method as special cases and, hence, the above result directly applies to them. In fact, this is the first lower bound on SHB and ASG’s sample complexities in quadratic optimization with persistent noise.

Remark 2.

The value of n0n_{0} above is obtained by optimizing over all possible values of α,β,η\alpha,\beta,\eta and hence does not depend on their specific choices. This implies that there is no “smart” combination of α,β\alpha,\beta and η\eta that will give a better sample complexity (for ϵ\epsilon small enough).

Remark 3.

The lower bounds in Kidambi et al., (2018) and Liu and Belkin, (2020) are obtained by viewing the expected error in SHB and ASG iterates for least squares as update rules of the form zn+1=Pznz_{n+1}=Pz_{n} for some matrix PP (Kidambi et al.,, 2018, Appendix A, p 16) and (Liu and Belkin,, 2020, Appendix C, p 12)). In particular, they obtain bounds on the eigenvalues of PP to get the desired claim. In contrast, the error relations for SHB and ASG methods in our setup (quadratic optimization with persistent noise) have the form zn+1=Pzn+αWnz_{n+1}=Pz_{n}+\alpha W_{n} for some matrix PP and vector WnW_{n} (cf. 3). This forces us to develop a new proof technique that jointly looks at both these terms and show that at least one of them remains larger than ϵ\epsilon for the choice of n0n_{0} given in Theorem 2.1.

We next state our upper bound on the sample complexity in Theorem 2.2 and Corollary 2.3. Similar bounds already exist in literature when AA is assumed to be symmetric and the noise is assumed to be iid with bounded variance (Can et al., (2019); Zhang et al., (2019)). Here, we show that a similar upper bound holds under more general settings: i.) AA is not symmetric but satisfies Assumption 1, and ii.) the noise is a martingale difference sequence satisfying Assumption 2.b.

Theorem 2.2.

Consider the LSA-M iterate and let Assumptions 1 and 2.b hold. Then, ϵ>0\forall\epsilon>0, there exists a choice of α\alpha, β\beta and η\eta in LSA-M (see Table 1 for exact values) such that the expected error 𝔼[xnx2]ϵ\mathbb{E}[\|x_{n}-x^{*}\|^{2}]\leq\epsilon, n>n0\forall n>n_{0}, where,

  1. (i)

    n0Θ~(1αλmin(A)),n_{0}\in\tilde{\Theta}(\frac{1}{{\alpha\lambda_{min}(A)}}), when η=0\eta=0

  2. (ii)

    n0Θ~(1αλmin(A)),n_{0}\in\tilde{\Theta}(\frac{1}{\sqrt{\alpha\lambda_{min}(A)}}), when η>0\eta>0.

From Table 1, we see that for each case α\alpha is a minimum of three terms. The first term arises due to the unbounded noise (Assumption 2.b), the second due to the target neighborhood ϵ\epsilon and the third from the optimal choice of step-size in the deterministic case. Since the bound provided in Theorem 2.2 is in terms of α\alpha, the minimum of the three terms dictate the sample complexity. If ϵ\epsilon is small enough such that the second term is the minimum, we obtain the following bound:

Corollary 2.3 (Upper bound on sample complexity).

Consider the LSA-M iterate and suppose Assumptions 1 and 2.b hold. Then for choice of parameters in Table 1,ϵ>0,\forall\epsilon>0 small enough, n0Θ~(Kϵλmin(A)2)\exists n_{0}\in\tilde{\Theta}\left(\frac{K}{\epsilon\lambda_{min}(A)^{2}}\right) such that 𝔼[xnx2]ϵ\mathbb{E}[\|x_{n}-x^{*}\|^{2}]\leq\epsilon, nn0\forall n\geq n_{0}.

Remark 4.

From Corollary 2.3, we see that for small enough ϵ>0\epsilon>0 the upper bound on the sample complexity of SGD, SHB and ASG match their lower bound in Theorem 2.1. In particular, since an upper bound on the sample complexity of SGD matches a lower bound on SHB and ASG, these methods cannot outperform SGD from a sample complexity perspective.

Remark 5.

When the noise is assumed to be bounded, the first term in the choice of α\alpha in Table 1 vanishes for all three parameter choices. Under such an assumption, if ϵ\epsilon is large enough or KK is small enough such that the third term in the choice of α\alpha is the minimum, then the sample complexity of both SHB and ASG is better than SGD. We emphasize that such improvements are lost when the noise variance is large or the neighbourhood under consideration is small.

3 Proof Outline

In this section we provide the proof outline for Theorem 2.1 and Theorem 2.2. Section 3.1 provides an outline of all the key steps to prove the lower bound on the sample complexity. The proof of all the intermediate lemmas have been pushed to Appendix B. In section 3.2 we provide a sketch of the proof of Theorem 2.2. The detailed proof can be found in Appendix C.

3.1 Proof Outline for Theorem 2.1

For simplicity, let d=1d=1 and AλA\equiv\lambda\in\mathbb{R}. The general multivariate case is handled subsequently. We begin by defining the transformed iterates x~n=xnx\tilde{x}_{n}=x_{n}-x^{*} and rewrite (2) as:

X~n=PX~n1+αWn,\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n}, (3)

where, X~n[x~nx~n1],Wn[Mn0]\tilde{X}_{n}\triangleq\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix},W_{n}\triangleq\begin{bmatrix}M_{n}\\ 0\end{bmatrix} and P[1αλ+η(1αβλ)η(1αβλ)10].P\triangleq\begin{bmatrix}1-\alpha\lambda+\eta(1-\alpha\beta\lambda)&-\eta(1-\alpha\beta\lambda)\\ 1&0\end{bmatrix}. For ease of exposition, we first consider the case when β=0\beta=0, corresponding to SHB and consider the general case for β[0,1]\beta\in[0,1] later. The objective here is to find an nn such that the error 𝔼[X~n2]\mathbb{E}[\|\tilde{X}_{n}\|^{2}] is lower bounded by ϵ\epsilon. Towards this, we decompose the error expression as follows:

𝔼[X~n2]\displaystyle\mathbb{E}[\|\tilde{X}_{n}\|^{2}] =𝔼[PnX~0+αj=0n1Pn1jWj+12]\displaystyle=\mathbb{E}[\|P^{n}\tilde{X}_{0}+\alpha\sum_{j=0}^{n-1}P^{n-1-j}W_{j+1}\|^{2}]
=PnX~02+𝔼[α2j=0n1Pn1jWj+12]\displaystyle=\|P^{n}\tilde{X}_{0}\|^{2}+\mathbb{E}[\alpha^{2}\sum_{j=0}^{n-1}\|P^{n-1-j}W_{j+1}\|^{2}]
PnX~02+α2Kj=0n1Pje12,\displaystyle\geq\|P^{n}\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\|P^{j}e_{1}\|^{2}, (4)

where, e1=(10)e_{1}=\begin{pmatrix}1\\ 0\end{pmatrix}. Here, the second equality follows since each Wj+1W_{j+1} is a martingale difference sequence and the inequality follows from Assumption 2.b. The first term here corresponds to the bias and the second term corresponds to the variance.

Let n=K64ϵλ2log(X0~2ϵ)n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon}). We will show that for this choice of nn, the expected error will always be greater than or equal to ϵ\epsilon for all α>0\alpha>0 and η[0,1]\eta\in[0,1]. From (3.1):

𝔼X~n2PnX0~2+α2Kj=0n1Pje12.\displaystyle\mathbb{E}\|\tilde{X}_{n}\|^{2}\geq\|P^{n}\tilde{X_{0}}\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}.

If α\alpha is such that PnX0~2ϵ\|P^{n}\tilde{X_{0}}\|^{2}\geq\epsilon, then the claim immediately follows for this choice of α\alpha and η\eta. We now consider the case where α\alpha and η\eta are such that PnX0~2<ϵ\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon. Here we will show that for this choice of α\alpha and η\eta, the variance will necessarily be greater than ϵ\epsilon. Let μ+\mu_{+} and μ\mu_{-} be the eigen-values of PP.

Lemma 1.

For all α>0\alpha>0, η[0,1]\eta\in[0,1], if PnX0~2<ϵ\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon for small enough ϵ>0\epsilon>0 then the variance term α2Kj=0n1Pje12\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2} is bounded by

α2Kj=0n1Pje12α2K2(1μ+2)(1μ2)(1η).\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{\alpha^{2}K}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}.

Note that, from the above lemma, the variance bound improves (i.e., decreases) when η\eta decreases. Using this, we show that the variance is ultimately bounded below by a linearly decreasing function of ρ(P)\rho(P) in the following lemma.

Lemma 2.

For all α>0\alpha>0, η[0,1]\eta\in[0,1], if PnX0~2<ϵ\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon for small enough ϵ>0\epsilon>0, then α2Kj=0n1Pje12K16λ2(1ρ(P)).\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{K}{16\lambda^{2}}(1-\rho(P)).

Recall that n=K64ϵλ2log(X0~2ϵ)n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon}) and PnX~02<ϵ\|P^{n}\tilde{X}_{0}\|^{2}<\epsilon. Using this, we get the following lower bound on 1ρ(P)1-\rho(P):

Lemma 3.

If n=K64ϵλ2log(X0~2ϵ)n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon}) and PnX~02<ϵ\|P^{n}\tilde{X}_{0}\|^{2}<\epsilon, then 1ρ(P)16ϵλ2K1-\rho(P)\geq\frac{16\epsilon\lambda^{2}}{K}.

This completes the proof for the univariate case when β=0\beta=0.

We now consider the case where β[0,1]\beta\in[0,1]. We notice that in the univariate case, LSA-M with parameters β\beta and η\eta is equivalent to LSA-M with parameters β=0\beta=0 and η=η(1αλβ)\eta^{\prime}=\eta(1-\alpha\lambda\beta). If αλ>1\alpha\lambda>1, regardless of choice of η\eta, we have from (3.1):

𝔼X~n2Pne12+α2Kj=0n1Pje12α2K>Kλ2\displaystyle\mathbb{E}\|\tilde{X}_{n}\|^{2}\geq\|P^{n}e_{1}\|^{2}+\alpha^{2}K\sum_{j=0}^{n-1}\|P^{j}e_{1}\|^{2}\geq\alpha^{2}K>\frac{K}{\lambda^{2}}

Thus, when ϵ<Kλ2\epsilon<\frac{K}{\lambda^{2}}, the expected error will always be larger than ϵ\epsilon. When αλ<1\alpha\lambda<1, then η=η(1αλβ)[0,1]\eta^{\prime}=\eta(1-\alpha\lambda\beta)\in[0,1] for all η,β[0,1]\eta,\beta\in[0,1]. Since this is equivalent to LSA-M with β=0\beta=0 and η[0,1]\eta^{\prime}\in[0,1], the result follows from the lower bound on univariate LSA-M with β=0\beta=0. This completes the proof for the univariate case for all α>0\alpha>0, η,β[0,1]\eta,\beta\in[0,1].

Finally, we consider the multivariate case, i.e., when d>1d>1. For the multivariate case, we study the transformed iterates Yn~=ZXn~\tilde{Y_{n}}=Z\tilde{X_{n}}, where Z:=Ed×d(S00S)Z:=E_{d\times d}\begin{pmatrix}S&0\\ 0&S\end{pmatrix}. Since X~n=PX~n1+αWn,\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n}, we get

Y~n\displaystyle\tilde{Y}_{n} =ZX~n=ZPX~n1+αZWn=ZPX~n1+αZWn\displaystyle=Z\tilde{X}_{n}=ZP\tilde{X}_{n-1}+\alpha ZW_{n}=ZP\tilde{X}_{n-1}+\alpha ZW_{n}
=ZPZ1Y~n1+αZWn=BY~n1+αZWn,\displaystyle=ZPZ^{-1}\tilde{Y}_{n-1}+\alpha ZW_{n}=B\tilde{Y}_{n-1}+\alpha ZW_{n},

where

B=(B1000B2000Bd), and Bi=(1+ηαλiη10).B=\begin{pmatrix}B_{1}&0&\ldots&0\\ 0&B_{2}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&B_{d}\end{pmatrix},\mbox{ and }B_{i}=\begin{pmatrix}1+\eta-\alpha\lambda_{i}&-\eta\\ 1&0\end{pmatrix}.

Let Y~n=(Y~n(1)Y~n(2)Y~n(d))\tilde{Y}_{n}=\begin{pmatrix}\tilde{Y}_{n}^{(1)}\\ \tilde{Y}_{n}^{(2)}\\ \vdots\\ \tilde{Y}_{n}^{(d)}\\ \end{pmatrix} and M^n=SMn=(M^n(1)M^n(2)M^n(d))\hat{M}_{n}=SM_{n}=\begin{pmatrix}\hat{M}_{n}^{(1)}\\ \hat{M}_{n}^{(2)}\\ \vdots\\ \hat{M}_{n}^{(d)}\\ \end{pmatrix}, where each Y~n(i)2\tilde{Y}_{n}^{(i)}\in\mathbb{R}^{2} and M^n(i)\hat{M}_{n}^{(i)}\in\mathbb{R}. We will show that the update rule for each Y~n(i)\tilde{Y}_{n}^{(i)} is the same as that of the univariate case. Notice that

Y~n(i)=BiY~n1(i)+α(M^n(i)0),\tilde{Y}_{n}^{(i)}=B_{i}\tilde{Y}_{n-1}^{(i)}+\alpha\begin{pmatrix}\hat{M}^{(i)}_{n}\\ 0\end{pmatrix},

where 𝔼[M^n(i)|n1]=0\mathbb{E}[\hat{M}^{(i)}_{n}|\mathcal{F}_{n-1}]=0 and 𝔼[(M^n(i))2|n1]σmin2(S)K\mathbb{E}[(\hat{M}^{(i)}_{n})^{2}|\mathcal{F}_{n-1}]\geq\sigma_{\min}^{2}(S)K. This is because 𝔼[M^n|n1]=𝔼[SMn|n1]=0\mathbb{E}[\hat{M}_{n}|\mathcal{F}_{n-1}]=\mathbb{E}[SM_{n}|\mathcal{F}_{n-1}]=0 and

𝔼[(M^n(i))2|n1]\displaystyle\mathbb{E}[(\hat{M}^{(i)}_{n})^{2}|\mathcal{F}_{n-1}] minv=1𝔼[vTM^nM^nTv|n1]\displaystyle\geq\min_{\|v\|=1}\mathbb{E}[v^{T}\hat{M}_{n}\hat{M}^{T}_{n}v|\mathcal{F}_{n-1}]
=minv=1𝔼[vTSMnMnTSTv|n1]\displaystyle=\min_{\|v\|=1}\mathbb{E}[v^{T}SM_{n}M^{T}_{n}S^{T}v|\mathcal{F}_{n-1}]
minv=1σmin2(S)𝔼[vTMnMnTv|n1]\displaystyle\geq\min_{\|v\|=1}\sigma_{\min}^{2}(S)\mathbb{E}[v^{T}M_{n}M^{T}_{n}v|\mathcal{F}_{n-1}]
σmin2(S)K.\displaystyle\geq\sigma_{\min}^{2}(S)K.

Let λ1=miniλi\lambda_{1}=\min_{i}\lambda_{i}. Then, from results in the univariate case, ϵ>0\exists\epsilon^{\prime}>0 such that for all α>0\alpha>0 and β,η[0,1]\beta,\eta\in[0,1], 𝔼[Y~n(1)2]ϵ\mathbb{E}[\|\tilde{Y}_{n}^{(1)}\|^{2}]\geq\epsilon^{\prime} for some nΘ(σmin2(S)Kϵλmin2)n\in\Theta(\frac{\sigma_{\min}^{2}(S)K}{\epsilon^{\prime}\lambda_{\min}^{2}}). Letting ϵ=ϵσmax2(S)\epsilon=\frac{\epsilon^{\prime}}{\sigma_{\max}^{2}(S)}, it follows that 𝔼[Y~n(1)2]σmax2(S)ϵ\mathbb{E}[\|\tilde{Y}_{n}^{(1)}\|^{2}]\geq\sigma_{\max}^{2}(S)\epsilon for some nΘ(σmin2(S)Kσmax2(S)ϵλmin2)=Θ(Kϵλmin2)n\in\Theta(\frac{\sigma_{\min}^{2}(S)K}{\sigma_{\max}^{2}(S)\epsilon\lambda_{\min}^{2}})=\Theta(\frac{K}{\epsilon\lambda_{\min}^{2}}) We ignore the term σmin2(S)σmax2(S)\frac{\sigma_{\min}^{2}(S)}{\sigma_{\max}^{2}(S)} as it is a constant, and it is in fact equal to 11 when AA is symmetric (diagonalising matrices of a symmetric matrix are orthogonal). Finally,

𝔼[X~n2]\displaystyle\mathbb{E}[\|\tilde{X}_{n}\|^{2}] σmin2(Z)𝔼[Y~n2]σmin2(S1)𝔼[Y~n(1)2]\displaystyle\geq\sigma^{2}_{\min}(Z)\mathbb{E}[\|\tilde{Y}_{n}\|^{2}]\geq\sigma^{2}_{\min}(S^{-1})\mathbb{E}[\|\tilde{Y}_{n}^{(1)}\|^{2}]
=1σmax2(S)𝔼[Y~n(1)2]ϵ,\displaystyle=\frac{1}{\sigma^{2}_{\max}(S)}\mathbb{E}[\|\tilde{Y}_{n}^{(1)}\|^{2}]\geq\epsilon,

for some nΘ(Kϵλmin2)n\in\Theta(\frac{K}{\epsilon\lambda_{\min}^{2}}).

3.2 Proof Outline of Theorem 2.2

Here we provide a quick sketch of the proof when η=0\eta=0 (corresponding to SGD) and β=0\beta=0 (corresponding to SHB). A similar analysis follows for β=1\beta=1 and the detailed proof of all the three cases can be found in Appendix C.

The LSA-M iterate in (2) with η=0\eta=0 corresponds to:

x~n=(IαA)x~n1+αMn=(IαA)nx~0+αi=0n1[(IαA)n1iMi+1]\begin{split}\tilde{x}_{n}&=(I-\alpha A)\tilde{x}_{n-1}+\alpha M_{n}\\ &=(I-\alpha A)^{n}\tilde{x}_{0}+\alpha\sum_{i=0}^{n-1}[(I-\alpha A)^{n-1-i}M_{i+1}]\end{split}

Using Assumption-2.b the expected norm of the iterates can be bounded as follows:

𝔼[xn~2](IαA)n2Λ+α2Ki=0n1(IαA)(n1i)2(1+𝔼[x~i2])\begin{split}\mathbb{E}[\|\tilde{x_{n}}\|^{2}]&\leq\|(I-\alpha A)^{n}\|^{2}\Lambda\\ &+\alpha^{2}K\sum_{i=0}^{n-1}\|(I-\alpha A)^{(n-1-i)}\|^{2}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}

We next use the following lemma to bound (IαA)i\|(I-\alpha A)^{i}\|.

Lemma 4.

Let, Md×dM\in\mathbb{R}^{d\times d} be a matrix and λi(M)\lambda_{i}(M) denote the ithi^{th} eigen-value of MM. Then, δ>0\forall\delta>0

MnCδ(ρ(M)+δ)n\|M^{n}\|\leq C_{\delta}(\rho(M)+\delta)^{n}

where ρ(M)=maxi|λi(M)|\rho(M)=\max_{i}|\lambda_{i}(M)| is the spectral radius of MM and CδC_{\delta} is a constant that depends on δ\delta. Furthermore, if the eigen-values of MM are distinct, then

MnC(ρ(M))n\|M^{n}\|\leq C(\rho(M))^{n}
Proof.

See Appendix-D. ∎

Using Assumption 1 and Lemma 4, we have

𝔼[x~n2]C2(1αλmin(A))2nΛ+C2α2Ki=0n1(1αλmin(A))2(n1i)(1+𝔼[x~i2])\begin{split}\mathbb{E}[\|\tilde{x}_{n}\|^{2}]&\leq C^{2}(1-\alpha\lambda_{\min}(A))^{2n}\Lambda\\ &+C^{2}\alpha^{2}K\sum_{i=0}^{n-1}(1-\alpha\lambda_{\min}(A))^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}
where, C=dσmin(S)σmin(S1),\mbox{where, \quad}C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})}, (5)

SS is the matrix in Jordan decomposition of AA and σmin(S)\sigma_{\min}(S) is the smallest singular value of SS. We define the sequence {Uk}\{U_{k}\} as below:

Uk\displaystyle U_{k} =C2(1αλmin(A))2kΛ\displaystyle=C^{2}(1-\alpha\lambda_{\min}(A))^{2k}\Lambda
+C2α2Ki=0k1(1αλmin(A))2(k1i)(1+Ui)\displaystyle+C^{2}\alpha^{2}K\sum_{i=0}^{k-1}(1-\alpha\lambda_{\min}(A))^{2(k-1-i)}(1+U_{i})

Observe that 𝔼[x~n2]Un\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n} and that the sequence {Uk}\{U_{k}\} satisfies

Uk+1\displaystyle U_{k+1} =(1αλmin(A))2Uk\displaystyle=(1-\alpha\lambda_{\min}(A))^{2}U_{k}
+C2Kα2(1+Uk);U0=C2Λ\displaystyle+C^{2}K\alpha^{2}(1+U_{k});\quad U_{0}=C^{2}\Lambda

Therefore, we have

Uk+1=((1αλmin(A))2+C2Kα2)Uk+α2C2KU_{k+1}=\left((1-\alpha\lambda_{\min}(A))^{2}+C^{2}K\alpha^{2}\right)U_{k}+\alpha^{2}C^{2}K

Since αλmin(A)34λmin(A)2+C2K\alpha\leq\frac{\lambda_{\min}(A)}{\frac{3}{4}\lambda_{\min}(A)^{2}+C^{2}K}, one can show that

Un(1αλmin(A)2)2nU0+α2C2K2αλmin(A)\begin{split}U_{n}&\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\frac{2}{\alpha\lambda_{min}(A)}\\ \end{split}

We assume that α1λmin(A)\alpha\leq\frac{1}{\lambda_{\min}(A)} and therefore (1αλmin(A)2)2eαλmin(A)(1-\frac{\alpha\lambda_{\min}(A)}{2})^{2}\leq e^{-\alpha\lambda_{\min}(A)}.

Unenαλmin(A)C2Λ+2αC2Kλmin(A)\begin{split}U_{n}\leq e^{-n\alpha\lambda_{\min}(A)}C^{2}\Lambda+\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\end{split}

Choose αϵλmin(A)4C2K.\alpha\leq\frac{\epsilon\lambda_{min}(A)}{4C^{2}K}. Then,

2αC2Kλmin(A)ϵ2𝔼[x~n2]Unϵ2+ϵ2=ϵ,\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\leq\frac{\epsilon}{2}\Rightarrow\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon,

when the sample complexity is:

n=1αλmin(A)log(2C2Λϵ)n=\frac{1}{\alpha\lambda_{\min}(A)}\log\left(\frac{2C^{2}\Lambda}{\epsilon}\right)

As in proof of Theorem 2.1, we begin by transforming the iterate given by (2) into the following iterate for β=0\beta=0:

X~n=PX~n1+αWn\displaystyle\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n}

where, X~n[x~nx~n1],\tilde{X}_{n}\triangleq\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix}, Wn[Mn0],W_{n}\triangleq\begin{bmatrix}M_{n}\\ 0\end{bmatrix}, and P[IαA+ηIηII0].P\triangleq\begin{bmatrix}I-\alpha A+\eta I&-\eta I\\ I&0\end{bmatrix}.

Using Assumption 2.a, we can obtain the following bound on the mean squared error:

𝔼[X~n2]Pn2X~02+α2Ki=0n1Pn1i2(1+𝔼[X~i2])\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{i=0}^{n-1}\|P^{n-1-i}\|^{2}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}])

Here, the first term corresponds to the bias and the second term correspond to the variance. A spectral analysis of PP tells us that the best choice of η\eta is given by η=(1λmin(A)α)2\eta=(1-\sqrt{\lambda_{\min}(A)\alpha})^{2} with α(2λmin(A)+λmax(A))2\alpha\leq\big{(}\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}}\big{)}^{2}. However, to simplify the analysis we choose the parameter η=(1λmin(A)α2)2\eta=\big{(}1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\big{)}^{2}, which maintains the same rate of decay of the bias term. Using Assumption 1 and Lemma 9, we show that

𝔼[X~n2]C^2(1λmin(A)α2)2nΛ+α2C^2Ki=0n1(1λmin(A)α2)2(n1i)(1+𝔼[X~i2]),\displaystyle\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}\!\!\!\!\!\!\!\!\!\!\!\!\!\!(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}]),

where Λ=x~02\Lambda=\|\tilde{x}_{0}\|^{2} and C^\hat{C} depends on α\alpha and λ\lambda. The dependence of the second term on X~i\tilde{X}_{i} appears because we do not assume the noise variance to be uniformly bounded by a constant. This makes analyzing the above recursion challenging. Towards this, we define a sequence {Vn}\{V_{n}\},

Vn=C^2(1λmin(A)α2)2nΛ+α2C^2Ki=0n1(1λmin(A)α2)2(n1i)(1+Vi).\displaystyle V_{n}=\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+V_{i}).

Choosing

αmin((λmin(A)3238λmin(A)2+25C2K)2,(ϵ(λmin(A))3/2200C2K)2,(2λmin(A)+λmax(A))2)\displaystyle\alpha\leq\min\Big{(}(\frac{\lambda_{\min}(A)^{\frac{3}{2}}}{\frac{3}{8}\lambda_{\min}(A)^{2}+25C^{2}K})^{2},(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K})^{2},(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}})^{2}\Big{)}

ensures that 𝔼[X~n2]Vnenλmin(A)α2C^2Λ+4α2C^2Kαλmin(A).\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\hat{C}^{2}\Lambda+\frac{4\alpha^{2}\hat{C}^{2}K}{\sqrt{\alpha\lambda_{min}(A)}}. Using Lemma 10, we bound C^\hat{C} in terms of CC. Here, C=1C=1 when the matrix AA is symmetric and C=dσmin(S)σmin(S1)C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})} when AA is not symmetric, where σmin()\sigma_{min}(\cdot) denotes the smallest singular value and SS is the matrix that diagonalizes AA, i.e., S1AS=DS^{-1}AS=D, a diagonal matrix. With this we obtain

Vnenλmin(A)α225C2λmin(A)αΛ+α2100C2K(λmin(A)α)3/2.V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\frac{25C^{2}}{\lambda_{\min}(A)\alpha}\Lambda+\alpha^{2}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\alpha\right)^{3/2}}.

Now, for n4αλmin(A)log(1λmin(A)α),n\geq\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right), it can show that Vn25C2Λen4λmin(A)α+100αC2K(λmin(A))3/2V_{n}\leq 25C^{2}\Lambda e^{-\frac{n}{4}\sqrt{\lambda_{\min}(A)}\alpha}+\frac{100\sqrt{\alpha}C^{2}K}{\left(\lambda_{\min}(A)\right)^{3/2}}. Next, we choose α\alpha small enough: α(ϵ(λmin(A))3/2200C2K)2\alpha\leq\left(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K}\right)^{2} to ensure that the second term in the above inequality is within an ϵ\epsilon boundary and nn large enough: n=4αλmin(A)log(50C2Λϵ),n=\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right), to ensure that the first term is within an ϵ\epsilon boundary. Finally,

n=max(4αλmin(A)log(50C2Λϵ),4αλmin(A)log(1λmin(A)α))\displaystyle n=\max\Bigg{(}\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right),\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right)\Bigg{)}

gives the desired bound on the sample complexity for β=0\beta=0.

4 Concluding Remarks

In this work, we analyze the sample complexity of SHB and ASG and provide matching lower and upper bounds up to constants and logarithmic terms. More importantly, we show that the same sample complexity bound can be obtained by standard SGD. Our work also calls into question some of the recent positive results in favour of momentum methods in the stochastic regime. We show that such improvements do not take into account all the terms involved in the error decomposition, or have major flaws. Although some other results do question the superiority of momentum methods in the stochastic regime, the assumptions and the setting that these works look at do not correspond to those in the positive results. We also emphasize that the negative results either only consider small step-sizes and momentum parameters not close to 1 or provide such results for specific instances in linear regression. In contrast, our work shows that SHB and ASG cannot obtain an improvement in terms of sample complexity (for small neighbourhoods) over SGD for the entire family of quadratic optimization and holds for all step-sizes and all momentum parameters in [0,1][0,1].

References

  • Assran and Rabbat, (2020) Assran, M. and Rabbat, M. (2020). On the convergence of nesterov’s accelerated gradient method in stochastic settings. Proceedings of the 37th International Conference on Machine Learning, PMLR, 119:410–420.
  • Borkar, (2008) Borkar, V. S. (2008). Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
  • Can et al., (2019) Can, B., Gurbuzbalaban, M., and Zhu, L. (2019). Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 891–901. PMLR.
  • Cauchy, (1847) Cauchy, L. A. (1847). Methode generale pour la resolution des systemes d’equations simultanees. C.R. Acad. Sci. Paris, 25:536–538.
  • Cheng et al., (2020) Cheng, X., Yin, D., Bartlett, P., and Jordan, M. (2020). Stochastic gradient and Langevin processes. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1810–1819. PMLR.
  • Foucart, (2012) Foucart, S. (2012). University lecture, m504, lecture 6: Matrix norms and spectral radii.
  • Horn and Johnson, (1990) Horn, R. and Johnson, C. (1990). Matrix analysis (Corrected reprint of the 1985 original). Cambridge University Press.
  • Hu et al., (2009) Hu, C., Pan, W., and Kwok, J. (2009). Accelerated gradient methods for stochastic optimization and online learning. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A., editors, Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc.
  • Jastrzębski et al., (2018) Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Storkey, A., and Bengio, Y. (2018). Three factors influencing minima in SGD.
  • Kidambi et al., (2018) Kidambi, R., Netrapalli, P., Jain, P., and Kakade, S. M. (2018). On the insufficiency of existing momentum schemes for stochastic optimization. CoRR, abs/1803.05591.
  • Liu and Belkin, (2020) Liu, C. and Belkin, M. (2020). Accelerating sgd with momentum for over-parameterized learning. In International Conference on Learning Representations.
  • Loizou and Richtárik, (2020) Loizou, N. and Richtárik, P. (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77:653–710.
  • Mandt et al., (2017) Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35.
  • Mou et al., (2020) Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2947–2997. PMLR.
  • Nesterov, (1983) Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate o(1k2)o\bigl{(}\frac{1}{k^{2}}\bigr{)}. Soviet Mathematics Doklady, 269:543–547.
  • Nesterov, (2005) Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152.
  • Nesterov, (2014) Nesterov, Y. (2014). Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition.
  • Nitanda, (2014) Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. In NIPS.
  • Polyak, (1964) Polyak, B. (1964). Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17.
  • Polyak, (1987) Polyak, B. (1987). Introduction to Optimization. Optimization Software, NY.
  • Qian, (1999) Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Netw., 12(1):145–151.
  • Sutskever et al., (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA. PMLR.
  • Yuan et al., (2016) Yuan, K., Ying, B., and Sayed, A. H. (2016). On the influence of momentum acceleration on online learning. Journal of Machine Learning Research, 17(192):1–66.
  • Zhang et al., (2019) Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

Appendix

 

Appendix A Comparison with recent works

A.1 Comparison with Mou et al.(2020)

Claim 1 in (p. 20, Mou et al., (2020)) analyzes the asymptotic covariance of the heavy ball momentum algorithm (with Polyak averaging) and claims a correction term that satisfies:

Tr(Lη)𝒪(ηκ2(U)λmin(A)3/2)\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\frac{\kappa^{2}(U)}{\lambda_{min}(A)^{3/2}})

where A~=(0IdA¯αId+ηA¯)=UDU1\tilde{A}=\begin{pmatrix}0&I_{d}\\ -\bar{A}&\alpha I_{d}+\eta\bar{A}\end{pmatrix}=UDU^{-1} as in the decomposition of A~\tilde{A} in Lemma 1 of Mou et al., (2020) and κ(U)=UopU1op\kappa(U)=\|U\|_{op}\|U^{-1}\|_{op}.

However in the proof of claim 1, we are not sure if the following bound holds, since the matrix A~\tilde{A} is not symmetric:

Tr(A~1𝔼(Ξ~AΛη(Ξ~A)T)(A~1)T)(mini|λi(A~)|)2(1+η2)vA2𝔼πηxtx22Tr(\tilde{A}^{-1}\mathbb{E}(\tilde{\Xi}_{A}\Lambda_{\eta}^{*}(\tilde{\Xi}_{A})^{T})(\tilde{A}^{-1})^{T})\leq(\min_{i}|\lambda_{i}(\tilde{A})|)^{-2}(1+\eta^{2})v_{A}^{2}\mathbb{E}_{\pi_{\eta}}\|x_{t}-x^{*}\|_{2}^{2}

Our calculation points towards the following bound:

Tr(Lη)𝒪(ηκ2(U)λmin(A)5/2).\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\frac{\kappa^{2}(U)}{\lambda_{min}(A)^{5/2}}).

We outline the proof for the uni-variate case, when A¯=λ\bar{A}=\lambda for some λ>0\lambda>0. Then,

A~=(01λα+ηλ), and A~1=1λ(α+ηλ1λ0).\tilde{A}=\begin{pmatrix}0&1\\ -\lambda&\alpha+\eta\lambda\end{pmatrix},\mbox{ and }\tilde{A}^{-1}=\frac{1}{\lambda}\begin{pmatrix}\alpha+\eta\lambda&-1\\ \lambda&0\end{pmatrix}.

Observe that A~1(A~1)T=1λ2[1+(α+ηλ)2λ(α+ηλ)λ(α+ηλ)λ2]\displaystyle\tilde{A}^{-1}(\tilde{A}^{-1})^{T}=\frac{1}{\lambda^{2}}\begin{bmatrix}1+(\alpha+\eta\lambda)^{2}&\lambda(\alpha+\eta\lambda)\\ \lambda(\alpha+\eta\lambda)&\lambda^{2}\end{bmatrix} and therefore Tr(A~1(A~1)T)=𝒪(1λ2)\displaystyle Tr(\tilde{A}^{-1}(\tilde{A}^{-1})^{T})=\mathcal{O}\left(\frac{1}{\lambda^{2}}\right). Using this we have,

Tr(A~1𝔼(Ξ~AΛη(Ξ~A)T)(A~1)T)𝒪(1λ2)Tr(𝔼(Ξ~AΛη(Ξ~A)T)Tr(\tilde{A}^{-1}\mathbb{E}(\tilde{\Xi}_{A}\Lambda_{\eta}^{*}(\tilde{\Xi}_{A})^{T})(\tilde{A}^{-1})^{T})\leq\mathcal{O}(\frac{1}{\lambda^{2}})Tr(\mathbb{E}(\tilde{\Xi}_{A}\Lambda_{\eta}^{*}(\tilde{\Xi}_{A})^{T})
𝒪(ηκ2(U)λ5/2)\lesssim\mathcal{O}(\eta\frac{\kappa^{2}(U)}{\lambda^{5/2}}) (6)

The second inequality follows as in Mou et al., (2020). Next we analyze the dependence of κ2(U)\kappa^{2}(U) on λ\lambda. Again for simplicity we consider the uni-variate case where A¯=λ\bar{A}=\lambda. Let A~=(01λα+ηλ),\displaystyle\tilde{A}=\begin{pmatrix}0&1\\ -\lambda&\alpha+\eta\lambda\end{pmatrix}, be diagonalizable. Therefore,

A~=U[μ+00μ]U1,\tilde{A}=U\begin{bmatrix}\mu_{+}&0\\ 0&\mu_{-}\end{bmatrix}U^{-1},

where μ+\mu_{+} and μ\mu_{-} are the eigenvalues of A~\tilde{A}. Let U=[u1u2u3u4]U=\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}. We therefore have,

[01λα+ηλ][u1u2u3u4]=[u1u2u3u4][μ+00μ].\begin{bmatrix}0&1\\ -\lambda&\alpha+\eta\lambda\end{bmatrix}\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}=\begin{bmatrix}u_{1}&u_{2}\\ u_{3}&u_{4}\end{bmatrix}\begin{bmatrix}\mu_{+}&0\\ 0&\mu_{-}\end{bmatrix}.

Solving the system of equations, we get:

U=[11μ+μ] and U1=1μ+μ[11μ+μ]U=\begin{bmatrix}1&1\\ \mu_{+}&\mu_{-}\end{bmatrix}\mbox{ and }U^{-1}=\frac{1}{\mu_{+}-\mu_{-}}\begin{bmatrix}1&1\\ \mu_{+}&\mu_{-}\end{bmatrix}

Now, μ+μ=(α+ηλ)4λ\mu_{+}-\mu_{-}=\sqrt{(\alpha+\eta\lambda)-4\lambda}. Using the choice of α=λ\alpha=\sqrt{\lambda} as in Mou et al., (2020), we have:

μ+μ=λ+η2λ2+2αηλ4λ\mu_{+}-\mu_{-}=\sqrt{\lambda+\eta^{2}\lambda^{2}+2\alpha\eta\lambda-4\lambda}
=λη2λ+2ηλ3=\sqrt{\lambda}\sqrt{\eta^{2}\lambda+2\eta\sqrt{\lambda}-3}

For λ<<1\lambda<<1 (which is the case where the momentum algorithm is claimed to improve the mixing time in Mou et al., (2020)), μ+μ𝒪(λ)\mu_{+}-\mu_{-}\geq\mathcal{O}(\sqrt{\lambda}). As in proof of Lemma 5 in Appendix D.2 and using the fact that UopU1op=σmax(U)σmax(U1)\displaystyle\|U\|_{op}\|U^{-1}\|_{op}=\sigma_{max}(U)\sigma_{max}(U^{-1}), we can show that κ2(U)𝒪(1λ)\kappa^{2}(U)\leq\mathcal{O}(\frac{1}{\lambda}). Combining with (6), we have:

Tr(Lη)𝒪(ηλ7/2).\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\lambda^{-7/2}).

A similar analysis can be carried for the multivariate case to show that

Tr(Lη)𝒪(ηλmin(A¯)7/2)\mbox{Tr}(L_{\eta})\lesssim\mathcal{O}(\eta\lambda_{min}(\bar{A})^{-7/2})

The correction term in SGD is 𝒪(ηλmin(A¯)3)\mathcal{O}(\eta\lambda_{min}(\bar{A})^{-3}) (See Mou et al., (2020), Claim 1). The stationary distribution for the momentum algorithm is larger than that of SGD when λmin(A)<<1\lambda_{min}(A)<<1. Indeed if we enforce that the two asymptotic covariances are of the same size by choosing the step-size for momentum iterate ηm\eta^{m} in terms of the step-size of SGD, i.e.,

𝒪(η1λmin(A¯)3)=𝒪(ηm1λmin(A¯)7/2),\mathcal{O}(\eta\frac{1}{\lambda_{min}(\bar{A})^{3}})=\mathcal{O}(\eta^{m}\frac{1}{\lambda_{min}(\bar{A})^{7/2}}),

then we must choose ηm=𝒪(ηλmin(A¯))\eta^{m}=\mathcal{O}(\eta\sqrt{\lambda_{min}(\bar{A})}). In Appendix C.1. of Mou et al., (2020), the mixing time of momentum iterate is shown to be 𝒪(1ηmλmin(A¯))\displaystyle\mathcal{O}(\frac{1}{\eta^{m}\sqrt{\lambda_{min}(\bar{A})}}), while the mixing time of SGD is 𝒪(1ηλmin(A¯))\displaystyle\mathcal{O}(\frac{1}{\eta\lambda_{min}(\bar{A})}). When we choose ηm=𝒪(λmin(A¯)η)\eta^{m}=\mathcal{O}(\sqrt{\lambda_{min}(\bar{A})}\eta), then the mixing time of momentum algorithm turns out to be the same as SGD. This behaviour is identical to what we observe in Theorem 2.2, where if we choose the same step-size then there is improvement by a square root factor.

A.2 Comparison with SHB, Can et al. (2019)

For strongly convex quadratic functions of the form:

f(x)=12xTQx+aTx+b,f(x)=\frac{1}{2}x^{T}Qx+a^{T}x+b,

where xdx\in\mathbb{R}^{d}, QRd×dQ\in R^{d\times d} is p.s.d, ada\in\mathbb{R}^{d}, bRb\in R and μIdQLId\mu I_{d}\preceq Q\preceq LI_{d}, Can et al., (2019) shows acceleration in Wasserstein distance by a factor of κ=Lμ\displaystyle\sqrt{\kappa}=\sqrt{\frac{L}{\mu}}. The trace of the asymptotic covariance matrix XHBX_{HB} is given by (See Appendix C.2 of Can et al., (2019)):

Tr(XHB)=i=1d2α(1+β)(1β)λi(2+2βαλi),\mbox{Tr}(X_{HB})=\sum_{i=1}^{d}\frac{2\alpha(1+\beta)}{(1-\beta)\lambda_{i}(2+2\beta-\alpha\lambda_{i})},

where, α\alpha is the step-size, β\beta is the momentum parameter and λi\lambda_{i} is the ithi^{th} eigen-value of QQ. We show that the asymptotic covariance matrix is worse compared to when no momentum is used, i.e., β=0\beta=0 and the optimial step-size α=2μ+L\alpha=\frac{2}{\mu+L} is used. Substituting these values for β\beta and α\alpha we get:

Tr(Xβ=0)\displaystyle\mbox{Tr}(X_{\beta=0}) =i=1d22μ+Lλi(22μ+Lλi)\displaystyle=\sum_{i=1}^{d}\frac{2\frac{2}{\mu+L}}{\lambda_{i}(2-\frac{2}{\mu+L}\lambda_{i})}
=i=1d22μ+Lλi2μ+L(μ+Lλi)\displaystyle=\sum_{i=1}^{d}\frac{2\frac{2}{\mu+L}}{\lambda_{i}\frac{2}{\mu+L}(\mu+L-\lambda_{i})}
=i=1d2λi(μ+Lλi)\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}

To compute the size of the stationary distribution with the iterates of heavy ball we set the step-size α=4(μ+L)2\alpha=\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}} and momentum parameter β=(LμL+μ)2\beta=\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2} and get:

Tr(XHB)\displaystyle\mbox{Tr}(X_{HB}) =i=1d2(4(μ+L)2)(1+β)(1β)λi(2+2(LμL+μ)24(μ+L)2λi)\displaystyle=\sum_{i=1}^{d}\frac{2\displaystyle\left(\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(1+\beta)}{(1-\beta)\lambda_{i}\left(2+2\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}-\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\lambda_{i}\right)}
=i=1d2(4(μ+L)2)(1+β)(1β)λi(2(L+μ)2+2(Lμ)24λi(L+μ)2)\displaystyle=\sum_{i=1}^{d}\frac{2\left(\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(1+\beta)}{(1-\beta)\lambda_{i}\left(\displaystyle\frac{2(\sqrt{L}+\sqrt{\mu})^{2}+2(\sqrt{L}-\sqrt{\mu})^{2}-4\lambda_{i}}{(\sqrt{L}+\sqrt{\mu})^{2}}\right)}
=i=1d2(4(μ+L)2)(1+β)(1β)λi(4(μ+L)2)(μ+Lλi)\displaystyle=\sum_{i=1}^{d}\frac{2\left(\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(1+\beta)}{(1-\beta)\lambda_{i}\left(\displaystyle\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}\right)(\mu+L-\lambda_{i})}
=i=1d2(1+β)(1β)λi(μ+Lλi)\displaystyle=\sum_{i=1}^{d}\frac{2(1+\beta)}{(1-\beta)\lambda_{i}(\mu+L-\lambda_{i})}
=i=1d2λi(μ+Lλi)(1+(LμL+μ)21(LμL+μ)2)\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\left(\frac{1+\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}}{1-\left(\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}\right)^{2}}\right)
=i=1d2λi(μ+Lλi)L+μ4μL\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\frac{L+\mu}{4\sqrt{\mu L}}
=i=1d2λi(μ+Lλi)12(κ+1κ)\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\frac{1}{2}\left(\sqrt{\kappa}+\frac{1}{\sqrt{\kappa}}\right)
=i=1d2λi(μ+Lλi)𝒪(κ)\displaystyle=\sum_{i=1}^{d}\frac{2}{\lambda_{i}(\mu+L-\lambda_{i})}\mathcal{O}(\sqrt{\kappa})
=Tr(Xβ=0)𝒪(κ)\displaystyle=\mbox{Tr}(X_{\beta=0})\mathcal{O}(\sqrt{\kappa})

The above calculation shows that the size of the asymptotic covariance matrix of SHB is worse by a factor of 𝒪(κ)\mathcal{O}(\sqrt{\kappa}).

A.3 Comparison with Assran and Rabbat (2020)

Theorem 1 of Assran and Rabbat, (2020) is stated for a constant ϵ\epsilon (not to be confused with ϵ\epsilon in the current paper) and uses CϵC_{\epsilon} (a term that depends on ϵ\epsilon). However in Corollary 1.1, the result is stated for a decreasing sequence ϵk\epsilon_{k} for which CϵC_{\epsilon} is undefined (equation (26)).

Using the expressions provided in Appendix of Assran and Rabbat, (2020), we find that the second term in Corollary 1.1 (variance bound) is in fact of order Q3/2Q^{3/2} instead of Q1/2Q^{1/2}. Using equation (29) and the display below it, we see that for ASG, Akk(11Q)k+1\|A^{k}\|\leq k\big{(}1-\frac{1}{\sqrt{Q}}\big{)}^{k+1}. It follows that

k=0n1Ak2k=0n1k2(11Q)2k+2=O(1(1(11Q)2)3)=O(Q3/2).\sum_{k=0}^{n-1}\|A^{k}\|^{2}\leq\sum_{k=0}^{n-1}k^{2}\Big{(}1-\frac{1}{\sqrt{Q}}\Big{)}^{2k+2}=O\Bigg{(}\frac{1}{\big{(}1-(1-\frac{1}{\sqrt{Q}})^{2}\big{)}^{3}}\Bigg{)}=O(Q^{3/2}).

However, for SGD,

k=0n1Ak2k=0n1(12Q1)2k=O(11(12Q1)2)=O(Q).\sum_{k=0}^{n-1}\|A^{k}\|^{2}\leq\sum_{k=0}^{n-1}\Big{(}1-\frac{2}{\sqrt{Q-1}}\Big{)}^{2k}=O\Bigg{(}\frac{1}{1-(1-\frac{2}{Q-1})^{2}}\Bigg{)}=O(Q).

As a result, the variance bound for ASG is O(σ2Q3/2)O(\sigma^{2}Q^{3/2}), while the variance bound for SGD is O(σ2Q)O(\sigma^{2}Q). Though the bias term in their result improves by a factor of Q\sqrt{Q}, the variance term also worsens by a factor of Q\sqrt{Q}.

Appendix B Proof of Theorem 2.1

Observe that

PnX~02\displaystyle\|P^{n}\tilde{X}_{0}\|^{2} =Pn2X~02 for some X~0\displaystyle=\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}\mbox{ for some }\tilde{X}_{0}
ρ(P)2nX~02\displaystyle\geq\rho(P)^{2n}\|\tilde{X}_{0}\|^{2}

Choose such an X~0\tilde{X}_{0} with X~01\|\tilde{X}_{0}\|\geq 1. If PnX~02>ϵ\|P^{n}\tilde{X}_{0}\|^{2}>\epsilon, 𝔼[X~n2]>ϵ\mathbb{E}[\|\tilde{X}_{n}\|^{2}]>\epsilon. Therefore we assume PnX~02ϵ\|P^{n}\tilde{X}_{0}\|^{2}\leq\epsilon which implies ρ(P)2nϵ\rho(P)^{2n}\leq\epsilon and Pne12ϵ\|P^{n}e_{1}\|^{2}\leq\epsilon.

Lemma 1.

For all α>0\alpha>0, η[0,(1αλ)2)\eta\in[0,(1-\sqrt{\alpha\lambda})^{2}), if PnX0~2<ϵ\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon for small enough ϵ>0\epsilon>0 then the variance term α2Kj=0n1Pje12\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2} is bounded by

α2Kj=0n1Pje12α2K2(1μ+2)(1μ2)(1η).\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{\alpha^{2}K}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}.
Proof.

We see that

j=0n1Pje12\displaystyle\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2} j=0n1(μ+jμjμ+μ)2\displaystyle\geq\sum_{j=0}^{n-1}\left(\frac{\mu_{+}^{j}-\mu_{-}^{j}}{\mu_{+}-\mu_{-}}\right)^{2}
=1(μ+μ)2j=0n1(μ+2j+μ2j2(μ+μ)j)\displaystyle=\frac{1}{(\mu_{+}-\mu_{-})^{2}}\sum_{j=0}^{n-1}(\mu_{+}^{2j}+\mu_{-}^{2j}-2(\mu_{+}\mu_{-})^{j})
=1(μ+μ)2(j=0n1μ+2j+j=0n1μ2j2j=0n1(μ+μ)j)\displaystyle=\frac{1}{(\mu_{+}-\mu_{-})^{2}}\left(\sum_{j=0}^{n-1}\mu_{+}^{2j}+\sum_{j=0}^{n-1}\mu_{-}^{2j}-2\sum_{j=0}^{n-1}(\mu_{+}\mu_{-})^{j}\right)
=1(μ+μ)2(1μ+2n1μ+2+1μ2n1μ221(μ+μ)n1μ+μ)\displaystyle=\frac{1}{(\mu_{+}-\mu_{-})^{2}}\left(\frac{1-\mu_{+}^{2n}}{1-\mu_{+}^{2}}+\frac{1-\mu_{-}^{2n}}{1-\mu_{-}^{2}}-2\frac{1-(\mu_{+}\mu_{-})^{n}}{1-\mu_{+}\mu_{-}}\right)

The sum of fractions in the bracket can be expressed as a single fraction with denominator (1μ+2)(1μ2)(1μ+μ)(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-}). The numerator turns out to be:

num=(μ+μ)2+μ+μ(μ+μ)2+μ+μ(μ+nμn)2+(μ+μ)2(μ+n1μn1)2\displaystyle num=(\mu_{+}-\mu_{-})^{2}+\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}+\mu_{+}\mu_{-}(\mu_{+}^{n}-\mu_{-}^{n})^{2}+(\mu_{+}\mu_{-})^{2}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}-
(μ+μ)3(μ+n1μn1)22(μ+μ)n(μ+μ)2(μ+nμn)2.\displaystyle(\mu_{+}\mu_{-})^{3}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}-2(\mu_{+}\mu_{-})^{n}(\mu_{+}-\mu_{-})^{2}-(\mu_{+}^{n}-\mu_{-}^{n})^{2}.

First consider the case η[0,(1αλ)2))\eta\in[0,(1-\sqrt{\alpha\lambda})^{2})). The eigen values μ+\mu_{+} and μ\mu_{-} in this case are real. Therefore all the terms in the numerator and the denominator are real. Next consider the case η((1αλ)2,1]\eta\in((1-\sqrt{\alpha\lambda})^{2},1]. In this case the eigen values μ+\mu_{+} and μ\mu_{-} are complex. Let μ+=ρ(P)(cosω+isinω)\mu_{+}=\rho(P)(\cos\omega+i\sin\omega) and μ=ρ(P)(cosωisinω)\mu_{-}=\rho(P)(\cos\omega-i\sin\omega). Observe that

(μ+μ)2\displaystyle(\mu_{+}-\mu_{-})^{2} =(ρ(P)(cosω+isinω)ρ(P)(cosωisinω))2=4ρ(P)2sin2ω,\displaystyle=(\rho(P)(\cos\omega+i\sin\omega)-\rho(P)(\cos\omega-i\sin\omega))^{2}=-4\rho(P)^{2}\sin^{2}\omega,
(μ+mμm)2\displaystyle(\mu_{+}^{m}-\mu_{-}^{m})^{2} =4ρ(P)2msin2mω,\displaystyle=-4\rho(P)^{2m}\sin^{2}m\omega,
μ+μ\displaystyle\mu_{+}\mu_{-} =ρ(P)2(sin2ω+cos2ω)=ρ(P)2=η.\displaystyle=\rho(P)^{2}(\sin^{2}\omega+\cos^{2}\omega)=\rho(P)^{2}=\eta.

Therefore, all the terms in the numerator and the denominator are again real. Now observe that,

μ+μ(μ+μ)20,\displaystyle\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}\geq 0,
(μ+μ)2(μ+n1μn1)2(μ+μ)3(μ+n1μn1)20.\displaystyle(\mu_{+}\mu_{-})^{2}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}-(\mu_{+}\mu_{-})^{3}(\mu_{+}^{n-1}-\mu_{-}^{n-1})^{2}\geq 0.

The second inequality follows because μ+μ=(1αλ+η)2((1αλ+η)24η)4=η\mu_{+}\mu_{-}=\frac{(1-\alpha\lambda+\eta)^{2}-((1-\alpha\lambda+\eta)^{2}-4\eta)}{4}=\eta and therefore (μ+μ)2(μ+μ)3(\mu_{+}\mu_{-})^{2}\geq(\mu_{+}\mu_{-})^{3}. Next,

μ+μ(μ+nμn)2\displaystyle\mu_{+}\mu_{-}(\mu_{+}^{n}-\mu_{-}^{n})^{2} =μ+μ(μ+μ)2(i=0n1μ+iμn1i)2\displaystyle=\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}\Big{(}\sum_{i=0}^{n-1}\mu_{+}^{i}\mu_{-}^{n-1-i}\Big{)}^{2}
μ+μ(μ+μ)2(μ+n1+μn1)2\displaystyle\geq\mu_{+}\mu_{-}(\mu_{+}-\mu_{-})^{2}(\mu_{+}^{n-1}+\mu_{-}^{n-1})^{2}
4(μ+μ)n(μ+μ)2\displaystyle\geq 4(\mu_{+}\mu_{-})^{n}(\mu_{+}-\mu_{-})^{2}

The last inequality follows since μ+n1+μn12(μ+μ)n12\mu_{+}^{n-1}+\mu_{-}^{n-1}\geq 2(\mu_{+}\mu_{-})^{\frac{n-1}{2}}. It follows that μ+μ(μ+nμn)22(μ+μ)n(μ+μ)20\mu_{+}\mu_{-}(\mu_{+}^{n}-\mu_{-}^{n})^{2}-2(\mu_{+}\mu_{-})^{n}(\mu_{+}-\mu_{-})^{2}\geq 0. Also, note that (μ+nμnμ+μ)2Pne12<ϵ\left(\frac{\mu_{+}^{n}-\mu_{-}^{n}}{\mu_{+}-\mu_{-}}\right)^{2}\leq\|P^{n}e_{1}\|^{2}<\epsilon and therefore (μ+nμn)2ϵ(μ+μ)2(\mu_{+}^{n}-\mu_{-}^{n})^{2}\leq\epsilon(\mu_{+}-\mu_{-})^{2} . It follows that num(1ϵ)(μ+μ)2num\geq(1-\epsilon)(\mu_{+}-\mu_{-})^{2}. Using these bounds, taking ϵ<=1/2\epsilon<=1/2 and noting that h(η,αλ):=(1μ+2)(1μ2)(1μ+μ)(αλ)2\displaystyle h(\eta,\alpha\lambda):=\frac{(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})}{(\alpha\lambda)^{2}}, we get

α2Kj=0n1Pje12\displaystyle\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2} α2K(μ+μ)2(1μ+2j1μ+2+1μ2j1μ221(μ+μ)j1μ+μ)\displaystyle\geq\frac{\alpha^{2}K}{(\mu_{+}-\mu_{-})^{2}}\left(\frac{1-\mu_{+}^{2j}}{1-\mu_{+}^{2}}+\frac{1-\mu_{-}^{2j}}{1-\mu_{-}^{2}}-2\frac{1-(\mu_{+}\mu_{-})^{j}}{1-\mu_{+}\mu_{-}}\right)
α2K2(1μ+2)(1μ2)(1μ+μ).\displaystyle\geq\frac{\alpha^{2}K}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})}. (7)

Finally, consider the case when η=(1αλ)2\eta=(1-\sqrt{\alpha\lambda})^{2}. In this case μ+=μ=μ\mu_{+}=\mu_{-}=\mu.

Let P=S1JSP=S^{-1}JS, where J=[μ10μ].J=\begin{bmatrix}\mu&1\\ 0&\mu\end{bmatrix}. Let S=[xywz]S=\begin{bmatrix}x&y\\ w&z\end{bmatrix}. Since η=(1αλ)2\eta=(1-\alpha\lambda)^{2}, we get μ=η=(1αλ)\mu=\sqrt{\eta}=(1-\alpha\lambda) and 1αλ+η=2η1-\alpha\lambda+\eta=2\sqrt{\eta}. We now solve for SS.

[xywz][1αλ+ηη10]=[μ10μ][xywz].\begin{bmatrix}x&y\\ w&z\end{bmatrix}\begin{bmatrix}1-\alpha\lambda+\eta&-\eta\\ 1&0\end{bmatrix}=\begin{bmatrix}\mu&1\\ 0&\mu\end{bmatrix}\begin{bmatrix}x&y\\ w&z\end{bmatrix}.

It follows that the following equations hold

x(1αλ+η)+y\displaystyle x(1-\alpha\lambda+\eta)+y =μx+w\displaystyle=\mu x+w
w(1αλ+η)+z\displaystyle w(1-\alpha\lambda+\eta)+z =μw\displaystyle=\mu w
ηx\displaystyle\eta x =μy+z\displaystyle=\mu y+z
ηw\displaystyle\eta w =μz.\displaystyle=\mu z.

Solving the above equations with w=1w=1, we get,

S=[011η]S=\begin{bmatrix}0&1\\ 1&-\sqrt{\eta}\end{bmatrix}
Pj\displaystyle P^{j} =S1JjS\displaystyle=S^{-1}J^{j}S
=[η110][μjnμj10μj][011η]\displaystyle=\begin{bmatrix}\sqrt{\eta}&1\\ 1&0\end{bmatrix}\begin{bmatrix}\mu^{j}&n\mu^{j-1}\\ 0&\mu^{j}\end{bmatrix}\begin{bmatrix}0&1\\ 1&-\sqrt{\eta}\end{bmatrix}
=[μj+1(j+1)μjμnjμj1][011μ]\displaystyle=\begin{bmatrix}\mu^{j+1}&(j+1)\mu^{j}\\ \mu^{n}&j\mu^{j-1}\end{bmatrix}\begin{bmatrix}0&1\\ 1&-\mu\end{bmatrix}
=[(j+1)μjjμj+1jμj1(j1)μj]\displaystyle=\begin{bmatrix}(j+1)\mu^{j}&-j\mu^{j+1}\\ j\mu^{j-1}&-(j-1)\mu^{j}\end{bmatrix}

Observe that,

j=0n1Pje12\displaystyle\sum_{j=0}^{n-1}\|P^{j}e_{1}\|^{2} =j=0n1(j+1)2μ2j+j2μ2(j1)j=0n1(jμj1)2\displaystyle=\sum_{j=0}^{n-1}(j+1)^{2}\mu^{2}j+j^{2}\mu^{2(j-1)}\geq\sum_{j=0}^{n-1}(j\mu^{j-1})^{2}
=limμ+μj=0n1(μ+jμjμ+μ)2\displaystyle=\lim_{\mu_{+}\rightarrow\mu_{-}}\sum_{j=0}^{n-1}\left(\frac{\mu_{+}^{j}-\mu_{-}^{j}}{\mu_{+}-\mu_{-}}\right)^{2}
limμ+μ12(1μ+2)(1μ2)(1μ+μ),\displaystyle\geq\lim_{\mu_{+}\rightarrow\mu_{-}}\frac{1}{2(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\mu_{+}\mu_{-})},

where the last inequality follows from (7). Therefore,

α2Kj=0n1Pje12α2K2(1μ2)(1μ2)(1η).\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{\alpha^{2}K}{2(1-\mu^{2})(1-\mu^{2})(1-\eta)}.

Lemma 2.

For all α>0\alpha>0, η[0,(1αλ)2)\eta\in[0,(1-\sqrt{\alpha\lambda})^{2}), if PnX0~2<ϵ\|P^{n}\tilde{X_{0}}\|^{2}<\epsilon for small enough ϵ>0\epsilon>0, then α2Kj=0n1Pje12K16λ2(1ρ(P)).\alpha^{2}K\sum_{j=0}^{n-1}\Big{\|}P^{j}e_{1}\Big{\|}^{2}\geq\frac{K}{16\lambda^{2}}(1-\rho(P)).

Proof.

Before proceeding to the proof, we introduce a new function

h(η,αλ)(1μ+2)(1μ2)(1η)(αλ)2\displaystyle h(\eta,\alpha\lambda)\triangleq\frac{(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}{(\alpha\lambda)^{2}}

to simplify calculations. First we consider the case η[0,(1αλ)2]\eta\in[0,(1-\sqrt{\alpha\lambda})^{2}] when μ+\mu_{+} and μ\mu_{-} are real. Since μ+\mu_{+} and μ\mu_{-} are only functions of αλ\alpha\lambda and η\eta, hh is well-defined.

The proof of the Lemma follows by showing h(η,αλ)(1ρ(P))8h(\eta,\alpha\lambda)(1-\rho(P))\leq 8. Note that 1μ+2=(1μ+)(1+μ+)2(1μ+)1-\mu_{+}^{2}=(1-\mu_{+})(1+\mu_{+})\leq 2(1-\mu_{+}) since μ+1\mu_{+}\leq 1. Similarly, (1μ2)2(1μ)(1-\mu_{-}^{2})\leq 2(1-\mu_{-}). Thus, h(η,αλ)(1μ+)4(1μ+)2(1μ)(1μ+μ)(αλ)2h(\eta,\alpha\lambda)(1-\mu_{+})\leq\frac{4(1-\mu_{+})^{2}(1-\mu_{-})(1-\mu_{+}\mu_{-})}{(\alpha\lambda)^{2}}. Since μ++μ=1αλ+η\mu_{+}+\mu_{-}=1-\alpha\lambda+\eta, it follows that (1μ+)(1μ)=1+μ+μμ+μ=1(1αλ+η)+η=αλ(1-\mu_{+})(1-\mu_{-})=1+\mu_{+}\mu_{-}-\mu_{+}-\mu_{-}=1-(1-\alpha\lambda+\eta)+\eta=\alpha\lambda. Thus,

h(η,αλ)(1μ+)\displaystyle h(\eta,\alpha\lambda)(1-\mu_{+}) 4(1μ+)(1η)αλ\displaystyle\leq\frac{4(1-\mu_{+})(1-\eta)}{\alpha\lambda}
=4(1η)αλ(1(1+ηλα)+(λα1η)24η2)g(η,αλ).\displaystyle=\frac{4(1-\eta)}{\alpha\lambda}\Bigg{(}1-\frac{(1+\eta-\lambda\alpha)+\sqrt{(\lambda\alpha-1-\eta)^{2}-4\eta}}{2}\Bigg{)}\triangleq g(\eta,\alpha\lambda).

We see that

g(η,αλ)η=2((1αλ+η)24η+η1)((1αλ+η)24η+ηαλ1)αλ(1αλ+η)24η\displaystyle\frac{\partial g(\eta,\alpha\lambda)}{\partial\eta}=\frac{2\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-1\Big{)}\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-\alpha\lambda-1\Big{)}}{\alpha\lambda\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}}

Observe that the denominator in the above expression is positive. We consider the following two cases: (i) αλ1\alpha\lambda\geq 1 and (ii) αλ<1\alpha\lambda<1. When αλ1\alpha\lambda\geq 1, we can directly bound g(η,αλ)=4(1μ+)(1η)αλg(\eta,\alpha\lambda)=\frac{4(1-\mu_{+})(1-\eta)}{\alpha\lambda}. Since μ+1\mu_{+}\geq-1 and η0\eta\geq 0, we get g(η,αλ)8g(\eta,\alpha\lambda)\leq 8.

Now consider the case αλ<1\alpha\lambda<1. We have that,

(1αλ+η)24η+η1\displaystyle\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-1 =(1η)2+(αλ)22αλ(1+η)(1η)\displaystyle=\sqrt{(1-\eta)^{2}+(\alpha\lambda)^{2}-2\alpha\lambda(1+\eta)}-(1-\eta)

When αλ<1\alpha\lambda<1, (αλ)22αλ(1+η)<0(\alpha\lambda)^{2}-2\alpha\lambda(1+\eta)<0 for all η0\eta\geq 0, thus ((1αλ+η)24η+ηαλ1)((1αλ+η)24η+η1)<0\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-\alpha\lambda-1\Big{)}\leq\Big{(}\sqrt{(1-\alpha\lambda+\eta)^{2}-4\eta}+\eta-1\Big{)}<0. This implies that the numerator of the partial derivative is also positive and we have that g(η,αλ)η>0\displaystyle\frac{\partial g(\eta,\alpha\lambda)}{\partial\eta}>0. Since the partial derivative is positive g(η,αλ)g(\eta,\alpha\lambda) is an increasing function of η\eta and thus the maximum is achieved at η=(1αλ)2\eta=(1-\sqrt{\alpha\lambda})^{2} and is given by:

g(η,αλ)\displaystyle g(\eta,\alpha\lambda) 4(1μ+)(1η)αλ\displaystyle\leq\frac{4(1-\mu_{+})(1-\eta)}{\alpha\lambda}
=4(1ρ(P))(1η)αλ\displaystyle=\frac{4(1-\rho(P))(1-\eta)}{\alpha\lambda}
=4(1(1αλ))(1η)αλ\displaystyle=\frac{4(1-(1-\sqrt{\alpha\lambda}))(1-\eta)}{\alpha\lambda}
=4αλ(2αλαλ)αλ\displaystyle=\frac{4\sqrt{\alpha\lambda}(2\sqrt{\alpha\lambda}-\alpha\lambda)}{\alpha\lambda}
8\displaystyle\leq 8

Next, for η=((1αλ)2,1]\eta=((1-\sqrt{\alpha\lambda})^{2},1], the eigen values μ+\mu_{+} and μ\mu_{-} are complex. We have,

h(η,αλ)=(1μ+2)(1μ2)(1η)(αλ)2.\displaystyle h(\eta,\alpha\lambda)=\frac{(1-\mu_{+}^{2})(1-\mu_{-}^{2})(1-\eta)}{(\alpha\lambda)^{2}}.

First we show that (1μ+2)(1μ2)4(1μ+)(1μ)(1-\mu_{+}^{2})(1-\mu_{-}^{2})\leq 4(1-\mu_{+})(1-\mu_{-}). Notice,

(1μ+2)(1μ2)\displaystyle(1-\mu_{+}^{2})(1-\mu_{-}^{2}) =(1μ+)(1+μ+)(1μ)(1+μ)\displaystyle=(1-\mu_{+})(1+\mu_{+})(1-\mu_{-})(1+\mu_{-})
=(1μ+)(1μ)(1+μ++μ+μ+μ)\displaystyle=(1-\mu_{+})(1-\mu_{)}(1+\mu_{+}+\mu_{-}+\mu_{+}\mu_{-})
=(1μ+)(1μ)(1+(1αλ+η)+η)\displaystyle=(1-\mu_{+})(1-\mu_{-})(1+(1-\alpha\lambda+\eta)+\eta)
4(1μ+)(1μ).\displaystyle\leq 4(1-\mu_{+})(1-\mu_{-}).

It follows that,

h(η,αλ)(1ρ(P))\displaystyle h(\eta,\alpha\lambda)(1-\rho(P)) 4(1ρ(P))(1η)αλ\displaystyle\leq\frac{4(1-\rho(P))(1-\eta)}{\alpha\lambda}
=4(1η)(1η)αλl(η,αλ)\displaystyle=\frac{4(1-\sqrt{\eta})(1-\eta)}{\alpha\lambda}\triangleq l(\eta,\alpha\lambda)

Observe that l(η,αλ)l(\eta,\alpha\lambda) is a decreasing function of η\eta and therefore, the infimum is attained for η=(1αλ)2\eta=(1-\sqrt{\alpha\lambda})^{2}. For this choice of η\eta,

l(η,αλ)\displaystyle l(\eta,\alpha\lambda) =4(1(1αλ))(1(1αλ)2)αλ8\displaystyle=\frac{4(1-(1-\sqrt{\alpha\lambda}))(1-(1-\sqrt{\alpha\lambda})^{2})}{\alpha\lambda}\leq 8

Lemma 3.

If n=K64ϵλ2log(X0~2ϵ)n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{\|\tilde{X_{0}}\|^{2}}{\epsilon}) and PnX~02<ϵ\|P^{n}\tilde{X}_{0}\|^{2}<\epsilon, then 1μ+16ϵλ2K1-\mu_{+}\geq\frac{16\epsilon\lambda^{2}}{K}.

Proof.

From PnX~0ϵ\|P^{n}\tilde{X}_{0}\|\leq\epsilon we have ρ(P)2nϵ\rho(P)^{2n}\leq\epsilon. Then,

ϵρ(P)2ne2n1ρ(P)ρ(P)\displaystyle\epsilon\geq\rho(P)^{2n}\geq e^{-2n\frac{1-\rho(P)}{\rho(P)}}

Thus, we have e2n1ρ(P)ρ(P)<ϵe^{-2n\frac{1-\rho(P)}{\rho(P)}}<\epsilon and n=K64ϵλ2log(1ϵ)ρ(P)2(1ρ(P))log(1ϵ)n=\frac{K}{64\epsilon\lambda^{2}}\log(\frac{1}{\epsilon})\geq\frac{\rho(P)}{2(1-\rho(P))}\log(\frac{1}{\epsilon}). Thus, by re-arranging, and choosing ϵ\epsilon such that 1K32ϵλ21\leq\frac{K}{32\epsilon\lambda^{2}}, we have

11ρ(P)K32ϵλ2+1K16ϵλ2.\displaystyle\frac{1}{1-\rho(P)}\leq\frac{K}{32\epsilon\lambda^{2}}+1\leq\frac{K}{16\epsilon\lambda^{2}}. (8)

Appendix C Proof of Theorem 2.2

We prove the theorem separately for η=0\eta=0 (corresponding to SGD), β=0\beta=0 (corresponding to SHB) and β=1\beta=1 (corresponding to ASG).

Case-1: η=0\eta=0 (SGD)

The LSA-M iterate in (2) with η=0\eta=0 corresponds to:

xn+1x=xnx+α(AxAxn+Mn+1)\displaystyle x_{n+1}-x^{*}=x_{n}-x^{*}+\alpha(Ax^{*}-Ax_{n}+M_{n+1}) (9)
xn+1x=xnx+α(AxAxn+Mn+1)+η((xnx)(xn1x))\displaystyle x_{n+1}-x^{*}=x_{n}-x^{*}+\alpha(Ax^{*}-Ax_{n}+M_{n+1})+\eta((x_{n}-x^{*})-(x_{n-1}-x^{*})) (10)

Let x~n=xnx\tilde{x}_{n}=x_{n}-x^{*}. Then, equation (9) can be rewritten as:

x~n=x~n1+α(Ax~n1+Mn)=(IαA)x~n1+αMn=(IαA)nx~0+αi=0n1[(IαA)n1iMi+1]\begin{split}\tilde{x}_{n}&=\tilde{x}_{n-1}+\alpha(-A\tilde{x}_{n-1}+M_{n})=(I-\alpha A)\tilde{x}_{n-1}+\alpha M_{n}\\ &=(I-\alpha A)^{n}\tilde{x}_{0}+\alpha\sum_{i=0}^{n-1}[(I-\alpha A)^{n-1-i}M_{i+1}]\end{split}

Taking the square of the norm on both sides of the above equation, we obtain

x~n2=(IαA)nx~02+2α((IαA)nx~0)T(i=0n1(IαA)n1iMi+1)+α2i=0n1j=0n1((IαA)n1iMi+1)T((IαA)(n1j)Mj+1))\begin{split}\|\tilde{x}_{n}\|^{2}&=\|(I-\alpha A)^{n}\tilde{x}_{0}\|^{2}+2\alpha\left((I-\alpha A)^{n}\tilde{x}_{0}\right)^{T}\left(\sum_{i=0}^{n-1}(I-\alpha A)^{n-1-i}M_{i+1}\right)\\ &+\alpha^{2}\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}((I-\alpha A)^{n-1-i}M_{i+1})^{T}((I-\alpha A)^{(n-1-j)}M_{j+1}))\end{split}

Now we take expectation on both sides to obtain

𝔼[x~n2](IαA)n2x0~2+2α((IαA)nx~0)T(i=0n(IαA)(n1i)𝔼[Mi+1])+α2i=0n1j=0n1𝔼((IαA)(n1i)Mi+1)T((IαA)(n1j)Mj+1))\begin{split}\mathbb{E}[\|\tilde{x}_{n}\|^{2}]&\leq\|(I-\alpha A)^{n}\|^{2}\|\tilde{x_{0}}\|^{2}+2\alpha\left((I-\alpha A)^{n}\tilde{x}_{0}\right)^{T}\left(\sum_{i=0}^{n}(I-\alpha A)^{(n-1-i)}\mathbb{E}[M_{i+1}]\right)\\ &+\alpha^{2}\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\mathbb{E}((I-\alpha A)^{(n-1-i)}M_{i+1})^{T}((I-\alpha A)^{(n-1-j)}M_{j+1}))\end{split}

Now, from Assumption 2.a, 𝔼[Mi+1]=𝔼[𝔼[Mi+1|i]]=0.\mathbb{E}[M_{i+1}]=\mathbb{E}[\mathbb{E}[M_{i+1}|\mathcal{F}_{i}]]=0. Therefore the second term becomes 0. Next consider the term inside the double summation. First consider the case iji\neq j. Without loss of generality, suppose i<ji<j.

𝔼[Mi+1T((IαA)(n1i))T(IαA)(n1j)Mj+1]=𝔼[𝔼[Mi+1T((IαA)(n1i))T(IαA)(n1j)Mj+1|j]]=𝔼[Mi+1T((IαA)(n1i))T(IαA)(n1j)𝔼[Mj+1|j]]=0\begin{split}&\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-j)}M_{j+1}\right]\\ &=\mathbb{E}\left[\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-j)}M_{j+1}|\mathcal{F}_{j}\right]\right]\\ &=\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-j)}\mathbb{E}[M_{j+1}|\mathcal{F}_{j}]\right]=0\end{split}

The last equality follows from Assumption 2.a. When i=ji=j,

𝔼[Mi+1T((IαA)(n1i))T(IαA)(n1i)Mi+1]𝔼[(IαA)(n1i)2𝔼[Mi+12|i]](IαA)(n1i)2K(1+𝔼[x~i2])\begin{split}&\mathbb{E}\left[M_{i+1}^{T}\left((I-\alpha A)^{(n-1-i)}\right)^{T}(I-\alpha A)^{(n-1-i)}M_{i+1}\right]\\ &\leq\mathbb{E}\left[\|(I-\alpha A)^{(n-1-i)}\|^{2}\mathbb{E}\left[\|M_{i+1}\|^{2}|\mathcal{F}_{i}\right]\right]\leq\|(I-\alpha A)^{(n-1-i)}\|^{2}K\left(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}]\right)\end{split}

Substituting the above values and using Λ=x0x2\Lambda=||x_{0}-x^{*}||^{2} we get

𝔼[xn~2](IαA)n2Λ+α2Ki=0n1(IαA)(n1i)2(1+𝔼[x~i2])\begin{split}\mathbb{E}[\|\tilde{x_{n}}\|^{2}]&\leq\|(I-\alpha A)^{n}\|^{2}\Lambda+\alpha^{2}K\sum_{i=0}^{n-1}\|(I-\alpha A)^{(n-1-i)}\|^{2}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}

We next use the following lemma to bound (IαA)i\|(I-\alpha A)^{i}\|.

Lemma 4.

Let, Md×dM\in\mathbb{R}^{d\times d} be a matrix and λi(M)\lambda_{i}(M) denote the ithi^{th} eigen-value of MM. Then, δ>0\forall\delta>0

MnCδ(ρ(M)+δ)n\|M^{n}\|\leq C_{\delta}(\rho(M)+\delta)^{n}

where ρ(M)=maxi|λi(M)|\rho(M)=\max_{i}|\lambda_{i}(M)| is the spectral radius of MM and CδC_{\delta} is a constant that depends on δ\delta. Furthermore, if the eigen-values of MM are distinct, then

MnC(ρ(M))n\|M^{n}\|\leq C(\rho(M))^{n}
Proof.

See Appendix D.1. ∎

We let λmin(A)=miniλi(A)\lambda_{\min}(A)=\min_{i}\lambda_{i}(A). Using Assumption 1 and Lemma 4, we have

𝔼[x~n2]C2(1αλmin(A))2nΛ+C2α2Ki=0n1(1αλmin(A))2(n1i)(1+𝔼[x~i2])\begin{split}\mathbb{E}[\|\tilde{x}_{n}\|^{2}]&\leq C^{2}(1-\alpha\lambda_{\min}(A))^{2n}\Lambda+C^{2}\alpha^{2}K\sum_{i=0}^{n-1}(1-\alpha\lambda_{\min}(A))^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{x}_{i}\|^{2}])\end{split}
where, C=dσmin(S)σmin(S1),\mbox{where, \quad}C=\frac{\sqrt{d}}{\sigma_{\min}(S)\sigma_{\min}(S^{-1})}, (11)

SS is the matrix in Jordan decomposition of AA and σmin(S)\sigma_{\min}(S) is the smallest singular value of SS. We define the sequence {Uk}\{U_{k}\} as below:

Uk=C2(1αλmin(A))2kΛ+C2α2Ki=0k1(1αλmin(A))2(k1i)(1+Ui)U_{k}=C^{2}(1-\alpha\lambda_{\min}(A))^{2k}\Lambda+C^{2}\alpha^{2}K\sum_{i=0}^{k-1}(1-\alpha\lambda_{\min}(A))^{2(k-1-i)}(1+U_{i})

Observe that 𝔼[x~n2]Un\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n} and that the sequence {Uk}\{U_{k}\} satisfies

Uk+1=(1αλmin(A))2Uk+C2Kα2(1+Uk);U0=C2Λ\displaystyle U_{k+1}=(1-\alpha\lambda_{\min}(A))^{2}U_{k}+C^{2}K\alpha^{2}(1+U_{k});\quad U_{0}=C^{2}\Lambda

Therefore, we have

Uk+1=((1αλmin(A))2+C2Kα2)Uk+α2C2KU_{k+1}=\left((1-\alpha\lambda_{\min}(A))^{2}+C^{2}K\alpha^{2}\right)U_{k}+\alpha^{2}C^{2}K

To ensure that (1αλmin(A))2+C2Kα2(1αλmin(A)/2)2(1-\alpha\lambda_{\min}(A))^{2}+C^{2}K\alpha^{2}\leq(1-\alpha\lambda_{\min}(A)/2)^{2}, choose α\alpha as follows:

α2λmin(A)22αλmin(A)+C2Kα2α2λmin(A)24αλmin(A)\displaystyle\alpha^{2}\lambda_{\min}(A)^{2}-2\alpha\lambda_{\min}(A)+C^{2}K\alpha^{2}\leq\frac{\alpha^{2}\lambda_{\min}(A)^{2}}{4}-\alpha\lambda_{\min}(A)
 or αλmin(A)34λmin(A)2+C2K\displaystyle\mbox{ or }\alpha\leq\frac{\lambda_{\min}(A)}{\frac{3}{4}\lambda_{\min}(A)^{2}+C^{2}K}
Un(1αλmin(A)2)2Un1+α2C2K(1αλmin(A)2)2nU0+α2C2Ki=0n1(1αλmin(A)2)2i(1αλmin(A)2)2nU0+α2C2K11(1αλmin(A)2)2(1αλmin(A)2)2nU0+α2C2K2αλmin(A)\begin{split}U_{n}&\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2}U_{n-1}+\alpha^{2}C^{2}K\\ &\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2i}\\ &\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\frac{1}{1-\left(1-\frac{\alpha\lambda_{min}(A)}{2}\right)^{2}}\\ &\leq\left(1-\frac{\alpha\lambda_{\min}(A)}{2}\right)^{2n}U_{0}+\alpha^{2}C^{2}K\frac{2}{\alpha\lambda_{min}(A)}\\ \end{split}

We assume that α1λmin(A)\alpha\leq\frac{1}{\lambda_{\min}(A)} and therefore (1αλmin(A)2)2eαλmin(A)(1-\frac{\alpha\lambda_{\min}(A)}{2})^{2}\leq e^{-\alpha\lambda_{\min}(A)}.

Unenαλmin(A)C2Λ+2αC2Kλmin(A)\begin{split}U_{n}\leq e^{-n\alpha\lambda_{\min}(A)}C^{2}\Lambda+\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\end{split}

Choose α\alpha as below:

αϵλmin(A)4C2K\alpha\leq\frac{\epsilon\lambda_{min}(A)}{4C^{2}K}

Then,

2αC2Kλmin(A)ϵ2𝔼[x~n2]Unϵ2+ϵ2=ϵ,\frac{2\alpha C^{2}K}{\lambda_{min}(A)}\leq\frac{\epsilon}{2}\Rightarrow\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq U_{n}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon,

when the sample complexity is:

n=1αλmin(A)log(2C2Λϵ)n=\frac{1}{\alpha\lambda_{\min}(A)}\log\left(\frac{2C^{2}\Lambda}{\epsilon}\right)

Case-2: β=0\beta=0 (SHB)

LSA-M iterate with β=0\beta=0 can be re-written as:

x~n+1=(IαA)x~n+α(Mn+1)+η(x~nx~n1)\tilde{x}_{n+1}=(I-\alpha A)\tilde{x}_{n}+\alpha(M_{n+1})+\eta(\tilde{x}_{n}-\tilde{x}_{n-1})

This can be re-written as:

[x~n+1x~n]=[IαA+ηIηII0][x~nx~n1]+α[Mn+10]\begin{bmatrix}\tilde{x}_{n+1}\\ \tilde{x}_{n}\end{bmatrix}=\begin{bmatrix}I-\alpha A+\eta I&-\eta I\\ I&0\end{bmatrix}\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix}+\alpha\begin{bmatrix}M_{n+1}\\ 0\end{bmatrix}

Let us define

X~n[x~nx~n1],P[IαA+ηIηII0] and Wn[Mn0].\tilde{X}_{n}\triangleq\begin{bmatrix}\tilde{x}_{n}\\ \tilde{x}_{n-1}\end{bmatrix},P\triangleq\begin{bmatrix}I-\alpha A+\eta I&-\eta I\\ I&0\end{bmatrix}\mbox{ and }W_{n}\triangleq\begin{bmatrix}M_{n}\\ 0\end{bmatrix}.

Note that 𝔼[Wn+1|n]=0\mathbb{E}[W_{n+1}|\mathcal{F}_{n}]=0 and 𝔼[Wn+12|n]=𝔼[Mn+12|n]K(1+𝔼[x~n2])K(1+𝔼[X~n2])\mathbb{E}[\|W_{n+1}\|^{2}|\mathcal{F}_{n}]=\mathbb{E}[\|M_{n+1}\|^{2}|\mathcal{F}_{n}]\leq K(1+\mathbb{E}[\|\tilde{x}_{n}\|^{2}])\leq K(1+\mathbb{E}[\|\tilde{X}_{n}\|^{2}]). It follows that,

X~n=PX~n1+αWn=PnX~0+αi=0n1Pn1iWi+1\displaystyle\tilde{X}_{n}=P\tilde{X}_{n-1}+\alpha W_{n}=P^{n}\tilde{X}_{0}+\alpha\sum_{i=0}^{n-1}P^{n-1-i}W_{i+1}

The norm square of the above equation gives:

X~n2=PnX~02+α(PnX~0)T(i=0n1P(n1i)Wi+1)+α(i=0n1P(n1i)Wi+1)T(PnX~0)+α2(i=0n1P(n1i)Wi+1)T(i=0n1P(n1i)Wi+1)\begin{split}\|\tilde{X}_{n}\|^{2}&=\|P^{n}\tilde{X}_{0}\|^{2}+\alpha\left(P^{n}\tilde{X}_{0}\right)^{T}\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)+\alpha\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)^{T}\left(P^{n}\tilde{X}_{0}\right)\\ &+\alpha^{2}\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)^{T}\left(\sum_{i=0}^{n-1}P^{(n-1-i)}W_{i+1}\right)\end{split}

Taking expectation on both sides as well as using the fact that 𝔼[Wk+1|k]=0\mathbb{E}[W_{k+1}|\mathcal{F}_{k}]=0 and 𝔼[Wk+12|k]K(1+𝔼[X~k2])\mathbb{E}[\|W_{k+1}\|^{2}|\mathcal{F}_{k}]\leq K(1+\mathbb{E}[\|\tilde{X}_{k}\|^{2}]), we have

𝔼[X~n2]Pn2X~02+α2Ki=0n1Pn1i2(1+𝔼[X~i2])\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{i=0}^{n-1}\|P^{n-1-i}\|^{2}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}])

Without loss of generality assume x~1=𝟎\tilde{x}_{-1}=\mathbf{0}. Therefore, X~02=x~02=Λ||\tilde{X}_{0}||^{2}=||\tilde{x}_{0}||^{2}=\Lambda. As before, for a matrix MM, let ρ(M)=maxi|λi(M)|\rho(M)=\max_{i}|\lambda_{i}(M)| denote the spectral radius of MM. Next, we compute ρ(P)\rho(P). Consider the characteristic equation of P:

det([IαA+ηIμIηIIμI])=0det\left(\begin{bmatrix}I-\alpha A+\eta I-\mu I&-\eta I\\ I&-\mu I\end{bmatrix}\right)=0

When A21A_{21} and A22A_{22} commute, we have the following formula for determinant of a block matrix (Horn and Johnson, (1990)):

det([A11A12A21A22])=det(A11A22A12A21)det\left(\begin{bmatrix}A_{11}&A_{12}\\ A_{21}&A_{22}\end{bmatrix}\right)=det\left(A_{11}A_{22}-A_{12}A_{21}\right)

Using this, the characteristic equation of PP simplifies to:

det(μI+αμAημI+μ2I+ηI)=0\displaystyle det(-\mu I+\alpha\mu A-\eta\mu I+\mu^{2}I+\eta I)=0

We note that when μ=0\mu=0, the LHS of the above equation becomes det(ηI)\det(\eta I). Thus, μ=0\mu=0 can never be a solution of the characteristic equation of PP whenever η0\eta\neq 0. We now further simplify the characteristic equation of PP to a more convienient form:

det(AI(μ+ημμ2ηαμ))=0det\left(A-I\left(\frac{\mu+\eta\mu-\mu^{2}-\eta}{\alpha\mu}\right)\right)=0

The only zeros of the characteristic equation of a matrix are its eigenvalues. Let λi(A)\lambda_{i}(A) be the eigen-value of AA with λi(A)=μ+ημμ2ηαμ\lambda_{i}(A)=\frac{\mu+\eta\mu-\mu^{2}-\eta}{\alpha\mu} so that

μ2+μ(αλi(A)1η)+η=0\displaystyle\mu^{2}+\mu(\alpha\lambda_{i}(A)-1-\eta)+\eta=0

The above is a quadratic equation in μ\mu and the solution is given by

μ=(λi(A)α1η)±(λi(A)α1η)24η2\displaystyle\mu=\frac{-(\lambda_{i}(A)\alpha-1-\eta)\pm\sqrt{(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta}}{2}

When (λi(A)α1η)24η0(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta\leq 0, the absolute value of eigenvalues of P are independent of α\alpha and

|μ|=12((λi(A)α1η)2+|(λi(A)α1η)24η|)=η|\mu|=\frac{1}{2}\left(\sqrt{(\lambda_{i}(A)\alpha-1-\eta)^{2}+|(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta|}\right)=\sqrt{\eta}

To ensure that (λi(A)α1η)24η0(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta\leq 0, we must have

(αλi(A)+1)2λi(A)αη(αλi(A)+1)+2λi(A)α(\alpha\lambda_{i}(A)+1)-2\sqrt{\lambda_{i}(A)\alpha}\leq\eta\leq(\alpha\lambda_{i}(A)+1)+2\sqrt{\lambda_{i}(A)\alpha}
(1λi(A)α)2η(1+λi(A)α)2\left(1-\sqrt{\lambda_{i}(A)\alpha}\right)^{2}\leq\eta\leq\left(1+\sqrt{\lambda_{i}(A)\alpha}\right)^{2}

For the spectral radius of PP to be η\sqrt{\eta}, the above must hold for all ii. We choose α\alpha as:

α(2λmin(A)+λmax(A))2\alpha\leq\left(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}}\right)^{2}

and η\eta as:

(1λmin(A)α)2η(1+λmin(A)α)2(1-\sqrt{\lambda_{\min}(A)\alpha})^{2}\leq\eta\leq(1+\sqrt{\lambda_{\min}(A)\alpha})^{2}

Observe that if we choose the momentum parameter η=(1λmin(A)α)2\eta=\left(1-\sqrt{\lambda_{\min}(A)\alpha}\right)^{2}, then PP has two repeated roots since (λi(A)α1η)24η=0\sqrt{(\lambda_{i}(A)\alpha-1-\eta)^{2}-4\eta}=0. To ensure that PP does not have any repeated root we choose the momentum parameter η=(1λmin(A)α2)2\eta=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}. Therefore, ρ(P)=(1λmin(A)α2)\rho(P)=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right). Using Assumption 1 and Lemma 4 we get

𝔼[X~n2]C^2(1λmin(A)α2)2nΛ+α2C^2Ki=0n1(1λmin(A)α2)2(n1i)(1+𝔼[X~i2])\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}])

However, unlike in Case-1, here the constant C^\hat{C} is not independent of α\alpha and λmin(A)\lambda_{min}(A). The following lemma specifies an upper bound on C^\hat{C}.

Lemma 5.

C^C5αλmin(A)\hat{C}\leq C\frac{5}{\sqrt{\alpha\lambda_{\min}(A)}}, where CC is as defined in (11).

Proof.

See Appendix D.2

We define the sequence {Vn}\{V_{n}\} as follows

Vn=C^2(1λmin(A)α2)2nΛ+α2C^2Ki=0n1(1λmin(A)α2)2(n1i)(1+Vi)V_{n}=\hat{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\hat{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+V_{i})

Observe that 𝔼[X~n2]Vn\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq V_{n}, and that {Vk}\{V_{k}\} satisfies

Vk+1=(1λmin(A)α2)2Vk+C^2Kα2(1+Vk);V0=C^2Λ\displaystyle V_{k+1}=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}V_{k}+\hat{C}^{2}K\alpha^{2}(1+V_{k});\quad V_{0}=\hat{C}^{2}\Lambda

Therefore, we have

Vk+1=((1λmin(A)α2)2+C^2Kα2)Vk+C^2Kα2V_{k+1}=\left(\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}+\hat{C}^{2}K\alpha^{2}\right)V_{k}+\hat{C}^{2}K\alpha^{2}

To ensure that ((1λmin(A)α2)2+C^2Kα2)(1λmin(A)α4)2\left(\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}+\hat{C}^{2}K\alpha^{2}\right)\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2} we need to choose α\alpha such that

1+λmin(A)α4λmin(A)α+C^2Kα21+λmin(A)α16λmin(A)α21+\frac{\lambda_{\min}(A)\alpha}{4}-\sqrt{\lambda_{\min}(A)\alpha}+\hat{C}^{2}K\alpha^{2}\leq 1+\frac{\lambda_{\min}(A)\alpha}{16}-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}
or 3αλmin(A)16+C^2Kα32λmin(A)2\mbox{or \quad}\frac{3\sqrt{\alpha}\lambda_{\min}(A)}{16}+\hat{C}^{2}K\alpha^{\frac{3}{2}}\leq\frac{\sqrt{\lambda_{\min}(A)}}{2}

Next, using Lemma 5, the above can be ensured by choosing α\alpha such that

3αλmin(A)16+C225αλminKα32λmin(A)2\frac{3\sqrt{\alpha}\lambda_{\min}(A)}{16}+C^{2}\frac{25}{\alpha\lambda_{\min}}K\alpha^{\frac{3}{2}}\leq\frac{\sqrt{\lambda_{\min}(A)}}{2}
 or α(λmin(A)38λmin(A)+25CKλmin(A))2\mbox{ or }\alpha\leq\left(\frac{\sqrt{\lambda_{\min}(A)}}{\frac{3}{8}\lambda_{\min}(A)+\frac{25CK}{\lambda_{\min}(A)}}\right)^{2}

The recursion for the sequence {Vk+1}\{V_{k+1}\} then follows

Vk+1(1λmin(A)α4)2Vk+C^2Kα2\begin{split}V_{k+1}&\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2}V_{k}+\hat{C}^{2}K\alpha^{2}\end{split}

Unrolling the recursion, we get

Vn(1λmin(A)α4)2nV0+C^2Kα2i=0n1(1λmin(A)α4)2i(1λmin(A)α4)2nV0+C^2Kα211(1λmin(A)4)2(1λmin(A)α4)2nV0+C^2Kα24αλmin(A)\begin{split}V_{n}&\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2n}V_{0}+\hat{C}^{2}K\alpha^{2}\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2i}\\ &\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2n}V_{0}+\hat{C}^{2}K\alpha^{2}\frac{1}{1-\left(1-\frac{\sqrt{\lambda_{min}(A)}}{4}\right)^{2}}\\ &\leq\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2n}V_{0}+\hat{C}^{2}K\alpha^{2}\frac{4}{\sqrt{\alpha\lambda_{min}(A)}}\end{split}

Further it follows from α(2λmin(A)+λmax(A))2\alpha\leq\left(\frac{2}{\sqrt{\lambda_{min}(A)}+\sqrt{\lambda_{max}(A)}}\right)^{2} that α1λmin(A)\alpha\leq\frac{1}{\lambda_{\min}(A)} and (1λmin(A)α4)2eλmin(A)α2\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}\right)^{2}\leq e^{-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}.

Vnenλmin(A)α2C^2Λ+4α2C^2Kαλmin(A)V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\hat{C}^{2}\Lambda+\frac{4\alpha^{2}\hat{C}^{2}K}{\sqrt{\alpha\lambda_{min}(A)}}

Again using Lemma 5,

Vnenλmin(A)α225C2λmin(A)αΛ+α2100C2K(λmin(A)α)3/2V_{n}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}\frac{25C^{2}}{\lambda_{\min}(A)\alpha}\Lambda+\alpha^{2}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\alpha\right)^{3/2}}

Observe that

enλmin(A)α2λmin(A)αenλmin(A)α4\frac{e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}}}{\lambda_{\min}(A)\alpha}\leq e^{-n\frac{\sqrt{\lambda_{\min}(A)\alpha}}{4}}
 for n4αλmin(A)log(1λmin(A)α).\mbox{ for }n\geq\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right).

Let nn be as above. Then,

Vn25C2Λen4λmin(A)α+α100C2K(λmin(A))3/2V_{n}\leq 25C^{2}\Lambda e^{-\frac{n}{4}\sqrt{\lambda_{\min}(A)}\alpha}+\sqrt{\alpha}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\right)^{3/2}}

Choose α\alpha as below:

α(ϵ(λmin(A))3/2200C2K)2\alpha\leq\left(\frac{\epsilon(\lambda_{\min}(A))^{3/2}}{200C^{2}K}\right)^{2}

Then,

α100C2K(λmin(A))3/2ϵ2𝔼[x~n2]𝔼[X~n2]Vnϵ2+ϵ2=ϵ,\sqrt{\alpha}\frac{100C^{2}K}{\left(\lambda_{\min}(A)\right)^{3/2}}\leq\frac{\epsilon}{2}\Rightarrow\mathbb{E}[\|\tilde{x}_{n}\|^{2}]\leq\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq V_{n}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon,

when nn is as follows:

n=4αλmin(A)log(50C2Λϵ)n=\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right)
n=max(4αλmin(A)log(50C2Λϵ),4αλmin(A)log(1λmin(A)α))n=\max\left(\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{50C^{2}\Lambda}{\epsilon}\right),\frac{4}{\sqrt{\alpha\lambda_{\min}(A)}}\log\left(\frac{1}{\lambda_{\min}(A)\alpha}\right)\right)

Case-3: β=1\beta=1 (ASG)

The proof progresses in a similar way as in Case-2. It is easy to see that the following relation holds with a modified definition of the matrix PP.

𝔼[X~n2]Pn2X~02+α2Ki=0n1Pn1i2(1+𝔼[X~i2]),\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\|P^{n}\|^{2}\|\tilde{X}_{0}\|^{2}+\alpha^{2}K\sum_{i=0}^{n-1}\|P^{n-1-i}\|^{2}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}]),

where,

P[IαA+η(IαA)η(IαA)I0]P\triangleq\begin{bmatrix}I-\alpha A+\eta(I-\alpha A)&-\eta(I-\alpha A)\\ I&0\end{bmatrix}

As in the previous case, we compute the eigen-values of PP. The characteristic equation of PP is given by:

det([IαA+η(IαA)μIη(IαA)IμI])=0det\left(\begin{bmatrix}I-\alpha A+\eta(I-\alpha A)-\mu I&-\eta(I-\alpha A)\\ I&-\mu I\end{bmatrix}\right)=0

As in the previous case, this can be simplified to the following equation:

det(μI+αμAμη(IαA)+μ2I+η(IαA))=0\displaystyle det(-\mu I+\alpha\mu A-\mu\eta(I-\alpha A)+\mu^{2}I+\eta(I-\alpha A))=0

We now further simplify the characteristic equation of PP to a more convienient form:

det(AI(μ+ημμ2ηαμ+αμηηα))=0det\left(A-I\left(\frac{\mu+\eta\mu-\mu^{2}-\eta}{\alpha\mu+\alpha\mu\eta-\eta\alpha}\right)\right)=0

Progressing as in the previous case, we get that the eigen-values of PP satisfies:

μ=(λi(A)α(1+η)1η)±(λi(A)α(1+η)1η)24η(1αλi(A))2\displaystyle\mu=\frac{-(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)\pm\sqrt{(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-4\eta(1-\alpha\lambda_{i}(A))}}{2}

When (λi(A)α(1+η)1η)24η(1αλi(A))0(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-4\eta(1-\alpha\lambda_{i}(A))\leq 0, we have that

|μ|=12((λi(A)α(1+η)1η)2(λi(A)α(1+η)1η)2+4η(1αλi(A)))=η(1αλi(A))|\mu|=\frac{1}{2}\left(\sqrt{(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}+4\eta(1-\alpha\lambda_{i}(A))}\right)=\sqrt{\eta(1-\alpha\lambda_{i}(A))}

This implies that,

ρ(P)=η(1αλmin(A))\rho(P)=\sqrt{\eta(1-\alpha\lambda_{min}(A))} (12)

To ensure that (λi(A)α(1+η)1η)24η(1αλi(A))0(\lambda_{i}(A)\alpha(1+\eta)-1-\eta)^{2}-4\eta(1-\alpha\lambda_{i}(A))\leq 0, we must have

η2(1αλi(A))2+2η(1α2λi(A)2)+(1αλi(A))20\eta^{2}(1-\alpha\lambda_{i}(A))^{2}+2\eta(1-\alpha^{2}\lambda_{i}(A)^{2})+(1-\alpha\lambda_{i}(A))^{2}\leq 0

We assume α1λmax(A)\alpha\leq\frac{1}{\lambda_{max}(A)} and therefore, (1αλi(A))0(1-\alpha\lambda_{i}(A))\geq 0 holds for all ii. Using this, the above relation simplifies to:

η2(1αλi(A))+2η(1+αλi(A))+(1αλi(A))0\eta^{2}(1-\alpha\lambda_{i}(A))+2\eta(1+\alpha\lambda_{i}(A))+(1-\alpha\lambda_{i}(A))\leq 0

For the above to hold, we must have that

2(1+αλi(A))4αλi(A))2(1αλi(A))η2(1+αλi(A))+4αλi(A))2(1αλi(A))\frac{2(1+\alpha\lambda_{i}(A))-4\sqrt{\alpha\lambda_{i}(A))}}{2(1-\alpha\lambda_{i}(A))}\leq\eta\leq\frac{2(1+\alpha\lambda_{i}(A))+4\sqrt{\alpha\lambda_{i}(A))}}{2(1-\alpha\lambda_{i}(A))}
(1αλi(A))2(1αλi(A))η(1+αλi(A))2(1αλi(A))\frac{(1-\sqrt{\alpha\lambda_{i}(A)})^{2}}{(1-\alpha\lambda_{i}(A))}\leq\eta\leq\frac{(1+\sqrt{\alpha\lambda_{i}(A)})^{2}}{(1-\alpha\lambda_{i}(A))}

The above must hold i\forall i and therefore we choose η\eta as:

(1αλmin(A))2(1αλmin(A))η(1+αλmin(A))2(1αλmin(A))\frac{(1-\sqrt{\alpha\lambda_{min}(A)})^{2}}{(1-\alpha\lambda_{min}(A))}\leq\eta\leq\frac{(1+\sqrt{\alpha\lambda_{min}(A)})^{2}}{(1-\alpha\lambda_{min}(A))}

As in Case-2, if we choose the momentum parameter η=(1λmin(A)α)2(1αλmin(A))\displaystyle\eta=\frac{(1-\sqrt{\lambda_{\min}(A)\alpha})^{2}}{(1-\alpha\lambda_{min}(A))}, then PP has two repeated roots. To ensure that PP does not have any repeated root we choose the momentum parameter

η=(1λmin(A)α2)2(1αλmin(A)).\displaystyle\eta=\frac{\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2}}{(1-\alpha\lambda_{min}(A))}.

Using (12), we get ρ(P)=(1λmin(A)α2)\rho(P)=\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right) which is same as in Case-2 and therefore we have:

𝔼[X~n2]C~2(1λmin(A)α2)2nΛ+α2C~2Ki=0n1(1λmin(A)α2)2(n1i)(1+𝔼[X~i2]).\mathbb{E}[\|\tilde{X}_{n}\|^{2}]\leq\tilde{C}^{2}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2n}\Lambda+\alpha^{2}\tilde{C}^{2}K\sum_{i=0}^{n-1}\left(1-\frac{\sqrt{\lambda_{\min}(A)\alpha}}{2}\right)^{2(n-1-i)}(1+\mathbb{E}[\|\tilde{X}_{i}\|^{2}]).

The above expression is same as that in Case-2 except the term C~\tilde{C}. Since, the matrix PP is different when β=1\beta=1, C^\hat{C} in Case-2 need not be equal to C~\tilde{C} in Case-3. However, we next show that C~\tilde{C} follows the exact same upper bound as in Case-2, Lemma 5. Towards this, we have the following lemma:

Lemma 6.

C~C5αλmin(A)\tilde{C}\leq C\frac{5}{\sqrt{\alpha\lambda_{\min}(A)}}, where CC is as defined in (11).

Proof.

See Appendix D.3

Thereafter, we can proceed exactly as in case-2 to prove the theorem.

Appendix D Proof of Auxilary Lemmas

D.1 Proof of Lemma 4

As in Foucart, (2012), we first construct a matrix norm ||||||{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} such that |M|=ρ(M)+δ{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\rho(M)+\delta. Consider the Jordan canonical form of MM

M=S[Jn1(λ1(M))000Jn2(λ2(M))000Jnk(λk(M))]S1M=S\begin{bmatrix}J_{n_{1}}(\lambda_{1}(M))&0&\ldots&0\\ 0&J_{n_{2}}(\lambda_{2}(M))&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&J_{n_{k}}(\lambda_{k}(M))\end{bmatrix}S^{-1}

We define

D(δ)=[Dn1(δ)000Dn2(δ)000Dnk(δ)],D(\delta)=\begin{bmatrix}D_{n_{1}}(\delta)&0&\ldots&0\\ 0&D_{n_{2}}(\delta)&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&D_{n_{k}}(\delta)\end{bmatrix},

where

Dj(δ)=[δ000δ2000δj]D_{j}(\delta)=\begin{bmatrix}\delta&0&\ldots&0\\ 0&\delta^{2}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&\delta^{j}\end{bmatrix}

Therefore,

D(1δ)S1MSD(δ)=[Bn1(λ1(M),δ)000Bn2(λ2(M),δ)000Bnk(λk(M),δ)]D(\frac{1}{\delta})S^{-1}MSD(\delta)=\begin{bmatrix}B_{n_{1}}(\lambda_{1}(M),\delta)&0&\ldots&0\\ 0&B_{n_{2}}(\lambda_{2}(M),\delta)&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\ldots&0&B_{n_{k}}(\lambda_{k}(M),\delta)\end{bmatrix}

where,

Bi(λ,δ)=Di(1δ)Ji(λ)Di(δ)=[λδ000λδ00λδ000λ]B_{i}(\lambda,\delta)=D_{i}(\frac{1}{\delta})J_{i}(\lambda)D_{i}(\delta)=\begin{bmatrix}\lambda&\delta&0&\ldots&0\\ 0&\lambda&\delta&\ddots&\vdots\\ 0&\ddots&\ddots&\ddots&0\\ \vdots&\ddots&\ddots&\lambda&\delta\\ 0&\ldots&0&0&\lambda\end{bmatrix}

We define the matrix norm ||||||{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} as

|M|D(1δ)S1MSD(δ)1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\triangleq\|D(\frac{1}{\delta})S^{-1}MSD(\delta)\|_{1}

where 1\|\cdot\|_{1} is the matrix norm induced by the vector L1L_{1}-norm. Using the fact that M1=maxj[1:d]i=1d|mi,j|\|M\|_{1}=\max_{j\in[1:d]}\sum_{i=1}^{d}|m_{i,j}|, where mi,jm_{i,j} is the i,ji,j-th entry of MM, we have

|M|=maxj[1:d](|λj|+δ)=ρ(M)+δ.{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\max_{j\in[1:d]}(|\lambda_{j}|+\delta)=\rho(M)+\delta.
 and |Mn||M|n(ρ(M)+δ)n\mbox{ and }{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M^{n}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{n}\leq(\rho(M)+\delta)^{n}

It can be easily seen that M11dM2\|M\|_{1}\geq\frac{1}{\sqrt{d}}\|M\|_{2}. Therefore it follows that

|M|=D(1δ)S1MSD(δ)11dD(1δ)S1MSD(δ)21dσmin(D(1δ))σmin(S1)M2σmin(S)σmin(D(δ))\begin{split}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}&=\|D\left(\frac{1}{\delta}\right)S^{-1}MSD(\delta)\|_{1}\\ &\geq\frac{1}{\sqrt{d}}\|D\left(\frac{1}{\delta}\right)S^{-1}MSD(\delta)\|_{2}\\ &\geq\frac{1}{\sqrt{d}}\sigma_{\min}\left(D\left(\frac{1}{\delta}\right)\right)\sigma_{\min}(S^{-1})\|M\|_{2}\sigma_{\min}(S)\sigma_{\min}(D(\delta))\end{split}

where σmin()\sigma_{\min}(\cdot) denotes the smallest singular value and we have used the fact that ABσmin(A)B\|AB\|\geq\sigma_{\min}(A)\|B\| repeatedly. For δ<1\delta<1, and rr defined as the size of largest Jordan block of MM

σmin(D(1δ))σmin(D(δ))=δr1δ=δr1.\sigma_{\min}\left(D\left(\frac{1}{\delta}\right)\right)\sigma_{\min}\left(D\left(\delta\right)\right)=\delta^{r}\frac{1}{\delta}=\delta^{r-1}.

We conclude the first half of the lemma by defining CδC_{\delta} as dδr1σmin(S)σmin(S1)\frac{\sqrt{d}}{\delta^{r-1}\sigma_{\min}(S)\sigma_{\min}(S^{-1})}.

In the case that the eigenvalues are distinct, r=1r=1 and CδC_{\delta} defined above becomes independent of δ.\delta. Moreover, when all eigen-values are distinct, each Jordan block is Jni(λi(M))=[λi(M)]J_{n_{i}}(\lambda_{i}(M))=[\lambda_{i}(M)] and the second half follows.

D.2 Proof of Lemma 5

Let SS be the matrix in Jordan decomposition of AA, i.e., SAS1=DSAS^{-1}=D, where DD is a diagonal matrix with eigenvalues of AA as its diagonal elements. Then,

(S00S)P(S100S1)=(IαSAS1+ηIηSS1SS10)=(IαD+ηIηII0),\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}=\begin{pmatrix}I-\alpha SAS^{-1}+\eta I&-\eta SS^{-1}\\ SS^{-1}&0\end{pmatrix}=\begin{pmatrix}I-\alpha D+\eta I&-\eta I\\ I&0\end{pmatrix},

where 0d0_{d} is the zero matrix of dimension d×dd\times d. For ease of exposition, suppose AA is a 2×22\times 2 matrix with eigenvalues λ1\lambda_{1} and λ2\lambda_{2}. Then

(S00S)P(S100S1)=(1+ηαλ10η001+ηαλ20η10000100)\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}=\begin{pmatrix}1+\eta-\alpha\lambda_{1}&0&-\eta&0\\ 0&1+\eta-\alpha\lambda_{2}&0&-\eta\\ 1&0&0&0\\ 0&1&0&0\end{pmatrix}

Suppose EE is the elementary matrix associated with the exchange of row-2 and row-3. It is easy to see that E=ET=E1E=E^{T}=E^{-1} and that,

E(S00S)P(S100S1)E1=(1+ηαλ1η001000001+ηαλ2η0010)=(B100B2)E\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}E^{-1}=\begin{pmatrix}1+\eta-\alpha\lambda_{1}&-\eta&0&0\\ 1&0&0&0\\ 0&0&1+\eta-\alpha\lambda_{2}&-\eta\\ 0&0&1&0\end{pmatrix}=\begin{pmatrix}B_{1}&0\\ 0&B_{2}\end{pmatrix}

where,

Bi=(1+ηαλiη10)B_{i}=\begin{pmatrix}1+\eta-\alpha\lambda_{i}&-\eta\\ 1&0\end{pmatrix}

Suppose, Xi=(xi,1xi,2xi,3xi,4)X_{i}=\begin{pmatrix}x_{i,1}&x_{i,2}\\ x_{i,3}&x_{i,4}\end{pmatrix} and,

Xi1BiXi=(μi,+00μi,).X_{i}^{-1}B_{i}X_{i}=\begin{pmatrix}\mu_{i,+}&0\\ 0&\mu_{i,-}\end{pmatrix}.

Here μi,+=(1αλi+η)+Δi2\mu_{i,+}=\frac{(1-\alpha\lambda_{i}+\eta)+\sqrt{\Delta_{i}}}{2} and μi,=(1αλi+η)Δi2\mu_{i,-}=\frac{(1-\alpha\lambda_{i}+\eta)-\sqrt{\Delta_{i}}}{2}, where Δi=(1+ηαλi)24η\Delta_{i}=(1+\eta-\alpha\lambda_{i})^{2}-4\eta. Solving the above equation we get,

Xi=(xi,3μi,+xi,4μi,xi,3xi,4)X_{i}=\begin{pmatrix}x_{i,3}\mu_{i,+}&x_{i,4}\mu_{i,-}\\ x_{i,3}&x_{i,4}\end{pmatrix}

Setting xi,3=xi,4=1x_{i,3}=x_{i,4}=1,

Xi=(μi,+μi,11) and Xi1=1μi,+μi,(11μi,μi,+)X_{i}=\begin{pmatrix}\mu_{i,+}&\mu_{i,-}\\ 1&1\end{pmatrix}\mbox{ and }X_{i}^{-1}=\frac{1}{\mu_{i,+}-\mu_{i,-}}\begin{pmatrix}1&-1\\ -\mu_{i,-}&\mu_{i,+}\\ \end{pmatrix}

For a general d×dd\times d matrix AA, using a similar procedure, it can be shown that

(X10000Xd)Ed×d(S00S)P(S100S1)Ed×d1(X110000Xd1)\begin{pmatrix}X_{1}&0&0\\ 0&\ddots&\vdots\\ 0&\cdots&X_{d}\\ \end{pmatrix}E_{d\times d}\begin{pmatrix}S&0\\ 0&S\end{pmatrix}P\begin{pmatrix}S^{-1}&0\\ 0&S^{-1}\end{pmatrix}E_{d\times d}^{-1}\begin{pmatrix}X_{1}^{-1}&0&0\\ 0&\ddots&\vdots\\ 0&\cdots&X_{d}^{-1}\\ \end{pmatrix}
=(μ1,+0000μ1,000000μd,+0000μd,)=\begin{pmatrix}\mu_{1,+}&0&0&\cdots&0\\ 0&\mu_{1,-}&0&\cdots&0\\ \vdots&0&\ddots&\cdots&0\\ 0&\cdots&0&\mu_{d,+}&0\\ 0&\cdots&0&0&\mu_{d,-}\end{pmatrix}

where Ed×dE_{d\times d} and Ed×d1E_{d\times d}^{-1} are permutation matrices that transform the matrix between them to a block diagonal matrix. Let S^=(X10000Xd)Ed×d(S00S)\hat{S}=\begin{pmatrix}X_{1}&0&0\\ 0&\ddots&\vdots\\ 0&\cdots&X_{d}\\ \end{pmatrix}E_{d\times d}\begin{pmatrix}S&0\\ 0&S\end{pmatrix}. Therefore,

C^=dσmin(S^)σmin(S^1)\hat{C}=\frac{\sqrt{d}}{\sigma_{\min}(\hat{S})\sigma_{\min}(\hat{S}^{-1})}

In order to simplify the expression for C^\hat{C}, we require the following lemma:

Lemma 7.

For all invertible matrices MM of order d×dd\times d, the following identity holds:

1σd(M)σd(M1)=σ1(M)σ1(M1),\frac{1}{\sigma_{d}(M)\sigma_{d}(M^{-1})}=\sigma_{1}(M)\sigma_{1}(M^{-1}),

where σ1(X)σd(X)\sigma_{1}(X)\geq\cdots\geq\sigma_{d}(X) denote the singular values of XX.

Proof.

By definition, σ12(M)σd2(M)\sigma_{1}^{2}(M)\geq\cdots\geq\sigma_{d}^{2}(M) are the eigenvalues of MTMM^{T}M. Then, the eigenvalues of (MTM)1=M1(M1)T(M^{T}M)^{-1}=M^{-1}(M^{-1})^{T} are 1σd(M)21σ1(M)2.\frac{1}{\sigma_{d}(M)^{2}}\geq\cdots\geq\frac{1}{\sigma_{1}(M)^{2}}. Note that M1(M1)TM^{-1}(M^{-1})^{T} and (M1)TM1(M^{-1})^{T}M^{-1} are similar since (M1)TM1=M(M1(M1)T)M1(M^{-1})^{T}M^{-1}=M(M^{-1}(M^{-1})^{T})M^{-1}. Consequently, M1(M1)TM^{-1}(M^{-1})^{T} and (M1)TM1(M^{-1})^{T}M^{-1} have the same set of eigenvalues and we find that the singular values of M1M^{-1} are 1σd(M)1σ1(M).\frac{1}{\sigma_{d}(M)}\geq\cdots\geq\frac{1}{\sigma_{1}(M)}.

Thus,

σ1(M1)=1σd(M),\sigma_{1}(M^{-1})=\frac{1}{\sigma_{d}(M)},
σd(M1)=1σ1(M)\sigma_{d}(M^{-1})=\frac{1}{\sigma_{1}(M)}

and the result follows. ∎

Using Lemma 7, we have

C^=dσmax(S^)σmax(S^1)dmaxi{σmax(Xi)}σmax(S)σmax(S1)maxi{σmax(Xi1)}=Cmaxi{σmax(Xi)}maxi{σmax(Xi1)},\begin{split}\hat{C}&=\sqrt{d}\sigma_{\max}(\hat{S})\sigma_{\max}(\hat{S}^{-1})\\ &\leq\sqrt{d}\max_{i}\{\sigma_{\max}(X_{i})\}\sigma_{\max}(S)\sigma_{\max}(S^{-1})\max_{i}\{\sigma_{\max}(X_{i}^{-1})\}\\ &=C\max_{i}\{\sigma_{\max}(X_{i})\}\max_{i}\{\sigma_{\max}(X_{i}^{-1})\},\end{split}

where CC is as defined in (11). Now, for any matrix XX of order d×dd\times d,

σmax(X)=X2XF=(i,j|xij|2)1/2dmaxi,j|xij|,\sigma_{\max}(X)=\|X\|_{2}\leq\|X\|_{F}=(\sum_{i,j}|x_{ij}|^{2})^{1/2}\leq d\max_{i,j}|x_{ij}|,

where F\|\cdot\|_{F} denotes the Frobenius norm. Using the above inequality, σmax(Xi)2\sigma_{\max}(X_{i})\leq 2 and σmax(Xi1)2|μi,+μi,|=2|Δi|\sigma_{\max}(X_{i}^{-1})\leq\frac{2}{|\mu_{i,+}-\mu_{i,-}|}=\frac{2}{|\sqrt{\Delta_{i}}|}. Next we lower bound |Δi||\sqrt{\Delta_{i}}|.

For a complex number zz, observe that |z||\sqrt{z}| = |z|\sqrt{|z|}. Now,

|Δi|\displaystyle|\Delta_{i}| =4η(1+ηαλi(A))2\displaystyle=4\eta-(1+\eta-\alpha\lambda_{i}(A))^{2} (13)
4η(1+ηαλmin(A))2\displaystyle\geq 4\eta-(1+\eta-\alpha\lambda_{min}(A))^{2}

Using η=(1αλmin2)2\eta=\left(1-\frac{\sqrt{\alpha\lambda_{\min}}}{2}\right)^{2},

|Δi|4η(1+1+αλmin4αλminαλmin)2=4η(2η3α4λmin)2=4[(η)2(η3αλmin8)2]=3αλmin2(2η3αλmin8)=3αλmin2(2αλmin3αλmin8)3αλmin2(2138)=1516αλmin\begin{split}|\Delta_{i}|&\geq 4\eta-\left(1+1+\frac{\alpha\lambda_{\min}}{4}-\sqrt{\alpha\lambda_{\min}}-\alpha\lambda_{\min}\right)^{2}\\ &=4\eta-(2\sqrt{\eta}-\frac{3\alpha}{4}\lambda_{\min})^{2}\\ &=4\left[(\sqrt{\eta})^{2}-\left(\sqrt{\eta}-\frac{3\alpha\lambda_{\min}}{8}\right)^{2}\right]\\ &=\frac{3\alpha\lambda_{\min}}{2}\left(2\sqrt{\eta}-\frac{3\alpha\lambda_{\min}}{8}\right)\\ &=\frac{3\alpha\lambda_{\min}}{2}\left(2-\sqrt{\alpha\lambda_{\min}}-\frac{3\alpha\lambda_{\min}}{8}\right)\\ &\geq\frac{3\alpha\lambda_{\min}}{2}\left(2-1-\frac{3}{8}\right)\\ &=\frac{15}{16}\alpha\lambda_{\min}\end{split}

Using this, we get maxi{σmax(Xi1)}216151αλmin\max_{i}\{\sigma_{\max}(X_{i}^{-1})\}\leq 2\sqrt{\frac{16}{15}}\frac{1}{\sqrt{\alpha\lambda_{\min}}}, and therefore C^5Cαλmin\hat{C}\leq\frac{5C}{\sqrt{\alpha\lambda_{\min}}}

D.3 Proof of Lemma 6

The proof of the lemma can proceed exactly as in the proof of Lemma D.2. The only difference is that one needs to lower bound |Δi|=4η(1αλi(A))(α(1+η)λi(A)1η)2|\Delta_{i}|=4\eta(1-\alpha\lambda_{i}(A))-(\alpha(1+\eta)\lambda_{i}(A)-1-\eta)^{2} for η=(1αλmin2)2(1αλi(A)).\eta=\frac{\left(1-\frac{\sqrt{\alpha\lambda_{\min}}}{2}\right)^{2}}{(1-\alpha\lambda_{i}(A))}. We define η=(1αλmin2)2\eta^{\prime}=\left(1-\frac{\sqrt{\alpha\lambda_{\min}}}{2}\right)^{2} and therefore η=η(1αλi(A))\eta=\frac{\eta^{\prime}}{(1-\alpha\lambda_{i}(A))}. Using this, we get

|Δi|\displaystyle|\Delta_{i}| =4η(1αλi(A))(1+ηα(1+η)λi(A))2\displaystyle=4\eta(1-\alpha\lambda_{i}(A))-(1+\eta-\alpha(1+\eta)\lambda_{i}(A))^{2}
=4η(1+ηαλi(A)(1+η))2\displaystyle=4\eta^{\prime}-(1+\eta-\alpha\lambda_{i}(A)(1+\eta))^{2}
=4η((1+η)(1αλi(A)))2\displaystyle=4\eta^{\prime}-\Big{(}(1+\eta)(1-\alpha\lambda_{i}(A))\Big{)}^{2}
=4η((1+η1αλi(A))(1αλi(A)))2\displaystyle=4\eta^{\prime}-\bigg{(}\Big{(}1+\frac{\eta^{\prime}}{1-\alpha\lambda_{i}(A)}\Big{)}(1-\alpha\lambda_{i}(A))\bigg{)}^{2}
=4η(1+ηαλi(A))\displaystyle=4\eta^{\prime}-(1+\eta^{\prime}-\alpha\lambda_{i}(A))

The expression for |Δi||\Delta_{i}| is exactly same as in the proof of Lemma D.2 (cf. (13)). Thereafter, we proceed as in proof of Lemma D.2 to prove the claim.