This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Permutation-Based SGD: Is Random Optimal?

  Shashank Rajput Correspondence to Shashank Rajput \langle[email protected]\rangle    Kangwook Lee

University of Wisconsin-Madison
   Dimitris Papailiopoulos
Abstract

A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We first show that for 1-dimensional strongly convex functions, with smooth second derivatives, there exist permutations that offer exponentially faster convergence compared to random. However, for general strongly convex functions, random permutations are optimal. Finally, we show that for quadratic, strongly-convex functions, there are easy-to-construct permutations that lead to accelerated convergence compared to random. Our results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random.

1 Introduction

Finite sum optimization seeks to solve the following:

min𝒙F(𝒙):=1ni=1nfi(𝒙).\min_{{\bm{x}}}F({\bm{x}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{x}}). (1)

Stochastic gradient descent (SGD) approximately solves finite sum problems, by iteratively updating the optimization variables according to the following rule:

𝒙t+1:=𝒙tαfσt(𝒙t),{\bm{x}}_{t+1}:={\bm{x}}_{t}-\alpha\nabla f_{\sigma_{t}}({\bm{x}}_{t}), (2)

where α\alpha is the step size and σt[n]={1,2,,n}\sigma_{t}\in[n]=\{1,2,\ldots,n\} is the index of the function sampled at iteration tt. There exist various ways of sampling σt\sigma_{t}, with the most common being with- and without-replacement sampling. In the former, σt\sigma_{t} is uniformly chosen at random from [n][n], and for the latter, σt\sigma_{t} represents the tt-th element of a random permutation of [n][n]. We henceforth refer to these two SGD variants as vanilla and permutation-based, respectively.

Although permutation-based SGD has been widely observed to perform better in practice (Bottou, 2009; Recht & Ré, 2012; 2013), the vanilla version has attracted the vast majority of theoretical analysis. This is because of the fact that at each iteration, in expectation the update is a scaled version of the true gradient, allowing for simple performance analyses of the algorithm, e.g., see (Bubeck et al., 2015).

Permutation-based SGD has resisted a tight analysis for a long time. However, a recent line of breakthrough results provides the first tight convergence guarantees for several classes of convex functions FF (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020; Mishchenko et al., 2020; Ahn et al., 2020; Nguyen et al., 2020). These recent studies mainly focus on two variants of permutation-based SGD where (1) a new random permutation is sampled at each epoch (also known as Random Reshuffle) (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020), and (2) a random permutation is sampled once and is reused throughout all SGD epochs (Single Shuffle) (Safran & Shamir, 2019; Mishchenko et al., 2020; Ahn et al., 2020).

Perhaps interestingly, Random Reshuffle and Single Shuffle exhibit different convergence rates and a performance gap that varies across different function classes. In particular, when run for KK epochs, the convergence rate for strongly convex functions is O~(1/nK2)\widetilde{O}(1/nK^{2}) for both Random Reshuffle and Single Shuffle (Nagaraj et al., 2019; Ahn et al., 2020; Mishchenko et al., 2020). However, when run specifically on strongly convex quadratics, Random Reshuffle experiences an acceleration of rates, whereas Single Shuffle does not (Safran & Shamir, 2019; Rajput et al., 2020; Ahn et al., 2020; Mishchenko et al., 2020). All the above rates have been coupled by matching lower bounds, at least up to constants and sometimes log factors (Safran & Shamir, 2019; Rajput et al., 2020).

Algorithm 1 Permutation-based SGD variants
1:Input: Initialization 𝒙01{\bm{x}}_{0}^{1}, step size α\alpha, epochs KK
2:σ=\sigma= a random permutation of [n][n]
3:for k=1,,Kk=1,\dots,K do
4:     if IGD then
5:         σk=(1,2,,n)\sigma^{k}=(1,2,\dots,n)
6:     else if Single Shuffle then
7:         σk=σ\sigma^{k}=\sigma
8:     else if Random Reshuffle then
9:         σk=\sigma^{k}= a random permutation of [n][n]
10:     end if
11:
12:     if  FlipFlop and kk is even then
13:         σk=\sigma^{k}= reverse of σk1\sigma^{k-1}
14:     end if
15:
16:     for i=1,,ni=1,\dots,n do
17:         𝒙ik:=𝒙i1kαfσik(𝒙i1k){\bm{x}}_{i}^{k}:={\bm{x}}_{i-1}^{k}-\alpha\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i-1}^{k}) }Epoch k\left.\begin{array}[]{@{}c@{}}\\ {}\hfil\\ {}\hfil\\ {}\hfil\end{array}\right\}\begin{tabular}[]{l}Epoch $k$\end{tabular}
18:     end for
19:     𝒙0k+1:=𝒙nk{\bm{x}}_{0}^{k+1}:={\bm{x}}_{n}^{k}
20:end for
Plain with FlipFlop
RR Ω(1n2K2+1nK3)\displaystyle\Omega\left(\frac{1}{n^{2}K^{2}}+\frac{1}{n{\color[rgb]{0.88,0.00,0.00}K^{3}}}\right) O~(1n2K2+1nK5)\displaystyle\widetilde{O}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{n{\color[rgb]{0.2,0.0,0.95}K^{5}}}\right)  Thm. 5
SS Ω(1nK2)\displaystyle\Omega\left(\frac{1}{{\color[rgb]{0.88,0.00,0.00}nK^{2}}}\right) O~(1n2K2+1nK4)\displaystyle\widetilde{O}\left(\frac{1}{{\color[rgb]{0.2,0.0,0.95}n^{2}}K^{2}}+\frac{1}{n{\color[rgb]{0.2,0.0,0.95}K^{4}}}\right)  Thm. 4
IGD Ω(1K2)\displaystyle\Omega\left(\frac{1}{{\color[rgb]{0.88,0.00,0.00}K^{2}}}\right) O~(1n2K2+1K3)\displaystyle\widetilde{O}\left(\frac{1}{{\color[rgb]{0.2,0.0,0.95}n^{2}}K^{2}}+\frac{1}{{\color[rgb]{0.2,0.0,0.95}K^{3}}}\right)  Thm. 6
Table 1: Convergence rates of Random Reshuffle (RR), Single Shuffle (SS) and Incremental Gradient Descent (IGD) on strongly convex quadratics: Plain vs. FlipFlop. Lower bounds for the “plain” versions are taken from (Safran & Shamir, 2019). When nKn\gg K, that is when the training set is much larger than the number of epochs, which arguably is the case in practice, the convergence rates of Random Reshuffle,Single Shuffle, and Incremental Gradient Descent are Ω(1nK3)\Omega(\frac{1}{nK^{3}}), Ω(1nK2)\Omega(\frac{1}{nK^{2}}), and Ω(1K2)\Omega(\frac{1}{K^{2}}) respectively. On the other hand, by combining these methods with FlipFlop the convergence rates become faster, i.e., O~(1nK5)\widetilde{O}(\frac{1}{nK^{5}}), O~(1nK4)\widetilde{O}(\frac{1}{nK^{4}}), and O~(1K3)\widetilde{O}(\frac{1}{K^{3}}), respectively.

From the above we observe that reshuffling at the beginning of every epoch may not always help. But then there are cases where Random Reshuffle is faster than Single Shuffle, implying that certain ways of generating permutations are more suited for certain subfamilies of functions.

The goal of our paper is to take a first step into exploring the relationship between convergence rates and the particular choice of permutations. We are particularly interested in understanding if random permutations are as good as optimal, or if SGD can experience faster rates with carefully crafted permutations. As we see in the following, the answer the above is not straightforward, and depends heavily on the function class at hand.

Our Contributions:

We define as permutation-based SGD to be any variant of the iterates in (2), where a permutation of the nn functions, at the start of each epoch, can be generated deterministically, randomly, or with a combination of the two. For example, Single Shuffle, Random Reshuffle, and incremental gradient descent (IGD), are all permutation-based SGD variants (see Algorithm 1).

We first want to understand—even in the absence of computational constraints in picking the optimal permutations—what is the fastest rate one can get for permutation-based SGD? In other words, are there permutations that are better than random in the eyes of SGD?

Perhaps surprisingly, we show that there exist permutations that may offer up to exponentially faster convergence than random permutations, but for a limited set of functions. Specifically, we show this for 1-dimensional functions (Theorem 1). However, such exponential improvement is no longer possible in higher dimensions (Theorem 2), or for general strongly convex objectives (Theorem 3), where random is optimal. The above highlight that an analysis of how permutations affect convergence rates needs to be nuanced enough to account for the structure of functions at hand. Otherwise, in lieu of further assumptions, random permutations may just appear to be as good as optimal.

In this work, we further identify a subfamily of convex functions, where there exist easy-to-generate permutations that lead accelerated convergence. We specifically introduce a new technique, FlipFlop, which can be used in conjunction with existing permutation-based methods, e.g., Random Reshuffle, Single Shuffle, or Incremental Gradient Descent, to provably improve their convergence on quadratic functions (Theorems 4, 5, and 6). The way that FlipFlop works is rather simple: every even epoch uses the flipped (or reversed) version of the previous epoch’s permutation. The intuition behind why FlipFlop leads to faster convergence is as follows. Towards the end of an epoch, the contribution of earlier gradients gets attenuated. To counter this, we flip the permutation for the next epoch so that every function’s contribution is diluted (approximately) equally over the course of two consecutive epochs. FlipFlop demonstrates that finding better permutations for specific classes of functions might be computationally easy. We summarize FlipFlop’s convergence rates in Table 1 and report the results of numerical verification in Section 6.2.

Note that in this work, we focus on the dependence of the error on the number of iterations, and in particular, the number of epochs. However, we acknowledge that its dependence on other parameters like the condition number is also very important. We leave such analysis for future work.

Notation:

We use lowercase for scalars (aa), lower boldface for vectors (𝒂{\bm{a}}), and upper boldface for matrices (𝑨{\bm{A}}).

2 Related Work

Gürbüzbalaban et al. (2019a; b) provided the first theoretical results establishing that Random Reshuffle and Incremental Gradient Descent (and hence Single Shuffle) were indeed faster than vanilla SGD, as they offered an asymptotic rate of O(1/K2)O\left(1/K^{2}\right) for strongly convex functions, which beats the convergence rate of O(1/nK)O\left(1/nK\right) for vanilla SGD when K=Ω(n)K=\Omega(n). Shamir (2016) used techniques from online learning and transductive learning theory to prove an optimal convergence rate of O~(1/n)\widetilde{O}(1/n) for the first epoch of Random Reshuffle (and hence Single Shuffle). Later, Haochen & Sra (2019) also established a non-asymptotic convergence rate of O~(1n2K2+1K3)\widetilde{O}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{K^{3}}\right), when the objective function is quadratic, or has smooth Hessian.

Nagaraj et al. (2019) used a very interesting iterate coupling based approach to give a new upper bound on the error rate of Random Reshuffle, thus proving for the first time that for general strongly convex smooth functions, it converges faster than vanilla SGD in all regimes of nn and KK. This was followed by (Safran & Shamir, 2019), where the authors were able to establish the first lower bounds, in terms of both nn and KK, for Random Reshuffle. However, there was a gap in these upper and lower bounds. The gap in the convergence rates was closed by Rajput et al. (2020), who showed that the upper bound given by Nagaraj et al. (2019) and the lower bound given by Safran & Shamir (2019) were both tight up to logarithmic terms.

For Single Shuffle, Mishchenko et al. (2020) and Ahn et al. (2020) showed an upper bound of O~(1nK2)\widetilde{O}\left(\frac{1}{nK^{2}}\right), which matched the lower bound given earlier by (Safran & Shamir, 2019), up to logarithmic terms. Ahn et al. (2020) and Mishchenko et al. (2020) also proved tight upper bounds for Random Reshuffle, with a simpler analysis and using more relaxed assumptions than (Nagaraj et al., 2019) and (Rajput et al., 2020). In particular, the results by Ahn et al. (2020) work under the PŁ condition and do not require individual component convexity.

Incremental Gradient Descent on strongly convex functions has also been studied well in literature (Nedić & Bertsekas, 2001; Bertsekas, 2011; Gürbüzbalaban et al., 2019a). More recently, Nguyen et al. (2020) provide a unified analysis for all permutation-based algorithms. The dependence of their convergence rates on the number of epochs KK is also optimal for Incremental Gradient Descent, Single Shuffle and Random Reshuffle.

There has also been some recent work on the analysis of Random Reshuffle on non-strongly convex functions and non-convex functions. Specifically, Nguyen et al. (2020) and Mishchenko et al. (2020) show that even there, Random Reshuffle outperforms SGD under certain conditions. Mishchenko et al. (2020) show that Random Reshuffle and Single Shuffle beat vanilla SGD on non-strongly convex functions after Ω(n)\Omega(n) epochs, and that Random Reshuffle is faster than vanilla SGD on non-convex objectives if the desired error is O(1/n)O(1/\sqrt{n}).

Speeding up convergence by combining without replacement sampling with other techniques like variance reduction (Shamir, 2016; Ying et al., 2020) and momentum (Tran et al., 2020) has also received some attention. In this work, we solely focus on the power of “good permutations” to achieve fast convergence.

3 Preliminaries

We will use combinations of the following assumptions:

Assumption 1 (Component convexity).

fi(𝒙)f_{i}({\bm{x}})’s are convex.

Assumption 2 (Component smoothness).

fi(𝒙)f_{i}({\bm{x}})’s are LL-smooth, i.e.,

𝒙,𝒚:fi(𝒙)fi(𝒚)L𝒙𝒚.\forall{\bm{x}},{\bm{y}}:\|\nabla f_{i}({\bm{x}})-\nabla f_{i}({\bm{y}})\|\leq L\|{\bm{x}}-{\bm{y}}\|.

Note that Assumption 2 immediately implies that FF also has LL-Lipschitz gradients:

𝒙,𝒚:F(𝒙)F(𝒚)L𝒙𝒚.\forall{\bm{x}},{\bm{y}}:\|\nabla F({\bm{x}})-\nabla F({\bm{y}})\|\leq L\|{\bm{x}}-{\bm{y}}\|.
Assumption 3 (Objective strong convexity).

FF is μ\mu-strongly convex, i.e.,

𝒙,𝒚:F(𝒚)F(𝒙)+F(𝒙),𝒚𝒙+12μ𝒚𝒙2.\forall{\bm{x}},{\bm{y}}:F({\bm{y}})\geq F({\bm{x}})+\langle\nabla F({\bm{x}}),{\bm{y}}-{\bm{x}}\rangle+\frac{1}{2}\mu\|{\bm{y}}-{\bm{x}}\|^{2}.

Note that Assumption 3 implies

𝒙,𝒚:F(𝒙)F(𝒚),𝒙𝒚μ𝒚𝒙2.\forall{\bm{x}},{\bm{y}}:\langle\nabla F({\bm{x}})-\nabla F({\bm{y}}),{\bm{x}}-{\bm{y}}\rangle\geq\mu\|{\bm{y}}-{\bm{x}}\|^{2}. (3)

We denote the condition number by κ\kappa, which is defined as κ=Lμ\kappa=\frac{L}{\mu}. It can be seen easily that κ1\kappa\geq 1 always. Let 𝒙{\bm{x}}^{*} denote the minimizer of Eq. (1), that is, 𝒙=argmin𝒙F(𝒙){\bm{x}}^{*}=\operatorname*{arg\,min}_{\bm{x}}F({\bm{x}}).

We will study permutation-based algorithms in the constant step size regime, that is, the step size is chosen at the beginning of the algorithm, and then remains fixed throughout. We denote the iterate after the ii-th iteration of the kk-th epoch by 𝒙ik{\bm{x}}_{i}^{k}. Hence, the initialization point is 𝒙01{\bm{x}}_{0}^{1}. Similarly, the permutation of (1,,n)(1,\dots,n) used in the kk-th epoch is denoted by σk\sigma^{k}, and its ii-th ordered element is denoted by σik\sigma_{i}^{k}. Note that if the ambient space is 1-dimensional, then we represent the iterates and the minimizer using non-bold characters, i.e. xikx_{i}^{k} and xx^{*}, to remain consistent with the notation.

In the following, due to lack of space, we only provide sketches of the full proofs, when possible. The full proofs of the lemmas and theorems are provided in the Appendix.

4 Exponential Convergence in 1-Dimension

In this section, we show that there exist permutations for Hessian-smooth 11-dimensional functions that lead to exponentially faster convergence compared to random.

Assumption 4 (Component Hessian-smoothness).

fi(x)f_{i}(x)’s have LHL_{H}-smooth second derivatives, that is,

x,y:|2fi(x)2fi(y)|LH|xy|.\forall x,y:|\nabla^{2}f_{i}(x)-\nabla^{2}f_{i}(y)|\leq L_{H}|x-y|.

We also define the following instance dependent constants: G:=maxifi(𝒙)G^{*}:=\max_{i}\|\nabla f_{i}({\bm{x}}^{*})\|, D=max{𝒙01𝒙,G2L}D=\max\left\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{G^{*}}{2L}\right\}, and G=G+2DLG=G^{*}+2DL.

Theorem 1.

Let Assumptions 1,2,3 and 4 hold. Let DD and GG be as defined above. If α=μ8n(L2+LHG)\alpha=\frac{\mu}{8n(L^{2}+L_{H}G)}, then there exists a sequence of permutations σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K} such that using those permutations from any initialization point x01x_{0}^{1} gives the error

|xnKx|(D+4nαG)eCK,|x_{n}^{K}-x^{*}|\leq(D+4n\alpha G)e^{-CK},

where C=μ216(L2+LHG)C=\frac{\mu^{2}}{16(L^{2}+L_{H}G)}.

Refer to caption

Figure 1: (A graphical depiction of Theorem 1’s proof sketch.) Assume that the minimizer is at the origin. The proof of Theorem 1 first shows that there exists an initialization and a sequence of permutations, such that using those, we get to the exact minimizer. Let the sequence of iterates for this run be xoptkx^{k}_{{opt}}. Consider a parallel run, which uses the same sequence of permutations, but an arbitrary initialization point. Let this sequence be xkx_{k}. The figure shows how xoptkx_{opt}^{k} converges to the exact optima, and the distance between xoptkx_{opt}^{k} and xkx_{k} decreases exponentially, leading to an exponential convergence for xkx_{k}.

An important thing to notice in the theorem statement is that the sequence of permutations σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K} only depends on the function, and not on the initialization point x01x_{0}^{1}. This implies that for any such function, there exists a sequence of permutations σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K}, which gives exponentially fast convergence, unconditionally of the initialization. Note that the convergence rate is slower than Gradient Descent, for which the constant ‘CC’ would be larger. However, here we are purely interested in the convergence rates of the best permutations and their (asymptotic) dependence on KK.

Proof sketch

The core idea is to establish that there exists an initialization point x01x_{0}^{1} (close to the minimizer xx^{*}), and a sequence of permutations such that that starting from x01x_{0}^{1} and using that sequence of permutation leads us exactly to the minimizer. Once we have proved this, we show that if two parallel runs of the optimization process are initialized from two different iterates, and they are coupled so that they use the exact same permutations, then they approach each other at an exponential rate. Thus, if we use the same permutation from any initialization point, it will converge to the minimizer at an exponential rate. See Figure 1 for a graphical depiction of this sketch. We note that the figure is not the result of an actual run, but only serves to explain the proof sketch.

5 Lower Bounds for Permutation-based SGD

The result in the previous section leads us to wonder if exponentially fast convergence can be also achieved in higher dimensions. Unfortunately, for higher dimensions, there exist strongly convex quadratic functions for which there does not exist any sequence of permutations that lead to an exponential convergence rate. This is formalized in the following theorem.

Theorem 2.

For any n4n\geq 4 (nn must be even), there exists a 2n+12n+1-dimensional strongly convex function FF which is the mean of nn convex quadratic functions, such that for every permutation-based algorithm with any step size,

𝒙nK1𝒙2=Ω(1n3K2).\displaystyle\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\|^{2}=\Omega\left(\frac{1}{n^{3}K^{2}}\right).

This theorem shows that we cannot hope to develop constant step size algorithms that exhibit exponentially fast convergence in multiple dimensions.

Proof sketch

In this sketch, we explain a simpler version of the theorem, which works for n=2n=2 and 2-Dimensions. Consider F(x,y)=12f1(x,y)+12f2(x,y)F(x,y)=\frac{1}{2}f_{1}(x,y)+\frac{1}{2}f_{2}(x,y) such that

f1(x,y)=x22x+y, andf2(x,y)=y22y+x.\displaystyle f_{1}(x,y)=\frac{x^{2}}{2}-x+y,\text{ and}f_{2}(x,y)=\frac{y^{2}}{2}-y+x.

Hence, F=14(x2+y2)F=\frac{1}{4}(x^{2}+y^{2}), and has minimizer at the origin. Each epoch has two possible permutations, σ=(1,2)\sigma=(1,2) or σ=(2,1)\sigma=(2,1). Working out the details manually, it can be seen that regardless of the permutation, x0k+1+y0k+1>x0k+y0kx^{k+1}_{0}+y^{k+1}_{0}>x^{k}_{0}+y^{k}_{0}, that is, the sum of the co-ordinates keeps increasing. This can be used to get a bound on the error term [xKyK]2\|[x^{K}\quad y^{K}]^{\top}\|^{2}.

Next, we show that even in 1-Dimension, individual function convexity might be necessary to obtain faster rates than Random Reshuffle.

Theorem 3.

There exists a 1-Dimensional strongly convex function FF which is the mean of two quadratic functions f1f_{1} and f2f_{2}, such that one of the functions is non-convex. Then, every permutation-based algorithm with constant step size α1L\alpha\leq\frac{1}{L} gives an error of at least

𝒙nK1𝒙2=Ω(1K2).\displaystyle\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\|^{2}=\Omega\left(\frac{1}{K^{2}}\right).

Proof sketch

The idea behind the sketch is to have one of the two component functions as strongly concave. This gives it the advantage that the farther away from its maximum the iterate is, the more it pushes the iterate away. Hence, it essentially results in increasing the deviation in each epoch. This leads to a slow convergence rate.

In the setting where the individual fif_{i} may be non-convex, Nguyen et al. (2020) and Ahn et al. (2020) show that Single Shuffle, Random Reshuffle, and Incremental Gradient Descent achieve the error rate of 𝒪~(1K2)\widetilde{\mathcal{O}}(\frac{1}{K^{2}}), for a fixed nn. In particular, their results only need that the component functions be smooth and hence their results apply to the function FF from Theorem 3. The theorem above essentially shows that when n=2n=2, this is the best possible error rate, for any permutation-based algorithm - deterministic or random. Hence, at least for n=2n=2, the three algorithms are optimal when the component functions can possibly be non-convex. However, note that here we are only considering the dependence of the convergence rate on KK. It is possible that these are not optimal, if we further take into account the dependence of the convergence rate on the combination of both nn and KK. Indeed, if we consider the dependence on nn as well, Incremental Gradient Descent has a convergence rate of Ω(1/K2)\Omega(1/K^{2}) (Safran & Shamir, 2019), whereas the other two have a convergence rate of O~(1/nK2)\widetilde{O}(1/nK^{2}) (Ahn et al., 2020).

6 Flipping Permutations for Faster Convergence in Quadratics

In this section, we introduce a new algorithm FlipFlop, that can improve the convergence rate of Single Shuffle,Random Reshuffle, and Incremental Gradient Descent on strongly convex quadratic functions.

The following theorem gives the convergence rate of FlipFlop with Single Shuffle:

Assumption 5.

fi(𝒙)f_{i}({\bm{x}})’s are quadratic.

Theorem 4.

If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Single Shuffle for KK epochs, where K80κ3/2log(nK)max{1,κn}K\geq 80\kappa^{3/2}\log(nK)\max\left\{1,\frac{\sqrt{\kappa}}{n}\right\} is an even integer, with step size α=10log(nK)μnK\alpha=\frac{10\log(nK)}{\mu nK} gives the error

𝔼[𝒙nK1𝒙2]=𝒪~(1n2K2+1nK4).\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{nK^{4}}\right). (4)

For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Single Shuffle:

𝔼[𝒙nK1𝒙2]=Ω(1nK2),\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\Omega\left(\frac{1}{nK^{2}}\right), (5)

Note that both the terms in Eq. (4) are smaller than the term in Eq. (5). In particular, when nK2n\gg K^{2} and nn is fixed as we vary KK, the RHS of Eq. (5) decays as 𝒪~(1K2)\widetilde{\mathcal{O}}(\frac{1}{K^{2}}), whereas the RHS of Eq. (4) decays as 𝒪~(1K4)\widetilde{\mathcal{O}}(\frac{1}{K^{4}}). Otherwise, when K2nK^{2}\gg n and KK is fixed as we vary nn, the RHS of Eq. (5) decays as 𝒪~(1n)\widetilde{\mathcal{O}}(\frac{1}{n}), whereas the RHS of Eq. (4) decays as 𝒪~(1n2)\widetilde{\mathcal{O}}(\frac{1}{n^{2}}). Hence, in both the cases, FlipFlop with Single Shuffle outperforms Single Shuffle.

The next theorem shows that FlipFlop improves the convergence rate of Random Reshuffle:

Theorem 5.

If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Random Reshuffle for KK epochs, where K55κlog(nK)max{1,nκ}K\geq 55\kappa\log(nK)\max\left\{1,\sqrt{\frac{n}{\kappa}}\right\} is an even integer, with step size α=10log(nK)μnK\alpha=\frac{10\log(nK)}{\mu nK} gives the error

𝔼[𝒙nK1𝒙2]=𝒪~(1n2K2+1nK5).\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{nK^{5}}\right).

For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Random Reshuffle:

𝔼[𝒙nK1𝒙2]=Ω(1n2K2+1nK3).\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\Omega\left(\frac{1}{n^{2}K^{2}}+\frac{1}{nK^{3}}\right).

Hence, we see that in the regime when nKn\gg K, which happens when the number of components in the finite sum of FF is much larger than the number of epochs, FlipFlop with Random Reshuffle is much faster than vanilla Random Reshuffle.

Note that the theorems above do not contradict Theorem 2, because for a fixed nn, both the theorems above give a convergence rate of 𝒪~(1/K2)\widetilde{\mathcal{O}}(1/K^{2}).

We also note here that the theorems above need the number of epochs to be much larger than κ\kappa, in which range Gradient Descent performs better than with- or without- replacement SGD, and hence GD should be preferred over SGD in that case. However, we think that this requirement on epochs is a limitation of our analysis, rather than that of the algorithms.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Dependence of convergence rates on the number of epochs KK for quadratic functions. The figures show the median and inter-quartile range after 10 runs of each algorithm, with random initializations and random permutation seeds (note that SS and IGD exhibit extremely small variance). We set n=800n=800, so that nKn\gg K and hence the higher order terms of KK dominate the convergence rates. Note that both the axes are in logarithmic scale.

Finally, the next theorem shows that FlipFlop improves the convergence rate of Incremental Gradient Descent.

Theorem 6.

If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Incremental GD for KK epochs, where K36κlog(nK)K\geq 36\kappa\log(nK) is an even integer, with step size α=6lognKμnK\alpha=\frac{6\log nK}{\mu nK} gives the error

𝔼[𝒙nK1𝒙2]=𝒪~(1n2K2+1K3).\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{K^{3}}\right).

For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Incremental Gradient Descent:

𝔼[𝒙nK1𝒙2]=Ω(1K2),\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\Omega\left(\frac{1}{K^{2}}\right),

In the next subsection, we give a short sketch of the proof of these theorems.

6.1 Proof sketch

In the proof sketch, we consider scalar quadratic functions. The same intuition carries over to multi-dimensional quadratics, but requires a more involved analysis. Let fi(x):=aix22+bix+cf_{i}(x):=\frac{a_{i}x^{2}}{2}+b_{i}x+c. Assume that F(x):=1ni=1nfi(x)F(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x) has minimizer at 0. This assumption is valid because it can be achieved by a simple translation of the origin (see (Safran & Shamir, 2019) for a more detailed explanation). This implies that i=1nbi=0\sum_{i=1}^{n}b_{i}=0.

For the sake of this sketch, assume x01=0x_{0}^{1}=0, that is, we are starting at the minimizer itself. Further, without loss of generality, assume that σ=(1,2,,n)\sigma=(1,2,\dots,n). Then, for the last iteration of the first epoch,

xn1\displaystyle x_{n}^{1} =xn11αfn(xn11)\displaystyle=x_{n-1}^{1}-\alpha f^{\prime}_{n}(x_{n-1}^{1})
=xn11α(anxn11+bn)\displaystyle=x_{n-1}^{1}-\alpha(a_{n}x_{n-1}^{1}+b_{n})
=(1αan)xn11αbn.\displaystyle=(1-\alpha a_{n})x_{n-1}^{1}-\alpha b_{n}.

Applying this to all iterations of the first epoch, we get

xn1\displaystyle x_{n}^{1} =i=1n(1αai)x01αi=1nbij=i+1n(1αaj).\displaystyle=\prod_{i=1}^{n}(1-\alpha a_{i})x_{0}^{1}-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=i+1}^{n}(1-\alpha a_{j}). (6)

Substituting x01=0x_{0}^{1}=0, we get

xn1=αi=1nbij=i+1n(1αaj).x_{n}^{1}=-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=i+1}^{n}(1-\alpha a_{j}). (7)

Note that the sum above is not weighted uniformly: b1b_{1} is multiplied by j=2n(1αaj)\prod_{j=2}^{n}(1-\alpha a_{j}), whereas bnb_{n} is multiplied by 1. Because (1αaj)<1(1-\alpha a_{j})<1, we see that b1b_{1}’s weight is much smaller than bnb_{n}. If the weights were all 1, then we would get xn0=αi=1nbi=0x_{n}^{0}=-\alpha\sum_{i=1}^{n}b_{i}=0, i.e., we would not move away from the minimizer. Since we want to stay close to the minimizer, we want the weights of all the bib_{i} to be roughly equal.

The idea behind FlipFlop is to add something like αi=1nbij=1i(1αaj)-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=1}^{i}(1-\alpha a_{j}) in the next epoch, to counteract the bias in Eq. (7). To achieve this, we simply take the permutation that the algorithm used in the previous epoch and flip it for the next epoch. Roughly speaking, in the next epoch b1b_{1} will get multiplied by 11 whereas bnb_{n} will get multiplied by j=1n1(1αaj)\prod_{j=1}^{n-1}(1-\alpha a_{j}). Thus over two epochs, both get scaled approximately the same.

To see more concretely, we look at the first order approximation of Eq. (7):

xn1\displaystyle x_{n}^{1} =αi=1nbij=i+1n(1αaj)αi=1nbi(1j=i+1nαaj)=α2i=1nbij=i+1naj,\displaystyle=-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=i+1}^{n}(1-\alpha a_{j})\approx-\alpha\sum_{i=1}^{n}b_{i}(1-\sum_{j=i+1}^{n}\alpha a_{j})=\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=i+1}^{n}a_{j}, (8)

where in the last step above we used the fact that i=1nbi=0\sum_{i=1}^{n}b_{i}=0. Now, let us see what happens in the second epoch if we use FlipFlop. Doing a recursion analogous to how we got Eq. (6), but reversing the order of functions, we get:

xn2\displaystyle x_{n}^{2} =i=1n(1αani+1)x02αi=1nbni+1j=i+1n(1αanj+1).\displaystyle=\prod_{i=1}^{n}(1-\alpha a_{n-i+1})x_{0}^{2}-\alpha\sum_{i=1}^{n}b_{n-i+1}\prod_{j=i+1}^{n}(1-\alpha a_{n-j+1}).

Carefully doing a change of variables in the equation above, we get:

xn2\displaystyle x_{n}^{2} =i=1n(1αai)x02αi=1nbij=1i1(1αaj).\displaystyle=\prod_{i=1}^{n}(1-\alpha a_{i})x_{0}^{2}-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=1}^{i-1}(1-\alpha a_{j}). (9)

Note that the product in the second term in the equation above is almost complementary to the product in Eq. (7). This is because we flipped the order in the second epoch. This will play an important part in cancelling out much of the bias in Eq. (7). Continuing on from Eq. (9), we again do a linear approximation similar to before and substitute Eq. (8) (and use the fact that x02=xn1x_{0}^{2}=x_{n}^{1}):

xn2\displaystyle x_{n}^{2} (1αi=1nai)x02+α2i=1nbij=1i1aj\displaystyle\approx\left(1-\alpha\sum_{i=1}^{n}a_{i}\right)x_{0}^{2}+\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=1}^{i-1}a_{j}
(1αi=1nai)(α2i=1nbij=i+1naj)+α2i=1nbij=1i1aj.\displaystyle\approx\left(1-\alpha\sum_{i=1}^{n}a_{i}\right)\left(\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=i+1}^{n}a_{j}\right)+\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=1}^{i-1}a_{j}.

We assume that α\alpha is small and hence we only focus on the quadratic terms:

xn2\displaystyle x_{n}^{2} =α2i=1nbijiaj+O(α3)\displaystyle=\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j\neq i}a_{j}+O(\alpha^{3})
=α2(i=1nbij=1naj)α2(i=1nbiai)+O(α3).\displaystyle=\alpha^{2}\left(\sum_{i=1}^{n}b_{i}\sum_{j=1}^{n}a_{j}\right)-\alpha^{2}\left(\sum_{i=1}^{n}b_{i}a_{i}\right)+O(\alpha^{3}).

Now, since i=1nbi=0\sum_{i=1}^{n}b_{i}=0, we get

xn2\displaystyle x_{n}^{2} α2(i=1nbiai)+O(α3).\displaystyle\approx-\alpha^{2}\left(\sum_{i=1}^{n}b_{i}a_{i}\right)+O(\alpha^{3}). (10)

Now, comparing the coefficients of the α2\alpha^{2} terms in Eq. (8) and Eq. (10), we see that the former has O(n2)O(n^{2}) terms whereas the latter has only nn terms. This correction of error is exactly what eventually manifests into the faster convergence rate of FlipFlop.

The main reason that the analysis for multidimensional quadratics is not as simple as the 1-dimensional analysis done above, is because unlike scalar multiplication, matrix multiplication is not commutative, and the AM-GM inequality is not true in higher dimensions (Lai & Lim, 2020; De Sa, 2020). One way to bypass this inequality is by using the following inequality for small enough α\alpha:

i=1n(𝑰α𝑨i)i=1n(𝑰α𝑨ni+1)1αnμ.\left\|\prod_{i=1}^{n}({\bm{I}}-\alpha{\bm{A}}_{i})\prod_{i=1}^{n}({\bm{I}}-\alpha{\bm{A}}_{n-i+1})\right\|\leq 1-\alpha n\mu.

Ahn et al. (2020) proved a stochastic version of this (see Lemma 6 in their paper). We prove a deterministic version in Lemma 3 (in the Appendix).

6.2 Numerical Verification

We verify the theorems numerically by running Random Reshuffle, Single Shuffle and their FlipFlop versions on the task of mean computation. We randomly sample n=800n=800 points from a 100100-dimensional sphere. Let the points be 𝒙i{\bm{x}}_{i} for i=1,,ni=1,\dots,n. Then, their mean is the solution to the following quadratic problem : argmin𝒙F(𝒙)=1ni=1n𝒙𝒙i2\arg\min_{{\bm{x}}}F({\bm{x}})=\frac{1}{n}\sum_{i=1}^{n}\|{\bm{x}}-{\bm{x}}_{i}\|^{2}. We solve this problem by using the given algorithms. The results are reported in Figure 2. The results are plotted in a log\loglog\log graph, so that we get to see the dependence of error on the power of KK.

Note that since the points are sampled randomly, Incremental Gradient Descent essentially becomes Single Shuffle. Hence, to verify Theorem 6, we need ‘hard’ instances of Incremental Gradient Descent, and in particular we use the ones used in Theorem 3 in (Safran & Shamir, 2019). These results are also reported in a log\loglog\log graph in Figure 2.

We also tried FlipFlop in the training of deep neural networks, but unfortunately we did not see a big speedup there. We ran experiments on logistic regression for 1-Dimensional artificial data, the results for which are in Appendix H. The code for all the experiments can be found at https://github.com/shashankrajput/flipflop .

7 Conclusion and Future Work

In this paper, we explore the theoretical limits of permutation-based SGD for solving finite sum optimization problems. We focus on the power of good, carefully designed permutations and whether they can lead to a much better convergence rate than random. We prove that for 1-dimensional, strongly convex functions, indeed good sequences of permutations exist, which lead to a convergence rate which is exponentially faster than random permutations. We also show that unfortunately, this is not true for higher dimensions, and that for general strongly convex functions, random permutations might be optimal.

However, we think that for some subfamilies of strongly convex functions, good permutations might exist and may be easy to generate. Towards that end, we introduce a very simple technique, FlipFlop, to generate permutations that lead to faster convergence on strongly convex quadratics. This is a black box technique, that is, it does not look at the optimization problem to come up with the permutations; and can be implemented easily. This serves as an example that for other classes of functions, there can exist other techniques for coming up with good permutations. Finally, note that we only consider constant step sizes in this work for both upper and lower bounds. Exploring regimes in which the step size changes, e.g., diminishing step sizes, is a very interesting open problem, which we leave for future work. We think that the upper and lower bounds in this paper give some important insights and can help in the development of better algorithms or heuristics. We strongly believe that under nice distributional assumptions on the component functions, there can exist good heuristics to generate good permutations, and this should also be investigated in future work.

References

  • Ahn et al. (2020) Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements, 2020.
  • Bertsekas (2011) Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
  • Bottou (2009) Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem offered to the attendance of the SLDS 2009 conference, 2009. URL http://leon.bottou.org/papers/bottou-slds-open-problem-2009.
  • Bubeck et al. (2015) Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • De Sa (2020) Christopher M De Sa. Random reshuffling is not always better. Advances in Neural Information Processing Systems, 33, 2020.
  • Gürbüzbalaban et al. (2019a) M Gürbüzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019a.
  • Gürbüzbalaban et al. (2019b) Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, pp.  1–36, 2019b.
  • Haochen & Sra (2019) Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624–2633, 2019.
  • Lai & Lim (2020) Zehua Lai and Lek-Heng Lim. Recht-ré noncommutative arithmetic-geometric mean conjecture is false. In International Conference on Machine Learning, pp. 5608–5617. PMLR, 2020.
  • Mishchenko et al. (2020) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. ArXiv, abs/2006.05988, 2020.
  • Nagaraj et al. (2019) Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703–4711, 2019.
  • Nedić & Bertsekas (2001) Angelia Nedić and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223–264. Springer, 2001.
  • Nesterov (2004) Yurii Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. Springer, 2004. ISBN 978-1-4613-4691-3. doi: 10.1007/978-1-4419-8853-9. URL https://doi.org/10.1007/978-1-4419-8853-9.
  • Nguyen et al. (2020) Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. arXiv preprint arXiv:2002.08246, 2020.
  • Nocedal & Wright (2006) Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • Rajput et al. (2020) Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964–7973. PMLR, 2020.
  • Recht & Ré (2012) Benjamin Recht and Christopher Ré. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. arXiv preprint arXiv:1202.4184, 2012.
  • Recht & Ré (2013) Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
  • Safran & Shamir (2019) Itay Safran and Ohad Shamir. How good is sgd with random shuffling?, 2019.
  • Safran & Shamir (2021) Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. arXiv preprint arXiv:2106.06880, 2021.
  • Schneider (2016) M. Schneider. Probability inequalities for kernel embeddings in sampling without replacement. In AISTATS, 2016.
  • Shamir (2016) Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp.  46–54, 2016.
  • Tran et al. (2020) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. Shuffling gradient-based methods with momentum. arXiv preprint arXiv:2011.11884, 2020.
  • Ying et al. (2020) Bicheng Ying, Kun Yuan, and Ali H Sayed. Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390–1408, 2020.

Appendix A Discussion and Future Work

Being the first paper (to the best of our knowledge) to theoretically analyze the optimality of random permutations, we limited our scope to a specific, but common theoretical setting - strongly convex functions with constant step size. We think that future work can generalize the results of this paper to settings of non-convexity, variable step sizes, as well as techniques like variance reduction, momentum, etc.

A.1 Lower bounds for variable step sizes

All the existing lower bounds (to the best of our knowledge) work in the constant step size regime (Safran & Shamir (2019); Rajput et al. (2020); Safran & Shamir (2021)). Thus, generalizing the lower bounds to variable step size algorithms would be a very important direction for future research.

However, the case when step sizes are not constant can be tricky to prove lower bounds, since the step size could potentially depend on the permutation, and the current iterate. A more reasonable setting to prove lower bounds could be the case when the step sizes follow a schedule over epochs, similar to what happens in practice.

A.2 Faster permutations for non-quadratic objectives

The analysis of FlipFlop leverages the fact that the Hessians of quadratic functions are constant. We think that the analysis of FlipFlop might be extended to strongly convex functions or even PŁ functions (which are non-convex in general), under some assumptions on the Lipschitz continuity of the Hessians, similar to how Haochen & Sra (2019) extended their analysis of quadratic functions to more general classes. A key take-away from FlipFlop is that we had to understand how permutation based SGD works specifically for quadratic functions, that is, we did a white-box analysis. In general, we feel that depending on the specific class of non-convex functions (say deep neural networks), practitioners would have to think about permutation-based SGD in a white-box fashion, to come up with better heuristics for shuffling.

In a concurrent work, (https://openreview.net/pdf?id=7gWSJrP3opB), it is shown that by greedily sorting stale gradients, a permutation order can be found which converges faster for some deep learning tasks. Hence, there do exist better permutations than random, even for deep learning tasks.

A.3 FlipFlop on Random Coordinate Descent

A shuffling scheme similar to FlipFlop has been used in random coordinate descent for faster practical convergence (see page 231 in Nocedal & Wright (2006)). This should be further investigated empirically and theoretically in future work. Even though the current analysis of FlipFlop does not directly go through for random coordinate descent, we think the analysis can be adapted to work.

Appendix B Proof of Theorem 1

In certain places in the proof, we would need that α14nL\alpha\leq\frac{1}{4nL}. To see how this is satisfied, note that we have assumed that αμ8n(L2+LHG)\alpha\leq\frac{\mu}{8n(L^{2}+L_{H}G)} in the theorem statement. Using the inequality μL\mu\leq L in αμ4n(L2+LHG)\alpha\leq\frac{\mu}{4n(L^{2}+L_{H}G)} gives that α14n(L+(LHG/μ))14nL\alpha\leq\frac{1}{4n(L+(L_{H}G/\mu))}\leq\frac{1}{4nL}.

In this proof, we assume that the minimizer of FF is at 0 to make the analysis simpler. This assumption can be satisfied by simply translating the origin to the minimizer (Safran & Shamir, 2019).

There are three main components in the proof:

  1. 1.

    Say that an epoch starts off at the minimizer. We prove that there exists at least one pair of permutations such that if we do two separate runs of the epoch, the first using the first permutation and the second using the second, then at the end of that epoch, the iterates corresponding to the two runs end up on opposite sides of the minimizer.

  2. 2.

    There exists a sequence of permutations and a point in the neighborhood of the minimizer, such that intializing at that point and using these sequence of permutations, we converge exactly to the minimizer.

  3. 3.

    Starting from any other point, we can couple the iterates with the iterates which were shown in the previous component, to get that these two sequences of iterates come close to each other exponentially fast.

We prove the first and second components in the Subsections B.1 and B.2; and conclude the proof in Subsection B.3 where we also prove the third component.

B.1 Permutations in one epoch

In this subsection, we prove that if x0,x1,,xnx_{0},x_{1},\dots,x_{n} are the iterates in an epoch such that x0=0x_{0}=0, then there exists a permutation of functions such that xn0x_{n}\geq 0. By the same logic, we show that there exists a permutation of functions such that xn0x_{n}\leq 0. These will give us control over movement of iterates across epochs.

Order the gradients at the minimizer, fi(0)\nabla f_{i}(0), in decreasing order. WLOG assume that it is just f1,f2,,fn\nabla f_{1},\nabla f_{2},\dots,\nabla f_{n}. We claim that this permutation leads to xn0x_{n}\geq 0.

We will need the following intermediate result. Let yi,yi1y_{i},y_{i-1} be such that yi=yi1αfi(yi1)y_{i}=y_{i-1}-\alpha\nabla f_{i}(y_{i-1}). Assume α1/L\alpha\leq 1/L and yi1xi1y_{i-1}\geq x_{i-1}. Then,

yixi\displaystyle y_{i}-x_{i} =yi1xi1α(fi(yi1)fi(xi1))\displaystyle=y_{i-1}-x_{i-1}-\alpha(\nabla f_{i}(y_{i-1})-\nabla f_{i}(x_{i-1}))
yi1xi1αL(yi1xi1)\displaystyle\geq y_{i-1}-x_{i-1}-\alpha L(y_{i-1}-x_{i-1})
=(1αL)(yi1xi1)\displaystyle=(1-\alpha L)(y_{i-1}-x_{i-1})
0,\displaystyle\geq 0, (11)

that is, yi1xi1yixiy_{i-1}\geq x_{i-1}\implies y_{i}\geq x_{i}.

Because 0 is the minimizer, we know that i=1nfi(0)=0\sum_{i=1}^{n}\nabla f_{i}(0)=0. Also, recall that x0=0x_{0}=0. There can be two cases:

  1. 1.

    i[1,n]:xi<0\forall i\in[1,n]:x_{i}<0. This cannot be true because if i:1in1:xi<0\forall i:1\leq i\leq n-1:x_{i}<0, then

    xn=i=1nαfi(xi1)αi=1nfi(0)0,x_{n}=\sum_{i=1}^{n}-\alpha\nabla f_{i}(x_{i-1})\geq-\alpha\sum_{i=1}^{n}\nabla f_{i}(0)\geq 0,

    where we used the fact that fi(x)fi(y)\nabla f_{i}(x)\leq\nabla f_{i}(y) if xyx\leq y and fif_{i} is convex.

  2. 2.

    Thus, i[1,n]:xi0\exists i\in[1,n]:x_{i}\geq 0. Now, consider the sequence yi,yi+1,,yny_{i},y_{i+1},\dots,y_{n} such that yi=0y_{i}=0 and for ji+1j\geq i+1, yj=yj1αfj(yj1)y_{j}=y_{j-1}-\alpha\nabla f_{j}(y_{j-1}). Then because α1/L\alpha\leq 1/L and xiyi=0x_{i}\geq y_{i}=0, we get that xjyjx_{j}\geq y_{j} for all jij\geq i (Using Ineq. (11)).

Hence, there is an i1i\geq 1 such that xiyi=0x_{i}\geq y_{i}=0, and further xjyjx_{j}\geq y_{j} for all jij\geq i. Next, we repeat the process above for yi,,yny_{i},\dots,y_{n}. That is, there can be two cases:

  1. 1.

    j[i+1,n]:yj<0\forall j\in[i+1,n]:y_{j}<0. This cannot be true because if j:i+1jn1:yj<0\forall j:i+1\leq j\leq n-1:y_{j}<0, then

    yn=j=inαfj(yj)αj=infj(0)0.y_{n}=\sum_{j=i}^{n}-\alpha\nabla f_{j}(y_{j})\geq-\alpha\sum_{j=i}^{n}\nabla f_{j}(0)\geq 0.
  2. 2.

    Thus, j[i+1,n]:yj0\exists j\in[i+1,n]:y_{j}\geq 0. Now, consider the sequence zj,zj+1,,znz_{j},z_{j+1},\dots,z_{n} such that zj=0z_{j}=0 and for kj+1k\geq j+1, zk=zk1αfk(zk1)z_{k}=z_{k-1}-\alpha\nabla f_{k}(z_{k-1}). Then because α1/L\alpha\leq 1/L and yjzj=0y_{j}\geq z_{j}=0, we get that ykzky_{k}\geq z_{k} for all kjk\geq j (Using Ineq. (11)).

Hence, there exists an integer j>i>0j>i>0 such that yj0y_{j}\geq 0. We have already proved that xjyjx_{j}\geq y_{j}. Thus, we have that xj0x_{j}\geq 0. We can continue repeating this process (apply the same two cases above for zj,zj+1,,znz_{j},z_{j+1},\dots,z_{n}, and so on), to get that xn0x_{n}\geq 0. We define pp to be this non-negative value of xnx_{n}. Note that the following lemma gives us that the gradients are bounded by GG

Lemma 1.

Define G:=maxifi(𝐱)G^{*}:=\max_{i}\|\nabla f_{i}({\bm{x}}^{*})\| and D=max{𝐱01𝐱,G2L}D=\max\left\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{G^{*}}{2L}\right\}. If Assumptions 2 and 3 hold, and α<18κnL\alpha<\frac{1}{8\kappa nL}, then for any permutation-based algorithm (deterministic or random), we have

i,j,k:\displaystyle\forall i,j,k: 𝒙ik𝒙2D,and\displaystyle\|{\bm{x}}_{i}^{k}-{\bm{x}}^{*}\|\leq 2D,\quad\text{and}
fj(𝒙ik)G+2DL.\displaystyle\|\nabla f_{j}({\bm{x}}_{i}^{k})\|\leq G^{*}+2DL.

Because the gradients are bounded by GG, we get that pnαGp\leq n\alpha G.

Similarly, we can show that the reverse permutation leads to xn0x_{n}\leq 0. We define qq to be this non-positive value of xnx_{n}. Because we have assumed that the gradients are bounded by GG, we get that qnαGq\geq-n\alpha G.

B.2 Exact convergence to the minimizer

In this section, we show that there exists a point such that if we initialize there and follow a specific permutation sequence, we land exactly at the minimizer.

In particular, we will show the following: There exists a point in [4q,4p][4q,4p] such that if we initialize there and follow a specific permutation sequence, then we land exactly at the minimizer.

We show this recursively. We will prove that there exists a point mK[4q,4p]m^{K}\in[4q,4p] such that if the last epoch begins there, that is x0K=mKx_{0}^{K}=m^{K}, then we land at the minimizer at the end of the last epoch, that is, xnK=0x_{n}^{K}=0. Then, we will show that there exists a point mK1[4q,4p]m^{K-1}\in[4q,4p] such that if the x0K1=mK1x_{0}^{K-1}=m^{K-1}, then x0K=xnK1=mKx_{0}^{K}=x_{n}^{K-1}=m^{K}. Repeating this for K2,,1K-2,\dots,1, we get that there exists a point m0[4q,4p]m^{0}\in[4q,4p] such that if we initialize the first epoch there, that is x01=m0x^{1}_{0}=m^{0}, then there is a permutation sequence such that ultimately xnK=0x_{n}^{K}=0.

We prove that any point mj[4q,4p]m^{j}\in[4q,4p] can be reached at the end of an epoch by beginning the epoch at some point mj1[4q,4p]m^{j-1}\in[4q,4p], that is if x0j1=mj1x_{0}^{j-1}=m^{j-1}, then xnj1=mjx_{n}^{j-1}=m^{j}.

  • Case: mj[p,4p]m^{j}\in[p,4p]. In this case, we show that mj1[0,4p]m^{j-1}\in[0,4p]. We have proved in the previous subsection that there exists a permutation σ\sigma such that if x0j1=0x_{0}^{j-1}=0 then xnj1=px_{n}^{j-1}=p.

    Next, we have the following helpful lemma that we will also use later.

    Lemma 2.

    Let x0,x1,,xnx_{0},x_{1},\dots,x_{n} be a sequence of iterates in an epoch and y0,y1,,yny_{0},y_{1},\dots,y_{n} be another sequence of iterates in an epoch such that both use the same permutation of functions. If αμ2n(L2+LHG)\alpha\leq\frac{\mu}{2n(L^{2}+L_{H}G)}, then

    (1nαL)|y0x0|(1Lα)n|y0x0||ynxn|(112nμα)|y0x0|.(1-n\alpha L)|y_{0}-x_{0}|\leq(1-L\alpha)^{n}|y_{0}-x_{0}|\leq|y_{n}-x_{n}|\leq\left(1-\dfrac{1}{2}n\mu\alpha\right)|y_{0}-x_{0}|.

    If we set x0=4p,y0=0x_{0}=4p,y_{0}=0 in Lemma 2 and we follow the permutation σ\sigma, then we get that

    xnyn\displaystyle x_{n}-y_{n} (x0y0)[1αnL,1αnμ2]\displaystyle\in(x_{0}-y_{0})\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right]
    xnp\displaystyle\implies x_{n}-p (4p0)[1αnL,1αnμ2]\displaystyle\in(4p-0)\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right] (Since using σ\sigma and y0=0y_{0}=0 results in yn=py_{n}=p.)
    xn\displaystyle\implies x_{n} 4p,\displaystyle\geq 4p,

    where we used the fact that α14nL\alpha\leq\frac{1}{4nL} is the last step.

    Thus, if x0j1=4px_{0}^{j-1}=4p and we follow the permutation σ\sigma, then xnj14px_{n}^{j-1}\geq 4p.

    Next, note that

    x1j1\displaystyle x_{1}^{j-1} =x0j1αfσ(0)(x0j1).\displaystyle=x_{0}^{j-1}-\alpha\nabla f_{\sigma(0)}(x_{0}^{j-1}).
    x2j1\displaystyle x_{2}^{j-1} =x1j1αfσ(1)(x1j1)\displaystyle=x_{1}^{j-1}-\alpha\nabla f_{\sigma(1)}(x_{1}^{j-1})
    \displaystyle\vdots
    xnj1\displaystyle x_{n}^{j-1} =xn1j1αfσ(n1)(xn1j1)\displaystyle=x_{n-1}^{j-1}-\alpha\nabla f_{\sigma(n-1)}(x_{n-1}^{j-1})

    Looking above, we see that x1j1x_{1}^{j-1} is a continuous function of x0j1x_{0}^{j-1}; x2j1x_{2}^{j-1} is a continuous function of x1j1x_{1}^{j-1}; and so on. Thus, using the fact that composition of continuous functions is continuous, we get that xnj1x_{n}^{j-1} is also a continuous function of x0j1x_{0}^{j-1}. We have shown that if x0j1=0x_{0}^{j-1}=0, then xnj1=px_{n}^{j-1}=p and if x0j1=4px_{0}^{j-1}=4p, then xnj14px_{n}^{j-1}\geq 4p. Thus, using the fact that that xnj1x_{n}^{j-1} is a continuous function of x0j1x_{0}^{j-1}, we get that for any point mj[p,4p]m^{j}\in[p,4p], there is at least one point mj1[0,4p]m^{j-1}\in[0,4p], such that x0j1=mj1x_{0}^{j-1}=m^{j-1} leads to xnj1=mjx_{n}^{j-1}=m^{j}.

  • Case: mj[4q,q]m_{j}\in[4q,q]. We can apply the same logic as above to show that there is at least one point mj1[4q,0]m^{j-1}\in[4q,0], such that x0j1=mj1xnj1=mjx_{0}^{j-1}=m^{j-1}\implies x_{n}^{j-1}=m^{j}.

  • Case: mj[q,p]m_{j}\in[q,p]. WLOG assume that |q|<|p||q|<|p|. Let σq\sigma_{q} be the permutation such that if x0j1=0x_{0}^{j-1}=0 and the epoch uses this permutation, then we end up at xnj1=qx_{n}^{j-1}=q.

    If we set x0=4p,y0=0x_{0}=4p,y_{0}=0 in Lemma 2 and we follow the permutation σq\sigma_{q}, then we get that

    xnyn\displaystyle x_{n}-y_{n} (x0y0)[1αnL,1αnμ2]\displaystyle\in(x_{0}-y_{0})\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right]
    xnq\displaystyle\implies x_{n}-q (4p0)[1αnL,1αnμ2]\displaystyle\in(4p-0)\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right] (Since using σq\sigma_{q} and y0=0y_{0}=0 results in yn=qy_{n}=q.)
    xn\displaystyle\implies x_{n} q+4p(1αnL)q+3p2p,\displaystyle\geq q+4p(1-\alpha nL)\geq q+3p\geq 2p,

    where we used the fact that α14nL\alpha\leq\frac{1}{4nL} is the last step.

    Thus, if x0j1=4px_{0}^{j-1}=4p and we follow the permutation σq\sigma_{q}, then xnj12px_{n}^{j-1}\geq 2p.

    Thus, using similar argument of continuity as the first case, we know that there is a point mj1[0,4p]m^{j-1}\in[0,4p], such that x0j1=mj1x_{0}^{j-1}=m^{j-1} leads to xnj1=mjx_{n}^{j-1}=m^{j} when we use the permutation σq\sigma_{q}.

B.3 Same sequence permutations get closer

In the previous subsection, we have shown that there exists a point m0[4q,4p]m^{0}\in[4q,4p] and a sequence of permutations σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K} such that if x01=m0x^{1}_{0}=m^{0} and epoch jj uses permutation σj\sigma^{j}, then xnK=0x^{K}_{n}=0. In this subsection, we will show that if x01x_{0}^{1} is initialized at any other point such that |x01|D|x_{0}^{1}|\leq D, then using the same permutations σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K} gives us that |xnK|(D+nαG)eK|x^{K}_{n}|\leq(D+n\alpha G)e^{-K}. For this we will repeatedly apply Lemma 2 on all the KK epochs. Assume that x01=ν0x_{0}^{1}=\nu^{0} with |ν0|D|\nu^{0}|\leq D.

Let yijy_{i}^{j} be the sequence of iterates such that y01=m0y^{1}_{0}=m^{0} and uses the permutation sequence σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K}. Then, we know that ynK=0y^{K}_{n}=0. Let xijx_{i}^{j} be the sequence of iterates such that x01=ν0x^{1}_{0}=\nu^{0} and uses the same permutation sequence σ1,σ2,,σK\sigma^{1},\sigma^{2},\dots,\sigma^{K}.

Then, using Lemma 2 gives us that |yn1xn1||ν0m0|(1μαn2)|y^{1}_{n}-x^{1}_{n}|\leq|\nu^{0}-m^{0}|(1-\frac{\mu\alpha n}{2}). Thus, we get that |y02x02||ν0m0|(1μαn2)|y^{2}_{0}-x^{2}_{0}|\leq|\nu^{0}-m^{0}|(1-\frac{\mu\alpha n}{2}). Again applying Lemma 2 gives us that |yn2xn2||ν0m0|(1μαn2)2|y^{2}_{n}-x^{2}_{n}|\leq|\nu^{0}-m^{0}|(1-\frac{\mu\alpha n}{2})^{2}. Therefore, after applying it KK times, we get

|ynKxnK||ν0m0|(1μαn2)K.|y^{K}_{n}-x^{K}_{n}|\leq|\nu^{0}-m^{0}|\left(1-\frac{\mu\alpha n}{2}\right)^{K}.

Note that |x01|=|ν0|D|x_{0}^{1}|=|\nu^{0}|\leq D, and |y01|=|m0||y_{0}^{1}|=|m^{0}|, with m0[4q,4p]m^{0}\in[4q,4p]. We showed earlier in Subsection B.1 that |p|nαG|p|\leq n\alpha G and |q|nαG|q|\leq n\alpha G. Therefore,

|ynKxnK||D+4nαG|(1μαn2)K.|y^{K}_{n}-x^{K}_{n}|\leq|D+4n\alpha G|\left(1-\frac{\mu\alpha n}{2}\right)^{K}.

Further, we know that ynK=0y^{K}_{n}=0. Thus,

|xnK|\displaystyle|x^{K}_{n}| |D+4nαG|(1μαn2)K\displaystyle\leq|D+4n\alpha G|\left(1-\frac{\mu\alpha n}{2}\right)^{K}
|D+4nαG|eμαn2K.\displaystyle\leq|D+4n\alpha G|e^{-\frac{\mu\alpha n}{2}K}.

Substituting the value of α\alpha completes the proof. Next we prove the lemmas used in this proof.

B.4 Proof of Lemma 1

We restate the lemma below.

Lemma.

Define G:=maxifi(𝐱)G^{*}:=\max_{i}\|\nabla f_{i}({\bm{x}}^{*})\| and D=max{𝐱01𝐱,G2L}D=\max\left\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{G^{*}}{2L}\right\}. If Assumptions 2 and 3 hold, and α<18κnL\alpha<\frac{1}{8\kappa nL}, then for any permutation-based algorithm (deterministic or random), we have

i,j,k:\displaystyle\forall i,j,k: 𝒙ik𝒙2D,and\displaystyle\|{\bm{x}}_{i}^{k}-{\bm{x}}^{*}\|\leq 2D,\quad\text{and}
fj(𝒙ik)G+2DL.\displaystyle\|\nabla f_{j}({\bm{x}}_{i}^{k})\|\leq G^{*}+2DL.

This lemma says that for any permutation-based algorithm, the domain of iterates and the norm of the gradient stays bounded during the optimization. This means that we can assume bounds on norm of iterates and gradients, which is not true in general for unconstrained SGD. This makes analyzing such algorithms much easier, and hence this lemma can be of independent interest for proving future results for permutation-based algorithms.

Remark 1.

This lemma does not hold in general for vanilla SGD where sampling is done with replacement. Consider the example with two functions f1(x)=x2xf_{1}(x)=x^{2}-x, and f2(x)=xf_{2}(x)=x; and F(x)=f1(x)+f2(x)F(x)=f_{1}(x)+f_{2}(x). This satisfies Assumptions 2 and 3, but one may choose f2f_{2} consecutively for arbitrary many iterations, which will lead the iterates a proportional distance away from the minimizer. This kind of situation can never happen for permutation-based SGD because we see every function exactly once in every epoch and hence no particular function can attract the iterates too much towards its minimizer, and by the end of the epochs most of the noise gets cancelled out.

Note that the requirement α<18κnL\alpha<\frac{1}{8\kappa nL} is stricter than the usual requirement α=O(1nL)\alpha=O\left(\frac{1}{nL}\right), but we believe the lemma should also hold in that regime. For the current paper, however, this stronger requirement suffices.

Proof.

To prove this lemma, we show two facts: If for some epoch kk, 𝒙0k𝒙D\|{\bm{x}}^{k}_{0}-{\bm{x}}^{*}\|\leq D, then a) i:𝒙ik𝒙2D\forall i:\|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}\|\leq 2D and b) 𝒙0k+1𝒙=𝒙nk𝒙D\|{\bm{x}}^{k+1}_{0}-{\bm{x}}^{*}\|=\|{\bm{x}}^{k}_{n}-{\bm{x}}^{*}\|\leq D. To see how a) and b) are sufficient to prove the lemma, assume that they are true. Then, since the first epoch begins inside the bounded region 𝒙01𝒙D\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\|\leq D, we see using b) that every subsequent epoch begins inside the same bounded region, that is 𝒙0k𝒙D\|{\bm{x}}^{k}_{0}-{\bm{x}}^{*}\|\leq D as well. Hence using a) we get that during these epochs, the iterates satisfy 𝒙ik𝒙2D\|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}\|\leq 2D, which is the first part of the lemma. Further, this bound together with the gradient Lipshitzness directly gives the upper bound G+2DLG^{*}+2DL on the gradients. Thus, all we need to do to prove this lemma is to prove a) and b), which we do next.

We will prove a) and b) for D=max{𝒙01𝒙,2κnαG14κnαL}D=\max\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{2\kappa n\alpha G^{*}}{1-4\kappa n\alpha L}\}. Once we do this, using α<18κnL\alpha<\frac{1}{8\kappa nL} will give us the exact statement of the lemma.

Let |𝒙0k𝒙|D|{\bm{x}}^{k}_{0}-{\bm{x}}^{*}|\leq D for some epoch kk. Then, we try to find the minimum number of iterations ii needed so that |𝒙ik𝒙|2D|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}|\geq 2D. Within this region, the gradient is bounded by G+2DLG^{*}+2DL. Thus, the minimum number of iterations needed are 2DDα(G+2DL)\frac{2D-D}{\alpha(G^{*}+2DL)}. However,

2DDα(G+2DL)\displaystyle\frac{2D-D}{\alpha(G^{*}+2DL)} =1α(GD+2L)\displaystyle=\frac{1}{\alpha(\frac{G^{*}}{D}+2L)}
1α(G14κnαL2κnαG+2L)\displaystyle\geq\frac{1}{\alpha(G^{*}\frac{1-4\kappa n\alpha L}{2\kappa n\alpha G^{*}}+2L)} (Using the fact that D2κnαG14κnαL.D\geq\frac{2\kappa n\alpha G^{*}}{1-4\kappa n\alpha L}.)
=1α(14κnαL2κnα+2L)\displaystyle=\frac{1}{\alpha(\frac{1-4\kappa n\alpha L}{2\kappa n\alpha}+2L)}
=2κn\displaystyle={2\kappa n}
2n.\displaystyle\geq 2n.

Thus, the minimum number iterations needed to go outside the bound |𝒙ik𝒙|2D|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}|\geq 2D is more than the length of the epoch. This implies that within the epoch, 𝒙ik𝒙2D\|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}\|\leq 2D, which proves a).

We prove b) next:

𝒙nk𝒙\displaystyle\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\| =(𝒙0kαi=0nfσik(𝒙ik))𝒙\displaystyle=\left\|\left({\bm{x}}_{0}^{k}-\alpha\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k})\right)-{\bm{x}}^{*}\right\|
=𝒙0k𝒙αi=0nfσik(𝒙0k)+αi=0n(fσik(𝒙0k)fσik(𝒙ik))\displaystyle=\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-\alpha\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})+\alpha\sum_{i=0}^{n}(\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k}))\right\|

Note that i=0nfσik(𝒙0k)\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k}) is just the sum of all component gradients at 𝒙0k{\bm{x}}_{0}^{k}, that is i=0nfσik(𝒙0k)=nF(𝒙0k)\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})=n\nabla F({\bm{x}}_{0}^{k}). Using this, we get

𝒙nk𝒙\displaystyle\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\| =𝒙0k𝒙nαF(𝒙0k)+αi=0n(fσik(𝒙0k)fσik(𝒙ik))\displaystyle=\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})+\alpha\sum_{i=0}^{n}(\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k}))\right\|
𝒙0k𝒙nαF(𝒙0k)+αi=0nfσik(𝒙0k)fσik(𝒙ik)\displaystyle\leq\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\|+\alpha\sum_{i=0}^{n}\left\|\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k})\right\|
𝒙0k𝒙nαF(𝒙0k)+αLi=0n𝒙0k𝒙ik,\displaystyle\leq\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\|+\alpha L\sum_{i=0}^{n}\left\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\|, (12)

where we used gradient Lipschitzness (Assumption 2) in the last step.

To bound the first term above, we use the standard analysis of gradient descent on smooth, strongly convex functions as follows

𝒙0k𝒙nαF(𝒙0k)2\displaystyle\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\|^{2} =𝒙0k𝒙22nα𝒙0k𝒙,F(𝒙0k)+n2α2F(𝒙0k)2\displaystyle=\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|^{2}-2n\alpha\langle{\bm{x}}_{0}^{k}-{\bm{x}}^{*},\nabla F({\bm{x}}_{0}^{k})\rangle+n^{2}\alpha^{2}\|\nabla F({\bm{x}}_{0}^{k})\|^{2}
𝒙0k𝒙22nαμ𝒙0k𝒙2+n2α2F(𝒙0k)2\displaystyle\leq\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|^{2}-2n\alpha\mu\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\|^{2}+n^{2}\alpha^{2}\|\nabla F({\bm{x}}_{0}^{k})\|^{2} (Using Ineq. (3))
=(1nαμ)𝒙0k𝒙2+nα(nαF(𝒙0k)2μ𝒙0k𝒙2)\displaystyle=(1-n\alpha\mu)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|^{2}+n\alpha(n\alpha\|\nabla F({\bm{x}}_{0}^{k})\|^{2}-\mu\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\|^{2})
(1nαμ)𝒙0k𝒙2+nα(nαL2𝒙0k𝒙2μ𝒙0k𝒙2)\displaystyle\leq(1-n\alpha\mu)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|^{2}+n\alpha(n\alpha L^{2}\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\|^{2}-\mu\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\|^{2}) (Using gradient Lipschitzness)
=(1nαμ)𝒙0k𝒙2+nα(nαL2μ)𝒙0k𝒙2\displaystyle=(1-n\alpha\mu)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|^{2}+n\alpha(n\alpha L^{2}-\mu)\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\|^{2}
(1nαμ)𝒙0k𝒙2,\displaystyle\leq(1-n\alpha\mu)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|^{2},

where in the last step we used that αμnL2\alpha{}\leq\frac{\mu}{nL^{2}} since α18κnL\alpha{}\leq\frac{1}{8\kappa nL}. Substituting this inequality in Ineq. (12), we get

𝒙nk𝒙\displaystyle\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\| 1nαμ𝒙0k𝒙+αLi=0n𝒙0k𝒙ik\displaystyle\leq\sqrt{1-n\alpha\mu}\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|+\alpha L\sum_{i=0}^{n}\left\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\|
(112nαμ)𝒙0k𝒙+αLi=0n𝒙0k𝒙ik.\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|+\alpha L\sum_{i=0}^{n}\left\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\|.

We have already proven a) that says that the iterates 𝒙ik{\bm{x}}_{i}^{k} satisfy 𝒙ik𝒙2D\|{\bm{x}}_{i}^{k}-{\bm{x}}^{*}\|\leq 2D. Using gradient Lipschitzness, this implies that the gradient norms stay bounded by G+2DLG^{*}+2DL. Hence, 𝒙0k𝒙ikαi(G+2DL)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\|\leq\alpha i(G^{*}+2DL). Using this,

𝒙nk𝒙\displaystyle\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\| (112nαμ)𝒙0k𝒙+αLi=0nαi(G+2DL)\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)\left\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\|+\alpha L\sum_{i=0}^{n}\alpha i(G^{*}+2DL)
(112nαμ)D+αLi=0nαi(G+2DL)\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)D+\alpha L\sum_{i=0}^{n}\alpha i(G^{*}+2DL)
(112nαμ)D+n2α2L(G+2DL)\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)D+n^{2}\alpha^{2}L(G^{*}+2DL)
D,\displaystyle\leq D,

where we used the fact that D2κnαG14κnαLD\geq\frac{2\kappa n\alpha G^{*}}{1-4\kappa n\alpha L} in the last step. ∎

B.5 Proof of Lemma 2

Without loss of generality, let σ=(1,2,3,,n)\sigma=(1,2,3,\dots,n). This is only done for ease of notation. The analysis goes through for any other permutation σ\sigma too.

First we show the lower bound. WLOG assume y0>x0y_{0}>x_{0}. Because α<1/L\alpha<1/L, we have that i,yi>xi\forall i,y_{i}>x_{i} by induction (see the equations below). Then,

yixi\displaystyle y_{i}-x_{i} =yi1xi1α(fi(yi1)fi(xi1))\displaystyle=y_{i-1}-x_{i-1}-\alpha(\nabla f_{i}(y_{i-1})-\nabla f_{i}(x_{i-1}))
yi1xi1αL(yi1xi1)\displaystyle\geq y_{i-1}-x_{i-1}-\alpha L(y_{i-1}-x_{i-1})
=(1αL)(yi1xi1)\displaystyle=(1-\alpha L)(y_{i-1}-x_{i-1})
\displaystyle\qquad\vdots
=(1αL)i(y0x0)\displaystyle=(1-\alpha L)^{i}(y_{0}-x_{0})
(1iαL)(y0x0).\displaystyle\geq(1-i\alpha L)(y_{0}-x_{0}). (13)

Next we show the upper bound

ynxn=y0x0αi=1n(fi(yi1)fi(xi1))\displaystyle y_{n}-x_{n}=y_{0}-x_{0}-\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{i-1})-\nabla f_{i}(x_{i-1}))
=y0x0αi=1n(fi(y0)fi(x0))+αi=1n(fi(y0)fi(x0)fi(yi1)+fi(xi1))\displaystyle=y_{0}-x_{0}-\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0}))+\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0})-\nabla f_{i}(y_{i-1})+\nabla f_{i}(x_{i-1}))
=y0x0nα(F(y0)F(x0))+αi=1n(fi(y0)fi(x0)fi(yi1)+fi(xi1))\displaystyle=y_{0}-x_{0}-n\alpha(\nabla F(y_{0})-\nabla F(x_{0}))+\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0})-\nabla f_{i}(y_{i-1})+\nabla f_{i}(x_{i-1}))
(1nαμ)(y0x0)+αi=1n(fi(y0)fi(x0)fi(yi1)+fi(xi1)).\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0})-\nabla f_{i}(y_{i-1})+\nabla f_{i}(x_{i-1})). (Using strong convexity)

We use the fact that the function is twice differentiable:

ynxn=(1nαμ)(y0x0)+αi=1n(x0y02fi(t)dtxi1yi12fi(t)dt)\displaystyle y_{n}-x_{n}=(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}\nabla^{2}f_{i}(t)\textnormal{d}t-\int_{x_{i-1}}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)
=(1nαμ)(y0x0)\displaystyle=(1-n\alpha\mu)(y_{0}-x_{0})
+αi=1n(x0y02fi(t)dtxi1xi1+(y0x0)2fi(t)dtxi1+(y0x0)yi12fi(t)dt)\displaystyle\quad+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}\nabla^{2}f_{i}(t)\textnormal{d}t-\int_{x_{i-1}}^{x_{i-1}+(y_{0}-x_{0})}\nabla^{2}f_{i}(t)\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)
=(1nαμ)(y0x0)\displaystyle=(1-n\alpha\mu)(y_{0}-x_{0})
+αi=1n(x0y0(2fi(t)2fi(xi1x0+t))dtxi1+(y0x0)yi12fi(t)dt).\displaystyle\quad+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}(\nabla^{2}f_{i}(t)-\nabla^{2}f_{i}(x_{i-1}-x_{0}+t))\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right).

In the above, we used the convention that abf(x)𝑑x\int_{a}^{b}f(x)dx is the same as baf(x)𝑑x-\int_{b}^{a}f(x)dx if a>ba>b.

Now, we can use the Hessian Lipschitzness to bound the term as follows

ynxn\displaystyle y_{n}-x_{n} (1nαμ)(y0x0)+αi=1n(x0y0LH|xi1x0|dtxi1+(y0x0)yi12fi(t)dt)\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}L_{H}|x_{i-1}-x_{0}|\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)
(1nαμ)(y0x0)+αi=1n(x0y0LHGαndtxi1+(y0x0)yi12fi(t)dt)\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}L_{H}G\alpha n\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)
=(1nαμ)(y0x0)+LHGα2n2(x0y0)αi=1nxi1+(y0x0)yi12fi(t)dt\displaystyle=(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})-\alpha\sum_{i=1}^{n}\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t
(1nαμ)(y0x0)+LHGα2n2(x0y0)+αi=1nL((y0x0)(yixi))\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})+\alpha\sum_{i=1}^{n}L((y_{0}-x_{0})-(y_{i}-x_{i}))
(1nαμ)(y0x0)+LHGα2n2(x0y0)+αi=1nL(iαL)(y0x0)\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})+\alpha\sum_{i=1}^{n}L(i\alpha L)(y_{0}-x_{0}) (Using Ineq. (13).)
(1nαμ)(y0x0)+LHGα2n2(x0y0)+α2n2L2(y0x0).\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})+\alpha^{2}n^{2}L^{2}(y_{0}-x_{0}).

Thus, if we have αμ2n(L2+LHG)\alpha\leq\frac{\mu}{2n(L^{2}+L_{H}G)}, then

ynxn(1μnα2)(y0x0).y_{n}-x_{n}\leq\left(1-\frac{\mu n\alpha}{2}\right)(y_{0}-x_{0}).

Appendix C Proof of Theorem 2

To prove this theorem, we consider three step-size ranges and do a case by case analysis for each of them. We construct functions for each range such that the convergence of any permutation-based algorithm is “slow” for the functions on their corresponding step-size regime. The final lower bound is the minimum among the lower bounds obtained for the three regimes.

In this proof, we will use different notation from the rest of the paper because we work with scalars in the proof, and hence the superscript will denote the scalar power, and not the epoch number. We will use 𝒙k,i{\bm{x}}_{k,i} to denote the ii-th iterate of the the kk-th epoch.

We will construct three functions F1(𝒙)F_{1}({\bm{x}}), F2(𝒚)F_{2}({\bm{y}}), and F3(z)F_{3}(z), each the means of nn component functions, such that

  • Any permutation-based algorithm on F1(𝒙)F_{1}({\bm{x}}) with α[12(n1)KL,3nL]\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}] and initialization 𝒙1,0=𝟎{\bm{x}}_{1,0}={\bm{0}} results in

    𝒙K,n2=Ω(1n3K2).\displaystyle\|{\bm{x}}_{K,n}\|^{2}=\Omega\left(\frac{1}{n^{3}K^{2}}\right).

    F1F_{1} will be an nn-dimensional function, that is 𝒙n{\bm{x}}\in{\mathbb{R}}^{n}. This function will have minimizer at 𝟎n{\bm{0}}\in{\mathbb{R}}^{n}. NOTE: This is the ‘key’ step-size range, and the proof sketch explained in the main paper corresponds to this function’s construction.

  • Any permutation-based algorithm on F2(𝒚)F_{2}({\bm{y}}) with α[3nL,1L]\alpha\in[\frac{3}{nL},\frac{1}{L}] and initialization 𝒚1,0=𝟎{\bm{y}}_{1,0}={\bm{0}} results in

    𝒚K,n2=Ω(1n2)\displaystyle\|{\bm{y}}_{K,n}\|^{2}=\Omega\left(\frac{1}{n^{2}}\right)

    F2F_{2} will be an nn-dimensional function, that is 𝒚n{\bm{y}}\in{\mathbb{R}}^{n}. This function will have minimizer at 𝟎n{\bm{0}}\in{\mathbb{R}}^{n}. The construction for this is also inspired by the construction for F1F_{1}, but this is constructed significantly differently due to the different step-size range.

  • Any permutation-based algorithm on F3(z)F_{3}(z) with α[12(n1)KL,1L]\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}] and initialization z1,0=1z_{1,0}=1 results in

    |zK,n|2=Ω(1)\displaystyle|z_{K,n}|^{2}=\Omega\left(1\right)

    F3F_{3} will be an 11-dimensional function, that is zz\in{\mathbb{R}}. This function will have minimizer at 0.

Then, the 2n+12n+1-dimensional function F([𝒙,𝒚,z])=F1(𝒙)+F2(𝒚)+F3(z)F([{\bm{x}}^{\top},{\bm{y}}^{\top},z]^{\top})=F_{1}({\bm{x}})+F_{2}({\bm{y}})+F_{3}(z) will show bad convergence in any step-size regime. This function will have minimizer at 𝟎2n+1{\bm{0}}\in{\mathbb{R}}^{2n+1}. Furthermore,

n1nL𝑰2F1,2F2,2F3,2F2L𝑰,\displaystyle\frac{n-1}{n}L{\bm{I}}\preceq\nabla^{2}F_{1},\nabla^{2}F_{2},\nabla^{2}F_{3},\nabla^{2}F\preceq 2L{\bm{I}},

that is, F1F_{1}, F2F_{2}, F3F_{3}, and FF are all n1nL\frac{n-1}{n}L-strongly convex and 2L2L-smooth.

In the subsequent subsections, we prove the lower bounds for F1F_{1}, F2F_{2}, and F3F_{3} separately.

NOTE: In Appendix C.4, we have discussed how the lower bound is actually invariant to initialization.

C.1 Lower bound for F1F_{1}, α[12(n1)KL,3nL]\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}]

We will work in nn-dimensions (nn is even) and represent a vector in the space as 𝒛=[x1,y1,,xn/2,yn/2]{\bm{z}}=[x_{1},y_{1},\dots,x_{n/2},y_{n/2}]. These xix_{i} and yiy_{i} are not related to the vectors used by F2F_{2} and F3F_{3} later, we only use xix_{i} and yiy_{i} to make the proof for this part easier to understand.

We start off by defining the nn component functions: For i[n/2]i\in[n/2], define

fi(𝒛)\displaystyle f_{i}({\bm{z}}) :=L2xi2xi+yi+ji(L2xj2+L2yj2), and\displaystyle:=\frac{L}{2}x_{i}^{2}-x_{i}+y_{i}+\sum_{j\neq i}\left(\frac{L}{2}x_{j}^{2}+\frac{L}{2}y_{j}^{2}\right),\text{ and}
gi(𝒛)\displaystyle g_{i}({\bm{z}}) :=L2yi2yi+xi+ji(L2xj2+L2yj2).\displaystyle:=\frac{L}{2}y_{i}^{2}-y_{i}+x_{i}+\sum_{j\neq i}\left(\frac{L}{2}x_{j}^{2}+\frac{L}{2}y_{j}^{2}\right).

Thus, F(𝒛):=1n(i=1n/2fi+i=1n/2gi)=(n1n)L2𝒛2F({\bm{z}}):=\frac{1}{n}\left(\sum_{i=1}^{n/2}f_{i}+\sum_{i=1}^{n/2}g_{i}\right)=\left(\frac{n-1}{n}\right)\frac{L}{2}\|{\bm{z}}\|^{2}. This function has minimizer at 𝒛=𝟎{\bm{z}}^{*}={\bm{0}}.

Let 𝒛k,j{\bm{z}}_{k,j} denote 𝒛{\bm{z}} at the jj-th iteration in kk-th epoch. We initialize at 𝒛1,0=𝟎{\bm{z}}_{1,0}={\bm{0}}. For the majority of this proof, we will work inside a given epoch, so we will skip the subscript denoting the epoch. We denote the jj-th iterate within the epoch as 𝒛j{\bm{z}}_{j} and the coordinates as 𝒛j=[xj,1,yj,1,,xj,n/2,yj,n/2]{\bm{z}}_{j}=[x_{j,1},y_{j,1},\dots,x_{j,n/2},y_{j,n/2}]

In the current epoch, let σ\sigma be the permutation of {f1,,fn/2,g1,,gn/2}\{f_{1},\dots,f_{n/2},g_{1},\dots,g_{n/2}\} used.

For any i[n/2]i\in[n/2], let pp and qq be indices such that σp=fi\sigma_{p}=f_{i} and σq=gi\sigma_{q}=g_{i}. Let us consider the case that p<qp<q (the case when p>qp>q will be analogous). Then, it can be seen that

xn,i\displaystyle x_{n,i} =(1αL)n1x0,i+(1αL)np1α(1αL)nqα, and\displaystyle=(1-\alpha L)^{n-1}x_{0,i}+(1-\alpha L)^{n-p-1}\alpha-(1-\alpha L)^{n-q}\alpha,\text{ and}
yn,i\displaystyle y_{n,i} =(1αL)n1y0,i(1αL)npα+(1αL)nqα.\displaystyle=(1-\alpha L)^{n-1}y_{0,i}-(1-\alpha L)^{n-p}\alpha+(1-\alpha L)^{n-q}\alpha. (14)

Hence,

xn,i+yn,i\displaystyle x_{n,i}+y_{n,i} =(1αL)n1(x0,i+y0,i)+2(1αL)np1α2L\displaystyle=(1-\alpha L)^{n-1}(x_{0,i}+y_{0,i})+2(1-\alpha L)^{n-p-1}\alpha^{2}L
(1αL)n1(x0,i+y0,i)+2(1αL)n1α2L.\displaystyle\geq(1-\alpha L)^{n-1}(x_{0,i}+y_{0,i})+2(1-\alpha L)^{n-1}\alpha^{2}L.

In the other case, when p>qp>q, we will get the same inequality. Let xK,n,ix_{K,n,i} and yK,n,iy_{K,n,i} denote the value of xn,ix_{n,i} and yn,iy_{n,i} at the KK-th epoch. Then, recalling that 𝒛{\bm{z}} was initialized to 𝟎{\bm{0}}, we use the inequality above to get

xK,n,i+yK,n,i\displaystyle x_{K,n,i}+y_{K,n,i} (1αL)(n1)K0+2(1αL)n1α2L1(1αL)(n1)K1(1αL)n1\displaystyle\geq(1-\alpha L)^{({n-1})K}\cdot 0+2(1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}
=2(1αL)n1α2L1(1αL)(n1)K1(1αL)n1\displaystyle=2(1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}} (15)

Since this inequality is valid for all ii, we get that

𝒛K,n2\displaystyle\|{\bm{z}}_{K,n}\|^{2} =i=1n/2(xK,n,i2+yK,n,i2)\displaystyle=\sum_{i=1}^{n/2}(x_{K,n,i}^{2}+y_{K,n,i}^{2})
i=1n/212(xK,n,i+yK,n,i)2\displaystyle\geq\sum_{i=1}^{n/2}\frac{1}{2}(x_{K,n,i}+y_{K,n,i})^{2}
i=1n/212(2(1αL)n1α2L1(1αL)(n1)K1(1αL)n1)2\displaystyle\geq\sum_{i=1}^{n/2}\frac{1}{2}\left(2(1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}\right)^{2}
=n((1αL)n1α2L1(1αL)(n1)K1(1αL)n1)2.\displaystyle=n\left((1-\alpha L)^{{n-1}}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{{n-1}}}\right)^{2}. (16)

Note that if α3nL\alpha\leq\frac{3}{nL} and n4n\geq 4, then (1αL)n118(1-\alpha L)^{n-1}\geq\frac{1}{8}. Using this in (16), we get

𝒛K,n2\displaystyle\|{\bm{z}}_{K,n}\|^{2} n((1αL)n1α2L1(1αL)(n1)K1(1αL)n1)2\displaystyle\geq n\left((1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}\right)^{2}
n128α4L2(1(1αL)(n1)K1(1αL)n1)2\displaystyle\geq\frac{n}{128}\alpha^{4}L^{2}\left(\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{{n-1}}}\right)^{2}
=n128L2((αL)21(1αL)n1)2(1(1αL)(n1)K)2\displaystyle=\frac{n}{128L^{2}}\left(\frac{(\alpha L)^{2}}{1-(1-\alpha L)^{n-1}}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}

We consider two cases:

  1. 1.

    Case A, αL12(n+2)\alpha L\leq\frac{1}{2(n+2)}: It can be verified that (αL)21(1αL)n1\frac{(\alpha L)^{2}}{1-(1-\alpha L)^{n-1}} is an increasing function of α\alpha when αL12(n+2)\alpha L\leq\frac{1}{2(n+2)}. Noting that we are working in the range when α12(n1)KL\alpha\geq\frac{1}{2(n-1)KL}, then

    𝒛K,n2\displaystyle\|{\bm{z}}_{K,n}\|^{2} n128L2((12(n1)K)21(112(n1)K)n1)2(1(1αL)(n1)K)2\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n-1)K}\right)^{2}}{1-\left(1-\frac{1}{2(n-1)K}\right)^{n-1}}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}
    n128L2((12(n1)K)212(n1)K(n1))2(1(1αL)(n1)K)2\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n-1)K}\right)^{2}}{\frac{1}{2(n-1)K}{(n-1)}}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}
    =n512(n1)4K2L2(1(1αL)(n1)K)2\displaystyle=\frac{n}{512(n-1)^{4}K^{2}L^{2}}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}
    n512(n1)4K2L2(1eαL(n1)K)2\displaystyle\geq\frac{n}{512(n-1)^{4}K^{2}L^{2}}\left({1-e^{-\alpha L({n-1})K}}\right)^{2}
    n512(n1)4K2L2(1e1/2)2\displaystyle\geq\frac{n}{512(n-1)^{4}K^{2}L^{2}}\left({1-e^{-1/2}}\right)^{2}
    =Ω(1n3K2L2).\displaystyle=\Omega\left(\frac{1}{n^{3}K^{2}L^{2}}\right).
  2. 2.

    Case B, αL>12(n+2)\alpha L>\frac{1}{2(n+2)}: In this case,

    𝒛K,n2\displaystyle\|{\bm{z}}_{K,n}\|^{2} n128L2((12(n+2))21)2(1(1αL)(n1)K)2\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n+2)}\right)^{2}}{1}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}
    n128L2((12(n+2))21)2(1eαL(n1)K)2\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n+2)}\right)^{2}}{1}\right)^{2}\left(1-e^{-\alpha L(n-1)K}\right)^{2}
    n128L2((12(n+2))21)2(1e3(n1)K2(n+2))2\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n+2)}\right)^{2}}{1}\right)^{2}\left(1-e^{-\frac{3(n-1)K}{2(n+2)}}\right)^{2}
    =Ω(1n3L2).\displaystyle=\Omega\left(\frac{1}{n^{3}L^{2}}\right).

C.2 Lower bound for F2F_{2}, α[3nL,1L]\alpha\in[\frac{3}{nL},\frac{1}{L}]

We will work in nn-dimensions and represent a vector in the space as 𝒚=[y1,,yn]{\bm{y}}=[y_{1},\dots,y_{n}].

We start off by defining the nn component functions: For i[n]i\in[n], define

fi(𝒚)\displaystyle f_{i}({\bm{y}}) :=yi+ji(Lyj22+yjn1)\displaystyle:=-y_{i}+\sum_{j\neq i}\left(\frac{Ly_{j}^{2}}{2}+\frac{y_{j}}{n-1}\right)

Thus, F(𝒚):=1ni=1nfi=(n1n)L2𝒚2F({\bm{y}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}=\left(\frac{n-1}{n}\right)\frac{L}{2}\|{\bm{y}}\|^{2}. This function has minimizer at 𝒚=𝟎{\bm{y}}^{*}={\bm{0}}.

Let 𝒚k,j{\bm{y}}_{k,j} denote 𝒚{\bm{y}} at the jj-th iteration in kk-th epoch. We initialize at 𝒚1,0=𝟎{\bm{y}}_{1,0}={\bm{0}}. For the majority of this proof, we will work inside a given epoch. We denote the jj-th iterate within the epoch as 𝒚j{\bm{y}}_{j} and the coordinates as 𝒚j=[yj,1,,yj,n]{\bm{y}}_{j}=[y_{j,1},\dots,y_{j,n}]

In the current epoch, let σ\sigma be the permutation of {f1,,fn}\{f_{1},\dots,f_{n}\} used. Let ii be the index such that σn=i\sigma_{n}=i, that is, ii is the last element of the permutation σ\sigma. Then at the end of the epoch,

yn,i\displaystyle y_{n,i} =(1αL)n1y0,i+ααn1j=0n2(1αL)j\displaystyle=(1-\alpha L)^{n-1}y_{0,i}+\alpha-\frac{\alpha}{n-1}\sum_{j=0}^{n-2}(1-\alpha L)^{j}
=(1αL)n1y0,i+α(1(1αL)n1)(n1)L.\displaystyle=(1-\alpha L)^{n-1}y_{0,i}+\alpha-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}. (17)

For some j[n]j\in[n], let ss be the integer such that σs=j\sigma_{s}=j, that is jj is the ss-th element in the permutation σ\sigma. Then for any jj and any epoch,

yn,j\displaystyle y_{n,j} =(1αL)n1y0,j+α(1αL)nsαn1j=0n2(1αL)j.\displaystyle=(1-\alpha L)^{n-1}y_{0,j}+\alpha(1-\alpha L)^{n-s}-\frac{\alpha}{n-1}\sum_{j=0}^{n-2}(1-\alpha L)^{j}.

Then,

yn,j\displaystyle y_{n,j} (1αL)n1y0,jαn1j=0n2(1αL)j\displaystyle\geq(1-\alpha L)^{n-1}y_{0,j}-\frac{\alpha}{n-1}\sum_{j=0}^{n-2}(1-\alpha L)^{j}
=(1αL)n1y0,j(1(1αL)n1)(n1)L.\displaystyle=(1-\alpha L)^{n-1}y_{0,j}-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}.

Note that the above is independent of σ\sigma, and hence applicable for all epochs. Applying it recursively and noting that we initialized 𝒚1,0=𝟎{\bm{y}}_{1,0}={\bm{0}}, we get

yn,j\displaystyle y_{n,j} (1(1αL)n1)(n1)Lk=1K(1αL)n\displaystyle\geq-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}\sum_{k=1}^{K}(1-\alpha L)^{n}
1(n1)L.\displaystyle\geq-\frac{1}{(n-1)L}.

Note that y0,iy_{0,i} is just yn,iy_{n,i} from the previous epoch. Hence we can substitute the inequality above for y0,iy_{0,i} in (17). Thus,

yn,i\displaystyle y_{n,i} (1αL)n1(n1)L+α(1(1αL)n1)(n1)L\displaystyle\geq-\frac{(1-\alpha L)^{n-1}}{(n-1)L}+\alpha-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}
=α1(n1)L\displaystyle=\alpha-\frac{1}{(n-1)L}
3nL1(n1)L\displaystyle\geq\frac{3}{nL}-\frac{1}{(n-1)L}
=Ω(1nL).\displaystyle=\Omega\left(\frac{1}{nL}\right).

This gives us that 𝒚k,n2=Ω(1n2L2)\|{\bm{y}}_{k,n}\|^{2}=\Omega\left(\frac{1}{n^{2}L^{2}}\right) for any kk.

C.3 Lower bound for F3F_{3}, α[12(n1)KL,1L]\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}]

Consider the function F(z)=1ni=1nfi(z)F(z)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(z), where fi(z)=Lz2f_{i}(z)=Lz^{2} for all ii. Note that the gradient of any function at zz is just 2Lz-2Lz. Hence, regardless of the permutation, if we start the shuffling based gradient descent method at z1,0=1z_{1,0}=1, we get that

zK,n=(12αL)nKz1,0=(12αL)nK.\displaystyle z_{K,n}=(1-2\alpha L)^{nK}z_{1,0}=(1-2\alpha L)^{nK}.

In the case when α12(n1)KL\alpha\leq\frac{1}{2(n-1)KL}, we see that

zK,n\displaystyle z_{K,n} (1212(n1)KLL)nK\displaystyle\geq\left(1-2\frac{1}{2(n-1)KL}L\right)^{nK}
(11(n1)K)nK\displaystyle\geq\left(1-\frac{1}{(n-1)K}\right)^{nK}
=Ω(1),\displaystyle=\Omega(1),

for n,K2n,K\geq 2. Finally, in the case when α1L\alpha\geq\frac{1}{L}, we see that

|zK,n|\displaystyle|z_{K,n}| =|12αL|nK\displaystyle=\left|1-2\alpha L\right|^{nK}
|121LL|nK\displaystyle\geq\left|1-2\frac{1}{L}L\right|^{nK}
1nK\displaystyle\geq 1^{nK}
=Ω(1).\displaystyle=\Omega(1).

C.4 Discussion about initialization

The lower bound partitions the step size in three ranges -

  • In the step size ranges α[12(n1)KL,3nL]\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}] and α[3nL,1L]\alpha\in[\frac{3}{nL},\frac{1}{L}], the initializations are done at the minimizer and it is shown that any permutation-based algorithm will still move away from the minimizer. The choice of initializing at the minimizer was solely for the convenience of analysis and calculations, and the proof should work for any other initialization as well.

    Furthermore, the effect of initializing at any arbitrary (non-zero) point will decay exponentially fast with epochs anyway. To see how, note that every epoch can be treated as nn steps of full gradient descent and some noise, and hence the full gradient descent part will essentially keep decreasing the effect of initialization exponentially, and what we would be left with is the noise in each epoch. Thus, it was more convenient for us to just assume initialization at the minimizer and only focus on the noise in each epoch.

  • The step size range α[12(n1)KL,1L]\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}] can be divided into two parts, α[0,12(n1)KL]\alpha\in[0,\frac{1}{2(n-1)KL}] and α[1L,)\alpha\in[\frac{1}{L},\infty).

    For the range α[0,12(n1)KL]\alpha\in[0,\frac{1}{2(n-1)KL}], we essentially show that the step size is too small to make any meaningful progress towards the minimizer. Hence, instead of initializing at 11, initializing at any other arbitrary (non-zero) point will also give the same slow convergence rate.

    For the range α[1L,)\alpha\in[\frac{1}{L},\infty), we show that the optimization algorithm will in fact diverge since the step size is too large. Hence, even here, any other arbitrary (non-zero) point of initialize will also give divergence.

Appendix D Proof of Theorem 3

Define f1(x):=L2x2xf_{1}(x):=\frac{L}{2}x^{2}-x and f2(x,y):=L4x2+xf_{2}(x,y):=-\frac{L}{4}x^{2}+x. Thus, F(x,y)=L8x2F(x,y)=\frac{L}{8}x^{2}. This function has minimizer at x=0x^{*}=0. For this proof, we will use the convention that xi,jx_{i,j} is the iterate after the ii-th iteration of the jj-th epoch. Further, the number in the superscript will be the scalar power. For example xi,j2=xi,jxi,jx_{i,j}^{2}=x_{i,j}\cdot x_{i,j}.

Initialize at x0,0=1Lx_{0,0}=\frac{1}{L}. Then at epoch kk, there are two possible permutations: σ=(1,2)\sigma=(1,2) and σ=(2,1)\sigma=(2,1). In the first case , σ=(1,2)\sigma=(1,2), we get that after the first iteration of the epoch,

x1,k\displaystyle x_{1,k} =x0,kαf1(x0,k,y0,k)\displaystyle=x_{0,k}-\alpha\nabla f_{1}(x_{0,k},y_{0,k})
=(1αL)x0,k+α,\displaystyle=(1-\alpha L)x_{0,k}+\alpha,

Continuing on, in the second iteration, we get

x2,k\displaystyle x_{2,k} =x1,kαf2(x1,k,y1,k)\displaystyle=x_{1,k}-\alpha\nabla f_{2}(x_{1,k},y_{1,k})
=(1+12αL)x1,kα\displaystyle=\left(1+\frac{1}{2}\alpha L\right)x_{1,k}-\alpha
=(1+12αL)((1αL)x0,k+α)α\displaystyle=\left(1+\frac{1}{2}\alpha L\right)((1-\alpha L)x_{0,k}+\alpha)-\alpha
=(1+12αL)(1αL)x0,k+12α2L.\displaystyle=\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L.

Note that x0,k+1=x2,kx_{0,{k+1}}=x_{2,{k}}. Thus, x0,k+1=(1+12αL)(1αL)x0,k+12α2Lx_{0,{k+1}}=\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L.

Similarly, for the other possible permutation, σ=(2,1)\sigma=(2,1), we get x0,k+1=(1+12αL)(1αL)x0,k+α2Lx_{0,k+1}=\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\alpha^{2}L. Thus, regardless of what permutation we use, we get that

x0,k+1(1+12αL)(1αL)x0,k+12α2L(1αL)x0,k+12α2L.x_{0,{k+1}}\geq\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L\geq(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L.

Hence, recalling that we initialized at x0,0=1Lx_{0,0}=\frac{1}{L}, we get

xn,K\displaystyle x_{n,K} =x0,K+1\displaystyle=x_{0,K+1}
(1αL)K1L+12α2Li=0K1(1αL)i\displaystyle\geq(1-\alpha L)^{K}\frac{1}{L}+\frac{1}{2}\alpha^{2}L\sum_{i=0}^{K-1}(1-\alpha L)^{i}
=1L(1αL)K+12α2L1(1αL)K1(1αL)\displaystyle=\frac{1}{L}(1-\alpha L)^{K}+\frac{1}{2}\alpha^{2}L\frac{1-(1-\alpha L)^{K}}{1-(1-\alpha L)}
=1L(1αL)K+12α(1(1αL)K).\displaystyle=\frac{1}{L}(1-\alpha L)^{K}+\frac{1}{2}\alpha\left(1-(1-\alpha L)^{K}\right). (18)

Now, if α1KL\alpha\geq\frac{1}{KL}, then

12α(1(1αL)K)α8=Ω(1KL),\frac{1}{2}\alpha\left(1-(1-\alpha L)^{K}\right)\geq\frac{\alpha}{8}=\Omega\left(\frac{1}{KL}\right),

and hence, xn,K2=Ω(1K2L2)x_{n,K}^{2}=\Omega(\frac{1}{K^{2}L^{2}}). Otherwise, if α1KL\alpha\leq\frac{1}{KL}, then continuing on from (18),

xn,K\displaystyle x_{n,K} 1L(1αL)K+12α(1(1αL)K)\displaystyle\geq\frac{1}{L}(1-\alpha L)^{K}+\frac{1}{2}\alpha\left(1-(1-\alpha L)^{K}\right)
1L(1αL)K\displaystyle\geq\frac{1}{L}(1-\alpha L)^{K}
1LeαLK\displaystyle\geq\frac{1}{L}e^{-\alpha LK}
1Le1\displaystyle\geq\frac{1}{L}e^{-1}
=Ω(1/L),\displaystyle=\Omega(1/L),

and hence in this case, xn,K2=Ω(1L2)x_{n,K}^{2}=\Omega(\frac{1}{L^{2}}).

Appendix E Proof of Theorem 4

Let F(𝒙):=1ni=1nfi(𝒙)F({\bm{x}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{x}}) be such that its minimizer it at the origin. This can be assumed without loss of generality because we can shift the coordinates appropriately, similar to (Safran & Shamir, 2019). Since the fif_{i} are convex quadratics, we can write them as fi(𝒙)=12𝒙𝑨i𝒙𝒃i𝒙+cif_{i}({\bm{x}})=\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{i}{\bm{x}}-{\bm{b}}_{i}^{\top}{\bm{x}}+c_{i}, where 𝑨i{\bm{A}}_{i} are symmetric, positive-semidefinite matrices. We can omit the constants cic_{i} because they do not affect the minimizer or the gradients. Because we assume that the minimizer of F(𝒙)F({\bm{x}}) is at the origin, we get that

i=1n𝒃i=𝟎.\displaystyle\sum_{i=1}^{n}{\bm{b}}_{i}={\bm{0}}. (19)

Let σ=(σ1,σ2,,σn)\sigma=(\sigma_{1},\sigma_{2},\dots,\sigma_{n}) be the random permutation of (1,2,,n)(1,2,\dots,n) generated at the beginning of the algorithm. Then for k(1,2,,K/2)k\in(1,2,\dots,K/2), epoch 2k12k-1 sees the nn functions in the following sequence:

(12𝒙𝑨σ1𝒙𝒃σ1𝒙,12𝒙𝑨σ2𝒙𝒃σ2𝒙,,12𝒙𝑨σn𝒙𝒃σn𝒙),\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}}{\bm{x}}-{\bm{b}}_{\sigma_{1}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{2}}{\bm{x}}-{\bm{b}}_{\sigma_{2}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}}{\bm{x}}-{\bm{b}}_{\sigma_{n}}^{\top}{\bm{x}}\right),

whereas epoch 2k2k sees the nn functions in the reverse sequence:

(12𝒙𝑨σn𝒙𝒃σn𝒙,12𝒙𝑨σn1𝒙𝒃σn1𝒙,,12𝒙𝑨σ1𝒙𝒃σ1𝒙).\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}}{\bm{x}}-{\bm{b}}_{\sigma_{n}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n-1}}{\bm{x}}-{\bm{b}}_{\sigma_{n-1}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}}{\bm{x}}-{\bm{b}}_{\sigma_{1}}^{\top}{\bm{x}}\right).

We define 𝑺i:=α𝑨σi{\bm{S}}_{i}:=\alpha{\bm{A}}_{\sigma_{i}} and 𝒕i=α𝒃σi{\bm{t}}_{i}=\alpha{\bm{b}}_{\sigma_{i}} for convenience of notation. We start off by computing the progress made during an even indexed epoch. Since the even epochs use the reverse permutation, we get

𝒙02k+1\displaystyle{\bm{x}}_{0}^{2k+1} =𝒙n2k\displaystyle={\bm{x}}_{n}^{2k}
=𝒙n12kα(𝑨σ1𝒙n12k𝒃σ1)\displaystyle={\bm{x}}^{2k}_{n-1}-\alpha\left({\bm{A}}_{\sigma_{1}}{\bm{x}}_{n-1}^{2k}-{\bm{b}}_{\sigma_{1}}\right) (fσ1f_{\sigma_{1}} is used at the last iteration of even epochs.)
=(𝑰α𝑨σ1)𝒙n12k+α𝒃σ1\displaystyle=\left({\bm{I}}-\alpha{\bm{A}}_{\sigma_{1}}\right){\bm{x}}^{2k}_{n-1}+\alpha{\bm{b}}_{\sigma_{1}}
=(𝑰𝑺1)𝒙n12k+𝒕1.\displaystyle=({\bm{I}}-{\bm{S}}_{1}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}.

We recursively apply the same procedure as above to the whole epoch to get the following

𝒙02k+1\displaystyle{\bm{x}}_{0}^{2k+1} =(𝑰𝑺1)𝒙n12k+𝒕1\displaystyle=({\bm{I}}-{\bm{S}}_{1}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}
=(𝑰𝑺1)((𝑰𝑺2)𝒙n22k+𝒕2)+𝒕1\displaystyle=({\bm{I}}-{\bm{S}}_{1})\left(({\bm{I}}-{\bm{S}}_{2}){\bm{x}}_{n-2}^{2k}+{\bm{t}}_{2}\right)+{\bm{t}}_{1}
=(𝑰𝑺1)(𝑰𝑺2)𝒙n22k+(𝑰𝑺1)𝒕2+𝒕1\displaystyle=({\bm{I}}-{\bm{S}}_{1})({\bm{I}}-{\bm{S}}_{2}){\bm{x}}_{n-2}^{2k}+({\bm{I}}-{\bm{S}}_{1}){\bm{t}}_{2}+{\bm{t}}_{1}
\displaystyle\qquad\vdots
=(i=1n(𝑰𝑺i))𝒙02k+i=1n(j=1ni(𝑰𝑺j))𝒕n+1i,\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}, (20)

where the product of matrices {𝑴i}\{{\bm{M}}_{i}\} is defined as i=lm𝑴i=𝑴l𝑴l+1𝑴m\prod_{i=l}^{m}{\bm{M}}_{i}={\bm{M}}_{l}{\bm{M}}_{l+1}\dots\bm{M}_{m} if mlm\geq l and 11 otherwise. Similar to Eq. (20), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order (σ1,σ2,,σn)(\sigma_{1},\sigma_{2},\dots,\sigma_{n}) instead of (σn,σn1,,σ1)(\sigma_{n},\sigma_{n-1},\dots,\sigma_{1}). After doing the computation, we get the following equation:

𝒙02k=(i=1n(𝑰𝑺ni+1))𝒙02k1+i=1n(j=1ni(𝑰𝑺nj+1))𝒕i.\displaystyle{\bm{x}}_{0}^{2k}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{x}}_{0}^{2k-1}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n-j+1})\right){\bm{t}}_{i}.

Combining the results above, we can get the total progress made after the pair of epoch 2k12k-1 and 2k2k:

𝒙02k+1=(i=1n(𝑰𝑺i))𝒙02k+i=1n(j=1ni(𝑰𝑺j))𝒕n+1i\displaystyle{\bm{x}}_{0}^{2k+1}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}
=(i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1))𝒙02k1+(j=1n(𝑰𝑺j))i=1n(j=1ni(𝑰𝑺n+1j))𝒕i\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{x}}_{0}^{2k-1}+\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}
+i=1n(j=1ni(𝑰𝑺j))𝒕n+1i.\displaystyle\quad+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}. (21)

In the sum above, the first term will have an exponential decay, hence we need to control the next two terms. We denote the sum of the terms as 𝒛{\bm{z}} (see the definition below) and we will control its norm later in this proof.

𝒛\displaystyle{\bm{z}} :=(j=1n(𝑰𝑺j))i=1n(j=1ni(𝑰𝑺n+1j))𝒕i+i=1n(j=1ni(𝑰𝑺j))𝒕n+1i\displaystyle:=\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}
=i=1n(j=1n(𝑰𝑺j))(j=1ni(𝑰𝑺n+1j))𝒕i+i=1n(j=1ni(𝑰𝑺j))𝒕n+1i.\displaystyle=\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}.

To see where the iterates end up after KK epochs, we simply set 2k=K2k=K in Eq. 21 and then keep applying the equation recursively to preceding epochs. Then, we get

𝒙nK=𝒙0K+1=(i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1))𝒙0K1+𝒛\displaystyle{\bm{x}}_{n}^{K}={\bm{x}}_{0}^{K+1}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{x}}_{0}^{K-1}+{\bm{z}}
=((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))2𝒙0K3+(i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1))𝒛+𝒛\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{2}{\bm{x}}_{0}^{K-3}+\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{z}}+{\bm{z}}
\displaystyle\qquad\vdots
=((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙01+k=0K21((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))k𝒛.\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}+\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}.

Taking squared norms and expectations on both sides, we get

𝔼[\displaystyle\mathbb{E}[ 𝒙nK2]=𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙01\displaystyle\|{\bm{x}}_{n}^{K}\|^{2}]=\mathbb{E}\left[\left\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right.\right.
+k=0K21((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))k𝒛2]\displaystyle\quad\left.\left.+\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}\right\|^{2}\right]
2𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙012]\displaystyle\leq 2\mathbb{E}\left[\left\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\|^{2}\right]
+2𝔼[k=0K21((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))k𝒛2]\displaystyle\quad+2\mathbb{E}\left[\left\|\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}\right\|^{2}\right] (Since (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2})
2𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙012]\displaystyle\leq 2\mathbb{E}\left[\left\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\|^{2}\right]
+2𝔼[(𝒛k=0K21i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)k)2].\displaystyle\quad+2\mathbb{E}\left[\left(\|{\bm{z}}\|\sum_{k=0}^{\frac{K}{2}-1}\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\|^{k}\right)^{2}\right].

We assumed that the functions fif_{i} have LL-Lipschitz gradients (Assumption 2). This translates to AiA_{i} having maximum eigenvalue less than LL. Hence, if α1/L\alpha\leq 1/L, we get that 𝑰α𝑨i{\bm{I}}-\alpha{\bm{A}}_{i} is positive semi-definite with maximum eigenvalue bounded by 1. Hence, 𝑰𝑺i1\|{\bm{I}}-{\bm{S}}_{i}\|\leq 1. Using this and the fact that for matrices 𝑴1{\bm{M}}_{1} and 𝑴2{\bm{M}}_{2}, 𝑴1𝑴2𝑴1𝑴2\|{\bm{M}}_{1}{\bm{M}}_{2}\|\leq\|{\bm{M}}_{1}\|\|{\bm{M}}_{2}\|, we get that

𝔼[𝒙nK2]\displaystyle\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K}\right\|^{2}\right] 2𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙012]\displaystyle\leq 2\mathbb{E}\left[\left\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\|^{2}\right]
+2𝔼[(𝒛k=0K21(i=1n𝑰𝑺ii=1n𝑰𝑺ni+1)k)2]\displaystyle\quad+2\mathbb{E}\left[\left(\|{\bm{z}}\|\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}\|{\bm{I}}-{\bm{S}}_{i}\|\prod_{i=1}^{n}\|{\bm{I}}-{\bm{S}}_{n-i+1}\|\right)^{k}\right)^{2}\right]
2𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙012]+2𝔼[(𝒛k=0K211)2]\displaystyle\leq 2\mathbb{E}\left[\left\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\|^{2}\right]+2\mathbb{E}\left[\left(\|{\bm{z}}\|\sum_{k=0}^{\frac{K}{2}-1}1\right)^{2}\right]
=2𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))K/2𝒙012]+K22𝔼[𝒛2].\displaystyle=2\mathbb{E}\left[\left\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\|^{2}\right]+\frac{K^{2}}{2}\mathbb{E}\left[\|{\bm{z}}\|^{2}\right].

We handle the two terms above separately. For the first term, we have the following bound:

Lemma 3.

If α18κLmin{2,κn}\alpha\leq\frac{1}{8\kappa L}\min\left\{2,\frac{\sqrt{\kappa}}{n}\right\}, then

i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)1αnμ\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\|\leq 1-\alpha n\mu

Note that K80κ3/2log(nK)max{1,κn}α18κLmin{2,κn}K\geq 80\kappa^{3/2}\log(nK)\max\left\{1,\frac{\sqrt{\kappa}}{n}\right\}\implies\alpha\leq\frac{1}{8\kappa L}\min\left\{2,\frac{\sqrt{\kappa}}{n}\right\}.

We also have the following lemma that bounds the expected squared norm of 𝒛{\bm{z}}.

Lemma 4.

If α1L\alpha\leq\frac{1}{L}, then

𝔼[𝒛2]2n2α4L2(G)2+170n5α6L4G2logn,\mathbb{E}\left[\left\|{\bm{z}}\right\|^{2}\right]\leq 2n^{2}\alpha^{4}L^{2}(G^{*})^{2}+170n^{5}\alpha^{6}L^{4}G^{2}\log n,

where G=maxi𝐛iG^{*}=\max_{i}\|{\bm{b}}_{i}\|, and the expectation is taken over the randomness of 𝐳{\bm{z}}.

Using these lemmas, we get that

𝔼[𝒙nK2]\displaystyle\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K}\right\|^{2}\right] 2(1nαμ)K/2𝒙012+K2n2α4L2G2+85K2n5α6L4G2logn\displaystyle\leq 2\left(1-n\alpha\mu\right)^{K/2}\left\|{\bm{x}}_{0}^{1}\right\|^{2}+K^{2}n^{2}\alpha^{4}L^{2}G^{2}+85K^{2}n^{5}\alpha^{6}L^{4}G^{2}\log n
2e12nαμK𝒙012+K2n2α4L2G2+85K2n5α6L4G2logn.\displaystyle\leq 2e^{-\frac{1}{2}n\alpha\mu K}\left\|{\bm{x}}_{0}^{1}\right\|^{2}+K^{2}n^{2}\alpha^{4}L^{2}G^{2}+85K^{2}n^{5}\alpha^{6}L^{4}G^{2}\log n.

Substituting α=10lognKμnK\alpha=\frac{10\log nK}{\mu nK} gives us the result.

E.1 Proof of Lemma 3

We define (𝑺~1,,𝑺~n):=(𝑺1,,𝑺n)(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n}):=({\bm{S}}_{1},\dots,{\bm{S}}_{n}) and (𝑺~n+1,,𝑺~2n):=(𝐒n,,𝑺1)(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n}):=(\mathbf{S}_{n},\dots,{\bm{S}}_{1}). Then,

i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\| =i=12n(𝑰𝑺~i)\displaystyle=\left\|\prod_{i=1}^{2n}({\bm{I}}-\widetilde{{\bm{S}}}_{i})\right\|
=𝑰i=12n𝑺~i+i<j𝑺~i𝑺~j\displaystyle=\left\|{\bm{I}}-\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}+\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}-\dots\right\|
𝑰i=12n𝑺~i+i<j𝑺~i𝑺~j+i<j<k𝑺~i𝑺~j𝑺~k+\displaystyle\leq\left\|{\bm{I}}-\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}+\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}\right\|+\left\|\sum_{i<j<k}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}\widetilde{{\bm{S}}}_{k}\right\|+\dots
𝑰i=12n𝑺~i+i<j𝑺~i𝑺~j+i<j<k𝑺~i𝑺~j𝑺~k+.\displaystyle\leq\left\|{\bm{I}}-\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}+\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}\right\|+\sum_{i<j<k}\left\|\widetilde{{\bm{S}}}_{i}\right\|\left\|\widetilde{{\bm{S}}}_{j}\right\|\left\|\widetilde{{\bm{S}}}_{k}\right\|+\dots.

Note that

i<j𝑺~i𝑺~j\displaystyle\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j} =2ij𝑺i𝑺j+i=1n𝑺i2\displaystyle=2\sum_{i\neq j}{\bm{S}}_{i}{\bm{S}}_{j}+\sum_{i=1}^{n}{\bm{S}}_{i}^{2}
=2(i=1n𝑺i)(i=1n𝑺i)i=1n𝑺i2.\displaystyle=2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)-\sum_{i=1}^{n}{\bm{S}}_{i}^{2}.

Substituting this and noting that i=12n𝑺~i=2i=1n𝑺i\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}=2\sum_{i=1}^{n}{\bm{S}}_{i}, we get

i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\| 𝑰2i=1n𝑺i+2(i=1n𝑺i)(i=1n𝑺i)i=1n𝑺i2\displaystyle\leq\left\|{\bm{I}}-2\sum_{i=1}^{n}{\bm{S}}_{i}+2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)-\sum_{i=1}^{n}{\bm{S}}_{i}^{2}\right\|
+i<j<k𝑺~i𝑺~j𝑺~k+\displaystyle\quad+\sum_{i<j<k}\left\|\widetilde{{\bm{S}}}_{i}\right\|\left\|\widetilde{{\bm{S}}}_{j}\right\|\left\|\widetilde{{\bm{S}}}_{k}\right\|+\dots
𝑰2i=1n𝑺i+2(i=1n𝑺i)(i=1n𝑺i)+i=1n𝑺i2\displaystyle\leq\left\|{\bm{I}}-2\sum_{i=1}^{n}{\bm{S}}_{i}+2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\right\|+\left\|\sum_{i=1}^{n}{\bm{S}}_{i}^{2}\right\|
+i<j<k𝑺~i𝑺~j𝑺~k+.\displaystyle\quad+\sum_{i<j<k}\left\|\widetilde{{\bm{S}}}_{i}\right\|\left\|\widetilde{{\bm{S}}}_{j}\right\|\left\|\widetilde{{\bm{S}}}_{k}\right\|+\dots.

Let us denote 𝑻:=i=1n𝑺i{\bm{T}}:=\sum_{i=1}^{n}{\bm{S}}_{i}. Then, we know by Assumptions 2 and 3 that 𝑻{\bm{T}} has eigenvalues in [nαμ,nαL][n\alpha\mu,n\alpha L]. Then, as long as α14nL\alpha\leq\frac{1}{4nL}, we get that

𝑰2i=1n𝑺i+2(i=1n𝑺i)(i=1n𝑺i)\displaystyle\left\|{\bm{I}}-2\sum_{i=1}^{n}{\bm{S}}_{i}+2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\right\| =𝑰2𝑻+2𝑻2\displaystyle=\left\|{\bm{I}}-2{\bm{T}}+2{\bm{T}}^{2}\right\|
132nαμ.\displaystyle\leq 1-\frac{3}{2}n\alpha\mu.

Substituting this, we get

i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\| (132nαμ)+i=1n𝑺i2+i<j<k𝑺~i𝑺~j𝑺~k+.\displaystyle\leq\left(1-\frac{3}{2}n\alpha\mu\right)+\left\|\sum_{i=1}^{n}{\bm{S}}_{i}^{2}\right\|+\sum_{i<j<k}\left\|\widetilde{{\bm{S}}}_{i}\right\|\left\|\widetilde{{\bm{S}}}_{j}\right\|\left\|\widetilde{{\bm{S}}}_{k}\right\|+\dots.

By Assumption 2, we know that AiL\|A_{i}\|\leq L. Hence, 𝐒~iαL\|\widetilde{\mathbf{S}}_{i}\|\leq\alpha L. Hence,

i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\| (132nαμ)+nα2L2+((2n3)α3L3+(2n4)α4L4+)\displaystyle\leq\left(1-\frac{3}{2}n\alpha\mu\right)+n\alpha^{2}L^{2}+\left(\binom{2n}{3}\alpha^{3}L^{3}+\binom{2n}{4}\alpha^{4}L^{4}+\dots\right)
132nαμ+nα2L2+i=32n(2nαL)i\displaystyle\leq 1-\frac{3}{2}n\alpha\mu+n\alpha^{2}L^{2}+\sum_{i=3}^{2n}(2n\alpha L)^{i}
132nαμ+nα2L2+8n3α3L312nαL\displaystyle\leq 1-\frac{3}{2}n\alpha\mu+n\alpha^{2}L^{2}+\frac{8n^{3}\alpha^{3}L^{3}}{1-2n\alpha L}
132nαμ+nα2L2+16n3α3L3.\displaystyle\leq 1-\frac{3}{2}n\alpha\mu+n\alpha^{2}L^{2}+16n^{3}\alpha^{3}L^{3}. (Since α1/4nL\alpha\leq 1/4nL.)

Finally, as long as α14κL\alpha\leq\frac{1}{4\kappa L} and α18nLκ\alpha\leq\frac{1}{8nL\sqrt{\kappa}},

i=1n(𝑰𝑺i)i=1n(𝑰𝑺ni+1)\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\| 1nαμ.\displaystyle\leq 1-n\alpha\mu.

E.2 Proof of Lemma 4

We start off by computing the first order expansion of 𝒛{\bm{z}}. We have the following lemma for this:

Lemma 5.
𝒛=i=1n𝑺i𝒕i+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i),{\bm{z}}=\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right),

where (𝐒~1,,𝐒~n):=(𝐒1,,𝐒n)(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n}):=({\bm{S}}_{1},\dots,{\bm{S}}_{n}) and (𝐒~n+1,,𝐒~2n):=(𝐒n,,𝐒1)(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n}):=(\mathbf{S}_{n},\dots,{\bm{S}}_{1}).

The proof of this lemma is quite algebraic and hence has been pushed to the end, in Appendix E.4.

The strategy is to bound 𝑺i𝒕i\|{{\bm{S}}}_{i}{\bm{t}}_{i}\|, i=12nj𝒕i\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\|, and i=1nj𝒕n+1i\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\|. Hence, we apply Lemma 5 and use triangle inequality:

𝔼[𝒛2]\displaystyle\mathbb{E}\left[\left\|{\bm{z}}\right\|^{2}\right] =𝔼[i=1n𝑺i𝒕i+j=n+12n1k=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)\displaystyle=\mathbb{E}\left[\left\|\sum_{i=1}^{n}{\bm{S}}_{i}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{k=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right.\right.
+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)2]\displaystyle\qquad\quad\left.\left.+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{\bm{S}}_{p})\right){{\bm{S}}}_{l}{\bm{S}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right\|^{2}\right]
𝔼[(i=1n𝑺i𝒕i+j=n+12n1l=1j1(p=1l1𝑰𝑺~p)𝑺~l𝑺~j(i=12nj𝒕i)\displaystyle\leq\mathbb{E}\left[\left(\sum_{i=1}^{n}\|{\bm{S}}_{i}\|\|{\bm{t}}_{i}\|+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\|\right)\|\widetilde{{\bm{S}}}_{l}\|\|\widetilde{{\bm{S}}}_{j}\|\left(\sum_{i=1}^{2n-j}\|{\bm{t}}_{i}\|\right)\right.\right.
+j=1n1l=1j1(p=1l1𝑰𝑺p)𝑺l𝑺j(i=1nj𝒕n+1i))2].\displaystyle\qquad\quad+\left.\left.\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-{\bm{S}}_{p}\|\right)\|{\bm{S}}_{l}\|\|{\bm{S}}_{j}\|\left(\sum_{i=1}^{n-j}\|{\bm{t}}_{n+1-i}\|\right)\right)^{2}\right].

Now, we recall that 𝑺iαL\|{\bm{S}}_{i}\|\leq\alpha L and 𝒕iαG\|{\bm{t}}_{i}\|\leq\alpha G. Because α1/L\alpha\leq 1/L, we also get that 𝑰𝑺i1\|{\bm{I}}-{\bm{S}}_{i}\|\leq 1. Using these,

𝔼[𝒛2]\displaystyle\mathbb{E}[\|{\bm{z}}\|^{2}] 𝔼[(i=1nα2LG+j=n+12n1l=1j1(p=1l11)αLαLi=12nj𝒕i\displaystyle\leq\mathbb{E}\left[\left(\sum_{i=1}^{n}\alpha^{2}LG+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}1\right)\alpha L\alpha L\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|\right.\right.
+j=1n1l=1j1(p=1l11)αLαLi=1nj𝒕n+1i)2]\displaystyle\qquad\quad\left.\left.+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}1\right)\alpha L\alpha L\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)^{2}\right]
𝔼[(nα2LG+2nα2L2j=n+12n1i=12nj𝒕i+nα2L2j=1n1i=1nj𝒕n+1i)2]\displaystyle\leq\mathbb{E}\left[\left(n\alpha^{2}LG+2n\alpha^{2}L^{2}\sum_{j=n+1}^{2n-1}\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|+n\alpha^{2}L^{2}\sum_{j=1}^{n-1}\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)^{2}\right]
=n2α4L2𝔼[(G+2Lj=n+12n1i=12nj𝒕i+Lj=1n1i=1nj𝒕n+1i)2]\displaystyle=n^{2}\alpha^{4}L^{2}\mathbb{E}\left[\left(G+2L\sum_{j=n+1}^{2n-1}\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|+L\sum_{j=1}^{n-1}\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)^{2}\right]
2n2α4L2(G2+L2𝔼[(2j=n+12n1i=12nj𝒕i+j=1n1i=1nj𝒕n+1i)2]).\displaystyle\leq 2n^{2}\alpha^{4}L^{2}\left(G^{2}+L^{2}\mathbb{E}\left[\left(2\sum_{j=n+1}^{2n-1}\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|+\sum_{j=1}^{n-1}\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)^{2}\right]\right). (Since (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2})

Using Hoeffding-Serfling inequality for bounded random vectors (Schneider, 2016, Theorem 2), we get the following lemma

Lemma 6.

For all j,l[1,n]j,l\in[1,n] we have that

𝔼[i=1j𝒕i2]18jα2(G)2log(n)\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|^{2}\right]\leq 18j\alpha^{2}(G^{*})^{2}\log(n)
𝔼[i=1j𝒕ii=1l𝒕i]18jlα2(G)2log(n),\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\left\|\sum_{i=1}^{l}{\bm{t}}_{i}\right\|\right]\leq 18\sqrt{jl}\alpha^{2}(G^{*})^{2}\log(n),

where G=maxi𝐛iG^{*}=\max_{i}\|{\bm{b}}_{i}\|, and the expectation is taken over the randomness of 𝐭i{\bm{t}}_{i}.

Writing out the expansion of (2j=n+12n1i=12nj𝒕i+j=1n1i=1nj𝒕n+1i)2\left(2\sum_{j=n+1}^{2n-1}\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|+\sum_{j=1}^{n-1}\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)^{2} and using the lemma above on the individual terms, we get

𝔼[𝒛2]\displaystyle\mathbb{E}\left[\left\|{\bm{z}}\right\|^{2}\right] 2n2α4L2G2+2n2α4L4(90α2n3G2logn).\displaystyle\leq 2n^{2}\alpha^{4}L^{2}G^{2}+2n^{2}\alpha^{4}L^{4}(90\alpha^{2}n^{3}G^{2}\log n).

E.3 Proof of Lemma 6

This proof is similar to the one in (Ahn et al., 2020). Define G:=maxi𝒃iG^{*}:=\max_{i}\|{\bm{b}}_{i}\|. We use the following theorem adapted to our setting

Theorem 7.

[(Schneider, 2016, Theorem 2)] With probability at least 1δn1-\frac{\delta}{n},

i=1j𝒕iαG8j(1j1n)log2nδ.\displaystyle\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\leq\alpha G^{*}\sqrt{8j\left(1-\frac{j-1}{n}\right)\log\frac{2n}{\delta}}.

Then taking a union bound over j=1,,nj=1,\dots,n, we get that with probability at least 1δ1-\delta,

j[1,n]:i=1j𝒕iαG8j(1j1n)log2nδαG8jlog2nδ.\displaystyle\forall j\in[1,n]:\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\leq\alpha G^{*}\sqrt{8j\left(1-\frac{j-1}{n}\right)\log\frac{2n}{\delta}}\leq\alpha G^{*}\sqrt{8j\log\frac{2n}{\delta}}.

Then, for the complementary event (which happens with probability δ\delta), we use the fact that 𝒕i=α𝒃σiαG\|{\bm{t}}_{i}\|=\|\alpha{\bm{b}}_{\sigma_{i}}\|\leq\alpha G^{*} to get the following:

j[1,n]:i=1j𝒕ii=1j𝒕iαGj.\forall j\in[1,n]:\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\leq\sum_{i=1}^{j}\left\|{\bm{t}}_{i}\right\|\leq\alpha G^{*}j.

Now, choose δ=1/n\delta=1/n. Then, we get that

𝔼[i=1j𝒕i2]\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|^{2}\right] (11n)8jα2G2log(2n2)+1n(αGj)2\displaystyle\leq\left(1-\frac{1}{n}\right)8j\alpha^{2}G^{2}\log(2n^{2})+\frac{1}{n}(\alpha G^{*}j)^{2}
18jα2(G)2logn.\displaystyle\leq 18j\alpha^{2}(G^{*})^{2}\log n.

Similarly, we can also get

𝔼[i=1j𝒕ii=1l𝒕i]18jlα2(G)2log(n).\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\left\|\sum_{i=1}^{l}{\bm{t}}_{i}\right\|\right]\leq 18\sqrt{jl}\alpha^{2}(G^{*})^{2}\log(n).

E.4 Proof of Lemma 5

As in the lemma’s statement, define (𝑺~1,,𝑺~2n)(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{2n}) as (𝑺~1,,𝑺~n)=(𝑺1,,𝑺n)(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n})=({\bm{S}}_{1},\dots,{\bm{S}}_{n}) and (𝑺~n+1,,𝑺~2n)=(𝑺n,,𝑺1)(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n})=({\bm{S}}_{n},\dots,{\bm{S}}_{1}). As a reminder, the term 𝒛{\bm{z}} is defined as follows:

𝒛:=i=1n(j=1n(𝑰𝑺j))(j=1ni(𝑰𝑺n+1j))𝒕i+i=1n(j=1ni(𝑰𝑺j))𝒕n+1i.\displaystyle{\bm{z}}:=\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}. (22)

First, we analyze the first term in 𝒛{\bm{z}}. Towards that end, we start by expanding (j=1n(𝑰𝑺j))(j=1ni(𝑰𝑺n+1j))\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right):

(j=1n(𝑰𝑺j))(j=1ni(𝑰𝑺n+1j))\displaystyle\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right) =(j=12ni(𝑰𝑺~j))\displaystyle=\left(\prod_{j=1}^{2n-i}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right) (23)
=(j=12ni1(𝑰𝑺~j))(𝑰𝑺~2ni)\displaystyle=\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\left({\bm{I}}-\widetilde{{\bm{S}}}_{2n-i}\right)
=(j=12ni1(𝑰𝑺~j))(j=12ni1(𝑰𝑺~j))𝑺~2ni.\displaystyle=\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}.

Similarly, we expand the term (j=12ni1(𝑰𝑺~j))\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right) and then recursively keep doing it to get the following:

(j=1n(𝑰𝑺j))\displaystyle\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right) (j=1ni(𝑰𝑺n+1j))=(j=12ni1(𝑰𝑺~j))(j=12ni1(𝑰𝑺~j))𝑺~2ni\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right)=\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}
=(j=12ni2(𝑰𝑺~j))(j=12ni2(𝑰𝑺~j))𝑺~2ni1(j=12ni1(𝑰𝑺~j))𝑺~2ni\displaystyle=\left(\prod_{j=1}^{2n-i-2}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)-\left(\prod_{j=1}^{2n-i-2}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i-1}-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}
=\displaystyle=\qquad\vdots
=(𝑰𝑺~1)(𝑰𝑺~1)𝑺~2\displaystyle=({\bm{I}}-\widetilde{{\bm{S}}}_{1})-({\bm{I}}-\widetilde{{\bm{S}}}_{1})\widetilde{{\bm{S}}}_{2}-\ldots
(j=12ni2(𝑰𝑺~j))𝑺~2ni1(j=12ni1(𝑰𝑺~j))𝑺~2ni\displaystyle\quad-\left(\prod_{j=1}^{2n-i-2}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i-1}-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}
=𝑰j=12ni(l=1j1(𝑰𝑺~l))𝑺~j.\displaystyle={\bm{I}}-\sum_{j=1}^{2n-i}\left(\prod_{l=1}^{j-1}({\bm{I}}-\widetilde{{\bm{S}}}_{l})\right)\widetilde{{\bm{S}}}_{j}.

Note that the term l=1j1(𝑰𝑺~l)\prod_{l=1}^{j-1}({\bm{I}}-\widetilde{{\bm{S}}}_{l}) above is similar to the RHS of Eq. (23). Hence, we repeat the process again on this term to get the following

(j=1n(𝑰𝑺j))(j=1ni(𝑰𝑺n+1j))\displaystyle\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right) =𝑰j=12ni(𝑰l=1j1(p=1l1(𝑰𝑺~p))𝑺~l)𝑺~j\displaystyle={\bm{I}}-\sum_{j=1}^{2n-i}\left({\bm{I}}-\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\right)\widetilde{{\bm{S}}}_{j}
=𝑰j=12ni𝑺~j+j=12nil=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j.\displaystyle={\bm{I}}-\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}+\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}. (24)

Using this in the first term i=1n(j=1n(𝑰𝑺j))(j=1ni(𝑰𝑺n+1j))𝒕i\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}(in Eq. (22)), we get

i=1n(j=1n(𝑰𝑺j))\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right) (j=1ni(𝑰𝑺n+1j))𝒕i=i=1n(𝑰j=12ni𝑺~j+j=12nil=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝒕i\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=\sum_{i=1}^{n}\left({\bm{I}}-\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}+\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right){\bm{t}}_{i}
=i=1n𝒕ii=1nj=12ni𝑺~j𝒕i+i=1nj=12nil=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j𝒕i.\displaystyle=\sum_{i=1}^{n}{\bm{t}}_{i}-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}.

Now, we use the fact that i=1n𝒃i=𝟎\sum_{i=1}^{n}{\bm{b}}_{i}={\bm{0}} (Eq. (19)) to get that i=1n𝒕i=𝟎\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}}. Then,

i=1n(j=1n(𝑰𝑺j))\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right) (j=1ni(𝑰𝑺n+1j))𝒕i=i=1nj=12ni𝑺~j𝒕i+i=1nj=12nil=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j𝒕i\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}

For convenience, we define 𝑴~j:=l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j\widetilde{{\bm{M}}}_{j}:=\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}. Then,

i=1n(j=1n(𝑰𝑺j))\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right) (j=1ni(𝑰𝑺n+1j))𝒕i=i=1nj=12ni𝑺~j𝒕i+i=1nj=12ni𝑴~j𝒕i\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}
=i=1nj=12ni𝑺~j𝒕i+i=1nj=1n𝑴~j𝒕i+i=1nj=n+12ni𝑴~j𝒕i\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{n}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}
=i=1nj=12ni𝑺~j𝒕i+j=1n𝑴~ji=1n𝒕i+i=1nj=n+12ni𝑴~j𝒕i\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n}\widetilde{{\bm{M}}}_{j}\sum_{i=1}^{n}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}
=i=1nj=12ni𝑺~j𝒕i+i=1nj=n+12ni𝑴~j𝒕i.\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}. (Since i=1n𝒕i=𝟎\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}})

Note that i=1nj=n+12ni𝑴~j𝒕i=j=n+12n1𝑴~j(i=12nj𝒕i)\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}=\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right). Hence,

i=1n(j=1n(𝑰𝑺j))\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right) (j=1ni(𝑰𝑺n+1j))𝒕i=i=1nj=12ni𝑺~j𝒕i+j=n+12n1𝑴~j(i=12nj𝒕i)\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)
=i=1nj=1n𝑺~j𝒕ii=1nj=n+12ni𝑺~j𝒕i+j=n+12n1𝑴~j(i=12nj𝒕i)\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)
=j=1n𝑺~ji=1n𝒕ii=1nj=n+12ni𝑺~j𝒕i+j=n+12n1𝑴~j(i=12nj𝒕i)\displaystyle=-\sum_{j=1}^{n}\widetilde{{\bm{S}}}_{j}\sum_{i=1}^{n}{\bm{t}}_{i}-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)
=i=1nj=n+12ni𝑺~j𝒕i+j=n+12n1𝑴~j(i=12nj𝒕i)\displaystyle=-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right) (Since i=1n𝒕i=𝟎\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}})
=i=1nj=n+12ni𝑺~j𝒕i+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)\displaystyle=-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)
=i=1nj=i+1n𝑺j𝒕i+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i).\displaystyle=-\sum_{i=1}^{n}\sum_{j=i+1}^{n}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right). (25)

Next we analyze the second term in 𝒛{\bm{z}}. For this, we start by expanding j=1ni(𝑰𝑺j)\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}) in a similar way as Eq. (24)

j=1ni(𝑰𝑺j)\displaystyle\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}) =𝑰j=1ni𝑺j+j=1nil=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j\displaystyle={\bm{I}}-\sum_{j=1}^{n-i}{{\bm{S}}}_{j}+\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{\bm{S}}_{j} (26)

Using this, we get

i=1n(j=1ni(𝑰𝑺j))\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right) 𝒕n+1i=i=1n(𝑰j=1ni𝑺j+j=1nil=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j)𝒕n+1i\displaystyle{\bm{t}}_{n+1-i}=\sum_{i=1}^{n}\left({\bm{I}}-\sum_{j=1}^{n-i}{{\bm{S}}}_{j}+\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\right){\bm{t}}_{n+1-i}
=i=1n𝒕n+1ii=1nj=1ni𝑺j𝒕n+1i+i=1nj=1nil=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j𝒕n+1i\displaystyle=\sum_{i=1}^{n}{\bm{t}}_{n+1-i}-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{i=1}^{n}\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}
=i=1nj=1ni𝑺j𝒕n+1i+i=1nj=1nil=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j𝒕n+1i,\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{i=1}^{n}\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i},

where we used the fact that i=1n𝒕i=𝟎\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}} in the last equality. For convenience, we define 𝑴j:=l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j{\bm{M}}_{j}:=\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}. Then,

i=1n(j=1ni(𝑰𝑺j))𝒕n+1i\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i} =i=1nj=1ni𝑺j𝒕n+1i+i=1nj=1ni𝑴j𝒕n+1i.\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{i=1}^{n}\sum_{j=1}^{n-i}{\bm{M}}_{j}{\bm{t}}_{n+1-i}.

Since i=1nj=1ni𝑴j𝒕n+1i=j=1n1𝑴j(i=1nj𝒕n+1i)\sum_{i=1}^{n}\sum_{j=1}^{n-i}{\bm{M}}_{j}{\bm{t}}_{n+1-i}=\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right), we get

i=1n(j=1ni(𝑰𝑺j))𝒕n+1i\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i} =i=1nj=1ni𝑺j𝒕n+1i+j=1n1𝑴j(i=1nj𝒕n+1i)\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=l=1nj=1l1𝑺j𝒕l+j=1n1𝑴j(i=1nj𝒕n+1i)\displaystyle=-\sum_{l=1}^{n}\sum_{j=1}^{l-1}{{\bm{S}}}_{j}{\bm{t}}_{l}+\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=i=1nj=1i1𝑺j𝒕i+j=1n1𝑴j(i=1nj𝒕n+1i)\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{i-1}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=i=1nj=1i1𝑺j𝒕i+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i).\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{i-1}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right). (27)

Finally, substituting Eq. (25) and (27) in the definition of 𝒛{\bm{z}} (Eq. (22)), we get

𝒛\displaystyle{\bm{z}} =i=1nj=i+1n𝑺j𝒕i+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)\displaystyle=-\sum_{i=1}^{n}\sum_{j=i+1}^{n}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)
i=1nj=1i1𝑺j𝒕i+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)\displaystyle\quad-\sum_{i=1}^{n}\sum_{j=1}^{i-1}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=i=1n(j=1i1𝑺j+j=i+1n𝑺j)𝒕i\displaystyle=-\sum_{i=1}^{n}\left(\sum_{j=1}^{i-1}{{\bm{S}}}_{j}+\sum_{j=i+1}^{n}{{\bm{S}}}_{j}\right){\bm{t}}_{i}
+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)\displaystyle\quad+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=i=1n(j=1n𝑺j)𝒕i+i=1n𝑺i𝒕i\displaystyle=-\sum_{i=1}^{n}\left(\sum_{j=1}^{n}{{\bm{S}}}_{j}\right){\bm{t}}_{i}+\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}
+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)\displaystyle\quad+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=(j=1n𝑺j)i=1n𝒕i+i=1n𝑺i𝒕i\displaystyle=-\left(\sum_{j=1}^{n}{{\bm{S}}}_{j}\right)\sum_{i=1}^{n}{\bm{t}}_{i}+\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}
+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)\displaystyle\quad+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)
=i=1n𝑺i𝒕i+j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)+j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i),\displaystyle=\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right),

where we used the fact that i=1n𝒕i=𝟎\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}} in the last equality.

Appendix F Proof of Theorem 5

The proof is similar to that of Theorem 4, except for that here we leverage the independence of random permutations in every other epoch. The setup is also the same as Theorem 4, but we explain it again here nevertheless, for the completeness of this proof.

Let F(𝒙):=1ni=1nfi(𝒙)F({\bm{x}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{x}}) be such that its minimizer it at the origin. This can be assumed without loss of generality because we can shift the coordinates appropriately, similar to (Safran & Shamir, 2019). Since the fif_{i} are convex quadratics, we can write them as fi(𝒙)=12𝒙𝑨i𝒙𝒃i𝒙+cif_{i}({\bm{x}})=\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{i}{\bm{x}}-{\bm{b}}_{i}^{\top}{\bm{x}}+c_{i}, where 𝑨i{\bm{A}}_{i} are symmetric, positive-semidefinite matrices. We can omit the constants cic_{i} because they do not affect the minimizer or the gradients. Because we assume that the minimizer of F(𝒙)F({\bm{x}}) is at the origin, we get that

i=1n𝒃i=𝟎.\displaystyle\sum_{i=1}^{n}{\bm{b}}_{i}={\bm{0}}. (28)

Let σk=(σ1k,σ2k,,σnk)\sigma^{k}=(\sigma^{k}_{1},\sigma^{k}_{2},\dots,\sigma^{k}_{n}) be the random permutation of (1,2,,n)(1,2,\dots,n) sampled in epoch 2k12k-1. Then epoch 2k12k-1 sees the nn functions in the reverse sequence:

(12𝒙𝑨σ1k𝒙𝒃σ1k𝒙,12𝒙𝑨σ2k𝒙𝒃σ2k𝒙,,12𝒙𝑨σnk𝒙𝒃σnk𝒙),\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{1}^{k}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{2}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{2}^{k}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{n}^{k}}^{\top}{\bm{x}}\right),

whereas epoch 2k2k sees the nn functions in the reverse sequence:

(12𝒙𝑨σnk𝒙𝒃σnk𝒙,12𝒙𝑨σn1k𝒙𝒃σn1k𝒙,,12𝒙𝑨σ1k𝒙𝒃σ1k𝒙).\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{n}^{k}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n-1}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{n-1}^{k}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{1}^{k}}^{\top}{\bm{x}}\right).

We define 𝑺ik:=α𝑨σik{\bm{S}}_{i}^{k}:=\alpha{\bm{A}}_{\sigma_{i}^{k}} and 𝒕ik=α𝒃σik{\bm{t}}_{i}^{k}=\alpha{\bm{b}}_{\sigma_{i}^{k}} for convenience of notation. We start off by computing the progress made duting an even indexed epoch. Since the even epochs use the reverse permutation of σk\sigma^{k}, we get

𝒙02k+1\displaystyle{\bm{x}}_{0}^{2k+1} =𝒙n2k\displaystyle={\bm{x}}_{n}^{2k}
=𝒙n12kα(𝑨σ1k𝒙n12k𝒃σ1k)\displaystyle={\bm{x}}^{2k}_{n-1}-\alpha\left({\bm{A}}_{\sigma_{1}^{k}}{\bm{x}}_{n-1}^{2k}-{\bm{b}}_{\sigma_{1}^{k}}\right) (fσ1kf_{\sigma_{1}^{k}} used at the last iteration of epoch 2k2k.)
=(𝑰α𝑨σ1k)𝒙n12k+α𝒃σ1k\displaystyle=\left({\bm{I}}-\alpha{\bm{A}}_{\sigma_{1}^{k}}\right){\bm{x}}^{2k}_{n-1}+\alpha{\bm{b}}_{\sigma_{1}^{k}}
=(𝑰𝑺1k)𝒙n12k+𝒕1k.\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}^{k}.

We recursively apply the same procedure as above to the whole epoch to get the following

𝒙02k+1\displaystyle{\bm{x}}_{0}^{2k+1} =(𝑰𝑺1k)𝒙n12k+𝒕1k\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}^{k}
=(𝑰𝑺1k)((𝑰𝑺2)𝒙n22k+𝒕2k)+𝒕1k\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k})\left(({\bm{I}}-{\bm{S}}_{2}){\bm{x}}_{n-2}^{2k}+{\bm{t}}_{2}^{k}\right)+{\bm{t}}_{1}^{k}
=(𝑰𝑺1k)(𝑰𝑺2k)𝒙n22k+(𝑰𝑺1)𝒕2k+𝒕1k\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k})({\bm{I}}-{\bm{S}}_{2}^{k}){\bm{x}}_{n-2}^{2k}+({\bm{I}}-{\bm{S}}_{1}){\bm{t}}_{2}^{k}+{\bm{t}}_{1}^{k}
\displaystyle\qquad\vdots
=(i=1n(𝑰𝑺ik))𝒙02k+i=1n(j=1ni(𝑰𝑺jk))𝒕n+1ik,\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}, (29)

where the product of matrices {𝑴i}\{{\bm{M}}_{i}\} is defined as i=lm𝑴i=𝑴l𝑴l+1𝑴m\prod_{i=l}^{m}{\bm{M}}_{i}={\bm{M}}_{l}{\bm{M}}_{l+1}\dots\bm{M}_{m} if mlm\geq l and 11 otherwise. Similar to Eq. (29), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order (σ1k,σ2k,,σnk)(\sigma_{1}^{k},\sigma_{2}^{k},\dots,\sigma_{n}^{k}) instead of (σnk,σn1k,,σ1k)(\sigma_{n}^{k},\sigma_{n-1}^{k},\dots,\sigma_{1}^{k}). After doing the computation, we get the following equation:

𝒙02k=(i=1n(𝑰𝑺ni+1k))𝒙02k1+i=1n(j=1ni(𝑰𝑺nj+1k))𝒕ik.\displaystyle{\bm{x}}_{0}^{2k}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{k})\right){\bm{x}}_{0}^{2k-1}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n-j+1}^{k})\right){\bm{t}}_{i}^{k}.

Combining the results above, we can get the total progress made after the pair of epoch 2k12k-1 and 2k2k:

𝒙02k+1\displaystyle{\bm{x}}_{0}^{2k+1} =(i=1n(𝑰𝑺ik))𝒙02k+i=1n(j=1ni(𝑰𝑺jk))𝒕n+1ik\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}
=(i=1n(𝑰𝑺ik))(i=1n(𝑰𝑺ni+1k))𝒙02k1+(j=1n(𝑰𝑺jk))i=1n(j=1ni(𝑰𝑺n+1jk))𝒕ik\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{k})\right){\bm{x}}_{0}^{2k-1}+\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j}^{k})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j}^{k})\right){\bm{t}}_{i}^{k}
+i=1n(j=1ni(𝑰𝑺jk))𝒕n+1ik.\displaystyle\quad+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}. (30)

In the sum above, the first term will have an exponential decay, hence we need to control the next two terms. Similar to Theorem 4, we denote the sum of the terms as 𝒛k{\bm{z}}^{k}:

𝒛k\displaystyle{\bm{z}}^{k} :=(j=1n(𝑰𝑺jk))i=1n(j=1ni(𝑰𝑺n+1jk))𝒕ik+i=1n(j=1ni(𝑰𝑺jk))𝒕n+1ik\displaystyle:=\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j}^{k})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j}^{k})\right){\bm{t}}_{i}^{k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}
=i=1n(j=1n(𝑰𝑺jk))(j=1ni(𝑰𝑺n+1jk))𝒕ik+i=1n(j=1ni(𝑰𝑺jk))𝒕n+1ik.\displaystyle=\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j}^{k})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j}^{k})\right){\bm{t}}_{i}^{k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}.

To see where the iterates end up after KK epochs, we simply set 2k=K2k=K in Eq. 30 and then keep applying the equation recursively to preceding epochs. Then, we get

𝒙nK=𝒙0K+1\displaystyle{\bm{x}}_{n}^{K}={\bm{x}}_{0}^{K+1} =(i=1n(𝑰𝑺iK2))(i=1n(𝑰𝑺ni+1K2))𝒙0K1+𝒛K2\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}})\right){\bm{x}}_{0}^{K-1}+{\bm{z}}^{\frac{K}{2}}
=((i=1n(𝑰𝑺iK2))(i=1n(𝑰𝑺ni+1K2)))((i=1n(𝑰𝑺iK21))(i=1n(𝑰𝑺ni+1K21)))𝒙0K3\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}})\right)\right)\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-1})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-1})\right)\right){\bm{x}}_{0}^{K-3}
+(i=1n(𝑰𝑺iK2))(i=1n(𝑰𝑺ni+1K2))𝒛K21+𝒛K2\displaystyle\quad+\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}})\right){\bm{z}}^{\frac{K}{2}-1}+{\bm{z}}^{\frac{K}{2}}
\displaystyle\qquad\vdots
=(k=0K21(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝒙01\displaystyle=\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}
+k=0K21(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k.\displaystyle\quad+\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}.

Taking squared norms and expectations on both sides, we get

𝔼[𝒙nK2]\displaystyle\mathbb{E}[\|{\bm{x}}_{n}^{K}\|^{2}] =𝔼[(k=0K21(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝒙01\displaystyle=\mathbb{E}\left[\left\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right.\right.
+k=0K21(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k2]\displaystyle\qquad\quad+\left.\left.\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\|^{2}\right]
2𝔼[(k=0K21(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝒙012]\displaystyle\leq 2\mathbb{E}\left[\left\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\|^{2}\right]
+2𝔼[k=0K21(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k2],\displaystyle\quad+2\mathbb{E}\left[\left\|\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\|^{2}\right], (31)

where we used the fact that (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}. Next, we expand the second term above to get

𝔼[𝒙nK2]\displaystyle\mathbb{E}[\|{\bm{x}}_{n}^{K}\|^{2}] 2𝔼[(k=0K21(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝒙012]\displaystyle\leq 2\mathbb{E}\left[\left\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\|^{2}\right]
+2k=0K21𝔼[(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k2]\displaystyle\quad+2\sum_{k=0}^{\frac{K}{2}-1}\mathbb{E}\left[\left\|\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\|^{2}\right]
+40k<kK21𝔼[(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k,\displaystyle\quad+4\sum_{0\leq k^{\prime}<k\leq\frac{K}{2}-1}\mathbb{E}\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k].\displaystyle\quad\qquad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]. (32)

We handle each of the three terms separately. Let 𝑵k:=(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)){\bm{N}}_{k}:=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right). The the first term can be written as:

𝔼[(k=0K21(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝒙012]=𝔼[(k=0K21𝑵k)𝒙012]\displaystyle\mathbb{E}\left[\left\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\|^{2}\right]=\mathbb{E}\left[\left\|\left(\prod_{k=0}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right\|^{2}\right]
=𝔼[(𝒙01)(k=0K21𝑵k)(k=0K21𝑵k)𝒙01]\displaystyle\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=0}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\left(\prod_{k=0}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]
=𝔼[(𝒙01)(k=1K21𝑵k)𝑵0𝑵0(k=1K21𝑵k)𝒙01]\displaystyle\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}{\bm{N}}_{0}^{\top}{\bm{N}}_{0}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]
=𝔼[(𝒙01)(k=1K21𝑵k)𝔼[𝑵0𝑵0](k=1K21𝑵k)𝒙01],\displaystyle\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\mathbb{E}\left[{\bm{N}}_{0}^{\top}{\bm{N}}_{0}\right]\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right],

where, in the last line, we used the fact that the permutations in every epoch are independent of the permutations in other epochs. Next, we have the following lemma that bounds the spectral norm of 𝔼[𝑵k𝑵k]\mathbb{E}\left[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}\right] for any k[0,K21]k\in[0,\frac{K}{2}-1]:

Lemma 7.

For any 0α316nLmin{1,nκ}0\leq\alpha\leq\frac{3}{16nL}\min\{1,\sqrt{\frac{n}{\kappa}}\},

𝔼[𝑵k𝑵k]1αnμ\displaystyle\left\|\mathbb{E}[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}]\right\|\leq 1-\alpha n\mu

Note that K55κlog(nK)max{1,nκ}α316nLmin{1,nκ}K\geq 55\kappa\log(nK)\max\left\{1,\sqrt{\frac{n}{\kappa}}\right\}\implies\alpha\leq\frac{3}{16nL}\min\{1,\frac{n}{\kappa}\}, and thus this lemma. Using it, we get

𝔼[(k=0K21(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝒙012]\displaystyle\mathbb{E}\left[\left\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\|^{2}\right]
=𝔼[(𝒙01)(k=1K21𝑵k)𝔼[𝑵0𝑵0](k=1K21𝑵k)𝒙01]\displaystyle\qquad\qquad\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\mathbb{E}\left[{\bm{N}}_{0}^{\top}{\bm{N}}_{0}\right]\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]
(1nαμ)𝔼[(𝒙01)(k=1K21𝑵k)(k=1K21𝑵k)𝒙01]\displaystyle\qquad\qquad\qquad\qquad\leq(1-n\alpha\mu)\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]
=(1nαμ)𝔼[(𝒙01)(k=2K21𝑵k)𝔼[𝑵1𝑵1](k=2K21𝑵k)𝒙01]\displaystyle\qquad\qquad\qquad\qquad=(1-n\alpha\mu)\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=2}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\mathbb{E}\left[{\bm{N}}_{1}^{\top}{\bm{N}}_{1}\right]\left(\prod_{k=2}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]
\displaystyle\qquad\qquad\qquad\qquad\quad\vdots
(1nαμ)K/2𝒙012.\displaystyle\qquad\qquad\qquad\qquad\leq(1-n\alpha\mu)^{K/2}\|{\bm{x}}_{0}^{1}\|^{2}. (33)

Next we look at the second term in Ineq. (32):

𝔼[(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k2]\displaystyle\mathbb{E}\left[\left\|\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\|^{2}\right]
𝔼[(l=0k1(i=1n𝑰𝑺iK2l2)(i=1n𝑰𝑺ni+1K2l2))𝒛K2k2]\displaystyle\quad\leq\mathbb{E}\left[\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}\|{\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l}\|^{2}\right)\left(\prod_{i=1}^{n}\|{\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l}\|^{2}\right)\right)\|{\bm{z}}^{\frac{K}{2}-k}\|^{2}\right]
𝔼[𝒛K2k2],\displaystyle\quad\leq\mathbb{E}\left[\left\|{\bm{z}}^{\frac{K}{2}-k}\right\|^{2}\right],

where in the last step we used that 𝑰𝑺ik1\|{\bm{I}}-{\bm{S}}_{i}^{k}\|\leq 1 for all i,ki,k. To see why this is true, recall that 𝑺ik=α𝑨σik{\bm{S}}_{i}^{k}=\alpha{\bm{A}}_{\sigma_{i}^{k}}. Further by Assumption 2, 𝑨σikL\|{\bm{A}}_{\sigma_{i}^{k}}\|\leq L and hence as long as α1/L\alpha\leq 1/L, we have that 𝑰𝑺ik1\|{\bm{I}}-{\bm{S}}_{i}^{k}\|\leq 1.

Next, note that for any kk we can apply Lemma 4 on 𝔼[𝒛K2k2]\mathbb{E}[\|{\bm{z}}^{\frac{K}{2}-k}\|^{2}]. Hence, we get the following bound on the second term:

2k=0K21𝔼[(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k2]\displaystyle 2\sum_{k=0}^{\frac{K}{2}-1}\mathbb{E}\left[\left\|\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\|^{2}\right]
2K2(2n2α4L2(G)2+170n5α6L4(G)2logn).\displaystyle\leq 2\frac{K}{2}(2n^{2}\alpha^{4}L^{2}(G^{*})^{2}+170n^{5}\alpha^{6}L^{4}(G^{*})^{2}\log n). (34)

Finally, we focus on the third term in Ineq. (32). We have the following lemma that gives an upper bound for it:

Lemma 8.

Let α12nL\alpha\leq\frac{1}{2nL} and n>6n>6. Then for 0k<kK210\leq k^{\prime}<k\leq\frac{K}{2}-1,

𝔼\displaystyle\mathbb{E} [(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k,\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]
1000n2α4L2(G)2+2000n6α7L5(G)2logn.\displaystyle\qquad\qquad\leq 1000n^{2}\alpha^{4}L^{2}(G^{*})^{2}+2000n^{6}\alpha^{7}L^{5}(G^{*})^{2}\log n.

Using Lemma 8, and inequalities (33) and (34) in Ineq. (32), we get

𝔼[𝒙nK2]\displaystyle\mathbb{E}[\|{\bm{x}}_{n}^{K}\|^{2}] 2(1nαμ)K/2𝒙012+2n2Kα4L2(G)2+170n5Kα6L4(G)2logn\displaystyle\leq 2(1-n\alpha\mu)^{K/2}\|{\bm{x}}_{0}^{1}\|^{2}+2n^{2}K\alpha^{4}L^{2}(G^{*})^{2}+170n^{5}K\alpha^{6}L^{4}(G^{*})^{2}\log n
+1000n2K2α4L2(G)2+2000n6K2α7L5(G)2logn\displaystyle\quad+1000n^{2}K^{2}\alpha^{4}L^{2}(G^{*})^{2}+2000n^{6}K^{2}\alpha^{7}L^{5}(G^{*})^{2}\log n
2(1nαμ)K/2𝒙012+1002n2Kα4L2(G)2+2170n6K2α7L4(G)2logn.\displaystyle\leq 2(1-n\alpha\mu)^{K/2}\|{\bm{x}}_{0}^{1}\|^{2}+1002n^{2}K\alpha^{4}L^{2}(G^{*})^{2}+2170n^{6}K^{2}\alpha^{7}L^{4}(G^{*})^{2}\log n.

Substituting α=10lognKμnK\alpha=\frac{10\log nK}{\mu nK} gives us the desired result.

F.1 Proof of Lemma 7

Recall that we have defined 𝑵k:=(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)){\bm{N}}_{k}:=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right). Hence,

𝑵k\displaystyle{\bm{N}}_{k}^{\top} =((i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right)^{\top}
=(i=1n(𝑰𝑺ni+1K2k))(i=1n(𝑰𝑺iK2k))\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)^{\top}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)^{\top}
=(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k))\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})^{\top}\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})^{\top}\right)
=(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k))\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)
=𝑵k,\displaystyle={\bm{N}}_{k},

where we used the fact that 𝑺iK2k{\bm{S}}_{i}^{\frac{K}{2}-k} are symmetric. Hence 𝑵k{\bm{N}}_{k} is symmetric. Then,

𝔼[𝑵k𝑵k]\displaystyle\|{\mathbb{E}}[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}]\| =max𝒙:𝒙=1𝔼[𝒙𝑵k𝑵k𝒙]\displaystyle=\max_{{\bm{x}}:\|{\bm{x}}\|=1}{\mathbb{E}}[{\bm{x}}^{\top}{\bm{N}}_{k}^{\top}{\bm{N}}_{k}{\bm{x}}]
=max𝒙:𝒙=1𝔼[𝒙𝑵k𝑵k𝒙].\displaystyle=\max_{{\bm{x}}:\|{\bm{x}}\|=1}{\mathbb{E}}[{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{N}}_{k}{\bm{x}}].

Next, note that 𝑵k1\|{\bm{N}}_{k}\|\leq 1 as long as α1nL\alpha\leq\frac{1}{nL}. This combined with the fact that 𝑵k{\bm{N}}_{k} is symmetric gives us that 𝒙𝑵k𝑵k𝒙𝒙𝑵k𝒙{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{N}}_{k}{\bm{x}}\leq{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{x}}. Hence, we get

𝔼[𝑵k𝑵k]\displaystyle\|{\mathbb{E}}[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}]\| max𝒙:𝒙=1𝔼[𝒙𝑵k𝒙]\displaystyle\leq\max_{{\bm{x}}:\|{\bm{x}}\|=1}{\mathbb{E}}[{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{x}}]
=𝔼[𝑵k]\displaystyle=\|{\mathbb{E}}[{\bm{N}}_{k}]\|
=𝔼[(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k))]\displaystyle=\left\|{\mathbb{E}}\left[\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right]\right\|
=𝔼[(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺iK2k))].\displaystyle=\left\|{\mathbb{E}}\left[\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)^{\top}\right]\right\|.

To complete the proof, we apply the following lemma from Ahn et al. (2020):

Lemma 9 (Lemma 6, Ahn et al. (2020)).

For any 0α316nLmin{1,nκ}0\leq\alpha\leq\frac{3}{16nL}\min\{1,\frac{n}{\kappa}\} and k[K]k\in[K],

𝔼[(i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺iK2k))]1αnμ.\left\|{\mathbb{E}}\left[\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)^{\top}\right]\right\|\leq 1-\alpha n\mu.
Remark 2.

To avoid confusion, we remark here that, the term ‘𝐒k{\bm{S}}_{k}’ in the paper Ahn et al. (2020) is the same as the term ‘i=1n(𝐈𝐒ik)\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})’ in our paper, and hence the original lemma statement in their paper looks different from what is written above.

F.2 Proof of Lemma 8

We begin by decomposing the product into product of independent terms, similar to proof of Lemma 8 in Ahn et al. (2020). However, after that we diverge from their proof since we use FlipFlop specific analysis.

𝔼\displaystyle\mathbb{E} [(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k,\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]
=𝔼[(𝒛K2k)(l=kk+1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle=\mathbb{E}\left[\left({\bm{z}}^{\frac{K}{2}-k}\right)^{\top}\left(\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.
(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\qquad\quad\left.\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k].\displaystyle\qquad\quad\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right].

Since k<kk^{\prime}<k, we get that (𝒛K2k)\left({\bm{z}}^{\frac{K}{2}-k}\right)^{\top}, (l=kk+1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\left(\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top} and

(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k\displaystyle\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}

are independent. Hence, we can write the expectation as product of expectations:

𝔼\displaystyle\mathbb{E} [(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k,\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]
=𝔼[(𝒛K2k)]𝔼[(l=kk+1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))]\displaystyle=\mathbb{E}\left[\left({\bm{z}}^{\frac{K}{2}-k}\right)^{\top}\right]\mathbb{E}\left[\left(\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right]
𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\qquad\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k].\displaystyle\qquad\qquad\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right].

Applying the Cauchy-Schwarz inequality on the decomposition above, we get

𝔼\displaystyle\mathbb{E} [(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k,\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]
𝔼[𝒛K2k]𝔼[l=kk+1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l))]\displaystyle\leq\left\|\mathbb{E}\left[{\bm{z}}^{\frac{K}{2}-k}\right]\right\|\left\|\mathbb{E}\left[\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right]\right\|
𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\ \quad\left\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\ \quad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\|
𝔼[𝒛K2k]𝔼[l=kk+1(i=1n𝑰𝑺iK2l)(i=1n𝑰𝑺ni+1K2l)]\displaystyle\leq\left\|\mathbb{E}\left[{\bm{z}}^{\frac{K}{2}-k}\right]\right\|\mathbb{E}\left[\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}\|{\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l}\|\right)\left(\prod_{i=1}^{n}\|{\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l}\|\right)\right]
𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\ \quad\left\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\ \quad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\|
𝔼[𝒛K2k]\displaystyle\leq\left\|\mathbb{E}\left[{\bm{z}}^{\frac{K}{2}-k}\right]\right\|
𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\ \quad\left\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k],\displaystyle\ \quad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\|,

where in the last step we used that 𝑰𝑺ik1\|{\bm{I}}-{\bm{S}}_{i}^{k}\|\leq 1 for all i,ki,k. To see why this is true, recall that 𝑺ik=α𝑨σik{\bm{S}}_{i}^{k}=\alpha{\bm{A}}_{\sigma_{i}^{k}}. Further by Assumption 2, 𝑨σikL\|{\bm{A}}_{\sigma_{i}^{k}}\|\leq L and hence as long as α1/L\alpha\leq 1/L, we have 𝑰𝑺ik1\|{\bm{I}}-{\bm{S}}_{i}^{k}\|\leq 1.

For the two terms in the product above, we have the following lemma:

Lemma 10.

If n>6n>6 and α1nL\alpha\leq\frac{1}{nL}, then

𝔼[𝒛K2k]28nα2LG+9α5L4n4G2nlogn.\displaystyle\|\mathbb{E}[{\bm{z}}^{\frac{K}{2}-k}]\|\leq 28n\alpha^{2}LG^{*}+9\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n}.
Lemma 11.

If n>6n>6 and α12nL\alpha\leq\frac{1}{2nL}, then

𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\left\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\|
32nα2LG+24α5L4n4G2nlogn.\displaystyle\qquad\qquad\qquad\qquad\qquad\leq 32n\alpha^{2}LG^{*}+24\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n}.

Finally, using these lemma we get

𝔼\displaystyle\mathbb{E} [(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k,\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\ \quad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]
(28nα2LG+9α5L4n4G2nlogn)(32nα2LG+24α5L4n4G2nlogn)\displaystyle\qquad\qquad\leq\left(28n\alpha^{2}LG^{*}+9\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n}\right)\cdot\left(32n\alpha^{2}LG^{*}+24\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n}\right)
896n2α4L2(G)2+960α7L5n5(G)22nlogn+432α10L8n9(G)2logn\displaystyle\qquad\qquad\leq 896n^{2}\alpha^{4}L^{2}(G^{*})^{2}+960\alpha^{7}L^{5}n^{5}(G^{*})^{2}\sqrt{2n\log n}+432\alpha^{10}L^{8}n^{9}(G^{*})^{2}\log n
1000n2α4L2(G)2+2000α7L5n6(G)2logn.\displaystyle\qquad\qquad\leq 1000n^{2}\alpha^{4}L^{2}(G^{*})^{2}+2000\alpha^{7}L^{5}n^{6}(G^{*})^{2}\log n.

F.3 Proof of Lemma 10

Since we are dealing with just a single epoch, we will skip the superscript. Using Lemma 5, we get

𝔼[𝒛]\displaystyle\|\mathbb{E}[{\bm{z}}]\| =𝔼[i=1n𝑺i𝒕i]+𝔼[j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle=\left\|\mathbb{E}\left[\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right]+\mathbb{E}\left[\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right.
+𝔼[j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle\quad+\left.\mathbb{E}\left[\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]\right\|
i=1n𝔼[𝑺i𝒕i]+j=n+12n1l=1j1𝔼[(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}\left[\|{{\bm{S}}}_{i}\|\|{\bm{t}}_{i}\|\right]+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left\|\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\|
+j=1n1l=1j1𝔼[(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)].\displaystyle\quad+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left\|\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right\|\right]. (35)

Define G:=maxi𝒃iG^{*}:=\max_{i}\|{\bm{b}}_{i}\|. Then 𝑺i𝒕i=α𝑨σikα𝒃σikα2LG\|{\bm{S}}_{i}\|\|{\bm{t}}_{i}\|=\|\alpha{\bm{A}}_{\sigma_{i}^{k}}\|\|\alpha{\bm{b}}_{\sigma_{i}^{k}}\|\leq\alpha^{2}LG^{*} and hence

i=1n𝔼[𝑺i𝒕i]nα2LG.\displaystyle\sum_{i=1}^{n}\mathbb{E}\left[\|{{\bm{S}}}_{i}\|\|{\bm{t}}_{i}\|\right]\leq n\alpha^{2}LG^{*}. (36)

Next we bound the other two terms. Using Eq. (26), we get that for any l<jl<j,

𝔼\displaystyle\mathbb{E} [(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]
=𝔼[(𝑰p=1l1𝑺p+p=1l1q=1p1(r=1q1(𝑰𝑺q))𝑺r𝑺p)𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle=\mathbb{E}\left[\left({\bm{I}}-\sum_{p=1}^{l-1}{{\bm{S}}}_{p}+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]
=i=1nj𝔼[𝑺l𝑺j𝒕n+1i]𝔼[p=1l1𝑺p𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle=\sum_{i=1}^{n-j}\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}\right]-\mathbb{E}\left[\sum_{p=1}^{l-1}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]
+𝔼[p=1l1q=1p1(r=1q1(𝑰𝑺q))𝑺r𝑺p𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle\quad+\mathbb{E}\left[\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]
=i>j,l𝔼[𝑺l𝑺j𝒕i]p<l,j<i𝔼[𝑺p𝑺l𝑺j𝒕i]+p=1l1q=1p1𝔼[(r=1q1(𝑰𝑺q))𝑺r𝑺p𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle=\sum_{i>j,l}\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{i}\right]-\sum_{p<l,j<i}\mathbb{E}\left[{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{i}\right]+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]
=i>j,l𝔼[𝑺l𝑺j𝔼[𝒕i|𝑺l,𝑺j]]p<l,j<i𝔼[𝑺p𝑺l𝑺j𝔼[𝒕i|𝑺l,𝑺j,𝑺p]]\displaystyle=\sum_{i>j,l}\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j}]\right]-\sum_{p<l,j<i}\mathbb{E}\left[{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j},{\bm{S}}_{p}]\right]
+p=1l1q=1p1𝔼[(r=1q1(𝑰𝑺q))𝑺r𝑺p𝑺l𝑺j(i=1nj𝒕n+1i)].\displaystyle\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right].

Since i=1nti=0\sum_{i=1}^{n}t_{i}=0, and we use uniform random permutations, 𝔼[𝒕i|𝑺l,𝑺j]=𝒕i𝒕l,𝒕i𝒕j𝒕in2=𝒕j𝒕ln2\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j}]=\sum_{\begin{subarray}{c}{\bm{t}}_{i}\neq{\bm{t}}_{l},\\ {\bm{t}}_{i}\neq{\bm{t}}_{j}\end{subarray}}\frac{{\bm{t}}_{i}}{n-2}=\frac{-{\bm{t}}_{j}-{\bm{t}}_{l}}{n-2}. Similarly, 𝔼[𝒕i|𝑺l,𝑺j,𝑺p]=𝒕j𝒕l𝒕pn3\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j},{{\bm{S}}}_{p}]=\frac{-{\bm{t}}_{j}-{\bm{t}}_{l}-{\bm{t}}_{p}}{n-3}. Hence,

𝔼[(p=1l1\displaystyle\Bigg{\|}\mathbb{E}\Bigg{[}\Bigg{(}\prod_{p=1}^{l-1} (𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i)]\displaystyle({\bm{I}}-{{\bm{S}}}_{p})\Bigg{)}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\Bigg{(}\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\Bigg{)}\Bigg{]}\Bigg{\|}
i>j,l𝔼[𝑺l𝑺j𝔼[𝒕i|𝑺l,𝑺j]]+p<l,j<i𝔼[𝑺p𝑺l𝑺j𝔼[𝒕i|𝑺l,𝑺j,𝑺p]]\displaystyle\leq\sum_{i>j,l}\left\|\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j}]\right]\right\|+\sum_{p<l,j<i}\left\|\mathbb{E}\left[{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j},{\bm{S}}_{p}]\right]\right\|
+p=1l1q=1p1𝔼[(r=1q1𝑰𝑺q)𝑺r𝑺p𝑺l𝑺ji=1nj𝒕n+1i]\displaystyle+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}\|{\bm{I}}-{{\bm{S}}}_{q}\|\right)\|{{\bm{S}}}_{r}\|\|{{\bm{S}}}_{p}\|\|{{\bm{S}}}_{l}\|\|{{\bm{S}}}_{j}\|\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right]
i>j,l𝔼[𝑺l𝑺j𝒕l+𝒕jn2]+p<l,j<i𝔼[𝑺p𝑺l𝑺j𝒕l+𝒕j+𝒕pn3]\displaystyle\leq\sum_{i>j,l}\mathbb{E}\left[\|{{\bm{S}}}_{l}\|\|{{\bm{S}}}_{j}\|\frac{\|{\bm{t}}_{l}+{\bm{t}}_{j}\|}{n-2}\right]+\sum_{p<l,j<i}\mathbb{E}\left[\|{{\bm{S}}}_{p}\|\|{{\bm{S}}}_{l}\|\|{{\bm{S}}}_{j}\|\frac{\|{\bm{t}}_{l}+{\bm{t}}_{j}+{\bm{t}}_{p}\|}{n-3}\right]
+p=1l1q=1p1𝔼[(r=1q1𝑰𝑺q)𝑺r𝑺p𝑺l𝑺ji=1nj𝒕n+1i]\displaystyle+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}\|{\bm{I}}-{{\bm{S}}}_{q}\|\right)\|{{\bm{S}}}_{r}\|\|{{\bm{S}}}_{p}\|\|{{\bm{S}}}_{l}\|\|{{\bm{S}}}_{j}\|\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right]
2nn2α3L2G+3n22(n3)α4L3G+α4L4p=1l1q=1p1𝔼[i=1nj𝒕n+1i]\displaystyle\leq\frac{2n}{n-2}\alpha^{3}L^{2}G^{*}+\frac{3n^{2}}{2(n-3)}\alpha^{4}L^{3}G^{*}+\alpha^{4}L^{4}\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right]
4α3L2G+3nα4L3G+α5L4n2G18nlogn,\displaystyle\leq 4\alpha^{3}L^{2}G^{*}+3n\alpha^{4}L^{3}G^{*}+\alpha^{5}L^{4}n^{2}G^{*}\sqrt{18n\log n}, (37)

where we used Lemma 6 and the assumption that n>6n>6 in the last step.

The third term in Ineq. (35) can be handled similarly. For any l,jl,j:

𝔼[(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]
=𝔼[(𝑰p=1l1𝑺~p+p=1l1q=1p1(r=1q1(𝑰𝑺~q))𝑺~r𝑺~p)𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle=\mathbb{E}\left[\left({\bm{I}}-\sum_{p=1}^{l-1}\widetilde{{\bm{S}}}_{p}+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]
=i=12nj𝔼[𝑺~l𝑺~j𝒕i]p=1l1i=12nj𝔼[𝑺~p𝑺~l𝑺~j𝒕i]+𝔼[p=1l1q=1p1(r=1q1(𝑰𝑺~q))𝑺~r𝑺~p𝑺~l𝑺~j(i=12nj𝒕i)].\displaystyle=\sum_{i=1}^{2n-j}\mathbb{E}\left[\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]-\sum_{p=1}^{l-1}\sum_{i=1}^{2n-j}\mathbb{E}\left[\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]+\mathbb{E}\left[\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right].

Now, it is easy to see that iji\neq j and i2nj+1i\neq 2n-j+1. Then, if i=li=l or i=2nl+1i=2n-l+1 we use for that case the fact that 𝔼[𝑺~l𝑺~j𝒕i]α3L2G\|\mathbb{E}[\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}]\|\leq\alpha^{3}L^{2}G^{*}. For all other ii, we can again use that 𝔼[𝒕i|𝑺~l,𝑺~j]=tltjn2\mathbb{E}[{\bm{t}}_{i}|\widetilde{{\bm{S}}}_{l},\widetilde{{\bm{S}}}_{j}]=\frac{-t_{l}-t_{j}}{n-2} if lnl\leq n or 𝔼[𝒕i|𝑺~l,𝑺~j]=t2nl+1tjn2\mathbb{E}[{\bm{t}}_{i}|\widetilde{{\bm{S}}}_{l},\widetilde{{\bm{S}}}_{j}]=\frac{-t_{2n-l+1}-t_{j}}{n-2} otherwise. Similarly, we can bound 𝔼[𝑺~p𝑺~l𝑺~j𝒕i]\left\|\mathbb{E}\left[\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]\right\|. Proceeding in a way similar to how we derived Ineq. (37), we get

𝔼[(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle\left\|\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\|
i=12nj𝔼[𝑺~l𝑺~j𝒕i]+p=1l1i=12nj𝔼[𝑺~p𝑺~l𝑺~j𝒕i]\displaystyle\qquad\qquad\leq\sum_{i=1}^{2n-j}\left\|\mathbb{E}\left[\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]\right\|+\sum_{p=1}^{l-1}\sum_{i=1}^{2n-j}\left\|\mathbb{E}\left[\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]\right\|
+p=1l1q=1p1𝔼[(r=1q1(𝑰𝑺~q))𝑺~r𝑺~p𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle\qquad\qquad\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left\|\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\|
α3L2G+2nn2α3L2G+α4L3G+nα4L3G+3n22(n3)α4L3G\displaystyle\qquad\qquad\leq\alpha^{3}L^{2}G^{*}+\frac{2n}{n-2}\alpha^{3}L^{2}G^{*}+\alpha^{4}L^{3}G^{*}+n\alpha^{4}L^{3}G^{*}+\frac{3n^{2}}{2(n-3)}\alpha^{4}L^{3}G^{*}
+p=1l1q=1p1𝔼[(r=1q1(𝑰𝑺~q))𝑺~r𝑺~p𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle\qquad\qquad\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left\|\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\|
5α3L2G+5nα4L3G\displaystyle\qquad\qquad\leq 5\alpha^{3}L^{2}G^{*}+5n\alpha^{4}L^{3}G^{*}
+p=1l1q=1p1𝔼[(r=1q1𝑰𝑺~q)𝑺~r𝑺~p𝑺~l𝑺~j(i=12nj𝒕i)]\displaystyle\qquad\qquad\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left\|\mathbb{E}\left[\left(\prod_{r=1}^{q-1}\|{\bm{I}}-\widetilde{{\bm{S}}}_{q}\|\right)\|\widetilde{{\bm{S}}}_{r}\|\|\widetilde{{\bm{S}}}_{p}\|\|\widetilde{{\bm{S}}}_{l}\|\|\widetilde{{\bm{S}}}_{j}\|\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\|
5α3L2G+5nα4L3G+α5L4n2G18nlogn.\displaystyle\qquad\qquad\leq 5\alpha^{3}L^{2}G^{*}+5n\alpha^{4}L^{3}G^{*}+\alpha^{5}L^{4}n^{2}G^{*}\sqrt{18n\log n}. (38)

Substituting Ineq. (36), (37) and (38) into (35), we get

𝔼[𝒛]\displaystyle\|\mathbb{E}[{\bm{z}}]\| nα2LG+10n2α3L2G+10n3α4L3G+2α5L4n4G18nlogn\displaystyle\leq n\alpha^{2}LG^{*}+10n^{2}\alpha^{3}L^{2}G^{*}+10n^{3}\alpha^{4}L^{3}G^{*}+2\alpha^{5}L^{4}n^{4}G^{*}\sqrt{18n\log n}
+4n2α3L2G+3n3α4L3G+α5L4n4G18nlogn\displaystyle\quad+4n^{2}\alpha^{3}L^{2}G^{*}+3n^{3}\alpha^{4}L^{3}G^{*}+\alpha^{5}L^{4}n^{4}G^{*}\sqrt{18n\log n}
28nα2LG+9α5L4n4G2nlogn,\displaystyle\leq 28n\alpha^{2}LG^{*}+9\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n},

where we used the assumption that α1nL\alpha\leq\frac{1}{nL} in the last step.

F.4 Proof of Lemma 11

Define the matrix 𝑴{\bm{M}} as

𝑴:=(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l))).\displaystyle{\bm{M}}:=\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right).

Since 𝑴{\bm{M}} is independent of (i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k))\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k^{\prime}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k^{\prime}})\right) and 𝒛K2k{\bm{z}}^{\frac{K}{2}-k^{\prime}}, using the tower rule, we get

𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\qquad\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]
=𝔼[((i=1n(𝑰𝑺iK2k))(i=1n(𝑰𝑺ni+1K2k)))𝔼[𝑴]𝒛K2k].\displaystyle\qquad=\mathbb{E}\left[\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k^{\prime}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k^{\prime}})\right)\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}^{\frac{K}{2}-k^{\prime}}\right].

We will now drop the superscript K2k\frac{K}{2}-k^{\prime} for convenience. Hence, we need to control the following term:

𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))𝔼[𝑴]𝒛]\displaystyle\mathbb{E}\left[\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]

We define (𝑺~1,,𝑺~2n)(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{2n}) as (𝑺~1,,𝑺~n)=(𝑺1,,𝑺n)(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n})=({\bm{S}}_{1},\dots,{\bm{S}}_{n}) and (𝑺~n+1,,𝑺~2n)=(𝑺n,,𝑺1)(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n})=({\bm{S}}_{n},\dots,{\bm{S}}_{1}). Then, we use Eq. (24) to get

((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))\displaystyle\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right) =i=12n(𝑰𝑺~i)\displaystyle=\prod_{i=1}^{2n}({\bm{I}}-\widetilde{{\bm{S}}}_{i})
=𝑰j=12n𝑺~j+j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j.\displaystyle={\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}+\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}.

Note that 𝑰j=12n𝑺~j{\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j} is a constant matrix. Since 𝑺~j=α𝑨σj\widetilde{{\bm{S}}}_{j}=\alpha{\bm{A}}_{\sigma_{j}}, we have that 𝑺~jαL\|\widetilde{{\bm{S}}}_{j}\|\leq\alpha L by Assumption 2. Hence, α12nL\alpha\leq\frac{1}{2nL} then 𝑰j=12n𝑺~j1\|{\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}\|\leq 1. Further, α1/L\alpha\leq 1/L implies that 𝑰𝑺i1\|{\bm{I}}-{\bm{S}}_{i}\|\leq 1, which implies 𝑴1\|{\bm{M}}\|\leq 1. Hence,

𝔼[((i=1n(𝑰𝑺i))(i=1n(𝑰𝑺ni+1)))𝔼[𝑴]𝒛]\displaystyle\left\|\mathbb{E}\left[\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|
𝔼[(𝑰j=12n𝑺~j)𝔼[𝑴]𝒛]\displaystyle\leq\left\|\mathbb{E}\left[\left({\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|
+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴]𝒛]\displaystyle\qquad+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|
=(𝑰j=12n𝑺~j)𝔼[𝑴]𝔼[𝒛]+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴]𝒛]\displaystyle=\left\|\left({\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\mathbb{E}\left[{\bm{z}}\right]\right\|+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|
𝔼[𝒛]+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴]𝒛].\displaystyle\leq\left\|\mathbb{E}\left[{\bm{z}}\right]\right\|+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|.

We can apply Lemma 10 to bound 𝔼[𝒛]\left\|\mathbb{E}\left[{\bm{z}}\right]\right\|. So, we focus on the other term. Using Lemma 5,

𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴]𝒛]\displaystyle\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|
𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴](i=1n𝑺i𝒕i)]\displaystyle\leq\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right)\right]\right\|
+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴](j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i))]\displaystyle\quad+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right)\right]\right\|
+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴](j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i))]\displaystyle\quad+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right)\right]\right\|
=𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)]𝔼[𝑴](i=1n𝑺i𝒕i)\displaystyle=\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\right]\mathbb{E}[{\bm{M}}]\left(\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right)\right\|
+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴](j=n+12n1l=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j(i=12nj𝒕i))]\displaystyle\quad+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right)\right]\right\|
+𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴](j=1n1l=1j1(p=1l1(𝑰𝑺p))𝑺l𝑺j(i=1nj𝒕n+1i))]\displaystyle\quad+\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right)\right]\right\|
𝔼[j=12nl=1j1(p=1l1𝑰𝑺~p)𝑺~l𝑺~j]𝔼[𝑴]i=1n𝑺i𝒕i\displaystyle\leq\mathbb{E}\left[\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\|\right)\|\widetilde{{\bm{S}}}_{l}\|\|\widetilde{{\bm{S}}}_{j}\|\right]\mathbb{E}[\|{\bm{M}}\|]\left\|\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right\|
+𝔼[(j=12nl=1j1(p=1l1𝑰𝑺~p)𝑺~l𝑺~j)𝔼[𝑴](j=n+12n1l=1j1(p=1l1𝑰𝑺~p)𝑺~l𝑺~ji=12nj𝒕i)]\displaystyle\quad+\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\|\right)\|\widetilde{{\bm{S}}}_{l}\|\|\widetilde{{\bm{S}}}_{j}\|\right)\mathbb{E}[\|{\bm{M}}\|]\left(\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\|\right)\|\widetilde{{\bm{S}}}_{l}\|\|\widetilde{{\bm{S}}}_{j}\|\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|\right)\right]
+𝔼[(j=12nl=1j1(p=1l1𝑰𝑺~p)𝑺~l𝑺~j)𝔼[𝑴](j=1n1l=1j1(p=1l1𝑰𝑺p)𝑺l𝑺ji=1nj𝒕n+1i)].\displaystyle\quad+\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\|\right)\|\widetilde{{\bm{S}}}_{l}\|\|\widetilde{{\bm{S}}}_{j}\|\right)\mathbb{E}[\|{\bm{M}}\|]\left(\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\|{\bm{I}}-{{\bm{S}}}_{p}\|\right)\|{{\bm{S}}}_{l}\|\|{{\bm{S}}}_{j}\|\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)\right].

Now, we use that 𝑴1\|{\bm{M}}\|\leq 1, 𝑰𝑺i~1\|{\bm{I}}-\widetilde{{\bm{S}}_{i}}\|\leq 1, 𝑺i~αL\|\widetilde{{\bm{S}}_{i}}\|\leq\alpha L and 𝒕i~αG\|\widetilde{{\bm{t}}_{i}}\|\leq\alpha G^{*}:

𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴]𝒛]\displaystyle\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|
4n3α4L3G+4n4α4L4𝔼[i=12nj𝒕i]+n4α4L4𝔼[i=1nj𝒕n+1i].\displaystyle\qquad\qquad\leq 4n^{3}\alpha^{4}L^{3}G^{*}+4n^{4}\alpha^{4}L^{4}\mathbb{E}\left[\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|\right]+n^{4}\alpha^{4}L^{4}\mathbb{E}\left[\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right].

Using Lemma 6, we get

𝔼[(j=12nl=1j1(p=1l1(𝑰𝑺~p))𝑺~l𝑺~j)𝔼[𝑴]𝒛]4n3α4L3G+15n4α5L4G2nlogn.\displaystyle\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|\leq 4n^{3}\alpha^{4}L^{3}G^{*}+15n^{4}\alpha^{5}L^{4}G^{*}\sqrt{2n\log n}.

Putting everything together,

𝔼[(l=0k(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))\displaystyle\left\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.
(l=0k1(i=1n(𝑰𝑺iK2l))(i=1n(𝑰𝑺ni+1K2l)))𝒛K2k]\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\|
(28nα2LG+9α5L4n4G2nlogn)+(4n3α4L3G+15n4α5L4G2nlogn)\displaystyle\qquad\qquad\leq(28n\alpha^{2}LG^{*}+9\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n})+(4n^{3}\alpha^{4}L^{3}G^{*}+15n^{4}\alpha^{5}L^{4}G^{*}\sqrt{2n\log n})
32nα2LG+24α5L4n4G2nlogn.\displaystyle\qquad\qquad\leq 32n\alpha^{2}LG^{*}+24\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n}.

Appendix G Proof of Theorem 6

Proof.

We start off by defining the error term

𝒓k=(i=1nfi(𝒙i12k1)i=1nfi(𝒙02k1))+(i=1nfni+1(𝒙i12k)i=1nfni+1(𝒙02k1)),{\bm{r}}^{k}=\left(\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\left(\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right),

where k[K/2]k\in[K/2]. This captures the difference between true gradients that the algorithms observes, and the gradients that a full step of gradient descent would have seen.

For two consecutive epochs of FlipFlop with Incremental GD, we have the following inequality

𝒙n2k𝒙2\displaystyle\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\|^{2} =𝒙02k1𝒙22α𝒙02k1𝒙,i=1nfi(xi12k1)+i=1nfni+1(xi12k)\displaystyle=\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},\sum_{i=1}^{n}\nabla f_{i}\left(x^{2k-1}_{i-1}\right)+\sum_{i=1}^{n}\nabla f_{n-i+1}\left(x^{2k}_{i-1}\right)\right\rangle
+α2i=1nfi(xi12k1)+i=1nfni+1(xi12k)2\displaystyle\quad+\alpha^{2}\left\|\sum_{i=1}^{n}\nabla f_{i}\left(x^{2k-1}_{i-1}\right)+\sum_{i=1}^{n}\nabla f_{n-i+1}\left(x^{2k}_{i-1}\right)\right\|^{2}
=𝒙02k1𝒙22α𝒙02k1𝒙,2nF(𝒙02k1)\displaystyle=\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},2n\nabla F({\bm{x}}^{2k-1}_{0})\right\rangle
2α𝒙02k1𝒙,𝒓k+α22nF(𝒙02k1)+𝒓k2\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+\alpha^{2}\left\|2n\nabla F({\bm{x}}^{2k-1}_{0})+{\bm{r}}^{k}\right\|^{2}
𝒙02k1𝒙24nα[LμL+μ𝒙02k1𝒙2+1L+μF(𝒙02k1)2]\displaystyle\leq\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-4n\alpha\left[\frac{L\mu}{L+\mu}\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}+\frac{1}{L+\mu}\left\|\nabla F\left({\bm{x}}^{2k-1}_{0}\right)\right\|^{2}\right]
2α𝒙02k1𝒙,𝒓k+α22nF(𝒙02k1)+𝒓k2\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+\alpha^{2}\left\|2n\nabla F({\bm{x}}^{2k-1}_{0})+{\bm{r}}^{k}\right\|^{2}
𝒙02k1𝒙24nα[LμL+μ𝒙02k1𝒙2+1L+μF(𝒙02k1)2]\displaystyle\leq\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-4n\alpha\left[\frac{L\mu}{L+\mu}\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}+\frac{1}{L+\mu}\left\|\nabla F\left({\bm{x}}^{2k-1}_{0}\right)\right\|^{2}\right]
2α𝒙02k1𝒙,𝒓k+8α2n2F(𝒙02k1)2+2α2𝒓k2\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+8\alpha^{2}n^{2}\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}+2\alpha^{2}\left\|{\bm{r}}^{k}\right\|^{2}
=(14nαLμL+μ)𝒙02k1𝒙2(4nα1L+μ8α2n2)F(𝒙02k1)2\displaystyle=\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-8\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
2α𝒙02k1𝒙,𝒓k+2α2𝒓k2,\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+2\alpha^{2}\left\|{\bm{r}}^{k}\right\|^{2}, (39)

where the first inequality is due to Theorem 2.1.11 in Nesterov (2004) and the second one is simply (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}.

What remains to be done is to bound the two terms with 𝒓k{\bm{r}}^{k} dependence. Firstly, we give a bound on the norm of 𝒓k{\bm{r}}^{k}:

𝒓k\displaystyle\|{\bm{r}}^{k}\| =(i=1nfi(𝒙i12k1)i=1nfi(𝒙02k1))+(i=1nfni+1(𝒙i12k)i=1nfni+1(𝒙02k1))\displaystyle=\left\|\left(\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\left(\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right)\right\|
i=1nfi(𝒙i12k1)fi(𝒙02k1)+i=1nfni+1(𝒙i12k)fni+1(𝒙02k1).\displaystyle\leq\sum_{i=1}^{n}\left\|\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right\|+\sum_{i=1}^{n}\left\|\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right\|.

Next, we will use the smoothness assumption and bounded gradients property (Lemma 1).

𝒓k\displaystyle\|{\bm{r}}^{k}\| Li=1n𝒙i12k1𝒙02k1+Li=1n𝒙i12k𝒙02k1\displaystyle\leq L\sum_{i=1}^{n}\left\|{\bm{x}}^{2k-1}_{i-1}-{\bm{x}}^{2k-1}_{0}\right\|+L\sum_{i=1}^{n}\left\|{\bm{x}}^{2k}_{i-1}-{\bm{x}}^{2k-1}_{0}\right\|
LGαi=1ni+LGαi=1n(n+i)\displaystyle\leq LG\alpha\sum_{i=1}^{n}i+LG\alpha\sum_{i=1}^{n}(n+i)
=n(2n1)αGL.\displaystyle=n(2n-1)\alpha GL.

Hence,

𝒓k24n4α2G2L2.\displaystyle\left\|{\bm{r}}^{k}\right\|^{2}\leq 4n^{4}\alpha^{2}G^{2}L^{2}. (40)

For the 𝒓k{\bm{r}}^{k} term, we need a more careful bound. Since the Hessian is constant for quadratic functions, we use 𝑯i{\bm{H}}_{i} to denote the Hessian matrix of function fi()f_{i}(\cdot). We start off by using the definition of 𝒓k{\bm{r}}^{k}:

𝒓k\displaystyle{\bm{r}}^{k} =(i=1nfi(𝒙i12k1)i=1nfi(𝒙02k1))+(i=1nfni+1(𝒙i12k)i=1nfni+1(𝒙02k1))\displaystyle=\left(\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\left(\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right)
=i=1n(fi(𝒙i12k1)fi(𝒙02k1))+i=1n(fni+1(𝒙i12k)fni+1(𝒙02k1))\displaystyle=\sum_{i=1}^{n}\left(\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\sum_{i=1}^{n}\left(\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right)
=i=1n𝑯i(𝒙i12k1𝒙02k1)+i=1n𝑯ni+1(𝒙i12k𝒙02k1),\displaystyle=\sum_{i=1}^{n}{\bm{H}}_{i}\left({\bm{x}}^{2k-1}_{i-1}-{\bm{x}}^{2k-1}_{0}\right)+\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left({\bm{x}}^{2k}_{i-1}-{\bm{x}}^{2k-1}_{0}\right),

where we used the fact that for a quadratic function ff with Hessian 𝑯{\bm{H}}, we have that f(x)f(y)=𝑯(xy)\nabla f(x)-\nabla f(y)={\bm{H}}(x-y). After that, we express 𝒙i12k1𝒙02k1{\bm{x}}^{2k-1}_{i-1}-{\bm{x}}^{2k-1}_{0} and 𝒙i12k𝒙02k1{\bm{x}}^{2k}_{i-1}-{\bm{x}}^{2k-1}_{0} as sum of gradient descent steps:

𝒓k\displaystyle{\bm{r}}^{k} =i=1n𝑯i(j=1i1αfj(𝒙j12k1))+i=1n𝑯ni+1(j=1nαfj(𝒙j12k1)+j=1i1αfnj+1(𝒙j12k))\displaystyle=\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}-\alpha\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})\right)+\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}-\alpha\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})+\sum_{j=1}^{i-1}-\alpha\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})\right)
=αi=1n𝑯i(j=1i1fj(𝒙j12k1))αi=1n𝑯ni+1(j=1i1fnj+1(𝒙j12k))\displaystyle=-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})\right)-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})\right)
αi=1n𝑯ni+1(j=1nfj(𝒙j12k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})\right)
=αi=1n𝑯i(j=1i1fj(𝒙02k1))αi=1n𝑯ni+1(j=1i1fnj+1(𝒙02k))\displaystyle=-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)
αi=1n𝑯ni+1(j=1nfj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)
αi=1n𝑯i(j=1i1fj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)
αi=1n𝑯ni+1(j=1i1fnj+1(𝒙j12k)fnj+1(𝒙02k))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)
αi=1n𝑯ni+1(j=1nfj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)
=2αi=1n𝑯i(j=1nfj(𝒙02k1))+αi=1n𝑯ifi(𝒙02k1)\displaystyle=-2\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{2k-1}_{0})
αi=1n𝑯i(j=1i1fj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)
αi=1n𝑯ni+1(j=1i1fnj+1(𝒙j12k)fnj+1(𝒙02k))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)
αi=1n𝑯i(j=1nfj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)

Next, we use the fact that j=1nfj(𝒙)=nF(x)\sum_{j=1}^{n}\nabla f_{j}({\bm{x}})=n\nabla F(x). We will also again use the fact that for a quadratic function ff with Hessian 𝑯{\bm{H}}, we have that f(x)f(y)=𝑯(xy)\nabla f(x)-\nabla f(y)={\bm{H}}(x-y):

𝒓k\displaystyle{\bm{r}}^{k} =2αi=1n𝑯i(nF(𝒙02k1))+αi=1n𝑯i(fi(𝒙02k1)fi(𝒙))+αi=1n𝑯ifi(𝒙)\displaystyle=-2\alpha\sum_{i=1}^{n}{\bm{H}}_{i}(n\nabla F({\bm{x}}^{2k-1}_{0}))+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}(\nabla f_{i}({\bm{x}}^{2k-1}_{0})-\nabla f_{i}({\bm{x}}^{*}))+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*})
αi=1n𝑯i(j=1i1fj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)
αi=1n𝑯ni+1(j=1i1fnj+1(𝒙j12k)fnj+1(𝒙02k))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)
αi=1n𝑯i(j=1nfj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)
=2α(i=1n𝑯i)2(𝒙02k1𝒙)+αi=1n𝑯i2(𝒙02k1𝒙)+αi=1n𝑯ifi(𝒙)\displaystyle=-2\alpha\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*})
αi=1n𝑯i(j=1i1fj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)
αi=1n𝑯ni+1(j=1i1fnj+1(𝒙j12k)fnj+1(𝒙02k))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)
αi=1n𝑯i(j=1nfj(𝒙j12k1)fj(𝒙02k1))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)
=𝒂k+𝒃k,\displaystyle={\bm{a}}^{k}+{\bm{b}}^{k},

where the random variables 𝒂k,𝒃k{\bm{a}}^{k},{\bm{b}}^{k} as

𝒂k\displaystyle{\bm{a}}^{k} :=2α(i=1n𝑯i)2(𝒙02k1𝒙)+αi=1n𝑯i2(𝒙02k1𝒙)+αi=1n𝑯ifi(𝒙), and\displaystyle:=-2\alpha\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*}),\text{ and}
𝒃k\displaystyle{\bm{b}}^{k} :=αi=1n𝑯i(j=1i1fj(𝒙j12k1)fj(𝒙02k1))\displaystyle:=-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)
αi=1n𝑯ni+1(j=1i1fnj+1(𝒙j12k)fnj+1(𝒙02k))\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)
αi=1n𝑯i(j=1nfj(𝒙j12k1)fj(𝒙02k1)).\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right).

Again, using smoothness assumption and bounded gradients property (Lemma 1) we get,

𝒃k\displaystyle\|{\bm{b}}^{k}\| 3α2L2Gn3.\displaystyle\leq 3\alpha^{2}L^{2}Gn^{3}. (41)

Next, we decompose the inner product of 𝒙02k1𝒙{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*} and 𝔼[𝒓k]{\mathbb{E}}\left[{\bm{r}}^{k}\right] in Eq. (39):

2α𝒙02k1𝒙,𝒓k\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle =2α𝒙02k1𝒙,𝒂k+𝒃k\displaystyle=-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{a}}^{k}+{\bm{b}}^{k}\right\rangle
=2α𝒙02k1𝒙,𝒂k2α𝒙02k1𝒙,𝒃k\displaystyle=-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{a}}^{k}\right\rangle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{b}}^{k}\right\rangle (42)

For the first term in (42),

2α𝒙02k1𝒙,𝒂k\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{a}}^{k}\right\rangle =4α2𝒙02k1𝒙,(i=1n𝑯i)2(𝒙02k1𝒙)\displaystyle=4\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*},\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})\right\rangle
2α2𝒙02k1𝒙,i=1n𝑯i2(𝒙02k1𝒙)\displaystyle\quad-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*},\sum_{i=1}^{n}{\bm{H}}_{i}^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})\right\rangle
2α2𝒙02k1𝒙,i=1n𝑯ifi(𝒙)\displaystyle\quad-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*},\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*})\right\rangle
4α2𝒙02k1𝒙,(i=1n𝑯i)2(𝒙02k1𝒙)\displaystyle\leq 4\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*},\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*})\right\rangle
2α2𝒙02k1𝒙,i=1n𝑯ifi(𝒙)\displaystyle\quad-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*},\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*})\right\rangle
=4α2n2F(𝒙02k1)22α2𝒙02k1𝒙,i=1n𝑯ifi(𝒙)\displaystyle=4\alpha^{2}n^{2}\|\nabla F({\bm{x}}_{0}^{2k-1})\|^{2}-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*},\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*})\right\rangle
4α2n2F(𝒙02k1)2+2α2nLG𝒙02k1𝒙.\displaystyle\leq 4\alpha^{2}n^{2}\|\nabla F({\bm{x}}_{0}^{2k-1})\|^{2}+2\alpha^{2}nLG\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|. (43)

For the second term in (42), we use Cauchy-Schwarz and Ineq. (41)

2α𝒙02k1𝒙,𝒃k6α3L2Gn3𝒙02k1𝒙.\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{b}}^{k}\right\rangle\leq 6\alpha^{3}L^{2}Gn^{3}\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|. (44)

Substituting (43) and (44) back to (42), we get

2α𝒙02k1𝒙,𝒓k\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle 4α2n2F(𝒙02k1)2+2α2nLG𝒙02k1𝒙\displaystyle\leq 4\alpha^{2}n^{2}\|\nabla F({\bm{x}}_{0}^{2k-1})\|^{2}+2\alpha^{2}nLG\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|
+6α3L2Gn3𝒙02k1𝒙.\displaystyle\quad+6\alpha^{3}L^{2}Gn^{3}\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|. (45)

Substituting (40) (45) back to (39), we finally get a recursion bound for one epoch:

𝒙n2k𝒙2\displaystyle\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\|^{2} (14nαLμL+μ)𝒙02k1𝒙2(4nα1L+μ8α2n2)F(𝒙02k1)2\displaystyle\leq\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-8\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
2α𝒙02k1𝒙,𝒓k+2α2𝒓k2\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+2\alpha^{2}\left\|{\bm{r}}^{k}\right\|^{2}
(14nαLμL+μ)𝒙02k1𝒙2(4nα1L+μ8α2n2)F(𝒙02k1)2\displaystyle\leq\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-8\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
4α2n2F(𝒙02k1)2+2α2nLG𝒙02k1𝒙+6α3L2Gn3𝒙02k1𝒙\displaystyle\quad 4\alpha^{2}n^{2}\|\nabla F({\bm{x}}_{0}^{2k-1})\|^{2}+2\alpha^{2}nLG\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|+6\alpha^{3}L^{2}Gn^{3}\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|
+8α4n4G2L2\displaystyle\quad+8\alpha^{4}n^{4}G^{2}L^{2}
=(14nαLμL+μ)𝒙02k1𝒙2(4nα1L+μ12α2n2)F(𝒙02k1)2\displaystyle=\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
+2α2nLG(1+3αLn2)𝒙02k1𝒙+8α4n4G2L2.\displaystyle\quad+2\alpha^{2}nLG(1+3\alpha Ln^{2})\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|+8\alpha^{4}n^{4}G^{2}L^{2}.

Next, we use the fact that 2abλa2+b2/λ2ab\leq\lambda a^{2}+b^{2}/\lambda (for any λ>0\lambda>0) on the term 2α2nLG(1+3αLn2)𝒙02k1𝒙2\alpha^{2}nLG(1+3\alpha Ln^{2})\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\| to get that

2α2nLG(1+3αLn2)𝒙02k1𝒙\displaystyle 2\alpha^{2}nLG(1+3\alpha Ln^{2})\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\| (α2nLG(1+3αLn2))2/(nαμ)+nαμ𝒙02k1𝒙2\displaystyle\leq\left(\alpha^{2}nLG(1+3\alpha Ln^{2})\right)^{2}/(n\alpha\mu)+n\alpha\mu\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|^{2}
=μ1α3nL2G2(1+3αLn2)2+nα𝒙02k1𝒙2.\displaystyle=\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+n\alpha\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|^{2}.

Substituting this back we get,

𝒙n2k𝒙2\displaystyle\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\|^{2} (14nαLμL+μ)𝒙02k1𝒙2(4nα1L+μ12α2n2)F(𝒙02k1)2\displaystyle\leq\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
+μ1α3nL2G2(1+3αLn2)2+nαμ𝒙02k1𝒙2+8α4n4G2L2\displaystyle\quad+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+n\alpha\mu\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|^{2}+8\alpha^{4}n^{4}G^{2}L^{2}
(12nαμ+nαμ)𝒙02k1𝒙2(4nα1L+μ12α2n2)F(𝒙02k1)2\displaystyle\leq\left(1-2n\alpha\mu+n\alpha\mu\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
+μ1α3nL2G2(1+3αLn2)2+8α4n4G2L2\displaystyle\quad+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+8\alpha^{4}n^{4}G^{2}L^{2} (Since μL\mu\leq L)
=(1nαμ)𝒙02k1𝒙2(4nα1L+μ12α2n2)F(𝒙02k1)2\displaystyle=\left(1-n\alpha\mu\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\|\nabla F({\bm{x}}^{2k-1}_{0})\right\|^{2}
+μ1α3nL2G2(1+3αLn2)2+8α4n4G2L2\displaystyle\quad+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+8\alpha^{4}n^{4}G^{2}L^{2}

Now, substituting the values of α\alpha and the bound on KK, we get that 4nα1L+μ12α2n204n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\geq 0 and hence,

𝒙n2k𝒙2\displaystyle\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\|^{2} (1nαμ)𝒙02k1𝒙2+μ1α3nL2G2(1+3αLn2)2+8α4n4G2L2\displaystyle\leq\left(1-n\alpha\mu\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+8\alpha^{4}n^{4}G^{2}L^{2}

Now, iterating this for K/2K/2 epoch pairs, we get

𝒙nK𝒙2\displaystyle\|{\bm{x}}^{K}_{n}-{\bm{x}}^{*}\|^{2} (1nαμ)K/2𝒙01𝒙2+K2μ1α3nL2G2(1+3αLn2)2+4Kα4n4G2L2\displaystyle\leq\left(1-n\alpha\mu\right)^{K/2}\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\|^{2}+\frac{K}{2}\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+4K\alpha^{4}n^{4}G^{2}L^{2}
enαμK/2𝒙01𝒙2+K2μ1α3nL2G2(1+3αLn2)2+4Kα4n4G2L2\displaystyle\leq e^{-n\alpha\mu K/2}\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\|^{2}+\frac{K}{2}\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+4K\alpha^{4}n^{4}G^{2}L^{2}
enαμK/2𝒙01𝒙2+Kμ1α3nL2G2+9Kμ1α5n5L4G2+4Kα4n4G2L2\displaystyle\leq e^{-n\alpha\mu K/2}\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\|^{2}+K\mu^{-1}\alpha^{3}nL^{2}G^{2}+9K\mu^{-1}\alpha^{5}n^{5}L^{4}G^{2}+4K\alpha^{4}n^{4}G^{2}L^{2} (Since (1+a)22+2a2(1+a)^{2}\leq 2+2a^{2})

Substituting α=6lognKμnK\alpha=\frac{6\log nK}{\mu nK} gives us the desired result.

Appendix H Additional Experiments

Refer to caption
Refer to caption
Refer to caption
Figure 3: Dependence of convergence rates on the number of epochs KK for logistic regression. The figures show the median and inter-quartile range after 10 runs of each algorithm, with random initializations and random permutation seeds (note that IGD exhibits extremely small variance). We set n=800n=800, so that nKn\gg K and hence the higher order terms of KK dominate the convergence rates. Note that both the axes are in logarithmic scale.

Although our theoretical guarantees for FlipFlop only hold for quadratic objectives, we conjecture that FlipFlop might be able to improve the convergence performance for other classes of functions, whose Hessians are smooth near their minimizers. To see this, we also ran some experiments on 1-dimensional logistic regression. As we can see in Figure 3, the convergence rates are very similar to those on quadratic functions. The data was synthetically generated such that the objective function becomes strongly convex and well conditioned near the minimizer. Note that logistic loss is not strongly convex on linearly separable data. Therefore, to make the loss strongly convex, we ensured that the data was not linearly separable. Essentially, the dataset was the following: all the datapoints were z=±1z=\pm 1, and their labels were y=𝟙z>0y={\mathbbm{1}}_{z>0} with probability 3/43/4 and y=𝟙z<0y={\mathbbm{1}}_{z<0} with probability 1/41/4. Framing this as an optimization problem, we have

minxF(x):=𝔼[ylog(h(xz))(1y)log(1h(xz))],\displaystyle\min_{x}F(x):={\mathbb{E}}\left[-y\log(h(xz))-(1-y)\log(1-h(xz))\right],

where h(xz)=1/(1+exz)h(xz)=1/(1+e^{-xz}). Note that x=log3x=-\log 3 is the minimizer of this function, which is helpful because we can use it to compute the exact error. Similar to the experiment on quadratic functions, nn was set to 800800 and step size was set in the same regime as in Theorems 4, 5, and 6.