Permutation-Based SGD: Is Random Optimal?

Shashank Rajput Correspondence to Shashank Rajput

\langle

[email protected]

\rangle

Kangwook Lee

University of Wisconsin-Madison Dimitris Papailiopoulos

Abstract

A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We first show that for 1-dimensional strongly convex functions, with smooth second derivatives, there exist permutations that offer exponentially faster convergence compared to random. However, for general strongly convex functions, random permutations are optimal. Finally, we show that for quadratic, strongly-convex functions, there are easy-to-construct permutations that lead to accelerated convergence compared to random. Our results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random.

1 Introduction

Finite sum optimization seeks to solve the following:

\min_{{\bm{x}}}F({\bm{x}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{x}}).

(1)

Stochastic gradient descent (SGD) approximately solves finite sum problems, by iteratively updating the optimization variables according to the following rule:

{\bm{x}}_{t+1}:={\bm{x}}_{t}-\alpha\nabla f_{\sigma_{t}}({\bm{x}}_{t}),

(2)

where $\alpha$ is the step size and $\sigma_{t}\in[n]=\{1,2,\ldots,n\}$ is the index of the function sampled at iteration $t$ . There exist various ways of sampling $\sigma_{t}$ , with the most common being with- and without-replacement sampling. In the former, $\sigma_{t}$ is uniformly chosen at random from $[n]$ , and for the latter, $\sigma_{t}$ represents the $t$ -th element of a random permutation of $[n]$ . We henceforth refer to these two SGD variants as vanilla and permutation-based, respectively.

Although permutation-based SGD has been widely observed to perform better in practice (Bottou, 2009; Recht & Ré, 2012; 2013), the vanilla version has attracted the vast majority of theoretical analysis. This is because of the fact that at each iteration, in expectation the update is a scaled version of the true gradient, allowing for simple performance analyses of the algorithm, e.g., see (Bubeck et al., 2015).

Permutation-based SGD has resisted a tight analysis for a long time. However, a recent line of breakthrough results provides the first tight convergence guarantees for several classes of convex functions $F$ (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020; Mishchenko et al., 2020; Ahn et al., 2020; Nguyen et al., 2020). These recent studies mainly focus on two variants of permutation-based SGD where (1) a new random permutation is sampled at each epoch (also known as Random Reshuffle) (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020), and (2) a random permutation is sampled once and is reused throughout all SGD epochs (Single Shuffle) (Safran & Shamir, 2019; Mishchenko et al., 2020; Ahn et al., 2020).

Perhaps interestingly, Random Reshuffle and Single Shuffle exhibit different convergence rates and a performance gap that varies across different function classes. In particular, when run for $K$ epochs, the convergence rate for strongly convex functions is $\widetilde{O}(1/nK^{2})$ for both Random Reshuffle and Single Shuffle (Nagaraj et al., 2019; Ahn et al., 2020; Mishchenko et al., 2020). However, when run specifically on strongly convex quadratics, Random Reshuffle experiences an acceleration of rates, whereas Single Shuffle does not (Safran & Shamir, 2019; Rajput et al., 2020; Ahn et al., 2020; Mishchenko et al., 2020). All the above rates have been coupled by matching lower bounds, at least up to constants and sometimes log factors (Safran & Shamir, 2019; Rajput et al., 2020).

Algorithm 1 Permutation-based SGD variants

1:Input: Initialization

{\bm{x}}_{0}^{1}

, step size

\alpha

, epochs

K

\sigma=

a random permutation of

[n]

3:for

k=1,\dots,K

4: if IGD then

\sigma^{k}=(1,2,\dots,n)

6: else if Single Shuffle then

\sigma^{k}=\sigma

8: else if Random Reshuffle then

\sigma^{k}=

a random permutation of

[n]

10: end if

11:

12: if FlipFlop and

k

is even then

13:

\sigma^{k}=

reverse of

\sigma^{k-1}

14: end if

15:

16: for

i=1,\dots,n

17:

{\bm{x}}_{i}^{k}:={\bm{x}}_{i-1}^{k}-\alpha\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i-1}^{k})

\left.\begin{array}[]{@{}c@{}}\\ {}\hfil\\ {}\hfil\\ {}\hfil\end{array}\right\}\begin{tabular}[]{l}Epoch $k$\end{tabular}

18: end for

19:

{\bm{x}}_{0}^{k+1}:={\bm{x}}_{n}^{k}

20:end for

	Plain	with FlipFlop
RR	$\displaystyle\Omega\left(\frac{1}{n^{2}K^{2}}+\frac{1}{n{\color[rgb]{0.88,0.00,0.00}K^{3}}}\right)$	$\displaystyle\widetilde{O}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{n{\color[rgb]{0.2,0.0,0.95}K^{5}}}\right)$ Thm. 5
SS	$\displaystyle\Omega\left(\frac{1}{{\color[rgb]{0.88,0.00,0.00}nK^{2}}}\right)$	$\displaystyle\widetilde{O}\left(\frac{1}{{\color[rgb]{0.2,0.0,0.95}n^{2}}K^{2}}+\frac{1}{n{\color[rgb]{0.2,0.0,0.95}K^{4}}}\right)$ Thm. 4
IGD	$\displaystyle\Omega\left(\frac{1}{{\color[rgb]{0.88,0.00,0.00}K^{2}}}\right)$	$\displaystyle\widetilde{O}\left(\frac{1}{{\color[rgb]{0.2,0.0,0.95}n^{2}}K^{2}}+\frac{1}{{\color[rgb]{0.2,0.0,0.95}K^{3}}}\right)$ Thm. 6

Table 1: Convergence rates of Random Reshuffle (RR), Single Shuffle (SS) and Incremental Gradient Descent (IGD) on strongly convex quadratics: Plain vs. FlipFlop. Lower bounds for the “plain” versions are taken from (Safran & Shamir, 2019). When

n\gg K

, that is when the training set is much larger than the number of epochs, which arguably is the case in practice, the convergence rates of Random Reshuffle,Single Shuffle, and Incremental Gradient Descent are

\Omega(\frac{1}{nK^{3}})

\Omega(\frac{1}{nK^{2}})

, and

\Omega(\frac{1}{K^{2}})

respectively. On the other hand, by combining these methods with FlipFlop the convergence rates become faster, i.e.,

\widetilde{O}(\frac{1}{nK^{5}})

\widetilde{O}(\frac{1}{nK^{4}})

, and

\widetilde{O}(\frac{1}{K^{3}})

, respectively.

From the above we observe that reshuffling at the beginning of every epoch may not always help. But then there are cases where Random Reshuffle is faster than Single Shuffle, implying that certain ways of generating permutations are more suited for certain subfamilies of functions.

The goal of our paper is to take a first step into exploring the relationship between convergence rates and the particular choice of permutations. We are particularly interested in understanding if random permutations are as good as optimal, or if SGD can experience faster rates with carefully crafted permutations. As we see in the following, the answer the above is not straightforward, and depends heavily on the function class at hand.

Our Contributions:

We define as permutation-based SGD to be any variant of the iterates in (2), where a permutation of the $n$ functions, at the start of each epoch, can be generated deterministically, randomly, or with a combination of the two. For example, Single Shuffle, Random Reshuffle, and incremental gradient descent (IGD), are all permutation-based SGD variants (see Algorithm 1).

We first want to understand—even in the absence of computational constraints in picking the optimal permutations—what is the fastest rate one can get for permutation-based SGD? In other words, are there permutations that are better than random in the eyes of SGD?

Perhaps surprisingly, we show that there exist permutations that may offer up to exponentially faster convergence than random permutations, but for a limited set of functions. Specifically, we show this for 1-dimensional functions (Theorem 1). However, such exponential improvement is no longer possible in higher dimensions (Theorem 2), or for general strongly convex objectives (Theorem 3), where random is optimal. The above highlight that an analysis of how permutations affect convergence rates needs to be nuanced enough to account for the structure of functions at hand. Otherwise, in lieu of further assumptions, random permutations may just appear to be as good as optimal.

In this work, we further identify a subfamily of convex functions, where there exist easy-to-generate permutations that lead accelerated convergence. We specifically introduce a new technique, FlipFlop, which can be used in conjunction with existing permutation-based methods, e.g., Random Reshuffle, Single Shuffle, or Incremental Gradient Descent, to provably improve their convergence on quadratic functions (Theorems 4, 5, and 6). The way that FlipFlop works is rather simple: every even epoch uses the flipped (or reversed) version of the previous epoch’s permutation. The intuition behind why FlipFlop leads to faster convergence is as follows. Towards the end of an epoch, the contribution of earlier gradients gets attenuated. To counter this, we flip the permutation for the next epoch so that every function’s contribution is diluted (approximately) equally over the course of two consecutive epochs. FlipFlop demonstrates that finding better permutations for specific classes of functions might be computationally easy. We summarize FlipFlop’s convergence rates in Table 1 and report the results of numerical verification in Section 6.2.

Note that in this work, we focus on the dependence of the error on the number of iterations, and in particular, the number of epochs. However, we acknowledge that its dependence on other parameters like the condition number is also very important. We leave such analysis for future work.

Notation:

We use lowercase for scalars ( $a$ ), lower boldface for vectors ( ${\bm{a}}$ ), and upper boldface for matrices ( ${\bm{A}}$ ).

2 Related Work

Gürbüzbalaban et al. (2019a; b) provided the first theoretical results establishing that Random Reshuffle and Incremental Gradient Descent (and hence Single Shuffle) were indeed faster than vanilla SGD, as they offered an asymptotic rate of $O\left(1/K^{2}\right)$ for strongly convex functions, which beats the convergence rate of $O\left(1/nK\right)$ for vanilla SGD when $K=\Omega(n)$ . Shamir (2016) used techniques from online learning and transductive learning theory to prove an optimal convergence rate of $\widetilde{O}(1/n)$ for the first epoch of Random Reshuffle (and hence Single Shuffle). Later, Haochen & Sra (2019) also established a non-asymptotic convergence rate of $\widetilde{O}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{K^{3}}\right)$ , when the objective function is quadratic, or has smooth Hessian.

Nagaraj et al. (2019) used a very interesting iterate coupling based approach to give a new upper bound on the error rate of Random Reshuffle, thus proving for the first time that for general strongly convex smooth functions, it converges faster than vanilla SGD in all regimes of $n$ and $K$ . This was followed by (Safran & Shamir, 2019), where the authors were able to establish the first lower bounds, in terms of both $n$ and $K$ , for Random Reshuffle. However, there was a gap in these upper and lower bounds. The gap in the convergence rates was closed by Rajput et al. (2020), who showed that the upper bound given by Nagaraj et al. (2019) and the lower bound given by Safran & Shamir (2019) were both tight up to logarithmic terms.

For Single Shuffle, Mishchenko et al. (2020) and Ahn et al. (2020) showed an upper bound of $\widetilde{O}\left(\frac{1}{nK^{2}}\right)$ , which matched the lower bound given earlier by (Safran & Shamir, 2019), up to logarithmic terms. Ahn et al. (2020) and Mishchenko et al. (2020) also proved tight upper bounds for Random Reshuffle, with a simpler analysis and using more relaxed assumptions than (Nagaraj et al., 2019) and (Rajput et al., 2020). In particular, the results by Ahn et al. (2020) work under the PŁ condition and do not require individual component convexity.

Incremental Gradient Descent on strongly convex functions has also been studied well in literature (Nedić & Bertsekas, 2001; Bertsekas, 2011; Gürbüzbalaban et al., 2019a). More recently, Nguyen et al. (2020) provide a unified analysis for all permutation-based algorithms. The dependence of their convergence rates on the number of epochs $K$ is also optimal for Incremental Gradient Descent, Single Shuffle and Random Reshuffle.

There has also been some recent work on the analysis of Random Reshuffle on non-strongly convex functions and non-convex functions. Specifically, Nguyen et al. (2020) and Mishchenko et al. (2020) show that even there, Random Reshuffle outperforms SGD under certain conditions. Mishchenko et al. (2020) show that Random Reshuffle and Single Shuffle beat vanilla SGD on non-strongly convex functions after $\Omega(n)$ epochs, and that Random Reshuffle is faster than vanilla SGD on non-convex objectives if the desired error is $O(1/\sqrt{n})$ .

Speeding up convergence by combining without replacement sampling with other techniques like variance reduction (Shamir, 2016; Ying et al., 2020) and momentum (Tran et al., 2020) has also received some attention. In this work, we solely focus on the power of “good permutations” to achieve fast convergence.

3 Preliminaries

We will use combinations of the following assumptions:

Assumption 1 (Component convexity).

$f_{i}({\bm{x}})$ ’s are convex.

Assumption 2 (Component smoothness).

$f_{i}({\bm{x}})$ ’s are $L$ -smooth, i.e.,

\forall{\bm{x}},{\bm{y}}:\|\nabla f_{i}({\bm{x}})-\nabla f_{i}({\bm{y}})\|\leq L\|{\bm{x}}-{\bm{y}}\|.

Note that Assumption 2 immediately implies that $F$ also has $L$ -Lipschitz gradients:

\forall{\bm{x}},{\bm{y}}:\|\nabla F({\bm{x}})-\nabla F({\bm{y}})\|\leq L\|{\bm{x}}-{\bm{y}}\|.

Assumption 3 (Objective strong convexity).

$F$ is $\mu$ -strongly convex, i.e.,

\forall{\bm{x}},{\bm{y}}:F({\bm{y}})\geq F({\bm{x}})+\langle\nabla F({\bm{x}}),{\bm{y}}-{\bm{x}}\rangle+\frac{1}{2}\mu\|{\bm{y}}-{\bm{x}}\|^{2}.

Note that Assumption 3 implies

\forall{\bm{x}},{\bm{y}}:\langle\nabla F({\bm{x}})-\nabla F({\bm{y}}),{\bm{x}}-{\bm{y}}\rangle\geq\mu\|{\bm{y}}-{\bm{x}}\|^{2}.

(3)

We denote the condition number by $\kappa$ , which is defined as $\kappa=\frac{L}{\mu}$ . It can be seen easily that $\kappa\geq 1$ always. Let ${\bm{x}}^{*}$ denote the minimizer of Eq. (1), that is, ${\bm{x}}^{*}=\operatorname*{arg\,min}_{\bm{x}}F({\bm{x}})$ .

We will study permutation-based algorithms in the constant step size regime, that is, the step size is chosen at the beginning of the algorithm, and then remains fixed throughout. We denote the iterate after the $i$ -th iteration of the $k$ -th epoch by ${\bm{x}}_{i}^{k}$ . Hence, the initialization point is ${\bm{x}}_{0}^{1}$ . Similarly, the permutation of $(1,\dots,n)$ used in the $k$ -th epoch is denoted by $\sigma^{k}$ , and its $i$ -th ordered element is denoted by $\sigma_{i}^{k}$ . Note that if the ambient space is 1-dimensional, then we represent the iterates and the minimizer using non-bold characters, i.e. $x_{i}^{k}$ and $x^{*}$ , to remain consistent with the notation.

In the following, due to lack of space, we only provide sketches of the full proofs, when possible. The full proofs of the lemmas and theorems are provided in the Appendix.

4 Exponential Convergence in 1-Dimension

In this section, we show that there exist permutations for Hessian-smooth $1$ -dimensional functions that lead to exponentially faster convergence compared to random.

Assumption 4 (Component Hessian-smoothness).

$f_{i}(x)$ ’s have $L_{H}$ -smooth second derivatives, that is,

\forall x,y:|\nabla^{2}f_{i}(x)-\nabla^{2}f_{i}(y)|\leq L_{H}|x-y|.

We also define the following instance dependent constants: $G^{*}:=\max_{i}\|\nabla f_{i}({\bm{x}}^{*})\|$ , $D=\max\left\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{G^{*}}{2L}\right\}$ , and $G=G^{*}+2DL$ .

Theorem 1.

Let Assumptions 1,2,3 and 4 hold. Let $D$ and $G$ be as defined above. If $\alpha=\frac{\mu}{8n(L^{2}+L_{H}G)}$ , then there exists a sequence of permutations $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ such that using those permutations from any initialization point $x_{0}^{1}$ gives the error

|x_{n}^{K}-x^{*}|\leq(D+4n\alpha G)e^{-CK},

where $C=\frac{\mu^{2}}{16(L^{2}+L_{H}G)}$ .

Refer to caption — Figure 1: (A graphical depiction of Theorem 1’s proof sketch.) Assume that the minimizer is at the origin. The proof of Theorem 1 first shows that there exists an initialization and a sequence of permutations, such that using those, we get to the exact minimizer. Let the sequence of iterates for this run be $x^{k}_{{opt}}$ . Consider a parallel run, which uses the same sequence of permutations, but an arbitrary initialization point. Let this sequence be $x_{k}$ . The figure shows how $x_{opt}^{k}$ converges to the exact optima, and the distance between $x_{opt}^{k}$ and $x_{k}$ decreases exponentially, leading to an exponential convergence for $x_{k}$ .

An important thing to notice in the theorem statement is that the sequence of permutations $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ only depends on the function, and not on the initialization point $x_{0}^{1}$ . This implies that for any such function, there exists a sequence of permutations $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ , which gives exponentially fast convergence, unconditionally of the initialization. Note that the convergence rate is slower than Gradient Descent, for which the constant ‘ $C$ ’ would be larger. However, here we are purely interested in the convergence rates of the best permutations and their (asymptotic) dependence on $K$ .

Proof sketch

The core idea is to establish that there exists an initialization point $x_{0}^{1}$ (close to the minimizer $x^{*}$ ), and a sequence of permutations such that that starting from $x_{0}^{1}$ and using that sequence of permutation leads us exactly to the minimizer. Once we have proved this, we show that if two parallel runs of the optimization process are initialized from two different iterates, and they are coupled so that they use the exact same permutations, then they approach each other at an exponential rate. Thus, if we use the same permutation from any initialization point, it will converge to the minimizer at an exponential rate. See Figure 1 for a graphical depiction of this sketch. We note that the figure is not the result of an actual run, but only serves to explain the proof sketch.

5 Lower Bounds for Permutation-based SGD

The result in the previous section leads us to wonder if exponentially fast convergence can be also achieved in higher dimensions. Unfortunately, for higher dimensions, there exist strongly convex quadratic functions for which there does not exist any sequence of permutations that lead to an exponential convergence rate. This is formalized in the following theorem.

Theorem 2.

For any $n\geq 4$ ( $n$ must be even), there exists a $2n+1$ -dimensional strongly convex function $F$ which is the mean of $n$ convex quadratic functions, such that for every permutation-based algorithm with any step size,

\displaystyle\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\|^{2}=\Omega\left(\frac{1}{n^{3}K^{2}}\right).

This theorem shows that we cannot hope to develop constant step size algorithms that exhibit exponentially fast convergence in multiple dimensions.

Proof sketch

In this sketch, we explain a simpler version of the theorem, which works for $n=2$ and 2-Dimensions. Consider $F(x,y)=\frac{1}{2}f_{1}(x,y)+\frac{1}{2}f_{2}(x,y)$ such that

\displaystyle f_{1}(x,y)=\frac{x^{2}}{2}-x+y,\text{ and}f_{2}(x,y)=\frac{y^{2}}{2}-y+x.

Hence, $F=\frac{1}{4}(x^{2}+y^{2})$ , and has minimizer at the origin. Each epoch has two possible permutations, $\sigma=(1,2)$ or $\sigma=(2,1)$ . Working out the details manually, it can be seen that regardless of the permutation, $x^{k+1}_{0}+y^{k+1}_{0}>x^{k}_{0}+y^{k}_{0}$ , that is, the sum of the co-ordinates keeps increasing. This can be used to get a bound on the error term $\|[x^{K}\quad y^{K}]^{\top}\|^{2}$ .

Next, we show that even in 1-Dimension, individual function convexity might be necessary to obtain faster rates than Random Reshuffle.

Theorem 3.

There exists a 1-Dimensional strongly convex function $F$ which is the mean of two quadratic functions $f_{1}$ and $f_{2}$ , such that one of the functions is non-convex. Then, every permutation-based algorithm with constant step size $\alpha\leq\frac{1}{L}$ gives an error of at least

\displaystyle\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\|^{2}=\Omega\left(\frac{1}{K^{2}}\right).

Proof sketch

The idea behind the sketch is to have one of the two component functions as strongly concave. This gives it the advantage that the farther away from its maximum the iterate is, the more it pushes the iterate away. Hence, it essentially results in increasing the deviation in each epoch. This leads to a slow convergence rate.

In the setting where the individual $f_{i}$ may be non-convex, Nguyen et al. (2020) and Ahn et al. (2020) show that Single Shuffle, Random Reshuffle, and Incremental Gradient Descent achieve the error rate of $\widetilde{\mathcal{O}}(\frac{1}{K^{2}})$ , for a fixed $n$ . In particular, their results only need that the component functions be smooth and hence their results apply to the function $F$ from Theorem 3. The theorem above essentially shows that when $n=2$ , this is the best possible error rate, for any permutation-based algorithm - deterministic or random. Hence, at least for $n=2$ , the three algorithms are optimal when the component functions can possibly be non-convex. However, note that here we are only considering the dependence of the convergence rate on $K$ . It is possible that these are not optimal, if we further take into account the dependence of the convergence rate on the combination of both $n$ and $K$ . Indeed, if we consider the dependence on $n$ as well, Incremental Gradient Descent has a convergence rate of $\Omega(1/K^{2})$ (Safran & Shamir, 2019), whereas the other two have a convergence rate of $\widetilde{O}(1/nK^{2})$ (Ahn et al., 2020).

6 Flipping Permutations for Faster Convergence in Quadratics

In this section, we introduce a new algorithm FlipFlop, that can improve the convergence rate of Single Shuffle,Random Reshuffle, and Incremental Gradient Descent on strongly convex quadratic functions.

The following theorem gives the convergence rate of FlipFlop with Single Shuffle:

Assumption 5.

$f_{i}({\bm{x}})$ ’s are quadratic.

Theorem 4.

If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Single Shuffle for $K$ epochs, where $K\geq 80\kappa^{3/2}\log(nK)\max\left\{1,\frac{\sqrt{\kappa}}{n}\right\}$ is an even integer, with step size $\alpha=\frac{10\log(nK)}{\mu nK}$ gives the error

\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{nK^{4}}\right).

(4)

For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Single Shuffle:

\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\Omega\left(\frac{1}{nK^{2}}\right),

(5)

Note that both the terms in Eq. (4) are smaller than the term in Eq. (5). In particular, when $n\gg K^{2}$ and $n$ is fixed as we vary $K$ , the RHS of Eq. (5) decays as $\widetilde{\mathcal{O}}(\frac{1}{K^{2}})$ , whereas the RHS of Eq. (4) decays as $\widetilde{\mathcal{O}}(\frac{1}{K^{4}})$ . Otherwise, when $K^{2}\gg n$ and $K$ is fixed as we vary $n$ , the RHS of Eq. (5) decays as $\widetilde{\mathcal{O}}(\frac{1}{n})$ , whereas the RHS of Eq. (4) decays as $\widetilde{\mathcal{O}}(\frac{1}{n^{2}})$ . Hence, in both the cases, FlipFlop with Single Shuffle outperforms Single Shuffle.

The next theorem shows that FlipFlop improves the convergence rate of Random Reshuffle:

Theorem 5.

If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Random Reshuffle for $K$ epochs, where $K\geq 55\kappa\log(nK)\max\left\{1,\sqrt{\frac{n}{\kappa}}\right\}$ is an even integer, with step size $\alpha=\frac{10\log(nK)}{\mu nK}$ gives the error

\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{nK^{5}}\right).

For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Random Reshuffle:

\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\Omega\left(\frac{1}{n^{2}K^{2}}+\frac{1}{nK^{3}}\right).

Hence, we see that in the regime when $n\gg K$ , which happens when the number of components in the finite sum of $F$ is much larger than the number of epochs, FlipFlop with Random Reshuffle is much faster than vanilla Random Reshuffle.

Note that the theorems above do not contradict Theorem 2, because for a fixed $n$ , both the theorems above give a convergence rate of $\widetilde{\mathcal{O}}(1/K^{2})$ .

We also note here that the theorems above need the number of epochs to be much larger than $\kappa$ , in which range Gradient Descent performs better than with- or without- replacement SGD, and hence GD should be preferred over SGD in that case. However, we think that this requirement on epochs is a limitation of our analysis, rather than that of the algorithms.

Finally, the next theorem shows that FlipFlop improves the convergence rate of Incremental Gradient Descent.

Theorem 6.

If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Incremental GD for $K$ epochs, where $K\geq 36\kappa\log(nK)$ is an even integer, with step size $\alpha=\frac{6\log nK}{\mu nK}$ gives the error

\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\widetilde{\mathcal{O}}\left(\frac{1}{n^{2}K^{2}}+\frac{1}{K^{3}}\right).

For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Incremental Gradient Descent:

\mathbb{E}\left[\left\|{\bm{x}}_{n}^{K-1}-{\bm{x}}^{*}\right\|^{2}\right]=\Omega\left(\frac{1}{K^{2}}\right),

In the next subsection, we give a short sketch of the proof of these theorems.

6.1 Proof sketch

In the proof sketch, we consider scalar quadratic functions. The same intuition carries over to multi-dimensional quadratics, but requires a more involved analysis. Let $f_{i}(x):=\frac{a_{i}x^{2}}{2}+b_{i}x+c$ . Assume that $F(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ has minimizer at 0. This assumption is valid because it can be achieved by a simple translation of the origin (see (Safran & Shamir, 2019) for a more detailed explanation). This implies that $\sum_{i=1}^{n}b_{i}=0$ .

For the sake of this sketch, assume $x_{0}^{1}=0$ , that is, we are starting at the minimizer itself. Further, without loss of generality, assume that $\sigma=(1,2,\dots,n)$ . Then, for the last iteration of the first epoch,

	$\displaystyle x_{n}^{1}$	$\displaystyle=x_{n-1}^{1}-\alpha f^{\prime}_{n}(x_{n-1}^{1})$
		$\displaystyle=x_{n-1}^{1}-\alpha(a_{n}x_{n-1}^{1}+b_{n})$
		$\displaystyle=(1-\alpha a_{n})x_{n-1}^{1}-\alpha b_{n}.$

Applying this to all iterations of the first epoch, we get

\displaystyle x_{n}^{1}

\displaystyle=\prod_{i=1}^{n}(1-\alpha a_{i})x_{0}^{1}-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=i+1}^{n}(1-\alpha a_{j}).

(6)

Substituting $x_{0}^{1}=0$ , we get

x_{n}^{1}=-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=i+1}^{n}(1-\alpha a_{j}).

(7)

Note that the sum above is not weighted uniformly: $b_{1}$ is multiplied by $\prod_{j=2}^{n}(1-\alpha a_{j})$ , whereas $b_{n}$ is multiplied by 1. Because $(1-\alpha a_{j})<1$ , we see that $b_{1}$ ’s weight is much smaller than $b_{n}$ . If the weights were all 1, then we would get $x_{n}^{0}=-\alpha\sum_{i=1}^{n}b_{i}=0$ , i.e., we would not move away from the minimizer. Since we want to stay close to the minimizer, we want the weights of all the $b_{i}$ to be roughly equal.

The idea behind FlipFlop is to add something like $-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=1}^{i}(1-\alpha a_{j})$ in the next epoch, to counteract the bias in Eq. (7). To achieve this, we simply take the permutation that the algorithm used in the previous epoch and flip it for the next epoch. Roughly speaking, in the next epoch $b_{1}$ will get multiplied by $1$ whereas $b_{n}$ will get multiplied by $\prod_{j=1}^{n-1}(1-\alpha a_{j})$ . Thus over two epochs, both get scaled approximately the same.

To see more concretely, we look at the first order approximation of Eq. (7):

\displaystyle x_{n}^{1}

\displaystyle=-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=i+1}^{n}(1-\alpha a_{j})\approx-\alpha\sum_{i=1}^{n}b_{i}(1-\sum_{j=i+1}^{n}\alpha a_{j})=\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=i+1}^{n}a_{j},

(8)

where in the last step above we used the fact that $\sum_{i=1}^{n}b_{i}=0$ . Now, let us see what happens in the second epoch if we use FlipFlop. Doing a recursion analogous to how we got Eq. (6), but reversing the order of functions, we get:

\displaystyle x_{n}^{2}

\displaystyle=\prod_{i=1}^{n}(1-\alpha a_{n-i+1})x_{0}^{2}-\alpha\sum_{i=1}^{n}b_{n-i+1}\prod_{j=i+1}^{n}(1-\alpha a_{n-j+1}).

Carefully doing a change of variables in the equation above, we get:

\displaystyle x_{n}^{2}

\displaystyle=\prod_{i=1}^{n}(1-\alpha a_{i})x_{0}^{2}-\alpha\sum_{i=1}^{n}b_{i}\prod_{j=1}^{i-1}(1-\alpha a_{j}).

(9)

Note that the product in the second term in the equation above is almost complementary to the product in Eq. (7). This is because we flipped the order in the second epoch. This will play an important part in cancelling out much of the bias in Eq. (7). Continuing on from Eq. (9), we again do a linear approximation similar to before and substitute Eq. (8) (and use the fact that $x_{0}^{2}=x_{n}^{1}$ ):

	$\displaystyle x_{n}^{2}$	$\displaystyle\approx\left(1-\alpha\sum_{i=1}^{n}a_{i}\right)x_{0}^{2}+\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=1}^{i-1}a_{j}$
		$\displaystyle\approx\left(1-\alpha\sum_{i=1}^{n}a_{i}\right)\left(\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=i+1}^{n}a_{j}\right)+\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j=1}^{i-1}a_{j}.$

We assume that $\alpha$ is small and hence we only focus on the quadratic terms:

	$\displaystyle x_{n}^{2}$	$\displaystyle=\alpha^{2}\sum_{i=1}^{n}b_{i}\sum_{j\neq i}a_{j}+O(\alpha^{3})$
		$\displaystyle=\alpha^{2}\left(\sum_{i=1}^{n}b_{i}\sum_{j=1}^{n}a_{j}\right)-\alpha^{2}\left(\sum_{i=1}^{n}b_{i}a_{i}\right)+O(\alpha^{3}).$

Now, since $\sum_{i=1}^{n}b_{i}=0$ , we get

\displaystyle x_{n}^{2}

\displaystyle\approx-\alpha^{2}\left(\sum_{i=1}^{n}b_{i}a_{i}\right)+O(\alpha^{3}).

(10)

Now, comparing the coefficients of the $\alpha^{2}$ terms in Eq. (8) and Eq. (10), we see that the former has $O(n^{2})$ terms whereas the latter has only $n$ terms. This correction of error is exactly what eventually manifests into the faster convergence rate of FlipFlop.

The main reason that the analysis for multidimensional quadratics is not as simple as the 1-dimensional analysis done above, is because unlike scalar multiplication, matrix multiplication is not commutative, and the AM-GM inequality is not true in higher dimensions (Lai & Lim, 2020; De Sa, 2020). One way to bypass this inequality is by using the following inequality for small enough $\alpha$ :

\left\|\prod_{i=1}^{n}({\bm{I}}-\alpha{\bm{A}}_{i})\prod_{i=1}^{n}({\bm{I}}-\alpha{\bm{A}}_{n-i+1})\right\|\leq 1-\alpha n\mu.

Ahn et al. (2020) proved a stochastic version of this (see Lemma 6 in their paper). We prove a deterministic version in Lemma 3 (in the Appendix).

6.2 Numerical Verification

We verify the theorems numerically by running Random Reshuffle, Single Shuffle and their FlipFlop versions on the task of mean computation. We randomly sample $n=800$ points from a $100$ -dimensional sphere. Let the points be ${\bm{x}}_{i}$ for $i=1,\dots,n$ . Then, their mean is the solution to the following quadratic problem : $\arg\min_{{\bm{x}}}F({\bm{x}})=\frac{1}{n}\sum_{i=1}^{n}\|{\bm{x}}-{\bm{x}}_{i}\|^{2}$ . We solve this problem by using the given algorithms. The results are reported in Figure 2. The results are plotted in a $\log$ – $\log$ graph, so that we get to see the dependence of error on the power of $K$ .

Note that since the points are sampled randomly, Incremental Gradient Descent essentially becomes Single Shuffle. Hence, to verify Theorem 6, we need ‘hard’ instances of Incremental Gradient Descent, and in particular we use the ones used in Theorem 3 in (Safran & Shamir, 2019). These results are also reported in a $\log$ – $\log$ graph in Figure 2.

We also tried FlipFlop in the training of deep neural networks, but unfortunately we did not see a big speedup there. We ran experiments on logistic regression for 1-Dimensional artificial data, the results for which are in Appendix H. The code for all the experiments can be found at https://github.com/shashankrajput/flipflop .

7 Conclusion and Future Work

In this paper, we explore the theoretical limits of permutation-based SGD for solving finite sum optimization problems. We focus on the power of good, carefully designed permutations and whether they can lead to a much better convergence rate than random. We prove that for 1-dimensional, strongly convex functions, indeed good sequences of permutations exist, which lead to a convergence rate which is exponentially faster than random permutations. We also show that unfortunately, this is not true for higher dimensions, and that for general strongly convex functions, random permutations might be optimal.

However, we think that for some subfamilies of strongly convex functions, good permutations might exist and may be easy to generate. Towards that end, we introduce a very simple technique, FlipFlop, to generate permutations that lead to faster convergence on strongly convex quadratics. This is a black box technique, that is, it does not look at the optimization problem to come up with the permutations; and can be implemented easily. This serves as an example that for other classes of functions, there can exist other techniques for coming up with good permutations. Finally, note that we only consider constant step sizes in this work for both upper and lower bounds. Exploring regimes in which the step size changes, e.g., diminishing step sizes, is a very interesting open problem, which we leave for future work. We think that the upper and lower bounds in this paper give some important insights and can help in the development of better algorithms or heuristics. We strongly believe that under nice distributional assumptions on the component functions, there can exist good heuristics to generate good permutations, and this should also be investigated in future work.

References

Ahn et al. (2020) Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements, 2020.
Bertsekas (2011) Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
Bottou (2009) Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem offered to the attendance of the SLDS 2009 conference, 2009. URL http://leon.bottou.org/papers/bottou-slds-open-problem-2009.
Bubeck et al. (2015) Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
De Sa (2020) Christopher M De Sa. Random reshuffling is not always better. Advances in Neural Information Processing Systems, 33, 2020.
Gürbüzbalaban et al. (2019a) M Gürbüzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019a.
Gürbüzbalaban et al. (2019b) Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, pp. 1–36, 2019b.
Haochen & Sra (2019) Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624–2633, 2019.
Lai & Lim (2020) Zehua Lai and Lek-Heng Lim. Recht-ré noncommutative arithmetic-geometric mean conjecture is false. In International Conference on Machine Learning, pp. 5608–5617. PMLR, 2020.
Mishchenko et al. (2020) Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. ArXiv, abs/2006.05988, 2020.
Nagaraj et al. (2019) Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703–4711, 2019.
Nedić & Bertsekas (2001) Angelia Nedić and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223–264. Springer, 2001.
Nesterov (2004) Yurii Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. Springer, 2004. ISBN 978-1-4613-4691-3. doi: 10.1007/978-1-4419-8853-9. URL https://doi.org/10.1007/978-1-4419-8853-9.
Nguyen et al. (2020) Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. arXiv preprint arXiv:2002.08246, 2020.
Nocedal & Wright (2006) Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
Rajput et al. (2020) Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964–7973. PMLR, 2020.
Recht & Ré (2012) Benjamin Recht and Christopher Ré. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. arXiv preprint arXiv:1202.4184, 2012.
Recht & Ré (2013) Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
Safran & Shamir (2019) Itay Safran and Ohad Shamir. How good is sgd with random shuffling?, 2019.
Safran & Shamir (2021) Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. arXiv preprint arXiv:2106.06880, 2021.
Schneider (2016) M. Schneider. Probability inequalities for kernel embeddings in sampling without replacement. In AISTATS, 2016.
Shamir (2016) Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 46–54, 2016.
Tran et al. (2020) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. Shuffling gradient-based methods with momentum. arXiv preprint arXiv:2011.11884, 2020.
Ying et al. (2020) Bicheng Ying, Kun Yuan, and Ali H Sayed. Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390–1408, 2020.

Appendix A Discussion and Future Work

Being the first paper (to the best of our knowledge) to theoretically analyze the optimality of random permutations, we limited our scope to a specific, but common theoretical setting - strongly convex functions with constant step size. We think that future work can generalize the results of this paper to settings of non-convexity, variable step sizes, as well as techniques like variance reduction, momentum, etc.

A.1 Lower bounds for variable step sizes

All the existing lower bounds (to the best of our knowledge) work in the constant step size regime (Safran & Shamir (2019); Rajput et al. (2020); Safran & Shamir (2021)). Thus, generalizing the lower bounds to variable step size algorithms would be a very important direction for future research.

However, the case when step sizes are not constant can be tricky to prove lower bounds, since the step size could potentially depend on the permutation, and the current iterate. A more reasonable setting to prove lower bounds could be the case when the step sizes follow a schedule over epochs, similar to what happens in practice.

A.2 Faster permutations for non-quadratic objectives

The analysis of FlipFlop leverages the fact that the Hessians of quadratic functions are constant. We think that the analysis of FlipFlop might be extended to strongly convex functions or even PŁ functions (which are non-convex in general), under some assumptions on the Lipschitz continuity of the Hessians, similar to how Haochen & Sra (2019) extended their analysis of quadratic functions to more general classes. A key take-away from FlipFlop is that we had to understand how permutation based SGD works specifically for quadratic functions, that is, we did a white-box analysis. In general, we feel that depending on the specific class of non-convex functions (say deep neural networks), practitioners would have to think about permutation-based SGD in a white-box fashion, to come up with better heuristics for shuffling.

In a concurrent work, (https://openreview.net/pdf?id=7gWSJrP3opB), it is shown that by greedily sorting stale gradients, a permutation order can be found which converges faster for some deep learning tasks. Hence, there do exist better permutations than random, even for deep learning tasks.

A.3 FlipFlop on Random Coordinate Descent

A shuffling scheme similar to FlipFlop has been used in random coordinate descent for faster practical convergence (see page 231 in Nocedal & Wright (2006)). This should be further investigated empirically and theoretically in future work. Even though the current analysis of FlipFlop does not directly go through for random coordinate descent, we think the analysis can be adapted to work.

Appendix B Proof of Theorem 1

In certain places in the proof, we would need that $\alpha\leq\frac{1}{4nL}$ . To see how this is satisfied, note that we have assumed that $\alpha\leq\frac{\mu}{8n(L^{2}+L_{H}G)}$ in the theorem statement. Using the inequality $\mu\leq L$ in $\alpha\leq\frac{\mu}{4n(L^{2}+L_{H}G)}$ gives that $\alpha\leq\frac{1}{4n(L+(L_{H}G/\mu))}\leq\frac{1}{4nL}$ .

In this proof, we assume that the minimizer of $F$ is at $0$ to make the analysis simpler. This assumption can be satisfied by simply translating the origin to the minimizer (Safran & Shamir, 2019).

There are three main components in the proof:

1.

Say that an epoch starts off at the minimizer. We prove that there exists at least one pair of permutations such that if we do two separate runs of the epoch, the first using the first permutation and the second using the second, then at the end of that epoch, the iterates corresponding to the two runs end up on opposite sides of the minimizer.
2.

There exists a sequence of permutations and a point in the neighborhood of the minimizer, such that intializing at that point and using these sequence of permutations, we converge exactly to the minimizer.
3.

Starting from any other point, we can couple the iterates with the iterates which were shown in the previous component, to get that these two sequences of iterates come close to each other exponentially fast.

We prove the first and second components in the Subsections B.1 and B.2; and conclude the proof in Subsection B.3 where we also prove the third component.

B.1 Permutations in one epoch

In this subsection, we prove that if $x_{0},x_{1},\dots,x_{n}$ are the iterates in an epoch such that $x_{0}=0$ , then there exists a permutation of functions such that $x_{n}\geq 0$ . By the same logic, we show that there exists a permutation of functions such that $x_{n}\leq 0$ . These will give us control over movement of iterates across epochs.

Order the gradients at the minimizer, $\nabla f_{i}(0)$ , in decreasing order. WLOG assume that it is just $\nabla f_{1},\nabla f_{2},\dots,\nabla f_{n}$ . We claim that this permutation leads to $x_{n}\geq 0$ .

We will need the following intermediate result. Let $y_{i},y_{i-1}$ be such that $y_{i}=y_{i-1}-\alpha\nabla f_{i}(y_{i-1})$ . Assume $\alpha\leq 1/L$ and $y_{i-1}\geq x_{i-1}$ . Then,

$\displaystyle y_{i}-x_{i}$	$\displaystyle=y_{i-1}-x_{i-1}-\alpha(\nabla f_{i}(y_{i-1})-\nabla f_{i}(x_{i-1}))$
	$\displaystyle\geq y_{i-1}-x_{i-1}-\alpha L(y_{i-1}-x_{i-1})$
	$\displaystyle=(1-\alpha L)(y_{i-1}-x_{i-1})$
	$\displaystyle\geq 0,$	(11)

that is, $y_{i-1}\geq x_{i-1}\implies y_{i}\geq x_{i}$ .

Because $0$ is the minimizer, we know that $\sum_{i=1}^{n}\nabla f_{i}(0)=0$ . Also, recall that $x_{0}=0$ . There can be two cases:

1.

$\forall i\in[1,n]:x_{i}<0$ . This cannot be true because if $\forall i:1\leq i\leq n-1:x_{i}<0$ , then

$x_{n}=\sum_{i=1}^{n}-\alpha\nabla f_{i}(x_{i-1})\geq-\alpha\sum_{i=1}^{n}\nabla f_{i}(0)\geq 0,$

where we used the fact that $\nabla f_{i}(x)\leq\nabla f_{i}(y)$ if $x\leq y$ and $f_{i}$ is convex.
2.

Thus, $\exists i\in[1,n]:x_{i}\geq 0$ . Now, consider the sequence $y_{i},y_{i+1},\dots,y_{n}$ such that $y_{i}=0$ and for $j\geq i+1$ , $y_{j}=y_{j-1}-\alpha\nabla f_{j}(y_{j-1})$ . Then because $\alpha\leq 1/L$ and $x_{i}\geq y_{i}=0$ , we get that $x_{j}\geq y_{j}$ for all $j\geq i$ (Using Ineq. (11)).

Hence, there is an $i\geq 1$ such that $x_{i}\geq y_{i}=0$ , and further $x_{j}\geq y_{j}$ for all $j\geq i$ . Next, we repeat the process above for $y_{i},\dots,y_{n}$ . That is, there can be two cases:

1.

$\forall j\in[i+1,n]:y_{j}<0$ . This cannot be true because if $\forall j:i+1\leq j\leq n-1:y_{j}<0$ , then

$y_{n}=\sum_{j=i}^{n}-\alpha\nabla f_{j}(y_{j})\geq-\alpha\sum_{j=i}^{n}\nabla f_{j}(0)\geq 0.$
2.

Thus, $\exists j\in[i+1,n]:y_{j}\geq 0$ . Now, consider the sequence $z_{j},z_{j+1},\dots,z_{n}$ such that $z_{j}=0$ and for $k\geq j+1$ , $z_{k}=z_{k-1}-\alpha\nabla f_{k}(z_{k-1})$ . Then because $\alpha\leq 1/L$ and $y_{j}\geq z_{j}=0$ , we get that $y_{k}\geq z_{k}$ for all $k\geq j$ (Using Ineq. (11)).

Hence, there exists an integer $j>i>0$ such that $y_{j}\geq 0$ . We have already proved that $x_{j}\geq y_{j}$ . Thus, we have that $x_{j}\geq 0$ . We can continue repeating this process (apply the same two cases above for $z_{j},z_{j+1},\dots,z_{n}$ , and so on), to get that $x_{n}\geq 0$ . We define $p$ to be this non-negative value of $x_{n}$ . Note that the following lemma gives us that the gradients are bounded by $G$

Lemma 1.

Define $G^{*}:=\max_{i}\|\nabla f_{i}({\bm{x}}^{*})\|$ and $D=\max\left\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{G^{*}}{2L}\right\}$ . If Assumptions 2 and 3 hold, and $\alpha<\frac{1}{8\kappa nL}$ , then for any permutation-based algorithm (deterministic or random), we have

	$\displaystyle\forall i,j,k:$	$\displaystyle\\|{\bm{x}}_{i}^{k}-{\bm{x}}^{*}\\|\leq 2D,\quad\text{and}$
		$\displaystyle\\|\nabla f_{j}({\bm{x}}_{i}^{k})\\|\leq G^{*}+2DL.$

Because the gradients are bounded by $G$ , we get that $p\leq n\alpha G$ .

Similarly, we can show that the reverse permutation leads to $x_{n}\leq 0$ . We define $q$ to be this non-positive value of $x_{n}$ . Because we have assumed that the gradients are bounded by $G$ , we get that $q\geq-n\alpha G$ .

B.2 Exact convergence to the minimizer

In this section, we show that there exists a point such that if we initialize there and follow a specific permutation sequence, we land exactly at the minimizer.

In particular, we will show the following: There exists a point in $[4q,4p]$ such that if we initialize there and follow a specific permutation sequence, then we land exactly at the minimizer.

We show this recursively. We will prove that there exists a point $m^{K}\in[4q,4p]$ such that if the last epoch begins there, that is $x_{0}^{K}=m^{K}$ , then we land at the minimizer at the end of the last epoch, that is, $x_{n}^{K}=0$ . Then, we will show that there exists a point $m^{K-1}\in[4q,4p]$ such that if the $x_{0}^{K-1}=m^{K-1}$ , then $x_{0}^{K}=x_{n}^{K-1}=m^{K}$ . Repeating this for $K-2,\dots,1$ , we get that there exists a point $m^{0}\in[4q,4p]$ such that if we initialize the first epoch there, that is $x^{1}_{0}=m^{0}$ , then there is a permutation sequence such that ultimately $x_{n}^{K}=0$ .

We prove that any point $m^{j}\in[4q,4p]$ can be reached at the end of an epoch by beginning the epoch at some point $m^{j-1}\in[4q,4p]$ , that is if $x_{0}^{j-1}=m^{j-1}$ , then $x_{n}^{j-1}=m^{j}$ .

•

Case: $m^{j}\in[p,4p]$ . In this case, we show that $m^{j-1}\in[0,4p]$ . We have proved in the previous subsection that there exists a permutation $\sigma$ such that if $x_{0}^{j-1}=0$ then $x_{n}^{j-1}=p$ .

Next, we have the following helpful lemma that we will also use later.

Lemma 2.

Let $x_{0},x_{1},\dots,x_{n}$ be a sequence of iterates in an epoch and $y_{0},y_{1},\dots,y_{n}$ be another sequence of iterates in an epoch such that both use the same permutation of functions. If $\alpha\leq\frac{\mu}{2n(L^{2}+L_{H}G)}$ , then

(1-n\alpha L)|y_{0}-x_{0}|\leq(1-L\alpha)^{n}|y_{0}-x_{0}|\leq|y_{n}-x_{n}|\leq\left(1-\dfrac{1}{2}n\mu\alpha\right)|y_{0}-x_{0}|.

If we set $x_{0}=4p,y_{0}=0$ in Lemma 2 and we follow the permutation $\sigma$ , then we get that

$\displaystyle x_{n}-y_{n}$	$\displaystyle\in(x_{0}-y_{0})\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right]$
$\displaystyle\implies x_{n}-p$	$\displaystyle\in(4p-0)\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right]$	(Since using $\sigma$ and $y_{0}=0$ results in $y_{n}=p$ .)
$\displaystyle\implies x_{n}$	$\displaystyle\geq 4p,$

where we used the fact that $\alpha\leq\frac{1}{4nL}$ is the last step.

Thus, if $x_{0}^{j-1}=4p$ and we follow the permutation $\sigma$ , then $x_{n}^{j-1}\geq 4p$ .

Next, note that

	$\displaystyle x_{1}^{j-1}$	$\displaystyle=x_{0}^{j-1}-\alpha\nabla f_{\sigma(0)}(x_{0}^{j-1}).$
	$\displaystyle x_{2}^{j-1}$	$\displaystyle=x_{1}^{j-1}-\alpha\nabla f_{\sigma(1)}(x_{1}^{j-1})$
	$\displaystyle\vdots$
	$\displaystyle x_{n}^{j-1}$	$\displaystyle=x_{n-1}^{j-1}-\alpha\nabla f_{\sigma(n-1)}(x_{n-1}^{j-1})$

Looking above, we see that $x_{1}^{j-1}$ is a continuous function of $x_{0}^{j-1}$ ; $x_{2}^{j-1}$ is a continuous function of $x_{1}^{j-1}$ ; and so on. Thus, using the fact that composition of continuous functions is continuous, we get that $x_{n}^{j-1}$ is also a continuous function of $x_{0}^{j-1}$ . We have shown that if $x_{0}^{j-1}=0$ , then $x_{n}^{j-1}=p$ and if $x_{0}^{j-1}=4p$ , then $x_{n}^{j-1}\geq 4p$ . Thus, using the fact that that $x_{n}^{j-1}$ is a continuous function of $x_{0}^{j-1}$ , we get that for any point $m^{j}\in[p,4p]$ , there is at least one point $m^{j-1}\in[0,4p]$ , such that $x_{0}^{j-1}=m^{j-1}$ leads to $x_{n}^{j-1}=m^{j}$ .

•

Case: $m_{j}\in[4q,q]$ . We can apply the same logic as above to show that there is at least one point $m^{j-1}\in[4q,0]$ , such that $x_{0}^{j-1}=m^{j-1}\implies x_{n}^{j-1}=m^{j}$ .

•

Case: $m_{j}\in[q,p]$ . WLOG assume that $|q|<|p|$ . Let $\sigma_{q}$ be the permutation such that if $x_{0}^{j-1}=0$ and the epoch uses this permutation, then we end up at $x_{n}^{j-1}=q$ .

If we set $x_{0}=4p,y_{0}=0$ in Lemma 2 and we follow the permutation $\sigma_{q}$ , then we get that

$\displaystyle x_{n}-y_{n}$	$\displaystyle\in(x_{0}-y_{0})\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right]$
$\displaystyle\implies x_{n}-q$	$\displaystyle\in(4p-0)\left[1-\alpha nL,1-\frac{\alpha n\mu}{2}\right]$	(Since using $\sigma_{q}$ and $y_{0}=0$ results in $y_{n}=q$ .)
$\displaystyle\implies x_{n}$	$\displaystyle\geq q+4p(1-\alpha nL)\geq q+3p\geq 2p,$

where we used the fact that $\alpha\leq\frac{1}{4nL}$ is the last step.

Thus, if $x_{0}^{j-1}=4p$ and we follow the permutation $\sigma_{q}$ , then $x_{n}^{j-1}\geq 2p$ .

Thus, using similar argument of continuity as the first case, we know that there is a point $m^{j-1}\in[0,4p]$ , such that $x_{0}^{j-1}=m^{j-1}$ leads to $x_{n}^{j-1}=m^{j}$ when we use the permutation $\sigma_{q}$ .

B.3 Same sequence permutations get closer

In the previous subsection, we have shown that there exists a point $m^{0}\in[4q,4p]$ and a sequence of permutations $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ such that if $x^{1}_{0}=m^{0}$ and epoch $j$ uses permutation $\sigma^{j}$ , then $x^{K}_{n}=0$ . In this subsection, we will show that if $x_{0}^{1}$ is initialized at any other point such that $|x_{0}^{1}|\leq D$ , then using the same permutations $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ gives us that $|x^{K}_{n}|\leq(D+n\alpha G)e^{-K}$ . For this we will repeatedly apply Lemma 2 on all the $K$ epochs. Assume that $x_{0}^{1}=\nu^{0}$ with $|\nu^{0}|\leq D$ .

Let $y_{i}^{j}$ be the sequence of iterates such that $y^{1}_{0}=m^{0}$ and uses the permutation sequence $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ . Then, we know that $y^{K}_{n}=0$ . Let $x_{i}^{j}$ be the sequence of iterates such that $x^{1}_{0}=\nu^{0}$ and uses the same permutation sequence $\sigma^{1},\sigma^{2},\dots,\sigma^{K}$ .

Then, using Lemma 2 gives us that $|y^{1}_{n}-x^{1}_{n}|\leq|\nu^{0}-m^{0}|(1-\frac{\mu\alpha n}{2})$ . Thus, we get that $|y^{2}_{0}-x^{2}_{0}|\leq|\nu^{0}-m^{0}|(1-\frac{\mu\alpha n}{2})$ . Again applying Lemma 2 gives us that $|y^{2}_{n}-x^{2}_{n}|\leq|\nu^{0}-m^{0}|(1-\frac{\mu\alpha n}{2})^{2}$ . Therefore, after applying it $K$ times, we get

|y^{K}_{n}-x^{K}_{n}|\leq|\nu^{0}-m^{0}|\left(1-\frac{\mu\alpha n}{2}\right)^{K}.

Note that $|x_{0}^{1}|=|\nu^{0}|\leq D$ , and $|y_{0}^{1}|=|m^{0}|$ , with $m^{0}\in[4q,4p]$ . We showed earlier in Subsection B.1 that $|p|\leq n\alpha G$ and $|q|\leq n\alpha G$ . Therefore,

|y^{K}_{n}-x^{K}_{n}|\leq|D+4n\alpha G|\left(1-\frac{\mu\alpha n}{2}\right)^{K}.

Further, we know that $y^{K}_{n}=0$ . Thus,

	$\displaystyle\|x^{K}_{n}\|$	$\displaystyle\leq\|D+4n\alpha G\|\left(1-\frac{\mu\alpha n}{2}\right)^{K}$
		$\displaystyle\leq\|D+4n\alpha G\|e^{-\frac{\mu\alpha n}{2}K}.$

Substituting the value of $\alpha$ completes the proof. Next we prove the lemmas used in this proof.

B.4 Proof of Lemma 1

We restate the lemma below.

Lemma.

	$\displaystyle\forall i,j,k:$	$\displaystyle\\|{\bm{x}}_{i}^{k}-{\bm{x}}^{*}\\|\leq 2D,\quad\text{and}$
		$\displaystyle\\|\nabla f_{j}({\bm{x}}_{i}^{k})\\|\leq G^{*}+2DL.$

This lemma says that for any permutation-based algorithm, the domain of iterates and the norm of the gradient stays bounded during the optimization. This means that we can assume bounds on norm of iterates and gradients, which is not true in general for unconstrained SGD. This makes analyzing such algorithms much easier, and hence this lemma can be of independent interest for proving future results for permutation-based algorithms.

Remark 1.

This lemma does not hold in general for vanilla SGD where sampling is done with replacement. Consider the example with two functions $f_{1}(x)=x^{2}-x$ , and $f_{2}(x)=x$ ; and $F(x)=f_{1}(x)+f_{2}(x)$ . This satisfies Assumptions 2 and 3, but one may choose $f_{2}$ consecutively for arbitrary many iterations, which will lead the iterates a proportional distance away from the minimizer. This kind of situation can never happen for permutation-based SGD because we see every function exactly once in every epoch and hence no particular function can attract the iterates too much towards its minimizer, and by the end of the epochs most of the noise gets cancelled out.

Note that the requirement $\alpha<\frac{1}{8\kappa nL}$ is stricter than the usual requirement $\alpha=O\left(\frac{1}{nL}\right)$ , but we believe the lemma should also hold in that regime. For the current paper, however, this stronger requirement suffices.

Proof.

To prove this lemma, we show two facts: If for some epoch $k$ , $\|{\bm{x}}^{k}_{0}-{\bm{x}}^{*}\|\leq D$ , then a) $\forall i:\|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}\|\leq 2D$ and b) $\|{\bm{x}}^{k+1}_{0}-{\bm{x}}^{*}\|=\|{\bm{x}}^{k}_{n}-{\bm{x}}^{*}\|\leq D$ . To see how a) and b) are sufficient to prove the lemma, assume that they are true. Then, since the first epoch begins inside the bounded region $\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\|\leq D$ , we see using b) that every subsequent epoch begins inside the same bounded region, that is $\|{\bm{x}}^{k}_{0}-{\bm{x}}^{*}\|\leq D$ as well. Hence using a) we get that during these epochs, the iterates satisfy $\|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}\|\leq 2D$ , which is the first part of the lemma. Further, this bound together with the gradient Lipshitzness directly gives the upper bound $G^{*}+2DL$ on the gradients. Thus, all we need to do to prove this lemma is to prove a) and b), which we do next.

We will prove a) and b) for $D=\max\{\|{\bm{x}}_{0}^{1}-{\bm{x}}^{*}\|,\frac{2\kappa n\alpha G^{*}}{1-4\kappa n\alpha L}\}$ . Once we do this, using $\alpha<\frac{1}{8\kappa nL}$ will give us the exact statement of the lemma.

Let $|{\bm{x}}^{k}_{0}-{\bm{x}}^{*}|\leq D$ for some epoch $k$ . Then, we try to find the minimum number of iterations $i$ needed so that $|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}|\geq 2D$ . Within this region, the gradient is bounded by $G^{*}+2DL$ . Thus, the minimum number of iterations needed are $\frac{2D-D}{\alpha(G^{*}+2DL)}$ . However,

$\displaystyle\frac{2D-D}{\alpha(G^{*}+2DL)}$	$\displaystyle=\frac{1}{\alpha(\frac{G^{*}}{D}+2L)}$
	$\displaystyle\geq\frac{1}{\alpha(G^{}\frac{1-4\kappa n\alpha L}{2\kappa n\alpha G^{}}+2L)}$	(Using the fact that $D\geq\frac{2\kappa n\alpha G^{*}}{1-4\kappa n\alpha L}.$ )
	$\displaystyle=\frac{1}{\alpha(\frac{1-4\kappa n\alpha L}{2\kappa n\alpha}+2L)}$
	$\displaystyle={2\kappa n}$
	$\displaystyle\geq 2n.$

Thus, the minimum number iterations needed to go outside the bound $|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}|\geq 2D$ is more than the length of the epoch. This implies that within the epoch, $\|{\bm{x}}^{k}_{i}-{\bm{x}}^{*}\|\leq 2D$ , which proves a).

We prove b) next:

	$\displaystyle\\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\\|$	$\displaystyle=\left\\|\left({\bm{x}}_{0}^{k}-\alpha\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k})\right)-{\bm{x}}^{*}\right\\|$
		$\displaystyle=\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-\alpha\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})+\alpha\sum_{i=0}^{n}(\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k}))\right\\|$

Note that $\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})$ is just the sum of all component gradients at ${\bm{x}}_{0}^{k}$ , that is $\sum_{i=0}^{n}\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})=n\nabla F({\bm{x}}_{0}^{k})$ . Using this, we get

$\displaystyle\\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\\|$	$\displaystyle=\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})+\alpha\sum_{i=0}^{n}(\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k}))\right\\|$
	$\displaystyle\leq\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\\|+\alpha\sum_{i=0}^{n}\left\\|\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k})\right\\|$
	$\displaystyle\leq\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\\|+\alpha L\sum_{i=0}^{n}\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\\|,$	(12)

where we used gradient Lipschitzness (Assumption 2) in the last step.

To bound the first term above, we use the standard analysis of gradient descent on smooth, strongly convex functions as follows

$\displaystyle\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\\|^{2}$	$\displaystyle=\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}-2n\alpha\langle{\bm{x}}_{0}^{k}-{\bm{x}}^{},\nabla F({\bm{x}}_{0}^{k})\rangle+n^{2}\alpha^{2}\\|\nabla F({\bm{x}}_{0}^{k})\\|^{2}$
	$\displaystyle\leq\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}-2n\alpha\mu\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2}+n^{2}\alpha^{2}\\|\nabla F({\bm{x}}_{0}^{k})\\|^{2}$	(Using Ineq. (3))
	$\displaystyle=(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}+n\alpha(n\alpha\\|\nabla F({\bm{x}}_{0}^{k})\\|^{2}-\mu\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2})$
	$\displaystyle\leq(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}+n\alpha(n\alpha L^{2}\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2}-\mu\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\\|^{2})$	(Using gradient Lipschitzness)
	$\displaystyle=(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}+n\alpha(n\alpha L^{2}-\mu)\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2}$
	$\displaystyle\leq(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\\|^{2},$

where in the last step we used that $\alpha{}\leq\frac{\mu}{nL^{2}}$ since $\alpha{}\leq\frac{1}{8\kappa nL}$ . Substituting this inequality in Ineq. (12), we get

	$\displaystyle\\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\\|$	$\displaystyle\leq\sqrt{1-n\alpha\mu}\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\\|+\alpha L\sum_{i=0}^{n}\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\\|$
		$\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\\|+\alpha L\sum_{i=0}^{n}\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\\|.$

We have already proven a) that says that the iterates ${\bm{x}}_{i}^{k}$ satisfy $\|{\bm{x}}_{i}^{k}-{\bm{x}}^{*}\|\leq 2D$ . Using gradient Lipschitzness, this implies that the gradient norms stay bounded by $G^{*}+2DL$ . Hence, $\left\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\|\leq\alpha i(G^{*}+2DL)$ . Using this,

	$\displaystyle\\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\\|$	$\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|+\alpha L\sum_{i=0}^{n}\alpha i(G^{}+2DL)$
		$\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)D+\alpha L\sum_{i=0}^{n}\alpha i(G^{*}+2DL)$
		$\displaystyle\leq\left(1-\frac{1}{2}n\alpha\mu\right)D+n^{2}\alpha^{2}L(G^{*}+2DL)$
		$\displaystyle\leq D,$

where we used the fact that $D\geq\frac{2\kappa n\alpha G^{*}}{1-4\kappa n\alpha L}$ in the last step. ∎

B.5 Proof of Lemma 2

Without loss of generality, let $\sigma=(1,2,3,\dots,n)$ . This is only done for ease of notation. The analysis goes through for any other permutation $\sigma$ too.

First we show the lower bound. WLOG assume $y_{0}>x_{0}$ . Because $\alpha<1/L$ , we have that $\forall i,y_{i}>x_{i}$ by induction (see the equations below). Then,

$\displaystyle y_{i}-x_{i}$	$\displaystyle=y_{i-1}-x_{i-1}-\alpha(\nabla f_{i}(y_{i-1})-\nabla f_{i}(x_{i-1}))$
	$\displaystyle\geq y_{i-1}-x_{i-1}-\alpha L(y_{i-1}-x_{i-1})$
	$\displaystyle=(1-\alpha L)(y_{i-1}-x_{i-1})$
	$\displaystyle\qquad\vdots$
	$\displaystyle=(1-\alpha L)^{i}(y_{0}-x_{0})$
	$\displaystyle\geq(1-i\alpha L)(y_{0}-x_{0}).$	(13)

Next we show the upper bound

	$\displaystyle y_{n}-x_{n}=y_{0}-x_{0}-\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{i-1})-\nabla f_{i}(x_{i-1}))$
	$\displaystyle=y_{0}-x_{0}-\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0}))+\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0})-\nabla f_{i}(y_{i-1})+\nabla f_{i}(x_{i-1}))$
	$\displaystyle=y_{0}-x_{0}-n\alpha(\nabla F(y_{0})-\nabla F(x_{0}))+\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0})-\nabla f_{i}(y_{i-1})+\nabla f_{i}(x_{i-1}))$
	$\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}(\nabla f_{i}(y_{0})-\nabla f_{i}(x_{0})-\nabla f_{i}(y_{i-1})+\nabla f_{i}(x_{i-1})).$		(Using strong convexity)

We use the fact that the function is twice differentiable:

	$\displaystyle y_{n}-x_{n}=(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}\nabla^{2}f_{i}(t)\textnormal{d}t-\int_{x_{i-1}}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)$
	$\displaystyle=(1-n\alpha\mu)(y_{0}-x_{0})$
	$\displaystyle\quad+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}\nabla^{2}f_{i}(t)\textnormal{d}t-\int_{x_{i-1}}^{x_{i-1}+(y_{0}-x_{0})}\nabla^{2}f_{i}(t)\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)$
	$\displaystyle=(1-n\alpha\mu)(y_{0}-x_{0})$
	$\displaystyle\quad+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}(\nabla^{2}f_{i}(t)-\nabla^{2}f_{i}(x_{i-1}-x_{0}+t))\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right).$

In the above, we used the convention that $\int_{a}^{b}f(x)dx$ is the same as $-\int_{b}^{a}f(x)dx$ if $a>b$ .

Now, we can use the Hessian Lipschitzness to bound the term as follows

$\displaystyle y_{n}-x_{n}$	$\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}L_{H}\|x_{i-1}-x_{0}\|\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)$
	$\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+\alpha\sum_{i=1}^{n}\left(\int_{x_{0}}^{y_{0}}L_{H}G\alpha n\textnormal{d}t-\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t\right)$
	$\displaystyle=(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})-\alpha\sum_{i=1}^{n}\int_{x_{i-1}+(y_{0}-x_{0})}^{y_{i-1}}\nabla^{2}f_{i}(t)\textnormal{d}t$
	$\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})+\alpha\sum_{i=1}^{n}L((y_{0}-x_{0})-(y_{i}-x_{i}))$
	$\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})+\alpha\sum_{i=1}^{n}L(i\alpha L)(y_{0}-x_{0})$	(Using Ineq. (13).)
	$\displaystyle\leq(1-n\alpha\mu)(y_{0}-x_{0})+L_{H}G\alpha^{2}n^{2}(x_{0}-y_{0})+\alpha^{2}n^{2}L^{2}(y_{0}-x_{0}).$

Thus, if we have $\alpha\leq\frac{\mu}{2n(L^{2}+L_{H}G)}$ , then

y_{n}-x_{n}\leq\left(1-\frac{\mu n\alpha}{2}\right)(y_{0}-x_{0}).

Appendix C Proof of Theorem 2

To prove this theorem, we consider three step-size ranges and do a case by case analysis for each of them. We construct functions for each range such that the convergence of any permutation-based algorithm is “slow” for the functions on their corresponding step-size regime. The final lower bound is the minimum among the lower bounds obtained for the three regimes.

In this proof, we will use different notation from the rest of the paper because we work with scalars in the proof, and hence the superscript will denote the scalar power, and not the epoch number. We will use ${\bm{x}}_{k,i}$ to denote the $i$ -th iterate of the the $k$ -th epoch.

We will construct three functions $F_{1}({\bm{x}})$ , $F_{2}({\bm{y}})$ , and $F_{3}(z)$ , each the means of $n$ component functions, such that

•

Any permutation-based algorithm on $F_{1}({\bm{x}})$ with $\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}]$ and initialization ${\bm{x}}_{1,0}={\bm{0}}$ results in

$\displaystyle\|{\bm{x}}_{K,n}\|^{2}=\Omega\left(\frac{1}{n^{3}K^{2}}\right).$

$F_{1}$ will be an $n$ -dimensional function, that is ${\bm{x}}\in{\mathbb{R}}^{n}$ . This function will have minimizer at ${\bm{0}}\in{\mathbb{R}}^{n}$ . NOTE: This is the ‘key’ step-size range, and the proof sketch explained in the main paper corresponds to this function’s construction.
•

Any permutation-based algorithm on $F_{2}({\bm{y}})$ with $\alpha\in[\frac{3}{nL},\frac{1}{L}]$ and initialization ${\bm{y}}_{1,0}={\bm{0}}$ results in

$\displaystyle\|{\bm{y}}_{K,n}\|^{2}=\Omega\left(\frac{1}{n^{2}}\right)$

$F_{2}$ will be an $n$ -dimensional function, that is ${\bm{y}}\in{\mathbb{R}}^{n}$ . This function will have minimizer at ${\bm{0}}\in{\mathbb{R}}^{n}$ . The construction for this is also inspired by the construction for $F_{1}$ , but this is constructed significantly differently due to the different step-size range.
•

Any permutation-based algorithm on $F_{3}(z)$ with $\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}]$ and initialization $z_{1,0}=1$ results in

$\displaystyle|z_{K,n}|^{2}=\Omega\left(1\right)$

$F_{3}$ will be an $1$ -dimensional function, that is $z\in{\mathbb{R}}$ . This function will have minimizer at $0$ .

Then, the $2n+1$ -dimensional function $F([{\bm{x}}^{\top},{\bm{y}}^{\top},z]^{\top})=F_{1}({\bm{x}})+F_{2}({\bm{y}})+F_{3}(z)$ will show bad convergence in any step-size regime. This function will have minimizer at ${\bm{0}}\in{\mathbb{R}}^{2n+1}$ . Furthermore,

\displaystyle\frac{n-1}{n}L{\bm{I}}\preceq\nabla^{2}F_{1},\nabla^{2}F_{2},\nabla^{2}F_{3},\nabla^{2}F\preceq 2L{\bm{I}},

that is, $F_{1}$ , $F_{2}$ , $F_{3}$ , and $F$ are all $\frac{n-1}{n}L$ -strongly convex and $2L$ -smooth.

In the subsequent subsections, we prove the lower bounds for $F_{1}$ , $F_{2}$ , and $F_{3}$ separately.

NOTE: In Appendix C.4, we have discussed how the lower bound is actually invariant to initialization.

C.1 Lower bound for $F_{1}$ , $\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}]$

We will work in $n$ -dimensions ( $n$ is even) and represent a vector in the space as ${\bm{z}}=[x_{1},y_{1},\dots,x_{n/2},y_{n/2}]$ . These $x_{i}$ and $y_{i}$ are not related to the vectors used by $F_{2}$ and $F_{3}$ later, we only use $x_{i}$ and $y_{i}$ to make the proof for this part easier to understand.

We start off by defining the $n$ component functions: For $i\in[n/2]$ , define

	$\displaystyle f_{i}({\bm{z}})$	$\displaystyle:=\frac{L}{2}x_{i}^{2}-x_{i}+y_{i}+\sum_{j\neq i}\left(\frac{L}{2}x_{j}^{2}+\frac{L}{2}y_{j}^{2}\right),\text{ and}$
	$\displaystyle g_{i}({\bm{z}})$	$\displaystyle:=\frac{L}{2}y_{i}^{2}-y_{i}+x_{i}+\sum_{j\neq i}\left(\frac{L}{2}x_{j}^{2}+\frac{L}{2}y_{j}^{2}\right).$

Thus, $F({\bm{z}}):=\frac{1}{n}\left(\sum_{i=1}^{n/2}f_{i}+\sum_{i=1}^{n/2}g_{i}\right)=\left(\frac{n-1}{n}\right)\frac{L}{2}\|{\bm{z}}\|^{2}$ . This function has minimizer at ${\bm{z}}^{*}={\bm{0}}$ .

Let ${\bm{z}}_{k,j}$ denote ${\bm{z}}$ at the $j$ -th iteration in $k$ -th epoch. We initialize at ${\bm{z}}_{1,0}={\bm{0}}$ . For the majority of this proof, we will work inside a given epoch, so we will skip the subscript denoting the epoch. We denote the $j$ -th iterate within the epoch as ${\bm{z}}_{j}$ and the coordinates as ${\bm{z}}_{j}=[x_{j,1},y_{j,1},\dots,x_{j,n/2},y_{j,n/2}]$

In the current epoch, let $\sigma$ be the permutation of $\{f_{1},\dots,f_{n/2},g_{1},\dots,g_{n/2}\}$ used.

For any $i\in[n/2]$ , let $p$ and $q$ be indices such that $\sigma_{p}=f_{i}$ and $\sigma_{q}=g_{i}$ . Let us consider the case that $p<q$ (the case when $p>q$ will be analogous). Then, it can be seen that

	$\displaystyle x_{n,i}$	$\displaystyle=(1-\alpha L)^{n-1}x_{0,i}+(1-\alpha L)^{n-p-1}\alpha-(1-\alpha L)^{n-q}\alpha,\text{ and}$
	$\displaystyle y_{n,i}$	$\displaystyle=(1-\alpha L)^{n-1}y_{0,i}-(1-\alpha L)^{n-p}\alpha+(1-\alpha L)^{n-q}\alpha.$		(14)

Hence,

	$\displaystyle x_{n,i}+y_{n,i}$	$\displaystyle=(1-\alpha L)^{n-1}(x_{0,i}+y_{0,i})+2(1-\alpha L)^{n-p-1}\alpha^{2}L$
		$\displaystyle\geq(1-\alpha L)^{n-1}(x_{0,i}+y_{0,i})+2(1-\alpha L)^{n-1}\alpha^{2}L.$

In the other case, when $p>q$ , we will get the same inequality. Let $x_{K,n,i}$ and $y_{K,n,i}$ denote the value of $x_{n,i}$ and $y_{n,i}$ at the $K$ -th epoch. Then, recalling that ${\bm{z}}$ was initialized to ${\bm{0}}$ , we use the inequality above to get

	$\displaystyle x_{K,n,i}+y_{K,n,i}$	$\displaystyle\geq(1-\alpha L)^{({n-1})K}\cdot 0+2(1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}$
		$\displaystyle=2(1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}$		(15)

Since this inequality is valid for all $i$ , we get that

$\displaystyle\\|{\bm{z}}_{K,n}\\|^{2}$	$\displaystyle=\sum_{i=1}^{n/2}(x_{K,n,i}^{2}+y_{K,n,i}^{2})$
	$\displaystyle\geq\sum_{i=1}^{n/2}\frac{1}{2}(x_{K,n,i}+y_{K,n,i})^{2}$
	$\displaystyle\geq\sum_{i=1}^{n/2}\frac{1}{2}\left(2(1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}\right)^{2}$
	$\displaystyle=n\left((1-\alpha L)^{{n-1}}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{{n-1}}}\right)^{2}.$	(16)

Note that if $\alpha\leq\frac{3}{nL}$ and $n\geq 4$ , then $(1-\alpha L)^{n-1}\geq\frac{1}{8}$ . Using this in (16), we get

	$\displaystyle\\|{\bm{z}}_{K,n}\\|^{2}$	$\displaystyle\geq n\left((1-\alpha L)^{n-1}\alpha^{2}L\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{n-1}}\right)^{2}$
		$\displaystyle\geq\frac{n}{128}\alpha^{4}L^{2}\left(\frac{1-(1-\alpha L)^{({n-1})K}}{1-(1-\alpha L)^{{n-1}}}\right)^{2}$
		$\displaystyle=\frac{n}{128L^{2}}\left(\frac{(\alpha L)^{2}}{1-(1-\alpha L)^{n-1}}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}$

We consider two cases:

Case A, $\alpha L\leq\frac{1}{2(n+2)}$ : It can be verified that $\frac{(\alpha L)^{2}}{1-(1-\alpha L)^{n-1}}$ is an increasing function of $\alpha$ when $\alpha L\leq\frac{1}{2(n+2)}$ . Noting that we are working in the range when $\alpha\geq\frac{1}{2(n-1)KL}$ , then

	$\displaystyle\\|{\bm{z}}_{K,n}\\|^{2}$	$\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n-1)K}\right)^{2}}{1-\left(1-\frac{1}{2(n-1)K}\right)^{n-1}}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}$
		$\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n-1)K}\right)^{2}}{\frac{1}{2(n-1)K}{(n-1)}}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}$
		$\displaystyle=\frac{n}{512(n-1)^{4}K^{2}L^{2}}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}$
		$\displaystyle\geq\frac{n}{512(n-1)^{4}K^{2}L^{2}}\left({1-e^{-\alpha L({n-1})K}}\right)^{2}$
		$\displaystyle\geq\frac{n}{512(n-1)^{4}K^{2}L^{2}}\left({1-e^{-1/2}}\right)^{2}$
		$\displaystyle=\Omega\left(\frac{1}{n^{3}K^{2}L^{2}}\right).$

Case B, $\alpha L>\frac{1}{2(n+2)}$ : In this case,

	$\displaystyle\\|{\bm{z}}_{K,n}\\|^{2}$	$\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n+2)}\right)^{2}}{1}\right)^{2}\left({1-(1-\alpha L)^{({n-1})K}}\right)^{2}$
		$\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n+2)}\right)^{2}}{1}\right)^{2}\left(1-e^{-\alpha L(n-1)K}\right)^{2}$
		$\displaystyle\geq\frac{n}{128L^{2}}\left(\frac{\left(\frac{1}{2(n+2)}\right)^{2}}{1}\right)^{2}\left(1-e^{-\frac{3(n-1)K}{2(n+2)}}\right)^{2}$
		$\displaystyle=\Omega\left(\frac{1}{n^{3}L^{2}}\right).$

C.2 Lower bound for $F_{2}$ , $\alpha\in[\frac{3}{nL},\frac{1}{L}]$

We will work in $n$ -dimensions and represent a vector in the space as ${\bm{y}}=[y_{1},\dots,y_{n}]$ .

We start off by defining the $n$ component functions: For $i\in[n]$ , define

\displaystyle f_{i}({\bm{y}})

\displaystyle:=-y_{i}+\sum_{j\neq i}\left(\frac{Ly_{j}^{2}}{2}+\frac{y_{j}}{n-1}\right)

Thus, $F({\bm{y}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}=\left(\frac{n-1}{n}\right)\frac{L}{2}\|{\bm{y}}\|^{2}$ . This function has minimizer at ${\bm{y}}^{*}={\bm{0}}$ .

Let ${\bm{y}}_{k,j}$ denote ${\bm{y}}$ at the $j$ -th iteration in $k$ -th epoch. We initialize at ${\bm{y}}_{1,0}={\bm{0}}$ . For the majority of this proof, we will work inside a given epoch. We denote the $j$ -th iterate within the epoch as ${\bm{y}}_{j}$ and the coordinates as ${\bm{y}}_{j}=[y_{j,1},\dots,y_{j,n}]$

In the current epoch, let $\sigma$ be the permutation of $\{f_{1},\dots,f_{n}\}$ used. Let $i$ be the index such that $\sigma_{n}=i$ , that is, $i$ is the last element of the permutation $\sigma$ . Then at the end of the epoch,

	$\displaystyle y_{n,i}$	$\displaystyle=(1-\alpha L)^{n-1}y_{0,i}+\alpha-\frac{\alpha}{n-1}\sum_{j=0}^{n-2}(1-\alpha L)^{j}$
		$\displaystyle=(1-\alpha L)^{n-1}y_{0,i}+\alpha-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}.$		(17)

For some $j\in[n]$ , let $s$ be the integer such that $\sigma_{s}=j$ , that is $j$ is the $s$ -th element in the permutation $\sigma$ . Then for any $j$ and any epoch,

\displaystyle y_{n,j}

\displaystyle=(1-\alpha L)^{n-1}y_{0,j}+\alpha(1-\alpha L)^{n-s}-\frac{\alpha}{n-1}\sum_{j=0}^{n-2}(1-\alpha L)^{j}.

Then,

	$\displaystyle y_{n,j}$	$\displaystyle\geq(1-\alpha L)^{n-1}y_{0,j}-\frac{\alpha}{n-1}\sum_{j=0}^{n-2}(1-\alpha L)^{j}$
		$\displaystyle=(1-\alpha L)^{n-1}y_{0,j}-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}.$

Note that the above is independent of $\sigma$ , and hence applicable for all epochs. Applying it recursively and noting that we initialized ${\bm{y}}_{1,0}={\bm{0}}$ , we get

	$\displaystyle y_{n,j}$	$\displaystyle\geq-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}\sum_{k=1}^{K}(1-\alpha L)^{n}$
		$\displaystyle\geq-\frac{1}{(n-1)L}.$

Note that $y_{0,i}$ is just $y_{n,i}$ from the previous epoch. Hence we can substitute the inequality above for $y_{0,i}$ in (17). Thus,

	$\displaystyle y_{n,i}$	$\displaystyle\geq-\frac{(1-\alpha L)^{n-1}}{(n-1)L}+\alpha-\frac{(1-(1-\alpha L)^{n-1})}{(n-1)L}$
		$\displaystyle=\alpha-\frac{1}{(n-1)L}$
		$\displaystyle\geq\frac{3}{nL}-\frac{1}{(n-1)L}$
		$\displaystyle=\Omega\left(\frac{1}{nL}\right).$

This gives us that $\|{\bm{y}}_{k,n}\|^{2}=\Omega\left(\frac{1}{n^{2}L^{2}}\right)$ for any $k$ .

C.3 Lower bound for $F_{3}$ , $\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}]$

Consider the function $F(z)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(z)$ , where $f_{i}(z)=Lz^{2}$ for all $i$ . Note that the gradient of any function at $z$ is just $-2Lz$ . Hence, regardless of the permutation, if we start the shuffling based gradient descent method at $z_{1,0}=1$ , we get that

\displaystyle z_{K,n}=(1-2\alpha L)^{nK}z_{1,0}=(1-2\alpha L)^{nK}.

In the case when $\alpha\leq\frac{1}{2(n-1)KL}$ , we see that

	$\displaystyle z_{K,n}$	$\displaystyle\geq\left(1-2\frac{1}{2(n-1)KL}L\right)^{nK}$
		$\displaystyle\geq\left(1-\frac{1}{(n-1)K}\right)^{nK}$
		$\displaystyle=\Omega(1),$

for $n,K\geq 2$ . Finally, in the case when $\alpha\geq\frac{1}{L}$ , we see that

	$\displaystyle\|z_{K,n}\|$	$\displaystyle=\left\|1-2\alpha L\right\|^{nK}$
		$\displaystyle\geq\left\|1-2\frac{1}{L}L\right\|^{nK}$
		$\displaystyle\geq 1^{nK}$
		$\displaystyle=\Omega(1).$

C.4 Discussion about initialization

The lower bound partitions the step size in three ranges -

•

In the step size ranges $\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}]$ and $\alpha\in[\frac{3}{nL},\frac{1}{L}]$ , the initializations are done at the minimizer and it is shown that any permutation-based algorithm will still move away from the minimizer. The choice of initializing at the minimizer was solely for the convenience of analysis and calculations, and the proof should work for any other initialization as well.

Furthermore, the effect of initializing at any arbitrary (non-zero) point will decay exponentially fast with epochs anyway. To see how, note that every epoch can be treated as $n$ steps of full gradient descent and some noise, and hence the full gradient descent part will essentially keep decreasing the effect of initialization exponentially, and what we would be left with is the noise in each epoch. Thus, it was more convenient for us to just assume initialization at the minimizer and only focus on the noise in each epoch.
•

The step size range $\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}]$ can be divided into two parts, $\alpha\in[0,\frac{1}{2(n-1)KL}]$ and $\alpha\in[\frac{1}{L},\infty)$ .

For the range $\alpha\in[0,\frac{1}{2(n-1)KL}]$ , we essentially show that the step size is too small to make any meaningful progress towards the minimizer. Hence, instead of initializing at $1$ , initializing at any other arbitrary (non-zero) point will also give the same slow convergence rate.

For the range $\alpha\in[\frac{1}{L},\infty)$ , we show that the optimization algorithm will in fact diverge since the step size is too large. Hence, even here, any other arbitrary (non-zero) point of initialize will also give divergence.

Appendix D Proof of Theorem 3

Define $f_{1}(x):=\frac{L}{2}x^{2}-x$ and $f_{2}(x,y):=-\frac{L}{4}x^{2}+x$ . Thus, $F(x,y)=\frac{L}{8}x^{2}$ . This function has minimizer at $x^{*}=0$ . For this proof, we will use the convention that $x_{i,j}$ is the iterate after the $i$ -th iteration of the $j$ -th epoch. Further, the number in the superscript will be the scalar power. For example $x_{i,j}^{2}=x_{i,j}\cdot x_{i,j}$ .

Initialize at $x_{0,0}=\frac{1}{L}$ . Then at epoch $k$ , there are two possible permutations: $\sigma=(1,2)$ and $\sigma=(2,1)$ . In the first case , $\sigma=(1,2)$ , we get that after the first iteration of the epoch,

	$\displaystyle x_{1,k}$	$\displaystyle=x_{0,k}-\alpha\nabla f_{1}(x_{0,k},y_{0,k})$
		$\displaystyle=(1-\alpha L)x_{0,k}+\alpha,$

Continuing on, in the second iteration, we get

	$\displaystyle x_{2,k}$	$\displaystyle=x_{1,k}-\alpha\nabla f_{2}(x_{1,k},y_{1,k})$
		$\displaystyle=\left(1+\frac{1}{2}\alpha L\right)x_{1,k}-\alpha$
		$\displaystyle=\left(1+\frac{1}{2}\alpha L\right)((1-\alpha L)x_{0,k}+\alpha)-\alpha$
		$\displaystyle=\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L.$

Note that $x_{0,{k+1}}=x_{2,{k}}$ . Thus, $x_{0,{k+1}}=\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L$ .

Similarly, for the other possible permutation, $\sigma=(2,1)$ , we get $x_{0,k+1}=\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\alpha^{2}L$ . Thus, regardless of what permutation we use, we get that

x_{0,{k+1}}\geq\left(1+\frac{1}{2}\alpha L\right)(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L\geq(1-\alpha L)x_{0,k}+\frac{1}{2}\alpha^{2}L.

Hence, recalling that we initialized at $x_{0,0}=\frac{1}{L}$ , we get

$\displaystyle x_{n,K}$	$\displaystyle=x_{0,K+1}$
	$\displaystyle\geq(1-\alpha L)^{K}\frac{1}{L}+\frac{1}{2}\alpha^{2}L\sum_{i=0}^{K-1}(1-\alpha L)^{i}$
	$\displaystyle=\frac{1}{L}(1-\alpha L)^{K}+\frac{1}{2}\alpha^{2}L\frac{1-(1-\alpha L)^{K}}{1-(1-\alpha L)}$
	$\displaystyle=\frac{1}{L}(1-\alpha L)^{K}+\frac{1}{2}\alpha\left(1-(1-\alpha L)^{K}\right).$	(18)

Now, if $\alpha\geq\frac{1}{KL}$ , then

\frac{1}{2}\alpha\left(1-(1-\alpha L)^{K}\right)\geq\frac{\alpha}{8}=\Omega\left(\frac{1}{KL}\right),

and hence, $x_{n,K}^{2}=\Omega(\frac{1}{K^{2}L^{2}})$ . Otherwise, if $\alpha\leq\frac{1}{KL}$ , then continuing on from (18),

	$\displaystyle x_{n,K}$	$\displaystyle\geq\frac{1}{L}(1-\alpha L)^{K}+\frac{1}{2}\alpha\left(1-(1-\alpha L)^{K}\right)$
		$\displaystyle\geq\frac{1}{L}(1-\alpha L)^{K}$
		$\displaystyle\geq\frac{1}{L}e^{-\alpha LK}$
		$\displaystyle\geq\frac{1}{L}e^{-1}$
		$\displaystyle=\Omega(1/L),$

and hence in this case, $x_{n,K}^{2}=\Omega(\frac{1}{L^{2}})$ .

Appendix E Proof of Theorem 4

Let $F({\bm{x}}):=\frac{1}{n}\sum_{i=1}^{n}f_{i}({\bm{x}})$ be such that its minimizer it at the origin. This can be assumed without loss of generality because we can shift the coordinates appropriately, similar to (Safran & Shamir, 2019). Since the $f_{i}$ are convex quadratics, we can write them as $f_{i}({\bm{x}})=\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{i}{\bm{x}}-{\bm{b}}_{i}^{\top}{\bm{x}}+c_{i}$ , where ${\bm{A}}_{i}$ are symmetric, positive-semidefinite matrices. We can omit the constants $c_{i}$ because they do not affect the minimizer or the gradients. Because we assume that the minimizer of $F({\bm{x}})$ is at the origin, we get that

\displaystyle\sum_{i=1}^{n}{\bm{b}}_{i}={\bm{0}}.

(19)

Let $\sigma=(\sigma_{1},\sigma_{2},\dots,\sigma_{n})$ be the random permutation of $(1,2,\dots,n)$ generated at the beginning of the algorithm. Then for $k\in(1,2,\dots,K/2)$ , epoch $2k-1$ sees the $n$ functions in the following sequence:

\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}}{\bm{x}}-{\bm{b}}_{\sigma_{1}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{2}}{\bm{x}}-{\bm{b}}_{\sigma_{2}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}}{\bm{x}}-{\bm{b}}_{\sigma_{n}}^{\top}{\bm{x}}\right),

whereas epoch $2k$ sees the $n$ functions in the reverse sequence:

\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}}{\bm{x}}-{\bm{b}}_{\sigma_{n}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n-1}}{\bm{x}}-{\bm{b}}_{\sigma_{n-1}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}}{\bm{x}}-{\bm{b}}_{\sigma_{1}}^{\top}{\bm{x}}\right).

We define ${\bm{S}}_{i}:=\alpha{\bm{A}}_{\sigma_{i}}$ and ${\bm{t}}_{i}=\alpha{\bm{b}}_{\sigma_{i}}$ for convenience of notation. We start off by computing the progress made during an even indexed epoch. Since the even epochs use the reverse permutation, we get

$\displaystyle{\bm{x}}_{0}^{2k+1}$	$\displaystyle={\bm{x}}_{n}^{2k}$
	$\displaystyle={\bm{x}}^{2k}_{n-1}-\alpha\left({\bm{A}}_{\sigma_{1}}{\bm{x}}_{n-1}^{2k}-{\bm{b}}_{\sigma_{1}}\right)$	( $f_{\sigma_{1}}$ is used at the last iteration of even epochs.)
	$\displaystyle=\left({\bm{I}}-\alpha{\bm{A}}_{\sigma_{1}}\right){\bm{x}}^{2k}_{n-1}+\alpha{\bm{b}}_{\sigma_{1}}$
	$\displaystyle=({\bm{I}}-{\bm{S}}_{1}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}.$

We recursively apply the same procedure as above to the whole epoch to get the following

$\displaystyle{\bm{x}}_{0}^{2k+1}$	$\displaystyle=({\bm{I}}-{\bm{S}}_{1}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}$
	$\displaystyle=({\bm{I}}-{\bm{S}}_{1})\left(({\bm{I}}-{\bm{S}}_{2}){\bm{x}}_{n-2}^{2k}+{\bm{t}}_{2}\right)+{\bm{t}}_{1}$
	$\displaystyle=({\bm{I}}-{\bm{S}}_{1})({\bm{I}}-{\bm{S}}_{2}){\bm{x}}_{n-2}^{2k}+({\bm{I}}-{\bm{S}}_{1}){\bm{t}}_{2}+{\bm{t}}_{1}$
	$\displaystyle\qquad\vdots$
	$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i},$	(20)

where the product of matrices $\{{\bm{M}}_{i}\}$ is defined as $\prod_{i=l}^{m}{\bm{M}}_{i}={\bm{M}}_{l}{\bm{M}}_{l+1}\dots\bm{M}_{m}$ if $m\geq l$ and $1$ otherwise. Similar to Eq. (20), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order $(\sigma_{1},\sigma_{2},\dots,\sigma_{n})$ instead of $(\sigma_{n},\sigma_{n-1},\dots,\sigma_{1})$ . After doing the computation, we get the following equation:

\displaystyle{\bm{x}}_{0}^{2k}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{x}}_{0}^{2k-1}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n-j+1})\right){\bm{t}}_{i}.

Combining the results above, we can get the total progress made after the pair of epoch $2k-1$ and $2k$ :

	$\displaystyle{\bm{x}}_{0}^{2k+1}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}$
	$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{x}}_{0}^{2k-1}+\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}$
	$\displaystyle\quad+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}.$		(21)

In the sum above, the first term will have an exponential decay, hence we need to control the next two terms. We denote the sum of the terms as ${\bm{z}}$ (see the definition below) and we will control its norm later in this proof.

	$\displaystyle{\bm{z}}$	$\displaystyle:=\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}$
		$\displaystyle=\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}.$

To see where the iterates end up after $K$ epochs, we simply set $2k=K$ in Eq. 21 and then keep applying the equation recursively to preceding epochs. Then, we get

	$\displaystyle{\bm{x}}_{n}^{K}={\bm{x}}_{0}^{K+1}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{x}}_{0}^{K-1}+{\bm{z}}$
	$\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{2}{\bm{x}}_{0}^{K-3}+\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right){\bm{z}}+{\bm{z}}$
	$\displaystyle\qquad\vdots$
	$\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}+\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}.$

Taking squared norms and expectations on both sides, we get

$\displaystyle\mathbb{E}[$	$\displaystyle\\|{\bm{x}}_{n}^{K}\\|^{2}]=\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right.\right.$
	$\displaystyle\quad\left.\left.+\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}\right\\|^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\quad+2\mathbb{E}\left[\left\\|\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}\right\\|^{2}\right]$	(Since $(a+b)^{2}\leq 2a^{2}+2b^{2}$ )
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\quad+2\mathbb{E}\left[\left(\\|{\bm{z}}\\|\sum_{k=0}^{\frac{K}{2}-1}\left\\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\\|^{k}\right)^{2}\right].$

We assumed that the functions $f_{i}$ have $L$ -Lipschitz gradients (Assumption 2). This translates to $A_{i}$ having maximum eigenvalue less than $L$ . Hence, if $\alpha\leq 1/L$ , we get that ${\bm{I}}-\alpha{\bm{A}}_{i}$ is positive semi-definite with maximum eigenvalue bounded by 1. Hence, $\|{\bm{I}}-{\bm{S}}_{i}\|\leq 1$ . Using this and the fact that for matrices ${\bm{M}}_{1}$ and ${\bm{M}}_{2}$ , $\|{\bm{M}}_{1}{\bm{M}}_{2}\|\leq\|{\bm{M}}_{1}\|\|{\bm{M}}_{2}\|$ , we get that

	$\displaystyle\mathbb{E}\left[\left\\|{\bm{x}}_{n}^{K}\right\\|^{2}\right]$	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]$
		$\displaystyle\quad+2\mathbb{E}\left[\left(\\|{\bm{z}}\\|\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{i}\\|\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{n-i+1}\\|\right)^{k}\right)^{2}\right]$
		$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]+2\mathbb{E}\left[\left(\\|{\bm{z}}\\|\sum_{k=0}^{\frac{K}{2}-1}1\right)^{2}\right]$
		$\displaystyle=2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]+\frac{K^{2}}{2}\mathbb{E}\left[\\|{\bm{z}}\\|^{2}\right].$

We handle the two terms above separately. For the first term, we have the following bound:

Lemma 3.

If $\alpha\leq\frac{1}{8\kappa L}\min\left\{2,\frac{\sqrt{\kappa}}{n}\right\}$ , then

\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\|\leq 1-\alpha n\mu

Note that $K\geq 80\kappa^{3/2}\log(nK)\max\left\{1,\frac{\sqrt{\kappa}}{n}\right\}\implies\alpha\leq\frac{1}{8\kappa L}\min\left\{2,\frac{\sqrt{\kappa}}{n}\right\}$ .

We also have the following lemma that bounds the expected squared norm of ${\bm{z}}$ .

Lemma 4.

If $\alpha\leq\frac{1}{L}$ , then

\mathbb{E}\left[\left\|{\bm{z}}\right\|^{2}\right]\leq 2n^{2}\alpha^{4}L^{2}(G^{*})^{2}+170n^{5}\alpha^{6}L^{4}G^{2}\log n,

where $G^{*}=\max_{i}\|{\bm{b}}_{i}\|$ , and the expectation is taken over the randomness of ${\bm{z}}$ .

Using these lemmas, we get that

	$\displaystyle\mathbb{E}\left[\left\\|{\bm{x}}_{n}^{K}\right\\|^{2}\right]$	$\displaystyle\leq 2\left(1-n\alpha\mu\right)^{K/2}\left\\|{\bm{x}}_{0}^{1}\right\\|^{2}+K^{2}n^{2}\alpha^{4}L^{2}G^{2}+85K^{2}n^{5}\alpha^{6}L^{4}G^{2}\log n$
		$\displaystyle\leq 2e^{-\frac{1}{2}n\alpha\mu K}\left\\|{\bm{x}}_{0}^{1}\right\\|^{2}+K^{2}n^{2}\alpha^{4}L^{2}G^{2}+85K^{2}n^{5}\alpha^{6}L^{4}G^{2}\log n.$

Substituting $\alpha=\frac{10\log nK}{\mu nK}$ gives us the result.

E.1 Proof of Lemma 3

We define $(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n}):=({\bm{S}}_{1},\dots,{\bm{S}}_{n})$ and $(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n}):=(\mathbf{S}_{n},\dots,{\bm{S}}_{1})$ . Then,

	$\displaystyle\left\\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\\|$	$\displaystyle=\left\\|\prod_{i=1}^{2n}({\bm{I}}-\widetilde{{\bm{S}}}_{i})\right\\|$
		$\displaystyle=\left\\|{\bm{I}}-\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}+\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}-\dots\right\\|$
		$\displaystyle\leq\left\\|{\bm{I}}-\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}+\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}\right\\|+\left\\|\sum_{i<j<k}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}\widetilde{{\bm{S}}}_{k}\right\\|+\dots$
		$\displaystyle\leq\left\\|{\bm{I}}-\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}+\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}\right\\|+\sum_{i<j<k}\left\\|\widetilde{{\bm{S}}}_{i}\right\\|\left\\|\widetilde{{\bm{S}}}_{j}\right\\|\left\\|\widetilde{{\bm{S}}}_{k}\right\\|+\dots.$

Note that

	$\displaystyle\sum_{i<j}\widetilde{{\bm{S}}}_{i}\widetilde{{\bm{S}}}_{j}$	$\displaystyle=2\sum_{i\neq j}{\bm{S}}_{i}{\bm{S}}_{j}+\sum_{i=1}^{n}{\bm{S}}_{i}^{2}$
		$\displaystyle=2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)-\sum_{i=1}^{n}{\bm{S}}_{i}^{2}.$

Substituting this and noting that $\sum_{i=1}^{2n}\widetilde{{\bm{S}}}_{i}=2\sum_{i=1}^{n}{\bm{S}}_{i}$ , we get

	$\displaystyle\left\\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\\|$	$\displaystyle\leq\left\\|{\bm{I}}-2\sum_{i=1}^{n}{\bm{S}}_{i}+2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)-\sum_{i=1}^{n}{\bm{S}}_{i}^{2}\right\\|$
		$\displaystyle\quad+\sum_{i<j<k}\left\\|\widetilde{{\bm{S}}}_{i}\right\\|\left\\|\widetilde{{\bm{S}}}_{j}\right\\|\left\\|\widetilde{{\bm{S}}}_{k}\right\\|+\dots$
		$\displaystyle\leq\left\\|{\bm{I}}-2\sum_{i=1}^{n}{\bm{S}}_{i}+2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\right\\|+\left\\|\sum_{i=1}^{n}{\bm{S}}_{i}^{2}\right\\|$
		$\displaystyle\quad+\sum_{i<j<k}\left\\|\widetilde{{\bm{S}}}_{i}\right\\|\left\\|\widetilde{{\bm{S}}}_{j}\right\\|\left\\|\widetilde{{\bm{S}}}_{k}\right\\|+\dots.$

Let us denote ${\bm{T}}:=\sum_{i=1}^{n}{\bm{S}}_{i}$ . Then, we know by Assumptions 2 and 3 that ${\bm{T}}$ has eigenvalues in $[n\alpha\mu,n\alpha L]$ . Then, as long as $\alpha\leq\frac{1}{4nL}$ , we get that

	$\displaystyle\left\\|{\bm{I}}-2\sum_{i=1}^{n}{\bm{S}}_{i}+2\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\left(\sum_{i=1}^{n}{\bm{S}}_{i}\right)\right\\|$	$\displaystyle=\left\\|{\bm{I}}-2{\bm{T}}+2{\bm{T}}^{2}\right\\|$
		$\displaystyle\leq 1-\frac{3}{2}n\alpha\mu.$

Substituting this, we get

\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\|

\displaystyle\leq\left(1-\frac{3}{2}n\alpha\mu\right)+\left\|\sum_{i=1}^{n}{\bm{S}}_{i}^{2}\right\|+\sum_{i<j<k}\left\|\widetilde{{\bm{S}}}_{i}\right\|\left\|\widetilde{{\bm{S}}}_{j}\right\|\left\|\widetilde{{\bm{S}}}_{k}\right\|+\dots.

By Assumption 2, we know that $\|A_{i}\|\leq L$ . Hence, $\|\widetilde{\mathbf{S}}_{i}\|\leq\alpha L$ . Hence,

$\displaystyle\left\\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\\|$	$\displaystyle\leq\left(1-\frac{3}{2}n\alpha\mu\right)+n\alpha^{2}L^{2}+\left(\binom{2n}{3}\alpha^{3}L^{3}+\binom{2n}{4}\alpha^{4}L^{4}+\dots\right)$
	$\displaystyle\leq 1-\frac{3}{2}n\alpha\mu+n\alpha^{2}L^{2}+\sum_{i=3}^{2n}(2n\alpha L)^{i}$
	$\displaystyle\leq 1-\frac{3}{2}n\alpha\mu+n\alpha^{2}L^{2}+\frac{8n^{3}\alpha^{3}L^{3}}{1-2n\alpha L}$
	$\displaystyle\leq 1-\frac{3}{2}n\alpha\mu+n\alpha^{2}L^{2}+16n^{3}\alpha^{3}L^{3}.$	(Since $\alpha\leq 1/4nL$ .)

Finally, as long as $\alpha\leq\frac{1}{4\kappa L}$ and $\alpha\leq\frac{1}{8nL\sqrt{\kappa}}$ ,

\displaystyle\left\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\|

\displaystyle\leq 1-n\alpha\mu.

E.2 Proof of Lemma 4

We start off by computing the first order expansion of ${\bm{z}}$ . We have the following lemma for this:

Lemma 5.

{\bm{z}}=\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right),

where $(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n}):=({\bm{S}}_{1},\dots,{\bm{S}}_{n})$ and $(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n}):=(\mathbf{S}_{n},\dots,{\bm{S}}_{1})$ .

The proof of this lemma is quite algebraic and hence has been pushed to the end, in Appendix E.4.

The strategy is to bound $\|{{\bm{S}}}_{i}{\bm{t}}_{i}\|$ , $\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\|$ , and $\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\|$ . Hence, we apply Lemma 5 and use triangle inequality:

	$\displaystyle\mathbb{E}\left[\left\\|{\bm{z}}\right\\|^{2}\right]$	$\displaystyle=\mathbb{E}\left[\left\\|\sum_{i=1}^{n}{\bm{S}}_{i}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{k=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right.\right.$
		$\displaystyle\qquad\quad\left.\left.+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{\bm{S}}_{p})\right){{\bm{S}}}_{l}{\bm{S}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right\\|^{2}\right]$
		$\displaystyle\leq\mathbb{E}\left[\left(\sum_{i=1}^{n}\\|{\bm{S}}_{i}\\|\\|{\bm{t}}_{i}\\|+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\\|\right)\\|\widetilde{{\bm{S}}}_{l}\\|\\|\widetilde{{\bm{S}}}_{j}\\|\left(\sum_{i=1}^{2n-j}\\|{\bm{t}}_{i}\\|\right)\right.\right.$
		$\displaystyle\qquad\quad+\left.\left.\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-{\bm{S}}_{p}\\|\right)\\|{\bm{S}}_{l}\\|\\|{\bm{S}}_{j}\\|\left(\sum_{i=1}^{n-j}\\|{\bm{t}}_{n+1-i}\\|\right)\right)^{2}\right].$

Now, we recall that $\|{\bm{S}}_{i}\|\leq\alpha L$ and $\|{\bm{t}}_{i}\|\leq\alpha G$ . Because $\alpha\leq 1/L$ , we also get that $\|{\bm{I}}-{\bm{S}}_{i}\|\leq 1$ . Using these,

$\displaystyle\mathbb{E}[\\|{\bm{z}}\\|^{2}]$	$\displaystyle\leq\mathbb{E}\left[\left(\sum_{i=1}^{n}\alpha^{2}LG+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}1\right)\alpha L\alpha L\left\\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\\|\right.\right.$
	$\displaystyle\qquad\quad\left.\left.+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}1\right)\alpha L\alpha L\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right)^{2}\right]$
	$\displaystyle\leq\mathbb{E}\left[\left(n\alpha^{2}LG+2n\alpha^{2}L^{2}\sum_{j=n+1}^{2n-1}\left\\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\\|+n\alpha^{2}L^{2}\sum_{j=1}^{n-1}\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right)^{2}\right]$
	$\displaystyle=n^{2}\alpha^{4}L^{2}\mathbb{E}\left[\left(G+2L\sum_{j=n+1}^{2n-1}\left\\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\\|+L\sum_{j=1}^{n-1}\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right)^{2}\right]$
	$\displaystyle\leq 2n^{2}\alpha^{4}L^{2}\left(G^{2}+L^{2}\mathbb{E}\left[\left(2\sum_{j=n+1}^{2n-1}\left\\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\\|+\sum_{j=1}^{n-1}\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right)^{2}\right]\right).$	(Since $(a+b)^{2}\leq 2a^{2}+2b^{2}$ )

Using Hoeffding-Serfling inequality for bounded random vectors (Schneider, 2016, Theorem 2), we get the following lemma

Lemma 6.

For all $j,l\in[1,n]$ we have that

	$\displaystyle\mathbb{E}\left[\left\\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\\|^{2}\right]\leq 18j\alpha^{2}(G^{*})^{2}\log(n)$
	$\displaystyle\mathbb{E}\left[\left\\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\\|\left\\|\sum_{i=1}^{l}{\bm{t}}_{i}\right\\|\right]\leq 18\sqrt{jl}\alpha^{2}(G^{*})^{2}\log(n),$

where $G^{*}=\max_{i}\|{\bm{b}}_{i}\|$ , and the expectation is taken over the randomness of ${\bm{t}}_{i}$ .

Writing out the expansion of $\left(2\sum_{j=n+1}^{2n-1}\left\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\|+\sum_{j=1}^{n-1}\left\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\|\right)^{2}$ and using the lemma above on the individual terms, we get

\displaystyle\mathbb{E}\left[\left\|{\bm{z}}\right\|^{2}\right]

\displaystyle\leq 2n^{2}\alpha^{4}L^{2}G^{2}+2n^{2}\alpha^{4}L^{4}(90\alpha^{2}n^{3}G^{2}\log n).

E.3 Proof of Lemma 6

This proof is similar to the one in (Ahn et al., 2020). Define $G^{*}:=\max_{i}\|{\bm{b}}_{i}\|$ . We use the following theorem adapted to our setting

Theorem 7.

[(Schneider, 2016, Theorem 2)] With probability at least $1-\frac{\delta}{n}$ ,

\displaystyle\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\leq\alpha G^{*}\sqrt{8j\left(1-\frac{j-1}{n}\right)\log\frac{2n}{\delta}}.

Then taking a union bound over $j=1,\dots,n$ , we get that with probability at least $1-\delta$ ,

\displaystyle\forall j\in[1,n]:\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\leq\alpha G^{*}\sqrt{8j\left(1-\frac{j-1}{n}\right)\log\frac{2n}{\delta}}\leq\alpha G^{*}\sqrt{8j\log\frac{2n}{\delta}}.

Then, for the complementary event (which happens with probability $\delta$ ), we use the fact that $\|{\bm{t}}_{i}\|=\|\alpha{\bm{b}}_{\sigma_{i}}\|\leq\alpha G^{*}$ to get the following:

\forall j\in[1,n]:\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\leq\sum_{i=1}^{j}\left\|{\bm{t}}_{i}\right\|\leq\alpha G^{*}j.

Now, choose $\delta=1/n$ . Then, we get that

	$\displaystyle\mathbb{E}\left[\left\\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\\|^{2}\right]$	$\displaystyle\leq\left(1-\frac{1}{n}\right)8j\alpha^{2}G^{2}\log(2n^{2})+\frac{1}{n}(\alpha G^{*}j)^{2}$
		$\displaystyle\leq 18j\alpha^{2}(G^{*})^{2}\log n.$

Similarly, we can also get

\displaystyle\mathbb{E}\left[\left\|\sum_{i=1}^{j}{\bm{t}}_{i}\right\|\left\|\sum_{i=1}^{l}{\bm{t}}_{i}\right\|\right]\leq 18\sqrt{jl}\alpha^{2}(G^{*})^{2}\log(n).

E.4 Proof of Lemma 5

As in the lemma’s statement, define $(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{2n})$ as $(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n})=({\bm{S}}_{1},\dots,{\bm{S}}_{n})$ and $(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n})=({\bm{S}}_{n},\dots,{\bm{S}}_{1})$ . As a reminder, the term ${\bm{z}}$ is defined as follows:

\displaystyle{\bm{z}}:=\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}.

(22)

First, we analyze the first term in ${\bm{z}}$ . Towards that end, we start by expanding $\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right)$ :

$\displaystyle\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right)$	$\displaystyle=\left(\prod_{j=1}^{2n-i}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)$	(23)
	$\displaystyle=\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\left({\bm{I}}-\widetilde{{\bm{S}}}_{2n-i}\right)$
	$\displaystyle=\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}.$

Similarly, we expand the term $\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)$ and then recursively keep doing it to get the following:

	$\displaystyle\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)$	$\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right)=\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}$
		$\displaystyle=\left(\prod_{j=1}^{2n-i-2}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)-\left(\prod_{j=1}^{2n-i-2}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i-1}-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}$
		$\displaystyle=\qquad\vdots$
		$\displaystyle=({\bm{I}}-\widetilde{{\bm{S}}}_{1})-({\bm{I}}-\widetilde{{\bm{S}}}_{1})\widetilde{{\bm{S}}}_{2}-\ldots$
		$\displaystyle\quad-\left(\prod_{j=1}^{2n-i-2}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i-1}-\left(\prod_{j=1}^{2n-i-1}({\bm{I}}-\widetilde{{\bm{S}}}_{j})\right)\widetilde{{\bm{S}}}_{2n-i}$
		$\displaystyle={\bm{I}}-\sum_{j=1}^{2n-i}\left(\prod_{l=1}^{j-1}({\bm{I}}-\widetilde{{\bm{S}}}_{l})\right)\widetilde{{\bm{S}}}_{j}.$

Note that the term $\prod_{l=1}^{j-1}({\bm{I}}-\widetilde{{\bm{S}}}_{l})$ above is similar to the RHS of Eq. (23). Hence, we repeat the process again on this term to get the following

	$\displaystyle\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right)$	$\displaystyle={\bm{I}}-\sum_{j=1}^{2n-i}\left({\bm{I}}-\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\right)\widetilde{{\bm{S}}}_{j}$
		$\displaystyle={\bm{I}}-\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}+\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}.$		(24)

Using this in the first term $\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}$ (in Eq. (22)), we get

	$\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)$	$\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=\sum_{i=1}^{n}\left({\bm{I}}-\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}+\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right){\bm{t}}_{i}$
		$\displaystyle=\sum_{i=1}^{n}{\bm{t}}_{i}-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}.$

Now, we use the fact that $\sum_{i=1}^{n}{\bm{b}}_{i}={\bm{0}}$ (Eq. (19)) to get that $\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}}$ . Then,

\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)

\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}

For convenience, we define $\widetilde{{\bm{M}}}_{j}:=\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}$ . Then,

$\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)$	$\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=1}^{n}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n}\widetilde{{\bm{M}}}_{j}\sum_{i=1}^{n}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}.$	(Since $\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}}$ )

Note that $\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{M}}}_{j}{\bm{t}}_{i}=\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$ . Hence,

$\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j})\right)$	$\displaystyle\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j})\right){\bm{t}}_{i}=-\sum_{i=1}^{n}\sum_{j=1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$
	$\displaystyle=-\sum_{j=1}^{n}\widetilde{{\bm{S}}}_{j}\sum_{i=1}^{n}{\bm{t}}_{i}-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\widetilde{{\bm{M}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$	(Since $\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}}$ )
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=n+1}^{2n-i}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=i+1}^{n}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right).$	(25)

Next we analyze the second term in ${\bm{z}}$ . For this, we start by expanding $\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})$ in a similar way as Eq. (24)

\displaystyle\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})

\displaystyle={\bm{I}}-\sum_{j=1}^{n-i}{{\bm{S}}}_{j}+\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{\bm{S}}_{j}

(26)

Using this, we get

	$\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right)$	$\displaystyle{\bm{t}}_{n+1-i}=\sum_{i=1}^{n}\left({\bm{I}}-\sum_{j=1}^{n-i}{{\bm{S}}}_{j}+\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\right){\bm{t}}_{n+1-i}$
		$\displaystyle=\sum_{i=1}^{n}{\bm{t}}_{n+1-i}-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{i=1}^{n}\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}$
		$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{i=1}^{n}\sum_{j=1}^{n-i}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i},$

where we used the fact that $\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}}$ in the last equality. For convenience, we define ${\bm{M}}_{j}:=\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}$ . Then,

\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}

\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{i=1}^{n}\sum_{j=1}^{n-i}{\bm{M}}_{j}{\bm{t}}_{n+1-i}.

Since $\sum_{i=1}^{n}\sum_{j=1}^{n-i}{\bm{M}}_{j}{\bm{t}}_{n+1-i}=\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$ , we get

$\displaystyle\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j})\right){\bm{t}}_{n+1-i}$	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{n-i}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}+\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
	$\displaystyle=-\sum_{l=1}^{n}\sum_{j=1}^{l-1}{{\bm{S}}}_{j}{\bm{t}}_{l}+\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{i-1}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n-1}{\bm{M}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=1}^{i-1}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right).$	(27)

Finally, substituting Eq. (25) and (27) in the definition of ${\bm{z}}$ (Eq. (22)), we get

	$\displaystyle{\bm{z}}$	$\displaystyle=-\sum_{i=1}^{n}\sum_{j=i+1}^{n}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)$
		$\displaystyle\quad-\sum_{i=1}^{n}\sum_{j=1}^{i-1}{{\bm{S}}}_{j}{\bm{t}}_{i}+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
		$\displaystyle=-\sum_{i=1}^{n}\left(\sum_{j=1}^{i-1}{{\bm{S}}}_{j}+\sum_{j=i+1}^{n}{{\bm{S}}}_{j}\right){\bm{t}}_{i}$
		$\displaystyle\quad+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
		$\displaystyle=-\sum_{i=1}^{n}\left(\sum_{j=1}^{n}{{\bm{S}}}_{j}\right){\bm{t}}_{i}+\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}$
		$\displaystyle\quad+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
		$\displaystyle=-\left(\sum_{j=1}^{n}{{\bm{S}}}_{j}\right)\sum_{i=1}^{n}{\bm{t}}_{i}+\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}$
		$\displaystyle\quad+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)$
		$\displaystyle=\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right),$

where we used the fact that $\sum_{i=1}^{n}{\bm{t}}_{i}={\bm{0}}$ in the last equality.

Appendix F Proof of Theorem 5

The proof is similar to that of Theorem 4, except for that here we leverage the independence of random permutations in every other epoch. The setup is also the same as Theorem 4, but we explain it again here nevertheless, for the completeness of this proof.

\displaystyle\sum_{i=1}^{n}{\bm{b}}_{i}={\bm{0}}.

(28)

Let $\sigma^{k}=(\sigma^{k}_{1},\sigma^{k}_{2},\dots,\sigma^{k}_{n})$ be the random permutation of $(1,2,\dots,n)$ sampled in epoch $2k-1$ . Then epoch $2k-1$ sees the $n$ functions in the reverse sequence:

\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{1}^{k}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{2}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{2}^{k}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{n}^{k}}^{\top}{\bm{x}}\right),

whereas epoch $2k$ sees the $n$ functions in the reverse sequence:

\left(\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{n}^{k}}^{\top}{\bm{x}},\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{n-1}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{n-1}^{k}}^{\top}{\bm{x}},\dots,\frac{1}{2}{\bm{x}}^{\top}{\bm{A}}_{\sigma_{1}^{k}}{\bm{x}}-{\bm{b}}_{\sigma_{1}^{k}}^{\top}{\bm{x}}\right).

We define ${\bm{S}}_{i}^{k}:=\alpha{\bm{A}}_{\sigma_{i}^{k}}$ and ${\bm{t}}_{i}^{k}=\alpha{\bm{b}}_{\sigma_{i}^{k}}$ for convenience of notation. We start off by computing the progress made duting an even indexed epoch. Since the even epochs use the reverse permutation of $\sigma^{k}$ , we get

$\displaystyle{\bm{x}}_{0}^{2k+1}$	$\displaystyle={\bm{x}}_{n}^{2k}$
	$\displaystyle={\bm{x}}^{2k}_{n-1}-\alpha\left({\bm{A}}_{\sigma_{1}^{k}}{\bm{x}}_{n-1}^{2k}-{\bm{b}}_{\sigma_{1}^{k}}\right)$	( $f_{\sigma_{1}^{k}}$ used at the last iteration of epoch $2k$ .)
	$\displaystyle=\left({\bm{I}}-\alpha{\bm{A}}_{\sigma_{1}^{k}}\right){\bm{x}}^{2k}_{n-1}+\alpha{\bm{b}}_{\sigma_{1}^{k}}$
	$\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}^{k}.$

We recursively apply the same procedure as above to the whole epoch to get the following

$\displaystyle{\bm{x}}_{0}^{2k+1}$	$\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k}){\bm{x}}_{n-1}^{2k}+{\bm{t}}_{1}^{k}$
	$\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k})\left(({\bm{I}}-{\bm{S}}_{2}){\bm{x}}_{n-2}^{2k}+{\bm{t}}_{2}^{k}\right)+{\bm{t}}_{1}^{k}$
	$\displaystyle=({\bm{I}}-{\bm{S}}_{1}^{k})({\bm{I}}-{\bm{S}}_{2}^{k}){\bm{x}}_{n-2}^{2k}+({\bm{I}}-{\bm{S}}_{1}){\bm{t}}_{2}^{k}+{\bm{t}}_{1}^{k}$
	$\displaystyle\qquad\vdots$
	$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k},$	(29)

where the product of matrices $\{{\bm{M}}_{i}\}$ is defined as $\prod_{i=l}^{m}{\bm{M}}_{i}={\bm{M}}_{l}{\bm{M}}_{l+1}\dots\bm{M}_{m}$ if $m\geq l$ and $1$ otherwise. Similar to Eq. (29), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order $(\sigma_{1}^{k},\sigma_{2}^{k},\dots,\sigma_{n}^{k})$ instead of $(\sigma_{n}^{k},\sigma_{n-1}^{k},\dots,\sigma_{1}^{k})$ . After doing the computation, we get the following equation:

\displaystyle{\bm{x}}_{0}^{2k}=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{k})\right){\bm{x}}_{0}^{2k-1}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n-j+1}^{k})\right){\bm{t}}_{i}^{k}.

Combining the results above, we can get the total progress made after the pair of epoch $2k-1$ and $2k$ :

$\displaystyle{\bm{x}}_{0}^{2k+1}$	$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})\right){\bm{x}}_{0}^{2k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}$
	$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{k})\right){\bm{x}}_{0}^{2k-1}+\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j}^{k})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j}^{k})\right){\bm{t}}_{i}^{k}$
	$\displaystyle\quad+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}.$	(30)

In the sum above, the first term will have an exponential decay, hence we need to control the next two terms. Similar to Theorem 4, we denote the sum of the terms as ${\bm{z}}^{k}$ :

	$\displaystyle{\bm{z}}^{k}$	$\displaystyle:=\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j}^{k})\right)\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j}^{k})\right){\bm{t}}_{i}^{k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}$
		$\displaystyle=\sum_{i=1}^{n}\left(\prod_{j=1}^{n}({\bm{I}}-{\bm{S}}_{j}^{k})\right)\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{n+1-j}^{k})\right){\bm{t}}_{i}^{k}+\sum_{i=1}^{n}\left(\prod_{j=1}^{n-i}({\bm{I}}-{\bm{S}}_{j}^{k})\right){\bm{t}}_{n+1-i}^{k}.$

To see where the iterates end up after $K$ epochs, we simply set $2k=K$ in Eq. 30 and then keep applying the equation recursively to preceding epochs. Then, we get

	$\displaystyle{\bm{x}}_{n}^{K}={\bm{x}}_{0}^{K+1}$	$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}})\right){\bm{x}}_{0}^{K-1}+{\bm{z}}^{\frac{K}{2}}$
		$\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}})\right)\right)\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-1})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-1})\right)\right){\bm{x}}_{0}^{K-3}$
		$\displaystyle\quad+\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}})\right){\bm{z}}^{\frac{K}{2}-1}+{\bm{z}}^{\frac{K}{2}}$
		$\displaystyle\qquad\vdots$
		$\displaystyle=\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}$
		$\displaystyle\quad+\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}.$

Taking squared norms and expectations on both sides, we get

$\displaystyle\mathbb{E}[\\|{\bm{x}}_{n}^{K}\\|^{2}]$	$\displaystyle=\mathbb{E}\left[\left\\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right.\right.$
	$\displaystyle\qquad\quad+\left.\left.\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\\|^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\quad+2\mathbb{E}\left[\left\\|\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\\|^{2}\right],$	(31)

where we used the fact that $(a+b)^{2}\leq 2a^{2}+2b^{2}$ . Next, we expand the second term above to get

$\displaystyle\mathbb{E}[\\|{\bm{x}}_{n}^{K}\\|^{2}]$	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\quad+2\sum_{k=0}^{\frac{K}{2}-1}\mathbb{E}\left[\left\\|\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\\|^{2}\right]$
	$\displaystyle\quad+4\sum_{0\leq k^{\prime}<k\leq\frac{K}{2}-1}\mathbb{E}\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.$
	$\displaystyle\quad\qquad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right].$	(32)

We handle each of the three terms separately. Let ${\bm{N}}_{k}:=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)$ . The the first term can be written as:

	$\displaystyle\mathbb{E}\left[\left\\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\\|^{2}\right]=\mathbb{E}\left[\left\\|\left(\prod_{k=0}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=0}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\left(\prod_{k=0}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]$
	$\displaystyle\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}{\bm{N}}_{0}^{\top}{\bm{N}}_{0}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]$
	$\displaystyle\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\mathbb{E}\left[{\bm{N}}_{0}^{\top}{\bm{N}}_{0}\right]\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right],$

where, in the last line, we used the fact that the permutations in every epoch are independent of the permutations in other epochs. Next, we have the following lemma that bounds the spectral norm of $\mathbb{E}\left[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}\right]$ for any $k\in[0,\frac{K}{2}-1]$ :

Lemma 7.

For any $0\leq\alpha\leq\frac{3}{16nL}\min\{1,\sqrt{\frac{n}{\kappa}}\}$ ,

\displaystyle\left\|\mathbb{E}[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}]\right\|\leq 1-\alpha n\mu

Note that $K\geq 55\kappa\log(nK)\max\left\{1,\sqrt{\frac{n}{\kappa}}\right\}\implies\alpha\leq\frac{3}{16nL}\min\{1,\frac{n}{\kappa}\}$ , and thus this lemma. Using it, we get

	$\displaystyle\mathbb{E}\left[\left\\|\left(\prod_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right){\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\qquad\qquad\qquad\qquad=\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\mathbb{E}\left[{\bm{N}}_{0}^{\top}{\bm{N}}_{0}\right]\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]$
	$\displaystyle\qquad\qquad\qquad\qquad\leq(1-n\alpha\mu)\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\left(\prod_{k=1}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]$
	$\displaystyle\qquad\qquad\qquad\qquad=(1-n\alpha\mu)\mathbb{E}\left[({\bm{x}}_{0}^{1})^{\top}\left(\prod_{k=2}^{\frac{K}{2}-1}{\bm{N}}_{k}\right)^{\top}\mathbb{E}\left[{\bm{N}}_{1}^{\top}{\bm{N}}_{1}\right]\left(\prod_{k=2}^{\frac{K}{2}-1}{\bm{N}}_{k}\right){\bm{x}}_{0}^{1}\right]$
	$\displaystyle\qquad\qquad\qquad\qquad\quad\vdots$
	$\displaystyle\qquad\qquad\qquad\qquad\leq(1-n\alpha\mu)^{K/2}\\|{\bm{x}}_{0}^{1}\\|^{2}.$		(33)

Next we look at the second term in Ineq. (32):

	$\displaystyle\mathbb{E}\left[\left\\|\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\\|^{2}\right]$
	$\displaystyle\quad\leq\mathbb{E}\left[\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l}\\|^{2}\right)\left(\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l}\\|^{2}\right)\right)\\|{\bm{z}}^{\frac{K}{2}-k}\\|^{2}\right]$
	$\displaystyle\quad\leq\mathbb{E}\left[\left\\|{\bm{z}}^{\frac{K}{2}-k}\right\\|^{2}\right],$

where in the last step we used that $\|{\bm{I}}-{\bm{S}}_{i}^{k}\|\leq 1$ for all $i,k$ . To see why this is true, recall that ${\bm{S}}_{i}^{k}=\alpha{\bm{A}}_{\sigma_{i}^{k}}$ . Further by Assumption 2, $\|{\bm{A}}_{\sigma_{i}^{k}}\|\leq L$ and hence as long as $\alpha\leq 1/L$ , we have that $\|{\bm{I}}-{\bm{S}}_{i}^{k}\|\leq 1$ .

Next, note that for any $k$ we can apply Lemma 4 on $\mathbb{E}[\|{\bm{z}}^{\frac{K}{2}-k}\|^{2}]$ . Hence, we get the following bound on the second term:

	$\displaystyle 2\sum_{k=0}^{\frac{K}{2}-1}\mathbb{E}\left[\left\\|\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k}\right\\|^{2}\right]$
	$\displaystyle\leq 2\frac{K}{2}(2n^{2}\alpha^{4}L^{2}(G^{})^{2}+170n^{5}\alpha^{6}L^{4}(G^{})^{2}\log n).$		(34)

Finally, we focus on the third term in Ineq. (32). We have the following lemma that gives an upper bound for it:

Lemma 8.

Let $\alpha\leq\frac{1}{2nL}$ and $n>6$ . Then for $0\leq k^{\prime}<k\leq\frac{K}{2}-1$ ,

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.$
		$\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]$
		$\displaystyle\qquad\qquad\leq 1000n^{2}\alpha^{4}L^{2}(G^{})^{2}+2000n^{6}\alpha^{7}L^{5}(G^{})^{2}\log n.$

Using Lemma 8, and inequalities (33) and (34) in Ineq. (32), we get

	$\displaystyle\mathbb{E}[\\|{\bm{x}}_{n}^{K}\\|^{2}]$	$\displaystyle\leq 2(1-n\alpha\mu)^{K/2}\\|{\bm{x}}_{0}^{1}\\|^{2}+2n^{2}K\alpha^{4}L^{2}(G^{})^{2}+170n^{5}K\alpha^{6}L^{4}(G^{})^{2}\log n$
		$\displaystyle\quad+1000n^{2}K^{2}\alpha^{4}L^{2}(G^{})^{2}+2000n^{6}K^{2}\alpha^{7}L^{5}(G^{})^{2}\log n$
		$\displaystyle\leq 2(1-n\alpha\mu)^{K/2}\\|{\bm{x}}_{0}^{1}\\|^{2}+1002n^{2}K\alpha^{4}L^{2}(G^{})^{2}+2170n^{6}K^{2}\alpha^{7}L^{4}(G^{})^{2}\log n.$

Substituting $\alpha=\frac{10\log nK}{\mu nK}$ gives us the desired result.

F.1 Proof of Lemma 7

Recall that we have defined ${\bm{N}}_{k}:=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)$ . Hence,

	$\displaystyle{\bm{N}}_{k}^{\top}$	$\displaystyle=\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right)^{\top}$
		$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)^{\top}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)^{\top}$
		$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})^{\top}\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})^{\top}\right)$
		$\displaystyle=\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)$
		$\displaystyle={\bm{N}}_{k},$

where we used the fact that ${\bm{S}}_{i}^{\frac{K}{2}-k}$ are symmetric. Hence ${\bm{N}}_{k}$ is symmetric. Then,

	$\displaystyle\\|{\mathbb{E}}[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}]\\|$	$\displaystyle=\max_{{\bm{x}}:\\|{\bm{x}}\\|=1}{\mathbb{E}}[{\bm{x}}^{\top}{\bm{N}}_{k}^{\top}{\bm{N}}_{k}{\bm{x}}]$
		$\displaystyle=\max_{{\bm{x}}:\\|{\bm{x}}\\|=1}{\mathbb{E}}[{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{N}}_{k}{\bm{x}}].$

Next, note that $\|{\bm{N}}_{k}\|\leq 1$ as long as $\alpha\leq\frac{1}{nL}$ . This combined with the fact that ${\bm{N}}_{k}$ is symmetric gives us that ${\bm{x}}^{\top}{\bm{N}}_{k}{\bm{N}}_{k}{\bm{x}}\leq{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{x}}$ . Hence, we get

	$\displaystyle\\|{\mathbb{E}}[{\bm{N}}_{k}^{\top}{\bm{N}}_{k}]\\|$	$\displaystyle\leq\max_{{\bm{x}}:\\|{\bm{x}}\\|=1}{\mathbb{E}}[{\bm{x}}^{\top}{\bm{N}}_{k}{\bm{x}}]$
		$\displaystyle=\\|{\mathbb{E}}[{\bm{N}}_{k}]\\|$
		$\displaystyle=\left\\|{\mathbb{E}}\left[\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k})\right)\right]\right\\|$
		$\displaystyle=\left\\|{\mathbb{E}}\left[\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)^{\top}\right]\right\\|.$

To complete the proof, we apply the following lemma from Ahn et al. (2020):

Lemma 9 (Lemma 6, Ahn et al. (2020)).

For any $0\leq\alpha\leq\frac{3}{16nL}\min\{1,\frac{n}{\kappa}\}$ and $k\in[K]$ ,

\left\|{\mathbb{E}}\left[\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k})\right)^{\top}\right]\right\|\leq 1-\alpha n\mu.

Remark 2.

To avoid confusion, we remark here that, the term ‘ ${\bm{S}}_{k}$ ’ in the paper Ahn et al. (2020) is the same as the term ‘ $\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{k})$ ’ in our paper, and hence the original lemma statement in their paper looks different from what is written above.

F.2 Proof of Lemma 8

We begin by decomposing the product into product of independent terms, similar to proof of Lemma 8 in Ahn et al. (2020). However, after that we diverge from their proof since we use FlipFlop specific analysis.

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.$
		$\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]$
		$\displaystyle=\mathbb{E}\left[\left({\bm{z}}^{\frac{K}{2}-k}\right)^{\top}\left(\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.$
		$\displaystyle\qquad\quad\left.\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.$
		$\displaystyle\qquad\quad\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right].$

Since $k^{\prime}<k$ , we get that $\left({\bm{z}}^{\frac{K}{2}-k}\right)^{\top}$ , $\left(\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}$ and

\displaystyle\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}

are independent. Hence, we can write the expectation as product of expectations:

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.$
		$\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]$
		$\displaystyle=\mathbb{E}\left[\left({\bm{z}}^{\frac{K}{2}-k}\right)^{\top}\right]\mathbb{E}\left[\left(\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right]$
		$\displaystyle\qquad\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.$
		$\displaystyle\qquad\qquad\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right].$

Applying the Cauchy-Schwarz inequality on the decomposition above, we get

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.$
		$\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]$
		$\displaystyle\leq\left\\|\mathbb{E}\left[{\bm{z}}^{\frac{K}{2}-k}\right]\right\\|\left\\|\mathbb{E}\left[\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right]\right\\|$
		$\displaystyle\ \quad\left\\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.$
		$\displaystyle\ \quad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\\|$
		$\displaystyle\leq\left\\|\mathbb{E}\left[{\bm{z}}^{\frac{K}{2}-k}\right]\right\\|\mathbb{E}\left[\prod_{l=k^{\prime}}^{k+1}\left(\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l}\\|\right)\left(\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l}\\|\right)\right]$
		$\displaystyle\ \quad\left\\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.$
		$\displaystyle\ \quad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\\|$
		$\displaystyle\leq\left\\|\mathbb{E}\left[{\bm{z}}^{\frac{K}{2}-k}\right]\right\\|$
		$\displaystyle\ \quad\left\\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.$
		$\displaystyle\ \quad\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\\|,$

For the two terms in the product above, we have the following lemma:

Lemma 10.

If $n>6$ and $\alpha\leq\frac{1}{nL}$ , then

\displaystyle\|\mathbb{E}[{\bm{z}}^{\frac{K}{2}-k}]\|\leq 28n\alpha^{2}LG^{*}+9\alpha^{5}L^{4}n^{4}G^{*}\sqrt{2n\log n}.

Lemma 11.

If $n>6$ and $\alpha\leq\frac{1}{2nL}$ , then

	$\displaystyle\left\\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.$
	$\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\\|$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\leq 32n\alpha^{2}LG^{}+24\alpha^{5}L^{4}n^{4}G^{}\sqrt{2n\log n}.$

Finally, using these lemma we get

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\left\langle\left(\prod_{l=0}^{k-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k},\right.\right.$
		$\displaystyle\ \quad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right\rangle\right]$
		$\displaystyle\qquad\qquad\leq\left(28n\alpha^{2}LG^{}+9\alpha^{5}L^{4}n^{4}G^{}\sqrt{2n\log n}\right)\cdot\left(32n\alpha^{2}LG^{}+24\alpha^{5}L^{4}n^{4}G^{}\sqrt{2n\log n}\right)$
		$\displaystyle\qquad\qquad\leq 896n^{2}\alpha^{4}L^{2}(G^{})^{2}+960\alpha^{7}L^{5}n^{5}(G^{})^{2}\sqrt{2n\log n}+432\alpha^{10}L^{8}n^{9}(G^{*})^{2}\log n$
		$\displaystyle\qquad\qquad\leq 1000n^{2}\alpha^{4}L^{2}(G^{})^{2}+2000\alpha^{7}L^{5}n^{6}(G^{})^{2}\log n.$

F.3 Proof of Lemma 10

Since we are dealing with just a single epoch, we will skip the superscript. Using Lemma 5, we get

$\displaystyle\\|\mathbb{E}[{\bm{z}}]\\|$	$\displaystyle=\left\\|\mathbb{E}\left[\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right]+\mathbb{E}\left[\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right.$
	$\displaystyle\quad+\left.\mathbb{E}\left[\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]\right\\|$
	$\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}\left[\\|{{\bm{S}}}_{i}\\|\\|{\bm{t}}_{i}\\|\right]+\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left\\|\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\\|$
	$\displaystyle\quad+\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left\\|\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right\\|\right].$	(35)

Define $G^{*}:=\max_{i}\|{\bm{b}}_{i}\|$ . Then $\|{\bm{S}}_{i}\|\|{\bm{t}}_{i}\|=\|\alpha{\bm{A}}_{\sigma_{i}^{k}}\|\|\alpha{\bm{b}}_{\sigma_{i}^{k}}\|\leq\alpha^{2}LG^{*}$ and hence

\displaystyle\sum_{i=1}^{n}\mathbb{E}\left[\|{{\bm{S}}}_{i}\|\|{\bm{t}}_{i}\|\right]\leq n\alpha^{2}LG^{*}.

(36)

Next we bound the other two terms. Using Eq. (26), we get that for any $l<j$ ,

	$\displaystyle\mathbb{E}$	$\displaystyle\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]$
		$\displaystyle=\mathbb{E}\left[\left({\bm{I}}-\sum_{p=1}^{l-1}{{\bm{S}}}_{p}+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]$
		$\displaystyle=\sum_{i=1}^{n-j}\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{n+1-i}\right]-\mathbb{E}\left[\sum_{p=1}^{l-1}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]$
		$\displaystyle\quad+\mathbb{E}\left[\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]$
		$\displaystyle=\sum_{i>j,l}\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{i}\right]-\sum_{p<l,j<i}\mathbb{E}\left[{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}{\bm{t}}_{i}\right]+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right]$
		$\displaystyle=\sum_{i>j,l}\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}\|{{\bm{S}}}_{l},{{\bm{S}}}_{j}]\right]-\sum_{p<l,j<i}\mathbb{E}\left[{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}\|{{\bm{S}}}_{l},{{\bm{S}}}_{j},{\bm{S}}_{p}]\right]$
		$\displaystyle\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-{{\bm{S}}}_{q})\right){{\bm{S}}}_{r}{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right].$

Since $\sum_{i=1}^{n}t_{i}=0$ , and we use uniform random permutations, $\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j}]=\sum_{\begin{subarray}{c}{\bm{t}}_{i}\neq{\bm{t}}_{l},\\ {\bm{t}}_{i}\neq{\bm{t}}_{j}\end{subarray}}\frac{{\bm{t}}_{i}}{n-2}=\frac{-{\bm{t}}_{j}-{\bm{t}}_{l}}{n-2}$ . Similarly, $\mathbb{E}[{\bm{t}}_{i}|{{\bm{S}}}_{l},{{\bm{S}}}_{j},{{\bm{S}}}_{p}]=\frac{-{\bm{t}}_{j}-{\bm{t}}_{l}-{\bm{t}}_{p}}{n-3}$ . Hence,

$\displaystyle\Bigg{\\|}\mathbb{E}\Bigg{[}\Bigg{(}\prod_{p=1}^{l-1}$	$\displaystyle({\bm{I}}-{{\bm{S}}}_{p})\Bigg{)}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\Bigg{(}\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\Bigg{)}\Bigg{]}\Bigg{\\|}$
	$\displaystyle\leq\sum_{i>j,l}\left\\|\mathbb{E}\left[{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}\|{{\bm{S}}}_{l},{{\bm{S}}}_{j}]\right]\right\\|+\sum_{p<l,j<i}\left\\|\mathbb{E}\left[{{\bm{S}}}_{p}{{\bm{S}}}_{l}{{\bm{S}}}_{j}\mathbb{E}[{\bm{t}}_{i}\|{{\bm{S}}}_{l},{{\bm{S}}}_{j},{\bm{S}}_{p}]\right]\right\\|$
	$\displaystyle+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}\\|{\bm{I}}-{{\bm{S}}}_{q}\\|\right)\\|{{\bm{S}}}_{r}\\|\\|{{\bm{S}}}_{p}\\|\\|{{\bm{S}}}_{l}\\|\\|{{\bm{S}}}_{j}\\|\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right]$
	$\displaystyle\leq\sum_{i>j,l}\mathbb{E}\left[\\|{{\bm{S}}}_{l}\\|\\|{{\bm{S}}}_{j}\\|\frac{\\|{\bm{t}}_{l}+{\bm{t}}_{j}\\|}{n-2}\right]+\sum_{p<l,j<i}\mathbb{E}\left[\\|{{\bm{S}}}_{p}\\|\\|{{\bm{S}}}_{l}\\|\\|{{\bm{S}}}_{j}\\|\frac{\\|{\bm{t}}_{l}+{\bm{t}}_{j}+{\bm{t}}_{p}\\|}{n-3}\right]$
	$\displaystyle+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left(\prod_{r=1}^{q-1}\\|{\bm{I}}-{{\bm{S}}}_{q}\\|\right)\\|{{\bm{S}}}_{r}\\|\\|{{\bm{S}}}_{p}\\|\\|{{\bm{S}}}_{l}\\|\\|{{\bm{S}}}_{j}\\|\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right]$
	$\displaystyle\leq\frac{2n}{n-2}\alpha^{3}L^{2}G^{}+\frac{3n^{2}}{2(n-3)}\alpha^{4}L^{3}G^{}+\alpha^{4}L^{4}\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\mathbb{E}\left[\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right]$
	$\displaystyle\leq 4\alpha^{3}L^{2}G^{}+3n\alpha^{4}L^{3}G^{}+\alpha^{5}L^{4}n^{2}G^{*}\sqrt{18n\log n},$	(37)

where we used Lemma 6 and the assumption that $n>6$ in the last step.

The third term in Ineq. (35) can be handled similarly. For any $l,j$ :

	$\displaystyle\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]$
	$\displaystyle=\mathbb{E}\left[\left({\bm{I}}-\sum_{p=1}^{l-1}\widetilde{{\bm{S}}}_{p}+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]$
	$\displaystyle=\sum_{i=1}^{2n-j}\mathbb{E}\left[\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]-\sum_{p=1}^{l-1}\sum_{i=1}^{2n-j}\mathbb{E}\left[\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]+\mathbb{E}\left[\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right].$

Now, it is easy to see that $i\neq j$ and $i\neq 2n-j+1$ . Then, if $i=l$ or $i=2n-l+1$ we use for that case the fact that $\|\mathbb{E}[\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}]\|\leq\alpha^{3}L^{2}G^{*}$ . For all other $i$ , we can again use that $\mathbb{E}[{\bm{t}}_{i}|\widetilde{{\bm{S}}}_{l},\widetilde{{\bm{S}}}_{j}]=\frac{-t_{l}-t_{j}}{n-2}$ if $l\leq n$ or $\mathbb{E}[{\bm{t}}_{i}|\widetilde{{\bm{S}}}_{l},\widetilde{{\bm{S}}}_{j}]=\frac{-t_{2n-l+1}-t_{j}}{n-2}$ otherwise. Similarly, we can bound $\left\|\mathbb{E}\left[\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]\right\|$ . Proceeding in a way similar to how we derived Ineq. (37), we get

	$\displaystyle\left\\|\mathbb{E}\left[\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\\|$
	$\displaystyle\qquad\qquad\leq\sum_{i=1}^{2n-j}\left\\|\mathbb{E}\left[\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]\right\\|+\sum_{p=1}^{l-1}\sum_{i=1}^{2n-j}\left\\|\mathbb{E}\left[\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}{\bm{t}}_{i}\right]\right\\|$
	$\displaystyle\qquad\qquad\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left\\|\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\\|$
	$\displaystyle\qquad\qquad\leq\alpha^{3}L^{2}G^{}+\frac{2n}{n-2}\alpha^{3}L^{2}G^{}+\alpha^{4}L^{3}G^{}+n\alpha^{4}L^{3}G^{}+\frac{3n^{2}}{2(n-3)}\alpha^{4}L^{3}G^{*}$
	$\displaystyle\qquad\qquad\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left\\|\mathbb{E}\left[\left(\prod_{r=1}^{q-1}({\bm{I}}-\widetilde{{\bm{S}}}_{q})\right)\widetilde{{\bm{S}}}_{r}\widetilde{{\bm{S}}}_{p}\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\\|$
	$\displaystyle\qquad\qquad\leq 5\alpha^{3}L^{2}G^{}+5n\alpha^{4}L^{3}G^{}$
	$\displaystyle\qquad\qquad\quad+\sum_{p=1}^{l-1}\sum_{q=1}^{p-1}\left\\|\mathbb{E}\left[\left(\prod_{r=1}^{q-1}\\|{\bm{I}}-\widetilde{{\bm{S}}}_{q}\\|\right)\\|\widetilde{{\bm{S}}}_{r}\\|\\|\widetilde{{\bm{S}}}_{p}\\|\\|\widetilde{{\bm{S}}}_{l}\\|\\|\widetilde{{\bm{S}}}_{j}\\|\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right]\right\\|$
	$\displaystyle\qquad\qquad\leq 5\alpha^{3}L^{2}G^{}+5n\alpha^{4}L^{3}G^{}+\alpha^{5}L^{4}n^{2}G^{*}\sqrt{18n\log n}.$		(38)

Substituting Ineq. (36), (37) and (38) into (35), we get

	$\displaystyle\\|\mathbb{E}[{\bm{z}}]\\|$	$\displaystyle\leq n\alpha^{2}LG^{}+10n^{2}\alpha^{3}L^{2}G^{}+10n^{3}\alpha^{4}L^{3}G^{}+2\alpha^{5}L^{4}n^{4}G^{}\sqrt{18n\log n}$
		$\displaystyle\quad+4n^{2}\alpha^{3}L^{2}G^{}+3n^{3}\alpha^{4}L^{3}G^{}+\alpha^{5}L^{4}n^{4}G^{*}\sqrt{18n\log n}$
		$\displaystyle\leq 28n\alpha^{2}LG^{}+9\alpha^{5}L^{4}n^{4}G^{}\sqrt{2n\log n},$

where we used the assumption that $\alpha\leq\frac{1}{nL}$ in the last step.

F.4 Proof of Lemma 11

Define the matrix ${\bm{M}}$ as

\displaystyle{\bm{M}}:=\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right).

Since ${\bm{M}}$ is independent of $\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k^{\prime}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k^{\prime}})\right)$ and ${\bm{z}}^{\frac{K}{2}-k^{\prime}}$ , using the tower rule, we get

	$\displaystyle\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.$
	$\displaystyle\qquad\qquad\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]$
	$\displaystyle\qquad=\mathbb{E}\left[\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-k^{\prime}})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-k^{\prime}})\right)\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}^{\frac{K}{2}-k^{\prime}}\right].$

We will now drop the superscript $\frac{K}{2}-k^{\prime}$ for convenience. Hence, we need to control the following term:

\displaystyle\mathbb{E}\left[\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]

We define $(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{2n})$ as $(\widetilde{{\bm{S}}}_{1},\dots,\widetilde{{\bm{S}}}_{n})=({\bm{S}}_{1},\dots,{\bm{S}}_{n})$ and $(\widetilde{{\bm{S}}}_{n+1},\dots,\widetilde{{\bm{S}}}_{2n})=({\bm{S}}_{n},\dots,{\bm{S}}_{1})$ . Then, we use Eq. (24) to get

	$\displaystyle\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)$	$\displaystyle=\prod_{i=1}^{2n}({\bm{I}}-\widetilde{{\bm{S}}}_{i})$
		$\displaystyle={\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}+\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}.$

Note that ${\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}$ is a constant matrix. Since $\widetilde{{\bm{S}}}_{j}=\alpha{\bm{A}}_{\sigma_{j}}$ , we have that $\|\widetilde{{\bm{S}}}_{j}\|\leq\alpha L$ by Assumption 2. Hence, $\alpha\leq\frac{1}{2nL}$ then $\|{\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}\|\leq 1$ . Further, $\alpha\leq 1/L$ implies that $\|{\bm{I}}-{\bm{S}}_{i}\|\leq 1$ , which implies $\|{\bm{M}}\|\leq 1$ . Hence,

	$\displaystyle\left\\|\mathbb{E}\left[\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|$
	$\displaystyle\leq\left\\|\mathbb{E}\left[\left({\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|$
	$\displaystyle\qquad+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|$
	$\displaystyle=\left\\|\left({\bm{I}}-\sum_{j=1}^{2n}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\mathbb{E}\left[{\bm{z}}\right]\right\\|+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|$
	$\displaystyle\leq\left\\|\mathbb{E}\left[{\bm{z}}\right]\right\\|+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|.$

We can apply Lemma 10 to bound $\left\|\mathbb{E}\left[{\bm{z}}\right]\right\|$ . So, we focus on the other term. Using Lemma 5,

	$\displaystyle\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|$
	$\displaystyle\leq\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right)\right]\right\\|$
	$\displaystyle\quad+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right)\right]\right\\|$
	$\displaystyle\quad+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right)\right]\right\\|$
	$\displaystyle=\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\right]\mathbb{E}[{\bm{M}}]\left(\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right)\right\\|$
	$\displaystyle\quad+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\left(\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right)\right)\right]\right\\|$
	$\displaystyle\quad+\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]\left(\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-{{\bm{S}}}_{p})\right){{\bm{S}}}_{l}{{\bm{S}}}_{j}\left(\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right)\right)\right]\right\\|$
	$\displaystyle\leq\mathbb{E}\left[\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\\|\right)\\|\widetilde{{\bm{S}}}_{l}\\|\\|\widetilde{{\bm{S}}}_{j}\\|\right]\mathbb{E}[\\|{\bm{M}}\\|]\left\\|\sum_{i=1}^{n}{{\bm{S}}}_{i}{\bm{t}}_{i}\right\\|$
	$\displaystyle\quad+\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\\|\right)\\|\widetilde{{\bm{S}}}_{l}\\|\\|\widetilde{{\bm{S}}}_{j}\\|\right)\mathbb{E}[\\|{\bm{M}}\\|]\left(\sum_{j=n+1}^{2n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\\|\right)\\|\widetilde{{\bm{S}}}_{l}\\|\\|\widetilde{{\bm{S}}}_{j}\\|\left\\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\\|\right)\right]$
	$\displaystyle\quad+\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-\widetilde{{\bm{S}}}_{p}\\|\right)\\|\widetilde{{\bm{S}}}_{l}\\|\\|\widetilde{{\bm{S}}}_{j}\\|\right)\mathbb{E}[\\|{\bm{M}}\\|]\left(\sum_{j=1}^{n-1}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}\\|{\bm{I}}-{{\bm{S}}}_{p}\\|\right)\\|{{\bm{S}}}_{l}\\|\\|{{\bm{S}}}_{j}\\|\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right)\right].$

Now, we use that $\|{\bm{M}}\|\leq 1$ , $\|{\bm{I}}-\widetilde{{\bm{S}}_{i}}\|\leq 1$ , $\|\widetilde{{\bm{S}}_{i}}\|\leq\alpha L$ and $\|\widetilde{{\bm{t}}_{i}}\|\leq\alpha G^{*}$ :

	$\displaystyle\left\\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\\|$
	$\displaystyle\qquad\qquad\leq 4n^{3}\alpha^{4}L^{3}G^{*}+4n^{4}\alpha^{4}L^{4}\mathbb{E}\left[\left\\|\sum_{i=1}^{2n-j}{\bm{t}}_{i}\right\\|\right]+n^{4}\alpha^{4}L^{4}\mathbb{E}\left[\left\\|\sum_{i=1}^{n-j}{\bm{t}}_{n+1-i}\right\\|\right].$

Using Lemma 6, we get

\displaystyle\left\|\mathbb{E}\left[\left(\sum_{j=1}^{2n}\sum_{l=1}^{j-1}\left(\prod_{p=1}^{l-1}({\bm{I}}-\widetilde{{\bm{S}}}_{p})\right)\widetilde{{\bm{S}}}_{l}\widetilde{{\bm{S}}}_{j}\right)^{\top}\mathbb{E}[{\bm{M}}]{\bm{z}}\right]\right\|\leq 4n^{3}\alpha^{4}L^{3}G^{*}+15n^{4}\alpha^{5}L^{4}G^{*}\sqrt{2n\log n}.

Putting everything together,

	$\displaystyle\left\\|\mathbb{E}\left[\left(\prod_{l=0}^{k^{\prime}}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right)^{\top}\right.\right.$
	$\displaystyle\qquad\left.\left.\left(\prod_{l=0}^{k^{\prime}-1}\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i}^{\frac{K}{2}-l})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1}^{\frac{K}{2}-l})\right)\right){\bm{z}}^{\frac{K}{2}-k^{\prime}}\right]\right\\|$
	$\displaystyle\qquad\qquad\leq(28n\alpha^{2}LG^{}+9\alpha^{5}L^{4}n^{4}G^{}\sqrt{2n\log n})+(4n^{3}\alpha^{4}L^{3}G^{}+15n^{4}\alpha^{5}L^{4}G^{}\sqrt{2n\log n})$
	$\displaystyle\qquad\qquad\leq 32n\alpha^{2}LG^{}+24\alpha^{5}L^{4}n^{4}G^{}\sqrt{2n\log n}.$

Appendix G Proof of Theorem 6

Proof.

We start off by defining the error term

{\bm{r}}^{k}=\left(\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\left(\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right),

where $k\in[K/2]$ . This captures the difference between true gradients that the algorithms observes, and the gradients that a full step of gradient descent would have seen.

For two consecutive epochs of FlipFlop with Incremental GD, we have the following inequality

$\displaystyle\\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\\|^{2}$	$\displaystyle=\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|^{2}-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{},\sum_{i=1}^{n}\nabla f_{i}\left(x^{2k-1}_{i-1}\right)+\sum_{i=1}^{n}\nabla f_{n-i+1}\left(x^{2k}_{i-1}\right)\right\rangle$
	$\displaystyle\quad+\alpha^{2}\left\\|\sum_{i=1}^{n}\nabla f_{i}\left(x^{2k-1}_{i-1}\right)+\sum_{i=1}^{n}\nabla f_{n-i+1}\left(x^{2k}_{i-1}\right)\right\\|^{2}$
	$\displaystyle=\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|^{2}-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{},2n\nabla F({\bm{x}}^{2k-1}_{0})\right\rangle$
	$\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+\alpha^{2}\left\\|2n\nabla F({\bm{x}}^{2k-1}_{0})+{\bm{r}}^{k}\right\\|^{2}$
	$\displaystyle\leq\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|^{2}-4n\alpha\left[\frac{L\mu}{L+\mu}\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|^{2}+\frac{1}{L+\mu}\left\\|\nabla F\left({\bm{x}}^{2k-1}_{0}\right)\right\\|^{2}\right]$
	$\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+\alpha^{2}\left\\|2n\nabla F({\bm{x}}^{2k-1}_{0})+{\bm{r}}^{k}\right\\|^{2}$
	$\displaystyle\leq\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|^{2}-4n\alpha\left[\frac{L\mu}{L+\mu}\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|^{2}+\frac{1}{L+\mu}\left\\|\nabla F\left({\bm{x}}^{2k-1}_{0}\right)\right\\|^{2}\right]$
	$\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+8\alpha^{2}n^{2}\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}+2\alpha^{2}\left\\|{\bm{r}}^{k}\right\\|^{2}$
	$\displaystyle=\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-8\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
	$\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+2\alpha^{2}\left\\|{\bm{r}}^{k}\right\\|^{2},$	(39)

where the first inequality is due to Theorem 2.1.11 in Nesterov (2004) and the second one is simply $(a+b)^{2}\leq 2a^{2}+2b^{2}$ .

What remains to be done is to bound the two terms with ${\bm{r}}^{k}$ dependence. Firstly, we give a bound on the norm of ${\bm{r}}^{k}$ :

	$\displaystyle\\|{\bm{r}}^{k}\\|$	$\displaystyle=\left\\|\left(\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\left(\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right)\right\\|$
		$\displaystyle\leq\sum_{i=1}^{n}\left\\|\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right\\|+\sum_{i=1}^{n}\left\\|\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right\\|.$

Next, we will use the smoothness assumption and bounded gradients property (Lemma 1).

	$\displaystyle\\|{\bm{r}}^{k}\\|$	$\displaystyle\leq L\sum_{i=1}^{n}\left\\|{\bm{x}}^{2k-1}_{i-1}-{\bm{x}}^{2k-1}_{0}\right\\|+L\sum_{i=1}^{n}\left\\|{\bm{x}}^{2k}_{i-1}-{\bm{x}}^{2k-1}_{0}\right\\|$
		$\displaystyle\leq LG\alpha\sum_{i=1}^{n}i+LG\alpha\sum_{i=1}^{n}(n+i)$
		$\displaystyle=n(2n-1)\alpha GL.$

Hence,

\displaystyle\left\|{\bm{r}}^{k}\right\|^{2}\leq 4n^{4}\alpha^{2}G^{2}L^{2}.

(40)

For the ${\bm{r}}^{k}$ term, we need a more careful bound. Since the Hessian is constant for quadratic functions, we use ${\bm{H}}_{i}$ to denote the Hessian matrix of function $f_{i}(\cdot)$ . We start off by using the definition of ${\bm{r}}^{k}$ :

	$\displaystyle{\bm{r}}^{k}$	$\displaystyle=\left(\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\left(\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\sum_{i=1}^{n}\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right)$
		$\displaystyle=\sum_{i=1}^{n}\left(\nabla f_{i}\left({\bm{x}}^{2k-1}_{i-1}\right)-\nabla f_{i}\left({\bm{x}}^{2k-1}_{0}\right)\right)+\sum_{i=1}^{n}\left(\nabla f_{n-i+1}\left({\bm{x}}^{2k}_{i-1}\right)-\nabla f_{n-i+1}\left({\bm{x}}^{2k-1}_{0}\right)\right)$
		$\displaystyle=\sum_{i=1}^{n}{\bm{H}}_{i}\left({\bm{x}}^{2k-1}_{i-1}-{\bm{x}}^{2k-1}_{0}\right)+\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left({\bm{x}}^{2k}_{i-1}-{\bm{x}}^{2k-1}_{0}\right),$

where we used the fact that for a quadratic function $f$ with Hessian ${\bm{H}}$ , we have that $\nabla f(x)-\nabla f(y)={\bm{H}}(x-y)$ . After that, we express ${\bm{x}}^{2k-1}_{i-1}-{\bm{x}}^{2k-1}_{0}$ and ${\bm{x}}^{2k}_{i-1}-{\bm{x}}^{2k-1}_{0}$ as sum of gradient descent steps:

	$\displaystyle{\bm{r}}^{k}$	$\displaystyle=\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}-\alpha\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})\right)+\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}-\alpha\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})+\sum_{j=1}^{i-1}-\alpha\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})\right)$
		$\displaystyle=-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})\right)-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})\right)$
		$\displaystyle=-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle=-2\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{2k-1}_{0})$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)$

Next, we use the fact that $\sum_{j=1}^{n}\nabla f_{j}({\bm{x}})=n\nabla F(x)$ . We will also again use the fact that for a quadratic function $f$ with Hessian ${\bm{H}}$ , we have that $\nabla f(x)-\nabla f(y)={\bm{H}}(x-y)$ :

	$\displaystyle{\bm{r}}^{k}$	$\displaystyle=-2\alpha\sum_{i=1}^{n}{\bm{H}}_{i}(n\nabla F({\bm{x}}^{2k-1}_{0}))+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}(\nabla f_{i}({\bm{x}}^{2k-1}_{0})-\nabla f_{i}({\bm{x}}^{}))+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{})$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle=-2\alpha\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*})$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle={\bm{a}}^{k}+{\bm{b}}^{k},$

where the random variables ${\bm{a}}^{k},{\bm{b}}^{k}$ as

	$\displaystyle{\bm{a}}^{k}$	$\displaystyle:=-2\alpha\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})+\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{*}),\text{ and}$
	$\displaystyle{\bm{b}}^{k}$	$\displaystyle:=-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{i-1}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-f_{j}({\bm{x}}^{2k-1}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{n-i+1}\left(\sum_{j=1}^{i-1}\nabla f_{n-j+1}({\bm{x}}^{2k}_{j-1})-\nabla f_{n-j+1}({\bm{x}}^{2k}_{0})\right)$
		$\displaystyle\quad-\alpha\sum_{i=1}^{n}{\bm{H}}_{i}\left(\sum_{j=1}^{n}\nabla f_{j}({\bm{x}}^{2k-1}_{j-1})-\nabla f_{j}({\bm{x}}^{2k-1}_{0})\right).$

Again, using smoothness assumption and bounded gradients property (Lemma 1) we get,

\displaystyle\|{\bm{b}}^{k}\|

\displaystyle\leq 3\alpha^{2}L^{2}Gn^{3}.

(41)

Next, we decompose the inner product of ${\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}$ and ${\mathbb{E}}\left[{\bm{r}}^{k}\right]$ in Eq. (39):

	$\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle$	$\displaystyle=-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{a}}^{k}+{\bm{b}}^{k}\right\rangle$
		$\displaystyle=-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{},{\bm{a}}^{k}\right\rangle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{},{\bm{b}}^{k}\right\rangle$		(42)

For the first term in (42),

$\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{a}}^{k}\right\rangle$	$\displaystyle=4\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{},\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})\right\rangle$
	$\displaystyle\quad-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{},\sum_{i=1}^{n}{\bm{H}}_{i}^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})\right\rangle$
	$\displaystyle\quad-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{},\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{})\right\rangle$
	$\displaystyle\leq 4\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{},\left(\sum_{i=1}^{n}{\bm{H}}_{i}\right)^{2}({\bm{x}}^{2k-1}_{0}-{\bm{x}}^{})\right\rangle$
	$\displaystyle\quad-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{},\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{})\right\rangle$
	$\displaystyle=4\alpha^{2}n^{2}\\|\nabla F({\bm{x}}_{0}^{2k-1})\\|^{2}-2\alpha^{2}\left\langle{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{},\sum_{i=1}^{n}{\bm{H}}_{i}\nabla f_{i}({\bm{x}}^{})\right\rangle$
	$\displaystyle\leq 4\alpha^{2}n^{2}\\|\nabla F({\bm{x}}_{0}^{2k-1})\\|^{2}+2\alpha^{2}nLG\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|.$	(43)

For the second term in (42), we use Cauchy-Schwarz and Ineq. (41)

\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{b}}^{k}\right\rangle\leq 6\alpha^{3}L^{2}Gn^{3}\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|.

(44)

Substituting (43) and (44) back to (42), we get

	$\displaystyle-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle$	$\displaystyle\leq 4\alpha^{2}n^{2}\\|\nabla F({\bm{x}}_{0}^{2k-1})\\|^{2}+2\alpha^{2}nLG\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|$
		$\displaystyle\quad+6\alpha^{3}L^{2}Gn^{3}\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|.$		(45)

Substituting (40) (45) back to (39), we finally get a recursion bound for one epoch:

	$\displaystyle\\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\\|^{2}$	$\displaystyle\leq\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-8\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
		$\displaystyle\quad-2\alpha\left\langle{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*},{\bm{r}}^{k}\right\rangle+2\alpha^{2}\left\\|{\bm{r}}^{k}\right\\|^{2}$
		$\displaystyle\leq\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-8\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
		$\displaystyle\quad 4\alpha^{2}n^{2}\\|\nabla F({\bm{x}}_{0}^{2k-1})\\|^{2}+2\alpha^{2}nLG\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{}\\|+6\alpha^{3}L^{2}Gn^{3}\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{}\\|$
		$\displaystyle\quad+8\alpha^{4}n^{4}G^{2}L^{2}$
		$\displaystyle=\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
		$\displaystyle\quad+2\alpha^{2}nLG(1+3\alpha Ln^{2})\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|+8\alpha^{4}n^{4}G^{2}L^{2}.$

Next, we use the fact that $2ab\leq\lambda a^{2}+b^{2}/\lambda$ (for any $\lambda>0$ ) on the term $2\alpha^{2}nLG(1+3\alpha Ln^{2})\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\|$ to get that

	$\displaystyle 2\alpha^{2}nLG(1+3\alpha Ln^{2})\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|$	$\displaystyle\leq\left(\alpha^{2}nLG(1+3\alpha Ln^{2})\right)^{2}/(n\alpha\mu)+n\alpha\mu\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|^{2}$
		$\displaystyle=\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+n\alpha\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|^{2}.$

Substituting this back we get,

$\displaystyle\\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\\|^{2}$	$\displaystyle\leq\left(1-4n\alpha\frac{L\mu}{L+\mu}\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
	$\displaystyle\quad+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+n\alpha\mu\\|{\bm{x}}_{0}^{2k-1}-{\bm{x}}^{*}\\|^{2}+8\alpha^{4}n^{4}G^{2}L^{2}$
	$\displaystyle\leq\left(1-2n\alpha\mu+n\alpha\mu\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
	$\displaystyle\quad+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+8\alpha^{4}n^{4}G^{2}L^{2}$	(Since $\mu\leq L$ )
	$\displaystyle=\left(1-n\alpha\mu\right)\\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\\|^{2}-\left(4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\right)\left\\|\nabla F({\bm{x}}^{2k-1}_{0})\right\\|^{2}$
	$\displaystyle\quad+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+8\alpha^{4}n^{4}G^{2}L^{2}$

Now, substituting the values of $\alpha$ and the bound on $K$ , we get that $4n\alpha\frac{1}{L+\mu}-12\alpha^{2}n^{2}\geq 0$ and hence,

\displaystyle\|{\bm{x}}^{2k}_{n}-{\bm{x}}^{*}\|^{2}

\displaystyle\leq\left(1-n\alpha\mu\right)\|{\bm{x}}^{2k-1}_{0}-{\bm{x}}^{*}\|^{2}+\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+8\alpha^{4}n^{4}G^{2}L^{2}

Now, iterating this for $K/2$ epoch pairs, we get

$\displaystyle\\|{\bm{x}}^{K}_{n}-{\bm{x}}^{*}\\|^{2}$	$\displaystyle\leq\left(1-n\alpha\mu\right)^{K/2}\\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\\|^{2}+\frac{K}{2}\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+4K\alpha^{4}n^{4}G^{2}L^{2}$
	$\displaystyle\leq e^{-n\alpha\mu K/2}\\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\\|^{2}+\frac{K}{2}\mu^{-1}\alpha^{3}nL^{2}G^{2}(1+3\alpha Ln^{2})^{2}+4K\alpha^{4}n^{4}G^{2}L^{2}$
	$\displaystyle\leq e^{-n\alpha\mu K/2}\\|{\bm{x}}^{1}_{0}-{\bm{x}}^{*}\\|^{2}+K\mu^{-1}\alpha^{3}nL^{2}G^{2}+9K\mu^{-1}\alpha^{5}n^{5}L^{4}G^{2}+4K\alpha^{4}n^{4}G^{2}L^{2}$	(Since $(1+a)^{2}\leq 2+2a^{2}$ )

Substituting $\alpha=\frac{6\log nK}{\mu nK}$ gives us the desired result.

∎

Appendix H Additional Experiments

Although our theoretical guarantees for FlipFlop only hold for quadratic objectives, we conjecture that FlipFlop might be able to improve the convergence performance for other classes of functions, whose Hessians are smooth near their minimizers. To see this, we also ran some experiments on 1-dimensional logistic regression. As we can see in Figure 3, the convergence rates are very similar to those on quadratic functions. The data was synthetically generated such that the objective function becomes strongly convex and well conditioned near the minimizer. Note that logistic loss is not strongly convex on linearly separable data. Therefore, to make the loss strongly convex, we ensured that the data was not linearly separable. Essentially, the dataset was the following: all the datapoints were $z=\pm 1$ , and their labels were $y={\mathbbm{1}}_{z>0}$ with probability $3/4$ and $y={\mathbbm{1}}_{z<0}$ with probability $1/4$ . Framing this as an optimization problem, we have

\displaystyle\min_{x}F(x):={\mathbb{E}}\left[-y\log(h(xz))-(1-y)\log(1-h(xz))\right],

where $h(xz)=1/(1+e^{-xz})$ . Note that $x=-\log 3$ is the minimizer of this function, which is helpful because we can use it to compute the exact error. Similar to the experiment on quadratic functions, $n$ was set to $800$ and step size was set in the same regime as in Theorems 4, 5, and 6.

$\displaystyle\\|{\bm{x}}_{n}^{k}-{\bm{x}}^{*}\\|$	$\displaystyle=\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})+\alpha\sum_{i=0}^{n}(\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k}))\right\\|$
	$\displaystyle\leq\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\\|+\alpha\sum_{i=0}^{n}\left\\|\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{0}^{k})-\nabla f_{\sigma^{k}_{i}}({\bm{x}}_{i}^{k})\right\\|$
	$\displaystyle\leq\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\\|+\alpha L\sum_{i=0}^{n}\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}_{i}^{k}\right\\|,$	(12)

$\displaystyle\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}-n\alpha\nabla F({\bm{x}}_{0}^{k})\right\\|^{2}$	$\displaystyle=\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}-2n\alpha\langle{\bm{x}}_{0}^{k}-{\bm{x}}^{},\nabla F({\bm{x}}_{0}^{k})\rangle+n^{2}\alpha^{2}\\|\nabla F({\bm{x}}_{0}^{k})\\|^{2}$
	$\displaystyle\leq\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}-2n\alpha\mu\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2}+n^{2}\alpha^{2}\\|\nabla F({\bm{x}}_{0}^{k})\\|^{2}$	(Using Ineq. (3))
	$\displaystyle=(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}+n\alpha(n\alpha\\|\nabla F({\bm{x}}_{0}^{k})\\|^{2}-\mu\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2})$
	$\displaystyle\leq(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}+n\alpha(n\alpha L^{2}\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2}-\mu\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\\|^{2})$	(Using gradient Lipschitzness)
	$\displaystyle=(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\right\\|^{2}+n\alpha(n\alpha L^{2}-\mu)\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{}\\|^{2}$
	$\displaystyle\leq(1-n\alpha\mu)\left\\|{\bm{x}}_{0}^{k}-{\bm{x}}^{*}\right\\|^{2},$

	$\displaystyle\|z_{K,n}\|$	$\displaystyle=\left\|1-2\alpha L\right\|^{nK}$
		$\displaystyle\geq\left\|1-2\frac{1}{L}L\right\|^{nK}$
		$\displaystyle\geq 1^{nK}$
		$\displaystyle=\Omega(1).$

$\displaystyle\mathbb{E}[$	$\displaystyle\\|{\bm{x}}_{n}^{K}\\|^{2}]=\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right.\right.$
	$\displaystyle\quad\left.\left.+\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}\right\\|^{2}\right]$
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\quad+2\mathbb{E}\left[\left\\|\sum_{k=0}^{\frac{K}{2}-1}\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{k}{\bm{z}}\right\\|^{2}\right]$	(Since $(a+b)^{2}\leq 2a^{2}+2b^{2}$ )
	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]$
	$\displaystyle\quad+2\mathbb{E}\left[\left(\\|{\bm{z}}\\|\sum_{k=0}^{\frac{K}{2}-1}\left\\|\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right\\|^{k}\right)^{2}\right].$

	$\displaystyle\mathbb{E}\left[\left\\|{\bm{x}}_{n}^{K}\right\\|^{2}\right]$	$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]$
		$\displaystyle\quad+2\mathbb{E}\left[\left(\\|{\bm{z}}\\|\sum_{k=0}^{\frac{K}{2}-1}\left(\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{i}\\|\prod_{i=1}^{n}\\|{\bm{I}}-{\bm{S}}_{n-i+1}\\|\right)^{k}\right)^{2}\right]$
		$\displaystyle\leq 2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]+2\mathbb{E}\left[\left(\\|{\bm{z}}\\|\sum_{k=0}^{\frac{K}{2}-1}1\right)^{2}\right]$
		$\displaystyle=2\mathbb{E}\left[\left\\|\left(\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{i})\right)\left(\prod_{i=1}^{n}({\bm{I}}-{\bm{S}}_{n-i+1})\right)\right)^{K/2}{\bm{x}}_{0}^{1}\right\\|^{2}\right]+\frac{K^{2}}{2}\mathbb{E}\left[\\|{\bm{z}}\\|^{2}\right].$

Permutation-Based SGD: Is Random Optimal?

Abstract

1 Introduction

Our Contributions:

Notation:

2 Related Work

3 Preliminaries

Assumption 1 (Component convexity).

Assumption 2 (Component smoothness).

Assumption 3 (Objective strong convexity).

4 Exponential Convergence in 1-Dimension

Assumption 4 (Component Hessian-smoothness).

Theorem 1.

Proof sketch

5 Lower Bounds for Permutation-based SGD

Theorem 2.

Proof sketch

Theorem 3.

Proof sketch

6 Flipping Permutations for Faster Convergence in Quadratics

Assumption 5.

Theorem 4.

Theorem 5.

Theorem 6.

6.1 Proof sketch

6.2 Numerical Verification

7 Conclusion and Future Work

References

Appendix A Discussion and Future Work

A.1 Lower bounds for variable step sizes

A.2 Faster permutations for non-quadratic objectives

A.3 FlipFlop on Random Coordinate Descent

Appendix B Proof of Theorem 1

B.1 Permutations in one epoch

Lemma 1.

B.2 Exact convergence to the minimizer

Lemma 2.

B.3 Same sequence permutations get closer

B.4 Proof of Lemma 1

Lemma.

Remark 1.

Proof.

B.5 Proof of Lemma 2

Appendix C Proof of Theorem 2

C.1 Lower bound for F1F_{1}, α∈[12​(n−1)​K​L,3n​L]\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}]

C.2 Lower bound for F2F_{2}, α∈[3n​L,1L]\alpha\in[\frac{3}{nL},\frac{1}{L}]

C.3 Lower bound for F3F_{3}, α∉[12​(n−1)​K​L,1L]\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}]

C.4 Discussion about initialization

Appendix D Proof of Theorem 3

Appendix E Proof of Theorem 4

Lemma 3.

Lemma 4.

E.1 Proof of Lemma 3

E.2 Proof of Lemma 4

Lemma 5.

Lemma 6.

E.3 Proof of Lemma 6

Theorem 7.

E.4 Proof of Lemma 5

Appendix F Proof of Theorem 5

Lemma 7.

Lemma 8.

F.1 Proof of Lemma 7

Lemma 9 (Lemma 6, Ahn et al. (2020)).

Remark 2.

F.2 Proof of Lemma 8

Lemma 10.

Lemma 11.

F.3 Proof of Lemma 10

F.4 Proof of Lemma 11

Appendix G Proof of Theorem 6

Proof.

Appendix H Additional Experiments

C.1 Lower bound for $F_{1}$ , $\alpha\in[\frac{1}{2(n-1)KL},\frac{3}{nL}]$

C.2 Lower bound for $F_{2}$ , $\alpha\in[\frac{3}{nL},\frac{1}{L}]$

C.3 Lower bound for $F_{3}$ , $\alpha\notin[\frac{1}{2(n-1)KL},\frac{1}{L}]$