This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SMG: A Shuffling Gradient-Based Method with Momentum

Trang H. Tran    Lam M. Nguyen    Quoc Tran-Dinh
Abstract

We combine two advanced ideas widely used in optimization for machine learning: shuffling strategy and momentum technique to develop a novel shuffling gradient-based method with momentum, coined Shuffling Momentum Gradient (SMG), for non-convex finite-sum optimization problems. While our method is inspired by momentum techniques, its update is fundamentally different from existing momentum-based methods. We establish state-of-the-art convergence rates of SMG for any shuffling strategy using either constant or diminishing learning rate under standard assumptions (i.e. LL-smoothness and bounded variance). When the shuffling strategy is fixed, we develop another new algorithm that is similar to existing momentum methods, and prove the same convergence rates for this algorithm under the LL-smoothness and bounded gradient assumptions. We demonstrate our algorithms via numerical simulations on standard datasets and compare them with existing shuffling methods. Our tests have shown encouraging performance of the new algorithms.

Machine Learning, ICML

1 Introduction

Most training tasks in supervised learning are boiled down to solving the following finite-sum minimization:

minwd{F(w):=1ni=1nf(w;i)},\min_{w\in\mathbb{R}^{d}}\Big{\{}F(w):=\frac{1}{n}\sum_{i=1}^{n}f(w;i)\Big{\}}, (1)

where f(;i):df(\cdot;i):\mathbb{R}^{d}\to\mathbb{R} is a given smooth and possibly nonconvex function for i[n]:={1,,n}i\in[n]:=\left\{1,\cdots,n\right\}.

Problem (1) looks simple, but covers various convex and nonconvex applications in machine learning and statistical learning, including, but not limited to, logistic regression, multi-kernel learning, conditional random fields, and neural networks. Especially, (1) covers the empirical risk minimization as a special case. Solution methods for approximately solving (1) have been widely studied in the literature under different sets of assumptions. The most common approach is perhaps stochastic gradient-type (SGD) methods (Robbins & Monro, 1951; Ghadimi & Lan, 2013; Bottou et al., 2018; Nguyen et al., 2018, 2019) and their variants.

Motivation. While SGD and its variants rely on randomized sampling strategies with replacement, gradient-based methods using without-replacement strategies are often easier and faster to implement. Moreover, practical evidence (Bottou, 2009) has shown that they usually produce a faster decrease of the training loss. Randomized shuffling strategies (also viewed as sampling without replacement) allow the algorithm to use exactly one function component f(;i)f(\cdot;i) at each epoch compared to SGD, which has only statistical convergence guarantees (e.g., in expectation or with high probability). However, very often, the analysis of shuffling methods is more challenging than SGD due to the lack of statistical independence.

In the deterministic case, single permutation (also called shuffle once, or single shuffling) and incremental gradient methods can be considered as special cases of the shuffling gradient-based methods we study in this paper. One special shuffling strategy is randomized reshuffling, which is broadly used in practice, where we use a different random permutation at each epoch. Alternatively, in recent years, it has been shown that many gradient-based methods with momentum update can notably boost the convergence speed both in theory and practice (Nesterov, 2004; Dozat, 2016; Wang et al., 2020). These methods have been widely used in both convex and nonconvex settings, especially, in deep learning community. Remarkably, Nesterov’s accelerated method (Nesterov, 1983) has made a revolution in large-scale convex optimization in the last two decades, and has been largely exploited in nonconvex problems. The developments we have discussed here motivate us to raise the following research question:

Can we combine both shuffling strategy and momentum scheme to develop new provable gradient-based algorithms for handling (1)?

In this paper, we answer this question affirmatively by proposing a novel algorithm called Shuffling Momentum Gradient (SMG). We establish its convergence guarantees for different shuffling strategies, and in particular, randomized reshuffling strategy. We also investigate different variants of our method.

Our contribution. To this end, our contributions in this paper can be summarized as follows.

  • (a)

    We develop a novel shuffling gradient-based method with momentum (Algorithm 1 in Section 2) for approximating a stationary point of the nonconvex minimization problem (1). Our algorithm covers any shuffling strategy ranging from deterministic to randomized, including incremental, single shuffling, and randomized reshuffling variants.

  • (b)

    We establish the convergence of our method in the nonconvex setting and achieve the state-of-the-art 𝒪(1/T2/3)\operatorname{\mathcal{O}}\left(1/{T^{2/3}}\right) convergence rate under standard assumptions (i.e. the LL-smoothness and bounded variance conditions), where TT is the number of epochs. For randomized reshuffling strategy, we can improve our convergence rate up to 𝒪(1/(n1/3T2/3))\operatorname{\mathcal{O}}\left(1/({n^{1/3}T^{2/3})}\right).

  • (c)

    We study different strategies for selecting learning rates (LR), including constant, diminishing, exponential, and cosine scheduled learning rates. In all cases, we prove the same convergence rate of the corresponding variants without any additional assumption.

  • (d)

    When a single shuffling strategy is used, we show that a momentum strategy can be incorporated directly at each iteration of the shuffling gradient method to obtain a different variant as presented in Algorithm 2. We analyze the convergence of this algorithm and achieve the same 𝒪(1/T2/3)\operatorname{\mathcal{O}}(1/T^{2/3}) epoch-wise convergence rate, but under a bounded gradient assumption instead of the bounded variance as for the SMG algorithm.

Our 𝒪(1/T2/3)\operatorname{\mathcal{O}}(1/T^{2/3}) convergence rate is the best known so far for shuffling gradient-type methods in nonconvex optimization (Nguyen et al., 2020; Mishchenko et al., 2020). However, like (Mishchenko et al., 2020), our SMG method only requires a generalized bounded variance assumption (Assumption 1(c)), which is weaker and more standard than the bounded component gradient assumption used in existing works. Algorithm 2 uses the same set of assumptions as in (Nguyen et al., 2020) to achieve the same rate, but has a momentum update. For the randomized reshuffling strategy, our 𝒪(1/(n1/3T2/3))\operatorname{\mathcal{O}}\left(1/({n^{1/3}T^{2/3})}\right) convergence rate also matches the rate of the without-momentum algorithm in (Mishchenko et al., 2020). It leads to the total of iterations nT=𝒪(nε3)nT=\operatorname{\mathcal{O}}(\sqrt{n}\varepsilon^{-3}).

We emphasize that, in many existing momentum variants, the momentum mi(t)m_{i}^{(t)} is updated recursively at each iteration as mi+1(t):=βmi(t)+(1β)gi(t)m_{i+1}^{(t)}:=\beta m_{i}^{(t)}+(1-\beta)g_{i}^{(t)} for a given weight β(0,1)\beta\in(0,1). This update shows that the momentum mi+1(t)m_{i+1}^{(t)} incorporates all the past gradient terms gi(t),gi1(t),gi1(t),,g0(t)g_{i}^{(t)},g_{i-1}^{(t)},g_{i-1}^{(t)},\dots,g_{0}^{(t)} with exponential decay weights 1,β,β2,,βi1,\beta,\beta^{2},\dots,\beta^{i}, respectively. However, in shuffling methods, the convergence guarantee is often obtained in epoch. Based on this observation, we modify the classical momentum update in the shuffling method as shown in Algorithm 1. More specifically, the momentum term m0(t)m_{0}^{(t)} is fixed at the beginning of each epoch, and an auxiliary sequence {vi(t)}\{v_{i}^{(t)}\} is introduced to keep track of the gradient average to update the momentum term in the next epoch. This modification makes Algorithm 1 fundamentally different from existing momentum-based methods. This new algorithm still achieves 𝒪(1/T2/3)\operatorname{\mathcal{O}}(1/T^{2/3}) epoch-wise convergence rate under standard assumptions. To the best of our knowledge, our work is the first analyzing convergence rate guarantees of shuffling-type gradient methods with momentum under standard assumptions.

Besides Algorithm 1, we also exploit recent advanced strategies for selecting learning rates, including exponential and cosine scheduled learning rates. These two strategies have shown state-of-the-art performance in practice (Smith, 2017; Loshchilov & Hutter, 2017; Li et al., 2020). Therefore, it is worth incorporating them in shuffling methods.

Related work. Let us briefly review the most related works to our methods studied in this paper.

Shuffling gradient-based methods. Shuffling gradient-type methods for solving (1) have been widely studied in the literature in recent years (Bottou, 2009; Gürbüzbalaban et al., 2019; Shamir, 2016; Haochen & Sra, 2019; Nguyen et al., 2020) for both convex and nonconvex settings. It was empirically investigated in a short note (Bottou, 2009) and also discussed in (Bottou, 2012). These methods have also been implemented in several software packages such as TensorFlow and PyTorch, broadly used in machine learning (Abadi et al., 2015; Paszke et al., 2019).

In the strongly convex case, shuffling methods have been extensively studied in (Ahn et al., 2020; Gürbüzbalaban et al., 2019; Haochen & Sra, 2019; Safran & Shamir, 2020; Nagaraj et al., 2019; Rajput et al., 2020; Nguyen et al., 2020; Mishchenko et al., 2020) under different assumptions. The best known convergence rate in this case is 𝒪(1/(nT)2+1/(nT3))\operatorname{\mathcal{O}}(1/(nT)^{2}+1/(nT^{3})), which matches the lower bound rate studied in (Safran & Shamir, 2020) up to some constant factor. Most results in the convex case are for the incremental gradient variant, which are studied in (Nedic & Bertsekas, 2001; Nedić & Bertsekas, 2001). Convergence results of shuffling methods on the general convex case are investigated in (Shamir, 2016; Mishchenko et al., 2020), where (Mishchenko et al., 2020) provides a unified approach to cover different settings. The authors in (Ying et al., 2017) combine a randomized shuffling and a variance reduction technique (e.g., SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013)) to develop a new variant. They show a linear convergence rate for strongly convex problems but using an energy function, which is unclear how to convert it into known convergence criteria.

In the nonconvex case, (Nguyen et al., 2020) first shows 𝒪(1/T2/3)\operatorname{\mathcal{O}}(1/T^{2/3}) convergence rate for a general class of shuffling gradient methods under the LL-smoothness and bounded gradient assumptions on (1). This analysis is then extended in (Mishchenko et al., 2020) to a more relaxed assumption. The authors in (Meng et al., 2019) study different distributed SGD variants with shuffling for strongly convex, general convex, and nonconvex problems. An incremental gradient method for weakly convex problems is investigated in (Li et al., 2019), where the authors show 𝒪(1/T1/2)\operatorname{\mathcal{O}}(1/T^{1/2}) convergence rate as in standard SGD. To the best of our knowledge, the best known rate of shuffling gradient methods for the nonconvex case under standard assumptions is 𝒪(1/T2/3)\operatorname{\mathcal{O}}(1/T^{2/3}) as shown in (Nguyen et al., 2020; Mishchenko et al., 2020).

Our Algorithm 1 developed in this paper is a nontrivial momentum variant of the general shuffling method in (Nguyen et al., 2020) but our analysis uses a standard bounded variance assumption instead of bounded gradient one.

Momentum-based methods. Gradient methods with momentum were studied in early works for convex problems such as heavy-ball, inertial, and Nesterov’s accelerated gradient methods (Polyak, 1964; Nesterov, 2004). Nesterov’s accelerated method is the most influent scheme and achieves optimal convergence rate for convex problems. While momentum-based methods are not yet known to improve theoretical convergence rates in the nonconvex setting, they show significantly encouraging performance in practice (Dozat, 2016; Wang et al., 2020), especially in the deep learning community. However, the momentum strategy has not yet been exploited in shuffling methods.

Adaptive learning rate schemes. Gradient-type methods with adaptive learning rates such as AdaGrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have shown state-of-the-art performance in several optimization applications. Recently, many adaptive schemes for learning rates have been proposed such as diminishing (Nguyen et al., 2020), exponential scheduled (Li et al., 2020), and cosine scheduled (Smith, 2017; Loshchilov & Hutter, 2017). In (Li et al., 2020), the authors analyze convergence guarantees for the exponential and cosine learning rates in SGD. These adaptive learning rates have also been empirically studied in the literature, especially in machine learning tasks. Their convergence guarantees have also been investigated accordingly under certain assumptions.

Although the analyses for adaptive learning rates are generally non-trivial, our theoretical results in this paper are flexible enough to cover various different learning rates. However, we only exploit the diminishing, exponential scheduled, and cosine scheduled schemes for our shuffling methods with momentum. We establish that in the last two cases, our algorithm still achieves state-of-the-art 𝒪(T2/3)\operatorname{\mathcal{O}}(T^{-2/3}) (and possibly up to 𝒪(n1/3T2/3)\operatorname{\mathcal{O}}(n^{-1/3}T^{-2/3})) epoch-wise rates.

Content. The rest of this paper is organized as follows. Section 2 describes our novel method, Shuffling Momentum Gradient (Algorithm 1). Section 3 investigates its convergence rate under different shuffling-type strategies and different learning rates. Section 4 proposes an algorithm with traditional momentum update for single shuffling strategy. Section 5 presents our numerical simulations. Due to space limit, the convergence analysis of our methods, all technical proofs, and additional experiments are deferred to the Supplementary Document (Supp. Doc.).

2 Shuffling Gradient-Based Method with Momentum

In this section, we describe our new shuffling gradient algorithm with momentum in Algorithm 1. We also compare our algorithm with existing methods and discuss its per-iteration complexity and possible modifications, e.g., mini-batch.

Algorithm 1 Shuffling Momentum Gradient (SMG)
1:  Initialization: Choose w~0d\tilde{w}_{0}\in\mathbb{R}^{d} and set m~0:=0\tilde{m}_{0}:=\textbf{0}.​​
2:  for t:=1,2,,Tt:=1,2,\cdots,T do
3:     Set w0(t):=w~t1w_{0}^{(t)}:=\tilde{w}_{t-1}; m0(t):=m~t1m_{0}^{(t)}:=\tilde{m}_{t-1}; and v0(t):=0v_{0}^{(t)}:=\textbf{0};
4:     Generate an arbitrarily deterministic or random permutation π(t)\pi^{(t)} of [n][n];
5:     for i:=0,,n1i:=0,\cdots,n-1 do
6:        Query gi(t):=f(wi(t);π(t)(i+1))g_{i}^{(t)}:=\nabla f(w_{i}^{(t)};\pi^{(t)}(i+1));
7:        Choose ηi(t):=ηtn\eta^{(t)}_{i}:=\frac{\eta_{t}}{n} and update
{mi+1(t):=βm0(t)+(1β)gi(t)vi+1(t):=vi(t)+1ngi(t)wi+1(t):=wi(t)ηi(t)mi+1(t);\left\{\begin{array}[]{lcl}m_{i+1}^{(t)}&:=&\beta m_{0}^{(t)}+(1-\beta)g_{i}^{(t)}\vspace{1ex}\\ v_{i+1}^{(t)}&:=&v_{i}^{(t)}+\frac{1}{n}g_{i}^{(t)}\vspace{1ex}\\ w_{i+1}^{(t)}&:=&w_{i}^{(t)}-\eta_{i}^{(t)}m_{i+1}^{(t)};\end{array}\right. (2)
8:     end for
9:     Set w~t:=wn(t)\tilde{w}_{t}:=w_{n}^{(t)} and m~t:=vn(t)\tilde{m}_{t}:=v_{n}^{(t)};
10:  end for
11:  Output: Choose w^T{w~0,,w~T1}\hat{w}_{T}\in\{\tilde{w}_{0},\cdots,\tilde{w}_{T-1}\} at random with probability [w^T=w~t1]=ηtt=1Tηt\mathbb{P}[\hat{w}_{T}=\tilde{w}_{t-1}]=\frac{\eta_{t}}{\sum_{t=1}^{T}\eta_{t}}.

Clearly, if β=0\beta=0, then Algorithm 1 reduces to the standard shuffling gradient method as in (Nguyen et al., 2020; Mishchenko et al., 2020). Since each inner iteration of our method uses one component f(,i)f(\cdot,i), we use ηi(t)=ηtn\eta_{i}^{(t)}=\frac{\eta_{t}}{n} in Algorithm 1 to make it consistent with one full gradient, which consists of nn gradient components. This form looks different from the learning rates used in previous work for SGD (Shamir, 2016; Mishchenko et al., 2020), however, it does not necessary make our learning rate smaller. In the same order of training samples nn, our learning rate matches the one in (Mishchenko et al., 2020) as well as the state-of-the-art complexity results. In fact, the detailed learning rate ηt\eta_{t} used in Algorithm 1 will be specified in Section 3.

Comparison. Unlike existing momentum methods where mi(t)m_{i}^{(t)} in (2) is updated recursively as mi+1(t):=βmi(t)+(1β)gi(t)m_{i+1}^{(t)}:=\beta m_{i}^{(t)}+(1-\beta)g_{i}^{(t)} for β(0,1)\beta\in(0,1), we instead fix the first term m0(i)m_{0}^{(i)} in (2) at each epoch. It is only updated at the end of each epoch by averaging all the gradient components {gi(t)}i=0n1\{g_{i}^{(t)}\}_{i=0}^{n-1} evaluated in such an epoch. To avoid storing these gradients, we introduce an auxiliary variable vi(t)v_{i}^{(t)} to keep track of the gradient average. Here, we fix the weight β\beta for the sake of our analysis, but it is possible to make β\beta adaptive.

Our new method is based on the following observations. First, while SGD generally uses an unbiased estimator for the full gradient, shuffling gradient methods often do not have such a nice property, making them difficult to analyze. Due to this fact, updating the momentum at each inner iteration does not seem preferable since it could make the estimator deviate from the true gradient. Therefore, we consider updating the momentum after each epoch. Second, unlike the traditional momentum with exponential decay weights, our momentum term m0(t)m_{0}^{(t)} is an equal-weighted average of all the past gradients in an epoch, but evaluated at different points wi(t1)w_{i}^{(t-1)} in the inner loop. Based on these observations, the momentum term should aid the full gradient F\nabla{F} instead of its component f(;i)\nabla{f}(\cdot;i), leading to the update at the end of each epoch.

It is also worth noting that our algorithm is fundamentally different from variance reduction methods. For instance, SVRG and SARAH variants require evaluating full gradient at each snapshot point while our method always uses single component gradient. Hence, our algorithm does not require a full gradient evaluation at each outer iteration, and our momentum term does not require full gradient of ff.

When the learning rate ηi(t)\eta_{i}^{(t)} is fixed at each epoch as ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}, where ηt>0\eta_{t}>0, we can derive from (2) that

wi+1(t)=wi(t)(1β)ηtngi(t)+βηtn(1β)ηt1et,w_{i+1}^{(t)}=w_{i}^{(t)}-\frac{(1-\beta)\eta_{t}}{n}g_{i}^{(t)}+\frac{\beta\eta_{t}}{n(1-\beta)\eta_{t-1}}e_{t},

where et:=w~t1w~t2+βηt1m~t2e_{t}:=\tilde{w}_{t-1}-\tilde{w}_{t-2}+\beta\eta_{t-1}\tilde{m}_{t-2}. Here, ete_{t} plays a role as a momentum or an inertial term, but it is different from the usual momentum term. However, we still name Algorithm 1 the Shuffling Momentum Gradient since it is inspired by momentum methods.

Per-iteration complexity. The per-iteration complexity of Algorithm 1 is almost the same as in standard shuffling gradient schemes, see, e.g., (Shamir, 2016). It needs to store two additional vectors mi+1(t)m_{i+1}^{(t)} and vi(t)v_{i}^{(t)}, and performs two vector additions and three scalar-vector multiplications. Such additional costs are very mild.

Shuffling strategies. Our convergence guarantee in Section 3 holds for any permutation π(t)\pi^{(t)} of {1,2,,n}\{1,2,\cdots,n\}, including deterministic and randomized ones. Therefore, our method covers any shuffling strategy, including incremental, single shuffling, and randomized reshuffling variants as special cases. We highlight that our convergence result for the randomized reshuffling variant is significantly better than the general ones as we will explain in detail in Section 3.

Mini-batch. Algorithm 1 also works with mini-batches, and our theory can be slightly adapted to establish the same convergence rate for mini-batch variants. However, it remains unclear to us how mini-batches can improve the convergence rate guarantee of Algorithm 1 in the general case where the shuffling strategy is not specified.

3 Convergence Analysis

We analyze the convergence of Algorithm 1 under standard assumptions, which is organized as follows.

3.1 Technical Assumptions

Our analysis relies on the following assumptions:

Assumption 1.

Problem (1) satisfies:

  • (a)\mathrm{(a)}

    (Boundedness from below) dom(F)\mathrm{dom}(F)\neq\emptyset and FF is bounded from below, i.e. F:=infwdF(w)>F_{*}:=\displaystyle\inf_{w\in\mathbb{R}^{d}}F(w)>-\infty.

  • (b)\mathrm{(b)}

    (LL-Smoothness) f(;i)f(\cdot;i) is LL-smooth for all i[n]i\in[n], i.e. there exists a universal constant L>0L>0 such that, for all w,wdom(F)w,w^{\prime}\in\mathrm{dom}\left(F\right), it holds that

    f(w;i)f(w;i)Lww.\|\nabla f(w;i)-\nabla f(w^{\prime};i)\|\leq L\|w-w^{\prime}\|. (3)
  • (c)\mathrm{(c)}

    (Generalized bounded variance) There exist two non-negative and finite constants Θ\Theta and σ\sigma such that for any wdom(F)w\in\mathrm{dom}\left(F\right) we have

    1ni=1nf(w;i)F(w)2ΘF(w)2+σ2.\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w;i)-\nabla F(w)\|^{2}\leq\Theta\|\nabla F(w)\|^{2}+\sigma^{2}. (4)

Assumption 1(a) is required in any algorithm to guarantee the well-definedness of (1). The LL-smoothness (3) is standard in gradient-type methods for both stochastic and deterministic algorithms. From this assumption, we have for any w,wdom(F)w,w^{\prime}\in\mathrm{dom}\left(F\right) (see (Nesterov, 2004)):

F(w)F(w)+F(w),ww+L2ww2.F(w)\leq F(w^{\prime})+\langle\nabla F(w^{\prime}),w-w^{\prime}\rangle+\frac{L}{2}\|w-w^{\prime}\|^{2}. (5)

The condition (4) in Assumption 1(c) reduces to the standard bounded variance condition if Θ=0\Theta=0. Therefore, (4) is more general than the bounded variance assumption, which is often used in stochastic optimization. Unlike recent existing works on momentum SGD and shuffling (Chen et al., 2018; Nguyen et al., 2020), we do not assume the bounded gradient assumption on each f(;i)f(\cdot;i) in Algorithm 1 (see Assumption 2). This condition is stronger than (4).

3.2 Main Result 1 and Its Consequences

Our first main result is the following convergence theorem for Algorithm 1 under any shuffling strategy.

Theorem 1.

Suppose that Assumption 1 holds for (1). Let {wi(t)}t=1T\{w_{i}^{(t)}\}_{t=1}^{T} be generated by Algorithm 1 with a fixed momentum weight 0β<10\leq\beta<1 and an epoch learning rate ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} for every t1t\geq 1. Assume that η0=η1\eta_{0}=\eta_{1}, ηtηt+1\eta_{t}\geq\eta_{t+1}, and 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}} for t1t\geq 1, where K:=max{52,9(53β)(Θ+1)1β}K:=\max\left\{\frac{5}{2},\frac{9(5-3\beta)(\Theta+1)}{1-\beta}\right\}. Then, it holds that

𝔼[F(w^T)2]=1t=1Tηtt=1TηtF(w~t1)2\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}=\displaystyle\frac{1}{\sum_{t=1}^{T}\eta_{t}}\displaystyle\sum_{t=1}^{T}\eta_{t}\|\nabla F(\tilde{w}_{t-1})\|^{2} (6)
4[F(w~0)F](1β)t=1Tηt+9σ2L2(53β)(1β)(t=1Tηt13t=1Tηt).\displaystyle\leq\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\sum_{t=1}^{T}\eta_{t}}+\displaystyle\frac{9\sigma^{2}L^{2}(5-3\beta)}{(1-\beta)}\left(\frac{\sum_{t=1}^{T}\eta_{t-1}^{3}}{\sum_{t=1}^{T}\eta_{t}}\right).
Remark 1 (Convergence guarantee).

When a deterministic permutation π(t)\pi^{(t)} is used, our convergence rate can be achieved in a deterministic sense. However, to unify our analysis, we will express our convergence guarantees in expectation, where the expectation is taken over all the randomness generated by π(t)\pi^{(t)} and w^T\hat{w}_{T} up to TT iterations. Since we can choose permutations π(t)\pi^{(t)} either deterministically or randomly, our bounds in the sequel will hold either deterministically or with probability 11 (w.p.1), respectively. Without loss of generality, we write these results in expectation.

Next, we derive two direct consequences of Theorem 1 by choosing constant and diminishing learning rates.

Corollary 1 (Constant learning rate).

Let us fix the number of epochs T1T\geq 1, and choose a constant learning rate ηt:=γT1/3\eta_{t}:=\frac{\gamma}{T^{1/3}} for some γ>0\gamma>0 such that γT1/31LK\frac{\gamma}{T^{1/3}}\leq\frac{1}{L\sqrt{K}} for t1t\geq 1 in Algorithm 1. Then, under the conditions of Theorem 1, 𝔼[F(w^T)2]\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} is upper bounded by

1T2/3(4[F(w~0)F](1β)γ+9σ2(53β)L2γ2(1β)).\displaystyle\frac{1}{T^{2/3}}\left(\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma}+\frac{9\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)}\right).

Consequently, the convergence rate of Algorithm 1 is 𝒪(T2/3)\mathcal{O}(T^{-2/3}) in epoch.

With a constant LR as in Corollary 1, the convergence rate of Algorithm 1 is exactly expressed as

𝒪([F(w~0)F]+σ2T2/3),\operatorname{\mathcal{O}}\left(\frac{[F(\tilde{w}_{0})-F_{*}]+\sigma^{2}}{T^{2/3}}\right),

which matches the best known rate in the literature (Mishchenko et al., 2020; Nguyen et al., 2020) in term of TT for general shuffling-type strategies.

Corollary 2 (Diminishing learning rate).

Let us choose a diminishing learning rate ηt:=γ(t+λ)1/3\eta_{t}:=\frac{\gamma}{(t+\lambda)^{1/3}} for some γ>0\gamma>0 and λ0\lambda\geq 0 for t1t\geq 1 such that η1:=γ(1+λ)1/31LK\eta_{1}:=\frac{\gamma}{(1+\lambda)^{1/3}}\leq\frac{1}{L\sqrt{K}} in Algorithm 1. Then, under the conditions of Theorem 1, after TT epochs with T2T\geq 2, we have

𝔼[F(w^T)2]C1+C2log(T1+λ)(T+λ)2/3(1+λ)2/3,\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{C_{1}+C_{2}\log(T-1+\lambda)}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}},

where C1C_{1} and C2C_{2} are respectively given by

C1\displaystyle C_{1} :=4[F(w~0)F](1β)γ+18σ2L2(53β)γ2(1β)(1+λ)and\displaystyle:=\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma}+\frac{18\sigma^{2}L^{2}(5-3\beta)\gamma^{2}}{(1-\beta)(1+\lambda)}\quad\text{and}
C2\displaystyle C_{2} :=9σ2L2(53β)γ2(1β).\displaystyle:=\frac{9\sigma^{2}L^{2}(5-3\beta)\gamma^{2}}{(1-\beta)}.

Consequently, the convergence rate of Algorithm 1 is 𝒪(T2/3log(T))\mathcal{O}(T^{-2/3}\log(T)) in epoch.

The diminishing LR ηt:=γ(t+λ)1/3\eta_{t}:=\frac{\gamma}{(t+\lambda)^{1/3}} allows us to use large learning rate values at early epochs compared to the constant case. However, we loose a log(T)\log(T) factor in the second term of our worst-case convergence rate bound.

We also derive the following two consequences of Theorem 1 for exponential and cosine scheduled learning rates.

Exponential scheduled learning rate. Given an epoch budget T1T\geq 1, and two positive constants γ>0\gamma>0 and ρ>0\rho>0, we consider the following exponential LR, see (Li et al., 2020):

ηt:=γαtT1/3,whereα:=ρ1/T(0,1).\eta_{t}:=\frac{\gamma\alpha^{t}}{T^{1/3}},\quad\text{where}\ \alpha:=\rho^{1/T}\in(0,1). (7)

The following corollary shows the convergence of Algorithm 1 using this LR without any additional assumption.

Corollary 3.

Let {wi(t)}t=1T\{w_{i}^{(t)}\}_{t=1}^{T} be generated by Algorithm 1 with ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}, where ηt\eta_{t} is in (7) such that 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}}. Then, under Assumption 1, we have

𝔼[F(w^T)2]4[F(w~0)F](1β)γρT2/3+9σ2L2(53β)γ2(1β)ρT2/3.\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma\rho T^{2/3}}+\frac{9\sigma^{2}L^{2}(5-3\beta)\gamma^{2}}{(1-\beta)\rho T^{2/3}}.

Cosine annealing learning rate: Alternatively, given an epoch budget T1T\geq 1, and a positive constant γ>0\gamma>0, we consider the following cosine LR for Algorithm 1:

ηt:=γT1/3(1+cost𝝅T),t=1,2,,T.\eta_{t}:=\frac{\gamma}{T^{1/3}}\left(1+\cos\frac{t\boldsymbol{\pi}}{T}\right),\quad t=1,2,\cdots,T. (8)

This LR is adopted from (Loshchilov & Hutter, 2017; Smith, 2017). However, different from these works, we fix our learning rate at each epoch instead of updating it at every iteration as in (Loshchilov & Hutter, 2017; Smith, 2017).

Corollary 4.

Let {wi(t)}t=1T\{w_{i}^{(t)}\}_{t=1}^{T} be generated by Algorithm 1 with ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}, where ηt\eta_{t} is given by (8) such that 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}}. Then, under Assumption 1, and for T2T\geq 2, 𝔼[F(w^T)2]\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} is upper bounded by

1T2/3(8[F(w~0)F](1β)γ+144σ2(53β)L2γ2(1β)).\frac{1}{T^{2/3}}\left(\frac{8[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma}+\frac{144\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)}\right).

The scheduled LRs (7) and (8) still preserve our best known convergence rate 𝒪(T2/3)\mathcal{O}(T^{-2/3}). Note that the exponential learning rates are available in both TensorFlow and PyTorch, while cosine learning rates are also used in PyTorch.

3.3 Main Result 2: Randomized Reshuffling Variant

A variant of Algorithm 1 is called a randomized reshuffling variant if at each iteration tt, the generated permutation π(t)=(π(t)(1),,π(t)(n))\pi^{(t)}=\big{(}\pi^{(t)}(1),\cdots,\pi^{(t)}(n)\big{)} is uniformly sampled at random without replacement from {1,,n}\left\{1,\cdots,n\right\}. Since the randomized reshuffling strategy is extremely popular in practice, we analyze Algorithm 1 under this strategy.

Theorem 2.

Suppose that Assumption 1 holds for (1). Let {wi(t)}t=1T\{w_{i}^{(t)}\}_{t=1}^{T} be generated by Algorithm 1 under a randomized reshuffling strategy, a fixed momentum weight 0β<10\leq\beta<1, and an epoch learning rate ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} for every t1t\geq 1. Assume that ηtηt+1\eta_{t}\geq\eta_{t+1} and 0<ηt1LD0<\eta_{t}\leq\frac{1}{L\sqrt{D}} for t1t\geq 1, where D=max(53,6(53β)(Θ+n)n(1β))D=\max\left(\frac{5}{3},\frac{6(5-3\beta)(\Theta+n)}{n(1-\beta)}\right) and η0=η1\eta_{0}=\eta_{1}. Then

𝔼[F(w^T)2]=1t=1Tηtt=1Tηt𝔼[F(w~t1)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}=\frac{1}{\sum_{t=1}^{T}\eta_{t}}\sum_{t=1}^{T}\eta_{t}\mathbb{E}\left[\|\nabla F(\tilde{w}_{t-1})\|^{2}\right] (9)
4[F(w~0)F](1β)t=1Tηt+6σ2(53β)L2n(1β)(t=1Tηt13t=1Tηt).\displaystyle\leq\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\sum_{t=1}^{T}\eta_{t}}+\frac{6\sigma^{2}(5-3\beta)L^{2}}{n(1-\beta)}\left(\frac{\sum_{t=1}^{T}\eta_{t-1}^{3}}{\sum_{t=1}^{T}\eta_{t}}\right).

We can derive the following two consequences.

Corollary 5 (Constant learning rate).

Let us fix the number of epochs T1T\geq 1, and choose a constant learning rate ηt:=γn1/3T1/3\eta_{t}:=\frac{\gamma n^{1/3}}{T^{1/3}} for some γ>0\gamma>0 such that γn1/3T1/31LD\frac{\gamma n^{1/3}}{T^{1/3}}\leq\frac{1}{L\sqrt{D}} for t1t\geq 1 in Algorithm 1. Then, under the conditions of Theorem 2, 𝔼[F(w^T)2]\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} is upper bounded by

1n1/3T2/3(4[F(w~0)F](1β)γ+6σ2(53β)L2γ2(1β)).\frac{1}{n^{1/3}T^{2/3}}\left(\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma}+\frac{6\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)}\right).

Consequently, the convergence rate of Algorithm 1 is 𝒪(n1/3T2/3)\mathcal{O}(n^{-1/3}T^{-2/3}) in epoch.

With a constant LR as in Corollary 5, the convergence rate of Algorithm 1 is improved to

𝒪([F(w~0)F]+σ2n1/3T2/3),\operatorname{\mathcal{O}}\left(\frac{[F(\tilde{w}_{0})-F_{*}]+\sigma^{2}}{n^{1/3}T^{2/3}}\right),

which matches the best known rate as in the randomized reshuffling scheme, see, e.g., (Mishchenko et al., 2020).

In this case, the total number of iterations 𝒯tol:=nT\mathcal{T}_{\textrm{tol}}:=nT is 𝒯tol=𝒪(nε3)\mathcal{T}_{\textrm{tol}}=\operatorname{\mathcal{O}}{(\sqrt{n}\varepsilon^{-3})} to obtain 𝔼[F(w^T)2]ε2\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\varepsilon^{2}. Compared to the complexity 𝒪(ε4)\operatorname{\mathcal{O}}(\varepsilon^{-4}) of SGD, our randomized reshuffling variant is better than SGD if n𝒪(ε2)n\leq\operatorname{\mathcal{O}}(\varepsilon^{-2}).

Corollary 6 (Diminishing learning rate).

Let us choose a diminishing learning rate ηt:=γn1/3(t+λ)1/3\eta_{t}:=\frac{\gamma n^{1/3}}{(t+\lambda)^{1/3}} for some γ>0\gamma>0 and λ0\lambda\geq 0 for t1t\geq 1 such that η1:=γn1/3(1+λ)1/31LD\eta_{1}:=\frac{\gamma n^{1/3}}{(1+\lambda)^{1/3}}\leq\frac{1}{L\sqrt{D}} in Algorithm 1. Then, under the conditions of Theorem 2, after TT epochs with T2T\geq 2, we have

𝔼[F(w^T)2]C3+C4log(T1+λ)n1/3[(T+λ)2/3(1+λ)2/3],\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{C_{3}+C_{4}\log(T-1+\lambda)}{n^{1/3}\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]},

where C3C_{3} and C4C_{4} are respectively given by

C3\displaystyle C_{3} :=4[F(w~0)F](1β)γ+12σ2(53β)L2γ2(1β)(1+λ)and\displaystyle:=\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma}+\frac{12\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)(1+\lambda)}\quad\text{and}
C4\displaystyle C_{4} :=6σ2(53β)L2γ2(1β).\displaystyle:=\frac{6\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)}.

Consequently, the convergence rate of Algorithm 1 is 𝒪(n1/3T2/3log(T))\mathcal{O}(n^{-1/3}T^{-2/3}\log(T)) in epoch.

Remark 2.

Algorithm 1 under randomized reshuffling still works with exponential and cosine scheduled learning rates, and our analysis is similar to the one with general shuffling schemes. However, we omit their analysis here.

4 Single Shuffling Variant

In this section, we modify the single-shuffling gradient method by directly incorporating a momentum term at each iteration. We prove for the first time that this variant still achieves state-of-the-art convergence rate guarantee under the smoothness and bounded gradient (i.e. Assumption 2) assumptions as in existing shuffling methods. This variant, though somewhat special, also covers an incremental gradient method with momentum as a special case.

Our new momentum algorithm using single shuffling strategy for solving (1) is presented in Algorithm 2.

Algorithm 2 Single Shuffling Momentum Gradient
1:  Initialization: Choose w~0d\tilde{w}_{0}\in\mathbb{R}^{d} and set m~0:=0\tilde{m}_{0}:=\textbf{0};​​
2:  Generate a permutation π\pi of [n][n];
3:  for t:=1,2,,Tt:=1,2,\cdots,T do
4:     Set w0(t):=w~t1w_{0}^{(t)}:=\tilde{w}_{t-1} and m0(t):=m~t1m_{0}^{(t)}:=\tilde{m}_{t-1};
5:     for i=:0,,n1i=:0,\cdots,n-1 do
6:        Query gi(t):=f(wi(t);π(i+1))g_{i}^{(t)}:=\nabla f(w_{i}^{(t)};\pi(i+1));
7:        Choose ηi(t):=ηtn\eta^{(t)}_{i}:=\frac{\eta_{t}}{n} and update
{mi+1(t):=βmi(t)+(1β)gi(t)wi+1(t):=wi(t)ηi(t)mi+1(t);\left\{\begin{array}[]{lcl}m_{i+1}^{(t)}&:=&\beta m_{i}^{(t)}+(1-\beta)g_{i}^{(t)}\\ w_{i+1}^{(t)}&:=&w_{i}^{(t)}-\eta_{i}^{(t)}m_{i+1}^{(t)};\end{array}\right.
8:     end for
9:     Set w~t:=wn(t)\tilde{w}_{t}:=w_{n}^{(t)} and m~t:=mn(t)\tilde{m}_{t}:=m_{n}^{(t)};
10:  end for
11:  Output: Choose w^T{w~0,,w~T1}\hat{w}_{T}\in\{\tilde{w}_{0},\cdots,\tilde{w}_{T-1}\} at random with probability [w^T=w~t1]=ηtt=1Tηt\mathbb{P}[\hat{w}_{T}=\tilde{w}_{t-1}]=\frac{\eta_{t}}{\sum_{t=1}^{T}\eta_{t}}.

Besides fixing the permutation π\pi, Algorithm 2 is different from Algorithm 1 at the point of updating mi(t)m_{i}^{(t)}. Here, mi+1(t)m_{i+1}^{(t)} is updated from mi(t)m_{i}^{(t)} instead of m0(t)m_{0}^{(t)} as in Algorithm 1. In addition, we update m~t:=mn(t)\tilde{m}_{t}:=m_{n}^{(t)} instead of the epoch gradient average. Note that we can write the main update of Algorithm 2 as

wi+1(t):=wi(t)(1β)ηtngi(t)+β(wi(t)wi1(t)),w_{i+1}^{(t)}:=w_{i}^{(t)}-\frac{(1-\beta)\eta_{t}}{n}g_{i}^{(t)}+\beta(w^{(t)}_{i}-w_{i-1}^{(t)}),

which exactly reduces to existing momentum updates.

Incremental gradient method with momentum. If we choose π:=[n]\pi:=[n], then we obtain the well-known incremental gradient variant, but with momentum. Hence, Algorithm 2 is still new compared to the standard incremental gradient algorithm (Bertsekas, 2011). To prove convergence of Algorithm 2, we replace Assumption 1(c) by the following:

Assumption 2 (Bounded gradient).

There exists G>0G>0 such that f(x;i)G\|\nabla{f}(x;i)\|\leq G, xdom(F)\forall x\in\mathrm{dom}\left(F\right) and i[n]i\in[n].

Now, we state the convergence of Algorithm 2 in the following theorem as our third main result.

Theorem 3.

Let {wi(t)}t=1T\{w_{i}^{(t)}\}_{t=1}^{T} be generated by Algorithm 2 with a LR ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} and 0<ηt1L0<\eta_{t}\leq\frac{1}{L} for t1t\geq 1. Then, under Assumption 1(a)-(b) and Assumption 2, we have

𝔼[F(w^T)2]Δ1(t=1Tηt)(1βn)+L2G2(t=1Tξt3t=1Tηt)+4βnG21βn,\begin{array}[]{lcl}\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}&&\leq\dfrac{\Delta_{1}}{\big{(}\sum_{t=1}^{T}\eta_{t}\big{)}(1-\beta^{n})}\\ &&+{~{}}L^{2}G^{2}\left(\dfrac{\sum_{t=1}^{T}\xi_{t}^{3}}{\sum_{t=1}^{T}\eta_{t}}\right)+\dfrac{4\beta^{n}G^{2}}{1-\beta^{n}},\end{array} (10)

where ξt:=max(ηt,ηt1)\xi_{t}:=\max(\eta_{t},\eta_{t-1}) for t2t\geq 2, ξ1=η1\xi_{1}=\eta_{1}, and

Δ1\displaystyle\Delta_{1} :=2[F(w~0)F]+(1L+η1)F(w~0)2\displaystyle:=2[F(\tilde{w}_{0})-F_{*}]+\left(\frac{1}{L}+\eta_{1}\right)\|\nabla F(\tilde{w}_{0})\|^{2}
+2Lη12G2.\displaystyle\qquad+2L\eta_{1}^{2}G^{2}.

Compared to (9) in Theorem 2, (10) strongly depends on the weight β\beta as 4βnG21βn\frac{4\beta^{n}G^{2}}{1-\beta^{n}} appears on the right-hand side of (10). Hence, β\beta must be chosen in a specific form to obtain desired convergence rate.

Theorem 3 provides a key bound to derive concrete convergence rates in the following two corollaries.

Corollary 7 (Constant learning rate).

In Algorithm 2, let us fix T1T\geq 1 and choose the parameters:

  • β:=(νT2/3)1/n\beta:=\left(\frac{\nu}{T^{2/3}}\right)^{1/n} for some constant ν0\nu\geq 0 such that β(R1R)1/n\beta\leq\left(\frac{R-1}{R}\right)^{1/n} for some R1R\geq 1, and

  • ηt:=γT1/3\eta_{t}:=\frac{\gamma}{T^{1/3}} for some γ>0\gamma>0 such that γT1/31L\frac{\gamma}{T^{1/3}}\leq\frac{1}{L}.

Then, under the conditions of Theorem 3, we have

𝔼[F(w^T)2]D0T2/3+RF(w~0)2T+2LG2RγT4/3,\begin{array}[]{lcl}\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}&\leq&\dfrac{D_{0}}{T^{2/3}}+\dfrac{R\|\nabla F(\tilde{w}_{0})\|^{2}}{T}+\dfrac{2LG^{2}R\gamma}{T^{4/3}},\end{array}

where

D0\displaystyle D_{0} :=2Rγ[F(w~0)F]+RγLF(w~0)2\displaystyle:=\frac{2R}{\gamma}[F(\tilde{w}_{0})-F_{*}]+\frac{R}{\gamma L}\|\nabla F(\tilde{w}_{0})\|^{2}
+L2G2γ2+4νG2R.\displaystyle\qquad+L^{2}G^{2}\gamma^{2}+4\nu G^{2}R.

Thus the convergence rate of Algorithm 2 is 𝒪(T2/3)\mathcal{O}(T^{-2/3}).

With a constant learning rate as in Corollary 7, the convergence rate of Algorithm 2 is

𝒪(L[F(w~0)F]+F(w~0)2+G2T2/3),\operatorname{\mathcal{O}}\left(\frac{L[F(\tilde{w}_{0})-F_{*}]+\|\nabla{F}(\tilde{w}_{0})\|^{2}+G^{2}}{T^{2/3}}\right),

which depends on L[F(w~0)F]L[F(\tilde{w}_{0})-F_{*}], F(w~0)2\|\nabla{F}(\tilde{w}_{0})\|^{2}, and G2G^{2}, and is slightly different from Corollary 1.

Corollary 8 (Diminishing learning rate).

In Algorithm 2, let us choose the parameters:

  • β:=(νT2/3)1/n\beta:=\left(\frac{\nu}{T^{2/3}}\right)^{1/n} for some constant ν0\nu\geq 0 such that β(R1R)1/n\beta\leq\left(\frac{R-1}{R}\right)^{1/n} for some R1R\geq 1, and

  • a diminishing learning rate ηt:=γ(t+λ)1/3\eta_{t}:=\frac{\gamma}{(t+\lambda)^{1/3}} for all t[T]t\in[T] for some γ>0\gamma>0 and λ0\lambda\geq 0 such that η1=γ(1+λ)1/31L\eta_{1}=\frac{\gamma}{(1+\lambda)^{1/3}}\leq\frac{1}{L}.

Then, under the conditions of Theorem 3, we have

𝔼[F(w^T)2]D1[(T+λ)2/3(1+λ)2/3]+4νG2RT2/3+L2G2γ2log(T1+λ)[(T+λ)2/3(1+λ)2/3],\begin{array}[]{lcl}\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}&\leq&\dfrac{D_{1}}{[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}]}+\dfrac{4\nu G^{2}R}{T^{2/3}}\\ &&+{~{}}\dfrac{L^{2}G^{2}\gamma^{2}\log(T-1+\lambda)}{[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}]},\end{array}

for T2T\geq 2, where

D1:=2Rγ[F(w~0)F]+[RγL+R(1+λ)1/3]F(w~0)2+2RLγG2(1+λ)2/3+21+λL2G2γ2.\begin{array}[]{lcl}D_{1}&:=&\frac{2R}{\gamma}[F(\tilde{w}_{0})-F_{*}]+\big{[}\frac{R}{\gamma L}+\frac{R}{(1+\lambda)^{1/3}}\big{]}\|\nabla F(\tilde{w}_{0})\|^{2}\\ &&+{~{}}\frac{2RL\gamma G^{2}}{(1+\lambda)^{2/3}}+\frac{2}{1+\lambda}L^{2}G^{2}\gamma^{2}.\end{array}

Thus the convergence rate of Algorithm 2 is 𝒪(log(T)T2/3)\mathcal{O}(\frac{\log(T)}{T^{2/3}}).

Remark 3.

Algorithm 2 still works with exponential and cosine scheduled LRs, and our analysis is similar to the one in Algorithm 1, which is deferred to the Supp. Doc.

5 Numerical Experiments

In order to examine our algorithms, we present two numerical experiments for different nonconvex problems and compare them with some state-of-the-art SGD-type and shuffling gradient methods.

5.1 Models and Datasets

Neural Networks. We test our Shuffling Momentum Gradient (SMG) algorithm using two standard network architectures: fully connected network (FCN) and convolutional neural network (CNN). For the fully connected setting, we train the classic LeNet-300-100 model (LeCun et al., 1998) on the Fashion-MNIST dataset (Xiao et al., 2017) with 60,00060,000 images.

We also use the convolutional LeNet-5 (LeCun et al., 1998) architecture to train the well-known CIFAR-10 dataset (Krizhevsky & Hinton, 2009) with 50,00050,000 samples. We repeatedly run the experiments for 10 random seeds and report the average results. All the algorithms are implemented and run in Python using the PyTorch package (Paszke et al., 2019).

Nonconvex Logistic Regression. Nonconvex regularizers are widely used in statistical learning such as approximating sparsity or gaining robustness. We consider the following nonconvex binary classification problem:

minwd{F(w):=1ni=1nlog(1+exp(yixiw))+λr(w)},\displaystyle\min_{w\in\mathbb{R}^{d}}\Big{\{}F(w)\!:=\!\frac{1}{n}\sum_{i=1}^{n}\log(1\!+\!\exp(-y_{i}x_{i}^{\top}w))\!+\!\lambda r(w)\Big{\}},

where {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} is a set of training samples; and r(w):=12j=1dwj21+wj2r(w):=\frac{1}{2}\sum_{j=1}^{d}\frac{w_{j}^{2}}{1+w_{j}^{2}} is a nonconvex regularizer, and λ:=0.01\lambda:=0.01 is a regularization parameter. This example was also previously used in (Wang et al., 2019; Tran-Dinh et al., 2019; Nguyen et al., 2020).

We have conducted the experiments on two classification datasets w8a (49,74949,749 samples) and ijcnn1 (91,70191,701 samples) from LIBSVM (Chang & Lin, 2011). The experiment is repeated with random seeds 10 times and the average result is reported.

5.2 Comparing SMG with Other Methods

We compare our SMG algorithm with Stochastic Gradient Descent (SGD) and two other methods: SGD with Momentum (SGD-M) (Polyak, 1964) and Adam (Kingma & Ba, 2014). For the latter two algorithms, we use the hyper-parameter settings recommended and widely used in practice (i.e. momentum: 0.9 for SGD-M, and two hyper-parameters β1:=0.9\beta_{1}:=0.9, β2:=0.999\beta_{2}:=0.999 for Adam). For our new SMG algorithm, we fixed the parameter β:=0.5\beta:=0.5 since it usually performs the best in our experiments.

To have a fair comparison, we apply the randomized reshuffling scheme to all methods. Note that shuffling strategies are broadly used in practice and have been implemented in TensorFlow, PyTorch, and Keras (Chollet et al., 2015). We tune each algorithm using constant learning rate and report the best result obtained.

Results. Our first experiment is presented in Figure 1, where we depict the value of “train loss” (i.e. F(w)F(w) in (1)) on the yy-axis and the “number of effective passes” (i.e. the number of epochs) on the xx-axis. It was observed that SGD-M and Adam work well for machine learning tasks (see, e.g., (Ruder, 2017)). Align with this fact, from Figure 1, we also observe that our SMG algorithm and SGD is slightly worse than SGD-M and Adam at the initial stage when training a neural network. However, SMG quickly catches up to Adam and demonstrates a good performance at the end of the training process.

Refer to caption
Refer to caption
Figure 1: The train loss produced by SMG, SGD, SGD-M, and Adam for Fashion-MNIST and CIFAR-10, respectively.

For the nonconvex logistic regression problem, our result is reported in Figure 2. For two small datasets tested in our experiments, our algorithm performs significantly better than SGD and Adam, and slightly better than SGD with momentum.

Refer to caption
Refer to caption
Figure 2: The train loss produced by SMG, SGD, SGD-M, and Adam for the w8a and ijcnn1 datasets, respectively.

5.3 The Choice of Hyper-parameter β\beta

Since the hyper-parameter β\beta plays a critical role in the proposed SMG method, our next experiment is to investigate how this algorithm depends on β\beta, while using the same constant learning rate.

Results. Our result is presented in Figure 3. We can observe from this figure that in the early stage of the training process, the choice β:=0.5\beta:=0.5 gives comparably good performance comparing to other smaller values. This choice also results in the best train loss in the end of the training process. However, the difference is not really significant, showing that SMG seems robust to the choice of the momentum weight β\beta in the range of [0.1,0.5][0.1,0.5].

Refer to caption
Refer to caption
Figure 3: The train loss reported by SMG with different β\beta on the Fashion-MNIST and CIFAR-10 datasets, respectively.

In the nonconvex logistic regression problem, the same choice of β\beta also yields similar outcome for two datasets: w8a and ijcnn1, as shown in Figure 4. We have also experimented with different choices of β{0.6,0.7,0.8,0.9}\beta\in\{0.6,0.7,0.8,0.9\}. However, these choices of β\beta do not lead to good performance, and, therefore, we omit to report them here. This finding raises an open question on the optimal choice of β\beta. Our empirical study here shows that the choice of β\beta in [0.1,0.5][0.1,0.5] works reasonably well, and β:=0.5\beta:=0.5 seems to be the best in our test.

Refer to caption
Refer to caption
Figure 4: The train loss produced by SMG under different values of β\beta on the w8a and ijcnn1 datasets, respectively.

5.4 Different Learning Rate Schemes

Our last experiment is to examine the effect of different learning rate variants on the performance of our SMG method, i.e. Algorithm 1.

Results. We conduct this test using four different learning rate variants: constant, diminishing, exponential decay, and cosine annealing learning rates. Our results are reported in Figure 5 and Figure 6. From Figure 5, we can observe that the cosine scheduled and the diminishing learning rates converge relatively fast at the early stage. However, the exponential decay and the constant learning rates make faster progress in the last epochs and tend to give the best result at the end of the training process in our neural network training experiments.

Refer to caption
Refer to caption
Figure 5: The train loss produced by SMG under four different learning rate schemes on the Fashion-MNIST and CIFAR-10 datasets, respectively.

For the non-convex logistic regression, Figure 6 shows that the cosine learning rate has certain advantages compared to other choices after a few dozen numbers of epochs.

Refer to caption
Refer to caption
Figure 6: The train loss produced by SMG using four different learning rates on the w8a and ijcnn1 datasets, respectively.

We note that the detailed settings and additional experiments in this section can be found in the Supp. Doc.

6 Conclusions and Future Work

We have developed two new shuffling gradient algorithms with momentum for solving nonconvex finite-sum minimization problems. Our Algorithm 1 is novel and can work with any shuffling strategy, while Algorithm 2 is similar to existing momentum methods using single shuffling strategy. Our methods achieve the state-of-the-art 𝒪(1/T2/3)\operatorname{\mathcal{O}}(1/T^{2/3}) convergence rate under standard assumptions using different learning rates and shuffling strategies. When a randomized reshuffling strategy is exploited for Algorithm 1, we can further improve our rates by a fraction n1/3n^{1/3} of the data size nn, matching the best known results in the non-momentum shuffling method. Our numerical results also show encouraging performance of the new methods.

We believe that our analysis framework can be extended to study non-asymptotic rates for some recent adaptive SGD schemes such as Adam (Kingma & Ba, 2014) and AdaGrad (Duchi et al., 2011) as well as variance-reduced methods such as SARAH (Nguyen et al., 2017) under shuffling strategies. It is also interesting to extend our framework to other problems such as minimax and federated learning. We will further investigate these opportunities in the future.

Acknowledgements

The authors would like to thank the reviewers for their useful comments and suggestions which helped to improve the exposition in this paper. The work of Q. Tran-Dinh has partly been supported by the Office of Naval Research under Grant No. ONR-N00014-20-1-2088 (2020-2023).

References

  • Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  • Ahn et al. (2020) Ahn, K., Yun, C., and Sra, S. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. arXiv preprint arXiv:2006.06946, 2020.
  • Bertsekas (2011) Bertsekas, D. Incremental proximal methods for large scale convex optimization. Math. Program., 129(2):163–195, 2011.
  • Bottou (2009) Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pp.  2624–2633, 2009.
  • Bottou (2012) Bottou, L. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp.  421–436. Springer, 2012.
  • Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev., 60(2):223–311, 2018.
  • Chang & Lin (2011) Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • Chen et al. (2018) Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ICLR, 2018.
  • Chollet et al. (2015) Chollet, F. et al. Keras, 2015. URL https://github.com/fchollet/keras.
  • Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
  • Dozat (2016) Dozat, T. Incorporating nesterov momentum into ADAM. ICLR Workshop, 1:2013––2016, 2016.
  • Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  • Ghadimi & Lan (2013) Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.
  • Gürbüzbalaban et al. (2019) Gürbüzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, Oct 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. URL http://dx.doi.org/10.1007/s10107-019-01440-w.
  • Haochen & Sra (2019) Haochen, J. and Sra, S. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624–2633. PMLR, 2019.
  • Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp.  315–323, 2013.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. ADAM: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), abs/1412.6980, 2014.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2020) Li, L., Zhuang, Z., and Orabona, F. Exponential step sizes for non-convex optimization. arXiv preprint arXiv:2002.05273, 2020.
  • Li et al. (2019) Li, X., Zhu, Z., So, A., and Lee, J. D. Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
  • Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017.
  • Meng et al. (2019) Meng, Q., Chen, W., Wang, Y., Ma, Z.-M., and Liu, T.-Y. Convergence analysis of distributed stochastic gradient descent with shuffling. Neurocomputing, 337:46–57, 2019.
  • Mishchenko et al. (2020) Mishchenko, K., Khaled Ragab Bayoumi, A., and Richtárik, P. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020.
  • Nagaraj et al. (2019) Nagaraj, D., Jain, P., and Netrapalli, P. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703–4711, 2019.
  • Nedić & Bertsekas (2001) Nedić, A. and Bertsekas, D. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223–264. Springer, 2001.
  • Nedic & Bertsekas (2001) Nedic, A. and Bertsekas, D. P. Incremental subgradient methods for nondifferentiable optimization. SIAM J. on Optim., 12(1):109–138, 2001.
  • Nesterov (1983) Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence 𝒪(1/k2)\mathcal{O}(1/k^{2}). Doklady AN SSSR, 269:543–547, 1983. Translated as Soviet Math. Dokl.
  • Nesterov (2004) Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004.
  • Nguyen et al. (2018) Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and Hogwild! convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning-Volume 80, pp.  3747–3755, 2018.
  • Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  2613–2621. JMLR. org, 2017.
  • Nguyen et al. (2019) Nguyen, L. M., Nguyen, P. H., Richtárik, P., Scheinberg, K., Takáč, M., and van Dijk, M. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20(176):1–49, 2019. URL http://jmlr.org/papers/v20/18-759.html.
  • Nguyen et al. (2020) Nguyen, L. M., Tran-Dinh, Q., Phan, D. T., Nguyen, P. H., and van Dijk, M. A unified convergence analysis for shuffling-type gradient methods. arXiv preprint arXiv:2002.08246, 2020.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  • Polyak (1964) Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • Rajput et al. (2020) Rajput, S., Gupta, A., and Papailiopoulos, D. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964–7973. PMLR, 2020.
  • Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • Ruder (2017) Ruder, S. An overview of gradient descent optimization algorithms, 2017.
  • Safran & Shamir (2020) Safran, I. and Shamir, O. How good is sgd with random shuffling? In Conference on Learning Theory, pp.  3250–3284. PMLR, 2020.
  • Shamir (2016) Shamir, O. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pp. 46–54, 2016.
  • Smith (2017) Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.  464–472. IEEE, 2017.
  • Tran-Dinh et al. (2019) Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. arXiv preprint arXiv:1905.05920, 2019.
  • Wang et al. (2020) Wang, B., Nguyen, T. M., Bertozzi, A. L., Baraniuk, R. G., and Osher, S. J. Scheduled restart momentum for accelerated stochastic gradient descent. arXiv preprint arXiv:2002.10583, 2020.
  • Wang et al. (2019) Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32:2406–2416, 2019.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Ying et al. (2017) Ying, B., Yuan, K., and Sayed, A. H. Convergence of variance-reduced stochastic learning under random reshuffling. arXiv preprint arXiv:1708.01383, 2(3):6, 2017.

Supplementary Document:

SMG: A Shuffling Gradient-Based Method with Momentum

7 Technical Proofs of Theorem 1 and Its Consequences in Section 3

This section provide the full proofs of Theorem 1 and its consequences in Section 3.

7.1 Auxiliary Notation and Common Expressions

Remark 4.

We use the superscript “(t)(t)” for the epoch counter, and the subscript “ii” for the counter of the inner loop of the tt-th epoch. In addition, at each epoch tt, we fix the learning rate ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} for given ηt>0\eta_{t}>0.

In this Supp. Doc. we repeatedly use some common notations to analyze the analysis for three different shuffling strategies, including: Randomized Reshuffling, Shuffle Once and Incremental Gradient. We are ready to introduce the notations here.

For any t1t\geq 1, i=0,,ni=0,\dots,n, w0(t)dw_{0}^{(t)}\in\mathbb{R}^{d} generated by Algorithm 1, and two permutations π(t)\pi^{(t)} and π(t1)\pi^{(t-1)} of [n][n] generated at epochs tt and t1t-1, we denote

Ai(t):=j=0i1gj(t)2,\displaystyle A_{i}^{(t)}:=\Big{\|}\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}, (11)
Bi(t):=j=in1gj(t)2,\displaystyle B_{i}^{(t)}:=\Big{\|}\sum_{j=i}^{n-1}g_{j}^{(t)}\Big{\|}^{2}, (12)
It:=i=0n1f(w0(t);π(t)(i+1))gi(t)2,\displaystyle I_{t}:=\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(i+1))-g_{i}^{(t)}\Big{\|}^{2}, (13)
Jt:=i=0n1f(w0(t);π(t1)(i+1))gi(t1)2,t2,\displaystyle J_{t}:=\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(i+1))-g_{i}^{(t-1)}\Big{\|}^{2},\quad t\geq 2, (14)
Kt:=n(Θ+1)F(w0(t))2+nσ2,\displaystyle K_{t}:=n(\Theta+1)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+n\sigma^{2},\vspace{1.5ex} (15)
Nt=(n+Θ)F(w0(t))2+σ2,\displaystyle N_{t}=(n+\Theta)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}, (16)

and

ξt:={max(ηt,ηt1)ift>1,η1ift=1.\xi_{t}:=\left\{\begin{array}[]{ll}\max(\eta_{t},\eta_{t-1})&\text{if}\ t>1,\\ \eta_{1}&\text{if}\ t=1.\end{array}\right. (17)

Note that we adopt the convention j=01gj(t)=0\sum_{j=0}^{-1}g_{j}^{(t)}=0, and j=nn1gj(t)=0\sum_{j=n}^{n-1}g_{j}^{(t)}=0 in the definitions of Ai(t)A_{i}^{(t)} and Bi(t)B_{i}^{(t)}.

For each epoch t=1,,Tt=1,\cdots,T, we denote t\mathcal{F}_{t} by σ(w0(1),,w0(t))\sigma(w_{0}^{(1)},\cdots,w_{0}^{(t)}), the σ\sigma-algebra generated by the iterates of Algorithm 1 up to the beginning of the epoch tt. From Step 4 of Algorithm 1, we observe that the permutation π(t)\pi^{(t)} used at time tt is independent of the σ\sigma-algebra t\mathcal{F}_{t}. This will be the key factor of our analysis for Randomized Reshuffling methods. We also denote 𝔼t[]\mathbb{E}_{t}[\cdot] by 𝔼[t]\mathbb{E}[\cdot\mid\mathcal{F}_{t}], the conditional expectation on the σ\sigma-algebra t\mathcal{F}_{t}.

Now, we collect the following expressions derived from Algorithm 1, which are commonly used for every shuffling method. First, from the update rule at Steps 1, 4, and 5 of Algorithm 1, for t>1t>1, we have

m0(t)=m~t1=vn(t1)=(2)vn1(t1)+1ngn1(t1)=v0(t1)+1nj=0n1gj(t1)=1nj=0n1gj(t1).m_{0}^{(t)}=\tilde{m}_{t-1}=v_{n}^{(t-1)}\overset{\eqref{eq:main_update}}{=}v_{n-1}^{(t-1)}+\frac{1}{n}g_{n-1}^{(t-1)}=v_{0}^{(t-1)}+\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t-1)}=\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t-1)}. (18)

For the special case when t=1t=1, we set m0(1)=m~0=0dm_{0}^{(1)}=\tilde{m}_{0}=\textbf{0}\in\mathbb{R}^{d}.

Next, from the update wi+1(t):=wi(t)ηi(t)mi+1(t)w_{i+1}^{(t)}:=w_{i}^{(t)}-\eta_{i}^{(t)}m_{i+1}^{(t)} with ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}, for i=1,2,,n1i=1,2,\dots,n-1, at Step 7 of Algorithm 1, we can derive

wi(t)\displaystyle w_{i}^{(t)} =(2)wi1(t)ηtnmi(t)=w0(t)ηtnj=0i1mj+1(t),t1,\displaystyle\overset{\eqref{eq:main_update}}{=}w_{i-1}^{(t)}-\frac{\eta_{t}}{n}m_{i}^{(t)}=w_{0}^{(t)}-\frac{\eta_{t}}{n}\sum_{j=0}^{i-1}m_{j+1}^{(t)},\quad t\geq 1, (19)
w0(t)=wn(t1)\displaystyle w_{0}^{(t)}=w_{n}^{(t-1)} =(2)wn1(t1)ηt1nmn(t1)=wi(t1)ηt1nj=in1mj+1(t1),t2.\displaystyle\overset{\eqref{eq:main_update}}{=}w_{n-1}^{(t-1)}-\frac{\eta_{t-1}}{n}m_{n}^{(t-1)}=w_{i}^{(t-1)}-\frac{\eta_{t-1}}{n}\sum_{j=i}^{n-1}m_{j+1}^{(t-1)},\quad t\geq 2. (20)

Note that j=01mj+1(t)\sum_{j=0}^{-1}m_{j+1}^{(t)} and j=nn1mj+1(t1)=0\sum_{j=n}^{n-1}m_{j+1}^{(t-1)}=0 by convention, these equations also hold true for i=0i=0 and i=ni=n.

7.2 The Proof Sketch of Theorem 1

Due to the technicality of our proofs, we first provide here a proof sketch of our first main result, Theorem 1, while the full proof is given in the next subsections.

The proof of Theorem 1 is divided into the following steps.

  • First, from the LL-smoothness of fif_{i} and the update of wi(t)w^{(t)}_{i}, we can bound

    F(w~t)F(w~t1)ηt2F(w~t1)2+𝒯[1],F(\tilde{w}_{t})\leq F(\tilde{w}_{t-1})-\frac{\eta_{t}}{2}\|\nabla{F}(\tilde{w}_{t-1})\|^{2}+\mathcal{T}_{[1]},

    where

    𝒯[1]:=ηt2F(w~t1)1nj=0n1(βgj(t1)+(1β)gj(t))2.\mathcal{T}_{[1]}:=\frac{\eta_{t}}{2}\big{\|}\nabla{F}(\tilde{w}_{t-1})-\frac{1}{n}\sum_{j=0}^{n-1}\big{(}\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\big{)}\big{\|}^{2}.

    Hence, the key step here is to upper bound 𝒯[1]\mathcal{T}_{[1]} by t=1TηtF(w~t1)2\sum_{t=1}^{T}\eta_{t}\|\nabla{F}(\tilde{w}_{t-1})\|^{2}.

  • Next, we upper bound 𝒯[1]ηt2n(βIt+(1β)Jt)\mathcal{T}_{[1]}\leq\frac{\eta_{t}}{2n}(\beta I_{t}+(1-\beta)J_{t}), where ItI_{t} and JtJ_{t} are defined by (13) and (14), respectively, which bound the sum of square errors between gi(t)g_{i}^{(t)} and the gradient f(w0(t),π(t)(i+1))\nabla{f}(w_{0}^{(t)},\pi^{(t)}(i+1)) at tt and t1t-1, respectively.

  • Then, βIt+(1β)Jt\beta I_{t}+(1-\beta)J_{t} can be upper bounded as

    βIt+(1β)Jt𝒪(ξt2j=0tβtjF(w0(j))2+ξt2σ2),\beta I_{t}+(1-\beta)J_{t}\leq\operatorname{\mathcal{O}}\Big{(}\xi_{t}^{2}\sum_{j=0}^{t}\beta^{t-j}\|\nabla{F}(w_{0}^{(j)})\|^{2}+\xi^{2}_{t}\sigma^{2}\Big{)},

    as shown in Lemma 4, where ξt:=max{ηt1,ηt}\xi_{t}:=\max\{\eta_{t-1},\eta_{t}\} defined in (17).

  • We further upper bound the sum of the right-hand side of the last estimate as in Lemma 5 in terms t=0TηtF(w~t1)2\sum_{t=0}^{T}\eta_{t}\|\nabla{F}(\tilde{w}_{t-1})\|^{2}.

Combining these steps together, and using some simplification, we obtain (6) in Theorem 1.

7.3 Technical Lemmas

The proof of Theorem 1 requires the following lemmas as intermediate steps of its proof. Let us first upper bound wi(t)w0(t)2\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2} and wi(t1)w0(t)2\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2} in the following lemma.

Lemma 1.

Let {wi(t)}\{w_{i}^{(t)}\} be generated by Algorithm 1 with 0β<10\leq\beta<1 and ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} for every t1t\geq 1. Then, for i=0,1,,ni=0,1,\dots,n, we have

(a)\displaystyle(\textbf{a})\ wi(t)w0(t)2{ξt2n2[βi2n2An(t1)+(1β)Ai(t)],ift>1,ξt2n2(1β)2Ai(t),ift=1,\displaystyle\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2}\leq\left\{\begin{array}[]{ll}\frac{\xi_{t}^{2}}{n^{2}}\left[\beta\frac{i^{2}}{n^{2}}A_{n}^{(t-1)}+(1-\beta)A_{i}^{(t)}\right],&\text{if}~{}t>1,\\ \frac{\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}A_{i}^{(t)},&\text{if}~{}t=1,\end{array}\right. (21)
(b)\displaystyle(\textbf{b})\ wi(t1)w0(t)2{ξt2n2[β(ni)2n2An(t2)+(1β)Bi(t1)],ift>2,ξt2n2(1β)2Bi(t1),ift=2.\displaystyle\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2}\leq\left\{\begin{array}[]{ll}\frac{\xi_{t}^{2}}{n^{2}}\left[\beta\frac{(n-i)^{2}}{n^{2}}A_{n}^{(t-2)}+(1-\beta)B_{i}^{(t-1)}\right],&\text{if}~{}t>2,\\ \frac{\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}B_{i}^{(t-1)},&\text{if}~{}t=2.\end{array}\right. (22)

Next, we upper bound the quantities An(t)A_{n}^{(t)}, i=0n1Ai(t)\sum_{i=0}^{n-1}A_{i}^{(t)}, and i=0n1Bi(t)\sum_{i=0}^{n-1}B^{(t)}_{i} defined above in terms of ItI_{t} and KtK_{t}.

Lemma 2.

Under the same setting as of Lemma 1 and Assumption 1(c), for t1t\geq 1, it holds that

(a)\displaystyle(\textbf{a})\qquad An(t)3n(It+Kt)andi=0n1Ai(t)3n22(It+Kt),\displaystyle A_{n}^{(t)}\leq 3n\Big{(}I_{t}+K_{t}\Big{)}\quad\text{and}\quad\sum_{i=0}^{n-1}A_{i}^{(t)}\leq\frac{3n^{2}}{2}\Big{(}I_{t}+K_{t}\Big{)}, (23)
(b)\displaystyle(\textbf{b})\qquad i=0n1Bi(t)3n2(It+Kt).\displaystyle\sum_{i=0}^{n-1}B_{i}^{(t)}\leq 3n^{2}\Big{(}I_{t}+K_{t}\Big{)}. (24)

The results of Lemma 3 below are direct consequences of Lemma 1, Lemma 2, and the fact that L2ξt225L^{2}\xi_{t}^{2}\leq\frac{2}{5}.

Lemma 3.

Under the same setting as of Lemma 2 and Assumption 1(b), it holds that

(a)\displaystyle(\textbf{a})\ It{L2ξt2[β(It1+Kt1)+32(1β)(It+Kt)],ift>1,32L2ξt2(1β)2(It+Kt),ift=1,\displaystyle I_{t}\leq\left\{\begin{array}[]{ll}L^{2}\xi_{t}^{2}\Big{[}\beta(I_{t-1}+K_{t-1})+\frac{3}{2}(1-\beta)(I_{t}+K_{t})\Big{]},&\text{if}~{}t>1,\\ \frac{3}{2}L^{2}\xi_{t}^{2}(1-\beta)^{2}\Big{(}I_{t}+K_{t}\Big{)},&\text{if}~{}t=1,\end{array}\right.
(b)\displaystyle(\textbf{b})\ Jt{3L2ξt2[β(It2+Kt2)+(1β)(It1+Kt1)],ift>2,3L2ξt2(It1+Kt1),ift=2,\displaystyle J_{t}\leq\left\{\begin{array}[]{ll}3L^{2}\xi_{t}^{2}\Big{[}\beta(I_{t-2}+K_{t-2})+(1-\beta)(I_{t-1}+K_{t-1})\Big{]},&\text{if}~{}t>2,\\ 3L^{2}\xi_{t}^{2}\Big{(}I_{t-1}+K_{t-1}\Big{)},&\text{if}~{}t=2,\end{array}\right.
(c)\displaystyle(\textbf{c})\ If L2ξt225, then It+Kt{β(It1+Kt1)+53β2Kt,ift>1,53β2K1,ift=1.\displaystyle\text{If }L^{2}\xi_{t}^{2}\leq\frac{2}{5},\text{ then }I_{t}+K_{t}\leq\left\{\begin{array}[]{ll}\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+\frac{5-3\beta}{2}K_{t},&\text{if}~{}t>1,\\ \frac{5-3\beta}{2}K_{1},&\text{if}~{}t=1.\end{array}\right.

One of our key estimates is to upper bound the quantity βJt+(1β)It\beta J_{t}+(1-\beta)I_{t}, which is derived in the following lemma.

Lemma 4.

Under the same conditions as in Lemma 3, for any t2t\geq 2, we have

βJt+(1β)It\displaystyle\beta J_{t}+(1-\beta)I_{t} 9(53β)L2ξt22(n(Θ+1)j=1tβtjF(w0(j))2+nσ21β).\displaystyle\leq\frac{9(5-3\beta)L^{2}\xi_{t}^{2}}{2}\left(n(\Theta+1)\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\frac{n\sigma^{2}}{1-\beta}\right). (25)
Lemma 5.

Under the same settings as in Theorem 1 (or Theorem 2, respectively) and ηtηt+1\eta_{t}\geq\eta_{t+1} for t1t\geq 1, we have

t=1Tηtj=1tβtjF(w~j1)211βt=1TηtF(w~t1)2.\displaystyle\sum_{t=1}^{T}\eta_{t}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}\leq\frac{1}{1-\beta}\sum_{t=1}^{T}\eta_{t}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2}.
Lemma 6.

Under the same setting as in Theorem 1, we have

η12F(w~0)2F(w~0)F(w~1)(1β)+(1β)η14F(w~0)2+9σ2(53β)L24(1β)ξ13.\displaystyle\frac{\eta_{1}}{2}\|\nabla F(\tilde{w}_{0})\|^{2}\leq\frac{F(\tilde{w}_{0})-F(\tilde{w}_{1})}{(1-\beta)}+\frac{(1-\beta)\eta_{1}}{4}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\frac{9\sigma^{2}(5-3\beta)L^{2}}{4(1-\beta)}\cdot\xi_{1}^{3}. (26)

7.4 The Proof of Theorem 1: Key Estimate for Algorithm 1

First, from the assumption 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}}, t1t\geq 1, we have 0<ηt21KL20<\eta_{t}^{2}\leq\frac{1}{KL^{2}}. Next, from (17), we have ξt=max(ηt;ηt1)\xi_{t}=\max(\eta_{t};\eta_{t-1}) for t>1t>1 and ξ1=η1\xi_{1}=\eta_{1}, which lead to 0<ξt21KL20<\xi_{t}^{2}\leq\frac{1}{KL^{2}} for t1t\geq 1. Moreover, from the definition of K=max(52,9(53β)(Θ+1)1β)K=\max\left(\frac{5}{2},\frac{9(5-3\beta)(\Theta+1)}{1-\beta}\right) in Theorem 1, we have L2ξt225L^{2}\xi_{t}^{2}\leq\frac{2}{5} and 9L2ξt2(53β)(Θ+1)1β9L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+1)\leq 1-\beta for t1.t\geq 1.

Now, letting i=ni=n in Equation (19), for all t1t\geq 1, we have

wn(t)w0(t)\displaystyle w_{n}^{(t)}-w_{0}^{(t)} =ηtnj=0n1mj+1(t)=(2)ηtnj=0n1(βm0(t)+(1β)gj(t))=ηtn(nβm0(t)+(1β)j=0n1gj(t)).\displaystyle=-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}m_{j+1}^{(t)}\overset{\eqref{eq:main_update}}{=}-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}\big{(}\beta m_{0}^{(t)}+(1-\beta)g_{j}^{(t)}\big{)}=-\frac{\eta_{t}}{n}\Big{(}n\beta m_{0}^{(t)}+(1-\beta)\sum_{j=0}^{n-1}g_{j}^{(t)}\Big{)}.

Since m0(t)=1nj=0n1gj(t1)m_{0}^{(t)}=\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t-1)} for t2t\geq 2 (due to (18)), we have the following update

wn(t)w0(t)\displaystyle w_{n}^{(t)}-w_{0}^{(t)} =ηtnj=0n1(βgj(t1)+(1β)gj(t)).\displaystyle=-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}\Big{(}\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\Big{)}. (27)

From the LL-smoothness of FF in Assumption (1)(b), (27), and the fact that w0(t+1):=wn(t)w_{0}^{(t+1)}:=w_{n}^{(t)}, for every epoch t2t\geq 2, we can derive

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) (5)F(w0(t))+F(w0(t))(w0(t+1)w0(t))+L2w0(t+1)w0(t)2\displaystyle\overset{\eqref{eq:Lsmooth}}{\leq}F(w_{0}^{(t)})+\nabla F(w_{0}^{(t)})^{\top}(w_{0}^{(t+1)}-w_{0}^{(t)})+\frac{L}{2}\|w_{0}^{(t+1)}-w_{0}^{(t)}\|^{2}
=(27)F(w0(t))ηtF(w0(t))(1nj=0n1(βgj(t1)+(1β)gj(t)))\displaystyle\overset{\eqref{update_epoch_02}}{=}F(w_{0}^{(t)})-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}\left(\frac{1}{n}\sum_{j=0}^{n-1}\Big{(}\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\Big{)}\right)
+Lηt221nj=0n1(βgj(t1)+(1β)gj(t))2\displaystyle\quad+\frac{L\eta_{t}^{2}}{2}\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}\Big{(}\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\Big{)}\Big{\|}^{2}
=(a)F(w0(t))ηt2F(w0(t))2+ηt2F(w0(t))1nj=0n1(βgj(t1)+(1β)gj(t))2\displaystyle\overset{\tiny(a)}{=}F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{\eta_{t}}{2}\Big{\|}\nabla F(w_{0}^{(t)})-\frac{1}{n}\sum_{j=0}^{n-1}\left(\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\right)\Big{\|}^{2}
ηt2(1Lηt)1nj=0n1(βgj(t1)+(1β)gj(t))2\displaystyle\quad-\frac{\eta_{t}}{2}\left(1-L\eta_{t}\right)\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}\left(\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\right)\Big{\|}^{2}
F(w0(t))ηt2F(w0(t))2+ηt2F(w0(t))1nj=0n1(βgj(t1)+(1β)gj(t))2,\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{\eta_{t}}{2}\Big{\|}\nabla F(w_{0}^{(t)})-\frac{1}{n}\sum_{j=0}^{n-1}\left(\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\right)\Big{\|}^{2}, (28)

where (a) follows from the elementary equality uv=12(u2+v2uv2)u^{\top}v=\frac{1}{2}\left(\|u\|^{2}+\|v\|^{2}-\|u-v\|^{2}\right) and the last equality comes from the fact that ηt25L1L\eta_{t}\leq\frac{\sqrt{2}}{\sqrt{5}L}\leq\frac{1}{L}, due to the choice of our learning rate ηt\eta_{t}.

Next, we use an interesting fact of the permutations π(t)\pi^{(t)} and π(t1)\pi^{(t-1)} to rewrite the full gradient as

F(w0(t))\displaystyle\nabla F(w_{0}^{(t)}) =βF(w0(t))+(1β)F(w0(t))\displaystyle=\beta\nabla F(w_{0}^{(t)})+(1-\beta)\nabla F(w_{0}^{(t)})
=βnj=0n1f(w0(t);π(t1)(j+1))+(1β)nj=0n1f(w0(t);π(t)(j+1))\displaystyle=\frac{\beta}{n}\sum_{j=0}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(j+1))+\frac{(1-\beta)}{n}\sum_{j=0}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))
=1nj=0n1(βf(w0(t);π(t1)(j+1))+(1β)f(w0(t);π(t)(j+1))).\displaystyle=\frac{1}{n}\sum_{j=0}^{n-1}\left(\beta\nabla f(w_{0}^{(t)};\pi^{(t-1)}(j+1))+(1-\beta)\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\right).

Let us upper bound the last term of (28) as follows:

𝒯[1]\displaystyle\mathcal{T}_{[1]} :=ηt2F(w0(t))1nj=0n1(βgj(t1)+(1β)gj(t))2\displaystyle:=\frac{\eta_{t}}{2}\Big{\|}\nabla F(w_{0}^{(t)})-\frac{1}{n}\sum_{j=0}^{n-1}\left(\beta g_{j}^{(t-1)}+(1-\beta)g_{j}^{(t)}\right)\Big{\|}^{2}
=ηt21nj=0n1[β(f(w0(t);π(t1)(j+1))gj(t1))+(1β)(f(w0(t);π(t)(j+1))gj(t))]2\displaystyle=\frac{\eta_{t}}{2}\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}\left[\beta\Big{(}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(j+1))-g_{j}^{(t-1)}\Big{)}+(1-\beta)\Big{(}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-g_{j}^{(t)}\Big{)}\right]\Big{\|}^{2}
(b)ηt2nj=0n1β(f(w0(t);π(t1)(j+1))gj(t1))+(1β)(f(w0(t);π(t)(j+1))gj(t))2\displaystyle\overset{(b)}{\leq}\frac{\eta_{t}}{2n}\sum_{j=0}^{n-1}\Big{\|}\beta\Big{(}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(j+1))-g_{j}^{(t-1)}\Big{)}+(1-\beta)\Big{(}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-g_{j}^{(t)}\Big{)}\Big{\|}^{2}
(c)ηt2nj=0n1[βf(w0(t);π(t1)(j+1))gj(t1)2+(1β)f(w0(t);π(t)(j+1))gj(t)2]\displaystyle\overset{(c)}{\leq}\frac{\eta_{t}}{2n}\sum_{j=0}^{n-1}\left[\beta\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(j+1))-g_{j}^{(t-1)}\Big{\|}^{2}+(1-\beta)\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-g_{j}^{(t)}\Big{\|}^{2}\right]
(13),(14)ηt2n[βJt+(1β)It](by the definition of It and Jt),\displaystyle\overset{\eqref{define_I},\eqref{define_J}}{\leq}\frac{\eta_{t}}{2n}\Big{[}\beta J_{t}+(1-\beta)I_{t}\Big{]}\qquad\textrm{(by the definition of $I_{t}$ and $J_{t}$)},

where (b) is from the Cauchy-Schwarz inequality, (c) follows from the convexity of 2\|\cdot\|^{2} for 0β<10\leq\beta<1.

Applying this into (28), we get the following result:

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) F(w0(t))ηt2F(w0(t))2+ηt2n[βJt+(1β)It].\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{\eta_{t}}{2n}\Big{[}\beta J_{t}+(1-\beta)I_{t}\Big{]}. (29)
Remark 5.

Note that the derived estimate (29) is true for all shuffling strategies, including Random Reshuffling, Shuffle Once, and Incremental Gradient (under the assumptions of Theorem 1 and 2, respectively).

Applying the result of Lemma 4 for t2t\geq 2, then we obtain

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) F(w0(t))ηt2F(w0(t))2+9(53β)ηtL2ξt24n[n(Θ+1)j=1tβtjF(w0(j))2+nσ21β],\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{9(5-3\beta)\eta_{t}L^{2}\xi_{t}^{2}}{4n}\Bigg{[}n(\Theta+1)\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\frac{n\sigma^{2}}{1-\beta}\Bigg{]},

which leads to

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) F(w0(t))ηt2F(w0(t))2+9(53β)ηtL2ξt2(Θ+1)4j=1tβtjF(w0(j))2+9σ2(53β)L2ξt34(1β).\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{9(5-3\beta)\eta_{t}L^{2}\xi_{t}^{2}(\Theta+1)}{4}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\frac{9\sigma^{2}(5-3\beta)L^{2}\xi_{t}^{3}}{4(1-\beta)}.

Since ξt\xi_{t} satisfies 9L2ξt2(53β)(Θ+1)1β9L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+1)\leq 1-\beta as proved above, we can deduce from the last estimate that

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) F(w0(t))ηt2F(w0(t))2+(1β)ηt4j=1tβtjF(w0(j))2+9σ2(53β)L24(1β)ξt3.\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{(1-\beta)\eta_{t}}{4}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\frac{9\sigma^{2}(5-3\beta)L^{2}}{4(1-\beta)}\cdot\xi_{t}^{3}.

Rearranging this inequality and noting that w~t1=w0(t)\tilde{w}_{t-1}=w_{0}^{(t)}, for t2t\geq 2, we obtain the following estimate:

ηt2F(w~t1)2F(w~t1)F(w~t)+(1β)ηt4j=1tβtjF(w~j1)2+9σ2(53β)L24(1β)ξt31β<1F(w~t1)F(w~t)(1β)+(1β)ηt4j=1tβtjF(w~j1)2+9σ2(53β)L24(1β)ξt3.\begin{array}[]{lcl}\frac{\eta_{t}}{2}\|\nabla F(\tilde{w}_{t-1})\|^{2}&\leq&F(\tilde{w}_{t-1})-F(\tilde{w}_{t})+\frac{(1-\beta)\eta_{t}}{4}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}+\frac{9\sigma^{2}(5-3\beta)L^{2}}{4(1-\beta)}\cdot\xi_{t}^{3}\\ &\overset{1-\beta<1}{\leq}&\frac{F(\tilde{w}_{t-1})-F(\tilde{w}_{t})}{(1-\beta)}+\frac{(1-\beta)\eta_{t}}{4}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}+\frac{9\sigma^{2}(5-3\beta)L^{2}}{4(1-\beta)}\cdot\xi_{t}^{3}.\end{array}

For t=1t=1, since ξ1=η1\xi_{1}=\eta_{1} as previously defined in (17), from the result of Lemma 6 we have

η12F(w~0)2F(w~0)F(w~1)(1β)+(1β)η14j=11β1jF(w~j1)2+9σ2(53β)L24(1β)ξ13.\displaystyle\frac{\eta_{1}}{2}\|\nabla F(\tilde{w}_{0})\|^{2}\leq\frac{F(\tilde{w}_{0})-F(\tilde{w}_{1})}{(1-\beta)}+\frac{(1-\beta)\eta_{1}}{4}\sum_{j=1}^{1}\beta^{1-j}\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}+\frac{9\sigma^{2}(5-3\beta)L^{2}}{4(1-\beta)}\cdot\xi_{1}^{3}.

Summing the previous estimate for t:=2t:=2 to t:=Tt:=T, and then using the last one, we obtain

12t=1TηtF(w~t1)2\displaystyle\frac{1}{2}\sum_{t=1}^{T}\eta_{t}\|\nabla F(\tilde{w}_{t-1})\|^{2} F(w~0)F(1β)+(1β)4t=1Tηtj=1tβtjF(w~j1)2+9σ2L2(53β)4(1β)t=1Tξt3.\displaystyle\leq\frac{F(\tilde{w}_{0})-F_{*}}{(1-\beta)}+\frac{(1-\beta)}{4}\sum_{t=1}^{T}\eta_{t}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{4(1-\beta)}\sum_{t=1}^{T}\xi_{t}^{3}.

Applying Lemma 5 to the last estimate, we get

12t=1TηtF(w~t1)2\displaystyle\frac{1}{2}\sum_{t=1}^{T}\eta_{t}\|\nabla F(\tilde{w}_{t-1})\|^{2} F(w~0)F(1β)+(1β)41(1β)t=1TηtF(w~t1)2+9σ2L2(53β)4(1β)t=1Tξt3,\displaystyle\leq\frac{F(\tilde{w}_{0})-F_{*}}{(1-\beta)}+\frac{(1-\beta)}{4}\frac{1}{(1-\beta)}\sum_{t=1}^{T}\eta_{t}\|\nabla F(\tilde{w}_{t-1})\|^{2}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{4(1-\beta)}\sum_{t=1}^{T}\xi_{t}^{3},

which is equivalent to

t=1TηtF(w~t1)2\displaystyle\sum_{t=1}^{T}\eta_{t}\|\nabla F(\tilde{w}_{t-1})\|^{2} 4[F(w~0)F](1β)+9σ2L2(53β)(1β)t=1Tξt3.\displaystyle\leq\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{(1-\beta)}\sum_{t=1}^{T}\xi_{t}^{3}. (30)

Dividing both sides of the resulting estimate by t=1Tηt\sum_{t=1}^{T}\eta_{t}, we obtain (6). Finally, due to the choice of w^T\hat{w}_{T} at Step 11 in Algorithm 1, we have 𝔼[F(w^T)2]=1t=1Tηtt=1TηtF(w~t1)2\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}=\frac{1}{\sum_{t=1}^{T}\eta_{t}}\sum_{t=1}^{T}\eta_{t}\|\nabla F(\tilde{w}_{t-1})\|^{2}. \square

7.5 The Proof of Corollaries 1 and 2: Constant and Diminishing Learning Rates

The proof of Corollary 1.

Since T1T\geq 1 and ηt:=γT1/3\eta_{t}:=\frac{\gamma}{T^{1/3}}, we have ηtηt+1\eta_{t}\geq\eta_{t+1}. We also have 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}} for all t1t\geq 1, and t=1Tηt=γT2/3\sum_{t=1}^{T}\eta_{t}=\gamma T^{2/3} and t=1Tηt13=γ3\sum_{t=1}^{T}\eta_{t-1}^{3}=\gamma^{3}. Substituting these expressions into (6), we obtain

𝔼[F(w^T)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} 4[F(w~0)F](1β)γT2/3+9σ2L2(53β)(1β)γ2T2/3.\displaystyle\leq\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma T^{2/3}}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{(1-\beta)}\cdot\frac{\gamma^{2}}{T^{2/3}}.

which is our desired result. ∎

The proof of Corollary 2.

For t1t\geq 1, since ηt=γ(t+λ)1/3\eta_{t}=\frac{\gamma}{(t+\lambda)^{1/3}}, we have ηtηt+1\eta_{t}\geq\eta_{t+1}. We also have 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}} for all t1t\geq 1, and

t=1Tηt=γt=1T1(t+λ)1/3γ1Tdτ(τ+λ)1/3γ[(T+λ)2/3(1+λ)2/3],t=1Tηt13=2η13+γ3t=3T1t1+λ2γ3(1+λ)+γ3t=2Tdττ1+λγ3[2(1+λ)+log(T1+λ)].\begin{array}[]{lll}&\sum_{t=1}^{T}\eta_{t}&=\gamma\sum_{t=1}^{T}\frac{1}{(t+\lambda)^{1/3}}\geq\gamma\int_{1}^{T}\frac{d\tau}{(\tau+\lambda)^{1/3}}\geq\gamma\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right],\\ &\sum_{t=1}^{T}\eta_{t-1}^{3}&=2\eta_{1}^{3}+\gamma^{3}\sum_{t=3}^{T}\frac{1}{t-1+\lambda}\leq\frac{2\gamma^{3}}{(1+\lambda)}+\gamma^{3}\int_{t=2}^{T}\frac{d\tau}{\tau-1+\lambda}\leq\gamma^{3}\left[\frac{2}{(1+\lambda)}+\log(T-1+\lambda)\right].\end{array}

Substituting these expressions into (6), we obtain

𝔼[F(w^T)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} 4[F(w~0)F](1β)γ[(T+λ)2/3(1+λ)2/3]+9σ2L2(53β)(1β)γ2[2(1+λ)+log(T1+λ)][(T+λ)2/3(1+λ)2/3].\displaystyle\leq\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{(1-\beta)}\cdot\frac{\gamma^{2}\left[\frac{2}{(1+\lambda)}+\log(T-1+\lambda)\right]}{\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]}.

Let us define C1C_{1} and C2C_{2} as follows:

C1:=4[F(w~0)F](1β)γ+18σ2L2(53β)γ2(1β)(1+λ), and C2:=9σ2L2(53β)γ2(1β).\displaystyle C_{1}:=\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma}+\frac{18\sigma^{2}L^{2}(5-3\beta)\gamma^{2}}{(1-\beta)(1+\lambda)},\text{ and }C_{2}:=\frac{9\sigma^{2}L^{2}(5-3\beta)\gamma^{2}}{(1-\beta)}.

Then, we obtain from the last estimate that

𝔼[F(w^T)2]C1+C2log(T1+λ)(T+λ)2/3(1+λ)2/3,\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{C_{1}+C_{2}\log(T-1+\lambda)}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}},

which completes our proof. ∎

7.6 The Proof of Corollaries 3 and 4: Scheduled Learning Rates

The proof of Corollary 3: Exponential LR.

For t1t\geq 1, since ηt=γαtT1/3\eta_{t}=\frac{\gamma\alpha^{t}}{T^{1/3}} where α:=ρ1/T(0,1)\alpha:=\rho^{1/T}\in(0,1), we have ηtηt+1\eta_{t}\geq\eta_{t+1}. We also have 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}} for all t1t\geq 1 and

t=1Tηt=γT1/3t=1TαtγT1/3t=1TαT=γρT2/3,t=1Tηt13t=1Tη13=γ3α3γ3.\begin{array}[]{lll}&\sum_{t=1}^{T}\eta_{t}&=\frac{\gamma}{T^{1/3}}\sum_{t=1}^{T}\alpha^{t}\geq\frac{\gamma}{T^{1/3}}\sum_{t=1}^{T}\alpha^{T}=\gamma\rho T^{2/3},\\ &\sum_{t=1}^{T}\eta_{t-1}^{3}&\leq\sum_{t=1}^{T}\eta_{1}^{3}=\gamma^{3}\alpha^{3}\leq\gamma^{3}.\end{array}

Substituting these expressions into (6), we obtain

𝔼[F(w^T)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} 4[F(w~0)F](1β)γρT2/3+9σ2L2(53β)(1β)γ2ρT2/3.\displaystyle\leq\frac{4[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma\rho T^{2/3}}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{(1-\beta)}\cdot\frac{\gamma^{2}}{\rho T^{2/3}}.

which is our desired result. ∎

The proof of Corollary 4: Cosine LR.

First, we would like to show that t=1Tcost𝝅T=1\sum_{t=1}^{T}\cos\frac{t\boldsymbol{\pi}}{T}=-1. Let us denote A:=t=1Tcost𝝅TA:=\sum_{t=1}^{T}\cos\frac{t\boldsymbol{\pi}}{T}. Multiplying this sum by sin𝝅2T\sin\frac{\boldsymbol{\pi}}{2T}, we get

Asin𝝅2T\displaystyle A\cdot\sin\frac{\boldsymbol{\pi}}{2T} =t=1Tcost𝝅Tsin𝝅2T=(a)12t=1T(sin(2t+1)𝝅2Tsin(2t1)𝝅2T)\displaystyle=\sum_{t=1}^{T}\cos\frac{t\boldsymbol{\pi}}{T}\sin\frac{\boldsymbol{\pi}}{2T}\overset{(a)}{=}\frac{1}{2}\sum_{t=1}^{T}\Big{(}\sin\frac{(2t+1)\boldsymbol{\pi}}{2T}-\sin\frac{(2t-1)\boldsymbol{\pi}}{2T}\Big{)}
=12(sin(2T+1)𝝅2Tsin𝝅2T)=12(sin𝝅2Tsin𝝅2T)=sin𝝅2T,\displaystyle=\frac{1}{2}\Big{(}\sin\frac{(2T+1)\boldsymbol{\pi}}{2T}-\sin\frac{\boldsymbol{\pi}}{2T}\Big{)}=\frac{1}{2}\Big{(}-\sin\frac{\boldsymbol{\pi}}{2T}-\sin\frac{\boldsymbol{\pi}}{2T}\Big{)}=-\sin\frac{\boldsymbol{\pi}}{2T},

where (a) comes from the identity cosasinb=12(sin(a+b)sin(ab))\cos a\cdot\sin b=\frac{1}{2}(\sin(a+b)-\sin(a-b)). Since sin𝝅2T\sin\frac{\boldsymbol{\pi}}{2T} is nonzero, we obtain A=t=1Tcost𝝅T=1.A=\sum_{t=1}^{T}\cos\frac{t\boldsymbol{\pi}}{T}=-1.

For 1tT1\leq t\leq T, since ηt=γT1/3(1+cost𝝅T)\eta_{t}=\frac{\gamma}{T^{1/3}}\left(1+\cos\frac{t\boldsymbol{\pi}}{T}\right), we have ηtηt+1\eta_{t}\geq\eta_{t+1}. We also have 0<ηt1LK0<\eta_{t}\leq\frac{1}{L\sqrt{K}} for all t1t\geq 1 and

t=1Tηt=γT1/3t=1T(1+cost𝝅T)=(b)γT1/3(T1)γT1/3T2=γT2/32,forT2,t=1Tηt13γ3Tt=1T23=8γ3, since (1+cost𝝅T)323 for all t1,\begin{array}[]{lll}&\sum_{t=1}^{T}\eta_{t}&=\frac{\gamma}{T^{1/3}}\sum_{t=1}^{T}\left(1+\cos\frac{t\boldsymbol{\pi}}{T}\right)\overset{(b)}{=}\frac{\gamma}{T^{1/3}}(T-1)\geq\frac{\gamma}{T^{1/3}}\cdot\frac{T}{2}=\frac{\gamma T^{2/3}}{2},\ \text{for}\ T\geq 2,\\ &\sum_{t=1}^{T}\eta_{t-1}^{3}&\leq\frac{\gamma^{3}}{T}\sum_{t=1}^{T}2^{3}=8\gamma^{3}\text{, since }\left(1+\cos\frac{t\boldsymbol{\pi}}{T}\right)^{3}\leq 2^{3}\text{ for all }t\geq 1,\end{array}

where (b)(b) follows since t=1Tcost𝝅T=1\sum_{t=1}^{T}\cos\frac{t\boldsymbol{\pi}}{T}=-1 as shown above. Substituting these expressions into (6), we obtain

𝔼[F(w^T)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} 8[F(w~0)F](1β)γT2/3+9σ2L2(53β)(1β)16γ2T2/3.\displaystyle\leq\frac{8[F(\tilde{w}_{0})-F_{*}]}{(1-\beta)\gamma T^{2/3}}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{(1-\beta)}\cdot\frac{16\gamma^{2}}{T^{2/3}}.

which is our desired estimate. ∎

7.7 The Proof of Technical Lemmas

We now provide the proof of six lemmas that serve for the proof of Theorem 1 above.

7.7.1 Proof of Lemma 1: Upper Bounding Two Terms wi(t)w0(t)2\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2} and wi(t1)w0(t)2\|w_{i}^{(t-1)}-w_{0}^{(t)}\|^{2}

Remark 6.

Note that the result of Lemma 1 is true for all shuffling strategies including Random Reshuffling, Shuffle Once and Incremental Gradient (under the assumptions of Theorem 1 and 2, respectively).

(a) From Equation (19), for i=0,1,,ni=0,1,\dots,n and t1t\geq 1, we have

wi(t)w0(t)\displaystyle w_{i}^{(t)}-w_{0}^{(t)} =ηtnj=0i1mj+1(t)=(2)ηtnj=0i1(βm0(t)+(1β)gj(t))=ηtn(iβm0(t)+(1β)j=0i1gj(t)).\displaystyle=-\frac{\eta_{t}}{n}\sum_{j=0}^{i-1}m_{j+1}^{(t)}\overset{\eqref{eq:main_update}}{=}-\frac{\eta_{t}}{n}\sum_{j=0}^{i-1}\big{(}\beta m_{0}^{(t)}+(1-\beta)g_{j}^{(t)}\big{)}=-\frac{\eta_{t}}{n}\Big{(}i\beta m_{0}^{(t)}+(1-\beta)\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{)}.

Moreover, by (17), we have ηtξt\eta_{t}\leq\xi_{t} for all t1t\geq 1. Therefore, for t>1t>1 and i=0,1,,ni=0,1,\dots,n, we can derive that

wi(t)w0(t)2\displaystyle\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2} ξt2n2iβm0(t)+(1β)j=0i1gj(t)2\displaystyle\leq\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}i\beta m_{0}^{(t)}+(1-\beta)\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}
=(18)ξt2n2βinj=0n1gj(t1)+(1β)j=0i1gj(t)2\displaystyle\overset{\eqref{update_m_0^{T}}}{=}\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}\beta\frac{i}{n}\sum_{j=0}^{n-1}g_{j}^{(t-1)}+(1-\beta)\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}
(a)ξt2n2(βinj=0n1gj(t1)2+(1β)j=0i1gj(t)2)\displaystyle\overset{(a)}{\leq}\frac{\xi_{t}^{2}}{n^{2}}\left(\beta\Big{\|}\frac{i}{n}\sum_{j=0}^{n-1}g_{j}^{(t-1)}\Big{\|}^{2}+(1-\beta)\Big{\|}\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}\right)
=(11)ξt2n2(βi2n2An(t1)+(1β)Ai(t)),\displaystyle\overset{\eqref{define_A}}{=}\frac{\xi_{t}^{2}}{n^{2}}\left(\beta\frac{i^{2}}{n^{2}}A_{n}^{(t-1)}+(1-\beta)A_{i}^{(t)}\right),

where (a) follows from the convexity of 2\|\cdot\|^{2} and 0β<10\leq\beta<1.

For t=1t=1 and i=0,1,,ni=0,1,\dots,n, we have m0(t)=0m_{0}^{(t)}=\textbf{0}, which leads to

wi(t)w0(t)2\displaystyle\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2} ξt2n2iβm0(t)+(1β)j=0i1gj(t)2=ξt2n2(1β)j=0i1gj(t)2=ξt2n2(1β)2Ai(t).\displaystyle\leq\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}i\beta m_{0}^{(t)}+(1-\beta)\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}=\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}(1-\beta)\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}=\frac{\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}A_{i}^{(t)}.

Combining these two cases, we obtain (21).

(b) Using similar argument for the second equation (20), for t2t\geq 2 and i=0,1,,ni=0,1,\dots,n; we can derive that

wi(t1)w0(t)\displaystyle w_{i}^{(t-1)}-w_{0}^{(t)} =ηt1nj=in1mj+1(t1)=(2)ηt1nj=in1[βm0(t1)+(1β)gj(t1)]\displaystyle=\frac{\eta_{t-1}}{n}\sum_{j=i}^{n-1}m_{j+1}^{(t-1)}\overset{\eqref{eq:main_update}}{=}\frac{\eta_{t-1}}{n}\sum_{j=i}^{n-1}\left[\beta m_{0}^{(t-1)}+(1-\beta)g_{j}^{(t-1)}\right]
=ηt1n[(ni)βm0(t1)+(1β)j=in1gj(t1)].\displaystyle=\frac{\eta_{t-1}}{n}\left[(n-i)\beta m_{0}^{(t-1)}+(1-\beta)\sum_{j=i}^{n-1}g_{j}^{(t-1)}\right].

Note again that ηt1ξt\eta_{t-1}\leq\xi_{t}, for t>2t>2 and i=0,1,,ni=0,1,\dots,n; we can show that

wi(t1)w0(t)2\displaystyle\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2} ξt2n2(ni)βm0(t1)+(1β)j=in1gj(t1)2\displaystyle\leq\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}(n-i)\beta m_{0}^{(t-1)}+(1-\beta)\sum_{j=i}^{n-1}g_{j}^{(t-1)}\Big{\|}^{2}
=(18)ξt2n2βninj=0n1gj(t2)+(1β)j=in1gj(t1)2\displaystyle\overset{\eqref{update_m_0^{T}}}{=}\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}\beta\frac{n-i}{n}\sum_{j=0}^{n-1}g_{j}^{(t-2)}+(1-\beta)\sum_{j=i}^{n-1}g_{j}^{(t-1)}\Big{\|}^{2}
(b)ξt2n2[βninj=0n1gj(t2)2+(1β)j=in1gj(t1)2]\displaystyle\overset{(b)}{\leq}\frac{\xi_{t}^{2}}{n^{2}}\left[\beta\Big{\|}\frac{n-i}{n}\sum_{j=0}^{n-1}g_{j}^{(t-2)}\Big{\|}^{2}+(1-\beta)\Big{\|}\sum_{j=i}^{n-1}g_{j}^{(t-1)}\Big{\|}^{2}\right]
=(11),(12)ξt2n2[β(ni)2n2An(t2)+(1β)Bi(t1)],\displaystyle\overset{\eqref{define_A},\eqref{define_B}}{=}\frac{\xi_{t}^{2}}{n^{2}}\left[\beta\frac{(n-i)^{2}}{n^{2}}A_{n}^{(t-2)}+(1-\beta)B_{i}^{(t-1)}\right],

where in (b) we use again the convexity of 2\|\cdot\|^{2} and 0β<10\leq\beta<1.

For t=2t=2, we easily get

wi(t1)w0(t)2\displaystyle\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2} ξt2n2(ni)βm0(t1)+(1β)j=in1gj(t1)2ξt2n2(1β)j=in1gj(t1)2=ξt2n2(1β)2Bi(t1).\displaystyle\leq\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}(n-i)\beta m_{0}^{(t-1)}+(1-\beta)\sum_{j=i}^{n-1}g_{j}^{(t-1)}\Big{\|}^{2}\leq\frac{\xi_{t}^{2}}{n^{2}}\Big{\|}(1-\beta)\sum_{j=i}^{n-1}g_{j}^{(t-1)}\Big{\|}^{2}=\frac{\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}B_{i}^{(t-1)}.

Combining the two cases above, we obtain (22). \square

7.7.2 Proof of Lemma 2: Upper Bounding The Terms An(t)A_{n}^{(t)}, i=0n1Ai(t)\sum_{i=0}^{n-1}A_{i}^{(t)}, and i=0n1Bi(t)\sum_{i=0}^{n-1}B^{(t)}_{i}

(a) Let first upper bound the term Ai(t)A_{i}^{(t)} defined by (11). Indeed, for i=0,,n1i=0,\dots,n-1 and t1t\geq 1, we have

Ai(t)\displaystyle A_{i}^{(t)} =j=0i1gj(t)2\displaystyle=\Big{\|}\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2}
(a)3j=0i1(gj(t)f(w0(t);π(t)(j+1)))2+3j=0i1(f(w0(t);π(t)(j+1))F(w0(t)))2+3j=0i1F(w0(t))2\displaystyle\overset{(a)}{\leq}3\Big{\|}\sum_{j=0}^{i-1}\Big{(}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{)}\Big{\|}^{2}+3\Big{\|}\sum_{j=0}^{i-1}\Big{(}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{)}\Big{\|}^{2}+3\Big{\|}\sum_{j=0}^{i-1}\nabla F(w_{0}^{(t)})\Big{\|}^{2}
(a)3ij=0i1gj(t)f(w0(t);π(t)(j+1))2+3ij=0i1f(w0(t);π(t)(j+1))F(w0(t))2+3i2F(w0(t))2\displaystyle\overset{(a)}{\leq}3i\sum_{j=0}^{i-1}\Big{\|}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}+3i\sum_{j=0}^{i-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{\|}^{2}+3i^{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}
3ij=0n1gj(t)f(w0(t);π(t)(j+1))2+3ij=0n1f(w0(t);π(t)(j+1))F(w0(t))2+3i2F(w0(t))2\displaystyle\leq 3i\sum_{j=0}^{n-1}\Big{\|}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}+3i\sum_{j=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{\|}^{2}+3i^{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}
(b)3iIt+3in[ΘF(w0(t))2+σ2]+3i2F(w0(t))2\displaystyle\overset{(b)}{\leq}3iI_{t}+3in\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}+3i^{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}
3iIt+3(i2+inΘ)F(w0(t))2+3inσ2,\displaystyle\leq 3iI_{t}+3(i^{2}+in\Theta)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+3in\sigma^{2},

where we use the Cauchy-Schwarz inequality in (a), and (b) follows from Assumption 1(c).

Letting i=ni=n in the above estimate, we obtain the first estimate of Lemma 2, i.e.:

An(t)\displaystyle A_{n}^{(t)} 3nIt+3n2(Θ+1)F(w0(t))2+3n2σ23n(It+Kt),t1.\displaystyle\leq 3nI_{t}+3n^{2}(\Theta+1)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+3n^{2}\sigma^{2}\leq 3n\Big{(}I_{t}+K_{t}\Big{)},\ t\geq 1.

Now we are ready to calculate the sum i=0n1Ai(t)\sum_{i=0}^{n-1}A_{i}^{(t)} as

i=0n1Ai(t)\displaystyle\sum_{i=0}^{n-1}A_{i}^{(t)} 3Iti=0n1i+3i=0n1(i2+inΘ)F(w0(t))2+3nσ2i=0n1i\displaystyle\leq 3I_{t}\sum_{i=0}^{n-1}i+3\sum_{i=0}^{n-1}(i^{2}+in\Theta)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+3n\sigma^{2}\sum_{i=0}^{n-1}i
3n22It+3n32(Θ+1)F(w0(t))2+3n3σ223n22(It+Kt),t1,\displaystyle\leq\frac{3n^{2}}{2}I_{t}+\frac{3n^{3}}{2}(\Theta+1)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\frac{3n^{3}\sigma^{2}}{2}\leq\frac{3n^{2}}{2}\Big{(}I_{t}+K_{t}\Big{)},\ t\geq 1,

where the last line follows from i=0n1in22\sum_{i=0}^{n-1}i\leq\frac{n^{2}}{2} and i=0n1i2n32\sum_{i=0}^{n-1}i^{2}\leq\frac{n^{3}}{2}. This proves the second estimate of Lemma 2.

(b) Using similar argument as above, for i=0,1,,n1i=0,1,\dots,n-1 and t1t\geq 1, we can derive

Bi(t)\displaystyle B_{i}^{(t)} =j=in1gj(t)2(a)3j=in1(gj(t)f(w0(t);π(t)(j+1)))2\displaystyle=\Big{\|}\sum_{j=i}^{n-1}g_{j}^{(t)}\Big{\|}^{2}\overset{(a)}{\leq}3\Big{\|}\sum_{j=i}^{n-1}\Big{(}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{)}\Big{\|}^{2}
+3j=in1(f(w0(t);π(t)(j+1))F(w0(t)))2+3j=in1F(w0(t))2\displaystyle\quad+3\Big{\|}\sum_{j=i}^{n-1}\Big{(}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{)}\Big{\|}^{2}+3\Big{\|}\sum_{j=i}^{n-1}\nabla F(w_{0}^{(t)})\Big{\|}^{2}
(a)3(ni)j=in1gj(t)f(w0(t);π(t)(j+1))2\displaystyle\overset{(a)}{\leq}3(n-i)\sum_{j=i}^{n-1}\Big{\|}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}
+3(ni)j=in1f(w0(t);π(t)(j+1))F(w0(t))2+3(ni)2F(w0(t))2\displaystyle\quad+3(n-i)\sum_{j=i}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{\|}^{2}+3(n-i)^{2}\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}
(13)3(ni)It+3(ni)j=0n1f(w0(t);π(t)(j+1))F(w0(t))2+3(ni)nF(w0(t))2\displaystyle\overset{\eqref{define_I}}{\leq}3(n-i)I_{t}+3(n-i)\sum_{j=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{\|}^{2}+3(n-i)n\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}
(b)3(ni)It+3(ni)n[ΘF(w0(t))2+σ2]+3(ni)nF(w0(t))2\displaystyle\overset{(b)}{\leq}3(n-i)I_{t}+3(n-i)n\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}+3(n-i)n\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}
3(ni)It+3(ni)n(Θ+1)F(w0(t))2+3(ni)nσ2,\displaystyle\leq 3(n-i)I_{t}+3(n-i)n(\Theta+1)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+3(n-i)n\sigma^{2},

where we use again the Cauchy-Schwarz inequality in (a), and (b) follows from Assumption 1(c). Finally, for all t1t\geq 1, we calculate the sum i=0n1Bi(t)\sum_{i=0}^{n-1}B_{i}^{(t)} as follows:

i=0n1Bi(t)\displaystyle\sum_{i=0}^{n-1}B_{i}^{(t)} 3Iti=0n1(ni)+3i=0n1(ni)n(Θ+1)F(w0(t))2+3nσ2i=0n1(ni)\displaystyle\leq 3I_{t}\sum_{i=0}^{n-1}(n-i)+3\sum_{i=0}^{n-1}(n-i)n(\Theta+1)\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}+3n\sigma^{2}\sum_{i=0}^{n-1}(n-i)
3n2It+3n3(Θ+1)F(w0(t))2+3n3σ2\displaystyle\leq 3n^{2}I_{t}+3n^{3}(\Theta+1)\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}+3n^{3}\sigma^{2}
3n2(It+Kt),\displaystyle\leq 3n^{2}\Big{(}I_{t}+K_{t}\Big{)},

where the last line follows from the fact that i=0n1(ni)n2\sum_{i=0}^{n-1}(n-i)\leq n^{2}. This proves (24). \square

7.7.3 Proof of Lemma 3: Upper Bounding The Terms ItI_{t}, JtJ_{t}, and It+KtI_{t}+K_{t}

(a) First, for every t1t\geq 1 and gi(t):=f(wi(t);π(t)(i+1))g_{i}^{(t)}:=\nabla f(w_{i}^{(t)};\pi^{(t)}(i+1)), we have

It\displaystyle I_{t} =i=0n1f(w0(t);π(t)(i+1))gi(t)2\displaystyle=\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(i+1))-g_{i}^{(t)}\Big{\|}^{2}
=Step 6i=0n1f(w0(t);π(t)(i+1))f(wi(t);π(t)(i+1))2\displaystyle\overset{\tiny\textrm{Step~{}\ref{alg:A1_step3}}}{=}\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(i+1))-\nabla f(w_{i}^{(t)};\pi^{(t)}(i+1))\Big{\|}^{2}
L2i=0n1wi(t)w0(t)2,\displaystyle\leq L^{2}\sum_{i=0}^{n-1}\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2}, (31)

where the last inequality follows from Assumption 1(b).

Applying Lemma 1 to (31), for t>1t>1, we have

It\displaystyle I_{t} L2ξt2n2i=0n1[βi2n2An(t1)+(1β)Ai(t)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}\left[\beta\frac{i^{2}}{n^{2}}A_{n}^{(t-1)}+(1-\beta)A_{i}^{(t)}\right]
L2ξt2n2[βAn(t1)n2i=0n1i2+(1β)i=0n1Ai(t)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta\frac{A_{n}^{(t-1)}}{n^{2}}\sum_{i=0}^{n-1}i^{2}+(1-\beta)\sum_{i=0}^{n-1}A_{i}^{(t)}\right]
(a)L2ξt2n2[βn33n(It1+Kt1)+(1β)3n22(It+Kt)]\displaystyle\overset{(a)}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta\frac{n}{3}3n\Big{(}I_{t-1}+K_{t-1}\Big{)}+(1-\beta)\frac{3n^{2}}{2}\Big{(}I_{t}+K_{t}\Big{)}\right]
=L2ξt2[β(It1+Kt1)+32(1β)(It+Kt)],\displaystyle=L^{2}\xi_{t}^{2}\left[\beta\big{(}I_{t-1}+K_{t-1}\big{)}+\frac{3}{2}(1-\beta)\Big{(}I_{t}+K_{t}\Big{)}\right],

where (a) follows from the result (23) in Lemma 2 and the fact that i=0n1i2n33\sum_{i=0}^{n-1}i^{2}\leq\frac{n^{3}}{3}.

For t=1t=1, by applying Lemma 1 and 2 consecutively to (31) we have

It\displaystyle I_{t} L2i=0n1wi(t)w0(t)2(21)L2ξt2n2i=0n1((1β)2Ai(t))L2ξt2(1β)2n2i=0n1Ai(t)\displaystyle\leq L^{2}\sum_{i=0}^{n-1}\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2}\overset{\eqref{update_distance_t}}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}\left((1-\beta)^{2}A_{i}^{(t)}\right)\leq\frac{L^{2}\xi_{t}^{2}(1-\beta)^{2}}{n^{2}}\sum_{i=0}^{n-1}A_{i}^{(t)}
(23)L2ξt2(1β)2n23n22(It+Kt)=32L2ξt2(1β)2(It+Kt).\displaystyle\overset{\eqref{bound_A^{T}}}{\leq}\frac{L^{2}\xi_{t}^{2}(1-\beta)^{2}}{n^{2}}\frac{3n^{2}}{2}\Big{(}I_{t}+K_{t}\Big{)}=\frac{3}{2}L^{2}\xi_{t}^{2}(1-\beta)^{2}\Big{(}I_{t}+K_{t}\Big{)}. (32)

(b) Using the similar argument as above we can derive that

Jt\displaystyle J_{t} =i=0n1f(w0(t);π(t1)(i+1))gi(t1)2\displaystyle=\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(i+1))-g_{i}^{(t-1)}\Big{\|}^{2}
=Step 6i=0n1f(w0(t);π(t1)(i+1))f(wi(t1);π(t1)(i+1))2\displaystyle\overset{\tiny\textrm{Step~{}\ref{alg:A1_step3}}}{=}\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(i+1))-\nabla f(w_{i}^{(t-1)};\pi^{(t-1)}(i+1))\Big{\|}^{2}
(3)L2i=0n1wi(t1)w0(t)2.\displaystyle\overset{\eqref{eq:Lsmooth_basic}}{\leq}L^{2}\sum_{i=0}^{n-1}\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2}.

Applying Lemma 1 for t>2t>2 we get to the above estimate, we get

Jt\displaystyle J_{t} L2ξt2n2i=0n1[β(ni)2n2An(t2)+(1β)Bi(t1)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}\left[\beta\frac{(n-i)^{2}}{n^{2}}A_{n}^{(t-2)}+(1-\beta)B_{i}^{(t-1)}\right]
L2ξt2n2[βAn(t2)n2i=0n1(ni)2+(1β)i=0n1Bi(t1)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta\frac{A_{n}^{(t-2)}}{n^{2}}\sum_{i=0}^{n-1}(n-i)^{2}+(1-\beta)\sum_{i=0}^{n-1}B_{i}^{(t-1)}\right]
(b)L2ξt2n2[βn3n(It2+Kt2)+(1β)3n2(It1+Kt1)]\displaystyle\overset{(b)}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta n\cdot 3n\Big{(}I_{t-2}+K_{t-2}\Big{)}+(1-\beta)\cdot 3n^{2}\Big{(}I_{t-1}+K_{t-1}\Big{)}\right]
=3L2ξt2[β(It2+Kt2)+(1β)(It1+Kt1)],\displaystyle=3L^{2}\xi_{t}^{2}\left[\beta\Big{(}I_{t-2}+K_{t-2}\Big{)}+(1-\beta)\Big{(}I_{t-1}+K_{t-1}\Big{)}\right],

where (b) follows from the results (23), (24) in Lemma 2 and the fact that i=0n1(ni)2n3\sum_{i=0}^{n-1}(n-i)^{2}\leq n^{3}.

Now applying Lemma 1 for the special case t=2t=2, we obtain

Jt\displaystyle J_{t} L2ξt2n2i=0n1(1β)2Bi(t1)L2ξt2n2(1β)23n2(It1+Kt1)3L2ξt2(It1+Kt1),\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}(1-\beta)^{2}B_{i}^{(t-1)}\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}\cdot 3n^{2}\Big{(}I_{t-1}+K_{t-1}\Big{)}\leq 3L^{2}\xi_{t}^{2}\Big{(}I_{t-1}+K_{t-1}\Big{)},

where we use the result (24) from Lemma 2 and the fact that 1β11-\beta\leq 1.

(c) First, from the result of Part (a), for t>1t>1, we have

ItL2ξt2[β(It1+Kt1)+32(1β)Kt]+32(1β)L2ξt2It.\displaystyle I_{t}\leq L^{2}\xi_{t}^{2}\left[\beta\big{(}I_{t-1}+K_{t-1}\big{)}+\frac{3}{2}(1-\beta)K_{t}\right]+\frac{3}{2}(1-\beta)L^{2}\xi_{t}^{2}I_{t}.

Since L2ξt225L^{2}\xi_{t}^{2}\leq\frac{2}{5} (due to the choice of our learning rate) and 1β11-\beta\leq 1, we further get

It\displaystyle I_{t} 25[β(It1+Kt1)+32(1β)Kt]+35It,\displaystyle\leq\frac{2}{5}\left[\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+\frac{3}{2}(1-\beta)K_{t}\right]+\frac{3}{5}I_{t},

which is equivalent to

Itβ(It1+Kt1)+32(1β)Kt.\displaystyle I_{t}\leq\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+\frac{3}{2}(1-\beta)K_{t}.

Adding KtK_{t} to both sides of this inequality, for t>1t>1, we get

It+Ktβ(It1+Kt1)+53β2Kt.\displaystyle I_{t}+K_{t}\leq\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+\frac{5-3\beta}{2}K_{t}. (33)

Finally, for t=1t=1, since L2ξt225L^{2}\xi_{t}^{2}\leq\frac{2}{5} and 1β11-\beta\leq 1, we have

It\displaystyle I_{t} 32L2ξt2(1β)2(It+Kt)35(1β)2It+35(1β)2Kt35It+35(1β)Kt,\displaystyle\leq\frac{3}{2}L^{2}\xi_{t}^{2}(1-\beta)^{2}\Big{(}I_{t}+K_{t}\Big{)}\leq\frac{3}{5}(1-\beta)^{2}I_{t}+\frac{3}{5}(1-\beta)^{2}K_{t}\leq\frac{3}{5}I_{t}+\frac{3}{5}(1-\beta)K_{t},

which is equivalent to

25It35(1β)Kt.\displaystyle\frac{2}{5}I_{t}\leq\frac{3}{5}(1-\beta)K_{t}. (34)

This leads to our desired result for t=1t=1 as

I1+K132(1β)K1+K1=53β2K1.\displaystyle I_{1}+K_{1}\leq\frac{3}{2}(1-\beta)K_{1}+K_{1}=\frac{5-3\beta}{2}K_{1}. (35)

Combining two cases, we obtain the desired result in Part (c). \square

7.7.4 Proof of Lemma 4: Upper Bounding The Key Quantity βJt+(1β)It\beta J_{t}+(1-\beta)I_{t}

First, we analyze the case t>2t>2 using the results of Lemma 3 as follows:

βJt+(1β)It\displaystyle\beta J_{t}+(1-\beta)I_{t} 3βL2ξt2[β(It2+Kt2)+(1β)(It1+Kt1)]\displaystyle\leq 3\beta L^{2}\xi_{t}^{2}\left[\beta\Big{(}I_{t-2}+K_{t-2}\Big{)}+(1-\beta)\Big{(}I_{t-1}+K_{t-1}\Big{)}\right]
+(1β)L2ξt2[β(It1+Kt1)+32(1β)(It+Kt)]\displaystyle\quad+(1-\beta)L^{2}\xi_{t}^{2}\left[\beta\big{(}I_{t-1}+K_{t-1}\big{)}+\frac{3}{2}(1-\beta)\Big{(}I_{t}+K_{t}\Big{)}\right]
L2ξt2[2(It+Kt)+4β(It1+Kt1)+3β2(It2+Kt2)]\displaystyle\leq L^{2}\xi_{t}^{2}\left[2\big{(}I_{t}+K_{t}\big{)}+4\beta\big{(}I_{t-1}+K_{t-1}\big{)}+3\beta^{2}\big{(}I_{t-2}+K_{t-2}\big{)}\right]
=:L2ξt2Pt,\displaystyle=:L^{2}\xi_{t}^{2}P_{t}, (36)

where the last two lines follow since 1β11-\beta\leq 1 and Pt:=2(It+Kt)+4β(It1+Kt1)+3β2(It2+Kt2)P_{t}:=2(I_{t}+K_{t})+4\beta(I_{t-1}+K_{t-1})+3\beta^{2}(I_{t-2}+K_{t-2}) for t>2t>2.

Next, we bound the term PtP_{t} in (36) as

Pt\displaystyle P_{t} =2(It+Kt)+4β(It1+Kt1)+3β2(It2+Kt2)\displaystyle=2\big{(}I_{t}+K_{t}\big{)}+4\beta\big{(}I_{t-1}+K_{t-1}\big{)}+3\beta^{2}\big{(}I_{t-2}+K_{t-2}\big{)}
2[β(It1+Kt1)+53β2Kt]+4β(It1+Kt1)+3β2(It2+Kt2)\displaystyle\leq 2\Big{[}\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+\frac{5-3\beta}{2}K_{t}\Big{]}+4\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+3\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)} apply (33) fort\displaystyle\text{apply \eqref{bound_sum_I_K>1} for}\ t
=(53β)Kt+6β(It1+Kt1)+3β2(It2+Kt2)\displaystyle=(5-3\beta)K_{t}+6\beta\Big{(}I_{t-1}+K_{t-1}\Big{)}+3\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)}
(53β)Kt+6β[β(It2+Kt2)+53β2Kt1]+3β2(It2+Kt2)\displaystyle\leq(5-3\beta)K_{t}+6\beta\Big{[}\beta\big{(}I_{t-2}+K_{t-2}\big{)}+\frac{5-3\beta}{2}K_{t-1}\Big{]}+3\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)} apply (33) fort1\displaystyle\text{apply \eqref{bound_sum_I_K>1} for}\ t-1
=(53β)Kt+3β(53β)Kt1+9β2(It2+Kt2).\displaystyle=(5-3\beta)K_{t}+3\beta(5-3\beta)K_{t-1}+9\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)}.

We consider the last term, which can be bounded as

β2(It2+Kt2)\displaystyle\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)} =β3(It3+Kt3)+53β2β2Kt2\displaystyle=\beta^{3}\big{(}I_{t-3}+K_{t-3}\big{)}+\frac{5-3\beta}{2}\beta^{2}K_{t-2} apply (33) recursively fort2>1\displaystyle\text{apply \eqref{bound_sum_I_K>1} recursively for}\ t-2>1
=βt1(I1+K1)+53β2j=2t2βtjKj\displaystyle=\beta^{t-1}\big{(}I_{1}+K_{1}\big{)}+\frac{5-3\beta}{2}\sum_{j=2}^{t-2}\beta^{t-j}K_{j}
βt153β2K1+53β2j=2t2βtjKj\displaystyle\leq\beta^{t-1}\frac{5-3\beta}{2}K_{1}+\frac{5-3\beta}{2}\sum_{j=2}^{t-2}\beta^{t-j}K_{j} apply (35)
53β2j=1t2βtjKj.\displaystyle\leq\frac{5-3\beta}{2}\sum_{j=1}^{t-2}\beta^{t-j}K_{j}.

Note that this bound is also true for the case t2=1t-2=1:

β2(It2+Kt2)\displaystyle\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)} =β2(I1+K1)β253β2K1\displaystyle=\beta^{2}\Big{(}I_{1}+K_{1}\Big{)}\leq\beta^{2}\frac{5-3\beta}{2}K_{1} apply (35)

Substituting this inequality into PtP_{t}, for t>2t>2, we get

Pt\displaystyle P_{t} (53β)Kt+3β(53β)Kt1+9β2(It2+Kt2)\displaystyle\leq(5-3\beta)K_{t}+3\beta(5-3\beta)K_{t-1}+9\beta^{2}\Big{(}I_{t-2}+K_{t-2}\Big{)}
92(53β)Kt+92(53β)βKt1+92(53β)j=1t2βtjKj\displaystyle\leq\frac{9}{2}(5-3\beta)K_{t}+\frac{9}{2}(5-3\beta)\beta K_{t-1}+\frac{9}{2}(5-3\beta)\sum_{j=1}^{t-2}\beta^{t-j}K_{j}
92(53β)j=1tβtjKj.\displaystyle\leq\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}K_{j}.

Now we analyze similarly for the case t=2t=2 as follows:

βJ2+(1β)I2\displaystyle\beta J_{2}+(1-\beta)I_{2} 3βL2ξ22(I1+K1)+(1β)L2ξ22[β(I1+K1)+32(1β)(I2+K2)]\displaystyle\leq 3\beta L^{2}\xi_{2}^{2}\Big{(}I_{1}+K_{1}\Big{)}+(1-\beta)L^{2}\xi_{2}^{2}\left[\beta\big{(}I_{1}+K_{1}\big{)}+\frac{3}{2}(1-\beta)\Big{(}I_{2}+K_{2}\Big{)}\right]
L2ξ22[2(I2+K2)+4β(I1+K1)]\displaystyle\leq L^{2}\xi_{2}^{2}\left[2\Big{(}I_{2}+K_{2}\Big{)}+4\beta\Big{(}I_{1}+K_{1}\Big{)}\right]
=:L2ξ22P2,\displaystyle=:L^{2}\xi_{2}^{2}P_{2},

where the last line follows since 1β11-\beta\leq 1 and P2:=2(I2+K2)+4β(I1+K1)P_{2}:=2\Big{(}I_{2}+K_{2}\Big{)}+4\beta\Big{(}I_{1}+K_{1}\Big{)}.

Next, we bound the term P2P_{2} as follows:

P2\displaystyle P_{2} =2(I2+K2)+4β(I1+K1)\displaystyle=2\Big{(}I_{2}+K_{2}\Big{)}+4\beta\Big{(}I_{1}+K_{1}\Big{)}
2(β(I1+K1)+53β2K2)+4β(I1+K1)\displaystyle\leq 2\left(\beta\Big{(}I_{1}+K_{1}\Big{)}+\frac{5-3\beta}{2}K_{2}\right)+4\beta\Big{(}I_{1}+K_{1}\Big{)} apply (33) fort=2\displaystyle\text{apply \eqref{bound_sum_I_K>1} for}\ t=2
=(53β)K2+6β(I1+K1)\displaystyle=(5-3\beta)K_{2}+6\beta\Big{(}I_{1}+K_{1}\Big{)}
(53β)K2+6β53β2K1\displaystyle\leq(5-3\beta)K_{2}+6\beta\frac{5-3\beta}{2}K_{1} apply (35)
=(53β)K2+3β(53β)K1\displaystyle=(5-3\beta)K_{2}+3\beta(5-3\beta)K_{1}
92(53β)j=12β2jKj.\displaystyle\leq\frac{9}{2}(5-3\beta)\sum_{j=1}^{2}\beta^{2-j}K_{j}.

Hence, the statements βJt+(1β)ItL2ξt2Pt\beta J_{t}+(1-\beta)I_{t}\leq L^{2}\xi_{t}^{2}P_{t} and Pt92(53β)j=1tβtjKjP_{t}\leq\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}K_{j} are true for every t2t\geq 2.

Combining these two cases, we have the following estimates:

βJt+(1β)ItL2ξt2Pt\displaystyle\beta J_{t}+(1-\beta)I_{t}\leq L^{2}\xi_{t}^{2}P_{t} L2ξt292(53β)j=1tβtjKj\displaystyle\leq L^{2}\xi_{t}^{2}\cdot\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}K_{j}
(15)L2ξt292(53β)j=1tβtj[n(Θ+1)F(w0(j))2+nσ2]\displaystyle\overset{\eqref{define_K}}{\leq}L^{2}\xi_{t}^{2}\cdot\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}\Big{[}n(\Theta+1)\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+n\sigma^{2}\Big{]}
L2ξt292(53β)[n(Θ+1)j=1tβtjF(w0(j))2+j=1tβtjnσ2]\displaystyle\leq L^{2}\xi_{t}^{2}\cdot\frac{9}{2}(5-3\beta)\Big{[}n(\Theta+1)\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\sum_{j=1}^{t}\beta^{t-j}n\sigma^{2}\Big{]}
9(53β)L2ξt22[n(Θ+1)j=1tβtjF(w0(j))2+nσ21β],\displaystyle\leq\frac{9(5-3\beta)L^{2}\xi_{t}^{2}}{2}\Big{[}n(\Theta+1)\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\frac{n\sigma^{2}}{1-\beta}\Big{]},

where the last line follows since j=1tβtj11β\sum_{j=1}^{t}\beta^{t-j}\leq\frac{1}{1-\beta} for every t2t\geq 2. Hence, we have proved (25). \square

7.7.5 The Proof of Lemma 5

Using the assumption that η1η2ηT\eta_{1}\geq\eta_{2}\geq\dots\geq\eta_{T}, we can derive the following sum as

t=1Tηtj=1tβtjF(w~j1)2\displaystyle\sum_{t=1}^{T}\eta_{t}\sum_{j=1}^{t}\beta^{t-j}\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2} =t=1Tηt[βt1F(w~0)2+βt2F(w~1)2++β0F(w~t1)2]\displaystyle=\sum_{t=1}^{T}\eta_{t}\left[\beta^{t-1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\beta^{t-2}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}+\dots+\beta^{0}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2}\right]
=η1β0F(w~0)2\displaystyle=\quad\eta_{1}\beta^{0}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}
+η2β1F(w~0)2+η2β0F(w~1)2\displaystyle\quad+\eta_{2}\beta^{1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\ \ +\eta_{2}\beta^{0}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}
+ηtβt1F(w~0)2+ηtβt2F(w~1)2++ηtβ0F(w~t1)2\displaystyle\quad+\eta_{t}\beta^{t-1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\ +\eta_{t}\beta^{t-2}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}+\dots+\eta_{t}\beta^{0}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2}
+ηTβT1F(w~0)2+ηTβT2F(w~1)2++ηTβ0F(w~T1)2\displaystyle\quad+\eta_{T}\beta^{T-1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\eta_{T}\beta^{T-2}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}\ +\ \dots\ +\ \eta_{T}\beta^{0}\big{\|}\nabla F(\tilde{w}_{T-1})\big{\|}^{2}
η1β0F(w~0)2\displaystyle\leq\eta_{1}\beta^{0}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}
+η1β1F(w~0)2+η2β0F(w~1)2\displaystyle\quad+\eta_{1}\beta^{1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\eta_{2}\beta^{0}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}
+η1βt1F(w~0)2+η2βt2F(w~1)2++ηtβ0F(w~t1)2\displaystyle\quad+\eta_{1}\beta^{t-1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\ \ +\eta_{2}\beta^{t-2}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}+\dots+\eta_{t}\beta^{0}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2}
+η1βT1F(w~0)2+η2βT2F(w~1)2++ηTβ0F(w~T1)2\displaystyle\quad+\eta_{1}\beta^{T-1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\ +\eta_{2}\beta^{T-2}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}\ +\ \dots\ +\ \eta_{T}\beta^{0}\big{\|}\nabla F(\tilde{w}_{T-1})\big{\|}^{2}
η1i=0T1βiF(w~0)2+η2i=0T2βiF(w~1)2++ηTβ0F(w~T1)2\displaystyle\leq\eta_{1}\sum_{i=0}^{T-1}\beta^{i}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\eta_{2}\sum_{i=0}^{T-2}\beta^{i}\big{\|}\nabla F(\tilde{w}_{1})\big{\|}^{2}\ +\ \dots\ +\ \eta_{T}\beta^{0}\big{\|}\nabla F(\tilde{w}_{T-1})\big{\|}^{2}
11βt=1TηtF(w~t1)2,\displaystyle\leq\frac{1}{1-\beta}\sum_{t=1}^{T}\eta_{t}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2},

where the last inquality follows since i=0t1βi11β\sum_{i=0}^{t-1}\beta^{i}\leq\frac{1}{1-\beta} for t=1,2,Tt=1,2\dots,T. \square

7.7.6 The Proof of Lemma 6

We analyze the special case t=1t=1 as follows. First, letting i=ni=n in Equation (19), for t1t\geq 1, we have

wn(t)w0(t)\displaystyle w_{n}^{(t)}-w_{0}^{(t)} =ηtnj=0n1mj+1(t)=(2)ηtnj=0n1[βm0(t)+(1β)gj(t)]=ηtn[nβm0(t)+(1β)j=0n1gj(t)].\displaystyle=-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}m_{j+1}^{(t)}\overset{\eqref{eq:main_update}}{=}-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}\big{[}\beta m_{0}^{(t)}+(1-\beta)g_{j}^{(t)}\big{]}=-\frac{\eta_{t}}{n}\Big{[}n\beta m_{0}^{(t)}+(1-\beta)\sum_{j=0}^{n-1}g_{j}^{(t)}\Big{]}.

For t=1t=1, since m0(t)=𝟎m_{0}^{(t)}=\mathbf{0}, we have

wn(t)w0(t)=ηtn(1β)j=0n1gj(t).\displaystyle w_{n}^{(t)}-w_{0}^{(t)}=-\frac{\eta_{t}}{n}(1-\beta)\sum_{j=0}^{n-1}g_{j}^{(t)}. (37)

Using the LL-smoothness of FF in Assumption 1(b), w0(t+1):=wn(t)w_{0}^{(t+1)}:=w_{n}^{(t)}, and (37), we have

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) (5)F(w0(t))+F(w0(t))(w0(t+1)w0(t))+L2w0(t+1)w0(t)2\displaystyle\overset{\eqref{eq:Lsmooth}}{\leq}F(w_{0}^{(t)})+\nabla F(w_{0}^{(t)})^{\top}(w_{0}^{(t+1)}-w_{0}^{(t)})+\frac{L}{2}\|w_{0}^{(t+1)}-w_{0}^{(t)}\|^{2}
=(37)F(w0(t))ηt(1β)F(w0(t))(1nj=0n1gj(t))+Lηt22(1β)21nj=0n1gj(t)2\displaystyle\overset{\eqref{update_epoch_03}}{=}F(w_{0}^{(t)})-\eta_{t}(1-\beta)\nabla F(w_{0}^{(t)})^{\top}\left(\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t)}\right)+\frac{L\eta_{t}^{2}}{2}(1-\beta)^{2}\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t)}\Big{\|}^{2}
=(a)F(w0(t))ηt2(1β)F(w0(t))2+ηt2(1β)F(w0(t))1nj=0n1gj(t)2\displaystyle\overset{\tiny(a)}{=}F(w_{0}^{(t)})-\frac{\eta_{t}}{2}(1-\beta)\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{\eta_{t}}{2}(1-\beta)\Big{\|}\nabla F(w_{0}^{(t)})-\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t)}\Big{\|}^{2}
ηt2(1β)(1Lηt(1β))1nj=0n1gj(t)2\displaystyle\quad-\frac{\eta_{t}}{2}(1-\beta)\left(1-L\eta_{t}(1-\beta)\right)\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t)}\Big{\|}^{2}
F(w0(t))ηt2(1β)F(w0(t))2+ηt2(1β)F(w0(t))1nj=0n1gj(t)2,\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}(1-\beta)\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{\eta_{t}}{2}(1-\beta)\Big{\|}\nabla F(w_{0}^{(t)})-\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(t)}\Big{\|}^{2}, (38)

where (a) follows from the equality uv=12(u2+v2uv2)u^{\top}v=\frac{1}{2}\left(\|u\|^{2}+\|v\|^{2}-\|u-v\|^{2}\right) and the last equality comes from the fact that ηt25L1L(1β)\eta_{t}\leq\frac{\sqrt{2}}{\sqrt{5}L}\leq\frac{1}{L(1-\beta)}.

Next, we bound the last term of (38) as follows:

F(w0(1))1nj=0n1gj(1)2\displaystyle\Big{\|}\nabla F(w_{0}^{(1)})-\frac{1}{n}\sum_{j=0}^{n-1}g_{j}^{(1)}\Big{\|}^{2} =1nj=0n1(f(w0(1);π(1)(j+1))gj(1))2\displaystyle=\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}\Big{(}\nabla f(w_{0}^{(1)};\pi^{(1)}(j+1))-g_{j}^{(1)}\Big{)}\Big{\|}^{2}
(b)1nj=0n1f(w0(1);π(1)(j+1))gj(1)2\displaystyle\overset{(b)}{\leq}\frac{1}{n}\sum_{j=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(1)};\pi^{(1)}(j+1))-g_{j}^{(1)}\Big{\|}^{2}
=(13)1nI1,\displaystyle\overset{\eqref{define_I}}{=}\frac{1}{n}I_{1},

where (b) is from the Cauchy-Schwarz inequality.

Applying this into (38) we get the following estimate

F(w0(2))\displaystyle F(w_{0}^{(2)}) F(w0(1))η12(1β)F(w0(1))2+η12(1β)1nI1.\displaystyle\leq F(w_{0}^{(1)})-\frac{\eta_{1}}{2}(1-\beta)\|\nabla F(w_{0}^{(1)})\|^{2}+\frac{\eta_{1}}{2}(1-\beta)\frac{1}{n}I_{1}. (39)
Remark 7.

Note that the derived estimate (39) is true for all shuffling strategies, including Random Reshuffling, Shuffle Once and Incremental Gradient (under the assumptions of Theorem 1 and 2, respectively).

Applying Lemma 3, we further have

F(w0(2))\displaystyle F(w_{0}^{(2)}) F(w0(1))η12(1β)F(w0(1))2+η12(1β)1nI1\displaystyle\leq F(w_{0}^{(1)})-\frac{\eta_{1}}{2}(1-\beta)\|\nabla F(w_{0}^{(1)})\|^{2}+\frac{\eta_{1}}{2}(1-\beta)\frac{1}{n}I_{1}
(32)F(w0(1))η12(1β)F(w0(1))2+η12(1β)32nL2ξ12(1β)2(I1+K1)\displaystyle\overset{\eqref{bound_I=1}}{\leq}F(w_{0}^{(1)})-\frac{\eta_{1}}{2}(1-\beta)\|\nabla F(w_{0}^{(1)})\|^{2}+\frac{\eta_{1}}{2}(1-\beta)\frac{3}{2n}L^{2}\xi_{1}^{2}(1-\beta)^{2}\Big{(}I_{1}+K_{1}\Big{)}
=(35)F(w0(1))η12(1β)F(w0(1))2+η12(1β)32nL2ξ12(1β)253β2K1.\displaystyle\overset{\eqref{bound_sum_I_K=1}}{=}F(w_{0}^{(1)})-\frac{\eta_{1}}{2}(1-\beta)\|\nabla F(w_{0}^{(1)})\|^{2}+\frac{\eta_{1}}{2}(1-\beta)\frac{3}{2n}L^{2}\xi_{1}^{2}(1-\beta)^{2}\frac{5-3\beta}{2}K_{1}.

Noting that w~t1=w0(t),1β<1\tilde{w}_{t-1}=w_{0}^{(t)},1-\beta<1 and Kt:=n(Θ+1)F(w0(t))2+nσ2K_{t}:=n(\Theta+1)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+n\sigma^{2}, we get

F(w~1)F(w~0)η12(1β)F(w~0)2+3η1L2ξ12(1β)(53β)8[(Θ+1)F(w~0)2+σ2].\displaystyle F(\tilde{w}_{1})\leq F(\tilde{w}_{0})-\frac{\eta_{1}}{2}(1-\beta)\|\nabla F(\tilde{w}_{0})\|^{2}+\frac{3\eta_{1}L^{2}\xi_{1}^{2}(1-\beta)(5-3\beta)}{8}\left[(\Theta+1)\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\sigma^{2}\right].

Rearranging the terms and dividing both sides by (1β)(1-\beta), we have

η12F(w~0)2F(w~0)F(w~1)(1β)+3η1L2ξ12(53β)8(Θ+1)F(w~0)2+3η1σ2L2ξ12(53β)8.\displaystyle\frac{\eta_{1}}{2}\|\nabla F(\tilde{w}_{0})\|^{2}\leq\frac{F(\tilde{w}_{0})-F(\tilde{w}_{1})}{(1-\beta)}+\frac{3\eta_{1}L^{2}\xi_{1}^{2}(5-3\beta)}{8}(\Theta+1)\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\frac{3\eta_{1}\sigma^{2}L^{2}\xi_{1}^{2}(5-3\beta)}{8}.

Since ξt\xi_{t} satisfies 9L2ξt2(53β)(Θ+1)1β9L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+1)\leq 1-\beta as above, we can deduce from the last estimate that

η12F(w~0)2F(w~0)F(w~1)(1β)+(1β)η14F(w~0)2+9σ2L2(53β)4(1β)ξ13,\displaystyle\frac{\eta_{1}}{2}\|\nabla F(\tilde{w}_{0})\|^{2}\leq\frac{F(\tilde{w}_{0})-F(\tilde{w}_{1})}{(1-\beta)}+\frac{(1-\beta)\eta_{1}}{4}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}+\frac{9\sigma^{2}L^{2}(5-3\beta)}{4(1-\beta)}\cdot\xi_{1}^{3},

which proves (26). \square

8 Convergence Analysis of Algorithm 1 for Randomized Reshuffling Strategy

In this section, we present the convergence results of Algorithm 1 for Randomized Reshuffling strategy. Since this analysis is similar to the previous bounds in Theorem 1, we will highlight the similarities and discuss the differences between the two analyses.

8.1 The Proof Sketch of Theorem 2

The proof of Theorem 2 is similar to Theorem 1, except for the fact that 𝔼[βIt+(1β)Jt]\mathbb{E}[\beta I_{t}+(1-\beta)J_{t}] is upper bounded by a different term. The proof of Theorem 2 is divided into the following steps.

  • From Theorem 1 we have

    F(w~t)F(w~t1)ηt2F(w~t1)2+ηt2n(βIt+(1β)Jt),F(\tilde{w}_{t})\leq F(\tilde{w}_{t-1})-\frac{\eta_{t}}{2}\|\nabla{F}(\tilde{w}_{t-1})\|^{2}+\frac{\eta_{t}}{2n}(\beta I_{t}+(1-\beta)J_{t}),

    where ItI_{t} and JtJ_{t} are defined by (13) and (14).

  • Then, 𝔼[βIt+(1β)Jt]\mathbb{E}[\beta I_{t}+(1-\beta)J_{t}] can be upper bounded as

    𝔼[βIt+(1β)Jt]𝒪(ξt2j=0tβtj𝔼[F(w0(j))2]+ξt2σ2),\mathbb{E}[\beta I_{t}+(1-\beta)J_{t}]\leq\operatorname{\mathcal{O}}\Big{(}\xi_{t}^{2}\sum_{j=0}^{t}\beta^{t-j}\mathbb{E}\left[\|\nabla{F}(w_{0}^{(j)})\|^{2}\right]+\xi^{2}_{t}\sigma^{2}\Big{)},

    as shown in Lemma 9, where ξt:=max{ηt1,ηt}\xi_{t}:=\max\{\eta_{t-1},\eta_{t}\} defined in (17).

  • We further upper bound the sum of the right-hand side of the last estimate as in Lemma 5 in terms t=0Tηt𝔼[F(w~t1)2]\sum_{t=0}^{T}\eta_{t}\mathbb{E}\left[\|\nabla{F}(\tilde{w}_{t-1})\|^{2}\right].

Combining these steps together, and using some simplification, we obtain (9) in Theorem 2.

8.2 Technical Lemmas

In this subsection, we will introduce some intermediate results used in Theorem 2. Lemmas 7 and 8 in Theorem 2 play a similar role as previous Lemmas 2 and 3 in Theorem 1.

Lemma 7.

Assume that the Randomized Reshuffling strategy is used in Algorithm 1. Then, under the same setting as of Lemma 1 and Assumption 1(c), for t1t\geq 1, it holds that

(a)\displaystyle(\textbf{a})\qquad 𝔼[An(t)]2n𝔼[It+Nt]andi=0n1𝔼[Ai(t)]n2𝔼[It+Nt],\displaystyle\mathbb{E}\big{[}A_{n}^{(t)}\big{]}\leq 2n\cdot\mathbb{E}\left[I_{t}+N_{t}\right]\quad\text{and}\quad\sum_{i=0}^{n-1}\mathbb{E}\big{[}A_{i}^{(t)}\big{]}\leq n^{2}\cdot\mathbb{E}\left[I_{t}+N_{t}\right], (40)
(b)\displaystyle(\textbf{b})\qquad i=0n1𝔼[Bi(t)]2n2𝔼[It+Nt].\displaystyle\sum_{i=0}^{n-1}\mathbb{E}\Big{[}B_{i}^{(t)}\Big{]}\leq 2n^{2}\cdot\mathbb{E}\left[I_{t}+N_{t}\right]. (41)

The results of Lemma 8 below are direct consequences of Lemma 1, Lemma 7, and the fact that L2ξt235L^{2}\xi_{t}^{2}\leq\frac{3}{5}.

Lemma 8.

Under the same setting as of Lemma 7 and Assumption 1(b), it holds that

(a)\displaystyle(\textbf{a})\ 𝔼[It]{L2ξt2[23β𝔼[It1+Nt1]+(1β)𝔼[It+Nt]],ift>1,L2ξt2(1β)2𝔼[It+Nt],ift=1,\displaystyle\mathbb{E}[I_{t}]\leq\left\{\begin{array}[]{ll}L^{2}\xi_{t}^{2}\Big{[}\frac{2}{3}\beta\cdot\mathbb{E}[I_{t-1}+N_{t-1}]+(1-\beta)\cdot\mathbb{E}[I_{t}+N_{t}]\Big{]},&\text{if}~{}t>1,\vspace{1ex}\\ L^{2}\xi_{t}^{2}(1-\beta)^{2}\cdot\mathbb{E}\left[I_{t}+N_{t}\right],&\text{if}~{}t=1,\end{array}\right.
(b)\displaystyle(\textbf{b})\ 𝔼[Jt]{2L2ξt2[β𝔼[It2+Nt2]+(1β)𝔼[It1+Nt1]],ift>2,2L2ξt2𝔼[It1+Nt1],ift=2,\displaystyle\mathbb{E}[J_{t}]\leq\left\{\begin{array}[]{ll}2L^{2}\xi_{t}^{2}\Big{[}\beta\cdot\mathbb{E}[I_{t-2}+N_{t-2}]+(1-\beta)\cdot\mathbb{E}[I_{t-1}+N_{t-1}]\Big{]},&\text{if}~{}t>2,\vspace{1ex}\\ 2L^{2}\xi_{t}^{2}\cdot\mathbb{E}\left[I_{t-1}+N_{t-1}\right],&\text{if}~{}t=2,\end{array}\right.
(c)\displaystyle(\textbf{c})\ If L2ξt235, then 𝔼[It+Nt]{β𝔼[It1+Nt1]+53β2𝔼[Nt],ift>1,53β2𝔼[N1],ift=1.\displaystyle\text{If }L^{2}\xi_{t}^{2}\leq\frac{3}{5},\text{ then }\mathbb{E}[I_{t}+N_{t}]\leq\left\{\begin{array}[]{ll}\beta\cdot\mathbb{E}[I_{t-1}+N_{t-1}]+\frac{5-3\beta}{2}\cdot\mathbb{E}[N_{t}],&\text{if}~{}t>1,\vspace{1ex}\\ \frac{5-3\beta}{2}\cdot\mathbb{E}[N_{1}],&\text{if}~{}t=1.\end{array}\right.

We will present below Lemmas 9 and 10, which play a similar role as previous Lemmas 4 and 6 in Theorem 1.

Lemma 9.

Under the same conditions as in Lemma 8, for any t2t\geq 2, we have

𝔼[βJt+(1β)It]\displaystyle\mathbb{E}[\beta J_{t}+(1-\beta)I_{t}] 3(53β)L2ξt2((Θ+n)j=1tβtj𝔼[F(w0(j))2]+σ21β).\displaystyle\leq 3(5-3\beta)L^{2}\xi_{t}^{2}\Big{(}(\Theta+n)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}\right]+\frac{\sigma^{2}}{1-\beta}\Big{)}. (42)
Lemma 10.

Under the same setting as in Theorem 2, we have

η12𝔼[F(w~0)2]\displaystyle\frac{\eta_{1}}{2}\mathbb{E}\left[\|\nabla F(\tilde{w}_{0})\|^{2}\right] 𝔼[F(w~0)F(w~1)](1β)+(1β)η14𝔼[F(w~0)2]+3σ2(53β)L22n(1β)ξ13.\displaystyle\leq\frac{\mathbb{E}\left[F(\tilde{w}_{0})-F(\tilde{w}_{1})\right]}{(1-\beta)}+\frac{(1-\beta)\eta_{1}}{4}\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\cdot\xi_{1}^{3}. (43)

8.3 The Proof of Theorem 2: Key Estimate for Algorithm 1

First, from the assumption 0<ηt1LD0<\eta_{t}\leq\frac{1}{L\sqrt{D}}, t1t\geq 1, we have 0<ηt21DL20<\eta_{t}^{2}\leq\frac{1}{DL^{2}}. Next, from (17), we have ξt=max(ηt;ηt1)\xi_{t}=\max(\eta_{t};\eta_{t-1}) for t>1t>1 and ξ1=η1\xi_{1}=\eta_{1}, which lead to 0<ξt21DL20<\xi_{t}^{2}\leq\frac{1}{DL^{2}} for t1t\geq 1. Moreover, from the definition of D=max(53,6(53β)(Θ+n)n(1β))D=\max\left(\frac{5}{3},\frac{6(5-3\beta)(\Theta+n)}{n(1-\beta)}\right) in Theorem 2, we have L2ξt235L^{2}\xi_{t}^{2}\leq\frac{3}{5} and 6L2ξt2(53β)(Θ+n)n(1β)6L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+n)\leq n(1-\beta) for t1.t\geq 1.

Similar to the estimate (29) in Theorem 1, we get the following result:

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) F(w0(t))ηt2F(w0(t))2+ηt2n[βJt+(1β)It].\displaystyle\leq F(w_{0}^{(t)})-\frac{\eta_{t}}{2}\|\nabla F(w_{0}^{(t)})\|^{2}+\frac{\eta_{t}}{2n}\Big{[}\beta J_{t}+(1-\beta)I_{t}\Big{]}.

Taking expectation of both sides and applying Lemma 9, we get:

𝔼[F(w0(t+1))]\displaystyle\mathbb{E}\left[F(w_{0}^{(t+1)})\right] 𝔼[F(w0(t))]ηt2𝔼[F(w0(t))2]+ηt2n𝔼[βJt+(1β)It]\displaystyle\leq\mathbb{E}\left[F(w_{0}^{(t)})\right]-\frac{\eta_{t}}{2}\mathbb{E}\left[\|\nabla F(w_{0}^{(t)})\|^{2}\right]+\frac{\eta_{t}}{2n}\mathbb{E}\Big{[}\beta J_{t}+(1-\beta)I_{t}\Big{]}
(42)𝔼[F(w0(t))]ηt2𝔼[F(w0(t))2]\displaystyle\overset{\eqref{bound_beta_IJ_RR}}{\leq}\mathbb{E}\left[F(w_{0}^{(t)})\right]-\frac{\eta_{t}}{2}\mathbb{E}\left[\|\nabla F(w_{0}^{(t)})\|^{2}\right]
+ηt2n3(53β)L2ξt2[(Θ+n)j=1tβtj𝔼[F(w0(j))2]+σ21β]\displaystyle\quad+\frac{\eta_{t}}{2n}3(5-3\beta)L^{2}\xi_{t}^{2}\bigg{[}(\Theta+n)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}\right]+\frac{\sigma^{2}}{1-\beta}\bigg{]}
𝔼[F(w0(t))]ηt2𝔼[F(w0(t))2]\displaystyle\leq\mathbb{E}\left[F(w_{0}^{(t)})\right]-\frac{\eta_{t}}{2}\mathbb{E}\left[\|\nabla F(w_{0}^{(t)})\|^{2}\right]
+3(53β)ηtL2ξt2(Θ+n)2nj=1tβtj𝔼[F(w0(j))2]+3σ2(53β)ηtL2ξt22n(1β).\displaystyle\quad+\frac{3(5-3\beta)\eta_{t}L^{2}\xi_{t}^{2}(\Theta+n)}{2n}\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)\eta_{t}L^{2}\xi_{t}^{2}}{2n(1-\beta)}.

Since ξt\xi_{t} satisfies 6L2ξt2(53β)(Θ+n)n(1β)6L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+n)\leq n(1-\beta) as proved above, we can deduce from the last estimate that

𝔼[F(w0(t+1))]\displaystyle\mathbb{E}\left[F(w_{0}^{(t+1)})\right] 𝔼[F(w0(t))]ηt2𝔼[F(w0(t))2]+(1β)ηt4j=1tβtj𝔼[F(w0(j))2]\displaystyle\leq\mathbb{E}\left[F(w_{0}^{(t)})\right]-\frac{\eta_{t}}{2}\mathbb{E}\left[\|\nabla F(w_{0}^{(t)})\|^{2}\right]+\frac{(1-\beta)\eta_{t}}{4}\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}\right]
+3σ2(53β)ηtL2ξt22n(1β).\displaystyle\quad+\frac{3\sigma^{2}(5-3\beta)\eta_{t}L^{2}\xi_{t}^{2}}{2n(1-\beta)}.

Rearranging this inequality while noting that ηtξt\eta_{t}\leq\xi_{t} and w~t1=w0(t)\tilde{w}_{t-1}=w_{0}^{(t)} we obtain that for t2t\geq 2:

ηt2𝔼[F(w~t1)2]𝔼[F(w~t1)F(w~t)]+(1β)ηt4j=1tβtj𝔼[F(w~j1)2]+3σ2(53β)L22n(1β)ξt31β<1𝔼[F(w~t1)F(w~t)](1β)+(1β)ηt4j=1tβtj𝔼[F(w~j1)2]+3σ2(53β)L22n(1β)ξt3.\begin{array}[]{lcl}\frac{\eta_{t}}{2}\mathbb{E}\left[\|\nabla F(\tilde{w}_{t-1})\|^{2}\right]&\leq&\mathbb{E}\left[F(\tilde{w}_{t-1})-F(\tilde{w}_{t})\right]+\frac{(1-\beta)\eta_{t}}{4}\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\cdot\xi_{t}^{3}\vspace{1ex}\\ &\overset{1-\beta<1}{\leq}&\frac{\mathbb{E}\left[F(\tilde{w}_{t-1})-F(\tilde{w}_{t})\right]}{(1-\beta)}+\frac{(1-\beta)\eta_{t}}{4}\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\cdot\xi_{t}^{3}.\end{array}

For t=1t=1, since ξ1=η1\xi_{1}=\eta_{1} as previously defined in (17), from the result of Lemma 10 we have

η12𝔼[F(w~0)2]\displaystyle\frac{\eta_{1}}{2}\mathbb{E}\left[\|\nabla F(\tilde{w}_{0})\|^{2}\right] 𝔼[F(w~0)F(w~1)](1β)+(1β)η14j=11β1j𝔼[F(w~j1)2]+3σ2(53β)L22n(1β)ξ13.\displaystyle\leq\frac{\mathbb{E}\left[F(\tilde{w}_{0})-F(\tilde{w}_{1})\right]}{(1-\beta)}+\frac{(1-\beta)\eta_{1}}{4}\sum_{j=1}^{1}\beta^{1-j}\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\cdot\xi_{1}^{3}.

Summing the previous estimate for t:=2t:=2 to t:=Tt:=T, and then using the last one, we obtain

t=1Tηt2𝔼[F(w~t1)2]\displaystyle\sum_{t=1}^{T}\frac{\eta_{t}}{2}\mathbb{E}\left[\|\nabla F(\tilde{w}_{t-1})\|^{2}\right] F(w~0)F(1β)+(1β)4t=1Tηtj=1tβtj𝔼[F(w~j1)2]+3σ2(53β)L22n(1β)t=1Tξt3.\displaystyle\leq\frac{F(\tilde{w}_{0})-F_{*}}{(1-\beta)}+\frac{(1-\beta)}{4}\sum_{t=1}^{T}\eta_{t}\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{j-1})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\sum_{t=1}^{T}\xi_{t}^{3}.

Applying the result of Lemma 5 to the last estimate, we get

12t=1Tηt𝔼[F(w~t1)2]\displaystyle\frac{1}{2}\sum_{t=1}^{T}\eta_{t}\mathbb{E}\left[\|\nabla F(\tilde{w}_{t-1})\|^{2}\right] F(w~0)F(1β)+(1β)41(1β)t=1Tηt𝔼[F(w~t1)2]+3σ2(53β)L22n(1β)t=1Tξt3,\displaystyle\leq\frac{F(\tilde{w}_{0})-F_{*}}{(1-\beta)}+\frac{(1-\beta)}{4}\frac{1}{(1-\beta)}\sum_{t=1}^{T}\eta_{t}\mathbb{E}\left[\|\nabla F(\tilde{w}_{t-1})\|^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\sum_{t=1}^{T}\xi_{t}^{3},

which is equivalent to

t=1Tηt𝔼[F(w~t1)2]\displaystyle\sum_{t=1}^{T}\eta_{t}\mathbb{E}\left[\|\nabla F(\tilde{w}_{t-1})\|^{2}\right] 4[F(w~0)F](1β)+6σ2(53β)L2n(1β)t=1Tξt3.\displaystyle\leq\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)}+\frac{6\sigma^{2}(5-3\beta)L^{2}}{n(1-\beta)}\sum_{t=1}^{T}\xi_{t}^{3}.

Dividing both sides of the resulting estimate by t=1Tηt\sum_{t=1}^{T}\eta_{t}, we obtain (9). Finally, due to the choice of w^T\hat{w}_{T} at Step 11 in Algorithm 1, we have 𝔼[F(w^T)2]=1t=1Tηtt=1Tηt𝔼[F(w~t1)2]\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}=\frac{1}{\sum_{t=1}^{T}\eta_{t}}\sum_{t=1}^{T}\eta_{t}\mathbb{E}\big{[}\|\nabla F(\tilde{w}_{t-1})\|^{2}\big{]}. \square

8.4 The Proof of Corollaries 5 and 6: Constant and Diminishing Learning Rates

The proof of Corollary 5.

Since T1T\geq 1 and ηt:=γn1/3T1/3\eta_{t}:=\frac{\gamma n^{1/3}}{T^{1/3}}, we have ηtηt+1\eta_{t}\geq\eta_{t+1}. We also have 0<ηt1LD0<\eta_{t}\leq\frac{1}{L\sqrt{D}} for all t1t\geq 1, and t=1Tηt=γn1/3T2/3\sum_{t=1}^{T}\eta_{t}=\gamma n^{1/3}T^{2/3} and t=1Tηt13=nγ3\sum_{t=1}^{T}\eta_{t-1}^{3}=n\gamma^{3}. Substituting these expressions into (9), we obtain

𝔼[F(w^T)2]4[F(w~0)F](1β)γn1/3T2/3+6σ2(53β)L2(1β)γ2n1/3T2/3,\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma n^{1/3}T^{2/3}}+\frac{6\sigma^{2}(5-3\beta)L^{2}}{(1-\beta)}\cdot\frac{\gamma^{2}}{n^{1/3}T^{2/3}},

which is our desired result. ∎

The proof of Corollary 6.

For t1t\geq 1, since ηt=γn1/3(t+λ)1/3\eta_{t}=\frac{\gamma n^{1/3}}{(t+\lambda)^{1/3}}, we have ηtηt+1\eta_{t}\geq\eta_{t+1}. We also have 0<ηt1LD0<\eta_{t}\leq\frac{1}{L\sqrt{D}} for all t1t\geq 1, and

t=1Tηt=γn1/3t=1T1(t+λ)1/3γn1/31Tdτ(τ+λ)1/3γn1/3[(T+λ)2/3(1+λ)2/3],t=1Tηt13=2η13+γ3t=3T1t1+λ2nγ3(1+λ)+nγ3t=2Tdττ1+λnγ3[2(1+λ)+log(T1+λ)].\begin{array}[]{lll}&\sum_{t=1}^{T}\eta_{t}&=\gamma n^{1/3}\sum_{t=1}^{T}\frac{1}{(t+\lambda)^{1/3}}\geq\gamma n^{1/3}\int_{1}^{T}\frac{d\tau}{(\tau+\lambda)^{1/3}}\geq\gamma n^{1/3}\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right],\vspace{1.5ex}\\ &\sum_{t=1}^{T}\eta_{t-1}^{3}&=2\eta_{1}^{3}+\gamma^{3}\sum_{t=3}^{T}\frac{1}{t-1+\lambda}\leq\frac{2n\gamma^{3}}{(1+\lambda)}+n\gamma^{3}\int_{t=2}^{T}\frac{d\tau}{\tau-1+\lambda}\leq n\gamma^{3}\left[\frac{2}{(1+\lambda)}+\log(T-1+\lambda)\right].\end{array}

Substituting these expressions into (9), we obtain

𝔼[F(w^T)2]4[F(w~0)F](1β)γn1/3[(T+λ)2/3(1+λ)2/3]+6σ2(53β)L2(1β)γ2[2(1+λ)+log(T1+λ)]n1/3[(T+λ)2/3(1+λ)2/3],\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma n^{1/3}\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]}+\frac{6\sigma^{2}(5-3\beta)L^{2}}{(1-\beta)}\cdot\frac{\gamma^{2}\left[\frac{2}{(1+\lambda)}+\log(T-1+\lambda)\right]}{n^{1/3}\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]},

Let C3C_{3} and C4C_{4} be defined respectively as

C3:=4[F(w~0)F](1β)γ+12σ2(53β)L2γ2(1β)(1+λ), and C4:=6σ2(53β)L2γ2(1β).\displaystyle C_{3}:=\frac{4\left[F(\tilde{w}_{0})-F_{*}\right]}{(1-\beta)\gamma}+\frac{12\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)(1+\lambda)},\quad\text{ and }\quad C_{4}:=\frac{6\sigma^{2}(5-3\beta)L^{2}\gamma^{2}}{(1-\beta)}.

Then, the last estimate leads to

𝔼[F(w^T)2]C3+C4log(T1+λ)n1/3[(T+λ)2/3(1+λ)2/3],\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]}\leq\frac{C_{3}+C_{4}\log(T-1+\lambda)}{n^{1/3}\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]},

which completes our proof. ∎

8.5 The Proof of Technical Lemmas

We now provide the proof of additional Lemmas that serve for the proof of Theorem 2 above.

8.5.1 Proof of Lemma 7: Upper Bounding The Terms 𝔼[An(t)]\mathbb{E}\big{[}A_{n}^{(t)}\big{]}, i=0n1𝔼[Ai(t)]\sum_{i=0}^{n-1}\mathbb{E}\big{[}A_{i}^{(t)}\big{]}, and i=0n1𝔼[Bi(t)]\sum_{i=0}^{n-1}\mathbb{E}\big{[}B^{(t)}_{i}\big{]}

In this proof, we will use (Mishchenko et al., 2020)[Lemma 1] for sampling without replacement. For the sake of references, we recall it here.

Lemma 11 (Lemma 1 in (Mishchenko et al., 2020)).

Let X1,,XndX_{1},\cdots,X_{n}\in\mathbb{R}^{d} be fixed vectors, X¯:=1ni=1nXi\bar{X}:=\frac{1}{n}\sum_{i=1}^{n}X_{i} be their average and σ2:=1ni=1nXiX¯2\sigma^{2}:=\frac{1}{n}\sum_{i=1}^{n}\|X_{i}-\bar{X}\|^{2} be the population variance. Fix any k{1,,n}k\in\{1,\cdots,n\}, let Xπ1,,XπkX_{\pi_{1}},\cdots,X_{\pi_{k}} be sampled uniformly without replacement from {X1,,Xn}\{X_{1},\cdots,X_{n}\} and X¯π\bar{X}_{\pi} be their average. Then, the sample average and the variance are given, respectively by

𝔼[X¯π]=X¯and𝔼[X¯πX¯2]=nkk(n1)σ2.\displaystyle\mathbb{E}[\bar{X}_{\pi}]=\bar{X}\qquad\text{and}\qquad\mathbb{E}\left[\|\bar{X}_{\pi}-\bar{X}\|^{2}\right]=\frac{n-k}{k(n-1)}\sigma^{2}.

Using this result, we now prove Lemma 7 as follows.

(a) Let first upper bound the term Ai(t)A_{i}^{(t)} defined by (11). Indeed, for i=0,,n1i=0,\cdots,n-1 and t1t\geq 1, we have

Ai(t)=j=0i1gj(t)2\displaystyle A_{i}^{(t)}=\Big{\|}\sum_{j=0}^{i-1}g_{j}^{(t)}\Big{\|}^{2} (a)2j=0i1(gj(t)f(w0(t);π(t)(j+1)))2+2j=0i1f(w0(t);π(t)(j+1))2\displaystyle\overset{(a)}{\leq}2\Big{\|}\sum_{j=0}^{i-1}\Big{(}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{)}\Big{\|}^{2}+2\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}
(a)2ij=0i1gj(t)f(w0(t);π(t)(j+1))2+2j=0i1f(w0(t);π(t)(j+1))2\displaystyle\overset{(a)}{\leq}2i\sum_{j=0}^{i-1}\Big{\|}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}+2\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}
2ij=0n1gj(t)f(w0(t);π(t)(j+1))2+2j=0i1f(w0(t);π(t)(j+1))2\displaystyle\leq 2i\sum_{j=0}^{n-1}\Big{\|}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}+2\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}
=(13)2iIt+2j=0i1f(w0(t);π(t)(j+1))2,\displaystyle\overset{\eqref{define_I}}{=}2iI_{t}+2\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2},

where we have used the Cauchy-Schwarz inequality in (a).

Now taking expectation conditioned on t\mathcal{F}_{t}, we get

𝔼t[Ai(t)]\displaystyle\mathbb{E}_{t}\big{[}A_{i}^{(t)}\big{]} 2i𝔼t[It]+2𝔼t[j=0i1f(w0(t);π(t)(j+1))2].\displaystyle\leq 2i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\mathbb{E}_{t}\left[\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}\right].

Note that 𝔼t[f(w0(t);π(t)(j+1))]=F(w0(t))\mathbb{E}_{t}\left[\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\right]=\nabla F(w_{0}^{(t)}) for every index j=0,1,,i1.j=0,1,\cdots,i-1. Hence, we have

𝔼t[Ai(t)]\displaystyle\mathbb{E}_{t}\big{[}A_{i}^{(t)}\big{]} 2i𝔼t[It]+2𝔼t[j=0i1f(w0(t);π(t)(j+1))2]\displaystyle\leq 2i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\mathbb{E}_{t}\bigg{[}\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}\bigg{]}
=2i𝔼t[It]+2iF(w0(t))2+2𝔼t[j=0i1f(w0(t);π(t)(j+1))iF(w0(t))2]\displaystyle=2i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\big{\|}i\nabla F(w_{0}^{(t)})\big{\|}^{2}+2\mathbb{E}_{t}\bigg{[}\Big{\|}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-i\nabla F(w_{0}^{(t)})\Big{\|}^{2}\bigg{]}
=2i𝔼t[It]+2i2F(w0(t))2+2i2𝔼t[1ij=0i1f(w0(t);π(t)(j+1))F(w0(t))2]\displaystyle=2i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2i^{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+2i^{2}\mathbb{E}_{t}\bigg{[}\Big{\|}\frac{1}{i}\sum_{j=0}^{i-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{\|}^{2}\bigg{]}
=2i𝔼t[It]+2i2F(w0(t))2+2i2nii(n1)1nj=0n1f(w0(t);π(t)(j+1))F(w0(t))2\displaystyle=2i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2i^{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+2i^{2}\frac{n-i}{i(n-1)}\frac{1}{n}\sum_{j=0}^{n-1}\big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\big{\|}^{2}
(b)2i𝔼t[It]+2i2F(w0(t))2+2i(ni)(n1)[ΘF(w0(t))2+σ2],\displaystyle\overset{(b)}{\leq}2i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2i^{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\frac{2i(n-i)}{(n-1)}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]},

where we apply the sample variance Lemma from (Mishchenko et al., 2020) and (b) follows from Assumption 1(c).

Letting i=ni=n in the above estimate and taking total expectation, we obtain the first estimate of Lemma 7, i.e.:

𝔼[An(t)]\displaystyle\mathbb{E}\big{[}A_{n}^{(t)}\big{]} 2n𝔼[It]+2n2𝔼[F(w0(t))2]2n𝔼[It+Nt].\displaystyle\leq 2n\cdot\mathbb{E}\left[I_{t}\right]+2n^{2}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}\right]\leq 2n\cdot\mathbb{E}\left[I_{t}+N_{t}\right].

Now we are ready to calculate the sum i=0n1𝔼t[Ai(t)]\sum_{i=0}^{n-1}\mathbb{E}_{t}\big{[}A_{i}^{(t)}\big{]} as

i=0n1𝔼t[Ai(t)]\displaystyle\sum_{i=0}^{n-1}\mathbb{E}_{t}\big{[}A_{i}^{(t)}\big{]} 2i=0n1i𝔼t[It]+2i=0n1i2F(w0(t))2+2i=0n1i(ni)(n1)[ΘF(w0(t))2+σ2]\displaystyle\leq 2\sum_{i=0}^{n-1}i\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\sum_{i=0}^{n-1}i^{2}\cdot\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+2\sum_{i=0}^{n-1}\frac{i(n-i)}{(n-1)}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}
n2𝔼t[It]+2n33F(w0(t))2+n(n+1)3[ΘF(w0(t))2+σ2]\displaystyle\leq n^{2}\cdot\mathbb{E}_{t}\left[I_{t}\right]+\frac{2n^{3}}{3}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\frac{n(n+1)}{3}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}
n2𝔼t[It]+n3F(w0(t))2+n2[ΘF(w0(t))2+σ2]\displaystyle\leq n^{2}\cdot\mathbb{E}_{t}\left[I_{t}\right]+n^{3}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+n^{2}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}
n2𝔼t[It]+n2[(n+Θ)F(w0(t))2+σ2](16)n2(𝔼t[It]+Nt),t1,\displaystyle\leq n^{2}\cdot\mathbb{E}_{t}\left[I_{t}\right]+n^{2}\Big{[}(n+\Theta)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}\overset{\eqref{define_N}}{\leq}n^{2}\left(\mathbb{E}_{t}\left[I_{t}\right]+N_{t}\right),\ t\geq 1,

where we use the facts that i=0n1i=n(n1)2n22\sum_{i=0}^{n-1}i=\frac{n(n-1)}{2}\leq\frac{n^{2}}{2} and i=0n1i2=n(n1)(2n1)6n33\sum_{i=0}^{n-1}i^{2}=\frac{n(n-1)(2n-1)}{6}\leq\frac{n^{3}}{3}.

Taking total expectation, we have the second estimate of Lemma 7.

(b) Using similar argument as above, for i=0,1,,n1i=0,1,\cdots,n-1 and t1t\geq 1, we can derive

Bi(t)\displaystyle B_{i}^{(t)} =j=in1gj(t)2(a)2j=in1(gj(t)f(w0(t);π(t)(j+1)))2+2j=in1f(w0(t);π(t)(j+1))2\displaystyle=\Big{\|}\sum_{j=i}^{n-1}g_{j}^{(t)}\Big{\|}^{2}\overset{(a)}{\leq}2\Big{\|}\sum_{j=i}^{n-1}\Big{(}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{)}\Big{\|}^{2}+2\Big{\|}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}
(a)2(ni)j=in1gj(t)f(w0(t);π(t)(j+1))2+2j=in1f(w0(t);π(t)(j+1))2\displaystyle\overset{(a)}{\leq}2(n-i)\sum_{j=i}^{n-1}\Big{\|}g_{j}^{(t)}-\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}+2\Big{\|}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}
(13)2(ni)It+2j=in1f(w0(t);π(t)(j+1))2,\displaystyle\overset{\eqref{define_I}}{\leq}2(n-i)I_{t}+2\Big{\|}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2},

where we use again the Cauchy-Schwarz inequality in (a).

Taking expectation conditioned on t\mathcal{F}_{t}, we get

𝔼t[Bi(t)]\displaystyle\mathbb{E}_{t}\big{[}B_{i}^{(t)}\big{]} 2(ni)𝔼t[It]+2𝔼t[j=in1f(w0(t);π(t)(j+1))2].\displaystyle\leq 2(n-i)\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\mathbb{E}_{t}\bigg{[}\Big{\|}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}\bigg{]}.

Note that 𝔼t[f(w0(t);π(t)(j+1))]=F(w0(t))\mathbb{E}_{t}\big{[}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\big{]}=\nabla F(w_{0}^{(t)}) for every index j=i,,n1.j=i,\cdots,n-1. Hence

𝔼t[Bi(t)]\displaystyle\mathbb{E}_{t}\big{[}B_{i}^{(t)}\big{]} 2(ni)𝔼t[It]+2𝔼t[j=in1f(w0(t);π(t)(j+1))2]\displaystyle\leq 2(n-i)\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\mathbb{E}_{t}\bigg{[}\Big{\|}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))\Big{\|}^{2}\bigg{]}
=2(ni)𝔼t[It]+2(ni)F(w0(t))2+2𝔼t[j=in1f(w0(t);π(t)(j+1))(ni)F(w0(t))2]\displaystyle=2(n-i)\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\big{\|}(n-i)\nabla F(w_{0}^{(t)})\big{\|}^{2}+2\mathbb{E}_{t}\bigg{[}\Big{\|}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-(n-i)\nabla F(w_{0}^{(t)})\Big{\|}^{2}\bigg{]}
=2(ni)𝔼t[It]+2(ni)2F(w0(t))2\displaystyle=2(n-i)\cdot\mathbb{E}_{t}\left[I_{t}\right]+2(n-i)^{2}\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}
+2(ni)2𝔼t[1nij=in1f(w0(t);π(t)(j+1))F(w0(t))2]\displaystyle\quad+2(n-i)^{2}\mathbb{E}_{t}\bigg{[}\Big{\|}\frac{1}{n-i}\sum_{j=i}^{n-1}\nabla f(w_{0}^{(t)};\pi^{(t)}(j+1))-\nabla F(w_{0}^{(t)})\Big{\|}^{2}\bigg{]}
(b)2(ni)𝔼t[It]+2(ni)2F(w0(t))2+2i(ni)(n1)[ΘF(w0(t))2+σ2],\displaystyle\overset{(b)}{\leq}2(n-i)\cdot\mathbb{E}_{t}\left[I_{t}\right]+2(n-i)^{2}\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}+\frac{2i(n-i)}{(n-1)}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]},

where we apply again the sample variance Lemma from (Mishchenko et al., 2020) and (b) follows from Assumption 1(c). Finally, for all t1t\geq 1, we calculate the sum i=0n1𝔼t[Bi(t)]\sum_{i=0}^{n-1}\mathbb{E}_{t}\Big{[}B_{i}^{(t)}\Big{]} similarly as follows

i=0n1𝔼t[Bi(t)]\displaystyle\sum_{i=0}^{n-1}\mathbb{E}_{t}\Big{[}B_{i}^{(t)}\Big{]} 2i=0n1(ni)𝔼t[It]+2i=0n1(ni)2F(w0(t))2+i=0n12i(ni)(n1)[ΘF(w0(t))2+σ2]\displaystyle\leq 2\sum_{i=0}^{n-1}(n-i)\cdot\mathbb{E}_{t}\left[I_{t}\right]+2\sum_{i=0}^{n-1}(n-i)^{2}\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}+\sum_{i=0}^{n-1}\frac{2i(n-i)}{(n-1)}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}
2n2𝔼t[It]+n(n+1)(2n+1)3F(w0(t))2+n(n+1)3[ΘF(w0(t))2+σ2]\displaystyle\leq 2n^{2}\cdot\mathbb{E}_{t}\left[I_{t}\right]+\frac{n(n+1)(2n+1)}{3}\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}+\frac{n(n+1)}{3}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}
2n2𝔼t[It]+2n3F(w0(t))2+2n2[ΘF(w0(t))2+σ2]\displaystyle\leq 2n^{2}\cdot\mathbb{E}_{t}\left[I_{t}\right]+2n^{3}\Big{\|}\nabla F(w_{0}^{(t)})\Big{\|}^{2}+2n^{2}\Big{[}\Theta\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}
2n2𝔼t[It]+2n2[(n+Θ)F(w0(t))2+σ2](16)2n2(𝔼t[It]+Nt).\displaystyle\leq 2n^{2}\cdot\mathbb{E}_{t}\left[I_{t}\right]+2n^{2}\Big{[}(n+\Theta)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}\Big{]}\overset{\eqref{define_N}}{\leq}2n^{2}\left(\mathbb{E}_{t}\left[I_{t}\right]+N_{t}\right).

Taking total expectation, we get (41). \square

8.5.2 Proof of Lemma 8: Upper Bounding The Terms 𝔼[It]\mathbb{E}[I_{t}], 𝔼[Jt]\mathbb{E}[J_{t}], and 𝔼[It+Nt]\mathbb{E}[I_{t}+N_{t}]

(a) First, for every t1t\geq 1 and gi(t):=f(wi(t);σ(t)(i+1))g_{i}^{(t)}:=\nabla f(w_{i}^{(t)};\sigma^{(t)}(i+1)), we have

It\displaystyle I_{t} =i=0n1f(w0(t);π(t)(i+1))gi(t)2\displaystyle=\sum_{i=0}^{n-1}\big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(i+1))-g_{i}^{(t)}\big{\|}^{2}
=Step 6i=0n1f(w0(t);π(t)(i+1))f(wi(t);π(t)(i+1))2\displaystyle\overset{\tiny\textrm{Step~{}\ref{alg:A1_step3}}}{=}\sum_{i=0}^{n-1}\big{\|}\nabla f(w_{0}^{(t)};\pi^{(t)}(i+1))-\nabla f(w_{i}^{(t)};\pi^{(t)}(i+1))\big{\|}^{2}
L2i=0n1wi(t)w0(t)2,\displaystyle\leq L^{2}\sum_{i=0}^{n-1}\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2}, (44)

where the last inequality follows from Assumption 1(b).

Applying Lemma 1 to (44), for t>1t>1, we have

It\displaystyle I_{t} L2ξt2n2i=0n1[βi2n2An(t1)+(1β)Ai(t)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}\bigg{[}\beta\frac{i^{2}}{n^{2}}A_{n}^{(t-1)}+(1-\beta)A_{i}^{(t)}\bigg{]}
L2ξt2n2[βAn(t1)n2i=0n1i2+(1β)i=0n1Ai(t)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\bigg{[}\beta\frac{A_{n}^{(t-1)}}{n^{2}}\sum_{i=0}^{n-1}i^{2}+(1-\beta)\sum_{i=0}^{n-1}A_{i}^{(t)}\bigg{]}
(a)L2ξt2n2[βn3An(t1)+(1β)i=0n1Ai(t)]\displaystyle\overset{(a)}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\bigg{[}\beta\cdot\frac{n}{3}\cdot A_{n}^{(t-1)}+(1-\beta)\sum_{i=0}^{n-1}A_{i}^{(t)}\bigg{]}

where (a) follows from the fact that i=0n1i2n33\sum_{i=0}^{n-1}i^{2}\leq\frac{n^{3}}{3}. Taking expectation and applying result (40) in Lemma 7 we get

𝔼[It]\displaystyle\mathbb{E}[I_{t}] L2ξt2n2[βn3𝔼[An(t1)]+(1β)i=0n1𝔼[Ai(t)]]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\bigg{[}\beta\cdot\frac{n}{3}\cdot\mathbb{E}\Big{[}A_{n}^{(t-1)}\Big{]}+(1-\beta)\sum_{i=0}^{n-1}\mathbb{E}\Big{[}A_{i}^{(t)}\Big{]}\bigg{]}
L2ξt2n2[βn32n𝔼[It1+Nt1]+(1β)n2𝔼[It+Nt]]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\bigg{[}\beta\cdot\frac{n}{3}\cdot 2n\cdot\mathbb{E}\big{[}I_{t-1}+N_{t-1}\big{]}+(1-\beta)n^{2}\mathbb{E}\left[I_{t}+N_{t}\right]\bigg{]}
L2ξt2[23β𝔼[It1+Nt1]+(1β)𝔼[It+Nt]].\displaystyle\leq L^{2}\xi_{t}^{2}\bigg{[}\frac{2}{3}\beta\cdot\mathbb{E}\big{[}I_{t-1}+N_{t-1}\big{]}+(1-\beta)\mathbb{E}\left[I_{t}+N_{t}\right]\bigg{]}.

For t=1t=1, by applying Lemma 1 and 7 consecutively to (44) we have

It\displaystyle I_{t} L2i=0n1wi(t)w0(t)2(21)L2ξt2n2i=0n1((1β)2Ai(t))L2ξt2(1β)2n2i=0n1Ai(t).\displaystyle\leq L^{2}\sum_{i=0}^{n-1}\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2}\overset{\eqref{update_distance_t}}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}\left((1-\beta)^{2}A_{i}^{(t)}\right)\leq\frac{L^{2}\xi_{t}^{2}(1-\beta)^{2}}{n^{2}}\sum_{i=0}^{n-1}A_{i}^{(t)}.

Similarly we get

𝔼[It]\displaystyle\mathbb{E}[I_{t}] (40)L2ξt2(1β)2n2n2𝔼[It+Nt]=L2ξt2(1β)2𝔼[It+Nt].\displaystyle\overset{\eqref{bound_A^{T}_RR}}{\leq}\frac{L^{2}\xi_{t}^{2}(1-\beta)^{2}}{n^{2}}n^{2}\cdot\mathbb{E}\left[I_{t}+N_{t}\right]=L^{2}\xi_{t}^{2}(1-\beta)^{2}\mathbb{E}\left[I_{t}+N_{t}\right]. (45)

(b) Using the similar argument as above we can derive that

Jt\displaystyle J_{t} =i=0n1f(w0(t);π(t1)(i+1))gi(t1)2\displaystyle=\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(i+1))-g_{i}^{(t-1)}\Big{\|}^{2}
=Step 6i=0n1f(w0(t);π(t1)(i+1))f(wi(t1);π(t1)(i+1))2\displaystyle\overset{\tiny\textrm{Step~{}\ref{alg:A1_step3}}}{=}\sum_{i=0}^{n-1}\Big{\|}\nabla f(w_{0}^{(t)};\pi^{(t-1)}(i+1))-\nabla f(w_{i}^{(t-1)};\pi^{(t-1)}(i+1))\Big{\|}^{2}
(3)L2i=0n1wi(t1)w0(t)2.\displaystyle\overset{\eqref{eq:Lsmooth_basic}}{\leq}L^{2}\sum_{i=0}^{n-1}\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2}.

Applying Lemma 1 for t>2t>2 to the above estimate, we get

Jt\displaystyle J_{t} L2ξt2n2i=0n1[β(ni)2n2An(t2)+(1β)Bi(t1)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}\left[\beta\frac{(n-i)^{2}}{n^{2}}A_{n}^{(t-2)}+(1-\beta)B_{i}^{(t-1)}\right]
L2ξt2n2[βAn(t2)n2i=0n1(ni)2+(1β)i=0n1Bi(t1)]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta\frac{A_{n}^{(t-2)}}{n^{2}}\sum_{i=0}^{n-1}(n-i)^{2}+(1-\beta)\sum_{i=0}^{n-1}B_{i}^{(t-1)}\right]
L2ξt2n2[βnAn(t2)+(1β)i=0n1Bi(t1)],\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta n\cdot A_{n}^{(t-2)}+(1-\beta)\sum_{i=0}^{n-1}B_{i}^{(t-1)}\right],

where the last line follows from the fact that i=0n1(ni)2n3\sum_{i=0}^{n-1}(n-i)^{2}\leq n^{3}. Similar to the previous part, taking total expectation we get

𝔼[Jt]\displaystyle\mathbb{E}\left[J_{t}\right] L2ξt2n2[βn𝔼[An(t2)]+(1β)i=0n1𝔼[Bi(t1)]]\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta n\cdot\mathbb{E}\Big{[}A_{n}^{(t-2)}\Big{]}+(1-\beta)\sum_{i=0}^{n-1}\mathbb{E}\Big{[}B_{i}^{(t-1)}\Big{]}\right]
(41)L2ξt2n2[βn𝔼[An(t2)]+(1β)2n2𝔼[It1+Nt1]]\displaystyle\overset{\eqref{bound_B^{T}_RR}}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\left[\beta n\cdot\mathbb{E}\Big{[}A_{n}^{(t-2)}\Big{]}+(1-\beta)2n^{2}\cdot\mathbb{E}\left[I_{t-1}+N_{t-1}\right]\right]
(40)L2ξt2n2[2βn2𝔼[It2+Nt2]+2(1β)n2𝔼[It1+Nt1]]\displaystyle\overset{\eqref{bound_A^{T}_RR}}{\leq}\frac{L^{2}\xi_{t}^{2}}{n^{2}}\Big{[}2\beta n^{2}\cdot\mathbb{E}\left[I_{t-2}+N_{t-2}\right]+2(1-\beta)n^{2}\cdot\mathbb{E}\left[I_{t-1}+N_{t-1}\right]\Big{]}
2L2ξt2[β𝔼[It2+Nt2]+(1β)𝔼[It1+Nt1]].\displaystyle\leq 2L^{2}\xi_{t}^{2}\Big{[}\beta\cdot\mathbb{E}\left[I_{t-2}+N_{t-2}\right]+(1-\beta)\cdot\mathbb{E}\left[I_{t-1}+N_{t-1}\right]\Big{]}.

Now applying Lemma 1 for the special case t=2t=2, we obtain

Jt\displaystyle J_{t} L2ξt2n2i=0n1(1β)2Bi(t1).\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}\sum_{i=0}^{n-1}(1-\beta)^{2}B_{i}^{(t-1)}.

Therefore, we have

𝔼[Jt]\displaystyle\mathbb{E}[J_{t}] L2ξt2n2(1β)2i=0n1𝔼[Bi(t1)]L2ξt2n2(1β)22n2𝔼[It1+Nt1]2L2ξt2𝔼[It1+Nt1],\displaystyle\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}\sum_{i=0}^{n-1}\mathbb{E}\left[B_{i}^{(t-1)}\right]\leq\frac{L^{2}\xi_{t}^{2}}{n^{2}}(1-\beta)^{2}\cdot 2n^{2}\mathbb{E}\left[I_{t-1}+N_{t-1}\right]\leq 2L^{2}\xi_{t}^{2}\cdot\mathbb{E}\left[I_{t-1}+N_{t-1}\right],

where we have used the result (41) from Lemma 7 and the fact that 1β11-\beta\leq 1.

(c) First, from the result of Part (a), for t>1t>1, we have

𝔼[It]L2ξt2[23β𝔼[It1+Nt1]+(1β)𝔼[Nt]]+(1β)L2ξt2𝔼[It].\displaystyle\mathbb{E}[I_{t}]\leq L^{2}\xi_{t}^{2}\left[\frac{2}{3}\beta\mathbb{E}[I_{t-1}+N_{t-1}]+(1-\beta)\mathbb{E}[N_{t}]\right]+(1-\beta)L^{2}\xi_{t}^{2}\mathbb{E}[I_{t}].

Since L2ξt235L^{2}\xi_{t}^{2}\leq\frac{3}{5} (due to the choice of our learning rate) and 1β11-\beta\leq 1, we further get

𝔼[It]\displaystyle\mathbb{E}[I_{t}] 25[β𝔼[It1+Nt1]+32(1β)𝔼[Nt]]+35𝔼[It],\displaystyle\leq\frac{2}{5}\left[\beta\mathbb{E}[I_{t-1}+N_{t-1}]+\frac{3}{2}(1-\beta)\mathbb{E}[N_{t}]\right]+\frac{3}{5}\mathbb{E}[I_{t}],

which is equivalent to

𝔼[It]β𝔼[It1+Nt1]+32(1β)𝔼[Nt].\displaystyle\mathbb{E}[I_{t}]\leq\beta\mathbb{E}[I_{t-1}+N_{t-1}]+\frac{3}{2}(1-\beta)\mathbb{E}[N_{t}].

Adding 𝔼[Nt]\mathbb{E}[N_{t}] to both sides of this inequality, for t>1t>1, we get

𝔼[It+Nt]β𝔼[It1+Nt1]+53β2𝔼[Nt].\displaystyle\mathbb{E}[I_{t}+N_{t}]\leq\beta\mathbb{E}[I_{t-1}+N_{t-1}]+\frac{5-3\beta}{2}\mathbb{E}[N_{t}]. (46)

Finally, for t=1t=1, since L2ξt235L^{2}\xi_{t}^{2}\leq\frac{3}{5} and 1β11-\beta\leq 1, we have

𝔼[It]\displaystyle\mathbb{E}[I_{t}] L2ξt2(1β)2(𝔼[It]+𝔼[Nt])35(1β)2𝔼[It]+35(1β)2𝔼[Nt]35𝔼[It]+35(1β)𝔼[Nt],\displaystyle\leq L^{2}\xi_{t}^{2}(1-\beta)^{2}\Big{(}\mathbb{E}[I_{t}]+\mathbb{E}[N_{t}]\Big{)}\leq\frac{3}{5}(1-\beta)^{2}\mathbb{E}[I_{t}]+\frac{3}{5}(1-\beta)^{2}\mathbb{E}[N_{t}]\leq\frac{3}{5}\mathbb{E}[I_{t}]+\frac{3}{5}(1-\beta)\mathbb{E}[N_{t}],

which is equivalent to

25𝔼[It]35(1β)𝔼[Nt].\displaystyle\frac{2}{5}\mathbb{E}[I_{t}]\leq\frac{3}{5}(1-\beta)\mathbb{E}[N_{t}]. (47)

This leads to our desired result for t=1t=1 as

𝔼[I1+N1]32(1β)𝔼[N1]+𝔼[N1]=53β2𝔼[N1].\displaystyle\mathbb{E}[I_{1}+N_{1}]\leq\frac{3}{2}(1-\beta)\mathbb{E}[N_{1}]+\mathbb{E}[N_{1}]=\frac{5-3\beta}{2}\mathbb{E}[N_{1}]. (48)

Combining two cases, we obtain the desired result in Part (c). \square

8.5.3 Proof of Lemma 9: Upper Bounding The Key Quantity 𝔼[βJt+(1β)It]\mathbb{E}[\beta J_{t}+(1-\beta)I_{t}]

First, we analyze the case t>2t>2 using the results of Lemma 8 as follows:

β𝔼[Jt]+(1β)𝔼[It]\displaystyle\beta\mathbb{E}[J_{t}]+(1-\beta)\mathbb{E}[I_{t}] 2βL2ξt2[β𝔼[It2+Nt2]+(1β)𝔼[It1+Nt1]]\displaystyle\leq 2\beta L^{2}\xi_{t}^{2}\left[\beta\mathbb{E}[I_{t-2}+N_{t-2}]+(1-\beta)\mathbb{E}[I_{t-1}+N_{t-1}]\right]
+(1β)L2ξt2[23β𝔼[It1+Nt1]+(1β)𝔼[It+Nt]]\displaystyle\quad+(1-\beta)L^{2}\xi_{t}^{2}\left[\frac{2}{3}\beta\mathbb{E}[I_{t-1}+N_{t-1}]+(1-\beta)\mathbb{E}[I_{t}+N_{t}]\right]
23L2ξt2[2𝔼[It+Nt]+4β𝔼[It1+Nt1]+3β2𝔼[It2+Nt2]]\displaystyle\leq\frac{2}{3}L^{2}\xi_{t}^{2}\left[2\mathbb{E}[I_{t}+N_{t}]+4\beta\mathbb{E}[I_{t-1}+N_{t-1}]+3\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}]\right]
=:23L2ξt2St,\displaystyle=:\frac{2}{3}L^{2}\xi_{t}^{2}S_{t}, (49)

where the last line follows since 1β11-\beta\leq 1 and St:=2𝔼[It+Nt]+4β𝔼[It1+Nt1]+3β2𝔼[It2+Nt2]S_{t}:=2\mathbb{E}[I_{t}+N_{t}]+4\beta\mathbb{E}[I_{t-1}+N_{t-1}]+3\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}] for t>2t>2.

Next, we bound the term StS_{t} in (49) as

St\displaystyle S_{t} =2𝔼[It+Nt]+4β𝔼[It1+Nt1]+3β2𝔼[It2+Nt2]\displaystyle=2\mathbb{E}[I_{t}+N_{t}]+4\beta\mathbb{E}[I_{t-1}+N_{t-1}]+3\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}]
2[β𝔼[It1+Nt1]+53β2𝔼[Nt]]+4β𝔼[It1+Nt1]+3β2𝔼[It2+Nt2]\displaystyle\leq 2\left[\beta\mathbb{E}[I_{t-1}+N_{t-1}]+\frac{5-3\beta}{2}\mathbb{E}[N_{t}]\right]+4\beta\mathbb{E}[I_{t-1}+N_{t-1}]+3\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}] apply (46) fort\displaystyle\text{apply \eqref{bound_sum_I_N>1_RR} for}\ t
=(53β)𝔼[Nt]+6β𝔼[It1+Nt1]+3β2𝔼[It2+Nt2]\displaystyle=(5-3\beta)\mathbb{E}[N_{t}]+6\beta\mathbb{E}[I_{t-1}+N_{t-1}]+3\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}]
(53β)𝔼[Nt]+6β[β𝔼[It2+Nt2]+53β2𝔼[Nt1]]+3β2𝔼[It2+Nt2]\displaystyle\leq(5-3\beta)\mathbb{E}[N_{t}]+6\beta\Big{[}\beta\mathbb{E}[I_{t-2}+N_{t-2}]+\frac{5-3\beta}{2}\mathbb{E}[N_{t-1}]\Big{]}+3\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}] apply (46) fort1\displaystyle\text{apply \eqref{bound_sum_I_N>1_RR} for}\ t-1
=(53β)𝔼[Nt]+3β(53β)𝔼[Nt1]+9β2𝔼[It2+Nt2].\displaystyle=(5-3\beta)\mathbb{E}[N_{t}]+3\beta(5-3\beta)\mathbb{E}[N_{t-1}]+9\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}].

We consider the last term, which can be bounded as

β2𝔼[It2+Nt2]\displaystyle\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}] =β3𝔼[It3+Nt3]+53β2β2𝔼[Nt2]\displaystyle=\beta^{3}\mathbb{E}[I_{t-3}+N_{t-3}]+\frac{5-3\beta}{2}\beta^{2}\mathbb{E}[N_{t-2}] apply (46) recursively fort2>1\displaystyle\text{apply \eqref{bound_sum_I_N>1_RR} recursively for}\ t-2>1
=βt1𝔼[I1+N1]+53β2j=2t2βtj𝔼[Nj]\displaystyle=\beta^{t-1}\mathbb{E}[I_{1}+N_{1}]+\frac{5-3\beta}{2}\sum_{j=2}^{t-2}\beta^{t-j}\mathbb{E}[N_{j}]
βt153β2𝔼[N1]+53β2j=2t2βtj𝔼[Nj]\displaystyle\leq\beta^{t-1}\frac{5-3\beta}{2}\mathbb{E}[N_{1}]+\frac{5-3\beta}{2}\sum_{j=2}^{t-2}\beta^{t-j}\mathbb{E}[N_{j}] apply (48)
53β2j=1t2βtj𝔼[Nj].\displaystyle\leq\frac{5-3\beta}{2}\sum_{j=1}^{t-2}\beta^{t-j}\mathbb{E}[N_{j}].

Note that this bound is also true for the case t2=1t-2=1:

β2𝔼[It2+Nt2]\displaystyle\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}] =β2𝔼[I1+N1]β253β2𝔼[Nj]\displaystyle=\beta^{2}\mathbb{E}[I_{1}+N_{1}]\leq\beta^{2}\frac{5-3\beta}{2}\mathbb{E}[N_{j}] apply (48)

Substituting this inequality into StS_{t}, for t>2t>2, we get

St\displaystyle S_{t} (53β)𝔼[Nt]+3β(53β)𝔼[Nt1]+9β2𝔼[It2+Nt2]\displaystyle\leq(5-3\beta)\mathbb{E}[N_{t}]+3\beta(5-3\beta)\mathbb{E}[N_{t-1}]+9\beta^{2}\mathbb{E}[I_{t-2}+N_{t-2}]
92(53β)𝔼[Nt]+92(53β)β𝔼[Nt1]+92(53β)j=1t2βtj𝔼[Nj]\displaystyle\leq\frac{9}{2}(5-3\beta)\mathbb{E}[N_{t}]+\frac{9}{2}(5-3\beta)\beta\mathbb{E}[N_{t-1}]+\frac{9}{2}(5-3\beta)\sum_{j=1}^{t-2}\beta^{t-j}\mathbb{E}[N_{j}]
92(53β)j=1tβtj𝔼[Nj].\displaystyle\leq\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}[N_{j}].

Now we analyze similarly for the case t=2t=2 as follows:

β𝔼[J2]+(1β)𝔼[I2]\displaystyle\beta\mathbb{E}[J_{2}]+(1-\beta)\mathbb{E}[I_{2}] 2βL2ξ22𝔼[I1+N1]+(1β)L2ξ22[23β𝔼[I1+N1]+(1β)𝔼[I2+N2]]\displaystyle\leq 2\beta L^{2}\xi_{2}^{2}\mathbb{E}[I_{1}+N_{1}]+(1-\beta)L^{2}\xi_{2}^{2}\left[\frac{2}{3}\beta\mathbb{E}[I_{1}+N_{1}]+(1-\beta)\mathbb{E}[I_{2}+N_{2}]\right]
23L2ξ22[2𝔼[I2+N2]+4β𝔼[I1+N1]]\displaystyle\leq\frac{2}{3}L^{2}\xi_{2}^{2}\big{[}2\mathbb{E}[I_{2}+N_{2}]+4\beta\mathbb{E}[I_{1}+N_{1}]\big{]}
=:23L2ξ22S2,\displaystyle=:\frac{2}{3}L^{2}\xi_{2}^{2}S_{2},

where the last line follows since 1β11-\beta\leq 1 and S2:=2𝔼[I2+N2]+4β𝔼[I1+N1]S_{2}:=2\mathbb{E}[I_{2}+N_{2}]+4\beta\mathbb{E}[I_{1}+N_{1}].

Next, we bound the term S2S_{2} as follows:

S2\displaystyle S_{2} =2𝔼[I2+N2]+4β𝔼[I1+N1]\displaystyle=2\mathbb{E}[I_{2}+N_{2}]+4\beta\mathbb{E}[I_{1}+N_{1}]
2(β𝔼[I1+N1]+53β2𝔼[N2])+4β𝔼[I1+N1]\displaystyle\leq 2\left(\beta\mathbb{E}[I_{1}+N_{1}]+\frac{5-3\beta}{2}\mathbb{E}[N_{2}]\right)+4\beta\mathbb{E}[I_{1}+N_{1}] apply (46) fort=2\displaystyle\text{apply \eqref{bound_sum_I_N>1_RR} for}\ t=2
=(53β)𝔼[N2]+6β𝔼[I1+N1]\displaystyle=(5-3\beta)\mathbb{E}[N_{2}]+6\beta\mathbb{E}[I_{1}+N_{1}]
(53β)𝔼[N2]+6β53β2𝔼[N1]\displaystyle\leq(5-3\beta)\mathbb{E}[N_{2}]+6\beta\frac{5-3\beta}{2}\mathbb{E}[N_{1}] apply (48)
=(53β)𝔼[N2]+3β(53β)𝔼[N1]\displaystyle=(5-3\beta)\mathbb{E}[N_{2}]+3\beta(5-3\beta)\mathbb{E}[N_{1}]
92(53β)j=12β2j𝔼[Nj].\displaystyle\leq\frac{9}{2}(5-3\beta)\sum_{j=1}^{2}\beta^{2-j}\mathbb{E}[N_{j}].

Hence, the statements 𝔼[βJt+(1β)It]23L2ξt2St\mathbb{E}[\beta J_{t}+(1-\beta)I_{t}]\leq\frac{2}{3}L^{2}\xi_{t}^{2}S_{t} and St92(53β)j=1tβtj𝔼[Nj]S_{t}\leq\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}[N_{j}] are true for every t2t\geq 2.

Combining these two cases, we have the following estimate

𝔼[βJt+(1β)It]23L2ξt2St\displaystyle\mathbb{E}[\beta J_{t}+(1-\beta)I_{t}]\leq\frac{2}{3}L^{2}\xi_{t}^{2}S_{t} L2ξt22392(53β)j=1tβtj𝔼[Nj]\displaystyle\leq L^{2}\xi_{t}^{2}\cdot\frac{2}{3}\cdot\frac{9}{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}[N_{j}]
(16)3L2ξt2(53β)j=1tβtj𝔼[(Θ+n)F(w0(j))2+σ2]\displaystyle\overset{\eqref{define_N}}{\leq}3L^{2}\xi_{t}^{2}(5-3\beta)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\Big{[}(\Theta+n)\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}+\sigma^{2}\Big{]}
3L2ξt2(53β)[(Θ+n)j=1tβtj𝔼[F(w0(j))2]+j=1tβtjσ2]\displaystyle\leq 3L^{2}\xi_{t}^{2}(5-3\beta)\bigg{[}(\Theta+n)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}\right]+\sum_{j=1}^{t}\beta^{t-j}\sigma^{2}\bigg{]}
3L2ξt2(53β)[(Θ+n)j=1tβtj𝔼[F(w0(j))2]+σ21β],\displaystyle\leq 3L^{2}\xi_{t}^{2}(5-3\beta)\bigg{[}(\Theta+n)\sum_{j=1}^{t}\beta^{t-j}\mathbb{E}\left[\big{\|}\nabla F(w_{0}^{(j)})\big{\|}^{2}\right]+\frac{\sigma^{2}}{1-\beta}\bigg{]},

where the last line follows since j=1tβtj11β\sum_{j=1}^{t}\beta^{t-j}\leq\frac{1}{1-\beta} for every t2t\geq 2. Hence, we have proved (42). \square

8.5.4 The Proof of Lemma 10

First, from the assumption 0<ηt1LD0<\eta_{t}\leq\frac{1}{L\sqrt{D}}, t1t\geq 1, we have 0<ηt21DL20<\eta_{t}^{2}\leq\frac{1}{DL^{2}}. Next, from (17), we have ξt=max(ηt;ηt1)\xi_{t}=\max(\eta_{t};\eta_{t-1}) for t>1t>1 and ξ1=η1\xi_{1}=\eta_{1}, which lead to 0<ξt21DL20<\xi_{t}^{2}\leq\frac{1}{DL^{2}} for t1t\geq 1. Moreover, from the definition of D=max(53,6(53β)(Θ+n)n(1β))D=\max\left(\frac{5}{3},\frac{6(5-3\beta)(\Theta+n)}{n(1-\beta)}\right) in Theorem 2, we have L2ξt235L^{2}\xi_{t}^{2}\leq\frac{3}{5} and 6L2ξt2(53β)(Θ+n)n(1β)6L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+n)\leq n(1-\beta) for t1.t\geq 1.

Similar to the estimate (39) in Theorem 2, we get the following result:

F(w0(2))\displaystyle F(w_{0}^{(2)}) F(w0(1))η12(1β)F(w0(1))2+η12(1β)1nI1.\displaystyle\leq F(w_{0}^{(1)})-\frac{\eta_{1}}{2}(1-\beta)\|\nabla F(w_{0}^{(1)})\|^{2}+\frac{\eta_{1}}{2}(1-\beta)\frac{1}{n}I_{1}.

Taking total expectation and applying Lemma 8 we further have:

𝔼[F(w0(2))]\displaystyle\mathbb{E}\left[F(w_{0}^{(2)})\right] 𝔼[F(w0(1))]η12(1β)𝔼[F(w0(1))2]+η12n(1β)𝔼[I1]\displaystyle\leq\mathbb{E}\left[F(w_{0}^{(1)})\right]-\frac{\eta_{1}}{2}(1-\beta)\mathbb{E}\left[\|\nabla F(w_{0}^{(1)})\|^{2}\right]+\frac{\eta_{1}}{2n}(1-\beta)\mathbb{E}[I_{1}]
(45)𝔼[F(w0(1))]η12(1β)𝔼[F(w0(1))2]+η12n(1β)L2ξ12(1β)2𝔼[I1+N1]\displaystyle\overset{\eqref{bound_I=1_RR}}{\leq}\mathbb{E}\left[F(w_{0}^{(1)})\right]-\frac{\eta_{1}}{2}(1-\beta)\mathbb{E}\left[\|\nabla F(w_{0}^{(1)})\|^{2}\right]+\frac{\eta_{1}}{2n}(1-\beta)L^{2}\xi_{1}^{2}(1-\beta)^{2}\cdot\mathbb{E}\left[I_{1}+N_{1}\right]
(48)𝔼[F(w0(1))]η12(1β)𝔼[F(w0(1))2]+η12n(1β)3L2ξ1253β2𝔼[N1].\displaystyle\overset{\eqref{bound_sum_I_N=1_RR}}{\leq}\mathbb{E}\left[F(w_{0}^{(1)})\right]-\frac{\eta_{1}}{2}(1-\beta)\mathbb{E}\left[\|\nabla F(w_{0}^{(1)})\|^{2}\right]+\frac{\eta_{1}}{2n}(1-\beta)^{3}L^{2}\xi_{1}^{2}\cdot\frac{5-3\beta}{2}\mathbb{E}[N_{1}].

Noting that w~t1=w0(t),1β<1\tilde{w}_{t-1}=w_{0}^{(t)},1-\beta<1 and Nt=(Θ+n)F(w0(t))2+σ2N_{t}=(\Theta+n)\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}+\sigma^{2}, we get

𝔼[F(w~1)]\displaystyle\mathbb{E}\left[F(\tilde{w}_{1})\right] 𝔼[F(w~0)]η12(1β)𝔼[F(w~0)2]+η1L2ξ12(1β)(53β)4n[(Θ+n)𝔼[F(w~0)2]+σ2].\displaystyle\leq\mathbb{E}\left[F(\tilde{w}_{0})\right]-\frac{\eta_{1}}{2}(1-\beta)\mathbb{E}\left[\|\nabla F(\tilde{w}_{0})\|^{2}\right]+\frac{\eta_{1}L^{2}\xi_{1}^{2}(1-\beta)(5-3\beta)}{4n}\Big{[}(\Theta+n)\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\right]+\sigma^{2}\Big{]}.

Rearranging the terms and dividing both sides by (1β)(1-\beta), we have

η12𝔼[F(w~0)2]\displaystyle\frac{\eta_{1}}{2}\mathbb{E}\left[\|\nabla F(\tilde{w}_{0})\|^{2}\right] 𝔼[F(w~0)F(w~1)](1β)+η1L2ξ12(53β)4n(Θ+n)𝔼[F(w~0)2]+η1σ2L2ξ12(53β)4n.\displaystyle\leq\frac{\mathbb{E}\left[F(\tilde{w}_{0})-F(\tilde{w}_{1})\right]}{(1-\beta)}+\frac{\eta_{1}L^{2}\xi_{1}^{2}\cdot(5-3\beta)}{4n}(\Theta+n)\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\right]+\frac{\eta_{1}\sigma^{2}L^{2}\xi_{1}^{2}(5-3\beta)}{4n}.

Since ξt\xi_{t} satisfies 6L2ξt2(53β)(Θ+n)n(1β)6L^{2}\xi_{t}^{2}(5-3\beta)(\Theta+n)\leq n(1-\beta) as above, we can deduce from the last estimate that

η12𝔼[F(w~0)2]\displaystyle\frac{\eta_{1}}{2}\mathbb{E}\left[\|\nabla F(\tilde{w}_{0})\|^{2}\right] 𝔼[F(w~0)F(w~1)](1β)+(1β)η14𝔼[F(w~0)2]+3σ2(53β)L22n(1β)ξ13,\displaystyle\leq\frac{\mathbb{E}\left[F(\tilde{w}_{0})-F(\tilde{w}_{1})\right]}{(1-\beta)}+\frac{(1-\beta)\eta_{1}}{4}\mathbb{E}\left[\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\right]+\frac{3\sigma^{2}(5-3\beta)L^{2}}{2n(1-\beta)}\cdot\xi_{1}^{3},

which proves (43). \square

9 The Proof of Technical Results in Section 4

This section provides the full proofs of the results in Section 4. Before proving Theorem 3, we introduce some common quantities and provide four technical lemmas.

9.1 Technical Lemmas

For the sake of our analysis, we introduce the following quantities:

hi(t):=gi(t)f(w0(t);π(i+1)),t1,i=0,,n1;\displaystyle h_{i}^{(t)}:=g_{i}^{(t)}-\nabla f(w_{0}^{(t)};\pi(i+1)),\hskip 40.40285ptt\geq 1,\ i=0,\dots,n-1; (50)
ki(t1):=gi(t1)f(w0(t);π(i+1)),t2,i=0,,n1;\displaystyle k_{i}^{(t-1)}:=g_{i}^{(t-1)}-\nabla f(w_{0}^{(t)};\pi(i+1)),\qquad t\geq 2,\ i=0,\dots,n-1; (51)
pt:=1ni=0n1mi+1(t1),t2;\displaystyle p_{t}:=\frac{1}{n}\sum_{i=0}^{n-1}m_{i+1}^{(t-1)},\qquad t\geq 2; (52)
qt:=1βni=0n1[(j=nin1βj)gi(t1)+(j=0ni1βj)gi(t)],t2.\displaystyle q_{t}:=\frac{1-\beta}{n}\sum_{i=0}^{n-1}\left[\left(\sum_{j=n-i}^{n-1}\beta^{j}\right)g_{i}^{(t-1)}+\left(\sum_{j=0}^{n-i-1}\beta^{j}\right)g_{i}^{(t)}\right],\qquad t\geq 2. (53)

The proof of Theorem 3 requires the following lemmas as intermediate steps of its proof.

Lemma 12.

Let {wi(t)}\{w_{i}^{(t)}\} be generated by Algorithm 2 with 0β<10\leq\beta<1 and ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} for every t1t\geq 1. Suppose that Assumption 2 holds. Then we have

(a)\displaystyle(\textbf{a})\quad mi+1(t)2G2, for t1 and i=0,,n1;\displaystyle\|m_{i+1}^{(t)}\|^{2}\leq G^{2},\quad\text{ for }t\geq 1\text{ and }i=0,\dots,n-1; (54)
(b)\displaystyle(\textbf{b})\quad pt2G2, for t2andF(w0(t))2G2, for t1.\displaystyle\big{\|}p_{t}\big{\|}^{2}\leq G^{2},\text{ for }t\geq 2\quad\text{and}\quad\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}\leq G^{2},\text{ for }t\geq 1. (55)
Lemma 13.

Under the same setting as of Lemma 12 and Assumption 1(b), if ξt\xi_{t} is defined by (17), then for i=0,1,,n1i=0,1,\dots,n-1 and t2t\geq 2, it holds that

max(hi(t)2,ki(t1)2)L2G2ξt2.\displaystyle\max\Big{(}\big{\|}h_{i}^{(t)}\big{\|}^{2},\big{\|}k_{i}^{(t-1)}\big{\|}^{2}\Big{)}\leq L^{2}G^{2}\xi_{t}^{2}. (56)
Lemma 14.

Under the same setting as of Lemma 12, for i=0,1,,n1i=0,1,\dots,n-1 and t2t\geq 2, we have

mi+1(t)\displaystyle m_{i+1}^{(t)} =βnmi+1(t1)+(1β)(βn1gi+1(t1)++βi+1gn1(t1)+βig0(t)++βgi1(t)+gi(t)).\displaystyle=\beta^{n}m_{i+1}^{(t-1)}+(1-\beta)\Big{(}\beta^{n-1}g_{i+1}^{(t-1)}+\dots+\beta^{i+1}g_{n-1}^{(t-1)}+\beta^{i}g_{0}^{(t)}+\dots+\beta g_{i-1}^{(t)}+g_{i}^{(t)}\Big{)}.

From this expression we obtain the following sum:

1ni=0n1mi+1(t)=βnpt+qt, for t2;\displaystyle\frac{1}{n}\sum_{i=0}^{n-1}m_{i+1}^{(t)}=\beta^{n}p_{t}+q_{t},\text{ for }t\geq 2; (57)

where ptp_{t} and qtq_{t} are defined in (52) and (53), respectively.

Lemma 15.

Under the same setting as in Theorem 3, the initial objective value F(w~1)F(\tilde{w}_{1}) is upper bounded by

F(w~1)\displaystyle F(\tilde{w}_{1}) F(w~0)+12LF(w~0)2+Lη12G2.\displaystyle\leq F(\tilde{w}_{0})+\frac{1}{2L}\|\nabla F(\tilde{w}_{0})\|^{2}+L\eta_{1}^{2}G^{2}. (58)

9.2 The Proof of Theorem 3: Key Bound for Algorithm 2

From the update wi+1(t):=wi(t)ηi(t)mi+1(t)w_{i+1}^{(t)}:=w_{i}^{(t)}-\eta_{i}^{(t)}m_{i+1}^{(t)} at Step 7 and Step 9 of Algorithm 2 with ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}, for i=1,,n1i=1,...,n-1, we have

wi(t)\displaystyle w_{i}^{(t)} =wi1(t)ηtnmi(t)=w0(t)ηtnj=0i1mj+1(t).\displaystyle=w_{i-1}^{(t)}-\frac{\eta_{t}}{n}m_{i}^{(t)}=w_{0}^{(t)}-\frac{\eta_{t}}{n}\sum_{j=0}^{i-1}m_{j+1}^{(t)}. (59)

Now, letting i=ni=n in the estimate and noting that w0(t+1)=wn(t)w_{0}^{(t+1)}=w_{n}^{(t)} for all t1t\geq 1, we obtain

w0(t+1)w0(t)=wn(t)w0(t)=ηtnj=0n1mj+1(t)\displaystyle w_{0}^{(t+1)}-w_{0}^{(t)}=w_{n}^{(t)}-w_{0}^{(t)}=-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}m_{j+1}^{(t)} (60)

From this update and the LL-smoothness of FF from Assumption 1(c), for t2t\geq 2, we can derive

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) (5)F(w0(t))+F(w0(t))(w0(t+1)w0(t))+L2w0(t+1)w0(t)2\displaystyle\overset{\eqref{eq:Lsmooth}}{\leq}F(w_{0}^{(t)})+\nabla F(w_{0}^{(t)})^{\top}(w_{0}^{(t+1)}-w_{0}^{(t)})+\frac{L}{2}\|w_{0}^{(t+1)}-w_{0}^{(t)}\|^{2}
=(60)F(w0(t))ηtF(w0(t))(1nj=0n1mj+1(t))+Lηt221nj=0n1mj+1(t)2\displaystyle\overset{\tiny\eqref{update_momentum_epoch_alg2}}{=}F(w_{0}^{(t)})-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}\bigg{(}\frac{1}{n}\sum_{j=0}^{n-1}m_{j+1}^{(t)}\bigg{)}+\frac{L\eta_{t}^{2}}{2}\Big{\|}\frac{1}{n}\sum_{j=0}^{n-1}m_{j+1}^{(t)}\Big{\|}^{2}
=(57)F(w0(t))ηtF(w0(t))(βnpt+qt)+Lηt22βnpt+qt2\displaystyle\overset{\tiny\eqref{eq_formula_sum}}{=}F(w_{0}^{(t)})-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}\left(\beta^{n}p_{t}+q_{t}\right)+\frac{L\eta_{t}^{2}}{2}\big{\|}\beta^{n}p_{t}+q_{t}\big{\|}^{2}
(a)F(w0(t))ηtβnF(w0(t))ptηtF(w0(t))qt+Lηt22βnpt2+Lηt22(1βn)qt2\displaystyle\overset{(a)}{\leq}F(w_{0}^{(t)})-\eta_{t}\beta^{n}\nabla F(w_{0}^{(t)})^{\top}p_{t}-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}q_{t}+\frac{L\eta_{t}^{2}}{2}\beta^{n}\|p_{t}\|^{2}+\frac{L\eta_{t}^{2}}{2(1-\beta^{n})}\|q_{t}\|^{2}
(b)F(w0(t))+ηtβnG2ηtF(w0(t))qt+Lηt22βnG2+Lηt22(1βn)qt2\displaystyle\overset{(b)}{\leq}F(w_{0}^{(t)})+\eta_{t}\beta^{n}G^{2}-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}q_{t}+\frac{L\eta_{t}^{2}}{2}\beta^{n}G^{2}+\frac{L\eta_{t}^{2}}{2(1-\beta^{n})}\|q_{t}\|^{2}
(c)F(w0(t))+2ηtβnG2ηtF(w0(t))qt+Lηt22(1βn)qt2\displaystyle\overset{(c)}{\leq}F(w_{0}^{(t)})+2\eta_{t}\beta^{n}G^{2}-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}q_{t}+\frac{L\eta_{t}^{2}}{2(1-\beta^{n})}\|q_{t}\|^{2}

where (a) comes from the convexity of squared norm: βnpt+qt2βnut2+(1βn)qt1βn2\|\beta^{n}p_{t}+q_{t}\|^{2}\leq\beta^{n}\|u_{t}\|^{2}+(1-\beta^{n})\big{\|}\frac{q_{t}}{1-\beta^{n}}\big{\|}^{2} for 0βn<10\leq\beta^{n}<1, (b) is from the inequalities F(w0(t))G\|F(w_{0}^{(t)})\|\leq G and ptG\|p_{t}\|\leq G noted in Lemma 12, and (c) follows from the fact that Lηt1L\eta_{t}\leq 1.

Next, we focus on processing the term F(w0(t))qt\nabla F(w_{0}^{(t)})^{\top}q_{t}. We can further upper bound the above expression as

F(w0(t+1))\displaystyle F(w_{0}^{(t+1)}) F(w0(t))+2ηtβnG2ηtF(w0(t))qt+Lηt22(1βn)qt2\displaystyle\leq F(w_{0}^{(t)})+2\eta_{t}\beta^{n}G^{2}-\eta_{t}\nabla F(w_{0}^{(t)})^{\top}q_{t}+\frac{L\eta_{t}^{2}}{2(1-\beta^{n})}\|q_{t}\|^{2}
=F(w0(t))+2ηtβnG2ηt1βn(1βn)F(w0(t))qt+Lηt22(1βn)qt2\displaystyle=F(w_{0}^{(t)})+2\eta_{t}\beta^{n}G^{2}-\frac{\eta_{t}}{1-\beta^{n}}(1-\beta^{n})\nabla F(w_{0}^{(t)})^{\top}q_{t}+\frac{L\eta_{t}^{2}}{2(1-\beta^{n})}\|q_{t}\|^{2}
=(d)F(w0(t))+2ηtβnG2+ηt2(1βn)(1βn)F(w0(t))qt2\displaystyle\overset{(d)}{=}F(w_{0}^{(t)})+2\eta_{t}\beta^{n}G^{2}+\frac{\eta_{t}}{2(1-\beta^{n})}\big{\|}(1-\beta^{n})\nabla F(w_{0}^{(t)})-q_{t}\big{\|}^{2}
ηt2(1βn)(1βn)F(w0(t))2ηt2(1βn)qt2+Lηt22(1βn)qt2\displaystyle\quad-\frac{\eta_{t}}{2(1-\beta^{n})}\big{\|}(1-\beta^{n})\nabla F(w_{0}^{(t)})\big{\|}^{2}-\frac{\eta_{t}}{2(1-\beta^{n})}\|q_{t}\|^{2}+\frac{L\eta_{t}^{2}}{2(1-\beta^{n})}\|q_{t}\|^{2}
(e)F(w0(t))+2ηtβnG2+ηt2(1βn)Mtηt(1βn)2F(w0(t))2,\displaystyle\overset{(e)}{\leq}F(w_{0}^{(t)})+2\eta_{t}\beta^{n}G^{2}+\frac{\eta_{t}}{2(1-\beta^{n})}M_{t}-\frac{\eta_{t}(1-\beta^{n})}{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}, (61)

where Mt:=(1βn)F(w0(t))qt2M_{t}:=\|(1-\beta^{n})\nabla F(w_{0}^{(t)})-q_{t}\|^{2}. Here, (d) follows from the equality uv=12(u2+v2uv2)u^{\top}v=\frac{1}{2}\left(\|u\|^{2}+\|v\|^{2}-\|u-v\|^{2}\right), and (e) comes from the fact that ηt1L\eta_{t}\leq\frac{1}{L}.

Note that 1βn=(1β)j=0n1βj1-\beta^{n}=(1-\beta)\sum_{j=0}^{n-1}\beta^{j} and F(w0(t))=1ni=0n1f(w0(t);π(i+1))\nabla F(w_{0}^{(t)})=\frac{1}{n}\sum_{i=0}^{n-1}\nabla f(w_{0}^{(t)};\pi(i+1)), we can rewrite

(1βn)F(w0(t))\displaystyle(1-\beta^{n})\nabla F(w_{0}^{(t)}) =1βni=0n1[(j=0n1βj)f(w0(t);π(i+1))].\displaystyle=\frac{1-\beta}{n}\sum_{i=0}^{n-1}\Bigg{[}\Big{(}\sum_{j=0}^{n-1}\beta^{j}\Big{)}\nabla f(w_{0}^{(t)};\pi(i+1))\Bigg{]}. (62)

Recall the definition of qtq_{t}, hi(t)h_{i}^{(t)}, and ki(t1)k_{i}^{(t-1)} from (53), (50), and (51), respectively, we can bound MtM_{t} as

Mt\displaystyle M_{t} :=qt(1βn)F(w0(t))2\displaystyle:=\big{\|}q_{t}-(1-\beta^{n})\nabla F(w_{0}^{(t)})\big{\|}^{2}
=(53),(62)1βni=0n1[(j=nin1βj)gi(t1)+(j=0ni1βj)gi(t)(j=0n1βj)f(w0(t);π(i+1))]2\displaystyle\overset{\eqref{define_q_t},\eqref{eq_rewrite_F}}{=}\left\|\frac{1-\beta}{n}\sum_{i=0}^{n-1}\left[\bigg{(}\sum_{j=n-i}^{n-1}\beta^{j}\bigg{)}g_{i}^{(t-1)}+\bigg{(}\sum_{j=0}^{n-i-1}\beta^{j}\bigg{)}g_{i}^{(t)}-\Big{(}\sum_{j=0}^{n-1}\beta^{j}\Big{)}\nabla f(w_{0}^{(t)};\pi(i+1))\right]\right\|^{2}
=(50),(50)1βni=0n1[(j=nin1βj)ki(t1)+(j=0ni1βj)hi(t)]2\displaystyle\overset{\eqref{define_h_t},\eqref{define_h_t}}{=}\left\|\frac{1-\beta}{n}\sum_{i=0}^{n-1}\left[\bigg{(}\sum_{j=n-i}^{n-1}\beta^{j}\bigg{)}k_{i}^{(t-1)}+\bigg{(}\sum_{j=0}^{n-i-1}\beta^{j}\bigg{)}h_{i}^{(t)}\right]\right\|^{2}
(1β)2ni=0n1(j=nin1βj)ki(t1)+(j=0ni1βj)hi(t)2,\displaystyle\leq\frac{(1-\beta)^{2}}{n}\sum_{i=0}^{n-1}\left\|\bigg{(}\sum_{j=n-i}^{n-1}\beta^{j}\bigg{)}k_{i}^{(t-1)}+\bigg{(}\sum_{j=0}^{n-i-1}\beta^{j}\bigg{)}h_{i}^{(t)}\right\|^{2},

where the last line comes from the Cauchy-Schwarz inequality. We normalize the last squared norms as follows:

Mt\displaystyle M_{t} (1β)2n(j=0n1βj)2i=0n1j=nin1βjj=0n1βjki(t1)+j=0ni1βjj=0n1βjhi(t)2\displaystyle\leq\frac{(1-\beta)^{2}}{n}\bigg{(}\sum_{j=0}^{n-1}\beta^{j}\bigg{)}^{2}\sum_{i=0}^{n-1}\left\|\frac{\sum_{j=n-i}^{n-1}\beta^{j}}{\sum_{j=0}^{n-1}\beta^{j}}k_{i}^{(t-1)}+\frac{\sum_{j=0}^{n-i-1}\beta^{j}}{\sum_{j=0}^{n-1}\beta^{j}}h_{i}^{(t)}\right\|^{2}
(1β)2n(j=0n1βj)2i=0n1[j=nin1βjj=0n1βjki(t1)2+j=0ni1βjj=0n1βjhi(t)2] by convexity of 2\displaystyle\leq\frac{(1-\beta)^{2}}{n}\bigg{(}\sum_{j=0}^{n-1}\beta^{j}\bigg{)}^{2}\sum_{i=0}^{n-1}\left[\frac{\sum_{j=n-i}^{n-1}\beta^{j}}{\sum_{j=0}^{n-1}\beta^{j}}\|k_{i}^{(t-1)}\|^{2}+\frac{\sum_{j=0}^{n-i-1}\beta^{j}}{\sum_{j=0}^{n-1}\beta^{j}}\|h_{i}^{(t)}\|^{2}\right]\qquad\text{ by convexity of }\|\cdot\|^{2}
(1β)2n(j=0n1βj)2i=0n1max(ki(t1)2;hi(t)2)\displaystyle\leq\frac{(1-\beta)^{2}}{n}\bigg{(}\sum_{j=0}^{n-1}\beta^{j}\bigg{)}^{2}\sum_{i=0}^{n-1}\max\Big{(}\|k_{i}^{(t-1)}\|^{2};\|h_{i}^{(t)}\|^{2}\Big{)}
=(1βn)2ni=0n1max(ki(t1)2;hi(t)2) since (1β)j=0n1βj=1βn\displaystyle=\frac{(1-\beta^{n})^{2}}{n}\sum_{i=0}^{n-1}\max\Big{(}\|k_{i}^{(t-1)}\|^{2};\|h_{i}^{(t)}\|^{2}\Big{)}\hskip 113.81102pt\text{ since }(1-\beta)\sum_{j=0}^{n-1}\beta^{j}=1-\beta^{n}
(1βn)2L2G2ξt2,\displaystyle\leq(1-\beta^{n})^{2}L^{2}G^{2}\xi_{t}^{2},

where the last estimate follows from Lemma 13 with ξt2=max(ηt2;ηt12)\xi_{t}^{2}=\max(\eta_{t}^{2};\eta_{t-1}^{2}) for t2t\geq 2. Substituting the last estimate into (61) for t2t\geq 2, we obtain

F(w0(t+1))F(w0(t))+2ηtβnG2+ηt(1βn)2L2G2ξt2ηt(1βn)2F(w0(t))2.\displaystyle F(w_{0}^{(t+1)})\leq F(w_{0}^{(t)})+2\eta_{t}\beta^{n}G^{2}+\frac{\eta_{t}(1-\beta^{n})}{2}L^{2}G^{2}\xi_{t}^{2}-\frac{\eta_{t}(1-\beta^{n})}{2}\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}.

Since w~t1=w0(t)\tilde{w}_{t-1}=w_{0}^{(t)} and w~t=w0(t+1)\tilde{w}_{t}=w_{0}^{(t+1)}, this inequality leads to

ηtF(w~t1)22[F(w~t1)F(w~t)](1βn)+4βnG21βnηt+L2G2ξt3,t2.\displaystyle\eta_{t}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2}\leq\frac{2[F(\tilde{w}_{t-1})-F(\tilde{w}_{t})]}{(1-\beta^{n})}+\frac{4\beta^{n}G^{2}}{1-\beta^{n}}\eta_{t}+L^{2}G^{2}\xi_{t}^{3},\ \ t\geq 2.

Now, using the fact that ξ1=η1\xi_{1}=\eta_{1}, we can derive the following statement for t=1t=1:

η1F(w~0)2η1F(w~0)21βn+4βnG21βnη1+L2G2ξ13.\displaystyle\eta_{1}\big{\|}\nabla F(\tilde{w}_{0})\big{\|}^{2}\leq\frac{\eta_{1}\|\nabla F(\tilde{w}_{0})\|^{2}}{1-\beta^{n}}+\frac{4\beta^{n}G^{2}}{1-\beta^{n}}\eta_{1}+L^{2}G^{2}\xi_{1}^{3}.

Summing the previous statement for t=2,3,,Tt=2,3,\dots,T, and using the last one, we can deduce that

t=1T(ηtF(w~t1)2)\displaystyle\sum_{t=1}^{T}\left(\eta_{t}\big{\|}\nabla F(\tilde{w}_{t-1})\big{\|}^{2}\right) 2[F(w~1)F]+η1F(w~0)2(1βn)+4βnG21βnt=1Tηt+L2G2t=1Tξt3\displaystyle\leq\frac{2[F(\tilde{w}_{1})-F_{*}]+\eta_{1}\|\nabla F(\tilde{w}_{0})\|^{2}}{(1-\beta^{n})}+\frac{4\beta^{n}G^{2}}{1-\beta^{n}}\sum_{t=1}^{T}\eta_{t}+L^{2}G^{2}\sum_{t=1}^{T}\xi_{t}^{3}
(58)2[F(w~0)F]+(1L+η1)F(w~0)2+2Lη12G2(1βn)+4βnG21βnt=1Tηt+L2G2t=1Tξt3.\displaystyle\overset{\eqref{bound_F(w)_1_tilde_2}}{\leq}\frac{2[F(\tilde{w}_{0})-F_{*}]+\left(\frac{1}{L}+\eta_{1}\right)\|\nabla F(\tilde{w}_{0})\|^{2}+2L\eta_{1}^{2}G^{2}}{(1-\beta^{n})}+\frac{4\beta^{n}G^{2}}{1-\beta^{n}}\sum_{t=1}^{T}\eta_{t}+L^{2}G^{2}\sum_{t=1}^{T}\xi_{t}^{3}.

Finally, dividing both sides of this estimate by t=1Tηt\sum_{t=1}^{T}\eta_{t}, we obtain (10). \square

9.3 The Proof of Corollaries 7 and 8: Constant and Diminishing Learning Rates

The proof of Corollary 7.

First, from β=(νT2/3)1/n\beta=\left(\frac{\nu}{T^{2/3}}\right)^{1/n} we have βn=νT2/3\beta^{n}=\frac{\nu}{T^{2/3}}. Since β(R1R)1/n\beta\leq\left(\frac{R-1}{R}\right)^{1/n} for some R1R\geq 1, we have 11βnR\frac{1}{1-\beta^{n}}\leq R. We also have ηt:=γT1/3\eta_{t}:=\frac{\gamma}{T^{1/3}} and ηt1L\eta_{t}\leq\frac{1}{L} for all t1t\geq 1. Moreover, t=1Tηt=γT2/3\sum_{t=1}^{T}\eta_{t}=\gamma T^{2/3} and t=1Tξt3=γ3\sum_{t=1}^{T}\xi_{t}^{3}=\gamma^{3}. Substituting these expressions into (10), we obtain

𝔼[F(w^T)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} Δ1γ(1βn)T2/3+L2G2γ2T2/3+4βnG21βn\displaystyle\leq\displaystyle\frac{\Delta_{1}}{\gamma(1-\beta^{n})T^{2/3}}+\frac{L^{2}G^{2}\gamma^{2}}{T^{2/3}}+\frac{4\beta^{n}G^{2}}{1-\beta^{n}}
Δ1RγT2/3+L2G2γ2T2/3+4βnG2R\displaystyle\leq\frac{\Delta_{1}R}{\gamma T^{2/3}}+\frac{L^{2}G^{2}\gamma^{2}}{T^{2/3}}+4\beta^{n}G^{2}R
=Δ1RγT2/3+L2G2γ2T2/3+4νG2RT2/3\displaystyle=\displaystyle\frac{\Delta_{1}R}{\gamma T^{2/3}}+\frac{L^{2}G^{2}\gamma^{2}}{T^{2/3}}+\frac{4\nu G^{2}R}{T^{2/3}}
RγΔ1+L2G2γ2+4νG2RT2/3.\displaystyle\leq\frac{\frac{R}{\gamma}\Delta_{1}+L^{2}G^{2}\gamma^{2}+4\nu G^{2}R}{T^{2/3}}.

Here, we have Δ1=2[F(w~0)F]+(1L+γT1/3)F(w~0)2+2LG2γ2T2/3\Delta_{1}=2[F(\tilde{w}_{0})-F_{*}]+\left(\frac{1}{L}+\frac{\gamma}{T^{1/3}}\right)\|\nabla F(\tilde{w}_{0})\|^{2}+\frac{2LG^{2}\gamma^{2}}{T^{2/3}} as defined in Theorem 3. Substituting this expression into the last inequality, we obtain the main inequality in Corollary 7. ∎

The proof of Corollary 8.

We have ηt=γ(t+λ)1/3\eta_{t}=\frac{\gamma}{(t+\lambda)^{1/3}} and ηt1L\eta_{t}\leq\frac{1}{L} for all t1t\geq 1. Moreover, we have

t=1Tηt=γt=1T1(t+λ)1/3γ1Tdτ(τ+λ)1/3γ[(T+λ)2/3(1+λ)2/3],t=1Tξt3=2η13+γ3t=3T1t1+λ2γ31+λ+γ3t=2Tdττ1+λγ3[21+λ+log(T1+λ)].\begin{array}[]{lll}&\sum_{t=1}^{T}\eta_{t}&=\gamma\sum_{t=1}^{T}\frac{1}{(t+\lambda)^{1/3}}\geq\gamma\int_{1}^{T}\frac{d\tau}{(\tau+\lambda)^{1/3}}\geq\gamma\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right],\\ &\sum_{t=1}^{T}\xi_{t}^{3}&=2\eta_{1}^{3}+\gamma^{3}\sum_{t=3}^{T}\frac{1}{t-1+\lambda}\leq\frac{2\gamma^{3}}{1+\lambda}+\gamma^{3}\int_{t=2}^{T}\frac{d\tau}{\tau-1+\lambda}\leq\gamma^{3}\left[\frac{2}{1+\lambda}+\log(T-1+\lambda)\right].\end{array}

Substituting these expressions with β=(νT2/3)1/n\beta=\left(\frac{\nu}{T^{2/3}}\right)^{1/n} into (10), we obtain

𝔼[F(w^T)2]\displaystyle\mathbb{E}\big{[}\|\nabla F(\hat{w}_{T})\|^{2}\big{]} Δ1γ[(T+λ)2/3(1+λ)2/3](1βn)+L2G2γ2(21+λ+log(T1+λ))(T+λ)2/3(1+λ)2/3+4βnG21βn\displaystyle\leq\displaystyle\frac{\Delta_{1}}{\gamma\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right](1-\beta^{n})}+\frac{L^{2}G^{2}\gamma^{2}\left(\frac{2}{1+\lambda}+\log(T-1+\lambda)\right)}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}}+\frac{4\beta^{n}G^{2}}{1-\beta^{n}}
Δ1Rγ[(T+λ)2/3(1+λ)2/3]+L2G2γ2(21+λ+log(T1+λ))(T+λ)2/3(1+λ)2/3+4βnG2R\displaystyle\leq\frac{\Delta_{1}R}{\gamma\left[(T+\lambda)^{2/3}-(1+\lambda)^{2/3}\right]}+\frac{L^{2}G^{2}\gamma^{2}\left(\frac{2}{1+\lambda}+\log(T-1+\lambda)\right)}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}}+4\beta^{n}G^{2}R
=RγΔ1(T+λ)2/3(1+λ)2/3+L2G2γ2(21+λ+log(T1+λ))(T+λ)2/3(1+λ)2/3+4νG2RT2/3\displaystyle=\frac{\frac{R}{\gamma}\Delta_{1}}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}}+\frac{L^{2}G^{2}\gamma^{2}\left(\frac{2}{1+\lambda}+\log(T-1+\lambda)\right)}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}}+\frac{4\nu G^{2}R}{T^{2/3}}
=RγΔ1+21+λL2G2γ2(T+λ)2/3(1+λ)2/3+L2G2γ2log(T1+λ)(T+λ)2/3(1+λ)2/3+4νG2RT2/3.\displaystyle=\frac{\frac{R}{\gamma}\Delta_{1}+\frac{2}{1+\lambda}L^{2}G^{2}\gamma^{2}}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}}+\frac{L^{2}G^{2}\gamma^{2}\log(T-1+\lambda)}{(T+\lambda)^{2/3}-(1+\lambda)^{2/3}}+\frac{4\nu G^{2}R}{T^{2/3}}.

Substituting Δ1\Delta_{1} from Theorem 3 and η1=γ(1+λ)1/3\eta_{1}=\frac{\gamma}{(1+\lambda)^{1/3}} into the last estimate, we obtain the inequality in Corollary 8. ∎

9.4 The Proof of Technical Lemmas

We now provide the full proof of lemmas that serve for the proof of Theorem 3 above.

9.4.1 The Proof of Lemma 12

We prove the first estimate of Lemma 12 by induction.

  • For t=1t=1, it is obvious that m0(1)2=m~02=0G2\|m_{0}^{(1)}\|^{2}=\|\tilde{m}_{0}\|^{2}=0\leq G^{2}.

  • For t=1t=1, assume that (54) holds for i1i-1, that is, mi(1)2G2\|m_{i}^{(1)}\|^{2}\leq G^{2}. From the update mi+1(1)=βmi(1)+(1β)gk(1)m_{i+1}^{(1)}=\beta m_{i}^{(1)}+(1-\beta)g_{k}^{(1)}, by our induction hypothesis, convexity of 2\|\cdot\|^{2}, 0β<10\leq\beta<1, and Assumption 2, we have

    mi+1(1)2βmi(1)2+(1β)gi(1)2βG2+(1β)G2=G2.\big{\|}m_{i+1}^{(1)}\big{\|}^{2}\leq\beta\big{\|}m_{i}^{(1)}\big{\|}^{2}+(1-\beta)\big{\|}g_{i}^{(1)}\big{\|}^{2}\leq\beta G^{2}+(1-\beta)G^{2}=G^{2}.

    Consequently, for t=1t=1, we have mi(1)2G2\|m_{i}^{(1)}\|^{2}\leq G^{2} for all i=0,,n1i=0,\dots,n-1.

  • Now, we prove (54) for t>1t>1. Assume that (54) holds for epoch t1t-1, that is mi+1(t1)2G2\|m_{i+1}^{(t-1)}\|^{2}\leq G^{2} for every i=0,,n1i=0,\dots,n-1. Since m0(t):=mn(t1m_{0}^{(t)}:=m^{(t-1}_{n}, we have m0(t)2=m~t12=mn(t1)2G2\|m_{0}^{(t)}\|^{2}=\|\tilde{m}_{t-1}\|^{2}=\|m_{n}^{(t-1)}\|^{2}\leq G^{2}. Using the previous argument from the case t=1t=1, we also get mi+1(t)2G2\|m_{i+1}^{(t)}\|^{2}\leq G^{2} for all i=0,,n1i=0,\dots,n-1.

Combining all the cases above, we have shown that (54) holds for all i=0,,n1i=0,\dots,n-1 and t=1,,Tt=1,\dots,T.

Now using the Cauchy-Schwarz inequality, we get

pt2=1ni=0n1mi+1(t1)21ni=0n1mi+1(t1)2G2, for t2.\displaystyle\big{\|}p_{t}\big{\|}^{2}=\left\|\frac{1}{n}\sum_{i=0}^{n-1}m_{i+1}^{(t-1)}\right\|^{2}\leq\frac{1}{n}\sum_{i=0}^{n-1}\big{\|}m_{i+1}^{(t-1)}\big{\|}^{2}\leq G^{2},\text{ for }t\geq 2.

Similarly, for t1t\geq 1, we also have

F(w0(t))2=1ni=0n1f(w0(t);π(i+1))21ni=0n1f(w0(t);π(i+1))2G2,\displaystyle\big{\|}\nabla F(w_{0}^{(t)})\big{\|}^{2}=\left\|\frac{1}{n}\sum_{i=0}^{n-1}\nabla f(w_{0}^{(t)};\pi(i+1))\right\|^{2}\leq\frac{1}{n}\sum_{i=0}^{n-1}\big{\|}\nabla f(w_{0}^{(t)};\pi(i+1))\big{\|}^{2}\leq G^{2},

which proves the second estimate (55) of Lemma 12. \square

9.4.2 The Proof of Lemma 13

Using the LL-smoothness assumption of f(;i)f(\cdot;i), we derive the following for t2t\geq 2:

hi(t)2=(50)f(wi(t);π(i+1))f(w0(t);π(i+1))2L2wi(t)w0(t)2,\displaystyle\big{\|}h_{i}^{(t)}\big{\|}^{2}\overset{\eqref{define_h_t}}{=}\big{\|}\nabla f(w_{i}^{(t)};\pi(i+1))-\nabla f(w_{0}^{(t)};\pi(i+1))\big{\|}^{2}\leq L^{2}\big{\|}w_{i}^{(t)}-w_{0}^{(t)}\big{\|}^{2},
ki(t1)2=(51)f(wi(t1);π(i+1))f(w0(t);π(i+1))2L2wi(t1)w0(t)2.\displaystyle\big{\|}k_{i}^{(t-1)}\big{\|}^{2}\overset{\eqref{define_k_t}}{=}\big{\|}\nabla f(w_{i}^{(t-1)};\pi(i+1))-\nabla f(w_{0}^{(t)};\pi(i+1))\big{\|}^{2}\leq L^{2}\big{\|}w_{i}^{(t-1)}-w_{0}^{(t)}\big{\|}^{2}.

Now from the update wi+1(t):=wi(t)ηi(t)mi+1(t)w_{i+1}^{(t)}:=w_{i}^{(t)}-\eta_{i}^{(t)}m_{i+1}^{(t)} in Algorithm 2 with ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}, for i=1,,n1i=1,\dots,n-1 and t2t\geq 2, we have

wi(t)=wi1(t)ηtnmi(t)=w0(t)ηtnj=0i1mj+1(t),\displaystyle w_{i}^{(t)}=w_{i-1}^{(t)}-\frac{\eta_{t}}{n}m_{i}^{(t)}=w_{0}^{(t)}-\frac{\eta_{t}}{n}\sum_{j=0}^{i-1}m_{j+1}^{(t)},
w0(t)=wn(t1)=wn1(t1)ηt1nmn(t1)=wi(t1)ηt1nj=in1mj+1(t1).\displaystyle w_{0}^{(t)}=w_{n}^{(t-1)}=w_{n-1}^{(t-1)}-\frac{\eta_{t-1}}{n}m_{n}^{(t-1)}=w_{i}^{(t-1)}-\frac{\eta_{t-1}}{n}\sum_{j=i}^{n-1}m_{j+1}^{(t-1)}.

Therefore, for i=0,,n1i=0,\dots,n-1 and t2t\geq 2, using these expressions and (54), we can bound

wi(t)w0(t)2=ηt21nj=0i1mj+1(t)2(54)ηt2G2,\displaystyle\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}=\eta_{t}^{2}\Big{\|}\frac{1}{n}\sum_{j=0}^{i-1}m_{j+1}^{(t)}\Big{\|}^{2}\overset{\tiny\eqref{eq_bounded_momentum}}{\leq}\eta_{t}^{2}G^{2},
wi(t1)w0(t)2=ηt121nj=in1mj+1(t1)2(54)ηt12G2.\displaystyle\|w_{i}^{(t-1)}-w_{0}^{(t)}\|^{2}=\eta_{t-1}^{2}\Big{\|}\frac{1}{n}\sum_{j=i}^{n-1}m_{j+1}^{(t-1)}\Big{\|}^{2}\overset{\tiny\eqref{eq_bounded_momentum}}{\leq}\eta_{t-1}^{2}G^{2}.

Substituting these inequalities into the first two expressions, for i=0,,n1i=0,\dots,n-1 and t2t\geq 2, we get

max(hi(t)2,ki(t1)2)\displaystyle\max\Big{(}\big{\|}h_{i}^{(t)}\big{\|}^{2},\big{\|}k_{i}^{(t-1)}\big{\|}^{2}\Big{)} L2max(ηt2G2;ηt12G2)L2G2max(ηt2;ηt12)=L2G2max(ηt;ηt1)2=L2G2ξt2,\displaystyle\leq L^{2}\max\Big{(}\eta_{t}^{2}G^{2};\eta_{t-1}^{2}G^{2}\Big{)}\leq L^{2}G^{2}\max\big{(}\eta_{t}^{2};\eta_{t-1}^{2}\big{)}=L^{2}G^{2}\max(\eta_{t};\eta_{t-1})^{2}=L^{2}G^{2}\xi_{t}^{2},

which proves (56), where the last line follows from the facts that ηt>0\eta_{t}>0 and ηt1>0\eta_{t-1}>0. \square

9.4.3 The Proof of Lemma 14

The first statement follows directly from the momentum update rule, i.e.:

mi+1(t)\displaystyle m_{i+1}^{(t)} =βmi(t)+(1β)gi(t)=β2mi1(t)+(1β)βgi1(t)+(1β)gi(t)\displaystyle=\beta m_{i}^{(t)}+(1-\beta)g_{i}^{(t)}=\beta^{2}m_{i-1}^{(t)}+(1-\beta)\beta g_{i-1}^{(t)}+(1-\beta)g_{i}^{(t)}
=βi+1m0(t)+(1β)(βig0(t)++βgi1(t)+gi(t))\displaystyle=\beta^{i+1}m_{0}^{(t)}+(1-\beta)\Big{(}\beta^{i}g_{0}^{(t)}+\dots+\beta g_{i-1}^{(t)}+g_{i}^{(t)}\Big{)}
=(a)βi+1mn(t1)+(1β)(βig0(t)++βgi1(t)+gi(t))\displaystyle\overset{(a)}{=}\beta^{i+1}m_{n}^{(t-1)}+(1-\beta)\Big{(}\beta^{i}g_{0}^{(t)}+\dots+\beta g_{i-1}^{(t)}+g_{i}^{(t)}\Big{)}
=βnmi+1(t1)+(1β)(βn1gi+1(t1)++βi+1gn1(t1)+βig0(t)++βgi1(t)+gi(t)),\displaystyle=\beta^{n}m_{i+1}^{(t-1)}+(1-\beta)\Big{(}\beta^{n-1}g_{i+1}^{(t-1)}+\dots+\beta^{i+1}g_{n-1}^{(t-1)}+\beta^{i}g_{0}^{(t)}+\dots+\beta g_{i-1}^{(t)}+g_{i}^{(t)}\Big{)},

where (a) follows from the rule m0(t)=m~t1=mn(t1)m_{0}^{(t)}=\tilde{m}_{t-1}=m_{n}^{(t-1)}.

Next, summing up this expression from i:=0i:=0 to i:=n1i:=n-1 and averaging, we get

1ni=0n1mi+1(t)=βnpt+qt,\displaystyle\frac{1}{n}\sum_{i=0}^{n-1}m_{i+1}^{(t)}=\beta^{n}p_{t}+q_{t}, (63)

where

pt:=1ni=0n1mi+1(t1)andqt:=1βni=0n1(βn1gi+1(t1)++βi+1gn1(t1)+βig0(t)++gi(t)).\displaystyle p_{t}:=\frac{1}{n}\sum_{i=0}^{n-1}m_{i+1}^{(t-1)}\quad\text{and}\quad q_{t}:=\frac{1-\beta}{n}\sum_{i=0}^{n-1}\Big{(}\beta^{n-1}g_{i+1}^{(t-1)}+\dots+\beta^{i+1}g_{n-1}^{(t-1)}+\beta^{i}g_{0}^{(t)}+\dots+g_{i}^{(t)}\Big{)}.

Now we consider the term qtq_{t} in the last expression:

qt\displaystyle q_{t} =1βni=0n1(βn1gi+1(t1)++βi+1gn1(t1)+βig0(t)++βgi1(t)+gi(t))\displaystyle=\frac{1-\beta}{n}\sum_{i=0}^{n-1}\Big{(}\beta^{n-1}g_{i+1}^{(t-1)}+\dots+\beta^{i+1}g_{n-1}^{(t-1)}+\beta^{i}g_{0}^{(t)}+\dots+\beta g_{i-1}^{(t)}+g_{i}^{(t)}\Big{)}
=1βn(βn1g1(t1)+βn2g2(t1)+β1gn1(t1)+g0(t))+\displaystyle=\ \frac{1-\beta}{n}\Big{(}\beta^{n-1}g_{1}^{(t-1)}+\beta^{n-2}g_{2}^{(t-1)}\dots+\beta^{1}g_{n-1}^{(t-1)}+g_{0}^{(t)}\Big{)}+\dots
+1βn(βn1gi+1(t1)+βn2gi+2(t1)++βi+1gn1(t1)+βig0(t)+βgi1(t)+gi(t))+\displaystyle\quad+\frac{1-\beta}{n}\Big{(}\beta^{n-1}g_{i+1}^{(t-1)}+\beta^{n-2}g_{i+2}^{(t-1)}+\dots+\beta^{i+1}g_{n-1}^{(t-1)}+\beta^{i}g_{0}^{(t)}\dots+\beta g_{i-1}^{(t)}+g_{i}^{(t)}\Big{)}+\dots
+1βn(βn1g0(t)+βn2g1(t)+βgn2(t)+gn1(t)).\displaystyle\quad+\frac{1-\beta}{n}\Big{(}\beta^{n-1}g_{0}^{(t)}+\beta^{n-2}g_{1}^{(t)}\dots+\beta g_{n-2}^{(t)}+g_{n-1}^{(t)}\Big{)}.

Reordering the terms gi(t)g_{i}^{(t)} and gi(t1)g_{i}^{(t-1)} and noting that j=nn1βj=0\sum_{j=n}^{n-1}\beta^{j}=0 by convention, we get

qt\displaystyle q_{t} =1βni=1n1[(j=nin1βj)gi(t1)]+1βni=0n1[(j=0ni1βj)gi(t)]\displaystyle=\frac{1-\beta}{n}\sum_{i=1}^{n-1}\left[\bigg{(}\sum_{j=n-i}^{n-1}\beta^{j}\bigg{)}g_{i}^{(t-1)}\right]+\frac{1-\beta}{n}\sum_{i=0}^{n-1}\left[\bigg{(}\sum_{j=0}^{n-i-1}\beta^{j}\bigg{)}g_{i}^{(t)}\right]
=1βni=0n1[(j=nin1βj)gi(t1)+(j=0ni1βj)gi(t)].\displaystyle=\frac{1-\beta}{n}\sum_{i=0}^{n-1}\left[\bigg{(}\sum_{j=n-i}^{n-1}\beta^{j}\bigg{)}g_{i}^{(t-1)}+\Bigg{(}\sum_{j=0}^{n-i-1}\beta^{j}\Bigg{)}g_{i}^{(t)}\right].

Substituting this expression of qtq_{t} into (63), we obtain Equation (57) of Lemma 14. \square

9.4.4 The Proof of Lemma 15: Upper Bounding The Initial Objective Value

Using (54), we have

w0(2)w0(1)2=wn(1)w0(1)2=η121ni=0n1mi+1(1)2(54)η12G2.\displaystyle\|w_{0}^{(2)}-w_{0}^{(1)}\|^{2}=\|w_{n}^{(1)}-w_{0}^{(1)}\|^{2}=\eta_{1}^{2}\Big{\|}\frac{1}{n}\sum_{i=0}^{n-1}m_{i+1}^{(1)}\Big{\|}^{2}\overset{\eqref{eq_bounded_momentum}}{\leq}\eta_{1}^{2}G^{2}. (64)

Since FF is LL-smooth, we can derive

F(w0(2))\displaystyle F(w_{0}^{(2)}) (5)F(w0(1))+F(w0(1))(w0(2)w0(1))+L2w0(2)w0(1)2\displaystyle\overset{\eqref{eq:Lsmooth}}{\leq}F(w_{0}^{(1)})+\nabla F(w_{0}^{(1)})^{\top}(w_{0}^{(2)}-w_{0}^{(1)})+\frac{L}{2}\|w_{0}^{(2)}-w_{0}^{(1)}\|^{2}
(a)F(w0(1))+12LF(w0(1))2+Lw0(2)w0(1)2,\displaystyle\overset{(a)}{\leq}F(w_{0}^{(1)})+\frac{1}{2L}\|\nabla F(w_{0}^{(1)})\|^{2}+L\|w_{0}^{(2)}-w_{0}^{(1)}\|^{2},
(64)F(w0(1))+12LF(w0(1))2+Lη12G2,\displaystyle\overset{\eqref{eq_lem_bound_first_01}}{\leq}F(w_{0}^{(1)})+\frac{1}{2L}\|\nabla F(w_{0}^{(1)})\|^{2}+L\eta_{1}^{2}G^{2},

where (a)(a) follows since uvu22+v22u^{\top}v\leq\frac{\|u\|^{2}}{2}+\frac{\|v\|^{2}}{2}.

Substituting w0(2):=w~1w_{0}^{(2)}:=\tilde{w}_{1} and w0(1):=w~0w_{0}^{(1)}:=\tilde{w}_{0} into this estimate, we obtain (58). \square

10 Detailed Implementation and Additional Experiments

In this Supp. Doc., we explain the detailed hyper-parameter tuning strategy in Section 5 and provide additional experiments for our proposed methods.

10.1 Comparing SMG with Other Methods

We compare our SMG algorithm with Stochastic Gradient Descent (SGD) and two other methods: SGD with Momentum (SGD-M) (Polyak, 1964) and Adam (Kingma & Ba, 2014). To have a fair comparison, a random reshuffling strategy is applied to all methods. We tune each algorithm with a constant learning rate using grid search and select the hyper-parameters that perform best according to their results. We report the additional results on the squared norm of gradients of this experiment in Figure 7. The hyper-parameters tuning strategy for each method is given below:

  • For SGD, the first (coarse) searching grid is {0.1,0.01,0.001}\{0.1,0.01,0.001\} and the fine grid is like {0.5,0.4,0.2,0.1,0.08,0.06,0.05}\{0.5,0.4,0.2,0.1,0.08,0.06,0.05\} if 0.10.1 is the best value we get after the first stage.

  • For our new SMG algorithm, we fixed the parameter β:=0.5\beta:=0.5 since it usually performs the best in our experiments. We let the coarse searching grid be {1,0.1,0.01}\{1,0.1,0.01\} and the fine grid be {0.5,0.4,0.2,0.1,0.08,0.06,0.05}\{0.5,0.4,0.2,0.1,0.08,0.06,0.05\} if 0.10.1 is the best value of the first stage. Note that our SMG algorithm may work with a bigger learning rate than the traditional SGD algorithm.

  • For SGD-M, we update the weights using the following rule:

    mi+1(t):=βmi(t)+gi(t)\displaystyle m_{i+1}^{(t)}:=\beta m_{i}^{(t)}+g_{i}^{(t)}
    wi+1(t):=wi(t)ηi(t)mi+1(t),\displaystyle w_{i+1}^{(t)}:=w_{i}^{(t)}-\eta_{i}^{(t)}m_{i+1}^{(t)},

    where gi(t)g_{i}^{(t)} is the (i+1)(i+1)-th gradient at epoch tt. Note that this momentum update is implemented in PyTorch with the default value β=0.9\beta=0.9. Hence, we choose this setting for SGD-M, and we tune the learning rate using the two searching grids {0.1,0.01,0.001}\{0.1,0.01,0.001\} and {0.5,0.4,0.2,0.1,0.08,0.06,0.05}\{0.5,0.4,0.2,0.1,0.08,0.06,0.05\} as in the SGD algorithm.

  • For Adam, we fixed two hyper-parameters β1:=0.9\beta_{1}:=0.9, β2:=0.999\beta_{2}:=0.999 as in the original paper. Since the default learning rate for Adam is 0.0010.001, we let our coarse searching grid be {0.01,0.001,0.0001}\{0.01,0.001,0.0001\}, and the fine grid be {0.002,0.001,0.0005}\{0.002,0.001,0.0005\} if 0.0010.001 performs the best in the first stage. We note that since the best learning rate for Adam is usually 0.0010.001, its hyper-parameter tuning process requires little effort than other algorithms in our experiments.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The squared norm of gradient produced by SMG, SGD, SGD-M, and Adam for different datasets.

10.2 The Choice of Hyper-parameter β\beta

As presented in Section 5, in this experiment, we investigate the sensitivity of our proposed SMG algorithm to the hyper-parameter β\beta. We choose a constant learning rate for each dataset and run the algorithm for different values of β\beta in the linear grid {0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}. However, the choice of β0.6\beta\geq 0.6 does not lead to a good performance, and, therefore, we omit to report them in our results. Figure 8 shows the comparison of F(w)2\|\nabla F(w)\|^{2} on different values of β\beta for FashionMNIST, CIFAR-10, w8a, and ijcnn1 datasets. Our empirical study here shows that the choice of β\beta in [0.1,0.5][0.1,0.5] works reasonably well, and β:=0.5\beta:=0.5 seems to be the best in our test.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: The squared norm of gradient reported by SMG with different β\beta on FashionMNIST, CIFAR-10, w8a, and ijcnn1 datasets

10.3 Different Learning Rate Schemes

As presented in Section 5, in the last experiment, we examine the performance of our SMG method with the effect of four different learning rate schemes. We choose the hyper-parameter β=0.5\beta=0.5 since it tends to give the best results in our test. The additional result on the squared norm gradient is reported in Figure 9.

  • Constant learning rate: Similar to the first section, we set the coarse searching grid as {1,0.1,0.01}\{1,0.1,0.01\} and let the fine grid be {0.5,0.4,0.2,0.1,0.08,0.06,0.05}\{0.5,0.4,0.2,0.1,0.08,0.06,0.05\} if 0.10.1 is the best value of the first stage.

  • Cosine annealing learning rate: For a fixed number of epoch TT, we need one hyper-parameter η\eta for the cosine learning rate scheme: ηt=η(1+cos(tπ/T))\eta_{t}=\eta(1+\cos(t\pi/T)). We choose a coarse grid {1,0.1,0.01,0.001}\{1,0.1,0.01,0.001\} and a fine grid {0.5,0.4,0.2,0.1,0.08,0.06,0.05}\{0.5,0.4,0.2,0.1,0.08,0.06,0.05\} if 0.10.1 is the best value for the parameter η\eta in the first stage.

  • Diminishing learning rate: In order to apply the diminishing scheme ηt=γ(t+λ)1/3\eta_{t}=\frac{\gamma}{(t+\lambda)^{1/3}}, we need two hyper-parameters γ\gamma and λ\lambda. At first, we let the searching grid for λ\lambda be {1,2,4,8}\{1,2,4,8\}. For the γ\gamma value, we set its searching grid so that the initial value η1\eta_{1} lies in the first grid of {1,0.1,0.01}\{1,0.1,0.01\}, and then lies in a fine grid centered at the best one of the coarse grid.

  • Exponential learning rate: We choose two hyper-parameters for the exponential scheme ηt=ηαt\eta_{t}=\eta\alpha^{t}. First, let the searching grid for decay rate α\alpha be {0.99,0.995,0.999}\{0.99,0.995,0.999\}. Then we set the searching grid for η\eta similarly to the diminishing case (such that initial learning rate η1\eta_{1} is in the coarse grid {1,0.1,0.01}\{1,0.1,0.01\}, and then a second grid centered at the best value of the first).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: The squared norm of gradient produced by SMG under four different learning rate schemes on FashionMNIST, CIFAR-10, w8a, and ijcnn1 datasets.

10.4 Additional Experiment with Single Shuffling Scheme

Algorithm 2 (Single Shuffling Momentum Gradient - SSMG) uses the single shuffling strategy which covers incremental gradient as a special case. Therefore in this experiment we compare our proposed methods (SMG and SSMG) with SGD, SGD with Momentum (SGD-M) and Adam using the single shuffling scheme.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: The train loss produced by SMG, SSMG, SGD, SGD-M, and Adam on the four different datasets: Fashion-MNIST, CIFAR-10, w8a, and ijcnn1.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: The squared norm of gradient produced by SMG, SSMG, SGD, SGD-M, and Adam on the four different datasets: Fashion-MNIST, CIFAR-10, w8a, and ijcnn1.

For the SSMG algorithm, the hyper-parameter β\beta is chosen in the grid {0.1,0.5,0.9}\{0.1,0.5,0.9\}. For the learning rate, we let the coarse searching grid be {0.1,0.01,0.001}\{0.1,0.01,0.001\} and the fine grid be {0.5,0.4,0.2,0.1,0.08,0.06,0.05}\{0.5,0.4,0.2,0.1,0.08,0.06,0.05\} if 0.10.1 is the best value of the first stage. For other methods, the hyper-parameters tuning strategy is similar to the settings in subsection 10.1.

The train loss F(w)F(w) and the squared norm of gradient F(w)2\|\nabla F(w)\|^{2} of each methods are reported in Figures 10 and 11. We observe that SSMG does not work well in the first two datasets compared to SMG, SGD, and SGD-M, but it performs reasonably well in the last two datasets on the binary classification problem.