This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms

Lam M. Nguyen
IBM Research
Thomas J. Watson Research Center
Yorktown Heights, NY, USA
[email protected]
&Trang H. Tran
School of ORIE
Cornell University
Ithaca, NY, USA
[email protected]
Equally contributed. Corresponding author: Lam M. Nguyen
Abstract

Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.

1 Introduction

In the last decade, neural network-based models have shown great success in many machine learning applications such as natural language processing (Collobert and Weston, 2008; Goldberg et al., 2018), computer vision and pattern recognition (Goodfellow et al., 2014; He and Sun, 2015). The training task of many learning models boils down to the following finite-sum minimization problem:

minwd{F(w):=1ni=1nf(w;i)},\min_{w\in\mathbb{R}^{d}}\left\{F(w):=\frac{1}{n}\sum_{i=1}^{n}f(w;i)\right\}, (1)

where f(;i):df(\cdot;i):\mathbb{R}^{d}\to\mathbb{R} is smooth and possibly non-convex for i[n]:={1,,n}i\in[n]:=\left\{1,\cdots,n\right\}. Solving the empirical risk minimization (1) had been a difficult task for a long time due to the non-convexity and the complicated learning models. Later progress with stochastic gradient descent (SGD) and its variants (Robbins and Monro, 1951; Duchi et al., 2011; Kingma and Ba, 2014) have shown great performance in training deep neural networks. These stochastic first-order methods are favorable thanks to its scalability and efficiency in dealing with large-scale problems. At each iteration SGD samples an index ii uniformly from the set {1,,n}\{1,\dots,n\}, and uses the individual gradient f(;i)\nabla f(\cdot;i) to update the weight.

While there has been much attention on the theoretical aspect of the traditional i.i.d. (independently identically distributed) version of SGD (Nemirovski et al., 2009; Ghadimi and Lan, 2013; Bottou et al., 2018), practical heuristics often use without-replacement data sampling schemes. Also known as shuffling sampling schemes, these methods generate some random or deterministic permutations of the index set {1,2,,n}\{1,2,\dots,n\} and apply gradient updates using these permutation orders. Intuitively, a collection of such nn individual updates is a pass over all the data, or an epoch. One may choose to create a new random permutation at the beginning of each epoch (in Random Reshuffling scheme) or use a random permutation for every epoch (in Single Shuffling scheme). Alternatively, one may use a Incremental Gradient scheme with a fixed deterministic order of indices. In this paper, we use the term unified shuffling SGD for SGD method using any data permutations, which includes the three special schemes described above.

Although shuffling sampling schemes usually show a better empirical performance than SGD (Bottou, 2009), the theoretical guarantees for these schemes are often more limited than vanilla SGD version, due to the lack of statistical independence. Recent works have shown improvement in computational complexity for shuffling schemes over SGD in various settings (Gürbüzbalaban et al., 2019; Haochen and Sra, 2019; Safran and Shamir, 2020; Nagaraj et al., 2019; Nguyen et al., 2021; Mishchenko et al., 2020; Ahn et al., 2020). In particular, in a general non-convex setting, shuffling sampling schemes improve the computational complexity in terms of ε^\hat{\varepsilon} for SGD from 𝒪(σ2/ε^2)\operatorname{\mathcal{O}}\left(\sigma^{2}/\hat{\varepsilon}^{2}\right) to 𝒪(nσ/ε^3/2)\operatorname{\mathcal{O}}\left(n\sigma/\hat{\varepsilon}^{3/2}\right), where σ\sigma is the bounded variance constant (Ghadimi and Lan, 2013; Nguyen et al., 2021; Mishchenko et al., 2020)111The computational complexity is the number of (individual) gradient computations needed to reach an ε^\hat{\varepsilon}-accurate stationary point (i.e. a point w^d\hat{w}\in\mathbb{R}^{d} that satisfies F(w^)2ε^\|\nabla F(\hat{w})\|^{2}\leq\hat{\varepsilon}.) . We summarize the detailed literature for multiple settings later in Table 1.

While global convergence is a desirable property for neural network training, the non-convexity landscape of complex learning models leads to difficulties in finding the global minimizer. In addition, there is little to no work studying the convergence to a global solution of shuffling-type SGD algorithms for a general non-convex setting. The closest line of research investigates the Polyak-Lojasiewicz (PL) condition (a generalization of strong-convexity), which demonstrates similar convergence rates as the strongly convex rates for shuffling SGD methods (Haochen and Sra, 2019; Ahn et al., 2020; Nguyen et al., 2021). In another direction, Gower et al. (2021) and Khaled and Richtárik (2020) investigates the global convergence for some class of non-convex functions, however for vanilla SGD method. Beznosikov and Takáč (2021) investigate a random shuffle version of variance reduction methods (e.g. SARAH algorithm Nguyen et al. (2017)), but this approach only can show convergence to stationary points. With a target on shuffling SGD methods and specific learning architectures, we come up with the central question of this paper:

How can we establish the convergence to global solution for a class of non-convex functions using shuffling-type SGD algorithms? Can we exploit the structure of neural networks to achieve this goal?

We answer this question affirmatively, and our contributions are summarized below.

Contributions.

  • We investigate a new framework for the convergence of a shuffling-type gradient algorithm to a global solution. We consider a relaxed set of assumptions and discuss their relations with previous settings. We show that our average-PL inequality (Assumption 3) holds for a wide range of neural networks equipped with squared loss function.

  • Our analysis generalizes the class function called star-MM-smooth-convex. This class contains non-convex functions and is more general than the class of star-convex smooth functions with respect to the minimizer (in the over-parameterized settings). In addition, our analysis does not use any bounded gradient or bounded weight assumptions.

  • We show the total complexity of 𝒪(nε^3/2)\mathcal{O}(\frac{n}{\hat{\varepsilon}^{3/2}}) for a class of non-convex functions to reach an ε^\hat{\varepsilon}-accurate global solution. This result matches the same gradient complexity to a stationary point for unified shuffling methods in non-convex settings, however, we are able to show the convergence to a global minimizer.

1.1 Related Work

In recent years, there have been different approaches to investigate the global convergence for machine learning optimization. This includes a popular line of research that studies some specific neural networks and utilizes their architectures. The most early works show the global convergence of Gradient Descent (GD) for simple linear networks and two-layer networks (Brutzkus et al., 2018; Soudry et al., 2018; Arora et al., 2019; Du et al., 2019b). These results are further extended to deep learning architectures (Allen-Zhu et al., 2019; Du et al., 2019a; Zou and Gu, 2019). This line of research continues with Stochastic Gradient Descent (SGD) algorithm, which proves the global convergence of SGD for deep neural networks for some probability depending on the initialization process and the number of input data (Brutzkus et al., 2018; Allen-Zhu et al., 2019; Zou et al., 2018; Zou and Gu, 2019). The common theme that appeared in most of these references is the over-parameterized setting, which means that the number of parameters in the network are excessively large (Brutzkus et al., 2018; Soudry et al., 2018; Allen-Zhu et al., 2019; Du et al., 2019a; Zou and Gu, 2019). This fact is closely related to our setting, and we will discuss it throughout our paper.

Polyak-Lojasiewicz (PL) condition and related assumptions. An alternative approach is to investigate some conditions on the optimization problem that may guarantee global convergence. A popular assumption is the Polyak-Lojasiewicz (PL) inequality, a generalization of strong-convexity (Polyak, 1964; Karimi et al., 2016; Nesterov and Polyak, 2006). Using this PL assumption, it can be shown that (stochastic) gradient descent achieves the same theoretical rate as in the strongly convex setting (i.e linear convergence for GD and sublinear convergence for SGD) (Karimi et al., 2016; De et al., 2017; Gower et al., 2021). Recent works demonstrate similar results for shuffling type SGD (Haochen and Sra, 2019; Ahn et al., 2020; Nguyen et al., 2021), both for unified and randomized shuffling schemes. On the other hand, (Schmidt and Roux, 2013; Vaswani et al., 2019) propose to use a new assumption called the Strong Growth Condition (SGC) that controls the rates at which the stochastic gradients decay comparing to the full gradient. This condition implies that the stochastic gradients and their variances converge to zero at the optimum solution (Schmidt and Roux, 2013; Vaswani et al., 2019). While the PL condition for FF implies that every stationary point of FF is also a global solution, the SGC implies that such a point is also a stationary point of every individual function. However, complicated models as deep feed-forward neural networks generally have non-optimal stationary points (Karimi et al., 2016). Thus, these assumptions are somewhat strong for non-convex settings.

Although there are plenty of works investigating the PL condition for the objective function FF (De et al., 2017; Vaswani et al., 2019; Gower et al., 2021), not many materials devoted to study the PL inequality for the individual functions f(;i)f(\cdot;i). A recent work (Sankararaman et al., 2020) analyzes SGD with the specific notion of gradient confusion for over-parameterized settings where the individual functions satisfy PL condition. They show that the neighborhood where SGD converges linearly depends on the level of gradient confusion (i.e. how much the individual gradients are negatively correlated). Taking a different approach, we investigate the PL property for individual functions and further show that our condition holds for a general class of neural networks with quadratic loss.

Over-paramaterized settings for neural networks. Most of the modern learning architectures contain deep and large networks, where the number of parameters are often far more than the number of input data. This leads to the fact that the objective loss function is trained closer and closer to zero. Understandably, in such settings all the individual functions f(;i)f(\cdot;i) are minimized simultaneously at 0 and they share a common minimizer. This condition is called the interpolation property (see e.g. (Schmidt and Roux, 2013; Ma et al., 2018; Meng et al., 2020; Loizou et al., 2021)) and is studied well in the literature (see e.g. (Zhou et al., 2019; Gower et al., 2021)). For a comparison, functions satisfying the strong growth condition necessarily satisfy the interpolation property. This property implies zero variance of individual gradients at the global minimizer, which allows good behavior for SGD near the solution. In this work, we slightly change this assumption which requires a small variance up to some level of the threshold ε\varepsilon. Note that when letting ε0\varepsilon\to 0, our assumption exactly recovers the interpolation property.

Star-convexity and related conditions. There have been many attentions to a class of structured non-convex functions called star-convex (Nesterov and Polyak, 2006; Lee and Valiant, 2016; Bjorck et al., 2021). Star-convexity can be understood as convexity between an arbitrary point ww and the global minimizer ww_{*}. The name star-convex comes from the fact that each sublevel set is star-shaped (Nesterov and Polyak, 2006; Lee and Valiant, 2016). Zhou et al. (2019) shows that if SGD follows a star-convex path and there exists a common global minimizer for all component functions, then SGD converges to a global minimum.

In recent progress, Hinder et al. (2020) considers the class of quasar-convex functions, which further generalizes star-convexity. This property was introduced originally in (Hardt et al., 2018) under the name ‘weakly quasi-convex’, and investigated recently in literature (Hinder et al., 2020; Jin, 2020; Gower et al., 2021). This class uses a parameter ζ(0,1]\zeta\in(0,1] to control the non-convexity of the function, where ζ=1\zeta=1 yeilds the star-convexity and ζ\zeta approaches 0 indicates more non-convexity (Hinder et al., 2020). Intuitively, quasar-convex functions are unimodal on all lines that pass through a global minimizer. Gower et al. (2021) investigates the performance of SGD for smooth and quasar-convex functions using an expected residual assumption (which is comparable to the interpolation property). They show a convergence rate of 𝒪(1/K)\mathcal{O}(1/{\sqrt{K}}) for i.i.d. sampling SGD with the number of total iterations KK, which translates to the computational complexity of 𝒪(1/ε^2)\operatorname{\mathcal{O}}\left(1/\hat{\varepsilon}^{2}\right). To the best of our knowledge, this paper is the first work studying the relaxation of star-convexity and global convergence for SGD with shuffling sample schemes, not for the i.i.d. version.

2 Theoretical Setting

We first present the shuffling-type gradient algorithm below. Our convergence results hold for any permutation of the training data {1,2,,n}\{1,2,\dots,n\}, including deterministic and random ones. Thus, our theoretical framework is general and applicable for any shuffling strategy, including Incremental Gradient, Single Shuffling, and Random Reshuffling.

Algorithm 1 (Shuffling-Type Gradient Algorithm for Solving (1))
1:  Initialization: Choose an initial point w~0dom(F)\tilde{w}_{0}\in\mathrm{dom}\left(F\right).
2:  for t=1,2,,Tt=1,2,\dots,T do
3:     Set w0(t):=w~t1w_{0}^{(t)}:=\tilde{w}_{t-1};
4:     Generate any permutation π(t)\pi^{(t)} of [n][n] (either deterministic or random);
5:     for i=1,,ni=1,\dots,n do
6:        Update wi(t):=wi1(t)ηi(t)f(wi1(t);π(t)(i))w_{i}^{(t)}:=w_{i-1}^{(t)}-\eta_{i}^{(t)}\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i));
7:     end for
8:     Set w~t:=wn(t)\tilde{w}_{t}:=w_{n}^{(t)};
9:  end for

We further specify the choice of learning rate ηi(t)\eta_{i}^{(t)} in the detailed analysis. Now we proceed to describe the set of assumptions used in our paper.

Assumption 1.

Suppose that fi:=minwdf(w;i)>f_{i}^{*}:=\min_{w\in\mathbb{R}^{d}}f(w;i)>-\infty, i{1,,n}i\in\{1,\dots,n\}.

Assumption 2.

Suppose that f(;i)f(\cdot;i) is LL-smooth for all i{1,,n}i\in\{1,\dots,n\}, i.e. there exists a constant L(0,+)L\in(0,+\infty) such that:

f(w;i)f(w;i)Lww,w,wd.\|\nabla f(w;i)-\nabla f(w^{\prime};i)\|\leq L\|w-w^{\prime}\|,\quad\forall w,w^{\prime}\in\mathbb{R}^{d}. (2)

Assumption 1 is required in any algorithm to guarantee the well-definedness of (1). In most applications, the component losses are bounded from below. By Assumption 2, the objective function FF is also LL-smooth. This Lipschitz smoothness Assumption is widely used for gradient-type methods. In addition, we denote the minimum value of the objective function F=minwdF(w)F_{*}=\min_{w\in\mathbb{R}^{d}}F(w). It is worthy to note the following relationship between FF_{*} and the component minimum values:

F\displaystyle F_{*} =minwdF(w)=1nminwd(i=1nf(w;i))1ni=1nminwd(f(w;i))=1ni=1nfi.\displaystyle=\min_{w\in\mathbb{R}^{d}}F(w)=\frac{1}{n}\min_{w\in\mathbb{R}^{d}}\left(\sum_{i=1}^{n}f(w;i)\right)\geq\frac{1}{n}\sum_{i=1}^{n}\min_{w\in\mathbb{R}^{d}}\left(f(w;i)\right)=\frac{1}{n}\sum_{i=1}^{n}f_{i}^{*}. (3)

We are interested in the case where the set of minimizers of FF is not empty. The equality F=1ni=1nfiF_{*}=\frac{1}{n}\sum_{i=1}^{n}f_{i}^{*} attains if and only if a minimizer of FF is also the common minimizer for all component functions. This condition implies that the variance of individual functions is 0 at the common minimizer.

2.1 PL Condition for Component Functions

Now we are ready to discuss the Polyak-Lojasiewicz condition as follows.

Definition 1 (Polyak-Lojasiewicz condition).

We say that ff satisfies Polyak-Lojasiewicz (PL) inequality for some constant μ>0\mu>0 if

f(w)22μ[f(w)f],wd,\|\nabla f(w)\|^{2}\geq 2\mu[f(w)-f^{*}],\quad\forall w\in\mathbb{R}^{d}, (4)

where f:=minwdf(w)f^{*}:=\min_{w\in\mathbb{R}^{d}}f(w).

The PL condition for the objective function FF is sufficient to show a global convergence for (stochastic) gradient descent (Karimi et al., 2016; Nesterov and Polyak, 2006; Polyak, 1964). It is well known that a function satisfying the PL condition is not necessarily convex (Karimi et al., 2016). However, this assumption on FF is somewhat strong because it implies that every stationary point of FF is also a global minimizer. Our goal is to consider a class of non-convex function which is more relaxed than the PL condition on FF, while still having the good global convergence properties. In this paper, we formulated an assumption called “average PL inequality”, specifically for the finite sum setting:

Assumption 3.

Suppose that f(;i)f(\cdot;i) satisfies average PL inequality for some constant μ>0\mu>0 such that

1ni=1nf(w;i)22μ1ni=1n[f(w;i)fi],wd.\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w;i)\|^{2}\geq 2\mu\frac{1}{n}\sum_{i=1}^{n}[f(w;i)-f_{i}^{*}],\quad\forall w\in\mathbb{R}^{d}. (5)

where fi:=minwdf(w;i)f_{i}^{*}:=\min_{w\in\mathbb{R}^{d}}f(w;i).

Comparisons. Assumption 3 is weaker than assuming the PL inequality for every component function f(;i)f(\cdot;i). In general setting, Assumption 3 is not comparable to assuming the PL inequality for FF. Formally, if FF satisfies PL the condition for some parameter τ>0\tau>0, then we have:

2τ[F(w)F]F(w)21ni=1nf(w;i)2.2\tau[F(w)-F_{*}]\leq\|\nabla F(w)\|^{2}\leq\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w;i)\|^{2}. (6)

However, by equation (3) we have that [F(w)F]1ni=1n[f(w;i)fi][F(w)-F_{*}]\leq\frac{1}{n}\sum_{i=1}^{n}[f(w;i)-f_{i}^{*}]. Therefore, the PL inequality for each function f(;i)f(\cdot;i), cannot directly imply the PL condition on FF and vice versa.

In the interpolated setting where there is a common minimizer for all component function f(;i)f(\cdot;i), it can be shown that the PL condition on FF is stronger than our average PL assumption:

2τ1ni=1n[f(w;i)fi]=2τ[F(w)F]F(w)21ni=1nf(w;i)2.\displaystyle 2\tau\frac{1}{n}\sum_{i=1}^{n}[f(w;i)-f_{i}^{*}]=2\tau[F(w)-F_{*}]\leq\|\nabla F(w)\|^{2}\leq\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w;i)\|^{2}.

On the other hand, our assumption cannot imply the PL inequality on FF unless we impose a strong relationship that upper bound the sum of individual squared gradients 1ni=1nf(w;i)2\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w;i)\|^{2} in terms of the full squared gradient F(w)2\|\nabla F(w)\|^{2}, for every wdw\in\mathbb{R}^{d} . For these reasons, the average PL Assumption 3 is arguably more reasonable than assuming the PL inequality for the objective function FF. Moreover, we show that Assumption 3 holds for a general class of neural networks with a final bias layer and squared loss function. We have the following theorem.

Theorem 1.

Let {(x(i),y(i))}i=1n\{({x}^{(i)},y^{(i)})\}_{i=1}^{n} is a training data set where x(i)m{x}^{(i)}\in\mathbb{R}^{m} is the input data and y(i)cy^{(i)}\in\mathbb{R}^{c} is the output data for i=1,,ni=1,\dots,n. We consider an architecture h(w;i)h(w;i) with ww be the vectorized weight and hh consists of a final bias layer bb:

h(w;i)\displaystyle h(w;i) =WTz(θ;i)+b,\displaystyle=W^{T}z(\theta;i)+b,

where w=vec({θ,W,b})w=\textbf{\text{vec}}(\{\theta,W,b\}) and z(θ;i)z(\theta;i) are some inner architectures, which can be chosen arbitrarily. Next, we consider the squared loss f(w;i)=12h(w;i)y(i)2f(w;i)=\frac{1}{2}\|h(w;i)-y^{(i)}\|^{2}. Then

f(w;i)22[f(w;i)fi],wd,\displaystyle\|\nabla f(w;i)\|^{2}\geq 2[f(w;i)-f_{i}^{*}],\quad\forall w\in\mathbb{R}^{d}, (7)

where fi:=minwdf(w;i)f_{i}^{*}:=\min_{w\in\mathbb{R}^{d}}f(w;i).

Therefore, for this application, Assumption 3 holds with μ=1\mu=1.

2.2 Small Variance at Global Solutions

In this section, we change the interpolation property in previous literature (Ma et al., 2018; Meng et al., 2020; Loizou et al., 2021) by a small threshold. For any global solution ww_{*} of FF, let us define

σ2:=infw𝒲(1ni=1nf(w;i)2).\displaystyle\sigma_{*}^{2}:=\inf_{w_{*}\in\mathcal{W}_{*}}\left(\frac{1}{n}\sum_{i=1}^{n}\left\|\nabla f(w_{*};i)\right\|^{2}\right). (8)

We can show that when there is a common minimizer for all component functions (i.e. when the equality F=1ni=1nfiF_{*}=\frac{1}{n}\sum_{i=1}^{n}f_{i}^{*} holds), the best variance σ2\sigma_{*}^{2} is 0. It is sufficient for our Theorem to impose a 𝒪(ε)\mathcal{O}(\varepsilon)-level upper bound on the variance σ2\sigma_{*}^{2}:

Assumption 4.

Suppose that the best variance at ww_{*} is small, that is, for ε>0\varepsilon>0

σ2Pε,\displaystyle\sigma_{*}^{2}\leq P\varepsilon, (9)

for some P>0P>0.

It is important to note that in current non-convex literature, Assumption 4 alone (or, assuming σ2=0\sigma_{*}^{2}=0 alone) is not sufficient enough to guarantee a global convergence property for SGD. Typically, some other conditions on the good landscape of the loss function are needed to complement the over-parameterized setting. Thus, we have motivation to introduce our next assumption.

2.3 Generalized Star-Smooth-Convex Condition for Shuffling Type Algorithm

We introduce the definition of star-smooth-convex function as follows.

Definition 2.

The function gg is star-MM-smooth-convex with respect to a reference point w^d\hat{w}\in\mathbb{R}^{d} if

g(w)g(w^)2Mg(w)g(w^),ww^,wd.\|\nabla g(w)-\nabla g(\hat{w})\|^{2}\leq M\langle\nabla g(w)-\nabla g(\hat{w}),w-\hat{w}\rangle,\quad\forall w\in\mathbb{R}^{d}. (10)

It is well known that when gg is LL-smooth and convex (Nesterov, 2004), we have the following general inequality for every w,wdw,w^{\prime}\in\mathbb{R}^{d}:

g(w)g(w)2Lg(w)g(w),ww\displaystyle\|\nabla g(w)-\nabla g(w^{\prime})\|^{2}\leq L\langle\nabla g(w)-\nabla g(w^{\prime}),w-w^{\prime}\rangle (11)

Our class of star-smooth-convex function requires a similar inequality to hold only for the special point w=w^w^{\prime}=\hat{w}. Interestingly, this is related to a class of star-convex functions, which satisfies the convex inequality for the minimizer w^\hat{w}:

(star-convexity w.r.t w^)g(w)g(w^)g(w),ww^,wd,\displaystyle\text{(star-convexity w.r.t }\hat{w})\quad g(w)-g(\hat{w})\leq\langle\nabla g(w),w-\hat{w}\rangle,\ \forall w\in\mathbb{R}^{d}, (12)

This class of functions contains non-convex objective losses and is well studied in the literature (see e.g. (Zhou et al., 2019)). Our Lemma 1 shows that the class of star-smooth-convex function is broader than the class of LL-smooth and star-convex functions. Therefore, our problem of interest is non-convex in general.

Lemma 1.

The function gg is star-MM-smooth-convex with respect to w^\hat{w} for some constant M>0M>0 if one of the two following conditions holds:

  1. 1.

    gg is LL-smooth and convex.

  2. 2.

    gg is LL-smooth and gg is star-convex with respect to w^\hat{w}.

Proof.

The first statement of Lemma 1 follows directly from equation (11). We have that gg is star-MM-smooth-convex with respect to any reference point and M=LM=L.

Now we proceed to the second statement. From the star-convex property of gg with respect to w^\hat{w}, we have

g(w)g(w^)g(w),ww^,wd,\displaystyle g(w)-g(\hat{w})\leq\langle\nabla g(w),w-\hat{w}\rangle,\ \forall w\in\mathbb{R}^{d},

and g(w^)=0\nabla g(\hat{w})=0 since w^\hat{w} is the global minimizer of gg. On the other hand, gg is LL-smooth and we have

g(w^)g(w1Lg(w))g(w)12Lg(w)2,\displaystyle g(\hat{w})\leq g\left(w-\frac{1}{L}\nabla g(w)\right)\leq g(w)-\frac{1}{2L}\|\nabla g(w)\|^{2},

which is equivalent to g(w)22L[g(w)g(w^)]\|\nabla g(w)\|^{2}\leq 2L[g(w)-g(\hat{w})], i[n]i\in[n]. Since g(w^)=0\nabla g(\hat{w})=0, i[n]i\in[n], we have for wd\forall w\in\mathbb{R}^{d}

g(w)g(w^)22L[g(w)g(w^)](12)2Lg(w)g(w^),ww.\displaystyle\|\nabla g(w)-\nabla g(\hat{w})\|^{2}\leq 2L[g(w)-g(\hat{w})]\overset{\eqref{eq_defn_star_convex}}{\leq}2L\langle\nabla g(w)-\nabla g(\hat{w}),w-w_{*}\rangle.

This is a star-MM-smooth-convex function as in Definition 2 with M=2LM=2L. ∎

For the analysis of shuffling type algorithm in this paper, we consider the general assumption called the generalized star-smooth-convex condition for shuffling algorithms:

Assumption 5.

Using Algorithm 1, let us assume that there exist some constants M>0M>0 and N0N\geq 0 such that at each epoch t=1,,Tt=1,\dots,T, we have for i=1,,ni=1,\dots,n:

f(wi1(t);π(t)(i))f(w;π(t)(i))2\displaystyle\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))-\nabla f(w_{*};\pi^{(t)}(i))\|^{2} Mf(wi1(t);π(t)(i))f(w;π(t)(i)),wi1(t)w\displaystyle\leq M\langle\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))-\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle
+N1ni=1nwi(t)w0(t)2,\displaystyle\qquad+N\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}, (13)

where ww_{*} is a global solution of FF.

We note that it is sufficient for our analysis to assume the case N=0N=0, i.e. the individual function f(;i)f(\cdot;i) is star-MM-smooth-convex with respect to ww_{*} for every i=1,,ni=1,\dots,n as in Definition 2. In that case, the assumption does not depend on the algorithm progress.

Assumption 5 is more flexible than the star-MM-smooth-convex one in (10) because an additional term N1ni=1nwi(t)w0(t)2N\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2} for some constant N>0N>0 allows for extra flexibility in our setting, where the right-hand side term f(w;i)f(w;i),ww\langle\nabla f(w;i)-\nabla f(w_{*};i),w-w_{*}\rangle could be negative for some wdw\in\mathbb{R}^{d}.

Note that we do not impose any assumptions on bounded weights or bounded gradients. Therefore, the term 1ni=1nwi(t)w0(t)2\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2} cannot be uniformly bounded by any universal constant.

3 New Framework for Convergence to a Global Solution

In this section, we present our theoretical results. Our Lemma 2 first provides a recursion to bound the squared distance term w~tw2\|\tilde{w}_{t}-w_{*}\|^{2}:

Lemma 2.

Assume that Assumptions 1, 2, 3, and 5 hold. Let {w~t}t=1T\{\tilde{w}_{t}\}_{t=1}^{T} be the sequence generated by Algorithm 1 with 0<ηtmin{n2M,12L}0<\eta_{t}\leq\min\left\{\frac{n}{2M},\frac{1}{2L}\right\}. For every γ>0\gamma>0 we have

w~tw2\displaystyle\|\tilde{w}_{t}-w_{*}\|^{2} (1+C1ηt3)w~t1w2+C2ηtσ2C3ηt[F(w~t1)F].\displaystyle\leq\left(1+C_{1}\eta_{t}^{3}\right)\|\tilde{w}_{t-1}-w_{*}\|^{2}+C_{2}\eta_{t}\sigma_{*}^{2}-C_{3}\eta_{t}[F(\tilde{w}_{t-1})-F_{*}]. (14)

where ww_{*} is a global solution of FF, F=minwdF(w)F_{*}=\min_{w\in\mathbb{R}^{d}}F(w), and

{C1=8L23+14NL2M+4γL46M,C2=2M+1+56L2+8N3ML2+5γ12M,C3=γγ+1μM.\displaystyle\begin{cases}C_{1}=\frac{8L^{2}}{3}+\frac{14NL^{2}}{M}+\frac{4\gamma L^{4}}{6M},\\ C_{2}=\frac{2}{M}+1+\frac{5}{6L^{2}}+\frac{8N}{3ML^{2}}+\frac{5\gamma}{12M},\\ C_{3}=\frac{\gamma}{\gamma+1}\frac{\mu}{M}.\end{cases} (15)

Rearranging the results of Lemma 2, we have

F(w~t1)F\displaystyle F(\tilde{w}_{t-1})-F_{*} 1C3(1ηt+C1ηt2)w~t1w21C3ηtw~tw2+C2C3σ2.\displaystyle\leq\frac{1}{C_{3}}\left(\frac{1}{\eta_{t}}+C_{1}\eta_{t}^{2}\right)\|\tilde{w}_{t-1}-w_{*}\|^{2}-\frac{1}{C_{3}\eta_{t}}\|\tilde{w}_{t}-w_{*}\|^{2}+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}. (16)

Therefore, with an appropriate choice of learning rate that guarantee (1/ηt+C1ηt2)1/ηt1\left({1}/{\eta_{t}}+C_{1}\eta_{t}^{2}\right)\leq{1}/{\eta_{t-1}}, we can unroll the recursion from Lemma 2. Thus we have our main result in the next Theorem.

Theorem 2.

Assume that Assumptions 1, 2, 3, and 5 hold. Let {w~t}t=1T\{\tilde{w}_{t}\}_{t=1}^{T} be the sequence generated by Algorithm 1 with the learning rate ηi(t)=ηtn\eta_{i}^{(t)}=\frac{\eta_{t}}{n} where 0<ηtmin{n2M,12L}0<\eta_{t}\leq\min\left\{\frac{n}{2M},\frac{1}{2L}\right\}.

Let the number of iterations T=λε3/2T=\frac{\lambda}{\varepsilon^{3/2}} for some λ>0\lambda>0 and ε>0\varepsilon>0. Constants C1C_{1}, C2C_{2}, and C3C_{3} are defined in (15) for any γ>0\gamma>0. We further define K=1+C1D3ε3/2K=1+C_{1}D^{3}\varepsilon^{3/2} and specify the learning rate ηt=Kηt1=Ktη0\eta_{t}=K\eta_{t-1}=K^{t}\eta_{0} and η0=DεKexp(λC1D3)\eta_{0}=\frac{D\sqrt{\varepsilon}}{K\exp(\lambda C_{1}D^{3})} such that DεKmin{n2M,12L}\frac{D\sqrt{\varepsilon}}{K}\leq\min\left\{\frac{n}{2M},\frac{1}{2L}\right\} for some constant D>0D>0. Then we have

1Tt=1T[F(w~t1)F]\displaystyle\frac{1}{T}\sum_{t=1}^{T}[F(\tilde{w}_{t-1})-F_{*}] Kexp(λC1D3)C3Dλw~0w2ε+C2C3σ2,\displaystyle\leq\frac{K\exp(\lambda C_{1}D^{3})}{C_{3}D\lambda}\|\tilde{w}_{0}-w_{*}\|^{2}\cdot\varepsilon+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}, (17)

where F=minwdF(w)F_{*}=\min_{w\in\mathbb{R}^{d}}F(w) and σ2\sigma_{*}^{2} is defined in (8).

Our analysis holds for arbitrarily constant values of the parameters γ,λ\gamma,\lambda and DD. In addition, we show our current analysis for every shuffling scheme. An interesting research question arises: whether the convergence results can be improved if one chooses to analyze a randomized shuffling scheme in this framework. However, we leave that question to future works.

Using Assumption 4, we can show the total complexity of Algorithm 1 for our setting.

Corollary 1.

Suppose that the conditions in Theorem 2 and Assumption 4 hold. Choose C1Dλ=1C_{1}D\lambda=1 and ε=ε^/G\varepsilon=\hat{\varepsilon}/G such that 0<ε^G0<\hat{\varepsilon}\leq G with the constants

G=2C1D2eC3w~0w2+C2PC3, where\displaystyle G=\frac{2C_{1}D^{2}e}{C_{3}}\|\tilde{w}_{0}-w_{*}\|^{2}+\frac{C_{2}P}{C_{3}},\text{ where }
{C1=8L23+14NL2M+4L23M,C2=2M+1+56L2+8N3ML2+512ML,C3=1L2+1μM.\displaystyle\begin{cases}C_{1}=\frac{8L^{2}}{3}+\frac{14NL^{2}}{M}+\frac{4L^{2}}{3M},\\ C_{2}=\frac{2}{M}+1+\frac{5}{6L^{2}}+\frac{8N}{3ML^{2}}+\frac{5}{12ML},\\ C_{3}=\frac{1}{L^{2}+1}\frac{\mu}{M}.\end{cases}

Then, the we need T=λG3/2ε^3/2T=\frac{\lambda G^{3/2}}{\hat{\varepsilon}^{3/2}} epochs to guarantee

min1tT[F(w~t1)F]1Tt=1T[F(w~t1)F]ε^.\displaystyle\min_{1\leq t\leq T}[F(\tilde{w}_{t-1})-F_{*}]\leq\frac{1}{T}\sum_{t=1}^{T}[F(\tilde{w}_{t-1})-F_{*}]\leq\hat{\varepsilon}.
Table 1: Comparisons of computational complexity (the number of individual gradient evaluations) needed by SGD algorithm to reach an ε^\hat{\varepsilon}-accurate solution ww that satisfies F(w)F(w)ε^F(w)-F(w_{*})\leq\hat{\varepsilon} (or F(w)2ε^\|\nabla F(w)\|^{2}\leq\hat{\varepsilon} in the non-convex case).
Settings References Complexity Shuffling Schemes Global Solution
Convex Nemirovski et al. (2009); Shamir and Zhang (2013) (1) 𝒪(Δ02+G2ε^2)\operatorname{\mathcal{O}}\left(\frac{\Delta_{0}^{2}+G^{2}}{\hat{\varepsilon}^{2}}\right)
Mishchenko et al. (2020); Nguyen et al. (2021) (2) 𝒪(nε^3/2)\operatorname{\mathcal{O}}\left(\frac{n}{\hat{\varepsilon}^{3/2}}\right)
PL condition Nguyen et al. (2021) 𝒪~(nσ2ε^1/2)\tilde{\operatorname{\mathcal{O}}}\left(\frac{n\sigma^{2}}{\hat{\varepsilon}^{1/2}}\right)
Star-convex     related Gower et al. (2021) (3) 𝒪(1ε^2)\operatorname{\mathcal{O}}\left(\frac{1}{\hat{\varepsilon}^{2}}\right)
Non-convex Ghadimi and Lan (2013) (5) 𝒪(σ2ε^2)\operatorname{\mathcal{O}}\left(\frac{\sigma^{2}}{\hat{\varepsilon}^{2}}\right)
Nguyen et al. (2021); Mishchenko et al. (2020) (5) 𝒪(nσε^3/2)\operatorname{\mathcal{O}}\left(\frac{n\sigma}{\hat{\varepsilon}^{3/2}}\right)
Our setting (non-convex) This paper, Corollary 1(4) 𝒪(n(N1)3/2ε^3/2)\operatorname{\mathcal{O}}\left(\frac{n(N\vee 1)^{3/2}}{\hat{\varepsilon}^{3/2}}\right)

(0) We note that the assumptions in this table are not comparable and we only show the roughly complexity in terms of ε^\hat{\varepsilon}. In addition, to make fair comparisons, we only report the complexity of unified shuffling schemes.
(1) Standard results for SGD in convex literature often use a different set of assumptions from the one in this paper (e.g. bounded domain that ww2Δ0\|w-w_{*}\|^{2}\leq\Delta_{0} for each iterate ww and/or bounded gradient that 𝔼[f(w;i)]G2\mathbb{E}[\|\nabla f(w;i)\|]\leq G^{2}). We report the corresponding complexity for a rough comparison.
(2) (Mishchenko et al., 2020) shows a bound for Incremental Gradient while (Nguyen et al., 2021) has a unified setting. We translate these results for unified shuffling schemes from these references to the convex setting.
(3) Since we cannot find a reference containing the convergence rate for vanilla SGD and star-convex functions, we adapt the reference Gower et al. (2021) here. This paper shows a result for LL-smooth and quasar convex function with an additional Expected Residual (ER) assumption, which is weaker than assuming smoothness for f(;i)f(\cdot;i) and interpolation property. The star-convex results hold when the quasar-convex parameter is 1.
(4) Since we use a different set of assumptions than the other references, we only report the rough comparison in n,Nn,N and ε^\hat{\varepsilon}, where NN is the constant from Assumption 5 and N1=max(N,1)N\vee 1=\max(N,1). Note that N=0N=0 in the framework of star-smooth-convex function. In addition, we need σ2=0\sigma_{*}^{2}=0 so that the complexity holds with arbitrary ε^\hat{\varepsilon}. We explain the detailed complexity below and in the Appendix.
(5) Standard literature for SGD in non-convex setting assumes a bounded variance that 𝔼i[f(w;i)F(w)2]σ2\mathbb{E}_{i}\big{[}\|f(w;i)-\nabla{F}(w)\|^{2}\big{]}\leq\sigma^{2}, we report the rough comparison.

Computational complexity. Our global convergence result in this Corollary holds for a fixed value of ε^\hat{\varepsilon} in Assumption 4. Thus, when ε0\varepsilon\to 0, this assumption is equivalent to assuming σ2=0\sigma_{*}^{2}=0. The total complexity of Corollary 1 is 𝒪(nε^3/2)\mathcal{O}\left(\frac{n}{\hat{\varepsilon}^{3/2}}\right). This rate matches the best known rate for unified sampling schemes for SGD in convex setting (Mishchenko et al., 2020; Nguyen et al., 2021). However, our result holds for a broader class of functions that are possibly non-convex. Comparing to the non-convex setting, current literature (Mishchenko et al., 2020; Nguyen et al., 2021) also matches our rate to the order of ε^\hat{\varepsilon}, however, we can only prove that SGD converges to a stationary point with a weaker criteria F(w)2ε^\|\nabla F(w)\|^{2}\leq\hat{\varepsilon} for general non-convex funtions. Table 1 shows these comparisons in various settings. Note that when using a randomized shuffling scheme, SGD often performs a better rate in terms of the data nn in various settings with and without (strongly) convexity. For example, in strongly convex and/or PL setting, the convergence rate of RR is 𝒪~(n/ε^)\tilde{\operatorname{\mathcal{O}}}({\sqrt{n}}/\sqrt{\hat{\varepsilon}}) , which is better than unified schemes with 𝒪~(n/ε^)\tilde{\operatorname{\mathcal{O}}}({n}/\sqrt{\hat{\varepsilon}}) (Ahn et al., 2020). However, for a fair comparison, we do not report these results in Table 1 as our theoretical analysis is derived for unified shuffling scheme.

If we further assume that L,M,N>1L,M,N>1, the detailed complexity with respect to these constants is

𝒪(L4(M+N)3/2μ3/2nε^3/2).\mathcal{O}\left(\frac{L^{4}(M+N)^{3/2}}{\mu^{3/2}}\cdot\frac{n}{\hat{\varepsilon}^{3/2}}\right).

We present all the detailed proofs in the Appendix. Our theoretical framework is new and adapted to the finite-sum minimization problem. Moreover, it utilizes the choice of shuffling sample schemes to show a better complexity in terms of ε^\hat{\varepsilon} than the complexity of vanilla i.i.d. sampling scheme.

4 Numerical Experiments

In this section, we show some experiments for shuffling-type SGD algorithms to demonstrate our theoretical results of convergence to a global solution. Following the setting of Theorem 1, we consider the non-convex regression problem with squared loss function. We choose fully connected neural networks in our implementation. We experiment with different regression datasets: the Diabetes dataset from sklearn library (Efron et al., 2004; Pedregosa et al., 2011) with 442 samples in dimension 10; the Life expectancy dataset from WHO (Repository, 2016) with 1649 trainable data points and 19 features. In addition, we test with the California Housing data from StatLib repository (Repository, 1997; Pedregosa et al., 2011) with a training set of 16514 samples and 8 features.

For the small Diabetes dataset, we use the classic LeNet-300-100 model (LeCun et al., 1998). For other larger datasets, we use similar fully connected neural networks with an additional starting layer of 900 neurons. We apply the randomized reshuffling scheme using PyTorch framework (Paszke et al., 2019). This shuffling scheme is the common heuristic in training neural networks and is implemented in many deep learning platforms (e.g. TensorFlow, PyTorch, and Keras (Abadi et al., 2015; Paszke et al., 2019; Chollet et al., 2015)).

For each dataset {xi,yi}\{x_{i},y_{i}\}, we preprocess and modify the initial data lightly to guarantee the over-parameterized setting in our experiment i.e. in order to make sure that there exists a solution that interpolates the data. We do this by using a pre-training stage: firstly we use GD/SGD algorithm to find a weight ww that yields a sufficiently small value for the loss function (for Diabetes dataset we train to 10810^{-8} and for other datasets we train to 10210^{-2}). Letting the input data xix_{i} be fixed, we change the label data to y^i\hat{y}_{i} such that the weight ww yields a small loss function 𝒪(ϵ)\operatorname{\mathcal{O}}(\epsilon) for the optimization associated with data {xi,y^i}\{x_{i},\hat{y}_{i}\}, and the distance between y^i\hat{y}_{i} and yiy_{i} is small. Then the modified data is ready for the next stage. We summarize the data (after modification) in our experiments in Table 2.

Table 2: Datasets used in our experiments
Data name # Samples # Features Networks layers Sources
Diabetes 442 10 300-100 Efron et al. (2004)
Life Expectancy 1649 19 900-300-100 Repository (2016)
California Housing 16514 8 900-300-100 Repository (1997)

For each dataset, we first tune the step size using a coarse grid search [0.0001,0.001,0.01,0.1,1][0.0001,0.001,0.01,0.1,1] for 100 epochs. Then, for example, if 0.01 performs the best, we test the second grid search [0.002,0.005,0.01,0.02,0.05][0.002,0.005,0.01,0.02,0.05] for 5000 epochs. Finally, we progress to the training stage with 10610^{6} epochs and repeat that experiment for 5 random seeds. We report the average results with confidence intervals in Figure 1.

Refer to caption
Refer to caption
Refer to caption
Figure 1: The train loss produced by Shuffling SGD algorithm for three datasets: Diabetes, Life Expectancy and California Housing.

For California Housing data, Shuffling SGD fluctuates toward the end of the training process. Nevertheless, for all three datasets it converges steadily to a small value of loss function. In summary, this experiment confirms our theoretical guarantee that demonstrates a convergence to global solution for shuffling-type SGD algorithm in neural network settings.

5 Conclusion and Future Research

In this paper, our focus is on investigating the global convergence properties of shuffling-type SGD methods. We consider a more relaxed set of assumptions in the framework of star-smooth-convex functions. We demonstrate that our approach achieves a total complexity of 𝒪(nε^3/2)\mathcal{O}(\frac{n}{\hat{\varepsilon}^{3/2}}) to attain an ε^\hat{\varepsilon}-accurate global solution. Notably, this result aligns with the previous computational complexity of unified shuffling methods in non-convex settings, while ensuring that the algorithm converges to a global solution. Our theoretical framework revolves around the shuffling sample schemes for finite-sum minimization problems in machine learning.

We also provide insights into the relationships between our framework and well-established over-parameterized settings, as well as the existing literature on the star-convexity class of functions. Furthermore, we establish connections with neural network architectures and explore how these learning models align with our theoretical optimization frameworks.

This paper prompts several intriguing research questions, including practical network designs and more relaxed theoretical settings that can support the global convergence of shuffling SGD methods. Additionally, extending the global convergence framework to other stochastic gradient methods (Duchi et al., 2011; Kingma and Ba, 2014) and variance reduction methods (Le Roux et al., 2012; Defazio et al., 2014; Johnson and Zhang, 2013; Nguyen et al., 2017), all with shuffling sampling schemes, as well as the exploration of momentum shuffling methods (Tran et al., 2021, 2022), represents a promising direction.

An interesting research question that arises is whether the convergence results can be further enhanced by exploring the potential of a randomized shuffling scheme within this framework (Mishchenko et al., 2020). However, we leave this question for future research endeavors.

References

  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  • Ahn et al. (2020) Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17526–17535. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/cb8acb1dc9821bf74e6ca9068032d623-Paper.pdf.
  • Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 09–15 Jun 2019. URL http://proceedings.mlr.press/v97/allen-zhu19a.html.
  • Arora et al. (2019) Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 322–332. PMLR, 09–15 Jun 2019. URL http://proceedings.mlr.press/v97/arora19a.html.
  • Beznosikov and Takáč (2021) Aleksandr Beznosikov and Martin Takáč. Random-reshuffled sarah does not need a full gradient computations. arXiv preprint arXiv:2111.13322, 2021.
  • Bjorck et al. (2021) Johan Bjorck, Anmol Kabra, Kilian Q. Weinberger, and Carla Gomes. Characterizing the loss landscape in non-negative matrix factorization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6768–6776, May 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/16836.
  • Bottou et al. (2018) L. Bottou, F. E. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Rev., 60(2):223–311, 2018.
  • Bottou (2009) Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms, 2009.
  • Brutzkus et al. (2018) Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJ33wwxRb.
  • Chollet et al. (2015) Francois Chollet et al. Keras. GitHub, 2015. URL https://github.com/fchollet/keras.
  • Collobert and Weston (2008) Ronan Collobert and Jason Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, pages 160–167. ACM, 2008. doi: 10.1145/1390156.1390177. URL https://doi.org/10.1145/1390156.1390177.
  • De et al. (2017) Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated Inference with Adaptive Batches. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1504–1513. PMLR, 20–22 Apr 2017. URL https://proceedings.mlr.press/v54/de17a.html.
  • Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
  • Du et al. (2019a) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1675–1685. PMLR, 09–15 Jun 2019a. URL http://proceedings.mlr.press/v97/du19c.html.
  • Du et al. (2019b) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=S1eK3i09YQ.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  • Efron et al. (2004) Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Diabetes dataset, 2004. URL https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html.
  • Ghadimi and Lan (2013) S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.
  • Goldberg et al. (2018) Yoav Goldberg, Graeme Hirst, Yang Liu, and Meng Zhang. Neural network methods for natural language processing. Comput. Linguistics, 44(1), 2018. doi: 10.1162/COLI\_r\_00312. URL https://doi.org/10.1162/COLI_r_00312.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay D. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6082.
  • Gower et al. (2021) Robert Gower, Othmane Sebbouh, and Nicolas Loizou. Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pages 1315–1323. PMLR, 2021.
  • Gürbüzbalaban et al. (2019) M. Gürbüzbalaban, A. Ozdaglar, and P. A. Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, Oct 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. URL http://dx.doi.org/10.1007/s10107-019-01440-w.
  • Haochen and Sra (2019) Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pages 2624–2633. PMLR, 2019.
  • Hardt et al. (2018) Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. Journal of Machine Learning Research, 19(29):1–44, 2018. URL http://jmlr.org/papers/v19/16-465.html.
  • He and Sun (2015) Kaiming He and Jian Sun. Convolutional neural networks at constrained time cost. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 5353–5360. IEEE Computer Society, 2015. doi: 10.1109/CVPR.2015.7299173. URL https://doi.org/10.1109/CVPR.2015.7299173.
  • Hinder et al. (2020) Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on Learning Theory, pages 1894–1938. PMLR, 2020.
  • Jin (2020) Jikai Jin. On the convergence of first order methods for quasar-convex optimization. arXiv preprint arXiv:2010.04937, 2020.
  • Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  • Karimi et al. (2016) Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Paolo Frasconi, Niels Landwehr, Giuseppe Manco, and Jilles Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, pages 795–811, Cham, 2016. Springer International Publishing.
  • Khaled and Richtárik (2020) Ahmed Khaled and Peter Richtárik. Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329, 2020.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • Le Roux et al. (2012) Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2663–2671, 2012.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee and Valiant (2016) Jasper C.H. Lee and Paul Valiant. Optimizing star-convex functions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 603–614, 2016. doi: 10.1109/FOCS.2016.71.
  • Loizou et al. (2021) Nicolas Loizou, Sharan Vaswani, Issam Hadj Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1306–1314. PMLR, 13–15 Apr 2021. URL http://proceedings.mlr.press/v130/loizou21a.html.
  • Ma et al. (2018) Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3325–3334. PMLR, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/ma18a.html.
  • Meng et al. (2020) Si Yi Meng, Sharan Vaswani, Issam Hadj Laradji, Mark Schmidt, and Simon Lacoste-Julien. Fast and furious convergence: Stochastic second order methods under interpolation. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1375–1386. PMLR, 26–28 Aug 2020. URL http://proceedings.mlr.press/v108/meng20a.html.
  • Mishchenko et al. (2020) Konstantin Mishchenko, Ahmed Khaled Ragab Bayoumi, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020.
  • Nagaraj et al. (2019) Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pages 4703–4711, 2019.
  • Nemirovski et al. (2009) A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. on Optimization, 19(4):1574–1609, 2009.
  • Nesterov (2004) Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London, 2004. ISBN 1-4020-7553-7.
  • Nesterov and Polyak (2006) Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • Nguyen et al. (2017) Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613–2621. JMLR. org, 2017.
  • Nguyen et al. (2021) Lam M. Nguyen, Quoc Tran-Dinh, Dzung T. Phan, Phuong Ha Nguyen, and Marten van Dijk. A unified convergence analysis for shuffling-type gradient methods. Journal of Machine Learning Research, 22(207):1–44, 2021. URL http://jmlr.org/papers/v22/20-1238.html.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Polyak (1964) Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • Repository (2016) Global Health Observatory Data Repository. Life expectancy and healthy life expectancy, 2016. URL https://apps.who.int/gho/data/view.main.SDG2016LEXREGv?lang=en.
  • Repository (1997) StatLib Repository. California housing, 1997. URL https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • Safran and Shamir (2020) Itay Safran and Ohad Shamir. How good is sgd with random shuffling? In Conference on Learning Theory, pages 3250–3284. PMLR, 2020.
  • Sankararaman et al. (2020) Karthik Abinav Sankararaman, Soham De, Zheng Xu, W Ronny Huang, and Tom Goldstein. The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. In International conference on machine learning, pages 8469–8479. PMLR, 2020.
  • Schmidt and Roux (2013) Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition, 2013.
  • Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/shamir13.html.
  • Soudry et al. (2018) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. J. Mach. Learn. Res., 19(1):2822–2878, January 2018. ISSN 1532-4435.
  • Tran et al. (2021) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. SMG: A shuffling gradient-based method with momentum. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10379–10389. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/tran21b.html.
  • Tran et al. (2022) Trang H Tran, Katya Scheinberg, and Lam M Nguyen. Nesterov accelerated shuffling gradient method for convex optimization. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 21703–21732. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/tran22a.html.
  • Vaswani et al. (2019) Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 1195–1204. PMLR, 16–18 Apr 2019. URL https://proceedings.mlr.press/v89/vaswani19a.html.
  • Zhou et al. (2019) Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451, 2019.
  • Zou and Gu (2019) Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 2053–2062, 2019.
  • Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks, 2018.

 
On the Convergence to a Global Solution
of Shuffling-Type Gradient Algorithms
Supplementary Material, NeurIPS 2023


 


Appendix A Theoretical settings: Proof of Theorem 1

A.1 Proof of Theorem 1

Proof.

Let us use the notation f(w;i)=ϕi(h(w;i))=12h(w;i)y(i)2f(w;i)=\phi_{i}(h(w;i))=\frac{1}{2}\|h(w;i)-y^{(i)}\|^{2}. We consider an architecture h(w;i)h(w;i) with ww be the vectorized weight and hh consists of a final bias layer bb:

h(w;i)\displaystyle h(w;i) =WTz(θ;i)+bc,\displaystyle=W^{T}z(\theta;i)+b\in\mathbb{R}^{c},

where w=vec({θ,W,b})w=\textbf{\text{vec}}(\{\theta,W,b\}) and z(θ;i)z(\theta;i) are some inner architecture, which can be chosen arbitrarily.

Firstly, we compute the gradient of f(;i)f(\cdot;i) with respect to bcb\in\mathbb{R}^{c}. For j=1,,cj=1,\dots,c, we have

f(w;i)bj=ϕi(h(w;i))bj=k=1ch(w;i)kbjϕi(x)xk|x=h(w;i)\displaystyle\frac{\partial f(w;i)}{\partial b_{j}}=\frac{\partial\phi_{i}(h(w;i))}{\partial b_{j}}=\sum_{k=1}^{c}\frac{\partial h(w;i)_{k}}{\partial b_{j}}\cdot\frac{\partial\phi_{i}(x)}{\partial x_{k}}\Big{|}_{x=h(w;i)} =ϕi(x)xj|x=h(w;i),i=1,,n.\displaystyle=\frac{\partial\phi_{i}(x)}{\partial x_{j}}\Big{|}_{x=h(w;i)},\ i=1,\dots,n. (18)

The last equality follows since h(w;i)kbj=0\frac{\partial h(w;i)_{k}}{\partial b_{j}}=0 for every kjk\neq j and h(w;i)kbj=1\frac{\partial h(w;i)_{k}}{\partial b_{j}}=1 for k=jk=j. In other words, it is the identity matrix.

Let us denote that fi=minwf(w;i)f_{i}^{*}=\min_{w}f(w;i) and ϕi=minxϕi(x)\phi_{i}^{*}=\min_{x}\phi_{i}(x). We prove the following statement for μ=1\mu=1:

wf(w;i)2xϕi(x)|x=h(w;i)22μ[ϕi(h(w;i))ϕi]2μ[f(w;i)fi],\displaystyle\|\nabla_{w}f(w;i)\|^{2}\geq\|\nabla_{x}\phi_{i}(x)|_{x=h(w;i)}\|^{2}\geq 2\mu[\phi_{i}(h(w;i))-\phi_{i}^{*}]\geq 2\mu[f(w;i)-f_{i}^{*}],

for every wd,w\in\mathbb{R}^{d}, and i=1,,n.i=1,\dots,n.

We begin with the first inequality:

wf(w;i)2\displaystyle\|\nabla_{w}f(w;i)\|^{2} =j=1d(f(w;i)wj)2j=dc+1d(f(w;i)wj)2=j=1c(f(w;i)bj)2\displaystyle=\sum_{j=1}^{d}\Big{(}\frac{\partial f(w;i)}{\partial w_{j}}\Big{)}^{2}\geq\sum_{j=d-c+1}^{d}\Big{(}\frac{\partial f(w;i)}{\partial w_{j}}\Big{)}^{2}=\sum_{j=1}^{c}\Big{(}\frac{\partial f(w;i)}{\partial b_{j}}\Big{)}^{2}
=(18)j=1c(ϕi(x)xj|x=h(w;i))2=xϕi(x)|x=h(w;i)2.\displaystyle\overset{\eqref{eq_last_bias_f_phi}}{=}\sum_{j=1}^{c}\Big{(}\frac{\partial\phi_{i}(x)}{\partial x_{j}}\Big{|}_{x=h(w;i)}\Big{)}^{2}=\|\nabla_{x}\phi_{i}(x)|_{x=h(w;i)}\|^{2}.

Now let us prove the PL condition for each function ϕi(x)\phi_{i}(x), i.e., there exists a constant μ>0\mu>0 such that:

xϕi(x)22μ[ϕi(x)ϕi]xc,i=1,,n.\displaystyle\|\nabla_{x}\phi_{i}(x)\|^{2}\geq 2\mu[\phi_{i}(x)-\phi_{i}^{*}]\ \forall x\in\mathbb{R}^{c},\ i=1,\dots,n.

Recall the squared loss that ϕi(x)=12xy(i)2\phi_{i}(x)=\frac{1}{2}\|x-y^{(i)}\|^{2} and xϕi(x)=xy(i)\nabla_{x}\phi_{i}(x)=x-y^{(i)}. We can see that the constant μ=1\mu=1 satisfies the following inequality for every xc,i=1,,nx\in\mathbb{R}^{c},\ i=1,\dots,n:

xϕi(x)2=xy(i)2=212xy(i)2=2μϕi(x)2μ[ϕi(x)ϕi],\displaystyle\|\nabla_{x}\phi_{i}(x)\|^{2}=\|x-y^{(i)}\|^{2}=2\frac{1}{2}\|x-y^{(i)}\|^{2}=2\mu\phi_{i}(x)\geq 2\mu[\phi_{i}(x)-\phi_{i}^{*}],

where the last inequality follows since ϕi0\phi_{i}^{*}\geq 0.
The PL condition for ϕi\phi_{i} directly implies the second inequality. The last inequality follows from the facts that f(w;i)=ϕi(h(w;i))f(w;i)=\phi_{i}(h(w;i)) and fi=minwfiminxϕi(x)=ϕif_{i}^{*}=\min_{w}f_{i}\geq\min_{x}\phi_{i}(x)=\phi_{i}^{*}. Hence, Theorem 1 is proved. ∎

Appendix B Preliminary results for SGD Shuffling Algorithm

In this section, we present the preliminary results for Algorithm 1. Firstly, from the choice of learning rate ηi(t):=ηtn\eta_{i}^{(t)}:=\frac{\eta_{t}}{n} and the update wi+1(t):=wi(t)ηi(t)f(wi(t);π(t)(i+1))w_{i+1}^{(t)}:=w_{i}^{(t)}-\eta_{i}^{(t)}\nabla f(w_{i}^{(t)};\pi^{(t)}(i+1)) in Algorithm 1, for i[n]i\in[n], we have

wi(t)=wi1(t)ηtnf(wi1(t);π(t)(i))=w0(t)ηtnj=0i1f(wj(t);π(t)(j+1)).\displaystyle w_{i}^{(t)}=w_{i-1}^{(t)}-\frac{\eta_{t}}{n}\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))=w_{0}^{(t)}-\frac{\eta_{t}}{n}\sum_{j=0}^{i-1}\nabla f(w_{j}^{(t)};\pi^{(t)}(j+1)). (19)

Hence,

w0(t+1)=wn(t)=w0(t)ηtnj=0n1f(wj(t);π(t)(j+1)).\displaystyle w_{0}^{(t+1)}=w_{n}^{(t)}=w_{0}^{(t)}-\frac{\eta_{t}}{n}\sum_{j=0}^{n-1}\nabla f(w_{j}^{(t)};\pi^{(t)}(j+1)). (20)

Next, we refer to a Lemma in [Nguyen et al., 2021] to bound the updates of shuffling SGD algorithms.

Lemma 3 (Lemma 5 in Nguyen et al. [2021]).

Suppose that Assumption 2 holds for (1). Let {wi(t)}\{w_{i}^{(t)}\} be generated by Algorithm 1 with the learning rate ηi(t):=ηtn>0\eta_{i}^{(t)}:=\frac{\eta_{t}}{n}>0 for a given positive sequence {ηt}\{\eta_{t}\}. If 0<ηt12L0<\eta_{t}\leq\frac{1}{2L} for all t1t\geq 1, we have

1nj=0n1wj(t)w2\displaystyle\frac{1}{n}\sum_{j=0}^{n-1}\|w_{j}^{(t)}-w_{*}\|^{2} 4w0(t)w2+8σ2ηt2,\displaystyle\leq 4\|w_{0}^{(t)}-w_{*}\|^{2}+8\sigma_{*}^{2}\cdot\eta_{t}^{2}, (21)
1nj=0n1wj(t)w0(t)2\displaystyle\frac{1}{n}\sum_{j=0}^{n-1}\|w_{j}^{(t)}-w_{0}^{(t)}\|^{2} ηt28L23w0(t)w2+16L2σ23ηt4+ 2σ2ηt2.\displaystyle\leq\eta_{t}^{2}\cdot\frac{8L^{2}}{3}\|w_{0}^{(t)}-w_{*}\|^{2}\ +\ \frac{16L^{2}\sigma_{*}^{2}}{3}\cdot\eta_{t}^{4}\ +\ 2\sigma_{*}^{2}\cdot\eta_{t}^{2}. (22)

Now considering the term wn(t)w0(t)2\|w_{n}^{(t)}-w_{0}^{(t)}\|^{2}, we get that

wn(t)w0(t)2\displaystyle\|w_{n}^{(t)}-w_{0}^{(t)}\|^{2} (20)ηt2n1nj=0n1f(wj(t);π(t)(j+1))2\displaystyle\overset{\eqref{update_shuffling_02}}{\leq}\frac{\eta_{t}^{2}}{n}\left\|\frac{1}{n}\sum_{j=0}^{n-1}\nabla f(w_{j}^{(t)};\pi^{(t)}(j+1))\right\|^{2}
=ηt2n1nj=0n1(f(wj(t);π(t)(j+1))f(w;π(t)(j+1)))2\displaystyle=\frac{\eta_{t}^{2}}{n}\left\|\frac{1}{n}\sum_{j=0}^{n-1}(\nabla f(w_{j}^{(t)};\pi^{(t)}(j+1))-\nabla f(w_{*};\pi^{(t)}(j+1)))\right\|^{2}
ηt2n1nj=0n1f(wj(t);π(t)(j+1))f(w;π(t)(j+1))2\displaystyle\leq\frac{\eta_{t}^{2}}{n}\frac{1}{n}\sum_{j=0}^{n-1}\left\|\nabla f(w_{j}^{(t)};\pi^{(t)}(j+1))-\nabla f(w_{*};\pi^{(t)}(j+1))\right\|^{2}
(2)L2ηt2n1nj=0n1wj(t)w2\displaystyle\overset{\eqref{eq:Lsmooth_basic}}{\leq}\frac{L^{2}\eta_{t}^{2}}{n}\frac{1}{n}\sum_{j=0}^{n-1}\|w_{j}^{(t)}-w_{*}\|^{2}
(21)4L2ηt2nw0(t)w2+8L2ηt4nσ2.\displaystyle\overset{\eqref{eq_thm_weight_02}}{\leq}\frac{4L^{2}\eta_{t}^{2}}{n}\|w_{0}^{(t)}-w_{*}\|^{2}+\frac{8L^{2}\eta_{t}^{4}}{n}\sigma_{*}^{2}.

We further have

1nj=0nwj(t)w0(t)2\displaystyle\frac{1}{n}\sum_{j=0}^{n}\|w_{j}^{(t)}-w_{0}^{(t)}\|^{2} =1nj=0n1wj(t)w0(t)2+1nwn(t)w0(t)2\displaystyle=\frac{1}{n}\sum_{j=0}^{n-1}\|w_{j}^{(t)}-w_{0}^{(t)}\|^{2}+\frac{1}{n}\|w_{n}^{(t)}-w_{0}^{(t)}\|^{2}
ηt28L23w0(t)w2+16L2σ23ηt4+ 2σ2ηt2\displaystyle\leq\eta_{t}^{2}\cdot\frac{8L^{2}}{3}\|w_{0}^{(t)}-w_{*}\|^{2}\ +\ \frac{16L^{2}\sigma_{*}^{2}}{3}\cdot\eta_{t}^{4}\ +\ 2\sigma_{*}^{2}\cdot\eta_{t}^{2}
+4L2ηt2nw0(t)w2+8L2ηt4nσ2.\displaystyle\quad+\frac{4L^{2}\eta_{t}^{2}}{n}\|w_{0}^{(t)}-w_{*}\|^{2}+\frac{8L^{2}\eta_{t}^{4}}{n}\sigma_{*}^{2}. (23)

Appendix C Main results: Proofs of Lemma 4, Lemma 2, Theorem 2, and Corollary 1

C.1 Proof of Lemma 4

Lemma 4.

Let {wi(t)}t=1T\{w^{(t)}_{i}\}_{t=1}^{T} be the sequence generated by Algorithm 1 with ηi(t)=ηtn\eta_{i}^{(t)}=\frac{\eta_{t}}{n}, with 0<ηtn2M0<\eta_{t}\leq\frac{n}{2M} for ηt12L\eta_{t}\leq\frac{1}{2L}. Then, under Assumptions 1, 2, and 5, we have

w0(t+1)w2\displaystyle\|w_{0}^{(t+1)}-w_{*}\|^{2} (1+B1ηt3)w0(t)w2ηt2M1ni=1nf(wi1(t);π(t)(i))2+B2ηtσ2,\displaystyle\leq\left(1+B_{1}\eta_{t}^{3}\right)\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+B_{2}\eta_{t}\sigma_{*}^{2}, (24)

where

{B1=8L23+14NL2M,B2=2M+1+56L2+8N3ML2.\displaystyle\begin{cases}B_{1}=\frac{8L^{2}}{3}+\frac{14NL^{2}}{M},\\ B_{2}=\frac{2}{M}+1+\frac{5}{6L^{2}}+\frac{8N}{3ML^{2}}.\end{cases} (25)
Proof.

We start with Assumption 5. Using the inequality 12a2b2ab2\frac{1}{2}\|a\|^{2}-\|b\|^{2}\leq\|a-b\|^{2}, we have for t=1,,Tt=1,\dots,T and i=1,,ni=1,\dots,n:

12f(wi1(t);π(t)(i))2f(w;π(t)(i))2\displaystyle\frac{1}{2}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}-\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
f(wi1(t);π(t)(i))f(w;π(t)(i))2\displaystyle\qquad\leq\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))-\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
(13)Mf(wi1(t);π(t)(i))f(w;π(t)(i)),wi1(t)w+N1ni=1nwi(t)w0(t)2\displaystyle\qquad\overset{\eqref{eq:new_property_fi_alg}}{\leq}M\langle\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))-\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle+N\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}
=Mf(wi1(t);π(t)(i)),wi1(t)wMf(w;π(t)(i)),wi1(t)w\displaystyle\qquad=M\langle\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle-M\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle
+N1ni=1nwi(t)w0(t)2,\displaystyle\qquad\qquad+N\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2},

This statement is equivalent to

f(wi1(t);π(t)(i)),wi1(t)w\displaystyle-\langle\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle 12Mf(wi1(t);π(t)(i))2+1Mf(w;π(t)(i))2\displaystyle\leq-\frac{1}{2M}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+\frac{1}{M}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
f(w;π(t)(i)),wi1(t)w\displaystyle\quad-\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle
+NM1ni=1nwi(t)w0(t)2,\displaystyle\quad+\frac{N}{M}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}, (26)

For any wWw_{*}\in W^{*}, from the update (19) we have,

wi(t)w2\displaystyle\|w_{i}^{(t)}-w_{*}\|^{2} =(19)wi1(t)w22ηtnf(wi1(t);π(t)(i)),wi1(t)w+ηt2n2f(wi1(t);π(t)(i))2\displaystyle\overset{\eqref{update_shuffling_01}}{=}\|w_{i-1}^{(t)}-w_{*}\|^{2}-\frac{2\eta_{t}}{n}\langle\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle+\frac{\eta_{t}^{2}}{n^{2}}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}
(26)wi1(t)w22ηt2Mnf(wi1(t);π(t)(i))2+2ηtMnf(w;π(t)(i))2\displaystyle\overset{\eqref{eq:new_property_fi_01}}{\leq}\|w_{i-1}^{(t)}-w_{*}\|^{2}-\frac{2\eta_{t}}{2Mn}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+\frac{2\eta_{t}}{Mn}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
2ηtnf(w;π(t)(i)),wi1(t)w+2ηtNMn1ni=1nwi(t)w0(t)2\displaystyle\qquad-\frac{2\eta_{t}}{n}\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle+\frac{2\eta_{t}N}{Mn}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}
+ηt2n2f(wi1(t);π(t)(i))2\displaystyle\qquad+\frac{\eta_{t}^{2}}{n^{2}}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}
(a)wi1(t)w2ηt2Mnf(wi1(t);π(t)(i))2+2ηtMnf(w;π(t)(i))2\displaystyle\overset{(a)}{\leq}\|w_{i-1}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2Mn}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+\frac{2\eta_{t}}{Mn}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
2ηtnf(w;π(t)(i)),wi1(t)w+2ηtNMn1ni=1nwi(t)w0(t)2\displaystyle\qquad-\frac{2\eta_{t}}{n}\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{*}\rangle+\frac{2\eta_{t}N}{Mn}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}
=wi1(t)w2ηt2Mnf(wi1(t);π(t)(i))2+2ηtMnf(w;π(t)(i))2\displaystyle=\|w_{i-1}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2Mn}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+\frac{2\eta_{t}}{Mn}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
2ηtnf(w;π(t)(i)),wi1(t)w0(t)2ηtnf(w;π(t)(i)),w0(t)w\displaystyle\qquad-\frac{2\eta_{t}}{n}\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{i-1}^{(t)}-w_{0}^{(t)}\rangle-\frac{2\eta_{t}}{n}\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{0}^{(t)}-w_{*}\rangle
+2ηtNMn1ni=1nwi(t)w0(t)2\displaystyle\qquad+\frac{2\eta_{t}N}{Mn}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}
(b)wi1(t)w2ηt2Mnf(wi1(t);π(t)(i))2+2ηtMnf(w;π(t)(i))2\displaystyle\overset{(b)}{\leq}\|w_{i-1}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2Mn}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+\frac{2\eta_{t}}{Mn}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}
+ηtnf(w;π(t)(i))2+ηtnwi1(t)w0(t)2\displaystyle\qquad+\frac{\eta_{t}}{n}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}+\frac{\eta_{t}}{n}\|w_{i-1}^{(t)}-w_{0}^{(t)}\|^{2}
2ηtnf(w;π(t)(i)),w0(t)w+2ηtNMn1ni=1nwi(t)w0(t)2,\displaystyle\qquad-\frac{2\eta_{t}}{n}\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{0}^{(t)}-w_{*}\rangle+\frac{2\eta_{t}N}{Mn}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2},

where (a)(a) follows since ηtn2M\eta_{t}\leq\frac{n}{2M} and (b)(b) follows by the inequality 2a,ba2b22\langle a,b\rangle\leq\|a\|^{2}\|b\|^{2}.

Note that 1ni=1nf(w;π(t)(i)),w0(t)w=F(w),w0(t)w=0\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(w_{*};\pi^{(t)}(i)),w_{0}^{(t)}-w_{*}\rangle=\langle\nabla F(w_{*}),w_{0}^{(t)}-w_{*}\rangle=0 since ww_{*} is a global solution of FF. Now we sum the derived statement for i=1,,ni=1,\dots,n and get

wn(t)w2\displaystyle\|w_{n}^{(t)}-w_{*}\|^{2} w0(t)w2ηt2M1ni=1nf(wi1(t);π(t)(i))2\displaystyle\leq\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}
+ηt(2M+1)1ni=1nf(w;π(t)(i))2+ηtni=1nwi1(t)w0(t)2\displaystyle\qquad+\eta_{t}\left(\frac{2}{M}+1\right)\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{*};\pi^{(t)}(i))\|^{2}+\frac{\eta_{t}}{n}\sum_{i=1}^{n}\|w_{i-1}^{(t)}-w_{0}^{(t)}\|^{2}
+2NηtM1ni=1nwi(t)w0(t)2\displaystyle\qquad+\frac{2N\eta_{t}}{M}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}
(8),(22)w0(t)w2ηt2M1ni=1nf(wi1(t);π(t)(i))2\displaystyle\overset{\eqref{variance_solution},\eqref{eq_thm_weight_01}}{\leq}\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}
+(2M+1)ηtσ2+8L2ηt33w0(t)w2+16L2ηt53σ2+ 2ηt3σ2\displaystyle\qquad+\left(\frac{2}{M}+1\right)\eta_{t}\sigma_{*}^{2}+\frac{8L^{2}\eta_{t}^{3}}{3}\|w_{0}^{(t)}-w_{*}\|^{2}\ +\ \frac{16L^{2}\eta_{t}^{5}}{3}\sigma_{*}^{2}\ +\ 2\eta_{t}^{3}\sigma_{*}^{2}
+2NηtM1ni=1nwi(t)w0(t)2\displaystyle\qquad+\frac{2N\eta_{t}}{M}\frac{1}{n}\sum_{i=1}^{n}\|w_{i}^{(t)}-w_{0}^{(t)}\|^{2}
(23)w0(t)w2ηt2M1ni=1nf(wi1(t);π(t)(i))2\displaystyle\overset{\eqref{eq_thm_weight_04}}{\leq}\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}
+(2M+1)ηtσ2+8L2ηt33w0(t)w2+16L2ηt53σ2+ 2ηt3σ2\displaystyle\qquad+\left(\frac{2}{M}+1\right)\eta_{t}\sigma_{*}^{2}+\frac{8L^{2}\eta_{t}^{3}}{3}\|w_{0}^{(t)}-w_{*}\|^{2}\ +\ \frac{16L^{2}\eta_{t}^{5}}{3}\sigma_{*}^{2}\ +\ 2\eta_{t}^{3}\sigma_{*}^{2}
+16NL2ηt33Mw0(t)w2+32NL2ηt53Mσ2+4Nηt3Mσ2\displaystyle\qquad+\frac{16NL^{2}\eta_{t}^{3}}{3M}\|w_{0}^{(t)}-w_{*}\|^{2}+\frac{32NL^{2}\eta_{t}^{5}}{3M}\sigma_{*}^{2}+\frac{4N\eta_{t}^{3}}{M}\sigma_{*}^{2}
+8NL2ηt3Mnw0(t)w2+16NL2ηt5Mnσ2,\displaystyle\qquad+\frac{8NL^{2}\eta_{t}^{3}}{Mn}\|w_{0}^{(t)}-w_{*}\|^{2}+\frac{16NL^{2}\eta_{t}^{5}}{Mn}\sigma_{*}^{2},

where we apply the derivations from Lemma 3. Now noting that ηt12L\eta_{t}\leq\frac{1}{2L}, n1n\leq 1 and rearranging the terms we get:

wn(t)w2\displaystyle\|w_{n}^{(t)}-w_{*}\|^{2} w0(t)w2ηt2M1ni=1nf(wi1(t);π(t)(i))2\displaystyle\leq\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}
+(8L23+16NL23M+8NL2M)ηt3w0(t)w2\displaystyle\qquad+\left(\frac{8L^{2}}{3}+\frac{16NL^{2}}{3M}+\frac{8NL^{2}}{M}\right)\eta_{t}^{3}\|w_{0}^{(t)}-w_{*}\|^{2}
+(2M+1+13L2+12L2+2N3ML2+NML2+NML2)ηtσ2\displaystyle\qquad+\left(\frac{2}{M}+1+\frac{1}{3L^{2}}+\frac{1}{2L^{2}}+\frac{2N}{3ML^{2}}+\frac{N}{ML^{2}}+\frac{N}{ML^{2}}\right)\eta_{t}\sigma_{*}^{2}

Since wn(t)=w0(t+1)=w~tw_{n}^{(t)}=w_{0}^{(t+1)}=\tilde{w}_{t}, we have the desired result in (24). ∎

C.2 Proof of Lemma 2

Proof.

From (24) where B1B_{1} and B2B_{2} are defined in (25), we have

w0(t+1)w2\displaystyle\qquad\|w_{0}^{(t+1)}-w_{*}\|^{2}
(1+B1ηt3)w0(t)w2ηt2M1ni=1nf(wi1(t);π(t)(i))2+B2ηtσ2\displaystyle\leq\left(1+B_{1}\eta_{t}^{3}\right)\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))\|^{2}+B_{2}\eta_{t}\sigma_{*}^{2}
(a)(1+B1ηt3)w0(t)w2γγ+1ηt2M1ni=1nf(w0(t);π(t)(i))2\displaystyle\overset{(a)}{\leq}\left(1+B_{1}\eta_{t}^{3}\right)\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\gamma}{\gamma+1}\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{0}^{(t)};\pi^{(t)}(i))\|^{2}
+ηtγ2M1ni=1nf(wi1(t);π(t)(i))f(w0(t);π(t)(i))2+B2ηtσ2\displaystyle\qquad+\frac{\eta_{t}\gamma}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{i-1}^{(t)};\pi^{(t)}(i))-\nabla f(w_{0}^{(t)};\pi^{(t)}(i))\|^{2}+B_{2}\eta_{t}\sigma_{*}^{2}
(2)(1+B1ηt3)w0(t)w2γγ+1ηt2M1ni=1nf(w0(t);π(t)(i))2\displaystyle\overset{\tiny\eqref{eq:Lsmooth_basic}}{\leq}\left(1+B_{1}\eta_{t}^{3}\right)\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\gamma}{\gamma+1}\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{0}^{(t)};\pi^{(t)}(i))\|^{2}
+ηtγL22M1ni=1nwi1(t)w0(t)2+B2σ2\displaystyle\qquad+\frac{\eta_{t}\gamma L^{2}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|w_{i-1}^{(t)}-w_{0}^{(t)}\|^{2}+B_{2}\sigma_{*}^{2}
(22)(1+B1ηt3)w0(t)w2γγ+1ηt2M1ni=1nf(w0(t);π(t)(i))2\displaystyle\overset{\tiny\eqref{eq_thm_weight_01}}{\leq}\left(1+B_{1}\eta_{t}^{3}\right)\|w_{0}^{(t)}-w_{*}\|^{2}-\frac{\gamma}{\gamma+1}\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{0}^{(t)};\pi^{(t)}(i))\|^{2}
+ηt3γL22M(8L23w0(t)w2+16L2σ23ηt2+ 2σ2)+B2ηtσ2\displaystyle\qquad+\frac{\eta_{t}^{3}\gamma L^{2}}{2M}\left(\frac{8L^{2}}{3}\|w_{0}^{(t)}-w_{*}\|^{2}\ +\ \frac{16L^{2}\sigma_{*}^{2}}{3}\cdot\eta_{t}^{2}\ +\ 2\sigma_{*}^{2}\right)+B_{2}\eta_{t}\sigma_{*}^{2}
=(1+B1ηt3+4ηt3γL43M)w0(t)w2+(B2+ηt2γL2M+8ηt4γL43M)ηtσ2\displaystyle=\left(1+B_{1}\eta_{t}^{3}+\frac{4\eta_{t}^{3}\gamma L^{4}}{3M}\right)\|w_{0}^{(t)}-w_{*}\|^{2}+\left(B_{2}+\frac{\eta_{t}^{2}\gamma L^{2}}{M}+\frac{8\eta_{t}^{4}\gamma L^{4}}{3M}\right)\eta_{t}\sigma_{*}^{2}
γγ+1ηt2M1ni=1nf(w0(t);π(t)(i))2\displaystyle\qquad-\frac{\gamma}{\gamma+1}\frac{\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f(w_{0}^{(t)};\pi^{(t)}(i))\|^{2}
(5)(1+ηt3(B1+4γL43M))w0(t)w2+(B2+ηt2γL2M+8ηt4γL43M)ηtσ2\displaystyle\overset{\tiny\eqref{eq:average_PL_fi}}{\leq}\left(1+\eta_{t}^{3}\left(B_{1}+\frac{4\gamma L^{4}}{3M}\right)\right)\|w_{0}^{(t)}-w_{*}\|^{2}+\left(B_{2}+\frac{\eta_{t}^{2}\gamma L^{2}}{M}+\frac{8\eta_{t}^{4}\gamma L^{4}}{3M}\right)\eta_{t}\sigma_{*}^{2}
γγ+12μηt2M1ni=1n[f(w0(t);π(t)(i))fi]\displaystyle\qquad-\frac{\gamma}{\gamma+1}\frac{2\mu\eta_{t}}{2M}\frac{1}{n}\sum_{i=1}^{n}[f(w_{0}^{(t)};\pi^{(t)}(i))-f_{i}^{*}]
(3)(1+ηt3(B1+4γL43M))w0(t)w2+(B2+ηt2γL2M+8ηt4γL43M)ηtσ2\displaystyle\overset{\tiny\eqref{eq_loss_great_01}}{\leq}\left(1+\eta_{t}^{3}\left(B_{1}+\frac{4\gamma L^{4}}{3M}\right)\right)\|w_{0}^{(t)}-w_{*}\|^{2}+\left(B_{2}+\frac{\eta_{t}^{2}\gamma L^{2}}{M}+\frac{8\eta_{t}^{4}\gamma L^{4}}{3M}\right)\eta_{t}\sigma_{*}^{2}
γγ+1μηtM[F(w0(t))F]\displaystyle\qquad-\frac{\gamma}{\gamma+1}\frac{\mu\eta_{t}}{M}[F(w_{0}^{(t)})-F_{*}]
(b)(1+ηt3(B1+4γL43M))w0(t)w2+(B2+γ4M+γ6M)ηtσ2\displaystyle\overset{(b)}{\leq}\left(1+\eta_{t}^{3}\left(B_{1}+\frac{4\gamma L^{4}}{3M}\right)\right)\|w_{0}^{(t)}-w_{*}\|^{2}+\left(B_{2}+\frac{\gamma}{4M}+\frac{\gamma}{6M}\right)\eta_{t}\sigma_{*}^{2}
γγ+1μηtM[F(w0(t))F],\displaystyle\qquad-\frac{\gamma}{\gamma+1}\frac{\mu\eta_{t}}{M}[F(w_{0}^{(t)})-F_{*}],

where (a)(a) follows since b2γab2γγ+1a2-\|b\|^{2}\leq\gamma\|a-b\|^{2}-\frac{\gamma}{\gamma+1}\|a\|^{2} for any γ>0\gamma>0 and (b)(b) follows since ηt12L\eta_{t}\leq\frac{1}{2L}. Since w0(t+1)=w~tw_{0}^{(t+1)}=\tilde{w}_{t}, we obtain the desired result in (14). ∎

C.3 Proof of Theorem 2

Proof.

For t=1,,T=λε3/2t=1,\dots,T=\frac{\lambda}{\varepsilon^{3/2}} for some λ>0\lambda>0

ηt\displaystyle\eta_{t} =(1+C1D3ε3/2)ηt1=(1+C1D3ε3/2)tη0(1+C1D3ε3/2)Tη0\displaystyle=(1+C_{1}D^{3}\varepsilon^{3/2})\eta_{t-1}=(1+C_{1}D^{3}\varepsilon^{3/2})^{t}\eta_{0}\leq(1+C_{1}D^{3}\varepsilon^{3/2})^{T}\eta_{0}
=(1+C1D3ε3/2)λ/ε3/2η0=(1+C1D3ε3/2)λ/ε3/2Dε(1+C1D3ε3/2)exp(λC1D3)\displaystyle\qquad=(1+C_{1}D^{3}\varepsilon^{3/2})^{\lambda/\varepsilon^{3/2}}\eta_{0}=(1+C_{1}D^{3}\varepsilon^{3/2})^{\lambda/\varepsilon^{3/2}}\frac{D\sqrt{\varepsilon}}{(1+C_{1}D^{3}\varepsilon^{3/2})\exp(\lambda C_{1}D^{3})}
Dε(1+C1D3ε3/2)min{n2M,12L},\displaystyle\qquad\leq\frac{D\sqrt{\varepsilon}}{(1+C_{1}D^{3}\varepsilon^{3/2})}\leq\min\left\{\frac{n}{2M},\frac{1}{2L}\right\}, (27)

since (1+x)1/xe(1+x)^{1/x}\leq e, x>0x>0. From (14), we have

[F(w~t1)F]\displaystyle[F(\tilde{w}_{t-1})-F_{*}] 1C3(1ηt+C1ηt2)w~t1w21C3ηtw~tw2+C2C3σ2.\displaystyle\leq\frac{1}{C_{3}}\left(\frac{1}{\eta_{t}}+C_{1}\eta_{t}^{2}\right)\|\tilde{w}_{t-1}-w_{*}\|^{2}-\frac{1}{C_{3}\eta_{t}}\|\tilde{w}_{t}-w_{*}\|^{2}+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}. (28)

We proceed to prove the following inequality for t=1,,Tt=1,\dots,T,

1ηt+C1ηt21ηt1.\displaystyle\frac{1}{\eta_{t}}+C_{1}\eta_{t}^{2}\leq\frac{1}{\eta_{t-1}}. (29)

From (27), and ηt=Kηt1\eta_{t}=K\eta_{t-1} where K=(1+C1D3ε3/2)K=(1+C_{1}D^{3}\varepsilon^{3/2}), we have

C1ηt2\displaystyle C_{1}\eta_{t}^{2} =C1K2ηt12=C1K2ηt13ηt1\displaystyle=C_{1}K^{2}\eta_{t-1}^{2}=C_{1}K^{2}\frac{\eta_{t-1}^{3}}{\eta_{t-1}}
C1K2D3ε3/2K3ηt1=C1D3ε3/2Kηt1\displaystyle\leq C_{1}K^{2}\frac{D^{3}\varepsilon^{3/2}}{K^{3}\eta_{t-1}}=\frac{C_{1}D^{3}\varepsilon^{3/2}}{K\eta_{t-1}} since ηt1DεK\displaystyle\text{since }\eta_{t-1}\leq\frac{D\sqrt{\varepsilon}}{K}
=K1K1ηt1=1ηt11Kηt1\displaystyle=\frac{K-1}{K}\frac{1}{\eta_{t-1}}=\frac{1}{\eta_{t-1}}-\frac{1}{K\eta_{t-1}} since K=(1+C1D3ε3/2)\displaystyle\text{since }K=(1+C_{1}D^{3}\varepsilon^{3/2})
=1ηt11Kηt1=1ηt11ηt,\displaystyle=\frac{1}{\eta_{t-1}}-\frac{1}{K\eta_{t-1}}=\frac{1}{\eta_{t-1}}-\frac{1}{\eta_{t}}, since ηt=Kηt1.\displaystyle\text{since }\eta_{t}=K\eta_{t-1}.

for t=1,,Tt=1,\dots,T. Hence, from (28), we have

[F(w~t1)F]\displaystyle[F(\tilde{w}_{t-1})-F_{*}] 1C3(1ηt+C1ηt2)w~t1w21C3ηtw~tw2+C2C3σ2\displaystyle\leq\frac{1}{C_{3}}\left(\frac{1}{\eta_{t}}+C_{1}\eta_{t}^{2}\right)\|\tilde{w}_{t-1}-w_{*}\|^{2}-\frac{1}{C_{3}\eta_{t}}\|\tilde{w}_{t}-w_{*}\|^{2}+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}
1C3ηt1w~t1w21C3ηtw~tw2+C2C3σ2.\displaystyle\leq\frac{1}{C_{3}\eta_{t-1}}\|\tilde{w}_{t-1}-w_{*}\|^{2}-\frac{1}{C_{3}\eta_{t}}\|\tilde{w}_{t}-w_{*}\|^{2}+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}.

Averaging the statement above for t=1,,Tt=1,\dots,T, we have

1Tt=1T[F(w~t1)F]\displaystyle\frac{1}{T}\sum_{t=1}^{T}[F(\tilde{w}_{t-1})-F_{*}] 1C3η0Tw~0w2+C2C3σ2\displaystyle\leq\frac{1}{C_{3}\eta_{0}T}\|\tilde{w}_{0}-w_{*}\|^{2}+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}
=(a)Kexp(λC1D3)C3Dεε3/2λw~0w2+C2C3σ2\displaystyle\overset{(a)}{=}\frac{K\exp(\lambda C_{1}D^{3})}{C_{3}D\sqrt{\varepsilon}}\frac{\varepsilon^{3/2}}{\lambda}\|\tilde{w}_{0}-w_{*}\|^{2}+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}
=Kexp(λC1D3)C3Dλw~0w2ε+C2C3σ2,\displaystyle=\frac{K\exp(\lambda C_{1}D^{3})}{C_{3}D\lambda}\|\tilde{w}_{0}-w_{*}\|^{2}\cdot\varepsilon+\frac{C_{2}}{C_{3}}\sigma_{*}^{2},

where (a)(a) follows since η0=DεKexp(λC1D3)\eta_{0}=\frac{D\sqrt{\varepsilon}}{K\exp(\lambda C_{1}D^{3})} and T=λε3/2T=\frac{\lambda}{\varepsilon^{3/2}}. ∎

C.4 Proof of Corollary 1

Proof.

Choose γ=1L2\gamma=\frac{1}{L^{2}}, we have

{C1=8L23+14NL2M+4L23M,C2=2M+1+56L2+8N3ML2+512ML,C3=1L2+1μM.\displaystyle\begin{cases}C_{1}=\frac{8L^{2}}{3}+\frac{14NL^{2}}{M}+\frac{4L^{2}}{3M},\\ C_{2}=\frac{2}{M}+1+\frac{5}{6L^{2}}+\frac{8N}{3ML^{2}}+\frac{5}{12ML},\\ C_{3}=\frac{1}{L^{2}+1}\frac{\mu}{M}.\end{cases}

Note that K=1+C1D3ε3/2K=1+C_{1}D^{3}\varepsilon^{3/2} and C1D3=1/λC_{1}D^{3}=1/\lambda, we get that K=1+1/T2K=1+1/T\leq 2. We continue from the statement of Theorem 2 and the choice C1D3λ=1C_{1}D^{3}\lambda=1:

1Tt=1T[F(w~t1)F]\displaystyle\frac{1}{T}\sum_{t=1}^{T}[F(\tilde{w}_{t-1})-F_{*}] Kexp(λC1D3)C3Dλw~0w2ε+C2C3σ2\displaystyle\leq\frac{K\exp(\lambda C_{1}D^{3})}{C_{3}D\lambda}\|\tilde{w}_{0}-w_{*}\|^{2}\cdot\varepsilon+\frac{C_{2}}{C_{3}}\sigma_{*}^{2}
2C1D3λC1D2exp(λC1D3)C3w~0w2ε+C2C3σ2\displaystyle\leq\frac{2}{C_{1}D^{3}\lambda}\cdot\frac{C_{1}D^{2}\exp(\lambda C_{1}D^{3})}{C_{3}}\cdot\|\tilde{w}_{0}-w_{*}\|^{2}\cdot\varepsilon+\frac{C_{2}}{C_{3}}\sigma_{*}^{2} since K2\displaystyle\text{ since }K\leq 2
2C1D2eC3w~0w2ε+C2C3σ2\displaystyle\leq\frac{2C_{1}D^{2}e}{C_{3}}\|\tilde{w}_{0}-w_{*}\|^{2}\cdot\varepsilon+\frac{C_{2}}{C_{3}}\sigma_{*}^{2} since C1D3λ=1\displaystyle\text{ since }C_{1}D^{3}\lambda=1
2C1D2eC3w~0w2ε+C2C3Pε\displaystyle\leq\frac{2C_{1}D^{2}e}{C_{3}}\|\tilde{w}_{0}-w_{*}\|^{2}\cdot\varepsilon+\frac{C_{2}}{C_{3}}P\varepsilon equation (9)\displaystyle\text{ equation }\eqref{eq_small_variance}
(2C1D2eC3w~0w2+C2PC3)ε=Gε\displaystyle\leq\left(\frac{2C_{1}D^{2}e}{C_{3}}\|\tilde{w}_{0}-w_{*}\|^{2}+\frac{C_{2}P}{C_{3}}\right)\varepsilon=G\varepsilon

with

G=2C1D2eC3w~0w2+C2PC3.\displaystyle G=\frac{2C_{1}D^{2}e}{C_{3}}\|\tilde{w}_{0}-w_{*}\|^{2}+\frac{C_{2}P}{C_{3}}.

Let 0<ε10<\varepsilon\leq 1 and choose ε^=Gε\hat{\varepsilon}=G\varepsilon. Then the number of iterations TT is

T\displaystyle T =λε3/2=λG3/2ε^3/2\displaystyle=\frac{\lambda}{{\varepsilon}^{3/2}}=\frac{\lambda G^{3/2}}{\hat{\varepsilon}^{3/2}}
=1ε^3/2C1D3(2C1D2ew~0w2+C2PC3)3/2\displaystyle=\frac{1}{\hat{\varepsilon}^{3/2}C_{1}D^{3}}\left(\frac{2C_{1}D^{2}e\|\tilde{w}_{0}-w_{*}\|^{2}+C_{2}P}{C_{3}}\right)^{3/2}
=1ε^3/21C1D3C33/2(2C1D2ew~0w2+C2P)3/2\displaystyle=\frac{1}{\hat{\varepsilon}^{3/2}}\cdot\frac{1}{C_{1}D^{3}C_{3}^{3/2}}\left(2C_{1}D^{2}e\|\tilde{w}_{0}-w_{*}\|^{2}+C_{2}P\right)^{3/2}
=1ε^3/2(2D2ew~0w2(8L23+14NL2M+4L23M)+(2+MM+56L2+8N3ML2+512ML)P)3/2(8L23+14NL2M+4L23M)D3(1L2+1μM)3/2\displaystyle=\frac{1}{\hat{\varepsilon}^{3/2}}\cdot\frac{\left(2D^{2}e\|\tilde{w}_{0}-w_{*}\|^{2}\left(\frac{8L^{2}}{3}+\frac{14NL^{2}}{M}+\frac{4L^{2}}{3M}\right)+\left(\frac{2+M}{M}+\frac{5}{6L^{2}}+\frac{8N}{3ML^{2}}+\frac{5}{12ML}\right)P\right)^{3/2}}{\left(\frac{8L^{2}}{3}+\frac{14NL^{2}}{M}+\frac{4L^{2}}{3M}\right)D^{3}\left(\frac{1}{L^{2}+1}\frac{\mu}{M}\right)^{3/2}}

to guarantee

min1tT[F(w~t1)F]1Tt=1T[F(w~t1)F]ε^.\displaystyle\min_{1\leq t\leq T}[F(\tilde{w}_{t-1})-F_{*}]\leq\frac{1}{T}\sum_{t=1}^{T}[F(\tilde{w}_{t-1})-F_{*}]\leq\hat{\varepsilon}.

Hence, the total complexity (number of individual gradient computations needed to reach ε^\hat{\varepsilon} accuracy) is 𝒪(nε^3/2)\mathcal{O}\left(\frac{n}{\hat{\varepsilon}^{3/2}}\right).

If we further assume that L,M,N>1L,M,N>1:

T\displaystyle T =1ε^3/2(2D2ew~0w2(8ML23+14NL2+4L23)+(2+M+5M6L2+8N3L2+512L)P)3/2(8L23+14NL2M+4L23M)D3(μL2+1)3/2\displaystyle=\frac{1}{\hat{\varepsilon}^{3/2}}\cdot\frac{\left(2D^{2}e\|\tilde{w}_{0}-w_{*}\|^{2}\left(\frac{8ML^{2}}{3}+14NL^{2}+\frac{4L^{2}}{3}\right)+\left(2+M+\frac{5M}{6L^{2}}+\frac{8N}{3L^{2}}+\frac{5}{12L}\right)P\right)^{3/2}}{\left(\frac{8L^{2}}{3}+\frac{14NL^{2}}{M}+\frac{4L^{2}}{3M}\right)D^{3}\left(\frac{\mu}{L^{2}+1}\right)^{3/2}}
1ε^3/2(𝒪((M+N)L2))3/2𝒪(1/L2)(L2+1μ)3/2\displaystyle\leq\frac{1}{\hat{\varepsilon}^{3/2}}\cdot\left(\mathcal{O}((M+N)L^{2})\right)^{3/2}\cdot\mathcal{O}\left(1/L^{2}\right)\cdot\left(\frac{L^{2}+1}{\mu}\right)^{3/2}
=𝒪(L4(M+N)3/2μ3/21ε^3/2)\displaystyle=\mathcal{O}\left(\frac{L^{4}(M+N)^{3/2}}{\mu^{3/2}}\cdot\frac{1}{\hat{\varepsilon}^{3/2}}\right)

and the complexity is 𝒪(L4(M+N)3/2μ3/2nε^3/2)\mathcal{O}\left(\frac{L^{4}(M+N)^{3/2}}{\mu^{3/2}}\cdot\frac{n}{\hat{\varepsilon}^{3/2}}\right). ∎