SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

Tehila Dahan
Department of Electrical Engineering
Technion
Haifa, Israel
[email protected]
&Kfir Y. Levy
Department of Electrical Engineering
Technion
Haifa, Israel
[email protected]

Abstract

We consider distributed learning scenarios where $M$ machines interact with a parameter server along several communication rounds in order to minimize a joint objective function. Focusing on the heterogeneous case, where different machines may draw samples from different data-distributions, we design the first local update method that provably benefits over the two most prominent distributed baselines: namely Minibatch-SGD and Local-SGD. Key to our approach is a slow querying technique that we customize to the distributed setting, which in turn enables a better mitigation of the bias caused by local updates.

1 Introduction

Federated Learning (FL) is a framework that enables huge scale collaborative learning among a large number of heterogeneous¹¹1Heterogeneous here refers to the data of each client, and we assume that its statistical properties may vary between different clients. clients (or machines). FL may potentially promote fairness among participants, by allowing clients with small scale datasets to participate in the learning process and affect the resulting model. Additionally, participants are not required to directly share data, which may improve privacy. Due to these reasons, FL has gained popularity in the past years, and found use in applications like voice recognition (APP, 2019; GGL, 2021), fraud detection (INT, 2020), drug discovery (MEL, 2020), and more (Wang et al., 2021a).

The two most prominent algorithmic approaches towards federated learning are Minibatch-SGD
(Dekel et al., 2012) and Local-SGD (a.k.a. Federated-Averaging) (Mangasarian and Solodov, 1993; McMahan et al., 2017; Stich, 2019) . In Minibatch-SGD all machines (or clients) always compute unbiased gradient estimates of the same query points, while using large batch sizes; and it is well known that this approach is not degraded due to data heterogeneity (Woodworth et al., 2020b). On the downside, the number of model updates made by Minibatch-SGD may be considerably smaller compared to the number of gradient queries made by each machine; which is due to the use of minibatches. This suggests that there may be room to improve over this approach by employing local update methods like Local-SGD, where the number of model updates and the number of gradient queries are the same. And indeed, in the past years, local update methods have been extensively investigated, see e.g. (Kairouz et al., 2021) and references therein.

We can roughly divide the research on FL into two scenarios: the homogeneous case, where it is assumed that the data on each machine is drawn from the same distribution; and to the more realistic heterogeneous case where it is assumed that data distributions may vary between machines.

Table 1: We compare the best known guarantees for parallel learning, to our SLowcal-SGD approach for the heterogeneous SCO case. The bolded term in the Rate column is the one that compares least favourably against Minibatch SGD. Where

G

and

G_{*}

relate to the dissimilarity measures that are defined in Equations (1) and (3). The

{\bf{R_{\min}}}

column presents the minimal number of communication rounds that are required to obtain a linear speedup (we fixed values of

K,M

and take

\sigma=1

). Note that we omit methods that do not enable a wall-clock linear speedup with

M

, e.g. (Mishchenko et al., 2022; Mitra et al., 2021).

Method

Rate

{\bf{R_{\min}~{}(\sigma=1)}}

MiniBatch SGD

(Dekel et al., 2012)

\frac{1}{R}+\frac{\sigma}{\sqrt{MKR}}

{MK}

Accelerated MiniBatch SGD

(Dekel et al., 2012; Lan, 2012)

\frac{1}{R^{2}}+\frac{\sigma}{\sqrt{MKR}}

\left(MK\right)^{1/3}

Local SGD

(Woodworth et al., 2020b)

{\bf\frac{G^{2/3}}{R^{2/3}}+\frac{\sigma^{2/3}}{(\sqrt{K}R)^{2/3}}}+\frac{1}{KR}+\frac{\sigma}{\sqrt{MKR}}

G^{4}\cdot(MK)^{3}+M^{3}K

SCAFFOLD

(Karimireddy et al., 2020b)

{\bf\frac{1}{R}}+\frac{\sigma}{\sqrt{MKR}}

MK

SLowcal-SGD

(This paper)

{\bf\frac{\sigma^{1/2}+G_{*}^{1/2}}{K^{1/4}R}}+\frac{1}{KR}+\frac{1}{K^{1/3}R^{4/3}}+\frac{1}{R^{2}}+\frac{\sigma}{\sqrt{MKR}}

G_{*}\cdot MK^{1/2}+MK^{1/2}

Lower Bound: Local-SGD

Yuan and Ma (2020)

{\bf\frac{G_{*}^{2/3}}{R^{2/3}}+\frac{\sigma_{*}^{2/3}}{(\sqrt{K}R)^{2/3}}}+\frac{1}{KR}+\frac{\sigma}{\sqrt{MKR}}

G_{*}^{4}\cdot(MK)^{3}+M^{3}K

For the homogeneous case it was shown in (Woodworth et al., 2020a; Glasgow et al., 2022) that the standard Local-SGD method is not superior to Minibatch-SGD. Nevertheless, (Yuan and Ma, 2020) have designed an accelerated variant of Local-SGD that provably benefits over the Minibatch baseline. These results are established for the fundamental Stochastic Convex Optimization (SCO) setting, which assumes that the learning objective is convex.

Similarly to the homogeneous case, it was shown in (Woodworth et al., 2020b; Glasgow et al., 2022) that Local-SGD is not superior to Minibatch-SGD in heterogeneous scenarios. Nevertheless, several local approaches that compare with the Minibatch baseline were designed in (Karimireddy et al., 2020b; Gorbunov et al., 2021). Unfortunately, we have so far been missing a local method that provably benefits over the Minibatch baseline in the heterogeneous SCO setting.

Our work focuses on the latter heterogeneous SCO setting, and provide a new Local-SGD-style algorithm that provably benefits over the minibatch baseline. Our algorithm named SLowcal-SGD, builds on customizing a recent technique for incorporating a slowly-changing sequence of query points (Cutkosky, 2019; Kavis et al., 2019), which in turn enables to better mitigate the bias induced by the local updates. Curiously, we also found importance weighting to be crucial in order to surpass the minibatch baseline.

In Table 1 we compare our results to the state-of-the-art methods for the heterogeneous SCO setting. We denote $M$ to be the number of machines, $K$ is the number of local updates per round, and $R$ is the number of communications rounds. Additionally, $G$ (or $G_{*}$ ) measures the dissimilarity between machines. Our table shows that Local-SGD requires much more communication rounds compared to Minibatch-SGD, and that the dissimilarity $G$ (or $G^{*})$ substantially degrades its performance. Conversely, one can see that even if the dissimilarity measure is $G_{*}=O(1)$ , our approach SLowcal-SGD still requires less communication rounds compared to Minibatch-SGD.

Similarly to the homogeneous case, accelerated-Minibatch-SGD (Dekel et al., 2012; Lan, 2012), obtains the best performance among all current methods, and it is still open to understand whether one can outperform this accelerated minibatch baseline. In App. A we elaborate on the computations of $R_{\min}$ in Table 1.

Related Work. We focus here on centralized learning problems, where we aim to employ $M$ machines in order to minimize a joint learning objective. We allow the machines to synchronize during $R$ communication rounds through a central machine called the Parameter Server ( $\mathcal{PS}$ ); and allow each machine to draw $K$ samples and perform $K$ local gradient computations in every such communication round. We assume that each machine $i$ may draw i.i.d. samples from a distribution ${\mathcal{D}}_{i}$ , which may vary between machines.

The most natural approach in this context is Minibatch-SGD, and its accelerate variant (Dekel et al., 2012), which have been widely adopted both in academy and in industry, see e.g. (Hoffer et al., 2017; Smith et al., 2017; You et al., 2019). Local update methods like Local-SGD (McMahan et al., 2017), have recently gained much popularity due to the rise of FL, and have been extensively explored in the past years.

Focusing on the SCO setting, it is well known that the standard Local-SGD is not superior (actually in most regimes it is inferior) to Minibatch-SGD (Woodworth et al., 2020a, b; Glasgow et al., 2022). Nevertheless, (Yuan and Ma, 2020) devised a novel accelerated local approach that provably surpasses the Minibatch baseline in the homogeneous case.

The heterogeneous SCO case has also been extensively investigated, with several original and elegant approaches (Koloskova et al., 2020; Khaled et al., 2020; Karimireddy et al., 2020b; Woodworth et al., 2020b; Mitra et al., 2021; Gorbunov et al., 2021; Mishchenko et al., 2022; Patel et al., ). Nevertheless, so far we have been missing a local approach that provably benefits over Minibatch-SGD. Note that Mitra et al. (2021); Mishchenko et al. (2022) improve the communication complexity with respect to the condition number of the objective; However their performance does not improve as we increase the number of machines $M$ ²²2This implies that such methods do not obtain a wall-clock speedup as we increase the number of machines $M$ ., which is inferior to the minibatch baseline.

The heterogeneous non-convex setting was also extensively explored (Karimireddy et al., 2020b, a; Gorbunov et al., 2021); and the recent work of (Patel et al., ) has developed a novel algorithm that provably benefits over the minibatch baseline in this case. The latter work also provides a lower bound which demonstrates that their upper bound is almost tight. Finally, for the special case of quadratic loss functions, it was shown in (Woodworth et al., 2020a) and in (Karimireddy et al., 2020b) that it is possible to surpass the minibatch baseline.

It is important to note that excluding the special case of quadratic losses, there does not exist a local update algorithm that provably benefits over accelerated-Minibatch-SGD (Dekel et al., 2012). And the latter applies to both homogeneous and heterogeneous SCO problems.

Our local update algorithm utilizes a recent technique of employing slowly changing query points in SCO problems (Cutkosky, 2019). The latter has shown to be useful in designing universal accelerated methods (Kavis et al., 2019; Ene et al., 2021; Antonakopoulos et al., 2022), as well as in improving asynchronous training methods (Aviv et al., 2021).

2 Setting: Parallel Stochastic Optimization

We consider Parallel stochastic optimization problems where the objective $f:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ is convex and is of the following form,

f(x):=\frac{1}{M}\sum_{i\in[M]}f_{i}(x):=\frac{1}{M}\sum_{i\in[M]}\mbox{\bf E}_{z^{i}\sim{\mathcal{D}}_{i}}f_{i}(x;z^{i})~{}.

Thus, the objective is an average of $M$ functions $\{f_{i}:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}\}_{i\in[M]}$ , and each such $f_{i}(\cdot)$ can be written as an expectation over losses $f_{i}(\cdot,z^{i})$ where the $z^{i}$ are drawn from some distribution ${\mathcal{D}}_{i}$ which is unknown to the learner. For ease of notation, in what follows we will not explicitly denote $\mbox{\bf E}_{z^{i}\sim{\mathcal{D}}_{i}}$ but rather use E to denote the expectation w.r.t. all randomization.

We assume that there exist $M$ machines (computation units), and that each machine may independently draw samples from the distribution ${\mathcal{D}}_{i}$ , and can therefore compute unbiased gradient estimates to the gradients of $f_{i}(\cdot)$ . Most commonly, we allow the machines to synchronize during $R$ communication rounds through a central machine called the Parameter Server ( $\mathcal{PS}$ ); and allow each machine to perform $K$ local computations in every such communication round.

We consider first order optimization methods that iteratively employ samples and generate a sequence of query points and eventually output a solution ${x}_{\rm output}$ . Our performance measure is the expected excess loss, $\textbf{ExcessLoss}:=\mbox{\bf E}[f({x}_{\rm output})]-f(w^{*})~{},$ where the expectation is w.r.t. the randomization of the samples, and $w^{*}$ is a global minimum of $f(\cdot)$ in ${\mathbb{R}}^{d}$ , i.e., $w^{*}\in\operatorname*{arg\,min}_{x\in{\mathbb{R}}^{d}}f(x)$ .

More concretely, at every computation step, each machine $i\in[M]$ may draw a fresh sample $z^{i}\sim{\mathcal{D}}_{i}$ , and compute a gradient estimate $g$ at a given point $x\in{\mathbb{R}}^{d}$ as follows, $g:=\nabla f_{i}(x,z^{i})~{}.$ and note that $\mbox{\bf E}[g|x]=\nabla f_{i}(x)$ , i.e. $g$ is an ubiased estimate of $\nabla f_{i}(x)$ .

General Parallelization Scheme.

A general scheme for parallel stochastic optimization is described in Alg. 1. It can be seen that the $\mathcal{PS}$ communicates with the machines along $R$ communication rounds. In every round $r\in[R]$ the $\mathcal{PS}$ distributes an anchor point $\Theta_{r}$ which is a starting point for the local computations in that round. Based on $\Theta_{r}$ each machine performs $K$ local gradient computations based on $K$ i.i.d. draws from ${\mathcal{D}}_{i}$ , and yields a message $\Phi_{r}^{i}$ . At the end of round $r$ the $\mathcal{PS}$ aggregates the messages from all machines and updates the anchor point $\Theta_{r+1}$ . Finally, after the last round, the $\mathcal{PS}$ outputs ${x}_{\rm output}$ , which is computed based on the anchor points $\{\Theta_{r}\}_{r=1}^{R}$ .

Algorithm 1 Parallel Stochastic Optimization Template

Input:

M

machines, Parameter Server

\mathcal{PS}

\#

Communication rounds

R

\#

Local computations

K

, initial point

x_{0}

\mathcal{PS}

Computes initial anchor point

\Theta_{0}

using

x_{0}

for

r=0,\ldots,R-1

Distributing anchor:

\mathcal{PS}

distributes anchor

\Theta_{r}

to all M machines

Local Computations: Each machine

i\in[M]

performs

K

local gradient computations based on

K

i.i.d. draws from

{\mathcal{D}}_{i}

, and yields a message

\Phi_{r}^{i}

Aggregation:

\mathcal{PS}

aggregates

\{\Phi_{r}^{i}\}_{i\in[M]}

from all machines, and computes a new anchor

\Theta_{r+1}

end for

output:

\mathcal{PS}

computes

{x}_{\rm output}

based on

\{\Theta_{r}\}_{r=1}^{R}

Ideally, one would hope that using $M$ machines in parallel will enable to accelerate the learning process by a factor of $M$ . And there exists a rich line of works that have shown that this is indeed possible to some extent, depending on $K,R$ , and on the parallelization algorithm.

Next, we describe the two most prominent approaches to first-order Parallel Optimization,

(i) Minibatch SGD:

In terms of Alg. 1, one can describe Minibatch-SGD as an algorithm in which the $\mathcal{PS}$ sends a weight vector $x_{r}\in{\mathbb{R}}^{d}$ in every round as the anchor point $\Theta_{r}$ . Based on that anchor $\Theta_{r}:=x_{r}$ , each machine $i$ computes an unbiased gradient estimate based on $K$ independent samples from ${\mathcal{D}}_{i}$ , i.e. $g_{r}^{i}:=\frac{1}{K}\sum_{k=1}^{K}\nabla f_{i}(x_{r},z_{Kr+k}^{i})$ , and communicates $g_{r}^{i}$ as the message $\Phi_{r}^{i}$ to the $\mathcal{PS}$ . The latter aggregates the messages $\{\Phi_{r}^{i}:=g_{r}^{i}\}_{i\in[M]}$ and compute the next anchor point $x_{r+1}$ ,

x_{r+1}=x_{r}-\eta\cdot\frac{1}{M}\sum_{i\in[M]}g_{r}^{i}~{},

where $\eta>0$ is the learning rate of the algorithm. The benefit in this approach is that all machines always compute gradient estimates at the same anchor points $\{x_{r}\}_{r}$ , which highly simplifies its analysis. On the downside, in this approach the number of gradient updates $R$ is smaller compared to the number of stochastic gradient computations made by each machine which is $KR$ . This gives the hope that there is room to improve upon Minibatch SGD, by mending this issue.

(ii) Local SGD:

In terms of Alg. 1, one can describe Local-SGD as an algorithm in which the $\mathcal{PS}$ sends a weight vector $x_{rK}\in{\mathbb{R}}^{d}$ in every round $r\in[R]$ as the anchor information $\Theta_{r}$ . Based on the anchor $\Theta_{r}:=x_{rK}$ , each machine performs a sequence of local gradient updates based on $K$ independent samples from ${\mathcal{D}}_{i}$ as follows, $\forall k\in[K]$ ,

x_{rK+k+1}^{i}=x_{rK+k}^{i}-\eta\cdot\nabla f_{i}(x_{rK+k}^{i},z_{rK+k}^{i})~{},

where for all machines $i\in[M]$ we initialize $x_{rK}^{i}=x_{rK}:=\Theta_{r}$ , and $\eta>0$ is the learning rate of the algorithm. At the end of round $r$ each machine communicates $x_{(r+1)K}^{i}$ as the message $\Phi_{r}^{i}$ to the $\mathcal{PS}$ and the latter computes the next anchor as follows,

\Theta_{r+1}:=x_{(r+1)K}=\frac{1}{M}\sum_{i\in[M]}x_{(r+1)K}^{i}~{}.

In local SGD the number of gradient steps is equal to the number of stochastic gradient computations made by each machine which is $KR$ . The latter suggests that such an approach may potentially surpass Minibatch SGD. Nevertheless, this potential benefit is hindered by the bias that is introduced between different machines during the local updates. And indeed, as we show in Table 1, this approach is inferior to Minibatch SGD in the prevalent case where $\sigma=O(1)$ .

Assumptions. We assume that $f(\cdot)$ is convex, and that the $f_{i}(\cdot)$ are smooth i.e. $\exists L>0$ such,

\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|~{},~{}~{}\forall i\in[M]~{},~{}\forall x,y\in{\mathbb{R}}^{d}

We also assume that variance of the gradient estimates is bounded, i.e. that there exists $\sigma>0$ such,

\mbox{\bf E}\|\nabla f_{i}(x;z)-\nabla f_{i}(x)\|^{2}\leq\sigma^{2}~{},~{}\forall x\in{\mathbb{R}}^{d}~{},~{}\forall i\in[M]~{}.

Letting $w^{*}$ be a global minimum of $f(\cdot)$ , we assume there exist $G_{*}\geq 0$ such that,

\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(w^{*})\|^{2}\leq G_{*}^{2}/2~{},~{}~{}\textbf{($G_{*}$-Dissimilarity)}

(1)

The above assumption together with the smoothness and convexity imply (see App. B) ,

\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(x)\|^{2}\leq G_{*}^{2}+4L(f(x)-f(w^{*}))~{},\quad\forall x\in{\mathbb{R}}^{d}

(2)

A stronger dissimilarity assumption that is often used in the literature is the following,

\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(x)-\nabla f(x)\|^{2}\leq G^{2}/2~{},~{}~{}\forall x\in{\mathbb{R}}^{d}~{}~{}\textbf{($G$-Dissimilarity)}

(3)

Notation: For $\{y_{t}\}_{t}$ we denote $y_{t_{1}:t_{2}}:=\sum_{\tau=t_{1}}^{t_{2}}y_{\tau}$ . For $N\in\mathbb{Z}^{+}$ we denote $[N]:=\{0,\ldots,N-1\}$ .

3 Our Approach

Section 3.1 describes a basic (single machine) algorithmic template called Anytime-GD. Section 3.2 describes our SLowcal-SGD algorithm, which is a Local-SGD style algorithm in the spirit of Anytime GD. We describe our method in Alg. 2, and state its guarantees in Thm. 2.

3.1 Anytime GD

The standard GD algorithm computes a sequence of iterates $\{w_{t}\}_{t\in[T]}$ and queries the gradients at theses iterates. It was recently shown that one can design a GD-style scheme that computes a sequence of iterates $\{w_{t}\}_{t\in[T]}$ yet queries the gradients at a different sequence $\{x_{t}\}_{t\in[T]}$ which may be slowly-changing, in the sense that $\|x_{t+1}-x_{t}\|$ may be considerably smaller than $\|w_{t+1}-w_{t}\|$ .

Concretely, the Anytime-GD algorithm (Cutkosky, 2019; Kavis et al., 2019) that we describe in Equations (4) and (5), employs a learning rate $\eta>0$ and a sequence of non-negative weights $\{\alpha_{t}\}_{t}$ . The algorithm maintains two sequences $\{w_{t}\}_{t},\{x_{t}\}_{t}$ that are updated as follows $\forall t$ ,

\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{},\text{where~{}}g_{t}=\nabla f(x_{t})~{},

(4)

and then,

\displaystyle x_{t+1}=\frac{\alpha_{0:t}}{\alpha_{0:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{}.

(5)

It can be shown that the above implies that $x_{t+1}=\frac{1}{\alpha_{0:t+1}}\sum_{\tau=0}^{t+1}\alpha_{\tau}w_{\tau}$ , i.e. the $x_{t}$ ’s are weighted averages of the $w_{t}$ ’s. Thus, at every iterate the gradient $g_{t}$ is queried at $x_{t}$ which is a weighted average of past iterates, and then $w_{t+1}$ is updated similarly to GD with a weight $\alpha_{t}$ on the gradient $g_{t}$ . Moreover, at initialization $x_{0}=w_{0}$ .

Curiously, it was shown in Cutkosky (2019) that Anytime-GD obtains the same convergence rates as GD for convex loss functions (both smooth and non-smooth). It was further shown and that one can employ a stochastic version (Anytime-SGD) where we query noisy gradients at $x_{t}$ instead of the exact ones, and that approach performs similarly to SGD.
Slowly changing query points. A recent work (Aviv et al., 2021), demonstrates that if we use projected Anytime-SGD, i.e. project the $w_{t}$ sequence to a given bounded convex domain; then one can immediately show that for both $\alpha_{t}=1$ and $\alpha_{t}=t+1$ we obtain $\|x_{t+1}-x_{t}\|\leq 2D/t$ , where $D$ is the diameter of the convex domain. Conversely, for standard SGD we have $\|w_{t+1}-w_{t}\|\leq\eta\|g_{t}\|$ , where $g_{t}$ here is a (possibly noisy) unbiased estimate of $\nabla f(w_{t})$ . Thus, while the change between consecutive SGD queries is controlled by $\eta$ which is usually $\propto 1/\sqrt{t}$ , and by magnitude of stochastic gradients; for Anytime-SGD the change decays with time, irrespective of the learning rate $\eta$ . In (Aviv et al., 2021), this is used to design better and more robust asynchronous training methods.
Relation to Momentum. In the appendix we show that Anytime-SGD can be explicitly written as a momentum method, and therefore is quite different from standard SGD. Concretely, for $\alpha_{t}=1$ we show that $x_{t+1}\approx x_{t}-\eta\sum_{\tau=1}^{t}(\tau/t^{2})\cdot g_{\tau}$ , and for $\alpha_{t}\propto t$ we show that $x_{t+1}\approx x_{t}-\eta\sum_{\tau=1}^{t}(\tau/t)^{3}\cdot g_{\tau}$ . Where $g_{\tau}$ here is a (possibly noisy) unbiased estimate of $\nabla f(x_{\tau})$ . This momentum interpretation provides a complementary intuition regarding the benefit of Anytime-SGD in the context of local update methods. Momentum brings more stability to the optimization process which in turn reduces the bias between different machines.

For the sake of this paper we will require a specific theorem that does not necessarily regard Anytime-GD, but is rather more general. We will require the following definition,

Definition

Let $\{\alpha_{t}\geq 0\}_{t}$ be a sequence of non-negative weights, and let $\{w_{t}\in{\mathbb{R}}^{d}\}_{t}$ , be an arbitrary sequence. We say that a sequence $\{x_{t}\in{\mathbb{R}}^{d}\}_{t}$ is an $\{\alpha_{t}\}_{t}$ weighted average of $\{w_{t}\}_{t}$ if $x_{0}=w_{0}$ , and for any $t>0$ Eq. (5) is satisfied.

Next, we state the main theorem for this section, which applies for any sequence $\{w_{t}\in{\mathbb{R}}^{d}\}_{t}$ ,

Theorem 1 (Rephrased from Theorem 1 in Cutkosky (2019)).

Let $f:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ be a convex function with a global minimum $w^{*}$ . Also let $\{\alpha_{t}\geq 0\}_{t}$ , and $\{w_{t}\in{\mathbb{R}}^{d}\}_{t},\{x_{t}\in{\mathbb{R}}^{d}\}_{t}$ such that $\{x_{t}\}_{t}$ is an $\{\alpha_{t}\}_{t}$ weighted average of $\{w_{t}\}_{t}$ . Then the following holds for any $t\geq 0$ ,

0\leq\alpha_{0:t}\left(f(x_{t})-f(w^{*})\right)\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

3.2 SLowcal-SGD

Our approach is to employ an Anytime version of Local-SGD, which we name by SLowcal-SGD.
Notation: Prior to describing our algorithm we will define $t$ to be the total of per-machine local updates up to step $k$ of round $r$ , resulting $t:=rK+k$ . In what follows, we will often find it useful to denote the iterates and samples using $t$ , rather than explicitly denoting $t=rK+k$ . Additionally we use $\{\alpha_{t}\}_{t}$ to denote a pre-defined sequence of non-negative weights. Finally, we denote $T:=RK$ .

Algorithm 2 SLowcal-SGD

Input:

M

machines, Parameter Server

\mathcal{PS}

\#

Communication rounds

R

\#

Local computations

K

, initial point

x_{0}

, learning rate

\eta>0

, weights

\{\alpha_{t}\}_{t}

Initialize: set

w_{0}=x_{0}

, initialize anchor point

\Theta_{0}:=(w_{0},x_{0})

, and set

t=0

for

r=0,\ldots,R-1

Distributing anchor:

\mathcal{PS}

distributes anchor

\Theta_{r}:=(w_{t},x_{t})

to all machines, each machine

i\in[M]

initializes

(w_{t}^{i},x_{t}^{i})=\Theta_{r}:=(w_{t},x_{t})

for

k=0,\ldots,K-1

Set

t=rK+k

Every machine

i\in[M]

draws a fresh sample

z_{t}^{i}\sim{\mathcal{D}}_{i}

, and computes

g_{t}^{i}=\nabla f_{i}(x_{t}^{i},z_{t}^{i})

Update

w_{t+1}^{i}=w_{t}^{i}-\eta\alpha_{t}g_{t}^{i}

, and

x_{t+1}^{i}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}^{i}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}^{i}

end for

Aggregation:

\mathcal{PS}

aggregates

\{(w_{t+1}^{i},x_{t+1}^{i})\}_{i\in[M]}

from all machines, and computes a new anchor

\Theta_{r+1}:=(w_{t+1},x_{t+1})=\left(\frac{1}{M}\sum_{i\in[M]}w_{t+1}^{i},\frac{1}{M}\sum_{i\in[M]}x_{t+1}^{i}\right)

end for

output:

\mathcal{PS}

outputs

x_{T}

(recall

T=KR

)

In the spirit of Anytime-SGD our approach is to maintain two sequences per machine $i\in[M]$ : $\{w_{t}^{i}\in{\mathbb{R}}^{d}\}_{t}$ and $\{x_{t}^{i}\in{\mathbb{R}}^{d}\}_{t}$ . Our approach is depicted explicitly in Alg. 2. Next we describe our algorithm in terms of the scheme depicted in Alg. 1:
(i) Distributing anchor. At the beginning of round $r$ the $\mathcal{PS}$ distributes $\Theta_{r}=(w_{t},x_{t})=(w_{rK},x_{rK})\in{\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$ to all machines.
(ii) Local Computations. For $t=rK$ , every machine initializes $(w_{t}^{i},x_{t}^{i})=\Theta_{r}$ , and for the next $K$ rounds, i.e. for any $rK\leq t\leq(r+1)K-1$ , every machine performs a sequence of local Anytime-SGD steps as follows,

\displaystyle w_{t+1}^{i}=w_{t}^{i}-\eta\alpha_{t}g_{t}^{i}~{},

(6)

where similarly to Anytime-SGD we query the gradients at the averages $x_{t}^{i}$ , meaning $g_{t}^{i}=\nabla f_{i}(x_{t}^{i},z_{t}^{i})~{}.$ And query points are updated as weighted averages of past iterates $\{w_{t}\}_{t}$ , ,

\displaystyle x_{t+1}^{i}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}^{i}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}^{i}~{},\qquad\forall~{}rK\leq t\leq(r+1)K-1~{}.

(7)

At the end round $r$ , i.e. $t=(r+1)K$ , each machine communicates $(w_{t}^{i},x_{t}^{i})$ as a message to the $\mathcal{PS}$ .
(iii) Aggregation. The $\mathcal{PS}$ aggregates the messages and computes the next anchor point $\Theta_{r+1}=(w_{t},x_{t})=\frac{1}{M}\sum_{i\in[M]}\Phi_{r}^{i}:=\left(\frac{1}{M}\sum_{i\in[M]}w_{t}^{i},\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\right)$ , where $t=(r+1)K$ .
Remark: Note that for $t=rK$ our notation for $(w_{t}^{i},x_{t}^{i})$ is inconsistent: at the end of round $r-1$ these values may vary between different machines, while at the beginning of round $r$ these values are all equal to $\Theta_{r}:=(w_{t},x_{t})$ . Nevertheless, for simplicity we will abuse notation, and explicitly state the right definition when needed. Importantly, in most of our analysis we will mainly need to refer to the averages $\left(\frac{1}{M}\sum_{i\in[M]}w_{t}^{i},\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\right)$ , and note the latter are consistent at the end and beginning of consecutive rounds due to the definition of $\Theta_{r}$ , and $\Phi_{r-1}^{i}$ .

3.2.1 Guarantees & Intuition

Below we state our main result for SLowcal-SGD (Alg. 2),

Theorem 2.

Let $f(\cdot)$ be a convex and $L$ -smooth function. Then under the assumption that we make in Sec. 2, invoking Alg. 2 with weights $\{\alpha_{t}=t+1\}_{t\in[T]}$ , and an appropriate learning rate $\eta$ ensures,

\displaystyle\small\mbox{\bf E}\Delta_{T}~{}\leq~{}O\left(LB_{0}^{2}\left(\frac{1}{KR}+\frac{1}{R^{2}}+\frac{1}{K^{1/3}R^{4/3}}\right)+\frac{\sigma B_{0}}{\sqrt{MKR}}+\frac{L^{1/2}(\sigma^{1/2}+G_{*}^{1/2})\cdot B_{0}^{3/2}}{K^{1/4}R}\right)~{},

where $\Delta_{T}:=f(x_{T})-f(x^{*})$ , $B_{0}:=\|w_{0}-w^{*}\|$ , and we choose the learning rate as follows,

\displaystyle\eta=\min\left\{\frac{1}{48L(T+1)},\frac{1}{10LK^{2}},\frac{1}{40LK(T+1)^{2/3}},\frac{\|w_{0}-w^{*}\|\sqrt{M}}{\sigma T^{3/2}},\frac{\|w_{0}-w^{*}\|^{1/2}}{L^{1/2}K^{7/4}R(\sigma^{1/2}+G_{*}^{1/2})}\right\}

(8)

As Table 1 shows, Thm. 2 implies that SLowcal-SGD improves over all existing upper bounds for Minibatch and Local SGD, by allowing less communication rounds to obtain a linear speedup of $M$ .

Intuition. The degradation in local SGD schemes (both standard and Anytime) is due to the bias that it introduces between different machines during each round, which leads to a bias in their gradients. Intuitively, this bias is small if the machines query the gradients at a sequence of slowly changing query points. This is exactly the benefit of SLowcal-SGD which queries the gradients at averaged iterates $x_{t}^{i}$ ’s. Intuitively these averages are slowly changing compared to the iterates themselves $w_{t}^{i}$ ; and recall that the latter are the query points used by standard Local-SGD. A complementary intuition to the benefit of our approach, is the interpretation of Anytime-SGD as a momentum method (see Sec. 3.1 and the appendix) which leads to decreased bias between machines.

To further simplify the more technical discussion here, we will assume the homogeneous case, i.e., that for any $i\in[M]$ we have ${\mathcal{D}}_{i}={\mathcal{D}}$ and $f_{i}(\cdot)=f(\cdot)$ .

So a bit more formally, let us discuss the bias between query points in a given round $r\in[R]$ , and let us denote $t_{0}=rK$ . The following holds for standard Local SGD,

\displaystyle w_{t}^{i}=w_{t_{0}}-\eta\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}.

(9)

where $g_{\tau}^{i}$ is the noisy gradients that Machine $i$ computes in $w_{\tau}^{i}$ , and we can write $g_{\tau}^{i}:=\nabla f(w_{\tau}^{i})+\xi_{\tau}^{i}$ , where $\xi_{\tau}^{i}$ is the noisy component of the gradient. Thus, for two machines $i\neq j$ we can write,

\displaystyle\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2}=\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\|^{2}\approx\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\nabla f(w_{\tau}^{i})-\nabla f(w_{\tau}^{j})\right\|^{2}+\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}

And it was shown in Woodworth et al. (2020a), that the noisy term is dominant and therefore we can bound,

\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2}

\displaystyle\lessapprox\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}\approx t-t_{0}\leq K~{}.

(10)

Similarly, for SLowcal-SGD we would like to bound $\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}$ for two machines $i\neq j$ ; and in order to simplify the discussion we will assume uniform weights i.e., $\alpha_{t}=1~{},~{}\forall t\in[T]$ . Now the update rule for the iterates $w_{t}^{i}$ , is of the same form as in Eq. (9), only now $g_{\tau}^{i}:=\nabla f(x_{\tau}^{i})+\xi_{\tau}^{i}$ , where $\xi_{\tau}^{i}$ is the noisy component of the gradient. Consequently,

\displaystyle\sum_{\tau=t_{0}}^{t}(w_{\tau}^{i}-w_{\tau}^{j})\approx-\eta\sum_{\tau=t_{0}}^{t-1}(t-\tau)(g_{\tau}^{i}-g_{\tau}^{j})\approx-\eta K\sum_{\tau=t_{0}}^{t-1}(g_{\tau}^{i}-g_{\tau}^{j})~{},

where we took a crude approximation of $t-\tau\approx K$ . Now, by definition of $x_{t}^{i}$ and $\alpha_{t}=1$ , $x_{t}^{i}=\frac{t_{0}}{t}\cdot x_{t_{0}}+\frac{1}{t}\sum_{\tau=t_{0}}^{t}w_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}.$ Thus, for two machines $i\neq j$ we have,

	$\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\\|x_{t}^{i}-x_{t}^{j}\\|^{2}$	$\displaystyle=\frac{1}{\eta^{2}}\mbox{\bf E}\left\\|\frac{1}{t}\sum_{\tau=t_{0}}^{t}w_{\tau}^{i}-w_{\tau}^{j}\right\\|^{2}\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}K^{2}}{t^{2}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\\|^{2}$
		$\displaystyle\approx\frac{K^{2}}{t^{2}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}\nabla f(x_{\tau}^{i})-\nabla f(x_{\tau}^{j})\right\\|^{2}+\frac{K^{2}}{t^{2}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\\|^{2}~{}.$

As we show in our analysis, the noisy term is dominant, so we can therefore bound,

\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}\lessapprox\frac{K^{2}}{t^{2}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}\approx\frac{K^{2}(t-t_{0})}{t^{2}}\leq\frac{K^{3}}{t^{2}}~{}.

(11)

Taking $t\approx T=RK$ above yields a bound of $O(K/R^{2})$ . Thus Equations (10), (11), illustrate that the bias of SLowcal-SGD is smaller by a factor of $R^{2}$ compared to the bias of standard Local-SGD. In the appendix we demonstrate the same benefit of Anytime-SGD over SGD when both use $\alpha_{t}\propto t$ .

Finally, note that the biases introduced by the local updates come into play in a slightly different manner in Local-SGD compared to SLowcal-SGD ³³3A major challenge in our analysis is that for a given $i\in[M]$ the $\{x_{t}^{i}\}_{t}$ sequence is not necessarily an $\{\alpha_{t}\}_{t}$ weighted average of the $\{w_{t}^{i}\}_{t}$ .. Consequently, the above discussion does not enable to demonstrate the exact rates that we derive. Nevertheless, it provides some intuition regarding the benefit of our approach. The full and exact derivations appear in the appendix.
Importance Weights. One may wonder whether it is necessary to employ increasing weights $\alpha_{t}=t+1$ , rather than employing standard uniform weights $\alpha_{t}=1~{},\forall t$ . Surprisingly, in our analysis we have found that increasing weights are crucial in order to obtain a benefit over Minibatch-SGD, and that upon using uniform weights SLowcal-SGD performs worse compared to Minibatch SGD! We elaborate on this in Appendix L. Below we provide an intuitive explanation.
Intuitive Explanation. The intuition behind the importance of using increasing weights is the following: Increasing weights are a technical tool to put more emphasis on the last rounds. Now, in the context of Local update methods, the iterates of the last rounds are more attractive since the bias between different machines shrinks as we progress. Intuitively, this happens since as we progress with the optimization process, the expected value of the gradients that we compute goes to zero (since we converge); and consequently the bias between different machines shrinks as we progress.

3.3 Proof Sketch for Theorem 2

Proof Sketch for Theorem 2.

As a starting point for the analysis, for every iteration $t\in[T]$ we will define the averages of $(w_{t}^{i},x_{t}^{i},g_{t}^{i})$ across all machines as follows,

w_{t}:=\frac{1}{M}\sum_{i\in[M]}w_{t}^{i}~{},\quad\&\quad x_{t}:=\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\quad\&\quad g_{t}:=\frac{1}{M}\sum_{i\in[M]}g_{t}^{i}~{}.

Note that Alg. 2 explicitly computes $(w_{t},x_{t})$ only once every $K$ local updates, and that theses are identical to the local copies of every machine at the beginning of every round. Combining the above definitions with Eq. (6) yields,

\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},~{}\forall t\in[T]

(12)

Further combining these definitions with Eq. (7) yields,

\displaystyle x_{t+1}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{},~{}\forall t\in[T]

(13)

The above implies that the $\{x_{t}\}_{t\in[T]}$ sequence is an $\{\alpha_{t}\}_{t\in[T]}$ weighted average of $\{w_{t}\}_{t\in[T]}$ . This enables to employ Thm. 1 which yields, $\alpha_{0:t}\Delta_{t}\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{},$ where we denote $\Delta_{t}:=f(x_{t})-f(w^{*})$ . This bound highlights the challenge in the analysis: our algorithm does not directly compute unbiased estimates of $x_{t}$ , except for the first iterate of each round. Concretely, Eq. (12) implies that our algorithm effectively updates using $g_{t}$ which is a biased estimate of $\nabla f(x_{t})$ .

It is therefore natural to decompose $\nabla f(x_{\tau})=g_{\tau}+(\nabla f(x_{\tau})-g_{\tau})$ in the above bound, yielding,

\displaystyle\alpha_{0:t}\Delta_{t}\leq\underset{{\rm(A)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})}}+\underset{{\rm(B)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-g_{\tau})\cdot(w_{\tau}-w^{*})}}

(14)

Thus, we intend to bound the weighted error $\alpha_{0:t}\Delta_{t}$ by bounding two terms: ${\rm(A)}$ which is related to the update rule of the algorithm, and ${\rm(B)}$ which accounts for the bias between $g_{t}$ and $\nabla f(x_{t})$ .
Notation: In what follows we will find the following notation useful, $\bar{g}_{t}:=\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{t}^{i})$ , and note that $\bar{g}_{t}=\mbox{\bf E}\left[g_{t}|\{x_{t}^{i}\}_{i\in[M]}\right]$ . We will also employ the following notations: $V_{t}:=\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}~{},$ and $D_{t}:=\|w_{t}-w^{*}\|^{2}~{},$ where $w^{*}$ is a global minimum of $f(\cdot)$ . We will also denote $D_{0:t}:=\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}$ .

Bounding (A):

Due to the update rule of Eq. (12), one can show by standard regret analysis that: ${\rm(A)}:=\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{},$

Bounding (B):

We can bound ${\rm(B)}$ in expectation using $V_{t}$ and $D_{0:t}$ as follows for any $\rho>0$ : $\mbox{\bf E}\left[{\rm(B)}\right]\leq\frac{1}{2\rho}\mbox{\bf E}V_{t}+\frac{\rho}{2}\mbox{\bf E}D_{0:t}~{},$

Combining (A) and (B):

Combining the above boounds on ${\rm(A)}$ and ${\rm(B)}$ into Eq. (14) we obtain the following bound which holds for any $\rho>0$ and $t\in[T]$ ,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}+\frac{1}{2\rho}\mbox{\bf E}V_{T}+\frac{\rho}{2}\mbox{\bf E}D_{0:T}

(15)

Now, to simplify the proof sketch we shall assume that $D_{t}\leq D_{0}~{}\forall t$ , implying that $D_{0:T}\leq TD_{0}$ . Plugging this into the above equation and taking $\rho=\frac{1}{4\eta T}$ gives,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+\eta\cdot\underset{{\rm(*)}}{\underbrace{\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}}}+4\eta T\mbox{\bf E}V_{T}~{}.

(16)

Next we will bound ${\rm(*)}$ and $\mbox{\bf E}V_{T}$ , and plug them back into Eq. (16).

Bounding ${\rm(*)}$ :

To bound ${\rm(*)}$ it is natural to decompose $g_{\tau}=(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau})$ . Using this decomposition we show that, ${\rm(*)}~{}~{}\lessapprox~{}~{}3\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+3\mbox{\bf E}V_{T}+12L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}.$

Bounding $\mbox{\bf E}V_{T}$

The definition of $V_{t}$ shows that it is encompasses the bias that is introduced due to the local updates, which in turn relates to the distances $\|x_{t}^{i}-x_{t}^{j}\|~{},~{}\forall i,j\in[M]$ . Thus, $\mbox{\bf E}V_{T}$ is therefore directly related to the dissimilarity between the machines. Our analysis shows the following: $\mbox{\bf E}V_{T}\leq 400L^{2}\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau})+90L^{2}\eta^{2}K^{6}R^{3}\sigma^{2}~{}.$ Plugging the above into Eq. (16), and using our choice for $\eta$ , gives an almost explicit bound,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\lessapprox\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2}+\frac{1}{2(T+1)}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}.

The theorem follows by plugging above the choices of $\eta,\alpha_{t}$ , and using a technical lemma. ∎

4 Experiments

To assess the effectiveness of our proposed approach, we conducted experiments on the MNIST (LeCun et al., 2010) dataset—a well-established benchmark in image classification comprising 70,000 grayscale images of handwritten digits (0–9), with 60,000 images designated for training and 10,000 for testing. The dataset was accessed via torchvision (version 0.16.2). We implemented a logistic regression model (Bishop and Nasrabadi, 2006) using the PyTorch framework and executed all computations on an NVIDIA L40S GPU. To ensure robustness, results were averaged over three different random seeds. The complete codebase for these experiments is publicly available on our GitHub repository.⁴⁴4https://github.com/dahan198/slowcal-sgd

We evaluated our approach using parameters derived from our theoretical framework ( $\alpha_{t}=t$ ) in comparison to Local-SGD and Minibatch-SGD under various configurations. Specifically, experiments were conducted with 16, 32, and 64 workers to examine the scalability and robustness of the proposed method. We also varied the number of local updates $K$ (or minibatch sizes for Minibatch-SGD) among 4, 8, 16, 32, and 64 to investigate how different local iteration counts impact performance. Data subsets for each worker were generated using a Dirichlet distribution (Hsu et al., 2019) with $\alpha=0.1$ to simulate real-world non-IID data scenarios characterized by high heterogeneity. For fairness, the learning rate was selected through grid search, with a value of 0.01 for SLowcal-SGD and Local-SGD, and 0.1 for Minibatch-SGD. More details about the data distribution across workers and complete experimental results are provided in Appendix M.

Refer to caption — (a) Test Accuracy ( $\uparrow$ is better).

Our results on the MNIST dataset, presented in Figure 1 and detailed in Appendix M.2, demonstrate the effectiveness of our approach, showing consistent performance improvements compared to Local-SGD and Minibatch-SGD as the number of local steps increases. Notably, this improvement becomes even more significant compared to the other methods as the number of workers increases, underscoring the scalability of our method and aligning with the theoretical guarantees outlined in our framework. These results highlight the robustness of our approach in handling highly heterogeneous, distributed environments.

Upon closer inspection, when a small number of local steps are performed, the differences between the approaches are negligible, with a slight advantage for Minibatch-SGD. However, as the number of local steps increases, the minibatch size grows, and the need for significant variance reduction diminishes. In this regime, making more frequent optimization updates becomes more impactful, as demonstrated by the superior performance of the local approaches compared to Minibatch-SGD. Importantly, with SLowcal-SGD, which keeps local updates closely aligned among workers throughout the training process, we can achieve significantly better and more stable performance compared to both Minibatch-SGD and Local-SGD as the number of local steps $K$ and the number of workers $M$ increase.

5 Conclusion

We have presented the first local approach for the heterogeneous distributed Stochastic Convex Optimization (SCO) setting that provably benefits over the two most prominent baselines, namely Minibatch-SGD, and Local-SGD. There are several interesting avenues for future exploration:
(a) developing an adaptive variant that does not require the knowledge of the problem parameters like $\sigma$ and $L$ ; (b) Allowing a per dimension step-size that could benefit in (the prevalent) scenarios where the scale of the gradients considerably changes between different dimensions; in the spirit of the well known AdaGrad method (Duchi et al., 2011). Finally, (c) it will be interesting to understand whether we can find an algorithm that provably dominates over the Accelerated Minibach-SGD baseline, which is an open question also in the homogeneous SCO setting.

Acknowledgement

This research was partially supported by Israel PBC-VATAT, the Technion Artificial Intelligent Hub (Tech.AI), and the Israel Science Foundation (grant No. 3109/24).

References

APP [2019] Apple. designing for privacy (video and slide deck). apple wwdc, 2019. URL https://developer.apple.com/videos/play/wwdc2019/708.
INT [2020] Intel and consilient. intel and consilient join forces to fight financial fraud with ai, 2020. URL https://newsroom.intel.com/news/intel-consilient-join-forces-fight-financial-fraud-ai/.
MEL [2020] Melloddy. melloddy project meets its year one objective: Deployment of the world’s first secure platform for multi-task federated learning in drug discovery among 10 pharmaceutical companies, 2020. URL https://www.melloddy.eu/y1announcement.
GGL [2021] Google. your voice and audio data stays private while google assistant improves, 2021. URL https://support.google.com/assistant/answer/10176224.
Antonakopoulos et al. [2022] Kimon Antonakopoulos, Dong Quan Vu, Volkan Cevher, Kfir Levy, and Panayotis Mertikopoulos. Undergrad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In International Conference on Machine Learning, pages 772–795. PMLR, 2022.
Aviv et al. [2021] Rotem Zamir Aviv, Ido Hakimi, Assaf Schuster, and Kfir Yehuda Levy. Asynchronous distributed learning: Adapting to gradient delays without prior knowledge. In International Conference on Machine Learning, pages 436–445. PMLR, 2021.
Bishop and Nasrabadi [2006] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
[8] Ashok Cutkosky. Lecture notes for ec525: Optimization for machine learning.
Cutkosky [2019] Ashok Cutkosky. Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446–1454. PMLR, 2019.
Dekel et al. [2012] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
Duchi et al. [2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
Ene et al. [2021] Alina Ene, Huy L Nguyen, and Adrian Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314–7321, 2021.
Glasgow et al. [2022] Margalit R Glasgow, Honglin Yuan, and Tengyu Ma. Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics, pages 9050–9090. PMLR, 2022.
Gorbunov et al. [2021] Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local sgd: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pages 3556–3564. PMLR, 2021.
Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
Hoffer et al. [2017] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
Hsu et al. [2019] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
Karimireddy et al. [2020a] Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
Karimireddy et al. [2020b] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020b.
Kavis et al. [2019] Ali Kavis, Kfir Y Levy, Francis Bach, and Volkan Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.
Khaled et al. [2020] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020.
Koloskova et al. [2020] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
Lan [2012] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
LeCun et al. [2010] Yann LeCun, Corinna Cortes, Chris Burges, et al. Mnist handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/. Licensed under CC BY-SA 3.0, available at https://creativecommons.org/licenses/by-sa/3.0/.
Mangasarian and Solodov [1993] Olvi L Mangasarian and Mikhail V Solodov. Backpropagation convergence via deterministic nonmonotone perturbed minimization. Advances in Neural Information Processing Systems, 6, 1993.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
Mishchenko et al. [2022] Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022.
Mitra et al. [2021] Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606–14619, 2021.
[31] Kumar Kshitij Patel, Lingxiao Wang, Blake Woodworth, Brian Bullins, and Nathan Srebro. Towards optimal communication complexity in distributed non-convex optimization. In Advances in Neural Information Processing Systems.
Smith et al. [2017] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
Stich [2019] Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
Wang et al. [2021a] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Agüera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas N. Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horváth, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konečný, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtárik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake E. Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, and Wennan Zhu. A field guide to federated optimization. CoRR, abs/2107.06917, 2021a. URL https://arxiv.org/abs/2107.06917.
Wang et al. [2021b] Jun-Kun Wang, Jacob Abernethy, and Kfir Y Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. arXiv e-prints, pages arXiv–2111, 2021b.
Woodworth et al. [2020a] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334–10343. PMLR, 2020a.
Woodworth et al. [2020b] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281–6292, 2020b.
You et al. [2019] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
Yuan and Ma [2020] Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, 33:5332–5344, 2020.

Appendix A Explanations Regarding the Linear Speedup and Table 1

Here we elaborate on the computations done in Table 1. First we will explain why the dominance of the term $\frac{1}{\sqrt{MKR}}$ implies a linear speedup by a factor of $M$ .

Explanation.

Recall that using SGD with a single machine $M=1$ , yields a convergnece rate of $\frac{1}{\sqrt{KR}}$ (as a dominant term). Thus, in order to obtain an excess loss smaller than some $\varepsilon>0$ , SGD requires $RK\geq\Omega\left(\frac{1}{\varepsilon^{2}}\right)$ . Where $RK$ is the wall-clock time required to compute the solution.

Now, when we use parallel optimization with $R$ communication rounds, $K$ local computations, and $M$ machines, the wall-clock time to compute a solution is still $RK$ . Now, if the dominant term in the convergence rate of this algorithm is $\frac{1}{\sqrt{MKR}}$ then the wall clock time to obtain an $\varepsilon$ -optimal solution should be $RK\geq\Omega\left(\frac{1}{M\varepsilon^{2}}\right)$ . And the latter is smaller by a factor of $M$ compared to a single machine.

Computation of $R_{\min}$ in Table 1.

The term $\frac{1}{\sqrt{MKR}}$ appears in the bounds of all of the parallel optimization methods that we describe. Nevertheless, it is dominant up as long as the number of communication rounds $R$ is larger than some treshold value $R_{\min}$ , that depends on the specific convergence rate. Clearly, smaller values of $R_{\min}$ imply less communication. Thus, in the $\bf{R_{\min}}$ column of the table we compute $R_{\min}$ for each method based on the term in the bound that compares least favourably against $\frac{1}{\sqrt{MKR}}$ . These terms are bolded in the Rate column of the table.

Concretely, denoting this less favourable term by $\mathcal{B}^{\rm parallel}:=\mathcal{B}^{\rm parallel}(M,K,R,G_{*})$ ⁵⁵5 $\mathcal{B}^{\rm parallel}$ may also depend on $\sigma,L,\|w_{0}-w^{*}\|$ but for simplicity of exposition we hide these dependencies in Table 1., then $\bf{R_{\min}}$ is the lowest $R$ which satisfies,

\mathcal{B}^{\rm parallel}\leq\frac{1}{\sqrt{MKR}}~{}.

Appendix B On Heterogeneity Assumption

Let us assume that the following holds at the optimum $w^{*}$ ,

\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(w^{*})\|^{2}\leq G_{*}^{2}/2

Then we can show the following relation for any $w\in{\mathbb{R}}^{d}$ ,

	$\displaystyle\frac{1}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w)\\|^{2}$	$\displaystyle=\frac{1}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w)-\nabla f_{i}(w^{})+\nabla f_{i}(w^{})\\|^{2}$
		$\displaystyle\leq\frac{2}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w)-\nabla f_{i}(w^{})\\|^{2}+\frac{2}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w^{})\\|^{2}$
		$\displaystyle\leq 4L(f(w)-f(w^{}))+G_{}^{2}~{}.$

where we used $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ which holds for any $a,b\in{\mathbb{R}}^{d}$ , and the last line follows by the lemma below that we borrow from [Johnson and Zhang, 2013, Cutkosky, ].

Lemma 1.

Let $\mathcal{L}(x)=\frac{1}{M}\sum_{i\in[M]}\ell_{i}(x)$ be a convex function with global minimum $w^{*}$ , and assume that every $f_{i}:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ is $L$ -smooth. Then the following holds,

\frac{1}{M}\sum_{i\in[M]}\|\nabla\ell_{i}(w)-\nabla\ell_{i}(w^{*})\|^{2}\leq 2L(\mathcal{L}(w)-\mathcal{L}(w^{*}))~{}.

Proof of Lemma 1.

The lemma follows immediately from lemma 27.1 in [Cutkosky, ], by taking $v=w^{*}$ therein. ∎

Appendix C Interpreting Anytime-SGD as Momentum

Here we show how to interpret the Anytime-SGD algorithm that we present in Equations (4),(5), as a momentum method. For completeness we rewrite the update equations below,

\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{},\text{where~{}}g_{t}=\nabla f(x_{t})~{},

(17)

and then,

\displaystyle x_{t+1}=\frac{\alpha_{0:t}}{\alpha_{0:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{}.

(18)

where $g_{t}$ is an unbiased gradient estimate at $x_{t}$ , and $\{\alpha_{t}\}_{t}$ is a sequence of non-negative scalars. And at initialization $x_{0}=w_{0}$ .

First note that Eq. (18) directly implies that,

x_{t+1}=\frac{1}{\alpha_{0:t+1}}\sum_{\tau=0}^{t+1}\alpha_{\tau}w_{\tau}~{}.

Next, note that we can directly write,

w_{\tau}=w_{0}-\eta\sum_{n=0}^{\tau-1}\alpha_{n}g_{n}

Plugging the above into the formula for $x_{t+1}$ yields,

	$\displaystyle x_{t+1}$	$\displaystyle=w_{0}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{\tau=0}^{t+1}\sum_{n=0}^{\tau-1}\alpha_{\tau}\alpha_{n}g_{n}$
		$\displaystyle=w_{0}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\sum_{\tau=n+1}^{t+1}\alpha_{\tau}\alpha_{n}g_{n}$
		$\displaystyle=w_{0}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{n+1:t+1}\alpha_{n}g_{n}~{}.$

Thus,

$\displaystyle\frac{1}{\eta}\left(x_{t+1}-x_{t}\right)$	$\displaystyle=\frac{1}{\alpha_{0:t}}\sum_{n=0}^{t-1}\alpha_{n+1:t}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{n+1:t+1}\alpha_{n}g_{n}$
	$\displaystyle=\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\frac{\alpha_{0:t+1}}{\alpha_{0:t}}\alpha_{n+1:t}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\alpha_{n+1:t+1}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\alpha_{t+1}\alpha_{t}g_{t}$
	$\displaystyle=-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\left(\alpha_{n+1:t+1}-\frac{\alpha_{0:t+1}}{\alpha_{0:t}}\alpha_{n+1:t}\right)\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\alpha_{t+1}\alpha_{t}g_{t}$
	$\displaystyle=-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\alpha_{t+1}\frac{\alpha_{0:n}}{\alpha_{0:t}}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\alpha_{t+1}\alpha_{t}g_{t}$
	$\displaystyle=-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{t+1}\alpha_{n}\frac{\alpha_{0:n}}{\alpha_{0:t}}g_{n}~{},$	(19)

where we used the equality below,

\alpha_{n+1:t+1}-\frac{\alpha_{0:t+1}}{\alpha_{0:t}}\alpha_{n+1:t}=\alpha_{t+1}-\alpha_{n+1:t}\frac{\alpha_{t+1}}{\alpha_{0:t}}=\alpha_{t+1}(1-\frac{\alpha_{n+1:t}}{\alpha_{0:t}})=\alpha_{t+1}\frac{\alpha_{0:n}}{\alpha_{0:t}}~{}.

Thus we can write,

x_{t+1}\approx x_{t}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{t+1}\alpha_{n}\frac{\alpha_{0:n}}{\alpha_{0:t}}g_{n}~{},

Uniform Weights. Thus, taking uniform weights $\alpha_{t}=1$ yields,

x_{t+1}\approx x_{t}-\eta\sum_{n=0}^{t}\frac{n}{t^{2}}g_{n}~{}.

Linear Weights. Similarly, taking linear weights $\alpha_{t}=t+1$ yields,

x_{t+1}\approx x_{t}-\eta\sum_{n=0}^{t}\frac{n^{3}}{t^{3}}g_{n}~{}.

Appendix D Proof of Theorem 1

Proof of Theorem 1.

We rehearse the proof of Theorem $1$ from Cutkosky [2019].

First, since $w^{*}$ is a global minimum and $\alpha_{0:t}$ are non-negative than clearly,

\alpha_{0:t}\left(f(x_{t})-f(w^{*})\right)\geq 0~{}.

Now, notice that the following holds,

\alpha_{t}(x_{t}-w_{t})=\alpha_{0:t-1}(x_{t-1}-x_{t})

Using the gradient inequality for $f$ gives,

	$\displaystyle\sum_{\tau=0}^{t}\alpha_{\tau}(f(x_{\tau})-f(w^{*}))$	$\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(x_{\tau}-w^{*})$
		$\displaystyle=\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})+\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(x_{\tau}-w_{\tau})$
		$\displaystyle=\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})+\sum_{\tau=0}^{t}\alpha_{0:\tau-1}\nabla f(x_{\tau})\cdot(x_{\tau-1}-x_{\tau})$
		$\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})+\sum_{\tau=0}^{t}\alpha_{0:\tau-1}(f(x_{\tau-1})-f(x_{\tau}))~{},$

where we have used the gradient inequality again which implies $\nabla f(x_{\tau})\cdot(x_{\tau-1}-x_{\tau})\leq f(x_{\tau-1})-f(x_{\tau})$ .

Now Re-ordering we obtain,

\displaystyle\sum_{\tau=0}^{t}\left(\alpha_{0:\tau}f(x_{\tau})-\alpha_{0:\tau-1}f(x_{\tau-1})\right)-\alpha_{0:t}f(w^{*})

\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

Telescoping the sum in the LHS we conclude the proof,

\displaystyle\alpha_{0:t}\left(f(x_{\tau})-f(w^{*})\right)

\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

∎

Appendix E More Intuition and Discussion Regarding the Benefit of SLowcal-SGD

More Elaborate Intuitive Explanation.

The intuition is the following: We have two extreme baselines: (1) Minibatch-SGD where queries do not change at all during updates-implying that there is no bias between different machines. However, Minibatch-SGD is “lazy” since among $KR$ queries it only performs $R$ gradient updates. Conversely (2) Local-SGD is not “lazy” since each machine performs $KR$ gradient updates. Nevertheless, the queries of different machines change substantially during each round, which translates to bias between machines, which in turn degrades the convergence.

Ideally, we would like to have a “non-lazy” method where each machine performs KR gradient updates (like Local-SGD), but where the queries of each machine do not change at all during rounds (like Minibatch-SGD) and therefore no bias is introduced between machines. Of course, this is too good to exist, but our method is a step in this direction: it is “non-lazy” and the query points of different machines change slowly, and therefore introduce less bias between machines. This translates to a better convergence rate.

Additional Technical Intuition for $\alpha_{t}\propto t$ .

Here we extend the technical explanation that we provide in Sec. 3.2.1 to the case where $\alpha_{t}\propto t$ , and show again that SLowcal-SGD yields smaller bias between different machines compared to Local-SGD.

As in the intuition for the case of uniform weights, to simplify the more technical discussion, we will assume the homogeneous case, i.e., that for any $i\in[M]$ we have ${\mathcal{D}}_{i}={\mathcal{D}}$ and $f_{i}(\cdot)=f(\cdot)$ .

Note that upon employing linear weights, the normalization factor $\alpha_{0:T}$ that plays a major role in the convergence guarantees of Anytime-SGD (see Thm. 1) also grows as $\alpha_{0:T}\propto T^{2}$ . Thus, in order to make an proper comparison, we should compare the bias of weighted Anytime-SGD, to the appropriate weighted version of SGD; where the normalization factor $\alpha_{0:T}$ also plays a similar role in the guarantees (see e.g. Wang et al. [2021b]). This weighted SGD is as follows Wang et al. [2021b], $\forall t\geq 0$

\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{};\quad\text{where $g_{t}$ is unbiased of $\nabla f(w_{t})$}~{}.

(20)

and after $t$ iterations it outputs $\overline{w}_{T}=\frac{1}{\alpha_{0:T}}\sum_{t=0}^{T}\alpha_{t}w_{t}$ . And for $\alpha_{t}=t+1$ this version enjoys the same guarantees as standard SGD.

Next, we compare the Local-SGD version of the above weighted SGD (Eq. (20)) to our SLowcal-SGDwhen both employ $\alpha_{t}=t+1$ . So a bit more formally, let us discuss the bias between query points in a given round $r\in[R]$ , and let us denote $t_{0}=rK$ . The following holds for weighted Local SGD,

\displaystyle w_{t}^{i}=w_{t_{0}}-\eta\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}g_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}.

(21)

\displaystyle\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2}=\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(g_{\tau}^{i}-g_{\tau}^{j})\right\|^{2}\approx\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(\nabla f(w_{\tau}^{i})-\nabla f(w_{\tau}^{j}))\right\|^{2}+\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(\xi_{\tau}^{i}-\xi_{\tau}^{j})\right\|^{2}

Now, we can be generous with respect to weighted SGD and only take the second noisy term into account and neglect the first term ⁶⁶6It was shown in Woodworth et al. [2020a], that the noisy term is dominant for standard SGD. Thus we obtain,

\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2}

\displaystyle\lessapprox\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(\xi_{\tau}^{i}-\xi_{\tau}^{j})\right\|^{2}\approx\alpha_{t_{0}+K}^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}(\xi_{\tau}^{i}-\xi_{\tau}^{j})\right\|^{2}\approx(rK)^{2}\cdot(t-t_{0})\leq r^{2}K^{3}~{}.

(22)

where we used $\alpha_{t}\leq\alpha_{t_{0}+K}~{},\forall t\leq t_{0}+K$ , as well as $\alpha_{t_{0}+K}^{2}=(r(K+1)+1)^{2}\approx(rK)^{2}$ . We also used $t-t_{0}\lessapprox K$ .

Similarly, for SLowcal-SGD we would like to bound $\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}$ for two machines $i\neq j$ ; while assuming linear weights i.e., $\alpha_{t}=t+1~{},~{}\forall t\in[T]$ . Now the update rule for the iterates $w_{t}^{i}$ , is of the same form as in Eq. (21), only now $g_{\tau}^{i}:=\nabla f(x_{\tau}^{i})+\xi_{\tau}^{i}$ , where $\xi_{\tau}^{i}$ is the noisy component of the gradient. Consequently, we can show the following,

\displaystyle\sum_{\tau=t_{0}}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\approx-\eta\sum_{\tau=t_{0}}^{t}\alpha_{\tau}\sum_{n=t_{0}}^{\tau-1}\alpha_{n}(g_{n}^{i}-g_{n}^{j})\approx-\eta\sum_{n=t_{0}}^{t-1}\alpha_{n+1:t}\alpha_{n}(g_{n}^{i}-g_{n}^{j})\approx-\eta r^{2}K^{3}\sum_{\tau=t_{0}}^{t-1}(g_{\tau}^{i}-g_{\tau}^{j})~{},

where we took a crude approximation of $\alpha_{n+1:t}\alpha_{n}\approx\alpha_{t_{0}:t+K}\alpha_{t_{0}}\lessapprox rK^{2}\cdot rK=r^{2}K^{3}$ . In the last " $\approx$ " we also change the notation of summation variable from $n$ to $\tau$ .

Now, by definition, $x_{t}^{i}\approx\frac{\alpha_{0:t_{0}}}{\alpha_{0:t}}\cdot x_{t_{0}}+\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}}^{t}\alpha_{\tau}w_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}.$ Thus, for two machines $i\neq j$ we have,

	$\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\\|x_{t}^{i}-x_{t}^{j}\\|^{2}$	$\displaystyle=\frac{1}{\eta^{2}}\mbox{\bf E}\left\\|\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\\|^{2}\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}r^{4}K^{6}}{(\alpha_{0:t})^{2}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\\|^{2}$
		$\displaystyle\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}r^{4}K^{6}}{r^{4}K^{4}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\\|^{2}$
		$\displaystyle\approx K^{2}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}\nabla f(x_{\tau}^{i})-\nabla f(x_{\tau}^{j})\right\\|^{2}+K^{2}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\\|^{2}~{}.$

where we have used $\alpha_{0:t}\approx\alpha_{0:t_{0}}\propto t_{0}^{2}\approx r^{2}K^{2}$ .

As we show in our analysis, the noisy term is dominant, so we can therefore bound,

\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}\lessapprox K^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}\approx K^{2}(t-t_{0})\leq{K^{3}}~{}.

(23)

Thus Equations (22), (23), illustrate that for $\alpha_{t}\propto t$ , then the bias of SLowcal-SGD is smaller by a factor of $r^{2}$ compared to the bias of weighted Local-SGD.
Since $r^{2}$ can be as big as $R^{2}$ this coincides with the benefit of SLowcal-SGD over standard SGD in the case where $\alpha_{t}=1$ , which we demonstrate in the main text.

Finally, note that upon dividing by the normalization factor $\alpha_{0:T},$ we have, that for SLowcal-SGD with either $\alpha_{t}=1$ or $\alpha_{t}\propto t$ then,

\displaystyle\frac{1}{\alpha_{0:T}}\frac{1}{\eta}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|\approx\frac{1}{R^{2}K^{2}}\cdot\sqrt{K^{3}}\approx\frac{1}{RK}\cdot\sqrt{\frac{K}{R^{2}}}=\frac{1}{\sqrt{K}R^{2}}

(24)

Comparably, upon dividing by the normalization factor $\alpha_{0:T},$ we have, that for Local-SGD with either $\alpha_{t}=1$ or $\alpha_{t}\propto t$ that,

\displaystyle\frac{1}{\alpha_{0:T}}\frac{1}{\eta}\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|\approx\frac{1}{R^{2}K^{2}}\cdot\sqrt{R^{2}K^{3}}\approx\frac{1}{RK}\cdot\sqrt{{K}}=\frac{1}{\sqrt{K}R}

(25)

Thus, with respect to the approximate and intuitive analysis that we make here SLowcal-SGD maintains similar benefit over Local-SGD for both $\alpha_{t}=1$ and $\alpha_{t}=t+1$ .

As we explain in Appendix L, taking $\alpha_{t}=1$ in SLowcal-SGD does not actually enable to provide a benefit over Local-SGD. The reason is that for $\alpha_{t}=1$ , the condition for which the dominant term in the bias (between different machines) is the noisy term (this enables the approximate analysis that we make here and in the body of the paper), leads to limitation on the learning rate which in turn degrades the performance for SLowcal-SGD with $\alpha_{t}=1$ . Conversely, for $\alpha_{t}\propto t$ there is no such degradation due to the limitation of the learning rate. For more details and intuition please see Appendix L.

Appendix F Proof of Thm. 2

Proof of Thm. 2.

As a starting point for the analysis, for every iteration $t\in[T]$ we will define the averages of $(w_{t}^{i},x_{t}^{i},g_{t}^{i})$ across all machines as follows,

w_{t}:=\frac{1}{M}\sum_{i\in[M]}w_{t}^{i}~{},\quad\&\quad x_{t}:=\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\quad\&\quad g_{t}:=\frac{1}{M}\sum_{i\in[M]}g_{t}^{i}~{}.

\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},~{}\forall t\in[T]

(26)

Further combining these definitions with Eq. (7) yields,

\displaystyle x_{t+1}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{},~{}\forall t\in[T]

(27)

And the above implies that the $\{x_{t}\}_{t\in[T]}$ sequence is an $\{\alpha_{t}\}_{t\in[T]}$ weighted average of $\{w_{t}\}_{t\in[T]}$ . This enables to employ Thm. 1 which yields,

\displaystyle\alpha_{0:t}\Delta_{t}:=\alpha_{0:t}(f(x_{t})-f(w^{*}))\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{},

where we denote $\Delta_{t}:=f(x_{t})-f(w^{*})$ . The above bound highlights the challenge in the analysis: our algorithm does not directly computes unbiased estimates of $x_{t}$ , except for the first iterate of each round. Concretely, Eq. (26) demonstrates that our algorithm effectively updates using $g_{t}$ which might be a biased estimate of $\nabla f(x_{t})$ .

It is therefore natural to decompose $\nabla f(x_{\tau})=g_{\tau}+(\nabla f(x_{\tau})-g_{\tau})$ in the above bound, leading to,

\displaystyle\alpha_{0:t}\Delta_{t}\leq\underset{{\rm(A)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})}}+\underset{{\rm(B)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-g_{\tau})\cdot(w_{\tau}-w^{*})}}

(28)

Thus, we intend to bound the weighted error $\alpha_{0:t}\Delta_{t}$ by bounding two terms: ${\rm(A)}$ which is directly related to the update rule of the algorithm (Eq. (26)), and ${\rm(B)}$ which accounts for the bias between $g_{t}$ and $\nabla f(x_{t})$ .
Notation: In what follows we will find the following notation useful,

\displaystyle\bar{g}_{t}:=\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{t}^{i})

(29)

and the above definition implies that $\bar{g}_{t}=\mbox{\bf E}\left[g_{t}|\{z_{0}^{i}\}_{i\in[M]},\ldots,\{z_{t-1}^{i}\}_{i\in[M]}\right]=\mbox{\bf E}\left[g_{t}|\{x_{t}^{i}\}_{i\in[M]}\right]$ . We will also employ the following notations,

V_{t}:=\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}~{},\quad\&\quad D_{t}:=\|w_{t}-w^{*}\|^{2}

where $w^{*}$ is a global minimum of $f(\cdot)$ . Moreover, we will also use the notation $D_{0:t}:=\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}$ .

Bounding (A):

Due to the update rule of Eq. (26), one can show by standard regret analysis (see Lemma 2 below) that,

\displaystyle{\rm(A)}:=\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{},

(30)

Lemma 2.

(OGD Regret Lemma -See e.g. [Hazan et al., 2016]) Let $w_{0}\in{\mathbb{R}}^{d}$ and $\eta>0$ . Also assume a sequence of $T$ non-negative weights $\{\alpha_{t}\geq 0\}_{t\in[T]}$ and $T$ vectors $\{g_{t}\in{\mathbb{R}}^{d}\}_{t\in[T]}$ , and assume an update rule of the following form:

w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{}.

Then the following bound holds for any $u\in{\mathbb{R}}^{d}$ , and $t\in[T]$ ,

\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)\leq\frac{\|w_{0}-u\|^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{}.

for completeness we provide a proof in Appendix G.

Bounding (B):

Since our goal is to bound the expected excess loss, we will bound the expected value of ${\rm(B)}$ , thus,

$\displaystyle\mbox{\bf E}\left[{\rm(B)}\right]$	$\displaystyle=\mbox{\bf E}\left[\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-g_{\tau})\cdot(w_{\tau}-w^{*})\right]$
	$\displaystyle=\mbox{\bf E}\left[\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-\bar{g}_{\tau})\cdot(w_{\tau}-w^{*})\right]$
	$\displaystyle\leq\mbox{\bf E}\sum_{\tau=0}^{t}\left(\frac{1}{2\rho}\alpha_{\tau}^{2}\\|\nabla f(x_{\tau})-\bar{g}_{\tau}\\|^{2}+\frac{\rho}{2}\mbox{\bf E}\\|w_{\tau}-w^{*}\\|^{2}\right)$
	$\displaystyle=\frac{1}{2\rho}\mbox{\bf E}V_{t}+\frac{\rho}{2}\mbox{\bf E}D_{0:t}~{},$	(31)

where the second line follows by the definition of $\bar{g}_{\tau}$ (see Eq. (29)) and due to the fact that $w_{\tau}$ is measurable with respect to $\left\{\{z_{0}^{i}\}_{i\in[M]},\ldots,\{z_{\tau-1}^{i}\}_{i\in[M]}\right\}$ while $\bar{g}_{\tau}=\mbox{\bf E}\left[g_{\tau}|\{z_{0}^{i}\}_{i\in[M]},\ldots,\{z_{\tau-1}^{i}\}_{i\in[M]}\right]$ implying that $\mbox{\bf E}[g_{\tau}\cdot(w_{\tau}-w^{*})]=\mbox{\bf E}[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})]$ ; the third line uses Young’s inequality $a\cdot b\leq\inf_{\rho>0}\{\frac{\rho}{2}\|a\|^{2}+\frac{1}{2\rho}\|b\|^{2}\}$ which holds for any $a,b\in{\mathbb{R}}^{d}$ ; and the last two lines use the definition of $V_{t}$ and $D_{0:T}$ .

Combining (A) and (B):

Combining Equations (30) and (F) into Eq. (28) we obtain the following bound which holds for any $\rho>0$ and $t\in[T]$ ,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}+\frac{1}{2\rho}\mbox{\bf E}V_{T}+\frac{\rho}{2}\mbox{\bf E}D_{0:t}

(32)

where we have used $V_{t}\leq V_{T}$ which holds for any $t\in[T]$ , as well as $\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}\leq\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}$ , which holds since $t\leq T$ .

Next, we shall bound each of the above terms. The following lemma bounds $\mbox{\bf E}D_{0:t}$ ,

Lemma 3.

The following bound holds for any $t\in[T]$ ,

\displaystyle\mbox{\bf E}D_{0:t}=\mbox{\bf E}\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}\leq 2T\|w_{0}-w^{*}\|^{2}+2T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}+16\eta^{2}T^{2}\cdot\mbox{\bf E}V_{T}

Combining the above lemma into Eq. (33) gives,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\left(\frac{\eta}{2}+\rho T\eta^{2}\right)\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}+\left(\frac{1}{2\rho}+8\rho\eta^{2}T^{2}\right)\mbox{\bf E}V_{T}+\rho T\|w_{0}-w^{*}\|^{2}

Since the above holds for any $\rho>0$ let us pick a specific value of $\rho=\frac{1}{4\eta T}$ ; by doing so we obtain,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+\eta\cdot\underset{{\rm(C)}}{\underbrace{\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}}}+4\eta T\mbox{\bf E}V_{T}~{}.

(33)

Next, we would like to bound ${\rm(C)}$ ; to do so it is natural to decompose $g_{\tau}=(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau})$ . The next lemma provides a bound, and its proof goes directly through this decomposition,

Lemma 4.

The following holds,

\displaystyle{\rm(C)}\leq 3\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+3\mbox{\bf E}V_{T}+12L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}.

Combining the above Lemma into Eq. (33) yields,

\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+8\eta T\mbox{\bf E}V_{T}+12\eta L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{},

(34)

where we have uses $3\leq 4T$ which holds since $T\geq 1$ . The next lemma provides a bound for $\mbox{\bf E}V_{t}$ ,

Lemma 5.

For any $t\leq T:=KR$ , Alg. 2 with the learning choice in Eq. (8) ensures the following bound,

\mbox{\bf E}V_{t}\leq 400L^{2}\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau})+90L^{2}\eta^{2}K^{6}R^{3}\sigma^{2}~{}.

Plugging the above bound back into Eq. (34) gives an almost explicit bound,

		$\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}$
		$\displaystyle\leq\frac{\\|w_{0}-w^{*}\\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}$
		$\displaystyle\quad+4\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+12\eta L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}$
		$\displaystyle=\frac{\\|w_{0}-w^{}\\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+4\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{}^{2}$
		$\displaystyle\quad+(12\eta L+16\cdot 10^{3}L^{3}\eta^{3}TK^{3})\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}$
		$\displaystyle\leq\frac{\\|w_{0}-w^{}\\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+4\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{}^{2}+\frac{1}{2(T+1)}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{},$		(35)

and we used $12\eta L\leq 1/4(T+1)$ which follows since $\eta\leq\frac{1}{48L(T+1)}$ (see Eq. (8)), as well as $16\cdot 10^{3}L^{3}\eta^{3}TK^{3}\leq 1/4(T+1)$ , which follows since $\eta\leq\frac{1}{40LK(T+1)^{2/3}}$ (see Eq. (8)). Next we use the above bound and invoke the following lemma,

Lemma 6.

Let $\{A_{t}\}_{t\in[T]}$ be a sequence of non-negative elements and $\mathcal{B}\in{\mathbb{R}}$ , and assume that for any $t\leq T$ ,

A_{t}\leq\mathcal{B}+\frac{1}{2(T+1)}\sum_{t=0}^{T}A_{t}~{},

Then the following bound holds,

A_{T}\leq 2\mathcal{B}~{}.

Taking $A_{t}\leftarrow\alpha_{0:t}\mbox{\bf E}\Delta_{t}$ and $\mathcal{B}\leftarrow\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+2\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2}$ provides the following explicit bound,

	$\displaystyle\alpha_{0:T}\mbox{\bf E}\Delta_{T}$	$\displaystyle\leq\frac{2\\|w_{0}-w^{}\\|^{2}}{\eta}+6\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+2\cdot 10^{3}L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+8\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{}^{2}$
		$\displaystyle\leq\frac{2\\|w_{0}-w^{}\\|^{2}}{\eta}+6\eta\frac{\sigma^{2}}{M}\cdot(KR)^{3}+2\cdot 10^{3}L^{2}\eta^{3}K^{7}R^{4}\sigma^{2}+8\cdot 10^{3}L^{2}\eta^{3}K^{7}R^{4}\cdot G_{}^{2}~{},$		(36)

where we have used $\sum_{\tau=0}^{T}\alpha_{0:\tau}\leq\sum_{t=0}^{T}\alpha_{t}^{2}\leq\sum_{r=0}^{R-1}\sum_{k=0}^{K-1}(r+1)^{2}K^{2}\leq K^{3}R^{3}$ , as well as $T=KR$ ,

Recalling that $T=KR$ and that,

\displaystyle\eta=\min\left\{\frac{1}{48L(T+1)},\frac{1}{10LK^{2}},\frac{1}{40LK(T+1)^{2/3}},\frac{\|w_{0}-w^{*}\|\sqrt{M}}{\sigma T^{3/2}},\frac{\|w_{0}-w^{*}\|^{1/2}}{L^{1/2}K^{7/4}R(\sigma^{1/2}+G_{*}^{1/2})}\right\}

The above bound translates into,

	$\displaystyle\small\alpha_{0:T}$	$\displaystyle\mbox{\bf E}\Delta_{T}\leq$		(37)
		$\displaystyle O\left(L(T+K^{2}+KT^{2/3})\\|w_{0}-w^{}\\|^{2}+\frac{\sigma\\|w_{0}-w^{}\\|T^{3/2}}{\sqrt{M}}+L^{1/2}K^{7/4}R(\sigma^{1/2}+G_{}^{1/2})\cdot\\|w_{0}-w^{}\\|^{3/2}\right)$		(38)

Noting that $\alpha_{0:T}\geq\Omega(T^{2})$ and using $T=KR$ gives the final bound,

	E	$\displaystyle\Delta_{T}\leq$
		$\displaystyle O\left(\frac{L\\|w_{0}-w^{}\\|^{2}}{KR}+\frac{L\\|w_{0}-w^{}\\|^{2}}{K^{1/3}R^{4/3}}+\frac{L\\|w_{0}-w^{}\\|^{2}}{R^{2}}+\frac{\sigma\\|w_{0}-w^{}\\|}{\sqrt{MKR}}+\frac{L^{1/2}(\sigma^{1/2}+G_{}^{1/2})\cdot\\|w_{0}-w^{}\\|^{3/2}}{K^{1/4}R}\right)~{}.$

which establishes the Theorem. ∎

Appendix G Proof of Lemma 2

Proof of Lemma 2.

The update rule implies for all $\tau\in[T]$

	$\displaystyle\\|w_{\tau+1}-u\\|^{2}$	$\displaystyle=\\|(w_{\tau}-u)-\eta\alpha_{\tau}g_{\tau}\\|^{2}$
		$\displaystyle=\\|w_{\tau}-u\\|^{2}-2\eta\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)+\eta^{2}\alpha_{\tau}^{2}\\|g_{\tau}\\|^{2}$

Re-ordering and gives,

\displaystyle 2\eta\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)=\left(\|w_{\tau}-u\|^{2}-\|w_{\tau+1}-u\|^{2}\right)+\eta^{2}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{}.

Summing over $\tau$ and telescoping we obtain,

	$\displaystyle 2\eta\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)$	$\displaystyle=\left(\\|w_{1}-u\\|^{2}-\\|w_{t+1}-u\\|^{2}\right)+\eta^{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|g_{\tau}\\|^{2}$
		$\displaystyle\leq\\|w_{1}-u\\|^{2}+\eta^{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|g_{\tau}\\|^{2}$

Dividing the above by $2\eta$ establishes the lemma. ∎

Appendix H Proof of Lemma 3

Proof of Lemma 3.

Recalling the notations $D_{\tau}:=\|w_{\tau}-w^{*}\|^{2}$ , our goal is to bound $\mbox{\bf E}D_{0:t}$ . To do so, we will derive a recursive formula for $D_{0:t}$ . Indeed, the update rule of Alg. 2 implies Eq. (26), which in turn leads to the following for any $t\in[T]$ ,

\displaystyle\|w_{t+1}-w^{*}\|^{2}

\displaystyle=\|(w_{t}-w^{*})-\eta\alpha_{t}g_{t}\|^{2}=\|w_{t}-w^{*}\|^{2}-2\eta\alpha_{t}g_{t}\cdot(w_{t}-w^{*})+\eta^{2}\alpha_{t}^{2}\|g_{t}\|^{2}

Unrolling the above equation and taking expectation gives,

\displaystyle\mbox{\bf E}\|w_{t+1}-w^{*}\|^{2}

\displaystyle=\|w_{0}-w^{*}\|^{2}\underset{{\rm(*)}}{\underbrace{-2\eta\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})}}+\eta^{2}\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}

(39)

The next lemma provides a bound on ${\rm(*)}$ ,

Lemma 7.

The following holds for any $t\in[T]$ ,

\displaystyle{\rm(*)}\leq 2\eta\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}

and recall that $D_{0:T}:=\sum_{t=0}^{T}\|w_{t}-w^{*}\|^{2}$ , and $V_{T}:=\sum_{t=0}^{T}\alpha_{t}^{2}\|\bar{g}_{t}-\nabla f(x_{t})\|^{2}$ .

Plugging the bound of Lemma 7 into Eq. (39), and using the notation of $D_{t}$ we conclude that for any $t\in[T]$ ,

	$\displaystyle\mbox{\bf E}D_{t}$	$\displaystyle\leq D_{0}+2\eta\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}+\eta^{2}\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|g_{\tau}\\|^{2}$
		$\displaystyle\leq D_{0}+2\eta\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}+\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\\|g_{t}\\|^{2}~{},$

where we used $t\leq T$ . Summing the above equation over $t$ gives,

\displaystyle\mbox{\bf E}D_{0:T}

\displaystyle\leq T\|w_{0}-w^{*}\|^{2}+2\eta T\cdot\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}+T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}

(40)

We shall now require the following lemma,

Lemma 8.

Let $A,B,C\geq 0$ , and assume that $A\leq B+C\sqrt{A}$ , then the following holds,

A\leq 2B+4C^{2}

Now, using the above Lemma with Eq. (40) implies,

\displaystyle\mbox{\bf E}D_{0:T}\leq 2T\|w_{0}-w^{*}\|^{2}+2T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}+16\eta^{2}T^{2}\cdot\mbox{\bf E}V_{T}

(41)

where we have taken $A\leftarrow D_{0:T}^{2}$ , $B\leftarrow T\|w_{0}-w^{*}\|^{2}+T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}$ , and $C\leftarrow 2\eta T\cdot\sqrt{\mbox{\bf E}V_{T}}$ . Thus, Eq. (41) establishes the lemma. ∎

H.1 Proof of Lemma 7

Proof of Lemma 7.

Recall that ${\rm(*)}=-2\eta\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})$ , we shall now focus on bounding ${\rm(*)}/2\eta$ ,

$\displaystyle-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})$	$\displaystyle=-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})$
	$\displaystyle=-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{})-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}(\bar{g}_{\tau}-\nabla f(x_{\tau}))\cdot(w_{\tau}-w^{})$
	$\displaystyle\leq 0+\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}\cdot\\|w_{\tau}-w^{*}\\|^{2}$
	$\displaystyle\leq 0+\sum_{\tau=0}^{t}\sqrt{\mbox{\bf E}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}}\cdot\sqrt{\mbox{\bf E}\\|w_{\tau}-w^{*}\\|^{2}}$
	$\displaystyle\leq\sqrt{\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}}\cdot\sqrt{\mbox{\bf E}\sum_{\tau=0}^{t}\\|w_{\tau}-w^{*}\\|^{2}}$
	$\displaystyle\leq\sqrt{\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\\|\bar{g}_{t}-\nabla f(x_{t})\\|^{2}}\cdot\sqrt{\mbox{\bf E}\sum_{t=0}^{T}\\|w_{t}-w^{*}\\|^{2}}$
	$\displaystyle:=\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}~{},$	(42)

where the first line is due to the definitions of $g_{\tau}$ and $\bar{g}_{\tau}$ appearing in Eq. (29) (this is formalized in Lemma 9 below and in its proof); the third line follows by observing that the $\{x_{t}\}_{t}$ sequence is and $\{\alpha_{t}\}_{t}$ weighted average of $\{w_{t}\}_{t}$ and thus Theorem 1 implies that $\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})\geq 0$ for any $t$ , as well as from Cauchy-Schwarz; the fourth line follows from the Cauchy-Schwarz inequality for random variables, which asserts that for every random variables $X,Y$ , then $\mbox{\bf E}[XY]\leq\sqrt{\mbox{\bf E}X^{2}}\sqrt{\mbox{\bf E}Y^{2}}$ ; the fifth line is an application of the following inequality

\sum_{\tau=0}^{t}a_{\tau}b_{\tau}\leq\sqrt{\sum_{\tau=0}^{t}a_{\tau}^{2}}\sqrt{\sum_{\tau=0}^{t}b_{\tau}^{2}}

which holds for any two sequences $\{a_{\tau}\in{\mathbb{R}}\}_{\tau},\{b_{\tau}\in{\mathbb{R}}\}_{\tau}$ , and the above also follows from the standard Cauchy-Schwarz inequality. Thus, Eq. (H.1) establishes the lemma.

We are left to show that $\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right]=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right]$ which is established in the lemma below,

Lemma 9.

The following holds for any $\tau\in[T]$ ,

\displaystyle\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right]=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right]~{}.

∎

H.1.1 Proof of Lemma 9

Proof of Lemma 9.

Let $\{\mathcal{F}_{\tau}\}_{\tau\in[T]}$ be the natural filtration induces by the history of samples up to every time step $\tau$ . Then according to the definitions of $g_{t}$ and $\bar{g}_{t}$ we have,

	$\displaystyle\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right]$	$\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\|\mathcal{F}_{\tau-1}\right]\right]$
		$\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[g_{\tau}\|\mathcal{F}_{\tau-1}\right]\cdot(w_{\tau}-w^{*})\right]$
		$\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[\bar{g}_{\tau}\|\mathcal{F}_{\tau-1}\right]\cdot(w_{\tau}-w^{*})\right]$
		$\displaystyle=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right]~{},$

where the first line follows by the law of total expectations; the second line follows since $w_{\tau}$ is measurable w.r.t. $\mathcal{F}_{\tau-1}$ ; the third line follows by definition of $g_{\tau}$ and $\bar{g}_{\tau}$ ; and the last line uses the law of total expectations. ∎

H.2 Proof of Lemma 8

Proof of Lemma 8.

We will divide the proof into two case.
Case 1: $B\geq C\sqrt{A}$ . In this case,

A\leq B+C\sqrt{A}\leq 2B\leq 2B+4C^{2}~{}.

Case 2: $B\leq C\sqrt{A}$ . In this case,

A\leq B+C\sqrt{A}\leq 2C\sqrt{A}~{},

dividing by $\sqrt{A}$ and taking the square implies,

A\leq 4C^{2}\leq 2B+4C^{2}~{}.

And therefore the lemma holds. ∎

Appendix I Proof of Lemma 4

Proof of Lemma 4.

Recalling that ${\rm(C)}:=\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}$ , we will decompose $g_{\tau}=(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau})$ which gives,

$\displaystyle{\rm(C)}$	$\displaystyle:=\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\\|(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau})\\|^{2}$
	$\displaystyle\leq 3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\\|g_{\tau}-\bar{g}_{\tau}\\|^{2}+3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}+3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\\|\nabla f(x_{\tau})\\|^{2}$
	$\displaystyle\leq 3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\\|g_{\tau}-\bar{g}_{\tau}\\|^{2}+3\mbox{\bf E}V_{T}+6L\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\Delta_{\tau}$
	$\displaystyle\leq 3\frac{\sigma^{2}}{M}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}+3\mbox{\bf E}V_{T}+6L\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\Delta_{\tau}$
	$\displaystyle\leq 3\frac{\sigma^{2}}{M}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}+3\mbox{\bf E}V_{T}+12L\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{0:\tau}\Delta_{\tau}~{},$	(43)

where the second line uses $\|a+b+c\|^{2}\leq 3(\|a\|^{2}+\|b\|^{2}+\|c\|^{2})$ which holds for any $a,b,c\in{\mathbb{R}}^{d}$ ; the third line uses the definition of $V_{T}$ as well as the smoothness of $f(\cdot)$ implying that $\|\nabla f(x_{\tau})\|^{2}\leq 2L(f(x_{\tau})-f(w^{*})):=2L\Delta_{\tau}$ (see Lemma 10 below); the fourth line invokes Lemma 11; the fifth line uses $\alpha_{\tau}^{2}=(\tau+1)^{2}\leq 2\alpha_{0:\tau}$ .

Lemma 10.

Let $F:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ be an $L$ -smooth function with a global minimum $x^{*}$ , then for any $x\in{\mathbb{R}}^{d}$ we have,

\|\nabla F(x)\|^{2}\leq 2L(F(x)-F(w^{*}))~{}.

Lemma 11.

The following bound holds for any $t\in[T]$ ,

\displaystyle\mbox{\bf E}\|g_{\tau}-\bar{g}_{\tau}\|^{2}

\displaystyle\leq\frac{\sigma^{2}}{M}~{}.

∎

I.1 Proof of Lemma 10

Proof of Lemma 10.

The $L$ smoothness of $f$ means the following to hold $\forall w,u\in{\mathbb{R}}^{d}$ ,

F(x+u)\leq F(x)+\nabla F(x)^{\top}u+\frac{L}{2}\|u\|^{2}~{}.

Taking $u=-\frac{1}{L}\nabla F(x)$ we get,

F(x+u)~{}\leq~{}F(x)-\frac{1}{L}\|\nabla F(x)\|^{2}+\frac{1}{2L}\|\nabla F(x)\|^{2}=F(x)-\frac{1}{2L}\|\nabla F(x)\|^{2}~{}.

Thus:

	$\displaystyle\\|\nabla F(x)\\|^{2}$	$\displaystyle~{}\leq~{}2L\big{(}F(x)-F(x+u)\big{)}$
		$\displaystyle~{}\leq~{}2L\big{(}F(x)-F(x^{*})\big{)}~{},$

where in the last inequality we used $F(x^{*})\leq F(x+u)$ which holds since $x^{*}$ is the global minimum. ∎

I.2 Proof of Lemma 11

Proof of Lemma 11.

Recall that we can write,

g_{\tau}-\bar{g}_{\tau}:=\frac{1}{M}\sum_{i\in[M]}(g_{\tau}^{i}-\bar{g}_{\tau}^{i})

where $\bar{g}_{\tau}^{i}:=\nabla f_{i}(x_{\tau}^{i})$ , and $g_{\tau}^{i}:=\nabla f_{i}(x_{\tau}^{i},z_{\tau}^{i})$ , and that $z_{t}^{1},\ldots,z_{t}^{M}$ are independent of each other. Thus, conditioning over $\{x_{t}^{i}\}_{i=1}^{M}$ then $\{g_{\tau}^{i}-\bar{g}_{\tau}^{i}\}_{i=1}^{M}$ are independent and zero mean i.e. $\mbox{\bf E}[g_{\tau}^{i}-\bar{g}_{\tau}^{i}|\{x_{t}^{i}\}_{i=1}^{M}]=0$ . Consequently,

	$\displaystyle\mbox{\bf E}\left[\\|g_{\tau}-\bar{g}_{\tau}\\|^{2}\|\{x_{t}^{i}\}_{i=1}^{M}\right]$	$\displaystyle=\frac{1}{M^{2}}\mbox{\bf E}\left[\left\\|\sum_{i\in[M]}(g_{\tau}^{i}-\bar{g}_{\tau}^{i})\right\\|^{2}\|\{x_{t}^{i}\}_{i\in[M]}\right]$
		$\displaystyle=\frac{1}{M^{2}}\sum_{i\in[M]}\mbox{\bf E}\left[\\|g_{\tau}^{i}-\bar{g}_{\tau}^{i}\\|^{2}\|\{x_{t}^{i}\}_{i=1}^{M}\right]$
		$\displaystyle\leq\frac{1}{M^{2}}\sum_{i\in[M]}\sigma^{2}$
		$\displaystyle\leq\frac{\sigma^{2}}{M}~{}.$

Using the law of total expectation implies that $\mbox{\bf E}\|g_{\tau}-\bar{g}_{\tau}\|^{2}\leq\frac{\sigma^{2}}{M}$ . ∎

Appendix J Proof of Lemma 5

Proof of Lemma 5.

To bound $\mbox{\bf E}V_{t}$ we will first employ the definition of $x_{t}$ together with the smoothness of $f(\cdot)$ ,

$\displaystyle\mbox{\bf E}\\|\nabla f(x_{\tau})-\bar{g}_{\tau}\\|^{2}$	$\displaystyle=\mbox{\bf E}\left\\|\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{\tau})-\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{\tau}^{i})\right\\|^{2}$
	$\displaystyle\leq\frac{1}{M}\sum_{i\in[M]}\mbox{\bf E}\\|\nabla f_{i}(x_{\tau})-\nabla f_{i}(x_{\tau}^{i})\\|^{2}$
	$\displaystyle\leq\frac{L^{2}}{M}\sum_{i\in[M]}\mbox{\bf E}\\|x_{\tau}-x_{\tau}^{i}\\|^{2}$
	$\displaystyle=\frac{L^{2}}{M}\sum_{i\in[M]}\mbox{\bf E}\left\\|\frac{1}{M}\sum_{j\in[M]}x_{\tau}^{j}-x_{\tau}^{i}\right\\|^{2}$
	$\displaystyle\leq\frac{L^{2}}{M^{2}}\sum_{i,j\in[M]}\mbox{\bf E}\\|x_{\tau}^{j}-x_{\tau}^{i}\\|^{2}~{},$	(44)

where the first line uses the definition of $\bar{g}_{t}$ , the second line uses Jensen’s inequality, and the third line uses the smoothness of $f_{i}(\cdot)$ ’s. The last line follows from Jensen’s inequality.

We use the following notation for any $\tau\in[T]$

\displaystyle q_{\tau}^{i,j}:=\alpha_{\tau}^{2}\|x_{\tau}^{i}-x_{\tau}^{j}\|^{2}~{},\quad\&\quad q_{\tau}^{i}:=\alpha_{\tau}^{2}\sum_{j\in[M]}\|x_{\tau}^{i}-x_{\tau}^{j}\|^{2}~{},\quad\&\quad Q_{\tau}:=\frac{1}{M^{2}}\alpha_{\tau}^{2}\sum_{i,j\in[M]}\|x_{\tau}^{i}-x_{\tau}^{j}\|^{2}~{},

(45)

and notice that $\sum_{j\in[M]}q_{\tau}^{i,j}=q_{\tau}^{i}$ , and that $\sum_{i,j\in[M]}q_{\tau}^{i,j}=M^{2}Q_{\tau}$ . Moreover $q_{\tau}^{i,j}=q_{\tau}^{j,i}~{},\forall i,j\in[M]$ .

Thus, according to Eq. (J) it is enough to bound $\mbox{\bf E}V_{t}$ as follows,

\displaystyle\mbox{\bf E}V_{t}:=\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\mbox{\bf E}\|\nabla f(x_{\tau})-\bar{g}_{\tau}\|^{2}\leq L^{2}\cdot\underset{{(\star)}}{\underbrace{\sum_{\tau=0}^{t}Q_{\tau}}}~{}.

(46)

Next we will bound the above term.

Bounding ${(\star)}$ :

Let $t\in[T]$ , if $t=rK$ for some $r\in[R]$ , then according to Alg. 2 $x_{t}^{i}=x_{t}$ for any machine $i\in[M]$ , thus $x_{t}^{i}-x_{t}^{j}=0$ for any two machines $i,j\in[M]$ .

More generally, if $t=rK+k$ for some $r\in[R]$ , and $k\in[K]$ , then by denoting $t_{0}:=rK$ we can write $t=t_{0}+k$ . Using this notation, the update rule for $x_{\tau}^{i}$ implies the following for any $i\in[M]$ ,

x_{t}^{i}=\frac{\alpha_{0:t_{0}}}{\alpha_{0:t}}x_{t_{0}}^{i}+\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}w_{\tau}^{i}=\frac{\alpha_{0:t_{0}}}{\alpha_{0:t}}x_{t_{0}}+\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}w_{\tau}^{i}~{},

where we used $x_{t_{0}}^{i}=x_{t_{0}}~{},~{}\forall i\in[M]$ . Thus, for any $i\neq j$ we can write,

\displaystyle\alpha_{t}^{2}\|x_{t}^{i}-x_{t}^{j}\|^{2}=\frac{\alpha_{t}^{2}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\|^{2}~{}.

(47)

So our next goal is to derive an expression for $\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})$ . The update rule of Eq. (6) implies that for any $\tau\in[t_{0},t_{0}+K]$ ,

\displaystyle w_{\tau}^{i}=w_{t_{0}}^{i}-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{n}g_{n}^{i}=w_{t_{0}}-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{n}g_{n}^{i}

(48)

where the second equality is due to the initialization of each round implying that $w_{t_{0}}^{i}=w_{t_{0}}~{},\forall i\in[M]$ .

Next, we will require the following notation $\bar{g}_{t}^{i}:=\nabla f_{i}(x_{t}^{i})$ , and $\xi_{t}^{i}:=g_{t}^{i}-\bar{g}_{t}^{i}$ . We can therefore write, $g_{t}^{i}=\bar{g}_{t}^{i}+\xi_{t}^{i}$ and it is immediate to show that $\mbox{\bf E}[\xi_{t}^{i}|x_{t}^{i}]=0$ . Using this notation together with Eq. (48), implies that for any $\tau\in[t_{0},t_{0}+K]$ and $i\neq j$ we have,

	$\displaystyle\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})$	$\displaystyle=-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\xi_{n}^{i}-\xi_{n}^{j})$
		$\displaystyle=-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}\xi_{n}$		(49)

and in the last line we use the following notation $\xi_{n}:=\xi_{n}^{i}-\xi_{n}^{j}$ ⁷⁷7A more appropriate notation would be $\xi_{n}^{(i,j)}:=\xi_{n}^{i}-\xi_{n}^{j}$ , but to ease notation we absorb the $(i,j)$ notation into $\xi$ ..

Summing Eq. (J) over $\tau\in[t_{0}+1,t]$ we obtain,

$\displaystyle\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})$	$\displaystyle=-\eta\sum_{\tau=t_{0}+1}^{t}\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{\tau=t_{0}+1}^{t}\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}\xi_{n}$
	$\displaystyle=-\eta\sum_{n=t_{0}+1}^{t}\sum_{\tau=n}^{t}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{t}\sum_{\tau=n}^{t}\alpha_{\tau}\alpha_{n}\xi_{n}$
	$\displaystyle=-\eta\sum_{n=t_{0}+1}^{t}\alpha_{n:t}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{t}\alpha_{n:t}\alpha_{n}\xi_{n}$
	$\displaystyle=-\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}(\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j})-\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}~{},$	(50)

where in the last equation we replace the notation of the summation index from $n$ to $\tau$ (only done to ease notation).

Plugging the above equation back into Eq. (47) we obtain for any $t\in[t_{0},t_{0}+K]$

$\displaystyle q_{t}^{i,j}:=\alpha_{t}^{2}$	$\displaystyle\\|x_{t}^{i}-x_{t}^{j}\\|^{2}$
	$\displaystyle=\frac{\alpha_{t}^{2}}{(\alpha_{0:t})^{2}}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\\|^{2}$
	$\displaystyle=\frac{\alpha_{t}^{2}}{(\alpha_{0:t})^{2}}\left\\|\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}(\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j})+\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\\|^{2}$
	$\displaystyle\leq\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}(\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j})\right\\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\\|^{2}$
	$\displaystyle\leq\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\cdot(t-t_{0})\sum_{\tau=t_{0}+1}^{t}(\alpha_{\tau:t}\alpha_{\tau})^{2}\\|\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j}\\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\\|^{2}$
	$\displaystyle\leq\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\cdot K\cdot(K\alpha_{t})^{2}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\\|^{2}$
	$\displaystyle=\eta^{2}\frac{2\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\cdot K^{3}L^{2}\cdot\frac{1}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\\|^{2}$
	$\displaystyle\leq 8\eta^{2}K^{3}L^{2}\cdot\frac{1}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\\|^{2}+(8\eta^{2}/\alpha^{2}_{t})\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\\|^{2}$
	$\displaystyle=8\eta^{2}K^{3}L^{2}\cdot\frac{1}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\\|^{2}+8\eta^{2}\alpha_{t}^{2}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\\|^{2}~{},$	(51)

where the first inequality uses $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ which holds $\forall a,b\in{\mathbb{R}}^{d}$ , the second inequality uses $\|\sum_{n=N_{1}+1}^{N_{2}}a_{n}\|^{2}\leq(N_{2}-N_{1})\sum_{n=N_{1}+1}^{N_{2}}\|a_{n}\|^{2}$ which holds for any $\{a_{n}\in{\mathbb{R}}^{d}\}_{n=N_{1}+1}^{N_{2}}$ ; the third inequality uses the definition of $\bar{g}_{\tau}^{i}$ , it also uses $t-t_{0}\leq K$ as well as $(\alpha_{\tau:t})^{2}\leq(K\alpha_{t})^{2}$ which holds since $\tau\leq t$ and since both $\alpha_{\tau}\leq\alpha_{t}$ ; and the last inequality uses the fact that $\alpha_{t}=t+1$ implying that the following holds,

\frac{\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\leq 4~{},\quad\&~{}\quad\frac{\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\leq\frac{4}{\alpha_{t}^{2}}

Lemma 12.

The following holds for any $i,j\in[M]$ ,

\displaystyle\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2}

\displaystyle\leq\frac{3L^{2}}{M\alpha_{\tau}^{2}}(q_{\tau}^{i}+q_{\tau}^{j})+6\left(\|\nabla f_{i}(x_{\tau})\|^{2}+\|\nabla f_{j}(x_{\tau})\|^{2}\right)~{},

Using the above lemma inside Eq. (J) yields,

$\displaystyle q_{t}^{i,j}$	$\displaystyle:=\alpha_{t}^{2}\\|x_{t}^{i}-x_{t}^{j}\\|^{2}$
	$\displaystyle\leq 24\eta^{2}K^{3}L^{2}\cdot\frac{1}{M}(q_{t_{0}+1:t}^{i}+q_{t_{0}+1:t}^{j})+48\eta^{2}K^{3}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\left(\\|\nabla f_{i}(x_{\tau})\\|^{2}+\\|\nabla f_{j}(x_{\tau})\\|^{2}\right)$
	$\displaystyle\quad+8\eta^{2}\alpha_{t}^{2}\left\\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\\|^{2}$
	$\displaystyle=\frac{\theta}{2}\cdot\frac{1}{M}(q_{t_{0}+1:t}^{i}+q_{t_{0}+1:t}^{j})+\frac{\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\left(\\|\nabla f_{i}(x_{\tau})\\|^{2}+\\|\nabla f_{j}(x_{\tau})\\|^{2}\right)+\mathcal{B}_{t}^{i,j}~{},$	(52)

where we have denoted ⁸⁸8 Formally we should use the $i,j$ upper script for $\xi_{\tau}$ in the definition of $\mathcal{B}_{t}^{i,j}$ , i.e. to define $\mathcal{B}_{t}^{i,j}:=8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}^{i,j}\right\|^{2}$ . We absorb this notation into $\xi_{\tau}$ to simplify the notation.

\theta:=48\eta^{2}K^{3}L^{2}~{},\quad\&\quad\mathcal{B}_{t}^{i,j}:=8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}~{}.

Summing Eq. (J) over $i,j\in[M]$ and using the definition of $Q_{t}$ gives,

$\displaystyle M^{2}Q_{t}$	$\displaystyle=\sum_{i,j\in[M]}q_{t}^{i,j}$
	$\displaystyle\leq\frac{\theta}{2}\cdot\frac{1}{M}\cdot 2M^{3}Q_{t_{0}+1:t}+\frac{\theta}{L^{2}}\cdot 2M\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\sum_{i\in[M]}\\|\nabla f_{i}(x_{\tau})\\|^{2}+\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}$
	$\displaystyle=M^{2}\theta\cdot Q_{t_{0}+1:t}+\frac{2M\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\sum_{i\in[M]}\\|\nabla f_{i}(x_{\tau})\\|^{2}+\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}~{},$	(53)

where we used,

\sum_{i,j\in[M]}q_{\tau}^{i}=\sum_{j\in[M]}\sum_{i\in[M]}q_{\tau}^{i}=\sum_{j\in[M]}M^{2}Q_{\tau}=M^{3}Q_{\tau}~{}.

Now dividing Eq. (J) by $M^{2}$ gives $\forall t\in[t_{0},t_{0}+K]$ ,

$\displaystyle Q_{t}$	$\displaystyle\leq\theta\cdot Q_{t_{0}+1:t}+\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot\frac{1}{M}\sum_{i\in[M]}\\|\nabla f_{i}(x_{\tau})\\|^{2}+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}$
	$\displaystyle\leq\theta\cdot Q_{t_{0}+1:t}+\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot(G_{}^{2}+4L(f(x_{\tau})-f(w^{})))+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}$
	$\displaystyle\leq\theta\cdot Q_{t_{0}+1:t}+\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}~{},$	(54)

where the second line follows from the dissimilarity assumption Eq. (2), and the last line is due to the definition of $\Delta_{\tau}$ .

Thus, we can re-write the above equation as follows forall $t\in[t_{0},t_{0}+K]$ ,

\displaystyle Q_{t}\leq\theta Q_{t_{0}+1:t}+H_{t}~{},

(55)

where $H_{t}=\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}$ , and recall that $\theta:=48\eta^{2}K^{3}L^{2}$ . Now, notice that $Q_{t},H_{t}\geq 0$ , and that $\theta$ satisfies $\theta K=48\eta^{2}K^{4}L^{2}\leq 1/2$ since we assume that $\eta^{2}\leq\frac{1}{100L^{2}K^{4}}$ (see Eq. (8)). This enables to make use of Lemma 13 below to conclude,

	$\displaystyle Q_{t_{0}+1:t_{0}+K}\leq 2H_{t_{0}+1:t_{0}+K}$	$\displaystyle=\frac{4\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t_{0}+K}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{2}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}$
		$\displaystyle=200\eta^{2}K^{3}\sum_{\tau=t_{0}+1}^{t_{0}+K}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{2}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}~{}.$		(56)

Lemma 13.

Let $K,\theta>0$ , and assume $\theta K\leq 1/2$ . Also assume a sequence of non-negative terms $\{Q_{t}\geq 0\}_{t=t_{0}+1}^{t_{0}+K}$ and another sequence $\{H_{t}\geq 0\}_{t=t_{0}+1}^{t_{0}+K}$ that satisfy the following inequality for any $t\in[t_{0},t_{0}+K]$ ,

Q_{t}\leq\theta Q_{t_{0}+1:t}+H_{t}

Then the following holds,

Q_{t_{0}+1:t_{0}+K}\leq 2H_{t_{0}+1:t_{0}+K}~{}.

Recalling that we like to bound the expectation of the LHS of Eq. (J), we will next bound $\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}$ , which is done in the following Lemma ⁹⁹9recall that for simplicity of notation we denote $\xi_{\tau}$ rather than $\xi_{\tau}^{i,j}$ .,

Lemma 14.

The following bound holds for any $t\in[t_{0},t_{0}+K]$ ,

\mbox{\bf E}\left\|\sum_{\tau=t_{0}+1}^{t}\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}\leq 4K^{3}\sigma^{2}~{}.

Since the above lemma for any $i,j$ and $t\in[t_{0},t_{0}+K]$ we can now bound $\mbox{\bf E}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}$ as follows,

$\displaystyle\mbox{\bf E}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}$	$\displaystyle=8\eta^{2}\alpha^{2}_{t}\sum_{t=t_{0}+1}^{t_{0}+K}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}+1}^{t}\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\\|^{2}$
	$\displaystyle\leq 8\eta^{2}\alpha^{2}_{t}\sum_{t=t_{0}+1}^{t_{0}+K}4K^{3}\sigma^{2}$
	$\displaystyle=32\eta^{2}\alpha^{2}_{t}K^{4}\sigma^{2}$
	$\displaystyle=32\eta^{2}(r+1)^{2}K^{6}\sigma^{2}~{},$	(57)

where the last lines $\alpha_{t}\leq(r+1)K$ for any iteration $t$ that belongs to round $r$ .

Since the above holds for any $i,j$ it follows that,

\frac{2}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}\leq 2\cdot 32\eta^{2}(r+1)^{2}K^{6}\sigma^{2}=64\eta^{2}(r+1)^{2}K^{6}\sigma^{2}~{}.

Plugging the above back into Eq. (J) gives,

\displaystyle Q_{t_{0}+1:t_{0}+K}

\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=t_{0}+1}^{t_{0}+K}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+64\eta^{2}(r+1)^{2}K^{6}\sigma^{2}~{}.

(58)

Final Bound on $\mbox{\bf E}V_{t}$ .

Finally, using the above bound together with the Eq. (46) enables to bound $\mbox{\bf E}V_{t}$ as follows,

	$\displaystyle\frac{1}{L^{2}}\mbox{\bf E}V_{t}$	$\displaystyle\leq\sum_{\tau=0}^{t}Q_{\tau}$
		$\displaystyle\leq\sum_{\tau=0}^{T}Q_{\tau}$
		$\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\sum_{r=0}^{R-1}\sum_{t=rK+1}^{rK+K}\alpha_{t}^{2}\mbox{\bf E}\\|x_{t}^{i}-x_{t}^{j}\\|^{2}$
		$\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\sum_{r=0}^{R-1}64\eta^{2}(r+1)^{2}K^{6}\sigma^{2}$
		$\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+64\eta^{2}K^{6}\sigma^{2}\cdot\frac{8}{6}R^{3}$
		$\displaystyle\leq 400\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+90\eta^{2}K^{6}R^{3}\sigma^{2}~{},$

where we have used $\sum_{r=0}^{R-1}(r+1)^{2}\leq\frac{8}{6}R^{3}$ , and the last line uses $\alpha_{\tau}^{2}=(\tau+1)^{2}\leq 2\alpha_{0:\tau}$ Consequently, we can bound

\mbox{\bf E}V_{t}\leq 400L^{2}\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+90L^{2}\eta^{2}K^{6}R^{3}\sigma^{2}~{},

which established the lemma. ∎

J.1 Proof of Lemma 12

Proof of Lemma 12.

First note that by definition of $x_{\tau}$ we have for any $i\in[M]$ ,

\displaystyle\|x_{\tau}-x_{\tau}^{i}\|^{2}=\left\|\frac{1}{M}\sum_{l\in[M]}x_{\tau}^{l}-x_{\tau}^{i}\right\|^{2}

\displaystyle\leq\frac{1}{M}\sum_{l\in[M]}\|x_{\tau}^{l}-x_{\tau}^{i}\|^{2}=\frac{1}{M\alpha_{\tau}^{2}}q_{\tau}^{i}~{}.

(59)

where we have used Jensen’s inequality, and the definition of $q_{\tau}^{i}$ .

Using the above inequality, we obtain,

	$\displaystyle\\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\\|^{2}$	$\displaystyle=\\|(\nabla f_{i}(x_{\tau}^{i})-\nabla f_{i}(x_{\tau}))+(\nabla f_{i}(x_{\tau})-\nabla f_{j}(x_{\tau}))-(\nabla f_{j}(x_{\tau}^{j})-\nabla f_{j}(x_{\tau}))\\|^{2}$
		$\displaystyle\leq 3\\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{i}(x_{\tau})\\|^{2}+3\\|\nabla f_{i}(x_{\tau})-\nabla f_{j}(x_{\tau})\\|^{2}+3\\|\nabla f_{j}(x_{\tau}^{j})-\nabla f_{j}(x_{\tau})\\|^{2}$
		$\displaystyle\leq 3L^{2}\left(\\|x_{\tau}-x_{\tau}^{i}\\|^{2}+\\|x_{\tau}-x_{\tau}^{j}\\|^{2}\right)+6\left(\\|\nabla f_{i}(x_{\tau})\\|^{2}+\\|\nabla f_{j}(x_{\tau})\\|^{2}\right)$
		$\displaystyle\leq\frac{3L^{2}}{M\alpha_{\tau}^{2}}(q_{\tau}^{i}+q_{\tau}^{j})+6\left(\\|\nabla f_{i}(x_{\tau})\\|^{2}+\\|\nabla f_{j}(x_{\tau})\\|^{2}\right)~{},$

where the second and third lines use $\|\sum_{n=1}^{N}a_{n}\|^{2}\leq N\sum_{n=1}^{N}\|a_{n}\|^{2}$ which holds for any $\{a_{n}\in{\mathbb{R}}^{d}\}_{n=1}^{N}$ ; we also used the smoothness of the $f_{i}(\cdot)$ ’s; and the last line uses Eq. (59). ∎

J.2 Proof of Lemma 13

Proof of Lemma 13.

Since the $Q_{t}$ ’s and $\theta$ are non-negative, we can further bound $Q_{t}$ for all $t\in[t_{0},t_{0}+K]$ as follows,

Q_{t}\leq\theta Q_{t_{0}+1:t}+H_{t}\leq\theta Q_{t_{0}+1:t_{0}+K}+H_{t}~{}.

Summing the above over $t$ gives,

Q_{t_{0}+1:t_{0}+K}:=\sum_{t=t_{0}+1}^{t_{0}+K}Q_{t}\leq\theta K\cdot Q_{t_{0}+1:t_{0}+K}+H_{t_{0}+1:t_{0}+K}\leq\frac{1}{2}\cdot Q_{t_{0}+1:t_{0}+K}+H_{t_{0}+1:t_{0}+K}

where we used $\theta K\leq 1/2$ . Re-ordering the above equation immediately establishes the lemma. ∎

J.3 Proof of Lemma 14

Proof of Lemma 14.

Letting $\{\mathcal{F}_{t}\}_{t}$ be the natural filtration that is induces by the random draws up to time $t$ , i.e., by $\{\{z_{1}^{i}\}_{i\in[M]},\ldots,\{z_{t}^{i}\}_{i\in[M]}\}$ . By the definition of $\xi_{t}$ it is clear that $\xi_{t}$ is measurable with respect to $\mathcal{F}_{t}$ , and that,

\mbox{\bf E}[\xi_{t}|\mathcal{F}_{t-1}]=0~{}.

Implying that $\{\xi_{t}\}_{t}$ is martingale difference sequence with respect to the filtration $\{\mathcal{F}_{t}\}_{t}$ . The following implies that,

	$\displaystyle\mbox{\bf E}\left\\|\sum_{\tau=t_{0}+1}^{t}\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\\|^{2}$	$\displaystyle\leq\sum_{\tau=t_{0}+1}^{t}\left(\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\right)^{2}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}$
		$\displaystyle\leq\sum_{\tau=t_{0}+1}^{t}\left(\frac{K\alpha_{t}\cdot\alpha_{t}}{\alpha_{t}^{2}}\right)^{2}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}$
		$\displaystyle=K^{2}\sum_{\tau=t_{0}+1}^{t}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}$
		$\displaystyle\leq K^{2}\sum_{\tau=t_{0}+1}^{t}\mbox{\bf E}\left\\|\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\\|^{2}$
		$\displaystyle\leq 2K^{2}\sum_{\tau=t_{0}+1}^{t}(\mbox{\bf E}\\|\xi_{\tau}^{i}\\|^{2}+\mbox{\bf E}\\|\xi_{\tau}^{i}\\|^{2})$
		$\displaystyle\leq 2K^{2}\sum_{\tau=t_{0}+1}^{t}2\sigma^{2}$
		$\displaystyle\leq 4K^{2}\sigma^{2}\cdot(t-t_{0})$
		$\displaystyle\leq 4K^{3}\sigma^{2}~{},$

where the first line follows by Lemma 15 below, the second line holds since $\alpha_{\tau}\leq\alpha_{t}$ , and $\alpha_{\tau:t}\leq K\alpha_{t}$ (recall $\alpha_{\tau}\leq\alpha_{t}$ since $\tau\leq t$ ); the fourth line follows due to $\xi_{\tau}:=\xi_{\tau}^{i}-\xi_{\tau}^{j}$ ; the fifth line uses $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ for any $a,b\in{\mathbb{R}}^{d}$ ; the sixth line follows since $\mbox{\bf E}\|\xi_{\tau}^{i}\|^{2}:=\mbox{\bf E}\|g_{\tau}^{i}-\bar{g}_{\tau}^{i}\|^{2}:=\mbox{\bf E}\|\nabla f(x_{\tau}^{i},z_{\tau}^{i})-\nabla f(x_{\tau}^{i})\|^{2}\leq\sigma^{2}$ ; and the last line uses $(t-t_{0})\leq K$ .

Lemma 15.

Let $\{\xi_{t}\}_{t}$ be a martingale difference sequence with respect to a filtration $\{\mathcal{F}_{t}\}_{t}$ , then the following holds for all time indexes $t_{1},t_{2}\geq 0$

\displaystyle\mbox{\bf E}\left\|\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right\|^{2}

\displaystyle=\sum_{\tau=t_{1}}^{t_{2}}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}~{}.

∎

J.3.1 Proof of Lemma 15

Proof of Lemma 15.

We shall prove the lemma by induction over $t_{2}$ . The base case where $t_{2}=t_{1}$ clearly holds since it boils down to the following identity,

\displaystyle\mbox{\bf E}\left\|\sum_{\tau=t_{1}}^{t_{1}}\xi_{\tau}\right\|^{2}=\mbox{\bf E}\left\|\xi_{t_{1}}\right\|^{2}

\displaystyle=\sum_{\tau=t_{1}}^{t_{1}}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}~{}.

Now for induction step let us assume that the equality holds for $t_{2}\geq t_{1}$ and lets prove it holds for $t_{2}+1$ . Indeed,

	$\displaystyle\mbox{\bf E}\left\\|\sum_{\tau=t_{1}}^{t_{2}+1}\xi_{\tau}\right\\|^{2}$	$\displaystyle=\mbox{\bf E}\left\\|\xi_{t_{2}+1}+\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right\\|^{2}$
		$\displaystyle=\mbox{\bf E}\left\\|\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right\\|^{2}+\mbox{\bf E}\\|\xi_{t_{2}+1}\\|^{2}+2\mbox{\bf E}\left(\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right)\cdot\xi_{t_{2}+1}$
		$\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}+2\mbox{\bf E}\left[\mbox{\bf E}\left[\left(\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right)\cdot\xi_{t_{2}+1}\|\mathcal{F}_{t_{2}}\right]\right]$
		$\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}+2\mbox{\bf E}\left[\left(\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right)\cdot\mbox{\bf E}\left[\xi_{t_{2}+1}\|\mathcal{F}_{t_{2}}\right]\right]$
		$\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}+0$
		$\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\\|\xi_{\tau}\right\\|^{2}~{},$

where the third line follows from the induction hypothesis, as well as from the law of total expectations; the fourth lines follows since $\{\xi_{\tau}\}_{\tau=0}^{t_{2}}$ are measurable w.r.t $\mathcal{F}_{t_{2}}$ , and the fifth line follows since $\mbox{\bf E}[\xi_{t_{2}+1}|\mathcal{F}_{t_{2}}]=0$ . Thus, we have established the induction step and therefore the lemma holds. ∎

Appendix K Proof of Lemma 6

Proof of Lemma 6.

Summing the inequality $A_{t}\leq\mathcal{B}+\frac{1}{2(T+1)}\sum_{t=0}^{T}A_{t}$ over $t$ gives,

\displaystyle A_{0:T}\leq(T+1)\mathcal{B}+(T+1)\frac{1}{2(T+1)}A_{0:T}=(T+1)\mathcal{B}+\frac{1}{2}A_{0:T}~{},

Re-ordering we obtain,

\displaystyle A_{0:T}\leq 2(T+1)\mathcal{B}~{}.

Plugging this back to the original inequality and taking $t=T$ gives,

A_{T}\leq\mathcal{B}+\frac{1}{2(T+1)}A_{0:T}\leq 2\mathcal{B}~{}.

which concludes the proof. ∎

Appendix L The Necessity of Non-uniform Weights

One may wonder, why should we employ increasing weights $\alpha_{t}\propto t$ rather than using standard uniform weights $\alpha_{t}=1~{},\forall t$ . Here we explain why uniform weights are insufficient and why increasing weights e.g. $\alpha_{t}\propto t$ are crucial to obtain our result.

Intuitive Explanation.

Prior to providing an elaborate technical explanation we will provides some intuition. The intuition behind the importance of using increasing weights is the following: Increasing weights are a technical tool to put more emphasis on the last rounds. Now, in the context of Local update methods, the iterates of the last rounds are more attractive since the bias between different machines shrinks as we progress. Intuitively, this happens since as we progress with the optimization process, the bias in the stochastic gradients that we compute goes to zero (in expectation), and consequently the bias between different machines shrinks as we progress.

Technical Explanation.

Assume general weights $\{\alpha_{t}\}_{t}$ , and let us go back to the proof of Lemma 5 (see Section J). Recall that in this proof we derive a bound of the following form (see Eq. (55))

\displaystyle A_{t}\leq\theta A_{t_{0}+1:t}+B_{t}~{},

(60)

where $A_{t}:=\alpha_{t}^{2}\|x_{t}^{i}-x_{t}^{j}\|^{2}$ , $B_{t}=8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}$ , and importantly ¹⁰¹⁰10Indeed, in the case where $\alpha_{t}:=t+1$ we can take $\theta:=8\eta^{2}K^{3}L^{2}$ , and this is used in the proof of Lemma 5,

\theta:=\eta^{2}\frac{2\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\cdot K^{3}L^{2}~{}.

Now, a crucial part of the proof is the fact that $\theta K\leq 1/2$ , which in turn enables to employ Lemma 13 in order to bound $\mbox{\bf E}V_{t}$ .

Not let us inspect the constraint $\theta K\leq 1/2$ for polynomial weights of the form $\alpha_{t}\propto t^{p}$ where $p\geq 0$ . This condition boils down to,

\eta^{2}\frac{2\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\cdot K^{3}L^{2}\cdot K\leq 1/2~{},

Implying the following bound should apply to $\eta$ for any $t\in[T]$ ,

\eta\leq\frac{1}{2LK^{2}}\cdot\frac{\alpha_{0:t}}{\alpha^{2}_{t}}\approx\frac{1}{2LK^{2}}\cdot\frac{t^{p+1}}{t^{2p}}=\frac{1}{2LK^{2}}\cdot t^{1-p}

Now since the bound should hold for any $t\in[T]$ we could divide into two cases:
Case 1: $p\leq 1$ . In this case $t^{1-p}$ is monotonically increasing with $t$ so the above condition should be satisfied for the smallest $t$ , namely $t=1$ , implying,

\eta\leq\frac{1}{2LK^{2}}\cdot\frac{\alpha_{0:t}}{\alpha^{2}_{t}}\approx\frac{1}{2LK^{2}}~{}.

The effect of this term on the overall error stems from the first term in the error analysis (see e.g. Eq. (F)), namely,

\frac{1}{\alpha_{0:T}}\cdot\frac{\|w_{0}-w^{*}\|^{2}}{\eta}=\frac{2LK^{2}\cdot\|w_{0}-w^{*}\|^{2}}{T^{p+1}}

Now, for the extreme values $p=0$ (uniform weights) and $p=1$ (linear weights), the above expression results an error term of the following form,

\displaystyle\text{Err}({\bf p=0})=O\left({K^{2}}/{T}\right)=O({K}/{R})~{}~{}~{}\&~{}~{}~{}\text{Err}({\bf p=1})=O({K^{2}}/{T^{2}})=O({1}/{R^{2}})~{}.

(61)

Thus, for $p=0$ , the error term is considerably worse even compared to Minibatch-SGD, and $p=1$ is clearly an improvement.
Case 2: $p>1$ . In this case $t^{1-p}$ is decreasing increasing with $t$ so the above condition should be satisfied for the largest $t$ , namely $t=T$ , implying,

\eta\leq\frac{1}{2LK^{2}}\cdot\frac{\alpha_{0:t}}{\alpha^{2}_{t}}\approx\frac{1}{2LK^{2}}\cdot\frac{1}{T^{p-1}}~{}.

Now, the effect of this term on the overall error stems from the first term in the error analysis (see e.g. Eq. (F)), namely,

\frac{1}{\alpha_{0:T}}\cdot\frac{\|w_{0}-w^{*}\|^{2}}{\eta}=\frac{2LK^{2}\cdot\|w_{0}-w^{*}\|^{2}\cdot T^{p-1}}{T^{p+1}}=\frac{2LK^{2}\cdot\|w_{0}-w^{*}\|^{2}\cdot}{T^{2}}

Thus, for any $p\geq 1$ we obtain,

\displaystyle\text{Err}({\bf p})=O({K^{2}}/{T^{2}})=O({1}/{R^{2}})~{}.

(62)

Conclusion: As can be seen from Equations (61) (62), using uniform weights or even polynomial weights $\alpha_{t}\propto t^{p}$ with $p<1$ yields strictly worse guarantees compared to taking $p\geq 1$ .

Appendix M Experiments

M.1 Data Distribution Across Workers

In federated learning or distributed machine learning settings, data heterogeneity among workers often arises due to non-identical data distributions. To simulate such scenarios, we split the MNIST dataset using a Dirichlet distribution. The Dirichlet distribution allows control over the degree of heterogeneity in the data assigned to each worker by adjusting the dirichlet-alpha parameter. Lower values of dirichlet-alpha result in more uneven class distributions across workers, simulating highly non-IID data.

Given $C$ classes and $M$ workers, the probability $p_{c,m}$ of assigning data from class $c$ to worker $m$ is based on sampling from a Dirichlet distribution:

p_{c,1},p_{c,2},\ldots,p_{c,M}\sim\text{Dirichlet}(\alpha)

where $\alpha$ is the concentration parameter controlling the level of heterogeneity. A smaller $\alpha$ value results in a more imbalanced distribution, meaning that each worker primarily receives data from a limited subset of classes. In this experiment, we set $\alpha=0.1$ to induce high heterogeneity.

The following figures present scatter plots illustrating the class frequencies assigned to individual workers for different worker configurations—16, 32, and 64—using a specific random seed.

M.2 Complete Experimental Results

This section presents the complete experimental results for 16, 32, and 64 workers, showing test accuracy and test loss as functions of local iterations $K$ . The plots illustrate the method’s scalability and performance, with higher accuracy ( $\uparrow$ ) and lower loss ( $\downarrow$ ) indicating better outcomes.

	$\displaystyle\frac{1}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w)\\|^{2}$	$\displaystyle=\frac{1}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w)-\nabla f_{i}(w^{})+\nabla f_{i}(w^{})\\|^{2}$
		$\displaystyle\leq\frac{2}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w)-\nabla f_{i}(w^{})\\|^{2}+\frac{2}{M}\sum_{i\in[M]}\\|\nabla f_{i}(w^{})\\|^{2}$
		$\displaystyle\leq 4L(f(w)-f(w^{}))+G_{}^{2}~{}.$

	$\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\\|x_{t}^{i}-x_{t}^{j}\\|^{2}$	$\displaystyle=\frac{1}{\eta^{2}}\mbox{\bf E}\left\\|\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\\|^{2}\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}r^{4}K^{6}}{(\alpha_{0:t})^{2}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\\|^{2}$
		$\displaystyle\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}r^{4}K^{6}}{r^{4}K^{4}}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\\|^{2}$
		$\displaystyle\approx K^{2}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}\nabla f(x_{\tau}^{i})-\nabla f(x_{\tau}^{j})\right\\|^{2}+K^{2}\mbox{\bf E}\left\\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\\|^{2}~{}.$

$\displaystyle-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})$	$\displaystyle=-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})$
	$\displaystyle=-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{})-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}(\bar{g}_{\tau}-\nabla f(x_{\tau}))\cdot(w_{\tau}-w^{})$
	$\displaystyle\leq 0+\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}\cdot\\|w_{\tau}-w^{*}\\|^{2}$
	$\displaystyle\leq 0+\sum_{\tau=0}^{t}\sqrt{\mbox{\bf E}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}}\cdot\sqrt{\mbox{\bf E}\\|w_{\tau}-w^{*}\\|^{2}}$
	$\displaystyle\leq\sqrt{\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\\|\bar{g}_{\tau}-\nabla f(x_{\tau})\\|^{2}}\cdot\sqrt{\mbox{\bf E}\sum_{\tau=0}^{t}\\|w_{\tau}-w^{*}\\|^{2}}$
	$\displaystyle\leq\sqrt{\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\\|\bar{g}_{t}-\nabla f(x_{t})\\|^{2}}\cdot\sqrt{\mbox{\bf E}\sum_{t=0}^{T}\\|w_{t}-w^{*}\\|^{2}}$
	$\displaystyle:=\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}~{},$	(42)

	$\displaystyle\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right]$	$\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\|\mathcal{F}_{\tau-1}\right]\right]$
		$\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[g_{\tau}\|\mathcal{F}_{\tau-1}\right]\cdot(w_{\tau}-w^{*})\right]$
		$\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[\bar{g}_{\tau}\|\mathcal{F}_{\tau-1}\right]\cdot(w_{\tau}-w^{*})\right]$
		$\displaystyle=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right]~{},$

	$\displaystyle\mbox{\bf E}\left[\\|g_{\tau}-\bar{g}_{\tau}\\|^{2}\|\{x_{t}^{i}\}_{i=1}^{M}\right]$	$\displaystyle=\frac{1}{M^{2}}\mbox{\bf E}\left[\left\\|\sum_{i\in[M]}(g_{\tau}^{i}-\bar{g}_{\tau}^{i})\right\\|^{2}\|\{x_{t}^{i}\}_{i\in[M]}\right]$
		$\displaystyle=\frac{1}{M^{2}}\sum_{i\in[M]}\mbox{\bf E}\left[\\|g_{\tau}^{i}-\bar{g}_{\tau}^{i}\\|^{2}\|\{x_{t}^{i}\}_{i=1}^{M}\right]$
		$\displaystyle\leq\frac{1}{M^{2}}\sum_{i\in[M]}\sigma^{2}$
		$\displaystyle\leq\frac{\sigma^{2}}{M}~{}.$

SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

Abstract

1 Introduction

2 Setting: Parallel Stochastic Optimization

General Parallelization Scheme.

(i) Minibatch SGD:

(ii) Local SGD:

3 Our Approach

3.1 Anytime GD

Theorem 1 (Rephrased from Theorem 1 in Cutkosky (2019)).

3.2 SLowcal-SGD

3.2.1 Guarantees & Intuition

Theorem 2.

3.3 Proof Sketch for Theorem 2

Proof Sketch for Theorem 2.

Bounding (A):

Bounding (B):

Combining (A) and (B):

Bounding (∗){\rm(*)}:

Bounding E​VT\mbox{\bf E}V_{T}

4 Experiments

5 Conclusion

Acknowledgement

References

Appendix A Explanations Regarding the Linear Speedup and Table 1

Explanation.

Computation of RminR_{\min} in Table 1.

Appendix B On Heterogeneity Assumption

Lemma 1.

Proof of Lemma 1.

Appendix C Interpreting Anytime-SGD as Momentum

Appendix D Proof of Theorem 1

Proof of Theorem 1.

Appendix E More Intuition and Discussion Regarding the Benefit of SLowcal-SGD

More Elaborate Intuitive Explanation.

Additional Technical Intuition for αt∝t\alpha_{t}\propto t.

Appendix F Proof of Thm. 2

Proof of Thm. 2.

Bounding (A):

Lemma 2.

Bounding (B):

Combining (A) and (B):

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Appendix G Proof of Lemma 2

Proof of Lemma 2.

Appendix H Proof of Lemma 3

Proof of Lemma 3.

Lemma 7.

Lemma 8.

H.1 Proof of Lemma 7

Proof of Lemma 7.

Lemma 9.

H.1.1 Proof of Lemma 9

Proof of Lemma 9.

H.2 Proof of Lemma 8

Proof of Lemma 8.

Appendix I Proof of Lemma 4

Proof of Lemma 4.

Lemma 10.

Lemma 11.

I.1 Proof of Lemma 10

Proof of Lemma 10.

I.2 Proof of Lemma 11

Proof of Lemma 11.

Appendix J Proof of Lemma 5

Proof of Lemma 5.

Bounding (⋆){(\star)}:

Lemma 12.

Lemma 13.

Lemma 14.

Final Bound on E​Vt\mbox{\bf E}V_{t}.

J.1 Proof of Lemma 12

Proof of Lemma 12.

J.2 Proof of Lemma 13

Proof of Lemma 13.

J.3 Proof of Lemma 14

Proof of Lemma 14.

Bounding ${\rm(*)}$ :

Bounding $\mbox{\bf E}V_{T}$

Computation of $R_{\min}$ in Table 1.

Additional Technical Intuition for $\alpha_{t}\propto t$ .

Bounding ${(\star)}$ :

Final Bound on $\mbox{\bf E}V_{t}$ .