This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

Tehila Dahan
Department of Electrical Engineering
Technion
Haifa, Israel
[email protected]
&Kfir Y. Levy
Department of Electrical Engineering
Technion
Haifa, Israel
[email protected]
Abstract

We consider distributed learning scenarios where MM machines interact with a parameter server along several communication rounds in order to minimize a joint objective function. Focusing on the heterogeneous case, where different machines may draw samples from different data-distributions, we design the first local update method that provably benefits over the two most prominent distributed baselines: namely Minibatch-SGD and Local-SGD. Key to our approach is a slow querying technique that we customize to the distributed setting, which in turn enables a better mitigation of the bias caused by local updates.

1 Introduction

Federated Learning (FL) is a framework that enables huge scale collaborative learning among a large number of heterogeneous111Heterogeneous here refers to the data of each client, and we assume that its statistical properties may vary between different clients. clients (or machines). FL may potentially promote fairness among participants, by allowing clients with small scale datasets to participate in the learning process and affect the resulting model. Additionally, participants are not required to directly share data, which may improve privacy. Due to these reasons, FL has gained popularity in the past years, and found use in applications like voice recognition (APP, 2019; GGL, 2021), fraud detection (INT, 2020), drug discovery (MEL, 2020), and more (Wang et al., 2021a).

The two most prominent algorithmic approaches towards federated learning are Minibatch-SGD 
(Dekel et al., 2012) and Local-SGD (a.k.a. Federated-Averaging) (Mangasarian and Solodov, 1993; McMahan et al., 2017; Stich, 2019) . In Minibatch-SGD all machines (or clients) always compute unbiased gradient estimates of the same query points, while using large batch sizes; and it is well known that this approach is not degraded due to data heterogeneity (Woodworth et al., 2020b). On the downside, the number of model updates made by Minibatch-SGD may be considerably smaller compared to the number of gradient queries made by each machine; which is due to the use of minibatches. This suggests that there may be room to improve over this approach by employing local update methods like Local-SGD, where the number of model updates and the number of gradient queries are the same. And indeed, in the past years, local update methods have been extensively investigated, see e.g. (Kairouz et al., 2021) and references therein.

We can roughly divide the research on FL into two scenarios: the homogeneous case, where it is assumed that the data on each machine is drawn from the same distribution; and to the more realistic heterogeneous case where it is assumed that data distributions may vary between machines.

Table 1: We compare the best known guarantees for parallel learning, to our SLowcal-SGD approach for the heterogeneous SCO case. The bolded term in the Rate column is the one that compares least favourably against Minibatch SGD. Where GG and GG_{*} relate to the dissimilarity measures that are defined in Equations (1) and  (3). The 𝐑min{\bf{R_{\min}}} column presents the minimal number of communication rounds that are required to obtain a linear speedup (we fixed values of K,MK,M and take σ=1\sigma=1). Note that we omit methods that do not enable a wall-clock linear speedup with MM, e.g. (Mishchenko et al., 2022; Mitra et al., 2021).
Method Rate 𝐑min(σ=𝟏){\bf{R_{\min}~{}(\sigma=1)}}
MiniBatch SGD
(Dekel et al., 2012)
1R+σMKR\frac{1}{R}+\frac{\sigma}{\sqrt{MKR}} MK{MK}
Accelerated MiniBatch SGD
(Dekel et al., 2012; Lan, 2012)
1R2+σMKR\frac{1}{R^{2}}+\frac{\sigma}{\sqrt{MKR}} (MK)1/3\left(MK\right)^{1/3}
Local SGD
(Woodworth et al., 2020b)
𝐆𝟐/𝟑𝐑𝟐/𝟑+σ𝟐/𝟑(𝐊𝐑)𝟐/𝟑+1KR+σMKR{\bf\frac{G^{2/3}}{R^{2/3}}+\frac{\sigma^{2/3}}{(\sqrt{K}R)^{2/3}}}+\frac{1}{KR}+\frac{\sigma}{\sqrt{MKR}} G4(MK)3+M3KG^{4}\cdot(MK)^{3}+M^{3}K
SCAFFOLD
(Karimireddy et al., 2020b)
𝟏𝐑+σMKR{\bf\frac{1}{R}}+\frac{\sigma}{\sqrt{MKR}} MKMK
SLowcal-SGD
(This paper)
σ𝟏/𝟐+𝐆𝟏/𝟐𝐊𝟏/𝟒𝐑+1KR+1K1/3R4/3+1R2+σMKR{\bf\frac{\sigma^{1/2}+G_{*}^{1/2}}{K^{1/4}R}}+\frac{1}{KR}+\frac{1}{K^{1/3}R^{4/3}}+\frac{1}{R^{2}}+\frac{\sigma}{\sqrt{MKR}} GMK1/2+MK1/2G_{*}\cdot MK^{1/2}+MK^{1/2}
Lower Bound: Local-SGD
Yuan and Ma (2020)
𝐆𝟐/𝟑𝐑𝟐/𝟑+σ𝟐/𝟑(𝐊𝐑)𝟐/𝟑+1KR+σMKR{\bf\frac{G_{*}^{2/3}}{R^{2/3}}+\frac{\sigma_{*}^{2/3}}{(\sqrt{K}R)^{2/3}}}+\frac{1}{KR}+\frac{\sigma}{\sqrt{MKR}} G4(MK)3+M3KG_{*}^{4}\cdot(MK)^{3}+M^{3}K

For the homogeneous case it was shown in (Woodworth et al., 2020a; Glasgow et al., 2022) that the standard Local-SGD method is not superior to Minibatch-SGD. Nevertheless, (Yuan and Ma, 2020) have designed an accelerated variant of Local-SGD that provably benefits over the Minibatch baseline. These results are established for the fundamental Stochastic Convex Optimization (SCO) setting, which assumes that the learning objective is convex.

Similarly to the homogeneous case, it was shown in (Woodworth et al., 2020b; Glasgow et al., 2022) that Local-SGD is not superior to Minibatch-SGD in heterogeneous scenarios. Nevertheless, several local approaches that compare with the Minibatch baseline were designed in (Karimireddy et al., 2020b; Gorbunov et al., 2021). Unfortunately, we have so far been missing a local method that provably benefits over the Minibatch baseline in the heterogeneous SCO setting.

Our work focuses on the latter heterogeneous SCO setting, and provide a new Local-SGD-style algorithm that provably benefits over the minibatch baseline. Our algorithm named SLowcal-SGD, builds on customizing a recent technique for incorporating a slowly-changing sequence of query points (Cutkosky, 2019; Kavis et al., 2019), which in turn enables to better mitigate the bias induced by the local updates. Curiously, we also found importance weighting to be crucial in order to surpass the minibatch baseline.

In Table 1 we compare our results to the state-of-the-art methods for the heterogeneous SCO setting. We denote MM to be the number of machines, KK is the number of local updates per round, and RR is the number of communications rounds. Additionally, GG (or GG_{*}) measures the dissimilarity between machines. Our table shows that Local-SGD requires much more communication rounds compared to Minibatch-SGD, and that the dissimilarity GG (or G)G^{*}) substantially degrades its performance. Conversely, one can see that even if the dissimilarity measure is G=O(1)G_{*}=O(1), our approach SLowcal-SGD still requires less communication rounds compared to Minibatch-SGD.

Similarly to the homogeneous case, accelerated-Minibatch-SGD (Dekel et al., 2012; Lan, 2012), obtains the best performance among all current methods, and it is still open to understand whether one can outperform this accelerated minibatch baseline. In App. A we elaborate on the computations of RminR_{\min} in Table 1.

Related Work.  We focus here on centralized learning problems, where we aim to employ MM machines in order to minimize a joint learning objective. We allow the machines to synchronize during RR communication rounds through a central machine called the Parameter Server (𝒫𝒮\mathcal{PS}); and allow each machine to draw KK samples and perform KK local gradient computations in every such communication round. We assume that each machine ii may draw i.i.d. samples from a distribution 𝒟i{\mathcal{D}}_{i}, which may vary between machines.

The most natural approach in this context is Minibatch-SGD, and its accelerate variant (Dekel et al., 2012), which have been widely adopted both in academy and in industry, see e.g. (Hoffer et al., 2017; Smith et al., 2017; You et al., 2019). Local update methods like Local-SGD (McMahan et al., 2017), have recently gained much popularity due to the rise of FL, and have been extensively explored in the past years.

Focusing on the SCO setting, it is well known that the standard Local-SGD is not superior (actually in most regimes it is inferior) to Minibatch-SGD (Woodworth et al., 2020a, b; Glasgow et al., 2022). Nevertheless, (Yuan and Ma, 2020) devised a novel accelerated local approach that provably surpasses the Minibatch baseline in the homogeneous case.

The heterogeneous SCO case has also been extensively investigated, with several original and elegant approaches (Koloskova et al., 2020; Khaled et al., 2020; Karimireddy et al., 2020b; Woodworth et al., 2020b; Mitra et al., 2021; Gorbunov et al., 2021; Mishchenko et al., 2022; Patel et al., ). Nevertheless, so far we have been missing a local approach that provably benefits over Minibatch-SGD. Note that Mitra et al. (2021); Mishchenko et al. (2022) improve the communication complexity with respect to the condition number of the objective; However their performance does not improve as we increase the number of machines MM 222This implies that such methods do not obtain a wall-clock speedup as we increase the number of machines MM., which is inferior to the minibatch baseline.

The heterogeneous non-convex setting was also extensively explored (Karimireddy et al., 2020b, a; Gorbunov et al., 2021); and the recent work of (Patel et al., ) has developed a novel algorithm that provably benefits over the minibatch baseline in this case. The latter work also provides a lower bound which demonstrates that their upper bound is almost tight. Finally, for the special case of quadratic loss functions, it was shown in (Woodworth et al., 2020a) and in (Karimireddy et al., 2020b) that it is possible to surpass the minibatch baseline.

It is important to note that excluding the special case of quadratic losses, there does not exist a local update algorithm that provably benefits over accelerated-Minibatch-SGD (Dekel et al., 2012). And the latter applies to both homogeneous and heterogeneous SCO problems.

Our local update algorithm utilizes a recent technique of employing slowly changing query points in SCO problems (Cutkosky, 2019). The latter has shown to be useful in designing universal accelerated methods (Kavis et al., 2019; Ene et al., 2021; Antonakopoulos et al., 2022), as well as in improving asynchronous training methods (Aviv et al., 2021).

2 Setting: Parallel Stochastic Optimization

We consider Parallel stochastic optimization problems where the objective f:df:{\mathbb{R}}^{d}\mapsto{\mathbb{R}} is convex and is of the following form,

f(x):=1Mi[M]fi(x):=1Mi[M]Ezi𝒟ifi(x;zi).f(x):=\frac{1}{M}\sum_{i\in[M]}f_{i}(x):=\frac{1}{M}\sum_{i\in[M]}\mbox{\bf E}_{z^{i}\sim{\mathcal{D}}_{i}}f_{i}(x;z^{i})~{}.

Thus, the objective is an average of MM functions {fi:d}i[M]\{f_{i}:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}\}_{i\in[M]}, and each such fi()f_{i}(\cdot) can be written as an expectation over losses fi(,zi)f_{i}(\cdot,z^{i}) where the ziz^{i} are drawn from some distribution 𝒟i{\mathcal{D}}_{i} which is unknown to the learner. For ease of notation, in what follows we will not explicitly denote Ezi𝒟i\mbox{\bf E}_{z^{i}\sim{\mathcal{D}}_{i}} but rather use E to denote the expectation w.r.t. all randomization.

We assume that there exist MM machines (computation units), and that each machine may independently draw samples from the distribution 𝒟i{\mathcal{D}}_{i}, and can therefore compute unbiased gradient estimates to the gradients of fi()f_{i}(\cdot). Most commonly, we allow the machines to synchronize during RR communication rounds through a central machine called the Parameter Server (𝒫𝒮\mathcal{PS}); and allow each machine to perform KK local computations in every such communication round.

We consider first order optimization methods that iteratively employ samples and generate a sequence of query points and eventually output a solution xoutput{x}_{\rm output}. Our performance measure is the expected excess loss, ExcessLoss:=E[f(xoutput)]f(w),\textbf{ExcessLoss}:=\mbox{\bf E}[f({x}_{\rm output})]-f(w^{*})~{}, where the expectation is w.r.t. the randomization of the samples, and ww^{*} is a global minimum of f()f(\cdot) in d{\mathbb{R}}^{d}, i.e., wargminxdf(x)w^{*}\in\operatorname*{arg\,min}_{x\in{\mathbb{R}}^{d}}f(x).

More concretely, at every computation step, each machine i[M]i\in[M] may draw a fresh sample zi𝒟iz^{i}\sim{\mathcal{D}}_{i}, and compute a gradient estimate gg at a given point xdx\in{\mathbb{R}}^{d} as follows, g:=fi(x,zi).g:=\nabla f_{i}(x,z^{i})~{}. and note that E[g|x]=fi(x)\mbox{\bf E}[g|x]=\nabla f_{i}(x), i.e. gg is an ubiased estimate of fi(x)\nabla f_{i}(x).

General Parallelization Scheme.

A general scheme for parallel stochastic optimization is described in Alg. 1. It can be seen that the 𝒫𝒮\mathcal{PS} communicates with the machines along RR communication rounds. In every round r[R]r\in[R] the 𝒫𝒮\mathcal{PS} distributes an anchor point Θr\Theta_{r} which is a starting point for the local computations in that round. Based on Θr\Theta_{r} each machine performs KK local gradient computations based on KK i.i.d. draws from 𝒟i{\mathcal{D}}_{i}, and yields a message Φri\Phi_{r}^{i}. At the end of round rr the 𝒫𝒮\mathcal{PS} aggregates the messages from all machines and updates the anchor point Θr+1\Theta_{r+1}. Finally, after the last round, the 𝒫𝒮\mathcal{PS} outputs xoutput{x}_{\rm output}, which is computed based on the anchor points {Θr}r=1R\{\Theta_{r}\}_{r=1}^{R}.

Algorithm 1 Parallel Stochastic Optimization Template
  Input: MM machines, Parameter Server 𝒫𝒮\mathcal{PS}, #\#Communication rounds RR, #\#Local computations KK, initial point x0x_{0}
  𝒫𝒮\mathcal{PS} Computes initial anchor point Θ0\Theta_{0} using x0x_{0}
  for r=0,,R1r=0,\ldots,R-1 do
     Distributing anchor: 𝒫𝒮\mathcal{PS} distributes anchor Θr\Theta_{r} to all M machines
     Local Computations: Each machine i[M]i\in[M] performs KK local gradient computations based on KK i.i.d. draws from 𝒟i{\mathcal{D}}_{i}, and yields a message Φri\Phi_{r}^{i}
     Aggregation: 𝒫𝒮\mathcal{PS} aggregates {Φri}i[M]\{\Phi_{r}^{i}\}_{i\in[M]} from all machines, and computes a new anchor Θr+1\Theta_{r+1}
  end for
  output: 𝒫𝒮\mathcal{PS} computes xoutput{x}_{\rm output} based on {Θr}r=1R\{\Theta_{r}\}_{r=1}^{R}

Ideally, one would hope that using MM machines in parallel will enable to accelerate the learning process by a factor of MM. And there exists a rich line of works that have shown that this is indeed possible to some extent, depending on K,RK,R, and on the parallelization algorithm.

Next, we describe the two most prominent approaches to first-order Parallel Optimization,

(i) Minibatch SGD:

In terms of Alg. 1, one can describe Minibatch-SGD as an algorithm in which the 𝒫𝒮\mathcal{PS} sends a weight vector xrdx_{r}\in{\mathbb{R}}^{d} in every round as the anchor point Θr\Theta_{r}. Based on that anchor Θr:=xr\Theta_{r}:=x_{r}, each machine ii computes an unbiased gradient estimate based on KK independent samples from 𝒟i{\mathcal{D}}_{i}, i.e. gri:=1Kk=1Kfi(xr,zKr+ki)g_{r}^{i}:=\frac{1}{K}\sum_{k=1}^{K}\nabla f_{i}(x_{r},z_{Kr+k}^{i}), and communicates grig_{r}^{i} as the message Φri\Phi_{r}^{i} to the 𝒫𝒮\mathcal{PS}. The latter aggregates the messages {Φri:=gri}i[M]\{\Phi_{r}^{i}:=g_{r}^{i}\}_{i\in[M]} and compute the next anchor point xr+1x_{r+1},

xr+1=xrη1Mi[M]gri,x_{r+1}=x_{r}-\eta\cdot\frac{1}{M}\sum_{i\in[M]}g_{r}^{i}~{},

where η>0\eta>0 is the learning rate of the algorithm. The benefit in this approach is that all machines always compute gradient estimates at the same anchor points {xr}r\{x_{r}\}_{r}, which highly simplifies its analysis. On the downside, in this approach the number of gradient updates RR is smaller compared to the number of stochastic gradient computations made by each machine which is KRKR. This gives the hope that there is room to improve upon Minibatch SGD, by mending this issue.

(ii) Local SGD:

In terms of Alg. 1, one can describe Local-SGD as an algorithm in which the 𝒫𝒮\mathcal{PS} sends a weight vector xrKdx_{rK}\in{\mathbb{R}}^{d} in every round r[R]r\in[R] as the anchor information Θr\Theta_{r}. Based on the anchor Θr:=xrK\Theta_{r}:=x_{rK}, each machine performs a sequence of local gradient updates based on KK independent samples from 𝒟i{\mathcal{D}}_{i} as follows, k[K]\forall k\in[K],

xrK+k+1i=xrK+kiηfi(xrK+ki,zrK+ki),x_{rK+k+1}^{i}=x_{rK+k}^{i}-\eta\cdot\nabla f_{i}(x_{rK+k}^{i},z_{rK+k}^{i})~{},

where for all machines i[M]i\in[M] we initialize xrKi=xrK:=Θrx_{rK}^{i}=x_{rK}:=\Theta_{r}, and η>0\eta>0 is the learning rate of the algorithm. At the end of round rr each machine communicates x(r+1)Kix_{(r+1)K}^{i} as the message Φri\Phi_{r}^{i} to the 𝒫𝒮\mathcal{PS} and the latter computes the next anchor as follows,

Θr+1:=x(r+1)K=1Mi[M]x(r+1)Ki.\Theta_{r+1}:=x_{(r+1)K}=\frac{1}{M}\sum_{i\in[M]}x_{(r+1)K}^{i}~{}.

In local SGD the number of gradient steps is equal to the number of stochastic gradient computations made by each machine which is KRKR. The latter suggests that such an approach may potentially surpass Minibatch SGD. Nevertheless, this potential benefit is hindered by the bias that is introduced between different machines during the local updates. And indeed, as we show in Table 1, this approach is inferior to Minibatch SGD in the prevalent case where σ=O(1)\sigma=O(1).

Assumptions. We assume that f()f(\cdot) is convex, and that the fi()f_{i}(\cdot) are smooth i.e. L>0\exists L>0 such,

fi(x)fi(y)Lxy,i[M],x,yd\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|~{},~{}~{}\forall i\in[M]~{},~{}\forall x,y\in{\mathbb{R}}^{d}

We also assume that variance of the gradient estimates is bounded, i.e. that there exists σ>0\sigma>0 such,

Efi(x;z)fi(x)2σ2,xd,i[M].\mbox{\bf E}\|\nabla f_{i}(x;z)-\nabla f_{i}(x)\|^{2}\leq\sigma^{2}~{},~{}\forall x\in{\mathbb{R}}^{d}~{},~{}\forall i\in[M]~{}.

Letting ww^{*} be a global minimum of f()f(\cdot), we assume there exist G0G_{*}\geq 0 such that,

1Mi[M]fi(w)2G2/2,(G-Dissimilarity)\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(w^{*})\|^{2}\leq G_{*}^{2}/2~{},~{}~{}\textbf{($G_{*}$-Dissimilarity)} (1)

The above assumption together with the smoothness and convexity imply (see App. B) ,

1Mi[M]fi(x)2G2+4L(f(x)f(w)),xd\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(x)\|^{2}\leq G_{*}^{2}+4L(f(x)-f(w^{*}))~{},\quad\forall x\in{\mathbb{R}}^{d} (2)

A stronger dissimilarity assumption that is often used in the literature is the following,

1Mi[M]fi(x)f(x)2G2/2,xd(G-Dissimilarity)\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(x)-\nabla f(x)\|^{2}\leq G^{2}/2~{},~{}~{}\forall x\in{\mathbb{R}}^{d}~{}~{}\textbf{($G$-Dissimilarity)} (3)

Notation: For {yt}t\{y_{t}\}_{t} we denote yt1:t2:=τ=t1t2yτy_{t_{1}:t_{2}}:=\sum_{\tau=t_{1}}^{t_{2}}y_{\tau}. For N+N\in\mathbb{Z}^{+} we denote [N]:={0,,N1}[N]:=\{0,\ldots,N-1\}.

3 Our Approach

Section 3.1 describes a basic (single machine) algorithmic template called Anytime-GD. Section 3.2 describes our SLowcal-SGD algorithm, which is a Local-SGD style algorithm in the spirit of Anytime GD. We describe our method in Alg. 2, and state its guarantees in Thm. 2.

3.1 Anytime GD

The standard GD algorithm computes a sequence of iterates {wt}t[T]\{w_{t}\}_{t\in[T]} and queries the gradients at theses iterates. It was recently shown that one can design a GD-style scheme that computes a sequence of iterates {wt}t[T]\{w_{t}\}_{t\in[T]} yet queries the gradients at a different sequence {xt}t[T]\{x_{t}\}_{t\in[T]} which may be slowly-changing, in the sense that xt+1xt\|x_{t+1}-x_{t}\| may be considerably smaller than wt+1wt\|w_{t+1}-w_{t}\|.

Concretely, the Anytime-GD algorithm (Cutkosky, 2019; Kavis et al., 2019) that we describe in Equations (4) and (5), employs a learning rate η>0\eta>0 and a sequence of non-negative weights {αt}t\{\alpha_{t}\}_{t}. The algorithm maintains two sequences {wt}t,{xt}t\{w_{t}\}_{t},\{x_{t}\}_{t} that are updated as follows t\forall t,

wt+1=wtηαtgt,t[T],where gt=f(xt),\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{},\text{where~{}}g_{t}=\nabla f(x_{t})~{}, (4)

and then,

xt+1=α0:tα0:t+1xt+αt+1α0:t+1wt+1.\displaystyle x_{t+1}=\frac{\alpha_{0:t}}{\alpha_{0:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{}. (5)

It can be shown that the above implies that xt+1=1α0:t+1τ=0t+1ατwτx_{t+1}=\frac{1}{\alpha_{0:t+1}}\sum_{\tau=0}^{t+1}\alpha_{\tau}w_{\tau}, i.e. the xtx_{t}’s are weighted averages of the wtw_{t}’s. Thus, at every iterate the gradient gtg_{t} is queried at xtx_{t} which is a weighted average of past iterates, and then wt+1w_{t+1} is updated similarly to GD with a weight αt\alpha_{t} on the gradient gtg_{t}. Moreover, at initialization x0=w0x_{0}=w_{0}.

Curiously, it was shown in Cutkosky (2019) that Anytime-GD obtains the same convergence rates as GD for convex loss functions (both smooth and non-smooth). It was further shown and that one can employ a stochastic version (Anytime-SGD) where we query noisy gradients at xtx_{t} instead of the exact ones, and that approach performs similarly to SGD.
Slowly changing query points. A recent work (Aviv et al., 2021), demonstrates that if we use projected Anytime-SGD, i.e. project the wtw_{t} sequence to a given bounded convex domain; then one can immediately show that for both αt=1\alpha_{t}=1 and αt=t+1\alpha_{t}=t+1 we obtain xt+1xt2D/t\|x_{t+1}-x_{t}\|\leq 2D/t, where DD is the diameter of the convex domain. Conversely, for standard SGD we have wt+1wtηgt\|w_{t+1}-w_{t}\|\leq\eta\|g_{t}\|, where gtg_{t} here is a (possibly noisy) unbiased estimate of f(wt)\nabla f(w_{t}). Thus, while the change between consecutive SGD queries is controlled by η\eta which is usually 1/t\propto 1/\sqrt{t}, and by magnitude of stochastic gradients; for Anytime-SGD the change decays with time, irrespective of the learning rate η\eta. In (Aviv et al., 2021), this is used to design better and more robust asynchronous training methods.
Relation to Momentum. In the appendix we show that Anytime-SGD can be explicitly written as a momentum method, and therefore is quite different from standard SGD. Concretely, for αt=1\alpha_{t}=1 we show that xt+1xtητ=1t(τ/t2)gτx_{t+1}\approx x_{t}-\eta\sum_{\tau=1}^{t}(\tau/t^{2})\cdot g_{\tau}, and for αtt\alpha_{t}\propto t we show that xt+1xtητ=1t(τ/t)3gτx_{t+1}\approx x_{t}-\eta\sum_{\tau=1}^{t}(\tau/t)^{3}\cdot g_{\tau}. Where gτg_{\tau} here is a (possibly noisy) unbiased estimate of f(xτ)\nabla f(x_{\tau}). This momentum interpretation provides a complementary intuition regarding the benefit of Anytime-SGD in the context of local update methods. Momentum brings more stability to the optimization process which in turn reduces the bias between different machines.

For the sake of this paper we will require a specific theorem that does not necessarily regard Anytime-GD, but is rather more general. We will require the following definition,

  • Definition

    Let {αt0}t\{\alpha_{t}\geq 0\}_{t} be a sequence of non-negative weights, and let {wtd}t\{w_{t}\in{\mathbb{R}}^{d}\}_{t}, be an arbitrary sequence. We say that a sequence {xtd}t\{x_{t}\in{\mathbb{R}}^{d}\}_{t} is an {αt}t\{\alpha_{t}\}_{t} weighted average of {wt}t\{w_{t}\}_{t} if x0=w0x_{0}=w_{0}, and for any t>0t>0 Eq. (5) is satisfied.

Next, we state the main theorem for this section, which applies for any sequence {wtd}t\{w_{t}\in{\mathbb{R}}^{d}\}_{t},

Theorem 1 (Rephrased from Theorem 1 in Cutkosky (2019)).

Let f:df:{\mathbb{R}}^{d}\mapsto{\mathbb{R}} be a convex function with a global minimum ww^{*}. Also let {αt0}t\{\alpha_{t}\geq 0\}_{t}, and {wtd}t,{xtd}t\{w_{t}\in{\mathbb{R}}^{d}\}_{t},\{x_{t}\in{\mathbb{R}}^{d}\}_{t} such that {xt}t\{x_{t}\}_{t} is an {αt}t\{\alpha_{t}\}_{t} weighted average of {wt}t\{w_{t}\}_{t}. Then the following holds for any t0t\geq 0,

0α0:t(f(xt)f(w))τ=0tατf(xτ)(wτw).0\leq\alpha_{0:t}\left(f(x_{t})-f(w^{*})\right)\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

3.2 SLowcal-SGD

Our approach is to employ an Anytime version of Local-SGD, which we name by SLowcal-SGD.
Notation: Prior to describing our algorithm we will define tt to be the total of per-machine local updates up to step kk of round rr, resulting t:=rK+kt:=rK+k. In what follows, we will often find it useful to denote the iterates and samples using tt, rather than explicitly denoting t=rK+kt=rK+k. Additionally we use {αt}t\{\alpha_{t}\}_{t} to denote a pre-defined sequence of non-negative weights. Finally, we denote T:=RKT:=RK.

Algorithm 2 SLowcal-SGD
  Input: MM machines, Parameter Server 𝒫𝒮\mathcal{PS}, #\#Communication rounds RR, #\#Local computations KK, initial point x0x_{0}, learning rate η>0\eta>0, weights {αt}t\{\alpha_{t}\}_{t}
  
  Initialize: set w0=x0w_{0}=x_{0}, initialize anchor point Θ0:=(w0,x0)\Theta_{0}:=(w_{0},x_{0}), and set t=0t=0
  for r=0,,R1r=0,\ldots,R-1 do
     Distributing anchor: 𝒫𝒮\mathcal{PS} distributes anchor Θr:=(wt,xt)\Theta_{r}:=(w_{t},x_{t}) to all machines, each machine i[M]i\in[M] initializes (wti,xti)=Θr:=(wt,xt)(w_{t}^{i},x_{t}^{i})=\Theta_{r}:=(w_{t},x_{t})
     for k=0,,K1k=0,\ldots,K-1 do
        Set t=rK+kt=rK+k
        Every machine i[M]i\in[M] draws a fresh sample zti𝒟iz_{t}^{i}\sim{\mathcal{D}}_{i}, and computes gti=fi(xti,zti)g_{t}^{i}=\nabla f_{i}(x_{t}^{i},z_{t}^{i})
        Update wt+1i=wtiηαtgtiw_{t+1}^{i}=w_{t}^{i}-\eta\alpha_{t}g_{t}^{i}, and xt+1i=(1αt+1α0:t+1)xti+αt+1α0:t+1wt+1ix_{t+1}^{i}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}^{i}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}^{i}
     end for
     Aggregation: 𝒫𝒮\mathcal{PS} aggregates {(wt+1i,xt+1i)}i[M]\{(w_{t+1}^{i},x_{t+1}^{i})\}_{i\in[M]} from all machines, and computes a new anchor Θr+1:=(wt+1,xt+1)=(1Mi[M]wt+1i,1Mi[M]xt+1i)\Theta_{r+1}:=(w_{t+1},x_{t+1})=\left(\frac{1}{M}\sum_{i\in[M]}w_{t+1}^{i},\frac{1}{M}\sum_{i\in[M]}x_{t+1}^{i}\right)
  end for
  output: 𝒫𝒮\mathcal{PS} outputs xTx_{T} (recall T=KRT=KR)

In the spirit of Anytime-SGD our approach is to maintain two sequences per machine i[M]i\in[M]: {wtid}t\{w_{t}^{i}\in{\mathbb{R}}^{d}\}_{t} and {xtid}t\{x_{t}^{i}\in{\mathbb{R}}^{d}\}_{t}. Our approach is depicted explicitly in Alg. 2. Next we describe our algorithm in terms of the scheme depicted in Alg. 1:
(i) Distributing anchor. At the beginning of round rr the 𝒫𝒮\mathcal{PS} distributes Θr=(wt,xt)=(wrK,xrK)d×d\Theta_{r}=(w_{t},x_{t})=(w_{rK},x_{rK})\in{\mathbb{R}}^{d}\times{\mathbb{R}}^{d} to all machines.
(ii) Local Computations. For t=rKt=rK, every machine initializes (wti,xti)=Θr(w_{t}^{i},x_{t}^{i})=\Theta_{r}, and for the next KK rounds, i.e. for any rKt(r+1)K1rK\leq t\leq(r+1)K-1, every machine performs a sequence of local Anytime-SGD steps as follows,

wt+1i=wtiηαtgti,\displaystyle w_{t+1}^{i}=w_{t}^{i}-\eta\alpha_{t}g_{t}^{i}~{}, (6)

where similarly to Anytime-SGD we query the gradients at the averages xtix_{t}^{i}, meaning gti=fi(xti,zti).g_{t}^{i}=\nabla f_{i}(x_{t}^{i},z_{t}^{i})~{}. And query points are updated as weighted averages of past iterates {wt}t\{w_{t}\}_{t}, ,

xt+1i=(1αt+1α0:t+1)xti+αt+1α0:t+1wt+1i,rKt(r+1)K1.\displaystyle x_{t+1}^{i}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}^{i}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}^{i}~{},\qquad\forall~{}rK\leq t\leq(r+1)K-1~{}. (7)

At the end round rr, i.e.  t=(r+1)Kt=(r+1)K, each machine communicates (wti,xti)(w_{t}^{i},x_{t}^{i}) as a message to the 𝒫𝒮\mathcal{PS}.
(iii) Aggregation. The 𝒫𝒮\mathcal{PS} aggregates the messages and computes the next anchor point Θr+1=(wt,xt)=1Mi[M]Φri:=(1Mi[M]wti,1Mi[M]xti)\Theta_{r+1}=(w_{t},x_{t})=\frac{1}{M}\sum_{i\in[M]}\Phi_{r}^{i}:=\left(\frac{1}{M}\sum_{i\in[M]}w_{t}^{i},\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\right), where t=(r+1)Kt=(r+1)K.
Remark: Note that for t=rKt=rK our notation for (wti,xti)(w_{t}^{i},x_{t}^{i}) is inconsistent: at the end of round r1r-1 these values may vary between different machines, while at the beginning of round rr these values are all equal to Θr:=(wt,xt)\Theta_{r}:=(w_{t},x_{t}). Nevertheless, for simplicity we will abuse notation, and explicitly state the right definition when needed. Importantly, in most of our analysis we will mainly need to refer to the averages (1Mi[M]wti,1Mi[M]xti)\left(\frac{1}{M}\sum_{i\in[M]}w_{t}^{i},\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\right), and note the latter are consistent at the end and beginning of consecutive rounds due to the definition of Θr\Theta_{r}, and Φr1i\Phi_{r-1}^{i}.

3.2.1 Guarantees & Intuition

Below we state our main result for SLowcal-SGD (Alg. 2),

Theorem 2.

Let f()f(\cdot) be a convex and LL-smooth function. Then under the assumption that we make in Sec. 2, invoking Alg. 2 with weights {αt=t+1}t[T]\{\alpha_{t}=t+1\}_{t\in[T]} , and an appropriate learning rate η\eta ensures,

EΔTO(LB02(1KR+1R2+1K1/3R4/3)+σB0MKR+L1/2(σ1/2+G1/2)B03/2K1/4R),\displaystyle\small\mbox{\bf E}\Delta_{T}~{}\leq~{}O\left(LB_{0}^{2}\left(\frac{1}{KR}+\frac{1}{R^{2}}+\frac{1}{K^{1/3}R^{4/3}}\right)+\frac{\sigma B_{0}}{\sqrt{MKR}}+\frac{L^{1/2}(\sigma^{1/2}+G_{*}^{1/2})\cdot B_{0}^{3/2}}{K^{1/4}R}\right)~{},

where ΔT:=f(xT)f(x)\Delta_{T}:=f(x_{T})-f(x^{*}), B0:=w0wB_{0}:=\|w_{0}-w^{*}\|, and we choose the learning rate as follows,

η=min{148L(T+1),110LK2,140LK(T+1)2/3,w0wMσT3/2,w0w1/2L1/2K7/4R(σ1/2+G1/2)}\displaystyle\eta=\min\left\{\frac{1}{48L(T+1)},\frac{1}{10LK^{2}},\frac{1}{40LK(T+1)^{2/3}},\frac{\|w_{0}-w^{*}\|\sqrt{M}}{\sigma T^{3/2}},\frac{\|w_{0}-w^{*}\|^{1/2}}{L^{1/2}K^{7/4}R(\sigma^{1/2}+G_{*}^{1/2})}\right\} (8)

As Table 1 shows, Thm. 2 implies that SLowcal-SGD  improves over all existing upper bounds for Minibatch and Local SGD, by allowing less communication rounds to obtain a linear speedup of MM.

Intuition. The degradation in local SGD schemes (both standard and Anytime) is due to the bias that it introduces between different machines during each round, which leads to a bias in their gradients. Intuitively, this bias is small if the machines query the gradients at a sequence of slowly changing query points. This is exactly the benefit of SLowcal-SGD  which queries the gradients at averaged iterates xtix_{t}^{i}’s. Intuitively these averages are slowly changing compared to the iterates themselves wtiw_{t}^{i}; and recall that the latter are the query points used by standard Local-SGD. A complementary intuition to the benefit of our approach, is the interpretation of Anytime-SGD as a momentum method (see Sec. 3.1 and the appendix) which leads to decreased bias between machines.

To further simplify the more technical discussion here, we will assume the homogeneous case, i.e., that for any i[M]i\in[M] we have 𝒟i=𝒟{\mathcal{D}}_{i}={\mathcal{D}} and fi()=f()f_{i}(\cdot)=f(\cdot).

So a bit more formally, let us discuss the bias between query points in a given round r[R]r\in[R], and let us denote t0=rKt_{0}=rK. The following holds for standard Local SGD,

wti=wt0ητ=t0t1gτi,i[M],t[t0,t0+K].\displaystyle w_{t}^{i}=w_{t_{0}}-\eta\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}. (9)

where gτig_{\tau}^{i} is the noisy gradients that Machine ii computes in wτiw_{\tau}^{i}, and we can write gτi:=f(wτi)+ξτig_{\tau}^{i}:=\nabla f(w_{\tau}^{i})+\xi_{\tau}^{i}, where ξτi\xi_{\tau}^{i} is the noisy component of the gradient. Thus, for two machines iji\neq j we can write,

Ewtiwtj2=η2Eτ=t0t1gτigτj2η2Eτ=t0t1f(wτi)f(wτj)2+η2Eτ=t0t1ξτiξτj2\displaystyle\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2}=\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\|^{2}\approx\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\nabla f(w_{\tau}^{i})-\nabla f(w_{\tau}^{j})\right\|^{2}+\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}

And it was shown in Woodworth et al. (2020a), that the noisy term is dominant and therefore we can bound,

1η2Ewtiwtj2\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2} Eτ=t0t1ξτiξτj2tt0K.\displaystyle\lessapprox\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}\approx t-t_{0}\leq K~{}. (10)

Similarly, for SLowcal-SGD we would like to bound Extixtj2\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2} for two machines iji\neq j; and in order to simplify the discussion we will assume uniform weights i.e., αt=1,t[T]\alpha_{t}=1~{},~{}\forall t\in[T]. Now the update rule for the iterates wtiw_{t}^{i}, is of the same form as in Eq. (9), only now gτi:=f(xτi)+ξτig_{\tau}^{i}:=\nabla f(x_{\tau}^{i})+\xi_{\tau}^{i}, where ξτi\xi_{\tau}^{i} is the noisy component of the gradient. Consequently,

τ=t0t(wτiwτj)ητ=t0t1(tτ)(gτigτj)ηKτ=t0t1(gτigτj),\displaystyle\sum_{\tau=t_{0}}^{t}(w_{\tau}^{i}-w_{\tau}^{j})\approx-\eta\sum_{\tau=t_{0}}^{t-1}(t-\tau)(g_{\tau}^{i}-g_{\tau}^{j})\approx-\eta K\sum_{\tau=t_{0}}^{t-1}(g_{\tau}^{i}-g_{\tau}^{j})~{},

where we took a crude approximation of tτKt-\tau\approx K. Now, by definition of xtix_{t}^{i} and αt=1\alpha_{t}=1, xti=t0txt0+1tτ=t0twτi,i[M],t[t0,t0+K].x_{t}^{i}=\frac{t_{0}}{t}\cdot x_{t_{0}}+\frac{1}{t}\sum_{\tau=t_{0}}^{t}w_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}. Thus, for two machines iji\neq j we have,

1η2Extixtj2\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2} =1η2E1tτ=t0twτiwτj21η2η2K2t2Eτ=t0t1gτigτj2\displaystyle=\frac{1}{\eta^{2}}\mbox{\bf E}\left\|\frac{1}{t}\sum_{\tau=t_{0}}^{t}w_{\tau}^{i}-w_{\tau}^{j}\right\|^{2}\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}K^{2}}{t^{2}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\|^{2}
K2t2Eτ=t0t1f(xτi)f(xτj)2+K2t2Eτ=t0t1ξτiξτj2.\displaystyle\approx\frac{K^{2}}{t^{2}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\nabla f(x_{\tau}^{i})-\nabla f(x_{\tau}^{j})\right\|^{2}+\frac{K^{2}}{t^{2}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}~{}.

As we show in our analysis, the noisy term is dominant, so we can therefore bound,

1η2Extixtj2K2t2Eτ=t0t1ξτiξτj2K2(tt0)t2K3t2.\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}\lessapprox\frac{K^{2}}{t^{2}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}\approx\frac{K^{2}(t-t_{0})}{t^{2}}\leq\frac{K^{3}}{t^{2}}~{}. (11)

Taking tT=RKt\approx T=RK above yields a bound of O(K/R2)O(K/R^{2}). Thus Equations (10), (11), illustrate that the bias of SLowcal-SGD is smaller by a factor of R2R^{2} compared to the bias of standard Local-SGD. In the appendix we demonstrate the same benefit of Anytime-SGD over SGD when both use αtt\alpha_{t}\propto t.

Finally, note that the biases introduced by the local updates come into play in a slightly different manner in Local-SGD compared to SLowcal-SGD 333A major challenge in our analysis is that for a given i[M]i\in[M] the {xti}t\{x_{t}^{i}\}_{t} sequence is not necessarily an {αt}t\{\alpha_{t}\}_{t} weighted average of the {wti}t\{w_{t}^{i}\}_{t}.. Consequently, the above discussion does not enable to demonstrate the exact rates that we derive. Nevertheless, it provides some intuition regarding the benefit of our approach. The full and exact derivations appear in the appendix.
Importance Weights. One may wonder whether it is necessary to employ increasing weights αt=t+1\alpha_{t}=t+1, rather than employing standard uniform weights αt=1,t\alpha_{t}=1~{},\forall t. Surprisingly, in our analysis we have found that increasing weights are crucial in order to obtain a benefit over Minibatch-SGD, and that upon using uniform weights SLowcal-SGD performs worse compared to Minibatch SGD! We elaborate on this in Appendix L. Below we provide an intuitive explanation.
Intuitive Explanation. The intuition behind the importance of using increasing weights is the following: Increasing weights are a technical tool to put more emphasis on the last rounds. Now, in the context of Local update methods, the iterates of the last rounds are more attractive since the bias between different machines shrinks as we progress. Intuitively, this happens since as we progress with the optimization process, the expected value of the gradients that we compute goes to zero (since we converge); and consequently the bias between different machines shrinks as we progress.

3.3 Proof Sketch for Theorem 2

Proof Sketch for Theorem 2.

As a starting point for the analysis, for every iteration t[T]t\in[T] we will define the averages of (wti,xti,gti)(w_{t}^{i},x_{t}^{i},g_{t}^{i}) across all machines as follows,

wt:=1Mi[M]wti,&xt:=1Mi[M]xti&gt:=1Mi[M]gti.w_{t}:=\frac{1}{M}\sum_{i\in[M]}w_{t}^{i}~{},\quad\&\quad x_{t}:=\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\quad\&\quad g_{t}:=\frac{1}{M}\sum_{i\in[M]}g_{t}^{i}~{}.

Note that Alg. 2 explicitly computes (wt,xt)(w_{t},x_{t}) only once every KK local updates, and that theses are identical to the local copies of every machine at the beginning of every round. Combining the above definitions with Eq. (6) yields,

wt+1=wtηαtgt,t[T]\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},~{}\forall t\in[T] (12)

Further combining these definitions with Eq. (7) yields,

xt+1=(1αt+1α0:t+1)xt+αt+1α0:t+1wt+1,t[T]\displaystyle x_{t+1}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{},~{}\forall t\in[T] (13)

The above implies that the {xt}t[T]\{x_{t}\}_{t\in[T]} sequence is an {αt}t[T]\{\alpha_{t}\}_{t\in[T]} weighted average of {wt}t[T]\{w_{t}\}_{t\in[T]}. This enables to employ Thm. 1 which yields, α0:tΔtτ=0tατf(xτ)(wτw),\alpha_{0:t}\Delta_{t}\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}, where we denote Δt:=f(xt)f(w)\Delta_{t}:=f(x_{t})-f(w^{*}). This bound highlights the challenge in the analysis: our algorithm does not directly compute unbiased estimates of xtx_{t}, except for the first iterate of each round. Concretely, Eq. (12) implies that our algorithm effectively updates using gtg_{t} which is a biased estimate of f(xt)\nabla f(x_{t}).

It is therefore natural to decompose f(xτ)=gτ+(f(xτ)gτ)\nabla f(x_{\tau})=g_{\tau}+(\nabla f(x_{\tau})-g_{\tau}) in the above bound, yielding,

α0:tΔtτ=0tατgτ(wτw)(A)+τ=0tατ(f(xτ)gτ)(wτw)(B)\displaystyle\alpha_{0:t}\Delta_{t}\leq\underset{{\rm(A)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})}}+\underset{{\rm(B)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-g_{\tau})\cdot(w_{\tau}-w^{*})}} (14)

Thus, we intend to bound the weighted error α0:tΔt\alpha_{0:t}\Delta_{t} by bounding two terms: (A){\rm(A)} which is related to the update rule of the algorithm, and (B){\rm(B)} which accounts for the bias between gtg_{t} and f(xt)\nabla f(x_{t}).
Notation: In what follows we will find the following notation useful, g¯t:=1Mi[M]fi(xti)\bar{g}_{t}:=\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{t}^{i}), and note that g¯t=E[gt|{xti}i[M]]\bar{g}_{t}=\mbox{\bf E}\left[g_{t}|\{x_{t}^{i}\}_{i\in[M]}\right]. We will also employ the following notations: Vt:=τ=0tατ2g¯τf(xτ)2,V_{t}:=\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}~{}, and Dt:=wtw2,D_{t}:=\|w_{t}-w^{*}\|^{2}~{}, where ww^{*} is a global minimum of f()f(\cdot). We will also denote D0:t:=τ=0twτw2D_{0:t}:=\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}.

Bounding (A):

Due to the update rule of Eq. (12), one can show by standard regret analysis that: (A):=τ=0tατgτ(wτw)w0w22η+η2τ=0tατ2gτ2,{\rm(A)}:=\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{},

Bounding (B):

We can bound (B){\rm(B)} in expectation using VtV_{t} and D0:tD_{0:t} as follows for any ρ>0\rho>0: E[(B)]12ρEVt+ρ2ED0:t,\mbox{\bf E}\left[{\rm(B)}\right]\leq\frac{1}{2\rho}\mbox{\bf E}V_{t}+\frac{\rho}{2}\mbox{\bf E}D_{0:t}~{},

Combining (A) and (B):

Combining the above boounds on (A){\rm(A)} and (B){\rm(B)} into Eq. (14) we obtain the following bound which holds for any ρ>0\rho>0 and t[T]t\in[T],

α0:tEΔtw0w22η+η2Eτ=0Tατ2gτ2+12ρEVT+ρ2ED0:T\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}+\frac{1}{2\rho}\mbox{\bf E}V_{T}+\frac{\rho}{2}\mbox{\bf E}D_{0:T} (15)

Now, to simplify the proof sketch we shall assume that DtD0tD_{t}\leq D_{0}~{}\forall t, implying that D0:TTD0D_{0:T}\leq TD_{0}. Plugging this into the above equation and taking ρ=14ηT\rho=\frac{1}{4\eta T} gives,

α0:tEΔtw0w2η+ηEτ=0Tατ2gτ2()+4ηTEVT.\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+\eta\cdot\underset{{\rm(*)}}{\underbrace{\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}}}+4\eta T\mbox{\bf E}V_{T}~{}. (16)

Next we will bound (){\rm(*)} and EVT\mbox{\bf E}V_{T}, and plug them back into Eq. (16).

Bounding (){\rm(*)}:

To bound (){\rm(*)} it is natural to decompose gτ=(gτg¯τ)+(g¯τf(xτ))+f(xτ)g_{\tau}=(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau}). Using this decomposition we show that, ()3σ2Mt=0Tαt2+3EVT+12LEt=0Tα0:tΔt.{\rm(*)}~{}~{}\lessapprox~{}~{}3\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+3\mbox{\bf E}V_{T}+12L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}.

Bounding EVT\mbox{\bf E}V_{T}

The definition of VtV_{t} shows that it is encompasses the bias that is introduced due to the local updates, which in turn relates to the distances xtixtj,i,j[M]\|x_{t}^{i}-x_{t}^{j}\|~{},~{}\forall i,j\in[M]. Thus, EVT\mbox{\bf E}V_{T} is therefore directly related to the dissimilarity between the machines. Our analysis shows the following: EVT400L2η2K3τ=0Tα0:τ(G2+4LΔτ)+90L2η2K6R3σ2.\mbox{\bf E}V_{T}\leq 400L^{2}\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau})+90L^{2}\eta^{2}K^{6}R^{3}\sigma^{2}~{}. Plugging the above into Eq. (16), and using our choice for η\eta, gives an almost explicit bound,

α0:tEΔtw0w2η+ησ2Mt=0Tαt2+L2η3TK6R3σ2+L2η3TK3τ=0Tα0:τG2+12(T+1)Et=0Tα0:tΔt.\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\lessapprox\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2}+\frac{1}{2(T+1)}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}.

The theorem follows by plugging above the choices of η,αt\eta,\alpha_{t}, and using a technical lemma. ∎

4 Experiments

To assess the effectiveness of our proposed approach, we conducted experiments on the MNIST (LeCun et al., 2010) dataset—a well-established benchmark in image classification comprising 70,000 grayscale images of handwritten digits (0–9), with 60,000 images designated for training and 10,000 for testing. The dataset was accessed via torchvision (version 0.16.2). We implemented a logistic regression model (Bishop and Nasrabadi, 2006) using the PyTorch framework and executed all computations on an NVIDIA L40S GPU. To ensure robustness, results were averaged over three different random seeds. The complete codebase for these experiments is publicly available on our GitHub repository.444https://github.com/dahan198/slowcal-sgd

We evaluated our approach using parameters derived from our theoretical framework (αt=t\alpha_{t}=t) in comparison to Local-SGD and Minibatch-SGD under various configurations. Specifically, experiments were conducted with 16, 32, and 64 workers to examine the scalability and robustness of the proposed method. We also varied the number of local updates KK (or minibatch sizes for Minibatch-SGD) among 4, 8, 16, 32, and 64 to investigate how different local iteration counts impact performance. Data subsets for each worker were generated using a Dirichlet distribution (Hsu et al., 2019) with α=0.1\alpha=0.1 to simulate real-world non-IID data scenarios characterized by high heterogeneity. For fairness, the learning rate was selected through grid search, with a value of 0.01 for SLowcal-SGD and Local-SGD, and 0.1 for Minibatch-SGD. More details about the data distribution across workers and complete experimental results are provided in Appendix M.

Refer to caption
(a) Test Accuracy (\uparrow is better).
Refer to caption
(b) Test Loss (\downarrow is better).
Figure 1: Performance vs. Local Iterations (KK) for different numbers of workers (MM).

Our results on the MNIST dataset, presented in Figure 1 and detailed in Appendix M.2, demonstrate the effectiveness of our approach, showing consistent performance improvements compared to Local-SGD and Minibatch-SGD as the number of local steps increases. Notably, this improvement becomes even more significant compared to the other methods as the number of workers increases, underscoring the scalability of our method and aligning with the theoretical guarantees outlined in our framework. These results highlight the robustness of our approach in handling highly heterogeneous, distributed environments.

Upon closer inspection, when a small number of local steps are performed, the differences between the approaches are negligible, with a slight advantage for Minibatch-SGD. However, as the number of local steps increases, the minibatch size grows, and the need for significant variance reduction diminishes. In this regime, making more frequent optimization updates becomes more impactful, as demonstrated by the superior performance of the local approaches compared to Minibatch-SGD. Importantly, with SLowcal-SGD, which keeps local updates closely aligned among workers throughout the training process, we can achieve significantly better and more stable performance compared to both Minibatch-SGD and Local-SGD as the number of local steps KK and the number of workers MM increase.

5 Conclusion

We have presented the first local approach for the heterogeneous distributed Stochastic Convex Optimization (SCO) setting that provably benefits over the two most prominent baselines, namely Minibatch-SGD, and Local-SGD. There are several interesting avenues for future exploration:
(a) developing an adaptive variant that does not require the knowledge of the problem parameters like σ\sigma and LL(b) Allowing a per dimension step-size that could benefit in (the prevalent) scenarios where the scale of the gradients considerably changes between different dimensions; in the spirit of the well known AdaGrad method (Duchi et al., 2011). Finally, (c) it will be interesting to understand whether we can find an algorithm that provably dominates over the Accelerated Minibach-SGD baseline, which is an open question also in the homogeneous SCO setting.

Acknowledgement

This research was partially supported by Israel PBC-VATAT, the Technion Artificial Intelligent Hub (Tech.AI), and the Israel Science Foundation (grant No. 3109/24).

References

  • APP [2019] Apple. designing for privacy (video and slide deck). apple wwdc, 2019. URL https://developer.apple.com/videos/play/wwdc2019/708.
  • INT [2020] Intel and consilient. intel and consilient join forces to fight financial fraud with ai, 2020. URL https://newsroom.intel.com/news/intel-consilient-join-forces-fight-financial-fraud-ai/.
  • MEL [2020] Melloddy. melloddy project meets its year one objective: Deployment of the world’s first secure platform for multi-task federated learning in drug discovery among 10 pharmaceutical companies, 2020. URL https://www.melloddy.eu/y1announcement.
  • GGL [2021] Google. your voice and audio data stays private while google assistant improves, 2021. URL https://support.google.com/assistant/answer/10176224.
  • Antonakopoulos et al. [2022] Kimon Antonakopoulos, Dong Quan Vu, Volkan Cevher, Kfir Levy, and Panayotis Mertikopoulos. Undergrad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In International Conference on Machine Learning, pages 772–795. PMLR, 2022.
  • Aviv et al. [2021] Rotem Zamir Aviv, Ido Hakimi, Assaf Schuster, and Kfir Yehuda Levy. Asynchronous distributed learning: Adapting to gradient delays without prior knowledge. In International Conference on Machine Learning, pages 436–445. PMLR, 2021.
  • Bishop and Nasrabadi [2006] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
  • [8] Ashok Cutkosky. Lecture notes for ec525: Optimization for machine learning.
  • Cutkosky [2019] Ashok Cutkosky. Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446–1454. PMLR, 2019.
  • Dekel et al. [2012] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
  • Duchi et al. [2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • Ene et al. [2021] Alina Ene, Huy L Nguyen, and Adrian Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314–7321, 2021.
  • Glasgow et al. [2022] Margalit R Glasgow, Honglin Yuan, and Tengyu Ma. Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics, pages 9050–9090. PMLR, 2022.
  • Gorbunov et al. [2021] Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local sgd: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pages 3556–3564. PMLR, 2021.
  • Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Hoffer et al. [2017] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
  • Hsu et al. [2019] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  • Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Karimireddy et al. [2020a] Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
  • Karimireddy et al. [2020b] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020b.
  • Kavis et al. [2019] Ali Kavis, Kfir Y Levy, Francis Bach, and Volkan Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.
  • Khaled et al. [2020] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020.
  • Koloskova et al. [2020] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
  • Lan [2012] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
  • LeCun et al. [2010] Yann LeCun, Corinna Cortes, Chris Burges, et al. Mnist handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/. Licensed under CC BY-SA 3.0, available at https://creativecommons.org/licenses/by-sa/3.0/.
  • Mangasarian and Solodov [1993] Olvi L Mangasarian and Mikhail V Solodov. Backpropagation convergence via deterministic nonmonotone perturbed minimization. Advances in Neural Information Processing Systems, 6, 1993.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Mishchenko et al. [2022] Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022.
  • Mitra et al. [2021] Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606–14619, 2021.
  • [31] Kumar Kshitij Patel, Lingxiao Wang, Blake Woodworth, Brian Bullins, and Nathan Srebro. Towards optimal communication complexity in distributed non-convex optimization. In Advances in Neural Information Processing Systems.
  • Smith et al. [2017] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  • Stich [2019] Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
  • Wang et al. [2021a] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Agüera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas N. Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horváth, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konečný, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtárik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake E. Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, and Wennan Zhu. A field guide to federated optimization. CoRR, abs/2107.06917, 2021a. URL https://arxiv.org/abs/2107.06917.
  • Wang et al. [2021b] Jun-Kun Wang, Jacob Abernethy, and Kfir Y Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. arXiv e-prints, pages arXiv–2111, 2021b.
  • Woodworth et al. [2020a] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334–10343. PMLR, 2020a.
  • Woodworth et al. [2020b] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281–6292, 2020b.
  • You et al. [2019] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
  • Yuan and Ma [2020] Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, 33:5332–5344, 2020.

Appendix A Explanations Regarding the Linear Speedup and Table 1

Here we elaborate on the computations done in Table 1. First we will explain why the dominance of the term 1MKR\frac{1}{\sqrt{MKR}} implies a linear speedup by a factor of MM.

Explanation.

Recall that using SGD with a single machine M=1M=1, yields a convergnece rate of 1KR\frac{1}{\sqrt{KR}} (as a dominant term). Thus, in order to obtain an excess loss smaller than some ε>0\varepsilon>0, SGD requires RKΩ(1ε2)RK\geq\Omega\left(\frac{1}{\varepsilon^{2}}\right). Where RKRK is the wall-clock time required to compute the solution.

Now, when we use parallel optimization with RR communication rounds, KK local computations, and MM machines, the wall-clock time to compute a solution is still RKRK. Now, if the dominant term in the convergence rate of this algorithm is 1MKR\frac{1}{\sqrt{MKR}} then the wall clock time to obtain an ε\varepsilon-optimal solution should be RKΩ(1Mε2)RK\geq\Omega\left(\frac{1}{M\varepsilon^{2}}\right). And the latter is smaller by a factor of MM compared to a single machine.

Computation of RminR_{\min} in Table 1.

The term 1MKR\frac{1}{\sqrt{MKR}} appears in the bounds of all of the parallel optimization methods that we describe. Nevertheless, it is dominant up as long as the number of communication rounds RR is larger than some treshold value RminR_{\min}, that depends on the specific convergence rate. Clearly, smaller values of RminR_{\min} imply less communication. Thus, in the 𝐑min\bf{R_{\min}} column of the table we compute RminR_{\min} for each method based on the term in the bound that compares least favourably against 1MKR\frac{1}{\sqrt{MKR}}. These terms are bolded in the Rate column of the table.

Concretely, denoting this less favourable term by parallel:=parallel(M,K,R,G)\mathcal{B}^{\rm parallel}:=\mathcal{B}^{\rm parallel}(M,K,R,G_{*})  555parallel\mathcal{B}^{\rm parallel} may also depend on σ,L,w0w\sigma,L,\|w_{0}-w^{*}\| but for simplicity of exposition we hide these dependencies in Table 1., then 𝐑min\bf{R_{\min}} is the lowest RR which satisfies,

parallel1MKR.\mathcal{B}^{\rm parallel}\leq\frac{1}{\sqrt{MKR}}~{}.

Appendix B On Heterogeneity Assumption

Let us assume that the following holds at the optimum ww^{*},

1Mi[M]fi(w)2G2/2\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(w^{*})\|^{2}\leq G_{*}^{2}/2

Then we can show the following relation for any wdw\in{\mathbb{R}}^{d},

1Mi[M]fi(w)2\displaystyle\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(w)\|^{2} =1Mi[M]fi(w)fi(w)+fi(w)2\displaystyle=\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(w)-\nabla f_{i}(w^{*})+\nabla f_{i}(w^{*})\|^{2}
2Mi[M]fi(w)fi(w)2+2Mi[M]fi(w)2\displaystyle\leq\frac{2}{M}\sum_{i\in[M]}\|\nabla f_{i}(w)-\nabla f_{i}(w^{*})\|^{2}+\frac{2}{M}\sum_{i\in[M]}\|\nabla f_{i}(w^{*})\|^{2}
4L(f(w)f(w))+G2.\displaystyle\leq 4L(f(w)-f(w^{*}))+G_{*}^{2}~{}.

where we used a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} which holds for any a,bda,b\in{\mathbb{R}}^{d}, and the last line follows by the lemma below that we borrow from [Johnson and Zhang, 2013, Cutkosky, ].

Lemma 1.

Let (x)=1Mi[M]i(x)\mathcal{L}(x)=\frac{1}{M}\sum_{i\in[M]}\ell_{i}(x) be a convex function with global minimum ww^{*}, and assume that every fi:df_{i}:{\mathbb{R}}^{d}\mapsto{\mathbb{R}} is LL-smooth. Then the following holds,

1Mi[M]i(w)i(w)22L((w)(w)).\frac{1}{M}\sum_{i\in[M]}\|\nabla\ell_{i}(w)-\nabla\ell_{i}(w^{*})\|^{2}\leq 2L(\mathcal{L}(w)-\mathcal{L}(w^{*}))~{}.
Proof of Lemma 1.

The lemma follows immediately from lemma 27.1 in [Cutkosky, ], by taking v=wv=w^{*} therein. ∎

Appendix C Interpreting Anytime-SGD as Momentum

Here we show how to interpret the Anytime-SGD algorithm that we present in Equations (4),(5), as a momentum method. For completeness we rewrite the update equations below,

wt+1=wtηαtgt,t[T],where gt=f(xt),\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{},\text{where~{}}g_{t}=\nabla f(x_{t})~{}, (17)

and then,

xt+1=α0:tα0:t+1xt+αt+1α0:t+1wt+1.\displaystyle x_{t+1}=\frac{\alpha_{0:t}}{\alpha_{0:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{}. (18)

where gtg_{t} is an unbiased gradient estimate at xtx_{t}, and {αt}t\{\alpha_{t}\}_{t} is a sequence of non-negative scalars. And at initialization x0=w0x_{0}=w_{0}.

First note that Eq. (18) directly implies that,

xt+1=1α0:t+1τ=0t+1ατwτ.x_{t+1}=\frac{1}{\alpha_{0:t+1}}\sum_{\tau=0}^{t+1}\alpha_{\tau}w_{\tau}~{}.

Next, note that we can directly write,

wτ=w0ηn=0τ1αngnw_{\tau}=w_{0}-\eta\sum_{n=0}^{\tau-1}\alpha_{n}g_{n}

Plugging the above into the formula for xt+1x_{t+1} yields,

xt+1\displaystyle x_{t+1} =w0η1α0:t+1τ=0t+1n=0τ1αταngn\displaystyle=w_{0}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{\tau=0}^{t+1}\sum_{n=0}^{\tau-1}\alpha_{\tau}\alpha_{n}g_{n}
=w0η1α0:t+1n=0tτ=n+1t+1αταngn\displaystyle=w_{0}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\sum_{\tau=n+1}^{t+1}\alpha_{\tau}\alpha_{n}g_{n}
=w0η1α0:t+1n=0tαn+1:t+1αngn.\displaystyle=w_{0}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{n+1:t+1}\alpha_{n}g_{n}~{}.

Thus,

1η(xt+1xt)\displaystyle\frac{1}{\eta}\left(x_{t+1}-x_{t}\right) =1α0:tn=0t1αn+1:tαngn1α0:t+1n=0tαn+1:t+1αngn\displaystyle=\frac{1}{\alpha_{0:t}}\sum_{n=0}^{t-1}\alpha_{n+1:t}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{n+1:t+1}\alpha_{n}g_{n}
=1α0:t+1n=0t1α0:t+1α0:tαn+1:tαngn1α0:t+1n=0t1αn+1:t+1αngn1α0:t+1αt+1αtgt\displaystyle=\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\frac{\alpha_{0:t+1}}{\alpha_{0:t}}\alpha_{n+1:t}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\alpha_{n+1:t+1}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\alpha_{t+1}\alpha_{t}g_{t}
=1α0:t+1n=0t1(αn+1:t+1α0:t+1α0:tαn+1:t)αngn1α0:t+1αt+1αtgt\displaystyle=-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\left(\alpha_{n+1:t+1}-\frac{\alpha_{0:t+1}}{\alpha_{0:t}}\alpha_{n+1:t}\right)\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\alpha_{t+1}\alpha_{t}g_{t}
=1α0:t+1n=0t1αt+1α0:nα0:tαngn1α0:t+1αt+1αtgt\displaystyle=-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t-1}\alpha_{t+1}\frac{\alpha_{0:n}}{\alpha_{0:t}}\alpha_{n}g_{n}-\frac{1}{\alpha_{0:t+1}}\alpha_{t+1}\alpha_{t}g_{t}
=1α0:t+1n=0tαt+1αnα0:nα0:tgn,\displaystyle=-\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{t+1}\alpha_{n}\frac{\alpha_{0:n}}{\alpha_{0:t}}g_{n}~{}, (19)

where we used the equality below,

αn+1:t+1α0:t+1α0:tαn+1:t=αt+1αn+1:tαt+1α0:t=αt+1(1αn+1:tα0:t)=αt+1α0:nα0:t.\alpha_{n+1:t+1}-\frac{\alpha_{0:t+1}}{\alpha_{0:t}}\alpha_{n+1:t}=\alpha_{t+1}-\alpha_{n+1:t}\frac{\alpha_{t+1}}{\alpha_{0:t}}=\alpha_{t+1}(1-\frac{\alpha_{n+1:t}}{\alpha_{0:t}})=\alpha_{t+1}\frac{\alpha_{0:n}}{\alpha_{0:t}}~{}.

Thus we can write,

xt+1xtη1α0:t+1n=0tαt+1αnα0:nα0:tgn,x_{t+1}\approx x_{t}-\eta\frac{1}{\alpha_{0:t+1}}\sum_{n=0}^{t}\alpha_{t+1}\alpha_{n}\frac{\alpha_{0:n}}{\alpha_{0:t}}g_{n}~{},

Uniform Weights. Thus, taking uniform weights αt=1\alpha_{t}=1 yields,

xt+1xtηn=0tnt2gn.x_{t+1}\approx x_{t}-\eta\sum_{n=0}^{t}\frac{n}{t^{2}}g_{n}~{}.

Linear Weights. Similarly, taking linear weights αt=t+1\alpha_{t}=t+1 yields,

xt+1xtηn=0tn3t3gn.x_{t+1}\approx x_{t}-\eta\sum_{n=0}^{t}\frac{n^{3}}{t^{3}}g_{n}~{}.

Appendix D Proof of Theorem 1

Proof of Theorem 1.

We rehearse the proof of Theorem 11 from Cutkosky [2019].

First, since ww^{*} is a global minimum and α0:t\alpha_{0:t} are non-negative than clearly,

α0:t(f(xt)f(w))0.\alpha_{0:t}\left(f(x_{t})-f(w^{*})\right)\geq 0~{}.

Now, notice that the following holds,

αt(xtwt)=α0:t1(xt1xt)\alpha_{t}(x_{t}-w_{t})=\alpha_{0:t-1}(x_{t-1}-x_{t})

Using the gradient inequality for ff gives,

τ=0tατ(f(xτ)f(w))\displaystyle\sum_{\tau=0}^{t}\alpha_{\tau}(f(x_{\tau})-f(w^{*})) τ=0tατf(xτ)(xτw)\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(x_{\tau}-w^{*})
=τ=0tατf(xτ)(wτw)+τ=0tατf(xτ)(xτwτ)\displaystyle=\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})+\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(x_{\tau}-w_{\tau})
=τ=0tατf(xτ)(wτw)+τ=0tα0:τ1f(xτ)(xτ1xτ)\displaystyle=\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})+\sum_{\tau=0}^{t}\alpha_{0:\tau-1}\nabla f(x_{\tau})\cdot(x_{\tau-1}-x_{\tau})
τ=0tατf(xτ)(wτw)+τ=0tα0:τ1(f(xτ1)f(xτ)),\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})+\sum_{\tau=0}^{t}\alpha_{0:\tau-1}(f(x_{\tau-1})-f(x_{\tau}))~{},

where we have used the gradient inequality again which implies f(xτ)(xτ1xτ)f(xτ1)f(xτ)\nabla f(x_{\tau})\cdot(x_{\tau-1}-x_{\tau})\leq f(x_{\tau-1})-f(x_{\tau}).

Now Re-ordering we obtain,

τ=0t(α0:τf(xτ)α0:τ1f(xτ1))α0:tf(w)\displaystyle\sum_{\tau=0}^{t}\left(\alpha_{0:\tau}f(x_{\tau})-\alpha_{0:\tau-1}f(x_{\tau-1})\right)-\alpha_{0:t}f(w^{*}) τ=0tατf(xτ)(wτw).\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

Telescoping the sum in the LHS we conclude the proof,

α0:t(f(xτ)f(w))\displaystyle\alpha_{0:t}\left(f(x_{\tau})-f(w^{*})\right) τ=0tατf(xτ)(wτw).\displaystyle\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

Appendix E More Intuition and Discussion Regarding the Benefit of SLowcal-SGD

More Elaborate Intuitive Explanation.

The intuition is the following: We have two extreme baselines: (1) Minibatch-SGD where queries do not change at all during updates-implying that there is no bias between different machines. However, Minibatch-SGD is “lazy” since among KRKR queries it only performs RR gradient updates. Conversely (2) Local-SGD is not “lazy” since each machine performs KRKR gradient updates. Nevertheless, the queries of different machines change substantially during each round, which translates to bias between machines, which in turn degrades the convergence.

Ideally, we would like to have a “non-lazy” method where each machine performs KR gradient updates (like Local-SGD), but where the queries of each machine do not change at all during rounds (like Minibatch-SGD) and therefore no bias is introduced between machines. Of course, this is too good to exist, but our method is a step in this direction: it is “non-lazy” and the query points of different machines change slowly, and therefore introduce less bias between machines. This translates to a better convergence rate.

Additional Technical Intuition for αtt\alpha_{t}\propto t.

Here we extend the technical explanation that we provide in Sec. 3.2.1 to the case where αtt\alpha_{t}\propto t, and show again that SLowcal-SGD yields smaller bias between different machines compared to Local-SGD.

As in the intuition for the case of uniform weights, to simplify the more technical discussion, we will assume the homogeneous case, i.e., that for any i[M]i\in[M] we have 𝒟i=𝒟{\mathcal{D}}_{i}={\mathcal{D}} and fi()=f()f_{i}(\cdot)=f(\cdot).

Note that upon employing linear weights, the normalization factor α0:T\alpha_{0:T} that plays a major role in the convergence guarantees of Anytime-SGD (see Thm. 1) also grows as α0:TT2\alpha_{0:T}\propto T^{2}. Thus, in order to make an proper comparison, we should compare the bias of weighted Anytime-SGD, to the appropriate weighted version of SGD; where the normalization factor α0:T\alpha_{0:T} also plays a similar role in the guarantees (see e.g. Wang et al. [2021b]). This weighted SGD is as follows Wang et al. [2021b], t0\forall t\geq 0

wt+1=wtηαtgt;where gt is unbiased of f(wt).\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{};\quad\text{where $g_{t}$ is unbiased of $\nabla f(w_{t})$}~{}. (20)

and after tt iterations it outputs w¯T=1α0:Tt=0Tαtwt\overline{w}_{T}=\frac{1}{\alpha_{0:T}}\sum_{t=0}^{T}\alpha_{t}w_{t}. And for αt=t+1\alpha_{t}=t+1 this version enjoys the same guarantees as standard SGD.

Next, we compare the Local-SGD version of the above weighted SGD (Eq. (20)) to our SLowcal-SGDwhen both employ αt=t+1\alpha_{t}=t+1. So a bit more formally, let us discuss the bias between query points in a given round r[R]r\in[R], and let us denote t0=rKt_{0}=rK. The following holds for weighted Local SGD,

wti=wt0ητ=t0t1ατgτi,i[M],t[t0,t0+K].\displaystyle w_{t}^{i}=w_{t_{0}}-\eta\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}g_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}. (21)

where gτig_{\tau}^{i} is the noisy gradients that Machine ii computes in wτiw_{\tau}^{i}, and we can write gτi:=f(wτi)+ξτig_{\tau}^{i}:=\nabla f(w_{\tau}^{i})+\xi_{\tau}^{i}, where ξτi\xi_{\tau}^{i} is the noisy component of the gradient. Thus, for two machines iji\neq j we can write,

Ewtiwtj2=η2Eτ=t0t1ατ(gτigτj)2η2Eτ=t0t1ατ(f(wτi)f(wτj))2+η2Eτ=t0t1ατ(ξτiξτj)2\displaystyle\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2}=\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(g_{\tau}^{i}-g_{\tau}^{j})\right\|^{2}\approx\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(\nabla f(w_{\tau}^{i})-\nabla f(w_{\tau}^{j}))\right\|^{2}+\eta^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(\xi_{\tau}^{i}-\xi_{\tau}^{j})\right\|^{2}

Now, we can be generous with respect to weighted SGD and only take the second noisy term into account and neglect the first term 666It was shown in Woodworth et al. [2020a], that the noisy term is dominant for standard SGD. Thus we obtain,

1η2Ewtiwtj2\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|^{2} Eτ=t0t1ατ(ξτiξτj)2αt0+K2Eτ=t0t1(ξτiξτj)2(rK)2(tt0)r2K3.\displaystyle\lessapprox\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\alpha_{\tau}(\xi_{\tau}^{i}-\xi_{\tau}^{j})\right\|^{2}\approx\alpha_{t_{0}+K}^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}(\xi_{\tau}^{i}-\xi_{\tau}^{j})\right\|^{2}\approx(rK)^{2}\cdot(t-t_{0})\leq r^{2}K^{3}~{}. (22)

where we used αtαt0+K,tt0+K\alpha_{t}\leq\alpha_{t_{0}+K}~{},\forall t\leq t_{0}+K, as well as αt0+K2=(r(K+1)+1)2(rK)2\alpha_{t_{0}+K}^{2}=(r(K+1)+1)^{2}\approx(rK)^{2}. We also used tt0Kt-t_{0}\lessapprox K.

Similarly, for SLowcal-SGD we would like to bound Extixtj2\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2} for two machines iji\neq j; while assuming linear weights i.e., αt=t+1,t[T]\alpha_{t}=t+1~{},~{}\forall t\in[T]. Now the update rule for the iterates wtiw_{t}^{i}, is of the same form as in Eq. (21), only now gτi:=f(xτi)+ξτig_{\tau}^{i}:=\nabla f(x_{\tau}^{i})+\xi_{\tau}^{i}, where ξτi\xi_{\tau}^{i} is the noisy component of the gradient. Consequently, we can show the following,

τ=t0tατ(wτiwτj)ητ=t0tατn=t0τ1αn(gnignj)ηn=t0t1αn+1:tαn(gnignj)ηr2K3τ=t0t1(gτigτj),\displaystyle\sum_{\tau=t_{0}}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\approx-\eta\sum_{\tau=t_{0}}^{t}\alpha_{\tau}\sum_{n=t_{0}}^{\tau-1}\alpha_{n}(g_{n}^{i}-g_{n}^{j})\approx-\eta\sum_{n=t_{0}}^{t-1}\alpha_{n+1:t}\alpha_{n}(g_{n}^{i}-g_{n}^{j})\approx-\eta r^{2}K^{3}\sum_{\tau=t_{0}}^{t-1}(g_{\tau}^{i}-g_{\tau}^{j})~{},

where we took a crude approximation of αn+1:tαnαt0:t+Kαt0rK2rK=r2K3\alpha_{n+1:t}\alpha_{n}\approx\alpha_{t_{0}:t+K}\alpha_{t_{0}}\lessapprox rK^{2}\cdot rK=r^{2}K^{3}. In the last "\approx" we also change the notation of summation variable from nn to τ\tau.

Now, by definition, xtiα0:t0α0:txt0+1α0:tτ=t0tατwτi,i[M],t[t0,t0+K].x_{t}^{i}\approx\frac{\alpha_{0:t_{0}}}{\alpha_{0:t}}\cdot x_{t_{0}}+\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}}^{t}\alpha_{\tau}w_{\tau}^{i}~{},~{}~{}\forall i\in[M],t\in[t_{0},t_{0}+K]~{}. Thus, for two machines iji\neq j we have,

1η2Extixtj2\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2} =1η2E1α0:tτ=t0tατ(wτiwτj)21η2η2r4K6(α0:t)2Eτ=t0t1gτigτj2\displaystyle=\frac{1}{\eta^{2}}\mbox{\bf E}\left\|\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\|^{2}\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}r^{4}K^{6}}{(\alpha_{0:t})^{2}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\|^{2}
1η2η2r4K6r4K4Eτ=t0t1gτigτj2\displaystyle\approx\frac{1}{\eta^{2}}\cdot\frac{\eta^{2}r^{4}K^{6}}{r^{4}K^{4}}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}g_{\tau}^{i}-g_{\tau}^{j}\right\|^{2}
K2Eτ=t0t1f(xτi)f(xτj)2+K2Eτ=t0t1ξτiξτj2.\displaystyle\approx K^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\nabla f(x_{\tau}^{i})-\nabla f(x_{\tau}^{j})\right\|^{2}+K^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}~{}.

where we have used α0:tα0:t0t02r2K2\alpha_{0:t}\approx\alpha_{0:t_{0}}\propto t_{0}^{2}\approx r^{2}K^{2}.

As we show in our analysis, the noisy term is dominant, so we can therefore bound,

1η2Extixtj2K2Eτ=t0t1ξτiξτj2K2(tt0)K3.\displaystyle\frac{1}{\eta^{2}}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}\lessapprox K^{2}\mbox{\bf E}\left\|\sum_{\tau=t_{0}}^{t-1}\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}\approx K^{2}(t-t_{0})\leq{K^{3}}~{}. (23)

Thus Equations (22), (23), illustrate that for αtt\alpha_{t}\propto t, then the bias of SLowcal-SGD is smaller by a factor of r2r^{2} compared to the bias of weighted Local-SGD.
Since r2r^{2} can be as big as R2R^{2} this coincides with the benefit of SLowcal-SGD over standard SGD in the case where αt=1\alpha_{t}=1, which we demonstrate in the main text.

Finally, note that upon dividing by the normalization factor α0:T,\alpha_{0:T},we have, that for SLowcal-SGD with either αt=1\alpha_{t}=1 or αtt\alpha_{t}\propto t then,

1α0:T1ηExtixtj1R2K2K31RKKR2=1KR2\displaystyle\frac{1}{\alpha_{0:T}}\frac{1}{\eta}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|\approx\frac{1}{R^{2}K^{2}}\cdot\sqrt{K^{3}}\approx\frac{1}{RK}\cdot\sqrt{\frac{K}{R^{2}}}=\frac{1}{\sqrt{K}R^{2}} (24)

Comparably, upon dividing by the normalization factor α0:T,\alpha_{0:T},we have, that for Local-SGD with either αt=1\alpha_{t}=1 or αtt\alpha_{t}\propto t that,

1α0:T1ηEwtiwtj1R2K2R2K31RKK=1KR\displaystyle\frac{1}{\alpha_{0:T}}\frac{1}{\eta}\mbox{\bf E}\|w_{t}^{i}-w_{t}^{j}\|\approx\frac{1}{R^{2}K^{2}}\cdot\sqrt{R^{2}K^{3}}\approx\frac{1}{RK}\cdot\sqrt{{K}}=\frac{1}{\sqrt{K}R} (25)

Thus, with respect to the approximate and intuitive analysis that we make here SLowcal-SGD maintains similar benefit over Local-SGD for both αt=1\alpha_{t}=1 and αt=t+1\alpha_{t}=t+1.

As we explain in Appendix L, taking αt=1\alpha_{t}=1 in SLowcal-SGD does not actually enable to provide a benefit over Local-SGD. The reason is that for αt=1\alpha_{t}=1, the condition for which the dominant term in the bias (between different machines) is the noisy term (this enables the approximate analysis that we make here and in the body of the paper), leads to limitation on the learning rate which in turn degrades the performance for SLowcal-SGD with αt=1\alpha_{t}=1. Conversely, for αtt\alpha_{t}\propto t there is no such degradation due to the limitation of the learning rate. For more details and intuition please see Appendix L.

Appendix F Proof of Thm. 2

Proof of Thm. 2.

As a starting point for the analysis, for every iteration t[T]t\in[T] we will define the averages of (wti,xti,gti)(w_{t}^{i},x_{t}^{i},g_{t}^{i}) across all machines as follows,

wt:=1Mi[M]wti,&xt:=1Mi[M]xti&gt:=1Mi[M]gti.w_{t}:=\frac{1}{M}\sum_{i\in[M]}w_{t}^{i}~{},\quad\&\quad x_{t}:=\frac{1}{M}\sum_{i\in[M]}x_{t}^{i}\quad\&\quad g_{t}:=\frac{1}{M}\sum_{i\in[M]}g_{t}^{i}~{}.

Note that Alg. 2 explicitly computes (wt,xt)(w_{t},x_{t}) only once every KK local updates, and that theses are identical to the local copies of every machine at the beginning of every round. Combining the above definitions with Eq. (6) yields,

wt+1=wtηαtgt,t[T]\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},~{}\forall t\in[T] (26)

Further combining these definitions with Eq. (7) yields,

xt+1=(1αt+1α0:t+1)xt+αt+1α0:t+1wt+1,t[T]\displaystyle x_{t+1}=(1-\frac{\alpha_{t+1}}{\alpha_{0:t+1}})x_{t}+\frac{\alpha_{t+1}}{\alpha_{0:t+1}}w_{t+1}~{},~{}\forall t\in[T] (27)

And the above implies that the {xt}t[T]\{x_{t}\}_{t\in[T]} sequence is an {αt}t[T]\{\alpha_{t}\}_{t\in[T]} weighted average of {wt}t[T]\{w_{t}\}_{t\in[T]}. This enables to employ Thm. 1 which yields,

α0:tΔt:=α0:t(f(xt)f(w))τ=0tατf(xτ)(wτw),\displaystyle\alpha_{0:t}\Delta_{t}:=\alpha_{0:t}(f(x_{t})-f(w^{*}))\leq\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{},

where we denote Δt:=f(xt)f(w)\Delta_{t}:=f(x_{t})-f(w^{*}). The above bound highlights the challenge in the analysis: our algorithm does not directly computes unbiased estimates of xtx_{t}, except for the first iterate of each round. Concretely, Eq. (26) demonstrates that our algorithm effectively updates using gtg_{t} which might be a biased estimate of f(xt)\nabla f(x_{t}).

It is therefore natural to decompose f(xτ)=gτ+(f(xτ)gτ)\nabla f(x_{\tau})=g_{\tau}+(\nabla f(x_{\tau})-g_{\tau}) in the above bound, leading to,

α0:tΔtτ=0tατgτ(wτw)(A)+τ=0tατ(f(xτ)gτ)(wτw)(B)\displaystyle\alpha_{0:t}\Delta_{t}\leq\underset{{\rm(A)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})}}+\underset{{\rm(B)}}{\underbrace{\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-g_{\tau})\cdot(w_{\tau}-w^{*})}} (28)

Thus, we intend to bound the weighted error α0:tΔt\alpha_{0:t}\Delta_{t} by bounding two terms: (A){\rm(A)} which is directly related to the update rule of the algorithm (Eq. (26)), and (B){\rm(B)} which accounts for the bias between gtg_{t} and f(xt)\nabla f(x_{t}).
Notation: In what follows we will find the following notation useful,

g¯t:=1Mi[M]fi(xti)\displaystyle\bar{g}_{t}:=\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{t}^{i}) (29)

and the above definition implies that g¯t=E[gt|{z0i}i[M],,{zt1i}i[M]]=E[gt|{xti}i[M]]\bar{g}_{t}=\mbox{\bf E}\left[g_{t}|\{z_{0}^{i}\}_{i\in[M]},\ldots,\{z_{t-1}^{i}\}_{i\in[M]}\right]=\mbox{\bf E}\left[g_{t}|\{x_{t}^{i}\}_{i\in[M]}\right]. We will also employ the following notations,

Vt:=τ=0tατ2g¯τf(xτ)2,&Dt:=wtw2V_{t}:=\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}~{},\quad\&\quad D_{t}:=\|w_{t}-w^{*}\|^{2}

where ww^{*} is a global minimum of f()f(\cdot). Moreover, we will also use the notation D0:t:=τ=0twτw2D_{0:t}:=\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}.

Bounding (A):

Due to the update rule of Eq. (26), one can show by standard regret analysis (see Lemma 2 below) that,

(A):=τ=0tατgτ(wτw)w0w22η+η2τ=0tατ2gτ2,\displaystyle{\rm(A)}:=\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{}, (30)
Lemma 2.

(OGD Regret Lemma -See e.g. [Hazan et al., 2016]) Let w0dw_{0}\in{\mathbb{R}}^{d} and η>0\eta>0. Also assume a sequence of TT non-negative weights {αt0}t[T]\{\alpha_{t}\geq 0\}_{t\in[T]} and TT vectors {gtd}t[T]\{g_{t}\in{\mathbb{R}}^{d}\}_{t\in[T]}, and assume an update rule of the following form:

wt+1=wtηαtgt,t[T].w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{}.

Then the following bound holds for any udu\in{\mathbb{R}}^{d}, and t[T]t\in[T],

τ=0tατgτ(wτu)w0u22η+η2τ=0tατ2gτ2.\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)\leq\frac{\|w_{0}-u\|^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{}.

for completeness we provide a proof in Appendix G.

Bounding (B):

Since our goal is to bound the expected excess loss, we will bound the expected value of (B){\rm(B)}, thus,

E[(B)]\displaystyle\mbox{\bf E}\left[{\rm(B)}\right] =E[τ=0tατ(f(xτ)gτ)(wτw)]\displaystyle=\mbox{\bf E}\left[\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-g_{\tau})\cdot(w_{\tau}-w^{*})\right]
=E[τ=0tατ(f(xτ)g¯τ)(wτw)]\displaystyle=\mbox{\bf E}\left[\sum_{\tau=0}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-\bar{g}_{\tau})\cdot(w_{\tau}-w^{*})\right]
Eτ=0t(12ρατ2f(xτ)g¯τ2+ρ2Ewτw2)\displaystyle\leq\mbox{\bf E}\sum_{\tau=0}^{t}\left(\frac{1}{2\rho}\alpha_{\tau}^{2}\|\nabla f(x_{\tau})-\bar{g}_{\tau}\|^{2}+\frac{\rho}{2}\mbox{\bf E}\|w_{\tau}-w^{*}\|^{2}\right)
=12ρEVt+ρ2ED0:t,\displaystyle=\frac{1}{2\rho}\mbox{\bf E}V_{t}+\frac{\rho}{2}\mbox{\bf E}D_{0:t}~{}, (31)

where the second line follows by the definition of g¯τ\bar{g}_{\tau} (see Eq. (29)) and due to the fact that wτw_{\tau} is measurable with respect to {{z0i}i[M],,{zτ1i}i[M]}\left\{\{z_{0}^{i}\}_{i\in[M]},\ldots,\{z_{\tau-1}^{i}\}_{i\in[M]}\right\} while g¯τ=E[gτ|{z0i}i[M],,{zτ1i}i[M]]\bar{g}_{\tau}=\mbox{\bf E}\left[g_{\tau}|\{z_{0}^{i}\}_{i\in[M]},\ldots,\{z_{\tau-1}^{i}\}_{i\in[M]}\right] implying that E[gτ(wτw)]=E[g¯τ(wτw)]\mbox{\bf E}[g_{\tau}\cdot(w_{\tau}-w^{*})]=\mbox{\bf E}[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})]; the third line uses Young’s inequality abinfρ>0{ρ2a2+12ρb2}a\cdot b\leq\inf_{\rho>0}\{\frac{\rho}{2}\|a\|^{2}+\frac{1}{2\rho}\|b\|^{2}\} which holds for any a,bda,b\in{\mathbb{R}}^{d}; and the last two lines use the definition of VtV_{t} and D0:TD_{0:T}.

Combining (A) and (B):

Combining Equations (30) and (F) into Eq. (28) we obtain the following bound which holds for any ρ>0\rho>0 and t[T]t\in[T],

α0:tEΔtw0w22η+η2Eτ=0Tατ2gτ2+12ρEVT+ρ2ED0:t\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\frac{\eta}{2}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}+\frac{1}{2\rho}\mbox{\bf E}V_{T}+\frac{\rho}{2}\mbox{\bf E}D_{0:t} (32)

where we have used VtVTV_{t}\leq V_{T} which holds for any t[T]t\in[T], as well as Eτ=0tατ2gτ2Eτ=0Tατ2gτ2\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}\leq\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}, which holds since tTt\leq T.

Next, we shall bound each of the above terms. The following lemma bounds ED0:t\mbox{\bf E}D_{0:t},

Lemma 3.

The following bound holds for any t[T]t\in[T],

ED0:t=Eτ=0twτw22Tw0w2+2Tη2Et=0Tαt2gt2+16η2T2EVT\displaystyle\mbox{\bf E}D_{0:t}=\mbox{\bf E}\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}\leq 2T\|w_{0}-w^{*}\|^{2}+2T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}+16\eta^{2}T^{2}\cdot\mbox{\bf E}V_{T}

Combining the above lemma into Eq. (33) gives,

α0:tEΔtw0w22η+(η2+ρTη2)Eτ=0Tατ2gτ2+(12ρ+8ρη2T2)EVT+ρTw0w2\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{2\eta}+\left(\frac{\eta}{2}+\rho T\eta^{2}\right)\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}+\left(\frac{1}{2\rho}+8\rho\eta^{2}T^{2}\right)\mbox{\bf E}V_{T}+\rho T\|w_{0}-w^{*}\|^{2}

Since the above holds for any ρ>0\rho>0 let us pick a specific value of ρ=14ηT\rho=\frac{1}{4\eta T}; by doing so we obtain,

α0:tEΔtw0w2η+ηEτ=0Tατ2gτ2(C)+4ηTEVT.\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+\eta\cdot\underset{{\rm(C)}}{\underbrace{\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}}}+4\eta T\mbox{\bf E}V_{T}~{}. (33)

Next, we would like to bound (C){\rm(C)}; to do so it is natural to decompose gτ=(gτg¯τ)+(g¯τf(xτ))+f(xτ)g_{\tau}=(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau}). The next lemma provides a bound, and its proof goes directly through this decomposition,

Lemma 4.

The following holds,

(C)3σ2Mt=0Tαt2+3EVT+12LEt=0Tα0:tΔt.\displaystyle{\rm(C)}\leq 3\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+3\mbox{\bf E}V_{T}+12L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}.

Combining the above Lemma into Eq. (33) yields,

α0:tEΔtw0w2η+3ησ2Mt=0Tαt2+8ηTEVT+12ηLEt=0Tα0:tΔt,\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+8\eta T\mbox{\bf E}V_{T}+12\eta L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}, (34)

where we have uses 34T3\leq 4T which holds since T1T\geq 1. The next lemma provides a bound for EVt\mbox{\bf E}V_{t},

Lemma 5.

For any tT:=KRt\leq T:=KR, Alg. 2 with the learning choice in Eq. (8) ensures the following bound,

EVt400L2η2K3τ=0Tα0:τ(G2+4LΔτ)+90L2η2K6R3σ2.\mbox{\bf E}V_{t}\leq 400L^{2}\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau})+90L^{2}\eta^{2}K^{6}R^{3}\sigma^{2}~{}.

Plugging the above bound back into Eq. (34) gives an almost explicit bound,

α0:tEΔt\displaystyle\alpha_{0:t}\mbox{\bf E}\Delta_{t}
w0w2η+3ησ2Mt=0Tαt2+720L2η3TK6R3σ2\displaystyle\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}
+4103L2η3TK3τ=0Tα0:τ(G2+4LΔτ))+12ηLEt=0Tα0:tΔt\displaystyle\quad+4\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+12\eta L\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}
=w0w2η+3ησ2Mt=0Tαt2+720L2η3TK6R3σ2+4103L2η3TK3τ=0Tα0:τG2\displaystyle=\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+4\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2}
+(12ηL+16103L3η3TK3)Et=0Tα0:tΔt\displaystyle\quad+(12\eta L+16\cdot 10^{3}L^{3}\eta^{3}TK^{3})\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}
w0w2η+3ησ2Mt=0Tαt2+720L2η3TK6R3σ2+4103L2η3TK3τ=0Tα0:τG2+12(T+1)Et=0Tα0:tΔt,\displaystyle\leq\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+4\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2}+\frac{1}{2(T+1)}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{0:t}\Delta_{t}~{}, (35)

and we used 12ηL1/4(T+1)12\eta L\leq 1/4(T+1) which follows since η148L(T+1)\eta\leq\frac{1}{48L(T+1)} (see Eq. (8)), as well as 16103L3η3TK31/4(T+1)16\cdot 10^{3}L^{3}\eta^{3}TK^{3}\leq 1/4(T+1), which follows since η140LK(T+1)2/3\eta\leq\frac{1}{40LK(T+1)^{2/3}} (see Eq. (8)). Next we use the above bound and invoke the following lemma,

Lemma 6.

Let {At}t[T]\{A_{t}\}_{t\in[T]} be a sequence of non-negative elements and \mathcal{B}\in{\mathbb{R}}, and assume that for any tTt\leq T,

At+12(T+1)t=0TAt,A_{t}\leq\mathcal{B}+\frac{1}{2(T+1)}\sum_{t=0}^{T}A_{t}~{},

Then the following bound holds,

AT2.A_{T}\leq 2\mathcal{B}~{}.

Taking Atα0:tEΔtA_{t}\leftarrow\alpha_{0:t}\mbox{\bf E}\Delta_{t} and w0w2η+3ησ2Mt=0Tαt2+720L2η3TK6R3σ2+2103L2η3TK3τ=0Tα0:τG2\mathcal{B}\leftarrow\frac{\|w_{0}-w^{*}\|^{2}}{\eta}+3\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+720L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+2\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2} provides the following explicit bound,

α0:TEΔT\displaystyle\alpha_{0:T}\mbox{\bf E}\Delta_{T} 2w0w2η+6ησ2Mt=0Tαt2+2103L2η3TK6R3σ2+8103L2η3TK3τ=0Tα0:τG2\displaystyle\leq\frac{2\|w_{0}-w^{*}\|^{2}}{\eta}+6\eta\frac{\sigma^{2}}{M}\sum_{t=0}^{T}\alpha_{t}^{2}+2\cdot 10^{3}L^{2}\eta^{3}TK^{6}R^{3}\sigma^{2}+8\cdot 10^{3}L^{2}\eta^{3}TK^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}G_{*}^{2}
2w0w2η+6ησ2M(KR)3+2103L2η3K7R4σ2+8103L2η3K7R4G2,\displaystyle\leq\frac{2\|w_{0}-w^{*}\|^{2}}{\eta}+6\eta\frac{\sigma^{2}}{M}\cdot(KR)^{3}+2\cdot 10^{3}L^{2}\eta^{3}K^{7}R^{4}\sigma^{2}+8\cdot 10^{3}L^{2}\eta^{3}K^{7}R^{4}\cdot G_{*}^{2}~{}, (36)

where we have used τ=0Tα0:τt=0Tαt2r=0R1k=0K1(r+1)2K2K3R3\sum_{\tau=0}^{T}\alpha_{0:\tau}\leq\sum_{t=0}^{T}\alpha_{t}^{2}\leq\sum_{r=0}^{R-1}\sum_{k=0}^{K-1}(r+1)^{2}K^{2}\leq K^{3}R^{3}, as well as T=KRT=KR,

Recalling that T=KRT=KR and that,

η=min{148L(T+1),110LK2,140LK(T+1)2/3,w0wMσT3/2,w0w1/2L1/2K7/4R(σ1/2+G1/2)}\displaystyle\eta=\min\left\{\frac{1}{48L(T+1)},\frac{1}{10LK^{2}},\frac{1}{40LK(T+1)^{2/3}},\frac{\|w_{0}-w^{*}\|\sqrt{M}}{\sigma T^{3/2}},\frac{\|w_{0}-w^{*}\|^{1/2}}{L^{1/2}K^{7/4}R(\sigma^{1/2}+G_{*}^{1/2})}\right\}

The above bound translates into,

α0:T\displaystyle\small\alpha_{0:T} EΔT\displaystyle\mbox{\bf E}\Delta_{T}\leq (37)
O(L(T+K2+KT2/3)w0w2+σw0wT3/2M+L1/2K7/4R(σ1/2+G1/2)w0w3/2)\displaystyle O\left(L(T+K^{2}+KT^{2/3})\|w_{0}-w^{*}\|^{2}+\frac{\sigma\|w_{0}-w^{*}\|T^{3/2}}{\sqrt{M}}+L^{1/2}K^{7/4}R(\sigma^{1/2}+G_{*}^{1/2})\cdot\|w_{0}-w^{*}\|^{3/2}\right) (38)

Noting that α0:TΩ(T2)\alpha_{0:T}\geq\Omega(T^{2}) and using T=KRT=KR gives the final bound,

E ΔT\displaystyle\Delta_{T}\leq
O(Lw0w2KR+Lw0w2K1/3R4/3+Lw0w2R2+σw0wMKR+L1/2(σ1/2+G1/2)w0w3/2K1/4R).\displaystyle O\left(\frac{L\|w_{0}-w^{*}\|^{2}}{KR}+\frac{L\|w_{0}-w^{*}\|^{2}}{K^{1/3}R^{4/3}}+\frac{L\|w_{0}-w^{*}\|^{2}}{R^{2}}+\frac{\sigma\|w_{0}-w^{*}\|}{\sqrt{MKR}}+\frac{L^{1/2}(\sigma^{1/2}+G_{*}^{1/2})\cdot\|w_{0}-w^{*}\|^{3/2}}{K^{1/4}R}\right)~{}.

which establishes the Theorem. ∎

Appendix G Proof of Lemma 2

Proof of Lemma 2.

The update rule implies for all τ[T]\tau\in[T]

wτ+1u2\displaystyle\|w_{\tau+1}-u\|^{2} =(wτu)ηατgτ2\displaystyle=\|(w_{\tau}-u)-\eta\alpha_{\tau}g_{\tau}\|^{2}
=wτu22ηατgτ(wτu)+η2ατ2gτ2\displaystyle=\|w_{\tau}-u\|^{2}-2\eta\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)+\eta^{2}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}

Re-ordering and gives,

2ηατgτ(wτu)=(wτu2wτ+1u2)+η2ατ2gτ2.\displaystyle 2\eta\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u)=\left(\|w_{\tau}-u\|^{2}-\|w_{\tau+1}-u\|^{2}\right)+\eta^{2}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}~{}.

Summing over τ\tau and telescoping we obtain,

2ητ=0tατgτ(wτu)\displaystyle 2\eta\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-u) =(w1u2wt+1u2)+η2τ=0tατ2gτ2\displaystyle=\left(\|w_{1}-u\|^{2}-\|w_{t+1}-u\|^{2}\right)+\eta^{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}
w1u2+η2τ=0tατ2gτ2\displaystyle\leq\|w_{1}-u\|^{2}+\eta^{2}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}

Dividing the above by 2η2\eta establishes the lemma. ∎

Appendix H Proof of Lemma 3

Proof of Lemma 3.

Recalling the notations Dτ:=wτw2D_{\tau}:=\|w_{\tau}-w^{*}\|^{2}, our goal is to bound ED0:t\mbox{\bf E}D_{0:t}. To do so, we will derive a recursive formula for D0:tD_{0:t}. Indeed, the update rule of Alg. 2 implies Eq. (26), which in turn leads to the following for any t[T]t\in[T],

wt+1w2\displaystyle\|w_{t+1}-w^{*}\|^{2} =(wtw)ηαtgt2=wtw22ηαtgt(wtw)+η2αt2gt2\displaystyle=\|(w_{t}-w^{*})-\eta\alpha_{t}g_{t}\|^{2}=\|w_{t}-w^{*}\|^{2}-2\eta\alpha_{t}g_{t}\cdot(w_{t}-w^{*})+\eta^{2}\alpha_{t}^{2}\|g_{t}\|^{2}

Unrolling the above equation and taking expectation gives,

Ewt+1w2\displaystyle\mbox{\bf E}\|w_{t+1}-w^{*}\|^{2} =w0w22ηEτ=0tατgτ(wτw)()+η2Eτ=0tατ2gτ2\displaystyle=\|w_{0}-w^{*}\|^{2}\underset{{\rm(*)}}{\underbrace{-2\eta\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*})}}+\eta^{2}\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2} (39)

The next lemma provides a bound on (){\rm(*)},

Lemma 7.

The following holds for any t[T]t\in[T],

()2ηEVTED0:T\displaystyle{\rm(*)}\leq 2\eta\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}

and recall that D0:T:=t=0Twtw2D_{0:T}:=\sum_{t=0}^{T}\|w_{t}-w^{*}\|^{2}, and VT:=t=0Tαt2g¯tf(xt)2V_{T}:=\sum_{t=0}^{T}\alpha_{t}^{2}\|\bar{g}_{t}-\nabla f(x_{t})\|^{2}.

Plugging the bound of Lemma 7 into Eq. (39), and using the notation of DtD_{t} we conclude that for any t[T]t\in[T],

EDt\displaystyle\mbox{\bf E}D_{t} D0+2ηEVTED0:T+η2Eτ=0tατ2gτ2\displaystyle\leq D_{0}+2\eta\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}+\eta^{2}\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}
D0+2ηEVTED0:T+η2Et=0Tαt2gt2,\displaystyle\leq D_{0}+2\eta\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}+\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}~{},

where we used tTt\leq T. Summing the above equation over tt gives,

ED0:T\displaystyle\mbox{\bf E}D_{0:T} Tw0w2+2ηTEVTED0:T+Tη2Et=0Tαt2gt2\displaystyle\leq T\|w_{0}-w^{*}\|^{2}+2\eta T\cdot\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}+T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2} (40)

We shall now require the following lemma,

Lemma 8.

Let A,B,C0A,B,C\geq 0, and assume that AB+CAA\leq B+C\sqrt{A}, then the following holds,

A2B+4C2A\leq 2B+4C^{2}

Now, using the above Lemma with Eq. (40) implies,

ED0:T2Tw0w2+2Tη2Et=0Tαt2gt2+16η2T2EVT\displaystyle\mbox{\bf E}D_{0:T}\leq 2T\|w_{0}-w^{*}\|^{2}+2T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}+16\eta^{2}T^{2}\cdot\mbox{\bf E}V_{T} (41)

where we have taken AD0:T2A\leftarrow D_{0:T}^{2}, BTw0w2+Tη2Et=0Tαt2gt2B\leftarrow T\|w_{0}-w^{*}\|^{2}+T\eta^{2}\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|g_{t}\|^{2}, and C2ηTEVTC\leftarrow 2\eta T\cdot\sqrt{\mbox{\bf E}V_{T}}. Thus, Eq. (41) establishes the lemma. ∎

H.1 Proof of Lemma 7

Proof of Lemma 7.

Recall that ()=2ηEτ=0tατgτ(wτw){\rm(*)}=-2\eta\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*}), we shall now focus on bounding ()/2η{\rm(*)}/2\eta,

Eτ=0tατgτ(wτw)\displaystyle-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}g_{\tau}\cdot(w_{\tau}-w^{*}) =Eτ=0tατg¯τ(wτw)\displaystyle=-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})
=Eτ=0tατf(xτ)(wτw)Eτ=0tατ(g¯τf(xτ))(wτw)\displaystyle=-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})-\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}(\bar{g}_{\tau}-\nabla f(x_{\tau}))\cdot(w_{\tau}-w^{*})
0+Eτ=0tατ2g¯τf(xτ)2wτw2\displaystyle\leq 0+\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}\cdot\|w_{\tau}-w^{*}\|^{2}
0+τ=0tEατ2g¯τf(xτ)2Ewτw2\displaystyle\leq 0+\sum_{\tau=0}^{t}\sqrt{\mbox{\bf E}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}}\cdot\sqrt{\mbox{\bf E}\|w_{\tau}-w^{*}\|^{2}}
Eτ=0tατ2g¯τf(xτ)2Eτ=0twτw2\displaystyle\leq\sqrt{\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}}\cdot\sqrt{\mbox{\bf E}\sum_{\tau=0}^{t}\|w_{\tau}-w^{*}\|^{2}}
Et=0Tαt2g¯tf(xt)2Et=0Twtw2\displaystyle\leq\sqrt{\mbox{\bf E}\sum_{t=0}^{T}\alpha_{t}^{2}\|\bar{g}_{t}-\nabla f(x_{t})\|^{2}}\cdot\sqrt{\mbox{\bf E}\sum_{t=0}^{T}\|w_{t}-w^{*}\|^{2}}
:=EVTED0:T,\displaystyle:=\sqrt{\mbox{\bf E}V_{T}}\cdot\sqrt{\mbox{\bf E}D_{0:T}}~{}, (42)

where the first line is due to the definitions of gτg_{\tau} and g¯τ\bar{g}_{\tau} appearing in Eq. (29) (this is formalized in Lemma 9 below and in its proof); the third line follows by observing that the {xt}t\{x_{t}\}_{t} sequence is and {αt}t\{\alpha_{t}\}_{t} weighted average of {wt}t\{w_{t}\}_{t} and thus Theorem 1 implies that Eτ=0tατf(xτ)(wτw)0\mbox{\bf E}\sum_{\tau=0}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})\geq 0 for any tt, as well as from Cauchy-Schwarz; the fourth line follows from the Cauchy-Schwarz inequality for random variables, which asserts that for every random variables X,YX,Y, then E[XY]EX2EY2\mbox{\bf E}[XY]\leq\sqrt{\mbox{\bf E}X^{2}}\sqrt{\mbox{\bf E}Y^{2}}; the fifth line is an application of the following inequality

τ=0taτbττ=0taτ2τ=0tbτ2\sum_{\tau=0}^{t}a_{\tau}b_{\tau}\leq\sqrt{\sum_{\tau=0}^{t}a_{\tau}^{2}}\sqrt{\sum_{\tau=0}^{t}b_{\tau}^{2}}

which holds for any two sequences {aτ}τ,{bτ}τ\{a_{\tau}\in{\mathbb{R}}\}_{\tau},\{b_{\tau}\in{\mathbb{R}}\}_{\tau}, and the above also follows from the standard Cauchy-Schwarz inequality. Thus, Eq. (H.1) establishes the lemma.

We are left to show that E[gτ(wτw)]=E[g¯τ(wτw)]\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right]=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right] which is established in the lemma below,

Lemma 9.

The following holds for any τ[T]\tau\in[T],

E[gτ(wτw)]=E[g¯τ(wτw)].\displaystyle\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right]=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right]~{}.

H.1.1 Proof of Lemma 9

Proof of Lemma 9.

Let {τ}τ[T]\{\mathcal{F}_{\tau}\}_{\tau\in[T]} be the natural filtration induces by the history of samples up to every time step τ\tau. Then according to the definitions of gtg_{t} and g¯t\bar{g}_{t} we have,

E[gτ(wτw)]\displaystyle\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})\right] =E[E[gτ(wτw)|τ1]]\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[g_{\tau}\cdot(w_{\tau}-w^{*})|\mathcal{F}_{\tau-1}\right]\right]
=E[E[gτ|τ1](wτw)]\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[g_{\tau}|\mathcal{F}_{\tau-1}\right]\cdot(w_{\tau}-w^{*})\right]
=E[E[g¯τ|τ1](wτw)]\displaystyle=\mbox{\bf E}\left[\mbox{\bf E}\left[\bar{g}_{\tau}|\mathcal{F}_{\tau-1}\right]\cdot(w_{\tau}-w^{*})\right]
=E[g¯τ(wτw)],\displaystyle=\mbox{\bf E}\left[\bar{g}_{\tau}\cdot(w_{\tau}-w^{*})\right]~{},

where the first line follows by the law of total expectations; the second line follows since wτw_{\tau} is measurable w.r.t. τ1\mathcal{F}_{\tau-1}; the third line follows by definition of gτg_{\tau} and g¯τ\bar{g}_{\tau}; and the last line uses the law of total expectations. ∎

H.2 Proof of Lemma 8

Proof of Lemma 8.

We will divide the proof into two case.
Case 1: BCAB\geq C\sqrt{A}. In this case,

AB+CA2B2B+4C2.A\leq B+C\sqrt{A}\leq 2B\leq 2B+4C^{2}~{}.

Case 2: BCAB\leq C\sqrt{A}. In this case,

AB+CA2CA,A\leq B+C\sqrt{A}\leq 2C\sqrt{A}~{},

dividing by A\sqrt{A} and taking the square implies,

A4C22B+4C2.A\leq 4C^{2}\leq 2B+4C^{2}~{}.

And therefore the lemma holds. ∎

Appendix I Proof of Lemma 4

Proof of Lemma 4.

Recalling that (C):=Eτ=0Tατ2gτ2{\rm(C)}:=\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}\|^{2}, we will decompose gτ=(gτg¯τ)+(g¯τf(xτ))+f(xτ)g_{\tau}=(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau}) which gives,

(C)\displaystyle{\rm(C)} :=Eτ=0Tατ2(gτg¯τ)+(g¯τf(xτ))+f(xτ)2\displaystyle:=\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|(g_{\tau}-\bar{g}_{\tau})+(\bar{g}_{\tau}-\nabla f(x_{\tau}))+\nabla f(x_{\tau})\|^{2}
3Eτ=0Tατ2gτg¯τ2+3Eτ=0Tατ2g¯τf(xτ)2+3Eτ=0Tατ2f(xτ)2\displaystyle\leq 3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}-\bar{g}_{\tau}\|^{2}+3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|\bar{g}_{\tau}-\nabla f(x_{\tau})\|^{2}+3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|\nabla f(x_{\tau})\|^{2}
3Eτ=0Tατ2gτg¯τ2+3EVT+6LEτ=0Tατ2Δτ\displaystyle\leq 3\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\|g_{\tau}-\bar{g}_{\tau}\|^{2}+3\mbox{\bf E}V_{T}+6L\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\Delta_{\tau}
3σ2MEτ=0Tατ2+3EVT+6LEτ=0Tατ2Δτ\displaystyle\leq 3\frac{\sigma^{2}}{M}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}+3\mbox{\bf E}V_{T}+6L\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\Delta_{\tau}
3σ2MEτ=0Tατ2+3EVT+12LEτ=0Tα0:τΔτ,\displaystyle\leq 3\frac{\sigma^{2}}{M}\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}+3\mbox{\bf E}V_{T}+12L\mbox{\bf E}\sum_{\tau=0}^{T}\alpha_{0:\tau}\Delta_{\tau}~{}, (43)

where the second line uses a+b+c23(a2+b2+c2)\|a+b+c\|^{2}\leq 3(\|a\|^{2}+\|b\|^{2}+\|c\|^{2}) which holds for any a,b,cda,b,c\in{\mathbb{R}}^{d}; the third line uses the definition of VTV_{T} as well as the smoothness of f()f(\cdot) implying that f(xτ)22L(f(xτ)f(w)):=2LΔτ\|\nabla f(x_{\tau})\|^{2}\leq 2L(f(x_{\tau})-f(w^{*})):=2L\Delta_{\tau} (see Lemma 10 below); the fourth line invokes Lemma 11; the fifth line uses ατ2=(τ+1)22α0:τ\alpha_{\tau}^{2}=(\tau+1)^{2}\leq 2\alpha_{0:\tau}.

Lemma 10.

Let F:dF:{\mathbb{R}}^{d}\mapsto{\mathbb{R}} be an LL-smooth function with a global minimum xx^{*}, then for any xdx\in{\mathbb{R}}^{d} we have,

F(x)22L(F(x)F(w)).\|\nabla F(x)\|^{2}\leq 2L(F(x)-F(w^{*}))~{}.
Lemma 11.

The following bound holds for any t[T]t\in[T],

Egτg¯τ2\displaystyle\mbox{\bf E}\|g_{\tau}-\bar{g}_{\tau}\|^{2} σ2M.\displaystyle\leq\frac{\sigma^{2}}{M}~{}.

I.1 Proof of Lemma 10

Proof of Lemma 10.

The LL smoothness of ff means the following to hold w,ud\forall w,u\in{\mathbb{R}}^{d},

F(x+u)F(x)+F(x)u+L2u2.F(x+u)\leq F(x)+\nabla F(x)^{\top}u+\frac{L}{2}\|u\|^{2}~{}.

Taking u=1LF(x)u=-\frac{1}{L}\nabla F(x) we get,

F(x+u)F(x)1LF(x)2+12LF(x)2=F(x)12LF(x)2.F(x+u)~{}\leq~{}F(x)-\frac{1}{L}\|\nabla F(x)\|^{2}+\frac{1}{2L}\|\nabla F(x)\|^{2}=F(x)-\frac{1}{2L}\|\nabla F(x)\|^{2}~{}.

Thus:

F(x)2\displaystyle\|\nabla F(x)\|^{2} 2L(F(x)F(x+u))\displaystyle~{}\leq~{}2L\big{(}F(x)-F(x+u)\big{)}
2L(F(x)F(x)),\displaystyle~{}\leq~{}2L\big{(}F(x)-F(x^{*})\big{)}~{},

where in the last inequality we used F(x)F(x+u)F(x^{*})\leq F(x+u) which holds since xx^{*} is the global minimum. ∎

I.2 Proof of Lemma 11

Proof of Lemma 11.

Recall that we can write,

gτg¯τ:=1Mi[M](gτig¯τi)g_{\tau}-\bar{g}_{\tau}:=\frac{1}{M}\sum_{i\in[M]}(g_{\tau}^{i}-\bar{g}_{\tau}^{i})

where g¯τi:=fi(xτi)\bar{g}_{\tau}^{i}:=\nabla f_{i}(x_{\tau}^{i}), and gτi:=fi(xτi,zτi)g_{\tau}^{i}:=\nabla f_{i}(x_{\tau}^{i},z_{\tau}^{i}), and that zt1,,ztMz_{t}^{1},\ldots,z_{t}^{M} are independent of each other. Thus, conditioning over {xti}i=1M\{x_{t}^{i}\}_{i=1}^{M} then {gτig¯τi}i=1M\{g_{\tau}^{i}-\bar{g}_{\tau}^{i}\}_{i=1}^{M} are independent and zero mean i.e. E[gτig¯τi|{xti}i=1M]=0\mbox{\bf E}[g_{\tau}^{i}-\bar{g}_{\tau}^{i}|\{x_{t}^{i}\}_{i=1}^{M}]=0. Consequently,

E[gτg¯τ2|{xti}i=1M]\displaystyle\mbox{\bf E}\left[\|g_{\tau}-\bar{g}_{\tau}\|^{2}|\{x_{t}^{i}\}_{i=1}^{M}\right] =1M2E[i[M](gτig¯τi)2|{xti}i[M]]\displaystyle=\frac{1}{M^{2}}\mbox{\bf E}\left[\left\|\sum_{i\in[M]}(g_{\tau}^{i}-\bar{g}_{\tau}^{i})\right\|^{2}|\{x_{t}^{i}\}_{i\in[M]}\right]
=1M2i[M]E[gτig¯τi2|{xti}i=1M]\displaystyle=\frac{1}{M^{2}}\sum_{i\in[M]}\mbox{\bf E}\left[\|g_{\tau}^{i}-\bar{g}_{\tau}^{i}\|^{2}|\{x_{t}^{i}\}_{i=1}^{M}\right]
1M2i[M]σ2\displaystyle\leq\frac{1}{M^{2}}\sum_{i\in[M]}\sigma^{2}
σ2M.\displaystyle\leq\frac{\sigma^{2}}{M}~{}.

Using the law of total expectation implies that Egτg¯τ2σ2M\mbox{\bf E}\|g_{\tau}-\bar{g}_{\tau}\|^{2}\leq\frac{\sigma^{2}}{M}. ∎

Appendix J Proof of Lemma 5

Proof of Lemma 5.

To bound EVt\mbox{\bf E}V_{t} we will first employ the definition of xtx_{t} together with the smoothness of f()f(\cdot),

Ef(xτ)g¯τ2\displaystyle\mbox{\bf E}\|\nabla f(x_{\tau})-\bar{g}_{\tau}\|^{2} =E1Mi[M]fi(xτ)1Mi[M]fi(xτi)2\displaystyle=\mbox{\bf E}\left\|\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{\tau})-\frac{1}{M}\sum_{i\in[M]}\nabla f_{i}(x_{\tau}^{i})\right\|^{2}
1Mi[M]Efi(xτ)fi(xτi)2\displaystyle\leq\frac{1}{M}\sum_{i\in[M]}\mbox{\bf E}\|\nabla f_{i}(x_{\tau})-\nabla f_{i}(x_{\tau}^{i})\|^{2}
L2Mi[M]Exτxτi2\displaystyle\leq\frac{L^{2}}{M}\sum_{i\in[M]}\mbox{\bf E}\|x_{\tau}-x_{\tau}^{i}\|^{2}
=L2Mi[M]E1Mj[M]xτjxτi2\displaystyle=\frac{L^{2}}{M}\sum_{i\in[M]}\mbox{\bf E}\left\|\frac{1}{M}\sum_{j\in[M]}x_{\tau}^{j}-x_{\tau}^{i}\right\|^{2}
L2M2i,j[M]Exτjxτi2,\displaystyle\leq\frac{L^{2}}{M^{2}}\sum_{i,j\in[M]}\mbox{\bf E}\|x_{\tau}^{j}-x_{\tau}^{i}\|^{2}~{}, (44)

where the first line uses the definition of g¯t\bar{g}_{t}, the second line uses Jensen’s inequality, and the third line uses the smoothness of fi()f_{i}(\cdot)’s. The last line follows from Jensen’s inequality.

We use the following notation for any τ[T]\tau\in[T]

qτi,j:=ατ2xτixτj2,&qτi:=ατ2j[M]xτixτj2,&Qτ:=1M2ατ2i,j[M]xτixτj2,\displaystyle q_{\tau}^{i,j}:=\alpha_{\tau}^{2}\|x_{\tau}^{i}-x_{\tau}^{j}\|^{2}~{},\quad\&\quad q_{\tau}^{i}:=\alpha_{\tau}^{2}\sum_{j\in[M]}\|x_{\tau}^{i}-x_{\tau}^{j}\|^{2}~{},\quad\&\quad Q_{\tau}:=\frac{1}{M^{2}}\alpha_{\tau}^{2}\sum_{i,j\in[M]}\|x_{\tau}^{i}-x_{\tau}^{j}\|^{2}~{}, (45)

and notice that j[M]qτi,j=qτi\sum_{j\in[M]}q_{\tau}^{i,j}=q_{\tau}^{i}, and that i,j[M]qτi,j=M2Qτ\sum_{i,j\in[M]}q_{\tau}^{i,j}=M^{2}Q_{\tau}. Moreover qτi,j=qτj,i,i,j[M]q_{\tau}^{i,j}=q_{\tau}^{j,i}~{},\forall i,j\in[M].

Thus, according to Eq. (J) it is enough to bound EVt\mbox{\bf E}V_{t} as follows,

EVt:=τ=0tατ2Ef(xτ)g¯τ2L2τ=0tQτ().\displaystyle\mbox{\bf E}V_{t}:=\sum_{\tau=0}^{t}\alpha_{\tau}^{2}\mbox{\bf E}\|\nabla f(x_{\tau})-\bar{g}_{\tau}\|^{2}\leq L^{2}\cdot\underset{{(\star)}}{\underbrace{\sum_{\tau=0}^{t}Q_{\tau}}}~{}. (46)

Next we will bound the above term.

Bounding (){(\star)}:

Let t[T]t\in[T], if t=rKt=rK for some r[R]r\in[R], then according to Alg. 2 xti=xtx_{t}^{i}=x_{t} for any machine i[M]i\in[M], thus xtixtj=0x_{t}^{i}-x_{t}^{j}=0 for any two machines i,j[M]i,j\in[M].

More generally, if t=rK+kt=rK+k for some r[R]r\in[R], and k[K]k\in[K], then by denoting t0:=rKt_{0}:=rK we can write t=t0+kt=t_{0}+k. Using this notation, the update rule for xτix_{\tau}^{i} implies the following for any i[M]i\in[M],

xti=α0:t0α0:txt0i+1α0:tτ=t0+1tατwτi=α0:t0α0:txt0+1α0:tτ=t0+1tατwτi,x_{t}^{i}=\frac{\alpha_{0:t_{0}}}{\alpha_{0:t}}x_{t_{0}}^{i}+\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}w_{\tau}^{i}=\frac{\alpha_{0:t_{0}}}{\alpha_{0:t}}x_{t_{0}}+\frac{1}{\alpha_{0:t}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}w_{\tau}^{i}~{},

where we used xt0i=xt0,i[M]x_{t_{0}}^{i}=x_{t_{0}}~{},~{}\forall i\in[M]. Thus, for any iji\neq j we can write,

αt2xtixtj2=αt2(α0:t)2τ=t0+1tατ(wτiwτj)2.\displaystyle\alpha_{t}^{2}\|x_{t}^{i}-x_{t}^{j}\|^{2}=\frac{\alpha_{t}^{2}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\|^{2}~{}. (47)

So our next goal is to derive an expression for τ=t0+1tατ(wτiwτj)\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j}). The update rule of Eq. (6) implies that for any τ[t0,t0+K]\tau\in[t_{0},t_{0}+K],

wτi=wt0iηn=t0+1ταngni=wt0ηn=t0+1ταngni\displaystyle w_{\tau}^{i}=w_{t_{0}}^{i}-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{n}g_{n}^{i}=w_{t_{0}}-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{n}g_{n}^{i} (48)

where the second equality is due to the initialization of each round implying that wt0i=wt0,i[M]w_{t_{0}}^{i}=w_{t_{0}}~{},\forall i\in[M].

Next, we will require the following notation g¯ti:=fi(xti)\bar{g}_{t}^{i}:=\nabla f_{i}(x_{t}^{i}), and ξti:=gtig¯ti\xi_{t}^{i}:=g_{t}^{i}-\bar{g}_{t}^{i}. We can therefore write, gti=g¯ti+ξtig_{t}^{i}=\bar{g}_{t}^{i}+\xi_{t}^{i} and it is immediate to show that E[ξti|xti]=0\mbox{\bf E}[\xi_{t}^{i}|x_{t}^{i}]=0. Using this notation together with Eq. (48), implies that for any τ[t0,t0+K]\tau\in[t_{0},t_{0}+K] and iji\neq j we have,

ατ(wτiwτj)\displaystyle\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j}) =ηn=t0+1ταταn(g¯nig¯nj)ηn=t0+1ταταn(ξniξnj)\displaystyle=-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\xi_{n}^{i}-\xi_{n}^{j})
=ηn=t0+1ταταn(g¯nig¯nj)ηn=t0+1ταταnξn\displaystyle=-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}\xi_{n} (49)

and in the last line we use the following notation ξn:=ξniξnj\xi_{n}:=\xi_{n}^{i}-\xi_{n}^{j} 777A more appropriate notation would be ξn(i,j):=ξniξnj\xi_{n}^{(i,j)}:=\xi_{n}^{i}-\xi_{n}^{j}, but to ease notation we absorb the (i,j)(i,j) notation into ξ\xi..

Summing Eq. (J) over τ[t0+1,t]\tau\in[t_{0}+1,t] we obtain,

τ=t0+1tατ(wτiwτj)\displaystyle\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j}) =ητ=t0+1tn=t0+1ταταn(g¯nig¯nj)ητ=t0+1tn=t0+1ταταnξn\displaystyle=-\eta\sum_{\tau=t_{0}+1}^{t}\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{\tau=t_{0}+1}^{t}\sum_{n=t_{0}+1}^{\tau}\alpha_{\tau}\alpha_{n}\xi_{n}
=ηn=t0+1tτ=ntαταn(g¯nig¯nj)ηn=t0+1tτ=ntαταnξn\displaystyle=-\eta\sum_{n=t_{0}+1}^{t}\sum_{\tau=n}^{t}\alpha_{\tau}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{t}\sum_{\tau=n}^{t}\alpha_{\tau}\alpha_{n}\xi_{n}
=ηn=t0+1tαn:tαn(g¯nig¯nj)ηn=t0+1tαn:tαnξn\displaystyle=-\eta\sum_{n=t_{0}+1}^{t}\alpha_{n:t}\alpha_{n}(\bar{g}_{n}^{i}-\bar{g}_{n}^{j})-\eta\sum_{n=t_{0}+1}^{t}\alpha_{n:t}\alpha_{n}\xi_{n}
=ητ=t0+1tατ:tατ(g¯τig¯τj)ητ=t0+1tατ:tατξτ,\displaystyle=-\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}(\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j})-\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}~{}, (50)

where in the last equation we replace the notation of the summation index from nn to τ\tau (only done to ease notation).

Plugging the above equation back into Eq. (47) we obtain for any t[t0,t0+K]t\in[t_{0},t_{0}+K]

qti,j:=αt2\displaystyle q_{t}^{i,j}:=\alpha_{t}^{2} xtixtj2\displaystyle\|x_{t}^{i}-x_{t}^{j}\|^{2}
=αt2(α0:t)2τ=t0+1tατ(wτiwτj)2\displaystyle=\frac{\alpha_{t}^{2}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}(w_{\tau}^{i}-w_{\tau}^{j})\right\|^{2}
=αt2(α0:t)2ητ=t0+1tατ:tατ(g¯τig¯τj)+ητ=t0+1tατ:tατξτ2\displaystyle=\frac{\alpha_{t}^{2}}{(\alpha_{0:t})^{2}}\left\|\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}(\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j})+\eta\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\|^{2}
η22αt2(α0:t)2τ=t0+1tατ:tατ(g¯τig¯τj)2+η22αt2(α0:t)2τ=t0+1tατ:tατξτ2\displaystyle\leq\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}(\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j})\right\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\|^{2}
η22αt2(α0:t)2(tt0)τ=t0+1t(ατ:tατ)2g¯τig¯τj2+η22αt2(α0:t)2τ=t0+1tατ:tατξτ2\displaystyle\leq\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\cdot(t-t_{0})\sum_{\tau=t_{0}+1}^{t}(\alpha_{\tau:t}\alpha_{\tau})^{2}\|\bar{g}_{\tau}^{i}-\bar{g}_{\tau}^{j}\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\|^{2}
η22αt2(α0:t)2K(Kαt)2τ=t0+1tατ2fi(xτi)fj(xτj)2+η22αt2(α0:t)2τ=t0+1tατ:tατξτ2\displaystyle\leq\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\cdot K\cdot(K\alpha_{t})^{2}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\|^{2}
=η22αt4(α0:t)2K3L21L2τ=t0+1tατ2fi(xτi)fj(xτj)2+η22αt2(α0:t)2τ=t0+1tατ:tατξτ2\displaystyle=\eta^{2}\frac{2\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\cdot K^{3}L^{2}\cdot\frac{1}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2}+\eta^{2}\frac{2\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\|^{2}
8η2K3L21L2τ=t0+1tατ2fi(xτi)fj(xτj)2+(8η2/αt2)τ=t0+1tατ:tατξτ2\displaystyle\leq 8\eta^{2}K^{3}L^{2}\cdot\frac{1}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2}+(8\eta^{2}/\alpha^{2}_{t})\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\alpha_{\tau}\xi_{\tau}\right\|^{2}
=8η2K3L21L2τ=t0+1tατ2fi(xτi)fj(xτj)2+8η2αt2τ=t0+1tατ:tαταt2ξτ2,\displaystyle=8\eta^{2}K^{3}L^{2}\cdot\frac{1}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2}+8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}~{}, (51)

where the first inequality uses a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} which holds a,bd\forall a,b\in{\mathbb{R}}^{d}, the second inequality uses n=N1+1N2an2(N2N1)n=N1+1N2an2\|\sum_{n=N_{1}+1}^{N_{2}}a_{n}\|^{2}\leq(N_{2}-N_{1})\sum_{n=N_{1}+1}^{N_{2}}\|a_{n}\|^{2} which holds for any {and}n=N1+1N2\{a_{n}\in{\mathbb{R}}^{d}\}_{n=N_{1}+1}^{N_{2}}; the third inequality uses the definition of g¯τi\bar{g}_{\tau}^{i}, it also uses tt0Kt-t_{0}\leq K as well as (ατ:t)2(Kαt)2(\alpha_{\tau:t})^{2}\leq(K\alpha_{t})^{2} which holds since τt\tau\leq t and since both αταt\alpha_{\tau}\leq\alpha_{t}; and the last inequality uses the fact that αt=t+1\alpha_{t}=t+1 implying that the following holds,

αt4(α0:t)24,&αt2(α0:t)24αt2\frac{\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\leq 4~{},\quad\&~{}\quad\frac{\alpha^{2}_{t}}{(\alpha_{0:t})^{2}}\leq\frac{4}{\alpha_{t}^{2}}
Lemma 12.

The following holds for any i,j[M]i,j\in[M],

fi(xτi)fj(xτj)2\displaystyle\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2} 3L2Mατ2(qτi+qτj)+6(fi(xτ)2+fj(xτ)2),\displaystyle\leq\frac{3L^{2}}{M\alpha_{\tau}^{2}}(q_{\tau}^{i}+q_{\tau}^{j})+6\left(\|\nabla f_{i}(x_{\tau})\|^{2}+\|\nabla f_{j}(x_{\tau})\|^{2}\right)~{},

Using the above lemma inside Eq. (J) yields,

qti,j\displaystyle q_{t}^{i,j} :=αt2xtixtj2\displaystyle:=\alpha_{t}^{2}\|x_{t}^{i}-x_{t}^{j}\|^{2}
24η2K3L21M(qt0+1:ti+qt0+1:tj)+48η2K3τ=t0+1tατ2(fi(xτ)2+fj(xτ)2)\displaystyle\leq 24\eta^{2}K^{3}L^{2}\cdot\frac{1}{M}(q_{t_{0}+1:t}^{i}+q_{t_{0}+1:t}^{j})+48\eta^{2}K^{3}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\left(\|\nabla f_{i}(x_{\tau})\|^{2}+\|\nabla f_{j}(x_{\tau})\|^{2}\right)
+8η2αt2τ=t0+1tατ:tαταt2ξτ2\displaystyle\quad+8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}
=θ21M(qt0+1:ti+qt0+1:tj)+θL2τ=t0+1tατ2(fi(xτ)2+fj(xτ)2)+ti,j,\displaystyle=\frac{\theta}{2}\cdot\frac{1}{M}(q_{t_{0}+1:t}^{i}+q_{t_{0}+1:t}^{j})+\frac{\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\left(\|\nabla f_{i}(x_{\tau})\|^{2}+\|\nabla f_{j}(x_{\tau})\|^{2}\right)+\mathcal{B}_{t}^{i,j}~{}, (52)

where we have denoted 888 Formally we should use the i,ji,j upper script for ξτ\xi_{\tau} in the definition of ti,j\mathcal{B}_{t}^{i,j}, i.e. to define ti,j:=8η2αt2τ=t0+1tατ:tαταt2ξτi,j2\mathcal{B}_{t}^{i,j}:=8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}^{i,j}\right\|^{2}. We absorb this notation into ξτ\xi_{\tau} to simplify the notation.

θ:=48η2K3L2,&ti,j:=8η2αt2τ=t0+1tατ:tαταt2ξτ2.\theta:=48\eta^{2}K^{3}L^{2}~{},\quad\&\quad\mathcal{B}_{t}^{i,j}:=8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}~{}.

Summing Eq. (J) over i,j[M]i,j\in[M] and using the definition of QtQ_{t} gives,

M2Qt\displaystyle M^{2}Q_{t} =i,j[M]qti,j\displaystyle=\sum_{i,j\in[M]}q_{t}^{i,j}
θ21M2M3Qt0+1:t+θL22Mτ=t0+1tατ2i[M]fi(xτ)2+i,j[M]ti,j\displaystyle\leq\frac{\theta}{2}\cdot\frac{1}{M}\cdot 2M^{3}Q_{t_{0}+1:t}+\frac{\theta}{L^{2}}\cdot 2M\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\sum_{i\in[M]}\|\nabla f_{i}(x_{\tau})\|^{2}+\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}
=M2θQt0+1:t+2MθL2τ=t0+1tατ2i[M]fi(xτ)2+i,j[M]ti,j,\displaystyle=M^{2}\theta\cdot Q_{t_{0}+1:t}+\frac{2M\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\sum_{i\in[M]}\|\nabla f_{i}(x_{\tau})\|^{2}+\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}~{}, (53)

where we used,

i,j[M]qτi=j[M]i[M]qτi=j[M]M2Qτ=M3Qτ.\sum_{i,j\in[M]}q_{\tau}^{i}=\sum_{j\in[M]}\sum_{i\in[M]}q_{\tau}^{i}=\sum_{j\in[M]}M^{2}Q_{\tau}=M^{3}Q_{\tau}~{}.

Now dividing Eq. (J) by M2M^{2} gives t[t0,t0+K]\forall t\in[t_{0},t_{0}+K],

Qt\displaystyle Q_{t} θQt0+1:t+2θL2τ=t0+1tατ21Mi[M]fi(xτ)2+1M2i,j[M]ti,j\displaystyle\leq\theta\cdot Q_{t_{0}+1:t}+\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot\frac{1}{M}\sum_{i\in[M]}\|\nabla f_{i}(x_{\tau})\|^{2}+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}
θQt0+1:t+2θL2τ=t0+1tατ2(G2+4L(f(xτ)f(w)))+1M2i,j[M]ti,j\displaystyle\leq\theta\cdot Q_{t_{0}+1:t}+\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L(f(x_{\tau})-f(w^{*})))+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}
θQt0+1:t+2θL2τ=t0+1tατ2(G2+4LΔτ))+1M2i,j[M]ti,j,\displaystyle\leq\theta\cdot Q_{t_{0}+1:t}+\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}~{}, (54)

where the second line follows from the dissimilarity assumption Eq. (2), and the last line is due to the definition of Δτ\Delta_{\tau}.

Thus, we can re-write the above equation as follows forall t[t0,t0+K]t\in[t_{0},t_{0}+K],

QtθQt0+1:t+Ht,\displaystyle Q_{t}\leq\theta Q_{t_{0}+1:t}+H_{t}~{}, (55)

where Ht=2θL2τ=t0+1tατ2(G2+4LΔτ))+1M2i,j[M]ti,jH_{t}=\frac{2\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{1}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t}^{i,j}, and recall that θ:=48η2K3L2\theta:=48\eta^{2}K^{3}L^{2}. Now, notice that Qt,Ht0Q_{t},H_{t}\geq 0, and that θ\theta satisfies θK=48η2K4L21/2\theta K=48\eta^{2}K^{4}L^{2}\leq 1/2 since we assume that η21100L2K4\eta^{2}\leq\frac{1}{100L^{2}K^{4}} (see Eq. (8)). This enables to make use of Lemma 13 below to conclude,

Qt0+1:t0+K2Ht0+1:t0+K\displaystyle Q_{t_{0}+1:t_{0}+K}\leq 2H_{t_{0}+1:t_{0}+K} =4θL2τ=t0+1t0+Kατ2(G2+4LΔτ))+2M2i,j[M]t0+1:t0+Ki,j\displaystyle=\frac{4\theta}{L^{2}}\sum_{\tau=t_{0}+1}^{t_{0}+K}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{2}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}
=200η2K3τ=t0+1t0+Kατ2(G2+4LΔτ))+2M2i,j[M]t0+1:t0+Ki,j.\displaystyle=200\eta^{2}K^{3}\sum_{\tau=t_{0}+1}^{t_{0}+K}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\frac{2}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}~{}. (56)
Lemma 13.

Let K,θ>0K,\theta>0, and assume θK1/2\theta K\leq 1/2. Also assume a sequence of non-negative terms {Qt0}t=t0+1t0+K\{Q_{t}\geq 0\}_{t=t_{0}+1}^{t_{0}+K} and another sequence {Ht0}t=t0+1t0+K\{H_{t}\geq 0\}_{t=t_{0}+1}^{t_{0}+K} that satisfy the following inequality for any t[t0,t0+K]t\in[t_{0},t_{0}+K],

QtθQt0+1:t+HtQ_{t}\leq\theta Q_{t_{0}+1:t}+H_{t}

Then the following holds,

Qt0+1:t0+K2Ht0+1:t0+K.Q_{t_{0}+1:t_{0}+K}\leq 2H_{t_{0}+1:t_{0}+K}~{}.

Recalling that we like to bound the expectation of the LHS of Eq. (J), we will next bound τ=t0+1tατ:tαταt2ξτ2\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}, which is done in the following Lemma 999recall that for simplicity of notation we denote ξτ\xi_{\tau} rather than ξτi,j\xi_{\tau}^{i,j}.,

Lemma 14.

The following bound holds for any t[t0,t0+K]t\in[t_{0},t_{0}+K],

Eτ=t0+1tατ:tαταt2ξτ24K3σ2.\mbox{\bf E}\left\|\sum_{\tau=t_{0}+1}^{t}\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}\leq 4K^{3}\sigma^{2}~{}.

Since the above lemma for any i,ji,j and t[t0,t0+K]t\in[t_{0},t_{0}+K] we can now bound Et0+1:t0+Ki,j\mbox{\bf E}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j} as follows,

Et0+1:t0+Ki,j\displaystyle\mbox{\bf E}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j} =8η2αt2t=t0+1t0+KEτ=t0+1tατ:tαταt2ξτ2\displaystyle=8\eta^{2}\alpha^{2}_{t}\sum_{t=t_{0}+1}^{t_{0}+K}\mbox{\bf E}\left\|\sum_{\tau=t_{0}+1}^{t}\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}
8η2αt2t=t0+1t0+K4K3σ2\displaystyle\leq 8\eta^{2}\alpha^{2}_{t}\sum_{t=t_{0}+1}^{t_{0}+K}4K^{3}\sigma^{2}
=32η2αt2K4σ2\displaystyle=32\eta^{2}\alpha^{2}_{t}K^{4}\sigma^{2}
=32η2(r+1)2K6σ2,\displaystyle=32\eta^{2}(r+1)^{2}K^{6}\sigma^{2}~{}, (57)

where the last lines αt(r+1)K\alpha_{t}\leq(r+1)K for any iteration tt that belongs to round rr.

Since the above holds for any i,ji,j it follows that,

2M2i,j[M]t0+1:t0+Ki,j232η2(r+1)2K6σ2=64η2(r+1)2K6σ2.\frac{2}{M^{2}}\sum_{i,j\in[M]}\mathcal{B}_{t_{0}+1:t_{0}+K}^{i,j}\leq 2\cdot 32\eta^{2}(r+1)^{2}K^{6}\sigma^{2}=64\eta^{2}(r+1)^{2}K^{6}\sigma^{2}~{}.

Plugging the above back into Eq. (J) gives,

Qt0+1:t0+K\displaystyle Q_{t_{0}+1:t_{0}+K} 200η2K3τ=t0+1t0+Kατ2(G2+4LΔτ))+64η2(r+1)2K6σ2.\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=t_{0}+1}^{t_{0}+K}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+64\eta^{2}(r+1)^{2}K^{6}\sigma^{2}~{}. (58)
Final Bound on EVt\mbox{\bf E}V_{t}.

Finally, using the above bound together with the Eq. (46) enables to bound EVt\mbox{\bf E}V_{t} as follows,

1L2EVt\displaystyle\frac{1}{L^{2}}\mbox{\bf E}V_{t} τ=0tQτ\displaystyle\leq\sum_{\tau=0}^{t}Q_{\tau}
τ=0TQτ\displaystyle\leq\sum_{\tau=0}^{T}Q_{\tau}
200η2K3τ=0Tατ2(G2+4LΔτ))+r=0R1t=rK+1rK+Kαt2Extixtj2\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\sum_{r=0}^{R-1}\sum_{t=rK+1}^{rK+K}\alpha_{t}^{2}\mbox{\bf E}\|x_{t}^{i}-x_{t}^{j}\|^{2}
200η2K3τ=0Tατ2(G2+4LΔτ))+r=0R164η2(r+1)2K6σ2\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+\sum_{r=0}^{R-1}64\eta^{2}(r+1)^{2}K^{6}\sigma^{2}
200η2K3τ=0Tατ2(G2+4LΔτ))+64η2K6σ286R3\displaystyle\leq 200\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{\tau}^{2}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+64\eta^{2}K^{6}\sigma^{2}\cdot\frac{8}{6}R^{3}
400η2K3τ=0Tα0:τ(G2+4LΔτ))+90η2K6R3σ2,\displaystyle\leq 400\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+90\eta^{2}K^{6}R^{3}\sigma^{2}~{},

where we have used r=0R1(r+1)286R3\sum_{r=0}^{R-1}(r+1)^{2}\leq\frac{8}{6}R^{3}, and the last line uses ατ2=(τ+1)22α0:τ\alpha_{\tau}^{2}=(\tau+1)^{2}\leq 2\alpha_{0:\tau} Consequently, we can bound

EVt400L2η2K3τ=0Tα0:τ(G2+4LΔτ))+90L2η2K6R3σ2,\mbox{\bf E}V_{t}\leq 400L^{2}\eta^{2}K^{3}\sum_{\tau=0}^{T}\alpha_{0:\tau}\cdot(G_{*}^{2}+4L\Delta_{\tau}))+90L^{2}\eta^{2}K^{6}R^{3}\sigma^{2}~{},

which established the lemma. ∎

J.1 Proof of Lemma 12

Proof of Lemma 12.

First note that by definition of xτx_{\tau} we have for any i[M]i\in[M],

xτxτi2=1Ml[M]xτlxτi2\displaystyle\|x_{\tau}-x_{\tau}^{i}\|^{2}=\left\|\frac{1}{M}\sum_{l\in[M]}x_{\tau}^{l}-x_{\tau}^{i}\right\|^{2} 1Ml[M]xτlxτi2=1Mατ2qτi.\displaystyle\leq\frac{1}{M}\sum_{l\in[M]}\|x_{\tau}^{l}-x_{\tau}^{i}\|^{2}=\frac{1}{M\alpha_{\tau}^{2}}q_{\tau}^{i}~{}. (59)

where we have used Jensen’s inequality, and the definition of qτiq_{\tau}^{i}.

Using the above inequality, we obtain,

fi(xτi)fj(xτj)2\displaystyle\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{j}(x_{\tau}^{j})\|^{2} =(fi(xτi)fi(xτ))+(fi(xτ)fj(xτ))(fj(xτj)fj(xτ))2\displaystyle=\|(\nabla f_{i}(x_{\tau}^{i})-\nabla f_{i}(x_{\tau}))+(\nabla f_{i}(x_{\tau})-\nabla f_{j}(x_{\tau}))-(\nabla f_{j}(x_{\tau}^{j})-\nabla f_{j}(x_{\tau}))\|^{2}
3fi(xτi)fi(xτ)2+3fi(xτ)fj(xτ)2+3fj(xτj)fj(xτ)2\displaystyle\leq 3\|\nabla f_{i}(x_{\tau}^{i})-\nabla f_{i}(x_{\tau})\|^{2}+3\|\nabla f_{i}(x_{\tau})-\nabla f_{j}(x_{\tau})\|^{2}+3\|\nabla f_{j}(x_{\tau}^{j})-\nabla f_{j}(x_{\tau})\|^{2}
3L2(xτxτi2+xτxτj2)+6(fi(xτ)2+fj(xτ)2)\displaystyle\leq 3L^{2}\left(\|x_{\tau}-x_{\tau}^{i}\|^{2}+\|x_{\tau}-x_{\tau}^{j}\|^{2}\right)+6\left(\|\nabla f_{i}(x_{\tau})\|^{2}+\|\nabla f_{j}(x_{\tau})\|^{2}\right)
3L2Mατ2(qτi+qτj)+6(fi(xτ)2+fj(xτ)2),\displaystyle\leq\frac{3L^{2}}{M\alpha_{\tau}^{2}}(q_{\tau}^{i}+q_{\tau}^{j})+6\left(\|\nabla f_{i}(x_{\tau})\|^{2}+\|\nabla f_{j}(x_{\tau})\|^{2}\right)~{},

where the second and third lines use n=1Nan2Nn=1Nan2\|\sum_{n=1}^{N}a_{n}\|^{2}\leq N\sum_{n=1}^{N}\|a_{n}\|^{2} which holds for any {and}n=1N\{a_{n}\in{\mathbb{R}}^{d}\}_{n=1}^{N}; we also used the smoothness of the fi()f_{i}(\cdot)’s; and the last line uses Eq. (59). ∎

J.2 Proof of Lemma 13

Proof of Lemma 13.

Since the QtQ_{t}’s and θ\theta are non-negative, we can further bound QtQ_{t} for all t[t0,t0+K]t\in[t_{0},t_{0}+K] as follows,

QtθQt0+1:t+HtθQt0+1:t0+K+Ht.Q_{t}\leq\theta Q_{t_{0}+1:t}+H_{t}\leq\theta Q_{t_{0}+1:t_{0}+K}+H_{t}~{}.

Summing the above over tt gives,

Qt0+1:t0+K:=t=t0+1t0+KQtθKQt0+1:t0+K+Ht0+1:t0+K12Qt0+1:t0+K+Ht0+1:t0+KQ_{t_{0}+1:t_{0}+K}:=\sum_{t=t_{0}+1}^{t_{0}+K}Q_{t}\leq\theta K\cdot Q_{t_{0}+1:t_{0}+K}+H_{t_{0}+1:t_{0}+K}\leq\frac{1}{2}\cdot Q_{t_{0}+1:t_{0}+K}+H_{t_{0}+1:t_{0}+K}

where we used θK1/2\theta K\leq 1/2. Re-ordering the above equation immediately establishes the lemma. ∎

J.3 Proof of Lemma 14

Proof of Lemma 14.

Letting {t}t\{\mathcal{F}_{t}\}_{t} be the natural filtration that is induces by the random draws up to time tt, i.e., by {{z1i}i[M],,{zti}i[M]}\{\{z_{1}^{i}\}_{i\in[M]},\ldots,\{z_{t}^{i}\}_{i\in[M]}\}. By the definition of ξt\xi_{t} it is clear that ξt\xi_{t} is measurable with respect to t\mathcal{F}_{t}, and that,

E[ξt|t1]=0.\mbox{\bf E}[\xi_{t}|\mathcal{F}_{t-1}]=0~{}.

Implying that {ξt}t\{\xi_{t}\}_{t} is martingale difference sequence with respect to the filtration {t}t\{\mathcal{F}_{t}\}_{t}. The following implies that,

Eτ=t0+1tατ:tαταt2ξτ2\displaystyle\mbox{\bf E}\left\|\sum_{\tau=t_{0}+1}^{t}\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2} τ=t0+1t(ατ:tαταt2)2Eξτ2\displaystyle\leq\sum_{\tau=t_{0}+1}^{t}\left(\frac{\alpha_{\tau:t}\alpha_{\tau}}{\alpha_{t}^{2}}\right)^{2}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}
τ=t0+1t(Kαtαtαt2)2Eξτ2\displaystyle\leq\sum_{\tau=t_{0}+1}^{t}\left(\frac{K\alpha_{t}\cdot\alpha_{t}}{\alpha_{t}^{2}}\right)^{2}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}
=K2τ=t0+1tEξτ2\displaystyle=K^{2}\sum_{\tau=t_{0}+1}^{t}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}
K2τ=t0+1tEξτiξτj2\displaystyle\leq K^{2}\sum_{\tau=t_{0}+1}^{t}\mbox{\bf E}\left\|\xi_{\tau}^{i}-\xi_{\tau}^{j}\right\|^{2}
2K2τ=t0+1t(Eξτi2+Eξτi2)\displaystyle\leq 2K^{2}\sum_{\tau=t_{0}+1}^{t}(\mbox{\bf E}\|\xi_{\tau}^{i}\|^{2}+\mbox{\bf E}\|\xi_{\tau}^{i}\|^{2})
2K2τ=t0+1t2σ2\displaystyle\leq 2K^{2}\sum_{\tau=t_{0}+1}^{t}2\sigma^{2}
4K2σ2(tt0)\displaystyle\leq 4K^{2}\sigma^{2}\cdot(t-t_{0})
4K3σ2,\displaystyle\leq 4K^{3}\sigma^{2}~{},

where the first line follows by Lemma 15 below, the second line holds since αταt\alpha_{\tau}\leq\alpha_{t}, and ατ:tKαt\alpha_{\tau:t}\leq K\alpha_{t} (recall αταt\alpha_{\tau}\leq\alpha_{t} since τt\tau\leq t); the fourth line follows due to ξτ:=ξτiξτj\xi_{\tau}:=\xi_{\tau}^{i}-\xi_{\tau}^{j}; the fifth line uses a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} for any a,bda,b\in{\mathbb{R}}^{d}; the sixth line follows since Eξτi2:=Egτig¯τi2:=Ef(xτi,zτi)f(xτi)2σ2\mbox{\bf E}\|\xi_{\tau}^{i}\|^{2}:=\mbox{\bf E}\|g_{\tau}^{i}-\bar{g}_{\tau}^{i}\|^{2}:=\mbox{\bf E}\|\nabla f(x_{\tau}^{i},z_{\tau}^{i})-\nabla f(x_{\tau}^{i})\|^{2}\leq\sigma^{2}; and the last line uses (tt0)K(t-t_{0})\leq K.

Lemma 15.

Let {ξt}t\{\xi_{t}\}_{t} be a martingale difference sequence with respect to a filtration {t}t\{\mathcal{F}_{t}\}_{t}, then the following holds for all time indexes t1,t20t_{1},t_{2}\geq 0

Eτ=t1t2ξτ2\displaystyle\mbox{\bf E}\left\|\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right\|^{2} =τ=t1t2Eξτ2.\displaystyle=\sum_{\tau=t_{1}}^{t_{2}}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}~{}.

J.3.1 Proof of Lemma 15

Proof of Lemma 15.

We shall prove the lemma by induction over t2t_{2}. The base case where t2=t1t_{2}=t_{1} clearly holds since it boils down to the following identity,

Eτ=t1t1ξτ2=Eξt12\displaystyle\mbox{\bf E}\left\|\sum_{\tau=t_{1}}^{t_{1}}\xi_{\tau}\right\|^{2}=\mbox{\bf E}\left\|\xi_{t_{1}}\right\|^{2} =τ=t1t1Eξτ2.\displaystyle=\sum_{\tau=t_{1}}^{t_{1}}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}~{}.

Now for induction step let us assume that the equality holds for t2t1t_{2}\geq t_{1} and lets prove it holds for t2+1t_{2}+1. Indeed,

Eτ=t1t2+1ξτ2\displaystyle\mbox{\bf E}\left\|\sum_{\tau=t_{1}}^{t_{2}+1}\xi_{\tau}\right\|^{2} =Eξt2+1+τ=t1t2ξτ2\displaystyle=\mbox{\bf E}\left\|\xi_{t_{2}+1}+\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right\|^{2}
=Eτ=t1t2ξτ2+Eξt2+12+2E(τ=t1t2ξτ)ξt2+1\displaystyle=\mbox{\bf E}\left\|\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right\|^{2}+\mbox{\bf E}\|\xi_{t_{2}+1}\|^{2}+2\mbox{\bf E}\left(\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right)\cdot\xi_{t_{2}+1}
=τ=t1t2+1Eξτ2+2E[E[(τ=t1t2ξτ)ξt2+1|t2]]\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}+2\mbox{\bf E}\left[\mbox{\bf E}\left[\left(\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right)\cdot\xi_{t_{2}+1}|\mathcal{F}_{t_{2}}\right]\right]
=τ=t1t2+1Eξτ2+2E[(τ=t1t2ξτ)E[ξt2+1|t2]]\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}+2\mbox{\bf E}\left[\left(\sum_{\tau=t_{1}}^{t_{2}}\xi_{\tau}\right)\cdot\mbox{\bf E}\left[\xi_{t_{2}+1}|\mathcal{F}_{t_{2}}\right]\right]
=τ=t1t2+1Eξτ2+0\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}+0
=τ=t1t2+1Eξτ2,\displaystyle=\sum_{\tau=t_{1}}^{t_{2}+1}\mbox{\bf E}\left\|\xi_{\tau}\right\|^{2}~{},

where the third line follows from the induction hypothesis, as well as from the law of total expectations; the fourth lines follows since {ξτ}τ=0t2\{\xi_{\tau}\}_{\tau=0}^{t_{2}} are measurable w.r.t t2\mathcal{F}_{t_{2}}, and the fifth line follows since E[ξt2+1|t2]=0\mbox{\bf E}[\xi_{t_{2}+1}|\mathcal{F}_{t_{2}}]=0. Thus, we have established the induction step and therefore the lemma holds. ∎

Appendix K Proof of Lemma 6

Proof of Lemma 6.

Summing the inequality At+12(T+1)t=0TAtA_{t}\leq\mathcal{B}+\frac{1}{2(T+1)}\sum_{t=0}^{T}A_{t} over tt gives,

A0:T(T+1)+(T+1)12(T+1)A0:T=(T+1)+12A0:T,\displaystyle A_{0:T}\leq(T+1)\mathcal{B}+(T+1)\frac{1}{2(T+1)}A_{0:T}=(T+1)\mathcal{B}+\frac{1}{2}A_{0:T}~{},

Re-ordering we obtain,

A0:T2(T+1).\displaystyle A_{0:T}\leq 2(T+1)\mathcal{B}~{}.

Plugging this back to the original inequality and taking t=Tt=T gives,

AT+12(T+1)A0:T2.A_{T}\leq\mathcal{B}+\frac{1}{2(T+1)}A_{0:T}\leq 2\mathcal{B}~{}.

which concludes the proof. ∎

Appendix L The Necessity of Non-uniform Weights

One may wonder, why should we employ increasing weights αtt\alpha_{t}\propto t rather than using standard uniform weights αt=1,t\alpha_{t}=1~{},\forall t. Here we explain why uniform weights are insufficient and why increasing weights e.g. αtt\alpha_{t}\propto t are crucial to obtain our result.

Intuitive Explanation.

Prior to providing an elaborate technical explanation we will provides some intuition. The intuition behind the importance of using increasing weights is the following: Increasing weights are a technical tool to put more emphasis on the last rounds. Now, in the context of Local update methods, the iterates of the last rounds are more attractive since the bias between different machines shrinks as we progress. Intuitively, this happens since as we progress with the optimization process, the bias in the stochastic gradients that we compute goes to zero (in expectation), and consequently the bias between different machines shrinks as we progress.

Technical Explanation.

Assume general weights {αt}t\{\alpha_{t}\}_{t}, and let us go back to the proof of Lemma 5 (see Section J). Recall that in this proof we derive a bound of the following form (see Eq. (55))

AtθAt0+1:t+Bt,\displaystyle A_{t}\leq\theta A_{t_{0}+1:t}+B_{t}~{}, (60)

where At:=αt2xtixtj2A_{t}:=\alpha_{t}^{2}\|x_{t}^{i}-x_{t}^{j}\|^{2}, Bt=8η2αt2τ=t0+1tατ:tαταt2ξτ2B_{t}=8\eta^{2}\alpha_{t}^{2}\left\|\sum_{\tau=t_{0}+1}^{t}\alpha_{\tau:t}\frac{\alpha_{\tau}}{\alpha_{t}^{2}}\xi_{\tau}\right\|^{2}, and importantly 101010Indeed, in the case where αt:=t+1\alpha_{t}:=t+1 we can take θ:=8η2K3L2\theta:=8\eta^{2}K^{3}L^{2}, and this is used in the proof of Lemma 5,

θ:=η22αt4(α0:t)2K3L2.\theta:=\eta^{2}\frac{2\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\cdot K^{3}L^{2}~{}.

Now, a crucial part of the proof is the fact that θK1/2\theta K\leq 1/2, which in turn enables to employ Lemma 13 in order to bound EVt\mbox{\bf E}V_{t}.

Not let us inspect the constraint θK1/2\theta K\leq 1/2 for polynomial weights of the form αttp\alpha_{t}\propto t^{p} where p0p\geq 0. This condition boils down to,

η22αt4(α0:t)2K3L2K1/2,\eta^{2}\frac{2\alpha^{4}_{t}}{(\alpha_{0:t})^{2}}\cdot K^{3}L^{2}\cdot K\leq 1/2~{},

Implying the following bound should apply to η\eta for any t[T]t\in[T],

η12LK2α0:tαt212LK2tp+1t2p=12LK2t1p\eta\leq\frac{1}{2LK^{2}}\cdot\frac{\alpha_{0:t}}{\alpha^{2}_{t}}\approx\frac{1}{2LK^{2}}\cdot\frac{t^{p+1}}{t^{2p}}=\frac{1}{2LK^{2}}\cdot t^{1-p}

Now since the bound should hold for any t[T]t\in[T] we could divide into two cases:
Case 1: p1p\leq 1. In this case t1pt^{1-p} is monotonically increasing with tt so the above condition should be satisfied for the smallest tt, namely t=1t=1, implying,

η12LK2α0:tαt212LK2.\eta\leq\frac{1}{2LK^{2}}\cdot\frac{\alpha_{0:t}}{\alpha^{2}_{t}}\approx\frac{1}{2LK^{2}}~{}.

The effect of this term on the overall error stems from the first term in the error analysis (see e.g. Eq. (F)), namely,

1α0:Tw0w2η=2LK2w0w2Tp+1\frac{1}{\alpha_{0:T}}\cdot\frac{\|w_{0}-w^{*}\|^{2}}{\eta}=\frac{2LK^{2}\cdot\|w_{0}-w^{*}\|^{2}}{T^{p+1}}

Now, for the extreme values p=0p=0(uniform weights) and p=1p=1 (linear weights), the above expression results an error term of the following form,

Err(𝐩=𝟎)=O(K2/T)=O(K/R)&Err(𝐩=𝟏)=O(K2/T2)=O(1/R2).\displaystyle\text{Err}({\bf p=0})=O\left({K^{2}}/{T}\right)=O({K}/{R})~{}~{}~{}\&~{}~{}~{}\text{Err}({\bf p=1})=O({K^{2}}/{T^{2}})=O({1}/{R^{2}})~{}. (61)

Thus, for p=0p=0, the error term is considerably worse even compared to Minibatch-SGD, and p=1p=1 is clearly an improvement.
Case 2: p>1p>1. In this case t1pt^{1-p} is decreasing increasing with tt so the above condition should be satisfied for the largest tt, namely t=Tt=T, implying,

η12LK2α0:tαt212LK21Tp1.\eta\leq\frac{1}{2LK^{2}}\cdot\frac{\alpha_{0:t}}{\alpha^{2}_{t}}\approx\frac{1}{2LK^{2}}\cdot\frac{1}{T^{p-1}}~{}.

Now, the effect of this term on the overall error stems from the first term in the error analysis (see e.g. Eq. (F)), namely,

1α0:Tw0w2η=2LK2w0w2Tp1Tp+1=2LK2w0w2T2\frac{1}{\alpha_{0:T}}\cdot\frac{\|w_{0}-w^{*}\|^{2}}{\eta}=\frac{2LK^{2}\cdot\|w_{0}-w^{*}\|^{2}\cdot T^{p-1}}{T^{p+1}}=\frac{2LK^{2}\cdot\|w_{0}-w^{*}\|^{2}\cdot}{T^{2}}

Thus, for any p1p\geq 1 we obtain,

Err(𝐩)=O(K2/T2)=O(1/R2).\displaystyle\text{Err}({\bf p})=O({K^{2}}/{T^{2}})=O({1}/{R^{2}})~{}. (62)

Conclusion: As can be seen from Equations (61) (62), using uniform weights or even polynomial weights αttp\alpha_{t}\propto t^{p} with p<1p<1 yields strictly worse guarantees compared to taking p1p\geq 1.

Appendix M Experiments

M.1 Data Distribution Across Workers

In federated learning or distributed machine learning settings, data heterogeneity among workers often arises due to non-identical data distributions. To simulate such scenarios, we split the MNIST dataset using a Dirichlet distribution. The Dirichlet distribution allows control over the degree of heterogeneity in the data assigned to each worker by adjusting the dirichlet-alpha parameter. Lower values of dirichlet-alpha result in more uneven class distributions across workers, simulating highly non-IID data.

Given CC classes and MM workers, the probability pc,mp_{c,m} of assigning data from class cc to worker mm is based on sampling from a Dirichlet distribution:

pc,1,pc,2,,pc,MDirichlet(α)p_{c,1},p_{c,2},\ldots,p_{c,M}\sim\text{Dirichlet}(\alpha)

where α\alpha is the concentration parameter controlling the level of heterogeneity. A smaller α\alpha value results in a more imbalanced distribution, meaning that each worker primarily receives data from a limited subset of classes. In this experiment, we set α=0.1\alpha=0.1 to induce high heterogeneity.

The following figures present scatter plots illustrating the class frequencies assigned to individual workers for different worker configurations—16, 32, and 64—using a specific random seed.

Refer to caption
(a) 16 Workers.
Refer to caption
(b) 32 Workers.
Refer to caption
(c) 64 Workers.
Figure 2: Class distribution across workers for different numbers of workers (16, 32, and 64) on the MNIST dataset. The dataset was partitioned using a Dirichlet distribution, with the dirichlet-alpha parameter set to 0.1 to induce high heterogeneity. Each scatter plot illustrates class frequencies for each worker.

M.2 Complete Experimental Results

This section presents the complete experimental results for 16, 32, and 64 workers, showing test accuracy and test loss as functions of local iterations KK. The plots illustrate the method’s scalability and performance, with higher accuracy (\uparrow) and lower loss (\downarrow) indicating better outcomes.

Refer to caption
Figure 3: Test Accuracy vs. Local Iterations (KK) for 16 workers (\uparrow is better).
Refer to caption
Figure 4: Test Accuracy vs. Local Iterations (KK) for 32 workers (\uparrow is better).
Refer to caption
Figure 5: Test Accuracy vs. Local Iterations (KK) for 64 workers (\uparrow is better).
Refer to caption
Figure 6: Test Loss vs. Local Iterations (KK) for 16 workers (\downarrow is better).
Refer to caption
Figure 7: Test Loss vs. Local Iterations (KK) for 32 workers (\downarrow is better).
Refer to caption
Figure 8: Test Loss vs. Local Iterations (KK) for 64 workers (\downarrow is better).