This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Alternating Direction Method of Multipliers
for Byzantine-Robust Distributed Learning

Feng Lin1  Weiyu Li2  Qing Ling1,3,4 11. Sun Yat-Sen University
22. University of Science and Technology of China
33. Guangdong Province Key Laboratory of Computational Science
44. Pazhou Lab
Corresponding Author: Qing Ling
E-mail address: [email protected]
Abstract

This paper aims to solve a distributed learning problem under Byzantine attacks. In the underlying distributed system, a number of unknown but malicious workers (termed as Byzantine workers) can send arbitrary messages to the master and bias the learning process, due to data corruptions, computation errors or malicious attacks. Prior work has considered a total variation (TV) norm-penalized approximation formulation to handle the Byzantine attacks, where the TV norm penalty forces the regular workers’ local variables to be close, and meanwhile, tolerates the outliers sent by the Byzantine workers. To solve the TV norm-penalized approximation formulation, we propose a Byzantine-robust stochastic alternating direction method of multipliers (ADMM) that fully utilizes the separable problem structure. Theoretically, we prove that the proposed method converges to a bounded neighborhood of the optimal solution at a rate of O(1/k)O(1/k) under mild assumptions, where kk is the number of iterations and the size of neighborhood is determined by the number of Byzantine workers. Numerical experiments on the MNIST and COVERTYPE datasets demonstrate the effectiveness of the proposed method to various Byzantine attacks.

keywords:
Distributed machine learning, alternating direction method of multipliers (ADMM), Byzantine attacks

1 Introduction

Most of the traditional machine learning algorithms require to collect training data from their owners to a single computer or data center, which is not only communication-inefficient but also vulnerable to privacy leakage [1, 2, 3]. With the explosive growth of the big data, federated learning has been proposed as a novel privacy-preserving distributed machine learning scheme, and received extensive research interest recently [4, 5, 6]. In federated learning, the training data are stored at distributed workers, and the workers compute their local variables using local training data, under the coordination of a master. This scheme effectively reduces the risk of data leakage and protects privacy.

However, federated learning still faces significant security challenges. Some of the distributed workers, whose identities are unknown, could be unreliable and send wrong or even malicious messages to the master, due to data corruptions, computation errors or malicious attacks. To characterize the worse-case scenario, we adopt the Byzantine failure model, in which the Byzantine workers are aware of all information of the other workers, and able to send arbitrary messages to the master [7, 8, 9, 10]. In this paper, we aim at solving the distributed learning problem under Byzantine attacks that potentially threat federated learning applications.

Related works. With the rapid popularization of federated learning, Byzantine-robust distributed learning has become an attractive research topic in recent years. Most of the existing algorithms modify distributed stochastic gradient descent (SGD) to the Byzantine-robust variants. In the standard distributed SGD, at every iteration, all the workers send their local stochastic gradients to the master, while the master averages all the received stochastic gradients and updates the optimization variable. When Byzantine workers are present, they can send faulty values other than true stochastic gradients to the master so as to bias the learning process. It is shown that the standard distributed SGD with mean aggregation is vulnerable to Byzantine attacks [11].

When the training data are independently and identically distributed (i.i.d.) at the workers, the stochastic gradients of the regular workers are i.i.d. too. This fact motivates two mainstream methods to deal with Byzantine attacks: attack detection and robust aggregation. For attack detection, [12] and [13] propose to offline train an autoencoder, which is used to online calculate credit scores of the workers. The messages sent by the workers with lower credit scores will be discarded in the mean aggregation. The robust subgradient push algorithm in [14] operates over a decentralized network. Each worker calculates a score for each of its neighbors, and isolates those with lower scores. The works of [15, 16] detect the Byzantine workers with historic gradients so as to ensure robustness. The work of [17] uses redundant gradients for attack detection. However, it requires overlapped data samples on multiple workers, and does not fit for the federated learning setting. For robust aggregation, the master can use geometric median, instead of mean, to aggregate the received messages [18, 19, 20]. When the number of Byzantine workers is less than the number of regular workers, geometric median provides a reliable approximation to the average of regular workers’ stochastic gradients. Other similar robust aggregation rules include marginal trimmed mean and dimensional median [21, 22, 23]. Some aggregation rules select a representative stochastic gradient from all the received ones to update the global variable, e.g., Medoid [19] and Krum [11]. Medoid selects the stochastic gradient with the smallest distance from all the others, while Krum selects the one with the smallest squared distance to a fixed number of nearest stochastic gradients. An extension of Krum, termed as hh-Krum, selects hh stochastic gradients with Krum and uses their average. Bulyan [24] first selects a number of stochastic gradients with Krum or other robust selection/aggregation rules, and then uses their trimmed dimensional median.

When the training data and the stochastic gradients are non-i.i.d. at the workers, which is common in federated learning applications [25], naive robust aggregation of stochastic gradients no longer works. The works of [26, 27] adopt a resampling strategy to alleviate the effect caused by non-i.i.d. training data. With a larger resampling parameter, the algorithms can handle higher data heterogeneity, at the cost of tolerating less Byzantine workers. Robust stochastic aggregation (RSA) aggregates local variables, instead of stochastic gradients [28]. To be specific, it considers a total variation (TV) norm-penalized approximation formulation to handle Byzantine attacks, where the TV norm penalty forces the regular workers’ local variables to be close, and meanwhile, tolerates the outliers sent by the Byzantine workers. Although the stochastic subgradient method proposed in [28] is able to solve the TV norm-penalized approximation formulation, it ignores the separable problem structure.

Other related works include [29, 30, 31, 32], which shows that the stochastic gradient noise affects the effectiveness of robust aggregation rules. Thus, the robustness of the Byzantine-resilience methods can be improved by reducing the variance of stochastic gradients. The asynchronous Byzantine-robust SGD is considered in [33, 34, 35]. The work of [36] addresses the saddle-point attacks in the non-convex setting, and [37, 38, 39, 40] consider Byzantine robustness in decentralized learning.

Our contributions. Our contributions are three-fold.

(i) We propose a Byzantine-robust stochastic alternating direction method of multipliers (ADMM) that utilizes the separable problem structure of the TV norm-penalized approximation formulation. The stochastic ADMM updates are further simplified, such that the iteration-wise communication and computation costs are the same as those of the stochastic subgradient method.

(ii) We theoretically prove that the proposed stochastic ADMM converges to a bounded neighborhood of the optimal solution at a rate of O(1/k)O(1/k) under mild assumptions, where kk is the number of iterations and the size of neighborhood is determined by the number of Byzantine workers.

(iii) We conduct numerical experiments on the MNIST and COVERTYPE datasets to demonstrate the effectiveness of the proposed stochastic ADMM to various Byzantine attacks.

2 Problem Formulation

Let us consider a distributed network with a master and mm workers, among which qq workers are Byzantine and the other r=mqr=m-q workers are regular. The exact value of qq and the identities of the Byzantine workers are all unknown. We are interested in solving a stochastic optimization problem in the form of

minx~i=1m𝔼[F(x~,ξi)]+f0(x~),\displaystyle\min\limits_{\tilde{x}}~{}\sum^{m}_{i=1}\mathbb{E}[F(\tilde{x},\xi_{i})]+f_{0}(\tilde{x}), (1)

where x~d\tilde{x}\in\mathbb{R}^{d} is the optimization variable, f0(x~)f_{0}(\tilde{x}) is the regularization term known to the master, and F(x~,ξi)F(\tilde{x},\xi_{i}) is the loss function of worker ii with respect to a random variable ξi𝒟i\xi_{i}\sim\mathcal{D}_{i}. Here we assume that the data distributions 𝒟i\mathcal{D}_{i} on the workers can be different, which is common in federated learning applications.

Define \mathcal{R} and \mathcal{B} as the sets of regular workers and Byzantine workers, respectively. We have ||=q|\mathcal{B}|=q and ||=r|\mathcal{R}|=r. Because of the existence of Byzantine workers, directly solving (1) without distinguishing between regular and Byzantine workers is meaningless. A less ambitious alternative is to minimize the summation of the regular workers’ local expected cost functions plus the regularization term, in the form of

minx~i𝔼[F(x~,ξi)]+f0(x~).\displaystyle\min\limits_{\tilde{x}}~{}\sum_{i\in\mathcal{R}}\mathbb{E}[F(\tilde{x},\xi_{i})]+f_{0}(\tilde{x}). (2)

Our proposed algorithm and RSA [28] both aggregate optimization variables, instead of stochastic gradients. To do so, denote xix_{i} as the local copy of x~\tilde{x} at a regular worker ii\in\mathcal{R}, and x0x_{0} as the local copy at the master. Collecting the local copies in a vector x=[x0;;xi;](r+1)dx=[x_{0};\cdots;x_{i};\cdots]\in\mathbb{R}^{(r+1)d}, we know that (2) is equivalent to

minx\displaystyle\min\limits_{x} i𝔼[F(xi,ξi)]+f0(x0),\displaystyle~{}\sum_{i\in\mathcal{R}}\mathbb{E}[F(x_{i},\xi_{i})]+f_{0}(x_{0}), (3)
s.t.\displaystyle s.t. xix0=0,i,\displaystyle~{}x_{i}-x_{0}=0,~{}\forall i\in\mathcal{R},

where xix0=0x_{i}-x_{0}=0, i\forall i\in\mathcal{R} are the consensus constraints to force the local copies to be the same.

RSA [28] considers a TV norm-penalized approximation formulation of (3), in the form of

minxi(𝔼[F(xi,ξi)]+λxix01)+f0(x0),\displaystyle\min\limits_{x}\sum_{i\in\mathcal{R}}(\mathbb{E}[F(x_{i},\xi_{i})]+\lambda\|x_{i}-x_{0}\|_{1})+f_{0}(x_{0}), (4)

where λ\lambda is a positive constant and ixix01\sum_{i\in\mathcal{R}}\|x_{i}-x_{0}\|_{1} is the TV norm penalty for the constraints in (3). The TV norm penalty forces the regular workers’ local optimization variables to be close to the master’s, and meanwhile, tolerates the outliers when the Byzantine attackers are present. Due to the existence of the nonsmooth TV norm term, RSA solves (4) with the stochastic subgradient method. The updates of RSA, at the existence of Byzantine workers, are as follows. At time kk, the master sends x0kx_{0}^{k} to the workers, every regular worker ii\in\mathcal{R} sends xikx_{i}^{k} to the master, while every Byzantine worker jj\in\mathcal{B} sends an arbitrary malicious vector ujkdu_{j}^{k}\in\mathbb{R}^{d} to the master. Then, the updates of xik+1x^{k+1}_{i} for every regular worker ii and x0k+1x^{k+1}_{0} for the master are given by

xik+1=\displaystyle x^{k+1}_{i}= xikαk(F(xik,ξik)+λsgn(xikx0k)),\displaystyle~{}x_{i}^{k}-\alpha^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+\lambda sgn(x_{i}^{k}-x_{0}^{k})\right),
x0k+1=\displaystyle x^{k+1}_{0}= x0kαk(f0(x0k)λisgn(xikx0k)λjsgn(ujkx0k)),\displaystyle~{}x_{0}^{k}-\alpha^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})-\lambda\sum_{i\in\mathcal{R}}sgn(x_{i}^{k}-x_{0}^{k})-\lambda\sum_{j\in\mathcal{B}}sgn(u_{j}^{k}-x_{0}^{k})\Big{)}, (5)

where F(xik,ξik)F^{\prime}(x_{i}^{k},\xi^{k}_{i}) is a stochastic gradient at xikx_{i}^{k} respect to a random sample ξik\xi_{i}^{k} for regular worker ii, sgn()sgn(\cdot) is the element-wise sign function (sgn(a)=1sgn(a)=1 if a>0a>0, sgn(a)=1sgn(a)=-1 if a<0a<0, and sgn(a)[1,1]sgn(a)\in[-1,1] if a=0a=0), and αk\alpha^{k} is the diminishing learning rate at time kk.

Although RSA has been proven as a robust algorithm under Byzantine attacks [28], the sign functions therein enable the Byzantine workers to send slightly modified messages that remarkably biases the learning process. In addition, RSA fully ignores the special separable structure of the TV norm penalty. In this paper, we also consider the TV norm-penalized approximation formulation (4), propose a stochastic ADMM that utilizes the problem structure, and develop a novel Byzantine-robust algorithm.

3 Algorithm Development

In this section, we utilize the separable problem structure of (4) and propose a robust stochastic ADMM to solve it. The challenge is that the unknown Byzantine workers can send faulty messages during the optimization process. At this stage, we simply ignore the existence of Byzantine workers and develop an algorithm to solve (4). Then, we will consider the influence of Byzantine workers on the algorithm. We begin with applying the stochastic ADMM to solve (4), and then simplify the updates such that the iteration-wise communication and computation costs are the same as those of the stochastic subgradient method in [28].

Stochastic ADMM. Suppose that all the workers are regular such that m=rm=r. To apply the stochastic ADMM, for every worker ii, introduce auxiliary variables z(0,i),z(i,0)dz(0,i),z(i,0)\in\mathbb{R}^{d} on the directed edges (0,i),(i,0)(0,i),(i,0), respectively. By introducing consensus constraints z(i,0)=x0z(i,0)=x_{0} and z(0,i)=xiz(0,i)=x_{i}, (4) is equivalent to

minx,z\displaystyle\min\limits_{x,z} i(𝔼[F(xi,ξi)]+λz(0,i)z(i,0)1)+f0(x0),\displaystyle~{}\sum_{i\in\mathcal{R}}(\mathbb{E}[F(x_{i},\xi_{i})]+\lambda\|z(0,i)-z(i,0)\|_{1})+f_{0}(x_{0}), (6)
s.t.\displaystyle s.t. z(i,0)x0=0,z(0,i)xi=0,i.\displaystyle~{}z(i,0)-x_{0}=0,~{}z(0,i)-x_{i}=0,~{}\forall i\in\mathcal{R}.

For the ease of presentation, we stack these auxiliary variables in a new variable z2rdz\in\mathbb{R}^{2rd}. As we will see below, the introduction of zz is to split the expectation term i𝔼[F(xi,ξi)]\sum_{i\in\mathcal{R}}\mathbb{E}[F(x_{i},\xi_{i})] and the TV norm penalty term ixix01\sum_{i\in\mathcal{R}}\|x_{i}-x_{0}\|_{1} so as to utilize the separable problem structure.

The augmented Lagrangian function of (6) is

β(x,z,η)=\displaystyle\mathcal{L}_{\beta}(x,z,\eta)= i(𝔼[F(xi,ξi)]+λz(0,i)z(i,0)1)+f0(x0)\displaystyle\sum_{i\in\mathcal{R}}(\mathbb{E}[F(x_{i},\xi_{i})]+\lambda\|z(0,i)-z(i,0)\|_{1})+f_{0}(x_{0})
+\displaystyle+ i(η(i,0),z(i,0)x0+β2z(i,0)x02)+i(η(0,i),z(0,i)xi+β2z(0,i)xi2),\displaystyle\sum_{i\in\mathcal{R}}\left(\langle\eta(i,0),z(i,0)-x_{0}\rangle+\frac{\beta}{2}\|z(i,0)-x_{0}\|^{2}\right)+\sum_{i\in\mathcal{R}}\left(\langle\eta(0,i),z(0,i)-x_{i}\rangle+\frac{\beta}{2}\|z(0,i)-x_{i}\|^{2}\right), (7)

where β\beta is a positive constant, while η(i,0)d\eta(i,0)\in\mathbb{R}^{d} and η(0,i)d\eta(0,i)\in\mathbb{R}^{d} are the Lagrange multipliers attached to the consensus constraints z(i,0)x0=0z(i,0)-x_{0}=0 and z(0,i)xi=0z(0,i)-x_{i}=0, respectively. For convenience, we also collect all the Lagrange multipliers in a new variable η2rd\eta\in\mathbb{R}^{2rd}.

Given the augmented Lagrangian function (3), the vanilla ADMM works as follows. At time kk, it first updates xk+1x^{k+1} through minimizing the augmented Lagrangian function at z=zkz=z^{k} and η=ηk\eta=\eta^{k}, then updates zk+1z^{k+1} through minimizing the Lagrangian function at x=xk+1x=x^{k+1} and η=ηk\eta=\eta^{k}, and finally updates ηk+1\eta^{k+1} through dual gradient ascent. The updates are given by

xk+1\displaystyle x^{k+1} =argminxβ(x,zk,ηk),\displaystyle=\arg\min\limits_{x}\mathcal{L}_{\beta}(x,z^{k},\eta^{k}), (8a)
zk+1\displaystyle z^{k+1} =argminzβ(xk+1,z,ηk),\displaystyle=\arg\min\limits_{z}\mathcal{L}_{\beta}(x^{k+1},z,\eta^{k}), (8b)
ηk+1(i,0)\displaystyle\eta^{k+1}(i,0) =ηk(i,0)+β(zk+1(i,0)x0k+1),ηk+1(0,i)=ηk(0,i)+β(zk+1(0,i)xik+1).\displaystyle=\eta^{k}(i,0)+\beta(z^{k+1}(i,0)-x_{0}^{k+1}),\quad\eta^{k+1}(0,i)=\eta^{k}(0,i)+\beta(z^{k+1}(0,i)-x_{i}^{k+1}). (8c)

However, the xx-update in (8a) is an expectation minimization problem and hence nontrivial. To address this issue, [41] proposes to replace the augmented Lagrangian function β(x,zk,ηk)\mathcal{L}_{\beta}(x,z^{k},\eta^{k}) with its stochastic counterpart, given by

βk(x,z,η)=\displaystyle\mathcal{L}_{\beta}^{k}(x,z,\eta)= iλz(0,i)z(i,0)1+f0(x0k)+f0(x0k),x0+σkx0x0k22\displaystyle\sum_{i\in\mathcal{R}}\lambda\|z(0,i)-z(i,0)\|_{1}+f_{0}(x_{0}^{k})+\langle f^{\prime}_{0}(x_{0}^{k}),x_{0}\rangle+\frac{\sigma^{k}\|x_{0}-x_{0}^{k}\|^{2}}{2}
+\displaystyle+ i(F(xik,ξik)+F(xik,ξik),xi+σkxixik22)\displaystyle\sum_{i\in\mathcal{R}}\left(F(x_{i}^{k},\xi_{i}^{k})+\langle F^{\prime}(x_{i}^{k},\xi_{i}^{k}),x_{i}\rangle+\frac{\sigma^{k}\|x_{i}-x_{i}^{k}\|^{2}}{2}\right)
+\displaystyle+ i(η(i,0),z(i,0)x0+β2z(i,0)x02)+i(η(0,i),z(0,i)xi+β2z(0,i)xi2),\displaystyle\sum_{i\in\mathcal{R}}\left(\langle\eta(i,0),z(i,0)-x_{0}\rangle+\frac{\beta}{2}\|z(i,0)-x_{0}\|^{2}\right)+\sum_{i\in\mathcal{R}}\left(\langle\eta(0,i),z(0,i)-x_{i}\rangle+\frac{\beta}{2}\|z(0,i)-x_{i}\|^{2}\right), (9)

where ξik\xi_{i}^{k} is the random variable of worker ii at time kk and σk\sigma^{k} is the positive stepsize. Observe that (3) is a stochastic first-order approximation to (3), in the sense that

𝔼[F(xi,ξi)]\displaystyle\mathbb{E}[F(x_{i},\xi_{i})] F(xik,ξik)+F(xik,ξik),xi+σkxixik22andf0(x)f0(x0k)+f0(x0k),x0+σkx0x0k22,\displaystyle\simeq F(x_{i}^{k},\xi_{i}^{k})+\langle F^{\prime}(x_{i}^{k},\xi_{i}^{k}),x_{i}\rangle+\frac{\sigma^{k}\|x_{i}-x_{i}^{k}\|^{2}}{2}\quad\text{and}\quad f_{0}(x)\simeq f_{0}(x_{0}^{k})+\langle f^{\prime}_{0}(x_{0}^{k}),x_{0}\rangle+\frac{\sigma^{k}\|x_{0}-x_{0}^{k}\|^{2}}{2},

at the points xi=xikx_{i}=x_{i}^{k} and x0=x0kx_{0}=x_{0}^{k}, respectively.

With the stochastic approximation, the explicit solutions of xik+1x_{i}^{k+1} and x0k+1x_{0}^{k+1} are

xik+1\displaystyle x^{k+1}_{i} =1σk+β(σkxik+βzk(0,i)+ηk(0,i)F(xik,ξik)),\displaystyle=\frac{1}{\sigma^{k}+\beta}\left(\sigma^{k}x_{i}^{k}+\beta z^{k}(0,i)+\eta^{k}(0,i)-F^{\prime}(x_{i}^{k},\xi^{k}_{i})\right),
x0k+1\displaystyle x^{k+1}_{0} =1σk+mβ(σkx0k+i(βzk(i,0)+ηk(i,0))f0(x0k)).\displaystyle=\frac{1}{\sigma^{k}+m\beta}\Big{(}\sigma^{k}x_{0}^{k}+\sum_{i\in\mathcal{R}}(\beta z^{k}(i,0)+\eta^{k}(i,0))-f_{0}^{\prime}(x_{0}^{k})\Big{)}. (10)

For simplicity, we replace the parameter 1σk+β\frac{1}{\sigma^{k}+\beta} by αik\alpha_{i}^{k} and 1σk+iβ\frac{1}{\sigma^{k}+\sum_{i\in\mathcal{R}}\beta} by α0k\alpha_{0}^{k}. Thus, (10) is equivalent to

xik+1=xikαik(F(xik,ξik)+βxikβzk(0,i)ηk(0,i)),\displaystyle x^{k+1}_{i}=x_{i}^{k}-\alpha_{i}^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+\beta x_{i}^{k}-\beta z^{k}(0,i)-\eta^{k}(0,i)\right),
x0k+1=x0kα0k(f0(x0k)+i(βx0kβzk(i,0)ηk(i,0))).\displaystyle x^{k+1}_{0}=x_{0}^{k}-\alpha_{0}^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})+\sum_{i\in\mathcal{R}}(\beta x_{0}^{k}-\beta z^{k}(i,0)-\eta^{k}(i,0))\Big{)}. (11)

Simplification. Observe that the zz-update in (8b) is also challenging as the variables z(0,i)z(0,i) and z(i,0)z(i,0) are coupled by the TV norm penalty term. Next, we will simplify the three-variable updates in (11), (8b) and (8c) to eliminate the zz-update and obtain a more compact algorithm. Note that the decentralized deterministic ADMM can also be simplified to eliminate auxiliary variables [42]. However, we are considering the distributed stochastic ADMM, and the TV norm penalty term makes the simplification much more challenging.

Proposition 1 (Simplified stochastic ADMM).

Suppose m=rm=r. The updates (11), (8b) and (8c) can be simplified as

xik+1=\displaystyle x^{k+1}_{i}= xikαik(F(xik,ξik)+2ηikηik1),\displaystyle~{}x_{i}^{k}-\alpha_{i}^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+2\eta^{k}_{i}-\eta^{k-1}_{i}\right), (12)
x0k+1=\displaystyle x^{k+1}_{0}= x0kα0k(f0(x0k)i(2ηikηik1)),\displaystyle~{}x_{0}^{k}-\alpha_{0}^{k}\bigg{(}f_{0}^{\prime}(x_{0}^{k})-\sum\limits_{i\in\mathcal{R}}(2\eta^{k}_{i}-\eta^{k-1}_{i})\bigg{)}, (13)
ηik+1:=\displaystyle\eta^{k+1}_{i}:= ηk+1(i,0)=ηk+1(0,i)=projλ(ηik+β2(xik+1x0k+1)),\displaystyle~{}\eta^{k+1}(i,0)=-\eta^{k+1}(0,i)=\mathrm{proj}_{\lambda}\left(\eta^{k}_{i}+\frac{\beta}{2}(x^{k+1}_{i}-x^{k+1}_{0})\right), (14)

where projλ()\mathrm{proj}_{\lambda}(\cdot) is the projection operator that for each dimenison maps any point in \mathbb{R} onto [λ,λ][-\lambda,\lambda].

Proof.

See A.  

Presence of Byzantine workers. Now we start to consider how the stochastic ADMM updates (12), (13) and (14) are implemented when the Byzantine workers are present. At time kk, every regular worker ii\in\mathcal{R} updates xik+1x_{i}^{k+1} with (12) and ηik+1\eta^{k+1}_{i} with (14), and then sends ηik+1\eta_{i}^{k+1} to the master. Meanwhile, every Byzantine worker jj\in\mathcal{B} can cheat the master by sending ηjk+1d\eta_{j}^{k+1}\mathbb{R}^{d} where the elements are arbitrary within [λ,λ]d[-\lambda,\lambda]^{d}. Otherwise, the Byzantine worker jj can be directly detected and eliminated by the master. This amounts to that every Byzantine worker jj\in\mathcal{B} follows an update rule similar to (14), as

ηjk+1=projλ(ηjk+β2(ujk+1x0k+1)),\displaystyle\eta^{k+1}_{j}=\mathrm{proj}_{\lambda}\left(\eta^{k}_{j}+\frac{\beta}{2}(u^{k+1}_{j}-x^{k+1}_{0})\right), (15)

where ujk+1du^{k+1}_{j}\in\mathbb{R}^{d} is an arbitrary vector. After receiving all messages ηik+1\eta_{i}^{k+1} from the regular workers ii\in\mathcal{R} and ηjk+1\eta_{j}^{k+1} from the Byzantine workers jj\in\mathcal{B}, the master updates x0k+1x_{0}^{k+1} via

x0k+1=\displaystyle x^{k+1}_{0}= x0kα0k(f0(x0k)i(2ηikηik1)j(2ηjkηjk1)).\displaystyle~{}x_{0}^{k}-\alpha_{0}^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})-\sum\limits_{i\in\mathcal{R}}(2\eta^{k}_{i}-\eta^{k-1}_{i})-\sum\limits_{j\in\mathcal{B}}(2\eta^{k}_{j}-\eta^{k-1}_{j})\Big{)}. (16)

The Byzantine-robust stochastic ADMM for distributed learning is outlined in Algorithm 1 and illustrated in Figure 1. Observe that the communication and computation costs are the same as those in the stochastic subgradient method in [28]. The only extra cost is that every worker ii must store the dual variable ηik\eta^{k}_{i}.

Comparing the stochastic subgradient updates (5) with the stochastic ADMM updates (12), (13) and (14), we can observe a primal-dual connection. In the stochastic subgradient method, the workers upload primal variables xikx_{i}^{k}, while in the stochastic ADMM, the workers upload dual variables ηik+1\eta_{i}^{k+1}. The stochastic subgradient method controls the influence of a malicious message by the sign function. No matter what the malicious message is, its modification on each dimension is among λ-\lambda, λ\lambda, and a value within [λ,λ][-\lambda,\lambda] if the values of the malicious worker and the master are identical. The stochastic ADMM controls the influence of a malicious message by the projection function. The modification of the malicious message on each dimension is within [λ,λ][-\lambda,\lambda].

Algorithm 1 Byzantine-Robust Stochastic ADMM

Master

Initialize x00x_{0}^{0}, ηi1\eta^{-1}_{i}, and ηi0\eta^{0}_{i}.

1:while not stopped do
2:     Update x0k+1x_{0}^{k+1} via (16);
3:     Broadcast x0k+1x_{0}^{k+1} to all the workers;
4:     Receive ηik+1\eta_{i}^{k+1} from the regular workers ii\in\mathcal{R} and ηjk+1\eta_{j}^{k+1} from the Byzantine workers jj\in\mathcal{B}.

Regular Worker ii

Initialize xi0x^{0}_{i}, ηi1\eta^{-1}_{i}, and ηi0\eta^{0}_{i}.

1:while not stopped do
2:     Update xik+1x_{i}^{k+1} via (12);
3:     Update ηik+1\eta^{k+1}_{i} via (14);
4:     Send ηik+1\eta^{k+1}_{i} to the master;
5:     Receive x0k+1x_{0}^{k+1} from the master.
Refer to caption
Figure 1: An illustration of the distributed stochastic ADMM for Byzantine-robust learning. In the master-worker architecture, there are mm workers in total, among which rr are regular workers and q=mrq=m-r are Byzantine workers.

4 Convergence Analysis

In this section, we analyze the convergence of the proposed Byzantine-robust stochastic ADMM. We make the following assumptions, which are common in analyzing distributed stochastic optimization algorithms.

Assumption 1 (Strong convexity).

The local cost functions 𝔼[F(xi,ξi)]\mathbb{E}[F(x_{i},\xi_{i})] and the regularization term f0(x0)f_{0}(x_{0}) are strongly convex with constants μi\mu_{i} and μ0\mu_{0}, respectively.

Assumption 2 (Lipschitz continuous gradients).

The local cost functions 𝔼[F(xi,ξi)]\mathbb{E}[F(x_{i},\xi_{i})] and the regularization term f0(x0)f_{0}(x_{0}) have Lipschitz continuous gradients with constants LiL_{i} and L0L_{0}, respectively.

Assumption 3 (Bounded variance).

Within every worker ii, the data sampling is i.i.d. with ξik𝒟i\xi^{k}_{i}\sim\mathcal{D}_{i}. The variance of stochastic gradients F(x,ξi)F^{\prime}({x},\xi_{i}) is upper bounded by δi2\delta^{2}_{i}, as

𝔼F(x,ξi)𝔼[F(x,ξi)]2δi2.\mathbb{E}\|F^{\prime}({x},\xi_{i})-\mathbb{E}[F^{\prime}({x},\xi_{i})]\|^{2}\leq\delta^{2}_{i}. (17)

4.1 Main Results

First, we show the equivalence between (2) and (6). When the penalty parameter λ\lambda is sufficiently, it has been shown in Theorem 1 of [28] that the optimal primal variables of (6) are consensual and identical to the minimizer of (2). We repeat this conclusion in the following lemma.

Lemma 1 (Consensus and equivalence).

Suppose Assumptions 1 and 2 hold. If λλ0:=maxi𝔼[F(x~,ξi)]\lambda\geq\lambda_{0}:=\max_{i\in\mathcal{R}}\|\mathbb{E}[F^{\prime}({\tilde{x}^{*}},\xi_{i})]\|_{\infty}, then for all ii\in\mathcal{R}, we have

xi=x0=x~,x_{i}^{*}=x_{0}^{*}=\tilde{x}^{*},

where xix_{i}^{*} and x0x_{0}^{*} are the optimal primal variables of (6), and x~\tilde{x}^{*} is the minimizer of (2).

Intuitively, setting a sufficiently large penalty parameter λ\lambda ensures the variables xix_{i} and x0x_{0} to be consensual, since a larger λ\lambda gives more weight on the consensus constraints. When the training data at the workers are non-i.i.d., the local expected gradients 𝔼[F(x~,ξi)]\mathbb{E}[F^{\prime}({\tilde{x}^{*}},\xi_{i})] deviate from 0, which leads to a large lower bound λ0\lambda_{0} to maintain consensus. Once the variables are consensual, (6) is equivalent to (2).

Now, we present the main theorem on the convergence of the proposed Byzantine-robust stochastic ADMM.

Theorem 1 (O(1/k)O(1/k)-convergence).

Suppose Assumptions 1, 2, and 3 hold. Let λλ0\lambda\geq\lambda_{0} and the stepsizes be

α0k=min{1ck+mβ,1μ0+L0,1μi+Li},αik=min{1ck+β,1μ0+L0,1μi+Li},i,\alpha_{0}^{k}=\min\left\{\frac{1}{ck+m\beta},\frac{1}{\mu_{0}+L_{0}},\frac{1}{\mu_{i}+L_{i}}\right\},\quad\alpha_{i}^{k}=\min\left\{\frac{1}{ck+\beta},\frac{1}{\mu_{0}+L_{0}},\frac{1}{\mu_{i}+L_{i}}\right\},\quad\forall i\in\mathcal{R},

for some positive constants c<min{μ0L0μ0+L0,μiLiμi+Li:i}c<\min\left\{\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}},\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}:i\in\mathcal{R}\right\} and β>0\beta>0. Then we have

𝔼x0kx02+i𝔼xikxi2=O(1/k)+O(λ2q2).\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i\in\mathcal{R}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}=O(1/k)+O(\lambda^{2}q^{2}). (18)
Proof.

See C.  

Theorem 1 guarantees that if we choose the stepsizes for both the workers and the master in the order of O(1/k)O(1/k), then the Byzantine-robust stochastic ADMM asymptotically approaches to the O(λ2q2)O(\lambda^{2}q^{2}) neighborhood of the optimal solution x~\tilde{x}^{*} of (2) (which equals x0x_{0}^{*} and xix_{i}^{*}, according to Lemma 1) in an O(1/k)O(1/k) rate. Note that the O(1/k)O(1/k) stepsizes are sensitive to their initial values [28]. Therefore, we set the O(1/k)O(1/\sqrt{k}) stepsizes in the numerical experiments. We also provide in D an ergodic convergence rate of O(logk/k)O(\log k/\sqrt{k}) with O(1/k)O(1/\sqrt{k}) stepsizes.

In (18), the asymptotic learning error is in the order of O(λ2q2)O(\lambda^{2}q^{2}), which is the same as that of RSA [28]. When more Byzantine workers are present, qq is larger and the asymptotic learning error increases. Using a larger λ\lambda helps consensus as indicated in Lemma 1, but incurs higher asymptotic learning error. In the numerical experiments, we will imperially demonstrate the influence of qq and λ\lambda.

4.2 Comparison with RSA: Case Studies

The proposed Byzantine-robust stochastic ADMM and RSA [28] solve the same problem, while the former takes advantages of the separable problem structure. Below we briefly discuss the robustness of the two algorithms to different Byzantine attacks.

RSA is relatively sensitive to small perturbations. To perturb the update of x0k+1x_{0}^{k+1} in (5), Byzantine worker jj can generate malicious ujku_{j}^{k} that is very close to x0kx_{0}^{k}, but its influence on each dimension is still λ\lambda or λ-\lambda. Potentially, this attack is able to lead the update to move toward a given wrong direction. In contrast, for the Byzantine-robust stochastic ADMM, small perturbations on ηjk\eta_{j}^{k} change little on the update of x0k+1x_{0}^{k+1} in (16). To effectively attack the Byzantine-robust stochastic ADMM, Byzantine worker jj can set each element of ηjk\eta_{j}^{k} to be λ(1)k\lambda(-1)^{k}, then its influence on each dimension will oscillate between 3λ3\lambda and 3λ-3\lambda. In comparison, the influence of this attack for RSA is just λ\lambda or λ-\lambda on each dimension. However, these large oscillations are easy to distinguish by the master through screening the received messages. In addition, it is nontrivial for this attack to lead the update to move toward a given wrong direction.

Developing Byzantine attacks that are most harmful to the Byzantine-robust stochastic ADMM and RSA, respectively, is beyond the scope of this paper. Instead, we give a toy example and develops two Byzantine attacks to justify the discussions above.

Example 1.

Consider a one-dimensional distributed machine learning task with r=2r=2 regular workers (numbered by 11 and 22) and q=1q=1 Byzantine worker (numbered by 33). The functions are deterministic and quadratic, with f0(x0)=x02/2f_{0}(x_{0})=x_{0}^{2}/2, F1(x1,ξ1)=(x11)2/4F_{1}(x_{1},\xi_{1})=(x_{1}-1)^{2}/4, and F2(x2,ξ2)=(x21)2/4F_{2}(x_{2},\xi_{2})=(x_{2}-1)^{2}/4. Therefore, x~=1/2\tilde{x}^{*}=1/2 is the minimizer of (2) and λ0=1/4\lambda_{0}=1/4 by Lemma 1. The local primal variables are initialized as their local optima, i.e., x00=0x_{0}^{0}=0 and x10=x20=1x_{1}^{0}=x_{2}^{0}=1 for both algorithms. The local dual variables of the Byzantine-robust stochastic ADMM are initialized as ηi1=ηi0=0\eta^{-1}_{i}=\eta_{i}^{0}=0 for i{1,2,3}i\in\{1,2,3\}. We construct two simple attacks.

Small value attack. Byzantine worker 33 generates u3k=x0kϵmax{k(k+1),1}u_{3}^{k}=x_{0}^{k}-\frac{\epsilon}{\max\{k(k+1),1\}}, where ϵ>0\epsilon>0 is a perturbation parameter.

Large value attack. Byzantine worker 33 generates u3k=x0k4λβ(1)ku_{3}^{k}=x_{0}^{k}-\frac{4\lambda}{\beta}(-1)^{k}.

We choose the parameters λ=λ0=1/2\lambda=\lambda_{0}=1/2 and β=1\beta=1, with stepsizes α0k=1k/8+3\alpha_{0}^{k}=\frac{1}{k/8+3} and α1k=α2k=1k/8+1\alpha_{1}^{k}=\alpha_{2}^{k}=\frac{1}{k/8+1}. The perturbation parameter is set as ϵ=1/2\epsilon=1/2 for the small value attack. Figure 2 shows the values of the local primal variables on the master and the regular workers. For both algorithms and both attacks, the master and the regular workers are able to asymptotically reach consensus as asserted by Lemma 1. Under the small value attack, RSA has larger asymptotic learning error than the Byzantine-robust stochastic ADMM as we have discussed, while under the large value attack, both algorithms coincidentally have zero asymptotic learning errors. In addition, we can observe that the Byzantine-robust stochastic ADMM is more stable than RSA under both attacks.

Refer to caption
(a) Small value attack
Refer to caption
(b) Large value attack
Figure 2: Values of the local primal variables on the master ‘(m)’ and the regular workers ‘(w)’.

5 Numerical Experiments

In this section, we evaluate the robustness of the proposed algorithm to various Byzantine attacks. We compare the proposed Byzantine-robust Stochastic ADMM with the following benchmark algorithms: (i) Ideal SGD without Byzantine attacks; (ii) SGD subject to Byzantine attacks; (iii) Geometric median stochastic gradient aggregation [18, 19]; (iv) Median stochastic gradient aggregation [18, 19]; (v) RSA [28]. All the parameters of the benchmark algorithms are hand-tuned to the best. Although the stochastic ADMM and RSA are rooted in the same problem formulation (4), they perform differently for the same value of λ\lambda due to Byzantine attacks, as we have observed in Example 1. Therefore, we hand-tune the best λ\lambda for the stochastic ADMM and RSA, respectively. In the numerical experiments, we use two datasets, MNIST111http://yann.lecun.com/exdb/mnist and COVERTYPE222https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets. The statistics of these datasets are shown in Table 1. We launch one master and 20 workers. In the i.i.d case, we conduct experiments on both datasets by randomly and evenly splitting the data samples to the workers, while in the non-i.i.d. case we only use the MNIST dataset. Each regular worker uses a mini-batch of 32 samples to estimate the local gradient at each iteration. The loss functions fi(x~)f_{i}(\tilde{x}) of workers are softmax regressions, and the regularization term is given by f0(x~)=0.012x~2f_{0}(\tilde{x})=\frac{0.01}{2}\|\tilde{x}\|^{2}. Performance is evaluated by the top-1 accuracy.

Name Training Samples Testing Samples Attributes
COVERTYPE 465264 115748 54
MNIST 60000 10000 784
Table 1: Specifications of the datasets.

Gaussian attack. Under Gaussian attack, at every iteration every Byzantine worker sends to the master a random vector, whose elements follow the Gaussian distribution with standard deviation 100100. Here we set the number of Byzantine workers q=8q=8. For Stochastic ADMM on the MNIST dataset, we set λ\lambda = 0.5, β=0.5\beta=0.5, α0k=110+10k\alpha_{0}^{k}=\frac{1}{10+10\sqrt{k}}, and αik=10.5+10k\alpha_{i}^{k}=\frac{1}{0.5+10\sqrt{k}}. As shown in Figure 3(a), SGD fails, Stochastic ADMM, RSA and Geometric median perform very similarly and are close to Ideal SGD, while Median is a little worse than the others. On the COVERTYPE dataset, we set λ\lambda = 0.5, β=0.1\beta=0.1, α0k=12+50k\alpha_{0}^{k}=\frac{1}{2+50\sqrt{k}}, and αik=10.1+50k\alpha_{i}^{k}=\frac{1}{0.1+50\sqrt{k}} for Stochastic ADMM. As shown in Figure 3(b), SGD performs the worst, while Stochastic ADMM, Geometric median, and median are close to Ideal SGD. Among all the Byzantine-robust algorithms, Stochastic ADMM has the fastest converge speed.

Refer to caption
(a) MNIST
Refer to caption
(b) COVERTYPE
Figure 3: Top-1 accuracy under Gaussian attack when q=8q=8.

Sign-flipping attack. Under sign-flipping attack, at every iteration every Byzantine worker calculates its local variable, but flips the sign by multiplying with a constant ε<0\varepsilon<0, and sends to the master. Here we set ε=3\varepsilon=-3 and the number of Byzantine workers q=8q=8. On the MNIST dataset, we set the parameters of Stochastic ADMM as λ=0.05\lambda=0.05, β=0.1\beta=0.1, α0k=12+5k\alpha_{0}^{k}=\frac{1}{2+5\sqrt{k}}, and αik=10.1+5k\alpha_{i}^{k}=\frac{1}{0.1+5\sqrt{k}}. Figure 4(a) shows that SGD also fails in this situation. Stochastic ADMM, RSA, and Geometric median are close to Ideal SGD, and achieve better accuracy than median. In Figure 4(b), shows the performance on the COVERTYPE dataset. The parameters of Stochastic ADMM are λ=0.5\lambda=0.5, β=0.3\beta=0.3, α0k=16+100k\alpha_{0}^{k}=\frac{1}{6+100\sqrt{k}} ,and αik=10.3+100k\alpha_{i}^{k}=\frac{1}{0.3+100\sqrt{k}}. Stochastic ADMM and RSA are close to Ideal SGD, while outperform Geometric median and Median.

Refer to caption
(a) MNIST
Refer to caption
(b) COVERTYPE
Figure 4: Top-1 accuracy under sign-flipping attack when q=8q=8.

Without Byzantine attack. We also investigate the case without Byzantine attack in both MNIST and COVERTYPE datasets, as shown in Figure 5. In Figure 5(a), Stochastic ADMM on MNIST dataset chooses the parameters λ=0.5\lambda=0.5, β=0.5\beta=0.5, α0=110+10k\alpha_{0}=\frac{1}{10+10\sqrt{k}}, and αi=10.5+10k\alpha_{i}=\frac{1}{0.5+10\sqrt{k}}. Without Byzantine attack, performance of Stochastic ADMM, RSA, and Geometric median is very similar to Ideal SGD, while Median is worse than the other Byzantine-robust algorithms. On the COVERTYPE dataset, we set the parameters of Stochastic ADMM as λ=0.5\lambda=0.5, β=0.3\beta=0.3, α0k=16+10k\alpha_{0}^{k}=\frac{1}{6+10\sqrt{k}}, and αik=10.3+10k\alpha_{i}^{k}=\frac{1}{0.3+10\sqrt{k}}. As shown in Figure 5(b), Stochastic ADMM is the best among all the algorithms, and RSA outperforms Geometric median and Median. We conclude that although Stochastic ADMM introduces bias to the updates, it still works well in the attack-free case.

Refer to caption
(a) MNIST
Refer to caption
(b) COVERTYPE
Figure 5: Top-1 accuracy without Byzantine attack.

Impact of λ\lambda. Here we show how the performance of the proposed algorithm on the two datasets are affected by the choice of the penalty parameter λ\lambda. We use sign-flipping attack with ε=3\varepsilon=-3 in the numerical experiments, and the number of Byzantine workers is q=4q=4. The parameters β\beta, α0k\alpha_{0}^{k} and αik\alpha_{i}^{k} are hand-tuned to the best. As depicted in Figure 6, on both datasets, the performance of Stochastic ADMM degrades when λ\lambda is too small. The reason is that, when λ\lambda is too small the regular workers are rely more on their local data, leading to deficiency of the distributed learning system [28]. Meanwhile, λ\lambda being too large also leads to worse performance.

Refer to caption
(a) MNIST
Refer to caption
(b) COVERTYPE
Figure 6: Top-1 accuracy under sign-flipping attacks with different λ\lambda.

Non-i.i.d. data. To demonstrate the robustness of the proposed algorithm against Byzantine attacks on non-i.i.d. data, we redistribute the MNIST dataset by letting every two workers share one digit. All Byzantine workers jj choose one regular worker indexed by pp, and send ujk+1=xpk+1u_{j}^{k+1}=x_{p}^{k+1} to the master at every iteration kk. When the number of Byzantine workers is q=4q=4, the best reachable accuracy is around 0.8, because of the absence of two handwriting digits’ data. Similarly, when the number of Byzantine workers is q=8q=8, the best reachable accuracy is around 0.6. Here, we set the parameters of Stochastic ADMM as λ=0.8\lambda=0.8, β=0.2\beta=0.2, α0k=14+10k\alpha_{0}^{k}=\frac{1}{4+10\sqrt{k}}, and αik=10.2+10k.\alpha_{i}^{k}=\frac{1}{0.2+10\sqrt{k}}. when q=4q=4. As shown in Figure 7(a), Median fails, Stochastic ADMM are close to Ideal SGD and outperforms all the other Byzantine-robust algorithms. When the number of Byzantine worker increases to q=8q=8, in Stochastic ADMM, we set λ=0.5\lambda=0.5, β=3\beta=3, α0k=160+500k\alpha_{0}^{k}=\frac{1}{60+500\sqrt{k}}, and αik=13+500k\alpha_{i}^{k}=\frac{1}{3+500\sqrt{k}}. As depicted in Figure 7(b), Geometric median and Median fail because the stochastic gradients of regular worker pp dominate, such that only one digit can be recognized. Stochastic ADMM and RSA both work well, but Stochastic ADMM converges faster than RSA.

Refer to caption
(a) q=4q=4
Refer to caption
(b) q=8q=8
Figure 7: Top-1 accuracy under non-i.i.d. data.

6 Conclusions

We proposed a stochastic ADMM to deal with the distributed learning problem under Byzantine attacks. We considered a TV norm-penalized approximation formulation to handle Byzantine attacks. Theoretically, we proved that the stochastic ADMM converges in expectation to a bounded neighborhood of the optimum at an O(1/k)O(1/k)-rate under mild assumptions. Numerically, we compared the proposed algorithm with other Byzantine-robust algorithms on two real datasets, showing the competitive performance of the Byzantine-robust stochastic ADMM.

Acknowledgement. Qing Ling is supported in part by NSF China Grant 61973324, Fundamental Research Funds for the Central Universities, and Guangdong Province Key Laboratory of Computational Science Grant 2020B1212060032. A preliminary version of this paper has appeared in IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 4–8, 2020.

References

  • [1] R. Agrawal and R. Srikant. “Privacy-preserving Data Mining,” Proceedings of ACM SIGMOD, 2000.
  • [2] S. Sicari, A. Rizzardi, L. Grieco, and A. Coen-Porisini, “Security, Privacy and Trust in Internet of Things: The Road Ahead,” Computer Networks, vol. 76, pp. 146–164, 2015.
  • [3] L. Zhou, K. Yeh, G. Hancke, Z. Liu, and C. Su, “Security and Privacy for the Industrial Internet of Things: An Overview of Approaches to Safeguarding Endpoints,” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 76–87, 2018.
  • [4] J. Konecny, H. McMahan, and D. Ramage, “Federated Optimization: Distributed Optimization Beyond the Datacenter,” arXiv: 1511.03575, 2015.
  • [5] J. Konecny, H. McMahan, F. Yu, P. Richtarik, A. Suresh, and D. Bacon, “Federated Learning: Strategies for Improving Communication Efficiency,” arXiv: 1610.05492, 2016.
  • [6] P. Kairouz and H. McMahan. “Advances and Open Problems in Federated Learning,” Foundations and Trends in Machine Learning, vol. 14, no. 1, 2021.
  • [7] L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals Problem,” ACM Transactions on Programming Languages and Systems, vol. 4, no. 3, pp. 382–401, 1982.
  • [8] N. Lynch, Distributed Algorithms, Morgan Kaufmann Publishers, San Francisco, USA, 1996.
  • [9] A. Vempaty, L. Tong, and P. K. Varshney, “Distributed Inference with Byzantine Data: State-of-the-Art Review on Data Falsification Attacks.” IEEE Signal Processing Magazine, vol. 30, no. 5, pp. 65–75, 2013.
  • [10] Y. Chen, S. Kar, and J. M. F. Moura, “The Internet of Things: Secure Distributed Inference.” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 64–75, 2018.
  • [11] P. Blanchard, E. M. E. Mhamdi, R. Guerraoui, and J. Stainer, “Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent,” Proceedings of NeurIPS, 2017.
  • [12] S. Li, Y. Cheng, W. Wang, Y. Liu, and T. Chen, “Abnormal Client Behavior Detection in Federated Learning,” arXiv: 1910.09933, 2019.
  • [13] S. Li, Y. Cheng, W. Wang, Y. Liu, and T. Chen, “Learning to Detect Malicious Clients for Robust Federated Learning,” arXiv: 2002.00211, 2020.
  • [14] N. Ravi and A. Scaglione, “Detection and Isolation of Adversaries in Decentralized Optimization for Non-Strongly Convex Objectives,” Proceedings of IFAC Workshop on Distributed Estimation and Control in Networked Systems, 2019.
  • [15] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine Stochastic Gradient Descent,” Proceedings of NeuIPS, 2018.
  • [16] Z. Allen-Zhu, F. Ebrahimianghazani, J. Li, and D. Alistarh, “Byzantine-Resilient Non-Convex Stochastic Gradient Descent,” Proceedings of ICLR, 2021.
  • [17] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient Distributed Training via Redundant Gradients,” Proceedings of ICML, 2018.
  • [18] Y. Chen, L. Su, and J. Xu, “Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2018.
  • [19] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine Tolerant SGD,” arXiv: 1802.10116, 2018.
  • [20] X. Cao and L. Lai, “Distributed Approximate Newton’s Method Robust to Byzantine Attackers,” IEEE Transactions on Signal Processing, vol. 68, pp. 6011–6025, 2020.
  • [21] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust Distributed Learning: Towards Optimal Statistical Rates,” Proceedings of ICML, 2018.
  • [22] C. Xie, S. Koyejo, and I. Gupta, “Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance,” Proceedings of ICML, 2019.
  • [23] C. Xie, O. Koyejo, and I. Gupta, “Phocas: Dimensional Byzantine-resilient Stochastic Gradient Descent,” arXiv: 1805.09682, 2018.
  • [24] E. M. E. Mhamdi, R. Guerraoui, and S. Rouault, “The Hidden Vulnerability of Distributed Learning in Byzantium,” Proceedings of ICML, 2018.
  • [25] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons, “The Non-IID Data Quagmire of Decentralized Machine Learning,” Proceedings of ICML, 2020.
  • [26] L. He, S. P. Karimireddy, and M. Jaggi, “Byzantine-robust Learning on Heterogeneous Datasets via Resampling,” arXiv: 2006.09365, 2020.
  • [27] J. Peng, Z. Wu, Q. Ling, and T. Chen, “Byzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. Data,” arXiv: 2009.08161, 2020.
  • [28] L. Li, W. Xu, T. Chen, G. Giannakis, and Q. Ling, “RSA: Byzantine-robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets,” Proceedings of AAAI, 2019.
  • [29] Z. Wu, Q. Ling, T. Chen, and G. Giannakis, “Federated Variance-Reduced Stochastic Gradient Descent with Robustness to Byzantine Attacks,” IEEE Transactions on Signal Processing, vol. 68, pp. 4583–4596, 2020.
  • [30] E. M. E. Mhamdi, R. Guerraoui, and S. Rouault, “Distributed Momentum for Byzantine-resilient Learning,” Proceedings of ICLR, 2021.
  • [31] P. Khanduri, S. Bulusu, P. Sharma, and P. Varshney, “Byzantine Resilient Non-Convex SVRG with Distributed Batch Gradient Computations,” arXiv: 1912.04531, 2019.
  • [32] S. P. Karimireddy, L. He, and M. Jaggi, “Learning from History for Byzantine Robust Optimization,” Proceedings of ICML, 2021.
  • [33] G. Damaskinos, E. M. E. Mhamdi, R. Guerraoui, R. Patra, and M. Taziki, “Asynchronous Byzantine Machine Learning (the Case of SGD),” Proceedings of ICML, 2018.
  • [34] Y. Yang and W. Li, “BASGD: Buffered Asynchronous SGD for Byzantine Learning,” arXiv: 2003.00937, 2020.
  • [35] C. Xie, S. Koyejo, and I. Gupta, “Zeno++: Robust Fully Asynchronous SGD,” Proceedings of ICML, 2020.
  • [36] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning,” Proceedings of ICML, 2019.
  • [37] Z. Yang and W. U. Bajwa, “ByRDiE: Byzantine-Resilient Distributed Coordinate Descent for Decentralized Learning,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 4, pp. 611–627, 2019.
  • [38] Z. Yang and W. U. Bajwa, “BRIDGE: Byzantine-resilient Decentralized Gradient Descent,” arXiv: 1908.08098, 2019.
  • [39] S. Guo, T. Zhang, X. Xie, L. Ma, T. Xiang, and Y. Liu, “Towards Byzantine-resilient Learning in Decentralized Systems,” arXiv: 2002.08569, 2020
  • [40] J. Peng, W. Li, and Q. Ling, “Byzantine-robust Decentralized Stochastic Optimization over Static and Time-varying Networks,” Signal Processing, vol. 183, no. 108020, 2021.
  • [41] H. Ouyang, N. He, and A. Gray, “Stochastic ADMM for Nonsmooth Optimization,” arXiv: 1211.0632, 2012.
  • [42] W. Ben-Ameur, P. Bianchi, and J. Jakubowicz, “Robust Distributed Consensus Using Total Variation,” IEEE Transactions on Automatic Control, vol. 61, no. 6, pp. 1550–1564, 2016.
  • [43] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, Boston, USA, 2004.

Appendix A Proof of Proposition 1

The proof of Proposition 1 relies on the following Lemma.

Lemma 2.

Let {z1,z2}=argminz1,z2λ|z1z2|+12(z1a1)2+12(z2a2)2\{z_{1}^{*},z_{2}^{*}\}={\arg\min}_{z_{1},z_{2}\in\mathbb{R}}~{}\lambda|z_{1}-z_{2}|+\frac{1}{2}(z_{1}-a_{1})^{2}+\frac{1}{2}(z_{2}-a_{2})^{2}, where λ,a1,a2\lambda,a_{1},a_{2}\in\mathbb{R}. Then

z1+z2=a1+a2andz1a1=projλ(a2a12).z_{1}^{*}+z_{2}^{*}=a_{1}+a_{2}\quad\text{and}\quad z_{1}^{*}-a_{1}=\mathrm{proj}_{\lambda}\left(\frac{a_{2}-a_{1}}{2}\right). (19)
Proof.

Note that {z1,z2}\{z_{1}^{*},z_{2}^{*}\} together with their difference Δ=z1z2\Delta^{*}=z_{1}^{*}-z_{2}^{*} is also optimal to the bi-level minimization problem

minΔ\displaystyle\min_{\Delta\in\mathbb{R}} minz1,z2λ|z1z2|+12(z1a1)2+12(z2a2)2,\displaystyle\min_{z_{1},z_{2}\in\mathbb{R}}~{}\lambda|z_{1}-z_{2}|+\frac{1}{2}(z_{1}-a_{1})^{2}+\frac{1}{2}(z_{2}-a_{2})^{2},
s.t.z1z2=Δ.\displaystyle\quad s.t.\quad z_{1}-z_{2}=\Delta.

That is, we first optimize the constrained minimization problem with an artificially imposed constraint z1z2=Δz_{1}-z_{2}=\Delta over {z1,z2}\{z_{1},z_{2}\}, and then optimize over Δ\Delta.

For the inner-level constrained minimization problem, from its KKT (Karush-Kuhn-Tucker) conditions we know the minimizer is {z1=a1+a2+Δ2,z2=a1+a2Δ2}\{z_{1}^{*}=\frac{a_{1}+a_{2}+\Delta}{2},z_{2}^{*}=\frac{a_{1}+a_{2}-\Delta}{2}\}, and accordingly, the optimal value is λ|Δ|+(Δ(a1a2))24\lambda|\Delta|+\frac{(\Delta-(a_{1}-a_{2}))^{2}}{4}. Therefore, for the outer-level unconstrained minimization problem, from its KKT conditions we know the minimizer can be written as Δ=(a1a2)proj2λ(a1a2)\Delta^{*}=(a_{1}-a_{2})-\mathrm{proj}_{2\lambda}(a_{1}-a_{2}). Substituting this result to {z1=a1+a2+Δ2,z2=a1+a2Δ2}\{z_{1}^{*}=\frac{a_{1}+a_{2}+\Delta^{*}}{2},z_{2}^{*}=\frac{a_{1}+a_{2}-\Delta}{2}^{*}\} yields z1=a1+projλ(a2a12)z_{1}^{*}=a_{1}+\mathrm{proj}_{\lambda}\left(\frac{a_{2}-a_{1}}{2}\right) and z2=a2+projλ(a1a22)z_{2}^{*}=a_{2}+\mathrm{proj}_{\lambda}\left(\frac{a_{1}-a_{2}}{2}\right). From them we obtain (19) and complete the proof.  

Now we begin to prove Proposition 1. Since (8b) is separable with respect to ii, it is equivalent to

{zk+1(i,0),zk+1(0,i)}\displaystyle~{}\left\{z^{k+1}(i,0),z^{k+1}(0,i)\right\} (20)
=\displaystyle= argminz(i,0),z(0,i)λz(0,i)z(i,0)1+β2z(i,0)(x0k+11βηk(i,0))2+β2z(0,i)(xik+11βηk(0,i))2,\displaystyle~{}{\arg\min}_{z(i,0),z(0,i)}\lambda\|z(0,i)-z(i,0)\|_{1}+\frac{\beta}{2}\big{\|}z(i,0)-\big{(}x_{0}^{k+1}-\frac{1}{\beta}\eta^{k}(i,0)\big{)}\big{\|}^{2}+\frac{\beta}{2}\big{\|}z(0,i)-\big{(}x_{i}^{k+1}-\frac{1}{\beta}\eta^{k}(0,i)\big{)}\big{\|}^{2},

According to Lemma 2, (20) leads to

zk+1(0,i)+zk+1(i,0)\displaystyle{z^{k+1}(0,i)+z^{k+1}(i,0)} =xik+1+x0k+1ηk(0,i)+ηk(i,0)β,\displaystyle={x_{i}^{k+1}+x_{0}^{k+1}}-\frac{\eta^{k}(0,i)+\eta^{k}(i,0)}{\beta}, (21a)
zk+1(i,0)(x0k+11βηk(i,0))\displaystyle z^{k+1}(i,0)-\big{(}x_{0}^{k+1}-\frac{1}{\beta}\eta^{k}(i,0)\big{)} =projλ/β((ηk(i,0)ηk(0,i))/β+xik+1x0k+12).\displaystyle=\mathrm{proj}_{\lambda/\beta}\left(\frac{(\eta^{k}(i,0)-\eta^{k}(0,i))/\beta+x_{i}^{k+1}-x_{0}^{k+1}}{2}\right). (21b)

From (8c) and (21a) we have

ηk+1(0,i)+ηk+1(i,0)=0,\eta^{k+1}(0,i)+\eta^{k+1}(i,0)=0, (22)

which is regardless of the initialization of η\eta. If we further initialize η0(i,0)=η0(0,i)\eta^{0}(i,0)=-\eta^{0}(0,i), for simplicity we can define for any k0k\geq 0 that

ηik:=ηk(i,0)=ηk(0,i).\eta^{k}_{i}:=\eta^{k}(i,0)=-\eta^{k}(0,i).

With this notation, we rewrite (8c) as

ηik+1=ηik+β(zk+1(i,0)x0k+1)=projλ(ηik+β2(xik+1x0k+1)),\displaystyle\eta^{k+1}_{i}=\eta^{k}_{i}+\beta(z^{k+1}(i,0)-x_{0}^{k+1})=\mathrm{proj}_{\lambda}\left(\eta^{k}_{i}+\frac{\beta}{2}(x^{k+1}_{i}-x^{k+1}_{0})\right), (23)

where the last equality is from (21b).

Next we simplify the updates of xik+1x_{i}^{k+1} and x0k+1x_{0}^{k+1} in (11). From (8c), we have

βxikβzk(0,i)ηk(0,i)\displaystyle\beta x_{i}^{k}-\beta z^{k}(0,i)-\eta^{k}(0,i) =ηk1(0,i)ηk(0,i)ηk(0,i)=2ηikηik1,\displaystyle=\eta^{k-1}(0,i)-\eta^{k}(0,i)-\eta^{k}(0,i)=2\eta^{k}_{i}-\eta^{k-1}_{i},

which simplifies (11) to

xik+1\displaystyle x^{k+1}_{i} =xikαik(F(xik,ξik)+2ηikηik1),\displaystyle=x_{i}^{k}-\alpha_{i}^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+2\eta^{k}_{i}-\eta^{k-1}_{i}\right), (24)
x0k+1\displaystyle x^{k+1}_{0} =x0kα0k(f0(x0k)i(2ηikηik1)).\displaystyle=x_{0}^{k}-\alpha_{0}^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})-\sum_{i\in\mathcal{R}}(2\eta^{k}_{i}-\eta^{k-1}_{i})\Big{)}. (25)

This completes the proof.  

Appendix B Supporting Lemmas

Lemma 3 (Optimality conditions of (6)).

The sufficient and necessary optimality conditions of (6) are

{𝔼[F(xi,ξi)]=η(0,i),f0(x0)=iη(i,0),η(0,i)=λgi,η(i,0)=λgi,z(i,0)=x0,z(0,i)=xi,\begin{cases}\mathbb{E}[F^{\prime}({x_{i}^{*}},\xi_{i})]=\eta^{*}(0,i),\quad f^{\prime}_{0}(x_{0}^{*})=\sum_{i\in\mathcal{R}}\eta^{*}(i,0),\\ \eta^{*}(0,i)=-\lambda g_{i}^{*},\quad\eta^{*}(i,0)=\lambda g_{i}^{*},\\ z^{*}(i,0)=x_{0}^{*},\quad z^{*}(0,i)=x_{i}^{*},\end{cases} (26)

for all ii\in\mathcal{R}, where gisgn(z(0,i)z(i,0))g_{i}^{*}\in{sgn}(z^{*}(0,i)-z^{*}(i,0)). In particular, defining ηi:=η(i,0)\eta_{i}^{*}:=\eta^{*}(i,0), we have for all ii\in\mathcal{R} that

f0(x0)=iηi,𝔼[F(xi,ξi)]=ηi,andηi=λgi[λ,λ]d.\displaystyle f^{\prime}_{0}(x_{0}^{*})=\sum_{i\in\mathcal{R}}\eta_{i}^{*},\quad\mathbb{E}[F^{\prime}({x_{i}^{*}},\xi_{i})]=-\eta_{i}^{*},\quad\text{and}\quad\eta_{i}^{*}=\lambda g_{i}^{*}\in[-\lambda,\lambda]^{d}. (27)
Proof.

The KKT conditions of (6) are given by

{𝔼[F(xi,ξi)]η(0,i)=0,f0(x0)+i(η(i,0)=0,0λz(0,i)z(0,i)z(i,0)1+η(0,i),0λz(i,0)z(0,i)z(i,0)1+η(i,0),z(i,0)x0=0,z(0,i)xi=0,\displaystyle\begin{cases}\mathbb{E}[F^{\prime}({x_{i}^{*}},\xi_{i})]-\eta^{*}(0,i)=0,\quad f^{\prime}_{0}(x_{0}^{*})+\sum_{i}\big{(}-\eta^{*}(i,0)=0,\\ 0\in\lambda\partial_{z^{*}(0,i)}\|z^{*}(0,i)-z^{*}(i,0)\|_{1}+\eta^{*}(0,i),\quad 0\in\lambda\partial_{z^{*}(i,0)}\|z^{*}(0,i)-z^{*}(i,0)\|_{1}+\eta^{*}(i,0),\\ z^{*}(i,0)-x_{0}^{*}=0,\quad z^{*}(0,i)-x_{i}^{*}=0,\end{cases}

where z(0,i)()\partial_{z^{*}(0,i)}(\cdot) and z(i,0)()\partial_{z^{*}(i,0)}(\cdot) denote the sets of subgradients. Applying the definition of the element-wise sign function sgn()sgn(\cdot) yields (26). Further noticing ηi:=η(i,0)=η(i,0)\eta_{i}^{*}:=\eta^{*}(i,0)=-\eta^{*}(i,0), we obtain (27).  

Lemma 4 (Theorem 2.1.12, [43]).

With Assumptions 1 and 2, for any xix_{i} and x0x_{0}, it holds

Fi(xi)Fi(xi),xixi1μi+LiFi(xi)Fi(xi)2+μiLiμi+Lixixi2,\displaystyle\langle F^{\prime}_{i}(x_{i})-F^{\prime}_{i}(x_{i}^{*}),x_{i}-x_{i}^{*}\rangle\geq\frac{1}{\mu_{i}+L_{i}}\|F^{\prime}_{i}(x_{i})-F^{\prime}_{i}(x_{i}^{*})\|^{2}+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\|x_{i}-x_{i}^{*}\|^{2},\quad (28)
f0(x0)f0(x0),x0x01μ0+L0f0(x0)f0(x0)2+μ0L0μ0+L0x0x02.\displaystyle\langle f^{\prime}_{0}(x_{0})-f^{\prime}_{0}(x_{0}^{*}),x_{0}-x_{0}^{*}\rangle\geq\frac{1}{\mu_{0}+L_{0}}\|f^{\prime}_{0}(x_{0})-f^{\prime}_{0}(x_{0}^{*})\|^{2}+\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\|x_{0}-x_{0}^{*}\|^{2}. (29)
Lemma 5.

With the simplified updates (12), (13), and (14), it holds for any regular worker ii\in\mathcal{R} that

ηik+1ηi,x0k+1xik+1\displaystyle\langle\eta_{i}^{k+1}-\eta_{i}^{*},x_{0}^{k+1}-x_{i}^{k+1}\rangle\leq 2βηik+1ηi,ηik+1ηik=1β(ηikηi2ηik+1ηi2ηik+1ηik2),\displaystyle~{}-\frac{2}{\beta}\langle\eta_{i}^{k+1}-\eta_{i}^{*},\eta_{i}^{k+1}-\eta_{i}^{k}\rangle=\frac{1}{\beta}\left(\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k+1}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k+1}-\eta_{i}^{k}\|^{2}\right), (30)
ηik+1,x0k+1xik+1\displaystyle\langle\eta_{i}^{k+1},x_{0}^{k+1}-x_{i}^{k+1}\rangle\leq 2βηik+1,ηik+1ηik=1β(ηik2ηik+12ηik+1ηik2).\displaystyle~{}-\frac{2}{\beta}\langle\eta_{i}^{k+1},\eta_{i}^{k+1}-\eta_{i}^{k}\rangle=\frac{1}{\beta}\left(\|\eta_{i}^{k}\|^{2}-\|\eta_{i}^{k+1}\|^{2}-\|\eta_{i}^{k+1}-\eta_{i}^{k}\|^{2}\right). (31)
Proof.

We first show the inequality in (30), and then modify it to prove (31). The relationships derived during the proof of Lemma 2 are useful here. Since ηik:=ηk(i,0)=ηk(0,i)\eta^{k}_{i}:=\eta^{k}(i,0)=-\eta^{k}(0,i), the right-hand side of (21b) is exactly 1βηik+1\frac{1}{\beta}\eta_{i}^{k+1}. Combining this fact with (21a) yields

zk+1(i,0)=x0k+1+1β(ηik+1ηik),zk+1(0,i)=xik+11β(ηik+1ηik).z^{k+1}(i,0)=x_{0}^{k+1}+\frac{1}{\beta}\big{(}\eta_{i}^{k+1}-\eta_{i}^{k}\big{)},\quad z^{k+1}(0,i)=x_{i}^{k+1}-\frac{1}{\beta}\big{(}\eta_{i}^{k+1}-\eta_{i}^{k}\big{)}. (32)

Recall that in (20), we minimize the function

λz(0,i)z(i,0)1+β2z(i,0)(x0k+11βηik)2+β2z(0,i)(xik+1+1βηik)2,\lambda\|z(0,i)-z(i,0)\|_{1}+\frac{\beta}{2}\big{\|}z(i,0)-\big{(}x_{0}^{k+1}-\frac{1}{\beta}\eta^{k}_{i}\big{)}\big{\|}^{2}+\frac{\beta}{2}\big{\|}z(0,i)-\big{(}x_{i}^{k+1}+\frac{1}{\beta}\eta^{k}_{i}\big{)}\big{\|}^{2},

with respect to [z(i,0);z(0,i)]2d\big{[}z(i,0);z(0,i)\big{]}\in\mathbb{R}^{2d}. From the first order optimality condition, there exists a subgradient of λz(0,i)z(i,0)1\lambda\|z(0,i)-z(i,0)\|_{1} at [zk+1(0,i);zk+1(i,0)]\big{[}z^{k+1}(0,i);z^{k+1}(i,0)\big{]}, denoted as [λgik+1;λgik+1][λ,λ]2d[\lambda g_{i}^{k+1};-\lambda g_{i}^{k+1}]\in[-\lambda,\lambda]^{2d}, such that

0=λgik+1+β(zk+1(0,i)(xik+1+1βηik))=λgik+1ηk+1.0=\lambda g_{i}^{k+1}+\beta\big{(}z^{k+1}(0,i)-(x_{i}^{k+1}+\frac{1}{\beta}\eta^{k}_{i})\big{)}=\lambda g_{i}^{k+1}-\eta^{k+1}. (33)

Applying the definition of subgradient of λz(0,i)z(i,0)1\lambda\|z(0,i)-z(i,0)\|_{1} at points [zk+1(0,i);zk+1(i,0)]\big{[}z^{k+1}(0,i);z^{k+1}(i,0)\big{]} and [z(0,i);z(i,0)]\big{[}z^{*}(0,i);z^{*}(i,0)\big{]} gives

λz(0,i)z(i,0)1\displaystyle\lambda\|z^{*}(0,i)-z^{*}(i,0)\|_{1}\geq λzk+1(0,i)zk+1(i,0)1\displaystyle~{}\lambda\|z^{k+1}(0,i)-z^{k+1}(i,0)\|_{1}
+[(λgik+1;λgik+1]),[z(0,i)zk+1(0,i);z(i,0)zk+1(i,0)],\displaystyle~{}+\big{\langle}[(\lambda g_{i}^{k+1};-\lambda g_{i}^{k+1}]),\big{[}z^{*}(0,i)-z^{k+1}(0,i);z^{*}(i,0)-z^{k+1}(i,0)\big{]}\big{\rangle}, (34)
λzk+1(0,i)zk+1(i,0)1\displaystyle\lambda\|z^{k+1}(0,i)-z^{k+1}(i,0)\|_{1}\geq λz(0,i)z(i,0)1\displaystyle~{}\lambda\|z^{*}(0,i)-z^{*}(i,0)\|_{1}
+[λgi;λgi],[zk+1(0,i)z(0,i);zk+1(i,0)z(i,0)],\displaystyle~{}+\big{\langle}[\lambda g_{i}^{*};-\lambda g_{i}^{*}],\big{[}z^{k+1}(0,i)-z^{*}(0,i);z^{k+1}(i,0)-z^{*}(i,0)\big{]}\big{\rangle}, (35)

where gi=ηiλg_{i}^{*}=\frac{\eta_{i}^{*}}{\lambda} is defined in Lemma 3. Summing up (34) and (35), we have

0λgik+1λgi,zk+1(i,0)zk+1(0,i)=ηk+1η,x0k+1xik+1+2β(ηik+1ηik),0\geq\big{\langle}\lambda g_{i}^{k+1}-\lambda g_{i}^{*},z^{k+1}(i,0)-z^{k+1}(0,i)\big{\rangle}=\big{\langle}\eta^{k+1}-\eta^{*},x_{0}^{k+1}-x_{i}^{k+1}+\frac{2}{\beta}\big{(}\eta_{i}^{k+1}-\eta_{i}^{k}\big{)}\big{\rangle},

where the last equality comes from (32) and (33). Rearranging the terms gives (30).

Further note that [λgi;λgi][\lambda g_{i}^{*};-\lambda g_{i}^{*}] in (35) can be replaced by any subgradient of λz(0,i)z(i,0)1\lambda\|z(0,i)-z(i,0)\|_{1} at [z(0,i);z(i,0)]\big{[}z^{*}(0,i);z^{*}(i,0)\big{]}. We hence replace it by [0;0][0;0], and then obtain (31).  

Appendix C Proof of Theorem 1

Restatement of Theorem 1. Suppose Assumptions 1, 2, and 3 hold. let λλ0\lambda\geq\lambda_{0} and the stepsizes be

α0k=min{1ck+mβ,A},αik=min{1ck+β,A},i,\alpha_{0}^{k}=\min\left\{\frac{1}{ck+m\beta},A\right\},\quad\alpha_{i}^{k}=\min\left\{\frac{1}{ck+\beta},A\right\},\ \forall i\in\mathcal{R},

for some positive constants c<min{μ0L0μ0+L0,μiLiμi+Li:i}c<\min\left\{\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}},\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}:i\in\mathcal{R}\right\}, β\beta, and Amin{1μ0+L0,1μi+Li:i}{A}\leq\min\left\{\frac{1}{\mu_{0}+L_{0}},\frac{1}{\mu_{i}+L_{i}}:i\in\mathcal{R}\right\}. Define constants

c0\displaystyle c_{0} =min{c+(m1)β,μ0L0μ0+L0,μiLiμi+Li:i},\displaystyle=\min\bigg{\{}c+(m-1)\beta,\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}},\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}:i\in\mathcal{R}\bigg{\}},
c1\displaystyle c_{1} =9d(μ0+L0)μ0L0λ2q2,\displaystyle=\frac{9d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}\lambda^{2}q^{2},
c2\displaystyle c_{2} =[2(4r+3q)2+64r+16(m1)β(r2β+i(1μi+1Li))]λ2d+4iδi2.\displaystyle=\left[2(4r+3q)^{2}+64r+16(m-1)\beta\left(\frac{r}{2\beta}+\sum_{i}\bigg{(}\frac{1}{\mu_{i}}+\frac{1}{L_{i}}\bigg{)}\right)\right]\lambda^{2}d+4\sum_{i}\delta_{i}^{2}.

Also denote Vk=𝔼x0kx02+i(𝔼xikxi2+2αik1βηik1ηi2)V^{k}=\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i\in\mathcal{R}}\big{(}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+\frac{2\alpha_{i}^{k-1}}{\beta}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}\big{)}. Then we have

Vk+1{Vk(1c0α0k)+c1α0k+c2(αik)2,k>k0,Vk(1c0A)+Ac1+A2c2,kk0,\displaystyle V^{k+1}\leq\begin{cases}V^{k}(1-c_{0}\alpha^{k}_{0})+c_{1}\alpha^{k}_{0}+c_{2}(\alpha^{k}_{i})^{2},\quad&k>k_{0},\\ V^{k}(1-c_{0}A)+{A}c_{1}+{A}^{2}c_{2},\quad&k\leq k_{0},\end{cases} (36)

where k0=min{k:1ck+β<A}k_{0}=\min\{k:\frac{1}{ck+\beta}<A\}. Consequently, it holds

Vk{V0(1cA)k+1+c1+Ac2c0,kk0,Cc(k1)+mβ+c1c0,k>k0,V^{k}\leq\begin{cases}V^{0}(1-cA)^{k+1}+\frac{c_{1}+{A}c_{2}}{c_{0}},\quad&k\leq k_{0},\\ \frac{C}{c(k-1)+m\beta}+\frac{c_{1}}{c_{0}},\quad&k>k_{0},\end{cases} (37)

with

C=max{(ck0+mβck0+β)2c2c0c,(ck0+mβ)(V0(1c0A)k0+1+Ac2c0)}.C=\max\left\{\left(\frac{ck_{0}+m\beta}{ck_{0}+\beta}\right)^{2}\frac{c_{2}}{c_{0}-c},(ck_{0}+m\beta)\left(V^{0}(1-c_{0}A)^{k_{0}+1}+\frac{Ac_{2}}{c_{0}}\right)\right\}.
Proof.

Recall that the updates satisfy

x0k+1=\displaystyle x_{0}^{k+1}= x0kα0k(f0(x0k)i(2ηikηik1)j(2ηjkηjk1)),\displaystyle~{}x_{0}^{k}-\alpha_{0}^{k}\bigg{(}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{)}, (38)
xik+1=\displaystyle x_{i}^{k+1}= xikαik(F(xik,ξik)+(2ηikηik1)),i,\displaystyle~{}x_{i}^{k}-\alpha_{i}^{k}\bigg{(}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{)},\forall i\in\mathcal{R}, (39)
ηik+1=\displaystyle\eta_{i}^{k+1}= projλ(ηik+β2(xik+1x0k+1))[λ,λ]d,iandηjk+1[λ,λ]d,j.\displaystyle~{}\mathrm{proj}_{\lambda}\left(\eta^{k}_{i}+\frac{\beta}{2}(x^{k+1}_{i}-x^{k+1}_{0})\right)\in[-\lambda,\lambda]^{d},\forall i\in\mathcal{R}\quad\text{and}\quad\eta_{j}^{k+1}\in[-\lambda,\lambda]^{d},\forall j\in\mathcal{B}. (40)

Step 1. At the master side, we have

𝔼x0k+1x02=(38)\displaystyle\mathbb{E}\|x_{0}^{k+1}-x_{0}^{*}\|^{2}\overset{\eqref{x0-adm-sim}}{=} 𝔼x0kx02+(α0k)2𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1)2\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+(\alpha_{0}^{k})^{2}\mathbb{E}\bigg{\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}
2α0k𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1),x0kx0.\displaystyle~{}-2\alpha_{0}^{k}\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}. (41)

For the second term in (41), the inequality a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} gives

𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1)2\displaystyle~{}\mathbb{E}\bigg{\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}
=(27)\displaystyle\overset{\eqref{eq:opt}}{=} 𝔼f0(x0k)f0(x0)i(2ηikηik1ηi)j(2ηjkηjk1)2\displaystyle~{}\mathbb{E}\bigg{\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{*}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}
\displaystyle\leq 2𝔼f0(x0k)f0(x0)2+2𝔼i(2ηikηik1ηi)+j(2ηjkηjk1)2\displaystyle~{}2\mathbb{E}\big{\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})\big{\|}^{2}+2\mathbb{E}\bigg{\|}\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{*}_{i}\big{)}+\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}
(40)\displaystyle\overset{\eqref{eta-range}}{\leq} 2𝔼f0(x0k)f0(x0)2+2(4r+3q)2λ2d.\displaystyle~{}2\mathbb{E}\big{\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})\big{\|}^{2}+2(4r+3q)^{2}\lambda^{2}d. (42)

Applying the inequality 2a,bϵa2+1ϵb22\langle a,b\rangle\leq\epsilon\|a\|^{2}+\frac{1}{\epsilon}\|b\|^{2} to the third term in (41) with ϵ=μ0L0μ0+L0\epsilon=\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}} yields

2𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1),x0kx0\displaystyle-2\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}
=(27)\displaystyle\overset{\eqref{eq:opt}}{=} 2𝔼f0(x0k)f0(x0),x0kx0+2i𝔼2ηikηik1ηi,x0kx0+2𝔼j2ηjkηjk1,x0kx0\displaystyle-2\mathbb{E}\big{\langle}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*}),x_{0}^{k}-x_{0}^{*}\big{\rangle}+2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{0}^{k}-x_{0}^{*}\big{\rangle}+2\mathbb{E}\bigg{\langle}\sum_{j\in\mathcal{B}}2\eta^{k}_{j}-\eta^{k-1}_{j},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}
(29)\displaystyle\overset{\eqref{eq:nes0}}{\leq} 2μ0+L0𝔼f0(x0k)f0(x0)22μ0L0μ0+L0𝔼x0kx02+2i𝔼2ηikηik1ηi,x0kx0\displaystyle-\frac{2}{\mu_{0}+L_{0}}\mathbb{E}\|f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})\|^{2}-\frac{2\mu_{0}L_{0}}{\mu_{0}+L_{0}}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{0}^{k}-x_{0}^{*}\big{\rangle}
+μ0L0μ0+L0𝔼x0kx02+μ0+L0μ0L0𝔼j(2ηjkηjk1)2\displaystyle+\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\frac{\mu_{0}+L_{0}}{\mu_{0}L_{0}}\mathbb{E}\bigg{\|}\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}
\displaystyle\leq 2μ0+L0𝔼f0(x0k)f0(x0)2μ0L0μ0+L0𝔼x0kx02+μ0+L0μ0L0(3λq)2d\displaystyle-\frac{2}{\mu_{0}+L_{0}}\mathbb{E}\|f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})\|^{2}-\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\frac{\mu_{0}+L_{0}}{\mu_{0}L_{0}}(3\lambda q)^{2}d
+2i𝔼2ηikηik1ηi,x0kx0.\displaystyle+2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{0}^{k}-x_{0}^{*}\big{\rangle}. (43)

Substituting (42) and (43) into (41) gives

𝔼x0k+1x02\displaystyle\mathbb{E}\|x_{0}^{k+1}-x_{0}^{*}\|^{2}\leq 𝔼x0kx02(1α0kμ0L0μ0+L0)𝔼f0(x0k)f0(x0)22α0k(1μ0+L0α0k)\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)-\mathbb{E}\big{\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})\big{\|}^{2}2\alpha_{0}^{k}\left(\frac{1}{\mu_{0}+L_{0}}-\alpha_{0}^{k}\right)
+2λ2d(4r+3q)2(α0k)2+9λ2d(μ0+L0)μ0L0q2α0k+2α0ki𝔼2ηikηik1ηi,x0kx0\displaystyle~{}+2\lambda^{2}d(4r+3q)^{2}(\alpha_{0}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha_{0}^{k}+2\alpha_{0}^{k}\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{0}^{k}-x_{0}^{*}\big{\rangle}
\displaystyle\leq 𝔼x0kx02(1α0kμ0L0μ0+L0)+2λ2d(4r+3q)2(α0k)2+9λ2d(μ0+L0)μ0L0q2α0k\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+2\lambda^{2}d(4r+3q)^{2}(\alpha^{k}_{0})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}
+2α0ki𝔼2ηikηik1ηi,x0kx0,\displaystyle+2\alpha_{0}^{k}\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{0}^{k}-x_{0}^{*}\big{\rangle}, (44)

where the last inequality comes from that α0kA1μ0+L0\alpha_{0}^{k}\leq{A}\leq\frac{1}{\mu_{0}+L_{0}}.

Step 2. Accordingly, at the regular worker side, we have for any ii\in\mathcal{R} that

𝔼xik+1xi2=(39)\displaystyle\mathbb{E}\|x_{i}^{k+1}-x_{i}^{*}\|^{2}\overset{\eqref{xi-adm-sim}}{=} 𝔼xikxi2+(αik)2𝔼F(xik,ξik)+(2ηikηik1)2\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+(\alpha_{i}^{k})^{2}\mathbb{E}\bigg{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{\|}^{2} (45)
2αik𝔼F(xik,ξik)+(2ηikηik1),xikxi.\displaystyle~{}-2\alpha_{i}^{k}\mathbb{E}\bigg{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\bigg{\rangle}.

For the second term in (45), the inequality a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} gives that

𝔼F(xik,ξik)+(2ηikηik1)2=(27)\displaystyle\mathbb{E}\bigg{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{\|}^{2}\overset{\eqref{eq:opt}}{=} 𝔼F(xik,ξik)Fi(xik)+Fi(xik)Fi(xi)+(2ηikηik1ηi)2\displaystyle~{}\mathbb{E}\bigg{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})-F^{\prime}_{i}(x_{i}^{k})+F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*}\big{)}\bigg{\|}^{2}
\displaystyle\leq 2𝔼Fi(xik)Fi(xi)2+4𝔼F(xik,ξik)Fi(xik)2+4𝔼2ηikηik1ηi2\displaystyle~{}2\mathbb{E}\big{\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})\big{\|}^{2}+4\mathbb{E}\|F^{\prime}(x_{i}^{k},\xi_{i}^{k})-F^{\prime}_{i}(x_{i}^{k})\|^{2}+4\mathbb{E}\|2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{*}_{i}\|^{2}
\displaystyle\leq 2𝔼Fi(xik)Fi(xi)2+4δi2+64λ2d,\displaystyle~{}2\mathbb{E}\big{\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})\big{\|}^{2}+4\delta_{i}^{2}+64\lambda^{2}d, (46)

where the last inequality comes from (17) and (40). Then the third term in (45) can be upper-bounded as

2𝔼F(xik,ξik)+(2ηikηik1),xikxi\displaystyle-2\mathbb{E}\bigg{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\bigg{\rangle}
=\displaystyle= 2𝔼Fi(xik)+(2ηikηik1),xikxi\displaystyle-2\mathbb{E}\bigg{\langle}F^{\prime}_{i}(x_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\bigg{\rangle}
=(27)\displaystyle\overset{\eqref{eq:opt}}{=} 2𝔼Fi(xik)Fi(xi),xikxi2𝔼2ηikηik1ηi,xikxi\displaystyle-2\mathbb{E}\big{\langle}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*}),x_{i}^{k}-x_{i}^{*}\big{\rangle}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\rangle
(28)\displaystyle\overset{\eqref{eq:nesi}}{\leq} 2μi+Li𝔼Fi(xik)Fi(xi)22μiLiμi+Li𝔼xikxi22𝔼2ηikηik1ηi,xikxi,\displaystyle-\frac{2}{\mu_{i}+L_{i}}\mathbb{E}\|F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})\|^{2}-\frac{2\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\rangle, (47)

where the first equality comes from taking expectation of the conditional expectation; that is, 𝔼x=𝔼[𝔼[x|k1]]\mathbb{E}x=\mathbb{E}\big{[}\mathbb{E}[x|\mathcal{F}_{k-1}]\big{]} with k1\mathcal{F}_{k-1} denoting the sigma-field generated by {ξil1,ηjl:lk,i,j}\{\xi_{i}^{l-1},\eta_{j}^{l}:l\leq k,i\in\mathcal{R},j\in\mathcal{B}\}.

Substituting (46) and (47) into (45) gives

𝔼xik+1xi2\displaystyle\mathbb{E}\|x_{i}^{k+1}-x_{i}^{*}\|^{2}\leq 𝔼xikxi2(1αikμiLiμi+Li)𝔼Fi(xik)Fi(xi)22αik(1μi+Liαik)\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)-\mathbb{E}\big{\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})\big{\|}^{2}2\alpha_{i}^{k}\left(\frac{1}{\mu_{i}+L_{i}}-\alpha_{i}^{k}\right)
+(4δi2+64λ2d)(αik)22αik𝔼2ηikηik1ηi,xikxi\displaystyle~{}+(4\delta_{i}^{2}+64\lambda^{2}d)(\alpha_{i}^{k})^{2}-2\alpha_{i}^{k}\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\rangle
\displaystyle\leq 𝔼xikxi2(1αikμiLiμi+Li)+(4δi2+64λ2d)(αik)22αik𝔼2ηikηik1ηi,xikxi,\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)+(4\delta_{i}^{2}+64\lambda^{2}d)(\alpha_{i}^{k})^{2}-2\alpha_{i}^{k}\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\rangle, (48)

where the last inequality comes from that αikA1μi+Li\alpha_{i}^{k}\leq{A}\leq\frac{1}{\mu_{i}+L_{i}}.

Step 3. Now combine (44) with (48). Using the notation Vk=𝔼x0kx02+i𝔼(xikxi2+2αik1βηik1ηi2)V^{k}=\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i}\mathbb{E}\big{(}\|x_{i}^{k}-x_{i}^{*}\|^{2}+\frac{2\alpha_{i}^{k-1}}{\beta}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}\big{)}, we have

Vk+1\displaystyle V^{k+1}\leq 𝔼x0kx02(1α0kμ0L0μ0+L0)+i𝔼xikxi2(1αikμiLiμi+Li)\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+\sum_{i\in\mathcal{R}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right) (49)
+λ2d[2(4r+3q)2(α0k)2+64i(αik)2]+4iδi2(αik)2+9λ2d(μ0+L0)μ0L0q2α0k\displaystyle~{}+\lambda^{2}d\bigg{[}2(4r+3q)^{2}(\alpha_{0}^{k})^{2}+64\sum_{i\in\mathcal{R}}(\alpha_{i}^{k})^{2}\bigg{]}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}(\alpha_{i}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}
+i2αikβ𝔼ηikηi22i𝔼2ηikηik1ηi,αik(xikxi)α0k(x0kx0).\displaystyle~{}+\sum_{i\in\mathcal{R}}\frac{2\alpha_{i}^{k}}{\beta}\mathbb{E}\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}-2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},\alpha_{i}^{k}(x_{i}^{k}-x_{i}^{*})-\alpha_{0}^{k}(x_{0}^{k}-x_{0}^{*})\big{\rangle}.

For the last term in (49), notice that

𝔼2ηikηik1ηi,αik(xikxi)α0k(x0kx0)\displaystyle~{}-\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},\alpha_{i}^{k}(x_{i}^{k}-x_{i}^{*})-\alpha_{0}^{k}(x_{0}^{k}-x_{0}^{*})\big{\rangle}
=\displaystyle= αik𝔼2ηikηik1ηi,xikx0k(αikα0k)𝔼2ηikηik1ηi,x0kx0\displaystyle~{}-\alpha_{i}^{k}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{0}^{k}\big{\rangle}-(\alpha_{i}^{k}-\alpha_{0}^{k})\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{0}^{k}-x_{0}^{*}\big{\rangle}
=\displaystyle= α0k𝔼ηikηi,xikx0kα0k𝔼ηikηik1,xikx0k(αikα0k)𝔼2ηikηik1ηi,xikxi,\displaystyle~{}-\alpha_{0}^{k}\mathbb{E}\big{\langle}\eta^{k}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{0}^{k}\big{\rangle}-\alpha_{0}^{k}\mathbb{E}\big{\langle}\eta^{k}_{i}-\eta_{i}^{k-1},x_{i}^{k}-x_{0}^{k}\big{\rangle}-(\alpha_{i}^{k}-\alpha_{0}^{k})\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\big{\rangle}, (50)

where the first equality comes from Corollary 1 that xi=x0x_{i}^{*}=x_{0}^{*}. For the first term in (50), Lemma 5 suggests that

α0k𝔼ηikηi,xikx0kα0kβ(ηik1ηi2ηikηi2ηikηik12)α0kβ(ηik1ηi2ηikηi2).-\alpha_{0}^{k}\mathbb{E}\big{\langle}\eta^{k}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{0}^{k}\big{\rangle}\leq\frac{\alpha_{0}^{k}}{\beta}\left(\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k}-\eta_{i}^{k-1}\|^{2}\right)\leq\frac{\alpha_{0}^{k}}{\beta}\left(\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}\right). (51)

For the second therm in (50), the projection operator in the ηi\eta_{i}-update gives that

ηikηik1,xikx0k=projλ(ηik1+β2(xikx0k))ηik1,xikx0k0,\big{\langle}\eta^{k}_{i}-\eta_{i}^{k-1},x_{i}^{k}-x_{0}^{k}\big{\rangle}=\big{\langle}\mathrm{proj}_{\lambda}\big{(}\eta^{k-1}_{i}+\frac{\beta}{2}(x^{k}_{i}-x^{k}_{0})\big{)}-\eta_{i}^{k-1},x_{i}^{k}-x_{0}^{k}\big{\rangle}\geq 0, (52)

provided ηik1[λ,λ]d\eta_{i}^{k-1}\in[-\lambda,\lambda]^{d}. For the third term in (50), we apply the equality 2a,b1ϵia2+ϵib22\langle a,b\rangle\leq\frac{1}{\epsilon_{i}}\|a\|^{2}+{\epsilon_{i}}\|b\|^{2} with ϵi=μiLiμi+Li{\epsilon_{i}}=\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}} to obtain

(αikα0k)𝔼2ηikηik1ηi,xikxi\displaystyle-(\alpha_{i}^{k}-\alpha_{0}^{k})\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\big{\rangle}\leq αikα0k2(μi+LiμiLi(4λ)2d+μiLiμi+Li𝔼xikxi2).\displaystyle~{}\frac{\alpha_{i}^{k}-\alpha_{0}^{k}}{2}\left(\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}}(4\lambda)^{2}d+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\right). (53)

Therefore, applying the bounds in (51), (52), and (53) to (50), we get

2𝔼2ηikηik1ηi,αik(xikxi)α0k(x0kx0)\displaystyle~{}-2\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},\alpha_{i}^{k}(x_{i}^{k}-x_{i}^{*})-\alpha_{0}^{k}(x_{0}^{k}-x_{0}^{*})\big{\rangle}
\displaystyle\leq 2α0kβ(𝔼ηik1ηi2𝔼ηikηi2)+(αikα0k)(μi+LiμiLi16λ2d+μiLiμi+Li𝔼xikxi2)\displaystyle~{}\frac{2\alpha_{0}^{k}}{\beta}\left(\mathbb{E}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}-\mathbb{E}\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}\right)+(\alpha_{i}^{k}-\alpha_{0}^{k})\left(\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}}16\lambda^{2}d+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\right)
\displaystyle\leq 2α0kβ𝔼ηik1ηi22αikβ𝔼ηikηi2+(αikα0k)((12β+μi+LiμiLi)16λ2d+μiLiμi+Li𝔼xikxi2).\displaystyle~{}\frac{2\alpha_{0}^{k}}{\beta}\mathbb{E}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}-\frac{2\alpha_{i}^{k}}{\beta}\mathbb{E}\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}+(\alpha_{i}^{k}-\alpha_{0}^{k})\left((\frac{1}{2\beta}+\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}})16\lambda^{2}d+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\right). (54)

Consequently (49) becomes

Vk+1\displaystyle V^{k+1}\leq 𝔼x0kx02(1α0kμ0L0μ0+L0)+i𝔼xikxi2(1αik+α0k2μiLiμi+Li)+iα0kαik12αik1β𝔼ηik1ηi2\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}\left(1-\alpha^{k}_{0}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+\sum_{i\in\mathcal{R}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\left(1-\frac{\alpha_{i}^{k}+\alpha_{0}^{k}}{2}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)+\sum_{i\in\mathcal{R}}\frac{\alpha_{0}^{k}}{\alpha_{i}^{k-1}}\frac{2\alpha_{i}^{k-1}}{\beta}\mathbb{E}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}
+λ2d[2(4r+3q)2(α0k)2+64i(αik)2+16(iμi+LiμiLi+r2β)(αikα0k)]+4iδi2(αik)2+9λ2d(μ0+L0)μ0L0q2α0k\displaystyle~{}+\lambda^{2}d\bigg{[}2(4r+3q)^{2}(\alpha_{0}^{k})^{2}+64\sum_{i\in\mathcal{R}}(\alpha_{i}^{k})^{2}+16(\sum_{i\in\mathcal{R}}\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}}+\frac{r}{2\beta})(\alpha_{i}^{k}-\alpha_{0}^{k})\bigg{]}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}(\alpha_{i}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}
\displaystyle\leq Vk(1c0α0k)+c1α0k+c2(αik)2,\displaystyle~{}V^{k}(1-c_{0}\alpha^{k}_{0})+c_{1}\alpha^{k}_{0}+c_{2}(\alpha^{k}_{i})^{2}, (55)

where the last inequality comes from the upper-bound of c0c_{0}. Plugging in the choices of stepsizes, we obtain (36).

Step 4. Now we iteratively uses (36) to derive the O(1/k)O(1/k)-convergence of VkV^{k}. First, when kk0k\leq k_{0}, it holds

Vk+1\displaystyle V^{k+1}\leq V0(1c0A)k+1+(Ac1+A2c2)(1+(1c0A)++(1c0A)k)\displaystyle V^{0}(1-c_{0}A)^{k+1}+({A}c_{1}+{A}^{2}c_{2})\big{(}1+(1-c_{0}A)+\ldots+(1-c_{0}A)^{k}\big{)}
\displaystyle\leq V0(1c0A)k+1+c1+Ac2c0.\displaystyle V^{0}(1-c_{0}A)^{k+1}+\frac{c_{1}+{A}c_{2}}{c_{0}}.

For any kk0+1k\geq k_{0}+1, initially it holds Vk0+1Cck0+mβ+c1c0V^{k_{0}+1}\leq\frac{C}{ck_{0}+m\beta}+\frac{c_{1}}{c_{0}} from the definition of CC. By deduction, if (37) holds for kk, then

Vk+1(36)\displaystyle V^{k+1}\overset{\eqref{eq:iter}}{\leq} (Cc(k1)+mβ+c1c0)(1c0ck+mβ)+c1ck+mβ+c2(ck+β)2\displaystyle~{}\bigg{(}\frac{C}{c(k-1)+m\beta}+\frac{c_{1}}{c_{0}}\bigg{)}\bigg{(}1-\frac{c_{0}}{ck+m\beta}\bigg{)}+\frac{c_{1}}{ck+m\beta}+\frac{c_{2}}{(ck+\beta)^{2}}
=\displaystyle= Cck+mβc0cc(k1)+mβCck+mβ+c2(ck+β)2+c1c0\displaystyle~{}\frac{C}{ck+m\beta}-\frac{c_{0}-c}{c(k-1)+m\beta}\frac{C}{ck+m\beta}+\frac{c_{2}}{(ck+\beta)^{2}}+\frac{c_{1}}{c_{0}}
\displaystyle\leq Cck+mβ+c1c0+1(ck+β)2(c2(ck+β)2(c0c)(ck+mβ)2C)\displaystyle~{}\frac{C}{ck+m\beta}+\frac{c_{1}}{c_{0}}+\frac{1}{(ck+\beta)^{2}}\bigg{(}{c_{2}}-\frac{(ck+\beta)^{2}(c_{0}-c)}{(ck+m\beta)^{2}}C\bigg{)}
\displaystyle\leq Cck+mβ+c1c0,\displaystyle~{}\frac{C}{ck+m\beta}+\frac{c_{1}}{c_{0}},

where the last inequality is from C(ck0+mβck0+β)2c2c0cC\geq(\frac{ck_{0}+m\beta}{ck_{0}+\beta})^{2}\frac{c_{2}}{c_{0}-c}. This completes the proof.  

Appendix D O(1/k)O(1/\sqrt{k})-ergodic convergence

Theorem 2.

Suppose Assumptions 1, 2, and 3 hold. Let λλ0\lambda\geq\lambda_{0} and the stepsizes be

α0k=min{1c¯k+mβ¯,A¯},αik=min{1c¯k+β¯,A¯},i,\alpha_{0}^{k}=\min\left\{\frac{1}{\bar{c}\sqrt{k}+m\bar{\beta}},\bar{A}\right\},\quad\alpha_{i}^{k}=\min\left\{\frac{1}{\bar{c}\sqrt{k}+\bar{\beta}},\bar{A}\right\},\ \forall i\in\mathcal{R},

for some positive constants c¯\bar{c}, β¯\bar{\beta}, and A¯min{μ04L02,μi2Li2+(m1)βc:i}\bar{A}\leq\min\left\{\frac{\mu_{0}}{4L_{0}^{2}},\frac{\mu_{i}}{2L_{i}^{2}+(m-1)\beta c}:i\in\mathcal{R}\right\}. Then the proposed algorithm converges in the ergodic sense that

i𝔼[F(x¯ik,ξi)]+f0(x¯0k)minx~(i𝔼[F(x~,ξi)]+f0(x~))Our goal in (2)mc~1+O(logkk),\sum_{i\in\mathcal{R}}\mathbb{E}[F(\bar{x}_{i}^{k},\xi_{i})]+f_{0}(\bar{x}_{0}^{k})-\underbrace{\min_{\tilde{x}}\left(\sum_{i\in\mathcal{R}}\mathbb{E}[F({\tilde{x}},\xi_{i})]+f_{0}({\tilde{x}})\right)}_{\text{Our goal in }\eqref{eq2}}\leq m\tilde{c}_{1}+{O}\left(\frac{\log k}{\sqrt{k}}\right), (56)

where x¯ik=l=1kαlxill=1kαl,x¯0k=l=1kαlx0ll=1kαl\bar{x}_{i}^{k}=\sum_{l=1}^{k}\frac{\alpha^{l}x_{i}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha^{l^{\prime}}},\bar{x}_{0}^{k}=\sum_{l=1}^{k}\frac{\alpha^{l}x_{0}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha^{l^{\prime}}} are the weighted average variables, and the constant c¯1=18dμ0λ2q2\bar{c}_{1}=\frac{18d}{\mu_{0}}\lambda^{2}q^{2}.

Proof.

Step 1. At the master side, we still have (41), in the form of

𝔼x0k+1x02=\displaystyle\mathbb{E}\|x_{0}^{k+1}-x_{0}^{*}\|^{2}= 𝔼x0kx02+(α0k)2𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1)2\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+(\alpha_{0}^{k})^{2}\mathbb{E}\bigg{\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}
2α0k𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1),x0kx0.\displaystyle~{}-2\alpha_{0}^{k}\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}. (57)

For the second term in (57), we also have (42), which can be further bounded by

𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1)22L02𝔼x0kx02+2(4r+3q)2λ2d.\displaystyle~{}\mathbb{E}\bigg{\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}\leq 2L_{0}^{2}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+2(4r+3q)^{2}\lambda^{2}d. (58)

Here we use the fact that f0f_{0} has Lipschitz continuous gradients in Assumption 2. In addition, using the fact that f0f_{0} is strongly convex in Assumption 1 leads to

f0(x0k),x0kx0f0(x0k)f0(x0)+μ02x0kx02.\langle f_{0}^{\prime}(x_{0}^{k}),x_{0}^{k}-x_{0}^{*}\rangle\geq f_{0}(x_{0}^{k})-f_{0}(x_{0}^{*})+\frac{\mu_{0}}{2}\|x_{0}^{k}-x_{0}^{*}\|^{2}. (59)

Applying the inequality 2a,bϵa2+1ϵb22\langle a,b\rangle\leq\epsilon\|a\|^{2}+\frac{1}{\epsilon}\|b\|^{2} to the third term in (57) with ϵ=μ02\epsilon=\frac{\mu_{0}}{2} yields

2𝔼f0(x0k)i(2ηikηik1)j(2ηjkηjk1),x0kx0\displaystyle-2\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}
=\displaystyle= 2𝔼f0(x0k),x0kx0+2𝔼i(2ηikηik1)+j(2ηjkηjk1),x0kx0\displaystyle-2\mathbb{E}\big{\langle}f^{\prime}_{0}(x_{0}^{k}),x_{0}^{k}-x_{0}^{*}\big{\rangle}+2\mathbb{E}\bigg{\langle}\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}+\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}
(59)\displaystyle\overset{\eqref{sc}}{\leq} 2𝔼(f0(x0k)f0(x0))μ0𝔼x0kx02+μ02𝔼x0kx02+2μ0𝔼j(2ηjkηjk1)2+2𝔼i(2ηikηik1),x0kx0\displaystyle-2\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{*})\big{)}-\mu_{0}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\frac{\mu_{0}}{2}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\frac{2}{\mu_{0}}\mathbb{E}\bigg{\|}\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}+2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle
\displaystyle\leq 2𝔼(f0(x0k)f0(x0))μ02𝔼x0kx02+2μ09q2λ2d+2𝔼i(2ηikηik1),x0kx0.\displaystyle-2\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{*})\big{)}-\frac{\mu_{0}}{2}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\frac{2}{\mu_{0}}9q^{2}\lambda^{2}d+2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle. (60)

Substituting (58) and (60) into (57) gives

𝔼x0k+1x02\displaystyle\mathbb{E}\|x_{0}^{k+1}-x_{0}^{*}\|^{2}\leq 𝔼x0kx02(1α0kμ02+(α0k)22L02)α0k2𝔼(f0(x0k)f0(x0))+α0k18dμ0λ2q2+(α0k)22(4r+3q)2dλ2\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}}{2}+(\alpha_{0}^{k})^{2}2L_{0}^{2}\right)-\alpha_{0}^{k}2\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{*})\big{)}+\alpha_{0}^{k}\frac{18d}{\mu_{0}}\lambda^{2}q^{2}+(\alpha_{0}^{k})^{2}2(4r+3q)^{2}d\lambda^{2}
+α0k2𝔼i(2ηikηik1),x0kx0\displaystyle~{}+\alpha_{0}^{k}2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle
\displaystyle\leq 𝔼x0kx022α0k𝔼(f0(x0k)f0(x0))+α0k18dμ0λ2q2+(α0k)22(4r+3q)2dλ2\displaystyle~{}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}-2\alpha_{0}^{k}\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{*})\big{)}+\alpha_{0}^{k}\frac{18d}{\mu_{0}}\lambda^{2}q^{2}+(\alpha_{0}^{k})^{2}2(4r+3q)^{2}d\lambda^{2} (61)
+α0k2𝔼i(2ηikηik1),x0kx0,\displaystyle~{}+\alpha_{0}^{k}2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle,

where the last inequality comes from that α0kA¯μ04L02\alpha_{0}^{k}\leq\bar{A}\leq\frac{\mu_{0}}{4L_{0}^{2}}.

Step 2. Accordingly, at the worker side, we have for any ii\in\mathcal{R} that (45) holds, as

𝔼xik+1xi2=\displaystyle\mathbb{E}\|x_{i}^{k+1}-x_{i}^{*}\|^{2}= 𝔼xikxi2+(αik)2𝔼F(xik,ξik)+(2ηikηik1)22αik𝔼F(xik,ξik)+(2ηikηik1),xikxi.\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+(\alpha_{i}^{k})^{2}\mathbb{E}\big{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\big{\|}^{2}-2\alpha_{i}^{k}\mathbb{E}\big{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\big{\rangle}. (62)

For the second term, we still have (46), which can be further bounded by

𝔼F(xik,ξik)+(2ηikηik1)22Li2𝔼xikxi2+4δi2+64λ2d.\displaystyle~{}\mathbb{E}\big{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\big{\|}^{2}\leq 2L_{i}^{2}\mathbb{E}\big{\|}x_{i}^{k}-x_{i}^{*}\big{\|}^{2}+4\delta_{i}^{2}+64\lambda^{2}d. (63)

Here we use the fact that f0f_{0} has Lipschitz continuous gradients in Assumption 2. Then, using Fi(xik),xikxiFi(xik)Fi(xi)+μi2xikxi2\langle F_{i}^{\prime}(x_{i}^{k}),x_{i}^{k}-x_{i}^{*}\rangle\geq F_{i}(x_{i}^{k})-F_{i}(x_{i}^{*})+\frac{\mu_{i}}{2}\|x_{i}^{k}-x_{i}^{*}\|^{2} as f0f_{0} is strongly convex in Assumption 1, we bound the third term in (62) as

2𝔼F(xik,ξik)+(2ηikηik1),xikxi=\displaystyle-2\mathbb{E}\big{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\big{\rangle}= 2𝔼Fi(xik),xikxi2𝔼2ηikηik1,xikxi\displaystyle~{}-2\mathbb{E}\big{\langle}F^{\prime}_{i}(x_{i}^{k}),x_{i}^{k}-x_{i}^{*}\big{\rangle}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle
\displaystyle\leq 2𝔼(Fi(xik)Fi(xi))μi𝔼xikxi22𝔼2ηikηik1,xikxi.\displaystyle~{}-2\mathbb{E}\big{(}F_{i}(x_{i}^{k})-F_{i}(x_{i}^{*})\big{)}-\mu_{i}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle. (64)

Substituting (63) and (64) into (62) gives

𝔼xik+1xi2\displaystyle\mathbb{E}\|x_{i}^{k+1}-x_{i}^{*}\|^{2}\leq 𝔼xikxi2(1αikμi+(αik)22Li2)αik2𝔼(Fi(xik)Fi(xi))+(αik)2(64λ2d+4δi2)\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\left(1-\alpha_{i}^{k}\mu_{i}+(\alpha_{i}^{k})^{2}2L_{i}^{2}\right)-\alpha_{i}^{k}2\mathbb{E}\big{(}F_{i}(x_{i}^{k})-F_{i}(x_{i}^{*})\big{)}+(\alpha_{i}^{k})^{2}\big{(}64\lambda^{2}d+4\delta_{i}^{2}\big{)}
αik2𝔼2ηikηik1,xikxi\displaystyle~{}-\alpha_{i}^{k}2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle
\displaystyle\leq 𝔼xikxi2α0k1/αik1α0k/αik2αik𝔼(Fi(xik)Fi(xi))+(αik)2(64dλ2+4δi2)\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\frac{\alpha_{0}^{k-1}/\alpha_{i}^{k-1}}{\alpha_{0}^{k}/\alpha_{i}^{k}}-2\alpha_{i}^{k}\mathbb{E}\big{(}F_{i}(x_{i}^{k})-F_{i}(x_{i}^{*})\big{)}+(\alpha_{i}^{k})^{2}\big{(}64d\lambda^{2}+4\delta_{i}^{2}\big{)} (65)
αik2𝔼2ηikηik1,xikxi,\displaystyle~{}-\alpha_{i}^{k}2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle,

where the last inequality holds from the bound of AA, such that

α0k1/αik1α0k/αik(1αikμi+(αik)22Li2)=αik(μiαik2Li2α0k1(m1)βck+k1)αik(μi(2Li2+(m1)βc)A)0.\frac{\alpha_{0}^{k-1}/\alpha_{i}^{k-1}}{\alpha_{0}^{k}/\alpha_{i}^{k}}-\left(1-\alpha_{i}^{k}\mu_{i}+(\alpha_{i}^{k})^{2}2L_{i}^{2}\right)=\alpha_{i}^{k}\big{(}\mu_{i}-\alpha_{i}^{k}2L_{i}^{2}-\alpha_{0}^{k-1}\frac{(m-1)\beta c}{\sqrt{k}+\sqrt{k-1}}\big{)}\geq\alpha_{i}^{k}\big{(}\mu_{i}-(2L_{i}^{2}+(m-1)\beta c)A\big{)}\geq 0.

Step 3. Denote

Fk=i𝔼[F(xik,ξik)]+f0(x0k)=iFi(xik)+f0(x0k),F=iFi(xi)+f0(x0)=minx~i𝔼[F(x~,ξi)]+f0(x~),F^{k}=\sum_{i\in\mathcal{R}}\mathbb{E}[F(x_{i}^{k},\xi_{i}^{k})]+f_{0}(x_{0}^{k})=\sum_{i\in\mathcal{R}}F_{i}(x_{i}^{k})+f_{0}(x_{0}^{k}),\quad F^{*}=\sum_{i\in\mathcal{R}}F_{i}(x_{i}^{*})+f_{0}(x_{0}^{*})=\min_{\tilde{x}}\sum_{i\in\mathcal{R}}\mathbb{E}[F({\tilde{x}},\xi_{i})]+f_{0}({\tilde{x}}),

and define a Lyapunov function V¯k=Ex0kx02+i(α0k1αik1𝔼xikxi2+2α0k1βηik12)\bar{V}^{k}=E\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i\in\mathcal{R}}\big{(}\frac{\alpha_{0}^{k-1}}{\alpha_{i}^{k-1}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+\frac{2\alpha_{0}^{k-1}}{\beta}\|\eta_{i}^{k-1}\|^{2}\big{)}. From (31), we have

2α0k𝔼2ηikηik1,(xikxi)(x0kx0)2α0kβ(ηik12ηik2)2α0k1βηik122α0kβηik2.\displaystyle-2\alpha_{0}^{k}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i},(x_{i}^{k}-x_{i}^{*})-(x_{0}^{k}-x_{0}^{*})\big{\rangle}\leq\frac{2\alpha_{0}^{k}}{\beta}\left(\|\eta_{i}^{k-1}\|^{2}-\|\eta_{i}^{k}\|^{2}\right)\leq\frac{2\alpha_{0}^{k-1}}{\beta}\|\eta_{i}^{k-1}\|^{2}-\frac{2\alpha_{0}^{k}}{\beta}\|\eta_{i}^{k}\|^{2}. (66)

Consequently, combining (61), (D), and (66) together gives

2α0k(FkF)\displaystyle 2\alpha_{0}^{k}\big{(}F^{k}-F^{*}\big{)}\leq V¯kV¯k+1+α0k18dμ0λ2q2+α0k(α0k2(4r+3q)2dλ2+αik(64rdλ2+4iδi2))\displaystyle~{}\bar{V}^{k}-\bar{V}^{k+1}+\alpha_{0}^{k}\frac{18d}{\mu_{0}}\lambda^{2}q^{2}+\alpha_{0}^{k}\bigg{(}\alpha_{0}^{k}2(4r+3q)^{2}d\lambda^{2}+\alpha_{i}^{k}\big{(}64rd\lambda^{2}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}\big{)}\bigg{)}
\displaystyle\leq V¯kV¯k+1+c¯1αik+c¯2(αik)2,\displaystyle~{}\bar{V}^{k}-\bar{V}^{k+1}+\bar{c}_{1}\alpha_{i}^{k}+\bar{c}_{2}(\alpha_{i}^{k})^{2}, (67)

where c¯1=18dμ0λ2q2\bar{c}_{1}=\frac{18d}{\mu_{0}}\lambda^{2}q^{2} and c¯2=(2(4r+3q)2+64r)dλ2+4iδi2.\bar{c}_{2}=\big{(}2(4r+3q)^{2}+64r\big{)}d\lambda^{2}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}. Summing up (67) from 0 to kk, we obtain

2l=1kα0l(FlF)\displaystyle 2\sum_{l=1}^{k}\alpha_{0}^{l}(F^{l}-F^{*})\leq V¯1V¯k+1+c¯1l=1kαil+c¯2l=1k(αil)2V¯1+c¯1l=1kαil+c¯2l=1k(αil)2.\displaystyle\bar{V}^{1}-\bar{V}^{k+1}+\bar{c}_{1}\sum_{l=1}^{k}\alpha_{i}^{l}+\bar{c}_{2}\sum_{l=1}^{k}(\alpha_{i}^{l})^{2}\leq\bar{V}^{1}+\bar{c}_{1}\sum_{l=1}^{k}\alpha_{i}^{l}+\bar{c}_{2}\sum_{l=1}^{k}(\alpha_{i}^{l})^{2}. (68)

Dividing both sides by 2l=1kα0l2\sum_{l=1}^{k}\alpha_{0}^{l} gives

l=1kα0ll=1kα0lFlFmc¯1+V¯1+c¯2l=1k(αil)22l=1kα0lV¯1+c¯2c2log(k+1)4(c+mβ)k.\displaystyle\sum_{l=1}^{k}\frac{\alpha_{0}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha_{0}^{l^{\prime}}}F^{l}-F^{*}\leq m\bar{c}_{1}+\frac{\bar{V}^{1}+\bar{c}_{2}\sum_{l=1}^{k}(\alpha_{i}^{l})^{2}}{2\sum_{l=1}^{k}\alpha_{0}^{l}}\leq\frac{\bar{V}^{1}+\frac{\bar{c}_{2}}{c^{2}}\log(k+1)}{4(c+m\beta)\sqrt{k}}. (69)

The convexity of FF and f0f_{0} leads to i𝔼[F(x¯ik,ξi)]+f0(x¯0k)Fl=1kα0ll=1kα0lFlF\sum_{i\in\mathcal{R}}\mathbb{E}[F(\bar{x}_{i}^{k},\xi_{i})]+f_{0}(\bar{x}_{0}^{k})-F^{*}\leq\sum_{l=1}^{k}\frac{\alpha_{0}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha_{0}^{l^{\prime}}}F^{l}-F^{*}. Combining this inequality and (69), we complete the proof.