Stochastic Alternating Direction Method of Multipliers
for Byzantine-Robust Distributed Learning

Feng Lin¹ Weiyu Li² Qing Ling^1,3,4

1

. Sun Yat-Sen University

2

. University of Science and Technology of China

3

. Guangdong Province Key Laboratory of Computational Science

4

. Pazhou Lab
Corresponding Author: Qing Ling
E-mail address: [email protected]

Abstract

This paper aims to solve a distributed learning problem under Byzantine attacks. In the underlying distributed system, a number of unknown but malicious workers (termed as Byzantine workers) can send arbitrary messages to the master and bias the learning process, due to data corruptions, computation errors or malicious attacks. Prior work has considered a total variation (TV) norm-penalized approximation formulation to handle the Byzantine attacks, where the TV norm penalty forces the regular workers’ local variables to be close, and meanwhile, tolerates the outliers sent by the Byzantine workers. To solve the TV norm-penalized approximation formulation, we propose a Byzantine-robust stochastic alternating direction method of multipliers (ADMM) that fully utilizes the separable problem structure. Theoretically, we prove that the proposed method converges to a bounded neighborhood of the optimal solution at a rate of $O(1/k)$ under mild assumptions, where $k$ is the number of iterations and the size of neighborhood is determined by the number of Byzantine workers. Numerical experiments on the MNIST and COVERTYPE datasets demonstrate the effectiveness of the proposed method to various Byzantine attacks.

keywords:

Distributed machine learning, alternating direction method of multipliers (ADMM), Byzantine attacks

1 Introduction

Most of the traditional machine learning algorithms require to collect training data from their owners to a single computer or data center, which is not only communication-inefficient but also vulnerable to privacy leakage [1, 2, 3]. With the explosive growth of the big data, federated learning has been proposed as a novel privacy-preserving distributed machine learning scheme, and received extensive research interest recently [4, 5, 6]. In federated learning, the training data are stored at distributed workers, and the workers compute their local variables using local training data, under the coordination of a master. This scheme effectively reduces the risk of data leakage and protects privacy.

However, federated learning still faces significant security challenges. Some of the distributed workers, whose identities are unknown, could be unreliable and send wrong or even malicious messages to the master, due to data corruptions, computation errors or malicious attacks. To characterize the worse-case scenario, we adopt the Byzantine failure model, in which the Byzantine workers are aware of all information of the other workers, and able to send arbitrary messages to the master [7, 8, 9, 10]. In this paper, we aim at solving the distributed learning problem under Byzantine attacks that potentially threat federated learning applications.

Related works. With the rapid popularization of federated learning, Byzantine-robust distributed learning has become an attractive research topic in recent years. Most of the existing algorithms modify distributed stochastic gradient descent (SGD) to the Byzantine-robust variants. In the standard distributed SGD, at every iteration, all the workers send their local stochastic gradients to the master, while the master averages all the received stochastic gradients and updates the optimization variable. When Byzantine workers are present, they can send faulty values other than true stochastic gradients to the master so as to bias the learning process. It is shown that the standard distributed SGD with mean aggregation is vulnerable to Byzantine attacks [11].

When the training data are independently and identically distributed (i.i.d.) at the workers, the stochastic gradients of the regular workers are i.i.d. too. This fact motivates two mainstream methods to deal with Byzantine attacks: attack detection and robust aggregation. For attack detection, [12] and [13] propose to offline train an autoencoder, which is used to online calculate credit scores of the workers. The messages sent by the workers with lower credit scores will be discarded in the mean aggregation. The robust subgradient push algorithm in [14] operates over a decentralized network. Each worker calculates a score for each of its neighbors, and isolates those with lower scores. The works of [15, 16] detect the Byzantine workers with historic gradients so as to ensure robustness. The work of [17] uses redundant gradients for attack detection. However, it requires overlapped data samples on multiple workers, and does not fit for the federated learning setting. For robust aggregation, the master can use geometric median, instead of mean, to aggregate the received messages [18, 19, 20]. When the number of Byzantine workers is less than the number of regular workers, geometric median provides a reliable approximation to the average of regular workers’ stochastic gradients. Other similar robust aggregation rules include marginal trimmed mean and dimensional median [21, 22, 23]. Some aggregation rules select a representative stochastic gradient from all the received ones to update the global variable, e.g., Medoid [19] and Krum [11]. Medoid selects the stochastic gradient with the smallest distance from all the others, while Krum selects the one with the smallest squared distance to a fixed number of nearest stochastic gradients. An extension of Krum, termed as $h$ -Krum, selects $h$ stochastic gradients with Krum and uses their average. Bulyan [24] first selects a number of stochastic gradients with Krum or other robust selection/aggregation rules, and then uses their trimmed dimensional median.

When the training data and the stochastic gradients are non-i.i.d. at the workers, which is common in federated learning applications [25], naive robust aggregation of stochastic gradients no longer works. The works of [26, 27] adopt a resampling strategy to alleviate the effect caused by non-i.i.d. training data. With a larger resampling parameter, the algorithms can handle higher data heterogeneity, at the cost of tolerating less Byzantine workers. Robust stochastic aggregation (RSA) aggregates local variables, instead of stochastic gradients [28]. To be specific, it considers a total variation (TV) norm-penalized approximation formulation to handle Byzantine attacks, where the TV norm penalty forces the regular workers’ local variables to be close, and meanwhile, tolerates the outliers sent by the Byzantine workers. Although the stochastic subgradient method proposed in [28] is able to solve the TV norm-penalized approximation formulation, it ignores the separable problem structure.

Other related works include [29, 30, 31, 32], which shows that the stochastic gradient noise affects the effectiveness of robust aggregation rules. Thus, the robustness of the Byzantine-resilience methods can be improved by reducing the variance of stochastic gradients. The asynchronous Byzantine-robust SGD is considered in [33, 34, 35]. The work of [36] addresses the saddle-point attacks in the non-convex setting, and [37, 38, 39, 40] consider Byzantine robustness in decentralized learning.

Our contributions. Our contributions are three-fold.

(i) We propose a Byzantine-robust stochastic alternating direction method of multipliers (ADMM) that utilizes the separable problem structure of the TV norm-penalized approximation formulation. The stochastic ADMM updates are further simplified, such that the iteration-wise communication and computation costs are the same as those of the stochastic subgradient method.

(ii) We theoretically prove that the proposed stochastic ADMM converges to a bounded neighborhood of the optimal solution at a rate of $O(1/k)$ under mild assumptions, where $k$ is the number of iterations and the size of neighborhood is determined by the number of Byzantine workers.

(iii) We conduct numerical experiments on the MNIST and COVERTYPE datasets to demonstrate the effectiveness of the proposed stochastic ADMM to various Byzantine attacks.

2 Problem Formulation

Let us consider a distributed network with a master and $m$ workers, among which $q$ workers are Byzantine and the other $r=m-q$ workers are regular. The exact value of $q$ and the identities of the Byzantine workers are all unknown. We are interested in solving a stochastic optimization problem in the form of

\displaystyle\min\limits_{\tilde{x}}~{}\sum^{m}_{i=1}\mathbb{E}[F(\tilde{x},\xi_{i})]+f_{0}(\tilde{x}),

(1)

where $\tilde{x}\in\mathbb{R}^{d}$ is the optimization variable, $f_{0}(\tilde{x})$ is the regularization term known to the master, and $F(\tilde{x},\xi_{i})$ is the loss function of worker $i$ with respect to a random variable $\xi_{i}\sim\mathcal{D}_{i}$ . Here we assume that the data distributions $\mathcal{D}_{i}$ on the workers can be different, which is common in federated learning applications.

Define $\mathcal{R}$ and $\mathcal{B}$ as the sets of regular workers and Byzantine workers, respectively. We have $|\mathcal{B}|=q$ and $|\mathcal{R}|=r$ . Because of the existence of Byzantine workers, directly solving (1) without distinguishing between regular and Byzantine workers is meaningless. A less ambitious alternative is to minimize the summation of the regular workers’ local expected cost functions plus the regularization term, in the form of

\displaystyle\min\limits_{\tilde{x}}~{}\sum_{i\in\mathcal{R}}\mathbb{E}[F(\tilde{x},\xi_{i})]+f_{0}(\tilde{x}).

(2)

Our proposed algorithm and RSA [28] both aggregate optimization variables, instead of stochastic gradients. To do so, denote $x_{i}$ as the local copy of $\tilde{x}$ at a regular worker $i\in\mathcal{R}$ , and $x_{0}$ as the local copy at the master. Collecting the local copies in a vector $x=[x_{0};\cdots;x_{i};\cdots]\in\mathbb{R}^{(r+1)d}$ , we know that (2) is equivalent to

	$\displaystyle\min\limits_{x}$	$\displaystyle~{}\sum_{i\in\mathcal{R}}\mathbb{E}[F(x_{i},\xi_{i})]+f_{0}(x_{0}),$		(3)
	$\displaystyle s.t.$	$\displaystyle~{}x_{i}-x_{0}=0,~{}\forall i\in\mathcal{R},$

where $x_{i}-x_{0}=0$ , $\forall i\in\mathcal{R}$ are the consensus constraints to force the local copies to be the same.

RSA [28] considers a TV norm-penalized approximation formulation of (3), in the form of

\displaystyle\min\limits_{x}\sum_{i\in\mathcal{R}}(\mathbb{E}[F(x_{i},\xi_{i})]+\lambda\|x_{i}-x_{0}\|_{1})+f_{0}(x_{0}),

(4)

where $\lambda$ is a positive constant and $\sum_{i\in\mathcal{R}}\|x_{i}-x_{0}\|_{1}$ is the TV norm penalty for the constraints in (3). The TV norm penalty forces the regular workers’ local optimization variables to be close to the master’s, and meanwhile, tolerates the outliers when the Byzantine attackers are present. Due to the existence of the nonsmooth TV norm term, RSA solves (4) with the stochastic subgradient method. The updates of RSA, at the existence of Byzantine workers, are as follows. At time $k$ , the master sends $x_{0}^{k}$ to the workers, every regular worker $i\in\mathcal{R}$ sends $x_{i}^{k}$ to the master, while every Byzantine worker $j\in\mathcal{B}$ sends an arbitrary malicious vector $u_{j}^{k}\in\mathbb{R}^{d}$ to the master. Then, the updates of $x^{k+1}_{i}$ for every regular worker $i$ and $x^{k+1}_{0}$ for the master are given by

	$\displaystyle x^{k+1}_{i}=$	$\displaystyle~{}x_{i}^{k}-\alpha^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+\lambda sgn(x_{i}^{k}-x_{0}^{k})\right),$
	$\displaystyle x^{k+1}_{0}=$	$\displaystyle~{}x_{0}^{k}-\alpha^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})-\lambda\sum_{i\in\mathcal{R}}sgn(x_{i}^{k}-x_{0}^{k})-\lambda\sum_{j\in\mathcal{B}}sgn(u_{j}^{k}-x_{0}^{k})\Big{)},$		(5)

where $F^{\prime}(x_{i}^{k},\xi^{k}_{i})$ is a stochastic gradient at $x_{i}^{k}$ respect to a random sample $\xi_{i}^{k}$ for regular worker $i$ , $sgn(\cdot)$ is the element-wise sign function ( $sgn(a)=1$ if $a>0$ , $sgn(a)=-1$ if $a<0$ , and $sgn(a)\in[-1,1]$ if $a=0$ ), and $\alpha^{k}$ is the diminishing learning rate at time $k$ .

Although RSA has been proven as a robust algorithm under Byzantine attacks [28], the sign functions therein enable the Byzantine workers to send slightly modified messages that remarkably biases the learning process. In addition, RSA fully ignores the special separable structure of the TV norm penalty. In this paper, we also consider the TV norm-penalized approximation formulation (4), propose a stochastic ADMM that utilizes the problem structure, and develop a novel Byzantine-robust algorithm.

3 Algorithm Development

In this section, we utilize the separable problem structure of (4) and propose a robust stochastic ADMM to solve it. The challenge is that the unknown Byzantine workers can send faulty messages during the optimization process. At this stage, we simply ignore the existence of Byzantine workers and develop an algorithm to solve (4). Then, we will consider the influence of Byzantine workers on the algorithm. We begin with applying the stochastic ADMM to solve (4), and then simplify the updates such that the iteration-wise communication and computation costs are the same as those of the stochastic subgradient method in [28].

Stochastic ADMM. Suppose that all the workers are regular such that $m=r$ . To apply the stochastic ADMM, for every worker $i$ , introduce auxiliary variables $z(0,i),z(i,0)\in\mathbb{R}^{d}$ on the directed edges $(0,i),(i,0)$ , respectively. By introducing consensus constraints $z(i,0)=x_{0}$ and $z(0,i)=x_{i}$ , (4) is equivalent to

	$\displaystyle\min\limits_{x,z}$	$\displaystyle~{}\sum_{i\in\mathcal{R}}(\mathbb{E}[F(x_{i},\xi_{i})]+\lambda\\|z(0,i)-z(i,0)\\|_{1})+f_{0}(x_{0}),$		(6)
	$\displaystyle s.t.$	$\displaystyle~{}z(i,0)-x_{0}=0,~{}z(0,i)-x_{i}=0,~{}\forall i\in\mathcal{R}.$

For the ease of presentation, we stack these auxiliary variables in a new variable $z\in\mathbb{R}^{2rd}$ . As we will see below, the introduction of $z$ is to split the expectation term $\sum_{i\in\mathcal{R}}\mathbb{E}[F(x_{i},\xi_{i})]$ and the TV norm penalty term $\sum_{i\in\mathcal{R}}\|x_{i}-x_{0}\|_{1}$ so as to utilize the separable problem structure.

The augmented Lagrangian function of (6) is

	$\displaystyle\mathcal{L}_{\beta}(x,z,\eta)=$	$\displaystyle\sum_{i\in\mathcal{R}}(\mathbb{E}[F(x_{i},\xi_{i})]+\lambda\\|z(0,i)-z(i,0)\\|_{1})+f_{0}(x_{0})$
	$\displaystyle+$	$\displaystyle\sum_{i\in\mathcal{R}}\left(\langle\eta(i,0),z(i,0)-x_{0}\rangle+\frac{\beta}{2}\\|z(i,0)-x_{0}\\|^{2}\right)+\sum_{i\in\mathcal{R}}\left(\langle\eta(0,i),z(0,i)-x_{i}\rangle+\frac{\beta}{2}\\|z(0,i)-x_{i}\\|^{2}\right),$		(7)

where $\beta$ is a positive constant, while $\eta(i,0)\in\mathbb{R}^{d}$ and $\eta(0,i)\in\mathbb{R}^{d}$ are the Lagrange multipliers attached to the consensus constraints $z(i,0)-x_{0}=0$ and $z(0,i)-x_{i}=0$ , respectively. For convenience, we also collect all the Lagrange multipliers in a new variable $\eta\in\mathbb{R}^{2rd}$ .

Given the augmented Lagrangian function (3), the vanilla ADMM works as follows. At time $k$ , it first updates $x^{k+1}$ through minimizing the augmented Lagrangian function at $z=z^{k}$ and $\eta=\eta^{k}$ , then updates $z^{k+1}$ through minimizing the Lagrangian function at $x=x^{k+1}$ and $\eta=\eta^{k}$ , and finally updates $\eta^{k+1}$ through dual gradient ascent. The updates are given by


$\displaystyle x^{k+1}$	$\displaystyle=\arg\min\limits_{x}\mathcal{L}_{\beta}(x,z^{k},\eta^{k}),$	(8a)
$\displaystyle z^{k+1}$	$\displaystyle=\arg\min\limits_{z}\mathcal{L}_{\beta}(x^{k+1},z,\eta^{k}),$	(8b)
$\displaystyle\eta^{k+1}(i,0)$	$\displaystyle=\eta^{k}(i,0)+\beta(z^{k+1}(i,0)-x_{0}^{k+1}),\quad\eta^{k+1}(0,i)=\eta^{k}(0,i)+\beta(z^{k+1}(0,i)-x_{i}^{k+1}).$	(8c)

However, the $x$ -update in (8a) is an expectation minimization problem and hence nontrivial. To address this issue, [41] proposes to replace the augmented Lagrangian function $\mathcal{L}_{\beta}(x,z^{k},\eta^{k})$ with its stochastic counterpart, given by

$\displaystyle\mathcal{L}_{\beta}^{k}(x,z,\eta)=$	$\displaystyle\sum_{i\in\mathcal{R}}\lambda\\|z(0,i)-z(i,0)\\|_{1}+f_{0}(x_{0}^{k})+\langle f^{\prime}_{0}(x_{0}^{k}),x_{0}\rangle+\frac{\sigma^{k}\\|x_{0}-x_{0}^{k}\\|^{2}}{2}$
$\displaystyle+$	$\displaystyle\sum_{i\in\mathcal{R}}\left(F(x_{i}^{k},\xi_{i}^{k})+\langle F^{\prime}(x_{i}^{k},\xi_{i}^{k}),x_{i}\rangle+\frac{\sigma^{k}\\|x_{i}-x_{i}^{k}\\|^{2}}{2}\right)$
$\displaystyle+$	$\displaystyle\sum_{i\in\mathcal{R}}\left(\langle\eta(i,0),z(i,0)-x_{0}\rangle+\frac{\beta}{2}\\|z(i,0)-x_{0}\\|^{2}\right)+\sum_{i\in\mathcal{R}}\left(\langle\eta(0,i),z(0,i)-x_{i}\rangle+\frac{\beta}{2}\\|z(0,i)-x_{i}\\|^{2}\right),$	(9)

where $\xi_{i}^{k}$ is the random variable of worker $i$ at time $k$ and $\sigma^{k}$ is the positive stepsize. Observe that (3) is a stochastic first-order approximation to (3), in the sense that

\displaystyle\mathbb{E}[F(x_{i},\xi_{i})]

\displaystyle\simeq F(x_{i}^{k},\xi_{i}^{k})+\langle F^{\prime}(x_{i}^{k},\xi_{i}^{k}),x_{i}\rangle+\frac{\sigma^{k}\|x_{i}-x_{i}^{k}\|^{2}}{2}\quad\text{and}\quad f_{0}(x)\simeq f_{0}(x_{0}^{k})+\langle f^{\prime}_{0}(x_{0}^{k}),x_{0}\rangle+\frac{\sigma^{k}\|x_{0}-x_{0}^{k}\|^{2}}{2},

at the points $x_{i}=x_{i}^{k}$ and $x_{0}=x_{0}^{k}$ , respectively.

With the stochastic approximation, the explicit solutions of $x_{i}^{k+1}$ and $x_{0}^{k+1}$ are

	$\displaystyle x^{k+1}_{i}$	$\displaystyle=\frac{1}{\sigma^{k}+\beta}\left(\sigma^{k}x_{i}^{k}+\beta z^{k}(0,i)+\eta^{k}(0,i)-F^{\prime}(x_{i}^{k},\xi^{k}_{i})\right),$
	$\displaystyle x^{k+1}_{0}$	$\displaystyle=\frac{1}{\sigma^{k}+m\beta}\Big{(}\sigma^{k}x_{0}^{k}+\sum_{i\in\mathcal{R}}(\beta z^{k}(i,0)+\eta^{k}(i,0))-f_{0}^{\prime}(x_{0}^{k})\Big{)}.$		(10)

For simplicity, we replace the parameter $\frac{1}{\sigma^{k}+\beta}$ by $\alpha_{i}^{k}$ and $\frac{1}{\sigma^{k}+\sum_{i\in\mathcal{R}}\beta}$ by $\alpha_{0}^{k}$ . Thus, (10) is equivalent to

	$\displaystyle x^{k+1}_{i}=x_{i}^{k}-\alpha_{i}^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+\beta x_{i}^{k}-\beta z^{k}(0,i)-\eta^{k}(0,i)\right),$
	$\displaystyle x^{k+1}_{0}=x_{0}^{k}-\alpha_{0}^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})+\sum_{i\in\mathcal{R}}(\beta x_{0}^{k}-\beta z^{k}(i,0)-\eta^{k}(i,0))\Big{)}.$		(11)

Simplification. Observe that the $z$ -update in (8b) is also challenging as the variables $z(0,i)$ and $z(i,0)$ are coupled by the TV norm penalty term. Next, we will simplify the three-variable updates in (11), (8b) and (8c) to eliminate the $z$ -update and obtain a more compact algorithm. Note that the decentralized deterministic ADMM can also be simplified to eliminate auxiliary variables [42]. However, we are considering the distributed stochastic ADMM, and the TV norm penalty term makes the simplification much more challenging.

Proposition 1 (Simplified stochastic ADMM).

Suppose $m=r$ . The updates (11), (8b) and (8c) can be simplified as

$\displaystyle x^{k+1}_{i}=$	$\displaystyle~{}x_{i}^{k}-\alpha_{i}^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+2\eta^{k}_{i}-\eta^{k-1}_{i}\right),$	(12)
$\displaystyle x^{k+1}_{0}=$	$\displaystyle~{}x_{0}^{k}-\alpha_{0}^{k}\bigg{(}f_{0}^{\prime}(x_{0}^{k})-\sum\limits_{i\in\mathcal{R}}(2\eta^{k}_{i}-\eta^{k-1}_{i})\bigg{)},$	(13)
$\displaystyle\eta^{k+1}_{i}:=$	$\displaystyle~{}\eta^{k+1}(i,0)=-\eta^{k+1}(0,i)=\mathrm{proj}_{\lambda}\left(\eta^{k}_{i}+\frac{\beta}{2}(x^{k+1}_{i}-x^{k+1}_{0})\right),$	(14)

where $\mathrm{proj}_{\lambda}(\cdot)$ is the projection operator that for each dimenison maps any point in $\mathbb{R}$ onto $[-\lambda,\lambda]$ .

Proof.

See A.

Presence of Byzantine workers. Now we start to consider how the stochastic ADMM updates (12), (13) and (14) are implemented when the Byzantine workers are present. At time $k$ , every regular worker $i\in\mathcal{R}$ updates $x_{i}^{k+1}$ with (12) and $\eta^{k+1}_{i}$ with (14), and then sends $\eta_{i}^{k+1}$ to the master. Meanwhile, every Byzantine worker $j\in\mathcal{B}$ can cheat the master by sending $\eta_{j}^{k+1}\mathbb{R}^{d}$ where the elements are arbitrary within $[-\lambda,\lambda]^{d}$ . Otherwise, the Byzantine worker $j$ can be directly detected and eliminated by the master. This amounts to that every Byzantine worker $j\in\mathcal{B}$ follows an update rule similar to (14), as

\displaystyle\eta^{k+1}_{j}=\mathrm{proj}_{\lambda}\left(\eta^{k}_{j}+\frac{\beta}{2}(u^{k+1}_{j}-x^{k+1}_{0})\right),

(15)

where $u^{k+1}_{j}\in\mathbb{R}^{d}$ is an arbitrary vector. After receiving all messages $\eta_{i}^{k+1}$ from the regular workers $i\in\mathcal{R}$ and $\eta_{j}^{k+1}$ from the Byzantine workers $j\in\mathcal{B}$ , the master updates $x_{0}^{k+1}$ via

\displaystyle x^{k+1}_{0}=

\displaystyle~{}x_{0}^{k}-\alpha_{0}^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})-\sum\limits_{i\in\mathcal{R}}(2\eta^{k}_{i}-\eta^{k-1}_{i})-\sum\limits_{j\in\mathcal{B}}(2\eta^{k}_{j}-\eta^{k-1}_{j})\Big{)}.

(16)

The Byzantine-robust stochastic ADMM for distributed learning is outlined in Algorithm 1 and illustrated in Figure 1. Observe that the communication and computation costs are the same as those in the stochastic subgradient method in [28]. The only extra cost is that every worker $i$ must store the dual variable $\eta^{k}_{i}$ .

Comparing the stochastic subgradient updates (5) with the stochastic ADMM updates (12), (13) and (14), we can observe a primal-dual connection. In the stochastic subgradient method, the workers upload primal variables $x_{i}^{k}$ , while in the stochastic ADMM, the workers upload dual variables $\eta_{i}^{k+1}$ . The stochastic subgradient method controls the influence of a malicious message by the sign function. No matter what the malicious message is, its modification on each dimension is among $-\lambda$ , $\lambda$ , and a value within $[-\lambda,\lambda]$ if the values of the malicious worker and the master are identical. The stochastic ADMM controls the influence of a malicious message by the projection function. The modification of the malicious message on each dimension is within $[-\lambda,\lambda]$ .

Algorithm 1 Byzantine-Robust Stochastic ADMM

Master

Initialize $x_{0}^{0}$ , $\eta^{-1}_{i}$ , and $\eta^{0}_{i}$ .

1:while not stopped do

2: Update

x_{0}^{k+1}

via (16);

3: Broadcast

x_{0}^{k+1}

to all the workers;

4: Receive

\eta_{i}^{k+1}

from the regular workers

i\in\mathcal{R}

and

\eta_{j}^{k+1}

from the Byzantine workers

j\in\mathcal{B}

Regular Worker $i$

Initialize $x^{0}_{i}$ , $\eta^{-1}_{i}$ , and $\eta^{0}_{i}$ .

1:while not stopped do

2: Update

x_{i}^{k+1}

via (12);

3: Update

\eta^{k+1}_{i}

via (14);

4: Send

\eta^{k+1}_{i}

to the master;

5: Receive

x_{0}^{k+1}

from the master.

Refer to caption — Figure 1: An illustration of the distributed stochastic ADMM for Byzantine-robust learning. In the master-worker architecture, there are $m$ workers in total, among which $r$ are regular workers and $q=m-r$ are Byzantine workers.

4 Convergence Analysis

In this section, we analyze the convergence of the proposed Byzantine-robust stochastic ADMM. We make the following assumptions, which are common in analyzing distributed stochastic optimization algorithms.

Assumption 1 (Strong convexity).

The local cost functions $\mathbb{E}[F(x_{i},\xi_{i})]$ and the regularization term $f_{0}(x_{0})$ are strongly convex with constants $\mu_{i}$ and $\mu_{0}$ , respectively.

Assumption 2 (Lipschitz continuous gradients).

The local cost functions $\mathbb{E}[F(x_{i},\xi_{i})]$ and the regularization term $f_{0}(x_{0})$ have Lipschitz continuous gradients with constants $L_{i}$ and $L_{0}$ , respectively.

Assumption 3 (Bounded variance).

Within every worker $i$ , the data sampling is i.i.d. with $\xi^{k}_{i}\sim\mathcal{D}_{i}$ . The variance of stochastic gradients $F^{\prime}({x},\xi_{i})$ is upper bounded by $\delta^{2}_{i}$ , as

\mathbb{E}\|F^{\prime}({x},\xi_{i})-\mathbb{E}[F^{\prime}({x},\xi_{i})]\|^{2}\leq\delta^{2}_{i}.

(17)

4.1 Main Results

First, we show the equivalence between (2) and (6). When the penalty parameter $\lambda$ is sufficiently, it has been shown in Theorem 1 of [28] that the optimal primal variables of (6) are consensual and identical to the minimizer of (2). We repeat this conclusion in the following lemma.

Lemma 1 (Consensus and equivalence).

Suppose Assumptions 1 and 2 hold. If $\lambda\geq\lambda_{0}:=\max_{i\in\mathcal{R}}\|\mathbb{E}[F^{\prime}({\tilde{x}^{*}},\xi_{i})]\|_{\infty}$ , then for all $i\in\mathcal{R}$ , we have

x_{i}^{*}=x_{0}^{*}=\tilde{x}^{*},

where $x_{i}^{*}$ and $x_{0}^{*}$ are the optimal primal variables of (6), and $\tilde{x}^{*}$ is the minimizer of (2).

Intuitively, setting a sufficiently large penalty parameter $\lambda$ ensures the variables $x_{i}$ and $x_{0}$ to be consensual, since a larger $\lambda$ gives more weight on the consensus constraints. When the training data at the workers are non-i.i.d., the local expected gradients $\mathbb{E}[F^{\prime}({\tilde{x}^{*}},\xi_{i})]$ deviate from 0, which leads to a large lower bound $\lambda_{0}$ to maintain consensus. Once the variables are consensual, (6) is equivalent to (2).

Now, we present the main theorem on the convergence of the proposed Byzantine-robust stochastic ADMM.

Theorem 1 ( $O(1/k)$ -convergence).

Suppose Assumptions 1, 2, and 3 hold. Let $\lambda\geq\lambda_{0}$ and the stepsizes be

\alpha_{0}^{k}=\min\left\{\frac{1}{ck+m\beta},\frac{1}{\mu_{0}+L_{0}},\frac{1}{\mu_{i}+L_{i}}\right\},\quad\alpha_{i}^{k}=\min\left\{\frac{1}{ck+\beta},\frac{1}{\mu_{0}+L_{0}},\frac{1}{\mu_{i}+L_{i}}\right\},\quad\forall i\in\mathcal{R},

for some positive constants $c<\min\left\{\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}},\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}:i\in\mathcal{R}\right\}$ and $\beta>0$ . Then we have

\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i\in\mathcal{R}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}=O(1/k)+O(\lambda^{2}q^{2}).

(18)

Proof.

See C.

Theorem 1 guarantees that if we choose the stepsizes for both the workers and the master in the order of $O(1/k)$ , then the Byzantine-robust stochastic ADMM asymptotically approaches to the $O(\lambda^{2}q^{2})$ neighborhood of the optimal solution $\tilde{x}^{*}$ of (2) (which equals $x_{0}^{*}$ and $x_{i}^{*}$ , according to Lemma 1) in an $O(1/k)$ rate. Note that the $O(1/k)$ stepsizes are sensitive to their initial values [28]. Therefore, we set the $O(1/\sqrt{k})$ stepsizes in the numerical experiments. We also provide in D an ergodic convergence rate of $O(\log k/\sqrt{k})$ with $O(1/\sqrt{k})$ stepsizes.

In (18), the asymptotic learning error is in the order of $O(\lambda^{2}q^{2})$ , which is the same as that of RSA [28]. When more Byzantine workers are present, $q$ is larger and the asymptotic learning error increases. Using a larger $\lambda$ helps consensus as indicated in Lemma 1, but incurs higher asymptotic learning error. In the numerical experiments, we will imperially demonstrate the influence of $q$ and $\lambda$ .

4.2 Comparison with RSA: Case Studies

The proposed Byzantine-robust stochastic ADMM and RSA [28] solve the same problem, while the former takes advantages of the separable problem structure. Below we briefly discuss the robustness of the two algorithms to different Byzantine attacks.

RSA is relatively sensitive to small perturbations. To perturb the update of $x_{0}^{k+1}$ in (5), Byzantine worker $j$ can generate malicious $u_{j}^{k}$ that is very close to $x_{0}^{k}$ , but its influence on each dimension is still $\lambda$ or $-\lambda$ . Potentially, this attack is able to lead the update to move toward a given wrong direction. In contrast, for the Byzantine-robust stochastic ADMM, small perturbations on $\eta_{j}^{k}$ change little on the update of $x_{0}^{k+1}$ in (16). To effectively attack the Byzantine-robust stochastic ADMM, Byzantine worker $j$ can set each element of $\eta_{j}^{k}$ to be $\lambda(-1)^{k}$ , then its influence on each dimension will oscillate between $3\lambda$ and $-3\lambda$ . In comparison, the influence of this attack for RSA is just $\lambda$ or $-\lambda$ on each dimension. However, these large oscillations are easy to distinguish by the master through screening the received messages. In addition, it is nontrivial for this attack to lead the update to move toward a given wrong direction.

Developing Byzantine attacks that are most harmful to the Byzantine-robust stochastic ADMM and RSA, respectively, is beyond the scope of this paper. Instead, we give a toy example and develops two Byzantine attacks to justify the discussions above.

Example 1.

Consider a one-dimensional distributed machine learning task with $r=2$ regular workers (numbered by $1$ and $2$ ) and $q=1$ Byzantine worker (numbered by $3$ ). The functions are deterministic and quadratic, with $f_{0}(x_{0})=x_{0}^{2}/2$ , $F_{1}(x_{1},\xi_{1})=(x_{1}-1)^{2}/4$ , and $F_{2}(x_{2},\xi_{2})=(x_{2}-1)^{2}/4$ . Therefore, $\tilde{x}^{*}=1/2$ is the minimizer of (2) and $\lambda_{0}=1/4$ by Lemma 1. The local primal variables are initialized as their local optima, i.e., $x_{0}^{0}=0$ and $x_{1}^{0}=x_{2}^{0}=1$ for both algorithms. The local dual variables of the Byzantine-robust stochastic ADMM are initialized as $\eta^{-1}_{i}=\eta_{i}^{0}=0$ for $i\in\{1,2,3\}$ . We construct two simple attacks.

Small value attack. Byzantine worker $3$ generates $u_{3}^{k}=x_{0}^{k}-\frac{\epsilon}{\max\{k(k+1),1\}}$ , where $\epsilon>0$ is a perturbation parameter.

Large value attack. Byzantine worker $3$ generates $u_{3}^{k}=x_{0}^{k}-\frac{4\lambda}{\beta}(-1)^{k}$ .

We choose the parameters $\lambda=\lambda_{0}=1/2$ and $\beta=1$ , with stepsizes $\alpha_{0}^{k}=\frac{1}{k/8+3}$ and $\alpha_{1}^{k}=\alpha_{2}^{k}=\frac{1}{k/8+1}$ . The perturbation parameter is set as $\epsilon=1/2$ for the small value attack. Figure 2 shows the values of the local primal variables on the master and the regular workers. For both algorithms and both attacks, the master and the regular workers are able to asymptotically reach consensus as asserted by Lemma 1. Under the small value attack, RSA has larger asymptotic learning error than the Byzantine-robust stochastic ADMM as we have discussed, while under the large value attack, both algorithms coincidentally have zero asymptotic learning errors. In addition, we can observe that the Byzantine-robust stochastic ADMM is more stable than RSA under both attacks.

5 Numerical Experiments

In this section, we evaluate the robustness of the proposed algorithm to various Byzantine attacks. We compare the proposed Byzantine-robust Stochastic ADMM with the following benchmark algorithms: (i) Ideal SGD without Byzantine attacks; (ii) SGD subject to Byzantine attacks; (iii) Geometric median stochastic gradient aggregation [18, 19]; (iv) Median stochastic gradient aggregation [18, 19]; (v) RSA [28]. All the parameters of the benchmark algorithms are hand-tuned to the best. Although the stochastic ADMM and RSA are rooted in the same problem formulation (4), they perform differently for the same value of $\lambda$ due to Byzantine attacks, as we have observed in Example 1. Therefore, we hand-tune the best $\lambda$ for the stochastic ADMM and RSA, respectively. In the numerical experiments, we use two datasets, MNIST¹¹1http://yann.lecun.com/exdb/mnist and COVERTYPE²²2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets. The statistics of these datasets are shown in Table 1. We launch one master and 20 workers. In the i.i.d case, we conduct experiments on both datasets by randomly and evenly splitting the data samples to the workers, while in the non-i.i.d. case we only use the MNIST dataset. Each regular worker uses a mini-batch of 32 samples to estimate the local gradient at each iteration. The loss functions $f_{i}(\tilde{x})$ of workers are softmax regressions, and the regularization term is given by $f_{0}(\tilde{x})=\frac{0.01}{2}\|\tilde{x}\|^{2}$ . Performance is evaluated by the top-1 accuracy.

Name	Training Samples	Testing Samples	Attributes
COVERTYPE	465264	115748	54
MNIST	60000	10000	784

Table 1: Specifications of the datasets.

Gaussian attack. Under Gaussian attack, at every iteration every Byzantine worker sends to the master a random vector, whose elements follow the Gaussian distribution with standard deviation $100$ . Here we set the number of Byzantine workers $q=8$ . For Stochastic ADMM on the MNIST dataset, we set $\lambda$ = 0.5, $\beta=0.5$ , $\alpha_{0}^{k}=\frac{1}{10+10\sqrt{k}}$ , and $\alpha_{i}^{k}=\frac{1}{0.5+10\sqrt{k}}$ . As shown in Figure 3(a), SGD fails, Stochastic ADMM, RSA and Geometric median perform very similarly and are close to Ideal SGD, while Median is a little worse than the others. On the COVERTYPE dataset, we set $\lambda$ = 0.5, $\beta=0.1$ , $\alpha_{0}^{k}=\frac{1}{2+50\sqrt{k}}$ , and $\alpha_{i}^{k}=\frac{1}{0.1+50\sqrt{k}}$ for Stochastic ADMM. As shown in Figure 3(b), SGD performs the worst, while Stochastic ADMM, Geometric median, and median are close to Ideal SGD. Among all the Byzantine-robust algorithms, Stochastic ADMM has the fastest converge speed.

Sign-flipping attack. Under sign-flipping attack, at every iteration every Byzantine worker calculates its local variable, but flips the sign by multiplying with a constant $\varepsilon<0$ , and sends to the master. Here we set $\varepsilon=-3$ and the number of Byzantine workers $q=8$ . On the MNIST dataset, we set the parameters of Stochastic ADMM as $\lambda=0.05$ , $\beta=0.1$ , $\alpha_{0}^{k}=\frac{1}{2+5\sqrt{k}}$ , and $\alpha_{i}^{k}=\frac{1}{0.1+5\sqrt{k}}$ . Figure 4(a) shows that SGD also fails in this situation. Stochastic ADMM, RSA, and Geometric median are close to Ideal SGD, and achieve better accuracy than median. In Figure 4(b), shows the performance on the COVERTYPE dataset. The parameters of Stochastic ADMM are $\lambda=0.5$ , $\beta=0.3$ , $\alpha_{0}^{k}=\frac{1}{6+100\sqrt{k}}$ ,and $\alpha_{i}^{k}=\frac{1}{0.3+100\sqrt{k}}$ . Stochastic ADMM and RSA are close to Ideal SGD, while outperform Geometric median and Median.

Without Byzantine attack. We also investigate the case without Byzantine attack in both MNIST and COVERTYPE datasets, as shown in Figure 5. In Figure 5(a), Stochastic ADMM on MNIST dataset chooses the parameters $\lambda=0.5$ , $\beta=0.5$ , $\alpha_{0}=\frac{1}{10+10\sqrt{k}}$ , and $\alpha_{i}=\frac{1}{0.5+10\sqrt{k}}$ . Without Byzantine attack, performance of Stochastic ADMM, RSA, and Geometric median is very similar to Ideal SGD, while Median is worse than the other Byzantine-robust algorithms. On the COVERTYPE dataset, we set the parameters of Stochastic ADMM as $\lambda=0.5$ , $\beta=0.3$ , $\alpha_{0}^{k}=\frac{1}{6+10\sqrt{k}}$ , and $\alpha_{i}^{k}=\frac{1}{0.3+10\sqrt{k}}$ . As shown in Figure 5(b), Stochastic ADMM is the best among all the algorithms, and RSA outperforms Geometric median and Median. We conclude that although Stochastic ADMM introduces bias to the updates, it still works well in the attack-free case.

Impact of $\lambda$ . Here we show how the performance of the proposed algorithm on the two datasets are affected by the choice of the penalty parameter $\lambda$ . We use sign-flipping attack with $\varepsilon=-3$ in the numerical experiments, and the number of Byzantine workers is $q=4$ . The parameters $\beta$ , $\alpha_{0}^{k}$ and $\alpha_{i}^{k}$ are hand-tuned to the best. As depicted in Figure 6, on both datasets, the performance of Stochastic ADMM degrades when $\lambda$ is too small. The reason is that, when $\lambda$ is too small the regular workers are rely more on their local data, leading to deficiency of the distributed learning system [28]. Meanwhile, $\lambda$ being too large also leads to worse performance.

Non-i.i.d. data. To demonstrate the robustness of the proposed algorithm against Byzantine attacks on non-i.i.d. data, we redistribute the MNIST dataset by letting every two workers share one digit. All Byzantine workers $j$ choose one regular worker indexed by $p$ , and send $u_{j}^{k+1}=x_{p}^{k+1}$ to the master at every iteration $k$ . When the number of Byzantine workers is $q=4$ , the best reachable accuracy is around 0.8, because of the absence of two handwriting digits’ data. Similarly, when the number of Byzantine workers is $q=8$ , the best reachable accuracy is around 0.6. Here, we set the parameters of Stochastic ADMM as $\lambda=0.8$ , $\beta=0.2$ , $\alpha_{0}^{k}=\frac{1}{4+10\sqrt{k}}$ , and $\alpha_{i}^{k}=\frac{1}{0.2+10\sqrt{k}}.$ when $q=4$ . As shown in Figure 7(a), Median fails, Stochastic ADMM are close to Ideal SGD and outperforms all the other Byzantine-robust algorithms. When the number of Byzantine worker increases to $q=8$ , in Stochastic ADMM, we set $\lambda=0.5$ , $\beta=3$ , $\alpha_{0}^{k}=\frac{1}{60+500\sqrt{k}}$ , and $\alpha_{i}^{k}=\frac{1}{3+500\sqrt{k}}$ . As depicted in Figure 7(b), Geometric median and Median fail because the stochastic gradients of regular worker $p$ dominate, such that only one digit can be recognized. Stochastic ADMM and RSA both work well, but Stochastic ADMM converges faster than RSA.

6 Conclusions

We proposed a stochastic ADMM to deal with the distributed learning problem under Byzantine attacks. We considered a TV norm-penalized approximation formulation to handle Byzantine attacks. Theoretically, we proved that the stochastic ADMM converges in expectation to a bounded neighborhood of the optimum at an $O(1/k)$ -rate under mild assumptions. Numerically, we compared the proposed algorithm with other Byzantine-robust algorithms on two real datasets, showing the competitive performance of the Byzantine-robust stochastic ADMM.

Acknowledgement. Qing Ling is supported in part by NSF China Grant 61973324, Fundamental Research Funds for the Central Universities, and Guangdong Province Key Laboratory of Computational Science Grant 2020B1212060032. A preliminary version of this paper has appeared in IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 4–8, 2020.

References

[1] R. Agrawal and R. Srikant. “Privacy-preserving Data Mining,” Proceedings of ACM SIGMOD, 2000.
[2] S. Sicari, A. Rizzardi, L. Grieco, and A. Coen-Porisini, “Security, Privacy and Trust in Internet of Things: The Road Ahead,” Computer Networks, vol. 76, pp. 146–164, 2015.
[3] L. Zhou, K. Yeh, G. Hancke, Z. Liu, and C. Su, “Security and Privacy for the Industrial Internet of Things: An Overview of Approaches to Safeguarding Endpoints,” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 76–87, 2018.
[4] J. Konecny, H. McMahan, and D. Ramage, “Federated Optimization: Distributed Optimization Beyond the Datacenter,” arXiv: 1511.03575, 2015.
[5] J. Konecny, H. McMahan, F. Yu, P. Richtarik, A. Suresh, and D. Bacon, “Federated Learning: Strategies for Improving Communication Efficiency,” arXiv: 1610.05492, 2016.
[6] P. Kairouz and H. McMahan. “Advances and Open Problems in Federated Learning,” Foundations and Trends in Machine Learning, vol. 14, no. 1, 2021.
[7] L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals Problem,” ACM Transactions on Programming Languages and Systems, vol. 4, no. 3, pp. 382–401, 1982.
[8] N. Lynch, Distributed Algorithms, Morgan Kaufmann Publishers, San Francisco, USA, 1996.
[9] A. Vempaty, L. Tong, and P. K. Varshney, “Distributed Inference with Byzantine Data: State-of-the-Art Review on Data Falsification Attacks.” IEEE Signal Processing Magazine, vol. 30, no. 5, pp. 65–75, 2013.
[10] Y. Chen, S. Kar, and J. M. F. Moura, “The Internet of Things: Secure Distributed Inference.” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 64–75, 2018.
[11] P. Blanchard, E. M. E. Mhamdi, R. Guerraoui, and J. Stainer, “Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent,” Proceedings of NeurIPS, 2017.
[12] S. Li, Y. Cheng, W. Wang, Y. Liu, and T. Chen, “Abnormal Client Behavior Detection in Federated Learning,” arXiv: 1910.09933, 2019.
[13] S. Li, Y. Cheng, W. Wang, Y. Liu, and T. Chen, “Learning to Detect Malicious Clients for Robust Federated Learning,” arXiv: 2002.00211, 2020.
[14] N. Ravi and A. Scaglione, “Detection and Isolation of Adversaries in Decentralized Optimization for Non-Strongly Convex Objectives,” Proceedings of IFAC Workshop on Distributed Estimation and Control in Networked Systems, 2019.
[15] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine Stochastic Gradient Descent,” Proceedings of NeuIPS, 2018.
[16] Z. Allen-Zhu, F. Ebrahimianghazani, J. Li, and D. Alistarh, “Byzantine-Resilient Non-Convex Stochastic Gradient Descent,” Proceedings of ICLR, 2021.
[17] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient Distributed Training via Redundant Gradients,” Proceedings of ICML, 2018.
[18] Y. Chen, L. Su, and J. Xu, “Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2018.
[19] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine Tolerant SGD,” arXiv: 1802.10116, 2018.
[20] X. Cao and L. Lai, “Distributed Approximate Newton’s Method Robust to Byzantine Attackers,” IEEE Transactions on Signal Processing, vol. 68, pp. 6011–6025, 2020.
[21] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust Distributed Learning: Towards Optimal Statistical Rates,” Proceedings of ICML, 2018.
[22] C. Xie, S. Koyejo, and I. Gupta, “Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance,” Proceedings of ICML, 2019.
[23] C. Xie, O. Koyejo, and I. Gupta, “Phocas: Dimensional Byzantine-resilient Stochastic Gradient Descent,” arXiv: 1805.09682, 2018.
[24] E. M. E. Mhamdi, R. Guerraoui, and S. Rouault, “The Hidden Vulnerability of Distributed Learning in Byzantium,” Proceedings of ICML, 2018.
[25] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons, “The Non-IID Data Quagmire of Decentralized Machine Learning,” Proceedings of ICML, 2020.
[26] L. He, S. P. Karimireddy, and M. Jaggi, “Byzantine-robust Learning on Heterogeneous Datasets via Resampling,” arXiv: 2006.09365, 2020.
[27] J. Peng, Z. Wu, Q. Ling, and T. Chen, “Byzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. Data,” arXiv: 2009.08161, 2020.
[28] L. Li, W. Xu, T. Chen, G. Giannakis, and Q. Ling, “RSA: Byzantine-robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets,” Proceedings of AAAI, 2019.
[29] Z. Wu, Q. Ling, T. Chen, and G. Giannakis, “Federated Variance-Reduced Stochastic Gradient Descent with Robustness to Byzantine Attacks,” IEEE Transactions on Signal Processing, vol. 68, pp. 4583–4596, 2020.
[30] E. M. E. Mhamdi, R. Guerraoui, and S. Rouault, “Distributed Momentum for Byzantine-resilient Learning,” Proceedings of ICLR, 2021.
[31] P. Khanduri, S. Bulusu, P. Sharma, and P. Varshney, “Byzantine Resilient Non-Convex SVRG with Distributed Batch Gradient Computations,” arXiv: 1912.04531, 2019.
[32] S. P. Karimireddy, L. He, and M. Jaggi, “Learning from History for Byzantine Robust Optimization,” Proceedings of ICML, 2021.
[33] G. Damaskinos, E. M. E. Mhamdi, R. Guerraoui, R. Patra, and M. Taziki, “Asynchronous Byzantine Machine Learning (the Case of SGD),” Proceedings of ICML, 2018.
[34] Y. Yang and W. Li, “BASGD: Buffered Asynchronous SGD for Byzantine Learning,” arXiv: 2003.00937, 2020.
[35] C. Xie, S. Koyejo, and I. Gupta, “Zeno++: Robust Fully Asynchronous SGD,” Proceedings of ICML, 2020.
[36] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning,” Proceedings of ICML, 2019.
[37] Z. Yang and W. U. Bajwa, “ByRDiE: Byzantine-Resilient Distributed Coordinate Descent for Decentralized Learning,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 4, pp. 611–627, 2019.
[38] Z. Yang and W. U. Bajwa, “BRIDGE: Byzantine-resilient Decentralized Gradient Descent,” arXiv: 1908.08098, 2019.
[39] S. Guo, T. Zhang, X. Xie, L. Ma, T. Xiang, and Y. Liu, “Towards Byzantine-resilient Learning in Decentralized Systems,” arXiv: 2002.08569, 2020
[40] J. Peng, W. Li, and Q. Ling, “Byzantine-robust Decentralized Stochastic Optimization over Static and Time-varying Networks,” Signal Processing, vol. 183, no. 108020, 2021.
[41] H. Ouyang, N. He, and A. Gray, “Stochastic ADMM for Nonsmooth Optimization,” arXiv: 1211.0632, 2012.
[42] W. Ben-Ameur, P. Bianchi, and J. Jakubowicz, “Robust Distributed Consensus Using Total Variation,” IEEE Transactions on Automatic Control, vol. 61, no. 6, pp. 1550–1564, 2016.
[43] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, Boston, USA, 2004.

Appendix A Proof of Proposition 1

The proof of Proposition 1 relies on the following Lemma.

Lemma 2.

Let $\{z_{1}^{*},z_{2}^{*}\}={\arg\min}_{z_{1},z_{2}\in\mathbb{R}}~{}\lambda|z_{1}-z_{2}|+\frac{1}{2}(z_{1}-a_{1})^{2}+\frac{1}{2}(z_{2}-a_{2})^{2}$ , where $\lambda,a_{1},a_{2}\in\mathbb{R}$ . Then

z_{1}^{*}+z_{2}^{*}=a_{1}+a_{2}\quad\text{and}\quad z_{1}^{*}-a_{1}=\mathrm{proj}_{\lambda}\left(\frac{a_{2}-a_{1}}{2}\right).

(19)

Proof.

Note that $\{z_{1}^{*},z_{2}^{*}\}$ together with their difference $\Delta^{*}=z_{1}^{*}-z_{2}^{*}$ is also optimal to the bi-level minimization problem

	$\displaystyle\min_{\Delta\in\mathbb{R}}$	$\displaystyle\min_{z_{1},z_{2}\in\mathbb{R}}~{}\lambda\|z_{1}-z_{2}\|+\frac{1}{2}(z_{1}-a_{1})^{2}+\frac{1}{2}(z_{2}-a_{2})^{2},$
		$\displaystyle\quad s.t.\quad z_{1}-z_{2}=\Delta.$

That is, we first optimize the constrained minimization problem with an artificially imposed constraint $z_{1}-z_{2}=\Delta$ over $\{z_{1},z_{2}\}$ , and then optimize over $\Delta$ .

For the inner-level constrained minimization problem, from its KKT (Karush-Kuhn-Tucker) conditions we know the minimizer is $\{z_{1}^{*}=\frac{a_{1}+a_{2}+\Delta}{2},z_{2}^{*}=\frac{a_{1}+a_{2}-\Delta}{2}\}$ , and accordingly, the optimal value is $\lambda|\Delta|+\frac{(\Delta-(a_{1}-a_{2}))^{2}}{4}$ . Therefore, for the outer-level unconstrained minimization problem, from its KKT conditions we know the minimizer can be written as $\Delta^{*}=(a_{1}-a_{2})-\mathrm{proj}_{2\lambda}(a_{1}-a_{2})$ . Substituting this result to $\{z_{1}^{*}=\frac{a_{1}+a_{2}+\Delta^{*}}{2},z_{2}^{*}=\frac{a_{1}+a_{2}-\Delta}{2}^{*}\}$ yields $z_{1}^{*}=a_{1}+\mathrm{proj}_{\lambda}\left(\frac{a_{2}-a_{1}}{2}\right)$ and $z_{2}^{*}=a_{2}+\mathrm{proj}_{\lambda}\left(\frac{a_{1}-a_{2}}{2}\right)$ . From them we obtain (19) and complete the proof.

Now we begin to prove Proposition 1. Since (8b) is separable with respect to $i$ , it is equivalent to

		$\displaystyle~{}\left\{z^{k+1}(i,0),z^{k+1}(0,i)\right\}$		(20)
	$\displaystyle=$	$\displaystyle~{}{\arg\min}_{z(i,0),z(0,i)}\lambda\\|z(0,i)-z(i,0)\\|_{1}+\frac{\beta}{2}\big{\\|}z(i,0)-\big{(}x_{0}^{k+1}-\frac{1}{\beta}\eta^{k}(i,0)\big{)}\big{\\|}^{2}+\frac{\beta}{2}\big{\\|}z(0,i)-\big{(}x_{i}^{k+1}-\frac{1}{\beta}\eta^{k}(0,i)\big{)}\big{\\|}^{2},$

According to Lemma 2, (20) leads to


$\displaystyle{z^{k+1}(0,i)+z^{k+1}(i,0)}$	$\displaystyle={x_{i}^{k+1}+x_{0}^{k+1}}-\frac{\eta^{k}(0,i)+\eta^{k}(i,0)}{\beta},$	(21a)
$\displaystyle z^{k+1}(i,0)-\big{(}x_{0}^{k+1}-\frac{1}{\beta}\eta^{k}(i,0)\big{)}$	$\displaystyle=\mathrm{proj}_{\lambda/\beta}\left(\frac{(\eta^{k}(i,0)-\eta^{k}(0,i))/\beta+x_{i}^{k+1}-x_{0}^{k+1}}{2}\right).$	(21b)

From (8c) and (21a) we have

\eta^{k+1}(0,i)+\eta^{k+1}(i,0)=0,

(22)

which is regardless of the initialization of $\eta$ . If we further initialize $\eta^{0}(i,0)=-\eta^{0}(0,i)$ , for simplicity we can define for any $k\geq 0$ that

\eta^{k}_{i}:=\eta^{k}(i,0)=-\eta^{k}(0,i).

With this notation, we rewrite (8c) as

\displaystyle\eta^{k+1}_{i}=\eta^{k}_{i}+\beta(z^{k+1}(i,0)-x_{0}^{k+1})=\mathrm{proj}_{\lambda}\left(\eta^{k}_{i}+\frac{\beta}{2}(x^{k+1}_{i}-x^{k+1}_{0})\right),

(23)

where the last equality is from (21b).

Next we simplify the updates of $x_{i}^{k+1}$ and $x_{0}^{k+1}$ in (11). From (8c), we have

\displaystyle\beta x_{i}^{k}-\beta z^{k}(0,i)-\eta^{k}(0,i)

\displaystyle=\eta^{k-1}(0,i)-\eta^{k}(0,i)-\eta^{k}(0,i)=2\eta^{k}_{i}-\eta^{k-1}_{i},

which simplifies (11) to

	$\displaystyle x^{k+1}_{i}$	$\displaystyle=x_{i}^{k}-\alpha_{i}^{k}\left(F^{\prime}(x_{i}^{k},\xi^{k}_{i})+2\eta^{k}_{i}-\eta^{k-1}_{i}\right),$		(24)
	$\displaystyle x^{k+1}_{0}$	$\displaystyle=x_{0}^{k}-\alpha_{0}^{k}\Big{(}f_{0}^{\prime}(x_{0}^{k})-\sum_{i\in\mathcal{R}}(2\eta^{k}_{i}-\eta^{k-1}_{i})\Big{)}.$		(25)

This completes the proof.

Appendix B Supporting Lemmas

Lemma 3 (Optimality conditions of (6)).

The sufficient and necessary optimality conditions of (6) are

\begin{cases}\mathbb{E}[F^{\prime}({x_{i}^{*}},\xi_{i})]=\eta^{*}(0,i),\quad f^{\prime}_{0}(x_{0}^{*})=\sum_{i\in\mathcal{R}}\eta^{*}(i,0),\\ \eta^{*}(0,i)=-\lambda g_{i}^{*},\quad\eta^{*}(i,0)=\lambda g_{i}^{*},\\ z^{*}(i,0)=x_{0}^{*},\quad z^{*}(0,i)=x_{i}^{*},\end{cases}

(26)

for all $i\in\mathcal{R}$ , where $g_{i}^{*}\in{sgn}(z^{*}(0,i)-z^{*}(i,0))$ . In particular, defining $\eta_{i}^{*}:=\eta^{*}(i,0)$ , we have for all $i\in\mathcal{R}$ that

\displaystyle f^{\prime}_{0}(x_{0}^{*})=\sum_{i\in\mathcal{R}}\eta_{i}^{*},\quad\mathbb{E}[F^{\prime}({x_{i}^{*}},\xi_{i})]=-\eta_{i}^{*},\quad\text{and}\quad\eta_{i}^{*}=\lambda g_{i}^{*}\in[-\lambda,\lambda]^{d}.

(27)

Proof.

The KKT conditions of (6) are given by

\displaystyle\begin{cases}\mathbb{E}[F^{\prime}({x_{i}^{*}},\xi_{i})]-\eta^{*}(0,i)=0,\quad f^{\prime}_{0}(x_{0}^{*})+\sum_{i}\big{(}-\eta^{*}(i,0)=0,\\ 0\in\lambda\partial_{z^{*}(0,i)}\|z^{*}(0,i)-z^{*}(i,0)\|_{1}+\eta^{*}(0,i),\quad 0\in\lambda\partial_{z^{*}(i,0)}\|z^{*}(0,i)-z^{*}(i,0)\|_{1}+\eta^{*}(i,0),\\ z^{*}(i,0)-x_{0}^{*}=0,\quad z^{*}(0,i)-x_{i}^{*}=0,\end{cases}

where $\partial_{z^{*}(0,i)}(\cdot)$ and $\partial_{z^{*}(i,0)}(\cdot)$ denote the sets of subgradients. Applying the definition of the element-wise sign function $sgn(\cdot)$ yields (26). Further noticing $\eta_{i}^{*}:=\eta^{*}(i,0)=-\eta^{*}(i,0)$ , we obtain (27).

Lemma 4 (Theorem 2.1.12, [43]).

With Assumptions 1 and 2, for any $x_{i}$ and $x_{0}$ , it holds

	$\displaystyle\langle F^{\prime}_{i}(x_{i})-F^{\prime}_{i}(x_{i}^{}),x_{i}-x_{i}^{}\rangle\geq\frac{1}{\mu_{i}+L_{i}}\\|F^{\prime}_{i}(x_{i})-F^{\prime}_{i}(x_{i}^{})\\|^{2}+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\\|x_{i}-x_{i}^{}\\|^{2},\quad$		(28)
	$\displaystyle\langle f^{\prime}_{0}(x_{0})-f^{\prime}_{0}(x_{0}^{}),x_{0}-x_{0}^{}\rangle\geq\frac{1}{\mu_{0}+L_{0}}\\|f^{\prime}_{0}(x_{0})-f^{\prime}_{0}(x_{0}^{})\\|^{2}+\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\\|x_{0}-x_{0}^{}\\|^{2}.$		(29)

Lemma 5.

With the simplified updates (12), (13), and (14), it holds for any regular worker $i\in\mathcal{R}$ that

	$\displaystyle\langle\eta_{i}^{k+1}-\eta_{i}^{*},x_{0}^{k+1}-x_{i}^{k+1}\rangle\leq$	$\displaystyle~{}-\frac{2}{\beta}\langle\eta_{i}^{k+1}-\eta_{i}^{},\eta_{i}^{k+1}-\eta_{i}^{k}\rangle=\frac{1}{\beta}\left(\\|\eta_{i}^{k}-\eta_{i}^{}\\|^{2}-\\|\eta_{i}^{k+1}-\eta_{i}^{*}\\|^{2}-\\|\eta_{i}^{k+1}-\eta_{i}^{k}\\|^{2}\right),$		(30)
	$\displaystyle\langle\eta_{i}^{k+1},x_{0}^{k+1}-x_{i}^{k+1}\rangle\leq$	$\displaystyle~{}-\frac{2}{\beta}\langle\eta_{i}^{k+1},\eta_{i}^{k+1}-\eta_{i}^{k}\rangle=\frac{1}{\beta}\left(\\|\eta_{i}^{k}\\|^{2}-\\|\eta_{i}^{k+1}\\|^{2}-\\|\eta_{i}^{k+1}-\eta_{i}^{k}\\|^{2}\right).$		(31)

Proof.

We first show the inequality in (30), and then modify it to prove (31). The relationships derived during the proof of Lemma 2 are useful here. Since $\eta^{k}_{i}:=\eta^{k}(i,0)=-\eta^{k}(0,i)$ , the right-hand side of (21b) is exactly $\frac{1}{\beta}\eta_{i}^{k+1}$ . Combining this fact with (21a) yields

z^{k+1}(i,0)=x_{0}^{k+1}+\frac{1}{\beta}\big{(}\eta_{i}^{k+1}-\eta_{i}^{k}\big{)},\quad z^{k+1}(0,i)=x_{i}^{k+1}-\frac{1}{\beta}\big{(}\eta_{i}^{k+1}-\eta_{i}^{k}\big{)}.

(32)

Recall that in (20), we minimize the function

\lambda\|z(0,i)-z(i,0)\|_{1}+\frac{\beta}{2}\big{\|}z(i,0)-\big{(}x_{0}^{k+1}-\frac{1}{\beta}\eta^{k}_{i}\big{)}\big{\|}^{2}+\frac{\beta}{2}\big{\|}z(0,i)-\big{(}x_{i}^{k+1}+\frac{1}{\beta}\eta^{k}_{i}\big{)}\big{\|}^{2},

with respect to $\big{[}z(i,0);z(0,i)\big{]}\in\mathbb{R}^{2d}$ . From the first order optimality condition, there exists a subgradient of $\lambda\|z(0,i)-z(i,0)\|_{1}$ at $\big{[}z^{k+1}(0,i);z^{k+1}(i,0)\big{]}$ , denoted as $[\lambda g_{i}^{k+1};-\lambda g_{i}^{k+1}]\in[-\lambda,\lambda]^{2d}$ , such that

0=\lambda g_{i}^{k+1}+\beta\big{(}z^{k+1}(0,i)-(x_{i}^{k+1}+\frac{1}{\beta}\eta^{k}_{i})\big{)}=\lambda g_{i}^{k+1}-\eta^{k+1}.

(33)

Applying the definition of subgradient of $\lambda\|z(0,i)-z(i,0)\|_{1}$ at points $\big{[}z^{k+1}(0,i);z^{k+1}(i,0)\big{]}$ and $\big{[}z^{*}(0,i);z^{*}(i,0)\big{]}$ gives

$\displaystyle\lambda\\|z^{}(0,i)-z^{}(i,0)\\|_{1}\geq$	$\displaystyle~{}\lambda\\|z^{k+1}(0,i)-z^{k+1}(i,0)\\|_{1}$
	$\displaystyle~{}+\big{\langle}[(\lambda g_{i}^{k+1};-\lambda g_{i}^{k+1}]),\big{[}z^{}(0,i)-z^{k+1}(0,i);z^{}(i,0)-z^{k+1}(i,0)\big{]}\big{\rangle},$	(34)
$\displaystyle\lambda\\|z^{k+1}(0,i)-z^{k+1}(i,0)\\|_{1}\geq$	$\displaystyle~{}\lambda\\|z^{}(0,i)-z^{}(i,0)\\|_{1}$
	$\displaystyle~{}+\big{\langle}[\lambda g_{i}^{};-\lambda g_{i}^{}],\big{[}z^{k+1}(0,i)-z^{}(0,i);z^{k+1}(i,0)-z^{}(i,0)\big{]}\big{\rangle},$	(35)

where $g_{i}^{*}=\frac{\eta_{i}^{*}}{\lambda}$ is defined in Lemma 3. Summing up (34) and (35), we have

0\geq\big{\langle}\lambda g_{i}^{k+1}-\lambda g_{i}^{*},z^{k+1}(i,0)-z^{k+1}(0,i)\big{\rangle}=\big{\langle}\eta^{k+1}-\eta^{*},x_{0}^{k+1}-x_{i}^{k+1}+\frac{2}{\beta}\big{(}\eta_{i}^{k+1}-\eta_{i}^{k}\big{)}\big{\rangle},

where the last equality comes from (32) and (33). Rearranging the terms gives (30).

Further note that $[\lambda g_{i}^{*};-\lambda g_{i}^{*}]$ in (35) can be replaced by any subgradient of $\lambda\|z(0,i)-z(i,0)\|_{1}$ at $\big{[}z^{*}(0,i);z^{*}(i,0)\big{]}$ . We hence replace it by $[0;0]$ , and then obtain (31).

Appendix C Proof of Theorem 1

Restatement of Theorem 1. Suppose Assumptions 1, 2, and 3 hold. let $\lambda\geq\lambda_{0}$ and the stepsizes be

\alpha_{0}^{k}=\min\left\{\frac{1}{ck+m\beta},A\right\},\quad\alpha_{i}^{k}=\min\left\{\frac{1}{ck+\beta},A\right\},\ \forall i\in\mathcal{R},

for some positive constants $c<\min\left\{\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}},\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}:i\in\mathcal{R}\right\}$ , $\beta$ , and ${A}\leq\min\left\{\frac{1}{\mu_{0}+L_{0}},\frac{1}{\mu_{i}+L_{i}}:i\in\mathcal{R}\right\}$ . Define constants

	$\displaystyle c_{0}$	$\displaystyle=\min\bigg{\{}c+(m-1)\beta,\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}},\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}:i\in\mathcal{R}\bigg{\}},$
	$\displaystyle c_{1}$	$\displaystyle=\frac{9d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}\lambda^{2}q^{2},$
	$\displaystyle c_{2}$	$\displaystyle=\left[2(4r+3q)^{2}+64r+16(m-1)\beta\left(\frac{r}{2\beta}+\sum_{i}\bigg{(}\frac{1}{\mu_{i}}+\frac{1}{L_{i}}\bigg{)}\right)\right]\lambda^{2}d+4\sum_{i}\delta_{i}^{2}.$

Also denote $V^{k}=\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i\in\mathcal{R}}\big{(}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+\frac{2\alpha_{i}^{k-1}}{\beta}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}\big{)}$ . Then we have

\displaystyle V^{k+1}\leq\begin{cases}V^{k}(1-c_{0}\alpha^{k}_{0})+c_{1}\alpha^{k}_{0}+c_{2}(\alpha^{k}_{i})^{2},\quad&k>k_{0},\\ V^{k}(1-c_{0}A)+{A}c_{1}+{A}^{2}c_{2},\quad&k\leq k_{0},\end{cases}

(36)

where $k_{0}=\min\{k:\frac{1}{ck+\beta}<A\}$ . Consequently, it holds

V^{k}\leq\begin{cases}V^{0}(1-cA)^{k+1}+\frac{c_{1}+{A}c_{2}}{c_{0}},\quad&k\leq k_{0},\\ \frac{C}{c(k-1)+m\beta}+\frac{c_{1}}{c_{0}},\quad&k>k_{0},\end{cases}

(37)

with

C=\max\left\{\left(\frac{ck_{0}+m\beta}{ck_{0}+\beta}\right)^{2}\frac{c_{2}}{c_{0}-c},(ck_{0}+m\beta)\left(V^{0}(1-c_{0}A)^{k_{0}+1}+\frac{Ac_{2}}{c_{0}}\right)\right\}.

Proof.

Recall that the updates satisfy

$\displaystyle x_{0}^{k+1}=$	$\displaystyle~{}x_{0}^{k}-\alpha_{0}^{k}\bigg{(}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{)},$	(38)
$\displaystyle x_{i}^{k+1}=$	$\displaystyle~{}x_{i}^{k}-\alpha_{i}^{k}\bigg{(}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{)},\forall i\in\mathcal{R},$	(39)
$\displaystyle\eta_{i}^{k+1}=$	$\displaystyle~{}\mathrm{proj}_{\lambda}\left(\eta^{k}_{i}+\frac{\beta}{2}(x^{k+1}_{i}-x^{k+1}_{0})\right)\in[-\lambda,\lambda]^{d},\forall i\in\mathcal{R}\quad\text{and}\quad\eta_{j}^{k+1}\in[-\lambda,\lambda]^{d},\forall j\in\mathcal{B}.$	(40)

Step 1. At the master side, we have

	$\displaystyle\mathbb{E}\\|x_{0}^{k+1}-x_{0}^{*}\\|^{2}\overset{\eqref{x0-adm-sim}}{=}$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{*}\\|^{2}+(\alpha_{0}^{k})^{2}\mathbb{E}\bigg{\\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}$
		$\displaystyle~{}-2\alpha_{0}^{k}\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}.$		(41)

For the second term in (41), the inequality $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ gives

	$\displaystyle~{}\mathbb{E}\bigg{\\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}$
$\displaystyle\overset{\eqref{eq:opt}}{=}$	$\displaystyle~{}\mathbb{E}\bigg{\\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}$
$\displaystyle\leq$	$\displaystyle~{}2\mathbb{E}\big{\\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{})\big{\\|}^{2}+2\mathbb{E}\bigg{\\|}\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{}_{i}\big{)}+\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}$
$\displaystyle\overset{\eqref{eta-range}}{\leq}$	$\displaystyle~{}2\mathbb{E}\big{\\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{*})\big{\\|}^{2}+2(4r+3q)^{2}\lambda^{2}d.$	(42)

Applying the inequality $2\langle a,b\rangle\leq\epsilon\|a\|^{2}+\frac{1}{\epsilon}\|b\|^{2}$ to the third term in (41) with $\epsilon=\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}$ yields

	$\displaystyle-2\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}$
$\displaystyle\overset{\eqref{eq:opt}}{=}$	$\displaystyle-2\mathbb{E}\big{\langle}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{}),x_{0}^{k}-x_{0}^{}\big{\rangle}+2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle}+2\mathbb{E}\bigg{\langle}\sum_{j\in\mathcal{B}}2\eta^{k}_{j}-\eta^{k-1}_{j},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}$
$\displaystyle\overset{\eqref{eq:nes0}}{\leq}$	$\displaystyle-\frac{2}{\mu_{0}+L_{0}}\mathbb{E}\\|f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{})\\|^{2}-\frac{2\mu_{0}L_{0}}{\mu_{0}+L_{0}}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}+2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle}$
	$\displaystyle+\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\mathbb{E}\\|x_{0}^{k}-x_{0}^{*}\\|^{2}+\frac{\mu_{0}+L_{0}}{\mu_{0}L_{0}}\mathbb{E}\bigg{\\|}\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}$
$\displaystyle\leq$	$\displaystyle-\frac{2}{\mu_{0}+L_{0}}\mathbb{E}\\|f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{})\\|^{2}-\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}+\frac{\mu_{0}+L_{0}}{\mu_{0}L_{0}}(3\lambda q)^{2}d$
	$\displaystyle+2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle}.$	(43)

Substituting (42) and (43) into (41) gives

$\displaystyle\mathbb{E}\\|x_{0}^{k+1}-x_{0}^{*}\\|^{2}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)-\mathbb{E}\big{\\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{})\big{\\|}^{2}2\alpha_{0}^{k}\left(\frac{1}{\mu_{0}+L_{0}}-\alpha_{0}^{k}\right)$
	$\displaystyle~{}+2\lambda^{2}d(4r+3q)^{2}(\alpha_{0}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha_{0}^{k}+2\alpha_{0}^{k}\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle}$
$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{*}\\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+2\lambda^{2}d(4r+3q)^{2}(\alpha^{k}_{0})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}$
	$\displaystyle+2\alpha_{0}^{k}\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle},$	(44)

where the last inequality comes from that $\alpha_{0}^{k}\leq{A}\leq\frac{1}{\mu_{0}+L_{0}}$ .

Step 2. Accordingly, at the regular worker side, we have for any $i\in\mathcal{R}$ that

	$\displaystyle\mathbb{E}\\|x_{i}^{k+1}-x_{i}^{*}\\|^{2}\overset{\eqref{xi-adm-sim}}{=}$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{*}\\|^{2}+(\alpha_{i}^{k})^{2}\mathbb{E}\bigg{\\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{\\|}^{2}$		(45)
		$\displaystyle~{}-2\alpha_{i}^{k}\mathbb{E}\bigg{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\bigg{\rangle}.$

For the second term in (45), the inequality $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ gives that

$\displaystyle\mathbb{E}\bigg{\\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{\\|}^{2}\overset{\eqref{eq:opt}}{=}$	$\displaystyle~{}\mathbb{E}\bigg{\\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})-F^{\prime}_{i}(x_{i}^{k})+F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{}\big{)}\bigg{\\|}^{2}$
$\displaystyle\leq$	$\displaystyle~{}2\mathbb{E}\big{\\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})\big{\\|}^{2}+4\mathbb{E}\\|F^{\prime}(x_{i}^{k},\xi_{i}^{k})-F^{\prime}_{i}(x_{i}^{k})\\|^{2}+4\mathbb{E}\\|2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{}_{i}\\|^{2}$
$\displaystyle\leq$	$\displaystyle~{}2\mathbb{E}\big{\\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})\big{\\|}^{2}+4\delta_{i}^{2}+64\lambda^{2}d,$	(46)

where the last inequality comes from (17) and (40). Then the third term in (45) can be upper-bounded as

	$\displaystyle-2\mathbb{E}\bigg{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\bigg{\rangle}$
$\displaystyle=$	$\displaystyle-2\mathbb{E}\bigg{\langle}F^{\prime}_{i}(x_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\bigg{\rangle}$
$\displaystyle\overset{\eqref{eq:opt}}{=}$	$\displaystyle-2\mathbb{E}\big{\langle}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{}),x_{i}^{k}-x_{i}^{}\big{\rangle}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{}\rangle$
$\displaystyle\overset{\eqref{eq:nesi}}{\leq}$	$\displaystyle-\frac{2}{\mu_{i}+L_{i}}\mathbb{E}\\|F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})\\|^{2}-\frac{2\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{}\rangle,$	(47)

where the first equality comes from taking expectation of the conditional expectation; that is, $\mathbb{E}x=\mathbb{E}\big{[}\mathbb{E}[x|\mathcal{F}_{k-1}]\big{]}$ with $\mathcal{F}_{k-1}$ denoting the sigma-field generated by $\{\xi_{i}^{l-1},\eta_{j}^{l}:l\leq k,i\in\mathcal{R},j\in\mathcal{B}\}$ .

Substituting (46) and (47) into (45) gives

$\displaystyle\mathbb{E}\\|x_{i}^{k+1}-x_{i}^{*}\\|^{2}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)-\mathbb{E}\big{\\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})\big{\\|}^{2}2\alpha_{i}^{k}\left(\frac{1}{\mu_{i}+L_{i}}-\alpha_{i}^{k}\right)$
	$\displaystyle~{}+(4\delta_{i}^{2}+64\lambda^{2}d)(\alpha_{i}^{k})^{2}-2\alpha_{i}^{k}\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{}\rangle$
$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)+(4\delta_{i}^{2}+64\lambda^{2}d)(\alpha_{i}^{k})^{2}-2\alpha_{i}^{k}\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{*}\rangle,$	(48)

where the last inequality comes from that $\alpha_{i}^{k}\leq{A}\leq\frac{1}{\mu_{i}+L_{i}}$ .

Step 3. Now combine (44) with (48). Using the notation $V^{k}=\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i}\mathbb{E}\big{(}\|x_{i}^{k}-x_{i}^{*}\|^{2}+\frac{2\alpha_{i}^{k-1}}{\beta}\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}\big{)}$ , we have

$\displaystyle V^{k+1}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+\sum_{i\in\mathcal{R}}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)$	(49)
	$\displaystyle~{}+\lambda^{2}d\bigg{[}2(4r+3q)^{2}(\alpha_{0}^{k})^{2}+64\sum_{i\in\mathcal{R}}(\alpha_{i}^{k})^{2}\bigg{]}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}(\alpha_{i}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}$
	$\displaystyle~{}+\sum_{i\in\mathcal{R}}\frac{2\alpha_{i}^{k}}{\beta}\mathbb{E}\\|\eta_{i}^{k}-\eta_{i}^{}\\|^{2}-2\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},\alpha_{i}^{k}(x_{i}^{k}-x_{i}^{})-\alpha_{0}^{k}(x_{0}^{k}-x_{0}^{})\big{\rangle}.$

For the last term in (49), notice that

	$\displaystyle~{}-\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},\alpha_{i}^{k}(x_{i}^{k}-x_{i}^{})-\alpha_{0}^{k}(x_{0}^{k}-x_{0}^{*})\big{\rangle}$
$\displaystyle=$	$\displaystyle~{}-\alpha_{i}^{k}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{0}^{k}\big{\rangle}-(\alpha_{i}^{k}-\alpha_{0}^{k})\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{*}\big{\rangle}$
$\displaystyle=$	$\displaystyle~{}-\alpha_{0}^{k}\mathbb{E}\big{\langle}\eta^{k}_{i}-\eta_{i}^{},x_{i}^{k}-x_{0}^{k}\big{\rangle}-\alpha_{0}^{k}\mathbb{E}\big{\langle}\eta^{k}_{i}-\eta_{i}^{k-1},x_{i}^{k}-x_{0}^{k}\big{\rangle}-(\alpha_{i}^{k}-\alpha_{0}^{k})\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{*}\big{\rangle},$	(50)

where the first equality comes from Corollary 1 that $x_{i}^{*}=x_{0}^{*}$ . For the first term in (50), Lemma 5 suggests that

-\alpha_{0}^{k}\mathbb{E}\big{\langle}\eta^{k}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{0}^{k}\big{\rangle}\leq\frac{\alpha_{0}^{k}}{\beta}\left(\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k}-\eta_{i}^{k-1}\|^{2}\right)\leq\frac{\alpha_{0}^{k}}{\beta}\left(\|\eta_{i}^{k-1}-\eta_{i}^{*}\|^{2}-\|\eta_{i}^{k}-\eta_{i}^{*}\|^{2}\right).

(51)

For the second therm in (50), the projection operator in the $\eta_{i}$ -update gives that

\big{\langle}\eta^{k}_{i}-\eta_{i}^{k-1},x_{i}^{k}-x_{0}^{k}\big{\rangle}=\big{\langle}\mathrm{proj}_{\lambda}\big{(}\eta^{k-1}_{i}+\frac{\beta}{2}(x^{k}_{i}-x^{k}_{0})\big{)}-\eta_{i}^{k-1},x_{i}^{k}-x_{0}^{k}\big{\rangle}\geq 0,

(52)

provided $\eta_{i}^{k-1}\in[-\lambda,\lambda]^{d}$ . For the third term in (50), we apply the equality $2\langle a,b\rangle\leq\frac{1}{\epsilon_{i}}\|a\|^{2}+{\epsilon_{i}}\|b\|^{2}$ with ${\epsilon_{i}}=\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}$ to obtain

\displaystyle-(\alpha_{i}^{k}-\alpha_{0}^{k})\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{*},x_{i}^{k}-x_{i}^{*}\big{\rangle}\leq

\displaystyle~{}\frac{\alpha_{i}^{k}-\alpha_{0}^{k}}{2}\left(\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}}(4\lambda)^{2}d+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}\right).

(53)

Therefore, applying the bounds in (51), (52), and (53) to (50), we get

	$\displaystyle~{}-2\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},\alpha_{i}^{k}(x_{i}^{k}-x_{i}^{})-\alpha_{0}^{k}(x_{0}^{k}-x_{0}^{*})\big{\rangle}$
$\displaystyle\leq$	$\displaystyle~{}\frac{2\alpha_{0}^{k}}{\beta}\left(\mathbb{E}\\|\eta_{i}^{k-1}-\eta_{i}^{}\\|^{2}-\mathbb{E}\\|\eta_{i}^{k}-\eta_{i}^{}\\|^{2}\right)+(\alpha_{i}^{k}-\alpha_{0}^{k})\left(\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}}16\lambda^{2}d+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\\|x_{i}^{k}-x_{i}^{*}\\|^{2}\right)$
$\displaystyle\leq$	$\displaystyle~{}\frac{2\alpha_{0}^{k}}{\beta}\mathbb{E}\\|\eta_{i}^{k-1}-\eta_{i}^{}\\|^{2}-\frac{2\alpha_{i}^{k}}{\beta}\mathbb{E}\\|\eta_{i}^{k}-\eta_{i}^{}\\|^{2}+(\alpha_{i}^{k}-\alpha_{0}^{k})\left((\frac{1}{2\beta}+\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}})16\lambda^{2}d+\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\mathbb{E}\\|x_{i}^{k}-x_{i}^{*}\\|^{2}\right).$	(54)

Consequently (49) becomes

$\displaystyle V^{k+1}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}\left(1-\alpha^{k}_{0}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+\sum_{i\in\mathcal{R}}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\frac{\alpha_{i}^{k}+\alpha_{0}^{k}}{2}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)+\sum_{i\in\mathcal{R}}\frac{\alpha_{0}^{k}}{\alpha_{i}^{k-1}}\frac{2\alpha_{i}^{k-1}}{\beta}\mathbb{E}\\|\eta_{i}^{k-1}-\eta_{i}^{*}\\|^{2}$
	$\displaystyle~{}+\lambda^{2}d\bigg{[}2(4r+3q)^{2}(\alpha_{0}^{k})^{2}+64\sum_{i\in\mathcal{R}}(\alpha_{i}^{k})^{2}+16(\sum_{i\in\mathcal{R}}\frac{\mu_{i}+L_{i}}{\mu_{i}L_{i}}+\frac{r}{2\beta})(\alpha_{i}^{k}-\alpha_{0}^{k})\bigg{]}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}(\alpha_{i}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}$
$\displaystyle\leq$	$\displaystyle~{}V^{k}(1-c_{0}\alpha^{k}_{0})+c_{1}\alpha^{k}_{0}+c_{2}(\alpha^{k}_{i})^{2},$	(55)

where the last inequality comes from the upper-bound of $c_{0}$ . Plugging in the choices of stepsizes, we obtain (36).

Step 4. Now we iteratively uses (36) to derive the $O(1/k)$ -convergence of $V^{k}$ . First, when $k\leq k_{0}$ , it holds

	$\displaystyle V^{k+1}\leq$	$\displaystyle V^{0}(1-c_{0}A)^{k+1}+({A}c_{1}+{A}^{2}c_{2})\big{(}1+(1-c_{0}A)+\ldots+(1-c_{0}A)^{k}\big{)}$
	$\displaystyle\leq$	$\displaystyle V^{0}(1-c_{0}A)^{k+1}+\frac{c_{1}+{A}c_{2}}{c_{0}}.$

For any $k\geq k_{0}+1$ , initially it holds $V^{k_{0}+1}\leq\frac{C}{ck_{0}+m\beta}+\frac{c_{1}}{c_{0}}$ from the definition of $C$ . By deduction, if (37) holds for $k$ , then

	$\displaystyle V^{k+1}\overset{\eqref{eq:iter}}{\leq}$	$\displaystyle~{}\bigg{(}\frac{C}{c(k-1)+m\beta}+\frac{c_{1}}{c_{0}}\bigg{)}\bigg{(}1-\frac{c_{0}}{ck+m\beta}\bigg{)}+\frac{c_{1}}{ck+m\beta}+\frac{c_{2}}{(ck+\beta)^{2}}$
	$\displaystyle=$	$\displaystyle~{}\frac{C}{ck+m\beta}-\frac{c_{0}-c}{c(k-1)+m\beta}\frac{C}{ck+m\beta}+\frac{c_{2}}{(ck+\beta)^{2}}+\frac{c_{1}}{c_{0}}$
	$\displaystyle\leq$	$\displaystyle~{}\frac{C}{ck+m\beta}+\frac{c_{1}}{c_{0}}+\frac{1}{(ck+\beta)^{2}}\bigg{(}{c_{2}}-\frac{(ck+\beta)^{2}(c_{0}-c)}{(ck+m\beta)^{2}}C\bigg{)}$
	$\displaystyle\leq$	$\displaystyle~{}\frac{C}{ck+m\beta}+\frac{c_{1}}{c_{0}},$

where the last inequality is from $C\geq(\frac{ck_{0}+m\beta}{ck_{0}+\beta})^{2}\frac{c_{2}}{c_{0}-c}$ . This completes the proof.

Appendix D $O(1/\sqrt{k})$ -ergodic convergence

Theorem 2.

Suppose Assumptions 1, 2, and 3 hold. Let $\lambda\geq\lambda_{0}$ and the stepsizes be

\alpha_{0}^{k}=\min\left\{\frac{1}{\bar{c}\sqrt{k}+m\bar{\beta}},\bar{A}\right\},\quad\alpha_{i}^{k}=\min\left\{\frac{1}{\bar{c}\sqrt{k}+\bar{\beta}},\bar{A}\right\},\ \forall i\in\mathcal{R},

for some positive constants $\bar{c}$ , $\bar{\beta}$ , and $\bar{A}\leq\min\left\{\frac{\mu_{0}}{4L_{0}^{2}},\frac{\mu_{i}}{2L_{i}^{2}+(m-1)\beta c}:i\in\mathcal{R}\right\}$ . Then the proposed algorithm converges in the ergodic sense that

\sum_{i\in\mathcal{R}}\mathbb{E}[F(\bar{x}_{i}^{k},\xi_{i})]+f_{0}(\bar{x}_{0}^{k})-\underbrace{\min_{\tilde{x}}\left(\sum_{i\in\mathcal{R}}\mathbb{E}[F({\tilde{x}},\xi_{i})]+f_{0}({\tilde{x}})\right)}_{\text{Our goal in }\eqref{eq2}}\leq m\tilde{c}_{1}+{O}\left(\frac{\log k}{\sqrt{k}}\right),

(56)

where $\bar{x}_{i}^{k}=\sum_{l=1}^{k}\frac{\alpha^{l}x_{i}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha^{l^{\prime}}},\bar{x}_{0}^{k}=\sum_{l=1}^{k}\frac{\alpha^{l}x_{0}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha^{l^{\prime}}}$ are the weighted average variables, and the constant $\bar{c}_{1}=\frac{18d}{\mu_{0}}\lambda^{2}q^{2}$ .

Proof.

Step 1. At the master side, we still have (41), in the form of

	$\displaystyle\mathbb{E}\\|x_{0}^{k+1}-x_{0}^{*}\\|^{2}=$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{*}\\|^{2}+(\alpha_{0}^{k})^{2}\mathbb{E}\bigg{\\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}$
		$\displaystyle~{}-2\alpha_{0}^{k}\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}.$		(57)

For the second term in (57), we also have (42), which can be further bounded by

\displaystyle~{}\mathbb{E}\bigg{\|}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\|}^{2}\leq 2L_{0}^{2}\mathbb{E}\|x_{0}^{k}-x_{0}^{*}\|^{2}+2(4r+3q)^{2}\lambda^{2}d.

(58)

Here we use the fact that $f_{0}$ has Lipschitz continuous gradients in Assumption 2. In addition, using the fact that $f_{0}$ is strongly convex in Assumption 1 leads to

\langle f_{0}^{\prime}(x_{0}^{k}),x_{0}^{k}-x_{0}^{*}\rangle\geq f_{0}(x_{0}^{k})-f_{0}(x_{0}^{*})+\frac{\mu_{0}}{2}\|x_{0}^{k}-x_{0}^{*}\|^{2}.

(59)

Applying the inequality $2\langle a,b\rangle\leq\epsilon\|a\|^{2}+\frac{1}{\epsilon}\|b\|^{2}$ to the third term in (57) with $\epsilon=\frac{\mu_{0}}{2}$ yields

	$\displaystyle-2\mathbb{E}\bigg{\langle}f^{\prime}_{0}(x_{0}^{k})-\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}-\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{*}\bigg{\rangle}$
$\displaystyle=$	$\displaystyle-2\mathbb{E}\big{\langle}f^{\prime}_{0}(x_{0}^{k}),x_{0}^{k}-x_{0}^{}\big{\rangle}+2\mathbb{E}\bigg{\langle}\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}+\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)},x_{0}^{k}-x_{0}^{}\bigg{\rangle}$
$\displaystyle\overset{\eqref{sc}}{\leq}$	$\displaystyle-2\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{})\big{)}-\mu_{0}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}+\frac{\mu_{0}}{2}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}+\frac{2}{\mu_{0}}\mathbb{E}\bigg{\\|}\sum_{j\in\mathcal{B}}\big{(}2\eta^{k}_{j}-\eta^{k-1}_{j}\big{)}\bigg{\\|}^{2}+2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{}\rangle$
$\displaystyle\leq$	$\displaystyle-2\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{})\big{)}-\frac{\mu_{0}}{2}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}+\frac{2}{\mu_{0}}9q^{2}\lambda^{2}d+2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle.$	(60)

Substituting (58) and (60) into (57) gives

$\displaystyle\mathbb{E}\\|x_{0}^{k+1}-x_{0}^{*}\\|^{2}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}}{2}+(\alpha_{0}^{k})^{2}2L_{0}^{2}\right)-\alpha_{0}^{k}2\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{})\big{)}+\alpha_{0}^{k}\frac{18d}{\mu_{0}}\lambda^{2}q^{2}+(\alpha_{0}^{k})^{2}2(4r+3q)^{2}d\lambda^{2}$
	$\displaystyle~{}+\alpha_{0}^{k}2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle$
$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}-2\alpha_{0}^{k}\mathbb{E}\big{(}f_{0}(x_{0}^{k})-f_{0}(x_{0}^{})\big{)}+\alpha_{0}^{k}\frac{18d}{\mu_{0}}\lambda^{2}q^{2}+(\alpha_{0}^{k})^{2}2(4r+3q)^{2}d\lambda^{2}$	(61)
	$\displaystyle~{}+\alpha_{0}^{k}2\mathbb{E}\langle\sum_{i\in\mathcal{R}}\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{0}^{k}-x_{0}^{*}\rangle,$

where the last inequality comes from that $\alpha_{0}^{k}\leq\bar{A}\leq\frac{\mu_{0}}{4L_{0}^{2}}$ .

Step 2. Accordingly, at the worker side, we have for any $i\in\mathcal{R}$ that (45) holds, as

\displaystyle\mathbb{E}\|x_{i}^{k+1}-x_{i}^{*}\|^{2}=

\displaystyle~{}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+(\alpha_{i}^{k})^{2}\mathbb{E}\big{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\big{\|}^{2}-2\alpha_{i}^{k}\mathbb{E}\big{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\big{\rangle}.

(62)

For the second term, we still have (46), which can be further bounded by

\displaystyle~{}\mathbb{E}\big{\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\big{\|}^{2}\leq 2L_{i}^{2}\mathbb{E}\big{\|}x_{i}^{k}-x_{i}^{*}\big{\|}^{2}+4\delta_{i}^{2}+64\lambda^{2}d.

(63)

Here we use the fact that $f_{0}$ has Lipschitz continuous gradients in Assumption 2. Then, using $\langle F_{i}^{\prime}(x_{i}^{k}),x_{i}^{k}-x_{i}^{*}\rangle\geq F_{i}(x_{i}^{k})-F_{i}(x_{i}^{*})+\frac{\mu_{i}}{2}\|x_{i}^{k}-x_{i}^{*}\|^{2}$ as $f_{0}$ is strongly convex in Assumption 1, we bound the third term in (62) as

	$\displaystyle-2\mathbb{E}\big{\langle}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)},x_{i}^{k}-x_{i}^{*}\big{\rangle}=$	$\displaystyle~{}-2\mathbb{E}\big{\langle}F^{\prime}_{i}(x_{i}^{k}),x_{i}^{k}-x_{i}^{}\big{\rangle}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{}\rangle$
	$\displaystyle\leq$	$\displaystyle~{}-2\mathbb{E}\big{(}F_{i}(x_{i}^{k})-F_{i}(x_{i}^{})\big{)}-\mu_{i}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}-2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle.$		(64)

Substituting (63) and (64) into (62) gives

$\displaystyle\mathbb{E}\\|x_{i}^{k+1}-x_{i}^{*}\\|^{2}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\alpha_{i}^{k}\mu_{i}+(\alpha_{i}^{k})^{2}2L_{i}^{2}\right)-\alpha_{i}^{k}2\mathbb{E}\big{(}F_{i}(x_{i}^{k})-F_{i}(x_{i}^{})\big{)}+(\alpha_{i}^{k})^{2}\big{(}64\lambda^{2}d+4\delta_{i}^{2}\big{)}$
	$\displaystyle~{}-\alpha_{i}^{k}2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle$
$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\frac{\alpha_{0}^{k-1}/\alpha_{i}^{k-1}}{\alpha_{0}^{k}/\alpha_{i}^{k}}-2\alpha_{i}^{k}\mathbb{E}\big{(}F_{i}(x_{i}^{k})-F_{i}(x_{i}^{})\big{)}+(\alpha_{i}^{k})^{2}\big{(}64d\lambda^{2}+4\delta_{i}^{2}\big{)}$	(65)
	$\displaystyle~{}-\alpha_{i}^{k}2\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i},x_{i}^{k}-x_{i}^{*}\rangle,$

where the last inequality holds from the bound of $A$ , such that

\frac{\alpha_{0}^{k-1}/\alpha_{i}^{k-1}}{\alpha_{0}^{k}/\alpha_{i}^{k}}-\left(1-\alpha_{i}^{k}\mu_{i}+(\alpha_{i}^{k})^{2}2L_{i}^{2}\right)=\alpha_{i}^{k}\big{(}\mu_{i}-\alpha_{i}^{k}2L_{i}^{2}-\alpha_{0}^{k-1}\frac{(m-1)\beta c}{\sqrt{k}+\sqrt{k-1}}\big{)}\geq\alpha_{i}^{k}\big{(}\mu_{i}-(2L_{i}^{2}+(m-1)\beta c)A\big{)}\geq 0.

Step 3. Denote

F^{k}=\sum_{i\in\mathcal{R}}\mathbb{E}[F(x_{i}^{k},\xi_{i}^{k})]+f_{0}(x_{0}^{k})=\sum_{i\in\mathcal{R}}F_{i}(x_{i}^{k})+f_{0}(x_{0}^{k}),\quad F^{*}=\sum_{i\in\mathcal{R}}F_{i}(x_{i}^{*})+f_{0}(x_{0}^{*})=\min_{\tilde{x}}\sum_{i\in\mathcal{R}}\mathbb{E}[F({\tilde{x}},\xi_{i})]+f_{0}({\tilde{x}}),

and define a Lyapunov function $\bar{V}^{k}=E\|x_{0}^{k}-x_{0}^{*}\|^{2}+\sum_{i\in\mathcal{R}}\big{(}\frac{\alpha_{0}^{k-1}}{\alpha_{i}^{k-1}}\mathbb{E}\|x_{i}^{k}-x_{i}^{*}\|^{2}+\frac{2\alpha_{0}^{k-1}}{\beta}\|\eta_{i}^{k-1}\|^{2}\big{)}$ . From (31), we have

\displaystyle-2\alpha_{0}^{k}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i},(x_{i}^{k}-x_{i}^{*})-(x_{0}^{k}-x_{0}^{*})\big{\rangle}\leq\frac{2\alpha_{0}^{k}}{\beta}\left(\|\eta_{i}^{k-1}\|^{2}-\|\eta_{i}^{k}\|^{2}\right)\leq\frac{2\alpha_{0}^{k-1}}{\beta}\|\eta_{i}^{k-1}\|^{2}-\frac{2\alpha_{0}^{k}}{\beta}\|\eta_{i}^{k}\|^{2}.

(66)

Consequently, combining (61), (D), and (66) together gives

	$\displaystyle 2\alpha_{0}^{k}\big{(}F^{k}-F^{*}\big{)}\leq$	$\displaystyle~{}\bar{V}^{k}-\bar{V}^{k+1}+\alpha_{0}^{k}\frac{18d}{\mu_{0}}\lambda^{2}q^{2}+\alpha_{0}^{k}\bigg{(}\alpha_{0}^{k}2(4r+3q)^{2}d\lambda^{2}+\alpha_{i}^{k}\big{(}64rd\lambda^{2}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}\big{)}\bigg{)}$
	$\displaystyle\leq$	$\displaystyle~{}\bar{V}^{k}-\bar{V}^{k+1}+\bar{c}_{1}\alpha_{i}^{k}+\bar{c}_{2}(\alpha_{i}^{k})^{2},$		(67)

where $\bar{c}_{1}=\frac{18d}{\mu_{0}}\lambda^{2}q^{2}$ and $\bar{c}_{2}=\big{(}2(4r+3q)^{2}+64r\big{)}d\lambda^{2}+4\sum_{i\in\mathcal{R}}\delta_{i}^{2}.$ Summing up (67) from $0$ to $k$ , we obtain

\displaystyle 2\sum_{l=1}^{k}\alpha_{0}^{l}(F^{l}-F^{*})\leq

\displaystyle\bar{V}^{1}-\bar{V}^{k+1}+\bar{c}_{1}\sum_{l=1}^{k}\alpha_{i}^{l}+\bar{c}_{2}\sum_{l=1}^{k}(\alpha_{i}^{l})^{2}\leq\bar{V}^{1}+\bar{c}_{1}\sum_{l=1}^{k}\alpha_{i}^{l}+\bar{c}_{2}\sum_{l=1}^{k}(\alpha_{i}^{l})^{2}.

(68)

Dividing both sides by $2\sum_{l=1}^{k}\alpha_{0}^{l}$ gives

\displaystyle\sum_{l=1}^{k}\frac{\alpha_{0}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha_{0}^{l^{\prime}}}F^{l}-F^{*}\leq m\bar{c}_{1}+\frac{\bar{V}^{1}+\bar{c}_{2}\sum_{l=1}^{k}(\alpha_{i}^{l})^{2}}{2\sum_{l=1}^{k}\alpha_{0}^{l}}\leq\frac{\bar{V}^{1}+\frac{\bar{c}_{2}}{c^{2}}\log(k+1)}{4(c+m\beta)\sqrt{k}}.

(69)

The convexity of $F$ and $f_{0}$ leads to $\sum_{i\in\mathcal{R}}\mathbb{E}[F(\bar{x}_{i}^{k},\xi_{i})]+f_{0}(\bar{x}_{0}^{k})-F^{*}\leq\sum_{l=1}^{k}\frac{\alpha_{0}^{l}}{\sum_{l^{\prime}=1}^{k}\alpha_{0}^{l^{\prime}}}F^{l}-F^{*}$ . Combining this inequality and (69), we complete the proof.

$\displaystyle\lambda\\|z^{}(0,i)-z^{}(i,0)\\|_{1}\geq$	$\displaystyle~{}\lambda\\|z^{k+1}(0,i)-z^{k+1}(i,0)\\|_{1}$
	$\displaystyle~{}+\big{\langle}[(\lambda g_{i}^{k+1};-\lambda g_{i}^{k+1}]),\big{[}z^{}(0,i)-z^{k+1}(0,i);z^{}(i,0)-z^{k+1}(i,0)\big{]}\big{\rangle},$	(34)
$\displaystyle\lambda\\|z^{k+1}(0,i)-z^{k+1}(i,0)\\|_{1}\geq$	$\displaystyle~{}\lambda\\|z^{}(0,i)-z^{}(i,0)\\|_{1}$
	$\displaystyle~{}+\big{\langle}[\lambda g_{i}^{};-\lambda g_{i}^{}],\big{[}z^{k+1}(0,i)-z^{}(0,i);z^{k+1}(i,0)-z^{}(i,0)\big{]}\big{\rangle},$	(35)

$\displaystyle\mathbb{E}\\|x_{0}^{k+1}-x_{0}^{*}\\|^{2}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{}\\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)-\mathbb{E}\big{\\|}f^{\prime}_{0}(x_{0}^{k})-f^{\prime}_{0}(x_{0}^{})\big{\\|}^{2}2\alpha_{0}^{k}\left(\frac{1}{\mu_{0}+L_{0}}-\alpha_{0}^{k}\right)$
	$\displaystyle~{}+2\lambda^{2}d(4r+3q)^{2}(\alpha_{0}^{k})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha_{0}^{k}+2\alpha_{0}^{k}\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle}$
$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\\|x_{0}^{k}-x_{0}^{*}\\|^{2}\left(1-\alpha_{0}^{k}\frac{\mu_{0}L_{0}}{\mu_{0}+L_{0}}\right)+2\lambda^{2}d(4r+3q)^{2}(\alpha^{k}_{0})^{2}+\frac{9\lambda^{2}d(\mu_{0}+L_{0})}{\mu_{0}L_{0}}q^{2}\alpha^{k}_{0}$
	$\displaystyle+2\alpha_{0}^{k}\sum_{i\in\mathcal{R}}\mathbb{E}\big{\langle}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{0}^{k}-x_{0}^{}\big{\rangle},$	(44)

$\displaystyle\mathbb{E}\bigg{\\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}\big{)}\bigg{\\|}^{2}\overset{\eqref{eq:opt}}{=}$	$\displaystyle~{}\mathbb{E}\bigg{\\|}F^{\prime}(x_{i}^{k},\xi_{i}^{k})-F^{\prime}_{i}(x_{i}^{k})+F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})+\big{(}2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{}\big{)}\bigg{\\|}^{2}$
$\displaystyle\leq$	$\displaystyle~{}2\mathbb{E}\big{\\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})\big{\\|}^{2}+4\mathbb{E}\\|F^{\prime}(x_{i}^{k},\xi_{i}^{k})-F^{\prime}_{i}(x_{i}^{k})\\|^{2}+4\mathbb{E}\\|2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta^{}_{i}\\|^{2}$
$\displaystyle\leq$	$\displaystyle~{}2\mathbb{E}\big{\\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{*})\big{\\|}^{2}+4\delta_{i}^{2}+64\lambda^{2}d,$	(46)

$\displaystyle\mathbb{E}\\|x_{i}^{k+1}-x_{i}^{*}\\|^{2}\leq$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)-\mathbb{E}\big{\\|}F^{\prime}_{i}(x_{i}^{k})-F^{\prime}_{i}(x_{i}^{})\big{\\|}^{2}2\alpha_{i}^{k}\left(\frac{1}{\mu_{i}+L_{i}}-\alpha_{i}^{k}\right)$
	$\displaystyle~{}+(4\delta_{i}^{2}+64\lambda^{2}d)(\alpha_{i}^{k})^{2}-2\alpha_{i}^{k}\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{}\rangle$
$\displaystyle\leq$	$\displaystyle~{}\mathbb{E}\\|x_{i}^{k}-x_{i}^{}\\|^{2}\left(1-\alpha_{i}^{k}\frac{\mu_{i}L_{i}}{\mu_{i}+L_{i}}\right)+(4\delta_{i}^{2}+64\lambda^{2}d)(\alpha_{i}^{k})^{2}-2\alpha_{i}^{k}\mathbb{E}\langle 2\eta^{k}_{i}-\eta^{k-1}_{i}-\eta_{i}^{},x_{i}^{k}-x_{i}^{*}\rangle,$	(48)

Stochastic Alternating Direction Method of Multipliers for Byzantine-Robust Distributed Learning

Abstract

keywords:

1 Introduction

2 Problem Formulation

3 Algorithm Development

Proposition 1 (Simplified stochastic ADMM).

Proof.

4 Convergence Analysis

Assumption 1 (Strong convexity).

Assumption 2 (Lipschitz continuous gradients).

Assumption 3 (Bounded variance).

4.1 Main Results

Lemma 1 (Consensus and equivalence).

Theorem 1 (O​(1/k)O(1/k)-convergence).

Proof.

4.2 Comparison with RSA: Case Studies

Example 1.

5 Numerical Experiments

6 Conclusions

References

Appendix A Proof of Proposition 1

Lemma 2.

Proof.

Appendix B Supporting Lemmas

Lemma 3 (Optimality conditions of (6)).

Proof.

Lemma 4 (Theorem 2.1.12, [43]).

Lemma 5.

Proof.

Appendix C Proof of Theorem 1

Proof.

Appendix D O​(1/k)O(1/\sqrt{k})-ergodic convergence

Theorem 2.

Proof.

Stochastic Alternating Direction Method of Multipliers
for Byzantine-Robust Distributed Learning

Theorem 1 ( $O(1/k)$ -convergence).

Appendix D $O(1/\sqrt{k})$ -ergodic convergence