Byzantine-Robust Decentralized Stochastic Optimization with Stochastic Gradient Noise-Independent Learning Error

Jie Peng

{}^{\text{a}}

, Weiyu Li

{}^{\text{b}}

, Qing Ling

{}^{\text{a},}

^*^**Corresponding author. E-mail address: [email protected]

{}^{\text{a}}

School of Computer Science and Engineering and Guangdong Provincial Key Laboratory of Computationa Science, Sun Yat-Sen University, Guangzhou, Guangdong 510006, China

{}^{\text{b}}

School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138, USA

Abstract

This paper studies Byzantine-robust stochastic optimization over a decentralized network, where every agent periodically communicates with its neighbors to exchange local models, and then updates its own local model by stochastic gradient descent (SGD). The performance of such a method is affected by an unknown number of Byzantine agents, which conduct adversarially during the optimization process. To the best of our knowledge, there is no existing work that simultaneously achieves a linear convergence speed and a small learning error. We observe that the learning error is largely dependent on the intrinsic stochastic gradient noise. Motivated by this observation, we introduce two variance reduction methods, stochastic average gradient algorithm (SAGA) and loopless stochastic variance-reduced gradient (LSVRG), to Byzantine-robust decentralized stochastic optimization for eliminating the negative effect of the stochastic gradient noise. The two resulting methods, BRAVO-SAGA and BRAVO-LSVRG, enjoy both linear convergence speeds and stochastic gradient noise-independent learning errors. Such learning errors are optimal for a class of methods based on total variation (TV)-norm regularization and stochastic subgradient update. We conduct extensive numerical experiments to demonstrate their effectiveness under various Byzantine attacks.

keywords:

Decentralized stochastic optimization, variance reduction, Byzantine-robustness

1 Introduction

Decentralized stochastic optimization has attracted immense research interest in recent years, and found various applications in signal processing [1, 2, 3], machine learning [4, 5], systems control [6, 7], communications and networking [8, 9], to name a few. Let us consider a decentralized network with $N$ agents. The finite-sum form of the decentralized stochastic optimization problem is given by

\displaystyle\min_{\tilde{x}\in{\mathbb{R}}^{p}}F(\tilde{x}):=\frac{1}{N}\sum_{w=1}^{N}F_{w}(\tilde{x}),\quad\text{with}\quad F_{w}(\tilde{x}):=\frac{1}{J}\sum_{j=1}^{J}F_{w,j}(\tilde{x}).

(1)

Here, $\tilde{x}\in{\mathbb{R}}^{p}$ is the model to optimize, $F(\tilde{x})$ is the global cost that averages $N$ local costs $F_{w}(\tilde{x})$ , and $F_{w}(\tilde{x})$ is the local cost of agent $w$ that averages $J$ local sample costs $F_{w,j}(\tilde{x})$ . Without loss of generality, we consider that every agent has the same number of samples. At every iteration, every agent communicates with its neighbors to obtain their local models, aggregates them, and updates its own local model with one or a mini-batch of local samples [10, 11, 12]. This is different to decentralized deterministic optimization where all the local samples are used in every computation step [13, 14, 15], and hence fits for scenarios where the number of local samples on every agent is large.

However, during the operation of the decentralized system, a number of the agents may become malfunctioning or even behave adversarially due to various uncertainties, such as data corruptions, network failures and external malicious attacks. They can send destructive messages to their neighbors in the communication steps, and hence bias the optimization process. One popular way to describe such uncertainties is the Byzantine attacks model [16, 17]. The malfunctioning or adversarial agents are termed as Byzantine agents, and the sent destructive messages are termed as Byzantine attacks. This scenario is challenging since the number and identities of the Byzantine agents are unknown, while the Byzantine attacks can be arbitrarily malicious. In this paper, our goal is to develop decentralized stochastic optimization methods that are robust to Byzantine attacks.

1.1 Related works

Existing decentralized deterministic optimization methods, such as the classical decentralized gradient descent (DGD) [6], generally use weighted mean to aggregate neighboring models and are hence vulnerable to Byzantine attacks. The remedy is to replace weighted mean aggregation by robust aggregation. The works of [18, 19, 20] study Byzantine-robust decentralized deterministic optimization with scalar models, using trimmed mean to replace weighted mean. When the models are vectors, ByRDiE [21] performs coordinate gradient descent, and applies trimmed mean [18] to filter out potential malicious elements on every dimension. A centerpoint-based method is proposed in [22] to guarantee that the aggregated model always lies in the convex hull of the regular neighbors’ models. However, for high-dimensional problems, these two methods are not suitable. Coordinate gradient descent used in ByRDiE is slow, while the centerpoint-based method requires the number of neighbors for every agent to be larger than the problem dimension. BRIDGE [23] extends the work of ByRDiE [21], allowing all coordinates to be updated at one time with the help of gradient descent. The work of [24] provides a general algorithmic framework that includes BRIDGE as a special case, and establishes the linear convergence speed and approximate consensus. In [25, 26], every agent aggregates the signs of the differences between its own model and the neighbors’ such that the influence of Byzantine attacks is limited. The work of [27] considers both gradient aggregation and model aggregation. It combines two techniques, comparative gradient elimination to select relatively reliable neighboring gradients, and coordinate-wise trimmed mean to aggregate neighboring models. But [27] requires the network to be fully connected.

Although there are an increasing number of methods, such as decentralized parallel stochastic gradient descent (DPSGD) [4, 6], for decentralized stochastic optimization, few of them consider Byzantine-robustness. Motivated by [25] and [26], [28] introduces a total variation (TV)-norm penalized formulation to force the local models to be close, but allow possible outliers. Applying the stochastic subgradient method to solve this TV-norm penalized formulation yields a Byzantine-robust algorithm, and its sublinear convergence is established. The work of [29] proposes a robust decentralized centered-clipping rule to aggregate the received local models. The work of [30] provides a principled guideline for designing proper robust aggregation rules in Byzantine-robust decentralized stochastic optimization. It also proposes iterative outlier scissor as one of such robust aggregation rules. However, the robust aggregation rules introduced in [29] and [30] are computationally costly. In addition, the convergence rates established in [29] and [30] are both sublinear.

On the other hand, Byzantine-robust distributed stochastic optimization has been a popular topic. Therein, a central server maintains a single global model and aggregates messages from the distributed agents. Most of the methods modify distributed stochastic gradient descent (SGD) by robustly aggregating the stochastic gradients sent from the agents [31, 32]. Such robust aggregation rules include coordinate-wise median, geometric median, trimmed mean, etc. Byzantine-robust stochastic second-order methods have also been developed in [33]. To enhance Byzantine-robustness, [34] assumes that the central server has extra clean data, and uses the clean data to distinguish malicious messages from the true stochastic gradients. However, it is difficult to generalize the above-mentioned works to the decentralized scenario, since there is no central server to collect and aggregate messages in a decentralized network.

In stochastic optimization, the noise induced by calculating stochastic gradients plays a critical role and affects convergence. For reducing the negative effect of stochastic gradient noise, variance reduction methods have been employed in decentralized stochastic optimization [35, 36, 37]. In Byzantine-robust distributed stochastic optimization, stochastic gradient noise also significantly influences Byzantine-robustness. With large stochastic gradient noise, the regular stochastic gradients are statistically very different, which is conducive for the Byzantine agents to hide themselves and perform attacks [38]. In light of this observation, [38] proposes to use the stochastic average gradient algorithm (SAGA) to eliminate the impact of inner variation, namely, the stochastic gradient noise caused by the heterogeneity of every regular agent’s local samples. The works of [39, 40] show that momentum acceleration can effectively reduce the impact of stochastic gradient noise and enhance Byzantine-robustness. Aiming at non-convex problems, [41] combines the stochastic variance-reduced gradient (SVRG) method and robust aggregation, and empirically demonstrates the performance. However, none of the existing works considers variance reduction in Byzantine-robust decentralized stochastic optimization. Due to the absence of the central server, there is no common model in the decentralized system, and the regular agents have to calculate their stochastic gradients at different local models. Thus, the statistical difference between the regular agents’ local models is enlarged, and variance reduction methods are more difficult to analyze.

Another relevant line of research in the signal processing society is Byzantine-robust decentralized detection. An adaptive threshold is used in [42] to screen malicious messages and alleviate the negative effect of Byzantine agents. The alternating direction method of multipliers (ADMM) is applied in [43], where the trimmed mean aggregation rule is adopted to defend against Byzantine attacks. The work of [44] considers identifying Byzantine models under data falsification attacks. For a comprehensive survey, readers are referred to [17].

1.2 Our contributions

Although the methods proposed in [28, 29, 30] are proven effective in handling Byzantine attacks for decentralized stochastic optimization, the established convergence speeds are sublinear. In particular, for the decentralized Byzantine-robust stochastic aggregation (DRSA) algorithm proposed in [28], with a constant step size, it exhibits a linear convergence speed but the learning error is suboptimal and can be significant (see Section II). With a diminishing step size, DRSA achieves an optimal learning error in order (see Section IV), but only has a sublinear convergence speed. We reveal that the underlying reason of such an unsatisfactory trade-off is the existence of stochastic gradient noise. In this context, our main contributions are as follows.

(C1) In light of the unsatisfactory trade-off between convergence speed and learning error of the existing methods, this paper proposes BRAVO, Byzantine-Robust decentralized stochastic optimization Aided with Variance reductiOn. We introduce two variance reduction methods, SAGA and loopless SVRG (LSVRG), to Byzantine-robust decentralized stochastic optimization for eliminating the negative effect of the stochastic gradient noise. The two methods, BRAVO-SAGA and BRAVO-LSVRG, enjoy both linear convergence speeds and stochastic gradient noise-independent learning errors. We further show that the learning errors are optimal in order for any subgradient-based method that solves the TV-norm penalized decentralized stochastic optimization problem.

(C2) The analysis of BRAVO-SAGA and BRAVO-LSVRG is quite different to that in [28] due to the applications of the variance reduction techniques. The analysis is also more challenging than those of the Byzantine-robust distributed variance-reduced stochastic optimization methods [38], due to the absence of the central server and the existence of the non-smooth TV-norm penalty term. In addition, [38] aggregates all stochastic gradients corrected by variance reduction, while this paper considers aggregation of neighboring models. To tackle the above challenges, we introduce two novel Lyapunov functions that aim to handle the corrected stochastic gradients evaluated at different local models, and take advantages of the bounded subgradients of the TV-norm.

(C3) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed methods.

2 Problem Formulation

We consider a decentralized undirected network $\mathcal{G}=(\mathcal{V},\mathcal{E})$ of $N$ agents, in which $\mathcal{V}:=\{1,\cdots,N\}$ represents the set of agents and $\mathcal{E}$ represents the set of edges. If $e:=(w,v)\in\mathcal{E}$ , then agents $w$ and $v$ are neighbors and can communicate with each other. We consider the Byzantine attacks model where a number of agents might be Byzantine and behave arbitrarily. Note that the number and identities of the Byzantine agents are unknown. Denote $\mathcal{R}$ as the set of regular agents with $|\mathcal{R}|=R$ and $\mathcal{B}$ as the set of Byzantine agents with $|\mathcal{B}|=B$ . Then, $\mathcal{V}=\mathcal{R}\cup\mathcal{B}$ and $N=R+B$ . For agent $w$ , denote $\mathcal{R}_{w}$ and $\mathcal{B}_{w}$ as the sets of regular neighbors and Byzantine neighbors, respectively.

Since the Byzantine agents can always arbitrarily manipulate their local samples, it is impossible to solve (1). Therefore, a reasonable goal is to find a minimizer to the averaged local costs of the regular agents, given by

\displaystyle\tilde{x}^{*}=\arg\min_{\tilde{x}\in{\mathbb{R}}^{p}}\frac{1}{R}\sum_{w\in\mathcal{R}}F_{w}(\tilde{x}).

(2)

The main difficulty in solving (2) is that the Byzantine agents will send arbitrarily malicious messages to their neighbors, such that the optimization process is uncontrollable.

Before going further, we make a common assumption in Byzantine-robust decentralized optimization that the network composed of all regular agents is connected [28]. Intuitively, if one regular agent is surrounded by Byzantine agents, the best model it can expect is the one trained with its own local samples, and can be far away from the true model $\tilde{x}^{*}$ . Denote $\mathcal{E}_{R}$ as the set of edges that are not attached to any Byzantine agent. The assumption is as follows.

Assumption 1.

(Network Connectivity) The network composed of all regular agents, denoted as $(\mathcal{R},\mathcal{E}_{R})$ , is connected.

Let each regular agent $w\in\mathcal{R}$ have a local model $x_{w}\in\mathbb{R}^{p}$ and use $x\in\mathbb{R}^{pR}$ to stack all local models of the regular agents. With the connected regular network $(\mathcal{R},\mathcal{E}_{R})$ , we can rewrite (2) to a consensus-constrained form, given by

	$\displaystyle\min_{x\in\mathbb{R}^{pR}}~{}$	$\displaystyle\frac{1}{R}\sum_{w\in\mathcal{R}}F_{w}(x_{w}).$		(3)
	s.t.	$\displaystyle\ x_{w}=x_{v},\quad\forall w\in\mathcal{R},~{}\forall v\in\mathcal{R}_{w}.$

Given an optimal solution $\tilde{x}^{*}$ to (2), a longer vector $x^{*}$ that stacks $R$ vectors $\tilde{x}^{*}$ is also an optimal solution to (3).

2.1 Prior work of DRSA

Following the line of [25, 26, 28], we penalize the consensus constraints in (3) by a TV-norm term, which yields

\displaystyle x^{*}=\arg\min_{x\in\mathbb{R}^{pR}}\frac{1}{R}\sum_{w\in\mathcal{R}}\left(F_{w}(x_{w})+\frac{\lambda}{2}\sum_{v\in\mathcal{R}_{w}}\|x_{w}-x_{v}\|_{1}\right),

(4)

with $\lambda>0$ being the penalty parameter. The TV-norm penalty term forces the local models to be close, but allows possible outliers. Note that [25] and [26] consider decentralized deterministic optimization, while [28] investigates decentralized stochastic optimization in the expectation minimization form, not in the finite-sum minimization form here. Nevertheless, the DRSA algorithm in [28] can still be applied to solve (4). When the Byzantine agents are absent, applying the stochastic subgradient method to solve (4) yields the update

\displaystyle x_{w}^{k+1}=x_{w}^{k}-\alpha^{k}\left(F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\right),

(5)

for any regular agent $w\in\mathcal{R}$ at time $k$ . Therein, $\alpha^{k}>0$ is the step size, and $i_{w}^{k}$ is the local sample index (one sample or a mini-batch of samples) selected by regular agent $w$ at time $k$ . To run (5), every regular agent $w\in\mathcal{R}$ needs to collect its neighboring models. However, when the Byzantine agents appear, (5) is not applicable since every regular agent $w\in\mathcal{R}$ does not know the identities of its Byzantine neighbors, while the Byzantine neighbors may send arbitrarily malicious messages for the sake of disturbing the optimization process. Denote $z_{v}^{k}\in\mathbb{R}^{p}$ as an arbitrary model that Byzantine agent $v\in\mathcal{B}$ sends to all neighbors. In fact, it can send different arbitrary models to different neighbors, but for notational convenience we use the same $z_{v}^{k}$ , which does not affect algorithm development and theoretical analysis. In this case, for regular agent $w\in\mathcal{R}$ , the update rule of DRSA is changed from (5) to

\displaystyle x_{w}^{k+1}=x_{w}^{k}-\alpha^{k}\Bigg{(}F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})

\displaystyle+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Bigg{)}.

(6)

2.2 Weakness of DRSA

DRSA is a stochastic subgradient-based method and uses a decreasing step size by default. According to Theorem 2 in [28], the theoretical convergence rate of DRSA is sublinear, which is slow. With a proper constant step size, DRSA has a linear convergence rate, but the corresponding learning error is relative to the variance of stochastic gradients, which could be large when the local samples of each regular agent are statistically very different.

Here we illustrate the impact of stochastic gradient noise on DRSA with a simple decentralized stochastic least-squares problem. As shown in Fig. 1, when the batch size is small, the stochastic gradient noise is large and the iterate $x^{k}$ converges to a neighborhood far away from the optimal solution $x^{*}$ . As the batch size increases, the convergence error correspondingly decreases. However, a larger batch size means that more stochastic gradients must be calculated at every time, which is expensive. Inspired by this observation, we propose to utilize computationally cheap variance reduction techniques to reduce the stochastic gradient noise, and in consequence, lower the learning error.

Refer to caption — Figure 1: Squared distance between the iterate $x^{k}$ and the optimal solution $x^{*}$ under different batch sizes for DRSA. The problem dimension is $p=1$ . The network is a complete graph with $N=4$ agents, and every regular agent has $J=10000$ scalar samples $d_{w,j}$ following the standard normal distribution $\mathcal{N}(0,1)$ . The local cost of every regular agent $w\in\mathcal{R}$ is in the least-squares form $F_{w}(x_{w})=\frac{1}{J}\sum_{j=1}^{J}F_{w,j}(x_{w})=\frac{1}{2J}\sum_{j=1}^{J}(x_{w}-d_{w,j})^{2}$ . One of the agents is Byzantine, and performs Gaussian attacks by sending messages following Gaussian distribution $\mathcal{N}(0,100^{2})$ . A large batch size corresponds to small stochastic gradient noise. The step size is $\alpha=0.0008$ . The penalty parameter is $\lambda=0.005$ .

3 Our Proposed Methods

In this section, we propose two decentralized stochastic optimization methods equipped with variance reduction techniques to boost Byzantine-robustness, termed as BRAVO-SAGA and BRAVO-LSVRG. Similar to other works on stochastic optimization [11, 36], we need to assume that the samplings at the regular agents are independent and uniformly random.

Assumption 2.

(Independent and Uniformly Random Sampling) At time $k$ , every regular agent $w\in\mathcal{R}$ independently and uniformly randomly select a sample with index $i_{w}^{k}\in\{1,\cdots,J\}$ .

In BRAVO-SAGA, every regular agent $w\in\mathcal{R}$ maintains a stochastic gradient table in which every item corresponds to a local sample, and updates an item in the table when the corresponding sample is selected to calculate the stochastic gradient. Specifically, let

\displaystyle\phi^{k+1}_{w,j}=\left\{\begin{aligned} &\phi^{k}_{w,j},&&j\neq i_{w}^{k},\\ &x_{w}^{k},&&j=i_{w}^{k},\end{aligned}\right.

(7)

where $\phi^{k}_{w,j}$ represents the most recent local model used in calculating $F^{\prime}_{w,j}$ on agent $w$ , prior to time $k$ . If the selected sample index $i_{w}^{k}=j$ at time $k$ , then $\phi^{k+1}_{w,j}=x_{w}^{k}$ ; otherwise $\phi^{k+1}_{w,j}=\phi^{k}_{w,j}$ . Therefore, $F^{\prime}_{w,j}(\phi^{k}_{w,j})$ refers to the most recent stochastic gradient for sample $j$ on agent $w$ , prior to time $k$ .

Rather than using a stochastic gradient $F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})$ to update its local model, here every regular agent $w\in\mathcal{R}$ calculates a corrected stochastic gradient $g_{w}^{k}$ , which is defined as

\displaystyle g_{w}^{k}=F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})+\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k}).

(8)

We can observe that the corrected stochastic gradient $g_{w}^{k}$ modifies the stochastic gradient $F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})$ by subtracting the stored one $F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})$ that also corresponds to sample index $i_{w}^{k}$ , and adding the average $\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k})$ of all stored stochastic gradients. Such a variance reduction operation works on the local samples and is irrelevant with the TV-norm penalty. Thus, we replace the original stochastic gradient $F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})$ in (6) by $g_{w}^{k}$ , and write the update rule of every regular agent $w\in\mathcal{R}$ in BRAVO-SAGA as

\displaystyle x_{w}^{k+1}=x_{w}^{k}-\alpha\Bigg{(}g_{w}^{k}

\displaystyle+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Bigg{)}.

(9)

It is worth noting that we use a constant step size $\alpha>0$ here, taking advantage of the variance reduction property of SAGA [45]. In Section 4, we will show that BRAVO-SAGA can linearly converge with a constant step size.

Algorithm 1 BRAVO-SAGA

Input: $x_{w}^{0}\in{\mathbb{R}}^{p}$ and $F^{\prime}_{w,j}(\phi_{w,j}^{0})=F^{\prime}_{w,j}(x_{w}^{0}),j=1,\cdots,J$ , $\forall w\in\mathcal{R}$ . $\lambda>0$ and $\alpha>0$ .

1:for

k=0,1,\cdots

, every regular agent

w\in\mathcal{R}

2: Broadcast its current model

x_{w}^{k}

to all neighbors.

3: Receive

x_{v}^{k}

from regular neighbors

v\in\mathcal{R}_{w}

and

z^{k}_{v}

from Byzantine neighbors

v\in\mathcal{B}_{w}

4: Sample

i_{w}^{k}

from

\{1,\cdots,J\}

5: Store stochastic gradient

F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})=F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})

6: Update

g_{w}^{k}

according to (8).

7: Update local iterate

x^{k+1}_{w}

according to (9).

8:end for

Algorithm 2 BRAVO-LSVRG

Input: $x_{w}^{0}\in{\mathbb{R}}^{p}$ and $y_{w}^{0}=x_{w}^{0}$ , $\forall w\in\mathcal{R}$ . $\lambda>0$ and $\alpha>0$ .

1:for

k=0,1,\cdots

, every regular agent

w\in\mathcal{R}

2: Broadcast its current model

x_{w}^{k}

to all neighbors.

3: Receive

x_{v}^{k}

from regular neighbors

v\in\mathcal{R}_{w}

and

z^{k}_{v}

from Byzantine neighbors

v\in\mathcal{B}_{w}

4: Sample

i_{w}^{k}

from

\{1,\cdots,J\}

5: With probability

1/J

y_{w}^{k+1}=x_{w}^{k}

and calculate local

6: full gradient; otherwise,

y_{w}^{k+1}=y_{w}^{k}

7: Update

g_{w}^{k}

according to (11).

8: Update local iterate

x^{k+1}_{w}

according to (9).

9:end for

Inheriting the property of SAGA, BRAVO-SAGA requires every regular agent to maintain a stochastic gradient table, which brings extra storage cost. For storage-limited applications, we can replace SAGA by loopless SVRG (as known as LSVRG) [46], which incurs no extra storage cost, although doubles the computation cost per iteration.

Define

\displaystyle y_{w}^{k+1}=\left\{\begin{aligned} &x_{w}^{k},&&\text{with probability}\ \frac{1}{J},\\ &y_{w}^{k},&&\text{with probability}\ 1-\frac{1}{J},\end{aligned}\right.

(10)

where $y_{w}^{k+1}$ is a reference point at which the full gradient $F^{\prime}_{w}(x_{w})$ is calculated. In BRAVO-LSVRG, every regular agent $w\in\mathcal{R}$ also updates with (9), where the corrected stochastic gradient is defined as

\displaystyle g_{w}^{k}=F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})+F^{\prime}_{w}(y_{w}^{k}).

(11)

Observe that at every time, with probability $\frac{1}{J}$ , $y_{w}^{k+1}$ is updated to $x_{w}^{k}$ , meaning that the local full gradient must be calculated. By doubling the number of the stochastic gradient calculations per iteration (in expectation), BRAVO-LSVRG avoids the high storage cost of BRAVO-SAGA.

The two proposed methods are outlined in Algorithm 1 and Algorithm 2, respectively.

4 Theoretical Analysis

In this section, we theoretically analyze the performance of the proposed BRAVO-SAGA and BRAVO-LSVRG.

4.1 Assumptions

We make the following assumptions that are common in the analysis of decentralized stochastic optimization [4, 28] and variance reduction [45, 46].

Assumption 3.

(Strong Convexity) For any model $\tilde{x}\in{\mathbb{R}}^{p}$ and every regular agent $w\in\mathcal{R}$ , the local cost $F_{w}(\tilde{x})$ is strongly convex with constant $\mu_{w}$ .

Assumption 4.

(Lipschitz Continuous Gradients) For any model $\tilde{x}\in{\mathbb{R}}^{p}$ and every regular agent $w\in\mathcal{R}$ , the local sample cost $F_{w,i_{w}^{k}}(\tilde{x})$ has Lipschitz continuous gradients with constant $L_{w}$ .

Assumption 5.

(Bounded Variance) For any model $\tilde{x}\in{\mathbb{R}}^{p}$ , the variance of $F^{\prime}_{w,i_{w}^{k}}(\tilde{x})$ is upper bounded by $\delta_{w}^{2}$ , namely, $\mathbb{E}_{i_{w}^{k}}\|F^{\prime}_{w,i_{w}^{k}}(\tilde{x})-F^{\prime}_{w}(\tilde{x})\|^{2}\leq\delta_{w}^{2}$ .

Assumption 5 bounds the variance of the local stochastic gradients. It is worth noting that BRAVO-SAGA and BRAVO-LSVRG only need Assumptions 3 and 4, but do not need Assumption 5 that is necessary for the theoretical analysis of DRSA [28]. The reason is that the adopted variance reduction methods are able to eliminate the effect of stochastic gradient noise. We do not assume homogeneity of regular agents; their local costs can be totally different.

4.2 Main theorems

As the network topology plays a critical role in the analysis, we define $A\in{\mathbb{R}}^{R\times|\mathcal{E}_{R}|}$ as the oriented agent-edge incidence matrix of $(\mathcal{R},\mathcal{E}_{R})$ . To be specific, for an edge $e=(w,v)\in\mathcal{E}_{R}$ with $w<v$ , the $(w,e)$ -th entry of $A$ is 1 and the $(v,e)$ -th entry of $A$ is $-1$ . We review the following lemma.

Lemma 1.

([28], Theorem 1) Suppose that Assumptions 1 and 3 hold true. If $\lambda\geq\lambda_{0}:=\frac{\sqrt{R}}{\tilde{\sigma}_{\min}(A)}\max_{w\in\mathcal{R}}\|F^{\prime}_{w}(\tilde{x}^{*})\|_{\infty}$ where $\tilde{\sigma}_{\min}(A)$ is the minimum nonzero singular value of A, then for the optimal solution $x^{*}$ of (4) and the optimal solution $\tilde{x}^{*}$ of (2), we have that $x^{*}$ stacks $R$ vectors $\tilde{x}^{*}$ as $[\cdots;\tilde{x}^{*};\cdots]$ .

Since BRAVO-SAGA, BRAVO-LSVRG and DRSA solve the same TV-norm penalized problem (4), they share Lemma 1, which shows that (4) is equivalent to the original problem (2) when the penalty parameter $\lambda$ is sufficiently large. Note that the threshold $\lambda_{0}$ is determined by the heterogeneity of the regular agents’ local costs. Higher heterogeneity leads to larger $\lambda_{0}$ . Below we only consider the regime of large $\lambda$ and focus on the convergence to (4).

The next theorem demonstrates that both BRAVO-SAGA and BRAVO-LSVRG can achieve linear convergence, and the learning errors are bounded.

Theorem 1.

Suppose Assumptions 1, 2, 3, 4 hold true. Denote $\eta=\min_{w\in\mathcal{R}}\{\frac{\mu_{w}L_{w}}{\mu_{w}+L_{w}}\}-\frac{\epsilon}{2}>0$ , $\epsilon\in(0,\min_{w\in\mathcal{R}}\{\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\})$ , and $L=\max_{w\in\mathcal{R}}L_{w}$ . Set the step size of BRAVO-SAGA and BRAVO-LSVRG as $\alpha\leq\frac{\eta}{12L^{2}J}.$ We have

\displaystyle\mathbb{E}[V^{k}]\leq(1-\eta\alpha)^{k}V^{0}+\Delta,

(12)

where

\displaystyle V^{k}:=\|x^{k}-x^{*}\|^{2}+8J\alpha^{2}L^{2}S^{k},

(13)

\displaystyle\Delta:=\frac{\alpha}{\eta}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\frac{1}{\epsilon\eta}\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p.

(14)

For BRAVO-LSVRG, we define $S^{k}$ as

\displaystyle S^{k}:=\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}.

(15)

For BRAVO-SAGA, we define $S^{k}$ as

\displaystyle S^{k}:=\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}.

(16)

In (12), the expectation is taken over all random variables.

The difference between (15) and (16) comes from the ways of SAGA and LSVRG to correct the stochastic gradients. Thus, we choose different Lyapunov functions $V^{k}$ and derive a unified convergence expression for the two methods. Theorem 1 asserts that BRAVO-SAGA and BRAVO-LSVRG can linearly converge to a neighborhood of the optimal solution $x^{*}$ of (4); the size of the neighborhood is determined by the penalty parameter $\lambda$ , the summation of the squared numbers of Byzantine neighbors $\sum_{w\in\mathcal{R}}|\mathcal{B}_{w}|^{2}$ , the summation of the squared numbers of regular neighbors $\sum_{w\in\mathcal{R}}|\mathcal{R}_{w}|^{2}$ , and the problem dimension $p$ .

Remark 1.

According to Theorem 1, we know that the step size in BRAVO-SAGA and BRAVO-LSVRG is quite important to their performance. If we choose a small step size, the learning error $\Delta$ is small but the convergence speed is relatively slow. Otherwise, if we choose a large step size, the convergence speed will be fast but the learning error is large. One can balance the two performance metrics through judiciously tuning the step size.

Next, we show that the learning errors established in Theorem 1 are optimal in order for any subgradient-based method that solves the TV-norm penalized decentralized stochastic optimization problem in (4). The class of such methods satisfy the following assumption.

Assumption 6.

(Span Condition) The iterative method solving (4) generates the sequence of regular agents’ models $\{x_{w}^{k}\}_{w\in\mathcal{R}}$ via

\displaystyle x_{w}^{k}\in x_{w}^{0}+\text{span}\{\hat{h}_{w}^{0},\hat{h}_{w}^{1},\cdots,\hat{h}_{w}^{k-1}\},\quad\forall w\in\mathcal{R},

(17)

where $\hat{h}_{w}^{k}:=h_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})$ is a stochastic subgradient, while $h_{w}^{k}$ is a linear combination of $\{F^{\prime}_{w,1}(x_{w}^{0}),\cdots,F^{\prime}_{w,J}(x_{w}^{0}),F^{\prime}_{w,1}(x_{w}^{1}),\cdots,F^{\prime}_{w,J}(x_{w}^{1}),\cdots,F^{\prime}_{w,1}(x_{w}^{k}),\cdots,F^{\prime}_{w,J}(x_{w}^{k})\}$ and satisfies $\mathbb{E}[h_{w}^{k}]=F^{\prime}_{w}(x_{w}^{k})$ .

Such a span condition is standard in analyzing the performance bounds of (stochastic) first-order methods [47, 48]. Methods satisfying Assumption 6 include but are not limited to DRSA, BRAVO-SAGA and BRAVO-LSVRG.

Theorem 2.

Given any decentralized stochastic optimization method to solve (4) that: (i) satisfies Assumptions 2 and 6; (ii) initializes the regular agents’ models at the same point, we can: (i) find $R$ local costs $\{F_{1}(x),\cdots,F_{R}(x)\}$ satisfying Assumption 3 for the regular agents, among which each local cost $F_{w}(x)$ is composed of the local sample costs $\{F_{w,1}(x),\cdots,F_{w,J}(x)\}$ satisfying Assumption 4; (ii) find a network topology satisfying Assumption 1, within which each regular agent $w$ has $\mathcal{B}_{w}$ Byzantine neighbors, such that the learning error of this method is at least

\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}=\Omega(\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p).

(18)

The lower bound on the learning error, as given in (18), is influenced by the heterogeneity across the local costs of regular agents (represented by the penalty parameter $\lambda$ ) and the impact of Byzantine agents (represented by the term $\sum_{w\in\mathcal{R}}|\mathcal{B}_{w}|^{2}$ ). Note that the lower bound given in [49] for Byzantine-robust distributed stochastic optimization is also determined by the heterogeneity across the local costs of regular agents and the impact of Byzantine agents.

Remark 2.

Although the Lyapunov functions in Theorem 1 and Theorem 2 are different, the learning errors are still comparable. Observe that the learning error in Theorem 1 is optimal in order when the summation of the squared numbers of Byzantine neighbors $\sum_{w\in\mathcal{R}}|\mathcal{B}_{w}|^{2}$ and the summation of the squared numbers of regular neighbors $\sum_{w\in\mathcal{R}}|\mathcal{R}_{w}|^{2}$ are close, or the step size $\alpha$ is sufficiently small.

To further illustrate the benefits of variance reduction to Byzantine-robust decentralized stochastic optimization, we compare the theoretical results with that of DRSA.

Lemma 2.

[28, Constant Step Size Variant of Theorem 2] Suppose Assumptions 1, 2, 3, 4, 5 hold true. Denote $\eta=\min\{\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\}-\epsilon>0$ and $\epsilon\in(0,\min_{w\in\mathcal{R}}\{\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\})$ . Set the step size of DRSA as ${\alpha}=\min_{w\in\mathcal{R}}\left\{\frac{1}{4(\mu_{w}+L_{w})}\right\}.$ We have

\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}\leq(1-\eta{\alpha})^{k}\|x^{0}-x^{*}\|^{2}+\Delta+\frac{2\alpha}{\eta}\sum_{w\in\mathcal{R}}\delta_{w}^{2},

(19)

where $\Delta$ is defined in (14). Here the expectation is taken over all random variables.

Lemma 2 shows that with a constant step size, DRSA can also achieve a linear convergence rate, and the learning error is relative to the summed variance $\sum_{w\in\mathcal{R}}\delta_{w}^{2}$ . Comparing the learning errors in Theorem 1 and Lemma 2, we can observe that in BRAVO-SAGA and BRAVO-LSVRG the effect of stochastic gradient noise has been fully eliminated. On the other hand, Lemma 2 also implies that with a diminishing step size, DRSA has a learning error that matches the lower bound in (18), but the convergence rate is sublinear. To the best of our knowledge, the proposed BRAVO-SAGA and BRAVO-LSVRG are the first Byzantine-robust decentralized stochastic optimization methods that achieve stochastic gradient noise-independent learning errors and linear convergence speeds simultaneously.

5 Numerical Experiments

In this section, we conduct extensive numerical experiments to demonstrate the Byzantine-robustness of the proposed methods, BRAVO-SAGA and BRAVO-LSVRG.

We consider an undirected Erdos-Renyi graph consisting of $N=100$ agents, in which every edge $e=(w,v)$ for any $w,v\in\mathcal{V}$ is connected with probability $q\in[0,1]$ . We set $q=0.5$ . All experiments are conducted on the MNIST and Fashion-MNIST datasets using softmax regression. The global cost function is defined as

\displaystyle F(\tilde{x})=-\frac{1}{J}\sum_{j=1}^{J}\sum_{m=0}^{M-1}\left(I(b^{(j)}=m)\ln\left(\frac{\exp((\tilde{x})_{m}^{T}a^{(j)})}{\sum_{l=0}^{M-1}\exp((\tilde{x})_{l}^{T}a^{(j)})}\right)\right).

Here $J$ and $M$ are the numbers of samples and categories, respectively. Sample $j$ is represented by $(a^{(j)},b^{(j)})$ , where $a^{(j)}\in{\mathbb{R}}^{p/M}$ is the data and $b^{(j)}\in{\mathbb{R}}$ is the target. $I(b^{(j)}=m)$ is the indicator function; $I(b^{(j)}=m)=1$ if $b^{(j)}=m$ , and $I(b^{(j)}=m)=0$ otherwise. The model is $\tilde{x}\in{\mathbb{R}}^{p}$ and $(\tilde{x})_{m}\in{\mathbb{R}}^{p/M}$ is the $m$ -th block of $\tilde{x}$ . The MNIST dataset contains $M=10$ handwritten digits from 0 to 9, with 60,000 training images and 10,000 testing images whose dimensions are $p/M=784$ . The Fashion-MNIST dataset contains fashion images and the other attributes are the same as those of the MNIST dataset.

We consider i.i.d. (independent and identically distributed) and non-i.i.d. data distributions across the agents. In the i.i.d. case, all $N$ agents randomly and evenly partition the training samples. In the non-i.i.d. case, every 10 agents randomly and evenly partition the training samples of one class.

We compare our proposed methods BRAVO-SAGA and BRAVO-LSVRG with several benchmark methods, including DPSGD [4, 6], the stochastic version of ByRDiE [21] (called ByRDiE-S), the stochastic version of BRIDGE [23] (called BRIDGE-S), and DRSA [28]. For DPSGD, we choose the Metropolis weights [50] to ensure the mixing matrix is doubly stochastic. For ByRDiE-S, we set the number of inner iteration to be 1, and for fair comparison, refer one iteration as all dimensions being updated once. All step sizes of the benchmark methods are hand-tuned to the best. For DRSA, the step size is set to $\alpha^{k}=O(\frac{1}{\sqrt{k}})$ as in [28], unless otherwise specified.

The performance metrics include classification accuracy, model variance, and convergence error $\|x^{k}-x^{*}\|^{2}$ . When computing the classification accuracy over the testing samples, we randomly choose one regular agent to compute the classification accuracy on its local model. To see the consensus error between the regular agents, we also show the variance of the regular models. See http://github.com/pengj97/BRAVO.

5.1 Byzantine attacks

When Byzantine agents exist, we set their number as $B=20$ , unless otherwise specified. We randomly choose $B$ agents to be Byzantine but guarantee the network of regular agents to be connected. We consider the following Byzantine attacks in the numerical experiments.

Gaussian attacks. Every Byzantine agent $v\in\mathcal{B}$ sends to its neighbors malicious model $x_{v}^{k}$ , whose entries independently follow the Gaussian distribution $\mathcal{N}(0,100^{2})$ .

Sign-flipping attacks. Every Byzantine agent $v\in\mathcal{B}$ calculates its true model $x_{v}^{k}$ , multiplies with a negative constant $c$ , and sends the malicious model $z_{v}^{k}=cx_{v}^{k}$ to its neighbors. Here we set $c=-4$ .

Sample-duplicating attacks. The Byzantine agents collude to select a specific regular agent, copy the model of the selected agent, and send it to neighbors. This amounts to that the Byzantine agents duplicate the samples of the selected regular agent. Here the sample-duplicating attacks are only applied to the non-i.i.d. case.

5.2 Experiments on the i.i.d. case

Without attacks. When the data distribution is i.i.d. and the number of Byzantine agents is $B=0$ , the numerical results are shown in Fig. 2. In BRAVO-SAGA and BRAVO-LSVRG, the penalty parameter is $\lambda=0.005$ and the step size is $\alpha=0.01$ . All methods perform well in this case, although ByRDiE-S is slower than the others. On the Fashion-MNIST dataset, BRAVO-SAGA and BRAVO-LSVRG reach slightly higher accuracy than DRSA.

Gaussian attacks and sign-flipping attacks. Set $\lambda=0.005$ and $\alpha=0.01$ under Gaussian attacks, as well as $\lambda=0.0001$ and $\alpha=0.01$ under sign-flipping attacks for BRAVO-SAGA and BRAVO-LSVRG. The numerical results are shown in Figs. 3 and 4. DPSGD fails since it is not designed to defend against Byzantine attacks. All Byzantine-robust methods perform well, while BRAVO-SAGA and BRAVO-LSVRG demonstrate faster convergence speeds and better classification accuracies, especially in the Fashion-MNIST dataset.

Importance of variance reduction. As we can observe from Figs. 3 and 4, the performance gains of both BRAVO-SAGA and BRAVO-LSVRG over DRSA are more obvious in the Fashion-MNIST dataset than in MNIST. To investigate this phenomenon, we depict how the summed variance of the regular agents’ (corrected) stochastic gradients evolves for the three methods and on the two datasets, as shown in Fig. 5. We consider that there is no Byzantine agent so as to avoid the possible interference of Byzantine attacks. The parameters are the same as those in plotting Fig. 2. Observe that with the variance reduction methods, the stochastic gradient noise is indeed reduced. Also observe that the level of the uncorrected stochastic gradient noise on MNIST is much lower than that on Fashion-MNIST. Hence, the improvement of BRAVO-SAGA and BRAVO-LSVRG over DRSA is smaller. The numerical results corroborate the theoretical findings that the learning error of DRSA is relative to the summed variance $\sum_{w\in\mathcal{R}}\delta_{w}^{2}$ , while the learning errors of the two variance-reduced ones are independent with the stochastic gradient noise.

5.3 Experiments on the non-i.i.d. case

Sample-duplicating attacks. Consider the case that the data distribution is non-i.i.d. and the Byzantine attacks are sample-duplicating. It is worth noting that since the Byzantine agents own the samples of two classes, the best accuracy that the regular agents can obtain is 0.8. In BRAVO-SAGA and BRAVO-LSVRG, the penalty parameter is $\lambda=0.02$ and the step size is $\alpha=0.01$ . As depicted in Fig. 6, BRAVO-SAGA, BRAVO-LSVRG and DRSA all perform well and achieve near-optimal accuracy. The TV-norm penalty forces the regular models to reach consensus, and thus these methods are insensitive to the non-i.i.d. data distribution. In contrast, with the majority-voting scheme, ByRDiE-S and BRIDGE-S fail in this case.

5.4 Impact of $\lambda$ and $B/N$

Impact of penalty parameter $\lambda$ . We investigate the impact of the penalty parameter $\lambda$ . The settings are same as those in sample-duplicating attacks. As shown in Fig. 7, if we choose a relatively large or small value of $\lambda$ , such as $\lambda=0.2$ or $\lambda=0.002$ , the performance of BRAVO-SAGA and BRAVO-LSVRG is not as good as that with a proper value of $\lambda$ . For example, here $\lambda=0.02$ gives a favorable trade-off between convergence speed and learning error.

Impact of fraction of Byzantine agents $B/N$ . We consider the impact of the fraction of the Byzantine agents. Here the settings are the same as those in sample-duplicating attacks. Considering that ByRDiE-S and BRIDGE-S require $N>2B$ , we let the fraction of Byzantine agents be less than 0.5. In BRAVO-SAGA and BRAVO-LSVRG, the penalty parameter is $\lambda=0.005$ and the step size is $\alpha=0.01$ . As depicted in Fig. 8, with the increasing fraction of Byzantine agents, all methods suffer from performance losses, but the advantages of BRAVO-SAGA and BRAVO-LSVRG are obvious compared to ByRDiE-S and BRIDGE-S.

5.5 Experiments on convergence error

To further highlight the advantages of BRAVO-SAGA and BRAVO-LSVRG over DRSA, we also compare them on the MNIST and Fashion-MNIST datasets in terms of the convergence error $\|x^{k}-x^{*}\|^{2}$ . Since the step sizes of DRSA are different in theory and in practice, we choose three step size rules for DRSA, $\alpha^{k}=0.01/(k+1)$ , $\alpha^{k}=0.01/\sqrt{k+1}$ and $\alpha^{k}=0.01$ . The attacks are sign-flipping with i.i.d. data where we set $\lambda=0.0001$ and sample-duplicating with non-i.i.d. data where we set $\lambda=0.02$ . The experimental results are depicted in Fig. 9. Observe that BRAVO-LSVRG and BRAVO-SAGA, with the help of variance reduction, both achieve faster convergence speeds than DRSA with the diminishing step sizes, and achieve smaller learning errors than DRSA with a constant step size that suffers from the stochastic gradient noise. These experimental results corroborate the theoretical analysis in Section 4 and show the superior performance of BRAVO-SAGA and BRAVO-LSVRG.

6 Conclusions

This paper aims at developing Byzantine-robust stochastic optimization methods over a decentralized network, in which an unknown number of Byzantine agents collude to bias the optimization process. Motivated by fact that the stochastic gradient noise significantly affects the learning error, we develop two variance-reduced Byzantine-robust decentralized stochastic optimization methods, BRAVO-SAGA and BRAVO-LSVRG. These methods enjoy linear convergence speeds and provable learning errors, which are independent of the stochastic gradient noise and are shown to be optimal in order for any subgradient-based method solving the TV-norm penalized decentralized stochastic optimization problem.

In light of this work, one natural question arises: What is the lower bound of the learning errors of first-order methods when applied to solving the non-penalized Byzantine-robust decentralized stochastic optimization problem (2)? Answering this fundamental question is challenging in our current setting, since we impose no restrictions on the data distributions across the agents. We will investigate this issue for the i.i.d. data distribution in our future work.

Acknowledgement. Qing Ling is supported in part by NSF China Grants 61973324 and 12126610, Guangdong Basic and Applied Basic Research Foundation Grant 2021B1515020094, and Guangdong Provincial Key Laboratory of Computational Science Grant 2020- B1212060032. A short preliminary version of this paper has appeared in IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, May 22–27, 2022 [51].

Appendix A Proof of Theorem 1

Proof.

We prove Theorem 1 for BRAVO-LSVRG first, and then modify the proof for BRAVO-SAGA.

Step 1. We begin with taking the conditional expectation given the random variables up to time $k$ , namely $\{i_{w}^{l}:l<k,w\in\mathcal{R}\}$ , and then taking the expectation over these random variables. For notational simplicity, denote the conditional expectation $\mathbb{E}[\cdot|i_{w}^{l}:l<k,w\in\mathcal{R}]$ as $\mathbb{E}_{k}[\cdot]$ . From the update in (9), for every regular agent $w$ , we have

	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k+1}-x_{w}^{*}\\|^{2}$	(20)
$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k}-x_{w}^{*}-\alpha\cdot\Big{(}g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\Big{)}-\alpha\cdot\Big{(}\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Big{)}\\|^{2}$
$\displaystyle=$	$\displaystyle\\|x_{w}^{k}-x_{w}^{*}\\|^{2}+\alpha^{2}\cdot\mathbb{E}_{k}\\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}$
	$\displaystyle-2\alpha\cdot\langle\mathbb{E}_{k}[g_{w}^{k}]+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k}),x_{w}^{k}-x_{w}^{}\rangle-2\alpha\cdot\langle\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k}),x_{w}^{k}-x_{w}^{}\rangle.$

We handle the terms at the right-hand side of (20) one by one.

For the second term at the right-hand side of (20), consider the optimality condition of (4), given by

\displaystyle F^{\prime}_{w}(x_{w}^{*})+\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}=0,

(21)

which holds true for some $s_{wv}\in[-1,1]^{p}$ and $s_{vw}=-s_{wv}$ . Then we have

	$\displaystyle\mathbb{E}_{k}\\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|F^{\prime}_{w}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})-F^{\prime}_{w}(x_{w}^{*})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}+g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2\\|F^{\prime}_{w}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\ -F^{\prime}_{w}(x_{w}^{*})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}\\|^{2}+2\mathbb{E}_{k}\\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 4\mathbb{E}_{k}\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}\\|^{2}+4\\|\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}+2\mathbb{E}_{k}\\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 8\mathbb{E}_{k}\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\\|^{2}+8\\|\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}\\|^{2}+4\\|\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}+2\mathbb{E}_{k}\\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 8\mathbb{E}_{k}\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\\|^{2}+32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p+2\mathbb{E}_{k}\\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}.$	(22)

For the last term at the right-hand side of (A), from (11), we have

	$\displaystyle\mathbb{E}_{k}\\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}$	(23)
$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-F^{\prime}_{w}(x_{w}^{k})-(F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-F^{\prime}_{w}(y_{w}^{k}))\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-F^{\prime}_{w}(x_{w}^{k})\\|^{2}+2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-F^{\prime}_{w}(y_{w}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})\\|^{2}+2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})-F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2L_{w}^{2}\\|x_{w}^{k}-x_{w}^{}\\|^{2}+2L_{w}^{2}\\|y_{w}^{k}-x_{w}^{}\\|^{2}.$

where the second inequality is because of the fact $\mathbb{E}\|a-\mathbb{E}a\|^{2}=\mathbb{E}\|a\|^{2}-\|\mathbb{E}a\|^{2}\leq\mathbb{E}\|a\|^{2}$ and the last inequality comes from Assumption 4.

Substituting (23) into (A), we have

		$\displaystyle\mathbb{E}_{k}\\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 8\mathbb{E}_{k}\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{})\\|^{2}+32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p+4L_{w}^{2}\\|x_{w}^{k}-x_{w}^{}\\|^{2}+4L_{w}^{2}\\|y_{w}^{k}-x_{w}^{*}\\|^{2}.$		(24)

For the third term at the right-hand side of (20), substituting the optimality condition (21), we have

	$\displaystyle-2\langle\mathbb{E}_{k}[g_{w}^{k}]+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k}),x_{w}^{k}-x_{w}^{*}\rangle$	(25)
$\displaystyle=$	$\displaystyle-2\langle F^{\prime}_{w}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\quad-F^{\prime}_{w}(x_{w}^{})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{}\rangle$
$\displaystyle=$	$\displaystyle-2\langle F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{}),x_{w}^{k}-x_{w}^{}\rangle-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle$
$\displaystyle\leq$	$\displaystyle-\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\\|x_{w}^{k}-x_{w}^{}\\|^{2}-\frac{2}{\mu_{w}+L_{w}}\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{})\\|^{2}-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle.$

where the last inequality comes from [47] since we assume that the functions $F_{w}(x_{w})$ are strongly convex and have Lipschitz continuous gradients (cf. Assumptions 3 and 4).

For the last term at the right-hand side of (20), with any $\epsilon>0$ we have

	$\displaystyle-2\langle\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k}),x_{w}^{k}-x_{w}^{*}\rangle$	(26)
$\displaystyle\leq$	$\displaystyle\epsilon\\|x_{w}^{k}-x_{w}^{*}\\|^{2}+\frac{\lambda^{2}}{\epsilon}\\|\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\epsilon\\|x_{w}^{k}-x_{w}^{*}\\|^{2}+\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}.$

Substituting (A), (25) and (26) into (20) and combining the terms, we have

	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k+1}-x_{w}^{*}\\|^{2}$	(27)
$\displaystyle\leq$	$\displaystyle(1-\alpha(\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}-\epsilon)+4\alpha^{2}L_{w}^{2})\\|x_{w}^{k}-x_{w}^{}\\|^{2}+4\alpha^{2}L_{w}^{2}\\|y_{w}^{k}-x_{w}^{}\\|^{2}+\alpha^{2}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)+\alpha\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}$
	$\displaystyle-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{}\rangle-2\alpha(\frac{1}{\mu_{w}+L_{w}}-4\alpha)\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{})\\|^{2}.$

If we constrain the step size as

\displaystyle\alpha\leq\min_{w\in\mathcal{R}}\{\frac{1}{4(\mu_{w}+L_{w})}\},

(28)

we have $\frac{1}{\mu_{w}+L_{w}}-4\alpha\geq 0$ and hence can drop the last term at the right-hand side of (27). Also noticing the definitions of $\eta$ and $L$ , we can rewrite (27) as

	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k+1}-x_{w}^{*}\\|^{2}$	(29)
$\displaystyle\leq$	$\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\\|x_{w}^{k}-x_{w}^{}\\|^{2}+4\alpha^{2}L^{2}\mathbb{E}_{k}\\|y_{w}^{k}-x_{w}^{}\\|^{2}+\alpha^{2}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)$
	$\displaystyle+\alpha\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle.$

Step 2. Here we define $I_{p}(x)=\frac{\lambda}{2}\sum_{w\in\mathcal{R}}\sum_{v\in\mathcal{R}_{w}}\|x_{w}-x_{v}\|_{1}$ . Since $I_{p}(x)$ is convex, we have

\displaystyle\langle\partial_{x}I_{p}(x^{k})-\partial_{x}I_{p}(x^{*}),x^{k}-x^{*}\rangle=\sum_{w\in\mathcal{R}}\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle\geq 0.

(30)

Summing up (29) over all regular agents $w\in\mathcal{R}$ and adding to (30), we have

		$\displaystyle\mathbb{E}_{k}\\|x^{k+1}-x^{*}\\|^{2}$		(31)
	$\displaystyle\leq$	$\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\\|x^{k}-x^{}\\|^{2}+4\alpha^{2}L^{2}\sum_{w\in\mathcal{R}}\\|y_{w}^{k}-x_{w}^{}\\|^{2}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}.$

Step 3. Defining $S^{k}:=\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}$ , we have

\displaystyle\mathbb{E}_{k}[S^{k+1}]=\sum_{w\in\mathcal{R}}\mathbb{E}_{k}\|y_{w}^{k+1}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}+\frac{1}{J}\sum_{w\in\mathcal{R}}\|x_{w}^{k}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})S^{k}+\frac{1}{J}\|x^{k}-x^{*}\|^{2}.

(32)

Combining (31) and (32), we have

		$\displaystyle\mathbb{E}_{k}\\|x^{k+1}-x^{*}\\|^{2}+8J\alpha^{2}L^{2}\mathbb{E}_{k}[S^{k+1}]$		(33)
	$\displaystyle\leq$	$\displaystyle(1-2\eta\alpha+12\alpha^{2}L^{2})\\|x^{k}-x^{*}\\|^{2}+(1-\frac{1}{2J})8J\alpha^{2}L^{2}S^{k}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}.$

If we constrain the step size as

\displaystyle 12\alpha^{2}L^{2}\leq\eta\alpha\quad\text{and}\quad\eta\alpha\leq\frac{1}{2J},

(34)

then the coefficients in (33) satisfy

	$\displaystyle 1-2\eta\alpha+12\alpha^{2}L^{2}\leq 1-\eta\alpha,$		(35)
	$\displaystyle 1-\frac{1}{2J}\leq 1-\eta\alpha.$		(36)

Construct a Lyapunov function as $V^{k}:=\|x^{k}-x^{*}\|^{2}+8J\alpha^{2}L^{2}S^{k}$ , we have

\displaystyle\mathbb{E}_{k}[V^{k}]\leq(1-\eta\alpha)V^{k-1}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}.

(37)

Taking the full expectation of (37), we obtain

\displaystyle\mathbb{E}[V^{k}]\leq(1-\eta\alpha)\mathbb{E}[V^{k-1}]+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}.

(38)

Using telescopic cancellation on (38) from time 1 to time $k$ , we have

\displaystyle\mathbb{E}[V^{k}]\leq(1-\eta\alpha)^{k}V^{0}+\Delta_{1},

(39)

where

\displaystyle\Delta:=\frac{\alpha}{\eta}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\frac{1}{\epsilon\eta}\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p.

(40)

In our derivation so far, the constraint on the step size $\alpha$ (cf. (28) and (34)) is

\displaystyle\alpha\leq\min\{\min_{w\in\mathcal{R}}\{\frac{1}{4(\mu_{w}+L_{w})}\},\frac{\eta}{12L^{2}},\frac{1}{2\eta J}\}.

(41)

Therefore, we simply choose

\displaystyle\alpha\leq\frac{\eta}{12L^{2}J},

(42)

which completes the proof for BRAVO-LSVRG.

Step 4. Below we prove Theorem 1 for BRAVO-SAGA. Since the update rules of BRAVO-SAGA and BRAVO-LSVRG are both (9) and the difference between them is the definition of $g_{w}^{k}$ , the intermediate results (20), (A), (25), and (26) also hold true for BRAVO-SAGA. Therefore, we just need to prove (23) and (A) for BRAVO-SAGA.

From the definition of $g_{w}^{k}$ in (8), at every regular agent $w$ , we have

	$\displaystyle\mathbb{E}_{k}\\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\\|^{2}$	(43)
$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-F^{\prime}_{w}(x_{w}^{k})-\Big{(}F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k})\Big{)}\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-F^{\prime}_{w}(x_{w}^{k})\\|^{2}+2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})+F^{\prime}_{w}(x_{w}^{})-\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})\\|^{2}+2\mathbb{E}_{k}\\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{})-F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle 2L_{w}^{2}\\|x_{w}^{k}-x_{w}^{}\\|^{2}+2L_{w}^{2}\mathbb{E}_{k}\\|\phi_{w,i_{w}^{k}}^{k}-x_{w}^{}\\|^{2}$
$\displaystyle=$	$\displaystyle 2L_{w}^{2}\\|x_{w}^{k}-x_{w}^{}\\|^{2}+2L_{w}^{2}\frac{1}{J}\sum_{j=1}^{J}\\|\phi_{w,j}^{k}-x_{w}^{}\\|^{2}.$

Substituting (43) into (A), we have

		$\displaystyle\mathbb{E}_{k}\\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 8\mathbb{E}_{k}\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{})\\|^{2}+4L_{w}^{2}\frac{1}{J}\sum_{j=1}^{J}\\|\phi_{w,j}^{k}-x_{w}^{}\\|^{2}+4L_{w}^{2}\\|x_{w}^{k}-x_{w}^{*}\\|^{2}+32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p.$		(44)

Similar to the proof for BRAVO-LSVRG, we can get

		$\displaystyle\mathbb{E}_{k}\\|x^{k+1}-x^{*}\\|^{2}$		(45)
	$\displaystyle\leq$	$\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\\|x^{k}-x^{}\\|^{2}+4\alpha^{2}L^{2}\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\\|\phi_{w,j}^{k}-x_{w}^{}\\|^{2}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}.$

Define $S^{k}:=\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}$ , we have

\displaystyle\mathbb{E}_{k}[S^{k+1}]=\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\mathbb{E}_{k}\|\phi_{w,j}^{k+1}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}+\frac{1}{J}\sum_{w\in\mathcal{R}}\|x_{w}^{k}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})S^{k}+\frac{1}{J}\|x^{k}-x^{*}\|^{2}.

(46)

With (46), the rest of the proof follows Step 3 and the proof for BRAVO-SAGA is complete.

Appendix B Proof of Theorem 2

Proof.

We establish the lower bound on the learning error by constructing a case that the regular agents’ models converge to an unsatisfactory point due to the misleading of the Byzantine agents. We construct a network topology within which the regular network composed of the regular agents is connected and each regular agent has the same number of Byzantine neighbors, namely, $|\mathcal{B}_{w}|=|\mathcal{B}_{v}|$ for any $w,v\in\mathcal{R}$ .

We begin with the scalar case with $p=1$ . For each regular agent $w\in\mathcal{R}$ , we construct its local cost function as

\displaystyle F_{w}(\tilde{x})=\frac{1}{2}\tilde{x}^{2}-\lambda|\mathcal{B}_{w}|\tilde{x},

(47)

which is strongly convex. Define the local sample costs as

\displaystyle F_{w,j}(\tilde{x})=F_{w}(\tilde{x}),\quad\forall j\in\{1,\cdots,J\},

(48)

which have Lipschitz continuous gradients with constants 1. It is easy to verify that the optimal solution of (4) is $x^{*}=[\lambda|\mathcal{B}_{1}|;\lambda|\mathcal{B}_{2}|;\cdots;\lambda|\mathcal{B}_{R}|]$ in this case.

Since the regular agents’ models are initialized at the same point, without loss of generality, we let $x_{w}^{0}=0,\forall w\in\mathcal{R}$ . Otherwise, we can set $F_{w}(\tilde{x}-x_{w}^{0})$ as the local cost of agent $w$ . Note that for each regular agent $w\in\mathcal{R}$ , $h_{w}^{0}$ is a linear combination of $\{F^{\prime}_{w,1}(x_{w}^{0}),\cdots,F^{\prime}_{w,J}(x_{w}^{0})\}$ . With (48), we have $h_{w}^{0}\in\text{span}\{F^{\prime}_{w}(x_{w}^{0})\}$ . Because $\mathbb{E}[h_{w}^{0}]=F^{\prime}_{w}(x_{w}^{0})$ , we have

\displaystyle h_{w}^{0}=F^{\prime}_{w}(x_{w}^{0})=-\lambda|\mathcal{B}_{w}|.

(49)

Therefore, for each regular agent $w\in\mathcal{R}$ , we have

	$\displaystyle\hat{h}_{w}^{0}$	$\displaystyle=h_{w}^{0}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{0}-x_{v}^{0})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{0}-z_{v}^{0})$		(50)
		$\displaystyle=-\lambda\|\mathcal{B}_{w}\|+\lambda\|\mathcal{R}_{w}\|sign(0)+\lambda\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{0}).$

Although $sign(0)$ could be any value in $[-1,1]$ , we let $sign(0)=0$ by convention. Choosing the Byzantine attacks such that $\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{0})=|\mathcal{B}_{w}|$ leads to

\displaystyle\hat{h}_{w}^{0}=0.

(51)

With Assumption 6, we further have

\displaystyle x_{w}^{1}\in x_{w}^{0}+\text{span}\{\hat{h}_{w}^{0}\}=\{0\}.

(52)

At time $k=1$ , $h_{w}^{1}$ is linear combination of $\{F^{\prime}_{w,1}(x_{w}^{0}),\cdots,F^{\prime}_{w,J}(x_{w}^{0}),F^{\prime}_{w,1}(x_{w}^{1}),\cdots,F^{\prime}_{w,J}(x_{w}^{1})\}$ for each regular agent $w\in\mathcal{R}$ . With (48), we have $h_{w}^{1}\in\text{span}\{F^{\prime}_{w}(x_{w}^{0}),F^{\prime}_{w}(x_{w}^{1})\}=\text{span}\{F^{\prime}_{w}(0)\}$ . Since $\mathbb{E}[h_{w}^{1}]=F^{\prime}_{w}(x_{w}^{1})=F^{\prime}_{w}(0)$ , we have

\displaystyle h_{w}^{1}=F^{\prime}_{w}(0)=-\lambda|\mathcal{B}_{w}|.

(53)

Therefore, for each regular agent $w\in\mathcal{R}$ , we have

	$\displaystyle\hat{h}_{w}^{1}$	$\displaystyle=h_{w}^{1}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{1}-x_{v}^{1})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{1}-z_{v}^{1})$
		$\displaystyle=-\lambda\|\mathcal{B}_{w}\|+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{1}-x_{v}^{1})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{1}).$

Choosing the Byzantine attacks such that $\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{1})=|\mathcal{B}_{w}|$ , we have

\displaystyle\hat{h}_{w}^{1}=0.

(54)

With Assumption 6, we further have

\displaystyle x_{w}^{2}\in x_{w}^{0}+\text{span}\{\hat{h}_{w}^{0},\hat{h}_{w}^{1}\}=\{0\}.

(55)

As such, for each regular agent $w\in\mathcal{R}$ , we have

\displaystyle x_{w}^{k}=0\quad\text{and}\quad\mathbb{E}\|x_{w}^{k}-x_{w}^{*}\|=\|x_{w}^{k}-x_{w}^{*}\|=\lambda|\mathcal{B}_{w}|,\quad\forall k\in\{0,1,\cdots\}.

(56)

Therefore, we have

\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}=\sum_{w\in\mathcal{R}}\mathbb{E}\|x_{w}^{k}-x_{w}^{*}\|^{2}

\displaystyle=\Omega(\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}).

(57)

For the vector case with $p>1$ , we apply the constructions on the scalar case for all dimensions. Consider

\displaystyle F_{w}(\tilde{x})=\frac{1}{2}\tilde{x}^{T}\tilde{x}-\lambda|\mathcal{B}_{w}|\tilde{x}^{T}\bm{1},

(58)

where $x\in\mathbb{R}^{p}$ and $\bm{1}\in\mathbb{R}^{p}$ is the all-one vector. Define

\displaystyle F_{w,j}(\tilde{x})=F_{w}(\tilde{x}),\quad\forall j\in\{1,\cdots,J\}.

(59)

Also let $x_{w}^{0}=\bm{0}$ for all regular agents $w\in\mathcal{R}$ and choose the Byzantine attacks such that $\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{k})=|\mathcal{B}_{w}|\cdot\bm{1}$ at any time $k$ . Similar to the derivations of the scalar case, we have

\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}=\Omega(\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p),

(60)

which completes the proof.

References

[1] K. Srivastava, A. Nedic, Distributed asynchronous constrained stochastic optimization, IEEE Journal of Selected Topics in Signal Processing 5 (4) (2011) 772–790.
[2] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, S. Lu, Distributed learning in the nonconvex world: From batch data to streaming and beyond, IEEE Signal Processing Magazine 37 (3) (2020) 26–38.
[3] S. Pu, A. Olshevsky, I. C. Paschalidis, Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent, IEEE Signal Processing Magazine 37 (3) (2020) 114–122.
[4] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in: Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
[5] M. Assran, N. Loizou, N. Ballas, M. G. Rabbat, Stochastic gradient push for distributed deep learning, in: International Conference on Machine Learning, 2019, pp. 344–353.
[6] A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control 54 (1) (2009) 48–61.
[7] Y. Du, K. You, Asynchronous stochastic gradient descent over decentralized datasets, IEEE Transactions on Control of Network Systems 8 (3) (2021) 1212–1224.
[8] G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, H. Zhu, Decentralized learning for wireless communications and networking, in: Splitting Methods in Communication, Imaging, Science, and Engineering, 2017, pp. 461–497.
[9] L. U. Khan, S. R. Pandey, N. H. Tran, W. Saad, Z. Han, M. N. Nguyen, C. S. Hong, Federated learning for edge networks: Resource optimization and incentive mechanism, IEEE Communications Magazine 58 (10) (2020) 88–93.
[10] H. Tang, X. Lian, M. Yan, C. Zhang, J. Liu, $d^{2}$ : Decentralized training over decentralized data, in: International Conference on Machine Learning, 2018, pp. 4848–4856.
[11] A. Mokhtari, A. Ribeiro, Dsa: Decentralized double stochastic averaging gradient algorithm, The Journal of Machine Learning Research 17 (1) (2016) 2165–2199.
[12] Z.-X. Cui, Q. Fan, C. Jia, Momentum methods for stochastic optimization over time-varying directed networks, Signal Processing 174 (2020) 107614.
[13] W. Shi, Q. Ling, G. Wu, W. Yin, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization 25 (2) (2015) 944–966.
[14] S. Pu, W. Shi, J. Xu, A. Nedic, Push–pull gradient methods for distributed optimization in networks, IEEE Transactions on Automatic Control 66 (1) (2020) 1–16.
[15] X. Zhang, S. Liu, N. Zhao, An accelerated algorithm for distributed optimization with barzilai-borwein step sizes, Signal Processing 202 (2023) 108748.
[16] L. Lamport, R. Shostak, M. Pease, The Byzantine generals problem, ACM Transactions on Programming Languages and Systems 4 (3) (1982) 382–401.
[17] Z. Yang, A. Gang, W. U. Bajwa, Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model, IEEE Signal Processing Magazine 37 (3) (2020) 146–159.
[18] L. Su, N. H. Vaidya, Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms, in: ACM Symposium on Principles of Distributed Computing, 2016, pp. 425–434.
[19] S. Sundaram, B. Gharesifard, Distributed optimization under adversarial nodes, IEEE Transactions on Automatic Control 64 (3) (2018) 1063–1076.
[20] L. Su, N. H. Vaidya, Byzantine-resilient multiagent optimization, IEEE Transactions on Automatic Control 66 (5) (2020) 2227–2233.
[21] Z. Yang, W. U. Bajwa, Byrdie: Byzantine-resilient distributed coordinate descent for decentralized learning, IEEE Transactions on Signal and Information Processing over Networks 5 (4) (2019) 611–627.
[22] J. Li, W. Abbas, M. Shabbir, X. Koutsoukos, Resilient distributed diffusion for multi-robot systems using centerpoint, in: Robotics: Science and Systems, 2020.
[23] Z. Yang, W. U. Bajwa, Bridge: Byzantine-resilient decentralized gradient descent, arXiv preprint arXiv:1908.08098.
[24] K. Kuwaranancharoen, S. Sundaram, On the geometric convergence of byzantine-resilient distributed optimization algorithms, arXiv preprint arXiv:2305.10810.
[25] W. Ben-Ameur, P. Bianchi, J. Jakubowicz, Robust distributed consensus using total variation, IEEE Transactions on Automatic Control 61 (6) (2015) 1550–1564.
[26] W. Xu, Z. Li, Q. Ling, Robust decentralized dynamic optimization at presence of malfunctioning agents, Signal Processing 153 (2018) 24–33.
[27] N. Gupta, N. H. Vaidya, Byzantine fault-tolerance in peer-to-peer distributed gradient-descent, arXiv preprint arXiv:2101.12316.
[28] J. Peng, W. Li, Q. Ling, Byzantine-robust decentralized stochastic optimization over static and time-varying networks, Signal Processing 183 (2021) 108020.
[29] L. He, S. P. Karimireddy, M. Jaggi, Byzantine-robust decentralized learning via clippedgossip, arXiv preprint arXiv:2202.01545.
[30] Z. Wu, T. Chen, Q. Ling, Byzantine-resilient decentralized stochastic optimization with robust aggregation rules, arXiv preprint arXiv:2206.04568.
[31] Y. Chen, L. Su, J. Xu, Distributed statistical machine learning in adversarial settings: Byzantine gradient descent, Proceedings of ACM on Measurement and Analysis of Computing Systems 1 (2) (2017) 1–25.
[32] P. Blanchard, E.-M. El-Mhamdi, R. Guerraoui, J. Stainer, Machine learning with adversaries: Byzantine tolerant gradient descent, in: Advances in Neural Information Processing Systems, 2017, pp. 119–129.
[33] X. Cao, L. Lai, Distributed approximate Newton’s method robust to Byzantine attackers, IEEE Transactions on Signal Processing 68 (2020) 6011–6025.
[34] X. Cao, L. Lai, Distributed gradient descent algorithm robust to an arbitrary number of Byzantine attackers, IEEE Transactions on Signal Processing 67 (22) (2019) 5850–5864.
[35] K. Yuan, B. Ying, J. Liu, A. H. Sayed, Variance-reduced stochastic learning by networked agents under random reshuffling, IEEE Transactions on Signal Processing 67 (2) (2019) 351–366.
[36] R. Xin, U. A. Khan, S. Kar, Variance-reduced decentralized stochastic optimization with accelerated convergence, IEEE Transactions on Signal Processing 68 (2020) 6255–6271.
[37] R. Xin, S. Kar, U. A. Khan, Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence, IEEE Signal Processing Magazine 37 (3) (2020) 102–113.
[38] Z. Wu, Q. Ling, T. Chen, G. B. Giannakis, Federated variance-reduced stochastic gradient descent with robustness to Byzantine attacks, IEEE Transactions on Signal Processing 68 (2020) 4583–4596.
[39] E.-M. El-Mhamdi, R. Guerraoui, S. Rouault, Distributed momentum for Byzantine-resilient learning, arXiv preprint arXiv:2003.00010.
[40] S. P. Karimireddy, L. He, M. Jaggi, Learning from history for byzantine robust optimization, in: International Conference on Machine Learning, 2021, pp. 5311–5319.
[41] P. Khanduri, S. Bulusu, P. Sharma, P. K. Varshney, Byzantine resilient non-convex svrg with distributed batch gradient computations, arXiv preprint arXiv:1912.04531.
[42] S. Liu, H. Zhu, S. Li, X. Li, C. Chen, X. Guan, An adaptive deviation-tolerant secure scheme for distributed cooperative spectrum sensing, in: IEEE Global Communications Conference, 2012, pp. 603–608.
[43] B. Kailkhura, P. Ray, D. Rajan, A. Yen, P. Barnes, R. Goldhahn, Byzantine-resilient locally optimum detection using collaborative autonomous networks, in: IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2017, pp. 1–5.
[44] B. Kailkhura, S. Brahma, P. K. Varshney, Data falsification attacks on consensus-based detection systems, IEEE Transactions on Signal and Information Processing over Networks 3 (1) (2016) 145–158.
[45] A. Defazio, F. Bach, S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, in: Advances in Neural Information Processing Systems, 2014, pp. 1646–1654.
[46] D. Kovalev, S. Horváth, P. Richtarik, Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop, in: Algorithmic Learning Theory, 2020, pp. 451–467.
[47] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Vol. 87, Springer Science & Business Media, 2003.
[48] G. Lan, Y. Zhou, An optimal randomized incremental gradient method, Mathematical programming 171 (2018) 167–215.
[49] S. P. Karimireddy, L. He, M. Jaggi, Byzantine-robust learning on heterogeneous datasets via bucketing, in: International Conference on Learning Representations, 2021.
[50] S. Boyd, P. Diaconis, L. Xiao, Fastest mixing Markov chain on a graph, SIAM Review 46 (4) (2004) 667–689.
[51] J. Peng, W. Li, Q. Ling, Variance reduction-boosted Byzantine robustness in decentralized stochastic optimization, in: International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 4283–4287.

	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k+1}-x_{w}^{*}\\|^{2}$	(20)
$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k}-x_{w}^{*}-\alpha\cdot\Big{(}g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\Big{)}-\alpha\cdot\Big{(}\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Big{)}\\|^{2}$
$\displaystyle=$	$\displaystyle\\|x_{w}^{k}-x_{w}^{*}\\|^{2}+\alpha^{2}\cdot\mathbb{E}_{k}\\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\\|^{2}$
	$\displaystyle-2\alpha\cdot\langle\mathbb{E}_{k}[g_{w}^{k}]+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k}),x_{w}^{k}-x_{w}^{}\rangle-2\alpha\cdot\langle\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k}),x_{w}^{k}-x_{w}^{}\rangle.$

	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k+1}-x_{w}^{*}\\|^{2}$	(27)
$\displaystyle\leq$	$\displaystyle(1-\alpha(\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}-\epsilon)+4\alpha^{2}L_{w}^{2})\\|x_{w}^{k}-x_{w}^{}\\|^{2}+4\alpha^{2}L_{w}^{2}\\|y_{w}^{k}-x_{w}^{}\\|^{2}+\alpha^{2}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)+\alpha\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}$
	$\displaystyle-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{}\rangle-2\alpha(\frac{1}{\mu_{w}+L_{w}}-4\alpha)\\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{})\\|^{2}.$

	$\displaystyle\mathbb{E}_{k}\\|x_{w}^{k+1}-x_{w}^{*}\\|^{2}$	(29)
$\displaystyle\leq$	$\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\\|x_{w}^{k}-x_{w}^{}\\|^{2}+4\alpha^{2}L^{2}\mathbb{E}_{k}\\|y_{w}^{k}-x_{w}^{}\\|^{2}+\alpha^{2}(32\lambda^{2}\|\mathcal{R}_{w}\|^{2}p+4\lambda^{2}\|\mathcal{B}_{w}\|^{2}p)$
	$\displaystyle+\alpha\frac{\lambda^{2}\|\mathcal{B}_{w}\|^{2}p}{\epsilon}-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle.$

Byzantine-Robust Decentralized Stochastic Optimization with Stochastic Gradient Noise-Independent Learning Error

Abstract

keywords:

1 Introduction

1.1 Related works

1.2 Our contributions

2 Problem Formulation

Assumption 1.

2.1 Prior work of DRSA

2.2 Weakness of DRSA

3 Our Proposed Methods

Assumption 2.

4 Theoretical Analysis

4.1 Assumptions

Assumption 3.

Assumption 4.

Assumption 5.

4.2 Main theorems

Lemma 1.

Theorem 1.

Remark 1.

Assumption 6.

Theorem 2.

Remark 2.

Lemma 2.

5 Numerical Experiments

5.1 Byzantine attacks

5.2 Experiments on the i.i.d. case

5.3 Experiments on the non-i.i.d. case

5.4 Impact of λ\lambda and B/NB/N

5.5 Experiments on convergence error

6 Conclusions

Appendix A Proof of Theorem 1

Proof.

Appendix B Proof of Theorem 2

Proof.

References

5.4 Impact of $\lambda$ and $B/N$