This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Byzantine-Robust Decentralized Stochastic Optimization with Stochastic Gradient Noise-Independent Learning Error

Jie Penga{}^{\text{a}}, Weiyu Lib{}^{\text{b}}, Qing Linga,{}^{\text{a},}***Corresponding author. E-mail address: [email protected] a{}^{\text{a}} School of Computer Science and Engineering and Guangdong Provincial Key Laboratory of Computationa Science, Sun Yat-Sen University, Guangzhou, Guangdong 510006, China
b{}^{\text{b}} School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138, USA
Abstract

This paper studies Byzantine-robust stochastic optimization over a decentralized network, where every agent periodically communicates with its neighbors to exchange local models, and then updates its own local model by stochastic gradient descent (SGD). The performance of such a method is affected by an unknown number of Byzantine agents, which conduct adversarially during the optimization process. To the best of our knowledge, there is no existing work that simultaneously achieves a linear convergence speed and a small learning error. We observe that the learning error is largely dependent on the intrinsic stochastic gradient noise. Motivated by this observation, we introduce two variance reduction methods, stochastic average gradient algorithm (SAGA) and loopless stochastic variance-reduced gradient (LSVRG), to Byzantine-robust decentralized stochastic optimization for eliminating the negative effect of the stochastic gradient noise. The two resulting methods, BRAVO-SAGA and BRAVO-LSVRG, enjoy both linear convergence speeds and stochastic gradient noise-independent learning errors. Such learning errors are optimal for a class of methods based on total variation (TV)-norm regularization and stochastic subgradient update. We conduct extensive numerical experiments to demonstrate their effectiveness under various Byzantine attacks.

keywords:
Decentralized stochastic optimization, variance reduction, Byzantine-robustness

1 Introduction

Decentralized stochastic optimization has attracted immense research interest in recent years, and found various applications in signal processing [1, 2, 3], machine learning [4, 5], systems control [6, 7], communications and networking [8, 9], to name a few. Let us consider a decentralized network with NN agents. The finite-sum form of the decentralized stochastic optimization problem is given by

minx~pF(x~):=1Nw=1NFw(x~),withFw(x~):=1Jj=1JFw,j(x~).\displaystyle\min_{\tilde{x}\in{\mathbb{R}}^{p}}F(\tilde{x}):=\frac{1}{N}\sum_{w=1}^{N}F_{w}(\tilde{x}),\quad\text{with}\quad F_{w}(\tilde{x}):=\frac{1}{J}\sum_{j=1}^{J}F_{w,j}(\tilde{x}). (1)

Here, x~p\tilde{x}\in{\mathbb{R}}^{p} is the model to optimize, F(x~)F(\tilde{x}) is the global cost that averages NN local costs Fw(x~)F_{w}(\tilde{x}), and Fw(x~)F_{w}(\tilde{x}) is the local cost of agent ww that averages JJ local sample costs Fw,j(x~)F_{w,j}(\tilde{x}). Without loss of generality, we consider that every agent has the same number of samples. At every iteration, every agent communicates with its neighbors to obtain their local models, aggregates them, and updates its own local model with one or a mini-batch of local samples [10, 11, 12]. This is different to decentralized deterministic optimization where all the local samples are used in every computation step [13, 14, 15], and hence fits for scenarios where the number of local samples on every agent is large.

However, during the operation of the decentralized system, a number of the agents may become malfunctioning or even behave adversarially due to various uncertainties, such as data corruptions, network failures and external malicious attacks. They can send destructive messages to their neighbors in the communication steps, and hence bias the optimization process. One popular way to describe such uncertainties is the Byzantine attacks model [16, 17]. The malfunctioning or adversarial agents are termed as Byzantine agents, and the sent destructive messages are termed as Byzantine attacks. This scenario is challenging since the number and identities of the Byzantine agents are unknown, while the Byzantine attacks can be arbitrarily malicious. In this paper, our goal is to develop decentralized stochastic optimization methods that are robust to Byzantine attacks.

1.1 Related works

Existing decentralized deterministic optimization methods, such as the classical decentralized gradient descent (DGD) [6], generally use weighted mean to aggregate neighboring models and are hence vulnerable to Byzantine attacks. The remedy is to replace weighted mean aggregation by robust aggregation. The works of [18, 19, 20] study Byzantine-robust decentralized deterministic optimization with scalar models, using trimmed mean to replace weighted mean. When the models are vectors, ByRDiE [21] performs coordinate gradient descent, and applies trimmed mean [18] to filter out potential malicious elements on every dimension. A centerpoint-based method is proposed in [22] to guarantee that the aggregated model always lies in the convex hull of the regular neighbors’ models. However, for high-dimensional problems, these two methods are not suitable. Coordinate gradient descent used in ByRDiE is slow, while the centerpoint-based method requires the number of neighbors for every agent to be larger than the problem dimension. BRIDGE [23] extends the work of ByRDiE [21], allowing all coordinates to be updated at one time with the help of gradient descent. The work of [24] provides a general algorithmic framework that includes BRIDGE as a special case, and establishes the linear convergence speed and approximate consensus. In [25, 26], every agent aggregates the signs of the differences between its own model and the neighbors’ such that the influence of Byzantine attacks is limited. The work of [27] considers both gradient aggregation and model aggregation. It combines two techniques, comparative gradient elimination to select relatively reliable neighboring gradients, and coordinate-wise trimmed mean to aggregate neighboring models. But [27] requires the network to be fully connected.

Although there are an increasing number of methods, such as decentralized parallel stochastic gradient descent (DPSGD) [4, 6], for decentralized stochastic optimization, few of them consider Byzantine-robustness. Motivated by [25] and [26], [28] introduces a total variation (TV)-norm penalized formulation to force the local models to be close, but allow possible outliers. Applying the stochastic subgradient method to solve this TV-norm penalized formulation yields a Byzantine-robust algorithm, and its sublinear convergence is established. The work of [29] proposes a robust decentralized centered-clipping rule to aggregate the received local models. The work of [30] provides a principled guideline for designing proper robust aggregation rules in Byzantine-robust decentralized stochastic optimization. It also proposes iterative outlier scissor as one of such robust aggregation rules. However, the robust aggregation rules introduced in [29] and [30] are computationally costly. In addition, the convergence rates established in [29] and [30] are both sublinear.

On the other hand, Byzantine-robust distributed stochastic optimization has been a popular topic. Therein, a central server maintains a single global model and aggregates messages from the distributed agents. Most of the methods modify distributed stochastic gradient descent (SGD) by robustly aggregating the stochastic gradients sent from the agents [31, 32]. Such robust aggregation rules include coordinate-wise median, geometric median, trimmed mean, etc. Byzantine-robust stochastic second-order methods have also been developed in [33]. To enhance Byzantine-robustness, [34] assumes that the central server has extra clean data, and uses the clean data to distinguish malicious messages from the true stochastic gradients. However, it is difficult to generalize the above-mentioned works to the decentralized scenario, since there is no central server to collect and aggregate messages in a decentralized network.

In stochastic optimization, the noise induced by calculating stochastic gradients plays a critical role and affects convergence. For reducing the negative effect of stochastic gradient noise, variance reduction methods have been employed in decentralized stochastic optimization [35, 36, 37]. In Byzantine-robust distributed stochastic optimization, stochastic gradient noise also significantly influences Byzantine-robustness. With large stochastic gradient noise, the regular stochastic gradients are statistically very different, which is conducive for the Byzantine agents to hide themselves and perform attacks [38]. In light of this observation, [38] proposes to use the stochastic average gradient algorithm (SAGA) to eliminate the impact of inner variation, namely, the stochastic gradient noise caused by the heterogeneity of every regular agent’s local samples. The works of [39, 40] show that momentum acceleration can effectively reduce the impact of stochastic gradient noise and enhance Byzantine-robustness. Aiming at non-convex problems, [41] combines the stochastic variance-reduced gradient (SVRG) method and robust aggregation, and empirically demonstrates the performance. However, none of the existing works considers variance reduction in Byzantine-robust decentralized stochastic optimization. Due to the absence of the central server, there is no common model in the decentralized system, and the regular agents have to calculate their stochastic gradients at different local models. Thus, the statistical difference between the regular agents’ local models is enlarged, and variance reduction methods are more difficult to analyze.

Another relevant line of research in the signal processing society is Byzantine-robust decentralized detection. An adaptive threshold is used in [42] to screen malicious messages and alleviate the negative effect of Byzantine agents. The alternating direction method of multipliers (ADMM) is applied in [43], where the trimmed mean aggregation rule is adopted to defend against Byzantine attacks. The work of [44] considers identifying Byzantine models under data falsification attacks. For a comprehensive survey, readers are referred to [17].

1.2 Our contributions

Although the methods proposed in [28, 29, 30] are proven effective in handling Byzantine attacks for decentralized stochastic optimization, the established convergence speeds are sublinear. In particular, for the decentralized Byzantine-robust stochastic aggregation (DRSA) algorithm proposed in [28], with a constant step size, it exhibits a linear convergence speed but the learning error is suboptimal and can be significant (see Section II). With a diminishing step size, DRSA achieves an optimal learning error in order (see Section IV), but only has a sublinear convergence speed. We reveal that the underlying reason of such an unsatisfactory trade-off is the existence of stochastic gradient noise. In this context, our main contributions are as follows.

(C1) In light of the unsatisfactory trade-off between convergence speed and learning error of the existing methods, this paper proposes BRAVO, Byzantine-Robust decentralized stochastic optimization Aided with Variance reductiOn. We introduce two variance reduction methods, SAGA and loopless SVRG (LSVRG), to Byzantine-robust decentralized stochastic optimization for eliminating the negative effect of the stochastic gradient noise. The two methods, BRAVO-SAGA and BRAVO-LSVRG, enjoy both linear convergence speeds and stochastic gradient noise-independent learning errors. We further show that the learning errors are optimal in order for any subgradient-based method that solves the TV-norm penalized decentralized stochastic optimization problem.

(C2) The analysis of BRAVO-SAGA and BRAVO-LSVRG is quite different to that in [28] due to the applications of the variance reduction techniques. The analysis is also more challenging than those of the Byzantine-robust distributed variance-reduced stochastic optimization methods [38], due to the absence of the central server and the existence of the non-smooth TV-norm penalty term. In addition, [38] aggregates all stochastic gradients corrected by variance reduction, while this paper considers aggregation of neighboring models. To tackle the above challenges, we introduce two novel Lyapunov functions that aim to handle the corrected stochastic gradients evaluated at different local models, and take advantages of the bounded subgradients of the TV-norm.

(C3) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed methods.

2 Problem Formulation

We consider a decentralized undirected network 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) of NN agents, in which 𝒱:={1,,N}\mathcal{V}:=\{1,\cdots,N\} represents the set of agents and \mathcal{E} represents the set of edges. If e:=(w,v)e:=(w,v)\in\mathcal{E}, then agents ww and vv are neighbors and can communicate with each other. We consider the Byzantine attacks model where a number of agents might be Byzantine and behave arbitrarily. Note that the number and identities of the Byzantine agents are unknown. Denote \mathcal{R} as the set of regular agents with ||=R|\mathcal{R}|=R and \mathcal{B} as the set of Byzantine agents with ||=B|\mathcal{B}|=B. Then, 𝒱=\mathcal{V}=\mathcal{R}\cup\mathcal{B} and N=R+BN=R+B. For agent ww, denote w\mathcal{R}_{w} and w\mathcal{B}_{w} as the sets of regular neighbors and Byzantine neighbors, respectively.

Since the Byzantine agents can always arbitrarily manipulate their local samples, it is impossible to solve (1). Therefore, a reasonable goal is to find a minimizer to the averaged local costs of the regular agents, given by

x~=argminx~p1RwFw(x~).\displaystyle\tilde{x}^{*}=\arg\min_{\tilde{x}\in{\mathbb{R}}^{p}}\frac{1}{R}\sum_{w\in\mathcal{R}}F_{w}(\tilde{x}). (2)

The main difficulty in solving (2) is that the Byzantine agents will send arbitrarily malicious messages to their neighbors, such that the optimization process is uncontrollable.

Before going further, we make a common assumption in Byzantine-robust decentralized optimization that the network composed of all regular agents is connected [28]. Intuitively, if one regular agent is surrounded by Byzantine agents, the best model it can expect is the one trained with its own local samples, and can be far away from the true model x~\tilde{x}^{*}. Denote R\mathcal{E}_{R} as the set of edges that are not attached to any Byzantine agent. The assumption is as follows.

Assumption 1.

(Network Connectivity) The network composed of all regular agents, denoted as (,R)(\mathcal{R},\mathcal{E}_{R}), is connected.

Let each regular agent ww\in\mathcal{R} have a local model xwpx_{w}\in\mathbb{R}^{p} and use xpRx\in\mathbb{R}^{pR} to stack all local models of the regular agents. With the connected regular network (,R)(\mathcal{R},\mathcal{E}_{R}), we can rewrite (2) to a consensus-constrained form, given by

minxpR\displaystyle\min_{x\in\mathbb{R}^{pR}}~{} 1RwFw(xw).\displaystyle\frac{1}{R}\sum_{w\in\mathcal{R}}F_{w}(x_{w}). (3)
s.t. xw=xv,w,vw.\displaystyle\ x_{w}=x_{v},\quad\forall w\in\mathcal{R},~{}\forall v\in\mathcal{R}_{w}.

Given an optimal solution x~\tilde{x}^{*} to (2), a longer vector xx^{*} that stacks RR vectors x~\tilde{x}^{*} is also an optimal solution to (3).

2.1 Prior work of DRSA

Following the line of [25, 26, 28], we penalize the consensus constraints in (3) by a TV-norm term, which yields

x=argminxpR1Rw(Fw(xw)+λ2vwxwxv1),\displaystyle x^{*}=\arg\min_{x\in\mathbb{R}^{pR}}\frac{1}{R}\sum_{w\in\mathcal{R}}\left(F_{w}(x_{w})+\frac{\lambda}{2}\sum_{v\in\mathcal{R}_{w}}\|x_{w}-x_{v}\|_{1}\right), (4)

with λ>0\lambda>0 being the penalty parameter. The TV-norm penalty term forces the local models to be close, but allows possible outliers. Note that [25] and [26] consider decentralized deterministic optimization, while [28] investigates decentralized stochastic optimization in the expectation minimization form, not in the finite-sum minimization form here. Nevertheless, the DRSA algorithm in [28] can still be applied to solve (4). When the Byzantine agents are absent, applying the stochastic subgradient method to solve (4) yields the update

xwk+1=xwkαk(Fw,iwk(xwk)+λvwsign(xwkxvk)),\displaystyle x_{w}^{k+1}=x_{w}^{k}-\alpha^{k}\left(F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\right), (5)

for any regular agent ww\in\mathcal{R} at time kk. Therein, αk>0\alpha^{k}>0 is the step size, and iwki_{w}^{k} is the local sample index (one sample or a mini-batch of samples) selected by regular agent ww at time kk. To run (5), every regular agent ww\in\mathcal{R} needs to collect its neighboring models. However, when the Byzantine agents appear, (5) is not applicable since every regular agent ww\in\mathcal{R} does not know the identities of its Byzantine neighbors, while the Byzantine neighbors may send arbitrarily malicious messages for the sake of disturbing the optimization process. Denote zvkpz_{v}^{k}\in\mathbb{R}^{p} as an arbitrary model that Byzantine agent vv\in\mathcal{B} sends to all neighbors. In fact, it can send different arbitrary models to different neighbors, but for notational convenience we use the same zvkz_{v}^{k}, which does not affect algorithm development and theoretical analysis. In this case, for regular agent ww\in\mathcal{R}, the update rule of DRSA is changed from (5) to

xwk+1=xwkαk(Fw,iwk(xwk)\displaystyle x_{w}^{k+1}=x_{w}^{k}-\alpha^{k}\Bigg{(}F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k}) +λvwsign(xwkxvk)+λvwsign(xwkzvk)).\displaystyle+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Bigg{)}. (6)

2.2 Weakness of DRSA

DRSA is a stochastic subgradient-based method and uses a decreasing step size by default. According to Theorem 2 in [28], the theoretical convergence rate of DRSA is sublinear, which is slow. With a proper constant step size, DRSA has a linear convergence rate, but the corresponding learning error is relative to the variance of stochastic gradients, which could be large when the local samples of each regular agent are statistically very different.

Here we illustrate the impact of stochastic gradient noise on DRSA with a simple decentralized stochastic least-squares problem. As shown in Fig. 1, when the batch size is small, the stochastic gradient noise is large and the iterate xkx^{k} converges to a neighborhood far away from the optimal solution xx^{*}. As the batch size increases, the convergence error correspondingly decreases. However, a larger batch size means that more stochastic gradients must be calculated at every time, which is expensive. Inspired by this observation, we propose to utilize computationally cheap variance reduction techniques to reduce the stochastic gradient noise, and in consequence, lower the learning error.

Refer to caption
Figure 1: Squared distance between the iterate xkx^{k} and the optimal solution xx^{*} under different batch sizes for DRSA. The problem dimension is p=1p=1. The network is a complete graph with N=4N=4 agents, and every regular agent has J=10000J=10000 scalar samples dw,jd_{w,j} following the standard normal distribution 𝒩(0,1)\mathcal{N}(0,1). The local cost of every regular agent ww\in\mathcal{R} is in the least-squares form Fw(xw)=1Jj=1JFw,j(xw)=12Jj=1J(xwdw,j)2F_{w}(x_{w})=\frac{1}{J}\sum_{j=1}^{J}F_{w,j}(x_{w})=\frac{1}{2J}\sum_{j=1}^{J}(x_{w}-d_{w,j})^{2}. One of the agents is Byzantine, and performs Gaussian attacks by sending messages following Gaussian distribution 𝒩(0,1002)\mathcal{N}(0,100^{2}). A large batch size corresponds to small stochastic gradient noise. The step size is α=0.0008\alpha=0.0008. The penalty parameter is λ=0.005\lambda=0.005.

3 Our Proposed Methods

In this section, we propose two decentralized stochastic optimization methods equipped with variance reduction techniques to boost Byzantine-robustness, termed as BRAVO-SAGA and BRAVO-LSVRG. Similar to other works on stochastic optimization [11, 36], we need to assume that the samplings at the regular agents are independent and uniformly random.

Assumption 2.

(Independent and Uniformly Random Sampling) At time kk, every regular agent ww\in\mathcal{R} independently and uniformly randomly select a sample with index iwk{1,,J}i_{w}^{k}\in\{1,\cdots,J\}.

In BRAVO-SAGA, every regular agent ww\in\mathcal{R} maintains a stochastic gradient table in which every item corresponds to a local sample, and updates an item in the table when the corresponding sample is selected to calculate the stochastic gradient. Specifically, let

ϕw,jk+1={ϕw,jk,jiwk,xwk,j=iwk,\displaystyle\phi^{k+1}_{w,j}=\left\{\begin{aligned} &\phi^{k}_{w,j},&&j\neq i_{w}^{k},\\ &x_{w}^{k},&&j=i_{w}^{k},\end{aligned}\right. (7)

where ϕw,jk\phi^{k}_{w,j} represents the most recent local model used in calculating Fw,jF^{\prime}_{w,j} on agent ww, prior to time kk. If the selected sample index iwk=ji_{w}^{k}=j at time kk, then ϕw,jk+1=xwk\phi^{k+1}_{w,j}=x_{w}^{k}; otherwise ϕw,jk+1=ϕw,jk\phi^{k+1}_{w,j}=\phi^{k}_{w,j}. Therefore, Fw,j(ϕw,jk)F^{\prime}_{w,j}(\phi^{k}_{w,j}) refers to the most recent stochastic gradient for sample jj on agent ww, prior to time kk.

Rather than using a stochastic gradient Fw,iwk(xwk)F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k}) to update its local model, here every regular agent ww\in\mathcal{R} calculates a corrected stochastic gradient gwkg_{w}^{k}, which is defined as

gwk=Fw,iwk(xwk)Fw,iwk(ϕw,iwkk)+1Jj=1JFw,j(ϕw,jk).\displaystyle g_{w}^{k}=F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})+\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k}). (8)

We can observe that the corrected stochastic gradient gwkg_{w}^{k} modifies the stochastic gradient Fw,iwk(xwk)F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k}) by subtracting the stored one Fw,iwk(ϕw,iwkk)F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k}) that also corresponds to sample index iwki_{w}^{k}, and adding the average 1Jj=1JFw,j(ϕw,jk)\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k}) of all stored stochastic gradients. Such a variance reduction operation works on the local samples and is irrelevant with the TV-norm penalty. Thus, we replace the original stochastic gradient Fw,iwk(xwk)F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k}) in (6) by gwkg_{w}^{k}, and write the update rule of every regular agent ww\in\mathcal{R} in BRAVO-SAGA as

xwk+1=xwkα(gwk\displaystyle x_{w}^{k+1}=x_{w}^{k}-\alpha\Bigg{(}g_{w}^{k} +λvwsign(xwkxvk)+λvwsign(xwkzvk)).\displaystyle+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Bigg{)}. (9)

It is worth noting that we use a constant step size α>0\alpha>0 here, taking advantage of the variance reduction property of SAGA [45]. In Section 4, we will show that BRAVO-SAGA can linearly converge with a constant step size.

Algorithm 1 BRAVO-SAGA

Input: xw0px_{w}^{0}\in{\mathbb{R}}^{p} and Fw,j(ϕw,j0)=Fw,j(xw0),j=1,,JF^{\prime}_{w,j}(\phi_{w,j}^{0})=F^{\prime}_{w,j}(x_{w}^{0}),j=1,\cdots,J, w\forall w\in\mathcal{R}. λ>0\lambda>0 and α>0\alpha>0.

1:for k=0,1,k=0,1,\cdots, every regular agent ww\in\mathcal{R} do
2:     Broadcast its current model xwkx_{w}^{k} to all neighbors.
3:     Receive xvkx_{v}^{k} from regular neighbors vwv\in\mathcal{R}_{w} and zvkz^{k}_{v} from Byzantine neighbors vwv\in\mathcal{B}_{w}.
4:     Sample iwki_{w}^{k} from {1,,J}\{1,\cdots,J\}.
5:     Store stochastic gradient Fw,iwk(ϕw,iwkk)=Fw,iwk(xwk)F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})=F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k}).
6:     Update gwkg_{w}^{k} according to (8).
7:     Update local iterate xwk+1x^{k+1}_{w} according to (9).
8:end for
Algorithm 2 BRAVO-LSVRG

Input: xw0px_{w}^{0}\in{\mathbb{R}}^{p} and yw0=xw0y_{w}^{0}=x_{w}^{0}, w\forall w\in\mathcal{R}. λ>0\lambda>0 and α>0\alpha>0.

1:for k=0,1,k=0,1,\cdots, every regular agent ww\in\mathcal{R} do
2:     Broadcast its current model xwkx_{w}^{k} to all neighbors.
3:     Receive xvkx_{v}^{k} from regular neighbors vwv\in\mathcal{R}_{w} and zvkz^{k}_{v} from Byzantine neighbors vwv\in\mathcal{B}_{w}.
4:     Sample iwki_{w}^{k} from {1,,J}\{1,\cdots,J\}.
5:     With probability 1/J1/J, ywk+1=xwky_{w}^{k+1}=x_{w}^{k} and calculate local
6:      full gradient; otherwise, ywk+1=ywky_{w}^{k+1}=y_{w}^{k}.
7:     Update gwkg_{w}^{k} according to (11).
8:     Update local iterate xwk+1x^{k+1}_{w} according to (9).
9:end for

Inheriting the property of SAGA, BRAVO-SAGA requires every regular agent to maintain a stochastic gradient table, which brings extra storage cost. For storage-limited applications, we can replace SAGA by loopless SVRG (as known as LSVRG) [46], which incurs no extra storage cost, although doubles the computation cost per iteration.

Define

ywk+1={xwk,with probability1J,ywk,with probability 11J,\displaystyle y_{w}^{k+1}=\left\{\begin{aligned} &x_{w}^{k},&&\text{with probability}\ \frac{1}{J},\\ &y_{w}^{k},&&\text{with probability}\ 1-\frac{1}{J},\end{aligned}\right. (10)

where ywk+1y_{w}^{k+1} is a reference point at which the full gradient Fw(xw)F^{\prime}_{w}(x_{w}) is calculated. In BRAVO-LSVRG, every regular agent ww\in\mathcal{R} also updates with (9), where the corrected stochastic gradient is defined as

gwk=Fw,iwk(xwk)Fw,iwk(ywk)+Fw(ywk).\displaystyle g_{w}^{k}=F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})+F^{\prime}_{w}(y_{w}^{k}). (11)

Observe that at every time, with probability 1J\frac{1}{J}, ywk+1y_{w}^{k+1} is updated to xwkx_{w}^{k}, meaning that the local full gradient must be calculated. By doubling the number of the stochastic gradient calculations per iteration (in expectation), BRAVO-LSVRG avoids the high storage cost of BRAVO-SAGA.

The two proposed methods are outlined in Algorithm 1 and Algorithm 2, respectively.

4 Theoretical Analysis

In this section, we theoretically analyze the performance of the proposed BRAVO-SAGA and BRAVO-LSVRG.

4.1 Assumptions

We make the following assumptions that are common in the analysis of decentralized stochastic optimization [4, 28] and variance reduction [45, 46].

Assumption 3.

(Strong Convexity) For any model x~p\tilde{x}\in{\mathbb{R}}^{p} and every regular agent ww\in\mathcal{R}, the local cost Fw(x~)F_{w}(\tilde{x}) is strongly convex with constant μw\mu_{w}.

Assumption 4.

(Lipschitz Continuous Gradients) For any model x~p\tilde{x}\in{\mathbb{R}}^{p} and every regular agent ww\in\mathcal{R}, the local sample cost Fw,iwk(x~)F_{w,i_{w}^{k}}(\tilde{x}) has Lipschitz continuous gradients with constant LwL_{w}.

Assumption 5.

(Bounded Variance) For any model x~p\tilde{x}\in{\mathbb{R}}^{p}, the variance of Fw,iwk(x~)F^{\prime}_{w,i_{w}^{k}}(\tilde{x}) is upper bounded by δw2\delta_{w}^{2}, namely, 𝔼iwkFw,iwk(x~)Fw(x~)2δw2\mathbb{E}_{i_{w}^{k}}\|F^{\prime}_{w,i_{w}^{k}}(\tilde{x})-F^{\prime}_{w}(\tilde{x})\|^{2}\leq\delta_{w}^{2}.

Assumption 5 bounds the variance of the local stochastic gradients. It is worth noting that BRAVO-SAGA and BRAVO-LSVRG only need Assumptions 3 and 4, but do not need Assumption 5 that is necessary for the theoretical analysis of DRSA [28]. The reason is that the adopted variance reduction methods are able to eliminate the effect of stochastic gradient noise. We do not assume homogeneity of regular agents; their local costs can be totally different.

4.2 Main theorems

As the network topology plays a critical role in the analysis, we define AR×|R|A\in{\mathbb{R}}^{R\times|\mathcal{E}_{R}|} as the oriented agent-edge incidence matrix of (,R)(\mathcal{R},\mathcal{E}_{R}). To be specific, for an edge e=(w,v)Re=(w,v)\in\mathcal{E}_{R} with w<vw<v, the (w,e)(w,e)-th entry of AA is 1 and the (v,e)(v,e)-th entry of AA is 1-1. We review the following lemma.

Lemma 1.

([28], Theorem 1) Suppose that Assumptions 1 and 3 hold true. If λλ0:=Rσ~min(A)maxwFw(x~)\lambda\geq\lambda_{0}:=\frac{\sqrt{R}}{\tilde{\sigma}_{\min}(A)}\max_{w\in\mathcal{R}}\|F^{\prime}_{w}(\tilde{x}^{*})\|_{\infty} where σ~min(A)\tilde{\sigma}_{\min}(A) is the minimum nonzero singular value of A, then for the optimal solution xx^{*} of (4) and the optimal solution x~\tilde{x}^{*} of (2), we have that xx^{*} stacks RR vectors x~\tilde{x}^{*} as [;x~;][\cdots;\tilde{x}^{*};\cdots].

Since BRAVO-SAGA, BRAVO-LSVRG and DRSA solve the same TV-norm penalized problem (4), they share Lemma 1, which shows that (4) is equivalent to the original problem (2) when the penalty parameter λ\lambda is sufficiently large. Note that the threshold λ0\lambda_{0} is determined by the heterogeneity of the regular agents’ local costs. Higher heterogeneity leads to larger λ0\lambda_{0}. Below we only consider the regime of large λ\lambda and focus on the convergence to (4).

The next theorem demonstrates that both BRAVO-SAGA and BRAVO-LSVRG can achieve linear convergence, and the learning errors are bounded.

Theorem 1.

Suppose Assumptions 1, 2, 3, 4 hold true. Denote η=minw{μwLwμw+Lw}ϵ2>0\eta=\min_{w\in\mathcal{R}}\{\frac{\mu_{w}L_{w}}{\mu_{w}+L_{w}}\}-\frac{\epsilon}{2}>0, ϵ(0,minw{2μwLwμw+Lw})\epsilon\in(0,\min_{w\in\mathcal{R}}\{\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\}), and L=maxwLwL=\max_{w\in\mathcal{R}}L_{w}. Set the step size of BRAVO-SAGA and BRAVO-LSVRG as αη12L2J.\alpha\leq\frac{\eta}{12L^{2}J}. We have

𝔼[Vk](1ηα)kV0+Δ,\displaystyle\mathbb{E}[V^{k}]\leq(1-\eta\alpha)^{k}V^{0}+\Delta, (12)

where

Vk:=xkx2+8Jα2L2Sk,\displaystyle V^{k}:=\|x^{k}-x^{*}\|^{2}+8J\alpha^{2}L^{2}S^{k}, (13)
Δ:=αηw(32λ2|w|2p+4λ2|w|2p)+1ϵηwλ2|w|2p.\displaystyle\Delta:=\frac{\alpha}{\eta}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\frac{1}{\epsilon\eta}\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p. (14)

For BRAVO-LSVRG, we define SkS^{k} as

Sk:=wywkxw2.\displaystyle S^{k}:=\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}. (15)

For BRAVO-SAGA, we define SkS^{k} as

Sk:=w1Jj=1Jϕw,jkxw2.\displaystyle S^{k}:=\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}. (16)

In (12), the expectation is taken over all random variables.

The difference between (15) and (16) comes from the ways of SAGA and LSVRG to correct the stochastic gradients. Thus, we choose different Lyapunov functions VkV^{k} and derive a unified convergence expression for the two methods. Theorem 1 asserts that BRAVO-SAGA and BRAVO-LSVRG can linearly converge to a neighborhood of the optimal solution xx^{*} of (4); the size of the neighborhood is determined by the penalty parameter λ\lambda, the summation of the squared numbers of Byzantine neighbors w|w|2\sum_{w\in\mathcal{R}}|\mathcal{B}_{w}|^{2}, the summation of the squared numbers of regular neighbors w|w|2\sum_{w\in\mathcal{R}}|\mathcal{R}_{w}|^{2}, and the problem dimension pp.

Remark 1.

According to Theorem 1, we know that the step size in BRAVO-SAGA and BRAVO-LSVRG is quite important to their performance. If we choose a small step size, the learning error Δ\Delta is small but the convergence speed is relatively slow. Otherwise, if we choose a large step size, the convergence speed will be fast but the learning error is large. One can balance the two performance metrics through judiciously tuning the step size.

Next, we show that the learning errors established in Theorem 1 are optimal in order for any subgradient-based method that solves the TV-norm penalized decentralized stochastic optimization problem in (4). The class of such methods satisfy the following assumption.

Assumption 6.

(Span Condition) The iterative method solving (4) generates the sequence of regular agents’ models {xwk}w\{x_{w}^{k}\}_{w\in\mathcal{R}} via

xwkxw0+span{h^w0,h^w1,,h^wk1},w,\displaystyle x_{w}^{k}\in x_{w}^{0}+\text{span}\{\hat{h}_{w}^{0},\hat{h}_{w}^{1},\cdots,\hat{h}_{w}^{k-1}\},\quad\forall w\in\mathcal{R}, (17)

where h^wk:=hwk+λvwsign(xwkxvk)+λvwsign(xwkzvk)\hat{h}_{w}^{k}:=h_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k}) is a stochastic subgradient, while hwkh_{w}^{k} is a linear combination of {Fw,1(xw0),,Fw,J(xw0),Fw,1(xw1),,Fw,J(xw1),,Fw,1(xwk),,Fw,J(xwk)}\{F^{\prime}_{w,1}(x_{w}^{0}),\cdots,F^{\prime}_{w,J}(x_{w}^{0}),F^{\prime}_{w,1}(x_{w}^{1}),\cdots,F^{\prime}_{w,J}(x_{w}^{1}),\cdots,F^{\prime}_{w,1}(x_{w}^{k}),\cdots,F^{\prime}_{w,J}(x_{w}^{k})\} and satisfies 𝔼[hwk]=Fw(xwk)\mathbb{E}[h_{w}^{k}]=F^{\prime}_{w}(x_{w}^{k}).

Such a span condition is standard in analyzing the performance bounds of (stochastic) first-order methods [47, 48]. Methods satisfying Assumption 6 include but are not limited to DRSA, BRAVO-SAGA and BRAVO-LSVRG.

Theorem 2.

Given any decentralized stochastic optimization method to solve (4) that: (i) satisfies Assumptions 2 and 6; (ii) initializes the regular agents’ models at the same point, we can: (i) find RR local costs {F1(x),,FR(x)}\{F_{1}(x),\cdots,F_{R}(x)\} satisfying Assumption 3 for the regular agents, among which each local cost Fw(x)F_{w}(x) is composed of the local sample costs {Fw,1(x),,Fw,J(x)}\{F_{w,1}(x),\cdots,F_{w,J}(x)\} satisfying Assumption 4; (ii) find a network topology satisfying Assumption 1, within which each regular agent ww has w\mathcal{B}_{w} Byzantine neighbors, such that the learning error of this method is at least

𝔼xkx2=Ω(wλ2|w|2p).\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}=\Omega(\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p). (18)

The lower bound on the learning error, as given in (18), is influenced by the heterogeneity across the local costs of regular agents (represented by the penalty parameter λ\lambda) and the impact of Byzantine agents (represented by the term w|w|2\sum_{w\in\mathcal{R}}|\mathcal{B}_{w}|^{2}). Note that the lower bound given in [49] for Byzantine-robust distributed stochastic optimization is also determined by the heterogeneity across the local costs of regular agents and the impact of Byzantine agents.

Remark 2.

Although the Lyapunov functions in Theorem 1 and Theorem 2 are different, the learning errors are still comparable. Observe that the learning error in Theorem 1 is optimal in order when the summation of the squared numbers of Byzantine neighbors w|w|2\sum_{w\in\mathcal{R}}|\mathcal{B}_{w}|^{2} and the summation of the squared numbers of regular neighbors w|w|2\sum_{w\in\mathcal{R}}|\mathcal{R}_{w}|^{2} are close, or the step size α\alpha is sufficiently small.

To further illustrate the benefits of variance reduction to Byzantine-robust decentralized stochastic optimization, we compare the theoretical results with that of DRSA.

Lemma 2.

[28, Constant Step Size Variant of Theorem 2] Suppose Assumptions 1, 2, 3, 4, 5 hold true. Denote η=min{2μwLwμw+Lw}ϵ>0\eta=\min\{\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\}-\epsilon>0 and ϵ(0,minw{2μwLwμw+Lw})\epsilon\in(0,\min_{w\in\mathcal{R}}\{\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\}). Set the step size of DRSA as α=minw{14(μw+Lw)}.{\alpha}=\min_{w\in\mathcal{R}}\left\{\frac{1}{4(\mu_{w}+L_{w})}\right\}. We have

𝔼xkx2(1ηα)kx0x2+Δ+2αηwδw2,\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}\leq(1-\eta{\alpha})^{k}\|x^{0}-x^{*}\|^{2}+\Delta+\frac{2\alpha}{\eta}\sum_{w\in\mathcal{R}}\delta_{w}^{2}, (19)

where Δ\Delta is defined in (14). Here the expectation is taken over all random variables.

Lemma 2 shows that with a constant step size, DRSA can also achieve a linear convergence rate, and the learning error is relative to the summed variance wδw2\sum_{w\in\mathcal{R}}\delta_{w}^{2}. Comparing the learning errors in Theorem 1 and Lemma 2, we can observe that in BRAVO-SAGA and BRAVO-LSVRG the effect of stochastic gradient noise has been fully eliminated. On the other hand, Lemma 2 also implies that with a diminishing step size, DRSA has a learning error that matches the lower bound in (18), but the convergence rate is sublinear. To the best of our knowledge, the proposed BRAVO-SAGA and BRAVO-LSVRG are the first Byzantine-robust decentralized stochastic optimization methods that achieve stochastic gradient noise-independent learning errors and linear convergence speeds simultaneously.

5 Numerical Experiments

In this section, we conduct extensive numerical experiments to demonstrate the Byzantine-robustness of the proposed methods, BRAVO-SAGA and BRAVO-LSVRG.

We consider an undirected Erdos-Renyi graph consisting of N=100N=100 agents, in which every edge e=(w,v)e=(w,v) for any w,v𝒱w,v\in\mathcal{V} is connected with probability q[0,1]q\in[0,1]. We set q=0.5q=0.5. All experiments are conducted on the MNIST and Fashion-MNIST datasets using softmax regression. The global cost function is defined as

F(x~)=1Jj=1Jm=0M1(I(b(j)=m)ln(exp((x~)mTa(j))l=0M1exp((x~)lTa(j)))).\displaystyle F(\tilde{x})=-\frac{1}{J}\sum_{j=1}^{J}\sum_{m=0}^{M-1}\left(I(b^{(j)}=m)\ln\left(\frac{\exp((\tilde{x})_{m}^{T}a^{(j)})}{\sum_{l=0}^{M-1}\exp((\tilde{x})_{l}^{T}a^{(j)})}\right)\right).

Here JJ and MM are the numbers of samples and categories, respectively. Sample jj is represented by (a(j),b(j))(a^{(j)},b^{(j)}), where a(j)p/Ma^{(j)}\in{\mathbb{R}}^{p/M} is the data and b(j)b^{(j)}\in{\mathbb{R}} is the target. I(b(j)=m)I(b^{(j)}=m) is the indicator function; I(b(j)=m)=1I(b^{(j)}=m)=1 if b(j)=mb^{(j)}=m, and I(b(j)=m)=0I(b^{(j)}=m)=0 otherwise. The model is x~p\tilde{x}\in{\mathbb{R}}^{p} and (x~)mp/M(\tilde{x})_{m}\in{\mathbb{R}}^{p/M} is the mm-th block of x~\tilde{x}. The MNIST dataset contains M=10M=10 handwritten digits from 0 to 9, with 60,000 training images and 10,000 testing images whose dimensions are p/M=784p/M=784. The Fashion-MNIST dataset contains fashion images and the other attributes are the same as those of the MNIST dataset.

We consider i.i.d. (independent and identically distributed) and non-i.i.d. data distributions across the agents. In the i.i.d. case, all NN agents randomly and evenly partition the training samples. In the non-i.i.d. case, every 10 agents randomly and evenly partition the training samples of one class.

We compare our proposed methods BRAVO-SAGA and BRAVO-LSVRG with several benchmark methods, including DPSGD [4, 6], the stochastic version of ByRDiE [21] (called ByRDiE-S), the stochastic version of BRIDGE [23] (called BRIDGE-S), and DRSA [28]. For DPSGD, we choose the Metropolis weights [50] to ensure the mixing matrix is doubly stochastic. For ByRDiE-S, we set the number of inner iteration to be 1, and for fair comparison, refer one iteration as all dimensions being updated once. All step sizes of the benchmark methods are hand-tuned to the best. For DRSA, the step size is set to αk=O(1k)\alpha^{k}=O(\frac{1}{\sqrt{k}}) as in [28], unless otherwise specified.

The performance metrics include classification accuracy, model variance, and convergence error xkx2\|x^{k}-x^{*}\|^{2}. When computing the classification accuracy over the testing samples, we randomly choose one regular agent to compute the classification accuracy on its local model. To see the consensus error between the regular agents, we also show the variance of the regular models. See http://github.com/pengj97/BRAVO.

Refer to caption
Figure 2: Classification accuracy and variance of agents’ local models without Byzantine attacks with i.i.d. data.
Refer to caption
Figure 3: Classification accuracy and variance of regular agents’ local models under Gaussian attacks with i.i.d. data.

5.1 Byzantine attacks

When Byzantine agents exist, we set their number as B=20B=20, unless otherwise specified. We randomly choose BB agents to be Byzantine but guarantee the network of regular agents to be connected. We consider the following Byzantine attacks in the numerical experiments.

Gaussian attacks. Every Byzantine agent vv\in\mathcal{B} sends to its neighbors malicious model xvkx_{v}^{k}, whose entries independently follow the Gaussian distribution 𝒩(0,1002)\mathcal{N}(0,100^{2}).

Sign-flipping attacks. Every Byzantine agent vv\in\mathcal{B} calculates its true model xvkx_{v}^{k}, multiplies with a negative constant cc, and sends the malicious model zvk=cxvkz_{v}^{k}=cx_{v}^{k} to its neighbors. Here we set c=4c=-4.

Sample-duplicating attacks. The Byzantine agents collude to select a specific regular agent, copy the model of the selected agent, and send it to neighbors. This amounts to that the Byzantine agents duplicate the samples of the selected regular agent. Here the sample-duplicating attacks are only applied to the non-i.i.d. case.

Refer to caption
Figure 4: Classification accuracy and variance of regular agents’ local models under sign-flipping attacks with i.i.d. data.
Refer to caption
Figure 5: Summed variance of regular agents’ (corrected) stochastic gradients without Byzantine attacks with i.i.d. data.

5.2 Experiments on the i.i.d. case

Without attacks. When the data distribution is i.i.d. and the number of Byzantine agents is B=0B=0, the numerical results are shown in Fig. 2. In BRAVO-SAGA and BRAVO-LSVRG, the penalty parameter is λ=0.005\lambda=0.005 and the step size is α=0.01\alpha=0.01. All methods perform well in this case, although ByRDiE-S is slower than the others. On the Fashion-MNIST dataset, BRAVO-SAGA and BRAVO-LSVRG reach slightly higher accuracy than DRSA.

Gaussian attacks and sign-flipping attacks. Set λ=0.005\lambda=0.005 and α=0.01\alpha=0.01 under Gaussian attacks, as well as λ=0.0001\lambda=0.0001 and α=0.01\alpha=0.01 under sign-flipping attacks for BRAVO-SAGA and BRAVO-LSVRG. The numerical results are shown in Figs. 3 and 4. DPSGD fails since it is not designed to defend against Byzantine attacks. All Byzantine-robust methods perform well, while BRAVO-SAGA and BRAVO-LSVRG demonstrate faster convergence speeds and better classification accuracies, especially in the Fashion-MNIST dataset.

Importance of variance reduction. As we can observe from Figs. 3 and 4, the performance gains of both BRAVO-SAGA and BRAVO-LSVRG over DRSA are more obvious in the Fashion-MNIST dataset than in MNIST. To investigate this phenomenon, we depict how the summed variance of the regular agents’ (corrected) stochastic gradients evolves for the three methods and on the two datasets, as shown in Fig. 5. We consider that there is no Byzantine agent so as to avoid the possible interference of Byzantine attacks. The parameters are the same as those in plotting Fig. 2. Observe that with the variance reduction methods, the stochastic gradient noise is indeed reduced. Also observe that the level of the uncorrected stochastic gradient noise on MNIST is much lower than that on Fashion-MNIST. Hence, the improvement of BRAVO-SAGA and BRAVO-LSVRG over DRSA is smaller. The numerical results corroborate the theoretical findings that the learning error of DRSA is relative to the summed variance wδw2\sum_{w\in\mathcal{R}}\delta_{w}^{2}, while the learning errors of the two variance-reduced ones are independent with the stochastic gradient noise.

Refer to caption
Figure 6: Classification accuracy and variance of regular agents’ local models under sample-duplicating attacks with non-i.i.d. data.
Refer to caption
Figure 7: Impact of penalty parameter λ\lambda for BRAVO-SAGA and BRAVO-LSVRG under sample-duplicating attacks with non-i.i.d. data.
Refer to caption
Figure 8: Classification accuracy of regular agent’s local models with different fractions of Byzantine agents under sample-duplicating attacks with non-i.i.d. data.

5.3 Experiments on the non-i.i.d. case

Sample-duplicating attacks. Consider the case that the data distribution is non-i.i.d. and the Byzantine attacks are sample-duplicating. It is worth noting that since the Byzantine agents own the samples of two classes, the best accuracy that the regular agents can obtain is 0.8. In BRAVO-SAGA and BRAVO-LSVRG, the penalty parameter is λ=0.02\lambda=0.02 and the step size is α=0.01\alpha=0.01. As depicted in Fig. 6, BRAVO-SAGA, BRAVO-LSVRG and DRSA all perform well and achieve near-optimal accuracy. The TV-norm penalty forces the regular models to reach consensus, and thus these methods are insensitive to the non-i.i.d. data distribution. In contrast, with the majority-voting scheme, ByRDiE-S and BRIDGE-S fail in this case.

5.4 Impact of λ\lambda and B/NB/N

Impact of penalty parameter λ\lambda. We investigate the impact of the penalty parameter λ\lambda. The settings are same as those in sample-duplicating attacks. As shown in Fig. 7, if we choose a relatively large or small value of λ\lambda, such as λ=0.2\lambda=0.2 or λ=0.002\lambda=0.002, the performance of BRAVO-SAGA and BRAVO-LSVRG is not as good as that with a proper value of λ\lambda. For example, here λ=0.02\lambda=0.02 gives a favorable trade-off between convergence speed and learning error.

Impact of fraction of Byzantine agents B/NB/N. We consider the impact of the fraction of the Byzantine agents. Here the settings are the same as those in sample-duplicating attacks. Considering that ByRDiE-S and BRIDGE-S require N>2BN>2B, we let the fraction of Byzantine agents be less than 0.5. In BRAVO-SAGA and BRAVO-LSVRG, the penalty parameter is λ=0.005\lambda=0.005 and the step size is α=0.01\alpha=0.01. As depicted in Fig. 8, with the increasing fraction of Byzantine agents, all methods suffer from performance losses, but the advantages of BRAVO-SAGA and BRAVO-LSVRG are obvious compared to ByRDiE-S and BRIDGE-S.

5.5 Experiments on convergence error

Refer to caption
Figure 9: Convergence error of DRSA (αk=0.01/(k+1)\alpha^{k}=0.01/(k+1)), DRSA (0.01/k+10.01/\sqrt{k+1}), DRSA (α=0.01\alpha=0.01), BRAVO-SAGA (α=0.01\alpha=0.01) and BRAVO-LSVRG (α=0.01\alpha=0.01) under sign-flipping attacks with i.i.d. data and sample-duplicating attacks with non-i.i.d. data.

To further highlight the advantages of BRAVO-SAGA and BRAVO-LSVRG over DRSA, we also compare them on the MNIST and Fashion-MNIST datasets in terms of the convergence error xkx2\|x^{k}-x^{*}\|^{2}. Since the step sizes of DRSA are different in theory and in practice, we choose three step size rules for DRSA, αk=0.01/(k+1)\alpha^{k}=0.01/(k+1), αk=0.01/k+1\alpha^{k}=0.01/\sqrt{k+1} and αk=0.01\alpha^{k}=0.01. The attacks are sign-flipping with i.i.d. data where we set λ=0.0001\lambda=0.0001 and sample-duplicating with non-i.i.d. data where we set λ=0.02\lambda=0.02. The experimental results are depicted in Fig. 9. Observe that BRAVO-LSVRG and BRAVO-SAGA, with the help of variance reduction, both achieve faster convergence speeds than DRSA with the diminishing step sizes, and achieve smaller learning errors than DRSA with a constant step size that suffers from the stochastic gradient noise. These experimental results corroborate the theoretical analysis in Section 4 and show the superior performance of BRAVO-SAGA and BRAVO-LSVRG.

6 Conclusions

This paper aims at developing Byzantine-robust stochastic optimization methods over a decentralized network, in which an unknown number of Byzantine agents collude to bias the optimization process. Motivated by fact that the stochastic gradient noise significantly affects the learning error, we develop two variance-reduced Byzantine-robust decentralized stochastic optimization methods, BRAVO-SAGA and BRAVO-LSVRG. These methods enjoy linear convergence speeds and provable learning errors, which are independent of the stochastic gradient noise and are shown to be optimal in order for any subgradient-based method solving the TV-norm penalized decentralized stochastic optimization problem.

In light of this work, one natural question arises: What is the lower bound of the learning errors of first-order methods when applied to solving the non-penalized Byzantine-robust decentralized stochastic optimization problem (2)? Answering this fundamental question is challenging in our current setting, since we impose no restrictions on the data distributions across the agents. We will investigate this issue for the i.i.d. data distribution in our future work.

Acknowledgement. Qing Ling is supported in part by NSF China Grants 61973324 and 12126610, Guangdong Basic and Applied Basic Research Foundation Grant 2021B1515020094, and Guangdong Provincial Key Laboratory of Computational Science Grant 2020- B1212060032. A short preliminary version of this paper has appeared in IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, May 22–27, 2022 [51].

Appendix A Proof of Theorem 1

Proof.

We prove Theorem 1 for BRAVO-LSVRG first, and then modify the proof for BRAVO-SAGA.

Step 1. We begin with taking the conditional expectation given the random variables up to time kk, namely {iwl:l<k,w}\{i_{w}^{l}:l<k,w\in\mathcal{R}\}, and then taking the expectation over these random variables. For notational simplicity, denote the conditional expectation 𝔼[|iwl:l<k,w]\mathbb{E}[\cdot|i_{w}^{l}:l<k,w\in\mathcal{R}] as 𝔼k[]\mathbb{E}_{k}[\cdot]. From the update in (9), for every regular agent ww, we have

𝔼kxwk+1xw2\displaystyle\mathbb{E}_{k}\|x_{w}^{k+1}-x_{w}^{*}\|^{2} (20)
=\displaystyle= 𝔼kxwkxwα(gwk+λvwsign(xwkxvk))α(λvwsign(xwkzvk))2\displaystyle\mathbb{E}_{k}\|x_{w}^{k}-x_{w}^{*}-\alpha\cdot\Big{(}g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\Big{)}-\alpha\cdot\Big{(}\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\Big{)}\|^{2}
=\displaystyle= xwkxw2+α2𝔼kgwk+λvwsign(xwkxvk)+λvwsign(xwkzvk)2\displaystyle\|x_{w}^{k}-x_{w}^{*}\|^{2}+\alpha^{2}\cdot\mathbb{E}_{k}\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}
2α𝔼k[gwk]+λvwsign(xwkxvk),xwkxw2αλvwsign(xwkzvk),xwkxw.\displaystyle-2\alpha\cdot\langle\mathbb{E}_{k}[g_{w}^{k}]+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k}),x_{w}^{k}-x_{w}^{*}\rangle-2\alpha\cdot\langle\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k}),x_{w}^{k}-x_{w}^{*}\rangle.

We handle the terms at the right-hand side of (20) one by one.

For the second term at the right-hand side of (20), consider the optimality condition of (4), given by

Fw(xw)+λvwswv=0,\displaystyle F^{\prime}_{w}(x_{w}^{*})+\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}=0, (21)

which holds true for some swv[1,1]ps_{wv}\in[-1,1]^{p} and svw=swvs_{vw}=-s_{wv}. Then we have

𝔼kgwk+λvwsign(xwkxvk)+λvwsign(xwkzvk)2\displaystyle\mathbb{E}_{k}\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}
=\displaystyle= 𝔼kFw(xwk)+λvwsign(xwkxvk)+λvwsign(xwkzvk)Fw(xw)λvwswv+gwkFw(xwk)2\displaystyle\mathbb{E}_{k}\|F^{\prime}_{w}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})-F^{\prime}_{w}(x_{w}^{*})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}+g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2}
\displaystyle\leq 2Fw(xwk)+λvwsign(xwkxvk)+λvwsign(xwkzvk)Fw(xw)λvwswv2+2𝔼kgwkFw(xwk)2\displaystyle 2\|F^{\prime}_{w}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\ -F^{\prime}_{w}(x_{w}^{*})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}\|^{2}+2\mathbb{E}_{k}\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2}
\displaystyle\leq 4𝔼kFw(xwk)Fw(xw)+λvwsign(xwkxvk)λvwswv2+4λvwsign(xwkzvk)2+2𝔼kgwkFw(xwk)2\displaystyle 4\mathbb{E}_{k}\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}\|^{2}+4\|\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}+2\mathbb{E}_{k}\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2}
\displaystyle\leq 8𝔼kFw(xwk)Fw(xw)2+8λvwsign(xwkxvk)λvwswv2+4λvwsign(xwkzvk)2+2𝔼kgwkFw(xwk)2\displaystyle 8\mathbb{E}_{k}\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\|^{2}+8\|\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv}\|^{2}+4\|\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}+2\mathbb{E}_{k}\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2}
\displaystyle\leq 8𝔼kFw(xwk)Fw(xw)2+32λ2|w|2p+4λ2|w|2p+2𝔼kgwkFw(xwk)2.\displaystyle 8\mathbb{E}_{k}\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\|^{2}+32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p+2\mathbb{E}_{k}\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2}. (22)

For the last term at the right-hand side of (A), from (11), we have

𝔼kgwkFw(xwk)2\displaystyle\mathbb{E}_{k}\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2} (23)
=\displaystyle= 𝔼kFw,iwk(xwk)Fw,iwk(xw)+Fw(xw)Fw(xwk)(Fw,iwk(ywk)Fw,iwk(xw)+Fw(xw)Fw(ywk))2\displaystyle\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-F^{\prime}_{w}(x_{w}^{k})-(F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-F^{\prime}_{w}(y_{w}^{k}))\|^{2}
\displaystyle\leq 2𝔼kFw,iwk(xwk)Fw,iwk(xw)+Fw(xw)Fw(xwk)2+2𝔼kFw,iwk(ywk)Fw,iwk(xw)+Fw(xw)Fw(ywk)2\displaystyle 2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-F^{\prime}_{w}(x_{w}^{k})\|^{2}+2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-F^{\prime}_{w}(y_{w}^{k})\|^{2}
\displaystyle\leq 2𝔼kFw,iwk(xwk)Fw,iwk(xw)2+2𝔼kFw,iwk(xw)Fw,iwk(ywk)2\displaystyle 2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})\|^{2}+2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})-F^{\prime}_{w,i_{w}^{k}}(y_{w}^{k})\|^{2}
\displaystyle\leq 2Lw2xwkxw2+2Lw2ywkxw2.\displaystyle 2L_{w}^{2}\|x_{w}^{k}-x_{w}^{*}\|^{2}+2L_{w}^{2}\|y_{w}^{k}-x_{w}^{*}\|^{2}.

where the second inequality is because of the fact 𝔼a𝔼a2=𝔼a2𝔼a2𝔼a2\mathbb{E}\|a-\mathbb{E}a\|^{2}=\mathbb{E}\|a\|^{2}-\|\mathbb{E}a\|^{2}\leq\mathbb{E}\|a\|^{2} and the last inequality comes from Assumption 4.

Substituting (23) into (A), we have

𝔼kgwk+λvwsign(xwkxvk)+λvwsign(xwkzvk)2\displaystyle\mathbb{E}_{k}\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}
\displaystyle\leq 8𝔼kFw(xwk)Fw(xw)2+32λ2|w|2p+4λ2|w|2p+4Lw2xwkxw2+4Lw2ywkxw2.\displaystyle 8\mathbb{E}_{k}\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\|^{2}+32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p+4L_{w}^{2}\|x_{w}^{k}-x_{w}^{*}\|^{2}+4L_{w}^{2}\|y_{w}^{k}-x_{w}^{*}\|^{2}. (24)

For the third term at the right-hand side of (20), substituting the optimality condition (21), we have

2𝔼k[gwk]+λvwsign(xwkxvk),xwkxw\displaystyle-2\langle\mathbb{E}_{k}[g_{w}^{k}]+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k}),x_{w}^{k}-x_{w}^{*}\rangle (25)
=\displaystyle= 2Fw(xwk)+λvwsign(xwkxvk)Fw(xw)λvwswv,xwkxw\displaystyle-2\langle F^{\prime}_{w}(x_{w}^{k})+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})\quad-F^{\prime}_{w}(x_{w}^{*})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle
=\displaystyle= 2Fw(xwk)Fw(xw),xwkxw2λvwsign(xwkxvk)λvwswv,xwkxw\displaystyle-2\langle F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*}),x_{w}^{k}-x_{w}^{*}\rangle-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle
\displaystyle\leq 2μwLwμw+Lwxwkxw22μw+LwFw(xwk)Fw(xw)22λvwsign(xwkxvk)λvwswv,xwkxw.\displaystyle-\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}\|x_{w}^{k}-x_{w}^{*}\|^{2}-\frac{2}{\mu_{w}+L_{w}}\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\|^{2}-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle.

where the last inequality comes from [47] since we assume that the functions Fw(xw)F_{w}(x_{w}) are strongly convex and have Lipschitz continuous gradients (cf. Assumptions 3 and 4).

For the last term at the right-hand side of (20), with any ϵ>0\epsilon>0 we have

2λvwsign(xwkzvk),xwkxw\displaystyle-2\langle\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k}),x_{w}^{k}-x_{w}^{*}\rangle (26)
\displaystyle\leq ϵxwkxw2+λ2ϵvwsign(xwkzvk)2\displaystyle\epsilon\|x_{w}^{k}-x_{w}^{*}\|^{2}+\frac{\lambda^{2}}{\epsilon}\|\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}
\displaystyle\leq ϵxwkxw2+λ2|w|2pϵ.\displaystyle\epsilon\|x_{w}^{k}-x_{w}^{*}\|^{2}+\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}.

Substituting (A), (25) and (26) into (20) and combining the terms, we have

𝔼kxwk+1xw2\displaystyle\mathbb{E}_{k}\|x_{w}^{k+1}-x_{w}^{*}\|^{2} (27)
\displaystyle\leq (1α(2μwLwμw+Lwϵ)+4α2Lw2)xwkxw2+4α2Lw2ywkxw2+α2(32λ2|w|2p+4λ2|w|2p)+αλ2|w|2pϵ\displaystyle(1-\alpha(\frac{2\mu_{w}L_{w}}{\mu_{w}+L_{w}}-\epsilon)+4\alpha^{2}L_{w}^{2})\|x_{w}^{k}-x_{w}^{*}\|^{2}+4\alpha^{2}L_{w}^{2}\|y_{w}^{k}-x_{w}^{*}\|^{2}+\alpha^{2}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}
2λvwsign(xwkxvk)λvwswv,xwkxw2α(1μw+Lw4α)Fw(xwk)Fw(xw)2.\displaystyle-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle-2\alpha(\frac{1}{\mu_{w}+L_{w}}-4\alpha)\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\|^{2}.

If we constrain the step size as

αminw{14(μw+Lw)},\displaystyle\alpha\leq\min_{w\in\mathcal{R}}\{\frac{1}{4(\mu_{w}+L_{w})}\}, (28)

we have 1μw+Lw4α0\frac{1}{\mu_{w}+L_{w}}-4\alpha\geq 0 and hence can drop the last term at the right-hand side of (27). Also noticing the definitions of η\eta and LL, we can rewrite (27) as

𝔼kxwk+1xw2\displaystyle\mathbb{E}_{k}\|x_{w}^{k+1}-x_{w}^{*}\|^{2} (29)
\displaystyle\leq (12ηα+4α2L2)xwkxw2+4α2L2𝔼kywkxw2+α2(32λ2|w|2p+4λ2|w|2p)\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\|x_{w}^{k}-x_{w}^{*}\|^{2}+4\alpha^{2}L^{2}\mathbb{E}_{k}\|y_{w}^{k}-x_{w}^{*}\|^{2}+\alpha^{2}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)
+αλ2|w|2pϵ2λvwsign(xwkxvk)λvwswv,xwkxw.\displaystyle+\alpha\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}-2\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle.

Step 2. Here we define Ip(x)=λ2wvwxwxv1I_{p}(x)=\frac{\lambda}{2}\sum_{w\in\mathcal{R}}\sum_{v\in\mathcal{R}_{w}}\|x_{w}-x_{v}\|_{1}. Since Ip(x)I_{p}(x) is convex, we have

xIp(xk)xIp(x),xkx=wλvwsign(xwkxvk)λvwswv,xwkxw0.\displaystyle\langle\partial_{x}I_{p}(x^{k})-\partial_{x}I_{p}(x^{*}),x^{k}-x^{*}\rangle=\sum_{w\in\mathcal{R}}\langle\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})-\lambda\sum_{v\in\mathcal{R}_{w}}s_{wv},x_{w}^{k}-x_{w}^{*}\rangle\geq 0. (30)

Summing up (29) over all regular agents ww\in\mathcal{R} and adding to (30), we have

𝔼kxk+1x2\displaystyle\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2} (31)
\displaystyle\leq (12ηα+4α2L2)xkx2+4α2L2wywkxw2+α2w(32λ2|w|2p+4λ2|w|2p)+αwλ2|w|2pϵ.\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\|x^{k}-x^{*}\|^{2}+4\alpha^{2}L^{2}\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}.

Step 3. Defining Sk:=wywkxw2S^{k}:=\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}, we have

𝔼k[Sk+1]=w𝔼kywk+1xw2=(11J)wywkxw2+1Jwxwkxw2=(11J)Sk+1Jxkx2.\displaystyle\mathbb{E}_{k}[S^{k+1}]=\sum_{w\in\mathcal{R}}\mathbb{E}_{k}\|y_{w}^{k+1}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})\sum_{w\in\mathcal{R}}\|y_{w}^{k}-x_{w}^{*}\|^{2}+\frac{1}{J}\sum_{w\in\mathcal{R}}\|x_{w}^{k}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})S^{k}+\frac{1}{J}\|x^{k}-x^{*}\|^{2}. (32)

Combining (31) and (32), we have

𝔼kxk+1x2+8Jα2L2𝔼k[Sk+1]\displaystyle\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2}+8J\alpha^{2}L^{2}\mathbb{E}_{k}[S^{k+1}] (33)
\displaystyle\leq (12ηα+12α2L2)xkx2+(112J)8Jα2L2Sk+α2w(32λ2|w|2p+4λ2|w|2p)+αwλ2|w|2pϵ.\displaystyle(1-2\eta\alpha+12\alpha^{2}L^{2})\|x^{k}-x^{*}\|^{2}+(1-\frac{1}{2J})8J\alpha^{2}L^{2}S^{k}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}.

If we constrain the step size as

12α2L2ηαandηα12J,\displaystyle 12\alpha^{2}L^{2}\leq\eta\alpha\quad\text{and}\quad\eta\alpha\leq\frac{1}{2J}, (34)

then the coefficients in (33) satisfy

12ηα+12α2L21ηα,\displaystyle 1-2\eta\alpha+12\alpha^{2}L^{2}\leq 1-\eta\alpha, (35)
112J1ηα.\displaystyle 1-\frac{1}{2J}\leq 1-\eta\alpha. (36)

Construct a Lyapunov function as Vk:=xkx2+8Jα2L2SkV^{k}:=\|x^{k}-x^{*}\|^{2}+8J\alpha^{2}L^{2}S^{k}, we have

𝔼k[Vk](1ηα)Vk1+α2w(32λ2|w|2p+4λ2|w|2p)+αwλ2|w|2pϵ.\displaystyle\mathbb{E}_{k}[V^{k}]\leq(1-\eta\alpha)V^{k-1}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}. (37)

Taking the full expectation of (37), we obtain

𝔼[Vk](1ηα)𝔼[Vk1]+α2w(32λ2|w|2p+4λ2|w|2p)+αwλ2|w|2pϵ.\displaystyle\mathbb{E}[V^{k}]\leq(1-\eta\alpha)\mathbb{E}[V^{k-1}]+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}. (38)

Using telescopic cancellation on (38) from time 1 to time kk, we have

𝔼[Vk](1ηα)kV0+Δ1,\displaystyle\mathbb{E}[V^{k}]\leq(1-\eta\alpha)^{k}V^{0}+\Delta_{1}, (39)

where

Δ:=αηw(32λ2|w|2p+4λ2|w|2p)+1ϵηwλ2|w|2p.\displaystyle\Delta:=\frac{\alpha}{\eta}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\frac{1}{\epsilon\eta}\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p. (40)

In our derivation so far, the constraint on the step size α\alpha (cf. (28) and (34)) is

αmin{minw{14(μw+Lw)},η12L2,12ηJ}.\displaystyle\alpha\leq\min\{\min_{w\in\mathcal{R}}\{\frac{1}{4(\mu_{w}+L_{w})}\},\frac{\eta}{12L^{2}},\frac{1}{2\eta J}\}. (41)

Therefore, we simply choose

αη12L2J,\displaystyle\alpha\leq\frac{\eta}{12L^{2}J}, (42)

which completes the proof for BRAVO-LSVRG.

Step 4. Below we prove Theorem 1 for BRAVO-SAGA. Since the update rules of BRAVO-SAGA and BRAVO-LSVRG are both (9) and the difference between them is the definition of gwkg_{w}^{k}, the intermediate results (20), (A), (25), and (26) also hold true for BRAVO-SAGA. Therefore, we just need to prove (23) and (A) for BRAVO-SAGA.

From the definition of gwkg_{w}^{k} in (8), at every regular agent ww, we have

𝔼kgwkFw(xwk)2\displaystyle\mathbb{E}_{k}\|g_{w}^{k}-F^{\prime}_{w}(x_{w}^{k})\|^{2} (43)
=\displaystyle= 𝔼kFw,iwk(xwk)Fw,iwk(xw)+Fw(xw)Fw(xwk)(Fw,iwk(ϕw,iwkk)Fw,iwk(xw)+Fw(xw)1Jj=1JFw,j(ϕw,jk))2\displaystyle\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-F^{\prime}_{w}(x_{w}^{k})-\Big{(}F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k})\Big{)}\|^{2}
\displaystyle\leq 2𝔼kFw,iwk(xwk)Fw,iwk(xw)+Fw(xw)Fw(xwk)2+2𝔼kFw,iwk(ϕw,iwkk)Fw,iwk(xw)+Fw(xw)1Jj=1JFw,j(ϕw,jk)2\displaystyle 2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-F^{\prime}_{w}(x_{w}^{k})\|^{2}+2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})+F^{\prime}_{w}(x_{w}^{*})-\frac{1}{J}\sum_{j=1}^{J}F^{\prime}_{w,j}(\phi_{w,j}^{k})\|^{2}
\displaystyle\leq 2𝔼kFw,iwk(xwk)Fw,iwk(xw)2+2𝔼kFw,iwk(xw)Fw,iwk(ϕw,iwkk)2\displaystyle 2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{k})-F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})\|^{2}+2\mathbb{E}_{k}\|F^{\prime}_{w,i_{w}^{k}}(x_{w}^{*})-F^{\prime}_{w,i_{w}^{k}}(\phi_{w,i_{w}^{k}}^{k})\|^{2}
\displaystyle\leq 2Lw2xwkxw2+2Lw2𝔼kϕw,iwkkxw2\displaystyle 2L_{w}^{2}\|x_{w}^{k}-x_{w}^{*}\|^{2}+2L_{w}^{2}\mathbb{E}_{k}\|\phi_{w,i_{w}^{k}}^{k}-x_{w}^{*}\|^{2}
=\displaystyle= 2Lw2xwkxw2+2Lw21Jj=1Jϕw,jkxw2.\displaystyle 2L_{w}^{2}\|x_{w}^{k}-x_{w}^{*}\|^{2}+2L_{w}^{2}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}.

Substituting (43) into (A), we have

𝔼kgwk+λvwsign(xwkxvk)+λvwsign(xwkzvk)2\displaystyle\mathbb{E}_{k}\|g_{w}^{k}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{k}-x_{v}^{k})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{k}-z_{v}^{k})\|^{2}
\displaystyle\leq 8𝔼kFw(xwk)Fw(xw)2+4Lw21Jj=1Jϕw,jkxw2+4Lw2xwkxw2+32λ2|w|2p+4λ2|w|2p.\displaystyle 8\mathbb{E}_{k}\|F^{\prime}_{w}(x_{w}^{k})-F^{\prime}_{w}(x_{w}^{*})\|^{2}+4L_{w}^{2}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}+4L_{w}^{2}\|x_{w}^{k}-x_{w}^{*}\|^{2}+32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p. (44)

Similar to the proof for BRAVO-LSVRG, we can get

𝔼kxk+1x2\displaystyle\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2} (45)
\displaystyle\leq (12ηα+4α2L2)xkx2+4α2L2w1Jj=1Jϕw,jkxw2+α2w(32λ2|w|2p+4λ2|w|2p)+αwλ2|w|2pϵ.\displaystyle(1-2\eta\alpha+4\alpha^{2}L^{2})\|x^{k}-x^{*}\|^{2}+4\alpha^{2}L^{2}\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}+\alpha^{2}\sum_{w\in\mathcal{R}}(32\lambda^{2}|\mathcal{R}_{w}|^{2}p+4\lambda^{2}|\mathcal{B}_{w}|^{2}p)+\alpha\sum_{w\in\mathcal{R}}\frac{\lambda^{2}|\mathcal{B}_{w}|^{2}p}{\epsilon}.

Define Sk:=w1Jj=1Jϕw,jkxw2S^{k}:=\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}, we have

𝔼k[Sk+1]=w1Jj=1J𝔼kϕw,jk+1xw2=(11J)w1Jj=1Jϕw,jkxw2+1Jwxwkxw2=(11J)Sk+1Jxkx2.\displaystyle\mathbb{E}_{k}[S^{k+1}]=\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\mathbb{E}_{k}\|\phi_{w,j}^{k+1}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})\sum_{w\in\mathcal{R}}\frac{1}{J}\sum_{j=1}^{J}\|\phi_{w,j}^{k}-x_{w}^{*}\|^{2}+\frac{1}{J}\sum_{w\in\mathcal{R}}\|x_{w}^{k}-x_{w}^{*}\|^{2}=(1-\frac{1}{J})S^{k}+\frac{1}{J}\|x^{k}-x^{*}\|^{2}. (46)

With (46), the rest of the proof follows Step 3 and the proof for BRAVO-SAGA is complete.  

Appendix B Proof of Theorem 2

Proof.

We establish the lower bound on the learning error by constructing a case that the regular agents’ models converge to an unsatisfactory point due to the misleading of the Byzantine agents. We construct a network topology within which the regular network composed of the regular agents is connected and each regular agent has the same number of Byzantine neighbors, namely, |w|=|v||\mathcal{B}_{w}|=|\mathcal{B}_{v}| for any w,vw,v\in\mathcal{R}.

We begin with the scalar case with p=1p=1. For each regular agent ww\in\mathcal{R}, we construct its local cost function as

Fw(x~)=12x~2λ|w|x~,\displaystyle F_{w}(\tilde{x})=\frac{1}{2}\tilde{x}^{2}-\lambda|\mathcal{B}_{w}|\tilde{x}, (47)

which is strongly convex. Define the local sample costs as

Fw,j(x~)=Fw(x~),j{1,,J},\displaystyle F_{w,j}(\tilde{x})=F_{w}(\tilde{x}),\quad\forall j\in\{1,\cdots,J\}, (48)

which have Lipschitz continuous gradients with constants 1. It is easy to verify that the optimal solution of (4) is x=[λ|1|;λ|2|;;λ|R|]x^{*}=[\lambda|\mathcal{B}_{1}|;\lambda|\mathcal{B}_{2}|;\cdots;\lambda|\mathcal{B}_{R}|] in this case.

Since the regular agents’ models are initialized at the same point, without loss of generality, we let xw0=0,wx_{w}^{0}=0,\forall w\in\mathcal{R}. Otherwise, we can set Fw(x~xw0)F_{w}(\tilde{x}-x_{w}^{0}) as the local cost of agent ww. Note that for each regular agent ww\in\mathcal{R}, hw0h_{w}^{0} is a linear combination of {Fw,1(xw0),,Fw,J(xw0)}\{F^{\prime}_{w,1}(x_{w}^{0}),\cdots,F^{\prime}_{w,J}(x_{w}^{0})\}. With (48), we have hw0span{Fw(xw0)}h_{w}^{0}\in\text{span}\{F^{\prime}_{w}(x_{w}^{0})\}. Because 𝔼[hw0]=Fw(xw0)\mathbb{E}[h_{w}^{0}]=F^{\prime}_{w}(x_{w}^{0}), we have

hw0=Fw(xw0)=λ|w|.\displaystyle h_{w}^{0}=F^{\prime}_{w}(x_{w}^{0})=-\lambda|\mathcal{B}_{w}|. (49)

Therefore, for each regular agent ww\in\mathcal{R}, we have

h^w0\displaystyle\hat{h}_{w}^{0} =hw0+λvwsign(xw0xv0)+λvwsign(xw0zv0)\displaystyle=h_{w}^{0}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{0}-x_{v}^{0})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{0}-z_{v}^{0}) (50)
=λ|w|+λ|w|sign(0)+λvwsign(zv0).\displaystyle=-\lambda|\mathcal{B}_{w}|+\lambda|\mathcal{R}_{w}|sign(0)+\lambda\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{0}).

Although sign(0)sign(0) could be any value in [1,1][-1,1], we let sign(0)=0sign(0)=0 by convention. Choosing the Byzantine attacks such that vwsign(zv0)=|w|\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{0})=|\mathcal{B}_{w}| leads to

h^w0=0.\displaystyle\hat{h}_{w}^{0}=0. (51)

With Assumption 6, we further have

xw1xw0+span{h^w0}={0}.\displaystyle x_{w}^{1}\in x_{w}^{0}+\text{span}\{\hat{h}_{w}^{0}\}=\{0\}. (52)

At time k=1k=1, hw1h_{w}^{1} is linear combination of {Fw,1(xw0),,Fw,J(xw0),Fw,1(xw1),,Fw,J(xw1)}\{F^{\prime}_{w,1}(x_{w}^{0}),\cdots,F^{\prime}_{w,J}(x_{w}^{0}),F^{\prime}_{w,1}(x_{w}^{1}),\cdots,F^{\prime}_{w,J}(x_{w}^{1})\} for each regular agent ww\in\mathcal{R}. With (48), we have hw1span{Fw(xw0),Fw(xw1)}=span{Fw(0)}h_{w}^{1}\in\text{span}\{F^{\prime}_{w}(x_{w}^{0}),F^{\prime}_{w}(x_{w}^{1})\}=\text{span}\{F^{\prime}_{w}(0)\}. Since 𝔼[hw1]=Fw(xw1)=Fw(0)\mathbb{E}[h_{w}^{1}]=F^{\prime}_{w}(x_{w}^{1})=F^{\prime}_{w}(0), we have

hw1=Fw(0)=λ|w|.\displaystyle h_{w}^{1}=F^{\prime}_{w}(0)=-\lambda|\mathcal{B}_{w}|. (53)

Therefore, for each regular agent ww\in\mathcal{R}, we have

h^w1\displaystyle\hat{h}_{w}^{1} =hw1+λvwsign(xw1xv1)+λvwsign(xw1zv1)\displaystyle=h_{w}^{1}+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{1}-x_{v}^{1})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(x_{w}^{1}-z_{v}^{1})
=λ|w|+λvwsign(xw1xv1)+λvwsign(zv1).\displaystyle=-\lambda|\mathcal{B}_{w}|+\lambda\sum_{v\in\mathcal{R}_{w}}sign(x_{w}^{1}-x_{v}^{1})+\lambda\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{1}).

Choosing the Byzantine attacks such that vwsign(zv1)=|w|\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{1})=|\mathcal{B}_{w}|, we have

h^w1=0.\displaystyle\hat{h}_{w}^{1}=0. (54)

With Assumption 6, we further have

xw2xw0+span{h^w0,h^w1}={0}.\displaystyle x_{w}^{2}\in x_{w}^{0}+\text{span}\{\hat{h}_{w}^{0},\hat{h}_{w}^{1}\}=\{0\}. (55)

As such, for each regular agent ww\in\mathcal{R}, we have

xwk=0and𝔼xwkxw=xwkxw=λ|w|,k{0,1,}.\displaystyle x_{w}^{k}=0\quad\text{and}\quad\mathbb{E}\|x_{w}^{k}-x_{w}^{*}\|=\|x_{w}^{k}-x_{w}^{*}\|=\lambda|\mathcal{B}_{w}|,\quad\forall k\in\{0,1,\cdots\}. (56)

Therefore, we have

𝔼xkx2=w𝔼xwkxw2\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}=\sum_{w\in\mathcal{R}}\mathbb{E}\|x_{w}^{k}-x_{w}^{*}\|^{2} =Ω(wλ2|w|2).\displaystyle=\Omega(\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}). (57)

For the vector case with p>1p>1, we apply the constructions on the scalar case for all dimensions. Consider

Fw(x~)=12x~Tx~λ|w|x~T𝟏,\displaystyle F_{w}(\tilde{x})=\frac{1}{2}\tilde{x}^{T}\tilde{x}-\lambda|\mathcal{B}_{w}|\tilde{x}^{T}\bm{1}, (58)

where xpx\in\mathbb{R}^{p} and 𝟏p\bm{1}\in\mathbb{R}^{p} is the all-one vector. Define

Fw,j(x~)=Fw(x~),j{1,,J}.\displaystyle F_{w,j}(\tilde{x})=F_{w}(\tilde{x}),\quad\forall j\in\{1,\cdots,J\}. (59)

Also let xw0=𝟎x_{w}^{0}=\bm{0} for all regular agents ww\in\mathcal{R} and choose the Byzantine attacks such that vwsign(zvk)=|w|𝟏\sum_{v\in\mathcal{B}_{w}}sign(-z_{v}^{k})=|\mathcal{B}_{w}|\cdot\bm{1} at any time kk. Similar to the derivations of the scalar case, we have

𝔼xkx2=Ω(wλ2|w|2p),\displaystyle\mathbb{E}\|x^{k}-x^{*}\|^{2}=\Omega(\sum_{w\in\mathcal{R}}\lambda^{2}|\mathcal{B}_{w}|^{2}p), (60)

which completes the proof.  

References

  • [1] K. Srivastava, A. Nedic, Distributed asynchronous constrained stochastic optimization, IEEE Journal of Selected Topics in Signal Processing 5 (4) (2011) 772–790.
  • [2] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, S. Lu, Distributed learning in the nonconvex world: From batch data to streaming and beyond, IEEE Signal Processing Magazine 37 (3) (2020) 26–38.
  • [3] S. Pu, A. Olshevsky, I. C. Paschalidis, Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent, IEEE Signal Processing Magazine 37 (3) (2020) 114–122.
  • [4] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in: Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.
  • [5] M. Assran, N. Loizou, N. Ballas, M. G. Rabbat, Stochastic gradient push for distributed deep learning, in: International Conference on Machine Learning, 2019, pp. 344–353.
  • [6] A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control 54 (1) (2009) 48–61.
  • [7] Y. Du, K. You, Asynchronous stochastic gradient descent over decentralized datasets, IEEE Transactions on Control of Network Systems 8 (3) (2021) 1212–1224.
  • [8] G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, H. Zhu, Decentralized learning for wireless communications and networking, in: Splitting Methods in Communication, Imaging, Science, and Engineering, 2017, pp. 461–497.
  • [9] L. U. Khan, S. R. Pandey, N. H. Tran, W. Saad, Z. Han, M. N. Nguyen, C. S. Hong, Federated learning for edge networks: Resource optimization and incentive mechanism, IEEE Communications Magazine 58 (10) (2020) 88–93.
  • [10] H. Tang, X. Lian, M. Yan, C. Zhang, J. Liu, d2d^{2}: Decentralized training over decentralized data, in: International Conference on Machine Learning, 2018, pp. 4848–4856.
  • [11] A. Mokhtari, A. Ribeiro, Dsa: Decentralized double stochastic averaging gradient algorithm, The Journal of Machine Learning Research 17 (1) (2016) 2165–2199.
  • [12] Z.-X. Cui, Q. Fan, C. Jia, Momentum methods for stochastic optimization over time-varying directed networks, Signal Processing 174 (2020) 107614.
  • [13] W. Shi, Q. Ling, G. Wu, W. Yin, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization 25 (2) (2015) 944–966.
  • [14] S. Pu, W. Shi, J. Xu, A. Nedic, Push–pull gradient methods for distributed optimization in networks, IEEE Transactions on Automatic Control 66 (1) (2020) 1–16.
  • [15] X. Zhang, S. Liu, N. Zhao, An accelerated algorithm for distributed optimization with barzilai-borwein step sizes, Signal Processing 202 (2023) 108748.
  • [16] L. Lamport, R. Shostak, M. Pease, The Byzantine generals problem, ACM Transactions on Programming Languages and Systems 4 (3) (1982) 382–401.
  • [17] Z. Yang, A. Gang, W. U. Bajwa, Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model, IEEE Signal Processing Magazine 37 (3) (2020) 146–159.
  • [18] L. Su, N. H. Vaidya, Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms, in: ACM Symposium on Principles of Distributed Computing, 2016, pp. 425–434.
  • [19] S. Sundaram, B. Gharesifard, Distributed optimization under adversarial nodes, IEEE Transactions on Automatic Control 64 (3) (2018) 1063–1076.
  • [20] L. Su, N. H. Vaidya, Byzantine-resilient multiagent optimization, IEEE Transactions on Automatic Control 66 (5) (2020) 2227–2233.
  • [21] Z. Yang, W. U. Bajwa, Byrdie: Byzantine-resilient distributed coordinate descent for decentralized learning, IEEE Transactions on Signal and Information Processing over Networks 5 (4) (2019) 611–627.
  • [22] J. Li, W. Abbas, M. Shabbir, X. Koutsoukos, Resilient distributed diffusion for multi-robot systems using centerpoint, in: Robotics: Science and Systems, 2020.
  • [23] Z. Yang, W. U. Bajwa, Bridge: Byzantine-resilient decentralized gradient descent, arXiv preprint arXiv:1908.08098.
  • [24] K. Kuwaranancharoen, S. Sundaram, On the geometric convergence of byzantine-resilient distributed optimization algorithms, arXiv preprint arXiv:2305.10810.
  • [25] W. Ben-Ameur, P. Bianchi, J. Jakubowicz, Robust distributed consensus using total variation, IEEE Transactions on Automatic Control 61 (6) (2015) 1550–1564.
  • [26] W. Xu, Z. Li, Q. Ling, Robust decentralized dynamic optimization at presence of malfunctioning agents, Signal Processing 153 (2018) 24–33.
  • [27] N. Gupta, N. H. Vaidya, Byzantine fault-tolerance in peer-to-peer distributed gradient-descent, arXiv preprint arXiv:2101.12316.
  • [28] J. Peng, W. Li, Q. Ling, Byzantine-robust decentralized stochastic optimization over static and time-varying networks, Signal Processing 183 (2021) 108020.
  • [29] L. He, S. P. Karimireddy, M. Jaggi, Byzantine-robust decentralized learning via clippedgossip, arXiv preprint arXiv:2202.01545.
  • [30] Z. Wu, T. Chen, Q. Ling, Byzantine-resilient decentralized stochastic optimization with robust aggregation rules, arXiv preprint arXiv:2206.04568.
  • [31] Y. Chen, L. Su, J. Xu, Distributed statistical machine learning in adversarial settings: Byzantine gradient descent, Proceedings of ACM on Measurement and Analysis of Computing Systems 1 (2) (2017) 1–25.
  • [32] P. Blanchard, E.-M. El-Mhamdi, R. Guerraoui, J. Stainer, Machine learning with adversaries: Byzantine tolerant gradient descent, in: Advances in Neural Information Processing Systems, 2017, pp. 119–129.
  • [33] X. Cao, L. Lai, Distributed approximate Newton’s method robust to Byzantine attackers, IEEE Transactions on Signal Processing 68 (2020) 6011–6025.
  • [34] X. Cao, L. Lai, Distributed gradient descent algorithm robust to an arbitrary number of Byzantine attackers, IEEE Transactions on Signal Processing 67 (22) (2019) 5850–5864.
  • [35] K. Yuan, B. Ying, J. Liu, A. H. Sayed, Variance-reduced stochastic learning by networked agents under random reshuffling, IEEE Transactions on Signal Processing 67 (2) (2019) 351–366.
  • [36] R. Xin, U. A. Khan, S. Kar, Variance-reduced decentralized stochastic optimization with accelerated convergence, IEEE Transactions on Signal Processing 68 (2020) 6255–6271.
  • [37] R. Xin, S. Kar, U. A. Khan, Decentralized stochastic optimization and machine learning: A unified variance-reduction framework for robust performance and fast convergence, IEEE Signal Processing Magazine 37 (3) (2020) 102–113.
  • [38] Z. Wu, Q. Ling, T. Chen, G. B. Giannakis, Federated variance-reduced stochastic gradient descent with robustness to Byzantine attacks, IEEE Transactions on Signal Processing 68 (2020) 4583–4596.
  • [39] E.-M. El-Mhamdi, R. Guerraoui, S. Rouault, Distributed momentum for Byzantine-resilient learning, arXiv preprint arXiv:2003.00010.
  • [40] S. P. Karimireddy, L. He, M. Jaggi, Learning from history for byzantine robust optimization, in: International Conference on Machine Learning, 2021, pp. 5311–5319.
  • [41] P. Khanduri, S. Bulusu, P. Sharma, P. K. Varshney, Byzantine resilient non-convex svrg with distributed batch gradient computations, arXiv preprint arXiv:1912.04531.
  • [42] S. Liu, H. Zhu, S. Li, X. Li, C. Chen, X. Guan, An adaptive deviation-tolerant secure scheme for distributed cooperative spectrum sensing, in: IEEE Global Communications Conference, 2012, pp. 603–608.
  • [43] B. Kailkhura, P. Ray, D. Rajan, A. Yen, P. Barnes, R. Goldhahn, Byzantine-resilient locally optimum detection using collaborative autonomous networks, in: IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2017, pp. 1–5.
  • [44] B. Kailkhura, S. Brahma, P. K. Varshney, Data falsification attacks on consensus-based detection systems, IEEE Transactions on Signal and Information Processing over Networks 3 (1) (2016) 145–158.
  • [45] A. Defazio, F. Bach, S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, in: Advances in Neural Information Processing Systems, 2014, pp. 1646–1654.
  • [46] D. Kovalev, S. Horváth, P. Richtarik, Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop, in: Algorithmic Learning Theory, 2020, pp. 451–467.
  • [47] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Vol. 87, Springer Science & Business Media, 2003.
  • [48] G. Lan, Y. Zhou, An optimal randomized incremental gradient method, Mathematical programming 171 (2018) 167–215.
  • [49] S. P. Karimireddy, L. He, M. Jaggi, Byzantine-robust learning on heterogeneous datasets via bucketing, in: International Conference on Learning Representations, 2021.
  • [50] S. Boyd, P. Diaconis, L. Xiao, Fastest mixing Markov chain on a graph, SIAM Review 46 (4) (2004) 667–689.
  • [51] J. Peng, W. Li, Q. Ling, Variance reduction-boosted Byzantine robustness in decentralized stochastic optimization, in: International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 4283–4287.