This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix\AtAppendix\AtAppendix\AtAppendix\AtAppendix\AtAppendix

FedCut: A Spectral Analysis Framework for Reliable Detection of Byzantine Colluders

Hanlin Gu, Lixin Fan,  XingXing Tang, and Qiang Yang Hanlin Gu and Lixin Fan are with WeBank AI Lab, WeBank, China. E-mail: {allengu, lixinfan}@webank.com, {ghltsl123, Lixin.Fan01}@gmail.com. Xingxing Tang is with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kon. E-mail: [email protected] Qiang Yang is with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong and WeBank AI Lab, WeBank, China. E-mail: [email protected] author: Lixin Fan.
Abstract

This paper proposes a general spectral analysis framework that thwarts a security risk in federated Learning caused by groups of malicious Byzantine attackers or colluders, who conspire to upload vicious model updates to severely debase global model performances. The proposed framework delineates the strong consistency and temporal coherence between Byzantine colluders’ model updates from a spectral analysis lens, and, formulates the detection of Byzantine misbehaviours as a community detection problem in weighted graphs. The modified normalized graph cut is then utilized to discern attackers from benign participants. Moreover, the Spectral heuristics is adopted to make the detection robust against various attacks. The proposed Byzantine colluder resilient method, i.e., FedCut, is guaranteed to converge with bounded errors. Extensive experimental results under a variety of settings justify the superiority of FedCut, which demonstrates extremely robust model performance (MP) under various attacks. It was shown that FedCut’s averaged MP is 2.1% to 16.5% better than that of the state of the art Byzantine-resilient methods. In terms of the worst-case model performance (MP), FedCut is 17.6% to 69.5% better than these methods.

Index Terms:
Byzantine colluders; Byzantine resilient; Spectral analysis; Normalized cut; Spectral heuristics; Graph; Federated learning; privacy preserving computing;

1 Introduction

Federated learning (FL) [40, 66] is a suite of privacy-preserving machine learning techniques that allow multiple parties to collaboratively train a global model, yet, without gathering or exchanging privacy for the sake of compliance with private data protection regulation rules such as GDPR111GDPR is applicable as of May 25th, 2018 in all European member states to harmonize data privacy laws across Europe. https://gdpr.eu/. It is the upgraded global model performance that motivates multiple parties to join FL, however, the risk of potential malicious attacks aiming to degrade FL model performances cannot be discounted. It was shown that even a single attacker (aka a Byzantine worker) may prevent the convergence of a naive FL aggregation rule by submitting vicious model updates to outweigh benign workers (see Lemma 1 of [6]). A great deal of research effort was then devoted to developing numerous Byzantine-resilient methods which can effectively detect and attenuate such misbehaviours [6, 68, 45, 13, 61, 65, 21]. Nevertheless, recent research pointed out that a group of attackers or colluders may conspire to cause more damages than these Byzantine-resilient methods can deal with (see [4, 63, 18]). It is the misbehavior of such Byzantine colluders that motivate our research to analyze their influences on global model performances from a spectral analysis framework, and based on our findings, an effective defense algorithm to do away with Byzantine colluders is proposed.

There are two main challenges brought by Byzantine colluders. First, colluders may conspire to misbehave consistently and introduce statistical bias to break down Robust Statistics (RS) based resilient methods (e.g., [4, 18, 63]). Consequently, the global model performances may deteriorate significantly. Second, Byzantine colluders may conspire to violate the assumption that all malicious model updates form one group while benign ones form the other, which is invariably assumed by most clustering-based Byzantine-resilient methods [54, 51, 20]. By submitting multiple groups of such detrimental yet disguised model updates, colluders therefore can evade clustering-based methods and degrade the global model performance significantly.

Refer to caption
Refer to caption
Figure 1: Averaged (left) and worst-case (right) model performances under all attacks for different Byzantine-resilient methods (Krum [6], GeoMedian [13], Median, Trimmed mean [68], Bulyan [21], FLtrust [10], DnC [52], Kmeans [54] and our proposed FedCut) with IID setting and 30 Byzantine clients for classification of Fashion MNIST (FedCut – the proposed method, see more details in Sect. 7.1).

In order to address the challenges brought by colluders, we propose in this article a spectral analysis framework, called FedCut, which admits effective detection of Byzantine colluders in a variety of settings and provide the spectral analysis of different types of Byzantine behaviours especially for Byzantine colluders (see Sect. 4). The essential ingredients of the proposed framework are as follows. First, we build the Spatial-Temporal graph with all clients’ model updates over multiple learning iterations as nodes, and similarities between respective pairs of model updates as edge weights (see Sect. 5.1). Second, an extension of the normalized cut (Ncut) [55, 42] provides the optimal c-partition Ncut (see Sect. 5.2), which allows to detect colluders with consistent behaviour efficiently. Third, spectral heuristics [70] are used to determine the type of Byzantine attackers, the unknown number of colluder groups and the appropriate scaling factor σ\sigma of Gaussian kernels used for measuring similarities between model updates (see Sect. 5.3). By leveraging the aforementioned techniques together within the unified spectral analysis framework, we thus propose in Sect. 5 the FedCut method which demonstrates superior robustness in the presence of different types of colluder attacks. Its convergence is theoretically proved in Sect. 6 and it compares favorably to the state of arts Byzantine-resilient methods in thorough empirical evaluations of model performances in Sect. 7. Our contributions are three folds.

  • First, we provide the spectral analysis of Byzantine attacks, especially, those launched by colluders. Specifically, we formulate existing Byzantine attacks as four types and gain a deeper understanding of colluders’ attacks in the lens of spectral. Moreover, we delineate root causes of failure cases of existing Byzantine-resilient methods in the face of colluder attacks (see Sect. 4.3).

  • Second, we propose the spectral analysis framework, called FedCut, which distinguishes benign clients from multiple groups of Byzantine colluders. Specifically, the normalized cut, temporal consistency and the spectral heuristics are adopted to address challenges brought by colluders. Moreover, we provide the theoretical analysis of convergence of the proposed algorithm FedCut.

  • Finally, we propose to thoroughly investigate both averaged and worst-case model performances of different Byzantine-resilient methods with extensive experiments under a variety of settings, including different types of models, datasets, extents of Non-IID, the fraction of attackers, combinations of colluders groups. It was then demonstrated that the proposed FedCut consistently outperforms existing methods under all different settings (see Fig. 1 and Tab. I).

2 Related work

Related work are abundant in the literature and we briefly review them below.
Byzantine attacks can be broadly categorized as Non-Collusion and Collusion attacks. The former attacks were proposed to degrade global model performances by uploading Gaussian noise or flipping gradients [6, 35]. The latter type launched consistent attacks to induce misclassifications of Byzantine attackers [4, 18, 63].
Robust statistics based aggregation approaches treated malicious model updates as outliers, which are far away from benign clients, and filter out outliers via robust statistics accordingly. For example, the coordinate-wise Median and some variants of Median, such as geometric Median, were proposed to remove outliers [68, 45, 61]. Moreover, Blanchard et al. assumed few outliers were far away from benign updates. Therefore, they used the Krum function to select model updates that are very close to at least half the other updates [6] as benign updates. However, the aforementioned methods are vulnerable to attacks with collusion which may conspire to induce biased estimations [4, 18, 63]. A different line of approaches [52, 16, 15] used the concentration filter to remove the outliers which are far away from concentration (such as the mean) [52, 16, 15]. For instance, Shejwalkar et al. applied SVD to the covariance matrix of model updates and filtered out outliers that largely deviate from the empirical mean of model updates towards the principle direction [52]. However, the effectiveness of the concentration filter may deteriorate in the presence of collusion attacks (see Sect. 4.3), and moreover, the time complexity of this approach is high O~(d3)\tilde{O}(d^{3}) when the dimension of updates dd is large.
Clustering based robust aggregation grouped all clients into two clusters according to pairwise similarities between clients’ updates and regarded the cluster with a smaller number as the Byzantine cluster. For instance, some methods applied Kmeans into clustering benign and byzantine clients [54, 9] and Sattler et al. separated benign and byzantine clients by minimizing cosine similarity of updates among two groups [51, 50]. Moreover, Ghosh et al. made use of iterative Kmeans clustering to remove the byzantine attackers [20]. However, the adopted naive clustering such as Kmeans,Kmedian [26, 38] have two shortcomings: 1) being inefficient and easily trapped into local minima; 2) breaking down when the number of colluder groups is unknown to the the clustering method (see details in Sect. 4).
Server based robust aggregation assumed the server had an extra training dataset which was used to evaluate uploaded model updates. Either abnormal updates with low scores were filtered out [10, 62, 48, 46] or minority Byzantine updates were filtered out through majority votes [12, 47, 22]. However, this approach is not applicable to the case when the server-side data is not available, or it might break down if the distribution of server’s data deviates far from that of training data of clients.
Historical Information based byzantine robust methods made use of historical information (such as distributed Momentums [17]) to help correct the statistical bias brought by colluders during the training, and thus lead to the convergence of optimization of federated learning [3, 27, 2, 11, 29, 19].
Other byzantine robust methods used signs of the gradients [5, 56], optimization strategy [35] or sampling methods [28] to achieve robust aggregation. Recently, some work studied Byzantine-robust algorithms in the decentralized setting without server [21, 44, 24] and asynchronous setting with heterogeneous communication delays [14, 64, 67].
Community detection in graph. In the proposed framework, the detection of Byzantine colluders is treated as the detection of multiple subgraphs or communities in a large weighted graph (see Sect. 5.1). Existing approaches [41] could be applied to detect the byzantine colluders. One important technique is to detect specific features of graph such as clustering coefficients [59]. Another important technique is to leverage the spectral property of graph based on the adjacency matrix or normalized Laplacian matrix and so on [53, 49, 8]. Moreover, the number of communities can be determined based on the eigengap of these matrix [70] (see Sect. 5.3).

TABLE I: Overview of the performance for various Byzantine-resilient methods (Robust statistics based: Krum [6], Median, Trimmed Mean [68] and DnC [52]; Clustering based: Kmeans [54]; Server based: FLtrust [10]; Spectral based: FedCut, see more details in Sect. 2) under different Byzantine attacks (Non-Collusion attack: [6], label flipping [15] and sign flipping [15], Collusion-diff attack: same value attack [35], Fang-v1 (design for trimmed mean) [18] and our designed Multi-collusion attack, Collusion-mimic attack: Mimic [28] and Lie [4], see more details in Sect. 4.2). ✔, \vartriangle and ✖denotes the drop of model performance less than 3%3\%, from 3%3\% to 10%10\% and above 10%10\% on Fashion MNIST respectively (see more comparisons in Sect. 7)
Robust statistics based Clustering based Server based Spectral based
Krum Median Tri mean DnC Kmeans FLtrust FedCut(Ours)
Non-Collusion Gaussian[6] \vartriangle
label flipping[15] \vartriangle
sign flipping[15] \vartriangle \vartriangle \vartriangle \vartriangle
Collusion-diff same value[35] \vartriangle
Fang-v1[18] \vartriangle \vartriangle
Multi-collusion \vartriangle \vartriangle
Collusion-mimic Mimic[28] \vartriangle \vartriangle \vartriangle
Lie [4] \vartriangle

3 Preliminary

3.1 Federated Learning

We consider a horizontal federated learning [66, 40] setting consisting of one server and KK clients. We assume KK clients222In this article we use terms ”client”, ”node”, ”participant” and ”party” interchangeably. have their local dataset 𝒟i={(𝐱i,j,yi,j}j=1ni,i=1K\mathcal{D}_{i}=\{(\mathbf{x}_{i,j},y_{i,j}\}_{j=1}^{n_{i}},i=1\cdots K, where 𝐱i,j\mathbf{x}_{i,j} is the input data, yi,jy_{i,j} is the label and nin_{i} is the total number of data points for ithi_{th} client. The training in federated learning is divided into three steps which iteratively run until the learning converges:

  • The ithi_{th} client takes empirical risk minimization as:

    min𝐰iFi(𝐰i,𝒟i)=min𝐰i1nij=1ni(𝐰i,𝐱i,j,yi,j),\min_{\mathbf{w}_{i}}F_{i}(\mathbf{w}_{i},\mathcal{D}_{i})=\min_{\mathbf{w}_{i}}\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\ell(\mathbf{w}_{i},\mathbf{x}_{i,j},y_{i,j}), (1)

    where 𝐰id\mathbf{w}_{i}\in\mathbb{R}^{d} is the ithi_{th} client’s local model weight and ()\ell(\cdot) is a loss function that measures the accuracy of the prediction made by the model on each data point.

  • Each client sends respective local model updates Fi\nabla F_{i} to the server and the server updates the global model 𝐰\mathbf{w} as 𝐰=𝐰η1Ki=1KFi\mathbf{w}=\mathbf{w}-\eta\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}, where η\eta is learning rate.

  • The server distributes the updated global model 𝐰\mathbf{w} to all clients.

3.2 Byzantine Attack in Federated Learning

We assume a malicious threat mode where an unknown number of participants out of K clients are Byzantine, i.e., they may upload arbitrarily corrupt updates 𝐠b\mathbf{g}_{b} to degrade the global model performance (MP). Under this assumption, behaviours of Byzantine clients and the rest of benign clients can be summarized as follows:

𝐠i={FiBenign clients𝐠bByzantine clients\mathbf{g}_{i}=\left\{\begin{aligned} \nabla F_{i}\qquad&\text{Benign clients}\\ \mathbf{g}_{b}\qquad&\text{Byzantine clients}\\ \end{aligned}\right. (2)

Note that under the assumed threat mode, each adversarial node has access to updates of all clients during the training procedure. They are aware of the adoption of Federated Learning Byzantine-resilient methods [6, 4, 63], and conspire to upload specially designed model updates that may overwhelm existing defending methods. Byzantine clients who behave consistently as such are referred to as colluders throughout this article. Moreover, we assume the server to be honest and try to defend the Byzantine attacks.

4 Failure Cases Caused by Byzantine Colluders

The challenge brought by colluders is detrimental to Byzantine-resilient algorithms in different ways. We first analyze below behaviours of representative Byzantine attackers from a graph theoretic perspective. Then we demonstrate a toy example to showcase that existing Byzantine-resilient methods are vulnerable to collusion attacks.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Heatmap of adjacency matrices for 8 types of attack (in Sect. 7.1) of 100 client including 30 attackers at 1000th1000_{th} iteration under IID setting. In each subfigure, benign clients form a single coherent cluster residing in the upper-left block of the adjacent matrix, while attackers reside in the bottom-right parts of the adjacent matrix. The scaling factor σ2=10\sigma^{2}=10 and the task is logistic regression for MNIST dataset. From left to right, top to bottom, attack methods are Gaussian attack [6], label flipping [15], sign flipping [15], our designed multi-collusion attack, same value attack [35], Fang-v1 (design for trimmed mean) [18], Mimic attack, and Lie [4] respectively.

4.1 Weighted Undirected Graph in Federated Learning

We regard model updates contributed by KK clients as an undirected graph G=(V,E)G=(V,E), where V=v1,,vKV={v_{1},\cdots,v_{K}} represent KK model updates, EE is a set of weighted edge representing similarities between uploaded model updates corresponding to clients in VV. We assume that the graph G=(V,E)G=(V,E) is weighted, and each edge between two nodes viv_{i} and vjv_{j} carries a non-negative weight, e.g., Aij=exp(𝐠i𝐠j2/2σ2)0A_{ij}=\text{exp}(-||\mathbf{g}_{i}-\mathbf{g}_{j}||^{2}/2\sigma^{2})\geq 0, where 𝐠i\mathbf{g}_{i} is uploaded gradient for ithi_{th} client and σ\sigma is the Gaussian scaling factor. Let GR=(VR,ER)G_{R}=(V_{R},E_{R}) and GB=(VB,EB)G_{B}=(V_{B},E_{B}) respectively denote two subgraphs of GG representing benign and Byzantine clients.

Moreover, Byzantine problem [32] could be regarded as finding a optimal graph-cut for GG to distinguish the Byzantine and benign model updates. Since model updates from colluders form specific patterns (see Fig. 3 for examples), the aforementioned graph-cut can be generalized to the so called community-detection problem [41] in which multiple subsets of closely connected nodes are to be separated from each other.

4.2 Spectral Graph Analysis for Byzantine Attackers

We illustrate the spectral analysis of representative Byzantine attacks, especially, those launched by colluders. For example, Fig. 2 shows adjacency matrices with elements representing pairwise similarities between 70 benign clients under IID setting and 30 attackers (the darker the element the higher the pairwise similarity is). It is clear in each subfigure that benign clients form a single coherent cluster residing in the upper-left block of the adjacent matrix333Note that the order of the nodes illustrated in Fig. 2 is irrelevant and we separate benign nodes from Byzantine nodes only for better visual illustration., while attackers reside in the bottom-right parts of the adjacent matrix. We observe the following characteristics pertaining to benign as well as Byzantine model updates.

First, benign model updates form a single group (upper-left block of Fig. 2), which is formally illustrated by Assumption 4.1, i.e., they all lie in the group (circle) with center F\nabla F and radius κ\kappa. Specifically, when the local datasets are homogeneous (IID), κ\kappa is small, indicating the strong similarity among benign model updates. When the local data of clients becomes heterogeneous, the radius κ\kappa becomes larger. Note that Assumption 4.1 has been widely used to bound differences between model updates of benign clients, e.g., in [37, 69].

Assumption 4.1.

Assume the difference of local gradients Fi\nabla F_{i} and the mean of benign model update F=1|VR|iVRFi{\nabla F}=\frac{1}{|V_{R}|}\sum_{i\in V_{R}}\nabla F_{i} is bounded (VRV_{R} is the set of benign clients), i.e., there exists a finite κ\kappa, such that

FiFκ.||\nabla F_{i}-{\nabla F}||\leq\kappa.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Spectral analysis of four types of Byzantine attackers (Each column represents one type of attack), including Non-Collusion, Collusion-diff, Collusion-mimic and Mixture attacks. We take one example of 100 clients consisting of 70 benign clients and 30 attackers, view the model updates as nodes, and compute the pair-wise similarities as edges. The first row provides an overview of four attacks ((a), (b), (c) and (d) represents Non-Collusion, Collusion-diff, Collusion-mimic and Mixture attacks respectively where green and red points represent benign and Byzantine updates respectively), while the second and third rows show the eigenvalue and eigengap of the normalized adjacency matrix (see Def. 5.3).

Second, Byzantine model updates can be categorized into four types (Fig. 3):

  • Non-Collusion: 𝐠bF>κ\|\mathbf{g}_{b}-\nabla F\|>\kappa and malicious updates (𝐠b\mathbf{g}_{b}) are far away from each other (e.g., Gaussian attack [6], label flipping [15] and sign flipping [15] attacks in Fig. 2(a), (b) and (c)).

  • Collusion-diff: 𝐠bF>κ\|\mathbf{g}_{b}-\nabla F\|>\kappa and malicious updates (𝐠b\mathbf{g}_{b}) form one or multiple clusters (small intra-cluster distance) (e.g., our designed collusion attack (see Sect. 7.1), same value attack [35], Fang-v1 [18] attacks in Fig. 2(d), (e) and (f)).

  • Collusion-mimic: 𝐠bF<κ\|\mathbf{g}_{b}-\nabla F\|<\kappa and 𝐠b\mathbf{g}_{b} of different attackers are almost identical (𝐠bi𝐠bjκ\|\mathbf{g}_{b}^{i}-\mathbf{g}_{b}^{j}\|\ll\kappa). It represents adversaries with strong connections form one or multiple clusters (small intra-cluster distance), and their behaviours are very similar to the few selected benign clients but are different from the rest of benign clients (e.g., Mimic attack [28] and Lie [4] in Fig. 2(g) and (h)).

  • Mixture: adversaries may combine Non-collusion, Collusion-diff, and Collusion-mimic arbitrarily to obtain a mixture attack.

In order to address the complex types of both benign and Byzantine model updates, we adopt he eigengap technique [53, 70] to reliably detect benign and Byzantine clusters (or communities). We provide the following proposition to elucidate characteristics of different types of Byzantine attacks in the lens of spectral analysis. The proof of Proposition 4.1 is deferred to Appendix C.

Proposition 4.1.

Suppose KK clients consist of mm benign clients and qq attackers (q<m1q<m-1). If Assumption 4.1 holds for Non-collusion and Collusion-diff attacks, then only the first cc eigenvalues are close to 1 and

  • c=1+q<K2c=1+q<\frac{K}{2} for Non-collusion attacks provided that 𝐠bFκ||\mathbf{g}_{b}-\nabla F||\gg\kappa and malicious updates (𝐠b\mathbf{g}_{b}) are far away from each other;

  • c=1+B<K2c=1+B<\frac{K}{2} for Collusion-diff attacks provided that malicious updates form BB groups and 𝐠bFκ||\mathbf{g}_{b}-\nabla F||\gg\kappa;

  • c>K2c>\frac{K}{2} for Collusion-mimic attacks that 𝐠bF<κ||\mathbf{g}_{b}-\nabla F||<\kappa and malicious updates are almost identical.

Remark 4.1.

Proposition 4.1 illustrates that the eigenvalue 1 has multiplicity cc, and then there is a large gap to the (c+1)th(c+1)_{th} eigenvalue λc+1<1\lambda_{c+1}<1. Therefore, we can use the position of the pronounced eigengap to elucidate the number of community clusters. Specifically, we use the largest eigengap to determine the number of the community clusters in our proposed method. Moreover, Proposition 4.1 demonstrates that the position of the pronounced eigengap for Non-collusion and Collusion-diff attacks is less than K2\frac{K}{2} while larger than K2\frac{K}{2} for Collusion-mimic attack. This property is used in the proposed method to distinguish Collusion-mimic attacks from other types of attacks (see Sect. 5.3.3).

We illustrate with a concrete example the different spectra of four types of attacks in Fig. 3, in which each column represents one type of attack. We observe the following properties (all of the examples include 70 benign clients and 30 attackers):

  • for Non-Collusion attack, we use the Gaussian attack [35] to upload largely different random updates following 𝒩(0,200)\mathcal{N}(0,200). Fig. 3(e) and (i) show the eigenvalue of adjacency matrix LL drops abruptly between index 30 and 31, i.e., the largest eigengap lies in index 31. It indicates the number of different communities is 31 with 70 benign clients forming a close group and 30 individual attackers forming other groups;

  • for Collusion-diff attack, we use the same-value attack [35] to upload updates consisting of all one elements, Fig. 3(f) and (j) illustrate the largest eigengap lies in 2 elucidating that benign clients form one group and colluders with strong connections form the other group;

  • for Collusion-mimic attack, we use the mimic attack [28] to upload updates that mimic one benign update, the largest eigengap is 70 because Byzantine colluders and mimiced benign clients form a group while the rest of 69 benign clients form 69 groups separately as Fig. 3(g) and (k) illustrate. It is because the connection among mimic colluders is much stronger than benign clients.

  • for Mixture attack, we combine the Gaussian attack (5 attackers), same-value attack (5 attackers), and mimic attack (20 attackers). Fig. 3(h) and (l) reveal that the largest eigengap is 80, in which mimic colluders and one mimiced benign client form a group. In contrast, 69 benign clients and 10 attackers form 79 groups separately. This example showcase that the mimic attack dominates the spectral characteristics of the Mixture attack. Therefore, for the proposed method, the Collusion-mimic type attack is first detected and removed, followed by the detection of other two types of attack (see Algo. 3 in Sect. 5.3.3 and line 3-4 of Algo. 2 in Sect. 5.2).

4.3 Failure Case Analysis

We evaluate two types of representative Byzantine-resilient methods (Robust statistics and the clustering based aggregation methods) and the proposed FedCut method (see Sect. 5.2) under four typical Byzantine Attacks mentioned in Sect. 4.2, in terms of their Byzantine Tolerance Rates (BTR) defined below. Specifically, we assume 10 benign model updates following 1D Gaussian distribution 𝒩(0.1,0.1)\mathcal{N}(0.1,0.1) and see four types Byzantine attacks (S1-S4) as illustrated in Tab. II.

TABLE II: four scenarios (S1-S4) of Byzantine Attacks. S1 represents the Non-Collusion type; S2-s and S2-m represent Collusion-diff and the difference is S2-s only has a single cluster of colluders while S2-m consists multiple clusters of colluders; S3 denotes the Collusion-mimic scenario; S4 denotes a mixture scenario comprising the Non-Collusion, Collusion-diff and Collusion-mimic type.
Scenario Attack Type Attack Distribution Number
S1 Non-Collusion 𝒩(0.1,1)\mathcal{N}(0.1,1) 8
S2-s Collusion-diff 𝒩(2,0.01)\mathcal{N}(-2,0.01) 8
S2-m Collusion-diff 𝒩(2,0.01)\mathcal{N}(-2,0.01) 4
𝒩(4,0.01)\mathcal{N}(4,0.01) 4
S3 Collusion-mimic 𝒩(μ,0.01)\mathcal{N}(\mu,0.01)555μ\mu is the minimum elements of 10 benign model updates 8
S4 Mixture 𝒩(2,0.01)\mathcal{N}(-2,0.01) 3
𝒩(μ,0.01)\mathcal{N}(\mu,0.01) 3
𝒩(0.1,1)\mathcal{N}(0.1,1) 1
TABLE III: four scenarios (S1-S4) of Byzantine Attacks are repeated 1000 runs to evaluate 5 representative Byzantine-resilient methods and the proposed FedCut method, in terms of their Byzantine Tolerance Rates (BTR) in Def. 4.1
BTR Krum [6] Median [68] Trimmed Mean [68] DnC [52] Kmeans [54] FedCut (ours)
S1 96.0% 96.2% 92.6% 95.3% 70.5% 96.2%
S2-s 34.5% 29.5% 27.5% 89.6% 99.0% 99.0%
S2-m 86.1% 89.3% 88.2% 99.4% 3.5% 99.6%
S3 29.6% 36.9% 37.7% 52.9% 33.7% 98.7%
S4 59.5% 61.5% 67.6% 53.1% 85.6% 95.9%

Next, we provide a toy example of four types of Byzantine attacks mentioned above to illustrate failure cases of existing Byzantine-resilient methods.

In order to evaluate different Byzantine-resilient methods under the four scenarios, we define the Byzantine Tolerant Rate representing the fraction of Byzantine Tolerant cases over repeated runs against attacks as follows:

Definition 4.1.

(Byzantine Tolerant Rate) Suppose the server receives (Kq)(K-q) correct gradients 𝒱={v1,,vKq}\mathcal{V}=\{v_{1},\cdots,v_{K-q}\} and qq Byzantine gradients 𝒰={u1,,uq}\mathcal{U}=\{u_{1},\cdots,u_{q}\}. A Byzantine robust method 𝒜\mathcal{A} is said to be Byzantine Tolerant to certain attacks [63] if

<𝔼[𝒱],𝔼[𝒜([𝒱𝒰])]>0<\mathbb{E}[\mathcal{V}],\mathbb{E}[\mathcal{A}([\mathcal{V}\cup\mathcal{U}])]>\geq 0 (3)

Moreover, the Byzantine Tolerant Rate (BTR) for 𝒜\mathcal{A} is the fraction of Byzantine Tolerant cases over repeated runs against attacks.

Tab. III summarized toy example results for different Byzantine-resilient methods under four types of Byzantine attacks mentioned above. We can draw following conclusions:

  • First, for Non-Collusion attack (S1), all Byzantine-resilient methods except Kmeans [54] perform well (i.e., the BTR is higher than 90%), which indicates that Non-Collusion attack is easy to be defended.

  • Second, for Collusion-diff attack (S2), Robust Statistics based methods such as Krum [6], Median, Trimmed Mean [68] and DnC [52] are vulnerable to S2-s. This failure is mainly ascribed to the wrong estimation of sample mean or median misled by biased model updates from colluders. Moreover, the clustering based method i.e., Kmeans [54] fails in the S2-m, with BTR as low as 3.5%. This is because the clustering based method relies on the assumption that only one group of colluders exists, but two or more groups of colluders in S2-m are misclassified by naive clustering-based methods with wrong assumptions.

  • Third, for Collusion-mimic attack (S3), both Robust Statistics based methods and clustering based method fail, with BTR lower than 52.9%. The main reason is that colluders would introduce statistical bias for benign updates and similar behaviours of colluders are hard to detect.

  • Finally, the proposed FedCut method is able to defend against all attacks with high BTR (more than 95%) by using a Spatial-Temporal framework and spectral heuristics illustrated in Sect. 5.

It is worth mentioning that the toy example used in this Section only showcases some simplified failure cases that might defeat existing Byzantine-resilient methods. Instead, the attacking methods evaluated in Sect. 7 is more complex and detrimental, and we refer readers to thorough experimental results in Sect. 7 and Appendix B.

5 FedCut: Spectral Analysis against Byzantine Colluders

This section illustrates the proposed spectral analysis framework (see Algo. 1) in which distinguishing benign clients from one or more groups of Byzantine colluders is formulated as a community detection problem in Spatial-Temporal graphs [53], in which nodes represent all model updates and weighted edges represent similarities between respective pairs of model update over all training iterations (see Sect. 5.1). The normalized graph cut with temporal consistency [55, 42] (called FedCut) is adopted to ensure that a global optimal clustering is found (see Sect. 5.2). Moreover, the spectral heuristics [70] is then used to determine the Gaussian scaling factor, the number of colluder groups and the attack type (see Sect. 5.3). The gist of the proposed method is how to discern colluders from benign clients, by scrutinizing similarities between their respective model updates (see Fig. 4 for an overview).

Algorithm 1 FedCut Framework
1:KK clients with local training datasets 𝒟i,i=1,2,,|𝒟i|\mathcal{D}_{i},i=1,2,\cdots,|\mathcal{D}_{i}|; number of global iterations TT; learning rate η\eta and batch size bb.
2:Global model 𝐰\mathbf{w}.
3:𝒘\bm{w}\leftarrow random initialization, L~0=𝟎K×K\tilde{L}^{0}=\mathbf{0}\in\mathbb{R}^{K\times K}.
4:for t=1,2,,Tt=1,2,\cdots,T do
5:     Step I: The server sends the global model 𝒘\bm{w} to all
6:    clients i=v1,v2,,vKi=v_{1},v_{2},\cdots,v_{K}.
7:     Step II: Training local models and server model.
8:     for i=v1,v2,,vKi=v_{1},v_{2},\cdots,v_{K} do in parallel
9:         𝐠it=ModelUpdate(𝐰,𝒟i,b,η)\mathbf{g}_{i}^{t}=ModelUpdate(\mathbf{w},\mathcal{D}_{i},b,\eta).
10:         Send 𝐠i\mathbf{g}_{i} to the server.
11:     end for
12:     Step III: Updating the global model via FedCut
13:     R,L~t=C-Ncut(𝐠it,L~t1)\mathcal{I}_{R},\tilde{L}^{t}=\textbf{\text{C-Ncut}}(\mathbf{g}_{i}^{t},\tilde{L}^{t-1})
14:     𝐠t\mathbf{g}^{t} = Aggregate(𝐠Rt\mathbf{g}^{t}_{\mathcal{I}_{R}}), where 𝐠Rt={𝐠it|viR}\mathbf{g}^{t}_{\mathcal{I}_{R}}=\{\mathbf{g}^{t}_{i}|v_{i}\in\mathcal{I}_{R}\}
15:     𝐰𝐰η𝐠t\mathbf{w}\leftarrow\mathbf{w}-\eta\mathbf{g}^{t}.
16:end for
17:return 𝐰\mathbf{w}.

5.1 A Spatial-Temporal Graph in Federated Learning

In order to use all information during the training, we define a Spatial-Temporal graph by considering behaviours of clients among all training iterations:

Definition 5.1.

(Spatial-Temporal Graph) Define a Spatial-Temporal graph G=(V,E)G=(V,E) as a sequence of snapshots <G1,,GT><G^{1},\cdots,G^{T}>, where Gt=(V,Et)G^{t}=(V,E^{t}) is an undirected graph at iteration tt. VV denotes a fixed set of KK vertexes representing model updates belonging to KK clients. EtE^{t} is a set of weighted edge representing similarities between model updates corresponding to clients in VV at iteration tt, where related adjacency matrix of the graph is AtA^{t}.

Given a set of model updates from KK clients over TT iterations during federated learning, one is able to construct such a Spatial-Temporal graph by assigning edge weights as the measured pair-wise similarity between model updates.

The construction of the Spatial-Temporal graph with historical information is motivated by a common shortcoming of Byzantine-resilient methods, e.g., reported in [29], which showed that methods without using historical information during the training might lead to deteriorated global model performances in the presence of colluders. Our ablation study in Appendix B also confirmed the importance of using the Spatial-Temporal graph.

5.2 FedCut

Refer to caption
Figure 4: FedCut: provide a temporal spatial cut to separate benign clients (green point) two-parties colluders (orange and red point) among all iterations.

As mentioned in Sect. 5.1, we have defined a Spatial-Temporal graph G=<G1,,GT>G=<G^{1},\cdots,G^{T}> according to the model updates from KK clients over TT iterations during federated learning. Byzantine problem is then regarded as to find an optimal cut for GG such that the inter-cluster similarities (between benign clients and colluders) are as small as possible and intra-cluster similarities are as large as possible 666We view the cluster with the largest numbers as benign clients which constitute the largest community among all clients.. One optimal cut as such is known as the Normalized cut (Ncut) and has been initially applied to image segmentation [55, 42]. We adopt the Ncut approach to identify those colluders which have large intra-cluster similarities. Moreover, we extend the notion of Ncut to cater for the proposed Spatial-Temporal Graph over all iterations to make the detection of Byzantine colluders more consistent over multiple iterations. The c-partition Ncut for Spatial-Temporal Graph is thus defined as follows:

Definition 5.2.

(c-partition Ncut for Spatial-Temporal Graph) Let G=(V,E)G=(V,E) be a Spatial-Temporal graph as Def. 5.1. Denote the c-partition for graph GG as V=B1BcV=B_{1}\cup\cdots\cup B_{c} and BiBj=B_{i}\cap B_{j}=\varnothing for any i,ji,j, c-partition Ncut for Spatial-Temporal Graph aims to optimize:

min(B1Bc)=Vt=1Ti=1cWt(Bi,Bi¯)Volt(Bi),\min_{(B_{1}\cup\cdots\cup B_{c})=V}\quad\sum_{t=1}^{T}\sum_{i=1}^{c}\frac{W^{t}(B_{i},\overline{B_{i}})}{Vol^{t}(B_{i})}, (4)

where Bi¯\overline{B_{i}} is the complement of BiB_{i}, Wt(Bi,Bj):=12iBi,jBjAijtW^{t}(B_{i},B_{j}):=\frac{1}{2}\sum_{i\in B_{i},j\in B_{j}}A^{t}_{ij}, and AijtA_{ij}^{t} is edge weight of BiB_{i} and BjB_{j}, and Volt(Bi):=iBjVAijtVol^{t}(B_{i}):=\sum_{i\in B}\sum_{j\in V}A_{ij}^{t}.

The proposed FedCut (Algo. 2) is used to get rid of malicious updates before model updates are aggregated. FedCut computes adjacent matrix according to the uploaded gradients over all training iterations and then takes the normalized cut to distinguish benign clients and colluders. Specifically, three important parameters including the appropriate Gaussian kernel σ\sigma_{*}, the cluster number cc and the set of mimic colluders \mathcal{I} (who behave similarly to benign clients) need to be firstly determined (see Algo. 3). Then we calculate the adjacency matrix AtA^{t} at ttht_{th} iteration to remove the mimic colluders, i.e., set the connection between the mimic colluders and benign clients to be zero (line 2-7 in Algo. 2). We further compute normalized adjacency matrix L~t\tilde{L}^{t} by induction over tt iterations (line 8-9 in Algo. 2). Finally, we implement NCut into normalized adjacency matrix L~t\tilde{L}^{t} to choose the clusters of benign clients (line 10-12 in Algo. 2).

It is worth noting that FedCut runs over multiple learning iterations and, as shown by Proposition 5.1, it is guaranteed to provide the optimal c-partition Ncut defined by Eq. (4) (See Proof in Appendix C).

Proposition 5.1.

FedCut (Algo. 2) solves the c-partition Ncut for Spatial-Temporal Graph i.e., it obtains a optimal solution by:

argmin(B1Bc)=Vt=1Ti=1cWt(Bi,Bi¯)Volt(Bi)\operatorname*{arg\,min}_{(B_{1}\cup\cdots\cup B_{c})=V}\quad\sum_{t=1}^{T}\sum_{i=1}^{c}\frac{W^{t}(B_{i},\overline{B_{i}})}{Vol^{t}(B_{i})}

.

Algorithm 2 c-NCut: c-partition Ncut for the Spatial-Temporal Graph
1:The uploaded gradients of KK clients at ttht_{th} iteration: {𝐠it}i=1K\{\mathbf{g}_{i}^{t}\}_{i=1}^{K}, normalized adjacency matrix averaged over the first t1t-1 iterations: L~t1\tilde{L}^{t-1};
2:The set of benign clients: R\mathcal{I}_{R}; normalized adjacency matrix averaged over tt iterations: L~t\tilde{L}^{t}.
3:c,σ,=c,\sigma_{*},\mathcal{I}= PDSH({𝐠it}i=1K\{\mathbf{g}_{i}^{t}\}_{i=1}^{K});
4:Compute the adjacency matrix AtA^{t} as:
5:if Client ii \in\mathcal{I} then
6:     Aijt=0A^{t}_{ij}=0 (iji\neq j)
7:else
8:     Aijt=exp(𝐠it𝐠jt2/2σ2)A^{t}_{ij}=\text{exp}(-||\mathbf{g}_{i}^{t}-\mathbf{g}^{t}_{j}||^{2}/2\sigma_{*}^{2})
9:end if
10:D=diag(Sum(At))D=\text{diag}(\text{Sum}(A^{t})), Lt=D1/2AtD1/2L^{t}=D^{-1/2}A^{t}D^{-1/2};
11:L~t=t1tL~t1+1tLt\tilde{L}^{t}=\frac{t-1}{t}\tilde{L}^{t-1}+\frac{1}{t}L^{t};
12:Eigen decomposition: L~t=QΛQ1\tilde{L}^{t}=Q\Lambda Q^{-1};
13:Apply Kmeans into top cc normalized eigenvectors (QQ) to assign KK clients into cc clusters.
14:R\mathcal{I}_{R} is the cluster with the largest number of clients;
15:return R\mathcal{I}_{R} and L~t\tilde{L}^{t}.

5.3 Spectral Heuristics

The determination of three parameters (i.e., the scaling parameter of Gaussian kernels σ\sigma_{*}, the number of clusters cc and mimic colluders \mathcal{I}) are critical in thwarting Byzantine Collusion attacks as illustrated in Sect. 4.3. We adopt the Spectral Heuristics about the following definition of eigengap to determine these three important parameters.

Definition 5.3.

The ithi_{th} eigengap is defined as |λiλi+1||\lambda_{i}-\lambda_{i+1}| for which eigenvalues of the the normalized adjacency matrix L=D1/2AD1/2L=D^{-1/2}AD^{-1/2} are ranked in descending order, where D=diag(Sum(A))D=\text{diag}(\text{Sum}(A)). Let δ\delta be the largest eigengap among |λiλi+1||\lambda_{i}-\lambda_{i+1}|.

Algorithm 3 Parameter Determination via Spectral Heuristics (PDSH)
1:The uploaded gradients of KK clients at ttht_{th} iteration: {𝐠it}i=1K\{\mathbf{g}_{i}^{t}\}_{i=1}^{K}, a preselected set of σ\sigma: {σ1,σn}\{\sigma_{1},\cdots\sigma_{n}\}, where σ0σ1\sigma_{0}\ll\sigma_{1}
2:The cluster number cc, the appropriate Gaussian kernel σ\sigma_{*}, the set of mimic colluders \mathcal{I};
3:Initialize =\mathcal{I}=\emptyset
4:while  σ{σ1,σn}\sigma\in\{\sigma_{1},\cdots\sigma_{n}\} do
5:     Compute the adjacency matrix AtA^{t}, where Aijt=A^{t}_{ij}=
6:  exp(𝐠it𝐠jt2/2σ2)\text{exp}(-||\mathbf{g}_{i}^{t}-\mathbf{g}^{t}_{j}||^{2}/2\sigma^{2}).
7:     D=diag(Sum(At))D=\text{diag}(\text{Sum}(A^{t})), Lt=D1/2AtD1/2L^{t}=D^{-1/2}A^{t}D^{-1/2}.
8:     Eigen decomposition: Lt=PΛP1L^{t}=P\Lambda P^{-1};
9:     Compute eigengap Δi=λiλi+1\Delta_{i}=\mathbf{\lambda}_{i}-\mathbf{\lambda}_{i+1}, where λi\mathbf{\lambda}_{i} is the
10:  ithi_{th} eigenvalue;
11:     j = argmaxiΔi\arg\max_{i}\Delta_{i}, Δσ=Δj,cσ=j\Delta_{\sigma}=\Delta_{j},c_{\sigma}=j;
12:end while
13:σ=argmaxσΔσ\sigma_{*}=\arg\max_{\sigma}\Delta_{\sigma}, c=cσc=c_{\sigma^{*}};
14:if c>K2c>\frac{K}{2} then
15:     Apply Kmeans into top cc normalized eigenvectors
16:   to assign KK clients into cc clusters;
17:     \mathcal{I} are the clusters with number larger than 1.
18:end if
19:σ=argmaxσ{Δσ|cσ<K2}\sigma_{*}=\arg\max_{\sigma}\{\Delta_{\sigma}|c_{\sigma}<\frac{K}{2}\}, c=cσc=c_{\sigma^{*}};
20:return cc, σ\sigma^{*} and the set of mimic colluders \mathcal{I}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: The change of eigengap for different attacks with different number of attackers under IID setting for MNIST dataset. From left to right, top to bottom, attack methods are Gaussian attack [6], label flipping [15], sign flipping [15], our designed collusion attack (see Appendix A), same value attack [35], Fang-v1 (design for trimmed mean) [18], Mimic attack [28], and Lie [4] respectively.

5.3.1 Gaussian Kernel Determination

The Gaussian kernel scaling parameter σ\sigma in similarity function Aij=exp(𝐠i𝐠j2/2σ2)A_{ij}=\text{exp}(-||\mathbf{g}_{i}-\mathbf{g}_{j}||^{2}/2\sigma^{2}) controls how rapidly the AijA_{ij} falls off with the distance between 𝐠i\mathbf{g}_{i} and 𝐠j\mathbf{g}_{j}. An appropriate σ\sigma is crucial for distinguishing Byzantine and benign clients. The following analysis shows that if the maximum eigengap δ\delta (see Def. 5.3) is sufficiently large, a small perturbation on the normalized adjacency matrix LL or will only affect the eigenvectors with bounded influence, and thus the clustering of cc clusters via top cc eigenvectors are stable.

Proposition 5.2 (Stability [57]).

Let λ\lambda, YY and δ\delta be eigenvalue, principle eigenvectors and the maximum eigengap of LL separately. Define a matrix small perturbation for LL as L~=L+E\tilde{L}=L+E so that E2||E||_{2} is small enough, let λ~\tilde{\lambda}, Y~\tilde{Y} be eigenvalue and principle eigenvectors of L~\tilde{L}. If the maximum eigengap δ\delta is large enough, then

YY~4Eδ2E||Y-\tilde{Y}||\leq\frac{4||E||}{\delta-\sqrt{2}||E||} (5)
Refer to caption
Refer to caption
Figure 6: The change of the maximum eigengap of normalized adjacency matrix and clustering accuracy with different scaling factor σ\sigma for the Gaussian attack [35], where the normalized adjacency matrix L=D1/2AD1/2L=D^{-1/2}AD^{-1/2}, D=diag(Sum(A))D=\text{diag}(\text{Sum}(A)) and Aij=exp(𝐠i𝐠j2/2σ2)A_{ij}=\text{exp}(-||\mathbf{g}_{i}-\mathbf{g}_{j}||^{2}/2\sigma^{2}). The clustering accuracy is calculated by NCut on the normalized adjacency matrix with 100 client including 70 benign clients and 30 Byzantine clients.

It is noted that the error bound YY~||Y-\tilde{Y}|| in the principle eigenvectors is affected by two factors: the perturbation E||E|| in the affinity matrix, and the maximal eigengap δ\delta. While the perturbation is unknown and unable to modify for a given matrix L~\tilde{L}, one can seek to maximize the eigengap δ\delta so that the recovered principle eigenvectors are as stable as possible. According to proposition 5.2, we select the optimal parameter σ\sigma_{*} by maximizing the maximum eigengap, i.e.,

σ=argmaxσδ(σ)\sigma^{*}=\text{argmax}_{\sigma}\delta(\sigma) (6)

We implement this strategy in Algo. 3. Fig. 6 shows that one can select σ\sigma as such that the largest eigengap is up to its maximal (e.g., log(σ)log(\sigma) is in the range (1,3.5) for the Gaussian attack), provided that the clustering accuracy achieves the largest value (100%) in this range. Also noted that the clustering accuracy drops in the extreme case (i.e., the σ1e4\sigma\geq 1e^{4}) even the maximum eigengap is still large. Therefore, we select the optimal σ\sigma at one reasonable range with removing the extreme case (see detailed range in Sect. 7.1).

5.3.2 Determination of the Number of Clusters

The assumption made by many clustering-based Byzantine-resilient methods that model updates uploaded by all Byzantine attackers form a single group might be violated in the presence of collusion (as shown by the toy example in Sect. 4.3 and experimental results in Sect. 7). Therefore, it is necessary to discover the number of clusters, and one possible way is to analyze the spectrum of the adjacency matrix, normalized adjacency matrix or correlation matrix [53]. According to the spectral property in Sect. 4.2, we determine the clustering number cc (see line 3-11 in Algo. 3) based on the position of the largest eigengap of normalized adjacency matrix, i.e.,

c=argmaxk|λkλk+1|c=\text{argmax}_{k}|\lambda_{k}-\lambda_{k+1}| (7)

Fig. 5 displays the change of eigengap under different attacks, which shows that the index of the largest eigengap is a good estimation of the number of clusters. Specifically,

  • for Non-Collusion attack (i.e., Gaussian attack [6], label flipping [15] and sign flipping [15]), the largest eigengap in Fig. 5(a), (b) and (c) consisting of 10, 20 and 30 attackers lies in 11, 21, and 31 with benign clients forming one group and 10, 20 and 30 individual attackers;

  • for Collusion-diff attack, the largest eigengap of Fig. 5(e) and (f) lies in between second and third largest eigenvalues. The position of the largest eigengap indicates that there are two clusters (one cluster represents benign clients while the other cluster represents colluders);

  • for Collusion-mimic attack, the largest eigengap in Fig. 5(a), (b), and (c) of 10, 20 and 30 attackers lies in 90, 80, and 70 indicating that cluster numbers are 90, 80, and 70 (colluders form one group while other benign clients form 90, 80, and 70 groups separately);

In addition, Our empirical results in Appendix B also demonstrates the effectiveness of estimating the number of communities via the largest eigengap even for heterogeneous dataset among clients.

5.3.3 Mimic Colluders Detection

Colluders may mimic the behaviours of some benign clients towards over-emphasizing some clients and render other model updates useless. Therefore, it is hard to distinguish the mimic colluders and benign clients. The existing byzantine-resilient methods could not defend the Collusion-mimic attack (see Sect. 4.3 for S3 case). We leverage the spectral property to detect the mimic colluders if the position of the largest eigengap is larger than K2\frac{K}{2} (Proposition 4.1). Specifically, when the position of the largest eigengap is larger than K2\frac{K}{2}, we pick the clusters with number of clients larger than two as mimic colluders sets (see line 12-15 in Algo. 3).

6 Convergence Analysis

We prove the convergence of such an iterative FedCut algorithm. The proof for Theorem 6.2 uses the similar technique as [15] (see details in Appendix C).

Assumption 6.1.

For malicious updates 𝐠b\mathbf{g}_{b} provided that 𝐠bF>κ||\mathbf{g}_{b}-\nabla F||>\kappa, the difference between the mean of benign updates and colluders’ updates has at least CκC\kappa distance, where CC is a large constant, i.e.,

𝐠bF>Cκ.\|\mathbf{g}_{b}-\nabla F\|>C\kappa.
Remark 6.1.

CC in the Assumption 6.1 is greatly larger than 1, which demonstrates a large distance between colluders and the mean of benign updates. If the single attacker uploads gradients within κ\kappa distance from the mean of benign updates, we can regard this attacker as one benign client. For mimic colluders who introduce the statistical bias, we detect these colluders and remove them according to spectral heuristics (see Sect. 5.3.3).

Theorem 6.1.

Suppose an 0<α<120<\alpha<\frac{1}{2} fraction of clients are Byzantine attackers. If Assumption 4.1 and 6.1 holds, we can find the estimate of 𝐠^\hat{\mathbf{g}} according to line 11 in Algo. 1 and with the probability 1O~(Z1)1-\tilde{O}(\sqrt{Z_{1}}), such that 𝐠^F𝒪~(ακ)||\hat{\mathbf{g}}-\nabla F||\leq\tilde{\mathcal{O}}(\alpha\kappa), where Z1=4Kα(1α)ZC24δK2α(1α)ZC24,Z=exp(2κ2/σ2)Z_{1}=\frac{4K\sqrt{\alpha(1-\alpha)Z^{\frac{C^{2}}{4}}}}{\delta-K\sqrt{2\alpha(1-\alpha)Z^{\frac{C^{2}}{4}}}},Z=\text{exp}(-2\kappa^{2}/\sigma^{2}).

Remark 6.2.

The Theorem 6.1 illustrates that the distance between estimated gradients 𝐠^\hat{\mathbf{g}} via FedCut and benign averaged gradients 𝐠¯\bar{\mathbf{g}} is bounded by 𝒪~(ακ)\tilde{\mathcal{O}}(\alpha\kappa). Moreover, α(1α)\alpha(1-\alpha) is increasing w.r.t. α\alpha when α<12\alpha<\frac{1}{2}, and Z1Z_{1} is increasing w.r.t. α(1α)\alpha(1-\alpha) . Therefore, Z1Z_{1} is increasing w.r.t. α\alpha, which indicates that the probability 1O~(Z1)1-\tilde{O}(Z_{1}) decreases as the number of Byzantine attackers increases. In addition, for the larger CC, Z1Z_{1} tends to zero since Z<1Z<1. Consequently, 𝐠^F𝒪~(κ)||\hat{\mathbf{g}}-\nabla F||\leq\tilde{\mathcal{O}}(\kappa) with a high probability close to 1.

Assumption 6.2.

The stochastic gradients sampled from any local dataset have uniformly bounded variance over 𝒟i\mathcal{D}_{i} for all benign clients, i.e., there exists a finite σ0\sigma_{0}, such that for all 𝐱i,j𝒟i,i[K]\mathbf{x}_{i,j}\in\mathcal{D}_{i},i\in[K], we have

𝔼j||[Fi(𝐰i,𝐱i,j))Fi(𝐰i)||2σ02,\mathbb{E}_{j}||[\nabla F_{i}(\mathbf{w}_{i},\mathbf{x}_{i,j}))-\nabla F_{i}(\mathbf{w}_{i})||^{2}\leq\sigma_{0}^{2}, (8)

where Fi(𝐰i)=𝔼jFi(𝐰i,xi,j))\nabla F_{i}(\mathbf{w}_{i})=\mathbb{E}_{j}\nabla F_{i}(\mathbf{w}_{i},x_{i,j})).

Remark 6.3.

The difference between Assumption 4.1 and 6.2 is that the former bounds the variance across gradient estimates within the same client while the latter bounds the variance between model updates across clients.

Assumption 6.3.

We assume that F(x) is L-smooth and has μ\mu-strong convex.

Theorem 6.2.

Suppose an 0<α<120<\alpha<\frac{1}{2} fraction of clients are corrupted. For a global objective function F:RdRF:R^{d}\to R, the server obtains a sequence of iterates {𝐰t:t[0:T]}\{\mathbf{w}^{t}:t\in[0:T]\} (see Algo. 1) when run with a fixed step-size η<min{14L,1μ}\eta<\min\{\frac{1}{4L},\frac{1}{\mu}\}. If Assumption 4.1, 6.1, 6.2 and 6.3 holds, the sequence of average iterates {𝐰t:t[0:T]}\{\mathbf{w}^{t}:t\in[0:T]\} satisfy the following convergence guarantees:

𝐰T𝐰2(1C1μL)T𝐰0𝐰2+Γμ2,\left\|\mathbf{w}^{T}-\mathbf{w}^{*}\right\|^{2}\leq(1-\frac{C_{1}\mu}{L})^{T}\left\|\mathbf{w}^{0}-\mathbf{w}^{*}\right\|^{2}+\frac{\Gamma}{\mu^{2}}, (9)

where Γ=𝒪~(σ02+κ2+α2κ2)\Gamma=\tilde{\mathcal{O}}(\sigma_{0}^{2}+\kappa^{2}+\alpha^{2}\kappa^{2}), C1C_{1} is a constant and 𝐰\mathbf{w}^{*} is the global optimal weights in federated learning.

Theorem 6.2 provides the convergence guarantee of FedCut framework in the strong convex case. As TT tends to infinity, the upper bound of 𝐰T𝐰2\|\mathbf{w}^{T}-\mathbf{w}^{*}\|^{2} becomes large with the increasing of σ0\sigma_{0} (variance of gradients within the same client), κ\kappa (variance between model updates across benign clients) and α\alpha (the ratio of Byzantine attackers).

7 Experimental results

TABLE IV: Model performances of different Byzantine-resilient methods under different Byzantine attacks (with IID setting 30 Byzantine clients for classification of MNIST, Fashion MNIST and CIFAR10). Moreover, the baseline of FedAvg [40] without any attacks achieve the model performance 92.5%, 90.1 % and 69.4% for MNIST, Fashion MNIST and CIFAR10 respectively.
Krum[6] GeoMedian[13] Median[68] Trimmed[68] Bulyan[21] FLtrust[10] DnC[52] Kmeans[54] FedCut(Ours)
M N I S T No attack 90.5±\pm0.1 92.4±\pm0.0 90.5±\pm0.1 90.3±\pm0.1 89.6±\pm0.1 89.4±\pm0.1 92.2±\pm0.3 92.4±\pm0.3 92.5±\pm0.1
Lie [4] 90.5±\pm0.0 84.4±\pm0.6 84.0±\pm0.3 83.6±\pm0.8 73.6±\pm1.0 89.5±\pm0.1 92.2±\pm0.3 92.2±\pm0.2 92.3±\pm0.2
Fang-v1 [18] 90.2±\pm0.1 45.8±\pm0.5 43.4±\pm1.5 37.8±\pm2.6 85.5±\pm0.3 75.8±\pm3.5 92.3±\pm0.3 92.3±\pm0.2 92.3±\pm0.1
Fang-v2 [18] 41.1±\pm7.4 53.4±\pm2.1 39.0±\pm2.3 42.4±\pm2.8 18.8±\pm6.1 79.7±\pm0.8 92.0±\pm0.6 88.2±\pm0.4 92.3±\pm0.1
Same value [35] 90.6±\pm0.1 85.6±\pm0.0 77.0±\pm0.4 75.1±\pm0.6 88.8±\pm0.4 89.8±\pm0.3 92.3±\pm0.2 92.2±\pm0.3 92.2±\pm0.1
Gaussian [6] 90.4±\pm0.1 92.4±\pm0.1 90.6±\pm0.1 90.8±\pm0.1 89.6±\pm0.3 87.0±\pm1.0 74.3±\pm0.4 22.4±\pm4.3 92.3±\pm0.1
sign flipping [15] 90.6±\pm0.1 91.3±\pm0.1 90.2±\pm0.1 90.2±\pm0.1 89.6±\pm0.2 72.2±\pm5.5 90.7±\pm0.1 49.9±\pm1.3 91.9±\pm0.2
label flipping [15] 90.4±\pm0.1 89.4±\pm0.1 85.6±\pm0.2 85.4±\pm0.4 89.5±\pm0.4 89.3±\pm0.6 92.2±\pm0.2 92.1±\pm0.2 92.1±\pm0.4
minic [28] 86.1±\pm0.4 86.3±\pm0.9 88.3±\pm0.3 89.7±\pm0.2 85.5±\pm0.5 90.5±\pm0.1 92.2±\pm0.4 92.1±\pm0.1 92.3±\pm0.1
Collusion (Ours) 90.3±\pm0.2 83.9±\pm2.1 79.5±\pm0.7 78.0±\pm0.8 89.7±\pm0.3 89.4±\pm0.1 90.2±\pm0.3 43.2±\pm20.8 92.3±\pm0.1
Averaged 85.1±\pm0.9 80.5±\pm0.7 76.8±\pm0.6 76.3±\pm0.9 80.0±\pm1.0 85.3±\pm1.2 90.1±\pm0.3 75.7±\pm2.8 92.2±\pm0.2
Worst-case 41.1±\pm7.4 45.8±\pm0.5 39.0±\pm2.3 37.8±\pm2.6 18.8±\pm6.1 72.2±\pm5.5 74.3±\pm0.4 22.4±\pm4.3 91.9±\pm0.2
F M N I S T No attack 84.6±\pm0.5 89.5±\pm0.3 88.2±\pm0.4 88.6±\pm0.4 87.3±\pm0.3 88.2±\pm0.5 90.3±\pm0.2 90.0±\pm0.3 90.1±\pm0.3
Lie [4] 85.3±\pm0.6 60.7±\pm12.9 50.5±\pm22.7 68.9±\pm6.9 55.5±\pm10.3 75.8±\pm4.3 78.4±\pm5.0 89.0±\pm0.8 89.7±\pm0.6
Fang-v1 [18] 85.4±\pm0.3 82.5±\pm1.0 73.1±\pm4.1 69.1±\pm4.6 84.8±\pm0.9 86.6±\pm1.3 89.5±\pm0.6 89.7±\pm0.3 89.7±\pm0.5
Fang-v2 [18] 10.0±\pm11.1 12.6±\pm4.8 64.2±\pm3.3 46.6±\pm16.6 10.5±\pm2.4 80.1±\pm2.3 71.5±\pm10.4 69.3±\pm1.4 89.3±\pm0.4
Same value [35] 85.6±\pm0.6 59.9±\pm14.3 69.8±\pm0.8 66.5±\pm3.6 86.7±\pm0.3 88.1±\pm0.7 89.4±\pm0.6 88.8±\pm0.6 89.6±\pm0.4
Gaussian [6] 85.5±\pm0.3 89.5±\pm0.5 87.5±\pm0.5 88.2±\pm0.4 87.1±\pm0.8 87.9±\pm0.7 69.8±\pm7.7 17.9±\pm27.3 89.7±\pm0.2
sign flipping [15] 85.2±\pm0.8 85.9±\pm1.2 86.2±\pm0.6 86.3±\pm0.5 87.1±\pm0.4 84.0±\pm1.4 76.9±\pm1.1 52.8±\pm3.0 87.5±\pm0.9
label flipping [15] 85.5±\pm0.2 89.6±\pm0.3 72.6±\pm5.1 78.5±\pm4.8 87.1±\pm0.3 87.9±\pm0.6 89.8±\pm0.3 89.9±\pm0.4 89.1±\pm0.3
minic [28] 79.7±\pm0.6 81.2±\pm0.4 83.0±\pm1.3 86.7±\pm0.8 79.8±\pm0.7 87.7±\pm0.2 88.8±\pm0.2 89.1±\pm0.4 89.4±\pm0.7
Collusion (Ours) 84.7±\pm0.5 35.9±\pm22.7 71.2±\pm6.3 69.5±\pm1.8 86.8±\pm0.6 88.4±\pm0.3 85.5±\pm1.7 76.9±\pm13.6 90.0±\pm0.5
Averaged 77.2±\pm1.6 68.7±\pm5.8 74.6±\pm4.5 74.9±\pm4.0 75.3±\pm1.7 85.5±\pm1.2 83.0±\pm2.8 75.3±\pm4.8 89.4±\pm0.5
Worst-case 10.0±\pm11.1 12.6±\pm4.8 50.5±\pm22.7 46.6±\pm16.6 10.5±\pm2.4 75.8±\pm4.3 69.8±\pm7.7 17.9±\pm27.3 87.5±\pm0.9
C I F A R 10 No attack 64.6±\pm0.1 64.9±\pm2.1 39.0±\pm25.2 66.6±\pm1.6 29.5±\pm6.6 68.3±\pm0.0 69.7±\pm0.9 68.6±\pm0.6 68.0±\pm0.4
Lie [4] 63.9±\pm2.1 10.0±\pm0.2 10.2±\pm0.2 9.8±\pm0.2 10.0±\pm0.1 12.4±\pm4.2 11.9±\pm3.3 20.4±\pm22.0 68.4±\pm0.4
Fang-v1 [18] 64.4±\pm0.5 63.9±\pm0.3 11.2±\pm2.1 17.5±\pm12.3 17.3±\pm3.4 68.1±\pm0.8 66.8±\pm0.3 68.0±\pm0.9 67.7±\pm3.9
Fang-v2 [18] 9.9±\pm0.1 15.0±\pm4.1 11.0±\pm1.3 11.3±\pm1.4 10.0±\pm0.1 61.8±\pm4.2 67.0±\pm0.6 53.4±\pm12.7 68.5±\pm1.5
Same value [35] 64.8±\pm0.7 50.2±\pm0.3 10.0±\pm0.0 10.0±\pm0.0 54.8±\pm0.2 29.2±\pm33.3 65.1±\pm5.2 65.3±\pm11.3 66.4±\pm1.8
Gaussian [6] 63.5±\pm1.1 62.1±\pm4.3 48.4±\pm4.8 65.4±\pm0.2 17.1±\pm6.2 68.1±\pm0.6 31.8±\pm2.3 12.6±\pm1.3 65.3±\pm0.7
sign flipping [15] 65.0±\pm0.5 58.6±\pm12.5 16.6±\pm5.6 51.9±\pm1.7 40.2±\pm14.9 28.7±\pm32.4 47.5±\pm9.4 17.7±\pm22.6 63.1±\pm5.5
label flipping [15] 63.7±\pm1.4 36.1±\pm22.7 18.6±\pm3.3 50.2±\pm2.1 13.0±\pm2.7 61.3±\pm2.5 57.1±\pm0.8 58.3±\pm1.5 63.8±\pm2.0
minic [28] 44.1±\pm0.9 65.0±\pm0.5 56.4±\pm1.0 66.0±\pm0.8 42.2±\pm2.8 48.0±\pm32.9 65.9±\pm1.7 66.6±\pm3.6 68.8±\pm0.2
Collusion (Ours) 63.3±\pm0.6 10.0±\pm0.1 9.7±\pm0.1 10.2±\pm0.0 19.4±\pm1.2 67.9±\pm0.7 29.5±\pm13.4 10.5±\pm0.3 67.0±\pm1.7
Averaged 56.7±\pm0.8 43.6±\pm4.7 23.1±\pm4.4 35.9±\pm2.0 25.3±\pm3.8 51.4±\pm11.2 51.2±\pm3.8 44.1±\pm7.7 66.7±\pm1.8
Worst-case 9.9±\pm0.1 10.0±\pm0.2 9.7±\pm0.1 9.8±\pm0.2 10.0±\pm0.1 12.4±\pm4.2 11.9±\pm3.3 10.5±\pm0.3 63.1±\pm5.5

This section illustrates the proposed FedCut method’s experimental results compared with eight existing Byzantine-robust methods. We refer to Appendix A and B for the full report of extensive experimental results.

7.1 Setup and Evaluation Metrics

  • Models: logistic regression, LeNet [34] and AlexNet [25] three models are used in all experiment settings. (see results for other models and datasets in supplementary material).

  • Datasets: MNIST [33], Fashion-MNIST [60] and CIFAR10 [30] are used for image classification tasks. To simulate Non-IID settings, class labels assigned to clients follows a Dirichlet distribution Dir(β)Dir(\beta) [36].

  • Federated Learning Settings: We simulate a horizontal federated learning system with K = 100 clients in a stand-alone machine with 8 Tesla V100-SXM2 32 GB GPUs and 72 cores of Intel(R) Xeon(R) Gold 61xx CPUs. In each communication round, the clients update the weight updates, and the server adopts Fedavg [40] algorithm to aggregate the model updates. The detailed experimental hyper-parameters are listed in Appendix A.

  • Byzantine attacks: We set 10%, 20% and 30% clients, i.e., 10, 20, and 30 out of 100 clients are Byzantine attackers. The following attacking methods are used in experiments:

    • the same value attack: model updates of attackers are replaced by the all ones’ vector;

    • the sign flipping attack: local gradients of attackers are shifted by a scaled value -4;

    • the gaussian attack: local gradients at clients are replaced by independent Gaussian random vectors 𝒩(0,200)\mathcal{N}(0,200);

    • the Lie attack: it was designed in [4];

    • the Fang-v1 attack and Fang-v2 attack: they were designed in [18] for coordinate-wise trimmed mean [68] and Krum [6] respectively;

    • Mimic attack: Colluders may mimic the behaviours of some benign clients towards over-emphasizing some clients and under-representing others [28].

    • Our designed multi-collusion attack: adversaries are separated into 4 groups, and the same group has similar values. For example, each group is sampled from 𝒩(μ+μi,0.0001)\mathcal{N}(\mu+\mu_{i},0.0001), and different groups have different μi\mu_{i}, where μ\mu is the mean of uploaded gradients of all other benign clients.

  • Byzantine-resilient methods: Nine existing methods i.e., Statistic-based methods: Krum [6], Median [68], Trimmed Mean [68], Bulyan [21] and DnC [52], Serve-evaluating methods: FLtrust [10], Clustering-based methods: Kmeans [54] and the proposed method FedCut are compared in terms of following metrics.

  • Gaussian kernel scaling parameters of FedCut: the preselected set of σ\sigma: {σ1,,σn}\{\sigma_{1},\cdots,\sigma_{n}\} in Algo. 3 is the geometric sequence with common ratio 22 of (1,1010)(1,10\sqrt{10}) for MNIST, (1,1010)(1,10\sqrt{10}) for Fashion-MNIST and (0.1,1010)(0.1,10\sqrt{10}) for CIFAR10.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The Detection accuracy of different Byzantine-resilient methods (Kmeans, DnC, NCut and FedCut) among all 8 attacks with different Non-IID extent (left: IID, middle: Non-IID with β=0.5\beta=0.5, right: Non-IID with β=0.1\beta=0.1) on MNIST (first row) and Fashion MNIST (second row).

Evaluation metric: two types of metrics are used in our evaluation.

  • Model Performance (MP) of the federated model is used to evaluate defending capabilities of different methods. In order to elucidate robustness of each defending method, we also report respective averaged and worst-case model performances under all possible attacks.

  • For methods including DnC [52], Kmeans [54] and FedCut, which detect benign clients before aggregation, detection accuracy among all iterations is used to quantify their defending capabilities. Note that this accuracy is measured against ground-truth Byzantine attackers’ membership, and it should not be confused with the image classification accuracy of the main task.

7.2 Comparison with Other Byzantine-resilient Methods

Tab. IV summarizes Model Performances (MP) of 8 existing methods as well as the proposed FedCut method for classification of MNIST, Fashion MNIST, and CIFAR10 using logistic regression, LeNet, and AlexNet respectively under IID setting and 30 attackers (see Appendix B for more results with other settings). There are four noticeable observations.

  • FedCut performs robustly under almost all attacks with worst-case MP above 92.0% for MNIST, 87.5% for Fashion MNIST and 63.1% for CIFAR10. This robust performance is in sharp contrast with eight existing methods, each of which has a worst-case MP degradation ranging from 18.2% (i.e., DnC by Gaussian attack reaching 74.3% MP) to 73.7% (i.e., Bulyan by Fang-v2 merely having 18.8% MP) on MNIST.

  • In terms of averaged MP, it is clearly observed that FedCut outperformed eight existing methods by noticeable margins ranging between 2.1% to 16.5% on MNIST, 3.9% to 20.7% on Fashion MNIST and 10.0% to 43.6% on CIFAR10.

  • Robust Statistics based methods (e.g., Krum, GeoMedian, Median, and Trimmed Mean) were overwhelmed by Collusion-diff and Collusion-mimic attacks (such as the MP drops 49.1% for Median under Fang-v1 attack and the MP drops 51.4% for Krum under Fang-v2 attack), incurred a significant MP degradation on MNIST. In contrast, all Collusion attacks cause a minor MP loss of less than 0.5% to FedCut.

  • Existing clustering-based method, i.e., Kmeans performs robustly in the presence of single-group colluders but with MP degradation less than 2% (e.g., Kmeans for Same Value attack on MNIST, Fashion MNIST and CIFAR10), but incurred significant MP degradation more than 50% in the face of multi-group colluders attackers (e.g., Kmeans for the Multi-Collusion attack on MNIST and Gaussian attack on Fashion MNIST). In contrast, multi-group collusion doesn’t cause significant MP loss to the proposed FedCut, which adopts spectral heuristics to make an estimation of the number of colluder groups (see Sect. 5.3). Correspondingly, the detection accuracy of FedCut is higher than other methods, as Fig. 7 shows. For instance, the detection accuracy of FedCut on MNIST in the IID setting is 99.9%, while the detection accuracy of DnC and Kmeans drop to 93.6% and 87.5% separately.

Refer to caption
Refer to caption
Figure 8: Averaged and worst-case model performance among all 8 attacks of different Byzantine-resilient methods with IID setting for different byzantine numbers (yellow: 10 attackers, blue: 20 attackers, green: 30 attackers) on Fashion MNIST.
Refer to caption
Refer to caption
Figure 9: Averaged and worst-case model performance among all 8 attacks of different Byzantine-resilient methods with 30 Byzantine attackers for different Non-IID extent (yellow: Non-IID with β=0.1\beta=0.1, blue: Non-IID with β=0.5\beta=0.5, green: IID) on Fashion MNIST.

7.3 Robustness

In this subsection, we demonstrate the robustness of our FedCut framework under different byzantine numbers, the Non-IID extent of clients’ local data, and multiple collusion attacks.

7.3.1 Robustness under Varying Numbers of Attackers.

Fig. 8 shows the model performance (MP) for different Byzantine-resilient methods under different types of attacks for different Byzantine attackers, i.e., 10, 20, and 30 attackers. The result shows that the MP of FedCut doesn’t drop with the increase of Byzantine numbers while the MP of others drops seriously (e.g., the averaged MP of Median [68] drops to 74.6% from 88.6% in Fashion MNIST dataset). Clearly, our proposed method, FedCut, is robust for the the number of byzantine attackers.

7.3.2 Robustness under Heterogeneous dataset

Fig. 9 shows the model performance (MP) for different Byzantine resilient methods under different types of attacks for different Non-IID extent of clients’ datasets. The result demonstrates that the MP of FedCut drops no more than 3% as the clients’ local dataset becomes more heterogeneous while the MP of others drops seriously (e.g., the averaged MP of trimmed mean [68] drops to 55.4% from 74.9% in Fashion MNIST dataset)

TABLE V: Computation Time complexity for different Byzantine resilient methods, where KK is the number of client, dd is the dimension of updates, TT^{\prime} is training time of the server’ data for FLtrust.
Defense Krum [6] GeoMedian [13] Median [68] Trimmed [68] Bulyan [21]
Time
Complexity
𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d) 𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d) 𝒪~(K2d+)\tilde{\mathcal{O}}(K^{2}d+) 𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d) 𝒪~(Kd)\tilde{\mathcal{O}}(Kd)
Defense FLtrust [10] DnC [52] Kmeans [54] FedCut (Ours)
Time
Complexity
𝒪~(Kd+T)\tilde{\mathcal{O}}(Kd+T\textquoteright) 𝒪~(d3+K2d)\tilde{\mathcal{O}}(d^{3}+K^{2}d) 𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d) 𝒪~(K2d+K3)\tilde{\mathcal{O}}(K^{2}d+K^{3})

7.3.3 Robustness against the Multi-Collusion attack

In the former part, we design the ’Multi-Collusion attack’ in which adversaries are separated into four groups, and the same group has similar values. For example, each group is sampled from 𝒩(μ+μi,0.0001)\mathcal{N}(\mu+\mu_{i},0.0001), and different groups have different μi\mu_{i}, where μ\mu is the mean of uploaded gradients of all other benign clients. We further design more ’collusion attacks’ with a various number of colluders’ parties as follows: 30 attackers are separated into 1) two groups with each group having 15 attackers; 2) three groups with each group having ten attackers 3) four groups with each group having 8, 8, 7, 7 attackers respectively; 4) five groups with each group has six attackers.

Tab. VII and VI displays the model performances under different ’collusion attacks,’ which illustrates that various collusion attacks do not influence FedCut and NCut while other byzantine resilient methods such as Kmeans are affected seriously by collusion attack. Note that DnC performs well with 2-groups collusion, but MP drops when colluders’ groups increase.

TABLE VI: Model performances of different Byzantine-resilient methods under different groups of collusion attacks (with Non-IID setting β=0.5\beta=0.5 and 30 attackers for classification of MNIST). ii- groups represent the number of colluders.
Krum
[6]
GeoMedian
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
2-groups 89.2±\pm0.4 79.2±\pm0.0 71.9±\pm0.1 69.0±\pm0.2 89.1±\pm0.2 88.5±\pm0.1 91.9±\pm0.1 45.7±\pm0.0 91.7±\pm0.0
3-groups 88.9±\pm0.1 81.1±\pm0.0 73.4±\pm0.0 70.7±\pm0.1 88.9±\pm0.1 88.0±\pm0.1 87.2±\pm0.8 45.9±\pm0.1 91.6±\pm0.1
4-groups 89.0±\pm0.5 82.2±\pm1.2 73.6±\pm1.2 72.3±\pm1.2 86.7±\pm1.2 89.2±\pm0.5 89.4±\pm0.5 31.2±\pm2.2 92.1±\pm0.1
5-groups 88.8±\pm1.1 84.2±\pm0.1 78.0±\pm0.1 75.1±\pm0.0 88.7±\pm0.1 88.4±\pm0.1 89.0±\pm0.3 54.1±\pm0.0 91.7±\pm0.1
TABLE VII: Model performances of different Byzantine-resilient methods under different groups of collusion attacks (with IID and 30 attackers for classification of MNIST). ii- groups represent the number of colluders’ clusters.
Krum
[6]
GeoMedian
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
2-groups 90.8±\pm0.0 80.2±\pm0.4 73.3±\pm0.1 72.7±\pm0.3 90.1±\pm0.0 88.8±\pm0.1 92.0±\pm0.1 46.4±\pm0.1 91.8±\pm0.0
3-groups 90.6±\pm0.1 82.3±\pm0.4 77.2±\pm0.7 75.7±\pm0.3 90.1±\pm0.1 88.9±\pm0.1 89.0±\pm1.4 49.5±\pm0.1 91.8±\pm0.0
4-groups 90.3±\pm0.2 83.9±\pm2.1 79.5±\pm0.7 78.0±\pm0.8 89.7±\pm0.3 89.4±\pm0.1 90.2±\pm0.3 43.2±\pm8.4 92.2±\pm0.2
5-groups 90.7±\pm0.1 84.4±\pm0.4 79.3±\pm0.1 78.9±\pm0.1 90.1±\pm0.1 88.6±\pm0.6 89.4±\pm0.1 51.9±\pm0.1 91.9±\pm0.0

7.4 Computation Complexity

We compare the computation time complexity for different Byzantine resilient methods for each iteration on the server as Tab. V. It shows that the computation complexity of our proposed method FedCut is 𝒪~(K2d+K3)\tilde{\mathcal{O}}(K^{2}d+K^{3}), where implementing the singular vector decomposition (SVD) needs 𝒪~(K3)\tilde{\mathcal{O}}(K^{3}) [23] and computing the adjacency matrix requires 𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d). The time complexity of FedCut is comparable to statistic-based methods such as Krum [6] (𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d)) and median [68] (𝒪~(K2d)\tilde{\mathcal{O}}(K^{2}d)) when the dimension of model dd is greatly larger than number of participants KK.

Empirically, the proposed FedCut method allows the training of a DNN model (AlexNet) with 100 clients to run in three hours, under a simulated environment (see Appendix A for details). We regard this time complexity is reasonable for practical federated learning applications across multiple institutions. For cross-devices application scenarios in which KK might be up to millions, randomized SVD [23] can be adopted to improve the time complexity from 𝒪~(K3)\tilde{\mathcal{O}}(K^{3}) to 𝒪~(K)\tilde{\mathcal{O}}(K) if the normalized adjacency matrix has low rank. Applying randomized SVD to cross-device FL scenarios is one of our future work.

8 Discussion and Conclusion

This paper proposed a novel spectral analysis framework, called FedCut, to detect Byzantine colluders robustly and efficiently from the graph perspective. Specifically, our proposed algorithm FedCut ensures the optimal separation of benign clients and colluders in the spatial-temporal graph constructed from uploaded model updates over different iterations. We analyze existing Byzantine attacks and Byzantine-resilient methods in the lens of spectral analysis. It was shown that existing Byzantine-resilient methods may suffer from failure cases in the face of Byzantine colluders. Moreover, spectral heuristics are used in FedCut to determine the number of colluder groups, the scaling factor and mimic colluders, which significantly improves the Byzantine tolerance in colluders detection.

Our extensive experiments on multiple datasets and theoretical convergence analysis demonstrate that our proposed framework can achieve drastic improvements over the FL baseline in terms of model performance under various Byzantine attacks.

Finally, the proposed and many existing Byzantine-resilient methods assume that the server could access to clients’ model updates to compute the similarity. However, the leakage of model updates allows a semi-honest server to infer some information of the private data [72]. Therefore, Byzantine problem should be considered when implementing FL for privacy-preserving applications. In future work, we will explore to what extent FedCut can be used in conjunction with Differential Privacy [1] or Homomorphic Encryption mechanisms [71].

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [2] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. Advances in Neural Information Processing Systems, 31, 2018.
  • [3] Zeyuan Allen-Zhu, Faeze Ebrahimianghazani, Jerry Li, and Dan Alistarh. Byzantine-resilient non-convex stochastic gradient descent. In International Conference on Learning Representations, 2020.
  • [4] Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning. Advances in Neural Information Processing Systems, 32, 2019.
  • [5] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd with majority vote is communication efficient and fault tolerant. In International Conference on Learning Representations, 2018.
  • [6] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems, 30, 2017.
  • [7] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [8] Andries E Brouwer and Willem H Haemers. Spectra of graphs. Springer Science & Business Media, 2011.
  • [9] Saikiran Bulusu, Venkata Gandikota, Arya Mazumdar, Ankit Singh Rawat, and Pramod K Varshney. Byzantine resilient distributed clustering with redundant data assignment. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2143–2148. IEEE, 2021.
  • [10] Xiaoyu Cao, Minghong Fang, Jia Liu, and Neil Zhenqiang Gong. Fltrust: Byzantine-robust federated learning via trust bootstrapping. arXiv preprint arXiv:2012.13995, 2020.
  • [11] Chen Chen, Lingjuan Lyu, Yuchen Liu, Fangzhao Wu, Chaochao Chen, and Gang Chen. Byzantine-resilient federated learning via gradient memorization.
  • [12] Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. Draco: Byzantine-resilient distributed training via redundant gradients. In International Conference on Machine Learning, pages 903–912. PMLR, 2018.
  • [13] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25, 2017.
  • [14] Georgios Damaskinos, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki, et al. Asynchronous byzantine machine learning (the case of sgd). In International Conference on Machine Learning, pages 1145–1154. PMLR, 2018.
  • [15] Deepesh Data and Suhas Diggavi. Byzantine-resilient high-dimensional sgd with local iterations on heterogeneous data. In International Conference on Machine Learning, pages 2478–2488. PMLR, 2021.
  • [16] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019.
  • [17] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Louis Alexandre Rouault. Distributed momentum for byzantine-resilient stochastic gradient descent. In 9th International Conference on Learning Representations (ICLR), number CONF, 2021.
  • [18] Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Gong. Local model poisoning attacks to {\{Byzantine-Robust}\} federated learning. In 29th USENIX Security Symposium (USENIX Security 20), pages 1605–1622, 2020.
  • [19] Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Rafael Pinot, and John Stephan. Byzantine machine learning made easy by resilient averaging of momentums. arXiv preprint arXiv:2205.12173, 2022.
  • [20] Avishek Ghosh, Justin Hong, Dong Yin, and Kannan Ramchandran. Robust federated learning in a heterogeneous environment. arXiv preprint arXiv:1906.06629, 2019.
  • [21] Rachid Guerraoui, Sébastien Rouault, et al. The hidden vulnerability of distributed learning in byzantium. In International Conference on Machine Learning, pages 3521–3530. PMLR, 2018.
  • [22] Nirupam Gupta, Thinh T Doan, and Nitin Vaidya. Byzantine fault-tolerance in federated local sgd under 2f-redundancy. arXiv preprint arXiv:2108.11769, 2021.
  • [23] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  • [24] Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. Byzantine-robust decentralized learning via self-centered clipping. arXiv preprint arXiv:2202.01545, 2022.
  • [25] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • [26] Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
  • [27] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • [28] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Byzantine-robust learning on heterogeneous datasets via bucketing. arXiv preprint arXiv:2006.09365, 2020.
  • [29] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for byzantine robust optimization. In International Conference on Machine Learning, pages 5311–5319. PMLR, 2021.
  • [30] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • [32] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem. In Concurrency: the Works of Leslie Lamport, pages 203–226. 2019.
  • [33] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [34] Yann LeCun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 20(5):14, 2015.
  • [35] Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1544–1551, 2019.
  • [36] Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data silos: An experimental study. arXiv preprint arXiv:2102.02079, 2021.
  • [37] Xiang Li, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Communication efficient decentralized training with multiple local updates. arXiv preprint arXiv:1910.09126, 5, 2019.
  • [38] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [39] Helmut Lutkepohl. Handbook of matrices. Computational statistics and Data analysis, 2(25):243, 1997.
  • [40] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • [41] Mark EJ Newman. Detecting community structure in networks. The European physical journal B, 38(2):321–330, 2004.
  • [42] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.
  • [43] Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1–22, 2013.
  • [44] Jie Peng and Qing Ling. Byzantine-robust decentralized stochastic optimization. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5935–5939. IEEE, 2020.
  • [45] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for federated learning. arXiv preprint arXiv:1912.13445, 2019.
  • [46] Saurav Prakash and Amir Salman Avestimehr. Mitigating byzantine attacks in federated learning. arXiv preprint arXiv:2010.07541, 2020.
  • [47] Shashank Rajput, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. Detox: A redundancy-based framework for faster and more robust gradient aggregation. Advances in Neural Information Processing Systems, 32, 2019.
  • [48] Jayanth Reddy Regatti, Hao Chen, and Abhishek Gupta. Bygars: Byzantine sgd with arbitrary number of attackers using reputation scores.
  • [49] Camellia Sarkar and Sarika Jalan. Spectral properties of complex networks. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(10):102101, 2018.
  • [50] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems, 32(8):3710–3722, 2020.
  • [51] Felix Sattler, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. On the byzantine robustness of clustered federated learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8861–8865. IEEE, 2020.
  • [52] Virat Shejwalkar and Amir Houmansadr. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In NDSS, 2021.
  • [53] Hua-Wei Shen and Xue-Qi Cheng. Spectral methods for the detection of network community structure: a comparative analysis. Journal of Statistical Mechanics: Theory and Experiment, 2010(10):P10020, 2010.
  • [54] Shiqi Shen, Shruti Tople, and Prateek Saxena. Auror: Defending against poisoning attacks in collaborative deep learning systems. In Proceedings of the 32nd Annual Conference on Computer Security Applications, pages 508–519, 2016.
  • [55] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  • [56] Jy-yong Sohn, Dong-Jun Han, Beongjun Choi, and Jaekyun Moon. Election coding for distributed learning: Protecting signsgd against byzantine attacks. Advances in Neural Information Processing Systems, 33:14615–14625, 2020.
  • [57] Gilbert W Stewart. Matrix perturbation theory. 1990.
  • [58] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
  • [59] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
  • [60] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [61] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized byzantine-tolerant sgd. Journal of Environmental Sciences (China) English Ed, 2018.
  • [62] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Zeno: Byzantine-suspicious stochastic gradient descent. arXiv preprint arXiv:1805.10032, 24, 2018.
  • [63] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, pages 261–270. PMLR, 2020.
  • [64] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno++: Robust fully asynchronous sgd. In International Conference on Machine Learning, pages 10495–10503. PMLR, 2020.
  • [65] Haibo Yang, Xin Zhang, Minghong Fang, and Jia Liu. Byzantine-resilient stochastic gradient descent for distributed learning: A lipschitz-inspired coordinate-wise median approach. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 5832–5837. IEEE, 2019.
  • [66] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
  • [67] Yi-Rui Yang and Wu-Jun Li. Basgd: Buffered asynchronous sgd for byzantine learning. In International Conference on Machine Learning, pages 11751–11761. PMLR, 2021.
  • [68] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR, 2018.
  • [69] Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pages 7184–7193. PMLR, 2019.
  • [70] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. Advances in neural information processing systems, 17, 2004.
  • [71] Chengliang Zhang, Suyi Li, Junzhe Xia, Wei Wang, Feng Yan, and Yang Liu. {\{BatchCrypt}\}: Efficient homomorphic encryption for {\{Cross-Silo}\} federated learning. In 2020 USENIX annual technical conference (USENIX ATC 20), pages 493–506, 2020.
  • [72] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. Advances in neural information processing systems, 32, 2019.

Appendix A Experiment Setting

This section illustrates the experiment settings of the empirical study of the proposed FedCut framework in comparison with existing Byzantine resilient methods.

Model Architectures The architectures we investigated include the well-known Logistic regression, LeNet [34] and AlexNet [31].

Dataset We evaluate Federated learning model performances with classification tasks on standard MNIST, Fashion MNIST [60] and CIFAR10 dataset. The MNIST database of 10-class handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. Fashion MNIST is fashion product of MNIST including 60,000 images and the test set has 10,000 images. The CIFAR-10 dataset consists of 60000 32×3232\times 32 colour images in 10 classes, with 6000 images per class. Respectively, we conduct stand image classification tasks of MNIST, Fashion MNIST and CIFAR10 with logistic regression, LeNet and AlexNet. According to the way we split the dataset for clients in federated learning, the experiments are divided into IID setting and Non-IID setting. Specifically, We consider lable-skew Non-IID federated learning setting, where we assume each client’s training examples are drawn with class labels following a dirichlet distribution (Dir(β)Dir(\beta)) [36], with which β>0\beta>0 is the concentration parameter controlling the identicalness among users (we use β=0.1\beta=0.1 & 0.5 in the following).

Federated Learning Settings We simulate a horizontal federated learning system with K = 100 clients in a stand-alone machine with 8 Tesla V100-SXM2 32 GB GPUs and 72 cores of Intel(R) Xeon(R) Gold 61xx CPUs. In each communication round, the clients update the weight updates, and the server adopts Fedavg [40] algorithm to aggregate the model updates. The detailed experimental hyper-parameters are listed in Tab. VIII.

Hyper-parameter Logistic Regression LeNet AlexNet
Activation function ReLU ReLU ReLU
Optimization method Adam Adam Adam
Learning rate 0.0005 0.001 0.001
Weight Decay 0.001 0.002 0.001
Batch size 32 32 64
Data Distribution IID and Non-IID (β=0.5,0.1)\beta=0.5,0.1) IID and Non-IID (β=0.5,0.1)\beta=0.5,0.1) IID
Iterations 2000 3000 3000
Number of Clients 100 100 30
Numbers of Attackers 10, 20, 30 10, 20, 30 6
TABLE VIII: Training parameters

Attack: We set 10%, 20% and 30% clients, i.e., 10, 20 and 30 out of 100 clients be Byzantine attackers. The following attacking methods are used in experiments:

  1. 1.

    the ’same value attack’, where model updates of attackers are replaced by the all ones vector;

  2. 2.

    the ’label flipping attack’, where attackers use the wrong label to generate the gradients to upload;

  3. 3.

    the ’sign flipping attack’, where local gradients of attackers are shifted by a scaled value -4;

  4. 4.

    the ’gaussian attack’, where local gradients at clients are replaced by independent Gaussian random vectors 𝒩(0,200)\mathcal{N}(0,200).

  5. 5.

    the ‘Lie attack’, which was designed in [4];

  6. 6.

    the ’Fang-v1 attack’, which was designed in [18] for coordinate-wise trimmed mean [68] and Krum [6] (Fang-v2 attack);

  7. 7.

    the ’Fang-v2 attack’, which was designed in [18] for Krum [6];

  8. 8.

    Our designed ’collusion attack’ that adversaries are separated into 4 groups and same group has the similar values. For example, each groups is sampled from 𝒩(μ+μi,0.0001)\mathcal{N}(\mu+\mu_{i},0.0001) and different groups has different μi\mu_{i}, where μ\mu is mean of uploaded gradients of all other benign clients.

Byzantine-resilient methods The following Byzantine resilent methods are evaluated in experiments:

  1. 1.
  2. 2.

    Median [68], we adopt codes from https://github.com/Liepill/RSA-Byzantine

  3. 3.
  4. 4.
  5. 5.
  6. 6.

    FLtrust [10], we adopt codes from https://people.duke.edu/~zg70/code/fltrust.zip

  7. 7.
  8. 8.

    NCut [42], we adopt NCut in each iteration separately for ablation study.

  9. 9.

    FedCut, the proposed FedCut method applied to the Spatial-Temporal network.

Appendix B Ablation Study

In this section, we report more experimental results on different extent Non-IID (e.g., β\beta=0.1 and β\beta=0.5), number of byzantine attackers (e.g., 10, 20 and 30), model (e.g., logistic regression, LeNet and AlexNet) and dataset (e.g., MNIST, Fashion MNIST and CIFAR10).

B.1 More Experimental Results

Tab. IX-XXVI show the model performance (MP) for different Byzantine resilient methods under different types of attacks in various settings. All results show that the proposed method FedCut achieve the best model performances compared to other byzantine resilient methods when the Non-IID extent increases from β=0.5\beta=0.5 to 0.1 and byzantine attackers numbers increases from 10 to 30. As shown by the ablation study, the MP of NCut is is not as stable as that of the proposed FedCut (e.g., MP drops to 31.7 % with Non-IID setting and 30 attackers under Fang-v1 attack).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Average MP of FedCut and Ncut (FedCut without temporal consistency) among all 8 attacks with different byzantine numbers (left: 10 attackers, middle: 20 attackers and right: 30 attackers) on MNIST (first row) and Fashion MNIST (second row).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: The change of eigengap for different attacks with different number of attackers under IID setting for MNIST dataset.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: The change of eigengap for different attacks with different number of attackers under Non-IID with β=0.1\beta=0.1 setting for MNIST dataset.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: The change of the maximum eigengap of normalized adjacency matrix and clustering accuracy with different scaling factor σ\sigma for the Gaussian attack [35] on different Non-IID dataset (q=0.1,0.5)q=0.1,0.5), where the normalized adjacency matrix L=D1/2AD1/2L=D^{-1/2}AD^{-1/2}, D=diag(Sum(A))D=\text{diag}(\text{Sum}(A)) and Aij=exp(𝐠i𝐠j2/2σ2)A_{ij}=\text{exp}(-||\mathbf{g}_{i}-\mathbf{g}_{j}||^{2}/2\sigma^{2}). The clustering accuracy is calculated by NCut on the normalized adjacency matrix with 100 client including 70 benign clients and 30 Byzantine clients.
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 75.7±\pm1.1 88.5±\pm0.9 76.5±\pm4.0 81.6±\pm1.4 75.5±\pm1.5 85.6±\pm1.6 89.1±\pm0.4 88.7±\pm0.7 88.9±\pm1.0
Lie [4] 73.8±\pm0.7 78.9±\pm1.7 40.0±\pm28.6 47.4±\pm22.8 72.1±\pm5.0 83.0±\pm2.9 75.3±\pm3.8 81.1±\pm0.9 88.7±\pm0.8
Fang-v1 [18] 75.4±\pm2.3 86.7±\pm1.1 64.7±\pm7.7 62.3±\pm17.1 74.4±\pm3.3 83.9±\pm3.0 89.9±\pm0.2 76.6±\pm8.1 89.2±\pm1.4
Fang-v2 [18] 10.0±\pm0.0 43.4±\pm5.7 68.8±\pm3.6 59.5±\pm13.0 56.8±\pm11.3 84.9±\pm0.7 88.4±\pm1.9 88.8±\pm0.4 88.9±\pm0.1
Same value [35] 75.8±\pm1.2 75.0±\pm1.6 78.3±\pm4.9 78.5±\pm3.5 76.7±\pm2.9 85.9±\pm2.0 89.5±\pm0.2 89.2±\pm0.5 89.3±\pm0.6
Gaussian [6] 77.2±\pm0.8 88.2±\pm0.7 78.2±\pm30.8 74.8±\pm7.9 64.8±\pm15.5 86.3±\pm0.9 80.4±\pm4.2 40.6±\pm24.2 89.6±\pm0.4
sign flipping [15] 75.9±\pm2.4 87.4±\pm1.1 78.9±\pm2.9 72.4±\pm14.3 67.7±\pm15.6 85.3±\pm1.3 85.5±\pm1.3 54.9±\pm10.0 84.4±\pm0.9
label flipping [15] 76.6±\pm0.9 88.1±\pm0.9 69.0±\pm7.5 78.7±\pm1.6 60.2±\pm20.9 85.8±\pm1.3 89.4±\pm0.5 87.8±\pm1.0 88.0±\pm0.4
minic [28] 75.5±\pm1.4 87.9±\pm0.6 77.1±\pm2.1 86.0±\pm0.3 79.3±\pm0.3 87.4±\pm0.4 88.9±\pm0.3 88.9±\pm0.2 88.9±\pm0.7
Collusion (Ours) 75.9±\pm1.6 34.9±\pm11.2 72.7±\pm1.1 69.3±\pm2.2 71.8±\pm4.4 87.1±\pm0.1 66.6±\pm6.6 56.6±\pm22.3 89.1±\pm0.4
Averaged 69.2±\pm1.2 75.9±\pm2.6 70.4±\pm9.3 71.1±\pm8.4 69.9±\pm8.1 85.5±\pm1.4 84.3±\pm1.9 75.3±\pm6.8 88.5±\pm0.7
Worst-case 10.0±\pm0.0 34.9±\pm11.2 40.0±\pm28.6 47.4±\pm22.8 56.8±\pm11.3 83.0±\pm2.9 66.6±\pm6.6 40.6±\pm24.2 84.4±\pm0.9
TABLE IX: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and 10 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 83.2±\pm0.6 89.9±\pm0.3 84.5±\pm0.9 84.3±\pm1.9 83.6±\pm2.3 87.4±\pm0.9 90.0±\pm0.4 89.6±\pm0.5 90.0±\pm0.4
Lie [4] 83.5±\pm0.7 84.3±\pm0.6 55.1±\pm18.7 75.8±\pm6.0 82.6±\pm1.8 83.6±\pm2.5 71.0±\pm29.0 80.2±\pm1.3 89.9±\pm0.6
Fang-v1 [18] 83.7±\pm0.4 89.1±\pm0.4 79.8±\pm3.4 79.7±\pm3.5 84.1±\pm1.8 85.3±\pm1.6 90.2±\pm0.3 89.6±\pm1.5 89.6±\pm0.3
Fang-v2 [18] 10.0±\pm0.0 60.3±\pm6.3 72.2±\pm2.3 73.1±\pm1.5 73.0±\pm2.0 85.7±\pm1.0 89.0±\pm1.3 88.7±\pm0.5 89.2±\pm0.4
Same value [35] 83.5±\pm0.3 76.3±\pm0.5 84.2±\pm0.6 83.4±\pm1.4 83.8±\pm1.3 87.1±\pm0.5 89.8±\pm0.2 89.9±\pm0.2 90.0±\pm0.5
Gaussian [6] 83.1±\pm0.4 89.7±\pm0.3 84.5±\pm1.3 85.3±\pm1.2 79.9±\pm4.6 87.2±\pm0.6 78.1±\pm4.6 49.6±\pm16.0 89.8±\pm0.1
sign flipping [15] 83.7±\pm0.7 88.9±\pm0.7 82.8±\pm3.8 84.6±\pm1.8 80.8±\pm2.7 86.7±\pm1.5 88.2±\pm0.5 71.7±\pm10.6 86.4±\pm0.6
label flipping [15] 83.6±\pm0.8 89.5±\pm0.4 81.1±\pm3.4 83.4±\pm3.4 81.1±\pm2.5 87.5±\pm0.6 90.0±\pm0.2 89.6±\pm0.2 89.8±\pm0.3
minic [28] 83.6±\pm0.5 89.4±\pm0.1 81.5±\pm6.3 88.4±\pm0.6 85.1±\pm0.8 88.7±\pm0.6 89.4±\pm0.1 89.7±\pm0.2 89.2±\pm0.3
Collusion (Ours) 84.3±\pm0.6 27.8±\pm24.8 75.2±\pm1.9 75.5±\pm1.4 81.9±\pm2.8 88.7±\pm0.6 84.2±\pm0.7 54.5±\pm12.6 90.1±\pm0.1
Averaged 76.2±\pm0.5 78.5±\pm3.4 78.1±\pm4.3 81.4±\pm2.3 81.6±\pm2.3 86.8±\pm1.0 86.0±\pm3.7 79.3±\pm4.4 89.4±\pm0.4
Worst-case 10.0±\pm0.0 27.8±\pm24.8 55.1±\pm18.7 73.1±\pm1.5 73.0±\pm2.0 83.6±\pm2.5 71.0±\pm29.0 49.6±\pm16.0 86.4±\pm0.6
TABLE X: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and 10 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 84.7±\pm0.7 90.0±\pm0.4 87.9±\pm0.4 88.2±\pm0.4 87.0±\pm0.6 88.0±\pm0.6 90.2±\pm0.2 89.8±\pm0.4 90.2±\pm0.1
Lie [4] 85.1±\pm0.3 66.8±\pm22.2 81.9±\pm2.2 82.7±\pm0.9 85.1±\pm0.5 85.2±\pm1.1 80.0±\pm3.8 78.6±\pm3.2 90.0±\pm0.4
Fang-v1 [18] 84.9±\pm0.6 88.7±\pm0.7 86.5±\pm0.4 86.6±\pm0.6 86.9±\pm0.4 87.4±\pm1.0 89.9±\pm0.4 90.1±\pm0.2 87.8±\pm2.1
Fang-v2 [18] 47.7±\pm31.0 78.2±\pm1.8 78.0±\pm0.3 84.7±\pm0.8 84.8±\pm1.0 84.3±\pm0.8 89.6±\pm0.9 89.3±\pm0.2 90.3±\pm0.4
Same value [35] 85.3±\pm0.5 76.9±\pm0.4 85.5±\pm0.5 85.1±\pm0.7 87.2±\pm0.5 88.2±\pm0.9 90.2±\pm0.6 90.3±\pm0.2 90.3±\pm0.1
Gaussian [6] 85.1±\pm0.9 90.0±\pm0.4 87.8±\pm0.5 88.1±\pm0.4 87.5±\pm0.4 88.3±\pm0.8 76.3±\pm5.2 30.2±\pm26.5 90.3±\pm0.2
sign flipping [15] 85.0±\pm0.4 89.4±\pm0.3 87.9±\pm0.4 88.2±\pm0.4 87.1±\pm0.4 86.5±\pm1.9 89.3±\pm0.6 87.7±\pm0.6 89.4±\pm1.0
label flipping [15] 84.9±\pm0.5 89.9±\pm0.2 85.6±\pm1.1 85.1±\pm0.7 87.1±\pm0.4 88.2±\pm0.8 90.4±\pm0.4 90.1±\pm0.2 88.0±\pm0.2
minic [28] 85.3±\pm0.2 89.4±\pm0.4 87.9±\pm0.7 89.4±\pm0.5 86.9±\pm0.3 89.1±\pm0.3 89.8±\pm0.4 89.9±\pm0.2 89.8±\pm0.2
Collusion (Ours) 85.3±\pm0.5 54.7±\pm4.5 79.1±\pm1.3 78.1±\pm0.4 87.5±\pm0.4 88.7±\pm0.2 85.0±\pm0.5 67.6±\pm2.9 90.1±\pm0.0
Averaged 81.3±\pm3.6 81.4±\pm3.1 84.8±\pm0.8 85.6±\pm0.6 86.7±\pm0.5 87.4±\pm0.8 87.1±\pm1.3 80.4±\pm3.5 89.6±\pm0.5
Worst-case 47.7±\pm31.0 54.7±\pm4.5 78.0±\pm0.3 78.1±\pm0.4 84.8±\pm1.0 84.3±\pm0.8 76.3±\pm5.2 30.2±\pm26.5 87.8±\pm2.1
TABLE XI: Model performances of different Byzantine-resilient methods under different attacks (with IID and 10 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 72.0±\pm2.5 88.4±\pm0.7 70.1±\pm11.0 81.8±\pm0.3 81.8±\pm0.4 86.1±\pm1.1 88.9±\pm0.8 88.7±\pm0.6 89.4±\pm0.4
Lie [4] 75.4±\pm1.2 64.4±\pm9.9 35.5±\pm25.0 41.3±\pm19.5 26.9±\pm28.5 76.8±\pm4.0 72.1±\pm6.8 88.1±\pm0.6 88.1±\pm2.3
Fang-v1 [18] 76.4±\pm1.5 78.5±\pm0.8 27.9±\pm10.9 35.8±\pm9.2 71.6±\pm5.5 84.1±\pm2.9 88.7±\pm0.8 89.7±\pm0.4 89.2±\pm0.1
Fang-v2 [18] 10.0±\pm0.0 10.9±\pm1.2 18.8±\pm18.4 32.9±\pm5.8 23.6±\pm9.0 84.9±\pm0.4 82.2±\pm8.1 65.8±\pm10.3 89.0±\pm0.3
Same value [35] 76.6±\pm1.9 64.5±\pm8.9 57.8±\pm11.9 66.7±\pm3.7 81.2±\pm0.7 86.1±\pm1.0 89.5±\pm0.4 89.5±\pm0.4 89.4±\pm0.2
Gaussian [6] 76.4±\pm0.4 88.0±\pm0.7 76.4±\pm2.7 79.4±\pm1.4 78.2±\pm3.5 85.7±\pm1.9 76.4±\pm6.6 20.7±\pm28.9 89.3±\pm0.1
sign flipping [15] 75.5±\pm1.2 85.3±\pm5.5 73.0±\pm5.9 54.5±\pm25.3 80.5±\pm8.5 83.4±\pm2.5 76.7±\pm1.2 61.1±\pm3.4 82.9±\pm0.5
label flipping [15] 76.5±\pm0.7 88.3±\pm1.0 66.8±\pm13.1 75.5±\pm2.9 77.3±\pm10.4 85.6±\pm1.2 87.3±\pm2.0 87.0±\pm1.2 88.5±\pm0.5
minic [28] 75.3±\pm1.4 84.6±\pm1.4 60.4±\pm13.3 81.1±\pm3.6 66.9±\pm4.7 87.4±\pm0.5 88.0±\pm0.1 88.7±\pm0.4 89.1±\pm0.2
Collusion (Ours) 75.0±\pm1.6 55.9±\pm11.4 67.7±\pm2.0 70.7±\pm0.4 71.7±\pm2.2 86.4±\pm0.5 86.1±\pm0.4 32.4±\pm20.0 89.4±\pm0.4
Averaged 68.9±\pm1.2 70.9±\pm4.2 55.4±\pm11.4 62.0±\pm7.2 66.0±\pm7.3 84.7±\pm1.6 83.6±\pm2.7 71.2±\pm6.6 88.4±\pm0.5
Worst-case 10.0±\pm0.0 10.9±\pm1.2 18.8±\pm18.4 32.9±\pm5.8 23.6±\pm9.0 76.8±\pm4.0 72.1±\pm6.8 20.7±\pm28.9 82.9±\pm0.5
TABLE XII: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and 20 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 82.3±\pm0.7 90.0±\pm0.3 85.6±\pm1.0 86.5±\pm0.4 85.5±\pm2.8 88.1±\pm0.6 89.4±\pm1.0 88.4±\pm1.0 90.2±\pm0.2
Lie [4] 83.5±\pm1.0 53.7±\pm13.5 36.1±\pm15.6 51.9±\pm26.7 54.5±\pm12.9 76.3±\pm6.0 73.3±\pm7.8 87.9±\pm1.0 89.9±\pm0.1
Fang-v1 [18] 84.8±\pm0.7 85.5±\pm1.1 75.5±\pm2.5 71.8±\pm7.3 83.1±\pm0.9 84.4±\pm1.4 89.2±\pm1.3 88.8±\pm1.2 89.4±\pm0.2
Fang-v2 [18] 10.0±\pm8.3 18.9±\pm7.1 62.4±\pm5.1 55.8±\pm2.8 59.3±\pm5.4 84.9±\pm0.8 83.0±\pm8.3 72.4±\pm5.2 89.5±\pm0.2
Same value [35] 83.7±\pm0.8 64.8±\pm8.2 77.7±\pm2.7 71.5±\pm5.5 86.2±\pm1.0 86.9±\pm0.7 88.8±\pm1.2 89.0±\pm1.2 89.4±\pm0.1
Gaussian [6] 82.9±\pm0.4 89.6±\pm0.5 84.6±\pm2.5 86.2±\pm1.1 85.7±\pm0.7 88.4±\pm0.4 77.5±\pm7.2 38.2±\pm34.5 89.5±\pm0.3
sign flipping [15] 82.9±\pm1.4 87.7±\pm0.9 85.0±\pm1.0 85.0±\pm0.9 85.1±\pm1.4 84.4±\pm1.8 85.0±\pm1.5 74.3±\pm7.4 86.0±\pm0.1
label flipping [15] 82.4±\pm1.1 89.9±\pm0.4 79.4±\pm3.4 80.4±\pm1.8 80.3±\pm6.6 88.1±\pm0.2 88.4±\pm1.2 88.6±\pm1.5 89.3±\pm0.3
minic [28] 81.1±\pm2.6 87.5±\pm1.1 83.0±\pm0.4 87.8±\pm0.3 76.3±\pm1.5 88.7±\pm0.3 89.3±\pm0.3 89.4±\pm0.3 88.9±\pm0.6
Collusion (Ours) 84.5±\pm1.0 41.7±\pm17.0 72.0±\pm1.2 61.4±\pm15.3 81.4±\pm5.0 88.7±\pm0.7 86.9±\pm0.5 56.5±\pm9.7 89.9±\pm0.3
Averaged 75.8±\pm1.8 70.9±\pm5.0 74.1±\pm3.5 73.8±\pm6.2 77.7±\pm3.8 85.9±\pm1.3 85.1±\pm3.0 77.4±\pm6.3 89.2±\pm0.2
Worst-case 10.0±\pm8.3 18.9±\pm7.1 36.1±\pm15.6 51.9±\pm26.7 54.5±\pm12.9 76.3±\pm6.0 73.3±\pm7.8 38.2±\pm34.5 86.0±\pm0.1
TABLE XIII: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and 20 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 84.6±\pm0.6 90.1±\pm0.4 88.3±\pm0.2 88.0±\pm0.6 87.5±\pm0.2 88.3±\pm0.6 89.3±\pm1.0 89.2±\pm0.7 89.1±\pm0.9
Lie [4] 84.7±\pm0.7 76.3±\pm18.9 69.0±\pm26.6 75.2±\pm3.1 82.4±\pm0.2 79.0±\pm4.2 79.7±\pm4.6 89.0±\pm1.0 88.8±\pm1.0
Fang-v1 [18] 85.3±\pm0.4 86.4±\pm0.7 80.8±\pm1.4 80.3±\pm2.0 87.5±\pm0.3 86.6±\pm0.9 88.5±\pm1.0 89.2±\pm0.6 89.7±\pm0.2
Fang-v2 [18] 38.0±\pm21.1 46.4±\pm9.7 70.9±\pm1.4 71.0±\pm2.1 17.2±\pm12.5 82.9±\pm0.8 85.0±\pm6.4 86.6±\pm2.3 90.0±\pm0.2
Same value [35] 85.0±\pm0.5 74.0±\pm2.2 78.9±\pm2.0 77.8±\pm1.2 87.1±\pm0.4 88.4±\pm0.4 89.3±\pm0.8 89.2±\pm0.7 89.5±\pm0.2
Gaussian [6] 85.2±\pm0.5 89.9±\pm0.4 88.1±\pm0.3 88.5±\pm0.3 87.1±\pm0.4 88.1±\pm0.5 78.6±\pm6.6 40.7±\pm33.8 89.4±\pm0.2
sign flipping [15] 84.5±\pm0.5 88.5±\pm0.1 87.4±\pm0.3 87.4±\pm0.5 87.4±\pm0.2 86.4±\pm0.6 87.3±\pm0.8 75.7±\pm6.8 87.3±\pm0.5
label flipping [15] 85.5±\pm0.6 89.9±\pm0.6 82.3±\pm0.8 82.0±\pm0.5 87.3±\pm0.2 87.3±\pm0.9 89.2±\pm1.1 89.8±\pm0.7 89.7±\pm0.1
minic [28] 81.8±\pm0.8 86.3±\pm0.8 86.4±\pm0.3 88.5±\pm0.2 80.0±\pm0.4 88.4±\pm0.3 89.1±\pm0.2 89.2±\pm0.9 89.8±\pm0.4
Collusion (Ours) 84.9±\pm0.7 35.7±\pm5.0 76.7±\pm1.1 69.7±\pm2.9 87.4±\pm0.2 88.6±\pm0.2 86.9±\pm0.4 66.7±\pm22.7 89.6±\pm1.0
Averaged 79.9±\pm2.6 76.4±\pm3.9 80.9±\pm3.4 80.8±\pm1.3 79.1±\pm1.5 86.4±\pm0.9 86.3±\pm2.3 80.5±\pm7.0 89.3±\pm0.5
Worst-case 38.0±\pm21.1 35.7±\pm5.0 69.0±\pm26.6 69.7±\pm2.9 17.2±\pm12.5 79.0±\pm4.2 78.6±\pm6.6 40.7±\pm33.8 87.3±\pm0.5
TABLE XIV: Model performances of different Byzantine-resilient methods under different attacks (with IID setting and 20 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 71.7±\pm1.9 88.6±\pm0.7 80.6±\pm1.2 79.5±\pm3.7 79.4±\pm0.9 86.2±\pm1.5 88.6±\pm0.2 89.3±\pm0.2 89.3±\pm0.5
Lie [4] 74.4±\pm0.8 37.9±\pm17.2 26.9±\pm14.5 30.1±\pm22.4 48.0±\pm10.9 66.5±\pm26.1 61.4±\pm22.2 87.7±\pm1.8 88.1±\pm2.3
Fang-v1 [18] 74.5±\pm1.9 61.3±\pm4.6 19.5±\pm5.4 21.3±\pm5.3 16.7±\pm33.8 82.0±\pm3.0 79.1±\pm2.4 89.1±\pm0.4 88.6±\pm0.6
Fang-v2 [18] 10.0±\pm0.0 10.0±\pm0.0 15.3±\pm7.7 17.6±\pm3.1 12.3±\pm3.0 84.4±\pm0.4 78.0±\pm14.6 15.0±\pm18.6 88.7±\pm0.2
Same value [35] 75.6±\pm1.0 43.2±\pm19.2 44.5±\pm8.2 41.0±\pm3.5 79.6±\pm1.1 86.0±\pm0.4 89.2±\pm0.7 89.4±\pm0.3 89.0±\pm0.5
Gaussian [6] 76.1±\pm2.1 87.9±\pm0.8 75.2±\pm3.7 80.9±\pm1.0 78.2±\pm3.7 85.9±\pm1.4 76.7±\pm11.1 13.3±\pm29.5 88.8±\pm0.1
sign flipping [15] 74.1±\pm1.0 79.3±\pm5.6 71.1±\pm6.1 70.0±\pm5.1 79.6±\pm1.0 80.7±\pm2.2 50.2±\pm3.2 39.4±\pm3.8 82.1±\pm0.5
label flipping [15] 76.5±\pm1.5 87.3±\pm1.0 67.7±\pm7.0 70.2±\pm5.1 80.1±\pm14.9 85.9±\pm1.6 83.7±\pm6.8 80.6±\pm3.2 86.7±\pm1.9
minic [28] 72.6±\pm5.5 70.3±\pm2.2 73.8±\pm0.7 75.9±\pm7.8 53.6±\pm3.6 86.9±\pm0.7 86.9±\pm0.5 88.9±\pm0.2 88.9±\pm0.1
Collusion (Ours) 76.4±\pm1.5 39.7±\pm9.0 67.8±\pm2.2 67.5±\pm3.4 19.7±\pm6.3 86.1±\pm0.2 82.1±\pm1.1 54.0±\pm3.1 89.2±\pm0.3
Averaged 68.2±\pm1.7 60.6±\pm6.0 54.2±\pm5.7 55.4±\pm6.0 54.7±\pm7.9 83.1±\pm3.8 77.6±\pm6.3 64.7±\pm6.1 87.9±\pm0.7
Worst-case 10.0±\pm0.0 10.0±\pm0.0 15.3±\pm7.7 17.6±\pm3.1 12.3±\pm3.0 66.5±\pm26.1 50.2±\pm3.2 13.3±\pm29.5 82.1±\pm0.5
TABLE XV: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and 30 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
F M N I S T No attack 82.6±\pm0.8 89.6±\pm0.7 85.4±\pm1.5 84.4±\pm1.9 83.7±\pm4.3 87.3±\pm0.8 89.5±\pm0.4 89.1±\pm0.6 89.7±\pm0.2
Lie [4] 83.0±\pm1.0 45.4±\pm14.6 39.0±\pm15.9 50.9±\pm10.5 59.5±\pm11.1 52.6±\pm19.5 79.8±\pm2.7 89.6±\pm0.7 89.3±\pm0.3
Fang-v1 [18] 83.3±\pm0.3 76.6±\pm2.2 47.6±\pm10.2 52.1±\pm5.9 81.2±\pm3.1 85.0±\pm1.6 85.5±\pm0.2 89.6±\pm0.3 88.8±\pm0.8
Fang-v2 [18] 10.0±\pm0.0 10.0±\pm0.0 47.4±\pm5.5 32.4±\pm14.8 12.3±\pm3.3 84.6±\pm0.9 80.8±\pm11.8 34.6±\pm24.0 88.8±\pm0.6
Same value [35] 84.2±\pm0.7 49.9±\pm15.9 53.5±\pm8.0 48.3±\pm10.3 85.8±\pm0.9 85.4±\pm1.7 89.4±\pm0.1 89.3±\pm0.4 89.1±\pm0.5
Gaussian [6] 83.4±\pm1.0 89.0±\pm0.4 84.1±\pm1.9 86.3±\pm1.3 85.1±\pm1.2 87.6±\pm0.9 73.0±\pm8.9 17.5±\pm29.4 89.5±\pm0.1
sign flipping [15] 83.0±\pm0.4 85.3±\pm1.7 83.6±\pm0.8 83.5±\pm2.0 83.1±\pm1.6 79.7±\pm2.1 59.6±\pm4.3 29.7±\pm23.1 87.9±\pm1.6
label flipping [15] 83.6±\pm0.3 89.7±\pm0.5 67.4±\pm15.1 75.5±\pm1.6 85.1±\pm1.1 87.4±\pm1.1 88.9±\pm1.2 89.3±\pm0.4 88.5±\pm0.4
minic [28] 77.0±\pm1.7 78.3±\pm0.7 80.6±\pm1.0 84.1±\pm0.7 63.0±\pm7.4 88.3±\pm0.3 87.9±\pm0.6 88.9±\pm0.6 89.4±\pm0.5
Collusion (Ours) 83.6±\pm0.4 10.4±\pm8.3 65.2±\pm4.8 63.3±\pm7.1 85.3±\pm0.6 88.6±\pm0.5 83.6±\pm0.3 63.0±\pm24.1 89.0±\pm0.5
Averaged 75.4±\pm0.7 62.4±\pm4.5 65.4±\pm6.5 66.1±\pm5.6 72.4±\pm3.5 82.7±\pm2.9 81.8±\pm3.1 68.1±\pm10.4 89.0±\pm0.6
Worst-case 10.0±\pm0.0 10.0±\pm0.0 39.0±\pm15.9 32.4±\pm14.8 12.3±\pm3.3 52.6±\pm19.5 59.6±\pm4.3 17.5±\pm29.4 87.9±\pm1.6
TABLE XVI: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and 30 attackers for classification of Fashion MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 64.2±\pm6.8 92.1±\pm0.1 66.9±\pm2.3 66.1±\pm2.1 55.6±\pm5.9 89.8±\pm0.2 91.7±\pm0.2 91.8±\pm0.3 92.2±\pm0.0
Lie [4] 61.6±\pm7.7 90.5±\pm0.2 75.8±\pm2.8 73.8±\pm3.7 63.1±\pm2.2 89.6±\pm0.1 92.1±\pm0.2 89.8±\pm0.2 92.2±\pm0.1
Fang-v1 [18] 62.7±\pm5.9 89.8±\pm0.2 50.9±\pm4.5 42.9±\pm6.6 57.5±\pm2.0 79.3±\pm1.5 92.0±\pm0.3 68.4±\pm5.3 92.0±\pm0.1
Fang-v2 [18] 58.4±\pm4.2 88.2±\pm0.1 40.9±\pm3.4 45.0±\pm2.7 41.6±\pm5.2 78.5±\pm2.4 91.4±\pm0.4 91.6±\pm0.1 92.3±\pm0.2
Same value [35] 63.0±\pm6.6 90.5±\pm0.2 58.8±\pm2.2 58.7±\pm2.2 55.2±\pm3.2 89.8±\pm0.2 92.1±\pm0.2 92.1±\pm0.3 92.2±\pm0.1
Gaussian [6] 63.6±\pm4.5 91.9±\pm0.1 68.5±\pm3.1 67.1±\pm4.9 54.5±\pm3.8 89.6±\pm0.4 83.3±\pm0.3 43.8±\pm3.4 92.3±\pm0.1
sign flipping [15] 66.2±\pm7.9 91.5±\pm0.1 66.8±\pm3.4 68.5±\pm5.1 53.6±\pm3.0 79.6±\pm0.6 90.3±\pm0.2 64.7±\pm1.3 91.5±\pm0.2
label flipping [15] 64.8±\pm2.1 91.3±\pm0.1 57.3±\pm2.5 58.1±\pm5.4 53.0±\pm7.2 89.9±\pm0.4 90.5±\pm0.2 86.4±\pm0.7 88.8±\pm0.1
minic [28] 72.5±\pm3.1 91.3±\pm0.1 74.8±\pm0.9 88.0±\pm0.4 65.1±\pm4.2 90.2±\pm0.1 91.4±\pm0.1 91.1±\pm0.2 92.1±\pm0.2
Collusion (Ours) 66.6±\pm4.9 89.7±\pm0.6 69.7±\pm7.3 70.8±\pm6.4 58.5±\pm9.0 89.6±\pm0.4 89.2±\pm1.9 79.1±\pm0.3 92.1±\pm0.2
Averaged 64.4±\pm5.4 90.7±\pm0.2 63.0±\pm3.2 63.9±\pm4.0 55.8±\pm4.6 86.6±\pm0.6 90.4±\pm0.4 79.9±\pm1.2 91.8±\pm0.1
Worst-case 58.4±\pm4.2 88.2±\pm0.1 40.9±\pm3.4 42.9±\pm6.6 41.6±\pm5.2 78.5±\pm2.4 83.3±\pm0.3 43.8±\pm3.4 88.8±\pm0.1
TABLE XVII: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and 10 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 89.0±\pm0.3 92.4±\pm0.1 85.3±\pm0.5 84.9±\pm0.1 81.7±\pm1.0 89.0±\pm0.4 92.2±\pm0.3 92.0±\pm0.4 92.4±\pm0.1
Lie [4] 89.3±\pm0.3 91.5±\pm0.2 87.7±\pm0.2 88.1±\pm0.5 85.1±\pm0.7 88.8±\pm0.7 92.3±\pm0.2 91.6±\pm0.0 92.4±\pm0.1
Fang-v1 [18] 89.2±\pm0.3 91.4±\pm0.2 72.5±\pm1.6 72.3±\pm1.3 80.1±\pm0.7 78.2±\pm2.0 92.3±\pm0.2 92.2±\pm0.2 92.3±\pm0.0
Fang-v2 [18] 38.1±\pm5.5 88.2±\pm0.2 70.2±\pm1.6 70.2±\pm2.3 60.5±\pm1.5 80.9±\pm0.6 91.9±\pm0.3 91.9±\pm0.2 92.4±\pm0.1
Same value [35] 89.3±\pm0.4 91.2±\pm0.2 79.3±\pm0.3 78.5±\pm1.1 81.1±\pm0.6 88.5±\pm0.9 92.3±\pm0.3 92.3±\pm0.2 92.4±\pm0.1
Gaussian [6] 89.2±\pm0.4 92.3±\pm0.2 85.7±\pm0.2 85.5±\pm0.3 80.6±\pm0.6 88.5±\pm0.5 83.5±\pm0.7 43.3±\pm5.3 92.3±\pm0.1
sign flipping [15] 89.1±\pm0.2 92.2±\pm0.1 86.0±\pm0.1 85.6±\pm0.5 81.0±\pm0.8 78.1±\pm0.8 92.0±\pm0.1 81.9±\pm8.3 91.9±\pm0.3
label flipping [15] 89.3±\pm0.5 91.8±\pm0.2 82.2±\pm0.2 82.6±\pm1.0 81.3±\pm1.6 88.9±\pm0.3 91.8±\pm0.1 90.4±\pm0.5 91.3±\pm1.3
minic [28] 89.9±\pm0.3 92.1±\pm0.2 88.6±\pm0.4 91.4±\pm0.1 85.4±\pm0.3 90.6±\pm0.1 92.2±\pm0.1 92.1±\pm0.1 92.4±\pm0.1
Collusion (Ours) 89.2±\pm0.4 90.7±\pm0.7 84.5±\pm2.4 84.1±\pm2.6 82.3±\pm3.7 88.7±\pm0.2 90.1±\pm1.7 78.1±\pm5.3 92.2±\pm0.0
Averaged 84.2±\pm0.9 91.4±\pm0.2 82.2±\pm0.8 82.3±\pm1.0 79.9±\pm1.2 86.0±\pm0.7 91.1±\pm0.4 84.6±\pm2.1 92.2±\pm0.2
Worst-case 38.1±\pm5.5 88.2±\pm0.2 70.2±\pm1.6 70.2±\pm2.3 60.5±\pm1.5 78.1±\pm0.8 83.5±\pm0.7 43.3±\pm5.3 91.3±\pm1.3
TABLE XVIII: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and 10 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 90.3±\pm0.1 92.5±\pm0.1 90.3±\pm0.1 90.1±\pm0.2 88.3±\pm0.3 89.6±\pm0.2 92.3±\pm0.3 92.3±\pm0.2 92.4±\pm0.1
Lie [4] 90.5±\pm0.1 91.8±\pm0.2 90.7±\pm0.2 90.9±\pm0.1 89.1±\pm0.1 89.6±\pm0.1 92.3±\pm0.2 91.8±\pm0.3 92.3±\pm0.1
Fang-v1 [18] 90.4±\pm0.1 91.7±\pm0.2 83.6±\pm0.2 82.8±\pm0.4 88.0±\pm0.2 77.1±\pm2.1 92.4±\pm0.2 91.9±\pm0.3 92.3±\pm0.1
Fang-v2 [18] 43.8±\pm2.0 89.5±\pm0.6 83.6±\pm0.5 83.2±\pm0.3 71.2±\pm0.6 81.6±\pm1.4 92.2±\pm0.3 92.1±\pm0.1 92.4±\pm0.0
Same value [35] 90.4±\pm0.2 91.6±\pm0.1 87.9±\pm0.2 87.6±\pm0.1 88.6±\pm0.2 89.5±\pm0.1 92.2±\pm0.1 92.3±\pm0.3 92.4±\pm0.1
Gaussian [6] 90.5±\pm0.2 92.4±\pm0.1 90.4±\pm0.3 90.4±\pm0.1 88.5±\pm0.2 88.7±\pm0.2 83.6±\pm0.5 50.7±\pm0.9 92.4±\pm0.1
sign flipping [15] 90.4±\pm0.1 92.3±\pm0.1 90.4±\pm0.2 90.3±\pm0.1 88.6±\pm0.1 75.6±\pm3.8 92.2±\pm0.2 91.5±\pm0.1 92.2±\pm0.1
label flipping [15] 90.4±\pm0.2 91.8±\pm0.1 89.4±\pm0.3 89.2±\pm0.2 88.6±\pm0.2 89.2±\pm0.5 92.2±\pm0.1 92.2±\pm0.3 92.2±\pm0.1
minic [28] 90.7±\pm0.2 92.1±\pm0.2 90.9±\pm0.2 91.7±\pm0.1 89.6±\pm0.1 90.4±\pm0.2 92.5±\pm0.1 92.2±\pm0.1 92.4±\pm0.1
Collusion (Ours) 90.5±\pm0.2 91.0±\pm0.4 88.9±\pm0.8 88.6±\pm0.8 88.7±\pm0.6 89.0±\pm0.4 91.7±\pm0.4 76.5±\pm0.7 92.2±\pm0.2
Averaged 85.8±\pm0.3 91.7±\pm0.2 88.6±\pm0.3 88.5±\pm0.2 86.9±\pm0.3 86.0±\pm0.9 91.4±\pm0.2 86.4±\pm0.3 92.3±\pm0.1
Worst-case 43.8±\pm2.0 89.5±\pm0.6 83.6±\pm0.2 82.8±\pm0.4 71.2±\pm0.6 75.6±\pm3.8 83.6±\pm0.5 50.7±\pm0.9 92.2±\pm0.1
TABLE XIX: Model performances of different Byzantine-resilient methods under different attacks (with IID and 10 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 63.2±\pm8.0 92.1±\pm0.1 63.3±\pm3.4 66.0±\pm0.7 75.6±\pm2.6 89.9±\pm0.2 91.7±\pm0.3 91.8±\pm0.2 92.3±\pm0.1
Lie [4] 66.6±\pm2.7 86.4±\pm0.2 81.6±\pm0.2 83.5±\pm0.4 79.2±\pm0.5 88.7±\pm0.5 92.0±\pm0.3 92.0±\pm0.2 92.1±\pm0.1
Fang-v1 [18] 68.4±\pm2.5 79.0±\pm0.8 32.3±\pm4.2 30.7±\pm5.5 63.2±\pm6.2 78.8±\pm2.2 92.0±\pm0.3 92.0±\pm0.4 92.2±\pm0.1
Fang-v2 [18] 52.1±\pm3.9 62.5±\pm4.0 26.4±\pm1.9 28.2±\pm3.1 39.5±\pm9.1 81.0±\pm3.3 89.0±\pm1.3 91.5±\pm0.1 92.2±\pm0.1
Same value [35] 66.3±\pm6.7 86.8±\pm0.5 52.7±\pm5.1 47.9±\pm2.2 71.1±\pm4.4 89.8±\pm0.4 91.9±\pm0.3 92.0±\pm0.2 92.1±\pm0.1
Gaussian [6] 68.6±\pm1.9 91.7±\pm0.1 68.0±\pm3.6 70.0±\pm3.1 71.4±\pm3.5 89.4±\pm0.4 78.9±\pm0.2 27.1±\pm1.8 92.1±\pm0.2
sign flipping [15] 70.4±\pm4.6 90.5±\pm0.1 66.5±\pm5.2 66.9±\pm4.0 71.9±\pm3.5 78.6±\pm0.8 82.7±\pm1.7 66.7±\pm9.8 89.6±\pm0.7
label flipping [15] 65.3±\pm1.5 90.0±\pm0.1 51.4±\pm2.3 47.9±\pm7.2 72.2±\pm2.9 89.7±\pm0.5 88.2±\pm0.7 87.1±\pm0.2 85.4±\pm0.4
minic [28] 68.0±\pm2.6 87.4±\pm1.8 81.4±\pm1.0 82.3±\pm3.1 65.3±\pm7.9 90.3±\pm0.2 90.5±\pm0.2 91.2±\pm0.1 92.0±\pm0.3
Collusion (Ours) 65.9±\pm4.8 84.6±\pm0.3 69.3±\pm5.2 69.2±\pm7.3 76.5±\pm6.3 89.3±\pm0.6 91.3±\pm0.2 41.2±\pm2.0 92.1±\pm0.1
Averaged 65.5±\pm3.9 85.1±\pm0.8 59.3±\pm3.2 59.3±\pm3.7 68.6±\pm4.7 86.6±\pm0.9 88.8±\pm0.6 77.3±\pm1.5 91.2±\pm0.2
Worst-case 52.1±\pm3.9 62.5±\pm4.0 26.4±\pm1.9 28.2±\pm3.1 39.5±\pm9.1 78.6±\pm0.8 78.9±\pm0.2 27.1±\pm1.8 85.4±\pm0.4
TABLE XX: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and 20 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 89.2±\pm0.6 92.4±\pm0.1 84.9±\pm0.6 84.6±\pm0.4 86.7±\pm0.2 89.0±\pm0.8 92.2±\pm0.2 92.0±\pm0.3 92.4±\pm0.1
Lie [4] 89.3±\pm0.3 88.3±\pm0.3 86.8±\pm0.2 86.8±\pm0.2 86.7±\pm0.3 89.0±\pm0.3 92.3±\pm0.2 92.3±\pm0.1 92.4±\pm0.1
Fang-v1 [18] 89.3±\pm0.4 80.4±\pm1.6 53.6±\pm4.4 52.8±\pm2.4 82.9±\pm1.3 78.9±\pm1.4 92.3±\pm0.3 92.3±\pm0.2 92.1±\pm0.1
Fang-v2 [18] 47.8±\pm5.7 70.2±\pm2.4 50.1±\pm4.5 47.3±\pm4.2 34.0±\pm2.3 80.3±\pm0.7 90.3±\pm0.9 91.3±\pm0.4 92.3±\pm0.0
Same value [35] 89.3±\pm0.2 88.5±\pm0.1 69.7±\pm1.9 70.8±\pm1.6 85.7±\pm0.5 88.9±\pm0.5 92.3±\pm0.2 92.4±\pm0.1 92.3±\pm0.1
Gaussian [6] 89.4±\pm0.4 92.1±\pm0.2 86.1±\pm0.4 86.2±\pm0.6 86.7±\pm0.2 87.9±\pm1.1 80.3±\pm0.6 31.1±\pm2.2 92.3±\pm0.0
sign flipping [15] 89.2±\pm0.2 91.9±\pm0.2 86.3±\pm0.4 86.4±\pm0.6 86.3±\pm0.5 76.9±\pm2.7 91.3±\pm0.2 76.9±\pm1.5 91.6±\pm0.4
label flipping [15] 89.4±\pm0.2 90.8±\pm0.1 78.1±\pm1.6 78.3±\pm1.5 86.5±\pm0.4 88.9±\pm1.0 90.9±\pm0.1 89.4±\pm0.5 91.3±\pm0.9
minic [28] 86.6±\pm4.2 89.9±\pm0.4 88.5±\pm1.1 90.4±\pm0.3 75.6±\pm1.2 90.7±\pm0.2 92.2±\pm0.2 92.1±\pm0.1 92.3±\pm0.1
Collusion (Ours) 88.9±\pm0.3 86.3±\pm1.2 80.0±\pm1.4 79.4±\pm3.4 87.3±\pm1.4 88.9±\pm0.6 91.5±\pm0.2 48.0±\pm11.0 92.3±\pm0.1
Averaged 84.8±\pm1.3 87.1±\pm0.7 76.4±\pm1.7 76.3±\pm1.5 79.8±\pm0.8 85.9±\pm0.9 90.6±\pm0.3 79.8±\pm1.6 92.1±\pm0.2
Worst-case 47.8±\pm5.7 70.2±\pm2.4 50.1±\pm4.5 47.3±\pm4.2 34.0±\pm2.3 76.9±\pm2.7 80.3±\pm0.6 31.1±\pm2.2 91.3±\pm0.9
TABLE XXI: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and 20 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 90.4±\pm0.2 92.4±\pm0.1 90.4±\pm0.1 90.4±\pm0.2 89.9±\pm0.4 89.4±\pm0.1 92.3±\pm0.3 92.3±\pm0.2 92.3±\pm0.1
Lie [4] 90.3±\pm0.3 89.6±\pm0.1 87.8±\pm0.5 87.5±\pm0.1 88.6±\pm0.1 89.7±\pm0.2 92.3±\pm0.2 92.3±\pm0.3 92.4±\pm0.1
Fang-v1 [18] 90.4±\pm0.2 78.5±\pm1.4 67.4±\pm1.9 66.8±\pm1.0 88.3±\pm0.2 74.8±\pm3.0 92.2±\pm0.3 92.2±\pm0.3 92.4±\pm0.0
Fang-v2 [18] 47.1±\pm2.4 72.8±\pm1.0 69.0±\pm1.1 69.1±\pm0.5 31.6±\pm2.8 79.3±\pm1.7 90.7±\pm0.5 91.4±\pm0.3 92.3±\pm0.1
Same value [35] 90.5±\pm0.1 89.8±\pm0.4 83.4±\pm0.7 82.4±\pm1.2 89.6±\pm0.1 89.6±\pm0.2 92.3±\pm0.3 92.2±\pm0.2 92.4±\pm0.1
Gaussian [6] 90.5±\pm0.1 92.4±\pm0.1 90.7±\pm0.1 90.4±\pm0.1 90.0±\pm0.0 87.8±\pm0.9 79.4±\pm0.6 31.9±\pm4.8 92.4±\pm0.1
sign flipping [15] 90.5±\pm0.4 92.1±\pm0.1 90.5±\pm0.2 90.4±\pm0.1 90.1±\pm0.1 73.9±\pm4.7 91.8±\pm0.1 65.8±\pm5.1 92.2±\pm0.1
label flipping [15] 90.5±\pm0.3 90.6±\pm0.1 88.0±\pm0.1 87.9±\pm0.2 90.1±\pm0.2 89.4±\pm0.5 92.3±\pm0.3 92.0±\pm0.1 92.0±\pm0.3
minic [28] 87.3±\pm1.1 86.5±\pm0.3 90.5±\pm0.1 91.1±\pm0.2 88.2±\pm0.6 90.2±\pm0.5 92.3±\pm0.1 92.2±\pm0.1 92.2±\pm0.1
Collusion (Ours) 90.4±\pm0.2 88.0±\pm2.0 84.7±\pm1.7 83.5±\pm0.4 90.0±\pm0.2 89.1±\pm0.7 91.5±\pm0.2 46.9±\pm7.1 92.1±\pm0.1
Averaged 85.8±\pm0.5 87.3±\pm0.6 84.2±\pm0.6 83.9±\pm0.4 83.6±\pm0.5 85.3±\pm1.3 90.7±\pm0.3 78.9±\pm1.9 92.3±\pm0.1
Worst-case 47.1±\pm2.4 72.8±\pm1.0 67.4±\pm1.9 66.8±\pm1.0 31.6±\pm2.8 73.9±\pm4.7 79.4±\pm0.6 31.9±\pm4.8 92.0±\pm0.3
TABLE XXII: Model performances of different Byzantine-resilient methods under different attacks (with IID and 20 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 62.6±\pm4.8 92.1±\pm0.0 66.1±\pm5.3 64.1±\pm3.2 73.7±\pm1.5 90.1±\pm0.5 91.5±\pm0.5 91.6±\pm0.2 92.3±\pm0.1
Lie [4] 68.8±\pm0.9 78.7±\pm0.4 79.4±\pm1.3 79.1±\pm0.8 71.5±\pm3.1 86.6±\pm2.0 92.0±\pm0.2 92.1±\pm0.3 92.1±\pm0.1
Fang-v1 [18] 68.7±\pm3.5 49.7±\pm8.3 20.8±\pm3.5 20.9±\pm2.1 54.0±\pm3.3 80.2±\pm2.5 91.9±\pm0.3 92.0±\pm0.3 92.0±\pm0.1
Fang-v2 [18] 30.7±\pm17.1 34.2±\pm4.8 26.9±\pm6.6 25.2±\pm2.6 12.2±\pm6.2 80.8±\pm2.0 82.3±\pm4.6 88.4±\pm1.3 92.0±\pm0.0
Same value [35] 66.8±\pm2.9 77.7±\pm0.2 46.9±\pm5.4 47.9±\pm2.5 70.1±\pm3.3 89.9±\pm0.3 92.0±\pm0.2 92.1±\pm0.3 92.0±\pm0.1
Gaussian [6] 67.9±\pm0.5 91.7±\pm0.0 72.1±\pm4.0 71.4±\pm2.7 70.6±\pm5.1 89.0±\pm0.5 74.9±\pm1.3 20.2±\pm2.2 92.1±\pm0.1
sign flipping [15] 66.3±\pm3.9 88.6±\pm0.5 73.6±\pm4.1 73.0±\pm4.2 72.0±\pm4.9 76.9±\pm1.6 52.5±\pm1.2 37.5±\pm6.4 89.5±\pm0.4
label flipping [15] 64.1±\pm2.0 88.6±\pm0.3 47.2±\pm1.4 42.7±\pm2.8 74.2±\pm3.5 89.9±\pm0.6 86.7±\pm0.8 82.9±\pm3.3 89.6±\pm0.4
minic [28] 69.9±\pm2.8 80.1±\pm5.1 74.9±\pm2.6 82.0±\pm2.4 47.7±\pm3.5 89.8±\pm0.3 90.5±\pm0.2 90.7±\pm0.5 91.7±\pm0.3
Collusion (Ours) 69.6±\pm6.8 79.7±\pm1.5 66.3±\pm2.9 66.2±\pm3.4 75.6±\pm7.7 89.5±\pm0.3 88.9±\pm1.3 39.0±\pm16.9 91.9±\pm0.2
Averaged 63.5±\pm4.5 76.1±\pm2.1 57.4±\pm3.7 57.3±\pm2.7 62.2±\pm4.2 86.3±\pm1.1 84.3±\pm1.1 72.7±\pm3.2 91.5±\pm0.2
Worst-case 30.7±\pm17.1 34.2±\pm4.8 20.8±\pm3.5 20.9±\pm2.1 12.2±\pm6.2 76.9±\pm1.6 52.5±\pm1.2 20.2±\pm2.2 89.5±\pm0.4
TABLE XXIII: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and 30 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
M N I S T No attack 88.9±\pm0.3 92.2±\pm0.2 85.2±\pm0.2 84.8±\pm0.7 86.0±\pm0.2 88.8±\pm0.4 92.2±\pm0.3 92.1±\pm0.4 92.4±\pm0.1
Lie [4] 89.1±\pm0.1 82.2±\pm0.6 82.3±\pm1.5 81.6±\pm1.0 83.8±\pm0.8 88.9±\pm0.5 92.3±\pm0.3 92.3±\pm0.2 92.2±\pm0.0
Fang-v1 [18] 89.3±\pm0.2 51.0±\pm3.1 35.3±\pm7.0 34.3±\pm2.9 78.1±\pm0.4 78.3±\pm3.3 92.2±\pm0.2 92.3±\pm0.1 92.2±\pm0.0
Fang-v2 [18] 33.5±\pm6.9 40.6±\pm3.3 40.6±\pm4.4 34.9±\pm3.3 14.9±\pm3.3 81.7±\pm1.6 84.4±\pm1.8 87.4±\pm0.5 92.3±\pm0.1
Same value [35] 88.9±\pm0.6 82.4±\pm0.7 55.7±\pm0.3 53.7±\pm2.8 85.5±\pm1.2 88.8±\pm0.8 92.3±\pm0.2 92.2±\pm0.1 92.2±\pm0.1
Gaussian [6] 89.2±\pm0.3 92.3±\pm0.1 87.1±\pm0.7 87.4±\pm0.4 85.7±\pm0.4 87.7±\pm0.6 74.0±\pm1.7 20.3±\pm4.4 92.3±\pm0.0
sign flipping [15] 89.0±\pm0.6 91.2±\pm0.3 87.2±\pm0.3 87.5±\pm0.5 86.0±\pm0.7 75.4±\pm0.6 83.1±\pm1.5 64.7±\pm9.1 91.5±\pm0.9
label flipping [15] 89.0±\pm0.5 89.3±\pm0.1 73.3±\pm2.3 71.3±\pm1.6 86.7±\pm0.3 88.8±\pm0.9 88.9±\pm0.2 88.3±\pm0.7 90.7±\pm0.7
minic [28] 73.4±\pm5.4 82.9±\pm1.4 83.1±\pm3.1 87.7±\pm0.3 72.5±\pm8.6 90.8±\pm0.5 91.8±\pm0.3 91.9±\pm0.2 92.2±\pm0.1
Collusion (Ours) 89.0±\pm0.5 82.2±\pm1.2 73.6±\pm1.2 72.3±\pm1.2 86.7±\pm1.2 89.2±\pm0.5 89.4±\pm0.5 31.2±\pm2.2 92.2±\pm0.0
Averaged 81.9±\pm1.5 78.6±\pm1.1 70.3±\pm2.1 69.6±\pm1.5 76.6±\pm1.7 85.8±\pm1.0 88.1±\pm0.7 75.3±\pm1.8 92.0±\pm0.2
Worst-case 33.5±\pm6.9 40.6±\pm3.3 35.3±\pm7.0 34.3±\pm2.9 14.9±\pm3.3 75.4±\pm0.6 74.0±\pm1.7 20.3±\pm4.4 90.7±\pm0.7
TABLE XXIV: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and 30 attackers for classification of MNIST).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
C I F A R 10 No attack 29.1±\pm7.0 53.0±\pm8.0 16.7±\pm5.1 19.1±\pm9.4 17.3±\pm3.4 56.3±\pm2.8 56.5±\pm12.8 54.3±\pm12.9 66.7±\pm0.7
Lie [4] 26.6±\pm6.5 11.2±\pm2.5 9.8±\pm0.2 11.8±\pm2.4 10.0±\pm0.2 11.2±\pm1.8 10.4±\pm0.7 47.0±\pm18.2 63.8±\pm0.2
Fang-v1 [18] 25.5±\pm6.7 32.5±\pm7.3 11.3±\pm2.5 16.2±\pm3.8 13.4±\pm2.0 24.5±\pm19.5 36.5±\pm23.4 55.5±\pm15.7 63.9±\pm1.3
Fang-v2 [18] 10.0±\pm0.0 10.0±\pm0.0 9.9±\pm0.3 10.0±\pm0.2 10.0±\pm0.0 10.0±\pm0.0 55.9±\pm10.0 10.0±\pm0.0 64.4±\pm1.5
Same value [35] 26.8±\pm6.5 10.0±\pm0.0 13.4±\pm3.7 10.0±\pm3.6 25.9±\pm11.6 49.3±\pm18.6 45.7±\pm13.0 44.9±\pm21.9 62.3±\pm1.4
Gaussian [6] 20.7±\pm9.9 27.6±\pm20.3 16.6±\pm2.5 27.9±\pm11.9 11.7±\pm2.0 55.5±\pm2.0 26.3±\pm7.9 13.3±\pm1.3 61.6±\pm0.8
sign flipping [15] 27.6±\pm8.3 28.1±\pm17.9 16.5±\pm3.8 25.6±\pm6.9 14.6±\pm4.7 37.4±\pm23.2 14.3±\pm3.4 11.2±\pm1.5 60.5±\pm1.3
label flipping [15] 20.3±\pm10.1 41.6±\pm8.5 14.9±\pm3.4 22.1±\pm6.2 12.7±\pm2.9 52.3±\pm2.5 43.1±\pm10.6 24.9±\pm24.8 54.8±\pm1.4
minic [28] 22.3±\pm10.7 52.6±\pm1.7 15.2±\pm2.4 21.3±\pm7.2 13.3±\pm1.5 10.0±\pm0.0 50.6±\pm12.0 42.6±\pm28.2 63.4±\pm0.9
Collusion (Ours) 28.6±\pm6.7 10.0±\pm0.1 9.8±\pm0.3 9.9±\pm0.4 13.2±\pm3.4 55.4±\pm3.1 29.5±\pm7.8 10.1±\pm0.2 62.0±\pm1.1
Averaged 23.8±\pm7.2 27.7±\pm6.6 13.4±\pm2.4 17.4±\pm5.2 14.2±\pm3.2 36.2±\pm7.4 36.9±\pm10.2 31.4±\pm12.5 62.3±\pm1.1
Worst-case 10.0±\pm0.0 10.0±\pm0.0 9.8±\pm0.2 9.9±\pm0.4 10.0±\pm0.2 10.0±\pm0.0 10.4±\pm0.7 10.0±\pm0.0 54.8±\pm1.4
TABLE XXV: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.1\beta=0.1 and and 6 attackers for classification of CIFAR10).
Krum
[6]
GeoMeidan
[13]
Median
[68]
Trimmed
[68]
Bulyan
[21]
FLtrust
[10]
DnC
[52]
Kmeans
[54]
FedCut
(Ours)
C I F A R 10 No attack 25.2±\pm16.8 55.9±\pm13.0 18.5±\pm10.2 59.0±\pm13.7 17.7±\pm7.4 62.9±\pm0.2 59.9±\pm15.2 56.9±\pm16.3 66.7±\pm1.6
Lie [4] 37.5±\pm9.1 10.0±\pm0.0 10.0±\pm0.1 11.7±\pm2.9 10.7±\pm1.2 16.7±\pm7.6 10.0±\pm0.0 31.5±\pm30.8 67.3±\pm0.5
Fang-v1 [18] 35.2±\pm10.5 49.9±\pm16.7 15.1±\pm3.1 23.2±\pm5.0 15.8±\pm3.6 43.5±\pm29.0 56.2±\pm17.1 56.4±\pm16.0 64.1±\pm0.7
Fang-v2 [18] 10.0±\pm0.0 10.0±\pm0.0 13.8±\pm2.4 13.1±\pm6.4 10.0±\pm0.1 10.0±\pm0.0 63.3±\pm4.0 15.9±\pm4.8 65.4±\pm0.6
Same value [35] 37.8±\pm8.9 15.1±\pm8.8 10.2±\pm0.3 10.7±\pm1.2 37.1±\pm16.4 59.5±\pm0.7 58.3±\pm15.3 56.2±\pm17.1 62.9±\pm0.9
Gaussian [6] 37.5±\pm11.5 56.1±\pm14.2 13.7±\pm3.0 47.9±\pm27.7 15.2±\pm4.6 62.3±\pm1.6 26.6±\pm8.4 12.6±\pm0.5 64.9±\pm1.2
sign flipping [15] 36.3±\pm11.1 51.1±\pm20.8 17.0±\pm2.2 41.7±\pm12.1 16.2±\pm3.1 60.8±\pm0.6 17.4±\pm1.4 20.6±\pm9.5 63.8±\pm0.6
label flipping [15] 20.1±\pm18.1 47.1±\pm23.7 14.5±\pm5.6 42.3±\pm9.4 12.8±\pm4.5 58.5±\pm2.3 48.3±\pm13.6 50.6±\pm16.3 55.7±\pm2.7
minic [28] 40.0±\pm3.4 60.7±\pm5.4 22.7±\pm2.0 62.1±\pm1.8 17.7±\pm1.7 10.0±\pm0.0 63.2±\pm1.6 45.5±\pm30.8 67.1±\pm1.4
Collusion (Ours) 39.2±\pm10.8 10.0±\pm0.0 10.0±\pm0.1 10.2±\pm0.5 14.1±\pm4.8 62.3±\pm0.2 14.1±\pm4.0 9.9±\pm0.2 66.2±\pm0.6
Averaged 31.9±\pm10.0 36.6±\pm10.3 14.6±\pm2.9 32.2±\pm8.1 16.7±\pm4.7 44.7±\pm4.2 41.7±\pm8.1 35.6±\pm14.2 64.4±\pm1.1
Worst-case 10.0±\pm0.0 10.0±\pm0.0 10.0±\pm0.1 10.2±\pm0.5 10.0±\pm0.1 10.0±\pm0.0 10.0±\pm0.0 9.9±\pm0.2 55.7±\pm2.7
TABLE XXVI: Model performances of different Byzantine-resilient methods under different attacks (with Non-IID setting β=0.5\beta=0.5 and and 6 attackers for classification of CIFAR10).

B.2 FedCut v.s. FedCut without Temporal Consistency

In this part, we compare FedCut and NCut, which remove the temporal consistency. Fig. 10 shows the averaged MP of FedCut and NCut, which demonstrates the stability of FedCut while the averaged MP of NCut is highly influenced by the Non-IID extent of clients’ data (e.g., the MP of Ncut drops to 80% with Non-IID parameter β=0.1\beta=0.1). The reason temporal consistency would help to distinguish benign and byzantine clients during the training process.

B.3 Robustness for Spectral Heuristics in Heterogeneous Dataset

In this part, we demonstrate the effectiveness of estimating the number of communities, Gaussian scaling factors via the largest eigengap even for the heterogeneous dataset among clients.

Fig. 11 and 12 display the position of the largest eigengap for Non-IID dataset (q=0.1,0.5q=0.1,0.5), which illustrates the largest eigengap is also a good estimation for the number of communities. Moreover, Fig. 13 also shows the eigengap and clustering accuracy could reach the optimal at the same time even for Non-IID dataset (q=0.1,0.5q=0.1,0.5), which demonstrates we could select the Gaussian scaling factor according to the largest eigengap.

Appendix C Proof

We consider a horizontal federated learning [66, 40] setting consisting of one server and KK clients. We assume KK clients777In this article we use terms ”client”, ”node”, ”participant” and ”party” interchangeably. have their local dataset 𝒟i={(𝐱i,j,yi,j}j=1ni,i=1K\mathcal{D}_{i}=\{(\mathbf{x}_{i,j},y_{i,j}\}_{j=1}^{n_{i}},i=1\cdots K, where 𝐱i,j\mathbf{x}_{i,j} is the input data, yi,jy_{i,j} is the label and nin_{i} is the total number of data points for ithi_{th} client. The training in federated learning is divided into three steps which iteratively run until the learning converges:

  • The ithi_{th} client takes empirical risk minimization as:

    min𝐰iFi(𝐰i,𝒟i)=min𝐰i1nij=1ni(𝐰i,𝐱i,j,yi,j),\min_{\mathbf{w}_{i}}F_{i}(\mathbf{w}_{i},\mathcal{D}_{i})=\min_{\mathbf{w}_{i}}\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\ell(\mathbf{w}_{i},\mathbf{x}_{i,j},y_{i,j}), (10)

    where 𝐰id\mathbf{w}_{i}\in\mathbb{R}^{d} is the ithi_{th} client’s local model weight and ()\ell(\cdot) is a loss function that measures the accuracy of the prediction made by the model on each data point.

  • Each client sends respective local model updates Fi\nabla F_{i} to the server and the server updates the global model 𝐰\mathbf{w} as 𝐰=𝐰η1Ki=1KFi\mathbf{w}=\mathbf{w}-\eta\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}, where η\eta is learning rate.

  • The server distributes the updated global model 𝐰\mathbf{w} to all clients.

We assume a malicious threat mode where an unknown number of participants out of K clients are Byzantine, i.e., they may upload arbitrarily corrupt updates 𝐠b\mathbf{g}_{b} to degrade the global model performance (MP). Under this assumption, behaviours of Byzantine clients and the rest of benign clients can be summarized as follows:

𝐠i={FiBenign clients𝐠bByzantine clients\mathbf{g}_{i}=\left\{\begin{aligned} \nabla F_{i}\qquad&\text{Benign clients}\\ \mathbf{g}_{b}\qquad&\text{Byzantine clients}\\ \end{aligned}\right. (11)

Moreover, we regard model updates contributed by KK clients as an undirected graph G=(V,E)G=(V,E), where V=v1,,vKV={v_{1},\cdots,v_{K}} represent KK model updates, EE is a set of weighted edge representing similarities between uploaded model updates corresponding to clients in VV. We assume that the graph G=(V,E)G=(V,E) is weighted, and each edge between two nodes viv_{i} and vjv_{j} carries a non-negative weight, e.g., Aij=exp(𝐠i𝐠j2/2σ2)0A_{ij}=\text{exp}(-||\mathbf{g}_{i}-\mathbf{g}_{j}||^{2}/2\sigma^{2})\geq 0, where 𝐠i\mathbf{g}_{i} is uploaded gradient for ithi_{th} client and σ\sigma is the Gaussian scaling factor. Let GR=(VR,ER)G_{R}=(V_{R},E_{R}) and GB=(VB,EB)G_{B}=(V_{B},E_{B}) respectively denote two subgraphs of GG representing benign and Byzantine clients.

C.1 Proof of Proposition 1

Lemma C.1.

[58] Let G be an undirected graph with non-negative weights. Then the multiplicity c of the eigenvalue 1 of LL equals the number of connected components B1,,BcB_{1},\cdots,B_{c} in the graph,

Remark C.1.

The analysis given in [42, 58] shows that one could estimate cc by counting the number of eigenvalues equaling 1 as Lemma C.1

Lemma C.2 ([57]).

Let λ\lambda, YY and δ\delta be eigenvalue, principle eigenvectors and the eigengap of LL separately. Assume a matrix small perturbation for LL as L~=L+E\tilde{L}=L+E so that E||E||888||||||\cdot|| in the paper represents the 2\ell_{2} norm is small enough, let λ~\tilde{\lambda}, Y~\tilde{Y} be eigenvalue and principle eigenvectors of L~\tilde{L}, then

λλ~E||\lambda-\tilde{\lambda}||\leq||E|| (12)
YY~4Eδ2E||Y-\tilde{Y}||\leq\frac{4||E||}{\delta-\sqrt{2}||E||} (13)
Remark C.2.

From the perturbation theory, Lemma C.2 demonstrates the stability of eigenvectors when there is a small perturbation of block diagonal adjacency matrix.

Assumption C.1.

Assume the difference of local gradients Fi\nabla F_{i} and the mean of benign model update F=1|VR|iVRFi{\nabla F}=\frac{1}{|V_{R}|}\sum_{i\in V_{R}}\nabla F_{i} is bounded (VRV_{R} is the set of benign clients), i.e., there exists a finite κ\kappa, such that

FiFκ.||\nabla F_{i}-{\nabla F}||\leq\kappa.
Remark C.3.

κ\kappa in Assumption C.1 has also been used earlier to bound heterogeneity in datasets [37]; Specifically, when the data is homogeneous, we have κ=0\kappa=0 in Assumption C.1.

Proposition C.1.

Suppose KK clients consist of mm benign clients and qq attackers (q<m1q<m-1). If Assumption C.1 holds for Non-collusion and Collusion-diff attacks, then only the first cc eigenvalues are close to 1 and

  • c=1+q<K2c=1+q<\frac{K}{2} for Non-collusion attacks provided that 𝐠bFκ||\mathbf{g}_{b}-\nabla F||\gg\kappa and malicious updates (𝐠b\mathbf{g}_{b}) are far away from each other;

  • c=1+B<K2c=1+B<\frac{K}{2} for Collusion-diff attacks provided that malicious updates form BB groups and 𝐠bFκ||\mathbf{g}_{b}-\nabla F||\gg\kappa;

  • c>K2c>\frac{K}{2} for Collusion-mimic attacks that 𝐠bF<κ||\mathbf{g}_{b}-\nabla F||<\kappa and malicious updates are almost identical.

Proof.

For Non-Collusion attack, 𝐠bF>κ\|\mathbf{g}_{b}-\nabla F\|>\kappa and malicious updates (𝐠b\mathbf{g}_{b}) are far away from each other. Therefore, we define normalized adjacency matrix AA in the ideal case is block-wised where benign clients form a block with no relations to other attackers. Consequently, we obtain the largest cc eigenvalues of AA are 1 according to Lemma C.1, where c=1+q<K2c=1+q<\frac{K}{2}. In general, normalized adjacency matrix A~=A+E\tilde{A}=A+E. Based on conditions that malicious updates (𝐠b\mathbf{g}_{b}) are far away from each other and 𝐠bFκ||\mathbf{g}_{b}-\nabla F||\gg\kappa, we have A~ij=exp(𝐠bFi/σ2)\tilde{A}_{ij}=\text{exp}(-||\mathbf{g}_{b}-\nabla F_{i}||/\sigma^{2}) is small, so E||E|| is small. Consequently, according to Lemma C.2,

λ~λE\tilde{\lambda}-\lambda\leq||E|| (14)

As a a result, we obtain the first qq eigenvalues of A~\tilde{A} are close to 1 due to the small E||E||.
Similarly, we have c=1+B<K2c=1+B<\frac{K}{2} for Collusion-diff attack; the first cc eigenvalues of A~\tilde{A} are close to 1 for Collusion-mimic attack, and cm+1>K2c\geq m+1>\frac{K}{2}. ∎

Remark C.4.

Proposition C.1 illustrates that the eigenvalue 1 has multiplicity cc, and then there is a gap to the (c+1)th(c+1)_{th} eigenvalue λc+1<1\lambda_{c+1}<1. Therefore, we can use the position of the obvious eigengap to elucidate the number of community clusters999We use the largest eigengap to determine the community clusters. Moreover, Proposition C.1 demonstrates the different positions of obvious eigengap for Non-collusion, Collusion-diff and Collusion-mimic attacks.

C.2 Proof of Proposition 2

Lemma C.3.

Minimizing i=1cW(Bi,Bi¯)/vol(Bi)\sum_{i=1}^{c}W(B_{i},\overline{B_{i}})/vol(B_{i}) over all c-partition V=B1BcV=B_{1}\cup\cdots\cup B_{c} is equivalent with maximizing tr(HTD1/2AD1/2H)\text{tr}(H^{T}D^{-1/2}AD^{-1/2}H) over {H|HK×c,HTH=I}\{H|H\in\mathbb{R}^{K\times c},H^{T}H=I\}, where W(Bi,Bj):=iBi,jBjAijW(B_{i},B_{j}):=\sum_{i\in B_{i},j\in B_{j}}A_{ij} and tr represents the trace of matrix.

Proof.

Define 𝐟i=(fi1,,fiK)T\mathbf{f}_{i}=(f_{i1},\cdots,f_{iK})^{T}

fij={1vol(Bi),vjBi0,vjBif_{ij}=\left\{\begin{array}[]{cc}\sqrt{\frac{1}{vol(B_{i})}},&v_{j}\in B_{i}\\ 0,&v_{j}\notin B_{i}\end{array}\right. (15)

Then

𝐟iT(DA)𝐟i=12m=1Kn=1KAmn(fimfin)2=12[vmBi,vnBiAmn(fimfin)2+vnBi,vmBiAmn(fimfin)2]=12[vmBi,vnBiAmn(1vol(Bi)0)2+vnBi,vmBiAmn(01vol(Bi))2]=W(Bi,Bi¯)/Vol(Bi),\begin{split}\mathbf{f}_{i}^{T}(D-A)\mathbf{f}_{i}&=\frac{1}{2}\sum_{m=1}^{K}\sum_{n=1}^{K}A_{mn}(f_{im}-f_{in})^{2}\\ &=\frac{1}{2}[\sum_{v_{m}\in B_{i},v_{n}\notin B_{i}}A_{mn}(f_{im}-f_{in})^{2}\\ &\quad+\sum_{v_{n}\in B_{i},v_{m}\notin B_{i}}A_{mn}(f_{im}-f_{in})^{2}]\\ &=\frac{1}{2}[\sum_{v_{m}\in B_{i},v_{n}\notin B_{i}}A_{mn}(\sqrt{\frac{1}{vol(B_{i})}}-0)^{2}\\ &\quad+\sum_{v_{n}\in B_{i},v_{m}\notin B_{i}}A_{mn}(0-\sqrt{\frac{1}{vol(B_{i})}})^{2}]\\ &=W(B_{i},\overline{B_{i}})/Vol(B_{i}),\end{split} (16)

where W(Bi,Bj):=iBi,jBjAijW(B_{i},B_{j}):=\sum_{i\in B_{i},j\in B_{j}}A_{ij}. Moreover, we have 𝐟iT𝐟i=1/vol(Bi)\mathbf{f}_{i}^{T}\mathbf{f}_{i}=1/vol(B_{i}). Thus,

i=1cW(Bi,Bi¯)/Vol(Bi)=i=1c𝐟iT(DA)𝐟i=tr(FT(DA)F),\begin{split}\sum_{i=1}^{c}W(B_{i},\overline{B_{i}})/Vol(B_{i})&=\sum_{i=1}^{c}\mathbf{f}_{i}^{T}(D-A)\mathbf{f}_{i}\\ &=tr(F^{T}(D-A)F),\end{split} (17)

where F is matrix combining all 𝐟i\mathbf{f}_{i}. Let F=D1/2HF=D^{-1/2}H, then HTH=IH^{T}H=I and

min(B1Bc)=VicW(Bi,Bi¯)/Vol(Bi)=minFtr(FT(DA)F)=minHtr(HTD1/2(DA)D1/2H)=cmaxHtr(HTD1/2AD1/2H)\begin{split}\min_{(B_{1}\cup\cdots\cup B_{c})=V}&\sum_{i}^{c}W(B_{i},\overline{B_{i}})/Vol(B_{i})\\ &=\min_{F}tr(F^{T}(D-A)F)\\ &=\min_{H}\text{tr}(H^{T}D^{-1/2}(D-A)D^{-1/2}H)\\ &=c-\max_{H}\text{tr}(H^{T}D^{-1/2}AD^{-1/2}H)\end{split} (18)

In federated learning, clients would send the updates to the server in each iteration, thus we consider the graph for all iterations as follows:

Definition C.1.

(Spatial-Temporal Graph) Define a Spatial-Temporal graph G=(V,E)G=(V,E) as a sequence of snapshots <G1,,GT><G^{1},\cdots,G^{T}>, where Gt=(V,Et)G^{t}=(V,E^{t}) is an undirected graph at iteration tt. VV denotes a fixed set of vertexes representing model updates belonging to KK clients. EtE^{t} is a set of weighted edge representing similarities between model updates corresponding to clients in VV at iteration tt, where related adjacency matrix of the graph is AtA^{t}.

The c-partition Ncut for Spatial-Temporal Graph is thus defined as follows:

Definition C.2.

(c-partition Ncut for Spatial-Temporal Graph) Let G=(V,E)G=(V,E) be a Spatial-Temporal graph as Def. C.1 Denote the c-partition for graph GG as V=B1BcV=B_{1}\cup\cdots\cup B_{c} and BiBj=B_{i}\cap B_{j}=\varnothing for any i,ji,j, c-partition Ncut for Spatial-Temporal Graph aims to optimize:

min(B1Bc)=Vt=1Ti=1cWt(Bi,Bi¯)Volt(Bi),\min_{(B_{1}\cup\cdots\cup B_{c})=V}\sum_{t=1}^{T}\sum_{i=1}^{c}\frac{W^{t}(B_{i},\overline{B_{i}})}{Vol^{t}(B_{i})}, (19)

where Bi¯\overline{B_{i}} is the complement of BiB_{i}, Wt(Bi,Bj):=viBi,vjBjAijtW^{t}(B_{i},B_{j}):=\sum_{v_{i}\in B_{i},v_{j}\in B_{j}}A^{t}_{ij}, and AijtA_{ij}^{t} is edge weight of BiB_{i} and BjB_{j}, and Volt(Bi):=viBvjVAijtVol^{t}(B_{i}):=\sum_{v_{i}\in B}\sum_{v_{j}\in V}A_{ij}^{t}

Proposition C.2.

FedCut (Algo. 2) solves the c-partition Ncut in Eq. (19).

Proof.

Firstly, according to the line 2 in Algo. 2,

L~t=t1tL~t1+1tLt=t1t(t2t1L~t2+1t1Lt1)+1tLt=t2tL~t2+1t(Lt1+Lt)==1ti=1tLi,\begin{split}\tilde{L}^{t}&=\frac{t-1}{t}\tilde{L}^{t-1}+\frac{1}{t}L^{t}\\ &=\frac{t-1}{t}(\frac{t-2}{t-1}\tilde{L}^{t-2}+\frac{1}{t-1}L^{t-1})+\frac{1}{t}L^{t}\\ &=\frac{t-2}{t}\tilde{L}^{t-2}+\frac{1}{t}(L^{t-1}+L^{t})\\ &=\cdots\\ &=\frac{1}{t}\sum_{i=1}^{t}L^{i},\end{split} (20)

where Li=D1/2AtD1/2L^{i}=D^{-1/2}A^{t}D^{-1/2}. Furthermore, according to Lemma C.3,

min(B1Bc)=Vt=1ti=1cWt(Bi,Bi¯)Volt(Bi)=minHi=1ttr(HTDt1/2(DtAt)Dt1/2H)=minHtr(HT(i=1tDt1/2(DtAt)Dt1/2)H)=minHtr(HTi=1tLtH)=minHtr(HTL~tH)t,\begin{split}&\min_{(B_{1}\cup\cdots\cup B_{c})=V}\quad\sum_{t=1}^{t}\sum_{i=1}^{c}\frac{W^{t}(B_{i},\overline{B_{i}})}{Vol^{t}(B_{i})}\\ &=\min_{H}\sum_{i=1}^{t}\text{tr}(H^{T}{D^{t}}^{-1/2}(D^{t}-A^{t}){D^{t}}^{-1/2}H)\\ &=\min_{H}\text{tr}(H^{T}(\sum_{i=1}^{t}{D^{t}}^{-1/2}(D^{t}-A^{t}){D^{t}}^{-1/2})H)\\ &=\min_{H}\text{tr}(H^{T}\sum_{i=1}^{t}L^{t}H)\\ &=\min_{H}\text{tr}(H^{T}\tilde{L}^{t}H)t,\end{split} (21)

where HTH=IH^{T}H=I. By the Rayleigh-Ritz theorem [39] it can be seen immediately that the solution of minHtr(HTL~tH)t\min_{H}\text{tr}(H^{T}\tilde{L}^{t}H)t is given by the HH which is the eigenvector corresponding to the cc largest eigenvalue of L~t\tilde{L}^{t}. Thus we prove that FedCut aims to solve the c-partition Ncut in Eq. (19) for Spatial-Temporal graph among tt iterations. ∎

C.3 Proof of Theorem 1

Assumption C.2.

For malicious updates 𝐠b\mathbf{g}_{b} provided that 𝐠bF>κ||\mathbf{g}_{b}-\nabla F||>\kappa, the difference between the mean of benign updates and colluders’ updates has at least CκC\kappa distance, where CC is a large constant, i.e.,

𝐠bF>Cκ.\|\mathbf{g}_{b}-\nabla F\|>C\kappa.
Lemma C.4 ([42]).

Let adjacency matrix A’s off-diagonal blocks AijA^{ij}, iji\neq j be zero (block diagonal matrix). Also assume that each cluster is connected. Then there exists cc orthogonal vectors r1,,rcr_{1},\cdots,r_{c} (riTrj=1r_{i}^{T}r_{j}=1 if i = j, 0 otherwise) so that top cc eigenvectors (YK×cY\in\mathbb{R}^{K\times c}) of normalized adjacency matrix satisfy

yj(i)=ri,y_{j}^{(i)}=r_{i}, (22)

for all i=1,,ci=1,\cdots,c, j=1,,nij=1,\cdots,n_{i}, where yj(i)y_{j}^{(i)} represents jthj_{th} rows in ithi_{th} clusters of YY. In other words, there are cc mutually orthogonal points on the surface of the unit c-sphere around which Y’s rows will cluster. Moreover, these clusters correspond exactly to the true clustering of the original data.

Lemma C.5 (Theorem 4.14 in [43]).

Defined YY same as Lemma C.4, the Kmeans algorithm that clusters Y~=Y+E1\tilde{Y}=Y+E_{1} into c groups returns an optimal solution of cost at most 1ϵ2132ϵ2ϵ2\frac{1-\epsilon^{2}}{1-32\epsilon^{2}}\epsilon^{2} with probability 1𝒪~(ϵ)1-\tilde{\mathcal{O}}(\sqrt{\epsilon}), where ϵ=E1\epsilon=||E_{1}||

Remark C.5.

1) Lemma C.4 illustrates the clustering accuracy of NCut is 100% in clustering top c eigenvectors when adjacency matrix is the block diagonal matrix.
2) Lemma C.5 shows Kmeans clustering for eigenvectors YY still reach optimal clustering with a large probability when eigenvectors YY has a small perturbation.

Theorem C.1.

Suppose an 0<α<120<\alpha<\frac{1}{2} fraction of clients are Byzantine attackers. If Assumption C.1 and C.2 holds, we can find the estimate of 𝐠^\hat{\mathbf{g}} of 𝐠¯\bar{\mathbf{g}} according to line 12 in Algo. 1 and with the probability 1O~(Z1)1-\tilde{O}(\sqrt{Z_{1}}), such that 𝐠^𝐠¯𝒪~(Cακ)||\hat{\mathbf{g}}-\bar{\mathbf{g}}||\leq\tilde{\mathcal{O}}(C\alpha\kappa), where Z1=4KZC24δK2ZC24,Z=exp(2κ2/σ2)Z_{1}=\frac{4K\sqrt{Z^{\frac{C^{2}}{4}}}}{\delta-K\sqrt{2Z^{\frac{C^{2}}{4}}}},Z=\text{exp}(-2\kappa^{2}/\sigma^{2}).

Proof.

Suppose the server receives first (K - q) correct gradients {𝐠1,,𝐠Kq}\{\mathbf{g}_{1},\cdots,\mathbf{g}_{K-q}\} with mean 𝐠¯\bar{\mathbf{g}}. Therefore, we have

𝐠i𝐠j𝐠i𝐠¯+𝐠j𝐠¯2κ,||\mathbf{g}_{i}-\mathbf{g}_{j}||\leq||\mathbf{g}_{i}-\bar{\mathbf{g}}||+||\mathbf{g}_{j}-\bar{\mathbf{g}}||\leq 2\kappa, (23)

for any 1i,jKq1\leq i,j\leq K-q. Thus we have Aijexp(2κ2/σ2)Z(0<Z<1)A_{ij}\geq\text{exp}(-2\kappa^{2}/\sigma^{2})\triangleq Z(0<Z<1), for any 1i,jKq1\leq i,j\leq K-q.

Moreover, for any updated gradients of clients in different party (such as 𝐠i,𝐠j\mathbf{g}_{i},\mathbf{g}_{j}), according to Assumption C.2, we have

𝐠i𝐠jCκ.||\mathbf{g}_{i}-\mathbf{g}_{j}||\geq C\kappa. (24)

Consequently, Aij<exp(C2k2/(2σ2))=ZC24A_{ij}<\text{exp}(-C^{2}k^{2}/(2\sigma^{2}))=Z^{\frac{C^{2}}{4}}, for any i, j in different parties.

On one hand, if Aij=0A_{ij}=0, for any i, j in different parties, then the adjacency matrix AA is block diagonal matrix so we could apply Lemma C.4 to get 100% clustering accuracy for solving FedCut (c-partition NCut for Spatial-temporal Graph) in ttht_{th} iteration, i.e., server would select clusters (\mathcal{I}) with largest number of clusters. Since α<0.5\alpha<0.5, the clusters with largest number must include all benign updates and malicious updates that 𝐠b𝐠¯<κ||\mathbf{g}_{b}-\bar{\mathbf{g}}||<\kappa. As a result,

𝐠^𝐠¯q𝐠b𝐠¯K𝒪~(qκK)||\hat{\mathbf{g}}-\bar{\mathbf{g}}||\leq\frac{q||\mathbf{g}_{b}-\bar{\mathbf{g}}||}{K}\leq\tilde{\mathcal{O}}(\frac{q\kappa}{K}) (25)

.

On the other hand, in general case that adjacency matrix A~ij>0\tilde{A}_{ij}>0, for any i, j in different parties, set A~=A+E\tilde{A}=A+E, where AA is block diagonal matrix and EE is perturbation matrix. For the large CC, ZC24Z^{\frac{C^{2}}{4}} is small enough. Consequently, we have

YY~4E2δ2E24Kα(1α)ZC24δK2α(1α)ZC24Z1.\begin{split}||Y-\tilde{Y}||&\leq\frac{4||E||_{2}}{\delta-\sqrt{2}||E||_{2}}\\ &\leq\frac{4K\sqrt{\alpha(1-\alpha)Z^{\frac{C^{2}}{4}}}}{\delta-K\sqrt{2\alpha(1-\alpha)Z^{\frac{C^{2}}{4}}}}\triangleq Z_{1}.\end{split} (26)

The first inequality is according to Lemma C.2 and second is due to E2Kα(1α)ZC24||E||_{2}\leq K\sqrt{\alpha(1-\alpha)}\sqrt{Z^{\frac{C^{2}}{4}}}. according to Lemma C.5, we could obtain the optimal clustering with probability 1O~(Z1)1-\tilde{O}(\sqrt{Z_{1}}). Similarly, we have

𝐠^𝐠¯q𝐠b𝐠¯K𝒪~(qκK)||\hat{\mathbf{g}}-\bar{\mathbf{g}}||\leq\frac{q||\mathbf{g}_{b}-\bar{\mathbf{g}}||}{K}\leq\tilde{\mathcal{O}}(\frac{q\kappa}{K}) (27)

for line 12 in Algo. 1 with probability 1O~(Z1)1-\tilde{O}(\sqrt{Z_{1}}), where Z1=4Kα(1α)ZC24δK2α(1α)ZC24,Z=exp(2ka2/σ2)Z_{1}=\frac{4K\sqrt{\alpha(1-\alpha)Z^{\frac{C^{2}}{4}}}}{\delta-K\sqrt{2\alpha(1-\alpha)Z^{\frac{C^{2}}{4}}}},Z=\text{exp}(-2ka^{2}/\sigma^{2}).

C.4 Proof for Theorem 2

Following definitions of federated learning introduced in [66, 40], we consider a horizontal federated learning setting consisting of one server and KK clients. We assume KK clients have their local dataset 𝒟i,i=1K\mathcal{D}_{i},i=1\cdots K. In each step, ithi_{th} client minimizes the local risk function as min𝐰iFi(𝐰i)=min𝐰i1|𝒟i|j=1|𝒟i|Fi(𝐰,𝐱i,j)\min_{\mathbf{w}_{i}}F_{i}(\mathbf{w}_{i})=\min_{\mathbf{w}_{i}}\frac{1}{|\mathcal{D}_{i}|}\sum_{j=1}^{|\mathcal{D}_{i}|}F_{i}(\mathbf{w},\mathbf{x}_{i,j}), where 𝐱i,j𝒟i,j[|𝒟i|]\mathbf{x}_{i,j}\in\mathcal{D}_{i},j\in[|\mathcal{D}_{i}|]. Then clients send the updates 𝐠i=Fi(𝐰i)\mathbf{g}_{i}=\nabla F_{i}(\mathbf{w}_{i}) to the server and server aggregate the gradients to update the global model 𝐰\mathbf{w}, finally the server distribute the 𝐰\mathbf{w} to all clients (see details in Algo. 1). Specifically, for t+1t+1 iteration, server update the global model 𝐰t+1\mathbf{w}_{t+1} as:

𝐰t+1=𝐰tη𝐠^t\mathbf{w}^{t+1}=\mathbf{w}^{t}-\eta\hat{\mathbf{g}}^{t} (28)
Assumption C.3.

The stochastic gradients sampled from any local dataset have uniformly bounded variance over 𝒟i\mathcal{D}_{i} for all clients, i.e., there exists a finite σ0\sigma_{0}, such that for all x𝒟i,i[K]x\in\mathcal{D}_{i},i\in[K], we have

𝔼j||[Fi(𝐰,xi,j))Fi(𝐰)||2σ02.\mathbb{E}_{j}||[\nabla F_{i}(\mathbf{w},x_{i,j}))-\nabla F_{i}(\mathbf{w})||^{2}\leq\sigma_{0}^{2}. (29)
Remark C.6.

The difference between Assumption C.1 and C.3 is that the former bounds the variance across gradient estimates within the same client while the latter bounds the variance between model updates across clients.

Assumption C.4.

We assume that F(x) is L-smooth and has μ\mu-strong convex.

Lemma C.6 ([7]).

If FF is L-smooth, then for all 𝐰\mathbf{w}^{*} (optimal solution w.r.t FF), then:

12LF(x)2F(𝐰)F(𝐰)L2𝐰𝐰2\frac{1}{2L}||\nabla F(x)||^{2}\leq F(\mathbf{w})-F(\mathbf{w}^{*})\leq\frac{L}{2}||\mathbf{w}-\mathbf{w}^{*}||^{2} (30)
Theorem C.2.

Suppose an 0<α<120<\alpha<\frac{1}{2} fraction of clients are corrupted. For a global objective function F:RdRF:R^{d}\to R, the server distributes a sequence of iterates {𝐰t:t[0:T]}\{\mathbf{w}^{t}:t\in[0:T]\} (see Algo. 1) when run with a fixed step-size η<min{14L,1μ}\eta<\min\{\frac{1}{4L},\frac{1}{\mu}\}. If Assumption C.1, C.2, C.3 and C.4 holds, the sequence of average iterates {𝐰t:t[0:T]}\{\mathbf{w}^{t}:t\in[0:T]\} satisfy the following convergence guarantees101010There are some modifications for theorem 2 compared to main text, please refer to Theorem 2 here instead of main text:

𝐰T𝐰2(1C1μL)T𝐰0𝐰2+Γμ2,\left\|\mathbf{w}^{T}-\mathbf{w}^{*}\right\|^{2}\leq(1-\frac{C_{1}\mu}{L})^{T}\left\|\mathbf{w}^{0}-\mathbf{w}^{*}\right\|^{2}+\frac{\Gamma}{\mu^{2}}, (31)

where Γ=𝒪~(σ02+κ2+C2α2κ2)\Gamma=\tilde{\mathcal{O}}(\sigma_{0}^{2}+\kappa^{2}+C^{2}\alpha^{2}\kappa^{2}), bb is batch size and 𝐰\mathbf{w}^{*} is the global optimal weights in federated learning.

Proof.

Note that 𝐰t=1Ki=1K𝐰it\mathbf{w}^{t}=\frac{1}{K}\sum_{i=1}^{K}\mathbf{w}_{i}^{t}, so we have:

𝐰t+1=𝐰tη𝐠t^=𝐰tηF(𝐰t)+ηF(𝐰t)η1Ki=1KFi(𝐰t)+η1Ki=1KFi(𝐰t)η𝐠t^=[𝐰tηF(𝐰t)]=:U1+η[1Ki=1K(F(𝐰t)Fi(𝐰it))]=:U2+η[1Ki=1KFi(𝐰it)𝐠t^]=:U3\begin{split}\mathbf{w}^{t+1}&=\mathbf{w}^{t}-\eta\hat{\mathbf{g}^{t}}\\ &=\mathbf{w}^{t}-\eta\nabla F(\mathbf{w}^{t})+\eta\nabla F(\mathbf{w}^{t})-\eta\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}(\mathbf{w}^{t})\\ &\quad+\eta\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}(\mathbf{w}^{t})-\eta\hat{\mathbf{g}^{t}}\\ &=\underbrace{[\mathbf{w}^{t}-\eta\nabla F(\mathbf{w}^{t})]}_{=:U_{1}}+\eta\underbrace{[\frac{1}{K}\sum_{i=1}^{K}(\nabla F(\mathbf{w}^{t})-\nabla F_{i}(\mathbf{w}_{i}^{t}))]}_{=:U_{2}}\\ &\quad+\eta\underbrace{[\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}(\mathbf{w}_{i}^{t})-\hat{\mathbf{g}^{t}}]}_{=:U_{3}}\\ \end{split} (32)

Therefore,

𝐰t+1𝐰=[U1𝐰]+ηU2+ηU3\mathbf{w}^{t+1}-\mathbf{w}^{*}=[U_{1}-\mathbf{w}^{*}]+\eta U_{2}+\eta U_{3} (33)

Taking norm on both sides and then squaring:

𝐰t+1𝐰2=U1𝐰2+η2U2+U32+2η<U1𝐰,U2+U3>\begin{split}||\mathbf{w}^{t+1}-\mathbf{w}^{*}||^{2}&=||U_{1}-\mathbf{w}^{*}||^{2}+\eta^{2}||U_{2}+U_{3}||^{2}\\ &\quad+2\eta<U_{1}-\mathbf{w}^{*},U_{2}+U_{3}>\end{split} (34)

Taking use of 2<𝐚,𝐛>||𝐚||2+||𝐛2||2<\mathbf{a},\mathbf{b}>\leq||\mathbf{a}||^{2}+||\mathbf{b}^{2}||, we have

2η<U1𝐰,U2+U3>=2<ημ2(U1𝐰),ημ(U2+U3)>ημ2U1𝐰2+2ημU2+U32\begin{split}&2\eta<U_{1}-\mathbf{w}^{*},U_{2}+U_{3}>\\ &=2<\sqrt{\frac{\eta\mu}{2}}(U_{1}-\mathbf{w}^{*}),\sqrt{\frac{\eta}{\mu}}(U_{2}+U_{3})>\\ &\leq\frac{\eta\mu}{2}||U_{1}-\mathbf{w}^{*}||^{2}+\frac{2\eta}{\mu}||U_{2}+U_{3}||^{2}\end{split} (35)

Combining Eq. (34) and (35), we have:

𝐰t+1𝐰2(1+ημ2)U1𝐰2+(η2+2ημ)U2+U32(1+ημ2)U1𝐰2+(2η2+4ημ)U22+(2η2+4ημ)U32\begin{split}||\mathbf{w}^{t+1}-\mathbf{w}^{*}||^{2}&\leq(1+\frac{\eta\mu}{2})||U_{1}-\mathbf{w}^{*}||^{2}\\ &+(\eta^{2}+\frac{2\eta}{\mu})||U_{2}+U_{3}||^{2}\\ &\leq(1+\frac{\eta\mu}{2})||U_{1}-\mathbf{w}^{*}||^{2}\\ &+(2\eta^{2}+\frac{4\eta}{\mu})||U_{2}||^{2}+(2\eta^{2}+\frac{4\eta}{\mu})||U_{3}||^{2}\end{split} (36)

Substituting the values of U1U_{1}, U2U_{2} and U3U_{3} into Eq. (34), taking expectation w.r.t. the stochastic sampling of gradients by clients, we could get:

||𝐰t+1𝐰||2(1+ημ2)||𝐰tηF(𝐰t)w||2+(2η2+4ημ)1Ki=1K(F(𝐰t)Fi(𝐰it))2+(2η2+4ημ)1Ki=1KFi(𝐰it)η𝐠t^2\begin{split}||\mathbf{w}^{t+1}&-\mathbf{w}^{*}||^{2}\leq(1+\frac{\eta\mu}{2})||\mathbf{w}^{t}-\eta\nabla F(\mathbf{w}^{t})-w^{*}||^{2}\\ &+(2\eta^{2}+\frac{4\eta}{\mu})||\frac{1}{K}\sum_{i=1}^{K}(\nabla F(\mathbf{w}^{t})-\nabla F_{i}(\mathbf{w}_{i}^{t}))||^{2}\\ &+(2\eta^{2}+\frac{4\eta}{\mu})||\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}(\mathbf{w}_{i}^{t})-\eta\hat{\mathbf{g}^{t}}||^{2}\end{split} (37)

Now we bound each of the three terms on the RHS of Eq. (37) separately as follows:
Firstly, since FF is μ\mu strong convex, <𝐰𝐰t,F(𝐰t)>F(𝐰)F(𝐰t)μ2||𝐰𝐰t||2<\mathbf{w}^{*}-\mathbf{w}^{t},\nabla F(\mathbf{w}^{t})>\leq F(\mathbf{w}^{*})-F(\mathbf{w}^{t})-\frac{\mu}{2}||\mathbf{w}^{*}-\mathbf{w}^{t}||^{2}. Also, according to Lemma C.6, we have:

𝐰tηF(𝐰t)𝐰2=𝐰t𝐰2+η2F(𝐰t)2+2η<𝐰𝐰t,F(𝐰t)>(1μη)𝐰t𝐰2+(η2ηL)F(𝐰t)2(1μη)𝐰t𝐰2,\begin{split}&||\mathbf{w}^{t}-\eta\nabla F(\mathbf{w}^{t})-\mathbf{w}^{*}||^{2}\\ &=||\mathbf{w}^{t}-\mathbf{w}^{*}||^{2}+\eta^{2}||\nabla F(\mathbf{w}^{t})||^{2}\\ &+2\eta<\mathbf{w}^{*}-\mathbf{w}^{t},\nabla F(\mathbf{w}^{t})>\\ &\leq(1-\mu\eta)||\mathbf{w}^{t}-\mathbf{w}^{*}||^{2}+(\eta^{2}-\frac{\eta}{L})||\nabla F(\mathbf{w}^{t})||^{2}\\ &\leq(1-\mu\eta)||\mathbf{w}^{t}-\mathbf{w}^{*}||^{2},\end{split} (38)

where the last inequality is due to η<14L\eta<\frac{1}{4L}. Secondly,

||1Ki=1K(F(𝐰t)Fi(𝐰it))||21K2i=1K||F(𝐰t)Fi(𝐰it))||22K2i=1K[||F(𝐰t)F(𝐰it))||2+||F(𝐰it)Fi(𝐰it))||2]2L2K2i=1K𝐰t𝐰it2+2κ2\begin{split}||\frac{1}{K}&\sum_{i=1}^{K}(\nabla F(\mathbf{w}^{t})-\nabla F_{i}(\mathbf{w}_{i}^{t}))||^{2}\\ &\leq\frac{1}{K^{2}}\sum_{i=1}^{K}||\nabla F(\mathbf{w}^{t})-\nabla F_{i}(\mathbf{w}_{i}^{t}))||^{2}\\ &\leq\frac{2}{K^{2}}\sum_{i=1}^{K}[||\nabla F(\mathbf{w}^{t})-\nabla F(\mathbf{w}_{i}^{t}))||^{2}\\ &+||\nabla F(\mathbf{w}_{i}^{t})-\nabla F_{i}(\mathbf{w}_{i}^{t}))||^{2}]\\ &\leq\frac{2L^{2}}{K^{2}}\sum_{i=1}^{K}||\mathbf{w}^{t}-\mathbf{w}_{i}^{t}||^{2}+2\kappa^{2}\end{split} (39)

Note that

𝐰t𝐰it2=1Kk=1K(𝐰kt𝐰it)21Kk=1K𝐰kt𝐰it2||\mathbf{w}^{t}-\mathbf{w}_{i}^{t}||^{2}=||\frac{1}{K}\sum_{k=1}^{K}(\mathbf{w}_{k}^{t}-\mathbf{w}_{i}^{t})||^{2}\leq\frac{1}{K}\sum_{k=1}^{K}||\mathbf{w}_{k}^{t}-\mathbf{w}_{i}^{t}||^{2} (40)

Moreover, for the batch data 𝒮\mathcal{S}, we have

𝐰kt𝐰it2=η2|𝒮|j(Fi(𝐰it,xij)Fk(𝐰kt,xkj))2η2|𝒮|jFi(𝐰it,xij)Fk(𝐰kt,xkj)25η2|𝒮|j[(||Fi(𝐰it,xij)Fi(𝐰it)||2)+(Fk(𝐰kt)Fk(𝐰kt,xkj)2)+(||Fi(𝐰it))F(𝐰it)||2)+(||F(𝐰it)F(𝐰kt))||2)+(||F(𝐰kt))Fk(𝐰kt))||2)]5η2[σ02+σ02+4κ2+L2𝐰it𝐰kt2+4κ2]\begin{split}&||\mathbf{w}_{k}^{t}-\mathbf{w}_{i}^{t}||^{2}=\frac{\eta^{2}}{|\mathcal{S}|}||\sum_{j}(\nabla F_{i}(\mathbf{w}_{i}^{t},x_{ij})-\nabla F_{k}(\mathbf{w}_{k}^{t},x_{kj}))||^{2}\\ &\leq\frac{\eta^{2}}{|\mathcal{S}|}\sum_{j}||\nabla F_{i}(\mathbf{w}_{i}^{t},x_{ij})-\nabla F_{k}(\mathbf{w}_{k}^{t},x_{kj})||^{2}\\ &\leq\frac{5\eta^{2}}{|\mathcal{S}|}\sum_{j}[(||\nabla F_{i}(\mathbf{w}_{i}^{t},x_{ij})-\nabla F_{i}(\mathbf{w}_{i}^{t})||^{2})\\ &\quad+(||\nabla F_{k}(\mathbf{w}_{k}^{t})-\nabla F_{k}(\mathbf{w}_{k}^{t},x_{kj})||^{2})\\ &\quad+(||\nabla F_{i}(\mathbf{w}_{i}^{t}))-\nabla F(\mathbf{w}_{i}^{t})||^{2})+(||\nabla F(\mathbf{w}_{i}^{t})-\nabla F(\mathbf{w}_{k}^{t}))||^{2})\\ &\quad+(||\nabla F(\mathbf{w}_{k}^{t}))-\nabla F_{k}(\mathbf{w}_{k}^{t}))||^{2})]\\ &\leq 5\eta^{2}[\sigma_{0}^{2}+\sigma_{0}^{2}+4\kappa^{2}+L^{2}||\mathbf{w}_{i}^{t}-\mathbf{w}_{k}^{t}||^{2}+4\kappa^{2}]\\ \end{split} (41)

The second inequality is due to the Cauchy–Schwarz inequality. The Third inequality is because of Lemma C.3, Assumption C.1 and C.4. Then we transfer 𝐰it𝐰kt2||\mathbf{w}_{i}^{t}-\mathbf{w}_{k}^{t}||^{2} to RHS of Eq. (41), we derive:

(15η2L2)𝐰it𝐰kt25(2σ02+8κ2)(1-5\eta^{2}L^{2})||\mathbf{w}_{i}^{t}-\mathbf{w}_{k}^{t}||^{2}\leq 5(2\sigma_{0}^{2}+8\kappa^{2}) (42)

Thus

𝐰it𝐰kt25(2σ02+8κ2)(15η2L2)||\mathbf{w}_{i}^{t}-\mathbf{w}_{k}^{t}||^{2}\leq\frac{5(2\sigma_{0}^{2}+8\kappa^{2})}{(1-5\eta^{2}L^{2})} (43)

Combining Eq. (39), (40) and (43), we obtain

||1Ki=1K(F(𝐰t)Fi(𝐰it))||22L2K25K2(2σ02+8κ2)(15η2)+2κ2=10L2(2σ02+8κ2)(15η2L2)+2κ2\begin{split}||\frac{1}{K}&\sum_{i=1}^{K}(\nabla F(\mathbf{w}^{t})-\nabla F_{i}(\mathbf{w}_{i}^{t}))||^{2}\\ &\leq\frac{2L^{2}}{K^{2}}\frac{5K^{2}(2\sigma_{0}^{2}+8\kappa^{2})}{(1-5\eta^{2})}+2\kappa^{2}\\ &=\frac{10L^{2}(2\sigma_{0}^{2}+8\kappa^{2})}{(1-5\eta^{2}L^{2})}+2\kappa^{2}\end{split} (44)

Thirdly, according to Theorem C.1, we obtain

1Ki=1KFi(𝐰it)𝐠t^𝒪~(α2κ2)||\frac{1}{K}\sum_{i=1}^{K}\nabla F_{i}(\mathbf{w}_{i}^{t})-\hat{\mathbf{g}^{t}}||\leq\tilde{\mathcal{O}}(\alpha^{2}\kappa^{2}) (45)

Finally, Noted that 15η2L2>121-5\eta^{2}L^{2}>\frac{1}{2} and (1+ημ2)(1μη)<1μη2(1+\frac{\eta\mu}{2})(1-\mu\eta)<1-\frac{\mu\eta}{2} when η<14L\eta<\frac{1}{4L}. And if η+2μ<3μ\eta+\frac{2}{\mu}<\frac{3}{\mu}, i.e., η<1μ\eta<\frac{1}{\mu}, we bounded Eq.(37) as:

𝐰t+1𝐰2(1+ημ2)(1μη)𝐰t𝐰2+(2η2+4ημ)(10L2(2σ02+8κ2)(15η2L2)+2κ2)+(2η2+4ημ)𝒪~(α2κ2)(1μη2)𝐰t𝐰2+(2η2+4ημ)(10L2(2σ02+8κ2)(15η2L2)+2κ2)+(2η2+4ημ)𝒪~(α2κ2)(1μη2)𝐰t𝐰2+240ημ(σ02+4κ2)+12κ2ημ+(6ημ)𝒪~(α2κ2)\begin{split}&||\mathbf{w}^{t+1}-\mathbf{w}^{*}||^{2}\leq(1+\frac{\eta\mu}{2})(1-\mu\eta)||\mathbf{w}^{t}-\mathbf{w}^{*}||^{2}\\ &+(2\eta^{2}+\frac{4\eta}{\mu})(\frac{10L^{2}(2\sigma_{0}^{2}+8\kappa^{2})}{(1-5\eta^{2}L^{2})}+2\kappa^{2})+(2\eta^{2}+\frac{4\eta}{\mu})\tilde{\mathcal{O}}(\alpha^{2}\kappa^{2})\\ &\leq(1-\frac{\mu\eta}{2})||\mathbf{w}^{t}-\mathbf{w}^{*}||^{2}+(2\eta^{2}+\frac{4\eta}{\mu})(\frac{10L^{2}(2\sigma_{0}^{2}+8\kappa^{2})}{(1-5\eta^{2}L^{2})}+2\kappa^{2})\\ &+(2\eta^{2}+\frac{4\eta}{\mu})\tilde{\mathcal{O}}(\alpha^{2}\kappa^{2})\\ &\leq(1-\frac{\mu\eta}{2})||\mathbf{w}^{t}-\mathbf{w}^{*}||^{2}+\frac{240\eta}{\mu}(\sigma_{0}^{2}+4\kappa^{2})+\frac{12\kappa^{2}\eta}{\mu}\\ &+(\frac{6\eta}{\mu})\tilde{\mathcal{O}}(\alpha^{2}\kappa^{2})\end{split} (46)

Therefore, we could use Eq. (46) by induction for t=1,,T1t=1,\cdots,T-1 to obtain

𝐰T𝐰2(1ημ2)T𝐰0𝐰2+[240ημ(σ02+4κ2)+12κ2ημ+(6ημ)𝒪~(C2α2κ2)]/(μη2)=(1ημ2)T𝐰0𝐰2+𝒪~(σ02+κ2+α2κ2)/μ2(1C1μL)T𝐰0𝐰2+1μ2Γ\begin{split}&\left\|\mathbf{w}^{T}-\mathbf{w}^{*}\right\|^{2}\\ &\leq(1-\frac{\eta\mu}{2})^{T}\left\|\mathbf{w}^{0}-\mathbf{w}^{*}\right\|^{2}+[\frac{240\eta}{\mu}(\sigma_{0}^{2}+4\kappa^{2})+\frac{12\kappa^{2}\eta}{\mu}\\ &+(\frac{6\eta}{\mu})\tilde{\mathcal{O}}(C^{2}\alpha^{2}\kappa^{2})]/(\frac{\mu\eta}{2})\\ &=(1-\frac{\eta\mu}{2})^{T}\left\|\mathbf{w}^{0}-\mathbf{w}^{*}\right\|^{2}+\tilde{\mathcal{O}}(\sigma_{0}^{2}+\kappa^{2}+\alpha^{2}\kappa^{2})/\mu^{2}\\ &\leq(1-\frac{C_{1}\mu}{L})^{T}\left\|\mathbf{w}^{0}-\mathbf{w}^{*}\right\|^{2}+\frac{1}{\mu^{2}}\Gamma\end{split} (47)

where Γ=𝒪~(σ02+κ2+α2κ2)\Gamma=\tilde{\mathcal{O}}(\sigma_{0}^{2}+\kappa^{2}+\alpha^{2}\kappa^{2}) and C1C_{1} is a constant. ∎