This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FedMCSA: Personalized Federated Learning via Model Components Self-Attention

Qi Guo Yong Qi Saiyu Qi Di Wu Qian Li School of Computer Science and Technology, Xi’an Jiaotong University Xi’an, Shaanxi Province, China
Abstract

Federated learning (FL) facilitates multiple clients to jointly train a machine learning model without sharing their private data. However, Non-IID data of clients presents a tough challenge for FL. Existing personalized FL approaches rely heavily on the default treatment of one complete model as a basic unit and ignore the significance of different layers on Non-IID data of clients. In this work, we propose a new framework, federated model components self-attention (FedMCSA), to handle Non-IID data in FL, which employs model components self-attention mechanism to granularly promote cooperation between different clients. This mechanism facilitates collaboration between similar model components while reducing interference between model components with large differences. We conduct extensive experiments to demonstrate that FedMCSA outperforms the previous methods on four benchmark datasets. Furthermore, we empirically show the effectiveness of the model components self-attention mechanism, which is complementary to existing personalized FL and can significantly improve the performance of FL.

keywords:
Personalized Federated Learning, Non-IID, Model Components, Self-Attention

1 Introduction

Federated learning (FL) has attracted widespread attention as a paradigm of distributed learning with privacy protection [1, 2, 3]. The standard FL follows three steps: (i) at each iteration, the server distributes the global model to clients; (ii) the client trains the local model on its local private data based on the global model; (iii) the server aggregates local models updated by clients to achieve a new global model, repeated until convergence [1, 4]. FL can ensure effective collaboration between different clients when the data distributions are independent and identically distributed (IID), i.e., private data distributions of clients are similar to each other. However, in many application scenarios, private data of clients may be different in size and class distribution, that is, the data distributions are not independent and identically distributed (Non-IID). In this case, FL may not achieve effective collaboration on different clients due to difference of individual private data [5].

Various algorithms have been proposed to handle the Non-IID data in FL, which can be divided into two categories: average aggregation methods and model-based aggregation methods. As shown in Figure 1(a), average aggregation methods average all local models to generate a global model and distribute it to all clients, where an additional fine-tuning step is performed to train the personalized model in the clients [6, 7, 8, 9]. However, using one global model is difficult to fit different clients with Non-IID data. As a result, as illustrated in Figure 1(b), model-based aggregation methods weight different local models to generate a personalized global model for each client, which treats the entire model as a basic unit to calculate the weighting coefficient of the local model [10, 11, 12]. Nevertheless, these methods ignore significance of different layers within the model and cause the curse of dimensionality when computing the similarity of high-dimensional models [13].

Refer to caption
(a) Average
Refer to caption
(b) Model-based
Refer to caption
(c) Component-based (Our)
Figure 1: The illustrations of different FL methods.

We argue each layer’s significance should be considered for handling Non-IID data of clients. It is inappropriate to disregard significance of layers and treat all layers equally by model-based aggregation methods on the server. As illustrated in Figure 1(c), we regard each layer in a model as a basic unit, i.e., a model component, and present a component-based aggregation method to granularly facilitate collaboration between different clients.

In this work, we propose a novel framework, federated model components self-attention (FedMCSA), to handle Non-IID data in FL. The core of FedMCSA is a model components self-attention mechanism, which utilizes the component-based aggregation method to adaptively update model components on the server. This mechanism facilitates collaboration between similar model components while reducing interference between model components with large differences. Specifically, FedMCSA first decomposes the local models from clients to obtain model components on the server, then lets the model components perform parallel self-attention operations, and finally, generates complete personalized models to send them to the clients. In this way, FedMCSA achieves a complete personalized FL and promotes purposeful and efficient collaboration among clients. The experimental results not only show that FedMCSA outperforms FedAvg [1], Fedprox [14], Per-FedAvg [8], pFedMe [9], and HeurFedAMP [11] in different settings, but also empirically demonstrate the effectiveness of the model components self-attention mechanism.

Our contributions and novelty can be summarized as follows:

  • 1.

    We propose a novel framework, federated model components self-attention (FedMCSA), to handle Non-IID data in FL, which can achieve a complete personalized FL to adaptively update models.

  • 2.

    We devise a new model components self-attention mechanism to granularly address Non-IID data from the perspective of the internal layers of the model, which can be seamlessly integrated into FL.

  • 3.

    Extensive experiments on four datasets are conducted to compare the proposed FedMCSA with state-of-the-art methods as well as its ablation variants. The results suggest that FedMCSA achieves a significant improvement in performance for personalized FL.

2 Related Works

FL The first FL algorithm is FedAvg [1], which is an iterative algorithm of client-server architecture. The current techniques aim to train a global model through cooperation between clients without leaking their private data to other clients, which can achieve better performance than working alone. Various challenges in FL have been investigated and addressed, including privacy protection [15, 16, 17, 18] and communication complexity [19, 20, 21]. A main challenge of FL is statistical diversity, which means that data distributions among clients are Non-IID with affecting its performance and convergence rate [8, 22, 23, 24, 25].

Personalized FL Diverse methods have been proposed to address the problem of personalized FL. Fedprox [14] adds a proximal term to the objective to address the challenges of heterogeneity. The goal of Per-FedAvg [8] is to get a global model as initialization, and then one more step of gradient update in each client is performed to obtain the personalized model. pFedMe [9] uses the Moreau envelope as the clients’ regularized loss function to achieve the decoupling of personalized model optimization and global model learning, which formulates a bi-level optimization problem in the client for personalized FL. The method for training a mixture of local and global models is considered as a personalized solution for each client [26, 27]. Under the assumption that the private model of a client is provided to other clients in addition to the server, the optimal weight combination based on the mutual benefits between the models is calculated for each client to achieve personalization [10], which can work better than computing a single model average with constant weights for the entire federation as in traditional FL. For the cross-silo scenario, FedAMP and HeurFedAMP [11] design a federated attentive message passing method to conduct personalized FL with preserving privacy. However, FedAMP and HeurFedAMP both need to understand the clients’ data distribution to set constant weight hyperparameters, which is difficult in actual applications. Meta-learning can also be used for personalization [8, 28]. The objective of the work  [8] is to investigate a personalized variant of FL, whose purpose is to develop an initial shared model that existing or prospective clients can easily modify to fit their local datasets by performing one or several steps of gradient descent on their own data. The authors propose a federated multi-task framework called MOCHA [29] using multi-task learning to address data heterogeneity and communication efficiency. For more details about FL and personalized FL, we recommend referring to these comprehensive surveys [3, 30], which not only investigate the problem of heterogeneity for personalized FL, but also discuss the unique characteristics and challenges of FL, provide a broad overview of current approaches, and present an extensive collection of future work.

Self-Attention The attention mechanism is widely used in many fields, which fully demonstrates the effectiveness of attention [31, 32]. The self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features [33, 34].

Different from existing works, we have deeply explored the potential of the interaction between personalized models from the perspective of the internal components of the model, and then achieve the adaptive update of different personalized models by using our specially designed FedMCSA.

3 Problem Definition of Personalized Federated Learning

The conventional formulation of FL aims to find a global model Θ\Theta by minimizing the overall population loss

minΘd{f(Θ):=1Ni=1Nfi(Θ)},\min_{\Theta\in\mathbb{R}^{d}}\left\{f(\Theta):=\frac{1}{N}\sum_{i=1}^{N}f_{i}(\Theta)\right\}, (1)

where the function fi:d,i=1,,Nf_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R},i=1,\ldots,N, denotes the expected loss of the client ii that only depends on her/his own data distribution of ξi{\xi_{i}}.

Instead of solving the conventional FL problem (1), we aim to adaptively solve personalized FL with the Non-IID private data from the perspective of internal model components. Motivated by [10, 11, 12], we allow each client to own a personalized model on the server, which does not depend on the single global model. Consider that NN clients C1C_{1}, …, CNC_{N} corresponding to NN datasets D1D_{1}, …, DND_{N} have their own personalized models Θ1{\Theta_{1}}, …, ΘN{\Theta_{N}} under the same model structure. For a neural network typically consisting of LL layers, the model parameter of ll-th layer of Θi{\Theta_{i}} is denoted as θi,l{\theta_{i,l}}. The purpose of the personalized FL problem is to use the private training data and the component-based aggregation method to train the personalized model Θi{\Theta_{i}} to make it close to the best performance that can be achieved on the distribution of ξi{\xi_{i}}. We want to allow similar model components collaborate more and reduce interference between model components with large differences. Accordingly, each personalized model could learn useful knowledge from other models as much as possible to improve the performance of FL. Therefore, the personalized FL problem is formulated as

minU=[Θ1,,ΘN]{F(U):=1Ni=1Nfi(Θi):=f(U)+γ12Nl=1LijNd(θi,lθj,l2):=ϕ(U)},\mathop{\min}\limits_{U=[{\Theta_{1}},...,{\Theta_{N}}]}\left\{{F(U):=\underbrace{\frac{1}{N}\sum\limits_{i=1}^{N}{{f_{i}}}\left({{\Theta_{i}}}\right)}_{:=f(U)}+\gamma\underbrace{\frac{1}{{2N}}\sum\limits_{l=1}^{L}{\sum\limits_{i\neq j}^{N}d\left({{{\left\|{{\theta_{i,l}}-{\theta_{j,l}}}\right\|}^{2}}}\right)}}_{:=\phi(U)}}\right\}, (2)

where U=[Θ1,,ΘN]{U=[{\Theta_{1}},...,{\Theta_{N}}]} is a personalized model set, fif_{i} is the expected loss of the client ii corresponding to the data distribution ξi{\xi_{i}}, γ\gamma is a regularization parameter, and d(θi,lθj,l2)d\left({{{\left\|{{\theta_{i,l}}-{\theta_{j,l}}}\right\|}^{2}}}\right) is the difference measurement function between model component θi,l{\theta_{i,l}} and θj,l{\theta_{j,l}}. According to the work [11], here we naturally assume that d:[0,)d:[0,\infty)\to\mathbb{R} is a non-linear function with the properties that dd is increasing and concave on [0,)[0,\infty), continuously differentiable on (0,)(0,\infty), and limt0+d(t)\lim_{t\rightarrow 0^{+}}d^{\prime}(t) is finite by d(0)=0d(0)=0.

As to be illustrated in the next section, our novel use of model components self-attention mechanism facilitates adaptively collaboration between clients by promoting similar model components to collaborate more with each other. Furthermore, the model components self-attention mechanism is agnostic to the clients’ private data distribution, which allows each client to perform on arbitrary target distributions according to requirements. The model components self-attention mechanism not only boosts the performance of personalized FL dramatically but also can be easily integrated to further improve the performance of existing personalized FL with average aggregation method.

4 Methodology

In this section, to tackle the optimization problem in (2), we first give a general method without considering privacy preservation. Then, we implement the process of the general method by proposing a new personalized FL framework, federated model components self-attention (FedMCSA), which adaptively conducts personalized FL from the perspective of the internal components of the model using parallel self-attention while protecting clients’ private data.

4.1 The General Method

Considering that f(U)=1Ni=1Nfi(Θi)f(U)=\frac{1}{N}\sum\limits_{i=1}^{N}{{f_{i}}}\left({{\Theta_{i}}}\right) and ϕ(U)=12Nl=1LijNd(θi,lθj,l2)\phi(U)=\frac{1}{{2N}}\sum\limits_{l=1}^{L}{\sum\limits_{i\neq j}^{N}d\left({{{\left\|{{\theta_{i,l}}-{\theta_{j,l}}}\right\|}^{2}}}\right)}, we can rewrite the optimization problem in (2) to

minU{F(U):=f(U)+γϕ(U)}.\mathop{\min}\limits_{U}\left\{{F(U):=f(U)+\gamma\phi(U)}\right\}. (3)

A natural way to tackle the optimization problem in (3) is to adopt the framework of incremental-type optimization [35, 11], which can be realized by iteratively optimize F(U)F(U) by alternatively optimizing f(U)f(U) and ϕ(U)\phi(U) until convergence. Specifically, the general method first optimize ϕ(U){\phi(U)} by applying a gradient descent step, and then optimize f(U){f(U)} by applying a proximal point step. In the tt-th iteration, the update is formulated as

Vt=Ut1τtϕ(Ut1),{V^{t}}={U^{t-1}}-{\tau_{t}}\nabla\phi({U^{t-1}}), (4)
Ut=argminU{f(U)+γ2τtUVt2},{U^{t}}=\arg\mathop{\min}\limits_{U}\left\{{f(U)+\frac{\gamma}{{2{\tau_{t}}}}{{\left\|{U-{V^{t}}}\right\|}^{2}}}\right\}, (5)

where Vt{V^{t}} denotes the prox-center, and τt>0{\tau_{t}}>0 is the step size of gradient descent. The general method has been proven in prior work  [11] to converge to the optimal solution when F(U)F(U) is convex and to a stationary point when F(U)F(U) is nonconvex.

4.2 Federated Model Components Self-Attention

The general method can be easy to deploy by gathering all clinets’ data together. In this subsection, we propose FedMCSA to optimize the personalized FL problem while protecting the data privacy of each client. Specifically, FedMCSA maintains a personalized cloud model for each client on a cloud server to implement the optimization step (4) of the general method and deploys the optimization step (5) privately in each client. The workflow of FedMCSA is illustrated in Figure 2.

Refer to caption
Figure 2: The workflow of federated model components self-attention for personalized federated learning. (At each iteration, the clients perform gradient descent based on the distributed personalized models to generate the updated models, which are then sent to the server. After receiving latest models from clients, the server update personalized models by the model components self-attention mechanism, and distributes the new personalized models to the corresponding clients.)

Following the optimization steps of the general method, FedMCSA firstly aims to optimizes ϕ(U){\phi(U)} and implements the optimization step in (4) by computing matrix Vt{V^{t}} on the cloud server. Let Vt=[W1t,,WNt]{V^{t}}=[W_{1}^{t},...,W_{N}^{t}] and Wit=[wi,1twi,2twi,Lt]W_{i}^{t}=[w_{i,1}^{t}\parallel w_{i,2}^{t}\parallel...\parallel w_{i,L}^{t}], where W1t,W2t,,WNtW_{1}^{t},W_{2}^{t},...,W_{N}^{t} are the columns of Vt{V^{t}} and wi,1t,wi,2t,,wi,Ltw_{i,1}^{t},w_{i,2}^{t},...,w_{i,L}^{t} are the layers of WitW_{i}^{t}, the update of the ii-th column and ll-th layer wi,ltw_{i,l}^{t} of Vt{V^{t}} computed in (4) can be rewritten into a linear combination of the model parameter sets θ1,lt1,,θN,lt1{\theta_{1,l}^{t-1}},\ldots,{\theta_{N,l}^{t-1}} as follows

wi,lt=\displaystyle w_{i,l}^{t}= (1τtNjiNd(θi,lt1θj,lt12))θi,lt1+τtNjiNd(θi,lt1θj,lt12)θj,lt1\displaystyle\left({1-\frac{{{\tau_{t}}}}{N}\sum\limits_{j\neq i}^{N}{{d^{\prime}}}\left({{{\left\|{\theta_{i,l}^{t-1}-\theta_{j,l}^{t-1}}\right\|}^{2}}}\right)}\right)\cdot\theta_{i,l}^{t-1}+\frac{{{\tau_{t}}}}{N}\sum\limits_{j\neq i}^{N}{{d^{\prime}}}\left({{{\left\|{\theta_{i,l}^{t-1}-\theta_{j,l}^{t-1}}\right\|}^{2}}}\right)\cdot\theta_{j,l}^{t-1} (6)
=\displaystyle= ψi,l,1tθ1,lt1++ψi,l,NtθN,lt1,\displaystyle\psi_{i,l,1}^{t}\theta_{1,l}^{t-1}+\cdots+\psi_{i,l,N}^{t}\theta_{N,l}^{t-1},

where d(θi,lt1θj,lt12)d^{\prime}\left(\left\|\theta_{i,l}^{t-1}-\theta_{j,l}^{t-1}\right\|^{2}\right) is the derivative of d(θi,lt1θj,lt12)d\left(\left\|\theta_{i,l}^{t-1}-\theta_{j,l}^{t-1}\right\|^{2}\right) and ψi,l,1t,,ψi,l,Nt{\psi_{i,l,1}^{t}},\ldots,{\psi_{i,l,N}^{t}} are the linear combination weights of the model parameter θ1,lt1,,θN,lt1{\theta_{1,l}^{t-1}},\ldots,{\theta_{N,l}^{t-1}}, respectively. Since ψi,l,1t++ψi,l,Nt=1{\psi_{i,l,1}^{t}}+\ldots+{\psi_{i,l,N}^{t}}=1, wi,ltw_{i,l}^{t} is actually a convex combination [11] of the model parameter θ1,lt1,,θN,lt1{\theta_{1,l}^{t-1}},\ldots,{\theta_{N,l}^{t-1}}.

We have argued for finding a linear combination of the model components to update the model component. The main question that follows is how to compute the effective weight coefficients in such a way that FedMCSA can seamlessly be incorporated into any existing FL paradigm. It is vital to determine how to calculate each model component’s weight coefficient as it has a crucial effect on the output of each model component and the performance of personalized FL.

A native method is to refer to HeurFedAMP to fix the weight coefficient for the model component that needs to be updated currently, set it as a hyperparameter, and then assign the remaining weights to other model components according to the similarity between the current model component and other model components. The setting of this hyperparameter needs to consider the clients’ data distribution, for example, it is set to 1/(Mi+1)1/(M_{i}+1), where MiM_{i} is the number of same distribution clients for client ii. Unfortunately, in practice, it is often difficult to provide this knowledge about the clients’ data distribution. Moreover, in the actual model components update process, the interaction between different model components is dynamic, therefore fixing the weight coefficient limits the dynamic change in the weight coefficients, which artificially damages the performance of the personalized models.

To avoid the requirement for clients’ data distribution and inappropriate restrictions on the weight coefficients of model components, FedMCSA introduces self-attention into the update of model components, which performs LL self-attention processes in parallel to produce the updated personalized model for each client. FedMCSA makes no assumptions about the knowledge of underlying data distributions or client similarities to provide greater flexibility in personalization. Specifically, for each self-attention process, NN model components [θ1,lt1,,θN,lt1][{\theta_{1,l}^{t-1}},\ldots,{\theta_{N,l}^{t-1}}] corresponding to ll-th layer from clients are used as input, and the output is the updated NN model components [w1,lt,,wN,lt][{w_{1,l}^{t}},\ldots,{w_{N,l}^{t}}] for ll-th layer. The queryquery vector, keykey vector, and valuevalue vector are denoted by [θ1,lt1,,θN,lt1][{\theta_{1,l}^{t-1}},\ldots,{\theta_{N,l}^{t-1}}] for ll-th layer in each model components self-attention process.

Meanwhile, FedMCSA adopts the simple but very efficient cosine similarity function sim(query,key)=σcos(query,key)sim(query,key)=\sigma\cos(query,key) with a scale hyperparameter σ\sigma as the metric function between different model components. Therefore, ψi,l,1t,,ψi,l,Nt{\psi_{i,l,1}^{t}},\ldots,{\psi_{i,l,N}^{t}} in (6) can be expressed as the following

ψi,l,kt=eσcos(θi,lt1,θk,lt1)h=1Neσcos(θk,lt1,θh,lt1),k[1,N].\psi_{i,l,k}^{t}=\frac{{{e^{\sigma\cos(\theta_{i,l}^{t-1},\theta_{k,l}^{t-1})}}}}{{\sum\nolimits_{h=1}^{N}{{e^{\sigma\cos(\theta_{k,l}^{t-1},\theta_{h,l}^{t-1})}}}}},k\in[1,N]. (7)

For simplicity, we denote the operation of FedMCSA on the server as the function MCSA\mathscr{F}_{M\!C\!S\!A}, where Ut1U^{t-1} and VtV^{t} are the input and output of MCSA\mathscr{F}_{M\!C\!S\!A}, respectively. Therefore, (4) can be reformulated as the following

Vt=MCSA(Ut1).V^{t}=\mathscr{F}_{M\!C\!S\!A}\left(U^{t-1}\right). (8)
Input: NN clients with private training data, TT, RR, SS, λ\lambda, η\eta, σ\sigma, U0=[Θ10,,ΘN0]U^{0}=[\Theta_{1}^{0},\ldots,\Theta_{N}^{0}]
Output: The personalized models UT=[Θ1T,,ΘNT]{U^{T}}=[\Theta_{1}^{T},\ldots,\Theta_{N}^{T}]
1 for t=1t=1 to TT do
       // Server
2       Server uniformly samples a subset of clients 𝒮t\mathcal{S}^{t} with size SS, and each of the sampled client sends the local model Θit1,i𝒮t{\Theta}_{i}^{t-1},\forall i\in\mathcal{S}^{t}, to the server;
3      
4      Server using the model components self-attention mechanism computes Vt=MCSA(Ut1){V^{t}}=\mathscr{F}_{M\!C\!S\!A}({U^{t-1}}) in (8) by Ut1={Θit1,i𝒮t}{U^{t-1}}=\{{\Theta}_{i}^{t-1},\forall i\in\mathcal{S}^{t}\};
5      Server sends Wit,i𝒮t{W}_{i}^{t},\forall i\in\mathcal{S}^{t} to the clients, respectively;
6      
      // Client
7       for all i=1i=1 to NN do
8             if The client receives the updated personalized model sent by the server then
9                   Θi=Wit\Theta_{i}=W_{i}^{t};
10                   Wlocalt=WitW_{l\!o\!c\!a\!l}^{t}=W_{i}^{t};
11                  
12             end if
13            for r=0r=0 to R1R-1 do
14                   Sample a fresh mini-batch 𝒟i\mathcal{D}_{i} with size |𝒟||\mathcal{D}| to compute Θit,r=argminΘid{fi(Θi)+λ2ΘiWlocalt2}\Theta_{i}^{t,r}=\arg\mathop{\min}\limits_{\Theta_{i}\in\mathbb{R}{{}^{d}}}\left\{{{f_{i}}({\Theta_{i}})+\frac{\lambda}{2}{{\left\|{{\Theta_{i}}-W_{l\!o\!c\!a\!l}^{t}}\right\|}^{2}}}\right\} defined in (9);
15                  
16                  Θi=Θit,r\Theta_{i}=\Theta_{i}^{t,r};
17                  
18             end for
19            Θit=Θit,r\Theta_{i}^{t}=\Theta_{i}^{t,r}.
20       end for
21      
22 end for
Algorithm 1 FedMCSA

After optimizing ϕ(U)\phi(U) using the model components self-attention mechanism on the server, we further implement the optimization of f(U)f(U) in the client. Let Ut=[Θ1t,,ΘNt]{U^{t}}=[\Theta_{1}^{t},...,\Theta_{N}^{t}], where Θ1t,,ΘNt\Theta_{1}^{t},...,\Theta_{N}^{t} are the columns of Ut{U^{t}}, the update of the ii-th column Θit\Theta_{i}^{t} of Ut{U^{t}} computed in (5) can be rewritten as the following

Θit=argminΘid{fi(Θi)+λ2ΘiWit2},\Theta_{i}^{t}=\arg\mathop{\min}\limits_{\Theta_{i}\in\mathbb{R}{{}^{d}}}\left\{{{f_{i}}({\Theta_{i}})+\frac{\lambda}{2}{{\left\|{{\Theta_{i}}-W_{i}^{t}}\right\|}^{2}}}\right\}, (9)

where λ=γτt\lambda=\frac{\gamma}{{{\tau_{t}}}}. We optimize ϕ(U)\phi(U) by (8) on the server and f(U)f(U) by (9) in the client, which together constitute the complete personalized FL framework FedMCSA. In the continuous iterative process of personalized FL, the entire process can be considered as a continuous federated model components self-attention network, as shown in Figure 2.

The complete algorithm description of FedMCSA is presented in Algorithm 1. FedMCSA implements a client-server personalized FL framework to boost the performance of personalized FL dramatically, which alternately optimizes ϕ(U)\phi(U) and f(U)f(U) through the model components self-attention mechanism on the server and the proximal gradient descent method in the client until a maximum number TT of iterations is reached.

5 Experimental Results and Discussion

In this section, we evaluate the performance of FedMCSA and demonstrate the effectiveness of the model components self-attention mechanism when the distributions of clients’ private data are Non-IID. First, we compare FedMCSA with the existing personalized FL algorithms under various datasets with different network models. Then we demonstrate the the effectiveness by ablation and integration studies. Finally, the impact of important hyperparameter σ\sigma is analyzed on the convergence of FedMCSA.

5.1 Experimental Settings

5.1.1 Datasets

We use four public benchmark datasets that are widely used in FL.

  • 1.

    Synthetic [9]: The Synthetic dataset is applied to a 10-class classifier, where each data point is composed of 60-dimensional real-valued data. The data size of each client is in the range of [250, 25810]. The data generation and distribution procedure is adopted from the previous works [9, 14] to generate Non-IID data, which uses two parameters α¯\overline{\alpha} and β¯\overline{\beta} to control the difference of the dataset and the local model of each client. Specifically, a synthetic dataset with α¯=0.5\overline{\alpha}=0.5 and β¯=0.5\overline{\beta}=0.5 is generated and distributed according to the power law to N=100N=100 clients [9].

  • 2.

    Mnist [36]: The Mnist dataset that is a handwritten digit dataset contains 70,000 instances with 10 labels. We adopt the Non-IID setup and data generation procedure in the work [9], where the complete dataset is distributed to N=20N=20 clients owing to the limitation on MNIST’s data size. Each client’s data size is different in the range of [1165, 3834] with only 2 classes of the 10 classes.

  • 3.

    FMnist (Fashion-Mnist) [37]: The FMNIST dataset includes 70,000 fashion products across ten categories, with 7,000 images per category. We use a same generation method as the Mnist dataset to produce the Non-IID data of FMnist.

  • 4.

    Cifar10 [38]: The Cifar10 dataset contains 60000 32x32 color images, each of which is categorized into one of ten mutually exclusive labels. We use a same generation method as the Mnist dataset to produce the Non-IID data of Cifar10.

5.1.2 Network Models

To demonstrate the generality and effectiveness of our proposed FedMCSA, we consider two different models simultaneously following the work [9] in the experiments. First, a l2{l_{2}}-regularized multinomial logistic regression model (MLR) is implemented with the softmax activation and cross-entropy loss functions. Meanwhile, we consider a two-layer deep neural network (DNN) with a hidden layer of size 20 for Synthetic and 100 for MNIST, FMnist, and Cifar10 using ReLU activation and a softmax layer at the end.

5.1.3 Baselines

We compare FedMCSA against five methods broadly falling under three categories. (i) Only train a single global model for all clients: FedAvg [1] and Fedprox [14]. (ii) Train more than one model but based on a global model: Per-FedAvg [8] and pFedMe [9]. (iii) Train a separate model for each client without considering a single global model: HeurFedAMP [11]. The details of the baselines are presented as follows.

  • 1.

    FedAvg [1]: As the most popular FL baseline, it directly averages all local models on the server to obtain a new global model.

  • 2.

    Fedprox [14]: A proximal term is added to the objective to address the challenges of the Non-IID data.

  • 3.

    Per-FedAvg [8]: After getting a global model as initialization, each client performs one more step of gradient update to obtain the personalized model.

  • 4.

    pFedMe [9]: The Moreau envelope is used as the clients’ regularized loss function to achieve the decoupling of personalized model optimization and global model learning. For the comparison with pFedMe, we use both its personalized model pFedMe(PM) and global model pFedMe(GM).

  • 5.

    HeurFedAMP [11]: A federated attentive message passing method is designed to conduct personalized FL with preserving privacy.

5.1.4 Implementation Details

We utilize the SGD optimizer and use PyTorch to implement our method. We randomly split all datasets into 75%\% and 25%\% for training and testing, respectively. The learning rates are set to η\eta = 0.02 by default for all four benchmark datasetrs. Meanwhile, the batch size is set to |𝒟||\mathcal{D}| = 20, and the local training epoch is set to RR = 20. The value of hyperparameter σ\sigma is set to the default value of 50, and the value of hyperparameter λ\lambda is set to the default value of 5. The subset of all clients is set to S=20S=20 for Synthetic, S=10S=10 for Mnist, FMnist, and Cifar10. The maximum number TT of iterations is set to 800 for all four benchmark datasetrs. The performance of all the methods is evaluated by considering the best mean testing accuracy (BMTA) in percentages, where BMTA is the highest mean testing accuracy achieved by a method during all communication rounds of training, and the mean testing accuracy is defined as the average of the testing accuracy on all clients. All experiments are implemented in PyTorch 1.7 running on Intel(R) Xeon(R) CPU, 64G memory, NVIDIA 1080Ti, and Ubuntu 16.04.

5.2 Performance Comparison

To demonstrate the empirical superiority of FedMCSA, several comparisons between FedMCSA, HeurFedAMP, pFedMe, Per-FedAvg, Fedprox, and FedAvg are conducted. First of all, the same hyperparameters for all algorithms are used as a basic comparison. Considering that the performance of algorithms will behave differently when the hyperparameters change, we further perform a grid search of hyperparameters to find the highest performance of combination fine-tuning hyperparameters.

5.2.1 The comparisons for the same hyperparameters.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Performance comparisons of FedAvg, Fedprox, Per-FedAvg, pFedMe(GM), pFedMe(PM), HeurFedAMP, and FedMCSA using MLR model on Synthetic and Mnist datasets.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Performance comparisons of FedAvg, Fedprox, Per-FedAvg, pFedMe(GM), pFedMe(PM), HeurFedAMP, and FedMCSA using DNN model on Synthetic and Mnist datasets.

The comparisons for the same hyperparameters are shown in Figure 3 and Figure 4. According to Figure 3 and Figure 4, we can observe that FedMCSA achieves the best performance compared to other algorithms in different settings. Specifically, for the MLR model, FedMCSA is about 16.86%\%, 16.66%\%, 9.55%\%, 17.57%\%, 10.53%\%and 9.64%\% more accurate than FedAvg, Fedprox, Per-FedAvg, pFedMe(GM), pFedMe(PM), and HeurFedAMP on Synthetic dataset, and 6.47%\%, 6.42%\%, 5.60%\%, 6.53%\%, 2.75%\%, and 0.50%\% more accurate than those on Mnist dataset. For the DNN model, FedMCSA is about 11.39%\%, 11.69%\%, 7.52%\%, 13.42%\%, 9.07%\%, and 11.61%\% more accurate than FedAvg, Fedprox, Per-FedAvg, pFedMe(GM), pFedMe(PM), and HeurFedAMP on Synthetic dataset, and 2.94%\%, 2.96%\%, 1.81%\%, 3.06%\%, 1.50%\%, and 1.07%\% more accurate than those on Mnist dataset.

It can be observed that the performance of FedAvg, Fedprox, and pFedMe(GM) is similar and significantly lower. The performance of Per-FedAvg, pFedMe(PM) and HeurFedAMP is similar and have advantages over FedAvg, Fedprox, and pFedMe(GM). Additionally, our proposed FedMCSA clearly surpasses PerFedAvg, pFedMe(PM), and HeurFedAMP, demonstrating its superiority. Although the optimization method of FedMCSA in the client is generally universal, with being similar to HeurFedAMP, FedMCSA is superior to FedAvg, Fedprox, pFedMe(GM), Per-FedAvg, pFedMe(PM), and HeurFedAMP in different settings. This is due to the fact that FedMCSA adopts the model components self-attention mechanism on the server side, which considers the significance of the internal components of the model from a novel perspective, and further implements an adaptive update of model components through parallel model components self-attention mechanisms. FedMCSA realizes the adaptive update of the entire model through the continuous adaptive update of model components. Consequently, the performance of FedMCSA is superior to FedAvg, Fedprox, Per-FedAvg, pFedMe, and HeurFedAMP under different models and datasets when the same hyperparameters are used.

5.2.2 The comparisons for fine-tuned hyperparameters.

Table 1: Performance comparisons using fine-tuned hyperparameters. |𝒟||\mathcal{D}| = 20, RR = 20, and TT = 800 are fixed for all experiments and all results are averaged over 3 runs. (The best result is marked in bold).
Algorithm Synthetic Mnist FMnist Cifar10
MLR DNN MLR DNN MLR DNN MLR DNN
FedAvg [1] 78.04±0.21 84.30±0.03 92.39±0.02 96.86±0.01 84.56±0.01 85.80±0.04 38.43±0.25 46.10±0.14
Fedprox [14] 77.51±0.22 84.01±0.06 92.42±0.03 96.89±0.02 84.45±0.01 85.68±0.10 38.35±0.41 45.92±0.16
Per-FedAvg [8] 85.15±0.06 88.50±0.07 94.18±0.05 98.15±0.01 98.82±0.04 99.31±0.01 57.79±0.08 81.64±0.09
pFedMe(GM) [9] 76.37±0.18 82.70±0.28 92.08±0.01 96.81±0.03 83.55±0.11 84.30±0.12 36.21±0.11 46.89±0.39
pFedMe(PM) [9] 84.97±0.08 87.01±0.08 97.24±0.01 98.93±0.01 99.20±0.01 99.30±0.01 66.70±0.09 84.05±0.13
HeurFedAMP [11] 84.42±0.05 84.00±0.06 98.33±0.03 98.42±0.01 99.19±0.01 99.13±0.02 72.18±0.03 82.41±0.03
FedMCSA(Our) 95.27±0.01 96.26±0.02 98.87±0.01 99.58±0.01 99.33±0.01 99.39±0.01 75.91±0.02 84.40±0.05

The highest accuracy comparison results achieved by fine-tuning the hyperparameters of all algorithms are shown in Table 1. In all settings, FedMCSA surpasses other algorithms to achieve the highest accuracy. Specifically, FedMCSA achieves 95.27%, 99.87%, 99.33%, and 75.91% on Synthetic, FMnist, Mnist, and Cifar10 datasets when using the MLR model. And FedMCSA achieves 96.26%, 99.58%, 99.39%, and 84.40% on Synthetic, FMnist, Mnist, and Cifar10 datasets when using the DNN model.

As for the MLR model, among all the baseline algorithms, the best performances achieved on Synthetic, Mnist, FMnist, and Cifar10 datasets are 85.15% of Per-FedAvg, 98.33% of HeurFedAMP, and 99.20% of pFedMe(PM), and 72.18% of HeurFedAMP, respectively. As for the DNN model, among all the baseline algorithms, the best performances achieved on Synthetic, Mnist, FMnist, and Cifar10 datasets are 88.50% of Per-FedAvg, 98.93% of pFedMe(PM), and 99.31% of Per-FedAvg, and 84.05% of pFedMe(PM), respectively. It is evident that none of all baselines achieve the best accuracies using different models under different datasets simultaneously, whereas FedMCSA achieves the highest accuracies using both MLR and DNN models across the four datasets simultaneously.

From Figure 3, Figure 4, and Table 1, we can observe that the performance of the DNN model tends to be better than that of the MLR model under the same experimental conditions. This is because DNN model has a hidden layer with stronger information extraction ability than MLR model, and can more fully mine the information in clients’ private data.

The key to the superiority of FedMCSA is that it abandons the setting of a complete model as a basic unit to carry out the cooperation between different models, and treats each layer of the model as a basic unit from the perspective of finer granularity. By taking each layer’s significance into consideration, FedMCSA explores the potential of the model components to make collaboration between different clients more targeted and effective. As a result, each model component can learn as much useful knowledge as possible from other models. With the advantage of model components self-attention mechanism, FedMCSA has achieved surprising performance using different models on Synthetic, Mnist, FMnist, and Cifar10 datasets.

5.3 Ablation Studies

To further analyze the model components self-attention mechanism of FedMCSA, we conduct ablation studies to verify the effectiveness of the mechanism from two different perspectives. On the one hand, we construct a variant of FedMCSA by replacing the model components self-attention mechanism with the average aggregation method. On the other hand, we integrate this mechanism into the most basic FL algorithm, FedAvg, as well as the representative personalized FL with average aggregation, pFedMe. Considering that the performance of the personalized model pFedMe(PM) is always better than the global model pFedMe(GM) in the same setting, here we use pFedMe(PM) to represent the pFedMe. Note that in each group of experiments, the settings are identical except for the variables of interest.

Specifically, we conduct three variants of FedMCSA, FedAvg, and pFedMe as follows.

  • 1.

    FedMCSAMCSA{}^{-M\!C\!S\!A}: The variant model that replaces the model components self-attention mechanism with the average aggregation method.

  • 2.

    FedAvg+MCSA{}^{+M\!C\!S\!A}: The variant model that replaces FedAvg’s aggregation method with the model components self-attention mechanism.

  • 3.

    pFedMe+MCSA{}^{+M\!C\!S\!A}: The variant model that replaces pFedMe(PM)’s aggregation method with the model components self-attention mechanism.

Table 2: Accuracy on the variants of FedMCSA, FedAvg, and pFedMe on four benchmark datasets. "Δ""\Delta" represents the change values of the variants compared with FedMCSA, FedAvg, and pFedMe and all results are averaged over 3 runs.
Synthetic Mnist FMnist Cifar10
MLR DNN MLR DNN MLR DNN MLR DNN
FedMCSA 94.06% 95.26% 98.82% 99.48% 99.32% 99.37% 72.18% 84.24%
FedMCSAMCSA{}^{-M\!C\!S\!A} 76.35% 83.08% 92.20% 96.84% 83.76% 84.42% 34.05% 46.26%
Δ\Delta 17.71%\downarrow 12.18%\downarrow 6.62%\downarrow 2.64%\downarrow 15.56%\downarrow 14.95%\downarrow 38.13%\downarrow 37.98%\downarrow
FedAvg 77.26% 83.55% 92.39% 96.54% 84.48% 83.87% 34.81% 45.60%
FedAvg+MCSA{}^{+M\!C\!S\!A} 94.32% 95.52% 98.83% 99.49% 99.31% 99.36% 72.13% 84.37%
Δ\Delta 17.06%\uparrow 11.97%\uparrow 6.44%\uparrow 2.95%\uparrow 14.83%\uparrow 15.49%\uparrow 37.32%\uparrow 38.77%\uparrow
pFedMe 83.56% 86.04% 96.06% 97.98% 98.95% 99.23% 65.46% 83.77%
pFedMe+MCSA{}^{+M\!C\!S\!A} 93.86% 93.49% 98.85% 99.48% 99.33% 99.34% 74.62% 84.70%
Δ\Delta 10.30%\uparrow 7.45%\uparrow 2.79%\uparrow 1.50%\uparrow 0.38%\uparrow 0.11%\uparrow 9.16%\uparrow 0.93%\uparrow

The experimental results of ablation studies are shown in Table 2. Compared with FedMCSA, FedMCSAMCSA{}^{-M\!C\!S\!A} has a significant drop in accuracy on the four benchmark datasets whether using the MLR model or the DNN model. FedMCSAMCSA{}^{-M\!C\!S\!A} achieves relatively good performance on Mnist dataset, with 6.62% and 2.64% drop in accuracy compared to FedMCSA using the MLR and DNN models, respectively. This is because the Mnist dataset is relatively simple, and the basic average aggregation method can mine sufficient effective knowledge. When the dataset is relatively complex, e.g., the Synthetic dataset, the impact of the absence of the model components self-attention mechanism is significantly enlarged. It is noted that compared with FedMCSA, FedMCSAMCSA{}^{-M\!C\!S\!A} drops by 38.13% with the MLR model and 37.98% with the DNN model on Cifar10 dataset. The consistent accuracy drop across different datasets illustrates the effectiveness of the model components self-attention mechanism of FedMCSA, and the greater accuracy impact on more complex datasets further illustrates that the role of the model components self-attention mechanism is more pronounced on relatively complex datasets.

We also compare the performance of FedAvg, FedAvg+MCSA{}^{+M\!C\!S\!A}, pFedMe, pFedMe+MCSA{}^{+M\!C\!S\!A} on the four benchmark datasets. From Table 2, we can observe that the model components self-attention mechanism of FedMCSA not only achieves the overall improvement on FedAvg, but also achieves the improvement on pFedMe. FedAvg+MCSA{}^{+M\!C\!S\!A} has a significant accuracy improvement over FedAvg on the four benchmark datasets. Especially on Cifar10 dataset, compared with FedAvg, FedAvg+MCSA{}^{+M\!C\!S\!A} has improved the accuracy by 37.32% and 38.77% under the MLR model and DNN model, respectively. Although pFedMe has achieved good performance on different datasets, pFedMe+MCSA{}^{+M\!C\!S\!A} still achieves performance improvements under a variety of different settings. On Synthetic dataset, pFedMe+MCSA{}^{+M\!C\!S\!A} improves the accuracy by 10.3% and 7.45% under the MLR model and DNN model, respectively.

The results of Table 2 demonstrate the effectiveness of the model components self-attention mechanism of FedMCSA from the consistent accuracy decrease of FedMCSAMCSA{}^{-M\!C\!S\!A} and the consistent accuracy improvement of FedAvg+MCSA{}^{+M\!C\!S\!A} and pFedMe+MCSA{}^{+M\!C\!S\!A}. Furthermore, the performance improvement across different datasets, different network models, and different algorithms further illustrates the generality and usability of the mechanism.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Impact of σ\sigma on the convergence of FedMCSA using MLR and DNN models on Synthetic dataset.

5.4 Impact of The Important Hyperparameter

To investigate the impact of the important hyperparameter σ\sigma on the convergence of FedMCSA, we conduct diverse experiments on the Synthetic dataset using both MLR and DNN models.

As σ\sigma is the scale of model components similarity in the FedMCSA, σ\sigma is considered as a hyperparameter of FedMCSA. As illustrated in Figure 5, all the parameters except σ\sigma are fixed during the experiments. For MLR, once σ\sigma increases to 30, the upper limit of FedMCSA’s accuracy is reached under the current conditions, while the upper limit of accuracy for DNN is only realized when σ=70\sigma=70. We observe that FedMCSA requires a moderate value to compute the personalized model. When the value of σ\sigma is too small, FedMCSA will converge slowly and reduce the accuracy. However, when the value of σ\sigma exceeds a certain threshold, the accuracy of FedMCSA is not further improved, while an excessive value of σ\sigma will cause the computational cost of the process to increase sharply. As a result, the value of σ\sigma = 50 is selected as the default in the case of both MLR and DNN for the Synthetic dataset. Similarly, the value of σ\sigma = 50 is selected as the default in the case of both MLR and DNN for the Mnist, FMnist, and Cifar10 datasets.

6 Conclusion

In this paper, we propose a new framework, federated model components self-attention, to handle the Non-IID data in FL, which devise the model components self-attention mechanism from a novel perspective of the component-based aggregation method. Different from the previous personalized FL approaches that treat the entire model as a basic unit for aggregation, FedMCSA promotes purposeful and refined collaboration among clients through parallel model components self-attention mechanism. Our extensive experiments on benchmark datasets demonstrate the superior performance of FedMCSA compared with FedAvg, Fedprox, Per-FedAvg, pFedMe, and HeurFedAMP in different settings. Moreover, the empirical experiment results demonstrate that the model components self-attention mechanism of FedMCSA has effectiveness and universality, and can also be easily integrated into existing personalized FL to further improve their performance. We hope that future research of personalized FL will not only focus on external optimization of the model, but also pay more attention to exploring the potential of collaboration between model components, which could lead to a more superior FL.

References