Optimizing Personalized Federated Learning through Adaptive Layer-Wise Learning

Weihang Chen ¹, Jie Ren¹, Zhiqiang Li¹, Ling Gao², Zheng Wang³

Abstract

Real-life deployment of federated Learning (FL) often faces non-IID data, which leads to poor accuracy and slow convergence. Personalized FL (pFL) tackles these issues by tailoring local models to individual data sources and using weighted aggregation methods for client-specific learning. However, existing pFL methods often fail to provide each local model with global knowledge on demand while maintaining low computational overhead. Additionally, local models tend to over-personalize their data during the training process, potentially dropping previously acquired global information. We propose FLAYER, a novel layer-wise learning method for pFL that optimizes local model personalization performance. FLAYER considers the different roles and learning abilities of neural network layers of individual local models. It incorporates global information for each local model as needed to initialize the local model cost-effectively. It then dynamically adjusts learning rates for each layer during local training, optimizing the personalized learning process for each local model while preserving global knowledge. Additionally, to enhance global representation in pFL, FLAYER selectively uploads parameters for global aggregation in a layer-wise manner. We evaluate FLAYER on four representative datasets in computer vision and natural language processing domains. Compared to six state-of-the-art pFL methods, FLAYER improves the inference accuracy, on average, by 7.21% (up to 14.29%).

Introduction

Federated Learning (FL) enables collaborative model training across diverse, decentralized data sources while preserving the confidentiality and integrity of each dataset. It is widely used in mobile applications like private face recognition (Niu and Deng 2022), predictive text, speech recognition, and image annotation (Song, Granqvist, and Talwar 2022). However, data from mobile devices frequently exhibits non-IID (non-independent and identically distributed) characteristics due to variations in user behavior, device types, or regional differences (Zhu et al. 2021). This data heterogeneity poses significant challenges for typical FL algorithms, as the trained global model may struggle to adapt to the specific needs of individual clients, resulting in poor inference performance and slow convergence.

Personalized Federated Learning (pFL) addresses the non-IID data challenge through client-specific learning (Tan et al. 2022), typically using weighted aggregation methods to customize model updates for individual clients. Customization can be achieved through various strategies, including model-wise (Luo and Wu 2022), layer-wise (Ma et al. 2022), or element-wise (Zhang et al. 2023b) approaches.

Model-wise aggregation methods like APPLE (Luo and Wu 2022), FedAMP (Huang et al. 2021), and Ditto (Li et al. 2021b) aggregate model parameters across multiple clients, where a client downloads models from others and aggregates them locally using learned weights at the model level. This process captures comprehensive global knowledge by integrating diverse information from all participating clients. However, it can not reflect finer-grained differences in individual clients’ data, potentially affecting the effectiveness of model personalization. Additionally, it incurs significant computational costs to calculate which client model can benefit local performance, potentially limiting scalability. Layer-wise methods perform local aggregation in layer units. Here, a “layer unit” can be a single neural network layer or a block comprising multiple layers. Examples of layer-wise methods include FedPer (Arivazhagan et al. 2019), FedRep (Collins et al. 2021) and pFedLA (Ma et al. 2022). These methods allow for more targeted adaptation of different parts of the network. For example, FedRep uses the global model to construct the lower layers (i.e., layers toward the input layer, also termed as base layers) of each local model and the higher layers (i.e., layers toward the model’s output layer, also termed as head layers) are built solely on local data for personalization. This approach ensures that the local model learns global representations from all clients while using the head layers to learn local representations, balancing global and local learning. Element-wise aggregation, such as FedALA (Zhang et al. 2023b), leverages the global model to construct the base layers, while learning an aggregation weight for each parameter in the head from both local and global models. However, these weights are set before the training stage and remain largely unchanged, potentially limiting the model’s adaptability to evolving data patterns over time. Moreover, despite offering finer control, element-wise aggregation can considerably increase computational costs due to the need to compute and maintain individual weights for each parameter. Table 1 compares the three aforementioned pFL methods with a standard FL, FedAvg (McMahan et al. 2017). In this experiment, we consider 20 clients to collaboratively learn their personalized models on the CIFAR-100 dataset (Krizhevsky, Hinton et al. 2009) using ResNet-18 (He et al. 2016). We simulate the heterogeneous settings by using the Dirichlet distribution $Dir(0.1)$ (Lin et al. 2020). All the experiments are conducted on a single NVIDIA RTX A5000 GPU. Among the methods compared, APPLE shows a moderate improvement in accuracy over FedAvg but incurs the highest training cost among the compared methods. Conversely, FedPer introduces personalization layers (head layers) into the FL, using local data to train these layers without uploading them to the server for aggregation, thereby speeding up convergence and significantly improving inference accuracy compared to FedAvg. Building on FedPer, FedALA incorporates local and global data within each element in head layers, delivering the highest inference accuracy. However, FedALA requires learning aggregation weights on an element-wise basis, which can lead to increased computational costs. Moreover, all three pFL methods use a constant learning rate across all clients during training, without considering the differing learning needs of various layers for each client. This may lead to over-personalization for each client, resulting in the loss of previously aggregated global information.

Method	$\#$ Iter.	Total time (s)	Time (s) /iter.	Acc. ( $\%$ )
FedAvg	181	5430	30	37.08
APPLE (model-wise)	25	6000	240	57.29
FedPer (layer-wise)	184	5704	31	54.26
FedALA (element-wise)	76	2660	35	58.65

Table 1: The computation cost (includes # training iterations until convergence, total training time, and training time in each iteration) and inference accuracy (%) on CIFAR-100 using ResNet-18.

Refer to caption — Figure 1: The local learning process of FLAYER on $k$ -th client during the $t$ -th iteration. In the local initialization stage, FLAYER aggregates both local and global head layers based on the local model’s performance from the previous iteration. The initialized local model is then trained on the local data, using an adaptive learning rate for each layer. Based on the parameter changes before and after local training, FLAYER constructs a masking matrix to identify and select essential parameters, with different proportions per layer, for updating the global model.

To effectively incorporate local and global information across all network layers on each client, we introduce FLAYER, a new layer-wise optimization for pFL. FLAYER operates throughout the local learning process: local model initialization, training, and parameter uploading. Specifically, during the local model initialization stage, we aggregate local and global information in the head layers on a layer-wise basis, with aggregation weights guided by the local model’s inference accuracy, which can be easily accessed on each client with negligible cost. This allows for dynamic adjustment of the learning contributions from local and global models based on local performance. In the local model training stage, FLAYER applies a layer-wise adaptive learning rate scheme based on each layer’s position and gradient. This enables each layer to effectively learn from the local dataset after initialization and helps mitigate the issue of vanishing gradient during training. For parameter uploading, FLAYER implements a layer-wise masking strategy to select essential parameters from each client for global aggregation. This ensures that the global averaging process retains crucial base features, enhancing the overall effectiveness of global learning.

We evaluate FLAYER by applying it to both image (Krizhevsky, Hinton et al. 2009; Chrabaszcz, Loshchilov, and Hutter 2017) and text (Zhang, Zhao, and LeCun 2015) classification tasks using four widely adopted benchmarks. The results show that FLAYER outperforms six other pFL methods (Luo and Wu 2022; Li et al. 2021b; Huang et al. 2021; Arivazhagan et al. 2019; Collins et al. 2021; Zhang et al. 2023b) in inference accuracy and computational cost. This paper makes the following contributions:

•

A new performance-guided layer-wise aggregation method allows clients to dynamically incorporate both local and global information in a cost-effective manner;
•

A new layer-specific adaptive learning rate scheme for pFL to steer the personalization and speed up convergence;
•

A new layer-wise masking technique for selectively uploading essential parameters to the central server to improve global representation.

Background and Overview

Problem Definition

pFL learns personalized models cooperatively among clients. Consider a scenario where we have $n$ clients, and each client processes their distinct private training data denoted as $D_{1},D_{2},\ldots,D_{N}$ , which has different data classes and sizes. These datasets exhibit heterogeneity, characterized by non-IID (non-independently and identically distributed) (Zhao et al. 2018). The goal of pFL can be defined as:

\{\theta_{1},\theta_{2},...,\theta_{n}\}=\arg\min_{\theta}\sum_{k=1}^{n}\frac{m_{k}}{M}L_{k}(\theta_{k})

(1)

where $\theta_{k}$ is the model parameters of client $k$ , $m_{k}$ is the data size of $D_{k}$ , $M$ is the whole data size of all clients, $L_{k}(\theta_{k})$ is the loss function of client $k$ .

Overview of FLAYER

Figure 1 depicts the workflow of FLAYER during the training of the $k$ -th client in the $t$ -th iteration. Initially, the global model is downloaded to each client for local aggregation, FLAYER implements an adaptive aggregation strategy, dynamically adjusting the aggregation weights between the global and local models based on the inference accuracy of the local model. This allows models, whether local or global, with higher performance to have a greater influence on the initialization of the local head layer. During the local training phase, FLAYER leverages a layer-wise adaptive learning rate scheme. This strategy dynamically adjusts the learning rates according to each layer’s position within the network and the gradient of that layer. By optimizing the learning step for each layer, FLAYER enhances local model personalization performance and accelerates convergence. In the final phase of updating the model to the central server, FLAYER employs a layer-wise masking strategy. This selective approach uploads only the essential parameters from each local model, preventing the global averaging process from diluting the crucial information captured by the local models, thus enhancing the global representation. Algorithm 1 presents the overall FL process in FLAYER.

Methodology

Performance-guided Layer-wise Local Aggregation

Building on previous findings (Yosinski et al. 2014; Zhu, Hong, and Zhou 2021) that base layers of DNN capture generalized features while head layers encode task-specific features, we introduce a differentiated update strategy for these layers. In our approach, during the local model initialization stage, the base layers of each client’s model are directly updated using parameters from the global model. This ensures the consistent refinement of generalized features across all clients. For the head layers, we use a dynamic aggregation method to integrate both global and local parameters. This integration is tailored based on the performance of the $k$ -th client’s model on its local dataset $D_{k}$ from the previous iteration. The process is defined as follows:

	$\displaystyle\scalebox{0.8}{\mbox{$\displaystyle\tilde{\theta}_{k}^{t}:=[\theta_{g}^{(1:L-s,t-1)},A_{k,l}^{t-1}\odot\theta_{k}^{(L-s+1:L,t-1)}+A_{k,g}^{t-1}\odot\theta_{g}^{(L-s+1:L,t-1)}$}}]$
	$\displaystyle\text{s.t.}\quad A_{k,l}+A_{k,g}=1$		(2)

Here, $\theta_{k}$ is the local model parameter matrix of the $k$ -th client, $\tilde{\theta}_{k}$ denotes the local model parameter matrix of the $k$ -th client after initialization, $L$ is the total number of layers, $s$ is the number of layers in the head for personalization, and $\theta_{g}^{(1:L-s,t-1)}$ represents the lower $L-s$ layers in the base part of the global model at iteration $t-1$ , which are used to update the base layers in the local model, and all the clients share the same base layers. $\theta_{k}^{(L-s+1:L,t-1)}$ denotes the head layers of the $k$ -th local model. We aggregate the head layers from the local and global models to initialize the head layers for the local model. The aggregation weights $A_{k,g}$ and $A_{k,l}$ control the influence of global and local parameters, respectively. At iteration ${t-1}$ , the local model inference accuracy on dataset $D_{k}$ sets the local weight $A_{k,l}^{t-1}$ , with $(1-A_{k,l})$ adjusting for global influence. A lower local accuracy increases reliance on the global model through $A_{k,g}$ , providing stability in early training phases. As the accuracy of the local model increases, the initialization of the head layer becomes more dependent on the local model, thereby incorporating more localized and personalized information.

Adaptive Layer-specific Learning Rate

After the model initialization, the local model trains on its local dataset $D_{k}$ :

\displaystyle\hat{\theta}_{k}^{t}:=\tilde{\theta}_{k}^{t}-\eta\nabla_{\tilde{\theta}_{k}}\mathcal{L}_{k}(\tilde{\theta}_{k}^{t},D_{k})

(3)

$\eta$ represents the learning rate, $\nabla_{\tilde{\theta}_{k}}\mathcal{L}_{k}(\tilde{\theta}_{k}^{t},D_{k})$ is the gradient of the loss function $\mathcal{L}_{k}$ with respect to the parameters $\tilde{\theta}_{k}$ evaluated using local dataset $D_{k}$ .

The existing pFL methods (Luo and Wu 2022; Li et al. 2021b; Huang et al. 2021; Arivazhagan et al. 2019; Collins et al. 2021; Zhang et al. 2023b) typically employ a fixed learning rate. However, in the context of FL with non-IID data, we observe that the learning rate is a critical hyperparameter that significantly impacts both the performance and convergence speed of local models (see Section Ablation). Previous work (Singh et al. 2015) pioneered layer-wise learning rate adjustments, primarily to mitigate the vanishing gradient issue in the lower layers of DNNs within a non-distributed training context. However, this approach is not well-suited for the pFL context. Inspired by (Luo et al. 2021) and our observation (see Section Layer Similarity), we note that the first layer among all local models, showing the highest similarity, captures universal features and thus requires a smaller learning rate with more gradual adjustments. In contrast, deeper layers, exhibiting greater divergence, deal with more complex, client-specific features and benefit from larger learning steps, which aids local model personalization. Building on these insights, we implement an adaptive learning rate scheme for pFL that integrates layer positional information with the corresponding gradient:

\eta^{(i,t)}=\eta\left(1+\log\left(1+\frac{1}{\|g^{(i,t)}\|_{2}}\right)\times\frac{i}{L}\right)

(4)

where:

•

$\eta$ is the base learning rate.
•

$g^{(i,t)}$ represents the gradient vector of the $i$ -th layer at iteration $t$ .
•

$\|g^{(i,t)}\|_{2}$ is the L2 norm (Euclidean norm) of the gradient.
•

$L$ is the total number of layers.

Layer-wise Sparse Binary Masking

In non-IID settings, averaging updated client parameters on the server side can dilute important information during aggregation. To address this, we propose a layer-wise binary masking scheme for server aggregation, aimed at preserving critical information and ensuring a high-quality global representation. The core idea is to selectively upload parameters from each layer based on their significance and layer position, distinguishing between general features in early layers and more complex, client-specific features in deeper layers (Luo et al. 2021). In detail, our strategy prioritizes uploading parameters with high significance, typically those with greater changes in early layers, as these are likely to capture essential patterns and features common to all client datasets, thereby enhancing the model’s generalization ability. For deeper layers, we employ a more inclusive approach, uploading a greater proportion of parameters to capture a wide range of client-specific details and complex features essential for the model’s performance on localized tasks. The proportion of parameters uploaded from each layer, denoted as $UP^{i}$ for layer $i$ , is calculated based on the layer’s position within the network architecture using the following formula:

UP^{i}:=\min(\max(\frac{i}{L},0.1),1)

(5)

where $UP^{i}$ is constrained to be at least 0.1 to ensure that every layer contributes to the global model aggregation, but not more than 1, reflecting a full update contribution.

To identify and select significant weights for sharing, we calculate the absolute weight fluctuation value of the local model within each layer after local training:

\Delta{\theta}_{k}^{(i,t)}:=|\hat{\theta}_{k}^{(i,t)}-\tilde{\theta}_{k}^{(i,t)}|

(6)

Then, to focus on parameters that have undergone notable changes, we identify the top $UP^{i}$ parameters from each layer $i$ in $k$ -th client model based on their fluctuation values after the $t$ -th training round:

S_{k}^{(i,t)}:=\textit{top\_percent}\left(\Delta{\theta}_{k}^{(i,t)},UP^{i}\right)

(7)

To manage which parameters are uploaded from each layer $i$ of a local model on client $k$ , we use a binary mask matrix, $M_{k}^{t}$ , which has the same dimensions as the parameter matrix $\hat{\theta}_{k}^{t}$ . Initially, all elements of this matrix are initialized to one. Each item $m_{j,k}^{(i,t)}$ in $M_{k}^{(i,t)}$ is determined by whether the corresponding parameter $\theta_{j,k}^{(i,t)}$ in $\Delta{\theta}_{k}^{(i,t)}$ belongs to the subset of parameters with the highest weight changes, $S_{k}^{(i,t)}$ . This setup uses the following rule:

m_{j,k}^{(i,t)}=\begin{cases}1,&\text{if }\theta_{j,k}^{(i,t)}\in S_{k}^{(i,t)}\\ 0,&\text{otherwise}\end{cases}

(8)

Finally, we obtain the essential parameter $\theta_{k}^{t}$ required for uploading by multiplying the $\hat{\theta}_{k}^{t}$ with the binary mask $M_{k}^{t}$ .

\theta_{k}^{t}:=\hat{\theta}_{k}^{t}\odot M_{k}^{t}

(9)

Algorithm 1 FLAYER

Input: $N$ clients, $\rho$ : client joining ratio, $L$ : loss function, $\Theta_{g}^{0}$ : initial global model, $\eta$ : base local learning rate, $s$ : the hyperparameter of FLAYER.
Output: Well-performing local models $\tilde{\Theta}_{1},\ldots,\tilde{\Theta}_{N}$

1: Server sends

\Theta_{g}^{0}

to all clients to initialize local models.

2: for iteration

t=1,\ldots,T

3: Server samples a subset

C^{t}

of clients according to

\rho

4: Server sends

\Theta_{g}^{t-1}

\left|C^{t}\right|

clients.

5: for Client

k\in C^{t}

in parallel do

6: Client

k

initializes local model

\tilde{\Theta}_{k}^{t}

by Equation (2).

7: Client

k

obtains

\hat{\Theta}_{k}^{t}

by Equation (3) - (4).

\qquad\qquad\qquad\qquad\qquad\quad\triangleright

Local model training

9: Client

k

obtains masked

\Theta_{k}^{t}

by Equation (5) - (9).

10: Client

k

sends

\Theta_{k}^{t}

to the server.

\qquad\triangleright

Uploading

11: end for

12: Server obtains

\Theta_{g}^{t}

\Theta_{g}^{t}\leftarrow\sum_{k\in\mathrm{C}^{t}}\frac{n_{k}}{\sum_{j\in\mathrm{C}^{t}}n_{j}}\Theta_{k}^{t}

13: end for

14: return

\tilde{\Theta}_{1},\ldots,\tilde{\Theta}_{N}

Adopting selective weight sharing, FLAYER enhances the global model representation. Our approach differs from FedMask (Li et al. 2021a), which also achieves personalization using a heterogeneous binary mask with a small overhead. However, FedMask does not consider the unique characteristics of different layers, failing to capture layer-specific information. Moreover, FedMask’s binary parameter aggregation is insufficient for complex tasks, such as CIFAR-100. In our approach, early layers, which capture universal features, are updated only with the most critical changes during server aggregation, preserving a robust foundation for all clients and preventing the dilution of essential base features. Conversely, deeper layers, which capture complex and client-specific features, receive updates from a larger proportion of parameters. This ensures the global model incorporates a diverse set of features, enhancing its generalization ability.

Evaluation Setup

	CNN			ResNet-18			fastText
Method	CIFAR-10	CIFAR-100	Tiny-ImageNet	CIFAR-10	CIFAR-100	Tiny-ImageNet	AG News
FedAvg	59.16 $\pm$ 0.56	33.08 $\pm$ 0.61	18.86 $\pm$ 0.29	86.95 $\pm$ 0.39	37.08 $\pm$ 0.43	20.32 $\pm$ 0.20	80.12 $\pm$ 0.31
APPLE (model)	89.60 $\pm$ 0.16	54.45 $\pm$ 0.24	39.42 $\pm$ 0.49	89.78 $\pm$ 0.19	57.29 $\pm$ 0.30	43.26 $\pm$ 0.55	95.37 $\pm$ 0.23
Ditto (model)	89.48 $\pm$ 0.04	47.68 $\pm$ 0.59	33.89 $\pm$ 0.08	88.70 $\pm$ 0.18	48.46 $\pm$ 0.89	36.37 $\pm$ 0.52	94.66 $\pm$ 0.18
FedAMP (model)	89.31 $\pm$ 0.17	47.77 $\pm$ 0.46	33.82 $\pm$ 0.33	88.52 $\pm$ 0.22	48.75 $\pm$ 0.49	35.83 $\pm$ 0.25	94.02 $\pm$ 0.11
FedPer (layer)	89.55 $\pm$ 0.28	49.15 $\pm$ 0.57	39.61 $\pm$ 0.24	89.20 $\pm$ 0.21	54.26 $\pm$ 0.43	42.38 $\pm$ 0.55	95.07 $\pm$ 0.16
FedRep (layer)	90.62 $\pm$ 0.18	51.45 $\pm$ 0.31	41.79 $\pm$ 0.52	90.29 $\pm$ 0.29	53.94 $\pm$ 0.40	45.98 $\pm$ 0.72	96.47 $\pm$ 0.15
FedALA (element)	90.84 $\pm$ 0.09	56.98 $\pm$ 0.18	45.10 $\pm$ 0.25	91.30 $\pm$ 0.35	58.65 $\pm$ 0.26	49.09 $\pm$ 0.89	96.58 $\pm$ 0.10
FLAYER	91.66 $\pm$ 0.05	60.50 $\pm$ 0.33	45.88 $\pm$ 0.29	91.68 $\pm$ 0.21	60.68 $\pm$ 0.42	50.12 $\pm$ 0.36	98.27 $\pm$ 0.22

Table 2: The average inference accuracy (

\%

) across all clients on CIFAR-10, CIFAR-100, Tiny-ImageNet and AG News.

Platforms and Workloads

To evaluate the performance of FLAYER, we use a four-layer CNN (McMahan et al. 2017) and ResNet-18 (He et al. 2016) for CV tasks, training them on three benchmark datasets: CIFAR-10, CIFAR-100 (Krizhevsky, Hinton et al. 2009), and Tiny-ImageNet (Chrabaszcz, Loshchilov, and Hutter 2017). For the NLP task, we train fastText (Joulin et al. 2017) on the AG News dataset (Zhang, Zhao, and LeCun 2015). We use the Dirichlet distribution $Dir(\beta)$ with $\beta=0.1$ (Lin et al. 2020; Wang et al. 2020) to model a high level of heterogeneity across client data. Following FedAvg, we use a batch size of 10 and a single epoch of local model training per iteration. We execute the training process five times for each task and calculate the geometric mean of training latency and inference accuracy until convergence. Our experiments consider 20 clients. The number of layers in the head for CNN, ResNet-18, and fastText is 1, 2 and 1, respectively. Following FedALA, we set a base learning rate of 0.1 for ResNet-18 and fastText and 0.005 for CNN during local training. All experiments were conducted on a multi-core server with a 24-core 5.7GHz Intel i9-12900K CPU and an NVIDIA RTX A5000 GPU with 24 GB of GPU memory.

Competitive Baselines

We compare FLAYER with six other pFL methods alongside FedAvg, including model-wise aggregation methods APPLE, Ditto and FedAMP, layer-wise aggregation methods FedPer and FedRep, and element-wise FedALA on four popular benchmark datasets in inference accuracy. In addition, we also evaluate the performance of FLAYER in terms of the computation cost, hyperparameter, layer similarity, data heterogeneity, scalability, and applicability.

	Model	CNN		ResNet-18
Dataset	Method	$\#$ Iter.	Total time (s)	$\#$ Iter.	Total time (s)
CIFAR-10	FedAvg	157	1256	179	5191
	APPLE	190	6650	130	31070
	Ditto	51	1071	172	11696
	FedAMP	47	^#517	191	7067
	FedPer	156	1248	183	5307
	FedRep	169	2028	185	6845
	FedALA	152	1520	133	5187
	FLAYER	78	858	53	^#2067
CIFAR-100	FedAvg	180	1620	181	5430
	APPLE	195	6825	25	6000
	Ditto	57	1254	101	6868
	FedAMP	61	671	173	6401
	FedPer	101	909	184	5704
	FedRep	69	828	179	6802
	FedALA	120	1200	76	2660
	FLAYER	27	^#324	58	^#2378
Tiny-ImageNet	FedAvg	48	2016	74	5920
	APPLE	69	9867	37	17427
	Ditto	35	3150	174	29754
	FedAMP	28	1316	84	7392
	FedPer	31	1302	78	6240
	FedRep	39	1794	115	10350
	FedALA	64	2944	48	4368
	FLAYER	16	^#896	18	^#1782

Table 3: The average computation cost for CV tasks.

Experimental Results

Overall Performance

Inference accuracy. Table 2 compares the inference accuracy of FLAYER with six other SOTA pFL methods in CV and NLP domains with $Dir(0.1)$ . APPLE gives the highest accuracy in the model-wise category, but with a high computation cost. FedPer uses a simple local aggregation strategy, utilizing global base layers and local head layers to initialize the local model, improving accuracy by an average of 18.1% over FedAvg. FedRep further enhances this by separately training head and base layers, boosting accuracy by 19.7% over FedAvg. Building on FedPer, FedALA incorporates global information into the local head initialization, achieving a 22.7% improvement in accuracy compared to FedAvg. Previous layer-wise pFL methods recognize the different roles of base and head layers in non-IID settings and apply different strategies for integrating global and local information to initialize the local model. However, they often overlook the roles and learning capabilities of the base and head layers during the local training stage. This oversight prevents the layers from capturing local information on demand, potentially slowing down convergence speed. FLAYER achieves the highest test accuracy among all pFL methods, with a 24.17% improvement over FedAvg. This is achieved by effectively incorporating global and local information for each client in a layer-wise manner during the initialization, local training, and model updating stages.

Computation cost. Table 3 compares the computation cost of our approach with six other pFL methods and FedAvg, measured by the training time required for convergence. Except for CIFAR-10 with CNN, where FedAMP delivers the lowest training cost (but with poor inference accuracy), FLAYER gives the lowest computation cost across all other tasks, reducing total training cost by an average of 58.9% (up to 80.1%) compared to FedAvg. Specifically, model-wise methods like APPLE and Ditto involve complex calculations leading to high overhead. FedRep trains the base and head layers separately, which incurs significant training costs. FLAYER effectively incorporates both local and global information across all layers, resulting in fewer rounds needed for convergence compared to FedALA, with an average reduction of 52.7% in total training time.

	CNN			ResNet-18
Hyperparameter (s)	3	2	1	3	2	1
Accuracy (%)	53.58	54.42	^#60.50	59.80	^#60.68	60.16

Table 4: The inference accuracy (

\%

) of FLAYER on CIFAR-100 by using CNN and ResNet-18 with various

s

	Heterogeneity		Scalability			Applicability
Methods	Dir(0.1)	Dir(0.01)	20 clients	50 clients	100 clients	Acc.	Imps.
FedAvg	37.08 $\pm$ 0.43	43.74 $\pm$ 0.38	37.08 $\pm$ 0.43	34.56 $\pm$ 0.25	33.08 $\pm$ 0.41	60.68 $\pm$ 0.42	24.77
APPLE	57.29 $\pm$ 0.30	74.52 $\pm$ 0.19	57.29 $\pm$ 0.30	58.09 $\pm$ 0.24	48.46 $\pm$ 0.32	-	-
Ditto	48.46 $\pm$ 0.89	72.94 $\pm$ 0.22	48.46 $\pm$ 0.89	46.08 $\pm$ 0.19	43.42 $\pm$ 0.37	58.49 $\pm$ 0.21	10.03
FedAMP	48.75 $\pm$ 0.49	73.12 $\pm$ 0.17	48.75 $\pm$ 0.49	46.49 $\pm$ 0.44	43.74 $\pm$ 0.20	60.72 $\pm$ 0.27	11.97
GPFL	51.06 $\pm$ 0.42	74.59 $\pm$ 0.21	51.06 $\pm$ 0.42	48.30 $\pm$ 0.29	44.61 $\pm$ 0.32	-	-
FedPer	54.26 $\pm$ 0.43	73.52 $\pm$ 0.15	54.26 $\pm$ 0.43	51.24 $\pm$ 0.39	47.67 $\pm$ 0.36	63.13 $\pm$ 0.23	8.87
FedRep	53.94 $\pm$ 0.40	75.08 $\pm$ 0.18	53.94 $\pm$ 0.40	50.10 $\pm$ 0.30	45.80 $\pm$ 0.27	61.33 $\pm$ 0.17	7.39
FedCP	46.72 $\pm$ 0.38	69.42 $\pm$ 0.32	46.72 $\pm$ 0.38	42.86 $\pm$ 0.33	40.19 $\pm$ 0.24	-	-
FedALA	58.65 $\pm$ 0.26	75.24 $\pm$ 0.11	58.65 $\pm$ 0.26	59.46 $\pm$ 0.23	58.80 $\pm$ 0.41	63.55 $\pm$ 0.58	4.90
FLAYER	^#60.68 $\pm$ 0.42	^#77.39 $\pm$ 0.24	^#60.68 $\pm$ 0.42	^#61.70 $\pm$ 0.30	^#59.96 $\pm$ 0.39	-	-

Table 5: The inference accuracy (%) of eight FL methods across varying levels of statistical heterogeneity and scalability, and the performance improvement (%) when applying our approach to them using ResNet-18 on CIFAR-100.

Evaluation on Personalization Layers

Table 4 shows inference accuracy for a 4-layer CNN and ResNet-18 with varying sizes (termed as s) of the head layers. For ResNet-18, the highest inference accuracy is achieved with s set to 2, focusing personalization on the final two layers. For the 4-layer CNN, s is set to 1, with the remaining layers updated using the global model.

Layer Similarity

To analyze how pFL methods perform across layers on non-IID datasets, we measure the Centered Kernel Alignment (CKA) (Kornblith et al. 2019) similarity of features from the same layer of different clients’ models using identical test samples. This analysis helps to evaluate the balance between personalization and generalization of different pFL methods. Figure 2 presents the CKA similarity across 20 clients for methods like FedAvg, APPLE, FedRep, FedALA, and FLAYER on CIFAR-10, highlighting changes from the initial round to the training convergence. We observe that after training, the similarity of the base layers in both the CNN and ResNet-18 improves across all FL methods, indicating that the global model effectively captures common features shared by different clients. The deeper layers show lower similarity, with the head layers exhibiting the least, reflecting their focus on localized, client-specific data. Additionally, the simpler structure of the 4-layer CNN results in higher similarity across all layers compared to ResNet-18, suggesting it is less capable of capturing specialized features. FLAYER achieves moderate similarity levels in the base and head layers, suggesting that it balances well between integrating global patterns and adapting to local specifics, thereby enhancing overall model performance.

Evaluation on Data Heterogeneity

We also evaluate the impact of statistical heterogeneity on FLAYER and other SOTA pFL methods using 20 clients. Specifically, we set two degrees of heterogeneity on CIFAR-100. The first scenario is $\beta=0.01$ , where the smaller the value of $\beta$ , the greater the heterogeneity of the setting. We use $\beta=0.1$ as the baseline performance. Table 5 reports the performance impact under these varying degrees of heterogeneity. FLAYER consistently outperforms all other SOTA pFL methods across all heterogeneous settings, delivering the highest accuracy.

Scalability

To evaluate the scalability of our approach, we vary the number of clients from 20 to 100 when applying ResNet-18 to CIFAR-100, setting the heterogeneity parameter $Dir(0.1)$ . Table 5 compares the average inference accuracy between FLAYER and other pFL methods. We can see that FLAYER consistently outperforms others across various scales of client quantity. While a decrease in accuracy across all methods is observed as the client count increases from 50 to 100, FLAYER experiences a decline of less than 2%. In contrast, the APPLE method shows a significant drop in performance, with a 9.6% decrease in inference accuracy in the same scenario. This underlines the efficiency of FLAYER in managing larger numbers of clients, particularly in scenarios characterized by increased scalability demands.

Applicability

Our evaluation so far applied FLAYER to FedAvg. We now apply FLAYER to other FL methods to evaluate the generalization ability of our approach. Note that FLAYER does not replace the foundational architectures of an FL method. Table 5 reports the inference accuracy and improvements achieved after applying our approach to an underlying FL method. FLAYER improves the accuracy of all pFL methods, boosting the accuracy by 4.90% to 11.97%.

Ablation Study

Figure 3 presents the accuracy of three strategies in FLAYER when used alone: Adaptive Aggregation (Agg.) Only, Adaptive Learning Rate (LR) Only, and Masking Only. The results show that Adaptive LR Only achieves the highest accuracy on both ResNet-18 and CNN when trained on the CIFAR-100. For CNN, the full FLAYER exhibits a convergence speed similar to Adaptive LR Only, suggesting that the learning rate is the most influential factor for CNN performance. While Masking Only shows a convergence trend comparable to FLAYER on ResNet-18, it converges more slowly on CNN compared to the other two strategies. The Masking Only benefits deeper network structures by prioritizing critical parameters involved in residual connections, thereby preserving the integrity of these connections and enhancing performance and convergence in deeper networks like ResNet-18. Adaptive Agg. Only brings the least benefit and has the slowest convergence speed when used alone. However, its role is essential in incorporating local and global information in the head layers before training, which lays a solid foundation for the effectiveness of adaptive learning rate and masking strategies. During ResNet-18 training, accuracy improves significantly around the 50th iteration, aligning with the trend observed in Adaptive Agg. Only. While Adaptive LR Only is crucial for performance enhancements, particularly in CNNs, the combined approach of Adaptive Agg., Adaptive LR, and Masking within FLAYER offers a balanced and synergistic strategy that leverages the strengths of each scheme.

Discussion

Computation cost. FLAYER introduces additional computational tasks for FL clients, such as calculating the L2 norm of the gradient per layer for adaptive learning rates and creating a masking matrix for critical parameters. Although these tasks incur per-iteration costs, they are offset by a reduction in overall training time. We further reduce computation costs using parallel processing and PyTorch’s optimized operations. Future plans include deploying FLAYER on real-world FL testbeds and enhancing efficiency for resource-limited devices through advanced caching and hierarchical FL strategies.

Application scenarios. FLAYER supports mobile applications such as predictive text and image annotation by training personalized models directly on devices, ensuring privacy and relevance to user preferences. The system optimizes model performance through adaptive learning and reduces battery impact by conducting training during charging periods. Future work will expand its applications to other sectors and further assess its real-world effectiveness.

Related Work

Previous pFL methods for managing non-IID data issues typically fall into two categories: personalizing the global model and customizing individual models for each client.

Global Model Personalization

Global model personalization aims to adjust the global model to suit diverse client data distributions, creating a model that universally benefits all clients. This typically involves training the global model on varied data and local adaptations for each client’s specific data. Previous studies have explored strategies to mitigate data heterogeneity and improve the global model’s generalization (Pillutla et al. 2022; Zhang et al. 2023a).

Learning Personalized Models

Personalized model learning tailors individual models to each client’s data, emphasizing local adaptation. This approach often employs weighted aggregation methods for local model personalization.

Model-wise aggregation. Train personalized models for each client by combining clients’ models using weighted aggregation. For example, FedFomo (Zhang et al. 2020) employs a distance metric for weighted aggregation, while APPLE (Luo and Wu 2022) introduces an adaptive mechanism to balance global and local objectives. FedAMP (Huang et al. 2021) uses attention functions for client-specific models, and Ditto (Li et al. 2021b) incorporates a proximal term for personalized models. However, existing model aggregation methods may overlook complex variations and unique characteristics in client data, leading to suboptimal personalization.

Layer-wise aggregation. Customizes different layers to varying extents, such as FedPer (Arivazhagan et al. 2019) and FedRep (Collins et al. 2021). Moreover, pFedLA (Ma et al. 2022) uses hypernetworks to update layer-wise aggregation weights with a huge computation cost. All of them ignore the impact of diverse local data on the base and head layers during the training process, which limits further improvements in accuracy.

Element-wise aggregation. This is the most fine-grained local aggregation approach, aggregating at the parameter level. FedALA (Zhang et al. 2023b) introduces an element-level aggregation weight matrix in the head layers, enhancing accuracy across various tasks. However, extra computation is required for weight calculation and does not account for the distinct roles and learning abilities of different layers during training.

Conclusion

We have presented FLAYER, a new layer-wise pFL approach to optimize FL in the face of non-IID data. FLAYER adaptively adjusts the aggregation weights and learning rate and selects layer-wise masking to effectively incorporate local and global information throughout all network layers. Experimental results show that FLAYER achieves the best inference accuracy and significantly reduces computational overheads compared to existing pFL methods.

References

Arivazhagan et al. (2019) Arivazhagan, M. G.; Aggarwal, V.; Singh, A. K.; and Choudhary, S. 2019. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818.
Chrabaszcz, Loshchilov, and Hutter (2017) Chrabaszcz, P.; Loshchilov, I.; and Hutter, F. 2017. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819.
Collins et al. (2021) Collins, L.; Hassani, H.; Mokhtari, A.; and Shakkottai, S. 2021. Exploiting shared representations for personalized federated learning. In International conference on machine learning, 2089–2099. PMLR.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Huang et al. (2021) Huang, Y.; Chu, L.; Zhou, Z.; Wang, L.; Liu, J.; Pei, J.; and Zhang, Y. 2021. Personalized cross-silo federated learning on non-iid data. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 7865–7873.
Joulin et al. (2017) Joulin, A.; Grave, É.; Bojanowski, P.; and Mikolov, T. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431.
Kornblith et al. (2019) Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G. 2019. Similarity of neural network representations revisited. In International conference on machine learning, 3519–3529. PMLR.
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
Li et al. (2021a) Li, A.; Sun, J.; Zeng, X.; Zhang, M.; Li, H.; and Chen, Y. 2021a. Fedmask: Joint computation and communication-efficient personalized federated learning via heterogeneous masking. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, 42–55.
Li et al. (2021b) Li, T.; Hu, S.; Beirami, A.; and Smith, V. 2021b. Ditto: Fair and robust federated learning through personalization. In International conference on machine learning, 6357–6368. PMLR.
Lin et al. (2020) Lin, T.; Kong, L.; Stich, S. U.; and Jaggi, M. 2020. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33: 2351–2363.
Luo and Wu (2022) Luo, J.; and Wu, S. 2022. Adapt to adaptation: Learning personalization for cross-silo federated learning. In IJCAI: proceedings of the conference, volume 2022, 2166. NIH Public Access.
Luo et al. (2021) Luo, M.; Chen, F.; Hu, D.; Zhang, Y.; Liang, J.; and Feng, J. 2021. No fear of heterogeneity: Classifier calibration for federated learning with non-iid data. Advances in Neural Information Processing Systems, 34: 5972–5984.
Ma et al. (2022) Ma, X.; Zhang, J.; Guo, S.; and Xu, W. 2022. Layer-wised model aggregation for personalized federated learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10092–10101.
McMahan et al. (2017) McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 1273–1282. PMLR.
Niu and Deng (2022) Niu, Y.; and Deng, W. 2022. Federated learning for face recognition with gradient correction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1999–2007.
Pillutla et al. (2022) Pillutla, K.; Malik, K.; Mohamed, A.-R.; Rabbat, M.; Sanjabi, M.; and Xiao, L. 2022. Federated learning with partial model personalization. In International Conference on Machine Learning, 17716–17758. PMLR.
Shamsian et al. (2021) Shamsian, A.; Navon, A.; Fetaya, E.; and Chechik, G. 2021. Personalized federated learning using hypernetworks. In International Conference on Machine Learning, 9489–9502. PMLR.
Singh et al. (2015) Singh, B.; De, S.; Zhang, Y.; Goldstein, T.; and Taylor, G. 2015. Layer-specific adaptive learning rates for deep networks. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 364–368. IEEE.
Song, Granqvist, and Talwar (2022) Song, C.; Granqvist, F.; and Talwar, K. 2022. Flair: Federated learning annotated image repository. Advances in Neural Information Processing Systems, 35: 37792–37805.
Tan et al. (2022) Tan, A. Z.; Yu, H.; Cui, L.; and Yang, Q. 2022. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems.
Wang et al. (2020) Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; and Poor, H. V. 2020. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 7611–7623.
Yosinski et al. (2014) Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems, 27.
Zhang et al. (2023a) Zhang, J.; Hua, Y.; Wang, H.; Song, T.; Xue, Z.; Ma, R.; Cao, J.; and Guan, H. 2023a. Gpfl: Simultaneously learning global and personalized feature information for personalized federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5041–5051.
Zhang et al. (2023b) Zhang, J.; Hua, Y.; Wang, H.; Song, T.; Xue, Z.; Ma, R.; and Guan, H. 2023b. FedALA: Adaptive local aggregation for personalized federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 11237–11244.
Zhang et al. (2020) Zhang, M.; Sapra, K.; Fidler, S.; Yeung, S.; and Alvarez, J. M. 2020. Personalized federated learning with first order model optimization. arXiv preprint arXiv:2012.08565.
Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
Zhao et al. (2018) Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; and Chandra, V. 2018. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582.
Zhu et al. (2021) Zhu, H.; Xu, J.; Liu, S.; and Jin, Y. 2021. Federated learning on non-IID data: A survey. Neurocomputing, 465: 371–390.
Zhu, Hong, and Zhou (2021) Zhu, Z.; Hong, J.; and Zhou, J. 2021. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, 12878–12889. PMLR.