This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Personalized Federated Learning via Comprehensive Knowledge Distillation

Pengju Wang, Bochao Liu, Weijia Guo, Yong Li, Shiming Ge Pengju Wang, Bochao Liu, Weijia Guo, Yong Li, and Shiming Ge are with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China, and also with the School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China. Emails: {wangpengju, liubochao, guoweijia, liyong, geshiming}@iie.ac.cn. Shiming Ge is the corresponding author.
Abstract

Federated learning is a distributed machine learning paradigm designed to protect data privacy. However, data heterogeneity across various clients results in catastrophic forgetting, where the model rapidly forgets previous knowledge while acquiring new knowledge. To address this challenge, personalized federated learning has emerged to customize a personalized model for each client. However, the inherent limitation of this mechanism is its excessive focus on personalization, potentially hindering the generalization of those models. In this paper, we present a novel personalized federated learning method that uses global and historical models as teachers and the local model as the student to facilitate comprehensive knowledge distillation. The historical model represents the local model from the last round of client training, containing historical personalized knowledge, while the global model represents the aggregated model from the last round of server aggregation, containing global generalized knowledge. By applying knowledge distillation, we effectively transfer global generalized knowledge and historical personalized knowledge to the local model, thus mitigating catastrophic forgetting and enhancing the general performance of personalized models. Extensive experimental results demonstrate the significant advantages of our method.

I Introduction

The rapid evolution of distributed intelligent systems has brought data privacy to the forefront. Federated Learning (FL), a distributed machine learning paradigm, enables collaborative model training through parameter sharing rather than raw data exchange, effectively reducing the risk of exposing sensitive data [1]. By leveraging FL, we can efficiently utilize data from distributed clients to collectively train high-performance models. FL has demonstrated its significant value in a variety of fields, including medical health [2], financial analytics [3], and social network [4]. In traditional FL, multiple clients collaborate to train a global model to achieve an optimal universal solution for all clients. However, the non-independent and identically distributed nature of the data distribution [5], known as data heterogeneity, often causes a decline in the performance of distributed clients and can even lead to catastrophic forgetting. Fig. 1 depicts this phenomenon. It is evident that the data distribution varies in each communication round. During each training round, distributed clients update their local model with the global model. Unfortunately, the post-update local model exhibits a significant performance decline compared to the pre-update local model, indicating the phenomenon of forgetting previously learned knowledge.

Personalized Federated Learning (PFL) is an innovative approach to tackle data heterogeneity in FL [6]. It involves the collaborative training of a global model by all clients, after which each client develops a personalized model through personalized strategies, reflecting the distinctive characteristics of its local data. However, while existing PFL methods excel in model personalization, they often neglect model generalization. For instance, pFedSD [7] employs knowledge distillation to transfer knowledge from the historical model to the local model for achieving model personalization. Nevertheless, this method may hinder the local model’s generalization, as the historical model represents the previous local model and incorporates personalized knowledge. The primary limitation of existing PFL methods lies in their design principles, which overemphasize personalized learning and may lead to model overfitting on individual clients, thereby reducing their adaptability to varied client [8].

Refer to caption
(a) Data distribution
Refer to caption
(b) Catastrophic forgetting
Figure 1: Data heterogeneity in FL leads to catastrophic forgetting.

In order to alleviate catastrophic forgetting and achieve a balance between generalization and personalization, we propose Personalized Federated Learning via Comprehensive Knowledge Distillation (FedCKD). Our method integrates multi-teacher knowledge distillation [9] into FL for comprehensive knowledge transfer. The global model represents the aggregated model from the last round of server aggregation, containing global generalized knowledge, while the historical model represents the local model from the last round of client training, containing historical personalized knowledge. By employing the multi-teacher knowledge distillation, we utilize global and historical models as teachers and the local model as the student. This method effectively transfers both global generalized knowledge and historical personalized knowledge to the local model. Global generalized knowledge enhances model performance, whereas historical personalized knowledge addresses the issue of catastrophic forgetting.

In summary, our primary contributions are as follows: (1) We propose a novel PFL method called FedCKD. Through multi-teacher knowledge distillation, our method effectively transfers global generalized knowledge and historical personalized knowledge to the local model, thereby addressing catastrophic forgetting and enhancing model performance. (2) We introduce the annealing mechanism in knowledge distillation that dynamically adjusts the weight factor in the loss function, facilitating a smooth transition of training from knowledge transfer to local learning. This mechanism enhances the model’s personalization ability and improves the training process’s stability. (3) We validate the superior performance of our method through an extensive series of experiments, surpassing existing state-of-the-art methods.

II Related Work

Federated Learning allows multiple clients to collaboratively train local models without sharing their data, addressing data privacy concerns. However, FL faces the challenge of data heterogeneity, which implies that data distribution across clients can vary significantly, potentially leading to subpar performance of traditional FL methods like FedAvg [1] and FedProx [10]. In response to this challenge, PFL has emerged. By introducing personalized strategies such as parameter decoupling, model regularization, personalized aggregation, or knowledge distillation, PFL aims to enhance the model’s ability to learn from the unique data characteristics of each client while ensuring global generality. Through parameter decoupling, FedPer [11] develops a global representation on the server while retaining a unique head for each client, LG-FedAvg [12] merges the benefits of localized representation learning with the training of a unified global model. Through model regularization, pFedMe [13] leverages moreau envelopes as a regulatory loss to streamline local model training. Through personalized aggregation, FedFomo [14] implements first-order model optimization in each client’s update phase to refine personalization. Through knowledge distillation, pFedSD [7] facilitates the transfer of personalized knowledge from the previous round of models to improve the current round. Although existing PFL methods have made significant progress, they frequently concentrate solely on enhancing the model personalization for individual clients, overlooking the vital aspect of maintaining model generalization. Future research should aim to achieve a balance between generalization and personalization, exploring novel strategies to enable models to offer personalized solutions for each client while maintaining adaptability to various clients.

III Methodology

III-A Personalized Federated Learning

In Traditional FL, training involves a server and nn clients. Each client, denoted as CkC_{k} (k=1,2,,nk=1,2,\ldots,n), possesses its own private data 𝒟k={𝒙k,𝒚k}\mathcal{D}_{k}=\{\bm{x}_{k},\bm{y}_{k}\}, with 𝒙k\bm{x}_{k} denotes the sample and 𝒚k\bm{y}_{k} denotes the label. The collective data across all clients is represented as 𝒟=k=1n𝒟k\mathcal{D}={\bigcup_{k=1}^{n}}\mathcal{D}_{k}. The goal in FL is to derive a global model 𝒘g\bm{w}_{g} that minimizes the overall loss function,

min𝒘gk=1n|𝒟k||𝒟|k(𝒟k;𝒘g),\min_{\bm{w}_{g}}\sum_{k=1}^{n}\frac{\left|\mathcal{D}_{k}\right|}{\left|\mathcal{D}\right|}\mathcal{L}_{k}(\mathcal{D}_{k};\bm{w}_{g}), (1)

where k(𝒟k;𝒘g)\mathcal{L}_{k}(\mathcal{D}_{k};\bm{w}_{g}) is the empirical loss for client CkC_{k}.

The training process comprises critical phases. In the server broadcast phase, the server sends its global model 𝒘g\bm{w}_{g} to all clients. In the client update phase, each client replaces its local model, denoted as 𝒘k𝒘g\bm{w}_{k}\leftarrow\bm{w}_{g}, and conducts local updates by applying 𝒘k𝒘kηk(𝒟k;𝒘k)\bm{w}_{k}\leftarrow\bm{w}_{k}-\eta\nabla\mathcal{L}_{k}(\mathcal{D}_{k};\bm{w}_{k}) to train its local model with own data. In the client upload phase, each client uploads its updated local model 𝒘k\bm{w}_{k} back to the server. In the global aggregation phase, the server aggregates the local models to generate a global model 𝒘gk=1n|𝒟k||𝒟|𝒘k\bm{w}_{g}\leftarrow\sum_{k=1}^{n}\frac{\left|\mathcal{D}_{k}\right|}{\left|\mathcal{D}\right|}\bm{w}_{k}. These above phases are iteratively executed until the model converges or a predefined number of training rounds is reached.

Refer to caption
Figure 2: The framework of our method. The client update phase is divided into local distillation and local store processes. During the local distillation process, a comprehensive knowledge distillation is performed, transferring knowledge from the global model 𝒘gt1\bm{w}^{t-1}_{g} and the historical model 𝒘ht1\bm{w}^{t-1}_{h} to the local model 𝒘kt1\bm{w}^{t-1}_{k}. During the local store process, the emphasis is on preserving the local model for the next round, 𝒘ht𝒘kt\bm{w}^{t}_{h}\leftarrow\bm{w}^{t}_{k}. Different colors are used to differentiate between knowledge types, where blue represents generalized knowledge and other colors indicate personalized knowledge.

Despite its clear advantages in overall performance, the global model 𝒘g\bm{w}_{g} suffers a significant performance decline in heterogeneous data scenarios. This decline is largely due to the global model’s objective in FL to be universally applicable, which often does not cater to the unique data of individual clients. To address this, PFL customizes a personalized model 𝒘k\bm{w}_{k} for each client, effectively mitigating the challenges of data heterogeneity and enabling a tailored exploration of each client’s unique data characteristics. The optimization problem in PFL is formulated as follows:

min𝒘1,,𝒘nk=1n|𝒟k||𝒟|k(𝒟k;𝒘k).\min_{\bm{w}_{1},\dots,\bm{w}_{n}}\sum_{k=1}^{n}\frac{\left|\mathcal{D}_{k}\right|}{\left|\mathcal{D}\right|}\mathcal{L}_{k}(\mathcal{D}_{k};\bm{w}_{k}). (2)

Should we focus on the model’s generalized performance, representing its performance across extensive data, or on its personalized performance, representing its adaptability to specific data? Striking a balance between these two aspects is essential in crafting personalized strategies that effectively capture the unique data distribution characteristics of each client while maintaining generality.

III-B Comprehensive Knowledge Distillation

Knowledge distillation (KD) transfers knowledge from a trained teacher model to an untrained student model [15]. Let ptp_{t} and psp_{s} denote the outputs of the teacher and student models, respectively. To smooth the output distribution, a temperature parameter τ\tau is used in the softmax function pτ=exp(zi/τ)jexp(zj/τ)p^{\tau}=\frac{\exp(z_{i}/\tau)}{\sum_{j}\exp(z_{j}/\tau)}, where ziz_{i} is the ii-th output element. KD aims to minimize the discrepancy between the teacher and student models. The loss function is defined as follows:

=CE(ps,y)+λKL(psτ,ptτ),\mathcal{L}=\mathcal{L}_{\rm CE}(p_{s},y)+\lambda\mathcal{L}_{\rm KL}(p_{s}^{\tau},p_{t}^{\tau}), (3)

where CE\mathcal{L}_{\rm CE} is the Cross-Entropy loss between the soft labels psp_{s} and the hard labels yy, KL\mathcal{L}_{\rm KL} is the Kullback-Leibler divergence loss between the soft labels psτp_{s}^{\tau} and the soft labels ptτp_{t}^{\tau}, λ\lambda is the weight factor.

Recently, KD has become a critical technique in FL [16], offering innovative solutions to tackle the challenge of catastrophic forgetting, exemplified by methods such as pFedSD [7]. This method facilitates the transfer of knowledge from the historical model to the local model for each client. However, it places excessive emphasis on utilizing personalized knowledge from the historical model while overlooking generalized knowledge of the global model.

As shown in Fig. 2, We propose a comprehensive knowledge distillation method using multi-teacher knowledge distillation. In our method, we retained the local model from previous training, referred to as the historical model, to maintain personalized knowledge. The global model, in contrast, embodies generalized knowledge. We employ both the global and historical models as teachers, with the local model serving as the student, to facilitate a thorough knowledge transfer that balances generalization and personalization. The loss function is defined as follows:

k=CE(pk,y)+λKL(pkτ,pgτ)+λKL(pkτ,phτ),\mathcal{L}_{k}=\mathcal{L}_{\rm CE}(p_{k},y)+\lambda\mathcal{L}_{\rm KL}(p_{k}^{\tau},p_{g}^{\tau})+\lambda\mathcal{L}_{\rm KL}(p_{k}^{\tau},p_{h}^{\tau}), (4)

where pkτp_{k}^{\tau} represents the soft labels of the local model 𝒘k\bm{w}_{k}, pgτp_{g}^{\tau} represents the soft labels of the global model 𝒘g\bm{w}_{g}, and phτp_{h}^{\tau} represents the soft labels of the historical model 𝒘h\bm{w}_{h}.

To facilitate effective knowledge transfer, we implement an annealing mechanism. Initially, the student model primarily learns from the teacher model’s soft labels. As training advances, λ\lambda undergoes annealing, gradually increasing its emphasis on the hard labels. We employ exponential decay for annealing, which simplifies the process by gradually decreasing the weight factor λ\lambda at a constant decay rate γ\gamma for each communication round.

Algorithm 1 FedCKD
1:Number of clients nn, participation rate rr, communication rounds TT, local epochs EE, learning rate η\eta.
2:Personalized models 𝒘1,𝒘2,,𝒘n\bm{w}_{1},\bm{w}_{2},\ldots,\bm{w}_{n}
3:procedure Server
4:     Server initializes 𝒘g0\bm{w}^{0}_{g}
5:     for round t=1,2,,Tt=1,2,\ldots,T do
6:         Server samples clients KrnK\leftarrow rn
7:         Server sends 𝒘gt1\bm{w}^{t-1}_{g} to clients KK
8:         for client k=1,2,,Kk=1,2,\ldots,K do
9:              for epoch e=1,2,,Ee=1,2,\ldots,E do
10:                  𝒘kt\bm{w}^{t}_{k}\leftarrow Client(𝒟k;𝒘kt1,𝒘gt1,𝒘ht1)(\mathcal{D}_{k};\bm{w}^{t-1}_{k},\bm{w}^{t-1}_{g},\bm{w}^{t-1}_{h})
11:              end for
12:         end for
13:         Server aggregates 𝒘gtk=1K|𝒟k||𝒟|𝒘kt\bm{w}^{t}_{g}\leftarrow\sum_{k=1}^{K}\frac{\left|\mathcal{D}_{k}\right|}{\left|\mathcal{D}\right|}\bm{w}^{t}_{k}
14:     end for
15:     return 𝒘1,𝒘2,,𝒘n\bm{w}_{1},\bm{w}_{2},\ldots,\bm{w}_{n}
16:end procedure
17:procedure Client(𝒟k;𝒘kt1,𝒘gt1,𝒘ht1\mathcal{D}_{k};\bm{w}^{t-1}_{k},\bm{w}^{t-1}_{g},\bm{w}^{t-1}_{h})
18:     Client updates 𝒘kt1𝒘gt1\bm{w}^{t-1}_{k}\leftarrow\bm{w}^{t-1}_{g}
19:     for local data 𝒟k\mathcal{D}_{k} do
20:         𝒘kt𝒘kt1ηk(𝒟k;𝒘kt1,𝒘gt1,𝒘ht1)\bm{w}^{t}_{k}\leftarrow\bm{w}^{t-1}_{k}-\eta\nabla\mathcal{L}_{k}(\mathcal{D}_{k};\bm{w}^{t-1}_{k},\bm{w}^{t-1}_{g},\bm{w}^{t-1}_{h})
21:     end for
22:     Client stores 𝒘ht𝒘kt\bm{w}^{t}_{h}\leftarrow\bm{w}^{t}_{k}
23:     return 𝒘kt\bm{w}^{t}_{k} to server
24:end procedure

Algorithm 1 shows FedCKD in detail. During each round, the server selects a subset of clients at random and sends the global model to them. These clients subsequently update their local model with the global model. Comprehensive knowledge distillation is then performed to transfer knowledge from both the global and historical models to the local model, and the trained local model is preserved as the historical model for future training rounds. Lastly, clients send their local model back to the server, where multiple local models are aggregated to create an enhanced global model.

Refer to caption
(a) Practical setting α=0.10\alpha=0.10
Refer to caption
(b) Pathological setting s=20s=20
Figure 3: Data heterogeneity among 2020 clients on the CIFAR100 dataset.

IV Experiments

IV-A Experimental Setup

Datasets. We simulate two heterogeneous settings using three datasets: FMNIST [17], CIFAR10 [18], and CIFAR100 [18]. In the practical setting, data distribution reflects real-world diversity, characterized by a Dirichlet distribution Dir(α)Dir(\alpha), with a default α=0.10\alpha=0.10 for all datasets. In the pathological setting, data distribution is extremely unbalanced, where each client is assigned imbalanced data from s=2/2/20s=2/2/20 classes out of the 10/10/10010/10/100 classes in the FMNIST/CIFAR10/CIFAR100 datasets. Fig. 3 illustrates the distribution of the CIFAR100 dataset among 2020 clients, showcasing sample numbers by circle size and sample labels by circle color.

Models. We utilize two commonly used models: a simple CNN for the FMNIST dataset, and a five-layer CNN for the CIFAR10 and CIFAR100 datasets, adhering to the model outlined in [7] to guarantee fairness.

Baselines. We contrast our method FedCKD with state-of-the-art methods. For traditional FL methods, we evaluate against FedAvg [1] and FedProx [10]. For PFL methods, we evaluate against FedPer [11], LG-FedAvg [12], pFedMe [13], FedFomo [14], and pFedSD [7].

Hyperparameters. We explore two scenarios: (1) n=20n=20 clients with a participation rate of r=1.00r=1.00 over T=50T=50 communication rounds; (2) n=100n=100 clients with a participation rate of r=0.10r=0.10 over T=100T=100 communication rounds. The training process involves executing local epochs E=5E=5 and batch sizes B=64B=64. The SGD optimizer is used with a learning rate of 0.010.01, a momentum factor of 0.900.90, and a weight decay of 1e51e-5. For comparability, our method sets the weight factor to λ=0.50\lambda=0.50 and distillation temperature to τ=3\tau=3 in line with [7], while the decay rate for the annealing mechanism is set to γ=0.99\gamma=0.99. When employing alternative methods, we strictly adhere to the settings detailed in [7]. We repeat each experiment three times, with the recording of both mean and standard deviation. Evaluations are conducted based on the average test accuracy across all clients.

Refer to caption
(a) Client scale n=20n=20
Refer to caption
(b) Client scale n=100n=100
Figure 4: Learning curves under different experimental settings.
Refer to caption
(a) Practical setting α=0.10\alpha=0.10
Refer to caption
(b) Pathological setting s=20s=20
Figure 5: Accuracy difference (%) among 2020 clients on the CIFAR100 dataset.

IV-B Performance Comparison

Test accuracy. We assess the test accuracy across three datasets in the practical and pathological settings, as presented in Table I– II. Our method, FedCKD, consistently outperforms other methods in various settings. A notable highlight is its performance in the practical setting on the CIFAR100 dataset with 2020 clients, where FedCKD exceeded the closest baseline, pFedSD, by a significant margin of 1.98%1.98\%. Furthermore, FedCKD exhibits remarkable stability, as indicated by its low standard deviation in most cases. On the contrary, the performance of FedAvg and FedProx is subpar because they fail to adequately account for the diverse distribution of data. FedPer and LG-FedAvg solely share partial parameters of the local model, overlooking vital local knowledge embedded in other parameters. pFedMe enhances the model’s alignment with each client’s data by incorporating regularization terms. Nevertheless, an excessive emphasis on personalization could compromise the global model’s generalization prowess. FedFomo relies on client-specific weights for model aggregation, potentially diminishing the cohesive integration of global knowledge. pFedSD leverages personalized knowledge from historical models to enhance performance, yet overlooks generalized knowledge within the global model. The learning curve illustrates how a model’s accuracy evolves with the increasing number of communication rounds. In the practical setting, as shown in Fig. 4, FedCKD consistently outperforms pFedSD across all datasets and communication rounds, affirming the effectiveness of FedCKD. Overall, FedCKD significantly improves model accuracy in heterogeneous settings by balancing personalized and generalized knowledge.

TABLE I: Test accuracy (%) on different datasets in the practical setting
Dataset Client FedAvg FedProx FedPer LG-FedAvg pFedMe FedFomo pFedSD FedCKD
FMNIST n=20n=20 90.15±\pm0.10 90.05±\pm0.13 96.30±\pm0.03 95.22±\pm0.22 93.24±\pm0.08 95.38±\pm0.04 96.57±\pm0.08 96.61±\pm0.06
n=100n=100 86.82±\pm0.39 76.78±\pm0.30 94.99±\pm0.17 91.41±\pm0.08 92.05±\pm0.08 92.06±\pm0.14 95.97±\pm0.06 95.98±\pm0.04
CIFAR10 n=20n=20 50.44±\pm0.68 50.50±\pm0.78 80.74±\pm0.47 78.61±\pm0.31 79.51±\pm0.28 79.29±\pm0.34 82.08±\pm0.46 82.94±\pm0.13
n=100n=100 49.04±\pm0.80 42.11±\pm0.95 78.89±\pm0.90 71.60±\pm1.52 74.62±\pm0.72 73.16±\pm0.65 80.22±\pm0.24 81.41±\pm0.35
CIFAR100 n=20n=20 32.24±\pm0.32 32.50±\pm0.55 52.12±\pm0.22 40.71±\pm0.05 38.42±\pm0.71 44.67±\pm0.37 55.27±\pm0.30 57.25±\pm0.07
n=100n=100 29.13±\pm0.18 14.27±\pm0.35 44.45±\pm0.64 21.61±\pm0.21 27.07±\pm0.25 26.48±\pm0.27 48.91±\pm0.77 50.20±\pm0.39
TABLE II: Test accuracy (%) on different datasets in the pathological setting
Dataset Client FedAvg FedProx FedPer LG-FedAvg pFedMe FedFomo pFedSD FedCKD
FMNIST n=20n=20 75.71±\pm0.27 75.40±\pm1.06 99.42±\pm0.01 99.15±\pm0.04 98.76±\pm0.03 99.28±\pm0.01 99.45±\pm0.01 99.49±\pm0.01
n=100n=100 84.13±\pm0.44 74.38±\pm0.90 97.28±\pm0.12 95.70±\pm0.08 94.87±\pm0.03 96.14±\pm0.41 97.42±\pm0.04 97.49±\pm0.02
CIFAR10 n=20n=20 45.37±\pm0.53 45.86±\pm1.12 91.69±\pm0.10 92.21±\pm0.36 89.69±\pm0.39 91.66±\pm0.17 92.52±\pm0.24 92.83±\pm0.16
n=100n=100 43.52±\pm1.70 38.24±\pm1.39 86.66±\pm0.24 79.25±\pm1.63 80.47±\pm1.10 81.62±\pm0.54 86.81±\pm0.32 87.24±\pm0.20
CIFAR100 n=20n=20 31.92±\pm0.23 32.16±\pm0.32 55.57±\pm0.26 45.48±\pm0.63 38.66±\pm1.21 48.84±\pm0.11 59.54±\pm0.01 60.70±\pm0.02
n=100n=100 27.88±\pm0.35 13.32±\pm0.37 47.01±\pm0.50 18.96±\pm0.30 26.96±\pm0.30 25.21±\pm0.63 50.99±\pm0.43 52.48±\pm0.15

Individual personalization. Individual personalization is critical in PFL because it allows the model to take into account each client’s unique data characteristics and distribution. Fig. 5 shows the difference in performance between our method FedCKD and the baseline pFedSD in terms of individual accuracy for each client. In the practical setting, FedCKD demonstrates superior accuracy compared to pFedSD in 1717 clients while showing lower accuracy in 33 clients, indicating that FedCKD outperforms pFedSD in 85.0085.00% of the cases. In the pathological setting, FedCKD’s accuracy exceeds that of pFedSD in 1414 clients and lags behind in 66 clients, resulting in higher accuracy in 70.00%70.00\% of the clients when compared to pFedSD. The findings suggest that FedCKD exhibits strong personalized performance, showcasing its robust capability in adapting to varying client data distributions.

Individual fairness. The main objectives of individual fairness are to achieve high average accuracy and ensure an even accuracy distribution among clients, and the metric for individual fairness is the standard deviation of model accuracy across all clients. In our experiments in the practical setting with 100100 clients, we recorded their model accuracy and computed the standard deviation. As shown in Table III, Our method, FedCKD, has higher test accuracy and lower standard deviation in most cases compared to other PFL methods. Notably, on the CIFAR10 dataset, FedCKD attains the highest average accuracy of 81.02%81.02\% and the lowest standard deviation of 8.84%8.84\%. Moreover, on the FMNIST dataset, FedCKD attains a standard deviation similar to pFedSD. Despite a slight increase in standard deviation on the CIFAR100 dataset, FedCKD significantly surpasses other PFL methods in terms of average accuracy. Overall, our method showcases improved fairness, leading to enhanced performance across all clients.

TABLE III: Test accuracy (%) across all clients
Method FMNIST CIFAR10 CIFAR100
FedPer 94.69±\pm5.08 79.25±\pm10.20 44.45±\pm5.93
LG-FedAvg 91.31±\pm7.20 72.56±\pm12.26 21.78±\pm4.95
pFedMe 92.14±\pm6.82 75.43±\pm11.29 26.82±\pm5.07
FedFomo 91.96±\pm6.79 73.09±\pm11.91 26.43±\pm5.52
pFedSD 95.92±\pm3.88 80.17±\pm9.23 48.62±\pm6.00
FedCKD 96.03±\pm3.98 81.02±\pm8.84 49.86±\pm5.56

IV-C Ablation Study

Contribution of annealing mechanism. We investigate the contribution of the annealing mechanism employed in KD. The results on the CIFAR100 dataset among 2020 clients in the practical setting are illustrated in Table IV. Upon implementing the annealing mechanism, the mean accuracy increases from 57.08%57.08\% to 57.25%57.25\%, while the standard deviation decreases from 0.15%0.15\% to 0.07%0.07\%. These results demonstrate that the annealing mechanism not only enhances the model’s performance but also increases its stability.

TABLE IV: Test accuracy (%) under annealing mechanism
Method w/o Annealing w/ Annealing
FedCKD 57.08±\pm0.15 57.25±\pm0.07
TABLE V: Test accuracy (%) under various experimental settings
Method Data Heterogeneity Participation Rate Model Architecture
α=0.01\alpha=0.01 α=0.10\alpha=0.10 α=1.00\alpha=1.00 r=0.20r=0.20 r=0.60r=0.60 r=1.00r=1.00 ResNet MobileNet
FedAvg 29.02±\pm0.43 31.54±\pm0.35 33.78±\pm0.33 28.09±\pm0.15 31.54±\pm0.35 32.24±\pm0.32 23.57±\pm1.72 31.92±\pm1.10
FedProx 18.19±\pm0.44 21.23±\pm0.46 31.42±\pm0.43 16.52±\pm0.42 21.23±\pm0.46 32.50±\pm0.55 20.81±\pm0.66 31.36±\pm0.75
FedPer 64.47±\pm0.43 52.14±\pm0.65 34.12±\pm0.27 53.25±\pm0.13 52.14±\pm0.65 52.12±\pm0.22 57.20±\pm0.68 59.74±\pm0.38
LG-FedAvg 54.27±\pm0.64 37.77±\pm0.08 21.36±\pm0.58 29.53±\pm0.73 37.77±\pm0.08 40.71±\pm0.05 16.02±\pm0.26 18.26±\pm1.33
pFedMe 46.97±\pm0.67 38.64±\pm0.38 24.95±\pm0.35 36.97±\pm0.52 38.64±\pm0.38 38.42±\pm0.71 32.96±\pm0.98 42.88±\pm0.77
FedFomo 58.52±\pm0.37 44.14±\pm0.21 29.41±\pm0.03 42.42±\pm0.29 44.14±\pm0.21 44.67±\pm0.37 45.62±\pm0.31 41.87±\pm0.34
pFedSD 67.11±\pm0.58 56.61±\pm0.34 37.81±\pm0.33 55.24±\pm0.55 56.61±\pm0.34 55.27±\pm0.30 59.74±\pm0.90 63.28±\pm0.28
FedCKD 67.85±\pm0.26 57.11±\pm0.27 40.64±\pm0.36 56.81±\pm0.28 57.11±\pm0.27 57.25±\pm0.07 60.30±\pm0.10 63.45±\pm0.27

IV-D Sensitivity Analysis

We conduct a sensitivity analysis to investigate the robustness to data heterogeneity, participation rate, and model architecture on the CIFAR100 dataset. The default hyperparameters used, unless stated otherwise, are α=0.1,n=20,r=1.0\alpha=0.1,n=20,r=1.0.

Robustness to data heterogeneity. Data heterogeneity is a fundamental factor that directly influences the model’s learning efficiency and ultimate performance. In the experiments, we set α={0.01,0.10,1.00}\alpha=\{0.01,0.10,1.00\}. Lower α\alpha values increase data distribution heterogeneity. FedCKD consistently outperforms other methods in these settings, illustrating its robustness across different levels of heterogeneity, detailed in Table V. Remarkably, when data heterogeneity α=1.00\alpha=1.00, FedCKD exceeds pFedSD by 2.83%2.83\%. However, LG-FedAvg, pFedMe, and FedFomo exhibit subpar performance compared to the traditional FL methods.

Robustness to participation rate. Participation rate plays a vital role in PFL as it determines the proportion of clients contributing to the global model update. In the experiments, we set r={0.20,0.60,1.00}r=\{0.20,0.60,1.00\}. Table V shows that most methods exhibit reduced performance as the participation rate decreases. This decline is due to the restricted data participation in each round, which results in the model that cannot fully utilize the overall data. Noteworthy, pFedSD’s performance declines when the participation rate reaches 1.001.00, likely because the wide variances in client data distributions introduce extra noise into the model updates. Conversely, our method, FedCKD, effectively mitigates the impact of noise.

Robustness to model architecture. Model architecture also plays a significant role in PFL because different model architectures may exhibit varying adaptability and learning capabilities when handling specific types of data. In the experiments, we utilize two more complex models, namely ResNet-8 [19] and MobileNetV2 [20]. Table V demonstrates that our method, FedCKD, significantly outperforms the baseline methods across a variety of model architectures. This superiority is evident through improved mean accuracy and reduced standard deviation, suggesting its ability to maintain consistent and efficient performance in all model architectures.

V Conclusion

This paper presents FedCKD, a comprehensive knowledge distillation method for PFL, designed to address catastrophic forgetting and achieve a balance between generalization and personalization. Utilizing multi-teacher knowledge distillation, we effectively transfer knowledge from global and historical models to the local model. The global model contains generalized knowledge, while the historical model holds personalized knowledge. Employing the distillation mechanism, we achieve a balance and integration of these two types of knowledge, significantly boosting the model’s performance. We also implement an annealing mechanism to further enhance performance and stability. Extensive experiments demonstrate the superiority of FedCKD over existing methods. However, it has inherent limitations as the KD process often requires additional computational resources. Future research will concentrate on implementing more efficient KD techniques in PFL to reduce computational costs.

References

  • [1] B. McMahan, E. Moore, D. Ramage et al., “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics Conference, 2017, pp. 1273–1282.
  • [2] N. Rana and H. Marwaha, “Role of federated learning in healthcare systems: A survey,” Mathematical Foundations of Computing, vol. 7, no. 4, pp. 459–484, 2024.
  • [3] A. Abadi, B. Doyle, F. Gini et al., “Starlit: Privacy-preserving federated learning to enhance financial fraud detection,” arXiv:2401.10765, 2024.
  • [4] Y. Chen, L. Liang, and W. Gao, “DFedSN: Decentralized federated learning based on heterogeneous data in social networks,” World Wide Web, vol. 26, no. 5, pp. 2545–2568, 2023.
  • [5] W. Guo, Z. Yao, Y. Liu et al., “A new federated learning model for host intrusion detection system under non-iid data,” in IEEE International Conference on Systems, Man, and Cybernetics, 2023, pp. 494–500.
  • [6] A. Z. Tan, H. Yu, L. Cui et al., “Towards personalized federated learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, pp. 9587–9603, 2021.
  • [7] H. Jin, D. Bai, D. Yao et al., “Personalized edge intelligence via federated self-knowledge distillation,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 2, pp. 567–580, 2023.
  • [8] L. Yi, H. Yu, C. Ren et al., “pFedMoE: Data-level personalization with mixture of experts for model-heterogeneous personalized federated learning,” arXiv:2402.01350, 2024.
  • [9] S. You, C. Xu, C. Xu et al., “Learning from multiple teacher networks,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1285–1294.
  • [10] T. Li, A. K. Sahu, M. Zaheer et al., “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, 2020, pp. 429–450.
  • [11] M. G. Arivazhagan, V. Aggarwal, A. K. Singh et al., “Federated learning with personalization layers,” arXiv:1912.00818, 2019.
  • [12] P. P. Liang, T. Liu, Z. Liu et al., “Think locally, act globally: Federated learning with local and global representations,” arXiv:2001.01523, 2020.
  • [13] C. T. Dinh, N. Tran, and J. Nguyen, “Personalized federated learning with moreau envelopes,” in Advances in Neural Information Processing Systems, 2020, pp. 21 394–21 405.
  • [14] M. Zhang, K. Sapra, S. Fidler et al., “Personalized federated learning with first order model optimization,” in International Conference on Learning Representations, 2020.
  • [15] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv:1503.02531, 2015.
  • [16] L. Qin, T. Zhu, W. Zhou et al., “Knowledge distillation in federated learning: A survey on long lasting challenges and new solutions,” arXiv:2406.10861, 2024.
  • [17] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747, 2017.
  • [18] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical Report, 2009.
  • [19] K. He, X. Zhang, S. Ren et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [20] A. G. Howard, M. Zhu, B. Chen et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.