Towards Personalized Federated Learning via Comprehensive Knowledge Distillation
Abstract
Federated learning is a distributed machine learning paradigm designed to protect data privacy. However, data heterogeneity across various clients results in catastrophic forgetting, where the model rapidly forgets previous knowledge while acquiring new knowledge. To address this challenge, personalized federated learning has emerged to customize a personalized model for each client. However, the inherent limitation of this mechanism is its excessive focus on personalization, potentially hindering the generalization of those models. In this paper, we present a novel personalized federated learning method that uses global and historical models as teachers and the local model as the student to facilitate comprehensive knowledge distillation. The historical model represents the local model from the last round of client training, containing historical personalized knowledge, while the global model represents the aggregated model from the last round of server aggregation, containing global generalized knowledge. By applying knowledge distillation, we effectively transfer global generalized knowledge and historical personalized knowledge to the local model, thus mitigating catastrophic forgetting and enhancing the general performance of personalized models. Extensive experimental results demonstrate the significant advantages of our method.
I Introduction
The rapid evolution of distributed intelligent systems has brought data privacy to the forefront. Federated Learning (FL), a distributed machine learning paradigm, enables collaborative model training through parameter sharing rather than raw data exchange, effectively reducing the risk of exposing sensitive data [1]. By leveraging FL, we can efficiently utilize data from distributed clients to collectively train high-performance models. FL has demonstrated its significant value in a variety of fields, including medical health [2], financial analytics [3], and social network [4]. In traditional FL, multiple clients collaborate to train a global model to achieve an optimal universal solution for all clients. However, the non-independent and identically distributed nature of the data distribution [5], known as data heterogeneity, often causes a decline in the performance of distributed clients and can even lead to catastrophic forgetting. Fig. 1 depicts this phenomenon. It is evident that the data distribution varies in each communication round. During each training round, distributed clients update their local model with the global model. Unfortunately, the post-update local model exhibits a significant performance decline compared to the pre-update local model, indicating the phenomenon of forgetting previously learned knowledge.
Personalized Federated Learning (PFL) is an innovative approach to tackle data heterogeneity in FL [6]. It involves the collaborative training of a global model by all clients, after which each client develops a personalized model through personalized strategies, reflecting the distinctive characteristics of its local data. However, while existing PFL methods excel in model personalization, they often neglect model generalization. For instance, pFedSD [7] employs knowledge distillation to transfer knowledge from the historical model to the local model for achieving model personalization. Nevertheless, this method may hinder the local model’s generalization, as the historical model represents the previous local model and incorporates personalized knowledge. The primary limitation of existing PFL methods lies in their design principles, which overemphasize personalized learning and may lead to model overfitting on individual clients, thereby reducing their adaptability to varied client [8].


In order to alleviate catastrophic forgetting and achieve a balance between generalization and personalization, we propose Personalized Federated Learning via Comprehensive Knowledge Distillation (FedCKD). Our method integrates multi-teacher knowledge distillation [9] into FL for comprehensive knowledge transfer. The global model represents the aggregated model from the last round of server aggregation, containing global generalized knowledge, while the historical model represents the local model from the last round of client training, containing historical personalized knowledge. By employing the multi-teacher knowledge distillation, we utilize global and historical models as teachers and the local model as the student. This method effectively transfers both global generalized knowledge and historical personalized knowledge to the local model. Global generalized knowledge enhances model performance, whereas historical personalized knowledge addresses the issue of catastrophic forgetting.
In summary, our primary contributions are as follows: (1) We propose a novel PFL method called FedCKD. Through multi-teacher knowledge distillation, our method effectively transfers global generalized knowledge and historical personalized knowledge to the local model, thereby addressing catastrophic forgetting and enhancing model performance. (2) We introduce the annealing mechanism in knowledge distillation that dynamically adjusts the weight factor in the loss function, facilitating a smooth transition of training from knowledge transfer to local learning. This mechanism enhances the model’s personalization ability and improves the training process’s stability. (3) We validate the superior performance of our method through an extensive series of experiments, surpassing existing state-of-the-art methods.
II Related Work
Federated Learning allows multiple clients to collaboratively train local models without sharing their data, addressing data privacy concerns. However, FL faces the challenge of data heterogeneity, which implies that data distribution across clients can vary significantly, potentially leading to subpar performance of traditional FL methods like FedAvg [1] and FedProx [10]. In response to this challenge, PFL has emerged. By introducing personalized strategies such as parameter decoupling, model regularization, personalized aggregation, or knowledge distillation, PFL aims to enhance the model’s ability to learn from the unique data characteristics of each client while ensuring global generality. Through parameter decoupling, FedPer [11] develops a global representation on the server while retaining a unique head for each client, LG-FedAvg [12] merges the benefits of localized representation learning with the training of a unified global model. Through model regularization, pFedMe [13] leverages moreau envelopes as a regulatory loss to streamline local model training. Through personalized aggregation, FedFomo [14] implements first-order model optimization in each client’s update phase to refine personalization. Through knowledge distillation, pFedSD [7] facilitates the transfer of personalized knowledge from the previous round of models to improve the current round. Although existing PFL methods have made significant progress, they frequently concentrate solely on enhancing the model personalization for individual clients, overlooking the vital aspect of maintaining model generalization. Future research should aim to achieve a balance between generalization and personalization, exploring novel strategies to enable models to offer personalized solutions for each client while maintaining adaptability to various clients.
III Methodology
III-A Personalized Federated Learning
In Traditional FL, training involves a server and clients. Each client, denoted as (), possesses its own private data , with denotes the sample and denotes the label. The collective data across all clients is represented as . The goal in FL is to derive a global model that minimizes the overall loss function,
(1) |
where is the empirical loss for client .
The training process comprises critical phases. In the server broadcast phase, the server sends its global model to all clients. In the client update phase, each client replaces its local model, denoted as , and conducts local updates by applying to train its local model with own data. In the client upload phase, each client uploads its updated local model back to the server. In the global aggregation phase, the server aggregates the local models to generate a global model . These above phases are iteratively executed until the model converges or a predefined number of training rounds is reached.

Despite its clear advantages in overall performance, the global model suffers a significant performance decline in heterogeneous data scenarios. This decline is largely due to the global model’s objective in FL to be universally applicable, which often does not cater to the unique data of individual clients. To address this, PFL customizes a personalized model for each client, effectively mitigating the challenges of data heterogeneity and enabling a tailored exploration of each client’s unique data characteristics. The optimization problem in PFL is formulated as follows:
(2) |
Should we focus on the model’s generalized performance, representing its performance across extensive data, or on its personalized performance, representing its adaptability to specific data? Striking a balance between these two aspects is essential in crafting personalized strategies that effectively capture the unique data distribution characteristics of each client while maintaining generality.
III-B Comprehensive Knowledge Distillation
Knowledge distillation (KD) transfers knowledge from a trained teacher model to an untrained student model [15]. Let and denote the outputs of the teacher and student models, respectively. To smooth the output distribution, a temperature parameter is used in the softmax function , where is the -th output element. KD aims to minimize the discrepancy between the teacher and student models. The loss function is defined as follows:
(3) |
where is the Cross-Entropy loss between the soft labels and the hard labels , is the Kullback-Leibler divergence loss between the soft labels and the soft labels , is the weight factor.
Recently, KD has become a critical technique in FL [16], offering innovative solutions to tackle the challenge of catastrophic forgetting, exemplified by methods such as pFedSD [7]. This method facilitates the transfer of knowledge from the historical model to the local model for each client. However, it places excessive emphasis on utilizing personalized knowledge from the historical model while overlooking generalized knowledge of the global model.
As shown in Fig. 2, We propose a comprehensive knowledge distillation method using multi-teacher knowledge distillation. In our method, we retained the local model from previous training, referred to as the historical model, to maintain personalized knowledge. The global model, in contrast, embodies generalized knowledge. We employ both the global and historical models as teachers, with the local model serving as the student, to facilitate a thorough knowledge transfer that balances generalization and personalization. The loss function is defined as follows:
(4) |
where represents the soft labels of the local model , represents the soft labels of the global model , and represents the soft labels of the historical model .
To facilitate effective knowledge transfer, we implement an annealing mechanism. Initially, the student model primarily learns from the teacher model’s soft labels. As training advances, undergoes annealing, gradually increasing its emphasis on the hard labels. We employ exponential decay for annealing, which simplifies the process by gradually decreasing the weight factor at a constant decay rate for each communication round.
Algorithm 1 shows FedCKD in detail. During each round, the server selects a subset of clients at random and sends the global model to them. These clients subsequently update their local model with the global model. Comprehensive knowledge distillation is then performed to transfer knowledge from both the global and historical models to the local model, and the trained local model is preserved as the historical model for future training rounds. Lastly, clients send their local model back to the server, where multiple local models are aggregated to create an enhanced global model.


IV Experiments
IV-A Experimental Setup
Datasets. We simulate two heterogeneous settings using three datasets: FMNIST [17], CIFAR10 [18], and CIFAR100 [18]. In the practical setting, data distribution reflects real-world diversity, characterized by a Dirichlet distribution , with a default for all datasets. In the pathological setting, data distribution is extremely unbalanced, where each client is assigned imbalanced data from classes out of the classes in the FMNIST/CIFAR10/CIFAR100 datasets. Fig. 3 illustrates the distribution of the CIFAR100 dataset among clients, showcasing sample numbers by circle size and sample labels by circle color.
Models. We utilize two commonly used models: a simple CNN for the FMNIST dataset, and a five-layer CNN for the CIFAR10 and CIFAR100 datasets, adhering to the model outlined in [7] to guarantee fairness.
Baselines. We contrast our method FedCKD with state-of-the-art methods. For traditional FL methods, we evaluate against FedAvg [1] and FedProx [10]. For PFL methods, we evaluate against FedPer [11], LG-FedAvg [12], pFedMe [13], FedFomo [14], and pFedSD [7].
Hyperparameters. We explore two scenarios: (1) clients with a participation rate of over communication rounds; (2) clients with a participation rate of over communication rounds. The training process involves executing local epochs and batch sizes . The SGD optimizer is used with a learning rate of , a momentum factor of , and a weight decay of . For comparability, our method sets the weight factor to and distillation temperature to in line with [7], while the decay rate for the annealing mechanism is set to . When employing alternative methods, we strictly adhere to the settings detailed in [7]. We repeat each experiment three times, with the recording of both mean and standard deviation. Evaluations are conducted based on the average test accuracy across all clients.




IV-B Performance Comparison
Test accuracy. We assess the test accuracy across three datasets in the practical and pathological settings, as presented in Table I– II. Our method, FedCKD, consistently outperforms other methods in various settings. A notable highlight is its performance in the practical setting on the CIFAR100 dataset with clients, where FedCKD exceeded the closest baseline, pFedSD, by a significant margin of . Furthermore, FedCKD exhibits remarkable stability, as indicated by its low standard deviation in most cases. On the contrary, the performance of FedAvg and FedProx is subpar because they fail to adequately account for the diverse distribution of data. FedPer and LG-FedAvg solely share partial parameters of the local model, overlooking vital local knowledge embedded in other parameters. pFedMe enhances the model’s alignment with each client’s data by incorporating regularization terms. Nevertheless, an excessive emphasis on personalization could compromise the global model’s generalization prowess. FedFomo relies on client-specific weights for model aggregation, potentially diminishing the cohesive integration of global knowledge. pFedSD leverages personalized knowledge from historical models to enhance performance, yet overlooks generalized knowledge within the global model. The learning curve illustrates how a model’s accuracy evolves with the increasing number of communication rounds. In the practical setting, as shown in Fig. 4, FedCKD consistently outperforms pFedSD across all datasets and communication rounds, affirming the effectiveness of FedCKD. Overall, FedCKD significantly improves model accuracy in heterogeneous settings by balancing personalized and generalized knowledge.
Dataset | Client | FedAvg | FedProx | FedPer | LG-FedAvg | pFedMe | FedFomo | pFedSD | FedCKD |
---|---|---|---|---|---|---|---|---|---|
FMNIST | 90.150.10 | 90.050.13 | 96.300.03 | 95.220.22 | 93.240.08 | 95.380.04 | 96.570.08 | 96.610.06 | |
86.820.39 | 76.780.30 | 94.990.17 | 91.410.08 | 92.050.08 | 92.060.14 | 95.970.06 | 95.980.04 | ||
CIFAR10 | 50.440.68 | 50.500.78 | 80.740.47 | 78.610.31 | 79.510.28 | 79.290.34 | 82.080.46 | 82.940.13 | |
49.040.80 | 42.110.95 | 78.890.90 | 71.601.52 | 74.620.72 | 73.160.65 | 80.220.24 | 81.410.35 | ||
CIFAR100 | 32.240.32 | 32.500.55 | 52.120.22 | 40.710.05 | 38.420.71 | 44.670.37 | 55.270.30 | 57.250.07 | |
29.130.18 | 14.270.35 | 44.450.64 | 21.610.21 | 27.070.25 | 26.480.27 | 48.910.77 | 50.200.39 |
Dataset | Client | FedAvg | FedProx | FedPer | LG-FedAvg | pFedMe | FedFomo | pFedSD | FedCKD |
---|---|---|---|---|---|---|---|---|---|
FMNIST | 75.710.27 | 75.401.06 | 99.420.01 | 99.150.04 | 98.760.03 | 99.280.01 | 99.450.01 | 99.490.01 | |
84.130.44 | 74.380.90 | 97.280.12 | 95.700.08 | 94.870.03 | 96.140.41 | 97.420.04 | 97.490.02 | ||
CIFAR10 | 45.370.53 | 45.861.12 | 91.690.10 | 92.210.36 | 89.690.39 | 91.660.17 | 92.520.24 | 92.830.16 | |
43.521.70 | 38.241.39 | 86.660.24 | 79.251.63 | 80.471.10 | 81.620.54 | 86.810.32 | 87.240.20 | ||
CIFAR100 | 31.920.23 | 32.160.32 | 55.570.26 | 45.480.63 | 38.661.21 | 48.840.11 | 59.540.01 | 60.700.02 | |
27.880.35 | 13.320.37 | 47.010.50 | 18.960.30 | 26.960.30 | 25.210.63 | 50.990.43 | 52.480.15 |
Individual personalization. Individual personalization is critical in PFL because it allows the model to take into account each client’s unique data characteristics and distribution. Fig. 5 shows the difference in performance between our method FedCKD and the baseline pFedSD in terms of individual accuracy for each client. In the practical setting, FedCKD demonstrates superior accuracy compared to pFedSD in clients while showing lower accuracy in clients, indicating that FedCKD outperforms pFedSD in % of the cases. In the pathological setting, FedCKD’s accuracy exceeds that of pFedSD in clients and lags behind in clients, resulting in higher accuracy in of the clients when compared to pFedSD. The findings suggest that FedCKD exhibits strong personalized performance, showcasing its robust capability in adapting to varying client data distributions.
Individual fairness. The main objectives of individual fairness are to achieve high average accuracy and ensure an even accuracy distribution among clients, and the metric for individual fairness is the standard deviation of model accuracy across all clients. In our experiments in the practical setting with clients, we recorded their model accuracy and computed the standard deviation. As shown in Table III, Our method, FedCKD, has higher test accuracy and lower standard deviation in most cases compared to other PFL methods. Notably, on the CIFAR10 dataset, FedCKD attains the highest average accuracy of and the lowest standard deviation of . Moreover, on the FMNIST dataset, FedCKD attains a standard deviation similar to pFedSD. Despite a slight increase in standard deviation on the CIFAR100 dataset, FedCKD significantly surpasses other PFL methods in terms of average accuracy. Overall, our method showcases improved fairness, leading to enhanced performance across all clients.
Method | FMNIST | CIFAR10 | CIFAR100 |
---|---|---|---|
FedPer | 94.695.08 | 79.2510.20 | 44.455.93 |
LG-FedAvg | 91.317.20 | 72.5612.26 | 21.784.95 |
pFedMe | 92.146.82 | 75.4311.29 | 26.825.07 |
FedFomo | 91.966.79 | 73.0911.91 | 26.435.52 |
pFedSD | 95.923.88 | 80.179.23 | 48.626.00 |
FedCKD | 96.033.98 | 81.028.84 | 49.865.56 |
IV-C Ablation Study
Contribution of annealing mechanism. We investigate the contribution of the annealing mechanism employed in KD. The results on the CIFAR100 dataset among clients in the practical setting are illustrated in Table IV. Upon implementing the annealing mechanism, the mean accuracy increases from to , while the standard deviation decreases from to . These results demonstrate that the annealing mechanism not only enhances the model’s performance but also increases its stability.
Method | w/o Annealing | w/ Annealing |
---|---|---|
FedCKD | 57.080.15 | 57.250.07 |
Method | Data Heterogeneity | Participation Rate | Model Architecture | |||||
---|---|---|---|---|---|---|---|---|
ResNet | MobileNet | |||||||
FedAvg | 29.020.43 | 31.540.35 | 33.780.33 | 28.090.15 | 31.540.35 | 32.240.32 | 23.571.72 | 31.921.10 |
FedProx | 18.190.44 | 21.230.46 | 31.420.43 | 16.520.42 | 21.230.46 | 32.500.55 | 20.810.66 | 31.360.75 |
FedPer | 64.470.43 | 52.140.65 | 34.120.27 | 53.250.13 | 52.140.65 | 52.120.22 | 57.200.68 | 59.740.38 |
LG-FedAvg | 54.270.64 | 37.770.08 | 21.360.58 | 29.530.73 | 37.770.08 | 40.710.05 | 16.020.26 | 18.261.33 |
pFedMe | 46.970.67 | 38.640.38 | 24.950.35 | 36.970.52 | 38.640.38 | 38.420.71 | 32.960.98 | 42.880.77 |
FedFomo | 58.520.37 | 44.140.21 | 29.410.03 | 42.420.29 | 44.140.21 | 44.670.37 | 45.620.31 | 41.870.34 |
pFedSD | 67.110.58 | 56.610.34 | 37.810.33 | 55.240.55 | 56.610.34 | 55.270.30 | 59.740.90 | 63.280.28 |
FedCKD | 67.850.26 | 57.110.27 | 40.640.36 | 56.810.28 | 57.110.27 | 57.250.07 | 60.300.10 | 63.450.27 |
IV-D Sensitivity Analysis
We conduct a sensitivity analysis to investigate the robustness to data heterogeneity, participation rate, and model architecture on the CIFAR100 dataset. The default hyperparameters used, unless stated otherwise, are .
Robustness to data heterogeneity. Data heterogeneity is a fundamental factor that directly influences the model’s learning efficiency and ultimate performance. In the experiments, we set . Lower values increase data distribution heterogeneity. FedCKD consistently outperforms other methods in these settings, illustrating its robustness across different levels of heterogeneity, detailed in Table V. Remarkably, when data heterogeneity , FedCKD exceeds pFedSD by . However, LG-FedAvg, pFedMe, and FedFomo exhibit subpar performance compared to the traditional FL methods.
Robustness to participation rate. Participation rate plays a vital role in PFL as it determines the proportion of clients contributing to the global model update. In the experiments, we set . Table V shows that most methods exhibit reduced performance as the participation rate decreases. This decline is due to the restricted data participation in each round, which results in the model that cannot fully utilize the overall data. Noteworthy, pFedSD’s performance declines when the participation rate reaches , likely because the wide variances in client data distributions introduce extra noise into the model updates. Conversely, our method, FedCKD, effectively mitigates the impact of noise.
Robustness to model architecture. Model architecture also plays a significant role in PFL because different model architectures may exhibit varying adaptability and learning capabilities when handling specific types of data. In the experiments, we utilize two more complex models, namely ResNet-8 [19] and MobileNetV2 [20]. Table V demonstrates that our method, FedCKD, significantly outperforms the baseline methods across a variety of model architectures. This superiority is evident through improved mean accuracy and reduced standard deviation, suggesting its ability to maintain consistent and efficient performance in all model architectures.
V Conclusion
This paper presents FedCKD, a comprehensive knowledge distillation method for PFL, designed to address catastrophic forgetting and achieve a balance between generalization and personalization. Utilizing multi-teacher knowledge distillation, we effectively transfer knowledge from global and historical models to the local model. The global model contains generalized knowledge, while the historical model holds personalized knowledge. Employing the distillation mechanism, we achieve a balance and integration of these two types of knowledge, significantly boosting the model’s performance. We also implement an annealing mechanism to further enhance performance and stability. Extensive experiments demonstrate the superiority of FedCKD over existing methods. However, it has inherent limitations as the KD process often requires additional computational resources. Future research will concentrate on implementing more efficient KD techniques in PFL to reduce computational costs.
References
- [1] B. McMahan, E. Moore, D. Ramage et al., “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics Conference, 2017, pp. 1273–1282.
- [2] N. Rana and H. Marwaha, “Role of federated learning in healthcare systems: A survey,” Mathematical Foundations of Computing, vol. 7, no. 4, pp. 459–484, 2024.
- [3] A. Abadi, B. Doyle, F. Gini et al., “Starlit: Privacy-preserving federated learning to enhance financial fraud detection,” arXiv:2401.10765, 2024.
- [4] Y. Chen, L. Liang, and W. Gao, “DFedSN: Decentralized federated learning based on heterogeneous data in social networks,” World Wide Web, vol. 26, no. 5, pp. 2545–2568, 2023.
- [5] W. Guo, Z. Yao, Y. Liu et al., “A new federated learning model for host intrusion detection system under non-iid data,” in IEEE International Conference on Systems, Man, and Cybernetics, 2023, pp. 494–500.
- [6] A. Z. Tan, H. Yu, L. Cui et al., “Towards personalized federated learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, pp. 9587–9603, 2021.
- [7] H. Jin, D. Bai, D. Yao et al., “Personalized edge intelligence via federated self-knowledge distillation,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 2, pp. 567–580, 2023.
- [8] L. Yi, H. Yu, C. Ren et al., “pFedMoE: Data-level personalization with mixture of experts for model-heterogeneous personalized federated learning,” arXiv:2402.01350, 2024.
- [9] S. You, C. Xu, C. Xu et al., “Learning from multiple teacher networks,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1285–1294.
- [10] T. Li, A. K. Sahu, M. Zaheer et al., “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, 2020, pp. 429–450.
- [11] M. G. Arivazhagan, V. Aggarwal, A. K. Singh et al., “Federated learning with personalization layers,” arXiv:1912.00818, 2019.
- [12] P. P. Liang, T. Liu, Z. Liu et al., “Think locally, act globally: Federated learning with local and global representations,” arXiv:2001.01523, 2020.
- [13] C. T. Dinh, N. Tran, and J. Nguyen, “Personalized federated learning with moreau envelopes,” in Advances in Neural Information Processing Systems, 2020, pp. 21 394–21 405.
- [14] M. Zhang, K. Sapra, S. Fidler et al., “Personalized federated learning with first order model optimization,” in International Conference on Learning Representations, 2020.
- [15] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv:1503.02531, 2015.
- [16] L. Qin, T. Zhu, W. Zhou et al., “Knowledge distillation in federated learning: A survey on long lasting challenges and new solutions,” arXiv:2406.10861, 2024.
- [17] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747, 2017.
- [18] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical Report, 2009.
- [19] K. He, X. Zhang, S. Ren et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- [20] A. G. Howard, M. Zhu, B. Chen et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.