Appendix: Knowledge Distillation in Federated Learning: a Survey on Long Lasting Challenges and New Solutions
Abstract.
Federated Learning (FL) is a distributed and privacy-preserving machine learning paradigm that coordinates multiple clients to train a model while keeping the raw data localized. However, this traditional FL poses some challenges, including privacy risks, data heterogeneity, communication bottlenecks, and system heterogeneity issues. To tackle these challenges, knowledge distillation (KD) has been widely applied in FL since 2020. KD is a validated and efficacious model compression and enhancement algorithm. The core concept of KD involves facilitating knowledge transfer between models by exchanging logits at intermediate or output layers. These properties make KD an excellent solution for the long-lasting challenges in FL. Up to now, there have been few reviews that summarize and analyze the current trend and methods for how KD can be applied in FL efficiently. This article aims to provide a comprehensive survey of KD-based FL, focusing on addressing the above challenges. First, we provide an overview of KD-based FL, including its motivation, basics, taxonomy, and a comparison with traditional FL. We also analyze the critical factors in KD-based FL, including teachers, knowledge, data, methods, and where KD should execute. We discuss how KD can address the challenges in FL, including privacy protection, data heterogeneity, communication efficiency, and personalization. Finally, we discuss the challenges facing KD-based FL algorithms and future research directions. We hope this survey can provide insights and guidance for researchers and practitioners in the FL area.
1. Critical factors in KD-based FL
1.1. KD-based FL: Teachers
Using diverse models as teachers often means that the problems targeted are different in the FD algorithm. In FD, various teacher models exist, including local and global models, mutual teaching between models, and self-teaching by models themselves. The distinct identities of these teachers often result in varying learning objectives and teaching methods.
Local models as teachers. In FD, a prevalent approach is to consider local models as teacher models and the global model as a student model (gong2022preserving; zhang2022fine; liu2022communication), which can be seen as a distributed multi-teacher distillation (wu2021one). Treating local models as teachers offers several advantages. Firstly, the global model can leverage the diverse and comprehensive knowledge from client models to enhance its generalization on unknown samples (zhang2022fine). Secondly, in the feature-based FD setting, clients’ knowledge is presented in a model-agnostic and consistent manner (hinton2015distilling), facilitating the decoupling of local and global models at the architectural level and freeing local models from constraints imposed by identical architectures (ozkara2021quped). Moreover, this allows for a better evaluation of each client model’s contributions, which is an important consideration in FL systems with incentive mechanisms (li2022incentive). However, utilizing local models as teachers also entails drawbacks. Firstly, while presenting logits of personalized client models in a unified form may seem feasible, their inner logic might be incompatible due to different representations of the same knowledge during feature extraction across clients (wu2022exploring). Directly aggregating client model knowledge could potentially lead to performance degradation of the global model. Additionally, clients with higher weight will exert greater influence (zhang2022fine) on the global model’s performance, thus making it more dependent on these clients, losing the benefits derived from knowledge diversity offered by multiple clients.
Global Model as a Teacher. In FD, the global model can serve as a teacher model to guide the local models (he2022class; ozkara2021quped; lee2022preservation; yao2021local), offering several advantages. Firstly, the global model represents a fusion of diverse local models and theoretically exhibits superior generalization capabilities. Consequently, leveraging the global model as a teacher aims to enhance the generalization of local models (lee2022preservation; mo2022feddq). Secondly, FD facilitates knowledge transfer from the global model to local models through KD in a manner that respects personalization. This personalization is manifested in two aspects: 1) The global model incrementally improves the performance of local models without replacing them, thereby preserving locally personalized knowledge (le2021distilling; ozkara2021quped), and 2) The architecture of local models can differ from that of the global model, enabling personalization in terms of model architecture (ozkara2021quped). However, employing the global model as a teacher also entails certain disadvantages. Firstly, clients require additional computational resources for performing KD, which may pose challenges for devices with limited computing capabilities. Secondly, during the initial stages, when the performance of the global model might not be optimal yet, it could misguide local models.
Mutual teaching. Unlike traditional KD, FL typically does not have a pre-trained high-performance teacher model (wang2022knowledge; shao2023selective). Instead, FL accumulates knowledge from different models through gradual mutual learning (xing2022efficient; zhao2022semi; afonin2021towards; li2022incentive). Mutual learning between models offers several advantages. Firstly, it mitigates the absence of a high-performance teacher model by enabling models to teach and learn from each other, enhancing generalization performance. Secondly, mutual learning allows models with different distributions to communicate effectively, effectively mitigating the non-IID issue (taya2022decentralized; ma2021adaptive). Moreover, the exchange of knowledge across diverse models and levels enhances the overall performance of the models. However, one major drawback of mutual learning is its increased computational burden on the entire system and frequent KD that significantly prolongs training time. Some algorithms (yang2022fedmmd) perform mutual learning on servers; however, this approach may lead to the server becoming a bottleneck in the system.
Be my own teacher. Catastrophic forgetting (shoham2019overcoming; shoham2019overcoming; dong2022federated) is a significant challenge in FL, which occurs when a model forgets old knowledge after learning new knowledge, resulting in performance degradation of the original data distribution. Essentially, catastrophic forgetting can be viewed as a non-iid problem in the time dimension. Many FD algorithms (ma2022continual; yao2021local) address this issue by leveraging old models as teachers to guide new models, enabling them to learn new knowledge while retaining previous knowledge. In FL, model forgetting manifests in two main forms (ma2022continual; halbe2023hepco): 1) inter-task forgetting and 2) intra-task forgetting. Inter-task forgetting (choi2022attractive) means that a new FL task causes the model to forget the knowledge learned in the old task, while intra-task forgetting (ma2022continual) means that the same FL task forgets the knowledge learned in the previous round of communication. To mitigate inter-task forgetting, multiple historical models (yao2021local) are typically employed as teacher models for better review of knowledge across different tasks. For addressing intra-task forgetting, reviewing only the most recent old model (ma2022continual) from previous communication rounds is usually sufficient since early models tend to have lower performance. Generally, local models do not need to review outdated global knowledge because local and global models continue communicating in subsequent updates. However, tackling model forgetting in FL presents certain challenges, such as increased storage costs due to storing historical models; this can be particularly challenging for clients with limited storage resources when multiple historical models are involved. Additionally, reviewing historical models through KD incurs additional computational costs and necessitates retaining old data for improved review effects; however, this may not always be feasible due to privacy concerns and storage cost considerations.
1.2. KD-based FL: Knowledge
For the student model, it is imperative not only to identify the teacher model but also to specify which knowledge should be acquired from it. A comprehensive discussion on the topic of the types of knowledge in KD can be found in (gou2021knowledge; chen2021distilling; wang2021knowledge). In this section, we will examine the characteristics of different types of knowledge under the framework of FL.
For simplicity, we classify knowledge into two categories: simple and complex types. It is evident that compared to simple knowledge, such as logits (hinton2015distilling), complex knowledge, including intermediate features (romero2014fitnets; qiao2023knowledge; le2023layer) and relationship features (yim2017gift; tang2023fedrad), offers the student model a more comprehensive understanding. This becomes particularly crucial when addressing the challenge of non-IID in FL systems since complex knowledge can provide diverse insights from different models. However, it should be noted that an abundance of knowledge does not necessarily guarantee improved results; in extreme cases, the teacher model may transfer all its knowledge to the student model, which is equivalent to directly sharing model parameters. In the context of FL, different types of knowledge can influence communication efficiency, computational complexity, model personalization capabilities, and privacy protection.
Communication cost. Communication cost poses a significant challenge in FL, and KD was initially introduced as a solution to the communication bottleneck. Distillation-based algorithms, compared to parameter-sharing-based approaches, effectively mitigate communication costs. In (jeong2018communication), local class-averaged logits are used as knowledge, which reduces communication costs as the number of classes is typically small. The method, however, demonstrates suboptimal performance in non-iid settings where the same labels have different features. In practice, KD often requires additional public datasets (huang2022learn), which leads to increased communication costs as the extracted knowledge needs transmission over the network. Although the communication cost is lower than directly transmitting large deep model parameters, it remains non-negligible in scenarios with numerous public data samples and complex knowledge. A certain number of public data samples are necessary for achieving improved distillation results (liu2022communication), necessitating a trade-off between communication costs and knowledge complexity. Among various types of knowledge (gou2021knowledge), utilizing output layer logits incurs the lowest communication cost. Nevertheless, tasks involving a large number of categories (e.g., thousands of categories and tens of thousands of public samples) may still encounter communication bottlenecks even when only transmitting logits; thus, compression techniques such as quantization and encoding can be beneficial (sattler2021cfd; gong2022preserving).
Computational cost. Complex knowledge (romero2014fitnets; yim2017gift; gong2021ensemble; shi2021towards) generally provides more structured and systematic information from the teacher model, making it more beneficial for the student model. However, concerns about both communication and computational resource costs arise in cross-device scenarios. The transfer of complex knowledge can result in increased consumption of computational resources, necessitating the construction of specific data structures (he2022learning; huang2022learn) to facilitate the transfer, thereby further increasing the computational cost of local devices. Hence, it is imperative to balance the performance gains derived from complex knowledge and the associated computational cost.
Personalized architecture. One of the fascinating aspects of FD is that it can easily achieve local model heterogeneity, where different clients can use completely distinct model architectures to satisfy their personalized requirements (wu2019distributed; ozkara2021quped). Model heterogeneity can lead to diverse internal representations of the same knowledge in various models, so the exchange of complex knowledge, especially the communication between layers and the relationship aspects of different models, becomes more important. Models with different architectures may differ significantly, not only in terms of size but also in complexity. The situation becomes more complex when there are many heterogeneous clients in the FL system. For example, various heterogeneous local models in the server guide the global model (zhang2022dense; gong2022preserving). At this time, due to the differences in model architecture, it is challenging to achieve compatibility with complex knowledge from various models. One potential solution involves clients converting their personalized model architecture into a common one (sun2022fed2kd) and designing a transmission process for complex knowledge exchange. In practice, heterogeneous models typically utilize simple knowledge for convenience.
Privacy protection. Privacy risk is a crucial concern to be taken into account. Generally, the more information shared, the higher the likelihood of privacy leakage. The logits generated by the model may encompass statistical information from the client (gong2022preserving), while the intermediate and relationship features (gong2021ensemble) may contain structured and systematic client-specific details. Hence, privacy preservation should also be considered when selecting different types of knowledge. To protect privacy, some techniques, such as differential privacy (hoech2022fedauxfdp; sattler2021fedaux) or quantization (gong2022preserving; sattler2021cfd), are commonly employed to protect the model’s knowledge.
1.3. KD-based FL: Data
Like the human learning process, teaching needs textbooks, and KD also relies on training data. In this section, we will comprehensively analyze the training data within the context of FD. FD imposes specific requirements on training data utilized in KD, encompassing three aspects: 1) the distribution of training data should be relevant to the local data of each client (liu2022communication), 2) there should be sufficient training data (liu2022communication; park2023towards), and 3) selecting appropriate data types (jeong2018communication; li2019fedmd; zhu2021data).
1.3.1. Correlation
The effectiveness of KD depends on the correlation between the data used for distillation and the local data on each client (liu2022communication). For example, using a fruit dataset to guide a student model for cat-dog classification would not yield satisfactory results. KD necessitates that both the teacher and student models employ the same dataset for distillation. However, in the setting of FL, private data cannot leave the client, and the teacher and student models may not be on the same device, making it difficult for the client to use its private data for distillation directly. Therefore, numerous FD algorithms (li2019fedmd; sun2020federated; cho2021personalized; itahara2021distillation) assume access to a correlated public dataset that does not need an identical distribution to the client’s data, such as using a cross-domain public dataset for distillation (gong2022preserving). However, it is difficult to find a relevant public dataset in many FL tasks. Therefore, some data-free FL methods have been proposed in (zhu2021data; zhang2022fine; zhang2022feddtg), where clients collaborate to train a generator for synthesizing data. However, certain tasks like object detection (zhao2019object) pose challenges when generating synthetic samples using a generator. Moreover, the requirement of correlated training data in FL for KD might compromise client privacy (gong2022preserving). Some literature has proposed alternative solutions, such as quantization (sattler2021cfd) and differential privacy (sattler2021fedaux).
1.3.2. Quantity
KD needs a substantial amount of training data to achieve significant performance improvements (liu2022communication). This poses a challenge in some FL tasks, where collecting data can be costly. Fortunately, KD allows leveraging unlabeled datasets (zhao2022semi; huang2022learn), reducing the collection cost to some extent. However, the large amount of data requires clients to have sufficient storage, which could exclude those with limited resources. Employing a generator (zhu2021data; peng2023fedgm) can effectively address the issue of collecting and storing training data for specific tasks. Furthermore, extracting knowledge from large datasets also presents challenges in terms of computation and communication bottlenecks. Dataset distillation (song2023federated) offers a potential solution by attaining comparable model performance on smaller datasets.
1.3.3. Data types
The data employed for KD in FL can be broadly categorized into three types: local private data (jeong2018communication), public data (li2019fedmd), and synthetic data (zhang2022fine).
To use client private data for distillation, (jeong2018communication) uses the average model outputs (logits) on the local dataset as a regularization term for local model loss. In (yang2022fedzact), clients upload their local models, and the server adjusts the parameters of each local model by evaluating them with each other’s models, after which all local models are sent back to the clients. The clients update their local models with their private data and the predictions of other clients’ models in a way similar to multi-teacher distillation. In (xing2022efficient), the server pairs all local models with each other based on the minimum mean square distance and sends the paired models to each other as their respective teacher networks. On the client side, the teacher network teaches the student network using local data and sends the student network back to the server for the next pairing round. Using client-side private datasets as the distillation dataset is based on the assumption that the model learns the personalized knowledge of a specific client, which can be reflected by its private dataset on different clients. However, this approach is not effective in non-iid scenarios because the knowledge of different clients’ models is heterogeneous (jeong2018communication).
Therefore, many studies assume the availability of a public dataset (lyu2023prototype; DBLP:conf/icml/ParkHH23; sattler2021cfd; gong2022preserving; yang2022fedmmd; zhao2022semi). In (sattler2021cfd), clients and the server utilize the same public dataset for distillation. In (gong2022preserving), the logits of all client models on an unlabeled public dataset serve as knowledge to teach the global model. In (yang2022fedmmd), a labeled public dataset is utilized for mutual distillation of all client models. Specifically, clients carry out local training and send their heterogeneous models to the server. The server conducts mutual distillation on all local models using the labeled public dataset, selecting one model as the student and others as the teacher models. The distilled knowledge includes intermediate features and logits, resulting in multiple new models of a local model. Then, these models are averaged and sent back to their corresponding clients. In (zhao2022semi), the hard labels of public data samples are acquired by taking the majority vote of the hard label predictions from various client models on an unlabeled public dataset, thereby labeling the public data. Then, each client applies the global hard labels for distillation. In practice, obtaining a public dataset may be challenging, and a domain shift may exist between the public dataset and client private datasets (huang2023federated; su2022domain).
Therefore, some research (zhu2021data; zhang2022dense; zhang2022fedzkt; zhang2022fine; zhang2022feddtg; qi2022fedbkd; peng2023fedgm; DBLP:conf/www/YangYPLLLZ23) has explored the feasibility of distillation without data. In (zhang2022fine), after the server aggregates all client models, it trains a generator to produce hard examples, which are samples with the highest prediction differences between local and global models. The global model then utilizes these hard examples for KD. All local models act as teachers during distillation to minimize the sum of their distillation losses. In (zhang2022feddtg), each client trains a generator and a discriminator locally, sends them to the server for aggregation, and obtains the global generator and discriminator. Clients use the global generator to generate fake samples, and their local private models produce soft labels on these fake samples, send them to the server for averaging, and then receive them back for distillation.
1.4. KD-based FL: Methods
In this section, we will summarize the main methods in KD-based FL.
Attention: In (gong2021ensemble), local model logits and attention on public datasets are uploaded to the server for distillation. Similarly, (wen2023communication) uses attention-based KD to improve communication efficiency.
Continual Learning: In (usmanova2022federated), historical models and server models are used as teacher models for distillation to address the challenge of catastrophic forgetting (kirkpatrick2017overcoming). Similarly, in (huang2022learn), KD is used to learn from the previous round model and local pre-trained model to cope with catastrophic forgetting. (dong2022federated) proposes a class-incremental learning mechanism to solve the catastrophic forgetting caused by new class samples.
Contrastive Learning: In (han2022fedx), contrastive learning (chen2020simple) is introduced for scenarios where local data lacks labels. Contrastive loss is introduced during local distillation training. (zou2023efckd) introduced model contrastive learning to improve performance.
Split Learning: In (he2020group), KD is combined with split learning (vepakomma2018split). The final model is a union of local abstract feature extractors and the server’s large model. Specifically, a small model is trained locally as a feature extractor, and the client sends the output of the feature extractor as abstract features, the output of the classifier (logits), and the labels of local data to the server. The server uses abstract features, logits, and true labels for distillation to train the server-side large model and then sends the soft labels it generates back to the client. The client uses local data for distillation.
Multi-Teacher: In (su2023cross), for a special scenario where the server has a large amount of labeled data, the client sends local personalized models to the server, and the server uses these personalized models to distill into the global model. Then, the global model is sent to the client to fine-tune the local personalized model. (nguyen2022label) uses a hierarchical architecture to divide all clients into multiple groups; each group selects one client as the group server and obtains a small group model through parameter sharing. Once a specific number of rounds have been completed, each small group sends the aggregated model to the global server and uses them as teacher models to perform multi-teacher distillation using the labeled public dataset on the global server.
Mutual Distillation: In (wu2022communication), the client simultaneously trains a large model and a small model, which are mutually distilled. Then, the gradients of the small model are SVD (klema1980singular) decomposed and uploaded to the server. After the server reconstructs the gradients of all clients, it aggregates them and sends them to each client. (li2021decentralized) is based on the P2P architecture and randomly selects two groups of clients with equal numbers, with each client corresponding to a client in the other group. The first group of clients trains the model locally and sends it to the corresponding client in the second group. The clients in the second group use local data for mutual distillation.