Leveraging Foundation Models for Efficient Federated Learning in Resource-Restricted Edge Networks
Abstract
Recently pre-trained Foundation Models (FMs) have been combined with Federated Learning (FL) to improve training of downstream tasks while preserving privacy. However, deploying FMs over edge networks with resource-constrained Internet of Things (IoT) devices is under-explored. This paper proposes a novel framework, namely, Federated Distilling knowledge to Prompt (FedD2P), for leveraging the robust representation abilities of a vision-language FM without deploying it locally on edge devices. This framework distills the aggregated knowledge of IoT devices to a prompt generator to efficiently adapt the frozen FM for downstream tasks. To eliminate the dependency on a public dataset, our framework leverages per-class local knowledge from IoT devices and linguistic descriptions of classes to train the prompt generator. Our experiments on diverse image classification datasets CIFAR, OxfordPets, SVHN, EuroSAT, and DTD show that FedD2P outperforms the baselines in terms of model performance.
Index Terms— Federated learning, foundation models, distilling knowledge, prompt-tuning.
1 Introduction
Traditional centralized learning over distributed Internet of Things (IoT) networks [1] fails to reach the expected performance mainly due to distribution of data over resource-constrained devices, limited communication resources, and privacy concerns. In this context, Federated Learning (FL) [2] has emerged as a fruitful alternative to the centralized approach by facilitating collaborative and privacy-preserving knowledge exchange among distributed IoT devices. This is achieved through repetitive communications with a coordinating server (fusion centre), which has higher computation power, and is responsible for aggregation of the distributed knowledge [3]. For applications that involve training of Deep Learning (DL) models from scratch, performance of conventional FL frameworks [4, 5, 6] can significantly degrade especially in resource-limited IoT networks. Such degradation is due to the fact that training on insufficient data distributed over edge devices requires several computation/communication rounds resulting in excessive overhead and latency. This becomes particularly significant for complex tasks under statistical and system heterogeneity.
Literature Review: Recently to speed up learning downstream tasks, Foundation Models (FM) [7], which are, typically, large DL models trained on general-purpose datasets, have been utilized as the backbone of local models in FL [7]. By leveraging few-shot capabilities of FMs in FL scenarios, clients are not required to train their models from scratch, significantly reducing the communication/computation overhead [8]. Additionally, the pre-trained knowledge incorporated in FMs can effectively mitigate the data scarcity issue [9]. To adapt FMs to downstream tasks, fine-tuning methods [10, 11, 12] have been employed. Among these methods, prompt-tuning is more adaptable to FL in edge environments, as it requires less computational resource and storage space, and outperforms its alternatives when dealing with limited data [13]. In prompt-tuning, a prompt is given to a pre-trained FM to generate specific responses corresponding to the downstream task without the need for additional training or gradient updates on the FM.
Consequently, there has been a recent surge of interest [8, 14, 15, 16, 9, 17] to adapt pre-trained FMs to downstream tasks using prompt tuning in collaborative frameworks. A drawback of these prior works is overlooking availability constrained resources at the client side, making them impractical for distributed IoT networks. There are few recent attempts to address this shortcoming, for instance, FedHPL [18] allows clients to download resource-appropriate versions of FMs from the server. Furthermore, FedHPL targets handling the heterogeneity of local models by employing the knowledge distillation technique, where logits, instead of prompts, are shared with the server. While FedHPL reduces the computation costs associated with fine-tuning of local FMs, it still assumes that clients possess sufficient storage for deploying and prompt tuning of local FMs. To further tackle resource constraints on devices, FedMKT [19] proposed to deploy an Large Language Model (LLM), such as LLaMa-2 with 7 billion parameters [20], on the server while pre-trained small language models (e.g., GPT-2 with 1.5 billion parameters [21]) are placed on the local devices. Although FedMKT liberates resource-constrained clients from performing local prompt tuning of FMs, the small pre-trained FMs are still relatively large, requiring more resources than IoT devices can, typically, provide. The paper aims to address this gap.
Contributions: To address the above mentioned issue, we introduce the Federated Distilling knowledge to Prompt (FedD2P) framework, which strategically places the FM exclusively on the server, eliminating the need for FM deployment on IoT devices. More specifically, the distributed knowledge from the lightweight local models of IoT devices is distilled into a prompt generator module, facilitating adaptation of a vision-language FM with the downstream task. Subsequently, the robust knowledge of the FM is employed to improve the generalization capabilities of IoT devices. By centralizing the FM on the server, IoT devices can deploy smaller models based on their available resources, enabling FedD2P to also effectively handle model heterogeneity. In summary, the paper makes the following key contributions:
-
[C1]
Design of the FedD2P framework wherein distributed, resource-constrained IoT devices leverage the robust representational knowledge of a vision-language FM located at the server.
-
[C2]
Introduction of a novel data-free mutual Knowledge Distillation (KD) framework based on per-class knowledge transfer of IoT devices and the sever side’s FM. In this regard, a linguistic assistance prompt generator is designed, which is fine-tuned using the distributed per-class knowledge of IoT devices and the linguistic descriptions of the classes.
2 Background and System Model

In this section, first, we briefly present the required background on KD technique and prompt-tuning for FL. Then, we present the system setup and formulate the problem of FL over a resource-limited IoT network.
2.1 Knowledge Distillation (KD)
Generally speaking, KD refers to the method of transferring knowledge from one or multiple teacher models to a student model [22]. Specifically, the softened output of the teacher model on a public dataset, along with the ground-truth output, is used to train the student model. In this context, a KD loss function is employed in addition to the supervised learning loss to minimize the discrepancy between the soft labels of the teacher model and the predictions made by the student model. In KD assisted FL scenarios, depending on the method, both clients and the server can assume the roles of either the teacher or the student, i.e., mutual KD [23]. In such scenarios, knowledge from clients is shared with the server to create a global knowledge base. This knowledge is then distilled back to the clients to enhance their performance, allowing for the transfer of knowledge from other clients to each individual client. Typically, KD methods [24] utilize a public dataset shared among all entities (also referred to as the transfer set) to align the extracted knowledge of local models and the server’s one. This assumption of a publicly available dataset, however, is unrealistic in practice, as it can be accessed by third parties and raise privacy concerns. In this paper, KD is employed in a data-free fashion (without reliance of a transfer set), and to establish a framework for knowledge exchange between local models and the FM of the server.
2.2 Prompt Tuning
Fine-tuning FMs for downstream tasks has shifted the dominant paradigm of machine learning from “training from scratch” to the “pretrain-then-finetune” framework. Fully fine tuning of a large FM, however, involves updating all its parameters, which increases the risk of overfitting. This challenge has driven the development of partial fine-tuning methods [10]. With the advent of LLMs, a novel capability known as prompting [25] has been introduced for such models. This technique involves prepending learnable parameters to the embedding space of the input—whether it be the embedding space of tokens in LLMs or the embedding space of image patches in Transformer-based vision models. This approach provides pre-trained FMs with hints about downstream tasks while keeping their parameters frozen. For further details please refer to Reference [26].
2.3 Federated Learning over Edge
We consider an edge environment consisting of IoT devices, denoted by , which are coordinated by an edge server. Each IoT device for (), aims to perform a -class classification task with the assistance of an FM located at the edge server. Each IoT device possesses a local dataset, represented as . Local datasets are distributed heterogeneously among IoT devices and collectively form the entire dataset . Each IoT device, based on its available storage and computation resources, employs a lightweight local model, denoted by parameterized by . This enables the framework to handle model heterogeneity. To develop a FL framework that trains local models with the support of an FM at the server, we solve the following personalized FL problem
(1) |
where is the loss function of local model over local dataset .
3 Federated Distilling Knowledge To Prompt (FedD2P)
The proposed FedD2P framework operates through a repetitive workflow, where in each communication round, the local knowledge of IoT devices is shared with the server to construct the aggregated knowledge in form of soft labels. This knowledge is then utilized to fine-tune the FM for the downstream task. Distilling the aggregated knowledge of IoT devices can more effectively instruct the FM towards the downstream task. The fine-tuned FM then generates the global knowledge, which is subsequently transmitted back to the IoT devices to assist them in local training.
To establish a data-free KD framework, we adopt the per-class knowledge sharing approach instead of per-sample knowledge transfer. This approach aligns the extracted knowledge of local models and the FM at the class level, restricting the amount of local knowledge shared by the IoT devices. To compensate this knowledge scarcity, we utilize the information in linguistic description of classes. The rich semantic representation of linguistic content from vision-language FMs can be employed for this purpose. Specifically, we propose a Linguistic Assistance (LA) prompt generator to facilitate this process. Fig. 1 provides the flow of knowledge of the proposed FedD2P framework and the LA prompt generator. Next, we provide further details on different components of the proposed FedD2P framework.
3.1 The Flow of knowledge
The proposed FedD2P framework includes an initialization stage followed by iterating four steps, summarized as follows:
(S0) Initialization: To initiate the FL process, firstly, each IoT device trains its local model on the local dataset, .
(S1) Knowledge Aggregation: The local knowledge of IoT device for class is computed by averaging soft labels generated by on , as
(2) |
where denotes input samples of dataset that belong to class , and represents the softmax function with the temperature parameter . These per-class local knowledge representations are then averaged at the server to form the per-class aggregated knowledge, computed as follows , where, denotes the set of all input samples from the dataset associated with class .
(S2) Fine-tuning the FM: The aggregated knowledge is subsequently utilized to fine-tune the FM for the downstream task will be described later in Subsection 3.2. Following this fine-tuning, the per-class global knowledge of the FM is computed as , where denotes the FM with parameters , and represents the class-specific prompt.
(S3) Local Knowledge Distillation: Per-class global knowledge, , is then transmitted to each IoT device , for (), to perform local knowledge distillation as follows
(3) |
where , and are per-sample cross-entropy and Kullback-Leibler loss functions, respectively. Here, represents the ground-truth corresponding to the input sample .
3.2 Linguistic Assistance Prompt Generation
To construct the LA prompt generator and without loss of generality, we assume that the linguistic description of classes within the downstream task are available at the server and are represented by . Their corresponding semantic representation vectors are extracted by the text encoder of the vision-language FM as , where denotes the text encoder of the FM, and represents its embedding dimension. The prompt generator , then uses these semantic representation of linguistic description of classes to generate class-specific prompts. To account for the correlation among the semantic representations of classes, we employ a multi-head self-attention mechanism to construct the LA prompt generator. The prompt for class is, therefore, calculated as
(4) |
where , and represent the query vector, and the key and value weight matrices, respectively. Here, , and denotes the parameters of the head layer. Accordingly, the class-specific prompts is defined as , which is forwarded to the vison-language FM to generate the global knowledge as follows
(5) |
where denotes the image encoder of the FM.
To distill the per-class aggregated knowledge into the LA prompt generator, the prompt-tuning is performed as follows
(6) |
where represents the ground-truth corresponding to the class . This completes presentation of the proposed FedD2P framework, next, evaluation experiments are presented.
4 Simulation Results
Baseline | CIFAR10 | SVHN | OxfordPets | EuroSAT | DTD | |||||
---|---|---|---|---|---|---|---|---|---|---|
B1 | 66.16 | 67.25 | 81.57 | 88.31 | 61.37 | 64.11 | 77.52 | 81.27 | 63.47 | 70.32 |
B2 | 68.35 | 70.39 | 85.10 | 88.24 | 65.34 | 67.48 | 77.60 | 80.89 | 67.90. | 72.50 |
B3 | 74.01 | 73.21 | 90.81 | 93.66 | 74.49 | 78.10 | 83.71 | 84.26 | 70.77 | 74.29 |
FedD2P | 75.95 | 77.36 | 92.24 | 94.36 | 77.01 | 78.92 | 86.35 | 88.03 | 72.18 | 75.37 |
In this section, we present different experiments conducted to evaluate performance of the proposed framework. We consider a distributed IoT network consisting of devices with data and model heterogeneity, collectively performing FL over communication rounds. The FL process is orchestrated by an edge server with high computation power employing a Contrastive Language-Image Pre-Training (CLIP) model as the vision-language FM. For statistical data heterogeneity, we employ the Dirichlet distribution with two different settings: for homogeneous, and for heterogeneous setting. We deploy Convolutional Neural Networks (CNNs) as local models, consisting of Visual Geometry Group (VGG) blocks. For model heterogeneity, we randomly select a CNN model with , , or VGG blocks for each device. The size of the each local dataset is set to samples. The number of local epochs and batch size are set to and , respectively. In each round, the LA prompt generator is trained for rounds. We set the temperature parameter as and for softening local and global outputs, respectively. The selection of hyperparameters, specifically the number of rounds and epochs, is determined empirically. In contrast, the selection of temperature parameters is elaborated upon in Section 4.2.
Datasets: We employ the CIFAR10 [27] and SVHN [28] datasets for general object classification, while the OxfordPets dataset [29] is utilized for fine-grained classification tasks. Additionally, the EuroSAT [30] and DTD [31] datasets are used for specialized tasks involving satellite imagery and texture recognition, respectively. For the CIFAR10 and OxfordPets datasets, the linguistic description of class is “a photo of [c]”. For the EuroSAT and DTD datasets, the descriptions are “[c] texture” and “a centered satellite photo of [c]”, respectively. For the SVHN dataset, we use “a photo of digit [c]”. Here [c] denotes the name of class in natural language.
Baselines:The comparison of the proposed FedD2P framework with baselines [19] and [18], which utilize significantly more powerful local models, is not fair within our settings. This is due to the limited computational resources of edge devices, which prevent the possibility of maintaining a FM locally. Therefore, we compare FedD2P against three baselines: B1) Each client trains its model solely on its local dataset without participating in the FL process, B2) The aggregated knowledge, rather than the global model, is transmitted back to clients, and B3), Only the ground-truth output is used to fine-tune the LA prompt generator, while the per-class aggregated knowledge is not distilled into it.
4.1 Evaluation Under Different Statistical Heterogeneity
Table 1 presents the performance comparisons of FedD2P with the baselines across two statistical distribution settings. The results indicate that leveraging an FM at the server is more effective than relying solely on the aggregated knowledge from edge devices. We attribute this consistently superior performance to the robust representations provided by the CLIP model and the effective distillation of global knowledge to IoT devices. It can also be inferred that distilling the distributed knowledge of IoT devices to the LA prompt generator fine-tunes the FM for the downstream task more effectively. The most significant performance gain compared with other baselines is observed for the OxfordPets dataset. This is likely due to the dataset’s challenging nature for training a local CNN model from scratch, whereas according to [12] the few-shot accuracy of the CLIP model on this dataset is superior compared with other datasets.
4.2 Evaluation Under Different Temperature Parameter
The effectiveness of the mutual KD framework is significantly determined by the entropy of soft labels. Low entropy for the local models and high entropy for the CLIP model necessitate adjusting the temperature parameters when computing local and global soft labels. Fig. 2(a) illustrates the performance of the FedD2S framework over CIFAR10 dataset and across different temperature parameters. As shown, the most effective performance is achieved by decreasing the entropy of global soft labels and increasing the entropy of local ones. This can be attributed to the high generalization ability of the robust CLIP model and the low generalization capability of simple local models.
4.3 Effectiveness of LA prompt generator
For comparison, we employ a Multi-Layer Perceptron (MLP)-based prompt generator, wherein each semantic representation vector, , is directly mapped to . Fig. 2(b) illustrates the impact of the self-attention mechanism within the LA prompt generator module on the CIFAR10 and EuroSat datasets. The superior performance of the multi-head self-attention mechanism is attributed to its ability to consider the relationships among different semantic representations of classes. This capability enables the generation of effective prompts that extract soft labels with sufficient generalization content, thereby assisting local models.


5 Conclusion
Our proposed FedD2P framework successfully leverages the robust representation abilities of vision-language FMs without deploying them locally on resource-constrained edge devices. By distilling aggregated knowledge from IoT devices to a prompt generator, FedD2P enhances the efficiency and performance of local models in diverse image classification tasks. Extensive simulations demonstrate that FedD2P not only achieves competitive performance compared to traditional baselines but also significantly improves local resource efficiency, making it a promising solution for federated learning in edge networks.
References
- [1] T. Zhang, L. Gao, C. He, M. Zhang, B. Krishnamachari, and A. S. Avestimehr, “Federated learning for the internet of things: Applications, challenges, and opportunities,” IEEE Internet of Things Magazine, vol. 5, no. 1, pp. 24–29, 2022.
- [2] B. Liu, N. Lv, Y. Guo, and Y. Li, “Recent advances on federated learning: A systematic survey,” Neurocomputing, p. 128019, 2024.
- [3] A. Imteaj, U. Thakker, S. Wang, J. Li, and M. H. Amini, “A survey on federated learning for resource-constrained iot devices,” IEEE Internet of Things Journal, vol. 9, no. 1, pp. 1–24, 2021.
- [4] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou, “Fedbn: Federated learning on non-iid features via local batch normalization,” arXiv preprint arXiv:2102.07623, 2021.
- [5] M. Ficco, A. Guerriero, E. Milite, F. Palmieri, R. Pietrantuono, and S. Russo, “Federated learning for iot devices: Enhancing tinyml with on-board training,” Information Fusion, vol. 104, p. 102189, 2024.
- [6] S. J. Seyedmohammadi, S. M. Sheikholeslami, J. Abouei, A. Mohammadi, and K. N. Plataniotis, “Mofleur: Motion-based federated learning gesture recognition,” in 2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS). IEEE, 2024, pp. 1–6.
- [7] W. Zhuang, C. Chen, and L. Lyu, “When foundation model meets federated learning: Motivations, challenges, and future directions,” arXiv preprint arXiv:2306.15546, 2023.
- [8] T. Guo, S. Guo, J. Wang, X. Tang, and W. Xu, “Promptfl: Let federated participants cooperatively learn prompts instead of models-federated learning in age of foundation model,” IEEE Transactions on Mobile Computing, 2023.
- [9] C. Qiu, X. Li, C. K. Mummadi, M. R. Ganesh, Z. Li, L. Peng, and W.-Y. Lin, “Text-driven prompt generation for vision-language models in federated learning,” arXiv preprint arXiv:2310.06123, 2023.
- [10] Y. Xin, S. Luo, H. Zhou, J. Du, X. Liu, Y. Fan, Q. Li, and Y. Du, “Parameter-efficient fine-tuning for pre-trained vision models: A survey,” arXiv preprint arXiv:2402.02242, 2024.
- [11] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International conference on machine learning. PMLR, 2019, pp. 2790–2799.
- [12] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
- [13] C. Han, Q. Wang, Y. Cui, W. Wang, L. Huang, S. Qi, and D. Liu, “Facing the elephant in the room: Visual prompt tuning or full finetuning?” arXiv preprint arXiv:2401.12902, 2024.
- [14] H. Zhao, W. Du, F. Li, P. Li, and G. Liu, “Fedprompt: Communication-efficient and privacy-preserving prompt tuning in federated learning,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [15] F.-E. Yang, C.-Y. Wang, and Y.-C. F. Wang, “Efficient model personalization in federated learning via client-specific prompt generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 159–19 168.
- [16] T. Guo, S. Guo, and J. Wang, “Pfedprompt: Learning personalized prompt for vision-language models in federated learning,” in Proceedings of the ACM Web Conference 2023, 2023, pp. 1364–1374.
- [17] W. Lu, X. Hu, J. Wang, and X. Xie, “Fedclip: Fast generalization and personalization for clip in federated learning,” arXiv preprint arXiv:2302.13485, 2023.
- [18] Y. Ma, L. Cheng, Y. Wang, Z. Zhong, X. Xu, and M. Wang, “Fedhpl: Efficient heterogeneous federated learning with prompt tuning and logit distillation,” arXiv preprint arXiv:2405.17267, 2024.
- [19] T. Fan, G. Ma, Y. Kang, H. Gu, L. Fan, and Q. Yang, “Fedmkt: Federated mutual knowledge transfer for large and small language models,” arXiv preprint arXiv:2406.02224, 2024.
- [20] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- [21] G. Bharathi Mohan, R. Prasanna Kumar, S. Parathasarathy, S. Aravind, K. Hanish, and G. Pavithria, “Text summarization for big data analytics: a comprehensive review of gpt 2 and bert approaches,” Data Analytics for Internet of Things Infrastructure, pp. 247–264, 2023.
- [22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- [23] K. Atapour, S. J. Seyedmohammadi, J. Abouei, A. Mohammadi, and K. N. Plataniotis, “Fedd2s: Personalized data-free federated knowledge distillation,” arXiv preprint arXiv:2402.10846, 2024.
- [24] S. J. Seyedmohammadi, S. K. Atapour, J. Abouei, and A. Mohammadi, “Knfu: Effective knowledge fusion,” arXiv preprint arXiv:2403.11892, 2024.
- [25] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
- [26] J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
- [27] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- [28] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng et al., “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2. Granada, 2011, p. 4.
- [29] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3498–3505.
- [30] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
- [31] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.