Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning

Shangchao Su, Mingzhao Yang, Bin Li, Xiangyang Xue¹¹footnotemark: 1 Corresponding author

Abstract

Federated learning (FL) enables multiple clients to collaboratively train a global model without disclosing their data. Previous researches often require training the complete model parameters. However, the emergence of powerful pre-trained models makes it possible to achieve higher performance with fewer learnable parameters in FL. In this paper, we propose a federated adaptive prompt tuning algorithm, FedAPT, for multi-domain collaborative image classification with powerful foundation models, like CLIP. Compared with direct federated prompt tuning, our core idea is to adaptively unlock specific domain knowledge for each test sample in order to provide them with personalized prompts. To implement this idea, we design an adaptive prompt tuning module, which consists of a meta prompt, an adaptive network, and some keys. The server randomly generates a set of keys and assigns a unique key to each client. Then all clients cooperatively train the global adaptive network and meta prompt with the local datasets and the frozen keys. Ultimately, the global aggregation model can assign a personalized prompt to CLIP based on the domain features of each test sample. We perform extensive experiments on two multi-domain image classification datasets across two different settings — supervised and unsupervised. The results show that FedAPT can achieve better performance with less than 10% of the number of parameters of the fully trained model, and the global model can perform well in diverse client domains simultaneously. The source code is available at https://github.com/leondada/FedAPT.

1 Introduction

As privacy protection gains increasing attention, federated learning (FL) (McMahan et al. 2017), a special machine learning paradigm, becomes more popular. FL has been applied to mobile phone album classification, automatic driving, and medical image analysis. FL enables multiple parties to cooperatively train a global model without sharing their training data. In each round of communication, the server sends the global model to the clients, and each client updates the global model with the private dataset to obtain the local model. The server collects the local models for aggregation, and obtains the global model for the next communication round. In computer vision, FL has been applied to classification (Hsu, Qi, and Brown 2020; Li, He, and Song 2021; Yang et al. 2023), detection (Liu et al. 2020; Su et al. 2023), ReID (Zhuang, Wen, and Zhang 2021; Zhuang et al. 2020), etc.

Refer to caption — Figure 1: The impact of different prompts on performance in multi-domain scenarios. The first three prompts are manually set. One can see that when domain information is included in the prompt (in red), the accuracy of the corresponding domain is improved (red arrow), but that of other domains is decreased (green arrow). The bottom three rows are the learned prompts. On one hand, the prompts learned only in one domain perform significantly worse in other domains. On the other hand, we can obtain prompts through FedAPT that are applicable across all domains.

Currently used FL techniques (McMahan et al. 2017; Li et al. 2020; Karimireddy et al. 2020) require all model parameters to be updated from the clients and shared on the server. As a result, collaborative learning incurs substantial client training costs, as well as communication costs for both clients and the server. The non-IID problem of various client data also poses significant difficulties for the federated learning algorithm. The performance of the global aggregation model will be impacted when the client datasets come from diverse data domains. Thus, training a more effective global model with fewer communication costs by adjusting only a few parameters remains an open research question.

The emergence of powerful vision-language pre-trained models, such as CLIP (Radford et al. 2021), opens up new avenues for solving this problem. CLIP formalizes object recognition into image-text matching, rather than an image-label mapping problem. Text encoder and image encoder are two of the Transformer (Vaswani et al. 2017) branches that are included in CLIP. About 400 million web-crawled image-text pairs are used for contrastive learning to establish semantic associations between images and texts. Pre-trained CLIP performs admirably in the zero-shot image classification task and demonstrates high transferability in a number of downstream applications. With the powerful representation ability of CLIP, it becomes possible to fine-tune only a few parameters for each client in the federated learning scenarios.

Inspired by the promising capabilities of CLIP, in this research, we propose a federated prompt tuning algorithm, FedAPT, for multi-domain collaborative image classification in cross-silo federated learning. Consider that there are multiple participants in diverse data domains; the ultimate objective is to develop a global classification model through federated prompt tuning that can perform well in all data domains; see Figure 1. We employ a prompt tuning based strategy to provide learnable prompts for the CLIP. To make the global prompt adapt to all domains, we propose an adaptive method to set customized prompts for various test images in order to address the cross-domain challenge.

The underlying intuition is that images contain domain-specific information. By incorporating this domain-specific information into prompts, we can better guide the pre-trained CLIP model to activate relevant knowledge to that domain, then the classification ability of the corresponding domain can be improved (see the first three rows in Figure 1). The adaptive prompt tuning module includes a meta prompt, an adaptive network, and some frozen keys. Before federated learning, the server randomly assigns each client a frozen key. In local training, each client trains the adaptive network and the prompt using local data and the frozen key. Considering that the client may not have labels in the real scene, we also design an unsupervised training method for FedAPT. After local training, all updated parameters and prompts are sent to the server for aggregation. During inference, we leverage an adaptive network to select a specific key for each test image, so as to generate a specific prompt from the meta prompt.

We conduct extensive experiments on two multi-domain image classification datasets, Office-Caltech10 and DomainNet, across two different settings: supervised and unsupervised. The results show that FedAPT can build a more powerful global model than fully-trained ResNet50 or ViT with less than 10% of the number of parameters, and demonstrates its ability to perform well in a variety of scenarios. All experiments show that our scheme has significantly improved performance compared with the competitors. In general, our contributions are as follows:

•

We utilize CLIP for the first time in federated cross-domain image classification, across two distinct scenarios — supervised and unsupervised. Our results demonstrate the significant potential of CLIP in federated learning.
•

We propose a federated adaptive prompt tuning framework, FedAPT. With the leverage of keys, FedAPT can provide a personalized prompt for each test image without adding any learnable parameters.
•

Our experiments conducted across different settings on both Office-Caltech10 and DomainNet demonstrate that FedAPT outperforms both the fully-trained models and the existing federated prompt tuning methods.

2 Related Work

2.1 Federated Learning

The idea of aggregating client data distribution information to the cloud can be traced back to SCM (Li et al. 2007). Recently, FedAvg (McMahan et al. 2017) extends the aggregation idea to neural networks and proposes the federated averaging algorithm. Despite FedAvg’s strong performance under the assumption of IID, as the non-IID degree increases (due to the diverse distributions of client datasets), the aggregation model’s performance continues to deteriorate, making the non-IID problem a key area of research. Some efforts attempt to enhance the performance of the global aggregation model by improving the optimization algorithm (Li et al. 2020; Karimireddy et al. 2020; Reddi et al. 2020; Wang et al. 2020b; Su, Li, and Xue 2023) or designing better aggregation methods (Wang et al. 2020a; Singh and Jaggi 2020), while others (Dinh, Tran, and Nguyen 2020; Fallah, Mokhtari, and Ozdaglar 2020; Hanzely et al. 2020; Li et al. 2021; Luo et al. 2022) turn to assigning different models to each client, i.e., personalized FL, such as using local batch normalization (Li et al. 2021), decoupling the model into personalized and global parts (Luo et al. 2022). The proposed FedAPT belongs to the former, i.e., to enhance the global model.

Some works (Chen et al. 2022; Nguyen et al. 2023) study the impact of pre-training on FL. Unlike the fine-tuning methods, they need to train the complete model. FedPCL (Tan et al. 2022) studies a setting where each client has $K$ fixed backbones, they fine-tune the projection head by sharing the prototype features of user data. Unlike FedPCL, we do not need to maintain multiple backbones locally, and we do not upload any prototype features. All in all, different from the existing works, we pay more attention to how to effectively utilize the knowledge in the powerful pre-trained models.

Recently, FedIns (Feng et al. 2023) shares a similar spirit with our work, both enabling Instance-adaptive Inference. The difference lies in the implementation: FedIns achieves adaptive inference for ViT through learnable Keys and SSF pool, while our approach achieves adaptive inference for CLIP through randomly initialized keys and a learnable query network coupled with prompts.

2.2 CLIP and Prompt Tuning

The contrastive vision-language pre-trained model CLIP (Radford et al. 2021) transforms image recognition into an image-text matching problem, freeing the object recognition task from human-annotated data, such that a huge amount (400M) of noisy image-text pairs can be used for training. The pre-trained model can be transferred to multiple datasets. Google releases ALIGN (Jia et al. 2021) to expand CLIP’s training data to 1.8B, which achieves better results. Thanks to its powerful representation capability, CLIP has also been successfully applied to a variety of downstream visual tasks, such as RegionCLIP (Zhong et al. 2022) and DetCLIP (Yao et al. 2022) for object detection, MedCLIP (Wang et al. 2022) for medical image analysis, CCR-CLIP (Yu et al. 2023) for text recognition.

Prompt (Liu et al. 2021a) adds additional words or sentences to the input text of the pre-trained language model, to make the pre-trained model better handle downstream tasks. The succeeding works (Li and Liang 2021; Liu et al. 2021b) treat the prompts as continuous vectors in order to learn new knowledge in the process of fine-tuning, which is called Prompt Tuning. In CLIP’s follow-up work, other researchers (Zhou et al. 2022b, a) tried to improve prompt tuning in CLIP. The work most relevant to ours is CoCoOp (Zhou et al. 2022a), which trains prompts conditioned on visual features. However, it is designed for centralized training and heavily relies on high GPU computational power, making it hard to apply to clients in federated learning. In contrast, in this paper, the training of conditions and prompts are decoupled, significantly accelerating the training process.

2.3 CLIP in Federated Learning

In the field of federated image classification, work related to CLIP is rare. PromptFL (Guo et al. 2022) extends CoOp (Zhou et al. 2022b) to the federated learning scenarios, averaging multiple prompts on the server. PromptFL shows the potential application of prompt tuning for CLIP in FL, however, it lacks optimizations tailored for multi-domain scenarios. In comparison, FedAPT proposes an adaptive prompt tuning method for multi-domain image classification. FedCLIP (Lu et al. 2023) adds an adapter consisting of two fully connected layers to the end of the image encoder. FedTPG (Qiu et al. 2023) proposes a prompt generator with better generalization performance on unknown categories. However, they do not generate personalized prompts for the images based on the domain information.

3 Preliminaries

3.1 Notations

Consider a federated learning scenario, such as autonomous driving, where multiple cars are deployed in $K$ regions (domains). Data from the same region exhibits similar styles, while data from different regions possesses distinct styles. Each region has $N_{dom}$ clients. We assume that each client has sufficient computing power to train the prompts. In supervised setting, $\mathcal{D}_{n}=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{m_{n}},n=1,\cdots,N$ denote the local datasets, $m_{n}$ is the number of images in the $n$ -th client. In unsupervised setting, $\mathcal{D}_{n}=\{\mathbf{x}_{i}\}_{i=1}^{m_{n}},n=1,\cdots,N$ . Images in diverse domains have different styles, see Figure 1. Note that there may exist domain and category divergences across different clients at the same time, the divergence of category refers to the different label distribution of different clients. Let $\mathcal{I}(\cdot)$ and $\mathcal{T}(\cdot)$ denote the image encoder and the text encoder. To reduce the calculation and communication costs, we freeze the parameters of the two encoders and perform the training task by fine-tuning the prompts.

3.2 Prompt Tuning

Let $\mathbf{t}^{c}$ be the word embedding of the $c$ th-class, the class text could have the form ‘a picture of a dog’. Then the prediction of CLIP for image classification is:

\displaystyle p(c\mid\mathbf{x})=\frac{\exp\left(\cos\left(\mathcal{T}(\mathbf{t}^{c}),\mathcal{I}(\mathbf{x})\right)/\tau\right)}{\sum_{i=1}^{C}\exp\left(\cos\left(\mathcal{T}(\mathbf{t}^{i}),\mathcal{I}(\mathbf{x})\right)/\tau\right)}

(1)

where $C$ is the number of classes, $\tau$ is a temperature parameter learned by CLIP.

To fine-tune the model, one method is to manually set prompts, such as replacing ‘a picture of a dog’ with ‘a picture of a cute dog’. However, in this way, because the change of words will directly affect the performance of the model, we need to find the most appropriate words. Another method is prompt tuning (Li and Liang 2021; Liu et al. 2021b; Zhou et al. 2022b). Let $\mathbf{p}^{c}=[\boldsymbol{v}^{c}_{1},\boldsymbol{v}^{c}_{2},\ldots,\boldsymbol{v}^{c}_{M}]$ be the prompt. The input of the $c$ -th class for the text encoder is $(\mathbf{p}^{c};\mathbf{t}^{c})$ , where $\{\boldsymbol{v}^{c}_{i}\}^{M}_{i=1}$ are learnable prompts for the $c$ -th class. $\boldsymbol{v}^{c}_{i}$ has the same dimension as word embedding $\mathbf{t}^{c}$ , $M$ is the number of learnable prompts.

3.3 Federated Learning with Prompt

The concept of FL is introduced in FedAvg (McMahan et al. 2017) to collaboratively train a global model that works for each participant. A straightforward way to apply CLIP’s prompt tuning to the FL field is that each client trains its own prompt and uploads the prompt to the server. The server averages the prompts of all clients and distributes them to the clients. Let $\mathbf{p}^{c}_{n}$ denotes the prompt for the $c$ -th class in the $n$ -th client. For the sake of brevity, we omit $c$ and use $\mathbf{p}_{n}$ to denote the prompt in the $n$ -th client. Let $\mathbf{p}_{g}$ denotes the prompt of the global model. Then the global objective is:

\displaystyle\mathbf{p}_{g}=\arg\min_{\mathbf{p}\in\mathbb{R}^{d}}\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}_{\mathbf{x},y\sim\mathcal{D}_{n}}[\ell(\mathcal{T}(\mathbf{p});\mathcal{I}(\mathbf{x}))]

(2)

where $\ell$ is the loss of the $n$ -th client. Although this approach can handle scenarios where multiple clients are IID, it is still difficult for a single global prompt to process clients from diverse data domains.

4 Method

The overview of the proposed FedAPT is in Figure 2. We focus on the problem setting of one-model-fits-all, that is, learning a powerful global model to adapt to the data distribution across all domains. With the parameters of the CLIP frozen, we introduce an additional adaptive prompt tuning (APT) module for the global model. APT accepts image features as input and outputs a specific prompt for each sample. The learnable parts of APT are the prompt and the adaptive network, which are obtained through the federated training of multiple clients.

4.1 Adaptive Prompt Tuning

The input of the text encoder in PromptFL always uses a fixed prompt for all samples: $\mathbf{p}_{g}\in\mathbb{R}^{C\times s\times d}$ , which is the learned prompt by federated averaging, $C$ is the number of classes, $s$ is the length of prompt, $d$ is the embedding dimension. However, under the setting in this paper, each client may come from a specific domain, and there are obvious domain divergences among the images. A fixed prompt is no longer suitable for images in all domains. It is necessary to design a new form. An intuitive way is to introduce personalized prompts into the input of the text encoder for each domain:

\displaystyle\mathbf{p}_{g}+\mathbf{p}_{p_{k}}

(3)

where $\mathbf{p}_{p_{k}}$ encodes the personalized domain knowledge of the $k$ -th domain. Unfortunately, under the goal of one-model-fits-all, we need a unified global model, rather than adding personalized parameters $\mathbf{p}_{p_{k}}$ for each client. Then a compromise improvement for Eq. 3 is to keep all $\mathbf{p}_{p_{k}}$ in the global model, then filter prompts based on the image features:

\displaystyle\mathbf{p}_{g}+f([\mathbf{p}_{p_{1}},\cdots,\mathbf{p}_{p_{K}}],\mathcal{Q}(\mathcal{I}(\mathbf{x})))

(4)

where $\mathbf{p}_{p_{k}}$ is the personalized prompt for the $k$ -th domain, $f(\cdot)$ is a filter function, $\mathcal{Q}$ is an adaptive network that is used to provide auxiliary information for filtering. However, this operation needs to retain a number of $\mathbf{p}_{p_{k}}$ , thus introducing additional parameters in the global model.

Finally, we adopt an adaptive prompt tuning approach. Let $\mathcal{P}$ denotes the output of the APT module, the input for this module is the image features $\mathcal{I}(\mathbf{x})$ . We use the key vector $\mathbf{e}_{k}\in\mathbb{R}^{s\times d}$ to encode personalized domain knowledge into $\mathbf{p}_{g}$ , and decode it when necessary. $\mathcal{P}$ in the global model is:

\displaystyle\mathcal{P}(\mathcal{I}(\mathbf{x}))=\mathbf{p}_{g}+\mathbf{p}_{g}\odot\sum_{k=1}^{K}q_{k}\mathbf{e}^{\prime}_{k}

(5)

where $q_{k}=\mathcal{Q}_{g}(\mathcal{I}(\mathbf{x}))_{k}$ , $\sum_{k}q_{k}=1$ , $\odot$ represents the element-wise multiplication, $\mathbf{e}^{\prime}_{k}\in\mathbb{R}^{C\times s\times d}$ is copied from $\mathbf{e}_{k}$ for $C$ times to match the dimension of $\mathbf{p}_{g}$ , and $\mathcal{Q}_{g}$ is the adaptive network with parameters $\phi$ , which can give a one-hot or soft-membership vector of $K$ dimensions. In the $n$ -th client from the $k$ -th domain, the prompt for the text encoder in the training process is:

\displaystyle\mathbf{p}_{n}+\mathbf{p}_{n}\odot\mathbf{e}^{\prime}_{k}

(6)

where $\mathbf{p}_{n}$ is initialized by $\mathbf{p}_{g}$ , $\mathbf{e}^{\prime}_{k}$ is frozen. Through the design of Eqs.5 and 6, we do not need to introduce additional learnable parameters $\mathbf{p}_{p_{k}}$ , but use the unified $\mathbf{p}_{g}$ , which is named “meta prompt”. The objective for the meta prompt is:

\displaystyle\mathbf{p}_{g}=\arg\min_{\mathbf{p}}\frac{1}{N}\sum_{n=1}^{N}\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{n}}[\ell(\mathcal{T}(\mathbf{p}+\mathbf{p}\odot\mathbf{e}^{\prime}_{n});\mathcal{I}(\mathbf{x}))]

4.2 Federated Training

To enable multiple clients to train APT cooperatively, there are three steps. At the beginning of federated training, the server randomly initializes the keys and sends them to different clients (Step 1).

Step2: Local Training. To evaluate the capabilities of CLIP and FedAPT in various FL settings, we consider two different client settings:

In the supervised setting, each client updates the global $\mathbf{p}_{g}$ and $\mathcal{Q}_{g}$ with the local data respectively to obtain $\mathbf{p}_{n}$ and $\mathcal{Q}_{n}$ . The prediction of the $n$ -th local model is:

\displaystyle p(c\mid\mathbf{x})=\frac{e^{\cos\left(\mathcal{T}(\mathbf{p}^{c}_{n}+\mathbf{p}^{c}_{n}\odot\mathbf{e}_{k}),\mathcal{I}(\mathbf{x})\right)/\tau}}{\sum_{i=1}^{C}e^{\cos\left(\mathcal{T}(\mathbf{p}^{i}_{n}+\mathbf{p}^{i}_{n}\odot\mathbf{e}_{k}),\mathcal{I}(\mathbf{x})\right)/\tau}}

where $k$ is the domain index of the client. Then the cross entropy loss for image classification is:

\displaystyle\mathcal{L}_{c}=\frac{1}{\left|\mathcal{D}_{n}\right|}\sum_{\mathbf{x}_{i},y_{i}\sim\mathcal{D}_{n}}-\log p\left(y_{i}\mid\mathbf{x}_{i},\mathbf{p}_{n}\right)

(7)

In an unsupervised setting, we design a two-stage training approach. In the first stage, we augment the unlabeled data (same augment strategy as SimCLR (Chen et al. 2020)) and utilize pseudo-labels with high confidence from the original data as supervised information to perform self-training on the augmented data. The training loss is:

\displaystyle\mathcal{L}_{pseudo}=\frac{1}{\left|\mathcal{D}_{n}\right|}\sum_{\mathbf{x}_{i}\sim\mathcal{D}_{n}}-\log p\left(\hat{y}_{i}\mid\hat{\mathbf{x}}_{i},\mathbf{p}_{n}\right)

(8)

where $\hat{y}_{i}$ is the pseudo one-hot label of $\mathbf{x}_{i}$ predicted by the original CLIP, $\hat{\mathbf{x}}_{i}$ is the augmented version of $\mathbf{x}_{i}$ . After the first stage, we can get a coarse prompt.

In the second stage, use $\mathbf{z}_{i}$ to represent the prediction score vector of $\mathbf{x}_{i}$ , we hope to compare the outputs among different samples. For inter-sample comparison, motivated by AaDLoss (Yang et al. 2022) in domain adaptation, for a certain sample, the logits of $K$ -nearest neighbor samples with similar image features should be as similar as possible, otherwise, there should be differences. Then the loss is:

\displaystyle\mathcal{L}_{inter}=\frac{1}{\left|\mathcal{D}_{n}\right|}\sum_{\mathbf{x}_{i}\sim\mathcal{D}_{n}}(-\sum_{j\in\mathcal{C}_{i}}\mathbf{z}_{i}^{\top}\mathbf{z}_{j}+\lambda\sum_{m\in\mathcal{B}_{i}}\mathbf{z}_{i}^{\top}\mathbf{z}_{m})

(9)

where $\mathcal{C}_{i}$ contains $K$ -nearest neighbors, $\mathcal{B}_{i}$ contains the rest samples. A feature bank and a score bank are used to efficiently implement this loss.

As for intra-sample comparison, we need that the prediction of a sample is close to that of its augmented counterpart:

\displaystyle\mathcal{L}_{intra}=\frac{1}{\left|\mathcal{D}_{n}\right|}\sum_{\mathbf{x}_{i}\sim\mathcal{D}_{n}}KL(\hat{\mathbf{z}}_{i}\|\mathbf{z}_{i})

(10)

where $\hat{\mathbf{z}}_{i}$ is the predict score of the augment data, $\mathbf{z}_{i}$ is extracted from the score bank. Then the overall loss in the second stage is $\mathcal{L}_{c}=\mathcal{L}_{inter}+\mathcal{L}_{intra}$ .

The adaptive network $\mathcal{Q}_{n}(\cdot)$ is a fully-connection layer with parameters $\phi_{n}$ , which tries to classify domain information from the features of images. Each client in the $k$ -th domain trains its own adaptive network with the following cross-entropy loss:

\displaystyle\mathcal{L}_{q}=\frac{1}{\left|\mathcal{D}_{n}\right|}\sum_{\boldsymbol{x}\in\mathcal{D}_{n}}-\log p\left(k\mid\mathbf{x}_{i},\phi_{n}\right)

(11)

The whole loss function for local training is $\mathcal{L}_{c}+\mathcal{L}_{q}$ .

Note that training $\mathcal{Q}_{g}(\cdot)$ is a special FL problem, because each client can only access the data of its own domain. In this paper, we find that in the multi-domain scenario, we can obtain good enough $\mathcal{Q}_{g}(\cdot)$ by directly using cross-entropy for local training and federated averaging for aggregation. To further improve the effect of $\mathcal{Q}_{g}(\cdot)$ , it is necessary to use special training methods, such as FedAws (Yu et al. 2020). In addition, directly uploading the data prototype (Tan et al. 2022; Peng et al. 2019b), can also train the $\mathcal{Q}_{g}(\cdot)$ in the server, but in order to prevent privacy disclosure, we do not upload any data features or prototypes.

Step3: Aggregation. After the local training step, the $n$ -th client uploads $\mathbf{p}_{n}$ and $\phi_{n}$ to the server. Due to $\phi_{n}\in\mathbb{R}^{d_{img}\times K}$ , $d_{img}$ is the dimension of each image feature, which is 512 in CLIP, the additional traffic brought by transmitting $\phi_{n}$ is extremely small compared with prompts. The server averages the parameters to obtain the aggregation model parameters:

\displaystyle\mathbf{p}_{g}=\frac{1}{N}\sum_{n=1}^{N}\mathbf{p}_{n},\quad\phi_{g}=\frac{1}{N}\sum_{n=1}^{N}\phi_{n}

(12)

Repeat Step 2 and Step 3 for $T$ rounds, then $\mathbf{p}_{g}$ and $\mathcal{Q}_{g}(\cdot)$ with parameters $\phi_{g}$ can be used to establish the global model.

Improve inference efficiency. Notably, during the inference process, assuming a batch size of $B$ , since each sample is associated with distinct prompts, the text encoder needs to be executed $B$ times. This issue is also encountered in other visually conditioned prompt methods (Zhou et al. 2022a). To address this concern, we propose an enhanced strategy in Figure 3. By precomputing the text encodings for each key, we eliminate the need for redundant execution of the text encoder, leading to a significant reduction in inference costs.

5 Experiments

We conduct extensive experiments to evaluate the efficacy of FedAPT. We aim to answer the following questions: 1) Can we achieve better performance with fewer learnable parameters in FL using the pre-trained model CLIP, compared to a fully-trained ViT or ResNet? 2) How much improvement in global model can FedAPT achieve compared to existing federated tuning algorithms?

We first compare FedAPT with various baselines on two large-scale multi-domain image classification datasets. We consider two scenarios, where each domain acts as a client independently and each domain is divided into five non-IID clients. Then we conduct several ablation studies to observe the impact of keys on the client model and global model. We also explore the impact of different initialized key values on the performance of the global model.

zero-shot

fully-trained

fine-tune

Datasets

& Domains

CLIP

-zeroshot

ResNet

-full

ViT

-full

ResNet

-tuning

ViT

-tuning

CLIP

-FC

Fed

-CLIP

Prompt

-FL

Fed

-APT

65.86

32.44

63.55

52.32

71.93

74.53

70.25

75.84

77.36

40.50

59.58

27.07

20.85

48.39

48.00

45.70

49.82

52.17

62.25

59.58

49.00

44.66

68.06

69.08

66.16

70.81

72.51

13.36

43.25

62.20

7.13

21.24

31.84

16.98

32.98

48.84

80.04

74.73

68.35

65.85

80.43

83.87

83.04

83.55

84.79

57.92

61.81

54.05

38.33

63.45

65.29

61.75

67.31

68.69

avg

53.32

55.23

54.04

38.19

58.92

62.10

57.31

63.39

67.39

95.69

95.31

36.97

95.31

96.87

96.35

95.31

95.83

93.24

100.00

36.20

98.27

100.00

98.27

100.00

95.23

100.00

12.90

93.54

96.77

93.54

96.77

100.00

92.54

96.44

29.77

93.77

97.33

96.00

98.22

96.44

avg

94.18

97.94

28.96

95.22

97.74

96.47

96.58

97.27

98.07

#params (M)

24.21

87.86

0.71

0.18

0.26

0.52

2.826

2.829

Table 1: The accuracy of the global model in various domains in supervised setting with domain differences. The last row is the number of parameters to be transmitted in each round of communication. Overall, these results indicate that FedAPT achieves the best performance with few learnable parameters.

	$\beta$		c	i	p	q	r	s	Average
by domain	$0.01$	CLIP-FC	67.60	44.39	65.32	22.97	79.42	59.66	56.56
		FedCLIP	68.41	45.04	64.41	16.76	82.22	60.08	56.15
		PromptFL	66.68	43.75	65.31	24.45	77.75	58.74	56.11
		FedAPT	67.21	43.69	66.43	26.70	77.88	57.85	56.63
	$0.5$	CLIP-FC	71.86	46.70	68.10	31.86	83.05	63.29	60.81
		FedCLIP	69.54	45.61	65.15	18.00	82.61	60.39	56.88
		PromptFL	73.19	47.72	69.59	37.31	82.98	64.85	62.61
		FedAPT	74.70	49.94	71.27	46.89	84.41	66.61	65.64
	$5$	CLIP-FC	73.50	47.34	68.62	34.00	83.86	64.37	61.95
		FedCLIP	69.37	45.61	65.58	18.67	82.77	60.73	57.12
		PromptFL	74.49	48.34	69.77	38.30	83.94	66.05	63.48
		FedAPT	75.64	50.63	71.59	50.55	85.02	67.63	66.84
by random	$0.01$	CLIP-FC	70.36	44.55	66.59	21.10	80.03	60.78	57.24
		FedCLIP	68.04	45.10	63.45	16.45	81.93	59.26	55.70
		PromptFL	69.68	41.59	65.19	22.41	77.87	59.40	56.02
		FedAPT	70.50	42.17	66.18	27.48	77.22	59.41	57.16
	$0.5$	CLIP-FC	73.19	46.24	68.73	34.63	82.43	64.24	61.58
		FedCLIP	69.86	45.26	65.50	17.46	82.65	61.05	56.96
		PromptFL	73.71	46.13	69.53	39.65	82.10	65.22	62.72
		FedAPT	75.56	48.41	71.64	47.28	83.32	65.06	65.21
	$5$	CLIP-FC	74.19	47.00	67.85	31.77	84.08	65.88	61.80
		FedCLIP	69.67	45.30	65.42	17.93	82.86	60.69	56.97
		PromptFL	74.44	47.70	69.06	38.68	84.66	67.51	63.68
		FedAPT	76.99	49.45	68.49	49.53	85.92	67.42	66.30

Table 2: The accuracy of the global model in various domains in supervised setting with domain and category differences. There are 30 clients in total. ‘by domain’ means each domain randomly selects one client, and a total of six clients are selected in each round. ‘by random’ means to randomly select six clients from all 30 clients. FedAPT’s global model can significantly improve performance in various domains.

5.1 Experimental Setup

Datasets. We adopt two datasets, Office-Caltech10 (Gong et al. 2012) and DomainNet (Peng et al. 2019a). Office-Caltech10 (OC) is a small dataset that has four domains (amazon, caltech, dslr, and webcam), and each domain contains 10 overlapping categories between the Office-31 (Saenko et al. 2010) dataset and the Caltech-256 (Griffin, Holub, and Perona 2007) dataset. There are at least 100 and at most 800 images in different domains. Since the number of images is small, we use each domain as a client. DomainNet (DN) is a large-scale multi-domain dataset containing six domains (clipart, infograph, painting, quickdraw, real, and sketch), and each domain contains about 33k-120k images with 345 categories. We set the number of clients $N_{dom}$ in each domain as 1 or 5. When $N_{dom}=5$ , we use the Dirichlet distribution to construct the non-IID property across clients. Both datasets contain 224 $\times$ 224 pixel images. Figure 4 depicts the divergences of different domains.

Compared Methods. To comprehensively evaluate the performance, we compare the following methods: 1) ResNet-full: Federated training with ResNet50 (He et al. 2016), all parameters participate in the training procedure. 2) ResNet-tuning, where the backbone of ResNet50 is frozen, only the last layer is trained. 3) ViT-full and 4) ViT-tuning: Replace ResNet in the above two baselines with ViT. 5) CLIP-zeroshot do not make any changes to the CLIP, and directly perform zero-shot image classification in all domains. This helps us understand how much the prompt tuning improves the CLIP. 6) CLIP-FC: Freeze the parameters of CLIP, and insert a learnable fully-connection layer at the end of the image encoder. Only the learnable layer is shared to the server. 7) FedCLIP (Lu et al. 2023): Add an adapter at the end of the visual backbone and perform federated tuning. 8) PromptFL (Guo et al. 2022): A prompt tuning method for CLIP, details can be seen in Section 3.3. 9) FedAPT: our proposed method. In the unsupervised experiment part, we utilize the local training method proposed in Section 4.2 uniformly as FedCLIP and PromptFL have not yet explored federated tuning in an unsupervised setting.

Implementation Details. We use PyTorch to implement all methods. The SGD optimizer is used for both datasets. We set Office-Caltech10 with a learning rate of 0.001 and batch size of 32, and DomainNet with a learning rate of 0.01 and batch size of 256. The global communication round $T_{g}$ is set to 50, and the local training epoch $T_{l}$ is set to $1$ . The length of prompts $s$ is 16. In supervised setting, we set class-specific prompts $\mathbf{p}_{g}\in\mathbb{R}^{C\times s\times d}$ . In unsupervised setting, due to the lack of supervision information, we use the class-shared prompts $\mathbf{p}_{g}\in\mathbb{R}^{s\times d}$ . The CLIP used in this paper takes ViT-B/32 (Dosovitskiy et al. 2020) as the image encoder. Each experiment is repeated three times with different seeds, and the mean result is reported. All experiments are completed with one GeForce RTX 3090 GPU.

5.2 Main Results

Supervised setting with domain differences. We first treat each domain as a separate client with the same class distribution but different domain characteristics.

The results on the two datasets are reported in Table 1. We highlight the following points: 1) From ResNet-full vs. ResNet-tuning and ViT-full vs. ViT-tuning, we can see that pre-trained ViT has more powerful representation ability than pre-trained ResNet. As a result, ResNet must be fully trained in order to achieve the accuracy rate obtained through ViT-tuning. 2) CLIP’s zero-shot classification capability is insufficient to meet the requirements of various domains. The performance gap with federated prompt tuning is large, which indicates that the original CLIP lacks knowledge of the client domains. 3) Inserting a fully-connection layer after the image encoder of CLIP can remarkably enhance the performance on DomainNet, as indicated by CLIP-FC, which approaches the basic version of PromptFL. 4) Considering the performance of different methods and the amount of communication parameters in each round, FedAPT achieves the highest performance at a very low cost, which shows its great potential in the application of pre-trained CLIP to FL scenarios.

Supervised setting with domain and category differences. We split each domain in DomainNet into five clients, i.e., sample $\boldsymbol{\pi}_{c}\sim\operatorname{Dir}\left(\beta\mathbf{1}_{5}\right)$ and allocate a $\boldsymbol{\pi}_{c,i}$ proportion of the instances with label $c$ to the training set of the $i$ -th client in each domain. Smaller $\beta$ leads to larger data distribution differences among clients and more imbalanced classes. Finally, we obtain 30 clients. In addition to the divergences in data domains, there are also divergences in category distribution among clients. In each communication round, we randomly select one client from each domain, meaning six clients are selected for training. Note that because the random seeds are the same, the clients selected in each round by different methods are consistent.

The results are shown in Table 2. As we can see, FedAPT still outperforms the competitors overall. Furthermore, it is worth noting that when dealing with extremely unbalanced categories, all CLIP tuning methods exhibit similar performance. This suggests that the pre-trained CLIP model needs to be augmented with additional designs to achieve better performance, as its existing capabilities may be insufficient.

		c	i	p	q	r	s	Average
domain differences
FedCLIP		68.23	45.14	64.71	18.07	82.00	60.00	56.36
PromptFL		68.37	46.61	65.67	17.34	82.17	60.49	56.78
FedAPT		69.48	47.18	66.12	17.89	83.09	61.19	57.49
domain and category differences
$\beta$ =0.01	FedCLIP	67.65	43.33	63.48	17.55	81.33	59.17	55.42
	PromptFL	68.31	44.41	64.31	16.37	81.29	60.21	55.82
	FedAPT	68.23	44.99	64.29	16.70	81.57	60.17	55.99
$\beta$ =0.5	FedCLIP	67.59	44.11	63.68	18.21	81.52	59.06	55.70
	PromptFL	68.75	46.22	65.13	17.68	82.90	60.62	56.88
	FedAPT	69.02	46.67	65.88	18.31	83.00	60.96	57.31
$\beta$ =5	FedCLIP	67.81	44.32	63.82	18.16	81.58	59.20	55.82
	PromptFL	68.64	46.27	65.60	17.52	82.50	60.38	56.82
	FedAPT	69.13	47.34	66.41	18.47	83.31	61.58	57.71

Table 3: The accuracy of the global model in various domains in unsupervised settings.

Unsupervised setting. We replicate the above experiments in an unlabeled setting, and the results are shown in Table 3. Specifically, we observe that 1) On DomainNet dataset, the performance of unsupervised federated tuning is comparable to that of supervised tuning. This indicates that the unsupervised fine-tuning method proposed in Section 3.3 is effective. 2) Under the unsupervised setting, FedAPT still outperforms the baseline methods, demonstrating its significant effectiveness. In addition, as we use the class-shared prompts for FedAPT and PromptFL in this setting, the communication parameter sizes per round are 0.011M and 0.008M, respectively. This demonstrates the advantage of federated prompt tuning in terms of communication costs.

5.3 Ablation Studies

	c	i	p	q	r	s	Average
meta prompt $\mathbf{p}_{g}$	74.62	48.04	68.82	22.18	82.71	65.98	60.39
$\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{0}$	78.76	44.77	65.85	20.47	80.80	63.85	59.08
$\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{1}$	70.96	52.70	65.83	19.26	81.03	62.40	58.70
$\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{2}$	71.98	46.05	74.58	16.63	81.12	61.20	58.59
$\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{3}$	69.66	39.99	60.27	48.84	75.78	58.67	58.87
$\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{4}$	73.70	47.56	68.12	19.66	85.94	64.41	59.90
$\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{5}$	72.28	44.92	65.16	20.63	81.04	70.42	59.08
Eq. 5	77.36	52.17	72.51	48.84	84.79	68.69	67.39

Table 4: The impact of combining meta prompts with keys of different domains.

	c	i	p	q	r	s
w/o key	78.30	51.48	73.99	49.05	85.35	69.72
w key	78.03	51.59	74.00	49.35	85.67	69.73

Table 5: The impact of keys in local training. We show that adding the keys does not destroy the local training.

		i	p	q	r	s	unseen domain (c )
PromptFL		49.71	71.18	34.09	83.58	67.26	70.60
FedAPT	$\tau=0.01$	52.23	72.76	51.25	84.67	69.35	68.26
FedAPT	$\tau=1$	52.27	72.14	51.05	84.18	68.77	70.50

Table 6: The impact of the adaptive network on seen and unseen domains.

Impact of keys. We do a comprehensive study of the role of ‘key’. We disassemble the global model trained with FedAPT in Table 1 and compare its performance under different situations: 1) Use the learned meta prompt $\mathbf{p}_{g}$ to test without any key. 2) Use keys from different domains for testing, such as $\mathbf{p}_{g}+\mathbf{p}_{g}\odot\mathbf{e}^{\prime}_{k}$ . 3) Use the complete APT module, i.e., Eq. 5. The results are reported in Tabe 4. The classification ability of the meta prompt $\mathbf{p}_{g}$ is shown in the first row. Because there is no suitable key, its performance is mediocre. Then the middle six rows show the performance of $\mathbf{p}_{g}$ with different keys. We can see that the $k$ -th key $\mathbf{e}^{\prime}_{k}$ can significantly improve the accuracy in the $k$ -th domain.

The affect of the adaptive network. We evaluate the performance of the global model in both client domains and unseen new domains. We introduce softmax temperature $\tau$ to the output of the adaptive network, and the performance at $\tau=0.01$ can be interpreted as directly determining “which key should be used”. The results in Table 6 show that directly selecting a single key ( $\tau=0.01$ ) improves the performance in client domains but sacrifices generalization in unseen domains. By adjusting $\tau$ , we can significantly enhance client performance while maintaining nearly unchanged generalization performance in unknown domains.

Other results. 1) One worrying point about using the key on the client side is whether the key will limit the learning ability of the local model. We verify this point with an experiment. We repeat the experiment in Table 1 and report the performance of local models from different domains in Table 5. The results show that the use of the key does not limit the learning of the local model. 2) We also compare different initialized keys. The results suggest that the keys work well for different types of random initialization, including the uniform and the standard normal distribution initialization and random orthogonal initialization. We provide the numerical results in the supplementary materials. 3) After precomputing and saving the text branch outputs of different keys, only the image branch and domain classification weights need to be computed during inference. The final inference cost is as follows: CLIP’s GFLOPs is 4.42 while that of ResNet50 is 4.14. 4) We demonstrate the impact in inference efficiency of the strategy in Figure 3. While maintaining nearly unchanged accuracy, the inference time has been reduced from thirty minutes to 30 seconds in one GPU card. Detailed numerical results are provided in the supplementary materials.

6 Conclusions

In this paper, we apply the pre-trained CLIP to the multi-domain federated learning in both supervised and unsupervised settings, and propose an adaptive prompt tuning method that uses domain-specific keys to generate specific prompts for each test sample. We have extensively validated the effectiveness of FedAPT. With less than 10% of learnable parameters, FedAPT achieves performance surpassing that of fully-trained models. Moreover, when faced with challenges such as feature and category differences in client data, FedAPT demonstrates a significant performance improvement compared with the competitors.

7 Acknowledgements

This work was supported in part by the National Key R&D Program of China (No.2021ZD0112803), the National Natural Science Foundation of China (No.62176061), STCSM project (No.22511105000), the Shanghai Research and Innovation Functional Program (No.17DZ2260900), and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.

References

Chen et al. (2022) Chen, H.-Y.; Tu, C.-H.; Li, Z.; Shen, H.-W.; and Chao, W.-L. 2022. On pre-training for federated learning. arXiv preprint arXiv:2206.11488.
Chen et al. (2020) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
Dinh, Tran, and Nguyen (2020) Dinh, C. T.; Tran, N. H.; and Nguyen, T. D. 2020. Personalized Federated Learning with Moreau Envelopes. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Fallah, Mokhtari, and Ozdaglar (2020) Fallah, A.; Mokhtari, A.; and Ozdaglar, A. 2020. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Advances in Neural Information Processing Systems, 33: 3557–3568.
Feng et al. (2023) Feng, C.-M.; Yu, K.; Liu, N.; Xu, X.; Khan, S.; and Zuo, W. 2023. Towards Instance-adaptive Inference for Federated Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 23287–23296.
Gong et al. (2012) Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE conference on computer vision and pattern recognition, 2066–2073. IEEE.
Griffin, Holub, and Perona (2007) Griffin, G.; Holub, A.; and Perona, P. 2007. Caltech-256 object category dataset.
Guo et al. (2022) Guo, T.; Guo, S.; Wang, J.; and Xu, W. 2022. PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models–Federated Learning in Age of Foundation Model. arXiv preprint arXiv:2208.11625.
Hanzely et al. (2020) Hanzely, F.; Hanzely, S.; Horváth, S.; and Richtárik, P. 2020. Lower bounds and optimal algorithms for personalized federated learning. Advances in Neural Information Processing Systems, 33: 2304–2315.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Hsu, Qi, and Brown (2020) Hsu, T.-M. H.; Qi, H.; and Brown, M. 2020. Federated visual classification with real-world data distribution. In European Conference on Computer Vision, 76–92. Springer.
Jia et al. (2021) Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
Karimireddy et al. (2020) Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; and Suresh, A. T. 2020. SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 5132–5143. PMLR.
Li et al. (2007) Li, B.; Chi, M.; Fan, J.; and Xue, X. 2007. Support cluster machine. In Proceedings of the 24th International Conference on Machine Learning, 505–512.
Li, He, and Song (2021) Li, Q.; He, B.; and Song, D. 2021. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10713–10722.
Li et al. (2020) Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; and Smith, V. 2020. Federated Optimization in Heterogeneous Networks. In Dhillon, I. S.; Papailiopoulos, D. S.; and Sze, V., eds., MLSys. mlsys.org.
Li et al. (2021) Li, X.; Jiang, M.; Zhang, X.; Kamp, M.; and Dou, Q. 2021. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
Li and Liang (2021) Li, X. L.; and Liang, P. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
Liu et al. (2021a) Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; and Neubig, G. 2021a. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
Liu et al. (2021b) Liu, X.; Ji, K.; Fu, Y.; Du, Z.; Yang, Z.; and Tang, J. 2021b. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
Liu et al. (2020) Liu, Y.; Huang, A.; Luo, Y.; Huang, H.; Liu, Y.; Chen, Y.; Feng, L.; Chen, T.; Yu, H.; and Yang, Q. 2020. Fedvision: An online visual object detection platform powered by federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 13172–13179.
Lu et al. (2023) Lu, W.; Hu, X.; Wang, J.; and Xie, X. 2023. FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning. arXiv preprint arXiv:2302.13485.
Luo et al. (2022) Luo, Z.; Wang, Y.; Wang, Z.; Sun, Z.; and Tan, T. 2022. Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; and Sabato, S., eds., International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, 14527–14541. PMLR.
McMahan et al. (2017) McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273–1282. PMLR.
Nguyen et al. (2023) Nguyen, J.; Wang, J.; Malik, K.; Sanjabi, M.; and Rabbat, M. 2023. Where to Begin? Exploring the Impact of Pre-Training and Initialization in Federated. In International Conference on Learning Representations.
Peng et al. (2019a) Peng, X.; Bai, Q.; Xia, X.; Huang, Z.; Saenko, K.; and Wang, B. 2019a. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, 1406–1415.
Peng et al. (2019b) Peng, X.; Huang, Z.; Zhu, Y.; and Saenko, K. 2019b. Federated adversarial domain adaptation. arXiv preprint arXiv:1911.02054.
Qiu et al. (2023) Qiu, C.; Li, X.; Mummadi, C. K.; Ganesh, M. R.; Li, Z.; Peng, L.; and Lin, W.-Y. 2023. Text-driven Prompt Generation for Vision-Language Models in Federated Learning. arXiv preprint arXiv:2310.06123.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
Reddi et al. (2020) Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečnỳ, J.; Kumar, S.; and McMahan, H. B. 2020. Adaptive federated optimization. arXiv preprint arXiv:2003.00295.
Saenko et al. (2010) Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. In European conference on computer vision, 213–226. Springer.
Singh and Jaggi (2020) Singh, S. P.; and Jaggi, M. 2020. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33.
Su, Li, and Xue (2023) Su, S.; Li, B.; and Xue, X. 2023. One-shot Federated Learning without server-side training. Neural Networks, 164: 203–215.
Su et al. (2023) Su, S.; Li, B.; Zhang, C.; Yang, M.; and Xue, X. 2023. Cross-domain Federated Object Detection. In 2023 IEEE International Conference on Multimedia and Expo (ICME), 1469–1474. IEEE.
Tan et al. (2022) Tan, Y.; Long, G.; Ma, J.; Liu, L.; Zhou, T.; and Jiang, J. 2022. Federated learning from pre-trained models: A contrastive learning approach. Advances in neural information processing systems.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2020a) Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D. S.; and Khazaeni, Y. 2020a. Federated Learning with Matched Averaging. In International Conference on Learning Representations.
Wang et al. (2020b) Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; and Poor, H. V. 2020b. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems 33.
Wang et al. (2022) Wang, Z.; Wu, Z.; Agarwal, D.; and Sun, J. 2022. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. arXiv preprint arXiv:2210.10163.
Yang et al. (2023) Yang, M.; Su, S.; Li, B.; and Xue, X. 2023. Exploring One-shot Semi-supervised Federated Learning with A Pre-trained Diffusion Model. arXiv preprint arXiv:2305.04063.
Yang et al. (2022) Yang, S.; Wang, Y.; Wang, K.; Jui, S.; et al. 2022. Attracting and dispersing: A simple approach for source-free domain adaptation. In Advances in Neural Information Processing Systems.
Yao et al. (2022) Yao, L.; Han, J.; Wen, Y.; Liang, X.; Xu, D.; Zhang, W.; Li, Z.; Xu, C.; and Xu, H. 2022. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection. arXiv preprint arXiv:2209.09407.
Yu et al. (2020) Yu, F.; Rawat, A. S.; Menon, A.; and Kumar, S. 2020. Federated learning with only positive labels. In International Conference on Machine Learning, 10946–10956. PMLR.
Yu et al. (2023) Yu, H.; Wang, X.; Li, B.; and Xue, X. 2023. Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11943–11952.
Zhong et al. (2022) Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L. H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16793–16803.
Zhou et al. (2022a) Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
Zhou et al. (2022b) Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
Zhuang, Wen, and Zhang (2021) Zhuang, W.; Wen, Y.; and Zhang, S. 2021. Joint optimization in edge-cloud continuum for federated unsupervised person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, 433–441.
Zhuang et al. (2020) Zhuang, W.; Wen, Y.; Zhang, X.; Gan, X.; Yin, D.; Zhou, D.; Zhang, S.; and Yi, S. 2020. Performance optimization of federated person re-identification via benchmark analysis. In Proceedings of the 28th ACM International Conference on Multimedia, 955–963.

Appendix A Details of Experiments

In some of the experiments, we utilize a standard Dirichlet distribution to partition the dataset, thereby creating disparities in class distributions among clients. Figure 5 illustrates the impact of different hyperparameter values, denoted as $\beta$ , on the divergence in class distributions. As $\beta$ decreases, the disparities increase, and when $\beta$ approaches 0, there is minimal class overlap among clients.

We use the standard pre-trained CLIP. All images is reshaped into 224*224. We use the preprocess pipeline defined by CLIP, which includes Resize, CenterCrop, ToTensor and Normalize.

The details of the local datasets when each domain has one client are shown in Table 7. When each domain has five clients, the training set of each domain is divided into 5 parts.

Client0

clipart

Client1

infograph

Client2

painting

Client3

quickdraw

Client3

real

Client5

sketch

train

33k

36k

50k

120k

121k

48k

test

15k

16k

22k

52k

21k

Table 7: The local datasets when each domain has one client.

Appendix B Additional Experiments

Initialization of keys

We also compare different initialized keys. Table 9 shows the results. ‘rand_U’ means the uniform distribution U(0,1). ‘rand_N’ means the standard normal distribution. ‘rand_01’ means to randomly generate one 01 matrix. ‘rand_O’ indicates random orthogonal initialization, that is, the keys of each domain are orthogonal. We can see that the results of different initializations are stable.

The improve of inference efﬁciency

We demonstrate the impact in inference efﬁciency of the strategy in Figure 3 of the main paper. By precomputing text encodings for different keys, we avoid the need to compute text encodings for each sample in a batch repeatedly, thus saving a considerable amount of inference cost. While maintaining nearly unchanged accuracy, the inference time has been reduced by about 60 times, see Table 8.

	c	i	p	q	r	s
original (min)	19	39	31	30	60	30
Improved (sec)	19.5	26	30	65	67	32

Table 8: The improve of inference efﬁciency.

	c	i	p	q	r	s	Average
rand_U	77.36	52.17	72.51	48.84	84.79	68.69	67.39
rand_N	75.78	51.86	72.02	53.82	84.20	67.40	67.51
rand_01	76.60	52.33	72.59	52.89	84.83	68.30	67.92
rand_O	77.00	52.25	72.83	51.29	84.86	68.64	67.81

Table 9: The effect of different initialization keys. It shows the robustness of initialization methods to keys.

zero-shot		CLIP-zeroshot	65.86	40.50	62.25	13.36	80.04	57.92	53.32	0M
fully-trained	ResNet-full	71.74	32.44	59.58	43.25	74.73	61.81	57.26	24.21M
fully-trained	ViT-full	63.55	27.07	49.00	62.20	68.35	54.05	54.04	87.86M
fine-tune fc layer	ResNet-tuning	52.32	20.85	44.66	7.13	65.85	38.33	38.19	0.71M
	ViT-tuning	71.93	48.39	68.06	21.24	80.43	63.45	58.92	0.18M
	CLIP-fc	74.53	48.00	69.08	31.84	83.87	65.29	62.10	0.26 M
prompt tuning	$s_{p}=8$
	PromptFL	75.95	49.60	70.73	31.06	83.77	67.12	63.04	1.413M
	FedAPT	77.28	52.07	72.45	47.31	84.76	68.73	67.10	1.416M
	$s_{p}=16$
	PromptFL	75.84	49.82	70.81	32.98	83.55	67.31	63.39	2.836M
	FedAPT	77.36	52.17	72.51	48.84	84.79	68.69	67.39	2.829M
	$s_{p}=24$
	PromptFL	76.11	49.77	70.62	32.47	83.46	67.38	63.30	4.239M
	FedAPT	77.34	52.29	72.87	50.35	84.80	68.54	67.70	4.242M

Table 10: The performance of the global model.

s_{p}

denotes the length of prompts.

Influence of the length of prompts.

In the main paper, we set the length of prompts to $16$ for PromptFL and FedAPT. Here we show the performance of the global model obtained by using prompts of different lengths. From Table 10, we can see that when the length of prompts increases, the performance of FedAPT will also increase, and FedAPT still outperforms all comparison methods. Furthermore, the performance of PromptFL is not improved with the increase of $s_{p}$ .

Performance of $\mathcal{Q}$ .

We take the test set of all domains as a whole, label the data according to the domain index, and test the classification accuracy of the adaptive network. The result is as shown in Figure 6. The adaptive network can converge rapidly in the early communication period.

Federated Adaptive Prompt Tuning for Multi-Domain Collaborative Learning

Abstract

1 Introduction

2 Related Work

2.1 Federated Learning

2.2 CLIP and Prompt Tuning

2.3 CLIP in Federated Learning

3 Preliminaries

3.1 Notations

3.2 Prompt Tuning

3.3 Federated Learning with Prompt

4 Method

4.1 Adaptive Prompt Tuning

4.2 Federated Training

5 Experiments

5.1 Experimental Setup

5.2 Main Results

5.3 Ablation Studies

6 Conclusions

7 Acknowledgements

References

Appendix A Details of Experiments

Appendix B Additional Experiments

Initialization of keys

The improve of inference efﬁciency

Influence of the length of prompts.

Performance of 𝒬\mathcal{Q}.

Performance of $\mathcal{Q}$ .