Training-Free Unsupervised Prompt for Vision-Language Models
Abstract
Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo-labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset 111Code and logs: https://github.com/wlb12345/TFUP.

1 Introduction
Foundational large-scale vision-language models (VLMs), such as CLIP [29] and ALIGN [14], have demonstrated impressive representation and generalization on various downstream tasks by using the contrastive learning objective on hundreds of millions of text and image pairs. Due to a potential shift between the pre-training and the specific task [23], fine-tuning is a common method to bridge this gap, which leverages labeled data on downstream tasks to fine-tune all parameters of the pre-trained model [5]. However, it requires significant amounts of labeled data and is computationally expensive, even leading to over-fitting [11]. To this end, prompt learning, as a parameter-efficient fine-tuning paradigm, recently attracted increasing attention from the research community. As a representative work, CLIP [29] directly utilizes hand-crafted prompts to achieve impressive zero-shot classification performance in Fig. 1 (a). But the amount of manual prior knowledge required to design appropriate prompts for each specific task is intolerable. Inspired by tuning studies in large language models (LLMs) [19, 18], latter studies like CoOp [46] and CLIP-Adapter [7], train learnable prompts on labeled data to alleviate such reliance on hard-prompt designs.
Though few-shot prompt methods gain significant improvements, they still require artificial prior knowledge to label downstream data and rely on manual annotation quality, which may limit the scalability of the original model. To this end, recent unsupervised prompt tuning frameworks have been introduced to eliminate the need for data annotations and enhance the efficiency of adapting VLMs for various downstream tasks [13, 35]. As shown in Fig. 1 (b), these methods tend to fine-tune models or learnable prompts directly on unlabeled data. UPL [13] selects the top-K confidence samples per class to tune the whole model using pseudo labels generated by the pre-trained vision-language model. POUF [35] treats the representation of class-specific text prompts as class prototypes and further aligns these prototypes with target image features in the latent space. However, these methods often undervalue the pre-trained CLIP generalization ability, which is crucial for achieving robust performance across diverse downstream datasets. By focusing on optimizing the CLIP performance on a specific set, they may overlook the importance of learning general features that can be applied to new, unseen data [45, 39]. Besides, these methods rely heavily on the pseudo labels to tune additional adaptation modules, which will inevitably introduce the confirmation bias [1]. Inaccurate pseudo-labels can misguide the tuning process and result in poor generalization capabilities. Consequently, our primary goal is to maximize the retention of the pre-trained VLMs’ capabilities while adapting them to downstream tasks with minimal costs.
Motivated by above observed limitations, we propose Training-Free Unsupervised Prompt (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Our TFUP generates similarity-base prediction probabilities by customizing a Feature Cache Model (FCM) and designing a Multi-level Similarity Measure (MSM). As shown in Fig. 1 (c), we first extract image features of unlabeled training images by CLIP’s visual encoder and then calculate the cosine similarity between image features and text features to obtain the predicted probability of the current image. In FCM, we select top-K samples as high-confidence samples for each category. Nonetheless, it is inevitable that these high-confidence images contain noisy information, such as complex backgrounds. In light of this, we further propose a prototype filter, which introduces an attention mechanism to filter out the representative samples from the constructed high-confidence sample set. Consequently, we can create a cache model using the features of the representative sample and the corresponding one-hot labels as key-value pairs.
Based on the constructed FCM, we calculate the distance between the test image and cached samples as the weights of the corresponding labels. Then, we combine different sample labels into a similarity-based prediction probability according to the weights. In existing unsupervised tuning studies, the distance measure method only considers the similarity of the overall image information and easily introduces background noise. For example, two pictures with the same background but different foregrounds are judged to be of the same category. Differently, we design a new measure method called Multi-level Similarity Measure (MSM), which considers both feature-level and semantic-level similarities. Specifically, we not only calculate the cosine similarity between image features as feature-level similarity but also calculate the KL divergence between image prediction probabilities as semantic-level similarity. Ultimately, we leverage a hadamard product to combine feature and semantic similarities. As shown in Fig. 2, through this non-parametric and non-training approach, our TFUP demonstrates extremely excellent efficiency and achieves promising performance, even surpassing the training-base unsupervised prompt learning methods [13, 35] on the Domain-Net [27] and Office-Home [36].

On top of our effective training-free strategy, we propose a training-base approach (TFUP-T) to further boost VLMs’ unsupervised adaptation. Following the standard Parameter-Efficient Fine-tuning (PEFT) methods [7, 40], we also append the image and text adapters to the image and text encoders, respectively. In addition, we adopt a residual connection [10] to combine pre-trained features with the fine-tuned features to preserve the CLIP’s generalization ability better. By leveraging our TFUP to produce supervisory information on downstream unlabeled datasets, we can effectively tune the adaptors to achieve higher performance. Different from existing pseudo-labeling strategies in unsupervised tuning, which mainly focus on instance-level predictions and may introduce accumulated errors [35, 43], we further introduce a marginal distribution entropy loss to constrain the model from a global perspective. As shown in Fig. 2, our TFUP-T achieves state-of-the-art classification performance among multiple benchmarks. In particular, TFUP-T not only achieves an average accuracy improvement of 3.3% compared to the SOTA POUF of unsupervised methods, but also obtains improvement by 1.2% compared to KgCoOp of few-shot approaches on Domain-Net [27]. Our contributions are summarized as follows,
-
•
We propose the first training-free approach (TFUP) for unsupervised prompt, which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. In particular, we generate similarity-base prediction probabilities by the Feature Cache Model and Multi-level Similarity Measure.
-
•
Based on TFUP, we propose a training-based approach (TFUP-T) to further boost performance. Considering the lack of labeled downstream data, we simultaneously optimize individual and global predictions on unlabeled data via pseudo-label cross-entropy loss and marginal distribution entropy loss.
-
•
Through extensive empirical analysis, our TFUP outperforms the original CLIP on all classification datasets by a large margin. It achieves promising performance without any labeled data or training, even surpassing the training-base method on multiple classification datasets. In addition, the training-base TFUP-T obtains the new state-of-the-art compared with both unsupervised and few-shot prompt learning methods.
2 Related Work
2.1 Vision-Language Models
Large-scale vision-language models have exhibited remarkable representation capabilities on various downstream vision tasks [3, 14, 6]. By employing the contrastive learning objective to train large-scale multi-level models on 400 million image-text pairs, CLIP [29] achieves impressive results in zero-shot visual recognition tasks. The success of CLIP inspired the development of several subsequent variants. FLIP [20] randomly masks out and removes a large portion of image patches during training to increase training speed without sacrificing accuracy. DeCLIP [21] uses more expansive and scalable supervision to learn visual representations. To reduce the effect of noise in web-crawled data, SoftCLIP [8] provides a soft cross-modal alignment by introducing a softened target, which is produced from the fine-grained intra-modal self-similarity. While CLIP and its variant models achieve great results in downstream datasets, it is important to reactivate specific capabilities. In this work, we focus on parameter-efficient fine-tuning of large-scale vision-language models to adapt to downstream tasks.
2.2 Prompt Learning in VLMs
Prompt learning has become the most popular paradigm in the natural language processing community for adapting LLMs to downstream tasks [18, 19, 12]. Motivated by these studies, CoOp [46] is the first approach which applies prompt learning to VLMs adaptation in the computer vision community. To further enhance generalization capabilities, CoCoOp [45] generates image-conditional context prompts for each image and incorporates them into text prompts for prompt tuning. In addition, KgCoOp [39] reduces the forgetting of general knowledge by minimizing the difference between text embeddings generated by learned prompts and hand-crafted prompts. Unlike the approaches mentioned above, which train learnable prompts on labeled data of specific tasks. The unsupervised prompt tuning framework proposes to fine-tune models or prompts directly on unlabeled data to efficiently adapt VLMs for downstream tasks. UPL [13] selects the top-K confidence samples per class to train itself using pseudo labels generated by a pre-trained vision-language model. POUF [35] treats the representation of class-specific text prompts as class prototypes aligned with target image features in the latent space. However, inaccurate pseudo-labels easily misguide the tuning process and result in poor generalization capabilities. We propose Training-Free Unsupervised Prompt (TFUP), which maximally preserves the inherent representation capabilities and enhances them while adapting VLMs to downstream tasks in a training-free and labeling-free manner.

2.3 Pseudo Labeling
Pseudo labeling is originally proposed for semi-supervised learning and gains popularity in other domains including NLP [24, 9], speech recognition [15, 26], image classification [33, 44], semantic segmentation [42, 41], object detection [34, 38], domain adaptation [4], to name a few. The main idea is to select the class with the maximum predicted probability as pseudo labels to fine-tune the model together with the true labels. Pseudo-Label [17] is the first pseudo-label method proposed in semi-supervised learning, which selects the category with the maximum predicted probability and converts it into a hard label. [30] uses a confidence-based strategy to further filter out unlabeled samples. Recently, as a representative work in semi-supervision learning, FixMatch [33] continues to employ a confidence-based strategy, retaining pseudo-labels with high-confidence predictions. Inspired by this, our proposed TFUP generates pseudo labels for downstream data based on the original and similarity-base prediction probability. Benefiting from the powerful generalization performance of VLMs and high-confidence feature cache model, we can effectively select the more convincing pseudo-labels. However, these pseudo-labels methods focus mainly on individual prediction can inevitably introduce prediction bias [43]. Inspired by recent studies in mutual-information maximization [16, 32, 22], we further introduce a marginal distribution entropy loss to constrain the model from a global perspective.
3 Method
Fig. 3 presents an overview of our proposed TFUP. In this section, we first discuss a representative vision-language model, CLIP, which utilizes hand-crafted prompts in a zero-shot manner for downstream tasks in Sec. 3.1. Subsequently, we introduce our proposed Training-free Unsupervised Prompt (TFUP) which maximally preserves the representational capabilities of pre-trained VLMs while adapting them to downstream tasks in a training-free and labeling-free manner in Sec. 3.2. To further improve the performance, we then introduce an unsupervised prompt tuning method (TFUP-T) which simultaneously optimize individual and global predictions on unlabeled data via pseudo-label cross-entropy loss and marginal distribution entropy loss in Sec. 3.3.
3.1 Preliminaries of CLIP
Contrastive language-image pre-training
CLIP is trained on 400 million image-text pairs with a language-image contrastive loss function. Specifically, the structure of CLIP consists of two components: visual encoder, denoted as , for converting the input images into visual features and text encoder, denoted as , for transforming input texts into text representations. Then image and text features are projected into the same embedding space through joint training. After pre-training on a large dataset, it has excellent classification performance in zero-shot scenarios, demonstrating the vision-language model’s excellent understanding of open-set concepts. Consider an image classification task that is defined as classifying a given test image into one of categories. Since the text description used in pre-training is different from the labels of the downstream recognition tasks, CLIP places all category names into the “[CLASS]” token of a pre-defined textual template such as “a photo of a [CLASS]”. We denote the text prompts obtained by converting the original labels as . Then, we obtain -dimensional visual features and text features by
(1) | ||||
(2) |
where and are L2-normalized visual and text features, respectively. We can then obtain the classification by calculating the cosine similarity of visual and textual features,
(3) |
where denote the prediction probabilities for the categories. We can easily identify the prediction .
3.2 Training-Free Unsupervised Prompt
We propose a training-free and labeling-free method called TFUP, which performs comparably or even better than training-base unsupervised methods [13, 35]. To achieve this goal, we construct a new Feature Cache Model (FCM) which stores the features of the representative sample and the corresponding one-hot labels as key-value pairs. Based on the constructed cache model, we calculate the similarity between test image and representative samples as the weights of the corresponding labels. In addition, we design a new Multi-level Similarity Measure (MSM) method which considers both feature-level and semantic-level similarity between images.

Cache model construction.
Given the pre-trained CLIP [29] model, we aim to leverage the unlabeled training set, denoted as , for unsupervised classification. For each training image, denoted as , we utilize the CLIP’s visual encoder to extract its -dimensional visual feature, denoted as . Then, we can use Eq. (3) to generate the prediction probability, denoted as . Finally, we view CLIP outputs the maximum one as the prediction and the corresponding index as the pseudo label . For all training samples, we denote their visual features and corresponding pseudo-labels as and . To create the key-value cache, we view visual features as keys and the corresponding pseudo-labels as values. Therefore, the unsupervised dataset is transformed into a key-value cache database .
For creating a standard Feature Cache Model (FCM), we first employ a confidence filter to filter out high-confidence samples and their corresponding pseudo-labels from the unsupervised dataset . Specifically, we select top-K most confident samples for each class, instead of keeping all samples with confidence scores higher than a pre-defined threshold, to preserve a balanced distribution of pseudo-labeled data. Then, we denote the high confidence visual features and corresponding pseudo-labels as and . Consequently, we obtain the high confidence pseudo-label dataset .
However, these high-confidence images are not necessarily representative samples. To further refine the confident pseudo-label data into representative samples, we propose a prototype filter which calculates the cosine similarity between each confidence sample and other confidence samples by
(4) |
where denotes the prototype score of the -th image. Higher value of means higher potential to be category prototypes. We select the highest scoring samples for each category as prototype samples. Then, we obtain the prototype visual features and corresponding labels as and . Finally, we denote the prototype pseudo-label dataset as .
Training-free inference.
Based on the constructed feature cache model, we calculate the similarity between test image and prototype samples as the weights of the corresponding labels. We then combine different sample labels into a similarity-base prediction probability according to the weights. The current similarity measure only considers the degree of similarity at the feature level. Differently, we propose a Multi-level Similarity Measure (MSM) which considers both feature-level and semantic-level similarity to measure the similarity between test samples and prototype samples. Specifically, for each test image, denoted as , we utilize the CLIP’s visual encoder to extract its -dimensional visual feature, denote as . Then, we calculate the multi-level similarity between test image and prototype samples as the weights of the corresponding labels. The MSM consists of Feature Similarity Measure (FSM) and Semantic Similarity Measure (SSM). For the FSM, we calculate the cosine similarity between the test image features and the prototype features by
(5) |
where represents the feature similarity score. In addition, the feature-level similarity measure only considers the similarity of the overall information of the image and ignores the consistency of the sample categories. To this end, we introduce SSM which calculates the KL-divergence between the test image prediction probabilities, denoted as , and the prototype prediction probabilities, denoted as , by
(6) | ||||
(7) | ||||
(8) |
where represents the semantic similarity score and denotes textual features. computes the KL-divergence, i.e., . To emphasize the criticality of the two measures, we combine CSM and SSM by
(9) |
where denotes hadamard product and represents the ultimate multi-level similarity score. Then, we combine different prototype sample pseudo-labels into a similarity-base prediction probability according to the multi-level similarity scores by
(10) |
where denotes matmul product and denotes the predicted probability based on feature cache model. After that, we obtain the final prediction probability of the test image, denote as , by adding the similarity-base prediction probability to the original prediction probability ,
(11) |
where presents the ultimate prediction probability.
3.3 Training-Base Unsupervised Prompt Tuning
Training-base adapters
The performance of TFUP can be further boosted by parameter efficient fine-tuning. Following the standard Parameter-Efficient Fine-tuning (PEFT) methods [7, 40], we attach image and text adapters to the image and text encoders, denoted by and , respectively. Each adapter consists of two layers of linear transformations. Referring to ReZero [2], we further employ two constant values and as residual ratio to adjust the proportion of downstream knowledge and model inherent knowledge to enhance the robustness of the model and prevent over-fitting. Mathematically, the new knowledge captured via fine-tuning is added with the original features via residual connections,
(12) | ||||
(13) |
where and denote the merged image and text features, respectively. Then, we can use Eq. (6) to generate the prediction probabilities of the test image, denote as .
Category | Methods | Domain-Net | ||||||
---|---|---|---|---|---|---|---|---|
Clipart | Infograph | Painting | Quickdraw | Real | Sketch | Avg | ||
CLIP (Zero-Shot) [29] | 70.9 | 48.2 | 65.9 | 14.0 | 83.6 | 63.6 | 57.7 | |
Tent [37] | 71.4 | 47.8 | 66.2 | 14.2 | 83.9 | 64.1 | 57.9 | |
Unsupervised | UPL [13] | 71.7 | 47.5 | 66.3 | 14.4 | 83.8 | 64.3 | 58.0 |
POUF [35] | 72.8 | 53.1 | 68.6 | 15.9 | 84.4 | 66.2 | 60.2 | |
TFUP (Training-Free) | 73.9 | 52.9 | 69.2 | 17.8 | 85.2 | 66.1 | 60.9 | |
TFUP-T (Training-Base) | 76.0 | 54.7 | 72.1 | 24.6 | 85.8 | 67.9 | 63.5 | |
Few-Shot | CoCoOp [45] | 75.1 | 55.5 | 71.5 | 20.4 | 84.8 | 67.3 | 62.4 |
KgCoOp [39] | 75.3 | 55.4 | 71.3 | 19.2 | 85.6 | 66.9 | 62.3 |
Category | Methods | Office-Home | ||||
---|---|---|---|---|---|---|
Art | Clipart | Product | Real World | Avg | ||
CLIP (Zero-Shot) [29] | 82.7 | 68.1 | 89.1 | 89.8 | 82.4 | |
Tent [37] | 83.2 | 67.8 | 91.9 | 90.4 | 83.3 | |
Unsupervised | UPL [13] | 83.3 | 67.7 | 91.5 | 90.7 | 83.3 |
POUF [35] | 83.7 | 71.2 | 91.4 | 90.8 | 84.3 | |
TFUP (Training-Free) | 83.7 | 71.5 | 92.7 | 90.6 | 84.6 | |
TFUP-T (Training-Base) | 86.0 | 74.2 | 93.1 | 91.7 | 86.3 | |
Few-Shot | CoCoOp [45] | 85.1 | 73.0 | 92.9 | 90.8 | 85.5 |
KgCoOp [39] | 85.0 | 73.2 | 92.7 | 91.5 | 85.6 |
Training-base supervision
In the paradigm of unsupervised prompt tuning, it is crucial to seek appropriate supervision to train adapters on unlabeled data. Thanks to the effectiveness of our training-free strategy, we first generate pseudo-labels on unlabeled data via our TFUP. Thus we can utilize Eq. (11) to generate the prediction probabilities of the unlabeled instances, denote as . We then view the maximum output of the CLIP model as the prediction and the corresponding index as the pseudo label . After that, we train the adapters via a standard cross-entropy loss,
(14) |
where denotes retaining only the high-confidence predictions higher than pre-defined threshold . However, all these pseudo-label-based methods [17, 25] focus solely on the instance-level constraint, ignoring the significance of the global prediction statistic on unlabeled data. Though high-confidence filtering can effectively select the more convincing pseudo-labels, constraining only on individual prediction can inevitably introduce prediction bias [43], considering the different learning difficulties of distinct classes, especially with a huge category space. To this end, inspired by recent studies in mutual-information maximization [16, 32, 22], we further introduce a marginal distribution entropy loss to constrain the model from a global perspective,
(15) |
where is the maximum value of distribution entropy and denotes the marginal distribution of the prediction probabilities of the test images on the class index .
4 Experiments
In this section, we evaluate the performance of the training-free TFUP and the training-base TFUP-T on four public datasets including Domain-Net [27], Office-Home [36], Office-31 [31], and VisDA-2017 [28]. We compare our methods with recent prompt learning methods including Clip (Zero-Shot) [29], Tent [37], UPL [13], POUF [35], CoCoOp [45], and KgCoOp [39]. Additionally, we provide extensive experimental results and further ablation studies.
4.1 Datasets
1) Domain-Net [27] is the largest and the most challenging domain adaptation benchmark which contains approximately 600,000 images from 6 different domains including Clipart (clp), Infograph (inf), Painting (pnt), Quickdraw (qdr), Real (rel) and Sketch (skt). 2) Office-Home [36] refers to a difficult domain adaptation dataset which includes 15500 images in office and home extracted from 4 different domains: Art (A), Clip (C), Product (P) and RealWorld (R). 3) Office-31 [31] is a standard dataset in visual transfer learning which contains 4,652 images with 31 classes from three domains: Amazon (A), Webcam (W), and DSLR (D). 4) VisDA-2017 [28] is a dataset from the 2017 Vision Domain Adaptation Challenge, covering 12 categories and more than 280,000 images, including 152,397 synthetic images and 55,388 real images.
Category | Methods | Office-31 | |||
---|---|---|---|---|---|
Amazon | Dslr | Webcam | Avg | ||
CLIP (Zero-Shot) [29] | 79.0 | 77.5 | 74.7 | 77.1 | |
Tent [37] | 81.5 | 80.7 | 82.8 | 81.7 | |
Unsupervised | UPL [13] | 81.4 | 82.6 | 83.6 | 82.5 |
POUF [35] | 83.6 | 89.9 | 90.6 | 88.0 | |
TFUP (Training-Free) | 81.2 | 86.9 | 84.2 | 84.1 | |
TFUP-T (Training-Base) | 84.8 | 90.8 | 93.2 | 89.6 | |
Few-Shot | CoCoOp [45] | 83.9 | 92.4 | 94.2 | 90.2 |
KgCoOp [39] | 84.4 | 92.6 | 94.2 | 90.4 |
Category | Methods | VisDA-2017 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
plane | bicycle | bus | car | horse | knife | mcycl | person | plant | sktbrd | train | truck | Avg | ||
CLIP (Zero-Shot) [29] | 99.1 | 91.7 | 93.8 | 76.6 | 98.4 | 91.5 | 95.3 | 82.7 | 86.5 | 96.0 | 94.6 | 60.2 | 88.9 | |
Tent [37] | 98.7 | 91.3 | 86.9 | 89.1 | 98.1 | 94.8 | 92.9 | 85.4 | 90.1 | 92.2 | 94.1 | 48.7 | 88.5 | |
Unsupervised | UPL [13] | 99.0 | 93.0 | 91.3 | 77.8 | 98.4 | 94.7 | 93.8 | 83.3 | 87.1 | 96.1 | 94.5 | 67.5 | 89.7 |
POUF [35] | 99.0 | 91.0 | 92.0 | 80.3 | 98.7 | 94.7 | 95.4 | 81.1 | 89.5 | 96.5 | 95.2 | 64.7 | 89.8 | |
TFUP (Training-Free) | 99.1 | 93.7 | 91.2 | 83.1 | 98.0 | 94.0 | 92.8 | 82.7 | 87.4 | 94.9 | 94.7 | 63.5 | 89.6 | |
TFUP-T (Training-Base) | 99.2 | 92.0 | 87.3 | 82.6 | 99.1 | 96.4 | 96.4 | 83.6 | 92.8 | 94.2 | 96.2 | 66.1 | 90.5 | |
Few-Shot | CoCoOp [45] | 99.1 | 93.6 | 91.9 | 74.6 | 98.4 | 93.4 | 91.7 | 61.3 | 87.0 | 96.8 | 95.0 | 68.7 | 87.6 |
KgCoOp [39] | 99.2 | 92.3 | 93.6 | 76.6 | 98.3 | 90.3 | 94.6 | 83.6 | 85.3 | 96.1 | 94.3 | 62.6 | 88.9 |
4.2 Baselines
To verify the effectiveness of our approach, we compare it with plenty of advanced methods: (1) Zero-shot CLIP [29], which applies hand-crafted prompts; (2) Tent [37] refers to using entropy minimization to tune the model before making predictions on downstream datasets; (3) UPL [13] selects the top-K confidence samples per class to train the soft prompts using pseudo labels generated by a pre-trained vision-language model. For a fair comparison, we do not use model ensemble as done in [13]; (4) POUF [35] treats the representation of class-specific text prompts as class prototypes aligned with target image features in the latent space; (5) CoCoOp [45] proposes an image-condition prompt learning method to generate specific text prompts for each image input; (6) KgCoOp [39] reduces the difference between text features generated by learnable prompts and hand-crafted prompts to enhance the generalization ability of learnable prompts. The reported results for the baseline models are obtained using the officially open-source code.
4.3 Implementation Details
Our implementation is based on the open-source repository of POUF [35]. For all experiments, we follow the same unlabeled downstream dataset, visual and text encoder, data augmentation, learning rate schedule, and batch size. Differently, we set the hyper-parameter and in Eq. (12) and Eq. (13), which are analyzed in our experiments. Following previous pseudo-labeling strategy, we set the predefined threshold in Eq. (14) to 0.95 as a constant value for all experiments. The final performance reported below is the average of three runs with different random seeds. All experiments are conducted based on RTX 3060.
4.4 Experimental Results
Results on Domain-Net [27]
Tab. 1 reports the results comparing TFUP and TFUP-T with unsupervised prompt tuning and few-shot prompt learning methods across 6 domains on the Domain-Net [27] datasets. It is obvious that our TFUP achieves the new state-of-the-art (SOTA) performance without any training compared with previous unsupervised prompt tuning methods. Specifically, TFUP outperforms original CLIP with prompt engineering by 3.2%. In addition to prompt tuning, TFUP-T not only achieves an average accuracy improvement of 3.3% compared to the current SOTA POUF of unsupervised methods, but also obtains improvement by 1.2% compared to the SOTA KgCoOp of few-shot approaches. The above experiment demonstrates the efficiency and effectiveness of TFUP, as well as the superiority of the training-base TFUP-T.
Results on Office-Home [36]
Tab. 2 illustrated the results comparing TFUP and TFUP-T with unsupervised prompt tuning and few-shot prompt learning methods across 4 domains on the Office-Home [36] datasets. Our TFUP achieves new SOTA performance without any training compared with unsupervised tuning methods and outperforms original CLIP with prompt engineering by 2.2%. For prompt tuning, TFUP-T boosts the performance of the POUF by 2.0% and demonstrates clear advantages over few-shot methods.
Results on Office-31 [31]
Tab. 3 provides the results comparing TFUP and TFUP-T with unsupervised prompt tuning and few-shot prompt learning methods across 3 domains on the Office-31 [31] datasets. Our TFUP has significant advantages without any training compared with Tent and UPL. Although the accuracy of training-free TFUP is slightly lower than the trainable POUF, TFUP-T achieves new SOTA on all domains. Due to the smaller amount of unsupervised data in the Office-31 dataset compared to the other datasets, the performance achieved by our method is limited.
Results on VisDA-2017 [28]
Tab. 4 summarizes the results compare TFUP and TFUP-T with unsupervised prompt tuning and few-shot prompt learning methods on the VisDA-2017 [28] datasets. Obviously, the training-free TFUP demonstrates competitive performance compared to the other unsupervised methods. By efficient fine-tuning, TFUP-T further improves the performance of model, confirming the effectiveness and versatility of our methods.
4.5 Ablation analysis
Effectiveness of each component.
To evaluate the effectiveness of various components, we conduct ablation experiments on Office-Home and Domain-Net datasets, as reported in Tab. 5. We can clearly see that each component significantly improves the performance of the model. For training-free TFUP, FCM + FSM increases the average accuracy of Office-Home and Domain-Net datasets by 2.2% and 3.2%, respectively. By efficient fine-tuning, further improves the performance of the model. Additionally, the pseudo-labeling strategy mainly focuses on rectifying individual predictions. We introduce a global prediction constraint , which increases the average performance of CLIP from 57.7% to 63.5% on the most challenging Domain-Net. It demonstrates the effectiveness of each component, and the superiority of our TFUP and TFUP-T.
Component Module | Average Accuracy | ||||
---|---|---|---|---|---|
FCM | MSM | Office-Home | Domain-Net | ||
82.4 | 57.7 | ||||
✓ | 83.8 (1.4) | 59.7 (2.0) | |||
✓ | ✓ | 84.6 (2.2) | 60.9 (3.2) | ||
✓ | ✓ | ✓ | 85.4 (3.0) | 62.3 (4.6) | |
✓ | ✓ | ✓ | ✓ | 86.3 (3.9) | 63.5 (5.8) |
Filtering Strategy | Office-Home Dataset | ||||
A | C | P | R | Average | |
Unsupervised Dataset | 82.5 | 68.3 | 90.9 | 89.8 | 82.9 |
+ Confidence Filter | 83.4 | 70.3 | 91.9 | 90.0 | 83.9 (1.0) |
+ Prototype Filter | 83.5 | 70.9 | 92.5 | 90.5 | 84.3 (1.4) |
+ Double Filter | 83.7 | 71.5 | 92.7 | 90.6 | 84.6 (1.7) |
Different sample filter strategies.
As presented in Tab. 6, we compare several common sample filter strategies to assess the effectiveness of our approach. i) Confidence Filter Strategy. Select top-K high confidence samples by pseudo labels. ii) Prototype Filter Strategy. Select top-K representative samples by prototype score. iii) Double Filter Strategy. Combine the confidence filter and prototype filter strategy to obtain the representative samples. Compared with the overall unlabeled data, both of the filter strategy designs could improve the final results by large margins. It demonstrates that the key factor of the filter strategy to achieve significant performance is reducing the introduction additional noisy samples. In addition, we find that combining confidence and prototype filter can both improve the results compared with using either of them. It indicates that only selecting the high confidence samples is not enough, and prototype scores are needed to screen representative samples.
Similarity Measure | Office-Home Dataset | ||||
A | C | P | R | Average | |
CLIP (Zero-Shot) | 82.7 | 68.1 | 89.1 | 89.8 | 82.4 |
+ Feature Similarity | 83.3 | 69.7 | 92.0 | 90.2 | 83.8 (1.4) |
+ Semantic Similarity | 83.6 | 70.9 | 92.3 | 90.5 | 84.3 (1.9) |
+ Multi-level Similarity | 83.7 | 71.5 | 92.7 | 90.6 | 84.6 (2.2) |

Different similarity measure strategies.
As summarized in Tab. 7, we provide in-depth analysis about the similarity measure strategies. i) Feature Similarity Measure (FSM) leverages cosine similarity to measure the similarity of the overall information of the images. ii) Semantic Similarity Measure (SSM) leverages KL-divergence to measure the distance of the image prediction probabilities. iii) Multi-level Similarity Measure (MSM) considers both feature-level and semantic-level similarities between images. We observe that FSM and SSM improve the average accuracy by 1.4% and 1.9% respectively, and the combination of them improves the results by 2.2%. It demonstrates that the effectiveness of considering both feature-level and semantic-level similarities, which not only measures the degree of similarity of the overall image information, but also ensures the semantic consistency of similar samples.
Sensitivity analysis.
As shown in Fig. 5, we evaluate the hyper-parameter sensitivity of and of Eq. (12) and Eq. (13) across 4 domains on the Office-Home datasets. and are residual ratio to adjust the proportion of new knowledge and inherent knowledge. Obviously, on most domains, the best residual rate of image features is =0.2, and the best residual rate of text features is =0.5. The model performance will decrease whether the optimal parameters increase or decrease. It may be because too much new knowledge causes the model to overfit while too little new knowledge makes it difficult for the model to adapt to a specific task. Therefore, we set =0.2 and =0.5 to obtain the best performance trade-off.
5 Conclusion
In this paper, we propose a novel approach named Training-Free Unsupervised Prompt (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. We generate similarity-base prediction probabilities by the proposed Feature Cache Model (FCM) and Multi-level Similarity Measure (MSM). In this way, TFUP outperforms original CLIP on all classification datasets by a large margin. It achieves promising performance without any labeled data or training, even surpassing the training-base prompt learning methods on multiple classification datasets. By efficient fine-tuning, TFUP-T not only achieves the SOTA compare with unsupervised prompt learning approaches but also demonstrates clear advantages over few-shot prompt learning methods. We hope that this paper, which maximally preserves the inherent representation capabilities and enhances them while adapting pre-train VLMs to specific downstream tasks in a training-free and labeling-free manner, will provide insights for future work on unsupervised prompt tuning.
References
- [1] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks. IEEE, 2020.
- [2] Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887, 2020.
- [3] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, pages 104–120. Springer, 2020.
- [4] Jaehoon Choi, Minki Jeong, Taekyung Kim, and Changick Kim. Pseudo-labeling curriculum for unsupervised domain adaptation. arXiv preprint arXiv:1908.00262, 2019.
- [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- [6] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and-language transformers. In CVPR, pages 18166–18176, 2022.
- [7] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- [8] Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Wei Liu, Jie Yang, Ke Li, and Xing Sun. Softclip: Softer cross-modal alignment makes clip stronger. arXiv preprint arXiv:2303.17561, 2023.
- [9] Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio Ranzato. Revisiting self-training for neural sequence generation. arXiv preprint arXiv:1909.13788, 2019.
- [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- [11] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- [13] Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
- [14] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- [15] Jacob Kahn, Ann Lee, and Awni Hannun. Self-training for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7084–7088. IEEE, 2020.
- [16] Andreas Krause, Pietro Perona, and Ryan Gomes. Discriminative clustering by regularized information maximization. Advances in neural information processing systems, 23, 2010.
- [17] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
- [18] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- [19] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- [20] Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In CVPR, pages 23390–23400, 2023.
- [21] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
- [22] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020.
- [23] Zemin Liu, Xingtong Yu, Yuan Fang, and Xinming Zhang. Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. arXiv preprint arXiv:2302.08043, 2023.
- [24] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the human language technology conference of the NAACL, main conference, pages 152–159, 2006.
- [25] Geoffrey J McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
- [26] Sree Hari Krishnan Parthasarathi and Nikko Strom. Lessons from building acoustic models with a million hours of speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6670–6674. IEEE, 2019.
- [27] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415, 2019.
- [28] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
- [29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [30] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models. Workshops on Applications of Computer Vision, 2005.
- [31] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
- [32] Yuan Shi and Fei Sha. Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. arXiv preprint arXiv:1206.6438, 2012.
- [33] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
- [34] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
- [35] Korawat Tanwisuth, Shujian Zhang, Huangjie Zheng, Pengcheng He, and Mingyuan Zhou. Pouf: Prompt-oriented unsupervised fine-tuning for large pre-trained models. In International Conference on Machine Learning, pages 33816–33832. PMLR, 2023.
- [36] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, pages 5018–5027, 2017.
- [37] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
- [38] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In ICCV, pages 3060–3069, 2021.
- [39] Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, pages 6757–6767, 2023.
- [40] Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- [41] Zhen Zhao, Sifan Long, Jimin Pi, Jingdong Wang, and Luping Zhou. Instance-specific and model-adaptive supervision for semi-supervised semantic segmentation. In CVPR, pages 23705–23714, 2023.
- [42] Zhen Zhao, Lihe Yang, Sifan Long, Jimin Pi, Luping Zhou, and Jingdong Wang. Augmentation matters: A simple-yet-effective approach to semi-supervised semantic segmentation. In CVPR, pages 11350–11359, 2023.
- [43] Zhen Zhao, Meng Zhao, Ye Liu, Di Yin, and Luping Zhou. Entropy-based optimization on individual and global predictions for semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8346–8355, 2023.
- [44] Zhen Zhao, Luping Zhou, Lei Wang, Yinghuan Shi, and Yang Gao. Lassl: Label-guided self-training for semi-supervised learning. In AAAI, volume 36, pages 9208–9216, 2022.
- [45] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
- [46] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.