Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Abstract

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality’s importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.

Index Terms— adapter, vision-language model, cross-modal, cache model, hard example

1 Introduction

Vision-Language Models (VLM), such as CLIP [1], ALIGN [2], and BLIP [3], have demonstrated excellent performance across multiple tasks, including recognition, generation, and classification. Traditional pre-training models achieve adaptation to downstream tasks by introducing various auxiliary losses and fine-tuning all parameters of the model. This approach generates a separate set of fine-tuning parameters for each task, proving effective on smaller-scale models. However, these methods suffer from two drawbacks: (i) As the model size increases, fine-tuning requires more significant time and space resources. (ii) For VLM pre-trained on millions of image-text pairs, this approach may lead to overfitting in scenarios with a limited number of available samples. To address these issues, efficient transfer learning (ETL) proposes fine-tuning a small subset of parameters. This allows for the transfer of task-relevant knowledge from the VLM to downstream tasks while mitigating the challenges associated with increased model scale and resource constraints.

Common ETL methods include LoRA [4], Prompt [5], and Adapter [6]. LoRA [4] adopts a low-rank matrix decomposition approach by introducing an auxiliary matrix alongside the original matrix. The model updates are achieved by modifying this auxiliary matrix. This ETL method tends to yield moderate performance on smaller-scale models. CoOp [7] first proposes turning the fixed prompt “this is a photo of a [CLS]” in CLIP [1] into a learnable prompt, enhancing the model’s generalization capabilities. However, this type of prompt is sensitive to parameters and can be challenging to train.

Refer to caption — Fig. 1: XMAdapter and Competing Methods: CLIP-Adapter [8], Tip-Adapter [9]. XMAdapter learns.

The adapter technique, based on parameter fine-tuning, has shown significant effectiveness in VLM. In Fig. 1(a), CLIP-Adapter [8] achieves impressive results by freezing the parameters of the visual encoder ( $f_{V}$ ) and the textual encoder ( $f_{T}$ ). Adapters are then added separately to $f_{V}$ and $f_{T}$ to extract information from the image and text domains. The extracted features are connected to the original features through residual connections, leading to excellent performance in image classification and recognition tasks. In Fig. 1(b), Tip-Adapter [9], during the training phase, stores the labels and corresponding features of images as a cache model. During the inference phase, the input image features undergo cosine similarity calculation with the stored image features in the cache model. This similarity computation, combined with the original CLIP [1] features, aims to enhance the model’s performance. GraphAdapter [10] proposes modeling images and text as separate sub-graphs to enhance the performance of the model.

These adapters are designed for either the image or the text, with the two parts working independently and without merging or interacting information. How to fully leverage the fused information between images and text has become a focal point of attention in the research community. To address this issue, we propose a XMAdapter approach that integrates textual and image information, as illustrated in Fig. 1(c). The model establishes key-value pairs for both the image and text domains, embedding textual knowledge into the image domain, thereby creating a cross-modal cache model.

To further enhance the model’s performance, we propose a method that independently adjusts the ratio of images and text. This is achieved by setting different ratios to adjust the fusion degree of image cache and text cache, thereby decoupling the measurement methods of similarity between different modalities. This addresses the difficulty of classifying hard samples when using only images or text. We explore hard samples based on the differences in cross-modal affinities and dynamically adjust the learning intensity for these samples, thereby further enhancing the model’s performance.

During the inference stage, the model first calculates the similarity between the features of the test data and the key-value pairs in the cache model. Subsequently, it combines the model’s prediction with the original CLIP’s prediction through residual connections. Our key contributions can be summarized as follows:

•

We propose a novel cross-modal cache model that independently constructs cache models for images and text. This model effectively integrates features from both modalities, proving crucial for efficient transfer learning in VLM models.
•

The model leverages retrieval through bimodal information from visual and language modalities to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion and exploits the differences in modality affinity to mine hard samples. This is done through adaptive adjustments to the learning intensity of the samples, enhancing the model’s performance.
•

XMAdapter demonstrates strong representation learning capabilities, achieving excellent results on 11 benchmark datasets, including tasks such as image recognition and image classification. Furthermore, it showcases robust generalization performance on 4 benchmark datasets.

2 Methodology

2.1 Image Cache Model Construction

To thoroughly leverage the knowledge within the training data, we have established a key-value cache model as a feature adapter. The specific procedure is as follows: for each training image, we utilize the pre-trained CLIP visual encoder to extract a C-dimensional L2 normalized feature. Subsequently, we convert the actual label into an N-dimensional one-hot vector, denoted as $L_{N}$ . For the K-shot N-class training samples $I_{K}$ , where each class consists of K labeled images, resulting in a total of NK training samples, we use $F_{\text{train}}^{\text{image}}$ and $L_{\text{one-hot}}^{\text{label}}$ to represent visual features and label vectors, respectively, serving as the key and value for the cache model. This key-value pair memorizes the newly extracted knowledge from a small training dataset. Finally, the affinities of the image side, denoted as $A^{\text{image}}$ , can be described as follows:

		$\displaystyle F_{\text{train}}^{\text{image}}=\text{ImageEncoder}(I_{K}),$		(1)
		$\displaystyle L_{\text{one-hot}}^{\text{label}}=\text{OneHot}(L_{N}),$
		$\displaystyle A^{\text{image}}=cos(f_{\text{test}}^{\text{image}},F_{\text{train}}^{\text{image}}),$

where “ImageEncoder” is the visual encoder of CLIP, “OneHot” is the operation that transforms $L_{N}$ into a one-hot encoding, and $f_{\text{test}}^{\text{image}}$ represents the features obtained from the testing images after passing through the “ImageEncoder”. “cos” denotes the calculation of the cosine similarity between the two features.

2.2 Cross-modal Cache Model Construction

To better utilize information across different modalities, we have devised a cross-modal cache model with the following steps: Firstly, by linearly mapping the features of the pre-trained textual side $F_{\text{CoOp}}^{\text{text}}$ from CoOp through the MetaNet network to a low-dimensional space $D$ , obtaining the feature vector of meta2text-feature. Subsequently, by using the values of the cache label to query the feature vector, we generate a matrix with dimensions of the total number of samples $N$ and $D$ . Then, $f_{\text{test}}^{\text{image}}$ is mapped through the Img2TxtNet network to a low-dimensional space $D$ to generate the feature vector of image2text-feature. Finally, the affinities of the textual side, denoted as $A^{\text{text}}$ , can be described as follows:

\displaystyle A^{\text{text}}=cos(\text{MetaNet}(F_{\text{CoOp}}^{\text{text}}),\text{Img2TxtNet}(f_{\text{test}}^{\text{image}})),

(2)

where the MetaNet and Img2TxtNet represents a linear neural network (MLP).

2.3 Adaptive Scaling

To further enhance the model’s performance, we separately calculate the similarities on the image and text sides, To better understand the contribution of different modalities to the model’s classification results, we propose a dynamic adjustment method for the proportions of $A^{\text{image}}$ and $A^{\text{text}}$ , decoupling the measurement methods of similarity between different modalities. The adjustment process is as follows:

\displaystyle A=\gamma A^{\text{image}}+(1-\gamma)A^{\text{text}},

(3)

where $\gamma$ represents the adaptive adjustment coefficient.

2.4 Building Logits

The model acquires knowledge from two sources. One part comes from the cache model constructed with a small number of labeled samples, which is obtained by multiplying the affinities matrix $A$ by $L_{\text{one-hot}}^{\text{label}}$ . The other part comes from the prior knowledge of the original CLIP classifier $W_{c}$ . The contributions of these two terms are balanced by the weight $\alpha$ . The entire process of constructing logits can be described as follows:

		$\displaystyle\text{logits}_{\text{cache}}=exp(-\beta(1-AL_{\text{one-hot}}^{\text{label}})),$		(4)
		$\displaystyle\text{logits}=\alpha\text{logits}_{\text{cache}}+f_{\text{test}}^{\text{image}}W_{c},$		(4)

where the $\alpha$ controls the fusion ratio and the $\beta$ controls the sharpness of the affinities matrix.

2.5 Hard Example Mining

The model performs well under the cache model. To further improve its performance, we adopt the OHEM (Online Hard Example Mining) approach. Specifically, for hard example samples, we set different weights to enhance the model’s accuracy. The process is as follows: We obtain the affinities matrix $A^{\text{image}}$ for images and the affinities matrix $A^{\text{text}}$ for text. By leveraging the differences in affinity between modalities, we aim to identify hard samples. The weights for learning hard examples are adaptively adjusted and can be described as follows:

\displaystyle A_{b}^{\text{weight}}=\frac{1}{N}\sum_{n=1}^{N}\text{sigmod}(\|A_{bn}^{\text{image}}-A_{bn}^{\text{text}}\|),

(5)

where $\|.\|$ denotes the absolute value of two numbers, $A_{bn}^{\text{image}}$ is a sample in the $A^{\text{image}}$ affinities matrix, $N$ represents the number of samples in the cache model, and sigmoid is the threshold function with values ranging between [0,1].

During the training phase, the model first calculates the cross-entropy loss $\mathcal{L}_{b}^{ce}$ between logits and the labels of the training samples. Subsequently, it adjusts the loss by incorporating the mean of $A_{b}^{\text{weight}}$ to form the final loss function. The process can be described as follows:

		$\displaystyle\mathcal{L}_{b}^{ce}=-\frac{1}{K}\sum_{k=1}^{K}y_{k}log(\text{logits}_{k}),$		(6)
		$\displaystyle\mathcal{L}=\frac{1}{B}\sum_{b=1}^{B}\mathcal{L}_{b}^{ce}*A_{b}^{\text{weight}},$		(6)

where $K$ is the number of classes, $y$ represents the label of the sample, $B$ is the total number of samples in a batch, and $b$ represents a sample in the batch.

3 Experiment

Table 1: The performance comparison regarding generalization capability on four CLIP visual backbones. The ETL methods are optimized with the ImageNet dataset on a 16-shot setting and tested on cross-domain datasets, including ImageNet-V2, ImageNet-Sketch, ImageNet-A, and ImageNet-R.

Method	Backbone	Source	Target
Method	Backbone	ImageNet	-V2	-Sketch	-A	-R	Average
Zero-shot CLIP [1]	ResNet-50	58.18	51.34	33.32	21.65	56.00	40.58
Linear Probe CLIP [1]		55.87	45.97	19.07	12.74	28.16	26.49
CoOp [7]		62.95	55.11	32.74	22.12	54.96	41.23
TaskRes [11]		64.75	56.47	35.83	22.80	60.70	43.95
GraphAdapter [10]		65.70	56.40	34.50	21.88	58.94	42.93
XMAdapter		66.22	56.51	36.72	23.46	61.53	44.56
Zero-shot CLIP [1]	ResNet-101	61.62	54.81	38.71	28.05	64.38	46.49
Linear Probe CLIP [1]		59.75	50.05	26.80	19.44	47.19	35.87
CoOp [7]		66.60	58.66	39.08	28.89	63.00	47.41
TaskRes [11]		67.70	59.50	41.70	29.87	68.07	49.79
GraphAdapter [10]		68.23	59.60	40.83	28.77	67.13	49.08
XMAdapter		68.96	59.64	41.50	30.57	68.82	50.13
Zero-shot CLIP [1]	ViT-B/32	62.05	54.79	40.82	29.57	65.99	47.79
Linear Probe CLIP [1]		59.58	49.73	28.06	19.67	47.20	36.17
CoOp [7]		66.85	58.08	40.44	30.62	64.45	48.40
TaskRes [11]		68.20	59.20	42.50	31.43	69.33	50.62
GraphAdapter [10]		68.80	59.00	41.70	29.57	68.67	49.74
XMAdapter		69.56	59.12	42.91	31.95	69.57	50.89
Zero-shot CLIP [1]	ViT-B/16	66.73	60.83	46.15	47.77	73.96	57.18
Linear Probe CLIP [1]		65.85	56.26	34.77	35.68	58.43	46.29
CoOp [7]		71.92	64.18	46.71	48.41	74.32	58.41
TaskRes [11]		73.07	65.30	49.13	50.37	77.70	60.63
GraphAdapter [10]		73.68	65.57	48.57	49.23	77.20	60.14
XMAdapter		74.43	65.54	49.58	50.69	77.95	60.94

3.1 Cross Label Generalization

We compared the performance of XMAdapter with Zero-shot CLIP [1], CoOp [7], TaskRes [11], CLIP-Adapter [8], Tip-Adapter [9], and GraphAdapter [10] on 11 datasets, as shown in Fig. 3. The model performed well across all 11 datasets in 1-/2-/4-/8-/16-shots settings. Particularly, at 16 shots, XMAdapter achieved an average accuracy of 76.87% across the 11 datasets, surpassing GraphAdapter’s 76.22% by 0.65%. On the challenging fine-grained classification dataset, FGVCAircraft, XMAdapter outperformed the six compared methods in the 2-/4-/8-/16-shots settings. This demonstrated that the cross-modal adapter design of XMAdapter, integrating both image and text information, is better suited for downstream tasks. A detailed introduction to the comparative methods can be found in the Appendix A.

3.2 Domain Generalization

We further tested the generalization capability of the XMAdapter, and the experimental results are shown in Table 1. The model is trained with 16-shot training samples using ImageNet [12] as the training dataset. The testing datasets include ImageNet-V2, ImageNet-Sketch, ImageNet-A, and ImageNet-R. These testing datasets share the same categories as ImageNet [12] but differ in background, texture, semantics, and other aspects. We used ResNet-50 [13], ResNet-101 [13], ViT-B/32 [14], and ViT-B/16 [14] as backbones. XMAdapter achieved an average improvement of +0.61%, +0.34%, +0.27%, and +0.31% on four datasets, respectively. The experimental results demonstrated that the XMAdapter exhibits strong generalization capabilities with its cross-modal cache model. For details regarding the datasets and implementation details in the Appendix B.

Table 2: Comparison of classification accuracy (%), time efficiency, parameters, and GFlops for different methods on 16-shot ImageNet [12], where our proposed XMAdapter achieve superior accuracy-efficiency trade-off.

Models Tunable Parameters(M) GFlops Training Time(one epoch)(s) Inference Time GPU Memory Performance CoOp [7] 0.008 1943.12 40.91 119.64ms 18.907 62.95 CLIP-Adapter [8] 0.524 1959.44 45.71 275.22ms 9.257 63.59 Tip-Adapter [9] 16.384 5.43 12.36 51.03ms 4.313 65.44 TaskRes [11] 1.024 5.42 13.64 4.89ms 6.227 64.75 GraphAdapter [10] 4.145 5.42 23.29 4.91ms 10.75 65.70 XMAdapter 18.561 5.39 13.41 73.24ms 5.148 66.22

3.3 Model Complexity

We compared our experimental results with existing efficient transfer learning methods from six perspectives: tunable parameters, GFlops, training time, inference time, GPU memory, and performance. All experiments were tested with batch size 32 on a single NVIDIA GeForce RTX 3090 GPU. The experiments were conducted under the setting of 16-shot on the ImageNet [12] dataset, and the results are presented in Table 2. We observed that the tunable parameters of the XMAdapter were slightly higher than those of the Tip-Adapter [9], mainly due to the more intricate process involved in establishing the cross-modal cache model compared to the conventional cache model. The GFlops value was comparable to Tip-Adapter [9], TaskRes [11], and GraphAdapter [10]. The training time was lower than that of CoOp [7], TaskRes [11], and GraphAdapter [10]. In terms of inference time and GPU memory, the XMAdapter ranks in the middle among the compared methods. Furthermore, the model’s performance is 3.27% higher than CoOp [7] and 0.52% higher than GraphAdapter [10]. Therefore, the XMAdapter meets the requirements of parameter-efficient transfer learning in terms of resource consumption, and operational efficiency, and achieves promising experimental results.

3.4 Ablation Studies

Different Hyper-parameters $\gamma$ . In Equation 3, $\gamma$ is the parameter adjusting the fusion ratio of $A^{\text{image}}$ and $A^{\text{text}}$ . To determine an appropriate value for $\gamma$ , we conducted tests on the 16-shot ImageNet dataset, as shown in Table 3. When $\gamma$ is set to 0, only $A^{\text{text}}$ influences the model, making it similar to CoOp [7], achieving an accuracy of 62.95. When $\gamma$ is set to 1, the model is similar to the configuration of Tip-Adapter [9], reaching an accuracy of 65.44. The best result of 66.22 is achieved when $\gamma$ is set to 0.7, further validating the rationality of XMAdapter in fusing multiple features.

Table 3: Comparing the effect of the

\gamma

on 16-shot ImageNet.

$\gamma$ 0 0.1 0.3 0.5 0.7 0.9 1.0 Acc 62.95 64.44 65.13 65.95 66.22 65.87 65.44

Different coefficient $\alpha$ and $\beta$ . $\alpha$ represents the proportion of the prediction in the cross-modal cache model combined with pre-trained CLIP, as illustrated in Eq. 4. To assess the impact of different coefficients $\alpha$ on model performance, we conducted tests on the 16-shot ImageNet dataset. We assumed a range of values for $\alpha$ from 0 to 4.0, and the experimental results are presented in Table 4. When $\alpha$ is set to 0, XMAdapter essentially utilizes only the knowledge from pre-trained CLIP, disregarding the content from the cache model. The model achieves a performance of 58.18%. When $\alpha$ is set to 1.2, the model achieves its best performance of 66.22%. However, when $\alpha$ is increased to 4.0, the model’s performance decreases. Experimental results indicate that the knowledge from the cache model and pre-trained CLIP hold equal importance in the model. As indicated in Eq. 4, When $\beta$ increases, it signifies that similar samples in the test images have a greater impact on the model’s predictions. As shown in Table 4 hold, the model achieves its best performance when $\beta$ is set to 3.5. For details regarding the performance of the model on different backbones, please refer to Different BackBones in the Appendix B.2.

Table 4: On the 16-shot ImageNet, we examined the impact of the two different coefficients, namely, the residual ratio

\alpha

and the sharpness ratio

\beta

, on XMAdapter.

Ablation Studies on XMAdapter Residual Ratio $\alpha$ 0.0 0.5 1.0 1.2 2.0 3.0 4.0 58.18 64.57 65.73 66.22 65.42 63.83 61.37 Sharpness Ratio $\beta$ 0.5 1.5 3.5 5.5 7.5 9.5 11.5 64.37 64.85 66.22 65.03 64.64 64.26 65.97

4 Conclusion

In this paper, we first analyze the drawbacks of current adapter-style methods in parameter-efficient transfer learning applications. Subsequently, we introduce XMAdapter as a solution, which creates a cache model by integrating cross-modal information to acquire knowledge. It dynamically adjusts the weights of hard examples based on the differences in affinity between modalities to enhance model performance. We validated the effectiveness and generalization of the model on 15 benchmark datasets. Adapter-style methods are based on fine-tuning VLM with a small amount of data to match downstream tasks. This tuning approach heavily relies on the performance of the pre-trained VLM.

References

[1] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[2] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning. PMLR, 2021, pp. 4904–4916.
[3] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12888–12900.
[4] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[5] Brian Lester, Rami Al-Rfou, and Noah Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
[6] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
[7] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
[8] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, pp. 1–15, 2023.
[9] Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European Conference on Computer Vision. Springer, 2022, pp. 493–510.
[10] Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, and Xinchao Wang, “Graphadapter: Tuning vision-language models with dual knowledge graph,” arXiv preprint arXiv:2309.13625, 2023.
[11] Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang, “Task residual for tuning vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10899–10909.
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 2021, OpenReview.net.
[15] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang, “Bilinear attention networks,” Advances in neural information processing systems, vol. 31, 2018.
[16] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian, “Deep modular co-attention networks for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6281–6290.
[17] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6639–6648.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio, Eds. 2019, pp. 4171–4186, Association for Computational Linguistics.
[19] Hao Tan and Mohit Bansal, “LXMERT: learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, Eds. 2019, pp. 5099–5110, Association for Computational Linguistics.
[20] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
[21] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu, “UNITER: learning universal image-text representations,” CoRR, vol. abs/1909.11740, 2019.
[22] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. 2022, OpenReview.net.
[23] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825.
[24] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 61–68.
[25] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang, “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” arXiv preprint arXiv:2110.07602, 2021.
[26] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim, “Visual prompt tuning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII. Springer, 2022, pp. 709–727.
[27] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu, “Neural prompt search,” arXiv preprint arXiv:2206.04673, 2022.
[28] Xianjun Yang, Wei Cheng, Xujiang Zhao, Linda Petzold, and Haifeng Chen, “Dynamic prompting: A unified framework for prompt tuning,” arXiv preprint arXiv:2303.02909, 2023.
[29] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy, “Unified vision and language prompt learning,” arXiv preprint arXiv:2210.07225, 2022.
[30] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan, “Maple: Multi-modal prompt learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19113–19122.
[31] Edouard Grave, Moustapha M Cisse, and Armand Joulin, “Unbounded cache model for online language modeling with open vocabulary,” Advances in neural information processing systems, vol. 30, 2017.
[32] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016.
[33] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
[34] Chelsea Finn, Pieter Abbeel, and Sergey Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning. PMLR, 2017, pp. 1126–1135.
[35] Li Fei-Fei, Rob Fergus, and Pietro Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop. IEEE, 2004, pp. 178–178.
[36] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3498–3505.
[37] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561.
[38] Maria-Elena Nilsback and Andrew Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008, pp. 722–729.
[39] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool, “Food-101–mining discriminative components with random forests,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 2014, pp. 446–461.
[40] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
[41] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010, pp. 3485–3492.
[42] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
[43] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
[44] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, no. 11, 2012.
[45] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar, “Do imagenet classifiers generalize to imagenet?,” in International Conference on Machine Learning. PMLR, 2019, pp. 5389–5400.
[46] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing, “Learning robust global representations by penalizing local predictive power,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[47] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15262–15271.
[48] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8340–8349.

Appendix

In this supplementary material, we first provide the related work in APPENDIX A and the experiment in APPENDIX B.

Appendix A Related Work

A.1 Vision-Language Model

Influenced by the tremendous success of pre-trained models in computer vision and natural language processing, the community has begun applying pre-training techniques to vision and language models. Classic Vision-Language Models typically consist of a vision encoder, a language encoder, a fusion encoder, and a loss function. In the early stages, models like BAN [15], MCAN [16], and Intra-Inter [17] dominated the scene. Influenced by the BERT [18] philosophy, models like LXMERT [19], ViLBERT [20], and UNITER [21] have achieved remarkable success in the Vision-Language Model domain. Recently, models such as CLIP [1], DeCLIP [22], BLIP [3], and ALIGN [2] have demonstrated that contrasted learning based on visual and language inputs can generate transferable features. They achieve good performance on downstream tasks without the need for fine-tuning. CoOp [7] transformed the fixed prompts in CLIP into learnable prompts, further enhancing the model’s performance. CoCoOp [23], by incorporating the characteristics of input images, improved the model’s generalization by introducing inductive biases through image augmentation. Our XMAdapter explores the model’s potential by adaptively adjusting the fusion ratio between images and text, achieving good performance in downstream tasks such as image classification and image recognition.

A.2 Parameter-Efficient Transfer Learning

Prompt learning is widely used in model tuning, especially for large language models. P-tuning [24] proposed to make the prompt into a token and use BiLSTM for learning. P-tuning v2 [25] combined the advantages of P-tuning and Prefix-tuning to make the model suitable for small models and complex natural language understanding tasks by removing repetitive parameters and enabling multi-task learning. Inspired by prompts in the NLP field, VPT [26] introduced a small number of trainable parameters as image prompts and achieved good results in downstream tasks. To improve the efficiency of prompt design, NOAH [27] used an evolutionary search technique to find the optimal design of vision prompts, adapters, and LoRA [4] as parameter-efficient tuning modules in each layer of the Vision Transformer. DP [28] proposed a method to dynamically adjust the prompt’s position and length according to the instance’s different tasks. UPT [29] and MAPLE [30] proposed adding learnable contextual markers in the language and visual branches and mapping language prompts to visual prompts through aggregation functions.

A.3 Cache Model

The cache model is a collection of key-value pairs storing training data and labels. It is established during the model’s training phase and inference and aggregates information from the cache model by treating the test samples as query conditions and utilizing similarity retrieval. This approach does not require updating model parameters and can improve the system’s inference speed. It has been widely applied in Unbounded Cache [31], Matching Networks [32], Prototypical Networks [33], and MAML [34]. Unbounded Cache [31] expanded the scale of a continuous cache through approximate nearest neighbor search and quantization algorithms. Matching Network [32] adapted to new class types by establishing a small labeled support set and mapping an unlabeled example to its label. The XMAdapter proposed in this paper fully integrates features from both images and text, decoupling the similarity measurement methods of different modalities. This enables the mutual utilization of knowledge between the two modalities, achieving good performance in downstream tasks.

Appendix B Experiment

B.1 Experimental Setups

Datasets Following previous adapter-style studies [8, 9], we validate our XMAdapter on 11 few-shot classification tasks, including ImageNet [12], Caltech101 [35], OxfordPets [36], StandfordCars [37], Flowers102 [38], Food101 [39], FGVCAircraft [40], SUN397 [41], DTD [42], EuroSAT [43], and UCF101 [44]. Among them, OxfordPets [36], StanfordCars [37], Flowers102 [38], FGVCAircraft [40], and Food101 [39] belong to fine-grained classification tasks, DTD [42] is the dataset of texture classification and EuroSAT [43] is for remote sensing classification. To investigate the generalization capability of our XMAdapter, we conduct experiments on ImageNetV2 [45], ImageNet-Sketch [46], ImageNet-A [47] and ImageNet-R [48].

Implementation Details This paper employs CLIP [1] as the backbone. By default, we utilize ResNet-50 [13] as the visual encoder and a 12-layer transformer as the textual encoder. To evaluate the adaptability of our method, XMAdapter also uses other CLIP [1] visual encoders, including ResNet-101 [13], ViT-B/32 [14], and ViT-B/16 [14]. We use a batch size of 32 for all datasets. We iterate for 20 epochs and optimize the model using 1, 2, 4, 8, and 16 shots. During training, we use the Adam optimizer with an initial learning rate of $1\times 10^{-3}$ , which decreases with cosine learning rate decay. The text in CoOp [7] is randomly initialized from a Gaussian distribution with a standard deviation of 0.02, and the context length is set to 16. The data augmentation strategy includes only two policies: “random resizing crop” and “random flipping”.

B.2 Different BackBones

We utilized ResNet-50 [13], ResNet-101 [13], ViT-B/32 [14], and ViT-B/16 [14] as backbones for conducting comparative experiments on the 16-shot ImageNet [12] dataset. From Table 5, it can be observed that XMAdapter, across the four backbones, outperforms GraphAdapter by +0.52%, +0.40%, +0.76%, and +0.74%, respectively. This indicates that the model exhibits strong generalizability.

Table 5: Classification accuracy (

\%

) of different visual encoders on 16-shot ImageNet [12].

Models ResNet-50 ResNet-101 ViT-B/32 ViT-B/16 Zero-shot CLIP [1] 58.18 61.62 62.05 66.73 CoOp [7] 62.95 66.60 66.85 71.92 CLIP-Adapter [8] 63.59 65.39 66.19 71.13 Tip-Adapter [9] 65.44 68.56 68.65 73.69 GraphAdapter [10] 65.70 68.23 68.80 73.68 XMAdapter 66.22 68.96 69.56 74.43