This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\IEEEoverridecommandlockouts

Test-time Alignment-Enhanced Adapter for Vision-Language Models

\IEEEauthorblockN1st Baoshun Tong \IEEEauthorblockAschool of computer science and engineering
sun yat-sen university
Guangzhou, China
[email protected]
   \IEEEauthorblockN2nd Kaiyu Song \IEEEauthorblockAschool of artificial intelligence
sun yat-sen university
Guangzhou, China
[email protected]
   \IEEEauthorblockN3rd Hanjiang Lai \IEEEauthorblockAschool of computer science and engineering
sun yat-sen university
Guangzhou, China
[email protected]
Abstract

Test-time adaptation with pre-trained vision-language models (VLMs) has attracted increasing attention for tackling the issue of distribution shift during the test phase. While prior methods have shown effectiveness in addressing distribution shift by adjusting classification logits, they are not optimal due to keeping text features unchanged. To address this issue, we introduce a new approach called Test-time Alignment-Enhanced Adapter (TAEA), which trains an adapter with test samples to adjust text features during the test phase. We can enhance the text-to-image alignment prediction by utilizing an adapter to adapt text features. Furthermore, we also propose to adopt the negative cache from TDA as enhancement module, which further improves the performance of TAEA. Our approach outperforms the state-of-the-art TTA method of pre-trained VLMs by an average of 0.75% on the out-of-distribution benchmark and 2.5% on the cross-domain benchmark, with an acceptable training time. Code will be available at https://github.com/BaoshunWq/clip-TAEA.

{IEEEkeywords}

Adapter, test-time adaptation, text-to-image alignment.

1 INTRODUCTION

Recently, test-time adaptation (TTA) [1], which adapts the model to unlabeled test data, has drawn much interest due to the distribution shift problem between the training and testing data. Since foundational vision-language models such as CLIP (Contrastive Language-Image Pretraining) [2] have achieved excellent results in many downstream tasks [3, 4], some researchers [5, 6, 7] have pay attention to the pre-trained vision-language models for TTA. Test-time Prompt Tuning (TPT) [5] firstly addressed the TTA issue by learning an adaptive prompt on the fly with test samples with the zero-shot generalization in VLMs. And DiffTPT [6] leveraged pre-trained diffusion models to generate diverse and informative new data to learn a better prompt. Later, Training-free dynamic adapter (TDA) [7] is proposed to adapt CLIP to downstream tasks by building two dynamic caches without training. This type of test-time adaptation heavily relies on the vision-language alignment capability of CLIP.

Despite the efficiency and effectiveness achieved by prior methods, the distribution shift between test data and training data can also damage the alignment capability of pre-trained vision-language models since the hand-crafted text prompts are challenging to design for the unknown test distribution, thereby reducing classification accuracy. Although the TPT and DiffTPT methods attempt to learn a prompt to adjust text features, its significant training overhead is not in line with the requirements of the testing phase [7]. In summary, we contemplate the following question: Can we efficiently adjust text features to further enhance the capability of text-image alignment under the unsupervised condition of TTA?

In this paper, we propose a novel Test-time Alignment-Enhanced Adapter (TAEA) to efficiently and effectively enhance the capability of text-image alignment. Our method consists of two modules. The first is the adapter module, which introduces a lightweight attention [8] block to adapt text category embedding according to the downstream images during the test phase. We use the original text features as a query while employing the test image features as key and value. Then, we train a gated single-head attention block to bridge the gap [9] between test-time image features and text features for each category. With the help of the test-time knowledge, this process can be considered as a learnable module to match the text features with test samples. The second one is the enhancement module, which helps to mitigate the prediction errors of high entropy or bias to certain predictions due to training with pseudo labels and further enhances the performance of our adapter.

Through experiments on the out-of-distribution (OOD) benchmark and cross-domain benchmark [5], we observe that our TAEA can outperform the existing state-of-the-art test-time adaptation methods. The contributions of our study are summarized as follows: 1) We propose test-time alignment-enhanced adapter (TAEA), a simple yet effective test-time adaptation method for vision-language models to help adapt text features to better align with test image features efficiently. 2) Under the unsupervised condition of TTA, we mitigate the problem of poor alignment capability of pre-trained vision-language models on the test dataset caused by distribution shift. 3) Experimental results show that our TAEA can outperform existing state-of-the-art test-time adaptation methods of pre-trained vision-language models.

2 RELATEDWORK

Adapter-based architectures have gained attention for efficiently incorporating task-specific modifications into pre-trained models. Reference [10] proposes CLIP-Adapter, which adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features. Reference [11] proposes Tip-Adapter, which constructs the adapter through a key-value cache model from the few-shot training set and updates the prior knowledge encoded in CLIP through feature retrieval. Reference [12] proposes Meta-Adapter, which constructs a lightweight network based on the gated multi-head attention  [8] mechanism, to bridge the gap between few-shot image features and text features for each category. Reference [13] proposes SAM-Adapter, which incorporates domain-specific information or visual prompts into the segmentation network via using simple yet effective adapters. Reference [14] proposes to learn low-cost T2I-Adapters to align internal knowledge in text-to-image (T2I) models with external control signals, while freezing the original large T2I models. And [15] states that the adapter-based tuning better mitigates forgetting issues than fine-tuning since it yields representations with less deviation from those generated by the initial pre-trained language model.

3 METHOD

3.1 Reviews of CLIP and TDA

Given a test image xtx_{t}, a global visual feature ftest=Ei(xt)f_{test}=E_{i}(x_{t}) is obtained by CLIP’s visual encoder EiE_{i}. Similarly, the corresponding text features ω\omega can be encoded using CLIP’s text encoder EtE_{t}. The text is composed of hand-crafted templates, one typical form being “a photo of [CLASS]” and a specific category name, such as “dog.” As a result, CLIP’s prediction yy for a test image is obtained by the matching score as follows:

Pclip(ftest)=ftestωT.P_{\mathrm{clip}}(f_{\mathrm{test}})=f_{\text{test}}\mathbf{\omega}^{T}. (1)

To better adapt CLIP to TTA tasks, TDA [7] comprises two dynamic key-value caches, each storing a dynamic queue of few-shot test features as keys and their corresponding pseudo-labels as values. The first one is positive cache, which aims to collect high-quality few-shot pseudo labels Lp^\hat{L_{p}} as positive values and the corresponding features QpQ_{p} as keys. The second one is the negative cache, which aims to gather CLIP-generated image features to QnQ_{n}, and the corresponding negative pseudo labels to Ln^\hat{L_{n}}. Finally, the prediction of TDA can be formulated by combining the negative cache, the positive cache, and the pre-trained CLIP model as follows:

PTDA(ftest)=Pclip(ftest)+Ppos(ftest)+Pneg(ftest).P_{\mathrm{TDA}}(f_{\mathrm{test}})=P_{\mathrm{clip}}(f_{\mathrm{test}})+P_{\mathrm{pos}}(f_{\mathrm{test}})+P_{\mathrm{neg}}(f_{\mathrm{test}}). (2)

Although TDA has achieved great efficiency and effectiveness, solely adjusting classification logits based on the dynamic cache while keeping text features unchanged is not optimal.

Refer to caption
Figure 1: Overall pipeline of the proposed method. The adapter module aims to adjust hand-crafted text features and improve the capability of text-to-image alignment. The enhancement module aims to enhance the performance further.
Table 1: Results on the Cross-Domain Benchmark. Same as the prior methods, the evaluation metric Average is calculated using the mean accuracy across all ten datasets.
Method Aircraft Caltech101 Cars DTD EuroSAT Flower102 Food101 Pets SUN397 UCF101 Average
CLIP-ResNet-50 16.11 87.26 55.89 40.37 25.79 62.77 74.82 82.97 60.85 59.48 56.63
CoOp 15.12 86.53 55.32 37.29 26.20 61.55 75.59 87.00 58.15 59.05 56.18
CoCoOp 14.61 87.38 56.22 38.53 28.73 65.57 76.20 88.39 59.61 57.10 57.23
TPT 17.58 87.02 58.46 40.84 28.33 62.69 74.88 84.49 61.46 60.82 57.66
DiffTPT 17.60 86.89 60.71 40.72 41.04 63.53 79.21 83.40 62.72 62.67 59.85
TDA 17.61 89.70 57.78 43.74 42.11 68.74 77.75 86.18 62.53 64.18 61.03
TAEA (ours) 18.60 87.87 57.82 51.77 59.05 67.24 78.44 88.01 66.44 61.62 63.69
CLIP-ViT-B/16 23.22 93.55 66.11 45.04 50.42 66.99 82.86 86.92 65.63 65.16 64.59
CoOp 18.47 93.70 64.51 41.92 46.39 68.71 85.30 89.14 64.15 66.55 63.88
CoCoOp 22.29 93.79 64.90 45.45 39.23 70.85 83.97 90.46 66.89 68.44 64.63
TPT 24.78 94.16 66.87 47.75 42.44 68.98 84.67 87.79 65.50 68.04 65.10
DiffTPT 25.60 92.49 67.01 47.00 43.13 70.10 87.23 88.22 65.74 62.67 65.47
TDA 23.91 94.24 67.28 47.40 58.00 71.42 86.14 88.63 67.62 70.66 67.53
TAEA (ours) 27.45 93.75 66.53 52.66 68.74 71.21 86.70 91.69 71.10 70.47 70.03

3.2 Our method

This section presents a new approach called Test-time Alignment-Enhanced Adapter (TAEA) to adjust text features. And we employ an enhancement module to further enhance text-image alignment capability under the unsupervised condition of TTA. Fig. 1 shows an overview of our proposed method.

Adapter module. The adapter module introduces two steps. The first step involves retrieving test knowledge from the test image with attention mechanisms. This can obtain image features related to text features. The second step employs a gating network to aggregate hand-crafted text features with the image features obtained in the first step. After completing the above two steps, we can obtain the text features aligned with the image.

Specifically, given the test images, we can obtain the image features FF and the hand-crafted text features ω\omega with the encoder of CLIP. The text features ω\omega is composed of hand-crafted prompt and NN-class names. Then, we match the test image features with text features using a single attention block. Thanks to the attention [8] mechanism, we can get better image features related to the text features by treating ω\omega as the query and employing the test image features FF as both key and value. To reduce training overhead and achieve a lightweight adapter, following the previous works [12], we leverage MLP as the attention mechanism, which could be defined as:

𝐅^=Fσ((ωWW)(FWF)/D),\hat{\mathbf{F}}=F^{\top}\sigma((\omega W_{W}^{\top})(FW_{F}^{\top})^{\top}/\sqrt{D}), (3)

where 𝐅^\hat{\mathbf{F}} represents the image feature related to text feature. The WFW_{F} and WWW_{W} indicate the weights of MLP layers. The σ\sigma denotes the Softmax function and DD is the scaling factor.

After obtaining the 𝐅^\hat{\mathbf{F}}, we introduce a learnable gate block f()f(\cdot) to match 𝐅^\hat{\mathbf{F}} and text features ω\omega, where f()f(\cdot) only contains a single MLP layer. In this way, we can generate a modulation scalar to filter knowledge and control the ratio between text and image features in the test time [12]. Finally, we could fine-tune the text feature as follows:

ω^=ω+f(ω)𝐅^,\hat{\omega}=\omega+f(\omega)\odot\hat{\mathbf{F}}, (4)

where \odot denotes Hadamard product. After training the gate block, f()f(\cdot) can adjust the match ratio according to the ω\omega and address distribution shift. The adapter could be defined as follows:

Padapter(ftest)=γω^ftestω^ftest,P_{\mathrm{adapter}}(f_{\mathrm{test}})=\gamma\frac{\hat{\omega}^{\top}f_{test}}{\|\hat{\omega}\|\|f_{test}\|}, (5)

where ftestf_{test} is the test image feature obtained by encoder and γ\gamma is a hyper-parameter.

To train our adapter module, we select samples with lower entropy to generate a pseudo label LaL_{a}, a one-hot encoded vector of a categorical distribution, and store them during the testing period of the first λN\lambda*N samples. The NN represents the total number of images in the test set, and λ\lambda is a hyperparameter to determine the timing of training the adapter. After the model has tested λN\lambda*N samples, we train the adapter module to update text features with these test samples. We use cross-entropy as the loss function.

The effectiveness of adjusting text features is theoretically guaranteed by non-local filters [16, 17, 18]. Intuitively, similar to the non-local filters, [12] states that an adapter based on gated attention could disregard some outlier samples while paying more attention to the samples related to the category description, resulting in robust feature representations.

Enhancement module. To mitigate the risk of prediction errors due to high entropy or being biased to certain predictions, we propose to adopt the negative cache from TDA [7] as the plug-and-play module for further enhancing the performance of our adapter. As stated by  [7], the negative cache is designed for negative learning. It aims to mitigate the negative impact of noisy pseudo-labels by introducing negative pseudo-labeling to identify class absence rather than presence.

In summary, with the help of the adapter and enhancement modules, our method can adjust hand-crafted text features to enhance the capability of text-image alignment. The overall formulation for computing classification results in our method is as follows:

PTAEA(ftest)=Pclip(ftest)+Padapter(ftest)+Pneg(ftest).P_{\mathrm{TAEA}}(f_{\mathrm{test}})=P_{\mathrm{clip}}(f_{\mathrm{test}})+P_{\mathrm{adapter}}(f_{\mathrm{test}})+P_{\mathrm{neg}}(f_{\mathrm{test}}). (6)

4 EXPERIMENT

4.1 Experimental Detail

Datasets. Consistent with prior works [6, 19], we conduct main experiments on two benchmarks: out-of-distribution (OOD) benchmark and cross-domain benchmark. To assess the robustness and generalization, OOD benchmark consists of one ID dataset ImageNet [20] and 4 out-of-distribution datasets derived from ImageNet: ImageNet-A [21], ImageNet-V2 [22], ImageNet-R [23], and ImageNet-S [24]. On the other hand, to evaluate the adaptation ability during testing on datasets with different domain distributions, the cross-domain benchmark consists of 10 diverse image classification datasets, each from a distinct domain with different classes: Pets [25], EuroSAT [26], Aircraft [27], Caltech101 [28], UCF101 [29], Cars [30], DTD [31], Flower102 [32], Food101 [33], and SUN397 [34].

Baselines. To evaluate the effectiveness of our method, we compare it with zero-shot methods, train-time adaptation methods, and test-time adaptation methods, including the recent state-of-the-art method. For the zero-shot method, We compared with public CLIP [2] results under ResNet-50 [35] and ViT-B/16 [36]. For the train-time adaptation methods, we compared with CoOp [37], CoCoOp [38], and Tip-Adapter [11]. For the test-time adaptation methods, we compared with TPT [5], and its improved version, DiffTPT [6]. We also compared against the challenging state-of-the-art method in this field: TDA [7]. All results of the methods compared in the table are obtained from the [7] paper.

Implementation Details. Same as prior works [5, 6, 7], we conducted experiments on out-of-distribution benchmark and cross-domain benchmark separately using ResNet-50 and ViT-B/16 backbones. Specifically, we use a batch size of 1, which means the test time adaptation is set for single-image scenarios. We set all database-related hyperparameters to be consistent with TDA. And we set λ=0.25\lambda=0.25, γ=0.6\gamma=0.6. For the training of the adapter module, We optimize the gated attention block with a batch size of 3 and use AdamW [19] optimizer with a learning rate of 0.001 and a cosine scheduler for 3 epochs. In all experiments, we evaluate a single NVIDIA Quadro RTX 6000 GPU.

4.2 Main results

Results on the OOD benchmark. The performances on the OOD benchmark are summarized in Table  2. Our method has surpassed the method that adjusts text features similarly and the current state-of-the-art method on average. Specifically, compared to the TPT method, which adjusts text features similarly, our method outperforms TPT on both ResNet-50 and ViT-B/16 architectures, improving all accuracy by 2.87% and 3.32% on average, respectively. Furthermore, compared to the current state-of-the-art method TDA, our method outperforms TDA on both ResNet-50 and ViT-B/16 architectures, improving all accuracy by 0.55% and 0.75% on average, respectively.

As shown in Table 3, to further evaluate the efficiency and effectiveness of our method, we conducted experiments on ImageNet validation dataset to test both accuracy and testing time. When compared to the current state-of-the-art method (TDA), although our approach requires additional testing time (requiring an additional 2min) due to minimal training requirements, we have achieved substantial improvements in accuracy (+3.82%). Compared to methods requiring similar training for adjusting text features, our approach dramatically reduces the testing time from 12h 50min by TPT and even more from 34h 45min by DiffTPT, down to just 18 minutes.

Table 2: Results on the OOD Benchmark. To demonstrate the effectiveness of our method, we conducted a comparison of two versions of the CLIP backbone: ResNet-50, and ViT-B/16.
Method ImageNet ImageNet-A ImageNet-V2 ImageNet-R ImageNet-S Average
ResNet-50 59.81 23.24 52.91 60.72 35.48 46.43
CoOp 63.33 23.06 55.40 56.60 34.67 46.61
CoCoOp 62.81 23.32 55.72 57.74 34.48 46.81
Tip-Adapter 62.03 23.13 53.97 60.35 35.74 47.04
TPT 60.74 26.67 54.70 59.11 35.09 47.26
DiffTPT 60.80 31.06 55.80 58.80 37.10 48.71
TDA 61.35 30.29 55.54 62.58 38.12 49.58
TAEA (ours) 63.63 30.72 56.15 62.46 37.70 50.13
ViT-B/16 68.34 49.89 61.88 77.65 48.24 61.20
CoOp 71.51 49.71 64.20 75.21 47.99 61.72
CoCoOp 71.02 50.63 64.07 76.18 48.75 62.13
Tip-Adapter 70.75 51.04 63.41 77.76 48.88 62.37
TPT 68.98 54.77 63.45 77.06 47.94 62.44
DiffTPT 70.30 55.68 65.10 75.00 46.80 62.28
TDA 69.51 60.11 64.67 80.24 50.54 65.01
TAEA (ours) 71.71 60.10 65.23 80.73 51.04 65.76
Table 3: Comparisons of efficiency (Testing Time) and effectiveness (Accuracy). The last column in the table represents the accuracy gain relative to the baseline clip.
Method Testing Time Accuracy Gain
ResNet-50 12 min 59.81 0
TPT 12h 50min 60.74 +0.93
DiffTPT 34h 45min 60.80 +0.99
TDA 16 min 61.35 +1.54
TAEA (ours) 18 min 63.63 +3.82

Results on the cross-domain benchmark. The performances on the cross-domain benchmark are summarized in Table 1. Our method not only surpasses the methods (i.e., TPT, DiffTPT) that adjust text features similarly but also exceeds the current state-of-the-art method TDA on average, reaching a new state-of-the-art. Specifically, compared to the TPT method, which adjusts text features similarly, our method outperforms TPT on both ResNet-50 and ViT-B/16 architectures, improving all accuracy by 6.03% and 4.93% on average, respectively. Furthermore, compared to the current state-of-the-art method TDA, our method outperforms TDA on both ResNet-50 and ViT-B/16 architectures, improving all accuracy by 2.66% and 2.5% on average, respectively.

4.3 Ablation Study

To validate the effectiveness of our proposed method, we conducted ablation experiments on the ImageNet [20]. TAEA consists of the adapter module and the enhancement module from negative cache [7]. We first evaluated the effectiveness of using a single enhancement module and a single adapter module. As shown in Fig. 2(a), each module can outperform the original CLIP-ResNet-50. Our adapter module exceeds CLIP and surpasses enhancement module, demonstrating its ability to effectively adjust text features and enhance the alignment capability between text and image. Furthermore, combining the enhancement module with the adapter can further improve performance on ImageNet, which is our TAEA method.

Furthermore, we also conduct ablation studies on γ\gamma in (5). γ\gamma is a parameter used to control the weighting of the adjusted text features for classification. Specifically, We experimented with a challenging OOD benchmark using γ=0\gamma=0, 0.2, 0.4, 0.6, 0.8, 1.0. It’s important to note that γ=0\gamma=0 indicates that the model did not use the adapter to adjust the text, which means that the text features remain unchanged during the testing phase. In Fig. 2(b), we summarize the average results. The result of baseline is from [7]

Refer to caption

(a) Results on Imagenet.

Refer to caption

(b) Results on OOD benchmark.

Figure 2: The results of ablation study.

5 CONCLUSION

To improve the alignment capability between text and image during the testing phase, we present a novel test-time adaptation approach TAEA called the Test-time Alignment-Enhanced Adapter, which integrates test distribution knowledge and adapts text features with a gated attention mechanism. In terms of performance, the results of extensive experiments demonstrate that TAEA outperforms state-of-the-art test-time adaptation methods. In terms of efficiency, we have significantly reduced the testing time compared to previous methods that adjust text features by learning prompts.

References

  • [1] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” in International Conference on Learning Representations, 2020.
  • [2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [3] Y. Ming, Z. Cai, J. Gu, Y. Sun, W. Li, and Y. Li, “Delving into out-of-distribution detection with vision-language representations,” Advances in neural information processing systems, vol. 35, pp. 35 087–35 102, 2022.
  • [4] Z. Zhou, Y. Lei, B. Zhang, L. Liu, and Y. Liu, “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 175–11 185.
  • [5] M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 274–14 289, 2022.
  • [6] C.-M. Feng, K. Yu, Y. Liu, S. Khan, and W. Zuo, “Diverse data augmentation with diffusions for effective test-time prompt tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2704–2714.
  • [7] A. Karmanov, D. Guan, S. Lu, A. El Saddik, and E. Xing, “Efficient test-time adaptation of vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 162–14 171.
  • [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [9] H. Nam, H. Lee, J. Park, W. Yoon, and D. Yoo, “Reducing domain gap by reducing style bias,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8690–8699.
  • [10] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024.
  • [11] R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European conference on computer vision.   Springer, 2022, pp. 493–510.
  • [12] L. Song, R. Xue, H. Wang, H. Sun, Y. Ge, Y. Shan et al., “Meta-adapter: An online few-shot learner for vision-language model,” Advances in Neural Information Processing Systems, vol. 36, pp. 55 361–55 374, 2023.
  • [13] T. Chen, L. Zhu, C. Deng, R. Cao, Y. Wang, S. Zhang, Z. Li, L. Sun, Y. Zang, and P. Mao, “Sam-adapter: Adapting segment anything in underperformed scenes,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2023, pp. 3367–3375.
  • [14] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4296–4304.
  • [15] R. He, L. Liu, H. Ye, Q. Tan, B. Ding, L. Cheng, J. Low, L. Bing, and L. Si, “On the effectiveness of adapter-based tuning for pretrained language model adaptation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 2208–2222.
  • [16] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
  • [17] Q. Yan, L. Zhang, Y. Liu, Y. Zhu, J. Sun, Q. Shi, and Y. Zhang, “Deep hdr imaging via a non-local network,” IEEE Transactions on Image Processing, vol. 29, pp. 4308–4322, 2020.
  • [18] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
  • [19] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2017.
  • [20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [21] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 262–15 271.
  • [22] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?” in International conference on machine learning.   PMLR, 2019, pp. 5389–5400.
  • [23] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 8340–8349.
  • [24] H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [25] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 3498–3505.
  • [26] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
  • [27] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
  • [28] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop.   IEEE, 2004, pp. 178–178.
  • [29] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [30] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561.
  • [31] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
  • [32] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian conference on computer vision, graphics & image processing.   IEEE, 2008, pp. 722–729.
  • [33] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13.   Springer, 2014, pp. 446–461.
  • [34] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE computer society conference on computer vision and pattern recognition.   IEEE, 2010, pp. 3485–3492.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [36] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [37] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
  • [38] ——, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 816–16 825.