This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Revisiting the Robust Generalization of Adversarial Prompt Tuning

Fan Yang, Mingxuan Xia, Sangzhou Xia, Chicheng Ma, Hui Hui
(2018)
Abstract.

Understanding the vulnerability of large-scale pre-trained vision-language models like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms generally adopt prompt learning strategies for adversarial fine-tuning to improve the adversarial robustness of the pre-trained model while keeping the efficiency of adapting to downstream tasks. Such a setup leads to the problem of over-fitting which impedes further improvement of the model’s generalization capacity on both clean and adversarial examples. In this work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning (i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the alignment of image and text features for adversarial examples and leverage the strong generalization of pre-trained CLIP to guide the model-enhancing its robust generalization on adversarial examples while maintaining its accuracy on clean ones. We also design a novel adaptive consistency objective function to balance the consistency of adversarial inputs and clean inputs between the fine-tuning model and the pre-trained model. We conduct extensive experiments across 14 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show the superiority of CAPT over other state-of-the-art adaption methods. CAPT demonstrated excellent performance in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.

Vision-and-Language foundation models, multi-modal, adversarial attack, robust generalization, prompt tuning
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation email; October 28 - November 1, 2024; Melbourne, Australia.isbn: 978-1-4503-XXXX-X/18/06ccs: Computing methodologies Artificial intelligence

1. Introduction

Large-scale models pre-trained on vision and language data have emerged as vision-language foundation models (Radford et al., 2021; Jia et al., 2021; Li et al., 2022a) achieving great success in numerous fields such as visual question answering (Zhou et al., 2020; Lin and Byrne, 2022), image captioning (Zhu et al., 2023), text-to-image generation (Li et al., 2023b; Zhang et al., 2023). With the rapid development of the multi-modal foundation model, various Visual-Language models (VLMs) have been proposed, and an increasing number of studies have introduced VLMs into different downstream tasks.

Unfortunately, a lot of recent studies (Mao et al., 2022; Schlarmann and Hein, 2023; Zhao et al., 2024) unveiled that VLMs are vulnerable to small adversarial noise (Szegedy et al., 2013)—the model will output completely different error results by introducing deliberately designed imperceptible perturbations. Due to the fact that VLMs become susceptible to adversarial example attacks when applied to downstream tasks, it is critical to increase the model’s robustness to ensure its dependable use in downstream tasks. In order to mitigate the risk posed by adversarial examples and enhance the robustness of models against adversarial attacks, numerous defense strategies have been suggested. Among them, adversarial training (Madry et al., 2018; Zhang et al., 2019; Wang et al., 2019) is one of the most common and effective approaches for adversarial defense which can be regarded as a type of data augmentation technique that crafts adversarial versions of the natural examples for model training. As for large-scale models, applying adversarial training from scratch to improve its robustness is impractical due to the computation-consuming process of adversarial examples generation in each training step, especially in models with massive parameters. Fine-tuning (Yosinski et al., 2014) is a relatively efficient way to better adapt pre-trained models to downstream tasks, however, as pre-trained models expand to encompass tens or hundreds of billions of parameters, the process of fine-tuning all these model weights becomes exceedingly expensive. This cost can escalate further if adversarial training is employed to enhance the robustness of the model against adversarial attacks.

To this end, prompt tuning (Zhou et al., 2022b, a; Lu et al., 2022; Khattak et al., 2023a; Yao et al., 2023; Jia et al., 2022a) is applied as a more efficient alternative fine-tuning approach to fine-tuning which enables the model to transfer to downstream tasks. Without changing the pre-trained models’ weights, this method adapts the pre-trained model to downstream tasks by adding some learnable prompt vectors. When using adversarial training to fine-tune downstream tasks for large-scale pre-trained models, adopting prompt learning can improve the model’s robustness while improving training efficiency. Recently, several works employed this efficient approach to enhance the adversarial robustness of pre-trained models in downstream tasks(Chen et al., 2020; Huang et al., 2023; Li et al., 2024). Huang (Huang et al., 2023) and Chen et al. (Chen et al., 2023) have investigated the use of adversarial visual prompting (Bahng et al., 2022) as a means of defense at test time to improve the adversarial robustness of pre-trained models. Li et al. (Li et al., 2024) found that the adversarial robustness of CLIP is sensitive to the prompt used for inference. So they adopt text prompt tuning to adapt pre-trained models.

However, these methods suffer a few drawbacks. All these approaches only focus on the adversarial examples but ignore their clean counterpart. Moreover, none of them pay attention to the problem of over-fitting in adversarial prompt learning (Salman et al., 2020; Rice et al., 2020; Zhou et al., 2022a; Khattak et al., 2023b). Under these circumstances, the model is influenced by the distribution of adversarial examples on small-scale training set, which will result in a decrease of the model’s robust generalization on both adversarial examples and clean ones.

In this work, we move one step and propose an adaptive Consistency-guided Adversarial Prompt tuning (CAPT) which utilizes multi-modal prompt learning and the powerful generalization ability of the pre-trained CLIP to improve the adversarial robust generalization of the pre-trained model while maintaining its accuracy on clean examples. Inspired by (Khattak et al., 2023a, b), we adopt multi-modal prompt learning to improve the alignment between visual and textual features for adversarial examples and enhance the robustness of image and text encoder during training. Previous methods only utilize the adversarial examples for adversarial fine-tuning, inspired by TRADES (Zhang et al., 2019), our approach focuses on the consistency between clean and adversarial inputs and adds a KL divergence regularization term to ensure this consistency. Moreover, we leverage the generalization ability of frozen pre-trained CLIP to tackle the over-fitting issue of adversarial fine-tuning and improve the zero-shot adversarial robustness. A regularization loss is also introduced to improve the model’s adversarial robust generalization capabilities. To balance the consistency between clean and adversarial inputs of the fine-tuned model and frozen pre-trained model, We design a novel adaptive consistency loss function that dynamically adjusts the weight of the different losses based on the model’s reliability. This weight determines the balance between learning from a pre-trained model and relying on fine-tuning the model’s own. Following APT (Li et al., 2024), we conduct extensive experiments on 14 datasets and 4 data sparsity schemes, 1-, 4- and 16-shot learning and training with the entire training set. Our contributions are summarized as follows:

  • We investigate the over-fitting problem in adversarial prompt tuning and discuss the drawback of adversarial prompt tuning as well as reason causing the over-fitting issue.

  • We propose a novel adaptive Consistency-guided Adversarial Prompt Tuning approach, which introduces powerful multi-modal prompts and a novel adaptive consistency-guided objective function leveraging the generalization capacity of pre-trained frozen CLIP. Our approach achieves significant improvement in robust generalization of the pre-trained model.

  • We conduct an extensive evaluation on different datasets including in-distribution experiments and out-of-distribution experiments. The results demonstrate that our proposed method significantly outperforms all other state-of-the-art baselines indicating the superiority of our method.

To the best of our knowledge, we are the first to investigate the over-fitting problem of adversarial prompt tuning. Specifically, our method significantly outperforms the state-of-the-art by 16.91% and 7.51 % for accuracy and robustness respectively averaged on different shots on the ImageNet dataset.  111The code is available in the supplementary materials..

2. Related Works

Vision Language models Foundational vision-language models(VLMs) (Radford et al., 2021; Jia et al., 2021; Li et al., 2022a, 2023a) utilize both visual and textual data to learn rich semantic multi-modal representations. These models are pre-trained on large-scale multi-modal datasets through image-text contrastive learning, which effectively draws the features of corresponding image-text pairs closer together, while simultaneously pushing away the features of mismatched pairs. By leveraging large-scale image-text datasets, for instance, 400 million pairs for CLIP (Radford et al., 2021) and 1 billion for ALIGN (Jia et al., 2021), and employing end-to-end pre-training strategies, VLMs can learn rich semantic associations between images and text enable them to better understanding multi-modal information of open vocabulary concept. Utilizing the powerful generalization ability of the pre-trained model, VLMs achieve state-of-the-art performance on various visual and vision-language tasks (Bangalath et al., 2022; Maaz et al., 2022; Gu et al., 2021; Zang et al., 2022; Li et al., 2022b; Rao et al., 2022).

Prompt Tuning for VLMs Prompt tuning (Zhou et al., 2022b, a; Khattak et al., 2023a; Gao et al., 2024; Zhu et al., 2023; Yao et al., 2023) has been introduced as an effective fine-tuning strategy to adapt pre-trained VLMs to specific downstream tasks. This approach incorporates a few trainable embeddings along with model inputs which are optimized during training while the rest of the model is kept frozen. Since the pre-trained model remains unchanged during the prompt learning, this strategy has proven to be especially beneficial for Visual Language Models (VLMs) like CLIP, where preserving the model’s inherent ability to generalize is essential. Context Optimization(CoOp) (Zhou et al., 2022b) replaced the hand-crafted prompts with learnable textual prompts to improve the textual embedding. Conditional Context Optimization(CoCoOp) (Zhou et al., 2022a) focuses on the overfitting problem of CoOp and proposes to condition prompts based on visual features for improved performance on generalization tasks. In addition, Knowledge-Guided Context Optimization(KgCoOp) (Yao et al., 2023) ensures that the designed learnable prompts incorporate crucial general knowledge. PLOT (Chen et al., 2022) utilizes optimal transport to align the vision and text modalities, creating discriminative and visually coherent local textual prompts. Beyond textual prompt tuning, Multi-modal Prompt Learning (MaPLe) (Khattak et al., 2023a) and PromptSRC (Khattak et al., 2023b) enhance the process by tuning prompts across both visual and text encoders simultaneously.

Adversarial Robustness Deep neural networks are susceptible to adversarial attacks, where barely noticeable noises are added to original images, causing incorrect classifications by the models (Goodfellow et al., 2014; Madry et al., 2017; Dong et al., 2018). To counteract this vulnerability, various defense strategies have been devised. Notably, adversarial training (Jia et al., 2022b; Pang et al., 2022; Wang et al., 2019; Zhang et al., 2019) stands out as an effective defense method. This technique incorporates adversarial examples into the training dataset during the model’s training phase, significantly bolstering the DNNs’ resistance to such attacks. As the adoption of large-scale pre-trained vision language models increases, their susceptibility to adversarial threats has also come into focus, with a proliferation of attack algorithms specifically designed against them (Yin et al., 2023; Zhang et al., 2022; Lu et al., 2023; Zhao et al., 2024). Recently, several studies have explored enhancing the adversarial robustness of pre-trained models through adversarial fine-tuning, which involves adjusting the model’s weights via adversarial training. Some of them improve the robustness of the model by fine-tuning the entire model(Mao et al., 2022; Li et al., 2023c; Wang et al., 2024b). Mao et al. (Mao et al., 2022) proposed a text-guided contrastive adversarial training loss and applied it for adversarial fine-tuning in order to boost the zero-shot adversarial robustness of large-scale pre-trained models. Wang et al. (Wang et al., 2024b) introduced a pre-trained model-guided adversarial fine-tuning to enhance the model’s zero-shot adversarial robustness. This approach fine-tuned the whole image encoder of the pre-trained model, which is inefficient and may destroy the generalization of the pre-training parameters. A few other methods employ partial adversarial fine-tuning (Chen et al., 2023, 2020; Huang et al., 2023; Li et al., 2024). Huang (Huang et al., 2023) and Chen et al. (Chen et al., 2023) have investigated the use of adversarial visual prompting (Bahng et al., 2022) as a means of defense at test time to improve the adversarial robustness of pre-trained models. Li et al. (Li et al., 2024) found that the adversarial robustness of CLIP is sensitive to the prompt used for inference. So they adopt text prompt tuning to adapt pre-trained models.

Although these methods can improve the robustness of pre-trained models, they ignore the over-fitting problem of adversarial fine-tuning (Dodge et al., 2020; Rice et al., 2020) especially for prompt learning (Zhou et al., 2022a; Khattak et al., 2023b). Different from them, we propose an adaptive Consistency-guided Adversarial Prompt tuning (CAPT) which utilizes multi-modal prompt learning (Khattak et al., 2023a) and the powerful generalization ability of the pre-trained CLIP to improve the adversarial robust generalization of the pre-trained model while maintaining its accuracy on clean examples.

3. Proposed Method

We first give preliminary in Sec. 3.1 on CLIP and Adversarial Prompt Tuning. In Sec. 3.2, we provide the details of our adaptive Consistency-guided Adversarial Prompt Tuning (CAPT) framework including the main component of our framework and our novel objective function.

3.1. Preliminaries

CLIP Revisiting CLIP model is mainly composed of two parts: image encoder and text encoder. We denote the image encoder and text encoder as ff_{\mathcal{I}} and f𝒯f_{\mathcal{T}} respectively and their pre-trained parameters as θ\theta_{\mathcal{I}} and θ𝒯\theta_{\mathcal{T}}. The deep features of images and text can be extracted by the corresponding encoder respectively. For the visual branch, the input image 𝒙i\bm{x}_{i} is firstly divided into MM patches and project to patch embedding. After adding a learnable class token 𝒆cls\bm{e}_{cls}, the input embedding is encoded by the image encoder to produce a latent visual representation 𝒛vi=f(𝒙i,θ)\bm{z}_{v}^{i}=f_{\mathcal{I}}(\bm{x}_{i},\theta_{\mathcal{I}}), where 𝒛vi\bm{z}_{v}^{i} is a dd dimensional feature vector for the image i. For the text branch, the class label yy is wrapped within a text template for example a photo of a {class label}’. Through word embedding, we can get the input text features of class j 𝒕j\bm{t}_{j} for the text encoder. The text encoder encodes 𝒕j\bm{t}_{j} via stacks of transformer blocks to produce a latent textual feature 𝒛tj=f𝒯(𝒕j,θ𝒯)\bm{z}_{t}^{j}=f_{\mathcal{T}}(\bm{t}_{j},\theta_{\mathcal{T}}). After getting the latent image feature 𝒛vi\bm{z}_{v}^{i} and the latent textual feature 𝒛tj\bm{z}_{t}^{j}, we could perform zero-shot inference by calculating the probability of class j for the image ii as

(1) pi,j=exp(cos(𝒛vi,𝒛tj)/τ)j=1Cexp(cos(𝒛vi,𝒛tj)/τ)p_{i,j}=\frac{{\rm exp(cos}(\bm{z}_{v}^{i},\bm{z}_{t}^{j})/\tau)}{\sum_{j=1}^{C}{\rm exp(cos}(\bm{z}_{v}^{i},\bm{z}_{t}^{j})/\tau)}

Adversarial Prompt Tuning Adversarial attacks are commonly conducted by generating an imperceptible perturbation for clean input misleading the model to produce a wrong prediction. For CLIP is to search for a perturbation 𝜹i\bm{\delta}_{i} for input 𝒙i\bm{x}_{i} to maximize the dissimilarity between the image feature 𝒛vi\bm{z}_{v}^{i} and the text feature of the ground-truth class prompt, 𝒛tyi\bm{z}_{t}^{y_{i}}. Assuming 𝜹\bm{\delta} is bounded by ϵ\epsilon-ball of pp-norm, which can be formulated as:

(2) argmax𝜹iϵL(𝒙i+𝜹i,𝒕yj,yj;𝜽,𝜽𝒯)\mathop{\arg\max}\limits_{\|\bm{\delta}_{i}\|\leq\epsilon}L(\bm{x}_{i}+\bm{\delta}_{i},\bm{t}_{y_{j}},y_{j};\bm{\theta}_{\mathcal{I}},\bm{\theta}_{\mathcal{T}})

Li et al. (Li et al., 2024) discover that the robustness will change a lot when using different prompts for inference, which indicates that the adversarial robustness of VLMs is sensitive to the text prompt during inference. Besides, the hand-crafted text prompt contains no additional information about the adversarial samples. To address this problem, they proposed to improve the adversarial robustness of VLMs through adversarial prompt tuning(APT). Following CoOp(Zhou et al., 2022b), they replace the word embedding of a fixed prompt template with a photo of a {class label}’ to sequence of MM learnable vector with a class embedding [Cj][C_{j}] as

(3) 𝒕j=[V]1[V]2[V]m[Cj]\bm{t}_{j}=[V]_{1}[V]_{2}...[V]_{m}[C_{j}]

To improve adversarial robustness, they adopt adversarial training during prompt tuning. Adversarial training can be formulated as a min-max optimization problem which can be described as:

(4) min𝒕max𝜹iϵL(𝒙i+𝜹i,𝒕,yj;𝜽,𝜽𝒯)\mathop{\min}\limits_{\bm{t}}\mathop{\max}\limits_{\|\bm{\delta}_{i}\|\leq\epsilon}L(\bm{x}_{i}+\bm{\delta}_{i},\bm{t},y_{j};\bm{\theta}_{\mathcal{I}},\bm{\theta}_{\mathcal{T}})

In order to get the model to make the right predictions on adversarial instances, the inner maximizing process creates examples that are deceptive to the model’s prediction, while the outer minimization process optimizes the learnable prompt vector. The adversarial resilience of the pre-trained model was steadily strengthened when the optimization processes of maximizing and minimization alternated. Nevertheless, the clean inputs and the association between adversarial and clean inputs are disregarded by Adversarial Prompt Tuning. The over-fitting issue in adversarial training and quick tweaking is also of little consequence to APT.

Refer to caption

Figure 1. The overview of our adaptive Consistency-guided Adversarial Prompt Tuning framework. Our method adopts multi-modal prompt learning to improve the alignment between visual and textual features for adversarial examples and enhance the robustness of image and text encoder during training. we introduce a frozen pre-trained CLIP to tackle with the over-fitting issue of adversarial fine-tuning and improve the zero-shot adversarial robustness.

3.2. CAPT

To efficiently improve the robustness of CLIP when adapting to downstream tasks, several researchers adopt prompt tuning strategies in adversarial fine-tuning. Chen et al. (Chen et al., 2023) attempt to utilize adversarial visual prompting to improve the adversarial robustness in test time. Li et al. (Li et al., 2024) found that a slight change of prompt will influence the robustness which indicates that the adversarial robustness is sensitive to the choice of prompt for inference. Thus, they use learnable textual prompts instead of fixed prompts to improve the adversarial robustness. However, all the existing adversarial prompt tuning methods follow uni-modal solutions either in the vision or in the language branch of CLIP. They ignore that both image and text encoder contribute to the alignment of multi-modal features especially for the adversarial examples whose feature distribution is quite different from the clean ones. In adversarial situations, learning prompts only for text encoders are insufficient for accurate prediction. Therefore, in order to improve the alignment of multi-modal features of adversarial examples and enhance the robustness of both image encoder and text encoder, we applied multi-modal prompt tuning (Khattak et al., 2023a) which employ learnable prompt vector in both image encoder and text encoder of CLIP. Following khattak (Khattak et al., 2023a), we learn prompts in the deeper transformer layers to progressively model stage-wise feature representations. We introduce b learnable tokens Vl={vl1,vl2,,vlb}V_{l}=\{v_{l}^{1},v_{l}^{2},...,v_{l}^{b}\} and Tl={Tl1,Tl2,,Tlb}T_{l}=\{T_{l}^{1},T_{l}^{2},...,T_{l}^{b}\} respectively as the learnable vision and text prompts in the lthl_{t}h transformer layer. New learnable prompts are introduced in each transformer block of image and text encoder up to depth JJ. For the visual branch, the process can be described as,

(5) [_,Ei]=fi(Vi1,Ei1),i=1,2,,J\displaystyle[\_,E_{i}]=f_{\mathcal{I}}^{i}(V_{i-1},E_{i-1}),\quad i=1,2,...,J
(6) [Vi,Ei]=fi(Vi1,Ei1),i=J+1,,K\displaystyle[V_{i},E_{i}]=f_{\mathcal{I}}^{i}(V_{i-1},E_{i-1}),\quad i=J+1,...,K

[][] is the concatenation operation. fif_{\mathcal{I}}^{i} is the process of image encoder in the layer ii. EiE_{i} is the output image feature in the layer ii.

For the text branch, it is similar to the visual branch, which can be described as,

(7) [_,Wj]=f𝒯j(Tj1,Wj1),j=1,2,,J\displaystyle[\_,W_{j}]=f_{\mathcal{T}}^{j}(T_{j-1},W_{j-1}),\quad j=1,2,...,J
(8) [Tj,Wj]=f𝒯j(Tj1,Wj1),j=J+1,,K\displaystyle[T_{j},W_{j}]=f_{\mathcal{T}}^{j}(T_{j-1},W_{j-1}),\quad j=J+1,...,K

f𝒯jf_{\mathcal{T}}^{j} is the process of text encoder in the layer jj. WjW_{j} is the output text feature in the layer jj. Applying multi-modal prompting in the deep layer enables learning prompts across various feature hierarchies within the transformer architecture.

Trade-off generalization between robustness and accuracy Adversarial prompt tuning and other approaches adopt adversarial training (Madry et al., 2018) strategy to improve the adversarial robustness of the pre-trained model when adapting to downstream tasks. However, all these methods only focus on the discrepancy between the prediction of adversarial inputs and the ground-truth labels but ignore clear examples. Although the adversarial robustness of the model has been improved during this process, the model’s generalization capacity on clean examples will potentially decrease. During the adaptation of pre-trained models to downstream tasks, it is very important to enhance the model’s adversarial robustness while ensuring accurate predictions for clean samples. Inspired by TRADES(Zhang et al., 2019) in adversarial training, we consider both the accuracy and robustness of the pre-trained model on clean examples and adversarial examples. Besides, we leverage the correlation between the clean examples and adversarial examples to improve robustness, the objective function can be described as,

(9) Ltrades=CE(sft(𝒛I𝒛T/τ),𝒚)+λKL(sft(𝒛Ia𝒛T/τ),sft(𝒛I𝒛T/τ))\begin{split}L_{\rm trades}&={\rm CE}({\rm sft}({\bm{z}}^{I}\cdot\bm{z}^{T}/\tau),\bm{y})\\ &+\lambda{\rm KL}({\rm sft}(\bm{z}^{I_{a}}\cdot\bm{z}^{T}/\tau),{\rm sft}(\bm{z}^{I}\cdot\bm{z}^{T}/\tau))\end{split}

where 𝒛I\bm{z}^{I} and 𝒛Ia\bm{z}^{I_{a}} represent the latent visual features of clean inputs and adversarial inputs respectively. 𝒛T\bm{z}^{T} is the latent text feature for the class labels. 𝒚\bm{y} is the ground-truth label of the inputs. sft stands for the softmax operation. CE denotes the cross-entropy and KL denotes the Kullback–Leibler divergence. λ\lambda is a weight balancing the focus on accuracy or robustness during adversarial fine-tuning.

In such an objective function, with the assistance of multi-modal prompts, pre-trained models can accurately predict clean samples while adapting to downstream tasks and narrow the distance between adversarial and clean samples, thereby enhancing the model’s adversarial robustness. This achieves a balance in the model’s generalization capabilities across both clean and adversarial samples.

Adaptive consistency guided with the frozen CLIP Prompt tuning is an efficient strategy for pre-trained models to adapt and learn task-specific knowledge without modifying the original pre-trained parameters of the model. During training, learnable prompts interact with frozen CLIP tokens via self-attention within the transformer architecture. This interaction of prompt tokens with pre-trained CLIP parameters could retain the generalization capacity of the pre-trained model within learned prompts implicitly. However, it still suffers from the problem (Zhou et al., 2022a; Yao et al., 2023; Khattak et al., 2023b) of over-fitting when fine-tuning on specific downstream tasks, which leads to a degeneration of the pre-trained model’s robust generalization capacity. Within the adversarial training framework, the over-fitting problem will be further aggravated (Rice et al., 2020; Wang et al., 2024a)as the model is influenced by the distribution of generated adversarial examples of the training set which leads to a decline of robustness of adversarial examples on unseen data. Thus in order to enhance the robust generalization of the pre-trained model on the downstream tasks, inspired by (Khattak et al., 2023b), we introduce a pre-trained frozen CLIP model with great generalization capacity to explicitly guide the adversarial features from the prompt tuning model to be consistent with the clean features from the frozen CLIP, which can be described as

(10) Lconsfrz=KL((sft(𝒛Ia𝒛T/τ),sft(𝒛frzI𝒛frzT/τ))L_{\rm cons-frz}={\rm KL}(({\rm sft}(\bm{z}^{I_{a}}\cdot\bm{z}^{T}/\tau),{\rm sft}(\bm{z}_{\rm frz}^{I}\cdot\bm{z}_{\rm frz}^{T}/\tau))

𝒛frzI\bm{z}_{frz}^{I} and 𝒛frzT\bm{z}_{frz}^{T} are the image latent features and text latent features for the clean examples from the pre-trained frozen CLIP.

With the assistance of pre-trained frozen CLIP, our approach could learn more generalized features for adversarial examples in order to improve the robust generalization of the model during adversarial fine-tuning.

In addition, the second term of the origin TRADES-like objective function in equation 9 guides the prompt to focus on the correlation between adversarial examples and clean examples in the prompt-tuning model which learns task-specific features when adapting to downstream tasks. Here, we rewrite as

(11) Lconstrain=KL((sft(𝒛Ia𝒛T/τ),sft(𝒛I𝒛T/τ)))L_{\rm cons-train}={\rm KL}(({\rm sft}(\bm{z}^{I_{a}}\cdot\bm{z}^{T}/\tau),{\rm sft}(\bm{z}^{I}\cdot\bm{z}^{T}/\tau)))

In order to guide prompts to balance the consistency between the pre-trained frozen CLIP and the adversarial prompt tuning CLIP itself, we design an adaptive weighting strategy in stead of using fixed weight parameter for these two terms. Specifically, we assign different weights to different losses based on the reliability of the model’s predictions on clean examples. The specific formulation of the adaptive consistency guided loss is as follows:

(12) Ladvcons=(1αcons)Lconstrain+αconsLconsfrz\displaystyle L_{\rm adv-cons}=(1-\alpha_{\rm cons})L_{\rm cons-train}+\alpha_{\rm cons}L_{\rm cons-frz}
αcons=exp(CE(sft(𝒛frzI𝒛frzT/τ),𝒚))exp(CE(sft(𝒛frzI𝒛frzT/τ),𝒚)+exp(CE(sft(𝒛I𝒛T/τ),𝒚)\displaystyle\alpha_{\rm cons}=\frac{{\rm exp}({\rm CE}({\rm sft}({\bm{z}}_{\rm frz}^{I}\cdot{\bm{z}}_{\rm frz}^{T}/\tau),\bm{y}))}{{\rm exp}({\rm CE}({\rm sft}({\bm{z}}_{\rm frz}^{I}\cdot\bm{z}_{\rm frz}^{T}/\tau),\bm{y})+{\rm exp}({\rm CE}({\rm sft}({\bm{z}}^{I}\cdot\bm{z}^{T}/\tau),\bm{y})}

where αcons{\alpha_{\rm cons}} is the adaptive weight parameter which is calculated according to the reliability of the frozen CLIP. Adaptive weights can dynamically adjust during the training process, guiding the prompt to balance the consistency between similarity for the features of adversarial examples and the features of clean examples obtained from the frozen CLIP and the fine-tuning CLIP. Combining this with Cross-Entropy (CE) loss enables the adversarial prompt tuning process to adapt to downstream tasks, maximizing model performance while ensuring that the learned features are consistent with the generalizable features of the frozen CLIP, thereby enhancing the model’s robust generalization.

Finally, we got the overall objective function of our adaptive consistency adversarial prompt tuning (CAPT) framework:

(13) LCAPT\displaystyle L_{\rm CAPT} =CE(sft(𝒛I𝒛T/τ),𝒚)+λLadvcons\displaystyle={\rm CE}({\rm sft}({\bm{z}}^{I}\cdot\bm{z}^{T}/\tau),\bm{y})+\lambda L_{\rm adv-cons}
=CE(sft(𝒛I𝒛T/τ),𝒚)\displaystyle={\rm CE}({\rm sft}({\bm{z}}^{I}\cdot\bm{z}^{T}/\tau),\bm{y})
+λ[(1αcons)Lconstrain+αconsLconsfrz]\displaystyle+\lambda[(1-\alpha_{\rm cons})L_{\rm cons-train}+\alpha_{\rm cons}L_{\rm cons-frz}]

CAPT focuses on both the clean examples and adversarial examples during training. In addition, with the assistance of adaptive weighting, CAPT achieves a balance between the frozen CLIP and the fine-tuning model of the alignment of the adversarial examples and their counterparts.

4. Experiments

In this section, we conduct an extensive evaluation on different datasets including in-distribution experiments and out-of-distribution experiments. The results demonstrate that our proposed method significantly outperforms all other state-of-the-art baselines indicating the superiority of our method.
Datasets. The experiments in this section were the same as the Adversarial Prompt Tuning setup unless otherwise specified. Following Li et al. (Li et al., 2024), we use the same 11 datasets to evaluate our method for fairness: ImageNet(Deng et al., 2009), Caltech101(Fei-Fei et al., 2004), OxfordPets(Parkhi et al., 2012), StanfordCars(Krause et al., 2013), Flowers102(Nilsback and Zisserman, 2008), Food101(Bossard et al., 2014), FGVCAircraft(Maji et al., 2013), SUN397(Xiao et al., 2010), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019) and UCF101 (Soomro et al., 2012). For each dataset, we evaluate with N-shots, meaning N examples per class are randomly sampled from the entire training set for training. N was either 1, 4, 16, or “all”, where ”all” means the entire training set was used. One exception was for ImageNet, where 100 shots were used instead of “all” because our computational resource was insufficient to run experiments on the full dataset. All methods are evaluated on the entire test set regardless of the training data scheme used.

Baseline methods. Our proposed method is a multi-modal prompting-based parameter-efficient adaption method. We compare it against two groups of related work: text prompting and prompting-based adversarial fine-tuning methods. For text prompt, we compare our method against Hand-Engineered Prompts (HEP) (Radford et al., 2021) which was originally proposed in CLIP and has been widely used in other works (Zhao et al., 2024; Zhou et al., 2022b, a). For prompting-based adversarial fine-tuning methods, we adopt Adversarial Visual Prompting(AVP) (Chen et al., 2023), Partial Adversarial Fine-Tuning (PAFT) (Chen et al., 2020) and Adversarial Prompt Tuning (APT) (Li et al., 2024) for comparison. AVP utilizes adversarial visual prompting (Bahng et al., 2022) as a test-time defense to enhance adversarial robustness for pre-trained models. PAFT can be viewed as the adversarial training variant of linear probing (Radford et al., 2021). APT combines textual prompt tuning with adversarial training to improve the robustness of the pre-trained model when adapting to downstream tasks. All compared methods share the same frozen pre-trained image and text encoders.

Implementation Details. We utilize the ViT-B/32 architecture of the CLIP model as the backbone. Following APT, the weights of image encoders were pre-trained using the state-of-the-art zero-shot adversarial robustness method TeCoA (Mao et al., 2022). We use the SGD optimizer with a momentum of 0.9. The initial learning rate is set at 0.0025 with a cosine learning rate scheduler and a warm-up strategy during the first epoch. We use context length 16 and independent V-L prompting in the first 9 transformer layers for both image and text branches. For hyper-parameters, we empirically set λ\lambda in the objective function to 100. The batch size is 32 for all the datasets. For ImageNet, the number of epochs was 20,20,50, and 20 for 1,4,16 and all shots respectively, and for other datasets was 50,100,200,200 which keeps the same with APT. Results were reported for the last checkpoint. The PGD (Madry et al., 2018) attack is used for both training and evaluation. Two perturbation budgets, ϵ=1/255\epsilon=1/255 and 4/2554/255 are used following (Mao et al., 2022) and (Croce et al., 2020) respectively. We use 3 steps with a step size of 2ϵ/32\epsilon/3 for training and 100 steps with a step size of ϵ/4\epsilon/4 and random start for evaluation 222More details of the experimental setup and results are in the supplementary materials..

5. Main Result

5.1. In-Distribution Performance

We evaluate the performance of our method and other baseline methods on the in-distribution scenarios, where the training and test data share the same distribution. We conduct experiments on different shots (1,4,16, all) and different perturbations (ϵ=1/255\epsilon=1/255 and 4/2554/255). A comparison of different prompting methods on different datasets and shots for various perturbations is shown in Table 1. We can see that compared with the hand-engineered prompts our method achieves a significant improvement in both accuracy and robustness. Besides, with the increase in the number of shots, the improvement is further increased. Compared to other prompting-based adversarial fine-tuning methods, our method also achieves huge superiority against all other methods on accuracy and robustness, especially for the ϵ=4/255\epsilon=4/255 perturbation, which indicates that our method performs better against stronger adversarial attacks. The results in the rest of the datasets for ϵ=4/255\epsilon=4/255 and ϵ=1/255\epsilon=1/255 perturbation are shown in the supplementary material.

Table 1. The performance for ϵ=4/255\epsilon=4/255 under different shots and part datasets. The results of HEP are copied under different shots in the table for the convenience of comparison.
Dataset Method 1-shot 4-shots 16-shots All
Acc. Rob. Acc. Rob. Acc. Rob. Acc. Rob.
ImageNet HEP (Radford et al., 2021) 39.63 10.28 39.63 10.28 39.63 10.28 39.63 10.28
PAFT (Chen et al., 2020) 9.23 3.29 15.93 5.37 31.92 12.90 17.12 7.26
AVP (Chen et al., 2023) 39.66 10.41 39.58 10.97 39.61 11.18 39.53 11.32
APT-CSC (Li et al., 2024) 18.56 4.08 29.88 6.99 35.10 8.46 39.91 11.87
APT-UC (Li et al., 2024) 37.88 11.21 39.52 11.52 40.78 12.17 41.64 12.49
CAPT 46.81 15.26 55.90 16.69 60.98 21.73 65.67 24.48
Caltech101 HEP 77.44 42.76 77.44 42.76 77.44 42.76 77.44 42.76
PAFT 51.32 32.04 72.55 46.36 86.42 60.55 88.85 66.99
AVP 77.40 43.12 77.44 43.57 77.36 44.67 77.61 46.33
APT-CSC 61.81 33.24 78.67 46.79 86.42 57.30 89.63 65.59
APT-UC 75.90 42.66 82.46 50.72 86.37 56.52 88.75 63.51
CAPT 79.92 57.52 86.86 64.91 92.54 73.43 95.01 79.43
OxfordPets HEP 61.49 14.34 61.49 14.34 61.49 14.34 61.49 14.34
PAFT 12.33 3.07 36.16 10.17 63.47 19.45 71.46 25.60
AVP 60.89 14.83 61.02 14.94 61.00 15.24 60.89 16.63
APT-CSC 35.87 6.75 51.73 9.90 66.07 17.19 72.36 24.52
APT-UC 57.49 14.31 61.21 14.71 67.77 20.29 72.43 24.72
CAPT 63.40 17.88 67.83 27.26 79.99 40.94 87.11 49.50
StanfordCars HEP 10.33 0.92 10.33 0.92 10.33 0.92 10.33 0.92
PAFT 5.98 2.02 14.09 3.84 34.27 11.09 41.03 13.75
AVP 10.38 0.95 10.65 1.08 11.51 1.52 11.70 1.72
APT-CSC 12.48 1.73 26.23 4.68 42.29 9.85 48.83 12.83
APT-UC 12.54 1.94 25.72 4.95 32.98 7.82 37.58 9.05
CAPT 36.93 8.75 56.59 16.86 74.59 26.17 81.17 31.60
Food101 HEP 21.70 3.19 21.70 3.19 21.70 3.19 21.70 3.19
PAFT 5.98 1.16 14.13 3.08 30.45 8.23 36.87 14.60
AVP 20.28 3.12 20.57 3.27 20.69 3.50 24.50 5.98
APT-CSC 11.41 1.25 20.75 2.65 33.15 6.72 42.75 14.71
APT-UC 19.85 3.65 23.51 4.05 30.56 7.94 35.52 12.98
CAPT 30.45 9.09 42.81 13.45 62.01 21.97 80.91 36.09
FGVCAircraft HEP 7.02 0.48 7.02 0.48 7.02 0.48 7.02 0.48
PAFT 6.38 1.96 12.70 2.96 26.35 7.94 29.28 10.08
AVP 6.42 0.42 6.78 0.60 7.71 1.05 8.01 1.32
APT-CSC 9.56 0.94 16.86 2.50 28.64 6.82 31.56 8.91
APT-UC 2.81 0.75 11.21 3.27 17.68 5.73 20.89 7.02
CAPT 12.99 4.23 21.27 4.62 40.35 19.14 50.47 25.05
Table 2. The performance for ϵ=1/255\epsilon=1/255 under different shots and part datasets.
Dataset Method 1-shot 4-shots 16-shots All
Acc. Rob. Acc. Rob. Acc. Rob. Acc. Rob.
ImageNet HEP 55.16 38.49 55.16 38.49 55.16 38.49 55.16 38.49
PAFT 19.75 14.03 32.74 22.87 52.42 39.12 38.93 29.36
AVP 55.15 38.66 55.17 38.78 55.25 38.77 55.29 38.85
APT-CSC 31.92 20.48 45.51 29.75 50.84 33.21 57.65 40.29
APT-UC 54.55 37.93 56.42 39.79 58.02 40.83 58.56 41.53
CAPT 55.39 29.87 60.66 34.24 64.34 38.93 68.77 43.07
Caltech101 HEP 83.94 74.00 83.94 74.00 83.94 74.00 83.94 74.00
PAFT 69.86 61.14 83.16 72.98 92.49 85.35 93.87 88.15
AVP 83.81 74.00 83.85 74.40 83.81 74.73 83.85 75.01
APT-CSC 75.21 63.77 86.21 75.42 91.24 84.06 93.47 86.77
APT-UC 86.04 73.06 90.30 81.01 92.66 85.40 93.67 87.14
CAPT 88.48 73.83 90.34 79.15 94.12 84.34 95.50 87.91
OxfordPets HEP 74.87 58.63 74.87 58.63 74.87 58.63 74.87 58.63
PAFT 29.30 19.43 58.11 42.06 80.40 63.21 86.05 69.69
AVP 74.73 58.11 74.65 58.08 74.76 58.27 74.98 58.52
APT-CSC 60.81 42.85 71.79 51.35 80.21 60.67 86.29 69.47
APT-UC 79.67 61.41 81.58 62.44 83.05 65.55 86.45 69.53
CAPT 74.13 41.53 78.58 53.99 84.85 61.16 88.36 69.15
StanfordCars HEP 25.31 12.30 25.31 12.30 25.31 12.30 25.31 12.30
PAFT 15.63 9.82 33.63 21.70 62.12 43.63 69.12 50.68
AVP 25.52 12.49 27.92 15.09 30.99 18.67 31.84 19.25
APT-CSC 24.79 13.06 44.17 26.23 64.82 43.40 72.18 50.01
APT-UC 36.35 19.15 48.76 26.89 60.23 36.87 63.71 38.96
CAPT 46.04 22.94 60.72 35.51 77.02 51.37 82.76 56.83
Food101 HEP 44.99 26.25 44.99 26.25 44.99 26.25 44.99 26.25
PAFT 14.06 8.55 31.27 17.82 51.82 32.50 64.33 45.72
AVP 43.16 25.39 43.44 26.34 43.98 27.34 48.14 32.24
APT-CSC 23.92 12.49 36.35 18.63 52.03 30.57 66.56 46.24
APT-UC 45.67 26.05 45.75 25.28 55.66 33.73 63.56 42.93
CAPT 43.11 22.04 54.38 28.79 66.94 38.52 82.26 54.77
FGVCAircraft HEP 12.48 5.88 12.48 5.88 12.48 5.88 12.48 5.88
PAFT 12.75 8.37 20.01 12.15 35.43 21.90 39.51 25.92
AVP 12.33 6.00 12.93 7.14 14.61 9.48 15.12 9.93
APT-CSC 14.61 7.38 23.34 12.30 36.81 21.24 41.73 26.22
APT-UC 13.89 7.71 21.21 10.92 28.47 15.39 31.95 18.84
CAPT 14.61 10.02 23.67 10.74 42.57 27.18 52.09 35.91

5.2. Cross Dataset Generalization

This section assesses the generalization capacity of our model on distribution shifts and cross-dataset scenarios. We use ImageNet as the source dataset for adversarial prompt tuning. Then, we evaluate the performance of these ImageNet-adapted models on the target datasets with the same classes yet different data distributions and the target datasets with different classes. Specifically, follow Li et al. (Li et al., 2024). , we use three ImageNet shift datasets, ImageNet-V2 (Recht et al., 2019), ImageNet-Sketch and ImageNet-R (Hendrycks et al., 2021) to represent different kinds of distribution shift. We use the rest ten datasets for the cross-dataset test.

Table 3. The generalization of the prompts learned by our method on ImageNet to datasets with input distribution shifts. The results for both variants of our method are reported for the checkpoints trained with 16 shots and ϵ=4/255\epsilon=4/255.
  Method Source Distribution Shifts
ImageNet ImageNet-V2 ImageNet-Sketch ImageNet-R
Acc. Rob. Acc. Rob. Acc. Rob. Acc. Rob.
HEP (Radford et al., 2021) 39.86 10.28 32.74 7.49 17.40 7.21 21.46 5.80
AVP (Chen et al., 2023) 39.61 11.18 32.68 8.12 17.39 7.69 21.47 6.25
PAFT (Chen et al., 2020) 31.92 12.90 25.55 9.52 10.02 5.05 13.34 4.55
APT-CSC (Li et al., 2024) 37.18 9.49 28.93 6.65 12.72 4.83 15.06 3.57
APT-UC (Li et al., 2024) 40.80 12.33 33.20 9.04 18.35 8.04 22.66 6.97
Ours w/o CG 60.53 18.45 51.04 14.08 27.26 15.24 31.08 12.36
Ours with CG 60.98 21.73 51.05 17.04 27.01 15.67 31.45 13.38
 

We undertake studies on our technique with and without the adaptive consistency led by the pre-trained frozen CLIP, in addition to doing research on alternative baseline methods. Based on the results presented in Tab 3, it is evident that our methodology surpasses all other approaches by a large margin, even when the consistency guidance from the pre-trained frozen CLIP is not included. This is true for both zero-shot accuracy and robustness in the distribution shifts scenario. With the support of the generalized information from the pre-trained frozen CLIP, the robust generalization of the model was further enhanced without compromising the accuracy of clean instances. This was accomplished without compromising efficiency. When taken as a whole, the result demonstrates that our approach is effective in improving the robust generalization of the model that has been partially trained. It is mentioned in the supplemental material that the findings of the cross-dataset test were obtained.

5.3. Ablation Study

To reveal the effect of different terms in our objective function, we conduct experiments with different objective functions for adversarial prompt tuning. We set 5 different cases including CEAdv\rm CE_{Adv},CEclean\rm CE_{clean}, CEclean+Lconstrain{\rm CE_{clean}}+L_{\rm cons-train},CEclean+Lconsfrz{\rm CE_{clean}}+L_{\rm cons-frz} and All. CEAdv\rm CE_{Adv} denotes the cross-entropy of the prediction of the adversarial examples with the label, which is the same loss function as APT (Li et al., 2024). CEclean\rm CE_{clean} denotes the cross-entropy of the prediction of the clean examples with the label without adopting adversarial training. LconstrainL_{\rm cons-train} and LconsfrzL_{\rm cons-frz} denote consistency loss between clean and adversarial inputs of fine-tuned model and frozen pre-trained model respectively as mentioned above. All the experiments in this section use 16 shots on the StanfordCars dataset. According to the result in Tab 4, we could see that solely using CEAdv\rm CE_{Adv} achieves better robustness. However, its accuracy on clean examples drops significantly as it disregards the clean ones. Both methods combining CEclean\rm CE_{clean} with LconstrainL_{\rm cons-train} or LconsfrzL_{\rm cons-frz} improve the robustness of the model to a great extent while ensuring the accuracy compared to using CEclean\rm CE_{clean} independently. Combining CEclean\rm CE_{clean} with LconstrainL_{\rm cons-train} and LconsfrzL_{\rm cons-frz} together achieves a greater improvement in robustness and even a slightly higher accuracy. The results of the ablation study once again demonstrate that our approach achieves a balance in improving the robust generalization of the model while retaining the accuracy of clean examples.

6. Conclusion

In this work, we focus on the problem of over-fitting that occurs in adversarial prompt tuning. This problem limits the model’s capacity to generalize on both clean and adversarial scenarios. We present the CAPT framework in order to address this issue. This framework makes use of multi-modal prompt learning in order to enhance the alignment of text-image features for adversarial cases and to raise consistency across both clean and adversarial inputs. In addition, we enhanced the robust generalization on adversarial scenarios by using a pre-trained frozen CLIP, which ensured that the correctness of clean examples was preserved. As a consequence of our rigorous testing, we have determined that CAPT is superior in terms of both its performance inside the distribution and its capacity to generalize both in instances when there is a shift in the distribution and across datasets. In this paper, we take a novel and efficient approach to addressing the problem of adversarial rapid tuning, which is characterized by over-fitting.

Table 4. Ablation study of different objective function
       Method Result
CEAdv\rm CE_{Adv} CEclean\rm CE_{clean} Lconstrain\rm L_{cons-train} Lconsfrz\rm L_{cons-frz} Acc. Rob.
44.35 36.70
74.97 8.47
73.88 33.24
75.69 29.44
76.07 35.65
 

References

  • (1)
  • Bahng et al. (2022) Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. 2022. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
  • Bangalath et al. (2022) Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. 2022. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems 35 (2022), 33781–33794.
  • Bossard et al. (2014) Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 446–461.
  • Chen et al. (2023) Aochuan Chen, Peter Lorenz, Yuguang Yao, Pin-Yu Chen, and Sijia Liu. 2023. Visual prompting for adversarial robustness. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  • Chen et al. (2022) Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. 2022. Plot: Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253 (2022).
  • Chen et al. (2020) Tianlong Chen, Sijia Liu, Shiyu Chang, Yu Cheng, Lisa Amini, and Zhangyang Wang. 2020. Adversarial robustness: From self-supervised pre-training to fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 699–708.
  • Cimpoi et al. (2014) Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3606–3613.
  • Croce et al. (2020) Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. 2020. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670 (2020).
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  • Dodge et al. (2020) Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
  • Dong et al. (2018) Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9185–9193.
  • Fei-Fei et al. (2004) Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop. IEEE, 178–178.
  • Gao et al. (2024) Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2024. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 2 (2024), 581–595.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  • Gu et al. (2021) Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
  • Helber et al. (2019) Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2019. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226.
  • Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. 2021. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision. 8340–8349.
  • Huang et al. (2023) Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, and Nenghai Yu. 2023. Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1600–1610.
  • Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  • Jia et al. (2022a) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022a. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  • Jia et al. (2022b) Xiaojun Jia, Yong Zhang, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022b. LAS-AT: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13398–13408.
  • Khattak et al. (2023a) Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023a. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122.
  • Khattak et al. (2023b) Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. 2023b. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15190–15200.
  • Krause et al. (2013) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops. 554–561.
  • Li et al. (2022b) Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. 2022b. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022).
  • Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023a. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  • Li et al. (2022a) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022a. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
  • Li et al. (2024) Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. 2024. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. arXiv preprint arXiv:2403.01849 (2024).
  • Li et al. (2023c) Xiao Li, Wei Zhang, Yining Liu, Zhanhao Hu, Bo Zhang, and Xiaolin Hu. 2023c. Language-Driven Anchors for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2301.13096 (2023).
  • Li et al. (2023b) Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. 2023b. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22511–22521.
  • Lin and Byrne (2022) Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. arXiv preprint arXiv:2210.03809 (2022).
  • Lu et al. (2023) Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, and Feng Zheng. 2023. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 102–111.
  • Lu et al. (2022) Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. 2022. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5206–5215.
  • Maaz et al. (2022) Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Ming-Hsuan Yang. 2022. Class-agnostic object detection with multi-modal transformer. In European conference on computer vision. Springer, 512–531.
  • Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations.
  • Maji et al. (2013) Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. 2013. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
  • Mao et al. (2022) Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl Vondrick. 2022. Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016 (2022).
  • Nilsback and Zisserman (2008) Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 722–729.
  • Pang et al. (2022) Tianyu Pang, Min Lin, Xiao Yang, Jun Zhu, and Shuicheng Yan. 2022. Robustness and accuracy could be reconcilable by (proper) definition. In International Conference on Machine Learning. PMLR, 17258–17277.
  • Parkhi et al. (2012) Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. 2012. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  • Rao et al. (2022) Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. 2022. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18082–18091.
  • Recht et al. (2019) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do imagenet classifiers generalize to imagenet?. In International conference on machine learning. PMLR, 5389–5400.
  • Rice et al. (2020) Leslie Rice, Eric Wong, and Zico Kolter. 2020. Overfitting in adversarially robust deep learning. In International conference on machine learning. PMLR, 8093–8104.
  • Salman et al. (2020) Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry. 2020. Do adversarially robust imagenet models transfer better? Advances in Neural Information Processing Systems 33 (2020), 3533–3545.
  • Schlarmann and Hein (2023) Christian Schlarmann and Matthias Hein. 2023. On the adversarial robustness of multi-modal foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3677–3685.
  • Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  • Wang et al. (2024b) Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. 2024b. Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2401.04350 (2024).
  • Wang et al. (2024a) Yifei Wang, Liangchen Li, Jiansheng Yang, Zhouchen Lin, and Yisen Wang. 2024a. Balance, imbalance, and rebalance: Understanding robust overfitting from a minimax game perspective. Advances in neural information processing systems 36 (2024).
  • Wang et al. (2019) Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. 2019. Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations.
  • Xiao et al. (2010) Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485–3492.
  • Yao et al. (2023) Hantao Yao, Rui Zhang, and Changsheng Xu. 2023. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6757–6767.
  • Yin et al. (2023) Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. 2023. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. arXiv preprint arXiv:2310.04655 (2023).
  • Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems 27 (2014).
  • Zang et al. (2022) Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision. Springer, 106–122.
  • Zhang et al. (2019) Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning. PMLR, 7472–7482.
  • Zhang et al. (2022) Jiaming Zhang, Qi Yi, and Jitao Sang. 2022. Towards adversarial attack on vision-language pre-training models. In Proceedings of the 30th ACM International Conference on Multimedia. 5005–5013.
  • Zhang et al. (2023) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  • Zhao et al. (2024) Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. 2024. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems 36 (2024).
  • Zhou et al. (2022a) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16816–16825.
  • Zhou et al. (2022b) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
  • Zhou et al. (2020) Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13041–13049.
  • Zhu et al. (2023) Peipei Zhu, Xiao Wang, Lin Zhu, Zhenglong Sun, Wei-Shi Zheng, Yaowei Wang, and Changwen Chen. 2023. Prompt-based learning for unpaired image captioning. IEEE Transactions on Multimedia (2023).