This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Peng Xie*, Yequan Bie*, Jianda Mao, Yangqiu Song, Yang Wang, Hao ChenπŸ–‚, Kani ChenπŸ–‚
The Hong Kong University of Science and Technology
{pxieaf, ybie}@connect.ust.hk
Abstract

Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

††*   Equal contributions.β€ β€ πŸ–‚β€‰ Corresponding authors.
Refer to caption
Figure 1: Comparison of the proposed CoA with other attacking strategies. The CLIP score results of Unidiffuser [4] are reported. Our method shows both superior performance and efficiency.

1 Introduction

Vision-language models (VLMs) have achieved significant progress over the last few years and demonstrated promising performance in image and natural language understanding, reasoning, and generation [2, 30, 1, 45, 46]. The powerful multi-modal capability in different tasks, including visual question answering and image captioning [34, 28, 10], makes these models to be widely deployed in real-world applications. However, as more VLMs are made open-sourced or used for commercial purposes, the security and safety issues raise concerns, e.g., VLMs could be attacked and exploited to generate fake and toxic content, which remains an inevitable challenge [5, 37, 48]. Moreover, compared to language models, VLMs suffer more from adversarial attacks since the vision modality is highly susceptible to visually inconspicuous adversarial perturbations due to the continuous and high dimensional nature of images [22, 7, 38]. This vulnerability could be exploited by adversaries to mislead VLMs by circumventing safety checkers [43, 53], injecting malicious code, or gaining unauthorized API access, leading to severe security risks in practical applications of the models [27, 54].

To explore the vulnerability of vision-language models, recent research proposes evaluating their robustness to adversarial attacks [54, 17, 47, 3, 7, 13], with some work focusing on transfer-based adversarial attacks [17, 47], i.e., use the adversarial examples generated via white-box surrogate models to mislead the victim black-box models [31, 55]. However, most existing transfer-based adversarial attack strategies only emphasize visual features when crafting adversarial examples, with only coarse leverage of text embeddings, neglecting the semantic correspondences between vision and text modalities. Moreover, there are different ways to compute attack success rate (ASR) for response generation tasks in current evaluation methods, lacking a clear and unified ASR calculation strategy. To address the above challenges, we propose a novel transfer-based adversarial attacking approach, namely Chain of Attack (CoA), which enhances the adversarial example generation process based on the multi-modal semantics using a series of intermediate attacking steps. We further establish a unified and comprehensive ASR computing method for targeted and untargeted evasion based on large language models (LLMs), holding the potential to facilitate future research and benchmarking by providing a fair and straightforward evaluation strategy for text generation tasks with human-understandable explanations. From the evaluation conducted in this study, we find the considered VLMs are generally vulnerable to adversarial visual attacks even without the knowledge of the victim models. Models with a larger number of parameters are less susceptible to targeted attacks, while still suffering from being fooled to some extent. Our proposed attacking strategy can further improve the attacking performance by perturbations with richer semantics compared to existing black-box attack methods. We hope the evaluation and the attacking strategy illustrated in this paper can encourage the future development of more trustworthy VLMs and their safety evaluations.

We summarize the main contributions of this paper as follows: (i) We propose a new transfer-based targeted attacking framework, Chain of Attack. It leverages an explicit step-by-step semantic update process to enhance the generation of adversarial examples, thereby improving attack quality and success rate. (ii) We establish a unified and comprehensive automatic attack success rate computing strategy based on LLMs. (iii) Evaluations of security and robustness for various VLMs are conducted using black-box attacks, demonstrating the effectiveness of the proposed method and highlighting the vulnerabilities of existing VLMs. More discussions about image perturbations are also included.

2 Related Work

2.1 Vision-Language Models and Robustness

Pre-trained vision-language models are broadly utilized for various vision and natural language tasks, including image captioning [19, 42] and visual question answering [30, 10], etc. For example, ViECap [19] incorporates entity-aware hard prompts to guide LLMs’ (i.e., GPT-2 [40]) attention toward the visual entities for coherent caption generation. LLaVA [30, 29] adopts a projection layer to connect a vision encoder and an LLM (i.e., Vicuna [12]) for general-purpose visual and language understanding. To alleviate the security issue of VLMs, recent research tends to evaluate the model robustness through adversarial attacks [17, 54, 13, 47]. For image captioning tasks, many previous work [8, 49] focuses on white-box and untargeted attacks for VLMs with traditional architecture (e.g., CNN and RNN-based), and requires human efforts for robustness evaluation. Zhao et al. [54] propose using CLIP score [41] for automatic evaluation. However, we argue more practical scenarios and metrics are necessary for evaluating the robustness of VLMs and facilitating future work. Therefore, in this work, we assess the adversarial robustness of VLMs with advanced architecture against more difficult targeted evasion under both embedding-based and LLM-based metrics.

2.2 Adversarial Attack

Adversarial attacks can be categorized into white-box, grey-box, and black-box attacks in terms of the attacker’s capabilities and knowledge [50]. Query-based attacks [16, 26, 35, 36] can sometimes be regarded as grey-box attacks instead of black-box attacks since the attacker can extract some information directly from the victim models, rather than being entirely uninformed. Most query-based methods conduct gradient estimation by repeatedly querying the victim models, which are typically time-consuming [54]. In contrast, transfer-based attacks [32, 39, 15, 9, 25] are black-box attacks where these methods generate adversarial examples using surrogate models without gaining any direct knowledge from the victim models [50]. In this paper, we focus on transfer-based image attacking under the most practical and high-stakes scenario, i.e., the black-box setting without any knowledge of the victim models, and our method achieves superior attacking performance comparable to or even outperforms query-based methods in some cases with much less computational cost, as shown in Fig.Β 1.

Refer to caption
Figure 2: The pipeline of the Chain of Attack (CoA) framework. (a) Our framework proposes using modality-aware embeddings to capture the semantic correspondence between images and texts. To enhance the adversarial transferability, we use a chain of attacks that explicitly updates the adversarial examples based on their previous multi-modal semantics in a step-by-step manner. A Targeted Contrastive Matching objective is further proposed to align and differentiate the semantics among clean, adversarial, and target reference examples. (b) Targeted response generation is conducted during inference, where the victim models give responses based on the adversarial examples. We further introduce a unified ASR computing strategy for automatic and comprehensive robustness evaluation of VLMs in response generation.

3 Method

3.1 Preliminaries

Problem definition. Let MM be the target victim vision-language model that takes II as the image input and outputs a prediction. Adversarial image attacks modify the visual input to generate adversarial example Ia​d​vI_{adv} by a perturbation Ξ΄\delta and achieve different attack goals. The paradigm can be formulated as:

Tβˆ—=M​(Ia​d​v),Ia​d​v=a​t​k​(I,Ξ΄),T^{*}=M(I_{adv}),\quad I_{adv}=atk(I,\delta), (1)

where Tβˆ—T^{*} is the desired output of attack, and a​t​k​(β‹…)atk(\cdot) represents the attack function learned for effective input perturbation. Specifically, the form of Tβˆ—T^{*} depends on the task, e.g., Tβˆ—T^{*} is a label for the classification task and is a textual response for multimodal tasks such as image captioning. The adversarial example generation should ensure that the learned perturbation Ξ΄βˆ—\delta^{*} is imperceptible to humans, which can be implemented using box constraints to limit the perturbation size in pixels:

β€–Iβˆ’Ia​d​vβ€–βˆž=β€–Ξ΄βˆ—β€–βˆžβ‰€Ο΅,||I-I_{adv}||_{\infty}=||\delta^{*}||_{\infty}\leq\epsilon, (2)

where Ο΅\epsilon is a hyperparameter representing the budget.

Threat model. A threat model defines the conditions under which a defense is designed to be secure and the precise security guarantees provided [6]. We specify the threat model for adversarial attacks in our method, which comprises two components: (i) Attacker capabilities/knowledge refers to the extent of the adversary’s knowledge. Unlike traditional taxonomies [6, 52], Zhang et al. [50] propose a more fine-grained categorization, including white-box, grey-box, and black-box victim model access. Specifically, our method focuses on adversarial transferability, which only has black-box access without knowledge of the victim models. (ii) Attack goals indicate the objectives or intentions that an adversary aims to achieve. It can typically be categorized into two classes: untargeted goals that only tend to fool the victim model to generate wrong responses, and targeted goals that require the model to give responses that are matched to the target. Our proposed attacking strategy focuses on the targeted goals, specifically, given a target reference caption Tr​e​fT_{ref}, the goal can be expressed as follows:

Ξ΄βˆ—=argmaxδ​sim​(Tβˆ—,Tr​e​f)\delta^{*}={\rm{argmax}}_{\delta}\,{\rm{sim}}(T^{*},T_{ref}) (3)

where sim​(β‹…){\rm{sim}}(\cdot) denotes the similarity measure.

Refer to caption
Figure 3: Illustration of the attacking chain. Given the modality-aware embeddings of clean examples and target examples, the adversarial examples including the image perturbations and the corresponding textual information are explicitly updated in a step-by-step manner with the guidance of Targeted Contrastive Matching. This Chain of Attack enhances the adversarial example generation while providing a clear and human-understandable β€œevolution” process, e.g., from β€œA bird in the park” to β€œTwo young boys playing baseball on a field”.
Refer to caption
Figure 4: Examples of the proposed LLM-based attack success rate evaluation. From left to right, the examples depict a completely successful attack case, a fooled-only case, and a failed attack case, respectively. The output score for each case is at the bottom.

3.2 Chain of Attack framework

An overview of our proposed attacking framework Chain of Attack (CoA) is illustrated in Fig.Β 2. As a transfer-based attacking strategy that has only black-box victim model access, a surrogate vision-language model (e.g., CLIP [41]) is adopted to help craft adversarial examples, which are then input to the victim model to get attacked response. For targeted evasion, we randomly sample a targeted reference text Tr​e​fT_{ref} from MS-COCO captions [11] for each input clean image II [54]. To craft adversarial examples that can more effectively influence the victim model, we propose leveraging the semantic correspondences between image and text modalities, thereby enriching the semantics of the generated adversarial examples. Specifically, we first obtain the corresponding clean text TT and target image Ir​e​fI_{ref} for clean image and target text, respectively:

T\displaystyle T =MI​2​T​(I),\displaystyle=M_{I2T}(I),
Ir​e​f\displaystyle I_{ref} =MT​2​I​(Tr​e​f),\displaystyle=M_{T2I}(T_{ref}), (4)

where MI​2​T​(β‹…)M_{I2T}(\cdot) and MT​2​I​(β‹…)M_{T2I}(\cdot) represent a publicly accessible pre-trained image-to-text model and a text-to-image model (distinct from the victim models), respectively. We observe that the sampled target reference contains some abundant information that is not directly associated with the visual content in the corresponding image, making model hard to learn the semantic relationships between images and texts. To alleviate this issue, we propose querying a large language model to extract the key visual information from the original target texts. For example, as shown in Fig.2, the original target text Tr​e​fT_{ref} is β€œThe little girl is taking tennis lesson to learn how to play.” We query an LLM (e.g., GPT-4) with the prompt β€œExtract the keywords/information from the following sentence (save verbs and objects): {text}.” and obtain the refined target reference text: β€œA little girl taking tennis lesson.” It is noteworthy that the generation of clean text, target image, and refined target text can be done during the data pre-processing before training.

We use modality fusion of embeddings to capture the semantic correspondence between images and texts, the modality fusion for the clean and target image-text pairs can be achieved by the following calculations:

F\displaystyle F =Ξ±β‹…Ev​(I)+(1βˆ’Ξ±)β‹…Et​(T),\displaystyle=\alpha\cdot E_{v}(I)+(1-\alpha)\cdot E_{t}(T),
Fr​e​f\displaystyle F_{ref} =Ξ±β‹…Ev​(Ir​e​f)+(1βˆ’Ξ±)β‹…Et​(Tr​e​f),\displaystyle=\alpha\cdot E_{v}(I_{ref})+(1-\alpha)\cdot E_{t}(T_{ref}), (5)

Given the surrogate image encoder Ev​(β‹…)E_{v}(\cdot) and text encoder Et​(β‹…)E_{t}(\cdot). Where FF and Fr​e​fF_{ref} are the modality-aware embeddings (MAE) for clean and target image-text pairs, respectively. Ξ±\alpha is a modality-balancing hyperparameter.

Previous transfer-based attacking strategies only use the uni-modality target to guide the learning of image perturbation and lack the explicit semantic update process [54, 47], which leads to coarse-grained semantic alignment, exhibiting sub-optimal attacking performance for VLMs. Therefore, we propose the chain-of-attack learning strategy to enhance the adversarial example generation with explicit step-by-step updating in the semantic domain, as shown in the left lower part of Fig.2 and a more detailed example in Fig.Β 3. Specifically, we initialize the image perturbation Ξ΄\delta such that Ξ΄0∼Uniform​(βˆ’Ο΅,Ο΅)\delta_{0}\sim{\rm{Uniform}}(-\epsilon,\epsilon). Given the clean image, the adversarial image example Ia​d​vI_{adv} can be obtained by adding perturbations at the pixel level. A publically accessible pre-trained image-to-text model is utilized to generate the caption Ta​d​vT_{adv} for the current adversarial image in each step. As previously mentioned, the modality-aware embedding for the current adversarial example is given by:

Fa​d​v=Ξ±β‹…Ev​(Ia​d​v)+(1βˆ’Ξ±)β‹…Et​(Ta​d​v),\displaystyle F_{adv}=\alpha\cdot E_{v}(I_{adv})+(1-\alpha)\cdot E_{t}(T_{adv}), (6)

where Fa​d​vF_{adv} is the modality-aware embedding of the current adversarial example. It is noteworthy that during each step, a new caption that describes the current adversarial image is generated, hence the modality-aware embedding is updated based on both changed image and text embeddings. We explicitly update the multi-modal semantics and generate the adversarial examples based on their previous semantics, resulting in a step-by-step attacking process (i.e., a chain of attacks). To learn the image perturbation at each step, we propose Targeted Contrastive Matching (TCM), where the cross-modality semantics of clean samples, target samples, and the current adversarial samples are aligned/diverged in the same latent embedding space. Specifically, TCM maximizes the similarity between the current adversarial example and the target reference example, while minimizing the similarity between the current adversarial example and the original clean example across both vision and text modalities. The TCM objective LL is defined as:

L=max​(β€–sim​(Fr​e​f,Fa​d​v)βˆ’Ξ²β‹…sim​(F,Fa​d​v)β€–+Ξ³,0),\displaystyle L={\rm{max}}(||{\rm{sim}}(F_{ref},F_{adv})-\beta\cdot{\rm{sim}}(F,F_{adv})||+\gamma,0), (7)

where Ξ²\beta is a hyperparameter that controls the trade-off between similarity maximization for positive pairs and minimization for negative pairs. Ξ³\gamma is the margin hyperparameter that controls the desired separation of the positive pairs and the negative pairs in the learned embedding space.

To optimize the image perturbation Ξ΄\delta through the TCM objective LL, projected gradient descent [33] is adopted and the optimization can be expressed as:

Ξ΄t+1=Proj||β‹…||βˆžβ‰€Ο΅β€‹(Ξ΄t+Ξ·β‹…βˆ‡Ξ΄L​(Ξ΄t)),\displaystyle\delta_{t+1}={\rm{Proj}}_{||\cdot||_{\infty\leq\epsilon}}(\delta_{t}+\eta\cdot\nabla_{\delta}L(\delta_{t})), (8)

where Proj​(β‹…){\rm{Proj}(\cdot)} projects Ξ΄\delta back into the Ο΅\epsilon-ball, Ξ·\eta is the step size, and βˆ‡L​(β‹…)\nabla L(\cdot) represents the gradient of the TCM loss.

Vlm Method CLIP Score (↑\uparrow) / Text Encoder Asr (↑\uparrow)
RN-50 RN-101 ViT-B/16 ViT-B/32 ViT-L/14 Ensemble Target Fool
ViECap [19] Clean image 46.7 44.3 47.7 47.2 35.2 44.2 - -
AttackBard [17] 49.2 46.7 48.7 51.7 36.3 46.5 15.5 25.0
Mix.Attack [47] 49.6 47.0 48.8 52.1 36.7 46.8 11.1 17.4
MF-it [54] 78.0 76.7 78.9 79.6 71.8 77.0 69.3 79.9
MF-ii [54] 76.4 75.3 77.4 78.0 70.1 75.4 76.6 85.8
Ours 82.9 81.9 83.8 84.7 78.2 82.3 98.4 99.5
SmallCap [42] Clean image 50.7 48.6 51.1 52.7 37.5 48.1 - -
AttackBard [17] 53.2 48.4 51.5 56.6 39.2 49.8 6.6 9.2
Mix.Attack [47] 52.9 48.3 51.5 56.4 39.2 49.7 5.8 8.1
MF-it [54] 57.9 54.8 59.1 60.7 46.6 55.8 22.1 26.8
MF-ii [54] 67.3 65.0 68.5 69.8 58.6 65.8 47.1 52.2
Ours 68.6 66.1 70.0 71.1 60.4 67.2 56.8 66.5
Unidiffuser [4] Clean image 41.7 41.5 42.9 44.6 30.5 40.2 - -
AttackBard [17] 52.2 48.6 53.1 56.5 56.5 53.4 8.1 14.4
Mix.Attack [47] 45.3 44.0 47.2 49.2 35.2 44.2 5.9 10.5
MF-it [54] 65.5 63.9 67.8 69.8 61.1 65.6 80.2 95.8
MF-ii [54] 70.9 69.5 72.1 73.3 63.7 70.0 90.0 98.4
Ours 76.1 74.4 77.2 78.5 69.8 75.2 94.2 98.9
LLaVA-7B [29] Clean image 46.8 46.8 48.1 47.7 33.7 44.6 - -
AttackBard [17] 47.9 47.4 48.1 48.5 34.6 45.3 2.0 3.7
Mix.Attack [47] 46.8 47.6 47.6 48.2 34.3 44.9 1.7 3.0
MF-it [54] 46.8 46.9 48.0 47.9 33.9 44.7 3.0 5.6
MF-ii [54] 47.2 46.7 48.2 48.0 34.2 44.9 2.6 4.7
Ours 51.1 49.6 52.0 55.2 35.8 48.7 14.5 28.4
LLaVA-13B [29] Clean image 46.4 46.3 47.9 47.5 33.4 44.3 - -
AttackBard [17] 47.9 47.4 48.1 48.5 34.6 45.3 2.6 4.8
Mix.Attack [47] 46.8 47.6 47.6 48.2 34.3 44.9 0.9 1.5
MF-it [54] 46.6 46.8 48.0 47.8 33.7 44.6 2.7 5.0
MF-ii [54] 47.4 47.2 48.7 48.4 34.4 45.2 3.6 6.9
Ours 48.1 48.0 49.4 49.0 34.6 45.8 12.3 24.3
Table 1: Quantitative performance comparison of transfer-based attacks against VLMs with the state-of-the-art methods. The metrics include CLIP score and our proposed LLM-based attack success rate (ASR). The names of the corresponding image encoders are adopted for different text encoders. The Ensemble column reports the average results of different CLIP text encoders. The best results are in bold.

3.3 LLM-based ASR

Previous works tend to evaluate the robustness of models on response generation tasks with human efforts [20, 21], making it labor-intensive and time-consuming. Some recent works propose using NLP metrics [8] or CLIP [41] scores [54] to measure the matching degree of generated response and the targeted response, which we argue are not comprehensive and not straightforward for human users to understand. For example, assume that the CLIP score between the generated text and the target text is 40% before attacking and the CLIP score is 45% after attacking, people can only know that the attacking increases the similarity by 5% without gaining any insight into whether the attack is success or not, i.e., Does the generated response genuinely closer to the targeted text from a human perspective, or if it just diverges more from the original clean text but still being far from the target text?

To address the above issues and considering the various evaluation strategies employed in different research, we propose a clear and unified attack success rate computation strategy for automatic evaluation of the robustness of VLMs on response generation tasks such as image captioning. Specifically, as illustrated in Fig.Β 4, we query an LLM (e.g., GPT-4) to serve as the human judge to distinguish whether the model is attacked successfully, i.e., the generated text is similar to the target reference text. In addition, to ensure a comprehensive evaluation, we also consider the scenario where the model is fooled into generating responses that are unrelated to the original clean text but still not similar to the target text. We further request the LLM to assign scores of 1, 0.5, and 0 for completely successful cases, fooled-only cases, and failed cases, respectively. Step-by-step thinking is utilized for accurate judgment and detailed explanations. For instance, the middle example of Fig.Β 4 shows a fooled-only case, where the LLM accurately suggests that β€œthe generated text is unrelated to the original text but also does not closely match the target text”, assigning a score of 0.5 while offering detailed human-understandable reasons.

Method CLIP Score (↑\uparrow) / Text Encoder Asr (↑\uparrow)
RN-50 RN-101 ViT-B/16 ViT-B/32 ViT-L/14 Ensemble Target Fool
Clean image 41.7 41.5 42.9 44.6 30.5 40.2 - -
Baseline 70.9 69.5 72.1 73.3 63.7 70.0 90.0 98.4
+ MAE 72.3 71.8 73.4 74.8 64.4 71.3 90.6 98.7
+ MAE + CoA (w/o TCM) 74.8 73.2 76.0 77.1 68.1 73.8 91.7 98.7
76.1 74.4 77.2 78.5 69.8 75.2 94.2 98.9
+ MAE + CoA (w/ TCM) (↑\uparrow5.2) (↑\uparrow4.9) (↑\uparrow5.1) (↑\uparrow5.2) (↑\uparrow6.1) (↑\uparrow5.2) (↑\uparrow4.2) (↑\uparrow0.5)
Table 2: Ablation study of the proposed method on Unidifusser. MAE, CoA, and TCM represent the Modality-Aware Embeddings, Chain of Attack module, and Targeted Contrastive Matching, respectively. The improvements compared to the baseline are highlighted.

The proposed LLM-based ASR can be computed as:

A​S​R=1Nβ€‹βˆ‘JUDGE​(T,Ta​d​v,Tr​e​f),\displaystyle ASR=\frac{1}{N}\sum{{\rm{JUDGE}}(T,T_{adv},T_{ref})}, (9)

where NN is the number of adversarial examples, and the outputs of JUDGE​(β‹…){\rm{JUDGE}}(\cdot) are the scores given by the LLM judgment mentioned before.

4 Experiments

4.1 Experimental Setups

Datasets. The clean images are from the validation images of ImageNet-1K [14]. For target reference text, we follow Zhao et al. [54] and sample a text description for each clean image. We further use GPT-4 [1] to extract the key information of the sampled text description. To simulate the real-world scenario, Stable Diffusion [44] is utilized to generate target images for each target reference text, and MiniGPT-4 [56] is adopted to generate clean descriptions for clean images. More details are in the appendix.

Implementation details. For all performance comparisons, we use consistent pre-trained checkpoints of the victim VLMs [19, 42, 4, 30]. The vision (ViT-B/16) and text encoder of CLIP [41] are adopted as the surrogate model. We use ClipCap [34] as the image-to-text model during the adversarial example generation process. Following the most common setting [6, 54], we set the perturbation budget Ο΅=8\epsilon=8 unless otherwise specified to ensure the perturbations are visually imperceptible. The objective is optimized using 100-step PGD [33] with Ξ·=1\eta=1. Other hyperparameters are selected by grid search, where we conduct ablation studies with different values (see appendix). Experiments are conducted on an RTX A6000 GPU.

Refer to caption
Figure 5: (a) Visual interpretation of the adversarial examples. A​MAM represents the attention map, which is based on the image-text similarity. The clean text and the target image are generated based on the clean image and the selected target reference text, respectively. (b) The effect of Ο΅\epsilon on Unidiffuser. The generated captions in red (i.e., Ο΅β‰₯8\epsilon\geq 8) are close to the target text.

4.2 Experimental Results

VLM robustness against adversarial attacks. Tab.Β 1 shows the evaluation results for VLM robustness against black-box adversarial attacks. The victim VLMs include ViECAP [19], SmallCap [42], Unidiffuser [4], LLaVA-1.5 7B and 13B [29]. All parameters of the victim VLMs are frozen, and the victim VLMs are invoked only once for response generation during inference. For models that require custom textual instruction (e.g., LLaVA), we use β€œWhat is the content of this image?” as the query. Specifically, our method CoA consistently outperforms baselines with a significant margin on both CLIP score and the proposed LLM-based ASR. Specifically, the CLIP score measures the embedding similarity between the generated response of victim models and the target text using text encoders of CLIP. We evaluate the CLIP score for each victim model with various CLIP text encoders, including ResNet-based [24] RN-50, RN-101, and ViT-based [18] ViT-B/16, ViT-B/32, and ViT-L/14. CoA respectively gains 6.9%, 2.1%, 7.4%, 7.5%, and 1.1% relative performance boosts over the second-best results on each victim model, demonstrating that the generated texts using our method have closer semantics to the target text in the text embedding space. To further make the evaluation more comprehensive, we also report the performance with the proposed LLM-based ASR. In addition to the targeted ASR introduced in Sec.Β 3.3, we also report the results of fooled cases (the last column of Tab.Β 1) to evaluate the attacking methods’ ability to fool the victim models into generating unrelated responses, including targeted and fooled-only cases. Our method achieves much better ASR compared to baselines. For example, CoA respectively gets 98.4% and 94.2% targeted ASR on ViECap [19] and Unidiffuser [4], while the second-best results are 76.6% and 90.0%, respectively. Furthermore, our method exhibits better capability to fool large VLMs (e.g., LLaVA [29]) into giving wrong responses based on the adversarial examples. From the robustness evaluation results, we find that VLMs with a larger number of parameters are less susceptible to attacks. In particular, large-scale VLMs demonstrate significantly stronger robustness against black-box attacks compared to smaller models. In addition, misleading large-scale VLMs to generate unrelated responses is much easier than generating targeted responses, highlighting the vulnerability of large VLMs to non-targeted attacks.

Ablation study. We conduct ablation studies to demonstrate the effectiveness of each proposed module, as shown in Tab.Β 2, where MF-ii [54] is adopted as the baseline. Specifically, all the proposed components improve the attack performance. More ablation results, e.g., hyperparameters and key information extraction, are in the appendix.

4.3 Discussion

Visual interpretation. To help better interpret and understand the adversarial examples, we obtain the attention maps for the clean images, adversarial images, and target images based on the gradients of the similarity between images and texts. Fig.Β 5 (a) shows a visualization example, where the clean image is a bird image and the target text is β€œA bunch of people celebrating around a birthday cake.” From the result, we can observe that the attention map correctly highlights the region for the clean image-text pair and target image-text pair. However, when we use the adversarial image to calculate the attention with the clean text, the model is misleading and highlights some irrelevant regions. Furthermore, from the attention map of the adversarial image-target text pair, we see some highlighted regions are relevant to the target image. For example, the target image highlights the lower right β€œcake” that is contained in the target text, while the lower right part of the adversarial image is also highlighted based on the similarity with the target text. These observations indicate that the adversarial images mislead the models and cause the model to perceive the adversarial image as the target image to some extent.

Efficiency & Comparison with query-based strategy. In addition to attacking performance, the efficiency of attacks is also an important challenge [23]. Fig.Β 1 reports the comparison of different attacking strategies based on their performance (y-axis) and training time per step (x-axis). A query-based method MF-ii-tt [54] is included and its computational cost is high due to the repeated invocation of the victim models. This strategy is sometimes regarded as a grey-box attacking strategy [50] since it needs to query the victim model and leverage the model outputs. From the comparison results (Fig.Β 1), it can be observed that our method can achieve superior performance and is comparable to or even outperforms query-based strategy in some cases with much lower computational cost.

Vlm Budget Ο΅\epsilon
8/255 16/255 32/255
ViECap [19] 82.3 82.4 82.5
SmallCap [42] 67.2 67.4 68.0
Unidiffuser [4] 75.2 75.4 75.7
LLaVA-7B [29] 48.7 48.7 49.3
LLaVA-13B [29] 45.8 45.8 46.0
Table 3: Ensemble CLIP scores of different perturbation budget Ο΅\epsilon.

Effect of perturbation budget. We explore the effect of Ο΅\epsilon based on the attack results and the image quality qualitatively and quantitatively. Specifically, as shown in Fig.Β 5 (b) and Tab.Β 3, when Ο΅\epsilon increases, the attack results become better, and the image quality decreases (measured by the LPIPS distance [51] between the adversarial and clean images). We conclude that a proper Ο΅\epsilon is crucial for balancing attack performance with the magnitude of perturbations.

5 Conclusion

In this study, we evaluate the robustness of VLMs against black-box adversarial attacks and highlight the vulnerability of existing models. A novel transfer-based targeted attacking strategy, namely Chain of Attack, is proposed to enhance the generation of adversarial examples through a series of explicit intermediate steps based on multi-modal semantics, thereby improving attack performance. Moreover, an LLM-based ASR computation strategy is introduced for more comprehensive robustness evaluations in response generation tasks, while offering human-understandable explanations. We hope this study serves as a reference for safety considerations in the development of future vision-language models, and facilitates more trustworthy model advancements and evaluations.

References

  • Achiam etΒ al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaΒ Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etΒ al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Alayrac etΒ al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etΒ al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • Bailey etΒ al. [2023] Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
  • Bao etΒ al. [2023] Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. In International Conference on Machine Learning, pages 1692–1717. PMLR, 2023.
  • Bommasani etΒ al. [2021] Rishi Bommasani, DrewΒ A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, MichaelΒ S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, etΒ al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • Carlini etΒ al. [2019] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
  • Carlini etΒ al. [2024] Nicholas Carlini, Milad Nasr, ChristopherΒ A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang WeiΒ W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024.
  • Chen etΒ al. [2017] Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Attacking visual language grounding with adversarial examples: A case study on neural image captioning. arXiv preprint arXiv:1712.02051, 2017.
  • Chen etΒ al. [2023] Huanran Chen, Yichi Zhang, Yinpeng Dong, Xiao Yang, Hang Su, and Jun Zhu. Rethinking model ensemble in transfer-based adversarial attacks. arXiv preprint arXiv:2303.09105, 2023.
  • Chen etΒ al. [2022] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
  • Chen etΒ al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr DollΓ‘r, and CΒ Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • Chiang etΒ al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephΒ E Gonzalez, etΒ al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  • Cui etΒ al. [2024] Xuanming Cui, Alejandro Aparcedo, YoungΒ Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24625–24634, 2024.
  • Deng etΒ al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dong etΒ al. [2018] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018.
  • Dong etΒ al. [2021] Yinpeng Dong, Shuyu Cheng, Tianyu Pang, Hang Su, and Jun Zhu. Query-efficient black-box adversarial attacks guided by a transfer-based prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9536–9548, 2021.
  • Dong etΒ al. [2023] Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
  • Dosovitskiy [2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Fei etΒ al. [2023] Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, and Feng Zheng. Transferable decoding with visual entities for zero-shot image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3136–3146, 2023.
  • Fu etΒ al. [2023] Xiaohan Fu, Zihan Wang, Shuheng Li, RajeshΒ K Gupta, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Misusing tools in large language models with visual adversarial examples. arXiv preprint arXiv:2310.03185, 2023.
  • Gong etΒ al. [2023] Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
  • Goodfellow [2014] IanΒ J Goodfellow. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Guo etΒ al. [2019] Chuan Guo, Jacob Gardner, Yurong You, AndrewΒ Gordon Wilson, and Kilian Weinberger. Simple black-box adversarial attacks. In International conference on machine learning, pages 2484–2493. PMLR, 2019.
  • He etΒ al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Huang etΒ al. [2019] Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. Enhancing adversarial example transferability with an intermediate level attack. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4733–4742, 2019.
  • Ilyas etΒ al. [2018] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pages 2137–2146. PMLR, 2018.
  • Jin etΒ al. [2024] Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599, 2024.
  • Li etΒ al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • Liu etΒ al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, and YongΒ Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a.
  • Liu etΒ al. [2024b] Haotian Liu, Chunyuan Li, Qingyang Wu, and YongΒ Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
  • Liu etΒ al. [2016] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
  • Liu etΒ al. [2017] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In Proceedings of 5th International Conference on Learning Representations, 2017.
  • Madry [2017] Aleksander Madry. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Mokady etΒ al. [2021] Ron Mokady, Amir Hertz, and AmitΒ H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  • Nguyen etΒ al. [2015] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427–436, 2015.
  • Papernot etΒ al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, ZΒ Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017.
  • Perez etΒ al. [2022] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  • Qi etΒ al. [2023] Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
  • Qin etΒ al. [2022] Zeyu Qin, Yanbo Fan, Yi Liu, Li Shen, Yong Zhang, Jue Wang, and Baoyuan Wu. Boosting the transferability of adversarial attacks with reverse adversarial perturbation. Advances in neural information processing systems, 35:29845–29858, 2022.
  • Radford etΒ al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etΒ al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Radford etΒ al. [2021] Alec Radford, JongΒ Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etΒ al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ramos etΒ al. [2023] Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva. Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023.
  • Rando etΒ al. [2022] Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian TramΓ¨r. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
  • Rombach etΒ al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and BjΓΆrn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Touvron etΒ al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, TimothΓ©e Lacroix, Baptiste RoziΓ¨re, Naman Goyal, Eric Hambro, Faisal Azhar, etΒ al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron etΒ al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etΒ al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Tu etΒ al. [2025] Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this image a safety evaluation benchmark for vision llms. In Computer Vision – ECCV 2024, pages 37–55, Cham, 2025. Springer Nature Switzerland.
  • Wang etΒ al. [2023] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, etΒ al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023.
  • Xu etΒ al. [2019] Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, HengΒ Tao Shen, and Wei Liu. Exact adversarial attack to image captioning via structured output learning with latent variables. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4135–4144, 2019.
  • Zhang etΒ al. [2024] Chiyu Zhang, Xiaogang Xu, Jiafei Wu, Zhe Liu, and Lu Zhou. Adversarial attacks of vision tasks in the past 10 years: A survey. arXiv preprint arXiv:2410.23687, 2024.
  • Zhang etΒ al. [2018] Richard Zhang, Phillip Isola, AlexeiΒ A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Zhang etΒ al. [2023] Yutong Zhang, Yao Li, Yin Li, and Zhichang Guo. A review of adversarial attacks in computer vision. arXiv preprint arXiv:2308.07673, 2023.
  • Zhao etΒ al. [2023] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, and Min Lin. A recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023.
  • Zhao etΒ al. [2024] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-ManΒ Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou etΒ al. [2018] Wen Zhou, Xin Hou, Yongjun Chen, Mengyun Tang, Xiangqi Huang, Xiang Gan, and Yong Yang. Transferable adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
  • Zhu etΒ al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix

In this supplementary material, we present more details about data and implementation, including more data examples, the algorithmic format and the core code of the proposed Chain of Attack. Furthermore, we report and analysis more detailed experimental, ablation, and visualization results, including the ablation studies on hyperparameters, the experiments on VQA task, and more examples and results of the proposed CoA and LLM-based ASR.

Data and Implementation Details

More Data Examples

Fig. 6 shows some examples of our used data in this paper. Specifically, as mentioned in the main paper, the clean image and the target text are from ImageNet-1k [14] and MS-COCO [11], respectively. To obtain the corresponding clean texts and the target images, we adopt GPT-4 [1] and Stable Diffusion [44] to generate high-quality texts and images, respectively. These clean and target image-text pairs are used to compute modality-aware embeddings and serve as the reference in Targeted Contrastive Matching to guide the learning of perturbations.

Refer to caption
Figure 6: Examples of the used clean images, clean texts, target texts, and target images.

Chain of Attack Algorithm

In addition to the method illustration in the main paper, the algorithmic format of the proposed Chain of Attack method is shown in Algorithm 1.

Input: The clean image II, clean text TT, targeted reference text Tr​e​fT_{ref}, generated target image Ir​e​fI_{ref}, surrogate image encoder Ev​(β‹…)E_{v}(\cdot) and text encoder Et​(β‹…)E_{t}(\cdot), modality-balancing hyperparameter Ξ±\alpha, positive-negative balancing hyperparameter Ξ²\beta, margin hyperparameter Ξ³\gamma, the step size of PGD Ξ·\eta.
Output: The adversarial example Ia​d​vI_{adv}.
Initialization:
Adversarial image Ia​d​v←II_{adv}\leftarrow I, PGD step number p​g​d​_​s​t​e​ppgd\_step, ϡ←8\epsilon\leftarrow 8, δ∼Uniform​(βˆ’Ο΅,Ο΅)\delta\sim{\rm{Uniform}}(-\epsilon,\epsilon);
# Calculation of modality-aware embeddings (MAE).;
F←α⋅Ev​(I)+(1βˆ’Ξ±)β‹…Et​(T)F\leftarrow\alpha\cdot E_{v}(I)+(1-\alpha)\cdot E_{t}(T);
Fr​e​f←α⋅Ev​(Ir​e​f)+(1βˆ’Ξ±)β‹…Et​(Tr​e​f)F_{ref}\leftarrow\alpha\cdot E_{v}(I_{ref})+(1-\alpha)\cdot E_{t}(T_{ref});
# Update process of Chain of Attack.;
t←1t\leftarrow 1;
whileΒ t≀p​g​d​_​s​t​e​pt\leq pgd\_stepΒ do
Β Β Β Β Β Β  Ia​d​v←Ia​d​v+Ξ΄tI_{adv}\leftarrow I_{adv}+\delta_{t};
Β Β Β Β Β Β  # The current adversarial text and MAE of each step.;
Β Β Β Β Β Β  Ta​d​v←MI​2​T​(Ia​d​v)T_{adv}\leftarrow M_{I2T}(I_{adv});
Β Β Β Β Β Β  Fa​d​v←α⋅Ev​(Ia​d​v)+(1βˆ’Ξ±)β‹…Et​(Ta​d​v)F_{adv}\leftarrow\alpha\cdot E_{v}(I_{adv})+(1-\alpha)\cdot E_{t}(T_{adv});
Β Β Β Β Β Β  # Objective of Target Contrastive Matching.;
Β Β Β Β Β Β  L←max​(β€–Fr​e​fT​Fa​d​vβˆ’Ξ²β‹…FT​Fa​d​vβ€–+Ξ³,0)L\leftarrow{\rm{max}}(||{F_{ref}}^{T}F_{adv}-\beta\cdot{F}^{T}F_{adv}||+\gamma,0);
Β Β Β Β Β Β  # Update the perturbation.;
Β Β Β Β Β Β  Ξ΄t+1←Proj||β‹…||βˆžβ‰€Ο΅β€‹(Ξ΄t+Ξ·β‹…βˆ‡Ξ΄L​(Ξ΄t))\delta_{t+1}\leftarrow{\rm{Proj}}_{||\cdot||_{\infty\leq\epsilon}}(\delta_{t}+\eta\cdot\nabla_{\delta}L(\delta_{t}));
Β Β Β Β Β Β  t←t+1t\leftarrow t+1;
Β Β Β Β Β Β 
end while
AlgorithmΒ 1 Chain of Attack

Core Code

We present the core pseudo code in Sect. PyTorch-like Pseudocode for the Core of an Implementation of Chain of Attack in this supplementary material.

More Experimental Results

Detailed Ablation Results with Various Hyperparameters

To explore the effects of the values of hyperparameters for our attack strategy, we conduct extensive ablation studies.

The ablation results of the modality-balancing hyperparameter Ξ±\alpha are reported in Tab. 4. Note that a smaller Ξ±\alpha means a large weight for the text modality. From the results, we can observe that text modality is more effective for attacking some victim VLMs (e.g., ViECap [19]). However, most attacking performance benefits from both of the modalities (e.g., SmallCap [42], Unidiffuser [4], and LLaVA [29]). This observation demonstrates the effectiveness of our proposed modality-aware embeddings that capture semantics from both domains. We suggest that a proper Ξ±\alpha can help achieve better results by fusing the visual and textual features.

In Tab. 5, we report the results of different combinations of hyperparameters Ξ²\beta and Ξ³\gamma, where Ξ²\beta is the hyperparameter that controls the trade-off between similarity maximization for positive pairs and minimization for negative pairs, and Ξ³\gamma is the margin hyperparameter that controls the desired separation of the positive pairs and the negative pairs in the learned embedding space, as mentioned in the main paper. Note that a larger Ξ²\beta indicates more focus on the difference between the adversarial examples and the original clean examples. Since our task is targeted attacking, we set 0<Ξ²<10<\beta<1. From our experiments, we find that larger Ξ³\gamma may degrade the performance, hence we suggest the margin hyperparameter should be set to less than 0.5. Some combinations of hyperparameters with promising performance are reported in Tab. 5.

Detailed Results of the Effect of Perturbation Budget

In Sect. 4.3 of the main paper, we discuss the effect of the perturbation budget Ο΅\epsilon with only the results of the ensemble score. We report the complete results in Tab. 6, from which we can see that large perturbation budgets can improve the attack performance. However, as mentioned in Sect. 4.3 of the main paper (also see Fig. 5 (b) and Tab. 3 in the main paper), with the perturbation budgets becoming larger, the image quality decreases. We suggest a proper Ο΅\epsilon value (e.g., 8) to balance the trade-off.

Vlm Ξ±\alpha CLIP Score (↑\uparrow) / Text Encoder
ViECap [19] RN-50 RN-101 ViT-B/16 ViT-B/32 ViT-L/14 Ensemble
0.9 77.6 76.4 78.6 79.3 71.6 76.7
0.7 79.8 80.4 81.2 81.5 74.4 79.0
0.5 81.2 80.4 82.2 83.0 76.2 80.6
0.3 82.7 81.7 83.6 84.4 78.1 82.1
0.1 82.9 81.9 83.8 84.7 78.2 82.3
SmallCap [42] 0.9 68.4 65.9 69.4 70.7 59.9 66.7
0.7 68.6 66.1 70.0 71.1 60.4 67.2
0.5 68.2 65.7 69.4 70.7 59.8 66.8
0.3 65.5 62.6 66.7 68.1 56.3 63.8
0.1 61.1 58.2 62.2 63.7 50.9 59.2
Unidiffuser [4] 0.9 73.6 71.9 74.7 75.8 66.7 72.5
0.7 75.1 73.3 76.1 77.2 68.5 74.0
0.5 75.8 74.3 76.9 78.1 69.4 74.9
0.3 76.1 74.4 77.2 78.5 69.8 75.2
0.1 72.1 70.5 73.5 75.1 64.8 71.2
LLaVA-7B [29] 0.9 47.7 47.3 48.9 48.5 34.3 45.4
0.7 48.2 47.8 49.1 48.7 34.7 45.7
0.5 51.1 49.6 52.0 55.2 35.8 48.7
0.3 48.8 48.3 49.6 49.4 35.1 46.2
0.1 47.6 47.4 48.9 48.5 34.5 45.4
Table 4: Ablation results of the modality-balancing hyperparameter Ξ±\alpha of the modality-aware embeddings for controlling the trade-off between vision and text modalities. A smaller Ξ±\alpha indicates a larger weight for text modality. The best ensemble scores are in bold.

Effect of PGD Steps

Following the setting of previous methods [54], we adopt projected gradient descent (PGD) [33] with 100 steps, as mentioned in the main paper. Additionally, we report the results of less number of PGD steps in Tab. 7. The results show that fewer PGD steps may lead to underfitting and PGD with 100 steps achieves the best attack performance.

Visual Question Answering Task

To further explore the potential application/risk of the attacking strategy, we implement the multi-round visual question answering (VQA) task using LLaVA-7B [29], as shown in Fig. 7. Two successful targeted attack examples are displayed. Specifically, in example 1, the original clean image is a part of the body of a large marine animal. We query LLaVA with queries β€œHow do you think of this image?” and β€œCould it be a marine creature?”. LLaVA identifies it as a marine animal and gives correct answers. However, when we input the adversarial image generated by our method, the victim model gives the wrong answer and identifies it as a cat, which is the content of target examples. Example 2 also exhibits the same conclusion. The results demonstrate our attacking strategy successfully misleads the victim model to generate target responses.

More Case Studies of the Proposed ASR

In addition to the results shown in Fig. 4 of the main paper, more evaluation examples of the proposed LLM-based ASR are shown in Fig. 10 in this supplementary material.

More Results of the Attacking Chain

In addition to Fig. 2 and Fig. 3 of the main paper, we visualize more examples of the intermediate steps of CoA and the results of the victim models, as shown in Fig. 8. Specifically, the left and middle parts of Fig. 8 show the update process of the adversarial examples based on both visual and textual semantics. The right part is the generation results of the victim models given the final adversarial examples. For example, in the third case, the semantic of the image changes from β€œA group of chickens of various colors foraging in a grassy outdoor enclosure” to the target semantic β€œA close up of a vase with flowers”, and the CLIP score between the intermediate adversarial text and the target text increases through the chain. Some victim models (e.g., ViECap, Unidiffuser) generate almost the same response as the target text (e.g., with CLIP score 99.6%, 100%), demonstrating the effectiveness of the generated adversarial examples.

Sensitivity of Adversarial Examples to Gaussian Noises and the Degradation to Original Clean Semantics

To explore the sensitivity of our generated adversarial examples to noises (e.g., Gaussian noises), we show the results of adversarial examples adding different scales of noises, as shown in Fig. 9. When the standard deviation of noises s​t​dGstd_{G} is relatively small, the victim models still output the target responses. However, it can be observed that as the s​t​dGstd_{G} becomes large, the victim models tend to generate responses that are more likely to the original clean text. The captions of some intermediate examples are a combination of the original clean text and the target reference text. This result interprets the process of adding perturbations to the adversarial images and it concludes that large noises can undermine the effectiveness of adversarial examples.

(More figures and tables are on the following pages.)

Vlm Ξ²\beta Ξ³\gamma CLIP Score (↑\uparrow) / Text Encoder
ViECap [19] RN-50 RN-101 ViT-B/16 ViT-B/32 ViT-L/14 Ensemble
0.9 0.1 78.4 77.3 79.3 80.0 72.5 77.5
0.8 0.2 77.1 76.1 78.3 78.9 71.0 76.3
0.7 0.3 77.6 76.4 78.6 79.3 71.6 76.7
0.6 0.4 77.3 76.1 78.4 79.1 71.1 76.4
SmallCap [42] 0.9 0.1 68.4 66.5 69.8 71.0 60.3 67.2
0.8 0.2 67.2 65.0 68.5 69.9 58.8 65.9
0.7 0.3 68.4 65.9 69.4 70.7 59.9 66.9
0.6 0.4 67.7 65.4 69.0 70.3 59.2 66.3
Unidiffuser [4] 0.9 0.1 72.9 71.7 74.3 75.4 66.2 72.1
0.8 0.2 73.3 71.6 74.5 75.6 66.3 72.3
0.7 0.3 73.6 71.9 74.7 75.8 66.7 72.5
0.6 0.4 73.2 71.5 74.3 75.4 66.2 72.1
LLaVA-7B [29] 0.9 0.1 47.8 47.5 49.0 48.7 34.4 45.5
0.8 0.2 47.7 47.4 48.9 48.5 34.4 45.4
0.7 0.3 47.4 47.3 48.6 48.2 34.2 45.1
0.6 0.4 47.7 47.4 48.9 48.4 34.4 45.4
Table 5: Results of some different combinations of the hyperparameters Ξ²\beta and Ξ³\gamma for Targeted Contrastive Matching.
Vlm Ο΅\epsilon CLIP Score (↑\uparrow) / Text Encoder
ViECap [19] RN-50 RN-101 ViT-B/16 ViT-B/32 ViT-L/14 Ensemble
8/255 82.9 81.9 83.8 84.7 78.2 82.3
16/255 83.1 82.0 83.9 84.8 78.4 84.2
32/255 83.1 82.2 83.9 84.8 78.4 82.5
SmallCap [42] 8/255 68.6 66.1 70.0 71.1 60.4 67.2
16/255 68.9 66.3 70.2 71.3 60.5 67.4
32/255 70.2 66.8 70.4 71.8 60.9 68.0
Unidiffuser [4] 8/255 76.1 74.4 77.2 78.5 69.8 75.2
16/255 76.3 74.8 77.4 78.6 70.1 75.4
32/255 76.7 75.1 77.7 78.9 70.3 75.7
LLaVA-7B [29] 8/255 51.1 49.6 52.0 55.2 35.8 48.7
16/255 51.1 49.6 52.0 55.3 35.8 48.7
32/255 51.7 50.1 52.5 55.9 36.2 49.3
LLaVA-13B [29] 8/255 48.1 48.0 49.4 49.0 34.6 45.8
16/255 48.1 48.0 49.4 49.0 34.6 45.8
32/255 48.2 48.1 49.4 49.2 34.9 46.0
Table 6: The detailed results of the effect of perturbation budgets Ο΅\epsilon.
Method CLIP Score (↑\uparrow) / Text Encoder
RN-50 RN-101 ViT-B/16 ViT-B/32 ViT-L/14 Ensemble
Clean image 41.7 41.5 42.9 44.6 30.5 40.2
CoA w/ PGD-10 63.1 61.5 64.5 66.0 53.9 61.8
CoA w/ PGD-50 74.5 73.0 75.8 77.2 68.0 73.7
CoA w/ PGD-100 76.1 74.4 77.2 78.5 69.8 75.2
Table 7: The effect of number of PGD [33] steps on Unidiffuser [4]. CoA w/ PGD-10 means our method CoA using PGD with 10 steps. The best results are highlighted in bold.
Refer to caption
Figure 7: Results of LLaVA-7B [46] on VQA task. The left part is the multi-round VQA for the original clean examples, while the right part shows the results of using adversarial examples generated by CoA. The sentences in the chat boxes with a smiling face are the queries of human users, while the sentences in the purple chat boxes with a robot icon are the answers of the victim model. The used clean texts, target images, and target texts are also shown at the top of each example.
Refer to caption
Figure 8: More results of the chain of attack. We visualize the adversarial images and their corresponding texts at some intermediate chain steps. The generation results of victim models given the generated adversarial examples are shown in the right part of this figure.
Refer to caption
Figure 9: Results for the sensitivity of adversarial examples to Gaussian noises and the degradation to original clean semantics. s​t​dGstd_{G} represents the standard deviation of the Gaussian noises added to the adversarial image. The victim model used to generate caption in these examples is Unidifusser [4]. The clean and target image-text pairs are shown on the left part of the figure, while the adversarial images with different Gaussian noises are on the right part. Captions in red indicate the degraded captions.
[Uncaptioned image]
Refer to caption
Figure 10: More evaluation examples and results of the proposed LLM-based ASR. From left to right, the examples depict a completely successful attack case, a fooled-only case, and a failed attack case, respectively. The output score for each case is at the bottom.

PyTorch-like Pseudocode for the Core of an Implementation of Chain of Attack

1# Given:
2# cle_img_feat - clean image features
3# tgt_txt_feat - target text features
4# cle_txt_feat - (generated) clean text features
5# tgt_img_feat - (generated) target image features
6# alpha, beta - hyperparameters
7# surrogate model (CLIP) and caption model
8
9# Modality-aware embedding
10cle_mae = alpha * cle_img_feat + (1-alpha) * cle_txt_feat
11cle_mae = cle_mae / cle_mae.norm(dim=1, keepdim=True)
12tgt_mae = alpha * tgt_img_feat + (1-alpha) * tgt_txt_feat
13tgt_mae = tgt_mae / tgt_mae.norm(dim=1, keepdim=True)
14
15# Adversarial example generation with Chain of Attack
16delta = torch.zeros_like(cle_img, requires_grad=True)
17for j in range(pgd_steps):
18 adv_img = cle_img + delta
19 adv_img = clip_model.encode_image(preprocess(adv_img))
20 # generate caption for current adv image
21 cur_caption = caption_model(adv_img)
22
23 adv_img_feat = clip_model.encode_image(adv_img)
24 adv_img_feat = adv_img_feat / adv_img_feat.norm(dim=1, keepdim=True)
25 cur_adv_text = clip.tokenize(current_caption).to(device)
26 cur_txt_feat = clip_model.encode_text(cur_adv_text)
27 cur_txt_feat = cur_txt_feat / cur_txt_feat.norm(dim=1, keepdim=True)
28
29 # modality-aware embedding
30 cur_adv_mae = alpha * adv_img_feat + (1-alpha) * cur_txt_feat
31 cur_adv_mae = cur_adv_mae / cur_adv_mae.norm(dim=1, keepdim=True)
32
33 # Targeted Contrastive Matching
34 cle_sim = torch.mean(torch.sum(cur_adv_mae * cle_mae, dim=1))
35 tgt_sim = torch.mean(torch.sum(cur_adv_mae * tgt_mae, dim=1))
36 margin = 1 - beta
37 loss = torch.mean(torch.relu(tgt_sim - beta * cle_sim + margin))
38 loss.backward()
39
40 grad = delta.grad.detach()
41 d = torch.clamp(delta + alpha * torch.sign(grad), min=-epsilon, max=epsilon)
42 delta.data = d
43 delta.grad.zero_()