Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Peng Xie^*, Yequan Bie^*, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen^🖂, Kani Chen^🖂
The Hong Kong University of Science and Technology
{pxieaf, ybie}@connect.ust.hk

Abstract

Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

^†^†* Equal contributions.^†^†🖂 Corresponding authors.

Refer to caption — Figure 1: Comparison of the proposed CoA with other attacking strategies. The CLIP score results of Unidiffuser [4] are reported. Our method shows both superior performance and efficiency.

1 Introduction

Vision-language models (VLMs) have achieved significant progress over the last few years and demonstrated promising performance in image and natural language understanding, reasoning, and generation [2, 30, 1, 45, 46]. The powerful multi-modal capability in different tasks, including visual question answering and image captioning [34, 28, 10], makes these models to be widely deployed in real-world applications. However, as more VLMs are made open-sourced or used for commercial purposes, the security and safety issues raise concerns, e.g., VLMs could be attacked and exploited to generate fake and toxic content, which remains an inevitable challenge [5, 37, 48]. Moreover, compared to language models, VLMs suffer more from adversarial attacks since the vision modality is highly susceptible to visually inconspicuous adversarial perturbations due to the continuous and high dimensional nature of images [22, 7, 38]. This vulnerability could be exploited by adversaries to mislead VLMs by circumventing safety checkers [43, 53], injecting malicious code, or gaining unauthorized API access, leading to severe security risks in practical applications of the models [27, 54].

To explore the vulnerability of vision-language models, recent research proposes evaluating their robustness to adversarial attacks [54, 17, 47, 3, 7, 13], with some work focusing on transfer-based adversarial attacks [17, 47], i.e., use the adversarial examples generated via white-box surrogate models to mislead the victim black-box models [31, 55]. However, most existing transfer-based adversarial attack strategies only emphasize visual features when crafting adversarial examples, with only coarse leverage of text embeddings, neglecting the semantic correspondences between vision and text modalities. Moreover, there are different ways to compute attack success rate (ASR) for response generation tasks in current evaluation methods, lacking a clear and unified ASR calculation strategy. To address the above challenges, we propose a novel transfer-based adversarial attacking approach, namely Chain of Attack (CoA), which enhances the adversarial example generation process based on the multi-modal semantics using a series of intermediate attacking steps. We further establish a unified and comprehensive ASR computing method for targeted and untargeted evasion based on large language models (LLMs), holding the potential to facilitate future research and benchmarking by providing a fair and straightforward evaluation strategy for text generation tasks with human-understandable explanations. From the evaluation conducted in this study, we find the considered VLMs are generally vulnerable to adversarial visual attacks even without the knowledge of the victim models. Models with a larger number of parameters are less susceptible to targeted attacks, while still suffering from being fooled to some extent. Our proposed attacking strategy can further improve the attacking performance by perturbations with richer semantics compared to existing black-box attack methods. We hope the evaluation and the attacking strategy illustrated in this paper can encourage the future development of more trustworthy VLMs and their safety evaluations.

We summarize the main contributions of this paper as follows: (i) We propose a new transfer-based targeted attacking framework, Chain of Attack. It leverages an explicit step-by-step semantic update process to enhance the generation of adversarial examples, thereby improving attack quality and success rate. (ii) We establish a unified and comprehensive automatic attack success rate computing strategy based on LLMs. (iii) Evaluations of security and robustness for various VLMs are conducted using black-box attacks, demonstrating the effectiveness of the proposed method and highlighting the vulnerabilities of existing VLMs. More discussions about image perturbations are also included.

2 Related Work

2.1 Vision-Language Models and Robustness

Pre-trained vision-language models are broadly utilized for various vision and natural language tasks, including image captioning [19, 42] and visual question answering [30, 10], etc. For example, ViECap [19] incorporates entity-aware hard prompts to guide LLMs’ (i.e., GPT-2 [40]) attention toward the visual entities for coherent caption generation. LLaVA [30, 29] adopts a projection layer to connect a vision encoder and an LLM (i.e., Vicuna [12]) for general-purpose visual and language understanding. To alleviate the security issue of VLMs, recent research tends to evaluate the model robustness through adversarial attacks [17, 54, 13, 47]. For image captioning tasks, many previous work [8, 49] focuses on white-box and untargeted attacks for VLMs with traditional architecture (e.g., CNN and RNN-based), and requires human efforts for robustness evaluation. Zhao et al. [54] propose using CLIP score [41] for automatic evaluation. However, we argue more practical scenarios and metrics are necessary for evaluating the robustness of VLMs and facilitating future work. Therefore, in this work, we assess the adversarial robustness of VLMs with advanced architecture against more difficult targeted evasion under both embedding-based and LLM-based metrics.

2.2 Adversarial Attack

Adversarial attacks can be categorized into white-box, grey-box, and black-box attacks in terms of the attacker’s capabilities and knowledge [50]. Query-based attacks [16, 26, 35, 36] can sometimes be regarded as grey-box attacks instead of black-box attacks since the attacker can extract some information directly from the victim models, rather than being entirely uninformed. Most query-based methods conduct gradient estimation by repeatedly querying the victim models, which are typically time-consuming [54]. In contrast, transfer-based attacks [32, 39, 15, 9, 25] are black-box attacks where these methods generate adversarial examples using surrogate models without gaining any direct knowledge from the victim models [50]. In this paper, we focus on transfer-based image attacking under the most practical and high-stakes scenario, i.e., the black-box setting without any knowledge of the victim models, and our method achieves superior attacking performance comparable to or even outperforms query-based methods in some cases with much less computational cost, as shown in Fig. 1.

3 Method

3.1 Preliminaries

Problem definition. Let $M$ be the target victim vision-language model that takes $I$ as the image input and outputs a prediction. Adversarial image attacks modify the visual input to generate adversarial example $I_{adv}$ by a perturbation $\delta$ and achieve different attack goals. The paradigm can be formulated as:

T^{*}=M(I_{adv}),\quad I_{adv}=atk(I,\delta),

(1)

where $T^{*}$ is the desired output of attack, and $atk(\cdot)$ represents the attack function learned for effective input perturbation. Specifically, the form of $T^{*}$ depends on the task, e.g., $T^{*}$ is a label for the classification task and is a textual response for multimodal tasks such as image captioning. The adversarial example generation should ensure that the learned perturbation $\delta^{*}$ is imperceptible to humans, which can be implemented using box constraints to limit the perturbation size in pixels:

||I-I_{adv}||_{\infty}=||\delta^{*}||_{\infty}\leq\epsilon,

(2)

where $\epsilon$ is a hyperparameter representing the budget.

Threat model. A threat model defines the conditions under which a defense is designed to be secure and the precise security guarantees provided [6]. We specify the threat model for adversarial attacks in our method, which comprises two components: (i) Attacker capabilities/knowledge refers to the extent of the adversary’s knowledge. Unlike traditional taxonomies [6, 52], Zhang et al. [50] propose a more fine-grained categorization, including white-box, grey-box, and black-box victim model access. Specifically, our method focuses on adversarial transferability, which only has black-box access without knowledge of the victim models. (ii) Attack goals indicate the objectives or intentions that an adversary aims to achieve. It can typically be categorized into two classes: untargeted goals that only tend to fool the victim model to generate wrong responses, and targeted goals that require the model to give responses that are matched to the target. Our proposed attacking strategy focuses on the targeted goals, specifically, given a target reference caption $T_{ref}$ , the goal can be expressed as follows:

\delta^{*}={\rm{argmax}}_{\delta}\,{\rm{sim}}(T^{*},T_{ref})

(3)

where ${\rm{sim}}(\cdot)$ denotes the similarity measure.

3.2 Chain of Attack framework

An overview of our proposed attacking framework Chain of Attack (CoA) is illustrated in Fig. 2. As a transfer-based attacking strategy that has only black-box victim model access, a surrogate vision-language model (e.g., CLIP [41]) is adopted to help craft adversarial examples, which are then input to the victim model to get attacked response. For targeted evasion, we randomly sample a targeted reference text $T_{ref}$ from MS-COCO captions [11] for each input clean image $I$ [54]. To craft adversarial examples that can more effectively influence the victim model, we propose leveraging the semantic correspondences between image and text modalities, thereby enriching the semantics of the generated adversarial examples. Specifically, we first obtain the corresponding clean text $T$ and target image $I_{ref}$ for clean image and target text, respectively:

	$\displaystyle T$	$\displaystyle=M_{I2T}(I),$
	$\displaystyle I_{ref}$	$\displaystyle=M_{T2I}(T_{ref}),$		(4)

where $M_{I2T}(\cdot)$ and $M_{T2I}(\cdot)$ represent a publicly accessible pre-trained image-to-text model and a text-to-image model (distinct from the victim models), respectively. We observe that the sampled target reference contains some abundant information that is not directly associated with the visual content in the corresponding image, making model hard to learn the semantic relationships between images and texts. To alleviate this issue, we propose querying a large language model to extract the key visual information from the original target texts. For example, as shown in Fig.2, the original target text $T_{ref}$ is “The little girl is taking tennis lesson to learn how to play.” We query an LLM (e.g., GPT-4) with the prompt “Extract the keywords/information from the following sentence (save verbs and objects): {text}.” and obtain the refined target reference text: “A little girl taking tennis lesson.” It is noteworthy that the generation of clean text, target image, and refined target text can be done during the data pre-processing before training.

We use modality fusion of embeddings to capture the semantic correspondence between images and texts, the modality fusion for the clean and target image-text pairs can be achieved by the following calculations:

	$\displaystyle F$	$\displaystyle=\alpha\cdot E_{v}(I)+(1-\alpha)\cdot E_{t}(T),$
	$\displaystyle F_{ref}$	$\displaystyle=\alpha\cdot E_{v}(I_{ref})+(1-\alpha)\cdot E_{t}(T_{ref}),$		(5)

Given the surrogate image encoder $E_{v}(\cdot)$ and text encoder $E_{t}(\cdot)$ . Where $F$ and $F_{ref}$ are the modality-aware embeddings (MAE) for clean and target image-text pairs, respectively. $\alpha$ is a modality-balancing hyperparameter.

Previous transfer-based attacking strategies only use the uni-modality target to guide the learning of image perturbation and lack the explicit semantic update process [54, 47], which leads to coarse-grained semantic alignment, exhibiting sub-optimal attacking performance for VLMs. Therefore, we propose the chain-of-attack learning strategy to enhance the adversarial example generation with explicit step-by-step updating in the semantic domain, as shown in the left lower part of Fig.2 and a more detailed example in Fig. 3. Specifically, we initialize the image perturbation $\delta$ such that $\delta_{0}\sim{\rm{Uniform}}(-\epsilon,\epsilon)$ . Given the clean image, the adversarial image example $I_{adv}$ can be obtained by adding perturbations at the pixel level. A publically accessible pre-trained image-to-text model is utilized to generate the caption $T_{adv}$ for the current adversarial image in each step. As previously mentioned, the modality-aware embedding for the current adversarial example is given by:

\displaystyle F_{adv}=\alpha\cdot E_{v}(I_{adv})+(1-\alpha)\cdot E_{t}(T_{adv}),

(6)

where $F_{adv}$ is the modality-aware embedding of the current adversarial example. It is noteworthy that during each step, a new caption that describes the current adversarial image is generated, hence the modality-aware embedding is updated based on both changed image and text embeddings. We explicitly update the multi-modal semantics and generate the adversarial examples based on their previous semantics, resulting in a step-by-step attacking process (i.e., a chain of attacks). To learn the image perturbation at each step, we propose Targeted Contrastive Matching (TCM), where the cross-modality semantics of clean samples, target samples, and the current adversarial samples are aligned/diverged in the same latent embedding space. Specifically, TCM maximizes the similarity between the current adversarial example and the target reference example, while minimizing the similarity between the current adversarial example and the original clean example across both vision and text modalities. The TCM objective $L$ is defined as:

\displaystyle L={\rm{max}}(||{\rm{sim}}(F_{ref},F_{adv})-\beta\cdot{\rm{sim}}(F,F_{adv})||+\gamma,0),

(7)

where $\beta$ is a hyperparameter that controls the trade-off between similarity maximization for positive pairs and minimization for negative pairs. $\gamma$ is the margin hyperparameter that controls the desired separation of the positive pairs and the negative pairs in the learned embedding space.

To optimize the image perturbation $\delta$ through the TCM objective $L$ , projected gradient descent [33] is adopted and the optimization can be expressed as:

\displaystyle\delta_{t+1}={\rm{Proj}}_{||\cdot||_{\infty\leq\epsilon}}(\delta_{t}+\eta\cdot\nabla_{\delta}L(\delta_{t})),

(8)

where ${\rm{Proj}(\cdot)}$ projects $\delta$ back into the $\epsilon$ -ball, $\eta$ is the step size, and $\nabla L(\cdot)$ represents the gradient of the TCM loss.

Vlm	Method	CLIP Score ( $\uparrow$ ) / Text Encoder						Asr ( $\uparrow$ )
Vlm	Method	RN-50	RN-101	ViT-B/16	ViT-B/32	ViT-L/14	Ensemble	Target	Fool
ViECap [19]	Clean image	46.7	44.3	47.7	47.2	35.2	44.2	-	-
	AttackBard [17]	49.2	46.7	48.7	51.7	36.3	46.5	15.5	25.0
	Mix.Attack [47]	49.6	47.0	48.8	52.1	36.7	46.8	11.1	17.4
	MF-it [54]	78.0	76.7	78.9	79.6	71.8	77.0	69.3	79.9
	MF-ii [54]	76.4	75.3	77.4	78.0	70.1	75.4	76.6	85.8
	Ours	82.9	81.9	83.8	84.7	78.2	82.3	98.4	99.5
SmallCap [42]	Clean image	50.7	48.6	51.1	52.7	37.5	48.1	-	-
	AttackBard [17]	53.2	48.4	51.5	56.6	39.2	49.8	6.6	9.2
	Mix.Attack [47]	52.9	48.3	51.5	56.4	39.2	49.7	5.8	8.1
	MF-it [54]	57.9	54.8	59.1	60.7	46.6	55.8	22.1	26.8
	MF-ii [54]	67.3	65.0	68.5	69.8	58.6	65.8	47.1	52.2
	Ours	68.6	66.1	70.0	71.1	60.4	67.2	56.8	66.5
Unidiffuser [4]	Clean image	41.7	41.5	42.9	44.6	30.5	40.2	-	-
	AttackBard [17]	52.2	48.6	53.1	56.5	56.5	53.4	8.1	14.4
	Mix.Attack [47]	45.3	44.0	47.2	49.2	35.2	44.2	5.9	10.5
	MF-it [54]	65.5	63.9	67.8	69.8	61.1	65.6	80.2	95.8
	MF-ii [54]	70.9	69.5	72.1	73.3	63.7	70.0	90.0	98.4
	Ours	76.1	74.4	77.2	78.5	69.8	75.2	94.2	98.9
LLaVA-7B [29]	Clean image	46.8	46.8	48.1	47.7	33.7	44.6	-	-
	AttackBard [17]	47.9	47.4	48.1	48.5	34.6	45.3	2.0	3.7
	Mix.Attack [47]	46.8	47.6	47.6	48.2	34.3	44.9	1.7	3.0
	MF-it [54]	46.8	46.9	48.0	47.9	33.9	44.7	3.0	5.6
	MF-ii [54]	47.2	46.7	48.2	48.0	34.2	44.9	2.6	4.7
	Ours	51.1	49.6	52.0	55.2	35.8	48.7	14.5	28.4
LLaVA-13B [29]	Clean image	46.4	46.3	47.9	47.5	33.4	44.3	-	-
	AttackBard [17]	47.9	47.4	48.1	48.5	34.6	45.3	2.6	4.8
	Mix.Attack [47]	46.8	47.6	47.6	48.2	34.3	44.9	0.9	1.5
	MF-it [54]	46.6	46.8	48.0	47.8	33.7	44.6	2.7	5.0
	MF-ii [54]	47.4	47.2	48.7	48.4	34.4	45.2	3.6	6.9
	Ours	48.1	48.0	49.4	49.0	34.6	45.8	12.3	24.3

Table 1: Quantitative performance comparison of transfer-based attacks against VLMs with the state-of-the-art methods. The metrics include CLIP score and our proposed LLM-based attack success rate (ASR). The names of the corresponding image encoders are adopted for different text encoders. The Ensemble column reports the average results of different CLIP text encoders. The best results are in bold.

3.3 LLM-based ASR

Previous works tend to evaluate the robustness of models on response generation tasks with human efforts [20, 21], making it labor-intensive and time-consuming. Some recent works propose using NLP metrics [8] or CLIP [41] scores [54] to measure the matching degree of generated response and the targeted response, which we argue are not comprehensive and not straightforward for human users to understand. For example, assume that the CLIP score between the generated text and the target text is 40% before attacking and the CLIP score is 45% after attacking, people can only know that the attacking increases the similarity by 5% without gaining any insight into whether the attack is success or not, i.e., Does the generated response genuinely closer to the targeted text from a human perspective, or if it just diverges more from the original clean text but still being far from the target text?

To address the above issues and considering the various evaluation strategies employed in different research, we propose a clear and unified attack success rate computation strategy for automatic evaluation of the robustness of VLMs on response generation tasks such as image captioning. Specifically, as illustrated in Fig. 4, we query an LLM (e.g., GPT-4) to serve as the human judge to distinguish whether the model is attacked successfully, i.e., the generated text is similar to the target reference text. In addition, to ensure a comprehensive evaluation, we also consider the scenario where the model is fooled into generating responses that are unrelated to the original clean text but still not similar to the target text. We further request the LLM to assign scores of 1, 0.5, and 0 for completely successful cases, fooled-only cases, and failed cases, respectively. Step-by-step thinking is utilized for accurate judgment and detailed explanations. For instance, the middle example of Fig. 4 shows a fooled-only case, where the LLM accurately suggests that “the generated text is unrelated to the original text but also does not closely match the target text”, assigning a score of 0.5 while offering detailed human-understandable reasons.

Method	CLIP Score ( $\uparrow$ ) / Text Encoder						Asr ( $\uparrow$ )
Method	RN-50	RN-101	ViT-B/16	ViT-B/32	ViT-L/14	Ensemble	Target	Fool
Clean image	41.7	41.5	42.9	44.6	30.5	40.2	-	-
Baseline	70.9	69.5	72.1	73.3	63.7	70.0	90.0	98.4
+ MAE	72.3	71.8	73.4	74.8	64.4	71.3	90.6	98.7
+ MAE + CoA (w/o TCM)	74.8	73.2	76.0	77.1	68.1	73.8	91.7	98.7
	76.1	74.4	77.2	78.5	69.8	75.2	94.2	98.9
+ MAE + CoA (w/ TCM)	( $\uparrow$ 5.2)	( $\uparrow$ 4.9)	( $\uparrow$ 5.1)	( $\uparrow$ 5.2)	( $\uparrow$ 6.1)	( $\uparrow$ 5.2)	( $\uparrow$ 4.2)	( $\uparrow$ 0.5)

Table 2: Ablation study of the proposed method on Unidifusser. MAE, CoA, and TCM represent the Modality-Aware Embeddings, Chain of Attack module, and Targeted Contrastive Matching, respectively. The improvements compared to the baseline are highlighted.

The proposed LLM-based ASR can be computed as:

\displaystyle ASR=\frac{1}{N}\sum{{\rm{JUDGE}}(T,T_{adv},T_{ref})},

(9)

where $N$ is the number of adversarial examples, and the outputs of ${\rm{JUDGE}}(\cdot)$ are the scores given by the LLM judgment mentioned before.

4 Experiments

4.1 Experimental Setups

Datasets. The clean images are from the validation images of ImageNet-1K [14]. For target reference text, we follow Zhao et al. [54] and sample a text description for each clean image. We further use GPT-4 [1] to extract the key information of the sampled text description. To simulate the real-world scenario, Stable Diffusion [44] is utilized to generate target images for each target reference text, and MiniGPT-4 [56] is adopted to generate clean descriptions for clean images. More details are in the appendix.

Implementation details. For all performance comparisons, we use consistent pre-trained checkpoints of the victim VLMs [19, 42, 4, 30]. The vision (ViT-B/16) and text encoder of CLIP [41] are adopted as the surrogate model. We use ClipCap [34] as the image-to-text model during the adversarial example generation process. Following the most common setting [6, 54], we set the perturbation budget $\epsilon=8$ unless otherwise specified to ensure the perturbations are visually imperceptible. The objective is optimized using 100-step PGD [33] with $\eta=1$ . Other hyperparameters are selected by grid search, where we conduct ablation studies with different values (see appendix). Experiments are conducted on an RTX A6000 GPU.

4.2 Experimental Results

VLM robustness against adversarial attacks. Tab. 1 shows the evaluation results for VLM robustness against black-box adversarial attacks. The victim VLMs include ViECAP [19], SmallCap [42], Unidiffuser [4], LLaVA-1.5 7B and 13B [29]. All parameters of the victim VLMs are frozen, and the victim VLMs are invoked only once for response generation during inference. For models that require custom textual instruction (e.g., LLaVA), we use “What is the content of this image?” as the query. Specifically, our method CoA consistently outperforms baselines with a significant margin on both CLIP score and the proposed LLM-based ASR. Specifically, the CLIP score measures the embedding similarity between the generated response of victim models and the target text using text encoders of CLIP. We evaluate the CLIP score for each victim model with various CLIP text encoders, including ResNet-based [24] RN-50, RN-101, and ViT-based [18] ViT-B/16, ViT-B/32, and ViT-L/14. CoA respectively gains 6.9%, 2.1%, 7.4%, 7.5%, and 1.1% relative performance boosts over the second-best results on each victim model, demonstrating that the generated texts using our method have closer semantics to the target text in the text embedding space. To further make the evaluation more comprehensive, we also report the performance with the proposed LLM-based ASR. In addition to the targeted ASR introduced in Sec. 3.3, we also report the results of fooled cases (the last column of Tab. 1) to evaluate the attacking methods’ ability to fool the victim models into generating unrelated responses, including targeted and fooled-only cases. Our method achieves much better ASR compared to baselines. For example, CoA respectively gets 98.4% and 94.2% targeted ASR on ViECap [19] and Unidiffuser [4], while the second-best results are 76.6% and 90.0%, respectively. Furthermore, our method exhibits better capability to fool large VLMs (e.g., LLaVA [29]) into giving wrong responses based on the adversarial examples. From the robustness evaluation results, we find that VLMs with a larger number of parameters are less susceptible to attacks. In particular, large-scale VLMs demonstrate significantly stronger robustness against black-box attacks compared to smaller models. In addition, misleading large-scale VLMs to generate unrelated responses is much easier than generating targeted responses, highlighting the vulnerability of large VLMs to non-targeted attacks.

Ablation study. We conduct ablation studies to demonstrate the effectiveness of each proposed module, as shown in Tab. 2, where MF-ii [54] is adopted as the baseline. Specifically, all the proposed components improve the attack performance. More ablation results, e.g., hyperparameters and key information extraction, are in the appendix.

4.3 Discussion

Visual interpretation. To help better interpret and understand the adversarial examples, we obtain the attention maps for the clean images, adversarial images, and target images based on the gradients of the similarity between images and texts. Fig. 5 (a) shows a visualization example, where the clean image is a bird image and the target text is “A bunch of people celebrating around a birthday cake.” From the result, we can observe that the attention map correctly highlights the region for the clean image-text pair and target image-text pair. However, when we use the adversarial image to calculate the attention with the clean text, the model is misleading and highlights some irrelevant regions. Furthermore, from the attention map of the adversarial image-target text pair, we see some highlighted regions are relevant to the target image. For example, the target image highlights the lower right “cake” that is contained in the target text, while the lower right part of the adversarial image is also highlighted based on the similarity with the target text. These observations indicate that the adversarial images mislead the models and cause the model to perceive the adversarial image as the target image to some extent.

Efficiency & Comparison with query-based strategy. In addition to attacking performance, the efficiency of attacks is also an important challenge [23]. Fig. 1 reports the comparison of different attacking strategies based on their performance (y-axis) and training time per step (x-axis). A query-based method MF-ii-tt [54] is included and its computational cost is high due to the repeated invocation of the victim models. This strategy is sometimes regarded as a grey-box attacking strategy [50] since it needs to query the victim model and leverage the model outputs. From the comparison results (Fig. 1), it can be observed that our method can achieve superior performance and is comparable to or even outperforms query-based strategy in some cases with much lower computational cost.

Vlm	Budget $\epsilon$
Vlm	8/255	16/255	32/255
ViECap [19]	82.3	82.4	82.5
SmallCap [42]	67.2	67.4	68.0
Unidiffuser [4]	75.2	75.4	75.7
LLaVA-7B [29]	48.7	48.7	49.3
LLaVA-13B [29]	45.8	45.8	46.0

Table 3: Ensemble CLIP scores of different perturbation budget

\epsilon

Effect of perturbation budget. We explore the effect of $\epsilon$ based on the attack results and the image quality qualitatively and quantitatively. Specifically, as shown in Fig. 5 (b) and Tab. 3, when $\epsilon$ increases, the attack results become better, and the image quality decreases (measured by the LPIPS distance [51] between the adversarial and clean images). We conclude that a proper $\epsilon$ is crucial for balancing attack performance with the magnitude of perturbations.

5 Conclusion

In this study, we evaluate the robustness of VLMs against black-box adversarial attacks and highlight the vulnerability of existing models. A novel transfer-based targeted attacking strategy, namely Chain of Attack, is proposed to enhance the generation of adversarial examples through a series of explicit intermediate steps based on multi-modal semantics, thereby improving attack performance. Moreover, an LLM-based ASR computation strategy is introduced for more comprehensive robustness evaluations in response generation tasks, while offering human-understandable explanations. We hope this study serves as a reference for safety considerations in the development of future vision-language models, and facilitates more trustworthy model advancements and evaluations.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
Bailey et al. [2023] Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
Bao et al. [2023] Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. In International Conference on Machine Learning, pages 1692–1717. PMLR, 2023.
Bommasani et al. [2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Carlini et al. [2019] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
Carlini et al. [2024] Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei W Koh, Daphne Ippolito, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024.
Chen et al. [2017] Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Attacking visual language grounding with adversarial examples: A case study on neural image captioning. arXiv preprint arXiv:1712.02051, 2017.
Chen et al. [2023] Huanran Chen, Yichi Zhang, Yinpeng Dong, Xiao Yang, Hang Su, and Jun Zhu. Rethinking model ensemble in transfer-based adversarial attacks. arXiv preprint arXiv:2303.09105, 2023.
Chen et al. [2022] Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
Cui et al. [2024] Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24625–24634, 2024.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Dong et al. [2018] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9185–9193, 2018.
Dong et al. [2021] Yinpeng Dong, Shuyu Cheng, Tianyu Pang, Hang Su, and Jun Zhu. Query-efficient black-box adversarial attacks guided by a transfer-based prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9536–9548, 2021.
Dong et al. [2023] Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
Dosovitskiy [2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Fei et al. [2023] Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, and Feng Zheng. Transferable decoding with visual entities for zero-shot image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3136–3146, 2023.
Fu et al. [2023] Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K Gupta, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Misusing tools in large language models with visual adversarial examples. arXiv preprint arXiv:2310.03185, 2023.
Gong et al. [2023] Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
Goodfellow [2014] Ian J Goodfellow. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
Guo et al. [2019] Chuan Guo, Jacob Gardner, Yurong You, Andrew Gordon Wilson, and Kilian Weinberger. Simple black-box adversarial attacks. In International conference on machine learning, pages 2484–2493. PMLR, 2019.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Huang et al. [2019] Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. Enhancing adversarial example transferability with an intermediate level attack. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4733–4742, 2019.
Ilyas et al. [2018] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pages 2137–2146. PMLR, 2018.
Jin et al. [2024] Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599, 2024.
Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a.
Liu et al. [2024b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
Liu et al. [2016] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
Liu et al. [2017] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In Proceedings of 5th International Conference on Learning Representations, 2017.
Madry [2017] Aleksander Madry. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
Mokady et al. [2021] Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
Nguyen et al. [2015] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427–436, 2015.
Papernot et al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017.
Perez et al. [2022] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
Qi et al. [2023] Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
Qin et al. [2022] Zeyu Qin, Yanbo Fan, Yi Liu, Li Shen, Yong Zhang, Jue Wang, and Baoyuan Wu. Boosting the transferability of adversarial attacks with reverse adversarial perturbation. Advances in neural information processing systems, 35:29845–29858, 2022.
Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Ramos et al. [2023] Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva. Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023.
Rando et al. [2022] Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Tu et al. [2025] Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this image a safety evaluation benchmark for vision llms. In Computer Vision – ECCV 2024, pages 37–55, Cham, 2025. Springer Nature Switzerland.
Wang et al. [2023] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023.
Xu et al. [2019] Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, and Wei Liu. Exact adversarial attack to image captioning via structured output learning with latent variables. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4135–4144, 2019.
Zhang et al. [2024] Chiyu Zhang, Xiaogang Xu, Jiafei Wu, Zhe Liu, and Lu Zhou. Adversarial attacks of vision tasks in the past 10 years: A survey. arXiv preprint arXiv:2410.23687, 2024.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Zhang et al. [2023] Yutong Zhang, Yao Li, Yin Li, and Zhichang Guo. A review of adversarial attacks in computer vision. arXiv preprint arXiv:2308.07673, 2023.
Zhao et al. [2023] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, and Min Lin. A recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023.
Zhao et al. [2024] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems, 36, 2024.
Zhou et al. [2018] Wen Zhou, Xin Hou, Yongjun Chen, Mengyun Tang, Xiangqi Huang, Xiang Gan, and Yong Yang. Transferable adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 452–467, 2018.
Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix

In this supplementary material, we present more details about data and implementation, including more data examples, the algorithmic format and the core code of the proposed Chain of Attack. Furthermore, we report and analysis more detailed experimental, ablation, and visualization results, including the ablation studies on hyperparameters, the experiments on VQA task, and more examples and results of the proposed CoA and LLM-based ASR.

Data and Implementation Details

More Data Examples

Fig. 6 shows some examples of our used data in this paper. Specifically, as mentioned in the main paper, the clean image and the target text are from ImageNet-1k [14] and MS-COCO [11], respectively. To obtain the corresponding clean texts and the target images, we adopt GPT-4 [1] and Stable Diffusion [44] to generate high-quality texts and images, respectively. These clean and target image-text pairs are used to compute modality-aware embeddings and serve as the reference in Targeted Contrastive Matching to guide the learning of perturbations.

Chain of Attack Algorithm

In addition to the method illustration in the main paper, the algorithmic format of the proposed Chain of Attack method is shown in Algorithm 1.

Input: The clean image

I

, clean text

T

, targeted reference text

T_{ref}

, generated target image

I_{ref}

, surrogate image encoder

E_{v}(\cdot)

and text encoder

E_{t}(\cdot)

, modality-balancing hyperparameter

\alpha

, positive-negative balancing hyperparameter

\beta

, margin hyperparameter

\gamma

, the step size of PGD

\eta

Output: The adversarial example

I_{adv}

Initialization:

Adversarial image

I_{adv}\leftarrow I

, PGD step number

pgd\_step

\epsilon\leftarrow 8

\delta\sim{\rm{Uniform}}(-\epsilon,\epsilon)

;

# Calculation of modality-aware embeddings (MAE).;

F\leftarrow\alpha\cdot E_{v}(I)+(1-\alpha)\cdot E_{t}(T)

;

F_{ref}\leftarrow\alpha\cdot E_{v}(I_{ref})+(1-\alpha)\cdot E_{t}(T_{ref})

;

# Update process of Chain of Attack.;

t\leftarrow 1

;

while $t\leq pgd\_step$ do

I_{adv}\leftarrow I_{adv}+\delta_{t}

;

# The current adversarial text and MAE of each step.;

T_{adv}\leftarrow M_{I2T}(I_{adv})

;

F_{adv}\leftarrow\alpha\cdot E_{v}(I_{adv})+(1-\alpha)\cdot E_{t}(T_{adv})

;

# Objective of Target Contrastive Matching.;

L\leftarrow{\rm{max}}(||{F_{ref}}^{T}F_{adv}-\beta\cdot{F}^{T}F_{adv}||+\gamma,0)

;

# Update the perturbation.;

\delta_{t+1}\leftarrow{\rm{Proj}}_{||\cdot||_{\infty\leq\epsilon}}(\delta_{t}+\eta\cdot\nabla_{\delta}L(\delta_{t}))

;

t\leftarrow t+1

;

end while

Algorithm 1 Chain of Attack

Core Code

We present the core pseudo code in Sect. PyTorch-like Pseudocode for the Core of an Implementation of Chain of Attack in this supplementary material.

More Experimental Results

Detailed Ablation Results with Various Hyperparameters

To explore the effects of the values of hyperparameters for our attack strategy, we conduct extensive ablation studies.

The ablation results of the modality-balancing hyperparameter $\alpha$ are reported in Tab. 4. Note that a smaller $\alpha$ means a large weight for the text modality. From the results, we can observe that text modality is more effective for attacking some victim VLMs (e.g., ViECap [19]). However, most attacking performance benefits from both of the modalities (e.g., SmallCap [42], Unidiffuser [4], and LLaVA [29]). This observation demonstrates the effectiveness of our proposed modality-aware embeddings that capture semantics from both domains. We suggest that a proper $\alpha$ can help achieve better results by fusing the visual and textual features.

In Tab. 5, we report the results of different combinations of hyperparameters $\beta$ and $\gamma$ , where $\beta$ is the hyperparameter that controls the trade-off between similarity maximization for positive pairs and minimization for negative pairs, and $\gamma$ is the margin hyperparameter that controls the desired separation of the positive pairs and the negative pairs in the learned embedding space, as mentioned in the main paper. Note that a larger $\beta$ indicates more focus on the difference between the adversarial examples and the original clean examples. Since our task is targeted attacking, we set $0<\beta<1$ . From our experiments, we find that larger $\gamma$ may degrade the performance, hence we suggest the margin hyperparameter should be set to less than 0.5. Some combinations of hyperparameters with promising performance are reported in Tab. 5.

Detailed Results of the Effect of Perturbation Budget

In Sect. 4.3 of the main paper, we discuss the effect of the perturbation budget $\epsilon$ with only the results of the ensemble score. We report the complete results in Tab. 6, from which we can see that large perturbation budgets can improve the attack performance. However, as mentioned in Sect. 4.3 of the main paper (also see Fig. 5 (b) and Tab. 3 in the main paper), with the perturbation budgets becoming larger, the image quality decreases. We suggest a proper $\epsilon$ value (e.g., 8) to balance the trade-off.

Vlm	$\alpha$	CLIP Score ( $\uparrow$ ) / Text Encoder
ViECap [19]		RN-50	RN-101	ViT-B/16	ViT-B/32	ViT-L/14	Ensemble
	0.9	77.6	76.4	78.6	79.3	71.6	76.7
	0.7	79.8	80.4	81.2	81.5	74.4	79.0
	0.5	81.2	80.4	82.2	83.0	76.2	80.6
	0.3	82.7	81.7	83.6	84.4	78.1	82.1
	0.1	82.9	81.9	83.8	84.7	78.2	82.3
SmallCap [42]	0.9	68.4	65.9	69.4	70.7	59.9	66.7
	0.7	68.6	66.1	70.0	71.1	60.4	67.2
	0.5	68.2	65.7	69.4	70.7	59.8	66.8
	0.3	65.5	62.6	66.7	68.1	56.3	63.8
	0.1	61.1	58.2	62.2	63.7	50.9	59.2
Unidiffuser [4]	0.9	73.6	71.9	74.7	75.8	66.7	72.5
	0.7	75.1	73.3	76.1	77.2	68.5	74.0
	0.5	75.8	74.3	76.9	78.1	69.4	74.9
	0.3	76.1	74.4	77.2	78.5	69.8	75.2
	0.1	72.1	70.5	73.5	75.1	64.8	71.2
LLaVA-7B [29]	0.9	47.7	47.3	48.9	48.5	34.3	45.4
	0.7	48.2	47.8	49.1	48.7	34.7	45.7
	0.5	51.1	49.6	52.0	55.2	35.8	48.7
	0.3	48.8	48.3	49.6	49.4	35.1	46.2
	0.1	47.6	47.4	48.9	48.5	34.5	45.4

Table 4: Ablation results of the modality-balancing hyperparameter

\alpha

of the modality-aware embeddings for controlling the trade-off between vision and text modalities. A smaller

\alpha

indicates a larger weight for text modality. The best ensemble scores are in bold.

Effect of PGD Steps

Following the setting of previous methods [54], we adopt projected gradient descent (PGD) [33] with 100 steps, as mentioned in the main paper. Additionally, we report the results of less number of PGD steps in Tab. 7. The results show that fewer PGD steps may lead to underfitting and PGD with 100 steps achieves the best attack performance.

Visual Question Answering Task

To further explore the potential application/risk of the attacking strategy, we implement the multi-round visual question answering (VQA) task using LLaVA-7B [29], as shown in Fig. 7. Two successful targeted attack examples are displayed. Specifically, in example 1, the original clean image is a part of the body of a large marine animal. We query LLaVA with queries “How do you think of this image?” and “Could it be a marine creature?”. LLaVA identifies it as a marine animal and gives correct answers. However, when we input the adversarial image generated by our method, the victim model gives the wrong answer and identifies it as a cat, which is the content of target examples. Example 2 also exhibits the same conclusion. The results demonstrate our attacking strategy successfully misleads the victim model to generate target responses.

More Case Studies of the Proposed ASR

In addition to the results shown in Fig. 4 of the main paper, more evaluation examples of the proposed LLM-based ASR are shown in Fig. 10 in this supplementary material.

More Results of the Attacking Chain

In addition to Fig. 2 and Fig. 3 of the main paper, we visualize more examples of the intermediate steps of CoA and the results of the victim models, as shown in Fig. 8. Specifically, the left and middle parts of Fig. 8 show the update process of the adversarial examples based on both visual and textual semantics. The right part is the generation results of the victim models given the final adversarial examples. For example, in the third case, the semantic of the image changes from “A group of chickens of various colors foraging in a grassy outdoor enclosure” to the target semantic “A close up of a vase with flowers”, and the CLIP score between the intermediate adversarial text and the target text increases through the chain. Some victim models (e.g., ViECap, Unidiffuser) generate almost the same response as the target text (e.g., with CLIP score 99.6%, 100%), demonstrating the effectiveness of the generated adversarial examples.

Sensitivity of Adversarial Examples to Gaussian Noises and the Degradation to Original Clean Semantics

To explore the sensitivity of our generated adversarial examples to noises (e.g., Gaussian noises), we show the results of adversarial examples adding different scales of noises, as shown in Fig. 9. When the standard deviation of noises $std_{G}$ is relatively small, the victim models still output the target responses. However, it can be observed that as the $std_{G}$ becomes large, the victim models tend to generate responses that are more likely to the original clean text. The captions of some intermediate examples are a combination of the original clean text and the target reference text. This result interprets the process of adding perturbations to the adversarial images and it concludes that large noises can undermine the effectiveness of adversarial examples.

(More figures and tables are on the following pages.)

Vlm	$\beta$	$\gamma$	CLIP Score ( $\uparrow$ ) / Text Encoder
ViECap [19]			RN-50	RN-101	ViT-B/16	ViT-B/32	ViT-L/14	Ensemble
	0.9	0.1	78.4	77.3	79.3	80.0	72.5	77.5
	0.8	0.2	77.1	76.1	78.3	78.9	71.0	76.3
	0.7	0.3	77.6	76.4	78.6	79.3	71.6	76.7
	0.6	0.4	77.3	76.1	78.4	79.1	71.1	76.4
SmallCap [42]	0.9	0.1	68.4	66.5	69.8	71.0	60.3	67.2
	0.8	0.2	67.2	65.0	68.5	69.9	58.8	65.9
	0.7	0.3	68.4	65.9	69.4	70.7	59.9	66.9
	0.6	0.4	67.7	65.4	69.0	70.3	59.2	66.3
Unidiffuser [4]	0.9	0.1	72.9	71.7	74.3	75.4	66.2	72.1
	0.8	0.2	73.3	71.6	74.5	75.6	66.3	72.3
	0.7	0.3	73.6	71.9	74.7	75.8	66.7	72.5
	0.6	0.4	73.2	71.5	74.3	75.4	66.2	72.1
LLaVA-7B [29]	0.9	0.1	47.8	47.5	49.0	48.7	34.4	45.5
	0.8	0.2	47.7	47.4	48.9	48.5	34.4	45.4
	0.7	0.3	47.4	47.3	48.6	48.2	34.2	45.1
	0.6	0.4	47.7	47.4	48.9	48.4	34.4	45.4

Table 5: Results of some different combinations of the hyperparameters

\beta

and

\gamma

for Targeted Contrastive Matching.

Vlm	$\epsilon$	CLIP Score ( $\uparrow$ ) / Text Encoder
ViECap [19]		RN-50	RN-101	ViT-B/16	ViT-B/32	ViT-L/14	Ensemble
	8/255	82.9	81.9	83.8	84.7	78.2	82.3
	16/255	83.1	82.0	83.9	84.8	78.4	84.2
	32/255	83.1	82.2	83.9	84.8	78.4	82.5
SmallCap [42]	8/255	68.6	66.1	70.0	71.1	60.4	67.2
	16/255	68.9	66.3	70.2	71.3	60.5	67.4
	32/255	70.2	66.8	70.4	71.8	60.9	68.0
Unidiffuser [4]	8/255	76.1	74.4	77.2	78.5	69.8	75.2
	16/255	76.3	74.8	77.4	78.6	70.1	75.4
	32/255	76.7	75.1	77.7	78.9	70.3	75.7
LLaVA-7B [29]	8/255	51.1	49.6	52.0	55.2	35.8	48.7
	16/255	51.1	49.6	52.0	55.3	35.8	48.7
	32/255	51.7	50.1	52.5	55.9	36.2	49.3
LLaVA-13B [29]	8/255	48.1	48.0	49.4	49.0	34.6	45.8
	16/255	48.1	48.0	49.4	49.0	34.6	45.8
	32/255	48.2	48.1	49.4	49.2	34.9	46.0

Table 6: The detailed results of the effect of perturbation budgets

\epsilon

Method	CLIP Score ( $\uparrow$ ) / Text Encoder
Method	RN-50	RN-101	ViT-B/16	ViT-B/32	ViT-L/14	Ensemble
Clean image	41.7	41.5	42.9	44.6	30.5	40.2
CoA w/ PGD-10	63.1	61.5	64.5	66.0	53.9	61.8
CoA w/ PGD-50	74.5	73.0	75.8	77.2	68.0	73.7
CoA w/ PGD-100	76.1	74.4	77.2	78.5	69.8	75.2

Table 7: The effect of number of PGD [33] steps on Unidiffuser [4]. CoA w/ PGD-10 means our method CoA using PGD with 10 steps. The best results are highlighted in bold.

PyTorch-like Pseudocode for the Core of an Implementation of Chain of Attack

⬇

1# Given:

2# cle_img_feat - clean image features

3# tgt_txt_feat - target text features

4# cle_txt_feat - (generated) clean text features

5# tgt_img_feat - (generated) target image features

6# alpha, beta - hyperparameters

7# surrogate model (CLIP) and caption model

9# Modality-aware embedding

10cle_mae = alpha * cle_img_feat + (1-alpha) * cle_txt_feat

11cle_mae = cle_mae / cle_mae.norm(dim=1, keepdim=True)

12tgt_mae = alpha * tgt_img_feat + (1-alpha) * tgt_txt_feat

13tgt_mae = tgt_mae / tgt_mae.norm(dim=1, keepdim=True)

15# Adversarial example generation with Chain of Attack

16delta = torch.zeros_like(cle_img, requires_grad=True)

17for j in range(pgd_steps):

18 adv_img = cle_img + delta

19 adv_img = clip_model.encode_image(preprocess(adv_img))

20 # generate caption for current adv image

21 cur_caption = caption_model(adv_img)

23 adv_img_feat = clip_model.encode_image(adv_img)

24 adv_img_feat = adv_img_feat / adv_img_feat.norm(dim=1, keepdim=True)

25 cur_adv_text = clip.tokenize(current_caption).to(device)

26 cur_txt_feat = clip_model.encode_text(cur_adv_text)

27 cur_txt_feat = cur_txt_feat / cur_txt_feat.norm(dim=1, keepdim=True)

29 # modality-aware embedding

30 cur_adv_mae = alpha * adv_img_feat + (1-alpha) * cur_txt_feat

31 cur_adv_mae = cur_adv_mae / cur_adv_mae.norm(dim=1, keepdim=True)

33 # Targeted Contrastive Matching

34 cle_sim = torch.mean(torch.sum(cur_adv_mae * cle_mae, dim=1))

35 tgt_sim = torch.mean(torch.sum(cur_adv_mae * tgt_mae, dim=1))

36 margin = 1 - beta

37 loss = torch.mean(torch.relu(tgt_sim - beta * cle_sim + margin))

38 loss.backward()

40 grad = delta.grad.detach()

41 d = torch.clamp(delta + alpha * torch.sign(grad), min=-epsilon, max=epsilon)

42 delta.data = d

43 delta.grad.zero_()