Towards Transferable Attacks Against Vision-LLMs
in Autonomous Driving with Typography

Nhat Chung^1,2, Sensen Gao^1,3, Tuan-Anh Vu^1,4, Jie Zhang⁵, Aishan Liu⁶, Yun Lin⁷, Jin Song Dong⁸, Qing Guo^1,8,∗
¹CFAR and IHPC, A*STAR, Singapore
²VNU-HCM, Vietnam ³Nankai University, China ⁴HKUST, HKSAR
⁵Nanyang Technological University, Singapore ⁶Beihang University, China
⁷Shanghai Jiao Tong University, China ⁸National University of Singapore, Singapore
^∗Corresponding author: [email protected]

Abstract

Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems due to their advanced visual-language reasoning capabilities, targeting the perception, prediction, planning, and control mechanisms. However, Vision-LLMs have demonstrated susceptibilities against various types of adversarial attacks, which would compromise their reliability and safety. To further explore the risk in AD systems and the transferability of practical threats, we propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs. Different from the few existing works developing general datasets of typographic attacks, this paper focuses on realistic traffic scenarios where these attacks can be deployed, on their potential effects on the decision-making autonomy, and on the practical ways in which these attacks can be physically presented. To achieve the above goals, we first propose a dataset-agnostic framework for automatically generating false answers that can mislead Vision-LLMs’ reasoning. Then, we present a linguistic augmentation scheme that facilitates attacks at image-level and region-level reasoning, and we extend it with attack patterns against multiple reasoning tasks simultaneously. Based on these, we conduct a study on how these attacks can be realized in physical traffic scenarios. Through our empirical study, we evaluate the effectiveness, transferability, and realizability of typographic attacks in traffic scenes. Our findings demonstrate particular harmfulness of the typographic attacks against existing Vision-LLMs (e.g., LLaVA, Qwen-VL, VILA, and Imp), thereby raising community awareness of vulnerabilities when incorporating such models into AD systems. We will release our source code upon acceptance.

1 Introduction

Vision-Language Large Models (Vision-LLMs) have seen rapid development over the recent years [1, 2, 3], and their incorporation into autonomous driving (AD) systems have been seriously considered by both industry and academia [4, 5, 6, 7, 8, 9]. The integration of Vision-LLMs into AD systems showcases their ability to convey explicit reasoning steps to road users on the fly and satisfy the need for textual justifications of traffic scenarios regarding perception, prediction, planning, and control, particularly in safety-critical circumstances in the physical world. The core strength of Vision-LLMs lies in their auto-regressive capabilities through large-scale pretraining with visual-language alignment [1], making them even able to perform zero-shot optical character recognition, grounded reasoning, visual-question answering, visual-language reasoning, etc. Nevertheless, despite their impressive capabilities, Vision-LLMs are unfortunately not impervious against adversarial attacks that can misdirect the reasoning processes [10]. Any successful attack strategies have the potential to pose critical problems when deploying Vision-LLMs in AD systems, especially those that may even bypass the models’ black-box characteristics. As a step towards their reliable adoption in AD, studying the transferability of adversarial attacks is crucial to raising awareness of practical threats against deployed Vision-LLMs, and to efforts in building appropriate defense strategies for them.

In this work, we revisit the shared auto-regressive characteristic of different Vision-LLMs and intuitively turn that strength into a weakness by leveraging typographic forms of adversarial attacks, also known as typographic attacks. Typographic attacks were first studied in the context of the well-known Contrastive Language-Image Pre-training (CLIP) model [11, 12]. Early works in this area focused on developing a general typographic attack dataset targeting multiple-choice answering (such as object recognition, visual attribute detection, and commonsense answering) and enumeration [13]. Researchers also explored multiple-choice self-generating attacks against zero-shot classification [14], and proposed several defense mechanisms, including keyword-training [15] and prompting the model for detailed reasoning [16]. Despite these initial efforts, the methodologies have neither seen a comprehensive attack framework nor been explicitly designed to investigate the impact of typographic attacks on safety-critical systems, particularly those in AD scenarios.

Our work aims to fill this research gap by studying typographic attacks from the perspective of AD systems that incorporate Vision-LLMs. In summary, our scientific contributions are threefold:

•

Dataset-Independent Framework: we introduce a dataset-independent framework designed to automatically generate misleading answers that can disrupt the reasoning processes of Vision-Large Language Models (Vision-LLMs).
•

Linguistic Augmentation Schemes: we develop a linguistic augmentation scheme aimed at facilitating stronger typographic attacks on Vision-LLMs. This scheme targets reasoning at both the image and region levels and is expandable to multiple reasoning tasks simultaneously.
•

Empirical Study in Semi-Realistic Scenarios: we conduct a study to explore the possible implementations of these attacks in real-world traffic scenarios.

Through our empirical study of typographic attacks in traffic scenes, we hope to raise community awareness of critical typographic vulnerabilities when incorporating such models into AD systems.

2 Related Work

2.1 Vision-LLMs

Having demonstrated the proficiency of Large Language Models (LLMs) in reasoning across various natural language benchmarks, researchers have extended LLMs with visual encoders to support multimodal understanding. This integration has given rise to various forms of Vision-LLMs, capable of reasoning based on the composition of visual and language inputs.

Vision-LLMs Pre-training. The interconnection between LLMs and pre-trained vision models involves the individual pre-training of unimodal encoders on their respective domains, followed by large-scale vision-language joint training [17, 18, 19, 20, 2, 1]. Through an interleaved visual language corpus (e.g., MMC4 [21] and M3W [22]), auto-regressive models learn to process images by converting them into visual tokens, combine these with textual tokens, and input them into LLMs. Visual inputs are treated as a foreign language, enhancing traditional text-only LLMs by enabling visual understanding while retaining their language capabilities. Hence, a straightforward pre-training strategy may not be designed to handle cases where input text is significantly more aligned with visual texts in an image than with the visual context of that image.

Vision-LLMs in AD Systems. Vision-LLMs have proven useful for perception, planning, reasoning, and control in autonomous driving (AD) systems [6, 7, 9, 5]. For example, existing works have quantitatively benchmarked the linguistic capabilities of Vision-LLMs in terms of their trustworthiness in explaining the decision-making processes of AD [7]. Others have explored the use of Vision-LLMs for vehicular maneuvering [8, 5], and [6] even validated an approach in controlled physical environments. Because AD systems involve safety-critical situations, comprehensive analyses of their vulnerabilities are crucial for reliable deployment and inference. However, proposed adoptions of Vision-LLMs into AD have been straightforward, which means existing issues (e.g., vulnerabilities against typographic attacks) in such models are likely present without proper countermeasures.

2.2 Transferable Adversarial Attacks

Adversarial attacks are most harmful when they can be developed in a closed setting with public frameworks yet can still be realized to attack unseen, closed-source models. The literature on these transferable attacks popularly spans across gradient-based strategies. Against Vision-LLMs, our research focuses on exploring the transferability of typographic attacks.

Gradient-based Attacks. Since Szegedy et al. introduced the concept of adversarial examples, gradient-based methods have become the cornerstone of adversarial attacks [23, 24]. Goodfellow et al. proposed the Fast Gradient Sign Method (FGSM [25]) to generate adversarial examples using a single gradient step, perturbing the model’s input before backpropagation. Kurakin et al. later improved FGSM with an iterative optimization method, resulting in Iterative-FGSM (I-FGSM) [26]. Projected Gradient Descent (PGD [27]) further enhances I-FGSM by incorporating random noise initialization, leading to better attack performance. Gradient-based transfer attack methods typically use a known surrogate model, leveraging its parameters and gradients to generate adversarial examples, which are then used to attack a black-box model. These methods often rely on multi-step iterative optimization techniques like PGD and employ various data augmentation strategies to enhance transferability [28, 29, 30, 31, 32]. However, gradient-based methods face limitations in adversarial transferability due to the disparity between the surrogate and target models, and the tendency of adversarial examples to overfit the surrogate model [33, 34].

Typographic Attacks. The development of large-scale pretrained vision-language with CLIP [11, 12] introduced a form of typographic attacks that can impair its zero-shot performances. A concurrent work [13] has also shown that such typographic attacks can extend to language reasoning tasks of Vision-LLMs like multi-choice question-answering and image-level open-vocabulary recognition. Similarly, another work [14] has developed a benchmark by utilizing a Vision-LLM to recommend an attack against itself given an image, a question, and its answer on classification datasets. Several defense mechanisms [15, 16] have been suggested by prompting the Vision-LLM to perform step-by-step reasoning. Our research differs from existing works in studying autonomous typographic attacks across question-answering scenarios of recognition, action reasoning, and scene understanding, particularly against Vision-LLMs in AD systems. Our work also discusses how they can affect reasoning capabilities at the image level, region-level understanding, and even against multiple reasoning tasks. Furthermore, we also discuss how these attacks can be realized in the physical world, particularly against AD systems.

3 Preliminaries

3.1 Revisiting Auto-Regressive Vision-LLMs

As a simplified formulation of auto-regressive Vision-LLMs, suppose we have a visual input $\mathbf{v}$ , a sequence of tokens generated up to timestep $t-1$ , denoted as $x_{1},x_{2},\dots,x_{t-1}$ , and $f(\cdot)$ as the Vision-LLM model function, whose goal is to predict the next token $x_{t}$ . We can denote its output vector of logits $\mathbf{y}_{t}$ at each timestep $t$ based on the previous tokens and the visual context:

	$\displaystyle\mathbf{y}_{t}$	$\displaystyle=f(x_{1},\dots,x_{t-1},\mathbf{v})$		(1)
		$\displaystyle=f(x_{1},\dots,x_{t-1},v_{1},\dots,v_{m}),$		(1)

where $v_{1},\dots,v_{m}$ denotes $m$ visual tokens encoded by a visual encoder on $\mathbf{v}$ . The logits $\mathbf{y}_{t}$ are converted into a probability distribution using the softmax function. Specifically, $y_{t,j}\in\mathbf{y}_{t}$ is the logit for token $j$ in the vocabulary $C$ at timestep $t$ , generally as follows:

P(x_{t}=j|x_{1},x_{2},\dots,x_{t-1},\mathbf{v})=\frac{\exp(y_{t,j})}{\sum_{k\in C}\exp(y_{t,k})}.

(2)

Then, the general language modeling loss for training the model can be based on cross-entropy loss. For a sequence of tokens $\mathbf{x}=\{x_{1},\dots,x_{n}\}$ , the loss is given by:

\mathcal{L}_{LM}(\mathbf{x})=\sum_{t=1}^{n}\log P(x_{t}\ |\ x_{1},\dots,x_{t-1},v_{1},\dots,v_{m})=\sum_{k=1}^{n+m}\log P(x_{t}\ |\ z_{1},\dots,z_{k-1}),

(3)

where $z_{i}$ denotes either a textual token $x$ or visual token $v$ at position $i$ . Vision-LLMs possess conversational capabilities at their core, so interleaving language data ( $m=0$ ) and vision-language data ( $m>0$ ) during optimization is crucial for enabling visual understanding while retaining language reasoning [1]. Regardless of $m$ , the loss objective of vision-guided language modeling is essentially the same as auto-regressive language modeling [35]. Consequently, as part of the alignment process, these practices imply blurred boundaries between textual and visual feature tokens during training. They may also facilitate text-to-text alignment between raw texts and within-image texts at inference.

3.2 Typographic Attacks in Vision-LLMs-based AD Systems

The integration of Vision-LLMs into end-to-end AD systems has brought promising results thus far [9], where Vision-LLMs can enhance user trust through explicit reasoning steps of the scene. On the one hand, language reasoning in AD systems can elevate their capabilities by utilizing the learned commonsense of LLMs, while being able to proficiently communicate to users. On the other hand, exposing Vision-LLMs to public traffic scenarios not only makes them more vulnerable to typographic attacks that misdirect the reasoning process but can also prove harmful if their results are connected with decision-making, judgment, and control processes.

Table 1: Transferability and stealthiness of attacks.

Method	SSIM $\uparrow$	Exact $\downarrow$	Lingo-Judge $\downarrow$	BLEURT $\downarrow$	BERTScore $\downarrow$
gradient-based, CLIP (16/255) [11]	0.6425	0.3670	0.3126	0.4456	0.6766
gradient-based, ALBEF (16/255) [36]	0.6883	0.3493	0.3139	0.4438	0.6754
our typographic attack	0.9506	0.0700	0.0700	0.5563	0.7327

Unlike the less transferable gradient-based attacks, typographic attacks are more transferable across Vision-LLMs by exploiting the inherent text-to-text alignment between raw texts and within-image texts to introduce misleading textual patterns in images, and influence the reasoning of a Vision-LLM, i.e., dominating over visual-text alignment. In digital form, the attack is formulated as a function $\tau(\cdot)$ that applies transformations representing typographic attacks to obtain an adversarial image $\hat{\mathbf{v}}=\tau(\mathbf{v})$ . Then, Eq. 1 can be rewritten as:

	$\displaystyle\mathbf{y}_{t}$	$\displaystyle=f(x_{1},\dots,x_{t-1},\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{\mathbf{v}}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0})$		(4)
		$\displaystyle=f(x_{1},\dots,x_{t-1},\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\hat{v}_{1},\dots,\hat{v}_{m}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}),$		(4)

where $\hat{v}_{1},\dots,\hat{v}_{m}$ denotes $m$ visual tokens under the influenced image $\hat{\mathbf{v}}$ , and whose textual content is meant to ❶ align with $\{x_{1},\dots,x_{t-1}\}$ , ❷ yet guide the reasoning process towards an incorrect answer. By exploiting the fundamental properties of many Vision-LLMs in language modeling to construct adversarial patterns, ❸ typographic attacks $\tau(\cdot)$ aim to be transferable across various pre-trained Vision-LLMs by directly influencing the visual information with texts. Our study is geared towards typographic attacks in AD scenarios to thoroughly understand the issues and raise awareness.

4 Methodology

Figure 1 shows an overview of our typographic attack pipeline, which goes from prompt engineering to attack annotation, particularly through Attack Auto-Generation, Attack Augmentation, and Attack Realization steps. We describe the details of each step in the following subsections.

Refer to caption — Figure 1: Our proposed pipeline is from attack generation via directives to augmentation by commands and conjunctions to positioning the attacks and finally influencing inference.

4.1 Auto-Generation of Typographic Attack

In this subsection, to handle the lack of both autonomy and diversity in typographic attacks, we propose to employ the support of an LLM and prompt engineering, denoted by a model function $l(\cdot)$ , to generate adversarial typographic patterns automatically. Let $\mathbf{q}$ , $\mathbf{a}$ respectively be the question prompt input and its answer on an image $\mathbf{v}$ , the adversarial text can be naively generated as $\hat{\mathbf{a}}$ ,

\displaystyle\hat{\mathbf{a}}=l(\mathbf{q},\mathbf{a}).

(5)

In order to generate useful misdirection, the adversarial patterns must align with an existing question while guiding LLM toward an incorrect answer. We can achieve this through a concept called directive, which refers to configuring the goal for an LLM, e.g., ChatGPT, to impose specific constraints while encouraging diverse behaviors. In our context, we direct the LLM to generate $\mathbf{\hat{a}}$ as an opposite of the given answer $\mathbf{a}$ , under the constraint of the given question $\mathbf{q}$ . Therefore, we can initialize directives to the LLM using the following prompts in Fig. 2,

When generating attacks, we would impose additional constraints depending on the question type. In our context, we focus on tasks of ❶ scene reasoning (e.g., counting), ❷ scene object reasoning (e.g., recognition), and ❸ action reasoning (e.g., action recommendation), as follows in Fig. 3,

The directives encourage the LLM to generate attacks that influence a Vision-LLM’s reasoning step through text-to-text alignment and automatically produce typographic patterns as benchmark attacks. Clearly, the aforementioned typographic attack only works for single-task scenarios, i.e., a single pair of question and answer. To investigate multi-task vulnerabilities with respect to multiple pairs, we can also generalize the formulation to $K$ pairs of questions and answers, denoted as $\mathbf{q}_{i},\mathbf{a}_{i}$ , to obtain the adversarial text $\hat{\mathbf{a}}_{i}$ for $i\in\left[1,K\right]$ .

4.2 Augmentations of Typographic Attack

Inspired by the success of instruction-prompting methodologies [37, 38], the greedy reasoning in LLMs [39], and to further exploit the ambiguity between textual and visual tokens in Vision-LLMs, we propose to augment the typographic attacks prompts within images by explicitly providing instruction keywords that emphasize text-to-text alignment over that of visual-language tokens. Our approach realizes the concept in the form of instructional directives: ❶ command directives for emphasizing a false answer and ❷ conjunction directives to additionally include attack clauses. In particular, we have developed,

•

Command Directive. By embedding commands with the attacks, we aim to prompt the Vision-LLMs into greedily producing erroneous answers. Our work investigates the "ANSWER:" directive as a prefix before the first attack prompt.
•

Conjunction Directive. Conjunctions, connectors (or the lack thereof) act to link together separate attack concepts that make the overall text appear more coherent, thereby increasing the likelihood of multi-task success. In our work, we investigate these directives as "AND," "OR," "WITH," or simply empty spaces as prefixes between attack prompts.

While other forms of directives can also be useful for enhancing the attack success rate, we focus on investigating basic directives related to typographic attacks in this work.

4.3 Realizations of Typographic Attacks

Digitally, typographic attacks are about embedding texts within images to fool the capabilities of Vision-LLMs, which might involve simply putting texts into the images. Physically, typographic attacks can incorporate real elements (e.g., stickers, paints, and drawings) into environments/entities observable by AI systems, with AD systems being prime examples. This would include the placement of texts with unusual fonts or colors on streets, objects, vehicles, or clothing to mislead AD systems in reasoning, planning, and control. We investigate Vision-LLMs when incorporated into AD systems, as they are likely under the most risk against typographic attacks. We categorize the placement locations as being identified with backgrounds and foregrounds in traffic scenes.

•

Backgrounds, which refer to elements in the environment that are static and pervasive in a traffic scene (e.g., streets, buildings, and bus stops). The background components present predefined locations for introducing deceptive typographic elements of various sizes.
•

Foregrounds, which refer to dynamic elements and directly interact with the perception of AD systems (e.g., vehicles, cyclists, and pedestrians). The foreground components present dynamic and variable locations for typographic attacks of various sizes.

In our work, foreground placements are supported by an open-vocabulary object detector [40] to flexibly extract box locations of specific targets. Let $\mathbf{A}=\hat{\mathbf{a}}_{1}||\dots||\hat{\mathbf{a}}_{K}$ be the typographic concatenation of attacks, and $\mathbf{A}^{\prime}$ be its augmented version, either on background or foreground, the function $\tau(\cdot)$ would perform inpainting $\mathbf{A}$ or $\mathbf{A}^{\prime}$ into image $\mathbf{v}$ ’s cropped box coordinates $x_{min},y_{min},x_{max},y_{max}$ .

Depending on the attacked task, we observe that different text placements and observed sizes would render some attacks more effective while some others are negligible. Our research illuminates that background-placement attacks are quite effective against scene reasoning and action reasoning but not as effective against scene object reasoning unless foreground placements are also included.

5 Experiments

5.1 Experimental Setup

We perform experiments with Vision-LLMs on VQA datasets for AD, such as LingoQA [7] and the dataset of CVPRW’2024 Challenge ¹¹1https://cvpr24-advml.github.io by CARLA simulator. We have used LLaVa [2] to output the attack prompts for LingoQA and the CVPRW’2024 dataset, and manually for some cases of the latter. Regarding LingoQA, we tested 1000 QAs in real traffic scenarios in tasks, such as scene reasoning and action reasoning. Regarding the CVPRW’2024 Challenge dataset, we tested more than 300 QAs on 100 images, each with at least three questions related to scene reasoning (e.g., target counting) and scene object reasoning of 5 classes (cars, persons, motorcycles, traffic lights and road signals). Our evaluation metrics are based on exact matches, Lingo-Judge Accuracy [7], and BLEURT [41], BERTScore [42] against non-attacked answers, with SSIM (Structural Similarity Index) to quantify the similarity between original and attacked images. In terms of models, we qualitatively and/or quantitatively tested with LLaVa [2], VILA [1], Qwen-VL [17], and Imp [18]. The models were run on an NVIDIA A40 GPU with approximately 45GiB of memory.

Table 2: Ablation study of our automatic attack strategy effectiveness. Lower scores mean more effective attacks, with (auto) denoting automatic attacks.

	Attack	LingoQA				CVPRW’24 (counting only)
	Type	Exact $\downarrow$	Lingo-Judge $\downarrow$	BLEURT $\downarrow$	BERTScore $\downarrow$	Exact $\downarrow$	Lingo-Judge $\downarrow$	BLEURT $\downarrow$	BERTScore $\downarrow$
Qwen-VL	auto	0.3191	0.3330	0.5460	0.6861	0.1950	0.1950	0.6267	0.7936
Imp	auto	0.5244	0.4755	0.6398	0.7790	0.1900	0.1700	0.6194	0.7983
VILA	auto	0.4744	0.5415	0.6462	0.7717	0.1700	0.1750	0.7052	0.8362
LLaVa	auto	0.5053	0.4021	0.5771	0.7435	0.3450	0.3450	0.7524	0.8781

Table 3: Ablation of attack effectiveness on CVPRW’24 dataset’s counting subtask. Lower scores mean more effective attacks, with (single) denoting single question attack, (composed) for multi-task attack, and (+a) means augmented with directives.

	Attack Type	Exact $\downarrow$	Lingo-Judge $\downarrow$	BLEURT $\downarrow$	BERTScore $\downarrow$
Qwen-VL	single	0.4000	0.3300	0.6890	0.8508
	single+a	0.3950	0.3350	0.6786	0.8354
	composed	0.0400	0.0400	0.5931	0.7998
	composed+a	0.0700	0.0700	0.5563	0.7327
Imp	single	0.4850	0.3500	0.7032	0.8490
	single+a	0.4800	0.3600	0.6870	0.8402
	composed	0.0360	0.0300	0.5733	0.7954
	composed+a	0.0850	0.0800	0.5919	0.8047
VILA	single	0.4650	0.4300	0.7642	0.8796
	single+a	0.4800	0.4600	0.7666	0.8871
	composed	0.0300	0.0300	0.6474	0.8121
	composed+a	0.0950	0.0950	0.6633	0.8221
LLaVa	single	0.3900	0.3900	0.7641	0.8893
	single+a	0.4100	0.4100	0.7714	0.8929
	composed	0.0100	0.0100	0.6303	0.8549
	composed+a	0.1400	0.1400	0.6758	0.8694

Table 4: Ablation of both image-level (counting) and patch-level (target recognition) attack strategy effectiveness on CVPRW’24 dataset. Lower scores mean more effective attacks, with (naive patch) denoting typographic attacks directly on a specific target, (composed) denoting multi-task attacks on both the specific target and at the image level, and (+a) means augmented with directives.

	Attack Type	Exact $\downarrow$	Lingo-Judge $\downarrow$	BLEURT $\downarrow$	BERTScore $\downarrow$
Qwen-VL	naive patch	0.2291	0.2088	0.3996	0.6442
	composed	0.1316	0.1088	0.3451	0.6247
	composed+a	0.0582	0.0303	0.2947	0.5718
Imp	naive patch	0.1607	0.0860	0.5291	0.7838
	composed	0.1620	0.1114	0.5728	0.8092
	composed+a	0.1215	0.0658	0.5014	0.7674
VILA	naive patch	0.4025	0.0810	0.5241	0.7238
	composed	0.1455	0.0506	0.5288	0.7687
	composed+a	0.0873	0.0329	0.5062	0.7498
LLaVa	naive patch	0.2443	0.1949	0.5482	0.8208
	composed	0.0708	0.0443	0.5161	0.7376
	composed+a	0.0481	0.0278	0.4928	0.8152

5.1.1 Attacks on Scene/Action Reasoning

As shown in Tab. 2, Fig. 4, and Fig. 5, our framework of attack can effectively misdirect various models’ reasoning. For example, Tab. 2 showcases an ablation study on the effectiveness of automatic attack strategies across two datasets: LingoQA and CVPRW’24 (focused solely on counting). The former two metrics (i.e. Exact and Lingo-Judge) are used to evaluate semantic correctness better, showing that short answers like the counting task can be easily misled, but longer, more complex answers in LingoQA may be more difficult to change. For example, the Qwen-VL attack scores 0.3191 under the Exact metric for LingoQA, indicating relative effectiveness compared to other scores in the same metric in counting. On the other hand, we see that the latter two scores (i.e. BLEURT and BERTScore) are typically high, hinting that our attack can mislead semantic reasoning, but even the wrong answers may still align with humans decently.

In terms of scene reasoning, we show in Tab. 4, Tab. 4, and Fig. 4 the effectiveness of our proposed attack against a number of cases. For example, in Fig. 4, a Vision-LLM can somewhat accurately answer queries about a clean image, but a typographic attacked input can make it fail, such as to accurately count people and vehicles, and we show that an augmented typographic attacked input can even attack stronger models (e.g. GPT4 [43]). In Fig. 5, we also show that scene reasoning can be misdirected where irrelevant details are focused on and hallucinate under typographic attacks. Our work also suggests that scene object reasoning / grounded object reasoning is typically more robust, as both object-level and image-level attacks may be needed to change the models’ answers.

In terms of action reasoning, we show in Fig. 5 that Vision-LLMs can recommend terribly bad advice, suggesting unsafe driving practices. Nevertheless, we see a promising point when Qwen-VL recommended fatal advice, but it reconsidered over the reasoning process of acknowledging the potential dangers of the initial bad suggestion. These examples demonstrate the vulnerabilities in automated reasoning processes under deceptive or manipulated conditions, but they also suggest that defensive learning can be applied to enhance model reasoning.

5.1.2 Compositions and Augmentations of Attacks

Table 5: Ablation study of our composition keywords, attack location on an image and their overall effectiveness by the metric defined in the CVPRW’24 Challenge³³3https://challenge.aisafety.org.cn/#/competitionDetail?id=13.

empty

(top)

AND

(top)

(bottom)

WITH

(top)

WITH

(bottom)

combined

(bottom)

QwenVL, Imp, GPT4

composed+a

48.08

46.97

47.24

50.54

51.33

51.02

53.56

We showed that composing multiple QA tasks for an attack is possible for a particular scenario, thereby suggesting that typographic attacks are not single-task attacks, as suggested by previous works. Furthermore, we found that augmentations of attacks are possible, which would imply that typographic attacks that leverage the inherent language modeling process can misdirect the reasoning of Vision-LLMs, as especially shown in the case of the strong GPT-4. However, as shown in Tab. 3, it may be challenging to search for the best augmentation keywords.

5.1.3 Towards Physical Typographic Attacks

In our toy experiments with semi-realistic attacks in Fig.5, we show that attacks involve manipulating text within real-world settings are potentially dangerous due to their ease of implementation, such as on signs, behind vehicles, on buildings, billboards, or any everyday object that an AD system might perceive and interpret to make decisions. For instance, modifying the text on a road sign from "stop" to "go faster" can pose potentially dangerous consequences on AD systems that utilize Vision-LLMs.

6 Conclusion

Our research has developed a comprehensive typographic attack framework designed for benchmarking Vision-LLMs under AD systems, exploring their adoption, the potential impacts on decision-making autonomy, and the methods by which these attacks can be physically implemented. Firstly, our dataset-agnostic framework is capable of automatically generating misleading responses that misdirect the reasoning of Vision-LLMs. Secondly, our linguistic formatting scheme is shown to augment attacks at a higher degree and can extend to simultaneously targeting multiple reasoning tasks. Thirdly, our study on the practical implementation of these attacks in physical traffic scenarios is critical for highlighting the need for defense models. Our empirical findings on the effectiveness, transferability, and realizability of typographic attacks in traffic environments highlight their effects on existing Vision-LLMs (e.g., LLaVA, Qwen-VL, VILA). This research underscores the urgent need for increased awareness within the community regarding vulnerabilities associated with integrating Vision-LLMs into AD systems.

Limitations. One of the primary limitations of our typographic attack framework lies in its dependency on environmental control and predictability. Our framework can demonstrate the vulnerability of Vision-LLMs to typographic manipulations in controlled settings, so the variability and unpredictability of real-world traffic scenarios can significantly diminish the consistency and reproducibility of the attacks. Additionally, our attacks assume that AD systems do not evolve to recognize and mitigate such manipulations, which may not hold true as defensive technologies advance. Another limitation is the ethical concern of testing and deploying such attacks, which could potentially endanger public safety if not managed correctly. This necessitates a careful approach to research and disclosure to ensure that knowledge of vulnerabilities does not lead to malicious exploitation.

Safeguards. To safeguard against the vulnerabilities exposed by typographic attacks, it is essential to develop robust defensive mechanisms within AD systems. While the current literature on defensive techniques is still understudied, there are ways forward to mitigate potential issues. A concurrent work is investigating how better prompting can support better reasoning to defend against the attacks [16], or how incorporating keyword training of Vision-LLMs can make these systems more resilient to such attacks by conditioning their answers on specific prefixes [15]. Another basic approach is to detect and remove all non-essential texts in the visual information. Overall, it is necessary to foster a community-wide effort toward establishing standards and best practices for the secure deployment of Vision-LLMs into AD.

Broader Impacts. The implications of our research into typographic attacks extend beyond the technical vulnerabilities of AD systems, touching on broader societal, ethical, and regulatory concerns. As Vision-LLMs and AD technologies proliferate, the potential for such attacks underscores the need for comprehensive safety and security frameworks that anticipate and mitigate unconventional threats. This research highlights the interplay between technology and human factors, illustrating how seemingly minor alterations in a traffic environment can lead to significant misjudgments by AD systems, potentially endangering public safety.

References

[1] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On pre-training for visual language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024.
[2] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[3] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. MM-LLMs: Recent advances in multimodal large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
[4] Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John F. Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, volume 11206 of Lecture Notes in Computer Science, pages 577–593. Springer, 2018.
[5] Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, and Hongsheng Li. LMDrive: Closed-loop end-to-end driving with large language models. In CVPR, 2024.
[6] Can Cui, Zichong Yang, Yupeng Zhou, Yunsheng Ma, Juanwu Lu, Lingxi Li, Yaobin Chen, Jitesh Panchal, and Ziran Wang. Personalized autonomous driving with large language models: Field experiments, 2024.
[7] Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, and Oleg Sinavski. LingoQA: Video question answering for autonomous driving. arXiv preprint arXiv:2312.14115, 2023.
[8] Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2Drive: Towards interpretable and chain-based reasoning for autonomous driving. arXiv preprint, 2023.
[9] Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. LLM4Drive: A survey of large language models for autonomous driving. CoRR, abs/2311.01043, 2023.
[10] Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this image? a safety evaluation benchmark for vision LLMs. arXiv preprint arXiv:2311.16101, 2023.
[11] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
[12] Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. https://distill.pub/2021/multimodal-neurons.
[13] Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, and Renjing Xu. Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language model. CoRR, abs/2402.19150, 2024.
[14] Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, and Bryan A. Plummer. Vision-LLMs can fool themselves with self-generated typographic attacks. CoRR, abs/2402.00626, 2024.
[15] Hiroki Azuma and Yusuke Matsui. Defense-prefix for preventing typographic attacks on CLIP. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, October 2-6, 2023, pages 3646–3655. IEEE, 2023.
[16] Hao Cheng, Erjia Xiao, and Renjing Xu. Typographic attacks in large multimodal models can be alleviated by more informative prompts. arXiv preprint arXiv:2402.19150, 2024.
[17] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[18] Zhenwei Shao, Xuecheng Ouyang, Zhenbiao Gai, Zhou Yu, and Jun Yu. Imp: An emprical study of multimodal small language models, 2024.
[19] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied multimodal language model, 2023.
[20] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Fuyu-8B: A multimodal architecture for ai agents, 2024.
[21] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: an open, billion-scale corpus of images interleaved with text. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[22] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
[23] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[24] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018.
[25] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[26] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Artificial intelligence safety and security, pages 99–112. Chapman and Hall/CRC, 2018.
[27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[28] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2730–2739, 2019.
[29] Xiaosen Wang, Xuanran He, Jingdong Wang, and Kun He. Admix: Enhancing the transferability of adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16158–16167, 2021.
[30] Jianping Zhang, Jen-tse Huang, Wenxuan Wang, Yichen Li, Weibin Wu, Xiaosen Wang, Yuxin Su, and Michael R Lyu. Improving the transferability of adversarial samples by path-augmented method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8173–8182, 2023.
[31] Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E Hopcroft. Nesterov accelerated gradient and scale invariance for adversarial attacks. arXiv preprint arXiv:1908.06281, 2019.
[32] Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4312–4321, 2019.
[33] Zeyu Qin, Yanbo Fan, Yi Liu, Li Shen, Yong Zhang, Jue Wang, and Baoyuan Wu. Boosting the transferability of adversarial attacks with reverse adversarial perturbation. Advances in neural information processing systems, 35:29845–29858, 2022.
[34] Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor Tsang, and Qing Guo. Boosting transferability in vision-language attacks via diversification along the intersection region of adversarial trajectory. arXiv preprint arXiv:2403.12445, 2024.
[35] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019.
[36] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
[37] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
[38] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024.
[39] Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations, 2023.
[40] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[41] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. BLEURT: Learning robust metrics for text generation. In Proceedings of ACL, 2020.
[42] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.
[43] OpenAI team. GPT-4 technical report, 2024.

Towards Transferable Attacks Against Vision-LLMs in Autonomous Driving with Typography

Abstract

1 Introduction

2 Related Work

2.1 Vision-LLMs

2.2 Transferable Adversarial Attacks

3 Preliminaries

3.1 Revisiting Auto-Regressive Vision-LLMs

3.2 Typographic Attacks in Vision-LLMs-based AD Systems

4 Methodology

4.1 Auto-Generation of Typographic Attack

4.2 Augmentations of Typographic Attack

4.3 Realizations of Typographic Attacks

5 Experiments

5.1 Experimental Setup

5.1.1 Attacks on Scene/Action Reasoning

5.1.2 Compositions and Augmentations of Attacks

5.1.3 Towards Physical Typographic Attacks

6 Conclusion

References

Towards Transferable Attacks Against Vision-LLMs
in Autonomous Driving with Typography