AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Abstract
Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs’ robustness to real-world 3D variations, we propose AdvDreamer, the first framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single-view images. AdvDreamer integrates advanced generative techniques with two key innovations and aims to characterize the worst-case distributions of 3D variations from natural images. To ensure adversarial effectiveness and method generality, we introduce an Inverse Semantic Probability Objective that executes adversarial optimization on fundamental vision-text alignment spaces, which can be generalizable across different VLM architectures and downstream tasks. To mitigate the distribution discrepancy between generated and real-world samples while maintaining physical reproducibility, we design a Naturalness Reward Model that provides regularization feedback during adversarial optimization, preventing convergence towards hallucinated and unnatural elements. Leveraging AdvDreamer, we establish MM3DTBench, the first VQA dataset for benchmarking VLMs’ 3D variations robustness. Extensive evaluations on representative VLMs with diverse architectures highlight that 3D variations in the real world may pose severe threats to model performance across various tasks.
1 Introduction
Recently, Vision-Language Models (VLMs) [34, 33, 7, 13] have demonstrated remarkable capabilities in bridging visual perception and natural language understanding. Through large-scale image-text pre-training followed by fine-grained instruction tuning [34, 29], these models effectively tackle a wide range of visual-centric tasks, including instruction-based image recognition [42, 58, 48], visual question answering [3, 34, 33, 29], and visual reasoning [10, 9]. Due to their strong generalizability and impressive zero-shot capability, VLMs are increasingly deployed in various safety-critical applications, particularly in autonomous driving [49, 56] and robotic systems [23, 57].

Despite extensive studies validating VLMs’ resilience against various distribution shifts (e.g., style variations [60, 2], image corruptions [2], and adversarial perturbations [16, 59, 62]), these investigations primarily focus on 2D perturbations in the digital domain, overlooking a critical challenge in real-world deployment: the 3D variations. As VLMs increasingly integrated into dynamic, physical environments, their ability to handle 3D transformations becomes crucial, raising a fundamental question: Do current VLMs truly possess sufficient robustness to handle distribution shifts arising from real-world 3D variations?
To systematically evaluate VLMs’ robustness under 3D variations, accurately characterizing real-world Adversarial 3D Transformation (Adv-3DT) examples is crucial. However, this task faces three fundamental challenges: 1) Scene-Prior Scarcity. Existing approaches heavily rely on explicit 3D representations [4, 21, 37] or multi-view observations [18, 45] to capture worst-case 3D variations through optimization-based methods. However, such rich prior information is rarely available in most real-world scenarios, where typically only a few observations are accessible. This practical constraint naturally motivates our exploration of a zero-shot manner for generating Adv-3DT samples. 2) Distribution Discrepancy. Current methods struggle to generate Adv-3DT samples that preserve visual fidelity and contextual coherence. Methods utilizing handcrafted 3D assets [4, 21] often suffer from noticeable texture discrepancies with real-world scenarios, while neural reconstruction-based approaches [18, 45] leveraging NeRF [39] or 3DGS [25] either operate in isolated object settings or struggle with complex backgrounds. These limitations result in samples that deviate from natural data distributions, undermining the reliability of robustness evaluations. 3) Generalizable Requirements. Traditional methods [18, 4] usually involve task-specific objectives with end-to-end optimization, limiting their applicability across diverse VLM architectures and tasks. Thus, a universal optimization method with task-agnostic objectives is essential.
To address these challenges, we propose AdvDreamer, a novel framework for capturing real-world Adv-3DT samples with task generalizability. In AdvDreamer, we design an innovative generative pipeline that enables controllable 3D transformations of individual natural images in a zero-shot manner. This design directly tackles the scarcity of scene priors by introducing generative 3D priors, and lays the foundation for further adversarial optimization.
To effectively identify the worst-case 3D transformations, AdvDreamer formulates rigid 3D transformations as an adversarial attack, capturing their underlying distribution through query-based adversarial optimization. We propose two key innovations to guide this process. Firstly, to tackle the challenge of distribution discrepancy and ensure physical reproducibility [18] of Adv-3DT samples, we propose the Naturalness Reward Model (NRM). It leverages DINOv2’s [41] rich visual context feedback to regulate the optimization trajectory, preventing the optimal distribution convergence toward generating unnatural or hallucinatory elements. For NRM training, we assemble a dataset comprising 100K random 3D variations samples, each annotated with naturalness scores by GPT-4o, ensuring alignment with human judgment. Secondly, we introduce the Inverse Semantic Probability (ISP) objective, designed to minimize the semantic probability between actual textual descriptions and the generated Adv-3DT samples. It operates solely through the foundational components of VLMs: the vision encoders (e.g. the OpenAI CLIP with ViT-L/14 [42] used in LLaVa [34]). This design drives attack focus on fundamental vision-text alignment spaces rather than model-specific layers or task-specific heads, ensuring AdvDreamer’s generalizability across different VLM architectures and downstream tasks.
Through extensive digital and physical experiments, we demonstrate that Adv-3DT samples generated by AdvDreamer can effectively deceive most advanced VLMs systems, including GPTs. These samples exhibit reliable reproducibility in physical scenarios, as depicted in Fig. 1. In summary, This paper makes the following contributions:
(1) We propose AdvDreamer, a novel framework that generates real-world Adv-3DT samples from a single natural image. It eliminates the dependency on explicit 3D assets and multi-view samples that previous methods relied upon. Based on extensive digital and physical experiments, we systematically reveal, for the first time, the vulnerability of VLMs under extreme 3D variations.
(2) We introduce a Naturalness Reward Model to maintain sample naturality during optimization, ensuring distribution alignment with real-world images and physical reproducibility, and an Inverse Semantic Probability objective to guarantee the transferability of Adv-3DT samples across different VLM architectures and downstream tasks.
(3) We construct the Multimodal 3D Transformation VQA Benchmark (MM3DTBench), comprising the most challenging Adv-3DT samples generated by AdvDreamer or reproduced from physical environments. We provide a comprehensive evaluation across 13 leading VLMs, detailing their performance and vulnerabilities.
2 Related Work
2.1 Vision-Language Models
Classic Vision-language models (VLMs) [42, 30, 36, 28] achieve cross-modal alignment through diverse pre-training objectives on large-scale image-text pairs [31, 46]. Recent advances in large language models (LLMs) [52, 53] have inspired new VLM architectures that integrate LLM capabilities via projection layers[33, 34] or specialized modules like Q-Former [29]. Through techniques like instruction tuning [34], current VLMs [33, 34, 13] demonstrate enhanced performance across vision-centric tasks, including visual grounding, question answering, and image captioning, etc. These models also enable intelligent systems for task planning and decision-making [23, 57].
Despite their impressive capabilities [42, 35, 59], concerns persist regarding their robustness and generalization under some corner cases, particularly given their widespread application in safety-critical environments. Recent studies [51] highlight that VLMs still exhibit deficiencies in processing visual details, such as orientation, quantity, and colour, which may be attributed to their visual representations. Moreover, VLMs remain susceptible to -norm adversarial perturbations [16, 38], with some efforts aiming to enhance their robustness through adversarial fine-tuning techniques [38]. In contrast to previous works, our study specifically investigates VLM robustness against 3D variations, addressing a crucial gap in understanding their reliability for dynamic real-world applications.

2.2 Evaluation for 3D Variation Robustness
Robustness & invariance to 3D variations remains a longstanding challenge in computer vision. Evaluating this property typically involves capturing worst-case 3D transformation parameters, such as adversarial poses or viewpoints, to generate challenging samples for assessment. Prior research can be categorized into three approaches (Fig.2). (A) Human-guided methods builds benchmarks like ObjectNet[8] and OOD-CV [61] by manually collecting samples with diverse poses and viewpoints. While this approach captures real-world scenarios, it lacks systematic coverage of worst-case examples and incurs high collection costs. (B) Simulator-based [4, 21] methods optimize worst-case 3D parameters through differentiable rendering of synthetic objects. However, this approach requires manually designed 3D assets and produces samples with limited realism. (C) Multi-view-based methods like ViewFool[18] and GMFool [45] leverage NeRF-based [39] implicit representations to optimize adversarial viewpoints. While avoiding explicit 3D modeling, these methods require multiple views and struggle with background integration.
In contrast, the proposed AdvDreamer adopts a (D) Generative-based approach that utilizes robust 3D priors from generative models. Given a single natural image, AdvDreamer synthesizes realistic adversarial 3D transformations in a zero-shot manner, offering a more elegant solution for evaluating model robustness in real-world scenarios.

3 Methodology
3.1 Problem Formulation
- Parametrization of Real-World 3D Variations. We primarily focus on rigid 3D transformations, i.e., the rotation, translation, and scaling, since they reflect the most typical 3D changes observed in real-world environments. Formally, we define a 6-dimensional vector to uniquely parameterize any arbitrary rigid transformation in 3D space, where , , and denote the Tait-Bryan angles (yaw, pitch, roll) in the sequence, and represent the translation along the and axes in the -plane, and is the uniform scaling factor. To ensure the captured images recognizable by humans [18], we constrain within a bounded range .
- Optimization Problem. The goal of AdvDreamer is to find an optimal adversarial distribution , ensuring that Adv-3DT sampled from this distribution are both aggressive against VLMs and visually realistic. This can be abstractly formulated as the following optimization problem:
(1) |
where is a generation process to produce a novel image with the specified transformation applied to the original natural image . To implement , we designed a pipeline incorporating state-of-the-art generative techniques, which will be detailed in Sec.3.2. The represents the comprehensive loss function guiding the optimization, which assesses the adversarial effectiveness and visual realism of the generated samples. We’ll elaborate on the design of each loss component in Sec.3.3 and Sec. 3.4.
Optimizing over the distribution rather than a deterministic offers two key advantages. 1) As highlighted in [18, 45], learning the underlying distribution enables comprehensive exploration of the 3D variation space, facilitating the identification of the model’s vulnerable regions. 2) Given the stochastic nature of the generation, optimizing over a continuous distribution can mitigate the impact of appearance inconsistency, as appearance variations induced by the generation are unlikely to consistently deceive the model without optimizing the adversarial 3D variations [6].
3.2 Overview of AdvDreamer
Generating Adv-3DT samples requires accurate 3D scene understanding from 2D images, which has been a long-standing challenge. Fortunately, recent advances in AI generative technologies provide elegant solutions to this problem. As shown in Fig. 3, we design a generation process in a zero-shot manner. In the following contents, we will sequentially detail the critical stages of AdvDreamer.
Stage0. Foreground-Background Pairs Preparation. We start by decompose the input image into foreground-background pairs . Specifically, we leverage Grounded-SAM [43] to semantically segment the major instance as foreground , followed by diffusion-based inpainting [44, 15] to obtain a complete background . This process is denoted as . Additionally, we support directly generating synthetic pairs through Stable-Diffusion [44] to enhance sample diversity.
Stage1. Adversarial Pose Manipulation. For each optimization iteration, we employ the proposed Adversarial Pose Manipulator (AdvPM) to transform the foreground into a batch of under sampled from current . AdvPM leverages the robust priors provided by TripoSR [50], the most advanced Large Reconstruction Model (LRM), to construct single-view-based 3D representations and apply sampled transformations. This process is formulated as , where and denotes TripoSR’s rendering function with weights .
Stage2. Images Re-Composition. Finally, we compose the transformed foreground with background through AnyDoor [12], a diffusion-based composition model that effectively handles shadows and occlusions. This process is defined as , where denotes AnyDoor’s weights. Combining stages 0-2, the complete generation process can be formally represented as follows:
(2) |
Stage3. Loss Calculation and Distribution Optimization. For each composition sample , we further compute its loss to guide the adversarial distribution update. By iteratively executing Stages 1-3, AdvDreamer converges to an optimal distribution . Sampling from and applying the generation process yields diverse Adv-3DT samples. Furthermore, the distribution center corresponds to the worst-case Adv-3DT samples.
3.3 Inverse Semantic Probability Objective
The Inverse Semantic Probability (ISP) objective aims to minimize the probability that VLMs assign to the ground-truth semantic attributes of Adv-3DT samples. It has two key design considerations. 1) It based solely on the visual encoder and text encoder , which are fundamental components across modern VLM architectures, ensuring architecture-agnostic applicability. 2), By operating in the vision-text alignment space rather than model-specific layers or task-specific heads, it enables task generalization.
Given a generated sample and a semantic label set containing the ground-truth label , we first compute a semantic similarity vector :
(3) |
where computes cosine similarity and follows the template ”a photo of a {}” in [42]. The vector captures the representation similarities between and each semantic label. We then derive the conditional semantic probability through softmax transformation:
(4) |
Then, the is defined as the negative log-likelihood of the ground-truth’s semantic probability:
(5) |
Intuitively, measures how likely is to be misclassified by comparing its similarity with the ground-truth semantics versus other irrelevant ones. Maximizing this term pushes the adversarial distribution towards regions that effectively mislead the model.

3.4 Naturalness Reward Model
Despite the 3D priors and realistic image effects provided by generative models, we observe that directly optimizing often leads to unnatural artifacts, such as shape distortions and physically implausible cases (see Fig. 7). These issues stem from two factors: the generalization limitations of generative models and the inherent high semantic deviation in low-quality samples. To address this challenge, we propose a novel Naturalness Reward Model (NRM) that continuously regularizes based on the samples’ naturalness during optimization, preventing convergence to low-quality ”pseudo-optimal” regions.
We utilize GPT-4o to create a large-scale training set by automatically annotating the naturalness of 100K carefully selected samples covering various 3D variations. The annotate considers two criteria: visual realism and physical plausibility, each quantified on a 5-point scale. To ensure annotation quality, we conduct manual verification and correct any obvious scoring errors. As shown in Fig. 4, we use DINOv2 [41] as the backbone network to extract robust and rich visual features, followed by a dual-stream prediction head for realism and plausibility scoring:
(6) |
where and are implemented as two-layer MLPs trained with cross-entropy loss. Our empirical evaluation demonstrates that NRM achieves high alignment with human assessment (see Appendix.B). Based on NRM’s predictions, we define the naturalness regularization loss as:
(7) |
The guides the optimization towards adversarial samples that maintain high visual and physical quality.
Target Models | #Params | ImagNet [17] | Synthesis | ||||||
---|---|---|---|---|---|---|---|---|---|
Clean | Random | Clean | Random | ||||||
OpenCLIP ViT-B/16 [24] | 149M | 98.0 | 62.6 | 54.0 | 18.0 | 98.0 | 89.9 | 86.0 | 62.3 |
OpenCLIP ViT-L/14 [24] | 428M | 94.4 | 61.7 | 50.9 | 15.3 | 98.4 | 89.2 | 83.7 | 62.3 |
OpenCLIP ViT-G/14 [24] | 2.5B | 96.4 | 63.5 | 53.5 | 18.7 | 98.4 | 89.4 | 86.0 | 62.7 |
BLIP ViT-B/16 [28] | 583M | 83.0 | 56.0 | 51.3 | 17.3 | 92.7 | 80.4 | 78.6 | 54.7 |
BLIP-2 ViT-L/14 FlanT5XL [29] | 3.4B | 84.0 | 57.0 | 52.2 | 18.7 | 92.7 | 84.5 | 80.9 | 58.0 |
BLIP-2 ViT-G/14 FlanT5XL [29] | 4.1B | 81.0 | 55.7 | 49.1 | 15.7 | 93.4 | 85.2 | 82.2 | 61.3 |

3.5 Query-Based Black-box Optimization
To solve Problem (1), we parameterize the adversarial distribution as a Multivariate Gaussian form. Specifically, we apply the following random variable approach:
(8) |
where , , follows a Gaussian Distribution with mean and covariance matrix . Thus, we can rewrite Problem (1) in a more specific form:
(9) |
The solution to this problem typically involves computing the gradient of the loss w.r.t. the distribution parameters and and updating them via gradient ascent. However, since the forward process involves multiple model components, including the diffusion model, this introduces uncertainty in the gradient propagation path, making optimization using gradient information challenging [47]. Therefore, we adopt the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [22, 19], an efficient query-based black-box optimizer. At each iteration , we: 1) Sample candidates from . 2) Generate corresponding samples . 3) Update distribution parameters using following Strategy:
(10) |
(11) |
here, denotes samples ranked by losses: we first select top-10 samples based on , then choose the top-k (k=5) with highest for updates. are importance weights, and are learning rates, tracks the evolution path, and is the step size updated via Cumulative Step-size Adaptation (CSA) [22]. Due to space constraints, complete derivations of Eq. (10), (11) and hyperparameter settings are provided in the Appendix.A. We provide the pseudocode for optimization process in algorithms 1.
4 Experiments
Models | #Params | Sample | COCO Caption [11] | NoCaps [1] | VQA-V2 [20] | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
C | S | B@4 | GPT-Score | C | S | B@4 | GPT-Score | GPT-Acc (%) | |||
BLIP-2 [28] | 3.4B | Clean | 133.4 | 22.8 | 36.0 | 24.7 | 98.1 | 14.1 | 44.1 | 24.4 | 53.4 |
83.8 | 16.7 | 23.3 | 18.9 | 55.7 | 9.3 | 26.9 | 17.1 | 31.7 | |||
72.2 | 14.7 | 21.5 | 17.0 | 40.0 | 7.3 | 22.4 | 15.0 | 28.0 | |||
LLaVa-1.5 [34] | 7B | Clean | 126.1 | 25.7 | 30.4 | 25.0 | 118.7 | 16.7 | 50.6 | 24.7 | 68.6 |
85.8 | 16.7 | 22.9 | 18.9 | 60.0 | 9.7 | 29.6 | 17.4 | 54.3 | |||
73.2 | 14.4 | 20.5 | 16.5 | 45.8 | 8.2 | 25.5 | 15.6 | 53.8 | |||
LLaVa-1.6 [33] | 13B | Clean | 41.6 | 11.2 | 6.9 | 22.9 | 30.5 | 6.9 | 10.8 | 22.6 | 71.2 |
21.9 | 6.5 | 5.3 | 16.9 | 13.6 | 4.0 | 6.5 | 15.7 | 53.0 | |||
16.7 | 5.5 | 4.6 | 15.4 | 10.5 | 3.0 | 4.3 | 13.5 | 46.9 | |||
Otter [26] | 7B | Clean | 127.0 | 22.6 | 34.9 | 24.0 | 76.4 | 12.1 | 29.7 | 22.5 | 53.9 |
67.9 | 13.8 | 17.8 | 17.2 | 45.4 | 7.6 | 19.2 | 16.2 | 44.9 | |||
58.9 | 13.0 | 17.5 | 15.8 | 36.0 | 6.1 | 15.7 | 14.6 | 42.0 | |||
MiniGPT-4 [64] | 7B | Clean | 78.5 | 23.3 | 23.4 | 24.0 | 60.8 | 17.2 | 28.5 | 24.7 | 51.3 |
52.3 | 16.4 | 16.3 | 18.1 | 32.9 | 10.2 | 18.4 | 16.7 | 41.0 | |||
42.2 | 14.1 | 12.9 | 16.2 | 26.6 | 8.3 | 14.9 | 15.0 | 34.3 | |||
GPT-4o-mini [40] | - | Clean | 23.7 | 9.6 | 3.0 | 23.3 | 15.7 | 5.6 | 2.9 | 23.0 | 68.6 |
12.5 | 5.2 | 1.2 | 18.0 | 7.8 | 3.6 | 0.6 | 17.2 | 43.4 | |||
9.6 | 4.2 | 1.6 | 16.3 | 7.3 | 2.9 | 0.4 | 15.5 | 42.7 | |||
GPT-4o [40] | - | Clean | 51.2 | 14.2 | 9.7 | 25.0 | 40.7 | 10.9 | 11.7 | 24.9 | 75.4 |
26.4 | 9.0 | 4.6 | 17.7 | 19.1 | 6.5 | 6.5 | 17.0 | 50.7 | |||
20.3 | 7.3 | 4.7 | 16.1 | 15.8 | 5.5 | 5.4 | 14.6 | 50.3 |
Target Model | Clean | OpenCLIP [24] | BLIP [28] | ||
---|---|---|---|---|---|
ALBEF [27] | 65.0 | 37.2 | 27.7 | 40.2 | 25.7 |
OpenAI CLIP ViT-B/16 [42] | 85.3 | 48.7 | 29.7 | 52.8 | 31.0 |
OpenCLIP ViT-B/16 [24] | 97.7 | 54.0 | 18.0 | 59.3 | 34.7 |
Meta-CLIP ViT-B/16 [55] | 91.0 | 49.3 | 30.0 | 53.8 | 29.7 |
BLIP ViT-B/16 [28] | 82.7 | 48.4 | 30.7 | 51.3 | 17.3 |
BLIP-2 ViT-L/14 [29] | 84.0 | 50.0 | 32.0 | 52.3 | 27.7 |
SigLIP ViT-B/16 [58] | 95.3 | 55.3 | 34.0 | 59.3 | 34.7 |
4.1 Experimental Setup
- Tasks & Datasets. We evaluate AdvDreamer on three representative vision-language tasks. For zero-shot image classification, we use the ImageNet test set [17] and synthetic images generated via Stable Diffusion v2 [44] based on ImageNet categories. For image captioning, we leverage the test splits from COCO Caption [11] and NoCaps [1], while Visual Question Answering (VQA) on the VQAv2 [20] test set. For each dataset, we carefully select 150-300 images with clear instance to ensure high-quality Adv-3DT generation. We use ImageNet-1K categories as the semantic label set for classification tasks, and COCO-80 categories for captioning and VQA tasks.
- Victim VLMs. For classification tasks, we target the representative vision-language foundation models: OpenCLIP [14, 24], BLIP [28], and BLIP-2 [29] and consider additional models for the transfer attack. For the remaining tasks, we select the mainstream open-source VLMs (LLaVa series [34, 32], MiniGPT-4 [64], Otter [26], etc.), and most advanced commercial VLMs (GPT-4o and GPT-4o-mini).
- Methods & Baselines. In our experiments, we design two attack protocols with AdvDreamer: represents the average results over 10 Adv-3DT samples randomly sampled from the optimal adversarial distribution, while indicates the evaluation results using the Adv-3DT sample corresponding to the distribution center. Additionally, we establish two potential baselines: the original natural images without transformation (Clean) and the average results over 10 images with random 3D transformations (Random).
- Evaluation Metrics. For classification tasks, we report Top-1 accuracy and its relative drop compared to clean samples. For captioning tasks, beyond common metrics (CIDEr, SPICE, and BLEU), we follow recent works [34, 63, 59] to employ LLM-as-a-judge evaluation. Specifically, we leverage text-only GPT-4 to assess three aspects on a 0-10 scale: semantic accuracy, tone confidence, and overall coherence. These scores sum to a 30-point GPT-Score, providing evaluation more aligned with human judgment. For VQA tasks, we adopt GPT-Acc, which uses the same LLM-based evaluation approach to assess answer correctness. All evaluation prompts are detailed in the Appendix.C.
4.2 Research Questions (RQs) and Results
- RQ1: Do current VLMs exhibit sufficient 3D variations robustness? Take-away:“Emphatically not. Current VLMs demonstrate severe vulnerability to worst-case 3D variations, with significant performance degradation across various vision-language tasks.“ As shown in Tab. 1, in zero-shot classification, Adv-3DT samples generated by AdvDreamer significantly compromise VLMs’ performance, causing accuracy drops of up to 80%. This substantially outperforms the Random baseline, validating the effectiveness of our optimization strategy. Notably, samples under induce more severe performance drops than those from the adversarial distribution , suggesting the existence of extremely vulnerable points in the transformation space.
This vulnerability extends to more complex downstream tasks. For image captioning (Tab. 3), Adv-3DT samples effectively degrade model performance across all metrics, with nearly 50% reduction in GPT-Score across different VLMs. The consistent degradation in both conventional metrics and human-aligned evaluation (GPT-Score) indicates a fundamental limitation in VLMs’ 3D understanding capabilities. For VQA tasks, even GPT-4o, achieving the highest accuracy (75.4%) on clean samples, exhibits a substantial 25% accuracy drop under Adv-3DT attacks. Fig. 5 presents visualizations of Adv-3DT samples generated by AdvDreamer, with more examples in the Appendix.D.
Sample | Zero-shot Cls. (OpenCLIP-B) | VQA (LLaVa-1.5-7B) | ||
---|---|---|---|---|
Acc. | Conf. | Acc.1 | Acc.2 | |
Natural | 100.0 | 0.892 | 83.3 | 100.0 |
Adv-3DT (AdvDreamer) | 0.0 | 0.003 | 25.0 | 25.0 |
Adv-3DT (Physicial) | 51.3 | 0.413 | 33.6 | 45.1 |

- RQ2: Do adversarial 3D variations transfer across different VLMs? Take-away:“Yes. Adv-3DT samples demonstrate strong transferability across various VLMs, revealing shared vulnerability region in 3D variation space.“ As shown in Tab. 3, we examine transferability using OpenCLIP and BLIP as surrogate models to optimize , then evaluate six target VLMs with diverse architectures. The transfer attacks maintain strong effectiveness, with performance degradation comparable to direct attacks. Notably, exhibits smaller transfer gaps than , suggesting that the learned distribution better captures generalizable vulnerability regions. The transferability also extends to captioning and VQA tasks (Tab. 2). Even when using only OpenCLIP as the surrogate model, Adv-3DT samples effectively compromise all target VLMs except BLIP-2 and LLaVa-1.5, which use their own encoders. We attribute this universal transferability to a fundamental limitation: current pretraining datasets fail to cover diverse 3D variations, resulting in universal performance deficiencies under certain transformation regions.

- RQ3: Do Adv-3DT samples captured by AdvDreamer reproducible in the physical world? Take-away:“Yes, AdvDreamer successfully identifies physically reproducible adversarial 3D transformations that significantly degrade VLM performance in real-world scenarios. We conduct physical-world experiments on 12 common objects from traffic and household environments. Our protocol consists of three steps: 1) capturing natural-state images of objects, 2) generating Adv-3DT samples using AdvDreamer with OpenCLIP as the surrogate model, and 3) physically reproducing the adversarial 3D transformations captured by AdvDreamer through video recordings. This process yields 3,807 frames of physical Adv-3DT samples.
We evaluate effectiveness through both zero-shot classification and VQA tasks. For VQA, we design two protocols: multiple-choice categorization (Acc.1) and binary verification (Acc.2). As shown in Tab. 4, physically reproduced Adv-3DT samples significantly degrade VLM performance: zero-shot classification accuracy drops to 51.3%, while VQA performance deteriorates to 33.6% (Acc.1) and 45.1% (Acc.2). These results confirm that AdvDreamer successfully captures physically realizable adversarial 3D transformations. However, we observe a moderate performance gap between digital and physical attacks, attributable to the data distribution gap and camera fluctuations. Fig. 5 visualizes representative examples of physical experiments, with additional results provided in the Appendix.D.
4.3 Ablation Studies and Additional Analysis
- Convergence Analysis. We analyze the attack success rates (ASR) of against OpenCLIP (ViT-B/16) and BLIP (ViT-B/16) across different optimization iterations. As shown in Fig. 6 (A), AdvDreamer’s effectiveness consistently improves with more iterations and converges after 15 steps. Based on this observation, we set the iteration step default to 15 throughout our experiments to balance effectiveness and computational efficiency.
- The Necessity of NRM. Fig. 7 presents a qualitative comparison between Adv-3DT samples generated with and without NRM. The results demonstrate the effectiveness of NRM’s naturality feedback in enhancing Adv-3DT sample quality and mitigating distribution discrepancy, leading to samples with higher visual quality and better alignment with natural image distributions.
- Does optimal distribution effectively characterize the worst-case 3D variations? We conduct analysis on ImageNet using five random natural images. For each image, we generate 100 Adv-3DT samples by scaling the variance of their with different ratios. Fig. 6 (B) shows that ASR consistently decreases as the sampling range expands beyond the optimal distribution. This inverse relationship confirms that AdvDreamer successfully identifies the most vulnerable regions in the 3D variation space.
4.4 MM3DTBench

We present MM3DTBench for benchmarking the robustness of VLMs to 3D variations. It contains 215 challenging Adv-3DT samples generated by AdvDreamer, along with their physical reproductions. We design two evaluation protocols: a multiple-choice task (Choice) and a free-form description task (Free answer), with accuracy computed through text-only GPT-4 judgment. More Details of MM3DTBench are provided in the Appendix.E. Our evaluation of 13 representative VLMs reveals robustness gaps. As shown in Fig. 8, even top-performing models (GPT-4o [40], CogVLM [54], and Intern-VL [13]) achieve limited success, while most models struggle with accuracy below 50%. Notably, Claude-3 [5], despite its recognized excellence, only achieves an accuracy of 30.0%. This performance correlates with models’ general out-of-distribution robustness measured by MultiTrust-RO [59], suggesting a fundamental limitation of current VLMs. We encourage the adoption of MM3DTBench in future VLM development to advance security evaluation in dynamic scenario.
5 Conclusion
This paper proposed AdvDreamer, a novel framework for generating physically reproducible and realistic adversarial 3D transformation samples from single-view images. Our comprehensive evaluation revealed critical robustness gaps in existing VLMs, highlighting the urgent need to enhance VLMs’ 3D variation perception and understanding capabilities in safety-critical applications. Through MM3DTBench, we established the first benchmark for worst-case 3D variations, aiming to facilitate the development of more robust vision-language systems for real-world deployment.
References
- Agrawal et al. [2019] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
- Al-Tahan et al. [2024] Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, and Mark Ibrahim. Unibench: Visual reasoning requires rethinking vision-language beyond scaling. arXiv preprint arXiv:2408.04810, 2024.
- Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Alcorn et al. [2019] Michael A Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4845–4854, 2019.
- Anthropic [2024] Anthropic. claude-3. https://www.anthropic.com/claude, 2024.
- Athalye et al. [2018] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR, 2018.
- Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Barbu et al. [2019] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
- Chen et al. [2024a] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024a.
- Chen et al. [2024b] Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, and Ziwei Liu. Large language models are visual reasoning coordinators. Advances in Neural Information Processing Systems, 36, 2024b.
- Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Chen et al. [2024c] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6593–6602, 2024c.
- Chen et al. [2024d] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024d.
- Cherti et al. [2023] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR 2023 (Eleventh International Conference on Learning Representations), 2023.
- Cui et al. [2024] Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24625–24634, 2024.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Dong et al. [2022] Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu. Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints. In Neural Information Processing Systems (NeurIPS), 2022.
- Golovin et al. [2017] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1487–1495, 2017.
- Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Hamdi and Ghanem [2020] Abdullah Hamdi and Bernard Ghanem. Towards analyzing semantic robustness of deep neural networks. In ECCV, pages 22–38. Springer, 2020.
- Hansen [2016] Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
- Huang et al. [2023] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Ilharco et al. [2021] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. openclip, 2021.
- Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023.
- Li et al. [2023a] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
- Li et al. [2021] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
- Li et al. [2019] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a.
- Liu et al. [2024b] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b.
- Liu et al. [2024c] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024c.
- Liu et al. [2024d] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2024d.
- Lu et al. [2019] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
- Madan et al. [2020] Spandan Madan, Timothy Henry, Jamell Dozier, Helen Ho, Nishchal Bhandari, Tomotake Sasaki, Frédo Durand, Hanspeter Pfister, and Xavier Boix. When and how do cnns generalize to out-of-distribution category-viewpoint combinations? arXiv preprint arXiv:2007.08032, 2020.
- Mao et al. [2022] Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl Vondrick. Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016, 2022.
- Mildenhall et al. [2020] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020.
- openai [2024] openai. Gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024.
- Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Ren et al. [2024] Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- Ruan et al. [2023] Shouwei Ruan, Yinpeng Dong, Hang Su, Ning Chen, and Xingxing Wei. Towards viewpoint-invariant visual recognition via adversarial training. In ICCV, pages 1–10, 2023.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Sun et al. [2023] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Tian et al. [2024] Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024.
- Tochilkin et al. [2024] Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151, 2024.
- Tong et al. [2024] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024.
- Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Wang et al. [2023] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Xu et al. [2024a] Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In The Twelfth International Conference on Learning Representations, 2024a.
- Xu et al. [2024b] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024b.
- Yokoyama et al. [2024] Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024.
- Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
- Zhang et al. [2024a] Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, and Jun Zhu. Benchmarking trustworthiness of multimodal large language models: A comprehensive study, 2024a.
- Zhang et al. [2024b] Yabin Zhang, Wenjie Zhu, Chenhang He, and Lei Zhang. Lapt: Label-driven automated prompt tuning for ood detection with vision-language models. Proceedings of the european conference on computer vision (ECCV), 2024b.
- Zhao et al. [2022] Bingchen Zhao, Shaozuo Yu, Wufei Ma, Mingxin Yu, Shenxiao Mei, Angtian Wang, Ju He, Alan Yuille, and Adam Kortylewski. Ood-cv: A benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. In European conference on computer vision, pages 163–180. Springer, 2022.
- Zhao et al. [2023] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
- Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.