This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement

Siwen Jiao1,3*{}^{1,3\textsuperscript{*}}, Yangyi Fang2*{}^{2\textsuperscript{*}},Baoyun Peng4*†{}^{4\textsuperscript{*\textdagger}},Wangqun Chen4,Bharadwaj Veeravalli1,Xulei Yang3
1National University of Singapore
2Tsinghua University
3Agency for Science, Technology and Research, Singapore
4Advanced Institute of Big Data, Beijing
Abstract

Recent advancements in Visual Language Models (VLMs) have made them crucial for visual question answering (VQA) in autonomous driving, enabling natural human-vehicle interactions. However, existing methods often struggle in dynamic driving environments, as they usually focus on static images or videos and rely on downsampling to manage computational costs. This results in the loss of critical details and the difficulty in effectively integrating spatial and temporal information, undermining fine-grained perception and temporal coherence essential for effective decision-making. To tackle these challenges, we introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving. LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception. It optimizes spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis to focus on motion-related features, thereby boosting computational efficiency. Our method achieves an impressive 168-fold token compression while attaining optimal performance, a significant improvement over traditional approaches. The core of LaVida Drive consists of two modules: the Query-aware Token Selection module and the Spatial-Temporal Token Recovery and Enhancement module. The former dynamically selects the most relevant visual tokens based on semantic alignment with the input query, reducing the token count from high-resolution spatial input. The latter ensures smooth and coherent interactions between spatial and temporal information, preserving contextual continuity across frames. Extensive experiments on various autonomous driving question-answering benchmarks show that LaVida Drive significantly reduces visual tokens, enhances efficiency, and improves overall performance.

11footnotetext: Equal contribution.22footnotetext: Corresponding author

1 Introduction

Refer to caption
Figure 1: Comparison between LaVida Drive and existing VLM-based autonomous driving methods. (Left): Traditional methods use a VIT projector with simple monocular input and downsampling. (Right): LaVida Drive efficiently captures high-resolution details while maintaining motion perception, replacing the projector with enhanced modules that leverage encoder-level tokens for spatial and temporal enhancement.

Recent advancements in large-scale pre-training have positioned VLMs as pivotal for VQA in autonomous driving, enabling intuitive human-vehicle interactions through natural language [9, 13, 3, 19, 24, 11]. VLMs facilitate the seamless integration of visual and linguistic information, allowing vehicles to comprehend and respond to complex queries in real-time, interpreting dynamic environments quickly and significantly improving the overall performance and reliability of the system [11, 19, 22].

Despite significant advancements, existing approaches predominantly solely focus on static images or video and rely on low-resolution inputs to manage computational costs, leading to the loss of critical high-resolution details and the difficulty in effectively integrating spatial and temporal information[7, 30, 20, 25]. This is particularly problematic in dynamic driving environments, where downsampling impairs fine-grained perception and temporal coherence, hindering effective decision-making [11, 22, 20, 26]. Balancing efficiency and accuracy in high-resolution, multi-frame settings for both static perception and motion detection significantly increases inference costs, posing a major challenge in VLM development [22, 24, 26, 18].

To address these challenges, we propose LaVida Drive, an innovative VQA framework designed to support fine-grained perception of high-resolution visual inputs in dynamic driving environments while integrating temporal information. Specifically, for spatial processing, the framework retains high-resolution inputs to capture rich details and uses lower resolution for temporal processing on motion-related features, thereby reducing computational load without compromising visual accuracy. However, maintaining high-resolution spatial inputs across multiple viewpoints significantly increases the number of tokens, leading to substantial inference overhead in VLMs. To handle this, we introduce the Query-aware Token Selection mechanism, which dynamically selects visual tokens highly relevant to the input query based on semantic content, enabling adaptive token filtering and significantly easing the burden of computation [11, 26]. Since token selection would disrupt spatial coherence and damage the contextual relationships between tokens, we introduce a Spatial-temporal Token Enhancement module to ensure coherence across different spatial and temporal contexts by using cross-attention mechanisms for consistent information flow across frames, achieving smooth and coherent multi-frame information transfer.

We validate LaVida Drive across multiple autonomous driving VQA benchmarks, showing significant enhancements in image-text alignment and multi-modal information processing. Our model reduces visual tokens by 50% to 84%, improving inference efficiency while maintaining performance. Key contributions include:

  • Propose a novel and efficient VQA framework that seamlessly integrates temporal data into high-resolution spatial inputs, enhancing computational efficiency and detailed visual perception.

  • Propose a novel query-aware token selection mechanism that dynamically extracts key information for question answering, demonstrating its effectiveness in balancing computation cost and performance.

  • Propose a token enhancement mechanism integrating multi-modal and multi-scale information, ensuring smooth and coherent interactions between spatial and temporal information and preserving contextual continuity across multiple frames.

2 Related Works

Recent advancements in autonomous driving have leveraged the intersection of vision and LLMs, driving improvements in both perception and decision-making capabilities. The literature can be categorized into two main areas: vision-based LLMs for autonomous driving and QA systems in autonomous driving.

2.1 Vision-based LLMs for Autonomous Driving

The integration of vision and language models has shown great promise in enhancing the perception capabilities of autonomous vehicles, enabling them to better understand and navigate complex driving environments. Early work in this domain includes CLIP-based methods [16], which pair visual representations with textual descriptions, enabling a richer understanding of the vehicle’s surroundings. Recent studies, such as those by [31] and [28], have proposed large multi-modal models incorporating visual and textual inputs to support decision-making. These models benefit from pretraining on large-scale datasets and have shown improvements in tasks such as scene interpretation and predicting vehicle behaviours in dynamic traffic scenarios.

The application of transformer-based models to vision-language fusion has also led to promising developments in autonomous driving. For instance, [5] developed a model that combines deep vision transformers with large-scale language models, which improves decision-making capabilities by enhancing the vehicle’s ability to generate complex driving plans. These models process real-time visual input while leveraging pre-trained knowledge to interpret high-level cues, such as road conditions and traffic regulations. Recently, [29] demonstrated the ability of multi-modal LLMs to engage in reasoning tasks in an end-to-end autonomous driving framework, allowing the vehicle to handle novel driving situations that require both visual and linguistic reasoning.

Furthermore, the advent of VLMs has opened new avenues for enhancing autonomous driving systems. For example, [13] introduced NuScenes-QA, a benchmark for VQA in autonomous driving scenarios, which addresses the complexities of multi-modal data and real-time acquisition. Similarly, [19] proposed DriveLM, a VLM-based approach that integrates web-scale data to boost generalization and interactivity with human users. These advancements highlight the potential of VLMs to address the nuanced challenges of autonomous driving, such as understanding dynamic environments and making informed decisions in real time.

2.2 Question Answering Systems for Autonomous Driving

QA systems have played a crucial role in improving human-vehicle interaction and facilitating autonomous decision-making. In autonomous driving, these systems help vehicles process natural language queries and provide context-aware answers based on visual inputs and pre-existing knowledge. For example, [27] developed a visual QA system that combines convolutional neural networks with language models to allow autonomous vehicles to answer questions about nearby objects and road conditions. This system allows passengers to ask real-time questions, and receive accurate, context-specific responses.

Further advances in contextual question answering [4] have significantly enhanced the vehicle’s ability to interpret complex driving scenarios. By utilizing multi-modal input, these systems provide more accurate answers to questions regarding traffic flow, pedestrian movement, or vehicle proximity. In addition, dialogue-based QA systems have gained traction in recent years, enabling more dynamic interactions between drivers and vehicles. For instance, [8] introduced a conversational QA framework, where vehicles answer questions and can engage in multi-turn dialogues, adjusting their responses based on evolving traffic conditions and user preferences. This enables smoother communication between passengers and vehicles, improving the overall driving experience and safety.

More recent work by [23] explores hybrid models that combine rule-based reasoning with large-scale language models, allowing vehicles to simulate human-like reasoning during real-time decision-making in complex environments more accurately. Their work focuses on providing accurate and safe driving suggestions during ambiguous driving situations, such as when encountering unforeseen obstacles or pedestrians. Additionally, the integration of LLMs into QA systems has shown significant promise. For example, [3] proposed a unique object-level multi-modal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. This approach not only enhances the interpretability of driving actions but also demonstrates the potential of LLM-based driving action generation compared with traditional behavioural cloning methods.

3 Method

3.1 Architecture Overview

Refer to caption
Figure 2: Framework of LaVida Drive. High-resolution images are divided into 224×224224\times 224 patches and processed by the image encoder to extract semantic features. A token-level similarity matrix aligns image tokens with text tokens, enabling query-aware token selection. Selected tokens are enhanced through a token-wise attention mechanism in the enhancement module, utilizing auxiliary branches for restoration without increasing token count. The final tokens and original text embeddings are then fed into the large language model.

As illustrated in Fig. 2, the architecture of LaVida Drive comprises three core components: the Multi-modal Encoder Cluster, the Query-aware Token Selection Module, and the Spatial-temporal Token Enhancement Module. The model processes inputs from three modalities: image data from the autonomous vehicle’s multiview cameras, video data, and natural language instructions provided by the user. Different from previous approaches, we employ multiple encoders to handle various input modalities, forming a Multi-modal Encoder Cluster to better address the unique requirements of each data source. All encoders are frozen. Specifically, each encoder processes data in a predefined format:

Text Encoder: Our text encoder employs the CLIP text encoder, leveraging its powerful feature extraction capabilities obtained through large-scale text-image contrastive learning. For an input text sequence 𝐗text\mathbf{X}_{\text{text}} containing LtextL_{\text{text}} tokens, the encoder processes each token embedding and maps the entire sequence to a semantic space. The output of the text encoder is a matrix of shape Ltext×dL_{\text{text}}\times d, represented as:

𝐄text=TextEncoder(𝐗text)Ltext×d.\mathbf{E}_{\text{text}}=\text{TextEncoder}(\mathbf{X}_{\text{text}})\in\mathbb{R}^{L_{\text{text}}\times d}. (1)

Image Encoder: The image encoder also employs the CLIP visual encoder, with a base resolution of 224×224224\times 224 pixels. This encoder efficiently maps visual data into a rich semantic space and is divided into a main branch and a support branch, each optimized for different aspects of image representation.

Image Encoder Main Branch: For the main branch, an input image 𝐗img\mathbf{X}_{\text{img}} of size H×W×CH\times W\times C is first divided into N=H×WP2N=\frac{H\times W}{P^{2}} patches of size P×P×CP\times P\times C. Each patch is flattened into a vector of length P2×CP^{2}\times C, forming a patch sequence of dimension N×(P2×C)N\times(P^{2}\times C). The main branch generates embeddings of shape Lmain×dL_{\text{main}}\times d from the penultimate layer of the CLIP visual encoder, represented as:

𝐄main=MainBranchEncoder(𝐗img)Lmain×d.\mathbf{E}_{\text{main}}=\text{MainBranchEncoder}(\mathbf{X}_{\text{img}})\in\mathbb{R}^{L_{\text{main}}\times d}. (2)

Image Encoder Support Branch: To supplement the context loss due to patch segmentation in the main branch, the support branch directly processes the downsampled entire image 𝐗img\mathbf{X}_{\text{img}}^{\prime}, of size 224×224×C224\times 224\times C. The support branch also generates embeddings of shape Lsupport×dL_{\text{support}}\times d from the penultimate layer of the CLIP visual encoder, represented as:

𝐄support=SupportBranchEncoder(𝐗img)Lsupport×d.\mathbf{E}_{\text{support}}=\text{SupportBranchEncoder}(\mathbf{X}_{\text{img}}^{\prime})\in\mathbb{R}^{L_{\text{support}}\times d}. (3)

Video Encoder: The video encoder is based on the TimeSformer model, which performs temporal modelling on frame sequences. Given an input sequence 𝐗temporal\mathbf{X}_{\text{temporal}} of TT frames, where each frame has a spatial dimension of H×W×CH\times W\times C, the encoder captures inter-frame dependencies to generate temporal representations. The output is an embedding sequence of size T×dT\times d, represented as:

𝐄temporal=VideoEncoder(𝐗temporal)T×d.\mathbf{E}_{\text{temporal}}=\text{VideoEncoder}(\mathbf{X}_{\text{temporal}})\in\mathbb{R}^{T\times d}. (4)

Next, we employ the Query-aware Token Selection Module, which processes tokens output by the image encoder and text encoder to generate a token-level similarity matrix Sm×nS\in\mathbb{R}^{m\times n}, where mm denotes the number of image tokens and nn denotes the number of text tokens. By leveraging semantic similarity in space, this module identifies the most relevant visual tokens to the user’s query, thereby reducing the number of visual tokens while retaining high-quality tokens. Finally, the Spatial-temporal Token Enhancement Module utilizes the video encoder’s output V3DS3D×768V_{3D}\in\mathbb{R}^{S_{3D}\times 768} and the multi-frame auxiliary information from the image encoder V2DmultiframeS2D×768V_{2D_{multi-frame}}\in\mathbb{R}^{S_{2D}\times 768} to recover and enhance tokens through a cross-attention mechanism. This module’s purpose is to restore context lost in token selection and aggregate temporal information without increasing the number of additional tokens, as described further in Section 3.3.

Refer to caption
Figure 3: The architecture of the Token-wise Attention module. We leverage the outputs from the support branches of the image encoder and video encoder, applying cross-attention to recover/enhance the selected tokens while maintaining consistent output dimensions according to the rules of attention computation.

3.2 Query-aware Token Selection

The Query-aware Token Selection module filters out the most relevant and important tokens based on image and text tokens and further compresses these tokens for efficient representation. In the section of the Multi-modal Encoder Cluster, we denote the main branch output from the image encoder as 𝐈\mathbf{I} and the output from the text encoder as 𝐓\mathbf{T}.Given that the CLIP[15] model was pre-trained on a large image-text dataset, we assume that image and text tokens are mapped into the same semantic space.To compute token-wise similarity, we omit the 0-th dimension from the embeddings, as it represents the overall semantic features.

To align the image embeddings with the text embeddings in the same semantic space, we apply a multi-layer perceptron (MLP) to the image tokens. This transformation produces the aligned image representation, denoted as 𝐈\mathbf{I}^{\prime}. The aligned image embeddings 𝐈\mathbf{I}^{\prime} can then be used for similarity computation with the text embeddings 𝐓\mathbf{T}. Therefore, we calculate the cosine similarity to obtain a token-wise similarity matrix 𝐒\mathbf{S}, measuring the semantic similarity between each image and text token.

s(𝐈,𝐓)=𝐈𝐓𝐈𝐓.s(\mathbf{I}^{\prime},\mathbf{T})=\frac{\mathbf{I}^{\prime}\cdot\mathbf{T}^{\top}}{\|\mathbf{I}^{\prime}\|\|\mathbf{T}\|}. (5)

Next, based on the similarity between each image token and text token, we calculate the normalized similarity matrix as follows:

piimg(x)=exp(s(𝐈,𝐓i)/τ)j=1Nexp(s(𝐈,𝐓j)/τ).p_{i}^{\text{img}}(x)=\frac{\exp(s(\mathbf{I}^{\prime},\mathbf{T}_{i})/\tau)}{\sum_{j=1}^{N}\exp(s(\mathbf{I}^{\prime},\mathbf{T}_{j})/\tau)}. (6)

This is then used to select the most relevant 𝐤\mathbf{k} image tokens, where 𝐤\mathbf{k} represents the sampling threshold. A higher threshold selects more visual tokens, allowing the quantity of visual tokens to be controlled by adjusting the downsampling ratio 𝐤\mathbf{k}. The experimental section compares performance under different thresholds, highlighting their impact as a proportion of the total token count. The steps are outlined in the following algorithm:

Algorithm 1 Query-aware Token Selection

Input: Image tokens 𝐈\mathbf{I}, Text tokens 𝐓\mathbf{T}, temperature parameter τ\tau, learnable parameter α\alpha, number of top-k tokens to select kk
Output: Selected top-k image tokens 𝐓top-k\mathbf{T}_{\text{top-k}}

  • Align Image and Text Embeddings:
    𝐈MLP(𝐈)\mathbf{I}^{\prime}\leftarrow\text{MLP}(\mathbf{I})

  • Compute Cosine Similarity:
    s(𝐈,𝐓)𝐈𝐓𝐈𝐓s(\mathbf{I}^{\prime},\mathbf{T})\leftarrow\frac{\mathbf{I}^{\prime}\cdot\mathbf{T}^{\top}}{\|\mathbf{I}^{\prime}\|\|\mathbf{T}\|}

  • Normalize Similarity:
    piimg(x)exp(s(𝐈,𝐓i)/τ)j=1Nexp(s(𝐈,𝐓j)/τ)p_{i}^{\text{img}}(x)\leftarrow\frac{\exp(s(\mathbf{I}^{\prime},\mathbf{T}_{i})/\tau)}{\sum_{j=1}^{N}\exp(s(\mathbf{I}^{\prime},\mathbf{T}_{j})/\tau)}

  • Compute Similarity Scores Matrix:
    𝐒sumk=1Kpiimg(x)\mathbf{S}_{\text{sum}}\leftarrow\sum_{k=1}^{K}p_{i}^{\text{img}}(x)

  • Compute Token Weights Matrix:
    𝐖i𝐈i\mathbf{W}\leftarrow\sum_{i}\mathbf{I}^{\prime}_{i}

  • Compute Selection Map:
    𝐌(1α)𝐒sum+α𝐖\mathbf{M}\leftarrow(1-\alpha)\cdot\mathbf{S}_{\text{sum}}+\alpha\cdot\mathbf{W}

  • Select Top-K Tokens:
    𝐓top-k𝐈[TopK(𝐌,k)]\mathbf{T}_{\text{top-k}}\leftarrow\mathbf{I}^{\prime}[\text{TopK}(\mathbf{M},k)]

  • Return: 𝐓top-k\mathbf{T}_{\text{top-k}}

To further compress the tokens, we employ an MLP for information aggregation, generating a query-aware compact token representation 𝐓select\mathbf{{T}_{\text{select}}}. In the experimental section, we demonstrate the model’s performance under various combinations of selection and MLP compression factors at a fixed overall compression ratio. The results indicate that balancing the selection factor with the MLP compression rate can achieve higher model performance while preserving a minimal number of tokens.

3.3 Spatial-temporal Token Enhancement

The Spatial-temporal Token Enhancement module is designed to address issues of contextual disruption and high computational overhead when dealing with multi-frame data. This module includes a general-purpose Token-wise Attention Module, which is then specifically applied in two configurations: spatial and temporal enhancements.

First, we introduce the Token-wise Attention Module, which forms the foundation for both spatial and temporal enhancements. As shown in Fig. 3, this module enhances token context by enabling interactions between the tokens selected by the Query-aware Token Selection module and the context tokens from the image or video encoder. The attention mechanism is defined as follows:

Atttoken-wise(Q,K,V)=Softmax(QKTdk)V,\text{Att}_{\text{token-wise}}(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V, (7)

where QQ (Query) represents token representations from the Query-aware Token Selection module, while KK (Key) and VV (Value) are derived from the encoder outputs. This mechanism allows each query token to absorb relevant information from context tokens.

With the Token-wise Attention Module established, we now apply it in two configurations:

  • Spatial Token Restoration: Here, we use the output EspatialE_{\text{spatial}} of the image encoder as keys and values. By interacting with each token representation II^{\prime}, this configuration introduces spatial context, resulting in the spatially enhanced representation Ienhanced spatialI_{\text{enhanced spatial}}:

    Ienhanced spatial\displaystyle I_{\text{enhanced spatial}} =Atttoken-wise(Q=I,\displaystyle=\text{Att}_{\text{token-wise}}(Q=I^{\prime}, (8)
    K=Espatial,V=Espatial).\displaystyle\quad K=E_{\text{spatial}},V=E_{\text{spatial}}).
  • Temporal Token Enhancement: When handling multi-frame data, we use the output EtemporalE_{\text{temporal}} of the video encoder as another set of keys and values. By interacting in parallel with II^{\prime}, this configuration incorporates temporal context and produces the temporally enhanced representation Ienhanced temporalI_{\text{enhanced temporal}}:

    Ienhanced temporal\displaystyle I_{\text{enhanced temporal}} =Atttoken-wise(Q=I,\displaystyle=\text{Att}_{\text{token-wise}}(Q=I^{\prime}, (9)
    K=Etemporal,V=Etemporal).\displaystyle\quad K=E_{\text{temporal}},V=E_{\text{temporal}}).

Finally, we combine the spatial and temporal enhanced representations through an MLP layer to obtain the final token representation:

Ifinal=MLP(Ienhanced spatial+Ienhanced temporal).I_{\text{final}}=\text{MLP}(I_{\text{enhanced spatial}}+I_{\text{enhanced temporal}}). (10)

In this way, the Spatial-temporal Token Enhancement module fully leverages both spatial and temporal information without increasing computational costs. The final output tokens are then concatenated with the query embedding for input into the Large Language Model.

4 Experiments

Refer to caption
Figure 4: Lavida Drive Response Example. The areas not covered by the black mask represent the image patches corresponding to the retained tokens. This is provided as a simplified example for ease of visualization, and the actual results may differ.
Table 1: Performance Comparison on DriveLM and NusceneQA Datasets. Bold indicates the highest value, while an underline indicates the second-highest value.
Dataset Method Ref. BLEU-4↑ METEOR↑ ROUGE-L↑ CIDEr↑
DriveLM Dataset EM-VLM4ADBase CVPR’24 45.4 34.5 72.0 3.20
EM-VLM4ADLarge CVPR’24 40.1 34.3 70.7 3.10
DriveLM-Agent ECCV’24 53.1 36.2 66.8 2.79
LaVida Drive (Ours) - 51.3 38.0 73.9 3.32
Dataset Method Ref. Exist↑ Object↑ Status↑ Comparison
NusceneQA Dataset EM-VLM4ADBase CVPR’24 75.9 40.2 51.9 63.2
EM-VLM4ADLarge CVPR’24 70.2 38.6 45.7 60.9
NusceneQA-Agent AAAI’24 84.8 52.3 59.8 70.0
LaVida Drive (Ours) - 78.0 52.8 60.9 70.5
Table 2: Ablation study on select ratio and compress ratio while fixing the overall reduction ratio to 168. Bold indicates the highest value.
Select Ratio Compress Ratio BLEU-4 METEOR ROUGE-L CIDEr
- 168 48.0 34.5 72.0 3.20
2 84 51.3 38.0 73.9 3.32
3 56 50.7 37.8 73.3 3.22
6 26 49.4 36.6 73.0 3.26
84 2 46.3 34.1 70.5 3.13
168 - 42.0 32.8 63.5 2.96
Table 3: Component-wise ablation experiment results. + denotes an added module or method, and © denotes a changed module or method.
Method #Tokens BLEU-4 METEOR ROUGE-L CIDEr
Baseline 49*6 45.4 34.5 72.0 3.20
+ High Solution Patch 49*6*28 49.4 \uparrow4.0 37.0 \uparrow2.5 73.7 \uparrow1.7 3.29\uparrow0.9
+ MLP+Pooling 49 48.0 \downarrow1.4 36.5 \downarrow0.5 72.3 \downarrow1.4 3.20\downarrow0.9
© Token Selection 49 50.7 \uparrow2.7 37.0 \uparrow0.5 73.0 \uparrow0.7 3.26 \uparrow0.6
+ Token Enhancement 49 51.3 \uparrow0.6 38.0 \uparrow1.0 73.9\uparrow0.9 3.32 \uparrow0.6
Table 4: Ablation among input type. We compare the response performance of different types of the dataset on the same test models. Bold indicates the highest value.
Multiview Multiframe BLEU-4 METEOR ROUGE-L CIDEr
- 48.30 36.5 72.3 3.23
- 50.8 37.2 74.0 3.32
51.3 38.0 73.9 3.32

In this section, we conduct extensive experiments on Lavida Drive and analyze the results, including both quantitative and qualitative evaluations. Finally, we perform ablation studies to validate the effectiveness of each module.

4.1 Setup

Dataset: Following EM-VLM4AD [6] protocols, we use identical training, validation, and test splits on the DriveLM [19] dataset, which covers tasks like perception, prediction, and decision-making to evaluate generalizability. Additionally, we test on the NusceneQA [14] dataset for fine-grained perception, comparing to traditional detection-based methods. The DriveLM training set includes approximately 340,184 unique multi-view QA pairs, with 18,899 pairs for testing and validation. The NusceneQA training and test sets contain 459,941 and 83,337 question-answer pairs, respectively.

Metrics: To ensure fairness and reproducibility, when evaluating the DriveLM dataset, we use the same metrics as EM-VLM4AD, assessing model performance from four perspectives: BLEU-4 [12], ROUGE-L [10], METEOR [1], and CIDEr [21]. For the NusceneQA dataset, we align with the metrics proposed by the dataset’s authors, evaluating accuracy across four query format categories: Exist, Object, Status, and Comparison.

Models: We employ the CLIP text and vision encoders as our text and image encoders, respectively, and utilize TimeSformer[2] as a video encoder for multi-frame input processing. The base language model used is T5-medium[17].

Implementation Details: Each model is trained on a single NVIDIA A100 Tensor Core GPU. The image encoder, text encoder, and video encoder are frozen, while other parameters, including those in larger models, are trained with an initial learning rate of 1e-4 and a weight decay of 0.05. The batch size is set to 4. Each model is trained for 12 epochs on the training set, with each image divided into 4x7 patches of size 224x224.

4.2 Overall Performance

Quantitative Comparison: In Tab. 1, we first compare our model with prior works on the DriveLM dataset, including EM-VLM4AD [6] and DriveLM-Agent [19]. Lavida Drive outperforms our baseline method in overall performance metrics, although DriveLM-Agent slightly exceeds our model in BLEU-4 score. However, it is important to note that DriveLM-Agent has a significantly larger parameter count, reaching 3.96B. Next, we fine-tune the pretrained model on the NusceneQA training set and evaluate it on the test set. Without using 3D detector outputs and prompts, our model achieves competitive results.

Qualitative Comparison: As shown in Fig. 4, we present LaVida Drive’s performance across various tasks. To analyze its dynamic perception of multi-view image inputs based on textual queries, we visualize the tokens selected by LaVida Drive in different scenarios. The analysis demonstrates that the model adaptively selects the most relevant tokens guided by keywords such as ”car” and ”pedestrians.” This process resembles human reasoning during driving, where relevant information is first filtered before drawing inferences. This dynamic token selection prior to model input underscores the system’s reliability and interpretability.

4.3 Ablation Studies

To verify the effectiveness of each module in our model, we designed a series of ablation studies, focusing on the selection and compression ratios, model components, and input types.

Select and Compress Ratio Ablation: We fixed the overall compression ratio at 168 and tested various combinations of the selection and MLP compression ratios to assess their impact on model performance. As shown in Tab. 2, initially, we compressed tokens using MLP alone, without selection, which gave suboptimal results. Then, we gradually increased the selection ratio while decreasing the compression ratio, observing peak performance at a selection ratio of 2, after which performance declined. Finally, we reduced tokens by a factor of 168 using selection alone (without MLP compression), resulting in a significant performance drop.

These results indicate that excessive redundancy and irrelevant information hinder VLM training and inference. However, overly reducing token selection leads to the loss of valuable information. Therefore, balancing the selection factor and MLP compression ratio is crucial for achieving higher model performance with fewer tokens.

Components-wise Ablation: Tab. 3 presents the component-wise experimental results of our method. We start by using the 224×224 downsampled feature map from the image encoder as low-resolution visual embeddings, producing 49×6 visual tokens, which we use as the baseline. Next, we replace simple downsampling with high-resolution 224×224 patches. This approach improves performance by +4.0%, +2.5%, +1.7%, and +0.9%, but increases token overhead. To address this issue, we first apply an MLP layer to downsample the tokens, reducing their count to 49; however, this results in performance declines of -1.4 -0.5%, -1.4%, and -0.9%. By replacing the MLP layer with our proposed Token Selection module, we effectively identify the most relevant tokens, leading to performance gains of +2.7%, +0.5%, +0.7%, and +0.6%, Finally, incorporating Token Recovery and Token Enhancement methods leads to further improvements of +0.6%, +1.0%, +0.9%, and +0.6%, respectively while maintaining the token count.

Input Type Ablation: To validate the robustness of our algorithm under scenarios where input data is lost in complex or adverse conditions, we tested different input configurations: single-frame multi-view, multi-frame single-view, and single-frame single-view. By analyzing the model’s performance after removing parts of the complete multi-view multi-frame dataset, we assess the model’s robustness and the impact of multi-frame and multi-view data on its performance. As shown in Tab. 4, our model maintains relatively good performance even after some data is removed. Although there is a significant collapse in performance when using the single-frame single-view dataset, it still outperforms our baseline.

5 Conclusion

In this work, we present LaVida Drive, a novel framework for VQA in autonomous driving, which effectively integrates high-resolution spatial perception with temporal dynamics. By leveraging query-aware token selection and spatial-temporal token enhancement, our approach reduces computational overhead without sacrificing fine-grained visual detail, enabling more efficient inference through selective processing of relevant detail visual cues and ensuring coherent information flow across frames. LaVida Drive provides a promising framework for real-time VQA systems in autonomous driving, balancing computational efficiency with detailed perception. It effectively integrates spatial and temporal information, laying the groundwork for intelligent systems that can handle complex, dynamic driving environments.

References

  • Banerjee and Lavie [2005] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  • Bertasius et al. [2021] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), 2021.
  • Chen et al. [2024] Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100. IEEE, 2024.
  • Chen et al. [2023] T. Chen, X. Zhang, and Y. Wang. Contextual question answering for autonomous driving. Journal of Autonomous Systems, 11:98–110, 2023.
  • Chen et al. [2022] Y. Chen, L. Xie, and X. Wang. Deep vision-language fusion for autonomous driving planning. In IEEE Transactions on Neural Networks and Learning Systems, pages 4135–4149, 2022.
  • Gopalkrishnan et al. [2024] Akshay Gopalkrishnan, Ross Greer, and Mohan Trivedi. Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving, 2024.
  • Huang et al. [2023] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36:72096–72109, 2023.
  • Li et al. [2024] Z. Li, H. Wang, and Y. Xu. Dialogue-based question answering for autonomous vehicles. In Proceedings of CVPR, 2024.
  • Liao et al. [2024] Guibiao Liao, Jiankun Li, and Xiaoqing Ye. Vlm2scene: Self-supervised image-text-lidar learning with foundation models for autonomous driving scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3351–3359, 2024.
  • Lin [2004] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  • Nie et al. [2025] Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In European Conference on Computer Vision, pages 292–308. Springer, 2025.
  • Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • Qian et al. [2024a] Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4542–4550, 2024a.
  • Qian et al. [2024b] Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario, 2024b.
  • Radford et al. [2021a] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021a.
  • Radford et al. [2021b] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  • Sha et al. [2023] Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
  • Sima et al. [2023] Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
  • Tian et al. [2024] Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024.
  • Vedantam et al. [2015] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • Wang et al. [2024a] Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024a.
  • Wang et al. [2024b] Y. Wang, J. Zhang, and Y. Li. Hybrid vision-language models for safe autonomous driving. IEEE Transactions on Vehicular Technology, 73(6):3752–3764, 2024b.
  • Wen et al. [2023] Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, et al. On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving. arXiv preprint arXiv:2311.05332, 2023.
  • Xu et al. [2024] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 2024.
  • Yang et al. [2024] Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14662–14672, 2024.
  • Yang et al. [2022] X. Yang, W. Li, and L. Zhang. Vision-language models for autonomous driving: A survey. IEEE Access, 10:67890–67905, 2022.
  • Zeng et al. [2023] Z. Zeng, L. Xie, and S. Wang. Multimodal vision-language models for autonomous driving. IEEE Transactions on Robotics, 39(2):456–470, 2023.
  • Zhang et al. [2024] J. Zhang, L. Liu, and C. Wu. End-to-end vision-language reasoning for autonomous driving. In Proceedings of CVPR, 2024.
  • Zhang et al. [2021] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021.
  • Zhu et al. [2023] J. Zhu, M. Li, and W. Hu. Vision-language pretraining for autonomous driving decision making. IEEE Transactions on Intelligent Vehicles, 8(3):745–756, 2023.