DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving

Wencheng Han SKL-IOTSC, CIS, University of Macau Dongqian Guo SKL-IOTSC, CIS, University of Macau Cheng-Zhong Xu SKL-IOTSC, CIS, University of Macau Jianbing Shen Corresponding author SKL-IOTSC, CIS, University of Macau

Abstract

In the field of autonomous driving, two important features of autonomous driving car systems are the explainability of decision logic and the accuracy of environmental perception. This paper introduces DME-Driver, a new autonomous driving system that enhances the performance and reliability of autonomous driving system. DME-Driver utilizes a powerful vision language model as the decision-maker and a planning-oriented perception model as the control signal generator. To ensure explainable and reliable driving decisions, the logical decision-maker is constructed based on a large vision language model. This model follows the logic employed by experienced human drivers and makes decisions in a similar manner. On the other hand, the generation of accurate control signals relies on precise and detailed environmental perception, which is where 3D scene perception models excel. Therefore, a planning oriented perception model is employed as the signal generator. It translates the logical decisions made by the decision-maker into accurate control signals for the self-driving cars. To effectively train the proposed model, a new dataset for autonomous driving was created. This dataset encompasses a diverse range of human driver behaviors and their underlying motivations. By leveraging this dataset, our model achieves high-precision planning accuracy through a logical thinking process.

1 Introduction

Refer to caption — Figure 1: Comparative Structures in Autonomous Driving Systems: (a) depicts an planning-oriented autonomous driving system that optimizes for overall performance using planning results, but lacks interpretability; (b) shows an LLM-based autonomous driving system capable of producing reasonable control signals, yet unable to fully leverage perception tasks; (c) illustrates our DME-Driver Autonomous Driving System, which effectively strengths both planning-oriented [25] and LVLM [35] models.

Autonomous driving systems represent a significant advancement in automotive cars technology, combining computer vision and artificial intelligence. These systems enable vehicles to perceive their surroundings [10, 41, 43, 51, 56], make informed decisions [3, 18, 21, 29, 47], and navigate without human intervention [27, 30, 37, 48, 54]. Key to these systems are sophisticated perception mechanisms and decision-making algorithms. These components allow vehicles to accurately understand their environment and make autonomous decisions, ensuring safe and efficient navigation. However, the complexity of these systems requires a focus on interpretability. Achieving interpretability is not just a technical challenge, but also a crucial step towards wider acceptance and integration of autonomous vehicles in society.

Recently, deep learning-based methods have achieved remarkable success in the realm of autonomous driving [13, 17, 28, 34, 39]. Some works [11, 24, 20, 26] proposed planning-oriented autonomous driving systems that can be trained end-to-end. As illustrated in Fig. 1(a), this system [11] encompasses several critical perception modules, including tracking, mapping, motion, and occupancy detection. The outputs from these modules are fed into a planner, which then generates the control signals for the vehicle. This approach leverages the full potential of perception models in autonomous driving cars, significantly enhancing the overall accuracy of the planning process. However, the end-to-end nature of such methods leads to a lack of comprehensibility in terms of human driving logic, resulting in a system that lacks interpretability. When confronted with scenarios that the model fails to resolve – what we might term ”bad cases” – it becomes challenging to understand the flawed decision-making process of the system. This obscurity poses a significant hurdle in troubleshooting and refining the autonomous driving system, as there is little insight into why and how these errors are made.

Benefiting from the advancements in large language models (LLMs), recent approaches have attempted to enhance the interpretability of autonomous driving systems by integrating LLMs into their frameworks [12, 33, 36, 42, 49], as shown in Fig. 1(b). Leveraging their formidable logic reasoning capabilities and generalization, LLMs can effectively understand the behavioral logic of human drivers in various driving scenarios. In the face of corner cases, they can attempt to mimic human driver behavior, making reasoned judgments and decisions. Moreover, even when decision-making errors occur, the logical reasoning process of these models can be used as evidence to pinpoint the causes of erroneous judgments and to seek solutions. However, these methods place LLMs at the core of the autonomous driving system, treating the outputs of perception modules as given conditions. This approach does not fully utilize the diverse perception tasks essential for accurate environmental perception. When perception models themselves produce erroneous predictions, these mistakes can accumulate in the decision-making process. The losses incurred from decision-making errors are then unable to retroactively optimize the perception tasks, preventing the achievement of optimal overall performance.

In this paper, we propose the DME (Decision-Maker Executor)-Driver autonomous driving system, which synergizes the advantages of both LLMs and planning-oriented perception models. Our system harnesses the LLMs’ ability to understand human driver behavior logic while simultaneously leveraging the precise environmental perception capabilities of planning-oriented autonomous driving models. As depicted in Fig. 1(c), our system comprises two key roles: the Decision-Maker and the Executor. The Decision-Maker is based on a Large Vision Language Model (LVLM), trained through imitation learning on extensive real-world driving data and human driver behavior logic. This enables the Decision-Maker to thoroughly grasp the relationship between driving scenarios and the underlying behavioral logic. When the vehicle encounters a new scene, the Decision-Maker can simulate human-like logical assessments of key elements in the scene, determining whether to accelerate, brake, or change lanes. Additionally, the Decision-Maker can mimic a human driver’s ability to describe key aspects of driving scenes and gaze information, providing the perception model with a reliable set of prior information. This helps the perception model to focus on elements of particular importance in the current scenario.

The Executor, on the other hand, is responsible for converting the Decision-Maker’s instructions into precise vehicle control signals. It acts as the executor, ensuring that the high-level logic and reasoning from the Decision-Maker are effectively translated into real-world driving actions. By doing so, the Executor bridges the gap between decision-making and vehicle control, enabling the system to navigate safely and efficiently in diverse driving conditions. By fully considering these two aspects, DME-Driver aims to achieve a more detailed and human-like understanding of driving scenarios, leading to safer and more efficient autonomous driving decisions.

In summary, this paper presents four key contributions:

•

DME-Driver Autonomous Driving System: We present the DME-Driver system, which combines the strengths of LLMs in logical reasoning and interpretability with the precise environmental sensing of planning-oriented models, improving decision-making robustness and interpretability in autonomous driving.
•

Human-Driver Behavior and Decision-Making (HBD) Dataset Leveraging both open-source datasets and newly collected data, we developed a distinctive dataset that integrates human driver behavior logic with detailed environmental perception, specifically designed for training the DME-Driver system.
•

Decision-Maker Model Design: Our Decision-Maker model, based on LVLM, is capable of imitating human driver instructions and focusing on important elements in the environment, providing human-like insights for better decision-making.
•

Executor Model Formulation: The Executor model accurately processes environmental data and translates the Decision-Maker’s instructions into accurate vehicle control signals, ensuring effective and context-aware responses in various driving situations.

Empirical evaluation demonstrate that our method achieves state-of-the-art accuracy in autonomous driving planning, significantly enhancing the system’s interpretability. Every driving decision made by the system can be traced back through logs to understand the underlying driving logic, providing a level of transparency and explainability that is unprecedented in autonomous driving systems.

2 Related Work

2.1 Autonomous Driving System

The development of autonomous driving technology has evolved from traditional rule-based systems to sophisticated learning-based approaches. Initially, autonomous vehicles were governed by algorithms that heavily relied on sensor-based data and predefined rules [52, 40]. While these methods provided reliable results in controlled environments, they struggled with the unpredictability and complexity inherent in real-world driving conditions.

The advent of deep learning revolutionized this landscape, enabling systems to learn from vast and varied datasets of real driving scenarios [1, 19]. This shift to learning-based approaches has endowed autonomous systems with the flexibility and adaptability needed to navigate complex and dynamic environments more effectively, paving the way for more robust and versatile driving systems.

A significant trend in recent years are the development of end-to-end autonomous driving systems. These systems utilize deep neural networks to process sensory inputs directly into driving actions, seeking to streamline the autonomous driving process [5]. While this approach simplifies the system architecture by eliminating modular decomposition, it raises challenges in terms of interpretability and robustness. The black box nature of these systems often hinders their ability to explain decisions and adapt to novel situations, which is critical for ensuring safety and gaining user trust in real-world applications.

2.2 Large Language Model for Autonomous Driving

The field of Large Language Models (LLMs) has seen remarkable growth, significantly advancing capabilities in natural language understanding and generation. Early models like BERT [14] and GPT [44] laid the groundwork for more complex systems such as GPT-3 [7]. These models have not only excelled in generating coherent text but have also shown proficiency in understanding context and subtleties in language, making them invaluable for diverse applications. A noteworthy advancement in LLMs is their integration with other data modalities, particularly visual data. Models like CLIP [45] and DALL-E [46] have demonstrated the effectiveness of combining textual and visual information, enabling a holistic understanding of multimodal content. This integration has been pivotal in broadening the applicability of LLMs to fields where context and nuance across different data types are essential. However, LLMs are not without their challenges. The computational requirements for training and running these models are substantial, raising concerns about environmental sustainability and the digital divide [50]. Furthermore, ethical considerations, particularly regarding biases in model outputs and their implications, are an ongoing area of concern and active research [4].

For a long time, autonomous driving systems have been treated as black boxes with a lack of interpretability, making it difficult to understand how decisions are made. The development of LLMs promises to solve this problem. GPT-Driver [38] transforms the GPT-3.5 model into a motion planner for autonomous driving, which demonstrates the motion planning abilities of the LLMs. DriveGPT4 [55], an interpretable end-to-end autonomous driving system based on LLMs, utilizes multimodal data such as videos, texts, and historical control signals. It generates textual responses to questions and predicts control signals for vehicle operation. Also, the reasoning abilities of LLMs improve the performance in perception and understanding tasks. LLM-AD [16], a semantic anomaly detection framework utilizing LLMs’ reasoning abilities, demonstrates that the LLM-based monitor aligns with human intuition in both fully end-to-end policies and classical autonomy stacks utilizing learned perception.

3 Human Driver Behavior and Decision-Making Dataset

3.1 Dataset Introduction

In order to thoroughly investigate the relationship between the behavioral logic of human drivers and the resulting driving signals, our study focuses on four key aspects. These aspects are essential for enhancing the robustness of an autonomous driving system by understanding and mimicking human-like driving behaviors.

Human Driver Gaze in Driving. Driving on roads requires real-time responses, often based on subconscious reflexes rather than deliberate reasoning. To comprehend these reflexive actions, we consider human gaze as an informative behavioral signal. During driving, human drivers instinctively focus on the most critical parts of the scene, which are typically directly influential in the driving logic of the current scenario. These elements could include traffic signals or other entities that might interact with the vehicle shortly. Understanding this gaze behavior is crucial for our system to recognize and prioritize important aspects of the driving environment.

Human Driver’s Understanding of Driving Scenes. The way human drivers logically describe driving scenes provides a rich, purposeful understanding of the scenario. Unlike standard image captions that offer a global view, human drivers focus on elements and their interrelations that could impact driving decisions. These logical descriptions enable even other human drivers to make accurate judgments about appropriate driving actions. For instance, a description might detail a scenario at an intersection where the vehicle is in a left-turn lane with a red light for turning.

Decision-Making and Rationale of Human Drivers. The decision-making process of human drivers is logical and information-rich. It encompasses the driver’s synthesis of various factors to determine the appropriate action for a given scenario, along with the underlying thought process. By comprehending and emulating this aspect, an autonomous driving system can mimic human-like decision-making logic. This capability is particularly valuable in novel or challenging scenarios not encountered during training, enabling the system to make safe and reliable judgments.

Precise Control Signals. The ultimate output of an autonomous driving system should be structured control signals directly applicable to the vehicle. Regardless of how correct and interpretable the natural language-based driving logic is, it must be translated into concrete control commands. This translation is vital to ensure that the detailed understanding and decisions derived from human driving logic are effectively and safely executed by the autonomous vehicle.

Given the requirement for four distinct types of raw data, we faced the challenge that no existing open-source dataset comprehensively covers all these aspects. To address this gap, we integrated three different open-source datasets, each contributing one or more of the needed data types. For example, the Look-Both-Ways [31] provides precise human driver gaze information collected via eye-tracking devices. BDD-X [58] offers manually annotated driving decisions and their underlying reasons. Nuscenes [8] includes detailed control signals. Leveraging the powerful generalization capabilities of LVLM, we can extrapolate the knowledge learned from each dataset to the others, even in the absence of a single dataset containing all types of annotations.

Besides this, to further enhance the LVLM’s capabilities, especially in multi-turn question-answering scenarios within the same scene, we collected a new, comprehensive virtual sub-dataset named Virtual HBD. This dataset was gathered using the Carla [15] simulation engine, where human drivers operated simulated driving controls such as steering wheels and pedals. This process yielded visual information and control signals for the vehicle. Additionally, drivers manually annotated their specific decisions and the reasons behind them at each moment. During the driving sessions, an eye tracker was employed to capture the drivers’ gaze, focusing on the objects they observed in real-time.

Thanks to this novel Virtual HBD sub-dataset, we can link all related information in a single dialogue, significantly bolstering the LVLM’s generalization ability. This dataset uniquely enables the system to understand and mimic the comprehensive decision-making process of human drivers, covering everything from gaze patterns and scene interpretation to the rationale behind specific driving decisions and their translation into control signals.

3.2 Data Collection and Labels Generation

To efficiently process and label this data, we employ a combination of GPT-4V(ision) pre-annotation followed by manual corrections. Fig. 2 outlines the data collection pipeline for our HBD Dataset. This pipeline comprises four main steps, starting from data collection, involving Prompt Design and manual correction, to the generation of dialogues.

Step 1: Data Collection As mentioned in the previous section, our data collection integrates four sub-datasets, including three open-source datasets - Look-Both-Ways [31], BDD-X [58], and Nuscenes [8] - and one newly collected sub-dataset, Virtual HBD. This comprehensive combination provides a rich base of raw data encompassing various aspects of human driving behavior.

Step 2: Prompt Design To guide GPT-4V in analyzing driving behaviors from a human driver’s perspective, we crafted unique prompts for each data type. For example, with human gaze data, the aim is for LVLM to understand which areas a human driver focuses on in a given scene and why. We first convert gaze points into an axis-aligned bounding box by taking the minimum and maximum values of x and y coordinates from the gaze points within 24 frames. This bounding box is then drawn on the image. GPT-4V processes this enhanced image to infer the driver’s focus and intentions.

Step 3: Manual Corrections After GPT-4V generates outputs, these are manually reviewed and corrected. Completely incorrect responses are deleted and regenerated with altered prompts. For correct responses, any details that do not align with typical human driver thought processes are manually adjusted.

Step 4: Dialogue Generation Once all data are transformed into detailed textual information, the next step is organizing this information into dialogues, which involves three sub-steps:

First-person Conversion: All pronouns in the dialogues are converted to the first person. This is to ensure that the subsequent LVLM Decision-Maker model can process the dialogues from the perspective of the human driver.

Combining Multi-turn Dialogues: The labeling process typically involves single-question prompts. For practical use, multi-turn dialogues offer more contextual clues for the LVLM. We thus concatenate different types of Q&A information into multi-turn dialogues, as shown in Fig. 2. The number of dialogue turns varies – one to three turns for open-source data, depending on the information available, and up to five turns for the Virtual HBD data.

Data Augmentation: To diversify the question-answer labels and enhance the model’s generalization capabilities, we utilize GPT-3.5 to rewrite the generated dialogues. This process involves changing the form of the dialogues while keeping their content consistent.

4 The proposed DME-Driver Autonomous Driving System

In our DME-Driver Autonomous Driving System, as illustrated in Fig. 3, the system is divided into two main components: the Decision-Maker and the Executor. The Decision-Maker acts as the central decision-maker, synthesizing vehicle status and current visual inputs to emulate a human driver’s logical judgments. Its output is expressed in natural language, providing a logical and interpretable narrative of driving decisions. This feature is particularly valuable for diagnosing and understanding ’bad cases’ in driving scenarios, as these natural language logs offer insights into the causes of incorrect decisions. However, as natural language cannot directly control a vehicle, the system incorporates the Executor network, functioning as a translator. This network converts the Decision-Maker’s linguistic outputs into precise vehicle control commands. The detailed architecture and functions of both the Decision-Maker and the Executor networks are crucial to the system’s effectiveness and will be elaborated in Sections 4.1 and 4.2 respectively.

4.1 Decision-Maker: Human Driver Logic Understanding

The Decision-Maker in our DME-Driver system is a sophisticated Large Vision Language Model (LVLM) designed to simulate the decision-making process of human drivers. In our experiments, we utilize LLaVA [35] as the baseline network for the Decision-Maker. This component is engineered to process inputs from three different modalities: visual inputs from the current and previous scenes, textual inputs in the form of prompts, and current status information detailing the vehicle’s operating state.

Visual Input: To process the visual input, we utilize a pretrained CLIP [45] visual encoder $E_{CLIP}$ . This encoder converts the visual information into feature tokens. To better comprehend the context of driving scenes, we enhance the input by concatenating the previous three key frames with the current frames into an image array:

\text{F}_{v}=E_{CLIP}(F_{t-3}\oplus F_{t-2}\oplus F_{t-1}\oplus F_{t})

(1)

Textual and Status Input: The prompt inputs $T_{p}$ and current status $T_{s}$ information are handled using a methodology similar to RT-2 [6], where a text tokenizer is employed to encode these inputs uniformly. After encoding, the tokens representing both visual and textual information are concatenated and fed into the LLaMA 2 [53] model for processing:

	$\displaystyle\text{F}_{t}$	$\displaystyle=\text{Tokenizer}(T_{p})\oplus\text{Tokenizer}(T_{s})$		(2)
	t	$\displaystyle=\text{LLaMA\_2}(\text{F}_{t}+\text{F}_{v})$		(2)

This integrated approach allows the Decision-Maker to consider all aspects of the driving scenario, ensuring a comprehensive understanding and simulation of human-like decision-making processes. The final step involves a de-tokenizer, which maps the output tokens back into natural language.

4.2 Executor: The Control Signal Generator

As depicted in Fig. 3, our Executor network in the DME-Driver system is designed based on the UniAD [25] planning-oriented autonomous driving framework, featuring 4 distinct components:

Backbone Network: The initial layer of the Executor network is a backbone network, which is responsible for extracting features from multi-view vision inputs. This network forms the foundation for subsequent feature processing and interpretation. Following the backbone network, the extracted image features are transformed into Bird’s Eye View (BEV) features through a process similar to BEVFormer.

Perception Modules: The next stage consists of four specialized perception modules:

TrackFormer is designed for detecting and tracking various elements within the driving scene.

MapFormer generates a segmented map in BEV, providing detailed spatial information about the environment.

MotionFormer predicts the motion trajectories of each element within the scene.

OccFormer is responsible for generating occupancy information, indicating the areas within the scene that are occupied and those that are free.

Planning Module: Following the perception modules, the Planning Module takes the output tokens from these modules as its input. This module’s primary function is to generate the predicted control signals for the vehicle.

Driver Logic Encoder: Distinct from the UniAD system, our Executor network incorporates additional enhancements for the OccFormer and the Planning module. The OccFormer combines textual information from scene descriptions and gaze data, while the Planning module integrates decision-related text. Specifically, we’ve integrated a Bert-based text encoder $E_{bert}$ to process corresponding textual inputs:

		$\displaystyle T_{occ}=E_{bert}(t_{gaze})\oplus E_{bert}(t_{description})$		(3)
		$\displaystyle T_{planner}=E_{bert}(t_{decision}),$		(3)

where $T_{occ}$ represents the text encoding for the OccFormer, while $T_{planner}$ represents the text encoding for the Planning module. $t_{gaze}$ , $t_{description}$ , and $t_{decision}$ represent the answers generated by the Decision-Maker for the Gaze, Scene Description, and Decision Making questions, respectively. After encoding the text, we combine the BEV feature $B$ generated by the backbone network with the corresponding text encoding using a transformer fusion structure named LogicalFusioner. In this structure, we consider the BEV feature as the query and the text encoding as the key and value. After the aggregating of the multi-head attention, we add a shortcut connection to the original $B$ and produce the enhanced BEV feature $B^{\prime}$ :

	$\displaystyle B^{\prime}$	$\displaystyle=\text{LogicalFusioner}(B,T)$		(4)
		$\displaystyle=\text{MHA}(Q=B,K=T,V=T)+B$		(4)

The inclusion of the TextEncoder in the Executor network enables it to go beyond just processing visual information; it allows for the integration of decision-making information and scene understanding provided by the Decision-Maker, emulating human driver insights for more comprehensive and context-aware driving decisions.

4.3 Training

The training of the DME-Driver system is streamlined into two essential steps. Firstly, training the Decision-Maker involves using multi-type human driver decision data to understand the human driving logic. Secondly, the training focuses on the Executor, which is trained using Decision-Maker instructions, perception labels, and control signals. By utilizing this data, the Executor can learn how to accurately transform instructions into control signals.

Decision-Maker Training: The training of the Decision-Maker network in our DME-Driver system encompasses two critical stages: pretraining and fine-tuning. Initially, the model undergoes pretraining on diverse datasets, including 593K image-text pairs from CC3M [9] and 100K video-text pairs from WebVid-10M [2], focusing on general video-text alignment. This phase involves training the video tokenizer while keeping the CLIP encoder and LLM weights fixed. The fine-tuning stage then tailors the model to the specific needs of interpretable autonomous driving. Here, the LLM is trained alongside the visual tokenizer using 30K video-text pairs, from the proposed HBD Dataset, and supplemented with 80K instruction-following image-text pairs from LLaVA [35].

Executor Training: The training of the Executor component in our DME-Driver system primarily follows the setup utilized by UniAD [25]. However, we introduce specific modifications to enhance the system’s consistency. Initially, similar to UniAD, we start by jointly training the perception parts, namely the tracking and mapping modules, for six epochs. We then proceed to an end-to-end training phase, which lasts for 20 epochs and encompasses all perception, prediction, and planning modules. To ensure alignment between the output signals of the planning module and the decisions made by the Decision-Maker, we introduce an additional reinforcement learning component during the training of the planning module. This component applies a penalty whenever the control signals deviate from the Decision-Maker’s decisions. Specifically, we category the Decision-Maker’s decisions into eight distinct types, such as moving forward, turning left, turning right, among others. For each of these decision types, we’ve established specific rules to determine whether a given control signal corresponds to one of these categories.

5 Experiment

5.1 Human Driver Logic

To assess the accuracy of the Decision-Maker in mimicking human driver decision-making and judgment in driving scenarios, we conducted an evaluation using the test set of the HBD dataset. Our primary goal was to determine whether the Decision-Maker could accurately replicate the focus areas, scene descriptions, reasoning, and decisions characteristic of human drivers.

In our evaluation process, we adapted a method similar to that used in DriveGPT4, utilizing an advanced version of ChatGPT to generate assessment metrics. Leveraging ChatGPT’s advanced reasoning capabilities, we designed it to provide a numerical score ranging from 0 to 1 for each prediction, where a higher score indicates better accuracy in mirroring human-like decision-making. This scoring method allows for a nuanced and comprehensive assessment of the Decision-Maker’s performance.

The evaluation of the Decision-Maker’s accuracy was conducted across four key dimensions:

Gaze: Assessing the accuracy of the Decision-Maker in identifying areas of focus during the driving process.

Scene Understanding: Evaluating how precisely the Decision-Maker describes the elements present in the current driving scene.

Reasoning: Analyzing the correctness of the logic employed by the Decision-Maker in making driving decisions.

Decision: Determining the accuracy of the final driving decisions made by the Decision-Maker.

To benchmark our system’s performance, we compared the Decision-Maker’s accuracy with that of other general-purpose large models, including LLaVA and GPT-4V, in similar scenarios. The outcomes of these comparisons are detailed in Table 1.

Table 1: Comparative Analysis of Driving Logic Understanding: This table contrasts our DME-Driver system with other large language models, emphasizing the proficiency in comprehending and interpreting driving logic.

Method	Gaze	Scene Understanding	Logic
Method	ChatGPT4 $\uparrow$	ChatGPT4 $\uparrow$	ChatGPT4 $\uparrow$
LLaVA-7B[35]	45.3	60.1	45.7
LLaVA-13B[35]	48.3	65.2	48.2
GPT-4V[57]	75.3	79.2	65.4
DME-Driver	85.2	86.5	80.3

5.2 Planning

The aim of this experiment is to validate the accuracy of the entire DME-Driver system in making autonomous driving decisions. Following methodologies established in previous works, we focus on evaluating the final decision accuracy of the system.

As shown in Table 2, the results of our experiments demonstrate that the DME-Driver system successfully harnesses the logic-driven prompts from the Decision-Maker, enhancing the decision-making precision of the planning module. This integration not only leads to higher decision accuracy in diverse driving situations but also maintains a detailed log of the decision-making process. The ability to trace back and understand the rationale behind each decision is a critical aspect of our system, adding a layer of interpretability and accountability that is often lacking in autonomous driving systems.

Table 2: Comparison of Planning Accuracy: This table showcases a comparative analysis between our DME-Driver system and state-of-the-art methods, highlighting the advancements in planning accuracy achieved by our approach.

Method	Input	L2(m) $\downarrow$				Col. Rate(%) $\downarrow$
Method	Input	1s	2s	3s	Avg.	1s	2s	3s	Avg.
NMP[59]	Lidar	-	-	2.31	-	-	-	1.92	-
SA-NMP[59]	Lidar	-	-	2.05	-	-	-	1.59	-
FF[22]	Lidar	0.55	1.20	2.54	1.43	0.06	0.17	1.07	0.43
EO[32]	Lidar	0.67	1.36	2.78	1.60	0.04	0.09	0.88	0.33
ST-P3[23]	Vision	1.33	2.11	2.90	2.11	0.23	0.62	1.27	0.71
UniAD[25]	Vision	0.48	0.96	1.65	1.03	0.05	0.17	0.71	0.31
DME-Driver	Vision	0.45	0.91	1.58	0.98	0.05	0.15	0.68	0.29

Table 3: Ablation Study Results: This table presents the impact of various components within our DME-Driver system, illustrating how each part contributes to the overall effectiveness and decision-making accuracy.

Module	L2(m) $\downarrow$	Col.Rate(%) $\downarrow$
Executor	1.03	0.31
GT+Executor	0.94	0.28
Decision-Maker + Executor	0.96	0.28
Decision-Maker + Executor + CL	0.98	0.29

5.3 Ablation Study

In our ablation study of the DME-Driver system, we methodically dissected the impact of each component on decision-making effectiveness, as shown in Table 3. We began by assessing the standalone performance of the Executor without Decision-Maker guidance, establishing a baseline. Next, we evaluated the impact of substituting the Decision-Maker’s guidance with ground truth language cues, observing potential improvements. Following this, we examined the combined performance of the Decision-Maker and Executor, gauging their collaborative efficiency. A crucial addition was the implementation of a consistency loss mechanism, slightly reducing performance metrics but significantly enhancing decision alignment between Executor and Decision-Maker.

6 Conclusion

In addressing the challenges of interpretability and insufficient use of human driver behavior patterns in autonomous driving systems, this paper introduces the DME-Driver Autonomous Driving System, a novel framework comprising two integral components: the Decision-Maker and the Executor. The Decision-Maker serves as the central decision-maker, adeptly understanding and emulating human driver logic, thus ensuring each action taken by the system is both logical and accountable. The Executor complements this by effectively translating the nuanced decisions of the Decision-Maker into precise vehicle control signals, harnessing the strengths of perception tasks and planning algorithms. To facilitate comprehensive training and understanding of human driver behavior, we developed the HBD dataset, rich in diverse and essential driving information such as gaze, decision logic, and operational signals. Our empirical tests showcase the system’s capability to accurately replicate human driver reasoning and actions. Combined with the Decision-Maker’s guidance, the Executor successfully converts these into operational commands, elevating the overall decision-making efficacy to a state-of-the-art level. This achievement not only demonstrates the effectiveness of the DME-Driver system but also marks a significant leap forward in the field of autonomous driving technology.

References

Badue et al. [2021] Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius B Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M Paixao, Filipe Mutz, et al. Self-driving cars: A survey. Expert Systems with Applications, 165:113816, 2021.
Bain et al. [2021] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
Barnes et al. [2017] Dan Barnes, Will Maddern, and Ingmar Posner. Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 203–210. IEEE, 2017.
Bender et al. [2021] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
Bojarski et al. [2016] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
Brohan et al. [2023] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
Changpinyo et al. [2021] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
Chen et al. [2015] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE international conference on computer vision, pages 2722–2730, 2015.
Chen et al. [2023] Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023.
Cheng et al. [2023] Wenhao Cheng, Junbo Yin, Wei Li, Ruigang Yang, and Jianbing Shen. Language-guided 3d object detection in point cloud for autonomous driving. arXiv preprint arXiv:2305.15765, 2023.
Chi and Mu [2017] Lu Chi and Yadong Mu. Deep steering: Learning end-to-end driving model from spatial and temporal visual cues. arXiv preprint arXiv:1708.03798, 2017.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Dosovitskiy et al. [2017] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
Elhafsi et al. [2023] Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa AD Nesnas, and Marco Pavone. Semantic anomaly detection with large language models. Autonomous Robots, pages 1–21, 2023.
Feng et al. [2018] Di Feng, Lars Rosenbaum, and Klaus Dietmayer. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st international conference on intelligent transportation systems (ITSC), pages 3266–3273. IEEE, 2018.
Fuji et al. [2014] Hiroshi Fuji, Jingyu Xiang, Yuichi Tazaki, Blaine Levedahl, and Tatsuya Suzuki. Trajectory planning for automated parking using multi-resolution state roadmap considering non-holonomic constraints. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, pages 407–413. IEEE, 2014.
Grigorescu et al. [2020] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
Gu et al. [2023] Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, and Hang Zhao. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023.
Hang et al. [2020] Peng Hang, Chen Lv, Yang Xing, Chao Huang, and Zhongxu Hu. Human-like decision making for autonomous driving: A noncooperative game theoretic approach. IEEE Transactions on Intelligent Transportation Systems, 22(4):2076–2087, 2020.
Hu et al. [2021] Peiyun Hu, Aaron Huang, John Dolan, David Held, and Deva Ramanan. Safe local motion planning with self-supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021.
Hu et al. [2022a] Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pages 533–549. Springer, 2022a.
Hu et al. [2022b] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Goal-oriented autonomous driving. arXiv preprint arXiv:2212.10156, 2022b.
Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
Huang et al. [2023] Zhiyu Huang, Haochen Liu, Jingda Wu, and Chen Lv. Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving. IEEE transactions on neural networks and learning systems, 2023.
Hubschneider et al. [2017] Christian Hubschneider, Andre Bauer, Michael Weber, and J Marius Zöllner. Adding navigation to the equation: Turning decisions for end-to-end vehicle control. In 2017 IEEE 20th international conference on intelligent transportation systems (ITSC), pages 1–8. IEEE, 2017.
Huval et al. [2015] Brody Huval, Tao Wang, Sameep Tandon, Jeff Kiske, Will Song, Joel Pazhayampallil, Mykhaylo Andriluka, Pranav Rajpurkar, Toki Migimatsu, Royce Cheng-Yue, et al. An empirical evaluation of deep learning on highway driving. arXiv preprint arXiv:1504.01716, 2015.
Isele et al. [2018] David Isele, Reza Rahimi, Akansel Cosgun, Kaushik Subramanian, and Kikuo Fujimura. Navigating occluded intersections with autonomous vehicles using deep reinforcement learning. In 2018 IEEE international conference on robotics and automation (ICRA), pages 2034–2039. IEEE, 2018.
Kamath et al. [2023] Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Jason Baldridge, and Zarana Parekh. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10813–10823, 2023.
Kasahara et al. [2022] Isaac Kasahara, Simon Stent, and Hyun Soo Park. Look both ways: Self-supervising driver gaze estimation and road scene saliency. In European Conference on Computer Vision, pages 126–142. Springer, 2022.
Khurana et al. [2022] Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, and Deva Ramanan. Differentiable raycasting for self-supervised occupancy forecasting. In European Conference on Computer Vision, pages 353–369. Springer, 2022.
Kim et al. [2020] Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, and John Canny. Advisable learning for self-driving vehicles by internalizing observation-to-action rules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9661–9670, 2020.
Liu et al. [2017] HaiLong Liu, Tadahiro Taniguchi, Yusuke Tanaka, Kazuhito Takenaka, and Takashi Bando. Visualization of driving behavior based on hidden feature extraction by using deep learning. IEEE Transactions on Intelligent Transportation Systems, 18(9):2477–2489, 2017.
Liu et al. [2023a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a.
Liu et al. [2023b] Mengyin Liu, Jie Jiang, Chao Zhu, and Xu-Cheng Yin. Vlpd: Context-aware pedestrian detection via vision-language semantic self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6662–6671, 2023b.
Luo et al. [2019] Qian Luo, Yurui Cao, Jiajia Liu, and Abderrahim Benslimane. Localization and navigation in autonomous driving: Threats and countermeasures. IEEE Wireless Communications, 26(4):38–45, 2019.
Mao et al. [2023] Jiageng Mao, Yuxi Qian, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
Maqueda et al. [2018] Ana I Maqueda, Antonio Loquercio, Guillermo Gallego, Narciso García, and Davide Scaramuzza. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5419–5427, 2018.
Montremerlo et al. [2008] M Montremerlo, J Beeker, S Bhat, and H Dahlkamp. The stanford entry in the urban challenge. Journal of Field Robotics, 7(9):468–492, 2008.
Nguyen et al. [2011] Thien-Nghia Nguyen, Bernd Michaelis, Ayoub Al-Hamadi, Michael Tornow, and Marc-Michael Meinecke. Stereo-camera-based urban environment perception using occupancy grid and object tracking. IEEE Transactions on Intelligent Transportation Systems, 13(1):154–165, 2011.
Peng et al. [2023] Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 815–824, 2023.
Radecki et al. [2016] Peter Radecki, Mark Campbell, and Kevin Matzen. All weather perception: Joint data association, tracking, and classification for autonomous ground vehicles. arXiv preprint arXiv:1605.02196, 2016.
Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
Schwarting et al. [2018] Wilko Schwarting, Javier Alonso-Mora, and Daniela Rus. Planning and decision-making for autonomous vehicles. Annual Review of Control, Robotics, and Autonomous Systems, 1:187–210, 2018.
Shao et al. [2023] Hao Shao, Letian Wang, Ruobing Chen, Hongsheng Li, and Yu Liu. Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Conference on Robot Learning, pages 726–737. PMLR, 2023.
Sriram et al. [2019] NN Sriram, Tirth Maniar, Jayaganesh Kalyanasundaram, Vineet Gandhi, Brojeshwar Bhowmick, and K Madhava Krishna. Talk to the vehicle: Language conditioned autonomous navigation of self driving cars. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5284–5290. IEEE, 2019.
Strubell et al. [2019] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
Sun et al. [2020] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
Thrun et al. [2006] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, et al. Stanley: The robot that won the darpa grand challenge. Journal of field Robotics, 23(9):661–692, 2006.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Wu et al. [2022] Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. Advances in Neural Information Processing Systems, 35:6119–6132, 2022.
Xu et al. [2023] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kenneth KY Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
Yakovlev and Borisov [2009] SS Yakovlev and Arkady N Borisov. A synergy of the rosenblatt perceptron and the jordan recurrence principle. Automatic Control and Computer Sciences, 43:31–39, 2009.
Yang et al. [2023] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023.
Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
Zeng et al. [2019] Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8660–8669, 2019.