Unified End-to-End V2X Cooperative Autonomous Driving

Zhiwei Li¹, Bozhen Zhang¹^*,Lei Yang, Tianyu Shen^*, Nuo Xu, Ruosen Hao, Weiting Li, Tao Yan, Huaping Liu ¹ Zhiwei Li and Bozhen Zhang contributed equally to this work.Corresponding authors:Bozhen Zhang and Tianyu Shen.Zhiwei Li is with the College of Information Science and Technology, Beijing, 100029,China(e-mail:[email protected]).Bozhen Zhang is with the School of Information Science and Engineering, Shenyang University of Technology, Shenyang, 110870,China(e-mail:[email protected]).Lei Yang is with the School of Vehicle and Mobility, Tsinghua University, Beijing, 100084,China(e-mail:[email protected]).Tianyu Shen is with College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029,China(e-mail:[email protected]).Nuo Xu is with School of Software, Beihang University, Beijing, 100191,China.Ruosen Hao is with the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029,China.Weiting Li is with the School of Information and Engineering, China University of Geosciences Beijing, Beijing, 100083,China.Tao Yan is with the School of Vehicle and Mobility, Tsinghua University, Beijing, 100084,China.Huaping Liu is with the Department of Computer Science and Technology, Tsinghua University, Beijing, 100084,China.

Abstract

V2X cooperation, through the integration of sensor data from both vehicles and infrastructure, is considered a pivotal approach to advancing autonomous driving technology. Current research primarily focuses on enhancing perception accuracy, often overlooking the systematic improvement of accident prediction accuracy through end-to-end learning, leading to insufficient attention to the safety issues of autonomous driving. To address this challenge, this paper introduces the UniE2EV2X framework, a V2X-integrated end-to-end autonomous driving system that consolidates key driving modules within a unified network. The framework employs a deformable attention-based data fusion strategy, effectively facilitating cooperation between vehicles and infrastructure. The main advantages include: 1) significantly enhancing agents’ perception and motion prediction capabilities, thereby improving the accuracy of accident predictions; 2) ensuring high reliability in the data fusion process; 3) superior end-to-end perception compared to modular approaches. Furthermore, We implement the UniE2EV2X framework on the challenging DeepAccident, a simulation dataset designed for V2X cooperative driving.

Index Terms:

V2X Cooperation, End-to-End, Autonomous Driving

I Introduction

Over the past few decades, the transportation[1] and automotive sectors[2][3] have seen increasing automation and intelligence, driven by advancements in deep learning[4][5][6], control theory[7][8], and technologies like sensors[9][10] and network communications[11][12][13][14]. Autonomous driving research is burgeoning, typically segmenting the complex intelligence system required into multiple subtasks based on different stages of driving, such as perception[15][16], prediction[17], planning[18], and control[19]. These multi-stage approaches necessitate maintenance of inter-module communication, potentially leading to system response delays and information loss[20][21]. Conversely, end-to-end autonomous driving methods[22][23][24] offer a more intuitive and streamlined approach by directly translating environmental data into vehicle control decisions, thus reducing system complexity and minimizing delays through unified data representation. However, the perception range of individual vehicle intelligence is limited to its onboard sensors, potentially compromising its perception capabilities under complex road and adverse weather conditions. Vehicle-to-Everything (V2X) cooperation[25][26][27] enhances autonomous vehicles by integrating information exchange and collaborative operation between vehicles and road infrastructure. This provides comprehensive, accurate road and traffic signal information, improving safety and efficiency. Moreover, V2X communication enables vehicles to perceive beyond their immediate vicinity, facilitating cooperative driving among vehicles. Despite the focus on improving metrics like detection accuracy and trajectory prediction precision in current V2X research, these improvements do not necessarily equate to effective planning outcomes due to the introduction of irrelevant information by multi-stage autonomous driving methods. This paper proposes an end-to-end V2X-based autonomous driving framework aimed at collision prediction outcomes, integrating target detection, tracking, trajectory prediction, and collision warning into a unified end-to-end V2X autonomous driving method. The remainder of the paper is organized as follows: Section 2 reviews related work, Section 3 introduces the end-to-end V2X neural network model, Section 4 presents experiments using public datasets and compares them with other models, validating the effectiveness of the proposed method. Section 5 concludes the paper.

II Related Work

II-A End-to-End Autonomous Driving

End-to-end autonomous driving methods are architecturally simpler than modular approaches, directly producing driving commands from perception data, thus avoiding the generation of redundant intermediate stage information. These methods can be implemented through two approaches: imitation learning and reinforcement learning[28]. Imitation learning, a form of supervised learning, involves learning strategies and updating models by imitating human driving behavior. Its advantage lies in high training efficiency, but it requires extensive annotated training data and cannot cover all potential traffic and driving scenarios. Reinforcement learning, in contrast, learns directly from the interaction between the model and the environment to maximize cumulative rewards. Its advantage is that it does not require manually collected and annotated data, but the model convergence is slow, and the results are significantly affected by the definition of rewards and other factors. Early end-to-end autonomous driving methods primarily focused on imitation learning, typically outputting simple driving tasks. Using CNNs to infer steering angles from images captured by three cameras, [29] achieved lane keeping on roads without lane markings. Considering vehicle speed, [30] introduced temporal information on top of CNNs using LSTM, an approach effective for simple tasks like lane keeping but limited in complex traffic scenarios and driving tasks[31]. Several studies have implemented end-to-end autonomous driving through reinforcement learning, handling more complex scenarios compared to imitation learning[32],[33],[34]. Integrating multimodal data into end-to-end autonomous driving models has resulted in better performance than single-modal approaches[35],[36],[37]. However, the challenge with end-to-end methods lies in poor model interpretability, making it difficult to diagnose and address issues when problems arise. UniAD unifies multiple shared BEV feature-based Transformer networks, containing modules for tracking, mapping, and trajectory prediction. This enhances the model’s interpretability, aids in training and troubleshooting, and employs the final planning outcomes to design the loss function, constructing an end-to-end autonomous driving model[38].

II-B Vehicle-to-Everything Cooperation

Autonomous vehicles based on single-vehicle intelligence perceive the environment centered around the vehicle itself using onboard sensors. However, the real-world traffic scene is complex and variable, with single vehicles, especially in terms of perceiving vulnerable road users[39]. Thanks to the advancement of communication technologies, Cooperative Automated Driving Vehicles (CAVs) have been proposed, enhancing the vehicle’s perception capabilities by aggregating perception data from other autonomous vehicles in the traffic environment. Autonomous driving cooperative perception can be categorized into three types based on the data transmitted via V2X. The first type involves direct transmission of raw point cloud data for cooperative perception, demanding high transmission bandwidth[40]. The second approach processes the raw perception data into unified feature information, such as BEV spatial features, before transmission to save bandwidth. This method balances bandwidth requirements and detection accuracy and is the mainstream V2X transmission method[41],[42],[43],[44],[45]. The third type generates prediction results for each autonomous vehicle before transmitting this outcome information via V2X, requiring low bandwidth but demanding high accuracy in individual vehicle prediction results[46]. High-quality datasets in autonomous driving cooperative perception have propelled research in the field, with mainstream datasets including V2X-Sim[47], OPV2V[48], and DAIR-V2X[49]. However, these mainstream vehicle-road cooperation datasets primarily focus on perception accuracy as the evaluation metric, suitable for testing the performance of autonomous driving perception algorithms but not for evaluating end-to-end related algorithms. DeepAccident[50] is a large-scale autonomous driving dataset generated using the CARLA simulator, supporting testing for end-to-end motion and accident prediction tasks. In this work, we propose an end-to-end autonomous driving framework based on vehicle-road cooperation and utilize the DeepAccident dataset to test the performance of related algorithms.

III Methodology

This paper introduces a vehicle-road cooperative end-to-end autonomous driving framework comprising two major components: the V2X Cooperative Feature Encoder and the End-to-End Perception, Motion, and Accident Prediction Module.

III-A V2X Cooperative Feature Encoding Based on Temporal Bird’s Eye View

Overall Structure

Our V2X framework includes both the vehicle itself and road infrastructure. During the cooperative phase, each agent first extracts and converts multi-view image features into BEV features. These are then encoded to align the temporal sequence of V2X agent BEV perception information. Finally, by merging the BEV features of the vehicle with those of the roadside infrastructure, we obtain the cooperative perception features. The process of extracting V2X cooperative features based on temporal BEV, as illustrated in Figure 1, consists of two main components: the multi-view image to BEV feature module based on spatial BEV encoding, and the temporal cascading BEV feature fusion module based on temporal BEV encoding. After spatial transformation and temporal fusion, the infrastructure BEV features are aligned and integrated with the vehicle’s coordinate system using a deformable attention mechanism to fuse the two aligned BEV features.enhancing the vehicle’s perception capabilities to achieve the final V2X cooperative BEV features.

Refer to caption — Figure 1: The overall architecture of V2X collaborative feature extraction process based on time-series BEV

Multi-View Image to BEV Feature Module Based on Spatial BEV Encoding

The original perception information obtained from both the vehicle and road infrastructure consists of multi-view perspective images. To eliminate spatial semantic differences and merge multi-source perception data, the multi-view images from both the vehicle and infrastructure are processed through two parallel channels of feature extraction and transformation to yield unified BEV features. Following the approach in [51], we map multi-view perspective images into the BEV space. In the module for converting multi-view images to BEV features based on spatial BEV encoding, multi-view images are first processed separately. Two-dimensional convolution is used to extract multi-view feature maps, which are then inputted into the spatial BEV encoder module. The spatial BEV encoder ultimately generates high-level semantic BEV features of the images. This process can be described by Equation (1), where ResNet refers to the ResNet-101 backbone network, $I_{ego}^{0}$ , $I_{ego}^{1}$ ,…, $I_{ego}^{5}$ represent camera images from six viewpoints of the vehicle, and $F_{ego}^{0}$ , $F_{ego}^{1}$ ,…, $F_{ego}^{5}$ represent the feature maps from these six viewpoints. Similarly, $F_{inf}^{0}$ , $F_{inf}^{1}$ ,…, $F_{inf}^{5}$ are the feature maps from six viewpoints of the road infrastructure.

F_{ego}^{0},F_{ego}^{1},…,F_{ego}^{5}=ResNet(I_{ego}^{0},I_{ego}^{1},…,I_{ego}^{5})

(1)

Next, the multi-view feature maps are inputted into a spatial BEV encoder based on a deformable spatial cross-attention mechanism to transform two-dimensional image features into BEV spatial features. Initially, a BEV target query $\mathbf{Q}\in\mathbb{R}^{H\times W\times C}$ , a learnable parameter tensor, is created to gradually learn the BEV information of the multi-view images under the action of the spatial BEV encoder. Q serves as the query for the spatial BEV encoder, with multi-view feature maps $F_{ego}^{i}$ or $F_{inf}^{i}$ as the keys and values for the encoder. After six rounds of BEV feature encoding interactions, the parameters of Q are continually updated to yield a complete and accurate BEV feature value B. The specific BEV encoding process can be represented by Equations (2) and (3), where $Q$ , $K^{i}$ , $V^{i}$ respectively denote the BEV target query, image BEV key, and image BEV value. $W^{q}$ , $W^{k}$ , $V^{i}$ represent the weight matrices for $Q$ , $K^{i}$ , $V^{i}$ , and $B$ , $F^{i}$ denote the BEV features and image features, respectively.

Q=W^{q}B,\hskip 20.00003ptK^{i}=W^{k}F^{i},\hskip 20.00003ptV^{i}=W^{v}F^{i}

(2)

B=softmax(\frac{Q(K^{i})^{T}}{\sqrt{d}})V^{i}

(3)

However, the query in traditional Transformer architecture encoders conducts attention operations with all keys, which is neither efficient nor necessary given the vast scale and mixed signals of multi-view feature maps serving as keys. Hence, in actual BEV feature encoding, encoders based on a deformable attention mechanism are used to conserve computational resources and enhance efficiency significantly.

Temporal Cascading BEV Feature Fusion Module Based on Temporal BEV Encoding

The BEV features $B_{t}$ obtained in the previous section are considered carriers of sequential information. Each moment’s BEV feature $B_{t}$ is based on the BEV feature from the previous moment $B_{(t-1)}$ to capture temporal information. This approach allows for the dynamic acquisition of necessary temporal features, enabling the BEV features to more quickly and effectively respond to changes in the dynamic environment. In the temporal cascading BEV feature fusion module based on temporal BEV encoding, the BEV feature from the preceding frame $B_{(t-1)}$ serves as prior information to enhance the current frame’s BEV feature $B_{t}$ . Since $B_{t}$ and $B_{(t-1)}$ are in their respective vehicle coordinate systems, the $B_{(t-1)}$ feature must first be transformed to the current frame $B_{t}$ ’s vehicle coordinate system using the vehicle’s position transformation matrix. Then, $B_{t}$ and $B_{(t-1)}$ , as two frames of BEV features, are inputted, and a temporal BEV encoder based on a deformable cross-attention mechanism is used to transform two-dimensional image features into cooperative perception BEV features. First, static scene alignment is achieved. Knowing the world coordinates of the vehicle at moments $t-1$ and $t$ , and using the continuous frame vehicle motion transformation matrix, $B_{(t-1)}$ features are aligned to $B_{t}$ . This alignment operation ensures that $B_{(t-1)}$ and $B_{t}$ in the same search position grids correspond to the same location in the real world, with the aligned BEV features denoted as $B_{(t-1)}^{\prime}$ . Subsequently, dynamic target alignment is executed. The BEV feature $B_{t}$ at time $t$ serves as the target query $Q\in\mathbb{R}^{H\times W\times C}$ , progressively learning the BEV features of time $t-1$ under the action of the temporal BEV encoder. $Q$ is used as the query for the temporal BEV encoder, with the previous moment’s BEV features serving as keys and values. Through BEV feature encoding interactions, $Q$ ’s parameters are continuously updated, ultimately yielding a complete and accurate cooperative perception BEV feature value $B_{t}$ . The specific BEV encoding process is represented by Equations (4) and (5), where $Q$ , $K^{i}$ , and $V^{i}$ respectively represent the target query for BEV features at time $t$ , the key for image BEV features at time $t-1$ , and the value for image BEV features at time $t-1$ . $W^{q}$ , $W^{k}$ , and $V^{i}$ are the weight matrices for $Q$ , $K^{i}$ , and $V^{i}$ , with $B_{t}$ and $B_{(t-1)}$ representing BEV features at time $t$ and image BEV features at time $t-1$ , respectively.

Q=W^{q}B_{t},\quad K^{i}=W^{k}B_{(t-1)},\quad V^{i}=W^{v}B_{(t-1)}

(4)

B_{t}=\text{softmax}\left(\frac{Q(K^{i})^{T}}{\sqrt{d}}\right)V^{i}

(5)

At time $t-1$ , assuming a target is present at some point in $B_{(t-1)}$ , it is likely that the target will appear near the corresponding point in $B_{t}$ at time $t$ . By employing the deformable cross-attention mechanism focusing on this point and sampling features around it, high-precision temporal feature extraction with low overhead can be achieved in dynamic and complex environments.

III-B End-to-End Autonomous Driving

We propose a unified end-to-end V2X cooperative autonomous driving model named UniE2EV2X, oriented towards accident prediction. The primary tasks of this model include object detection and tracking, motion prediction, and post-processing for accident prediction, as illustrated in Figure 2.

Detection and Track

The perception module is the initial component of the end-to-end autonomous driving framework presented in this paper. It consists of detection and tracking sub-modules, taking collaborative BEV features as input and producing tracked proxy features for use in the downstream motion prediction module. The detection sub-module is responsible for predicting target information under collaborative BEV features in each time frame, including target locations and dimensions. The tracking sub-module associates the same targets across frames by assigning consistent IDs. In this study, detection and tracking tasks are integrated into a unified multi-object tracking module which first conducts detection queries to identify newly appeared targets, then interacts current frame tracking queries with detection queries from preceding frames to aggregate temporal information, and updates the tracking queries for target tracking in subsequent frames. This multi-object tracking query contains features representing target information over consecutive frames. Additionally, an ego-vehicle query module is introduced to aggregate the trajectory of the self-driving car, which is later used to predict the vehicle’s future trajectory. The multi-object tracking module consists of N Transformer layers, and the output features, $Q_{A}$ , contain rich proxy target information that will be further utilized in the motion prediction module.

Motion Prediction

The motion prediction module takes the multi-object tracking queries, $Q_{A}$ , and collaborative BEV features from the perception module as inputs. Using a scene-centric approach, it outputs motion queries, $Q_{X}$ , to predict the future trajectories of each proxy and the ego-vehicle over T frames with K possible paths. This method allows simultaneous prediction of multiple proxies’ trajectories and fully considers interactions between proxies and between proxies and target locations. The motion queries between proxies, $Q_{a}$ , are derived from multi-head cross-attention mechanisms between motion and tracking queries, while the motion queries related to target locations, $Q_{g}$ , are generated through a variable attention mechanism using motion queries, target positions, and collaborative BEV features. $Q_{a}$ and $Q_{g}$ are combined and passed through a multilayer perceptron (MLP) to produce the query context, $Q_{c}tx$ . The motion query positions, $Q_{p}os$ , incorporate four types of positional knowledge: scene-level anchors, proxy-level anchors, the proxies’ current positions, and predicted target points. $Q_{c}tx$ and $Q_{p}os$ are merged to form the motion query, $Q_{X}$ , which directly predicts each proxy’s motion trajectory.

Accident Prediction

After inputting collaborative BEV features into the end-to-end autonomous driving framework, the movement predictions for all agents and the ego-vehicle are obtained. These predictions are post-processed frame-by-frame to check for potential accidents. For each timestamp, the predicted motion trajectories of each proxy can be approximated as polygons, and the nearest other targets are identified. By checking if the minimum distance between objects is below a safety threshold, it can be determined whether an accident has occurred, providing labels for the colliding objects’ IDs, positions, and the timestamp of the collision. To assess the accuracy of accident predictions compared to real accident data, the same post-processing steps are applied to actual accident movements to ascertain future accidents’ occurrences. The basis for collision includes cases where both predictions and ground truth indicate an accident and the distance between colliding objects is below the threshold.

IV Experiments

V Conclusion

References

[1] Haiyang Yu, Rui Jiang, Zhengbing He, Zuduo Zheng, Li Li, Runkun Liu, and Xiqun Chen. Automated vehicle-involved traffic flow studies: A survey of assumptions, models, speculations, and perspectives. Transportation research part C: emerging technologies, 127:103101, 2021.
[2] Long Chen, Yuchen Li, Chao Huang, Bai Li, Yang Xing, Daxin Tian, Li Li, Zhongxu Hu, Xiaoxiang Na, Zixuan Li, et al. Milestones in autonomous driving and intelligent vehicles: Survey of surveys. IEEE Transactions on Intelligent Vehicles, 8(2):1046–1056, 2022.
[3] Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.
[4] Jing Ren, Hossam Gaber, and Sk Sami Al Jabar. Applying deep learning to autonomous vehicles: A survey. In 2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD), pages 247–252, 2021.
[5] Steffen Hagedorn, Marcel Hallgarten, Martin Stoll, and Alexandru Condurache. Rethinking integration of prediction and planning in deep learning-based automated driving systems: a review. arXiv preprint arXiv:2308.05731, 2023.
[6] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
[7] Wei Liu, Min Hua, Zhiyun Deng, Zonglin Meng, Yanjun Huang, Chuan Hu, Shunhui Song, Letian Gao, Changsheng Liu, Bin Shuai, et al. A systematic survey of control techniques and applications in connected and automated vehicles. IEEE Internet of Things Journal, 2023.
[8] Changzhu Zhang, Jinfei Hu, Jianbin Qiu, Weilin Yang, Hong Sun, and Qijun Chen. A novel fuzzy observer-based steering control approach for path tracking in autonomous vehicles. IEEE Transactions on Fuzzy Systems, 27(2):278–290, 2018.
[9] Zhangjing Wang, Yu Wu, and Qingqing Niu. Multi-sensor fusion in automated driving: A survey. Ieee Access, 8:2847–2868, 2019.
[10] Keli Huang, Botian Shi, Xiang Li, Xin Li, Siyuan Huang, and Yikang Li. Multi-modal sensor fusion for auto driving perception: A survey. arXiv preprint arXiv:2202.02703, 2022.
[11] Zhong-gui MA, Zhuo LI, and Yan-peng LIANG. Overview and prospect of communication-sensing-computing integration for autonomous driving in the internet of vehicles. Chinese Journal of Engineering, 45(1):137–149, 2023.
[12] Salvador V Balkus, Honggang Wang, Brian D Cornet, Chinmay Mahabal, Hieu Ngo, and Hua Fang. A survey of collaborative machine learning using 5g vehicular communications. IEEE Communications Surveys & Tutorials, 24(2):1280–1303, 2022.
[13] Tejasvi Alladi, Vinay Chamola, Nishad Sahu, Vishnu Venkatesh, Adit Goyal, and Mohsen Guizani. A comprehensive survey on the applications of blockchain for securing vehicular networks. IEEE Communications Surveys & Tutorials, 24(2):1212–1239, 2022.
[14] Faisal Hawlader, François Robinet, and Raphaël Frank. Leveraging the edge and cloud for v2x-based real-time object detection in autonomous driving. Computer Communications, 213:372–381, 2024.
[15] Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[16] Yushan Han, Hui Zhang, Huifang Li, Yi Jin, Congyan Lang, and Yidong Li. Collaborative perception in autonomous driving: Methods, datasets, and challenges. IEEE Intelligent Transportation Systems Magazine, 2023.
[17] Sajjad Mozaffari, Omar Y Al-Jarrah, Mehrdad Dianati, Paul Jennings, and Alexandros Mouzakitis. Deep learning-based vehicle behavior prediction for autonomous driving applications: A review. IEEE Transactions on Intelligent Transportation Systems, 23(1):33–47, 2020.
[18] Peng Hang, Chen Lv, Chao Huang, Jiacheng Cai, Zhongxu Hu, and Yang Xing. An integrated framework of decision making and motion planning for autonomous vehicles considering social behaviors. IEEE transactions on vehicular technology, 69(12):14458–14469, 2020.
[19] Sampo Kuutti, Richard Bowden, Yaochu Jin, Phil Barber, and Saber Fallah. A survey of deep learning applications to autonomous vehicle control. IEEE Transactions on Intelligent Transportation Systems, 22(2):712–733, 2020.
[20] Jingyuan Zhao, Wenyi Zhao, Bo Deng, Zhenghong Wang, Feng Zhang, Wenxiang Zheng, Wanke Cao, Jinrui Nan, Yubo Lian, and Andrew F Burke. Autonomous driving system: A comprehensive survey. Expert Systems with Applications, page 122836, 2023.
[21] Oskar Natan and Jun Miura. Fully end-to-end autonomous driving with semantic depth cloud mapping and multi-agent. arXiv preprint arXiv:2204.05513, 2022.
[22] Pranav Singh Chib and Pravendra Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey. IEEE Transactions on Intelligent Vehicles, 2023.
[23] Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, et al. Motion planning for autonomous driving: The state of the art and future perspectives. IEEE Transactions on Intelligent Vehicles, 2023.
[24] Daniel Coelho and Miguel Oliveira. A review of end-to-end autonomous driving in urban environments. IEEE Access, 10:75296–75311, 2022.
[25] Shanzhi Chen, Jinling Hu, Yan Shi, Li Zhao, and Wen Li. A vision of c-v2x: Technologies, field testing, and challenges with chinese development. IEEE Internet of Things Journal, 7(5):3872–3881, 2020.
[26] Sohan Gyawali, Shengjie Xu, Yi Qian, and Rose Qingyang Hu. Challenges and solutions for cellular based v2x communications. IEEE Communications Surveys & Tutorials, 23(1):222–255, 2020.
[27] Mario H Castañeda Garcia, Alejandro Molina-Galan, Mate Boban, Javier Gozalvez, Baldomero Coll-Perales, Taylan Şahin, and Apostolos Kousaridas. A tutorial on 5g nr v2x communications. IEEE Communications Surveys & Tutorials, 23(3):1972–2026, 2021.
[28] Ardi Tampuu, Tambet Matiisen, Maksym Semikin, Dmytro Fishman, and Naveed Muhammad. A survey of end-to-end driving: Architectures and training methods. IEEE Transactions on Neural Networks and Learning Systems, 33(4):1364–1384, 2020.
[29] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
[30] Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, and Jiebo Luo. End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In 2018 24th international conference on pattern recognition (ICPR), pages 2289–2294. IEEE, 2018.
[31] Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019.
[32] Tanmay Agarwal, Hitesh Arora, and Jeff Schneider. Learning urban driving policies using deep reinforcement learning. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 607–614. IEEE, 2021.
[33] Marwa Ahmed, Ahmed Abobakr, Chee Peng Lim, and Saeid Nahavandi. Policy-based reinforcement learning for training autonomous driving agents in urban areas with affordance learning. IEEE Transactions on Intelligent Transportation Systems, 23(8):12562–12571, 2021.
[34] Zhenbo Huang, Shiliang Sun, Jing Zhao, and Liang Mao. Multi-modal policy fusion for end-to-end autonomous driving. Information Fusion, 98:101834, 2023.
[35] Yi Xiao, Felipe Codevilla, Akhil Gurram, Onay Urfalioglu, and Antonio M López. Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 23(1):537–547, 2020.
[36] Tengju Ye, Wei Jing, Chunyong Hu, Shikun Huang, Lingping Gao, Fangzhen Li, Jingke Wang, Ke Guo, Wencong Xiao, Weibo Mao, et al. Fusionad: Multi-modality fusion for prediction and planning tasks of autonomous driving. arXiv preprint arXiv:2308.01006, 2023.
[37] Jianyu Chen, Shengbo Eben Li, and Masayoshi Tomizuka. Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 23(6):5068–5078, 2021.
[38] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
[39] Syed Adnan Yusuf, Arshad Khan, and Riad Souissi. Vehicle-to-everything (v2x) in the autonomous vehicles domain–a technical review of communication, sensor, and ai technologies for road user safety. Transportation Research Interdisciplinary Perspectives, 23:100980, 2024.
[40] Qi Chen, Sihai Tang, Qing Yang, and Song Fu. Cooper: Cooperative perception for connected autonomous vehicles based on 3d point clouds. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pages 514–524. IEEE, 2019.
[41] Yen-Cheng Liu, Junjiao Tian, Chih-Yao Ma, Nathan Glaser, Chia-Wen Kuo, and Zsolt Kira. Who2com: Collaborative perception via learnable handshake communication. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 6876–6883. IEEE, 2020.
[42] Yen-Cheng Liu, Junjiao Tian, Nathaniel Glaser, and Zsolt Kira. When2com: Multi-agent perception via communication graph grouping. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 4106–4115, 2020.
[43] Yue Hu, Shaoheng Fang, Zixing Lei, Yiqi Zhong, and Siheng Chen. Where2comm: Communication-efficient collaborative perception via spatial confidence maps. Advances in neural information processing systems, 35:4874–4886, 2022.
[44] Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2vnet: Vehicle-to-vehicle communication for joint perception and prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 605–621. Springer, 2020.
[45] Hongbo Yin, Daxin Tian, Chunmian Lin, Xuting Duan, Jianshan Zhou, Dezong Zhao, and Dongpu Cao. V2vformer $++$ : Multi-modal vehicle-to-vehicle cooperative perception via global-local transformer. IEEE Transactions on Intelligent Transportation Systems, 2023.
[46] Braden Hurl, Robin Cohen, Krzysztof Czarnecki, and Steven Waslander. Trupercept: Trust modelling for autonomous vehicle cooperative perception from synthetic data. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 341–347. IEEE, 2020.
[47] Yiming Li, Dekun Ma, Ziyan An, Zixun Wang, Yiqi Zhong, Siheng Chen, and Chen Feng. V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Automation Letters, 7(4):10914–10921, 2022.
[48] Runsheng Xu, Hao Xiang, Xin Xia, Xu Han, Jinlong Li, and Jiaqi Ma. Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. In 2022 International Conference on Robotics and Automation (ICRA), pages 2583–2589. IEEE, 2022.
[49] Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Jirui Yuan, Ping Luo, and Zaiqing Nie. Vehicle-infrastructure cooperative 3d object detection via feature flow prediction. arXiv preprint arXiv:2303.10552, 2023.
[50] Tianqi Wang, Sukmin Kim, Ji Wenxuan, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, and Ping Luo. Deepaccident: A motion and accident prediction benchmark for v2x autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5599–5606, 2024.
[51] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.