Learning for Vehicle-to-Vehicle Cooperative Perception under Lossy Communication
Abstract
Deep learning has been widely used in intelligent vehicle driving perception systems, such as 3D object detection. One promising technique is Cooperative Perception, which leverages Vehicle-to-Vehicle (V2V) communication to share deep learning-based features among vehicles. However, most cooperative perception algorithms assume ideal communication and do not consider the impact of Lossy Communication (LC), which is very common in the real world, on feature sharing. In this paper, we explore the effects of LC on Cooperative Perception and propose a novel approach to mitigate these effects. Our approach includes an LC-aware Repair Network (LCRN) and a V2V Attention Module (V2VAM) with intra-vehicle attention and uncertainty-aware inter-vehicle attention. We demonstrate the effectiveness of our approach on the public OPV2V dataset (a digital-twin simulated dataset) using point cloud-based 3D object detection. Our results show that our approach improves detection performance under lossy V2V communication. Specifically, our proposed method achieves a significant improvement in Average Precision compared to the state-of-the-art cooperative perception algorithms, which proves the capability of our approach to effectively mitigate the negative impact of LC and enhance the interaction between the ego vehicle and other vehicles. The code is available at: https://github.com/jinlong17/V2VLC.
Index Terms:
deep learning, vehicle-to-vehicle cooperative perception, 3D object detection, lossy communication, digital twinI Introduction
How to perceive the surrounding objects precisely in complex real-world scenarios is critical for modern intelligent vehicle research. The accurate perception system (e.g., 3D object detection) is the fundamental base for the next motion planning and control of the intelligent vehicles, which implies tremendous impacts on the driving safety of intelligent vehicles [2, 3, 4, 5, 6].
Because of the perception limitation of the current individual intelligent vehicle [7, 8, 9], the cooperative perception of Connected Automated Vehicles (CAV) recently attracted much attention in this research community. Compared to the perception of individual intelligent vehicles, recent studies [10, 11, 12] show that cooperative perception of CAV can significantly improve the perception performance by leveraging Vehicle-to-Vehicle (V2V) communication technology for information sharing. Information sharing through V2V communication is an important technology for CAV cooperative perception, which is utilized to observe a wider range and perceive more occluded objects in the complex traffic environment [13, 14]. There are three ways for information sharing during V2V communication: (1) sharing raw sensor data as early fusion, (2) sharing intermediate features of the deep learning based detection networks as intermediate fusion, (3) sharing detection results as late fusion. Recent state-of-the-art studies [10, 15] show that intermediate fusion is the best trade-off between detection accuracy and bandwidth requirement. This paper also focuses on the intermediate fusion during communication for V2V cooperative perception.



Many intermediate fusion methods have been recently proposed for the V2V cooperative perception [11, 12, 13, 10]; however, all of them assume the ideal communication. The only V2V cooperative perception study that considered non-ideal communication focused solely on communication delays [15]. To date, no existing work has explored the impact of Lossy Communication (LC) on V2V cooperative perception in complex real-world driving environments. In urban traffic scenarios, V2V communication is susceptible to a range of factors that can result in lossy communication, such as multi-path effects from obstacles (e.g., buildings and vehicles)[17], Doppler shift introduced by fast-moving vehicles[18], interference generated by other communication networks [19], and dynamic topology caused by routing failures [20], as well as various weather conditions. Incomplete or inaccurate shared intermediate features resulting from lossy communication could compromise the effectiveness and efficiency of V2V cooperative perception, as shown in Figure 1. Failure to address LC in cooperative perception could lead to degraded perception performance, increased collision risk, and reduced traffic efficiency.
This paper first studies the negative effect of lossy communication in the V2V cooperative perception and then proposes a novel intermediate LC-aware feature fusion method to address the issue. Specifically, the proposed method includes an LC-aware Repair Network (LCRN) to recover the incomplete shared features by lossy communication and a specially designed V2V Attention Module (V2VAM) to enhance the interaction between the ego vehicle and other vehicles. The V2VAM includes the intra-vehicle attention of ego vehicle and uncertainty-aware inter-vehicle attention. It is challenging to collect the authentic CAV perception data with lossy communication in real-world driving, and considering the advantage of the digital twin in many application [21, 22, 23, 24, 25, 26], this paper evaluates the proposed method in a digital-twin CARLA simulator [16] based public cooperative perception dataset OPV2V [10]. The contributions of this paper are summarized as follows.
-
•
We propose the first research on V2V cooperative perception (point cloud-based 3D object detection) under lossy communication and study the side effect of lossy communication on cooperative perception, specifically the impact on detection performance.
-
•
This paper proposes a novel intermediate LC-aware feature fusion method to relieve the side effect of lossy communication by a LC-aware Repair Network and enhance the interaction between the ego vehicle and other vehicles by a specially designed V2V Attention Module including intra-vehicle attention of ego vehicle and uncertainty-aware inter-vehicle attention.
-
•
We evaluate the proposed method on the public cooperative perception dataset OPV2V, which is based on the digital-twin CARLA simulator [16].
The rest of this paper is organized as follows. Section II briefly reviews the related literature to this work, Section III presents the details of the proposed V2V cooperative perception method under lossy communication, Section IV provides the experiments and analysis with two scenarios: Ideal Communication and Lossy Communication, and the final conclusion are given in Section V.
II Related Work

II-A 3D Perception for Autonomous Driving
3D object detection is one of the most critical ways to the success of autonomous driving perception. Based on recently available sensor modality [27], 3D detection method can be roughly divided into three categories: (1) Camera-based detection methods where approaches detect 3D objects using a single or multiple RGB images [28, 29, 30]. For example, CaDDN [28] utilizes the depth distributions combined with the image features to generate bird’s-eye-view representations for 3D object detection. ImVoxelNet [29] constructs a 3D volume in 3D space and samples multi-view features to obtain the voxel representation for 3D object detection. DETR3D [30] uses queries to index into extracted 2D multi-camera features to directly estimate 3D bounding boxes in 3D spaces. (2) LiDAR-based detection methods where these methods typically convert LiDAR points into voxels or pillars, leading in voxel-based [31, 32] or pillar-based object detection methods [33, 34, 35]. PointRCNN [36] proposes a two-stage strategy based on raw point clouds, which learns rough estimation first and then refines it with semantic attributes. Some methods [31, 32] propose to split the space into voxels and produce features per voxel. However, 3D voxels are usually expensive to process. To address this issue, PointPillars [37] propose to compress all the voxels along the z-axis into a single pillar, then predict 3D boxes in the bird’s-eye-view space. Moreover, some recent methods [38, 39] combine voxel-based and point-based approaches to detect 3D objects jointly. (3) Camera-LiDAR fusion detection method where it presents an approach fusing information from both image and LiDAR points, which is a trend in 3D object detection. How to align the image features with point clouds is challenging in multimodal fusion. To solve this challenge, some methods [40, 41, 42] use a two-step framework, where detecting the object in 2D images in the first stage, then using the obtained information to further process point clouds for 3D detection. While other works [43, 44] develop end-to-end fusion pipelines and leverage cross-attention mechanisms to perform feature alignment. Our work in this paper focuses on the cooperative point cloud based 3D object detection to achieve fast processing and real-time performance [33, 34]; pillar-based approach would be used in our following experiments.
II-B Vehicle-to-Vehicle Cooperative Perception
The performance of a 3D perception method highly depends on the accuracy of 3D point clouds. However, LiDAR cameras suffer from refraction, occlusion, and long-range distance, so the single-vehicle system could become unreliable under some challenging situations [10]. In recent years, Vehicle-to-Vehicle (V2V) / vehicle-to-infrastructure (V2X) cooperating system has been proposed to overcome the disadvantages of the single-vehicle system by using multiple vehicles. The collaboration among different vehicles enables the 3D perception network to fuse information from different sources.
Some former methods use Early fusion to share raw data among different vehicles. For example, Cooper [45] fused the point clouds from different connected autonomous vehicles and made predictions based on the aligned data. AUTOCAST [46] exchanged sensor readings from different sensors to broaden the perceptive field for a single vehicle. Other methods use Late fusion to integrate the 3D detection results from each vehicle. Rawashdeh et al. [47] proposed a machine learning based method that shares the dimension and the location of the center point for each tracked object. Some other late fusion methods [48, 49] also adopt point clouds as sensor data from both vehicle and infrastructure. While early fusion requires large bandwidth and data transfer speed, late fusion may generate undesirable results due to the biased individual prediction. In order to find the balance between data load and accuracy, recent methods focus on Intermediate fusion by sharing intermediate representations. F-cooper [12] applied voxel features fusion and spatial feature fusion from two cars. V2VNet [13] employed a graph neural network to aggregate features extracted by LiDAR from each vehicle. V2X-ViT [15] proposed a vision Transformer architecture to fuse features from vehicles and infrastructures. Cui et al. [50] proposed a Point-based Transformer for point cloud processing, which can integrate collaborative perception with control decisions. Tu et al. [51] proposed an efficient and practical online attack network in a multi-agent deep learning system based on intermediate representations. Luo et al. [52] utilized attention modules to fuse the intermediate feature and enhance feature complementarity. Lei et al. [53] proposed a latency compensation module to realize intermediate feature-level synchronization. Hu et al. [54] proposed a spatial confidence-aware communication strategy to use less communication to improve performance by focusing on perceptually critical areas. OPV2V [10] utilized a self-attention module to fuse the received intermediate features. CoBEVT [11] proposed local-global sparse attention that captures complex spatial interactions across views and agents to improve the performance of cooperative perception. However, these fusion methods are all with the assumption of ideal communication, which would suffer dramatic performance drop with lossy communication in the real world. To address this issue, we design a special V2V Attention Module (V2VAM), including intra-vehicle attention of ego vehicle and uncertainty-aware inter-vehicle attention to enhance the V2V interaction.
II-C Communication Issue in V2V Perception
V2V and V2X communication can improve the safety and reliability of autonomous vehicles by exchanging information with surrounding vehicles. However, communication among vehicles may bring new issues to this research field [55]. Due to the nature of the connectivity, lossy communication is inevitable in wireless channels. Some factors like channel errors, network congestion, and delay deadline violation can cause packet losses during the transmission of data in the wireless network [56]. Low latency and high reliability are two common challenges for V2V communication. For example, in the pre-crash sensing scene, the maximum latency is only 20 ms, and data delivery reliability must be greater than 99 [55, 57]. Several works have proposed specific resource allocation schemes to ensure latency and reliability of V2V communication systems by using Lagrange dual decomposition and binary search [58], greedy link selection [59], or federated learning [60]. Some studies also aim to improve the V2V communication security from different aspects such as authentication, data integrity, and data protection [61].
Lossy Communication (LC) is also a critical issue in V2V communication. According to studies on single-hop broadcasting, the obstacle (vehicles, buildings, etc.) between transmitter and receiver will result in signal power fluctuations, thus causing packet loss [55, 62, 63, 21]. The shared data could also be interfered with by other signals or modified by attackers before arriving at its destination, thus leading to lossy data. In this work, we aim to eliminate the lossy communication by proposing an LC-aware repair network and improving the robustness of the V2V perception network.
III Methodology
In this paper, we focus on the cooperative LIDAR-based 3D object detection task for autonomous driving and consider a realistic scenario where communication loss exists in collaboration. Since we focus on the lossy communication challenge during data transmission in this paper, we assume there are no communication delays or localization errors in the V2V system. To handle lossy communication challenges in the real world and enhance CAV’s cooperative perception capability, inspired by [10], this paper proposes a novel intermediate LC-aware feature fusion framework. The overall architecture of the proposed framework is illustrated in Fig. 2, which includes five components: 1) V2V metadata sharing, 2) LIDAR feature extraction, 3) Feature sharing, 4) LC-aware repair network and V2V Attention module, 5) classification and regression headers.
III-A Overview of architecture
V2V metadata sharing. We select one of the CAVs as the ego vehicle to construct a spatial graph around it where each node is a CAV within the communication range, and each edge represents a directional V2V communication channel between a pair of nodes. Upon receiving the relative pose and extrinsic of the ego vehicle, all the other CAVs nearby will project their own LiDAR point clouds to the ego vehicle’s coordinate frame before feature extraction, which could be simply formulated as
(1) |
where is the CAV pose in -th CAV at the time , and is coordinate transformation matrix from CAV to ego.
LIDAR feature extraction. The anchor-based PointPillar method [37] is selected as the 3D detection backbone to extract visual features from point clouds. Since it can be deployed in the real world easily than other 3D detection backbones (e.g. SECOND [31], PIXOR [64], and VoxelNet [32]) thanks to its low inference latency and optimized memory usage [10]. This method converts the raw point clouds to a stacked pillar tensor, then scattered to a 2D pseudo-image and fed to the PointPillar backbone. Finally, the backbone extracts informative visual feature maps. Each CAV has its own LIDAR feature extraction module.
Feature sharing. In this component, the ego vehicle will receive the neighboring CAV feature maps after each CAV feature extraction, and these received intermediate features will be fed into the remaining detection networks in the ego vehicle. In the real-world scenario (e.g. urban building and unpredictable occlusion), the transmission of the feature maps usually suffers inevitable damage that leads to lossy communication. As a result, existing 3D object detectors would suffer a dramatic performance drop with the lossy features collected from surrounding CAVs, as shown in Table I.
LC-aware Repair Network and V2V Attention Module. The intermediate features aggregated from other surrounding CAVs are fed into the major component of our framework i.e., LC-Aware Repair Network for recovering the intermediate feature map in lossy communication by using tensor-wise filtering, and V2V Attention module for iterative inter-vehicle as well as intra-vehicle feature fusion utilizing attention mechanisms. The proposed LC-aware repair network and V2V attention module will be revealed with details in Sec. III-B and Sec. III-C, respectively.
Classification and regression headers. After receiving the final fused feature maps, two prediction headers are utilized for box regression and classification.
III-B LC-aware Repair Network
Image denoising is one of the longstanding challenging tasks in computer vision. The primary sources of noise [65] are shot noise, where a Poisson process with variance equal to the signal level, and read noise, where an approximately Gaussian process is caused by diverse sensor readout effects. To denoise them, some deep learning-based methods [66, 67, 68] use denoising networks that generate a filter for every pixel in the desired output to constrain the output space and thereby prevent the impact of artifacts. Inspired by these architectures, to handle the common V2V communication challenges i.e., lossy communication, we design a customized LC-aware repair network for intermediate feature recovering from other CAVs.
The framework of the LC-aware repair network is shown in Fig. 3, which is an encoder-decoder architecture with skip connections. This network generates a specific per-tensor filter kernel to jointly align and recover the input damaged feature to produce a recovered version of the output feature. The input feature for LC-aware repair network is , then a tensor-wise kernel is generated and applied to to produce the recovered output feature . the specific tensor-wise filter kernel could be simply formulated as
(2) |
and the value at each tensor in our output feature is
(3) |
where denotes the matrix dot product. is a tensor-wise kernel, and each tensor in channel dimension is a per-tensor kernel and can be applied to the neighborhood region of each tensor in the input feature by multiplication. The denotes the tensor-wise network and is used to perceive the input feature and generate the suitable kernel for each tensor.
To acquire the repaired output feature , the tensor-wise filtering of the input damaged feature could largely preserve the feature detailed without corruption. Therefore, a large kernel size is desired to leverage the rich neighborhood information of each tensor fully. In our experiment, the kernel size is set to due to memory limitations.
The LC-aware repair loss function is the tensor-wise distance between the ground truth original feature before suffering lossy communication and the repaired feature . The repair loss can be defined as
(4) |

III-C V2V Attention Module
Self-attention mechanism [69] has emerged as a recent advance to capture long-range interaction; The key idea of self-attention is to calculate the response at a position as a weighted sum of the features at all locations, with the interaction between features determined by the features themselves rather than their relative location, as in convolutions. In this paper, after receiving the recovered intermediate feature, we aim to leverage the intermediate deep learning features from multiple nearby CAVs to improve perception performance based on V2V communication. We design a customized intra-vehicle and inter-vehicle attention fusion method by considering the lossy communication situation to enhance interaction between ego CAV and other CAVs. Moreover, we adopt a criss-cross attention module in our proposed V2V attention method, which can be leveraged to capture contextual information from full-feature dependencies more efficiently and effectively.
Intra-Vehicle Attention. For the ego vehicle only, the intra-vehicle attention module can enable features from any position to perceive globally, thus enjoying full-image contextual information to better capture the representative feature. Formally, let be an input feature map of an ego vehicle, which is perfect data generated by self-vehicle without suffering any lossy communication. In the intra-vehicle attention module, the feature map would be calculated by three convolutional layers to produce three feature vectors , , and , respectively, where all of them have the same size, . Following the scaled dot-product attention in [69], we compute the dot products of the and , then divide them using a scaling factor i.e. dimension of feature vectors, and apply a softmax function to obtain the weights on the . The intra-vehicle attention as shown in Fig. 4 is defined as follows,
(5) |
where is the dimension of , and the standard function is used as the activated function here. denotes the output feature map of ego vehicle with considering all spatial information of the feature map.
Uncertainty-Aware Inter-Vehicle Attention. In V2V cooperative perception, the intermediate feature maps aggregated from other CAVs are shared to the ego vehicle. The shared feature maps with lossy communication would be recovered by LC-aware repair network, as introduced in Sec. III-B, but they are still noisy to some extent, while the ego feature maps are prefect without any lossy transmission. Fusing these uncertain feature maps with a certain ego feature map directly could be risky in the cooperative perception interaction process. To address this issue, we propose an uncertainty-aware inter-vehicle attention fusion method by considering the uncertainty of the recovered feature maps. In this module, the shared feature maps would be calculated by two convolutional layers to produce two feature vectors , and , respectively, where all of them have the identical size, and the other feature vector is directly obtained from ego self-vehicle instead of other vehicles, as shown in Fig. 4. Similar to the intra-vehicle attention in Sec. III-C, the uncertainty-aware inter-vehicle attention can be defined as
(6) |
where is the dimension of , and is the number of the neighboring CAVs. denotes the sum of the output feature map considering the interaction between the ego vehicle and other vehicles.
Efficient Implementation. Inspired by [70], we adopt two consecutive criss-cross (CC) attention modules to implement V2V attention in point cloud data rather than scaled dot-product attention. The latter generates huge attention maps to measure the relationships for each point-pair, resulting in a very high complexity of , where is the size of input features and . The CC attention module [70] aggregates contextual information in horizontal and vertical directions, collecting contextual information from all pixels by serially stacking two CC attention modules. Each position has sparse connections to other positions in the feature map, with a total of connections per position. This approach greatly reduces the complexity from to while still effectively capturing the relevant context from all vehicles through V2V communication.
After obtaining the intra-vehicle attention and inter-vehicle attention, all of them would be fed into the max pooling and average pooling layers separately to obtain abundant spatial information, then they are concatenated as the input for the 2D convolutional layer with ReLU activation function. Therefore, the final fusion feature output in V2V attention module is
(7) |
where denotes a set of max pooling, average pooling, and convolution layers. For 3D object detection, we use the smooth L1 loss for bounding box regression and focal loss [71] for classification. The final loss is the combination of detection and LC-aware repair loss as follows,
(8) |
where and are the balance coefficients within range .

IV Experiment
IV-A Dataset
Due to the difficulties of collecting the real-world CAV perception data for cooperative perception with lossy communication in realistic scenes, we use the digital-twin-based simulation dataset to validate the proposed method. The experiments are conducted on the public cooperative perception dataset OPV2V [10]. OPV2V is a large-scale open-source simulated dataset for V2V perception, which contains 73 divergent scenes with various numbers of connected vehicles, 11,464 frames, and 232,913 annotated 3D vehicle bounding boxes. These data are collected from 8 digital towns in CARLA [16], and a digital town of Culver City, Los Angeles with the same road topology. Following the default setting of OPV2V [10], we use frames and frames from OPV2V as the training set and validation set, respectively, and frames in CARLA Towns and frames in Culver City are used as testing set for all methods.
IV-B Experiments Setup
Evaluation metrics. We evaluate the performance of our proposed framework by the final 3D vehicle detection accuracy. Following [10, 15], we set the evaluation range as meters, meters, where all CAVs are included in this spatial range, whose number is in the range of in the experiment. and we measure the accuracy with Average Precisions (AP) at IntersectionoverUnion (IoU) threshold of and .
Experiment details. In this work, we focus on LiDAR-based vehicle detection and assess models under two scenarios: 1) Ideal Communication, where all data transmissions are under perfect communication; 2) Lossy Communication, where all intermediate features from other CAVs suffer from the lossy communication except the ego vehicle feature. To simulate the lossy communication, the shared intermediate features are randomly selected by a uniform distributed random probability , then replaced by a uniform distributed random noise, which is generated by a uniform distribution within the range of original intermediate features. Statistically, the range of original intermediate features is in our experiment.
In the training stage, we adopt two schemes to observe the impact of different training data on V2V 3D object detection models. The Scheme I uses only ideal communication-based data for training, while the other Scheme II uses simulated lossy communication-based data as described above for training. The training parameter settings for both schemes are identical, and the only difference between them is the training data, which considers lossy communication in Scheme II All trained models are evaluated on V2V CARLA Towns and Culver City testing sets under both Ideal Communication and Lossy Communication scenarios. Specifically, all models use the PointPillar [37] as the backbone with the voxel resolution of m for both height and width. We adopt Adam optimizer [72] with an initial learning rate of and steadily decay it every epochs using a factor of . The coefficient of detection loss is set to , and that of LC-aware repair loss is set to . We follow the same hyperparameters in V2X-ViT [15], and all models are trained on two RTX 3090 GPUs.
Compared methods. We consider No fusion as the baseline, which only uses the ego vehicle’s LiDAR point clouds. In addition, we evaluate five state-of-the-art methods in this paper, which use Intermediate Fusion as the main fusion strategy: CoBEVT [11], F-Cooper [12], V2VNet [13], OPV2V [10], and V2X-ViT [15](see Sec.II-B for detailed descriptions). To demonstrate the significant effect of Lossy Communication, we first train these methods under two scenarios: Ideal Communication and Lossy Communication. We then test these methods under the same two scenarios to assess their performance. To show the effectiveness of two critical components in our framework, namely LCRN, and V2VAM, we design a simple feature averaging fusion method with a convolutional layer called AveFuse. This method averages all intermediate features from ego-vehicle and other vehicles, and then the averaged feature is passed through a convolutional layer.
Method | Com. Type | V2V CARLA Towns | V2V Culver City | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AP@ 0.5 | AP@ 0.7 | AP@ 0.5 | AP@ 0.7 | |||||||||||||
NO Fusion |
|
|
|
|
|
|||||||||||
F-Cooper [12] |
|
|
|
|
|
|||||||||||
V2VNet [13] |
|
|
|
|
|
|||||||||||
OPV2V [10] |
|
|
|
|
|
|||||||||||
CoBEVT [11] |
|
|
|
|
|
|||||||||||
V2X-ViT [15] |
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
Method | Com. Type | V2V CARLA Towns | V2V Culver City | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | ||||||||||||
NO Fusion |
|
|
|
|
|
||||||||||
F-Cooper [12] |
|
|
|
|
|
||||||||||
V2VNet [13] |
|
|
|
|
|
||||||||||
OPV2V [10] |
|
|
|
|
|
||||||||||
CoBEVT [11] |
|
|
|
|
|
||||||||||
V2X-ViT [15] |
|
|
|
|
|
||||||||||
V2VAM+LCRN |
|
|
|
|
|
Method | V2V CARLA Towns | V2V Culver City | ||
---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | |
NO Fusion | 0.679 | 0.602 | 0.557 | 0.471 |
AveFuse (Baseline) | 0.632 | 0.325 | 0.697 | 0.374 |
V2VAM w/o Intra | 0.613 | 0.490 | 0.637 | 0.458 |
V2VAM w/o Inter | 0.641 | 0.494 | 0.681 | 0.504 |
V2VAM | 0.709 | 0.583 | 0.761 | 0.541 |
AveFuse+LCRN | 0.698 | 0.472 | 0.714 | 0.558 |
V2VAM+LCRN | 0.841 | 0.705 | 0.846 | 0.663 |







IV-C Experimental Results
Table I shows the performance comparisons of all models that are trained with Scheme I, then tested on two communication types e.g., Ideal Communication and Lossy Communication, respectively. Under Ideal communication, all the cooperative perception methods significantly surpass NO Fusion baseline. In V2V CARLA Town testing set, our proposed V2VAM outperforms the other five advanced fusion methods to achieve / for [email protected]/0.7, which is highlighted as bold text in Table I. In V2V Culver City testing set, CoBEVT [11] gets / for [email protected]/0.7, while the V2VAM achieves the / for [email protected]/0.7 as the best performance, which is higher than the second best fusion method CoBEVT [11] with an [email protected]/0.7 improvement of /. These results indicate cooperative perception methods can improve the perception performance than a single vehicle perception system under Ideal Communication, and our proposed fusion method V2VAM can enhance the interaction between ego vehicle and other vehicles efficiently, which achieves the best performance. However, under Lossy Communication testing scenario, all intermediate fusion methods have a drastic performance drop on two testing sets, and the accuracy of these methods is even less than NO Fusion. In V2V CARLA Town testing set, the cooperative perception performance of F-Cooper [12], V2VNet [13], OPV2V [10], and CoBEVT [11] decrease by , , , and in [email protected], respectively. Obviously, all intermediate fusion methods without considering the lossy communication are not practical for deployment in the real world.
The result of 3D object detection on two OPV2V testing sets based on the training of Scheme II is presented in Table II. Under Lossy Communication, although all intermediate fusion methods have a better performance than Table I, which learned the lossy intermediate feature in the training stage. They still fail to handle lossy communication data resulting in the poor perception performance in II. In V2V CARLA Town testing set, F-Cooper [12] got , V2VNet [13] got , CoBEVT [11] got , and V2X-ViT [15] got in [email protected]. These four fusion methods are even worse than single-vehicle baseline NO Fusion, which indicates the highly negative impacts by lossy communication. While our proposed method can reach the best performance of / for [email protected]/0.7 on V2V CARLA town testing set, and / for [email protected]/0.7 on Culver City testing sets, respectively. The proposed method achieves the best performance under both Ideal Communication and Lossy Communication, which is highlighted in Table I. Obviously, our proposed LCRN module efficiently maintains the benefits of collaborations under lossy communication. The proposed method can also diminish the impact of lossy V2V communication to achieve excellent cooperative perception performance. Further, we visualize some 3D object detection results on V2V Culver City testing set under Lossy Communication, as shown in Fig. 5. Intuitively, these five comparison methods cannot handle loss communication appropriately, thus leading to some false negative proposals. While the proposed method improves the perception performance under lossy communication significantly.
IV-D Discussion: Different Lossy Communication Types in V2V
As explained in [56, 73], several random issues such as the occurrence of obstacles, fast and changing vehicle speeds, distance between vehicles might result in lossy communication when sharing a set of communication data. To simulate the complex lossy communication in the real world, the sharing data is randomly selected by a uniformly distributed random probability and then replaced by random noise within the range of original shared feature values. We design two ways of random selection to simulate different lossy communication types in the real-world V2V communication.
Lossy Communication (named as “Lossy”) on global feature: The shared feature after V2V metadata sharing is reshaped from 3D tensor to 2D matrix first (Fig. 6(b)). Then, as shown in Fig. 6(c1-c3), the reshaped feature is randomly selected by the global random probability and replaced by random noise within the range of original shared feature values.
Channelwise Lossy Communication (named as “-Lossy”): Different with the “Lossy” type to simulate lossy communication on the reshaped global feature, “-Lossy” type is to simulate lossy communication on different channels. As shown in as Fig. 6(d1-d3), given a shared feature , channels are randomly selected by the channelwise random probability and replaced by random noise within the range of original shared feature values.
Finally, the simulated lossy feature is reshaped back to its original shape of and then received by ego vehicle. In our experiment, Scheme II utilizes the simulated lossy communication data by the “Lossy” type to train models, and then we use the models trained in Scheme II to test both “Lossy” and “-Lossy” simulated data. Table IV shows the performance comparisons of several methods with the two lossy communication types. The proposed method achieves the best performance under both “Lossy” and “-Lossy” communication types.
Method | Com. Type | V2V CARLA Towns | V2V Culver City | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AP@ 0.5 | AP@ 0.7 | AP@ 0.5 | AP@ 0.7 | ||||||||||||
OPV2V [10] |
|
|
|
|
|
||||||||||
CoBEVT [11] |
|
|
|
|
|
||||||||||
V2X-ViT [15] |
|
|
|
|
|
||||||||||
Proposed |
|
|
|
|
|
IV-E Ablation Study
The effectiveness of the two proposed components, V2VAM and LCRN, is investigated here. Based on training Scheme II, all methods are evaluated under Lossy Communication on V2V CARLA Town and Culver City testing sets, respectively. AveFuse is used as the baseline fusion method, which just averages all intermediate features. As shown in Table III, the proposed V2VAM obtains in [email protected] and in [email protected] on V2V CARLA Town testing set, which is and higher than AveFusion in [email protected] and [email protected] respectively. Both Intra-vehicle attention and Inter-vehicle attention modules are quite effective for V2VAM if we remove one of them in V2VAM during the ablation study. By adding LCRN to the baseline method, AveFuse+LCRN achieves in [email protected] and in [email protected] on V2V CARLA Town testing set, with the improvement of in [email protected], and in [email protected]. Furthermore, our proposed method V2VAM+LCRN achieves the best performance on both V2V CARLA Town and Culver City testing set. Obviously, both V2VAM and LCRN components are beneficial for improving the final performance of 3D object detection in lossy communication scenarios.
V Conclusions
In this paper, the side effect of lossy communication in the V2V cooperative perception is studied, and then we propose the first intermediate LC-aware feature fusion method considering lossy communication. An LC-aware Repair Network (LCRN) is proposed to relieve the side effect of lossy communication and a specially designed V2V Attention Module (V2VAM) is designed to enhance the interaction between the ego vehicle and other vehicles including intra-vehicle attention of ego vehicle and uncertainty-aware inter-vehicle attention. The proposed method is verified in the digital-twin CARLA simulator based public cooperative perception dataset OPV2V, which is quite effective for the cooperative point cloud based 3D object detection under lossy V2V communication and outperforms other V2V point-cloud-based 3D object detection methods significantly.
References
- [1] R. Xu, H. Xiang, X. Han, X. Xia, Z. Meng, C.-J. Chen, C. Correa-Jullian, and J. Ma, “The opencda open-source ecosystem for cooperative driving automation research,” IEEE Transactions on Intelligent Vehicles, pp. 1–13, 2023.
- [2] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 33–55, 2016.
- [3] Y. Ma, C. Sun, J. Chen, D. Cao, and L. Xiong, “Verification and validation methods for decision-making and planning of automated vehicles: A review,” IEEE Transactions on Intelligent Vehicles, 2022.
- [4] D. Cao, X. Wang, L. Li, C. Lv, X. Na, Y. Xing, X. Li, Y. Li, Y. Chen, and F.-Y. Wang, “Future directions of intelligent vehicles: Potentials, possibilities, and perspectives,” IEEE Transactions on Intelligent Vehicles, vol. 7, no. 1, pp. 7–10, 2022.
- [5] Q. Lan and Q. Tian, “Instance, scale, and teacher adaptive knowledge distillation for visual detection in autonomous driving,” IEEE Transactions on Intelligent Vehicles, pp. 1–14, 2022.
- [6] K. Strandberg, N. Nowdehi, and T. Olovsson, “A systematic literature review on automotive digital forensics: Challenges, technical solutions and data collection,” IEEE Transactions on Intelligent Vehicles, pp. 1–19, 2022.
- [7] L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na, Z. Li et al., “Milestones in autonomous driving and intelligent vehicles: Survey of surveys,” IEEE Transactions on Intelligent Vehicles, 2022.
- [8] K. Wang, L. Pu, J. Zhang, and J. Lu, “Gated adversarial network based environmental enhancement method for driving safety under adverse weather conditions,” IEEE Transactions on Intelligent Vehicles, 2022.
- [9] Z. Ju, H. Zhang, X. Li, X. Chen, J. Han, and M. Yang, “A survey on attack detection and resilience for connected and automated vehicles: From vehicle dynamics and control perspective,” IEEE Transactions on Intelligent Vehicles, vol. 7, no. 4, pp. 815–837, 2022.
- [10] R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” in International Conference on Robotics and Automation. IEEE, 2022, pp. 2583–2589.
- [11] R. Xu, Z. Tu, H. Xiang, W. Shao, B. Zhou, and J. Ma, “Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers,” in Conference on Robot Learning, 2022.
- [12] Q. Chen, X. Ma, S. Tang, J. Guo, Q. Yang, and S. Fu, “F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds,” in ACM/IEEE Symposium on Edge Computing, 2019, pp. 88–100.
- [13] T.-H. Wang, S. Manivasagam, M. Liang, B. Yang, W. Zeng, and R. Urtasun, “V2vnet: Vehicle-to-vehicle communication for joint perception and prediction,” in European Conference on Computer Vision. Springer, 2020, pp. 605–621.
- [14] Y. Tian, J. Wang, Y. Wang, C. Zhao, F. Yao, and X. Wang, “Federated vehicular transformers and their federations: Privacy-preserving computing and cooperation for autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 7, no. 3, pp. 456–465, 2022.
- [15] R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma, “V2x-vit: Vehicle-to-everything cooperative perception with vision transformer,” in European Conference on Computer Vision. Springer, 2022, pp. 107–124.
- [16] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Annual Conference on Robot Learning. PMLR, 2017, pp. 1–16.
- [17] A. Saleh and R. Valenzuela, “A statistical model for indoor multipath propagation,” IEEE Journal on Selected Areas in Communications, vol. 5, no. 2, pp. 128–137, 1987.
- [18] Y. Zhao and S.-G. Haggman, “Intercarrier interference self-cancellation scheme for ofdm mobile communication systems,” IEEE Transactions on Communications, vol. 49, no. 7, pp. 1185–1191, 2001.
- [19] R. Tavakoli, M. Nabi, T. Basten, and K. Goossens, “An experimental study of cross-technology interference in in-vehicle wireless sensor networks,” in ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, ser. MSWiM ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 195–204.
- [20] D. Patra, S. Chavhan, D. Gupta, A. Khanna, and J. J. P. C. Rodrigues, “V2x communication based dynamic topology control in vanets,” in International Conference on Distributed Computing and Networking. New York, NY, USA: Association for Computing Machinery, 2021, p. 62–68.
- [21] Z. Wang, K. Han, and P. Tiwari, “Digital twin-assisted cooperative driving at non-signalized intersections,” IEEE Transactions on Intelligent Vehicles, 2021.
- [22] Z. Wang, X. Liao, C. Wang, D. Oswald, G. Wu, K. Boriboonsomsin, M. J. Barth, K. Han, B. Kim, and P. Tiwari, “Driver behavior modeling using game engine and real vehicle: A learning-based approach,” IEEE Transactions on Intelligent Vehicles, vol. 5, no. 4, pp. 738–749, 2020.
- [23] Z. Hu, S. Lou, Y. Xing, X. Wang, D. Cao, and C. Lv, “Review and perspectives on driver digital twin and its enabling technologies for intelligent vehicles,” IEEE Transactions on Intelligent Vehicles, 2022.
- [24] A. Venon, Y. Dupuis, P. Vasseur, and P. Merriaux, “Millimeter wave fmcw radars for perception, recognition and localization in automotive applications: A survey,” IEEE Transactions on Intelligent Vehicles, vol. 7, no. 3, pp. 533–555, 2022.
- [25] J. Li, R. Xu, J. Ma, Q. Zou, J. Ma, and H. Yu, “Domain adaptive object detection for autonomous driving under foggy weather,” in IEEE Winter Conference on Applications of Computer Vision, 2023, pp. 612–622.
- [26] J. Li, Z. Xu, L. Fu, X. Zhou, and H. Yu, “Domain adaptation from daytime to nighttime: A situation-sensitive vehicle detection and traffic flow parameter estimation framework,” Transportation Research Part C: Emerging Technologies, vol. 124, p. 102946, 2021.
- [27] K. Wang, T. Zhou, X. Li, and F. Ren, “Performance and challenges of 3d object detection methods in complex scenes for autonomous driving,” IEEE Transactions on Intelligent Vehicles, pp. 1–18, 2022.
- [28] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 8555–8564.
- [29] D. Rukhovich, A. Vorontsova, and A. Konushin, “Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection,” in IEEE Winter Conference on Applications of Computer Vision, 2022, pp. 2397–2406.
- [30] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning. PMLR, 2022, pp. 180–191.
- [31] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
- [32] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
- [33] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang, and Z. Zhang, “Embracing single stride 3d object detector with sparse transformer,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 8458–8468.
- [34] Y. Wang, A. Fathi, A. Kundu, D. A. Ross, C. Pantofaru, T. Funkhouser, and J. Solomon, “Pillar-based object detection for autonomous driving,” in European Conference on Computer Vision. Springer, 2020, pp. 18–34.
- [35] T. Y. Chen, C. C. Hsiao, and C.-C. Huang, “Density-imbalance-eased lidar point cloud upsampling via feature consistency learning,” IEEE Transactions on Intelligent Vehicles, 2022.
- [36] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 770–779.
- [37] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705.
- [38] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538.
- [39] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d object detector for point cloud,” in IEEE International Conference on Computer Vision, 2019, pp. 1951–1960.
- [40] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 918–927.
- [41] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Sequential fusion for 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 4604–4612.
- [42] T. Zhou, J. Chen, Y. Shi, K. Jiang, M. Yang, and D. Yang, “Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection,” IEEE Transactions on Intelligent Vehicles, 2023.
- [43] Y. Li, A. W. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen, Y. Lu, D. Zhou, Q. V. Le et al., “Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 182–17 191.
- [44] J. Nie, J. Yan, H. Yin, L. Ren, and Q. Meng, “A multimodality fusion deep neural network and safety test strategy for intelligent vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 2, pp. 310–322, 2020.
- [45] Q. Chen, S. Tang, Q. Yang, and S. Fu, “Cooper: Cooperative perception for connected autonomous vehicles based on 3d point clouds,” in IEEE International Conference on Distributed Computing Systems. IEEE, 2019, pp. 514–524.
- [46] H. Qiu, P.-H. Huang, N. Asavisanu, X. Liu, K. Psounis, and R. Govindan, “Autocast: scalable infrastructure-less cooperative perception for distributed collaborative driving,” in International Conference on Mobile Systems, Applications and Services, 2022, pp. 128–141.
- [47] Z. Y. Rawashdeh and Z. Wang, “Collaborative automated driving: A machine learning-based method to enhance the accuracy of shared information,” in International Conference on Intelligent Transportation Systems. IEEE, 2018, pp. 3961–3966.
- [48] E. Arnold, M. Dianati, R. de Temple, and S. Fallah, “Cooperative perception for 3d object detection in driving scenarios using infrastructure sensors,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 3, pp. 1852–1864, 2020.
- [49] H. Yu, Y. Luo, M. Shu, Y. Huo, Z. Yang, Y. Shi, Z. Guo, H. Li, X. Hu, J. Yuan et al., “Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 361–21 370.
- [50] J. Cui, H. Qiu, D. Chen, P. Stone, and Y. Zhu, “Coopernaut: end-to-end driving with cooperative perception for networked vehicles,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 252–17 262.
- [51] J. Tu, T. Wang, J. Wang, S. Manivasagam, M. Ren, and R. Urtasun, “Adversarial attacks on multi-agent communication,” in IEEE International Conference on Computer Vision, 2021, pp. 7768–7777.
- [52] G. Luo, H. Zhang, Q. Yuan, and J. Li, “Complementarity-enhanced and redundancy-minimized collaboration network for multi-agent perception,” in International Conference on Multimedia, 2022, pp. 3578–3586.
- [53] Z. Lei, S. Ren, Y. Hu, W. Zhang, and S. Chen, “Latency-aware collaborative perception,” in European Conference on Computer Vision. Springer, 2022, pp. 316–332.
- [54] Y. Hu, S. Fang, Z. Lei, Y. Zhong, and S. Chen, “Where2comm: Communication-efficient collaborative perception via spatial confidence maps,” in Advances in Neural Information Processing Systems, 2022.
- [55] S. Zeadally, J. Guerrero, and J. Contreras, “A tutorial survey on vehicle-to-vehicle communications,” Telecommunication Systems, vol. 73, no. 3, pp. 469–489, 2020.
- [56] M. M. Nasralla, C. T. Hewage, and M. G. Martini, “Subjective and objective evaluation and packet loss modeling for 3d video transmission over lte networks,” in International Conference on Telecommunications and Multimedia. IEEE, 2014, pp. 254–259.
- [57] P. Watta, X. Zhang, and Y. L. Murphey, “Vehicle position and context detection using v2v communication,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 4, pp. 634–648, 2020.
- [58] J. Mei, K. Zheng, L. Zhao, Y. Teng, and X. Wang, “A latency and reliability guaranteed resource allocation scheme for lte v2v communication systems,” IEEE Transactions on Wireless Communications, vol. 17, no. 6, pp. 3850–3860, 2018.
- [59] F. Abbas, P. Fan, and Z. Khan, “A novel low-latency v2v resource allocation scheme based on cellular v2x communications,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 6, pp. 2185–2197, 2018.
- [60] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Federated learning for ultra-reliable low-latency v2v communications,” in IEEE Global Communications Conference. IEEE, 2018, pp. 1–7.
- [61] H. Hasrouny, A. E. Samhat, C. Bassil, and A. Laouiti, “Vanet security challenges and solutions: A survey,” Vehicular Communications, vol. 7, pp. 7–20, 2017.
- [62] Y. Yan, H. Du, Q.-L. Han, and W. Li, “Discrete multi-objective switching topology sliding mode control of connected autonomous vehicles with packet loss,” IEEE Transactions on Intelligent Vehicles, 2022.
- [63] X. Sun, F. R. Yu, and P. Zhang, “A survey on cyber-security of connected and autonomous vehicles (cavs),” IEEE Transactions on Intelligent Transportation Systems, 2021.
- [64] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7652–7660.
- [65] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll, “Burst denoising with kernel prediction networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2502–2510.
- [66] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1712–1722.
- [67] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive separable convolution,” in IEEE International Conference on Computer Vision, 2017, pp. 261–270.
- [68] Z. Xia, F. Perazzi, M. Gharbi, K. Sunkavalli, and A. Chakrabarti, “Basis prediction networks for effective burst denoising with large kernels,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 844–11 853.
- [69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [70] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in IEEE International Conference on Computer Vision, 2019, pp. 603–612.
- [71] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
- [72] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations., 2017.
- [73] E. Belyaev, A. Vinel, A. Surak, M. Gabbouj, M. Jonsson, and K. Egiazarian, “Robust vehicle-to-infrastructure video transmission for road surveillance applications,” IEEE Transactions on Vehicular Technology, vol. 64, no. 7, pp. 2991–3003, 2014.