Vision meets mmWave Radar: 3D Object Perception Benchmark for Autonomous Driving

Yizhou Wang^1,*, Jen-Hao Cheng^1,*, Jui-Te Huang², Sheng-Yao Kuan³,
Qiqian Fu⁴, Chiming Ni⁴, Shengyu Hao⁴, Gaoang Wang⁴,
Guanbin Xing¹, Hui Liu¹, Jenq-Neng Hwang¹
¹University of Washington, ²Carnegie Mellon University,
³National Yang Ming Chiao Tung University, ⁴Zhejiang University,
{ywang26, andyhci, gxing, huiliu, hwang}@uw.edu,
juiteh@cs.cmu.edu, shaunkuan.10@nycu.edu.tw,
{qiqian.21, chiming.21, gaoangwang}@intl.zju.edu.cn,
shengyuhao@zju.edu.cn
* Equal contribution

Abstract

Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. The incorporation of rich semantic information from the camera, and reliable 3D information from the radar can potentially achieve an efficient, cheap, and portable solution for 3D object perception tasks. It can also be robust to different lighting or all-weather driving scenarios due to the capability of mmWave radars. In this paper, we introduce the CRUW3D dataset, including 66K synchronized and well-calibrated camera, radar, and LiDAR frames in various driving scenarios. Unlike other large-scale autonomous driving datasets, our radar data is in the format of radio frequency (RF) tensors that contain not only 3D location information but also spatio-temporal semantic information. This kind of radar format can enable machine learning models to generate more reliable object perception results after interacting and fusing the information or features between the camera and radar.

1 Introduction

Reliability and robustness are critically important to autonomous driving and advanced driver assistance systems (ADAS). To achieve reliable object perception, different sensor modalities are usually considered on autonomous vehicles [13, 32], e.g., camera, LiDAR, radar, etc. These sensors have their own strengths and weaknesses. As one of the primary sensors, camera, can provide rich and human-understandable semantic information with relatively high resolution. But it is sensitive to adverse weather or different lighting conditions. LiDAR, as a popular 3D sensor for perception tasks, can provide accurate 3D point clouds for spatial analysis. However, LiDAR is not reliable in adverse weather conditions like rain and fog, which can interfere with laser light, potentially resulting in ghost obstacles in the scene. On the other hand, millimeter wave (mmWave) radar is a traditional automotive sensor for distance and speed estimation. But radar is usually blamed for its low resolution, over-sensitivity, and poor semantic information involved.

Therefore, sensor fusion has been considered as a solution to take advantage of these different sensors and eliminate their shortcomings. Recently, some researchers have focused on sensor fusion between the camera and LiDAR [22, 11, 20] by either fusing the features from the camera and LiDAR during the intermediate stage or fusing the detection results at the final decision stage. However, with LiDAR involved, the system usually becomes equipment-complex and computation-expensive. This kind of solution is neither efficient nor robust to adverse driving conditions.

Radar, on the other hand, is a cost-efficient sensor that can compensate for the limitations of the camera by providing robust 3D distance and speed information, which is potentially useful for autonomous or assisted driving systems. Two kinds of data representations are usually considered for the mmWave radar, i.e., radio frequency (RF) tensor and radar points. RF tensor is a dense and informative data representation, containing both amplitude and phase information, but the location and speed are implicit. Whereas radar points are explicit representations, which are usually sparse (less than 5 points on a nearby car) [2, 8] and non-descriptive. Traditionally, mmWave radar is often used as a supplementary sensor due to its difficulty in parsing useful clues for semantic understanding, which limits its potential for sensor fusion with other modality data. Some research works focus on semantic understanding, e.g., object classification and detection, on radar data [15, 19, 17, 6, 29, 30], and joint 3D object detection and tracking [4], These semantic understanding tasks require radar RF tensors as the input data format since more object-level information is preserved. However, limited public datasets include radar RF tensors with proper annotations, as discussed in Table 1.

Refer to caption — Figure 1: Examples in the CRUW3D dataset. Each example contains a camera RGB image and a radar RF tensor pair. The RF tensors are transformed to Cartesian coordinates for better visualization. We include data examples under different driving scenarios and lighting conditions. The corresponding 3D bounding box annotations are projected to RGB and RF tensors, respectively.

To fill up the lack of such data and annotation, we introduce a new dataset, named CRUW3D, containing 66K synchronized camera, radar, and LiDAR data frames, under various driving scenarios, with object 3D bounding box and trajectory annotations. Fig. 1 shows some examples of our data and annotations in CRUW3D. To enhance the precision of data labeling, we include a LiDAR in our data collection system. Based on LiDAR point clouds, we carefully label the object 3D bounding boxes in each time frame and the object trajectories throughout the temporal sequences. We also provide the calibration parameters among sensors to allow data/information transformation among different modalities or for sensor fusion setups. We hope the CRUW3D dataset will enable more research on reliable and robust collaborative perception. The CRUW3D dataset will be publicly available soon.

Overall, our CRUW3D dataset has the following key contributions:

•

It is the first public dataset with synchronized camera RGB images, raw radar analog-to-digital-converter (ADC) data, radar RF tensors with phase, and LiDAR point clouds, to the best of our knowledge.
•

It includes object annotations of 3D bounding boxes and 3D object trajectories, which is valuable for various object perception tasks, e.g., 3D object detection and 3D multi-object tracking.
•

It contains different lighting conditions, which are challenging for vision-based object perception methods, thus, providing a good benchmark for sensor-fusion-based object perception algorithms.

2 Related Works

Autonomous driving datasets have attracted great attention for deep learning based object perception methods. The KITTI dataset [9] is the first complete autonomous driving dataset, which includes stereo cameras and a LiDAR, with various annotations. Recently, larger-scale and more advanced datasets are available, e.g., BDD100K [33], nuScenes [2], ApolloScape [10], Waymo Open [25]. However, due to the hardware compatibility and less developed radar perception techniques, most datasets do not incorporate radar signals as a part of their sensor systems.

Among the available radar datasets, some of them, e.g., nuScenes [2], HiRes2019 [16], RadarRobotCar [1], RADIARE [24], etc., consider radar data in the format of radar points that do not contain the useful Doppler and surface texture information of objects for semantic understanding. Other researchers focus on using RF tensors as the radar data format. Specifically, some manage to collect a dataset with camera, radar and LiDAR, and annotate the objects as 3D bounding boxes based on the dense point cloud from LiDAR [15, 6]. Others consider the camera-radar solution without a LiDAR [17, 19], whose annotation format is usually at pixel or point level. However, most of the datasets are not publicly available, as shown in Table 1.

A recent dataset, K-Radar [18], provides camera images, radar RF tensors, and LiDAR point clouds with 3D bounding boxes and tracking annotations in diverse weather and lighting conditions. However, their radar RF tensors only contain magnitude response. Compared with their radar data format, our radar data contains not only amplitude values but also phase response, which provides semantics useful for classification and scene understanding. Moreover, we provide radar data in its raw format, called analog-to-digital converter (ADC). Radar’s ADC data inherently keeps more information. We foresee a more powerful architecture to consume radar ADC data and bypass the need for time-domain to frequency-domain conversion and, as a result, achieve better 3D perception capability.

Table 1: Comparison with related datasets with radar data by modality, data format, scenario, etc. Better settings are marked in gray. Our CRUW3D dataset can fill the gap among these related datasets with RF radar tensors, ADC samples, and 3D bounding boxes with tracking identities.

Dataset	Modality¹¹1Modalities: “C” for camera, “R” for radar, “L” for LiDAR.	Radar²²2Radar data formats: “RP” for radar points, “RF” for radio frequency (RF) tensors.	Scenario	Scale	Class	Anno	Public
nuScenes [2]	C/R/L	RP	combined	5.5 hours	23	3D Box+Trk	✓
RADIATE [24]	C/R/L	RP	combined	3 hours	7	2D Box	✓
HiRes2019 [16]	C/R/L	RP	normal	546 frames	7	3D Box	✓
CARRADA [17]	C/R	RF	normal	21.2 min	3	Pixel	✓
RTCnet [19]	C/R	RF	normal	1 hour	3	Point	✗
RADDet [35]	C/R	RF	normal	10K frames	6	2D Box	✓
CRUW [30]	C/R	RF	combined	3.5 hours	3	Point	✓
ROD2021 [27]	C/R	RF	combined	28 min	3	Point	✓
Qualcomm [15]	C/R/L	RF	normal	3 hours	1	3D Box	✗
Xsense.ai [6]	C/R/L	RF	normal	34.2 min	1	3D Box	✗
K-Radar [18]	C/R/L	RF	combined	58.3 min	5	3D Box+Trk	✓
CRUW3D (Ours)	C/R/L	RF&ADC	combined	40 min	5	3D Box+Trk	✓

3 CRUW3D Dataset

3.1 Data Collection

We propose a dataset collection pipeline with stereo cameras, mmWave radar, and a LiDAR, including a sensor platform, a data collection software, and a sensor calibration method. With our proposed pipeline, the data collected from three sensor modalities can be temporally synchronized and spatially calibrated accurately.

Sensor Platform

Our dataset collection sensor system is shown in Figure 2. There are two FLIR BFS-U3-16S2C-CS cameras, one TI AWR1843 radar board, and one Livox Horizon LiDAR. The detailed specifications are listed in Table 2.

Table 2: Sensor Configurations for CRUW3D Dataset.

Cameras	Value	Radar	Value	LiDAR	Value
Frame Rate	30 FPS	Frame Rate	30 FPS	Frame Rate	10 FPS³³3The frame rate is after the point cloud integration to scan over LiDAR’s field of view (FOV). Details introduced in Section 3.2.
Pixels (W $\times$ H)	1440 $\times$ 1080	Frequency	77 GHz	Point Rate	240,000 pts/s
Resolution	1.6 MP	# of Transmitters	2	Detection Range	260 m
Field of View	93.6^∘	# of Receivers	4	Range Precision	0.3 cm
Stereo Baseline	0.6 m	# of Chirps per Frame	255	Field of View	81.7^∘ $\times$ 25.1^∘
		Max Range	30 m	Angular Precision	0.05^∘
		Range Resolution	0.23 m
		Min & Max Angle	$\pm$ 90^∘⁴⁴4Better radar performance and resolution within $\pm$ 60^∘.
		Azimuth Resolution	$\sim$ 15^∘

Sensor Synchronization

Our dataset collection software is based on Robot Operating System (ROS) under Ubuntu. For cameras and LiDAR, since they provide open-source API, we directly integrate them into the ROS system. However, TI only provides software based on Windows and MATLAB. Therefore, we create a Windows virtual box in our Ubuntu system and communicate different processes through ROS. We set up hardware time synchronization between cameras and LiDAR using a Transistor-Transistor Logic (TTL) signal generated by the right camera. Both camera and LiDAR sensors support TTL signal time synchronization through their APIs. On the software level, we use the ApproximateTime synchronization policy provided by the ROS library to align three sensors’ data into 30 FPS time slots. To synchronize between radar and other sensors, we use a software trigger to start a data sequence collection. A service client triggers the collection process of radar data and starts another collection process of other sensor data upon receiving a response. From our experiments, the latency of the software trigger is under a few milliseconds, which is negligible. More details of our data collection system are described in the supplementary document.

Sensor Calibration

First, we calibrate the stereo cameras using Zhang’s method [36], which gives us the intrinsic parameters, distortion coefficients, and extrinsic parameters of the two cameras. These results are used later for stereo rectification in Section 3.2. For the sensor calibration between cameras and LiDAR, we adopt the calibration algorithm proposed by Dhall et al. [5]. This will give us two transformation matrices, representing the transformation between the left camera and LiDAR, and between right camera and LiDAR, respectively. As for radar, which is carefully mounted and aligned with cameras and LiDAR according to their pitch angle, its coordinates are parallel to the camera’s bird’s-eye view (BEV). The translation vectors between the sensors are also measured to form the full transformation matrices between cameras and radar.

3.2 Data Processing

Camera Data Processing

The image sequences captured by the stereo cameras are first undistorted and rectified based on the camera calibration. Then, for the low-quality images due to adverse lighting conditions, we conduct image enhancement to improve the quality and lighting stability of the collected videos. Here, we implement a deep learning based method, named RRDNet [37], to restore the underexposed image in zero shot using a three-branch CNN. To achieve stable enhancement results for video sequences, we train the network using only the first frame of each sequence, and do inference on the rest frames.

Radar Data Processing

Our radar data processing is similar to the pre-processing mentioned in [28], where RF tensors in radar range-azimuth coordinates are described as a bird’s-eye view (BEV) representation, where the $x$ -axis denotes azimuth (angle) and the $y$ -axis denotes range (distance). From the raw radar data, we first implement a range fast Fourier transform (FFT) on the received chirp samples to estimate the range of the reflections. After that, we conduct a second angle FFT on the samples along different receiver antennas to estimate the azimuth angle of the reflections. Besides, we also transform the RF tensors into Cartesian coordinates for better alignment with the camera and more clear visualization. A more detailed description of our radar data processing is mentioned in the supplementary document.

LiDAR Data Processing

Livox LiDAR has a special laser scanning technology called non-repetitive horizontal scanning, which is significantly different from the repetitive linear scanning offered by most traditional LiDAR sensors. It accumulates the points captured inside the FOV to get denser point clouds within an integration time window. However, based on this technology, the point cloud from LiDAR cannot cover the whole FOV within a camera frame (i.e., $\nicefrac{{1}}{{30}}$ seconds). To ensure every camera/radar frame has a corresponding LiDAR frame for annotation, we accumulate the point clouds captured in the consecutive three frames (i.e., $\nicefrac{{1}}{{10}}$ -second time window) as a complete frame, which means the frame rate of our LiDAR is 10 FPS, as mentioned in Table 2.

3.3 Data Annotation

In the CRUW3D dataset, we label 3D bounding boxes on LiDAR point clouds. Unlike the 3D bounding box labels in the KITTI dataset, we use three Euler angles to represent the orientation of each bounding box, since the streets in the CRUW3D dataset are not as flat as those in the KITTI dataset. Here, we consider the following 5 object categories during the annotation: pedestrian, car, van, truck, and bus. The detailed statistics are shown in Section 3.4. In addition to the 3D bounding boxes, we also annotate object track IDs for later multi-object tracking (MOT) tasks. However, because different sensors have different FOVs and point clouds for the faraway objects are usually sparse, we only annotated the object within the overlapped areas, shown in Figure 3. After the 3D bounding boxes are labeled on point clouds, we project all the bounding boxes to camera and radar coordinates by the transformation matrices from sensor calibration. Then, the annotations can be used to train networks for camera and radar, respectively.

Description	Value
Driving Time	40 min
Scenarios	70% normal, 30% adverse
city street, highway, sidewalk
	Overall	Train	Test
# of Frames	66K	56K	10K
# of Seqs	74	56	18
# of Labeled 3D Bboxes	80K	57K	23K
# of Labeled 3D Tracks	576	397	179

3.4 Data Statistics

Our CRUW3D dataset contains about 66K frames of synchronized camera, radar, and LiDAR data under various driving scenarios with different lighting conditions. Approximately 70% of the data are captured in normal driving scenarios with good lighting conditions. The rest 30% are captured in adverse lighting conditions, e.g., nighttime, or strong lighting. Some data statistics are shown in Table 3. Among all the data frames, we annotate 19K frames in the training set, and 10K frames in the testing set. We use this setting for all the experiments mentioned in Section 4.

As for the annotations for the CRUW3D dataset, we analyze the different distributions of our labeled objects in Figure 4, including the number of 3D bounding boxes, number of 3D object trajectories, object depths, object azimuth angles, and object dimensions.

Object Class Distribution

For the 5 object categories (i.e., pedestrian, car, van, truck, and bus) we are interested in, pedestrian and car are two dominant categories, as shown in Figure 4 (a) and (b), which reasonably reflects the actual object class distribution in real driving scenarios.

Object Location Distribution

First, we analyze the depth distribution of the 3D bounding boxes in Figure 4 (c), where object depth represents the distance between LiDAR and the center of a 3D bounding box along LiDAR’s $x$ -axis. Here, most annotated 3D bounding boxes are distributed within 0 – 40 meters. Besides, we also analyze object azimuth angle distribution in Figure 4 (d). Most labels fall into the range between $-$ 50^∘ to 50^∘, which is the overlapped region for three sensor modalities.

Object Size Distribution

Figure 4 (e) shows the distribution of object length for different object classes, including pedestrian, car, truck, and bus. The distributions for pedestrians and cars are relatively concentrated, while those of trucks and buses are more spread.

Object Trajectory

We also record some statistics for that in Table3. Overall, there are 576 object trajectories, including 397 trajectories in the training set and 179 in the testing set. The average length of the object trajectories is 121 frames.

3.5 Comparison with Related Datasets

We compare the CRUW3D dataset with some related datasets with radar sensors in Table 1. We discuss the dataset in different aspects, including sensor modalities, radar data format, driving scenarios, dataset scale, annotated object categories, annotation format, and public availability.

From Table 1, most related datasets, whose radar data format is RF tensor, do not provide 3D bounding box and trajectory annotations. Although the scale of our dataset is relatively small, we validate its scale with some baseline algorithms in the experimental results in Section 4. Nonetheless, we will continually collect and annotate data to expand the scale.

4 Baseline Experiments

In this section, we conduct a series of baseline experiments on our CRUW3D dataset, including camera-based 3D object detection, camera-based 3D object tracking, radar-based object detection, and a camera-radar fusion baseline. In the following experiments, we only consider pedestrian and car as our perception target classes.

4.1 Camera-Based 3D Object Detection

Monocular 3D object detection is pivotal for autonomous driving applications. Neural networks for 3D object detection extract images’ features and detect objects in the perspective view or BEV. We implement SMOKE [14] and DD3D [21] as baselines on our benchmark.

SMOKE is a single-stage 3D object detection method based on CenterNet [7]. Given an input image, it detects targeted objects’ 3D centers projected on the image plane. However, this algorithm was originally designed for the KITTI dataset, whose 3D bounding box orientation only includes a yaw angle. We convert our quaternion-based orientation label for each bounding box to the yaw angle by ignoring the pitch and row, which assume the other rotation angles are negligible. Here, we use DLA-34 [34] as the backbone network for SMOKE during the implementation.

DD3D is built on top of another 2D object detector, named FCOS [26]. It uses a large-scale depth dataset DDAD15M to pre-train the network to obtain better depth-aware features from images, which achieves state-of-the-art among monocular 3D object detection methods. Here, we try two different backbone networks, i.e., DLA-34 [34] and V2-99 [12], during the implementation.

Similar to KITTI, the evaluation metrics include the average precision (AP) by the 3D bounding box and by BEV 2D bounding box using IOU thresholds of 0.5 or 0.7 for cars and 0.3 or 0.5 for pedestrians. The quantitative results are shown in Table 4. From the experiments, compared with SMOKE, DD3D achieves better performance in all aspects. With the larger backbone V2-99, DD3D obtains the best performance on both cars and pedestrians.

Table 4: Monocular 3D object detection baseline results on the CRUW3D testing set.

Method	Car				Pedestrian
	IOU=0.5		IOU=0.7		IOU=0.3		IOU=0.5
	AP ${}^{\text{3D}}$	AP ${}^{\text{BEV}}$	AP ${}^{\text{3D}}$	AP ${}^{\text{BEV}}$	AP ${}^{\text{3D}}$	AP ${}^{\text{BEV}}$	AP ${}^{\text{3D}}$	AP ${}^{\text{BEV}}$
SMOKE [14]	44.52	48.55	17.63	25.11	10.17	10.57	2.58	3.36
DD3D (DLA-34) [21]	57.88	64.86	24.29	36.09	16.58	18.62	6.84	7.44
DD3D (V2-99) [21]	58.08	64.68	25.41	37.86	18.46	20.57	8.29	9.25

4.2 Camera-Based 3D Object Tracking

After object 3D detection results are predicted, we further implement a 3D multi-object tracking (MOT) algorithm, called AB3DMOT [31], to obtain object 3D bounding box trajectories. We conduct experiments based on the 3D object detection results, i.e., SMOKE and DD3D, from Table 4 and feed into the AB3DMOT framework. AB3DMOT track different object class separately and combine them in the final stage, thus, we also evaluate the 3D MOT performance of cars and pedestrians separately, as shown in Table 5.

As for the evaluation metrics for 3D MOT, we adopt the metrics proposed in [31], including scaled average multi-object tracking accuracy (sAMOTA), average multi-object tracking accuracy (AMOTA), and average multi-object tracking precision (AMOTP). From Table 5, the combination of “DD3D+AB3DMOT” achieves the best 3D MOT performance. The performance of “SMOKE+AB3DMOT” on pedestrian tracking is very poor, due to the poor 3D detection quality in the previous stage.

Table 5: 3D MOT baseline results by AB3DMOT [31] on the CRUW3D testing set.

3D Detector	Car			Pedestrian
3D Detector	sAMOTA	AMOTA	AMOTP	sAMOTA	AMOTA	AMOTP
SMOKE [14]	54.54	17.30	24.94	0.00	-2.4	2.05
DD3D (DLA-34) [21]	74.73	30.39	58.58	11.95	1.06	17.58
DD3D (V2-99) [21]	74.80	32.73	57.06	14.87	1.83	15.10

4.3 Radar-Based Object Detection

Table 6: Radar object detection baseline results by RODNet [29] on the CRUW3D testing set.

Method	AP	AR
RODNet (vanilla)	28.69	42.37
RODNet (HG)	29.97	43.24
RODNet (HGwI)	31.30	44.72
RODNet (HGwI + TDC)	32.72	47.22

For radar-based object detection, which detects each object as a point in the RF tensor, we use RODNet [29] as our baseline method. The evaluation metrics are average precision (AP) and average recall (AR) with different object location similarity (OLS) thresholds, which are the same as our previous CRUW dataset [30]. The quantitative results are shown in Table 6. The overall performance is lower than that on the CRUW dataset [30], showing our CRUW3D dataset is much more challenging. Similar to the performance in [29], RODNet with HGwI backbone and temporal deformable convolution achieves the best performance.

5 Conclusion

In this paper, we introduced a new benchmark dataset, named CRUW3D, which contains synchronized and well-calibrated camera, radar, and LiDAR data with object 3D bounding box and trajectory annotations. To the best of our knowledge, it is the first public dataset with radar RF tensors with magnitude and phase information for 3D object detection and multi-object tracking tasks. With the CRUW3D dataset, sensor fusion between the camera and mmWave radar can be further exploited to improve reliability and robustness for autonomous driving.

6 Limitations and Future Works

Although the CRUW3D dataset would contribute a lot to the sensor fusion community, dataset scale is still one of the limitations, compared with other large-scale autonomous driving datasets. Therefore, we are actively collecting and annotating more data to enlarge the scale of our dataset. With both labeled and unlabeled data, research works on camera and radar-based perception under the setting of self/semi-supervised learning can be further conducted.

Acknowledgments and Disclosure of Funding

This work is supported by CISCO Systems, Inc. [FA206070/A175367]. The authors would also like to thank the colleagues and students in the Information Processing Lab at UWECE for their help and assistance in the dataset collection, processing, and annotation works.

References

[1] Dan Barnes, Matthew Gadd, Paul Murcutt, Paul Newman, and Ingmar Posner. The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, 2020.
[2] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
[3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
[4] Jen-Hao Cheng, Sheng-Yao Kuan, Hugo Latapie, Gaowen Liu, and Jenq-Neng Hwang. Centerradarnet: Joint 3d object detection and tracking framework using 4d fmcw radar, 2023.
[5] Ankit Dhall, Kunal Chelani, Vishnu Radhakrishnan, and K Madhava Krishna. Lidar-camera calibration using 3d-3d point correspondences. arXiv preprint arXiv:1705.09785, 2017.
[6] Xu Dong, Pengluo Wang, Pengyue Zhang, and Langechuan Liu. Probabilistic oriented object detection in automotive radar. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 102–103, 2020.
[7] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 6569–6578, 2019.
[8] Di Feng, Christian Haase-Schütz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wiesbeck, and Klaus Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 2020.
[9] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
[10] Xinyu Huang, Peng Wang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence, 42(10):2702–2719, 2019.
[11] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
[12] Youngwan Lee and Jongyoul Park. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2020.
[13] Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, et al. Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pages 163–168. IEEE, 2011.
[14] Zechen Liu, Zizhang Wu, and Roland Tóth. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 996–997, 2020.
[15] Bence Major, Daniel Fontijne, Amin Ansari, Ravi Teja Sukhavasi, Radhika Gowaikar, Michael Hamilton, Sean Lee, Slawomir Grzechnik, and Sundar Subramanian. Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
[16] Michael Meyer and Georg Kuschk. Automotive radar dataset for deep learning based 3d object detection. In 2019 16th European Radar Conference (EuRAD), pages 129–132. IEEE, 2019.
[17] Arthur Ouaknine, Alasdair Newson, Julien Rebut, Florence Tupin, and Patrick Pérez. Carrada dataset: Camera and automotive radar with range-angle-doppler annotations. arXiv preprint arXiv:2005.01456, 2020.
[18] Dong-Hee Paek, Seung-Hyun Kong, and Kevin Tirta Wijaya. K-radar: 4d radar object detection for autonomous driving in various weather conditions. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
[19] Andras Palffy, Jiaao Dong, Julian FP Kooij, and Dariu M Gavrila. Cnn based road user detection using the 3d radar cube. IEEE Robotics and Automation Letters, 5(2):1263–1270, 2020.
[20] Su Pang, Daniel Morris, and Hayder Radha. Clocs: Camera-lidar object candidates fusion for 3d object detection. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10386–10393. IEEE, 2020.
[21] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021.
[22] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018.
[23] Mark A Richards. Fundamentals of radar signal processing. Tata McGraw-Hill Education, 2005.
[24] Marcel Sheeny, Emanuele De Pellegrin, Saptarshi Mukherjee, Alireza Ahrabian, Sen Wang, and Andrew Wallace. Radiate: A radar dataset for automotive perception. arXiv preprint arXiv:2010.09076, 2020.
[25] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
[26] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
[27] Yizhou Wang, Jenq-Neng Hwang, Gaoang Wang, Hui Liu, Kwang-Ju Kim, Hung-Min Hsu, Jiarui Cai, Haotian Zhang, Zhongyu Jiang, and Renshu Gu. Rod2021 challenge: A summary for radar object detection challenge for autonomous driving applications. In Proceedings of the 2021 International Conference on Multimedia Retrieval, pages 553–559, 2021.
[28] Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang, Guanbin Xing, and Hui Liu. Rodnet: Radar object detection using cross-modal supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 504–513, 2021.
[29] Yizhou Wang, Zhongyu Jiang, Yudong Li, Jenq-Neng Hwang, Guanbin Xing, and Hui Liu. Rodnet: A real-time radar object detection network cross-supervised by camera-radar fused object 3d localization. IEEE Journal of Selected Topics in Signal Processing, 2021.
[30] Yizhou Wang, Gaoang Wang, Hung-Min Hsu, Hui Liu, and Jenq-Neng Hwang. Rethinking of radar’s role: A camera-radar dataset and systematic annotator via coordinate alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2815–2824, 2021.
[31] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani. AB3DMOT: A Baseline for 3D Multi-Object Tracking and New Evaluation Metrics. ECCVW, 2020.
[32] Hao Yang, Chenxi Liu, Meixin Zhu, Xuegang Ban, and Yinhai Wang. How fast you will drive? predicting speed of customized paths by deep neural network. IEEE Transactions on Intelligent Transportation Systems, 2021.
[33] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
[34] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403–2412, 2018.
[35] Ao Zhang, Farzan Erlik Nowruzi, and Robert Laganiere. Raddet: Range-azimuth-doppler based radar object detection for dynamic road users. In 2021 18th Conference on Robots and Vision (CRV), pages 95–102. IEEE, 2021.
[36] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000.
[37] Anqi Zhu, Lin Zhang, Ying Shen, Yong Ma, Shengjie Zhao, and Yicong Zhou. Zero-shot restoration of underexposed images via robust retinex decomposition. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020.

Appendices

Appendix A Data Collection Pipeline

The pipelines for our CRUW3D dataset collection software and the ROS-based time synchronizer are shown in Figure 5.

The main software is deployed on the Ubuntu 20.04 operating system, which is responsible for communicating with two cameras and the LiDAR. Since TI’s radar does not open-source its APIs, we create a Windows 10 virtual box on the Ubuntu system to communicate with the radar. The bridge between these two systems is built by the ROS.

As for the time synchronizer, we build the pipeline based on ROS. We use a software trigger to start time synchronizer and radar node at the same time. Then, the time synchronizer triggers the right camera, and the left camera and LiDAR are triggered through the synchronization cable (hardware trigger). On the other hand, the radar node triggers radar ADC capturing through the MATLAB engine by TI mmWave Studio.

Appendix B Radar Data Representations and Processing

B.1 Radar Data Representations

There are usually two different kinds of radar data representations, i.e., radar points and radio frequency (RF) tensors, as shown in Figure 6. Radar points are more frequently used for obstacle ranging and speed estimation in autonomous driving since the ranging and speed can be directly inferred from the raw radar data through Fast Fourier Transform (FFT) and adaptive peak thresholding and clustering [23]. However, radar points from an mmWave radar are usually very sparse with relatively low angle resolution, especially compared with LiDAR [3, 8]. Thus, a large amount of useful semantic information is missing using this kind of representation.

Radar is feasible for semantic understanding, e.g., object classification, detection, and tracking, owing to the hidden phase information inside the radio frequencies. Typically, radar’s signal amplitude is commonly used to estimate the distance and speed of the obstacles, while the phase information is usually not well-utilized because of its “non-intuitiveness”, making it difficult to interpret by the classical signal processing mechanisms.

B.2 Radar Data Processing

We illustrate the details of our radar data processing steps, which can be divided into two parts, i.e., RF tensors in range-azimuth (RA) coordinates to localize and classify the objects in the BEV, and range-azimuth-Doppler (RAD) coordinates to obtain the relative radial speed information.

RF-RA tensors

This process is the same as the pre-processing mentioned in [28]. RF tensors in radar range-azimuth coordinates can be described as a bird’s-eye view (BEV) representation, where the $x$ -axis denotes azimuth (angle) and the $y$ -axis denotes range (distance). From the raw radar data, we first implement a range FFT on the received chirp samples to estimate the range of the reflections. After that, we conduct a second angle FFT on the samples along different receiver antennas to estimate the azimuth angle of the reflections. An example RF-RA tensor is shown in Figure 7 (c). After being transformed into RF tensors, the radar data is represented as a complex-valued 2D format (with real and imaginary channels).

RF-RA tensors in Cartesian coordinates

To better associate the data among different sensor modalities, we also transform RF-RA tensors into Cartesian coordinates, as shown in Figure 7 (d). We first generate a Cartesian grid for the target RF tensor and map each grid location to the polar coordinates. The value at a certain location is obtained by bilinear interpolation from the original RF-RA tensors. Note that, due to the continuity of amplitude and phase in RF-RA tensors, we conduct the interpolation on the amplitude and phase parts for each complex pixel value instead of using real and imaginary interpolation.

RF-RAD tensors

Besides the above RF tensors in the range-azimuth coordinates, to obtain the speed information from radar, we further process the raw radar data into RF-RAD tensors. First, same as the RF-RA pre-processing procedure, we use the range FFT to estimate the range of the reflections. Then, a Doppler FFT is implemented to estimate the speed at each range grid. Afterward, the angle FFT is appended to estimate the azimuth angle. Here, we will get a 3D tensor representing the scenario in the RAD coordinates. In order to get the relative radial speed between the object and the ego-car, we select the speed grid with the greatest amplitude value along the Doppler axis. We call this resulting tensor speed map RF- $s$ , and each element in RF- $s$ represents the relative radial speed at a certain range and angle location. An example RF- $s$ image is shown in Fig. 7(e).

Appendix C Implementation Details

C.1 Monocular 3D Object Detection Implementation Details

SMOKE Implementation

We follow the original implementation of SMOKE with a few modifications. First, we convert 3D bounding box annotations in LiDAR coordinates to camera 3D coordinates by sensor calibration. Here, we only predict the yaw angles of the objects following the original implementation of SMOKE [14]. The statistics of the car and pedestrian sizes in our dataset are $[\bar{h},\bar{w},\bar{l}]^{\intercal}=[1.64,1.88,4.54]^{\intercal}$ and $[\bar{h},\bar{w},\bar{l}]^{\intercal}=[1.76,0.73,0.88]^{\intercal}$ , respectively. The statistics of object depths in our dataset is $[\bar{\sigma_{z}},\bar{\mu_{z}}]^{\intercal}=[17.75,8.84]^{\intercal}$ . We use the original image resolution and pad it to $1472\times 1088$ . We train the network with a batch size of 4 on one Tesla V100 for 80000 iterations. The learning rate is set at $2.5\times 10^{-4}$ and drops at 50000 and 60000 iterations by a factor of $10$ . During testing, we add 3D bounding box non-maximum suppression (NMS) to filter out some false positives.

DD3D Implementation

Our implementation of DD3D is mostly aligned with the original implementation. We use the depth pre-trained model provided by the authors of DD3D [21] to train our 3D detectors. We use the same canonical 3D bounding box sizes statistics, as described in SMOKE. Color jitter, random flip, and resize were adopted in data augmentation during training. We train the network with a batch size of 4 on one Tesla V100 for 80000 iterations. The learning rate is set at $2.0\times 10^{-3}$ and drops at 50000 and 60000 iterations by a factor of $10$ . We also apply an NMS with 3D IOU criteria to filter false positives.

C.2 RODNet Implementation Details

Most implementation details are following the original RODNet paper [29]. However, since the 3D bounding box annotations are labeled on the LiDAR point cloud, the object centers are not perfectly aligned with the reflection from the radar. Therefore, we consider the locations of the intersection between the ego-object directions and the object surfaces, i.e., the nearest points of the object surfaces from the ego-car, as the ground truth locations of the radar objects.