RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments

Haisheng Su^1,2 Feixiang Song² Cong Ma² Wei Wu² Junchi Yan¹

{}^{(\textrm{{\char 0\relax}})}

¹School of AI and Department of CSE, Shanghai Jiao Tong University
²SenseAuto Research
{suhaisheng,yanjunchi}@sjtu.edu.cn, {songfeixiang1,macong,wuwei}@senseauto.com
Code & Dataset: SenseAuto & SJTU-ReThinkLab/RoboSense

Abstract

Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future research development, where the detailed analysis as well as benchmarks are also provided accordingly. Data desensitization measures have been conducted for privacy protection.

Refer to caption — Figure 1: An example from RoboSense dataset: The data with annotated 3D boxes and occupancy descriptions on Camera, Fisheye, LiDAR, and BEV respectively, where the same targets are associated with unique IDs across different devices and timestamps.

1 Introduction

Recent years have witnessed significant progress achieved in the field of autonomous driving, enabling numerous intelligent vehicles running on highway or urban areas. In addition to self-driving cars, social mobile robots have emerged as a new industry tailored to autonomous navigation for typical applications, such as tractor, sweeper, retail and delivery. Notably, such intelligent mobile agents usually operate and navigate in crowded and unstructured environments (i.e., campuses, scenic spots, streets, parks and sidewalks, etc.), with varying and uncontrolled natural conditions such as illumination, occlusion and obstruction. In order to achieve navigation tasks safely, egocentric perceptual solutions enable these robots to perceive and comprehend the surrounding context from a first-person view, so as to interact successfully with passby pedestrians and vehicles, predict their intentions and incorporate this information in agents’ planning and decision reasoning process.

To evaluate and compare different egocentric perceptual methods fairly, several standarized benchmarks [4, 8, 33, 15, 27, 40] have been proposed in recent years, advancing the development of modern data-driven approaches. KITTI [8] is a pioneering dataset providing multi-modal sensor data including front-view LiDAR pointclouds as well as corresponding stereo images and GPS / IMU data. nuScenes [4] constructs a multi-sensor dataset collected in two cities travelling at an average of 16 km/h, where rich collections of 3D boxes and IDs are annotated in the full $360^{\circ}$ view. Waymo Open dataset [33] significantly increases the amount of annotations with higher annotation frequency. However, the target domain application of existing benchmarks is autonomous driving: the sensor data are captured exclusively from structural roads and highways, with sensor suites installed on top of cars.

To fill the vacancies of egocentric perceptual benchmarks target a unique domain related to navigation tasks in crowded and unstructured environments, in this paper, we present RoboSense, a novel multimodal dataset with several benchmarks associated to it. Our dataset is collected from diverse social scenarios filled with crowded obstructions, which is different from previously collected datasets used for autonomous driving (e.g. nuScenes [4]). Benefiting from the well time-synced multi-sensor data, we hope that our RoboSense can facilitate the development of egocentric perceptual frameworks for various types of autonomous navigation agents with controllable cost, not only self-driving cars but also autonomous agents such as social mobile robots. To this end, the data collection robot is equipped with 3 main types of sensors (C: Camera, L: LiDAR, F: Fisheye), and each type of sensor consists of 4 devices installed on different sides respectively to ensure the data captured under full $360^{\circ}$ view without blind spots.

Specifically, RoboSense consists of a total of 133K+ frames of synchronized data, spanning over 7.6K temporal sequences of 6 main scene classes (i.e., scenic spots, parks, squares, campuses, streets and sidewalks). Moreover, 1.4M 3D bounding boxes together with track IDs are annotated based on 3 different types of sensors, where most of targets tend to be closer to the robot as shown in Fig. 1. Then we form global trajectories for each agent separately through associating the same IDs across consecutive frames and different devices from a Bird’s-Eye View (BEV) perspective. Additionally, we formulate 6 standarized benchmarks for egocentric perceptual tasks as follows: 1. Multi-view 3D Detection; 2. LiDAR 3D Detection; 3. Multi-modal 3D Detection; 4. Multiple 3D Object Tracking (3D MOT); 5. Motion Prediction; 6. Occupancy Prediction. Meanwhile, multi-task end-to-end training scheme is also supported in our RoboSense for evaluation of joint optimization. In sum, the main contributions of our work are three folds:

•

To our best knowledge, our RoboSense is the first dataset tailored to egocentric perceptual tasks related to navigation of autonomous agents in unstructured environments.
•

We annotate 1.4M 3D bounding boxes on 133K+ synchronized sensor data, where most of targets are closer to the robot. Each target is associated with a unique ID, thus forming a total of 216K trajectories, which spread over 7.6K temporal sequences, covering 6 main scene classes.
•

We formulate 6 standarized benchmarks to facilitate the evaluation and fair comparisons of different perceptual solutions related to navigation in built environments.

Table 1: Statistical comparison between RoboSense and similar existing datasets used for autonomous driving. C: Camera, L: LiDAR, F: Fisheye.

\dagger

means statistics exclude the testing set, which is unavailable.

{\ddagger}

indicates 10

\times

higher annotation frequency (10Hz).

Dataset	Year	Size (hr)	Ann. Scenes	Ann. Frames	With Trajectory	Multi-view Overlapping	Sensor Layouts	3D Boxes ( Total )	3D Boxes^† ( $\leq$ 5m )
KITTI [8]	2012	1.5	22	15K	✗	✗	4C+1L	80K	638
Cityscapes [6]	2016	-	-	25K	✗	✗	1C	0	0
ApolloScape [15]	2016	2	-	144K	✗	✗	1L	70K	4.7K
H3D [25]	2019	0.77	160	27K	✗	✓	3C+1L	1.1M	-
Lyft L5 [16]	2019	2.6	366	55K	✓	✓	7C+3L	1.3M	-
nuScenes [4]	2019	5.5	1K	40K	✓	✓	6C+1L	1.4M	9.8K
Argoverse [5]	2019	0.6	113	22K	✓	✓	9C+2L	993K	15K
Waymo Open [33]	2019	6.4	1K	200K ${\ddagger}$	✓	✓	5C+5L	12M ${\ddagger}$	123K ${\ddagger}$
BDD100k [42]	2020	1K	100k	100k	✗	✗	1C	0	0
RoboSense (Ours)	2024	42	7.6K	133K	✓	✓	4C+4F+4L	1.4M	173K

2 Related Work

We summarize the compositions of some existing perception and prediction datasets as shown in Tab. 1.

Perception Datasets. Current released perception datasets can be divided into image-only datasets [42, 6] and multimodal datasets [8, 4, 33, 16, 15]. BDD100k [42] and Cityscapes [6] focus on 2D perception which provide large amount of 2D annotations (boxes, masks) for driving scene understanding under various weather and illumination conditions. KITTI [8] is known as the pioneering multimodal dataset which has been widely used for academic research. It records 6 hours of driving data using a LiDAR sensor and a front-facing stereo camera to provide pointclouds and images with annotated 3D boxes. H3D dataset [25] collects a total of 1.3M 3D objects over 27K frames from 160 crowded scenes of the full 360^∘ view. nuScenes [4] and Waymo Open Dataset [33] are two similar datasets with same structure, while the latter one providing more annotations owing to higher annotation frequency (2Hz vs. 10Hz). Different from previously collected datasets used for autonomous driving, the annotation frequency of our RoboSense is even smaller (1Hz) due to the low speed (less than 1 m/s) moving status of social mobile robots navigating in crowded and unstructured environments.

Prediction Datasets. nuScenes [4] and Waymo Open Dataset [33] can be also used for prediction task which release lane graphs as well. Lyft [16] introduces traffic/speed control data, and Waymo Open Dataset [33] adds more signals to the map such as crosswalk, lane boundaries, stop signs and speed limits. Recently, Shifts dataset [24] becomes the largest forecasting dataset with the most scenario hours to date. Meanwhile, Argoverse [5] is also a large-scale dataset with high data frequency (10Hz) and high scenario quality for motion forecasting ( $>2000km$ across 6 cities). Together, these datasets have enabled exploration of multi-actor, long-range motion forecasting leveraging both static and dynamic maps.

Generally, our dataset differs in three substantial ways: 1) targets a unique domain related to navigation tasks in crowded and unstructured environments, which is more difficult than autonomous driving scenarios in terms of complexity of environmental context and diversity of surrounding obstructions. 2) In addition to 3D bounding box and trajectory annotations, our dataset also provides high-quality occupancy descriptions for each collected scene, supporting the occupancy prediction task around the social robotics for safe navigation. 3) Our dataset is mostly collected in social crowded scenes, where pedestrians and cars tend to be closer to the robot, yielding a distribution with a mode at approximately 5 $m$ , which is quite different to the existing datasets for autonomous cars as shown in Fig. 2. Besides, the egocentric perceptual tasks under this circumstance is more challenging due to frequent occlusion and truncation.

3 RoboSense Open Dataset

We commence with the sensor setup as well as data acquisition details, delineate the coordinate systems and label generation process, and present data statistics respectively.

3.1 Sensor Setup and Data Acquisition

Sensor setup. We use a social mobile robot (i.e., robosweeper) as data collection platform, which is equipped with different sensors installed in different sides of the robot respectively to ensure data captured in 360 $\degree$ horizontal view without blind spots, including LiDAR, Camera, Fisheye, GPS / IMU and Ultrasonic. Refer to Fig. 3 for sensor layouts and Tab. 3 for detailed sensor specifications.

Data acquisition. We utilize the mobile robot to collect data along the Dishui Lake in Shanghai, China, lasting 42h in total at an average speed of less than 1m/s through manually remote control. 22 different places are travelled, which can be categorized into 6 main kinds of outdoor or semi-closed social scenarios (i.e., scenic spots, parks, squares, campuses, streets and sidewalks). After data collection, we manually select and process 7619 representative scenes of 20 $s$ duration respectively for further annotation, covering various natural conditions (i.e., weather and illumination) and diverse environmental background and obstructions (i.e., motion, amount, type, occlusion, truncation).

3.2 Coordinate Systems

Ego-Vehicle Coordinate. The Ego-Vehicle Coordinate System is centered at the rear axle of the vehicle. The positive directions of the X, Y, and Z axes correspond to the forward, leftward, and upward directions of the vehicle, respectively. Ego-Vehicle Coordinate System is the most frequently used in tasks such as perception, tracking, prediction, and planning, where dynamic and static targets as well as trajectories are transformed into this coordinate system.

Global Coordinate. To transform the dynamic and static elements from historical and future frames into the current frame coordinate system, we need to establish a global coordinate system to record the position and orientation of the ego vehicle in each frame. The origin of the Global Coordinate System is an arbitrarily defined point in Shanghai Lingang, China, and the positive directions of the X, Y, and Z axes follow the definition of the North-East-Up coordinate.

LiDAR Coordinate. The LiDAR Coordinate System is defined based on the Hesai lidar installed directly above the vehicle, the positive directions of the X, Y, and Z axes follow the definition of the Ego-Vehicle Coordinate System.

Camera Coordinate. The RoboSweeper is equipped with four fisheye cameras and four pinhole cameras. The origin of the Camera Coordinate System for both types of cameras is the optical center. However, the positive directions of the coordinate axes are defined differently in the RoboSense dataset. In the fisheye coordinate system, the X, Y, and Z axes correspond to directly below, right, and behind the optical center, respectively. In contrast, in the pinhole coordinate system, these axes correspond to directly right, below, and front of the optical center, respectively.

Pixel Coordinate. The image is presented in the form of pixels, each pixel corresponds to a 2D pixel coordinate. The origin of the Pixel Coordinate System is the upper left corner of the image. Points in the 3D Camera Coordinate System can obtain coordinates in the Pixel Coordinate System through the camera projection.

3.3 Ground Truth Labels

After integrating, synchronizing and calibrating the multi-sensor raw data, we annotate keyframes (LiDAR, image) at the frequency of 1Hz due to the low-speed moving status.

3D object. With the selected scenes of collected RoboSense dataset, we annotate 3D object boxes of 3 movable classes (i.e., “Vehicle”, “Cyclist” and “Pedestrian”) for each sampled keyframe in both the LiDAR coordinate of pointclouds and the Camera coordinate of multi-view images respectively. Each annotated 3D box can be represented as $[x,y,z,w,l,h,\theta,cls]$ , where $x,y,z$ indicate the 3D position of a regular object, and $w,l,h$ represent the scale information including width, length and height. $\theta$ and $cls$ correspond to the orientation (especially yaw angle) and the object class respectively. A three-stage auto-labelling pipeline is detailed in the supplementary material (see Sec. B.2).

Trajectory. To facilitate the temporal tasks such as multi-object tracking and motion forecasting described in Sec. 4, we assign a unique Track ID $\tau$ to each agent across a temporal sequence on Bird-Eye-View (BEV) of the Ego-Vehicle coordinate. Furthermore, agents with the same $\tau$ within a sequence are linked together to form object trajectories.

Occupancy label. In addition to 3 typical classes of moving objects on roads which are annotated temporally as above, there also exists a rich collection of static obstacles with irregular shapes especially in the complex scenarios (i.e., parks, campuses and squares, etc.) of RoboSense. To detailly describe the environment in surrounding camera views for driving safety, we voxelize the 3D space and generate high-quality yet dense occupancy labels to represent the voxel states. Similar with previous occupancy benchmarks [36, 35] built upon public datasets [4, 33], we conduct dynamic objects and static scenes segmentation along the temporal dimension based on annotated 3D boxes and trajectories. Then sparse LiDAR points inside each box are extracted from $T-k$ to $T+k$ frames respectively, where $T$ indicates the index of current keyframe, and $k$ is set to 10 empirically. Refer to the supplementary material for more details of occupancy label generation process (see Sec. B.3).

4 Tasks & Metrics

Both egocentric perceptual tasks and prediction tasks are supported in our RoboSense dataset and benchmark.

4.1 Perception

4.1.1 3D Object Detection

The RoboSense 3D detection task requires to detect 3D bounding boxes of three main classes (i.e. “Vehicle”, “Pedestrian” and “Cyclist”), including position, size, orientation and category. Following the conventions in [9, 4, 33], we adopt mAP (mean Average Precision), AOS (Average Orientation Similarity) and ASE (Average Scale Error) to measure the performance of different detectors.

There are several matching criteria to define the true positive for Average Precision (AP) metric calculation. For example, [9] adopts 3D Intersection-over-Union (IoU) to match each prediction with a ground-truth box, while [4] define a match through thresholding the 2D center distance on the Bird-Eye-View ground plane. As for RoboSense detection task, we also adopt a similar distance measure. Differently, we define the threshold as a relative $Proportion$ $p$ of ground truth Closest Collision-point Distance (CCDP) from the ego-vehicle, rather than an absolute Center Distance (CD) $d$ adopted in [4]. We claim that the localization accuracy of near obstacles’ closest collision-point is more important in low-speed driving scenarios. Then AP is calculated as the normalized area under precision-recall curve [7]. Finally, mAP is obtained by averaging over all classes $\mathbb{C}$ and matching thresholds $\mathbb{P}=\{5\%,10\%,20\%\}$ :

mAP=\frac{1}{|\mathbb{C}|\cdot{|\mathbb{P}}|}\sum_{c\in\mathbb{C}}\sum_{p\in\mathbb{P}}{AP}_{c,p}

(1)

In addition to AP, we also measure AOS and ASE for each matched true positive, which represent the precision of predicted yaw angle and object scale respectively. AOS (Average Orientation Similarity) is formulated as:

AOS=\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}\mathop{\operatorname{max}}\limits_{\widetilde{r}:\widetilde{r}\geq r}{s}(\widetilde{r}),

(2)

s(r)=\frac{1}{|\mathbb{D}(r)|}\sum_{i\in\mathbb{D}(r)}\frac{1+cos\Delta_{\theta}^{(i)}}{2},

(3)

where $\mathcal{R}$ indicates the recall range $[0.1,1]$ interpolated with 40 points. $\mathbb{D}(r)$ indicates the set of matched true positives at recall $r$ . And $\Delta_{\theta}^{(i)}$ denotes the angle difference between sample $i$ and ground truth. Different from [33], we only consider true positive samples under each recall level, rather than all predicted positives.

ASE is defined as $\mathbf{1-IoU}$ , which aims to measure the scale error through calculating the 3D $\mathbf{IoU}$ after aligning orientation and translation of predictions with ground truth.

4.1.2 Multi-Object Tracking

The tracking task is designed to associate all detected 3D boxes of movable object classes across input multi-view temporal sequences (i.e. videos or point cloud sequences). Each object is assigned a unique and consistent track ID $\tau$ from first appearance until complete vanishing. As for performance evaluation, we refer to [4, 9, 22, 32], and mainly adopt sAMOTA (Scaled Average Multi-Object Tracking Accuracy), AMOTP (Average Multi-Object Tracking Precision) to measure the 3D tracking performance.

Formally, sAMOTA is defined as the mean value of sMOTA over all recalls:

sAMOTA=\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}{sMOTA}_{r},

(4)

sMOTA_{r}=max(0,1-\frac{FP_{r}+FN_{r}+IDS_{r}-(1-r)\cdot GT}{r\cdot GT}),

(5)

where $FP_{r},FN_{r}$ and $IDS_{r}$ represent the number of false positives (wrongly detection), false negatives (missing detection) and identity switches at the corresponding recall $r$ , respectively. Similarly, AMOTP is the average results of MOTP among different recalls, which can be defined as:

AMOTP=\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}\frac{\sum_{i,t}d_{i,t}}{TP_{r}},

(6)

where $TP_{r}$ is the number of true positives at the recall $r$ , and $d_{i,t}$ denotes the position error of matched track $i$ at timestamp $t$ . Besides, additional metrics such as MT (Most Tracked) and ML (Most Lost) [3] are also reported.

4.2 Prediction

4.2.1 Motion Forecasting

Based on perception results, the motion forecasting task requires to predict agents’ future trajectories. Specifically, $\mathcal{K}$ plausible trajectories in future $T=3s$ timesteps for each agent are forecasted as offsets to the current agent’s position. Following the standard protocols [20, 21, 26, 10], we adopt minADE (minimum Average Displacement Error), minFDE (minimum Final Displacement Error), MR (Miss Rate) and EPA (End-to-end Prediction Accuracy) as metrics to measure the precision of motion prediction. In order to decouple the accuracy of perception and prediction, these metrics are only caculated for matched TPs (True Positives), where the matching threshold is set to $p_{match}=5\%$ of ground truth distance of the closest collision-point from the ego-vehicle. And the miss threshold of minFDE is set to $p_{miss}=20\%$ for calculating the MR metric.

4.2.2 Occupancy Prediction

The goal of occupancy prediction task is to estimate the state of each voxel in the 3D space. Formally, a sequence of $T$ historical frames with $N$ surround-view camera images $\{I_{i,t}\in\mathbb{R}^{H_{i}\times W_{i}\times 3}\}$ are served as input, where $i=1,...,N$ and $t=1,...,T$ . Besides, sensor intrinsic parameters $\{K_{i}\}$ together with extrinsic parameters $\{R_{i}|t_{i}\}$ for each frame are also provided. Then the ground truth labels describe the voxel states separately, including occupancy state and semantic label. Three states are considered on the RoboSense dataset, including “occupied”, “free” and “unknown”. And the semantic label of each voxel can be one of the 3 predefined object categories or an “unknown” class to indicate general objects. Furthermore, each voxel can be also equipped with extra attributes as outputs, such as instance IDs and motion vectors, which are left as our future work.

To evaluate the quality of predicted occupancy, we measure the whole-scene level voxel segmentation results using $\mathbf{IoU}$ metric for each class. Considering the low-speed driving scenarios, we evaluate the metric under different ranges around the ego vehicle in both 3D and BEV space. Finally, mIoU is obtained through averaging over 4 classes. Moreover, evaluation is only performed on the visible voxels from the camera view.

5 Experiments

Table 2: The details of RoboSense dataset, including the proportion of day/night data among different scenes respectively; The distribution of training/testing/validation sets; The count of synchronized sequences/frames as well as annotated 3D boxes/trajectories for each scene.

Scene-ID	Distribution			Ratio of Dataset			Num of Sequences	Num of Frames	Num of 3D Boxes	Num of Trajectories
Scene-ID	Day	Night	Scene	Train	Test	Val	Num of Sequences	Num of Frames	Num of 3D Boxes	Num of Trajectories
S-1	56%	44%	20%	50%	40%	10%	1.5K	26K	310K	36K
S-2	69%	31%	30%				2.3K	42K	293K	37K
S-3	71%	29%	17%				1.2K	22K	284K	64K
S-4	83%	17%	7%				0.5K	9K	144K	22K
S-5	70%	30%	20%				1.6K	26K	297K	44K
S-6	22%	78%	6%	0%	100%	0%	0.5K	8K	88K	13K
Total	65%	35%	100%	46%	44%	10%	7.6K	133K	1.4M	216K

5.1 Benchmark Setup

Our RoboSense dataset contains 7.6K sequences (including 130K annotated frames) of synchronized multi-sensor data, covering 6 main categories (including 22 different locations) of outdoor or semi-closed scenarios (i.e., S1-parks, S2-scenic spots, S3-squares, S4-campuses, S5-sidewalks and S6-streets). To protect the data privacy, we conduct a series of data desensitization measures through masking the human faces and car plates as well as road signs from all sensor data. The details of RoboSense dataset composition and partitioning are listed in the Tab. 2. The RoboSense dataset is collected under various illumination, traffic flow and weather conditions, to ensure the diversity of static background and movable obstacles, thus meeting the demand of different realistic applications.

RoboSense dataset is divided into three parts with a ratio of 50%, 40% and 10%, for the purpose of training, testing and validation respectively. As for the scene partition, one of the 6 collected scenes (i.e. S-6) is assigned to the testing set exclusively, while the remaining scenes are shared among all splits. Ground truth labels of training and validation sets for corresponding task are provided, together with the synchronized multi-sensor raw data. However, the testing set only provides data. Hence algorithms can merely be submitted to our online benchmark for corresponding task evaluation of testing set.

5.2 Sensor Specifications

The detailed specifications of all devices are shown in Tab. 3. To cover the areas from near to farther areas, we select Cameras with different focal lengths and Field of View (FOV). Besides, 5 LiDAR sensors are installed in our data collection robot, where the top Hesai Pandar40M is served as autolabeller to provide initial annotations for the splicing points of other LiDARs. 11 Ultrasonics sensors are also installed for freespace detection to ensure safety. All devices are synchronized in time via Network Time Protocol (NTP) before data collection, we utilize a time interval of 100ms as the global timestamp, and match the frame from each device with the nearest timestamp adjacent to the global timestamp. This process ultimately yields synchronized multi-sensor data at a frame rate of 10 FPS.

Table 3: Sensor Specifications on RoboSense.

Modality	Sensor	Details
Camera	4 $\times$ Camera	RGB, 25Hz, 1920 $\times$ 1080 FOV: $[{111.78}^{\circ},{63.16}^{\circ}]$
Camera	4 $\times$ Fisheye	RGB, 25Hz, 1280 $\times$ 720 FOV: $[{180.0}^{\circ},{180.0}^{\circ}]$
LiDAR	Hesai Pandar40M	64 beams, 10Hz, 384k pps FOV: $[{360.0}^{\circ}$ , ${-25}^{\circ}$ to ${15}^{\circ}]$
	3 $\times$ Zvision ML30s	40 beams, 10Hz, 720k pps FOV: $[{286.48}^{\circ}$ , ${-25}^{\circ}$ to ${15}^{\circ}]$
	Livox Horizon	40 beams, 10Hz, 720k pps FOV: $[{286.48}^{\circ}$ , ${-25}^{\circ}$ to ${15}^{\circ}]$
Ultrasonics	3 $\times$ LRU	STP-313, 1m-10m, 40kHz, $\pm$ 1 $mm$
Ultrasonics	8 $\times$ SRU	STP-318, 5cm-200cm, 40kHz, $\pm$ 1 $mm$
Localization	GPS & IMU	GPS, IMU, AHRS. ${0.2}^{\circ}$ heading, ${0.1}^{\circ}$ roll/pitch, 20mm, RTK positioning, 1000Hz update rate

5.3 Implementation Details

For LiDAR detection task, we set the point range to x $\in$ [-45 $m$ , 45 $m$ ], y $\in$ [-45 $m$ , 45 $m$ ], z $\in$ [-1 $m$ , 4 $m$ ], with a fixed voxel size of 0.16 $m$ and 0.05 $m$ for pillar-based and voxel-based methods respectively. For Image detection tasks, we use ResNet18 [11] as backbone network and the input image is resized to $640\times 352$ . For practical usages, we report performance using our proposed Closest-Collision Distance Proportion (CCDP) as matching criterion. Comparisons of different matching functions on average precision are shown in Fig. 4. As expected, when using Center Distance (CD) or IOU, objects without distance differentiation can not reflect the model capability of locating closest collision points of nearby obstacles, which is more challenging and essential for low-speed driving scenarios.

Table 4: 3D Detection results on validation sets of RoboSense using Center-Point (CP) distance and Closest Collision-Point (CCP) distance as matching criteria respectively where the relative proportion

p

is set to 5% (LiDAR) and 10% (Image).

Task	Method	Vehicle@ $p$ =5%/10%			Cyclist@ $p$ =5%/10%			Pedestrian@ $p$ =5%/10%
Task	Method	3D AP $\uparrow$	AOS $\uparrow$	ASE $\downarrow$	3D AP $\uparrow$	AOS $\uparrow$	ASE $\downarrow$	3D AP $\uparrow$	AOS $\uparrow$	ASE $\downarrow$
LiDAR 3D Detection	PointPillar [17]	72.5/53.0	73.5/61.1	20.6/16.1	44.2/32.8	45.4/38.3	64.2/54.3	62.7/38.2	45.3/34.1	38.3/27.2
	SECOND [41]	78.8/63.1	80.2/69.4	19.8/15.7	53.8/43.5	57.2/49.9	67.7/55.7	70.8/47.2	54.6/43.2	40.1/29.3
	PVRCNN [31]	74.6/57.4	77.4/67.7	16.4/15.4	53.6/41.4	55.7/50.1	62.5/61.9	66.4/39.1	50.1/37.0	40.4/25.5
	Transfusion-L [2]	83.6/65.1	84.5/73.8	19.7/16.0	59.7/47.0	78.0/70.8	82.1/72.9	72.3/42.8	60.5/48.7	45.1/37.4
Multi-view 3D Detection	BEVDet [14]	76.2/30.2	40.4/25.9	17.3/11.2	42.3/25.7	36.1/30.2	56.5/42.1	47.4/28.5	48.6/36.5	30.2/18.8
	BEVDet4D [13]	77.2/31.1	41.1/26.4	16.8/10.8	42.0/24.8	33.9/27.7	55.3/41.2	48.1/29.3	46.6/37.6	27.5/21.3
	BEVDepth [18]	77.8/31.3	40.9/26.3	16.7/10.7	43.3/27.0	34.9/30.2	52.2/46.6	50.1/31.3	46.7/37.9	28.0/21.4
	BEVFormer [19]	78.2/32.0	41.6/26.7	16.5/10.6	44.1/27.6	34.9/30.5	51.3/44.3	50.2/32.3	46.3/38.0	28.1/17.9

Table 5: Study of different sensor layouts for perception tasks (3D detection and MOT) on validation sets of RoboSense under different ranges (m). AB3DMOT [39] is adopted as 3D MOT baseline. C: Camera, F: Fisheye, L: LiDAR, V: View

Task	Detector	Layouts	Detection				Tracking
			Metric	Range( $m$ )			sAMOTA $\uparrow$	AMOTP $\uparrow$	MT $\uparrow$	ML $\downarrow$
				[0, 5]	[5, 10]	[10, 30]
Multi-view 3D Perception	BEVDepth [18]	4C	3D AP	54.9/16.0	60.1/18.3	53.7/33.1	44.03	29.95	20.23	54.01
			AOS	44.8/19.7	37.0/18.8	34.5/26.9
		4F	3D AP	61.1/16.9	70.6/19.9	50.8/29.0	39.56	27.10	18.02	61.74
			AOS	58.7/27.5	41.3/23.5	36.1/27.4
		4C + 4F	3D AP	68.9/20.5	75.2/22.9	64.2/38.6	51.16	35.68	25.21	48.07
			AOS	53.9/24.4	43.1/22.5	39.6/30.9
LiDAR 3D Perception	PointPillar [17]	4L	3D AP	59.2/19.3	73.1/42.0	71.0/65.4	44.77	33.65	25.04	54.08
			AOS	46.5/19.2	67.2/47.5	69.0/65.7
Multi-modal 3D Perception	BEVDepth [18] + Pointpillar [17]	8V + 4L	3D AP	61.3/36.9	61.3/54.6	54.4/52.6	43.32	43.18	34.74	40.82
			AOS	64.8/49.6	78.7/75.0	79.4/78.4

5.4 Baselines: Perception

5.4.1 LiDAR 3D Detection

To demonstrate the performance of advanced 3D detectors on LiDAR-only detection track of our RoboSense benchmark, we implement several popular CNN-based methods with different fashions, including Pointpillar [17] (Pillar-based), SECOND [41] (Voxel-based), and PV-RCNN [31] (Two-stage Point-Voxel based). Besides, Transformer-based method such as Transfusion-L [2] is also implemented for architecture comparison. Pointpillar as the most efficient method above is adopted as our baseline for LiDAR 3D detection task.

5.4.2 Multi-View 3D Detection

Current works of multi-view 3D detection can be divided into two mainstreams, namely LSS [28] based and Transformer based. To examine the effectiveness of image-only multi-view 3D detection models, we select the widely-used method BEVDet [14] as our LSS-based baseline on image 3D detection track of RoboSense, and re-implement several extended versions such as BEVDet4D [13] which takes advantage of history temporal clues, and BEVDepth [18] which adopts an additional branch for depth prediction under point supervision. Besides, BEVFormer [19] as a Transformer-based representative work is also included.

5.4.3 Multiple Object Tracking

We follow the “Tracking-by-Detection” paradigm using 3D detection results from Camera or LiDAR data as input respectively, and present several baselines for multiple 3D object tracking task. Specifically, 3D boxes detected from surround-view images by BEVDepth [18] and splicing pointclouds by Pointpillar [17] are provided separately. And the tracking approach AB3DMOT described in [39] is picked to serve as the baseline of multiple object tracker in the 3D space. Then the same objects across different sensors are associated with unique track IDs to form global trajectories in the past.

5.5 Baselines: Prediction

5.5.1 Motion Prediction

Traditional motion prediction methods utilize perception ground truth (i.e., history trajectories of agents and HDmap) as input, which lacks of uncertainty modeling in practical applications. In this paper, we implement several vision-based end-to-end methods for joint perception and motion prediction on RoboSense benchmark, including ViP3D [10] and PnPNet [20]. For comparisons, we also report the motion prediction results of assuming agents surrounding the ego-vehicle with constant positions or velocities respectively, thus to reflect the diversity and difficulty of our dataset on prediction task.

5.5.2 Occupancy Prediction

We extend a BEV 3D detection model - BEVDepth [18] to the 3D occupancy prediction task, which is then adopted as our baseline for the visual occupancy prediction task. Concretely, we replace the original detection decoders with the occupancy reconstruction layers while maintaining the BEV feature encoders. ResNet18 [11] pretrained on FCOS3D [38] is employed as image backbone for visual feature extraction.

5.6 Results and Analysis

5.6.1 Perception Results

3D Object Detection. The 3D detection results based on multi-view images and splicing point clouds are shown in Tab. 4. As for LiDAR 3D detection, Transfusion-L [2] achieves the leading performance owing to the advanced transformer architecture. In terms of multi-view 3D detection, BEVDet4D [13] and BEVDepth [18] obtain significant improvement than BEVDet [14] through involving temporal clues and adopting an additional depth branch respectively. Besides, BEVFormer [19] also achieves competitive results by introducing a query-based attention mechanism. Generally, LiDAR-based 3D detector can generate high-quality detection results than vision-based methods. However, vision-based methods are capable of detecting various ranges of objects with more sensors (Fisheye or Camera). Note that two different matching criteria are both considered for TP calculation, namely Center-Point (CP) distance and Closest Collision-Point (CCP) distance. It can be observed that the CCP localization performance is obviously lower than the CP localization (i.e. 18.5% 3D AP drop of Transfusion-L for Vehicle class and 29.5% 3D AP drop for Pedestrian class. For navigation safety, the CCP localization is more important for near-field egocentric perception in crowded social scenarios.

Performance with Different Sensor-layouts. To evaluate the performance of different sensor layouts under various ranges, we conduct extensive comparisons as shown in Tab. 5. As for visual perception, 4C layout achieves better AP than 4F layout in farther areas (i.e., 10-30 $m$ ), while 4F layout is good at detecting near-field targets within 10 $m$ . Through combining these two layouts, better performance can be achieved across different ranges. LiDAR 3D detector exhibits an obvious advantage over visual detectors especially in CCP and farther object localization, while the performance of near-field objects within 5 $m$ is inferior (19.3% vs. 20.5%). Moreover, we implement multi-modal 3D perception (8V+4L) through late-fusion strategy. Specifically, 3D detection results from multi-view 3D detector and LiDAR 3D detector are adopted for post-processing. And we can observe that the CCP-based 3D AP of objects within 5 $m$ is remarkably boosted from 20.5% to 36.9%. And the AOS metric is also increased consistently.

Multiple Object Tracking. Regarding to the MOT task in Tab. 5, AB3DMOT [39] is adopted as baseline tracker in 3D space, which mitigates the impact of object occlusions existing in 2D image, especially for crowded scenarios. Through introducing more sensors (4C + 4F), vision-based methods can also achieve competitive tracking performance with LiDAR-based methods, even better in sAMOTA metric (51.16 vs. 44.77). With the multi-modal input, AMOTP, MT and ML performance can be further improved as expected. However, although equipped with multi-modal and multi-sensor data as input, the perception performance is still inferior especially in near-field (i.e. 36.9% CCP-based 3D AP within 5m), revealing the deficiencies of current perception methods in handling the obstacles in near ranges. The main reason may be the frequent truncation and occlusion caused by a large view occupation of near obstacles, which showcases the great challenge and importance of our proposed benchmark for the development of egocentric perceptual frameworks related to navigation in crowded and unstructured environments.

Table 6: Motion forecasting results on validation sets of RoboSense.

*

and

{\dagger}

indicate utilizing GroundTruth 3D boxes and detection results from PointPillar [17] as input respectively with constant positions or velocities for comparisons.

Method

minADE (m)

\downarrow

minFDE (m)

\downarrow

\downarrow

EPA

\uparrow

Constant Pos.*

2.42

3.01

0.319

0.680

Constant Vel.*

1.59

3.54

0.219

0.780

Constant Pos.^†

1.52

1.95

0.267

0.243

Random Vel.^†

2.56

3.85

0.872

0.029

ViP3D [10]

1.31

1.55

0.196

0.283

PnPNet [20]

0.89

1.12

0.172

0.313

Table 7: Occupancy prediction results on validation sets of RoboSense using 4F sensors as input. “mIoU-3D” and “mIoU-BEV” indicate the standard mIoU metric calculated in 3D space and BEV respectively without considering the ground voxels.

Range(m)

mIoU-3D

\uparrow

mIoU-BEV

\uparrow

[0, 12.8]

24.6

29.7

[0, 2]

39.6

48.2

[2, 5]

30.7

36.7

[5, 12.8]

16.1

19.7

5.6.2 Prediction Results

Motion forecasting of surrounding agents as well as occupancy state descriptions around the ego-vehicle are two crucial prediction tasks in the research field of autonomous driving, which have been extensively explored in urban and highway scenarios for autonomous cars.

Motion Prediction. As shown in Tab. 6, either visual end-to-end methods [10] or LiDAR-based end-to-end methods [20] are all supported for validation on our RoboSense. PnPNet [20] with LiDAR points as input can produce less prediction errors and better EPA than ViP3D [10], both of which remarkably outperform two baseline settings of modeling agents with constant positions or velocities.

Occupancy Prediction. As shown in Tab. 7, we use 4F sensor data as input and report the performance of mIOU metric in both 3D and BEV space under various ranges respectively. Note that the metric is calculated without considering states of the ground voxels, leading to lower performance in either 3D or BEV space. As expected, the performance evaluated within 2m is better than farther areas.

6 Conclusion

To foster the research of egocentric perceptual framework tailored to various types of autonomous agents navigating in crowded and unstructured environments, RoboSense, a real-world and multi-modal dataset is collected in complex social scenarios with varying and uncontrolled environmental conditions and dynamical elements. It consists of 7.6K scenes manually selected from different locations, with 1.4M 3D Boxes and 216K trajectories annotated in total on 133K synchronous frames. Besides, occupancy descriptions are also provided to facilitate the surrounding context comprehension. In the future works, more tasks and associated benchmarks, such as motion planning, will be expanded for end-to-end autonomous navigating application, and explore the additional benefits that joint optimization can bring to the modular training.

References

Agro et al. [2023] Ben Agro, Quinlan Sykora, Sergio Casas, and Raquel Urtasun. Implicit occupancy flow fields for perception and prediction in self-driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1379–1388, 2023.
Bai et al. [2022] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099, 2022.
Bernardin and Stiefelhagen [2008] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
Chang et al. [2019] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757, 2019.
Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
Geiger et al. [2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
Gu et al. [2023] Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, and Hang Zhao. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
Huang and Huang [2022] Junjie Huang and Guan Huang. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
Huang et al. [2021] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
Huang et al. [2018] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 954–960, 2018.
Kesten et al. [2019] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet. Lyft level 5 av dataset 2019. https://level5.lyft.com/dataset/, 2019.
Lang et al. [2019] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
Li et al. [2023] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023.
Li et al. [2022] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
Liang et al. [2020] Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end perception and prediction with tracking in the loop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020.
Luo et al. [2018] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
Luo et al. [2021] Wenhan Luo, Junliang Xing, Anton Milan, Xiaoqin Zhang, Wei Liu, and Tae-Kyun Kim. Multiple object tracking: A literature review. Artificial intelligence, 293:103448, 2021.
Ma et al. [2024] Cong Ma, Lei Qiao, Chengkai Zhu, Kai Liu, Zelong Kong, Qing Li, Xueqi Zhou, Yuheng Kan, and Wei Wu. Holovic: Large-scale dataset and benchmark for multi-sensor holographic intersection and vehicle-infrastructure cooperative. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22129–22138, 2024.
Malinin et al. [2021] Andrey Malinin, Neil Band, German Chesnokov, Yarin Gal, Mark JF Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, et al. Shifts: A dataset of real distributional shift across multiple large-scale tasks. arXiv preprint arXiv:2107.07455, 2021.
Patil et al. [2019] Abhishek Patil, Srikanth Malla, Haiming Gang, and Yi-Ting Chen. The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In 2019 International Conference on Robotics and Automation (ICRA), pages 9552–9557. IEEE, 2019.
Peri et al. [2022] Neehar Peri, Jonathon Luiten, Mengtian Li, Aljoša Ošep, Laura Leal-Taixé, and Deva Ramanan. Forecasting from lidar via future object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17202–17211, 2022.
Pham et al. [2020] Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, and Jie Lin. A* 3d dataset: Towards autonomous driving in challenging environments. In 2020 IEEE International conference on Robotics and Automation (ICRA), pages 2267–2273. IEEE, 2020.
Philion and Fidler [2020] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
Scaramuzza et al. [2006] Davide Scaramuzza, Agostino Martinelli, and Roland Siegwart. A flexible technique for accurate omnidirectional camera calibration and structure from motion. In Fourth IEEE International Conference on Computer Vision Systems (ICVS’06), pages 45–45. IEEE, 2006.
Sharp et al. [2002] Gregory C Sharp, Sang W Lee, and David K Wehe. Icp registration using invariant features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):90–102, 2002.
Shi et al. [2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020.
Sun et al. [2020a] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020a.
Sun et al. [2020b] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020b.
Tang et al. [2024] Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Rethinking sparse latent representation for vision-based semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15035–15044, 2024.
Tian et al. [2024] Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36, 2024.
Tong et al. [2023] Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023.
Wang et al. [2024] Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, and Chao Ma. Occgen: Generative multi-modal 3d occupancy prediction for autonomous driving. arXiv preprint arXiv:2404.15014, 2024.
Wang et al. [2021] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
Weng et al. [2020] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani. Ab3dmot: A baseline for 3d multi-object tracking and new evaluation metrics. arXiv e-prints, 2020.
Xiao et al. [2021] Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3095–3101. IEEE, 2021.
Yan et al. [2018] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.

\thetitle

Supplementary Material

Appendix A Coordinates Transformation

A.1 LiDAR $\Leftrightarrow$ Ego-Vehicle

LiDAR to Ego-Vehicle: $(x_{v},y_{v},z_{v})$ represents a three-dimensional coordinate point in Ego-Vehicle Coordinate System. The transformation from the coordinates $(x_{v},y_{v},z_{v})$ in the Ego-Vehicle Coordinate System to $(x_{l},y_{l},z_{l})$ in the LiDAR Coordinate System is calculated as follows:

\left(\begin{array}[]{l}x_{l}\\ y_{l}\\ z_{l}\\ 1\end{array}\right)=\left[\begin{array}[]{cc}R_{L}^{3\times 3}&T_{L}^{3\times 1}\\ 0&1\end{array}\right]\left(\begin{array}[]{c}x_{v}\\ y_{v}\\ z_{v}\\ 1\end{array}\right)

(1)

where $R_{L}\in\mathbb{R}^{3\times 3}$ and $T_{L}\in\mathbb{R}^{3\times 1}$ represent the rotation and translation from the Ego-Vehicle Coordinate System to the LiDAR Coordinate System, respectively.

Ego-Vehicle to LiDAR: The transformation from Ego-Vehicle Coordinate System to LiDAR Coordinate System is the inverse transformation of Eq.(1).

A.2 LiDAR $\Leftrightarrow$ Camera

LiDAR to Camera: Regardless of whether it is a fisheye or a pinhole camera, the coordinate transformation formula from the LiDAR Coordinate System to the Camera Coordinate System is the same and is given as follows:

\left(\begin{array}[]{c}x_{c}\\ y_{c}\\ z_{c}\\ 1\end{array}\right)=\left[\begin{array}[]{cc}R_{C}^{3\times 3}&T_{C}^{3\times 1}\\ 0&1\end{array}\right]\left(\begin{array}[]{c}x_{l}\\ y_{l}\\ z_{l}\\ 1\end{array}\right)

(2)

where $(x_{c},y_{c},z_{c})$ represents a three-dimensional coordinate point in the Camera Coordinate System. $R_{C}\in\mathbb{R}^{3\times 3}$ and $T_{C}\in\mathbb{R}^{3\times 1}$ represent the rotation and translation from the LiDAR Coordinate System to the Camera Coordinate System, respectively.

Camera to LiDAR: The transformation from Camera Coordinate System to LiDAR Coordinate System is the inverse transformation of Eq.(2).

A.3 Camera $\Leftrightarrow$ Pixel

Camera to Pixel: The projection formulas of different types of cameras are different in the RoboSense dataset, the projection formula of a pinhole camera is as follows:

z_{c}\left(\begin{array}[]{l}u\\ v\\ 1\end{array}\right)=K^{3\times 3}\left(\begin{array}[]{l}x_{c}\\ y_{c}\\ z_{c}\\ \end{array}\right),K^{3\times 3}=\left[\begin{array}[]{ccc}f_{x}&-1&u_{0}\\ 0&f_{y}&v_{0}\\ 0&0&1\end{array}\right]

(3)

$(u,v)$ is pixel coordinate, $K\in\mathbb{R}^{3\times 1}$ represents the camera intrinsic parameters, $(f_{x},f_{y})$ represents the focal lengths of the camera, and $(u_{0},v_{0})$ indicates the displacement of the camera’s optical center from the origin of the Pixel Coordinate System. The projection formula from camera coordinate to pixel coordinate of the fisheye camera is very different, the camera projection process refers to the projection formula of Omnidirectional Camera (OCam) in [29].

Pixel to Camera: The transformation from Pixel Coordinate System to Camera Coordinate System in a pinhole camera model requires the inverse of Eq.(3). Since this is a 2D to 3D transformation, it is necessary to first determine the magnitude of $z_{c}$ . The projection formula from pixel coordinate to camera coordinate of the fisheye camera refers to the projection formula of Omnidirectional Camera (OCam) in [29].

A.4 Ego-Vehicle $\Leftrightarrow$ Global

Ego-Vehicle to Global: $R_{G}\in\mathbb{R}^{3\times 3}$ and $T_{G}\in\mathbb{R}^{3\times 1}$ represent the transformation matrices of the vehicle’s orientation and position in the Global Coordinate System, respectively. The transformation formula for converting the coordinates $(x_{v},y_{v},z_{v})$ in the Ego-Vehicle Coordinate System to $(x_{g},y_{g},z_{g})$ in the Global Coordinate System is as follows:

\left(\begin{array}[]{l}x_{g}\\ y_{g}\\ z_{g}\\ 1\end{array}\right)=\left[\begin{array}[]{cc}R_{G}^{3\times 3}&T_{G}^{3\times 1}\\ 0&1\end{array}\right]\left(\begin{array}[]{c}x_{v}\\ y_{v}\\ z_{v}\\ 1\end{array}\right)

(4)

Global to Ego-Vehicle: The transformation from Global Coordinate System to Ego-Vehicle Coordinate System is the inverse transformation of Eq.(4).

Table 8: The Number and proportion of 3D Boxes from all sensors (Global Scenes) and Livox LiDAR (Local Scenes) per category under different ranges (m) respectively.

Global/Local	Vehicle			Cyclist			Pedestrian			Total
	[0 - 10]	[10 - 30]	[30 - ]	[0 - 10]	[10 - 30]	[30 - ]	[0 - 10]	[10 - 30]	[30 - ]
Global (Hesai LiDAR)	165K	402K	343K	23K	38K	15K	187K	163K	51K	1.4M
	910K			76K			401K
	65.00%			5.42%			28.64%			100%
Local (Livox LiDAR)	150K	282K	133K	20K	28K	7K	163K	103K	21K	907K
	565K			55K			287K
	40.36%			3.93%			20.50%			64.79%

Appendix B More Details of RoboSense

B.1 Annotation Statistics

We present more statistics on the annotations of RoboSense as shown in Tab. 8. It can be observed that our RoboSense dataset contains approximately 1.4M annotated objects, with vehicles and pedestrians comprising the majority, while cyclists are lesser. The distribution of objects is relatively uniform in terms of distance. Additionally, due to the smaller coverage area of Livox pointclouds (Local view) compared to Hesai pointclouds (Global view), the number of annotated objects in the Livox pointclouds is only 64.79% of that in the Hesai pointclouds. In Fig. A1, we further compare the distribution of annotated objects between our Robosense dataset and nuScenes dataset. It is obvious that our Robosense dataset contains significantly more annotated objects of vehicles, pedestrians, and cyclists classes respectively, which tend to be closer to the ego robot.

B.2 3D Object Label Generation

To generate high-quality 3D object annotations, we design a three-stage 3D object generation pipeline for different sensors covering various ranges. First, a pre-trained LiDAR detection model (i.e., [17]) of high precision is adopted to produce 3D objects on the full $360\degree$ view using high-quality Pandar64 points as input. Then expert annotators are required to refine the initial 3D boxes continuously throughout the whole sequences in each scene, based on splicing pointclouds which are obtained by aligning 4 vehicle-side LiDARs to the Ego-Vehicle coordinate through affine transformation. Besides, annotators need to supplement surrounding 3D boxes in a near range which are not scanned by the top Hesai LiDAR or fail to be detected owing to high occlusion and truncation. Last but not least, invalid 3D annotations should be excluded for target LiDAR coordinate and Camera coordinate respectively, where the annotated objects are not covered in the corresponding sensor data. Through multiple validation steps, highly accurate annotations can be achieved in both near and far ranges. We also release intermediate Pandar64 points for research usages.

Table 9: 3D Detection results of different modalities on validation sets of RoboSense using IoU as matching criteria.

Task	Method	Vehicle@IoU=0.7/0.3			Cyclist@IoU=0.5/0.3			Pedestrian@IoU=0.5/0.3
Task	Method	3D AP $\uparrow$	AOS $\uparrow$	ASE $\downarrow$	3D AP $\uparrow$	AOS $\uparrow$	ASE $\downarrow$	3D AP $\uparrow$	AOS $\uparrow$	ASE $\downarrow$
LiDAR 3D Detection	PointPillar [17]	43.7	45.5	13.3	39.5	39.6	69.2	52.6	36.6	34.9
	SECOND [41]	55.8	59.8	17.2	52.3	53.3	65.9	61.7	46.9	37.5
	PVRCNN [31]	53.5	57.9	16.9	53.0	50.7	55.9	58.9	43.4	38.4
	Transfusion-L [2]	65.8	66.3	17.3	59.3	71.0	78.5	67.1	56.0	42.7
Multi-view 3D Detection	BEVDet [14]	32.1	21.8	10.4	19.9	21.2	36.8	25.9	29.7	20.3
	BEVDet4D [13]	33.5	22.8	10.4	20.1	21.1	36.7	26.2	28.3	17.7
	BEVDepth [18]	33.4	22.8	10.2	22.6	22.2	41.6	27.7	28.1	17.9
	BEVFormer [19]	33.6	23.0	10.3	23.4	22.1	35.3	28.0	29.5	17.8

B.3 Occupancy Label Preprocess

Occupancy label generation can be primarily divided into two parts: pointclouds densification and occupancy label determination. Unlike existing counterpart [34] which only utilizes the sparse keyframe LiDAR points, multi-frame aggregation operation is found to be indispensable for dense occupancy generation. For dynamic objects, the extracted dynamic points of neighboring frames are subsequently concatenating for each object along the corresponding trajectory respectively, thus achieving the pointclouds densification. For static scenes, coordinate transformation is performed from the ego-vehicle coordinate to the global coordinate across time using ego-pose information, and then simply aggregate all static points on the ego-vehicle coordinate of current keyframe through concatenation.

Notably, owing to the complex driving scenarios with uneven ground and rapid pose changes especially when turning directions to avoid obstacles during data collection, pose drifts are observed in the IMU data. Therefore, the temporal aggregation results of pointclouds are inferior with misaligned horizon and ego-motion blur as shown in Fig. A2. To relieve these issues, ICP (Interative-Closed-Point) [30] is conducted additionally for static scene points registeration before multi-frame aggregation. Finally, densified pointclouds for a single frame can be obtained by fusing the static scenes with the dynamic objects.

Given dense points of a specific scene, we label all voxels within a fixed range by a resolution of $0.5m\times 0.5m$ , based on the height of majority points inside each voxel. If the height is larger than a threshold $\sigma$ , the voxel state is set to “occupied”, otherwise “free”. Moreover, considering the occlusion and truncation situations, some occupied voxels are not scanned by LiDAR beams and camera views actually. Hence we set part of voxels to “unknown” state which are invisible from both the LiDAR and camera views through tracing the casting ray.

B.4 Metric Comparison

In addition to the evaluation of 3D detection results with the proposed matching criteria (Center-Point distance and Closest Collision-Point distance), we also provide the corresponding evaluation results using the traditional 3D IOU (Intersection-Over-Union) matching criteria for comparison, as shown in Tab. 9. It is obvious that without distance differentiation, the evaluation results of 3D AP for both LiDAR-based and Camera-based methods are all in a low level, which can not reflect the objective performance and fail to satisfy the practical application requirements of the detection model. However, the proposed matching criterion is designed to measure the locating capability of closest collision points of nearby obstacles, which is more challenging and essential for low-speed driving scenarios.

B.5 Scene Distribution

Our RoboSense dataset contains 7.6K sequences, covering 6 main categories (including 22 different locations) of outdoor or semi-closed scenarios (i.e., S1-parks, S2-scenic spots, S3-squares, S4-campuses and S5-sidewalks or S6-streets). Fig. A3 illustrates the scene distributions of our collected data constructed for RoboSense dataset, which are surrounding Dishui Lake in Shanghai, China, with several markers drew in Google Map indicating the main locations performed data collection. Besides, the illustrations for each representative scenario among the collected locations are shown in Fig. A4-A9 respectively.