RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
Abstract
Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full view, forming 216K trajectories across 7.6K temporal sequences. It has and as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future research development, where the detailed analysis as well as benchmarks are also provided accordingly. Data desensitization measures have been conducted for privacy protection.

1 Introduction
Recent years have witnessed significant progress achieved in the field of autonomous driving, enabling numerous intelligent vehicles running on highway or urban areas. In addition to self-driving cars, social mobile robots have emerged as a new industry tailored to autonomous navigation for typical applications, such as tractor, sweeper, retail and delivery. Notably, such intelligent mobile agents usually operate and navigate in crowded and unstructured environments (i.e., campuses, scenic spots, streets, parks and sidewalks, etc.), with varying and uncontrolled natural conditions such as illumination, occlusion and obstruction. In order to achieve navigation tasks safely, egocentric perceptual solutions enable these robots to perceive and comprehend the surrounding context from a first-person view, so as to interact successfully with passby pedestrians and vehicles, predict their intentions and incorporate this information in agents’ planning and decision reasoning process.
To evaluate and compare different egocentric perceptual methods fairly, several standarized benchmarks [4, 8, 33, 15, 27, 40] have been proposed in recent years, advancing the development of modern data-driven approaches. KITTI [8] is a pioneering dataset providing multi-modal sensor data including front-view LiDAR pointclouds as well as corresponding stereo images and GPS / IMU data. nuScenes [4] constructs a multi-sensor dataset collected in two cities travelling at an average of 16 km/h, where rich collections of 3D boxes and IDs are annotated in the full view. Waymo Open dataset [33] significantly increases the amount of annotations with higher annotation frequency. However, the target domain application of existing benchmarks is autonomous driving: the sensor data are captured exclusively from structural roads and highways, with sensor suites installed on top of cars.
To fill the vacancies of egocentric perceptual benchmarks target a unique domain related to navigation tasks in crowded and unstructured environments, in this paper, we present RoboSense, a novel multimodal dataset with several benchmarks associated to it. Our dataset is collected from diverse social scenarios filled with crowded obstructions, which is different from previously collected datasets used for autonomous driving (e.g. nuScenes [4]). Benefiting from the well time-synced multi-sensor data, we hope that our RoboSense can facilitate the development of egocentric perceptual frameworks for various types of autonomous navigation agents with controllable cost, not only self-driving cars but also autonomous agents such as social mobile robots. To this end, the data collection robot is equipped with 3 main types of sensors (C: Camera, L: LiDAR, F: Fisheye), and each type of sensor consists of 4 devices installed on different sides respectively to ensure the data captured under full view without blind spots.
Specifically, RoboSense consists of a total of 133K+ frames of synchronized data, spanning over 7.6K temporal sequences of 6 main scene classes (i.e., scenic spots, parks, squares, campuses, streets and sidewalks). Moreover, 1.4M 3D bounding boxes together with track IDs are annotated based on 3 different types of sensors, where most of targets tend to be closer to the robot as shown in Fig. 1. Then we form global trajectories for each agent separately through associating the same IDs across consecutive frames and different devices from a Bird’s-Eye View (BEV) perspective. Additionally, we formulate 6 standarized benchmarks for egocentric perceptual tasks as follows: 1. Multi-view 3D Detection; 2. LiDAR 3D Detection; 3. Multi-modal 3D Detection; 4. Multiple 3D Object Tracking (3D MOT); 5. Motion Prediction; 6. Occupancy Prediction. Meanwhile, multi-task end-to-end training scheme is also supported in our RoboSense for evaluation of joint optimization. In sum, the main contributions of our work are three folds:
-
•
To our best knowledge, our RoboSense is the first dataset tailored to egocentric perceptual tasks related to navigation of autonomous agents in unstructured environments.
-
•
We annotate 1.4M 3D bounding boxes on 133K+ synchronized sensor data, where most of targets are closer to the robot. Each target is associated with a unique ID, thus forming a total of 216K trajectories, which spread over 7.6K temporal sequences, covering 6 main scene classes.
-
•
We formulate 6 standarized benchmarks to facilitate the evaluation and fair comparisons of different perceptual solutions related to navigation in built environments.
Dataset | Year | Size (hr) | Ann. Scenes | Ann. Frames | With Trajectory | Multi-view Overlapping | Sensor Layouts | 3D Boxes ( Total ) | 3D Boxes† ( 5m ) |
KITTI [8] | 2012 | 1.5 | 22 | 15K | ✗ | ✗ | 4C+1L | 80K | 638 |
Cityscapes [6] | 2016 | - | - | 25K | ✗ | ✗ | 1C | 0 | 0 |
ApolloScape [15] | 2016 | 2 | - | 144K | ✗ | ✗ | 1L | 70K | 4.7K |
H3D [25] | 2019 | 0.77 | 160 | 27K | ✗ | ✓ | 3C+1L | 1.1M | - |
Lyft L5 [16] | 2019 | 2.6 | 366 | 55K | ✓ | ✓ | 7C+3L | 1.3M | - |
nuScenes [4] | 2019 | 5.5 | 1K | 40K | ✓ | ✓ | 6C+1L | 1.4M | 9.8K |
Argoverse [5] | 2019 | 0.6 | 113 | 22K | ✓ | ✓ | 9C+2L | 993K | 15K |
Waymo Open [33] | 2019 | 6.4 | 1K | 200K | ✓ | ✓ | 5C+5L | 12M | 123K |
BDD100k [42] | 2020 | 1K | 100k | 100k | ✗ | ✗ | 1C | 0 | 0 |
RoboSense (Ours) | 2024 | 42 | 7.6K | 133K | ✓ | ✓ | 4C+4F+4L | 1.4M | 173K |
2 Related Work
We summarize the compositions of some existing perception and prediction datasets as shown in Tab. 1.
Perception Datasets. Current released perception datasets can be divided into image-only datasets [42, 6] and multimodal datasets [8, 4, 33, 16, 15]. BDD100k [42] and Cityscapes [6] focus on 2D perception which provide large amount of 2D annotations (boxes, masks) for driving scene understanding under various weather and illumination conditions. KITTI [8] is known as the pioneering multimodal dataset which has been widely used for academic research. It records 6 hours of driving data using a LiDAR sensor and a front-facing stereo camera to provide pointclouds and images with annotated 3D boxes. H3D dataset [25] collects a total of 1.3M 3D objects over 27K frames from 160 crowded scenes of the full 360∘ view. nuScenes [4] and Waymo Open Dataset [33] are two similar datasets with same structure, while the latter one providing more annotations owing to higher annotation frequency (2Hz vs. 10Hz). Different from previously collected datasets used for autonomous driving, the annotation frequency of our RoboSense is even smaller (1Hz) due to the low speed (less than 1 m/s) moving status of social mobile robots navigating in crowded and unstructured environments.
Prediction Datasets. nuScenes [4] and Waymo Open Dataset [33] can be also used for prediction task which release lane graphs as well. Lyft [16] introduces traffic/speed control data, and Waymo Open Dataset [33] adds more signals to the map such as crosswalk, lane boundaries, stop signs and speed limits. Recently, Shifts dataset [24] becomes the largest forecasting dataset with the most scenario hours to date. Meanwhile, Argoverse [5] is also a large-scale dataset with high data frequency (10Hz) and high scenario quality for motion forecasting ( across 6 cities). Together, these datasets have enabled exploration of multi-actor, long-range motion forecasting leveraging both static and dynamic maps.
Generally, our dataset differs in three substantial ways: 1) targets a unique domain related to navigation tasks in crowded and unstructured environments, which is more difficult than autonomous driving scenarios in terms of complexity of environmental context and diversity of surrounding obstructions. 2) In addition to 3D bounding box and trajectory annotations, our dataset also provides high-quality occupancy descriptions for each collected scene, supporting the occupancy prediction task around the social robotics for safe navigation. 3) Our dataset is mostly collected in social crowded scenes, where pedestrians and cars tend to be closer to the robot, yielding a distribution with a mode at approximately 5, which is quite different to the existing datasets for autonomous cars as shown in Fig. 2. Besides, the egocentric perceptual tasks under this circumstance is more challenging due to frequent occlusion and truncation.

3 RoboSense Open Dataset
We commence with the sensor setup as well as data acquisition details, delineate the coordinate systems and label generation process, and present data statistics respectively.
3.1 Sensor Setup and Data Acquisition
Sensor setup. We use a social mobile robot (i.e., robosweeper) as data collection platform, which is equipped with different sensors installed in different sides of the robot respectively to ensure data captured in 360 horizontal view without blind spots, including LiDAR, Camera, Fisheye, GPS / IMU and Ultrasonic. Refer to Fig. 3 for sensor layouts and Tab. 3 for detailed sensor specifications.
Data acquisition. We utilize the mobile robot to collect data along the Dishui Lake in Shanghai, China, lasting 42h in total at an average speed of less than 1m/s through manually remote control. 22 different places are travelled, which can be categorized into 6 main kinds of outdoor or semi-closed social scenarios (i.e., scenic spots, parks, squares, campuses, streets and sidewalks). After data collection, we manually select and process 7619 representative scenes of 20 duration respectively for further annotation, covering various natural conditions (i.e., weather and illumination) and diverse environmental background and obstructions (i.e., motion, amount, type, occlusion, truncation).
3.2 Coordinate Systems
Ego-Vehicle Coordinate. The Ego-Vehicle Coordinate System is centered at the rear axle of the vehicle. The positive directions of the X, Y, and Z axes correspond to the forward, leftward, and upward directions of the vehicle, respectively. Ego-Vehicle Coordinate System is the most frequently used in tasks such as perception, tracking, prediction, and planning, where dynamic and static targets as well as trajectories are transformed into this coordinate system.
Global Coordinate. To transform the dynamic and static elements from historical and future frames into the current frame coordinate system, we need to establish a global coordinate system to record the position and orientation of the ego vehicle in each frame. The origin of the Global Coordinate System is an arbitrarily defined point in Shanghai Lingang, China, and the positive directions of the X, Y, and Z axes follow the definition of the North-East-Up coordinate.
LiDAR Coordinate. The LiDAR Coordinate System is defined based on the Hesai lidar installed directly above the vehicle, the positive directions of the X, Y, and Z axes follow the definition of the Ego-Vehicle Coordinate System.
Camera Coordinate. The RoboSweeper is equipped with four fisheye cameras and four pinhole cameras. The origin of the Camera Coordinate System for both types of cameras is the optical center. However, the positive directions of the coordinate axes are defined differently in the RoboSense dataset. In the fisheye coordinate system, the X, Y, and Z axes correspond to directly below, right, and behind the optical center, respectively. In contrast, in the pinhole coordinate system, these axes correspond to directly right, below, and front of the optical center, respectively.
Pixel Coordinate. The image is presented in the form of pixels, each pixel corresponds to a 2D pixel coordinate. The origin of the Pixel Coordinate System is the upper left corner of the image. Points in the 3D Camera Coordinate System can obtain coordinates in the Pixel Coordinate System through the camera projection.
3.3 Ground Truth Labels
After integrating, synchronizing and calibrating the multi-sensor raw data, we annotate keyframes (LiDAR, image) at the frequency of 1Hz due to the low-speed moving status.
3D object. With the selected scenes of collected RoboSense dataset, we annotate 3D object boxes of 3 movable classes (i.e., “Vehicle”, “Cyclist” and “Pedestrian”) for each sampled keyframe in both the LiDAR coordinate of pointclouds and the Camera coordinate of multi-view images respectively. Each annotated 3D box can be represented as , where indicate the 3D position of a regular object, and represent the scale information including width, length and height. and correspond to the orientation (especially yaw angle) and the object class respectively. A three-stage auto-labelling pipeline is detailed in the supplementary material (see Sec. B.2).
Trajectory. To facilitate the temporal tasks such as multi-object tracking and motion forecasting described in Sec. 4, we assign a unique Track ID to each agent across a temporal sequence on Bird-Eye-View (BEV) of the Ego-Vehicle coordinate. Furthermore, agents with the same within a sequence are linked together to form object trajectories.

Occupancy label. In addition to 3 typical classes of moving objects on roads which are annotated temporally as above, there also exists a rich collection of static obstacles with irregular shapes especially in the complex scenarios (i.e., parks, campuses and squares, etc.) of RoboSense. To detailly describe the environment in surrounding camera views for driving safety, we voxelize the 3D space and generate high-quality yet dense occupancy labels to represent the voxel states. Similar with previous occupancy benchmarks [36, 35] built upon public datasets [4, 33], we conduct dynamic objects and static scenes segmentation along the temporal dimension based on annotated 3D boxes and trajectories. Then sparse LiDAR points inside each box are extracted from to frames respectively, where indicates the index of current keyframe, and is set to 10 empirically. Refer to the supplementary material for more details of occupancy label generation process (see Sec. B.3).
4 Tasks & Metrics
Both egocentric perceptual tasks and prediction tasks are supported in our RoboSense dataset and benchmark.
4.1 Perception
4.1.1 3D Object Detection
The RoboSense 3D detection task requires to detect 3D bounding boxes of three main classes (i.e. “Vehicle”, “Pedestrian” and “Cyclist”), including position, size, orientation and category. Following the conventions in [9, 4, 33], we adopt mAP (mean Average Precision), AOS (Average Orientation Similarity) and ASE (Average Scale Error) to measure the performance of different detectors.
There are several matching criteria to define the true positive for Average Precision (AP) metric calculation. For example, [9] adopts 3D Intersection-over-Union (IoU) to match each prediction with a ground-truth box, while [4] define a match through thresholding the 2D center distance on the Bird-Eye-View ground plane. As for RoboSense detection task, we also adopt a similar distance measure. Differently, we define the threshold as a relative of ground truth Closest Collision-point Distance (CCDP) from the ego-vehicle, rather than an absolute Center Distance (CD) adopted in [4]. We claim that the localization accuracy of near obstacles’ closest collision-point is more important in low-speed driving scenarios. Then AP is calculated as the normalized area under precision-recall curve [7]. Finally, mAP is obtained by averaging over all classes and matching thresholds :
(1) |
In addition to AP, we also measure AOS and ASE for each matched true positive, which represent the precision of predicted yaw angle and object scale respectively. AOS (Average Orientation Similarity) is formulated as:
(2) |
(3) |
where indicates the recall range interpolated with 40 points. indicates the set of matched true positives at recall . And denotes the angle difference between sample and ground truth. Different from [33], we only consider true positive samples under each recall level, rather than all predicted positives.
ASE is defined as , which aims to measure the scale error through calculating the 3D after aligning orientation and translation of predictions with ground truth.
4.1.2 Multi-Object Tracking
The tracking task is designed to associate all detected 3D boxes of movable object classes across input multi-view temporal sequences (i.e. videos or point cloud sequences). Each object is assigned a unique and consistent track ID from first appearance until complete vanishing. As for performance evaluation, we refer to [4, 9, 22, 32], and mainly adopt sAMOTA (Scaled Average Multi-Object Tracking Accuracy), AMOTP (Average Multi-Object Tracking Precision) to measure the 3D tracking performance.
Formally, sAMOTA is defined as the mean value of sMOTA over all recalls:
(4) |
(5) |
where and represent the number of false positives (wrongly detection), false negatives (missing detection) and identity switches at the corresponding recall , respectively. Similarly, AMOTP is the average results of MOTP among different recalls, which can be defined as:
(6) |
where is the number of true positives at the recall , and denotes the position error of matched track at timestamp . Besides, additional metrics such as MT (Most Tracked) and ML (Most Lost) [3] are also reported.
4.2 Prediction
4.2.1 Motion Forecasting
Based on perception results, the motion forecasting task requires to predict agents’ future trajectories. Specifically, plausible trajectories in future timesteps for each agent are forecasted as offsets to the current agent’s position. Following the standard protocols [20, 21, 26, 10], we adopt minADE (minimum Average Displacement Error), minFDE (minimum Final Displacement Error), MR (Miss Rate) and EPA (End-to-end Prediction Accuracy) as metrics to measure the precision of motion prediction. In order to decouple the accuracy of perception and prediction, these metrics are only caculated for matched TPs (True Positives), where the matching threshold is set to of ground truth distance of the closest collision-point from the ego-vehicle. And the miss threshold of minFDE is set to for calculating the MR metric.
4.2.2 Occupancy Prediction
The goal of occupancy prediction task is to estimate the state of each voxel in the 3D space. Formally, a sequence of historical frames with surround-view camera images are served as input, where and . Besides, sensor intrinsic parameters together with extrinsic parameters for each frame are also provided. Then the ground truth labels describe the voxel states separately, including occupancy state and semantic label. Three states are considered on the RoboSense dataset, including “occupied”, “free” and “unknown”. And the semantic label of each voxel can be one of the 3 predefined object categories or an “unknown” class to indicate general objects. Furthermore, each voxel can be also equipped with extra attributes as outputs, such as instance IDs and motion vectors, which are left as our future work.
To evaluate the quality of predicted occupancy, we measure the whole-scene level voxel segmentation results using metric for each class. Considering the low-speed driving scenarios, we evaluate the metric under different ranges around the ego vehicle in both 3D and BEV space. Finally, mIoU is obtained through averaging over 4 classes. Moreover, evaluation is only performed on the visible voxels from the camera view.
5 Experiments
Scene-ID | Distribution | Ratio of Dataset | Num of Sequences | Num of Frames | Num of 3D Boxes | Num of Trajectories | ||||
Day | Night | Scene | Train | Test | Val | |||||
S-1 | 56% | 44% | 20% | 50% | 40% | 10% | 1.5K | 26K | 310K | 36K |
S-2 | 69% | 31% | 30% | 2.3K | 42K | 293K | 37K | |||
S-3 | 71% | 29% | 17% | 1.2K | 22K | 284K | 64K | |||
S-4 | 83% | 17% | 7% | 0.5K | 9K | 144K | 22K | |||
S-5 | 70% | 30% | 20% | 1.6K | 26K | 297K | 44K | |||
S-6 | 22% | 78% | 6% | 0% | 100% | 0% | 0.5K | 8K | 88K | 13K |
Total | 65% | 35% | 100% | 46% | 44% | 10% | 7.6K | 133K | 1.4M | 216K |
5.1 Benchmark Setup
Our RoboSense dataset contains 7.6K sequences (including 130K annotated frames) of synchronized multi-sensor data, covering 6 main categories (including 22 different locations) of outdoor or semi-closed scenarios (i.e., S1-parks, S2-scenic spots, S3-squares, S4-campuses, S5-sidewalks and S6-streets). To protect the data privacy, we conduct a series of data desensitization measures through masking the human faces and car plates as well as road signs from all sensor data. The details of RoboSense dataset composition and partitioning are listed in the Tab. 2. The RoboSense dataset is collected under various illumination, traffic flow and weather conditions, to ensure the diversity of static background and movable obstacles, thus meeting the demand of different realistic applications.
RoboSense dataset is divided into three parts with a ratio of 50%, 40% and 10%, for the purpose of training, testing and validation respectively. As for the scene partition, one of the 6 collected scenes (i.e. S-6) is assigned to the testing set exclusively, while the remaining scenes are shared among all splits. Ground truth labels of training and validation sets for corresponding task are provided, together with the synchronized multi-sensor raw data. However, the testing set only provides data. Hence algorithms can merely be submitted to our online benchmark for corresponding task evaluation of testing set.
5.2 Sensor Specifications
The detailed specifications of all devices are shown in Tab. 3. To cover the areas from near to farther areas, we select Cameras with different focal lengths and Field of View (FOV). Besides, 5 LiDAR sensors are installed in our data collection robot, where the top Hesai Pandar40M is served as autolabeller to provide initial annotations for the splicing points of other LiDARs. 11 Ultrasonics sensors are also installed for freespace detection to ensure safety. All devices are synchronized in time via Network Time Protocol (NTP) before data collection, we utilize a time interval of 100ms as the global timestamp, and match the frame from each device with the nearest timestamp adjacent to the global timestamp. This process ultimately yields synchronized multi-sensor data at a frame rate of 10 FPS.
Modality | Sensor | Details |
Camera | 4 Camera | RGB, 25Hz, 1920 1080 FOV: |
4 Fisheye | RGB, 25Hz, 1280 720 FOV: | |
LiDAR | Hesai Pandar40M | 64 beams, 10Hz, 384k pps FOV:, to |
3 Zvision ML30s | 40 beams, 10Hz, 720k pps FOV:, to | |
Livox Horizon | 40 beams, 10Hz, 720k pps FOV:, to | |
Ultrasonics | 3 LRU | STP-313, 1m-10m, 40kHz, 1 |
8 SRU | STP-318, 5cm-200cm, 40kHz, 1 | |
Localization | GPS & IMU | GPS, IMU, AHRS. heading, roll/pitch, 20mm, RTK positioning, 1000Hz update rate |
5.3 Implementation Details
For LiDAR detection task, we set the point range to x[-45, 45], y[-45, 45], z[-1, 4], with a fixed voxel size of 0.16 and 0.05 for pillar-based and voxel-based methods respectively. For Image detection tasks, we use ResNet18 [11] as backbone network and the input image is resized to . For practical usages, we report performance using our proposed Closest-Collision Distance Proportion (CCDP) as matching criterion. Comparisons of different matching functions on average precision are shown in Fig. 4. As expected, when using Center Distance (CD) or IOU, objects without distance differentiation can not reflect the model capability of locating closest collision points of nearby obstacles, which is more challenging and essential for low-speed driving scenarios.
Task | Method | Vehicle@=5%/10% | Cyclist@=5%/10% | Pedestrian@=5%/10% | ||||||
3D AP | AOS | ASE | 3D AP | AOS | ASE | 3D AP | AOS | ASE | ||
LiDAR 3D Detection | PointPillar [17] | 72.5/53.0 | 73.5/61.1 | 20.6/16.1 | 44.2/32.8 | 45.4/38.3 | 64.2/54.3 | 62.7/38.2 | 45.3/34.1 | 38.3/27.2 |
SECOND [41] | 78.8/63.1 | 80.2/69.4 | 19.8/15.7 | 53.8/43.5 | 57.2/49.9 | 67.7/55.7 | 70.8/47.2 | 54.6/43.2 | 40.1/29.3 | |
PVRCNN [31] | 74.6/57.4 | 77.4/67.7 | 16.4/15.4 | 53.6/41.4 | 55.7/50.1 | 62.5/61.9 | 66.4/39.1 | 50.1/37.0 | 40.4/25.5 | |
Transfusion-L [2] | 83.6/65.1 | 84.5/73.8 | 19.7/16.0 | 59.7/47.0 | 78.0/70.8 | 82.1/72.9 | 72.3/42.8 | 60.5/48.7 | 45.1/37.4 | |
Multi-view 3D Detection | BEVDet [14] | 76.2/30.2 | 40.4/25.9 | 17.3/11.2 | 42.3/25.7 | 36.1/30.2 | 56.5/42.1 | 47.4/28.5 | 48.6/36.5 | 30.2/18.8 |
BEVDet4D [13] | 77.2/31.1 | 41.1/26.4 | 16.8/10.8 | 42.0/24.8 | 33.9/27.7 | 55.3/41.2 | 48.1/29.3 | 46.6/37.6 | 27.5/21.3 | |
BEVDepth [18] | 77.8/31.3 | 40.9/26.3 | 16.7/10.7 | 43.3/27.0 | 34.9/30.2 | 52.2/46.6 | 50.1/31.3 | 46.7/37.9 | 28.0/21.4 | |
BEVFormer [19] | 78.2/32.0 | 41.6/26.7 | 16.5/10.6 | 44.1/27.6 | 34.9/30.5 | 51.3/44.3 | 50.2/32.3 | 46.3/38.0 | 28.1/17.9 |
Task | Detector | Layouts | Detection | Tracking | ||||||
Metric | Range() | sAMOTA | AMOTP | MT | ML | |||||
[0, 5] | [5, 10] | [10, 30] | ||||||||
Multi-view 3D Perception | BEVDepth [18] | 4C | 3D AP | 54.9/16.0 | 60.1/18.3 | 53.7/33.1 | 44.03 | 29.95 | 20.23 | 54.01 |
AOS | 44.8/19.7 | 37.0/18.8 | 34.5/26.9 | |||||||
4F | 3D AP | 61.1/16.9 | 70.6/19.9 | 50.8/29.0 | 39.56 | 27.10 | 18.02 | 61.74 | ||
AOS | 58.7/27.5 | 41.3/23.5 | 36.1/27.4 | |||||||
4C + 4F | 3D AP | 68.9/20.5 | 75.2/22.9 | 64.2/38.6 | 51.16 | 35.68 | 25.21 | 48.07 | ||
AOS | 53.9/24.4 | 43.1/22.5 | 39.6/30.9 | |||||||
LiDAR 3D Perception | PointPillar [17] | 4L | 3D AP | 59.2/19.3 | 73.1/42.0 | 71.0/65.4 | 44.77 | 33.65 | 25.04 | 54.08 |
AOS | 46.5/19.2 | 67.2/47.5 | 69.0/65.7 | |||||||
Multi-modal 3D Perception | BEVDepth [18] + Pointpillar [17] | 8V + 4L | 3D AP | 61.3/36.9 | 61.3/54.6 | 54.4/52.6 | 43.32 | 43.18 | 34.74 | 40.82 |
AOS | 64.8/49.6 | 78.7/75.0 | 79.4/78.4 |

5.4 Baselines: Perception
5.4.1 LiDAR 3D Detection
To demonstrate the performance of advanced 3D detectors on LiDAR-only detection track of our RoboSense benchmark, we implement several popular CNN-based methods with different fashions, including Pointpillar [17] (Pillar-based), SECOND [41] (Voxel-based), and PV-RCNN [31] (Two-stage Point-Voxel based). Besides, Transformer-based method such as Transfusion-L [2] is also implemented for architecture comparison. Pointpillar as the most efficient method above is adopted as our baseline for LiDAR 3D detection task.
5.4.2 Multi-View 3D Detection
Current works of multi-view 3D detection can be divided into two mainstreams, namely LSS [28] based and Transformer based. To examine the effectiveness of image-only multi-view 3D detection models, we select the widely-used method BEVDet [14] as our LSS-based baseline on image 3D detection track of RoboSense, and re-implement several extended versions such as BEVDet4D [13] which takes advantage of history temporal clues, and BEVDepth [18] which adopts an additional branch for depth prediction under point supervision. Besides, BEVFormer [19] as a Transformer-based representative work is also included.
5.4.3 Multiple Object Tracking
We follow the “Tracking-by-Detection” paradigm using 3D detection results from Camera or LiDAR data as input respectively, and present several baselines for multiple 3D object tracking task. Specifically, 3D boxes detected from surround-view images by BEVDepth [18] and splicing pointclouds by Pointpillar [17] are provided separately. And the tracking approach AB3DMOT described in [39] is picked to serve as the baseline of multiple object tracker in the 3D space. Then the same objects across different sensors are associated with unique track IDs to form global trajectories in the past.
5.5 Baselines: Prediction
5.5.1 Motion Prediction
Traditional motion prediction methods utilize perception ground truth (i.e., history trajectories of agents and HDmap) as input, which lacks of uncertainty modeling in practical applications. In this paper, we implement several vision-based end-to-end methods for joint perception and motion prediction on RoboSense benchmark, including ViP3D [10] and PnPNet [20]. For comparisons, we also report the motion prediction results of assuming agents surrounding the ego-vehicle with constant positions or velocities respectively, thus to reflect the diversity and difficulty of our dataset on prediction task.
5.5.2 Occupancy Prediction
We extend a BEV 3D detection model - BEVDepth [18] to the 3D occupancy prediction task, which is then adopted as our baseline for the visual occupancy prediction task. Concretely, we replace the original detection decoders with the occupancy reconstruction layers while maintaining the BEV feature encoders. ResNet18 [11] pretrained on FCOS3D [38] is employed as image backbone for visual feature extraction.
5.6 Results and Analysis
5.6.1 Perception Results
3D Object Detection. The 3D detection results based on multi-view images and splicing point clouds are shown in Tab. 4. As for LiDAR 3D detection, Transfusion-L [2] achieves the leading performance owing to the advanced transformer architecture. In terms of multi-view 3D detection, BEVDet4D [13] and BEVDepth [18] obtain significant improvement than BEVDet [14] through involving temporal clues and adopting an additional depth branch respectively. Besides, BEVFormer [19] also achieves competitive results by introducing a query-based attention mechanism. Generally, LiDAR-based 3D detector can generate high-quality detection results than vision-based methods. However, vision-based methods are capable of detecting various ranges of objects with more sensors (Fisheye or Camera). Note that two different matching criteria are both considered for TP calculation, namely Center-Point (CP) distance and Closest Collision-Point (CCP) distance. It can be observed that the CCP localization performance is obviously lower than the CP localization (i.e. 18.5% 3D AP drop of Transfusion-L for Vehicle class and 29.5% 3D AP drop for Pedestrian class. For navigation safety, the CCP localization is more important for near-field egocentric perception in crowded social scenarios.
Performance with Different Sensor-layouts. To evaluate the performance of different sensor layouts under various ranges, we conduct extensive comparisons as shown in Tab. 5. As for visual perception, 4C layout achieves better AP than 4F layout in farther areas (i.e., 10-30), while 4F layout is good at detecting near-field targets within 10. Through combining these two layouts, better performance can be achieved across different ranges. LiDAR 3D detector exhibits an obvious advantage over visual detectors especially in CCP and farther object localization, while the performance of near-field objects within 5 is inferior (19.3% vs. 20.5%). Moreover, we implement multi-modal 3D perception (8V+4L) through late-fusion strategy. Specifically, 3D detection results from multi-view 3D detector and LiDAR 3D detector are adopted for post-processing. And we can observe that the CCP-based 3D AP of objects within 5 is remarkably boosted from 20.5% to 36.9%. And the AOS metric is also increased consistently.
Multiple Object Tracking. Regarding to the MOT task in Tab. 5, AB3DMOT [39] is adopted as baseline tracker in 3D space, which mitigates the impact of object occlusions existing in 2D image, especially for crowded scenarios. Through introducing more sensors (4C + 4F), vision-based methods can also achieve competitive tracking performance with LiDAR-based methods, even better in sAMOTA metric (51.16 vs. 44.77). With the multi-modal input, AMOTP, MT and ML performance can be further improved as expected. However, although equipped with multi-modal and multi-sensor data as input, the perception performance is still inferior especially in near-field (i.e. 36.9% CCP-based 3D AP within 5m), revealing the deficiencies of current perception methods in handling the obstacles in near ranges. The main reason may be the frequent truncation and occlusion caused by a large view occupation of near obstacles, which showcases the great challenge and importance of our proposed benchmark for the development of egocentric perceptual frameworks related to navigation in crowded and unstructured environments.
Range(m) | mIoU-3D | mIoU-BEV | |
|
24.6 | 29.7 | |
|
39.6 | 48.2 | |
|
30.7 | 36.7 | |
|
16.1 | 19.7 |
5.6.2 Prediction Results
Motion forecasting of surrounding agents as well as occupancy state descriptions around the ego-vehicle are two crucial prediction tasks in the research field of autonomous driving, which have been extensively explored in urban and highway scenarios for autonomous cars.
Motion Prediction. As shown in Tab. 6, either visual end-to-end methods [10] or LiDAR-based end-to-end methods [20] are all supported for validation on our RoboSense. PnPNet [20] with LiDAR points as input can produce less prediction errors and better EPA than ViP3D [10], both of which remarkably outperform two baseline settings of modeling agents with constant positions or velocities.
Occupancy Prediction. As shown in Tab. 7, we use 4F sensor data as input and report the performance of mIOU metric in both 3D and BEV space under various ranges respectively. Note that the metric is calculated without considering states of the ground voxels, leading to lower performance in either 3D or BEV space. As expected, the performance evaluated within 2m is better than farther areas.
6 Conclusion
To foster the research of egocentric perceptual framework tailored to various types of autonomous agents navigating in crowded and unstructured environments, RoboSense, a real-world and multi-modal dataset is collected in complex social scenarios with varying and uncontrolled environmental conditions and dynamical elements. It consists of 7.6K scenes manually selected from different locations, with 1.4M 3D Boxes and 216K trajectories annotated in total on 133K synchronous frames. Besides, occupancy descriptions are also provided to facilitate the surrounding context comprehension. In the future works, more tasks and associated benchmarks, such as motion planning, will be expanded for end-to-end autonomous navigating application, and explore the additional benefits that joint optimization can bring to the modular training.
References
- Agro et al. [2023] Ben Agro, Quinlan Sykora, Sergio Casas, and Raquel Urtasun. Implicit occupancy flow fields for perception and prediction in self-driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1379–1388, 2023.
- Bai et al. [2022] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099, 2022.
- Bernardin and Stiefelhagen [2008] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
- Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Chang et al. [2019] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757, 2019.
- Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
- Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
- Geiger et al. [2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Gu et al. [2023] Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, and Hang Zhao. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
- Huang and Huang [2022] Junjie Huang and Guan Huang. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
- Huang et al. [2021] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Huang et al. [2018] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 954–960, 2018.
- Kesten et al. [2019] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet. Lyft level 5 av dataset 2019. https://level5.lyft.com/dataset/, 2019.
- Lang et al. [2019] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
- Li et al. [2023] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023.
- Li et al. [2022] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
- Liang et al. [2020] Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end perception and prediction with tracking in the loop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020.
- Luo et al. [2018] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
- Luo et al. [2021] Wenhan Luo, Junliang Xing, Anton Milan, Xiaoqin Zhang, Wei Liu, and Tae-Kyun Kim. Multiple object tracking: A literature review. Artificial intelligence, 293:103448, 2021.
- Ma et al. [2024] Cong Ma, Lei Qiao, Chengkai Zhu, Kai Liu, Zelong Kong, Qing Li, Xueqi Zhou, Yuheng Kan, and Wei Wu. Holovic: Large-scale dataset and benchmark for multi-sensor holographic intersection and vehicle-infrastructure cooperative. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22129–22138, 2024.
- Malinin et al. [2021] Andrey Malinin, Neil Band, German Chesnokov, Yarin Gal, Mark JF Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, et al. Shifts: A dataset of real distributional shift across multiple large-scale tasks. arXiv preprint arXiv:2107.07455, 2021.
- Patil et al. [2019] Abhishek Patil, Srikanth Malla, Haiming Gang, and Yi-Ting Chen. The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In 2019 International Conference on Robotics and Automation (ICRA), pages 9552–9557. IEEE, 2019.
- Peri et al. [2022] Neehar Peri, Jonathon Luiten, Mengtian Li, Aljoša Ošep, Laura Leal-Taixé, and Deva Ramanan. Forecasting from lidar via future object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17202–17211, 2022.
- Pham et al. [2020] Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, and Jie Lin. A* 3d dataset: Towards autonomous driving in challenging environments. In 2020 IEEE International conference on Robotics and Automation (ICRA), pages 2267–2273. IEEE, 2020.
- Philion and Fidler [2020] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
- Scaramuzza et al. [2006] Davide Scaramuzza, Agostino Martinelli, and Roland Siegwart. A flexible technique for accurate omnidirectional camera calibration and structure from motion. In Fourth IEEE International Conference on Computer Vision Systems (ICVS’06), pages 45–45. IEEE, 2006.
- Sharp et al. [2002] Gregory C Sharp, Sang W Lee, and David K Wehe. Icp registration using invariant features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):90–102, 2002.
- Shi et al. [2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020.
- Sun et al. [2020a] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020a.
- Sun et al. [2020b] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020b.
- Tang et al. [2024] Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Rethinking sparse latent representation for vision-based semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15035–15044, 2024.
- Tian et al. [2024] Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36, 2024.
- Tong et al. [2023] Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023.
- Wang et al. [2024] Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, and Chao Ma. Occgen: Generative multi-modal 3d occupancy prediction for autonomous driving. arXiv preprint arXiv:2404.15014, 2024.
- Wang et al. [2021] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
- Weng et al. [2020] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani. Ab3dmot: A baseline for 3d multi-object tracking and new evaluation metrics. arXiv e-prints, 2020.
- Xiao et al. [2021] Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3095–3101. IEEE, 2021.
- Yan et al. [2018] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
Supplementary Material
Appendix A Coordinates Transformation
A.1 LiDAREgo-Vehicle
LiDAR to Ego-Vehicle: represents a three-dimensional coordinate point in Ego-Vehicle Coordinate System. The transformation from the coordinates in the Ego-Vehicle Coordinate System to in the LiDAR Coordinate System is calculated as follows:
(1) |
where and represent the rotation and translation from the Ego-Vehicle Coordinate System to the LiDAR Coordinate System, respectively.
Ego-Vehicle to LiDAR: The transformation from Ego-Vehicle Coordinate System to LiDAR Coordinate System is the inverse transformation of Eq.(1).
A.2 LiDARCamera
LiDAR to Camera: Regardless of whether it is a fisheye or a pinhole camera, the coordinate transformation formula from the LiDAR Coordinate System to the Camera Coordinate System is the same and is given as follows:
(2) |
where represents a three-dimensional coordinate point in the Camera Coordinate System. and represent the rotation and translation from the LiDAR Coordinate System to the Camera Coordinate System, respectively.
Camera to LiDAR: The transformation from Camera Coordinate System to LiDAR Coordinate System is the inverse transformation of Eq.(2).
A.3 CameraPixel
Camera to Pixel: The projection formulas of different types of cameras are different in the RoboSense dataset, the projection formula of a pinhole camera is as follows:
(3) |
is pixel coordinate, represents the camera intrinsic parameters, represents the focal lengths of the camera, and indicates the displacement of the camera’s optical center from the origin of the Pixel Coordinate System. The projection formula from camera coordinate to pixel coordinate of the fisheye camera is very different, the camera projection process refers to the projection formula of Omnidirectional Camera (OCam) in [29].
Pixel to Camera: The transformation from Pixel Coordinate System to Camera Coordinate System in a pinhole camera model requires the inverse of Eq.(3). Since this is a 2D to 3D transformation, it is necessary to first determine the magnitude of . The projection formula from pixel coordinate to camera coordinate of the fisheye camera refers to the projection formula of Omnidirectional Camera (OCam) in [29].
A.4 Ego-VehicleGlobal
Ego-Vehicle to Global: and represent the transformation matrices of the vehicle’s orientation and position in the Global Coordinate System, respectively. The transformation formula for converting the coordinates in the Ego-Vehicle Coordinate System to in the Global Coordinate System is as follows:
(4) |
Global to Ego-Vehicle: The transformation from Global Coordinate System to Ego-Vehicle Coordinate System is the inverse transformation of Eq.(4).
Global/Local | Vehicle | Cyclist | Pedestrian | Total | ||||||
[0 - 10] | [10 - 30] | [30 - ] | [0 - 10] | [10 - 30] | [30 - ] | [0 - 10] | [10 - 30] | [30 - ] | ||
Global (Hesai LiDAR) | 165K | 402K | 343K | 23K | 38K | 15K | 187K | 163K | 51K | 1.4M |
910K | 76K | 401K | ||||||||
65.00% | 5.42% | 28.64% | 100% | |||||||
Local (Livox LiDAR) | 150K | 282K | 133K | 20K | 28K | 7K | 163K | 103K | 21K | 907K |
565K | 55K | 287K | ||||||||
40.36% | 3.93% | 20.50% | 64.79% |

Appendix B More Details of RoboSense
B.1 Annotation Statistics
We present more statistics on the annotations of RoboSense as shown in Tab. 8. It can be observed that our RoboSense dataset contains approximately 1.4M annotated objects, with vehicles and pedestrians comprising the majority, while cyclists are lesser. The distribution of objects is relatively uniform in terms of distance. Additionally, due to the smaller coverage area of Livox pointclouds (Local view) compared to Hesai pointclouds (Global view), the number of annotated objects in the Livox pointclouds is only 64.79% of that in the Hesai pointclouds. In Fig. A1, we further compare the distribution of annotated objects between our Robosense dataset and nuScenes dataset. It is obvious that our Robosense dataset contains significantly more annotated objects of vehicles, pedestrians, and cyclists classes respectively, which tend to be closer to the ego robot.
B.2 3D Object Label Generation
To generate high-quality 3D object annotations, we design a three-stage 3D object generation pipeline for different sensors covering various ranges. First, a pre-trained LiDAR detection model (i.e., [17]) of high precision is adopted to produce 3D objects on the full view using high-quality Pandar64 points as input. Then expert annotators are required to refine the initial 3D boxes continuously throughout the whole sequences in each scene, based on splicing pointclouds which are obtained by aligning 4 vehicle-side LiDARs to the Ego-Vehicle coordinate through affine transformation. Besides, annotators need to supplement surrounding 3D boxes in a near range which are not scanned by the top Hesai LiDAR or fail to be detected owing to high occlusion and truncation. Last but not least, invalid 3D annotations should be excluded for target LiDAR coordinate and Camera coordinate respectively, where the annotated objects are not covered in the corresponding sensor data. Through multiple validation steps, highly accurate annotations can be achieved in both near and far ranges. We also release intermediate Pandar64 points for research usages.
Task | Method | Vehicle@IoU=0.7/0.3 | Cyclist@IoU=0.5/0.3 | Pedestrian@IoU=0.5/0.3 | ||||||
3D AP | AOS | ASE | 3D AP | AOS | ASE | 3D AP | AOS | ASE | ||
LiDAR 3D Detection | PointPillar [17] | 43.7 | 45.5 | 13.3 | 39.5 | 39.6 | 69.2 | 52.6 | 36.6 | 34.9 |
SECOND [41] | 55.8 | 59.8 | 17.2 | 52.3 | 53.3 | 65.9 | 61.7 | 46.9 | 37.5 | |
PVRCNN [31] | 53.5 | 57.9 | 16.9 | 53.0 | 50.7 | 55.9 | 58.9 | 43.4 | 38.4 | |
Transfusion-L [2] | 65.8 | 66.3 | 17.3 | 59.3 | 71.0 | 78.5 | 67.1 | 56.0 | 42.7 | |
Multi-view 3D Detection | BEVDet [14] | 32.1 | 21.8 | 10.4 | 19.9 | 21.2 | 36.8 | 25.9 | 29.7 | 20.3 |
BEVDet4D [13] | 33.5 | 22.8 | 10.4 | 20.1 | 21.1 | 36.7 | 26.2 | 28.3 | 17.7 | |
BEVDepth [18] | 33.4 | 22.8 | 10.2 | 22.6 | 22.2 | 41.6 | 27.7 | 28.1 | 17.9 | |
BEVFormer [19] | 33.6 | 23.0 | 10.3 | 23.4 | 22.1 | 35.3 | 28.0 | 29.5 | 17.8 |

B.3 Occupancy Label Preprocess
Occupancy label generation can be primarily divided into two parts: pointclouds densification and occupancy label determination. Unlike existing counterpart [34] which only utilizes the sparse keyframe LiDAR points, multi-frame aggregation operation is found to be indispensable for dense occupancy generation. For dynamic objects, the extracted dynamic points of neighboring frames are subsequently concatenating for each object along the corresponding trajectory respectively, thus achieving the pointclouds densification. For static scenes, coordinate transformation is performed from the ego-vehicle coordinate to the global coordinate across time using ego-pose information, and then simply aggregate all static points on the ego-vehicle coordinate of current keyframe through concatenation.
Notably, owing to the complex driving scenarios with uneven ground and rapid pose changes especially when turning directions to avoid obstacles during data collection, pose drifts are observed in the IMU data. Therefore, the temporal aggregation results of pointclouds are inferior with misaligned horizon and ego-motion blur as shown in Fig. A2. To relieve these issues, ICP (Interative-Closed-Point) [30] is conducted additionally for static scene points registeration before multi-frame aggregation. Finally, densified pointclouds for a single frame can be obtained by fusing the static scenes with the dynamic objects.
Given dense points of a specific scene, we label all voxels within a fixed range by a resolution of , based on the height of majority points inside each voxel. If the height is larger than a threshold , the voxel state is set to “occupied”, otherwise “free”. Moreover, considering the occlusion and truncation situations, some occupied voxels are not scanned by LiDAR beams and camera views actually. Hence we set part of voxels to “unknown” state which are invisible from both the LiDAR and camera views through tracing the casting ray.
B.4 Metric Comparison
In addition to the evaluation of 3D detection results with the proposed matching criteria (Center-Point distance and Closest Collision-Point distance), we also provide the corresponding evaluation results using the traditional 3D IOU (Intersection-Over-Union) matching criteria for comparison, as shown in Tab. 9. It is obvious that without distance differentiation, the evaluation results of 3D AP for both LiDAR-based and Camera-based methods are all in a low level, which can not reflect the objective performance and fail to satisfy the practical application requirements of the detection model. However, the proposed matching criterion is designed to measure the locating capability of closest collision points of nearby obstacles, which is more challenging and essential for low-speed driving scenarios.
B.5 Scene Distribution

Our RoboSense dataset contains 7.6K sequences, covering 6 main categories (including 22 different locations) of outdoor or semi-closed scenarios (i.e., S1-parks, S2-scenic spots, S3-squares, S4-campuses and S5-sidewalks or S6-streets). Fig. A3 illustrates the scene distributions of our collected data constructed for RoboSense dataset, which are surrounding Dishui Lake in Shanghai, China, with several markers drew in Google Map indicating the main locations performed data collection. Besides, the illustrations for each representative scenario among the collected locations are shown in Fig. A4-A9 respectively.





