SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

Qiu Zhou Jinming Cao^1∗ Hanchao Leng² Yifang Yin³ Yu Kun²
Roger Zimmermann¹
¹National University of Singapore, Singapore ²Xiaomi Group, China ³A*STAR, Singapore Joint first authors with equal contributionsCorresponding Author

Abstract

In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. Bird’s Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input. However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. The codes are available at: https://github.com/zhouqiu/SOGDet.

1 Introduction

Autonomous driving has become a burgeoning field for both research and industry, with a notable focus on achieving accurate and comprehensive perception of the 3D environment. Recently, Bird’s Eye View (BEV) based methods [15; 14; 21; 19] have attracted extensive attention in 3D object detection due to their effectiveness in reducing computational costs and footprints. The common paradigm is to take the multi-view images as inputs to detect objects, wherein the noticeable work BEVDet [15] serves as a strong baseline. BEVDet first extracts image features from multi-view images using a typical backbone network such as ResNet [13]. The features are thereafter mapped to the BEV space with View Transformer [30], followed by a convolutional network and a target detection head. Inspired by BEVDet, following studies [19; 14; 12] have integrated additional features into this framework, such as depth supervision [19] and temporal modules [14].

Refer to caption — Figure 1: Illustration of 3D object detection and semantic occupancy prediction tasks. On the rightmost legend, the top 10 categories in the blue box are shared for both tasks, and the bottom 6 categories in the green box are exclusively used by semantic occupancy prediction. (a) 3D object detection usually focuses on objects on roads, such as bicycles and cars. In contrast, 3D semantic occupancy prediction (b) concerns more about physical contexts (e.g., sidewalk and vegetation) in the environment. By combining these two (c), we can obtain a more comprehensive perception of the traffic conditions, such as pedestrians and bicycles mainly on the sidewalk and cars and buses co-appearing on drive surface.

Despite the significant improvement in localizing and classifying specific objects such as cars, and pedestrians, most existing methods [15; 14; 21; 19] neglect the physical context in the environment. These contexts, such as roads, pavements, and vegetation, though out of interest for detection, still offer important cues for perceiving the 3D scenes. For example, as shown in Figure 1, cars mostly appear in the drive surface rather than the sidewalk and vegetation. To harness such important features for object detection, we notice a recent emerging task – 3D semantic-occupancy prediction [16; 20; 39; 37], that voxelizes the given image and then performs semantic segmentation of the resulting voxels. This task not only predicts the occupancy status but also identifies the objects within each occupied pixel, thereby enabling the comprehension of physical contexts. As shown in Figure 1, 3D object detection and semantic occupancy prediction focuses on objects on roads and environmental contexts, respectively. Combing these two leads to the hybrid features in Figure 1(c) that provide a more comprehensive description of the scene, such as the location and orientation of cars driving on the drive surface, and the presence of pedestrians on sidewalk or crossings.

Motivated by this important observation, we propose a novel approach called SOGDet, which stands for Semantic-Occupancy Guided Multi-view 3D Object Detection. To the best of our knowledge, our method is the first of its kind to employ a 3D semantic-occupancy branch (OC) to enhance 3D object detection (OD). Specifically, we leverage a BEV representation of the scene to predict not only the pose and type of 3D objects (OD branch) but also the semantic class of the physical context (OC branch). SOGDet is a plug-and-play approach that can be seamlessly integrated with existing BEV-based methods [15; 14; 19] for 3D object detection tasks in an end-to-end training manner. Moreover, to better facilitate the OD task, we extensively explore two labeling approaches for the OC branch, wherein the one predicts the binary occupancy label only and the other involves the semantics of each class. Based on these two approaches, we train two variants of SOGDet, namely SOGDet-BO and SOGDet-SE. Both variants significantly outperform the baseline method, demonstrating the effectiveness of our proposed method.

We conduct extensive experiments on the exclusive nuScenes [3] dataset to evaluate the effectiveness of our proposed method. In particular, we apply SOGDet to several state-of-the-art backbone networks [13; 26; 6] and compare it to various commonly used baseline methods [14; 19]. Our experimental results demonstrate that SOGDet consistently improves the performance of all tested backbone networks and baseline methods on the 3D OD task in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). On the flip side, our OC approach surprisingly achieves comparable performance to state-of-the-art methods [16]. This finding represents another promising side product and is beyond our expectation, as our intention is to design a simple network and sheds little light on it. The above results together highlight the effectiveness of the combination of 3D OD and OC in achieving comprehensive 3D environment understanding, and further enabling the development of robust autonomous driving systems.

2 Related Work

2.1 3D Object Detection (OD)

constituents an indispensable component in autonomous driving [1; 8]. Prior monocular methods [10; 4; 18; 31] predict 3D bounding boxes using single-view images. For example, D4LCN [10] uses an estimated depth map to enhance image representation. Cai et al. [4] used object height prior to invert the 2D structured polygon into a 3D cuboid. However, due to the limitation of scarce data and single-view input, the model demonstrates difficulties in tackling more complex tasks [15]. To overcome this problem, recent studies [15; 14; 19] have been devoted to the development of large-scale benchmarks [3; 33] with multiple camera views. For example, inspired by the success of FCOS [34] in 2D detection, FCOS3D [36] treats the 3D OD problem as 2D OD. Specifically, FCOS3D conducts perception solely in the image view and employs the strong spatial correlation of the targets’ attributes with image appearance. Based on FCOS3D, PGD [35] presents using geometric relation graph to facilitate the targets’ depth prediction. Benefited from the DETR [7] method, some approaches have also explored the validity of Transformer, such as DETR3D [38] and Graph-DETR3D [9].

Unlike the aforementioned methods, BEVDet [15] leverages the Lift-Splat-Shoot (LSS) based [30] detector to perform 3D OD in multiple views. The framework is explicitly designed to encode features in the BEV space, making it scalable for multi-task learning, multi-sensor fusion, and temporal fusion [14]. The framework is extensively studied by following work, such as BEVDepth [19], which enhances depth prediction by introducing a camera-aware depth network. Additionally, BEVDet4D [14] and BEVFormer [21] extend BEVDet from the temporal and spatiotemporal dimension, respectively. Our proposed method also builds upon the BEVDet framework. Specifically, we introduce the semantic occupancy branch to guide the prediction of object detectors, a paradigm that has not been studied by existing efforts.

2.2 3D Semantic Occupancy Prediction (OC)

has emerged as a popular task in the past two years [5; 16; 20; 28; 39; 37]. It involves assigning an occupancy probability to each voxel in the 3D space. The task offers useful 3D representations for multi-shot scene reconstruction, as it ensures the consistency of multi-shot geometry and helps obscured parts to be recovered [32].

The existing methods are relatively sparse in the literature. MonoScene [5] is the pioneering work that uses monocular images to infer dense 3D voxelized semantic scenes. However, simply fusing multi-camera results with cross-camera post-processing often leads to sub-optimal results. VoxFormer [20] devises a two-stage framework to output the full 3D volumetric semantics from 2D images. The first stage uses a sparse collection of depth-estimated visible and occupied voxels, followed by a densification stage that generates dense 3D voxels from the sparse ones. TPVFormer [16] performs end-to-end training by using sparse LiDAR points as supervision, resulting in more accurate occupancy predictions.

It is worth noting that the methods discussed above are specifically designed to tackle the 3D OC task. However, our focus in this paper is to improve 3D OD by incorporating 3D semantic-occupancy as a supportive branch. As a result, we empirically resort to a simple network structure of OC to validate the key idea and do not adopt the standard OC labeling protocols following [16].

3 Method

3.1 Overall Architecture and Notations

The overall architecture of our proposed method is illustrated in Figure 2. It is composed of three main components: an image backbone, a view transformer, and a task stage that predicts both OC and OD simultaneously. Specifically, the multi-view input images are first encoded by the image backbone, and then aggregated and transformed into the Bird-Eye-View (BEV) feature by the view transformer. With inherent camera parameters, the view transformer conducts depth-aware multi-view fusion and 4D temporal fusion simultaneously. Thereafter, the task stage generates both OC and OD features, which are interacted using a modality fusion module. We finally predict the OD and OC outputs using their respective features.

To ensure the clearance and consistency throughout our presentation, we first define the following notations following the order of data flow within our pipeline.

${I}$ represents an image group with same height and width from $N$ cameras with the same timestamp. $\boldsymbol{F_{img}}\in\mathbb{R}^{N\times C\times H\times W}$ represents feature map produced by the image backbone. ${H}$ , ${W}$ and ${C}$ means the height, width and channels of the feature map, respectively. $\boldsymbol{F_{d}}\in\mathbb{R}^{N\times D\times H\times W}$ represents depth estimation of the multi-view image group ${I}$ . $\boldsymbol{F_{bev}}\in\mathbb{R}^{C_{bev}\times X\times Y}$ represents BEV features extracted by the view transformer. The dimensions of the BEV plane are represented by $X\times Y$ , and the number of channels in the BEV feature is $C_{bev}$ , which is set to 128 following [14]. $\boldsymbol{F_{od}}$ and $\boldsymbol{F_{oc}}$ represent task-specific intermediate features of OD and OC branches in task stage.

For the camera parameters, we combine the offset vector and rotation matrix to represent the translation $\boldsymbol{TR}\in\mathbb{R}^{4\times 4}$ from source coordinate system to target coordinate system. For example, $\boldsymbol{{TR}_{cam}^{lid}}$ means a translation from camera coordinate system to lidar coordinate system. And $\boldsymbol{TR_{in}}$ represents the intrinsic parameters of all cameras.

For the output, the OD branch has two outputs: Bounding Box $\boldsymbol{B}\in\mathbb{R}^{M\times(3+3+2+2+1)}$ and Heatmap $\boldsymbol{H}$ , where $M$ is the total number of bounding boxes and the second dimension of $\boldsymbol{B}$ represents location, scale, orientation, velocity and attribute. $\boldsymbol{Occ}\in\mathbb{R}^{O\times X\times Y\times Z}$ represents the OC branch output, which means that for the different grids from voxel grid $\boldsymbol{V}\in\mathbb{R}^{X\times Y\times Z}$ , there are $O$ semantic labels in total. And we generate the occupancy voxel grid from point cloud $\boldsymbol{P}\in\mathbb{R}^{K\times 3}$ of $K$ points.

3.2 Image Backbone

The image backbone encodes the multi-view input images ${I}$ into the feature map $\boldsymbol{F_{img}}$ . Following previous work [15; 14], we sequentially concatenate ResNet [13] and FPN [22] as our image backbone to extract the image feature. Moreover, we empirically found that using ShapeConv [6] instead of traditional convolutional layers in the image backbone leads to improved accuracy in the OD task without increasing model complexity during inference. In view of this, all ResNet-50 and -100 models in our method and baseine are replaced with ShapeConv for a fair comparison. Detailed ablation studies on the performance obtained by ShapeConv can be found in the Supp.

3.3 View Transformer

The view transformer converts the image feature $\boldsymbol{F_{img}}$ to the BEV feature $\boldsymbol{F_{bev}}$ . We implement this module with the combination of BEVDepth [19] and BEVDet4D [14] for better performance, namely BEVDet4D-depth, which jointly conducts depth-aware multi-view fusion and 4D temporal fusion based on BEVDepth and BEVDet4D, respectively.

3.3.1 Depth-aware Multi-view Fusion.

Following BEVDepth [19], the $\boldsymbol{F_{d}}$ feature is estimated by a Depth Network based on image feature $\boldsymbol{F_{img}}$ and camera parameter $\boldsymbol{{TR}_{in}}$ by,

\boldsymbol{F_{d}}=DepthNet(\boldsymbol{F_{img}},\boldsymbol{{TR}_{in}}).

(1)

Here, we use the notation $DepthNet(*,*)$ to refer to the sub-network introduced in [19], which is composed of a series of convolutional layers and MLPs.

Then the Lift-Solat-Shoot(LSS) [30] is applied to calculate BEV feature $\boldsymbol{F_{bev}}$ as follows,

\boldsymbol{F_{bev}}=LSS(\boldsymbol{F_{img}},\boldsymbol{F_{d}},\boldsymbol{{TR}_{cam}^{lid}}),

(2)

where $LSS(*,*,*)$ is a depth-aware transformation based on [30] following [19] which first lift the image feature $\boldsymbol{F_{img}}$ and its depth feature $\boldsymbol{F_{d}}$ into 3D lidar coordinate system by $\boldsymbol{{TR}_{cam}^{lid}}$ , then splat 3D feature into 2D BEV plane to obtain $\boldsymbol{F_{bev}}$ .

3.3.2 4D Temporal Fusion.

Let $\boldsymbol{F_{bev}^{curr}}$ and $\boldsymbol{F_{bev}^{adj}}$ represent the BEV feature in the current timestamp and an adjacent timestamp, respectively. We then apply a temporal fusion step following BEVDet4D [14] to aggregate $\boldsymbol{F_{bev}^{curr}}$ and $\boldsymbol{F_{bev}^{adj}}$ using Equation 3,

\boldsymbol{F_{bev}}=Concat[\boldsymbol{F_{bev}^{curr}},\boldsymbol{F_{bev}^{adj}}]

(3)

where $Concat[*,*]$ represents the concatenation of two matrices along the channel dimension.

3.4 Task Stage

The task stage consists of two branches that take the BEV feature $\boldsymbol{F_{bev}}$ as the input to obtain the Bounding Box $\boldsymbol{B}$ and Heatmap $\boldsymbol{H}$ outputs for OD branch and the Occpancy output $\boldsymbol{Occ}$ for OC branch, respectively.

On the one hand, the OD branch is our primary task branch, which performs a 10-class object detection on car, truck, etc. On the other hand, the OC branch is to facilitate object detection by generating a 3D geometrical voxel around the ego vehicle.

To refine the BEV feature $\boldsymbol{F_{bev}}$ in both branches, we first apply a 3-layers ResNet [13] to extract intermediate features $\boldsymbol{F_{od}}$ and $\boldsymbol{F_{oc}}$ in three different resolution, which are 1/2, 1/4, 1/8 of the height, width respectively. A pyramid network [22] is then employed to upsample the features to the same size as the original one. For the OD branch, we use CenterPoint [41] to produce the final OD bounding box prediction heatmap $\boldsymbol{H}$ and bounding box $\boldsymbol{B}$ from $\boldsymbol{F_{od}}$ . For the OC branch, a simple 3D-Conv Head [11] is used to generate occupancy voxel grid $\boldsymbol{Occ}$ from $\boldsymbol{F_{oc}}$ .

3.4.1 Modality-fusion Module.

The modality-fusion module is essential in our method to perform interactions between the above two branches. We define $\mathbb{G}_{C\rightarrow D}$ to adapt the features from OC to OD, and vice versa with $\mathbb{G}_{D\rightarrow C}$ . We employ a weighted average operation parameterized by $\lambda$ to fuse features from different modalities and empirically set $\lambda=0.9$ ,

\left\{\begin{aligned} \boldsymbol{F_{od}}=(1-\lambda)\cdot\mathbb{G}_{C\rightarrow D}(\boldsymbol{F_{oc}})+\lambda\cdot\boldsymbol{F_{od}},\\ \boldsymbol{F_{oc}}=(1-\lambda)\cdot\mathbb{G}_{D\rightarrow C}(\boldsymbol{F_{od}})+\lambda\cdot\boldsymbol{F_{oc}}.\end{aligned}\right.

(4)

Taking OC to OD as example, the Equation 4 above shows that feature $\boldsymbol{F_{od}}$ in branch OD are $1-\lambda$ replaced by feature $\mathbb{G}_{C\rightarrow D}(\boldsymbol{F_{oc}})$ from branch OC. $\mathbb{G}_{C\rightarrow D}$ serves as a filter to reduce the modality gap between OD and OC. The operation takes effect when the BEV feature is upsampled in their own branches each time in the pyramid network [22] mentioned above. We will demonstrate that this strategy can effectively enhance the information that is ignored by their original branch and thus fill the modality gap.

3.5 Occupancy Label Generation

We leverage two types of supervision signals from the occupancy prediction for the OC branch. One is binary occupancy label BO, whose supervision is binary with 0 and 1 representing empty and occupied voxels, respectively. The other is semantic label SE, containing 16 semantic labels such as barrier, bicycle, etc. Figure 3 illustrates the two types of label.

To generate the binary occupancy labels, we consider only the geometry features of each voxel and illustrate this procedure in Algorithm 1. This approach is cost-friendly and require no extra manual annotations.

For semantic label, we observe that directly using the sparse semantic occupancy points as ground-truth labels leads to unstable training. Therefore, we follow TPVFormer [16] to optimize the supervision voxel generation, where the voxels without semantic labels are masked and ignored. We detail this labeling process in the Supp.

Data: Point Cloud

\mathbf{P}

, Dimension Bound

X_{min}

X_{max}

Y_{min}

Y_{max}

Z_{min}

Z_{max}

, Resolution

R_{X}

R_{Y}

R_{Z}

Result: Voxel Grid

\mathbf{V}

/* Transform position of points into grid index */

for $p\in\mathbf{P}$ do

p_{X},p_{Y},p_{Z}\leftarrow p

for $axis\in\{X,Y,Z\}$ do

if ${axis}_{min}\leq p_{axis}\leq{axis}_{max}$ then

p_{axis}\leftarrow\frac{(p_{axis}-{axis}_{min})}{R_{axis}}

else

\mathbf{P}\leftarrow\mathbf{P}-\{p\}

/* Delete out of bound */

break

/* Calculate the scale of output voxel grid */

X\leftarrow\frac{X_{max}-X_{min}}{R_{X}}

Y\leftarrow\frac{Y_{max}-Y_{min}}{R_{Y}}

Z\leftarrow\frac{Z_{max}-Z_{min}}{R_{Z}}

build

\mathbf{V}\in\mathbb{R}^{X\times Y\times Z}

/* Fill voxels */

for $v\in\mathbf{V}$ do

index(v)\in\mathbf{P}

then

v\leftarrow 1

else

v\leftarrow 0

Algorithm 1 Binary occupancy label generation

3.6 Training Objectives

3.6.1 Losses of OD Branch.

We adopt the CenterPoint Head [41] to produce the final OD bounding box prediction, based on which a Gaussian focal loss [22] and an L1 loss are jointly computed. In the following, we will sequentially elaborate these two loss functions.

Gaussian focal loss emphasizes more on the overall difference between predicted values and actual values across the entire plane. $\boldsymbol{H}$ denotes the heatmap output by the OD branch, which is a probability matrix recording the likelihood of each pixel belonging to any of the 10 classes. We then embed the real annotations into a 2D image with the same size as $\boldsymbol{H}$ , forming the ground-truth heatmap $\boldsymbol{\widehat{H}}$ , namely, a one-hot matrix. The Gaussian focal loss is then computed as,

{L}_{G}=-\lfloor\boldsymbol{\widehat{H}}\rfloor log(\boldsymbol{H}){(1-\boldsymbol{H})^{\alpha}}-{(1-\boldsymbol{\widehat{H}})}^{\gamma}log(1-\boldsymbol{H}){\boldsymbol{H}^{\alpha}},

(5)

where $\lfloor*\rfloor$ denotes the floor operation, $\alpha=2.0$ and $\gamma=4.0$ are parameters of intensity following [23].

L1 loss is employed to optimize bounding box statistics, i.e., absolute distance location, scale, orientation, velocity and attribute, from a micro perspective. To this end, we estimate the L1 distance between predicted bounding box $\boldsymbol{B}$ and its ground-truth $\boldsymbol{\widehat{B}}$ as,

{L}_{1}=\frac{1}{M}\cdot\sum_{m}^{M}|\boldsymbol{B_{m}}-\boldsymbol{\widehat{B_{m}}}|.

(6)

In this way, the total loss of OD branch is shown as,

{L}_{OD}=L_{G}+\mu_{od}L_{1},

(7)

where $\mu_{od}$ =0.25 is the weight coefficient of OD branch.

3.6.2 Losses of OC Branch.

We combine the cross entropy loss $L_{ce}$ with class weight and lovász-softmax loss [2] $L_{lova}$ following [16] in OC branch as follows,

{L}_{OC}=L_{lova}+\mu_{oc}L_{ce}

(8)

where $\mu_{oc}$ =1 for SOGDet-SE and 6 for SOGDet-BO is the weight coefficient of OC branch. We set the same loss weight for all classes in SOGDet-SE and 1:2 for empty and occupied voxels in SOGDet-BO for loss weight in $L_{ce}$ , respectively.

3.6.3 Overall Objective.

Combined the above loss functions together, we can define our final objective as below,

L={L}_{OD}+\omega{L}_{OC},

(9)

where $\omega$ is the balancing factor between the OC and OD branches. We empirically set $\omega=10$ to maximize the effectiveness of our multi-task learning framework.

Table 1: Performance comparison on the nuScenes validation set. As indicated in [26], the complexity of Swin-Tiny and -Small are similar to those of ResNet-50 and -101, respectively.

Method	Venue	NDS(%)↑	mAP(%)↑
PETR-Tiny	ECCV22	43.1	36.1
BEVDet-Tiny	arXiv22	39.2	31.2
DETR3D-R50	CoRL22	37.4	30.3
Ego3RT-R50	ECCV22	40.9	35.5
BEVDet-R50	arXiv22	37.9	29.8
BEVDet4D-R50	arXiv22	45.7	32.2
BEVDepth-R50	AAAI23	47.5	35.1
AeDet-R50	CVPR23	50.1	38.7
SOGDet-BO-R50	-	50.2	38.2
SOGDet-SE-R50	-	50.6	38.8
BEVerse-Small	arXiv22	49.5	35.2
PETR-R101	ECCV22	42.1	35.7
UVTR-R101	NeurIPS2022	48.3	37.9
PolarDETR-T-R101	arXiv22	48.8	38.3
BEVFormer-R101	ECCV22	51.7	41.6
BEVDepth-R101	AAAI23	53.5	41.2
PolarFormer-R101	AAAI23	52.8	43.2
AeDet-R101	CVPR23	56.1	44.9
SOGDet-BO-R101	-	55.4	43.9
SOGDet-SE-R101	-	56.6	45.8

4 Experiments

Table 2: Performance comparison on the nuScenes test set.

Method	Venue	NDS(%)↑	mAP(%)↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓
FCOS3D [36]	ICCV21	42.8	35.8	0.690	0.249	0.452	1.434	0.124
DD3D [29]	ICCV21	47.7	41.8	0.572	0.249	0.368	1.014	0.124
PGD [35]	CoRL22	44.8	38.6	0.626	0.245	0.451	1.509	0.127
BEVDet [15]	arXiv22	48.2	42.2	0.529	0.236	0.395	0.979	0.152
BEVFormer [21]	ECCV22	53.5	44.5	0.631	0.257	0.405	0.435	0.143
DETR3D [38]	CoRL22	47.9	41.2	0.641	0.255	0.394	0.845	0.133
Ego3RT [27]	ECCV22	47.3	42.5	0.549	0.264	0.433	1.014	0.145
PETR [24]	ECCV22	50.4	44.1	0.593	0.249	0.383	0.808	0.132
CMT-C [40]	ICCV23	48.1	42.9	0.616	0.248	0.415	0.904	0.147
PETRv2 [25]	ICCV23	55.3	45.6	0.601	0.249	0.391	0.382	0.123
X3KD [17]	CVPR23	56.1	45.6	0.506	0.253	0.414	0.366	0.131
SOGDet-BO	-	57.8	47.1	0.482	0.248	0.390	0.329	0.125
SOGDet-SE	-	58.1	47.4	0.471	0.246	0.389	0.330	0.128

Table 3: Comparison with the State-of-the-Art OC method on the nuScenes val set.

			category-wise IoU (%)↑
Method	Venue	mIoU(%)↑
TPVFormer	CVPR23	59.3	64.9	27.0	83.0	82.8	38.3	27.4	44.9	24.0	55.4	73.6	91.7	60.7	59.8	61.1	78.2	76.5
SOGDet-SE	-	58.6	57.8	30.7	74.9	74.7	43.7	42.0	44.5	32.7	62.6	63.9	85.9	54.3	54.6	58.9	76.9	80.2

4.1 Experimental Setup

4.1.1 Dataset and metrics.

We conducted extensive experiments on the nuScenes [3] and Panoptic nuScenes [43] dataset, which is currently the exclusive benchmark for both 3D object detection and occupancy prediction. Following the standard practice [15; 12], we used the official splits of this dataset: 700 and 150 scenes respectively for training and validation, and the remaining 150 for testing.

For OD task, we reported nuScenes Detection Score (NDS), mean Average Precision (mAP), mean Average Translation Error (mATE), mean Average Scale Error (mASE), mean Average Orientation Error (mAOE), mean Average Velocity Error (mAVE), and mean Average Attribute Error (mAAE). Among them, NDS and mAP are the more representative ones.

For OC task, we designed two types of occupancy labeling approaches. For the binary occupancy labeling approach, to the best of our knowledge, we are the first to employ such labeling approach in the literature, we performed the qualitative experiments. For the semantic labeling one, we maintained a consistent experimental protocol with the state-of-the-art method TPVFormer[16]. Accordingly, we report the mean Intersection over Union (mIoU) of all semantic categories.

4.1.2 Implementation details.

To demonstrate the effectiveness and generalization capabilities of SOGDet, we used several popular architectures [19; 14]. To ensure that any improvements were solely due to our SOGDet, we kept most experimental settings, such as backbone and batch size untouched, and added only the OC branch. Unless otherwise noted, our baseline model is BEVDet4D-depth, which is a fusion of two recent multi-view 3D object detectors, BEVDepth [19] and BEVDet4D [14] as described in Section 3. We followed the experimental protocol of AEDet [12] and training on eight 80G A100 GPUs with a mini-batch size of 8, for a total batch size of 64, and trained the model for 24 epochs with CBGS [42] using AdamW as the optimizer with a learning rate of 2e-4.

4.2 Comparison with State-of-the-Art

We evaluated the performance of our SOGDet model against other state-of-the-art multi-view 3D object detectors on the nuScenes validation and test sets.

Table 1 reports the results for the validation set using Swin-Tiny, -Small, ResNet-50 and -101 backbones (detailed results can be found at the Supp.). As shown in the table, our method achieves highly favorable model performance, with NDS scores of 50.2% and 55.4% for SOGDet-BO and 50.6% and 56.6% for SOGDet-SE on ResNet-50 and -101, respectively. These results surpass current state-of-the-art multi-view 3D object detectors with a large margin, including BEVDepth [19] (3.1% improvement in NDS at both ResNet-50 and -100) and AEDet [12] (0.5% improvement in NDS at both ResNet-50 and -100).

In Table 2, we present the results obtained by SOGDet with the ResNet-101 backbone on the nuScenes test set, where we report the performance of state-of-the-art methods that use the same backbone network for a fair comparison. We follow the same training strategy of existing approaches [19; 12] that utilize both the training and validation sets to retrain the networks and without any test-time augmentation. SOGDet shows improved performance in multi-view 3D OD task with 58.1% NDS and 47.4% mAP, further verifying the effectiveness of our proposed approach.

4.3 Visualization

Figure 4 illustrates qualitative results of our approach on the nuScenes [3] dataset using ResNet-50 as the backbone for both OD and the OC branch, more results can be found in the Supp. Pertaining to the object detection task, we focus only on occupied voxels, and therefore, locations marked as “empty” are not shown. The hybrid features reveal strong correlations between the physical structures and the location of the detected objects, such as vehicles, bicycles, and pedestrians. For example, vehicles are typically detected in drive surface, while bicycles and pedestrians are often detected on sidewalk. These findings are consistent with the observations and motivations of our paper and demonstrate that the integration of the two branches can lead to a better perception and understanding of the real world.

4.4 Ablation Study

4.4.1 Comparison with the State-of-the-Art OC method

To further evaluate the effectiveness of our approach, we compared our method with respect to semantic categories of multi-view image input with TPVFormer [16] and presented the results in Table 3. Backbones from both methods take equivalent complexities.

The primary goal of our work is to enhance the 3D OD by integrating 3D OC. Despite its simpleness, our results, as shown in Table 3, demonstrate that our SOGDet are comparable to TPVFormer, a state-of-the-art method specifically designed for the OC task. Moreover, our method even outperforms this baseline in certain categories such as bicycles, vegetation, and others. This result indicates that the combination of the two branches can bring benefits for the OC branch as well, serving as another byproduct.

Table 4: Performance comparison with different baselines.

Backbone	Architecture	Method	mAP(%)↑	NDS(%)↑
		Baseline	31.2	39.2
	BEVDet	SOGDet-SE	32.9	41.5
		Baseline	33.8	47.6
Tiny	BEVDet4D	SOGDet-SE	34.6	48.7
		Baseline	35.1	47.5
	BEVDepth	SOGDet-SE	37.2	48.3
		Baseline	37.0	49.0
R50	BEVDet4D-depth	SOGDet-SE	38.8	50.6

4.4.2 Different baseline architecture

Our proposed SOGDet is a flexible method that can be seamlessly integrated into most BEV-based multi-view object detection architectures. In order to evaluate the generalization capabilities of our method, we tested its effectiveness on several representative baseline architectures, namely BEVDet [15], BEVDet4D [14], BEVDepth [19], and BEVDet4D-depth, using the nuScenes validation set. The results, in Table 4 show that SOGDet consistently surpasses these baselines under various settings. This result demonstrates the validity of our method to generalize to different model architectures.

4.4.3 Complexity Analysis

The efficiency concern is highly significant under resource-constrained environments. To study the effectiveness of our method pertaining to this aspect, we estimate metrics including floating point operations (FLOPs.) and parameter count (Param.), and show the results in Figure 5. It can be observed that compared with the SoTA method AeDet [12], our SOGDet is more efficient especially on the more important metric FLOPs, i.e., 252G v.s. 473G. On the other hand, SOGDet outperforms AeDet by 0.5% in terms of NDS. This indicates that our method achieves a better trade-off between efficiency and model performance.

Further ablation experiments such as performance of the two OC labeling approaches, different backbones, and hyperparameters, etc., can be found at the Supp.

5 Conclusion and future work

The Bird’s Eye View (BEV) based method has shown great promise in achieving accurate 3D object detection using multi-view images. However, most existing BEV-based methods unexpectedly ignore the physical contexts in the environment, which is critical to the perception of 3D scenes. In this paper, we propose the SOGDet approach to incorporate such context using a 3D semantic occupancy approach. In particular, our SOGDet predicts not only the pose and type of each 3D object, but also the semantic classes of the physical contexts for finer-grained detection. Extensive experimental results on the nuScenes dataset demonstrate that our SOGDet consistently improves the model performance of several popular backbone networks and baseline methods.

In future work, we plan to explore the application of SOGDet with more auxiliary data inputs, such as lidar and radar, to further help the 3D object detection. Additionally, we believe that integrating 3D semantic-occupancy prediction into other autonomous driving tasks beyond 3D object detection, such as path planning and decision-making, may contribute a promising avenue for future research.

References

[1] Eduardo Arnold, Omar Y Al-Jarrah, Mehrdad Dianati, Saber Fallah, David Oxtoby, and Alex Mouzakitis. A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems, 20(10):3782–3795, 2019.
[2] Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
[3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
[4] Yingjie Cai, Buyu Li, Zeyu Jiao, Hongsheng Li, Xingyu Zeng, and Xiaogang Wang. Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation. In AAAI, pages 10478–10485, 2020.
[5] Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
[6] Jinming Cao, Hanchao Leng, Dani Lischinski, Daniel Cohen-Or, Changhe Tu, and Yangyan Li. Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In ICCV, pages 7088–7097, 2021.
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
[8] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, pages 1907–1915, 2017.
[9] Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinhong Jiang, and Feng Zhao. Graph-detr3d: rethinking overlapping regions for multi-view 3d object detection. In ACM MM, pages 5999–6008, 2022.
[10] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping Luo. Learning depth-guided convolutions for monocular 3d object detection. In CVPR, pages 1000–1001, 2020.
[11] Zhiqi Li Fang Ming. Occupancy dataset for nuscenes, 2023. https://github.com/FANG-MING/occupancy-for-nuscenes.
[12] Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, and Lin Ma. Aedet: Azimuth-invariant multi-view 3d object detection. arXiv preprint arXiv:2211.12501, 2022.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[14] Junjie Huang and Guan Huang. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
[15] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
[16] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2302.07817, 2023.
[17] Marvin Klingner, Shubhankar Borse, Varun Ravi Kumar, Behnaz Rezaei, Venkatraman Narayanan, Senthil Yogamani, and Fatih Porikli. X3kd: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection. In CVPR, pages 13343–13353, 2023.
[18] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu. Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection. In CVPR, pages 8973–8983, 2021.
[19] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022.
[20] Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. arXiv preprint arXiv:2302.12251, 2023.
[21] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
[22] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
[23] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
[24] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548. Springer, 2022.
[25] Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Aqi Gao, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petrv2: A unified framework for 3d perception from multi-camera images. In ICCV, 2023.
[26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
[27] Jiachen Lu, Zheyuan Zhou, Xiatian Zhu, Hang Xu, and Li Zhang. Learning ego 3d representation as ray tracing. In ECCV, pages 129–144. Springer, 2022.
[28] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
[29] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In ICCV, pages 3142–3152, 2021.
[30] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210. Springer, 2020.
[31] Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical depth distribution network for monocular 3d object detection. In CVPR, pages 8555–8564, 2021.
[32] Yining Shi, Kun Jiang, Jiusi Li, Junze Wen, Zelin Qian, Mengmeng Yang, Ke Wang, and Diange Yang. Grid-centric traffic scenario perception for autonomous driving: A comprehensive review. arXiv preprint arXiv:2303.01212, 2023.
[33] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020.
[34] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV, pages 9627–9636, 2019.
[35] Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022.
[36] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
[37] Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
[38] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
[39] Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. arXiv preprint arXiv:2303.09551, 2023.
[40] Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai Wang, and Xiangyu Zhang. Cross modal transformer via coordinates encoding for 3d object dectection. In ICCV, 2023.
[41] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In CVPR, pages 11784–11793, 2021.
[42] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
[43] Fong, Whye Kit and Mohan, Rohit and Hurtado, Juana Valeria and Zhou, Lubing and Caesar, Holger and Beijbom, Oscar and Valada, Abhinav. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. In IEEE Robotics and Automation Letters, 7(2):3795–3802, 2022.