You Only Look Bottom-Up for Monocular 3D Object Detection

Kaixin Xiong^1∗, Dingyuan Zhang^1∗, Dingkang Liang¹, Zhe Liu¹,
Hongcheng Yang¹, Wondimu Dikubab^1,2, Jianwei Cheng², Xiang Bai^1† Manuscript received: April, 5, 2023; Revised July, 18, 2023; Accepted August, 25, 2023.This paper was recommended for publication by Editor Cesar Cadena Lerma upon evaluation of the Associate Editor and Reviewers’ comments. This work was supported by the National Science Fund for Distinguished Young Scholars of China (Grant No.62225603). * Equal contribution†Corresponding author, xbai@hust.edu.cn¹Kaixin Xiong, Dingyuan Zhang, Dingkang Liang, Zhe Liu, Hongcheng Yang and Xiang Bai are with Huazhong University of Science and Technology, China. {kaixinxiong, dyzhang233, dkliang, zheliu1994, hcyang, xbai}@hust.edu.cn²Wondimu Dikubab and Jianwei Cheng are with JIMU Intelligent Technology Co., Ltd, China. wondiyeaby@gmail.com, cjw@jmadas.comDigital Object Identifier (DOI): see top of this page.

Abstract

Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects’ location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory performances. Motivated by the fact that humans could leverage the bottom-up positional clues to locate objects in 3D space from a single image, in this paper, we explore the position modeling from the image feature column and propose a new method named You Only Look Bottum-Up (YOLOBU). Specifically, our YOLOBU leverages Column-based Cross Attention to determine how much a pixel contributes to pixels above it. Next, the Row-based Reverse Cumulative Sum (RRCS) is introduced to build the connections of pixels in the bottom-up direction. Our YOLOBU fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way. Extensive experiments on the KITTI dataset demonstrate the effectiveness and superiority of our method.

Index Terms:

Deep Learning for Visual Perception; Computer Vision for Automation; Computer Vision for Transportation

I Introduction

Monocular 3D Object Detection has received increasing attention since it provides a low-cost solution for autonomous driving [1, 2, 3, 4]. This task aims to estimate objects’ localization, orientation, and dimensions from a single RGB image. The main challenge of this task comes from irreversible depth information lost in the projection process, and recovering the depth information from 2D to 3D is an ill-posed problem.

Refer to caption — Figure 1: Two cars with different dimensions and the same depth and appearance, the corresponding 2D bounding boxes on the image have different sizes. (a) For monocular 3D detectors, ambiguity would occur with only 2D size information; (b) For human perception from monocular images, there is no ambiguity guided by position information.

Most image-based monocular 3D object detection methods [5, 6, 7] infer objects’ 3D location mainly via objects’ 2D sizes on the image plane. However, these methods usually fail when facing scale ambiguity problems since only the size clues are leveraged. We present a case on the right of Fig. 1(a), where two cars are different in 3D dimensions but share the same depth and appearance information. The contradiction comes from detectors should predict the same depth for these two cars while they have different 2D sizes on the image plane. In other words, such ambiguity could confuse the detector and lead to sub-optimal detection results. Several methods [8, 9] have attempted to leverage the ground plane as prior information to solve this problem. However, these methods heavily rely on the strong assumption that the ground plane is flat (i.e., the ground plane is smooth and parallel to the bottom of the ego car), which inevitably induces noises when going from horizontal to uphill or downhill. Naturally, digging out the potential 3D prior clues from monocular images is an intuitive and effective solution for avoiding such problems. Indeed, this is consistent with rational behavior when humans locate objects from a single image.

We observe that humans can easily identify such cases and locate objects accurately with the help of position clues. As shown in Fig. 1(b), position clues from the earthing contact points of instances to the bottom of the image plane could be treated as explicit references for locating objects, which eliminates the size ambiguity. Inspired by such rational human behaviors, we argue that not only is size information essential for image-based 3D detection, but also the bottom-up positional information on the image plane is indispensable as extra information compensation.

Note that two prerequisites should be met when utilizing above mentioned position information: 1) the pixels closer to the bottom of the image represent smaller depth; 2) the depth from the bottom of the image to target instances is monotonically increasing. Fortunately, these two assumptions could be easily satisfied in most autonomous driving scenes since the camera is mounted on the top of the ego car and the ego car is driven on the drivable areas [10].

In this paper, we propose a novel method named You Only Look Bottom-Up (YOLOBU), which naturally and succinctly establishes the association of pixels on the image plane in the bottom-up direction. Since not all pixels within the same column and among different columns contribute to the localization of objects above them equally, we introduce the Column-based Cross Attention (CCA) to model such a relationship, which can also help suppress the noise by assigning different attention weights to each pixel when the assumption fails in some conditions (e.g., a large vehicle right in front of your car. The top back edge of that vehicle would have a lower depth than the road straight underneath). Different from the widely used cross-attention in Object Detection [11, 12], our CCA calculates the attention weight based on each image feature column. Following CCA, our proposed Row-based Reverse Cumulative Sum (RRCS) is conducted to build the connections of pixels within each column in the bottom-up direction. Benefiting from CCA and RRCS, our YOLOBU can be better aware of the relationship of pixels with different positions on the image plane.

Extensive experiments are carried out on the KITTI dataset to demonstrate the effectiveness of the proposed YOLOBU. In particular, our YOLOBU achieves state-of-the-art for the car category and highly competitive performance for the cyclist/pedestrian category.

In summary, the main contributions of this paper can be summarized in two folds: 1) We provide a rethinking of the position modeling for solving the size ambiguity problem in monocular 3D detection. We point out that the bottom-up position clue is an overlooked design factor for monocular 3D detectors. 2) We propose a new method YOLOBU, a plug-and-play approach that effectively leverages the position clues from the image and can be generalized for other applications.

II Related Work

II-A 3D Detection with LiDAR

There are plenty of works that conduct 3D object detection with LiDAR inputs. Many works [13, 14, 15, 16, 17] focus on the pure LiDAR-based 3D object detection with fully supervision. Some other methods [18, 19, 20] explore the semi and weakly supervised learning or even zero-shot learning to reduce the costs of labeling. To take advantage of the complementary between images and point clouds, some methods [21, 22, 23] propose different strategies for better fusion. Despite their remarkable performance, the dependence on expensive LiDAR sensors hinders the wide use of these methods. Our method only needs a single cheap camera, making it easier for practical usage.

II-B Monocular 3D Detection

Monocular 3D Object Detection methods can be roughly grouped into two categories: depth-guided methods and pure image-based methods.

Depth-guided methods need extra data sources such as point clouds and depth images in the training phase. These methods mainly use the extra sub-network to estimate pixel-level depth under the supervision of the sparse projected point cloud or even dense depth images, which introduces extra labeling efforts. In particular, Pseudo-LiDAR [24] converts pixels to pseudo point clouds and then feeds them into a LiDAR-based detector [16, 17, 25]. Besides, there are also some methods [26, 27, 28, 29] that fuse depth feature and image feature together to improve the performance while [30, 31, 32] learn 3D prior information via knowledge distillation. Some methods [33, 34] conduct view transform with estimated depth distribution. DD3D [35] claims that pre-training paradigms could replace the pseudo-lidar paradigm. These methods benefit from precise depth information provided by point clouds, which could be considered as using another type of ground truth as extra supervision.

Pure image-based methods only need RGB images and calibrated information in the whole inference pipeline. These methods mainly learn depth information from objects’ apparent size and geometry constraints provided by eight keypoints or pin-hole model [5, 36, 37, 38, 7, 39, 40]. For example, M3D-RPN [5] proposes 2D-3D anchor and depth-wise convolution to detect 3D objects. RTM3D [37] predicts nine keypoints of 3D proposals and optimizes the proposals via a re-projection cost function. MonoDLE [41] proposes three strategies to alleviate problems caused by localization errors. GUPNet [42] proposes a geometry uncertainty projection method for better localization. MVM3Det [43] proposes feature orthogonal transformation to estimate the position. TIM [44] conducts monotonic attention to translate images into bird’s-eye-view maps. DID-M3D [45] decouples the object’s depth into visual depth and attribute depth. These methods show great promise with their simple structures and competitive performances.

II-C Position Clues for Monocular 3D Detection

Several methods have exploited the position clues provided by the ground plane. Mono3D [1] leverages the ground plane to filter redundant proposals. GAC [8] proposes ground-aware convolution that extracts depth priors from the ground plane hypothesis. DeepLine [46] extracts and encodes line segmentation features into image features via Hough transform. MonoGround [9] generates dense depth supervision from the ground plane of ground truth 3D bounding boxes. These methods could alleviate the size ambiguity problem to some extent. However, they do not fully leverage the positional and contextual information of pixels, which plays an important role in monocular 3D detection. Compared with the above methods, our YOLOBU only requires a weaker assumption that the depth of the ground plane is monotonically increasing from the bottom of the image to target instances, which could be satisfied more easily. Meanwhile, our method doesn’t rely on explicit geometry structures in the scene, such as lines or planes. We explicitly build the relationship of pixels in the bottom-up direction, which could serve as a position clue for reasoning objects’ location in 3D space.

II-D 2D-3D Transform for Image-based Perception

The way to transform 2D image features into features representing 3D space is crucial for image-based perception. It can be roughly categorized into inverse perspective mapping (IPM)-based and depth estimation-based. IPM first assumes a flat plane, and then it maps all pixels from a given viewpoint onto this flat plane through homography projection, and many methods [47], [48], [49], [50] use IPM for obstacles detection or lane detection task. Depth estimation-based methods [34], [33], [24], [51], [45] mainly estimate pixel-level depth or depth distributions to transform the image features into 3D features. Since our method acts as a plug-and-play module and operates on the image feature level, it can be adopted for both types of approaches.

III METHOD

Fig. 2(a) shows the pipeline of our YOLOBU, which contains three parts: 1) A CNN-based backbone used to extract the feature map $F_{b}$ from images; 2) The proposed Colum-based Cross Attention (CCA), which conduct cross attentions between image features and the column-wise queries $Q$ to produce attention weights for each pixel in each column and then get the re-weighted feature $F_{c}$ . 3) The proposed Row-based Reverse Cumulative Sum(RRCS), which effectively makes pixels perceive the scenes below them and outputs the position aware feature $F_{p}$ . In the next sections, we delve deeper into the key elements of YOLOBU, i.e., CCA and RRCS. For a better explanation, we set the origin of coordinate system in the bottom left of the image.

III-A Column-based Cross Attention (CCA)

Structured layout (i.e., road) pixels with different semantics and locations devote differently to object localization. For example, in each column, the road pixels closer to the top of the image plane represent a larger depth interval, which may contribute more to the localization of far objects and less useful for near objects. This phenomenon indicates that it is infeasible to treat pixels in the same column equally. Furthermore, the depth interval of pixels in different columns varies, inspiring us to pay attention to the differences among different columns. To model such a relationship, we propose the coloumn-based cross attention mechanism termed as CCA, which leverages cross attention to assign different weights to pixels within each column, and uses a single distinct query for each column to represent the differences between columns.

Concretely, given the input feature map $F_{b}\in\mathbb{R}^{H\times W\times C}$ extracted from backbone, we define a set of learnable embedding queries $Q\in\mathbb{R}^{W\times C}$ . Each query corresponds to a column of the feature map separately.

We illustrate the mechanism of CCA, taking the $j$ -th column as an example. As shown in Fig. 2(b), the sine-cosine position encoding $P^{\cdot j}\in\mathbb{R}^{H\times C}$ is added to the image feature $F_{b}^{\cdot j}\in\mathbb{R}^{H\times C}$ to get position encoded feature $F_{k}^{\cdot j}\in\mathbb{R}^{H\times C}$ , as below:

\begin{gathered}F_{k}^{\cdot j}=F_{b}^{\cdot j}+P^{\cdot j}.\end{gathered}

(1)

Next, we get the attention weight $W^{\cdot j}\in\mathbb{R}^{H\times 1}$ by conducting matrix multiplication attached with a $Softmax$ activation function:

\begin{gathered}W^{\cdot j}=Softmax\left(F_{k}^{\cdot j}(Q^{j})^{\mathrm{T}}\right),\end{gathered}

(2)

where $Q^{j}\in\mathbb{R}^{1\times C}$ is the $j$ -th embedding query, and $W^{\cdot j}\in\mathbb{R}^{H\times 1}$ represents the attention weights for the $j^{th}$ column.

We stack the weights $W^{\cdot j}$ and repeat through the channel dimension to get the complete attention weights $W\in\mathbb{R}^{H\times W\times C}$ , and finally we get the output feature $F_{c}\in\mathbb{R}^{H\times W\times C}$ by multiplying $W$ with $F_{b}$ :

\begin{gathered}F_{c}=W\bigodot F_{b},\end{gathered}

(3)

where $\bigodot$ denotes Hadamard product.

Note that our neat CCA module builds the relationships within columns and across columns efficiently. Compared with the original transformer [52, 53] that costs $O((H*W)^{2}*C)$ , the time complexity of our proposed CCA is only $O(H*W*C)$ .

III-B Row-based Reverse Cumulative Sum (RRCS)

After re-weighting the pixels within each column, we need to build the connections for pixels in the one-way direction (i.e., bottom-up direction), since the bottom-up positional information is indispensable as extra information compensation and can serve as references for locating objects, as what we argued in Sec. I. For this purpose, we propose the RRCS, which models the bottom-up relationship between pixels in a row-by-row way.

Specifically, as illustrated in Fig. 2(c), we conduct vertical cumulative sum (VCS) on the output of CCA module $F_{c}\in\mathbb{R}^{H\times W\times C}$ in the bottom-up direction:

F_{r}^{i\cdot}=\sum_{k=1}^{i}F_{c}^{k\cdot},

(4)

where the $F_{r}^{i\cdot}$ and $F_{c}^{k\cdot}$ represents the $i$ -th row of feature map $F_{r}$ and $k$ -th row of input $F_{c}$ , respectively. The RRCS promises that each pixel could perceive all of the pixels below it, which could be treated as a dense unidirectional connection at the pixel level. Thus, encoded features of structured hierarchical layout pixels (i.e., road pixels) could be propagated to the target objects, which acts as a position-related clue for 3D pose reasoning.

Note that the cumulative sum would change the value magnitude of features, which might be hard for the network to converge. Thus, normalization is needed here to keep values stable, which is conducted as the Eq. 5 shows:

\hat{F_{r}^{i\cdot}}=\frac{F_{r}^{i\cdot}}{i}.

(5)

The feature map $\hat{F_{r}}$ is then encoded by a $1\times 1$ convolution layer $\varphi$ and added with the feature $F_{b}$ to formulate the position aware feature $F_{p}\in\mathbb{R}^{H\times W\times C}$ :

F_{p}=F_{b}+\varphi(\hat{F_{r}}).

(6)

III-C Training Objective

Our method predicts 3D proposals based on the feature map $F_{p}$ , which is attached to the detection head. Our detection heads are composed of three branches: classification branch, 2D regression branch and 3D regression branch. Optimizing 2D parameters $b_{2D}$ and 3D parameters $b_{3D}$ simultaneously helps the network learn geometrical relationships between the image and world space.

We follow [41] as for the training loss. Suppose that $y=(c,b_{2D},b_{3D})$ and $\hat{y}=(\hat{c},\hat{b_{2D}},\hat{b_{3D}})$ denote the set of ground truths and predictions respectively, where $c$ denotes the classification score. The training objective is summarized as Eq. 7.

L=L_{cls}(c,\hat{c})+L_{2D}(b_{2D},\hat{b_{2D}})+L_{3D}(b_{3D},\hat{b_{3D}})

(7)

Here $L$ is the overall loss, while $L_{cls}$ , $L_{2D}$ and $L_{3D}$ represents classification loss, 2D detection loss and 3D detection loss respectively.

TABLE I: The performance for Car on the KITTI test set. The Bold font indicates the first place in this table. ^∗ denotes that EXTRA DATASET is used.

Methods	Year	Extra Data	$AP_{3D\|R_{40}}$			$AP_{BEV\|R_{40}}$
Methods	Year	Extra Data	Easy	Moderate	Hard	Easy	Moderate	Hard
D4LCN [27]	CVPR20	Depth	16.65	11.72	9.51	22.51	16.02	12.55
CaDDN [34]	CVPR21	LiDAR	19.17	13.41	11.46	27.94	18.91	17.19
AutoShape [54]	ICCV21	CAD Models	22.47	14.17	11.36	30.66	20.08	15.95
DD3D [35]	ICCV21	Depth^∗	23.22	16.34	14.20	30.98	22.56	20.03
SGM3D [32]	RA-L22	Stereo	22.46	14.65	12.97	31.49	21.37	18.43
MonoDistill [31]	ICLR22	LiDAR	22.97	16.03	13.60	31.87	22.59	19.72
DID-M3D [45]	ECCV22	LiDAR	24.40	16.29	13.75	32.95	22.76	19.83
M3D-RPN [5]	ICCV19	None	14.76	9.71	7.42	21.02	13.67	10.23
MonoPair [55]	CVPR20	None	13.04	9.99	8.65	19.28	14.83	12.89
MonoDLE [41]	CVPR21	None	17.23	12.26	10.29	24.79	18.89	16.00
MonoEF [56]	CVPR21	None	21.29	13.87	11.71	29.03	19.70	17.26
MonoFlex [57]	CVPR21	None	19.94	13.89	12.07	28.23	19.75	16.89
MonoRCNN [58]	ICCV21	None	18.36	12.65	10.03	25.48	18.11	14.10
GUPNet [42]	ICCV21	None	20.11	14.20	11.77	30.29	21.19	18.20
GAC [8]	RA-L21	None	21.65	13.25	9.91	29.81	17.98	13.08
DeepLine [46]	BMVC21	None	24.23	14.33	10.30	31.09	19.05	14.13
MonoGround [9]	CVPR22	None	21.37	14.36	12.62	30.07	20.47	17.74
HomoLoss [59]	CVPR22	None	21.75	14.94	13.07	29.60	20.68	17.81
DCD [38]	ECCV22	None	23.81	15.90	13.21	32.55	21.50	18.25
Ours	-	None	22.43	16.21	13.73	30.54	21.66	18.64

TABLE II: The performance of

AP_{3D|R_{40}}

for Pedestrian and Cyclist on the KITTI test set.

Methods	Year	Pedestrian			Cyclist
Methods	Year	Easy	Moderate	Hard	Easy	Moderate	Hard
M3D-RPN [5]	ICCV19	4.92	3.48	2.94	0.94	0.65	0.47
MonoPair [55]	CVPR20	10.02	6.68	5.53	3.79	2.12	1.83
MonoDLE [41]	CVPR21	9.64	6.55	5.44	4.59	2.66	2.45
MonoFlex [57]	CVPR21	9.43	6.31	5.26	4.17	2.35	2.04
MonoGround [9]	CVPR22	12.37	7.89	7.13	4.62	2.68	2.53
HomoLoss [59]	CVPR22	11.87	7.66	6.82	5.48	3.50	2.99
DCD [38]	ECCV22	10.37	6.73	6.28	4.72	2.74	2.41
Ours	-	11.68	7.58	6.22	5.25	2.83	2.31

IV Experiments

IV-A Dataset and Metrics

We first introduce the dataset and evaluation metrics.

KITTI [60], containing 7,481 training samples and 7,518 testing samples (test set), is one of the most widely used datasets for autonomous driving-related tasks. We split the training samples into two parts: 3,712 samples for model training (train set) and 3,769 for validation (val set), following [41]. We test our method on the official server for fair comparisons with other methods.

For evaluation metrics, we report $AP_{3D|R40}$ and $AP_{BEV|R40}$ on car, pedestrian and cyclist categories with detection IoU thresholds of 0.7, 0.5 and 0.5, respectively. For better analysis, all experiments are done under all three difficulties (Easy, Moderate, and Hard) according to 2D bounding box height, occlusion, and truncation degrees.

IV-B Implementation Details

We then explain the details of model hyperparameters and training settings in this section.

Model settings. Similar to [41], the input image is first resized to (384, 1280) after random crop and random flip as data augmentation. DLA-34 [61] without deformable convolutions is adopted as our backbone network. The feature maps are downsampled by 4 $\times$ , and the channel of feature maps is 64. The number of query embeddings in Cross Attention is equivalent to the feature map width, i.e., 96. We encode the image features and the fixed position encoding with 1 $\times$ 1 convolution, ReLU, and 1 $\times$ 1 convolution as the input keys in attention. Each head branch comprises 3 $\times$ 3 convolution, ReLU, and 1 $\times$ 1 convolution.

Training and Inference. We implement our method using PyTorch [62]. Our model is trained on two NVIDIA 3090 Ti GPUs in an end-to-end manner for 140 epochs, and the batch size is set to 16. Following [41], we employ the common Adam optimizer with an initial learning rate of 1.25e-4 and decay it by ten times at 90 and 120 epochs. We also applied the warm-up strategy (5 epochs) to stabilize the training process. In the inference stage, NMS with an IoU threshold of 0.2 is used to filter proposals.

TABLE III: Comparison of Position-aware methods on KITTI val set. We report

AP_{3D|R_{40}}

and

AP_{BEV|R_{40}}

for Car category.

Methods	$AP_{3D\|R_{40}}$			$AP_{BEV\|R_{40}}$
Methods	Easy	Mod	Hard	Easy	Mod	Hard
Baseline	18.3	14.49	12.12	26.11	20.75	17.95
+CoordConv [63]	17.94	14.76	12.59	23.96	19.63	17.91
+GAC [8]	19.64	15.20	12.69	27.64	21.45	18.44
+YOLOBU (Ours)	20.67	15.81	14.07	27.53	22.07	19.30

IV-C Comparison with State-of-the-Arts Methods

In this part, we compare our method with some well-known and effective methods on KITTI [60] dataset.

The results for Car on test set are shown in Tab. I. We can see that our method outperforms all listed pure image-based methods on the $AP_{3D|R_{40}}$ metric at Moderate and Hard difficulties. Specifically, our method outperforms DCD [38] by 0.31%, 0.52% and HomoLoss [59] by 1.27%, 0.66% at Moderate, Hard difficulties respectively. Besides, our method outperforms MonoGround [9] by 1.06%, 1.85%, 1.11% at Easy, Moderate and Hard settings separately. Although our method is inferior to DeepLine [46] under the Easy setting, it performs better under the Moderate and Hard settings, exceeding 1.88% and 3.43%, respectively. Even when compared with listed methods using extra data [64], our method is still competitive. Specifically, our method outperforms the knowledge distillation methods SGM3D [32] by 1.56% and 0.76% and MonoDistill [31] by 0.18% and 0.13% at Moderate and Hard settings. Furthermore, we only get a 0.13% performance drop at the Moderate setting on $AP_{3D|R_{40}}$ metric when compared with DD3D [35] that uses a large amount of data in pretraining. We also get the competitive result to DID-M3D [45] that leverages Lidar supervision at Moderate and Hard difficulties. This is mainly because our proposed YOLOBU can leverage position clues as supplementary information to locate objects.

The results for Pedestrian and Cyclist on test set are shown in Tab. II. Although the performance of our method on Pedestrian and Cyclist is not as impressive as the performance on Car, YOLOBU still ranks third and second on Pedestrian and Cyclist, lagging behind the best methods by 0.31% and 0.67% at Moderate setting, respectively.

IV-D Comparison with Position-Aware Methods

Our method is significantly better in performance when compared with other position-aware methods [63, 8]. We conduct comparative experiments between CoordConv [63], GAC [8] and our YOLOBU on val set. For CoordConv [63] setting, we simply concatenate the pixels’ $u-v$ coordinates and features and then use a 1 $\times$ 1 convolution to extract information. For GAC [8] setting, we use the ground-aware convolution in regression heads following the implementation of its official codebase. The results are listed in Tab. III, which shows that our YOLOBU outperforms the other two methods by a large margin on $AP_{3d}$ . Specifically, YOLOBU beats CoordConv [63] by 2.73%, 1.05%, 1.48% on $AP_{3d}$ at Easy, Moderate, Hard difficulty respectively. YOLOBU also exceeds GAC [8] by 1.03%, 0.61%, 1.38% on $AP_{3D}$ at Easy, Moderate, Hard difficulty respectively. This is mainly because our YOLOBU models the bottom-up relationships between pixels and thus encodes more representative position information for monocular 3D detection.

IV-E Ablation Study

We further conduct additional experiments to analyze the effectiveness of each component. We use MonoDLE [41] as our baseline and keep all experimental settings the same.

TABLE IV: Ablation study of core modules on KITTI val set. We report

AP_{3D|R_{40}}

and

AP_{BEV|R_{40}}

for Car.

Methods	CCA	RRCS	$AP_{3D\|R_{40}}$			$AP_{BEV\|R_{40}}$
Methods	CCA	RRCS	Easy	Mod	Hard	Easy	Mod	Hard
Baseline			18.3	14.49	12.12	26.11	20.75	17.95
(a)	$\checkmark$		19.07	14.87	12.67	26.05	19.92	17.28
(b)		$\checkmark$	17.43	13.67	12.19	24.58	19.37	16.84
(c)	$\checkmark$	$\checkmark$	20.67	15.81	14.07	27.53	22.07	19.30

Effectiveness of the CCA. To figure out how much contribution the Column-based Cross Attention (CCA) brings, we attach the CCA to the backbone of our baseline. As Tab. IV(a) shows, the CCA brings $AP_{3D}$ (0.77%, 0.38% and 0.55% improvements on Easy, Moderate and Hard difficulties, respectively) while slightly reducing in the $AP_{BEV}$ metric. The results indicate that the contribution of the CCA only is limited since the positional relation between pixels is not modeled.

Effectiveness of the RRCS. To validate the effectiveness of Row-based Reverse Cumulative Sum (RRCS) only, we insert the proposed RRCS after the backbone of our baseline. The results are listed in Tab. IV(b). We can observe that only adding RRCS is detrimental to the performance. Note that the semantic and positional information of each pixel is variant and keeping all the edge weights the same would bring much noise to the model.

We emphasize that the CCA and RRCS are an integral whole to establish the association of pixels. The former generates the edge values, and the latter connects nodes with the edge values. As shown in Tab. IV(c), when we use both the CCA and RRCS, we can obtain a significant improvement over the baseline. Our method brings 2.37%, 1.32%, 1.95% improvement of $AP_{3D}$ on Easy, Moderate, and Hard difficulties respectively, demonstrating the significance of the positional relation modeling.

Analysis of attention types. Tab. V quantifies the influence of different attention types in the proposed CCA. For the model without column-based attention, we use a single query for all pixels and conduct global cross attention instead. We can observe that column-based attention is necessary since the semantics of structured layout pixels (e.g., road pixels) vary across columns, and one single query is not representative enough.

TABLE V: Ablation study of attention methods in CCA on KITTI val set. We report

AP_{3D|R_{40}}

and

AP_{BEV|R_{40}}

for Car.

Methods	$AP_{3D\|R_{40}}$			$AP_{BEV\|R_{40}}$
Methods	Easy	Mod	Hard	Easy	Mod	Hard
w/o column-based	17.95	14.77	12.63	25.85	21.08	18.39
w column-based	20.67	15.81	14.07	27.53	22.07	19.30

TABLE VI: Ablation study of Cumulative Sum direction in RRCS on KITTI val set. We report

AP_{3D|R_{40}}

and

AP_{BEV|R_{40}}

for Car.

Methods	$AP_{3D\|R_{40}}$			$AP_{BEV\|R_{40}}$
Methods	Easy	Mod	Hard	Easy	Mod	Hard
Baseline	18.3	14.49	12.12	26.11	20.75	17.95
Up-Bottom	19.37	14.79	13.33	26.34	20.95	18.40
Bottom-Up	20.67	15.81	14.07	27.53	22.07	19.30

Analysis of the Cumulative Sum direction. In order to examine the impact of different cumulative sum directions (i.e., Bottom-Up and Up-Bottom), we conduct comparative experiments and the results are listed in Tab. VI. We can observe that building the graph either in the Bottom-Up or Up-Bottom direction improves the performance over baseline, which proves the effectiveness of pixels’ association in columns for the monocular 3D detection task. Meanwhile, we can see that building relations from the Bottom-Up direction is much better than Up-Bottom. This is consistent with our prior that the pixels which help locate objects mainly occur below them.

TABLE VII: The performance on the nuScenes val set. ^∗ denotes that the results is reproduced using the official code base without the finetune stage.

Methods	NDS	mAP	mATE	mASE	mAOE	mAVE	mAAE
FCOS3D^∗	0.365	0.287	0.814	0.269	0.530	1.319	0.171
Ours	0.381	0.303	0.762	0.268	0.515	1.316	0.158

IV-F Performance on the Large-scale nuScenes Dataset

To further evaluate the effectiveness of our method, we conduct a comparison experiment on the large-scale nuScenes dataset. We train models on the official train split and evaluate on the val split. For evaluation metrics, we report the official metrics include: Nuscenes Detection Score (NDS), mean Average Precision (mAP), Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE) and Average Attribute Error (AAE).

We use the FCOS3D [51] as the baseline, and we insert the proposed modules (CCA and RRCS) after the neck to refine the features of each scale. Because of the significant increase in the scale of training data, we use 8 NVIDIA 3090 GPUs with a total batch of 32 to train models. The results are listed in Tab. VII. Our method outperforms the FCOS3D by 1.6% NDS and mAP, showing its effectiveness, and the conclusions are consistent with that of the KITTI dataset.

IV-G Visualizations and Discussions

We visualize the model predictions to analyze our method more intuitively. The results are shown in Fig.3.

Qualitative results. As we can see, our proposed YOLOBU predicts remarkably precise 3D Bounding Boxes in scenario (b). Specifically, compared with our baseline MonoDLE [41] in scenario (a), our method can correctly detect the car in the top right corner of the image while the baseline predicts an inaccurate bounding box. This phenomenon demonstrates that image-based 3D detection methods benefit from the bottom-up positional information.

Generalization for other applications. Since our method effectively leverages the position clues from images and serves as a plug-and-play module, one can easily adapt our method for other applications by simply plugging it into the original solution, as long as the task (e.g., stereo and multi-view 3D object detection for research, and autonomous driving, underwater robots and intelligent logistics for the industry) meets the assumptions introduced by our method.

Limitations. Our method has some limitations: 1) Our method fails to detect the truncated and large objects, as shown in Fig. 3 (c). The main reason is that these samples cannot benefit from the bottom-up positional information. 2) The inference time of the proposed method will be slightly slower than the baseline [41] (22.56 FPS vs. 23.64 FPS). We will explore real-time solutions in the future. 3) Our method does not model the calibration parameters, and the performance may be affected when the calibration parameter change. However, we can easily modify the detection head to model them explicitly for better results.

V CONCLUSIONS

In this paper, we demonstrate that intrinsic position clues from images are important but ignored by most existing image-based methods. We propose a novel method for Monocular 3D Object Detection named YOLOBU, which performs cross attention on each column and conduct cumulative sum in the bottom-up direction in a row-by-row way. With these two steps, our method is able to learn position information from bottom to up and serve as strong auxiliary clues to solve the size ambiguity problem. Our method has achieved competitive performance on the KITTI 3D object detection benchmark using a monocular camera without additional information. We hope our neat and effective approach will serve as a strong baseline for future research in Monocular 3D Object Detection.

References

[1] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3d object detection for autonomous driving,” in CVPR, 2016, pp. 2147–2156.
[2] Y. Liu, L. Wang, and M. Liu, “Yolostereo3d: A step back to 2d for efficient stereo 3d detection,” in ICRA. IEEE, 2021, pp. 13 018–13 024.
[3] Y. Jung, S.-W. Seo, and S.-W. Kim, “Fast point clouds upsampling with uncertainty quantification for autonomous vehicles,” in ICRA. IEEE, 2022, pp. 7776–7782.
[4] V. R. Kumar, S. A. Hiremath, M. Bach, S. Milz, C. Witt, C. Pinard, S. Yogamani, and P. Mäder, “Fisheyedistancenet: Self-supervised scale-aware distance estimation using monocular fisheye camera for autonomous driving,” in ICRA. IEEE, 2020, pp. 574–581.
[5] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network for object detection,” in ICCV, 2019, pp. 9287–9296.
[6] Z. Liu, Z. Wu, and R. Tóth, “Smoke: Single-stage monocular 3d object detection via keypoint estimation,” in CVPR, 2020, pp. 996–997.
[7] Q. Lian, B. Ye, R. Xu, W. Yao, and T. Zhang, “Exploring geometric consistency for monocular 3d object detection,” in CVPR, 2022, pp. 1685–1694.
[8] Y. Liu, Y. Yixuan, and M. Liu, “Ground-aware monocular 3d object detection for autonomous driving,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 919–926, 2021.
[9] Z. Qin and X. Li, “Monoground: Detecting monocular 3d objects from the ground,” in CVPR, 2022, pp. 3793–3802.
[10] A. Frickenstein, M.-R. Vemparala, J. Mayr, N.-S. Nagaraja, C. Unger, F. Tombari, and W. Stechele, “Binary dad-net: Binarized driveable area detection network for autonomous driving,” in ICRA. IEEE, 2020, pp. 2295–2301.
[11] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
[12] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR, 2021.
[13] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499.
[14] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
[15] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in CVPR, 2019, pp. 12 697–12 705.
[16] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in CVPR, 2019, pp. 770–779.
[17] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet: Robust 3d object detection from point clouds with triple attention,” in AAAI, vol. 34, no. 07, 2020, pp. 11 677–11 684.
[18] J. Li, Z. Liu, J. Hou, and D. Liang, “Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection,” ICRA, 2023.
[19] D. Zhang, D. Liang, Z. Zhou, J. Li, X. Ye, Z. Liu, X. Tan, and X. Bai, “A simple vision transformer forweakly semi-supervised 3d object detection,” in ICCV, 2023.
[20] D. Zhang, D. Liang, H. Yang, Z. Zou, X. Ye, Z. Liu, and X. Bai, “Sam3d: Zero-shot 3d object detection via segment anything model,” arXiv preprint arXiv:2306.02245, 2023.
[21] Z. Liu, T. Huang, B. Li, X. Chen, X. Wang, and X. Bai, “Epnet++: Cascade bi-directional fusion for multi-modal 3d object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[22] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2774–2781.
[23] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion framework,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 421–10 434, 2022.
[24] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in CVPR, 2019, pp. 8445–8453.
[25] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn: Towards high performance voxel-based 3d object detection,” in AAAI, vol. 35, no. 2, 2021, pp. 1201–1209.
[26] W. Bao, B. Xu, and Z. Chen, “Monofenet: Monocular 3d object detection with feature enhancement networks,” IEEE Transactions on Image Processing, vol. 29, pp. 2753–2765, 2019.
[27] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided convolutions for monocular 3d object detection,” in CVPRW, 2020, pp. 1000–1001.
[28] L. Jing, R. Yu, H. Kretzschmar, K. Li, C. R. Qi, H. Zhao, A. Ayvaci, X. Chen, D. Cower, Y. Li, et al., “Depth estimation matters most: Improving per-object depth estimation for monocular 3d detection and tracking,” in ICRA, 2022.
[29] T. Xie, K. Wang, R. Li, X. Tang, and L. Zhao, “Panet: A pixel-level attention network for 6d pose estimation with embedding vector features,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1840–1847, 2021.
[30] X. Ye, L. Du, Y. Shi, Y. Li, X. Tan, J. Feng, E. Ding, and S. Wen, “Monocular 3d object detection via feature domain adaptation,” in ECCV. Springer, 2020, pp. 17–34.
[31] Z. Chong, X. Ma, H. Zhang, Y. Yue, H. Li, Z. Wang, and W. Ouyang, “Monodistill: Learning spatial features for monocular 3d object detection,” in ICLR, 2021.
[32] Z. Zhou, L. Du, X. Ye, Z. Zou, X. Tan, L. Zhang, X. Xue, and J. Feng, “Sgm3d: Stereo guided monocular 3d object detection,” IEEE Robotics and Automation Letters, 2022.
[33] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in ECCV. Springer, 2020, pp. 194–210.
[34] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in CVPR, 2021, pp. 8555–8564.
[35] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” in ICCV, 2021, pp. 3142–3152.
[36] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “Gs3d: An efficient 3d object detection framework for autonomous driving,” in CVPR, 2019, pp. 1019–1028.
[37] P. Li, H. Zhao, P. Liu, and F. Cao, “Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving,” in ECCV. Springer, 2020, pp. 644–660.
[38] Y. Li, Y. Chen, J. He, and Z. Zhang, “Densely constrained depth estimator for monocular 3d object detection,” in ECCV, 2022.
[39] X. Liu, N. Xue, and T. Wu, “Learning auxiliary monocular contexts helps monocular 3d object detection,” in AAAI, vol. 36, no. 2, 2022, pp. 1810–1818.
[40] K. Xiong, S. Gong, X. Ye, X. Tan, J. Wan, E. Ding, J. Wang, and X. Bai, “Cape: Camera view position embedding for multi-view 3d object detection,” in CVPR, 2023.
[41] X. Ma, Y. Zhang, D. Xu, D. Zhou, S. Yi, H. Li, and W. Ouyang, “Delving into localization errors for monocular 3d object detection,” in CVPR, 2021, pp. 4721–4730.
[42] Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, J. Yan, and W. Ouyang, “Geometry uncertainty projection network for monocular 3d object detection,” in ICCV, 2021, pp. 3111–3121.
[43] L. Haoran, D. Zicheng, M. Mingjun, C. Yaran, L. Jiaqi, and Z. Dongbin, “Mvm3det: A novel method for multi-view monocular 3d detection,” in ICRA, 2021.
[44] A. Saha, O. Mendez, C. Russell, and R. Bowden, “Translating images into maps,” in ICRA, 2022, pp. 9200–9206.
[45] L. Peng, X. Wu, Z. Yang, H. Liu, and D. Cai, “Did-m3d: Decoupling instance depth for monocular 3d object detection,” in ECCV, 2022.
[46] C. Liu, S. Gu, L. Van Gool, and R. Timofte, “Deep line encoding for monocular 3d object detection and depth prediction,” in BMVC, 2021, p. 354.
[47] Y. Kim and D. Kum, “Deep learning based vehicle position and orientation estimation via inverse perspective mapping image,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 317–323.
[48] A. Palazzi, G. Borghi, D. Abati, S. Calderara, and R. Cucchiara, “Learning to map vehicles into bird’s eye view,” in Image Analysis and Processing-ICIAP 2017: 19th International Conference, Catania, Italy, September 11-15, 2017, Proceedings, Part I 19. Springer, 2017, pp. 233–243.
[49] M. Zhu, S. Zhang, Y. Zhong, P. Lu, H. Peng, and J. Lenneman, “Monocular 3d vehicle detection using uncalibrated traffic cameras through homography,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 3814–3821.
[50] L. Reiher, B. Lampe, and L. Eckstein, “A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view,” in 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2020, pp. 1–7.
[51] T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
[52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
[53] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
[54] Z. Liu, D. Zhou, F. Lu, J. Fang, and L. Zhang, “Autoshape: Real-time shape-aware monocular 3d object detection,” in ICCV, 2021, pp. 15 641–15 650.
[55] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object detection using pairwise spatial relationships,” in CVPR, 2020, pp. 12 093–12 102.
[56] Y. Zhou, Y. He, H. Zhu, C. Wang, H. Li, and Q. Jiang, “Monocular 3d object detection: An extrinsic parameter free approach,” in CVPR, 2021, pp. 7556–7566.
[57] Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular 3d object detection,” in CVPR, 2021, pp. 3289–3298.
[58] X. Shi, Q. Ye, X. Chen, C. Chen, Z. Chen, and T.-K. Kim, “Geometry-based distance decomposition for monocular 3d object detection,” in ICCV, 2021, pp. 15 172–15 181.
[59] J. Gu, B. Wu, L. Fan, J. Huang, S. Cao, Z. Xiang, and X.-S. Hua, “Homography loss for monocular 3d object detection,” in CVPR, 2022, pp. 1080–1089.
[60] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
[61] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in CVPR, 2018, pp. 2403–2412.
[62] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: an imperative style, high-performance deep learning library,” in NeurIPS, 2019, pp. 8026–8037.
[63] R. Liu, J. Lehman, P. Molino, F. Petroski Such, E. Frank, A. Sergeev, and J. Yosinski, “An intriguing failing of convolutional neural networks and the coordconv solution,” in NeurIPS, 2018.
[64] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” in CVPR, 2020, pp. 2485–2494.