Towards Better Performance and More Explainable Uncertainty for 3D Object Detection of Autonomous Vehicles

Hujie Pan^1,2, Zining Wang², Wei Zhan², and Masayoshi Tomizuka² ¹Hujie Pan is with School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China and now a visiting student researcher in University of California Berkeley, Berkeley, 94720, USA [email protected], [email protected]²Zining Wang, Wei Zhan and Masayoshi Tomizuka are with Department of Mechanical Engineering, University of California Berkeley, Berkeley, 94720, USA [email protected], [email protected], [email protected]

Abstract

In this paper, we propose a novel form of the loss function to increase the performance of LiDAR-based 3D object detection and obtain more explainable and convincing uncertainty for the prediction. The loss function was designed using corner transformation and uncertainty modeling. With the new loss function, the performance of our method on the val split of KITTI dataset shows up to a 15% increase in terms of Average Precision (AP) comparing with the baseline using simple L1 Loss. In the study of the characteristics of predicted uncertainties, we find that generally more accurate prediction of the bounding box is accompanied by lower uncertainty. The distribution of corner uncertainties agrees on the distribution of the point cloud in the bounding box, which means the corner with denser observed points has lower uncertainty. Moreover, our method learns the constraint from the cuboid geometry of the bounding box in the uncertainty prediction. Finally, we propose an efficient Bayesian updating method to recover the uncertainty for the original parameters of the bounding boxes which can help provide probabilistic results for the tracking and planning module.

I INTRODUCTION

The “perception-plan-control” scheme has been widely adopted as the framework for autonomous driving solutions [1]. As a significant step in the scheme, the perception module provides an understanding of the environment within which the object detection and localization play crucial roles. With the help of onboard sensors such as camera and LiDAR, the color and depth information of the foreground and background can be obtained. A lot of efforts have been made to the object detection algorithms using the sensor data above, among which deep learning methods have made great progress in precision and callback rate [2, 3, 4, 5, 6, 7, 8]. Considering the variation of the environment, states of the objects and different observability, the predicted result should include a certain level of uncertainty, which is also an essential input for the planning and decision-making modules. However, most deep learning based detectors only produced the deterministic states of the object while lacked feedback of uncertainties [9].

To tackle the problem mentioned above, a lot of attempts have been made to quantitatively predict the uncertainty for Deep Neural Networks (DNN) and some of them are proposed to predict the uncertainties of different parameters for object detection. [10, 11] proposed a method to learn the probabilistic distribution of the parameters by minimizing the Kullback-Leibler divergence (KLD) of the predicted distributions from preset ones. Method [10] considers the variations of labeled point data while highly depends on how the probabilistic distribution is preset. [12] infers the uncertainty of the label based on the point cloud distribution of the sample using Bayesian method and produces a probabilistic map to represent the uncertainty of the bounding box. [4, 13, 14] directly learn the parameters of the probability distribution by maximizing the likelihood. However, they all assume that the parameters are independent with each other and adopt a diagonal covariance matrix for uncertainty modeling. [14] further analyzes the features of the predicted uncertainties but is constrained by the independence assumption. This assumption makes the derived ensemble variance of each corner for the bounding box all the same. In that case, the modeled uncertainty is unable to fully reflect the distribution of the point cloud which makes it less explainable and persuasive. A non-diagonal assumption could potentially address this issue but is prone to numerical instability such as gradient exploding in our preliminary test.

Instead of the non-diagonal covariance assumption for the original parameters of the bounding box, we propose a method that first transfer the original parameters to the eight corners and model the probabilistic distribution of the location of each corner. This method provided enough degrees of freedom (DOFs) in representing the uncertainties and avoid the numerical instability of training a non-diagonal covariance matrix. As for the network architecture, following PointRCNN, we proposed a PointNet-based 2-stage method which keeps lossless point-wise features.

With the proposed approach, we are able to considerably improve the performance of the object detector from the baseline and reach a comparable average precision (AP) with the state-of-the-art algorithms. Meanwhile, the predicted uncertainty successfully represented the distribution of the points in the bounding box as well as the constraint of its cuboid geometry. Finally, we proposed a Bayesian updating method to recover the uncertainty of the original parameter set of the bounding box so as to provide the uncertainty of states for tracking and planning.

II RELATED WORKS

II-A LiDAR-based 3D object detection

Unlike the organized pixel values in images, point cloud provides irregular data which prevents the direct application of classical convolutional neural networks (CNN) such as VGG, ResNet. To tackle this issue, many researchers preprocess the point cloud data by reorganizing it into 2D or 3D grids [2, 8, 15, 16, 17]. However, voxelization resulted in large size of the input and high computation cost until the 3D sparse convolution method was proposed [16, 18]. Moreover, the grid-based methods potentially cause information loss of point cloud and have limited receptive field. To make use of lossless point cloud information, PointNet-based architectures are proposed to extract point-wise and global features from the point cloud [5, 19, 20]. There are several 2-stage object detectors based on PointNet such as PointRCNN [7] and STD [21]. In this work, we propose a 2-stage method that utilize point-wise features for region proposal and refinement following PointRCNN.

II-B Uncertainty modeling for deep learning

To represent the uncertainty of neural networks, Bayesian modeling is proposed to estimate the epistemic and aleatoric uncertainties [22, 23, 24]. Epistemic uncertainty is model-based which arises from the uncertainty of model parameters. It reflects the limitation of the model on describing the biased training data and can be reduced by enlarging the dataset [24]. There are two main ways to predict the epistemic uncertainty: variational inference [25] and sampling [26]. The aleatoric uncertainty, on the other hand, is measurement-based which arises from the sensor noise, data representation and label noise, etc. [27]. It can be predicted by outputting the parameters of a distribution and the method is adopted in this paper. Recently, many efforts have been put on estimating the uncertainty of 3D object detection. [13] and [28] utilized Monte Carlo dropout to capture the epistemic uncertainty of the bounding box. While [4] and [14] predicted the aleatoric uncertainty by learning the parameters in the probabilistic distribution of bounding boxes. [10] and [11] further pre-estimated the uncertainty of the label and learned the probabilistic distribution by minimizing the KLD of predicted from preset ones. In this work, we estimate the aleatoric uncertainties of the corners from bounding boxes by learning the parameters of the probabilistic distribution and analyze how they represent the distributions of point clouds and are constrained by the cuboid geometry. Finally, we propose a Bayesian update method to recover the uncertainty of the original parameters in labels.

III Proposed Method

In this section, we propose a two-stage detector whose RPN stage and the point cloud encoder use the backbone PointRCNN [7] and PointNet++ [20]. It is trained by the proposed innovative form of output and loss function with corner transformation and uncertainty modeling. The architecture of the network is shown in Fig. 1.

III-A Network Architecture

As shown in Fig. 1, we take raw point cloud data as input rather than voxelizing it. In the region proposal network (RPN) stage, we generate 3D regions of interest (ROIs) using the point-wise and global features extracted by a PointNet-like architecture. After ROI-pooling, we feed the pooled point features along with the original point coordinates and intensities to the point cloud encoder of PointNet++ [20] to learn the representation of the point cloud. The encoder is followed by three task-specified fully connected (FC) layers which output the classification scores, box residuals and the uncertainties of the locations of box corners, respectively. The FC layers only predict the residual of the bounding box. The final bounding box is recovered by combining the residual and the preset anchor of the ROI pooling layer together.

Refer to caption — Figure 1: The architecture of the network that predicts the uncertainty of the corners of 3D bounding boxes and the residual of bounding boxes.

The labels of KITTI dataset describe the ground truth bounding box with 7 parameters: 3 for the center location, 3 for the dimensions and 1 for the orientation. The orientation is described by the yaw angle of the object. Accordingly, our ROI box and final refined bounding box are all parametrized by $[x,y,z,h,w,l,\psi]$ in which $[x,y,z]$ denotes the location, $[h,w,l]$ denotes the dimensions and $\psi$ is the yaw angle.

III-B Corner Transformation

A corner-based equally weighted regression loss is proposed to enrich the representation capability of uncertainties. Rather than directly predicting the coordinate of each corner, we keep the original expression of the bounding box and transformed it to the corner coordinates in camera coordinate system while retaining the cuboid constraints of them.

Equation 1 is the transforming function from the original label parameters to the corresponding 8 corners.

\begin{bmatrix}x_{c}\\ y_{c}\\ z_{c}\end{bmatrix}=\begin{bmatrix}\sin\psi&0&\cos\psi\\ 0&1&0\\ -\cos\psi&0&\sin\psi\end{bmatrix}\begin{bmatrix}\pm l/2\\ \pm h/2\\ \pm w/2\end{bmatrix}+\begin{bmatrix}x\\ y\\ z\end{bmatrix}

(1)

in which $[x_{c},y_{c},z_{c}]^{T}$ is the location of the corner, $l,h,w$ are the dimensions of the bounding box, $\psi$ is the yaw angle and $[x,y,z]^{T}$ is the location of the box center. We transform both the label and recovered bounding box for loss calculation.

III-C Uncertainty Modeling

Denote the procedure from sensing to annotation as a measurement, and the predicted bounding box as the mean value, we assume the components of the coordinates of corners to be drawn from independent univariate Laplace distributions with a probability density function defined as follows:

p(x|\mu,b)=\frac{1}{2b}\exp{(-\frac{|x-\mu|}{b})}

(2)

in which $\mu$ is the predicted transformed component, $b$ is the predicted diversity of the Laplace distribution. The variance of the Laplace distribution equals to $2b^{2}$ .

Then we take the negative log likelihood as the loss function for each single component of the corner:

\mathcal{L}(x,\mu,b)=-\log{p(x|\mu,b)}=\ln{2b}+\frac{|x-\mu|}{b}

(3)

Finally, we calculate the ensemble regression loss of the 3D bounding box by summing up the loss of all the components from all corners:

\mathcal{L}_{ens}=\sum_{i}\sum_{j}\mathcal{L}(x_{ij},\mu_{ij},b_{ij})

(4)

in which $i$ is the index of the corner, and $j$ is the index of the component of the corner.

Since the form of loss is negative log likelihood, summing up the losses is equivalent to multiplying the probability density defined in (2). In other words, by minimizing the ensemble loss, we are maximizing the likelihood of labeled corners under the Laplace distribution with parameters $\mu$ , $b$ .

IV EXPERIMENTAL RESULTS

This section starts with the evaluation of our method on 3D object detection benchmark of the ‘Car’ category on KITTI dataset [29]. The dataset provides 7481 training samples and 7518 testing samples. Following [7], we divide the training data into train split (3712 samples) and val split (3769 samples). We conduct an ablation study with different cases on val split and compare our method with the state-of-the-art algorithms on test set. After the evaluation, we further analyze the behavior of the uncertainties as well as how it represents the distribution of the point cloud and how it is constrained by cuboid geometry.

IV-A 3D Object Detection on KITTI

We first compare the performance of our method with the state-of-the-art methods on KITTI test set. We pick some representative LiDAR-based methods listed in Tab. I. We highlight the comparison of performance between our method and PointRCNN which uses bin-based loss function and serves as the base for our method.

All results are evaluated by the average precision (AP) with a 3D intersection over union (IoU) threshold of 0.7 on easy, moderate, and hard difficulty levels respectively. The AP of our method on test set was calculated on official KITTI server.

As shown in Tab. I, our method achieves a comparable level of performance with the methods listed and surpasses most of them. When compared with the base network PointRCNN, our method has a comparable performance with it at easy difficulty level. With the increase of the difficulty, our method surpasses PointRCNN in terms of AP by 1.23% at moderate and 2.47% at hard difficulty levels which indicates not only a better performance but also higher robustness for less informative samples.

To find out how corner transformation and uncertainty modeling affect the performance of the model, we perform an ablation study in different cases on the val split. As shown in Tab. II, we set a baseline whose loss function was simply L1 form without corner transformation and uncertainty modeling. The baseline with corner transformation is proceeded with the L1 loss calculation on the components of transformed corners. The baseline with uncertainty adopts only the aleatoric uncertainty modeling without any other modifications following [14]. To make a fair comparison, the results of different cases share the same RPN result and follow the same training procedure. We also add the performance of PointRCNN on val split coming from [7] for comparison.

TABLE I: Validation results on the KITTI test dataset

Method	Average Precision 3D (%)
Method	Easy	Moderate	Hard
MV3D [2]	74.97	63.63	54.00
SECOND [16]	83.34	72.55	65.82
PointPillars [15]	82.58	74.31	68.99
STD [21]	87.95	79.71	75.09
PointRCNN [7]	86.96	75.64	70.70
Ours	86.55	76.87 (+1.23)	73.17 (+2.47)

TABLE II: Ablation study on the VAL split

Method	Average Precision 3D (%)
Method	Easy	Moderate	Hard
Simpe L1 loss (baseline)	73.03	66.02	60.86
Baseline with corner trans	85.74	77.64	76.05
Baseline with uncertainty	83.85	77.47	74.57
PointRCNN (bin-based loss) [7]	88.88	78.63	77.38
Corner trans & uncertainty (ours)	89.26	80.62	79.13

As shown in Tab. II, corner transformation and uncertainty modeling improve the performance of the model by 11.62% and 11.45% respectively on moderate difficulty level from the baseline. While our method with both the above features improves the AP by 16.23%, 14.60% and 18.27% on easy, moderate and hard difficulty levels respectively from the baseline. The corner transformation increases the performance of the model by re-weighting the original parameters of the bounding boxes and transforming them into equally weighted corners. That’s because the corners contribute equally in representing the bounding box such as IoU calculation, while the original parameters are not. In that case, even with simple L1 loss function, corner transformation considerably increases the performance of the model from the baseline. On the other hand, uncertainty modeling helps to increase the noise tolerance of the detector which means the noisy label affected less on updating the network weights due to the diversity of Laplace distribution. Combining these two methods, our method increases the performance by 0.4%, 1.99% and 1.75% on easy, moderate and hard difficulty levels respectively from PointRCNN with bin-based loss on val split.

IV-B Explaining the Uncertainties

In this section, we analyze the characteristics of the predicted uncertainty starting with its general behaviors. Then, we discuss the relationship between the corner uncertainty and the point cloud distribution. Finally, we introduce how cuboid constraint affects the distribution of the corner uncertainties.

IV-B1 General Behaviors

To reveal the relationship between the distance to ego-vehicle and the uncertainty of different components, we calculate the overall variance of each bounding box by summing up the variances of a single component from its eight corners. Notice that here the $x$ , $y$ and $z$ components are obtained in the camera coordinate system. As shown in Fig. 2, we plot the mean values of overall variance in bins for every 5 meters with respect to the distance. It shows that for corners with distances less than 10 meters, the overall variance decreases as the distance increases. It is attributed to the truncated objects close to the LiDAR. While the distance is greater than 10 meters, the variances of all 3 components increase with the increase of the distance. Moreover, we calculate the standard deviation of the overall variances to represent its variation and plot it as error bars in Fig. 2a. With the increase of the distance, the variation of the uncertainty also increases, especially for the x and z components. However, the uncertainty at y direction is always smaller than the other two, especially at the distance greater than 40 meters from the sensor and its variation is also much smaller than those of the other 2 components. This might be explained by Fig. 2b which shows the negative correlation between the total uncertainty of the bounding box and IoU. It means that a more accurate prediction is usually accompanied by lower uncertainty. We further calculate the average corner loss of the $y$ component and find that it is only 0.056, which is much lower than $x$ (0.187) and $z$ (0.304) as expected.

IV-B2 Point cloud distribution representation

To describe the spatial distribution of the point cloud and determine whether the corner is at the denser or the sparser side, we calculate the average Euler distance from each corner to all the detected points in the bounding box using (5):

d_{k}=\frac{1}{N}\sum_{i}||\mathbf{c}_{k}-\mathbf{p}_{i}||

(5)

in which $d_{k}$ is the average Euler distance of $k$ th corner of the bounding box, $\mathbf{c}_{k}$ and $\mathbf{p}_{i}$ are the coordinates of the $k$ th corner and $i$ th detected point respectively. $N$ is the total number of the detected points within the bounding box.

As to the uncertainty, we calculate the ensemble variance $\sigma_{ens}^{2}$ of each corner by summing up the variance of their own three components.

After the normalization in (6), $d_{k}$ and $\sigma_{ens,k}$ can be regarded as the samples of two pseudo probabilistic distributions denoted as $D(k)$ and $U(k)$ respectively. To evaluate the similarity of the two distributions, i.e. how relevant are the distances and ensemble uncertainties in a corner set, we calculate the KLD of $U(k)$ to $D(k)$ as shown in (7). Notice that $D(k)$ and $U(k)$ represent the proportional relationship of the distances and uncertainties of the eight corners respectively, and the KLD here denotes the information loss when $U(k)$ is used to approximate $D(k)$ . Lower KLD means closer proportional relationship between the uncertainties and distances in a corner set.

	$\displaystyle d_{k}$	$\displaystyle\leftarrow d_{k}/\rm{sum}(\mathbf{d}),$		(6)
	$\displaystyle\sigma_{ens,k}$	$\displaystyle\leftarrow\sigma_{ens,k}/\rm{sum}(\mathbf{u})$		(6)

KLD=\sum_{k}D(k)\log{\frac{D(k)}{U(k)}}

(7)

We plot the KLD with respect to the detection distance in Fig. 3 in which the overall KLD locates close to 0. With the increase of the detection distance, the number of the data points with KLD greater than 0.05 increases. After looking into the samples with high KLD, we find that most of these data points represent the bounding box with low IoUs which indicates they are less accurate predictions.

We further pick four samples and plot them with their KLD values within different difficulty levels in Fig. 4 to analyze how they perform with low KLD values. As is defined in (7), lower KLD literally means higher similarity of the distributions of the corner uncertainty and the point cloud. As indicated in Fig. 4, even at different difficulty levels and with different point numbers, the uncertainty of corner shows the same trend that it is lower at denser point cloud side. Fig. 4b shows that the predicted corners at the side with denser point cloud are closer to the ground truth than the spaser side with higher confidence. It matches our observation about the negtive correlation between the uncertainty and accuracy in the general behaviors of the uncertainty discussed in the former section. This is practical and would help to predict collisions in autonomous driving since the denser point side is most likely the closer side to the LiDAR.

Besides most of the low KLD cases, we also pick two samples with relatively high KLD to analyze the ‘outliers’. While generally, as seen in Fig. 5, they are still the cases that we discussed in the former paragraph. In Fig. 5a, corners at the side with more points have lower uncertainties comparing to the other side which is also found in Fig. 5b. One reason to explain the higher KLD is that the uncertainty distribution does not exactly fit the point clouds corner by corner. For instance, in Fig. 5a, the uncertainty of corner 1 is smaller than that of corner 3 which agree on the point cloud distribution, while corner 2 and corner 4 are the opposite. Another reason of higher KLD is that these objects are of large distance from the sensor which makes the total variance of the bounding box at a high level. Moreover, the differences between uncertainties of different corners is not as distinguishable as that in the point cloud distribution which makes the two distributions numerically not similar to each other. As seen in Fig. 1b, the variances of the corners vary from 1.5 to 1.7, whose change rate is only approximately 15%, while those of the cases in Fig. 4 show at least a 75% difference.

IV-B3 Influence of geometry constraint

Although we set a high DOF in modeling the uncertainties, the model still learns the constraint of the bounding box from its geometry. We use the relevant locations of corners to describe the geometry because they are determined by the parameters of the bounding box. For instance, once the shape of the box is set and we assume a point of the bounding box as the static reference, the relative displacement of the corners caused by the small pose variation are constrained by relative locations of the corners from the reference point. And the small displacement is approximately proportional to the distance between the corner to the reference point. This is also the way that small error transfers which can be represented using variance.

In our case, we set the corner with the minimum variance as the reference point and its variance as the reference uncertainty. We calculate the variance difference and Euler distance between each corner and the reference point as shown in (8)

	$\displaystyle\sigma_{k}$	$\displaystyle\leftarrow\sqrt{\sigma_{k}^{2}-\rm{min}{\left\{\sigma_{i}^{2}\right\}}},i\in\left\{1,2,...,7,8\right\}$		(8)
	$\displaystyle d_{c,k}$	$\displaystyle\leftarrow\|\|\mathbf{c}_{k}-\mathbf{c}_{m}\|\|$		(8)

in which $m=\operatorname*{arg\ min}\limits_{i}\left\{\sigma_{i}^{2}\right\}$ , and $k$ is the index of the corner.

Like (6) and (7), we normalize $\sigma_{k}$ and $d_{c,k}$ with their sums respectively and calculate the KLD of the pseudo distribution $R_{\sigma}(k)=\sigma_{k}$ from the distribution $R_{d}(k)=d_{c,k}$ . The KLD here represents the similarity between the distribution of relative predicted uncertainties $R_{\sigma}(k)$ and relative locations $R_{d}(k)$ . Low KLD means that the predicted uncertainty tends to be constrained by the cuboid geometry of the bounding box based on a confident corner.

We plot the KLD of relative predicted uncertainties from relative locations (KLD-R for short) with respect to the KLD of corner variance distribution from corner Euler distance distribution defined in (7) (KLD-UD for short) in Fig. 6 to reveal the relationship between the influence of point cloud distribution and cuboid constraint on uncertainties. We find that most of the samples locate close to the axis which means the uncertainties agree on at least one of the two distributions $R_{d}(k)=d_{c,k}$ and $D(k)=d_{k}$ .

To better understand how the model learns the point cloud distribution and how it is affected by the cuboid constraints, we plot four representative samples in Fig. 7. As seen in Fig. 7a and 7b, the samples with higher KLD-R are with rich point cloud information, and the model predicts a confident face (formed by corner 1, 5, 8 and 4) in Fig. 7a while a confident edge (formed by corner 1 and 2) in Fig. 7b rather than a confident reference point in our designed test. Transferring our concept of relative point to the face and edge, we find the uncertainties of the samples in Fig. 7a and 7b are still constrained by the cuboid geometry. In Fig. 7b, denote the edge with corner $i$ and $j$ as “edge $ij$ ”, if we set edge 12 as the reference, edge 78 has the highest corner uncertainty which is also the farthest from the reference. While the other two edges are closer to the reference and have lower uncertainty comparing with edge 78. With less point information provided, the model is not able to predict a confident face or edge but only a confident point which results in lower KLD-R and potentially higher KLD-UD. As we can see in Fig. 7c and 7d, with limited point cloud information provided, the model tends to predict a relatively confident corner (corner 1 in both Fig. 7c and 7d), and the value of uncertainties of other corners are affected by their relative locations to the confident one.

V Uncertainty Recovery

In this section, we propose an efficient Bayesian updating method to recover the uncertainty of the original parameters of the bounding box. The basic idea is to divide the eight corners of the bounding box into 4 pairs. With proper division, we are able to derive the required parameters from each pair and denote the process as an individual measurement. Then, we calculate the variance of the obtained parameter using error transfer formula. Finally, we conduct Bayesian update to obtain the final uncertainties of the original parameters.

V-A Uncertainty of an individual measurement

To recover the yaw angle, we pick the edge that was parallel to the orientation of the car and utilized its two vertices as the corner pair. Denote the coordinates of the two corners in $x-z$ plane are $(x_{i},z_{i})$ and $(x_{j},z_{j})$ . Then the yaw angle:

\psi=\arctan\frac{z_{i}-z_{j}}{x_{i}-x_{j}}

(9)

Applying the error transfer formula, we can obtain the variance of the yaw angle:

	$\displaystyle\sigma_{\psi}^{2}$	$\displaystyle=\left\|\frac{\partial{\psi}}{\partial{x_{i}}}\right\|^{2}\sigma_{xi}^{2}+\left\|\frac{\partial{\psi}}{\partial{x_{j}}}\right\|^{2}\sigma_{xj}^{2}+\left\|\frac{\partial{\psi}}{\partial{z_{i}}}\right\|^{2}\sigma_{zi}^{2}+\left\|\frac{\partial{\psi}}{\partial{z_{j}}}\right\|^{2}\sigma_{zj}^{2}$		(10)
		$\displaystyle=\frac{\|z_{i}-z_{j}\|^{2}(\sigma_{xi}^{2}+\sigma_{xj}^{2})+\|x_{i}-x_{j}\|^{2}(\sigma_{zi}^{2}+\sigma_{zj}^{2})}{\left[(x_{i}-x_{j})^{2}+(z_{i}-z_{j})^{2}\right]^{2}}$		(10)

Similarly, we can obtain the uncertainties of the dimensions of the box

\sigma_{d,k}^{2}=\frac{\sum_{k}(c_{i,k}-c_{j,k})^{2}(\sigma_{i,k}^{2}+\sigma_{j,k}^{2})}{\sum_{k}(c_{i,k}-c_{j,k})^{2}}

(11)

and the uncertainty of the location of the box

\sigma_{loc,k}^{2}=\frac{1}{2}(\sigma_{i,k}^{2}+\sigma_{j,k}^{2})

(12)

in which $c_{k},k\in\{1,2,3\}$ is the component of the corner.

V-B Bayesian update

With the variance obtained from the measurements discussed above, we can use Bayesian update method to approximate the final variance with (13) [30].

\sigma_{bayesien}^{2}=\frac{\prod_{i}\sigma_{i}^{2}}{\sum_{j}\prod_{i\neq j}\sigma_{i}^{2}}

(13)

VI Discussions and Conclusions

We have presented an innovative design of loss function to improve the performance of a 3D object detector and learn an explainable uncertainty for the predictions. By applying corner transformation and uncertainty modeling, our method re-weights the original parameters of the bounding box in the loss function and increases the adaptivity of the model to the noisy and biased LiDAR data and labels. The test result on the KITTI val split shows that the performance of our method increases by up to 15% comparing with the baseline which is with simple L1 loss. As for the results on KITTI test set, our method surpasses the original PointRCNN at moderate and hard difficulty level by 1.23% and 2.47% respectively which indicates better performance and higher robustness.

To study the characteristics of the uncertainty, we design KLD-based tests to explain how the predicted uncertainties of the corners represent the distribution of the point cloud in the bounding box and how they are constrained by the cuboid geometry of the bounding box. As we expected, our method predicts lower uncertainties for corners at the side with relatively denser point cloud. Moreover, the distribution of the predicted uncertainties is constrained by the cuboid geometry of the bounding box in different cases based on the representation of the asymmetrically distributed point cloud is in the bounding box.

With the method proposed in Section V, we can estimate the uncertainty of the parameters of the bounding box from the uncertainties of corners. What’s more, our method can be transferred to most deep-learning-based object detectors with little increment of computation cost. It not only increases the performance but also predicts convincing uncertainties for tracking and planning. We will further apply our method in the RPN stage for more improvement and test it on different state-of-the-art models to confirm its constancy.

References

[1] S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. H. Eng, D. Rus, and M. H. Ang, “Perception, planning, control, and coordination for autonomous vehicles,” Machines, vol. 5, no. 1, p. 6, 2017.
[2] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1907–1915.
[3] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, and K. Dietmayer, “Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges,” IEEE Transactions on Intelligent Transportation Systems, 2020.
[4] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 677–12 686.
[5] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 918–927.
[6] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538.
[7] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 770–779.
[8] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
[9] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics, and Autonomous Systems, 2018.
[10] G. P. Meyer and N. Thakurdesai, “Learning an uncertainty-aware object detector for autonomous driving,” arXiv preprint arXiv:1910.11375, 2019.
[11] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, “Bounding box regression with uncertainty for accurate object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2888–2897.
[12] Z. Wang, D. Feng, Y. Zhou, L. Rosenbaum, F. Timm, K. Dietmayer, M. Tomizuka, and W. Zhan, “Inferring spatial uncertainty in object detection,” to appear in International Conference on Intelligent Robots and Systems (IROS), 2020.
[13] D. Feng, L. Rosenbaum, and K. Dietmayer, “Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3266–3273.
[14] D. Feng, L. Rosenbaum, F. Timm, and K. Dietmayer, “Leveraging heteroscedastic aleatoric uncertainties for robust real-time lidar 3d object detection,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 1280–1287.
[15] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705.
[16] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
[17] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7652–7660.
[18] B. Graham, M. Engelcke, and L. van der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9224–9232.
[19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
[20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in neural information processing systems, 2017, pp. 5099–5108.
[21] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d object detector for point cloud,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1951–1960.
[22] D. J. MacKay, “A practical bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992.
[23] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding,” arXiv preprint arXiv:1511.02680, 2015.
[24] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in neural information processing systems, 2017, pp. 5574–5584.
[25] A. Graves, “Practical variational inference for neural networks,” in Advances in neural information processing systems, 2011, pp. 2348–2356.
[26] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.
[27] A. Malinin and M. Gales, “Predictive uncertainty estimation via prior networks,” in Advances in Neural Information Processing Systems, 2018, pp. 7047–7058.
[28] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout sampling for robust object detection in open-set conditions,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–7.
[29] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–3361.
[30] S. M. Lynch, Introduction to applied Bayesian statistics and estimation for social scientists. Springer Science & Business Media, 2007.