\addauthor

Kuan-Chih [email protected] \addauthorYu-Kai [email protected] \addauthorWinston H. [email protected] \addinstitution National Taiwan University,
Taipei, Taiwan Multi-Stream Monocular Vehicle Velocity Estimation

Multi-Stream Attention Learning for Monocular Vehicle Velocity and Inter-Vehicle Distance Estimation

Abstract

Vehicle velocity and inter-vehicle distance estimation are essential for ADAS (Advanced driver-assistance systems) and autonomous vehicles. To save the cost of expensive ranging sensors, recent studies focus on using a low-cost monocular camera to perceive the environment around the vehicle in a data-driven fashion. Existing approaches treat each vehicle independently for perception and cause inconsistent estimation. Furthermore, important information like context and spatial relation in 2D object detection is often neglected in the velocity estimation pipeline. In this paper, we explore the relationship between vehicles of the same frame with a global-relative-constraint (GLC) loss to encourage consistent estimation. A novel multi-stream attention network (MSANet) is proposed to extract different aspects of features, e.g., spatial and contextual features, for joint vehicle velocity and inter-vehicle distance estimation. Experiments show the effectiveness and robustness of our proposed approach. MSANet outperforms state-of-the-art algorithms on both the KITTI dataset and TuSimple velocity dataset.

1 Introduction

Refer to caption — Figure 1: Left: Previous methods [Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer, Song et al.(2020)Song, Lu, Zhang, and Li] treat each vehicle independently and estimate their velocity and position separately. Right: Our proposed method jointly learns each vehicle’s state with the relative constraint (Section 3.5), which helps the network to learn the global consistency of the predictions.

Self-driving cars and ADAS (Advanced driver-assistance systems) have significant impacts on today’s society. Inter-vehicle distance estimation and the relative velocity between vehicles are two fundamental requirements for ADAS or autonomous vehicles to prevent the ego-vehicle from collisions. They are also crucial for path prediction, path planning, and decision making [Sadat et al.(2020)Sadat, Casas, Ren, Wu, Dhawan, and Urtasun].

Most of the 3D perception applications for the vehicles rely on ranging sensors (e.g\bmvaOneDot, LiDAR, Radar), which can directly perceive the surrounding environment by emitting radio waves or laser pulse. However, these sensors have the disadvantage of sparse output, and high-cost [Li et al.(2019)Li, Chen, and Shen]. On the contrary, a monocular camera is another good candidate for its lower affordable price. Additionally, the camera can provide dense color image information, richer texture, and a high frame rate, which is suitable for the ADAS system. Furthermore, considering the success of deep neural networks in the vehicle environment perception tasks, including object detection [Wang et al.(2019)Wang, Chao, Garg, Hariharan, Campbell, and Weinberger, Ma et al.(2020)Ma, Liu, Xia, Zhang, Zeng, and Ouyang] and depth estimation [Eigen et al.(2014)Eigen, Puhrsch, and Fergus, Fu et al.(2018)Fu, Gong, Wang, Batmanghelich, and Tao], exploring a new deep learning method to estimate the distance and relative velocity of the surrounding vehicle using a monocular camera is desired [Song et al.(2020)Song, Lu, Zhang, and Li].

The existing vehicle distance estimation methods based on the monocular camera can be roughly categorized into two branches: One is monocular depth estimation [Eigen et al.(2014)Eigen, Puhrsch, and Fergus, Fu et al.(2018)Fu, Gong, Wang, Batmanghelich, and Tao], the other is 3D monocular object detection [Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Kosecka, Li et al.(2020)Li, Zhao, Liu, and Cao, Ma et al.(2020)Ma, Liu, Xia, Zhang, Zeng, and Ouyang]. The former learns the depth of each pixel of the image by the supervised dense depth map. The latter regresses the pose of the vehicle in the world coordinate. Instead of predicting all the 3D bounding boxes or per-pixel depth maps, we aim at finding the nearest distance and velocity of the vehicles nearby for the purpose of ADAS application, similar to [Song et al.(2020)Song, Lu, Zhang, and Li]. This significantly reduces the effort of annotation cost and also benefits for use in real scenarios.

The vehicle velocity estimation task is widely used in traffic surveillance [Jung et al.(2017)Jung, Choi, Jung, Lee, Kwon, and Jung, Tran et al.(2018)Tran, Dinh-Duy, Truong, Ton-That, Do, Luong, Nguyen, Nguyen, and Do]. A stationary camera is applied to estimate the velocity of the vehicle and analyze the traffic flow. However, due to the surveillance camera being fixed, the problem is less complex than the case of estimating relative velocity on the ego-vehicle. Previous works [Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer, Song et al.(2020)Song, Lu, Zhang, and Li] leverage the multiple features, including depth map, optical flow, tracking information, or geometry cues, to estimate the vehicle’s velocity. Though the above two works achieve remarkable results, they treat each vehicle velocity estimation problem independently and estimate each vehicle state separately, which will cause prediction inconsistency.

To address the aforementioned problem, we incorporate relative constraints and explore the relationship between each vehicle in the same frame and propose a global relative constraint (GLC) loss shown in Figure 1. Instead of treating each vehicle independently, GLC loss is designed to regularize the difference between the estimation of a vehicle’s relative state and the corresponding ground truth, thus benefiting learning consistent and reasonable prediction (detailed in Section 3.5). Besides, observing the importance of contextual information and spatial position [Choi et al.(2020)Choi, Kim, and Choo] for enriching the representation of the vehicle, we leverage multiple information, including context-aware features, motion clues, and spatial patterns, to jointly predict the vehicle’s state.

In summary, we make the following contributions: (1.) We propose to leverage motion, contextual, and spatial clues to extract helpful information for joint relative velocity and inter-vehicle distance estimation. (2.) A global relative constraint (GLC) loss $L_{rel}$ is presented to encourage the model to learn the consistency features between each vehicle in the same frame. (3.) Experimental results show that our approach outperforms the state-of-the-art algorithms on two public datasets.

2 Related Work

Monocular inter-vehicle distance estimation. To estimate the distance of the inter-vehicle, the naive way is monocular depth estimation. Several supervised learning-based methods were proposed to deal with the depth prediction tasks. Eigen et al\bmvaOneDot[Eigen et al.(2014)Eigen, Puhrsch, and Fergus] presents a multi-scale network to refine the predicted depth map iteratively from different stages of the network. DORN [Fu et al.(2018)Fu, Gong, Wang, Batmanghelich, and Tao] introduces an ordinary regression loss for depth network learning. To save the label efforts, Bian et al\bmvaOneDot[Bian et al.(2019)Bian, Li, Wang, Zhan, Shen, Cheng, and Reid] presents an unsupervised framework to predict the depth and ego-motion jointly in the image sequence, which significantly decreases the annotation costs. On the other hand, some studies investigate regressing the 3D bounding box of vehicles to measure the vehicle’s position. Based on the given vehicle shape prior and geometry constraint, 3DBBox [Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Kosecka] estimates the vehicle’s pose from a single RGB image. Pseudo-LiDAR [Wang et al.(2019)Wang, Chao, Garg, Hariharan, Campbell, and Weinberger] converts the estimated depth map to the 3D point cloud, which can be utilized to predict 3D bounding boxes by existing 3D detectors. PatchNet [Ma et al.(2020)Ma, Liu, Xia, Zhang, Zeng, and Ouyang] figures out that projecting 2D bounding box into world coordinate can help to learn the 3D vehicle position.

Monocular vehicle velocity estimation. To our best knowledge, there are few works developing monocular velocity estimation for the vehicle. In [Tran et al.(2018)Tran, Dinh-Duy, Truong, Ton-That, Do, Luong, Nguyen, Nguyen, and Do], a fixed camera is utilized to capture the traffic flow. Based on the geometry constraint and the known location of the landmarks, the velocity of the vehicle can be calculated by the parameters of the camera calibration, which is simpler than the application on the onboard vehicle. To estimation the relative velocity from ego-vehicle, Kampelmühler et al\bmvaOneDot[Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer] directly regress the vehicle velocity from monocular sequences, which utilize several cues like motion feature from Flownet [Ilg et al.(2017)Ilg, Mayer, Saikia, Keuper, Dosovitskiy, and Brox] and depth feature from Monodepth [Godard et al.(2017)Godard, Mac Aodha, and Brostow]. Furthermore, Song et al\bmvaOneDot[Song et al.(2020)Song, Lu, Zhang, and Li] utilize geometry constraint and optical flow feature to predict the velocity and distance of vehicle jointly. Though the above works obtain the desired performance, they predict each vehicle’s state individually (Figure 1), which neglects to explore the relationships between neighboring vehicles. Furthermore, they do not explore additional cues like context-aware or spatial information.

Self-attention mechanism. Our multi-stream fusion block is related to the self-attention mechanism [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin]. The self-attention mechanism has been widely leveraged in sequential modeling and has considerable improvement in natural language processing (NLP) tasks. Compared with CNN, RNN, and LSTM, self-attention can achieve better performance due to its capability to capture long-range dependencies between each item [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin]. Several recent studies adopt self-attention techniques in many computer vision tasks. For example, Wang et al\bmvaOneDot[Wang et al.(2018)Wang, Girshick, Gupta, and He] utilize non-local operations to capture long-term information for the video classification task. Besides, self-attention is utilized in object detection and image classification tasks [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko, Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, and Houlsby, Zhao et al.(2020)Zhao, Jia, and Koltun], and prove that the effectiveness of capturing long-range dependency. Additionally, the self-attention operation can also be integrated with the generative model for image generation tasks [Zhang et al.(2018)Zhang, Goodfellow, Metaxas, and Odena].

3 Proposed Method

3.1 Problem Definition

Given two monocular images between two timestamps $t-1$ and $t$ with known camera parameters, we aim to estimate the vehicle’s distance and velocity in the current frame $t$ related to the camera coordinate. For each vehicle, the inter-vehicle distance $d\in\mathbb{R}^{+}$ is defined as the distance from the onboard camera optical center to the closest tangent plane on the vehicle surface, which is shown in Figure 2 (top right). We take the planar position of the closest point on the vehicle in meter, $p\in\mathbb{R}^{3}$ , as the vehicle position in the current frame $t$ . Assuming the corresponding closest points on the vehicle in the last frame $t-1$ is $p^{\prime}\in\mathbb{R}^{3}$ , the relative velocity is defined as $v=(p-p^{\prime})/\Delta{t}\in\mathbb{R}^{3}$ , where $\Delta t$ is the time difference between last frame and current frame in seconds. Our goal is to estimate the vehicle’s state $\zeta=(v,p)$ including its position and velocity in current frame $t$ .

3.2 Overview

The overall architecture of our MSANet for joint vehicle velocity and position estimation is shown in Figure 2. Given two monocular images from two timestamps, we first detect the vehicles by an off-the-shelf object detector (e.g\bmvaOneDot, Faster RCNN [Ren et al.(2015)Ren, He, Girshick, and Sun]), followed by cropping the expanded region of the vehicle. Next, we apply three kinds of encoders to extract different information, including motion clues, context-aware features, and spatial patterns (Section 3.3). Then, a multiple streams attention fusion (MSAF) block is proposed to fuse all features effectively (Section 3.4). Finally, the intermediate fused representation from the MSAF block can be used to predict the position and velocity of the vehicle. Besides, a global relative constraint (GLC) loss is further proposed to encourage learning consistency between vehicles (Section 3.5).

3.3 Multiple Streams Feature Representation

Motion stream. Similar to prior works [Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer, Song et al.(2020)Song, Lu, Zhang, and Li], to estimate the vehicle motion, we use an optical flow network to learn the dense motion clues of the cropped vehicle from the previous frame $I_{t^{\prime}}$ to the current frame $I_{t}$ . Following the object detection methods, ROI Align [He et al.(2017)He, Gkioxari, Dollár, and Girshick] technique is applied to extract the object features inside the vehicle region. The motion clue $f_{m}$ can be formulated as:

\vspace{-0.05cm}f_{m}=RoI({\bf F_{m}},x_{v})\vspace{-0.1cm}

(1)

where $\bf F_{m}$ is the optical flow feature, and $x_{v}$ is the area of the vehicle.

Context-Aware stream. We apply a backbone CNN as an encoder to extract the visual information. To further refine the features, we applied a DenseASPP [Yang et al.(2018)Yang, Yu, Zhang, Li, and Yang], which combines dense connection skills [Huang et al.(2017)Huang, Liu, Van Der Maaten, and Weinberger] to ASPP [Chen et al.(2017)Chen, Papandreou, Kokkinos, Murphy, and Yuille] to increase larger receptive field. It can benefit to enhance the network. We also integrate the residual module [He et al.(2016)He, Zhang, Ren, and Sun] with DenseASPP to make training stable and improve performance. Finally, the ROI align technique is adopted to obtain the context-aware feature $f_{c}$ :

\vspace{-0.1cm}f_{c}=RoI({\bf\mathcal{D}(F_{c})+F_{c}},x_{v})\vspace{-0.05cm}

(2)

where $\bf F_{c}$ is the appearance feature from CNN block, and $\mathcal{D}$ is the DenseASPP [Yang et al.(2018)Yang, Yu, Zhang, Li, and Yang] module.

Spatial stream. Intuitively, the size and location of the bounding box benefit to predict the position and velocity of the vehicle. We first encode the 2D bounding box with box center $(b_{x},b_{y})$ and its width $b_{w}$ and height $b_{h}$ . With the known camera focal length $(f_{x},f_{y})$ and the principle point $(c_{x},c_{y})$ , the box can be transformed from the pixel coordinate to the world coordinate ${\bf p}=[p_{x},p_{y},p_{w},p_{h}]$ as follows:

\vspace{-0.05cm}p_{x}=\frac{(b_{x}-c_{x})}{f_{x}}\hat{z},p_{y}=\frac{(b_{y}-c_{y})}{f_{y}}\hat{z},p_{w}=\frac{b_{w}}{f_{x}},p_{h}=\frac{b_{h}}{f_{y}}\vspace{-0.1cm}

(3)

where $\hat{z}$ is a fixed depth scalar. Next, we apply an instance encoder with a two-layer fully connected layer to encode the instance box feature $f_{i}$ .

On the other hand, the spatial pattern of the vehicle is also an important clue: The bounding box position of a farther vehicle is typically closer to the top in the image. To this end, we design an encoder to extract the spatial pattern. Given a vehicle bounding box region, we generate a one-channel binary map representing the front object and background. The map has zeros everywhere except the vehicle location. We apply a spatial encoder with two layers of convolution kernels, followed by a global average pooling operation and a linear transform to obtain the spatial pattern feature $f_{p}$ .

Finally, the final spatial features $f_{sp}$ are generated by simply concatenate instance feature $f_{i}$ and spatial pattern feature $f_{p}$ :

\vspace{-0.05cm}f_{sp}=f_{i}\parallel f_{p}\vspace{-0.05cm}

(4)

where $\parallel$ means the concatenate operation along the channel axis.

3.4 Multiple Streams Attention Fusion

Inspired by [Wang et al.(2020)Wang, Chen, Lu, Zhao, Trigoni, and Markham] that integrates the self-attention module to refine the flattened input feature, we share the same flavor and extend it to build a multi-stream attention fusion (MSAF) block to fuse different features, which is shown in Figure 3. Three flattened features are concatenated firstly, and the hybrid feature $Q$ is generated by a linear transform $W_{Q}$ :

\quad Q=W_{Q}(f_{c}\parallel f_{m}\parallel f_{sp})

(5)

We attempt to calculate the correlation between hybrid feature $Q$ and context-aware feature $f_{c}$ , which is different from the standard non-local module. Simply transform context features with two non-shared FC layers $W_{K}$ and $W_{V}$ , two intermediate features $K$ and $V$ can be obtained. Followed by combining with hybrid feature $Q$ to generate the attention map $S$ by matrix multiplication and also the attentive output $F$ as follows:

\vspace{-0.1 cm}\begin{gathered}S=softmax(Q^{T}K)\\ F=SV\vspace{-0.1 cm}\end{gathered}

(6)

Finally, the correlation feature $F$ is passed through a fully connected layer $W_{F}$ and added with a shortcut connection of the original contextual feature. Then, the above feature can concatenate with the motion feature to obtain the final squeezed feature $x$ :

x=(f_{sp}+W_{F}\cdot F)\parallel f_{m}\vspace{-0.1 cm}

(7)

In this way, the final represented feature $x$ becomes more robust and is beneficial for predicting the vehicle’s position and velocity (Section 4.3).

3.5 Global Relative Constraint Loss

As shown in Figure 4(a), our model aims at regressing each vehicle’s velocity and position by a regression loss $L_{reg}$ , which is similar to [Song et al.(2020)Song, Lu, Zhang, and Li]. Due to predicting the absolute state of the vehicle from the monocular camera is difficult, we further incorporate relative global constraint (GLC) between vehicles to enforce prediction consistency, which helps the model reduce errors by imposing constraints of relative vehicle states.

An illustration of relative global constraint (GLC) is shown in Figure 4(b). We consider two different predictions $\zeta^{a}_{i}$ and $\zeta^{b}_{i}$ with the same shift $\Delta_{i}$ to the vehicle true state $\hat{\zeta}_{i}$ , they contribute the same regression loss for vehicle $i$ . However, for another vehicle $j$ with predicted state $\zeta_{j}$ , we observe that the prediction $\zeta^{a}_{i}$ is more reasonable than $\zeta^{b}_{i}$ because the state difference $d^{a}_{ij}$ is more close to true state difference $\hat{d}_{ij}$ than the other one $d^{b}_{ij}$ . To this end, we propose a relative loss to explore the state difference between vehicles, which helps the model enforce global consistency and improve velocity and position prediction performance. The global relative constraint (GLC) loss is formulated as:

\vspace{-0.2 cm}L_{rel}=\sum_{i,j=1,i\neq j}h(d_{ij},\hat{d}_{ij})

(8)

where $d_{ij}=(\zeta^{i}_{v}-\zeta^{j}_{v},\zeta^{i}_{p}-\zeta^{j}_{p})$ is the relative state between vehicles. $h(\cdot)$ is noted as the function to measure the distance. $\hat{(\cdot)}$ represents the ground truth of the relative state. For distance function $h(\cdot)$ choice, we choose the Charbonnier loss [Barron(2019)] as the objective function, $L_{Cha}$ , which is a robust $L_{1}$ loss, can be expressed as follows:

\vspace{-0.1cm}h(\bf{s},\bf{\hat{s}})=\sqrt[]{({\bf s}-{\bf{\hat{s}}})^{2}+\epsilon^{2}}\vspace{-0.1cm}

(9)

where $\epsilon$ is a small constant (e.g\bmvaOneDot, $10^{-6}$ ). The Charbonnier loss function can handle the outliers efficiently, which benefits network training (Experimental results in Section 4.3).

To train the MSANet, we apply a regression loss function to estimate the absolute velocity and position of each vehicle:

\vspace{-0.2cm}L_{reg}=\sum_{i}^{N}h(\zeta^{i}_{v},\hat{\zeta}^{i}_{v})+\lambda\sum_{i}^{N}h(\zeta^{i}_{p},\hat{\zeta}^{i}_{p})\vspace{-0.2cm}

(10)

where $\lambda$ is the scaling coefficient between position and velocity, which is set as 0.1 as default. $N$ is the number of target vehicles in the image. Furthermore, similar to [Godard et al.(2017)Godard, Mac Aodha, and Brostow], to encourage smoother optical flow prediction, we adopt a smoothness loss for estimated optical flow $F$ with the cropped image $I$ :

L_{smooth}=\sum_{i,j}\sum_{d\in x,y}|\partial_{d}F(i,j)|e^{-|{\partial_{d}}I(i,j)|}\vspace{-0.2cm}

(11)

where $d$ represents partial derivative on $x$ and $y$ direction. This loss is an edge-aware loss, which can help to enforce local smoothness for optical weighted by image gradients.

The final loss function can be summarized as a weighted sum of the above four terms:

L=L_{reg}+\lambda_{1}L_{smooth}+\lambda_{2}L_{rel}\vspace{-0.2cm}

(12)

where $\lambda_{1}$ and $\lambda_{2}$ are the scaling coefficients. We set $\lambda_{1}=1$ , $\lambda_{2}=0.3$ as default.

4 Experiments

4.1 Experimental Setup

Dataset. We adopt two datasets for experiments: KITTI raw dataset [Geiger et al.(2012)Geiger, Lenz, and Urtasun] and Tusimple velocity dataset ¹¹1https://github.com/TuSimple/tusimple-benchmark. Tusimple velocity dataset includes 1074 driving sequences for training, of which the video length is 2 seconds under 20 fps. The bounding boxes are annotated for the last frame. For the KITTI raw dataset, we follow the setting in [Song et al.(2020)Song, Lu, Zhang, and Li]. Due to the detailed tracklet of each vehicle in the video clips are available, we can generate distance and velocity ground truth by ourselves.

Training details. MSANet utilizes PWC-Net [Sun et al.(2018)Sun, Yang, Liu, and Kautz] pretrained on the FlyingChairs [Dosovitskiy et al.(2015)Dosovitskiy, Fischer, Ilg, Häusser, Hazırbaş, Golkov, v.d. Smagt, Cremers, and Brox] as the optical flow extractor to estimate the motion information of the vehicle and adopts ResNet34 [He et al.(2016)He, Zhang, Ren, and Sun] as our feature extractor. During training, We utilize the Adam optimization algorithm with mini-batch 4. The total epoch is 100. Whole experiments are implemented in Pytorch.

Metrics. For velocity estimation, we follow the rules in the Tusimple Velocity Challenge. The vehicle velocity are categorized into three groups according to their relative distance $d$ to the ego-vehicle: near-range ( $d<20m$ ), medium-range ( $20m<d<45m$ ), and far-range ( $d>45m$ ). The main metric is the Mean Square Error (MSE) of the velocity with the unit $m^{2}/s^{2}$ and position with the unit $m^{2}$ , following the metric used in the Tusimple dataset. The mean MSE of the three groups is regarded as the final metric. For vehicle distance estimation, we utilize the standard metrics in the depth estimation tasks [Fu et al.(2018)Fu, Gong, Wang, Batmanghelich, and Tao, Eigen et al.(2014)Eigen, Puhrsch, and Fergus], including absolute relative difference (AbsRel), square relative difference (SqRel), RMSE, RMSE (log), and $\delta$ . It is noted that we only consider the closet point for each vehicle, which follows the setting described in [Song et al.(2020)Song, Lu, Zhang, and Li].

4.2 Main Results

	Position	Velocity
	MSE $\downarrow$	MSE (near) $\downarrow$	MSE (medium) $\downarrow$	MSE (far) $\downarrow$	MSE (avg.) $\downarrow$
Rank1 [Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer]	-	0.18	0.66	3.07	1.30
Rank2	-	0.25	0.75	3.50	1.50
Rank3	-	0.55	2.21	5.94	2.90
Song et al\bmvaOneDot( $org$ ) [Song et al.(2020)Song, Lu, Zhang, and Li]	9.72	0.23	0.99	3.27	1.50
Song et al\bmvaOneDot( $full$ ) [Song et al.(2020)Song, Lu, Zhang, and Li]	10.23	0.15	0.34	2.09	0.86
Ours	7.56	0.10	0.26	1.58	0.65

Table 1: The quantitative results of vehicle position and relative velocity estimation on the Tusimple dataset.

	AbsRel $\downarrow$	SqRel $\downarrow$	RMSE $\downarrow$	RMSE (log) $\downarrow$	$\delta<1.25^{1}\uparrow$	$\delta<1.25^{2}\uparrow$	$\delta<1.25^{3}\uparrow$
Song et al\bmvaOneDot( $org$ ) [Song et al.(2020)Song, Lu, Zhang, and Li]	0.037	0.132	2.700	0.059	0.989	1.000	1.000
Song et al\bmvaOneDot( $full$ ) [Song et al.(2020)Song, Lu, Zhang, and Li]	0.041	0.152	2.894	0.062	0.987	1.000	1.000
Ours	0.034	0.105	2.416	0.050	0.997	1.000	1.000

Table 2: The quantitative results of vehicle distance estimation on the Tusimple dataset.

Experimental results on Tusimple dataset. The velocity and position in the Tusimple dataset are annotated along two directions x-axis and z-axis of the camera coordinate. Following the setting in [Song et al.(2020)Song, Lu, Zhang, and Li], we set the output dimension of the proposed network as three dimensions, including one for position and two for velocity. The remaining dimension of position can be obtained by inverse projection with the bounding box of the vehicle. In practice, we transform the center bottom pixel of the cropped image to the world coordinate to obtain the z-axis reference position and learn the residual value from the network to predict the vehicle’s z-axis position.

To show the capability of our proposed network, we compare to the top 3 ranked in Tsusimple velocity challenge and a joint velocity and position estimation network proposed in [Song et al.(2020)Song, Lu, Zhang, and Li]. As shown in Table 1, our model has superior performance than others. The rank one method [Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer] in the challenge utilizes three separate models for different ranges, which under the risk for hyperparameter tuning. In [Song et al.(2020)Song, Lu, Zhang, and Li], the authors predict all range distance and velocity from the cropped vehicle image, which will cause the performance limit. On the contrary, we additionally leverage the spatial and context-aware information and predict the vehicle’s state jointly with the global relative constraint (GLC) loss, which benefits the model to predict the different ranges. The average mean square velocity error of our approach is about 0.65 $m^{2}/s^{2}$ corresponded to 0.40 $m/s$ absolute error, which is better than the method in [Song et al.(2020)Song, Lu, Zhang, and Li] (0.86 $m^{2}/s^{2}$ MSE corresponded to 0.48 $m/s$ absolute error). Besides, the detailed statistic of the vehicle distance regression performance is shown in Table 2. It proves that the proposed network and global relative constraint (GLC) are effective for distance estimation.

Experimental results on KITTI dataset. In Table 3, we present our model’s performance of velocity prediction on the KITTI dataset. Our method outperforms prior art [Song et al.(2020)Song, Lu, Zhang, and Li] among all distance ranges for relative velocity estimation. We further report the results of distance estimation on the KITTI dataset with the same setting in [Song et al.(2020)Song, Lu, Zhang, and Li] for a fair comparison. As shown in Table 4, our approach gets the competitive results and outperforms the others in most metrics. Moreover, our model gets less outlier due to the model predicting the reasonable estimation with the global relative constraint (GLC).

	MSE (near) $\downarrow$	MSE (medium) $\downarrow$	MSE (far) $\downarrow$	MSE (avg.) $\downarrow$
Song et al\bmvaOneDot( $full$ ) [Song et al.(2020)Song, Lu, Zhang, and Li]	0.29	0.93	1.57	0.94
Ours	0.23	0.67	0.96	0.62

Table 3: The quantitative results of velocity estimation on KITTI dataset.

	AbsRel $\downarrow$	SqRel $\downarrow$	RMSE $\downarrow$	RMSE (log) $\downarrow$	$\delta<1.25^{1}\uparrow$	$\delta<1.25^{2}\uparrow$	$\delta<1.25^{3}\uparrow$
3DBBox [Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Kosecka]	0.222	1.863	7.696	0.228	0.659	0.966	0.994
DORN [Fu et al.(2018)Fu, Gong, Wang, Batmanghelich, and Tao]	0.078	0.505	4.078	0.179	0.927	0.985	0.995
Unsfm [Bian et al.(2019)Bian, Li, Wang, Zhan, Shen, Cheng, and Reid]	0.219	1.924	7.873	0.338	0.710	0.886	0.933
Song et al\bmvaOneDot[Song et al.(2020)Song, Lu, Zhang, and Li]	0.075	0.474	4.639	0.124	0.912	0.996	1.000
Ours	0.098	0.444	4.240	0.127	0.930	0.998	1.000

Table 4: The quantitative results of vehicle distance estimation on the KITTI dataset. We compare with baseline results of prior works reported in [Song et al.(2020)Song, Lu, Zhang, and Li].

Qualitative result. Figure 5 gives the visualization of prediction results for position and vehicle on the Tusimple dataset. The example shows that the prediction of our model is closed to the ground truth. Moreover, it also has less error compared to [Song et al.(2020)Song, Lu, Zhang, and Li], which indicates the proposed methods have better performance for jointly predicting velocity and position.

	Position (m)	Velocity (m/s)
Vehicle id	[Song et al.(2020)Song, Lu, Zhang, and Li]	Ours	GT	[Song et al.(2020)Song, Lu, Zhang, and Li]	Ours	GT
1	(35.1, -1.6)	(33.9, -1.5)	(36.9, -1.8)	(4.3, -0.5)	(4.7, -0.3)	(5.5, -0.2)
2	(44.2, 2.3)	(42.6, 2.3)	(41.2, 0.7)	(1.3, -0.2)	(1.8, -0.1)	(2.1, -0.1)
3	(30.5, 4.6)	(26.9, 4.1)	(27.3, 2.7)	(0.2, -0.5)	(-0.0, -0.3)	(0.1, -0.4)
4	(36.5, 4.5)	(34.2, 4.2)	(34.0, 2.7)	(-0.2, -0.0)	(-0.7, -0.4)	(-0.5, -0.2)
5	(11.9, 5.3)	(10.6, 4.7)	(9.8, 3.3)	(-1.2, -0.3)	(-1.0, -0.1)	(-0.7, -0.3)
MSE $\downarrow$	9.23 $\pm$ 3.71 ( $m^{2}$ )	4.10 $\pm$ 2.64 ( $m^{2}$ )	-	0.48 $\pm$ 0.54 ( $m^{2}/s^{2}$ )	0.16 $\pm$ 0.18 ( $m^{2}/s^{2}$ )	-

4.3 Ablation Study

Effect of Proposed Streams and Fusion Module. We conduct an ablation study to analyze the effect of the proposed three streams and fusion block. The experimental results are shown in Table 6. Firstly, only the motion stream (M) is left, and it got an undesirable performance, which achieves 0.91 (MSE) for velocity estimation. Secondly, the spatial stream (SP) is further adding to the network. The MSE becomes 0.84, which shows that extra spatial information helps to improve model performance. Thirdly, we combine context-aware stream (C) to the network by simply concatenating operation. The performance improved by 10% achieving 0.75, demonstrating that the contextual information is helpful for velocity and position estimation. Finally, we fuse three streams with the proposed multiple-stream attention fusion block, and the entire network achieves the best performance at 0.65. The experimental results prove the effectiveness of the three streams and the fusion block.

Effect of Proposed Losses. To investigate the efficiency of the proposed loss, we conduct five experiments with different losses, which are listed in Table 6. The first three rows compared different distance measurement function $h(\cdot)$ described in Eq. 10. As we can see, we prove the claim that using Chabonnier loss will have a better performance compared to L1 and Smooth L1 loss. The fourth row shows the impact of smoothness loss. Furthermore, the last row shows that the proposed relative constraint loss helps to regularize the consistency between the vehicles in the same frame, which achieves better performance.

\RawFloats

Index M SP C MSE (V) $\downarrow$ MSE (P) $\downarrow$ 1 $\surd$ $\times$ $\times$ 0.91 10.23 2 $\surd$ $\surd$ $\times$ 0.84 8.26 3 $\surd$ $\surd$ $\surd$ 0.75 8.02 4 $\surd$ $\surd$ $\surd$ 0.65 7.56 Table 5: The ablation study shows effeteness of motion stream (M), spatial stream (SP), and context stream (C), and also the proposed fusion block. The underline indicates using the proposed fusion block. Loss Function MSE (V) $\downarrow$ $L_{reg}(L_{1})$ 0.85 $L_{reg}($ Smooth $L_{1})$ 0.79 $L_{reg}(L_{Cha})$ 0.77 $L_{reg}(L_{Cha})+L_{Smooth}$ 0.73 $L_{reg}(L_{Cha})+L_{Smooth}+L_{Rel}$ 0.65 Table 6: The analysis on different loss functions of the proposed framework. $L_{reg}(\cdot)$ means the different measurement function $h$ described in Eq. 10.

5 Conclusion

This work presents a novel framework for joint vehicle velocity and inter-vehicle distance estimation. MSANet leverages multiple information, including context-aware features, motion clues, and spatial positions to learn the vehicle’s state. A novel global relative constraint (GLC) loss is proposed to resolve the prediction inconsistency problem. Experiments on KITTI and Tusimple datasets validate the effectiveness of the proposed approach. We believe our idea paves a new path for research of advanced driver-assistance systems.

Acknowledgements

This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 110-2634-F-002-026. We are grateful to the National Center for High-performance Computing. We also thank Tsung-Han Wu for his helpful discussions on this work.

References

[Barron(2019)] Jonathan T Barron. A general and adaptive robust loss function. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[Bian et al.(2019)Bian, Li, Wang, Zhan, Shen, Cheng, and Reid] Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), 2020.
[Chen et al.(2017)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
[Choi et al.(2020)Choi, Kim, and Choo] Sungha Choi, Joanne T. Kim, and Jaegul Choo. Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[Dosovitskiy et al.(2015)Dosovitskiy, Fischer, Ilg, Häusser, Hazırbaş, Golkov, v.d. Smagt, Cremers, and Brox] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
[Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, and Houlsby] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[Eigen et al.(2014)Eigen, Puhrsch, and Fergus] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
[Fu et al.(2018)Fu, Gong, Wang, Batmanghelich, and Tao] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[Geiger et al.(2012)Geiger, Lenz, and Urtasun] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[Godard et al.(2017)Godard, Mac Aodha, and Brostow] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[He et al.(2017)He, Gkioxari, Dollár, and Girshick] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[Huang et al.(2017)Huang, Liu, Van Der Maaten, and Weinberger] Gao Huang, Zhuang Liu, L. Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[Ilg et al.(2017)Ilg, Mayer, Saikia, Keuper, Dosovitskiy, and Brox] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[Jung et al.(2017)Jung, Choi, Jung, Lee, Kwon, and Jung] Heechul Jung, Min-Kook Choi, Jihun Jung, Jin-Hee Lee, Soon Kwon, and Woo Young Jung. Resnet-based vehicle classification and localization in traffic surveillance systems. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
[Kampelmühler et al.(2018)Kampelmühler, Müller, and Feichtenhofer] Moritz Kampelmühler, Michael G. Müller, and Christoph Feichtenhofer. Camera-based vehicle velocity estimation from monocular video. In Computer Vision Winter Workshop (CVWW), 2018.
[Li et al.(2019)Li, Chen, and Shen] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn based 3d object detection for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[Li et al.(2020)Li, Zhao, Liu, and Cao] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In European Conference on Computer Vision (ECCV), 2020.
[Ma et al.(2020)Ma, Liu, Xia, Zhang, Zeng, and Ouyang] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, and Wanli Ouyang. Rethinking pseudo-lidar representation. In European Conference on Computer Vision (ECCV), 2020.
[Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Kosecka] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 3D Bounding Box Estimation Using Deep Learning and Geometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[Ren et al.(2015)Ren, He, Girshick, and Sun] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
[Sadat et al.(2020)Sadat, Casas, Ren, Wu, Dhawan, and Urtasun] Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, and Raquel Urtasun. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In European Conference on Computer Vision (ECCV), 2020.
[Song et al.(2020)Song, Lu, Zhang, and Li] Zhenbo Song, Jianfeng Lu, Tong Zhang, and Hongdong Li. End-to-end learning for inter-vehicle distance and relative velocity estimation in adas with a monocular camera. In IEEE International Conference on Robotics and Automation (ICRA), 2020.
[Sun et al.(2018)Sun, Yang, Liu, and Kautz] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[Tran et al.(2018)Tran, Dinh-Duy, Truong, Ton-That, Do, Luong, Nguyen, Nguyen, and Do] Minh-Triet Tran, Tung Dinh-Duy, Thanh-Dat Truong, Vinh Ton-That, Thanh-Nhon Do, Quoc-An Luong, Thanh-An Nguyen, Vinh-Tiep Nguyen, and Minh N. Do. Traffic flow analysis with multiple adaptive vehicle detectors and velocity estimation with landmark-based scanlines. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018.
[Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In International Conference on Learning Representations (ICLR), 2017.
[Wang et al.(2020)Wang, Chen, Lu, Zhao, Trigoni, and Markham] Bing Wang, Changhao Chen, Chris Xiaoxuan Lu, Peijun Zhao, Niki Trigoni, and Andrew Markham. Atloc: Attention guided camera localization. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
[Wang et al.(2018)Wang, Girshick, Gupta, and He] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[Wang et al.(2019)Wang, Chao, Garg, Hariharan, Campbell, and Weinberger] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[Yang et al.(2018)Yang, Yu, Zhang, Li, and Yang] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[Zhang et al.(2018)Zhang, Goodfellow, Metaxas, and Odena] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
[Zhao et al.(2020)Zhao, Jia, and Koltun] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.