Semantics-Driven Unsupervised Learning for Monocular Depth and Ego-Motion Estimation

Xiaobin Wei Jianjiang Feng Jie Zhou

Abstract

We propose a semantics-driven unsupervised learning approach for monocular depth and ego-motion estimation from videos in this paper. Recent unsupervised learning methods employ photometric errors between synthetic view and actual image as a supervision signal for training. In our method, we exploit semantic segmentation information to mitigate the effects of dynamic objects and occlusions in the scene, and to improve depth prediction performance by considering the correlation between depth and semantics. To avoid costly labeling process, we use noisy semantic segmentation results obtained by a pre-trained semantic segmentation network. In addition, we minimize the position error between the corresponding points of adjacent frames to utilize 3D spatial information. Experimental results on the KITTI dataset show that our method achieves good performance in both depth and ego-motion estimation tasks.

Keywords:

Depth prediction, ego-motion estimation, semantics-driven unsupervised learning

1 Introduction

Visual odometry (VO) [36] is a process of estimating camera’s motion by using image sequences as input. It is a basic task in many computer vision applications such as automatic driving, augmented reality, navigation systems, etc. Recovering 3D depth information from 2D images is an important problem in computer vision, which plays an important role in scene understanding [53, 35], 3D reconstruction [37, 40], etc.

In the past decade, geometry-based VO approaches have been extensively studied. There are usually two kinds of methods: (1) Feature-based methods, such as PTAM [19] and ORB-SLAM [29, 30], estimate camera poses and generate sparse 3D map by minimizing the re-projection error. Typical processes include feature extraction, feature matching, motion estimation and local optimization, estimating camera pose and generating sparse 3D maps. (2) Direct methods, according to raw pixel intensity of images directly calculate the camera motion by minimizing the photometric error. However, these methods are often not robust enough in challenging environments, such as motion blurring and lack of texture. In recent years, some research works [8, 7, 23, 21, 22, 42] have adopted supervised neural networks to solve the VO and depth prediction problems. Since these methods need a lot of labeled data with ground truth to train, and LIDAR sensors in autonomous vehicles provide only very sparse 3D points, their generalization to new scenarios is limited.

Compared with supervised learning, unsupervised learning does not require labeled data. More and more studies focus on unsupervised learning methods for depth and camera motion estimation [54, 22, 52, 26, 51, 50, 38, 25, 32, 45, 13]. They use photometric error as a loss function to learn. Depth and poses are used to project the source image onto the target frame for synthesizing the target view, and the network is trained by minimizing the error between the synthesized view and the actual image.

For dynamic objects and occluded objects in the scene, however, the assumption of photometric consistence between adjacent frames does not hold, which will lead to inaccurate depth prediction. We propose to use semantic segmentation to alleviate this problem. When the labels of the pixel in the source image and the corresponding pixel in the target image are different, it may be a moving object or occlusion. We further explore another application of semantic segmentation by utilizing the correlation between depth and semantics to improve depth estimation performance. For example, between the adjacent upper and lower pixels in the ground area of the image, the upper pixel has a larger depth. To avoid the costly manual labeling process, we use semantic segmentation obtained from a pre-trained segmentation network.

Although the semantic segmentation algorithm requires supervised learning, its adoption does not affect the unsupervised nature of the core algorithm in this paper, and it is reasonable in practice, because of the following reasons: (1) it is not necessary to label the semantics on the training data of the depth prediction and pose estimation network; (2) Obtaining semantic segmentation labels requires labor costs, but it is feasible. In contrast, there is currently no convenient technology to easily obtain high-resolution, accurate depth maps in dynamic scenes; (3) Semantic segmentation is essentially supervised; (4) There are already multiple large-scale labeled semantic segmentation datasets; (5) Existing semantic segmentation networks have achieved excellent performance and generalization ability.

Refer to caption — Figure 1: Example of depth prediction on the KITTI dataset. Top to bottom: input RGB image, input semantic segmentation estimated by [55], and outputted depth map by the proposed method

Photometric error considers only 2D appearance information. We propose an additional 3D point loss, which considers 3D spatial information. For a pixel in the target frame, its 3D coordinate in the target frame coordinate system can be obtained from the depth map. The 3D coordinate of the corresponding pixel in the source frame coordinate system can also be obtained by the depth maps and the transformation matrix. According to the transformation matrix, they can be transformed into a same coordinate system, and the distance between the two points should be as close as possible, which can be used as the 3D point loss.

Our method is evaluated on KITTI dataset [11], and the results show the effectiveness of our method in monocular depth prediction and camera motion estimation. Figure 1 shows the result of our monocular depth prediction on the KITTI dataset. Our main contributions are as follows: (1) We propose a semantic loss and using semantic consistency as a mask for photometric loss and 3D point loss to reduce the influence of dynamic objects and occlusion in the scene, and using the depth characteristics of certain semantic category to improve the accuracy of depth prediction. (2) We propose a 3D point loss to improve the performance of depth prediction by utilizing 3D information.

2 Related Work

Existing methods for depth and self-motion estimation include geometry-based methods and learning-based methods.

Geometry-based methods. Geometry-based VO schemes can be divided into two categories: feature-based methods and direct methods. Feature-based methods extract stable feature points from each frame, complete the matching of adjacent frames through the invariant descriptors [24, 2, 34] of these feature points, and then recover camera poses and map point coordinates more robustly through epipolar geometry [14], but extraction and matching of feature points is time-consuming, which makes the classical feature-based methods run slower than direct methods. MonoSLAM [6] proposed by Davison et al. in 2007 is the first real-time monocular visual SLAM system. The front-end uses feature points tracking method and the back-end uses extended Kalman filter technology. Klein et al. proposed PTAM (Parallel tracking and mapping) [19], which was the earliest method to use non-linear optimization, and implemented the parallelization of tracking and mapping processes. Mur-Artal et al. proposed ORB-SLAM [29] in 2015, which is based on PTAM architecture, adds map initialization and loop closure detection, optimizes methods of key frame selection and map construction, and achieves good results in processing speed, tracking effect and map accuracy. Direct methods directly estimate the camera poses and map structure through minimizing photometric error without calculating key points and descriptors. LSD-SLAM (Large-scale direct monocular SLAM) [10], which is a monocular SLAM algorithm based on direct method proposed by Engel et al. in 2014, uses direct tracking method and is insensitive to the missing homogenous regions. Engel et al. [9] combine a fully direct probabilistic model with consistent, joint optimization of all model parameters, including geometry-represented to estimate camera internal parameters, pose and depth of pixels.

Supervised learning methods. Eigen et al. [8] propose a multi-scale deep network to solve the depth prediction problem. They address the problem by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. [7] is an extension of [8], which solves three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling. Liu et al. [23] consider depth estimation as a continuous CRF learning problem, which learns the unary and pairwise potentials of continuous CRF in a unified deep CNN framework. Laina et al. [21] propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. [48, 31, 18] use more than one image during training stage for depth estimation. [20] is a semi-supervised method to learn dense monocular depth. In terms of VO, Wang et al. [46] proposed an end-to-end monocular framework, DeepVO, which not only automatically learns effective feature representation for the VO problem through CNN, but also implicitly models sequential dynamics and relations using deep RNN. Ummenhofer et al. [42] use a convolutional network consisting of multiple stacked encoder-decoder networks for end-to-end training to compute depth and camera motion from successive, unconstrained image pairs. VINet [5] is a sequence-to-sequence framework for motion estimation using visual and inertial sensors. [17, 16, 43, 3] regress the 6-DOF camera pose from a single RGB image.

Unsupervised learning methods. Most of the unsupervised works are supervised by view synthesis, which minimizes the difference between the synthesized view and the target image. Godard et al. [12] propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images to estimate depth. Zhou et al. [54] propose an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. The network is divided into two parts: single-view depth and multi view pose networks, with a loss based on warping nearby views to the target using the computed depth and poses. UnDeepVO [22] makes use of spatial losses and temporal losses between stereo image sequences for unsupervised training, for which they use stereo image pairs to recover the scale but test it by using consecutive monocular images. Zhan et al. [52] add deep feature-based warping loss to loss function to improve the accuracy and robustness of depth and motion estimation. Considering the inferred 3D geometry of the whole scene, Mahjourian et al. [26] proposed an unsupervised learning method for monocular image depth and motion estimation using 3D geometric constraints to enforce consistency of the estimated 3D point clouds and ego-motion across consecutive frames. GeoNet [51] is a jointly unsupervised learning framework for monocular depth, optical flow and ego-motion estimation from videos, which uses separate components to learn the rigid flow and object motion by rigid structure reconstructor and non-rigid motion localizer respectively. Some works [33, 49, 41] learn 3D structures from 2D images based on the projective geometry. Wang et al. [44] using a differentiable implementation of direct visual odometry and a novel depth normalization strategy to improve monocular video depth prediction. Yang et al. [50] introduce a “3D as-smooth-as-possible (3D-ASAP)” prior to learn edges and geometry (depth, normal) all at once. Shen et al. [38] use epipolar geometry to incorporate intermediate geometric computations such as feature matches into the tasks. In order to eliminate the need of static scene assumption, three parallel networks are used to predict the camera motion, depth map, and per-pixel optical flow between two frames in EPC++ [25]. Ranjan et al. [32] introduce Competitive Collaboration to segment the scene into static and moving regions without supervision. Wang et al. [45] use Recurrent Neural Networks to utilize the temporal information. Chen et al. [4] do unsupervised depth prediction and supervised semantic segmentation using stereo images pairs and semantic segmentation ground truth. Meng et al. [28] use semantic segmentation, instance class segmentation and instance edge map for unsupervised 3D geometry perception. Godard et al. [13] propose a minimum reprojection loss to robustly handle occlusions and a full-resolution multi-scale sampling method to reduce visual artifacts. We take semantic consistency as the mask of photometric loss and 3D point loss, and consider the correlation between depth and semantics.

3 Method

An overview of our approach is shown in Figure 2. It can learn depth and camera motion from unlabeled data. The network consists of two parts: depth prediction network and pose estimation network, which are trained jointly. The framework takes as input a sequence of consecutive monocular images and semantic segmentation. The depth prediction network outputs the depth map of each frame, and the pose estimation network outputs pose between adjacent frames. The loss function includes photometric loss, semantic loss and 3D point loss.

3.1 Photometric Loss

In previous methods, image reconstruction loss is used, which is a fundamental supervision signal widely used in unsupervised tasks. For two adjacent frames, $I_{t}$ and $I_{t^{\prime}}$ , if the depth map of $I_{t}$ and the relative pose between the two views are given, then $I_{t}$ view can be reconstructed from $I_{t^{\prime}}$ . Taking $I_{t}$ as input, depth prediction network generates depth map for $I_{t}$ , denoted as $\hat{D}_{t}$ . The relative camera pose between two views can be estimated from the pose estimation network, denoted as $\hat{T}_{t\to t^{\prime}}$ . Denote $p_{t}$ as the homogeneous coordinates of a pixel in $I_{t}$ , and $p_{t^{\prime}}$ as the corresponding pixel in $I_{t^{\prime}}$ . Using epipolar geometry, the projected coordinates can be expressed as:

p_{t^{\prime}}\sim K\hat{T}_{t\to t^{\prime}}\hat{D}_{t}(p_{t})K^{-1}p_{t}

(1)

where $K$ is the camera intrinsic matrix, $\hat{T}_{t\to t^{\prime}}$ is the camera coordinate transformation matrix from the $I_{t}$ frame to the $I_{t^{\prime}}$ frame, $\hat{D}_{t}(p_{t})$ is the depth value of the $p_{t}$ pixel in the $I_{t}$ frame, and the coordinates are homogeneous.

According to the projection relationship, a new synthetic frame $\hat{I}_{t^{\prime}\to t}$ can be obtained from $I_{t^{\prime}}$ frame by using the differentiable bilinear interpolation mechanism proposed in [15].

Structural similarity (SSIM) [47] can be used to evaluate the quality of image prediction. A widely used image reconstruction error function is as follows:

re(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v})=\frac{\alpha}{2}(1-SSIM(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v}))+(1-\alpha)\|I_{t}^{u,v}-\hat{I}_{t^{\prime}\to t}^{u,v}\|_{1}

(2)

where the superscript $uv$ represents the image pixel at coordinates $(u,v)$ and $\alpha$ usually is set to 0.85.

The image reconstruction loss can be formulated as:

L_{recon}=\sum_{u,v}re(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v}).

(3)

[13] proposes a minimum reprojection loss to handle occlusions. [13] applies a per-pixel mask $\mu$ :

\mu^{u,v}=\left[\min_{t^{\prime}}re(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v})<\min_{t^{\prime}}re(I_{t}^{u,v},I_{t^{\prime}}^{u,v})\right]

(4)

where $t^{\prime}\in\{t-1,t+1\}$ and $\left[\cdot\right]$ is the Iverson bracket. The minimum reprojection loss is:

L_{p}=\sum_{u,v}\mu^{u,v}\min_{t^{\prime}}re(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v}).

(5)

In order to solve the gradient-locality issue in motion estimation and eliminate the discontinuity of the depth learned in low texture regions, depth smoothness loss is used to adjust the depth estimation. We adopt the depth gradient smoothness loss in [12] which uses image gradient to weight depth gradient:

L_{smooth}=\sum_{u,v}|\nabla D_{t}^{u,v}|^{T}\cdot e^{-|\nabla I_{t}^{u,v}|}

(6)

where $\nabla$ is the vector differential operator, $T$ denotes the transpose of image gradient weighting and $|\cdot|$ denotes elementwise absolute value.

3.2 Semantic Loss

Similar to photometric consistency, semantic consistency should be satisfied between the adjacent frames. The semantic segmentation of $I_{t}$ is denoted as $S_{t}$ . According to equation (1), the semantic segmentation $\hat{S}_{t^{\prime}\to t}$ of the $I_{t}$ frame can be synthesized from $S_{t^{\prime}}$ . Different from differentiable bilinear interpolation mechanism in image reconstruction, nearest neighbor interpolation is used in semantic segmentation synthesis, because the value of semantic segmentation represents the class of each pixel. The semantic segmentation reconstruction loss is as follows:

L_{ss}=\sum_{u,v}\min_{t^{\prime}}\left[S_{t}^{u,v}\neq\hat{S}_{t^{\prime}\to t}^{u,v}\right]

(7)

where $\left[\cdot\right]$ is the Iverson bracket.

The projection process in equation (1) implies an assumption: the scene is static and there is no occlusion between the two views, but the actual scene obviously does not meet this assumption. The projection rule will make mistakes at the pixels of dynamic or occluded objects. Pixels that do not obey the semantic consistency may be dynamic objects or occlusion, whose pixels should be removed when calculating the reconstruction loss. A mask $M_{t^{\prime}}$ , in which the value of dynamic or occluded pixels is 1 and the value of the remaining pixels is 0, is used to indicate these pixels:

M_{t^{\prime}}^{u,v}=\left[S_{t}^{u,v}\neq\hat{S}_{t^{\prime}\to t}^{u,v}\right]

(8)

where $\left[\cdot\right]$ is the Iverson bracket.

The improved image reconstruction loss is:

L_{img}=\sum_{u,v}\left[\min_{t^{\prime}}mre(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v})<\min_{t^{\prime}}re(I_{t}^{u,v},I_{t^{\prime}}^{u,v})\right]\min_{t^{\prime}}mre(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v})

(9)

where $mre(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v})=re(I_{t}^{u,v},\hat{I}_{t^{\prime}\to t}^{u,v})+bM_{t^{\prime}}^{u,v}$ and $b$ is a large constant greater than all possible $re(I_{t}^{u,v},I_{t^{\prime}}^{u,v})$ values.

For the depth of some objects, we can give some prior knowledge constraints. For roads and sidewalks in autonomous driving datasets, between the two adjacent upper and lower pixels in a image, the upper pixel has a longer distance and a larger depth. Therefore, we can introduce the following loss:

L_{road}=\sum_{u,v}R^{u,v}R^{u,v-1}\left[D_{t}^{u,v}>D_{t}^{u,v-1}\right]

(10)

where $R^{u,v}$ is 1 if pixel $(u,v)$ is roads or sidewalks, otherwise 0.

3.3 3D Point Loss

The photometric loss is mainly concerned with the 2D pixel coordinate system. The spatial structure information of 3D points can be used as an effective supervisory signal to improve the performance of depth prediction. We propose 3D point loss to make full use of 3D information. [26] aligns 3D point clouds with ICP(Iterative Closest Point algorithm. ICP algorithm needs many iterations in the process of point cloud registration, which results in time-consuming network training. We use two frame depth maps and transformation matrix to calculate the 3D coordinates of corresponding points.

The depth map prediction of $I_{t}$ can be obtained from depth prediction network, and 3D coordinates of each pixel in $I_{t}$ in the camera coordinate system can be further obtained. For a pixel $p_{t}$ whose coordinates are $(u,v)$ in $I_{t}$ , denote $P_{t}$ as the coordinates of 3D point corresponding to $p_{t}$ in $I_{t}$ frame camera coordinate system. $P_{t}$ can be expressed as:

P_{t}\sim\hat{D}_{t}(p_{t})K^{-1}p_{t}.

(11)

Pixel coordinates of the corresponding point $p_{t^{\prime}}$ in $I_{t^{\prime}}$ can be obtained from equation (1).

Denote $\hat{D}_{t^{\prime}}$ as the depth prediction of $I_{t^{\prime}}$ by the network. The coordinates of $p_{t^{\prime}}$ are not integers, so the depth value of $p_{t^{\prime}}$ can’t be obtained directly. Similar to the image reconstruction, we use bilinear interpolation to estimate the depth value of $p_{t^{\prime}}$ . Denote $P_{t^{\prime}}$ as the coordinates of 3D point corresponding to $p_{t^{\prime}}$ in $I_{t^{\prime}}$ frame camera coordinate system. Same as equation (11), $P_{t^{\prime}}$ can be expressed as:

P_{t^{\prime}}\sim\hat{D}_{t^{\prime}}(p_{t^{\prime}})K^{-1}p_{t^{\prime}}.

(12)

Note that $P_{t}$ and $P_{t^{\prime}}$ are in different camera coordinates. They need to be transformed into the same coordinate system.

Transform $P_{t^{\prime}}$ to $I_{t}$ frame camera coordinate system:

\hat{P}_{t^{\prime}\to t}\sim\hat{T}_{t^{\prime}\to t}P_{t^{\prime}}

(13)

where $\hat{P}_{t}$ is the coordinate after transformation, $\hat{T}_{t^{\prime}\to t}$ is the camera coordinate transformation matrix from $I_{t^{\prime}}$ frame to $I_{t}$ frame.

As shown in Figure 3, $P_{t}$ and $\hat{P}_{t^{\prime}\to t}$ are corresponding points and should be as close as possible. The 3D position error is: $pe(P_{t},\hat{P}_{t^{\prime}\to t})=\|P_{t}-\hat{P}_{t^{\prime}\to t}\|_{1}$ . Occluded or dynamic pixels should be ignored and the 3D point loss can be expressed as:

L_{3D}=\sum_{u,v}\left[\min_{t^{\prime}}mpe(P_{t},\hat{P}_{t^{\prime}\to t})<h\right]\min_{t^{\prime}}mpe(P_{t},\hat{P}_{t^{\prime}\to t})

(14)

where $mpe(P_{t},\hat{P}_{t^{\prime}\to t})=pe(P_{t},\hat{P}_{t^{\prime}\to t})+hM_{t^{\prime}}^{u,v}$ and $h$ is a large constant greater than all possible $pe(P_{t},\hat{P}_{t^{\prime}\to t})$ values.

Compared with the 2D loss using only one frame depth map, the 3D loss using two depth maps based on 3D point consistency, can make better use of 3D spatial structure information.

3.4 Network Architecture

The framework is divided into two parts: depth prediction network and pose estimation network. Input of the networks includes RGB images and semantic segmentation. We adopt the pre-trained semantic segmentation network in [55], which performs fine-tuning on the 200 training images of KITTI dataset [11] and can output 19 classes, including road, sidewalk, building, wall, fence, etc. Input each image into the network in [55] to get the result of semantic segmentation. The introduction of semantic segmentation network is similar to network pre-training. We use noisy semantic segmentation results, which avoids labeling cost.

Depth prediction network is composed of encoder and decoder networks with skip connections similar to DispNet architecture [27]. Two parallel encoder networks are used to input a single image and semantic segmentation respectively to extract their feature maps. The two feature maps are concatenated and input to the decoder network. Each encoder network has 14 convolution layers and kernel size is 3 for all layers, except the first 4 layers for which the sizes are 5, 5, 7, 7 respectively. The semantic segmentation, which includes 19 classes, inputs the encoder network in the form of 19 channels. The decoder network uses skip-connections to fuse low-level features from different stages of the encoder networks consisting of 7 convolution layers and 7 deconvolution layers.

Pose estimation network takes as input a sequence of adjacent frames concatenated along the color channels, which is similar to the Pose network in [54]. Different from the relative poses between the target view and each of the source views in [54], the relative poses between every two adjacent frames is predicted.

The total loss function is:

\displaystyle L=

\displaystyle\lambda_{1}L_{img}+\lambda_{2}L_{ss}+\lambda_{3}L_{3D}+\lambda_{4}L_{road}+\lambda_{5}L_{smooth}

(15)

where $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ , $\lambda_{4}$ and $\lambda_{5}$ are weights for the different losses. Through experiments, we find that setting the weights to $\lambda_{1}=1$ , $\lambda_{2}=0.1$ , $\lambda_{3}=0.1$ , $\lambda_{4}=0.1$ and $\lambda_{5}=0.001$ makes training more stable.

4 Experiments

In this section, we evaluate the performance of our algorithm. We compared our algorithm with prior art on both single view depth and pose estimation on KITTI dataset [11]. We perform a detailed ablation study to show that both the semantic loss and 3D point loss can improve the depth prediction and pose estimation performance.

We use the TensorFlow [1] framework to implement the neural network. Adam optimizer is used to train the network, with parameters $\beta_{1}=0.9$ , $\beta_{2}=0.99$ , $learning\_rate=0.0002$ , and $batch\_size=4$ . During training, we resize the image sequences to a resolution of $256\times 832$ which is the same as [25] and [32]. The network was trained for 10-20 epochs using 3-frame training sequences. The network was trained and tested on a NVIDIA GeForce GTX 1080 Ti GPU. The network training time for 200 $K$ iterations is about 43 hours. The mean inference time of depth map prediction for a image of size $256\times 832$ is 13.6 ms, and the mean inference time of pose estimation for a 3-frame sequence is 5.6 ms.

4.1 Dataset

We train and evaluate the proposed method on commonly used KITTI benchmark dataset [11], which includes a full set of input sources including raw images, 3D point cloud data from LIDAR and camera trajectories. We use monocular image sequences for training and test, 3D point cloud and camera trajectories are only used to evaluate training models. The original image size is $375\times 1242$ , and images are downsampled to $256\times 832$ during training. In order to compare fairly with other methods, we use two different splits of the KITTI dataset to evaluate depth prediction and pose estimation respectively.

We evaluate the single-view depth estimation performance on the test split composed of 697 images from 28 scenes as in [8]. About 40,000 pictures of the remaining 33 scenes were used for training and validation.

The KITTI odometry benchmark [11] consists of 22 stereo sequences, 11 sequences (00-10) with ground truth trajectories and 11 sequences (11-21) without ground truth. We follow [54] to split the KITTI odometry dataset. We train the model on KITTI odometry sequence 00-08 and evaluate the pose error on sequence 09 and 10.

4.2 Depth Prediction Evaluation

We evaluate the performance of our monocular depth prediction. The Velodyne laser scanning point is projected into the image plane to obtain the ground truth. Since we use only monocular image for training, absolute scale information can’t be recovered. We multiply the predicted depth map by a scale factor, which is the ratio of the median of ground truth to the median of the predicted depth map, the same as [54]. Our depth estimation results are compared quantitatively with previous works (some of which use certain type of supervision information). All methods are evaluated on the same training images and test images. The split of dataset is described in section 4.1. The error measurements are consistent with those used in [8].

Table 1: Depth evaluation metrics on the KITTI dataset [11] using the split of Eigen et al. [8]. For fair comparison, we use the monocular video self-supervision result without pretraining for [13]. We mark the best results in bold

Method	Supervision	Abs Rel	Sq Rel	RMSE	RMSE log	$\delta<1.25$	$\delta<1.25^{2}$	$\delta<1.25^{3}$
Eigen [8] Coarse	Depth	0.214	1.605	6.563	0.292	0.638	0.804	0.894
Eigen [8] Fine	Depth	0.203	1.548	6.307	0.282	0.702	0.890	0.958
Liu [23]	Depth	0.201	1.584	6.471	0.273	0.680	0.898	0.967
Zhou [54]	No	0.208	1.768	6.856	0.283	0.678	0.885	0.957
Mahjourian [26]	No	0.163	1.240	6.220	0.250	0.762	0.916	0.968
Yang [50]	No	0.162	1.352	6.276	0.252	0.783	0.921	0.969
Yin [51]	No	0.155	1.296	5.857	0.233	0.793	0.931	0.973
Luo [25]	No	0.141	1.029	5.350	0.216	0.816	0.941	0.976
Ranjan [32]	No	0.140	1.070	5.326	0.217	0.826	0.941	0.975
Godard [13]	No	0.132	1.044	5.142	0.210	0.845	0.948	0.977
Ours	No	0.131	0.902	4.980	0.204	0.837	0.952	0.981

Table 1 shows the comparison between our method and other methods. Abs Rel, Sq Rel, RMSE and RMSE log are error metrics, and small values mean better performance. $\delta<1.25$ , $\delta<1.25^{2}$ and $\delta<1.25^{3}$ are accuracy metrics, and large values mean better performance. In order to make a fair comparison with other methods, we use the maximum depth thresholds of 80 meters to evaluate. Our method achieves the best performance on most metrics.

Figure 4 shows some visualization examples compared with other methods. As can be seen from Figure 4, the depth maps predicted by our method are clearer at the boundary of objects and better recover the depth of cars and trees.

Table 2: Absolute Trajectory Error (ATE) on KITTI odometry dataset [11] over all multi-frame snippets

Method	Seq. 09	Seq. 10
ORB-SLAM (full)	$0.014\pm 0.008$ m	$0.012\pm 0.011$ m
ORB-SLAM (short)	$0.064\pm 0.141$ m	$0.064\pm 0.130$ m
Mean Odom.	$0.032\pm 0.026$ m	$0.028\pm 0.023$ m
Zhou [54]	$0.021\pm 0.017$ m	$0.020\pm 0.015$ m
Mahjourian [26]	$0.013\pm 0.010$ m	$0.012\pm 0.011$ m
Yin [51]	$0.012\pm 0.007$ m	$0.012\pm 0.009$ m
Luo [25]	$0.013\pm 0.007$ m	$0.012\pm 0.008$ m
Ranjan [32]	$0.012\pm 0.007$ m	$0.012\pm 0.008$ m
Godard [13]	$0.017\pm 0.008$ m	$0.015\pm 0.010$ m
Ours	$\mathbf{0.010\pm 0.005}$ m	$\mathbf{0.009\pm 0.008}$ m

4.3 Pose Estimation Evaluation

We use KITTI odometry dataset to evaluate the proposed approach, and compare the results with Zhou et al. [54], Mahjourian et al. [26], Yin et al. [51], Luo et al. [25], Godard et al. [13], Ranjan et al. [32] and ORB-SLAM [29] proposed by Mur-Artal et al.. We also use the dataset mean of car motion (using ground truth odometry) for 5-frame snippets as another baseline for comparison. Among them, [54, 26, 51, 25, 32, 13] are unsupervised deep learning methods, while ORB-SLAM is a traditional geometry-based method. The methods based on deep learning use the same training data. The models are trained on the KITTI odometry dataset 00-08 sequences and the relative pose estimation is evaluated on the sequences 09 and 10. In the experiment, we fixed the length of the input image sequences to 3 frames, which is the same as [26]. We compared two versions of ORB-SLAM. “ORB-SLAM (full)” accepts all frames of the whole sequence as input, which involves global optimization steps, such as loop closure detection and bundle adjustment. Note that there is no loop in sequence 10, so loop closure detection is not used. “ORB-SLAM (short)” only accepts five consecutive frames as input. Because of the scale uncertainty of monocular VO, we optimize the scale to make the trajectory consistent with the ground truth.

In order to make a fair comparison with other methods, like [54, 26, 51, 25, 32], we measure the Absolute Trajectory Error (ATE) [39] over 3 or 5 frame snippets as the metric for pose evaluation. As shown in Table 2, our method is superior to other methods. The output of pose estimation network is the relative poses between 3 or 5 frames snippets. Compared with the geometry-based method, the camera trajectory predicted by this kind of method has a larger cumulative error for a long time image sequence.

4.4 Ablation Study

We investigate the contribution of several components proposed in our unsupervised architecture. As shown in Table 3, in order to demonstrate the importance of each component of the losses, we conducted ablation studies on depth prediction. We trained and evaluated three models with different losses. The experimental results show the importance of each component.

Table 3: Depth evaluation metrics on the KITTI dataset [11] using the split of Eigen et al. [8] for various versions of our model

$L_{img}$	$L_{ss}$	$L_{road}$	$L_{3D}$	Abs Rel	Sq Rel	RMSE	RMSE log	$\delta<1.25$	$\delta<1.25^{2}$	$\delta<1.25^{3}$
$\surd$				0.144	1.089	5.423	0.214	0.815	0.945	0.979
$\surd$	$\surd$			0.138	1.007	5.235	0.211	0.829	0.948	0.979
$\surd$	$\surd$	$\surd$		0.133	0.939	5.157	0.208	0.837	0.951	0.980
$\surd$			$\surd$	0.136	0.913	5.191	0.210	0.820	0.947	0.981
$\surd$	$\surd$	$\surd$	$\surd$	0.131	0.902	4.980	0.204	0.837	0.952	0.981

$L_{img}$	$L_{ss}$	$L_{road}$	$L_{3D}$	Seq. 09	Seq. 10
$\surd$				$0.013\pm 0.009$ m	$0.012\pm 0.009$ m
$\surd$	$\surd$	$\surd$		$0.010\pm 0.006$ m	$0.010\pm 0.009$ m
$\surd$			$\surd$	$0.010\pm 0.006$ m	$0.010\pm 0.009$ m
$\surd$	$\surd$	$\surd$	$\surd$	$\mathbf{0.010\pm 0.005}$ m	$\mathbf{0.009\pm 0.008}$ m

Table 4: Absolute Trajectory Error (ATE) on KITTI odometry dataset [11] for various versions of our model

Figure 5 shows the depth maps generated by the models under different loss function training. Using all losses can get the best performance. The boundary of objects is clearer, and it is better to predict the depth of small or thin objects such as poles and traffic signs.

As shown in Table 4, we compare the effects of different loss functions on pose estimation results. We can see that semantic loss and 3D point loss improve the accuracy of pose estimation, although not so significant compared to depth estimation.

5 Conclusions

We propose a semantics-driven unsupervised deep learning method for monocular depth prediction and camera ego-motion estimation tasks. It is trained on unlabeled monocular image sequence, and performs pose estimation and dense depth map estimation during testing. We introduce a semantic loss to reduce the impact of dynamic objects or occluded objects in the scene and improve depth estimation performance by considering the semantic consistency and correlation between depth and semantics. We also propose a new 3D point loss to improve the accuracy of depth prediction. The experimental evaluation on the KITTI dataset shows that our method achieves good good performance. Compared with semantic segmentation, the boundary between objects in depth map is not clear enough. One direction of future work is to use semantic segmentation to improve the boundary performance between objects in depth maps.

References

[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: Symposium on Operating Systems Design and Implementation. pp. 265–283 (2016)
[2] Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: European Conference on Computer Vision. pp. 404–417 (2006)
[3] Brahmbhatt, S., Gu, J., Kim, K., Hays, J., Kautz, J.: Geometry-aware learning of maps for camera localization. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2616–2625 (2018)
[4] Chen, P.Y., Liu, A.H., Liu, Y.C., Wang, Y.C.F.: Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2019)
[5] Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: VINet: Visual-Inertial odometry as a sequence-to-sequence learning problem. In: AAAI Conference on Artificial Intelligence (2017)
[6] Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), 1052–1067 (2007)
[7] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision. pp. 2650–2658 (2015)
[8] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems. pp. 2366–2374 (2014)
[9] Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(3), 611–625 (2017)
[10] Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: Large-scale direct monocular SLAM. In: European Conference on Computer Vision. pp. 834–849. Springer (2014)
[11] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 3354–3361 (2012)
[12] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 270–279 (2017)
[13] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: IEEE International Conference on Computer Vision. pp. 3828–3838 (2019)
[14] Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2003)
[15] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems. pp. 2017–2025 (2015)
[16] Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5974–5983 (2017)
[17] Kendall, A., Grimes, M., Cipolla, R.: PoseNet: A convolutional network for real-time 6-DOF camera relocalization. In: IEEE International Conference on Computer Vision. pp. 2938–2946 (2015)
[18] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: IEEE International Conference on Computer Vision. pp. 66–75 (2017)
[19] Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: IEEE and ACM International Symposium on Mixed and Augmented Reality. pp. 1–10 (2007)
[20] Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 6647–6655 (2017)
[21] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: International Conference on 3D Vision (3DV). pp. 239–248 (2016)
[22] Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: Monocular visual odometry through unsupervised deep learning. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 7286–7291 (2018)
[23] Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 2024–2039 (2015)
[24] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
[25] Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., Yuille, A.: Every pixel counts ++: Joint learning of geometry and motion with 3D holistic understanding (2018)
[26] Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5667–5675 (2018)
[27] Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 4040–4048 (2016)
[28] Meng, Y., Lu, Y., Raj, A., Sunarjo, S., Guo, R., Javidi, T., Bansal, G., Bharadia, D.: SIGNet: Semantic instance aided unsupervised 3d geometry perception. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2019)
[29] Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31(5), 1147–1163 (2015)
[30] Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics 33(5), 1255–1262 (2017)
[31] Ranftl, R., Vineet, V., Chen, Q., Koltun, V.: Dense monocular depth estimation in complex dynamic scenes. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 4058–4066 (2016)
[32] Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12240–12249 (2019)
[33] Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. In: Advances in Neural Information Processing Systems. pp. 4996–5004 (2016)
[34] Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficient alternative to SIFT or SURF. In: IEEE International Conference on Computer Vision. pp. 2564–2571 (2011)
[35] Sakaridis, C., Dai, D., Hecker, S., Van Gool, L.: Model adaptation with synthetic and real data for semantic dense foggy scene understanding. In: European Conference on Computer Vision. pp. 687–704 (2018)
[36] Scaramuzza, D., Fraundorfer, F.: Visual odometry [tutorial]. IEEE Robotics and Automation Magazine 18(4), 80–92 (2011)
[37] Schönberger, J.L., Frahm, J.M.: Structure-from-Motion revisited. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
[38] Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., Quan, L.: Beyond photometric loss for self-supervised ego-motion estimation. In: International Conference on Robotics and Automation. IEEE (2019)
[39] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: International Conference on Intelligent Robots and Systems. pp. 573–580 (2012)
[40] Sweeney, C., Sattler, T., Hollerer, T., Turk, M., Pollefeys, M.: Optimizing the viewing graph for Structure-from-Motion. In: IEEE International Conference on Computer Vision. pp. 801–809 (2015)
[41] Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: European Conference on Computer Vision. pp. 322–337 (2016)
[42] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: DeMoN: Depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5038–5047 (2017)
[43] Walch, F., Hazirbas, C., Leal-Taixe, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Image-based localization using LSTMs for structured feature correlation. In: IEEE International Conference on Computer Vision. pp. 627–637 (2017)
[44] Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2022–2030 (2018)
[45] Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5555–5564 (2019)
[46] Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 2043–2050. IEEE (2017)
[47] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004)
[48] Xie, J., Girshick, R., Farhadi, A.: Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: European Conference on Computer Vision. pp. 842–857 (2016)
[49] Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In: Advances in Neural Information Processing Systems. pp. 1696–1704 (2016)
[50] Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: LEGO: Learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 225–234 (2018)
[51] Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1983–1992 (2018)
[52] Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 340–349 (2018)
[53] Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J., Jin, H., Funkhouser, T.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 5287–5295 (2017)
[54] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1851–1858 (2017)
[55] Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., Catanzaro, B.: Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8856–8865 (2019)