Learning a Geometric Representation for Data-Efficient
Depth Estimation via Gradient Field and Contrastive Loss

Dongseok Shim and H. Jin Kim^∗ This research was supported by Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea(NRF) and Unmanned Vehicle Advanced Research Center(UVARC) funded by the Ministry of Science and ICT, the Republic of Korea(NRF-2020M3C1C1A01086411)Authors are with the Department of Mechanical and Aerospace Engineering, Seoul National University, Gwanak-gu, Seoul, 08826, Korea. E-mail: {tlaehdtjr01, hjinkim}@snu.ac.kr (^∗Corresponding author)

Abstract

Estimating a depth map from a single RGB image has been investigated widely for localization, mapping, and 3-dimensional object detection. Recent studies on a single-view depth estimation are mostly based on deep Convolutional neural Networks (ConvNets) which require a large amount of training data paired with densely annotated labels. Depth annotation tasks are both expensive and inefficient, so it is inevitable to leverage RGB images which can be collected very easily to boost the performance of ConvNets without depth labels. However, most self-supervised learning algorithms are focused on capturing the semantic information of images to improve the performance in classification or object detection, not in depth estimation. In this paper, we show that existing self-supervised methods do not perform well on depth estimation and propose a gradient-based self-supervised learning algorithm with momentum contrastive loss to help ConvNets extract the geometric information with unlabeled images. As a result, the network can estimate the depth map accurately with a relatively small amount of annotated data. To show that our method is independent of the model structure, we evaluate our method with two different monocular depth estimation algorithms. Our method outperforms the previous state-of-the-art self-supervised learning algorithms and shows the efficiency of labeled data in triple compared to random initialization on the NYU Depth v2 dataset.

I Introduction

Depth estimation is an essential element for mobile robots such as a drone or autonomous driving car to navigate paths, deviate obstacles, and understand environments by scene. Especially, weight and cost issues of high-quality depth sensors like LiDAR or RGB-D camera motivate the research on depth estimation using a monocular camera.

Refer to caption — Figure 1: Our proposed method trains an encoder to learn a geometric visual representation of an image $\mathcal{I}$ with its gradient field $G$ . The encoder maps the image and its gradient field to the feature space $\mathcal{Z}$ and the head projects its feature $z$ to the low-dimensional head space $\mathcal{H}$ to prevent overfitting. By contrastive loss, the representation of a query image $\mathcal{I}_{q}$ in the head space $\mathcal{H}$ becomes closer to the representation of the positive pair $G_{+}$ and farther to the representation of the negative pair $G_{-}$ from the other image $\mathcal{I}_{-}$ .

In recent years, depth estimation from a single RGB image has been influenced by the dramatic success of the deep convolutional neural networks[1]. A complex depth estimation network for higher accuracy[2, 3, 4] and an efficient network for a real-time or on-device operation[5, 6] have been both achieved. However, training a depth estimation network requires a large number of images with pixel-wise depth annotated labels, which costs very intense and expensive efforts to collect.

Due to such difficulties, a need for leveraging unlabeled RGB images is increasing to boost the performance of the ConvNet with a small amount of labeled data. Previous self-supervised learning algorithms such as rotation[7], exemplar[8] pretext tasks or contrastive learning based approach[9, 10] are mainly focused on extracting semantic features from the image to increase the performance on classification, object detection, or semantic segmentation. Self-supervised learning methods have also been applied for depth estimation[11, 12, 13, 14], but they all require stereo left-right paired images[11, 12] or images from the monocular camera which are consecutively collected so that a large portion of images overlap[13, 14].

Therefore, we propose a self-supervised learning method for ConvNet to learn geometric features of an image with unlabeled independent RGB images for increasing the performance in depth estimation. We utilize the Sobel kernel and Canny edge binary mask [15] to generate gradient fields of the image as the positive and negative pairs for momentum contrast learning[9]. To do so, the ConvNet learns the relationship between the RGB image and its gradient field by distinguishing the source image of the gradient field from the other RGB images. The proposed method is general and model-agnostic in that it is compatible with any parametric-model-based depth estimation network regardless of the structure of encoder and decoder. Our method also outperforms existing state-of-the-art self-supervised visual representation learning in the monocular depth estimation and achieves a better performance than random initialization by [16] with 3 times fewer annotated depth labels. To the best of our knowledge, it is the first approach to pre-train the depth estimation network with independently sampled images in an unsupervised manner.

In short, our contribution can be summarized as follows:

•

We show that the previous self-supervised visual representation learning with independent images do not perform well on the depth estimation task.
•

The proposed gradient-based momentum contrastive learning enables the ConvNet to capture the geometric representation of the image regardless of its model structure.
•

Our method outperforms existing self-supervised learning based pre-training method both in accuracy and error metric of monocular depth estimation and achieves three times better data-efficiency compared to random initialization.

We evaluate our method on NYU Depth v2 dataset[17] with two different monocular depth estimation networks to prove that our method is model-agnostic and it is applicable to various networks from the high-quality complex networks to the real-time efficient networks. The implementation of the paper in PyTorch and its pre-trained models are available at https://github.com/dsshim0125/grmc.

II Related Work

In this section, we summarize the past research on monocular depth estimation and recap visual representation learning algorithms which are categorized into self-supervised pretext and contrastive learning.

II-A Monocular Depth Estimation

Recent studies on monocular depth estimation use deep learning algorithms with a large-scale labeled dataset. Due to the properties of the ConvNet, the size of the receptive field decreases as the image passes multiple layers with different sizes of the convolution kernels and strides. Therefore, most depth estimation networks leverage an encoder-decoder structure to increase the size of the output depth map.

The two-scale ConvNets which generate coarse and fine depth maps each to feed additional information by concatenation from coarse to fine networks have been suggested in [18]. Later, their work is extended to three-scale ConvNets for other auxiliary estimation tasks like normal prediction or segmentation[19]. In [5], a deep residual network ResNet[20] is used as the encoder to extract features from the images and has used a novel up-projection algorithm for efficient and faster training. [2] formulates the depth estimation problem as the quantized ordinal regression and [3] applies a U-Net[21] with DenseNet[22] as the encoder to generate a high-quality depth map with complex neural networks.

Self-supervised learning based depth estimation has also been studied by estimating a camera pose or disparity maps [14, 11, 12, 13], but they need paired stereo images[11, 12, 23] or at least sequences of the monocular images[14, 13]. It means that the algorithms above cannot be trained with independently sampled monocular images.

II-B Self-Supervised Visual Representation Learning

As most images are unlabeled, significant research efforts aim to train neural networks more efficiently with a small number of labeled dataset. Self-supervised learning formulates a pretext task using only unlabeled data so that the network can learn useful visual representation from the image before task-specific supervised learning such as classification or object detection.

In [24], a single image is divided into several non-overlapped patches and the networks are trained to predict their relative positions. Follow-up studies [25, 26] also generate patches but the patches are randomly shuffled and the pretext task is to recover the original images by predicting the permutation of the patches which is similar to solving jigsaw puzzles.

A rotation pretext that rotates the input image is introduced in [7], and the network predicts the degree of applied rotation. The rotation degree is an element of a set that has a finite cardinality of specific numbers, so this poses the rotation pretext task as a classification problem, not a regression.

An exemplar task has been proposed in [8] which decreases the distance between the original RGB image (seed image) and its heavily augmented image and increases the distance between the seed and augmented image from the other RGB images. It allows the network to extract a visual representation which is invariant to a wide range of image transformations.

II-C Contrastive Learning

Recently, a visual representation learning algorithm based contrastive learning has achieved huge success in the image classification task. Contrastive learning [27] is an approach to learn representations by enforcing the attractive force to positive pairs and the repulsive forces to negative pairs. The proposed method in [28] divides the image into several non-overlapping patches and the task is to predict the pixel values of the next unseen patch. They use the target patch as the positive pair and patches from the other location in the same image or patches from the other images as negative pairs for the contrastive learning. [9] and [10] impose multiple image transformations on RGB images and set the positive pair as the augmented image from the same image and the negative pair as the augmented image from the other image. The difference is that [9] uses 2 encoders with the same structure but [10] only uses a single encoder.

III Method

In this section, we introduce the learning algorithm of the momentum-based contrastive learning with two encoders by [9], compare it with our proposed method, and present the gradient field with the modified Canny edge detector[15] which is used as the positive and negative pairs for contrastive learning.

III-A Momentum Contrastive Learning

We pre-train the encoder of the depth estimation network with the contrastive loss to learn the representation of the image without any supervisory signals. We adopt momentum contrastive learning suggested in MoCo[9], with 2 encoders which are query and key encoders respectively.

The encoder maps the positive and negative pairs of the image to the feature space $\mathcal{Z}$ to extract the latent vector $z$ which contains the compressed information of the input data. Although the latent vector $z$ has a low dimension, it is still vulnerable to overfitting to a specific training data domain, so the $head$ module is adopted which is a 2-layer fully-connected network followed by ReLU non-linear activation function[10]. The head module projects the latent vector $z$ to the feature space $\mathcal{H}$ which has much lower dimensionality with similar representation compared to the original feature space $\mathcal{Z}$ . The contrastive loss is then calculated by the similarity between the projected latent vectors $h$ of positive and negative pairs in the low-dimensional feature space.

MoCo[9] leverages a dynamic dictionary as a queue of latent vector $h$ of input data $K=\{h_{k_{0}},h_{k_{1}},h_{k_{2}},...\}$ so that the contrastive learning can be less dependent on the batch size. The encoded key value $h_{k}$ is not discarded after training but stacked in the dictionary to formulate large negative pairs during training. The size of the dynamic dictionary $K$ is fixed, so the most outdated values are removed when the number of the latent vectors in the dictionary exceeds the maximum size. To avoid a rapid change of the key value $h_{k}$ from the same RGB image $\mathcal{I}$ for stable representation learning, MoCo[9] separates two encoders each for the query and key data. The key encoder is not updated by contrastive loss but rather by the momentum update as to keep the key latent vector $h_{k}$ consistent.

As MoCo[9] measures the similarity between the query latent vector $h_{q}$ and the key latent vector $h_{k}$ with a dot product, the contrastive loss function $\mathcal{L}_{q}$ can be formulated as Eq. (1) which is called InfoNCE[29]. The subscript $+$ indicates the positive pair, and $\tau$ is a temperature parameter to control the concentration of the distribution[30].

\mathcal{L}_{q}=-\mathrm{log}\frac{\mathrm{exp}(h_{q}\cdot h_{k_{+}}/\tau)}{\Sigma\mathrm{exp}^{K}_{i=0}(h_{q}\cdot h_{k_{i}}/\tau)}

(1)

The difference between our method and MoCo[9] is that we choose the query data as the RGB image $\mathcal{I}_{q}$ and the key data as the gradient field $G$ whereas MoCo[9] uses the augmented images by color distortion both as the query and key data. By adopting the gradient field as the key data for momentum contrastive learning, the encoder can learn the geometric representation of the image rather than semantic representation. We distinguish the type of query data from the type of key data, which are RGB image and gradient field respectively. It is because we intend to update the query encoder by the back-propagation[31] only with the error signal from RGB image that is the input of our final objective of the network, i.e., monocular depth estimation. After training with the momentum contrastive loss, the query encoder is used as the initial point of depth estimation network as shown in Fig. 2.

III-B Gradient Field

Previous self-supervised learning algorithms are mainly focused on capturing the semantic information of the image, so most of them formulate the unlabeled pretext task by imposing wide range of image transformations like HSV-space color randomization[8], Inception cropping[32] or Gaussian blur[10]. In this paper, we exploit a gradient field of the image as positive and negative pairs of contrastive learning so that the network can extract the visual geometric features.

To generate the gradient field $G$ of the image $\mathcal{I}$ , we use Canny edge detector[15] to extract the edge of the image. Unlike the standard Canny algorithm which generates a binary mask to filter the magnitude of the gradient from the Sobel operator by comparing its value with the neighbor pixels, we modify the Canny detector to extract the value of the magnitude of the dominant gradient itself as well as its location so that the pixel of the gradient field has different intensity value according to its edge dominance. The procedure of generating a gradient field can be mathematically expressed as Eq. (2):

\begin{gathered}\mathcal{I}\in\mathbb{R}^{h\times w},\>\mathcal{I}_{u},\,\mathcal{I}_{v}\in\mathbb{R}^{h\times w},\\ \|E\|=\sqrt{\mathcal{I}_{u}^{2}+\mathcal{I}_{v}^{2}},\\ G=B_{Canny}\otimes\,\|E\|,\\ \end{gathered}

(2)

where

\displaystyle\mathcal{I}_{u}=\frac{\partial\mathcal{I}}{\partial u},\;\mathcal{I}_{v}=\frac{\partial\mathcal{I}}{\partial v}.

$\|E\|$ and $B_{Canny}$ denote the magnitude of the gradient from the Sobel operator and the binary result of the Canny algorithm respectively, and the operator $\otimes:(\mathbb{R}^{h\times w},\mathbb{R}^{h\times w})\mapsto\mathbb{R}^{h\times w}$ indicates the pixel-wise multiplication. Lastly, as the range of the magnitude of the gradient field is different from the input RGB image, we both normalize pixel values of the image and gradient field from 0 to 1 for a faster and stabler model learning. We visualize the magnitude of the gradient, binary mask from Canny detector and our proposed gradient field in Fig. 3.

In short, we pre-train the encoder of the depth estimation network with the momentum-based contrastive learning by[9] and the gradient field from the modified Canny algorithm[15] to learn the geometric representation of the images. Our proposed momentum contrastive learning with the gradient field is summarized in Fig. 1.

Model	Method	Accuracy metric			Error metric
Alhashim et al.[3]		$\delta<1.25$	$\delta<1.25^{2}$	$\delta<1.25^{3}$	rel	rms	$\mathrm{log}_{10}$
	Random[16]	0.743	0.932	0.980	0.175	0.599	0.073
	ImageNet^∗[33]	0.826	0.967	0.999	0.131	0.484	0.057
	Rotation[7]	0.693	0.911	0.974	0.189	0.666	0.082
	Exemplar[8]	0.516	0.810	0.938	0.255	0.923	0.117
	MoCo[9]	0.755	0.933	0.981	0.168	0.592	0.071
	Ours	0.801	0.952	0.986	0.147	0.532	0.062
Laina et al.[5]	Random[16]	0.672	0.899	0.969	0.207	0.710	0.087
	ImageNet^∗[33]	0.773	0.950	0.989	0.157	0.553	0.067
	Rotation[7]	0.608	0.871	0.961	0.239	0.791	0.098
	Exemplar[8]	0.598	0.861	0.954	0.242	0.822	0.101
	MoCo[9]	0.692	0.909	0.974	0.201	0.668	0.082
	Ours	0.709	0.917	0.976	0.192	0.656	0.080

TABLE I: Comparison to prior self-supervised learning algorithms The evaluation has been done on the official test split of NYU Depth v2[17]. * indicates supervised pre-training with ground truth labels and all the other methods are pre-trained in an unsupervised manner. The accuracy and error metrics can be defined as

\delta

\mathrm{max}(\frac{\hat{y}_{p}}{y_{p}},\frac{y_{p}}{\hat{y}_{p}})

; rel:

\frac{1}{n}\Sigma^{n}_{p}\frac{\|y_{p}-\hat{y}_{p}\|}{y}

; rms:

\sqrt{\frac{1}{n}\Sigma^{n}_{p}(y_{p}-\hat{y}_{p})^{2}}

; log₁₀:

\frac{1}{n}\Sigma^{n}_{p}\|\mathrm{log}_{10}(y_{p})-\mathrm{log}_{10}(\hat{y}_{p})\|

; where

y_{p}

\hat{y}_{p}

are pixels of depth map

y

and its estimation

\hat{y}

. All the results are re-experimented under the same condition for a fair evaluation.

IV Experiments

To demonstrate the effectiveness of our self-supervised pre-training, we first explain the dataset and implementation details, and present multiple experiments to compare our method with existing self-supervised algorithms.

IV-A Dataset

We both train the encoder in an unsupervised manner and fine-tune the entire depth estimation network with supervisory signals on the NYU Depth v2[17] dataset. The dataset consists of indoor 120K train samples and 654 test samples collected with Microsoft Kinect. We train our method on the 50K subsets of the train sample and evaluate the method on the entire 654 test samples. A resolution of the dataset is 640 $\times$ 480 and we do not resize the input RGB image, but the output depth map is down-sampled to half resolution 320 $\times$ 240 for training speed and computational efficiency. During the evaluation, we trim the test input image with a pre-defined center cropping by [18] for a precise evaluation.

IV-B Implementation Details

We implement our self-supervised pre-training and monocular depth estimation networks[3, 5] on a public deep learning platform PyTorch[34]. We re-experimented both the self-supervised and depth estimation tasks to eliminate the factors that can affect the performance of the network except for the pre-trained weights of the encoder. For training an encoder with momentum-based contrastive learning, we set the batch size as 64, the size of the dynamic dictionary $K$ as 16384, and the temperature parameter $\tau$ as 0.07. We adopt the SGD optimizer with the learning rate of 0.015, the momentum of 0.9, and set the weight decay as 0.0001. For fine-tuning the depth estimation network, we train depth estimation networks with the same input size 640 $\times$ 480 and use the Adam optimizer with the learning rate of 0.0001. We set the batch size as 4 for training and set the batch size as 1 for evaluation.

[Uncaptioned image] — Figure 5: Kernel visualization of the encoder First 64 convolution kernels learned by DenseNet-161[22] trained on (a) the self-supervised pretext for encoder pre-training and (b) the depth estimation fine-tuning.

IV-C Performance Evaluations

We compare the results on monocular depth estimation according to the pre-trained weights of the encoder with two different structures of networks, [3] [5]. The work in [3] adopts DenseNet-161[22] as the encoder for high-quality complex depth estimation and [5] uses ResNet-50[20] as the encoder and its up-projection module as the decoder for a real-time operation. TABLE I shows some meaningful results. Our method seems to show worse performance than supervised ImageNet[33] pre-training due to the difference in the dataset size and the presence of the labels: NYU Depth v2[17] contains 120K images whereas ImageNet contains 14M images with classification labels. However, our method outperforms the existing state-of-the-art self-supervised algorithms on all the quantitative metrics. We improves the accuracy by 5% $\sim$ 30% trained with [3] and by 2% $\sim$ 11% with [5]. Interestingly, Rotation[7] and Exemplar[8] pretext degrade the performance of the depth estimation network compared to random initialization by [16]. It indicates that pre-training the network to capture the semantic information of the image for classification or object detection does not always provide a better initialization for other tasks such as depth estimation. Some qualitative results of our proposed method and prior self-supervised methods trained with [3] are shown in Fig. 4.

IV-D Impact of encoder pre-training

In Fig. 5, we visualize the kernels of the first layer of DenseNet-161[22] learned by our proposed self-supervised pre-training and fine-tuning for depth estimation. We observe that the patterns of the kernels which play a critical role in capturing specific information of the image barely change. The absolute value of the kernels are slightly changed by task adaptation, but the general patterns and shapes of the kernels are not affected by the loss signals from the depth estimation. It indicates that how we pre-train the encoder of the entire network determines the dominant representation of the images that the network can capture regardless of task-specific fine-tuning.

IV-E Labeled Data-Efficiency

We evaluate our model on a subset of the entire dataset with annotated depth labels compared to random initialization by [16]. We intentionally restrict the amount of the labeled data varying from 1% to 10% of NYU Depth v2[17] for fine-tuning the depth estimation networks. Both training and evaluation have been done with [3] where DenseNet-161[22] is used as the encoder. From Table II, our method trained with only 1% and 5% of labels outperforms the random initialization by [16] trained with 3% and 10% of labeled data respectively in all the accuracy metrics. The gain in data-efficiency of the network increases as the number of the labeled data becomes smaller, where the data- efficiency is triple with 1% of labeled data and double with 5% of labeled data. It is because the network is trained by the gradient descent, so the parameter initialization becomes more important as the number of labeled training data becomes smaller.

Labeled Data	Metric	1%	3%	5%	10%
Random[16]	$\delta_{1}$	0.469	0.530	0.623	0.685
Ours	$\delta_{1}$	0.598	0.654	0.701	0.737
Random[16]	$\delta_{2}$	0.767	0.813	0.878	0.909
Ours	$\delta_{2}$	0.870	0.904	0.921	0.934
Random[16]	$\delta_{3}$	0.909	0.931	0.960	0.973
Ours	$\delta_{3}$	0.957	0.974	0.978	0.984
Data-Efficiency		$\times 3$		$\times 2$

TABLE II: Data-efficient depth estimation The accuracy metrics are denoted as

\delta_{i}:\delta<1.25^{i}

IV-F Evaluation on Domain Generalization

To demonstrate the domain generalization capability, we evaluate our proposed method on the outdoor dataset even though the depth estimation network is pre-trained and fine-tuned only on the indoor dataset. We use Make3D dataset [35] for evaluation which consists of 534 outdoor images, 400 training data and 134 test data. For evaluation on Make3D, we exploit three commonly used error metrics [36, 37] and central crop[23]. We adopt the depth estimation network from [3] with the encoder of DenseNet-161[22] training on NYU Depth v2[17] and no further fine-tuning on images and labels from Make3D is done.

Table III confirms that the proposed pre-training method on Make3D still shows a better performance than the random[16], rotation[7] and MoCo[9] initialization. As shown in Fig. 6, depth estimation network pre-trained with our method generates much plausible depth output and preserves more details such as edges of the buildings and branches of the trees compared to random initialization and even to the noisy ground truth depth map. It is because our proposed method trains ConvNet to extract geometric features of the image such as vertical or horizontal structural information which are robust regardless of indoor and outdoor environments.

Method	rel	rms	$\mathrm{log}_{10}$
Random[16]	0.355	10.82	0.479
Rotation[7]	0.379	11.08	0.497
MoCo[9]	0.349	10.83	0.476
Ours	0.339	10.68	0.468

TABLE III: Performance on Make3D. Quantitative results on 134 test images of Make3D[35] dataset. Pixels above 70m in depth are masked out.

V CONCLUSIONS

In this paper, we propose a self-supervised pre-training algorithm with momentum-based contrastive learning and gradient field generated from the modified Canny edge detector so that the network can learn the geometric representation of the image for depth estimation. Our method outperforms existing self-supervised learning algorithms both in accuracy and error metrics and enables threefold improvements in data-efficiency. We also show generalization capability to the unseen data from different environment, indoor and outdoor environment. Our proposed method can be further applied to any parametric model which requires a geometric feature such as normal estimation or 3D reconstruction.

References

[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[2] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2002–2011.
[3] I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941, 2018.
[4] J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation and 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10 193–10 202.
[5] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV). IEEE, 2016, pp. 239–248.
[6] D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V. Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108.
[7] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representations, 2018.
[8] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with exemplar convolutional neural networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 9, pp. 1734–1747, 2015.
[9] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
[10] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[11] C. Zhou, H. Zhang, X. Shen, and J. Jia, “Unsupervised learning of stereo matching,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1567–1575.
[12] R. Garg, V. K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European conference on computer vision. Springer, 2016, pp. 740–756.
[13] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 3828–3838.
[14] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.
[15] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, no. 6, pp. 679–698, 1986.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
[17] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012.
[18] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
[19] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650–2658.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[23] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270–279.
[24] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
[25] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
[26] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boosting self-supervised learning via knowledge transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.
[27] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
[28] O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in International Conference on Machine Learning. PMLR, 2020, pp. 4182–4192.
[29] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[30] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
[31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
[34] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019, pp. 8026–8037.
[35] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008.
[36] K. Karsch, C. Liu, and S. B. Kang, “Depth transfer: Depth extraction from video using non-parametric sampling,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 11, pp. 2144–2158, 2014.
[37] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024–2039, 2015.


(a) Self-supervised	(b) Fine-tune

Learning a Geometric Representation for Data-Efficient Depth Estimation via Gradient Field and Contrastive Loss