Learning Feature Decomposition for Domain Adaptive Monocular Depth Estimation

Shao-Yuan Lo¹, Wei Wang², Jim Thomas², Jingjing Zheng², Vishal M. Patel¹, and Cheng-Hao Kuo² ¹ Shao-Yuan Lo and Vishal M. Patel are with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA. {sylo, vpatel36}@jhu.edu² Wei Wang, Jim Thomas, Jingjing Zheng, and Cheng-Hao Kuo are with Amazon Lab126. {wweiwan, jimthoma, zhejingj, chkuo}@amazon.com* This work was mostly done when Shao-Yuan Lo was an intern at Amazon Lab126.

Abstract

Monocular depth estimation (MDE) has attracted intense study due to its low cost and critical functions for robotic tasks such as localization, mapping and obstacle detection. Supervised approaches have led to great success with the advance of deep learning, but they rely on large quantities of ground-truth depth annotations that are expensive to acquire. Unsupervised domain adaptation (UDA) transfers knowledge from labeled source data to unlabeled target data, so as to relax the constraint of supervised learning. However, existing UDA approaches may not completely align the domain gap across different datasets because of the domain shift problem. We believe better domain alignment can be achieved via well-designed feature decomposition. In this paper, we propose a novel UDA method for MDE, referred to as Learning Feature Decomposition for Adaptation (LFDA), which learns to decompose the feature space into content and style components. LFDA only attempts to align the content component since it has a smaller domain gap. Meanwhile, it excludes the style component which is specific to the source domain from training the primary task. Furthermore, LFDA uses separate feature distribution estimations to further bridge the domain gap. Extensive experiments on three domain adaptative MDE scenarios show that the proposed method achieves superior accuracy and lower computational cost compared to the state-of-the-art approaches.

I INTRODUCTION

Depth information is essential to many robotic applications, e.g., localization, mapping and obstacle detection. Existing depth acquisition devices, such as Lidar and structured-light sensors, are typically bulky, heavy and power-consuming. Therefore, they are unsuitable for compact robotic platforms. This motivates the progress of monocular depth estimation (MDE) that predicts depth from a single image, as it has low cost, small size, high power efficiency, and is no need to re-calibrate after a long time of use.

Recent advances in deep learning have enabled supervised learning approaches to perform MDE [3, 9, 10, 25], but obtaining ground-truth depth annotations are costly and labor-intensive. Moreover, the obtained depth annotations only correspond to that specific camera. In other words, a model trained with the images and annotations of a specific camera may not generalize well to another camera with different camera settings, e.g., focal length and size of field view. Synthetic data and their annotations are easier to acquire. However, the supervised approaches trained on such a synthetic dataset often suffer from severe accuracy degradation when tested on real data, as different datasets have distinct characteristics. This is known as the domain shift problem. These challenges hinder the MDE technique from applying to compact robotic platforms. Hence, developing algorithms that can transfer the knowledge learned from one labeled dataset to another unlabeled dataset becomes increasingly important.

We approach this via unsupervised domain adaptation (UDA) in which given a labeled source dataset and an unlabeled target dataset, the objective is to learn a MDE model that has a good generalization to the unlabeled target domain data. Existing works mainly rely on a synthetic-to-real translation or vice versa to bridge the domain gap [1, 2, 7, 30, 36, 44, 45]. Although these works have achieved great improvements, image translation itself is not an easy task. Images may not be perfectly translated to another domain or contain distortion after translation. Another research stream performs feature alignment through adversarial learning [7, 22, 36, 45]. Nevertheless, it is difficult to completely align the entire feature space from different domains owing to the domain shift problem.

Refer to caption — Figure 1: Example results of domain adaptive MDE on the Foggy Cityscapes dataset [38]. It is a scenario of adverse weather adaptation. “Conventional” refers to the method based on the usual domain adversarial learning [12]. The red boxes highlight regions where our method makes improvements.

To overcome these challenges, inspired by recent approaches [36, 45] and disentangled learning techniques [5, 24, 26, 27, 28], we assume that the feature space can be decomposed into content and style components. The content component consists of semantic features that are shared across different domains. For example, consider images of indoor scenes from two different datasets. Objects like tables, chairs and beds are content information. Such semantic features are more domain-invariant, so it is easier to align the content component from different domains. In contrast, the style component is domain-specific. For instance, style features like texture and color are unique to the scenes captured by a particular camera, so aligning the style features may not be practical. Hence, to train a MDE model working for the target data, we suggest to discard the source-specific style component that hinders adaptation to narrow the domain gap, but include the target-specific style component that is still useful for the primary MDE task.

Based on the above intuitions, we propose a novel UDA method for the MDE task, referred to as Learning Feature Decomposition for depth Adaptation (LFDA): (1) Different from prior works attempting to align the entire feature maps of source and target data [7, 22, 36, 45], LFDA only needs to align the content features that already have a much smaller domain gap. (2) To further improve the content feature alignment, LFDA individually estimates the statistics of different feature domains via separate batch normalizations (BNs) [4, 29, 43], which can bypass the domain-specific elements in the feature space. The separate BN structure also helps to properly integrate the content and style features of the target data. (3) With the proposed decomposition learning, LFDA bridges the domain gap more efficiently. In particular, it keeps a relatively compact structure at inference time, leading to lower computational complexity compared to the recent advances which require a sophisticated image translation network during inference [36, 44]. (4) In addition, most existing approaches rely on a multi-stage training procedure that first pre-trains each sub-networks separately then fine-tunes them together [1, 2, 30, 36, 44]. Instead, LFDA is trained end-to-end in a single stage, making it more feasible to deploy in practical applications.

In evaluation, the majority of existing studies only focus on synthetic-to-real adaptation [1, 2, 7, 22, 36, 44, 45]. In contrast, we apply our method to three broad scenarios of domain adaptation: (1) cross-camera adaptation, (2) synthetic-to-real adaptation, and (3) adverse weather adaptation [41]. To the best of our knowledge, this paper is the first attempt that considers all the three scenarios for the MDE task. Particularly, adverse weather adaptation is the first time explored for MDE. Fig. 1 shows examples of adverse weather adaptation results. Compared to a conventional approach, our LFDA can obtain more accurate depth predictions for cars, traffic signs, sky, etc., under foggy weather condition. More extensive experiments in Sec. IV demonstrate that LFDA achieves promising performance on all the scenarios.

II RELATED WORK

Monocular depth estimation (MDE). Deep learning has achieved high accuracy for MDE by supervised learning. Eigen et al. [9] introduced a deep learning-based MDE approach with a multi-scale network. Afterward, Laina et al. [25] presented a deeper network with a fully convolutional network and residual learning. Fu et al. [10] divided depth ranges into multiple depth bins and solved MDE in a classification manner using an ordinal regression loss. Recently, Bhat et al. [3] developed a transformer-based block to adaptively adjust the depth bins for each image. Several studies explore training MDE models via self-supervision. Notable algorithms include exploiting epipolar geometry constraints from stereo pairs [13, 15, 23] and utilizing multi-view information from monocular video sequences [31, 46].

Unsupervised domain adaptation (UDA). Because depth annotations are prohibitively expensive for supervised learning, unsupervised domain adaptation has gained a lot of interest in the research community. Recent approaches involve feature distribution alignment using adversarial learning [12, 40], image-to-image translation [17, 34], and pseudo-label generation [6, 37].

Domain adaptive MDE. Domain adaptation for MDE was introduced by Atapour et al. [2], where they trained a depth estimation network using synthetic images then translated real images to synthetic style during inference. AdaDepth [22] employs adversarial learning at both feature and output spaces to align the distributions between the source and target domains. T²Net [45] transfers synthetic images to real style to train a MDE network. CrDoCo [7] and GASDA [44] use bidirectional style transfer to learn the mapping between two domains, where GASDA also exploits epipolar geometry structure for real images. SharinGAN [36] translates both synthetic and real data to a single shared domain to decrease their discrepancy. DESC [30] leverages an additional semantic segmentation network and edge detection to provide semantic and edge guidance. Akada et al. [1] adopted recent SSRL techniques to learn domain-invariant representations. However, they either suffer from sub-optimal domain alignment or high computational complexity during inference.

III PROPOSED METHOD

III-A Framework

An overview of the proposed LFDA is shown in Fig. 2. The entire framework consists of eight sub-networks: shared content encoder $E_{con}$ , source-specific style encoder $E^{s}_{sty}$ , target-specific style encoder $E^{t}_{sty}$ , MDE task decoder $D$ , generator $G$ , domain discriminator $Disc$ , source-to-target translation discriminator $Disc^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ , and target-to-source translation discriminator $Disc^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}$ . $\{E_{con},D\}$ composes as a MDE primary task network, which is a standard encoder-decoder architecture.

Feature decomposition. As illustrated in Fig. 2a, the two individual style encoders $E^{s}_{sty}$ and $E^{t}_{sty}$ extract the domain-specific style features of the given source input $I^{s}$ and target input $I^{t}$ , respectively. This is formulated as $z^{s}_{sty}=E^{s}_{sty}(I^{s})$ and $z^{t}_{sty}=E^{t}_{sty}(I^{t})$ . We believe that the content of images is more domain-invariant, so a shared content encoder $E_{con}$ is used to learn the content features of both source and target images, formulated as $z^{s}_{con}=E_{con}(I^{s})$ and $z^{t}_{con}=E_{con}(I^{t})$ . This decomposition is achieved by the training scheme shown in Fig. 2b, and the details are elaborated in Sec. III-B.

Feature alignment. Although the content features $z^{s}_{con}$ and $z^{t}_{con}$ learned by a standard encoder already have a small domain gap, they are still not completely domain-invariant, as the content of images from different domains also contains some domain-specific elements, such as scale and viewpoint. To address this, we perform feature alignment in two aspects.

First, we propose to estimate the feature distributions of $I^{s}$ and $I^{t}$ individually using a separate BN structure [4, 29, 43]. Specifically, two BN branches [19], denoted as $BN^{s}$ and $BN^{t}$ , are deployed after each convolutional layer in $E_{con}$ (see Fig. 2c). Each BN branch works individually for its own domain. To elaborate, $BN^{s}$ and $BN^{t}$ learn domain-specific affine parameters $\{\gamma^{s},\beta^{s}\}$ / $\{\gamma^{t},\beta^{t}\}$ , and distribution statistics $\{\mu^{s},\sigma^{s}\}$ / $\{\mu^{t},\sigma^{t}\}$ for the source and target data, respectively. Note that all the layers other than BNs are still shared (e.g., convolution and ReLU). Suppose that $\ddot{z}^{d}_{con}$ is the content feature of domain $d$ , where $d\in\{s,t\}$ , the separate BN structure at an arbitrary layer in $E_{con}$ is formulated as:

BN^{d}(\ddot{z}^{d}_{con};\,\gamma^{d},\beta^{d})=\gamma^{d}\,\bigg{(}\frac{\ddot{z}^{d}_{con}-\mu^{d}}{\sqrt{(\sigma^{d})^{2}+k}}\bigg{)}+\beta^{d},

(1)

where $k$ is a tiny constant for numerical stability. With this design, the domain gap between $z^{s}_{con}$ and $z^{t}_{con}$ is acquired by the domain-specific parameters $\{\mu^{d},\sigma^{d},\gamma^{d},\beta^{d}\}$ , and their domain-invariant part passes through each BN layer.

Second, inspired by GRL [12], we employ adversarial learning [16] to align the features $z^{s}_{con}$ and $z^{t}_{con}$ (see Fig. 2a). Details are discussed in Sec. III-B.

Feature integration. Our feature decomposition extracts four preferred components: $\{z^{s}_{con},z^{s}_{sty},z^{t}_{con},z^{t}_{sty}\}$ , where $z^{s}_{con}$ and $z^{t}_{con}$ are aligned by our separate BN structure and adversarial learning. To train the MDE task decoder $D$ , we use $z^{s}_{con}$ , $z^{t}_{con}$ and $z^{t}_{sty}$ . We discard $z^{s}_{sty}$ since it is specific to source data and thus cannot help the model adapt to the target domain. Instead, the target-specific style component $z^{t}_{sty}$ is still useful for the MDE model that works for the target domain.

After feature decomposition, $z^{t}_{sty}$ and $z^{t}_{con}$ have different feature characteristics though they are from the same target domain. Hence, directly fusing them in the task decoder $D$ would cause potential accuracy degradation. To address this issue, as shown in Fig. 2c, we also deploy separate BNs in $D$ . There are three BN branches: $BN^{s}_{con}$ , $BN^{t}_{con}$ and $BN^{t}_{sty}$ , each of which works as Eq. (1). $BN^{s}_{con}$ and $BN^{t}_{con}$ are used for the same purpose as discussed before, and $BN^{t}_{sty}$ is responsible for characterizing the feature distribution of $z^{t}_{sty}$ exclusively. Since the content and style features have different underlying distributions, simply leveraging a single set of BN parameters for $z^{t}_{con}$ and $z^{t}_{sty}$ would estimate an inaccurate mixture. Therefore, the additional $BN^{t}_{sty}$ is used to disentangle such mixture distribution, allowing proper integration of $z^{t}_{con}$ and $z^{t}_{sty}$ for decoding target features. Because the content and style components may have different importance for MDE, we employ a $1\times 1$ convolution and a residual connection to combine $z^{t}_{con}$ and $z^{t}_{sty}$ right before the output layer of $D$ . This weighted fusion helps to adjust the balance between these two features of target data. Finally, $D$ outputs predicted depth maps, $\tilde{Y}^{s}=D(z^{s}_{con})$ and $\tilde{Y}^{t}=D(z^{t}_{con},z^{t}_{sty})$ , respectively.

III-B Objectives

The proposed LFDA framework is trained with the following objective functions.

Feature decomposition loss. This loss is used to decompose the feature components according to our assumption for domain adaptation. It consists of translation loss and reconstruction loss.

Inspired by style transfer techniques [18, 20], we adopt the translation loss to separate the content and style features of an input images. Let us consider the case of source-to-target image translation in our framework. Given a source image $I^{s}$ and a target image $I^{t}$ , we aim to derive a translated image $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}=G(z^{s}_{con},z^{t}_{sty})$ which consists of the content of $I^{s}$ and the style of $I^{t}$ (see Fig. 2b). We achieve this translation via objective $\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{trans}$ , which consists of two perceptual losses [20] and an adversarial loss:

\begin{split}\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{trans}&=\sum_{j\in L}w^{trans}_{con,j}\big{\|}\phi_{j}(I^{s})-\phi_{j}(I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t})\big{\|}_{1}\\ &+\sum_{j\in L}w^{trans}_{sty,j}\big{\|}\mu(\phi_{j}(I^{t}))-\mu(\phi_{j}(I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}))\big{\|}_{1}\\ &+\eta\,\big{(}Disc^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}(I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t})-1\big{)}^{2},\end{split}

(2)

where $\eta=0.2$ , $w^{trans}_{con}$ and $w^{trans}_{sty}$ are pre-defined weights, $L$ denotes the $\{relu1\_1,relu2\_1,relu3\_1,relu4\_1,relu5\_1\}$ layers of a pre-trained VGG network [39] that measures perceptual loss, $\phi_{j}$ is the $j$ -th layer in $L$ , and $\mu(\cdot)$ returns the channel-wise mean values of a feature space. This translation loss has also been explored by [5].

To elaborate, the first perceptual loss computes the distance of the high-level content features between $I^{s}$ and $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ such that $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ contains the content of $I^{s}$ . Since the content information mostly exists in higher layers of VGG, we set $w^{trans}_{con}$ to $\{0,0,0,1/4,1\}$ . The second perceptual loss forces $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ to contain the style of $I^{t}$ . To explicitly encode the style information of an image, we employ AdaIN structure [18] that measures the distance of the channel-wise mean values of the style features between $I^{t}$ and $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ . Since the style information mostly exists in lower layers of VGG, we set $w^{trans}_{sty}$ to $\{1,1,1,0,0\}$ . The third term is a standard least-squares adversarial loss [32], where we assign labels 1 and 0 to untranslated and translated images, respectively. This loss helps to improve the quality of image translation. As for the case of target-to-source translation, it is symmetric to source-to-target translation. We define its objective as $\mathcal{L}^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}_{trans}$ , which replaces $s$ to $t$ , $t$ to $s$ and $s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t$ to $t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s$ in Eq. (2).

The reconstruction loss is used to guarantee that the combination of the decomposed content and style components forms a nearly complete representation of an input image [5]. Let us consider the case of source image reconstruction. Given a source image $I^{s}$ , we aim to derive a reconstruction $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}=G(z^{s}_{con},z^{s}_{sty})$ (see Fig. 2b). This can be achieved via objective $\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}_{recon}$ , which is also based on the perceptual loss:

\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}_{recon}=\sum_{j\in L}w^{recon}_{j}\big{\|}\phi_{j}(I^{s})-\phi_{j}(I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s})\big{\|}_{1},

(3)

where $w^{recon}=\{1/32,1/16,1/8,1/4,1\}$ . Symmetrically, target image reconstruction is achieved via objective $\mathcal{L}^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{recon}$ , which replaces $s$ to $t$ and $s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s$ to $t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t$ , from Eq. (3).

With the above loss functions, LFDA decomposes the feature space into $\{z^{s}_{con},z^{s}_{sty},z^{t}_{con},z^{t}_{sty}\}$ , where each of which contains its supposed information exclusively.

Feature alignment loss. Different from prior works that attempt to align the entire features [7, 22, 36, 45], LFDA only needs to align the content features that already have a much smaller domain gap, which is easier to achieve. Inspired by GRL [12], we use a domain adversarial loss $\mathcal{L}_{align}$ to align the distributions of $z^{s}_{con}$ and $z^{t}_{con}$ (see Fig. 2a). This is defined as: $\mathcal{L}_{align}=\big{(}Disc(z^{s}_{con})\big{)}^{2}+\big{(}Disc(z^{t}_{con})-1\big{)}^{2}$ , where we assign labels 1 and 0 to the source and target domain, respectively. We use the least-squares adversarial loss [32] because it is shown to be more stable at training time. Eventually, $\mathcal{L}_{align}$ further reduces the discrepancy between $z^{s}_{con}$ and $z^{t}_{con}$ .

Depth estimation loss. This is the primary task objective for MDE. We employ $L_{1}$ loss to make use of the source data annotations: $\mathcal{L}^{s}_{de}=\|\tilde{Y}^{s}-Y^{s}\|_{1}$ , where $\tilde{Y}^{s}=D(z^{s}_{con})$ is the predicted depth map and $Y^{s}$ is the corresponding ground-truth. Following GASDA [44] and SharinGAN [36], depth smoothness loss $\mathcal{L}_{sm}$ and geometry consistency loss $\mathcal{L}_{geo}$ are used as self-supervisions for the target data. They are defined as: $\mathcal{L}_{sm}=e^{-\nabla I^{t}}\|\nabla\tilde{Y}^{t}\|_{1}$ , where $\tilde{Y}^{t}=D(z^{t}_{con},z^{t}_{sty})$ is the predicted depth map; $\mathcal{L}_{geo}=\alpha\big{(}1-SSIM(I^{t},\hat{I}^{t})\big{)}+\beta\|I^{t}-\hat{I}^{t}\|_{1}$ , where $\alpha=0.425$ , $\beta=0.15$ , $\hat{I}^{t}$ is the inverse warped image derived from $\tilde{Y}^{t}$ the right counterpart of $I^{t}$ , and SSIM [42] is an image quality metric. Moreover, inspired by image translation-based adaptation approaches [1, 7, 30, 44, 45], we leverage $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ that is generated during feature decomposition learning, to adapt the task network to the target domain (i.e., feed $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ produced from Fig. 2b into the pipeline of Fig. 2a). This is defined as: $z^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{con}=E_{con}(I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t})$ , $z^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{sty}=E^{t}_{sty}(I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t})$ , and $\hat{Y}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}=D(z^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{con},z^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{sty})$ . Then, the $L_{1}$ loss is used to train with the translated image: $\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{de}=\|\tilde{Y}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}-Y^{s}\|_{1}$ .

Full learning objective. The full objective of the proposed LFDA framework is defined as:

\begin{split}\mathcal{L}&=(\mathcal{L}^{s}_{de}+\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{de})+\lambda_{geo}\mathcal{L}_{geo}+\lambda_{sm}\mathcal{L}_{sm}+\lambda_{align}\mathcal{L}_{align}\\ &+\lambda_{recon}(\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}_{recon}+\mathcal{L}^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{recon})+\lambda_{trans}(\mathcal{L}^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}_{trans}+\mathcal{L}^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}_{trans}),\end{split}

(4)

where $\lambda$ ’s are trade-off factors. We optimize this loss function end-to-end in a single stage.

III-C Inference

During inference, our goal is to predict a depth map from a given target image. This corresponds to the red path in Fig. 2a. Therefore, only $E_{con}$ , $E^{t}_{sty}$ and $D$ are retained after training, where $E^{t}_{sty}$ is the only required sub-network in addition to the MDE primary task network $\{E_{con},D\}$ . Compared to recent top-performing approaches which require an entire sophisticated image translation network during inference [36, 44], LFDA allows much lower computational complexity. This is attributed to the proposed decomposition learning that reduces the domain gap more efficiently.

IV EXPERIMENTS

We extensively evaluate the proposed LFDA on three domain adaptation scenarios: cross-camera adaptation, synthetic-to-real adaptation, and adverse weather adaptation [41]. Moreover, we conduct an ablation study and analyze the computational complexity of the models.

IV-A Implementation Details

For fair comparison, the architectures of sub-networks $\{E_{con},D\}$ , $E^{s}_{sty}$ , $E^{t}_{sty}$ , $Disc$ , $Disc^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}$ and $Disc^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ are implemented identical to the corresponding ones in T²Net [45]. Besides, generator $G$ is implemented as in [5]. The models are trained by Adam optimizer [21] with initial learning rates of $1e^{-4}$ for $\{E_{con},D\}$ and $2e^{-5}$ for the other sub-networks. The learning rates decrease according to the polynomial decay policy. We set $\lambda_{geo}=1$ , $\lambda_{sm}=\lambda_{align}=0.01$ , $\lambda_{recon}=0.5$ , and $\lambda_{trans}=0.05$ . The entire framework is trained end-to-end in a single stage. The experiments are implemented by PyTorch [35] and conducted on a single NVIDIA Tesla V100 GPU. We will release our source code after the paper gets accepted.

Table I: Results of Cityscapes-to-KITTI adaptation, tested on KITTI Eigen split [9] (cap 80m). The 1.25ⁿ columns refer to the standard

\delta<

1.25ⁿ accuracy metrics.

		Lower, better			Higher, better
Method	abs-rel	sq-rel	rmse	rmse-log	1.25	1.25²	1.25³
T²Net [45]	0.173	1.335	5.640	0.242	0.773	0.930	0.970
DESC [30]	0.149	0.967	5.236	0.223	0.810	0.940	0.976
LFDA (ours)	0.119	0.963	5.049	0.207	0.855	0.948	0.977

IV-B Cross-Camera Adaptation

Different cameras may have distinct intrinsic parameters or viewpoints, making the captured images have different scales, fields of view, etc. Such domain gap could cause sub-optimal adaptation performance.

Datasets. We use Cityscapes [8] as the source dataset and KITTI [14] as the target dataset. The KITTI Eigen split [9] is used for testing. Following [45], we rescale the input size of KITTI images from 375×1242 to 192×640, and upsample the predicted depth maps to the original size for evaluation. For Cityscapes, we follow [30] that crops and resizes the images from 1024×2048 to 192×640. The ground-truth depth is capped at 80m.

Results. Table I reports the results adhered to a standard evaluation protocol [9]. The impressive improvements on all the metrics show the superiority of our LFDA. In particular, LFDA’s abs-rel error is 20% lower than DESC [30]. This indicates that the proposed learning of feature decomposition is effective to reduce the domain gap between the images captured by different cameras.

Table II: Results of X-to-KITTI adaptation, tested on KITTI stereo 2015 [33]. Top-2 methods are in bold. vK: Virtual KITTI, K: KITTI, CS: Cityscapes, G: GTA5 images.

			Lower, better			Higher, better
Method	Dataset	abs-rel	sq-rel	rmse	rmse-log	1.25	1.25²	1.25³
Atapour et al. [2]	G $\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}$ K	0.101	1.048	5.308	0.184	0.903	0.988	0.992
GASDA [44]	vK $\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}$ K	0.106	0.987	5.215	0.176	0.885	0.963	0.986
LFDA (ours)	CS $\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}$ K	0.092	1.055	5.024	0.165	0.906	0.966	0.985
LFDA (ours)	vK $\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}$ K	0.087	0.931	4.765	0.162	0.910	0.968	0.986

Table III: Results of vKITTI-to-KITTI adaptation, tested on KITTI Eigen split [9] (cap 80m). Top-2 methods are in bold.

		Lower, better			Higher, better
Method	abs-rel	sq-rel	rmse	rmse-log	1.25	1.25²	1.25³
AdaDepth [22]	0.214	1.932	7.157	0.295	0.665	0.882	0.950
CrDoCo [7]	0.232	2.204	6.733	0.291	0.739	0.883	0.942
T²Net [45]	0.173	1.396	6.041	0.251	0.757	0.916	0.966
Akada et al. [1]	0.168	1.228	5.498	0.235	0.771	0.921	0.973
DESC [30]	0.156	1.067	5.628	0.237	0.787	0.924	0.970
GASDA [44]	0.149	1.003	4.995	0.227	0.824	0.941	0.973
SharinGAN [36]	0.116	0.939	5.068	0.203	0.850	0.948	0.978
LFDA (ours)	0.120	0.961	5.095	0.213	0.848	0.945	0.975

IV-C Synthetic-to-Real Adaptation

The style and appearance of synthetic images are usually different from that of real images. This can negatively impact the accuracy on real data.

Datasets. We use Virtual KITTI (vKITTI) [11] and KITTI as the source and the target domains, respectively. Following [45], we resize the vKITTI images to 192×640 and cap the ground-truth depth at 80m. We evaluate on both KITTI Eigen split and KITTI stereo 2015 dataset [33].

Results. Table II reports the test results on KITTI stereo 2015 dataset. We also put our Cityscapes-to-KITTI model for comparison. As it can be observed, both our models achieve much better accuracy than present approaches in most metrics. Note that Atapour et al. [2] uses the images captures from the GTA5 game as their source data, and KITTI has a smaller domain shift with GTA5 than Cityscapes or vKITTI. Table III shows the test results on KITTI Eigen split. LFDA significantly outperforms most existing works, while it is behind SharinGAN [36] by a slim margin. Note that SharinGAN requires a sophisticated image translation network during inference, resulting in a much higher computational cost than ours. Also, it relies on a complicated multi-stage training procedure. Both drawbacks make it unfriendly to be deployed in real-world applications.

IV-D Adverse Weather Adaptation

Adverse weather such as fog and rain produce image artifacts. These artifacts can result in accuracy degradation.

Datasets. In this experiment, Foggy Cityscapes [38] is used as the target dataset. It is constructed by simulating haze upon Cityscapes images. We crop and resize the images to 192×640, and cap the ground-truth depth at 80m.

Results. Table IV reports the test results on Foggy Cityscapes. We evaluate both our models of Cityscapes-to-KITTI and vKITTI-to-KITTI. Since this is the first time in the literature to explore adverse weather adaptation for MDE, we build our own baselines to compared with. Src-Only refers to the model trained on only the source data, and Src+Tgt+AL is trained on both source and target data by adversarial learning to align their entire feature distributions. Clearly, LFDA makes considerable improvements over both baselines, indicating that it performs more stably under different weather conditions. Examples of qualitative results are shown in Fig. 1.

Table IV: Results on Foggy Cityscapes [38] (cap 80m).

			Lower, better			Higher, better
Method	Dataset	abs-rel	sq-rel	rmse	rmse-log	1.25	1.25²	1.25³
Src-Only	CS	0.477	8.333	18.211	0.717	0.225	0.507	0.720
Src+Tgt+AL	CS & K	0.422	4.672	11.879	0.448	0.249	0.698	0.915
LFDA (ours)	CS & K	0.283	3.485	11.261	0.381	0.479	0.835	0.914
Src-Only	vK	0.415	9.117	17.356	0.673	0.370	0.631	0.741
Src+Tgt+AL	vK & K	0.378	6.130	15.434	0.600	0.325	0.688	0.795
LFDA (ours)	vK & K	0.332	4.454	13.024	0.475	0.374	0.762	0.868

IV-E Ablation Study

We conduct an ablation study using our model of vKITTI-to-KITTI and evaluate on the KITTI Eigen split. The results are reported in Table V. First, we can see that +Tgt+AL makes an obvious improvement over Src-Only, showing the importance of domain adaptation. Second, +Tgt+Con+2BN refers to the model that makes use of the decomposed content features and deploys two separate BN branches for the source and target domains, respectively. +Tgt+Con+2BN greatly improves the abs-rel metric by 0.017, showing our feature decomposition and separate BNs are effective in learning the domain-invariant content feature. Next, +Tgt+Con+2BN+Sty includes $z^{t}_{sty}$ in the pipeline but still maintains only two separate BNs. Results show that it suffers from severe performance degradation. This proves our argument that content and style features have different distributions, so passing them through the same BN would drop model performance. Finally, LFDA (i.e. +Tgt+Con+3BN+Sty), which deploys the third BN for the target style feature exclusively, resolves this issue successfully. Obviously, LFDA performs the best in most metrics, demonstrating the effectiveness of our method.

Table V: Results of ablation study, tested vKITTI-to-KITTI adaptation on KITTI Eigen split [9] (cap 80m).

		Lower, better			Higher, better
Method	abs-rel	sq-rel	rmse	rmse-log	1.25	1.25²	1.25³
Src-Only	0.212	2.196	7.114	0.323	0.673	0.851	0.930
+Tgt+AL	0.140	1.022	5.131	0.216	0.834	0.943	0.977
+Tgt+Con+2BN	0.123	1.039	5.220	0.215	0.847	0.944	0.974
+Tgt+Con+2BN+Sty	0.273	3.566	8.371	0.314	0.659	0.882	0.948
LFDA (ours)	0.120	0.961	5.095	0.213	0.848	0.945	0.975

IV-F Feature decomposition visualization

To verify the effectiveness of our feature decomposition, Fig. 3 shows the qualitative results of the image reconstruction and translation that are illustrated in Fig. 2b. We can observe that the reconstructed images $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}$ and $I^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ are very close to the input images $I^{s}$ and $I^{t}$ , respectively. In addition, the translated images $I^{s\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}t}$ and $I^{t\mathrel{\hbox{\rule[-0.2pt]{3.0pt}{0.4pt}}\mkern-4.0mu\hbox{\char 41\relax}}s}$ accurately maintain the content information while generating the style appearance of another domain. Such high-quality results can be achieved only if the decomposition of content and style features is successful. This demonstrates the rationale behind the high performance of the proposed method.

IV-G Computational Complexity

In addition to accuracy, model size and computational cost are also important factors when we evaluate a model. They determine the feasibility of a model for practical applications. In Table VI, we compare LFDA to two existing top-performing approaches in terms of the number of parameters and the number of multiply-accumulate operations (MACs) used at inference time. GASDA [44] include three sub-networks during inference, a target data MDE network, a target-to-source translation network, and a target-to-source MDE network. This design places a heavy computational burden. SharinGAN [36] also needs an image translation network plus a MDE network. In contrast, in LFDA, the only sub-network in addition to the primary MDE network is $E^{t}_{sty}$ , which increases minimum complexity. LFDA’s number of MACs is 51% and 20% fewer than GASDA and SharinGAN, respectively, showing that our method can bridge the domain gap much more efficiently.

Table VI: Comparison of model complexity. The number of multiply-accumulate operations (MACs) is computed on the input size of 192×640.

Method	Params	MACs
GASDA [44]	112.3M	221.5G
SharinGAN [36]	57.7M	148.1G
LFDA (ours)	57.6M	108.1G

V CONCLUSIONS

In this paper, we propose LFDA, a novel domain adaptive MDE method. We suppose that a feature space can be decomposed into components of image content and appearance style. LFDA learns to achieve this decomposition and thus can efficiently mitigate the domain shift problem between source and target data. LFDA shows superior accuracy on three broad scenarios of domain adaptation. Moreover, it has a relatively low computational cost and can be trained end-to-end in a single stage, thereby more practical for real-world applications.

References

[1] H. Akada, S. F. Bhat, I. Alhashim, and P. Wonka. Self-supervised learning of domain invariant features for depth estimation. arXiv preprint arXiv:2106.02594, 2021.
[2] A. Atapour-Abarghouei and T. P. Breckon. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[3] S. F. Bhat, I. Alhashim, and P. Wonka. Adabins: Depth estimation using adaptive bins. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[4] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han. Domain-specific batch normalization for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[5] W.-L. Chang, H.-P. Wang, W.-H. Peng, and W.-C. Chiu. All about structure: Adapting structural information across domains for boosting semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[6] C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang. Progressive feature alignment for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[7] Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang. Crdoco: Pixel-level domain transfer with cross-domain consistency. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[9] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Conference on Neural Information Processing Systems, 2014.
[10] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[11] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[12] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, 2015.
[13] R. Garg, V. K. Bg, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, 2016.
[14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[15] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Conference on Neural Information Processing Systems, 2014.
[17] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning, 2018.
[18] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE International Conference on Computer Vision, 2017.
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
[20] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, 2016.
[21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
[22] J. N. Kundu, P. K. Uppala, A. Pahuja, and R. V. Babu. Adadepth: Unsupervised content congruent adaptation for depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[23] Y. Kuznietsov, J. Stuckler, and B. Leibe. Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[24] C.-S. Lai, Z. You, C.-C. Huang, Y.-H. Tsai, and W.-C. Chiu. Colorization of depth map via disentanglement. In European Conference on Computer Vision, 2020.
[25] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In International Conference on 3D Vision, 2016.
[26] Y.-L. Lee, M.-Y. Tseng, Y.-C. Luo, D.-R. Yu, and W.-C. Chiu. Learning face recognition unsupervisedly by disentanglement and self-augmentation. In International Conference on Robotics and Automation, 2020.
[27] Y.-J. Li, C.-S. Lin, Y.-B. Lin, and Y.-C. F. Wang. Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation. In IEEE International Conference on Computer Vision, 2019.
[28] Y.-C. Liu, Y.-Y. Yeh, T.-C. Fu, S.-D. Wang, W.-C. Chiu, and Y.-C. F. Wang. Detach and adapt: Learning cross-domain disentangled deep representation. 2018.
[29] S.-Y. Lo and V. M. Patel. Defending against multiple and unforeseen adversarial videos. In IEEE Transactions on Image Processing, 2021.
[30] A. Lopez-Rodriguez and K. Mikolajczyk. Desc: Domain adaptation for depth estimation via semantic consistency. British Machine Vision Conference, 2020.
[31] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[32] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarial networks. In IEEE International Conference on Computer Vision, 2017.
[33] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[34] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems, 2019.
[36] K. PNVR, H. Zhou, and D. Jacobs. Sharingan: Combining synthetic and real data for unsupervised geometry estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[37] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning.
[38] C. Sakaridis, D. Dai, and L. Van Gool. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 2018.
[39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
[40] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[41] V. VS, V. Gupta, P. Oza, V. A. Sindagi, and V. M. Patel. Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[42] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004.
[43] C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le. Adversarial examples improve image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[44] S. Zhao, H. Fu, M. Gong, and D. Tao. Geometry-aware symmetric domain adaptation for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[45] C. Zheng, T.-J. Cham, and J. Cai. T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In European Conference on Computer Vision, 2018.
[46] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.