Unsupervised Domain Adaptive Object Detection using Forward-Backward Cyclic Adaptation

Siqi Yang, Lin Wu, Arnold Wiliem, Brian C. Lovell
The University of Queensland
[email protected], [email protected], [email protected], [email protected]

Abstract

We present a novel approach to perform the unsupervised domain adaptation for object detection through forward-backward cyclic (FBC) training. Recent adversarial training based domain adaptation methods have shown their effectiveness on minimizing domain discrepancy via marginal feature distributions alignment. However, aligning the marginal feature distributions does not guarantee the alignment of class conditional distributions. This limitation is more evident when adapting object detectors as the domain discrepancy is larger compared to the image classification task, e.g. various number of objects exist in one image and the majority of content in an image is background. This motivates us to learn domain invariance for category level semantics via gradient alignment. Intuitively, if the gradients of two domains point in similar directions, then the learning of one domain can improve that of another domain. To achieve gradient alignment, we propose Forward-Backward Cyclic Adaptation, which iteratively computes adaptation from source to target via backward hopping and from target to source via forward passing. In addition, we align low level features for adapting holistic color/texture via adversarial training. However, the detector performs well on both domains is not ideal for target domain. As such, in each cycle, domain diversity is enforced by maximum entropy regularization on the source domain to penalize confident source-specific learning and minimum entropy regularization on target domain to intrigue target-specific learning. Theoretical analysis on the training process is provided, and extensive experiments on challenging cross-domain object detection datasets have shown the superiority of our approach over the state-of-the-art.

1 Introduction

Refer to caption — Figure 1: (a) Due to domain discrepancy, the detector trained on the source domain does not perform well on the target. Green boxes indicate false positives and red indicate missing objects. (b) Feature visualization of the detection results on target images generated by source-only model. It is difficult to align feature at instance-level without category information due to the existence of false detections on background.

Object detection is a fundamental problem in computer vision [42, 27, 41, 28, 24]. It can be applied into many scenarios such as face and pedestrian detection [16] and self-driving cars [4]. However, due to the variations in shape and appearance, lighting conditions and backgrounds, a model trained on the source data might not perform well on target–a problem known as domain discrepancy, as shown in Figure 1. A common approach to maximize the performance on the target domain is via fine-tuning a pre-trained model with a large amount of target data. However, the task of annotating bounding boxes for target objects is time-consuming and expensive. Hence, an effective model that can adapt object detectors into a new domain without labels–unsupervised domain adaptation–is highly desirable.

Unsupervised domain adaptation methods [29, 30, 11, 54, 45, 47] for image classification are developed to learn domain-invariant features by minimizing the source error and simultaneously the domain discrepancy through feature distribution alignment. Standard optimization criteria include maximum mean discrepancy [29, 30] and distribution moment matching [52, 53]. Recent adversarial training based domain adaptation methods have shown their effectiveness in learning domain-invariance by matching the marginal distributions of both source and target features [11, 55, 54]. However, matching the marginal distributions does not guarantee the alignment of class conditional distributions [57, 51, 32, 21]. For example, aligning the target cat class to the source dog class can easily meet the objective of reducing the cost of source/target domain distinction, but the semantic categories are wrong. The limitation of global adversarial learning is more evident when the domain discrepancy between two domains is larger.

In object detection, performing domain alignment is more challenging compared to alignment in the image classification task in the following two aspects: (1) The input image may contain multiple objects, while there is only one centered object in the classification task; (2) The images in object detection are dominated by background and non-objects. Therefore, performing global adversarial learning (i.e., marginal feature distributions) at image-level is not sufficient for such a challenging task due to the limitations discussed above. Chen et al. [5] made the first attempt to apply domain adversarial training to object detection, where the marginal feature distributions at instance-level are aligned in addition to the image-level adaptation. However, due to the domain shift, the detector may not be accurate and many non-object proposals from the backgrounds are used for domain alignment (Figure 1). This amplifies the limitation of domain adversarial training and hence limited gains can be achieved. In order to tackle the limitation of global adversarial training, two state-of-the-art methods [60, 46] have been proposed to reduce the discrepancy between the aligned samples which are then used for adversarial training. More specifically, they re-weigh and select the target images or proposals according to the scores of domain classifier, where images/instances have features similar to those of source samples will be emphasized. The above methods show that it is challenging to align marginal feature distributions in the object detection task, especially when no target labels are provided.

In this work, we argue that explicit feature distribution alignment is not a necessary condition to learn domain-invariance. Instead, we remark that domain-invariance of category level semantics can be achieved by gradient alignment, where the inner product between the gradients on images from different domains is maximized. Intuitively, if the inner product is positive, the gradients from different domains are pointing in similar directions. This implies that taking a gradient step on one domain can improve the learning on another domain and vice versa. In other words, the two learnings share similar information and therefore lead to domain-invariance. More importantly, the gradients of the last fully connected layer can preserve class conditional information for different domains. Therefore, gradient alignment shows its advantages on the challenging task, adapting object detectors.

To achieve the goals, we propose a Forward-Backward Cyclic Adaptation (FBC) approach to learn adaptive object detectors. In each cycle, the games of Forward Passing, adaptation from source to target, and Backward Hopping, adaptation from target to source, are played sequentially. Each adaptation is a domain transfer, where the training is first initialized with the model trained on previous domain and then finetuned with the images in current domain. We provide theoretical analysis to show that by computing the forward and backward adaptation sequentially via Stochastic Gradient Descent (SGD), gradient alignment can be achieved. Our proposed approach is related to the cycle consistency utilized in machine translation [14], image-to-image translation [59, 58] and unsupervised domain adaptation [50] with a similar intuition that the mappings of an example transferred from source to target and then back to the source domain should have same results. Different from these approaches, we do not strictly enforce a cycle consistency loss on the source domain and our proposed cyclic adaptation shares the same network architectures for both adaptations rather than two separate generators.

In addition to learning the domain-invariance of category level semantics, we leverage domain adversarial training to learn the domain-invariance of holistic color and textures via aligning low level features. This domain adversarial training is conducted on the adaptation from source to target domain.

However, a detector with good generalization on both domains may not be the optimal solution for the target domain. To address this, we introduce domain-diversity into the training objective to avoid overfitting on the source domain and encourage target-specific learning on the target domain. In this work, we adopt two regularizers: (1) a maximum entropy regularizer on source domain and (2) a minimum entropy regularizer on target domain. The overview of our model is shown in Fig. 2.

We conduct experiments on four domain shift scenarios and experimental results show the effectiveness of our proposed approach. Contributions: (1) We propose a forward-backward cyclic adaptation approach to learn unsupervised domain adaptive object detectors through effective gradient alignment; (2) To achieve good performance on the target domain, we explicitly enforce domain-diversity via entropy regularization to further approximate the domain-invariant detectors closer to the optimal solution to the target space; (3) The proposed method is simple yet effective and can be applied to various architectures.

2 Related Work

Object Detection. Existing deep object detection methods [12, 42, 41, 27, 24, 25] can be roughly grouped into two categories: two-stage and single-stage frameworks. A representative of two-stage framework is the Faster R-CNN proposed by Ren et al. [42], which consists of two sub-networks: a region proposal network that generates region proposals and a R-CNN that classifies the categories of the proposals. Single-stage detectors, e.g. SSD [27] and YOLO [41], have demonstrated high efficiency in object detection, where the networks perform object classification and localization simultaneously. Other methods like FPN [24] and RetinaNet [25] propose to leverage a combination of features from different levels to improve the feature representations.

Unsupervised Domain Adaptation for Image Classification. Unsupervised domain adaptation approaches are proposed to address domain discrepancy with labeled source data and unlabeled target data. A vast number of deep learning based works are presented for image classification [44, 54, 55, 29, 30]. Many adaptation methods [29, 30, 52, 31, 11, 55, 54] are proposed to reduce the domain divergence based on the following theory:

Theorem 1 (Ben-David et al. [1])

Let $h:\mathcal{X}\to\mathcal{Y}$ be a hypothesis in the hypothesis space $\mathcal{H}$ . The expected error on the target domain $\epsilon_{T}(h)$ is bounded by

\epsilon_{T}(h)\leq\epsilon_{S}(h)+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})+\lambda,\forall h\in\mathcal{H}\textrm{ ,}

(1)

where $\epsilon_{S}(h)$ is the expected error on the source domain, $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})=2\underset{h,h^{\prime}\in\mathcal{H}}{\operatorname{sup}}\left|\underset{x\sim\mathcal{D}_{S}}{\operatorname{Pr}}[h(x)\neq h^{\prime}(x)]-\underset{x\sim\mathcal{D}_{T}}{\operatorname{Pr}}[h(x)\neq h^{\prime}(x)]\right|$ measures domain divergence, and $\lambda$ is the expected error of ideal joint hypothesis, $\lambda=\min_{h\in\mathcal{H}}[\epsilon_{S}(h)+\epsilon_{T}(h)]$ .

To minimize the divergence, various methods have been proposed to align the distributions of features from source and target domains, e.g., maximum mean discrepancy [29, 30], correlation alignment [52], joint distribution discrepancy loss [31] and adversarial training that aligns marginal distributions [11, 55, 54]. Recent adversarial training based methods [11, 55, 54] are studied to match the marginal distributions of the source and target features, where the feature generator is trained to confuse the domain classifier. Although the adversarial training based methods have achieved impressive results, recent works [57, 46, 51, 20, 2] show that aligning the marginal distributions without considering class conditional distributions does not guarantee small $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})$ . To address this, Luo et al. [32] propose to improve it via a semantic-aware discriminator, and Xie et al. [57] propose to align the semantic prototypes for each class. Some works [57, 20, 2] propose to minimize the joint hypothesis error $\lambda$ with pseudo labels in addition to the marginal distribution alignment. Some other methods propose to use the predictions of a classifier as pseudo labels for unlabeled target samples [50, 45, 3]. Lee et al. [22] argue that training with pseudo labels is equivalent to entropy regularization, which favors a low density separation between classes.

Unsupervised Domain Adaptation for Object Detection. Fewer works are available in the unsupervised domain adaptation for object detection. To our knowledge, there are only three works, Domain Adaptive Faster R-CNN (DA-Faster) [5], Selective Domain Alignment (SDA) [60] and Strong-Weak Domain Alignment (SWDA) [46]. The DA-Faster [5] adds two domain classifiers to the Faster-RCNN for learning domain-invariant features for image-level and instance-level features. However, due to the limitation of domain adversarial training and inaccurate instance predictions, the improvement is limited. To address this, two state-of-the-art methods [60, 46] propose to select target images/instances that are similar to source ones for adversarial training. Zhu et al. [60] propose to first filter non-objects via grouping the proposals and then emphasize the target proposals that are similar to the source for adversarial domain alignment. Saito et al. [46] propose to weakly align the image-level features from the high-level layer, where the images that are globally similar have higher priority to be aligned. The weak alignment is achieved by replacing the cross-entropy loss of domain classifier with focal loss [25]. In contrast to selecting similar pairs for adversarial training, our proposed method learns domain-invariance of category level semantics via gradient alignment.

Gradient-based Meta Learning. Our method is also related to recent gradient-based meta learning methods: MAML [10] and Reptile [37], which are designed to learn a good initialization for few shot learning and have demonstrated good within-task generalization. Reptile [37] suggested that SGD automatically maximizes the inner products between the gradients computed on different minibatches of the same task, and results in within-task generalization. Riemer et al. [43] integrates the Reptile algorithm with an experience replay module for the task of continual learning, where the transfer between examples is maximized via meta-learning. Inspired by these methods, we leverage the generalization ability of Reptile [37] to improve the generalization across domains for unsupervised domain adaptation via gradient alignment.

3 Forward-Backward Domain Adaptation for Object Detection

3.1 Overview

In unsupervised domain adaptation, $N_{S}$ labeled images $\{\mathcal{X}_{S},\mathcal{Y}_{S}\}=\{x^{i}_{S},y^{i}_{S}\}^{N_{S}}_{i=1}$ from the source domain with a distribution $\mathcal{D}_{S}$ are given. We have $N_{T}$ unlabeled images $\mathcal{X}_{T}=\{x^{j}_{T}\}^{N_{T}}_{j=1}$ from the target domain with a different distribution $\mathcal{D}_{T}$ , but the ground truth labels $\mathcal{Y}_{T}=\{y^{j}_{T}\}^{N_{T}}_{j=1}$ are not accessible during training. Note that in object detection, each label in $\mathcal{Y}_{S}$ or $\mathcal{Y}_{T}$ is composed of a set of bounding boxes with their corresponding class labels. Our goal is to learn a neural network (parameterized by $\theta$ ) $f_{\theta}:\mathcal{X}_{T}\to\mathcal{Y}_{T}$ that can make accurate predictions on the target samples without the need for labeled training data.

In Theorem 1, the expected error on target domain $\epsilon_{T}(h)$ is bounded by three terms: (1) the expected error on source domain $\epsilon_{S}(h)$ which can be minimized easily via supervised learning, (2) the disagreement between two hypotheses on source and target domains $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})$ , and (3) the expected error of ideal joint hypothesis $\lambda$ .

In this work, we argue that aligning the feature distributions is not a necessary condition to reduce the $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})$ . Different from the above-mentioned distribution alignment based methods, we cast the domain adaptation into an optimization problem to leverage the domain-invariance. As the ultimate goal of domain adaptation is to achieve good performance on the target domain, we further introduce domain-diversity into training to boost the detection performance in the target space.

3.2 Gradient Alignment via Forward-Backward Cyclic Training

Recent gradient-based meta-learning methods [10, 40, 37], designed for few shot learning, have demonstrated their success in approximating learning algorithms and shown their ability to generalize well to new data from unseen distributions. Inspired by these methods, we propose to learn the domain-invariance via gradient alignment to achieve generalization across domains.

3.2.1 Gradient Alignment for Domain-invariance

Suppose that we have neural networks that learn the predictions for source and target samples as $f_{\theta_{S}}:\mathcal{X}_{S}\to\mathcal{Y}_{S}$ and $f_{\theta_{T}}:\mathcal{X}_{T}\to\mathcal{Y}_{T}$ , respectively. The network parameters $\theta_{S}$ and $\theta_{T}$ are updated via minimizing the empirical risks, $\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})=\frac{1}{N_{S}}\sum_{i=1}^{N_{S}}\ell(f_{\theta_{S}}(x^{i}_{S}),y^{i}_{S})$ and $\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{Y}_{T})=\frac{1}{N_{T}}\sum_{j=1}^{N_{T}}\ell(f_{\theta_{T}}(x^{j}_{T}),y^{j}_{T})$ , where $\ell(\cdot)$ is the cross-entropy loss.

In this paper, we argue that domain-invariance occurs when:

\frac{\partial\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})}{\partial\theta_{S}}\cdot\frac{\partial\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{Y}_{T})}{\partial\theta_{T}}>0\textrm{ ,}

(2)

where the $\cdot$ is the inner-product operator. When two gradients are pointing in similar directions, it implies that the learning of source samples can benefit the learning of target samples and vice versa. This indicates that the two learnings share similar information and therefore leads to domain-invariance. Moreover, this gradient alignment can encode conditional class information as the gradients are generated from the classification losses $\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})$ and $\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{Y}_{T})$ . This is different from the feature alignment by a domain classifier in adversarial training based methods [11, 55, 54, 5, 46], where class information is not explicitly considered.

Recall Theorem 1, once $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})$ is minimized, the generalization error on target domain $\epsilon_{T}(h)$ is bounded by the shared error of ideal joint hypothesis, $\lambda=\min_{h\in\mathcal{H}}[\epsilon_{S}(h)+\epsilon_{T}(h)]$ . As suggested in [1], it is important to have a classifier performing well on both domains. Therefore similar to the previous works [22, 57, 2], we resort to using pseudo labels $\mathcal{\hat{Y}}_{T}=\{\hat{y}^{j}_{T}\}^{N_{T}}_{j=1}$ to optimize the upper bound for the $\lambda$ . These pseudo labels are the detections on the target images produced by the source detector $f_{\theta_{S}}$ and are updated with the updates of $f_{\theta_{S}}$ . Therefore, our objective function of gradient alignment is

$\displaystyle\min_{\theta_{S},\theta_{T}}$	$\displaystyle\mathcal{L}_{g}(\mathcal{X}_{S},\mathcal{Y}_{S},\mathcal{X}_{T},\mathcal{\hat{Y}}_{T})$
	$\displaystyle=\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})+\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{\hat{Y}}_{T})$
	$\displaystyle\quad-\alpha\frac{\partial\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})}{\partial\theta_{S}}\cdot\frac{\partial\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{\hat{Y}}_{T})}{\partial\theta_{T}}\textrm{ .}$	(3)

3.2.2 Forward-Backward Cyclic Training

To achieve the above objective, we propose an algorithm that sequentially plays the game of Backward Hopping on the source domain and Foward Passing on the target domain, and a shared network parameterized by $\theta$ is updated iteratively. We initialize the shared network $\theta$ with ImageNet [7] pre-trained model. Let us denote a cycle of performing forward passing and backward hopping as an episode. In the backward hopping phase of episode $t$ , the network parameterized $\theta_{S}^{(t)}$ is first initialized with the model $\theta_{T}^{(t-1)}$ from the previous episode $t-1$ . And the model $\theta_{S}^{(t)}$ is then optimized with one image per time via stochastic gradient descent (SGD) on $N_{S}$ labeled source images $\{\mathcal{X}_{S},\mathcal{Y}_{S}\}$ . In forward passing, the model $\theta_{T}^{(t)}$ is initialized with $\theta_{S}^{(t)}$ and trained with pseudo labeled target samples $\{\mathcal{X}_{T},\hat{\mathcal{Y}}_{T}\}$ . The training procedure is shown in Fig. 2.

Theoretical Analysis. We provide theoretical analysis to show how our proposed forward and backward training strategy can achieve the objective of gradient alignment in Eq. 3. For simplicity, we only analyze the gradient computations in one episode and denote the gradient obtained in one episode as $g_{e}$ . We then have $g_{e}=g_{S}+g_{T}$ , where $g_{S}$ is obtained in backward hopping $g_{S}=\frac{\partial\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})}{\partial\theta_{S}}$ and $g_{T}$ is the gradient obtained in forward passing $g_{T}=\frac{\partial\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{\hat{Y}}_{T})}{\partial\theta_{T}}$ .

According to the Taylor’s theorem, the gradient of forward passing can be expanded as $g_{T}=\bar{g}_{T}+\bar{H}_{T}(\theta_{T}-\theta_{0})+O(\lVert\theta_{T}-\theta_{0}\rVert^{2})$ , where $\bar{g}_{T}$ and $\bar{H}_{T}$ are the gradient and Hessian matrix at initial point $\theta_{0}$ . Then the overall gradient $g_{e}$ can be rewritten as:

	$\displaystyle g_{e}$	$\displaystyle=g_{S}+g_{T}$
		$\displaystyle=\bar{g}_{S}+\bar{g}_{T}+\bar{H}_{T}(\theta_{T}-\theta_{0})+O(\lVert\theta_{T}-\theta_{0}\rVert^{2})\textrm{ .}$		(4)

Let us denote the initial parameters in one episode as $\theta_{0}$ . In our proposed forward and backward training strategy, the model parameters of backward hopping are first initialized with $\theta_{S}=\theta_{0}$ and are updated by $\theta_{0}-\alpha g_{S}$ . In forward passing, the model is initialized with the updated $\theta_{S}$ and thus $\theta_{T}=\theta_{0}-\alpha g_{S}$ . Substitute this to Eq. 4 and we have

\displaystyle g_{e}=\bar{g}_{S}+\bar{g}_{T}-\alpha\bar{H}_{T}\bar{g}_{S}+O(\lVert\theta_{T}-\theta_{0}\rVert^{2})\textrm{ .}

(5)

It is noted in Reptile [37] that $\mathbb{E}[\bar{H}_{S}\bar{g}_{T}]=\mathbb{E}[\bar{H}_{T}\bar{g}_{S}]=\frac{1}{2}[\frac{\partial}{\partial\theta_{0}}(\bar{g}_{S}\cdot\bar{g}_{T})]$ . Therefore, this training is approximating our objective function in Eq. 3. The proposed training strategy is relevant to the meta-learning approaches [10, 37, 23] that are initially designed for few shot learning. Their training mainly aims at the generalization ability within one task, while our forward and backward training is aiming at the generalization across tasks. More details are shown in the supplementary materials.

3.3 Local Feature Alignment via Adversarial Training

Domain adversarial training has demonstrated its effectiveness in reducing domain discrepancy of low-level features, e.g. local texture and color, regardless of class conditional information [5, 46]. Therefore, we align the low-level features at image-level in combination with the gradient alignment on source domain. We utilize the gradient reversal layer (GRL) proposed by Ganin and Lempitsky [11] for domain adversarial training, where the gradients of the domain classifier are reversed for domain confusion. Following SWDA [46], we extract local features $F$ from low-level layer as input of the domain classifier $D$ and the least-squares loss [35, 59] is used to optimize the domain classifier. The loss of adversarial training is as follows:

	$\displaystyle\mathcal{L}_{adv}=$	$\displaystyle\frac{1}{2}\frac{1}{N_{S}WH}\sum_{i,w,h}D(F(x_{S}^{i}))_{wh}^{2}$
		$\displaystyle+\frac{1}{2}\frac{1}{N_{S}WH}\sum_{j,w,h}(1-D(F(x_{T}^{j}))_{wh})^{2}\textrm{ ,}$		(6)

where $H$ and $W$ are height and width of the output feature map of domain classifier.

3.4 Domain Diversity via Entropy Regularization

The ultimate goal of domain adaptation is to achieve good performance on target domain. However, a model that only learns the domain-invariance is not an optimal solution for the target domain, as

\displaystyle\epsilon_{T}(h)

\displaystyle\leq\epsilon_{T}(h^{a})+\epsilon_{T}(h,h^{a}),

(7)

where $h^{a}=\operatorname*{arg\,min}_{h\in\mathcal{H}}[\epsilon_{S}(h)+\epsilon_{T}(h)]$ . Moreover, in the absence of ground truth labels for target samples, the learning of domain-invariance largely relies on the source samples, which might results in the overfitting on the source domain and limits its ability to generalize well on target domain. Therefore, it is important to introduce the domain-diversity into the training to encourage more emphasis on target-specific information.

In this work, we define the domain diversity as a combination of two regularizations: (1) maximum entropy regularization on source domain to avoid overfitting and (2) minimum entropy regularization on unlabled target domain to leverage target-specific information. Low entropy corresponds to high confidence. To avoid the overfitting when training with source domain data, we utilize the maximum entropy regularizer [39], which penalizes the confident predictions with low entropy. The maximum entropy principle proposed by Jaynes [18] has been applied to reinforcement learning [56, 36] to prevent early convergence and supervised learning to improve generalization [39, 26, 8].

\max_{\theta_{S}}\textrm{H}(f_{\theta_{S}}(\mathcal{X}_{S}))=-\sum_{i=1}^{N_{S}}f_{\theta_{S}}(x^{i}_{S})\operatorname{log}(f_{\theta_{S}}(x^{i}_{S}))\textrm{ .}

(8)

On the contrary, to leverage unlabeled target domain data, we exploit the minimum entropy regularizer. This entropy minimization has been used for unsupervised clustering [38], semi-supervised learning [13] and unsupervised domain adaptation [30, 33] to encourages low density separation between clusters or classes. Here, we minimize the entropy of class conditional distribution:

\min_{\theta_{T}}\textrm{H}(f_{\theta_{T}}(\mathcal{X}_{T}))=-\sum_{j=1}^{N_{T}}f_{\theta_{T}}(x^{j}_{T})\operatorname{log}(f_{\theta_{T}}(x^{j}_{T}))\textrm{ .}

(9)

We define the objective function of learning domain diversity as:

\mathcal{L}_{div}(\mathcal{X}_{S},\mathcal{X}_{T})=-\textrm{H}(f_{\theta_{S}}(\mathcal{X}_{S}))+\textrm{H}(f_{\theta_{T}}(\mathcal{X}_{T}))\textrm{ .}

(10)

3.5 Overall Objective

To learn domain-invariance for adapting object detectors, we propose to perform gradient alignment for high-level semantics and domain adversarial training on local features for low-level information, e.g. local textures/color.

\displaystyle\mathcal{L}_{inv}(\mathcal{X}_{S},\mathcal{Y}_{S},\mathcal{X}_{T})=\mathcal{L}_{g}(\mathcal{X}_{S},\mathcal{Y}_{S},\mathcal{X}_{T})+\lambda\mathcal{L}_{adv}(\mathcal{X}_{S},\mathcal{X}_{T})\textrm{ ,}

(11)

where $\lambda$ balances the trade-off between gradient alignment loss and adversarial training loss.

Maximizing the domain-diversity contradicts the intention of learning domain-invariance. As discussed above, without the access to the ground truth labels of target samples, the accuracy on the target samples relies on the domain-invariance information learnt from the source domain. Consequently, to accomplish the trade-off between learning domain-invariance and domain-diversity, we propose to use a hyperparameter $\gamma$ to balance the trade-off. Therefore, our overall objective function is

\displaystyle\min_{\theta}\mathcal{L}_{inv}(\mathcal{X}_{S},\mathcal{Y}_{S},\mathcal{X}_{T})+\gamma\mathcal{L}_{div}(\mathcal{X}_{S},\mathcal{X}_{T})\textrm{ .}

(12)

The full algorithm is outlined in Algorithm 1.

Algorithm 1 Forward-Backward Cyclic Domain Adaptation for Object Detection

0: Source samples

\{x^{i}_{S},y^{i}_{S}\}^{N_{S}}_{i=1}

, target samples

\{x^{j}_{T}\}^{N_{T}}_{j=1}

, ImageNet pre-trained model

\theta_{0}

, hyperparameters

\alpha

\beta

\gamma

\lambda

, number of iterations

N_{itr}

0: A shared model

\theta

1: Initialize

\theta

with

\theta_{0}

2: for

t

N_{itr}

3: //Backward Hopping:

\theta_{S}^{(t)}\leftarrow\theta

5: for

i

j

N_{S}

N_{T}

\theta_{S}^{(t)}\leftarrow\theta_{S}^{(t)}-\alpha(\nabla_{\theta_{S}^{(t)}}(\mathcal{L}_{\theta_{S}^{(t)}}(x^{i}_{S},y^{i}_{S})+\lambda\mathcal{L}_{adv}(x^{i}_{S},x^{j}_{T}))-\gamma\textrm{H}(f_{\theta_{S}^{(t)}}(x^{i}_{S})))

7: end for

\theta\leftarrow\theta-\beta\theta_{S}^{(t)}

9: Generate pseudo labels

\hat{y}_{T}=f_{\theta_{S}^{(t)}}(x^{j}_{T}),j=1,...,N_{T}

10: //Forward Passing:

11:

\theta_{T}^{(t)}\leftarrow\theta

12: for

j

N_{T}

13:

\theta_{T}^{(t)}\leftarrow\theta_{T}^{(t)}-\alpha(\nabla_{\theta_{T}^{(t)}}(\mathcal{L}_{\theta_{T}^{(t)}}(x^{j}_{T},\hat{y}^{j}_{T})+\gamma\textrm{H}(f_{\theta_{T}^{(t)}}(x^{j}_{T})))

14: end for

15:

\theta\leftarrow\theta-\beta\theta_{T}^{(t)}

16: end for

4 Experiments

In this section, we evaluate the proposed forward-backward cycling adaptation approach (FBC) on four cross-domain detection datasets.

4.1 Implementation Details

Following DA-Faster [5] and SWDA [46], we use the Faster-RCNN [42] as our detection framework. All training and test images are resized with the shorter side of 600 pixels and training batch size is 1. Our method is implemented using Pytorch.

Baselines. We compare our method with three baselines: (1) Source Only: A Faster R-CNN detector that is fine-tuned on the pre-trained ImageNet [7] model with labeled source samples without adaptation; (2) DA-Faster [5]; (3) SWDA [46]; (4) Zhu et al. [60].

Evaluation Metrics. For the evaluation, we measure the mean average precision (mAP) with a threshold of 0.5 across all classes. As the source only models are a bit different among baselines, the gain of mAP is used as a metric to evaluate the effectiveness of adaptation.

4.2 Adaptation between Dissimilar Domains

Method	aero	bcycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	hrs	motor	prsn	plnt	sheep	sofa	train	tv	mAP
Source Only [46]	35.6	52.5	24.3	23.0	20.0	43.9	32.8	10.7	30.6	11.7	13.8	6.0	36.8	45.9	48.7	41.9	16.5	7.3	22.9	32.0	27.8
DA-Faster† [5]	15.0	34.6	12.4	11.9	19.8	21.1	23.2	3.1	22.1	26.3	10.6	10.0	19.6	39.4	34.6	29.3	1.0	17.1	19.7	24.8	19.8
SWDA [46]	26.2	48.5	32.6	33.7	38.5	54.3	37.1	18.6	34.8	58.3	17.0	12.5	33.8	65.5	61.6	52.0	9.3	24.9	54.1	49.1	38.1
Source Only (ours)	24.2	47.1	24.9	17.7	26.6	47.3	30.4	11.9	36.8	26.4	10.1	11.8	25.9	74.6	42.1	24.0	3.8	27.2	37.9	29.9	29.5
FBC (ours)	43.9	64.4	28.9	26.3	39.4	58.9	36.7	14.8	46.2	39.2	11.0	11.0	31.1	77.1	48.1	36.1	17.8	35.2	52.6	50.5	38.5
FBC w/o (ours)	32.1	57.6	24.4	23.7	34.1	59.3	32.2	9.1	40.3	41.3	27.8	11.9	30.2	72.9	48.8	38.3	6.1	33.1	46.5	48.0	36.0

Table 1: Results (

\%

) on the adaptation from PASCAL [9] to Clipart Dataset [17]. The DA-Faster†result is the reported in SWDA [46].

We evaluate the adaptation performance on two pairs of dissimilar domains: PASCAL [9] to Clipart [17], and PASCAL [9] to Watercolor [17]. For the two domain shifts, we use the same source-only model trained on PASCAL. Following SWDA [46], we use ResNet101 [15] as the backbone network for Faster R-CNN detector and the settings of training and test sets are the same.

Datasets. PASCAL VOC dataset [9] is used as source domain in these two domain shift scenarios. This dataset consists of real images with 20 object classes. The training set contains around 15K images. The two dissimilar target domains are the Clipart dataset [17] with comic images and the Watercolor dataset [17] with artistic images. Clipart dataset has the same 20 object classes as the PASCAL, while Watercolor only has six. Clipart dataset contains 1K comic images, which are used for both training (without labels) and testing. There are 2K images in the Watercolor dataset: 1K for training (without labels) and 1K for testing.

Results on Clipart Dataset [17]. In the original paper of DA-Faster [5], they do not evaluate on the Clipart and Watercolor datasets. Thus, we follow with the results of DA-Faster [5] reported in SWDA [46]. As shown in Table 5, in comparison to the source only model, DA-Faster [5] degrades the detection performance significantly, with a drop of $8$ percentage points in mAP. DA-Faster [5] adopts two domain classifiers on both image-level and instance-level features. However, the source/target domain confusion without considering the semantic information will lead to wrong alignment of semantic classes across domains. The problem is more challenging when domain shift in object detection is large, i.e. PASCAL [9] to Clipart [17]. In Clipart [17], the comic images contain objects that are far different from those in PASCAL [9] w.r.t. the shapes and appearance, such as sketches. To address this, the SWDA [46] conducts a weak alignment on the image-level features by training the domain classifier with a focal loss. With the additional help of a domain classifier on lower level features and context regularization, the SWDA [46] can boost the mAP of detection from 27.8% to 38.1% with an increase of 10.3 points. Our proposed FBC can achieve the highest mAP of 38.5%.

Method	bike	bird	car	cat	dog	prsn	mAP
Source Only [46]	68.8	46.8	37.2	32.7	21.3	60.7	44.6
DA-Faster† [5]	75.2	40.6	48.0	31.5	20.6	60.0	46.0
SWDA [46]	82.3	55.9	46.5	32.7	35.5	66.7	53.3
Source Only (ours)	66.7	43.5	41.0	26.0	22.9	58.9	43.2
FBC (ours)	90.9	47.7	46.0	38.7	31.8	66.7	53.6
FBC w/o local (ours)	88.7	48.2	46.6	38.7	35.6	64.1	53.6

Table 2: Results (

\%

) on the adaptation from PASCAL [9] to Watercolor [17]. The DA-Faster†is the reproduced in SWDA [5].

Method	AP on Car
Source Only [5]	31.2
DA-Faster [5]	39.0
Source Only [46]	34.6
DA-Faster† [5]	34.2
SWDA [46]	42.3
Source Only [60]	34.0
Zhu et al. [60]	43.0
Source Only (ours)	31.2
FBC (ours)	42.7
FBC w/o local (ours)	39.2

Table 3: Results (

\%

) on the adaptation from Sim10k [19] to Cityscapes [6]. The DA-Faster†is the reproduced in SWDA [5].

Results on Watercolor Dataset [17]. The adaptation results on the Watercolor dataset are summarized in Table 6. In Watercolor, most of the images contains only one or two objects with less variations of shape and appearance compared with the Clipart dataset. As reported in SWDA [46], the source only model can achieve quite good results with a mAP of 44.6% and DA-Faster [5] can improve it slightly by only 1.4 points. SWDA [46] performs much better than DA-Faster [5] and obtain a high mAP of 53.5%. The gain from adaptation is 8.7 points. The mAP of our proposed FBC is 53.6%, which is 0.3% higher than that of SWDA. Even without the local feature alignment via adversarial training, our proposed forward-backward cyclic adaptation method can achieve state-of-the-art performance.

Feature Visualization. To visualize the adaptability of our method, we use the Grad-cam [49] to show the evidence (heatmap) for the last fully connected layer in the object detectors. The high value in the heatmap indicates the evidence that why the classifiers make the classification. Figure 5 shows the differences of classification evidence before and after adaptation. As we can see, the adapted detector is able to classify the objects (e.g. persons) based on more semantics (e.g. faces, necks, joints). It demonstrates that the adapted detector has addressed the discrepancy on the appearance of real and cartoon objects. More samples can be found in the supplementary materials.

4.3 Adaptation from Synthetic to Real Images

As the adaptation from the synthetic images to the real images can potentially reduce the efforts of collecting the real data and labels, we evaluate the adaptation performance in the scenario of Sim10k [19] to Cityscapes [6].

Datasets. The source domain, Sim10k [19], contains synthetic images which are rendered by the computer game Grand Theft Auto (GTA). It provides 58,701 bounding box annotations for cars in 10K images. The target domain, Cityscapes [6], consists of real images captured by car-mounted video camera for driving scenarios. It comprises 2,975 images for training and 500 images for validation. We use its training set for adaptation without labels, and validation set for evaluation. The adaptation is only evaluated on class car as Sim10k only provides annotations for car.

Results. Results are shown in Table 7. The reported mAP gain of DA-Faster [5] in its original report (7.8 points) is significantly different from its reproduced gain (-0.4 points) in SWDA [46]. It implies that a lot of efforts are needed to reproduce the reported results of DA-Faster [5]. Zhu et al. [60] achieves the best performance with a mAP of 43%. Our proposed FBC have a competitive result of mAP, 42.7%, which is 0.4% higher than that of SWDA. But compared with the method by Zhu et al. [60], our proposed method has a much simpler network architecture and training scheme.

4.4 Adaptation between Similar Domains

Method	person	rider	car	truck	bus	train	motor	bcycle	mAP
Source Only [5]	17.8	23.6	27.1	11.9	23.8	9.1	14.4	22.8	18.8
DA-Faster [5]	25.0	31.0	40.5	22.1	35.3	20.2	20.0	27.1	27.6
Source Only [60]	9.7	32.2	44.6	16.2	27.0	9.1	20.7	29.7	26.2
Zhu et al. [60]	33.5	38.0	48.5	26.5	39.0	23.3	28	33.6	33.8
Source Only [46]	24.1	33.1	34.3	4.1	22.3	3.0	15.3	26.5	20.3
SWDA [46]	29.9	42.3	43.5	24.5	36.2	32.6	30.0	35.3	34.3
Source Only (ours)	22.4	34.2	27.2	12.1	28.4	9.5	20.0	27.1	22.9
FBC (ours)	31.5	46.0	44.3	25.9	40.6	39.7	29.0	36.4	36.7
FBC w/o local (ours)	29.0	37.0	35.6	18.9	32.1	10.7	25.0	31.3	27.5

Table 4: Results (

\%

) on the adaptation from Cityscapes [6] to FoggyCityscapes Dataset [48].

Datasets. The target dataset, FoggyCityscapes [48], is a synthetic foggy dataset where images are rendered from the Cityscapes [6]. The annotations and data splits are the same as the Cityscapes. The adaptation performance is evaluated on the validation set of FoggyCityscapes.

Results. It can be seen in Table 8 that both SWDA and Zhu et al. obtain better adaptation results than DA-Faster with mAP of 34.3% and 33.8% respectively. However, compared with the adaptation gain of SWDA (14%), the gain achieved by Zhu et al. is only 7.6%. Our proposed FBC method outperforms the baseline methods, which boosts the mAP to 36.7%. If without the local feature alignment, our proposed method can only obtain limited gain. It is because in this scenario, the main difference between two domains is the local texture.

t-SNE Visualization. We visualize the differences of features before and after adaptation via t-SNE visualization [34] in Figure 7. The features are output from the ROI pooling layer and 100 images are randomly selected. After adaptation, the distributions of source and target features are well aligned with regard to the object classes. More importantly, as shown in Figure 7(b), different classes are better distinguished and more target objects are detected for each class after adaptation. This demonstrates the effectiveness of our proposed adaptation method for object detection.

5 Conclusions

We addressed the unsupervised domain adaptation for object detection task where the target domain does not have labels. A forward-backward cyclic adaptation method is proposed. This method was based on the intuition that domain invariance of category level semantics could be learnt when the gradient directions of source and target were aligned. Theoretical analysis was presented to show that the proposed method achieved the gradient alignment goal. Local feature alignment via adversarial training was performed for learning domain-invariance of holistic color/textures. Furthermore, we proposed a domain diversity constraint to penalize confident source-specific learning and intrigue target-specific learning via entropy regularization.

In the following supplementary material, we fist provide detailed theoretical analysis to illustrate how our proposed Forward and Backward Cyclic Adaptation (FBC) in Algorithm 1 approximates the objective function of gradient alignment (to Eq.3 in our main submission). Details are shown in Section F. We then provide ablation studies in Section G. Section I demonstrates more examples of feature visualization on the Watercolor [17].

F Deriving the Objective Function

We detail the theoretical analysis in the main submission to show how the proposed algorithm approximates the objective function of gradient alignment. We follow the conventions in Reptile [37] and demonstrate the gradient computations during the training. In Reptile [37], they effectively extrapolated the gradient with a number of steps taken. Let us first denote the terms following [37, 43]:

$\displaystyle g_{i}$	$\displaystyle=\frac{\partial\mathcal{L}_{i}(\theta_{i})}{\partial\theta_{i}}\quad(\textrm{gradient obtained during SGD})\textrm{,}$	(13)
$\displaystyle\theta_{i+1}$	$\displaystyle=\theta_{i}-\alpha g_{i}\quad(\textrm{squence of parameter vectors)}\textrm{,}$	(14)
$\displaystyle\bar{g}_{i}$	$\displaystyle=\frac{\partial\mathcal{L}_{i}(\theta_{i})}{\partial\theta_{1}}\quad(\textrm{gradient at initial point})\textrm{,}$	(15)
$\displaystyle g^{j}_{i}$	$\displaystyle=\frac{\partial\mathcal{L}_{i}(\theta_{i})}{\partial\theta_{j}}\quad(\textrm{gradient evaluated at point i with respect to parameters j})\textrm{,}$	(16)
$\displaystyle\bar{H}_{i}$	$\displaystyle=\frac{\partial^{2}\mathcal{L}_{i}(\theta_{i})}{\partial\theta^{2}_{1}}\quad(\textrm{Hessian at initial point})\textrm{,}$	(17)
$\displaystyle H^{j}_{i}$	$\displaystyle=\frac{\partial^{2}\mathcal{L}_{i}(\theta_{i})}{\partial\theta^{2}_{j}}\quad(\textrm{Hessian evaluated at point i with respect to parameters j})\textrm{,}$	(18)

where the $\alpha$ is learning rate and $\mathcal{L}_{i}$ is the loss function on the samples for each gradient updates.

According to the Taylor’s theorem, we have the SGD gradients as follows:

$\displaystyle g_{i}=L^{\prime}_{i}(\theta_{1})$	$\displaystyle=\mathcal{L}^{\prime}_{i}(\theta_{1})+\mathcal{L}^{\prime\prime}_{i}(\theta_{i}-\theta_{1})+O(\lVert\theta_{i}-\theta_{1}\rVert^{2})\textrm{,}$	(19)
	$\displaystyle=\bar{g}_{i}+\bar{H}_{i}(\theta_{i}-\theta_{1})+O(\lVert\theta_{i}-\theta_{1}\rVert^{2})\quad(\textrm{using definition of}\>\bar{g}_{i},\bar{H}_{i})\textrm{,}$	(20)
	$\displaystyle=\bar{g}_{i}-\alpha\bar{H}_{i}\sum^{i-1}_{j=1}g_{j}+O(\lVert\theta_{i}-\theta_{1}\rVert^{2})\quad(\textrm{using gradient updates}\>\theta_{i}-\theta_{1}=-\alpha\sum^{i-1}_{j}g_{j})\textrm{,}$	(21)
	$\displaystyle=\bar{g}_{i}-\alpha\bar{H}_{i}\sum^{i-1}_{j=1}\bar{g}_{j}+O(\lVert\theta_{i}-\theta_{1}\rVert^{2})\quad(\textrm{using}\>g_{j}=\bar{g}_{j}+O(\lVert\theta_{i}-\theta_{1}\rVert^{2}))\textrm{.}$	(22)

If we consider there are two steps of parameter updates with stochastic gradient descent (SGD), where the gradient of the first step is $g_{1}$ and the one of second step is $g_{2}$ . According to the Eq. 22, we have

	$\displaystyle g_{1}$	$\displaystyle=\bar{g}_{1}\textrm{,}$		(23)
	$\displaystyle g_{2}$	$\displaystyle=\bar{g}_{2}-\alpha\bar{H}_{2}\bar{g}_{1}+O(\lVert\theta_{i}-\theta_{1}\rVert^{2})\textrm{.}$		(24)

Then, the overall gradient of the two SGD steps is

\displaystyle g=g_{1}+g_{2}=\bar{g}_{1}+\bar{g}_{2}-\alpha\bar{H}_{2}\bar{g}_{1}+O(\lVert\theta_{i}-\theta_{1}\rVert^{2})\textrm{.}

(25)

In Reptile [37], they noted that

\displaystyle\epsilon[\bar{H}_{2}\bar{g}_{1}]=\epsilon[\bar{H}_{1}\bar{g}_{2}]=\frac{1}{2}\epsilon[\bar{H}_{2}\bar{g}_{1}+\bar{H}_{1}\bar{g}_{2}]=\frac{1}{2}\epsilon[\frac{\partial}{\partial\theta_{1}}(\bar{g}_{1}\bar{g}_{2})]\textrm{,}

(26)

where the $\epsilon$ is the expected loss. Therefore, the overall expected loss is

\displaystyle\epsilon[g]=\epsilon[\bar{g}_{1}]+\epsilon[\bar{g}_{2}]-\frac{1}{2}\alpha\epsilon[\frac{\partial}{\partial\theta_{1}}(\bar{g}_{1}\bar{g}_{2})]\textrm{.}

(27)

In our work, we aim to address the domain adaptation problem for object detection. In our proposed forward and backward cyclic adaptation (Algorithm 1), we train the network with episodic training. In each episode, similar to the two-step SGD updates discussed above, we first perform the backward hopping on labeled source samples $\{\mathcal{X}_{S},\mathcal{Y}_{S}\}$ to obtain the parameters $\theta_{S}$ , and then we initialize the forward passing with $\theta_{S}$ and train the network with pseudo labeled target samples $\{\mathcal{X}_{T},\mathcal{\hat{Y}}_{T}\}$ , obtaining the updated parameters $\theta_{T}$ . The shared model $\theta$ is updated by $\theta_{S}$ and $\theta_{T}$ sequentially. We can consider the gradient of forward passing, $g_{S}$ , as $g_{1}$ , and similarly $g_{T}$ as $g_{2}$ . Then we can substitute $g_{S}$ and $g_{T}$ to Eq. 27:

\displaystyle\mathbb{E}[g_{e}]=\mathbb{E}[\bar{g}_{S}]+\mathbb{E}[\bar{g}_{T}]-\frac{1}{2}\alpha\epsilon[\frac{\partial}{\partial\theta_{S}}(\bar{g}_{S}\bar{g}_{T})]\textrm{ ,}

(28)

where $\mathbb{E}$ is the expected loss. The above equation shows that the training of our proposed adaptation method (Algorithm 1) is approximating the objective of gradient alignment:

\displaystyle\min_{\theta_{S},\theta_{T}}\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})+\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{\hat{Y}}_{T})-\alpha\frac{\partial\mathcal{L}_{\theta_{S}}(\mathcal{X}_{S},\mathcal{Y}_{S})}{\partial\theta_{S}}\cdot\frac{\partial\mathcal{L}_{\theta_{T}}(\mathcal{X}_{T},\mathcal{\hat{Y}}_{T})}{\partial\theta_{T}}\textrm{ .}

(29)

G Ablation Studies

In this section, we evaluate the effects of the different components in our proposed adaptation method. As shown in the Eq.11 and Eq.12 in our submission, our overall objective function is

	$\displaystyle\min_{\theta}\mathcal{L}$	$\displaystyle=\mathcal{L}_{inv}(\mathcal{X}_{S},\mathcal{Y}_{S},\mathcal{X}_{T})+\gamma\mathcal{L}_{div}(\mathcal{X}_{S},\mathcal{X}_{T})$
		$\displaystyle=\mathcal{L}_{g}(\mathcal{X}_{S},\mathcal{Y}_{S},\mathcal{X}_{T})+\lambda\mathcal{L}_{adv}(\mathcal{X}_{S},\mathcal{X}_{T})+\gamma\mathcal{L}_{div}(\mathcal{X}_{S},\mathcal{X}_{T})\textrm{ ,}$

where $\mathcal{L}_{g}$ is the loss of gradient alignment, $\mathcal{L}_{adv}$ is the loss of local feature alignment via adversarial training and $\mathcal{L}_{div}$ is the loss of domain-diversity. $\lambda$ and $\gamma$ are the hyperparameters and we set $\lambda=0.5$ and $\gamma=0.1$ for all the experiments in this work.

In the following sections, we use G, L, and D to indicate gradient alignment, local feature alignment and domain diversity, respectively.

G.1 Effects of Gradient Alignment

To evaluate the effects of gradient alignment, we perform the forward-backward cyclic method (FBC) on four different cross-domain scenarios with gradient alignment only. The results are shown in Table 5 - 8. In the adaptation scenarios, PASCAL [9]-to-Clipart [17] (in Table 5) and PASCAL-to-Watercolor [17] (in Table 6), the FBC with gradient alignment can achieve better adaptation results than the FBC with local feature alignment only. It is because the domain discrepancy in these two adaptation scenarios is large, i.e. adapting real objects to cartoon or watercolor objects. This indicates that gradient alignment has its superiority in aligning high-level semantics.

However, in the adaptation scenarios, Sim10k [19]-to-Cityscapes [6] (in Table 7 and Cityscapes [6]-to-FoggyCityscapes [48] (in Table 8 , the domain discrepancy between two domain are mainly in the low-level features, e.g. textures and colors. Therefore, in these scenarios, the FBC with gradient alignment only can achieve limited gain on mAP, compared witht he FBC with local feature alignment. It is more evident in Cityscapes-to-FoggyCityscapes, where the foggy images are rendered from the real images. However, the FBC with gradient alignment only is still $4.6\%$ higher than the source only model (in Table 8). Although the FBC with local feature alignment can obtain a high mAP with $33.7\%$ , in combination with gradient alignment and domain diversity, the mAP can be boosted to $36.7\%$ .

G.2 Effects of Local Feature Alignment

The local feature alignment is conducted via adversarial training, which aligns the marginal feature distributions between the source and target domains. As discussed in the main submission, the alignment of marginal feature distributions does not perform well when the domain discrepancy is large. This is also demonstrated in our experiments. In Table 5 and Table 6, the FBC with local feature alignment only does not perform better than the gradient alignment, when the domain discrepancy is large. But when the domain discrepancy is small, i.e. in low-level semantics, the FBC with local feature alignment demonstrates its superiority, as shown in Table 7 and Table 8.

It is worthy to mention that the gradient alignment and local feature alignment are complementary, as gradient alignment can achieve category-level alignment for high-level semantics and local feature alignment via adversarial training has its advantages for aligning low-level semantics. The combination of these two alignment and domain diversity can achieve the state-of-the-art performance.

G.3 Effects of Domain Diversity

Here we evaluate the effects of the domain-diversity. As shown in Table 5 - Table 8, the domain diversity can consistently improve the adaptation results.

Method	G	L	D	aero	bcycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	hrs	motor	prsn	plnt	sheep	sofa	train	tv	mAP
Source Only (ours)				24.2	47.1	24.9	17.7	26.6	47.3	30.4	11.9	36.8	26.4	10.1	11.8	25.9	74.6	42.1	24.0	3.8	27.2	37.9	29.9	29.5
FBC (ours)	$\checkmark$			28.8	64	21.1	19.1	39.7	60.7	29.5	14.2	46.4	29.3	21.8	8.9	28.8	72.7	51.3	32.9	12.8	28.1	52.7	49.5	35.6
FBC (ours)	$\checkmark$		$\checkmark$	32.1	57.6	24.4	23.7	34.1	59.3	32.2	9.1	40.3	41.3	27.8	11.9	30.2	72.9	48.8	38.3	6.1	33.1	46.5	48	35.9
		$\checkmark$		31.8	53.0	21.3	25.0	36.1	55.9	30.4	11.6	39.3	21.0	9.4	14.5	32.4	79.0	44.9	37.8	6.2	35.6	43.0	53.5	34.1
	$\checkmark$	$\checkmark$	$\checkmark$	43.9	64.4	28.9	26.3	39.4	58.9	36.7	14.8	46.2	39.2	11.0	11.0	31.1	77.1	48.1	36.1	17.8	35.2	52.6	50.5	38.5

Table 5: The results (

\%

) on the adaptation from PASCAL [9] to Clipart Dataset [17].

Method	G	L	D	bike	bird	car	cat	dog	prsn	mAP
Source Only (ours)				66.7	43.5	41	26.0	22.9	58.9	43.2
FBC (ours)	$\checkmark$			90.9	46.5	51.3	33.2	29.5	65.9	52.9
FBC (ours)	$\checkmark$		$\checkmark$	88.7	48.2	46.6	38.7	35.6	64.1	53.6
		$\checkmark$		89.0	47.2	46.1	39.9	27.7	65.0	52.5
	$\checkmark$	$\checkmark$	$\checkmark$	90.9	47.7	46.0	38.7	31.8	66.7	53.6

Table 6: The results (

\%

) on the adaptation from PASCAL [9] to Watercolor Dataset [17].

Method	G	L	D	AP on Car
Source Only (ours)				31.2
FBC (ours)	$\checkmark$			38.2
FBC (ours)	$\checkmark$		$\checkmark$	39.2
		$\checkmark$		41.4
	$\checkmark$	$\checkmark$	$\checkmark$	42.7

Table 7: The results (

\%

) on the adaptation from Sim10k [19] to Cityscapes Dataset [6].

Method	G	L	D	person	rider	car	truck	bus	train	motor	bcycle	mAP
Source Only (ours)				22.4	34.2	27.2	12.1	28.4	9.5	20.0	27.1	22.9
FBC (ours)	$\checkmark$			25.8	35.6	35.5	18.4	29.6	10.0	24.5	30.3	26.2
FBC (ours)	$\checkmark$		$\checkmark$	29.0	37.0	35.6	18.9	32.1	10.7	25.0	31.3	27.5
		$\checkmark$		31.6	45.1	42.6	26.4	37.8	22.1	29.4	34.6	33.7
	$\checkmark$	$\checkmark$	$\checkmark$	31.5	46.0	44.3	25.9	40.6	39.7	29.0	36.4	36.7

Table 8: Results (

\%

) on the adaptation from Cityscapes [6] to FoggyCityscapes Dataset [48].

H More Implementation Details

In this section, we provide more implementation details of our experiments.

Details of Local Feature Alignment. In this work, we utilize the Gradient Reversal Layer (GRL) proposed by Ganin and Lempitsky [11] for adversarial training. Following SWDA [46], we extract local features from low-level layer as input of the domain classifier $D$ and the least-squares loss [35, 59]. The domain classifier is the same as the local domain classifier in SWDA, which consists of three layered convolutional layers with kernel size as 1.

For the local features, the features output from conv3-3 are extracted in the case of VGG16 model and the features output from the last res3c layer are extracted in ResNet101 model. The name of the layer follows the prototxt in Caffe [caffe].

I Feature Visualization

References

[1] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151–175, May 2010.
[2] Chaoqi Chen, Weiping Xie, Tingyang Xu, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. In CVPR, 2019.
[3] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In NeurIPS, 2011.
[4] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017.
[5] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In CVPR, 2018.
[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
[8] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Maximum-entropy fine grained classification. In NeurIPS, 2018.
[9] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
[11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.
[12] Ross Girshick. Fast r-cnn. In ICCV, 2015.
[13] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In NeurIPS, 2005.
[14] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. In NeurlIPS, 2016.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[16] Peiyun Hu and Deva Ramanan. Finding tiny faces. In CVPR, 2017.
[17] Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, 2018.
[18] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 1957.
[19] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv preprint arXiv:1610.01983, 2016.
[20] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In CVPR, 2019.
[21] Abhishek Kumar, Prasanna Sattigeri, Kahini Wadhawan, Leonid Karlinsky, Rogerio Feris, Bill Freeman, and Gregory Wornell. Co-regularized alignment for unsupervised domain adaptation. In NeurIPS, 2018.
[22] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
[23] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018.
[24] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In CVPR, 2017.
[26] Hu Liu, Sheng Jin, and Changshui Zhang. Connectionist temporal classification with maximum entropy regularization. In NeurIPS, 2018.
[27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
[28] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
[29] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. ICML, 2015.
[30] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In NeurIPS, 2016.
[31] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
[32] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. arXiv preprint arXiv:1809.09478, 2018.
[33] Zelun Luo, Yuliang Zou, Judy Hoffman, and Li F Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In NeurIPS, 2017.
[34] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
[35] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
[36] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
[37] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2, 2018.
[38] Gintautas Palubinskas, Xavier Descombes, and Frithjof Kruggel. An unsupervised clustering method using the entropy minimization. In ICPR, 1998.
[39] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
[40] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. ICLR, 2016.
[41] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
[43] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. ICLR, 2019.
[44] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. TPAMI, 41(4):801–814, 2019.
[45] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In ICML, 2017.
[46] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. Strong-weak distribution alignment for adaptive object detection. In CVPR, 2019.
[47] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
[48] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Semantic foggy scene understanding with synthetic data. IJCV, pages 1–20, 2018.
[49] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
[50] Ozan Sener, Hyun Oh Song, Ashutosh Saxena, and Silvio Savarese. Learning transferrable representations for unsupervised domain adaptation. In NeurIPS, 2016.
[51] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. In ICLR, 2018.
[52] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, 2016.
[53] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
[54] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
[55] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
[56] Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 1991.
[57] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In ICML, 2018.
[58] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, 2017.
[59] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
[60] Xinge Zhu, Jiangmiao Pang, Ceyuan Yang, Jianping Shi, and Dahua Lin. Adapting object detectors via selective cross-domain alignment. In CVPR, 2019.