This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Evaluation of Deep Neural Network Domain Adaptation Techniques for Image Recognition
thanks: *Equal contribution.

Venkata Santosh Sai Ramireddy Muthireddy* Hochschule Bonn-Rhein-Sieg
Bonn, Germany
santosh.muthireddy@smail.inf.h-brs.de
   Alan Preciado-Grijalva* Hochschule Bonn-Rhein-Sieg
Bonn, Germany
alan.preciado@smail.inf.h-brs.de
Abstract

It has been well proved that deep networks are efficient at extracting features from a given (source) labeled dataset. However, it is not always the case that they can generalize well to other (target) datasets which very often have a different underlying distribution. In this report, we evaluate four different domain adaptation techniques for image classification tasks: DeepCORAL, DeepDomainConfusion, CDAN and CDAN+E. These techniques are unsupervised given that the target dataset dopes not carry any labels during training phase. We evaluate model performance on the office-31 dataset. A link to the github repository of this report can be found here: https://github.com/agrija9/Deep-Unsupervised-Domain-Adaptation.

Index Terms:
domain adaptation, unsupervised learning, convolutional neural networks, transfer learning

Introduction

Deep neural networks based methods have been producing state-of-the-art (SOTA) results for many problems in machine learning and computer vision. These methods require large amount of training and testing data to achieve the expected result. Although the model is trained with large datasets sometimes it will not generalize learned knowledge to new environment and datasets. This is because deep learning algorithms assume that training and testing data is drawn from independent and identical distributions (i.i.d.). However this assumption rarely holds true, as there will be a shift in data distributions across different domains this is explained in Fig. 1. This domain shift between source and target datasets will make deep neural networks produce wrong predictions on the target dataset. So training of a deep neural network with source and target datasets which reduces the domain shift between distribution of datasets is called as domain adaptation [1].

There are different types of domain adaptation (DA) techniques, a few of them are Unsupervised DA [2][3], Semi Supervised DA[4], Weakly Supervised DA[5], One Shot DA[6], Few Shot DA[7], Zero Shot DA[8]. In this report, we discussed on Unsupervised DA techniques. Unsupervised DA is called so because unlabeled target dataset is used during training to reduce the domain shift.

Further Unsupervised DA have different types based on the technique used to reduce the domain shift between the source and target datasets. They are:

  • Adversarial Methods [2]

  • Distance-based Methods [3] [9]

  • Incremental Methods [10]

  • Optimal Transport [11]

  • Other Methods [12]

Refer to caption
Figure 1: Domain adaptation in the true data space: Expectation vs. Reality.[13]

In this report, techniques evaluated are selected from Adversarial and Distance-based methods. In Adversarial methods, Conditional adversarial domain adaptation (CDAN) and CDAN with Entropy conditioning (CDAN+E) [2] are selected. In Distance-based methods, Deep domain confusion: Maximizing for domain invariance (DDC) [9] and Deep coral: Correlation alignment for deep domain adaptation [3] are selected. Application of domain adaptation techniques can be used in different types of domain shift explained in Fig. 6 and application in dataset to dataset domain shift is selected for evaluation. In this report evaluation of different domain adaptation techniques are benchmarked on office-31 dataset [14].

Refer to caption
Figure 2: Applications to different types of domain shift [15]

Domain Adaptation Methods

DeepCORAL

DeepCORAL [3] is a domain adaptation method based on a previously proposed method called Correlation Alignment (CORAL) [16].

CORAL attempts to increase performance in domain shift by aligning the second order statistics between both data distributions. This method consists of 1) feature extraction, 2) apply a linear transformation and 3) implement an SVM classifier.

DeepCORAL extends the capacities of CORAL by learning a non-linear mapping using convolutional neural networks and a differentiable loss function that attempts to minimize the Frobenius norm between the feature covariance matrices of source and target distributions. This method works in an unsupervised way where the target dataset has no labels.

Recall that the main goal is to learn representations that allow to achieve good performance in both data distributions (image recognition, for example). To achieve this, DeepCORAL intializes the network parameters from pre-trained AlexNet model while at the same time minimizing the loss mentioned above (a.k.a CORAL loss). The architecture of DeepCORAL is shown in the figure below.

Refer to caption
Figure 3: DeepCORAL neural network architecture.

Note that DeepCORAL takes the last fully connected layer of AlexNet (fc8) and computes a classification and CORAL loss.

CORAL loss

This loss is defined for a single layer feature map. In this case for fc8, although not mentioned, we believe the authors run a grid search to identify the optimal layer within the network to compute the loss. This can have to do with the task at hand.

To formalize the CORAL loss, suppose that we have source-domain training examples Ds={xi}D_{s}=\{x_{i}\}, where xdx\in\mathbb{R}^{d} with labels Ls={yi}L_{s}=\{y_{i}\}, i{1,,L}i\in\{1,...,L\}, and unlabeled data DT={ui}D_{T}=\{u_{i}\}, udu\in\mathbb{R}^{d}. Let nsn_{s} and ntn_{t} be the number of source and target data available respectively. Recall that x and u are the d-dimensional deep layer activations (fc8) ϕ(I)\phi{(I)} of an input image II. Let CSC_{S} and CTC_{T} be the feature covariance matrices of these activations for source and target domains respectively. The CORAL loss is defined as the difference between these covariances (second order statistics):

lCORAL=14d2CSCTF2\textit{l}_{CORAL}=\frac{1}{4d^{2}}\|C_{S}-C_{T}\|_{F}^{2} (1)

where F2\|\cdot\|_{F}^{2} refers to the squared matrix Frobenius norm. The covariance matrices CSC_{S}, CTC_{T} are defined as

CS=1nS1(DSTDS1nS(1TDS)T(1TDS))C_{S}=\frac{1}{n_{S}-1}\big{(}D_{S}^{T}D_{S}-\frac{1}{n_{S}}(1^{T}D_{S})^{T}(1^{T}D_{S})\big{)} (2)
CT=1nT1(DTTDT1nT(1TDT)T(1TDT))C_{T}=\frac{1}{n_{T}-1}\big{(}D_{T}^{T}D_{T}-\frac{1}{n_{T}}(1^{T}D_{T})^{T}(1^{T}D_{T})\big{)} (3)

where 1 is a column vector with all elements equal to 1.

The gradient of the CORAL loss can be calculated using the chain rule. Refer to the paper for the analytical expression.

In practice, we do batch learning and thus the covariances are computed over batches. Moreover, the network parameters are shared between the two networks.

In the context of multi-class classification, the total loss that is used to train is a joint term between with classification loss (cross-entropy) and CORAL loss. The first one trying to optimize classification accuracy and the second trying to learn features that can work well in both data domains. The total loss is

l=lCLASS+i=1tλilCORAL\textit{l}=l_{CLASS}+\sum_{i=1}^{t}{\lambda_{i}l_{CORAL}} (4)

here tt is the number of CORAL loss computed per layers. In this case it is a single layer. The parameter λ\lambda is introduced as a regularizer to help reach an equilibrium between these two losses at the end of training.

Deep Domain Confusion (DDC)

Deep domain confusion is another domain adaptation method that uses a deep neural network to learn a non-linear transformation. It is similar to DeepCORAL in the sense that it introduces a transfer loss to learn features for both data domains. [9]

In their paper, Tzeng et al. show the concept of biased datasets and how trained classifiers don’t always transfer well to target domains. Their approach is to maximize a loss function that minimizes classification error and maximizes domain confusion.

Refer to caption
Figure 4: Deep Domain Confusion approach: Maximizing domain confusion can lead to good performance in target domains.

To minimize the distance between data distributions, DDC uses a standard distribution distance metric called Maximum Mean Discrepancy (MMD). In this case, this distance is computed with respect to an activation layer ϕ()\phi(\cdot) acting on source data points xSx_{S} \in XSX_{S} and target data points xTx_{T} \in XTX_{T}. The approximation to this distance is as follows

MMD(XS,XT)=|1XSϕ(xs)1XTϕ(xt)|MMD(X_{S},X_{T})=\left|\frac{1}{\|X_{S}\|}\sum{\phi(x_{s})}-\frac{1}{\|X_{T}\|}\sum{\phi(x_{t})}\right| (5)

Training a model with MMD and a classification loss will result in a strong classifier ready to transfer accross domains. Thus, the total loss is

l=lC(XL,y)+λMMD2(XS,XT)l=l_{C}(X_{L},y)+\lambda MMD^{2}(X_{S},X_{T}) (6)

Here lCl_{C} corresponds to a classification loss on the available source labelled data XLX_{L} and the ground truths yy. The hyperparameter λ\lambda acts in a similar way a regularizer as in DeepCORAL.

The architecture proposed in DDC is similar to DeepCORAL; it intializes its weights with a pre-trained AlexNet and weights are shared between both data domains. The difference is that it has an extra bottleneck layer fc adapt that is meant to regularize the training of the source classifier and prevent overfitting to particular nuances of the source distribution. The loss is computed after this bottleneck layer. The next figure shows the architecture of DDC.

Refer to caption
Figure 5: Deep Domain Confusion neural network architecture. Both data domains share weights, a bottleneck is introduced to the AlexNet architecture and the total loss is computed after this layer.
Refer to caption
Figure 6: Deep Domain Confusion approach: Maximizing domain confusion can lead to good performance in target domains.

CDAN

Adversarial domain adaptation [17] [18] [19] consists of domain adaptation and adversarial learning which is very similar to Generative Adversarial Networks (GANs) [20]. In adversarial domain adaptation, a deep neural network model learns image representations by minimizing a classification loss and simultaneously a domain discriminator learns how to distinguish between source and target domain. There are however two pitfalls for adversarial domain adaptation methods. In the first place they cannot handle datasets which have complex multi modal distributions. And secondly we cannot rely on the uncertain information difference between source and target data to model the domain discriminator.

These two pitfalls can be solved by conditional domain adaptation techniques. Conditional domain adaptation is made possible due to the research carried out in Conditional Generative Adversarial Networks (CGANs) [21]. CGANs proved a point that domain shift between distributions of real and synthetic images can be reduced by conditioning network using discriminative information.

CDAN [2] is an adversarial domain adaptation technique motivated by CGANs and it makes use of the discriminative information obtained from a deep classifier network. The key to CDAN models is a new conditional domain discriminator conditioned to the covariance of representations of domain-specific resources and classifier predictions. Let us consider Ds={(xis,yis)}i=1nsD_{s}=\{(x_{i}^{s},y_{i}^{s})\}_{i=1}^{n_{s}} is a source domain data with nsn_{s} labeled examples and Dt={(xjt)}j=1ntD_{t}=\{(x_{j}^{t})\}_{j=1}^{n_{t}} is a target domain dataset with ntn_{t} unlabeled examples. Source and target datasets are drawn from P(xs,ys)P(x^{s},y^{s}) and Q(xt,yt)Q(x^{t},y^{t}) which are not i.i.d. i.e., PQP\neq Q. The aim of CDAN is to design a network G:xyG:x\to y that can reduce the domain shift between source and target datasets. f=F(x)f=F(x) and g=G(x)g=G(x) are feature representation and classifier predictions made from the network G respectively.

Considering all the things mentioned above, CDAN is formalized as a minmax optimization problem. In minmax optimization, two terms are considered as players, namely E(G)E(G) corresponding to source classifier G and minimized to guarantee lower source risk and E(D,G)E(D,G) on G and the domain discriminator D on cross domains. E(D,G)E(D,G) is minimized over D and maximized over ff and gg

E(G)=𝔼(xis,yis)DsL(G(xis),yis)E(G)=\mathbb{E}_{(x_{i}^{s},y_{i}^{s})\sim D_{s}}L(G(x_{i}^{s}),y_{i}^{s}) (7)
E(D,G)=𝔼xisDslog[D(fis,gis)]𝔼xitDtlog[D(fjt,gjt)]E(D,G)=-\mathbb{E}_{x_{i}^{s}\sim D_{s}}log[D(f_{i}^{s},g_{i}^{s})]-\mathbb{E}_{x_{i}^{t}\sim D_{t}}log[D(f_{j}^{t},g_{j}^{t})] (8)

where L(.) is the cross entropy loss. Now CDAN is described in minmax optimization problem as

\stackunderminGE(G)λE(D,G)and\stackunderminGE(D,G)\stackunder{min}{G}\ E(G)-\lambda E(D,G)\ and\ \stackunder{min}{G}\ E(D,G) (9)

where λ\lambda is a hyper-parameter that balances the domain adversary and source risk.

Refer to caption
Figure 7: Architecture of CDAN for domain adaptation [2]

In Fig. 7 D is the domain discriminator conditioned using a multilinear map fgf\otimes g, multilinear mapping is explained in Section 3.2 of [2]. Thus CDAN enables domain adaptation over cross domains using minmax optimization.

CDAN+E

CDAN+E is an variant of CDAN. In CDAN+E we will have additional involvement of Entropy Conditioning which improves the difference between source and target domains. In CDAN, the domain discriminator allots equal weight to each and every example irrespective of examples with uncertain predictions from G. So uncertainity corresponding to classifier predictions are quantified using entropy criteria explained in Eq. 10.

H(g)=c=1CgcloggcH(g)=-\sum_{c=1}^{C}g_{c}log\ g_{c} (10)

where C is number of classes and gcg_{c} is probability of predictions.

Now training examples are prioritized by an entropy-aware weight given by

w(H(g))=1+eH(g)w(H(g))=1+e^{-H(g)} (11)

So CDAN+E minmax optimization is formulated as

\stackunderminG𝔼(xis,yis)DsL(G(xis),yis)+λ(𝔼xiDsw(H(gis))log[D(T(his))]+𝔼xjtDtw(H(gjt))log[1D(T(hjt))])\stackundermaxD𝔼xiDsw(H(gis))log[D(T(his))]+𝔼xjtDtw(H(gjt))log[1D(T(hjt))]\begin{multlined}\stackunder{min}{G}\mathbb{E}_{(x_{i}^{s},y_{i}^{s})\sim D_{s}}L(G(x_{i}^{s}),y_{i}^{s})\\ +\lambda(\mathbb{E}_{x_{i}\sim D_{s}}w(H(g_{i}^{s}))log[D(T(h_{i}^{s}))]\\ +\mathbb{E}_{x_{j}^{t}\sim D_{t}}w(H(g_{j}^{t}))log[1-D(T(h_{j}^{t}))])\\ \stackunder{max}{D}\mathbb{E}_{x_{i}\sim D_{s}}w(H(g_{i}^{s}))log[D(T(h_{i}^{s}))]\\ +\mathbb{E}_{x_{j}^{t}\sim D_{t}}w(H(g_{j}^{t}))log[1-D(T(h_{j}^{t}))]\end{multlined}\stackunder{min}{G}\mathbb{E}_{(x_{i}^{s},y_{i}^{s})\sim D_{s}}L(G(x_{i}^{s}),y_{i}^{s})\\ +\lambda(\mathbb{E}_{x_{i}\sim D_{s}}w(H(g_{i}^{s}))log[D(T(h_{i}^{s}))]\\ +\mathbb{E}_{x_{j}^{t}\sim D_{t}}w(H(g_{j}^{t}))log[1-D(T(h_{j}^{t}))])\\ \stackunder{max}{D}\mathbb{E}_{x_{i}\sim D_{s}}w(H(g_{i}^{s}))log[D(T(h_{i}^{s}))]\\ +\mathbb{E}_{x_{j}^{t}\sim D_{t}}w(H(g_{j}^{t}))log[1-D(T(h_{j}^{t}))] (12)

where h =(f,g) is a joint variable.

Thus CDAN+E encourages the certain predictions which are achieved through entropy minimization principle [22].

Experiments

In this section we describe the experiments performed with the different metods on different data domains. Our implementations are written in Pytorch and are located in the link to the repository given above. The experiments were carried out using Google Colab GPU support (Testla P4).

Office-31 dataset

Office-31 dataset was first introduced in [14]. It contains a total of 4652 images collected from three different sources. The three different sources from which images are collected are online web (amazon), digital SLR (dslr) camera and webcam. Office-31 dataset have 31 classes 111The 31 classes are: backpack, bike, bike helmet, bookcase, bottle, calculator, desk chair, desk lamp, computer, file cabinet, headphones, keyboard, laptop, letter tray, mobile phone, monitor, mouse, mug, notebook, pen, phone, printer, projector, puncher, ring binder, ruler, scissors, speaker, stapler, tape, and trash can. in total.

  • Images from the web: These are collected from amazon website. We call this dataset as amazon dataset (A). It has around 2800 images and with an average of 90 images per each class. The resolution of the images are each 300x300 pixels.

  • Images from a dslr camera: These are captured from dslr camera in real environments. We call this dataset as dslr dataset (D). It has total of 498 images and with an average of 16 images per each class. The resolution of the images are 1000x1000 pixels.

  • Images from a webcam: These are collected from a webcam with lower resolution and will have more noise. From now this is called as webcam dataset (W). It has around 795 images and with an average of 25 images per each class. The resolution of the images is HxW where H and W are varying between 400 and 600.

Refer to caption
Figure 8: Sample of keyboard category images for all the three domains [14]

DeepCORAL

The method is evaluated with the benchmark Office dataset described above. We have carried out 6 different experiments for all the domain shifts available (given the tree datastes cotained in Office).

CORAL loss is applied to the last fully connected layer (fc8). The weights of this layer are initialized following a normal distribution 𝒩(0,0.005)\mathcal{N}(0,0.005) and the weights from the other layers are initialized with the pre-trained AlexNet. We used Stochastic gradient descent (SGD).

DeepCORAL was trained using batches of 128 images (both domains), a fixed learning rate of 10310^{-3}, weight decay set to 5x1045x10^{-4} and momentum to 0.9. The λ\lambda factor is set in such a way that both losses are approximately the same. In practice, given tt epoch iterations, λ\lambda is 1t\frac{1}{t}. Furthermore, during training only the source dataset is labelled, this is how we compute the classification loss.

In Table 1 we show the obtained accuracies for each method for all 6 experiments. These values correspond to the image classification accuracy on the target dataset. From here, we see that DeepCORAL obtained the best result in 4 out of 6 experiments. This method outputs accuracy values in the range between 38% to 89%. The fluctuation of accuracies between domains has to do directly with the chosen source and target distributions. It is important to note that the method does not output accuracies greater than 50% in most of the cases. We’ve compared our values with the original paper and although lower by a small percentage, these follow the same trend as the original results.

For a better visualization of the method’s performance we have added to visualize accuracy and loss as a function of epochs for the case Amazon \,\to\, Webcam. Figure 8a) shows an increase of almost 10% in recognition accuracy when including CORAL loss. Figure 9a) shows the behavior of the loss functions, note that they tend to become equal due to the lambda factor.

Deep Domain Confusion

Training for this method is very similar to DeepCORAL. The main differences are 1) it trains with MMD loss, 2) adds the bottleneck layer and 3) it uses a schedule learning rate which decreases as a function of epochs. Fig. 8b) shows that accuracy with MMD (or DDC) doesn’t have a very noticeable increase compared to training without MMD, however, it seems to be more stable and shows an steady increase. Due to time constrains, we were only able to run one experiment for this method. However, it is consistent with the accuracy reported on the original paper.

CDAN and CDAN+E

Experiments for CDAN and CDAN+E are conducted with same set of hyperparameters and the only difference between both is the transfer loss function that is involved. Hyperparameters used in the experiments for CDAN and CDAN+E are:

  • Epochs 100

  • Source batch size 10

  • Target batch size 10

  • Learning rate 1e-3

  • Momentum 0.9

  • Weight decay 5e-4

Dataset used for all the experiments are office-31 dataset [14] Backbone used in this experiment is pretrained AlexNet with a small modification. There is a bottleneck layer introduced before last fully connected layer. Bottleneck layer has the dimension 256 (used in the original paper [2]) which is also a user defined shape. Optimizer used is Stochastic Gradient Descent (SGD) in which parameters of deep neural network G and domain discriminator D are added to optimize. During the training phase each batch from source and target data loader are passed through deep neural network G and features f from bottleneck layer and fully connected layer output are obtained from the deep neural network G. Thus obtained features f and final layer output of source and target datasets are joined together. Concatenated final layer outputs are passed through softmax layer in order to get network predictions g.

As we obtained features f and predictions g, classification loss for source data can be calculated using gsg_{s} and ground truth labels of source data. Transfer loss is calculated based on domain adaptation method selected. If CDAN is selected, joint variable h=(f,g) which have both source and target data features and predictions are used. Thus transfer loss is calculated using domain discriminator network whose input is h which is explained in above Sections on CDAN. If CDAN+E is selected, along with the features and predictions we calculated entropy from class predictions which is explained in above Section. CDAN+E CDAN+E. This calculated entropy is used to perform weighted operations based on uncertainty of the predictions. Total loss to do back propagation is calculated using below equation.

Totalloss=classificationloss+λtransferlossTotal\ loss=classification\ loss+\lambda\ transfer\ loss (13)

where λ\lambda is a hyperparameter which is described in the above Sections. During the experiment it is observed that transfer loss obtained is very small which resulted in very very small gradients of order 10710^{-7}. Which in turn did not update the weights of domain discriminator and thus the transfer loss is fluctuated with very small variance. In Table I we can observe that CDAN+E only performed better in 2 cases out of 6. It is also observed that transfer loss is reducing till 20th20^{th}epoch in most of the cases. In Fig. 9(c) and 9(d), training and testing accuracy as well as in Fig. 10(c) and 10(d), transfer loss and classification loss of A\toW domain adaptation is shown respectively. In our opinion smaller batch size also played a key role in results of CDAN and CDAN+E methods. Smaller batch size is chosen considering the limitations of local machine. Although the results are not matched with original paper we observe that there is variation in accuracy and loss when transfer loss is included.

Refer to caption
(a) DeepCORAL
Refer to caption
(b) DDC
Refer to caption
(c) CDAN
Refer to caption
(d) CDAN+E
Figure 9: Classification and transfer loss for each of the evaluated methods as a function of epochs.
Refer to caption
(a) DeepCORAL
Refer to caption
(b) DDC
Refer to caption
(c) CDAN
Refer to caption
(d) CDAN+E
Figure 10: Image recognition accuracies for the four evaluated methods. The domain shift in this case corresponds to Amazon to Webcam. Red and green lines correspond to the classification accuracy on the source test data. Blue and orange lines correspond to the target test data.
TABLE I: Testing accuracy for different domain adaptation techniques across different datasets.
A \to W A \to D W \to A W \to D D \to A D \to W
No Adaptation 43.1±\pm2.5 49.2±\pm3.7 35.6±\pm0.6 94.2±\pm3.1 35.4±\pm0.7 90.9±\pm2.4
DeepCORAL 49.5±\pm2.7 40.0±\pm3.3 38.3±\pm0.4 74.4±\pm4.3 38.5±\pm1.5 89.1±\pm4.4
DDC 41.7±\pm9.1
CDAN 44.9±\pm3.3 49.5±\pm4.6 34.8±\pm2.4 93.3±\pm3.4 32.9±\pm2.4 88.3±\pm3.8
CDAN+E 48.7±\pm7.5 53.7±\pm4.7 35.3±\pm2.7 93.6±\pm3.4 33.9±\pm2.2 87.7±\pm4.0

Conclusions

Based on the methods we evaluated, we can say there are three possible scenarios for accuracy performance. Note that in the graphs presented here, the accuracy increases when we add a domain adaptation loss (e.g. Amazon \,\to\, Webcam), however, we noticed that accuracy can decrease or stay the same for other domain shifts. Refer to the repository to see the other plots. Moreover, recognition accuracies don’t go higher than 70% in most experiments, this means there is a lot of room for improvement. For future work we can carry out trainings for 200 epochs (not 100) and understand better the behavior of our losses, for instance why in some cases losses oscillate too much. Finally, we can conclude that we gained hands-on experience building end-to-end pipelines using deep neural networks for domain adaptation tasks.

References

  • [1] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, Oct 2010.
  • [2] M. Long, Z. CAO, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.   Curran Associates, Inc., 2018, pp. 1640–1650. [Online]. Available: http://papers.nips.cc/paper/7436-conditional-adversarial-domain-adaptation.pdf
  • [3] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in Computer Vision – ECCV 2016 Workshops, G. Hua and H. Jégou, Eds.   Cham: Springer International Publishing, 2016, pp. 443–450.
  • [4] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko, “Semi-supervised domain adaptation via minimax entropy,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019, pp. 8049–8057.
  • [5] S. Tan, J. Jiao, and W. Zheng, “Weakly supervised open-set domain adaptation by dual-domain collaboration,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2019, pp. 5389–5398. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2019.00554
  • [6] J. Hoffman, E. Tzeng, J. Donahue, Y. Jia, K. Saenko, and T. Darrell, “One-shot adaptation of supervised deep convolutional models,” arXiv preprint arXiv:1312.6204, 2013.
  • [7] T. Wang, X. Zhang, L. Yuan, and J. Feng, “Few-shot adaptive faster r-cnn,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2019, pp. 7166–7175. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2019.00734
  • [8] K.-C. Peng, Z. Wu, and J. Ernst, “Zero-shot deep domain adaptation,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds.   Cham: Springer International Publishing, 2018, pp. 793–810.
  • [9] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
  • [10] M. Wulfmeier, A. Bewley, and I. Posner, “Incremental adversarial domain adaptation for continually changing environments,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 4489–4495.
  • [11] G. O. Jimenez, M. E. Gheche, E. Simou, H. P. Maretic, and P. Frossard, “Cdot: Continuous domain adaptation using optimal transport,” arXiv preprint arXiv:1909.11448, 2019.
  • [12] M. Binkowski, D. Hjelm, and A. Courville, “Batch weight for domain adaptation with mass shift,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [13] X. Xu, X. Zhou, R. Venkatesan, G. Swaminathan, and O. Majumder, “d-sne: Domain adaptation using stochastic neighborhood embedding,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019, pp. 2492–2501.
  • [14] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Computer Vision – ECCV 2010, K. Daniilidis, P. Maragos, and N. Paragios, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 213–226.
  • [15] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in CVPR 2011, June 2011, pp. 1785–1792.
  • [16] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsupervised domain adaptation,” CoRR, vol. abs/1612.01939, 2016. [Online]. Available: http://arxiv.org/abs/1612.01939
  • [17] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37.   Lille, France: PMLR, 07–09 Jul 2015, pp. 1180–1189. [Online]. Available: http://proceedings.mlr.press/v37/ganin15.html
  • [18] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [19] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [21] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • [22] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy minimization,” in Advances in neural information processing systems, 2005, pp. 529–536.