Towards Stable and Comprehensive Domain Alignment:
Max-Margin Domain-Adversarial Training

Jianfei Yang Han Zou Yuxun Zhou Lihua Xie

Abstract

Domain adaptation tackles the problem of transferring knowledge from a label-rich source domain to a label-scarce or even unlabeled target domain. Recently domain-adversarial training (DAT) has shown promising capacity to learn a domain-invariant feature space by reversing the gradient propagation of a domain classifier. However, DAT is still vulnerable in several aspects including (1) training instability due to the overwhelming discriminative ability of the domain classifier in adversarial training, (2) restrictive feature-level alignment, and (3) lack of interpretability or systematic explanation of the learned feature space. In this paper, we propose a novel Max-margin Domain-Adversarial Training (MDAT) by designing an Adversarial Reconstruction Network (ARN). The proposed MDAT stabilizes the gradient reversing in ARN by replacing the domain classifier with a reconstruction network, and in this manner ARN conducts both feature-level and pixel-level domain alignment without involving extra network structures. Furthermore, ARN demonstrates strong robustness to a wide range of hyper-parameters settings, greatly alleviating the task of model selection. Extensive empirical results validate that our approach outperforms other state-of-the-art domain alignment methods. Moreover, reconstructing adapted features reveals the domain-invariant feature space which conforms with our intuition.

Machine Learning, ICML

1 Introduction

Deep neural networks have gained great success on a wide range of tasks such as visual recognition and machine translation (LeCun et al., 2015). They usually require a large number of labeled data that can be prohibitively expensive to collect, and even with sufficient supervision their performance can still be poor when being generalized to a new environment. The problem of discrepancy between training and testing data distribution is commonly referred to as domain shift or covariant shift (Shimodaira, 2000). To alleviate the effect of such shift, domain adaptation sets out to obtain a model trained in a label-rich source domain to generalize well in an unlabeled target domain (Pan et al., 2010). Domain adaptation has benefited various applications in many practical scenarios, including but not limited to object detection under challenging conditions (Chen et al., 2018), cost-effective learning using only synthetic data to generalize to real-world imagery (Vazquez et al., 2013), etc.

Prevailing methods for unsupervised domain adaptation (UDA) are mostly based on domain alignment which aims to learn domain-invariant features by reducing the distribution discrepancy between the source and target domain using some pre-defined metrics such as maximum mean discrepancy (Gretton et al., 2007, 2012). Recently, Ganin & Lempitsky (2015) proposed to achieve domain alignment by domain-adversarial training (DAT) that reverses the gradients of a domain classifier to maximize domain confusion. Having yielded remarkable performance gain, DAT was employed in many subsequent UDA methods (Long et al., 2018; Chen et al., 2019; Liu et al., 2019).

Nevertheless, DAT with gradient reverse layers still face three critical restrictions when applying it to practical scenarios. (1) DAT cannot continuously provide effective gradients for learning domain-invariant representations. The reason is that the binary domain classifier has high-capacity to discriminate two domains and thus overwhelms adversarial training, which is usually solved by manually adjusting the weights of adversarial loss according to specific tasks such as (Shu et al., 2018). (2) DAT cannot deal with pixel-level domain shift that are frequently encountered in visual tasks (Hoffman et al., 2018). (3) The domain-invariant features learned by DAT are only based on intuition and learning theory (Ben-David et al., 2010) but difficult to interpret, which impedes the investigation of the underlying mechanism of adversarial domain adaptation.

To overcome the aforementioned difficulties, we propose a novel adversarial approach, namely Max-margin Domain-Adversarial Training (MDAT), to realize stable and comprehensive (i.e. both feature-level and pixel-level) domain alignment. MDAT works based on a carefully-designed Adversarial Reconstruction Network (ARN). Specifically, ARN consists of a shared feature extractor, a label predictor, and a reconstruction network (i.e. decoder) that serves as a domain classifier. MDAT enables an adversarial game between the feature extractor and the decoder. The decoder focuses on reconstructing features on source domain and pushing target features away from a margin, while the feature extractor aims to fool the decoder by generating target features that can be reconstructed. In this adversarial way, three critical issues are subtly solved: (1) the max-margin loss reduces the discriminative capacity of domain classifier, balancing and stabilizing domain-adversarial training; (2) without involving any new network structures, MDAT achieves both pixel-level and feature-level domain alignment; (3) reconstructing adapted features to images reveals how the source and target domains are aligned by adversarial training. We evaluate ARN with MDAT on both visual and non-visual UDA benchmarks. It shows more training procedure and achieves significant improvement to DAT on all tasks with pixel-level or higher-level domain shift. We also observe that it is insensitive to the choices of hyperparameters and as such is favorable for replication in practice. In principle, our approach is generic and can be used to enhance any domain adaptation methods that leverage domain alignment as an ingredient.

2 Related Work

Domain adaptation aims to transfer knowledge from one domain to another. Ben-David et al. (2010) provide an upper bound of the test error on the target domain in terms of the source error and the $\mathcal{H}\triangle\mathcal{H}$ -distance. As the source error is stationary for a fixed model, the goal of most UDA methods is to minimize the $\mathcal{H}\triangle\mathcal{H}$ -distance by reducing some metrics such as Maximum Mean Discrepancy (MMD) (Tzeng et al., 2014; Long et al., 2015) and CORAL (Sun & Saenko, 2016). Inspired by Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), Ganin & Lempitsky (2015) proposed to learn domain-invariant features by Domain-Adversarial Training (DAT), which has inspired many UDA methods thereafter. For example, Zhang et al. (2019) propose a new divergence for distribution comparison based on minimax optimization and Wang et al. (2019b) discover that filtering our unrelated source samples helps avoid negative transfer in DAT. Adversarial Discriminative Domain Adaptation (ADDA) tried to fool the label classifier by adversarial training but not in an end-to-end manner. CyCADA (Hoffman et al., 2018) and PixelDA (Bousmalis et al., 2017) leveraged GAN to conduct both feature-level and pixel-level domain adaptation, which yields significant improvement yet the network complexity is high. Recent works explore that DAT deteriorates feature learning, and hence they propose to overcome it by generating transferable examples (Liu et al., 2019) or involving extra regularizer to retain discriminability (Chen et al., 2019). These approaches can also be directly applied to MDAT for further enhancement.

Another line of approaches that are relevant to our method is reconstruction network (i.e. decoder network), which enables unsupervised image-to-image translation by learning pixel-level features (Zhu et al., 2017). In UDA, Ghifary et al. (2016) employed a decoder network for pixel-level adaptation, and Domain Separate Network (DSN) (Bousmalis et al., 2016) further leveraged multiple decoder networks to learn domain-specific features. These approaches treat the decoder network as an independent component for augmented feature learning that is irrelevant to domain alignment (Glorot et al., 2011). In this paper, we propose to innovatively utilize decoder network as domain classifier in MDAT which enables both feature-level and pixel-level domain alignment in a stable and straightforward fashion.

Refer to caption — Figure 1: The proposed architecture is composed of a shared feature extractor $G_{e}$ for two domains, a label predictor $G_{y}$ and a reconstruction network $G_{r}$ . In addition to the basic supervised learning in the source domain, our adversarial reconstruction training enables the extractor $G_{e}$ to learn domain-invariant features. Specifically, the network $G_{r}$ aims to reconstruct the source samples $x^{s}$ and to impede the reconstruction of the target samples $x^{t}$ , while the extractor $G_{e}$ tries to fool the reconstruction network in order to reconstruct the target samples $x^{t}$ .

3 Problem Formulation

3.1 Problem Definition and Notations

In unsupervised domain adaptation, we assume that a model works with a labeled dataset $\textbf{X}_{S}$ and an unlabeled dataset $\textbf{X}_{T}$ . Let $\textbf{X}_{S}=\{(\textbf{x}^{s}_{i},y^{s}_{i})\}_{i\in[{N_{s}}]}$ denote the labeled dataset of $N_{s}$ samples from the source domain, and the certain label $y^{s}_{i}$ belongs to the label space $Y$ that is a finite set ( $Y={1,2,...,K}$ ). The other dataset $\textbf{X}_{T}=\{\textbf{x}^{t}_{i}\}_{i\in[{N_{t}}]}$ has $N_{t}$ samples from the target domain but has no labels. We further assume that two domains have different distributions, i.e. $\textbf{x}^{s}_{i}\sim\mathcal{D}_{S}$ and $\textbf{x}^{t}_{i}\sim\mathcal{D}_{T}$ . In other words, there exist some domain shift (Ben-David et al., 2010) between $\mathcal{D}_{S}$ and $\mathcal{D}_{T}$ . The ultimate goal is to learn a model that can predict the label $y^{t}_{i}$ given the target input $\textbf{x}^{t}_{i}$ .

3.2 Unbalanced Minimax Game in Domain-Adversarial Training

To achieve domain alignment, Domain-Adversarial Training (DAT) is a minimax game between a shared feature extractor $F$ for two domains and a domain classifier $D$ . The domain classifier is trained to determine whether the input sample belongs to the source or the target domain while the feature extractor learns to deceive the domain classifier, which is formulated as:

\min_{F}\max_{D}\mathbb{E}_{x\sim D_{s}}[\ln{F(x)}]+\mathbb{E}_{x\sim D_{t}}[\ln{(1-D(F(x)))}].

(1)

We usually utilize Convolutional Neural Network (CNN) as the feature extractor and fully connected layers (FC) as the domain classifier. Theoretically, DAT reduces the cross-domain discrepancy and helps learn domain-invariant representations (Ganin et al., 2016). However, the training of DAT is rather unstable. Without sophisticated tuning of the hyper-parameters, DAT cannot often reach the convergence. Through empirical experiments, we observe that such instability is due to the imbalanced adversarial game between $D$ and $F$ . The binary domain discriminator $D$ can easily achieve convergence with very high accuracy at an early training epoch, while it is much harder for the feature extractor $F$ to fool the domain discriminator and to simultaneously perform well on the source domain. In this sense, there is a strong likelihood that the domain classifier overwhelms DAT, and the only solution is to palliate the training of $D$ by tuning the hyper-parameters according to different tasks. In our method, we restrict the capacity of the domain classifier so as to form a minimax game in a harmonious manner. Inspired by the max-margin loss in Support Vector Machine (SVM) (Cristianini et al., 2000) (i.e. hinge loss), if we push the source domain and the target domain away from a margin rather than as far as possible, then the training task of $F$ to fool $D$ becomes much easier. For a binary domain classifier, we define the margin loss as

\mathcal{L}_{mg}(y)=[0,m-t\cdot y]^{+},

(2)

where $y$ is the predicted domain label, $[\cdot]^{+}$ is $max(0,\cdot)$ , $m$ is a positive margin and $t$ is the ground truth label for two domains (assuming $t=-1$ for the source domain and $t=1$ for the target domain). Then we introduce our MDAT scheme based on an innovative network architecture.

3.3 Max-margin Domain-Adversarial Training

Besides the training instability issue, DAT also suffers from restrictive feature-level alignment – lack of pixel-level alignment. To realize stable and comprehensive domain alignment together, we first propose an Adversarial Reconstruction Network (ARN) and then elaborate MDAT.

As depicted in Figure 1, our model consists of three parts including a shared feature extractor $G_{e}$ for both domains, a label predictor $G_{y}$ and a reconstruction network $G_{r}$ . Let the feature extractor $G_{e}(\textbf{x};\theta_{e})$ be a function parameterized by $\theta_{e}$ which maps an input sample x to a deep embedding z. Let the label predictor $G_{y}(\textbf{z};\theta_{y})$ be a task-specific function parameterized by $\theta_{y}$ which maps an embedding z to a task-specific prediction $\hat{y}$ . The reconstruction network $G_{r}(\textbf{z};\theta_{r})$ is a decoding function parameterized by $\theta_{r}$ that maps an embedding z to its corresponding reconstruction $\hat{\textbf{x}}$ .

The first learning objective for the feature extractor $G_{e}$ and the label predictor $G_{y}$ is to perform well in the source domain. For a supervised $K$ -way classification problem, it is simply achieved by minimizing the negative log-likelihood of the ground truth class for each sample:

\mathcal{L}_{task}=\sum_{i=1}^{N_{s}}\mathcal{L}_{y}(\textbf{x}_{i}^{s},\textbf{y}_{i}^{s})=-\sum_{i=1}^{N_{s}}\textbf{y}^{s}_{i}\cdot\log G_{y}(G_{e}(\textbf{x}_{i}^{s};\theta_{e});\theta_{y}),

(3)

where $\textbf{y}^{s}_{i}$ is the one-hot encoding of the class label $y^{s}_{i}$ and the logarithm operation is conducted on the softmax predictions of the model.

The second objective is to render feature learning to be domain-invariant. This is motivated by the covariate shift assumption (Shimodaira, 2000) that indicates if the feature distributions $S(\textbf{z})=\{G_{e}(\textbf{x};\theta_{e})|\textbf{x}\sim\mathcal{D}_{S}\}$ and $T(\textbf{z})=\{G_{e}(\textbf{x};\theta_{e})|\textbf{x}\sim\mathcal{D}_{T}\}$ are similar, the source label predictor $G_{y}$ can achieve a similar accuracy in the target domain. To this end, we first design a decoder network $G_{r}$ that serves as a domain classifier. In MDAT, we train the decoder network $G_{r}$ to only reconstruct the features in the source domain and to push the features in the target domain away from a margin. In this way, the decoder has the functionality of distinguishing the source domain from the target domain. The objective of training $G_{r}$ is formulated as

\small\min_{\theta_{r}}\sum_{i=1}^{N_{s}+N_{t}}\mathcal{L}_{mg}(\mathcal{L}_{r}(\textbf{x}_{i}))=\min_{\theta_{r}}\sum_{i=1}^{N_{s}}\mathcal{L}_{r}(\textbf{x}_{i}^{s})+\sum_{j=1}^{N_{t}}[m-\mathcal{L}_{r}(\textbf{x}^{t}_{j})]^{+},

(4)

where $m$ is a positive margin and $\mathcal{L}_{r}(\cdot)$ is the mean squared error (MSE) term for the reconstruction loss defined as

\mathcal{L}_{r}(\textbf{x})=||G_{r}(G_{e}(\textbf{x};\theta_{e});\theta_{r})-\textbf{x}||^{2}_{2},

(5)

where $||\cdot||^{2}_{2}$ denotes the squared $L_{2}$ -norm. Compared the normal binary domain classifier (e.g. fully connected layers), the decoder network is tailored as a smoothing domain discriminator by separating two domains from a specific margin rather than as far as possible.

Oppositely, to form an adversarial game, the feature extractor $G_{e}$ learns to deceive $G_{r}$ such that the learned target features are indistinguishable to the source ones, which is formulated by:

\min_{\theta_{e}}\sum_{j=1}^{N_{t}}\mathcal{L}_{r}(\textbf{x}_{j}^{t}).

(6)

Then the whole learning procedure of ARN with MDAT can be formulated by:

	$\displaystyle\min_{\theta_{e},\theta_{c}}\sum_{i=1}^{N_{s}}\mathcal{L}_{y}(\textbf{x}_{i}^{s},\textbf{y}_{i}^{s})+\alpha\sum_{j=1}^{N_{t}}\mathcal{L}_{r}(\textbf{x}_{j}^{t}),$		(7)
	$\displaystyle\min_{\theta_{r}}\sum_{i=1}^{N_{s}}\mathcal{L}_{r}(\textbf{x}_{i}^{s})+\sum_{j=1}^{N_{t}}[m-\mathcal{L}_{r}(\textbf{x}^{t}_{j})]^{+},$		(8)

where $\mathcal{L}_{y}$ denotes the negative log-likelihood of the ground truth class for labeled sample $(\textbf{x}_{i}^{s},\textbf{y}_{i}^{s})$ and $\alpha$ controls the interaction of the loss terms. In the following section, we derive an optimal solution of MDAT and provide theoretical justifications on how MDAT reduces the distribution discrepancy for UDA.

3.4 Optimal Solution of MDAT

Considering the adversarial game between a reconstruction network $R$ and a feature extractor $E$ (i.e. $G_{r}$ and $G_{e}$ in our network, respectively), we prove that if the feature extractor $E$ maps both source domain $x^{s}\sim P_{S}(x^{s})$ and target domain $x^{t}\sim P_{T}(x^{t})$ to a common feature space $\mathcal{Z}=\{z=E(x)|z\sim P_{z}(z)\}$ , the MDAT system reaches a Nash Equilibrium. This theoretically explains how MDAT enables the feature extractor to learn domain-invariant features. Similar to EBGAN (Zhao et al., 2017), we assume $E$ and $R$ have infinite capacity. Denote $R$ as the MSE of the reconstruction network. We first define two objectives:

V(E,R)=\int_{x^{s},x^{t}}\mathcal{L}_{mg}(x^{s},x^{t})P_{S}(x^{s})P_{T}(x^{t})dx^{s}dx^{t}

(9)

U(E,R)=\int_{x^{t}}\mathcal{L}_{r}(x^{t})P_{T}(x^{t})dx^{t}

(10)

In MDAT, we train the feature extractor $E$ to minimize the quantity $V(E,R)$ and train the reconstruction network $R$ to minimize the quantity $U(E,R)$ . A Nash equilibrium of our system is a pair $(E^{*},R^{*})$ that satisfies:

V(E^{*},R^{*})\leq V(E^{*},R)\quad\forall R

(11)

U(E^{*},R^{*})\leq U(E,R^{*})\quad\forall E

(12)

Theorem 3.1

If a feature extractor $E$ maps both source domain $x^{s}\sim P_{S}(x^{s})$ and target domain $x^{t}\sim P_{T}(x^{t})$ to a common feature space $\mathcal{Z}=\{z=E(x)|z\sim P_{z}(z)\}$ , the system reaches a Nash equilibrium $(E^{*},R^{*})$ and $V(E^{*},R^{*})=m$ .

Proof. We first prove Eq.11:

$\displaystyle V(E^{*},R)$	$\displaystyle=\int_{x^{s}}P_{S}(x^{s})R(E^{*}(x^{s}))dx^{s}$
	$\displaystyle+\int_{x^{t}}P_{T}(x^{t})\big{[}m-R\big{(}E^{*}(x^{t})\big{)}\big{]}^{+}dx^{t}$	(13)
	$\displaystyle=\int_{z}\Big{(}P_{z}(z)R(z)+P_{z}(z)[m-R(z)]^{+}\Big{)}dz$	(14)

As we know $f(x)=x+[m-x]^{+}$ is monotonically increasing on $[0,+\infty)$ , $V(E^{*},R)$ reaches its minimum when $R^{*}(z)=0$ :

\displaystyle V(E^{*},R^{*})

\displaystyle=m\int_{z}P_{z}(z)dz=m

(15)

When $R^{*}(z)=0$ , we expand $U(E,R^{*})$ in Eq.12:

	$\displaystyle U(E,R^{*})$	$\displaystyle=\int_{x^{t}}R^{*}\big{(}E(x^{t})\big{)}P_{T}(x^{t})dx^{t}$		(16)
		$\displaystyle=\int_{z}R^{*}(z)dz=0$		(17)

As $U(E,R^{*})\geq 0$ , we get $U(E^{*},R^{*})\leq U(E,R^{*})$ .

It can be easily observed that the optimal solution of MDAT is a Nash equilibrium when the feature extractor maps two domains into a common feature space, i.e. aligning the distributions in the feature space.

3.5 Connection to Domain Adaptation Theories

We further investigate how the proposed method connects the learning theory of domain adaptation. The rationale behind domain alignment is motivated from the learning theory of non-conservative domain adaptation problem by Ben-David et al. (Ben-David et al., 2010):

Theorem 3.2

Let $\mathcal{H}$ be the hypothesis space where $h\in\mathcal{H}$ . Let $(\mathcal{D}_{S},\epsilon_{s})$ and $(\mathcal{D}_{T},\epsilon_{t})$ be the two domains and their corresponding generalization error functions. The expected error for the target domain is upper bounded by

\epsilon_{t}(h)\leq\epsilon_{s}(h)+\frac{1}{2}d_{\mathcal{\mathcal{H}\triangle\mathcal{H}}}(\mathcal{D}_{S},\mathcal{D}_{T})+\lambda,\forall h\in\mathcal{H},

(18)

where the ideal risk $\lambda=\min_{h}[\epsilon_{s}(h)+\epsilon_{t}(h)]$ , and $d_{\mathcal{\mathcal{H}\triangle\mathcal{H}}}(\mathcal{D}_{S},\mathcal{D}_{T})=2\sup_{h_{1},h_{2}\in\mathcal{H}}|\Pr_{x\sim\mathcal{D}_{S}}[h_{1}(x)\neq h_{2}(x)]-\Pr_{x\sim\mathcal{D}_{T}}[h_{1}(x)\neq h_{2}(x)]|$

Theoretically, when we minimize the $\mathcal{H}\triangle\mathcal{H}$ -distance, the upper bound of the expected error for the target domain is reduced accordingly. As derived in DAT (Ganin & Lempitsky, 2015), assuming a family of domain classifiers $\mathcal{H}_{d}$ to be rich enough to contain the symmetric difference hypothesis set of $\mathcal{H}_{p}$ , such that $\mathcal{H}_{p}\triangle\mathcal{H}_{p}=\{h|h=h_{1}\oplus h_{2},\,h_{1},h_{2}\in\mathcal{H}_{p}\}$ where $\oplus$ is XOR-function, the empirical $\mathcal{H}_{p}\triangle\mathcal{H}_{p}$ -distance has an upper bound w.r.t. the optimal domain classifier $h$ :

\footnotesize d_{\mathcal{H}_{p}\triangle\mathcal{H}_{p}}(\hat{\mathcal{D}}_{S},\hat{\mathcal{D}}_{T})\leq 2\sup_{h\in\mathcal{H}_{d}}|\Pr_{\textbf{z}\sim\hat{\mathcal{D}}_{S}}[h(\textbf{z})=0]+\Pr_{\textbf{z}\sim\hat{\mathcal{D}}_{T}}[h(\textbf{z})=1]-1|,

(19)

where $\hat{\mathcal{D}}_{S}$ and $\hat{\mathcal{D}}_{T}$ denote the distributions of the source and target feature space $\mathcal{Z}_{S}$ and $\mathcal{Z}_{T}$ , respectively. Note that the MSE of $G_{r}$ plus a ceiling function is a form of domain classifier $h(\textbf{z})$ , i.e. $\lceil[m-\mathcal{L}_{r}(\cdot)]^{+}-0.5\rceil$ for $m=1$ . It maps source samples to $0$ and target samples to $1$ which is exactly the upper bound in Eq.19. Hence, our reconstruction network $G_{r}$ maximizes the domain discrepancy with a margin and the feature extractor learns to minimize it adversarially.

Table 1: We compare with general, statistics-based (S), reconstruction-based (R) and adversarial-based (A) domain adaptation approaches. We repeated each experiment for 3 times and report the average and standard deviation (std) of the test accuracy in the target domain.

Source	MNIST	USPS	SVHN	SYN
Target	USPS	MNIST	MNIST	SVHN
Source-Only model	78.2	63.4	54.9	86.7
Train on target	96.5	99.4	99.4	91.3
[S] MMD (Long et al., 2015)	81.1	-	71.1	88.0
[S] CORAL (Sun & Saenko, 2016)	80.7	-	63.1	85.2
[R] DRCN (Ghifary et al., 2016)	91.8	73.7	82.0	87.5
[R] DSN (Bousmalis et al., 2016)	91.3	-	82.7	91.2
[A] DANN (DAT) (Ganin et al., 2016)	85.1	73.0	74.7	90.3
[A] ADDA (Tzeng et al., 2017)	89.4	90.1	76.0	-
[A] CDAN (Long et al., 2018)	93.9	96.9	88.5	-
[A] CyCADA (Hoffman et al., 2018)	95.6	96.5	90.4	-
[A] BSP+DANN (Chen et al., 2019)	94.5	97.7	89.4	-
[A] MCD (Saito et al., 2018)	96.5	94.1	96.2	-
[A] CADA (Zou et al., 2019)	96.4	97.0	90.9	-
ARN w.o. MDAT	93.1 $\pm$ 0.3	76.5 $\pm$ 1.2	67.4 $\pm$ 0.9	86.8 $\pm$ 0.5
ARN with MDAT (proposed)	98.6 $\pm$ 0.3	98.4 $\pm$ 0.1	97.4 $\pm$ 0.3	92.0 $\pm$ 0.2

3.6 Discussions

Compared with the conventional DAT-based methods that are usually based on a binary logistic network (Ganin & Lempitsky, 2015), the proposed ARN with MDAT is more attractive and incorporates new merits conceptually and theoretically:

(1) Effective gradients and balanced adversarial training. Using the decoder as domain classifier with a margin loss to restrain its overwhelming capacity in adversarial training, the adversarial game can continuously provide effective gradients for training the feature extractor, leading to better alignment and balanced adversarial training. Moreover, through the experiments in Section 4, we discover that our method shows more stable training procedure and strong robustness to the hyper-parameters, i.e. $\alpha$ and $m$ , greatly alleviating the parameters tuning for model selection.

(2) Richer information for comprehensive domain alignment. Rather than typical DAT that uses a bit of domain information, MDAT utilizes the reconstruction network as the domain classifier that captures more domain-specific and pixel-level features during the unsupervised reconstruction (Bousmalis et al., 2016). Therefore, MDAT further helps address pixel-level domain shift apart from the feature-level shift, leading to comprehensive domain alignment in a straightforward manner.

(3) Feature interpretability for method validation. MDAT allows us to visualize the features by directly reconstructing target features to images by the decoder network. It is crucial to understand to what extent the features are aligned since this helps to reveal the underlying mechanism of adversarial domain adaptation. We interpret these adapted features in Section 4.3.

Table 2: Comparisons on WiFi gesture recognition.

Method	Accuracy (%)
Source-only	58.4 $\pm$ 0.7
[S] MMD (Long et al., 2015)	61.2 $\pm$ 0.5
[R] DRCN (Ghifary et al., 2016)	69.3 $\pm$ 0.3
[A] DANN (DAT) (Ganin & Lempitsky, 2015)	68.2 $\pm$ 0.2
[A] ADDA (Tzeng et al., 2017)	71.5 $\pm$ 0.3
[A] CADA (Zou et al., 2019)	88.8 $\pm$ 0.1
ARN with MDAT	91.3 $\pm$ 0.2

4 Experiment

We evaluate the proposed approach on several visual and non-visual UDA tasks with varying degrees of domain shift. Then detailed analyses are conducted w.r.t. toy dataset, parameter sensitivity, gradient and feature visualization. Dataset descriptions and implementation details are attached in the supplementary materials.

4.1 Setup

Digits (Ganin et al., 2016). We utilize four digit datasets including MNIST, USPS, SVHN and Synthetic Digits (SYN) that form four transfer tasks: MNIST $\to$ USPS, USPS $\to$ MNIST, SVHN $\to$ MNIST and SYN $\to$ SVHN.

Office-Home (Venkateswara et al., 2017) is a challenging UDA dataset including 15,500 images from 65 categories. It comprises four extremely distinct domains: Artistic images (Ar), ClipArt (Cl), Product images (Pr), and Real-World images (Rw). We evaluate on all twelve transfer tasks.

WiFi Gesture Recognition (Zou et al., 2019) consists of six gestures recorded by Channel State Information (CSI) (Xie et al., 2018). Each CSI sample is a 2D matrix that depicts the gesture with the surrounding layout environment. Thus, the CSI data collected in two environments forms two domains, which formulates a spatial adaptation problem.

We compare with state-of-the-art UDA methods that perform three ways of domain alignment. Specifically, MMD regularization (Long et al., 2015) and CORAL (Sun & Saenko, 2016) are based on statistical distribution matching. DRCN (Ghifary et al., 2016) and DSN (Bousmalis et al., 2016) use the reconstruction network for UDA, while more prevailing UDA methods adopt domain-adversarial training including DANN (Ganin & Lempitsky, 2015), ADDA (Tzeng et al., 2017), CyCADA (Hoffman et al., 2018), CDAN (Long et al., 2018), MCD (Saito et al., 2018), CADA (Zou et al., 2019), TransNorm (Wang et al., 2019a) and BSP (Chen et al., 2019). The baseline results are reported from their original papers where available.

We used Pytorch to implement our model. For Digits dataset, we follow the same protocol in (Hoffman et al., 2018) and the same network architecture of (Ganin & Lempitsky, 2015). For Office-Home, we adopt ResNet-50 pretrained on ImageNet as our backbone. According to the standard protocols in (Long et al., 2015), we employ all the labeled source samples and unlabeled target samples for training. For WiFi Gesture Recognition data, we employ the modified LeNet and the standard protocol in (Zou et al., 2019). The designs of $G_{r}$ are the inverse of $G_{e}$ with pooling operation replaced by upsampling. We fix $\alpha=0.02$ and $m=5$ in all the experiments, which are obtained on SVHN $\to$ MNIST by Baysian optimization (Malkomes et al., 2016). We adopt mini-batch SGD optimizer with momentum of 0.9 and the progressive training strategy in DANN (Ganin & Lempitsky, 2015).

4.2 Overall Results

The classification accuracies on Digits are shown in Table 1. Our method outperforms all other methods on four transfer tasks. Specifically, for SVHN $\to$ MNIST where severe pixel-level domain shift exists, our method significantly improves DANN by 22.7%, which justifies the efficacy of ARN for addressing pixel-level shift. Our method also performs well when the target domain are quite small, achieving 98.6% accuracy on MNIST $\to$ USPS. In Table 2, our method improves the source-only model by 32.9% on WiFi spatial adaptation problem, which indicates that MDAT is also helpful for non-visual domain adaptation problem. Table 3 shows the performance on large-scale dataset and MDAT yields better performance against other domain alignment approaches.

Table 3: Accuracy (mean) of unsupervised domain adaptation on Office-Home datasets across 3 independent runs.

Method	Ar $\to$ Cl	Ar $\to$ Pr	Ar $\to$ Rw	Cl $\to$ Ar	Cl $\to$ Pr	Cl $\to$ Rw	Pr $\to$ Ar	Pr $\to$ Cl	Pr $\to$ Rw	Rw $\to$ Ar	Rw $\to$ Cl	Rw $\to$ Pr	Avg
ResNet-50 (He et al., 2016)	34.9	50.0	58.0	37.4	41.9	46.2	38.5	31.2	60.4	53.9	41.2	59.9	46.1
DAN (Long et al., 2015)	43.6	57.0	67.9	45.8	56.5	60.4	44.0	43.6	67.7	63.1	51.5	74.3	56.3
DANN (DAT) (Ganin & Lempitsky, 2015)	45.6	59.3	70.1	47.0	58.5	60.9	46.1	43.7	68.5	63.2	51.8	76.8	57.6
TransNorm+DANN (Wang et al., 2019a)	43.5	60.9	72.1	51.0	61.5	62.5	49.6	46.8	70.4	63.7	52.2	77.9	59.3
CDAN (Long et al., 2018)	49.0	69.3	74.5	54.4	66.0	68.4	55.6	48.3	75.9	68.4	55.4	80.5	63.8
ARN with MDAT (Proposed)	51.3	69.7	76.2	59.5	68.3	70.0	57.2	48.9	75.8	69.1	55.3	80.6	65.2

4.3 Analyses

Ablation study. To verify the contribution of the reconstruction network $G_{r}$ and MDAT, we discard the term $\mathcal{L}_{r}(\textbf{x}^{t})$ in Eq.4, and evaluate the method, denoted as ARN w.o. MDAT in Table 1. Comparing it with source-only model, we can infer the improvement of reconstructing target samples. ARN w.o. MDAT improves tasks with low-level domain shift such as MNIST $\leftrightarrow$ USPS, which conforms with our discussion that unsupervised reconstruction is instrumental in learning pixel-level features. Comparing ARN w.o. MDAT with the original ARN, we can infer the contribution of MDAT. Table 1 shows that the MDAT achieves an impressive margin-of-improvement. For USPS $\to$ MNIST and SVHN $\to$ MNIST, the MDAT improves ARN w.o. MDAT by around 30%. It demonstrates that MDAT that helps learn domain-invariant features is the main reason for the tremendous improvement.

Toy dataset. We study the behavior of MDAT on a variant of inter-twinning moons 2D problem, where the target samples are rotated $30^{\circ}$ from the source samples. 300 samples are generated for each domain using scikit-learn (Pedregosa et al., 2011). The adaptation ability is investigated by comparing MDAT with DANN and source-only model. As shown in Figure 2, we visualize the changing boundaries during 10 epochs training. In Figure 2(a), the model is overfitting the source domain, and the decision boundary does not change. In Figure 2(b) and 2(c), both DANN and MDAT adapt the boundaries to the target samples, but MDAT shows faster and better adaptation during 10 epochs. Integrating the training procedure of SVHN $\to$ MNIST in Figure 3(a), we justify that more effective gradients are provided by MDAT for better adaptation performance.

Table 4: Accuracy (%) on SVHN

\to

MNIST.

$\alpha$	0.01	0.03	0.07	0.1	0.2	0.3	0.5	1.0
DANN	71.1	74.1	72.7	74.1	74.7	9.6	9.7	10.3
ARN ( $m=1$ )	95.7	95.9	93.3	93.2	80.1	75.3	73.1	67.5
$m$	0	0.1	0.3	0.5	1.0	2.0	5.0	10.0
ARN ( $\alpha=2e^{-2}$ )	64.3	64.5	75.2	90.0	96.0	97.4	97.7	96.7

Gradients and stability analysis. We further study the training procedure of MDAT on SVHN $\to$ MNIST w.r.t. loss and target accuracy in Figure 3(a) and 3(b), respectively. In Figure 3(a), ARN has steadily decreasing loss ( $\mathcal{L}_{r}$ ) for all $\alpha$ , but the domain loss in DAT ( $\mathcal{L}_{d}$ ) becomes extremely small at the beginning. These observations conform with our intuition: the domain classifier in DAT is too strong to impede the adversarial training, while MDAT provides more effective gradients for training feature extractor by restricting the capacity of domain classifier. With effective gradients, the adversarial game is more balanced, which is validated in Figure 3(b) where the test accuracy of ARN is more stable than that of DAT across training epochs.

[Uncaptioned image] — Table 5: Visualizing the source image, target images and reconstructed target images (R-Target Images) for four digit adaptation tasks.

Parameter sensitivity. We investigate the sensitivity of $\alpha$ and $m$ on SVHN $\to$ MNIST. In Table 4, the results show that ARN achieves good performance as $\alpha\in[0.01,0.1]$ . Even with larger $\alpha$ , ARN is able to achieve convergence. In comparison, denoting $\alpha$ as the weight of adversarial domain loss ( $\mathcal{L}_{d}$ ), the DANN cannot converge when $\alpha>0.2$ due to the imbalanced adversarial game between the overwhelming domain classifier and the feature extractor. For the sensitivity of $m$ , the accuracy of ARN exceeds 96.0% as $m\geq 1$ . In Section 3.5, as $m\geq 1$ , the decoder serves as a domain classifier. These analyses validate that ARN is more insensitive to the parameters than that of DANN. Even in the worst cases, ARN can always achieve convergence.

T-SNE embeddings. We analyze the performance of domain alignment for DANN (DAT) (Ganin & Lempitsky, 2015) and ARN (MDAT) by plotting T-SNE embeddings of the features z on the task SVHN $\to$ MNIST. In Figure 4(a), the source-only model obtains diverse embeddings for each category but the domains are not aligned. In Figure 4(b), the DANN aligns two domains but the decision boundaries of the classifier are vague. In Figure 4(c), the proposed ARN effectively aligns two domains for all categories and the classifier boundaries are much clearer.

Interpreting adapted features via reconstruction. One of the key advantages of ARN is that by visualizing the reconstructed target images we can infer how the features are domain-invariant. We reconstruct the MDAT features of the target domain and visualize them in Table 5. It is observed that the target features are reconstructed to source-like images by the decoder $G_{r}$ . As discussed before, intuitively, MDAT forces the target feature to mimic the source feature distribution, which conforms with the visualization. Similar to image-to-image translation, this indicates that our method conducts implicit feature-to-feature translation that transfers the target features to source-like features, and hence the features are domain-invariant.

5 Conclusion

We proposed a new domain alignment approach namely Max-margin Domain-adversarial Training (MDAT) and a MDAT-based deep neural network for unsupervised domain adaptation. The proposed method offers effective and stable gradients for feature learning via an adversarial game between the feature extractor and the reconstruction network. The theoretical analysis provides justifications on how it minimizes the distribution discrepancy. Extensive experiments demonstrate the effectiveness of our method and we further interpret the features by visualization that conforms with our insight. Potential evaluation on semi-supervised learning constitutes our future work.

References

Ben-David et al. (2010) Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
Bousmalis et al. (2016) Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., and Erhan, D. Domain separation networks. In Advances in Neural Information Processing Systems, pp. 343–351, 2016.
Bousmalis et al. (2017) Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and Krishnan, D. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3722–3731, 2017.
Chen et al. (2019) Chen, X., Wang, S., Long, M., and Wang, J. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International Conference on Machine Learning, pp. 1081–1090, 2019.
Chen et al. (2018) Chen, Y., Li, W., Sakaridis, C., Dai, D., and Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3339–3348, 2018.
Cristianini et al. (2000) Cristianini, N., Shawe-Taylor, J., et al. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.
Ganin & Lempitsky (2015) Ganin, Y. and Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
Ghifary et al. (2016) Ghifary, M., Kleijn, W. B., Zhang, M., Balduzzi, D., and Li, W. Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 597–613. Springer, 2016.
Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 513–520, 2011.
Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
Gretton et al. (2007) Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, pp. 513–520, 2007.
Gretton et al. (2012) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, pp. 1205–1213, 2012.
He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Hoffman et al. (2018) Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., and Darrell, T. CyCADA: Cycle-consistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine Learning, pp. 1989–1998, 2018.
LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
Liu et al. (2019) Liu, H., Long, M., Wang, J., and Jordan, M. Transferable adversarial training: A general approach to adapting deep classifiers. In International Conference on Machine Learning, pp. 4013–4022, 2019.
Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 97–105, 2015.
Long et al. (2018) Long, M., Cao, Z., Wang, J., and Jordan, M. I. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, 2018.
Malkomes et al. (2016) Malkomes, G., Schaff, C., and Garnett, R. Bayesian optimization for automated model selection. In Advances in Neural Information Processing Systems, pp. 2900–2908, 2016.
Pan et al. (2010) Pan, S. J., Yang, Q., et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
Saito et al. (2018) Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723–3732, 2018.
Shimodaira (2000) Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
Shu et al. (2018) Shu, R., Bui, H. H., Narui, H., and Ermon, S. A dirt-t approach to unsupervised domain adaptation. In Proc. 6th International Conference on Learning Representations, 2018.
Sun & Saenko (2016) Sun, B. and Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pp. 443–450. Springer, 2016.
Tzeng et al. (2014) Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014. URL http://arxiv.org/abs/1412.3474.
Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, pp. 4, 2017.
Vazquez et al. (2013) Vazquez, D., Lopez, A. M., Marin, J., Ponsa, D., and Geronimo, D. Virtual and real world adaptation for pedestrian detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):797–809, 2013.
Venkateswara et al. (2017) Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proc. CVPR, pp. 5018–5027, 2017.
Wang et al. (2019a) Wang, X., Jin, Y., Long, M., Wang, J., and Jordan, M. I. Transferable normalization: Towards improving transferability of deep neural networks. In Advances in Neural Information Processing Systems, pp. 1951–1961, 2019a.
Wang et al. (2019b) Wang, Z., Dai, Z., Poczos, B., and Carbonell, J. Characterizing and avoiding negative transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019b.
Xie et al. (2018) Xie, Y., Li, Z., and Li, M. Precise power delay profiling with commodity wi-fi. IEEE Transactions on Mobile Computing, 18(6):1342–1355, 2018.
Zhang et al. (2019) Zhang, Y., Liu, T., Long, M., and Jordan, M. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pp. 7404–7413, 2019.
Zhao et al. (2017) Zhao, J., Mathieu, M., and LeCun, Y. Energy-based generative adversarial network. In Proc. 5th International Conference on Learning Representations, 2017.
Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
Zou et al. (2019) Zou, H., Zhou, Y., Yang, J., Liu, H., Das, H. P., and Spanos, C. J. Consensus adversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5997–6004, 2019.

Appendix

Hyperparameter For all tasks, we simply use the same hyperparameters that are chosen from the sensitivity analysis. We use $\alpha=0.02$ and $m=5.0$ , and we reckon that better results can be obtained by tuning the hyperparameters for specific tasks.

Sensitivity

We have presented all the results of the sensitivity study in Section 4.3, and now we show their detailed training procedures in Figure 5(a) and 5(b). It is observed that the accuracy increases when $\alpha$ drops or the margin $m$ increases. The reason is very simple: (1) when $\alpha$ is too large, it affects the effect of supervised training on source domain; (2) when the margin $m$ is small, the divergence between source and target domain (i.e. $\mathcal{H}\triangle\mathcal{H}$ -distance) cannot be measured well.

Visualization

Here we provide more visualization of the reconstructed images of target samples. In Figure 6, the target samples are shown in the left column while their corresponding reconstructed samples are shown in the right. We can see that for low-level domain shift such as MNIST $\leftrightarrow$ USPS, the reconstructed target samples are very source-like while preserving their original shapes and skeletons. However, for larger domain shift in Figure 6(c) and 6(d), they are reconstructed to source-like same digits but simultaneously some noises are removed. Specifically, in Figure 6(d), we can see that one target sample (SVHN) may contain more than one digits that are noises for recognition. After reconstruction, only the right digits are reconstructed. Some target samples may suffer from terrible illumination conditions but their reconstructed digits are very clear, which is amazing.