ToAlign: Task-oriented Alignment for Unsupervised Domain Adaptation

Guoqiang Wei¹ Cuiling Lan² Wenjun Zeng² Zhizheng Zhang² Zhibo Chen^1†

¹ University of Science and Technology of China ² Microsoft Research Asia
[email protected] {culan,wezeng,zhizzhang}@microsoft.com
[email protected] This work was done when Guoqiang Wei was an intern at MSRA.Corresponding author.

Abstract

Unsupervised domain adaptive classification intends to improve the classification performance on unlabeled target domain. To alleviate the adverse effect of domain shift, many approaches align the source and target domains in the feature space. However, a feature is usually taken as a whole for alignment without explicitly making domain alignment proactively serve the classification task, leading to sub-optimal solution. In this paper, we propose an effective Task-oriented Alignment (ToAlign) for unsupervised domain adaptation (UDA). We study what features should be aligned across domains and propose to make the domain alignment proactively serve classification by performing feature decomposition and alignment under the guidance of the prior knowledge induced from the classification task itself. Particularly, we explicitly decompose a feature in the source domain into a task-related/discriminative feature that should be aligned, and a task-irrelevant feature that should be avoided/ignored, based on the classification meta-knowledge. Extensive experimental results on various benchmarks (e.g., Office-Home, Visda-2017, and DomainNet) under different domain adaptation settings demonstrate the effectiveness of ToAlign which helps achieve the state-of-the-art performance. The code is publicly available at https://github.com/microsoft/UDA.

1 Introduction

Convolutional Neural Networks (CNNs) have made extraordinary progress in various computer vision tasks, with image classification as a most representative one. The trained models generally perform well on the testing data which shares similar data distribution to that of the training data. However, in many practical scenarios, drastic performance degradation is observed when applying such trained models to new domains with domain shift [60], where the data distributions between the training and testing domains are different. Fine-tuning on labeled target data is a direct solution but is costly due to the requirement of target sample annotations. In contrast, unsupervised domain adaptation (UDA) requires only the labeled source data and unlabeled target data to enhance the model’s performance on the target domain, which has attracted increasing interest in both academia [3, 2, 77, 61, 26, 32] and industry [67, 28].

Refer to caption — Figure 1: Illustration of adversarial learning based (a) Baseline and (b) our proposed *ToAlign*. $D$ and $C$ denote domain discriminator and image classifier respectively. (a) Baseline (e.g., DANN [19]) directly aligns the target feature $\mathbf{f}^{t}$ with the holistic source feature $\mathbf{f}^{s}$ . Domain alignment and image classification tasks are optimized in parallel. (b) Our proposed *ToAlign* makes the domain alignment proactively serve the classification task, where target feature $\mathbf{f}^{t}$ is aligned with source task-discriminative "positive" feature $\mathbf{f}^{s}_{p}$ which is obtained under the guidance of meta-knowledge induced from the classification task. $\odot$ denotes Hadamard product.

There has been a large spectrum of UDA methods. Supported by the theoretical analysis [3], the overwhelming majority of methods tend to align the distributions of source and target domains. A line of works [6, 71, 46, 57, 58] explicitly align the distributions based on domain discrepancy measurements, e.g., Maximum Mean Discrepancy (MMD) [6]. Another line of alignment-based UDAs borrow ideas from Generative Adversarial Networks [20] and use domain adversarial training to learn domain-aligned/invariant features, which dominate in the top performance methods. In the seminal work Domain Adversarial Neural Network (DANN) [18, 19], a domain discriminator is trained to distinguish the target features from source features while a feature extractor (generator) is trained to generate domain-invariant features to fool this discriminator. Following DANN, a plethora of variants have been proposed [61, 41, 54, 12, 53, 64, 39, 43, 15, 11, 68].

It is noteworthy that the goal of alignment in UDA is to alleviate the adverse effect of domain shift to improve the classification performance on unlabeled target data. Even though impressive progress has been made, there is a common intrinsic limitation, i.e., alignment is still not deliberately designed to dedicatedly/proactively serve the final image classification task. In many previous UDAs, as shown in Figure 1 (a), the alignment task is in parallel with the ultimate classification task. The assumption is that learning domain-invariant features (via alignment) reduces the domain gap and thus makes the image classifier trained on source readily applicable to target [3]. However, with alignment treated as a parallel task, there is a lack of mechanism to make it explicitly assist classification, where the alignment may contaminate the discriminative features for classification [29]. Previous works (e.g., CDAN [41]) exploit class information (e.g., predicted class probability) as a condition to the discriminator. MADA [45] implements class-level domain alignment by applying one discriminator per class. Their purpose is to provide additional helpful information to the discriminator [41] or perform class-level alignment [45], but they are still short of explicitly making alignment assist classification.

Some works move a step forward and investigate what features the networks should align for better adaptation. [66, 33] focus on transferable local regions, which are selected based on the uncertainty or entropy of the domain discriminator, for alignment. However, such self-induced feature selection is still not specific to the optimization of classification task; instead, it is based on the alignment task itself. There is no guarantee that alignment positively serves the classification task. Hsu et al. [24] carry out object centerness-aware alignment by aligning the center part of the objects to exclude the background distraction/noise for domain adaptive object detection. However, the feature in object center position could be task-irrelevant and thus is not suited for alignment. Moreover, regarding such centerness feature as alignment objective is somewhat ad-hoc, which is still not designed directly from the perspective of assisting classification.

We pinpoint that the selection of "right" features to achieve task-oriented alignment is important. For classification, the essence is to train the network to extract class-discriminative feature. Similarly, for UDA classification, it is also desired to assure strong discrimination of the target domain features without class label supervision. Thus, we intend to align target features to the task-discriminative source features while ignoring the task-irrelevant ones. Note that for the feature of a source sample, it contains both task/classification-discriminative and task-irrelevant information, because the network is in general not able to suppress non-discriminative feature responses (e.g., responses unrelated to image class or those related to other tasks such as alignment) perfectly [55, 9]. Aligning target features with task-irrelevant source features would prevent alignment from serving classification and lead to poor adaptation. Intuitively, for example, image style that is a non-causal factor for classification can be considered as task-irrelevant information and the bias towards such factor in alignment may hurt the classification task. We demonstrate this by conducting experiments where only the source task-irrelevant features are utilized to align with target i.e., the scheme Baseline+TiAlign in Figure 3. The performance of Baseline+TiAlign (in purple) on target test set drops drastically compared to the source-only method which dose not incorporate any alignment technique. This corroborates that aligning with task-irrelevant features is even harmful to the classification on target domain.

Motivated by this, in this paper, we propose an effective UDA method named Task-oriented Alignment (ToAlign) to make the domain alignment explicitly serve classification. We achieve this by performing feature alignment guided by the meta-knowledge induced from the classification task to make the target features align with task-discriminative source features (i.e., "positive" features), to avoid the interference from task-irrelevant features (i.e., "negative" features). Figure 2 conceptually illustrates the comparison between our proposed alignment and previous one. Particularly, as illustrated in Figure 1 (b), to obtain the suitable feature from a source sample for alignment with target samples, we leverage the classification task to guide the extraction/distillation of task-related/discriminative feature $\mathbf{f}_{p}^{s}$ , from original feature $\mathbf{f}^{s}$ . Correspondingly, for the domain alignment task, we enforce aligning target features with the source positive features by domain adversarial training to achieve task-oriented alignment. In this way, the domain alignment will better assist the classification task.

We summarize our main contributions as follows:

•

We pinpoint that the selection of "right" features to achieve task-orientated alignment is important for adaptation.
•

We propose an effective UDA approach named ToAlign which enables the alignment to explicitly serve classification. We decompose a source feature into a task-relevant/discriminative one and a task-irrelevant one under the guidance of classification-meta knowledge for performing classification-oriented alignment, which explicitly guides the network what features should be aligned.

Extensive experimental results demonstrate the effectiveness of ToAlign. ToAlign is generic and can be applied to different adversarial learning based UDAs to enhance their adaption capability, which helps achieve the state-of-the-art performance with a negligible increase in training complexity and no increase in inference complexity.

2 Related Work

Unsupervised Domain Adaptation aims to transfer the knowledge from labeled source domain(s) to unlabeled target domain. Ben et al. [3] theoretically reveal that learning domain-invariant representations helps make the image classifier trained on source domain applicable to target domain. Various works learn domain-invariant features by aligning the source and target distributions measured by some metrics [6, 71, 46, 57, 58, 48], or by domain adversarial learning [61, 41, 54, 12, 53, 64, 39, 74, 59, 43, 15, 11, 68, 8, 30]. The latter is overwhelmingly popular in recent years owing to its superiority in dealing with distribution problems [20]. Note that our proposed method is designed to enhance the capability of the widely used domain adversarial learning based approaches.

For domain adversarial learning based approach (e.g., DANN [18, 19]), in general, a domain discriminator is trained to distinguish the source domain from the target domain, meanwhile a feature extractor is trained to learn domain-invariant features. Many variants of DANN have been proposed [41, 15, 59, 11, 13, 59, 74, 36, 5]. CDAN [41] further conditions the discriminator on the image class information conveyed in the classifier predictions. MADA [45] implements class-wise alignment with multi-discriminators. GSDA [25] performs class-, group- and domain-wise alignments simultaneously, where the three types of alignment are enforced to be consistent in their gradients for more precise alignment. HDA [13] leverages domain-specific representations as heuristics to obtain domain-invariant representations from a heuristic search perspective. CMSS [70] exploits Curriculum Learning (CL) [4] to align target samples with the dynamically selected source samples to exploit the different transferability of the source samples.

However, in these methods, the domain alignment is designed as a task in parallel with the image classification task. It does not explicitly take serving classification as its mission, where such alignment may result in loss of discriminative information. Jin et al. [29] remedy the loss of discriminative information caused by alignment via incorporating a restoration module. Wei et al. [68] pinpoint that alignment and classification are not well coordinated in optimization where they may contradict with each other. They thus propose to use meta-learning to coordinate their optimization directions.

In this paper, to make alignment explicitly serve classification, we propose a task-oriented alignment. Guided by the classification meta-knowledge, task-discriminative sub-features are selected for alignment. Different from [68], we investigate what features should be aligned to assist classification and intend to provide more interpretable alignment. We are the first to perform task-oriented alignment by decomposing source feature into task-discriminative and task-irrelevant feature, and explicitly guides the network what sub-features should be aligned. Note that Huang et al. [27] propose to decouple features into domain-invariant and domain-specific features, where the former ones are aligned for unsupervised person re-identification. [47, 7] exploit the VAE framework with several complex losses to perform the disentanglement from the perspective of domain and semantics simultaneously, and only use domain-invariant semantics for inference, leaving domain-specific but task-related information underexplored. In contrast to focusing on the domain-level, our decomposition strategy focuses on the task-level guided by image classification task, where we further enable domain alignment on the task-discriminative features to proactively serve image classification.

3 Task-Oriented Alignment for UDA

Unsupervised domain adaptation (UDA) for classification aims to train a classification model on labeled source domain image set $\mathbf{X}_{s}$ and unlabeled target domain image set $\mathbf{X}_{t}$ to obtain high classification accuracy on a target domain test set.

Most popular adversarial learning based UDAs attempt to align the features of the source and target domains to alleviate the domain gap to improve the classification performance on target domain. As mentioned before, aligning based on holistic features is sub-optimal, where such alignment does not explicitly serve classification. To address this, as illustrated in Figure 1 (b), we propose an effective task-oriented alignment to explicitly make the alignment serve classification. Particularly, we propose to decompose a source sample feature into a task-discriminative one that should be aligned, and a task-irrelevant one that should be ignored based on the classification meta-knowledge. Then, we perform alignment between the target features and the positive source features, which is consistent with the essence of the classification task, i.e., focusing on discriminative features.

In Sec. 3.1, to be self-contained, we briefly introduce adversarial learning based UDAs. We answer the question of what feature should be aligned to better serve classification and introduce our task-oriented feature decomposition and alignment in Sec. 3.2.

3.1 Recap of Domain Adversarial UDAs

Domain adversarial learning based UDAs typically train a domain discriminator $D$ to distinguish which domain (i.e., source or target) a sample belongs to, and adversarially train a feature extractor $G$ to fool the discriminator $D$ in order to learn domain-invariant feature representations. The network is also trained under the supervision of image classification on the labeled source samples. Particularly, $D$ is optimized to minimize the domain classification loss $\mathcal{L}_{D}$ (i.e., binary cross entropy loss). Meanwhile, $G$ is optimized to maximize the domain classification loss $\mathcal{L}_{D}$ and minimize the image classification loss $\mathcal{L}_{cls}$ (i.e., cross entropy loss):

\begin{split}&\operatorname*{argmin}_{D}\mathcal{L}_{D},\\ &\operatorname*{argmin}_{G}\mathcal{L}_{cls}-\mathcal{L}_{D},\\ \end{split}

(1)

To achieve adversarial training, usually, gradient reversal layer (GRL) [18, 19] which connects $G$ and $D$ is used via multiplying the gradient from $D$ by a negative constant during the back-propagation to $G$ . $\mathcal{L}_{D}$ is typically defined as [19, 41, 15]:

\begin{split}\mathcal{L}_{D}(\mathbf{X}_{s},\mathbf{X}_{t})=-\mathbb{E}_{\mathbf{x}_{s}\sim\mathbf{X}_{s}}\left[\log(D(G(\mathbf{x}_{s})))\right]-\mathbb{E}_{\mathbf{x}_{t}\sim\mathbf{X}_{t}}\left[\log(1-D(G(\mathbf{x}_{t})))\right],\end{split}

(2)

3.2 Task-oriented Feature Decomposition and Alignment

In adversarial learning based UDAs, a feature ingested by $D$ as a holistic feature from a source or target sample, in general contains both task/classification-discriminative information and task-irrelevant information. Intuitively, aligning the task-irrelevant features would not effectively reduce the domain gap of the task-discriminative features and thus brings no obvious benefit for the classification task. Mistakenly aligning the target features with the source task-irrelevant features would hurt the discrimination power of the target features. We also experimentally confirm that in Figure 3, i.e., aligning with task-irrelevant features (TiAlign, line in purple) drastically reduces the classification accuracy on the target domain. Therefore, we propose to decompose a holistic feature of each source sample into a task-discriminative feature and a task-irrelevant feature to enable the task-oriented alignment with the target features.

Particularly, we softly select/re-weight (based on Grad-CAM [55]) the feature vector $\mathbf{f}^{s}$ of a source sample to obtain task-discriminative feature $\mathbf{f}_{p}^{s}$ that is discriminative for identifying the groundtruth class, which we refer to as positive feature. Correspondingly, the task-irrelevant feature $\mathbf{f}_{n}^{s}$ can be obtained simultaneously, which we refer to as negative feature.

Task-Oriented Feature Decomposition. Grad-CAM [78, 55, 9] is a widely used technique to localize the most important features for classification in a convolutional neural network model. As analyzed in [78, 55, 9, 56], the gradients (w.r.t. the feature for classification) of the final predicted score corresponding to the ground-truth class convey the task-discriminative information, which identifies the relevant features to recognize the image class correctly. It is noteworthy that such task-discriminative information is, in general, highly related (but not limited) to the foreground object in the classification task. In this work, motivated by Grad-CAM, we propose to use the gradients of the predicted score corresponding to the ground-truth class as the attention weights to obtain the task-discriminative features.

As illustrated in Figure 1, we obtain a feature map $F\in\mathbb{R}_{+}^{H\times W\times M}$ (i.e., a tensor of non-negative real numbers, with height $H$ , width $W$ , and $M$ channels) from the final convolutional block (with ReLU layer) of the feature extractor. After spatial-wise global average pooling (GAP), we have a feature vector ${\mathbf{f}}=pool(F)\in\mathbb{R}^{M}$ . The logits for all classes are predicted via the classifier $C(\cdot)$ . Based on the response $C({\mathbf{f}})$ , we can derive the gradient $\mathbf{w}_{cls}\in\mathbb{R}^{M}$ of $y^{k}$ w.r.t. $\mathbf{f}$ :

\mathbf{w}^{cls}=\frac{\partial{y^{k}}}{\partial\mathbf{f}},

(3)

where $y^{k}$ is the predicted score corresponding to the ground-truth class $k$ . As analyzed in [55, 9, 56], the gradient $\mathbf{w}^{cls}$ conveys the channel-wise importance information of feature $\mathbf{f}$ for classifying the sample into its groudtruth class $k$ . We draw inspiration from Grad-CAM which uses $\mathbf{w}^{cls}$ to modulate the feature map in channel-wise to find the classification-discriminative features. Similarly, modulated with $\mathbf{w}^{cls}$ , we can obtain the task-discriminative (i.e., positive) feature as:

\mathbf{f}_{p}=\mathbf{w}^{cls}_{p}\odot\mathbf{f}=s\mathbf{w}^{cls}\odot\mathbf{f},

(4)

where $\odot$ represents the Hadamard product, the attention weight vector $\mathbf{w}^{cls}_{p}=s\mathbf{w}^{cls}$ , where $s\in\mathbb{R}_{+}$ is an adaptive non-negative parameter to modulate the energy $\mathcal{E}(\mathbf{f}_{p})=||\mathbf{f}_{p}||_{2}^{2}$ of $\mathbf{f}_{p}$ such that $\mathcal{E}(\mathbf{f}_{p})=\mathcal{E}(\mathbf{f})$ :

s=\sqrt{\frac{||\mathbf{f}||_{2}^{2}}{||\mathbf{w}^{cls}\odot\mathbf{f}||_{2}^{2}}}=\sqrt{\frac{\sum_{m=1}^{M}f_{m}^{2}}{\sum_{m=1}^{M}(w^{cls}_{m}f_{m})^{2}}},

(5)

Motivated by the counterfactual analysis in [55], the task-irrelevant (i.e., negative) feature can be represented as $\mathbf{f}_{n}=-\mathbf{w}^{cls}_{p}\odot\mathbf{f}$ , where $-\mathbf{w}^{cls}_{p}$ delights the task-discriminative channels since the task-discriminative channels (with larger values in $\mathbf{w}^{cls}_{p}$ ) correspond to ones with smaller values in - $\mathbf{w}^{cls}_{p}$ .

To better understand and validate the discriminativeness of the positive and negative features, we visualize the spatial maps $F$ with channels modulated by $\mathbf{w}^{cls}$ and $-\mathbf{w}^{cls}$ following [55, 78]. As shown in Figure 4, the positive information is more related to the foreground objects that provide the discriminative information for the classification task, while the negative one is more in connection with the non-discriminative background regions.

Task-oriented Domain Alignment. As discussed above, we expect the domain alignment to explicitly serve the final classification task. Given the source task-discriminative features obtained based on the classification meta-knowledge, we can guide the target features to be aligned with the source task-discriminative features $\mathbf{f}_{p}$ through different domain adversarial learning based alignment methods [18, 19, 13]. The procedure is almost the same as that in UDAs discussed in Sec. 3.1, except that the input source feature $\mathbf{f}^{s}$ to the final domain discriminator is replaced by the positive feature $\mathbf{f}^{s}_{p}$ of this source sample. Thus, the domain classification loss is defined with a small modification on Eq. (2):

\begin{split}\mathcal{L}_{D}(\mathbf{X}_{s},\mathbf{X}_{t})=-\mathbb{E}_{\mathbf{x}_{s}\sim\mathbf{X}_{s}}\left[\log(D(G^{p}(\mathbf{x}_{s})))\right]-\mathbb{E}_{\mathbf{x}_{t}\sim\mathbf{X}_{t}}\left[\log(1-D(G(\mathbf{x}_{t})))\right].\end{split}

(6)

where $G^{p}(\mathbf{x}_{s})=\mathbf{f}_{p}^{s}$ denotes the positive feature of source $\mathbf{x}_{s}$ .

Understanding from the Meta-knowledge Perspective. To enable a better understanding of ToAlign on why it works well, here, we analyse ToAlign from the perspective of meta-learning with meta-knowledge.

In an adversarial UDA framework, the image classification task and domain alignment task can be considered to be a meta-train task $\mathcal{T}^{tr}$ and a meta-test task $\mathcal{T}^{te}$ , respectively. ToAlign actually introduces knowledge communication from $\mathcal{T}^{tr}$ to $\mathcal{T}^{te}$ . In the meta-training stage, we can obtain the prior/meta-knowledge $\phi^{tr}$ of $\mathcal{T}^{tr}$ . Without effective communication between $\mathcal{T}^{tr}$ and $\mathcal{T}^{te}$ , the optimization of $\mathcal{T}^{te}$ may contradict that of $\mathcal{T}^{tr}$ , considering that they have different optimization goals. To improve the knowledge communication from $\mathcal{T}^{tr}$ to $\mathcal{T}^{te}$ , certain meaningful prior/meta-knowledge $\phi^{tr}$ is helpful for a more effective $\mathcal{T}^{te}|_{\phi^{tr}}$ . A typical implementation of passing meta-knowledge from $\mathcal{T}^{tr}$ to $\mathcal{T}^{te}$ is based on gradients [40, 17, 35, 68, 34], i.e., $\nabla\mathcal{T}^{tr}$ , which provides knowledge of $\mathcal{T}^{tr}$ . Other mechanisms e.g., leveraging the parameters regularizer in a way of weight decay, are also exploited [1, 76]. In our ToAlign, instead of encoding the meta-knowledge $\phi^{tr}$ into the gradients w.r.t. the parameters, we use $\mathcal{T}^{tr}$ to learn/derive attention weights for identifying $\mathcal{T}^{tr}$ -related sub-features in the feature space and then pass such prior/meta-knowledge $\phi^{tr}$ to $\mathcal{T}^{te}$ to make meta-test task $\mathcal{T}^{te}_{\phi^{tr}}$ adapt its optimization based on $\phi^{tr}$ .

In this work, actually, we are motivated by the reliable human prior knowledge on what should be aligned across domains to better assist classification task for UDA (i.e., task/classification-discriminative features), while excluding the interference from task-irrelevant ones. Accordingly, in our design, we obtain the prior/meta-knowledge for identifying task-discriminative features from the classification task (meta-train) and apply it to the domain alignment task (meta-test) to achieve task-oriented alignment.

4 Experiments

To evaluate the effectiveness of ToAlign, we conduct comprehensive experiments under three domain adaptation settings, i.e., single source unsupervised domain adaptation (SUDA), multi-source unsupervised domain adaptation (MUDA) and semi-supervised domain adaptation (SSDA). For SSDA, domain adaptation is performed from labeled source domain to partially labeled target domain [16].

4.1 Datasets and Implementation Details

Datasets. We use two commonly used benchmark datasets (i.e., Office-Home [63] and VisDA-2017 [49]) for SUDA and a large-scale dataset DomainNet [46] for MUDA and SSDA. 1)

Method	Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Ar $\rightarrow$ Rw	Cl $\rightarrow$ Ar	Cl $\rightarrow$ Pr	Cl $\rightarrow$ Rw	Pr $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Ar	Rw $\rightarrow$ Cl	Rw $\rightarrow$ Pr	Avg
Source-Only [22]	34.9	50.0	58.0	37.4	41.9	46.2	38.5	31.2	60.4	53.9	41.2	59.9	46.1
MCD(CVPR’18) [53]	48.9	68.3	74.6	61.3	67.6	68.8	57.0	47.1	75.1	69.1	52.2	79.6	64.1
CDAN(NeurIPS’18) [41]	50.7	70.6	76.0	57.6	70.0	70.0	57.4	50.9	77.3	70.9	56.7	81.6	65.8
ALDA(AAAI’20) [11]	53.7	70.1	76.4	60.2	72.6	71.5	56.8	51.9	77.1	70.2	56.3	82.1	66.6
SymNet(NeurIPS’18) [74]	47.7	72.9	78.5	64.2	71.3	74.2	63.6	47.6	79.4	73.8	50.8	82.6	67.2
TADA(AAAI’19) [66]	53.1	72.3	77.2	59.1	71.2	72.1	59.7	53.1	78.4	72.4	60.0	82.9	67.6
MDD(ICML’19) [73]	54.9	73.7	77.8	60.0	71.4	71.8	61.2	53.6	78.1	72.5	60.2	82.3	68.1
BNM(CVPR’20) [14]	56.2	73.7	79.0	63.1	73.6	74.0	62.4	54.8	80.7	72.4	58.9	83.5	69.4
GSDA(CVPR’20) [25]	61.3	76.1	79.4	65.4	73.3	74.3	65.0	53.2	80.0	72.2	60.6	83.1	70.3
GVB(CVPR’20) [15]	57.0	74.7	79.8	64.6	74.1	74.6	65.2	55.1	81.0	74.6	59.7	84.3	70.4
E-Mix(AAAI’21) [77]	57.7	76.6	79.8	63.6	74.1	75.0	63.4	56.4	79.7	72.8	62.4	85.5	70.6
MetaAlign(CVPR’21) [68]	59.3	76.0	80.2	65.7	74.7	75.1	65.7	56.5	81.6	74.1	61.1	85.2	71.3
DANNP [68]	54.2	70.0	77.6	62.3	72.4	73.1	61.3	52.7	80.0	72.0	56.8	83.1	67.9
DANNP+ToAlign	$56.8_{\uparrow}$	$74.8_{\uparrow}$	$79.9_{\uparrow}$	$64.0_{\uparrow}$	$73.9_{\uparrow}$	$75.3_{\uparrow}$	$63.8_{\uparrow}$	$53.7_{\uparrow}$	$81.1_{\uparrow}$	$73.1_{\uparrow}$	$58.2_{\uparrow}$	$84.0_{\uparrow}$	$69.9_{\uparrow}$
HDA(NeurIPS’20) [13]	56.8	75.2	79.8	65.1	73.9	75.2	66.3	56.7	81.8	75.4	59.7	84.7	70.9
HDA+ToAlign	$57.9_{\uparrow}$	$\textbf{76.9}_{\uparrow}$	$\textbf{80.8}_{\uparrow}$	$\textbf{66.7}_{\uparrow}$	$\textbf{75.6}_{\uparrow}$	$\textbf{77.0}_{\uparrow}$	$\textbf{67.8}_{\uparrow}$	$\textbf{57.0}_{\uparrow}$	$\textbf{82.5}_{\uparrow}$	$75.1_{\downarrow}$	$60.0_{\uparrow}$	$84.9_{\uparrow}$	$\textbf{72.0}_{\uparrow}$

Table 1: Accuracy (%) of different UDAs on Office-Home with ResNet-50 as backbone. Best in bold.

Office-Home [63] consists of images from four different domains: Art (Ar), Clipart (Cl), Product (Pr), and Real-World (Rw).

Methods	Clipart	Infograph	Painting	Quickdraw	Real	Sketch	Avg.
Source-Only [22]	47.6_±0.52	13.0_±0.41	38.1_±0.45	13.3_±0.39	51.9_±0.85	33.7_±0.54	32.9_±0.54
ADDA(CVPR’17) [61]	47.5_±0.76	11.4_±0.67	36.7_±0.53	14.7_±0.50	49.1_±0.82	33.5_±0.49	32.2_±0.63
DANN(ICML’15) [18]	45.5_±0.59	13.1_±0.72	37.0_±0.69	13.2_±0.77	48.9_±0.65	31.8_±0.62	32.6_±0.68
DCTN(CVPR’18) [69]	48.6_±0.73	23.5_±0.59	48.8_±0.63	7.2_±0.46	53.5_±0.56	47.3_±0.47	38.2_±0.57
MCD(CVPR’18) [53]	54.3_±0.64	22.1_±0.70	45.7_±0.63	7.6_±0.49	58.4_±0.65	43.5_±0.57	38.5_±0.61
M³SDA(ICCV’19) [46]	57.2_±0.98	24.2_±1.21	51.6_±0.44	5.2_±0.45	61.6_±0.89	49.6_±0.56	41.5_±0.74
M³SDA- $\beta$ (ICCV’19) [46]	58.6_±0.53	26.0_±0.89	52.3_±0.55	6.3_±0.58	62.7_±0.51	49.5_±0.76	42.6_±0.64
MDAN(NeurIPS’18) [75]	60.3_±0.41	25.0_±0.43	50.3_±0.36	8.2_±1.92	61.5_±0.46	51.3_±0.58	42.8_±0.69
MLMSDA(Arxiv’20) [37]	61.4_±0.79	26.2_±0.41	51.9_±0.20	19.1_±0.31	57.0_±1.04	50.3_±0.67	44.3_±0.57
GVBG(CVPR’20) [15]	61.5_±0.44	23.9_±0.71	54.2_±0.46	16.4_±0.57	67.8_±0.98	52.5_±0.62	46.0_±0.63
CMSS(ECCV’20) [70]	64.2_±0.18	28.0_±0.20	53.6_±0.39	16.0_±0.12	63.4_±0.21	53.8_±0.35	46.5_±0.24
HDA(NeurIPS’20) [13]	63.6_±0.35	25.9_±0.16	56.1_±0.38	16.6_±0.54	69.1_±0.42	54.3_±0.26	47.6_±0.40
Baseline	66.4_±0.24	24.7_±0.16	57.3_±0.10	11.5_±0.17	69.2_±0.21	55.2_±0.13	47.3_±0.19
Baseline+ToAlign	_↑67.0_±0.22	_↑25.9_±0.20	_↑57.8_±0.32	_↑12.2_±0.14	_↑70.7_±0.25	_↑56.0_±0.18	_↑48.2_±0.22

Table 2: Accuracy (%) of different MUDA methods on DomainNet with ResNet-101 as backbone. Best in bold.

Each domain contains 65 object categories in office and home environments. Following the typical settings [15, 13, 68, 41], we evaluate methods on one-source to one-target domain adaptation, resulting in 12 adaptation cases in total. 2) VisDA-2017 [49] is a synthetic-to-real dataset for domain adaptation with over 280,000 images across 12 categories, where the source images are synthetic and the target images are real collected from MS COCO dataset [38]. 3) DomainNet [46] is a large-scale dataset containing about 600,000 images across 345 categories, which span 6 domains with large domain gap: Clipart (C), Infograph (I), Painting (P), Quickdraw (Q), Real (R), and Sketch (S). For MUDA, following

Method		Acc.
DANNP		67.9
DANNP+ToAlign	$s=$ 1	59.7
	$s=$ 8	68.8
	$s=$ 16	69.7
	$s=$ 64	70.0
	$s=$ 128	69.8
	Adaptive $s$	69.9

Table 3: Ablation study on the influence of

s

in Eq. 5.

Method	Time/ms	GPU mem./MB	Acc./%
DANNP	550	6,660	67.9
DANNP+ MetaAlign[68]	1,000	10,004	69.5
DANNP+ ToAlign	590	6,668	69.9

Table 4: Training complexity comparison (on GTX TITAN X GPU) in terms of computational time (of one iteration) and GPU memory for a mini-batch with batch size 32.

the settings in [46, 70, 13, 34, 62], we evaluate methods on five-sources to one-target domain adaptation, resulting in 6 MUDA cases in total. For SSDA, we take the typical protocal in [23, 51, 13], where there are 7 SSDA cases conducted on the 4 sub-domains (i.e., C, R, P and S) with 126 sub-categories selected from DomainNet. All methods are evaluated under the one-shot/three-shot setting respectively, where besides unlabeled samples, one/three sample(s) per class in the target domain are available during training.

Implementation Details. We apply our ToAlign on top of two different baseline schemes: DANNP [15, 68] and HDA [13]. DANNP is an improved variant of the classical adversarial learning based adaptation method DANN [18, 19], where the domain discrimination $D$ is conditioned on the predicted class probabilities. HDA is a state-of-the-art adversarial training based method which leverages the domain-specific representations as heuristics to obtain domain-invariant representations.

We use the ResNet-50 [22] pre-trained on ImageNet [31] as the backbone for SUDA, while using ResNet-101 and ResNet-34 for MUDA and SSDA respectively. Following [68, 41, 13], the image classifier $C$ is composed of one fully connected layer. The discriminator $D$ consists of three fully connected layers with inserted dropout and ReLU layers. We follow [74] to take an annealing strategy to set the learning rate $\eta$ , i.e., $\eta_{t}=\frac{\eta_{0}}{(1+\gamma p)^{\tau}}$ , where $p$ indicates the progress of training that increases linearly from 0 to 1, $\gamma=10$ , and $\tau=0.75$ . The initial learning rate $\eta_{0}$ is set to $1e-3,3e-4,3e-4$ , and $1e-3$ for SUDA on Office-Home, SUDA on VisDA-2017, MSDA on DomainNet, and SSDA on DomainNet, respectively. All reported experimental results are the average of three runs with different seeds.

4.2 Ablation Study

Effectiveness of ToAlign on Different Baselines. Our proposed ToAlign is generic and applicable to different domain adversarial training based baselines, where we focus on what features to align instead of the alignment methods. The last four rows in Table 1 show the ablation comparisons on Office-Home. Our ToAlign improves the accuracy of baseline DANNP and HDA by 2.0% and 1.1% respectively. As can be seen from the results in Table 1, Table 2, Table 5 and Table 6, our ToAlign can consistently bring significant improvement over the baseline schemes under different domain adaptation settings, i.e., SUDA, MUDA and SSDA. ToAlign enables the domain alignment task to proactively serve the classification task, resulting in more effective feature alignment for image classification.

Effectiveness of Different Ways to Obtain Positive Features. As mentioned in Sec. 3.2, we use $\mathbf{w}^{cls}_{p}=s\mathbf{w}^{cls}$ as the attention weight (which conveys the classification prior/meta-knowledge) to derive positive feature $\mathbf{f}_{p}$ , where $s$ is a parameter to modulate the energy of $\mathbf{f}_{p}$ . We study the influence of $s$ under the setting of Rw $\rightarrow$ Cl on Office-Home for our scheme DANNP+ToAlign and illustrate the results in Table 3. As discussed around Eq. (5), we can use an adaptively calculated $s$ , which achieves 2% improvement over the baseline on target test data. Moreover, we can treat $s$ as a preset hyper-parameter. We found that the performance drops drastically if $s$ is too small (e.g., $s=1$ ). That is because the energy of the source positive feature will get too weak when $s$ gets too small (e.g., the source feature $\mathbf{f}$ ’s average energy $\mathcal{E}(\mathbf{f})$ is about 800; if $s=1$ , the source positive feature’s average energy $\mathcal{E}(\mathbf{f}_{p})$ is about 2). Then, it would be ineffective to align the target with the source positive features. When $s$ is larger than 16, the performance significantly outperforms the baseline and approaches the result of using adaptive $s$ . As an optional design choice, we could transform the weight $\mathbf{w}^{cls}$ with certain activation function $\sigma(\cdot)$ such as Sigmoid or Softmax followed by a best selected scaling factor $s$ , i.e., $\mathbf{w}^{cls}_{p}=s\sigma(\mathbf{w}^{cls})$ . We found the results (i.e., 69.6/69.7 for Sigmoid/Softmax) are close to that without activation function. We reckon that what is more important is the relative importance among the elements in $\mathbf{w}^{cls}$ . For simplicity, we finally take the adaptive $s$ (cf. Eq. 5) for all experiments.

Methods	R $\rightarrow$ C	R $\rightarrow$ P	P $\rightarrow$ C	C $\rightarrow$ S	S $\rightarrow$ P	R $\rightarrow$ S	P $\rightarrow$ R	Avg.
Source-Only [22]	55.6	60.6	56.8	50.8	56.0	46.3	71.8	56.9
DANN(ICML’15) [18]	58.2	61.4	56.3	52.8	57.4	52.2	70.3	58.4
ADR(ICLR’18) [52]	57.1	61.3	57.0	51.0	56.0	49.0	72.0	57.6
CDAN(NeurIPS’18) [41]	65.0	64.9	63.7	53.1	63.4	54.5	73.2	62.5
ENT(NeurIPS’05) [21]	65.2	65.9	65.4	54.6	59.7	52.1	75.0	62.6
MME(ICCV’19) [51]	70.0	67.7	69.0	56.3	64.8	61.0	76.1	66.4
CANN(Arxiv’20) [50]	72.7	70.3	69.8	60.5	66.4	62.7	77.3	68.5
GVBG(CVPR’20) [15]	70.8	65.9	71.1	62.4	65.1	67.1	76.8	68.4
HDA(NeurIPS’20) [13]	72.4	71.0	71.0	63.6	68.8	64.2	79.9	70.0
HDA+ToAlign	73.0_↑	72.0_↑	71.7_↑	63.0_↓	69.3_↑	64.6_↑	80.8_↑	70.6_↑

Table 5: Accuracy (%) of different one-shot SSDA methods on DomainNet with ResNet-34 as backbone. Best in bold.

Methods	R $\rightarrow$ C	R $\rightarrow$ P	P $\rightarrow$ C	C $\rightarrow$ S	S $\rightarrow$ P	R $\rightarrow$ S	P $\rightarrow$ R	Avg.
Source-Only [22]	60.0	62.2	59.4	55.0	59.5	50.1	73.9	60.0
ADR(ICLR’18) [52]	60.7	61.9	60.7	54.4	59.9	51.1	74.2	60.4
CDAN(NeurIPS’18) [41]	69.0	67.3	68.4	57.8	65.3	59.0	78.5	66.5
ENT(NeurIPS’05) [21]	71.0	69.2	71.1	60.0	62.1	61.1	78.6	67.6
MME(ICCV’19) [51]	72.2	69.7	71.7	61.8	66.8	61.9	78.5	68.9
MetaMME(ECCV’20) [34]	73.5	70.3	72.8	62.8	68.0	63.8	79.2	70.1
GVBG(CVPR’20) [15]	73.3	68.7	72.9	65.3	66.6	68.5	79.2	70.6
CANN(Arxiv’20) [50]	75.4	71.5	73.2	64.1	69.4	64.2	80.8	71.2
HDA(NeurIPS’20) [13]	74.5	71.5	73.9	65.9	70.1	65.9	81.9	71.8
HDA+ToAlign	75.7_↑	72.9_↑	75.6_↑	66.2_↑	71.1_↑	66.4_↑	83.0_↑	73.0_↑

Table 6: Accuracy (%) of different three-shot SSDA methods on DomainNet with ResNet-34 as backbone. Best in bold.

4.3 Comparison with the State-of-the-arts

Single Source Unsupervised Domain Adaptation (SUDA). We incorporate our ToAlign into the recent state-of-the-art UDA method HDA [13], denoted as HDA+ToAlign. Table 1 shows the comparisons with the previous state-of-the-art methods on Office-Home. HDA+ToAlign outperforms all the previous methods and achieves the state-of-the-art performance. It is noteworthy that HDA+ToAlign achieves the best adaptation results on almost all the one-source to one-target adaptation cases thanks to the effective feature alignment for classification. The results on VisDA-2017 could be found in Appendix, where HDA+ToAlign outperforms HDA by 0.9%.

Multi-source Unsupervised Domain Adaptation (MUDA). Table 2 shows the results on DomainNet, where all the methods take ResNet-101 as the feature extractor. We build our Baseline based on HDA [13]. For simplicity, we replace the multi-class domain discriminator in the original HDA by a two-class one as in [61, 18, 70]. Note that CMSS [70] selects suitable source samples for alignment while our ToAlign selects task-discriminative sub-feature for each sample for task-oriented alignment. Compared with Baseline, ToAlign brings about 0.9% improvement and helps to achieve the best performance on this more challenging dataset.

Semi-supervised Domain Adaptation (SSDA). Table 5 and Table 6 show the results on one-shot and three-shot SSDA respectively, where all the methods use ResNet-34 as backbone. To compare with previous methods, we apply ToAlign on top of HDA. HDA+ToAlign outperforms HDA by 0.6%/1.2% for one-/three-shot settings, and surpasses all previous SSDA methods.

4.4 Complexity

In Table 4, we compare the training complexity and performance of ToAlign with baseline DANNP, and DANNP+MetaAlign [68] which incorporates meta-learning to coordinate the optimization of domain alignment and image classification. In contrast, inspired by the prior knowledge of what feature should be aligned to serve classification task, we distill such meta-knowledge from classification task and explicitly pass it to alignment task for classification-oriented alignment, eschewing complex optimization. Compared with baseline, ToAlign introduces negligible additional computational cost (only 7%) and occupies almost the same GPU memory as the baseline, which is much smaller than that of DANNP+MetaAlign, which almost doubles the computational cost due to its complex meta-optimization. Thanks to our explicit design which makes domain alignment effectively serve the classification task, our ToAlign achieves superior performance to MetaAlign.

4.5 Feature Visualization

We visualize the target feature response maps $F$ (which will be pooled to be the input of the image classifier) of the Baseline (DANNP) and ToAlign in Figure 5. Baseline sometimes focuses on the background features which are useless to the image classification task, since it aligns the holistic features without considering the discriminativeness of different channels/sub-features. Thanks to our task-oriented alignment, in ToAlign, the features with higher responses are in general related to task-discriminative features, which is more consistent with human perception. More results can be found in the Appendix.

5 Conclusion

In this paper, we study what features should be aligned across domains for more effective unsupervised domain adaptive image classification. To make the domain alignment task proactively serve the classification task, we propose an effective task-oriented alignment (ToAlign). We explicitly decompose a feature in the source domain into a task-related feature that should be aligned and a task-irrelevant one that should be ignored, under the guidance of the meta-knowledge induced from the classification task itself. Extensive experiments on various datasets demonstrate the effectiveness of our ToAlign. In our future work, we will extend ToAlign to tasks beyond image classification, e.g., object detection and segmentation.

Acknowledgments and Disclosure of Funding

This work was supported in part by the National Key Research and Development Program of China 2018AAA0101400 and NSFC under Grant U1908209, 61632001, and 62021001.

References

[1] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using meta-regularization. In NeurIPS, pages 998–1008, 2018.
[2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
[3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NeurIPS, volume 19, page 137. MIT; 1998, 2007.
[4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48, 2009.
[5] R. Bermúdez Chacón, M. Salzmann, and P. Fua. Domain-adaptive multibranch networks. In ICLR, 2020.
[6] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
[7] R. Cai, Z. Li, P. Wei, J. Qiao, K. Zhang, and Z. Hao. Learning disentangled semantic representation for domain adaptation. In IJCAI, volume 2019, page 2060, 2019.
[8] J. Cao, O. Katzir, P. Jiang, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li. Dida: Disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019, 2018.
[9] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, pages 839–847, 2018.
[10] J. Chen, X. Qiu, P. Liu, and X. Huang. Meta multi-task learning for sequence modeling. In AAAI, volume 32, 2018.
[11] M. Chen, S. Zhao, H. Liu, and D. Cai. Adversarial-learned loss for domain adaptation. In AAAI, pages 3521–3528, 2020.
[12] Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty. Re-weighted adversarial adaptation network for unsupervised domain adaptation. In CVPR, pages 7976–7985, 2018.
[13] S. Cui, X. Jin, S. Wang, Y. He, and Q. Huang. Heuristic domain adaptation. In NeurIPS, 2020.
[14] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian. Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In CVPR, pages 3941–3950, 2020.
[15] S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually vanishing bridge for adversarial domain adaptation. In CVPR, pages 12455–12464, 2020.
[16] J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell. Semi-supervised domain adaptation with instance constraints. In CVPR, pages 668–675, 2013.
[17] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, page 1126–1135, 2017.
[18] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189, 2015.
[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096–2030, 2016.
[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
[21] Y. Grandvalet, Y. Bengio, et al. Semi-supervised learning by entropy minimization. In NeurIPS, pages 281–296, 2005.
[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[23] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
[24] C.-C. Hsu, Y.-H. Tsai, Y.-Y. Lin, and M.-H. Yang. Every pixel matters: Center-aware feature alignment for domain adaptive object detector. In ECCV, pages 733–748.
[25] L. Hu, M. Kan, S. Shan, and X. Chen. Unsupervised domain adaptation with hierarchical gradient synchronization. In CVPR, pages 4043–4052, 2020.
[26] J. Huang, D. Guan, A. Xiao, and S. Lu. Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data. In NeurIPS, 2021.
[27] Y. Huang, P. Peng, Y. Jin, Y. Li, and J. Xing. Domain adaptive attention learning for unsupervised person re-identification. pages 11069–11076, 2020.
[28] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In CVPR, pages 12627–12637, 2019.
[29] X. Jin, C. Lan, W. Zeng, and Z. Chen. Feature alignment and restoration for domain generalization and adaptation. arXiv preprint arXiv:2006.12009, 2020.
[30] G. Kang, L. Zheng, Y. Yan, and Y. Yang. Deep adversarial attention alignment for unsupervised domain adaptation: the benefit of target expectation maximization. In ECCV, pages 401–416, 2018.
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, page 1097–1105, 2012.
[32] J. N. Kundu, N. Venkat, A. Revanur, R. V. Babu, et al. Towards inheritable models for open-set domain adaptation. In CVPR, pages 12376–12385, 2020.
[33] V. K. Kurmi, S. Kumar, and V. P. Namboodiri. Attending to discriminative certainty for domain adaptation. In CVPR, pages 491–500, 2019.
[34] D. Li and T. Hospedales. Online meta-learning for multi-source and semi-supervised domain adaptation. In ECCV, pages 382–403, 2020.
[35] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Learning to generalize: Meta-learning for domain generalization. AAAI, 2018.
[36] S. Li, C. H. Liu, Q. Lin, B. Xie, Z. Ding, G. Huang, and J. Tang. Domain conditioned adaptation network. In AAAI, pages 11386–11393, 2020.
[37] Z. Li, Z. Zhao, Y. Guo, H. Shen, and J. Ye. Mutual learning network for multi-source domain adaptation. arXiv preprint arXiv:2003.12944, 2020.
[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
[39] H. Liu, M. Long, J. Wang, and M. Jordan. Transferable adversarial training: A general approach to adapting deep classifiers. In ICML, pages 4013–4022, 2019.
[40] P. Liu and X. Huang. Meta-learning multi-task communication. arXiv preprint arXiv:1810.09988, 2018.
[41] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. In NeurIPS, pages 1645–1655, 2018.
[42] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution adaptation. In ICCV, pages 2200–2207, 2013.
[43] Z. Lu, Y. Yang, X. Zhu, C. Liu, Y.-Z. Song, and T. Xiang. Stochastic classifiers for unsupervised domain adaptation. In CVPR, pages 9111–9120, 2020.
[44] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.
[45] Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial domain adaptation. In AAAI, volume 32, 2018.
[46] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415, 2019.
[47] X. Peng, Z. Huang, X. Sun, and K. Saenko. Domain agnostic learning with disentangled representations. In ICML, pages 5102–5112, 2019.
[48] X. Peng and K. Saenko. Synthetic to real adaptation with generative correlation alignment networks. In WACV, pages 1982–1991, 2018.
[49] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
[50] C. Qin, L. Wang, Q. Ma, Y. Yin, H. Wang, and Y. Fu. Opposite structure learning for semi-supervised domain adaptation. arXiv preprint arXiv:2002.02545, 2020.
[51] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko. Semi-supervised domain adaptation via minimax entropy. In ICCV, pages 8050–8058, 2019.
[52] K. Saito, Y. Ushiku, T. Harada, and K. Saenko. Adversarial dropout regularization. In ICLR, 2018.
[53] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, pages 3723–3732, 2018.
[54] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, pages 8503–8512, 2018.
[55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
[56] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop, 2014.
[57] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
[58] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, pages 443–450, 2016.
[59] H. Tang and K. Jia. Discriminative adversarial domain adaptation. In AAAI, volume 34, pages 5940–5947, 2020.
[60] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521–1528, 2011.
[61] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 7167–7176, 2017.
[62] N. Venkat, J. N. Kundu, D. K. Singh, A. Revanur, and R. V. Babu. Your classifier can secretly suffice multi-source domain adaptation. In NeurIPS, 2021.
[63] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, pages 5018–5027, 2017.
[64] R. Volpi, P. Morerio, S. Savarese, and V. Murino. Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, pages 5495–5504, 2018.
[65] Q. Wang and T. Breckon. Unsupervised domain adaptation via structured prediction based selective pseudo-labeling. In AAAI, volume 34, pages 6243–6250, 2020.
[66] X. Wang, L. Li, W. Ye, M. Long, and J. Wang. Transferable attention for domain adaptation. In AAAI, volume 33, pages 5345–5352, 2019.
[67] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W.-m. Hwu, T. S. Huang, and H. Shi. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. In CVPR, pages 12635–12644, 2020.
[68] G. Wei, C. Lan, W. Zeng, and Z. Chen. Metaalign: Coordinating domain alignment and classification for unsupervised domain adaptation. In CVPR, 2021.
[69] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR, pages 3964–3973, 2018.
[70] L. Yang, Y. Balaji, S.-N. Lim, and A. Shrivastava. Curriculum manager for source selection in multi-source domain adaptation. In ECCV, volume 12359, pages 608–624, 2020.
[71] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. In ICLR, 2017.
[72] J. Zhang, W. Li, and P. Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In CVPR, pages 1859–1867, 2017.
[73] Y. Zhang, T. Liu, M. Long, and M. Jordan. Bridging theory and algorithm for domain adaptation. In ICML, pages 7404–7413, 2019.
[74] Y. Zhang, H. Tang, K. Jia, and M. Tan. Domain-symmetric networks for adversarial domain adaptation. In CVPR, pages 5031–5040, 2019.
[75] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple source domain adaptation. In NeurIPS, volume 31, pages 8559–8570, 2018.
[76] L. Zhao, X. Peng, Y. Chen, M. Kapadia, and D. N. Metaxas. Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In CVPR, pages 6528–6537, 2020.
[77] L. Zhong, Z. Fang, F. Liu, J. Lu, B. Yuan, and G. Zhang. How does the combined risk affect the performance of unsupervised domain adaptation approaches? In AAAI, 2021.
[78] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016.

Appendix

A. Visualization of Decomposed Features

To better understand and validate the discriminativeness of the positive and the negative features, similar to Figure 4 in our manuscript, here we show more visualization results of the spatial maps $F$ with channels modulated by $\mathbf{w}^{cls}$ (corresponding to positive features) and $-\mathbf{w}^{cls}$ (corresponding to negative features) following [55, 78]. We can observe that the positive information is more related to the foreground objects that provide the discriminative information for the classification task, while the negative one is more in connection with the non-discriminative background regions.

B. Experiments

B.1 More Implementation Details

We use two domain alignment based methods as our baselines: 1) DANNP [68] is an improved variant of DANN [18], where the domain discrimination $D$ in DANNP is conditioned on the predicted class probabilities instead of extracted features as illustrated in Figure 7 (a). 2) HDA [13] draws inspiration from heuristic search and incorporates the domain-specific representations as heuristics to help learn domain-invariant ones. Figure 7 (b) shows its architecture.

Method	Avg.
Source-Only [22]	55.3
DANN(ICML’15) [18]	57.4
CDAN(NeurIPS’18) [41]	70.0
MDD(ICML’19) [73]	74.6
GVB(CVPR’20) [15]	75.3
HDA(NeurIPS’20) [13]	74.6
HDA+ToAlign	75.5

Table 7: Classification accuracy (%) of the Synthetic

\rightarrow

Real setting on Visda-2017 for SUDA using ResNet-50 as backbone. Note that HDA [13] does not report the result on this dataset and we obtain the result by running their released source code.

All experimental results are obtained by running three times with different seeds. To evaluate the stableness of our ToAlign, we visualize error bars of our schemes DANNP+ToAlign and HDA+ToAlign on Office-Home in Figure 8, where we also present the error bars for the two baseline schemes DANNP and HDA. The variances between our ToAlign and the corresponding baselines are close (0.41 vs. 0.40 for DANNP and 0.40 vs. 0.35 for HDA) and our ToAlign dose not introduce much additional unstability.

B.2 Experimental Results of SUDA

As referred to in our main manuscript, the experimental results on Visda-2017 for SUDA are presented in Appendix. Here, Table 7 shows the results, where our ToAlign introduces 0.9% improvements over the baseline HDA.

B.3 Experimental Results of SSDA

We present the results for the more challenging one-shot SSDA on DomainNet in Table 6. Our ToAlign improves the baseline HDA by 0.6%, and HDA+ToAlign outperforms all the previous methods.

When we compare Table 6 (here for one-shot SSDA) with Table 3 in our main manuscript (for three-shot SSDA), we observe that introducing another two-shot samples per class brings about 2.4% gain, which demonstrates that access to target annotations information (even very few samples) is helpful to domain adaptation.

B.4 Feature Visualization

We visualize more results of the feature response maps on the target test images in Figure 9, as a supplement to Figure 5 in our main manuscript. Baseline sometimes focuses on the background features which are useless to the image classification task, since it aligns the holistic features without considering the discriminativeness of different channels/sub-features. Thanks to our task-oriented alignment, in ToAlign, the features with higher responses are in general related to task-discriminative features, which is more consistent with human perception.

We further visualize the learned source (red) and target (blue) feature representations (i.e., $\mathbf{f}_{s}$ and $\mathbf{f}_{t}$ ) using t-SNE [44] for different methods in Figure 10. Figure 10 (a) shows the embedded features of the Source-Only method where no adaptation technique is used, where we can see that the samples are very scattered. In comparison, the samples for HDA [13] (cf. Figure 10 (b)) and our HDA+ToAlign (cf. Figure 10 (c)) form more compact clusters, where the clusters of ours are more compact and the target samples are located closer to the source samples than HDA.

C. Broader Impact

Unsupervised domain adaptation aims to obtain better performance on unlabeled target data based on the knowledge from labeled source data and unlabeled target data, which is an important and practical problem in both the academic and industry. Our proposed ToAlign emphasizes that domain alignment task should assist/serve classification task, where we perform alignment under the guidance of the meta-knowledge induced from classification task. We also provide some understanding from the meta-knowledge perspective, where we pass the meta-train task knowledge in a simple and effective way to the meta-test task. This provides some insights on how to pass meta-knowledge more effectively for the meta-learning based multi-task communication [40, 68, 10].

The major societal impact of our ToAlign arises from the UDA task itself, which aims to transfer knowledge from labeled source domain to unlabeled target domain, leading to heavy dependency on source domain. The major limitation of our ToAlign is that it is only applicable to domain adversarial learning based UDAs, which though dominates in the top performance methods. How to apply the idea to other category of methods, e.g., pseudo-label based ones [42, 72, 65], will be investigated in future.

ToAlign could be further improved from two perspectives. First, other ways to derive classification meta-knowledge could be exploited, where we now use the gradients as guidance which is drawn inspiration from Grad-CAM [55]. Second, ToAlign could be further expanded to more challenging tasks like semantic segmentation and object detection.