This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ToAlign: Task-oriented Alignment for Unsupervised Domain Adaptation

Guoqiang Wei1   Cuiling Lan2   Wenjun Zeng2   Zhizheng Zhang2   Zhibo Chen1†

1 University of Science and Technology of China    2 Microsoft Research Asia
[email protected] {culan,wezeng,zhizzhang}@microsoft.com 
[email protected]
This work was done when Guoqiang Wei was an intern at MSRA.Corresponding author.
Abstract

Unsupervised domain adaptive classification intends to improve the classification performance on unlabeled target domain. To alleviate the adverse effect of domain shift, many approaches align the source and target domains in the feature space. However, a feature is usually taken as a whole for alignment without explicitly making domain alignment proactively serve the classification task, leading to sub-optimal solution. In this paper, we propose an effective Task-oriented Alignment (ToAlign) for unsupervised domain adaptation (UDA). We study what features should be aligned across domains and propose to make the domain alignment proactively serve classification by performing feature decomposition and alignment under the guidance of the prior knowledge induced from the classification task itself. Particularly, we explicitly decompose a feature in the source domain into a task-related/discriminative feature that should be aligned, and a task-irrelevant feature that should be avoided/ignored, based on the classification meta-knowledge. Extensive experimental results on various benchmarks (e.g., Office-Home, Visda-2017, and DomainNet) under different domain adaptation settings demonstrate the effectiveness of ToAlign which helps achieve the state-of-the-art performance. The code is publicly available at https://github.com/microsoft/UDA.

1 Introduction

Convolutional Neural Networks (CNNs) have made extraordinary progress in various computer vision tasks, with image classification as a most representative one. The trained models generally perform well on the testing data which shares similar data distribution to that of the training data. However, in many practical scenarios, drastic performance degradation is observed when applying such trained models to new domains with domain shift [60], where the data distributions between the training and testing domains are different. Fine-tuning on labeled target data is a direct solution but is costly due to the requirement of target sample annotations. In contrast, unsupervised domain adaptation (UDA) requires only the labeled source data and unlabeled target data to enhance the model’s performance on the target domain, which has attracted increasing interest in both academia [3, 2, 77, 61, 26, 32] and industry [67, 28].

Refer to caption
Figure 1: Illustration of adversarial learning based (a) Baseline and (b) our proposed ToAlign. DD and CC denote domain discriminator and image classifier respectively. (a) Baseline (e.g., DANN [19]) directly aligns the target feature 𝐟t\mathbf{f}^{t} with the holistic source feature 𝐟s\mathbf{f}^{s}. Domain alignment and image classification tasks are optimized in parallel. (b) Our proposed ToAlign makes the domain alignment proactively serve the classification task, where target feature 𝐟t\mathbf{f}^{t} is aligned with source task-discriminative "positive" feature 𝐟ps\mathbf{f}^{s}_{p} which is obtained under the guidance of meta-knowledge induced from the classification task. \odot denotes Hadamard product.

There has been a large spectrum of UDA methods. Supported by the theoretical analysis [3], the overwhelming majority of methods tend to align the distributions of source and target domains. A line of works [6, 71, 46, 57, 58] explicitly align the distributions based on domain discrepancy measurements, e.g., Maximum Mean Discrepancy (MMD) [6]. Another line of alignment-based UDAs borrow ideas from Generative Adversarial Networks [20] and use domain adversarial training to learn domain-aligned/invariant features, which dominate in the top performance methods. In the seminal work Domain Adversarial Neural Network (DANN) [18, 19], a domain discriminator is trained to distinguish the target features from source features while a feature extractor (generator) is trained to generate domain-invariant features to fool this discriminator. Following DANN, a plethora of variants have been proposed [61, 41, 54, 12, 53, 64, 39, 43, 15, 11, 68].

It is noteworthy that the goal of alignment in UDA is to alleviate the adverse effect of domain shift to improve the classification performance on unlabeled target data. Even though impressive progress has been made, there is a common intrinsic limitation, i.e., alignment is still not deliberately designed to dedicatedly/proactively serve the final image classification task. In many previous UDAs, as shown in Figure 1 (a), the alignment task is in parallel with the ultimate classification task. The assumption is that learning domain-invariant features (via alignment) reduces the domain gap and thus makes the image classifier trained on source readily applicable to target [3]. However, with alignment treated as a parallel task, there is a lack of mechanism to make it explicitly assist classification, where the alignment may contaminate the discriminative features for classification [29]. Previous works (e.g., CDAN [41]) exploit class information (e.g., predicted class probability) as a condition to the discriminator. MADA [45] implements class-level domain alignment by applying one discriminator per class. Their purpose is to provide additional helpful information to the discriminator [41] or perform class-level alignment [45], but they are still short of explicitly making alignment assist classification.

Some works move a step forward and investigate what features the networks should align for better adaptation. [66, 33] focus on transferable local regions, which are selected based on the uncertainty or entropy of the domain discriminator, for alignment. However, such self-induced feature selection is still not specific to the optimization of classification task; instead, it is based on the alignment task itself. There is no guarantee that alignment positively serves the classification task. Hsu et al. [24] carry out object centerness-aware alignment by aligning the center part of the objects to exclude the background distraction/noise for domain adaptive object detection. However, the feature in object center position could be task-irrelevant and thus is not suited for alignment. Moreover, regarding such centerness feature as alignment objective is somewhat ad-hoc, which is still not designed directly from the perspective of assisting classification.

Refer to caption
Figure 2: Conceptual comparison between (a) previous alignment and (b) our proposed task-oriented alignment. {𝐟t}\{\mathbf{f}^{t}\} and {𝐟s}\{\mathbf{f}^{s}\} denote the sets of target features and source features, respectively. (a) Previous methods take each source feature as a holistic one for alignment with target features. (b) We decompose each source feature 𝐟s\mathbf{f}^{s} into a task-discriminative positive feature 𝐟ps\mathbf{f}^{s}_{p} and a task-irrelevant negative feature 𝐟ns\mathbf{f}^{s}_{n} and make the target features to be aligned with the positive source features {𝐟ps}\{\mathbf{f}^{s}_{p}\} while avoiding aligning with the negative source features {𝐟ns}\{\mathbf{f}^{s}_{n}\}.

We pinpoint that the selection of "right" features to achieve task-oriented alignment is important. For classification, the essence is to train the network to extract class-discriminative feature. Similarly, for UDA classification, it is also desired to assure strong discrimination of the target domain features without class label supervision. Thus, we intend to align target features to the task-discriminative source features while ignoring the task-irrelevant ones. Note that for the feature of a source sample, it contains both task/classification-discriminative and task-irrelevant information, because the network is in general not able to suppress non-discriminative feature responses (e.g., responses unrelated to image class or those related to other tasks such as alignment) perfectly [55, 9]. Aligning target features with task-irrelevant source features would prevent alignment from serving classification and lead to poor adaptation. Intuitively, for example, image style that is a non-causal factor for classification can be considered as task-irrelevant information and the bias towards such factor in alignment may hurt the classification task. We demonstrate this by conducting experiments where only the source task-irrelevant features are utilized to align with target i.e., the scheme Baseline+TiAlign in Figure 3. The performance of Baseline+TiAlign (in purple) on target test set drops drastically compared to the source-only method which dose not incorporate any alignment technique. This corroborates that aligning with task-irrelevant features is even harmful to the classification on target domain.

Motivated by this, in this paper, we propose an effective UDA method named Task-oriented Alignment (ToAlign) to make the domain alignment explicitly serve classification. We achieve this by performing feature alignment guided by the meta-knowledge induced from the classification task to make the target features align with task-discriminative source features (i.e., "positive" features), to avoid the interference from task-irrelevant features (i.e., "negative" features). Figure 2 conceptually illustrates the comparison between our proposed alignment and previous one. Particularly, as illustrated in Figure 1 (b), to obtain the suitable feature from a source sample for alignment with target samples, we leverage the classification task to guide the extraction/distillation of task-related/discriminative feature 𝐟ps\mathbf{f}_{p}^{s}, from original feature 𝐟s\mathbf{f}^{s}. Correspondingly, for the domain alignment task, we enforce aligning target features with the source positive features by domain adversarial training to achieve task-oriented alignment. In this way, the domain alignment will better assist the classification task.

We summarize our main contributions as follows:

  • We pinpoint that the selection of "right" features to achieve task-orientated alignment is important for adaptation.

  • We propose an effective UDA approach named ToAlign which enables the alignment to explicitly serve classification. We decompose a source feature into a task-relevant/discriminative one and a task-irrelevant one under the guidance of classification-meta knowledge for performing classification-oriented alignment, which explicitly guides the network what features should be aligned.

Extensive experimental results demonstrate the effectiveness of ToAlign. ToAlign is generic and can be applied to different adversarial learning based UDAs to enhance their adaption capability, which helps achieve the state-of-the-art performance with a negligible increase in training complexity and no increase in inference complexity.

2 Related Work

Refer to caption
Figure 3: Classification accuracy on target (Rw\rightarrowCl in Office-Home) for different methods. TiAlign denotes aligning target features with task-irrelevant source features.
Refer to caption
Figure 4: Visualization of task-discriminative and task-irrelvant features. The positive features generally focus on the foreground objects which provide the most discriminative information for classification, while the negative ones focus on non-discriminative background regions. The images are sampled from Office-Home.

Unsupervised Domain Adaptation aims to transfer the knowledge from labeled source domain(s) to unlabeled target domain. Ben et al. [3]  theoretically reveal that learning domain-invariant representations helps make the image classifier trained on source domain applicable to target domain. Various works learn domain-invariant features by aligning the source and target distributions measured by some metrics [6, 71, 46, 57, 58, 48], or by domain adversarial learning [61, 41, 54, 12, 53, 64, 39, 74, 59, 43, 15, 11, 68, 8, 30]. The latter is overwhelmingly popular in recent years owing to its superiority in dealing with distribution problems [20]. Note that our proposed method is designed to enhance the capability of the widely used domain adversarial learning based approaches.

For domain adversarial learning based approach (e.g., DANN [18, 19]), in general, a domain discriminator is trained to distinguish the source domain from the target domain, meanwhile a feature extractor is trained to learn domain-invariant features. Many variants of DANN have been proposed [41, 15, 59, 11, 13, 59, 74, 36, 5]. CDAN [41] further conditions the discriminator on the image class information conveyed in the classifier predictions. MADA [45] implements class-wise alignment with multi-discriminators. GSDA [25] performs class-, group- and domain-wise alignments simultaneously, where the three types of alignment are enforced to be consistent in their gradients for more precise alignment. HDA [13] leverages domain-specific representations as heuristics to obtain domain-invariant representations from a heuristic search perspective. CMSS [70] exploits Curriculum Learning (CL) [4] to align target samples with the dynamically selected source samples to exploit the different transferability of the source samples.

However, in these methods, the domain alignment is designed as a task in parallel with the image classification task. It does not explicitly take serving classification as its mission, where such alignment may result in loss of discriminative information. Jin et al. [29]  remedy the loss of discriminative information caused by alignment via incorporating a restoration module. Wei et al. [68]  pinpoint that alignment and classification are not well coordinated in optimization where they may contradict with each other. They thus propose to use meta-learning to coordinate their optimization directions.

In this paper, to make alignment explicitly serve classification, we propose a task-oriented alignment. Guided by the classification meta-knowledge, task-discriminative sub-features are selected for alignment. Different from [68], we investigate what features should be aligned to assist classification and intend to provide more interpretable alignment. We are the first to perform task-oriented alignment by decomposing source feature into task-discriminative and task-irrelevant feature, and explicitly guides the network what sub-features should be aligned. Note that Huang et al. [27] propose to decouple features into domain-invariant and domain-specific features, where the former ones are aligned for unsupervised person re-identification. [47, 7] exploit the VAE framework with several complex losses to perform the disentanglement from the perspective of domain and semantics simultaneously, and only use domain-invariant semantics for inference, leaving domain-specific but task-related information underexplored. In contrast to focusing on the domain-level, our decomposition strategy focuses on the task-level guided by image classification task, where we further enable domain alignment on the task-discriminative features to proactively serve image classification.

3 Task-Oriented Alignment for UDA

Unsupervised domain adaptation (UDA) for classification aims to train a classification model on labeled source domain image set 𝐗s\mathbf{X}_{s} and unlabeled target domain image set 𝐗t\mathbf{X}_{t} to obtain high classification accuracy on a target domain test set.

Most popular adversarial learning based UDAs attempt to align the features of the source and target domains to alleviate the domain gap to improve the classification performance on target domain. As mentioned before, aligning based on holistic features is sub-optimal, where such alignment does not explicitly serve classification. To address this, as illustrated in Figure 1 (b), we propose an effective task-oriented alignment to explicitly make the alignment serve classification. Particularly, we propose to decompose a source sample feature into a task-discriminative one that should be aligned, and a task-irrelevant one that should be ignored based on the classification meta-knowledge. Then, we perform alignment between the target features and the positive source features, which is consistent with the essence of the classification task, i.e., focusing on discriminative features.

In Sec. 3.1, to be self-contained, we briefly introduce adversarial learning based UDAs. We answer the question of what feature should be aligned to better serve classification and introduce our task-oriented feature decomposition and alignment in Sec. 3.2.

3.1 Recap of Domain Adversarial UDAs

Domain adversarial learning based UDAs typically train a domain discriminator DD to distinguish which domain (i.e., source or target) a sample belongs to, and adversarially train a feature extractor GG to fool the discriminator DD in order to learn domain-invariant feature representations. The network is also trained under the supervision of image classification on the labeled source samples. Particularly, DD is optimized to minimize the domain classification loss D\mathcal{L}_{D} (i.e., binary cross entropy loss). Meanwhile, GG is optimized to maximize the domain classification loss D\mathcal{L}_{D} and minimize the image classification loss cls\mathcal{L}_{cls} (i.e., cross entropy loss):

argminDD,argminGclsD,\begin{split}&\operatorname*{argmin}_{D}\mathcal{L}_{D},\\ &\operatorname*{argmin}_{G}\mathcal{L}_{cls}-\mathcal{L}_{D},\\ \end{split} (1)

To achieve adversarial training, usually, gradient reversal layer (GRL) [18, 19] which connects GG and DD is used via multiplying the gradient from DD by a negative constant during the back-propagation to GG. D\mathcal{L}_{D} is typically defined as [19, 41, 15]:

D(𝐗s,𝐗t)=𝔼𝐱s𝐗s[log(D(G(𝐱s)))]𝔼𝐱t𝐗t[log(1D(G(𝐱t)))],\begin{split}\mathcal{L}_{D}(\mathbf{X}_{s},\mathbf{X}_{t})=-\mathbb{E}_{\mathbf{x}_{s}\sim\mathbf{X}_{s}}\left[\log(D(G(\mathbf{x}_{s})))\right]-\mathbb{E}_{\mathbf{x}_{t}\sim\mathbf{X}_{t}}\left[\log(1-D(G(\mathbf{x}_{t})))\right],\end{split} (2)

3.2 Task-oriented Feature Decomposition and Alignment

In adversarial learning based UDAs, a feature ingested by DD as a holistic feature from a source or target sample, in general contains both task/classification-discriminative information and task-irrelevant information. Intuitively, aligning the task-irrelevant features would not effectively reduce the domain gap of the task-discriminative features and thus brings no obvious benefit for the classification task. Mistakenly aligning the target features with the source task-irrelevant features would hurt the discrimination power of the target features. We also experimentally confirm that in Figure 3, i.e., aligning with task-irrelevant features (TiAlign, line in purple) drastically reduces the classification accuracy on the target domain. Therefore, we propose to decompose a holistic feature of each source sample into a task-discriminative feature and a task-irrelevant feature to enable the task-oriented alignment with the target features.

Particularly, we softly select/re-weight (based on Grad-CAM [55]) the feature vector 𝐟s\mathbf{f}^{s} of a source sample to obtain task-discriminative feature 𝐟ps\mathbf{f}_{p}^{s} that is discriminative for identifying the groundtruth class, which we refer to as positive feature. Correspondingly, the task-irrelevant feature 𝐟ns\mathbf{f}_{n}^{s} can be obtained simultaneously, which we refer to as negative feature.

Task-Oriented Feature Decomposition. Grad-CAM [78, 55, 9] is a widely used technique to localize the most important features for classification in a convolutional neural network model. As analyzed in [78, 55, 9, 56], the gradients (w.r.t. the feature for classification) of the final predicted score corresponding to the ground-truth class convey the task-discriminative information, which identifies the relevant features to recognize the image class correctly. It is noteworthy that such task-discriminative information is, in general, highly related (but not limited) to the foreground object in the classification task. In this work, motivated by Grad-CAM, we propose to use the gradients of the predicted score corresponding to the ground-truth class as the attention weights to obtain the task-discriminative features.

As illustrated in Figure 1, we obtain a feature map F+H×W×MF\in\mathbb{R}_{+}^{H\times W\times M} (i.e., a tensor of non-negative real numbers, with height HH, width WW, and MM channels) from the final convolutional block (with ReLU layer) of the feature extractor. After spatial-wise global average pooling (GAP), we have a feature vector 𝐟=pool(F)M{\mathbf{f}}=pool(F)\in\mathbb{R}^{M}. The logits for all classes are predicted via the classifier C()C(\cdot). Based on the response C(𝐟)C({\mathbf{f}}), we can derive the gradient 𝐰clsM\mathbf{w}_{cls}\in\mathbb{R}^{M} of yky^{k} w.r.t. 𝐟\mathbf{f}:

𝐰cls=yk𝐟,\mathbf{w}^{cls}=\frac{\partial{y^{k}}}{\partial\mathbf{f}}, (3)

where yky^{k} is the predicted score corresponding to the ground-truth class kk. As analyzed in [55, 9, 56], the gradient 𝐰cls\mathbf{w}^{cls} conveys the channel-wise importance information of feature 𝐟\mathbf{f} for classifying the sample into its groudtruth class kk. We draw inspiration from Grad-CAM which uses 𝐰cls\mathbf{w}^{cls} to modulate the feature map in channel-wise to find the classification-discriminative features. Similarly, modulated with 𝐰cls\mathbf{w}^{cls}, we can obtain the task-discriminative (i.e., positive) feature as:

𝐟p=𝐰pcls𝐟=s𝐰cls𝐟,\mathbf{f}_{p}=\mathbf{w}^{cls}_{p}\odot\mathbf{f}=s\mathbf{w}^{cls}\odot\mathbf{f}, (4)

where \odot represents the Hadamard product, the attention weight vector 𝐰pcls=s𝐰cls\mathbf{w}^{cls}_{p}=s\mathbf{w}^{cls}, where s+s\in\mathbb{R}_{+} is an adaptive non-negative parameter to modulate the energy (𝐟p)=𝐟p22\mathcal{E}(\mathbf{f}_{p})=||\mathbf{f}_{p}||_{2}^{2} of 𝐟p\mathbf{f}_{p} such that (𝐟p)=(𝐟)\mathcal{E}(\mathbf{f}_{p})=\mathcal{E}(\mathbf{f}):

s=𝐟22𝐰cls𝐟22=m=1Mfm2m=1M(wmclsfm)2,s=\sqrt{\frac{||\mathbf{f}||_{2}^{2}}{||\mathbf{w}^{cls}\odot\mathbf{f}||_{2}^{2}}}=\sqrt{\frac{\sum_{m=1}^{M}f_{m}^{2}}{\sum_{m=1}^{M}(w^{cls}_{m}f_{m})^{2}}}, (5)

Motivated by the counterfactual analysis in [55], the task-irrelevant (i.e., negative) feature can be represented as 𝐟n=𝐰pcls𝐟\mathbf{f}_{n}=-\mathbf{w}^{cls}_{p}\odot\mathbf{f}, where 𝐰pcls-\mathbf{w}^{cls}_{p} delights the task-discriminative channels since the task-discriminative channels (with larger values in 𝐰pcls\mathbf{w}^{cls}_{p}) correspond to ones with smaller values in -𝐰pcls\mathbf{w}^{cls}_{p}.

To better understand and validate the discriminativeness of the positive and negative features, we visualize the spatial maps FF with channels modulated by 𝐰cls\mathbf{w}^{cls} and 𝐰cls-\mathbf{w}^{cls} following [55, 78]. As shown in Figure 4, the positive information is more related to the foreground objects that provide the discriminative information for the classification task, while the negative one is more in connection with the non-discriminative background regions.

Task-oriented Domain Alignment. As discussed above, we expect the domain alignment to explicitly serve the final classification task. Given the source task-discriminative features obtained based on the classification meta-knowledge, we can guide the target features to be aligned with the source task-discriminative features 𝐟p\mathbf{f}_{p} through different domain adversarial learning based alignment methods [18, 19, 13]. The procedure is almost the same as that in UDAs discussed in Sec. 3.1, except that the input source feature 𝐟s\mathbf{f}^{s} to the final domain discriminator is replaced by the positive feature 𝐟ps\mathbf{f}^{s}_{p} of this source sample. Thus, the domain classification loss is defined with a small modification on Eq. (2):

D(𝐗s,𝐗t)=𝔼𝐱s𝐗s[log(D(Gp(𝐱s)))]𝔼𝐱t𝐗t[log(1D(G(𝐱t)))].\begin{split}\mathcal{L}_{D}(\mathbf{X}_{s},\mathbf{X}_{t})=-\mathbb{E}_{\mathbf{x}_{s}\sim\mathbf{X}_{s}}\left[\log(D(G^{p}(\mathbf{x}_{s})))\right]-\mathbb{E}_{\mathbf{x}_{t}\sim\mathbf{X}_{t}}\left[\log(1-D(G(\mathbf{x}_{t})))\right].\end{split} (6)

where Gp(𝐱s)=𝐟psG^{p}(\mathbf{x}_{s})=\mathbf{f}_{p}^{s} denotes the positive feature of source 𝐱s\mathbf{x}_{s}.

Understanding from the Meta-knowledge Perspective. To enable a better understanding of ToAlign on why it works well, here, we analyse ToAlign from the perspective of meta-learning with meta-knowledge.

In an adversarial UDA framework, the image classification task and domain alignment task can be considered to be a meta-train task 𝒯tr\mathcal{T}^{tr} and a meta-test task 𝒯te\mathcal{T}^{te}, respectively. ToAlign actually introduces knowledge communication from 𝒯tr\mathcal{T}^{tr} to 𝒯te\mathcal{T}^{te}. In the meta-training stage, we can obtain the prior/meta-knowledge ϕtr\phi^{tr} of 𝒯tr\mathcal{T}^{tr}. Without effective communication between 𝒯tr\mathcal{T}^{tr} and 𝒯te\mathcal{T}^{te}, the optimization of 𝒯te\mathcal{T}^{te} may contradict that of 𝒯tr\mathcal{T}^{tr}, considering that they have different optimization goals. To improve the knowledge communication from 𝒯tr\mathcal{T}^{tr} to 𝒯te\mathcal{T}^{te}, certain meaningful prior/meta-knowledge ϕtr\phi^{tr} is helpful for a more effective 𝒯te|ϕtr\mathcal{T}^{te}|_{\phi^{tr}}. A typical implementation of passing meta-knowledge from 𝒯tr\mathcal{T}^{tr} to 𝒯te\mathcal{T}^{te} is based on gradients [40, 17, 35, 68, 34], i.e., 𝒯tr\nabla\mathcal{T}^{tr}, which provides knowledge of 𝒯tr\mathcal{T}^{tr}. Other mechanisms e.g., leveraging the parameters regularizer in a way of weight decay, are also exploited [1, 76]. In our ToAlign, instead of encoding the meta-knowledge ϕtr\phi^{tr} into the gradients w.r.t. the parameters, we use 𝒯tr\mathcal{T}^{tr} to learn/derive attention weights for identifying 𝒯tr\mathcal{T}^{tr}-related sub-features in the feature space and then pass such prior/meta-knowledge ϕtr\phi^{tr} to 𝒯te\mathcal{T}^{te} to make meta-test task 𝒯ϕtrte\mathcal{T}^{te}_{\phi^{tr}} adapt its optimization based on ϕtr\phi^{tr}.

In this work, actually, we are motivated by the reliable human prior knowledge on what should be aligned across domains to better assist classification task for UDA (i.e., task/classification-discriminative features), while excluding the interference from task-irrelevant ones. Accordingly, in our design, we obtain the prior/meta-knowledge for identifying task-discriminative features from the classification task (meta-train) and apply it to the domain alignment task (meta-test) to achieve task-oriented alignment.

4 Experiments

To evaluate the effectiveness of ToAlign, we conduct comprehensive experiments under three domain adaptation settings, i.e., single source unsupervised domain adaptation (SUDA), multi-source unsupervised domain adaptation (MUDA) and semi-supervised domain adaptation (SSDA). For SSDA, domain adaptation is performed from labeled source domain to partially labeled target domain [16].

4.1 Datasets and Implementation Details

Datasets. We use two commonly used benchmark datasets (i.e., Office-Home [63] and VisDA-2017 [49]) for SUDA and a large-scale dataset DomainNet [46] for MUDA and SSDA. 1)

Method Ar\rightarrowCl Ar\rightarrowPr Ar\rightarrowRw Cl\rightarrowAr Cl\rightarrowPr Cl\rightarrowRw Pr\rightarrowAr Pr\rightarrowCl Pr\rightarrowRw Rw\rightarrowAr Rw\rightarrowCl Rw\rightarrowPr Avg
Source-Only [22] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
MCD(CVPR’18) [53] 48.9 68.3 74.6 61.3 67.6 68.8 57.0 47.1 75.1 69.1 52.2 79.6 64.1
CDAN(NeurIPS’18) [41] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
ALDA(AAAI’20) [11] 53.7 70.1 76.4 60.2 72.6 71.5 56.8 51.9 77.1 70.2 56.3 82.1 66.6
SymNet(NeurIPS’18) [74] 47.7 72.9 78.5 64.2 71.3 74.2 63.6 47.6 79.4 73.8 50.8 82.6 67.2
TADA(AAAI’19) [66] 53.1 72.3 77.2 59.1 71.2 72.1 59.7 53.1 78.4 72.4 60.0 82.9 67.6
MDD(ICML’19) [73] 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1
BNM(CVPR’20) [14] 56.2 73.7 79.0 63.1 73.6 74.0 62.4 54.8 80.7 72.4 58.9 83.5 69.4
GSDA(CVPR’20) [25] 61.3 76.1 79.4 65.4 73.3 74.3 65.0 53.2 80.0 72.2 60.6 83.1 70.3
GVB(CVPR’20) [15] 57.0 74.7 79.8 64.6 74.1 74.6 65.2 55.1 81.0 74.6 59.7 84.3 70.4
E-Mix(AAAI’21) [77] 57.7 76.6 79.8 63.6 74.1 75.0 63.4 56.4 79.7 72.8 62.4 85.5 70.6
MetaAlign(CVPR’21) [68] 59.3 76.0 80.2 65.7 74.7 75.1 65.7 56.5 81.6 74.1 61.1 85.2 71.3
DANNP [68] 54.2 70.0 77.6 62.3 72.4 73.1 61.3 52.7 80.0 72.0 56.8 83.1 67.9
DANNP+ToAlign   56.856.8_{\uparrow}   74.874.8_{\uparrow}   79.979.9_{\uparrow}   64.064.0_{\uparrow}   73.973.9_{\uparrow}   75.375.3_{\uparrow}   63.863.8_{\uparrow}   53.753.7_{\uparrow}   81.181.1_{\uparrow}   73.173.1_{\uparrow}   58.258.2_{\uparrow}   84.084.0_{\uparrow}   69.969.9_{\uparrow}
HDA(NeurIPS’20) [13] 56.8 75.2 79.8 65.1 73.9 75.2 66.3 56.7 81.8 75.4 59.7 84.7 70.9
HDA+ToAlign   57.957.9_{\uparrow}   76.9\textbf{76.9}_{\uparrow}   80.8\textbf{80.8}_{\uparrow}   66.7\textbf{66.7}_{\uparrow}   75.6\textbf{75.6}_{\uparrow}   77.0\textbf{77.0}_{\uparrow}   67.8\textbf{67.8}_{\uparrow}   57.0\textbf{57.0}_{\uparrow}   82.5\textbf{82.5}_{\uparrow}   75.175.1_{\downarrow}   60.060.0_{\uparrow}   84.984.9_{\uparrow}   72.0\textbf{72.0}_{\uparrow}
 
Table 1: Accuracy (%) of different UDAs on Office-Home with ResNet-50 as backbone. Best in bold.

Office-Home [63] consists of images from four different domains: Art (Ar), Clipart (Cl), Product (Pr), and Real-World (Rw).

Methods Clipart Infograph Painting Quickdraw Real Sketch Avg.
Source-Only [22] 47.6±0.52 13.0±0.41 38.1±0.45 13.3±0.39 51.9±0.85 33.7±0.54 32.9±0.54
ADDA(CVPR’17) [61] 47.5±0.76 11.4±0.67 36.7±0.53 14.7±0.50 49.1±0.82 33.5±0.49 32.2±0.63
DANN(ICML’15) [18] 45.5±0.59 13.1±0.72 37.0±0.69 13.2±0.77 48.9±0.65 31.8±0.62 32.6±0.68
DCTN(CVPR’18) [69] 48.6±0.73 23.5±0.59 48.8±0.63 7.2±0.46 53.5±0.56 47.3±0.47 38.2±0.57
MCD(CVPR’18) [53] 54.3±0.64 22.1±0.70 45.7±0.63 7.6±0.49 58.4±0.65 43.5±0.57 38.5±0.61
M3SDA(ICCV’19) [46] 57.2±0.98 24.2±1.21 51.6±0.44 5.2±0.45 61.6±0.89 49.6±0.56 41.5±0.74
M3SDA-β\beta(ICCV’19) [46] 58.6±0.53 26.0±0.89 52.3±0.55 6.3±0.58 62.7±0.51 49.5±0.76 42.6±0.64
MDAN(NeurIPS’18) [75] 60.3±0.41 25.0±0.43 50.3±0.36 8.2±1.92 61.5±0.46 51.3±0.58 42.8±0.69
MLMSDA(Arxiv’20) [37] 61.4±0.79 26.2±0.41 51.9±0.20 19.1±0.31 57.0±1.04 50.3±0.67 44.3±0.57
GVBG(CVPR’20) [15] 61.5±0.44 23.9±0.71 54.2±0.46 16.4±0.57 67.8±0.98 52.5±0.62 46.0±0.63
CMSS(ECCV’20) [70] 64.2±0.18 28.0±0.20 53.6±0.39 16.0±0.12 63.4±0.21 53.8±0.35 46.5±0.24
HDA(NeurIPS’20) [13] 63.6±0.35 25.9±0.16 56.1±0.38 16.6±0.54 69.1±0.42 54.3±0.26 47.6±0.40
Baseline 66.4±0.24 24.7±0.16 57.3±0.10 11.5±0.17 69.2±0.21 55.2±0.13 47.3±0.19
Baseline+ToAlign 67.0±0.22 25.9±0.20 57.8±0.32 12.2±0.14 70.7±0.25 56.0±0.18 48.2±0.22
Table 2: Accuracy (%) of different MUDA methods on DomainNet with ResNet-101 as backbone. Best in bold.

Each domain contains 65 object categories in office and home environments. Following the typical settings [15, 13, 68, 41], we evaluate methods on one-source to one-target domain adaptation, resulting in 12 adaptation cases in total. 2) VisDA-2017 [49] is a synthetic-to-real dataset for domain adaptation with over 280,000 images across 12 categories, where the source images are synthetic and the target images are real collected from MS COCO dataset [38]. 3) DomainNet [46] is a large-scale dataset containing about 600,000 images across 345 categories, which span 6 domains with large domain gap: Clipart (C), Infograph (I), Painting (P), Quickdraw (Q), Real (R), and Sketch (S). For MUDA, following

Method Acc.
DANNP 67.9
DANNP+ToAlign s=s=1 59.7
s=s=8 68.8
s=s=16 69.7
s=s=64 70.0
s=s=128 69.8
Adaptive ss 69.9
Table 3: Ablation study on the influence of ss in Eq. 5.
Method Time/ms GPU mem./MB Acc./%
DANNP 550 6,660 67.9
DANNP+ MetaAlign[68] 1,000 10,004 69.5
DANNP+ ToAlign 590 6,668 69.9
Table 4: Training complexity comparison (on GTX TITAN X GPU) in terms of computational time (of one iteration) and GPU memory for a mini-batch with batch size 32.

the settings in [46, 70, 13, 34, 62], we evaluate methods on five-sources to one-target domain adaptation, resulting in 6 MUDA cases in total. For SSDA, we take the typical protocal in [23, 51, 13], where there are 7 SSDA cases conducted on the 4 sub-domains (i.e., C, R, P and S) with 126 sub-categories selected from DomainNet. All methods are evaluated under the one-shot/three-shot setting respectively, where besides unlabeled samples, one/three sample(s) per class in the target domain are available during training.

Implementation Details. We apply our ToAlign on top of two different baseline schemes: DANNP [15, 68] and HDA [13]. DANNP is an improved variant of the classical adversarial learning based adaptation method DANN [18, 19], where the domain discrimination DD is conditioned on the predicted class probabilities. HDA is a state-of-the-art adversarial training based method which leverages the domain-specific representations as heuristics to obtain domain-invariant representations.

We use the ResNet-50 [22] pre-trained on ImageNet [31] as the backbone for SUDA, while using ResNet-101 and ResNet-34 for MUDA and SSDA respectively. Following  [68, 41, 13], the image classifier CC is composed of one fully connected layer. The discriminator DD consists of three fully connected layers with inserted dropout and ReLU layers. We follow [74] to take an annealing strategy to set the learning rate η\eta, i.e., ηt=η0(1+γp)τ\eta_{t}=\frac{\eta_{0}}{(1+\gamma p)^{\tau}}, where pp indicates the progress of training that increases linearly from 0 to 1, γ=10\gamma=10, and τ=0.75\tau=0.75. The initial learning rate η0\eta_{0} is set to 1e3,3e4,3e41e-3,3e-4,3e-4, and 1e31e-3 for SUDA on Office-Home, SUDA on VisDA-2017, MSDA on DomainNet, and SSDA on DomainNet, respectively. All reported experimental results are the average of three runs with different seeds.

4.2 Ablation Study

Effectiveness of ToAlign on Different Baselines. Our proposed ToAlign is generic and applicable to different domain adversarial training based baselines, where we focus on what features to align instead of the alignment methods. The last four rows in Table 1 show the ablation comparisons on Office-Home. Our ToAlign improves the accuracy of baseline DANNP and HDA by 2.0% and 1.1% respectively. As can be seen from the results in Table 1, Table 2, Table 5 and Table 6, our ToAlign can consistently bring significant improvement over the baseline schemes under different domain adaptation settings, i.e., SUDA, MUDA and SSDA. ToAlign enables the domain alignment task to proactively serve the classification task, resulting in more effective feature alignment for image classification.

Effectiveness of Different Ways to Obtain Positive Features. As mentioned in Sec. 3.2, we use 𝐰pcls=s𝐰cls\mathbf{w}^{cls}_{p}=s\mathbf{w}^{cls} as the attention weight (which conveys the classification prior/meta-knowledge) to derive positive feature 𝐟p\mathbf{f}_{p}, where ss is a parameter to modulate the energy of 𝐟p\mathbf{f}_{p}. We study the influence of ss under the setting of Rw\rightarrowCl on Office-Home for our scheme DANNP+ToAlign and illustrate the results in Table 3. As discussed around Eq. (5), we can use an adaptively calculated ss, which achieves 2% improvement over the baseline on target test data. Moreover, we can treat ss as a preset hyper-parameter. We found that the performance drops drastically if ss is too small (e.g., s=1s=1). That is because the energy of the source positive feature will get too weak when ss gets too small (e.g., the source feature 𝐟\mathbf{f}’s average energy (𝐟)\mathcal{E}(\mathbf{f}) is about 800; if s=1s=1, the source positive feature’s average energy (𝐟p)\mathcal{E}(\mathbf{f}_{p}) is about 2). Then, it would be ineffective to align the target with the source positive features. When ss is larger than 16, the performance significantly outperforms the baseline and approaches the result of using adaptive ss. As an optional design choice, we could transform the weight 𝐰cls\mathbf{w}^{cls} with certain activation function σ()\sigma(\cdot) such as Sigmoid or Softmax followed by a best selected scaling factor ss, i.e., 𝐰pcls=sσ(𝐰cls)\mathbf{w}^{cls}_{p}=s\sigma(\mathbf{w}^{cls}). We found the results (i.e., 69.6/69.7 for Sigmoid/Softmax) are close to that without activation function. We reckon that what is more important is the relative importance among the elements in 𝐰cls\mathbf{w}^{cls}. For simplicity, we finally take the adaptive ss (cf. Eq. 5) for all experiments.

Methods R\rightarrowC R\rightarrowP P\rightarrowC C\rightarrowS S\rightarrowP R\rightarrowS P\rightarrowR Avg.
Source-Only [22] 55.6 60.6 56.8 50.8 56.0 46.3 71.8 56.9
DANN(ICML’15) [18] 58.2 61.4 56.3 52.8 57.4 52.2 70.3 58.4
ADR(ICLR’18) [52] 57.1 61.3 57.0 51.0 56.0 49.0 72.0 57.6
CDAN(NeurIPS’18) [41] 65.0 64.9 63.7 53.1 63.4 54.5 73.2 62.5
ENT(NeurIPS’05) [21] 65.2 65.9 65.4 54.6 59.7 52.1 75.0 62.6
MME(ICCV’19) [51] 70.0 67.7 69.0 56.3 64.8 61.0 76.1 66.4
CANN(Arxiv’20) [50] 72.7 70.3 69.8 60.5 66.4 62.7 77.3 68.5
GVBG(CVPR’20) [15] 70.8 65.9 71.1 62.4 65.1 67.1 76.8 68.4
HDA(NeurIPS’20) [13] 72.4 71.0 71.0 63.6 68.8 64.2 79.9 70.0
HDA+ToAlign   73.0   72.0   71.7   63.0   69.3   64.6   80.8   70.6
Table 5: Accuracy (%) of different one-shot SSDA methods on DomainNet with ResNet-34 as backbone. Best in bold.
Methods R\rightarrowC R\rightarrowP P\rightarrowC C\rightarrowS S\rightarrowP R\rightarrowS P\rightarrowR Avg.
Source-Only [22] 60.0 62.2 59.4 55.0 59.5 50.1 73.9 60.0
ADR(ICLR’18) [52] 60.7 61.9 60.7 54.4 59.9 51.1 74.2 60.4
CDAN(NeurIPS’18) [41] 69.0 67.3 68.4 57.8 65.3 59.0 78.5 66.5
ENT(NeurIPS’05) [21] 71.0 69.2 71.1 60.0 62.1 61.1 78.6 67.6
MME(ICCV’19) [51] 72.2 69.7 71.7 61.8 66.8 61.9 78.5 68.9
MetaMME(ECCV’20) [34] 73.5 70.3 72.8 62.8 68.0 63.8 79.2 70.1
GVBG(CVPR’20) [15] 73.3 68.7 72.9 65.3 66.6 68.5 79.2 70.6
CANN(Arxiv’20) [50] 75.4 71.5 73.2 64.1 69.4 64.2 80.8 71.2
HDA(NeurIPS’20) [13] 74.5 71.5 73.9 65.9 70.1 65.9 81.9 71.8
HDA+ToAlign   75.7   72.9   75.6   66.2   71.1   66.4   83.0   73.0
Table 6: Accuracy (%) of different three-shot SSDA methods on DomainNet with ResNet-34 as backbone. Best in bold.

4.3 Comparison with the State-of-the-arts

Single Source Unsupervised Domain Adaptation (SUDA). We incorporate our ToAlign into the recent state-of-the-art UDA method HDA [13], denoted as HDA+ToAlign. Table 1 shows the comparisons with the previous state-of-the-art methods on Office-Home. HDA+ToAlign outperforms all the previous methods and achieves the state-of-the-art performance. It is noteworthy that HDA+ToAlign achieves the best adaptation results on almost all the one-source to one-target adaptation cases thanks to the effective feature alignment for classification. The results on VisDA-2017 could be found in Appendix, where HDA+ToAlign outperforms HDA by 0.9%.

Refer to caption
Figure 5: Visualization of the feature response maps on target test images. First row: Art of Office-Home. Second row: Painting of DomainNet.

Multi-source Unsupervised Domain Adaptation (MUDA). Table 2 shows the results on DomainNet, where all the methods take ResNet-101 as the feature extractor. We build our Baseline based on HDA [13]. For simplicity, we replace the multi-class domain discriminator in the original HDA by a two-class one as in [61, 18, 70]. Note that CMSS [70] selects suitable source samples for alignment while our ToAlign selects task-discriminative sub-feature for each sample for task-oriented alignment. Compared with Baseline, ToAlign brings about 0.9% improvement and helps to achieve the best performance on this more challenging dataset.

Semi-supervised Domain Adaptation (SSDA). Table 5 and Table 6 show the results on one-shot and three-shot SSDA respectively, where all the methods use ResNet-34 as backbone. To compare with previous methods, we apply ToAlign on top of HDA. HDA+ToAlign outperforms HDA by 0.6%/1.2% for one-/three-shot settings, and surpasses all previous SSDA methods.

4.4 Complexity

In Table 4, we compare the training complexity and performance of ToAlign with baseline DANNP, and DANNP+MetaAlign [68] which incorporates meta-learning to coordinate the optimization of domain alignment and image classification. In contrast, inspired by the prior knowledge of what feature should be aligned to serve classification task, we distill such meta-knowledge from classification task and explicitly pass it to alignment task for classification-oriented alignment, eschewing complex optimization. Compared with baseline, ToAlign introduces negligible additional computational cost (only 7%) and occupies almost the same GPU memory as the baseline, which is much smaller than that of DANNP+MetaAlign, which almost doubles the computational cost due to its complex meta-optimization. Thanks to our explicit design which makes domain alignment effectively serve the classification task, our ToAlign achieves superior performance to MetaAlign.

4.5 Feature Visualization

We visualize the target feature response maps FF (which will be pooled to be the input of the image classifier) of the Baseline (DANNP) and ToAlign in Figure 5. Baseline sometimes focuses on the background features which are useless to the image classification task, since it aligns the holistic features without considering the discriminativeness of different channels/sub-features. Thanks to our task-oriented alignment, in ToAlign, the features with higher responses are in general related to task-discriminative features, which is more consistent with human perception. More results can be found in the Appendix.

5 Conclusion

In this paper, we study what features should be aligned across domains for more effective unsupervised domain adaptive image classification. To make the domain alignment task proactively serve the classification task, we propose an effective task-oriented alignment (ToAlign). We explicitly decompose a feature in the source domain into a task-related feature that should be aligned and a task-irrelevant one that should be ignored, under the guidance of the meta-knowledge induced from the classification task itself. Extensive experiments on various datasets demonstrate the effectiveness of our ToAlign. In our future work, we will extend ToAlign to tasks beyond image classification, e.g., object detection and segmentation.

Acknowledgments and Disclosure of Funding

This work was supported in part by the National Key Research and Development Program of China 2018AAA0101400 and NSFC under Grant U1908209, 61632001, and 62021001.

References

  • [1] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using meta-regularization. In NeurIPS, pages 998–1008, 2018.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NeurIPS, volume 19, page 137. MIT; 1998, 2007.
  • [4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48, 2009.
  • [5] R. Bermúdez Chacón, M. Salzmann, and P. Fua. Domain-adaptive multibranch networks. In ICLR, 2020.
  • [6] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
  • [7] R. Cai, Z. Li, P. Wei, J. Qiao, K. Zhang, and Z. Hao. Learning disentangled semantic representation for domain adaptation. In IJCAI, volume 2019, page 2060, 2019.
  • [8] J. Cao, O. Katzir, P. Jiang, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li. Dida: Disentangled synthesis for domain adaptation. arXiv preprint arXiv:1805.08019, 2018.
  • [9] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, pages 839–847, 2018.
  • [10] J. Chen, X. Qiu, P. Liu, and X. Huang. Meta multi-task learning for sequence modeling. In AAAI, volume 32, 2018.
  • [11] M. Chen, S. Zhao, H. Liu, and D. Cai. Adversarial-learned loss for domain adaptation. In AAAI, pages 3521–3528, 2020.
  • [12] Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty. Re-weighted adversarial adaptation network for unsupervised domain adaptation. In CVPR, pages 7976–7985, 2018.
  • [13] S. Cui, X. Jin, S. Wang, Y. He, and Q. Huang. Heuristic domain adaptation. In NeurIPS, 2020.
  • [14] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian. Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In CVPR, pages 3941–3950, 2020.
  • [15] S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually vanishing bridge for adversarial domain adaptation. In CVPR, pages 12455–12464, 2020.
  • [16] J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell. Semi-supervised domain adaptation with instance constraints. In CVPR, pages 668–675, 2013.
  • [17] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, page 1126–1135, 2017.
  • [18] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, pages 1180–1189, 2015.
  • [19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014.
  • [21] Y. Grandvalet, Y. Bengio, et al. Semi-supervised learning by entropy minimization. In NeurIPS, pages 281–296, 2005.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [23] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
  • [24] C.-C. Hsu, Y.-H. Tsai, Y.-Y. Lin, and M.-H. Yang. Every pixel matters: Center-aware feature alignment for domain adaptive object detector. In ECCV, pages 733–748.
  • [25] L. Hu, M. Kan, S. Shan, and X. Chen. Unsupervised domain adaptation with hierarchical gradient synchronization. In CVPR, pages 4043–4052, 2020.
  • [26] J. Huang, D. Guan, A. Xiao, and S. Lu. Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data. In NeurIPS, 2021.
  • [27] Y. Huang, P. Peng, Y. Jin, Y. Li, and J. Xing. Domain adaptive attention learning for unsupervised person re-identification. pages 11069–11076, 2020.
  • [28] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In CVPR, pages 12627–12637, 2019.
  • [29] X. Jin, C. Lan, W. Zeng, and Z. Chen. Feature alignment and restoration for domain generalization and adaptation. arXiv preprint arXiv:2006.12009, 2020.
  • [30] G. Kang, L. Zheng, Y. Yan, and Y. Yang. Deep adversarial attention alignment for unsupervised domain adaptation: the benefit of target expectation maximization. In ECCV, pages 401–416, 2018.
  • [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, page 1097–1105, 2012.
  • [32] J. N. Kundu, N. Venkat, A. Revanur, R. V. Babu, et al. Towards inheritable models for open-set domain adaptation. In CVPR, pages 12376–12385, 2020.
  • [33] V. K. Kurmi, S. Kumar, and V. P. Namboodiri. Attending to discriminative certainty for domain adaptation. In CVPR, pages 491–500, 2019.
  • [34] D. Li and T. Hospedales. Online meta-learning for multi-source and semi-supervised domain adaptation. In ECCV, pages 382–403, 2020.
  • [35] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Learning to generalize: Meta-learning for domain generalization. AAAI, 2018.
  • [36] S. Li, C. H. Liu, Q. Lin, B. Xie, Z. Ding, G. Huang, and J. Tang. Domain conditioned adaptation network. In AAAI, pages 11386–11393, 2020.
  • [37] Z. Li, Z. Zhao, Y. Guo, H. Shen, and J. Ye. Mutual learning network for multi-source domain adaptation. arXiv preprint arXiv:2003.12944, 2020.
  • [38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  • [39] H. Liu, M. Long, J. Wang, and M. Jordan. Transferable adversarial training: A general approach to adapting deep classifiers. In ICML, pages 4013–4022, 2019.
  • [40] P. Liu and X. Huang. Meta-learning multi-task communication. arXiv preprint arXiv:1810.09988, 2018.
  • [41] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. In NeurIPS, pages 1645–1655, 2018.
  • [42] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution adaptation. In ICCV, pages 2200–2207, 2013.
  • [43] Z. Lu, Y. Yang, X. Zhu, C. Liu, Y.-Z. Song, and T. Xiang. Stochastic classifiers for unsupervised domain adaptation. In CVPR, pages 9111–9120, 2020.
  • [44] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [45] Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial domain adaptation. In AAAI, volume 32, 2018.
  • [46] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415, 2019.
  • [47] X. Peng, Z. Huang, X. Sun, and K. Saenko. Domain agnostic learning with disentangled representations. In ICML, pages 5102–5112, 2019.
  • [48] X. Peng and K. Saenko. Synthetic to real adaptation with generative correlation alignment networks. In WACV, pages 1982–1991, 2018.
  • [49] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • [50] C. Qin, L. Wang, Q. Ma, Y. Yin, H. Wang, and Y. Fu. Opposite structure learning for semi-supervised domain adaptation. arXiv preprint arXiv:2002.02545, 2020.
  • [51] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko. Semi-supervised domain adaptation via minimax entropy. In ICCV, pages 8050–8058, 2019.
  • [52] K. Saito, Y. Ushiku, T. Harada, and K. Saenko. Adversarial dropout regularization. In ICLR, 2018.
  • [53] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, pages 3723–3732, 2018.
  • [54] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, pages 8503–8512, 2018.
  • [55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
  • [56] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop, 2014.
  • [57] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
  • [58] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, pages 443–450, 2016.
  • [59] H. Tang and K. Jia. Discriminative adversarial domain adaptation. In AAAI, volume 34, pages 5940–5947, 2020.
  • [60] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521–1528, 2011.
  • [61] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 7167–7176, 2017.
  • [62] N. Venkat, J. N. Kundu, D. K. Singh, A. Revanur, and R. V. Babu. Your classifier can secretly suffice multi-source domain adaptation. In NeurIPS, 2021.
  • [63] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, pages 5018–5027, 2017.
  • [64] R. Volpi, P. Morerio, S. Savarese, and V. Murino. Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, pages 5495–5504, 2018.
  • [65] Q. Wang and T. Breckon. Unsupervised domain adaptation via structured prediction based selective pseudo-labeling. In AAAI, volume 34, pages 6243–6250, 2020.
  • [66] X. Wang, L. Li, W. Ye, M. Long, and J. Wang. Transferable attention for domain adaptation. In AAAI, volume 33, pages 5345–5352, 2019.
  • [67] Z. Wang, M. Yu, Y. Wei, R. Feris, J. Xiong, W.-m. Hwu, T. S. Huang, and H. Shi. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. In CVPR, pages 12635–12644, 2020.
  • [68] G. Wei, C. Lan, W. Zeng, and Z. Chen. Metaalign: Coordinating domain alignment and classification for unsupervised domain adaptation. In CVPR, 2021.
  • [69] R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR, pages 3964–3973, 2018.
  • [70] L. Yang, Y. Balaji, S.-N. Lim, and A. Shrivastava. Curriculum manager for source selection in multi-source domain adaptation. In ECCV, volume 12359, pages 608–624, 2020.
  • [71] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. In ICLR, 2017.
  • [72] J. Zhang, W. Li, and P. Ogunbona. Joint geometrical and statistical alignment for visual domain adaptation. In CVPR, pages 1859–1867, 2017.
  • [73] Y. Zhang, T. Liu, M. Long, and M. Jordan. Bridging theory and algorithm for domain adaptation. In ICML, pages 7404–7413, 2019.
  • [74] Y. Zhang, H. Tang, K. Jia, and M. Tan. Domain-symmetric networks for adversarial domain adaptation. In CVPR, pages 5031–5040, 2019.
  • [75] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple source domain adaptation. In NeurIPS, volume 31, pages 8559–8570, 2018.
  • [76] L. Zhao, X. Peng, Y. Chen, M. Kapadia, and D. N. Metaxas. Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In CVPR, pages 6528–6537, 2020.
  • [77] L. Zhong, Z. Fang, F. Liu, J. Lu, B. Yuan, and G. Zhang. How does the combined risk affect the performance of unsupervised domain adaptation approaches? In AAAI, 2021.
  • [78] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016.

Appendix

A. Visualization of Decomposed Features

To better understand and validate the discriminativeness of the positive and the negative features, similar to Figure 4 in our manuscript, here we show more visualization results of the spatial maps FF with channels modulated by 𝐰cls\mathbf{w}^{cls} (corresponding to positive features) and 𝐰cls-\mathbf{w}^{cls} (corresponding to negative features) following [55, 78]. We can observe that the positive information is more related to the foreground objects that provide the discriminative information for the classification task, while the negative one is more in connection with the non-discriminative background regions.

Refer to caption
Figure 6: Visualization of task-discriminative and task-irrelevant features. The images are sampled from different domains of Office-Home.

B. Experiments

B.1 More Implementation Details

We use two domain alignment based methods as our baselines: 1) DANNP [68] is an improved variant of DANN [18], where the domain discrimination DD in DANNP is conditioned on the predicted class probabilities instead of extracted features as illustrated in Figure 7 (a). 2) HDA [13] draws inspiration from heuristic search and incorporates the domain-specific representations as heuristics to help learn domain-invariant ones. Figure 7 (b) shows its architecture.

Refer to caption
Figure 7: Pipelines of two representative domain alignment based UDA methods. (a) DANNP [68]. (b) HDA [13]. heu\mathcal{L}_{heu} denotes a heuristic loss which is implemented by 1\mathcal{L}_{1}-norm loss [13].
Refer to caption
Figure 8: Error bars of ToAlign on top of DANNP and HDA on Office-Home.
Refer to caption
Figure 9: Visualization of the feature response maps on target test images. The Category (Domain) information is shown on each sample.
Method Avg.
Source-Only [22] 55.3
DANN(ICML’15) [18] 57.4
CDAN(NeurIPS’18) [41] 70.0
MDD(ICML’19) [73] 74.6
GVB(CVPR’20) [15] 75.3
HDA(NeurIPS’20) [13] 74.6
HDA+ToAlign 75.5
Table 7: Classification accuracy (%) of the Synthetic \rightarrow Real setting on Visda-2017 for SUDA using ResNet-50 as backbone. Note that HDA [13] does not report the result on this dataset and we obtain the result by running their released source code.

All experimental results are obtained by running three times with different seeds. To evaluate the stableness of our ToAlign, we visualize error bars of our schemes DANNP+ToAlign and HDA+ToAlign on Office-Home in Figure 8, where we also present the error bars for the two baseline schemes DANNP and HDA. The variances between our ToAlign and the corresponding baselines are close (0.41 vs. 0.40 for DANNP and 0.40 vs. 0.35 for HDA) and our ToAlign dose not introduce much additional unstability.

B.2 Experimental Results of SUDA

As referred to in our main manuscript, the experimental results on Visda-2017 for SUDA are presented in Appendix. Here, Table 7 shows the results, where our ToAlign introduces 0.9% improvements over the baseline HDA.

B.3 Experimental Results of SSDA

We present the results for the more challenging one-shot SSDA on DomainNet in Table 6. Our ToAlign improves the baseline HDA by 0.6%, and HDA+ToAlign outperforms all the previous methods.

When we compare Table 6 (here for one-shot SSDA) with Table 3 in our main manuscript (for three-shot SSDA), we observe that introducing another two-shot samples per class brings about 2.4% gain, which demonstrates that access to target annotations information (even very few samples) is helpful to domain adaptation.

B.4 Feature Visualization

We visualize more results of the feature response maps on the target test images in Figure 9, as a supplement to Figure 5 in our main manuscript. Baseline sometimes focuses on the background features which are useless to the image classification task, since it aligns the holistic features without considering the discriminativeness of different channels/sub-features. Thanks to our task-oriented alignment, in ToAlign, the features with higher responses are in general related to task-discriminative features, which is more consistent with human perception.

We further visualize the learned source (red) and target (blue) feature representations (i.e., 𝐟s\mathbf{f}_{s} and 𝐟t\mathbf{f}_{t}) using t-SNE [44] for different methods in Figure 10. Figure 10 (a) shows the embedded features of the Source-Only method where no adaptation technique is used, where we can see that the samples are very scattered. In comparison, the samples for HDA [13] (cf. Figure 10 (b)) and our HDA+ToAlign (cf. Figure 10 (c)) form more compact clusters, where the clusters of ours are more compact and the target samples are located closer to the source samples than HDA.

C. Broader Impact

Unsupervised domain adaptation aims to obtain better performance on unlabeled target data based on the knowledge from labeled source data and unlabeled target data, which is an important and practical problem in both the academic and industry. Our proposed ToAlign emphasizes that domain alignment task should assist/serve classification task, where we perform alignment under the guidance of the meta-knowledge induced from classification task. We also provide some understanding from the meta-knowledge perspective, where we pass the meta-train task knowledge in a simple and effective way to the meta-test task. This provides some insights on how to pass meta-knowledge more effectively for the meta-learning based multi-task communication [40, 68, 10].

The major societal impact of our ToAlign arises from the UDA task itself, which aims to transfer knowledge from labeled source domain to unlabeled target domain, leading to heavy dependency on source domain. The major limitation of our ToAlign is that it is only applicable to domain adversarial learning based UDAs, which though dominates in the top performance methods. How to apply the idea to other category of methods, e.g., pseudo-label based ones [42, 72, 65], will be investigated in future.

ToAlign could be further improved from two perspectives. First, other ways to derive classification meta-knowledge could be exploited, where we now use the gradients as guidance which is drawn inspiration from Grad-CAM [55]. Second, ToAlign could be further expanded to more challenging tasks like semantic segmentation and object detection.

Refer to caption
Figure 10: T-sne visualization of different methods on Ar\rightarrowPr of Office-Home. Red: source. Blue: target.