Adaptively-Accumulated Knowledge Transfer
for Partial Domain Adaptation

Taotao Jing^†, Haifeng Xia^†, Zhengming Ding^♯ ^†Department of ECE, Indiana University-Purdue University Indianapolis, USA^♯Department of CIT, Indiana University-Purdue University Indianapolis, USA jingt,haifxia,[email protected]

(2020)

Abstract.

Partial domain adaptation (PDA) attracts appealing attention as it deals with a realistic and challenging problem when the source domain label space substitutes the target domain. Most conventional domain adaptation (DA) efforts concentrate on learning domain-invariant features to mitigate the distribution disparity across domains. However, it is crucial to alleviate the negative influence caused by the irrelevant source domain categories explicitly for PDA. In this work, we propose an Adaptively-Accumulated Knowledge Transfer framework (A²KT) to align the relevant categories across two domains for effective domain adaptation. Specifically, an adaptively-accumulated mechanism is explored to gradually filter out the most confident target samples and their corresponding source categories, promoting positive transfer with more knowledge across two domains. Moreover, a dual distinct classifier architecture consisting of a prototype classifier and a multilayer perceptron classifier is built to capture intrinsic data distribution knowledge across domains from various perspectives. By maximizing the inter-class center-wise discrepancy and minimizing the intra-class sample-wise compactness, the proposed model is able to obtain more domain-invariant and task-specific discriminative representations of the shared categories data. Comprehensive experiments on several partial domain adaptation benchmarks demonstrate the effectiveness of our proposed model, compared with the state-of-the-art PDA methods.

Partial Domain Adaptation; Multimodality Adaptation; Unsupervised Domain Adaptation

^†^†journalyear: 2020^†^†copyright: acmcopyright^†^†conference: Proceedings of the 28th ACM International Conference on Multimedia; October 12–16, 2020; Seattle, WA, USA^†^†booktitle: Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA^†^†price: 15.00^†^†doi: 10.1145/3394171.3413986^†^†isbn: 978-1-4503-7988-5/20/10^†^†ccs: Theory of computation Unsupervised learning and clustering^†^†ccs: Computing methodologies Transfer learning

1. Introduction

Deep Neural Networks (DNNs) have achieved promising performances in various multi-media applications with the help of sufficient well-labeled training data, which is not always available and dramatically expensive to collect and annotate (Simonyan and Zisserman, 2014; Yan et al., 2016; He et al., 2016; Ding et al., 2018). Domain Adaptation (DA) has made significant progress in such a common and real-world situation when massive amounts of well-labeled training data of the target domain are not accessible (Jiang et al., 2017; Li et al., 2019b; Zhuo et al., 2017; Wang et al., 2018; Yao et al., 2019; Li et al., 2019a; Xia and Ding, 2020). The philosophy of domain adaptation is transferring the knowledge of a related well-labeled source domain to the unlabeled target domain by aligning the marginal and conditional distributions while mitigating the domain distribution disparity across domains. Towards this goal, a plenty of domain adaptation (DA) techniques have been successfully applied in various multimedia tasks such as multimodal learning (Rasiwasia et al., 2010; Shu et al., 2015; Wang et al., [n.d.]), visual object recognition (Li et al., 2019b, 2020b), and text categorization (Blitzer et al., 2006; Dai et al., 2008).

Recent domain adaptation efforts seek to capture general domain-invariant but task-discriminative feature representations in shared feature space for two domains through matching the cross-domain distribution alignment schemes. Discrepancy loss is one of the most commonly used strategies to evaluate the cross-domain distribution difference, e.g., maximum mean discrepancy (MMD) (Borgwardt et al., 2006). A bunch of domain adaptation efforts design various MMD loss functions to align the source and target domain marginal and conditional distribution by incorporating pseudo labels of the target domain (Long et al., 2015, 2016). Besides, the adversarial loss is another sufficiently explored scheme to eliminate the domain shifts by training one or more domain discriminator against the feature generator in an adversarial manner (Bousmalis et al., 2017; Tzeng et al., 2015, 2017; Hoffman et al., 2017; Luo et al., 2017). Moreover, latest DA research works jointly consider both the domain-wise alignment as well as the task-specific category-level alignment (Saito et al., 2018a; Lee et al., 2019a; Zhang et al., 2019), or propose various reconstruction penalties to obtain target specific structures (Zhang et al., 2018b). However, all conventional domain adaptation solutions assume that the source and target domain have identical label space, which is not always satisfied in real life (Saenko et al., 2010).

Partial domain adaptation (PDA) focuses on such a common and challenging situation when the source domain label space subsumes the target domain label space (Cao et al., 2018b, a; Zhang et al., 2018a). Along this line, Cao et al. propose the Partial Adversarial Domain Adaptation (PADA) to simultaneously eliminates the negative transfer by down-weighing the source domain outlier categories during training the classifier and domain adversary and promotes the cross-domain distribution alignment in the shared label space (Cao et al., 2018b). Later on, Cao et al. extend PADA to Selective Adversarial Networks (SAN) which incorporates instance-level and category-level weighting mechanism with multi-discriminator domain adversarial networks to not just down-weigh the source outlier classes, but also align each target sample to several most relevant classes and promote positive transfer for each instance (Cao et al., 2018a). On the other hand, Zhang et al. present Importance Weighted Adversarial Nets (IWAN) to alleviate the distraction of the source domain outlier classes by assigning importance score of each source sample obtained from the two domain classifier strategy (Zhang et al., 2018a). Similar to this, Cao et al. propose Example Transfer Network (ETN) to quantify the transferability of the source samples and evaluate each point contribution to both the classifier and domain discriminator (Cao et al., 2019). Unfortunately, even most of the previously mentioned PDA efforts explore the re-weighing mechanism to reduce the outlier source categories negative transfer, adapting the cross-domain distribution in the whole source and target domain data space and label space is still vulnerable to the outlier source categories and misclassified samples. Besides, most existing PDA methods suffer from explicitly matching source and target domains distribution by only considering the domain-wise adaptation while ignoring the alignment of class-wise distribution.

In this paper, we propose an Adaptively-Accumulated Knowledge Transfer scheme (A²KT) to manage partial domain adaptation challenges by simultaneously promoting positive transfer in the shared label space while alleviating negative transfer caused by the outlier source categories. The general idea is through gradually filtering out confident task-relevant target samples and corresponding categories to optimize both domain-wise distribution adaptation and class-wise distribution alignment. To sum up, the contributions of this paper are highlighted as follows:

•

First of all, we propose an adaptively-accumulated knowledge transfer strategy to iteratively weigh and filter out confident task-relevant target samples and corresponding categories under the guidance of the source domain data for effective cross-domain alignment.
•

Secondly, we explore two different types of task-specific classifiers to capture and transfer intrinsic distribution knowledge across domains from various perspectives.
•

Thirdly, we propose a cross-domain alignment loss function which is able to align the class-level discrimination across domains, and compact the sample-level distribution within the same class.

2. Related Work

2.1. Domain Adaptation

The cross-domain data distribution discrepancy, known as domain shift, is the main challenge of domain adaptation. Plenty of works exploits the potential of deep neural networks to capture explanatory attributes and domain-invariant features in recent years, which is conducive to mitigating domain shift while transferring underlying knowledge across domains in domain adaptation tasks (Bengio et al., 2013; Donahue et al., 2014; Yosinski et al., 2014). Compared to traditional machine learning based domain adaptation solutions, introducing deep architecture into domain adaptation promotes the generalization of frameworks dramatically (Hoffman et al., 2014; Oquab et al., 2014). Some researchers integrate high-order statistical properties of different domains into a unified framework, such as maximum mean discrepancy (MMD), to align the data distribution across domains, which successfully eliminate domain shift and achieve promising classification performance on the target domain (Long et al., 2015, 2016). By virtue of generative adversarial techniques, some works involve a domain discriminator into the game to distinguish which domain the sample belonging to while optimizing the generator and discriminator in an adversarial manner (Ganin et al., 2016; Tzeng et al., 2015; Li et al., 2019b). Moreover, the latest works rethink the domain adaptation problem from various perspectives and propose dual-classifiers-based frameworks that seek to align not only domain-wise data distributions but also classifier-class-specific boundaries (Saito et al., 2018b; Lee et al., 2019b; Zhang et al., 2019).

2.2. Partial Domain Adaptation

Unfortunately, realistic application scenarios hardly satisfy the standard domain adaptation assumption that the source and target domains share identical label space. A more common situation is the source domain subsumes the target domain label space, which means the source domain includes samples from more different categories except for the shared ones with the target domain. Such a novel challenge, named as partial domain adaptation (PDA), attracts substantial attention in transfer learning and brings out many inspiring works on this topic. Selective Adversarial Network (SAN) explores multiple adversarial networks to weight and select out the outlier categories source samples and down their transferring weights (Cao et al., 2018a). Partial Adversarial Domain Adaptation (PADA) extends SAN and pays more attention to class-level transferability weighting on the source classifier (Cao et al., 2018b). Similarly, Importance Weighted Adversarial Nets (IWAN) considers the sigmoid output of an auxiliary domain classifier as the indicator to measure the probability of each source sample comes from the target domain (Zhang et al., 2018a). Example Transfer Network (ETN) further explains the discriminative information as the transferability quantification of the source domain samples, through which the irrelevant examples from outlier categories are down-weighted for both the task-specific classifier and domain discriminator (Cao et al., 2019). All the pioneering efforts achieve impressive performance improvements over conventional domain adaptation approaches on PDA tasks.

However, although most existing PDA solutions seek to mitigate the negative transfer caused by outlier source classes by re-weighting samples’ importance to reduce the distraction, they still train and predict the entire source domain label space, which dilutes the contribution of discriminative information within the shared categories across domains. Besides, some of them regard the prediction of the target samples as pseudo labels to align cross-domain conditional distribution, which would involve severe classification errors and mislead the optimizing direction of the model, especially at the initial stage of training when the classifier cannot handle the differently distributed unlabeled target domain samples.

Unlike previous efforts, our proposed Adaptively Accumulated Knowledge Transfer framework (A²KT) can simultaneously align the data distribution inter-class center-wise and intra-class sample-wise, both within and across domains. Exploiting prototype classifier and adaptively optimization strategy makes for eliminating the distraction triggered by the misclassified target domain samples.

Refer to caption — Figure 1. Illustration of the proposed model for partial domain adaptation, where the source contains more categories than the target. Both source and target data are input to the feature extractor $C(\cdot)$ , then classified by multilayer perceptron classifier $C_{N}(\cdot)$ and prototype classifier $C_{P}(\cdot)$ . The prediction results of $C_{P}(\cdot)$ are exploited to filter out confident target samples for further alignment across domains. Each shape denotes one category, colored and grey shapes mean the source and target samples respectively, while the colored but shaded shapes denote the filtered out target samples with assigned pseudo labels.

3. The Proposed Method

3.1. Preliminaries and Motivation

Given source domain $\mathcal{D}_{s}=\{\mathbf{X}_{s},\mathbf{Y}_{s}\}=\{(\mathbf{x}_{s}^{1},\mathbf{y}_{s}^{1}),\cdots,(\mathbf{x}_{s}^{n_{s}},\mathbf{y}_{s}^{n_{s}})\}$ with labels, and unlabeled target domain $\mathcal{D}_{t}=\{\mathbf{X}_{t}\}=\{\mathbf{x}_{t}^{1},\cdots,\mathbf{x}_{t}^{n_{t}}\}$ , where $\mathbf{x}_{s/t}^{i}\in\mathbb{R}^{d}$ is a $d$ -dimension source/target sample and $y_{s}^{i}$ is the known label corresponding to the source sample. $\mathcal{D}_{s}$ and $\mathcal{D}_{t}$ are drawn from distribution $P_{s}$ and $P_{t}$ respectively, while $P_{s}\neq P_{t}$ . Since the source domain label space $\mathcal{C}_{s}$ subsumes the target domain label space $\mathcal{C}_{t}$ , i.e., $\mathcal{C}_{s}\supset\mathcal{C}_{t}$ , partial domain adaptation attempts to predict unlabeled target samples with the relevant source knowledge out of the entire well-labeled source domain.

To eliminate the influence of irrelevant source categories, existing partial domain adaptation models mainly design a weighting strategy to select the relevant source categories for effective cross-domain alignment with discrepancy loss (Zhang et al., 2018a) or adversarial loss (Cao et al., 2018a). To mitigate the conditional distribution mismatch across two domains, most of them rely on the pseudo labels of target samples assigned from a source-supervised neural network classifier. Due to the cross-domain distribution gap, such pseudo labels are not reliable, which would further hurt the cross-domain alignment, since the neural network classifier fits perfectly for the source distribution while not for target distribution.

To address these issues, we consider not only to detect the irrelevant source categories to eliminate the negative influence but also select the most confident target samples during cross-domain alignment. Thus, our proposed model can adaptively select out a subset of the target domain samples that are highly affiliated to the source domain and corresponding categories to align across domains. Moreover, the prototype classifier (Snell et al., 2017) is adopted to annotate the target samples via source prototypes, since it can capture the intrinsic structure and semantic knowledge across source and target domain. Exploring the dual classifier architecture consisting of two different types classifiers, prototype classifier and multilayer perceptron classifer, extends the ability of the proposed model to reveal the task-specific knowledge from various perspectives.

3.2. Adaptively-Accumulated Knowledge Transfer

The proposed framework, which is shwon as Figure 1, consists of three modules: 1) domain-invariant feature generator $G(\cdot)$ , 2) fully-connected multilayer perceptron classifier $C_{N}(\cdot)$ , 3) prototype classifier $C_{P}(\cdot)$ . $G(\cdot)$ takes the source and target data as input and maps them into a shared embedding space. The extracted features are denoted as $\mathbf{z}_{s/t}$ for the source and target domain respectively.

3.2.1. Building Diverse Source-Supervised Classifiers

With $\mathbf{z}_{s/t}$ as input, $C_{N}(\cdot)$ and $C_{P}(\cdot)$ can assign labels from different perspectives, denoted as $\hat{\mathbf{y}}_{N}$ and $\hat{\mathbf{y}}_{P}$ , respectively. $C_{N}(\cdot)$ is a fully-connected multilayer perceptron classifier, while the prototype classifier $C_{P}(\cdot)$ measures the similarity between every target sample to each source domain class center $\mu_{c}$ followed by a Softmax function to assign probability prediction, that is, $\hat{\mathbf{y}}_{P,t}^{c}=\mathbf{\Phi}\big{(}\mathbf{z}_{t},\mu_{c}\big{)}$ , where $\mathbf{\Phi}(\cdot,\cdot)$ is the similarity measurement function followed by Softmax.

In order to maintain the performance on the source domain, we keep the supervision from source and minimize the cross-entropy loss over ground truth $\mathbf{Y}_{s}=\{\mathbf{y}_{s}^{1},\cdots,\mathbf{y}_{s}^{n_{s}}\}$ and predicted labels $\hat{\mathbf{Y}}_{N,s}=\{\hat{\mathbf{y}}_{N,s}^{1},\cdots,\hat{\mathbf{y}}_{N,s}^{n_{s}}\}$ from $C_{N}(\cdot)$ as:

(1)

\displaystyle\mathcal{L}_{y}=\frac{1}{N_{s}}\sum\nolimits_{i=1}^{N_{s}}\mathcal{L}(\mathbf{\hat{y}}_{N,s}^{i},\mathbf{y}_{s}^{i})

As $C_{P}(\cdot)$ is a parameter-free classifier, so we don’t need to add supervision over source domain data to $C_{P}(\cdot)$ .

3.2.2. Adaptively Accumulating Cross-Domain Knowledge

Empirical Maximum Mean Discrepancy (MMD) has been verified as a promising technique to minimize the cross-domain marginal distribution difference (Long et al., 2017). Some very recent works also adopt pseudo labels for target domain data in order to match the conditional distribution across-domain, by minimizing the distance between the source and target domain class-wise embeddings from the same category (Tzeng et al., 2017). However, aligning all the target categories with the predicted label information is not effective since pseudo labels are not reliable especially under the PDA settings.

To alleviate the negative impact of misclassified pseudo labels to target domain samples, as well as the outlier categories from source domain label space, we propose the Adaptively-accumulated Knowledge Transfer strategy to discard those target samples with low prediction confidences. That is, only samples with confidently predicted probability label in

(2)

\displaystyle\widetilde{\mathcal{D}}_{t}=\{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}\mid\hat{\mathbf{y}}_{P,t}^{i,c}>\mathbf{p}_{0}\},

are accepted to update the cross-domain alignment, where $c$ is the pseudo label of $x_{t}^{i}$ , and $\hat{\mathbf{y}}_{P,t}^{i,c}$ is the probability confidence from prototype classifier $C_{P}(\cdot)$ of sample $\mathbf{x_{t}^{i}}$ belonging to class $c$ . $\mathbf{p_{0}}\in[0,1]$ is the threshold. It is noteworthy that we do not need to add another hyper-parameter to tune the model, as the probability confidence measures the similarity between the target sample to the source domain, we can let the model learn $\mathbf{p}_{0}$ adaptively by setting it as the average of initial probability prediction produced by prototype classifier $C_{P}(\cdot)$ of source domain samples belonging to ground-truth class, which is $\mathbf{p}_{0}=\frac{1}{n_{s}}\sum\nolimits_{\mathbf{x}_{s}^{j}\in\mathcal{D}_{s}}\hat{\mathbf{y}}_{P,s}^{j,c}$ , where $c$ is the ground-truth label of source sample $\mathbf{x}_{s}^{j}$ . We only explore highly-confident target samples into the cross-domain alignment. In other words, the selected target samples may not cover the whole label space, which is reasonable and acceptable.

3.2.3. Preserving Inter-class Discrimination

We treat the class-wise embeddings in a different way. Instead of matching the source and target domain mean embeddings from the same category, we seek to enlarge the distance between the source and target domain mean embeddings but from different classes. Specifically, we accept the $L_{2}$ distance to measure the distribution difference between two embeddings from two classes ( $c_{i},c_{j}$ ) and two domains ( $d_{k},d_{l}$ ):

(3)		$\displaystyle\mathcal{F}_{c_{i},c_{j},d_{k},d_{l}}$	$\displaystyle=\\|\mu_{d_{k},c_{i}}-\mu_{d_{l},c_{j}}\\|^{2}$
(3)			$\displaystyle=\Big{\\|}\dfrac{1}{N_{d_{k},c_{i}}}\sum\limits_{u=1}^{N_{d_{k},c_{i}}}\mathbf{z}_{d_{k},c_{i}}^{u}-\dfrac{1}{N_{d_{l},c_{j}}}\sum\limits_{v=1}^{N_{d_{l},c_{j}}}\mathbf{z}_{d_{l},c_{j}}^{v}\Big{\\|}^{2}$

where $\mathbf{Z}\in\mathbb{R}^{d\times(n_{s}+n_{t})}$ denotes the embedding feature matrix composed of $\{\mathbf{z}_{s}^{1},\cdots,\mathbf{z}_{s}^{n_{s}}\}$ and $\{\mathbf{z}_{t}^{1},\cdots,\mathbf{z}_{t}^{n_{t}}\}$ , and $\mu_{d_{k/l},c_{i/j}}\in\mathbb{R}^{d}$ denotes the class center of data from category $c_{i/j}$ domain $d_{k/l}$ .

It is noteworthy that $d_{k}$ and $d_{l}$ could be the same because we also seek to maximize the class-wise distance between different categories within the same domain. On the contrary, $c_{i}$ and $c_{j}$ are always different. The integrated inter-class discriminative alignment loss term includes TWO parts: (1) Aligning within source/target domain (2) Aligning across domains, which is shown as Eq. (4):

(4)		$\displaystyle\mathcal{L}_{inter}=$	$\displaystyle\lambda_{1}\Big{(}\sum\limits_{c=1}^{C}\sum\limits_{\begin{subarray}{c}c^{\prime}=1,\\ c^{\prime}\neq c\end{subarray}}^{C}\dfrac{\mathcal{F}_{c,c^{\prime},s,s}}{C(C-1)}+\sum\limits_{c=1}^{\hat{C}}\sum\limits_{\begin{subarray}{c}c^{\prime}=1,\\ c^{\prime}\neq c\end{subarray}}^{\hat{C}}\dfrac{\mathcal{F}_{c,c^{\prime},t,t}}{\hat{C}(\hat{C}-1)}\Big{)}$
(4)			$\displaystyle+\dfrac{1}{\hat{C}}\dfrac{1}{\hat{C}-1}\sum\nolimits_{c=1}^{\hat{C}}\sum\nolimits_{\begin{subarray}{c}c^{\prime}=1,\\ c^{\prime}\neq c\end{subarray}}^{\hat{C}}\mathcal{F}_{c,c^{\prime},s,t},$

where $\lambda_{1}$ is a hyper-parameter to balance the contribution of within-domain and between-domain terms in $\mathcal{L}_{inter}$ . It is noteworthy that here $C$ is the number of categories in the whole domain label space only when we align the inter-class discriminative distribution within source domain ( $\mathcal{F}_{c,c^{\prime},s,s}$ ), i.e., $C=\mid\mathcal{C}_{s}\mid$ . In other situations ( $\mathcal{F}_{c,c^{\prime},s,t}$ , $\mathcal{F}_{c,c^{\prime},t,t}$ ), $\hat{C}$ is the number of categories of the filtered out target domain subset $\widetilde{\mathcal{D}}_{t}$ , which may be smaller than the number of categories in the whole source domain label space, due to the Adaptively-Accumulated Knowledge Transfer strategy we proposed to filter out target samples with high prediction confidence.

3.2.4. Pursuing Intra-class Compactness

Except for maximizing the inter-class distribution distance within/across domains, we also seek to pursue more intra-class compactness. Specifically, we develop an effective loss term to reduce the intra-class variation by minimizing the distance between every two samples belonging to the same category from any domains, which is shown as:

(5)

\displaystyle\mathcal{S}_{c}

\displaystyle=\dfrac{1}{N_{c}(N_{c}-1)}\sum\nolimits_{i=1}^{N_{c}}\sum\nolimits_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N_{c}}\|\mathbf{z}^{i}-\mathbf{z}^{j}\|^{2},

where $N_{c}$ is the total number of samples belonging to class $c$ from the source domain and filtered out target samples. Thus, we further define the total loss of all intra-class sample-wise distance as:

(6)

\displaystyle\mathcal{L}_{intra}

\displaystyle=\dfrac{\lambda_{2}}{C}\sum\nolimits_{c=1}^{C}\mathcal{S}_{c},

where $C$ is the number of categories in the source domain label space. It is noteworthy that for the target domain, we still only align those samples filtered out with high confidence to reduce the distraction of misclassification, while for samples from the source domain are always aligned over the whole label space. $\lambda_{2}$ is a hyper-parameter to balance the contribution of $\mathcal{L}_{intra}$ .

Table 1. Comparisons of Recognition Rates (

\%

) of Partial Domain Adaptation on Office-31 Dataset (ResNet-50).

Method	A31 $\rightarrow$ W10	A31 $\rightarrow$ D10	W31 $\rightarrow$ A10	W31 $\rightarrow$ D10	D31 $\rightarrow$ A10	D31 $\rightarrow$ W10	Average
Source Only	75.59 $\pm$ 1.09	83.44 $\pm$ 1.12	84.97 $\pm$ 0.86	98.09 $\pm$ 0.74	83.92 $\pm$ 0.95	96.27 $\pm$ 0.85	87.05 $\pm$ 0.94
DAN (Long et al., 2015)	59.32 $\pm$ 0.49	61.78 $\pm$ 0.56	67.64 $\pm$ 0.29	90.45 $\pm$ 0.36	74.95 $\pm$ 0.67	73.90 $\pm$ 0.38	71.34 $\pm$ 0.46
DANN (Ganin et al., 2016)	73.56 $\pm$ 0.15	81.53 $\pm$ 0.23	86.12 $\pm$ 0.15	98.73 $\pm$ 0.20	82.78 $\pm$ 0.18	96.27 $\pm$ 0.26	86.50 $\pm$ 0.20
ADDA (Tzeng et al., 2017)	75.67 $\pm$ 0.17	83.41 $\pm$ 0.17	84.25 $\pm$ 0.13	99.85 $\pm$ 0.12	83.62 $\pm$ 0.14	95.38 $\pm$ 0.23	87.03 $\pm$ 0.16
RTN (Long et al., 2016)	78.98 $\pm$ 0.55	77.07 $\pm$ 0.49	89.46 $\pm$ 0.37	85.35 $\pm$ 0.47	89.25 $\pm$ 0.39	93.22 $\pm$ 0.52	85.56 $\pm$ 0.47
IWAN (Zhang et al., 2018a)	89.15 $\pm$ 0.37	90.45 $\pm$ 0.36	94.26 $\pm$ 0.25	99.36 $\pm$ 0.24	95.62 $\pm$ 0.29	99.32 $\pm$ 0.32	94.69 $\pm$ 0.31
SAN (Cao et al., 2018a)	90.90 $\pm$ 0.45	94.27 $\pm$ 0.28	88.73 $\pm$ 0.44	99.36 $\pm$ 0.12	94.15 $\pm$ 0.36	99.32 $\pm$ 0.52	94.96 $\pm$ 0.36
PADA (Cao et al., 2018b)	96.54 $\pm$ 0.31	82.17 $\pm$ 0.37	95.41 $\pm$ 0.33	100.00 $\pm$ .00	92.69 $\pm$ 0.29	99.32 $\pm$ 0.45	92.69 $\pm$ 0.29
DRCN (Li et al., 2020a)	90.80	94.30	94.80	100.00	95.20	100.00	95.90
ETN (Cao et al., 2019)	94.52 $\pm$ 0.20	95.03 $\pm$ 0.22	94.64 $\pm$ 0.24	100.00 $\pm$ .00	96.21 $\pm$ 0.27	100.00 $\pm$ .00	96.73 $\pm$ 0.16
Ours( $C_{N}$ )	92.18 $\pm$ 0.12	92.95 $\pm$ 0.24	96.14 $\pm$ 0.23	100.00 $\pm$ .00	95.92 $\pm$ 0.32	100.00 $\pm$ .00	96.20 $\pm$ 0.15
Ours( $C_{P}$ )	97.28 $\pm$ 0.33	96.79 $\pm$ 0.15	96.14 $\pm$ 0.21	100.00 $\pm$ .00	96.13 $\pm$ 0.17	100.00 $\pm$ .00	97.72 $\pm$ 0.14

Table 2. Comparisons of Recognition Rates (

\%

) of Partial Domain Adaptation on Office-31 Dataset (VGG).

Method	A31 $\rightarrow$ W10	A31 $\rightarrow$ D10	W31 $\rightarrow$ A10	W31 $\rightarrow$ D10	D31 $\rightarrow$ A10	D31 $\rightarrow$ W10	Average
Source Only	60.34 $\pm$ 0.84	76.43 $\pm$ 0.48	79.12 $\pm$ 0.54	99.36 $\pm$ 0.36	72.96 $\pm$ 0.56	97.97 $\pm$ 0.63	81.03 $\pm$ 0.57
DAN (Long et al., 2015)	58.78 $\pm$ 0.43	54.76 $\pm$ 0.44	67.29 $\pm$ 0.20	92.78 $\pm$ 0.28	55.42 $\pm$ 0.56	85.86 $\pm$ 0.32	69.15 $\pm$ 0.37
DANN (Ganin et al., 2016)	50.85 $\pm$ 0.12	57.96 $\pm$ 0.20	62.32 $\pm$ 0.12	94.27 $\pm$ 0.16	51.77 $\pm$ 0.14	95.23 $\pm$ 0.24	68.73 $\pm$ 0.16
ADDA (Tzeng et al., 2017)	53.28 $\pm$ 0.15	58.78 $\pm$ 0.12	63.34 $\pm$ 0.08	95.36 $\pm$ 0.08	50.24 $\pm$ 0.10	94.33 $\pm$ 0.18	69.22 $\pm$ 0.12
RTN (Tzeng et al., 2017)	69.35 $\pm$ 0.42	75.43 $\pm$ 0.38	82.98 $\pm$ 0.36	99.59 $\pm$ 0.32	81.45 $\pm$ 0.32	98.42 $\pm$ 0.48	84.54 $\pm$ 0.38
IWAN (Zhang et al., 2018a)	82.90 $\pm$ 0.31	90.95 $\pm$ 0.33	93.36 $\pm$ 0.22	88.53 $\pm$ 0.16	89.57 $\pm$ 0.24	79.75 $\pm$ 0.26	87.51 $\pm$ 0.25
SAN (Cao et al., 2018a)	83.39 $\pm$ 0.36	90.70 $\pm$ 0.20	91.85 $\pm$ 0.35	100.00 $\pm$ .00	87.16 $\pm$ 0.23	99.32 $\pm$ 0.45	92.07 $\pm$ 0.27
PADA (Cao et al., 2018b)	86.05 $\pm$ 0.36	81.73 $\pm$ 0.34	95.26 $\pm$ 0.27	100.00 $\pm$ .00	93.00 $\pm$ 0.24	99.42 $\pm$ 0.24	92.54 $\pm$ 0.24
ETN (Cao et al., 2019)	85.66 $\pm$ 0.16	89.43 $\pm$ 0.17	92.28 $\pm$ 0.20	100.00 $\pm$ .00	95.93 $\pm$ 0.23	100.00 $\pm$ .00	93.88 $\pm$ 0.13
Ours( $C_{N}$ )	88.44 $\pm$ 0.24	86.54 $\pm$ 0.15	94.98 $\pm$ 0.38	100.00 $\pm$ .00	94.98 $\pm$ 0.21	99.32 $\pm$ 0.18	94.04 $\pm$ 0.19
Ours( $C_{P}$ )	90.48 $\pm$ 0.23	90.38 $\pm$ 0.38	95.19 $\pm$ 0.16	100.00 $\pm$ .00	94.67 $\pm$ 0.19	99.66 $\pm$ 0.23	95.06 $\pm$ 0.20

3.3. Overall Objective and Optimization

Entropy minimization regularization is adopted to eliminate the side effect caused by the uncertainty of classifiers, due to the large domain shift and samples which are hard to transfer. Especially during the early training stage, the target domain samples are easy to be assigned to wrong categories and may deteriorate the optimization procedures. We also explore the entropy minimization regularization as:

(7)

\mathcal{L}_{em}=-\frac{1}{N_{t}}\sum\nolimits_{i=1}^{N_{t}}\sum\nolimits_{c=1}^{C}\hat{\mathbf{y}}_{N,t}^{i,c}\log\hat{\mathbf{y}}_{N,t}^{i,c},

where $C$ is the number of categories in source domain label space, $N_{t}$ is the number of samples from the target domain.

To sum up, we propose our overall objective function as:

(8)

\displaystyle\min\limits_{G,C_{N}}\mathcal{L}_{y}+\mathcal{L}_{intra}-\mathcal{L}_{inter}+\mathcal{L}_{em}.

The whole framework consists of a feature generator $G(\cdot)$ , a multilayer perceptron classifier $C_{N}(\cdot)$ , and a prototype classifier $C_{P}(\cdot)$ . As $C_{P}(\cdot)$ is non-parameter, so only $G(\cdot)$ and $C_{N}(\cdot)$ are optimized with the objective as Eq. (8). Specifically, $\mathcal{L}_{y}$ is calculated on the source domain data, while $\mathcal{L}_{em}$ is based on the whole target domain. However, $\mathcal{L}_{intra}$ and $\mathcal{L}_{inter}$ are only based on the filtered out target data, as well as the corresponding source data from the same categories as the filtered target samples pseudo labels.

4. Experiments

4.1. Datasets & Implementation Details

Office-31 (Saenko et al., 2010) consists of more than 4,000 images from 31 categories office common objects. The dataset includes 3 different domains: Amazon, Webcam, and DSLR. Following the protocol of (Cao et al., 2018a), 9 different partial domain adaptation tasks are explored. For each target domain, we select the 10 shared categories across Office-31 and Caltech-256 (Griffin et al., 2007) dataset and denoted as A10, W10, and D10. The source domain data takes the whole domain data space and denoted as A31, W31, and D31.

Office-Home (Venkateswara et al., 2017) is a much larger benchmark containing 65 different class images from 4 domains: Ar (Art), Cl (Clipart), Pr (Product), and Rw (RealWorld). Following the existing evaluation settings (Cao et al., 2018b, 2019), we have 12 partial domain adaptation tasks. From each target domain, we only select the first 25 categories in alphabetical order, while the source domain utilizes all 65 class images.

Table 3. Comparisons of Recognition Rates (

\%

) of Partial Domain Adaptation on Office+Home Dataset (ResNet-50).

Method	Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Ar $\rightarrow$ Rw	Cl $\rightarrow$ Ar	Cl $\rightarrow$ Pr	Cl $\rightarrow$ Rw	Pr $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Ar	Rw $\rightarrow$ Cl	Rw $\rightarrow$ Pr	Avg.
Source Only	46.33	67.51	75.87	59.14	59.94	62.73	58.22	41.79	74.88	67.40	48.18	74.17	61.35
DAN (Long et al., 2015)	43.76	67.90	77.47	63.73	58.99	67.59	56.84	37.07	76.37	69.15	44.30	77.48	61.72
DANN (Ganin et al., 2016)	45.23	68.79	79.21	64.56	60.01	68.29	57.56	38.89	77.45	70.28	45.23	78.32	62.82
ADDA (Tzeng et al., 2017)	45.23	68.79	79.21	64.56	60.01	68.29	57.56	38.89	77.45	70.28	45.23	78.32	62.82
RTN (Long et al., 2016)	49.31	57.70	80.07	63.54	63.47	73.38	65.11	41.73	75.32	63.18	43.57	80.50	63.07
IWAN (Zhang et al., 2018a)	53.94	54.45	78.12	61.31	47.95	63.32	54.17	52.02	81.28	76.46	56.75	82.90	63.56
SAN (Cao et al., 2018a)	44.42	68.68	74.60	67.49	64.99	77.80	59.78	44.72	80.07	72.18	50.21	78.66	65.30
PADA (Cao et al., 2018b)	51.95	67.00	78.74	52.16	53.78	59.03	52.61	43.22	78.79	73.73	56.60	77.09	62.06
DRCN (Li et al., 2020a)	54.00	76.40	83.00	62.10	64.50	71.00	70.80	49.80	80.50	77.50	59.10	79.90	69.00
ETN (Cao et al., 2019)	59.24	77.03	79.54	62.92	65.73	75.01	68.29	55.37	84.37	75.72	57.66	84.54	70.45
Ours( $C_{N}$ )	61.41	83.81	86.36	64.15	74.12	75.15	67.22	55.44	83.88	72.15	60.22	83.59	72.29
Ours( $C_{P}$ )	62.54	83.92	86.69	65.44	74.96	75.04	67.40	55.14	84.37	73.25	60.51	84.09	72.78

Table 4. Comparisons of Recognition Rates (

\%

) of Unsupervised Domain Adaptation on Office+Home Dataset (ResNet-50).

Method		Ar $\rightarrow$ Cl	Ar $\rightarrow$ Pr	Ar $\rightarrow$ Rw	Cl $\rightarrow$ Ar	Cl $\rightarrow$ Pr	Cl $\rightarrow$ Rw	Pr $\rightarrow$ Ar	Pr $\rightarrow$ Cl	Pr $\rightarrow$ Rw	Rw $\rightarrow$ Ar	Rw $\rightarrow$ Cl	Rw $\rightarrow$ Pr	Avg.
No Adaptive	$C_{N}$	51.79	70.42	79.40	56.16	62.97	70.40	60.42	48.15	76.75	66.08	63.94	76.58	65.26
No Adaptive	$C_{P}$	51.31	70.31	79.18	56.16	63.08	70.04	60.51	48.03	75.76	66.08	53.52	76.64	64.25
$C_{N}$ Guide	$C_{N}$	62.09	81.01	83.60	60.75	64.48	65.27	65.20	53.52	84.76	71.23	56.39	80.06	69.03
$C_{N}$ Guide	$C_{P}$	61.95	80.84	83.32	60.94	64.71	65.93	65.56	53.58	84.76	71.14	56.39	79.89	69.08
Same $C_{N}$ & $C_{P}$	$C_{N}$	56.75	80.06	87.36	60.20	64.99	76.97	65.75	55.14	83.27	69.30	55.08	82.18	69.75
Same $C_{N}$ & $C_{P}$	$C_{P}$	56.81	80.00	87.41	60.29	64.93	76.97	65.75	55.08	83.27	69.30	55.02	82.18	69.75
Ours	$C_{N}$	61.41	83.81	86.36	64.15	74.12	75.15	67.22	55.44	83.88	72.15	60.22	83.59	72.29
Ours	$C_{P}$	62.54	83.92	86.69	65.44	74.96	75.04	67.40	55.14	84.37	73.25	60.51	84.09	72.78

Comparisons: We compare the performance of our proposed method with several domain adaptation and the state-of-the-art partial DA methods such as: Deep Adaptation Network (DAN) (Long et al., 2015), Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017), Residual Transfer Network (RTN) (Long et al., 2016), Importance Weighted Adversarial Nets (IWAN) (Zhang et al., 2018a), Selective Adversarial Network (SAN) (Cao et al., 2018a), Partial Adversarial Domain Adaptation (PADA) (Cao et al., 2018b), Example Transfer Network (ETN) (Cao et al., 2019), and Adaptive Feature Norm (AFN) (Xu et al., 2019). Specifically, DAN applies multi-kernel MMD to match source and target domain distribution and learn transferable features across the domain. ADDA combines the adversarial training idea and united weights sharing to generate domain invariant features. RTN jointly adapts features distribution as well as source and target classifiers via deep residual learning framework. IWAN and SAN select or re-weight outlier categories in source domain label space to alleviate the negative influence caused by those classes that are not in the target domain label space. PADA, ETN, and AFN are the state-of-the-art partial domain adaptation models. Through down-weighting source domain data which is from outlier categories, PADA reduces the negative transfer influence caused by outlier classes. ETN proposes a progressive weighting scheme to quantify the transferability of source examples. AFN proposes a parameter-free approach to progressively adapt the source and target domain feature norms to a large range of values, which results in significant transfer gains.

Implementation Details: For each source-target pair case, we finetune the ImageNet pre-trained convolutional neural networks on the source domain and remove the last fully-connected layer as the backbone network. Then we input the backbone networks output of all source and target domain data into two dense layers with hidden layer output as 1,024 followed by ReLU activation and 0.1 dropout probability as the feature extractor $G(\cdot)$ . We accept ResNet-50 network (He et al., 2016) as the backbone on Office-Home and Office-31, and also explore the performance of VGG network as the backbone (Simonyan and Zisserman, 2014) on Offce-Home dataset. The output dimension of the generator $G(\cdot)$ , as known as the embedding features $\mathbf{z}_{x/t}$ , is 512. The multilayer perceptron classifier $C_{N}(\cdot)$ is a two-layer fully-connected neural network where the hidden layer output dimension is 512, and the output size is the number of source domain categories. For prototype classifier $C_{P}(\cdot)$ , we take cosine similarity as the measurement function in $\mathbf{\Phi}(\cdot,\cdot)$ , and we directly take the source domain class centers as the prototypes, because the feature generator update every epoch, so the prototypes are also updating along with training. All experiments are implemented via PyTorch. We train the model for 100 epochs by Adam optimizer with learning rate as 0.0001, and report the last epoch results. $\mathbf{p}_{0}$ is rounded to two decimal places. $\lambda_{1}=0.1$ and $\lambda_{2}=0.5$ on Office31 dataset, while $\lambda_{1}=0.01$ , $\lambda_{2}=2$ on Office-Home. We will analyze the parameter sensitivity in Section 4.3.

4.2. Comparison Results

In this section, we will comprehensively evaluate our proposed model with several baselines on Office-31 and Office-Home benchmarks in terms of the target samples labels prediction accuracy to manifest the effectiveness of our model.

Specifically, we observe that PDA methods (IWAN, SAN, PADA, DRCN, and ETN) achieve better performance than standard DA efforts such as DAN, DANN, ADDA, and RTN. ETN achieves much greater improvement because it introduces a method to quantify the source samples transferability. Our proposed method can still outperform all compared baselines on most partial domain adaptation tasks and obtain the best average performance.

Table 1 reports the classification accuracy on the Office-31 dataset obtained by all baselines and our model with ResNet-50 as the backbone of feature extractor. It is noteworthy that the prototype classifier $C_{P}(\cdot)$ always generates better performance than the conventional multilayer perceptron classifier $C_{N}(\cdot)$ . From the results, the prototype classifier achieves the best performance on 5 out of 6 tasks, compared to all the other baselines. To be specific, the average classification accuracy reaches the best performance $\mathbf{97.72\%}$ , and reaches $\mathbf{100\%}$ accuracy on W31 $\rightarrow$ D10 and D31 $\rightarrow$ W10.

Moreover, we also explore the VGG network as the feature extractor backbone on Office-31 dataset and report the results in table 2. Our proposed model achieves the best average performance compared with other baselines. Specifically, compared to the best baseline performance on task A31 $\rightarrow$ W10, PADA, $C_{N}(\cdot)$ and $C_{P}(\cdot)$ improve the accuracy over 2% to 88.44% and 4% to 90.48%, respectively. It is noteworthy that the improvements of performance with VGG networks as backbone is more significant than using ResNet-50, because the ResNet-50 is more advanced deep convolutional neural networks model, which can generate more task specific discriminate features than VGG networks.

Experiment results on the Office-Home dataset are stated in Table 3. Both $C_{N}(\cdot)$ and $C_{P}(\cdot)$ obtain better performance against other baselines with significant improvements on average classification accuracy ( $\mathbf{1.84\%}$ and $\mathbf{2.33\%}$ ). Moreover, our proposed method achieves more than $\mathbf{5\%}$ accuracy increase compared to the state-of-the-art baseline, e.g., Ar $\rightarrow$ Pr, Cl $\rightarrow$ Pr, etc.

4.3. Ablation Analysis

First, visualize the generator $G(\cdot)$ output features before and after the domain adaptation process on task Ar $\rightarrow$ Cl on Office-Home, and A $\rightarrow$ W on Office-31 dataset in the Fig. 2 (a) and (b). From the results, we observe that our proposed method aligns the source and target domain samples with respect to categories, and tights the compactness of the embedding features to each class centers.

Secondly, we evaluate the contribution of every loss term in Eq. (8) by removing each specific term while keeping other terms as the original framework. The results are shown in Fig. 3. It is noteworthy that both $\mathcal{L}_{intra}$ and $\mathcal{L}_{inter}$ make crucial contribution to the PDA tasks because these two terms are aligning the data distribution inter-classes and intra-class. $\mathcal{L}_{y}$ keeps the model performance on the source domain stable, while it has limited contribution to the PDA process, but cannot be ignored. $\mathcal{L}_{em}$ helps to mitigate the negative transfer influence of the multilayer perceptron classifier $C_{N}$ , especially at the beginning of the training stage.

Then, we monitor the training and optimization process of our model. Fig. 4 illustrates the process of the adaptively-accumulated knowledge transfer process. We choose case A31 $\rightarrow$ W10 of Office-31 dataset and show the changing of the filtered out high prediction confidence categories used to align the data distribution across domains. In the beginning, high prediction target samples only spread in only 6 classes, but then more and more categories are involved, and the number finally reaches 11, while the total number of the target domain categories is 10. Although there is an incorrect outlier class involved, the adaptive optimization strategy still significantly narrows the range of the target domain label space.

Moreover, we implement several ablation experiments on the Office-Home dataset with different training details to explore the contribution of our proposed model and optimization strategy, the results are reported in Table 4. ”No Adaptive” denotes the results without the adaptively accumulating knowledge transfer and target samples filtering out process. From the results, compared to our complete A²KT model results, we notice how important the adaptively accumulating knowledge strategy is. ” $C_{N}$ Guide” are the results when we use the $C_{N}$ probabilistic prediction to filter out high confidence target samples for domain alignment, instead of $C_{P}$ . The way to decide the threshold is the same as when we use $C_{P}$ . The results prove that the multilayer perceptron classifier $C_{N}$ and the prototype classifier $C_{P}$ have different classification philosophy, and using $C_{P}$ probability prediction to accumulate can boost the performance significantly. Finally, we explore the motivation of adopting two different type dual classifiers framework in our model by setting $C_{N}$ and $C_{P}$ both same structure multilayer perceptron classifiers, all other settings and training strategies are the same as before, and the results are reported in ”Same $C_{N}\&C_{P}$ ”. From the results, we observe that for some cases two same multilayer perceptron classifiers can get slightly better performance than our model, e.g., Ar $\rightarrow$ Rw and Cl $\rightarrow$ Rw. However, for most cases and the average performance, our model with different type classifiers outperforms much more. All the results with different training strategies in Table 4 demonstrate the effectiveness and motivation of our model and optimization strategies.

We present the parameter sensitivity analysis in Fig. 5. We vary $\lambda_{1}$ from 0.0001 to 0.05 and $\lambda_{2}$ from 1 to 3 on four cases on the Office-Home dataset (Ar $\rightarrow$ Pr, Ar $\rightarrow$ Rw, Pr $\rightarrow$ Ar, Rw $\rightarrow$ Cl) to analyze if the model is sensitive to the change of the hyper-parameters. The results in Fig. 5 shows that our model has great stability across cases of the two parameters $\lambda_{1}$ and $\lambda_{2}$ .

Finally, we select several representative target samples from task Pr $\rightarrow$ Rw on Office-Home dataset and show the predictions of $C_{N}(\cdot)$ and $C_{P}(\cdot)$ in Fig. 6. We notice that some cases only $C_{N}(\cdot)$ or $C_{P}(\cdot)$ can handle, or even neither can predict correctly, which demonstrates the motivation of combine two different type classifiers $C_{N}(\cdot)$ and $C_{P}(\cdot)$ in our proposed model. Besides, we operate the image retrieval task by giving specific labels to retrieve the target samples. The 5 target images with the highest $C_{P}(\cdot)$ prediction confidence and 5 with the lowest in the retrieved images are shown in Fig. 7. The different samples retrieved by $C_{N}(\cdot)$ and $C_{P}(\cdot)$ demonstrate the motivation of integrating various classifiers.

5. Conclusion

This paper presented a novel Domain-Invariant Feature learning framework for partial domain adaptation. With the help of the Adaptively-Accumulated Knowledge Transfer Optimization strategy, the target domain samples with high confidence and task-relevant source categories are selected out adaptively. By maximizing the inter-class center-wise discrepancy and minimizing the intra-class sample-wise compactness, more domain-invariant and task-specific discriminative representations will be extracted. Extensive experiments on several partial domain adaptation benchmarks manifest the superiority of our algorithms against previous works.

References

(1)
Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
Blitzer et al. (2006) John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain Adaptation with Structural Correspondence Learning. In Proceedings of the 2006 conference on empirical methods in natural language processing. 120–128.
Borgwardt et al. (2006) Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J Smola. 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49–e57.
Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3722–3731.
Cao et al. (2018a) Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2018a. Partial transfer learning with selective adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2724–2732.
Cao et al. (2018b) Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang. 2018b. Partial adversarial domain adaptation. In Proceedings of the European Conference on Computer Vision. 135–150.
Cao et al. (2019) Zhangjie Cao, Kaichao You, Mingsheng Long, Jianmin Wang, and Qiang Yang. 2019. Learning to Transfer Examples for Partial Domain Adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2985–2994.
Dai et al. (2008) Wenyuan Dai, Yuqiang Chen, Guirong Xue, Qiang Yang, and Yong Yu. 2008. Translated Learning: Transfer Learning across Different Feature Spaces. In Advances in neural information processing systems. 353–360.
Ding et al. (2018) Zhengming Ding, Ming Shao, and Yun Fu. 2018. Robust Multi-view Representation: A Unified Perspective from Multi-view Learning to Domain Adaption.. In IJCAI. 5434–5440.
Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning. 647–655.
Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17, 1 (2016), 2096–2030.
Griffin et al. (2007) Gregory Griffin, Alex Holub, and Pietro Perona. 2007. Caltech-256 object category dataset. (2007).
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Hoffman et al. (2014) Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, and Kate Saenko. 2014. LSDA: Large scale detection through adaptation. In Advances in neural information processing systems. 3536–3544.
Hoffman et al. (2017) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. 2017. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213 (2017).
Jiang et al. (2017) Shuhui Jiang, Zhengming Ding, and Yun Fu. 2017. Deep low-rank sparse collective factorization for cross-domain recommendation. In Proceedings of the 25th ACM international conference on Multimedia. 163–171.
Lee et al. (2019a) Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. 2019a. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 10285–10295.
Lee et al. (2019b) Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. 2019b. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 10285–10295.
Li et al. (2019a) Jingjing Li, Erpeng Chen, Zhengming Ding, Lei Zhu, Ke Lu, and Zi Huang. 2019a. Cycle-consistent conditional adversarial transfer networks. In Proceedings of the 27th ACM International Conference on Multimedia. 747–755.
Li et al. (2020a) Shuang Li, Chi Harold Liu, Qiuxia Lin, Qi Wen, Limin Su, Gao Huang, and Zhengming Ding. 2020a. Deep Residual Correction Network for Partial Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1–1.
Li et al. (2019b) Shuang Li, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and Gao Huang. 2019b. Joint Adversarial Domain Adaptation. In Proceedings of the 27th ACM International Conference on Multimedia. 729–737.
Li et al. (2020b) Shuang Li, Harold Chi Liu, Qiuxia Lin, Binhui Xie, Zhengming Ding, Gao Huang, and Jian Tang. 2020b. Domain Conditioned Adaptation Network. In Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence.
Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning. 97–105.
Long et al. (2016) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2016. Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems. 136–144.
Long et al. (2017) Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In International conference on machine learning. 2208–2217.
Luo et al. (2017) Zelun Luo, Yuliang Zou, Judy Hoffman, and Li F Fei-Fei. 2017. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in neural information processing systems. 165–177.
Oquab et al. (2014) Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1717–1724.
Rasiwasia et al. (2010) Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 251–260.
Saenko et al. (2010) Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. 2010. Adapting visual category models to new domains. In Proceedings of the European Conference on Computer Vision. Springer, 213–226.
Saito et al. (2018a) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018a. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3723–3732.
Saito et al. (2018b) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018b. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3723–3732.
Shu et al. (2015) Xiangbo Shu, Guo-Jun Qi, Jinhui Tang, and Jingdong Wang. 2015. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In Proceedings of the 23rd ACM international conference on Multimedia. 35–44.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S Zemel. 2017. Prototypical Networks for Few-shot Learning. In Advances in neural information processing systems. 4077–4087.
Tzeng et al. (2015) Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. 2015. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision. 4068–4076.
Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1. 4.
Venkateswara et al. (2017) Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. 2017. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5018–5027.
Wang et al. (2018) Jindong Wang, Wenjie Feng, Yiqiang Chen, Han Yu, Meiyu Huang, and Philip S Yu. 2018. Visual domain adaptation with manifold embedded distribution alignment. In Proceedings of the 26th ACM international conference on Multimedia. 402–410.
Wang et al. ([n.d.]) L Wang, B Sun, J Robinson, T Jing, and Y Fu. [n.d.]. EV-Action: Electromyography-Vision Multi-Modal Action Dataset. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG). 129–136.
Xia and Ding (2020) Haifeng Xia and Zhengming Ding. 2020. Structure Preserving Generative Cross-Domain Learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Xu et al. (2019) Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. 2019. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. In Proceedings of the IEEE International Conference on Computer Vision. 1426–1435.
Yan et al. (2016) Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494–2502.
Yao et al. (2019) Yuan Yao, Yu Zhang, Xutao Li, and Yunming Ye. 2019. Heterogeneous domain adaptation via soft transfer network. In Proceedings of the 27th ACM International Conference on Multimedia. 1578–1586.
Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in neural information processing systems. 3320–3328.
Zhang et al. (2018a) Jing Zhang, Zewei Ding, Wanqing Li, and Philip Ogunbona. 2018a. Importance weighted adversarial nets for partial domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8156–8164.
Zhang et al. (2018b) Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. 2018b. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3801–3809.
Zhang et al. (2019) Yabin Zhang, Hui Tang, Kui Jia, and Mingkui Tan. 2019. Domain-symmetric networks for adversarial domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5031–5040.
Zhuo et al. (2017) Junbao Zhuo, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2017. Deep unsupervised convolutional domain adaptation. In Proceedings of the 25th ACM international conference on Multimedia. 261–269.

Adaptively-Accumulated Knowledge Transfer for Partial Domain Adaptation