Partial Domain Adaptation Using Graph Convolutional Networks

Abstract

Partial domain adaptation (PDA), in which we assume the target label space is included in the source label space, is a general version of standard domain adaptation. Since the target label space is unknown, the main challenge of PDA is to reduce the learning impact of irrelevant source samples, named outliers, which do not belong to the target label space. Although existing partial domain adaptation methods effectively down-weigh outliers’ importance, they do not consider data structure of each domain and do not directly align the feature distributions of the same class in the source and target domains, which may lead to misalignment of category-level distributions. To overcome these problems, we propose a graph partial domain adaptation (GPDA) network, which exploits Graph Convolutional Networks for jointly considering data structure and the feature distribution of each class. Specifically, we propose a label relational graph to align the distributions of the same category in two domains and introduce moving average centroid separation for learning networks from the label relational graph. We demonstrate that considering data structure and the distribution of each category is effective for PDA and our GPDA network achieves state-of-the-art performance on the Digit and Office-31 datasets.

Index Terms— Deep learning, Image classification, Partial domain adaptation, Graph neural networks

1 Introduction

Recently, deep learning-based methods have shown state-of-the-art performance in image classification beyond human perception. However, these methods require a lot of labeled data to train deep neural networks. Since it costs a lot of time and money to obtain labels for training, there is a limitation to applying them to real situations. Unsupervised domain adaptation methods [1, 2, 3, 4] have received attention as a way to reduce the labeling cost. They aim to ensure a network trained with rich labeled data from the source domain working well on unlabeled data from the target domain. As the source data and the target data are sampled from different distributions, the networks trained on the source domain without domain adaptation do not infer well on the target samples. To bridge different domains, most domain adaptation methods try to learn domain invariant feature representations by adversarial learning [1]. These methods successfully reduce the large gap between different domains, and domain adaptation methods for image classification [2, 3, 4, 5, 6, 7] perform well as much as the network trained on rich labeled data from the target domain.

Refer to caption — Fig. 1: Example of comparison with standard domain adaptation and partial domain adaptation.

However, existing domain adaptation methods generally assume that the source and target domains share the identical label space. Under this assumption, when the source and target datasets have different label spaces, the domain difference can not be correctly reduced. In real applications, it is not practical to find or generate a source domain with the identical label space as the target domain. To overcome this problem, partial domain adaptation (PDA) has been studied under the assumption that the target label space is included in the source label space. In PDA, we use a large dataset with numerous classes as a source dataset and transfer source domain knowledge to a small target domain with few categories. PDA is feasible for many applications because large datasets with many classes are open to the public so that we can use it as a source domain, and classes of the target dataset are highly likely to be subsets of the source dataset.

Partial domain adaptation is more challenging than standard domain adaptation since the target label space is unknown, and there are irrelevant source samples, named outliers, which do not belong to the target label space. Therefore, most partial domain adaptation methods try to prevent learning with outliers. Cao et al. propose the PADA [6] and SAN [8] architectures to down-weigh outliers’ importance automatically by introducing the probabilities of source samples belonging to the target label space. Similarly, Zhang et al. suggest IWAN [9] by exploiting an additional domain classifier and Cao et al. design ETN [10], which adds a new classifier and a novel domain classifier to select outliers accurately. Matsuura et al. propose TWINs [11], which estimate the ratio of the target samples in each class for weighting the classes present in the target domain. Although these novel methods [6, 8, 9, 10, 11] effectively perform partial domain adaptation by reducing the effect of outliers on training, they do not consider data structure of each domain, which is known to reflect the marginal or conditional distribution [12] and data statistic [13, 14], and they do not directly align the feature distributions of the same category in the source and target domains.

In this paper, we construct a graph for data structure to align the feature distributions of the same category for partial domain adaptation. Specifically, we propose a label relational graph exploiting the relationship between pseudo labels for the target samples and ground truth labels for the source samples. Moreover, we introduce moving average centroid separation for learning networks from the label relational graph. By using the label relational graph and moving average centroid separation, the features of the same classes in the source and target domains are incorporated together, while the features of the different classes are separated from each other. To consider these two modules jointly, we propose a graph partial domain adaptation (GPDA) network. Our network is effective for partial domain adaptation and that achieves state-of-the-art performance on the Digit [15, 16] and Office-31 [17] datasets.

2 Proposed Method

In partial domain adaptation, we are given by $n_{s}$ source samples and the corresponding class labels $\left\{(x_{i}^{s},y_{i}^{s})\right\}_{i=1}^{n_{s}}$ from the source domain $D_{s}$ with $\left|C_{s}\right|$ classes and $n_{t}$ unlabeled target samples $\left\{x_{i}^{t}\right\}_{i=1}^{n_{t}}$ from the target domain $D_{t}$ with $\left|C_{t}\right|$ classes, where $\left|C_{s}\right|$ denotes the source label space whereas $\left|C_{t}\right|$ the target label space. Like standard domain adaptation, we assume that the source and target samples are drawn from different probability distribution $p$ and $q$ , respectively. Moreover, in partial domain adaptation, $C_{t}$ is the subset of $C_{s}$ , i.e., $C_{t}\subset C_{s}$ , and $C_{t}$ is unknown. In other words, we know the target label space is included in the source label space, but do not know what kinds of classes are included in the target label space.

2.1 Graph Partial Domain Adaptation

In partial domain adaptation, adopting standard domain adaptation algorithms, which learn all source and target samples with the same weight, cause performance degeneration by outliers, i.e., $x_{i}^{s}$ corresponding to $y_{i}^{s}\notin\left|C_{t}\right|$ . Therefore, we introduce a graph partial domain adaptation (GPDA) network, which aims to reduce the learning impact of the outliers during training time. As illustrated in Fig. 2, we exploit data structure, forcing to align the distributions of the same category in a weight framework. $E$ is a feature extractor, and $F_{s}$ and $F_{t}$ are classifiers for source samples in the source label space and the common label space, respectively. Task-specific features are learned by classifiers in supervised learning as followed:

	$\displaystyle L_{S}(\theta_{{F}_{s}},\theta_{E})=\frac{1}{n_{s}}\sum_{x_{i}\in D_{s}}L_{y}(F_{s}(E(x_{i})),y_{i}),$		(1)
	$\displaystyle L_{T}(\theta_{{F}_{t}},\theta_{E})=\frac{1}{n_{s}}\sum_{x_{i}\in D_{s}}\gamma_{{y_{i}}}L_{y}(F_{t}(E(x_{i})),y_{i}),$		(2)

where $L_{y}$ is the cross entropy loss function, and $\gamma_{{y_{i}}}$ is a probability of a source label $y_{i}$ belonging to the target label space. Here, $\boldsymbol{\gamma}\in R^{\left|C_{s}\right|}$ , which indicates the contribution of each source class, and $\boldsymbol{\gamma}$ can be calculated as follows:

\boldsymbol{\gamma}=\frac{1}{n_{t}}\sum_{x_{i}\in D_{t}}F_{s}(E(x_{i})).

(3)

For domain invariant features, we combine Graph Convolutional Networks $G$ with a domain classifier $D$ . $D$ is trained to distinguish the source domain from the target domain, and simultaneously $G$ and $E$ are trained to confuse $D$ . This loss function can be expressed as follows:

\begin{split}L_{D}(\theta_{E},\theta_{G},\theta_{D})=&-\frac{1}{n_{s}}\sum_{x_{i}\in D_{s}}\gamma_{{y_{i}}}L_{bce}(D(G(E(x_{i}),A),d_{i}))\\ &-\frac{1}{n_{t}}\sum_{x_{i}\in D_{t}}L_{bce}(D(G(E(x_{i}),A),d_{i})),\end{split}

(4)

where $L_{bce}$ is the binary cross entropy loss and $d_{i}\in\{0,1\}$ indicates the domain label. Moreover, a label relational graph $A$ and moving average centroid separation $L_{CS}$ lead to aligning the distributions of same category in the source and target domains, which are described in section 2.2 and section 2.3, respectively. The total objective function is as follows:

\displaystyle L_{GPDA}=L_{S}+L_{T}+\lambda_{1}L_{D}+\lambda_{2}L_{CS},

(5)

where $\lambda_{1}$ and $\lambda_{2}$ are the trade-off parameters, and we set trade-off parameters to $\lambda_{1}=1.0$ and $\lambda_{2}=1.0$ for all experiments. Finally, the proposed GPDA network can be solved by a minimax optimization problem as follows:

\displaystyle\begin{aligned} (\hat{\theta}_{E},\hat{\theta}_{G},\hat{\theta}_{{F}_{s}},\hat{\theta}_{{F}_{t}})&=\underset{\theta_{E},\theta_{G},\theta_{{F}_{s}},\theta_{{F}_{t}}}{\operatorname{arg}\,\operatorname{min}}\;L_{GPDA},\\ \hat{\theta}_{D}&=\underset{\theta_{D}}{\operatorname{arg}\,\operatorname{max}}\;L_{GPDA}.\end{aligned}

(6)

2.2 GCN with A Label Relational Graph

Graph Convolutional Networks (GCN)[18] are motivated by a first-order approximation of localized spectral filters on graphs [19]. Each GCN layer with $N$ nodes is described as followed:

Z={\tilde{D}}^{-\frac{1}{2}}\tilde{A}{\tilde{D}}^{-\frac{1}{2}}X\Theta,

(7)

where $X\in{R}^{N\times F}$ is a $F$ -dimensional node signal matrix, and $\Theta\in{R}^{F\times F^{\prime}}$ is a learnable filter changing a node signal to $F^{\prime}$ dimension with $\tilde{A}=A+I_{N}$ and $\tilde{D}_{ii}=\sum_{j}\tilde{A}_{ij}$ .

GCN outperforms in tasks on datasets with defined node-to-node relationships [20], and has recently been used in computer vision. Unlike datasets such as Citeseer, Cora, and Pubmed [20], which are datasets with predefined nodes and edges, it is important to determine nodes and edges appropriately for tasks in computer vision. To adopt Graph Convolutional Networks for computer vision tasks, Chen et al. [21] use the graph by setting nodes to word embedding vectors and edges to class co-occurrence patterns within the dataset for multi-label image classification. In addition, for group action recognition, Wu et al. [22] set nodes to feature maps of people and edges to appearance relation. However, since there is no graph for partial domain adaptation, we construct a label relational graph to use Graph Convolutional Networks in partial domain adaptation.

In the label relational graph, each node feature represents the feature map of a sample, then the ith node feature of the graph $X_{i}$ is obtained by:

X_{i}=E(x_{i}),

(8)

where $E$ and $x_{i}$ indicate the feature extractor and the ith input image, respectively. An adjacency matrix $A$ contains relationships of the nodes, i.e., edges, and each component of $A$ is obtained as follows:

{A}_{ij}=\sum_{c=0}^{C-1}{y_{i,c}y_{j,c}},

(9)

where $C$ is the number of class, and $y_{i,c}$ and $y_{j,c}$ are labels of the ith and jth image, respectively. In the case of the source images, there are corresponding ground truth labels, which are one-hot vectors, but in the case of the target images, labels are not given in training. We exploit pseudo-labels [23], which are well known for semi-supervised learning techniques. Finally, the label relational graph has a large value between images that are likely to have the same class and has a low value for unrelated images. Layerwise propagation of the GCN layer with the label relational graph smooths features of images with the same class, which leads to aligning the distributions of the same class.

Table 1: The classification accuracy of the Digit dataset in the partial domain adaptation setting.

	Digit
Method	MNIST $\rightarrow$ USPS	USPS $\rightarrow$ MNIST
Source only	85.2	80.0
DANN [1]	67.1	72.1
IWAN [9]	90.6	85.7
TWINs [11]	96.3	90.2
GPDA	96.9	94.6

Table 2: The classification accuracy of the Office-31 dataset in the partial domain adaptation setting.

	Office-31
Method	A $\rightarrow$ W	D $\rightarrow$ W	W $\rightarrow$ D	A $\rightarrow$ D	D $\rightarrow$ A	W $\rightarrow$ A	Average
ResNet [24]	75.59	96.27	98.09	83.44	83.92	84.97	87.05
DANN [1]	73.56	96.27	98.73	81.53	82.78	86.12	86.50
IWAN [9]	89.15	99.32	99.36	90.45	95.62	94.26	94.69
SAN [8]	93.90	99.32	99.36	94.27	94.15	88.73	94.96
PADA [6]	86.54	99.32	100	82.17	92.69	95.41	92.69
ETN [10]	94.52	100	100	95.03	96.21	94.64	96.73
Baseline	88.81	100	100	94.27	88.73	94.89	94.45
Ours w/o $L_{CS}$	94.58	100	100	92.36	94.26	94.89	96.02
Ours w/o graph	95.59	100	100	94.27	94.26	94.89	96.50
Ours (GPDA)	96.95	100	100	98.73	95.10	95.83	97.77

2.3 Moving Average Centroid Separation

Graph convolution may lead to smoothing the features of different classes because of incorrect pseudo labels. To alleviate this smoothing effect, we introduce moving average centroid separation, which follows the idea in [25]. Specifically, we use features of the labeled source samples and pseudo-labeled target samples. In [25], they design the class centroid alignment module to map features in the same class nearby, while we introduce moving average centroid separation to map features in the different classes separately. The moving average centroid separation objective function is as follows:

L_{CS}(\theta_{E},\theta_{G})=-\sum_{k=0}^{C-1}\parallel c^{s}_{k}-c^{t}_{(k+i)\,mod\,C}\parallel^{2},

(10)

where $c^{s}_{k}$ and $c^{t}_{k}$ are centroids of feature maps of class $k$ in the source and target domains, respectively, and $i$ is a random integer number within [1, $C-1$ ], updated in each iteration. Through the objective function, false signals in pseudo-labeled target samples are suppressed, and features in the different classes are explicitly separated from each other.

3 Experiments

3.1 Setup

We evaluate our architecture to compare with state-of-the-art networks for partial domain adaptation on two benchmark datasets: Digit [15, 16] and Office-31 [17].

Digit. We utilize MNIST and USPS for two domain adaptation tasks (i.e., MNIST $\rightarrow$ USPS and USPS $\rightarrow$ MNIST). MNIST and USPS consist of 10 images containing numbers from 0 to 9, but domains are different. MNIST is collected from students, whereas USPS is a dataset for US portal service. In the PDA setting, we use all images and labels as the source dataset, and adopt images corresponding to the first five classes as the target dataset as conducted in [11] (i.e., $\left|C_{s}\right|=10$ , $\left|C_{t}\right|=5$ ).

Office-31. Office-31 is a standard benchmark for domain adaptation. It consists of 4,652 images and 31 categories collected from three different domains. Amazon (A) contains images from amazon.com. DSLR (D) and Webcam (W) are taken by a DSLR camera and a web camera, respectively. We utilize the Office-31 dataset for six domain adaptation tasks. We use all images and labels as the source dataset, and adopt images corresponding to the ten classes as the target dataset as conducted in [8] (i.e., $\left|C_{s}\right|=31$ , $\left|C_{t}\right|=10$ ).

We implement our GPDA network based on Pytorch, and exploit CNN architectures for the Digit dataset, as the same protocol in DANN [1]. For the office-31 dataset, we finetune ResNet-50 [24] pre-trained on ImageNet [26]. For a fair comparison, we use the same base network for previous methods. We add two GCN layers with 256 and 1024 channels on Digit and Office-31, respectively, since increasing the number of layers or channels for GCN layers do not improve performance. New layers are trained from scratch with 10 times the learning rate of the pre-trained layer. We use SGD with the momentum of 0.9 and the learning rate decay strategy implemented in DANN [1], and the learning rates of all new layers are increased gradually from 0 to 1 as DANN [1] also.

3.2 Results and Analysis

We show that our GPDA network outperforms previous methods on the Digit dataset in Table 1. Source only and DANN are the methods without the domain adaptation algorithm and with the standard domain adaptation algorithm, respectively. In the situation where the label spaces of the source and target are different, the performance of DANN is lower than the way without using the domain adaptation method, i.e., source only. This is the result of the outliers interfering with distribution alignment. Our method achieves state-of-the-art performance compared to previous partial domain adaptation methods [9, 11].

In Table 2, the proposed method achieves state-of-the-art performance with average gain of 1% beyond ETN [10], which adds the novel domain classifier. Specifically, our network has the same algorithm for reducing the learning impact of outliers as PADA [6], but shows a performance improvement of 4.6% on average, exploiting the newly introduced a label relational graph and moving average centroid separation. It shows that using data structure and considering the distribution of each category are valid for partial domain adaptation. Moreover, we experiment with ablation studies to examine the effect of each module. Baseline indicates the network without a label relational graph and moving average centroid separation. It outperforms PADA by exploiting the additional classifier expected to learn images in the common label space. We experiment our GPDA network without a label relational graph and moving average centroid separation, respectively. Ours w/o $L_{CS}$ only leads to align the distributions of the same category, whereas ours w/o graph separate the distributions of different categories. They provide higher performances than baseline, meaning that each module is effective for PDA, and we can see that two modules work complement each other to obtain better overall results.

4 Conclusion

In this paper, we design a novel architecture, named a graph partial domain adaptation (GPDA) network, to consider data structure and the distribution of each class for partial domain adaptation. Specifically, we integrate Graph Convolutional Networks into a down-weight framework, and propose a label relational graph and moving average centroid separation for graph learning. The experimental results show that our GPDA network outperforms previous state-of-the-art methods, demonstrating the effectiveness of our method.

References

[1] Yaroslav Ganin and Victor Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
[2] Dongwan Kim, Seungmin Lee, Namil Kim, and Seong-Gyun Jeong, “Delegated adversarial training for unsupervised domain adaptation,” in ICIP, 2019.
[3] Debasmit Das and CS George Lee, “Unsupervised domain adaptation using regularized hyper-graph matching,” in ICIP, 2018.
[4] Jiahui Fu, Xiaofu Wu, Suofei Zhang, and Jun Yan, “Improved open set domain adaptation with backpropagation,” in ICIP, 2019.
[5] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin, “Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation,” in ICCV, 2019.
[6] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang, “Partial adversarial domain adaptation,” in ECCV, 2018.
[7] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko, “Semi-supervised domain adaptation via minimax entropy,” arXiv preprint arXiv:1904.06487, 2019.
[8] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan, “Partial transfer learning with selective adversarial networks,” in CVPR, 2018.
[9] Jing Zhang, Zewei Ding, Wanqing Li, and Philip Ogunbona, “Importance weighted adversarial nets for partial domain adaptation,” in CVPR, 2018.
[10] Zhangjie Cao, Kaichao You, Mingsheng Long, Jianmin Wang, and Qiang Yang, “Learning to transfer examples for partial domain adaptation,” in CVPR, 2019.
[11] Toshihiko Matsuura, Kuniaki Saito, and Tatsuya Harada, “Twins: Two weighted inconsistency-reduced networks for partial domain adaptation,” arXiv preprint arXiv:1812.07405, 2018.
[12] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu, “Transfer joint matching for unsupervised domain adaptation,” in CVPR, 2014.
[13] Jing Zhang, Wanqing Li, and Philip Ogunbona, “Joint geometrical and statistical alignment for visual domain adaptation,” in CVPR, 2017.
[14] Yong Xu, Xiaozhao Fang, Jian Wu, Xuelong Li, and David Zhang, “Discriminative transfer subspace learning via low-rank and sparse representation,” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 850–863, 2015.
[15] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[16] Jonathan J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on pattern analysis and machine intelligence, vol. 16, no. 5, pp. 550–554, 1994.
[17] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell, “Adapting visual category models to new domains,” in ECCV. Springer, 2010.
[18] Thomas N Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[19] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
[20] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, pp. 93–93, 2008.
[21] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo, “Multi-label image recognition with graph convolutional networks,” in CVPR, 2019.
[22] Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu, “Learning actor relation graphs for group activity recognition,” in CVPR, 2019.
[23] Dong-Hyun Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on Challenges in Representation Learning, ICML.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[25] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen, “Learning semantic representations for unsupervised domain adaptation,” in ICML, 2018.
[26] Olga Russakovsky et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.