This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diversity-enhancing Generative Network for Few-shot Hypothesis Adaptation

Ruijiang Dong    Feng Liu    Haoang Chi    Tongliang Liu    Mingming Gong    Gang Niu    Masashi Sugiyama    Bo Han
Abstract

Generating unlabeled data has been recently shown to help address the few-shot hypothesis adaptation (FHA) problem, where we aim to train a classifier for the target domain with a few labeled target-domain data and a well-trained source-domain classifier (i.e., a source hypothesis), for the additional information of the highly-compatible unlabeled data. However, the generated data of the existing methods are extremely similar or even the same. The strong dependency among the generated data will lead the learning to fail. In this paper, we propose a diversity-enhancing generative network (DEG-Net) for the FHA problem, which can generate diverse unlabeled data with the help of a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC). Specifically, DEG-Net will generate data via minimizing the HSIC value (i.e., maximizing the independence) among the semantic features of the generated data. By DEG-Net, the generated unlabeled data are more diverse and more effective for addressing the FHA problem. Experimental results show that the DEG-Net outperforms existing FHA baselines and further verifies that generating diverse data plays a vital role in addressing the FHA problem.

Machine Learning, ICML

Refer to caption
(a) The copy issue of generated data on MSM\to S task
Refer to caption
(b) Classification accuracy of different amount of unlabeled data drawn from different domains on MSM\to S task
Figure 1: The low-diversity issue of generated unlabeled data when solving the FHA problem. Subfigure (a) illustrates the labeled data (left) drawn from the target domain and unlabeled data (right) generated by TOHAN on the MNIST\toSVHN (MSM\to S) task. It is clear that the generated data are similar to each other and seem to copy the original target data (middle, left). Subfigure (b) illustrates the accuracy of the data drawn from different domains with different data volumes on the task MSM\to S. For the source data and target data, the accuracy of the model trained by them is higher as the number of data increases. For the generated data, the growth of data volume only helps to improve the accuracy of the model when it is small.

1 Introduction

Data and expert knowledge are always scarce in newly-emerging fields, thus it is very important and challenging to study how to leverage knowledge from other similar fields to help complete tasks in the new fields. To cope with this challenge, transfer learning methods were proposed to leverage the knowledge of source domains (e.g., data in source domains or models trained with data in source domains) to help complete the tasks in other similar domains (a.k.a. the target domains) (Zamir et al., 2018; Sun et al., 2019; Liu et al., 2019; Wang et al., 2019; Zhang et al., 2020; Teshima et al., 2020; Jing et al., 2020; Liu et al., 2021a; Zhong et al., 2021; Zhou et al., 2020; Shui et al., 2022; Guo et al., 2022; Fang et al., 2022; Chi et al., 2022; Wang et al., 2023). Among many transfer learning methods, hypothesis transfer learning (HTL) methods have received a lot of attention since it does not require access to the data in source domains, which prevents the data leakage issue and protects the data privacy (Du et al., 2017; Yang et al., 2021a, b). Recently, the few-shot hypothesis adaptation (FHA) problem has been formulated to make HTL more realistic, which is suitable to solve many problems (Liu et al., 2021b, c; Snell et al., 2017; Wang et al., 2020; Yang et al., 2020). In FHA, only a well-trained source classifier (i.e., a source hypothesis) and a few labeled target data are available (Chi et al., 2021a).

Similar to HTL, FHA also aims to obtain a good target-domain classifier with the help of a source hypothesis and a few target-domain data (Chi et al., 2021a; Motiian et al., 2017). Recently, generating unlabeled data has been shown to be an effective strategy to address in the FHA problem (Chi et al., 2021a). The target-oriented hypothesis adaptation network (TOHAN), a one-step solution to the FHA problem, constructed an intermediate domain to enrich the training data. The data in intermediate domain are highly compatible with data in source domain and target domain (Balcan & Blum, 2010). By the generated unlabeled data in the intermediate domain, TOHAN partially overcame the problems caused by data scarcity in the target domain.

However, the existing methods ignore the diversity of the generated data or the independence among the generated data, so that the generated data are extremely similar or even the same. Lack of diversity leads to less effective data for addressing the FHA problem. Taking the FHA task of digits datasets as an example, we find that the data generated by TOHAN has an issue that the generator tends to generate similar data as the target data and even copy target data (Figure 1(a)). To show how diversity matters in the FHA problem intuitionally, we conduct the experiments in the digits datasets. We use a few labeled target-domain data and the increasing unlabeled data to train the target model. The result is shown in Figure 1(b). For the source data and target data, it is clear that the accuracy of the trained model is higher as the number of data increases. For the generated data, the growth of data volume only helps to improve the accuracy of the model when it is small (e.g., less than 45 in Figure 1(b)). However, the accuracy of the model fluctuates around 33% regardless of the increase in the unlabeled data, when the number of data exceeds 35. This result shows that the model trained by generated data converges faster than those trained by the source data and target data since the generated data have less diversity.

In this paper, to show how the diversity of unlabeled data (i.e., the independence among unlabeled data) affects the FHA methods, we theoretically analyze the effect of the sample complexity regarding the FHA problem (Theorem 1). In this analysis, we adopt the log-coefficient score α\alpha (Dagan et al., 2019) to measure the dependency among unlabeled data. Our results show that we can still count on the unlabeled data to help address the FHA problem as long as the unlabeled data are weakly dependent (α<0.5\alpha<0.5). Nevertheless, once α0.5\alpha\geq 0.5, the results in Theorem 1 may not hold, resulting in failure theoretically. In addition, we find that high dependency among unlabeled data usually means that we need more unlabeled data to obtain a good target-domain classifier. From the above analysis and Figure 1, we argue that diversity matters in addressing the FHA problem.

To this end, we propose the diversity-enhancing generative network (DEG-Net) for the FHA problem, which is a weight-shared conditional generative method equipped with a kernel independence measure: the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; Ma et al., 2020; Pogodin & Latham, 2020), which is used in various situations, e.g., clustering (Song et al., 2007; Blaschko & Gretton, 2008), independence testing (Gretton et al., 2007), self-supervised classification (Li et al., 2021), and model-inversion attack (Peng et al., 2022). Although the log-coefficient score is used to analyze the effect of the sample complexity regarding the FHA problem, its calculation requires knowing the target-domain distribution, which is unknown in practice. Yet, HSIC can be estimated easily by the data sample. Thus, we adopt the HSIC to calculate the dependency among generated unlabeled data.

The overview of DEG-Net is in Figure 2, showing that there are two modules in DEG-Net: the generation module and the adaptation module. In the generation module, we train the conditional generator with a well-trained source classifier and a few-target domain data. To train the generator with both the source-domain and target-domain knowledge and improve the diversity of generated data simultaneously, the generative loss of DEG-Net consists of 3 parts: classification loss, similarity loss, and diversity loss. More specifically, DEG-Net generates data via minimizing the HSIC value (i.e., maximizing the independence) between the semantic features of the target data and generated data, where the semantic features are the hidden-layer outputs of the well-trained source hypothesis. To use the generalization knowledge in the semantic features of data shared by different classes (Chen et al., 2020; Chi et al., 2021b; Yao et al., 2021), the generator is a weight-shared network. As for the adaptation module, the classifier is trained using the generated data and a few labeled target-domain data. The adaptation module consists of a classifier and a group discriminator (Chi et al., 2021a; Motiian et al., 2017). With the help of the group discriminator which tries to confuse the classifier to distinguish data from different domains, the classifier is trained to classify the data over the target domain using the generated data and a few label data drawn from the target domain.

We verify the effectiveness of DEG-Net on 8 FHA benchmark tasks (Chi et al., 2021a), including the digit datasets and object datasets. Experimental results show that DEG-Net outperforms existing FHA methods and achieves state-of-the-art performance. Besides, due to the weight-shared generator, DEG-Net is much faster than previous generative FHA methods in the training. We also conduct an ablation study to verify that each component in the DEG-Net is useful, which shows that diverse generated data can help improve the performance when addressing the FHA problem, which lights up a novel road for the FHA problem.

2 Preliminary and Related Works

Problem Setting. In this section, we formalize the FHA problem mathematically. Denoting by 𝒳n\mathcal{X}\subset\mathbb{R}^{n} the input space and by 𝒴:={1,,K}\mathcal{Y}:=\{1,\cdots,K\} the output space, where KK is the number of classes. The source domain (target domain) can be defined as a joint probability distribution μs\mu^{\mathrm{s}} on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} (μt\mu^{\mathrm{t}} on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}). Besides, we assume that there is a well-trained model fs:𝒳𝒴f^{s}:{\mathcal{X}}\rightarrow{\mathcal{Y}}. The fsf^{s} is trained with data {𝒙is,yis}i=1n\{{\bm{x}}^{s}_{i},y^{s}_{i}\}_{i=1}^{n} drawn independently and identically distributed (i.i.d) from μs\mu^{s} with the aim of minimizing 𝔼^(𝒙,y)μs[(h(𝒙),y)]\hat{\mathbb{E}}_{({\bm{x}},y)\sim\mu^{\mathrm{s}}}[\ell(h({\bm{x}}),y)], where :𝒴×𝒴\ell:{\mathcal{Y}}\times{\mathcal{Y}} is a loss function to measure if two elements in 𝒴{\mathcal{Y}} are close, and hh is an element in a hypothesis set :={h:𝒳𝒴}{\mathcal{H}}:=\{h:{\mathcal{X}}\rightarrow{\mathcal{Y}}\}. Thus, fsf^{s} can be defined as

fs:=argminh𝔼^(Xs,Ys)[(h(Xs),Ys)].\displaystyle f^{s}:=\operatorname*{arg\,min}_{h\in{\mathcal{H}}}\hat{\mathbb{E}}_{(X^{s},Y^{s})}[\ell(h(X^{s}),Y^{s})]. (1)

fsf^{s} is also called the source hypothesis in our paper. Hence, the FHA problem is defined as follows:

Problem 1.

(FHA) Given the source hypothesis fsf^{s} defined in Eq. (1) and the labeled dataset St:={(𝐱it,yit)}i=1mlS^{t}:=\{({\bm{x}}^{t}_{i},y^{t}_{i})\}_{i=1}^{m_{\rm l}} (ml7Km_{\rm l}\leq 7K, following Chi et al. (2021a) and Park et al. (2019)), drawn i.i.d. from target domain μt\mu^{t}, the FHA methods aim to train a classifier ftf^{t}\in{\mathcal{H}} with fsf^{s} and StS^{t} such that we can minimize the value of 𝔼(Xt,Yt)[(h(Xt),Yt)]{\mathbb{E}}_{(X^{t},Y^{t})}[\ell(h(X^{t}),Y^{t})], where hh\in{\mathcal{H}}. Namely, ft=argminh𝔼(Xt,Yt)[(h(Xt),Yt)]f^{t}=\operatorname*{arg\,min}_{h\in{\mathcal{H}}}{\mathbb{E}}_{(X^{t},Y^{t})}[\ell(h(X^{t}),Y^{t})].

Hypothesis Transfer Learning Methods. Hypothesis transfer learning (HTL) aims to train a classifier with only a well-trained classifier and a few labeled data or abundant unlabeled data over the target domain (Kuzborskij & Orabona, 2013; Tommasi et al., 2010; Liang et al., 2020). In Kuzborskij & Orabona (2013), they used the leave-one-out error to find the optimal transfer parameters based on the regularized least squares with biased regularization. SHOT (Liang et al., 2020) froze the source hypothesis and trained a domain-specific encoding module using the abundant unlabeled data. Later, neighborhood reciprocity clustering (Yang et al., 2021a) was proposed to address HTL by encouraging reciprocal neighbors to concord in their label prediction. HTL problem does not have a limitation on the amount of labeled data drawn from the target domain for the transfer training, which is different from the FHA problem.

Target-oriented Hypothesis Adaptation Network. TOHAN (Chi et al., 2021a) is a one-step solution for the FHA problem. It has a good performance due to using the generated unlabeled data in the adaptation process. Motivated by the learnability in semi-supervised learning (SSL), TOHAN found that unlabeled data in the intermediate domain, which is compatible with both the source classifier and target classifier, can address the FHA problem for providing the additional information in the training. Guided by this principle, the key module of TOHAN is to generate the unlabeled data drawn from the probability distribution μm\mu^{\mathrm{m}}:

μm=argminμ[χ(hs,μ)+χ(ht,μ)],\displaystyle\mu^{m}=\mathop{\arg\min}\limits_{\mu}\left[\chi(h^{s},\mu)+\chi(h^{t},\mu)\right], (2)

where χ(,)\chi(\cdot,\cdot) measures how compatible hsh^{s} (resp. hth^{t}) is with data distribution μ\mu (Balcan & Blum, 2010).

3 Theoretical Analysis Regarding the Data Diversity in FHA

In previous works, researchers have shown that generated high-compatible data can help address the FHA problem. However, as discussed in Section 1, the diversity of the generated data matters in addressing the FHA. Besides, the previous studies assume that the generated data is independent of their theory, and is inconsistent with their methods. In this section, we will show how the dependency among the generated data affects the performance of FHA methods. Similar to Chi et al. (2021a), our theory is also based on the theory regarding SSL (Webb & Sammut, 2010).

Dependency Measure. Following Dagan et al. (2019), we use the log-coefficient to theoretically analyze the data diversity. log-coefficient measures the dependency among observations from a random variable ZZ.

Definition 1 (Log-influence and log-coefficient (Dagan et al., 2019)).

Let 𝐙=(Z1,,Zm){\bm{Z}}=(Z_{1},\dots,Z_{m}) be a random variable over (𝒳×𝒴)m({\mathcal{X}}\times{\mathcal{Y}})^{m} and let μ𝐙\mu_{\bm{Z}} denote either its probability distribution if discrete or its density if continuous. Assume that μ𝐙>0\mu_{\bm{Z}}>0 on all (𝒳×𝒴)m({\mathcal{X}}\times{\mathcal{Y}})^{m}. For any ij{1,2,,m}i\neq j\in\{1,2,\dots,m\}, define the log-influence between jj and ii as

Ij,ilog(𝒁)=14supZij(𝒳×𝒴)m2Zi,Zi,Zj,Zj𝒳×𝒴\displaystyle I^{\log}_{j,i}({\bm{Z}})=\frac{1}{4}\sup_{\mbox{\tiny$\begin{array}[]{c}Z_{-i-j}\in({\mathcal{X}}\times{\mathcal{Y}})^{m-2}\\ Z_{i},Z_{i}^{\prime},Z_{j},Z_{j}^{\prime}\in{\mathcal{X}}\times{\mathcal{Y}}\end{array}$}} (3)
logμ𝒁[ZiZjZij]μ𝒁[ZiZjZij]μ𝒁[ZiZjZij]μ𝒁[ZiZjZij].\displaystyle\log\frac{\mu_{\bm{Z}}[Z_{i}Z_{j}Z_{-i-j}]\mu_{\bm{Z}}[Z_{i}^{\prime}Z_{j}^{\prime}Z_{-i-j}]}{\mu_{\bm{Z}}[Z_{i}^{\prime}Z_{j}Z_{-i-j}]\mu_{\bm{Z}}[Z_{i}Z_{j}^{\prime}Z_{-i-j}]}.

Then the log-coefficient of 𝐙{\bm{Z}} is defined as αlog(𝐙)=maxi=1,,mijIj,ilog(𝐙)\alpha_{\log}({\bm{Z}})=\max_{i=1,\dots,m}\sum_{i\neq j}I^{\log}_{j,i}({\bm{Z}}).

From Definition 1, it is clear that αlog(𝒁)\alpha_{\log}({\bm{Z}}) will be zero if ZiZ_{i} and ZjZ_{j} are independent (for any iji\neq j).

Sample Complexity Analysis for FHA. Since the generated data are unlabeled, we follow the theory regarding SSL to analyze how the generated unlabeled data can help address the FHA problem. More importantly, we will analyze how the dependency on the generated data affects the performance of FHA methods. For simplicity, we consider a binary SSL problem (i.e., K=2K=2).

Let f:𝒳{0,1}f^{*}:\mathcal{X}\to\{0,1\} be the optimal target classifier. Let err(h)=𝔼xμXt[h(x)f(x)]err(h)=\mathbb{E}_{x\sim\mu_{X}^{t}}[h(x)\neq f^{*}(x)] be the true error rate of a hypothesis hh\in{\mathcal{H}} over a marginal distribution μXt\mu_{X}^{t}. In FHA, its learnability mainly depends on the compatibility χ:×𝒳[0,1]\chi:\mathcal{H}\times\mathcal{X}\mapsto[0,1] measuring how “compatible” hh is to one unlabeled data 𝒙{\bm{x}}. In the following, we use χ(h,μXt)=𝔼xμXt[χ(h,x)]\chi(h,\mu_{X}^{t})=\mathbb{E}_{x\sim\mu_{X}^{t}}[\chi(h,x)] to represent the expectation of the compatibility of data from μXt\mu_{X}^{t} on a classifier hh, and let SX(mu)S_{X}^{(m_{\rm u})} be an observation of a random variable 𝑿t,mu=(X1t,,Xmut){\bm{X}}^{t,m_{\rm u}}=(X^{t}_{1},\dots,X^{t}_{m_{\rm u}}), where the distribution regarding XitX^{t}_{i} is μXt\mu_{X}^{t}, i=1,,mui=1,\dots,m_{\rm u}. The following theorem shows that, under some conditions, we can learn a good ftf^{t} even when the dependency among unlabeled target data exists.

Theorem 1.

Let χ^(h,SX(mu))=1muxSX(mu)χ(h,x)\hat{\chi}(h,S_{X}^{(m_{\rm u})})=\frac{1}{m_{\rm u}}\sum_{x\in S_{X}^{(m_{\rm u})}}\chi(h,x) be the empirical compatibility over SX(mu)S_{X}^{(m_{\rm u})} and 0={h:err^(h)=0}\mathcal{H}_{0}=\{h\in\mathcal{H}:\widehat{err}(h)=0\}. If ff^{*}\in\mathcal{H}, χ(f,μXt)=1t\chi(f^{*},\mu_{X}^{t})=1-t, and αlog(𝐗t,mu)<1/2\alpha_{\log}({\bm{X}}^{t,m_{\rm u}})<1/2, then mum_{u} unlabeled data and mlm_{l} labeled data are sufficient to learn to error ϵ\epsilon with probability 1δ1-\delta, for

mu=max(𝒪(1(1αlog(𝑿t,mu))ϵ2log2δ),\displaystyle m_{\rm u}=\max\left(\mathcal{O}\left(\frac{1}{(1-\alpha_{\log}({\bm{X}}^{t,m_{\rm u}}))\epsilon^{2}}\log\frac{2}{\delta}\right),\right. (4)
𝒪(VCdim(χ())(12αlog(𝑿t,mu))ϵ2))\displaystyle\phantom{=\;\;}\left.\mathcal{O}\left(\frac{{\rm VCdim}(\chi(\mathcal{H}))}{(1-2\alpha_{\log}({\bm{X}}^{t,m_{\rm u}}))\epsilon^{2}}\right)\right)

and

ml=2ϵ[ln(2μXt,χ(t+2ϵ)[2ml,μXt])+ln4δ],m_{\rm l}=\frac{2}{\epsilon}\left[\ln(2\mathcal{H}_{\mu_{X}^{t},\chi}(t+2\epsilon)[2m_{\rm l},\mu_{X}^{t}])+\ln\frac{4}{\delta}\right], (5)

where χ()={χh:h}\chi(\mathcal{H})=\{\chi_{h}:h\in\mathcal{H}\} is assumed to have a finite VC dimension, χh()=χ(h,)\chi_{h}(\cdot)=\chi(h,\cdot), and μXt,χ(t+2ϵ)[2ml,μXt]\mathcal{H}_{\mu_{X}^{t},\chi}(t+2\epsilon)[2m_{\rm l},\mu_{X}^{t}] is the expected number of splits of 2ml2m_{\rm l} data drawn from μXt\mu_{X}^{t} using hypotheses in \mathcal{H} of compatibility more than 1t2ϵ1-t-2\epsilon. In particular, with probability at least 1δ1-\delta, we have err(h^)ϵerr(\hat{h})\leq\epsilon, where h^=argmaxh0χ^(h,SX(mu))\hat{h}=\mathop{\arg\max}_{h\in\mathcal{H}_{0}}\hat{\chi}(h,S_{X}^{(m_{\rm u})}).

The proof of Theorem 1 is presented in Appendix A, which mainly follows the recent result in the problem of learning from dependent data (Dagan et al., 2019).

Theorem 1 shows that when the dependency among the unlabeled data is weak (i.e., αlog(𝑿t,mu)<1/2\alpha_{\log}({\bm{X}}^{t,m_{\rm u}})<1/2), we can obtain a similar result compared to the classical result in the SSL theory (Balcan & Blum, 2010). Namely, if we can generate data that are highly compatible with ff^{*}, which means that tt is very small and thus μXt,χ(t+2ϵ)[2ml,μXt]\mathcal{H}_{\mu_{X}^{t},\chi}(t+2\epsilon)[2m_{\rm l},\mu_{X}^{t}] is small, thus we do not need a lot of labeled data drawn from the target domain to learn a good ftf^{t} (Chi et al., 2021a).

Diversity Matters in FHA. Theorem 1 also shows that the diversity of unlabeled data matters in the FHA problem. There are two reasons. The first reason is that Theorem 1 might not be true if there is strong dependency among the unlabeled data (e.g., αlog(𝑿t,mu)>1/2\alpha_{\log}({\bm{X}}^{t,m_{\rm u}})>1/2). This will directly make the previous work lose the theoretical foundation to address the FHA problem. The second reason is that we need more unlabeled data to reach the same error ϵ\epsilon if the dependency among the unlabeled data increases. Specifically, if αlog(𝑿t,mu)\alpha_{\log}({\bm{X}}^{t,m_{\rm u}}) is very close to 1/21/2, then mum_{\rm u} could be a very large number. The above reasons motivate us to take care of the dependency on the generated data. To weaken such dependency, we propose our method, diversity-enhancing generative network (DEG-Net) for the FHA problem.

4 Diversity-enhancing Generative Network for FHA Problem

In this section, we propose the diversity-enhancing generative network (DEG-Net) for the FHA problem. DEG-Net has two modules: (1) the generation module to generate diverse data from the intermediate domain ; (2) the adaptation module to train the classifier for the target domain using the generated data and labeled target data.

Refer to caption
(a) Generation module
Refer to caption
(b) Adaptation module
Figure 2: Overview of the diversity-enhancing generative network (DEG-Net). It consists of generator GG, a classifier ft=htgtf^{t}=h_{t}\circ g_{t} (initialize fs=ftf^{s}=f^{t}) and group discriminator DD. (a) In the generation module, we train a generator GG with classifier ft=fsf^{t}=f^{s} frozen. (b) In the adaptation module, we first pair the generated data and labeled data and use the paired data to train the discriminator DD while freezing the encoder hth_{t}. Then, we freeze the discriminator DD to train the classifier ftf^{t}.

4.1 Diversity-enhancing Generation

In order to overcome the shortcomings of the current generative method for the FHA problem, we come up with solutions for both the generative architecture and the loss function. As for the generative architecture, we propose the weight-shared conditional generative network for generating the data with the specific category. We also design a novel loss function to constrain the similarities and diversity of the semantic information of the generated data.

Weight-shared Conditional Generative Network. As discussed before, for using generalized features which are shared among different classes to improve the quality of generated data and reduce the time of training, the weight-shared conditional generative network is promising. Following Chi et al. (2021a), the generator aims to generate data of specific categories from Gaussian random noise. The encoder outputs the semantic feature and the class probability distribution of the generated data. To achieve the aim of the generative method, we design the two-part loss functions: the classification loss function and the similarity of the semantic feature loss function.

We assume that 𝒙i,ng=G(zi,cn)\bm{x}_{i,n}^{\mathrm{g}}=G(z^{i},c_{n}) is the generated data with a specific category nn, where the inputs of generator GG are the Gaussian random noise ziz^{i} and the specific categorical information cnc_{n}. Specifically, we use the one-hot encoded label as the categorical information. The generated data 𝒙i,ng\bm{x}_{i,n}^{\mathrm{g}} inputs to the well-trained source-domain classifier fs=hsgsf_{s}=h_{s}\circ g_{s}, where the output of hsh_{s} is the group discriminator feature, which will be used in the adaptation module and the output of gsg_{s} is probability feature 𝒑𝒊=(pi,1,.pi,n,,pi,K)\bm{p_{i}}=(p_{i,1},\ldots.p_{i,n},\ldots,p_{i,K}), where pi,np_{i,n} is the probability of the generated data belonging to the nthn^{\mathrm{th}} class respectively. The semantic feature 𝒔i,ng\bm{s}_{i,n}^{\mathrm{g}} used in the similarity loss and diversity loss is the hidden-layer output of hsh_{s} (details of the hidden-layer selection can be found in Appendix C.). We aim to update the parameters of the generator to force the generated data with the categorical information 𝒙i,ng\bm{x}_{i,n}^{\mathrm{g}} belonging to the nthn^{\mathrm{th}} class, i.e., making pi,np_{i,n} close to 1. Specifically, we minimize the following loss to generate the data of a specific category nn:

c=1Kn=1K1Bni=1Bnpi,n122,\displaystyle\mathcal{L}_{c}=\frac{1}{K}\sum_{n=1}^{K}\frac{1}{B_{n}}\sum_{i=1}^{B_{n}}\left\|p_{i,n}-1\right\|_{2}^{2}, (6)

where BnB_{n} is the batch size of the generator.

Input: conditional generator GθGG_{\theta_{G}} parameterized by θG\theta_{G}, a group discriminator DθDD_{\theta_{D}} parameterized by θD\theta_{D}, a classifier fθff_{\theta_{f}} parameterized by θf\theta_{f}, kernel function kk, generation batch size BnB_{n}, learning rate αG\alpha_{G}, αD\alpha_{D} and αf\alpha_{f}, total epoch TmaxT_{max}, pretraining group discriminator epoch TdT_{d}.
for t=1,2,..,Tmaxt=1,2,.....,T_{max} do
       for n=0,1,,K1n=0,1,...,K-1 do
             1: Generate random noise zz and categorical information cnc_{n};
             2: Generate data Gn(z)G_{n}(z) and then add them to 𝒟m\mathcal{D}_{m};
             3: Calculate the semantic feature 𝒔\bm{s} and classification probability 𝒑\bm{p};
             4: Calculate the kernel matrice of semantic feature SnS^{n} using kernel function kk;
       end for
      5: Update θGθGαGGd(s,p)\theta_{G}\leftarrow\theta_{G}-\alpha_{G}\nabla\mathcal{L}_{G_{d}}(s,p) using Eq. (11);
       if t=TmaxTft=T_{max}-T_{f} then
             for i=1,2,,Tdi=1,2,...,T_{d} do
                   6: Sample 𝒢1\mathcal{G}_{1},𝒢3\mathcal{G}_{3} from 𝒟m×𝒟m\mathcal{D}_{m}\times\mathcal{D}_{m};
                   7: Sample 𝒢2\mathcal{G}_{2},𝒢4\mathcal{G}_{4} from 𝒟m×𝒟t\mathcal{D}_{m}\times\mathcal{D}_{t};
                   8: Update θDθDαDD({𝒢i=14})\theta_{D}\leftarrow\theta_{D}-\alpha_{D}\nabla\mathcal{L}_{D}(\{\mathcal{G}^{4}_{i=1}\}) using Eq. (12);
             end for
            
       end if
      if ttmaxTft\geq t_{max}-T_{f} then
             9: Sample 𝒢1\mathcal{G}_{1},𝒢3\mathcal{G}_{3} from 𝒟m×𝒟m\mathcal{D}_{m}\times\mathcal{D}_{m};
             10: Sample 𝒢2\mathcal{G}_{2},𝒢4\mathcal{G}_{4} from 𝒟m×𝒟t\mathcal{D}_{m}\times\mathcal{D}_{t};
             11: Update θfθfαff({𝒢i=14},x)\theta_{f}\leftarrow\theta_{f}-\alpha_{f}\nabla\mathcal{L}_{f}(\{\mathcal{G}^{4}_{i=1}\},x) using Eq. (4.2);
             12: Update θDθDαDD({𝒢i=14})\theta_{D}\leftarrow\theta_{D}-\alpha_{D}\nabla\mathcal{L}_{D}(\{\mathcal{G}^{4}_{i=1}\}) using Eq. (12);
       end if
      
end for
Output: a well-trained classifier fθf_{\theta}.
Algorithm 1 Diversity-enhancing Generative Network (DEG-Net)

To make the generated data closer to data in the target domain, we need to define the loss function to measure the difference between data of two different domains. Motivated by Zheng & Zhou (2021), DEG-Net uses semantic features to calculate the similarities. To avoid the copy issue, we decided to use the 1\ell_{1} distance 𝒙𝒚1=iωi|xiyi|\left\|\bm{x}-\bm{y}\right\|_{1}=\sum_{i}\omega_{i}|x_{i}-y_{i}|, where ωi=|xiyi|2/𝒙𝒚2\omega_{i}=|x_{i}-y_{i}|^{2}/\left\|\bm{x}-\bm{y}\right\|_{2}, since it encourages larger gradients for feature dimensions with higher residual errors. Compared to the 2\ell_{2}-norm, it is better to measure the similarity of the semantic features between the generated images and the target images, since 1\ell_{1} distance is more robust to outliers (Oneto et al., 2020). Thus, the similarity loss is defined as follows:

s=1Kn=1K1mlMBni=1Bnj=1ml𝒔i,ng𝒔jt1,\displaystyle\mathcal{L}_{s}=\frac{1}{K}\sum_{n=1}^{K}\frac{1}{m_{\rm l}MB_{n}}\sum_{i=1}^{B_{n}}\sum_{j=1}^{m_{\rm l}}\left\|\bm{s}_{i,n}^{\mathrm{g}}-\bm{s}_{j}^{t}\right\|_{1}, (7)

where M=max𝒔1,𝒔2𝒳𝒔1𝒔21M=\max_{\bm{s}_{1},\bm{s}_{2}\in\mathcal{X}}\left\|\bm{s}_{1}-\bm{s}_{2}\right\|_{1} (𝒳\mathcal{X} is compact and 1\left\|\cdot\right\|_{1} is continuous), mlm_{\rm l} is the number of labeled data drawn from the target domain, 𝒔jt\bm{s}_{j}^{\mathrm{t}} and 𝒔i,ng\bm{s}_{i,n}^{\mathrm{g}} are the semantic features of target data and generated data, respectively. Combining Eq. (6) and Eq. (7), we obtain the loss to train the weight-shared conditional generative network:

G=c+λs,\displaystyle\mathcal{L}_{G}=\mathcal{L}_{c}+\lambda\mathcal{L}_{s}, (8)

where λ0\lambda\geq 0 is a hyper-parameter between two losses. Note that optimizing Eq. (8) is corresponding to generative method’s principle Eq. (2) for the FHA problem , where Eq. (6) (resp. Eq. (7)) is corresponding to χ(hs,μ)\chi(h^{s},\mu) (resp. χ(ht,μ)\chi(h^{t},\mu)). To ensure that the conditional generator can generate the image with the correct class label, we pretrain the generator for some epochs.

Generative Function with Diversity. As discussed above, the weak dependence among unlabeled data is an important condition for using generated unlabeled data to address the FHA problem. To ensure that the unlabeled data are weakly dependent among unlabeled data (i.e., to generate more diverse unlabeled data), it is necessary to use diversity regularization to train the generator. Unfortunately, the log-coefficient score, a dependence measure used to analyze the sample complexity, is hard to calculate, since its calculation requires the unknown distribution regarding the target-domain data. HSIC, a kernel independence measure can also measure the dependency of the generated data. Different from the log-coefficient score, HSIC can be easily estimated (Gretton et al., 2005; Song et al., 2012):

HSIC^(X,Y)=1(N1)2Tr(KHLH),\displaystyle\widehat{\rm{HSIC}}(X,Y)=\frac{1}{(N-1)^{2}}\mathrm{Tr}(KHLH), (9)

where K=(kij)=k(xi,xj)K=(k_{ij})=k(x_{i},x_{j}) (L=(lij)=k(yi,yj)L=(l_{ij})=k(y_{i},y_{j})) is the kernel matrix (k(,)k(\cdot,\cdot) is the kernel function) and H=I1N𝟏𝟏H=I-\frac{1}{N}\bm{1}\bm{1}^{\top} is the centering matrix. We minimize the HSIC measure of the generated data’s semantic features to obtain weakly dependent data. Specifically, we use the Gaussian kernel as the kernel function ,and minimize the following loss to generate more diverse data:

d\displaystyle\mathcal{L}_{d} =1Kn=1KHSIC^(𝒔ng,𝒔ng)\displaystyle=\frac{1}{K}\sum_{n=1}^{K}\sqrt{\widehat{\rm{HSIC}}(\bm{s}_{n}^{\mathrm{g}},\bm{s}_{n}^{\mathrm{g}})} (10)
=1Kn=1K1(Bn1)2Tr(SnHSnH),\displaystyle=\frac{1}{K}\sum_{n=1}^{K}\sqrt{\frac{1}{(B_{n}-1)^{2}}\mathrm{Tr}(S^{n}HS^{n}H)},

where Sn=(𝒔ijn)=k(𝒔i,ng,𝒔j,ng)S^{n}=(\bm{s}^{n}_{ij})=k(\bm{s}^{\mathrm{g}}_{i,n},\bm{s}^{\mathrm{g}}_{j,n}) is the kernel matrice of the semantic features of the generated data with a specific class. Hence, we obtain the total loss to train the generator with diversity enhancement:

Gd\displaystyle\mathcal{L}_{G_{d}} =G+βd,\displaystyle=\mathcal{L}_{G}+\beta\mathcal{L}_{d}, (11)

where β0\beta\geq 0 is a hyper-parameter between the generative loss and diversity regularization. Note that the diversity regularization and the similarity loss restrict themselves.

Table 1: Classification accuracy±\pmstandard deviation (%) on 6 digits FHA tasks. Bold value represents the highest accuracy on each column.
Tasks WA FHA Methods Number of Target Data per Class
1 2 3 4 5 6 7
MUM\to U 69.7 FT 74.4±\pm0.7 76.7±\pm1.9 76.9±\pm2.2 77.3±\pm1.1 77.6±\pm1.4 78.3±\pm2.1 78.3±\pm1.6
SHOT 87.2±\pm0.2 87.9±\pm0.3 87.8±\pm0.4 88.0±\pm0.4 87.9±\pm0.5 88.0±\pm0.3 88.4±\pm0.3
S+FADA 83.7±\pm0.9 86.0±\pm0.4 86.1±\pm1.1 86.5±\pm0.8 86.8±\pm1.4 87.0±\pm0.6 87.2±\pm0.8
T+FADA 84.2±\pm0.1 84.2±\pm0.3 85.2±\pm0.9 85.2±\pm0.6 86.0±\pm1.5 86.8±\pm1.5 87.2±\pm0.5
TOHAN 87.7±\pm0.7 88.3±\pm0.5 88.5±\pm1.2 89.3±\pm0.9 89.4±\pm0.8 90.0±\pm1.0 90.4±\pm1.2
DEG-Net 83.1±\pm0.9 86.2±\pm0.8 86.5±\pm0.6 88.7±\pm0.9 89.6±\pm0.5 91.5±\pm0.6 92.1±\pm0.6
MSM\to S 24.1 FT 26.7±\pm1.0 26.8±\pm2.1 26.8±\pm1.6 27.0±\pm0.7 27.3±\pm1.2 27.5±\pm0.8 28.3±\pm1.5
SHOT 25.7±\pm2.2 26.9±\pm1.2 27.9±\pm2.6 29.1±\pm0.4 29.1±\pm1.4 29.6±\pm1.7 29.8±\pm1.5
S+FADA 25.6±\pm1.3 27.7±\pm0.5 27.8±\pm0.7 28.2±\pm1.3 28.4±\pm1.4 29.0±\pm1.0 29.6±\pm1.9
T+FADA 25.3±\pm1.0 26.3±\pm0.8 28.9±\pm1.0 29.1±\pm1.3 29.2±\pm1.3 31.9±\pm0.4 32.4±\pm1.8
TOHAN 26.7±\pm0.1 28.6±\pm1.1 29.5±\pm1.4 29.6±\pm0.4 30.5±\pm1.2 32.1±\pm0.2 33.2±\pm0.8
DEG-Net 27.2±\pm0.3 28.5±\pm1.3 29.7±\pm0.9 30.7±\pm0.8 32.9±\pm1.5 33.7±\pm1.8 34.9±\pm1.6
SUS\to U 64.3 FT 64.9±\pm1.1 66.5±\pm1.5 66.7±\pm1.7 67.3±\pm1.1 68.1±\pm2.3 68.3±\pm0.5 69.7±\pm1.4
SHOT 74.7±\pm0.3 75.5±\pm1.4 75.6±\pm1.0 75.8±\pm0.7 77.1±\pm2.1 77.8±\pm1.6 79.6±\pm0.6
S+FADA 72.2±\pm1.4 73.6±\pm1.4 74.7±\pm1.4 76.2±\pm1.3 77.2±\pm1.7 77.8±\pm3.0 79.7±\pm1.9
T+FADA 71.7±\pm0.6 74.3±\pm1.9 74.5±\pm0.8 75.9±\pm2.1 77.7±\pm1.5 76.8±\pm1.8 79.7±\pm1.9
TOHAN 75.8±\pm0.9 76.8±\pm1.2 79.4±\pm0.9 80.2±\pm0.6 80.5±\pm1.4 81.1±\pm1.1 82.6±\pm1.9
DEG-Net 75.2±\pm0.3 76.9±\pm1.5 78.2±\pm1.2 80.7±\pm1.5 81.7±\pm1.7 83.1±\pm1.7 84.3±\pm2.2
SMS\to M 70.2 FT 70.2±\pm0.0 70.6±\pm0.3 70.7±\pm0.1 70.8±\pm0.3 70.9±\pm0.2 71.1±\pm0.3 71.1±\pm0.4
SHOT 72.6±\pm1.9 73.6±\pm2.0 74.1±\pm0.6 74.6±\pm1.2 74.9±\pm0.7 75.4±\pm0.3 76.1±\pm1.5
S+FADA 74.4±\pm1.5 83.1±\pm0.7 83.3±\pm1.1 85.9±\pm0.5 86.0±\pm1.2 87.6±\pm2.6 89.1±\pm1.0
T+FADA 74.2±\pm1.8 81.6±\pm4.0 83.4±\pm0.8 82.0±\pm2.3 86.2±\pm0.7 87.2±\pm0.8 88.2±\pm0.6
TOHAN 76.0±\pm1.9 83.3±\pm0.3 84.2±\pm0.4 86.5±\pm1.1 87.1±\pm1.3 88.0±\pm0.5 89.7±\pm0.5
DEG-Net 76.2±\pm1.3 78.2±\pm1.3 85.7±\pm0.6 85.9±\pm0.8 88.6±\pm1.6 89.5±\pm1.2 90.2±\pm0.7
UMU\to M 82.9 FT 83.5±\pm0.4 84.3±\pm2.4 84.5±\pm0.7 85.5±\pm1.3 86.6±\pm1.0 87.2±\pm0.7 88.1±\pm2.7
SHOT 83.1±\pm0.5 85.5±\pm0.3 85.8±\pm0.6 86.0±\pm0.2 86.6±\pm0.2 86.7±\pm0.2 87.0±\pm0.1
S+FADA 83.2±\pm0.2 84.0±\pm0.3 85.0±\pm1.2 85.6±\pm0.5 85.7±\pm0.6 86.2±\pm0.6 87.2±\pm1.1
T+FADA 82.9±\pm0.7 83.9±\pm0.2 84.7±\pm0.8 85.4±\pm0.6 85.6±\pm0.7 86.3±\pm0.9 86.6±\pm0.7
TOHAN 84.0±\pm0.5 85.2±\pm0.3 85.6±\pm0.7 86.5±\pm0.5 87.3±\pm0.6 88.2±\pm0.7 89.2±\pm0.5
DEG-Net 82.2±\pm0.7 85.9±\pm0.6 86.5±\pm1.5 87.8±\pm0.9 88.9±\pm0.9 90.3±\pm0.5 91.6±\pm1.2
USU\to S 17.3 FT 23.4±\pm1.8 23.6±\pm2.7 23.8±\pm1.6 24.6±\pm1.4 24.6±\pm1.2 24.8±\pm0.7 25.5±\pm1.8
SHOT 30.3±\pm1.2 31.6±\pm0.4 29.8±\pm0.5 29.4±\pm0.3 29.7±\pm0.5 29.8±\pm0.8 30.1±\pm0.9
S+FADA 28.1±\pm1.2 28.7±\pm1.3 29.0±\pm1.2 30.1±\pm1.1 30.3±\pm1.3 30.7±\pm1.0 30.9±\pm1.5
T+FADA 27.5±\pm1.4 27.9±\pm0.9 28.4±\pm1.3 29.4±\pm1.8 29.5±\pm0.7 30.2±\pm1.0 30.4±\pm1.7
TOHAN 29.9±\pm1.2 30.5±\pm1.2 31.4±\pm1.1 32.8±\pm0.9 33.1±\pm1.0 34.0±\pm1.0 35.1±\pm1.8
DEG-Net 29.1±\pm1.3 30.7±\pm1.1 31.8±\pm0.7 33.0±\pm1.6 33.5±\pm1.4 35.1±\pm1.3 36.2±\pm1.2

4.2 Adaptation Module Using Generated Data

Following Chi et al. (2021a), we create paired data using the labeled data in the target domain and the generated data and assign the group labels to the paired groups under the following rules: 𝒢1\mathcal{G}_{1} pairs the generated data with the same class label; 𝒢2\mathcal{G}_{2} pairs the generated data and the data in the target domain with the same class label; 𝒢3\mathcal{G}_{3} pairs the generated data but with different class label; 𝒢4\mathcal{G}_{4} pairs the generated data and the data in the target domain and also with different class label. By using adversarial learning, we train a discriminator DD which could distinguish between the data in different domains while maintaining high classification accuracy on generated data. The discriminator DD is a four-class classifier with the inputs of the above paired group data. Different from classical adversarial domain adaptation (Ganin et al., 2016; Jiang et al., 2020), the group discriminator DD decides which of the four groups a given data pair belongs to. By freezing the encoder, we train DD with the cross-entropy loss:

D=𝔼^[i=14y𝒢ilog(D(ϕ(𝒢i)))],\mathcal{L}_{D}=-\hat{\mathbb{E}}\left[\sum_{i=1}^{4}y_{\mathcal{G}_{i}}\log(D(\phi(\mathcal{G}_{i})))\right], (12)

where 𝔼^()\hat{\mathbb{E}}(\cdot) represents the empirical mean value, y𝒢iy_{\mathcal{G}_{i}} is the group label of the group 𝒢i\mathcal{G}_{i} and ϕ(𝒢i):=[g(x1),g(x2)]\phi(\mathcal{G}_{i}):=[g(x_{1}),g(x_{2})] is the output of the encoder with the paired data input.

Next, we will train the classifier ft=htgtf^{t}=h_{t}\circ g_{t} while freezing the group discriminator, which is initialized with the same weight as that in the source classifier fs=hsgsf^{s}=h_{s}\circ g_{s}. Motivated by non-saturating games (Goodfellow, 2016), we minimize the following loss to update ftf^{t}:

f=\displaystyle\mathcal{L}_{f}= γ𝔼^[y𝒢1log(D(ϕ(𝒢2)))y𝒢3log(D(ϕ(𝒢4)))]\displaystyle-\gamma\hat{\mathbb{E}}\left[y_{\mathcal{G}_{1}}\log\left(D\left(\phi\left(\mathcal{G}_{2}\right)\right)\right)-y_{\mathcal{G}_{3}}\log\left(D\left(\phi\left(\mathcal{G}_{4}\right)\right)\right)\right]
+𝔼^[(ft(xt),ft(xt))],\displaystyle+\hat{\mathbb{E}}\left[\ell\left(f_{t}\left(x_{t}\right),f_{t}^{*}\left(x_{t}\right)\right)\right], (13)

where γ0\gamma\geq 0 is a hyper-parameter, ll is the cross-entropy loss, and ftf^{*}_{t} is the optimal target model. As demonstrated in Theorem 1, it is only necessary to use generated unlabeled data for addressing the FHA problem. Thus, we only use labeled target data for target supervised loss in Eq. (4.2).

Table 2: Classification accuracy±\pmstandard deviation (%) on 2 objects FHA tasks. Bold value represents the highest accuracy on each row.
Tasks FHA Methods
WA FT SHOT S+FADA T+FADA TOHAN DEG-Net
CFSLCF\to SL 70.6 71.5±\pm1.0 71.9±\pm0.4 72.1±\pm0.4 71.3±\pm0.5 72.8±\pm0.1 74.3±\pm0.3
SLCFSL\to CF 51.8 54.3±\pm0.5 53.9±\pm0.2 56.9±\pm0.5 55.8±\pm0.8 56.6±\pm0.3 57.2±\pm0.5
Table 3: Classification accuracy on VisDA-C dataset. Bold value represents the highest accuracy on each row.
FHA Methods Number of Targe Data per Class
8 9 10
SHOT 62.07 62.28 62.36
TOHAN 64.75 64.98 64.83
DFG-Net 65.36 65.55 66.36
Table 4: Ablation study. Classification accuracy±\pmstandard deviation(%) on MUM\to U. Bold value represents the highest accuracy on each column.
FHA Methods Number of Target Data per Class
1 2 3 4 5 6 7
TOHAN 76.0±\pm1.9 83.3±\pm0.3 84.2±\pm0.4 86.5±\pm1.1 87.1±\pm1.3 88.0±\pm0.5 89.7±\pm0.5
separate generative DEG-Net 75.7±\pm0.7 84.7±\pm0.5 85.0±\pm1.2 85.9±\pm0.9 87.4±\pm0.8 89.1±\pm1.0 90.4 ±\pm1.2
DEG-Net w/o diversity 87.2±\pm1.9 89.5±\pm0.3 89.2±\pm0.4 90.2±\pm1.1 90.3±\pm1.3 91.1±\pm0.5 91.2±\pm0.5
DEG-Net 87.3±\pm0.9 89.2±\pm0.8 90.1±\pm0.6 90.8±\pm0.9 90.6±\pm0.5 91.5±\pm0.6 92.1±\pm0.6

5 Experiments

We compare DEG-Net with previous FHA methods on digits datasets (i.e. MNIST (MM), USPS (UU), and SVHN (SS)) and objects datasets (i.e. CIFAR-10 (CFCF) , STL-10 (SLSL) and VisDA-C), following (Chi et al., 2021a). Following the standard domain-adaptation protocols (Shu et al., 2018), we compare DEG-Net with 4 baselines: (1) Without adaptation (WA); (2) Fine tuning (FT); (3) SHOT (Liang et al., 2020); (4) S+FADA (Chi et al., 2021a); (5) T+FADA (Chi et al., 2021a); and (6) TOHAN (Chi et al., 2021a). Details regarding datasets are in Appendixe B and details regarding baselines and implementation are in and Appendixe C.

Digits Datasets. Following Chi et al. (2021a) and Motiian et al. (2017), We conduct 6 tasks of the adaptation among the 3 digital datasets and choose the number of target data from 1 to 7 per class. The classifier accuracy on the target domain of our method over 6 tasks is shown in Table 1. The results show that the performance of DEG-Net is the best on almost all the tasks. It is clear that the accuracy of DEG-Net is lower than TOHAN when the amount of target data is too small. The diversity regularization and the similarity loss restrict each other, to avoid the copy issue. However, when the amount of target data is too small, the target domain information is a few, so the generator is less likely to generate similar data with the target domain. Diversity loss enhances this adversarial effect, resulting DEG-Net degrading to TOHAN and SHOT. Another improvement of DEG-Net over TOHAN is the faster training process of the generator. We need 0.93s to complete the training within each epoch in DEG-Net while needing 1.35s in TOHAN.

Objects Datasets. Following Chi et al. (2021a), we examine the performance of DEG-Net on 2 object tasks and choose the number of target data as 10 per class. The classification accuracies on object tasks are shown in Table 2.Our methods outperform baselines. In CFSLCF\to SL, we achieve a 1.5% improvement over TOHAN. In SLCFSL\to CF, we achieve a performance accuracy of 57.2%, 0.3% improvement over S+FADA. The effect of DEG-Net is not obvious in objective tasks. The possible reason may be that the simple structure of generative networks is difficult to generate diverse and correct data since the complexity of datasets

For estimating our method on larger datasets, we conduct the experiment on the VisDA-C (Peng et al., 2017), a commonly used large-scale domain adaptation benchmark. We choose synthesis data as the source domain data and real data as the target domain data and choose the number of target data from 8 to 10 per class. The classifier accuracy on the target domain of our method is shown in Table 3. DEG-Net achieves the highest accuracy over the baselines. We find that the accuracy of the generative-based methods (DEG-Net and TOHAN) is higher than SHOT. The reason may be that the generative-based methods can provide a similar amount of valid generated data per class for adaptation learning. Our method performs well than TOHAN, which can demonstrate that diversity matters in the generative-based methods for addressing the FHA problem. In addition, it should be noted that TOHAN trains different generators per class, leading to consuming more memory resources.

DEG-Net Generates More Diverse Data Than TOHAN. In this part, we analyze the diversity of the generated data by DEG-Net and TOHAN to see if our generation process can produce more diverse data than TOHAN’s. We choose the square root of the HSIC to measure the diversity of the generated data in the task MSM\to S, and calculate the HSIC value among the target-domain data as a reference value that is 0.00130.0013. After the calculation, the average diversity measure of DEG-Net is 0.00190.0019, and the average diversity measure of TOHAN is 0.00270.0027. DEG-Net can generate more diverse data than TOHAN. The detailed diversity analysis can be found in Appendix D.

Ablation Study. To show the advantage of the weight-shared architecture and the diversity loss, we conduct two experiments: (1) The architecture of weight-shared is the same as the DEG-Net but uses Eq. (8) to train the generator (DEG-Net without diversity). (2) The separate generative method, which is similar to TOHAN, has KK generators and uses the semantic features to calculate the similarity loss Gsn\mathcal{L}_{G_{s}}^{n} for training each generator:

1Bnpn122+λ1NyMBni=1Bnj=1Ny𝒔ign𝒔jt1.\frac{1}{B_{n}}\left\|p_{n}-1\right\|_{2}^{2}+\lambda\frac{1}{N_{y}MB_{n}}\sum_{i=1}^{B_{n}}\sum_{j=1}^{N_{y}}\left\|\bm{s}_{i}^{gn}-\bm{s}_{j}^{t}\right\|_{1}. (14)

As shown in Table 4, DEG-Net works better than both methods introduced above, and the weight-shared architecture works better than the separate generative method, which reveals that both the weight-shared architecture and the diversity loss can improve the quality of generated data and thus achieve the higher accuracy. Specifically, compared to the modified DEG-Net without the diversity loss, the separate generative method ignores the generalization knowledge in the semantic features of data which is shared with all the classes. Modified DEG-Net discards the diversity loss, and thus generates low diverse data and results in worse performance. However, the HSIC diversity loss does not work for all situations. DEG-Net achieves similar accuracy with modified DEG-Net without diversity or even worse if the amount of the labeled data is very few (i.e., m12m_{1}\leq 2). This phenomenon may be caused by worse data generated by the diversity method. While estimating HSIC, we need enough data to ensure the value of HSIC can be estimated well. Since the diversity loss restricting to the similarity loss, the generator is less likely to generate similar data over target domain while HSIC loss is not estimated well (i.e., the distribution of generated data is far from the target domain). In addition, we conduct the ablation study for comparing our method to TOHAN with the basic geometric data augmentation for the FHA problem over the digit tasks. The results show that the performance of the augmentation techniques is worse than our method in general and details of these experiments can be found in Appendix D.

6 Conclusion

In this paper, we focus on generating more diverse unlabeled data for addressing the few-shot hypothesis adaptation (FHA) problem. We experimentally and theoretically prove that the diversity of generated data (i.e., the independence among the generated data) matters in addressing the FHA problem. For addressing the FHA problem, we propose a diversity-enhancing generative network (DEG-Net), which consists of the generation module and the adaptation module. With weight-shared conditional generative method equipped with a kernel independence measure: Hilbert-Schmidt independence criterion, DEG-Net can generate more diverse unlabeled data and achieve better performance. Experiments show that the generated data of DEG-Net are more diverse. Since the high diversity of the generated data, DEG-Net achieves state-of-the-art performance when addressing the FHA problem, which lights up a novel and theoretical-guaranteed road to the FHA problem in the future.

Acknowledgements

RJD and BH were supported by NSFC Young Scientists Fund No. 62006202 and Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652, and HKBU CSD Departmental Incentive Grant. BH was also supported by RIKEN Collaborative Research Fund. MS was supported by JST CREST Grant Number JPMJCR18A2.

References

  • Balcan & Blum (2010) Balcan, M. and Blum, A. A discriminative model for semi-supervised learning. Journal of ACM, 57(3):19:1–19:46, 2010.
  • Blaschko & Gretton (2008) Blaschko, M. and Gretton, A. Learning taxonomies by dependence maximization. In NeurIPS, 2008.
  • Chen et al. (2020) Chen, X., Wang, Z., Tang, S., and Muandet, K. Mate: plugging in model awareness to task embedding for meta learning. In NeurIPS, 2020.
  • Chi et al. (2021a) Chi, H., Liu, F., Yang, W., Lan, L., Liu, T., Han, B., Cheung, W., and Kwok, J. Tohan: A one-step approach towards few-shot hypothesis adaptation. In NeurIPS, 2021a.
  • Chi et al. (2021b) Chi, H., Liu, F., Yang, W., Lan, L., Liu, T., Han, B., Niu, G., Zhou, M., and Sugiyama, M. Demystifying assumptions in learning to discover novel classes. In ICLR, 2021b.
  • Chi et al. (2022) Chi, H., Liu, F., Yang, W., Lan, L., Liu, T., Han, B., Niu, G., Zhou, M., and Sugiyama, M. Meta discovery: Learning to discover novel classes given very limited data. In ICLR, 2022.
  • Coates et al. (2011) Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
  • Dagan et al. (2019) Dagan, Y., Daskalakis, C., Dikkala, N., and Jayanti, S. Learning from weakly dependent data under dobrushin’s condition. In COLT, 2019.
  • Du et al. (2017) Du, S. S., Koushik, J., Singh, A., and Póczos, B. Hypothesis transfer learning via transformation functions. In NeurIPS, 2017.
  • Fang et al. (2022) Fang, Z., Lu, J., Liu, F., and Zhang, G. Semi-supervised heterogeneous domain adaptation: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1087–1105, 2022.
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • Goodfellow (2016) Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
  • Gretton et al. (2005) Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In ALT, 2005.
  • Gretton et al. (2007) Gretton, A., Fukumizu, K., Teo, C., Song, L., Schölkopf, B., and Smola, A. A kernel statistical test of independence. In NeurIPS, 2007.
  • Guo et al. (2022) Guo, X., Lin, F., Yi, C., Song, J., Sun, D., Lin, L., Zhong, Z., Wu, Z., Wang, X., Zhang, Y., et al. Deep transfer learning enables lesion tracing of circulating tumor cells. Nature Communications, 13(1):7687, 2022.
  • Hull (1994) Hull, J. J. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, 1994.
  • Jiang et al. (2020) Jiang, P., Wu, A., Han, Y., Shao, Y., Qi, M., and Li, B. Bidirectional adversarial training for semi-supervised domain adaptation. In IJCAI, 2020.
  • Jing et al. (2020) Jing, Y., Liu, X., Ding, Y., Wang, X., Ding, E., Song, M., and Wen, S. Dynamic instance normalization for arbitrary style transfer. In AAAI, 2020.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • Kuzborskij & Orabona (2013) Kuzborskij, I. and Orabona, F. Stability and hypothesis transfer learning. In ICML, 2013.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2021) Li, Y., Pogodin, R., Sutherland, D. J., and Gretton, A. Self-supervised learning with kernel dependence maximization. In NeurIPS, 2021.
  • Liang et al. (2020) Liang, J., Hu, D., and Feng, J. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In ICML, 2020.
  • Liu et al. (2019) Liu, F., Lu, J., Han, B., Niu, G., Zhang, G., and Sugiyama, M. Butterfly: one-step approach towards wildly unsupervised domain adaptation. In NeurIPS LTS Workshop, 2019.
  • Liu et al. (2021a) Liu, F., Xu, W., Lu, J., and Sutherland, D. J. Meta two-sample testing: Learning kernels for testing with limited data. In NeurIPS, 2021a.
  • Liu et al. (2021b) Liu, L., Hamilton, W., Long, G., Jiang, J., and Larochelle, H. A universal representation transformer layer for few-shot image classification. In ICLR, 2021b.
  • Liu et al. (2021c) Liu, L., Zhou, T., Long, G., Jiang, J., Dong, X., and Zhang, C. Isometric propagation network for generalized zero-shot learning. In ICLR, 2021c.
  • Long et al. (2018) Long, M., Cao, Z., Wang, J., and Jordan, M. I. Conditional adversarial domain adaptation. In NeurIPS, 2018.
  • Ma et al. (2020) Ma, W.-D. K., Lewis, J., and Kleijn, W. B. The HSIC bottleneck: Deep learning without back-propagation. In AAAI, 2020.
  • Motiian et al. (2017) Motiian, S., Jones, Q., Iranmanesh, S., and Doretto, G. Few-shot adversarial domain adaptation. In NeurIPS, 2017.
  • Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NeurIPS, 2011.
  • Oneto et al. (2020) Oneto, L., Donini, M., Luise, G., Ciliberto, C., Maurer, A., and Pontil, M. Exploiting mmd and sinkhorn divergences for fair and transferable representation learning. In NeurIPS, 2020.
  • Park et al. (2019) Park, S., Mello, S. D., Molchanov, P., Iqbal, U., Hilliges, O., and Kautz, J. Few-shot adaptive gaze estimation. In ICCV, 2019.
  • Peng et al. (2017) Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., and Saenko, K. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • Peng et al. (2022) Peng, X., Liu, F., Zhang, J., Lan, L., Ye, J., Liu, T., and Han, B. Bilateral dependency optimization: Defending against model-inversion attacks. In ACM KDD, 2022.
  • Pogodin & Latham (2020) Pogodin, R. and Latham, P. Kernelized information bottleneck leads to biologically plausible 3-factor hebbian learning in deep networks. In NeurIPS, 2020.
  • Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Shorten & Khoshgoftaar (2019) Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of Big Data, 2019.
  • Shu et al. (2018) Shu, R., Bui, H. H., Narui, H., and Ermon, S. A dirt-t approach to unsupervised domain adaptation. In ICLR, 2018.
  • Shui et al. (2022) Shui, C., Wang, B., and Gagné, C. On the benefits of representation regularization in invariance based domain generalization. Machine Learning, 111(3):895–915, 2022.
  • Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In NeurIPS, 2017.
  • Song et al. (2007) Song, L., Smola, A., Gretton, A., and Borgwardt, K. M. A dependence maximization view of clustering. In ICML, 2007.
  • Song et al. (2012) Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(5), 2012.
  • Sun et al. (2019) Sun, Q., Liu, Y., Chua, T.-S., and Schiele, B. Meta-transfer learning for few-shot learning. In CVPR, 2019.
  • Teshima et al. (2020) Teshima, T., Sato, I., and Sugiyama, M. Few-shot domain adaptation by causal mechanism transfer. In NeurIPS, 2020.
  • Tommasi et al. (2010) Tommasi, T., Orabona, F., and Caputo, B. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In CVPR, 2010.
  • Wang et al. (2019) Wang, B., Mendez, J., Cai, M., and Eaton, E. Transfer learning via minimizing the performance gap between domains. In NeurIPS, 2019.
  • Wang et al. (2023) Wang, B., Mendez, J. A., Shui, C., Zhou, F., Wu, D., Xu, G., Gagn, C., and Eaton, E. Gap minimization for knowledge sharing and transfer. Journal of Machine Learning Research, 24(33):1–57, 2023.
  • Wang et al. (2020) Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1–34, 2020.
  • Webb & Sammut (2010) Webb, G. I. and Sammut, C. Encyclopedia of Machine Learning. 2010.
  • Yang et al. (2020) Yang, S., Liu, L., and Xu, M. Free lunch for few-shot learning: Distribution calibration. In ICLR, 2020.
  • Yang et al. (2021a) Yang, S., van de Weijer, J., Herranz, L., Jui, S., et al. Exploiting the intrinsic neighborhood structure for source-free domain adaptation. In NeurIPS, 2021a.
  • Yang et al. (2021b) Yang, S., Wang, Y., van de Weijer, J., Herranz, L., and Jui, S. Generalized source-free domain adaptation. In ICCV, 2021b.
  • Yao et al. (2021) Yao, H., Huang, L.-K., Zhang, L., Wei, Y., Tian, L., Zou, J., Huang, J., et al. Improving generalization in meta-learning via task augmentation. In ICML, 2021.
  • Zamir et al. (2018) Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In CVPR, 2018.
  • Zhang et al. (2020) Zhang, Y., Liu, F., Fang, Z., Yuan, B., Zhang, G., and Lu, J. Clarinet: A one-step approach towards budget-friendly unsupervised domain adaptation. In IJCAI, 2020.
  • Zheng & Zhou (2021) Zheng, H. and Zhou, M. Exploiting chain rule and bayes’ theorem to compare probability distributions. In NeurIPS, 2021.
  • Zhong et al. (2021) Zhong, L., Fang, Z., Liu, F., Lu, J., Yuan, B., and Zhang, G. How does the combined risk affect the performance of unsupervised domain adaptation approaches? In AAAI, 2021.
  • Zhou et al. (2020) Zhou, F., Jiang, Z., Shui, C., Wang, B., and Chaib-draa, B. Domain generalization with optimal transport and metric learning. arXiv preprint arXiv: 2007.10573, 2020.

Appendix A Proof of Theorem 1

Before proving the Theorem 1, we first introduce a McDiarmid-like inequality under the log-coefficient of a random vector 𝒁{\bm{Z}}.

Lemma 1 (McDiarmid-like Inequality under the Log-coefficient of 𝒁{\bm{Z}}).

Let μ(m)\mu^{(m)} be a distribution defined over 𝒵m{\mathcal{Z}}^{m}, and 𝐙=(Z1,,Zm)μ(m){\bm{Z}}=(Z_{1},\dots,Z_{m})\sim\mu^{(m)} be a random vector, and g:𝒵mg:{\mathcal{Z}}^{m}\rightarrow\mathbb{R} with the following bounded differences property with parameters λ1,,λm>0\lambda_{1},\dots,\lambda_{m}>0:

Zi,Zj:|g(Zi)g(Zj)|i=1m1ZiZjλi,\displaystyle\forall Z_{i},Z_{j}:~{}|g(Z_{i})-g(Z_{j})|\leq\sum_{i=1}^{m}1_{Z_{i}\neq Z_{j}}\lambda_{i}, (15)

where 𝒵m=(𝒳×𝒴)m{\mathcal{Z}}^{m}=({\mathcal{X}}\times{\mathcal{Y}})^{m}. If αlog(𝐙)<1\alpha_{\log}({\bm{Z}})<1, then, for all t>0t>0,

Pr[|g(𝒁)𝔼[g(𝒁)]|t]2exp((1αlog(𝒁))t22i=1mλi2).\displaystyle\textnormal{Pr}[|g({\bm{Z}})-\mathbb{E}[g({\bm{Z}})]|\geq t]\leq 2\exp\left(-\frac{(1-\alpha_{\log}({\bm{Z}}))t^{2}}{2\sum_{i=1}^{m}\lambda^{2}_{i}}\right). (16)
Proof.

Based on Definition 2.2 and Lemma 5.2 in (Dagan et al., 2019), we know that μ(m)\mu^{(m)} satisfies Dobrushin’s condition with a coefficient α<1\alpha<1. Thus, based on Theorem 2.3 in (Dagan et al., 2019), we have

Pr[|g(𝒁)𝔼[g(𝒁)]|t]2exp((1α)t22i=1mλi2).\displaystyle\textnormal{Pr}[|g({\bm{Z}})-\mathbb{E}[g({\bm{Z}})]|\geq t]\leq 2\exp\left(-\frac{(1-\alpha)t^{2}}{2\sum_{i=1}^{m}\lambda^{2}_{i}}\right). (17)

Since, ααlog(𝒁)\alpha\leq\alpha_{\log}({\bm{Z}}), we prove this lemma. ∎

Then, we introduce a recent result regarding bounding the expected suprema of a empirical process using the corresponding Gaussian complexity.

Theorem 2 ((Dagan et al., 2019)).

Let 𝐙{\bm{Z}} be a random vector over some domain 𝒵m{\mathcal{Z}}^{m} and let 𝒢{\mathcal{G}} be a class of functions from 𝒵{\mathcal{Z}} to \mathbb{R}. If αlog(𝐙)<1/2\alpha_{\log}({\bm{Z}})<1/2, then

𝔼S𝒁supg𝒢(1mi=1mg(si)𝔼S[1mi=1mg(si)])C𝔊𝒁(𝒢)12αlog(𝒁),\displaystyle\mathop{\mathbb{E}}\limits_{S\sim{\bm{Z}}}\mathop{\sup}_{g\in{\mathcal{G}}}\left(\frac{1}{m}\sum_{i=1}^{m}g(s_{i})-\mathop{\mathbb{E}}\limits_{S}\left[\frac{1}{m}\sum_{i=1}^{m}g(s_{i})\right]\right)\leq\frac{C\mathfrak{G}_{{\bm{Z}}}({\mathcal{G}})}{\sqrt{1-2\alpha_{\log}({\bm{Z}})}}, (18)

where C>0C>0 is a universal constant, and S=(s1,,sm)S=(s_{1},\dots,s_{m}) is a sample of 𝐙{\bm{Z}}.

Note that, the above result is very general, it does not assume that the mm marginals of the distribution of 𝒁{\bm{Z}} are identical. Based on the above theorem and lemma, we can prove the following lemma.

Lemma 2.

Let 𝐙{\bm{Z}} be a random vector over some domain 𝒵m{\mathcal{Z}}^{m} and let 𝒢{\mathcal{G}} be a class of functions from 𝒵{\mathcal{Z}} to \mathbb{R}. If αlog(𝐙)<1/2\alpha_{\log}({\bm{Z}})<1/2, and there exists L>0L>0 such that for any g𝒢g\in{\mathcal{G}} and ZiZ_{i}, |g(Zi)|L|g(Z_{i})|\leq L, then, for any t>0t>0,

PrS𝒁[supg𝒢|1mi=1mg(si)𝔼S[1mi=1mg(si)]|C𝔊𝒁(𝒢)12αlog(𝒁)+t]e(1αlog(𝒁))mt2CL2\displaystyle\mathop{\textnormal{Pr}}\limits_{S\sim{\bm{Z}}}\left[\mathop{\sup}\limits_{g\in{\mathcal{G}}}\left|\frac{1}{m}\sum_{i=1}^{m}g(s_{i})-\mathop{\mathbb{E}}\limits_{S}\left[\frac{1}{m}\sum_{i=1}^{m}g(s_{i})\right]\right|\geq\frac{C\mathfrak{G}_{{\bm{Z}}}({\mathcal{G}})}{\sqrt{1-2\alpha_{\log}({\bm{Z}})}}+t\right]\leq e^{-\frac{(1-\alpha_{\log}({\bm{Z}}))mt^{2}}{C^{\prime}L^{2}}} (19)

for some universal constants C,C>0C,C^{\prime}>0.

Proof.

We first consider proving the non-absolute-value version. Let

M(S)=supg𝒢(1mi=1mg(si)𝔼S[1mi=1mg(si)]).\displaystyle M(S)=\mathop{\sup}\limits_{g\in{\mathcal{G}}}\left(\frac{1}{m}\sum_{i=1}^{m}g(s_{i})-\mathop{\mathbb{E}}\limits_{S}\left[\frac{1}{m}\sum_{i=1}^{m}g(s_{i})\right]\right). (20)

For any S𝒁S\sim{\bm{Z}} and S𝒁S^{\prime}\sim{\bm{Z}}, we have that |M(S)M(S)|i=1m2L1sisi/m|M(S)-M(S^{\prime})|\leq\sum_{i=1}^{m}2L1_{s_{i}\neq s_{i}^{\prime}}/m. According to Lemma 1, we have

PrS𝒁[M(S)𝔼[M(S)]t]exp((1αlog(𝒁))mt2CL2)\displaystyle\mathop{\textnormal{Pr}}\limits_{S\sim{\bm{Z}}}\left[M(S)-\mathbb{E}[M(S)]\geq t\right]\leq\exp\left({-\frac{(1-\alpha_{\log}({\bm{Z}}))mt^{2}}{C^{\prime}L^{2}}}\right) (21)

for some universal constant C>0C^{\prime}>0. Then, combining 𝔼[M(S)]\mathbb{E}[M(S)] (based on Theorem 2), we have

PrS𝒁[supg𝒢(1mi=1mg(si)𝔼S[1mi=1mg(si)])C𝔊𝒁(𝒢)12αlog(𝒁)+t]e(1αlog(𝒁))mt2CL2.\displaystyle\mathop{\textnormal{Pr}}\limits_{S\sim{\bm{Z}}}\left[\mathop{\sup}\limits_{g\in{\mathcal{G}}}\left(\frac{1}{m}\sum_{i=1}^{m}g(s_{i})-\mathop{\mathbb{E}}\limits_{S}\left[\frac{1}{m}\sum_{i=1}^{m}g(s_{i})\right]\right)\geq\frac{C\mathfrak{G}_{{\bm{Z}}}({\mathcal{G}})}{\sqrt{1-2\alpha_{\log}({\bm{Z}})}}+t\right]\leq e^{-\frac{(1-\alpha_{\log}({\bm{Z}}))mt^{2}}{C^{\prime}L^{2}}}. (22)

For the opposite inequality, M(S)-M(S) part, following (Dagan et al., 2019), we can apply the same arguments on 𝒢-{\mathcal{G}}. Note that 𝔊(𝒢)=𝔊(𝒢)\mathfrak{G}(-{\mathcal{G}})=\mathfrak{G}({\mathcal{G}}), which concludes the bound. ∎

The above lemma is a slightly general version of Theorem 6.7 in (Dagan et al., 2019) by considering the influence of αlog(𝒁)\alpha_{\log}({\bm{Z}}). Based on Lemma 2, we can prove the Theorem 1 below.

See 1

Proof.

Let SS be the set of mum_{\rm u} unlabeled data. Based on the relation between VC dimension and the Gaussian complexity, Lemma 2 gives that, with probability at least 1δ21-\frac{\delta}{2}, we have

|PrxS¯[χh(x)=1]PrxμXt[χh(x)=1]|ϵfor all χhχ(),|\textnormal{Pr}_{x\sim\bar{S}}[\chi_{h}(x)=1]-\textnormal{Pr}_{x\sim{\mu_{X}^{t}}}[\chi_{h}(x)=1]|\leq\epsilon\quad\text{for all }\chi_{h}\in\chi(\mathcal{H}),

where S¯\bar{S} denotes the uniform distribution over SS. Since χh(x)=χ(h,x)\chi_{h}(x)=\chi(h,x), this implies that we have

|χ(h,D)χ^(h,S)|ϵfor all h.|\chi(h,D)-\hat{\chi}(h,S)|\leq\epsilon\quad\text{for all }h\in\mathcal{H}.

Therefore, the set of hypotheses with χ^(h,S)1tϵ\hat{\chi}(h,S)\geq 1-t-\epsilon is contained in μXt,χ(t+2ϵ)\mathcal{H}_{{\mu_{X}^{t}},\chi}(t+2\epsilon).

The bound on the number of labeled data now follows directly from known concentration results using the expected number of partitions instead of the maximum in the standard VC-dimension bounds. This bound ensures that with probability 1δ21-\frac{\delta}{2}, none of the functions hμXt,χ(t+2ϵ)h\in\mathcal{H}_{{\mu_{X}^{t}},\chi}(t+2\epsilon) with err(h)ϵerr(h)\geq\epsilon have err^(h)=0\widehat{err}(h)=0.

The above two arguments together imply that with probability 1δ1-\delta, all hh\in\mathcal{H} with err^(h)=0\widehat{err}(h)=0 and χ^(h,S)1tϵ\hat{\chi}(h,S)\geq 1-t-\epsilon have err(h)ϵerr(h)\geq\epsilon, and furthermore ff^{*} has χ^(f,S)1tϵ\hat{\chi}(f^{*},S)\geq 1-t-\epsilon. This in turn implies that with probability at least 1δ1-\delta, we have err(h^)ϵerr(\hat{h})\leq\epsilon, where h^=argmaxh0χ^(h,S).\hat{h}=\mathop{\arg\max}_{h\in\mathcal{H}_{0}}\hat{\chi}(h,S).

Appendix B Datasets

Digits.

Following TOHAN (Chi et al., 2021a), we conduct 6 adaptation experiments on digits datasets: MUM\to U, MSM\to S, SUS\to U,SMS\to M, UMU\to M and USU\to S. MNIST (MM) (LeCun et al., 1998) is the handwritten digits dataset, which has been size-normalized and centered in 28×2828\times 28 pixels. SVHN (SS) (Netzer et al., 2011) is the real-world image digits dataset, of which images are 32×3232\times 32 pixels with 3 channels. USPS (UU) (Hull, 1994) data are 16×16 grayscale pixels. The SVHN and USPS images are resized to 28×2828\times 28 grayscale pixels in the adaptation task (Chi et al., 2021a).

Objects.

Following Sun et al. (2019), we compared DEG-Net and benchmark on CIFAR-10 and STL-10. The CIFAR-10 (Krizhevsky et al., 2009) dataset contains 60, 000 32×3232\times 32 color images in 10 categories, while the STL-10 (Coates et al., 2011) dataset is inspired by the CIFAR-10 dataset with some modifications. However, these two datasets only contain nine overlapping classes. We removed the non-overlapping classes (“frog” and “monkey”) (Shu et al., 2018). We also compared DEG-Net and benchmark on VisDA-C (Peng et al., 2017), which is a challenging large-scale datasets that mainly focuses on the 12-class synthesisto-real classification task.

Appendix C Details regarding Experiments

Baselines. We follow the standard domain-adaptation protocols (Shu et al., 2018) and compare DEG-Net with 4 baselines: (1) Without adaptation (WA): to classify the target domain with the well-trained source domain classifiers. (2) Fine tuning (FT): to train the last connected layer of the classifier with few accessible labeled data. (3) SHOT: an HTL method, where we modify it to use both the labeled target data and unlabeled target data (Liang et al., 2020). (4) S+FADA:to generate unlabeled data using the loss c\mathcal{L}_{c} with the well-trained source clasifier and apply them into DANN (Ganin et al., 2016). (5) T+FADA:to generate unlabeled data using the loss s\mathcal{L}_{s} with the few labeled target data and apply them into DANN. (6) TOHAN: a novel FHA method, which generates the specific category unlabeled data separately (Chi et al., 2021a).

Table 5: The Diversity of the target data and generated data by different methods.
Task Target Data TOHAN DEG-Net
MSM\to S 0.0013 0.0027 0.0019
USU\to S 0.0025 0.0021
SMS\to M 0.0004 0.0016 0.0013
UMU\to M 0.0014 0.0008
SUS\to U 0.0002 0.0012 0.001
MUM\to U 0.0009 0.0005
Table 6: Classification accuracy±\pmstandard deviation (%) on digits FHA tasks of the data augmentation. Bold value represents the highest average accuracy on each column.
Method Tasks Number of Target Data per Class
1 2 3 4 5 6 7
TOHAN w/ aug MUM\to U 77.1±\pm0.4 83.5±\pm0.6 84.0±\pm0.7 86.7±\pm1.1 87.5±\pm0.6 88.1±\pm1.4 89.4±\pm1.1
MSM\to S 26.7±\pm1.0 27.8±\pm1.6 29.7±\pm1.3 29.4±\pm0.7 30.3±\pm1.2 32.4±\pm0.8 33.5±\pm1.5
SMS\to M 76.4±\pm0.5 78.6±\pm0.3 82.7±\pm0.1 86.5±\pm0.3 87.9±\pm0.2 88.2±\pm0.3 89.6±\pm0.4
UMU\to M 82.1±\pm0.7 84.9±\pm1.3 85.3±\pm0.6 86.7±\pm1.5 87.4±\pm0.8 87.9±\pm0.7 89.8±\pm0.4
Avg. 65.6±\pm0.7 68.7±\pm1.0 70.4±\pm0.7 72.4±\pm0.9 73.3±\pm0.7 74.1±\pm0.8 75.6±\pm0.8
TOHAN MUM\to U 76.0±\pm1.9 83.3±\pm0.3 84.2±\pm0.4 86.5±\pm1.1 87.1±\pm1.3 88.0±\pm0.5 89.7±\pm0.5
MSM\to S 26.7±\pm0.1 28.6±\pm1.1 29.5±\pm1.4 29.6±\pm0.4 30.5±\pm1.2 32.1±\pm0.2 33.2±\pm0.8
SMS\to M 76.0±\pm1.9 83.3±\pm0.3 84.2±\pm0.4 86.5±\pm1.1 87.1±\pm1.3 88.0±\pm0.5 89.7±\pm0.5
UMU\to M 84.0±\pm0.5 85.2±\pm0.3 85.6±\pm0.7 86.5±\pm0.5 87.3±\pm0.6 88.2±\pm0.7 89.2±\pm0.5
Avg. 65.7±\pm1.1 70.1±\pm0.5 70.9±\pm0.7 72.2±\pm0.8 73.0±\pm1.1 74.0±\pm0.5 75.5±\pm0.6
DEG-Net MUM\to U 83.1±\pm0.9 86.2±\pm0.8 86.5±\pm0.6 88.7±\pm0.9 89.6±\pm0.5 91.5±\pm0.6 92.1±\pm0.6
MSM\to S 27.2±\pm0.3 28.5±\pm1.3 29.7±\pm0.9 30.7±\pm0.8 32.9±\pm1.5 33.7±\pm1.8 34.9±\pm1.6
SMS\to M 76.2±\pm1.3 78.2±\pm1.3 85.7±\pm0.6 85.9±\pm0.8 88.6±\pm1.6 89.5±\pm1.2 90.2±\pm0.7
UMU\to M 82.2±\pm0.7 85.9±\pm0.6 86.5±\pm1.5 87.8±\pm0.9 88.9±\pm0.9 90.3±\pm0.5 91.6±\pm1.2
Avg. 67.1±\pm0.8 69.7±\pm1.0 72.1±\pm0.9 73.3±\pm0.9 75.0±\pm1.1 76.3±\pm1.0 77.2±\pm1.0

Implementation Details. We implement all methods by PyTorch 1.7.1 and Python 3.7.6, and conduct all the experiments on NVIDIA RTX 2080Ti GPUs. Due to the limitation of accessible computing resources, we can not choose more complex networks as the backbone of the generator.

Our conditional generator GG uses the standard DCGAN network (Radford et al., 2015). We adopt the backbone network of LeNet-5 with batch normalization and dropout to extract the group discriminator feature in the digits tasks and adopt the backbone networt of ResNet50 in the object task. We employ connected layers with the softmax function as the classifier to obtain the probability. The semantic feature in digital tasks is the output of the first fully connected layer. We adopt 3 connected layers with softmax function as the group discriminator DD. While estimating the HSIC measure, we add ϵ\epsilon to the kernel function for ensuring the kernel matrix be positive.

Hyper-parameter Settings. Following the common protocol of domain adaptation (Shu et al., 2018), we set fixed hyper-parameters for the different datasets. We pretrain the conditional generator for 300 epochs and pretrain the group discriminator for 100 epochs. The training step of the classifier (i.e. the adaptation module) is set to 50. As for the generator and the group discriminator, the learning rate of adam optimizer is set to 1×1041\times 10^{-4}. As for the classifier, the learning rate of adam optimizer is set to 1×1031\times 10^{-3}. The tradeoff parameter λ\lambda in Eq. (8) is set to 0.90.9 and the tradeoff parameter β\beta in Eq. (11) is set to 0.10.1. Following Long et al. (2018) the radeoff parameter γ\gamma in Eq. (4.2) is set to 21+exp(10q˙)1\frac{2}{1+\exp({-10\dot{q}})}-1.

Table 7: Classification accuracy (%) on digits FHA tasks using the generated data. Bold value represents the highest average accuracy on each column
Task Method Number of Generated Data per Class Number of Target Data per Class
1 2 3 4 5 6 7
MUM\to U TOHAN 0 76.0 83.3 84.2 86.5 87.1 88.0 89.7
5 75.8 83.7 84.3 84.3 85.0 87.5 88.1
20 76.0 83.3 84.3 84.0 85.1 87.2 89.5
DEG-Net 0 83.1 86.2 86.5 88.7 89.6 91.5 92.1
5 82.6 86.0 85.9 88.2 88.7 91.9 92.3
20 81.3 84.3 86.2 88.6 90.3 92.3 93.4
MSM\to S TOHAN 0 26.7 28.6 29.5 29.6 30.5 32.1 33.2
5 26.2 28.4 28.9 29.1 30.2 31.4 32.5
20 25.8 26.9 29.8 27.4 29.8 32.7 32.8
DEG-Net 0 27.2 28.5 29.7 30.7 32.9 33.7 34.9
5 27.3 28.2 29.6 29.4 32.8 33.8 35.4
20 26.4 27.3 30.2 28.9 33.5 35.0 36.4

Appendix D Additional Analysis

Augmentation Techniques on the FHA Problem

In this section, we compare the accuracy of the target classifier trained by TOHAN and that of TOHAN with the basic geometric data augmentation for the FHA problem over the digit tasks. The geometric data augmentation technique has been widely explored to diversify the image data (Shorten & Khoshgoftaar, 2019). In our experiment, we randomly choose one or more augmentation techniques: resizing, shifting, cropping, and slight rotations (1 and 20 and -1 to -20) for the generated data in TOHAN. The classifier accuracy on the target domain of our method over 4 experiments and the average accuracy is shown in Table 6.

It is clear that the performance of the augmentation techniques is worse than our method in general. It may be caused by the fact that the generated images are similar and even the same as the few target data. The diversity of generated data is still low with data augmentation. The accuracy of the augmentation is basically the same as TOHAN’s. The improvement brought by the augmentation is more obvious while the number of the target data is increasing.

Diversity Analysis of DEG-Net

In this section, we compare the diversity of generated data of DEG-Net with that of TOHAN and target data. Because of the difficulty of calculating log-influence, we use the HSIC to measure the diversity of data. Considering that the generated batch in the training process is 32, we calculate the HSIC measure with the 32 sample data. Table 5 shows the diversity of the different data. It is clear that the diversity loss in DEG-Net works well to make the generated data more diverse.

Data Efficiency Analysis of DEG-Net

In this section, we conduct the experiments in the tasks MSM\to S and MUM\to U to analyze the efficiency of the generated data. Following the architecture of the DEG-Net, we use the Eq. (11) to train the conditional generator and obtain the following loss to update classifier ftf_{t}:

f=γ𝔼^[y𝒢1log(D(ϕ(𝒢2)))y𝒢3log(D(ϕ(𝒢4)))]+𝔼^[(ft(xt)),ft(Xt))]+𝔼^[(ft(xg)),ft(xg))],\begin{split}\mathcal{L^{*}}_{f}=&-\gamma\hat{\mathbb{E}}\left[y_{\mathcal{G}_{1}}\log\left(D\left(\phi\left(\mathcal{G}_{2}\right)\right)\right)-y_{\mathcal{G}_{3}}\log\left(D\left(\phi\left(\mathcal{G}_{4}\right)\right)\right)\right]\\ &+\hat{\mathbb{E}}\left[\ell\left(f_{t}\left(x_{t}\right)),f_{t}^{*}\left(X_{t}\right)\right)\right]+\hat{\mathbb{E}}\left[\ell\left(f_{t}\left(x_{g}\right)),f_{t}^{*}\left(x_{g}\right)\right)\right],\end{split} (23)

where xgx_{g} is the generated data and ft(xg)f_{t}^{*}(x_{g}) is the label of the generated data. We use the different numbers of the generated data by TOHAN (Chi et al., 2021a) and DEG-Net to train the classifier and the classification accuracy is shown in Table 7. It is clear that the performance of using data generated by TOHAN is almost the same as just using labeled data. In addition, the data generated by DEG-Net can not improve the performance of the model while the number of target data per class is small. It may be caused by that the generated data is similar to the label target data, so that add the almost same data for the training will bring little improvement. However, it is worth noting that the improvement will be large if the number of data generated by DEG-Net is more than 5 per class. This phenomenon indicates that the data generated by DEG-Net is more independent of the existing target data and could be treated as the new ones to some degree.