This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradual Domain Adaptation
without Indexed Intermediate Domains

Hong-You Chen
The Ohio State University, USA
[email protected] &Wei-Lun Chao
The Ohio State University, USA
[email protected]
Abstract

The effectiveness of unsupervised domain adaptation degrades when there is a large discrepancy between the source and target domains. Gradual domain adaptation (GDA) is one promising way to mitigate such an issue, by leveraging additional unlabeled data that gradually shift from the source to the target. Through sequentially adapting the model along the “indexed” intermediate domains, GDA substantially improves the overall adaptation performance. In practice, however, the extra unlabeled data may not be separated into intermediate domains and indexed properly, limiting the applicability of GDA. In this paper, we investigate how to discover the sequence of intermediate domains when it is not already available. Concretely, we propose a coarse-to-fine framework, which starts with a coarse domain discovery step via progressive domain discriminator training. This coarse domain sequence then undergoes a fine indexing step via a novel cycle-consistency loss, which encourages the next intermediate domain to preserve sufficient discriminative knowledge of the current intermediate domain. The resulting domain sequence can then be used by a GDA algorithm. On benchmark data sets of GDA, we show that our approach, which we name Intermediate DOmain Labeler (IDOL), can lead to comparable or even better adaptation performance compared to the pre-defined domain sequence, making GDA more applicable and robust to the quality of domain sequences. Codes are available at https://github.com/hongyouc/IDOL.

1 Introduction

The distributions of real-world data change dynamically due to many factors like time, locations, environments, etc. Such a fact poses a great challenge to machine-learned models, which implicitly assume that the test data distribution is covered by the training data distribution. To resolve this generalization problem, unsupervised domain adaptation (UDA), which aims to adapt a learned model to the test domain given its unlabeled data [15, 13], has been an active sub-field in machine learning.

Typically, UDA assumes that the “source” domain, in which the model is trained, and the “target” domain, in which the model is deployed, are discrepant but sufficiently related. Concretely, Zhao et al. [75], Ben-David et al. [1] show that the generalization error of UDA is bounded by the discrepancy of the marginal or conditional distributions between domains. Namely, the effectiveness of UDA may degrade along with the increase in domain discrepancy. Take one popular algorithm, self-training [38, 48, 49], for example. Self-training adapts the source model by progressively labeling the unlabeled target data (i.e., pseudo-labels) and using them to fine-tune the model [2, 31, 77, 30, 41, 28, 44, 78]. Self-training works if the pseudo-labels are accurate (as it essentially becomes supervised learning), but is vulnerable if they are not, which occurs when there exists a large domain gap [35].

To address this issue, several recent works investigate gradual domain adaptation (GDA) [25, 12, 71, 35], in which beyond the source and target domains, the model can access additional unlabeled data from the intermediate domains that shift gradually from the source to the target. By adapting the model along the sequence of intermediate domains — i.e., from the ones close to the source to the ones close to the target — the large domain gap between source and target is chipped away by multiple sub-adaptation problems (between consecutive intermediate domains) whose domain gaps are smaller. Namely, every time the model moves a step closer to the target, instead of taking a huge jump that can significantly decrease the performance (see Figure 1 for an illustration). GDA makes sense, as real-world data change gradually more often than abruptly [66, 11, 60]. The recent work by Kumar et al. [35] further demonstrates the strength of GDA both empirically and theoretically.

One potential drawback of GDA is the need for a well-defined sequence of intermediate domains. That is, prior to adaptation, the additional unlabeled data must be grouped into multiple domains and indexed with intermediate domain labels that reflect the underlying data distribution shift from the source to the target. Such information, however, may not be available directly. Existing methods usually leverage side information like time tags [35, 25, 71] to define the sequence, which may be sub-optimal. In some applications, even the side information may not be accessible (e.g., due to privacy concerns), greatly limiting the applicability of GDA.

Refer to caption
Figure 1: Gradual domain adaptation (GDA) without indexed intermediate domains. In this setting, one is provided with labeled source, unlabeled target, and additional unlabeled intermediate data that have not been grouped and indexed into a domain sequence. Our approach intermediate domain labeler (IDOL) can successfully discover the domain sequence, which can then be leveraged by a GDA algorithm to achieve a higher target accuracy than direct unsupervised domain adaptation (UDA). Images are from the Portraits dataset [14].

In this paper, we therefore study GDA in the extreme case — the additional unlabeled data are neither grouped nor indexed. Specifically, we investigate how to discover the “domain sequence” from data, such that it can be used to drive GDA. We propose a two-stage coarse-to-fine framework named Intermediate DOmain Labeler (IDOL). In the first stage, IDOL labels each intermediate data instance with a coarse score that reflects how close it is to the source or target. We study several methods that estimate the distance from a data instance to a domain. We find that a progressively trained domain discriminator — which starts with source vs. target data but gradually adds data close to either of them into training — performs the best, as it better captures the underlying data manifold.

In the second stage, IDOL then builds upon the coarse scores to group data into domains by further considering the discriminative (e.g., classification) knowledge the model aims to preserve from the source to the target. Since the additional unlabeled data are fully unlabeled, we take a greedy approach to identify the next intermediate domain (i.e., a group of data) that can best preserve the discriminative knowledge in the current intermediate domain. Concretely, we employ self-training [38] along the domain sequence discovered so far to provide pseudo-labels for the current domain, and propose a novel cycle-consistency loss to discover the next domain, such that the model adapted to the next domain can be “adapted back” to the current domain and predict the same pseudo-labels. The output of IDOL is a sequence of intermediate domains that can be used by any GDA algorithms [35, 25, 71].

We validated IDOL on two data sets studied in [35], including Rotated MNIST [36] and Portraits over years [14]. IDOL can successfully discover the domain sequence that leads to comparable GDA performance to using the pre-defined sequence (i.e., by side information). More importantly, IDOL is compatible with pre-defined sequences — by treating them as the coarse sequences — to further improve upon them. We also investigate IDOL in scenarios where the additional unlabeled data may contain outliers and demonstrate IDOL’s effectiveness even in such challenging cases. To our knowledge, our work is the first to tackle GDA without grouped and indexed intermediate domains. The success of IDOL opens up broader application scenarios that GDA can contribute to.

2 Background: gradual domain adaptation (GDA)

2.1 Setup of UDA and GDA

We consider an unsupervised domain adaptation (UDA) problem, in which a classifier is provided with labeled data from the source domain 𝒮={(𝒙i𝒮,yi𝒮)}i=1|𝒮|\mathcal{S}=\{({\bm{x}}^{\mathcal{S}}_{i},y^{\mathcal{S}}_{i})\}_{i=1}^{|\mathcal{S}|} and unlabeled data from the target domain 𝒯={𝒙i𝒯}i=1|𝒯|\mathcal{T}=\{{\bm{x}}^{\mathcal{T}}_{i}\}_{i=1}^{|\mathcal{T}|}. Both 𝒮\mathcal{S} and 𝒯\mathcal{T} are sampled IID from the underlying joint distributions of the source P𝒮P_{\mathcal{S}} and target P𝒯P_{\mathcal{T}}, respectively. A source model 𝜽𝒮\bm{\theta}_{\mathcal{S}} is typically produced by training on 𝒮\mathcal{S} directly. The goal of UDA is to learn a target model 𝜽𝒯\bm{\theta}_{\mathcal{T}} such that it can perform well on the target, measured by a loss (𝜽𝒯,P𝒯)\mathcal{L}(\bm{\theta}_{\mathcal{T}},P_{\mathcal{T}}). We focus on the standard UDA setup that assumes no label shifts [6].

Gradual domain adaptation (GDA)

, on top of UDA, assumes the existence of a sequence of domains P0,P1,PM1,PMP_{0},P_{1},\ldots P_{M-1},P_{M}, where P0P_{0} and PMP_{M} are the source domain and target domain, respectively, along with M1M-1 intermediate domains P1,,PM1P_{1},\ldots,P_{M-1}. The data distributions of domains are assumed to change gradually from the source to the target along the sequence. One can access unlabeled data 𝒰m={𝒙i𝒰m}i=1|𝒰m|\mathcal{U}_{m}=\{{\bm{x}}^{\mathcal{U}_{m}}_{i}\}_{i=1}^{{|\mathcal{U}_{m}|}} from each of the intermediate domain, forming a sequence of unlabeled data 𝒰1,,𝒰M1\mathcal{U}_{1},\ldots,\mathcal{U}_{M-1}. We denote the union of them by 𝒰={𝒙i𝒰}i=1|𝒰|\mathcal{U}=\{{\bm{x}}^{\mathcal{U}}_{i}\}_{i=1}^{{|\mathcal{U}|}}. For brevity, we assume that each 𝒰m\mathcal{U}_{m} is disjoint from the others but the size |𝒰m||\mathcal{U}_{m}| is the same m{1,,M1}\forall m\in\{1,\cdots,M-1\}.

2.2 GDA with pre-defined domain sequences using self-training [35]

We summarize the theoretical analysis of GDA using self-training by Kumar et al. [35] as follows.

Self-training for UDA.

Self-training [38] takes a pre-trained (source) model 𝜽\bm{\theta} and the target unlabeled data 𝒯\mathcal{T} as input. Denote by 𝒛=f(𝒙;𝜽){\bm{z}}=f({\bm{x}};\bm{\theta}) the predicted logits 𝒛{\bm{z}}, self-training first applies f(;𝜽)f(\cdot;\bm{\theta}) to target data points 𝒙𝒯{\bm{x}}\in\mathcal{T}, sharpens the predictions by argmax into pseudo-labels, and then updates the current model 𝜽\bm{\theta} by minimizing the loss \ell (e.g., cross entropy) w.r.t the pseudo-labels,

𝜽𝒯=ST(𝜽,𝒯)=argmin𝜽𝚯1|𝒯|i=1|𝒯|(f(𝒙i;𝜽),sharpen(f(𝒙i;𝜽))),\displaystyle\bm{\theta}_{\mathcal{T}}=\texttt{ST}(\bm{\theta},\mathcal{T})=\arg\min_{\bm{\theta}^{\prime}\in\bm{\Theta}}\frac{1}{|\mathcal{T}|}\sum_{i=1}^{|\mathcal{T}|}{\ell\big{(}f({\bm{x}}_{i};\bm{\theta}^{\prime}),\texttt{sharpen}(f({\bm{x}}_{i};\bm{\theta}))\big{)}}, (1)

where 𝜽𝒯\bm{\theta}_{\mathcal{T}} denotes the resulting target model and ST(𝜽,𝒯)\texttt{ST}(\bm{\theta},\mathcal{T}) denotes the self-training process. A baseline use of self-training for UDA is to set 𝜽\bm{\theta} as the source model 𝜽𝒮\bm{\theta}_{\mathcal{S}}. However, when the domain gap between 𝒮\mathcal{S} and 𝒯\mathcal{T} is large, 𝜽𝒮\bm{\theta}_{\mathcal{S}} cannot produce accurate pseudo-labels for effective self-training.

Gradual self-training for GDA.

In the GDA setup, self-training is performed along the domain sequence, from the current model 𝜽m\bm{\theta}_{m} (updated using 𝒰m\mathcal{U}_{m} already) to the next one 𝜽m+1\bm{\theta}_{m+1} using 𝒰m+1\mathcal{U}_{m+1}

𝜽m+1=ST(𝜽m,𝒰m+1),\displaystyle\bm{\theta}_{m+1}=\texttt{ST}(\bm{\theta}_{m},\mathcal{U}_{m+1}), (2)

starting from the source model 𝜽0=𝜽𝒮\bm{\theta}_{0}=\bm{\theta}_{\mathcal{S}}. By doing so, the model gradually adapts from 𝒮\mathcal{S} to 𝒯\mathcal{T} through the sequence 𝒰1,,𝒰M\mathcal{U}_{1},\ldots,\mathcal{U}_{M} to produce the final target model 𝜽𝒯\bm{\theta}_{\mathcal{T}}.

𝜽𝒯=𝜽M=ST(𝜽𝒮,(𝒰1,,𝒰M)).\displaystyle\bm{\theta}_{\mathcal{T}}=\bm{\theta}_{M}=\texttt{ST}\big{(}\bm{\theta}_{\mathcal{S}},(\mathcal{U}_{1},\ldots,\mathcal{U}_{M})\big{)}. (3)

As the domain gap between consecutive domains is smaller than between 𝒮\mathcal{S} and 𝒯\mathcal{T}, the pseudo-labels will be more accurate in each step. As a result, self-training in GDA can lead to a lower target error than in UDA. We note that the domain sequences in [35] are pre-defined using side information.

Theoretical analysis.

Kumar et al. [35] make three assumptions for theoretical analysis:

  • Separation: for every domain PmP_{m}, the data are separable such that there exists an RR-bounded (linear) classifier 𝜽m𝚯\bm{\theta}_{m}\in\bm{\Theta} with 𝜽m2R\|\bm{\theta}_{m}\|_{2}\leq R that achieves a low loss of (𝜽m,Pm)\mathcal{L}(\bm{\theta}_{m},P_{m}).

  • Gradual shift: for the domain sequence P0,,PmP_{0},\ldots,P_{m} with no label shifts, the maximum per-class Wasserstein-infinity distance ρ(Pm,Pm+1)\rho(P_{m},P_{m+1}) between consecutive domains should be ρ<1R\leq\rho<\frac{1}{R}.

  • Bounded data: finite samples XX from each domain PmP_{m} are bounded, i.e., EXPm[X22]B2E_{X\sim P_{m}}[\|X\|^{2}_{2}]\leq B^{2}.

Under these assumptions of GDA, if the source model θ0\theta_{0} has a low loss (θ0,P0)\mathcal{L}(\theta_{0},P_{0}) on P0P_{0}, then with a probability δ\delta, the resulting target model θM\theta_{M} through gradual self-training has

(𝜽M,PM)βM+1((𝜽0,P0)+4BR+2log2M/δ|𝒰|),where β=21ρR.\displaystyle\mathcal{L}(\bm{\theta}_{M},P_{M})\leq\beta^{M+1}\Big{(}\mathcal{L}(\bm{\theta}_{0},P_{0})+\frac{4BR+\sqrt{2\log{2M/\delta}}}{\sqrt{|\mathcal{U}|}}\Big{)},\hskip 10.0pt\text{where }\beta=\frac{2}{1-\rho R}. (4)

That is, the target error is controlled by the intermediate domain shift ρ\rho. If we can sample infinite data (i.e., |𝒰||\mathcal{U}|\rightarrow\infty) and the source classifier is perfect (i.e., (θ0,P0)=0\mathcal{L}(\theta_{0},P_{0})=0), with a small distance ρ(Pm,Pm+1)ρ\rho(P_{m},P_{m+1})\leq\rho between consecutive domains, gradual self-training achieves a zero target error.

3 GDA without indexed intermediate domains

3.1 Setup and motivation

In this paper, we study the case of GDA where the additional unlabeled data 𝒰\mathcal{U} are not readily divided into domain sequences 𝒰1,,𝒰M1\mathcal{U}_{1},\cdots,\mathcal{U}_{M-1}. In other words, we are provided with a single, gigantic unlabled set that has a wide range of support from the source to the target domain. One naive way to leverage it is to adapt from 𝒮\mathcal{S} to 𝒰\mathcal{U}111In [35], the authors investigated sampling 𝒰1,,𝒰M1\mathcal{U}_{1},\cdots,\mathcal{U}_{M-1} from 𝒰\mathcal{U} uniformly at random with replacement. and then to 𝒯\mathcal{T}, which however leads to a much larger target error than applying self-training along the properly indexed intermediate domains [35]. We attribute this to the large ρ(𝒮,𝒰)\rho(\mathcal{S},\mathcal{U}) or ρ(𝒰,𝒯)\rho(\mathcal{U},\mathcal{T}) according to Equation 4. To take advantage of 𝒰\mathcal{U}, we must separate it into intermediate domains such that ρ(Pm,Pm+1)\rho(P_{m},P_{m+1}) is small enough for every m{0,,M1}m\in\{0,\ldots,M-1\}.

3.2 Overview of our approach: intermediate domain labeler (IDOL)

Given the source data 𝒮\mathcal{S}, the target data 𝒰M=𝒯\mathcal{U}_{M}=\mathcal{T}, and the unlabeled intermediate data 𝒰={𝒙i𝒰}i=1|𝒰|\mathcal{U}=\{{\bm{x}}^{\mathcal{U}}_{i}\}_{i=1}^{{|\mathcal{U}|}}, we propose the intermediate domain labeler (IDOL), whose goal is to sort intermediate data instances in 𝒰\mathcal{U} and chunk them into M1M-1 domains 𝒰1,,𝒰M1\mathcal{U}_{1},\cdots,\mathcal{U}_{M-1} (with equal sizes) for GDA to succeed.

Taking gradual self-training in Equation 3 as an example, IDOL aims to solve the following problem

min(𝜽𝒯,P𝒯),\displaystyle\min\mathcal{L}(\bm{\theta}_{\mathcal{T}},P_{\mathcal{T}}),
s.t. 𝜽𝒯=ST(𝜽𝒮,(𝒰1,,𝒰M))and(𝒰1,,𝒰M1)=IDOL(𝒮,𝒯,𝒰;M1),\displaystyle\text{s.t. }\bm{\theta}_{\mathcal{T}}=\texttt{ST}\big{(}\bm{\theta}_{\mathcal{S}},(\mathcal{U}_{1},\ldots,\mathcal{U}_{M})\big{)}\quad\text{and}\quad(\mathcal{U}_{1},\ldots,\mathcal{U}_{M-1})=\textsc{IDOL}(\mathcal{S},\mathcal{T},\mathcal{U};M-1), (5)

where the target model 𝜽𝒯\bm{\theta}_{\mathcal{T}} is produced by applying gradual self-training to the domain sequence discovered by IDOL. IDOL accepts the number of intermediate domains M1M-1 as a hyper-parameter.

Solving Equation 5 is hard, as evaluating any output sequence needs running through the entire gradual self-training process, not to mention that we have no labeled target data to estimate (𝜽𝒯,P𝒯)\mathcal{L}(\bm{\theta}_{\mathcal{T}},P_{\mathcal{T}}). In this paper, we propose to solve Equation 5 approximately via a coarse-to-fine procedure.

The coarse stage.

We aim to give each 𝒙i𝒰𝒰{\bm{x}}^{\mathcal{U}}_{i}\in\mathcal{U} a score qiq_{i}, such that it tells 𝒙{\bm{x}}’s position in between the source and target. Specifically, a higher/lower qiq_{i} indicates that 𝒙i𝒰{\bm{x}}^{\mathcal{U}}_{i} is closer to the source/target domain. With these scores, we can already obtain a coarse domain sequence by sorting them in the descending order and dividing them into M1M-1 chunks.

The fine stage.

Instead of creating the domain sequence 𝒰1,,𝒰M1\mathcal{U}_{1},\cdots,\mathcal{U}_{M-1} by looking at the score of each individual 𝒙i𝒰𝒰{\bm{x}}^{\mathcal{U}}_{i}\in\mathcal{U}, we further consider how data grouped into the same domain can collectively preserve the discriminative knowledge from the previous domains — after all, the goal of UDA or GDA is to pass the discriminative knowledge from the labeled source domain to the target domain. To this end, we propose a novel cycle-consistency loss to refine the coarse scores progressively.

3.3 The coarse stage: assigning domain scores

In this stage, we assign each 𝒙i𝒰𝒰{\bm{x}}^{\mathcal{U}}_{i}\in\mathcal{U} a score qiq_{i}, such that higher/lower qiq_{i} indicates that 𝒙i𝒰{\bm{x}}^{\mathcal{U}}_{i} is closer to the source/target domain. We discuss some design choices. For brevity, we omit the superscript U.

Confidence scores by the classifier f(𝒙;𝜽)f({\bm{x}};\bm{\theta}).

We first investigate scoring each intermediate data instance by the confidence score of the source model 𝜽𝒮\bm{\theta}_{\mathcal{S}}, i.e., qi=maxf(𝒙i;𝜽𝒮)q_{i}=\max f({\bm{x}}_{i};\bm{\theta}_{\mathcal{S}}) across classes. This is inspired by outlier detection [42] and the common practice of self-training, which selects only data with sufficiently large confidence scores to fine-tune upon [38, 35]. Here we employ an advanced version, which is to use the latest 𝜽\bm{\theta} (i.e., the model adapted to the current intermediate domain) to select the next intermediate domain. That is, every time we select the highest confidence instances from the remaining ones in 𝒰\mathcal{U} to adapt 𝜽\bm{\theta}. Overall, we can give data selected in the mm-th round (m{1,,M1}m\in\{1,\cdots,M-1\}) a score M1mM2[0,1]\frac{M-1-m}{M-2}\in[0,1]. One drawback of this method is the blindness to the target domain 𝒯\mathcal{T}. Specifically, we find that using confidence scores tends to select easily classified examples, which do not necessarily follow the gradual shift from the source to the target.

Manifold distance with the source features.

To model the flow from the source to the target, we argue that it is crucial to consider both sides as references. We thus extract features of 𝒮\mathcal{S}, 𝒯\mathcal{T}, and 𝒰\mathcal{U} using the source model 𝜽𝒮\bm{\theta}_{\mathcal{S}} and apply a manifold learning algorithm (e.g., UMAP [50]) to discover the underlying manifold. Denote by γ(𝒙)\gamma({\bm{x}}) the dimension-reduced features of 𝒙{\bm{x}} after UMAP, we compute the ratio of its distance to the nearest point in the 𝒯\mathcal{T} and 𝒮\mathcal{S} as the score qi=min𝒙𝒯𝒯γ(𝒙i)γ(𝒙𝒯)2min𝒙𝒮𝒮γ(𝒙i)γ(𝒙𝒮)2q_{i}=\frac{\min_{{\bm{x}}^{\mathcal{T}}\in\mathcal{T}}\|\gamma({\bm{x}}_{i})-\gamma({\bm{x}}^{\mathcal{T}})\|_{2}}{\min_{{\bm{x}}^{\mathcal{S}}\in\mathcal{S}}\|\gamma({\bm{x}}_{i})-\gamma({\bm{x}}^{\mathcal{S}})\|_{2}}.

Domain discriminator.

We investigate another idea to emphasize the roles of the source and target, which is to train a domain discriminator [13]. Concretely, we construct a binary classifier g(;ϕ)g(\cdot;\bm{\phi}) using deep neural networks, which is trained to separate the source data 𝒮\mathcal{S} (class: 1) and the target data 𝒯\mathcal{T} (class: 0). We use a binary-cross entropy loss to optimize the learnable parameter ϕ\bm{\phi}

(ϕ)=1|𝒮|𝒙𝒮𝒮log(σ(g(𝒙𝒮;ϕ)))1|𝒯|𝒙𝒯𝒯log(1σ(g(𝒙𝒯;ϕ))),\displaystyle\mathcal{L}(\bm{\phi})=-\frac{1}{|\mathcal{S}|}\sum_{{\bm{x}}^{\mathcal{S}}\in\mathcal{S}}\log(\sigma(g({\bm{x}}^{\mathcal{S}};\bm{\phi})))-\frac{1}{|\mathcal{T}|}\sum_{{\bm{x}}^{\mathcal{T}}\in\mathcal{T}}\log(1-\sigma(g({\bm{x}}^{\mathcal{T}};\bm{\phi}))), (6)

where σ\sigma is the sigmoid function. We can then assign a score to 𝒙i𝒰{\bm{x}}_{i}\in\mathcal{U} by qi=g(𝒙i;ϕ)q_{i}=g({\bm{x}}_{i};\bm{\phi}).

Progressive training for the domain discriminator.

The domain discriminator g(;ϕ)g(\cdot;\bm{\phi}), compared to the manifold distance, better contrasts the source and target domains but does not leverage any of the data 𝒙𝒰{\bm{x}}\in\mathcal{U}. That is, it might give a faithful score to an example 𝒙{\bm{x}} that is very close to 𝒮\mathcal{S} or 𝒯\mathcal{T}, but for other examples that are far away and hence out-of-domain from both ends, the score by g(;ϕ)g(\cdot;\bm{\phi}) becomes less reliable, not to mention that neural networks are usually poorly calibrated for these examples [21, 42]. We therefore propose to progressively augment the source and target data with 𝒙𝒰{\bm{x}}\in\mathcal{U} that has either a fairly high or low g(𝒙;ϕ)g({\bm{x}};\bm{\phi}), and fine-tune the domain discriminator g(𝒙;ϕ)g({\bm{x}};\bm{\phi}). We perform this process for KK rounds, and every time include |𝒰|2K\frac{|\mathcal{U}|}{2K} new examples into the source and target. By doing so, the domain discriminator g(𝒙;ϕ)g({\bm{x}};\bm{\phi}) is updated with data from 𝒰\mathcal{U} and thus can provide more accurate scores to distinguish the remaining examples into the source or target side. More specifically, let NN denote |𝒰||\mathcal{U}|, we repeat the following steps for kk from 11 to KK:

  1. 1.

    Train g(,ϕ)g(\cdot,\bm{\phi}) using 𝒮\mathcal{S} and 𝒯\mathcal{T}, based on the loss in Equation 6.

  2. 2.

    Predict q^i=g(𝒙i,ϕ),𝒙i𝒰\hat{q}_{i}=g({\bm{x}}_{i},\bm{\phi}),\forall{\bm{x}}_{i}\in\mathcal{U}.

  3. 3.

    Rank all q^i\hat{q}_{i} in the descending order.

  4. 4.

    The data with the N2K\frac{N}{2K} largest q^i\hat{q}_{i} scores form a new intermediate domain for the source side (their qiq_{i} is set to 2Kk2K\frac{2K-k}{2K}). We remove them from 𝒰\mathcal{U} and add them into 𝒮\mathcal{S}.

  5. 5.

    The data with the N2K\frac{N}{2K} smallest q^i\hat{q}_{i} scores form a new intermediate domain for the target side (their qiq_{i} is set to k2K\frac{k}{2K}). We remove them from 𝒰\mathcal{U} and add them into 𝒯\mathcal{T}.

Overall, we give a data instance 𝒙𝒰{\bm{x}}\in\mathcal{U} selected in the kk-th round (k{1,,K}k\in\{1,\cdots,K\}) a score 2Kk2K\frac{2K-k}{2K} if it is added into the source side, or k2K\frac{k}{2K} if it is added into the target side.

3.4 The fine stage: cycle-consistency for refinement

The domain scores developed in subsection 3.3 can already give each individual example 𝒙𝒰{\bm{x}}\in\mathcal{U} a rough position of which intermediate domain it belongs to. However, for GDA to succeed, we must consider intermediate data in a collective manner. That is, each intermediate domain should preserve sufficient discriminative (e.g., classification) knowledge from the previous one, such that after MM rounds of adaptation through GDA, the resulting 𝜽𝒯\bm{\theta}_{\mathcal{T}} in Equation 5 will be able to perform well on 𝒯\mathcal{T}. This is reminiscent of some recent claims in UDA [75, 46]: matching only the marginal distributions between domains (i.e., blind to the class labels) may lead to sub-optimal adaptation results.

But how can we measure the amount of discriminative knowledge preserved by an intermediate domain (or in the target domain) if we do not have its labeled data? To seek for the answer, we revisit Equation 4 and consider the case M=1M=1. If the two domains P0P_{0} and P1P_{1} are sufficiently close and the initial model 𝜽0\bm{\theta}_{0} is well-trained, the resulting model 𝜽1\bm{\theta}_{1} after self-training will perform well. Reversely, we can treat such a well-performing 𝜽1\bm{\theta}_{1} as the well-trained initial model, and apply self-training again — this time from P1P_{1} back to P0P_{0}. The resulting model 𝜽0\bm{\theta}_{0}^{\prime}, according to the same rationale, should perform well on P0P_{0}. In other words, 𝜽0\bm{\theta}_{0}^{\prime} and 𝜽0\bm{\theta}_{0} should predict similarly on data sampled from P0P_{0}. The similarity between θ0\bm{\theta}_{0}^{\prime} and θ0\bm{\theta}_{0} in terms of their predictions therefore can be used as a proxy to measure how much discriminative knowledge θ1\bm{\theta}_{1}, after adapted to P1P_{1}, preserves. It is worth noting that measuring the similarity between 𝜽0\bm{\theta}_{0}^{\prime} and 𝜽0\bm{\theta}_{0} requires no labeled data from P0P_{0} or P1P_{1}.

Now let us consider a more general case that MM is not limited to 11. Let us set 𝜽0=𝜽𝒮\bm{\theta}_{0}=\bm{\theta}_{\mathcal{S}}, where 𝜽𝒮\bm{\theta}_{\mathcal{S}} has been trained with the labeled source data 𝒮\mathcal{S} till convergence. Based on the concept mentioned above, we propose a novel learning framework to discover the domain sequences,

argmin(𝒰1,,𝒰M1)\displaystyle\operatorname{arg\,min}_{(\mathcal{U}_{1},\ldots,\mathcal{U}_{M-1})} 𝔼𝒙P0[(f(𝒙;𝜽0),sharpen(f(𝒙;𝜽0)))],\displaystyle\mathbb{E}_{{\bm{x}}\sim P_{0}}\left[\ell\big{(}f({\bm{x}};\bm{\theta}_{0}^{\prime}),\texttt{sharpen}(f({\bm{x}};\bm{\theta}_{0}))\big{)}\right],
s.t., 𝜽0=ST(𝜽M,(𝒰M1,,𝒰0)),\displaystyle\bm{\theta}^{\prime}_{0}=\texttt{ST}(\bm{\theta}_{M},(\mathcal{U}_{M-1},\ldots,\mathcal{U}_{0})), (7)
𝜽M=ST(𝜽0,(𝒰1,,𝒰M)),\displaystyle\bm{\theta}_{M}=\texttt{ST}(\bm{\theta}_{0},(\mathcal{U}_{1},\ldots,\mathcal{U}_{M})),

where 𝒰0\mathcal{U}_{0} is the source data 𝒮\mathcal{S} with labels removed, and 𝒰M=𝒯\mathcal{U}_{M}=\mathcal{T} is the target data. 𝜽0\bm{\theta}_{0}^{\prime} is a gradually self-trained model from 𝜽0\bm{\theta}_{0} that undergoes a cycle 𝒰1,,𝒰M,,𝒰0\mathcal{U}_{1}\rightarrow,\cdots,\rightarrow\mathcal{U}_{M}\rightarrow,\cdots,\rightarrow\mathcal{U}_{0}. In other words, Equation 7 aims to find the intermediate domain sequences such that performing gradual self-training along it in a forward and then backward fashion leads to cycle-consistency [76].

Refer to caption
Figure 2: Cycle-consistency (cf. Equation 8).

Optimization. Solving Equation 7 is still not trivial due to its combinatorial nature. We therefore propose to solve it greedily, starting with discovering 𝒰1\mathcal{U}_{1}, and then 𝒰2\mathcal{U}_{2}, and then so on. Concretely, we decompose Equation 7 into a series of sub-problems, as illustrated in Figure 2, each is defined as

argmin𝒰m+1𝒰m\displaystyle\operatorname{arg\,min}_{\mathcal{U}_{m+1}\subset\mathcal{U}_{\setminus m}}\quad 1|𝒰m|𝒙𝒰m(f(𝒙;𝜽m),sharpen(f(𝒙;𝜽m))),\displaystyle\frac{1}{|\mathcal{U}_{m}|}\sum_{{\bm{x}}\in\mathcal{U}_{m}}{\ell\big{(}f({\bm{x}};\bm{\theta}_{m}^{\prime}),\texttt{sharpen}(f({\bm{x}};\bm{\theta}_{m}))\big{)}},
s.t., 𝜽m=ST(𝜽m+1,𝒰m),\displaystyle\bm{\theta}^{\prime}_{m}=\texttt{ST}(\bm{\theta}_{m+1},\mathcal{U}_{m}), (8)
𝜽m+1=ST(𝜽m,𝒰m+1),\displaystyle\bm{\theta}_{m+1}=\texttt{ST}(\bm{\theta}_{m},\mathcal{U}_{m+1}),

where 𝜽m\bm{\theta}_{m} is the model already adapted to the current domain 𝒰m\mathcal{U}_{m}, and 𝒰m=𝒰j=1m𝒰j\mathcal{U}_{\setminus m}=\mathcal{U}\setminus\cup_{j=1}^{m}\mathcal{U}_{j} is the remaining data that have not been selected by the mm intermediate domains so far. That is, each sub-problem discovers only the next intermediate domain 𝒰m+1\mathcal{U}_{m+1} via cycle-consistency.

Let N=|𝒰m|N=|\mathcal{U}_{\setminus m}| and 𝒰m={𝒙i}i=1N\mathcal{U}_{\setminus m}=\{{\bm{x}}_{i}\}_{i=1}^{N}, selecting 𝒰m+1𝒰m\mathcal{U}_{m+1}\subset\mathcal{U}_{\setminus m} is equivalent to selecting a binary indicator vector 𝒒{0,1}N\bm{q}\in\{0,1\}^{N}, in which qi=1q_{i}=1222Here we use the same notation as the coarse scores defined in subsection 3.3, since later we will initialize these values indeed by the coarse scores. means that 𝒙i𝒰m{\bm{x}}_{i}\in\mathcal{U}_{\setminus m} is included into 𝒰m+1\mathcal{U}_{m+1}. This representation turns the self-training step 𝜽m+1=ST(𝜽m,𝒰m+1)\bm{\theta}_{m+1}=\texttt{ST}(\bm{\theta}_{m},\mathcal{U}_{m+1}) in Equation 8 into

ST(𝜽m,𝒒)=argmin𝜽𝚯1Ni=1Nqi×(f(𝒙i;𝜽),sharpen(f(𝒙i;𝜽m))),\displaystyle\texttt{ST}(\bm{\theta}_{m},\bm{q})=\arg\min_{\bm{\theta}\in\bm{\Theta}}\frac{1}{N}\sum_{i=1}^{N}q_{i}\times{\ell\big{(}f({\bm{x}}_{i};\bm{\theta}),\texttt{sharpen}(f({\bm{x}}_{i};\bm{\theta}_{m}))\big{)}}, (9)

We choose to solve Equation 8 approximately by relaxing the binary vector 𝒒{0,1}N\bm{q}\in\{0,1\}^{N} into a real vector 𝒒N\bm{q}\in\mathbb{R}^{N}, which fits perfectly into the meta-reweighting framework [55, 29]. Concretely, meta-reweighting treats 𝒒N\bm{q}\in\mathbb{R}^{N} as learnable differentiable parameters associated to training examples (in our case, the data in 𝒰m\mathcal{U}_{\setminus m}). Meta-reweighting for Equation 8 can be implemented via the following six steps for multiple iterations.

  1. 1.

    Detach: 𝜽𝜽m\bm{\theta}\leftarrow\bm{\theta}_{m},

  2. 2.

    Forward: 𝜽(𝒒)𝜽η𝜽|𝒰m|×i𝒰mqi×(f(𝒙i;𝜽),sharpen(f(𝒙i;𝜽m)))𝜽,\bm{\theta}(\bm{q})\leftarrow\bm{\theta}-\cfrac{\eta_{\bm{\theta}}}{|\mathcal{U}_{\setminus m}|}\times\cfrac{\partial\sum_{i\in\mathcal{U}_{\setminus m}}q_{i}\times\ell(f({\bm{x}}_{i};\bm{\theta}),\texttt{sharpen}(f({\bm{x}}_{i};\bm{\theta}_{m})))}{\partial\bm{\theta}},

  3. 3.

    Detach: 𝜽𝜽(𝒒)\bm{\theta}^{\prime}\leftarrow\bm{\theta}(\bm{q}),

  4. 4.

    Backward: 𝜽(𝒒)𝜽(𝒒)η𝜽|𝒰m|×j𝒰m(f(𝒙j;𝜽(𝒒)),sharpen(f(𝒙j;𝜽)))𝜽(𝒒),\bm{\theta}(\bm{q})\leftarrow\bm{\theta}(\bm{q})-\cfrac{\eta_{\bm{\theta}}}{|\mathcal{U}_{m}|}\times\cfrac{\partial\sum_{j\in\mathcal{U}_{m}}\ell(f({\bm{x}}_{j};\bm{\theta}(\bm{q})),\texttt{sharpen}(f({\bm{x}}_{j};\bm{\theta}^{\prime})))}{\partial\bm{\theta}(\bm{q})},

  5. 5.

    Update: 𝒒𝒒η𝒒|𝒰m|×j𝒰m(f(𝒙j;𝜽(𝒒)),sharpen(f(𝒙j;𝜽m)))𝒒\bm{q}\leftarrow\bm{q}-\cfrac{\eta_{\bm{q}}}{|\mathcal{U}_{m}|}\times\cfrac{\partial\sum_{j\in\mathcal{U}_{m}}\ell(f({\bm{x}}_{j};\bm{\theta}(\bm{q})),\texttt{sharpen}(f({\bm{x}}_{j};\bm{\theta}_{m})))}{\partial\bm{q}},

  6. 6.

    Update: qimax{0,qi}q_{i}\leftarrow\max\{0,q_{i}\}.

The aforementioned updating rule gives 𝒙i{\bm{x}}_{i} a higher value of qiq_{i} if it helps preserve the discriminative knowledge. After obtaining the updated 𝒒N\bm{q}\in\mathbb{R}^{N}, we then sort it in the descending order and select the top |𝒰|M1\frac{|\mathcal{U}|}{M-1} examples to be 𝒰m+1\mathcal{U}_{m+1} (i.e., every intermediate domain has an equal size).

Discussion. The way we relax the binary-valued vector 𝒒{0,1}N\bm{q}\in\{0,1\}^{N} by real values is related to the linear programming relaxation of an integer programming problem. It has been widely applied, for example, in discovering sub-domains within a dataset [16]. Theoretically, we should constrain each element of 𝒒\bm{q} to be within [0,1][0,1]. Empirically, we found that even without the clipping operation to upper-bound qiq_{i} (i.e., just performing max{0,qi}\max{\{0,q_{i}\}}), the values in qiq_{i} do not explode and the algorithm is quite stable: we see a negligible difference of using upper-bounded qiq_{i} or not in the resulting accuracy of GDA. This is also consistent with the common practice of using meta-reweighting [29, 55].

Initialization. We initialize qiq_{i} by the coarse scores defined in subsection 3.3. This is for one important reason: Equation 8 searches for the next intermediate domain 𝒰m+1\mathcal{U}_{m+1} only based on the current intermediate domain 𝒰m\mathcal{U}_{m}. Thus, it is possible that 𝒰m+1\mathcal{U}_{m+1} will include some data points that help preserve the knowledge in 𝒰m\mathcal{U}_{m} but are distributionally deviated from the gradual transition between 𝒰m\mathcal{U}_{m} and the target domain 𝒯\mathcal{T}. The scores defined in subsection 3.3 thus serve as the guidance; we can also view Equation 8 as a refinement step of the coarse scores. As will be seen in section 4, the qualities of the initialization and the refinement steps both play important roles in IDOL’s success.

4 Experiment

4.1 Setup

We mainly study two benchmark datasets used in [35]. Rotated MNIST [36] is a popular synthetic dataset to study distributional shifts. It gradually rotates the original 50,00050,000 images in the MNIST [37] dataset with [0,60][0,60] degrees uniformly, where [0,5)[0,5)/[5,55)[5,55)/[55,60][55,60] degrees are for the source/intermediate/target domains, respectively. However, since original images have no ground truths of the rotation, the rotation annotations are not expected to be perfect. Portraits dataset [14] is a real-world gender classification dataset of a collection of American high school seniors from the year 1905 to 2013. Naturally, the portrait styles change along years (e.g., lip curvature [14], fashion styles). The images are first sorted by the years and split. For both datasets, each domain contains 20002000 images, and 1,0001,000 images are reserved for both the source and target domains for validation.

We follow the setup in [35]: each model is a convolutional neural network trained for 2020 epochs for each domain consequently (including training on the source data), using Adam optimizer [32] with a learning rate 0.0010.001, batch size 3232, and weight decay 0.020.02. We use this optimizer as the default if not specified. Hyper-parameters of IDOL include K=2MK=2M rounds for progressive training and 3030 epochs of refinement per step (with mini-batch 128128), where M=19M=19 for the Rotated MNIST and M=7M=7 for the Portraits. More details are in the supplementary material.

To study how the intermediate domain sequences affect gradual self-training, we consider baselines including “Source only” and “UDA” that directly self-trains on the target data (𝒯\mathcal{T}) and all the intermediate data (𝒰\mathcal{U}). We compare different domain sequences including Pre-defined indexes, e.g., rotations and years. We further consider the challenging setup that all intermediate data are unindexed. Random sequences are by dividing the unlabeled data into M1M-1 domains randomly, with self-training on the target in the end. For our IDOL, different domain scores introduced in subsection 3.3 are adopted for the coarse domain sequences, and we optionally apply cycle-consistency refinement to make them fine-grained.

Refer to caption
Refer to caption

Refer to caption

Refer to caption
Refer to caption

(a) Rotated MNIST (class “1”)

Refer to caption

(b) Portraits (class “female”)

Figure 3: Random samples (the most representative class) from pre-defined indexes or IDOL sequences.
Table 1: Gradual self-training on Rotated MNIST and Portraits. Bottom section: IDOL with domain scores.
Coarse scores Indexed? Adaptation Refined? Rotated MNIST Portraits
None Source only - 31.9±\pm1.7 75.3±\pm1.6
UDA (𝒯\mathcal{T}) - 33.0±\pm2.2 76.9±\pm2.1
UDA (𝒯+𝒰\mathcal{T}+\mathcal{U}) - 38.0±\pm1.6 78.9±\pm3.0
Pre-defined [35] GDA 87.9±\pm1.2 83.8±\pm0.8
93.3±\pm2.3 85.8±\pm0.4
Random GDA 39.5±\pm2.0 81.1±\pm1.8
57.5±\pm2.7 82.5±\pm2.2
Classifier confidence GDA 45.5±\pm3.5 79.3±\pm1.7
Manifold distance 72.4±\pm3.1 81.9±\pm0.8
Domain discriminator 82.1±\pm2.7 82.3±\pm0.9
Progressive domain discriminator 85.7±\pm2.7 83.4±\pm0.8
87.5±\pm2.0 85.5±\pm1.0

4.2 Main study: comparison with pre-defined domain sequences

The main results on Rotated MNIST and Portraits are provided in Table 1. We first verify that without any domain sequences, UDA (on target and intermediate data) and GDA (on random sequences) do not perform well, though they slightly improve over the source model. On the contrary, training with pre-defined indexes that arguably guide the gradual adaptation significantly performs better.

IDOL is competitive to pre-defined sequences.

IDOL can produce meaningful domain sequences (see Figure 3), given only the intermediate unlabeled data without any order. The domain discriminator with progressive training performs the best and is comparable to pre-defined sequences. The confidence scores by the classifier are not enough to capture domain shifts.

Refer to caption

(a) Target Acc. of gradual ST. Refer to caption (b) Years vs. IDOL sequence.

Figure 4: Portraits with year indexes or IDOL domain sequence.

Refinement helps and coarse initialization matters.

We observe that it is crucial to have high-quality coarse sequences as the initialization; the fine indexes by cycle-consistency refinement could further improve the coarse sequences, validating its effectiveness. We note that the Rotated MNIST dataset treats the original MNIST data as 0-degree rotation and artificially rotates the data given a rotation index. However, for data in the original MNIST dataset, there already exist variations in terms of rotations. This can be seen in Figure 3: in the first row of 0-degree rotation based on the pre-defined indexes, the examples of digit 11 do have slight rotations. Interestingly, in this situation, IDOL could potentially capture the true rotation of each example to further improve GDA.

Analysis.

To understand why refinement helps, we take a closer look at the Portraits dataset that is naturally sorted by years. Are years the best factor to construct the domain sequence? We compare it with the refined sequence by IDOL. In Figure 4, we monitor the target accuracy of each step of adaptation and find that the accuracy fluctuates by using years. Interestingly, learning with IDOL stably and gradually improves the target accuracy and ends up with a higher accuracy, validating that refinement with cycle-consistency indeed produces better sequences. Specifically, the IDOL sequence has a reasonable 0.7270.727 correlation with years but they are not perfectly aligned. Several factors such as fashions, hairstyles, eyeglasses, and lip curvatures may not perfectly align with years [14]. Even within the same year, individuals could have variations in these factors as well. We thus hypothesize that IDOL can discover the domain sequence that reflects a smoother transition of these factors.

We note that the intermediate domains constructed by IDOL (domain scores with or without refinement) may not match the target “class” distributions. We monitor the number of examples of class cc in each domain 𝒰m\mathcal{U}_{m}, i.e., |𝒰m,c||\mathcal{U}_{m,c}|, and compute 1M1m=1M1maxc{|𝒰m,c|}minc{|𝒰m,c|}\frac{1}{M-1}\sum_{m=1}^{M-1}\frac{\max_{c}{\{|\mathcal{U}_{m,c}|\}}}{\min_{c}{\{|\mathcal{U}_{m,c}|\}}}, which ideally should be 1.01.0, i.e., class-balanced. We observe fine-grained indexes are indeed more class-balanced (coarse vs fine-grained): 1.331.33 vs 1.231.23 on Rotated MNIST and 1.271.27 vs 1.251.25 on Portraits.

4.3 Case study: learning with partial or outlier information

If partial or outlier information of the intermediate domains is given, is IDOL still helpful? We examine the question by three experiments. First, we consider unsorted domains: the data are grouped into M1M-1 domains according to the pre-defined indexes but the domains are unordered. We find that if we sort the domains with the mean domain scores 1|𝒰m|𝒙i𝒰mqi\frac{1}{|\mathcal{U}_{m}|}\sum_{{\bm{x}}_{i}\in{\mathcal{U}_{m}}}q_{i}, then we can perfectly recover the pre-defined order. Second, we investigate if we have fewer, coarsely separated pre-defined domains: we double the size of |𝒰m||\mathcal{U}_{m}|. We find that on Rotated MNIST (w/ pre-defined indexes), the accuracy degrades by 11%11\%. IDOL can recover the accuracy since IDOL outputs a sorted sequence of data points, from which we can construct more, fine-grained intermediate domains. Third, we investigate the addition of outlier domains, in which the starting/end intermediate domains do not match the source/target domains exactly. For example, on Rotated MNIST, the intermediate domains are expanded to cover [30,90][-30,90] degrees. GDA on such domain sequence with outlier domains degrades to 77.0±2.077.0\pm 2.0 accuracy. Refinement can still improve it to 81.3±1.481.3\pm 1.4.

Refer to caption

   (a) Rotated MNIST.Refer to caption      (b) Portraits.

Figure 5: Different numbers of noisily-indexed intermediate data.

4.4 Case study: learning with low-quality intermediate data or indexes

In practice, the intermediate unlabeled data or the pre-defined domain indexes for them might not be perfect. We consider several realistic cases to show that our method can make GDA more robust to the quality of the intermediate data or the domain sequences. First, we investigate fewer intermediate data: Equation 4 implies that the target error increases if the intermediate data are too sparse. We repeat the experiments in subsection 4.2 but with only 30%30\% intermediate data. Second, we consider noisy indexes: building upon 30%30\% clean indexes, we further add more data with noisy, random indexes.

Figure 5 summarizes the results. We first notice that with only 30%30\% intermediate data, even if the indexes are clean, the accuracy decreases drastically by 14%14\% for MNIST and by 5%5\% for Portraits. For MNIST, adding more noisily-indexed data actually decreases the accuracy. Portraits gains some improvements with the noisily-indexed data, suggesting that the task could benefit from learning with intermediate data (even without accurate indexes). One advantage of IDOL is that we can annotate domain indexes for any unindexed data. On both datasets, IDOL stably improves with more data available, significantly outperforming that without IDOL.

Table 2: CIFAR10-STL.
Method Target Acc.
Source only 76.6±\pm0.4
UDA (𝒯\mathcal{T}) (lr =104=10^{-4}) 69.4±\pm0.4
UDA (𝒯\mathcal{T}) (lr =105=10^{-5}) 75.1±\pm0.3
UDA (𝒯\mathcal{T} + 𝒰\mathcal{U}) 61.1±\pm0.8
GDA w/ conf. 77.1±\pm0.5
GDA w/ IDOL 78.1±\pm0.4

4.5 Case study: CIFAR10-STL UDA with additional, open-domain unlabeled data

We examine IDOL on a CIFAR10-to-STL [34, 5] UDA task which is known to be challenging due to the small data size [4]. The STL dataset has an additional unlabeled set, whose data are sub-sampled from ImageNet [9]. We use this set as the intermediate data, which contains unknown or outlier domains and classes. Here, we do not aim to compete with the state of the arts on the CIFAR10-to-STL UDA task, but study a more general application scenario for IDOL.

To leverage these data, we use M=3M=3 and filter out 80%80\% of them using the classifier confidence. We first train a ResNet-20 [23] source model. In Table 2, we observe that direct UDA using self-training on STL or unlabeled data are inferior to the source model, even if the learning rate is carefully tuned. GDA with IDOL achieves the highest accuracy, demonstrating its robustness and wide applicability.

5 Related Work

Domain adaptation.

UDA has been studied extensively [53, 1, 63], and many different approaches have been proposed such as minimizing domain divergence [45, 61, 65, 13, 20, 62], cross-domain [58, 57, 39, 61] and domain-invariant features [26, 68, 54, 67, 73, 69, 18]. Self-training recently emerges as a simple yet effective approach [2, 31, 77, 30, 64, 41, 28]. It uses the source model to provide pseudo-labels for unlabeled target data, and approaches UDA via supervised learning in the target domain [38, 48, 49]. Theoretical analyses for applying self-training in UDA are provided in [70, 3].

Gradual domain adaptation.

Many UDA works show the benefits of gradually bridging the domain gap. With features as input, several work [19, 15, 8] construct Grassmannian manifolds between the source and the target. For deep models, progressive DA [27] proposes to create the synthetic intermediate domain with an image-to-image translation network. Other works [7, 17] instead generate intermediate data with jointly-trained generators. Na et al. [51] augments the intermediate data with Mixup [74]. Unlike all the above work that focus on building synthetic intermediate domains given only the source and target data, gradual domain adaptation proposed in [25, 12, 71, 35] studies how to leverage extra real intermediate data with pre-defined domain indexes.

Learning with cycle consistency.

The concept of cycle consistency has been successfully applied in many machine learning tasks. On a high level, cycle supervision enforces that the translation to the target should be able to be translated back to match the source. For instance, back-translation [59, 22] is a popular technique to perform unsupervised neural translation in natural language processing. Many computer vision tasks such as image generation [56], image segmentation[40], style transfer [76], etc., can also be improved by matching the structural outputs with cyclic signals. We propose a novel approach that uses cycle consistency to discover intermediate domains for gradual adaptation.

Discovering feature subspaces.

We further discuss the connections of our work to existing efforts about subspace learning. One powerful technique is subspace clustering [33, 52, 10] that partitions data into many subspaces (e.g., classes) unsupervisedly. However, it may not be applied to GDA directly since it groups with discriminative features which might not capture domain shifts. Specifically for UDA, some works aim to discover sub-spaces in the source/target domains [16, 47, 72, 24, 43]. The assumption is that the datasets might not be collected from one environment but generated from many different distributions, re-formulating it as a multi-domain adaptation problem. We note that, the sub-domains discovered in these methods are fundamentally different from the domain sequence we discover. The sub-domains are assumed to have a dedicated adaptation mapping to the target domain. The domain sequences in GDA are for constructing a path from the source to the target via intermediate data. How to link these learning techniques to GDA will be interesting future work.

6 Conclusion

Gradual domain adaptation leverages intermediate data to bridge the domain gap between the source and target domains. However, it relies on a strong premise — prior knowledge of domain indexes to sort the intermediate data into a domain sequence. We propose the IDOL algorithm that can produce comparable or even better domain sequences without pre-defined indexes. With a simple progressively-trained domain discriminator, it matches the performance of using pre-defined indexes, and further improves with the proposed cycle-consistency refinement. Essentially, IDOL is a useful tool to augment any GDA algorithms with high-quality domain sequences given unindexed data.

Acknowledgments and funding transparency statement

This research is partially supported by NSF IIS-2107077 and the OSU GI Development funds. We are thankful for the generous support of the computational resources by the Ohio Supercomputer Center.

References

  • Ben-David et al. [2010] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • Chen et al. [2011] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In NeurIPS, 2011.
  • Chen et al. [2020] Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. In NeurIPS, 2020.
  • Cicek and Soatto [2019] Safa Cicek and Stefano Soatto. Unsupervised domain adaptation via regularized conditional alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1416–1425, 2019.
  • Coates et al. [2011] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Combes et al. [2020] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoff Gordon. Domain adaptation with conditional distribution matching and generalized label shift. In NeurIPS, 2020.
  • Cui et al. [2020] Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, and Qi Tian. Gradually vanishing bridge for adversarial domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12455–12464, 2020.
  • Cui et al. [2014] Zhen Cui, Wen Li, Dong Xu, Shiguang Shan, Xilin Chen, and Xuelong Li. Flowing on riemannian manifold: Domain adaptation by shifting covariance. IEEE transactions on cybernetics, 44(12):2264–2273, 2014.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Elhamifar and Vidal [2013] Ehsan Elhamifar and René Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
  • Farshchian et al. [2018] Ali Farshchian, Juan A Gallego, Joseph P Cohen, Yoshua Bengio, Lee E Miller, and Sara A Solla. Adversarial domain adaptation for stable brain-machine interfaces. arXiv preprint arXiv:1810.00045, 2018.
  • Gadermayr et al. [2018] Michael Gadermayr, Dennis Eschweiler, Barbara Mara Klinkhammer, Peter Boor, and Dorit Merhof. Gradual domain adaptation for segmenting whole slide images showing pathological variability. In International Conference on Image and Signal Processing, pages 461–469. Springer, 2018.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 17(1):2096–2030, 2016.
  • Ginosar et al. [2015] Shiry Ginosar, Kate Rakelly, Sarah Sachs, Brian Yin, and Alexei A Efros. A century of portraits: A visual historical record of american high school yearbooks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 1–7, 2015.
  • Gong et al. [2012] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE conference on computer vision and pattern recognition, pages 2066–2073. IEEE, 2012.
  • Gong et al. [2013] Boqing Gong, Kristen Grauman, and Fei Sha. Reshaping visual datasets for domain adaptation. Advances in Neural Information Processing Systems, 2013.
  • Gong et al. [2019] Rui Gong, Wen Li, Yuhua Chen, and Luc Van Gool. Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2477–2486, 2019.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Gopalan et al. [2011] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In 2011 international conference on computer vision, pages 999–1006. IEEE, 2011.
  • Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
  • Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
  • He et al. [2016a] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. Advances in neural information processing systems, 29:820–828, 2016a.
  • He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016b.
  • Hoffman et al. [2012] Judy Hoffman, Brian Kulis, Trevor Darrell, and Kate Saenko. Discovering latent domains for multisource domain adaptation. In European Conference on Computer Vision, pages 702–715. Springer, 2012.
  • Hoffman et al. [2014] Judy Hoffman, Trevor Darrell, and Kate Saenko. Continuous manifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867–874, 2014.
  • Hoffman et al. [2018] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
  • Hsu et al. [2020] Han-Kai Hsu, Chun-Han Yao, Yi-Hsuan Tsai, Wei-Chih Hung, Hung-Yu Tseng, Maneesh Singh, and Ming-Hsuan Yang. Progressive domain adaptation for object detection. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2020.
  • Inoue et al. [2018] Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, 2018.
  • Jamal et al. [2020] Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, and Boqing Gong. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7610–7619, 2020.
  • Khodabandeh et al. [2019] Mehran Khodabandeh, Arash Vahdat, Mani Ranjbar, and William G Macready. A robust learning approach to domain adaptive object detection. In ICCV, 2019.
  • Kim et al. [2019] Seunghyeon Kim, Jaehoon Choi, Taekyung Kim, and Changick Kim. Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In ICCV, 2019.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kriegel et al. [2009] Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. Acm transactions on knowledge discovery from data (tkdd), 3(1):1–58, 2009.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Kumar et al. [2020] Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pages 5468–5479. PMLR, 2020.
  • Larochelle et al. [2007] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pages 473–480, 2007.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee [2013] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, 2013.
  • Lee et al. [2019] Seungmin Lee, Dongwan Kim, Namil Kim, and Seong-Gyun Jeong. Drop to adapt: Learning discriminative features for unsupervised domain adaptation. In ICCV, 2019.
  • Li et al. [2019] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
  • Liang et al. [2019] Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. Distant supervised centroid shift: A simple and efficient approach to visual domain adaptation. In CVPR, 2019.
  • Liang et al. [2017] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
  • Liu et al. [2020] Ziwei Liu, Zhongqi Miao, Xingang Pan, Xiaohang Zhan, Dahua Lin, Stella X Yu, and Boqing Gong. Open compound domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12406–12415, 2020.
  • Long et al. [2013] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pages 2200–2207, 2013.
  • Long et al. [2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
  • Long et al. [2017] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. arXiv preprint arXiv:1705.10667, 2017.
  • Mancini et al. [2018] Massimiliano Mancini, Lorenzo Porzi, Samuel Rota Bulo, Barbara Caputo, and Elisa Ricci. Boosting domain adaptation by discovering latent domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3771–3780, 2018.
  • McClosky et al. [2006a] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In ACL, 2006a.
  • McClosky et al. [2006b] David McClosky, Eugene Charniak, and Mark Johnson. Reranking and self-training for parser adaptation. In ACL, 2006b.
  • McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  • Na et al. [2020] Jaemin Na, Heechul Jung, HyungJin Chang, and Wonjun Hwang. Fixbi: Bridging domain spaces for unsupervised domain adaptation. arXiv preprint arXiv:2011.09230, 2020.
  • Ng et al. [2001] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14:849–856, 2001.
  • Pan et al. [2010] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2010.
  • Pei et al. [2018] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Ren et al. [2018] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4334–4343. PMLR, 2018.
  • Russo et al. [2018] Paolo Russo, Fabio M Carlucci, Tatiana Tommasi, and Barbara Caputo. From source to target and back: symmetric bi-directional adaptive gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8099–8108, 2018.
  • Saito et al. [2017] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. Adversarial dropout regularization. arXiv preprint arXiv:1711.01575, 2017.
  • Saito et al. [2018] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
  • Sennrich et al. [2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709, 2015.
  • Sethi and Kantardzic [2017] Tegjyot Singh Sethi and Mehmed Kantardzic. On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications, 82:77–99, 2017.
  • Shu et al. [2018] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. In ICLR, 2018.
  • Sugiyama et al. [2007] Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert MÞller. Covariate shift adaptation by importance weighted cross validation. JMLR, 8(May):985–1005, 2007.
  • Sun et al. [2016] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
  • Tao et al. [2018] Qingyi Tao, Hao Yang, and Jianfei Cai. Zero-annotation object detection with web knowledge transfer. In ECCV, 2018.
  • Tzeng et al. [2014] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • Vergara et al. [2012] Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A Ryan, Margie L Homer, and Ramón Huerta. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
  • Volpi et al. [2018] Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature augmentation for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5495–5504, 2018.
  • Wang et al. [2019] Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, and Jianmin Wang. Transferable attention for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5345–5352, 2019.
  • Wang et al. [2020] Yan Wang, Chen Xiangyu, You Yurong, Erran Li Li, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, and Chao Wei-Lun. Train in germany, test in the usa: Making 3d object detectors generalize. In CVPR, 2020.
  • Wei et al. [2021] Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. In ICLR, 2021.
  • Wulfmeier et al. [2018] Markus Wulfmeier, Alex Bewley, and Ingmar Posner. Incremental adversarial domain adaptation for continually changing environments. In 2018 IEEE International conference on robotics and automation (ICRA), pages 4489–4495. IEEE, 2018.
  • Xiong et al. [2014] Caiming Xiong, Scott McCloskey, Shao-Hang Hsieh, and Jason Corso. Latent domains modeling for visual domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  • Xu et al. [2020] Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup. In AAAI, 2020.
  • Zhang et al. [2018] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  • Zhao et al. [2019] Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523–7532. PMLR, 2019.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • Zou et al. [2018] Yang Zou, Zhiding Yu, BVK Vijaya Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, 2018.
  • Zou et al. [2019] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5982–5991, 2019.

Supplementary Material

We provide details omitted in the main paper.

  • section 7: more details of IDOL (cf. section 3 of the main paper).

  • section 8: details of implementations and experimental setups, computation cost, more results for refinement in IDOL (cf. section 4 of the main paper).

  • section 9: broader impact and potential negative societal impacts of our work.

  • section 10: limitations and future work.

7 Details of IDOL

7.1 Another view for the coarse-to-fine framework of IDOL

Another way to understand the coarse-to-fine framework we proposed is from the optimization perspective. In Equation 3 of the main paper, we provide the overall objective of IDOL, which is however intractable due to (a) no labeled target data to evaluate the domain sequence and (b) the combinatory nature of searching for a domain sequence. We deal with (a) by Equation 7 of the main paper. That is, we propose the cycle-consistency loss to measure the quality of the domain sequence, which requires no labeled target data. Since Equation 7 is still hard to solve, we relax it by a greedy approach. We find one domain at a time in sequence, starting from 𝒰1\mathcal{U}_{1} (i.e., the one closest to the source) to 𝒰M1\mathcal{U}_{M-1} (i.e., the one closest to the target). Each sub-problem is described in Equation 8 of the main paper. The relaxation may lead to sub-optimal solutions. Concretely, each sub-problem is not aware of other already selected intermediate domains or other future intermediate domains to be selected. To mitigate this issue, we propose to assign each data point a coarse domain score (i.e., subsection 3.3 of the main paper), which serves as the initialization of each sub-problem.

7.2 Algorithms

Here we provide the summary of the IDOL algorithm. IDOL learns to sort the unindexed intermediate data to a sequence, partitioned into several intermediate domains. An illustration is provided in Figure 6. As shown in algorithm 1, there are three main steps in the overall procedure: first, we construct the coarse domain sequence by learning to predict the domain score for each example and sorting the examples according to the domain scores. Second, we refine the coarse indexes with cycle-consistency as shown in algorithm 2. The refinement is decomposed into several steps, gradually discovering the next intermediate domain in sequence. Each step is to refine the coarse indexes with meta-reweighting [55, 29] and takes the closest examples to the current domain as the next domain, as shown in algorithm 3. Finally, IDOL outputs a sequence of intermediate data points; it can then be divided into several intermediate domains for gradual domain adaptation.

Refer to caption
Figure 6: Gradual domain adaptation (GDA) without indexed intermediate domains. Our IDOL algorithm sorts the unindexed intermediate data into a sequence, from the source domain to the target domain, then it can be further partitioned into several intermediate domains for gradual domain adaptation.
Input: Labeled source data 𝒮\mathcal{S}, unlabeled target data 𝒯\mathcal{T}, intermediate data 𝒰\mathcal{U}, and # of domains M1M-1;
1 Coarse indexing (by progressive training for the domain discriminator): learn g(;ϕ)g(\cdot;\bm{\phi}) with 𝒮,𝒯,𝒰\mathcal{S},\mathcal{T},\mathcal{U} and assign a score qi=g(𝒙i𝒰;ϕ)q_{i}=g({\bm{x}}_{i}^{\mathcal{U}};\bm{\phi}) to every 𝒙i𝒰𝒰{\bm{x}}_{i}^{\mathcal{U}}\in\mathcal{U} (cf. subsection 3.3 in the main paper);
2 Construct: indexed sequence Icoarse=(𝒙1,,𝒙|𝒰|)I_{\text{coarse}}=({\bm{x}}_{1},...,{\bm{x}}_{|\mathcal{U}|}) by sorting {qi=g(𝒙i𝒰;ϕ)|𝒙i𝒰𝒰}\{q_{i}=g({\bm{x}}_{i}^{\mathcal{U}};\bm{\phi})|\forall{\bm{x}}_{i}^{\mathcal{U}}\in\mathcal{U}\};
3 Fine-grained indexes: learn Ifine-grainedI_{\text{fine-grained}} with refinement (algorithm 2);
4 Construct: domain sequence by chunking Ifine-grainedI_{\text{fine-grained}} into M1M-1 domains;
Output: (𝒰1,,𝒰M1)(\mathcal{U}_{1},...,\mathcal{U}_{M-1}).
Algorithm 1 Intermediate DOmain Labeler (IDOL)
Input: Labeled source data 𝒮\mathcal{S}, # of domains M1M-1, index sequence Icoarse=(𝒙1,,𝒙|Icoarse|)I_{\text{coarse}}=({\bm{x}}_{1},\ldots,{\bm{x}}_{|I_{\text{coarse}}|}).
Initialize: Pre-train the source model 𝜽0\bm{\theta}_{0} on 𝒮\mathcal{S}, 𝒮0𝒮\mathcal{S}_{0}\leftarrow\mathcal{S}, chunk size C=|Icoarse|M1C=\frac{|I_{\text{coarse}}|}{M-1};
1 for m[0,1,,M2]m\in[0,1,\ldots,M-2] do
       Initialize: data parameter 𝒒=[|Icoarse||Icoarse|,,|Icoarse|1|Icoarse|,,0|Icoarse|]\bm{q}=[\frac{|I_{\text{coarse}}|}{|I_{\text{coarse}}|,},\frac{|I_{\text{coarse}}|-1}{|I_{\text{coarse}}|},\ldots,\frac{0}{|I_{\text{coarse}}|}];
2      
3      Obtain the next domain with Im+1I_{m+1}\leftarrowFindNextDomain(𝜽m,Icoarse,𝒮m,𝒒\bm{\theta}_{m},I_{\text{coarse}},\mathcal{S}_{m},\bm{q}, CC); //algorithm 3
4      
5      Pseudo-label Im+1I_{m+1} to construct 𝒮m+1={(𝒙i,sharpen(f(𝒙i,𝜽m)))}𝒙iIm+1\mathcal{S}_{m+1}=\{({\bm{x}}_{i},\texttt{sharpen}(f({\bm{x}}_{i},\bm{\theta}_{m})))\}_{{\bm{x}}_{i}\in I_{m+1}};
6      𝜽m+1\bm{\theta}_{m+1}\leftarrow self-train 𝜽m\bm{\theta}_{m} on 𝒮m+1\mathcal{S}_{m+1};
7      
8      Update Icoarse(𝒙i|𝒙iIcoarse,𝒙iIm+1)I_{\text{coarse}}\leftarrow({\bm{x}}_{i}|{\bm{x}}_{i}\in I_{\text{coarse}},{\bm{x}}_{i}\notin I_{m+1});
Output: Concatenate I1,,IM1I_{1},\ldots,I_{M-1} as the fine indexes Ifine-grainedI_{\text{fine-grained}}.
Algorithm 2 Refinement of the coarse sequence
Input: 𝜽m\bm{\theta}_{m}, (pseudo-)labeled data 𝒮m={(𝒙i,yi)}i=1|𝒮m|\mathcal{S}_{m}=\{({\bm{x}}_{i},y_{i})\}_{i=1}^{|\mathcal{S}_{m}|}, intermediate index sequence I=(𝒙1,,𝒙|I|)I=({\bm{x}}_{1},...,{\bm{x}}_{|I|}), initial data parameters 𝒒|I|\bm{q}\in\mathbb{R}^{|I|} (qiq_{i} is corresponded to 𝒙i{\bm{x}}_{i} in II), loss function \ell, learning rate η𝜽,η𝒒\eta_{\bm{\theta}},\eta_{\bm{q}}, and chunk size CC.
1 while stop do
2       Detach 𝜽(0)𝜽m\bm{\theta}^{(0)}\leftarrow\bm{\theta}_{m};
3       for t[1,,T]t\in[1,\ldots,T] do
4             Sample a mini-batch B𝒰B_{\mathcal{U}} from I,𝒒I,\bm{q}; //Forward adaptation.
5             𝜽(t)𝜽(t1)η𝜽𝜽(t1)iB𝒰qi×(f(𝒙i;𝜽(t1)),sharpen(f(𝒙i;𝜽m)))\bm{\theta}^{(t)}\leftarrow\bm{\theta}^{(t-1)}-\eta_{\bm{\theta}}\nabla_{\bm{\theta}^{(t-1)}}\sum_{i\in B_{\mathcal{U}}}q_{i}\times\ell(f({\bm{x}}_{i};\bm{\theta}^{(t-1)}),\texttt{sharpen}(f({\bm{x}}_{i};\bm{\theta}_{m})));
6            
7      Detach 𝜽𝜽(T)(𝒒)\bm{\theta}^{\prime}\leftarrow\bm{\theta}^{(T)}(\bm{q});
8       for t[T,,2T1]t\in[T,\ldots,2T-1] do
9             Sample a mini-batch B𝒮mB_{\mathcal{S}_{m}} from 𝒮\mathcal{S}; //Backward adaptation.
10             𝜽(t+1)𝜽(t)η𝜽𝜽(t)1|B𝒮m|iB𝒮m(f(𝒙i;𝜽(t)),sharpen(f(𝒙i;𝜽)))\bm{\theta}^{(t+1)}\leftarrow\bm{\theta}^{(t)}-\eta_{\bm{\theta}}\nabla_{\bm{\theta}^{(t)}}\frac{1}{|B_{\mathcal{S}_{m}}|}\sum_{i\in B_{\mathcal{S}_{m}}}\ell(f({\bm{x}}_{i};\bm{\theta}^{(t)}),\texttt{sharpen}(f({\bm{x}}_{i};\bm{\theta}^{\prime})));
11            
12      Sample a mini-batch B𝒮mB_{\mathcal{S}_{m}} from 𝒮m;\mathcal{S}_{m}; //Update data parameters with cycle-consistency.
13       Update 𝒒𝒒η𝒒𝒒1|B𝒮m|iB𝒮m(f(𝒙i;𝜽(2T)(𝒒)),yi)\bm{q}\leftarrow\bm{q}-\eta_{\bm{q}}\nabla_{\bm{q}}\frac{1}{|B_{\mathcal{S}_{m}}|}\sum_{i\in B_{\mathcal{S}_{m}}}\ell(f({\bm{x}}_{i};\bm{\theta}^{(2T)}(\bm{q})),y_{i});
14       qimax{0,qi},qi𝒒q_{i}\leftarrow\max\{0,q_{i}\},\forall q_{i}\in\bm{q};
15      
16II\leftarrow sort II by 𝒒\bm{q} (descending order);
Output: the next domain I[:C]I[:C].
Algorithm 3 Finding the next domain with cycle-consistency (FindNextDomain)

8 Implementation Details

Table 3: IDOL with refinement on different coarse domain scores.
Coarse scores Indexed? Adaptation Refined? Rotated MNIST Portraits
Classifier confidence GDA 45.5±\pm3.5 79.3±\pm1.7
62.5±\pm2.1 83.6±\pm1.6
Manifold distance 72.4±\pm3.1 81.9±\pm0.8
82.4±\pm2.3 85.2±\pm0.9
Domain discriminator 82.1±\pm2.7 82.3±\pm0.9
86.2±\pm2.2 85.1±\pm1.3
Progressive domain discriminator 85.7±\pm2.7 82.3±\pm0.9
87.5±\pm2.0 85.5±\pm1.0

8.1 Experimental setup

Dataset, model, and optimizer.

The Rotated MNIST and Portraits datasets are resized to 28×2828\times 28 and 32×3232\times 32, respectively, without data augmentation. In the CIAFR10-STL experiments, images are resized to 32×3232\times 32. We use the standard normalization and data augmentation with random horizontal flipping and random cropping as in [23].

For the Rotated MNIST and Portraits datasets, we adopt the same network used in [35]. The network consists of 33 convolutional layers, each with the kernel size of 55, stride 22, and 3232 channel size. We use ReLU activations for all hidden layers. After the convolutional layers, it follows with a dropout layer with 0.50.5 dropping rate, a batch-norm layer, and the Softmax classifier. We use the Adam optimizer [32] with the learning rate 0.0010.001, batch size 3232, and weight decay 0.020.02. We use this optimizer and train for 2020 epochs for the Rotated MNIST and Portraits datasets as the default if not specified, including training the source model, self-training adaptation on each domain, our domain discriminator, and progressive training for each step.

For the CIFAR10-STL experiments, we train a ResNet-20 for 200200 epochs for the source model and 8080 epochs for both the domain discriminator and progressive training for each step with Adam optimizer with learning rate 0.000010.00001, batch size 128128, and weight decay 0.00010.0001.

We use the same network for our domain discriminator but replace the classifier with a Sigmoid binary classifier.

GDA.

For GDA, we focus on gradual self-training studied in [35]. We follow the common practice to filter out low confidence pseudo-labeled data for self-training on every domain. That is, in applying Equation 1 of the main paper for self-training, we only use data with high prediction confidences. We keep top 90%90\% confident examples for the Rotated MNIST and Portraits datasets and top 20%20\% confident examples for the CIFAR10-STL experiments due to the fact that most of the unlabeled data are noisy and could be out-of-distribution. Each domain is trained for 2020 epochs for the Rotated MNIST and Portraits datasets. We train 8080 epochs for the CIFAR10-STL experiments.

IDOL.

For hyperparameters specific to IDOL, we tune with the target validation set. We did not change hyperparameters in GDA for fair comparison. We set K=2MK=2M rounds for progressive training. For algorithm 3, we set T=10T=10 and train for in total 3030 epochs for all datasets, with batch size 128128. The learning rates are set as η𝜽=η𝒒=0.001\eta_{\bm{\theta}}=\eta_{\bm{q}}=0.001.

8.2 Computation cost

We run our experiments on one GeForce RTX 2080 Ti GPU with Intel i9-9960X CPUs. In algorithm 3, updating the data parameter 𝒒\bm{q} requires it to backward on gradients of 𝜽\bm{\theta}, which approximately takes three-time of the computation time of a standard forward-backward pass [55].

We estimated the time consumption on the Portraits dataset experiments. For coarse scores, the confidence is simply by applying the classifier to the input data. Calculating the manifold distance via [50] takes about 20 seconds with Intel i9-9960X CPU. The domain discriminator (both w/ or w/o progressive training) and the fine stage involve GPU computation. Using a GeForce RTX 2080 Ti GPU, training a domain discriminator takes about 4 seconds. Progressive training takes about MM times longer, where MM is the number of intermediate domains (which is 77 here). This is because it trains the domain discriminator for MM rounds. For the refinement stage proposed in cf. subsection 3.4 based on meta-learning, Discovering one intermediate domain takes about 4444 seconds and we perform it for MM rounds.

8.3 More results on refinement

We provide the full experiments (cf. Table 1 in the main paper) of IDOL refinement on different coarse domain scores in Table 3. We observe the refinement consistently helps on the coarse domain scores, and the quality of the coarse domain scores is important to the final performance.

Next, we design the following experiment to investigate the variations of different trials and the effects of the number of intermediate samples. We apply IDOL (with progressive discriminator and refinement) for 1010 runs with different random seeds, and record the intermediate domain index (from source =0=0 to the target =M=M) to which each sample is assigned. For instance, if a sample is assigned to the second intermediate domain, we record 22 for this sample. After 1010 runs, each sample will end up with 1010 indices, and we calculate the variance. If a sample is always assigned to the same intermediate domain, the variance will be zero for this sample. We repeat the experiments with 100%/50%/25%100\%/50\%/25\% of intermediate data (dropped randomly). The averaged variance over all samples for both the Rotated MNIST dataset (M=19M=19) and Portraits dataset (M=7M=7). As shown in Table 4, the domain sequences formed in different trials are pretty consistent since the variances are small. Rotated MNIST has a higher variance than Portraits probably because the number of intermediate domains is larger. Besides, the amount of the intermediate data does not significantly affect the consistency among trials.

Table 4: Variances of domain assignments.
Intermediate data Rotated MNIST Portraits
100% 1.12 0.147
50% 1.33 0.150
25% 1.35 0.136

9 Broader Impact

Unsupervised domain adaptation aims to address the distribution shift from the labeled training data to the unlabeled target data. Our work focuses on leveraging extra unlabeled data to help unsupervised domain adaptation. We only assume that the unlabeled data are generated from underlying distributions that can be viewed as structured shifts from the source to the target domain. To the best of our knowledge, our algorithm would not cause negative effects on data privacy, fairness, security, or other societal impacts.

10 Limitations and Future Work

Nearly all the domain adaptation algorithms need to make assumptions on the underlying data distributions. For our algorithm IDOL, we generally require the additional unlabeled data to distribute in between the source and the target domains, which is the standard setup of gradual domain adaptation (cf. section 2 of the main paper). GDA will be less effective if the additional unlabeled data besides the source and target domains do not gradually bridge the two domains or contain outliers. If there are unlabeled data close to the source or target domains but are not distributed in between of them (e.g., if the source/target are digits with 6060/0-degree rotation, then digits with 7070-degree are one of such cases), IDOL may still include those data into the intermediate domains and could potentially degrade the overall performance. In subsection 4.5 of the main paper, we further show that, even if the additional unlabeled data contain outliers or does not smoothly bridge the two domains, IDOL can still effectively leverage the data to improve the performance on the target domain.

Our current scope of the experiments is mainly limited by the availability of benchmark datasets for gradual domain adaptation. This is mainly because GDA is a relatively new setting. Most of the existing datasets for domain adaptation are designed for the conventional unsupervised domain adaptation, in which only the source and target domain data are provided. Given (1) the superior performance of gradual domain adaptation by leveraging the additional unlabeled data and (2) the fact that real-world data change gradually more often than abruptly, we believe it will be useful to develop more datasets for gradual domain adaptation. For instance, datasets in autonomous driving that involve shifts in time and geo-locations could be an option. Other examples include sensor measurements drift over time, evolving road conditions in self-driving cars, and neural signals received by brain-machine interfaces, as pointed out in the introduction of [35].