Delving into the Continuous Domain Adaptation

Yinsong Xu School of Artificial Intelligence, Beijing University of Posts and Telecommunications National Institute of Health Data Science, Peking UniversityBeijingChina [email protected] , Zhuqing Jiang School of Artificial Intelligence, Beijing University of Posts and TelecommunicationsBeijing Key Laboratory of Network System and Network CultureBeijingChina [email protected] , Aidong Men School of Artificial Intelligence, Beijing University of Posts and TelecommunicationsBeijingChina [email protected] , Yang Liu Wangxuan Institute of Computer Technology, Peking UniversityBeijingChina [email protected] and Qingchao Chen National Institute of Health Data Science, Peking UniversityBeijingChina [email protected]

(2022)

Abstract.

Existing domain adaptation methods assume that domain discrepancies are caused by a few discrete attributes and variations, e.g., art, real, painting, quickdraw, etc. We argue that this is not realistic as it is implausible to define the real-world datasets using a few discrete attributes. Therefore, we propose to investigate a new problem namely the Continuous Domain Adaptation (CDA) through the lens where infinite domains are formed by continuously varying attributes. Leveraging knowledge of two labeled source domains and several observed unlabeled target domains data, the objective of CDA is to learn a generalized model for whole data distribution with the continuous attribute. Besides the contributions of formulating a new problem, we also propose a novel approach as a strong CDA baseline. To be specific, firstly we propose a novel alternating training strategy to reduce discrepancies among multiple domains meanwhile generalize to unseen target domains. Secondly, we propose a continuity constraint when estimating the cross-domain divergence measurement. Finally, to decouple the discrepancy from the mini-batch size, we design a domain-specific queue to maintain the global view of the source domain that further boosts the adaptation performances. Our method is proven to achieve the state-of-the-art in CDA problem using extensive experiments. The code is available at https://github.com/SPIresearch/CDA.

domain adaptation; transfer learning; out-of-distribution

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: xxx; ^†^†booktitle: ;

1. Introduction

Refer to caption — Figure 1. (a)Discrete domains; (b) Continuous domain; (c) Pull step in our method; (d) Shrinkage step in our method. (best view in color, $\theta$ is distribution parameter and $a$ is continuous attribute).

Recently, machine learning models trained with large-scale data and well-curated annotations have achieved a steady and significant improvement. However, as the distribution is complex in multimedia content interpretation, models are fragile and may break down under unseen or out-of-distribution (OOD) test domain – when there exist domain distribution discrepancies between the source (train) and target (test) domain. Given an unlabeled target dataset with domain discrepancies from the source one, it is costly and challenging to annotate enough training data to train a generalized model. The domain adaptation (DA) paradigm aims to minimize the domain discrepancies and successfully transfer the generalized model from the source to the target domain.

Existing DA methods address the discrepancies among discrete domains by learning domain-invariant features. As shown in Figure 1(a), they assume that the domain discrepancies are caused by different discrete concepts in the data collection procedure (Peng et al., 2019), e.g., art, real, painting, quickdraw, etc. As shown in Table 1, existing multi-source (Zhao et al., 2020; Lin et al., 2020; Zhao et al., 2021; He et al., 2021; Peng et al., 2019), multi-target DA (Gholami et al., 2020), and open compound DA (Liu et al., 2020b) tasks assume that the discrepancies are caused by unknown and discrete domain attributes in training. Evolving DA (Liu et al., 2020a) is a continual learning based method that aims to adapt target data that come in an online manner without forgetting. However, practical scenarios are way more complicated, and the previously mentioned assumptions are easily violated. Models are expected to adapt to continuous domains in loads of scenarios, which is more challenging, but there are very few attempts in the literature.

Continuous Indexed DA (Wang et al., 2020a) addresses the adaption in continuous domains, but it requires domain labels for each sample which is unrealistic and restricts the scalability of the approach. In this paper, we consider a more realistic setting of domain adaptation — adapting to continuous domains. Assume that a face recognition system is trained on several labeled photos (the source domains), such as an ID card. When the model is deployed in the real world, the system is expected to recognize targets in the wild, potentially covering long periods (the continuous target domains) such as tracking the suspect or finding the lost child. Another restriction we are confronted with is the domain labels. It is challenging to recognize the shooting time of photos as they are collected from various and uncertain sources. Thus, it is necessary and valuable to learn representations using data collected in a limited period but adapt the models to photos taken at anytime (continuously) in the real world. This setup brings challenges to previous DA methods.

We formulate the above setting as a new setup called continuous DA (CDA): 1)multiple domains are sampled continuously based on an attribute variable, e.g., time and angle, as shown in Figure 1(b). 2)we have access to two arbitrarily sampled labeled source domains data and several unlabeled probe target domains’ data. 3) the task is to learn a domain-invariant and discriminate model that generalizes well to domains with the arbitrary attribute value. CDA is different from other existing DA tasks (see Figure 1 and Table 1): Firstly in our CDA setup, infinite domains are generated based on the continuous attribute values (e.g. time and angle), rather than using several discrete attributes in other setup (e.g. art/real/painting style). Secondly, domain labels (the attribute values) are not available in the CDA training, ensuring that it is a plausible and real-world problem (but we assume knowing which source domains the source data come from). Thirdly, the generalization capability to open domains (target domain data under test are not available in the training stage) is the main evaluation in the CDA. Finally, we propose to train the model using only two source domains data and annotations, as we think it is good and sufficient enough to model the domain attributes in the CDA problem.

Table 1. Adaptation tasks using multiple domains, including Multi-Source DA(MSDA), Multi-Target DA(MTDA), Open Compound DA(OCDA), Continuous Indexed DA(CIDA), Evolving DA(EDA) and CDA.

DA	Source	Target	Domain	Open	Domain
SetUp	Domain	Domain	Labels	Domains	Attribute
MSDA	Multiple	Single	Yes	No	Unknown
MTDA	Single	Multiple	Yes	No	Unknown
OCDA	Single	Multiple	No	Yes	Unknown
CIDA	Multiple	Multiple	Yes	Yes	Discrete
EDA	Single	Multiple	No	No	Discrete/Continuous
CDA	Two	Infinite	No	Yes	Continuous

Through understanding and discussing the characteristics of CDA, we articulate and prosper the challenges as follows:

Unknown and complex continuity of domain attributes and discrepancies: Although the domain attribute is defined smoothly and the value varies continuously, it is computationally challenging to model such continuity in the attribute space using the high-dimension observed data. The complex attributes and their continuities give rise to the challenging and unpredictable domain discrepancies.

Infinite and non-overlapping target domains under test: Although a few target domains data can be sampled to probe the continuous attribute variations, the discrepancy between adjacent target domains is still large and difficult to model. In other words, the probe target domain data available in the training stage are not representative of all the target domain samples drawn in the continuous attribute space. To make things worse, there are infinite domains in the CDA. Therefore it is not guaranteed the model can generalize well to unseen adjacent target domain).

Unknown and unpredictable target domain attribute index: Predicting the target domain index (attribute) is always challenging based on the finitely sampled probe target domains in the CDA. In addition, we don’t assume to access the domain indexes (attribute values) in the CDA.

In this work, we design a novel framework to tackle the CDA challenges.

Firstly, to capture the complex domain attributes and improve model generalization to unseen and infinite target domains, we propose a novel alternating direction training strategy composed of Pull(P) and Shrinkage(S) Step. As shown in Figure 1(c)(d), we model the data distribution parameterized by $\theta(a)$ ( $\theta$ is one dimension for better illustrated and $a$ is the attribute variable) as a trajectory. The P step firstly pulls probe target domains on the trajectory path formed by two source domains by reducing the sum of two target-source discrepancies (keep the two source domains fixed); The S step secondly shrinks the domain trajectory distance by reducing the discrepancies between two source domains (keep the target domains fixed). The intuition behind is: alternating the P and S step progressively, the multi-domain attribute geometry formed by source and probe target domains is preserved. In addition, the strategy guarantees the generalized solutions to unseen target ones (see ablation study results in Table 3).

Secondly, it is challenging to estimate the complete domain attribute geometry and pull all unseen target domains onto it using only two source domains. To tackle this problem, we propose a novel continuity constraint when estimating the cross-domain divergence measurement. In addition, we provide remarks and implications of the regularized constraint. This helps pull unseen target domains to the domain attribute trajectory.

Finally, in the CDA setup, as the proposed domain discrepancy needs to cover and pull infinite domains to the trajectory in the stochastic manner, we propose a novel implementation to maintain the global view of source domain statistics using a queue. It is able to improve the source model’s ability to adapt by preserving more complete source information than mini-batches.

Our contributions are summarized as follows:

•

We propose a new and realistic problem namely the Continuous Domain Adaptation.
•

We propose a novel alternating direction training strategy and a domain-continuity regularizer to reduce multi-domain discrepancies meanwhile maintaining the geometry of the domain attributes. In addition, a novel queue based implementation is designed to estimate the global source domain statistics.
•

Our analysis provides insight and our method achieves SOTA results.

2. Related Works

Single-source domain adaptation. Existing methods explicitly measure and reduce the single source-target cross-domain discrepancy, including Maximum Mean Discrepancy(MMD) (Tolstikhin et al., 2016), $\mathcal{H}\Delta\mathcal{H}$ -divergence (Ben-David et al., 2010), Maximum Classifier Discrepancy (Saito et al., 2018), Margin Disparity Discrepancy (MDD) (Zhang et al., 2019), and Optimal Transport(Nguyen et al., 2021), Source-free UDA(Li et al., 2021a), Contrastive Learning (Chen et al., 2021), and the adversarial learning methods (Li et al., 2019; Chen et al., 2018; Li et al., 2021b; Liang et al., 2021; Zhong et al., 2021; Jiang et al., 2020; Wang et al., 2020b). However, as there are infinite target domains in CDA, most of these methods show degraded performances or challenging to adapt in CDA setting.

Multi-domain adaptation. Several works address the multi-domain adaptation problem listed in Table 1 (Zhao et al., 2020; Liu et al., 2020b; Jin et al., 2020; He et al., 2021) but they only consider the domains sampled on discrete attribute variable. The most similar work to ours is continuous-indexed domain adaptation method (Wang et al., 2020a) however, they assume the domain indexes are available in the training. However, in the CDA and our method, target domain indexes are not available and we assume only two source domains available in the training, which is more realistic setup.

Domain generalization(DG). DG methods (Li et al., 2018a; Matsuura and Harada, 2020; Li et al., 2018b; Zhao et al., 2021; Xu et al., 2020; Liu et al., 2021) aim to train a model with labeled data that can generalize to any unseen target domain. Similar to DG setup, the CDA setup assumes to access some target domain’s data but not all target domains. Differently, our domain attribute varies in the continuous manner and the CDA uses two source domains.

Continual learning based domain adaptation. Existing domain adaptation methods based on CL focus on “continual” adaptation however we adapt for “continuous” domains. (Bobu et al., 2018; Lao et al., 2020; Liu et al., 2020a; Su et al., 2020; Mancini et al., 2019) focus on learning target tasks online and they test on seen target domains, however our CDA problem assumes that there are infinite, continuously changing and unseen target domains under test and with unknown domain attribute.

3. Continuous Domain Adaptation

Problem Formulation: In our CDA setup, the data distribution is given by $P_{\theta(a)}(x,y)$ ( $P_{\theta(a)}$ for abbreviation), parameterized by $\theta(a)$ , and $a$ is an unavailable continuous attribute in the real world scenarios (e.g., time, rotation). We are provided with two labeled source domains $\{S_{i}(x,y)\}_{i=1}^{2}$ from $P_{\theta(a_{i})}$ , and a set of unlabeled probe target domains $\{T_{j}(x)\}_{j=1}^{N_{T}}$ from $P_{\theta(a_{j})}$ . The goal of CDA is to learn a generalized model that achieves good performance on $P_{\theta}(a)$ with arbitrary $a$ , leveraging the knowledge of two source domains (data and annotations) and limited number of target domains’ data.

How fragile is DA method in CDA problem? CDA is analogous to a practical setting where we are given a dataset in the real-world application with a continuously varying attribute. Furthermore, such attribute variations, although sometimes tiny, will destroy the model developed using partially annotated data from the dataset (see Benchmark Analysis in Sec 5.1).

Why two source domains? In the CDA task, all domains are sampled based on the continuous attribute, and the domain variations are mainly caused by the attribute variations. Thus, we are given two labeled source domains and a limited number of target domains to exploit the latent attribute and its variation. Note that the test data also comprises unseen domains.

4. Method

4.1. Motivation

The main objective of CDA is to learn a generalized model tackling discrepancies among all domains sampled on a continuous attribute. To handle the statistics of unseen domains, we model each distribution parameter $\theta$ as a point in the high-dimension space, as shown in figure 3. Thus, connecting all values of the attribute $a$ , we can obtain a trajectory $\theta(a)$ in the space. In this way, infinite domains’ data distribution is represented as a trajectory. Our strategy is to study the geometry of the continuous attribute and use it as an inductive bias in the modeling. Inspired by this principle, we design the alternating direction training strategy, including two steps: Pull and Shrinkage. As shown in figure 3, in Pull Step, we regard the trajectory formed by the two source domains as the inductive bias and pull target domains to the trajectory so that unseen target domains have a higher chance to be close to the trajectory as well. In Shrinkage Step, we progressively shrink and shorten the trajectory. Alternating the two steps in training, the proposed method is proved to generalize to unseen target domains.

As unseen target data is infeasible in training, we approximate it with disturbance on probe target data through which the source-unseen target discrepancy can be estimated by source-probe target discrepancy. When the disturbance is limited to $0$ , gradient plenty can be derived as a constraint of the discrepancy. Then, the discrepancy between the unseen target and source domain is minimized by the constraint term. In addition, to obtain global discrepancy by perceiving more source information than a mini-bacth, we decouple the discrepancy from the mini-batch size by utilizing queues.

4.2. Overall Framework

The overall architecture is shown in figure 2. Specifically, the framework comprises a joint feature encoder $E$ , which extracts the feature $e=E(x)$ of input data $x$ , and two adapter modules $A_{1}$ and $A_{2}$ , each of which consists of two paired classifiers. The reason of adopting two adapter modules rather than one is that, one adapter fails to simultaneously fit two source domains during early training. As a result, it estimates biased cross-domain discrepancy which leads to fragile optimized results. Thus, we design two pairs of classifiers ( $F_{1}$ , $F_{1}^{\prime}$ ) and ( $F_{2}$ , $F_{2}^{\prime}$ ) to classify source samples and tackle two discrepancy measurements ( $\mathcal{D}_{1}$ , $\mathcal{D}_{2}$ ) for two source domains respectively.

To improve model generalization to unseen and infinite target domains, a novel alternating direction strategy is proposed to progressively reduce multiple domains’ discrepancies, as shown in Figure 3. The alternating direction training strategy is composed of two alternating steps. To be specific, in Pull Step, we compute discrepancies between probe target domain $T_{i}$ and two sources domains $S_{1}$ , $S_{2}$ . Then we reduce the sum of them to pull all target domains close to the trajectory formed by two source domains. In Shrinkage Step, the distance of the trajectory is shortened by reducing the discrepancy between two source domains. Alternating the two steps, the training strategy enables more stable and robust optimization.

The final objective of CDA is to improve the model generalization on the whole data distribution. Nevertheless, the above strategy only reduces discrepancies between source and the seen probe target domains, without considering the unseen target ones. To overcome it, a Continuity Constraint is proposed to regularize the continuous geometry of domain attributes by regularizing the gradient penalty in Pull Step. It is useful because it implicitly constrains that pulling probe target domains to source ones tends to pull the unseen target ones.

As there are infinite target domains and our method has to pull finite target domains to the source ones in the stochastic manner, it is challenging to maintain the global semantic features of the source domain using mini-batch methods. Therefore, a novel implementation is adopted to maintain the global view of two source domains using queues $Q_{1}$ and $Q_{2}$ , preserving and updating the complete and up-to-date source domain information. The queue has the following advantages over the previous mini-batch form: 1) the queue size is flexible to set and can be much larger than a mini-batch size; 2) the queue always maintains the newest source domain features.

4.3. Alternating Direction Strategy for Multi-Domain Discrepancy Reduction

Multi-Domain Discrepancy. The hypothesis-induced discrepancies (Ben-David et al., 2010; Zhang et al., 2019) require taking supremum over hypothesis space to measure the discrepancy of two domains. In this work, we extend it into multi-domains. Specifically, we propose two pairs of classifiers. Content classifiers $F_{1}$ and $F_{2}$ are used to classify content labels in each source domain, while auxiliary classifiers $F_{1}^{\prime}$ and $F_{2}^{\prime}$ are used to estimate the discrepancies (together with content classifiers) between each source domain and the other target domains for the previously mentioned alternating training strategy. The discrepancy of domain $P$ and $Q$ is formed using the supremum of prediction differences between a pair of classifiers $F_{j},F^{\prime}_{j}(j=1,2)$ as follows:

	$\displaystyle\mathcal{D}_{j}(e_{p},e_{q})=\sup_{F^{\prime}_{j}}\big{[}-\mathbb{E}_{e_{p}\sim P}\log[\sigma_{h_{F_{j}}(e_{p})}\circ F^{\prime}_{j}(e_{p})]$
(1)		$\displaystyle-\mathbb{E}_{e_{q}\sim Q}\log[1-\sigma_{h_{F_{j}}(e_{q})}\circ F^{\prime}_{j}(e_{q})]\big{]},$

where $\circ$ denotes function composition, $h_{F}$ is a labeling function: $h_{F}(e)=\arg\max_{y}p(y|x)$ , where $p$ is the softmax probabilities predicted by $F$ , and $\sigma$ is the softmax function, $\sigma_{j}(z)=\exp(z_{j})/\sum_{i}\exp(z_{i})$ .

Advantages of using two adapters and discrepancies. Due to using two source domains, we adopt $F_{i}$ and ${F_{i}}^{{}^{\prime}}$ , $i\in\{1,2\}$ , to estimate two separate discrepancies between some target and each ( $i^{th}$ ) source domain. It is hard for one classifier to fit two source domains at the beginning of training because they lie in different trajectory positions. Our discrepancy measurement is based on classifiers, it is more accurate to estimate two discrepancies separately.

Pull Step. Feeding samples from $S_{1},S_{2}$ and $T_{i}$ to the network, we train the encoder $E$ to pull the target domains on the trajectory path formed by the two source domains. It is achieved by reducing the sum of $\mathcal{D}_{1}(e_{s_{1}},e_{t_{i}})$ and $\mathcal{D}_{2}(e_{s_{2}},e_{t_{i}})$ , which reaches the minimum when $T_{i}$ is on the trajectory. Networks can be trained based on the following objective, $\mathcal{L}_{p}$ :

(2)

\displaystyle\min_{E}\sum_{i=1}^{N_{T}}\mathcal{D}_{1}(e_{s_{1}},e_{t_{i}})+\mathcal{D}_{2}(e_{s_{2}},e_{t_{i}}).

Shrinkage Step. After pulling target domains to the trajectory in the Pull Step, source-only samples are utilized to optimize the encoder $E$ and two classifiers $F_{1}$ , $F_{2}$ to shrinkage the trajectory distance by reducing the following discrepancy between two source domains, denoted as $\mathcal{L}_{s}$ .

(3)

\displaystyle\min_{E}\mathcal{D}_{1}(e_{s_{1}},e_{s_{2}})+\mathcal{D}_{2}(e_{s_{2}},e_{s_{1}}).

As the source labels are available, the cross-entropy loss $\mathcal{E}(x,y)$ is adopted to train the networks $E,F_{1},F_{2}$ as follow, denoted as $\mathcal{L}_{ce}$ :

(4)

\displaystyle\min_{E,F_{1},F_{2}}\mathcal{E}(x_{s_{1}},y_{s_{1}})+\mathcal{E}(x_{s_{2}},y_{s_{2}}).

Remarks. One may argue that we should train two steps together in a unified framework. Although it is practicable, we argue that it will pull both target and source domains together and the geometry of the trajectory will change drastically. As a result, the continuity will not be constrained effectively, which is harmful for generalization on unseen target domains. The comparison result is shown in Sec 5.3.

4.4. Theoretical Insights and Explanations

As we take supremum to define the discrepancy in Eq (4.3), we first show the optimum of $F^{\prime}_{i}$ that reaches the supremum and the formulation of loss function Eq (2) and Eq (3) in two P Step and S Step, respectively. Denoting $P_{s_{j}}$ and $P_{t_{i}}$ as the distributions of two domains’ features $e_{s_{j}}$ and $e_{t_{i}}$ encoded by $E$ , we derived the optimal values of $\sigma_{h_{F(e)}}\circ F^{\prime}(e)$ in P and S Steps as follows:

In Pull Step:

(5)

\displaystyle\sigma_{h_{F_{j}(e)}}\circ F^{\prime}_{j}(e)=\frac{P_{s_{j}}}{P_{s_{j}}+P_{t_{i}}},j=1,2.

In Shrinkage Step:

(6)

\displaystyle\sigma_{h_{F}(e)}\circ F^{\prime}_{1}(e)=\frac{P_{s_{1}}}{P_{s_{1}}+P_{s_{2}}},\sigma_{h_{F}(e)}\circ F^{\prime}_{2}(e)=\frac{P_{s_{2}}}{P_{s_{2}}+P_{s_{1}}}.

Now we analyze the convergence of the proposed alternating learning algorithm where the minimization of Eq.(2) and Eq.(3) will align source and target domains. Substituting Eq.(5) and Eq.(6) in Eq.(2) and Eq.(3) respectively, the objective can be derived into:

(7)

\min_{E}\left\{\begin{aligned} &\sum_{i=1}^{N_{T}}\text{JS}(P_{s_{1}}||P_{t_{i}})+\text{JS}(P_{s_{2}}||P_{t_{i}}),&\text{P Step}\\ &2\text{JS}(P_{s_{1}}||P_{s_{2}}).&\text{S Step}\\ \end{aligned}\right.

It shows that the sum of JS-divergence between target domains and two source domains is minimized in P Step, while the JS-divergence between two source domains is minimized in S Step.

Global optimum. Since the JS-divergence between two distributions is always non-negative and zero only when they are equal. Thus, the solution of P Step in Eq (7) is $p_{s_{1}}=p_{t_{i}}$ and $p_{s_{2}}=p_{t_{i}}$ , while that of S Step is $p_{s_{1}}=p_{s_{2}}$ . Two steps run alternatively and reach global minimum when $p_{s_{1}}=p_{s_{2}}=p_{t_{i}}$ i.e., the domains are aligned perfectly and the trajectory becomes a point as shown in the right part of Figure 3.

4.5. Continuity Constraint via Gradient Plenty

The above strategy is not guaranteed to pull the unseen target domains close to the trajectory. To address it, we propose to add the continuity constraint in the optimization. Given the probe and the neighborhood unseen target domain with attribute $a$ and $a+\Delta a$ respectively, their encoder features can be represented as $e_{a}$ and $e_{a+\Delta a}$ . Though estimating the discrepancy $\mathcal{D}_{i}(e_{s_{i}},e_{t(a+\Delta a)})$ accurately is intractable, we can find a constant $\eta$ that:

(8)

\displaystyle\mathcal{D}_{i}(e_{s_{i}},e_{t(a+\Delta a)})\leq\eta\Delta a+\mathcal{D}_{i}(e_{s_{i}},e_{t(a)})).

As the domain variations are mainly caused by the attribute variations, when $\Delta a\to 0$ , $e_{t}{(a+\Delta a)}=e_{t}(a)+\beta\Delta a$ , where $\beta$ is another constant. Therefore, Eq.(8) can be reformulated as:

(9)

\displaystyle\left\|\nabla_{e_{t}}\mathcal{D}_{i}(e_{s_{i}},e_{t})\right\|\leq\eta/\left\|\beta\right\|.

Empirically we set ${\eta}/{\left\|\beta\right\|}=1$ and implement it as the gradient penalty in Pull Step, denoted as $\mathcal{L}_{gp}$ :

(10)

\displaystyle\min_{E}(\left\|\nabla_{e_{t}}\mathcal{D}_{i}(e_{s_{i}},e_{t})\right\|-1)^{2}.

In this way, when we minimize the discrepancy of between the source and probe target domains, $\mathcal{D}_{i}(e_{s_{i}},e_{t})$ , the discrepancy of between the source and unseen target domains , $\mathcal{D}_{i}(e_{s_{i}},e_{t(a+\Delta a)})$ , is minimized implicitly.

4.6. Global Domain Discrepancy Measurement using Queue Implementation

The fundamental problem of the alternating direction strategy and the CDA task is to robustly estimate the domain shifts in the stochastic manner. Existing discrepancy computation performance heavily relies on mini-batch size, especially in CDA task where there are infinite target domains under test and the discrepancy is biased. To decouple the discrepancy from the mini-batch size, we introduce queue to obtain more global and complete cross-domain discrepancy. Concretely, in Shrinkage Step, the samples in two source domains are progressively replaced in two queues, denoted as $Q_{1}$ and $Q_{2}$ . In Pull Step, we compute the discrepancy between the target domain and two source domains.

4.7. Overall Optimization

The overall optimization is as follows:

(11)		Pull Step:	$\displaystyle\min_{E}\mathcal{L}_{p}+\mathcal{L}_{gp},$
(12)		Shrinkage Step:	$\displaystyle\min_{E,F_{1},F_{2}}\mathcal{L}_{s}+\mathcal{L}_{ce}.$

5. Experiments

We first introduce a new CONtinuous Domain Adaptation (CONDA) benchmark to investigate the CDA problem and evaluate the proposed method. Then the detailed analysis of the benchmark is performed and different splits are selected for efficient evaluations. Finally we compare with other methods for extensive evaluations.

5.1. Continuous Domain Adaptation Benchmark

Datasets: The CONDA benchmark consists of in total 474,530 images with full annotations, sourced from three datasets including EMNIST (Cohen et al., 2017), SmallNorb (LeCun et al., 2004), and our synthesized dataset Face Age Synthesis based on works in (Liu et al., 2015) and (Or-El et al., 2020). Figure 4 shows multiple domains based on a continuously varied attribute. Three different attributes are selected including image rotation angles, the elevation camera views, and the age of the human identities.

Rotating EMNIST(Cohen et al., 2017) is an extension of MNIST including six different splits including 131,600 characters of 47 classes. Images are rotated by $0^{\circ}$ to $180^{\circ}$ . We adopt this dataset to study continuous domain variations caused by rotation angles of 2D images.

SmallNorb(LeCun et al., 2004) contains 48,600 images of 50 toys from 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The toys were imaged under 9 elevations (30 to 70 degrees every 5 degrees). We adopt this dataset to study continuous domain variations caused by elevation camera views.

Face Age Synthesis. Age is a common and continuous attribute that causes humans’ appearance variations, without existing dataset to support, we propose to use Lifespan Age Transformation Synthesis (Or-El et al., 2020) to synthesize a full span of 0–90 ages with the interval of 10. 294,330 images of 1000 identity are selected from CelebA(Liu et al., 2015) under synthesis based on the top 1,000 identities FID (Xu et al., 2018).We adopt this dataset to study continuous domain variations caused by age of human identities.

Table 2. Comparision Results (%).

Dataset	Method	P 1	P 2				P 3				P 4				Avg
Dataset	Method	S 1	S 1	S 2	S 3	S 4	S 1	S 2	S 3	S 4	S 1	S 2	S 3	S 4	Avg
Rotating EMNIST	SO	26.8	41.7	41.7	41.7	41.7	51.9	51.9	51.9	51.9	44.7	44.7	44.7	44.7	44.6
	CIDA	31.1	44.7	45.3	44.7	44.1	57.2	57.1	56.2	56.4	49.0	48.7	48.4	48.1	48.5
	BCDM	33.3	49.7	46.6	50.7	30.3	60.0	60.2	58.4	47.8	45.1	45.9	42.2	48.7	47.6
	ATDOC	30.6	44.5	43.6	44.5	42.5	60.9	59.7	61.6	58.2	47.9	48.4	47.4	50.4	49.2
	StableNet	26.6	41.5	41.5	41.5	41.5	51.9	51.9	51.9	51.9	44.3	44.3	44.3	44.3	44.4
	MCC	29.2	34.8	34.7	35.4	36.2	40.2	39.9	40.9	43.6	39.9	38.9	39.0	40.6	37.9
	Ours	34.2	51.1	50.6	52.1	45.6	61.7	60.8	59.4	55.0	52.3	51.5	48.3	51.6	51.8
SmallNorb	SO	85.1	88.8	88.8	88.8	88.8	86.8	86.8	86.8	86.8	86.5	86.5	86.5	86.5	87.1
	CIDA	82.4	85.5	85.9	86.8	85.2	85.5	85.1	85.7	85.0	84.8	85.2	84.1	84.0	85.0
	BCDM	87.4	78.6	83.0	80.5	83.5	77.1	80.8	79.2	74.2	77.4	82.2	78.0	74.4	79.7
	ATDOC	80.0	84.4	87.3	88.1	87.9	83.7	86.5	86.7	86.3	82.6	86.5	86.7	85.6	85.6
	StableNet	80.1	87.0	87.0	87.0	87.0	85.7	85.7	85.7	85.7	83.5	83.5	83.5	83.2	85.0
	MCC	86.2	86.3	86.2	83.3	80.1	86.8	86.9	81.7	82.9	83.4	83.7	75.3	81.6	83.4
	Ours	91.5	90.5	90.3	89.6	89.8	87.6	89.5	87.4	89.7	87.8	89.4	87.8	90.1	89.3
Face Age Synthesis	SO	71.6	74.1	74.1	74.1	74.1	71.7	71.7	71.7	71.7	70.6	70.6	70.6	70.6	72.1
	CIDA	29.4	58.6	53.4	44.5	52.2	53.4	53.0	47.2	49.1	55.0	57.6	48.0	53.0	50.3
	BCDM	78.8	76.2	54.4	47.3	54.0	73.0	64.4	49.8	47.5	71.9	76.1	65.5	62.4	63.2
	ATDOC	30.8	49.1	48.9	27.2	48.5	24.5	31.3	25.5	45.4	31.9	41.5	34.3	44.6	37.2
	StableNet	69.6	78.1	78.1	78.1	78.1	76.6	76.6	76.6	76.6	74.0	74.0	74.0	74.0	75.7
	MCC	30.4	55.6	55.9	50.2	57.9	48.7	52.8	56.0	54.5	52.6	58.7	55.8	57.9	52.8
	Ours	79.0	81.8	79.4	79.1	78.2	79.9	78.6	76.0	76.9	79.6	79.4	74.4	77.1	78.4

Benchmark Analysis: We conduct an in-depth analysis of the cross-domain statistics in CONDA benchmark in Figure 5. Three results are reported including the 1) target domain performances using different source-only models; 2) average target domain performances of source-only models; 3) MMD (Tolstikhin et al., 2016) between source and target domain features using pre-trained source-only model.

From the Figure 5 (a)(d)(g), we made two observations: first, the performance of each source domain model (a plot using a specific color) reaches the peak at the source attribute but successively drops when the attribute keeps moving far away from the source one. This suggests that the domain shift increases with the increasing attribute differences. Second, SmallNorb and Face Age Synthesis exhibit the similar smooth performance patterns among all target domains, however, the performance curve in Rotating EMNIST has a very steep peak point on the source attribute and performances drop very quickly for other attributes/domains. This suggests that the current deep network architecture is not suitable to tackle rotation variations and the domain gaps in this dataset seem the largest. It confirms the fact that deep architectures cannot tackle the domain shift in CDA.

In the Figure 5 (b)(e)(h), SmallNorb and Face Age Synthesis exhibit the similar average performances patterns among all target domains, where the performance first rises and then falls. Instead, the performance curve in EMNIST is stable among different attributes. It is reasonable because deep networks are not able to handle rotation variations.

Finally, it is observed in Figure 5 (c)(f)(i) that the closer to the diagonal, the darker the point color is. The phenomenon is strongly correlated to the accuracy curve. It provides the insight that the choice of samples to annotate has an impact on the performance when the annotation budget is limited.

Splits. To perform comprehensive evaluation, we design 13 splits, with the attempt to cover all combinations of choices of source, probe target and target test domains in the evaluation. The source domain locations in the attribute space are denoted as P1-P4, while four settings of the segment to randomly sample the probe target denotes as S1-S4, as shown in Figure 6.

5.2. Results and Comparisons in CDA

Implementation. We perform extensive evaluations on three datasets. For Rotating EMNIST and smallNorb, a four-layer CNN encoder $E$ and a two-layer MLP for each classifier are adopted. For Face Age Synthesis, we use pre-trained ResNet-50 from ImageNet as the encoder. We adopt Adam with learning rate $1\times 10^{-4}$ . See the appendix for more details. The average classification accuracy of all target domains is used in the evaluation. The results of 13 splits on Rotating EMNIST, SmallNorb and Face Age Synthesis are shown in Table 2.

Comparisons. In each split, performances of our method and six existing comparison methods are reported. As it is the first time to address the problem of CDA, we reproduce existing methods in different tasks. As To summarize, these comparison methods comprise source only(SO) model, Continuously Indexed DA(CIDA(Wang et al., 2020a)), Single Source DA(BCDM (Li et al., 2021b), ATDOC(Liang et al., 2021)), Multi Source/Target DA(MCC(Jin et al., 2020)) and Domain Generalization(StableNet(Zhang et al., 2021)).

To summarize the main observations: (1) our method outperforms all other comparison methods in all splits and datasets. It demonstrates that our methods can effectively handle the continuous domain shifts using two source domains. (2) It is worth noting that some methods (i.e., CIDA, ATDOC) perform worse than SO performance.

5.3. Ablation Study

Table 3. Analysis Results (%) in SmallNorb Dataset.

Split	Variant
P2	V1	V2	V3	V4	V5	V6	V7	Ours
S1	87.2	87.6	86.6	86.7	85.1	86.5	89.1	90.5
S2	87.1	87.5	88.4	88.5	85.4	89.7	89.1	90.3
S3	87.4	86.4	88.0	86.0	82.5	88.9	88.3	89.6
S4	88.8	87.1	87.3	87.0	87.0	89.0	88.0	89.8

In this section, we perform ablation studies on all splits to investigate the effectiveness of the proposed three main contributions: discrepancy measure, two-stage strategy and feature queues. Results are shown in Table 3.

We compare the following variant methods and analyze their results. V1: We adopt three binary domain discriminators to encourage domain confusions between ( $S_{1},S_{2}$ ), ( $S_{2},T$ ) and ( $S_{1},T$ ). V2: We merge P and S Steps into a single step and minimize three pairs of discrepancies at the same time, without fixing $S_{1}$ and $S_{2}$ in P Step.V3: We merge P and S Steps into a single step without fixing $S_{1}$ in P Step.V4: We merge P and S Steps into a single step without fixing $S_{2}$ in P Step.V5: We remove the S Step but keeping other components in our method. V6: We remove the queue implementation. V7: We remove the continuity constraint.

Compared with our method, the adaptation performances of all variants drastically decrease, and Variant 1 achieves the worst performance. Variant 1 is equivalent to reducing three JS divergence losses. The proposed method outperforms this variant, demonstrating that directly minimizing pair-wise discrepancies fails to generalize to unseen target domain in the continuous domain space. In addition, disjoint domains in continuous domain space leads to unreliable training of the encoder as discussed in (Arjovsky and Bottou, 2017). From comparison with variants 2-5, merging stages and unfixing source domains in P Step will cause ambiguous optimization direction which is harmful for adaptation. For example, if we unfix the source domain, the geodesic path is dynamic and the target domains are hard to converge. Comparing with variant 6, it seems that designed queues can further improve the model performances. It confirms that queues may be able to minimize the error term between the expected and the empirical ones. Comparing with variant 7, it seems that continuity constraint is effective.

6. Analysis

6.1. Discussions

Why are some methods worse than SO? It demonstrates that CDA is a challenging problem, and most existing DA methods can not be applied directly. The reason is twofold: 1) It may be because the adversarial-based methods only align the observed probe target domain and source domain but do not consider the unseen target domain. As a result, the model overfits on source and probe target domains. 2)These methods propose one classifier for all domains. Therefore, it is hard to fit all domains with large discrepancies. Meanwhile, it shows the advantage of proposing two classifiers.

The choice of two source domains its effects We set up this novel benchmark to analyze this point. If two domains are similar, the adaptation performances degrade. It can be seen in Tables 2 that P4 achieves the worst performance.

6.2. Hyper-parameter Analysis

Queue size. We evaluate our methods with different queue sizes, and the results are visualized in Figure 7(a). It can be observed that the accuracy curve raises at the beginning, and then falls. It is because that the queue enlarges the number of samples that compute the discrepancy, thus it promotes the adaptation. But large size makes the queues have more features extracted by the model before several iterations, which lead to imprecise discrepancy computation.

Number of probe target domains. We evaluate the task using different number of probe target domains in the experiment in Figure 7(b). It can be observed that with the increasing number of probe target domains, the overall accuracy increases for all methods. Our method always achieves the best-performing method for different setups.

6.3. Qualitative Analysis.

We perform t-SNE plot of the two source domain features and target features in Figure 8. The more evenly the points of three colors are mixed, the better the domains are aligned. We highlighted some areas using red dashed circles in the subfigure of CIDA, BCDM and ATDOC. In these areas, grey dots (target) dominate, with very few blue (source 1) and orange (source 2) dots overlapping with grey ones. However, this pattern does not appear in the t-SNE plot of our method, where dots of three colors are mixed more evenly. It seems that our method achieves the best domain alignment on unseen target domains as the source features spread more in the target zones.

6.4. Convergence Analysis

In Figure 7(c), we plot the loss $\mathcal{L}_{s}$ with respect to the number of iterations in the training procedure of smallNorb. The loss implies the distance between all target domains and the trajectory, that is – the aim of this is to verify the effectiveness of our Pull Step. It can be observed that our method achieves comparable convergence and relatively small loss, which suggests that target domains are able to be pulled close to the trajectory.

In Figure 7(d), we visualize the $l2$ distance between two classifiers’ predictions of the target sample. The aim is to demonstrate the distance between two source domains in the trajectory, that is, verifying the effectiveness of our Shrinkage Step. First, due to discrepancies between two source domains and the high impact of cross-entropy loss in the beginning, two classifiers learn to discriminate in their individual domain. Therefore, the $l2$ distance starts increasing. Next, due to our proposed discrepancy measurement and reduction losses dominating the training, domains are aligned together and the two classifiers are pulled closely (distances are small). The convergence implies the shrinkage of the trajectory formed by all domains.

Note that we can observe the regularly alternating patterns of the loss functions in Figure 7(c-d). This is because the training procedure is in the alternating manner and the two steps are adversarial. The above two convergences jointly demonstrate that our method is able to reduce the discrepancy progressively.

7. Conclusions

In this work, we investigate a novel problem namely the continuous domain adaptation. We have proposed a novel framework for this CDA problem that outperforms other baseline techniques.

Acknowledgements.

This work is supported by Peking University Medicine Seed Fund for Interdisciplinary Research (BMU2022MX011), the Fundamental Research Funds for the Central Universities and PKU-OPPO Innovation Fund (BO202103), and Zhe jiang Lab (NO.2022NB0AB05).

References

(1)
Arjovsky and Bottou (2017) Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017).
Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine learning 79, 1 (2010), 151–175.
Bobu et al. (2018) Andreea Bobu, Eric Tzeng, Judy Hoffman, and Trevor Darrell. 2018. Adapting to continuously shifting domains. (2018).
Chen et al. (2018) Qingchao Chen, Yang Liu, Zhaowen Wang, Ian Wassell, and Kevin Chetty. 2018. Re-Weighted Adversarial Adaptation Network for Unsupervised Domain Adaptation. In CVPR.
Chen et al. (2021) Yang Chen, Yingwei Pan, Yu Wang, Ting Yao, Xinmei Tian, and Tao Mei. 2021. Transferrable Contrastive Learning for Visual Domain Adaptation. 3399–3408.
Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. 2017. EMNIST: Extending MNIST to handwritten letters. IJCNN (2017). https://doi.org/10.1109/ijcnn.2017.7966217
Gholami et al. (2020) Behnam Gholami, Pritish Sahu, Ognjen Rudovic, Konstantinos Bousmalis, and Vladimir Pavlovic. 2020. Unsupervised Multi-Target Domain Adaptation: An Information Theoretic Approach. IEEE Transactions on Image Processing 29 (2020), 3993–4002. https://doi.org/10.1109/TIP.2019.2963389
He et al. (2021) Jianzhong He, Xu Jia, Shuaijun Chen, and Jianzhuang Liu. 2021. Multi-source domain adaptation with collaborative learning for semantic segmentation. In CVPR. 11008–11017.
Jiang et al. (2020) Junguang Jiang, Ximei Wang, Mingsheng Long, and Jianmin Wang. 2020. Resource Efficient Domain Adaptation. 2220–2228.
Jin et al. (2020) Ying Jin, Ximei Wang, Mingsheng Long, and Jianmin Wang. 2020. Minimum class confusion for versatile domain adaptation. In ECCV. Springer, 464–480.
Lao et al. (2020) Qicheng Lao, Xiang Jiang, Mohammad Havaei, and Yoshua Bengio. 2020. Continuous Domain Adaptation with Variational Domain-Agnostic Feature Replay. arXiv:2003.04382 [cs.LG]
LeCun et al. (2004) Yann LeCun, Fu Jie Huang, and Leon Bottou. 2004. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, Vol. 2. II–104.
Li et al. (2018b) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. 2018b. Learning to generalize: Meta-learning for domain generalization. In AAAI.
Li et al. (2019) Shuang Li, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and Gao Huang. 2019. Joint Adversarial Domain Adaptation. In Proceedings of the 27th ACM International Conference on Multimedia. 729–737.
Li et al. (2021b) Shuang Li, Fangrui Lv, Binhui Xie, Chi Harold Liu, Jian Liang, and Chen Qin. 2021b. Bi-Classifier Determinacy Maximization for Unsupervised Domain Adaptation. In AAAI.
Li et al. (2021a) Xinhao Li, Jingjing Li, Lei Zhu, Guoqing Wang, and Zi Huang. 2021a. Imbalanced Source-Free Domain Adaptation. 3330–3339.
Li et al. (2018a) Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. 2018a. Domain generalization via conditional invariant representations. In AAAI, Vol. 32.
Liang et al. (2021) Jian Liang, Dapeng Hu, and Jiashi Feng. 2021. Domain Adaptation With Auxiliary Target Domain-Oriented Classifier. In CVPR. 16632–16642.
Lin et al. (2020) Chuang Lin, Sicheng Zhao, Lei Meng, and Tat-Seng Chua. 2020. Multi-source domain adaptation for visual sentiment classification. In AAAI, Vol. 34. 2661–2668.
Liu et al. (2021) Chang Liu, Lichen Wang, Kai Li, and Yun Fu. 2021. Domain Generalization via Feature Variation Decorrelation. 1683–1691.
Liu et al. (2020a) Hong Liu, Mingsheng Long, Jianmin Wang, and Yu Wang. 2020a. Learning to Adapt to Evolving Domains. In NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. 22338–22348.
Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In ICCV.
Liu et al. (2020b) Ziwei Liu, Zhongqi Miao, Xingang Pan, Xiaohang Zhan, Dahua Lin, Stella X. Yu, and Boqing Gong. 2020b. Open Compound Domain Adaptation. In CVPR.
Mancini et al. (2019) Massimiliano Mancini, Samuel Rota Bulo, Barbara Caputo, and Elisa Ricci. 2019. Adagraph: Unifying predictive and continuous domain adaptation through graphs. In CVPR. 6568–6577.
Matsuura and Harada (2020) Toshihiko Matsuura and Tatsuya Harada. 2020. Domain Generalization Using a Mixture of Multiple Latent Domains. In AAAI.
Nguyen et al. (2021) Tuan Nguyen, Trung Le, Nhan Dam, Quan Hung Tran, Truyen Nguyen, and Dinh Phung. 2021. TIDOT: A Teacher Imitation Learning Approach for Domain Adaptation with Optimal Transport. In IJCAI. International Joint Conferences on Artificial Intelligence Organization, 2862–2868.
Or-El et al. (2020) Roy Or-El, Soumyadip Sengupta, Ohad Fried, Eli Shechtman, and Ira Kemelmacher-Shlizerman. 2020. Lifespan Age Transformation Synthesis. In ECCV.
Peng et al. (2019) Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In ICCV. 1406–1415.
Saito et al. (2018) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR. 3723–3732.
Su et al. (2020) Peng Su, Shixiang Tang, Peng Gao, Di Qiu, Ni Zhao, and Xiaogang Wang. 2020. Gradient Regularized Contrastive Learning for Continual Domain Adaptation. arXiv preprint arXiv:2007.12942 (2020).
Tolstikhin et al. (2016) Ilya O Tolstikhin, Bharath K Sriperumbudur, and Bernhard Schölkopf. 2016. Minimax estimation of maximum mean discrepancy with radial kernels. NeurIPS 29 (2016), 1930–1938.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Wang et al. (2020a) Hao Wang, Hao He, and Dina Katabi. 2020a. Continuously Indexed Domain Adaptation. In ICML. PMLR, 9898–9907.
Wang et al. (2020b) Haotian Wang, Wenjing Yang, Ji Wang, Ruxin Wang, Long Lan, and Mingyang Geng. 2020b. Pairwise Similarity Regularization for Adversarial Domain Adaptation. 2409–2418.
Xu et al. (2020) Hai Xu, Hongtao Xie, Zheng-Jun Zha, Sun-ao Liu, and Yongdong Zhang. 2020. March on Data Imperfections: Domain Division and Domain Generalization for Semantic Segmentation. 3044–3053.
Xu et al. (2018) Qiantong Xu, Gao Huang, Yang Yuan, Chuan Guo, Yu Sun, Felix Wu, and Kilian Weinberger. 2018. An empirical study on evaluation metrics of generative adversarial networks. arXiv preprint arXiv:1806.07755 (2018).
Zhang et al. (2021) Xingxuan Zhang, Peng Cui, Renzhe Xu, Linjun Zhou, Yue He, and Zheyan Shen. 2021. Deep Stable Learning for Out-Of-Distribution Generalization. In CVPR. 5372–5382.
Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. 2019. Bridging theory and algorithm for domain adaptation. In ICML. PMLR, 7404–7413.
Zhao et al. (2020) Sicheng Zhao, Guangzhi Wang, Shanghang Zhang, Yang Gu, Yaxian Li, Zhichao Song, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2020. Multi-source distilling domain adaptation. In AAAI, Vol. 34. 12975–12983.
Zhao et al. (2021) Yuyang Zhao, Zhun Zhong, Fengxiang Yang, Zhiming Luo, Yaojin Lin, Shaozi Li, and Nicu Sebe. 2021. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In CVPR. 6277–6286.
Zhong et al. (2021) Li Zhong, Zhen Fang, Feng Liu, Jie Lu, Bo Yuan, and Guangquan Zhang. 2021. How does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches?. In AAAI.

In the following, we provide additional materials to support our submission (omitted from the main submission due to space limits). Specifically, we first provide implementation details including benchmark, network architecture, training procedure and details of comparison methods. Then we analyze the datasets used in CDA for more details. Next, more comparisons are performed and ablation studies are provided, even including another evaluations using existing multi-domain adaptation benchmarks. This proves that our method is able to be extended to other benchmarks and achieves robust and best performing results. Finally, we analyze, discuss and re-emphasize the motivations, main observations and insights in the new Continuous Domain Adaptation (CDA) task.

Appendix A Additional Implementation Details

In this section, we provide additional information of the dataset, evaluation metrics and the implementation details of our and the comparison methods in the CDA task.

A.1. Datasets details

Rotating EMNIST: (Cohen et al., 2017) is an extension of MNIST to handwritten letters. It provides six different splits, and we use EMNIST(balanced) which contains 131,600 characters of 47 classes. Images are of resolution $32\times 32$ and rotated by $0^{\circ}$ to $180^{\circ}$

SmallNorb: (LeCun et al., 2004) contains 48,600 images of 50 toys from 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The toys were imaged under 9 elevations (30 to 70 degrees every 5 degrees). Each image is downsampled to $48\times 48$ pixels and cropped to a size of $32\times 32$ pixels.

Age Face Synthesis: Age is a common and continuous attribute that causes humans’ appearance variations, however to the best of our knowledge, we have not found any existing dataset to support the study. Therefore, we propose to use Lifespan Age Transformation Synthesis (Or-El et al., 2020) to synthesize a full span of 0–90 ages with the interval of 10. To be specific, 1000 identity images are selected from CelebA(Liu et al., 2015) under synthesis based on the top 1,000 FID scores (Xu et al., 2018). Finally, the Face Age Synthesis contains 294,330 images of resolution $256\times 256$ in total.

A.2. Experiment Split Choices

First, we introduce four configurations based on the source domain locations in the attribute space denoted as P1-P4, as shown in Figure 6 in the main paper. In P1, where two source domains (red dots) are at the edge of the domain space, we randomly sample 50% of the domains (purple dots) in the middle as probe target while others as unseen target test domains. As for P2-P4, two source domains divided the continuous domain into three segments, so we designed four settings S1-S4 based on which segment we shall select to randomly sample the probe target. In each selected segment we randomly sample 50% as the probe target domain.

For the sake of better reproductive research, the 13 splits in SmallNorb and Age Face Synthesis are listed in Table 4. All the numbers in the table indicate domain indexes. In SmallNorb, 0-8 refer to 9 elevations (30 to 70 degrees every 5 degrees). In Age Face Synthesis, 0-9 refer to 10 ages (0 to 90 degrees every 10 years). In Rotating EMNIST, we rotate a batch of images with different angles to generate source and probe target domain. The angles are listed in Table 5.

Table 4. Splits in SmallNorb and the Face Age Synthesis.In SmallNorb, 0-8 refer to 9 elevations (30 to 70 degrees every 5 degrees). In Age Face Synthesis, 0-9 refer to 10 ages (0 to 90 degrees every 10 years).

	SmallNorb					Age Face Synthesis
P	S1	S2	S3	S4	P	S1	S2	S3	S4
0,8	2,4,5	–	–	–	0,9	6,1,5	–	–	–
2,6	1,5,8	1,5	5	1	1,7	0,3,8	3,8	8	3
3,6	0,5,8	0,5	5	0	2,6	1,5,8	1,5	5	1
4,6	0,5,8	0,5	5	0	3,6	0,5,8	0,5	5	0

Table 5. Splits in Rotating EMNIST.

P	S1	S2	S3	S4
0,180	(0,180)	–	–	–
22.5,157.5	(0,180)	(0,157,5)	(22.5,157.5)	(0,22.5)
45,135	(0,180)	(0,135)	(45,135)	(0,45)
67.5,112.5	(0,180)	(0,112.5)	(67.5,112.5)	(0,67.5)

Table 6. Network architecture of source-only model in SmallNorb and Rotating EMNIST. The output channels are 5 and 47 in SmallNorb and Rotating EMNIST, respectively.

layer name	in channels	out channels	kernel size	stride
Conv1	1	256	3	2
Conv2	256	256	3	2
Conv3	256	256	3	2
Conv4	256	100	4	1
FC 1	100	256	–	–
FC 2	256	5/47	–	–

A.3. Network Architecture.

The network architectures are shown in Figure 9. The network architecture used in SmallNorb and Rotating EMNIST is shown in Table 6. For Face Age Synthesis, we used the ResNet50 encoder part pretrained on ImageNet as the main encoder. We only change the classifier to the 1000 classes.

Comparison Method Details

•

Source-Only Model: The source only model consists of an encoder and a classifier, as shown in Figure 9 (a) and (b) respectively. In smallNorb and EMNIST, the encoder and classifier design are shown in Table 9. In Face Age Synthesis dataset, the encoder follows the ResNet50 network but change the final layer classifer in our application.
•

CIDA: In the CDA formulation, it is assumed that the domain indexes are unavailable. However the CIDA method needs domain index to work so we assign pseudo domain index for each domain in the experiment. Specifically, we assign 0 and 1 to source domain 1 and 2 respectively and all probe target domains are assigned with 0.5. For fair comparisons, we used the same encoder as source-only model in all experiments.
•

BCDM, ATDOC: BCDM and ATDOC aim to adapt on single source domain. Thus, we combine two source domains. For fair comparisons, we used the same encoder, classifier as the source-only model in all experiments.