On Learning Domain-Invariant Representations for Transfer Learning with Multiple Sources

Trung Phung¹ Trung Le² Long Vuong¹ Toan Tran¹ Anh Tran¹ Hung Bui¹ Dinh Phung^1,2
¹ VinAI Research, Vietnam ² Monash University, Australia
[email protected], [email protected],
{v.longvt8, v.toantm3, v.anhtt152, v.hungbh1, v.dinhpq2}@vinai.io

Abstract

Domain adaptation (DA) benefits from the rigorous theoretical works that study its insightful characteristics and various aspects, e.g., learning domain-invariant representations and its trade-off. However, it seems not the case for the multiple source DA and domain generalization (DG) settings which are remarkably more complicated and sophisticated due to the involvement of multiple source domains and potential unavailability of target domain during training. In this paper, we develop novel upper-bounds for the target general loss which appeal to us to define two kinds of domain-invariant representations. We further study the pros and cons as well as the trade-offs of enforcing learning each domain-invariant representation. Finally, we conduct experiments to inspect the trade-off of these representations for offering practical hints regarding how to use them in practice and explore other interesting properties of our developed theory.

1 Introduction

Although annotated data has been shown to be really precious and valuable to deep learning (DL), in many real-world applications, annotating a sufficient amount of data for training qualified DL models is prohibitively labour-expensive, time-consuming, and error-prone. Transfer learning is a vital solution for the lack of labeled data. Additionally, in many situations, we are only able to collect limited number of annotated data from multiple source domains. Therefore, it is desirable to train a qualified DL model primarily based on multiple source domains and possibly together with a target domain. Depending on the availability of the target domain during training, we encounter either multiple source domain adaptation (MSDA) or domain generalization (DG) problem.

Domain adaptation (DA) is a specific case of MSDA when we need to transfer from a single source domain to an another target domain available during training. For the DA setting, the pioneering work [3] and other following work [45, 55, 56, 22, 19, 57, 23] are abundant to study its insightful characterizations and aspects, notably what domain-invariant (DI) representations are, how to learn this kind of representations, and the trade-off of enforcing learning DI representations. Those well-grounded studies lay the foundation for developing impressive practical DA methods [14, 50, 46, 55, 7, 54, 38].

Due to the appearance of multiple source domains and possibly unavailability of target domain, establishing theoretical foundation and characterizing DI representations for the MSDA and especially DG settings are significantly more challenging. A number of state-of-the-art works in MSDA and DG [59, 41, 20, 11, 43, 28, 29, 35, 15, 58, 52, 37, 39] have implicitly exploited and used DI representations in some sense to achieve impressive performances, e.g., [29, 29, 11] minimize representation divergence during training, [43] decomposes representation to obtain invariant feature, [58] maximizes domain prediction entropy to learn invariance. Despite the successes, the notion DI representations in MSDA and DG is not fully-understood, and rigorous studies to theoretically characterize DI representations for these settings are still very crucial and imminent. In this paper, we provide theoretical answers to the following questions: (1) what are DI representations in MSDA and DG, and (2) what one should expect when learning them. Overall, our contributions in this work can be summarized as:

•

In Section 2.2, we first develop two upper-bounds for the general target loss in the MSDA and DG settings, whose proofs can be found in Appendix A. We then base on these bounds to characterize and define two kinds of DI representations: general DI representations and compressed DI representations.
•

We further develop theory to inspect the pros and cons of two kinds of DI representations in Section 2.3, aiming to shed light on how to use those representations in practice. Particularly, two types of DI representations optimize different types of divergence on feature space, hence serving different purposes. Appendix B contains proofs regarding these characteristics.
•

Finally, we study in Section 2.4 the trade-off of two kinds of DI representations which theoretically answers the question whether and how the target performance is hurt when enforcing learning each kind of representation. We refer reader to Appendix C for its proof.
•

We conduct experiments to investigate the trade-off of two kinds of representations for giving practical hints regarding how to use those representations in practice as well as exploring other interesting properties of our developed theory.

It is worth noting that although MSDA has been investigated in some works [19, 57], our proposed method is the first work that rigorously formulates and provides theoretical analysis for representation learning in MSDA and DG. Specifically, our bounds developed in Theorems 1 and 3 interweaving both input and latent spaces are novel and benefit theoretical analyses in deep learning. Our theory is developed in a general setting of multi-class classification with a sufficiently general loss, which can be viewed as a non-trivial generalization of existing works in DA [31, 3, 56, 32, 57], each of which considers binary classification with specific loss functions. Moreover, our theoretical bounds developed in Theorem 8 is the first theoretical analysis of the trade-off of learning DI representations in MSDA and DG. Particularly, by considering the MSDA setting with single source and target domains, we achieve the same trade-off nature discovered in [56], but again our setting is more general than binary classification setting in that work.

2 Our Main Theory

2.1 Notations

Let $\mathcal{X}$ be a data space on which we endow a data distribution $\mathbb{P}$ with a corresponding density function $p(x)$ . We consider the multi-class classification problem with the label set $\mathcal{Y}=\left[C\right]$ , where $C$ is the number of classes and $\left[C\right]:=\{1,\ldots,C\}$ . Denote $\mathcal{Y}_{\Delta}:=\left\{\alpha\in\mathbb{R}^{C}:\left\|\alpha\right\|_{1}=1\,\wedge\,\alpha\geq\mathbf{0}\right\}$ as the $C-$ simplex label space, let $f:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ be a probabilistic labeling function returning a $C$ -tuple $f\left(x\right)=\left[f\left(x,i\right)\right]_{i=1}^{C}$ , whose element $f\left(x,i\right)=p\left(y=i\mid x\right)$ is the probability to assign a data sample $x\sim\mathbb{P}$ to the class $i$ (i.e., $i\in\left\{1,...,C\right\}$ ). Moreover, a domain is denoted compactly as pair of data distribution and labeling function $\mathbb{D}:=\left(\mathbb{P},f\right)$ . We note that given a data sample $x\sim\mathbb{P}$ , its categorical label $y\in\mathcal{Y}$ is sampled as $y\sim Cat\left(f\left(x\right)\right)$ which a categorical distribution over $f\left(x\right)\in\mathcal{Y}_{\Delta}$ .

Let $l:\mathcal{Y}_{\Delta}\times\mathcal{Y}\mapsto\mathbb{R}$ be a loss function, where $l\left(\hat{f}\left(x\right),y\right)$ with $\hat{f}\left(x\right)\in\mathcal{Y}_{\Delta}$ and $y\in\mathcal{Y}$ specifies the loss (e.g., cross-entropy, Hinge, L1, or L2 loss) to assign a data sample $x$ to the class $y$ by the hypothesis $\hat{f}$ . Moreover, given a prediction probability $\hat{f}\left(x\right)$ w.r.t. the ground-truth prediction $f\left(x\right)$ , we define the loss $\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)=\mathbb{E}_{y\sim f\left(x\right)}\left[l\left(\hat{f}\left(x\right),y\right)\right]=\sum_{y=1}^{C}l\left(\hat{f}\left(x\right),y\right)f\left(x,y\right)$ . This means $\ell\left(\cdot,\cdot\right)$ is convex w.r.t. the second argument. We further define the general loss caused by using a classifier $\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ to predict $\mathbb{D}\equiv\left(\mathbb{P},f\right)$ as

\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)=\mathcal{L}\left(\hat{f},\mathbb{D}\right):=\mathbb{E}_{x\sim\mathbb{P}}\left[\ell\left(\hat{f}(x),f(x)\right)\right].

We inspect the multiple source setting in which we are given multiple source domains $\{\mathbb{D}^{S,i}\}_{i=1}^{K}$ over the common data space $\mathcal{X}$ , each of which consists of data distribution and its own labeling function $\mathbb{D}^{S,i}:=\left(\mathbb{P}^{S,i},f^{S,i}\right)$ . Based on these source domains, we need to work out a learner or classifier that requires to evaluate on a target domain $\mathbb{D}^{T}:=\left(\mathbb{P}^{T},f^{T}\right)$ . Depending on the knownness or unknownness of a target domain during training, we experience either multiple source DA (MSDA) [32, 30, 44, 19] or domain generalization (DG) [25, 29, 28, 35] setting.

One typical approach in MSDA and DG is to combine the source domains together [14, 29, 28, 35, 1, 11, 48] to learn DI representations with hope to generalize well to a target domain. When combining the source domains, we obtain a mixture of multiple source distributions denoted as $\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i}$ , where the mixing coefficients $\pi=\left[\pi_{i}\right]_{i=1}^{K}$ can be conveniently set to $\pi_{i}=\frac{N_{i}}{\sum_{j=1}^{K}N_{j}}$ with $N_{i}$ being the training size of the $i-$ th source domain.

In term of representation learning, input is mapped into a latent space $\mathcal{Z}$ by a feature map $g:\mathcal{X}\mapsto\mathcal{Z}$ and then a classifier $\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ is trained based on the representations $g\left(\mathcal{X}\right)$ . Let $f:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ be the original labeling function. To facilitate the theory developed for latent space, we introduce representation distribution being the pushed-forward distribution $\mathbb{P}_{g}:=g_{\#}\mathbb{P}$ , and the labeling function $h:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ induced by $g$ as $h(z)=\frac{\int_{g^{-1}\left(z\right)}f(x)p(x)dx}{\int_{g^{-1}\left(z\right)}p(x)dx}$ [22]. Going back to our multiple source setting, the source mixture becomes $\mathbb{D}_{g}^{\pi}=\sum_{i}\pi_{i}\mathbb{D}_{g}^{S,i}$ , where each source domain is $\mathbb{D}_{g}^{S,i}=\left(\mathbb{P}_{g}^{S,i},h^{S,i}\right)$ , and the target domain is $\mathbb{D}_{g}^{T}=\left(\mathbb{P}_{g}^{T},h^{T}\right)$ .

Finally, in our theory development, we use Hellinger divergence between two distributions defined as $\begin{aligned} D_{1/2}\left(\mathbb{P}^{1},\mathbb{P}^{2}\right)=2\int\left(\sqrt{p^{1}(x)}-\sqrt{p^{2}(x)}\right)^{2}dx\end{aligned}$ , whose squared $d_{1/2}=\sqrt{D_{1/2}}$ is a proper metric.

2.2 Two Types of Domain-Invariant Representations

Hinted by the theoretical bounds developed in [3], DI representations learning, in which feature extractor $g$ maps source and target data distributions to a common distribution on the latent space, is well-grounded for the DA setting. However, this task becomes significantly challenging for the MSDA and DG settings due to multiple source domains and potential unknownness of target domain. Similar to the case of DA, it is desirable to develop theoretical bounds that directly motivate definitions of DI representations for the MSDA and DG settings, as being done in the next theorem.

Theorem 1.

Consider a mixture of source domains $\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i}$ and the target domain $\mathbb{D}^{T}$ . Let $\ell$ be any loss function upper-bounded by a positive constant $L$ . For any hypothesis $\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ where $\hat{f}=\hat{h}\circ g$ with $g:\mathcal{X}\rightarrow\mathcal{Z}$ and $\hat{h}:\mathcal{Z}\rightarrow\mathcal{Y}_{\Delta}$ , the target loss on input space is upper bounded

\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}\,d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)

(1)

The bound in Equation 1 implies that the target loss in the input or latent space depends on three terms: (i) representation discrepancy: $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ , (ii) the label shift: $\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]$ , and (iii) the general source loss: $\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)$ . To minimize the target loss in the left side, we need to minimize the three aforementioned terms. First, the label shift term is a natural characteristics of domains, hence almost impossible to tackle. Secondly, the representation discrepancy term can be explicitly tackled for the MSDA setting, while almost impossible for the DG setting. Finally, the general source loss term is convenient to tackle, where its minimization results in a feature extractor $g$ and a classifier $\hat{h}$ .

Contrary to previous works in DA and MSDA [3, 32, 57, 7] that consider both losses and data discrepancy on data space, our bound connects losses on data space to discrepancy on representation space. Therefore, our theory provides a natural way to analyse representation learning, especially feature alignment in deep learning practice. Note that although DANN [14] explains their feature alignment method using theory developed by Ben-david et al. [3], it is not rigorous. In particular, while application of the theory to representation space yield a representation discrepancy term, the loss terms are also on that feature space, and hence minimizing these losses is not the learning goal. Finally, our setting is much more general, which extends to multilabel, stochastic labeling setting, and any bounded loss function.

From the upper bound, we turn to first type of DI representations for the MSDA and DG settings. Here, the objective of $g$ is to map samples onto representation space in a way that the common classifier $\hat{h}$ can effectively and correctly classifies them, regardless of which domain the data comes from.

Definition 2.

Consider a class $\mathcal{G}$ of feature maps and a class $\mathcal{H}$ of hypotheses. Let $\{\mathbb{D}^{S,i}\}_{i=1}^{K}$ be a set of source domains.

i) (DG with unknown target data) A feature map $g^{*}\in\mathcal{G}$ is said to be a DG general domain-invariant (DI) feature map if $g^{*}$ is the solution of the optimization problem (OP): $\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)$ . Moreover, the latent representations $z=g^{*}\left(x\right)$ induced by $g^{*}$ is called general DI representations for the DG setting.

ii) (MSDA with known target data) A feature map $g^{*}\in\mathcal{G}$ is said to be a MSDA general DI feature map if $g^{*}$ is the solution of the optimization problem (OP): $\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)$ which satisfies $\mathbb{P}_{g^{*}}^{T}=\mathbb{P}_{g^{*}}^{\pi}$ (i.e., $\min_{g\in\mathcal{G}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ ). Moreover, the latent representations $z=g^{*}\left(x\right)$ induced by $g^{*}$ is called general DI representations for the MSDA setting.

The definition of general DI representations for the DG setting in Definition 2 is transparent in light of Theorem 1, wherein we aim to find $g$ and $\hat{h}$ to minimize the general source loss $\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)$ due to the unknownness of $\mathbb{P}^{T}$ . Meanwhile, regarding the general DI representations for the MSDA setting, we aim to find $g^{*}$ satisfying $\mathbb{P}_{g^{*}}^{T}=\mathbb{P}_{g^{*}}^{\pi}$ and $\hat{h}$ to minimize the general source loss $\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)$ . Practically, to find general representations for MSDA, we solve

\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\left\{\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)+\lambda D\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\right\},

where $\lambda>0$ is a trade-off parameter and $D$ can be any divergence (e.g., Jensen Shannon (JS) divergence [16], $f$ -divergence [40], MMD distance [28], or WS distance [44]).

In addition, the general DI representations have been exploited in some works [60, 27] from the practical perspective and previously discussed in [2] from the theoretical perspective. Despite the similar definition to our work, general DI representation in [2] is not motivated from minimization of a target loss bound.

As pointed out in the next section, learning general DI representations increases the span of latent representations $g\left(x\right)$ on the latent space which might help to reduce $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ for a general target domain. On the other hand, many other works [28, 29, 15, 52, 11] have also explored the possibility of enhancing generalization by finding common representation among source distributions. The following theorem motivates the latter, which is the second kind of DI representations.

Theorem 3.

	$\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)$	$\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\\|\Delta p^{i}(y\|x)\\|_{1}\right]$		(2)
		$\displaystyle+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{L\sqrt{2\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\,\frac{L\sqrt{2\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right).$		(2)

Evidently, Theorem 3 suggests another kind of representation learning, where source representation distributions are aligned in order to lower the target loss’s bound as concretely defined as follows.

Definition 4.

Consider a class $\mathcal{G}$ of feature maps and a class $\mathcal{H}$ of hypotheses. Let $\left\{\mathbb{D}^{S,i}\right\}_{i=1}^{K}$ be a set of source domains.

i) (DG with unknown target data) A feature map $g^{*}\in\mathcal{G}$ is a DG compressed DI representations for source domains $\{\mathbb{D}^{S,i}\}_{i=1}^{K}$ if $g^{*}$ is the solution of the optimization problem (OP): $\min_{g\in\mathcal{G}}\text{$\min$}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)$ which satisfies $\mathbb{P}_{g^{*}}^{S,1}=\mathbb{P}_{g^{*}}^{S,2}=\ldots=\mathbb{P}_{g^{*}}^{S,K}$ (i.e., the pushed forward distributions of all source domains are identical). The latent representations $z=g^{*}\left(x\right)$ is then called compressed DI representations for the DG setting.

ii) (MSDA with known target data) A feature map $g^{*}\in\mathcal{G}$ is an MSDA compressed DI representations for source domains $\{\mathbb{D}^{S,i}\}_{i=1}^{K}$ if $g^{*}$ is the solution of the optimization problem (OP): $\min_{g\in\mathcal{G}}\text{$\min$}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)$ which satisfies $\mathbb{P}_{g^{*}}^{S,1}=\mathbb{P}_{g^{*}}^{S,2}=\ldots=\mathbb{P}_{g^{*}}^{S,K}=\mathbb{P}_{g^{*}}^{T}$ (i.e., the pushed forward distributions of all source and target domains are identical). The latent representations $z=g^{*}\left(x\right)$ is then called compressed DI representations for the MSDA setting.

2.3 Characteristics of Domain-Invariant Representations

In the previous section, two kinds of DI representations are introduced, where each of them originates from minimization of different terms in the upper bound of target loss. In what follows, we examine and discuss their benefits and drawbacks. We start with the novel development of hypothesis-aware divergence for multiple distributions , which is necessary for our theory developed later.

2.3.1 Hypothesis-Aware Divergence for Multiple Distributions

Let consider multiple distributions $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ on the same space, whose density functions are $q_{1},...,q_{C}$ . We are given a mixture of these distributions $\mathbb{Q}^{\alpha}=\sum_{i=1}^{C}\alpha_{i}\mathbb{Q}_{i}$ with a mixing coefficient $\alpha\in\mathcal{Y}_{\Delta}$ and desire to measure a divergence between $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ . One possible solution is employing a hypothesis from an infinite capacity hypothesis class $\mathcal{H}$ to identify which distribution the data come from. In particular, given $z\sim\mathbb{Q}^{\alpha}$ , the function $\hat{h}\left(z,i\right),i\in[C]$ outputs the probability that $x\sim\mathbb{Q}_{i}$ . Note that we use the notation $\hat{h}$ in this particular subsection to denote a domain classifier, while everywhere else in the paper $\hat{h}$ denotes label classifier. Let $l\left(\hat{h}\left(z\right),i\right)\in\mathbb{R}$ be the loss when classifying $z$ using $\hat{h}$ provided the ground-truth label $i\in\left[C\right]$ . Our motivation is that if the distributions $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ are distant, it is easier to distinguish samples from them, hence the minimum classification loss $\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)$ is much lower than in the case of clutching $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ . Therefore, we develop the following theorem to connect the loss optimization problem with an f-divergence among $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ . More discussions about hypothesis-aware divergence can be found in Appendix B of this paper and in [13].

Theorem 5.

Assuming the hypothesis class $\mathcal{H}$ has infinite capacity, we define the hypothesis-aware divergence for multiple distributions as

D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=-\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)+\mathcal{C}_{l,\alpha},

(3)

where $\mathcal{C}_{l,\alpha}$ depends only on the form of loss function $l$ and value of $\alpha$ . This divergence is a proper f-divergence among $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ in the sense that $D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)\geq 0,\forall\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ and $\alpha\in\mathcal{Y}_{\Delta}$ , and $D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=0$ if $\mathbb{Q}_{1}=...=\mathbb{Q}_{C}$ .

2.3.2 General Domain-Invariant Representation

As previously defined, the general DI feature map $g^{*}$ (cf. Definition 2) is the one that minimizes the total source loss

\begin{aligned} g^{*}=\arg\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{i}\right)=\text{argmin}_{g\in\mathcal{G}}\text{min}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)\end{aligned}.

(4)

From the result of Theorem 5, we expect that the minimal loss $\min_{\hat{h}\in\mathcal{H}}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)$ should be inversely proportional to the divergence between the class-conditionals $\mathbb{P}_{g}^{S,i,c}$ . To generalize this result to the multi-source setting, we further define $\mathbb{Q}_{g}^{S,c}:=\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{S,i,c}$ as the mixture of the class $c$ conditional distributions of the source domains on the latent space, where $\alpha_{c}=\sum_{j=1}^{K}\pi_{j}\gamma_{j,c}$ are the mixing coefficients, and $\gamma_{i,c}=\mathbb{P}^{S,i}\left(y=c\right)$ are label marginals. Then, the inner loop of the objective function in Eq. 4 can be viewed as training the optimal hypothesis $\hat{h}\in\mathcal{H}$ to distinguish the samples from $\mathbb{Q}_{g}^{S,c},c\in\left[C\right]$ for a given feature map $g$ . Therefore, by linking to the multi-divergence concept developed in Section 2.3.1, we achieve the following theorem.

Theorem 6.

Assume that $\mathcal{H}$ has infinite capacity, we have the following statements.

1. $D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right)=-\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)+\text{const}$ , where $\alpha=\left[\alpha_{c}\right]_{c\in\left[C\right]}$ is defined as above.

2. Finding the general DI feature map $g^{*}$ via the OP in Eq. 4 is equivalent to solving

g^{*}=\underset{{}_{g\in\mathcal{G}}}{\text{argmax}}\,D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right).

(5)

Theorem 6, especially its second claim, discloses that learning general DI representations maximally expands the coverage of latent representations of the source domains by maximizing $D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right)$ . Hence, the span of source mixture $\mathbb{P}_{g}^{\pi}=\sum_{c=1}^{C}\alpha_{c}\mathbb{Q}_{g}^{s,c}$ is also increased. We believe this demonstrates one of the benefits of general DI representations because it implicitly enhance the chance to match a general unseen target domains in the DG setting by possibly reducing the source-target representation discrepancy term $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ (cf. Theorem 1). Additionally, in MSDA where we wish to minimize $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ explicitly, expanding source representation’s coverage is also useful.

2.3.3 Compressed Domain-Invariant Representations

It is well-known that generalization gap between the true loss and its empirical estimation affects generalization capability of model [51, 47]. One way to close this gap is increasing sampling density. However, as hinted by our analysis, enforcing general DI representation learning tends to maximize the cross-domain class divergence, hence implicitly increasing the diversity of latent representations from different source domains. This renders learning a classifier $\hat{h}$ on top of those source representations harder due to scattering samples on representation space, leading to higher generalization gap between empirical and general losses. In contrast, compressed DI representations help making the task of learning $\hat{h}$ easier with a lower generalization gap by decreasing the diversity of latent representations from the different source domains, via enforcing $\mathbb{P}_{g^{*}}^{S,1}=\mathbb{P}_{g^{*}}^{S,2}=\ldots=\mathbb{P}_{g^{*}}^{S,K}$ . We now develop rigorous theory to examine this observation.

Let the training set be $S=\{(z_{i},y_{i})\}_{i=1}^{N}$ , consisting of independent pairs $(z_{i},y_{i})$ drawn from the source mixture, i.e., the sampling process starts with sampling domain index $k\sim Cat\left(\boldsymbol{\pi}\right)$ , then sampling $z\sim\mathbb{P}_{g}^{S,k}$ (i.e., $x\sim\mathbb{P}^{S,k}$ and $z=g\left(x\right)$ ), and finally assigning a label with $y\sim h^{S,k}\left(z\right)$ (i.e., $h^{S,k}\left(z\right)$ is a distribution over $\left[C\right]$ ). The empirical loss is defined as

\displaystyle\mathcal{L}\left(\hat{h},S\right)=\frac{1}{\sum_{k=1}^{K}N_{k}}\sum_{k=1}^{K}\sum_{i=1}^{N_{k}}\ell\left(\hat{h}\left(z_{i}\right),h^{S,k}\left(z_{i}\right)\right),

(6)

where $N_{k},\,k\in\left[C\right]$ is the number of samples drawn from the $k$ -th source domain and $N=\sum_{k=1}^{K}N_{k}$ .

Here, $\mathcal{L}\left(\hat{h},S\right)$ is an unbiased estimation of the general loss $\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)=\sum_{k=1}^{K}\pi_{k}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,k}\right)$ . To quantify the quality of the estimation, we investigate the upper bound of the generalization gap $\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|$ with the confidence level $1-\delta$ .

Theorem 7.

For any confident level $\delta\in[0,1]$ over the choice of $S$ , the estimation of loss is in the $\epsilon$ -range of the true loss

\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\epsilon\right)\geq 1-\delta,

where $\epsilon=\epsilon\left(\delta\right)=\left(\frac{A}{\delta}\right)^{1/2}$ is a function of $\delta$ for which $A$ is proportional to

\displaystyle\begin{aligned} \frac{1}{N}\left(\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{i}}}{K}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\sum_{i=1}^{K}\sqrt{\pi_{i}}\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\|\Delta p^{k,i}\left(y|x\right)\right\|_{1}\right]+\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)^{2}\end{aligned}

Theorem 7 reveals one benefit of learning compressed DI representations, that is, when enforcing compressed DI representation learning, we minimize $\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)$ , which tends to reduce the generalization gap $\epsilon=\epsilon\left(\delta\right)$ for a given confidence level $1-\delta$ . Therefore, compressed DI representations allow us to minimize population loss $\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)$ more efficiently via minimizing empirical loss.

2.4 Trade-off of Learning Domain Invariant Representation

Similar to the theoretical finding in Zhao et al. [56] developed for DA, we theoretically find that compression does come with a cost for MSDA and DG. We investigate the representation trade-off, typically how compressed DI representation affects classification loss. Specifically, we consider a data processing chain $\mathcal{X}\stackrel{{\scriptstyle g}}{{\longmapsto}}\mathcal{Z}\stackrel{{\scriptstyle\hat{h}}}{{\longmapsto}}\mathcal{Y}_{\Delta}$ , where $\mathcal{X}$ is the common data space, $\mathcal{Z}$ is the latent space induced by a feature extractor $g$ , and $\hat{h}$ is a hypothesis on top of the latent space. We define $\mathbb{P}_{\mathcal{Y}}^{\pi}$ and $\mathbb{P}_{\mathcal{Y}}^{T}$ as two distribution over $\mathcal{Y}$ in which to draw $y\sim\mathbb{P}_{\mathcal{Y}}^{\pi}$ , we sample $k\sim Cat\left(\pi\right)$ , $x\sim\mathbb{P}^{S,k}$ , and $y\sim f^{S,k}\left(x\right)$ , while similar to draw $y\sim\mathbb{P}_{\mathcal{Y}}^{T}$ . Our theoretical bounds developed regarding the trade-off of learning DI representations are relevant to $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)$ .

Theorem 8.

Consider a feature extractor $g$ and a hypothesis $\hat{h}$ , the Hellinger distance between two label marginal distributions $\mathbb{P}_{\mathcal{Y}}^{\pi}$ and $\mathbb{P}_{\mathcal{Y}}^{T}$ can be upper-bounded as:

1. $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{k=1}^{K}\pi_{k}\mathcal{L}\left(\hat{h}\circ g,f^{S,k},\mathbb{P}^{S,k}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}$

2. $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.$

Here we note that the general loss $\mathcal{L}$ is defined based on the Hellinger loss $\ell$ which is define as

$\ell(\hat{f}(x),f(x))=D_{1/2}(\hat{f}(x),f(x))=2\sum_{i=1}^{C}\left(\sqrt{\hat{f}\left(x,i\right)}-\sqrt{f\left(x,i\right)}\right)^{2}$ (more discussion can be found in Appendix C).

Remark. Compared to the trade-off bound in the work of Zhao et al. [56], our context is more general, concerning MSDA and DG problems with multiple source domains and multi-class probabilistic labeling functions, rather than single source DA with binary-class and deterministic setting. Moreover, the Hellinger distance is more universal, in the sense that it does not depend on the choice of classifier family $\mathcal{H}$ and loss function $\ell$ as in the case of $\mathcal{H}$ -divergence in [56].

We base on the first inequality of Theorem 8 to analyze the trade-off of learning general DI representations. The first term on the left hand side is the source mixture’s loss, which is controllable and tends to be small when enforcing learning general DI representations. With that in mind, if two label marginal distributions $\mathbb{P}_{\mathcal{Y}}^{\pi}$ and $\mathbb{P}_{\mathcal{Y}}^{T}$ are distant (i.e., $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)$ is high), the sum $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}$ tends to be high. This leads to 2 possibilities. The first scenario is when the representation discrepancy $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ has small value, e.g., it is minimized in MSDA setting, or it happens to be small by pure chance in DG setting. In this case, the lower bound of target loss $\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)$ is high, possibly hurting model’s generalization ability. On the other hand, if the discrepancy $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ is large for some reasons, the lower bound of target loss will be small, but its upper-bound is higher, as indicated Theorem 1.

Based on the second inequality of Theorem 8, we observe that if two label marginal distributions $\mathbb{P}_{\mathcal{Y}}^{\pi}$ and $\mathbb{P}_{\mathcal{Y}}^{T}$ are distant while enforcing learning compressed DI representations (i.e., both source loss and source-source feature discrepancy $\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)$ are low), the sum $\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\mathcal{L}\left(h\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}$ is high. For the MSDA setting, the discrepancy $\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi}_{j}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)$ is trained to get smaller, meaning that the lower bound of target loss $\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)$ is high, hurting the target performance. Similarly, for the DG setting, if the trained feature extractor $g$ occasionally reduces $\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi}_{j}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)$ for some unseen target domain, it certainly increases the target loss $\mathcal{L}\left(h\circ g,f^{T},\mathbb{P}^{T}\right)$ . In contrast, if for some target domains, the discrepancy $\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi}_{j}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)$ is high by some reasons, by linking to the upper bound in Theorem 3, the target general loss has a high upper-bound, hence is possibly high.

This trade-off between representation discrepancy and target loss suggests a sweet spot for just-right feature alignment. In that case, the target loss is most likely to be small.

3 Experiment

To investigate the trade-off of learning DI representations, we conduct domain generalization experiments on the colored MNIST dataset (CC BY-SA 3.0) [2, 24]. In particular, the task is to predict binary label $Y$ of colored input images $X$ generated from binary digit feature $Z_{d}$ and color feature $Z_{c}$ . We refer readers to Appendix D for more information of this dataset.

We conduct $7$ source domains by setting color-label correlation $\mathbb{P}(Z_{c}=1|Y=1)=\mathbb{P}(Z_{c}=0|Y=0)=\theta^{S,i}$ where $\theta^{s,i}\sim Uni\left(\left[0.6,1\right]\right)$ for $i=1,...,12$ , while two target domains are created with $\theta^{T,i}\in\text{$\left\{0.05,0.7\right\}$ }$ for $i=1,2$ . Here we note that colored images in the target domain with $\theta^{T,2}=0.7$ are more similar to those in the source domains, while colored images in the target domain with $\theta^{T,1}=0.05$ are less similar.

We wish to study the characteristics and trade-off between two kinds of DI representations when predicting on various target domains. Specifically, we apply adversarial learning [16] similar to [14], in which a min-max game is played between domain discriminator $\hat{h}^{d}$ trying to distinguish the source domain given representation, while the feature extractor (generator) $g$ tries to fool the domain discriminator. Simultaneously, a classifier is used to classify label based on the representation. Let $\mathcal{L}_{gen}$ and $\mathcal{L}_{disc}$ be the label classification and domain discrimination losses, the training objective becomes:

\min_{g}\left(\min_{\hat{h}}\mathcal{L}_{gen}+\lambda\max_{\hat{h}^{d}}\mathcal{L}_{disc}\right),

where the source compression strength $\lambda>0$ controls the compression extent of learned representation. More specifically, general DI representation is obtained with $\lambda=0$ , while larger $\lambda$ leads to more compressed DI representation. Finally, our implementation is based on DomainBed [17] repository, and all training details as well as further MSDA experiment and DG experiment on real datasets are included in Appendix D.¹¹1Our code could be found at: https://github.com/VinAIResearch/DIRep

3.1 Trade-off of Two Kinds of Domain-Invariant Representations

Refer to caption — Figure 1: Phase transition of representations with increased $\lambda$ . (a) General DI representations. (b) Just-right compressed DI representations. (c) Overly-compressed DI representations.

We consider target domain which is similar to source domain in this experiment. To govern the trade-off between two kinds of DI representations, we gradually increase the source compression strength $\lambda$ with noting that $\lambda=0$ means only focusing on learning general DI representations. According to our theory and as shown in Figure 1, learning only general DI representations ( $\lambda=0$ ) maximizes the cross-domain class divergence by encouraging the classes of source domains more separate arbitrarily, while by increasing $\lambda$ , we enforce compressing the latent representations of source domains together. Therefore, for an appropriate $\lambda$ , the class separation from general DI representations and the source compression from compressed DI representations (i.e., just-right compressed DI representations), are balanced as in the case (b) of Figure 1, while for overly high $\lambda$ , source compression from compressed DI representations dominates and compromises the class separation from general DI representations (i.e., overly-compressed DI representations).

Figure 2a shows the source validation and target accuracies when increasing $\lambda$ (i.e., encouraging the source compression). It can be observed that both source validation accuracy and target accuracy have the same pattern: increasing when setting appropriate $\lambda$ for just-right compressed DI representations and compromising when setting overly high values $\lambda$ for overly-compressed DI representations. Figure 2b shows in detail the variation of the source validation accuracy for each specific $\lambda$ . In practice, we should encourage learning two kinds of DI representations simultaneously by finding an appropriate trade-off to balance them for working out just-right compressed DI representations.

3.2 Generalization Capacity on Various Target Domains

In this experiment, we inspect the generalization capacity of the models trained on the source domains with different compression strengths $\lambda$ to different target domains (i.e., close and far ones). As shown in Figure 3, the target accuracy is lower for the far target domain than for the close one regardless of the source compression strengths $\lambda$ . This observation aligns with our upper-bounds developed in Theorems 1 and 3 in which larger representation discrepancy increases the upper-bounds of the general target loss, hence more likely hurting the target performance.

Another interesting observation is that for the far target domain, the target accuracy for $\lambda=0$ peaks. This can be partly explained as the general DI representations help to increase the span of source latent representations for more chance to match a far target domain (cf. Section 2.3.2). Specifically, the source-target discrepancy $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ could be small and the upper bound in Theorem 1 is lower. In contrast, for the close target domain, the compressed DI representations help to really improve the target performance, while over-compression degrades the target performance, similar to result on source domains. This is because the target domain is naturally close to the source domains, the source and target latent representations are already mixed up, but by encouraging the compressed DI representations, we aim to learning more elegant representations for improving target performance.

4 Related Works

Domain adaptation (DA) has been intensively studied from both theoretical [3, 9, 6, 56, 34, 45] and empirical [46, 55, 14, 7, 54] perspectives. Notably, the pioneering work [3] and other theoretical works [45, 55, 56, 22, 31, 6] have investigated DA in various aspects which lay foundation for the success of practical works [14, 50, 46, 55, 14, 7, 54]. Multiple-source domain adaptation (MSDA) extends DA by gathering training set from multiple source domains [8, 30, 21, 44, 45, 12, 57, 53, 42]. Different from DA with abundant theoretical works, theoretical study in MSDA is significantly limited. Notably, the works in [32, 33] relies on assumptions about the existence of the same labeling function, which can be used for all source domains; and target domain is an unknown mixture of the source domains. They then show that there exists a distributional weighted combination of these experts performing well on target domain with loss at most $\epsilon$ , given a set of source expert hypotheses (with loss is at most on respective domain). Hoffman et al. [19] further develop this idea to a more general case where different labeling functions corresponding to different source domains, and the target domain is arbitrary. Under this setting, there exists a distributional weighted combination of source experts such that the loss on target domain is bounded by a source loss term and a discrepancy term between target domain and source mixture. On the other hand, Zhao et al. [57] directly extend Ben-david et al.’s work with an improvement on the sample complexity term.

Domain generalization (DG) [17, 26, 4, 20] is the most challenging setting among the three due to the unavailability of target data. The studies in [4, 10] use kernel method to find feature map which minimizes expected target loss. Moreover, those in [28, 29, 35, 15, 52, 11] learn compressed DI representations by minimizing different types of domain dissimilarity. Other notions of invariance are also proposed, for example, [2] learns a latent space on which representation distributions do not need to be aligned, but share a common optimal hypothesis. Another work in [43] uncovers domain-invariant hypothesis from low-rank decomposition of domain-specific hypotheses. Moreover, there are other works which try to strike a balance between two types of representation spaces [5].

5 Conclusion

In this paper, we derive theoretical bounds for target loss in MSDA and DG settings which characterize two types of representation: general DI representation for learning invariant classifier which works on all source domains and compressed DI representation motivated from reducing inter-domain representation discrepancy. We further characterize the properties of these two representations, and develop a lower bound on the target loss which governs the trade-off between learning them. Finally, we conduct experiments on Colored MNIST dataset and real dataset to illustrate our theoretical claims.

References

[1] Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T. H., and Mitliagkas, I. Generalizing to unseen domains via distribution matching, 2020.
[2] Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization, 2020.
[3] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Mach. Learn. 79, 1–2 (May 2010), 151–175.
[4] Blanchard, G., Lee, G., and Scott, C. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems (2011), vol. 24.
[5] Chattopadhyay, P., Balaji, Y., and Hoffman, J. Learning to balance specificity and invariance for in and out of domain generalization. arXiv preprint arXiv:2008.12839 (2020).
[6] Cortes, C., and Mohri, M. Domain adaptation in regression. In Algorithmic Learning Theory - 22nd International Conference, ALT 2011, Proceedings (2011), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 308–323. Copyright: Copyright 2011 Elsevier B.V., All rights reserved.; 22nd International Conference on Algorithmic Learning Theory, ALT 2011 ; Conference date: 05-10-2011 Through 07-10-2011.
[7] Cortes, C., Mohri, M., and Medina, A. M. Adaptation based on generalized discrepancy. Journal of Machine Learning Research 20, 1 (2019), 1–30.
[8] Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research 9, 57 (2008), 1757–1774.
[9] David, S. B., Lu, T., Luu, T., and Pal, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (13–15 May 2010), vol. 9 of Proceedings of Machine Learning Research, PMLR, pp. 129–136.
[10] Deshmukh, A. A., Lei, Y., Sharma, S., Dogan, U., Cutler, J. W., and Scott, C. A generalization error bound for multi-class domain generalization, 2019.
[11] Dou, Q., de Castro, D. C., Kamnitsas, K., and Glocker, B. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems (2019), pp. 6450–6461.
[12] Duan, L., Xu, D., and Chang, S.-F. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), pp. 1338–1345.
[13] Duchi, J. C., Khosravi, K., and Ruan, F. Multiclass classification, information, divergence, and surrogate risk, 2017.
[14] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., and Lempitsky, V. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17, 59 (2016), 1–35.
[15] Ghifary, M., Bastiaan Kleijn, W., Zhang, M., and Balduzzi, D. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision (2015), pp. 2551–2559.
[16] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems (2014), vol. 27, Curran Associates, Inc.
[17] Gulrajani, I., and Lopez-Paz, D. In search of lost domain generalization. CoRR abs/2007.01434 (2020).
[18] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015.
[19] Hoffman, J., Mohri, M., and Zhang, N. Algorithms and theory for multiple-source adaptation. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc.
[20] Huang, Z., Wang, H., Xing, E. P., and Huang, D. Self-challenging improves cross-domain generalization. Proceedings of the European Conference on Computer Vision (ECCV) (2020).
[21] Ievgen Redko, Amaury Habrard, M. S. On the analysis of adaptability in multi-source domain adaptation. Mach. Learn. 108 (June 2019), 1635–1652.
[22] Johansson, F. D., Sontag, D., and Ranganath, R. Support and invertibility in domain-invariant representations. In Proceedings of Machine Learning Research (2019), vol. 89, pp. 527–536.
[23] Le, T., Nguyen, T., Ho, N., Bui, H., and Phung, D. Lamda: Label matching deep domain adaptation. In ICML (2021).
[24] LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).
[25] Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. Learning to generalize: Meta-learning for domain generalization.
[26] Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision (2017), pp. 5542–5550.
[27] Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.-Z., and Hospedales, T. M. Episodic training for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019).
[28] Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018).
[29] Li, Y., Tian, X., Gong, M., Liu, Y., Liu, T., Zhang, K., and Tao, D. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) (September 2018).
[30] Mansour, Y., Mohri, M., Ro, J., Theertha Suresh, A., and Wu, K. A theory of multiple-source adaptation with limited target labeled data. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (13–15 Apr 2021), A. Banerjee and K. Fukumizu, Eds., vol. 130 of Proceedings of Machine Learning Research, PMLR, pp. 2332–2340.
[31] Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. vol. 22.
[32] Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems (2009), D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., vol. 21, Curran Associates, Inc.
[33] Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple source adaptation and the renyi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (Arlington, Virginia, USA, 2009), UAI ’09, AUAI Press, p. 367–374.
[34] Mansour, Y., and Schain, M. Robust domain adaptation. Annals of Mathematics and Artificial Intelligence 71 Issue 4 (2014), 365–380.
[35] Muandet, K., Balduzzi, D., and Schölkopf, B. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning (Atlanta, Georgia, USA, 17–19 Jun 2013), S. Dasgupta and D. McAllester, Eds., vol. 28 of Proceedings of Machine Learning Research, PMLR, pp. 10–18.
[36] Muller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems (2019), vol. 32, Curran Associates, Inc.
[37] Nguyen, T., Le, T., Zhao, H., Tran, H. Q., Nguyen, T., and Phung, D. Most: Multi-source domain adaptation via optimal transport for student-teacher learning. In UAI (2021).
[38] Nguyen, T., Le, T., Zhao, H., Tran, H. Q., Nguyen, T., and Phung, D. Tidot: A teacher imitation learning approach for domain adaptation with optimal transport. In IJCAI (2021).
[39] Nguyen, V. A., Nguyen, T., Le, T., Tran, Q. H., and Phung, D. Stem: An approach to multi-source domain adaptation with guarantees. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 9352–9363.
[40] Nguyen, X., Wainwright, M. J., and Jordan, M. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems (2008), J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds., vol. 20, Curran Associates, Inc.
[41] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), IEEE Computer Society, pp. 1406–1415.
[42] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019).
[43] Piratla, V., Netrapalli, P., and Sarawagi, S. Efficient domain generalization via common-specific low-rank decomposition. In Proceedings of the 37th International Conference on Machine Learning (13–18 Jul 2020), vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 7728–7738.
[44] Redko, I., Courty, N., Flamary, R., and Tuia, D. Optimal transport for multi-source domain adaptation under target shift. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (16–18 Apr 2019), K. Chaudhuri and M. Sugiyama, Eds., vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 849–858.
[45] Redko, I., Habrard, A., and Sebban, M. Theoretical analysis of domain adaptation with optimal transport, 2017.
[46] Saito, K., Kim, D., Sclaroff, S., and Saenko, K. Universal domain adaptation through self supervision. In Advances in Neural Information Processing Systems (2020), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., pp. 16282–16292.
[47] Shalev-Shwartz, S., and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
[48] Sicilia, A., Zhao, X., and Hwang, S. J. Domain adversarial neural networks for domain generalization: When it works and how to improve, 2021.
[49] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 2818–2826.
[50] Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017).
[51] Vapnik, V., and Vapnik, V. Statistical Learning Theory. Wiley, 1998.
[52] Wang, H., He, Z., Lipton, Z. C., and Xing, E. P. Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations (2019).
[53] Xu, R., Chen, Z., Zuo, W., Yan, J., and Lin, L. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018).
[54] Yan, S., Song, H., Li, N., Zou, L., and Ren, L. Improve unsupervised domain adaptation with mixup training, 2020.
[55] Zhang, Y., Liu, T., Long, M., and Jordan, M. Bridging theory and algorithm for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning (09–15 Jun 2019), K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 7404–7413.
[56] Zhao, H., Combes, R. T. D., Zhang, K., and Gordon, G. On learning invariant representations for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning (09–15 Jun 2019), K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 7523–7532.
[57] Zhao, H., Zhang, S., Wu, G., Moura, J. M. F., Costeira, J. P., and Gordon, G. J. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc.
[58] Zhao, S., Gong, M., Liu, T., Fu, H., and Tao, D. Domain generalization via entropy regularization. In Advances in Neural Information Processing Systems (2020), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., pp. 16096–16107.
[59] Zhao, S., Wang, G., Zhang, S., Gu, Y., Li, Y., Song, Z., Xu, P., Hu, R., Chai, H., and Keutzer, K. Multi-source distilling domain adaptation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (Apr. 2020), 12975–12983.
[60] Zhou, K., Yang, Y., Hospedales, T., and Xiang, T. Learning to generate novel domains for domain generalization. In European Conference on Computer Vision (2020), Springer, pp. 561–578.

This supplementary material provides proofs for theoretical results stated in the main paper, as well as detailed experiment settings and further experimental results. It is organized as follows

•

Appendix A contains the proofs for the upper bounds of target loss introduced in Section 2.2 of our main paper.
•

Appendix B contains the proofs for characteristics of two representations as mentioned in Section 2.3 of our main paper.
•

Appendix C contains the proof for trade-off theorem discussed in Section 2.4 of our main paper.
•

Finally, in Appendix D, we present the generative detail of the Colored MNIST dataset, the experimental setting together with additional results for MSDA on Colored MNIST and DG on the real-world PACS dataset.

Appendix A Appendix A: Target Loss’s Upper Bounds

We begin with a crucial proposition for our theory development, which further allows us to connect loss on feature space to loss on data space.

Proposition 9.

Let $\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ where $\hat{f}=\hat{h}\circ g$ with $g:\mathcal{X}\mapsto\mathcal{Z}$ and $\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ . Let $c:\mathcal{Z}\times\left[C\right]\mapsto\mathbb{R}$ be a positive function.

i) For any $i\in\left[C\right]$ , we have

\int h\left(z,i\right)c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz=\int f\left(x,i\right)c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx,

ii) For any $i\in\left[C\right]$ , we have

\int\left|h\left(z,i\right)-h^{\prime}\left(z,i\right)\right|c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz\leq\int\left|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right|c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx,

where $p$ is the density of a distribution $\mathbb{P}$ on $\mathcal{X}$ , $p_{g}$ is the density of the distribution $\mathbb{P}_{g}=g_{\#}\mathbb{P}$ , $f:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ and $f^{\prime}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ are labeling functions, and $h:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ and $h^{\prime}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ are labeling functions on the latent space induced from $f$ , i.e., $h(z)=\frac{\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}$ and $h^{\prime}(z)=\frac{\int_{g^{-1}\left(z\right)}f^{\prime}\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}$ .

Proof.

i) Using the definition of $h$ , we manipulate the integral on feature space as

	$\displaystyle\int_{\mathcal{Z}}h\left(z,i\right)c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz$	$\displaystyle\stackrel{{\scriptstyle\left(1\right)}}{{=}}\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\frac{\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}\int_{g^{-1}\left(z\right)}p\left(x^{\prime}\right)dx^{\prime}dz$
		$\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dxdz$
		$\displaystyle\stackrel{{\scriptstyle\left(2\right)}}{{=}}\int_{\mathcal{\mathcal{Z}}}\int_{g^{-1}\left(z\right)}c\left(\hat{h}\left(g\left(x\right)\right),i\right)f\left(x,i\right)p\left(x\right)dxdz$
		$\displaystyle\stackrel{{\scriptstyle\left(3\right)}}{{=}}\int_{\mathcal{\mathcal{Z}}}\int_{\mathcal{X}}\mathbb{I}_{x\in g^{-1}\left(z\right)}c\left(\hat{h}\left(z\right),i\right)f\left(x,i\right)p\left(x\right)dxdz$
		$\displaystyle\stackrel{{\scriptstyle\left(4\right)}}{{=}}\int_{\mathcal{X}}\int_{\mathcal{\mathcal{Z}}}\mathbb{I}_{z=g\left(x\right)}c\left(\hat{h}\left(z\right),i\right)f\left(x,i\right)p\left(x\right)dxdz$
		$\displaystyle=\int_{\mathcal{X}}c\left(\hat{h}\left(g\left(x\right)\right),i\right)f\left(x,i\right)p\left(x\right)dx$
		$\displaystyle=\int_{\mathcal{X}}c\left(\hat{f}\left(x\right),i\right)f\left(x,i\right)p\left(x\right)dx.$

In $\left(1\right)$ , we use the definition of push-forward distribution $\int_{B}p_{g}\left(z\right)dz=\int_{B}dz\int_{g^{-1}\left(z\right)}p\left(x\right)dx$ . In $\left(2\right)$ , $c\left(\hat{h}\left(z\right),i\right)$ can be put inside the integral because $z=g\left(x\right)$ for any $x\in g^{-1}\left(z\right)$ . In $\left(3\right)$ , the integral over restricted region is expanded to all space with the help of $\mathbb{I}_{x\in g^{-1}\left(z\right)}$ , whose value is $1$ if $x\in g^{-1}\left(z\right)$ and $0$ otherwise. In $\left(4\right)$ , Fubini theorem is invoked to swap the integral signs.

ii) Using the same technique, we have

	$\displaystyle\int\left\|h\left(z,i\right)-h^{\prime}\left(z,i\right)\right\|c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz$
	$\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\frac{\left\|\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx-\int_{g^{-1}\left(z\right)}f^{\prime}\left(x,i\right)p\left(x\right)dx\right\|}{\int_{g^{-1}\left(z\right)}p\left(x^{\prime}\right)dx^{\prime}}\int_{g^{-1}\left(z\right)}p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\left\|\int_{g^{-1}\left(z\right)}\left(f\left(x,i\right)-f^{\prime}\left(x,i\right)\right)p\left(x\right)dx\right\|dz$
	$\displaystyle\leq\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\int_{g^{-1}\left(z\right)}\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{\mathcal{Z}}}\int_{\mathcal{X}}\mathbb{I}_{x\in g^{-1}\left(z\right)}c\left(\hat{h}\left(z\right),i\right)\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{\mathcal{Z}}}\mathbb{I}_{z=g\left(x\right)}c\left(\hat{h}\left(g\left(x\right)\right),i\right)\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{X}}\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx.$

∎

Proposition 9 allows us to connect expected loss on feature space and input space as in the following corollary.

Corollary 10.

Consider a domain $\mathbb{D}=\left(\mathbb{P},f\right)$ with data distribution $\mathbb{P}$ and ground-truth labeling function $f$ . A hypothesis is $\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ , where $\hat{f}=\hat{h}\circ g$ with $g:\mathcal{X}\mapsto\mathcal{Z}$ and $\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ . Then, the labeling function $f$ on input space induces a ground-truth labeling function on feature space $h(z)=\frac{\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}$ . Let $\ell:\mathcal{Y}_{\Delta}\times\mathcal{Y}_{\Delta}\mapsto\mathbb{R}$ be the loss function, then expected loss can be calculated either w.r.t. input space $\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)=\int\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)p\left(x\right)dx$ or w.r.t. feature space $\mathcal{L}\left(\hat{h},h,g_{\#}\mathbb{P}\right)=\int\ell\left(\hat{h}\left(z\right),h\left(z\right)\right)p_{g}\left(z\right)dz$ . If we assume the loss $\ell\left(\cdot,\cdot\right)$ has the formed mentioned in the main paper, that is,

\ell\left(u,v\right)=\sum_{i=1}^{C}l\left(u,i\right)v_{i},

for any two simplexes $u,v\in\mathcal{Y}_{\Delta}$ , where $u=\left[u_{i}\right]_{i=1}^{C}$ and $v=\left[v_{i}\right]_{i=1}^{C}$ . Then the losses w.r.t. input space and feature space are the same, i.e.,

\mathcal{L}\left(\hat{h},h,g_{\#}\mathbb{P}\right)=\mathcal{L}\left(\hat{f},f,\mathbb{P}\right).

Proof.

We derive as

	$\displaystyle\mathcal{L}\left(\hat{h},h,g_{\#}\mathbb{P}\right)$	$\displaystyle=\int_{\mathcal{Z}}\ell\left(\hat{h}\left(z\right),h\left(z\right)\right)p\left(z\right)dz=\sum_{i=1}^{C}\int_{\mathcal{Z}}l\left(\hat{h}\left(z\right),i\right)h\left(z,i\right)p\left(z\right)dz$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\sum_{i=1}^{C}\int_{\mathcal{X}}l\left(\hat{f}\left(x\right),i\right)f\left(x,i\right)p\left(x\right)dx=\int_{\mathcal{X}}\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)p\left(x\right)dx$
		$\displaystyle=\mathcal{L}\left(\hat{f},f,\mathbb{P}\right).$

Here we have $\stackrel{{\scriptstyle(1)}}{{=}}$ by using (i) in Proposition 9 with $c\left(\beta,i\right)=l\left(\beta,i\right)$ for any $\beta\in\mathcal{Y}_{\Delta}$ and $i\in\left[C\right]$ . ∎

Normally, target loss $\mathcal{L}\left(\hat{f},f^{T},\mathbb{P}^{T}\right)$ is bounded by source loss $\mathcal{L}\left(\hat{f},f^{S},\mathbb{P}^{S}\right)$ , a label shift term $LS\left(f^{T},f^{S}\right)$ , and a data shift term $DS\left(\mathbb{P}^{T},\mathbb{P}^{S}\right)$ [3]. Here, this kind of bound is developed using data distribution $\mathbb{P}$ on input space and labeling function $f$ from input to label space, which are not convenient in understanding representation learning, since $\mathbb{P}^{T},\mathbb{P}^{S}$ are data nature and therefore fixed. In order to relate target loss to properties of learned representations, another bound in which $\mathcal{L}\left(\hat{h},h,\mathbb{P}_{g}\right)$ is the loss w.r.t. feature space and $DS\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S}\right)$ is the data shift on feature space is more favorable. However, this naive approach presents a pitfall, since the loss $\mathcal{L}\left(\hat{h},h,\mathbb{P}_{g}\right)$ is not identical to the loss w.r.t. input space $\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)$ , which is of ultimate interest, e.g., to-be-bounded target loss $\mathcal{L}\left(\hat{f},f^{T},\mathbb{P}^{T}\right)$ , or to-be-minimized source loss $\mathcal{L}\left(\hat{f},f^{S},\mathbb{P}^{S}\right)$ . Using the previous proposition and corollary, we could bridge this gap and develop a target bound connecting both data space and feature space.

Theorem 11.

(Theorem 1 in the main paper) Consider a mixture of source domains $\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i}$ and the target domain $\mathbb{D}^{T}$ . Let $\ell$ be any loss function upper-bounded by a positive constant $L$ . For any hypothesis $\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ where $\hat{f}=\hat{h}\circ g$ with $g:\mathcal{X}\mapsto\mathcal{Z}$ and $\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ , the target loss on input space is upper bounded

\begin{aligned} \mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}\,d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\end{aligned},

(7)

where $\Delta p^{i}(y|x):=\left[\left|f^{T}(x,y)-f^{S,i}(x,y)\right|\right]_{y=1}^{C}$ is the absolute of single point label shift on input space between source domain $\mathbb{D}^{S,i}$ , the target domain $\mathbb{D}^{T}$ , $[K]:=\left\{1,2,...,K\right\}$ , and the feature distribution of the source mixture $\mathbb{P}_{g}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{g}^{S,i}$ .

Proof.

First, consider the hybrid domain $\mathbb{D}_{g}^{h,i}=\left(\mathbb{P}_{g}^{\pi},h^{T}\right)$ , with $\mathbb{P}_{g}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{g}^{S,i}$ be the feature distribution of the source mixture, and $h^{T}$ is the induced ground-truth labeling function of target domain. The loss on the hybrid domain is then upper bounded by the loss on source mixture and a label shift term. We derive as follows:

	$\displaystyle\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{hi}\right)$	$\displaystyle=\int\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)p_{g}^{\pi}\left(z\right)dz$
		$\displaystyle=\int\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)\sum_{i=1}^{K}\pi_{i}p_{g}^{S,i}\left(z\right)dz=\sum_{i=1}^{K}\pi_{i}\int\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\int\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle+\sum_{i=1}^{K}\pi_{i}\int\left\|\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)-\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)\right\|p_{g}^{S,i}\left(z\right)dz.$

Firstly, using Corollary 10, the loss terms on feature space and input space are equal

\sum_{i=1}^{K}\pi_{i}\int\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)p_{g}^{S,i}\left(z\right)dz=\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},f^{S,i},\mathbb{P}^{S,i}\right)

Secondly, the difference term, can be transformed into label shift on input space using Proposition 9

		$\displaystyle\sum_{i=1}^{K}\pi_{i}\int_{\mathcal{Z}}\left\|\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)-\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)\right\|p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{Z}}\ell\left(\hat{h}\left(z\right),j\right)\left\|h^{T}\left(z,j\right)-h^{S,i}\left(z,j\right)\right\|p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle=L\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{Z}}\left\|h^{T}\left(z,j\right)-h^{S,i}\left(z,j\right)\right\|p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{\leq}}L\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{X}}\left\|f^{T}\left(z,j\right)-f^{S,i}\left(z,j\right)\right\|p^{S,i}\left(x\right)dx$
		$\displaystyle\leq L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\\|\Delta p^{i}\left(y\|x\right)\\|_{1}\right].$

Here we note that $\stackrel{{\scriptstyle(1)}}{{\leq}}$ results from (ii) in Proposition 9 with $c\left(\beta,i\right)=1$ for any $\beta\in\mathcal{Y}_{\Delta}$ and $i\in\left[C\right]$ . Furthermore, $\Delta p^{i}(y|x):=\left[\left|f^{T}(x,y)-f^{S,i}(x,y)\right|\right]_{y=1}^{C}$ is the absolute single point label shift on input space between the source domain $\mathbb{D}^{S,i}$ and the target domain $\mathbb{D}^{T}$ .

With these two terms, we have the upper bound for hybrid domain as

\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{hy}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right].

Next, we relate the loss on target $\mathbb{D}_{g}^{T}$ to hybrid domain $\mathbb{D}_{g}^{hy}$ , which differs only at the feature marginals.

\displaystyle\begin{aligned} \left|\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{T}\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{hy}\right)\right|&=\left|\int\ell\left(\hat{h}(z),h^{T}(z)\right)\left(p_{g}^{T}(z)-p_{g}^{\pi}(z)\right)dz\right|\\ &\leq\int L\left|p_{g}^{T}(z)-p_{g}^{\pi}(z)\right|dz\\ &\leq L\int\left|\sqrt{p_{g}^{T}(z)}+\sqrt{p_{g}^{\pi}(z)}\right|\left|\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right|dz\\ &\leq L\left[\int\left(\sqrt{p_{g}^{T}(z)}+\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}\left[\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}\\ &\leq\frac{L}{\sqrt{2}}\left[\int\left(p_{g}^{T}(z)+p_{g}^{\pi}(z)+2\sqrt{p_{g}^{T}(z)p_{g}^{\pi}(z)}\right)dz\right]^{1/2}\\ &\times\left[2\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}\\ &\leq\frac{L}{\sqrt{2}}\left[2+2\left(\int p_{g}^{T}(z)dz\int p_{g}^{\pi}(z)dz\right)^{1/2}\right]^{1/2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\\ &\leq L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).\end{aligned}

In the above proof, we repeatedly invoke Cauchy-Schwartz inequality $\left|\int f(z)g(z)dz\right|^{2}\leq\int\left|f(z)\right|^{2}dz\int\left|g(z)\right|^{2}dz$ . Moreover, for the sake of completeness, we reintroduce the definition of the square root Hellinger distance

d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)=\left[2\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}.

To this end, we obtain the upper bound for target loss related to loss on souce mixture, a label shift term on input space, and a data shift term between target domain and source mixture on feature space.

\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).

(8)

Finally, using the fact that loss on feature space equal loss on input space (Corollary 10), we have

\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).

That concludes our proof. ∎

This bound is novel since it relates loss on input space and data shift on feature space. This allows us to further investigate how source-source compression and source-target compression affect learning. First, we prove a lemma showing decomposition of data shift between target domain and source mixture $d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$ to a sum of data shifts between target domain and source domains.

Lemma 12.

Given a source mixture and a target domain, we have the following

d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)

Proof.

Firstly, we observe that

	$\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$	$\displaystyle=\left[2\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}$
		$\displaystyle=\left[2\int\left(p_{g}^{T}(z)+p_{g}^{\pi}(z)-2\sqrt{p_{g}^{T}(z)p_{g}^{\pi}(z)}\right)dz\right]^{1/2}.$

Secondly, we use Cauchy-Schwartz inequality to obtain

	$\displaystyle p_{g}^{T}(z)p_{g}^{\pi}(z)$	$\displaystyle=\left(\sum_{j=1}^{K}\pi_{j}p_{g}^{T}(z)\right)\left(\sum_{j=1}^{K}\pi_{j}p_{g}^{S,j}(z)\right)$
		$\displaystyle\geq\left(\sum_{j=1}^{K}\pi_{j}\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)^{2}.$

Therefore,we arrive at

	$\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$	$\displaystyle\leq\left[2\int\left(\sum_{j=1}^{K}\pi_{j}p_{g}^{T}(z)+\sum_{j=1}^{K}\pi_{j}p_{g}^{S,j}(z)-2\sum_{j=1}^{K}\pi_{j}\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)dz\right]^{1/2}$
		$\displaystyle=\left[\sum_{j=1}^{K}\pi_{j}2\int\left(p_{g}^{T}(z)+p_{g}^{S,j}(z)-2\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)dz\right]^{1/2}$
		$\displaystyle\leq\sum_{j=1}^{K}\left[\pi_{j}2\int\left(p_{g}^{T}(z)+p_{g}^{S,j}(z)-2\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)dz\right]^{1/2}$
		$\displaystyle=\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right).$

∎

Now we are ready to prove the bound which motivate compressed DI representation.

Theorem 13.

(Theorem 3 in the main paper) Consider mixture of source domains $\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i}$ and target domain $\mathbb{D}^{T}$ . Let $\ell$ be any loss function upper-bounded by a positive constant $L$ . For any hypothesis $\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}$ where $\hat{f}=\hat{h}\circ g$ with $g:\mathcal{X}\mapsto\mathcal{Z}$ and $\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}$ , the target loss on input space is upper bounded

	$\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)$	$\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\\|\Delta p^{i}(y\|x)\\|_{1}\right]$		(9)
		$\displaystyle+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{L\sqrt{2\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)$		(9)

Proof.

In the previous Theorem 11, the upper bound for target loss is

\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).

Using Lemma 12, we have

\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)

\displaystyle\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)

Next, we use the triangle inequality for square root Hellinger distance

	$\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)$	$\displaystyle\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)$
		$\displaystyle\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)$

Therefore, by average over all $\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}$ pairs,

d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)=\sum_{i=1}^{K}\frac{1}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\leq\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)

We obtain the conclusion of our proof

	$\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)$	$\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\\|\Delta p^{i}(y\|x)\\|_{1}\right]$
		$\displaystyle+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{L\sqrt{2\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)$

∎

Appendix B Appendix B: DI Representation’s Characteristics

B.1 General Domain-Invariant Representations

In the main paper, we defined general DI representation via minimization of source loss $\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)$ . We then proposed to view the optimization problem $\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)$ as calculating a type of divergence, i.e., hypothesis-aware divergence. To understand the connection between the two, we first consider the classification problem where samples are drawn from a mixture $z\sim\mathbb{Q}^{\alpha}=\sum_{i=1}^{C}\alpha_{i}\mathbb{Q}_{i}$ , with $\mathbb{Q}_{i}$ defined on $\mathcal{Z}$ and density being $q_{i}\left(z\right)$ , and the task is to predict which distributions $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ the samples originate from, i.e., labels being $1,\ldots,C$ . Here, the hypothesis class $\mathcal{H}$ is assumed to have infinite capacity, and the objective is to minimize $\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)=\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{C}\alpha_{i}\mathcal{L}\left(\hat{h},\mathbb{Q}_{i}\right)$ .

B.1.1 Hypothesis-Aware Divergence

Theorem 14.

(Theorem 5 in the main paper) Assuming the hypothesis class $\mathcal{H}$ has infinite capacity, we define the hypothesis-aware divergence for multiple distributions as

D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=-\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)+\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right).

(10)

This divergence is a proper divergence among $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ in the sense that $D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)\geq 0$ for all $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ and $\alpha\in\mathcal{Y}_{\Delta}$ , and $D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=0$ if $\mathbb{Q}_{1}=...=\mathbb{Q}_{C}$ .

Proof.

Data is sampled from the mixture $\mathbb{Q}^{\alpha}$ by firstly sampling domain index $i\sim Cat\left(\alpha\right)$ , then sampling data $z\sim\mathbb{Q}_{i}$ and label with $i$ . We examine the the total expected loss for any hypothesis $\hat{h}\in\mathcal{H}$ , which is

	$\displaystyle\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)$	$\displaystyle:=\sum_{i=1}^{C}\alpha_{i}\mathcal{L}\left(\hat{h},\mathbb{Q}_{i}\right)$
		$\displaystyle=\sum_{i=1}^{C}\alpha_{i}\int l\left(\hat{h}\left(z\right),i\right)q_{i}\left(z\right)dz$

We would like to minimize this loss, which leads to

$\displaystyle\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)$	$\displaystyle=\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{C}\alpha_{i}\int l\left(\hat{h}\left(z\right),i\right)q_{i}\left(z\right)dz$	(11)
	$\displaystyle\stackrel{{\scriptstyle\left(1\right)}}{{=}}\min_{\hat{h}\in\mathcal{H}}\int\left(\sum_{i=1}^{C}\alpha_{i}l\left(\hat{h}\left(z\right),i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right)q^{\alpha}\left(z\right)dz$
	$\displaystyle\stackrel{{\scriptstyle\left(2\right)}}{{=}}\int\min_{\hat{h}\in\mathcal{H}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\hat{h}\left(z\right),i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right)q^{\alpha}\left(z\right)dz$
	$\displaystyle=\int\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right)q^{\alpha}\left(z\right)dz$
	$\displaystyle\stackrel{{\scriptstyle\left(3\right)}}{{\leq}}\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\int\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)q_{i}\left(z\right)dz\right)$
	$\displaystyle=\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)\right).$

For $\left(1\right)$ , $q^{\alpha}\left(z\right)=\sum_{i=1}^{C}\alpha_{i}q_{i}\left(z\right)$ is introduced as the density of the mixture, whereas for $\left(2\right)$ , we use the fact that $\mathcal{H}$ has infinite capacity, leading to the equality $\min_{\hat{h}\in\mathcal{H}}\int f\left(\hat{h}\left(x\right)\right)q\left(x\right)dx=\int\min_{\hat{h}\in\mathcal{H}}f\left(\hat{h}\left(x\right)\right)q\left(x\right)dx$ . Moreover, for $\left(3\right)$ , the property of concave function $\phi\left(t\right)=\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)t_{i}\right)$ with $t\in\mathcal{Y}_{\Delta}$ is invoked, i.e., $\mathbb{E}_{z\sim Q}\left[\phi\left(t\left(z\right)\right)\right]\leq\phi\left(\mathbb{E}_{z\sim Q}\left[t\left(z\right)\right]\right)$ .

This hints us to define a non-zero divergence $D^{\alpha}$ between multiple distributions $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ as

	$\displaystyle D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)$	$\displaystyle=-\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)+\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right),$
		$\displaystyle=\int-\phi\left(\left[\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right]_{i=1}^{C}\right)q^{\alpha}\left(z\right)dz+\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right)$

which is a proper f-divergence, since $-\phi\left(t\right)$ is a convex function, and $\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right)$ is just a constant. Moreover, $D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)\geq 0$ for all $\mathbb{Q}_{1},...,\mathbb{Q}_{C}$ and $\alpha\in\mathcal{Y}_{\Delta}$ due to the previous inequality 11. The equality happens if there is some $\beta_{0}\in\mathcal{Y}_{\Delta}$ such that, for all $z\in\mathcal{Z}$

\beta_{0}=\underset{{}_{\beta\in\mathcal{Y}_{\Delta}}}{\text{argmin}\ }\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}.

This means $\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}=A_{i},\forall i\in\left[C\right]$ , where $A_{i}$ is a constant dependent on index $i$ . However, this leads to

	$\displaystyle\int q_{i}\left(z\right)dz$	$\displaystyle=A_{i}\int q^{\alpha}\left(z\right)dz$
	$\displaystyle 1$	$\displaystyle=A_{i}$

i.e., $q_{i}\left(z\right)=q^{\alpha}\left(z\right),\forall i\in\left[C\right]$ . In other words, the equality happens when all distributions are the same $\mathbb{Q}_{1}=...=\mathbb{Q}_{C}$ . ∎

B.1.2 General Domain-Invariant Representations

For a fixed feature map $g$ , the induced representation distributions of source domains are $\mathbb{P}_{g}^{S,i}$ . We then find the optimal hypothesis $\hat{h}_{g}^{*}$ on the induced representation distributions $\mathbb{P}_{g}^{S,i}$ by minimizing the loss

\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)=\text{min}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right).

(12)

The general domain-invariant feature map $g^{*}$ is defined as the one that offers the minimal optimal loss as

\begin{aligned} g^{*}=\arg\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{i}\right)=\text{argmin}_{g\in\mathcal{G}}\text{min}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)\end{aligned}.

(13)

We denote $\mathbb{P}_{g}^{s,i,c}$ as the class $c$ conditional distribution of the source domain $i$ on the latent space and $p_{g}^{s,i,c}$ as its density function. The induced representation distribution $\mathbb{P}_{g}^{S,i}$ of source domain $i$ is a mixture of $\mathbb{P}_{g}^{s,i,c}$ as $\mathbb{P}_{g}^{S,i}=\sum_{c=1}^{C}\gamma_{i,c}\mathbb{P}_{g}^{s,i,c},$ where $\gamma_{i,c}=\mathbb{P}^{s,i}\left(y=c\right)$ .

We further define $\mathbb{Q}_{g}^{s,c}:=\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{s,i,c}$ where $\alpha_{c}=\sum_{j=1}^{K}\pi_{j}\gamma_{j,c}$ . Obviously, we can interpret $\mathbb{Q}_{g}^{s,c}$ as the mixture of the class $c$ conditional distributions of the source domains on the latent space. The objective function in Eq. (12) can be viewed as training the optimal hypothesis $\hat{h}\in\mathcal{H}$ to distinguish the samples from $\mathbb{Q}_{g}^{s,c},c\in\left[C\right]$ for a given feature map $g$ . Therefore, by linking to the multi-divergence concept developed in Theorem 14, we achieve the following theorem.

Theorem 15.

(Theorem 6 in the main paper) Assume that $\mathcal{H}$ has infinite capacity, we have the following statements.

2. Finding the general domain-invariant feature map $g^{*}$ via the OP in (13) is equivalent to solving

g^{*}=\underset{{}_{g\in\mathcal{G}}}{\text{argmax}}\,D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right).

(14)

Proof.

We investigate the loss on mixture

	$\displaystyle\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)$	$\displaystyle=\sum_{i=1}^{K}\pi_{i}\sum_{c=1}^{C}\gamma_{i,c}\mathcal{L}\left(\hat{h},c,\mathbb{P}_{g}^{S,i,c}\right)$
		$\displaystyle=\sum_{c=1}^{C}\alpha_{c}\mathcal{L}\left(\hat{h},c,\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{S,i,c}\right)$
		$\displaystyle=\sum_{c=1}^{C}\alpha_{c}\mathcal{L}\left(\hat{h},c,\mathbb{Q}_{g}^{S,c}\right)$

Therefore, the loss on mixture is actually a loss on joint class-conditional distributions $\mathbb{Q}_{g}^{S,c}=\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{S,i,c}$ . Using result from Theorem 14, we can define a divergence between these class-conditionals

	$\displaystyle D^{\alpha}\left(\mathbb{Q}_{g}^{S,1},...,\mathbb{Q}_{g}^{S,C}\right)$	$\displaystyle=-\min_{\hat{h}\in\mathcal{H}}\sum_{c=1}^{C}\alpha_{c}\mathcal{L}\left(\hat{h},\mathbb{Q}_{g}^{S,c}\right)+\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{c=1}^{C}\ell\left(\beta,i\right)\alpha_{c}\right)$
		$\displaystyle=-\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)+\text{\text{const}}$

∎

B.2 Compressed Domain-Invariant Representations

Theorem 16.

(Theorem 7 in the main paper) For any confident level $\delta\in[0,1]$ over the choice of $S$ , the estimation of loss is in the $\epsilon$ -range of the true loss

\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\epsilon\right)\geq 1-\delta,

where $\epsilon=\epsilon\left(\delta\right)=\left(\frac{A}{\delta}\right)^{1/2}$ is a function of $\delta$ for which $A$ is proportional to

\displaystyle\frac{1}{N}\left(\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{i}}}{K}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\sum_{i=1}^{K}\sqrt{\pi_{i}}\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\|\Delta p^{k,i}\left(y|x\right)\right\|_{1}\right]+\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)^{2}.

Proof.

Let $S$ be a sample of $N$ data points $\left(z,y\right)\sim\mathbb{D}_{g}^{\pi}$ sampled from the mixture domain, i.e., i.e., $i\sim Cat(\pi),z\sim\mathbb{P}^{S,i}$ , and labeling with corresponding $y\sim Cat\left(\hat{h}^{S,i}\left(z\right)\right)$ . The loss of a hypothesis $h$ on a sample $\left(z,y\right)=\left(z,\hat{h}^{S,i}(z)\right)$ for some domain index $i$ is $\ell\left(\hat{h}\left(z\right),\hat{h}^{S,i}\left(z\right)\right)$ . To avoid crowded notation, we denote this loss as $\ell^{i}\left(z\right):=\ell\left(\hat{h}\left(z\right),\hat{h}^{S,i}\left(z\right)\right)$ .

Let $N=\sum_{i=1}^{K}N_{i}$ , where each $N_{i}$ is the number of sample drawn from domain $i$ . The estimation of loss on a particular domain $i$ is

\mathcal{L}\left(\hat{h},S^{i}\right)=\sum_{j=1}^{N_{i}}\frac{1}{N_{i}}\ell^{i}\left(z_{j}\right).

This estimation is unbiased estimation, i.e., $\mathbb{E}_{S^{i}\sim\left(\mathbb{D}_{g}^{S,i}\right)^{N_{i}}}\left[\mathcal{L}\left(\hat{h},S^{i}\right)\right]=\mathbb{E}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]=\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)$ . Furthermore, loss estimation on source mixture is

\displaystyle\mathcal{L}\left(h,S\right)

\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{N_{i}}\frac{1}{N}\ell^{i}\left(z_{j}\right)

This estimation is also an unbiased estimation

	$\displaystyle\mathbb{E}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]=$	$\displaystyle\mathbb{E}_{\left\{N_{i}\right\}}\left[\mathbb{E}_{S^{i}}\left[\mathcal{L}\left(\hat{h},S\right)\right]\right]$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)=\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right).$

Therefore, we can bound the concentration of $\mathcal{L}\left(\hat{h},S\right)$ around its mean value $\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)$ using Chebyshev’s inequality

\displaystyle\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\epsilon\right)

\displaystyle\geq 1-\frac{\text{Var}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]}{\epsilon^{2}}

which is equivalent to

\displaystyle\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\sqrt{\frac{\text{Var}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]}{\delta}}\right)

\displaystyle\geq 1-\delta

The variance of $\mathcal{L}\left(\hat{h},S\right)$ is

$\displaystyle\text{Var}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]$	$\displaystyle\stackrel{{\scriptstyle\left(1\right)}}{{=}}\frac{1}{N}\text{Var}_{\left(z,y\right)\sim\mathbb{D}_{g}^{\pi}}\left[\ell\left(\hat{h}(z),y\right)\right]$	(15)
	$\displaystyle\stackrel{{\scriptstyle\left(2\right)}}{{=}}\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\left(\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]+\mathbb{E}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]^{2}\right)-\left(\mathbb{E}_{\left(z,y\right)\sim\mathbb{D}_{g}^{\pi}}\left[\ell\left(\hat{h}(z),y\right)\right]\right)^{2}$
	$\displaystyle\leq\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\left(\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]+\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)^{2}\right)$
	$\displaystyle\leq\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]+\frac{1}{N}\left(\sum_{i=1}^{K}\sqrt{\pi_{i}}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)\right)^{2}$

$\stackrel{{\scriptstyle\left(1\right)}}{{=}}$ is true since $\mathcal{L}\left(h,S\right)$ is the sum of $N$ i.i.d. random variable $\ell\left(h(z),y\right)$ with $\left(z,y\right)$ sampled from the same distribution $\mathbb{D}_{g}^{\pi}$ . In $\left(2\right)$ , the variance of w.r.t. a distribution mixture is related to mean and variance of constituting distribution, i.e., $\text{Var}_{\sum_{i}\pi_{i}\mathbb{P}_{i}}\left[X\right]=\sum_{i}\pi_{i}\left(\text{Var}_{\mathbb{P}_{i}}\left[X\right]+\mathbb{E}_{\mathbb{P}_{i}}\left[X\right]^{2}\right)-\mathbb{E}_{\sum_{i}\pi_{i}\mathbb{P}_{i}}\left[X\right]^{2}$ .

We reuse the result of 8 in Theorem 11, substituting $\mathbb{D}_{g}^{T}\equiv\mathbb{D}_{g}^{S,i}$ , $\mathbb{D}_{g}^{\pi}\equiv\mathbb{D}_{g}^{S,j}$ to obtain

	$\displaystyle\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)$	$\displaystyle\leq\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\mathbb{E}_{\mathbb{P}^{S,i}}\left[\\|\Delta p^{i,j}(y\|x)\\|_{1}\right]+L\sqrt{2}\>d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)$
		$\displaystyle\leq\frac{1}{K}\sum_{j=1}^{K}\left(\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\\|\Delta p^{k,i}\left(y\|x\right)\right\\|_{1}\right]+L\sqrt{2}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right).$

Therefore, the right hand side of 15 is upper by $A$ , where $A$ is

	$\displaystyle A$	$\displaystyle=\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]$
		$\displaystyle+\frac{1}{N}\left(\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{i}}}{K}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\sum_{i=1}^{K}\sqrt{\pi_{i}}\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\\|\Delta p^{k,i}\left(y\|x\right)\right\\|_{1}\right]+\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)^{2}.$

This concludes our proof, where the concentration inequality is

\begin{aligned} \text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\sqrt{\frac{A}{\delta}}\right)&\geq 1-\delta\end{aligned}.

∎

Appendix C Appendix C: Trade-Off in Learning DI Representations

Lemma 17.

Given a labeling function $f:\mathcal{X}\rightarrow\mathcal{Y}_{\Delta}$ and a hypothesis $\hat{f}:\mathcal{X}\rightarrow\mathcal{Y}_{\Delta}$ , let denote $\mathbb{P}_{\mathcal{Y}}^{f}$ and $\mathbb{P}_{\mathcal{Y}}^{\hat{f}}$ as two label marginal distributions induced by $f$ and $\hat{f}$ on the data distribution $\mathbb{P}$ . Particularly, to sample $y\sim\mathbb{P}_{\mathcal{Y}}^{f}$ , we first sample $x\sim\mathbb{P}$ (i.e., $\mathbb{P}$ is the data distribution with the density function $p$ ) and then sample $y\sim Cat\left(f\left(x\right)\right)$ , while similar to sample $y\sim\mathbb{P}_{\mathcal{Y}}^{\hat{f}}$ . We then have

d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{f},\mathbb{P}_{\mathcal{Y}}^{\hat{f}}\right)\leq\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)^{1/2},

where the loss $\mathcal{L}$ is defined based on the Hellinger loss $\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)=D_{1/2}\left(\hat{f}\left(x\right),f\left(x\right)\right)=2\sum_{i=1}^{C}\left[\sqrt{\hat{f}\left(x,i\right)}-\sqrt{f\left(x,i\right)}\right]^{2}$ .

Proof.

We have

	$\displaystyle D_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{f},\mathbb{P}_{\mathcal{Y}}^{\hat{f}}\right)$	$\displaystyle=2\sum_{i=1}^{C}\left(\sqrt{p^{f}\left(y\right)}-\sqrt{p^{\hat{f}}\left(y\right)}\right)^{2}$
	$\displaystyle=$	$\displaystyle 2\sum_{i=1}^{C}\left(\sqrt{\int p^{f}\left(y=i\mid x\right)p(x)dx}-\sqrt{\int p^{\hat{f}}\left(y=i\mid x\right)p(x)dx}\right)^{2}$
	$\displaystyle=$	$\displaystyle 2\sum_{i=1}^{C}\left(\sqrt{\int f\left(x,i\right)p(x)dx}-\sqrt{\int\hat{f}\left(x,i\right)p(x)dx}\right)^{2}$
	$\displaystyle=$	$\displaystyle 2\sum_{i=1}^{C}\left[\int f\left(x,i\right)p(x)dx+\int\hat{f}\left(x,i\right)p(x)dx-2\sqrt{\int f\left(x,i\right)p(x)dx}\sqrt{\int\hat{f}\left(x,i\right)p(x)dx}\right]$
	$\displaystyle\overset{(1)}{\leq}$	$\displaystyle 2\sum_{i=1}^{C}\left[\int f\left(x,i\right)p(x)dx+\int\hat{f}\left(x,i\right)p(x)dx-2\sqrt{\int f\left(x,i\right)\hat{f}\left(x,i\right)}p(x)dx\right]$
	$\displaystyle=$	$\displaystyle 2\sum_{i=1}^{C}\int\left[\sqrt{f\left(x,i\right)}-\sqrt{\hat{f}\left(x,i\right)}\right]^{2}p(x)dx=\int 2\sum_{i=1}^{C}\left[\sqrt{f\left(x,i\right)}-\sqrt{\hat{f}\left(x,i\right)}\right]^{2}p(x)dx$
	$\displaystyle=$	$\displaystyle\int D_{1/2}\left(\hat{f}\left(x\right),f\left(x\right)\right)p(x)dx=\mathcal{L}\left(\hat{f},f,\mathbb{P}\right),$

where we note that in the derivation in $\overset{(1)}{\leq}$ , we use Cauchy-Schwarz inequality: $\int\hat{f}\left(x,i\right)p\left(x\right)dx\int f\left(x,i\right)p\left(x\right)dx\geq\left(\int\sqrt{\hat{f}\left(x,i\right)f\left(x,i\right)}p\left(x\right)dx\right)^{2}$ .

Therefore, we reach the conclusion as

d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{f},\mathbb{P}_{\mathcal{Y}}^{\hat{f}}\right)\leq\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)^{1/2}.

∎

Lemma 18.

Consider the hypothesis $\hat{f}=\hat{h}\circ g$ . We have the following inequalities w.r.t. the source and target domains:

(i) $d_{1/2}\left(\mathbb{\hat{P}}_{\mathcal{Y}}^{T},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2},$ where $\mathbb{P}_{\mathcal{Y}}^{T}$ is the label marginal distribution induced by $f^{T}$ on $\mathbb{P}^{T}$ , while $\hat{\mathbb{P}}_{\mathcal{Y}}^{T}$ is the label marginal distribution induced by $\hat{f}$ on $\mathbb{P}^{T}$ .

(ii) $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}$ , where $\mathbb{P}_{\mathcal{Y}}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{\mathcal{Y}}^{S,i}$ with $\mathbb{P}_{\mathcal{Y}}^{S,i}$ to be induced by $f^{S,i}$ on $\mathbb{P}^{S,i}$ and $\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}$ with $\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}$ to be induced by $\hat{f}$ on $\mathbb{P}^{S,i}$ (i.e., equivalently, the label marginal distribution induced by $\hat{f}$ on $\mathbb{P}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}^{S,i}$ ).

Proof.

(i) The proof of this part is obvious from Lemma 17 by considering $f^{T}$ as $f$ and $\mathbb{P}^{T}$ as $\mathbb{P}$ .

(ii) By the convexity of $D_{1/2}$ , which is a member of $f$ -divergence family, we have

	$\displaystyle D_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)$	$\displaystyle=D_{1/2}\left(\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{\mathcal{Y}}^{S,i},\sum_{i=1}^{K}\pi_{i}\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}\right)\leq\sum_{i=1}^{K}\pi_{i}D_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{S,i},\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}\right)$
	$\displaystyle\overset{(1)}{\leq}$	$\displaystyle\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right),$

where the derivation in $\overset{(1)}{\leq}$ is from Lemma 17. Therefore, we reach the conclusion as

d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}.

∎

Theorem 19.

(Theorem 8 in the main paper) Consider a feature extractor $g$ and a hypothesis $\hat{h}$ , the Hellinger distance between two label marginal distributions $\mathbb{P}_{\mathcal{Y}}^{\pi}$ and $\mathbb{P}_{\mathcal{Y}}^{T}$ can be upper-bounded as:

(i) $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{k=1}^{K}\pi_{k}\mathcal{L}\left(\hat{h}\circ g,f^{S,k},\mathbb{P}^{S,k}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.$

(ii) $d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.$

Here we note that the general loss $\mathcal{L}$ is defined based on the Hellinger loss $\ell$ defined as $\ell\left(\hat{f}\left(x\right),f(x)\right)=D_{1/2}\left(\hat{f}\left(x\right),f(x)\right)$ .

Proof.

(i)We define $\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}$ and $\mathbb{P}_{\mathcal{Y}}^{T},\hat{\mathbb{P}}_{\mathcal{Y}}^{T}$ as in Lemma 18. Recap that to sample $y\sim\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}$ , we sample $k\sim Cat(\pi),x\sim\mathbb{P}^{S,k}$ (i.e., $x\sim\mathbb{P}^{\pi}:=\sum_{k=1}^{K}\pi_{k}\mathbb{P}^{S,k}$ ), compute $z=g\left(x\right)$ (i.e., $z\sim\mathbb{P}_{g}^{\pi}$ ) , and $y\sim Cat\left(\hat{h}\left(z\right)\right)$ , while similar to draw $y\sim\mathbb{\hat{P}}_{\mathcal{Y}}^{T}$ .

Using triangle inequality for Hellinger distance, we have

\begin{aligned} d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)&\leq d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)+d_{1/2}\left(\mathbb{\hat{P}}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{T}\right)+d_{1/2}\left(\hat{\mathbb{P}}_{\mathcal{Y}}^{T},\mathbb{P}_{\mathcal{Y}}^{T}\right)\end{aligned}.

Referring to Lemma 18, we achieve

\displaystyle d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)

\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{\hat{\mathcal{Y}}}^{\pi},\mathbb{P}_{\hat{\mathcal{Y}}}^{T}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

From the monotonicity of Hellinger distance, when applying to $\mathbb{P}_{g}^{T}$ and $\mathbb{P}_{g}^{\pi}$ with the same transition probability $p\left(y=i\mid z\right)=\hat{h}\left(z,i\right)$ for obtaining $\mathbb{P}_{\hat{\mathcal{Y}}}^{\pi}\text{ and }\mathbb{P}_{\hat{\mathcal{Y}}}^{T}$ , we have

d_{1/2}\left(\mathbb{P}_{\hat{\mathcal{Y}}}^{\pi},\mathbb{P}_{\hat{\mathcal{Y}}}^{T}\right)\leq d_{1/2}\left(\mathbb{P}_{g}^{\pi},\mathbb{P}_{g}^{T}\right).

Finally, we reach the conclusion as

\begin{aligned} d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)&\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{g}^{\pi},\mathbb{P}_{g}^{T}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}\end{aligned}.

(ii) From Lemma 12, we can decompose the data shift term and use triangle inequality again, hence arriving at

	$\displaystyle d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)$	$\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}$
		$\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)$
		$\displaystyle+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.$

This concludes our proof. ∎

The loss $\mathcal{L}$ in Theorem 19 defined based on the Hellinger loss $\ell$ defined as $\ell\left(\hat{f}\left(x\right),f(x)\right)=D_{1/2}\left(\hat{f}\left(x\right),f(x)\right)$ , while theory development in previous sections bases on the loss $\ell$ which has the specific form

\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)=\sum_{i=1}^{C}l\left(\hat{f}\left(x\right),i\right)f\left(x,i\right).

(16)

To make it more consistent, we discuss under which condition the Hellinger loss is in the family defined in (16). It is evident that if the labeling function $f$ satisfying $f\left(x,i\right)>0,\forall x\sim\mathbb{P}$ and $i\in\left[C\right]$ , for example, we apply label smoothing [49, 36] on ground-truth labels, the Hellinger loss is in the family of interest. That is because the following derivation:

	$\displaystyle D_{1/2}\left(\hat{f}\left(x\right),f(x)\right)$	$\displaystyle=2\sum_{i=1}^{C}\left[\sqrt{\hat{f}\left(x,i\right)}-\sqrt{f\left(x,i\right)}\right]^{2}$
	$\displaystyle=$	$\displaystyle 2\sum_{i=1}^{C}\left[\sqrt{\frac{\hat{f}\left(x,i\right)}{f\left(x,i\right)}}-1\right]^{2}f\left(x,i\right),$

where we consider $l\left(\hat{f}\left(x\right),i\right)=\left[\sqrt{\frac{\hat{f}\left(x,i\right)}{f\left(x,i\right)}}-1\right]^{2}$ .

Appendix D Appendix D: Additional Experiments

D.1 Experiment on Colored MNIST

D.1.1 Dataset

We conduct experiments on the colored MNIST dataset [2] whose data is generated as follow. Firstly, for any original image $X$ in the MNIST dataset [24], the value of digit feature is $Z_{d}=0$ if the image’s digit is from $0\rightarrow 4$ , while $Z_{d}=1$ is assigned to image with digit from $5\rightarrow 9$ . Next, the ground-truth label for the image $X$ is also binary and sampled from either $\mathbb{P}(Y|Z_{d}=1)$ or $\mathbb{P}(Y|Z_{d}=0)$ , depending on the value of digit feature $Z_{d}$ . These binomial distributions are such that $\mathbb{P}(Y=1|Z_{d}=1)=\mathbb{P}(Y=0|Z_{d}=0)=0.75$ . Next, the color feature binary random variable $Z_{c}$ is assigned to each image conditioning on its label, i.e., $z_{C}\sim\mathbb{P}(Z_{C}|Y=1)$ or $z_{c}\sim\mathbb{P}(Z_{C}|Y=0)$ with $\mathbb{P}(Z_{c}=1|Y=1)=\mathbb{P}(Z_{c}=0|Y=0)=\theta$ , depending on the domain. Finally, we color the image red if $Z_{c}=0$ or green if $Z_{c}=1$ .

For both DG and MSDA experiments, there are 7 source domains generated by setting $\mathbb{P}(Z_{c}=1|Y=1)=\mathbb{P}(Z_{c}=0|Y=0)=\theta^{S,i}$ where $\theta^{s,i}\sim Uni\left(\left[0.6,1\right]\right)$ for $i=1,\ldots,7$ . In our actual implemetation, we take $\theta^{s,i}=0.6+\text{$\frac{0.4}{7}\left(i-1\right)$}$ . The two target domains are created with $\theta^{T,i}\in\text{$\left\{0.05,0.7\right\}$}$ for $i=1,2$ . After domain creation, data from each domain is split into training set and validation set. For DG experiment, no data from the target domain is used in training. On the other hand, the same train-validation split is applied to target domains in MSDA and the unlabeled training splits are used for training, while the validation splits are used for testing.

D.1.2 Model

We train a hypothesis $\hat{f}=\hat{h}\circ g$ and minimize the classification loss w.r.t. entire source data

\mathcal{L}_{gen}:=\sum_{i=1}^{7}\frac{N_{i}}{N_{S}}\mathbb{E}_{\left(x,y\right)\sim\mathbb{D}^{S,i}}\left[CE\left(\hat{h}\left(g\left(x\right)\right),y\right)\right],

where CE is the cross-entropy loss, $N_{i}$ is the number of samples from domain $i$ , and $N_{S}$ is the total number of source samples.

To align source-source representation distribution, we apply adversarial learning [16] as in [14], in which a min-max game is played, where the domain discriminator $\hat{h}^{s-s}$ tries to predict domain labels from input representations, while the feature extractor (generator) $g$ tries to fool the domain discriminator, i.e., $\min_{g}\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-s}$ . The source-source compression loss is defined as

\mathcal{L}_{disc}^{s-s}:=\sum_{i=1}^{7}\frac{N_{i}}{N_{S}}\mathbb{E}_{x\sim\mathbb{P}^{S,i}}\left[-CE\left(\log\hat{h}^{s-s}\left(g\left(x\right)\right),i\right)\right],

where $i$ is the domain label. It is well-known [16] that if we search $\hat{h}^{s-s}$ in a family with infinite capacity then

\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-s}=JS\left(\mathbb{P}_{g}^{S,1},...,\mathbb{P}_{g}^{S,7}\right).

Similarly, alignment between source and target feature distribution is enforced by employing adversarial learning between another discriminator $\hat{h}^{s-t}$ and the encoder $g$ , with the objective is $\min_{g}\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-t}$ . The loss function for source-target compression is

\mathcal{L}_{disc}^{s-t}:=\frac{N_{k}}{N_{S}+N_{T}}\mathbb{E}_{x\sim\mathbb{P}^{\pi,S}}\left[-\log\hat{h}^{s-t}\left(g\left(x\right)\right)\right]+\frac{N_{T}}{N_{S}+N_{T}}\mathbb{E}_{x\sim\mathbb{P}^{T}}\left[-\log\left(1-\hat{h}^{s-t}\left(g\left(x\right)\right)\right)\right],

where $\mathbb{P}^{\pi,S}=\sum_{i=1}^{7}\frac{N_{i}}{N_{S}}\mathbb{P}^{S,i}$ is the source mixture and $\mathbb{P}^{T}$ is the chosen target domain among the two.

Finally, for DG we optimize the objective

\min_{g}\left(\min_{\hat{h}}\mathcal{L}_{gen}+\lambda\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-s}\right),

(17)

where $\lambda$ being the trade-off hyperparameter: $\lambda=0$ corresponds to DG’s general DI representation, while larger $\lambda$ corresponds to more compressed DI representation. On the other hand, the objective for MSDA setting is

\min_{g}\left(\min_{\hat{h}}\mathcal{L}_{gen}+\lambda^{s-s}\max_{\hat{h}^{s-s}}\mathcal{L}_{com}^{s-s}+\lambda^{s-t}\max_{\hat{h}^{s-t}}\mathcal{L}_{com}^{s-t}\right),

(18)

with $\lambda^{s-s}$ controls the source-source compression and $\lambda^{s-t}$ controls the source-target compression.

Our implementation is based largely on Domain Bed repository [17]. Specifically, the encoder $g$ is a convolutional neural network with 4 cnn layers, each is accompanied by RELU activation and batchnorm, while the classifier $\hat{h}$ and discriminators $\hat{h}^{s-s},\hat{h}^{s-t}$ are densely connected multi-layer perceptions with 3 layers. Our code can be found in the zip file accompanying this appendix. Moreover, our experiments were run on one Tesla V100 GPU and it took around 30 minutes for one training on Colored MNIST.

D.1.3 Multiple-source Domain Adaptation

In additional to DG experiment provided in Section 3 in the main paper, further experiment is conducted on MSDA, in which source-target compression is applied in additional to source-source compression. Specifically, unlabeled data from a target domain is supplied for training, whose label is $1$ while all labeled source data has label $0$ , and the source-target discriminator is tasked with classifying them. We experiment with 2 target domains $\theta^{T,i}\in\left\{0.5,0.7\right\}$ separately and report accuracy on target domain for different compression strength. The result is presented in Figure 5.

It is evident from the figure that the more compression on both source-source and source-target representation, the lower the accuracy. This result aligns with our previous bound (Theorem 8 in main paper), i.e.,

	$\displaystyle d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)-\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)-\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)$
	$\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.$

When source-source and source-target compression is applied, the term $\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)$ is minimized, raising the lower bound of the loss terms. Subsequently, as source loss $\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)$ is minimized, the target loss $\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)$ is high and hence performance is hindered.

However, the drop in accuracy for large compression on close target domain is not as significant as in distant target domain, as indicated in Figure 5b and 5c. In fact, accuracy for some compressed representation is higher than no compressed representation at larget iteration. It is possible that the negative effect of raising lower bound as in trade-off Theorem is counteracted by the benefit of compressed DI representation (Section 2.3.3 of main paper), i.e., the learned classifier for compressed DI representation better approximates the ground-truth labeling function. On the other hand, model with general DI representation cannot approximate this ground-truth labeling function as accurately, but overfit to training dataset, resulting in target accuracy drop at larger iteration.

D.2 Experiment on PACS dataset

In order to verify our theoretical finding on real dataset, we conduct further experiment on PACS dataset [26], which has 4 domains: Photo, Art Painting, Cartoon, and Sketch. Among the 4 domains, Photo and Cartoon are chosen as training domains, while Art Painting is chosen as target domain close to the training ones, and Sketch is the target domain distant from the training ones. We use Resnet18 [18] as the feature map, while label classifier and domain discriminator are multi-layer perceptrons. We only investigate DG setting on this dataset, in which training objective function is Eq. 17. The result in Figure 6 illustrates similar pattern to DG experiment on Colored MNIST. Specifically, the accuracies for both target domains raise until a peak is reached and then decrease, which confirms our developed theory for benefit and trade-off of compressed DI representation.

	$\displaystyle\int\left\|h\left(z,i\right)-h^{\prime}\left(z,i\right)\right\|c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz$
	$\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\frac{\left\|\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx-\int_{g^{-1}\left(z\right)}f^{\prime}\left(x,i\right)p\left(x\right)dx\right\|}{\int_{g^{-1}\left(z\right)}p\left(x^{\prime}\right)dx^{\prime}}\int_{g^{-1}\left(z\right)}p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\left\|\int_{g^{-1}\left(z\right)}\left(f\left(x,i\right)-f^{\prime}\left(x,i\right)\right)p\left(x\right)dx\right\|dz$
	$\displaystyle\leq\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\int_{g^{-1}\left(z\right)}\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{\mathcal{Z}}}\int_{\mathcal{X}}\mathbb{I}_{x\in g^{-1}\left(z\right)}c\left(\hat{h}\left(z\right),i\right)\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{\mathcal{Z}}}\mathbb{I}_{z=g\left(x\right)}c\left(\hat{h}\left(g\left(x\right)\right),i\right)\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|p\left(x\right)dxdz$
	$\displaystyle=\int_{\mathcal{X}}\left\|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right\|c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx.$

		$\displaystyle\sum_{i=1}^{K}\pi_{i}\int_{\mathcal{Z}}\left\|\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)-\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)\right\|p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{Z}}\ell\left(\hat{h}\left(z\right),j\right)\left\|h^{T}\left(z,j\right)-h^{S,i}\left(z,j\right)\right\|p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle=L\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{Z}}\left\|h^{T}\left(z,j\right)-h^{S,i}\left(z,j\right)\right\|p_{g}^{S,i}\left(z\right)dz$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{\leq}}L\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{X}}\left\|f^{T}\left(z,j\right)-f^{S,i}\left(z,j\right)\right\|p^{S,i}\left(x\right)dx$
		$\displaystyle\leq L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\\|\Delta p^{i}\left(y\|x\right)\\|_{1}\right].$