This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Learning Domain-Invariant Representations for Transfer Learning with Multiple Sources

Trung Phung1   Trung Le2   Long Vuong1   Toan Tran1   Anh Tran1   Hung Bui1   Dinh Phung1,2
1 VinAI Research, Vietnam   2 Monash University, Australia
[email protected], [email protected],
{v.longvt8, v.toantm3, v.anhtt152, v.hungbh1, v.dinhpq2}@vinai.io
Abstract

Domain adaptation (DA) benefits from the rigorous theoretical works that study its insightful characteristics and various aspects, e.g., learning domain-invariant representations and its trade-off. However, it seems not the case for the multiple source DA and domain generalization (DG) settings which are remarkably more complicated and sophisticated due to the involvement of multiple source domains and potential unavailability of target domain during training. In this paper, we develop novel upper-bounds for the target general loss which appeal to us to define two kinds of domain-invariant representations. We further study the pros and cons as well as the trade-offs of enforcing learning each domain-invariant representation. Finally, we conduct experiments to inspect the trade-off of these representations for offering practical hints regarding how to use them in practice and explore other interesting properties of our developed theory.

1 Introduction

Although annotated data has been shown to be really precious and valuable to deep learning (DL), in many real-world applications, annotating a sufficient amount of data for training qualified DL models is prohibitively labour-expensive, time-consuming, and error-prone. Transfer learning is a vital solution for the lack of labeled data. Additionally, in many situations, we are only able to collect limited number of annotated data from multiple source domains. Therefore, it is desirable to train a qualified DL model primarily based on multiple source domains and possibly together with a target domain. Depending on the availability of the target domain during training, we encounter either multiple source domain adaptation (MSDA) or domain generalization (DG) problem.

Domain adaptation (DA) is a specific case of MSDA when we need to transfer from a single source domain to an another target domain available during training. For the DA setting, the pioneering work [3] and other following work [45, 55, 56, 22, 19, 57, 23] are abundant to study its insightful characterizations and aspects, notably what domain-invariant (DI) representations are, how to learn this kind of representations, and the trade-off of enforcing learning DI representations. Those well-grounded studies lay the foundation for developing impressive practical DA methods [14, 50, 46, 55, 7, 54, 38].

Due to the appearance of multiple source domains and possibly unavailability of target domain, establishing theoretical foundation and characterizing DI representations for the MSDA and especially DG settings are significantly more challenging. A number of state-of-the-art works in MSDA and DG [59, 41, 20, 11, 43, 28, 29, 35, 15, 58, 52, 37, 39] have implicitly exploited and used DI representations in some sense to achieve impressive performances, e.g., [29, 29, 11] minimize representation divergence during training, [43] decomposes representation to obtain invariant feature, [58] maximizes domain prediction entropy to learn invariance. Despite the successes, the notion DI representations in MSDA and DG is not fully-understood, and rigorous studies to theoretically characterize DI representations for these settings are still very crucial and imminent. In this paper, we provide theoretical answers to the following questions: (1) what are DI representations in MSDA and DG, and (2) what one should expect when learning them. Overall, our contributions in this work can be summarized as:

  • In Section 2.2, we first develop two upper-bounds for the general target loss in the MSDA and DG settings, whose proofs can be found in Appendix A. We then base on these bounds to characterize and define two kinds of DI representations: general DI representations and compressed DI representations.

  • We further develop theory to inspect the pros and cons of two kinds of DI representations in Section 2.3, aiming to shed light on how to use those representations in practice. Particularly, two types of DI representations optimize different types of divergence on feature space, hence serving different purposes. Appendix B contains proofs regarding these characteristics.

  • Finally, we study in Section 2.4 the trade-off of two kinds of DI representations which theoretically answers the question whether and how the target performance is hurt when enforcing learning each kind of representation. We refer reader to Appendix C for its proof.

  • We conduct experiments to investigate the trade-off of two kinds of representations for giving practical hints regarding how to use those representations in practice as well as exploring other interesting properties of our developed theory.

It is worth noting that although MSDA has been investigated in some works [19, 57], our proposed method is the first work that rigorously formulates and provides theoretical analysis for representation learning in MSDA and DG. Specifically, our bounds developed in Theorems 1 and 3 interweaving both input and latent spaces are novel and benefit theoretical analyses in deep learning. Our theory is developed in a general setting of multi-class classification with a sufficiently general loss, which can be viewed as a non-trivial generalization of existing works in DA [31, 3, 56, 32, 57], each of which considers binary classification with specific loss functions. Moreover, our theoretical bounds developed in Theorem 8 is the first theoretical analysis of the trade-off of learning DI representations in MSDA and DG. Particularly, by considering the MSDA setting with single source and target domains, we achieve the same trade-off nature discovered in [56], but again our setting is more general than binary classification setting in that work.

2 Our Main Theory

2.1 Notations

Let 𝒳\mathcal{X} be a data space on which we endow a data distribution \mathbb{P} with a corresponding density function p(x)p(x). We consider the multi-class classification problem with the label set 𝒴=[C]\mathcal{Y}=\left[C\right], where CC is the number of classes and [C]:={1,,C}\left[C\right]:=\{1,\ldots,C\}. Denote 𝒴Δ:={αC:α1=1α𝟎}\mathcal{Y}_{\Delta}:=\left\{\alpha\in\mathbb{R}^{C}:\left\|\alpha\right\|_{1}=1\,\wedge\,\alpha\geq\mathbf{0}\right\} as the CC-simplex label space, let f:𝒳𝒴Δf:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} be a probabilistic labeling function returning a CC-tuple f(x)=[f(x,i)]i=1Cf\left(x\right)=\left[f\left(x,i\right)\right]_{i=1}^{C}, whose element f(x,i)=p(y=ix)f\left(x,i\right)=p\left(y=i\mid x\right) is the probability to assign a data sample xx\sim\mathbb{P} to the class ii (i.e., i{1,,C}i\in\left\{1,...,C\right\}). Moreover, a domain is denoted compactly as pair of data distribution and labeling function 𝔻:=(,f)\mathbb{D}:=\left(\mathbb{P},f\right). We note that given a data sample xx\sim\mathbb{P}, its categorical label y𝒴y\in\mathcal{Y} is sampled as yCat(f(x))y\sim Cat\left(f\left(x\right)\right) which a categorical distribution over f(x)𝒴Δf\left(x\right)\in\mathcal{Y}_{\Delta}.

Let l:𝒴Δ×𝒴l:\mathcal{Y}_{\Delta}\times\mathcal{Y}\mapsto\mathbb{R} be a loss function, where l(f^(x),y)l\left(\hat{f}\left(x\right),y\right) with f^(x)𝒴Δ\hat{f}\left(x\right)\in\mathcal{Y}_{\Delta} and y𝒴y\in\mathcal{Y} specifies the loss (e.g., cross-entropy, Hinge, L1, or L2 loss) to assign a data sample xx to the class yy by the hypothesis f^\hat{f}. Moreover, given a prediction probability f^(x)\hat{f}\left(x\right) w.r.t. the ground-truth prediction f(x)f\left(x\right), we define the loss (f^(x),f(x))=𝔼yf(x)[l(f^(x),y)]=y=1Cl(f^(x),y)f(x,y)\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)=\mathbb{E}_{y\sim f\left(x\right)}\left[l\left(\hat{f}\left(x\right),y\right)\right]=\sum_{y=1}^{C}l\left(\hat{f}\left(x\right),y\right)f\left(x,y\right). This means (,)\ell\left(\cdot,\cdot\right) is convex w.r.t. the second argument. We further define the general loss caused by using a classifier f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} to predict 𝔻(,f)\mathbb{D}\equiv\left(\mathbb{P},f\right) as

(f^,f,)=(f^,𝔻):=𝔼x[(f^(x),f(x))].\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)=\mathcal{L}\left(\hat{f},\mathbb{D}\right):=\mathbb{E}_{x\sim\mathbb{P}}\left[\ell\left(\hat{f}(x),f(x)\right)\right].

We inspect the multiple source setting in which we are given multiple source domains {𝔻S,i}i=1K\{\mathbb{D}^{S,i}\}_{i=1}^{K} over the common data space 𝒳\mathcal{X}, each of which consists of data distribution and its own labeling function 𝔻S,i:=(S,i,fS,i)\mathbb{D}^{S,i}:=\left(\mathbb{P}^{S,i},f^{S,i}\right). Based on these source domains, we need to work out a learner or classifier that requires to evaluate on a target domain 𝔻T:=(T,fT)\mathbb{D}^{T}:=\left(\mathbb{P}^{T},f^{T}\right). Depending on the knownness or unknownness of a target domain during training, we experience either multiple source DA (MSDA) [32, 30, 44, 19] or domain generalization (DG) [25, 29, 28, 35] setting.

One typical approach in MSDA and DG is to combine the source domains together [14, 29, 28, 35, 1, 11, 48] to learn DI representations with hope to generalize well to a target domain. When combining the source domains, we obtain a mixture of multiple source distributions denoted as 𝔻π=i=1Kπi𝔻S,i\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i}, where the mixing coefficients π=[πi]i=1K\pi=\left[\pi_{i}\right]_{i=1}^{K} can be conveniently set to πi=Nij=1KNj\pi_{i}=\frac{N_{i}}{\sum_{j=1}^{K}N_{j}} with NiN_{i} being the training size of the ii-th source domain.

In term of representation learning, input is mapped into a latent space 𝒵\mathcal{Z} by a feature map g:𝒳𝒵g:\mathcal{X}\mapsto\mathcal{Z} and then a classifier h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta} is trained based on the representations g(𝒳)g\left(\mathcal{X}\right). Let f:𝒳𝒴Δf:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} be the original labeling function. To facilitate the theory developed for latent space, we introduce representation distribution being the pushed-forward distribution g:=g#\mathbb{P}_{g}:=g_{\#}\mathbb{P}, and the labeling function h:𝒵𝒴Δh:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta} induced by gg as h(z)=g1(z)f(x)p(x)𝑑xg1(z)p(x)𝑑xh(z)=\frac{\int_{g^{-1}\left(z\right)}f(x)p(x)dx}{\int_{g^{-1}\left(z\right)}p(x)dx} [22]. Going back to our multiple source setting, the source mixture becomes 𝔻gπ=iπi𝔻gS,i\mathbb{D}_{g}^{\pi}=\sum_{i}\pi_{i}\mathbb{D}_{g}^{S,i}, where each source domain is 𝔻gS,i=(gS,i,hS,i)\mathbb{D}_{g}^{S,i}=\left(\mathbb{P}_{g}^{S,i},h^{S,i}\right), and the target domain is 𝔻gT=(gT,hT)\mathbb{D}_{g}^{T}=\left(\mathbb{P}_{g}^{T},h^{T}\right).

Finally, in our theory development, we use Hellinger divergence between two distributions defined as D1/2(1,2)=2(p1(x)p2(x))2𝑑x\begin{aligned} D_{1/2}\left(\mathbb{P}^{1},\mathbb{P}^{2}\right)=2\int\left(\sqrt{p^{1}(x)}-\sqrt{p^{2}(x)}\right)^{2}dx\end{aligned}, whose squared d1/2=D1/2d_{1/2}=\sqrt{D_{1/2}} is a proper metric.

2.2 Two Types of Domain-Invariant Representations

Hinted by the theoretical bounds developed in [3], DI representations learning, in which feature extractor gg maps source and target data distributions to a common distribution on the latent space, is well-grounded for the DA setting. However, this task becomes significantly challenging for the MSDA and DG settings due to multiple source domains and potential unknownness of target domain. Similar to the case of DA, it is desirable to develop theoretical bounds that directly motivate definitions of DI representations for the MSDA and DG settings, as being done in the next theorem.

Theorem 1.

Consider a mixture of source domains 𝔻π=i=1Kπi𝔻S,i\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i} and the target domain 𝔻T\mathbb{D}^{T}. Let \ell be any loss function upper-bounded by a positive constant LL. For any hypothesis f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} where f^=h^g\hat{f}=\hat{h}\circ g with g:𝒳𝒵g:\mathcal{X}\rightarrow\mathcal{Z} and h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\rightarrow\mathcal{Y}_{\Delta}, the target loss on input space is upper bounded

(f^,𝔻T)i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]+L2d1/2(gT,gπ)\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}\,d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) (1)

where Δpi(y|x):=[|fT(x,y)fS,i(x,y)|]y=1C\Delta p^{i}(y|x):=\left[\left|f^{T}(x,y)-f^{S,i}(x,y)\right|\right]_{y=1}^{C} is the absolute of single point label shift on input space between source domain 𝔻S,i\mathbb{D}^{S,i}, the target domain 𝔻T\mathbb{D}^{T}, and [K]:={1,2,,K}[K]:=\left\{1,2,...,K\right\}.

The bound in Equation 1 implies that the target loss in the input or latent space depends on three terms: (i) representation discrepancy: d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right), (ii) the label shift: maxi[K]𝔼S,i[Δpi(y|x)1]\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right], and (iii) the general source loss: i=1Kπi(f^,𝔻S,i)\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right). To minimize the target loss in the left side, we need to minimize the three aforementioned terms. First, the label shift term is a natural characteristics of domains, hence almost impossible to tackle. Secondly, the representation discrepancy term can be explicitly tackled for the MSDA setting, while almost impossible for the DG setting. Finally, the general source loss term is convenient to tackle, where its minimization results in a feature extractor gg and a classifier h^\hat{h}.

Contrary to previous works in DA and MSDA [3, 32, 57, 7] that consider both losses and data discrepancy on data space, our bound connects losses on data space to discrepancy on representation space. Therefore, our theory provides a natural way to analyse representation learning, especially feature alignment in deep learning practice. Note that although DANN [14] explains their feature alignment method using theory developed by Ben-david et al. [3], it is not rigorous. In particular, while application of the theory to representation space yield a representation discrepancy term, the loss terms are also on that feature space, and hence minimizing these losses is not the learning goal. Finally, our setting is much more general, which extends to multilabel, stochastic labeling setting, and any bounded loss function.

From the upper bound, we turn to first type of DI representations for the MSDA and DG settings. Here, the objective of gg is to map samples onto representation space in a way that the common classifier h^\hat{h} can effectively and correctly classifies them, regardless of which domain the data comes from.

Definition 2.

Consider a class 𝒢\mathcal{G} of feature maps and a class \mathcal{H} of hypotheses. Let {𝔻S,i}i=1K\{\mathbb{D}^{S,i}\}_{i=1}^{K} be a set of source domains.

i) (DG with unknown target data) A feature map g𝒢g^{*}\in\mathcal{G} is said to be a DG general domain-invariant (DI) feature map if gg^{*} is the solution of the optimization problem (OP): ming𝒢minh^i=1Kπi(h^,𝔻gS,i)\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right). Moreover, the latent representations z=g(x)z=g^{*}\left(x\right) induced by gg^{*} is called general DI representations for the DG setting.

ii) (MSDA with known target data) A feature map g𝒢g^{*}\in\mathcal{G} is said to be a MSDA general DI feature map if gg^{*} is the solution of the optimization problem (OP): ming𝒢minh^i=1Kπi(h^,𝔻gS,i)\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right) which satisfies gT=gπ\mathbb{P}_{g^{*}}^{T}=\mathbb{P}_{g^{*}}^{\pi} (i.e., ming𝒢d1/2(gT,gπ)\min_{g\in\mathcal{G}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)). Moreover, the latent representations z=g(x)z=g^{*}\left(x\right) induced by gg^{*} is called general DI representations for the MSDA setting.

The definition of general DI representations for the DG setting in Definition 2 is transparent in light of Theorem 1, wherein we aim to find gg and h^\hat{h} to minimize the general source loss i=1Kπi(f^,𝔻S,i)\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right) due to the unknownness of T\mathbb{P}^{T}. Meanwhile, regarding the general DI representations for the MSDA setting, we aim to find gg^{*} satisfying gT=gπ\mathbb{P}_{g^{*}}^{T}=\mathbb{P}_{g^{*}}^{\pi} and h^\hat{h} to minimize the general source loss i=1Kπi(f^,𝔻S,i)\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right). Practically, to find general representations for MSDA, we solve

ming𝒢minh^{i=1Kπi(h^,𝔻gS,i)+λD(gT,gπ)},\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\left\{\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)+\lambda D\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\right\},

where λ>0\lambda>0 is a trade-off parameter and DD can be any divergence (e.g., Jensen Shannon (JS) divergence [16], ff-divergence [40], MMD distance [28], or WS distance [44]).

In addition, the general DI representations have been exploited in some works [60, 27] from the practical perspective and previously discussed in [2] from the theoretical perspective. Despite the similar definition to our work, general DI representation in [2] is not motivated from minimization of a target loss bound.

As pointed out in the next section, learning general DI representations increases the span of latent representations g(x)g\left(x\right) on the latent space which might help to reduce d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) for a general target domain. On the other hand, many other works [28, 29, 15, 52, 11] have also explored the possibility of enhancing generalization by finding common representation among source distributions. The following theorem motivates the latter, which is the second kind of DI representations.

Theorem 3.

Consider a mixture of source domains 𝔻π=i=1Kπi𝔻S,i\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i} and the target domain 𝔻T\mathbb{D}^{T}. Let \ell be any loss function upper-bounded by a positive constant LL. For any hypothesis f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} where f^=h^g\hat{f}=\hat{h}\circ g with g:𝒳𝒵g:\mathcal{X}\rightarrow\mathcal{Z} and h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\rightarrow\mathcal{Y}_{\Delta}, the target loss on input space is upper bounded

(f^,𝔻T)\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right) i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right] (2)
+i=1Kj=1KL2πjKd1/2(gT,gS,i)+i=1Kj=1KL2πjKd1/2(gS,i,gS,j).\displaystyle+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{L\sqrt{2\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\,\frac{L\sqrt{2\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right).

Evidently, Theorem 3 suggests another kind of representation learning, where source representation distributions are aligned in order to lower the target loss’s bound as concretely defined as follows.

Definition 4.

Consider a class 𝒢\mathcal{G} of feature maps and a class \mathcal{H} of hypotheses. Let {𝔻S,i}i=1K\left\{\mathbb{D}^{S,i}\right\}_{i=1}^{K} be a set of source domains.

i) (DG with unknown target data) A feature map g𝒢g^{*}\in\mathcal{G} is a DG compressed DI representations for source domains {𝔻S,i}i=1K\{\mathbb{D}^{S,i}\}_{i=1}^{K} if gg^{*} is the solution of the optimization problem (OP): ming𝒢minh^i=1Kπi(h^,𝔻gS,i)\min_{g\in\mathcal{G}}\text{$\min$}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right) which satisfies gS,1=gS,2==gS,K\mathbb{P}_{g^{*}}^{S,1}=\mathbb{P}_{g^{*}}^{S,2}=\ldots=\mathbb{P}_{g^{*}}^{S,K} (i.e., the pushed forward distributions of all source domains are identical). The latent representations z=g(x)z=g^{*}\left(x\right) is then called compressed DI representations for the DG setting.

ii) (MSDA with known target data) A feature map g𝒢g^{*}\in\mathcal{G} is an MSDA compressed DI representations for source domains {𝔻S,i}i=1K\{\mathbb{D}^{S,i}\}_{i=1}^{K} if gg^{*} is the solution of the optimization problem (OP): ming𝒢minh^i=1Kπi(h^,𝔻gS,i)\min_{g\in\mathcal{G}}\text{$\min$}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right) which satisfies gS,1=gS,2==gS,K=gT\mathbb{P}_{g^{*}}^{S,1}=\mathbb{P}_{g^{*}}^{S,2}=\ldots=\mathbb{P}_{g^{*}}^{S,K}=\mathbb{P}_{g^{*}}^{T} (i.e., the pushed forward distributions of all source and target domains are identical). The latent representations z=g(x)z=g^{*}\left(x\right) is then called compressed DI representations for the MSDA setting.

2.3 Characteristics of Domain-Invariant Representations

In the previous section, two kinds of DI representations are introduced, where each of them originates from minimization of different terms in the upper bound of target loss. In what follows, we examine and discuss their benefits and drawbacks. We start with the novel development of hypothesis-aware divergence for multiple distributions , which is necessary for our theory developed later.

2.3.1 Hypothesis-Aware Divergence for Multiple Distributions

Let consider multiple distributions 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} on the same space, whose density functions are q1,,qCq_{1},...,q_{C}. We are given a mixture of these distributions α=i=1Cαii\mathbb{Q}^{\alpha}=\sum_{i=1}^{C}\alpha_{i}\mathbb{Q}_{i} with a mixing coefficient α𝒴Δ\alpha\in\mathcal{Y}_{\Delta} and desire to measure a divergence between 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C}. One possible solution is employing a hypothesis from an infinite capacity hypothesis class \mathcal{H} to identify which distribution the data come from. In particular, given zαz\sim\mathbb{Q}^{\alpha}, the function h^(z,i),i[C]\hat{h}\left(z,i\right),i\in[C] outputs the probability that xix\sim\mathbb{Q}_{i}. Note that we use the notation h^\hat{h} in this particular subsection to denote a domain classifier, while everywhere else in the paper h^\hat{h} denotes label classifier. Let l(h^(z),i)l\left(\hat{h}\left(z\right),i\right)\in\mathbb{R} be the loss when classifying zz using h^\hat{h} provided the ground-truth label i[C]i\in\left[C\right]. Our motivation is that if the distributions 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} are distant, it is easier to distinguish samples from them, hence the minimum classification loss minh^1:Cα(h^)\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right) is much lower than in the case of clutching 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C}. Therefore, we develop the following theorem to connect the loss optimization problem with an f-divergence among 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C}. More discussions about hypothesis-aware divergence can be found in Appendix B of this paper and in [13].

Theorem 5.

Assuming the hypothesis class \mathcal{H} has infinite capacity, we define the hypothesis-aware divergence for multiple distributions as

Dα(1,,C)=minh^1:Cα(h^)+𝒞l,α,D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=-\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)+\mathcal{C}_{l,\alpha}, (3)

where 𝒞l,α\mathcal{C}_{l,\alpha} depends only on the form of loss function ll and value of α\alpha. This divergence is a proper f-divergence among 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} in the sense that Dα(1,,C)0,1,,CD^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)\geq 0,\forall\mathbb{Q}_{1},...,\mathbb{Q}_{C} and α𝒴Δ\alpha\in\mathcal{Y}_{\Delta}, and Dα(1,,C)=0D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=0 if 1==C\mathbb{Q}_{1}=...=\mathbb{Q}_{C}.

2.3.2 General Domain-Invariant Representation

As previously defined, the general DI feature map gg^{*} (cf. Definition 2) is the one that minimizes the total source loss

g=argming𝒢minh^i=1Kπi(h^,hS,i,gi)=argming𝒢minh^i=1Kπi(h^,𝔻gS,i).\begin{aligned} g^{*}=\arg\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{i}\right)=\text{argmin}_{g\in\mathcal{G}}\text{min}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)\end{aligned}. (4)

From the result of Theorem 5, we expect that the minimal loss minh^(h^,hS,i,gS,i)\min_{\hat{h}\in\mathcal{H}}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right) should be inversely proportional to the divergence between the class-conditionals gS,i,c\mathbb{P}_{g}^{S,i,c}. To generalize this result to the multi-source setting, we further define gS,c:=i=1Kπiγi,cαcgS,i,c\mathbb{Q}_{g}^{S,c}:=\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{S,i,c} as the mixture of the class cc conditional distributions of the source domains on the latent space, where αc=j=1Kπjγj,c\alpha_{c}=\sum_{j=1}^{K}\pi_{j}\gamma_{j,c} are the mixing coefficients, and γi,c=S,i(y=c)\gamma_{i,c}=\mathbb{P}^{S,i}\left(y=c\right) are label marginals. Then, the inner loop of the objective function in Eq. 4 can be viewed as training the optimal hypothesis h^\hat{h}\in\mathcal{H} to distinguish the samples from gS,c,c[C]\mathbb{Q}_{g}^{S,c},c\in\left[C\right] for a given feature map gg. Therefore, by linking to the multi-divergence concept developed in Section 2.3.1, we achieve the following theorem.

Theorem 6.

Assume that \mathcal{H} has infinite capacity, we have the following statements.

1. Dα(gs,1,,gs,C)=minh^i=1Kπi(h^,hS,i,gS,i)+constD^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right)=-\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)+\text{const}, where α=[αc]c[C]\alpha=\left[\alpha_{c}\right]_{c\in\left[C\right]} is defined as above.

2. Finding the general DI feature map gg^{*} via the OP in Eq. 4 is equivalent to solving

g=argmaxg𝒢Dα(gs,1,,gs,C).g^{*}=\underset{{}_{g\in\mathcal{G}}}{\text{argmax}}\,D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right). (5)

Theorem 6, especially its second claim, discloses that learning general DI representations maximally expands the coverage of latent representations of the source domains by maximizing Dα(gs,1,,gs,C)D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right). Hence, the span of source mixture gπ=c=1Cαcgs,c\mathbb{P}_{g}^{\pi}=\sum_{c=1}^{C}\alpha_{c}\mathbb{Q}_{g}^{s,c} is also increased. We believe this demonstrates one of the benefits of general DI representations because it implicitly enhance the chance to match a general unseen target domains in the DG setting by possibly reducing the source-target representation discrepancy term d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) (cf. Theorem 1). Additionally, in MSDA where we wish to minimize d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) explicitly, expanding source representation’s coverage is also useful.

2.3.3 Compressed Domain-Invariant Representations

It is well-known that generalization gap between the true loss and its empirical estimation affects generalization capability of model [51, 47]. One way to close this gap is increasing sampling density. However, as hinted by our analysis, enforcing general DI representation learning tends to maximize the cross-domain class divergence, hence implicitly increasing the diversity of latent representations from different source domains. This renders learning a classifier h^\hat{h} on top of those source representations harder due to scattering samples on representation space, leading to higher generalization gap between empirical and general losses. In contrast, compressed DI representations help making the task of learning h^\hat{h} easier with a lower generalization gap by decreasing the diversity of latent representations from the different source domains, via enforcing gS,1=gS,2==gS,K\mathbb{P}_{g^{*}}^{S,1}=\mathbb{P}_{g^{*}}^{S,2}=\ldots=\mathbb{P}_{g^{*}}^{S,K}. We now develop rigorous theory to examine this observation.

Let the training set be S={(zi,yi)}i=1NS=\{(z_{i},y_{i})\}_{i=1}^{N}, consisting of independent pairs (zi,yi)(z_{i},y_{i}) drawn from the source mixture, i.e., the sampling process starts with sampling domain index kCat(𝝅)k\sim Cat\left(\boldsymbol{\pi}\right), then sampling zgS,kz\sim\mathbb{P}_{g}^{S,k} (i.e., xS,kx\sim\mathbb{P}^{S,k} and z=g(x)z=g\left(x\right)), and finally assigning a label with yhS,k(z)y\sim h^{S,k}\left(z\right) (i.e., hS,k(z)h^{S,k}\left(z\right) is a distribution over [C]\left[C\right]). The empirical loss is defined as

(h^,S)=1k=1KNkk=1Ki=1Nk(h^(zi),hS,k(zi)),\displaystyle\mathcal{L}\left(\hat{h},S\right)=\frac{1}{\sum_{k=1}^{K}N_{k}}\sum_{k=1}^{K}\sum_{i=1}^{N_{k}}\ell\left(\hat{h}\left(z_{i}\right),h^{S,k}\left(z_{i}\right)\right), (6)

where Nk,k[C]N_{k},\,k\in\left[C\right] is the number of samples drawn from the kk-th source domain and N=k=1KNkN=\sum_{k=1}^{K}N_{k}.

Here, (h^,S)\mathcal{L}\left(\hat{h},S\right) is an unbiased estimation of the general loss (h^,𝔻gπ)=k=1Kπk(h^,𝔻gS,k)\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)=\sum_{k=1}^{K}\pi_{k}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,k}\right). To quantify the quality of the estimation, we investigate the upper bound of the generalization gap |(h^,S)(h^,𝔻gπ)|\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right| with the confidence level 1δ1-\delta.

Theorem 7.

For any confident level δ[0,1]\delta\in[0,1] over the choice of SS, the estimation of loss is in the ϵ\epsilon-range of the true loss

Pr(|(h^,S)(h^,𝔻gπ)|ϵ)1δ,\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\epsilon\right)\geq 1-\delta,

where ϵ=ϵ(δ)=(Aδ)1/2\epsilon=\epsilon\left(\delta\right)=\left(\frac{A}{\delta}\right)^{1/2} is a function of δ\delta for which AA is proportional to

1N(i=1Kj=1KπiK(f^,𝔻S,j)+Li=1Kπimaxk[K]𝔼S,k[Δpk,i(y|x)1]+LKi=1Kj=1K2πid1/2(gS,i,gS,j))2\displaystyle\begin{aligned} \frac{1}{N}\left(\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{i}}}{K}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\sum_{i=1}^{K}\sqrt{\pi_{i}}\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\|\Delta p^{k,i}\left(y|x\right)\right\|_{1}\right]+\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)^{2}\end{aligned}

Theorem 7 reveals one benefit of learning compressed DI representations, that is, when enforcing compressed DI representation learning, we minimize LKi=1Kj=1K2πid1/2(gS,i,gS,j)\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right), which tends to reduce the generalization gap ϵ=ϵ(δ)\epsilon=\epsilon\left(\delta\right) for a given confidence level 1δ1-\delta. Therefore, compressed DI representations allow us to minimize population loss (h^,𝔻gπ)\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right) more efficiently via minimizing empirical loss.

2.4 Trade-off of Learning Domain Invariant Representation

Similar to the theoretical finding in Zhao et al. [56] developed for DA, we theoretically find that compression does come with a cost for MSDA and DG. We investigate the representation trade-off, typically how compressed DI representation affects classification loss. Specifically, we consider a data processing chain 𝒳g𝒵h^𝒴Δ\mathcal{X}\stackrel{{\scriptstyle g}}{{\longmapsto}}\mathcal{Z}\stackrel{{\scriptstyle\hat{h}}}{{\longmapsto}}\mathcal{Y}_{\Delta}, where 𝒳\mathcal{X} is the common data space, 𝒵\mathcal{Z} is the latent space induced by a feature extractor gg, and h^\hat{h} is a hypothesis on top of the latent space. We define 𝒴π\mathbb{P}_{\mathcal{Y}}^{\pi} and 𝒴T\mathbb{P}_{\mathcal{Y}}^{T} as two distribution over 𝒴\mathcal{Y} in which to draw y𝒴πy\sim\mathbb{P}_{\mathcal{Y}}^{\pi}, we sample kCat(π)k\sim Cat\left(\pi\right), xS,kx\sim\mathbb{P}^{S,k}, and yfS,k(x)y\sim f^{S,k}\left(x\right), while similar to draw y𝒴Ty\sim\mathbb{P}_{\mathcal{Y}}^{T}. Our theoretical bounds developed regarding the trade-off of learning DI representations are relevant to d1/2(𝒴π,𝒴T)d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right).

Theorem 8.

Consider a feature extractor gg and a hypothesis h^\hat{h}, the Hellinger distance between two label marginal distributions 𝒴π\mathbb{P}_{\mathcal{Y}}^{\pi} and 𝒴T\mathbb{P}_{\mathcal{Y}}^{T} can be upper-bounded as:

1. d1/2(𝒴π,𝒴T)[k=1Kπk(h^g,fS,k,S,k)]1/2+d1/2(gT,gπ)+(h^g,fT,T)1/2d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{k=1}^{K}\pi_{k}\mathcal{L}\left(\hat{h}\circ g,f^{S,k},\mathbb{P}^{S,k}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}

2. d1/2(𝒴π,𝒴T)[i=1Kπi(h^g,fS,i,S,i)]1/2+i=1Kj=1KπjKd1/2(gS,i,gS,j)+i=1Kj=1KπjKd1/2(gT,gS,i)+(h^g,fT,T)1/2.d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

Here we note that the general loss \mathcal{L} is defined based on the Hellinger loss \ell which is define as

(f^(x),f(x))=D1/2(f^(x),f(x))=2i=1C(f^(x,i)f(x,i))2\ell(\hat{f}(x),f(x))=D_{1/2}(\hat{f}(x),f(x))=2\sum_{i=1}^{C}\left(\sqrt{\hat{f}\left(x,i\right)}-\sqrt{f\left(x,i\right)}\right)^{2} (more discussion can be found in Appendix C).

Remark. Compared to the trade-off bound in the work of Zhao et al. [56], our context is more general, concerning MSDA and DG problems with multiple source domains and multi-class probabilistic labeling functions, rather than single source DA with binary-class and deterministic setting. Moreover, the Hellinger distance is more universal, in the sense that it does not depend on the choice of classifier family \mathcal{H} and loss function \ell as in the case of \mathcal{H}-divergence in [56].

We base on the first inequality of Theorem 8 to analyze the trade-off of learning general DI representations. The first term on the left hand side is the source mixture’s loss, which is controllable and tends to be small when enforcing learning general DI representations. With that in mind, if two label marginal distributions 𝒴π\mathbb{P}_{\mathcal{Y}}^{\pi} and 𝒴T\mathbb{P}_{\mathcal{Y}}^{T} are distant (i.e., d1/2(𝒴π,𝒴T)d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right) is high), the sum d1/2(gT,gπ)+(h^g,fT,T)1/2d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2} tends to be high. This leads to 2 possibilities. The first scenario is when the representation discrepancy d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) has small value, e.g., it is minimized in MSDA setting, or it happens to be small by pure chance in DG setting. In this case, the lower bound of target loss (h^g,fT,T)\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right) is high, possibly hurting model’s generalization ability. On the other hand, if the discrepancy d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) is large for some reasons, the lower bound of target loss will be small, but its upper-bound is higher, as indicated Theorem 1.

Based on the second inequality of Theorem 8, we observe that if two label marginal distributions 𝒴π\mathbb{P}_{\mathcal{Y}}^{\pi} and 𝒴T\mathbb{P}_{\mathcal{Y}}^{T} are distant while enforcing learning compressed DI representations (i.e., both source loss and source-source feature discrepancy [i=1Kπi(h^g,fS,i,S,i)]1/2+i=1Kj=1KπjKd1/2(gS,i,gS,j)\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right) are low), the sum i=1Kj=1KπjKd1/2(gT,gS,i)+(hg,fT,T)1/2\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\mathcal{L}\left(h\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2} is high. For the MSDA setting, the discrepancy i=1Kj=1KπjKd1/2(gT,gS,i)\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi}_{j}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right) is trained to get smaller, meaning that the lower bound of target loss (h^g,fT,T)\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right) is high, hurting the target performance. Similarly, for the DG setting, if the trained feature extractor gg occasionally reduces i=1Kj=1KπjKd1/2(gT,gS,i)\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi}_{j}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right) for some unseen target domain, it certainly increases the target loss (hg,fT,T)\mathcal{L}\left(h\circ g,f^{T},\mathbb{P}^{T}\right). In contrast, if for some target domains, the discrepancy i=1Kj=1KπjKd1/2(gT,gS,i)\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi}_{j}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right) is high by some reasons, by linking to the upper bound in Theorem 3, the target general loss has a high upper-bound, hence is possibly high.

This trade-off between representation discrepancy and target loss suggests a sweet spot for just-right feature alignment. In that case, the target loss is most likely to be small.

3 Experiment

To investigate the trade-off of learning DI representations, we conduct domain generalization experiments on the colored MNIST dataset (CC BY-SA 3.0) [2, 24]. In particular, the task is to predict binary label YY of colored input images XX generated from binary digit feature ZdZ_{d} and color feature ZcZ_{c}. We refer readers to Appendix D for more information of this dataset.

We conduct 77 source domains by setting color-label correlation (Zc=1|Y=1)=(Zc=0|Y=0)=θS,i\mathbb{P}(Z_{c}=1|Y=1)=\mathbb{P}(Z_{c}=0|Y=0)=\theta^{S,i} where θs,iUni([0.6,1])\theta^{s,i}\sim Uni\left(\left[0.6,1\right]\right) for i=1,,12i=1,...,12, while two target domains are created with θT,i{0.05,0.7} \theta^{T,i}\in\text{$\left\{0.05,0.7\right\}$ } for i=1,2i=1,2. Here we note that colored images in the target domain with θT,2=0.7\theta^{T,2}=0.7 are more similar to those in the source domains, while colored images in the target domain with θT,1=0.05\theta^{T,1}=0.05 are less similar.

We wish to study the characteristics and trade-off between two kinds of DI representations when predicting on various target domains. Specifically, we apply adversarial learning [16] similar to [14], in which a min-max game is played between domain discriminator h^d\hat{h}^{d} trying to distinguish the source domain given representation, while the feature extractor (generator) gg tries to fool the domain discriminator. Simultaneously, a classifier is used to classify label based on the representation. Let gen\mathcal{L}_{gen} and disc\mathcal{L}_{disc} be the label classification and domain discrimination losses, the training objective becomes:

ming(minh^gen+λmaxh^ddisc),\min_{g}\left(\min_{\hat{h}}\mathcal{L}_{gen}+\lambda\max_{\hat{h}^{d}}\mathcal{L}_{disc}\right),

where the source compression strength λ>0\lambda>0 controls the compression extent of learned representation. More specifically, general DI representation is obtained with λ=0\lambda=0, while larger λ\lambda leads to more compressed DI representation. Finally, our implementation is based on DomainBed [17] repository, and all training details as well as further MSDA experiment and DG experiment on real datasets are included in Appendix D.111Our code could be found at: https://github.com/VinAIResearch/DIRep

3.1 Trade-off of Two Kinds of Domain-Invariant Representations

Refer to caption
Figure 1: Phase transition of representations with increased λ\lambda. (a) General DI representations. (b) Just-right compressed DI representations. (c) Overly-compressed DI representations.

We consider target domain which is similar to source domain in this experiment. To govern the trade-off between two kinds of DI representations, we gradually increase the source compression strength λ\lambda with noting that λ=0\lambda=0 means only focusing on learning general DI representations. According to our theory and as shown in Figure 1, learning only general DI representations (λ=0\lambda=0) maximizes the cross-domain class divergence by encouraging the classes of source domains more separate arbitrarily, while by increasing λ\lambda, we enforce compressing the latent representations of source domains together. Therefore, for an appropriate λ\lambda, the class separation from general DI representations and the source compression from compressed DI representations (i.e., just-right compressed DI representations), are balanced as in the case (b) of Figure 1, while for overly high λ\lambda, source compression from compressed DI representations dominates and compromises the class separation from general DI representations (i.e., overly-compressed DI representations).

Figure 2a shows the source validation and target accuracies when increasing λ\lambda (i.e., encouraging the source compression). It can be observed that both source validation accuracy and target accuracy have the same pattern: increasing when setting appropriate λ\lambda for just-right compressed DI representations and compromising when setting overly high values λ\lambda for overly-compressed DI representations. Figure 2b shows in detail the variation of the source validation accuracy for each specific λ\lambda. In practice, we should encourage learning two kinds of DI representations simultaneously by finding an appropriate trade-off to balance them for working out just-right compressed DI representations.

3.2 Generalization Capacity on Various Target Domains

Refer to caption
(a) Accuracy vs λ\lambda
Refer to caption
(b) Accuracy vs iteration
Figure 2: (2a) Source validation accuracy and target accuracy for close target domain (see Section 3.2) over compression strength. (2b) Validation accuracy over training step for different values of λ\lambda.
Refer to caption
(a) Far target domain with θ=0.05\theta=0.05.
Refer to caption
(b) Close target domain with θ=0.7\theta=0.7.
Figure 3: Accuracy of models trained with different λ\lambda on two test domains. For the far domain with θ=0.05\theta=0.05, the model corresponding with general DI representations performs better than that with more compact DI representations. For close domain with θ=0.7\theta=0.7, the models with compression generalize significantly better than those with general DI representations.

In this experiment, we inspect the generalization capacity of the models trained on the source domains with different compression strengths λ\lambda to different target domains (i.e., close and far ones). As shown in Figure 3, the target accuracy is lower for the far target domain than for the close one regardless of the source compression strengths λ\lambda. This observation aligns with our upper-bounds developed in Theorems 1 and 3 in which larger representation discrepancy increases the upper-bounds of the general target loss, hence more likely hurting the target performance.

Another interesting observation is that for the far target domain, the target accuracy for λ=0\lambda=0 peaks. This can be partly explained as the general DI representations help to increase the span of source latent representations for more chance to match a far target domain (cf. Section 2.3.2). Specifically, the source-target discrepancy d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) could be small and the upper bound in Theorem 1 is lower. In contrast, for the close target domain, the compressed DI representations help to really improve the target performance, while over-compression degrades the target performance, similar to result on source domains. This is because the target domain is naturally close to the source domains, the source and target latent representations are already mixed up, but by encouraging the compressed DI representations, we aim to learning more elegant representations for improving target performance.

4 Related Works

Domain adaptation (DA) has been intensively studied from both theoretical [3, 9, 6, 56, 34, 45] and empirical [46, 55, 14, 7, 54] perspectives. Notably, the pioneering work [3] and other theoretical works [45, 55, 56, 22, 31, 6] have investigated DA in various aspects which lay foundation for the success of practical works [14, 50, 46, 55, 14, 7, 54]. Multiple-source domain adaptation (MSDA) extends DA by gathering training set from multiple source domains [8, 30, 21, 44, 45, 12, 57, 53, 42]. Different from DA with abundant theoretical works, theoretical study in MSDA is significantly limited. Notably, the works in [32, 33] relies on assumptions about the existence of the same labeling function, which can be used for all source domains; and target domain is an unknown mixture of the source domains. They then show that there exists a distributional weighted combination of these experts performing well on target domain with loss at most ϵ\epsilon, given a set of source expert hypotheses (with loss is at most on respective domain). Hoffman et al. [19] further develop this idea to a more general case where different labeling functions corresponding to different source domains, and the target domain is arbitrary. Under this setting, there exists a distributional weighted combination of source experts such that the loss on target domain is bounded by a source loss term and a discrepancy term between target domain and source mixture. On the other hand, Zhao et al. [57] directly extend Ben-david et al.’s work with an improvement on the sample complexity term.

Domain generalization (DG) [17, 26, 4, 20] is the most challenging setting among the three due to the unavailability of target data. The studies in [4, 10] use kernel method to find feature map which minimizes expected target loss. Moreover, those in [28, 29, 35, 15, 52, 11] learn compressed DI representations by minimizing different types of domain dissimilarity. Other notions of invariance are also proposed, for example, [2] learns a latent space on which representation distributions do not need to be aligned, but share a common optimal hypothesis. Another work in [43] uncovers domain-invariant hypothesis from low-rank decomposition of domain-specific hypotheses. Moreover, there are other works which try to strike a balance between two types of representation spaces [5].

5 Conclusion

In this paper, we derive theoretical bounds for target loss in MSDA and DG settings which characterize two types of representation: general DI representation for learning invariant classifier which works on all source domains and compressed DI representation motivated from reducing inter-domain representation discrepancy. We further characterize the properties of these two representations, and develop a lower bound on the target loss which governs the trade-off between learning them. Finally, we conduct experiments on Colored MNIST dataset and real dataset to illustrate our theoretical claims.

References

  • [1] Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T. H., and Mitliagkas, I. Generalizing to unseen domains via distribution matching, 2020.
  • [2] Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization, 2020.
  • [3] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Mach. Learn. 79, 1–2 (May 2010), 151–175.
  • [4] Blanchard, G., Lee, G., and Scott, C. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems (2011), vol. 24.
  • [5] Chattopadhyay, P., Balaji, Y., and Hoffman, J. Learning to balance specificity and invariance for in and out of domain generalization. arXiv preprint arXiv:2008.12839 (2020).
  • [6] Cortes, C., and Mohri, M. Domain adaptation in regression. In Algorithmic Learning Theory - 22nd International Conference, ALT 2011, Proceedings (2011), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 308–323. Copyright: Copyright 2011 Elsevier B.V., All rights reserved.; 22nd International Conference on Algorithmic Learning Theory, ALT 2011 ; Conference date: 05-10-2011 Through 07-10-2011.
  • [7] Cortes, C., Mohri, M., and Medina, A. M. Adaptation based on generalized discrepancy. Journal of Machine Learning Research 20, 1 (2019), 1–30.
  • [8] Crammer, K., Kearns, M., and Wortman, J. Learning from multiple sources. Journal of Machine Learning Research 9, 57 (2008), 1757–1774.
  • [9] David, S. B., Lu, T., Luu, T., and Pal, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (13–15 May 2010), vol. 9 of Proceedings of Machine Learning Research, PMLR, pp. 129–136.
  • [10] Deshmukh, A. A., Lei, Y., Sharma, S., Dogan, U., Cutler, J. W., and Scott, C. A generalization error bound for multi-class domain generalization, 2019.
  • [11] Dou, Q., de Castro, D. C., Kamnitsas, K., and Glocker, B. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems (2019), pp. 6450–6461.
  • [12] Duan, L., Xu, D., and Chang, S.-F. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), pp. 1338–1345.
  • [13] Duchi, J. C., Khosravi, K., and Ruan, F. Multiclass classification, information, divergence, and surrogate risk, 2017.
  • [14] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., and Lempitsky, V. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17, 59 (2016), 1–35.
  • [15] Ghifary, M., Bastiaan Kleijn, W., Zhang, M., and Balduzzi, D. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision (2015), pp. 2551–2559.
  • [16] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems (2014), vol. 27, Curran Associates, Inc.
  • [17] Gulrajani, I., and Lopez-Paz, D. In search of lost domain generalization. CoRR abs/2007.01434 (2020).
  • [18] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015.
  • [19] Hoffman, J., Mohri, M., and Zhang, N. Algorithms and theory for multiple-source adaptation. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc.
  • [20] Huang, Z., Wang, H., Xing, E. P., and Huang, D. Self-challenging improves cross-domain generalization. Proceedings of the European Conference on Computer Vision (ECCV) (2020).
  • [21] Ievgen Redko, Amaury Habrard, M. S. On the analysis of adaptability in multi-source domain adaptation. Mach. Learn. 108 (June 2019), 1635–1652.
  • [22] Johansson, F. D., Sontag, D., and Ranganath, R. Support and invertibility in domain-invariant representations. In Proceedings of Machine Learning Research (2019), vol. 89, pp. 527–536.
  • [23] Le, T., Nguyen, T., Ho, N., Bui, H., and Phung, D. Lamda: Label matching deep domain adaptation. In ICML (2021).
  • [24] LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).
  • [25] Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. Learning to generalize: Meta-learning for domain generalization.
  • [26] Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision (2017), pp. 5542–5550.
  • [27] Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.-Z., and Hospedales, T. M. Episodic training for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019).
  • [28] Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018).
  • [29] Li, Y., Tian, X., Gong, M., Liu, Y., Liu, T., Zhang, K., and Tao, D. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) (September 2018).
  • [30] Mansour, Y., Mohri, M., Ro, J., Theertha Suresh, A., and Wu, K. A theory of multiple-source adaptation with limited target labeled data. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (13–15 Apr 2021), A. Banerjee and K. Fukumizu, Eds., vol. 130 of Proceedings of Machine Learning Research, PMLR, pp. 2332–2340.
  • [31] Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. vol. 22.
  • [32] Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems (2009), D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., vol. 21, Curran Associates, Inc.
  • [33] Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple source adaptation and the renyi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (Arlington, Virginia, USA, 2009), UAI ’09, AUAI Press, p. 367–374.
  • [34] Mansour, Y., and Schain, M. Robust domain adaptation. Annals of Mathematics and Artificial Intelligence 71 Issue 4 (2014), 365–380.
  • [35] Muandet, K., Balduzzi, D., and Schölkopf, B. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning (Atlanta, Georgia, USA, 17–19 Jun 2013), S. Dasgupta and D. McAllester, Eds., vol. 28 of Proceedings of Machine Learning Research, PMLR, pp. 10–18.
  • [36] Muller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems (2019), vol. 32, Curran Associates, Inc.
  • [37] Nguyen, T., Le, T., Zhao, H., Tran, H. Q., Nguyen, T., and Phung, D. Most: Multi-source domain adaptation via optimal transport for student-teacher learning. In UAI (2021).
  • [38] Nguyen, T., Le, T., Zhao, H., Tran, H. Q., Nguyen, T., and Phung, D. Tidot: A teacher imitation learning approach for domain adaptation with optimal transport. In IJCAI (2021).
  • [39] Nguyen, V. A., Nguyen, T., Le, T., Tran, Q. H., and Phung, D. Stem: An approach to multi-source domain adaptation with guarantees. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 9352–9363.
  • [40] Nguyen, X., Wainwright, M. J., and Jordan, M. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems (2008), J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds., vol. 20, Curran Associates, Inc.
  • [41] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), IEEE Computer Society, pp. 1406–1415.
  • [42] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019).
  • [43] Piratla, V., Netrapalli, P., and Sarawagi, S. Efficient domain generalization via common-specific low-rank decomposition. In Proceedings of the 37th International Conference on Machine Learning (13–18 Jul 2020), vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 7728–7738.
  • [44] Redko, I., Courty, N., Flamary, R., and Tuia, D. Optimal transport for multi-source domain adaptation under target shift. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (16–18 Apr 2019), K. Chaudhuri and M. Sugiyama, Eds., vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 849–858.
  • [45] Redko, I., Habrard, A., and Sebban, M. Theoretical analysis of domain adaptation with optimal transport, 2017.
  • [46] Saito, K., Kim, D., Sclaroff, S., and Saenko, K. Universal domain adaptation through self supervision. In Advances in Neural Information Processing Systems (2020), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., pp. 16282–16292.
  • [47] Shalev-Shwartz, S., and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [48] Sicilia, A., Zhao, X., and Hwang, S. J. Domain adversarial neural networks for domain generalization: When it works and how to improve, 2021.
  • [49] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 2818–2826.
  • [50] Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017).
  • [51] Vapnik, V., and Vapnik, V. Statistical Learning Theory. Wiley, 1998.
  • [52] Wang, H., He, Z., Lipton, Z. C., and Xing, E. P. Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations (2019).
  • [53] Xu, R., Chen, Z., Zuo, W., Yan, J., and Lin, L. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018).
  • [54] Yan, S., Song, H., Li, N., Zou, L., and Ren, L. Improve unsupervised domain adaptation with mixup training, 2020.
  • [55] Zhang, Y., Liu, T., Long, M., and Jordan, M. Bridging theory and algorithm for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning (09–15 Jun 2019), K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 7404–7413.
  • [56] Zhao, H., Combes, R. T. D., Zhang, K., and Gordon, G. On learning invariant representations for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning (09–15 Jun 2019), K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 7523–7532.
  • [57] Zhao, H., Zhang, S., Wu, G., Moura, J. M. F., Costeira, J. P., and Gordon, G. J. Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc.
  • [58] Zhao, S., Gong, M., Liu, T., Fu, H., and Tao, D. Domain generalization via entropy regularization. In Advances in Neural Information Processing Systems (2020), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., pp. 16096–16107.
  • [59] Zhao, S., Wang, G., Zhang, S., Gu, Y., Li, Y., Song, Z., Xu, P., Hu, R., Chai, H., and Keutzer, K. Multi-source distilling domain adaptation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (Apr. 2020), 12975–12983.
  • [60] Zhou, K., Yang, Y., Hospedales, T., and Xiang, T. Learning to generate novel domains for domain generalization. In European Conference on Computer Vision (2020), Springer, pp. 561–578.

This supplementary material provides proofs for theoretical results stated in the main paper, as well as detailed experiment settings and further experimental results. It is organized as follows

  • Appendix A contains the proofs for the upper bounds of target loss introduced in Section 2.2 of our main paper.

  • Appendix B contains the proofs for characteristics of two representations as mentioned in Section 2.3 of our main paper.

  • Appendix C contains the proof for trade-off theorem discussed in Section 2.4 of our main paper.

  • Finally, in Appendix D, we present the generative detail of the Colored MNIST dataset, the experimental setting together with additional results for MSDA on Colored MNIST and DG on the real-world PACS dataset.

Appendix A Appendix A: Target Loss’s Upper Bounds

We begin with a crucial proposition for our theory development, which further allows us to connect loss on feature space to loss on data space.

Proposition 9.

Let f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} where f^=h^g\hat{f}=\hat{h}\circ g with g:𝒳𝒵g:\mathcal{X}\mapsto\mathcal{Z} and h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}. Let c:𝒵×[C]c:\mathcal{Z}\times\left[C\right]\mapsto\mathbb{R} be a positive function.

i) For any i[C]i\in\left[C\right], we have

h(z,i)c(h^(z),i)pg(z)𝑑z=f(x,i)c(f^(x),i)p(x)𝑑x,\int h\left(z,i\right)c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz=\int f\left(x,i\right)c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx,

ii) For any i[C]i\in\left[C\right], we have

|h(z,i)h(z,i)|c(h^(z),i)pg(z)𝑑z|f(x,i)f(x,i)|c(f^(x),i)p(x)𝑑x,\int\left|h\left(z,i\right)-h^{\prime}\left(z,i\right)\right|c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz\leq\int\left|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right|c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx,

where pp is the density of a distribution \mathbb{P} on 𝒳\mathcal{X}, pgp_{g} is the density of the distribution g=g#\mathbb{P}_{g}=g_{\#}\mathbb{P}, f:𝒳𝒴Δf:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} and f:𝒳𝒴Δf^{\prime}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} are labeling functions, and h:𝒵𝒴Δh:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta} and h:𝒵𝒴Δh^{\prime}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta} are labeling functions on the latent space induced from ff, i.e., h(z)=g1(z)f(x,i)p(x)𝑑xg1(z)p(x)𝑑xh(z)=\frac{\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx} and h(z)=g1(z)f(x,i)p(x)𝑑xg1(z)p(x)𝑑xh^{\prime}(z)=\frac{\int_{g^{-1}\left(z\right)}f^{\prime}\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}.

Proof.

i) Using the definition of hh, we manipulate the integral on feature space as

𝒵h(z,i)c(h^(z),i)pg(z)𝑑z\displaystyle\int_{\mathcal{Z}}h\left(z,i\right)c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz =(1)𝒵c(h^(z),i)g1(z)f(x,i)p(x)𝑑xg1(z)p(x)𝑑xg1(z)p(x)𝑑x𝑑z\displaystyle\stackrel{{\scriptstyle\left(1\right)}}{{=}}\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\frac{\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}\int_{g^{-1}\left(z\right)}p\left(x^{\prime}\right)dx^{\prime}dz
=𝒵c(h^(z),i)g1(z)f(x,i)p(x)𝑑x𝑑z\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dxdz
=(2)𝒵g1(z)c(h^(g(x)),i)f(x,i)p(x)𝑑x𝑑z\displaystyle\stackrel{{\scriptstyle\left(2\right)}}{{=}}\int_{\mathcal{\mathcal{Z}}}\int_{g^{-1}\left(z\right)}c\left(\hat{h}\left(g\left(x\right)\right),i\right)f\left(x,i\right)p\left(x\right)dxdz
=(3)𝒵𝒳𝕀xg1(z)c(h^(z),i)f(x,i)p(x)𝑑x𝑑z\displaystyle\stackrel{{\scriptstyle\left(3\right)}}{{=}}\int_{\mathcal{\mathcal{Z}}}\int_{\mathcal{X}}\mathbb{I}_{x\in g^{-1}\left(z\right)}c\left(\hat{h}\left(z\right),i\right)f\left(x,i\right)p\left(x\right)dxdz
=(4)𝒳𝒵𝕀z=g(x)c(h^(z),i)f(x,i)p(x)𝑑x𝑑z\displaystyle\stackrel{{\scriptstyle\left(4\right)}}{{=}}\int_{\mathcal{X}}\int_{\mathcal{\mathcal{Z}}}\mathbb{I}_{z=g\left(x\right)}c\left(\hat{h}\left(z\right),i\right)f\left(x,i\right)p\left(x\right)dxdz
=𝒳c(h^(g(x)),i)f(x,i)p(x)𝑑x\displaystyle=\int_{\mathcal{X}}c\left(\hat{h}\left(g\left(x\right)\right),i\right)f\left(x,i\right)p\left(x\right)dx
=𝒳c(f^(x),i)f(x,i)p(x)𝑑x.\displaystyle=\int_{\mathcal{X}}c\left(\hat{f}\left(x\right),i\right)f\left(x,i\right)p\left(x\right)dx.

In (1)\left(1\right), we use the definition of push-forward distribution Bpg(z)𝑑z=B𝑑zg1(z)p(x)𝑑x\int_{B}p_{g}\left(z\right)dz=\int_{B}dz\int_{g^{-1}\left(z\right)}p\left(x\right)dx. In (2)\left(2\right), c(h^(z),i)c\left(\hat{h}\left(z\right),i\right) can be put inside the integral because z=g(x)z=g\left(x\right) for any xg1(z)x\in g^{-1}\left(z\right). In (3)\left(3\right), the integral over restricted region is expanded to all space with the help of 𝕀xg1(z)\mathbb{I}_{x\in g^{-1}\left(z\right)}, whose value is 11 if xg1(z)x\in g^{-1}\left(z\right) and 0 otherwise. In (4)\left(4\right), Fubini theorem is invoked to swap the integral signs.

ii) Using the same technique, we have

|h(z,i)h(z,i)|c(h^(z),i)pg(z)𝑑z\displaystyle\int\left|h\left(z,i\right)-h^{\prime}\left(z,i\right)\right|c\left(\hat{h}\left(z\right),i\right)p_{g}\left(z\right)dz
=𝒵c(h^(z),i)|g1(z)f(x,i)p(x)𝑑xg1(z)f(x,i)p(x)𝑑x|g1(z)p(x)𝑑xg1(z)p(x)𝑑x𝑑z\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\frac{\left|\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx-\int_{g^{-1}\left(z\right)}f^{\prime}\left(x,i\right)p\left(x\right)dx\right|}{\int_{g^{-1}\left(z\right)}p\left(x^{\prime}\right)dx^{\prime}}\int_{g^{-1}\left(z\right)}p\left(x\right)dxdz
=𝒵c(h^(z),i)|g1(z)(f(x,i)f(x,i))p(x)𝑑x|𝑑z\displaystyle=\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\left|\int_{g^{-1}\left(z\right)}\left(f\left(x,i\right)-f^{\prime}\left(x,i\right)\right)p\left(x\right)dx\right|dz
𝒵c(h^(z),i)g1(z)|f(x,i)f(x,i)|p(x)𝑑x𝑑z\displaystyle\leq\int_{\mathcal{\mathcal{Z}}}c\left(\hat{h}\left(z\right),i\right)\int_{g^{-1}\left(z\right)}\left|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right|p\left(x\right)dxdz
=𝒵𝒳𝕀xg1(z)c(h^(z),i)|f(x,i)f(x,i)|p(x)𝑑x𝑑z\displaystyle=\int_{\mathcal{\mathcal{Z}}}\int_{\mathcal{X}}\mathbb{I}_{x\in g^{-1}\left(z\right)}c\left(\hat{h}\left(z\right),i\right)\left|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right|p\left(x\right)dxdz
=𝒳𝒵𝕀z=g(x)c(h^(g(x)),i)|f(x,i)f(x,i)|p(x)𝑑x𝑑z\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{\mathcal{Z}}}\mathbb{I}_{z=g\left(x\right)}c\left(\hat{h}\left(g\left(x\right)\right),i\right)\left|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right|p\left(x\right)dxdz
=𝒳|f(x,i)f(x,i)|c(f^(x),i)p(x)𝑑x.\displaystyle=\int_{\mathcal{X}}\left|f\left(x,i\right)-f^{\prime}\left(x,i\right)\right|c\left(\hat{f}\left(x\right),i\right)p\left(x\right)dx.

Proposition 9 allows us to connect expected loss on feature space and input space as in the following corollary.

Corollary 10.

Consider a domain 𝔻=(,f)\mathbb{D}=\left(\mathbb{P},f\right) with data distribution \mathbb{P} and ground-truth labeling function ff. A hypothesis is f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta}, where f^=h^g\hat{f}=\hat{h}\circ g with g:𝒳𝒵g:\mathcal{X}\mapsto\mathcal{Z} and h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}. Then, the labeling function ff on input space induces a ground-truth labeling function on feature space h(z)=g1(z)f(x,i)p(x)𝑑xg1(z)p(x)𝑑xh(z)=\frac{\int_{g^{-1}\left(z\right)}f\left(x,i\right)p\left(x\right)dx}{\int_{g^{-1}\left(z\right)}p\left(x\right)dx}. Let :𝒴Δ×𝒴Δ\ell:\mathcal{Y}_{\Delta}\times\mathcal{Y}_{\Delta}\mapsto\mathbb{R} be the loss function, then expected loss can be calculated either w.r.t. input space (f^,f,)=(f^(x),f(x))p(x)𝑑x\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)=\int\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)p\left(x\right)dx or w.r.t. feature space (h^,h,g#)=(h^(z),h(z))pg(z)𝑑z\mathcal{L}\left(\hat{h},h,g_{\#}\mathbb{P}\right)=\int\ell\left(\hat{h}\left(z\right),h\left(z\right)\right)p_{g}\left(z\right)dz. If we assume the loss (,)\ell\left(\cdot,\cdot\right) has the formed mentioned in the main paper, that is,

(u,v)=i=1Cl(u,i)vi,\ell\left(u,v\right)=\sum_{i=1}^{C}l\left(u,i\right)v_{i},

for any two simplexes u,v𝒴Δu,v\in\mathcal{Y}_{\Delta}, where u=[ui]i=1Cu=\left[u_{i}\right]_{i=1}^{C} and v=[vi]i=1Cv=\left[v_{i}\right]_{i=1}^{C}. Then the losses w.r.t. input space and feature space are the same, i.e.,

(h^,h,g#)=(f^,f,).\mathcal{L}\left(\hat{h},h,g_{\#}\mathbb{P}\right)=\mathcal{L}\left(\hat{f},f,\mathbb{P}\right).
Proof.

We derive as

(h^,h,g#)\displaystyle\mathcal{L}\left(\hat{h},h,g_{\#}\mathbb{P}\right) =𝒵(h^(z),h(z))p(z)𝑑z=i=1C𝒵l(h^(z),i)h(z,i)p(z)𝑑z\displaystyle=\int_{\mathcal{Z}}\ell\left(\hat{h}\left(z\right),h\left(z\right)\right)p\left(z\right)dz=\sum_{i=1}^{C}\int_{\mathcal{Z}}l\left(\hat{h}\left(z\right),i\right)h\left(z,i\right)p\left(z\right)dz
=(1)i=1C𝒳l(f^(x),i)f(x,i)p(x)𝑑x=𝒳(f^(x),f(x))p(x)𝑑x\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\sum_{i=1}^{C}\int_{\mathcal{X}}l\left(\hat{f}\left(x\right),i\right)f\left(x,i\right)p\left(x\right)dx=\int_{\mathcal{X}}\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)p\left(x\right)dx
=(f^,f,).\displaystyle=\mathcal{L}\left(\hat{f},f,\mathbb{P}\right).

Here we have =(1)\stackrel{{\scriptstyle(1)}}{{=}} by using (i) in Proposition 9 with c(β,i)=l(β,i)c\left(\beta,i\right)=l\left(\beta,i\right) for any β𝒴Δ\beta\in\mathcal{Y}_{\Delta} and i[C]i\in\left[C\right]. ∎

Normally, target loss (f^,fT,T)\mathcal{L}\left(\hat{f},f^{T},\mathbb{P}^{T}\right) is bounded by source loss (f^,fS,S)\mathcal{L}\left(\hat{f},f^{S},\mathbb{P}^{S}\right), a label shift term LS(fT,fS)LS\left(f^{T},f^{S}\right), and a data shift term DS(T,S)DS\left(\mathbb{P}^{T},\mathbb{P}^{S}\right) [3]. Here, this kind of bound is developed using data distribution \mathbb{P} on input space and labeling function ff from input to label space, which are not convenient in understanding representation learning, since T,S\mathbb{P}^{T},\mathbb{P}^{S} are data nature and therefore fixed. In order to relate target loss to properties of learned representations, another bound in which (h^,h,g)\mathcal{L}\left(\hat{h},h,\mathbb{P}_{g}\right) is the loss w.r.t. feature space and DS(gT,gS)DS\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S}\right) is the data shift on feature space is more favorable. However, this naive approach presents a pitfall, since the loss (h^,h,g)\mathcal{L}\left(\hat{h},h,\mathbb{P}_{g}\right) is not identical to the loss w.r.t. input space (f^,f,)\mathcal{L}\left(\hat{f},f,\mathbb{P}\right), which is of ultimate interest, e.g., to-be-bounded target loss (f^,fT,T)\mathcal{L}\left(\hat{f},f^{T},\mathbb{P}^{T}\right), or to-be-minimized source loss (f^,fS,S)\mathcal{L}\left(\hat{f},f^{S},\mathbb{P}^{S}\right). Using the previous proposition and corollary, we could bridge this gap and develop a target bound connecting both data space and feature space.

Theorem 11.

(Theorem 1 in the main paper) Consider a mixture of source domains 𝔻π=i=1Kπi𝔻S,i\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i} and the target domain 𝔻T\mathbb{D}^{T}. Let \ell be any loss function upper-bounded by a positive constant LL. For any hypothesis f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} where f^=h^g\hat{f}=\hat{h}\circ g with g:𝒳𝒵g:\mathcal{X}\mapsto\mathcal{Z} and h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}, the target loss on input space is upper bounded

(f^,𝔻T)i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]+L2d1/2(gT,gπ),\begin{aligned} \mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}\,d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\end{aligned}, (7)

where Δpi(y|x):=[|fT(x,y)fS,i(x,y)|]y=1C\Delta p^{i}(y|x):=\left[\left|f^{T}(x,y)-f^{S,i}(x,y)\right|\right]_{y=1}^{C} is the absolute of single point label shift on input space between source domain 𝔻S,i\mathbb{D}^{S,i}, the target domain 𝔻T\mathbb{D}^{T}, [K]:={1,2,,K}[K]:=\left\{1,2,...,K\right\}, and the feature distribution of the source mixture gπ:=i=1KπigS,i\mathbb{P}_{g}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{g}^{S,i}.

Proof.

First, consider the hybrid domain 𝔻gh,i=(gπ,hT)\mathbb{D}_{g}^{h,i}=\left(\mathbb{P}_{g}^{\pi},h^{T}\right), with gπ:=i=1KπigS,i\mathbb{P}_{g}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{g}^{S,i} be the feature distribution of the source mixture, and hTh^{T} is the induced ground-truth labeling function of target domain. The loss on the hybrid domain is then upper bounded by the loss on source mixture and a label shift term. We derive as follows:

(h^,𝔻ghi)\displaystyle\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{hi}\right) =(h^(z),hT(z))pgπ(z)𝑑z\displaystyle=\int\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)p_{g}^{\pi}\left(z\right)dz
=(h^(z),hT(z))i=1KπipgS,i(z)dz=i=1Kπi(h^(z),hT(z))pgS,i(z)𝑑z\displaystyle=\int\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)\sum_{i=1}^{K}\pi_{i}p_{g}^{S,i}\left(z\right)dz=\sum_{i=1}^{K}\pi_{i}\int\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)p_{g}^{S,i}\left(z\right)dz
i=1Kπi(h^(z),hS,i(z))pgS,i(z)𝑑z\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\int\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)p_{g}^{S,i}\left(z\right)dz
+i=1Kπi|(h^(z),hT(z))(h^(z),hS,i(z))|pgS,i(z)𝑑z.\displaystyle+\sum_{i=1}^{K}\pi_{i}\int\left|\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)-\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)\right|p_{g}^{S,i}\left(z\right)dz.

Firstly, using Corollary 10, the loss terms on feature space and input space are equal

i=1Kπi(h^(z),hS,i(z))pgS,i(z)𝑑z=i=1Kπi(f^,fS,i,S,i)\sum_{i=1}^{K}\pi_{i}\int\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)p_{g}^{S,i}\left(z\right)dz=\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},f^{S,i},\mathbb{P}^{S,i}\right)

Secondly, the difference term, can be transformed into label shift on input space using Proposition 9

i=1Kπi𝒵|(h^(z),hT(z))(h^(z),hS,i(z))|pgS,i(z)𝑑z\displaystyle\sum_{i=1}^{K}\pi_{i}\int_{\mathcal{Z}}\left|\ell\left(\hat{h}\left(z\right),h^{T}\left(z\right)\right)-\ell\left(\hat{h}\left(z\right),h^{S,i}\left(z\right)\right)\right|p_{g}^{S,i}\left(z\right)dz
i=1Kπij=1C𝒵(h^(z),j)|hT(z,j)hS,i(z,j)|pgS,i(z)𝑑z\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{Z}}\ell\left(\hat{h}\left(z\right),j\right)\left|h^{T}\left(z,j\right)-h^{S,i}\left(z,j\right)\right|p_{g}^{S,i}\left(z\right)dz
=Li=1Kπij=1C𝒵|hT(z,j)hS,i(z,j)|pgS,i(z)𝑑z\displaystyle=L\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{Z}}\left|h^{T}\left(z,j\right)-h^{S,i}\left(z,j\right)\right|p_{g}^{S,i}\left(z\right)dz
(1)Li=1Kπij=1C𝒳|fT(z,j)fS,i(z,j)|pS,i(x)𝑑x\displaystyle\stackrel{{\scriptstyle(1)}}{{\leq}}L\sum_{i=1}^{K}\pi_{i}\sum_{j=1}^{C}\int_{\mathcal{X}}\left|f^{T}\left(z,j\right)-f^{S,i}\left(z,j\right)\right|p^{S,i}\left(x\right)dx
Lmaxi[K]𝔼S,i[Δpi(y|x)1].\displaystyle\leq L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}\left(y|x\right)\|_{1}\right].

Here we note that (1)\stackrel{{\scriptstyle(1)}}{{\leq}} results from (ii) in Proposition 9 with c(β,i)=1c\left(\beta,i\right)=1 for any β𝒴Δ\beta\in\mathcal{Y}_{\Delta} and i[C]i\in\left[C\right]. Furthermore, Δpi(y|x):=[|fT(x,y)fS,i(x,y)|]y=1C\Delta p^{i}(y|x):=\left[\left|f^{T}(x,y)-f^{S,i}(x,y)\right|\right]_{y=1}^{C} is the absolute single point label shift on input space between the source domain 𝔻S,i\mathbb{D}^{S,i} and the target domain 𝔻T\mathbb{D}^{T}.

With these two terms, we have the upper bound for hybrid domain as

(h^,𝔻ghy)i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1].\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{hy}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right].

Next, we relate the loss on target 𝔻gT\mathbb{D}_{g}^{T} to hybrid domain 𝔻ghy\mathbb{D}_{g}^{hy}, which differs only at the feature marginals.

|(h^,𝔻gT)(h^,𝔻ghy)|=|(h^(z),hT(z))(pgT(z)pgπ(z))𝑑z|L|pgT(z)pgπ(z)|𝑑zL|pgT(z)+pgπ(z)||pgT(z)pgπ(z)|𝑑zL[(pgT(z)+pgπ(z))2𝑑z]1/2[(pgT(z)pgπ(z))2𝑑z]1/2L2[(pgT(z)+pgπ(z)+2pgT(z)pgπ(z))𝑑z]1/2×[2(pgT(z)pgπ(z))2𝑑z]1/2L2[2+2(pgT(z)𝑑zpgπ(z)𝑑z)1/2]1/2d1/2(gT,gπ)L2d1/2(gT,gπ).\displaystyle\begin{aligned} \left|\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{T}\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{hy}\right)\right|&=\left|\int\ell\left(\hat{h}(z),h^{T}(z)\right)\left(p_{g}^{T}(z)-p_{g}^{\pi}(z)\right)dz\right|\\ &\leq\int L\left|p_{g}^{T}(z)-p_{g}^{\pi}(z)\right|dz\\ &\leq L\int\left|\sqrt{p_{g}^{T}(z)}+\sqrt{p_{g}^{\pi}(z)}\right|\left|\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right|dz\\ &\leq L\left[\int\left(\sqrt{p_{g}^{T}(z)}+\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}\left[\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}\\ &\leq\frac{L}{\sqrt{2}}\left[\int\left(p_{g}^{T}(z)+p_{g}^{\pi}(z)+2\sqrt{p_{g}^{T}(z)p_{g}^{\pi}(z)}\right)dz\right]^{1/2}\\ &\times\left[2\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}\\ &\leq\frac{L}{\sqrt{2}}\left[2+2\left(\int p_{g}^{T}(z)dz\int p_{g}^{\pi}(z)dz\right)^{1/2}\right]^{1/2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\\ &\leq L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).\end{aligned}

In the above proof, we repeatedly invoke Cauchy-Schwartz inequality |f(z)g(z)𝑑z|2|f(z)|2𝑑z|g(z)|2𝑑z\left|\int f(z)g(z)dz\right|^{2}\leq\int\left|f(z)\right|^{2}dz\int\left|g(z)\right|^{2}dz. Moreover, for the sake of completeness, we reintroduce the definition of the square root Hellinger distance

d1/2(gT,gπ)=[2(pgT(z)pgπ(z))2𝑑z]1/2.d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)=\left[2\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}.

To this end, we obtain the upper bound for target loss related to loss on souce mixture, a label shift term on input space, and a data shift term between target domain and source mixture on feature space.

(h^,𝔻gT)i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]+L2d1/2(gT,gπ).\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right). (8)

Finally, using the fact that loss on feature space equal loss on input space (Corollary 10), we have

(f^,𝔻T)i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]+L2d1/2(gT,gπ).\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).

That concludes our proof. ∎

This bound is novel since it relates loss on input space and data shift on feature space. This allows us to further investigate how source-source compression and source-target compression affect learning. First, we prove a lemma showing decomposition of data shift between target domain and source mixture d1/2(gT,gπ)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) to a sum of data shifts between target domain and source domains.

Lemma 12.

Given a source mixture and a target domain, we have the following

d1/2(gT,gπ)j=1Kπjd1/2(gT,gS,j)d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)
Proof.

Firstly, we observe that

d1/2(gT,gπ)\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) =[2(pgT(z)pgπ(z))2𝑑z]1/2\displaystyle=\left[2\int\left(\sqrt{p_{g}^{T}(z)}-\sqrt{p_{g}^{\pi}(z)}\right)^{2}dz\right]^{1/2}
=[2(pgT(z)+pgπ(z)2pgT(z)pgπ(z))𝑑z]1/2.\displaystyle=\left[2\int\left(p_{g}^{T}(z)+p_{g}^{\pi}(z)-2\sqrt{p_{g}^{T}(z)p_{g}^{\pi}(z)}\right)dz\right]^{1/2}.

Secondly, we use Cauchy-Schwartz inequality to obtain

pgT(z)pgπ(z)\displaystyle p_{g}^{T}(z)p_{g}^{\pi}(z) =(j=1KπjpgT(z))(j=1KπjpgS,j(z))\displaystyle=\left(\sum_{j=1}^{K}\pi_{j}p_{g}^{T}(z)\right)\left(\sum_{j=1}^{K}\pi_{j}p_{g}^{S,j}(z)\right)
(j=1KπjpgT(z)pgS,j(z))2.\displaystyle\geq\left(\sum_{j=1}^{K}\pi_{j}\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)^{2}.

Therefore,we arrive at

d1/2(gT,gπ)\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) [2(j=1KπjpgT(z)+j=1KπjpgS,j(z)2j=1KπjpgT(z)pgS,j(z))𝑑z]1/2\displaystyle\leq\left[2\int\left(\sum_{j=1}^{K}\pi_{j}p_{g}^{T}(z)+\sum_{j=1}^{K}\pi_{j}p_{g}^{S,j}(z)-2\sum_{j=1}^{K}\pi_{j}\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)dz\right]^{1/2}
=[j=1Kπj2(pgT(z)+pgS,j(z)2pgT(z)pgS,j(z))𝑑z]1/2\displaystyle=\left[\sum_{j=1}^{K}\pi_{j}2\int\left(p_{g}^{T}(z)+p_{g}^{S,j}(z)-2\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)dz\right]^{1/2}
j=1K[πj2(pgT(z)+pgS,j(z)2pgT(z)pgS,j(z))𝑑z]1/2\displaystyle\leq\sum_{j=1}^{K}\left[\pi_{j}2\int\left(p_{g}^{T}(z)+p_{g}^{S,j}(z)-2\sqrt{p_{g}^{T}(z)p_{g}^{S,j}(z)}\right)dz\right]^{1/2}
=j=1Kπjd1/2(gT,gS,j).\displaystyle=\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right).

Now we are ready to prove the bound which motivate compressed DI representation.

Theorem 13.

(Theorem 3 in the main paper) Consider mixture of source domains 𝔻π=i=1Kπi𝔻S,i\mathbb{D}^{\pi}=\sum_{i=1}^{K}\pi_{i}\mathbb{D}^{S,i} and target domain 𝔻T\mathbb{D}^{T}. Let \ell be any loss function upper-bounded by a positive constant LL. For any hypothesis f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\mapsto\mathcal{Y}_{\Delta} where f^=h^g\hat{f}=\hat{h}\circ g with g:𝒳𝒵g:\mathcal{X}\mapsto\mathcal{Z} and h^:𝒵𝒴Δ\hat{h}:\mathcal{Z}\mapsto\mathcal{Y}_{\Delta}, the target loss on input space is upper bounded

(f^,𝔻T)\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right) i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right] (9)
+i=1Kj=1KL2πjK(d1/2(gT,gS,i)+d1/2(gS,i,gS,j))\displaystyle+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{L\sqrt{2\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)
Proof.

In the previous Theorem 11, the upper bound for target loss is

(f^,𝔻T)i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]+L2d1/2(gT,gπ).\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right)\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]+L\sqrt{2}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right).

Using Lemma 12, we have

d1/2(gT,gπ)\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) j=1Kπjd1/2(gT,gS,j)\displaystyle\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)

Next, we use the triangle inequality for square root Hellinger distance

d1/2(gT,gπ)\displaystyle d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right) j=1Kπjd1/2(gT,gS,j)\displaystyle\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)
j=1Kπj(d1/2(gT,gS,i)+d1/2(gS,i,gS,j))\displaystyle\leq\sum_{j=1}^{K}\sqrt{\pi_{j}}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)

Therefore, by average over all gT,gS,i\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i} pairs,

d1/2(gT,gπ)=i=1K1Kd1/2(gT,gπ)i=1Kj=1KπjK(d1/2(gT,gS,i)+d1/2(gS,i,gS,j))d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)=\sum_{i=1}^{K}\frac{1}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)\leq\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)

We obtain the conclusion of our proof

(f^,𝔻T)\displaystyle\mathcal{L}\left(\hat{f},\mathbb{D}^{T}\right) i=1Kπi(f^,𝔻S,i)+Lmaxi[K]𝔼S,i[Δpi(y|x)1]\displaystyle\leq\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,i}\right)+L\max_{i\in[K]}\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i}(y|x)\|_{1}\right]
+i=1Kj=1KL2πjK(d1/2(gT,gS,i)+d1/2(gS,i,gS,j))\displaystyle+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{L\sqrt{2\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)

Appendix B Appendix B: DI Representation’s Characteristics

B.1 General Domain-Invariant Representations

In the main paper, we defined general DI representation via minimization of source loss ming𝒢minh^i=1Kπi(h^,hS,i,gS,i)\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right). We then proposed to view the optimization problem minh^i=1Kπi(h^,hS,i,gS,i)\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right) as calculating a type of divergence, i.e., hypothesis-aware divergence. To understand the connection between the two, we first consider the classification problem where samples are drawn from a mixture zα=i=1Cαiiz\sim\mathbb{Q}^{\alpha}=\sum_{i=1}^{C}\alpha_{i}\mathbb{Q}_{i}, with i\mathbb{Q}_{i} defined on 𝒵\mathcal{Z} and density being qi(z)q_{i}\left(z\right), and the task is to predict which distributions 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} the samples originate from, i.e., labels being 1,,C1,\ldots,C. Here, the hypothesis class \mathcal{H} is assumed to have infinite capacity, and the objective is to minimize minh^1:Cα(h^)=minh^i=1Cαi(h^,i)\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)=\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{C}\alpha_{i}\mathcal{L}\left(\hat{h},\mathbb{Q}_{i}\right).

B.1.1 Hypothesis-Aware Divergence

Theorem 14.

(Theorem 5 in the main paper) Assuming the hypothesis class \mathcal{H} has infinite capacity, we define the hypothesis-aware divergence for multiple distributions as

Dα(1,,C)=minh^1:Cα(h^)+infβ𝒴Δ(i=1Cl(β,i)αi).D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=-\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)+\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right). (10)

This divergence is a proper divergence among 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} in the sense that Dα(1,,C)0D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)\geq 0 for all 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} and α𝒴Δ\alpha\in\mathcal{Y}_{\Delta}, and Dα(1,,C)=0D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)=0 if 1==C\mathbb{Q}_{1}=...=\mathbb{Q}_{C}.

Proof.

Data is sampled from the mixture α\mathbb{Q}^{\alpha} by firstly sampling domain index iCat(α)i\sim Cat\left(\alpha\right), then sampling data ziz\sim\mathbb{Q}_{i} and label with ii. We examine the the total expected loss for any hypothesis h^\hat{h}\in\mathcal{H}, which is

1:Cα(h^)\displaystyle\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right) :=i=1Cαi(h^,i)\displaystyle:=\sum_{i=1}^{C}\alpha_{i}\mathcal{L}\left(\hat{h},\mathbb{Q}_{i}\right)
=i=1Cαil(h^(z),i)qi(z)𝑑z\displaystyle=\sum_{i=1}^{C}\alpha_{i}\int l\left(\hat{h}\left(z\right),i\right)q_{i}\left(z\right)dz

We would like to minimize this loss, which leads to

minh^1:Cα(h^)\displaystyle\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right) =minh^i=1Cαil(h^(z),i)qi(z)𝑑z\displaystyle=\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{C}\alpha_{i}\int l\left(\hat{h}\left(z\right),i\right)q_{i}\left(z\right)dz (11)
=(1)minh^(i=1Cαil(h^(z),i)qi(z)qα(z))qα(z)𝑑z\displaystyle\stackrel{{\scriptstyle\left(1\right)}}{{=}}\min_{\hat{h}\in\mathcal{H}}\int\left(\sum_{i=1}^{C}\alpha_{i}l\left(\hat{h}\left(z\right),i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right)q^{\alpha}\left(z\right)dz
=(2)minh^(i=1Cαil(h^(z),i)qi(z)qα(z))qα(z)𝑑z\displaystyle\stackrel{{\scriptstyle\left(2\right)}}{{=}}\int\min_{\hat{h}\in\mathcal{H}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\hat{h}\left(z\right),i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right)q^{\alpha}\left(z\right)dz
=minβ𝒴Δ(i=1Cαil(β,i)qi(z)qα(z))qα(z)𝑑z\displaystyle=\int\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right)q^{\alpha}\left(z\right)dz
(3)minβ𝒴Δ(i=1Cαil(β,i)qi(z)dz)\displaystyle\stackrel{{\scriptstyle\left(3\right)}}{{\leq}}\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\int\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)q_{i}\left(z\right)dz\right)
=minβ𝒴Δ(i=1Cαil(β,i)).\displaystyle=\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)\right).

For (1)\left(1\right), qα(z)=i=1Cαiqi(z)q^{\alpha}\left(z\right)=\sum_{i=1}^{C}\alpha_{i}q_{i}\left(z\right) is introduced as the density of the mixture, whereas for (2)\left(2\right), we use the fact that \mathcal{H} has infinite capacity, leading to the equality minh^f(h^(x))q(x)𝑑x=minh^f(h^(x))q(x)𝑑x\min_{\hat{h}\in\mathcal{H}}\int f\left(\hat{h}\left(x\right)\right)q\left(x\right)dx=\int\min_{\hat{h}\in\mathcal{H}}f\left(\hat{h}\left(x\right)\right)q\left(x\right)dx. Moreover, for (3)\left(3\right), the property of concave function ϕ(t)=minβ𝒴Δ(i=1Cαil(β,i)ti)\phi\left(t\right)=\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)t_{i}\right) with t𝒴Δt\in\mathcal{Y}_{\Delta} is invoked, i.e., 𝔼zQ[ϕ(t(z))]ϕ(𝔼zQ[t(z)])\mathbb{E}_{z\sim Q}\left[\phi\left(t\left(z\right)\right)\right]\leq\phi\left(\mathbb{E}_{z\sim Q}\left[t\left(z\right)\right]\right).

This hints us to define a non-zero divergence DαD^{\alpha} between multiple distributions 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} as

Dα(1,,C)\displaystyle D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right) =minh^1:Cα(h^)+infβ𝒴Δ(i=1Cl(β,i)αi),\displaystyle=-\min_{\hat{h}\in\mathcal{H}}\mathcal{L}_{\mathbb{Q}_{1:C}}^{\alpha}\left(\hat{h}\right)+\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right),
=ϕ([qi(z)qα(z)]i=1C)qα(z)dz+infβ𝒴Δ(i=1Cl(β,i)αi)\displaystyle=\int-\phi\left(\left[\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}\right]_{i=1}^{C}\right)q^{\alpha}\left(z\right)dz+\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right)

which is a proper f-divergence, since ϕ(t)-\phi\left(t\right) is a convex function, and infβ𝒴Δ(i=1Cl(β,i)αi)\inf_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{i=1}^{C}l\left(\beta,i\right)\alpha_{i}\right) is just a constant. Moreover, Dα(1,,C)0D^{\alpha}\left(\mathbb{Q}_{1},...,\mathbb{Q}_{C}\right)\geq 0 for all 1,,C\mathbb{Q}_{1},...,\mathbb{Q}_{C} and α𝒴Δ\alpha\in\mathcal{Y}_{\Delta} due to the previous inequality 11. The equality happens if there is some β0𝒴Δ\beta_{0}\in\mathcal{Y}_{\Delta} such that, for all z𝒵z\in\mathcal{Z}

β0=argminβ𝒴Δi=1Cαil(β,i)qi(z)qα(z).\beta_{0}=\underset{{}_{\beta\in\mathcal{Y}_{\Delta}}}{\text{argmin}\ }\sum_{i=1}^{C}\alpha_{i}l\left(\beta,i\right)\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}.

This means qi(z)qα(z)=Ai,i[C]\frac{q_{i}\left(z\right)}{q^{\alpha}\left(z\right)}=A_{i},\forall i\in\left[C\right], where AiA_{i} is a constant dependent on index ii. However, this leads to

qi(z)𝑑z\displaystyle\int q_{i}\left(z\right)dz =Aiqα(z)𝑑z\displaystyle=A_{i}\int q^{\alpha}\left(z\right)dz
1\displaystyle 1 =Ai\displaystyle=A_{i}

i.e., qi(z)=qα(z),i[C]q_{i}\left(z\right)=q^{\alpha}\left(z\right),\forall i\in\left[C\right]. In other words, the equality happens when all distributions are the same 1==C\mathbb{Q}_{1}=...=\mathbb{Q}_{C}. ∎

B.1.2 General Domain-Invariant Representations

For a fixed feature map gg, the induced representation distributions of source domains are gS,i\mathbb{P}_{g}^{S,i}. We then find the optimal hypothesis h^g\hat{h}_{g}^{*} on the induced representation distributions gS,i\mathbb{P}_{g}^{S,i} by minimizing the loss

minh^i=1Kπi(h^,hS,i,gS,i)=minh^i=1Kπi(h^,𝔻gS,i).\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)=\text{min}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right). (12)

The general domain-invariant feature map gg^{*} is defined as the one that offers the minimal optimal loss as

g=argming𝒢minh^i=1Kπi(h^,hS,i,gi)=argming𝒢minh^i=1Kπi(h^,𝔻gS,i).\begin{aligned} g^{*}=\arg\min_{g\in\mathcal{G}}\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{i}\right)=\text{argmin}_{g\in\mathcal{G}}\text{min}_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)\end{aligned}. (13)

We denote gs,i,c\mathbb{P}_{g}^{s,i,c} as the class cc conditional distribution of the source domain ii on the latent space and pgs,i,cp_{g}^{s,i,c} as its density function. The induced representation distribution gS,i\mathbb{P}_{g}^{S,i} of source domain ii is a mixture of gs,i,c\mathbb{P}_{g}^{s,i,c} as gS,i=c=1Cγi,cgs,i,c,\mathbb{P}_{g}^{S,i}=\sum_{c=1}^{C}\gamma_{i,c}\mathbb{P}_{g}^{s,i,c},where γi,c=s,i(y=c)\gamma_{i,c}=\mathbb{P}^{s,i}\left(y=c\right).

We further define gs,c:=i=1Kπiγi,cαcgs,i,c\mathbb{Q}_{g}^{s,c}:=\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{s,i,c} where αc=j=1Kπjγj,c\alpha_{c}=\sum_{j=1}^{K}\pi_{j}\gamma_{j,c}. Obviously, we can interpret gs,c\mathbb{Q}_{g}^{s,c} as the mixture of the class cc conditional distributions of the source domains on the latent space. The objective function in Eq. (12) can be viewed as training the optimal hypothesis h^\hat{h}\in\mathcal{H} to distinguish the samples from gs,c,c[C]\mathbb{Q}_{g}^{s,c},c\in\left[C\right] for a given feature map gg. Therefore, by linking to the multi-divergence concept developed in Theorem 14, we achieve the following theorem.

Theorem 15.

(Theorem 6 in the main paper) Assume that \mathcal{H} has infinite capacity, we have the following statements.

1. Dα(gs,1,,gs,C)=minh^i=1Kπi(h^,hS,i,gS,i)+constD^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right)=-\min_{\hat{h}\in\mathcal{H}}\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)+\text{const}, where α=[αc]c[C]\alpha=\left[\alpha_{c}\right]_{c\in\left[C\right]} is defined as above.

2. Finding the general domain-invariant feature map gg^{*} via the OP in (13) is equivalent to solving

g=argmaxg𝒢Dα(gs,1,,gs,C).g^{*}=\underset{{}_{g\in\mathcal{G}}}{\text{argmax}}\,D^{\alpha}\left(\mathbb{Q}_{g}^{s,1},...,\mathbb{Q}_{g}^{s,C}\right). (14)
Proof.

We investigate the loss on mixture

i=1Kπi(h^,hS,i,gS,i)\displaystyle\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right) =i=1Kπic=1Cγi,c(h^,c,gS,i,c)\displaystyle=\sum_{i=1}^{K}\pi_{i}\sum_{c=1}^{C}\gamma_{i,c}\mathcal{L}\left(\hat{h},c,\mathbb{P}_{g}^{S,i,c}\right)
=c=1Cαc(h^,c,i=1Kπiγi,cαcgS,i,c)\displaystyle=\sum_{c=1}^{C}\alpha_{c}\mathcal{L}\left(\hat{h},c,\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{S,i,c}\right)
=c=1Cαc(h^,c,gS,c)\displaystyle=\sum_{c=1}^{C}\alpha_{c}\mathcal{L}\left(\hat{h},c,\mathbb{Q}_{g}^{S,c}\right)

Therefore, the loss on mixture is actually a loss on joint class-conditional distributions gS,c=i=1Kπiγi,cαcgS,i,c\mathbb{Q}_{g}^{S,c}=\sum_{i=1}^{K}\frac{\pi_{i}\gamma_{i,c}}{\alpha_{c}}\mathbb{P}_{g}^{S,i,c}. Using result from Theorem 14, we can define a divergence between these class-conditionals

Dα(gS,1,,gS,C)\displaystyle D^{\alpha}\left(\mathbb{Q}_{g}^{S,1},...,\mathbb{Q}_{g}^{S,C}\right) =minh^c=1Cαc(h^,gS,c)+minβ𝒴Δ(c=1C(β,i)αc)\displaystyle=-\min_{\hat{h}\in\mathcal{H}}\sum_{c=1}^{C}\alpha_{c}\mathcal{L}\left(\hat{h},\mathbb{Q}_{g}^{S,c}\right)+\min_{\beta\in\mathcal{Y}_{\Delta}}\left(\sum_{c=1}^{C}\ell\left(\beta,i\right)\alpha_{c}\right)
=i=1Kπi(h^,hS,i,gS,i)+const\displaystyle=-\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},h^{S,i},\mathbb{P}_{g}^{S,i}\right)+\text{\text{const}}

B.2 Compressed Domain-Invariant Representations

Theorem 16.

(Theorem 7 in the main paper) For any confident level δ[0,1]\delta\in[0,1] over the choice of SS, the estimation of loss is in the ϵ\epsilon-range of the true loss

Pr(|(h^,S)(h^,𝔻gπ)|ϵ)1δ,\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\epsilon\right)\geq 1-\delta,

where ϵ=ϵ(δ)=(Aδ)1/2\epsilon=\epsilon\left(\delta\right)=\left(\frac{A}{\delta}\right)^{1/2} is a function of δ\delta for which AA is proportional to

1N(i=1Kj=1KπiK(f^,𝔻S,j)+Li=1Kπimaxk[K]𝔼S,k[Δpk,i(y|x)1]+LKi=1Kj=1K2πid1/2(gS,i,gS,j))2.\displaystyle\frac{1}{N}\left(\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{i}}}{K}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\sum_{i=1}^{K}\sqrt{\pi_{i}}\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\|\Delta p^{k,i}\left(y|x\right)\right\|_{1}\right]+\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)^{2}.
Proof.

Let SS be a sample of NN data points (z,y)𝔻gπ\left(z,y\right)\sim\mathbb{D}_{g}^{\pi} sampled from the mixture domain, i.e., i.e., iCat(π),zS,ii\sim Cat(\pi),z\sim\mathbb{P}^{S,i}, and labeling with corresponding yCat(h^S,i(z))y\sim Cat\left(\hat{h}^{S,i}\left(z\right)\right). The loss of a hypothesis hh on a sample (z,y)=(z,h^S,i(z))\left(z,y\right)=\left(z,\hat{h}^{S,i}(z)\right) for some domain index ii is (h^(z),h^S,i(z))\ell\left(\hat{h}\left(z\right),\hat{h}^{S,i}\left(z\right)\right). To avoid crowded notation, we denote this loss as i(z):=(h^(z),h^S,i(z))\ell^{i}\left(z\right):=\ell\left(\hat{h}\left(z\right),\hat{h}^{S,i}\left(z\right)\right).

Let N=i=1KNiN=\sum_{i=1}^{K}N_{i}, where each NiN_{i} is the number of sample drawn from domain ii. The estimation of loss on a particular domain ii is

(h^,Si)=j=1Ni1Nii(zj).\mathcal{L}\left(\hat{h},S^{i}\right)=\sum_{j=1}^{N_{i}}\frac{1}{N_{i}}\ell^{i}\left(z_{j}\right).

This estimation is unbiased estimation, i.e., 𝔼Si(𝔻gS,i)Ni[(h^,Si)]=𝔼zgS,i[i(z)]=(h^,𝔻gS,i)\mathbb{E}_{S^{i}\sim\left(\mathbb{D}_{g}^{S,i}\right)^{N_{i}}}\left[\mathcal{L}\left(\hat{h},S^{i}\right)\right]=\mathbb{E}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]=\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right). Furthermore, loss estimation on source mixture is

(h,S)\displaystyle\mathcal{L}\left(h,S\right) =i=1Kj=1Ni1Ni(zj)\displaystyle=\sum_{i=1}^{K}\sum_{j=1}^{N_{i}}\frac{1}{N}\ell^{i}\left(z_{j}\right)

This estimation is also an unbiased estimation

𝔼S(𝔻gπ)N[(h^,S)]=\displaystyle\mathbb{E}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]= 𝔼{Ni}[𝔼Si[(h^,S)]]\displaystyle\mathbb{E}_{\left\{N_{i}\right\}}\left[\mathbb{E}_{S^{i}}\left[\mathcal{L}\left(\hat{h},S\right)\right]\right]
=\displaystyle= i=1Kπi(h^,𝔻gS,i)=(h^,𝔻gπ).\displaystyle\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)=\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right).

Therefore, we can bound the concentration of (h^,S)\mathcal{L}\left(\hat{h},S\right) around its mean value (h^,𝔻gπ)\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right) using Chebyshev’s inequality

Pr(|(h^,S)(h^,𝔻gπ)|ϵ)\displaystyle\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\epsilon\right) 1VarS(𝔻gπ)N[(h^,S)]ϵ2\displaystyle\geq 1-\frac{\text{Var}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]}{\epsilon^{2}}

which is equivalent to

Pr(|(h^,S)(h^,𝔻gπ)|VarS(𝔻gπ)N[(h^,S)]δ)\displaystyle\text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\sqrt{\frac{\text{Var}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right]}{\delta}}\right) 1δ\displaystyle\geq 1-\delta

The variance of (h^,S)\mathcal{L}\left(\hat{h},S\right) is

VarS(𝔻gπ)N[(h^,S)]\displaystyle\text{Var}_{S\sim\left(\mathbb{D}_{g}^{\pi}\right)^{N}}\left[\mathcal{L}\left(\hat{h},S\right)\right] =(1)1NVar(z,y)𝔻gπ[(h^(z),y)]\displaystyle\stackrel{{\scriptstyle\left(1\right)}}{{=}}\frac{1}{N}\text{Var}_{\left(z,y\right)\sim\mathbb{D}_{g}^{\pi}}\left[\ell\left(\hat{h}(z),y\right)\right] (15)
=(2)1Ni=1Kπi(VarzgS,i[i(z)]+𝔼zgS,i[i(z)]2)(𝔼(z,y)𝔻gπ[(h^(z),y)])2\displaystyle\stackrel{{\scriptstyle\left(2\right)}}{{=}}\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\left(\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]+\mathbb{E}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]^{2}\right)-\left(\mathbb{E}_{\left(z,y\right)\sim\mathbb{D}_{g}^{\pi}}\left[\ell\left(\hat{h}(z),y\right)\right]\right)^{2}
1Ni=1Kπi(VarzgS,i[i(z)]+(h^,𝔻gS,i)2)\displaystyle\leq\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\left(\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]+\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)^{2}\right)
1Ni=1KπiVarzgS,i[i(z)]+1N(i=1Kπi(h^,𝔻gS,i))2\displaystyle\leq\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]+\frac{1}{N}\left(\sum_{i=1}^{K}\sqrt{\pi_{i}}\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right)\right)^{2}

=(1)\stackrel{{\scriptstyle\left(1\right)}}{{=}} is true since (h,S)\mathcal{L}\left(h,S\right) is the sum of NN i.i.d. random variable (h(z),y)\ell\left(h(z),y\right) with (z,y)\left(z,y\right) sampled from the same distribution 𝔻gπ\mathbb{D}_{g}^{\pi}. In (2)\left(2\right), the variance of w.r.t. a distribution mixture is related to mean and variance of constituting distribution, i.e., Variπii[X]=iπi(Vari[X]+𝔼i[X]2)𝔼iπii[X]2\text{Var}_{\sum_{i}\pi_{i}\mathbb{P}_{i}}\left[X\right]=\sum_{i}\pi_{i}\left(\text{Var}_{\mathbb{P}_{i}}\left[X\right]+\mathbb{E}_{\mathbb{P}_{i}}\left[X\right]^{2}\right)-\mathbb{E}_{\sum_{i}\pi_{i}\mathbb{P}_{i}}\left[X\right]^{2}.

We reuse the result of 8 in Theorem 11, substituting 𝔻gT𝔻gS,i\mathbb{D}_{g}^{T}\equiv\mathbb{D}_{g}^{S,i}, 𝔻gπ𝔻gS,j\mathbb{D}_{g}^{\pi}\equiv\mathbb{D}_{g}^{S,j} to obtain

(h^,𝔻gS,i)\displaystyle\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{S,i}\right) (f^,𝔻S,j)+L𝔼S,i[Δpi,j(y|x)1]+L2d1/2(gS,i,gS,j)\displaystyle\leq\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\mathbb{E}_{\mathbb{P}^{S,i}}\left[\|\Delta p^{i,j}(y|x)\|_{1}\right]+L\sqrt{2}\>d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)
1Kj=1K((f^,𝔻S,j)+Lmaxk[K]𝔼S,k[Δpk,i(y|x)1]+L2d1/2(gS,i,gS,j)).\displaystyle\leq\frac{1}{K}\sum_{j=1}^{K}\left(\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\|\Delta p^{k,i}\left(y|x\right)\right\|_{1}\right]+L\sqrt{2}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right).

Therefore, the right hand side of 15 is upper by AA, where AA is

A\displaystyle A =1Ni=1KπiVarzgS,i[i(z)]\displaystyle=\frac{1}{N}\sum_{i=1}^{K}\pi_{i}\text{Var}_{z\sim\mathbb{P}_{g}^{S,i}}\left[\ell^{i}\left(z\right)\right]
+1N(i=1Kj=1KπiK(f^,𝔻S,j)+Li=1Kπimaxk[K]𝔼S,k[Δpk,i(y|x)1]+LKi=1Kj=1K2πid1/2(gS,i,gS,j))2.\displaystyle+\frac{1}{N}\left(\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{i}}}{K}\mathcal{L}\left(\hat{f},\mathbb{D}^{S,j}\right)+L\sum_{i=1}^{K}\sqrt{\pi_{i}}\max_{k\in\left[K\right]}\mathbb{E}_{\mathbb{P}^{S,k}}\left[\left\|\Delta p^{k,i}\left(y|x\right)\right\|_{1}\right]+\frac{L}{K}\sum_{i=1}^{K}\sum_{j=1}^{K}\sqrt{2\pi_{i}}\leavevmode\nobreak\ d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)^{2}.

This concludes our proof, where the concentration inequality is

Pr(|(h^,S)(h^,𝔻gπ)|Aδ)1δ.\begin{aligned} \text{Pr}\left(\left|\mathcal{L}\left(\hat{h},S\right)-\mathcal{L}\left(\hat{h},\mathbb{D}_{g}^{\pi}\right)\right|\leq\sqrt{\frac{A}{\delta}}\right)&\geq 1-\delta\end{aligned}.

Appendix C Appendix C: Trade-Off in Learning DI Representations

Lemma 17.

Given a labeling function f:𝒳𝒴Δf:\mathcal{X}\rightarrow\mathcal{Y}_{\Delta} and a hypothesis f^:𝒳𝒴Δ\hat{f}:\mathcal{X}\rightarrow\mathcal{Y}_{\Delta}, let denote 𝒴f\mathbb{P}_{\mathcal{Y}}^{f} and 𝒴f^\mathbb{P}_{\mathcal{Y}}^{\hat{f}} as two label marginal distributions induced by ff and f^\hat{f} on the data distribution \mathbb{P}. Particularly, to sample y𝒴fy\sim\mathbb{P}_{\mathcal{Y}}^{f}, we first sample xx\sim\mathbb{P} (i.e., \mathbb{P} is the data distribution with the density function pp) and then sample yCat(f(x))y\sim Cat\left(f\left(x\right)\right), while similar to sample y𝒴f^y\sim\mathbb{P}_{\mathcal{Y}}^{\hat{f}}. We then have

d1/2(𝒴f,𝒴f^)(f^,f,)1/2,d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{f},\mathbb{P}_{\mathcal{Y}}^{\hat{f}}\right)\leq\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)^{1/2},

where the loss \mathcal{L} is defined based on the Hellinger loss (f^(x),f(x))=D1/2(f^(x),f(x))=2i=1C[f^(x,i)f(x,i)]2\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)=D_{1/2}\left(\hat{f}\left(x\right),f\left(x\right)\right)=2\sum_{i=1}^{C}\left[\sqrt{\hat{f}\left(x,i\right)}-\sqrt{f\left(x,i\right)}\right]^{2}.

Proof.

We have

D1/2(𝒴f,𝒴f^)\displaystyle D_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{f},\mathbb{P}_{\mathcal{Y}}^{\hat{f}}\right) =2i=1C(pf(y)pf^(y))2\displaystyle=2\sum_{i=1}^{C}\left(\sqrt{p^{f}\left(y\right)}-\sqrt{p^{\hat{f}}\left(y\right)}\right)^{2}
=\displaystyle= 2i=1C(pf(y=ix)p(x)𝑑xpf^(y=ix)p(x)𝑑x)2\displaystyle 2\sum_{i=1}^{C}\left(\sqrt{\int p^{f}\left(y=i\mid x\right)p(x)dx}-\sqrt{\int p^{\hat{f}}\left(y=i\mid x\right)p(x)dx}\right)^{2}
=\displaystyle= 2i=1C(f(x,i)p(x)𝑑xf^(x,i)p(x)𝑑x)2\displaystyle 2\sum_{i=1}^{C}\left(\sqrt{\int f\left(x,i\right)p(x)dx}-\sqrt{\int\hat{f}\left(x,i\right)p(x)dx}\right)^{2}
=\displaystyle= 2i=1C[f(x,i)p(x)𝑑x+f^(x,i)p(x)𝑑x2f(x,i)p(x)𝑑xf^(x,i)p(x)𝑑x]\displaystyle 2\sum_{i=1}^{C}\left[\int f\left(x,i\right)p(x)dx+\int\hat{f}\left(x,i\right)p(x)dx-2\sqrt{\int f\left(x,i\right)p(x)dx}\sqrt{\int\hat{f}\left(x,i\right)p(x)dx}\right]
(1)\displaystyle\overset{(1)}{\leq} 2i=1C[f(x,i)p(x)𝑑x+f^(x,i)p(x)𝑑x2f(x,i)f^(x,i)p(x)dx]\displaystyle 2\sum_{i=1}^{C}\left[\int f\left(x,i\right)p(x)dx+\int\hat{f}\left(x,i\right)p(x)dx-2\sqrt{\int f\left(x,i\right)\hat{f}\left(x,i\right)}p(x)dx\right]
=\displaystyle= 2i=1C[f(x,i)f^(x,i)]2p(x)𝑑x=2i=1C[f(x,i)f^(x,i)]2p(x)dx\displaystyle 2\sum_{i=1}^{C}\int\left[\sqrt{f\left(x,i\right)}-\sqrt{\hat{f}\left(x,i\right)}\right]^{2}p(x)dx=\int 2\sum_{i=1}^{C}\left[\sqrt{f\left(x,i\right)}-\sqrt{\hat{f}\left(x,i\right)}\right]^{2}p(x)dx
=\displaystyle= D1/2(f^(x),f(x))p(x)𝑑x=(f^,f,),\displaystyle\int D_{1/2}\left(\hat{f}\left(x\right),f\left(x\right)\right)p(x)dx=\mathcal{L}\left(\hat{f},f,\mathbb{P}\right),

where we note that in the derivation in (1)\overset{(1)}{\leq}, we use Cauchy-Schwarz inequality: f^(x,i)p(x)𝑑xf(x,i)p(x)𝑑x(f^(x,i)f(x,i)p(x)𝑑x)2\int\hat{f}\left(x,i\right)p\left(x\right)dx\int f\left(x,i\right)p\left(x\right)dx\geq\left(\int\sqrt{\hat{f}\left(x,i\right)f\left(x,i\right)}p\left(x\right)dx\right)^{2}.

Therefore, we reach the conclusion as

d1/2(𝒴f,𝒴f^)(f^,f,)1/2.d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{f},\mathbb{P}_{\mathcal{Y}}^{\hat{f}}\right)\leq\mathcal{L}\left(\hat{f},f,\mathbb{P}\right)^{1/2}.

Lemma 18.

Consider the hypothesis f^=h^g\hat{f}=\hat{h}\circ g. We have the following inequalities w.r.t. the source and target domains:

(i) d1/2(^𝒴T,𝒴T)(h^g,fT,T)1/2,d_{1/2}\left(\mathbb{\hat{P}}_{\mathcal{Y}}^{T},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}, where 𝒴T\mathbb{P}_{\mathcal{Y}}^{T} is the label marginal distribution induced by fTf^{T} on T\mathbb{P}^{T}, while ^𝒴T\hat{\mathbb{P}}_{\mathcal{Y}}^{T} is the label marginal distribution induced by f^\hat{f} on T\mathbb{P}^{T}.

(ii) d1/2(𝒴π,^𝒴π)[i=1Kπi(h^g,fS,i,S,i)]1/2d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}, where 𝒴π:=i=1Kπi𝒴S,i\mathbb{P}_{\mathcal{Y}}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{\mathcal{Y}}^{S,i} with 𝒴S,i\mathbb{P}_{\mathcal{Y}}^{S,i} to be induced by fS,if^{S,i} on S,i\mathbb{P}^{S,i} and ^𝒴π:=i=1Kπi^𝒴S,i\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i} with ^𝒴S,i\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}to be induced by f^\hat{f} on S,i\mathbb{P}^{S,i} (i.e., equivalently, the label marginal distribution induced by f^\hat{f} on π:=i=1KπiS,i\mathbb{P}^{\pi}:=\sum_{i=1}^{K}\pi_{i}\mathbb{P}^{S,i}).

Proof.

(i) The proof of this part is obvious from Lemma 17 by considering fTf^{T} as ff and T\mathbb{P}^{T} as \mathbb{P}.

(ii) By the convexity of D1/2D_{1/2}, which is a member of ff-divergence family, we have

D1/2(𝒴π,^𝒴π)\displaystyle D_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right) =D1/2(i=1Kπi𝒴S,i,i=1Kπi^𝒴S,i)i=1KπiD1/2(𝒴S,i,^𝒴S,i)\displaystyle=D_{1/2}\left(\sum_{i=1}^{K}\pi_{i}\mathbb{P}_{\mathcal{Y}}^{S,i},\sum_{i=1}^{K}\pi_{i}\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}\right)\leq\sum_{i=1}^{K}\pi_{i}D_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{S,i},\hat{\mathbb{P}}_{\mathcal{Y}}^{S,i}\right)
(1)\displaystyle\overset{(1)}{\leq} i=1Kπi(h^g,fS,i,S,i),\displaystyle\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right),

where the derivation in (1)\overset{(1)}{\leq} is from Lemma 17. Therefore, we reach the conclusion as

d1/2(𝒴π,^𝒴π)[i=1Kπi(h^g,fS,i,S,i)]1/2.d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}.

Theorem 19.

(Theorem 8 in the main paper) Consider a feature extractor gg and a hypothesis h^\hat{h}, the Hellinger distance between two label marginal distributions 𝒴π\mathbb{P}_{\mathcal{Y}}^{\pi} and 𝒴T\mathbb{P}_{\mathcal{Y}}^{T} can be upper-bounded as:

(i) d1/2(𝒴π,𝒴T)[k=1Kπk(h^g,fS,k,S,k)]1/2+d1/2(gT,gπ)+(h^g,fT,T)1/2.d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{k=1}^{K}\pi_{k}\mathcal{L}\left(\hat{h}\circ g,f^{S,k},\mathbb{P}^{S,k}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{\pi}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

(ii) d1/2(𝒴π,𝒴T)[i=1Kπi(h^g,fS,i,S,i)]1/2+i=1Kj=1KπjKd1/2(gS,i,gS,j)+i=1Kj=1KπjKd1/2(gT,gS,i)+(h^g,fT,T)1/2.d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

Here we note that the general loss \mathcal{L} is defined based on the Hellinger loss \ell defined as (f^(x),f(x))=D1/2(f^(x),f(x))\ell\left(\hat{f}\left(x\right),f(x)\right)=D_{1/2}\left(\hat{f}\left(x\right),f(x)\right).

Proof.

(i)We define 𝒴π,^𝒴π\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi} and 𝒴T,^𝒴T\mathbb{P}_{\mathcal{Y}}^{T},\hat{\mathbb{P}}_{\mathcal{Y}}^{T} as in Lemma 18. Recap that to sample y^𝒴πy\sim\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}, we sample kCat(π),xS,kk\sim Cat(\pi),x\sim\mathbb{P}^{S,k} (i.e., xπ:=k=1KπkS,kx\sim\mathbb{P}^{\pi}:=\sum_{k=1}^{K}\pi_{k}\mathbb{P}^{S,k}), compute z=g(x)z=g\left(x\right) (i.e., zgπz\sim\mathbb{P}_{g}^{\pi}) , and yCat(h^(z))y\sim Cat\left(\hat{h}\left(z\right)\right), while similar to draw y^𝒴Ty\sim\mathbb{\hat{P}}_{\mathcal{Y}}^{T}.

Using triangle inequality for Hellinger distance, we have

d1/2(𝒴π,𝒴T)d1/2(𝒴π,^𝒴π)+d1/2(^𝒴π,^𝒴T)+d1/2(^𝒴T,𝒴T).\begin{aligned} d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)&\leq d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{\pi}\right)+d_{1/2}\left(\mathbb{\hat{P}}_{\mathcal{Y}}^{\pi},\hat{\mathbb{P}}_{\mathcal{Y}}^{T}\right)+d_{1/2}\left(\hat{\mathbb{P}}_{\mathcal{Y}}^{T},\mathbb{P}_{\mathcal{Y}}^{T}\right)\end{aligned}.

Referring to Lemma 18, we achieve

d1/2(𝒴π,𝒴T)\displaystyle d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right) [i=1Kπi(h^g,fS,i,S,i)]1/2+d1/2(𝒴^π,𝒴^T)+(h^g,fT,T)1/2.\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{\hat{\mathcal{Y}}}^{\pi},\mathbb{P}_{\hat{\mathcal{Y}}}^{T}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

From the monotonicity of Hellinger distance, when applying to gT\mathbb{P}_{g}^{T} and gπ\mathbb{P}_{g}^{\pi} with the same transition probability p(y=iz)=h^(z,i)p\left(y=i\mid z\right)=\hat{h}\left(z,i\right) for obtaining 𝒴^π and 𝒴^T\mathbb{P}_{\hat{\mathcal{Y}}}^{\pi}\text{ and }\mathbb{P}_{\hat{\mathcal{Y}}}^{T}, we have

d1/2(𝒴^π,𝒴^T)d1/2(gπ,gT).d_{1/2}\left(\mathbb{P}_{\hat{\mathcal{Y}}}^{\pi},\mathbb{P}_{\hat{\mathcal{Y}}}^{T}\right)\leq d_{1/2}\left(\mathbb{P}_{g}^{\pi},\mathbb{P}_{g}^{T}\right).

Finally, we reach the conclusion as

d1/2(𝒴π,𝒴T)[i=1Kπi(h^g,fS,i,S,i)]1/2+d1/2(gπ,gT)+(h^g,fT,T)1/2.\begin{aligned} d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)&\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+d_{1/2}\left(\mathbb{P}_{g}^{\pi},\mathbb{P}_{g}^{T}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}\end{aligned}.

(ii) From Lemma 12, we can decompose the data shift term and use triangle inequality again, hence arriving at

d1/2(𝒴π,𝒴T)\displaystyle d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right) [i=1Kπi(h^g,fS,i,S,i)]1/2+j=1Kπjd1/2(gT,gS,j)+(h^g,fT,T)1/2\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{j=1}^{K}\sqrt{\pi_{j}}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,j}\right)+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}
[i=1Kπi(h^g,fS,i,S,i)]1/2+i=1Kj=1KπjK(d1/2(gT,gS,i)+d1/2(gS,i,gS,j))\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}\left(d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)\right)
+(h^g,fT,T)1/2.\displaystyle+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

This concludes our proof. ∎

The loss \mathcal{L} in Theorem 19 defined based on the Hellinger loss \ell defined as (f^(x),f(x))=D1/2(f^(x),f(x))\ell\left(\hat{f}\left(x\right),f(x)\right)=D_{1/2}\left(\hat{f}\left(x\right),f(x)\right), while theory development in previous sections bases on the loss \ell which has the specific form

(f^(x),f(x))=i=1Cl(f^(x),i)f(x,i).\ell\left(\hat{f}\left(x\right),f\left(x\right)\right)=\sum_{i=1}^{C}l\left(\hat{f}\left(x\right),i\right)f\left(x,i\right). (16)

To make it more consistent, we discuss under which condition the Hellinger loss is in the family defined in (16). It is evident that if the labeling function ff satisfying f(x,i)>0,xf\left(x,i\right)>0,\forall x\sim\mathbb{P} and i[C]i\in\left[C\right], for example, we apply label smoothing [49, 36] on ground-truth labels, the Hellinger loss is in the family of interest. That is because the following derivation:

D1/2(f^(x),f(x))\displaystyle D_{1/2}\left(\hat{f}\left(x\right),f(x)\right) =2i=1C[f^(x,i)f(x,i)]2\displaystyle=2\sum_{i=1}^{C}\left[\sqrt{\hat{f}\left(x,i\right)}-\sqrt{f\left(x,i\right)}\right]^{2}
=\displaystyle= 2i=1C[f^(x,i)f(x,i)1]2f(x,i),\displaystyle 2\sum_{i=1}^{C}\left[\sqrt{\frac{\hat{f}\left(x,i\right)}{f\left(x,i\right)}}-1\right]^{2}f\left(x,i\right),

where we consider l(f^(x),i)=[f^(x,i)f(x,i)1]2l\left(\hat{f}\left(x\right),i\right)=\left[\sqrt{\frac{\hat{f}\left(x,i\right)}{f\left(x,i\right)}}-1\right]^{2}.

Appendix D Appendix D: Additional Experiments

D.1 Experiment on Colored MNIST

D.1.1 Dataset

We conduct experiments on the colored MNIST dataset [2] whose data is generated as follow. Firstly, for any original image XX in the MNIST dataset [24], the value of digit feature is Zd=0Z_{d}=0 if the image’s digit is from 040\rightarrow 4, while Zd=1Z_{d}=1 is assigned to image with digit from 595\rightarrow 9. Next, the ground-truth label for the image XX is also binary and sampled from either (Y|Zd=1)\mathbb{P}(Y|Z_{d}=1) or (Y|Zd=0)\mathbb{P}(Y|Z_{d}=0), depending on the value of digit feature ZdZ_{d}. These binomial distributions are such that (Y=1|Zd=1)=(Y=0|Zd=0)=0.75\mathbb{P}(Y=1|Z_{d}=1)=\mathbb{P}(Y=0|Z_{d}=0)=0.75. Next, the color feature binary random variable ZcZ_{c} is assigned to each image conditioning on its label, i.e., zC(ZC|Y=1)z_{C}\sim\mathbb{P}(Z_{C}|Y=1) or zc(ZC|Y=0)z_{c}\sim\mathbb{P}(Z_{C}|Y=0) with (Zc=1|Y=1)=(Zc=0|Y=0)=θ\mathbb{P}(Z_{c}=1|Y=1)=\mathbb{P}(Z_{c}=0|Y=0)=\theta, depending on the domain. Finally, we color the image red if Zc=0Z_{c}=0 or green if Zc=1Z_{c}=1.

For both DG and MSDA experiments, there are 7 source domains generated by setting (Zc=1|Y=1)=(Zc=0|Y=0)=θS,i\mathbb{P}(Z_{c}=1|Y=1)=\mathbb{P}(Z_{c}=0|Y=0)=\theta^{S,i} where θs,iUni([0.6,1])\theta^{s,i}\sim Uni\left(\left[0.6,1\right]\right) for i=1,,7i=1,\ldots,7. In our actual implemetation, we take θs,i=0.6+0.47(i1)\theta^{s,i}=0.6+\text{$\frac{0.4}{7}\left(i-1\right)$}. The two target domains are created with θT,i{0.05,0.7}\theta^{T,i}\in\text{$\left\{0.05,0.7\right\}$} for i=1,2i=1,2. After domain creation, data from each domain is split into training set and validation set. For DG experiment, no data from the target domain is used in training. On the other hand, the same train-validation split is applied to target domains in MSDA and the unlabeled training splits are used for training, while the validation splits are used for testing.

Refer to caption
Figure 4: Images in Colored MNIST dataset are “colored” according to color feature ZCZ_{C}. If ZC=0Z_{C}=0, channel 0 is kept while channel 11 contains all 0, corresponding to red images. Similarly, ZC=1Z_{C}=1 means channel 11 is kept intact while channel 0 is zero-out, represented by green images.

D.1.2 Model

We train a hypothesis f^=h^g\hat{f}=\hat{h}\circ g and minimize the classification loss w.r.t. entire source data

gen:=i=17NiNS𝔼(x,y)𝔻S,i[CE(h^(g(x)),y)],\mathcal{L}_{gen}:=\sum_{i=1}^{7}\frac{N_{i}}{N_{S}}\mathbb{E}_{\left(x,y\right)\sim\mathbb{D}^{S,i}}\left[CE\left(\hat{h}\left(g\left(x\right)\right),y\right)\right],

where CE is the cross-entropy loss, NiN_{i} is the number of samples from domain ii, and NSN_{S}is the total number of source samples.

To align source-source representation distribution, we apply adversarial learning [16] as in [14], in which a min-max game is played, where the domain discriminator h^ss\hat{h}^{s-s} tries to predict domain labels from input representations, while the feature extractor (generator) gg tries to fool the domain discriminator, i.e., mingmaxh^ssdiscss\min_{g}\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-s}. The source-source compression loss is defined as

discss:=i=17NiNS𝔼xS,i[CE(logh^ss(g(x)),i)],\mathcal{L}_{disc}^{s-s}:=\sum_{i=1}^{7}\frac{N_{i}}{N_{S}}\mathbb{E}_{x\sim\mathbb{P}^{S,i}}\left[-CE\left(\log\hat{h}^{s-s}\left(g\left(x\right)\right),i\right)\right],

where ii is the domain label. It is well-known [16] that if we search h^ss\hat{h}^{s-s} in a family with infinite capacity then

maxh^ssdiscss=JS(gS,1,,gS,7).\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-s}=JS\left(\mathbb{P}_{g}^{S,1},...,\mathbb{P}_{g}^{S,7}\right).

.

Similarly, alignment between source and target feature distribution is enforced by employing adversarial learning between another discriminator h^st\hat{h}^{s-t} and the encoder gg, with the objective is mingmaxh^ssdiscst\min_{g}\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-t}. The loss function for source-target compression is

discst:=NkNS+NT𝔼xπ,S[logh^st(g(x))]+NTNS+NT𝔼xT[log(1h^st(g(x)))],\mathcal{L}_{disc}^{s-t}:=\frac{N_{k}}{N_{S}+N_{T}}\mathbb{E}_{x\sim\mathbb{P}^{\pi,S}}\left[-\log\hat{h}^{s-t}\left(g\left(x\right)\right)\right]+\frac{N_{T}}{N_{S}+N_{T}}\mathbb{E}_{x\sim\mathbb{P}^{T}}\left[-\log\left(1-\hat{h}^{s-t}\left(g\left(x\right)\right)\right)\right],

where π,S=i=17NiNSS,i\mathbb{P}^{\pi,S}=\sum_{i=1}^{7}\frac{N_{i}}{N_{S}}\mathbb{P}^{S,i} is the source mixture and T\mathbb{P}^{T} is the chosen target domain among the two.

Finally, for DG we optimize the objective

ming(minh^gen+λmaxh^ssdiscss),\min_{g}\left(\min_{\hat{h}}\mathcal{L}_{gen}+\lambda\max_{\hat{h}^{s-s}}\mathcal{L}_{disc}^{s-s}\right), (17)

where λ\lambda being the trade-off hyperparameter: λ=0\lambda=0 corresponds to DG’s general DI representation, while larger λ\lambda corresponds to more compressed DI representation. On the other hand, the objective for MSDA setting is

ming(minh^gen+λssmaxh^sscomss+λstmaxh^stcomst),\min_{g}\left(\min_{\hat{h}}\mathcal{L}_{gen}+\lambda^{s-s}\max_{\hat{h}^{s-s}}\mathcal{L}_{com}^{s-s}+\lambda^{s-t}\max_{\hat{h}^{s-t}}\mathcal{L}_{com}^{s-t}\right), (18)

with λss\lambda^{s-s} controls the source-source compression and λst\lambda^{s-t} controls the source-target compression.

Our implementation is based largely on Domain Bed repository [17]. Specifically, the encoder gg is a convolutional neural network with 4 cnn layers, each is accompanied by RELU activation and batchnorm, while the classifier h^\hat{h} and discriminators h^ss,h^st\hat{h}^{s-s},\hat{h}^{s-t} are densely connected multi-layer perceptions with 3 layers. Our code can be found in the zip file accompanying this appendix. Moreover, our experiments were run on one Tesla V100 GPU and it took around 30 minutes for one training on Colored MNIST.

D.1.3 Multiple-source Domain Adaptation

In additional to DG experiment provided in Section 3 in the main paper, further experiment is conducted on MSDA, in which source-target compression is applied in additional to source-source compression. Specifically, unlabeled data from a target domain is supplied for training, whose label is 11 while all labeled source data has label 0, and the source-target discriminator is tasked with classifying them. We experiment with 2 target domains θT,i{0.5,0.7}\theta^{T,i}\in\left\{0.5,0.7\right\} separately and report accuracy on target domain for different compression strength. The result is presented in Figure 5.

Refer to caption
(a) Distant target domain θ=0.05\theta=0.05
Refer to caption
(b) Close target domain θ=0.7\theta=0.7, only λst\lambda^{s-t} changes
Refer to caption
(c) Close target domain θ=0.7\theta=0.7, only λss\lambda^{s-s} changes
Figure 5: Accuracy on distant and close target domains, where tuples (λst,λss)\left(\lambda^{s-t},\lambda^{s-s}\right) indicate strength of source-target compression and source-source compression, respectively.
Refer to caption
(a) Art Painting as target domain.
Refer to caption
(b) Sketch as target domain.
Figure 6: Domain generalization on PACS with different compression strength. On both domain, a slight compression is beneficial which increases accuracy, in-line with possible benefit of compressed DI representation. However, larger compression deteriorates performance, which confirm our discussion using trade-off bound.

It is evident from the figure that the more compression on both source-source and source-target representation, the lower the accuracy. This result aligns with our previous bound (Theorem 8 in main paper), i.e.,

d1/2(𝒴π,𝒴T)i=1Kj=1KπjKd1/2(gT,gS,i)i=1Kj=1KπjKd1/2(gS,i,gS,j)\displaystyle d_{1/2}\left(\mathbb{P}_{\mathcal{Y}}^{\pi},\mathbb{P}_{\mathcal{Y}}^{T}\right)-\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)-\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right)
[i=1Kπi(h^g,fS,i,S,i)]1/2+(h^g,fT,T)1/2.\displaystyle\leq\left[\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right)\right]^{1/2}+\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right)^{1/2}.

When source-source and source-target compression is applied, the term i=1Kj=1KπjKd1/2(gT,gS,i)+i=1Kj=1KπjKd1/2(gS,i,gS,j)\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{T},\mathbb{P}_{g}^{S,i}\right)+\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\sqrt{\pi_{j}}}{K}d_{1/2}\left(\mathbb{P}_{g}^{S,i},\mathbb{P}_{g}^{S,j}\right) is minimized, raising the lower bound of the loss terms. Subsequently, as source loss i=1Kπi(h^g,fS,i,S,i)\sum_{i=1}^{K}\pi_{i}\mathcal{L}\left(\hat{h}\circ g,f^{S,i},\mathbb{P}^{S,i}\right) is minimized, the target loss (h^g,fT,T)\mathcal{L}\left(\hat{h}\circ g,f^{T},\mathbb{P}^{T}\right) is high and hence performance is hindered.

However, the drop in accuracy for large compression on close target domain is not as significant as in distant target domain, as indicated in Figure 5b and 5c. In fact, accuracy for some compressed representation is higher than no compressed representation at larget iteration. It is possible that the negative effect of raising lower bound as in trade-off Theorem is counteracted by the benefit of compressed DI representation (Section 2.3.3 of main paper), i.e., the learned classifier for compressed DI representation better approximates the ground-truth labeling function. On the other hand, model with general DI representation cannot approximate this ground-truth labeling function as accurately, but overfit to training dataset, resulting in target accuracy drop at larger iteration.

D.2 Experiment on PACS dataset

In order to verify our theoretical finding on real dataset, we conduct further experiment on PACS dataset [26], which has 4 domains: Photo, Art Painting, Cartoon, and Sketch. Among the 4 domains, Photo and Cartoon are chosen as training domains, while Art Painting is chosen as target domain close to the training ones, and Sketch is the target domain distant from the training ones. We use Resnet18 [18] as the feature map, while label classifier and domain discriminator are multi-layer perceptrons. We only investigate DG setting on this dataset, in which training objective function is Eq. 17. The result in Figure 6 illustrates similar pattern to DG experiment on Colored MNIST. Specifically, the accuracies for both target domains raise until a peak is reached and then decrease, which confirms our developed theory for benefit and trade-off of compressed DI representation.