Few-shot Domain Adaptation by Causal Mechanism Transfer

Takeshi Teshima,^1,2 Issei Sato,^1,2 Masashi Sugiyama^2,1
¹The University of Tokyo, ²RIKEN
[email protected], {sato, sugi}@k.u-tokyo.ac.jp

Abstract

We study few-shot supervised domain adaptation (DA) for regression problems, where only a few labeled target domain data and many labeled source domain data are available. Many of the current DA methods base their transfer assumptions on either parametrized distribution shift or apparent distribution similarities, e.g., identical conditionals or small distributional discrepancies. However, these assumptions may preclude the possibility of adaptation from intricately shifted and apparently very different distributions. To overcome this problem, we propose mechanism transfer, a meta-distributional scenario in which a data generating mechanism is invariant across domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for DA. We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally. Our method can be seen as the first attempt to fully leverage the structural causal models for DA.

1 Introduction

Learning from a limited amount of data is a long-standing yet actively studied problem of machine learning. Domain adaptation (DA) Ben-David et al. (2010) tackles this problem by leveraging auxiliary data sampled from related but different domains. In particular, we consider few-shot supervised DA for regression problems, where only a few labeled target domain data and many labeled source domain data are available.

A key component of DA methods is the transfer assumption (TA) to relate the source and the target distributions. Many of the previously explored TAs have relied on certain direct distributional similarities, e.g., identical conditionals Shimodaira (2000) or small distributional discrepancies Ben-David et al. (2007). However, these TAs may preclude the possibility of adaptation from apparently very different distributions. Many others assume parametric forms of the distribution shift Zhang et al. (2013) or the distribution family Storkey and Sugiyama (2007) which can highly limit the considered set of distributions. (we further review related work in Section 5.1).

To alleviate the intrinsic limitation of previous TAs due to relying on apparent distribution similarities or parametric assumptions, we focus on a meta-distributional scenario where there exists a common generative mechanism behind the data distributions (Figures 1,2). Such a common mechanism may be more conceivable in applications involving structured table data such as medical records Yadav et al. (2018). For example, in medical record analysis for disease risk prediction, it can be reasonable to assume that there is a pathological mechanism that is common across regions or generations, but the data distributions may vary due to the difference in cultures or lifestyles. Such a hidden structure (pathological mechanism, in this case), once estimated, may provide portable knowledge to enable DA, allowing one to obtain accurate predictors for under-investigated regions or new generations.

Concretely, our assumption relies on the generative model of nonlinear independent component analysis (nonlinear ICA; Figure 1), where the observed labeled data are generated by first sampling latent independent components (ICs) $S$ and later transforming them by a nonlinear invertible mixing function denoted by $f$ Hyvärinen et al. (2019). Under this generative model, our TA is that $f$ representing the mechanism is identical across domains (Figure 2). This TA allows us to formally relate the domain distributions and develop a novel DA method without assuming their apparent similarities or making parametric assumptions.

Refer to caption — Figure 1: Nonparametric generative model of nonlinear independent component analysis. Our meta-distributional transfer assumption is built on the model, where there exists an invertible function $f$ representing the mechanism to generate the labeled data $(X,Y)$ from the independent components (ICs), $S$ , sampled from $q$ . As a result, each pair $(f,q)$ defines a joint distribution $p$ .

Our contributions.

Our key contributions can be summarized in three points as follows.

1.

We formulate the flexible yet intuitively accessible TA of shared generative mechanism and develop a few-shot regression DA method (Section 3). The idea is as follows. First, from the source domain data, we estimate the mixing function $f$ by nonlinear ICA Hyvärinen et al. (2019) because $f$ is the only assumed relation of the domains. Then, to transfer the knowledge, we perform data augmentation using the estimated $f$ on the target domain data using the independence of the IC distributions. In the end, the augmented data is used to fit a target predictor (Figure 3).
2.

We theoretically justify the augmentation procedure by invoking the theory of generalized U-statistics (Lee, 1990). The theory shows that the proposed data augmentation procedure yields the uniformly minimum variance unbiased risk estimator in an ideal case. We also provide an excess risk bound Mohri et al. (2012) to cover a more realistic case (Section 4).
3.

We experimentally demonstrate the effectiveness of the proposed algorithm (Section 6). The real-world data we use is taken from the field of econometrics, for which structural equation models have been applied in previous studies Greene (2012).

A salient example of the generative model we consider is the structural equations of causal modeling (Section 2). In this context, our method can be seen as the first attempt to fully leverage the structural causal models for DA (Section 5.2).

2 Problem Setup

In this section, we describe the problem setup and the notation. To summarize, our problem setup is homogeneous, multi-source, and few-shot supervised domain adapting regression. That is, respectively, all data distributions are defined on the same data space, there are multiple source domains, and a limited number of labeled data is available from the target distribution (and we do not assume the availability of unlabeled data). In this paper, we use the terms domain and distribution interchangeably.

Notation.

Let us denote the set of real (resp. natural) numbers by $\mathbb{R}$ (resp. $\mathbb{N}$ ). For $N\in\mathbb{N}$ , we define $[N]:=\{1,2,\ldots,N\}$ . Throughout the paper, we fix $D(\in\mathbb{N})>1$ and suppose that the input space $\mathcal{X}$ is a subset of $\mathbb{R}^{D-1}$ and the label space $\mathcal{Y}$ is a subset of $\mathbb{R}$ . As a result, the overall data space $\mathcal{Z}:=\mathcal{X}\times\mathcal{Y}$ is a subset of $\mathbb{R}^{D}$ . We generally denote a labeled data point by $Z=(X,Y)$ . We denote by $\mathcal{Q}$ the set of independent distributions on $\mathbb{R}^{D}$ with absolutely continuous marginals. For a distribution $p$ , we denote its induced expectation operator by ${\mathbb{E}}_{p}$ . Table 3 in Supplementary Material provides a summary of notation.

Basic setup: Few-shot domain adapting regression.

Let $p_{\mathrm{Tar}}$ be a distribution (the target distribution) over $\mathcal{Z}$ , and let $\mathcal{G}\subset\{g:\mathbb{R}^{D-1}\to\mathbb{R}\}$ be a hypothesis class. Let $\ell:\mathcal{G}\times\mathbb{R}^{D}\to[0,B_{\ell}]$ be a loss function where $B_{\ell}>0$ is a constant. Our goal is to find a predictor $g\in\mathcal{G}$ which performs well for $p_{\mathrm{Tar}}$ , i.e., the target risk $R(g):={\mathbb{E}}_{p_{\mathrm{Tar}}}\ell(g,Z)$ is small. We denote $g^{*}\in\mathop{\rm arg~{}min}_{g\in\mathcal{G}}R(g)$ . To this goal, we are given an independent and identically distributed (i.i.d.) sample $\mathcal{D}_{\mathrm{Tar}}:=\{Z_{i}\}_{i=1}^{n_{\mathrm{Tar}}}\overset{\text{i.i.d.}}{\sim}p_{\mathrm{Tar}}$ . In a fully supervised setting where $n_{\mathrm{Tar}}$ is large, a standard procedure is to select $g$ by empirical risk minimization (ERM), i.e., $\hat{g}\in\mathop{\rm arg~{}min}_{g\in\mathcal{G}}\hat{R}(g)$ , where $\hat{R}(g):=\frac{1}{n_{\mathrm{Tar}}}\sum_{i=1}^{n_{\mathrm{Tar}}}\ell(g,Z_{i})$ . However, when $n_{\mathrm{Tar}}$ is not sufficiently large, $\hat{R}(g)$ may not accurately estimate $R(g)$ , resulting in a high generalization error of $\hat{g}$ . To compensate for the scarcity of data from the target distribution, let us assume that we have data from $K$ distinct source distributions $\{p_{k}\}_{k=1}^{K}$ over $\mathcal{Z}$ , that is, we have independent i.i.d. samples $\mathcal{D}_{k}:=\{{Z^{\mathrm{Src}}_{k,i}}\}_{i=1}^{n_{k}}\overset{\text{i.i.d.}}{\sim}p_{k}(k\in[K],n_{k}\in N)$ whose relations to $p_{\mathrm{Tar}}$ are described shortly. We assume $n_{\mathrm{Tar}},n_{k}\geq D$ for simplicity.

Key assumption.

In this work, the key transfer assumption is that all domains follow nonlinear ICA models with identical mixing functions (Figure 2). To be precise, we assume that there exists a set of IC distributions $q_{\mathrm{Tar}},q_{k}\in\mathcal{Q}(k\in[K])$ , and a smooth invertible function $f:\mathbb{R}^{D}\to\mathbb{R}^{D}$ (the transformation or mixing) such that ${Z^{\mathrm{Src}}_{k,i}}\sim p_{k}$ is generated by first sampling $S^{\mathrm{Src}}_{k,i}\sim q_{k}$ and later transforming it by

{Z^{\mathrm{Src}}_{k,i}}=f(S^{\mathrm{Src}}_{k,i}),

(1)

and similarly ${Z_{i}}=f(S_{i}),S_{i}\sim q_{\mathrm{Tar}}$ for $p_{\mathrm{Tar}}$ . The above assumption allows us to formally relate $p_{k}$ and $p_{\mathrm{Tar}}$ . It also allows us to estimate $f$ when sufficient identification conditions required by the theory of nonlinear ICA are met. Due to space limitation, we provide a brief review of the nonlinear ICA method used in this paper and the known theoretical conditions in Supplementary Material A. Having multiple source domains is assumed here for the identifiability of $f$ : it comes from the currently known identification condition of nonlinear ICA Hyvärinen et al. (2019). Note that complex changes in $q$ are allowed, hence the assumption of invariant $f$ can accommodate intricate shifts in the apparent distribution $p$ . We discuss this further in Section 5.3 by taking a simple example.

Example: Structural equation models

A salient example of generative models expressed as Eq. (1) is structural equation models (SEMs; Pearl, 2009; Peters et al., 2017), which are used to describe the data generating mechanism involving the causality of random variables (Pearl, 2009). More precisely, the generative model of Eq.(1) corresponds to the reduced form Reiss and Wolak (2007) of a Markovian SEM Pearl (2009), i.e., a form where the structural equations to determine $Z$ from $(Z,S)$ are solved so that $Z$ is expressed as a function of $S$ . Such a conversion is always possible because a Markovian SEM induces an acyclic causal graph Pearl (2009), hence the structural equations can be solved by elimination of variables. This interpretation of reduced-form SEMs as Eq.(1) has been exploited in methods of causal discovery, e.g., in the linear non-Gaussian additive-noise models and their successors (Kano and Shimizu, 2003; Shimizu et al., 2006; Monti et al., 2019). In the case of SEMs, the key assumption of this paper translates into the invariance of the structural equations across domains, which enables an intuitive assessment of the assumption based on prior knowledge. For instance, if all domains have the same causal mechanism and are in the same intervention state (including an intervention-free case), the modeling choice is deemed plausible. Note that we do not estimate the original structural equations in the proposed method (Section 3) but we only require estimating the reduced form, which is an easier problem compared to causal discovery.

3 Proposed Method: Mechanism Transfer

In this section, we detail the proposed method, mechanism transfer (Algorithm 1). The method first estimates the common generative mechanism $f$ from the source domain data and then uses it to perform data augmentation of the target domain data to transfer the knowledge (Figure 3).

3.1 Step 1: Estimate $f$ using the source domain data

The first step estimates the common transformation $f$ by nonlinear ICA, namely via generalized contrastive learning (GCL; Hyvärinen et al., 2019). GCL uses auxiliary information for training a certain binary classification function, $r_{\hat{f},\bm{\psi}}$ , equipped with a parametrized feature extractor $\hat{f}:\mathbb{R}^{D}\to\mathbb{R}^{D}$ . The trained feature extractor $\hat{f}$ is used as an estimator of $f$ . The auxiliary information we use in our problem setup is the domain indices $[K]$ . The classification function to be trained in GCL is $r_{\hat{f},\bm{\psi}}(z,u):=\sum_{d=1}^{D}\psi_{d}({\hat{f}}^{-1}(z)_{d},u)$ consisting of $(\hat{f},\{\psi_{d}\}_{d=1}^{D})$ , and the classification task of GCL is logistic regression to classify $({Z^{\mathrm{Src}}_{k}},k)$ as positive and $({Z^{\mathrm{Src}}_{k}},k^{\prime})(k^{\prime}\neq k)$ as negative. This yields the following domain-contrastive learning criterion to estimate $f$ :

\begin{split}\mathop{\rm argmin}\limits_{\begin{subarray}{c}\hat{f}\in\mathcal{F},\\ \{\psi_{d}\}_{d=1}^{D}\subset\Psi\end{subarray}}\sum_{k=1}^{K}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\biggl{(}&\phi\left(r_{\hat{f},\bm{\psi}}({Z^{\mathrm{Src}}_{k,i}},k)\right)+\mathbb{E}_{k^{\prime}\neq k}\phi\left(-r_{\hat{f},\bm{\psi}}({Z^{\mathrm{Src}}_{k,i}},k^{\prime})\right)\biggr{)},\end{split}

where $\mathcal{F}$ and $\Psi$ are sets of parametrized functions, $\mathbb{E}_{k^{\prime}\neq k}$ denotes the expectation with respect to $k^{\prime}\sim\mathrm{U}([K]\setminus\{k\})$ ( $\mathrm{U}$ denotes the uniform distribution), and $\phi$ is the logistic loss $\phi(m):=\log(1+\exp(-m))$ . We use the solution $\hat{f}$ as an estimator of $f$ . In experiments, $\mathcal{F}$ is implemented by invertible neural networks Kingma and Dhariwal (2018), $\Psi$ by multi-layer perceptron, and $\mathbb{E}_{k^{\prime}\neq k}$ is replaced by a random sampling renewed for every mini-batch.

3.2 Step 2: Extract and inflate the target ICs using $\hat{f}$

The second step extracts and inflates the target domain ICs using the estimated $\hat{f}$ . We first extract the ICs of the target domain data by applying the inverse of $\hat{f}$ as

\begin{split}\hat{s}_{i}={\hat{f}}^{-1}(Z_{i}).\end{split}

After the extraction, we inflate the set of IC values by taking all dimension-wise combinations of the estimated IC:

\begin{split}\bar{s}_{{\bm{i}}}=(\hat{s}_{i_{1}}^{(1)},\ldots,\hat{s}_{i_{D}}^{(D)}),\quad{\bm{i}}=(i_{1},\ldots,i_{D})\in[n_{\mathrm{Tar}}]^{D},\end{split}

to obtain new plausible IC values $\bar{s}_{{\bm{i}}}$ . The intuitive motivation of this procedure stems from the independence of the IC distributions. Theoretical justifications are provided in Section 4. In our implementation, we use invertible neural networks Kingma and Dhariwal (2018) to model the function $\hat{f}$ to enable the computation of the inverse ${\hat{f}}^{-1}$ .

3.3 Step 3: Synthesize target data from the inflated ICs

The third step estimates the target risk $R$ by the empirical distribution of the augmented data:

\begin{split}\check{R}(g):=\frac{1}{n_{\mathrm{Tar}}^{D}}\sum_{\bm{i}\in[n_{\mathrm{Tar}}]^{D}}\ell(g,\hat{f}(\bar{s}_{{\bm{i}}})),\end{split}

(2)

and performs empirical risk minimization. In experiments, we use a regularization term $\Omega(\cdot)$ to control the complexity of $\mathcal{G}$ and select

\begin{split}\check{g}\in\mathop{\rm argmin}\limits_{g\in\mathcal{G}}\left\{\check{R}(g)+\Omega(g)\right\}.\end{split}

The generated hypothesis $\check{g}$ is then used to make predictions in the target domain. In our experiments, we use $\Omega(g)=\lambda\|g\|^{2}$ , where $\lambda>0$ and the norm is that of the reproducing kernel Hilbert space (RKHS) which we take the subset $\mathcal{G}$ from. Note that we may well subsample only a subset of combinations in Eq. (2) to mitigate the computational cost similarly to Clémençon et al. (2016) and Papa et al. (2015).

4 Theoretical Insights

In this section, we state two theorems to investigate the statistical properties of the method proposed in Section 3 and provide plausibility beyond the intuition that we take advantage of the independence of the IC distributions.

4.1 Minimum variance property: Idealized case

The first theorem provides an insight into the statistical advantage of the proposed method: in the ideal case, the method attains the minimum variance among all possible unbiased risk estimators.

Theorem 1 (Minimum variance property of $\check{R}$ ).

Assume that $\hat{f}=f$ . Then, for each $g\in\mathcal{G}$ , the proposed risk estimator $\check{R}(g)$ is the uniformly minimum variance unbiased estimator of $R(g)$ , i.e., for any unbiased estimator $\tilde{R}(g)$ of $R(g)$ ,

\begin{split}\forall q\in\mathcal{Q},\quad\mathrm{Var}(\check{R}(g))\leq\mathrm{Var}(\tilde{R}(g))\end{split}

as well as ${\mathbb{E}}_{p_{\mathrm{Tar}}}\check{R}(g)=R(g)$ holds.

The proof of Theorem 1 is immediate once we rewrite $R(g)$ as a $D$ -variate regular statistical functional and $\check{R}(g)$ as its corresponding generalized U-statistic (Lee, 1990). Details can be found in Supplementary Material D. Theorem 1 implies that the proposed risk estimator can have superior statistical efficiency in terms of the variance over the ordinary empirical risk.

4.2 Excess risk bound: More realistic case

In real situations, one has to estimate $f$ . The following theorem characterizes the statistical gain and loss arising from the estimation error $f-\hat{f}$ . The intuition is that the increased number of points suppresses the possibility of overfitting because the hypothesis has to fit the majority of the inflated data, but the estimator $\hat{f}$ has to be accurate so that fitting to the inflated data is meaningful. Note that the theorem is agnostic to how $\hat{f}$ is obtained, hence it applies to more general problem setup as long as $f$ can be estimated.

Theorem 2 (Excess risk bound).

Let $\check{g}$ be a minimizer of Eq. (2). Under appropriate assumptions (see Theorem 3 in Supplementary Material), for arbitrary $\delta,\delta^{\prime}\in(0,1)$ , we have with probability at least $1-(\delta+\delta^{\prime})$ ,

\begin{split}R(\check{g})-R(g^{*})\leq\underbrace{C\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}}_{\text{Approximation error}}+\underbrace{4D\mathfrak{R}(\mathcal{G})+2DB_{\ell}\sqrt{\frac{\log 2/{\delta}}{2n}}}_{\text{Estimation error}}+\underbrace{\kappa_{1}(\delta^{\prime},n)+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})}_{\text{Higher order terms}}.\end{split}

Here, $\left\|\cdot\right\|_{W^{1,1}}$ is the $(1,1)$ -Sobolev norm, and we define the effective Rademacher complexity $\mathfrak{R}(\mathcal{G})$ by

\begin{split}\mathfrak{R}(\mathcal{G}):=\frac{1}{n}\mathbb{E}_{\hat{S}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{2},\ldots,S^{\prime}_{D}}[\tilde{\ell}(\hat{s}_{i},S_{2}^{\prime},\ldots,S_{D}^{\prime})]\right|\right],\end{split}

(3)

where $\{\sigma_{i}\}_{i=1}^{n}$ are independent sign variables, $\mathbb{E}_{\hat{S}}$ is the expectation with respect to $\{\hat{s}_{i}\}_{i=1}^{n_{\mathrm{Tar}}}$ , the dummy variables $S_{2}^{\prime},\ldots,S_{D}^{\prime}$ are i.i.d. copies of $\hat{s}_{1}$ , and $\tilde{\ell}$ is defined by using the degree- $D$ symmetric group $\mathfrak{S}_{D}$ as

\begin{split}\tilde{\ell}(s_{1},\ldots,s_{D}):=\frac{1}{D!}\sum_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)})),\end{split}

and $\kappa_{1}(\delta^{\prime},n)$ and $\kappa_{2}(f-\hat{f})$ are higher order terms. The constants $B_{q}$ and $B_{\ell}$ depend only on $q$ and $\ell$ , respectively, while $C^{\prime}$ depends only on $f,q,\ell$ , and $D$ .

Details of the statement and the proof can be found in Supplementary Material C. The Sobolev norm Adams and Fournier (2003) emerges from the evaluation of the difference between the estimated IC distribution and the ground-truth IC distribution. In Theorem 2, the utility of the proposed method appears in the effective complexity measure. The complexity is defined by a set of functions which are marginalized over all but one argument, resulting in mitigated dependence on the input dimensionality from exponential to linear (Supplementary Material C, Remark 3).

5 Related Work and Discussion

In this section, we review some existing TAs for DA to clarify the relative position of the paper. We also clarify the relation to the literature of causality-related transfer learning.

5.1 Existing transfer assumptions

Here, we review some of the existing work and TAs. See Table 1 for a summary.

(1) Parametric assumptions.

Some TAs assume parametric distribution families, e.g., Gaussian mixture model in covariate shift Storkey and Sugiyama (2007). Some others assume parametric distribution shift, i.e., parametric representations of the target distribution given the source distributions. Examples include location-scale transform of class conditionals Zhang et al. (2013); Gong et al. (2016), linearly dependent class conditionals Zhang et al. (2015), and low-dimensional representation of the class conditionals after kernel embedding Stojanov et al. (2019). In some applications, e.g., remote sensing, some parametric assumptions have proven useful Zhang et al. (2013).

(2) Invariant conditionals and marginals.

Some methods assume invariance of certain conditionals or marginals Qui (2009), e.g., $p(Y|X)$ in the covariate shift scenario Shimodaira (2000), $p(Y|\mathcal{T}(X))$ for an appropriate feature transformation $\mathcal{T}$ in transfer component analysis Pan et al. (2011), $p(Y|\mathcal{T}(X))$ for a feature selector $\mathcal{T}$ Rojas-Carulla et al. (2018); Magliacane et al. (2018), $p(X|Y)$ in the target shift (TarS) scenario Zhang et al. (2013); Nguyen et al. (2016), and few components of regular-vine copulas and marginals in Lopez-paz et al. (2012). For example, the covariate shift scenario has been shown to fit well to brain computer interface data Sugiyama et al. (2007).

(3) Small discrepancy or integral probability metric.

Another line of work relies on certain distributional similarities, e.g., integral probability metric Courty et al. (2017) or hypothesis-class dependent discrepancies Ben-David et al. (2007); Blitzer et al. (2008); Ben-David et al. (2010); Kuroki et al. (2019); Zhang et al. (2019); Cortes et al. (2019). These methods assume the existence of the ideal joint hypothesis Ben-David et al. (2010), corresponding to a relaxation of the covariate shift assumption. These TA are suited for unsupervised or semi-supervised DA in computer vision applications Courty et al. (2017).

(4) Transferable parameter.

Some others consider parameter transfer Kumagai (2016), where the TA is the existence of a parameterized feature extractor that performs well in the target domain for linear-in-parameter hypotheses and its learnability from the source domain data. For example, such a TA has been known to be useful in natural language processing or image recognition Lee et al. (2009); Kumagai (2016).

Table 1: Comparison of TAs for DA (Parametric: parametric distribution family or distribution shift, Invariant dist.: invariant distribution components such as conditionals, marginals, or copulas. Disc. / IPM: small discrepancy or integral probability metric, Param-transfer: existence of transferable parameter, Mechanism: invariant mechanism). AD: adaptation among Apparently Different distributions is accommodated. NP: Non-Parametrically flexible. BCI: Brain computer interface. The numbers indicate the paragraphs of Section 5.1.

TA	AD	NP	Suited app. example
(1) Parametric	$\checkmark$	-	Remote sensing
(2) Invariant dist.	-	$\checkmark$	BCI
(3) Disc. / IPM	-	$\checkmark$	Computer vision
(4) Param-transfer	$\checkmark$	$\checkmark$	Computer vision
(Ours) Mechanism	$\checkmark$	$\checkmark$	Medical records

5.2 Causality for transfer learning

Our method can be seen as the first attempt to fully leverage structural causal models for DA. Most of the causality-inspired DA methods express their assumptions in the level of graphical causal models (GCMs), which only has much coarser information than structural causal models (SCMs) (Peters et al., 2017, Table 1.1) exploited in this paper. Compared to previous work, our method takes one step further to assume and exploit the invariance of SCMs. Specifically, many studies assume the GCM $X\leftarrow Y$ (the anticausal scenario) following the seminal meta-analysis of Schölkopf et al. (2012) and use it to motivate their parametric distribution shift assumptions or the parameter estimation procedure Zhang et al. (2013, 2015); Gong et al. (2016, 2018). Although such assumptions on the GCM have the virtue of being more robust to misspecification, they tend to require parametric assumptions to obtain theoretical justifications. On the other hand, our assumption enjoys a theoretical guarantee without relying on parametric assumptions.

One notable work in the existing literature is Magliacane et al. (2018) that considered the domain adaptation among different intervention states, a problem setup that complements ours that considers an intervention-free (or identical intervention across domains) case. To model intervention states, Magliacane et al. (2018) also formulated the problem setup using SCMs, similarly to the present paper. Therefore, we clarify a few key differences between Magliacane et al. (2018) and our work here. In terms of the methodology, Magliacane et al. (2018) takes a variable selection approach to select a set of predictor variables with an invariant conditional distribution across different intervention states. On the other hand, our method estimates the SEMs (in the reduced form) and applies a data augmentation procedure to transfer the knowledge. To the best of our knowledge, the present paper is the first to propose a way to directly use the estimated SEMs for domain adaptation, and the fine-grained use of the estimated SEMs enables us to derive an excess risk bound. In terms of the plausible applications, their problem setup may be more suitable for application fields with interventional experiments such as genomics, whereas ours may be more suited for fields where observational studies are more common such as health record analysis or economics. In Appendix E, we provide a more detailed comparison.

5.3 Plausibility of the assumptions

Checking the validity of the assumption.

As is often the case in DA, the scarcity of data disables data-driven testing of the TAs, and we need domain knowledge to judge the validity. For our TA, the intuitive interpretation as invariance of causal models (Section 2) can be used.

Invariant causal mechanisms.

The invariance of causal mechanisms has been exploited in recent work of causal discovery such as Xu et al. (2014) and Monti et al. (2019), or under the name of the multi-environment setting in Ghassami et al. (2017). Moreover, the SEMs are normally assumed to remain invariant unless explicitly intervened in Hünermund and Bareinboim (2019). However, the invariance assumption presumes that the intervention states do not vary across domains (allowing for the intervention-free case), which can be limiting for some applications where different interventions are likely to be present, e.g., different treatment policies being put in place in different hospitals. Nevertheless, the present work can already be of practical interest if it is combined with the effort to find suitable data or situations. For instance, one may find medical records in group hospitals where the same treatment criteria is put in place or local surveys in the same district enforcing identical regulations. In future work, relaxing the requirement to facilitate the data-gathering process is an important area. For such future extensions, the present theoretical analyses can also serve as a landmark to establish what can be guaranteed in the basic case without mechanism alterations.

Fully observed variables.

As the first algorithm in the approach to fully exploit SCMs for DA, we also consider the case where all variables are observable. Although it is often assumed in a causal inference problem that there are some unobserved confounding variables, we leave further extension to such a case for future work.

Required number of source domains.

A potential drawback of the proposed method is that it requires a number of source domains in order to satisfy the identification condition of the nonlinear ICA, namely GCL in this paper (Supplementary Material A). The requirement solely arises from the identification condition of the ICA method and therefore has the possibility to be made less stringent by the future development of nonlinear ICA methods. Moreover, if one can accept other identification conditions, one-sample ICA methods (e.g., linear ICA) can be also used in the proposed approach in a straightforward manner, and our theoretical analyses still hold regardless of the method chosen.

Flexibility of the model.

The relation between $X$ and $Y$ can drastically change while $f$ is invariant. For example, even in a simple additive noise model $(X,Y)=f(S_{1},S_{2})=(S_{1},S_{1}+S_{2})$ , the conditional $p(Y|X)$ can shift drastically if the distribution of the independent noise $S_{2}$ changes in a complex manner, e.g., becoming multimodal from unimodal.

6 Experiment

In this section, we provide proof-of-concept experiments to demonstrate the effectiveness of the proposed approach. Note that the primary purpose of the experiments is to confirm whether the proposed method can properly perform DA in real-world data, and it is not to determine which DA method and TA are the most suited for the specific dataset.

6.1 Implementation details of the proposed method

Estimation of $f$ (Step 1).

We model $\hat{f}$ by an $8$ -layer Glow neural network (Supplementary Material B.2). We model $\psi_{d}$ by a $1$ -hidden-layer neural network with a varied number of hidden units, $K$ output units, and the rectified linear unit activation LeCun et al. (2015). We use its $k$ -th output ( $k\in[K]$ ) as the value for $\psi_{d}(\cdot,k)$ . For training, we use the Adam optimizer Kingma and Ba (2017) with fixed parameters $(\beta_{1},\beta_{2},\epsilon)=(0.9,0.999,10^{-8})$ , fixed initial learning rate $10^{-3}$ , and the maximum number of epochs $300$ . The other fixed hyperparameters of $\hat{f}$ and its training process are described in Supplementary Material B.

Augmentation of target data (Step 3).

For each evaluation step, we take all combinations (with replacement) of the estimated ICs to synthesize target domain data. After we synthesize the data, we filter them by applying a novelty detection technique with respect to the union of source domain data. Namely, we use one-class support vector machine Schölkopf et al. (2000) with the fixed parameter $\nu=0.1$ and radial basis function (RBF) kernel $k(x,y)=\exp(-\|x-y\|^{2}/\gamma)$ with $\gamma=D$ . This is because the estimated transform $\hat{f}$ is not expected to be trained well outside the union of the supports of the source distributions. After performing the filtration, we combined the original target training data with the augmented data to ensure the original data points to be always included.

Predictor hypothesis class $\mathcal{G}$ .

As the predictor model, we use the kernel ridge regression (KRR) with RBF kernel. The bandwidth $\gamma$ is chosen by the median heuristic similarly to Yamada et al. (2011) for simplicity. Note that the choice of the predictor model is for the sake of comparison with the other methods tailored for KRR Cortes et al. (2019), and that an arbitrary predictor hypothesis class and learning algorithm can be easily combined with the proposed approach.

Hyperparameter selection.

We perform grid-search for hyperparameter selection. The number of hidden units for $\psi_{d}$ is chosen from $\{10,20\}$ and the coefficient of weight-decay from $10^{\{-2,-1\}}$ . The ${\ell^{2}}$ regularization coefficient $\lambda$ of KRR is chosen from $\lambda\in 2^{\{-10,\ldots,10\}}$ following Cortes et al. (2019). To perform hyperparameter selection as well as early-stopping, we record the leave-one-out cross-validation (LOOCV) mean-squared error on the target training data every $20$ epochs and select its minimizer. The leave-one-out score is computed using the well-known analytic formula instead of training the predictor for each split. Note that we only use the original target domain data as the held-out set and not the synthesized data. In practice, if the target domain data is extremely few, one may well use percentile-cv Ng (1997) to mitigate overfitting of hyperparameter selection.

Computation environment

All experiments were conducted on an Intel Xeon(R) 2.60 GHz CPU with 132 GB memory. They were implemented in Python using the PyTorch library Paszke et al. (2019) or the R language R Core Team (2018).

6.2 Experiment using real-world data

Dataset.

We use the gasoline consumption data (Greene, 2012, p.284, Example 9.5), which is a panel data of gasoline usage in 18 of the OECD countries over 19 years. We consider each country as a domain, and we disregard the time-series structure and consider the data as i.i.d. samples for each country in this proof-of-concept experiment. The dataset contains four variables, all of which are log-transformed: motor gasoline consumption per car (the predicted variable), per-capita income, motor gasoline price, and the stock of cars per capita (the predictor variables) Baltagi and Griffin (1983). For further details of the data, see Supplementary Material B. We used the dataset because there are very few public datasets for domain adapting regression tasks Cortes and Mohri (2014) especially for multi-source DA, and also because the dataset has been used in econometric analyses involving SEMs Baltagi (2005), conforming to our approach.

Compared methods.

We compare the following DA methods, all of which apply to regression problems. Unless explicitly specified, the predictor class $\mathcal{G}$ is chosen to be KRR with the same hyperparameter candidates as the proposed method (Section 6.1). Further details are described in Supplementary Material B.5.

•

Naive baselines (SrcOnly, TarOnly, and S&TV): SrcOnly (resp. TarOnly) trains a predictor on the source domain data (resp. target training data) without any device. SrcOnly can be effective if the source domains and the target domain have highly similar distributions. The S&TV baseline trains on both source and target domain data, but the LOOCV score is computed only from the target domain data.
•

TrAdaBoost: Two-stage TrAdaBoost.R2; a boosting method tailored for few-shot regression transfer proposed in Pardoe and Stone (2010). It is an iterative method with early-stopping Pardoe and Stone (2010), for which we use the leave-one-out cross-validation score on the target domain data as the criterion. As suggested in Pardoe and Stone (2010), we set the maximum number of outer loop iterations at $30$ . The base predictor is the decision tree regressor with the maximum depth $6$ Hastie et al. (2009). Note that although TrAdaBoost does not have a clarified transfer assumption, we compare the performance for reference.
•

IW: Importance weighted KRR using RuLSIF Yamada et al. (2011). The method directly estimates a relative joint density ratio function $\frac{p_{\mathrm{Src}}(z)}{\alpha p_{\mathrm{Src}}(z)+(1-\alpha)p_{\mathrm{Tar}}(z)}$ for $\alpha\in[0,1)$ , where $p_{\mathrm{Src}}$ is a hypothetical source distribution created by pooling all source domain data. Following Yamada et al. (2011), we experiment on $\alpha\in\{0,0.5,0.95\}$ and report the results separately. The regularization coefficient $\lambda^{\prime}$ is selected from $\lambda^{\prime}\in 2^{\{-10,\ldots,10\}}$ using importance-weighted cross-validation Sugiyama et al. (2007).
•

GDM: Generalized discrepancy minimization Cortes et al. (2019). This method performs instance-weighted training on the source domain data with the weights that minimize the generalized discrepancy (via quadratic programming). We select the hyper-parameters $\lambda_{r}$ from $2^{\{-10,\ldots,10\}}$ as suggested by Cortes et al. (2019). The selection criterion is the performance of the trained predictor on the target training labels as the method trains on the source domain data and the target unlabeled data.
•

Copula: Non-parametric regular-vine copula method Lopez-paz et al. (2012). This method presumes using a specific joint density estimator called regular-vine (R-vine) copulas. Adaptation is realized in two steps: the first step estimates which components of the constructed R-vine model are different by performing two-sample tests based on maximum mean discrepancy Lopez-paz et al. (2012), and the second step re-estimates the components in which a change is detected using only the target domain data.
•

LOO (reference score): Leave-one-out cross-validated error estimate is also calculated for reference. It is the average prediction error of predicting for a single held-out test point when the predictor is trained on the rest of the whole target domain data including those in the test set for the other algorithms.

Evaluation procedure.

The prediction accuracy was measured by the mean squared error (MSE). For each train-test split, we randomly select one-third (6 points) of the target domain dataset as the training set and use the rest as the test set. All experiments were repeated 10 times with different train-test splits of target domain data.

Results.

The results are reported in Table 2. We report the MSE scores normalized by that of LOO to facilitate the comparison, similarly to Cortes and Mohri (2014). In many of the target domain choices, the naive baselines (SrcOnly and S&TV) suffer from negative transfer, i.e., higher average MSE than TarOnly (in 12 out of 18 domains). On the other hand, the proposed method successfully performs better than TarOnly or is more resistant to negative transfer than the other compared methods. The performances of GDM, Copula, and IW are often inferior even compared to the baseline performance of SrcAndTarValid. For GDM and IW, this can be attributed to the fact that these methods presume the availability of abundant (unlabeled) target domain data, which is unavailable in the current problem setup. For Copula, the performance inferior to the naive baselines is possibly due to the restriction of the predictor model to its accompanied probability model Lopez-paz et al. (2012). TrAdaBoost works reasonably well for many but not all domains. For some domains, it suffered from negative transfer similarly to others, possibly because of the very small number of training data points. Note that the transfer assumption of TrAdaBoost has not been stated Pardoe and Stone (2010), and it is not understood when the method is reliable. The domains on which the baselines perform better than the proposed method can be explained by the following two cases: (1) easier domains allow naive baselines to perform well and (2) some domains may have deviated $f$ . Case (1) implies that estimating $f$ is unnecessary, hence the proposed method can be suboptimal (more likely for JPN, NLD, NOR, and SWE, where SrcOnly or S&TV improve upon TrgOnly). On the other hand, case (2) implies that an approximation error was induced as in Theorem 2 (more likely for IRL and ITA). In this case, others also perform poorly, implying the difficulty of the problem instance. In either case, in practice, one may well perform cross-validation to fallback to the baselines.

Table 2: Results of the real-world data experiments for different choices of the target domain. The evaluation score is MSE normalized by that of LOO (the lower the better). All experiments were repeated 10 times with different train-test splits of target domain data, and the average performance is reported with the standard errors in the brackets. The target column indicates abbreviated country names. Bold-face indicates the best score (Prop: proposed method, TrAda: TrAdaBoost, the numbers in the brackets of IW indicate the value of

\alpha

). The proposed method often improves upon the baseline TarOnly or is relatively more resistant to negative transfer, with notable improvements in DEU, GBR, and USA.

Target	(LOO)	TarOnly	Prop	SrcOnly	S&TV	TrAda	GDM	Copula	IW(.0)	IW(.5)	IW(.95)
AUT	1	5.88 (1.60)	5.39 (1.86)	9.67 (0.57)	9.84 (0.62)	5.78 (2.15)	31.56 (1.39)	27.33 (0.77)	39.72 (0.74)	39.45 (0.72)	39.18 (0.76)
BEL	1	10.70 (7.50)	7.94 (2.19)	8.19 (0.68)	9.48 (0.91)	8.10 (1.88)	89.10 (4.12)	119.86 (2.64)	105.15 (2.96)	105.28 (2.95)	104.30 (2.95)
CAN	1	5.16 (1.36)	3.84 (0.98)	157.74 (8.83)	156.65 (10.69)	51.94 (30.06)	516.90 (4.45)	406.91 (1.59)	592.21 (1.87)	591.21 (1.84)	589.87 (1.91)
DNK	1	3.26 (0.61)	3.23 (0.63)	30.79 (0.93)	28.12 (1.67)	25.60 (13.11)	16.84 (0.85)	14.46 (0.79)	22.15 (1.10)	22.11 (1.10)	21.72 (1.07)
FRA	1	2.79 (1.10)	1.92 (0.66)	4.67 (0.41)	3.05 (0.11)	52.65 (25.83)	91.69 (1.34)	156.29 (1.96)	116.32 (1.27)	116.54 (1.25)	115.29 (1.28)
DEU	1	16.99 (8.04)	6.71 (1.23)	229.65 (9.13)	210.59 (14.99)	341.03 (157.80)	739.29 (11.81)	929.03 (4.85)	817.50 (4.60)	818.13 (4.55)	812.60 (4.57)
GRC	1	3.80 (2.21)	3.55 (1.79)	5.30 (0.90)	5.75 (0.68)	11.78 (2.36)	26.90 (1.89)	23.05 (0.53)	47.07 (1.92)	45.50 (1.82)	45.72 (2.00)
IRL	1	3.05 (0.34)	4.35 (1.25)	135.57 (5.64)	12.34 (0.58)	23.40 (17.50)	3.84 (0.22)	26.60 (0.59)	6.38 (0.13)	6.31 (0.14)	6.16 (0.13)
ITA	1	13.00 (4.15)	14.05 (4.81)	35.29 (1.83)	39.27 (2.52)	87.34 (24.05)	226.95 (11.14)	343.10 (10.04)	244.25 (8.50)	244.84 (8.58)	242.60 (8.46)
JPN	1	10.55 (4.67)	12.32 (4.95)	8.10 (1.05)	8.38 (1.07)	18.81 (4.59)	95.58 (7.89)	71.02 (5.08)	135.24 (13.57)	134.89 (13.50)	134.16 (13.43)
NLD	1	3.75 (0.80)	3.87 (0.79)	0.99 (0.06)	0.99 (0.05)	9.45 (1.43)	28.35 (1.62)	29.53 (1.58)	33.28 (1.78)	33.23 (1.77)	33.14 (1.77)
NOR	1	2.70 (0.51)	2.82 (0.73)	1.86 (0.29)	1.63 (0.11)	24.25 (12.50)	23.36 (0.88)	31.37 (1.17)	27.86 (0.94)	27.86 (0.93)	27.52 (0.91)
ESP	1	5.18 (1.05)	6.09 (1.53)	5.17 (1.14)	4.29 (0.72)	14.85 (4.20)	33.16 (6.99)	152.59 (6.19)	53.53 (2.47)	52.56 (2.42)	52.06 (2.40)
SWE	1	6.44 (2.66)	5.47 (2.63)	2.48 (0.23)	2.02 (0.21)	2.18 (0.25)	15.53 (2.59)	2706.85 (17.91)	118.46 (1.64)	118.23 (1.64)	118.27 (1.64)
CHE	1	3.51 (0.46)	2.90 (0.37)	43.59 (1.77)	7.48 (0.49)	38.32 (9.03)	8.43 (0.24)	29.71 (0.53)	9.72 (0.29)	9.71 (0.29)	9.79 (0.28)
TUR	1	1.65 (0.47)	1.06 (0.15)	1.22 (0.18)	0.91 (0.09)	2.19 (0.34)	64.26 (5.71)	142.84 (2.04)	159.79 (2.63)	157.89 (2.63)	157.13 (2.69)
GBR	1	5.95 (1.86)	2.66 (0.57)	15.92 (1.02)	10.05 (1.47)	7.57 (5.10)	50.04 (1.75)	68.70 (1.25)	70.98 (1.01)	70.87 (0.99)	69.72 (1.01)
USA	1	4.98 (1.96)	1.60 (0.42)	21.53 (3.30)	12.28 (2.52)	2.06 (0.47)	308.69 (5.20)	244.90 (1.82)	462.51 (2.14)	464.75 (2.08)	465.88 (2.16)
#Best	-	2	10	2	4	0	0	0	0	0	0

7 Conclusion

In this paper, we proposed a novel few-shot supervised DA method for regression problems based on the assumption of shared generative mechanism. Through theoretical and experimental analysis, we demonstrated the effectiveness of the proposed approach. By considering the latent common structure behind the domain distributions, the proposed method successfully induces positive transfer even when a naive usage of the source domain data can suffer from negative transfer. Our future work includes making an experimental comparison with extensively more datasets and methods as well as an extension to the case where the underlying mechanism are not exactly identical but similar.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments and thorough discussions. We would also like to thank Yuko Kuroki and Taira Tsuchiya for proofreading the manuscript. This work was supported by RIKEN Junior Research Associate Program. TT was supported by Masason Foundation. IS was supported by KAKEN 17H04693. MS was supported by JST CREST Grant Number JPMJCR18A2.

References

Qui (2009) Dataset Shift in Machine Learning. Neural Information Processing Series. MIT Press, Cambridge, Mass, 2009.
Adams and Fournier (2003) Robert A. Adams and John JF Fournier. Sobolev Spaces. Academic press, 2003.
Arjovsky et al. (2020) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv:1907.02893 [cs, stat], March 2020.
Baltagi (2005) Badi Baltagi. Econometric Analysis of Panel Data. New York: John Wiley and Sons, 3rd edition, 2005.
Baltagi and Griffin (1983) Badi H. Baltagi and James M. Griffin. Gasoline demand in the OECD: An application of pooling and testing procedures. European Economic Review, 22(2):117–137, 1983.
Ben-David et al. (2007) Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press, 2007.
Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1-2):151–175, 2010.
Blitzer et al. (2008) John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems 20, pages 129–136. Curran Associates, Inc., 2008.
Clémençon et al. (2016) Stephan Clémençon, Igor Colin, and Aurélien Bellet. Scaling-up empirical risk minimization: Optimization of incomplete U-statistics. Journal of Machine Learning Research, 17(76):1–36, 2016.
Cortes and Mohri (2014) Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
Cortes et al. (2019) Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. Journal of Machine Learning Research, 20(1):1–30, 2019.
Courty et al. (2017) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems 30, pages 3730–3739. Curran Associates, Inc., 2017.
Ghassami et al. (2017) AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In Advances in Neural Information Processing Systems 30, pages 3011–3021. Curran Associates, Inc., 2017.
Golub and Van Loan (2013) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. The Johns Hopkins University Press, Baltimore, 4th edition, 2013.
Gong et al. (2016) Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In Proceedings of the 33rd International Conference on Machine Learning, pages 2839–2848, New York, USA, 2016. PMLR.
Gong et al. (2018) Mingming Gong, Kun Zhang, Biwei Huang, Clark Glymour, Dacheng Tao, and Kayhan Batmanghelich. Causal generative domain adaptation networks. arXiv:1804.04333 [cs, stat], April 2018.
Greene (2012) William H. Greene. Econometric Analysis. Prentice Hall, Boston, 7th edition, 2012.
Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition, 2009.
Hayfield and Racine (2008) Tristen Hayfield and Jeffrey S. Racine. Nonparametric econometrics: The np package. Journal of Statistical Software, 27(5), 2008.
Hünermund and Bareinboim (2019) Paul Hünermund and Elias Bareinboim. Causal inference and data-fusion in econometrics. arXiv:1912.09104 [econ], December 2019.
Hyvärinen and Pajunen (1999) A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
Hyvärinen and Morioka (2016) Aapo Hyvärinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In Advances in Neural Information Processing Systems 29, pages 3765–3773. Curran Associates, Inc., 2016.
Hyvärinen and Morioka (2017) Aapo Hyvärinen and Hiroshi Morioka. Nonlinear ICA of temporally dependent stationary sources. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 460–469, 2017.
Hyvärinen et al. (2019) Aapo Hyvärinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868, 2019.
Ipsen and Rehman (2008) Ilse C. F. Ipsen and Rizwana Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM Journal on Matrix Analysis and Applications, 30(2):762–776, 2008.
Kano and Shimizu (2003) Yutaka Kano and Shohei Shimizu. Causal inference using nonnormality. In Proceedings of the International Symposium on the Science of Modeling, the 30th Anniversary of the Information Criterion,, pages 261–270, 2003.
Khemakhem et al. (2019) Ilyes Khemakhem, Diederik P. Kingma, Ricardo Pio Monti, and Aapo Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. arXiv:1907.04809 [cs, stat], July 2019.
Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], January 2017.
Kingma and Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, pages 10215–10224. Curran Associates, Inc., 2018.
Kumagai (2016) Wataru Kumagai. Learning bound for parameter transfer learning. In Advances in Neural Information Processing Systems 29, pages 2721–2729. Curran Associates, Inc., 2016.
Kuroki et al. (2019) Seiichi Kuroki, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato, and Masashi Sugiyama. Unsupervised domain adaptation based on source-guided discrepancy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4122–4129, 2019.
LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
Lee (1990) A. J. Lee. U-Statistics: Theory and Practice. M. Dekker, New York, 1990.
Lee et al. (2009) Honglak Lee, Rajat Raina, Alex Teichman, and Andrew Y. Ng. Exponential family sparse coding with applications to self-taught learning. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, pages 1113–1119, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc.
Lopez-paz et al. (2012) David Lopez-paz, Jose M. Hernández-lobato, and Bernhard Schölkopf. Semi-supervised domain adaptation with non-parametric copulas. In Advances in Neural Information Processing Systems 25, pages 665–673. Curran Associates, Inc., 2012.
Magliacane et al. (2018) Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31, pages 10846–10856. Curran Associates, Inc., 2018.
Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge, MA, 2012.
Monti et al. (2019) Ricardo Pio Monti, Kun Zhang, and Aapo Hyvärinen. Causal discovery with general non-linear relationships using non-linear ICA. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, 2019.
Ng (1997) Andrew Y. Ng. Preventing “overfitting” of cross-validation data. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 245–253, San Francisco, CA, USA, 1997.
Nguyen et al. (2016) Tuan Duong Nguyen, Marthinus Christoffel, and Masashi Sugiyama. Continuous Target Shift Adaptation in Supervised Learning. In Asian Conference on Machine Learning, volume 45 of Proceedings of Machine Learning Research, pages 285–300. PMLR, 2016.
Pan et al. (2011) Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
Papa et al. (2015) Guillaume Papa, Stéphan Clémençon, and Aurélien Bellet. SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk. In Advances in Neural Information Processing Systems 28, pages 1027–1035. Curran Associates, Inc., 2015.
Pardoe and Stone (2010) David Pardoe and Peter Stone. Boosting for regression transfer. In Proceedings of the Twenty-Seventh International Conference on Machine Learning, pages 863–870, Haifa, Israel, 2010.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Pearl (2009) Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, U.K. ; New York, 2nd edition, 2009.
Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, Massachuestts, 2017.
R Core Team (2018) R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2018.
Reiss and Wolak (2007) Peter C. Reiss and Frank A. Wolak. Structural econometric modeling: Rationales and examples from industrial organization. In Handbook of Econometrics, volume 6, pages 4277–4415. Elsevier, 2007.
Rejchel (2012) Wojciech Rejchel. On ranking and generalization bounds. Journal of Machine Learning Research, 13(May):1373–1392, 2012.
Rojas-Carulla et al. (2018) Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(36):1–34, 2018.
Schölkopf et al. (2000) Bernhard Schölkopf, Robert C Williamson, Alex J. Smola, John Shawe-Taylor, and John C. Platt. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on Machine Learning, pages 459–466. Omnipress, 2012.
Sherman (1994) Robert P. Sherman. Maximal inequalities for degenerate U-processes with applications to optimization estimators. The Annals of Statistics, 22(1):439–459, 1994.
Shimizu et al. (2006) Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti J Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(October):2003–2030, 2006.
Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
Stojanov et al. (2019) Petar Stojanov, Mingming Gong, Jaime Carbonell, and Kun Zhang. Data-driven approach to multiple-source domain adaptation. In Proceedings of Machine Learning Research, volume 89, pages 3487–3496. PMLR, 2019.
Storkey and Sugiyama (2007) Amos J Storkey and Masashi Sugiyama. Mixture regression for covariate shift. In Advances in Neural Information Processing Systems 19, pages 1337–1344. MIT Press, 2007.
Sugiyama et al. (2007) Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 1st edition, 2019.
Xu et al. (2014) Lele Xu, Tingting Fan, Xia Wu, KeWei Chen, Xiaojuan Guo, Jiacai Zhang, and Li Yao. A pooling-LiNGAM algorithm for effective connectivity analysis of fMRI data. Frontiers in Computational Neuroscience, 8(October):125, 2014.
Yadav et al. (2018) Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining electronic health records (EHRs): A survey. ACM Computing Surveys, 50(6):1–40, 2018.
Yamada et al. (2011) Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems 24, pages 594–602. Curran Associates, Inc., 2011.
Zhang et al. (2015) K. Zhang, M. Gong, and B. Schölkopf. Multi-source domain adaptation: A causal view. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 3150–3157. AAAI Press, 2015.
Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, pages 819–827, 2013.
Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning, pages 7404–7413, Long Beach, California, USA, 2019. PMLR.

This is the Supplementary Material for “Few-shot Domain Adaptation by Causal Mechanism Transfer.” Table 3 summarizes the abbreviations and the symbols used in the paper.

Table 3: List of abbreviations and symbols used in the paper.

Abbreviation / Symbol	Meaning
DA	Domain adaptation
TA	Transfer assumption
SEM	Structural equation model
GCM	Graphical causal model
SCM	Structural causal model
IC	Independent component
ICA	Independent component analysis
GCL	Generalized contrastive learning
i.i.d.	Independent and identically distributed
$[N]$	$\{1,2,\ldots,N\}$ where $N\in\mathbb{N}$
$\left\\|\cdot\right\\|_{W^{k,p}}$	The $(k,p)$ -Sobolev norm
$X$	The predictor random vector ( $\mathbb{R}^{D-1}$ -valued)
$Y$	The predicted random variable ( $\mathbb{R}$ -valued)
$Z=(X,Y)$	The joint random variable ( $\mathbb{R}^{D}$ -valued)
$S$	The independent component vector ( $\mathbb{R}^{D}$ -valued)
$\mathcal{X}\subset\mathbb{R}^{D-1}$	The space of $X$
$\mathcal{Y}\subset\mathbb{R}$	The space of $Y$
$\mathcal{Z}\subset\mathbb{R}^{D}$	The space of $Z=(X,Y)$
$\mathcal{G}\subset\{g:\mathbb{R}^{D-1}\to\mathbb{R}\}$	Predictor hypothesis class
$\ell:\mathcal{G}\times\mathcal{Z}\to[0,B_{\ell}]$	Loss function
$R(g)$	Target domain risk ${\mathbb{E}}_{p_{\mathrm{Tar}}}\ell(g,Z)$
$g^{*}\in\mathcal{G}$	Minimizer of target domain risk
$\mathcal{Q}$	The set of independent distributions
$f$	Ground truth mixing function
$p_{\mathrm{Tar}}$	The target joint distribution
$p_{k}$	The joint distribution of source domain $k$
$q_{\mathrm{Tar}}\in\mathcal{Q}$	The target independent component (IC) distribution
$q_{k}\in\mathcal{Q}$	The IC distribution of source domain $k$
$D$	The dimension of $\mathcal{Z}$
$K$	The number of source domains
$n_{\mathrm{Tar}}$	The size of the target labeled sample
$n_{k}$	The size of the labeled sample from source domain $k$
$\mathcal{D}_{\mathrm{Tar}}=\{Z_{i}\}_{i=1}^{n_{\mathrm{Tar}}}$	Target labeled data set
$\mathcal{D}_{k}=\{{Z^{\mathrm{Src}}_{k,i}}\}_{i=1}^{n_{k}}$	Source labeled data set of source domain $k$
$\hat{R}(g)$	The ordinary empirical risk estimator
$\check{R}(g)$	The proposed risk estimator (Eq. (2))
$\hat{f}$	The estimator of $f$
$\{\psi_{d}\}_{d=1}^{D}$	The penultimate layer functions composed with $f$ during GCL

Appendix A Preliminary: Nonlinear ICA

Here, we use the same notation as the main text. The recently developed nonlinear ICA provides an algorithm to estimate the mixing function $f$ . For the case of nonlinear $f$ , the impossibility of identification (i.e., consistent estimation) of $f$ in the one-sample i.i.d. case had been established more than two decades ago (Hyvärinen and Pajunen, 1999). However, recently, various conditions have been proposed under which $f$ can be identified with the help of auxiliary information Hyvärinen and Morioka (2016, 2017); Hyvärinen et al. (2019); Khemakhem et al. (2019).

The identification condition that is directly relevant to this paper is that of the generalized contrastive learning (GCL) proposed in Hyvärinen et al. (2019). Hyvärinen et al. (2019) assumes that an auxiliary variable $u_{i}$ from some measurable set $\mathcal{U}$ is obtained for each data point as $\{(z_{i},u_{i})\}_{i=1}^{n}$ and that the ICs $S=(S^{(1)},\ldots,S^{(D)})$ are conditionally independent given $u$ :

\begin{split}q(s|u)=\prod_{d=1}^{D}q^{(d)}(s^{(d)}|u).\end{split}

Under such conditions, GCL estimates $f$ by training a classification function

\begin{split}r_{\hat{f},\bm{\psi}}(z,u)=\sum_{d=1}^{D}\psi_{d}({\hat{f}}^{-1}(z)_{d},u)\end{split}

(4)

parametrized by $\hat{f}$ and $\{\psi_{d}\}_{d=1}^{D}$ with the logistic loss for classifying

\begin{split}(z,u)\text{ vs. }(z,\tilde{u}),\end{split}

where $\tilde{u}\in\mathcal{U}\setminus\{u\}$ . The key condition for the identification of $f$ is the following.

Assumption 1 (Assumption of variability; Hyvärinen et al., 2019, Theorem 1).

For any $z$ , there exist $2D+1$ distinct points in $\mathcal{U}$ , denoted by $\{u_{j}\}_{j=0}^{2D}$ , such that the set of $(2D)$ -dimensional vectors $\{w(z|u_{j})-w(z|u_{0})\}_{j=1}^{2D}$ are linearly independent, where

\begin{split}w(z|u):=\left(\frac{\partial\log q^{(1)}(z_{1}|u)}{\partial z_{1}},\ldots,\frac{\partial\log q^{(D)}(z_{D}|u)}{\partial z_{D}},\frac{\partial^{2}\log q^{(1)}(z_{1}|u)}{\partial z_{1}^{2}},\ldots,\frac{\partial^{2}\log q^{(D)}(z_{D}|u)}{\partial z_{D}^{2}}\right).\\ \end{split}

Under Assumption 1 and some regularity conditions, Theorem 1 of Hyvärinen et al. (2019) states that the transformation $\hat{f}$ in Eq. (4) trained by GCL is a consistent estimator of $f$ upto additional dimension-wise invertible transformations. Note that the assumption is intrinsically difficult to confirm based on data due to the unsupervised nature of the problem setting. In this paper, we use the source domain index as the auxiliary variable and employ GCL for domain adaptation. The present version of Assumption 1 requires that we have at least $2D+1$ distinct source domains. Although this condition can be restrictive in high-dimensional data, we conjecture that there is a possibility for this assumption to be made less stringent in the future because the identification condition is only known to be a sufficient condition, not a necessary condition. However, pursuing a refinement of the identification condition is out of the scope of this paper. Among the various methods for nonlinear ICA, we chose to use GCL Hyvärinen et al. (2019) because it can operate under a nonparametric assumption on the IC distributions whereas other nonlinear ICA methods Hyvärinen and Morioka (2016, 2017); Khemakhem et al. (2019) may require parametric assumptions.

Appendix B Experiment Details

Here, we describe more implementation details of the experiment. Our experiment code can be found at https://github.com/takeshi-teshima/few-shot-domain-adaptation-by-causal-mechanism-transfer.

B.1 Dataset details

Gasoline consumption data.

The data was downloaded from http://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=4338&itemId=1118672321&resourceId=13452.

B.2 Model details: Invertible neural networks

Here, we describe the details of the Glow architecture Kingma and Dhariwal (2018) used in our experiments. Glow consists of three types of layers which are invertible by design, namely affine coupling layers, $1\times 1$ convolution layers, and activation normalization (actnorm) layers. In our implementation, we use actnorm as the first layer, and each of the subsequent layers consists of a $1\times 1$ convolution layer followed by an affine coupling layer.

Affine coupling layers.

The coefficients $s$ and $t$ for affine coupling layers in the notation of Kingma and Dhariwal (2018) are parametrized by two one-hidden-layer neural networks whose number of hidden units is the same and the first layer parameter is shared. The activation functions of the first layer, the second layer of $s$ , and the second layer of $t$ are the rectified linear unit (ReLU) activation LeCun et al. (2015), the hyperbolic tangent function, and the linear activation function, respectively. A standard practice of affine coupling layers is to compose the coefficient $s$ with an exponential function $x\mapsto\exp(x)$ so as to simplify the computation of the log-determinant of the Jacobian Kingma and Dhariwal (2018). In our implementation, since we do not require the computation of the log-determinant, we omit this device and instead compose $x\mapsto(x+1)$ . The addition of $1$ shifts the parameter space so that $(s,t)=(0,0)$ corresponds to the the identity map, where $0$ denotes the constant zero function. The split of the affine coupling layers is fixed at $(\lfloor\frac{D}{2}\rfloor,D-\lfloor\frac{D}{2}\rfloor)$ .

$1\times 1$ convolution layers.

We initialize the parameters of the neural networks by $\mathcal{N}(0,\frac{1}{m})$ where $m$ is the number of parameters of each layer and $\mathcal{N}$ is the normal distribution.

B.3 Model details: Penultimate layer networks

We initialize the parameter for each layer of $\psi_{d}$ by $\mathrm{U}(-\sqrt{\frac{1}{m}},\sqrt{\frac{1}{m}})$ , where $m$ is the number of input features and $\mathrm{U}$ is the uniform distribution.

B.4 Training details

During the training of GCL, we fix the batch size at 32.

B.5 Compared methods details

Here, we detail the methods compared through the experiment. Note that the present paper focuses on regression problems as our approach is based on ICA, hence the methods for classification domain adaptation are not comparable.

TrAdaBoost.

As suggested in Pardoe and Stone (2010), we use the linear loss function and set the maximum number of internal boosting iterations at $30$ .

GDM.

We fix the number of sampling required for approximating the maximization in the generalization discrepancy at $200$ . This method presumes using hypothesis classes in a reproducing kernel Hilbert space (RKHS).

Copula.

For this model, the probabilistic model of non-parametric R-vine copula of depth $1$ is used following Lopez-paz et al. (2012). Kernel density estimators with RBF kernel are used for estimating the marginal distributions and the copulas. The bandwidths of the RBF kernels are determined using the rule-of-thumb implemented as “normal-reference” in the np package of R language Hayfield and Racine (2008). The predictions are made by numerically aggregating the estimated conditional distribution over the interval $[\min_{i}Y_{i}-2\sigma,\max_{i}Y_{i}+2\sigma]$ where $\sigma$ denotes the square root of the unbiased variance of $\{Y_{i}\}_{i=1}^{n_{\mathrm{src}}}$ . The aggregation is performed by discretizing the interval into a grid of $300$ points. The level of the two-sample test is fixed at $0.05$ for all combination of the two-sample tests following the experiment code of Lopez-paz et al. (2012). This method is a single-source domain adaptation method and we pool all source domain data for adaptation.

Appendix C Details and Proofs of Theorem 2

Here, we detail the assumptions, the statement, and the proof of Theorem 2.

C.1 Notation

To make the proof self-contained, we first recall some general and problem-specific notation. In the notation here, we omit the domain identifiers from the distributions and the sample size, such as Tar or Src, because only the target domain data or their distributions appear in the proofs. The theorem holds regardless of how $\hat{f}$ is estimated as long as $\hat{f}$ is independent of the target domain data. In the proof, we extend the maximal discrepancy bound of U-statistics previously proved for the case of degree- $2$ in Rejchel (2012), to allow higher degrees.

General mathematical notation.

We denote the set of natural numbers (resp. real numbers) by $\mathbb{N}$ (resp. $\mathbb{R}$ ). For any $N\in\mathbb{N}$ , we define $[N]:=\{1,2,\ldots,N\}$ . We use ${a\choose b}$ to denote the number of $b$ -combinations of $a$ elements. For a finite set $A$ , the notation $\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{a\in A}$ denotes the operator to take an average over $A$ , i.e., $\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{a\in A}h=\frac{1}{|A|}\sum_{a\in A}h(a)$ . For a $D$ -dimensional function $h$ , we denote its $j$ -th dimension ( $j\in[D]$ ) by suffixing $h_{j}$ . For a vector $s$ , we denote its $j$ -th element by $s^{(j)}$ . We denote the Jacobian determinant of a differentiable function $\psi$ at $a$ by $\mathrm{J}\psi(a):=\det\frac{\mathrm{d}\psi(a)}{\mathrm{d}a}$ . We denote the identity matrix by $I$ regardless of the size of the matrices when there is no ambiguity. For finite dimensional vectors, we denote the $2$ -norm by $\left\|\cdot\right\|_{{\ell^{2}}}$ and the $1$ -norm by $\left\|\cdot\right\|_{{\ell^{1}}}$ . For square matrices, we denote the operator- $2$ norm by $\left\|\cdot\right\|_{\mathrm{op}}$ and the operator- $1$ norm by $\left\|\cdot\right\|_{\mathrm{op}(1)}$ . We use $W^{k,p}$ to denote the Sobolev space (on $\mathbb{R}^{D}$ ) of order $k$ and define its associated norm by $\left\|h\right\|_{W^{k,p}}:=\left(\sum_{|\alpha|\leq k}\left\|h^{(\alpha)}\right\|_{{L^{p}}}^{p}\right)^{1/p}$ where $\alpha$ is a multi-index and $h^{(\alpha)}$ denotes the partial derivative $\frac{\partial^{|\alpha|}h}{\partial s_{1}^{\alpha_{1}}\cdots\partial s_{D}^{\alpha_{D}}}$ (Adams and Fournier, 2003, Paragraph 3.1). We let $\mathfrak{S}_{D}$ be the degree- $D$ symmetric group, $\mathfrak{G}_{j}^{D}:=\{\tau:[D]\to[j]\ |\ \tau\text{ is surjective}\}$ be the set of $j$ grouping of indices in $[D]$ , and $\mathfrak{I}_{j}^{n}:=\{\rho:[j]\to[n]\ |\ \rho\text{ is injective}\}$ be the set of all size- $j$ combinations (without replacement) of indices in $[n]$ .

Distributions and expectations.

We denote by $\mathcal{Q}$ the set of all factorized distributions on $\mathbb{R}^{D}$ with absolutely continuous marginals. For a measure $P$ , we denote its $j$ -product measure by $P^{j}:=P\otimes\cdots\otimes P$ (repeated $j$ times). We assume that all measures appearing in this proof are absolutely continuous with respect to the Lebesgue measure. The push-forward of a distribution $p$ by a function $h$ is denoted by $h_{\sharp}p$ . The expectation of a function $h$ with respect to measure $P$ is denoted by $Ph$ (if it exists) by abuse of notation. We also abuse the notation to use $\psi(s,P,\ldots,P)$ as the shorthand for $P^{D-1}\psi(s,S_{2}^{\prime},\ldots,S_{D}^{\prime})$ where $\{S_{d}^{\prime}\}_{d=2}^{D}\overset{\text{i.i.d.}}{\sim}P$ .

C.2 Problem setup

We denote the target domain distribution by $p$ .

We fix a hypothesis class $\mathcal{G}(\subset\{g:\mathbb{R}^{D-1}\to\mathbb{R}\})$ , and our goal is to find a $g\in\mathcal{G}$ such that the risk functional

\begin{split}R(g):=\int p(z)\ell(g,z)\mathrm{d}z\end{split}

is small, where $\ell:\mathcal{G}\times\mathbb{R}^{D}\to\mathbb{R}_{\geq 0}$ is a loss function. We denote by $g^{*}$ a minimizer of $R$ (assuming it exists). To this end, we are given the training data $\mathcal{D}:=\{Z_{i}\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}p$ . Throughout, we assume $n\geq D$ . To complement the smallness of $n$ , we assume the existence of a generative mechanism. Concretely, we assume that there exists a diffeomorphism $f:\mathbb{R}^{D}\to\mathbb{R}^{D}$ such that $q:=(f^{-1})_{\sharp}p$ satisfies $q\in\mathcal{Q}$ . With this transform, the original risk functional is also expressed as

\begin{split}R(g)=\int q(s)\ell(g,f(s))\mathrm{d}s.\end{split}

As an estimator of $f$ , we are given another diffeomorphism $\hat{f}:\mathbb{R}^{D}\to\mathbb{R}^{D}$ such that $\hat{f}\simeq f$ . With this $\hat{f}$ , the proposed method converts the dataset $\mathcal{D}$ by $S_{i}:=\hat{f}(Z_{i})$ . We can regard $\check{\mathcal{D}}:=\{S_{i}\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}\check{q}$ , where $\check{q}:=({\hat{f}}^{-1}\circ f)_{\sharp}q$ . We use $Q$ (resp. $\check{Q}$ ) to denote the probability measure corresponding to the density $q$ (resp. $\check{q}$ ). This conversion results in the relation:

\begin{split}\check{q}(s)=q(f^{-1}\circ\hat{f}(s))\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|.\end{split}

As a candidate hypothesis $g\in\mathcal{G}$ , the proposed method selects a minimizer $\check{g}\in\mathcal{G}$ of the proposed risk estimator $\check{R}$ defined as

\begin{split}\check{R}(g):=\frac{1}{n^{D}}\sum_{(i_{1},\ldots,i_{D})\in[n]^{D}}\ell(g,\hat{f}(\hat{s}_{i_{1}}^{(1)},\ldots,\hat{s}_{i_{D}}^{(D)})).\end{split}

(5)

In the proof, we evaluate its concentration around the expectation $\bar{\check{R}}(g):={\mathbb{E}}_{\check{\mathcal{D}}}\check{R}(g)$ . We use ${\mathbb{E}}_{\check{\mathcal{D}}}$ to denote the expectation with respect to $\check{\mathcal{D}}$ . Let $\bar{\check{g}}$ denote a hypothesis which minimizes $\bar{\check{R}}(g)$ (assuming it exists).

In what follows, for notational simplicity, we define the $D$ -variate symmetric function $\tilde{\ell}$ as

\begin{split}\tilde{\ell}(s_{1},\ldots,s_{D})=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)})),\end{split}

where $\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}$ indicates an averaging operation over all permutations (without replacement) of $[D]$ . We use $\hat{\mathbb{E}}_{n}$ to denote the sample average operator with respect to $\mathcal{D}$ or $\check{\mathcal{D}}$ , depending on the context.

C.3 Assumptions

Assumption 2 (The underlying density function is bounded and Lipschitz continuous).

Assume

\begin{split}B_{q}&:=\sup_{s\in\mathbb{R}^{D}}q(s)<\infty,\quad L_{q}:=\sup_{s_{1}\neq s_{2}}\frac{|q(s_{1})-q(s_{2})|}{\|s_{1}-s_{2}\|}<\infty.\end{split}

Assumption 3 ( $f^{-1}$ is Lipschitz continuous and Hölder continuous).

We assume $f^{-1}\in C^{1,1}$ where $C^{1,1}$ is the $(1,1)$ -Hölder space (Adams and Fournier, 2003, Paragraph 1.29) and

\begin{split}L_{f^{-1}}&:=\sup_{z_{1}\neq z_{2}}\frac{\|f^{-1}(z_{1})-f^{-1}(z_{2})\|}{\|z_{1}-z_{2}\|}<\infty.\end{split}

Assumption 4 (Bounded derivatives of $f$ and $f^{-1}$ ).

Assume that

\begin{split}B_{\partial f}^{\infty}:=\sup_{s\in\mathbb{R}^{D}}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right\|_{\infty}<\infty,\quad B_{\partial f^{-1}}^{\infty}:=\sup_{z\in\mathbb{R}^{D}}\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(z)\right\|_{\infty}<\infty.\end{split}

where $\left\|\cdot\right\|_{\infty}$ denotes the maximum absolute value of the elements of a matrix.

Assumption 5 (Loss function is bounded and uniformly Lipschitz continuous in $Z$ ).

The considered loss function takes values in a bounded interval:

\begin{split}\ell:\mathcal{G}\times\mathcal{Z}\to[0,B_{\ell}],\end{split}

where $0<B_{\ell}<\infty$ . Also assume

\begin{split}L_{\ell_{\mathcal{G}}}:=\sup_{g\in\mathcal{G}}\sup_{z_{1}\neq z_{2}}\frac{|\ell(g,z_{1})-\ell(g,z_{2})|}{\|z_{1}-z_{2}\|}<\infty.\end{split}

Assumption 6 (Estimated feature extractor).

Assume $\hat{f}$ is independent of $\mathcal{D}$ and that $f_{j}-\hat{f}_{j}\in W^{1,d}$ for all $(j,d)\in[D]\times[D]$ .

Although $\hat{f}$ and $f$ are assumed to be diffeomorphisms in the classical sense (implying that they are strongly differentiable), we introduce the Sobolev space because we want to measure their difference and their difference of derivatives in terms of integration.

Assumption 7 (Entropic condition: Euclidean class (Sherman, 1994)).

The function class $\Phi:=\{\tilde{\ell}:g\in\mathcal{G}\}$ is Euclidean for the envelope $F$ and constants $A$ and $V$ (Sherman, 1994), i.e., if $\mu$ is a measure for which $\mu F^{2}<\infty$ , then

\begin{split}D(t,d_{\mu},\Phi)\leq At^{-V},\quad 0<t\leq 1,\end{split}

where $d_{\mu}$ is the pseudo metric defined by

\begin{split}d_{\mu}(\phi_{1},\phi_{2}):=\left[\mu|\phi_{1}-\phi_{2}|^{2}/\mu F^{2}\right]^{1/2}\end{split}

for $\phi_{1},\phi_{2}\in\Phi$ , and $D(t,d_{\mu},\Phi)$ denotes the packing number of $\Phi$ with respect to the pseudometric $d_{\mu}$ and radius $t$ . Without loss of generality, we take the envelope $F$ such that $F(\cdot)\leq B_{\ell}$ .

Assumption 8.

The hypothesis class $\mathcal{G}$ is expressive enough so that the model approximation error does not expand due to $\hat{f}$ , i.e.,

\begin{split}\inf_{g\in\mathcal{G}}\bar{\check{R}}(g)\leq\inf_{g\in\mathcal{G}}R(g)\end{split}

The following complexity measure of $\mathcal{G}$ , which is a version of Rademacher complexity for our problem setting, is used to state the theorem.

Definition 1 (Effective Rademacher complexity).

Define

\begin{split}\mathfrak{R}(\mathcal{G}):=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{2},\ldots,S^{\prime}_{D}}[\tilde{\ell}(S_{i},S_{2}^{\prime},\ldots,S_{D}^{\prime})]\right|\right]\end{split}

where $\{\sigma_{i}\}_{i=1}^{n}$ are independent uniform sign variables and $S_{2}^{\prime},\ldots,S_{D}^{\prime}\overset{\text{i.i.d.}}{\sim}\check{Q}$ are independent of all other random variables.

We provide the definition of the ordinary Rademacher complexity in Section C.8 and make a comparison of the two complexity measures in terms of how they depend on the input dimensionality.

C.4 Theorem statement

Our goal is to prove the following theorem. This is a detailed version of the theorem appearing in the main body of the paper.

Theorem 3 (Excess risk bound).

Assume Assumptions 2, 3, 4, 5, 6, 7, and 8.

Then for arbitrary $\delta,\delta^{\prime}\in(0,1)$ , we have with probability at least $1-(\delta+\delta^{\prime})$ ,

\begin{split}&R(\check{g})-R(g^{*})\\ &\leq\underbrace{C\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}}_{\text{Approximation error}}+\underbrace{4D\mathfrak{R}(\mathcal{G})+2DB_{\ell}\sqrt{\frac{\log 2/{\delta}}{2n}}}_{\text{Estimation error}}+\underbrace{\kappa_{1}(\delta^{\prime},n)+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})}_{\text{Higher order terms}}.\end{split}

where

\begin{split}&C:=B_{q}L_{\ell_{\mathcal{G}}}+DB_{\ell}(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime}),\\ &C_{1}^{\prime}:=(D+1)^{3/2}\left(B_{\partial f}^{\infty}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)+B_{\partial f^{-1}}^{\infty}\right),\\ &\kappa_{1}(\delta^{\prime},n)=\mathcal{O}(n^{-1})/{\delta^{\prime}}+\mathcal{O}(n^{-1}),\\ &\kappa_{2}(f-\hat{f})=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\end{split}

and $C_{d}^{\prime}(d=1,\ldots,D)$ are constants determined in Lemma 11.

Proof of Theorem 3.

By adding and subtracting terms, we have

\begin{split}&R(\check{g})-R(g^{*})=\underbrace{(R-\bar{\check{R}})(\check{g})}_{\text{\text{(A) Approximation error}}}+\underbrace{\bar{\check{R}}(\check{g})-\bar{\check{R}}(\bar{\check{g}})}_{\text{\text{(B) Pseudo estimation error}}}+\underbrace{\bar{\check{R}}(\bar{\check{g}})-R(g^{*})}_{\text{(C) Additional model misspecification error}}.\end{split}

Applying Lemma 1 to (A), Lemma 2 to (B), and Assumption 8 to (C), we obtain the assertion. ∎

As it can be seen from the proof above, Theorem 3 is proved in two parts, each corresponding to the two lemmas below. The first lemma evaluates the approximation error which reflects the fact that we are approximating $f$ by $\hat{f}$ .

Lemma 1 (Approximation error bound).

Given Assumptions 2, 3, 4, 5, and 6. we have

\begin{split}(R-\bar{\check{R}})(\check{g})\leq&C\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})\end{split}

where $C$ and $\kappa_{2}(f-\hat{f})$ are

\begin{split}C&:=B_{q}L_{\ell_{\mathcal{G}}}+DB_{\ell}(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime}),\\ \kappa_{2}(f-\hat{f})&:=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\end{split}

and $C_{d}^{\prime}(d=1,\ldots,D)$ are constants determined in Lemma 11.

The second lemma evaluates the pseudo estimation error which reflects the fact that we rely on a finite sample to approximate the underlying distribution.

Lemma 2 (Pseudo estimation error bound).

Assume that Assumptions 2 and 7 hold. Let the Rademacher complexity be defined as Definition 1. Then for any $\delta,\delta^{\prime}\in(0,1)$ , we have with probability at least $1-(\delta+\delta^{\prime})$ that

\begin{split}\bar{\check{R}}(\check{g})-\bar{\check{R}}(\bar{\check{g}})\leq 4Dw_{D}\mathfrak{R}(\mathcal{G})+2DB_{\ell}w_{D}\sqrt{\frac{\log 2/{\delta}}{2n}}+\underbrace{2w_{D}(D-1)\sum_{j=2}^{D}\frac{C_{j}}{\delta^{\prime}}n^{-j/2}+4B_{\ell}\sum_{j=1}^{D-1}w_{j}}_{\text{$\mathcal{O}(n^{-1})$}}\end{split}

where $\{w_{j}\}_{j=1}^{D}$ are universal constants determined in Lemma 3, and $\{C_{j}\}_{j=2}^{D}$ are constants determined in Lemma 6. Note that $w_{j}=\mathcal{O}(n^{-(D-j)})$ and $w_{D}=\frac{n(n-1)\cdots(n-D+1)}{n^{D}}<1$ .

In what follows, we first present some basic facts in Section C.5 and provide the proofs for the lemmas. We provide the proof of Lemma 1 in Section C.7, and that of Lemma 2 in Section C.6.

C.5 V-statistic and U-statistic

The theoretical analysis is performed by interpreting the proposed risk estimator Eq. (5) as a V-statistic (explained shortly). The proofs will be based on applying the following facts in order:

1.

V-statistic can be represented as a weighted average of U-statistics with degrees from $1$ to $D$ , and only the degree- $D$ ) term is the leading term.
2.

The degree- $D$ term is again decomposed into a degree- $1$ U-statistic and a set of degenerate U-statistics.
3.

The degree- $1$ U-statistic is an i.i.d. sum admitting a Rademacher complexity bound.
4.

The degenerate terms concentrate around zero following an exponential inequality under appropriate entropy conditions.

To consolidate the strategy given above, we describe what are V- and U-statistics, and how they relate to each other. These estimators emerge when we allow re-using the same data point repeatedly from a single sample to estimate a function which takes multiple data points.

V-statistic.

For a given regular statistical functional of degree $D$ Lee (1990):

\begin{split}\check{Q}^{D}\tilde{\ell}:=\int\tilde{\ell}(s_{1},\cdots,s_{D})\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D},\end{split}

(6)

its associated von-Mises statistic (V-statistic) is the following quantity Lee (1990):

\begin{split}V_{n}^{D}\tilde{\ell}:=\frac{1}{n^{D}}\sum_{i_{1}=1}^{n}\cdots\sum_{i_{D}=1}^{n}\tilde{\ell}(S_{i_{1}},\ldots,S_{i_{D}}).\end{split}

Note that Eq. (6) does not coincide with the expectation of $V_{n}^{D}\tilde{\ell}$ in general, i.e., the V-statistic is generally not an unbiased estimator. However, it is known to be a consistent estimator of Eq. (6) (Lee, 1990).

U-statistic.

Similarly, for a $j$ -variate symmetric and integrable function $h(x_{1},\ldots,x_{j})$ , its corresponding U-statistic (Lee, 1990) of degree $j$ is

\begin{split}U_{n}^{j}h:=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}h(s_{\rho(1)},\ldots,s_{\rho(j)}).\end{split}

The V- and U-statistics are generalizations of the sample mean (which is the U- and V-statistics of degree $1$ ). The important difference from the sample mean in higher degrees is that the summands may not be independent. To deal with the dependence, the following standard decompositions have been developed Lee (1990).

Lemma 3 (Decomposition of a V-statistic (Lee, 1990)).

A V-statistic can be expressed as a sum of U-statistics of degrees from $1$ to $D$ (Lee, 1990, Section 4.2, Theorem 1):

\begin{split}V_{n}^{D}\tilde{\ell}&=\sum_{j=1}^{D}w_{j}U_{n}^{j}\tilde{\ell}^{(j)}\end{split}

where the weights $w_{j}$ and $j$ -variate functions $\tilde{\ell}^{(j)}$ are

\begin{split}w_{j}&:=\frac{1}{n^{D}}|\mathfrak{G}_{j}^{D}|{n\choose j},\quad\tilde{\ell}^{(j)}(s_{1},\ldots,s_{j}):=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\tilde{\ell}(s_{\tau(1)},\ldots,s_{\tau(D)}).\end{split}

Proof.

See (Lee, 1990, Section 4.2, Theorem 1 (p.183)). ∎

Remark 1.

The weights $\{w_{j}\}_{j=1}^{D}$ satisfy $\sum_{j}w_{j}=1$ (Lee, 1990, Section 4.2, Theorem 1 (p.183)). We can also find the order of $w_{j}$ with respect to $n$ as:

\begin{split}w_{D}=\frac{1}{n^{D}}\underbrace{|\mathfrak{G}_{j}^{D}|}_{\text{$D!$}}{n\choose D}=\frac{n(n-1)\cdots(n-D+1)}{n^{D}}=\mathcal{O}(1),\quad w_{j}=\mathcal{O}(n^{-(D-j)}),\quad\tilde{\ell}^{(D)}=\tilde{\ell}.\end{split}

Lemma 4 (Hoeffding decomposition of a U-statistic (Sherman, 1994, p.449)).

A U-statistic with a symmetric kernel $\psi$ can be decomposed as a sum of U-statistics of degrees from $1$ to $D$ as

\begin{split}U_{n}^{D}\psi-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi&=\sum_{j=1}^{D}U_{n}^{j}\psi_{j}\\ &=\hat{\mathbb{E}}_{n}\psi_{1}+\sum_{j=2}^{D}U_{n}^{j}\psi_{j}\end{split}

(7)

where $\{\psi_{j}\}_{j=1}^{D}$ are $j$ -variate, symmetric and degenerate functions. Note that ${\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi=\check{Q}^{D}\psi$ . Here, a $j$ -variate symmetric function $\psi_{j}$ is said degenerate when

\begin{split}\forall s_{2},\ldots,s_{j},\quad\psi_{j}(\check{Q},s_{2},\ldots,s_{j})=0.\end{split}

Specifically, $\psi_{1}$ is

\begin{split}\psi_{1}(s)&=\psi(s,\check{Q},\ldots,\check{Q})+\cdots+\psi(\check{Q},\ldots,\check{Q},s)-D\check{Q}^{D}\psi\\ &=D\cdot(\psi(s,\check{Q},\ldots,\check{Q})-\check{Q}^{D}\psi)\quad(\text{by symmetry}).\end{split}

(8)

For further details, see (Sherman, 1994, p.449). Note that in (Sherman, 1994, p.449), Eq. (7) is written using $\check{Q}^{D}\psi$ in place of ${\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi$ . This is because

\begin{split}{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi=U_{n}^{D}{\mathbb{E}}_{\check{\mathcal{D}}}\psi=U_{n}^{D}\check{Q}^{D}\psi=\check{Q}^{D}\psi\end{split}

holds by linearity and symmetry.

Remark 2 (Connecting the lemmas to Section C.6).

It can be easily checked by definition that the proposed risk estimator Eq. (5) takes the form of a V-statistic: $\check{R}(g)=V_{n}^{D}\tilde{\ell}$ for each $g\in\mathcal{G}$ . Let us denote $\tilde{\ell}^{*}(s):=\tilde{\ell}(s,\check{Q},\ldots,\check{Q})$ . Then ${\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*}=\check{Q}^{D}\tilde{\ell}$ holds by definition. Substituting these into Eq. (8), we have that Eq. (7) applied to $\psi=\tilde{\ell}$ is equivalent to

\begin{split}U_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}=D\cdot(\hat{\mathbb{E}}_{n}\tilde{\ell}^{*}-{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*})+\sum_{j=2}^{D}U_{n}^{j}\tilde{\ell}_{j}.\end{split}

where $\{\tilde{\ell}_{j}\}_{j=2}^{D}$ are symmetric degenerate functions. In Section C.6, we first decompose $\check{R}(g)$ into a sum of U-statistics. After such conversion, we take a closer look at the leading term, $\hat{\mathbb{E}}_{n}\tilde{\ell}^{*}$ .

C.6 Proof of pseudo estimation error bound

(Proof of Lemma 2).

First, we have

\begin{split}\bar{\check{R}}(\check{g})-\bar{\check{R}}(\bar{\check{g}})=\bar{\check{R}}(\check{g})-\check{R}(\check{g})+\check{R}(\check{g})-\bar{\check{R}}(\bar{\check{g}})&\leq\bar{\check{R}}(\check{g})-\check{R}(\check{g})+\check{R}(\bar{\check{g}})-\bar{\check{R}}(\bar{\check{g}})\\ &\leq 2\sup_{g\in\mathcal{G}}\left|\check{R}(g)-\bar{\check{R}}(g)\right|.\end{split}

Now the right-most expression can be decomposed as

\begin{split}&\sup_{g\in\mathcal{G}}|\check{R}(g)-\bar{\check{R}}(g)|=\sup_{g\in\mathcal{G}}|V_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}V_{n}^{D}\tilde{\ell}|\\ &\leq w_{D}\sup_{g\in\mathcal{G}}\left|U_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}\right|+\sum_{j=1}^{D-1}w_{j}\sup_{g\in\mathcal{G}}\left|U_{n}^{j}\tilde{\ell}^{(j)}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)}\right|\quad(\because\text{Lemma~{}\ref{lem:decompose-v-estimator}})\\ &\leq w_{D}\sup_{g\in\mathcal{G}}|U_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}|+2B_{\ell}\sum_{j=1}^{D-1}w_{j}\\ &\leq w_{D}\left(\sup_{g\in\mathcal{G}}|\hat{\mathbb{E}}_{n}\tilde{\ell}_{1}|+\sum_{j=2}^{D}\sup_{g\in\mathcal{G}}|U_{n}^{j}\tilde{\ell}_{j}|\right)+2B_{\ell}\sum_{j=1}^{D-1}w_{j}\quad(\because\text{Lemma~{}\ref{lem:hoeffding-decomposition}})\\ &=w_{D}\left(\underbrace{\sup_{g\in\mathcal{G}}|\hat{\mathbb{E}}_{n}D(\tilde{\ell}^{*}-{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*})|}_{\text{Addressed in Lemma~{}\ref{lem:rademacher-U-leading-term}}}+\underbrace{\sum_{j=2}^{D}\sup_{g\in\mathcal{G}}|U_{n}^{j}\tilde{\ell}_{j}|}_{\text{Addressed in Lemma~{}\ref{lem:rademacher-U-degenerate-part}}}\right)+2B_{\ell}\sum_{j=1}^{D-1}w_{j}.\end{split}

where $\tilde{\ell}_{j}$ are symmetric degenerate functions and $\tilde{\ell}^{*}$ is defined as in Remark 2. Applying Lemma 5 to the first term and Lemma 6 to the second term with the union bound, we obtain the assertion. ∎

In the last part of the proof we used the following lemmas. Because the leading term is an i.i.d. sum, the following Rademacher complexity bound can be proved.

Lemma 5 (U-process bound: the leading term).

Assume Assumption 2 holds. Then, we have with probability at least $1-\delta$ ,

\begin{split}\sup_{g\in\mathcal{G}}|\hat{\mathbb{E}}_{n}(\tilde{\ell}^{*}-{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*})|\leq 2\mathfrak{R}(\mathcal{G})+B_{\ell}\sqrt{\frac{\log(2/\delta)}{2n}},\end{split}

where $\mathfrak{R}$ is defined in Definition 1.

Proof.

Applying the standard one-sided Rademacher complexity bound based on McDiarmid’s inequality (Mohri et al., 2012, Theorem 3.1) twice with the union bound, we obtain the lemma. ∎

The other terms than the leading term are degenerate U-statistics, hence the following holds under appropriate entropy assumptions.

Lemma 6 (U-process bound: degenerate terms (Sherman, 1994, Corollary 7)).

Assume Assumption 7. Then for each $j=2,\ldots,D$ , there exist constants $C_{j}$ such that for any $\delta\in(0,1)$ , we have with probability at least $1-\delta^{\prime}/(D-1)$ ,

\begin{split}\sup_{g\in\mathcal{G}}|U_{n}^{j}\tilde{\ell}_{j}|\leq\frac{(D-1)}{\delta^{\prime}}C_{j}n^{-j/2}\end{split}

where $C_{j}$ depends only on $A$ , $V$ , and $B_{\ell}$ .

Proof.

The proof follows a similar path as that of (Sherman, 1994, Corollary 7), but we provide more explicit expressions to inspect the order with respect to $n$ . Let $\Phi_{\mathcal{G},\hat{f}}^{(j)}:=\{\tilde{\ell}_{j}:g\in\mathcal{G}\}$ . Then $\Phi_{\mathcal{G},\hat{f}}^{(j)}$ is Euclidean for an envelope $F_{j}$ satisfying $\check{Q}^{j}F_{j}^{2}<\infty$ by Lemma 6 in Sherman (1994) and Assumption 7. In addition, $\Phi_{\mathcal{G},\hat{f}}^{(j)}$ is a set of functions degenerate with respect to $\check{Q}$ . Without loss of generality, we can take $F_{j}$ such that $F_{j}\leq B_{\ell}$ . Similarly to the proof of (Sherman, 1994, Corollary 4), we can apply (Sherman, 1994, Main Corollary) with $p=1$ in their notation to obtain

\begin{split}{\mathbb{E}}_{\check{\mathcal{D}}}\sup_{g\in\mathcal{G}}|n^{j/2}U_{n}^{j}\tilde{\ell}_{j}|\leq\Gamma A^{1/2mp}(\check{Q}^{j}F_{j}^{2})^{(\epsilon+\alpha)/2}\leq\underbrace{\Gamma A^{1/2mp}(B_{\ell})^{\epsilon+\alpha}}_{\text{$=:C_{j}$}}\end{split}

where $\Gamma$ is a universal constant (Sherman, 1994, Main Corollary), $\epsilon\in(0,1)$ and $m$ are chosen to satisfy $1-V/2m>1-\epsilon$ , and $\alpha=1-V/2m$ . By applying Markov inequality, we have for arbitrary $u>0$ ,

\begin{split}\mathbb{P}_{\check{\mathcal{D}}}\left(\sup_{g\in\mathcal{G}}|n^{j/2}U_{n}^{j}\tilde{\ell}_{j}|>u\right)\leq\frac{C_{j}}{u},\end{split}

where $\mathbb{P}_{\check{\mathcal{D}}}(E)$ denotes the probability of the event $E$ with respect to $\check{\mathcal{D}}$ . Equating the right hand side with $\delta^{\prime}/(D-1)$ and solving for $u$ , we obtain the result. ∎

C.7 Proof of approximation error bound

(Proof of Lemma 1).

Due to Lemma 3, we have

\begin{split}\sup_{g\in\mathcal{G}}\left(R(g)-\bar{\check{R}}(g)\right)&=\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}V_{n}^{D}\tilde{\ell}\right)\\ &=\sup_{g\in\mathcal{G}}\left(\sum_{j=1}^{D}w_{j}(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)})\right)\\ &\leq w_{D}\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}^{(D)}\right)+2B_{\ell}\sum_{j=1}^{D-1}\underbrace{w_{j}}_{\text{$\mathcal{O}(n^{-(D-j)})$}}\\ &\leq w_{D}\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}^{(D)}\right)+2B_{\ell}\mathcal{O}(n^{-1})\end{split}

By applying Lemmas 7 (with $j=D$ ), we obtain

\begin{split}&\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}^{(D)}\right)\leq\sup_{g\in\mathcal{G}}\left\|\ell(f(g,\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}+DB_{\ell}\left\|q-\check{q}\right\|_{{L^{1}}}.\\ \end{split}

The right-hand side can be further bounded by applying Lemmas 9 and 8 by

\begin{split}&B_{q}L_{\ell_{\mathcal{G}}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+DB_{\ell}\left((L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime})\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+B_{q}\kappa_{2}(f-\hat{f})\right)\\ &\leq(B_{q}L_{\ell_{\mathcal{G}}}+DB_{\ell}(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime}))\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})\end{split}

and hence the assertion of the lemma. ∎

The above proof combined three approximation bounds, which are shown in the following lemmas. The following lemma reduces the difference in the expectation of U-statistic into the differences in the loss function and the density function. Although we apply the following Lemma 7 only with $j=D$ , we prove its general form for $j\in[D]$ .

Lemma 7 (Approximation bound for U-statistic of degree- $j$ ).

Fix $j\in[D]$ . Assume Assumption 2. Then we have for any $g\in\mathcal{G}$ ,

\begin{split}R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)}\leq\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}+jB_{\ell}\left\|q-\check{q}\right\|_{{L^{1}}}\end{split}

Proof.

Let us define a $D$ -variate function $\ell^{\dagger}$ and a $j$ -variate function $\ell^{\dagger(j)}$ (similarly to $\tilde{\ell}$ and $\tilde{\ell}^{(j)}$ , respectively) by

\begin{split}\ell^{\dagger}(s_{1},\ldots,s_{D})&:=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,f(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)})),\\ \ell^{\dagger(j)}(s_{1},\ldots,s_{j})&:=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\ell^{\dagger}(s_{\tau(1)},\ldots,s_{\tau(D)}).\end{split}

Then, recalling $Q\in\mathcal{Q}$ , we can show $R(g)=Q^{n}(U_{n}^{j}\ell^{\dagger(j)})$ because

\begin{split}Q^{n}(U_{n}^{j}\ell^{\dagger(j)})&=Q^{n}(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\ell^{\dagger(j)}(S_{\rho(1)},\ldots,S_{\rho(j)}))\\ &=Q^{n}(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\ell^{\dagger}(S_{\rho\circ\tau(1)},\ldots,S_{\rho\circ\tau(D)}))\\ &=Q^{n}(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,f(S_{\rho\circ\tau\circ\pi(1)}^{(1)},\ldots,S_{\rho\circ\tau\circ\pi(D)}^{(D)})))\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q^{n}\ell(g,f(S_{\rho\circ\tau\circ\pi(1)}^{(1)},\ldots,S_{\rho\circ\tau\circ\pi(D)}^{(D)}))\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q[\ell(g,f(S^{(1)},\ldots,S^{(D)}))]\quad(\because Q\in\mathcal{Q})\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}R(g)=R(g).\end{split}

Combining this expression with Lemma 3,

\begin{split}R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)}&=Q^{n}(U_{n}^{j}\ell^{\dagger(j)})-\check{Q}^{n}(U_{n}^{j}\tilde{\ell}^{(j)})\\ &=\underbrace{Q^{n}(U_{n}^{j}\ell^{\dagger(j)}-U_{n}^{j}\tilde{\ell}^{(j)})}_{\text{$A$}}+\underbrace{(Q^{n}-\check{Q}^{n})(U_{n}^{j}\tilde{\ell}^{(j)})}_{\text{$B$}}\\ \end{split}

Now, $A$ can be bounded from above as

\begin{split}A&=Q^{n}(U_{n}^{j}\ell^{\dagger(j)}-U_{n}^{j}\tilde{\ell}^{(j)})\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q^{n}(\ell(g,f(S^{(1)}_{\rho\circ\tau\circ\pi(1)},\ldots,S^{(D)}_{\rho\circ\tau\circ\pi(D)}))-\ell(g,\hat{f}(S^{(1)}_{\rho\circ\tau\circ\pi(1)},\ldots,S^{(D)}_{\rho\circ\tau\circ\pi(D)})))\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q(\ell(g,f(S^{(1)},\ldots,S^{(D)}))-\ell(g,\hat{f}(S^{(1)},\ldots,S^{(D)})))\quad(\because Q\in\mathcal{Q})\\ &\leq\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}\end{split}

Then recalling Assumption 2, we can bound $B$ from above as

\begin{split}B&=(Q^{n}-\check{Q}^{n})(U_{n}^{j}\tilde{\ell}^{(j)})=(Q^{n}-\check{Q}^{n})\left(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\tilde{\ell}^{(j)}(S_{\rho(1)},\ldots,S_{\rho(j)})\right)\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}(Q^{n}-\check{Q}^{n})\left(\tilde{\ell}^{(j)}(S_{\rho(1)},\ldots,S_{\rho(j)})\right)=(Q^{j}-\check{Q}^{j})(\tilde{\ell}^{(j)}(S_{1},\ldots,S_{j}))\quad(\because\text{ symmetry})\\ &\leq B_{\ell}\int\left|\prod_{i=1}^{j}q(s_{i})-\prod_{i=1}^{j}\check{q}(s_{i})\right|\mathrm{d}s_{1}\cdots\mathrm{d}s_{j}\\ &=B_{\ell}\int\left|\sum_{i=1}^{j}q(s_{1})\cdots q(s_{i-1})\cdot(q(s_{i})-\check{q}(s_{i}))\cdot\check{q}(s_{i+1})\cdots\check{q}(s_{j})\right|\mathrm{d}s_{1}\cdots\mathrm{d}s_{j}\\ &\leq B_{\ell}\sum_{i=1}^{j}\int q(s_{1})\cdots q(s_{i-1})\cdot|q(s_{i})-\check{q}(s_{i})|\cdot\check{q}(s_{i+1})\cdots\check{q}(s_{j})\mathrm{d}s_{1}\cdots\mathrm{d}s_{j}\\ &=B_{\ell}\sum_{i=1}^{j}\int|q(s_{i})-\check{q}(s_{i})|\mathrm{d}s_{i}=B_{\ell}\cdot j\left\|q-\check{q}\right\|_{{L^{1}}},\\ \end{split}

which proves the assertion. ∎

Now the following lemmas bound each approximation terms in terms of the difference between $f$ and $\hat{f}$ .

Lemma 8 (Loss difference approximation).

Assume Assumption 5. Then we have for any $g\in\mathcal{G}$ ,

\begin{split}\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}\leq B_{q}L_{\ell_{\mathcal{G}}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}\end{split}

Proof.

\begin{split}&\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}=\int|\ell(g,f(s))-\ell(g,\hat{f}(s))|q(s)\mathrm{d}s\\ &\qquad\leq B_{q}\int L_{\ell_{\mathcal{G}}}\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{2}}}\mathrm{d}s\\ &\qquad\leq B_{q}L_{\ell_{\mathcal{G}}}\int\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{1}}}\mathrm{d}s\leq B_{q}L_{\ell_{\mathcal{G}}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}.\\ \end{split}

∎

Lemma 9 (Density difference approximation).

Assume Assumptions 2, 3, and 4. Then we have

\begin{split}\left\|q-\check{q}\right\|_{{L^{1}}}\leq(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime})\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+B_{q}\kappa_{2}(f-\hat{f})\\ \end{split}

where $C_{1}^{\prime}$ and $\kappa_{2}(f-\hat{f})$ are defined as in Lemma 11.

Proof.

Since $\check{q}(s)=q(f^{-1}\circ\hat{f}(s))\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|$ , we have

\begin{split}\left\|q-\check{q}\right\|_{{L^{1}}}&=\int\left|q(s)-q(f^{-1}\circ\hat{f}(s))\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\right|\mathrm{d}s\\ &\leq\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s+\int q(f^{-1}\circ\hat{f}(s))\left|1-\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\right|\mathrm{d}s\\ &\leq\underbrace{\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s}_{\text{(A)}}+B_{q}\underbrace{\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s}_{\text{(B)}}\\ \end{split}

where the last line follows from the triangle inequality. Applying Lemma 10 to (A) and Lemma 11 to (B) yields the assertion. ∎

Lemma 10.

Assume Assumptions 2 and 3. Then,

\begin{split}\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s\leq L_{q}L_{f^{-1}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}\end{split}

Proof.

We have

\begin{split}&\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s=\int|q(f^{-1}\circ f(s))-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s\\ &\qquad\leq L_{q}L_{f^{-1}}\int\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{2}}}\mathrm{d}s\leq L_{q}L_{f^{-1}}\int\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{1}}}\mathrm{d}s\\ &\qquad\leq L_{q}L_{f^{-1}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}\\ \end{split}

∎

Lemma 11 (Jacobian difference approximation).

Assume Assumptions 2 and 4. Then,

\begin{split}\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s&\leq DC_{1}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+\kappa_{2}(f-\hat{f}),\end{split}

where

\begin{split}C_{d}^{\prime}&:=(D+1)^{\frac{7}{2}d-2}\left((B_{\partial f}^{\infty})^{d}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)^{d}+(B_{\partial f^{-1}}^{\infty})^{d}\right),\\ \kappa_{2}(f-\hat{f})&:=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\\ \end{split}

Proof.

Applying Lemma 12 with $A:=(\mathrm{J}f^{-1}\circ f)(s)=I$ , we obtain

\begin{split}\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s&=\int\left|(\mathrm{J}f^{-1}\circ f)(s)-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s\\ &\leq\int\sum_{d=1}^{D}{D\choose d}\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\mathrm{d}s.\\ \end{split}

Now, each term in the integrand can be bounded from above as

\begin{split}&\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &=\left\|\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))\right)\left(\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right)-\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right)\left(\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right)\right\|_{\mathrm{op}}\\ &\leq\left\|\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right)\left(\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right)\right\|_{\mathrm{op}}+\left\|\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right)\left(\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right)\right\|_{\mathrm{op}}\\ &\leq\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}+\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &\qquad\quad(\because\text{ submultiplicativity \cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.3.2)}})\\ &\leq\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\left(D\cdot\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right\|_{\infty}\right)+\left(D\cdot\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\infty}\right)\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &\qquad\quad(\because\left\|\cdot\right\|_{\mathrm{op}}\leq D\left\|\cdot\right\|_{\infty}\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.3.2)}})\\ &\leq\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\cdot\left(DB_{\partial f}^{\infty}\right)\ +\ \left(DB_{\partial f^{-1}}^{\infty}\right)\cdot\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &\leq DB_{\partial f}^{\infty}\sqrt{D}\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}(1)}+DB_{\partial f^{-1}}^{\infty}\sqrt{D}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}\right\|_{\mathrm{op}(1)}\\ &\qquad\quad(\because\left\|\cdot\right\|_{\mathrm{op}}\leq\sqrt{D}\left\|\cdot\right\|_{\mathrm{op}(1)}\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.3.1)}})\\ &=D^{\frac{3}{2}}B_{\partial f}^{\infty}\max_{k\in[D]}\sum_{j=1}^{D}\left|\frac{\partial f^{-1}_{j}}{\partial z_{k}}(f(s))-\frac{\partial f^{-1}_{j}}{\partial z_{k}}(\hat{f}(s))\right|+D^{\frac{3}{2}}B_{\partial f^{-1}}^{\infty}\max_{k\in[D]}\sum_{j=1}^{D}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\\ &\leq D^{\frac{3}{2}}B_{\partial f}^{\infty}\max_{k\in[D]}\sum_{j=1}^{D}\left\|f^{-1}_{j}\right\|_{C^{1,1}}\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{2}}}+D^{\frac{3}{2}}B_{\partial f^{-1}}^{\infty}\sum_{k=1}^{D}\sum_{j=1}^{D}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\\ &\leq D^{\frac{3}{2}}B_{\partial f}^{\infty}\left(\sum_{j=1}^{D}\left\|f^{-1}_{j}\right\|_{C^{1,1}}\right)\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{1}}}+D^{\frac{3}{2}}B_{\partial f^{-1}}^{\infty}\sum_{k=1}^{D}\sum_{j=1}^{D}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\\ &\qquad\quad(\because\left\|\cdot\right\|_{{\ell^{2}}}\leq\left\|\cdot\right\|_{{\ell^{1}}}\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.2.2)}}).\\ \end{split}

When powered to $d$ , this yields

\begin{split}&\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\\ &\leq(D^{2}+D)^{d-1}\biggl{[}\sum_{j=1}^{D}\left(D^{3/2}B_{\partial f}^{\infty}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)|f_{j}(s)-\hat{f}_{j}(s)|\right)^{d}\\ &\qquad\qquad\qquad\qquad+\sum_{k=1}^{D}\sum_{j=1}^{D}\left(D^{3/2}B_{\partial f^{-1}}^{\infty}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\right)^{d}\biggr{]}\end{split}

where we used $(\sum_{i=1}^{L}a_{i})^{d}\leq L^{d-1}(\sum_{i=1}^{L}a_{i}^{d})$ for $a_{i}\geq 0$ , which follows from Hölder inequality. Hence,

\begin{split}&\int\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\mathrm{d}s\\ &\leq D^{\frac{5}{2}d-1}(D+1)^{d-1}\biggl{[}\left(B_{\partial f}^{\infty}\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)^{d}\sum_{j=1}^{D}\int|f_{j}(s)-\hat{f}_{j}(s)|^{d}\mathrm{d}s\\ &\qquad\qquad\qquad\qquad\qquad\qquad+(B_{\partial f^{-1}}^{\infty})^{d}\sum_{j=1}^{D}\left(\sum_{k=1}^{D}\int\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|^{d}\mathrm{d}s\right)\biggr{]}\\ &\leq(D+1)^{\frac{7}{2}d-2}\left((B_{\partial f}^{\infty})^{d}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)^{d}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}+(B_{\partial f^{-1}}^{\infty})^{d}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}\right)\\ &\leq C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\end{split}

Therefore,

\begin{split}&\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s\\ &\leq\sum_{d=1}^{D}{D\choose d}\int\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\mathrm{d}s\\ &\leq DC_{1}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+\underbrace{\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}}_{\text{$\kappa_{2}(f-\hat{f})$}}\end{split}

∎

Lemma 11 used the following lemma to bound the difference in Jacobian determinants.

Lemma 12 (Determinant perturbation bound (Ipsen and Rehman, 2008, Corollary 2.11)).

Let $A$ and $E$ be $D\times D$ complex matrices. Then,

\begin{split}|\det(A)-\det(A+E)|\leq\sum_{d=1}^{D}{D\choose d}\left\|A\right\|_{\mathrm{op}}^{D-d}\left\|E\right\|_{\mathrm{op}}^{d}.\end{split}

C.8 Comparison of Rademacher complexities

The following consideration demonstrates how the effective complexity measure $\mathfrak{R}$ in Theorem 3 resulting from the proposed method may enjoy a relaxed dependence on the input dimensionality compared to the ordinary empirical risk minimization. To do so, we first recall the definition of the ordinary Rademacher complexity and a standard performance guarantee derived based on it.

Definition 2 (Ordinary Rademacher complexity).

The ordinary empirical risk minimization finds the candidate hypothesis by

\begin{split}\hat{g}\in\mathop{\rm argmin}\limits_{g\in\mathcal{G}}\hat{R}(g),\end{split}

where

\begin{split}\hat{R}(g):=\frac{1}{n}\sum_{i=1}^{n}\ell(g,Z_{i})=\frac{1}{n}\sum_{i=1}^{n}\ell(g,\hat{f}(S_{i}^{(1)},\ldots,S_{i}^{(D)}))\end{split}

and the corresponding ordinary Rademacher complexity $\mathfrak{R}_{\mathrm{ord}}(\mathcal{G})$ is

\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{G})&:=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}\ell(S_{i}^{(1)},\ldots,S_{i}^{(D)})\right|\right]\end{split}

where $\{\sigma_{i}\}_{i=1}^{n}$ are independent uniform sign variables and we denoted $\ell(s^{(1)},\ldots,s^{(D)})=\ell(g,\hat{f}(s^{(1)},\ldots,s_{i}^{(D)}))$ by abuse of notation. This yields the standard Rademacher complexity based bound. Applying Lemma 5 and using the same proof technique, we have that with probability at least $1-\delta$ ,

\begin{split}R(\hat{g})-R(g^{*})\leq 2\sup_{g\in\mathcal{G}}|R(g)-\hat{R}(g)|\leq 4\mathfrak{R}_{\mathrm{ord}}(\mathcal{G})+2B_{\ell}\sqrt{\frac{\log(2/\delta)}{2n}}.\end{split}

Therefore, we the corresponding complexity terms are $\mathfrak{R}_{\mathrm{ord}}(\mathcal{G})$ and $D\mathfrak{R}(\mathcal{G})$ . In Remark 3, we make a comparison of these two complexity measures by taking an example. To recall, the effective Rademacher complexity can be written as, in terms of the notation in this section,

\begin{split}&\mathfrak{R}(\mathcal{G})=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{2},\ldots,S^{\prime}_{D}}\tilde{\ell}(S_{i},S_{2}^{\prime},\ldots,S_{D}^{\prime})\right|\right]\\ &=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}\frac{1}{D}\left(\ell(S_{i}^{(1)},S_{2}^{\prime(2)},\ldots,S_{D}^{\prime(D)})+\cdots+\ell(S_{1}^{\prime(1)},S_{2}^{\prime(2)},\ldots,S_{i}^{(D)})\right)\right|\right]\\ \end{split}

Remark 3 (Comparison of Radmacher complexities).

As an example, consider $\mathcal{H}$ , the set of $L$ -Lipschitz functions (with respect to infinity norm) on the unit cube $[0,1]^{d}$ . It is well-known that there exists a constant $C>0$ such that the following holds (Wainwright, 2019, Example 5.10, p.129) for sufficiently small $t>0$ :

\begin{split}\log\mathcal{N}(t,\mathcal{H},\left\|\cdot\right\|_{\infty})\asymp(C/t)^{d}.\end{split}

(9)

Here, $a(t)\asymp b(t)$ indicates that there exist $k_{1},k_{2}>0$ such that, for sufficiently small $t$ , it holds that $k_{1}b(t)\leq a(t)\leq k_{2}b(t)$ . On the other hand, the well-known discretization argument implies that there exist constants $c$ and $B$ such that for any $t\in(0,B]$ , the following relation between the Rademacher complexity and the metric entropy holds:

\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{H})\leq t+c\sqrt{\frac{\log\mathcal{N}(t,\mathcal{H},\left\|\cdot\right\|_{\infty})}{n}}.\end{split}

(10)

Substituting Eq. (9) into Eq. (10), we can find that, for large enough $n$ , the right hand side is minimized at $t=(c\cdot C^{\frac{d}{2}}\cdot\frac{d}{2})^{\frac{2}{2+d}}\cdot n^{-\frac{1}{2+d}}$ . This yields

\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{H})\leq\tilde{C}\cdot n^{-\frac{1}{2+d}}\end{split}

(11)

with a new constant $\tilde{C}=\left(c\cdot C^{\frac{d}{2}}\cdot\frac{d}{2}\right)^{\frac{2}{2+d}}+c\cdot C^{\frac{d}{2}}\left(c\cdot C^{\frac{d}{2}}\cdot\frac{d}{2}\right)^{-\frac{d}{2+d}}$ . Therefore, by substituting $d=D$ in Eq. (11), the metric-entropy based bound on the ordinary Rademacher complexity exhibits exponential dependence on the input dimension as

\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{H})&\leq\mathcal{O}\left(n^{-\frac{1}{2+D}}\right),\end{split}

which is a manifestation of the curse of dimensionality. On the other hand, by following a similar calculation, we can see that the effective Rademacher complexity $\mathfrak{R}(\mathcal{H})$ avoids an exponential dependence on the input dimension $D$ . By substituting $d=1$ in Eq. (11), we can see

\begin{split}D\mathfrak{R}(\mathcal{H})&\leq\mathfrak{R}_{\mathrm{ord}}(\mathcal{H}_{1})+\cdots+\mathfrak{R}_{\mathrm{ord}}(\mathcal{H}_{D})\leq\mathcal{O}\left(n^{-\frac{1}{3}}\right),\\ \end{split}

where $\mathcal{H}_{j}:=\{{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}h({S_{1}^{\prime}}^{(1)},\ldots,{S_{j-1}^{\prime}}^{(j-1)},(\cdot)^{(j)},{S_{j+1}^{\prime}}^{(j+1)},\ldots,{S^{\prime}_{D}}^{(D)}):h\in\mathcal{H}\}$ . This is because the Lipschitz constant of functions in $\mathcal{H}_{j}$ is at most $L$ (i.e., the Lipschitz constant does not increase by the marginalization procedure) because for any $h\in\mathcal{H}_{j}$ ,

\begin{split}&|h(x)-h(y)|\\ &=|{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}[h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},x,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})-h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},y,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})]|\\ &\leq{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}|h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},x,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})-h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},y,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})|\\ &\leq{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}L\cdot\|(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},x,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})-(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},y,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})\|\\ &={\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}L\cdot\|(0,\ldots,0,x-y,0,\ldots,0)\|\\ &=L\cdot|x-y|.\\ \end{split}

C.9 Remark on higher order Sobolev norms

Here, we comment on how the term $\kappa_{2}(f-\hat{f})$ is treated as a higher order term of $f-\hat{f}$ .

Remark 4 (Higher order Sobolev norms).

Let us assume that $\operatorname{supp}{q}\cup\operatorname{supp}{\check{q}}$ is contained in a compact set $\mathcal{\tilde{S}}$ for all $\hat{f}$ considered. Note that for $d\in[D]$ ,

\begin{split}\int_{\mathcal{\tilde{S}}}|h(s)|^{d}\mathrm{d}s\leq(V_{\mathcal{\tilde{S}}})^{\frac{d}{d-D}}\left(\int_{\mathcal{\tilde{S}}}|h(s)|^{D}\mathrm{d}s\right)^{d/D}\end{split}

by Hölder ’s inequality, where we defined $V_{\mathcal{\tilde{S}}}:=\int_{\mathcal{\tilde{S}}}1\mathrm{d}s$ , hence we have $\left\|\cdot\right\|_{{L^{d}}(\mathcal{\tilde{S}})}\leq(V_{\mathcal{\tilde{S}}})^{\frac{1}{d-D}}\left\|\cdot\right\|_{{L^{D}}(\mathcal{\tilde{S}})}$ . By applying the relation to each term in the definition of $\left\|\cdot\right\|_{W^{1,d}}$ , we obtain

\begin{split}\left\|f\right\|_{W^{1,d}}^{d}\leq(V_{\mathcal{\tilde{S}}})^{\frac{d}{d-D}}\left\|f\right\|_{W^{1,D}}^{d}\end{split}

Thus we obtain

\begin{split}\kappa_{2}(f-\hat{f})&=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}\\ &\leq\sum_{d=2}^{D}{D\choose d}(V_{\mathcal{\tilde{S}}})^{\frac{d}{d-D}}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}^{d}\\ &\leq\mathcal{O}\left(\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}^{2}\right)\quad(\hat{f}\to f).\end{split}

By also replacing $\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}$ with $\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}$ in Theorem 3, we can see more clearly that $\kappa_{2}(f-\hat{f})$ is a higher order term of $\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}$ .

Appendix D Details and Proofs of Theorem 1

Here, we provide the proof of Theorem 1. We reuse the notation and terminology from Section C of this Supplementary Material. We prove the uniformly minimum variance property of the proposed risk estimator under the ideal situation of $\hat{f}=f$ .

Theorem 4 (Known causal mechanism case).

Assume $\hat{f}=f$ . Then, for all $g\in\mathcal{G}$ , we have that $\check{R}(g)$ is the uniformly minimum variance unbiased estimator of $R(g)$ . As a special case, it has a smaller variance than the ordinary empirical risk estimator: $\forall q\in\mathcal{Q},\forall g\in\mathcal{G},\mathrm{Var}(\check{R}(g))\leq\mathrm{Var}(\hat{R}(g))$ .

Proof.

The proof is a result of the following two facts. When $\check{q}\in\mathcal{Q}$ , the estimator $\check{R}(g)$ becomes the generalized U-statistic of the statistical functional Eq. (6). Furthermore, when $\hat{f}=f$ , Eq. (6) coincides with $R(g)$ because the approximation error is zero. Since we assume $\hat{f}=f$ we have $\check{q}=q\in\mathcal{Q}$ and hence both of the statements above hold. Therefore, by Lemma 13, the first assertion of the theorem follows. The last assertion of the theorem follows as a special case as $\hat{R}(g)$ is an unbiased estimator of $R(g)$ for $q\in\mathcal{Q}$ .

From here, we confirm the above statements by calculation. We first show that $\check{R}(g)$ is the generalized U-statistic. To see this, observe that the statistical functional Eq. (6) allows the following expression given $\check{q}\in\mathcal{Q}$ :

\begin{split}&\int\tilde{\ell}(s_{1},\ldots,s_{D})\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}\\ &=\int\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)}))\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}\\ &=\int\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)}))\prod_{d}\check{q}^{(d)}(s_{1}^{(d)})\cdots\prod_{d}\check{q}^{(d)}(s_{D}^{(d)})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}\\ &=\int\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{1}^{(1)},\ldots,s_{1}^{(D)}))\prod_{d}\check{q}^{(d)}(s_{1})\mathrm{d}s_{1}\\ &=\int\ell(g,\hat{f}(s^{(1)},\ldots,s^{(D)}))\check{q}_{1}(s^{(1)})\cdots\check{q}_{D}(s^{(D)})\mathrm{d}s^{(1)}\cdots\mathrm{d}s^{(D)}.\\ \end{split}

This is a regular statistical functional of degrees $(1,\ldots,1)$ with the kernel $\ell(g,\hat{f}(\cdot,\ldots,\cdot))$ . On the other hand, we have

\begin{split}\check{R}(g)&=\frac{1}{n^{D}}\sum_{(i_{1},\ldots,i_{D})\in[n]^{D}}\tilde{\ell}(S_{i_{1}},\ldots,S_{i_{D}})=\frac{1}{n^{D}}\sum_{(i_{1},\ldots,i_{D})\in[n]^{D}}\ell(g,\hat{f}(S_{i_{1}}^{(1)},\ldots,S_{i_{D}}^{(D)}))\end{split}

because the summations run through all combinations with replacement. This combined with the fact that $\{S_{i}^{(d)}\}_{i,d}$ are jointly independent when $\check{q}\in\mathcal{Q}$ yields that $\check{R}(g)$ is the generalized U-statistic for Eq. (6).

Now we show that Eq. (6) coincides $R(g)$ . Given $\hat{f}=f$ , we have

\begin{split}R(g)&=\int q(s)\ell(g,f(s))\mathrm{d}s\\ &=\int q(s)\ell(g,\hat{f}(s))\mathrm{d}s\quad(\text{By }f=\hat{f}.)\\ &=\int q_{1}(s^{(1)})\cdots q_{D}(s^{(D)})\ell(g,\hat{f}(s^{(1)},\ldots,s^{(D)}))\mathrm{d}s^{(1)}\cdots\mathrm{d}s^{(D)}\quad(\text{by }q\in\mathcal{Q})\\ &=\int\check{q}_{1}(s^{(1)})\cdots\check{q}_{D}(s^{(D)})\ell(g,\hat{f}(s^{(1)},\ldots,s^{(D)}))\mathrm{d}s^{(1)}\cdots\mathrm{d}s^{(D)}\quad(\text{by }q=\check{q})\\ &=\int\tilde{\ell}(s_{1},\ldots,s_{D})\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}.\quad(\because\text{ symmetry})\\ \end{split}

∎

The following well-known lemma states that a generalized U-statistic is a uniformly minimum variance unbiased estimator.

Lemma 13 (Uniformly minimum variance property of a generalized U-statistic).

Let $\theta:\mathcal{Q}\to\mathbb{R}$ be a regular statistical functional with kernel $\psi:\mathbb{R}^{k_{1}}\times\cdots\times\mathbb{R}^{k_{L}}\to\mathbb{R}$ (Clémençon et al., 2016), i.e.,

\begin{split}\theta(q)=\int\psi((x_{1}^{(1)},\ldots,x_{k_{1}}^{(1)}),\ldots,(x_{1}^{(L)},\ldots,x_{k_{L}}^{(L)}))\prod_{j=1}^{k_{1}}q_{1}(x_{j}^{(1)})\mathrm{d}x_{j}^{(1)}\cdots\prod_{j=1}^{k_{L}}q_{L}(x_{j}^{(L)})\mathrm{d}x_{j}^{(L)}.\end{split}

Given samples $\{x_{i}^{(l)}\}_{i=1}^{n_{l}}\overset{\text{i.i.d.}}{\sim}q_{l}(n_{l}\geq k_{l}\text{ and }l=1,\ldots,L)$ , let $\mathrm{GU}_{(n_{1},\ldots,n_{L})}^{(k_{1},\ldots,k_{L})}\psi$ be the corresponding generalized U-statistic

\begin{split}\mathrm{GU}_{(n_{1},\ldots,n_{L})}^{(k_{1},\ldots,k_{L})}\psi:=\frac{1}{\prod_{l}{n_{l}\choose k_{l}}}\sum\psi\left(\left(x_{i_{1}^{(1)}}^{(1)},\ldots,x_{i_{k_{1}}^{(1)}}^{(1)}\right),\ldots,\left(x_{i_{1}^{(L)}}^{(L)},\ldots,x_{i_{k_{L}}^{(L)}}^{(L)}\right)\right).\end{split}

where $\sum$ denotes that the indices run through all possible combinations (without replacement) of the indices. Then, $\mathrm{GU}_{(n_{1},\ldots,n_{L})}^{(k_{1},\ldots,k_{L})}\psi$ is the uniformly minimum variance unbiased estimator of $\theta$ on $\mathcal{Q}$ .

Proof.

The assertion can be proved in a parallel manner as the proof of (Lee, 1990, Section 1.1, Lemma B) ∎

Remark 5 (Relation to the UMVUE property of $\hat{R}(g)$ ).

The result in Theorem 4 is not contradictory to the fact that the sample average $\hat{R}(g)$ is a U-statistic of degree- $1$ and hence the minimum variance among all unbiased estimator of $R(g)$ on $\mathcal{P}$ , where $\mathcal{P}$ is a set of distributions containing all absolutely continuous distributions Lee (1990). Specifically, $\check{R}(g)$ is not generally an unbiased estimator of $R(g)$ on $\mathcal{P}\setminus\mathcal{Q}$ , even if $\hat{f}=f$ . While $\check{R}(g)$ satisfies the $D$ -sample symmetry condition, the same does not hold for $\hat{R}(g)$ . By restricting the attention to $\mathcal{Q}$ , the estimator $\check{R}(g)$ achieves a smaller variance than $\hat{R}(g)$ .

Appendix E Further Comparison with Related Work

Here, we provide an additional detailed comparison with the related work to complement Section 5 of the main text.

E.1 Comparison with Magliacane et al. (2018)

Magliacane et al. (2018) considered domain adaptation among different interventional states by using SCMs. Their problem setting and ours do not strictly include each other (the two settings are somewhat complementary), and their assumption may be more suitable for application fields with interventional experiments such as genomics, while ours may be more suited for fields with observational data such as health record analysis or economics. At the methodological level, Magliacane et al. (2018) takes a variable selection approach to find a subset so that the conditional distribution is invariant, whereas our paper takes a data augmentation approach via the estimation of the SEMs (in the reduced form).

The essential assumptions of Magliacane et al. (2018) are the existence of a separating set (with small “incomplete information bias”) and the identifiability of such a set (yielded from Proposition 1, Assumption 1, and Assumption 2 (iii) in Magliacane et al. (2018)). A particularly plausible application conforming to the assumptions is, for example (but not limited to), genomics experiments. Part of the reason is that Assumption 2 (ii) and (iii) are likely to hold for well-targeted experiments Magliacane et al. (2018). The following is a detailed comparison.

(1) Modeling assumption and problem setup.

The two problem settings do not strictly include one another, and they are of complementing relations where ours corresponds to the intervention-free case and Magliacane et al. (2018) corresponds to the intervention case. If we try to express the problem setting of Magliacane et al. (2018) within our formulation, we would be expressing the interventions as alterations to the SEMs. We assume that such alterations do not occur in our setting since our focus is on observational data; therefore, the problem formulation of Magliacane et al. (2018) is not a subset of ours. On the other hand, if we try to express our problem setting within the formulation of Magliacane et al. (2018), our problem setup would only have $C_{1}$ as the context variable, and $C_{1}$ would be a parent of all observed variables, e.g., $C_{1}$ switches the distribution of $S$ by switching different quantile functions to perform inverse transform. This potentially allows the existence of the effect $C_{1}\to Y$ and diverges from Assumption 2 (iii) in Magliacane et al. (2018). Also, even if such an edge does not exist, it is acceptable that there are no separating sets (in the extreme case) if $Y$ is a parent of all $X_{i}$ ’s. In this case, conditioning on any of the $X_{i}$ ’s would result in making $C_{1}$ and $Y$ dependent. From this consideration, our problem setting is not a subset of that of Magliacane et al. (2018), either.

(2) Plausible applications.

The problem setup of Magliacane et al. (2018) is suitable especially for applications in which various experiments are conducted such as genomics Magliacane et al. (2018), whereas our problem setting may be more suitable for some fields with observational data such as health record analysis or economics.

(3) Methodology.

Our proposed method actually estimates the SEMs (though in the reduced-form) and exploits the estimated SEMs in the domain adaptation algorithm. In fact, directly using the estimated SEMs as a tool to realize domain adaptation can be seen as the first attempt to fully leverage the structural causal models in the DA algorithm. On the other hand, Magliacane et al. (2018) approaches the problem of domain adaptation via variable selection to find a subset so that the conditional distribution is invariant.

E.2 Comparison with Gong et al. (2018)

In the present paper, we assumed an invariance of structural equations between domains. Here, we clarify the difference from a related but different assumption considered by Causal Generative Domain Adaptation Network (CG-DAN; Gong et al., 2018).

(1) Problem setup.

Gong et al. (2018) presumes the anticausal scenario (i.e., $Y$ is the cause of $X$ ) and that $X$ given $Y$ follows a structural equation model, whereas our paper considers more general SEMs of $X$ and $Y$ .

(2) Theoretical justification.

The approach of Gong et al. (2018) does not have a theoretical guarantee in terms of the identifiability of $f$ , i.e., there has been no known theoretical condition under which the learned generator is applicable across different domains. On the other hand, our method enjoys a strong theoretical justification of nonlinear ICA including the identifiability of $f$ under known theoretical conditions.

(3) Methodology.

The method of Gong et al. (2018) estimates the GCM of $X$ given $Y$ using source domain data and uses it to design a generator neural network. On the other hand, we more directly exploit the estimated reduced-form SEM in the method.

E.3 Comparison with Arjovsky et al. (2020)

Arjovsky et al. (2020) proposed invariant risk minimization (IRM) for the out-of-distribution (OOD) generalization problem. The IRM approach tries to learn a feature extractor that makes the optimal predictor invariant across domains, and its theoretical validity is argued based on SCMs. Here, we compare it with the present work in terms of the problem setup, theoretical justification, and the methodology.

(1) Basic assumption and problem setup.

The OOD generalization problem tackled in Arjovsky et al. (2020) assumes no access to the target domain data. In this respect, the problem is different and intrinsically more difficult than the one considered in this paper, where a small labeled sample from the target domain is assumed to be available. In order to solve the OOD generalization problem, in a nutshell, Arjovsky et al. (2020) essentially assumes the existence of a feature extractor that elicits an invariant predictor, i.e., one that makes the optimal predictors of the different domains to be identical after the feature transformation. This can be seen as a variant of the representation learning approach for domain adaptation where we assume there exists $\mathcal{T}$ such that $p(Y|\mathcal{T}(X))$ is invariant across domains. Indeed, for example, when the loss function is the cross-entropy, the condition corresponds to the invariance of $P(Y|\mathcal{T}(X))$ across domains Arjovsky et al. (2020). More technically, in addition, (Arjovsky et al., 2020, Definition 7(ii)) requires the condition ${\mathbb{E}}_{1}[Y|\mathrm{Pa}(Y)]={\mathbb{E}}_{2}[Y|\mathrm{Pa}(Y)]$ , which can be violated when the latent factors corresponding to $Y$ have different distributions across domains. On the other hand, our assumption can be seen as the existence of a feature extractor that can simultaneously estimate the independent components in all domains, which does not necessarily imply the existence of a common feature transformer that induces a unique optimal predictor.

(2) Theoretical justification.

Arjovsky et al. (2020) formulated a condition under which the IRM principle leads to an appropriate predictor for OOD generalization, but only under a certain linearity assumption which is essentially a relaxation of linear SEMs. Furthermore, in the theoretical guarantee, the feature extractor is restricted to be linear. In addition, Arjovsky et al. (2020) only provides the population-level analysis that the solution of the IRM objective formulated using the underlying distributions enjoys OOD generalization, and it does not discuss the condition under which the ideal feature extractor can be properly estimated by the empirical IRM. The requirement for the strong assumption of linearity likely stems from the intrinsic difficulty of the OOD problem in Arjovsky et al. (2020), namely, its formulation does not assume specific types of interventions. On the other hand, our method enjoys a stronger theoretical guarantee of an excess risk bound without such parametric assumptions on the models or the data generating process, by focusing on the case that the causal mechanisms are indifferent across the domains.

(3) Methodology.

The methodology of IRM estimates a single predictor that generalizes well to all domains by finding a feature extractor that makes the predictor optimal in all domains. The approach shares the same spirit as the representation learning approaches to domain adaptation, which try to find a feature extractor that induces invariant conditional distributions, such as transfer component analysis Pan et al. (2011). On the other hand, our method estimates the SEMs (in the reduced-form) and exploits it to make the training on the few target domain data more efficient through data augmentation.

Few-shot Domain Adaptation by Causal Mechanism Transfer

Abstract

1 Introduction

Our contributions.

2 Problem Setup

Notation.

Basic setup: Few-shot domain adapting regression.

Key assumption.

Example: Structural equation models

3 Proposed Method: Mechanism Transfer

3.1 Step 1: Estimate ff using the source domain data

3.2 Step 2: Extract and inflate the target ICs using f^\hat{f}

3.3 Step 3: Synthesize target data from the inflated ICs

4 Theoretical Insights

4.1 Minimum variance property: Idealized case

Theorem 1 (Minimum variance property of Rˇ\check{R}).

4.2 Excess risk bound: More realistic case

Theorem 2 (Excess risk bound).

5 Related Work and Discussion

5.1 Existing transfer assumptions

(1) Parametric assumptions.

(2) Invariant conditionals and marginals.

(3) Small discrepancy or integral probability metric.

(4) Transferable parameter.

5.2 Causality for transfer learning

5.3 Plausibility of the assumptions

Checking the validity of the assumption.

Invariant causal mechanisms.

Fully observed variables.

Required number of source domains.

Flexibility of the model.

6 Experiment

6.1 Implementation details of the proposed method

Estimation of ff (Step 1).

Augmentation of target data (Step 3).

Predictor hypothesis class 𝒢\mathcal{G}.

Hyperparameter selection.

Computation environment

6.2 Experiment using real-world data

Dataset.

Compared methods.

Evaluation procedure.

Results.

7 Conclusion

Acknowledgments

References

Appendix A Preliminary: Nonlinear ICA

Assumption 1 (Assumption of variability; Hyvärinen et al., 2019, Theorem 1).

Appendix B Experiment Details

B.1 Dataset details

Gasoline consumption data.

B.2 Model details: Invertible neural networks

Affine coupling layers.

1×11\times 1 convolution layers.

B.3 Model details: Penultimate layer networks

B.4 Training details

B.5 Compared methods details

TrAdaBoost.

GDM.

Copula.

Appendix C Details and Proofs of Theorem 2

C.1 Notation

General mathematical notation.

Distributions and expectations.

C.2 Problem setup

C.3 Assumptions

Assumption 2 (The underlying density function is bounded and Lipschitz continuous).

Assumption 3 (f−1f^{-1} is Lipschitz continuous and Hölder continuous).

Assumption 4 (Bounded derivatives of ff and f−1f^{-1}).

Assumption 5 (Loss function is bounded and uniformly Lipschitz continuous in ZZ).

Assumption 6 (Estimated feature extractor).

Assumption 7 (Entropic condition: Euclidean class (Sherman, 1994)).

Assumption 8.

Definition 1 (Effective Rademacher complexity).

C.4 Theorem statement

Theorem 3 (Excess risk bound).

Proof of Theorem 3.

Lemma 1 (Approximation error bound).

Lemma 2 (Pseudo estimation error bound).

C.5 V-statistic and U-statistic

3.1 Step 1: Estimate $f$ using the source domain data

3.2 Step 2: Extract and inflate the target ICs using $\hat{f}$

Theorem 1 (Minimum variance property of $\check{R}$ ).

Estimation of $f$ (Step 1).

Predictor hypothesis class $\mathcal{G}$ .

$1\times 1$ convolution layers.

Assumption 3 ( $f^{-1}$ is Lipschitz continuous and Hölder continuous).

Assumption 4 (Bounded derivatives of $f$ and $f^{-1}$ ).

Assumption 5 (Loss function is bounded and uniformly Lipschitz continuous in $Z$ ).

Lemma 7 (Approximation bound for U-statistic of degree- $j$ ).

Remark 5 (Relation to the UMVUE property of $\hat{R}(g)$ ).