This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Few-shot Domain Adaptation by Causal Mechanism Transfer

Takeshi Teshima,1,2 Issei Sato,1,2 Masashi Sugiyama2,1
1The University of Tokyo, 2RIKEN
[email protected], {sato, sugi}@k.u-tokyo.ac.jp
Abstract

We study few-shot supervised domain adaptation (DA) for regression problems, where only a few labeled target domain data and many labeled source domain data are available. Many of the current DA methods base their transfer assumptions on either parametrized distribution shift or apparent distribution similarities, e.g., identical conditionals or small distributional discrepancies. However, these assumptions may preclude the possibility of adaptation from intricately shifted and apparently very different distributions. To overcome this problem, we propose mechanism transfer, a meta-distributional scenario in which a data generating mechanism is invariant across domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for DA. We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally. Our method can be seen as the first attempt to fully leverage the structural causal models for DA.

1 Introduction

Learning from a limited amount of data is a long-standing yet actively studied problem of machine learning. Domain adaptation (DA) Ben-David et al. (2010) tackles this problem by leveraging auxiliary data sampled from related but different domains. In particular, we consider few-shot supervised DA for regression problems, where only a few labeled target domain data and many labeled source domain data are available.

A key component of DA methods is the transfer assumption (TA) to relate the source and the target distributions. Many of the previously explored TAs have relied on certain direct distributional similarities, e.g., identical conditionals Shimodaira (2000) or small distributional discrepancies Ben-David et al. (2007). However, these TAs may preclude the possibility of adaptation from apparently very different distributions. Many others assume parametric forms of the distribution shift Zhang et al. (2013) or the distribution family Storkey and Sugiyama (2007) which can highly limit the considered set of distributions. (we further review related work in Section 5.1).

To alleviate the intrinsic limitation of previous TAs due to relying on apparent distribution similarities or parametric assumptions, we focus on a meta-distributional scenario where there exists a common generative mechanism behind the data distributions (Figures 1,2). Such a common mechanism may be more conceivable in applications involving structured table data such as medical records Yadav et al. (2018). For example, in medical record analysis for disease risk prediction, it can be reasonable to assume that there is a pathological mechanism that is common across regions or generations, but the data distributions may vary due to the difference in cultures or lifestyles. Such a hidden structure (pathological mechanism, in this case), once estimated, may provide portable knowledge to enable DA, allowing one to obtain accurate predictors for under-investigated regions or new generations.

Concretely, our assumption relies on the generative model of nonlinear independent component analysis (nonlinear ICA; Figure 1), where the observed labeled data are generated by first sampling latent independent components (ICs) SS and later transforming them by a nonlinear invertible mixing function denoted by ff Hyvärinen et al. (2019). Under this generative model, our TA is that ff representing the mechanism is identical across domains (Figure 2). This TA allows us to formally relate the domain distributions and develop a novel DA method without assuming their apparent similarities or making parametric assumptions.

Refer to caption

Figure 1: Nonparametric generative model of nonlinear independent component analysis. Our meta-distributional transfer assumption is built on the model, where there exists an invertible function ff representing the mechanism to generate the labeled data (X,Y)(X,Y) from the independent components (ICs), SS, sampled from qq. As a result, each pair (f,q)(f,q) defines a joint distribution pp.

Refer to caption


Figure 2: Our assumption of common generative mechanism. By capturing the common data generation mechanism, we enable domain adaptation among seemingly very different distributions without relying on parametric assumptions.

Our contributions.

Our key contributions can be summarized in three points as follows.

  1. 1.

    We formulate the flexible yet intuitively accessible TA of shared generative mechanism and develop a few-shot regression DA method (Section 3). The idea is as follows. First, from the source domain data, we estimate the mixing function ff by nonlinear ICA Hyvärinen et al. (2019) because ff is the only assumed relation of the domains. Then, to transfer the knowledge, we perform data augmentation using the estimated ff on the target domain data using the independence of the IC distributions. In the end, the augmented data is used to fit a target predictor (Figure 3).

  2. 2.

    We theoretically justify the augmentation procedure by invoking the theory of generalized U-statistics (Lee, 1990). The theory shows that the proposed data augmentation procedure yields the uniformly minimum variance unbiased risk estimator in an ideal case. We also provide an excess risk bound Mohri et al. (2012) to cover a more realistic case (Section 4).

  3. 3.

    We experimentally demonstrate the effectiveness of the proposed algorithm (Section 6). The real-world data we use is taken from the field of econometrics, for which structural equation models have been applied in previous studies Greene (2012).

A salient example of the generative model we consider is the structural equations of causal modeling (Section 2). In this context, our method can be seen as the first attempt to fully leverage the structural causal models for DA (Section 5.2).

2 Problem Setup

In this section, we describe the problem setup and the notation. To summarize, our problem setup is homogeneous, multi-source, and few-shot supervised domain adapting regression. That is, respectively, all data distributions are defined on the same data space, there are multiple source domains, and a limited number of labeled data is available from the target distribution (and we do not assume the availability of unlabeled data). In this paper, we use the terms domain and distribution interchangeably.

Notation.

Let us denote the set of real (resp. natural) numbers by \mathbb{R} (resp. \mathbb{N}). For NN\in\mathbb{N}, we define [N]:={1,2,,N}[N]:=\{1,2,\ldots,N\}. Throughout the paper, we fix D()>1D(\in\mathbb{N})>1 and suppose that the input space 𝒳\mathcal{X} is a subset of D1\mathbb{R}^{D-1} and the label space 𝒴\mathcal{Y} is a subset of \mathbb{R}. As a result, the overall data space 𝒵:=𝒳×𝒴\mathcal{Z}:=\mathcal{X}\times\mathcal{Y} is a subset of D\mathbb{R}^{D}. We generally denote a labeled data point by Z=(X,Y)Z=(X,Y). We denote by 𝒬\mathcal{Q} the set of independent distributions on D\mathbb{R}^{D} with absolutely continuous marginals. For a distribution pp, we denote its induced expectation operator by 𝔼p{\mathbb{E}}_{p}. Table 3 in Supplementary Material provides a summary of notation.

Basic setup: Few-shot domain adapting regression.

Let pTarp_{\mathrm{Tar}} be a distribution (the target distribution) over 𝒵\mathcal{Z}, and let 𝒢{g:D1}\mathcal{G}\subset\{g:\mathbb{R}^{D-1}\to\mathbb{R}\} be a hypothesis class. Let :𝒢×D[0,B]\ell:\mathcal{G}\times\mathbb{R}^{D}\to[0,B_{\ell}] be a loss function where B>0B_{\ell}>0 is a constant. Our goal is to find a predictor g𝒢g\in\mathcal{G} which performs well for pTarp_{\mathrm{Tar}}, i.e., the target risk R(g):=𝔼pTar(g,Z)R(g):={\mathbb{E}}_{p_{\mathrm{Tar}}}\ell(g,Z) is small. We denote gargming𝒢R(g)g^{*}\in\mathop{\rm arg~{}min}_{g\in\mathcal{G}}R(g). To this goal, we are given an independent and identically distributed (i.i.d.) sample 𝒟Tar:={Zi}i=1nTari.i.d.pTar\mathcal{D}_{\mathrm{Tar}}:=\{Z_{i}\}_{i=1}^{n_{\mathrm{Tar}}}\overset{\text{i.i.d.}}{\sim}p_{\mathrm{Tar}}. In a fully supervised setting where nTarn_{\mathrm{Tar}} is large, a standard procedure is to select gg by empirical risk minimization (ERM), i.e., g^argming𝒢R^(g)\hat{g}\in\mathop{\rm arg~{}min}_{g\in\mathcal{G}}\hat{R}(g), where R^(g):=1nTari=1nTar(g,Zi)\hat{R}(g):=\frac{1}{n_{\mathrm{Tar}}}\sum_{i=1}^{n_{\mathrm{Tar}}}\ell(g,Z_{i}). However, when nTarn_{\mathrm{Tar}} is not sufficiently large, R^(g)\hat{R}(g) may not accurately estimate R(g)R(g), resulting in a high generalization error of g^\hat{g}. To compensate for the scarcity of data from the target distribution, let us assume that we have data from KK distinct source distributions {pk}k=1K\{p_{k}\}_{k=1}^{K} over 𝒵\mathcal{Z}, that is, we have independent i.i.d. samples 𝒟k:={Zk,iSrc}i=1nki.i.d.pk(k[K],nkN)\mathcal{D}_{k}:=\{{Z^{\mathrm{Src}}_{k,i}}\}_{i=1}^{n_{k}}\overset{\text{i.i.d.}}{\sim}p_{k}(k\in[K],n_{k}\in N) whose relations to pTarp_{\mathrm{Tar}} are described shortly. We assume nTar,nkDn_{\mathrm{Tar}},n_{k}\geq D for simplicity.

Key assumption.

In this work, the key transfer assumption is that all domains follow nonlinear ICA models with identical mixing functions (Figure 2). To be precise, we assume that there exists a set of IC distributions qTar,qk𝒬(k[K])q_{\mathrm{Tar}},q_{k}\in\mathcal{Q}(k\in[K]) , and a smooth invertible function f:DDf:\mathbb{R}^{D}\to\mathbb{R}^{D} (the transformation or mixing) such that Zk,iSrcpk{Z^{\mathrm{Src}}_{k,i}}\sim p_{k} is generated by first sampling Sk,iSrcqkS^{\mathrm{Src}}_{k,i}\sim q_{k} and later transforming it by

Zk,iSrc=f(Sk,iSrc),{Z^{\mathrm{Src}}_{k,i}}=f(S^{\mathrm{Src}}_{k,i}), (1)

and similarly Zi=f(Si),SiqTar{Z_{i}}=f(S_{i}),S_{i}\sim q_{\mathrm{Tar}} for pTarp_{\mathrm{Tar}}. The above assumption allows us to formally relate pkp_{k} and pTarp_{\mathrm{Tar}}. It also allows us to estimate ff when sufficient identification conditions required by the theory of nonlinear ICA are met. Due to space limitation, we provide a brief review of the nonlinear ICA method used in this paper and the known theoretical conditions in Supplementary Material A. Having multiple source domains is assumed here for the identifiability of ff: it comes from the currently known identification condition of nonlinear ICA Hyvärinen et al. (2019). Note that complex changes in qq are allowed, hence the assumption of invariant ff can accommodate intricate shifts in the apparent distribution pp. We discuss this further in Section 5.3 by taking a simple example.

Example: Structural equation models

A salient example of generative models expressed as Eq. (1) is structural equation models (SEMs; Pearl, 2009; Peters et al., 2017), which are used to describe the data generating mechanism involving the causality of random variables (Pearl, 2009). More precisely, the generative model of Eq.(1) corresponds to the reduced form Reiss and Wolak (2007) of a Markovian SEM Pearl (2009), i.e., a form where the structural equations to determine ZZ from (Z,S)(Z,S) are solved so that ZZ is expressed as a function of SS. Such a conversion is always possible because a Markovian SEM induces an acyclic causal graph Pearl (2009), hence the structural equations can be solved by elimination of variables. This interpretation of reduced-form SEMs as Eq.(1) has been exploited in methods of causal discovery, e.g., in the linear non-Gaussian additive-noise models and their successors (Kano and Shimizu, 2003; Shimizu et al., 2006; Monti et al., 2019). In the case of SEMs, the key assumption of this paper translates into the invariance of the structural equations across domains, which enables an intuitive assessment of the assumption based on prior knowledge. For instance, if all domains have the same causal mechanism and are in the same intervention state (including an intervention-free case), the modeling choice is deemed plausible. Note that we do not estimate the original structural equations in the proposed method (Section 3) but we only require estimating the reduced form, which is an easier problem compared to causal discovery.

3 Proposed Method: Mechanism Transfer

In this section, we detail the proposed method, mechanism transfer (Algorithm 1). The method first estimates the common generative mechanism ff from the source domain data and then uses it to perform data augmentation of the target domain data to transfer the knowledge (Figure 3).

Algorithm 1 Proposed method: mechanism transfer
0:  Source domain data sets {𝒟k}k[K]\{\mathcal{D}_{k}\}_{k\in[K]}, target domain data set 𝒟Tar\mathcal{D}_{\mathrm{Tar}}, nonlinear ICA algorithm ICA\operatorname{ICA}, and a learning algorithm 𝒜𝒢\mathcal{A}_{\mathcal{G}} to fit the hypothesis class 𝒢\mathcal{G} of predictors.
  // Step 1. Estimate the shared transformation.
   f^ICA(𝒟1,,𝒟K)\hat{f}\leftarrow\operatorname{ICA}(\mathcal{D}_{1},\ldots,\mathcal{D}_{K})
  // Step 2. Extract and shuffle target independent components
   s^if^1(Zi),(i=1,,nTar)\hat{s}_{i}\leftarrow{\hat{f}}^{-1}(Z_{i}),\quad(i=1,\ldots,n_{\mathrm{Tar}})
   {s¯𝒊}𝒊[nTar]DAllCombinations({s^i}i=1nTar)\{\bar{s}_{{\bm{i}}}\}_{{\bm{i}}\in[n_{\mathrm{Tar}}]^{D}}\leftarrow\operatorname{AllCombinations}(\{\hat{s}_{i}\}_{i=1}^{n_{\mathrm{Tar}}})
  // Step 3. Synthesize target data and fit the predictor.
   z¯𝒊f^(s¯𝒊)\bar{z}_{{\bm{i}}}\leftarrow\hat{f}(\bar{s}_{{\bm{i}}})
   gˇ𝒜𝒢({z¯𝒊}𝒊)\check{g}\leftarrow\mathcal{A}_{\mathcal{G}}(\{\bar{z}_{{\bm{i}}}\}_{{\bm{i}}})
  gˇ\check{g}: the predictor in the target domain.

Refer to caption
(a) Labeled target data

f^1\overset{{\hat{f}}^{-1}}{\to} Refer to caption (b) Find IC \to Refer to caption (c) Shuffle f^\overset{\hat{f}}{\to} Refer to caption (d) Pseudo target data

Figure 3: Schematic illustration of proposed few-shot domain adaptation method after estimating the common mechanism ff. With the estimated f^\hat{f}, the method augments the small target domain sample in a few steps to enhance statistical efficiency: 3(a) The algorithm is given labeled target domain data. 3(b) From labeled target domain data, extract the ICs. 3(c) By shuffling the values, synthesize likely values of IC. 3(d) From the synthesized IC, generate pseudo target data. The generated data is used to fit a predictor for the target domain.

3.1 Step 1: Estimate ff using the source domain data

The first step estimates the common transformation ff by nonlinear ICA, namely via generalized contrastive learning (GCL; Hyvärinen et al., 2019). GCL uses auxiliary information for training a certain binary classification function, rf^,𝝍r_{\hat{f},\bm{\psi}}, equipped with a parametrized feature extractor f^:DD\hat{f}:\mathbb{R}^{D}\to\mathbb{R}^{D}. The trained feature extractor f^\hat{f} is used as an estimator of ff. The auxiliary information we use in our problem setup is the domain indices [K][K]. The classification function to be trained in GCL is rf^,𝝍(z,u):=d=1Dψd(f^1(z)d,u)r_{\hat{f},\bm{\psi}}(z,u):=\sum_{d=1}^{D}\psi_{d}({\hat{f}}^{-1}(z)_{d},u) consisting of (f^,{ψd}d=1D)(\hat{f},\{\psi_{d}\}_{d=1}^{D}), and the classification task of GCL is logistic regression to classify (ZkSrc,k)({Z^{\mathrm{Src}}_{k}},k) as positive and (ZkSrc,k)(kk)({Z^{\mathrm{Src}}_{k}},k^{\prime})(k^{\prime}\neq k) as negative. This yields the following domain-contrastive learning criterion to estimate ff:

argminf^,{ψd}d=1DΨk=1K1nki=1nk(ϕ(rf^,𝝍(Zk,iSrc,k))+𝔼kkϕ(rf^,𝝍(Zk,iSrc,k))),\begin{split}\mathop{\rm argmin}\limits_{\begin{subarray}{c}\hat{f}\in\mathcal{F},\\ \{\psi_{d}\}_{d=1}^{D}\subset\Psi\end{subarray}}\sum_{k=1}^{K}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\biggl{(}&\phi\left(r_{\hat{f},\bm{\psi}}({Z^{\mathrm{Src}}_{k,i}},k)\right)+\mathbb{E}_{k^{\prime}\neq k}\phi\left(-r_{\hat{f},\bm{\psi}}({Z^{\mathrm{Src}}_{k,i}},k^{\prime})\right)\biggr{)},\end{split}

where \mathcal{F} and Ψ\Psi are sets of parametrized functions, 𝔼kk\mathbb{E}_{k^{\prime}\neq k} denotes the expectation with respect to kU([K]{k})k^{\prime}\sim\mathrm{U}([K]\setminus\{k\}) (U\mathrm{U} denotes the uniform distribution), and ϕ\phi is the logistic loss ϕ(m):=log(1+exp(m))\phi(m):=\log(1+\exp(-m)). We use the solution f^\hat{f} as an estimator of ff. In experiments, \mathcal{F} is implemented by invertible neural networks Kingma and Dhariwal (2018), Ψ\Psi by multi-layer perceptron, and 𝔼kk\mathbb{E}_{k^{\prime}\neq k} is replaced by a random sampling renewed for every mini-batch.

3.2 Step 2: Extract and inflate the target ICs using f^\hat{f}

The second step extracts and inflates the target domain ICs using the estimated f^\hat{f}. We first extract the ICs of the target domain data by applying the inverse of f^\hat{f} as

s^i=f^1(Zi).\begin{split}\hat{s}_{i}={\hat{f}}^{-1}(Z_{i}).\end{split}

After the extraction, we inflate the set of IC values by taking all dimension-wise combinations of the estimated IC:

s¯𝒊=(s^i1(1),,s^iD(D)),𝒊=(i1,,iD)[nTar]D,\begin{split}\bar{s}_{{\bm{i}}}=(\hat{s}_{i_{1}}^{(1)},\ldots,\hat{s}_{i_{D}}^{(D)}),\quad{\bm{i}}=(i_{1},\ldots,i_{D})\in[n_{\mathrm{Tar}}]^{D},\end{split}

to obtain new plausible IC values s¯𝒊\bar{s}_{{\bm{i}}}. The intuitive motivation of this procedure stems from the independence of the IC distributions. Theoretical justifications are provided in Section 4. In our implementation, we use invertible neural networks Kingma and Dhariwal (2018) to model the function f^\hat{f} to enable the computation of the inverse f^1{\hat{f}}^{-1}.

3.3 Step 3: Synthesize target data from the inflated ICs

The third step estimates the target risk RR by the empirical distribution of the augmented data:

Rˇ(g):=1nTarD𝒊[nTar]D(g,f^(s¯𝒊)),\begin{split}\check{R}(g):=\frac{1}{n_{\mathrm{Tar}}^{D}}\sum_{\bm{i}\in[n_{\mathrm{Tar}}]^{D}}\ell(g,\hat{f}(\bar{s}_{{\bm{i}}})),\end{split} (2)

and performs empirical risk minimization. In experiments, we use a regularization term Ω()\Omega(\cdot) to control the complexity of 𝒢\mathcal{G} and select

gˇargming𝒢{Rˇ(g)+Ω(g)}.\begin{split}\check{g}\in\mathop{\rm argmin}\limits_{g\in\mathcal{G}}\left\{\check{R}(g)+\Omega(g)\right\}.\end{split}

The generated hypothesis gˇ\check{g} is then used to make predictions in the target domain. In our experiments, we use Ω(g)=λg2\Omega(g)=\lambda\|g\|^{2}, where λ>0\lambda>0 and the norm is that of the reproducing kernel Hilbert space (RKHS) which we take the subset 𝒢\mathcal{G} from. Note that we may well subsample only a subset of combinations in Eq. (2) to mitigate the computational cost similarly to Clémençon et al. (2016) and Papa et al. (2015).

4 Theoretical Insights

In this section, we state two theorems to investigate the statistical properties of the method proposed in Section 3 and provide plausibility beyond the intuition that we take advantage of the independence of the IC distributions.

4.1 Minimum variance property: Idealized case

The first theorem provides an insight into the statistical advantage of the proposed method: in the ideal case, the method attains the minimum variance among all possible unbiased risk estimators.

Theorem 1 (Minimum variance property of Rˇ\check{R}).

Assume that f^=f\hat{f}=f. Then, for each g𝒢g\in\mathcal{G}, the proposed risk estimator Rˇ(g)\check{R}(g) is the uniformly minimum variance unbiased estimator of R(g)R(g), i.e., for any unbiased estimator R~(g)\tilde{R}(g) of R(g)R(g),

q𝒬,Var(Rˇ(g))Var(R~(g))\begin{split}\forall q\in\mathcal{Q},\quad\mathrm{Var}(\check{R}(g))\leq\mathrm{Var}(\tilde{R}(g))\end{split}

as well as 𝔼pTarRˇ(g)=R(g){\mathbb{E}}_{p_{\mathrm{Tar}}}\check{R}(g)=R(g) holds.

The proof of Theorem 1 is immediate once we rewrite R(g)R(g) as a DD-variate regular statistical functional and Rˇ(g)\check{R}(g) as its corresponding generalized U-statistic (Lee, 1990). Details can be found in Supplementary Material D. Theorem 1 implies that the proposed risk estimator can have superior statistical efficiency in terms of the variance over the ordinary empirical risk.

4.2 Excess risk bound: More realistic case

In real situations, one has to estimate ff. The following theorem characterizes the statistical gain and loss arising from the estimation error ff^f-\hat{f}. The intuition is that the increased number of points suppresses the possibility of overfitting because the hypothesis has to fit the majority of the inflated data, but the estimator f^\hat{f} has to be accurate so that fitting to the inflated data is meaningful. Note that the theorem is agnostic to how f^\hat{f} is obtained, hence it applies to more general problem setup as long as ff can be estimated.

Theorem 2 (Excess risk bound).

Let gˇ\check{g} be a minimizer of Eq. (2). Under appropriate assumptions (see Theorem 3 in Supplementary Material), for arbitrary δ,δ(0,1)\delta,\delta^{\prime}\in(0,1), we have with probability at least 1(δ+δ)1-(\delta+\delta^{\prime}),

R(gˇ)R(g)Cj=1Dfjf^jW1,1Approximation error+4D(𝒢)+2DBlog2/δ2nEstimation error+κ1(δ,n)+DBBqκ2(ff^)Higher order terms.\begin{split}R(\check{g})-R(g^{*})\leq\underbrace{C\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}}_{\text{Approximation error}}+\underbrace{4D\mathfrak{R}(\mathcal{G})+2DB_{\ell}\sqrt{\frac{\log 2/{\delta}}{2n}}}_{\text{Estimation error}}+\underbrace{\kappa_{1}(\delta^{\prime},n)+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})}_{\text{Higher order terms}}.\end{split}

Here, W1,1\left\|\cdot\right\|_{W^{1,1}} is the (1,1)(1,1)-Sobolev norm, and we define the effective Rademacher complexity (𝒢)\mathfrak{R}(\mathcal{G}) by

(𝒢):=1n𝔼S^𝔼σ[supg𝒢|i=1nσi𝔼S2,,SD[~(s^i,S2,,SD)]|],\begin{split}\mathfrak{R}(\mathcal{G}):=\frac{1}{n}\mathbb{E}_{\hat{S}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{2},\ldots,S^{\prime}_{D}}[\tilde{\ell}(\hat{s}_{i},S_{2}^{\prime},\ldots,S_{D}^{\prime})]\right|\right],\end{split} (3)

where {σi}i=1n\{\sigma_{i}\}_{i=1}^{n} are independent sign variables, 𝔼S^\mathbb{E}_{\hat{S}} is the expectation with respect to {s^i}i=1nTar\{\hat{s}_{i}\}_{i=1}^{n_{\mathrm{Tar}}}, the dummy variables S2,,SDS_{2}^{\prime},\ldots,S_{D}^{\prime} are i.i.d. copies of s^1\hat{s}_{1}, and ~\tilde{\ell} is defined by using the degree-DD symmetric group 𝔖D\mathfrak{S}_{D} as

~(s1,,sD):=1D!π𝔖D(g,f^(sπ(1)(1),,sπ(D)(D))),\begin{split}\tilde{\ell}(s_{1},\ldots,s_{D}):=\frac{1}{D!}\sum_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)})),\end{split}

and κ1(δ,n)\kappa_{1}(\delta^{\prime},n) and κ2(ff^)\kappa_{2}(f-\hat{f}) are higher order terms. The constants BqB_{q} and BB_{\ell} depend only on qq and \ell, respectively, while CC^{\prime} depends only on f,q,f,q,\ell, and DD.

Details of the statement and the proof can be found in Supplementary Material C. The Sobolev norm Adams and Fournier (2003) emerges from the evaluation of the difference between the estimated IC distribution and the ground-truth IC distribution. In Theorem 2, the utility of the proposed method appears in the effective complexity measure. The complexity is defined by a set of functions which are marginalized over all but one argument, resulting in mitigated dependence on the input dimensionality from exponential to linear (Supplementary Material C, Remark 3).

5 Related Work and Discussion

In this section, we review some existing TAs for DA to clarify the relative position of the paper. We also clarify the relation to the literature of causality-related transfer learning.

5.1 Existing transfer assumptions

Here, we review some of the existing work and TAs. See Table 1 for a summary.

(1) Parametric assumptions.

Some TAs assume parametric distribution families, e.g., Gaussian mixture model in covariate shift Storkey and Sugiyama (2007). Some others assume parametric distribution shift, i.e., parametric representations of the target distribution given the source distributions. Examples include location-scale transform of class conditionals Zhang et al. (2013); Gong et al. (2016), linearly dependent class conditionals Zhang et al. (2015), and low-dimensional representation of the class conditionals after kernel embedding Stojanov et al. (2019). In some applications, e.g., remote sensing, some parametric assumptions have proven useful Zhang et al. (2013).

(2) Invariant conditionals and marginals.

Some methods assume invariance of certain conditionals or marginals Qui (2009), e.g., p(Y|X)p(Y|X) in the covariate shift scenario Shimodaira (2000), p(Y|𝒯(X))p(Y|\mathcal{T}(X)) for an appropriate feature transformation 𝒯\mathcal{T} in transfer component analysis Pan et al. (2011), p(Y|𝒯(X))p(Y|\mathcal{T}(X)) for a feature selector 𝒯\mathcal{T} Rojas-Carulla et al. (2018); Magliacane et al. (2018), p(X|Y)p(X|Y) in the target shift (TarS) scenario Zhang et al. (2013); Nguyen et al. (2016), and few components of regular-vine copulas and marginals in Lopez-paz et al. (2012). For example, the covariate shift scenario has been shown to fit well to brain computer interface data Sugiyama et al. (2007).

(3) Small discrepancy or integral probability metric.

Another line of work relies on certain distributional similarities, e.g., integral probability metric Courty et al. (2017) or hypothesis-class dependent discrepancies Ben-David et al. (2007); Blitzer et al. (2008); Ben-David et al. (2010); Kuroki et al. (2019); Zhang et al. (2019); Cortes et al. (2019). These methods assume the existence of the ideal joint hypothesis Ben-David et al. (2010), corresponding to a relaxation of the covariate shift assumption. These TA are suited for unsupervised or semi-supervised DA in computer vision applications Courty et al. (2017).

(4) Transferable parameter.

Some others consider parameter transfer Kumagai (2016), where the TA is the existence of a parameterized feature extractor that performs well in the target domain for linear-in-parameter hypotheses and its learnability from the source domain data. For example, such a TA has been known to be useful in natural language processing or image recognition Lee et al. (2009); Kumagai (2016).

Table 1: Comparison of TAs for DA (Parametric: parametric distribution family or distribution shift, Invariant dist.: invariant distribution components such as conditionals, marginals, or copulas. Disc. / IPM: small discrepancy or integral probability metric, Param-transfer: existence of transferable parameter, Mechanism: invariant mechanism). AD: adaptation among Apparently Different distributions is accommodated. NP: Non-Parametrically flexible. BCI: Brain computer interface. The numbers indicate the paragraphs of Section 5.1.
TA AD NP Suited app. example
(1) Parametric \checkmark - Remote sensing
(2) Invariant dist. - \checkmark BCI
(3) Disc. / IPM - \checkmark Computer vision
(4) Param-transfer \checkmark \checkmark Computer vision
(Ours) Mechanism \checkmark \checkmark Medical records

5.2 Causality for transfer learning

Our method can be seen as the first attempt to fully leverage structural causal models for DA. Most of the causality-inspired DA methods express their assumptions in the level of graphical causal models (GCMs), which only has much coarser information than structural causal models (SCMs) (Peters et al., 2017, Table 1.1) exploited in this paper. Compared to previous work, our method takes one step further to assume and exploit the invariance of SCMs. Specifically, many studies assume the GCM XYX\leftarrow Y (the anticausal scenario) following the seminal meta-analysis of Schölkopf et al. (2012) and use it to motivate their parametric distribution shift assumptions or the parameter estimation procedure Zhang et al. (2013, 2015); Gong et al. (2016, 2018). Although such assumptions on the GCM have the virtue of being more robust to misspecification, they tend to require parametric assumptions to obtain theoretical justifications. On the other hand, our assumption enjoys a theoretical guarantee without relying on parametric assumptions.

One notable work in the existing literature is Magliacane et al. (2018) that considered the domain adaptation among different intervention states, a problem setup that complements ours that considers an intervention-free (or identical intervention across domains) case. To model intervention states, Magliacane et al. (2018) also formulated the problem setup using SCMs, similarly to the present paper. Therefore, we clarify a few key differences between Magliacane et al. (2018) and our work here. In terms of the methodology, Magliacane et al. (2018) takes a variable selection approach to select a set of predictor variables with an invariant conditional distribution across different intervention states. On the other hand, our method estimates the SEMs (in the reduced form) and applies a data augmentation procedure to transfer the knowledge. To the best of our knowledge, the present paper is the first to propose a way to directly use the estimated SEMs for domain adaptation, and the fine-grained use of the estimated SEMs enables us to derive an excess risk bound. In terms of the plausible applications, their problem setup may be more suitable for application fields with interventional experiments such as genomics, whereas ours may be more suited for fields where observational studies are more common such as health record analysis or economics. In Appendix E, we provide a more detailed comparison.

5.3 Plausibility of the assumptions

Checking the validity of the assumption.

As is often the case in DA, the scarcity of data disables data-driven testing of the TAs, and we need domain knowledge to judge the validity. For our TA, the intuitive interpretation as invariance of causal models (Section 2) can be used.

Invariant causal mechanisms.

The invariance of causal mechanisms has been exploited in recent work of causal discovery such as Xu et al. (2014) and Monti et al. (2019), or under the name of the multi-environment setting in Ghassami et al. (2017). Moreover, the SEMs are normally assumed to remain invariant unless explicitly intervened in Hünermund and Bareinboim (2019). However, the invariance assumption presumes that the intervention states do not vary across domains (allowing for the intervention-free case), which can be limiting for some applications where different interventions are likely to be present, e.g., different treatment policies being put in place in different hospitals. Nevertheless, the present work can already be of practical interest if it is combined with the effort to find suitable data or situations. For instance, one may find medical records in group hospitals where the same treatment criteria is put in place or local surveys in the same district enforcing identical regulations. In future work, relaxing the requirement to facilitate the data-gathering process is an important area. For such future extensions, the present theoretical analyses can also serve as a landmark to establish what can be guaranteed in the basic case without mechanism alterations.

Fully observed variables.

As the first algorithm in the approach to fully exploit SCMs for DA, we also consider the case where all variables are observable. Although it is often assumed in a causal inference problem that there are some unobserved confounding variables, we leave further extension to such a case for future work.

Required number of source domains.

A potential drawback of the proposed method is that it requires a number of source domains in order to satisfy the identification condition of the nonlinear ICA, namely GCL in this paper (Supplementary Material A). The requirement solely arises from the identification condition of the ICA method and therefore has the possibility to be made less stringent by the future development of nonlinear ICA methods. Moreover, if one can accept other identification conditions, one-sample ICA methods (e.g., linear ICA) can be also used in the proposed approach in a straightforward manner, and our theoretical analyses still hold regardless of the method chosen.

Flexibility of the model.

The relation between XX and YY can drastically change while ff is invariant. For example, even in a simple additive noise model (X,Y)=f(S1,S2)=(S1,S1+S2)(X,Y)=f(S_{1},S_{2})=(S_{1},S_{1}+S_{2}), the conditional p(Y|X)p(Y|X) can shift drastically if the distribution of the independent noise S2S_{2} changes in a complex manner, e.g., becoming multimodal from unimodal.

6 Experiment

In this section, we provide proof-of-concept experiments to demonstrate the effectiveness of the proposed approach. Note that the primary purpose of the experiments is to confirm whether the proposed method can properly perform DA in real-world data, and it is not to determine which DA method and TA are the most suited for the specific dataset.

6.1 Implementation details of the proposed method

Estimation of ff (Step 1).

We model f^\hat{f} by an 88-layer Glow neural network (Supplementary Material B.2). We model ψd\psi_{d} by a 11-hidden-layer neural network with a varied number of hidden units, KK output units, and the rectified linear unit activation LeCun et al. (2015). We use its kk-th output (k[K]k\in[K]) as the value for ψd(,k)\psi_{d}(\cdot,k). For training, we use the Adam optimizer Kingma and Ba (2017) with fixed parameters (β1,β2,ϵ)=(0.9,0.999,108)(\beta_{1},\beta_{2},\epsilon)=(0.9,0.999,10^{-8}), fixed initial learning rate 10310^{-3}, and the maximum number of epochs 300300. The other fixed hyperparameters of f^\hat{f} and its training process are described in Supplementary Material B.

Augmentation of target data (Step 3).

For each evaluation step, we take all combinations (with replacement) of the estimated ICs to synthesize target domain data. After we synthesize the data, we filter them by applying a novelty detection technique with respect to the union of source domain data. Namely, we use one-class support vector machine Schölkopf et al. (2000) with the fixed parameter ν=0.1\nu=0.1 and radial basis function (RBF) kernel k(x,y)=exp(xy2/γ)k(x,y)=\exp(-\|x-y\|^{2}/\gamma) with γ=D\gamma=D. This is because the estimated transform f^\hat{f} is not expected to be trained well outside the union of the supports of the source distributions. After performing the filtration, we combined the original target training data with the augmented data to ensure the original data points to be always included.

Predictor hypothesis class 𝒢\mathcal{G}.

As the predictor model, we use the kernel ridge regression (KRR) with RBF kernel. The bandwidth γ\gamma is chosen by the median heuristic similarly to Yamada et al. (2011) for simplicity. Note that the choice of the predictor model is for the sake of comparison with the other methods tailored for KRR Cortes et al. (2019), and that an arbitrary predictor hypothesis class and learning algorithm can be easily combined with the proposed approach.

Hyperparameter selection.

We perform grid-search for hyperparameter selection. The number of hidden units for ψd\psi_{d} is chosen from {10,20}\{10,20\} and the coefficient of weight-decay from 10{2,1}10^{\{-2,-1\}}. The 2{\ell^{2}} regularization coefficient λ\lambda of KRR is chosen from λ2{10,,10}\lambda\in 2^{\{-10,\ldots,10\}} following Cortes et al. (2019). To perform hyperparameter selection as well as early-stopping, we record the leave-one-out cross-validation (LOOCV) mean-squared error on the target training data every 2020 epochs and select its minimizer. The leave-one-out score is computed using the well-known analytic formula instead of training the predictor for each split. Note that we only use the original target domain data as the held-out set and not the synthesized data. In practice, if the target domain data is extremely few, one may well use percentile-cv Ng (1997) to mitigate overfitting of hyperparameter selection.

Computation environment

All experiments were conducted on an Intel Xeon(R) 2.60 GHz CPU with 132 GB memory. They were implemented in Python using the PyTorch library Paszke et al. (2019) or the R language R Core Team (2018).

6.2 Experiment using real-world data

Dataset.

We use the gasoline consumption data (Greene, 2012, p.284, Example 9.5), which is a panel data of gasoline usage in 18 of the OECD countries over 19 years. We consider each country as a domain, and we disregard the time-series structure and consider the data as i.i.d. samples for each country in this proof-of-concept experiment. The dataset contains four variables, all of which are log-transformed: motor gasoline consumption per car (the predicted variable), per-capita income, motor gasoline price, and the stock of cars per capita (the predictor variables) Baltagi and Griffin (1983). For further details of the data, see Supplementary Material B. We used the dataset because there are very few public datasets for domain adapting regression tasks Cortes and Mohri (2014) especially for multi-source DA, and also because the dataset has been used in econometric analyses involving SEMs Baltagi (2005), conforming to our approach.

Compared methods.

We compare the following DA methods, all of which apply to regression problems. Unless explicitly specified, the predictor class 𝒢\mathcal{G} is chosen to be KRR with the same hyperparameter candidates as the proposed method (Section 6.1). Further details are described in Supplementary Material B.5.

  • Naive baselines (SrcOnly, TarOnly, and S&TV): SrcOnly (resp. TarOnly) trains a predictor on the source domain data (resp. target training data) without any device. SrcOnly can be effective if the source domains and the target domain have highly similar distributions. The S&TV baseline trains on both source and target domain data, but the LOOCV score is computed only from the target domain data.

  • TrAdaBoost: Two-stage TrAdaBoost.R2; a boosting method tailored for few-shot regression transfer proposed in Pardoe and Stone (2010). It is an iterative method with early-stopping Pardoe and Stone (2010), for which we use the leave-one-out cross-validation score on the target domain data as the criterion. As suggested in Pardoe and Stone (2010), we set the maximum number of outer loop iterations at 3030. The base predictor is the decision tree regressor with the maximum depth 66 Hastie et al. (2009). Note that although TrAdaBoost does not have a clarified transfer assumption, we compare the performance for reference.

  • IW: Importance weighted KRR using RuLSIF Yamada et al. (2011). The method directly estimates a relative joint density ratio function pSrc(z)αpSrc(z)+(1α)pTar(z)\frac{p_{\mathrm{Src}}(z)}{\alpha p_{\mathrm{Src}}(z)+(1-\alpha)p_{\mathrm{Tar}}(z)} for α[0,1)\alpha\in[0,1), where pSrcp_{\mathrm{Src}} is a hypothetical source distribution created by pooling all source domain data. Following Yamada et al. (2011), we experiment on α{0,0.5,0.95}\alpha\in\{0,0.5,0.95\} and report the results separately. The regularization coefficient λ\lambda^{\prime} is selected from λ2{10,,10}\lambda^{\prime}\in 2^{\{-10,\ldots,10\}} using importance-weighted cross-validation Sugiyama et al. (2007).

  • GDM: Generalized discrepancy minimization Cortes et al. (2019). This method performs instance-weighted training on the source domain data with the weights that minimize the generalized discrepancy (via quadratic programming). We select the hyper-parameters λr\lambda_{r} from 2{10,,10}2^{\{-10,\ldots,10\}} as suggested by Cortes et al. (2019). The selection criterion is the performance of the trained predictor on the target training labels as the method trains on the source domain data and the target unlabeled data.

  • Copula: Non-parametric regular-vine copula method Lopez-paz et al. (2012). This method presumes using a specific joint density estimator called regular-vine (R-vine) copulas. Adaptation is realized in two steps: the first step estimates which components of the constructed R-vine model are different by performing two-sample tests based on maximum mean discrepancy Lopez-paz et al. (2012), and the second step re-estimates the components in which a change is detected using only the target domain data.

  • LOO (reference score): Leave-one-out cross-validated error estimate is also calculated for reference. It is the average prediction error of predicting for a single held-out test point when the predictor is trained on the rest of the whole target domain data including those in the test set for the other algorithms.

Evaluation procedure.

The prediction accuracy was measured by the mean squared error (MSE). For each train-test split, we randomly select one-third (6 points) of the target domain dataset as the training set and use the rest as the test set. All experiments were repeated 10 times with different train-test splits of target domain data.

Results.

The results are reported in Table 2. We report the MSE scores normalized by that of LOO to facilitate the comparison, similarly to Cortes and Mohri (2014). In many of the target domain choices, the naive baselines (SrcOnly and S&TV) suffer from negative transfer, i.e., higher average MSE than TarOnly (in 12 out of 18 domains). On the other hand, the proposed method successfully performs better than TarOnly or is more resistant to negative transfer than the other compared methods. The performances of GDM, Copula, and IW are often inferior even compared to the baseline performance of SrcAndTarValid. For GDM and IW, this can be attributed to the fact that these methods presume the availability of abundant (unlabeled) target domain data, which is unavailable in the current problem setup. For Copula, the performance inferior to the naive baselines is possibly due to the restriction of the predictor model to its accompanied probability model Lopez-paz et al. (2012). TrAdaBoost works reasonably well for many but not all domains. For some domains, it suffered from negative transfer similarly to others, possibly because of the very small number of training data points. Note that the transfer assumption of TrAdaBoost has not been stated Pardoe and Stone (2010), and it is not understood when the method is reliable. The domains on which the baselines perform better than the proposed method can be explained by the following two cases: (1) easier domains allow naive baselines to perform well and (2) some domains may have deviated ff. Case (1) implies that estimating ff is unnecessary, hence the proposed method can be suboptimal (more likely for JPN, NLD, NOR, and SWE, where SrcOnly or S&TV improve upon TrgOnly). On the other hand, case (2) implies that an approximation error was induced as in Theorem 2 (more likely for IRL and ITA). In this case, others also perform poorly, implying the difficulty of the problem instance. In either case, in practice, one may well perform cross-validation to fallback to the baselines.

Table 2: Results of the real-world data experiments for different choices of the target domain. The evaluation score is MSE normalized by that of LOO (the lower the better). All experiments were repeated 10 times with different train-test splits of target domain data, and the average performance is reported with the standard errors in the brackets. The target column indicates abbreviated country names. Bold-face indicates the best score (Prop: proposed method, TrAda: TrAdaBoost, the numbers in the brackets of IW indicate the value of α\alpha). The proposed method often improves upon the baseline TarOnly or is relatively more resistant to negative transfer, with notable improvements in DEU, GBR, and USA.
Target (LOO) TarOnly Prop SrcOnly S&TV TrAda GDM Copula IW(.0) IW(.5) IW(.95)
AUT 1 5.88 (1.60) 5.39 (1.86) 9.67 (0.57) 9.84 (0.62) 5.78 (2.15) 31.56 (1.39) 27.33 (0.77) 39.72 (0.74) 39.45 (0.72) 39.18 (0.76)
BEL 1 10.70 (7.50) 7.94 (2.19) 8.19 (0.68) 9.48 (0.91) 8.10 (1.88) 89.10 (4.12) 119.86 (2.64) 105.15 (2.96) 105.28 (2.95) 104.30 (2.95)
CAN 1 5.16 (1.36) 3.84 (0.98) 157.74 (8.83) 156.65 (10.69) 51.94 (30.06) 516.90 (4.45) 406.91 (1.59) 592.21 (1.87) 591.21 (1.84) 589.87 (1.91)
DNK 1 3.26 (0.61) 3.23 (0.63) 30.79 (0.93) 28.12 (1.67) 25.60 (13.11) 16.84 (0.85) 14.46 (0.79) 22.15 (1.10) 22.11 (1.10) 21.72 (1.07)
FRA 1 2.79 (1.10) 1.92 (0.66) 4.67 (0.41) 3.05 (0.11) 52.65 (25.83) 91.69 (1.34) 156.29 (1.96) 116.32 (1.27) 116.54 (1.25) 115.29 (1.28)
DEU 1 16.99 (8.04) 6.71 (1.23) 229.65 (9.13) 210.59 (14.99) 341.03 (157.80) 739.29 (11.81) 929.03 (4.85) 817.50 (4.60) 818.13 (4.55) 812.60 (4.57)
GRC 1 3.80 (2.21) 3.55 (1.79) 5.30 (0.90) 5.75 (0.68) 11.78 (2.36) 26.90 (1.89) 23.05 (0.53) 47.07 (1.92) 45.50 (1.82) 45.72 (2.00)
IRL 1 3.05 (0.34) 4.35 (1.25) 135.57 (5.64) 12.34 (0.58) 23.40 (17.50) 3.84 (0.22) 26.60 (0.59) 6.38 (0.13) 6.31 (0.14) 6.16 (0.13)
ITA 1 13.00 (4.15) 14.05 (4.81) 35.29 (1.83) 39.27 (2.52) 87.34 (24.05) 226.95 (11.14) 343.10 (10.04) 244.25 (8.50) 244.84 (8.58) 242.60 (8.46)
JPN 1 10.55 (4.67) 12.32 (4.95) 8.10 (1.05) 8.38 (1.07) 18.81 (4.59) 95.58 (7.89) 71.02 (5.08) 135.24 (13.57) 134.89 (13.50) 134.16 (13.43)
NLD 1 3.75 (0.80) 3.87 (0.79) 0.99 (0.06) 0.99 (0.05) 9.45 (1.43) 28.35 (1.62) 29.53 (1.58) 33.28 (1.78) 33.23 (1.77) 33.14 (1.77)
NOR 1 2.70 (0.51) 2.82 (0.73) 1.86 (0.29) 1.63 (0.11) 24.25 (12.50) 23.36 (0.88) 31.37 (1.17) 27.86 (0.94) 27.86 (0.93) 27.52 (0.91)
ESP 1 5.18 (1.05) 6.09 (1.53) 5.17 (1.14) 4.29 (0.72) 14.85 (4.20) 33.16 (6.99) 152.59 (6.19) 53.53 (2.47) 52.56 (2.42) 52.06 (2.40)
SWE 1 6.44 (2.66) 5.47 (2.63) 2.48 (0.23) 2.02 (0.21) 2.18 (0.25) 15.53 (2.59) 2706.85 (17.91) 118.46 (1.64) 118.23 (1.64) 118.27 (1.64)
CHE 1 3.51 (0.46) 2.90 (0.37) 43.59 (1.77) 7.48 (0.49) 38.32 (9.03) 8.43 (0.24) 29.71 (0.53) 9.72 (0.29) 9.71 (0.29) 9.79 (0.28)
TUR 1 1.65 (0.47) 1.06 (0.15) 1.22 (0.18) 0.91 (0.09) 2.19 (0.34) 64.26 (5.71) 142.84 (2.04) 159.79 (2.63) 157.89 (2.63) 157.13 (2.69)
GBR 1 5.95 (1.86) 2.66 (0.57) 15.92 (1.02) 10.05 (1.47) 7.57 (5.10) 50.04 (1.75) 68.70 (1.25) 70.98 (1.01) 70.87 (0.99) 69.72 (1.01)
USA 1 4.98 (1.96) 1.60 (0.42) 21.53 (3.30) 12.28 (2.52) 2.06 (0.47) 308.69 (5.20) 244.90 (1.82) 462.51 (2.14) 464.75 (2.08) 465.88 (2.16)
#Best - 2 10 2 4 0 0 0 0 0 0

7 Conclusion

In this paper, we proposed a novel few-shot supervised DA method for regression problems based on the assumption of shared generative mechanism. Through theoretical and experimental analysis, we demonstrated the effectiveness of the proposed approach. By considering the latent common structure behind the domain distributions, the proposed method successfully induces positive transfer even when a naive usage of the source domain data can suffer from negative transfer. Our future work includes making an experimental comparison with extensively more datasets and methods as well as an extension to the case where the underlying mechanism are not exactly identical but similar.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments and thorough discussions. We would also like to thank Yuko Kuroki and Taira Tsuchiya for proofreading the manuscript. This work was supported by RIKEN Junior Research Associate Program. TT was supported by Masason Foundation. IS was supported by KAKEN 17H04693. MS was supported by JST CREST Grant Number JPMJCR18A2.

References

  • Qui (2009) Dataset Shift in Machine Learning. Neural Information Processing Series. MIT Press, Cambridge, Mass, 2009.
  • Adams and Fournier (2003) Robert A. Adams and John JF Fournier. Sobolev Spaces. Academic press, 2003.
  • Arjovsky et al. (2020) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv:1907.02893 [cs, stat], March 2020.
  • Baltagi (2005) Badi Baltagi. Econometric Analysis of Panel Data. New York: John Wiley and Sons, 3rd edition, 2005.
  • Baltagi and Griffin (1983) Badi H. Baltagi and James M. Griffin. Gasoline demand in the OECD: An application of pooling and testing procedures. European Economic Review, 22(2):117–137, 1983.
  • Ben-David et al. (2007) Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press, 2007.
  • Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1-2):151–175, 2010.
  • Blitzer et al. (2008) John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in Neural Information Processing Systems 20, pages 129–136. Curran Associates, Inc., 2008.
  • Clémençon et al. (2016) Stephan Clémençon, Igor Colin, and Aurélien Bellet. Scaling-up empirical risk minimization: Optimization of incomplete U-statistics. Journal of Machine Learning Research, 17(76):1–36, 2016.
  • Cortes and Mohri (2014) Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
  • Cortes et al. (2019) Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. Journal of Machine Learning Research, 20(1):1–30, 2019.
  • Courty et al. (2017) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems 30, pages 3730–3739. Curran Associates, Inc., 2017.
  • Ghassami et al. (2017) AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In Advances in Neural Information Processing Systems 30, pages 3011–3021. Curran Associates, Inc., 2017.
  • Golub and Van Loan (2013) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. The Johns Hopkins University Press, Baltimore, 4th edition, 2013.
  • Gong et al. (2016) Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In Proceedings of the 33rd International Conference on Machine Learning, pages 2839–2848, New York, USA, 2016. PMLR.
  • Gong et al. (2018) Mingming Gong, Kun Zhang, Biwei Huang, Clark Glymour, Dacheng Tao, and Kayhan Batmanghelich. Causal generative domain adaptation networks. arXiv:1804.04333 [cs, stat], April 2018.
  • Greene (2012) William H. Greene. Econometric Analysis. Prentice Hall, Boston, 7th edition, 2012.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition, 2009.
  • Hayfield and Racine (2008) Tristen Hayfield and Jeffrey S. Racine. Nonparametric econometrics: The np package. Journal of Statistical Software, 27(5), 2008.
  • Hünermund and Bareinboim (2019) Paul Hünermund and Elias Bareinboim. Causal inference and data-fusion in econometrics. arXiv:1912.09104 [econ], December 2019.
  • Hyvärinen and Pajunen (1999) A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
  • Hyvärinen and Morioka (2016) Aapo Hyvärinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In Advances in Neural Information Processing Systems 29, pages 3765–3773. Curran Associates, Inc., 2016.
  • Hyvärinen and Morioka (2017) Aapo Hyvärinen and Hiroshi Morioka. Nonlinear ICA of temporally dependent stationary sources. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 460–469, 2017.
  • Hyvärinen et al. (2019) Aapo Hyvärinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868, 2019.
  • Ipsen and Rehman (2008) Ilse C. F. Ipsen and Rizwana Rehman. Perturbation bounds for determinants and characteristic polynomials. SIAM Journal on Matrix Analysis and Applications, 30(2):762–776, 2008.
  • Kano and Shimizu (2003) Yutaka Kano and Shohei Shimizu. Causal inference using nonnormality. In Proceedings of the International Symposium on the Science of Modeling, the 30th Anniversary of the Information Criterion,, pages 261–270, 2003.
  • Khemakhem et al. (2019) Ilyes Khemakhem, Diederik P. Kingma, Ricardo Pio Monti, and Aapo Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. arXiv:1907.04809 [cs, stat], July 2019.
  • Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], January 2017.
  • Kingma and Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, pages 10215–10224. Curran Associates, Inc., 2018.
  • Kumagai (2016) Wataru Kumagai. Learning bound for parameter transfer learning. In Advances in Neural Information Processing Systems 29, pages 2721–2729. Curran Associates, Inc., 2016.
  • Kuroki et al. (2019) Seiichi Kuroki, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato, and Masashi Sugiyama. Unsupervised domain adaptation based on source-guided discrepancy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4122–4129, 2019.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • Lee (1990) A. J. Lee. U-Statistics: Theory and Practice. M. Dekker, New York, 1990.
  • Lee et al. (2009) Honglak Lee, Rajat Raina, Alex Teichman, and Andrew Y. Ng. Exponential family sparse coding with applications to self-taught learning. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, pages 1113–1119, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc.
  • Lopez-paz et al. (2012) David Lopez-paz, Jose M. Hernández-lobato, and Bernhard Schölkopf. Semi-supervised domain adaptation with non-parametric copulas. In Advances in Neural Information Processing Systems 25, pages 665–673. Curran Associates, Inc., 2012.
  • Magliacane et al. (2018) Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31, pages 10846–10856. Curran Associates, Inc., 2018.
  • Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge, MA, 2012.
  • Monti et al. (2019) Ricardo Pio Monti, Kun Zhang, and Aapo Hyvärinen. Causal discovery with general non-linear relationships using non-linear ICA. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, 2019.
  • Ng (1997) Andrew Y. Ng. Preventing “overfitting” of cross-validation data. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 245–253, San Francisco, CA, USA, 1997.
  • Nguyen et al. (2016) Tuan Duong Nguyen, Marthinus Christoffel, and Masashi Sugiyama. Continuous Target Shift Adaptation in Supervised Learning. In Asian Conference on Machine Learning, volume 45 of Proceedings of Machine Learning Research, pages 285–300. PMLR, 2016.
  • Pan et al. (2011) Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
  • Papa et al. (2015) Guillaume Papa, Stéphan Clémençon, and Aurélien Bellet. SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk. In Advances in Neural Information Processing Systems 28, pages 1027–1035. Curran Associates, Inc., 2015.
  • Pardoe and Stone (2010) David Pardoe and Peter Stone. Boosting for regression transfer. In Proceedings of the Twenty-Seventh International Conference on Machine Learning, pages 863–870, Haifa, Israel, 2010.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Pearl (2009) Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, U.K. ; New York, 2nd edition, 2009.
  • Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, Massachuestts, 2017.
  • R Core Team (2018) R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2018.
  • Reiss and Wolak (2007) Peter C. Reiss and Frank A. Wolak. Structural econometric modeling: Rationales and examples from industrial organization. In Handbook of Econometrics, volume 6, pages 4277–4415. Elsevier, 2007.
  • Rejchel (2012) Wojciech Rejchel. On ranking and generalization bounds. Journal of Machine Learning Research, 13(May):1373–1392, 2012.
  • Rojas-Carulla et al. (2018) Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(36):1–34, 2018.
  • Schölkopf et al. (2000) Bernhard Schölkopf, Robert C Williamson, Alex J. Smola, John Shawe-Taylor, and John C. Platt. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
  • Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on Machine Learning, pages 459–466. Omnipress, 2012.
  • Sherman (1994) Robert P. Sherman. Maximal inequalities for degenerate U-processes with applications to optimization estimators. The Annals of Statistics, 22(1):439–459, 1994.
  • Shimizu et al. (2006) Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti J Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(October):2003–2030, 2006.
  • Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
  • Stojanov et al. (2019) Petar Stojanov, Mingming Gong, Jaime Carbonell, and Kun Zhang. Data-driven approach to multiple-source domain adaptation. In Proceedings of Machine Learning Research, volume 89, pages 3487–3496. PMLR, 2019.
  • Storkey and Sugiyama (2007) Amos J Storkey and Masashi Sugiyama. Mixture regression for covariate shift. In Advances in Neural Information Processing Systems 19, pages 1337–1344. MIT Press, 2007.
  • Sugiyama et al. (2007) Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
  • Wainwright (2019) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 1st edition, 2019.
  • Xu et al. (2014) Lele Xu, Tingting Fan, Xia Wu, KeWei Chen, Xiaojuan Guo, Jiacai Zhang, and Li Yao. A pooling-LiNGAM algorithm for effective connectivity analysis of fMRI data. Frontiers in Computational Neuroscience, 8(October):125, 2014.
  • Yadav et al. (2018) Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining electronic health records (EHRs): A survey. ACM Computing Surveys, 50(6):1–40, 2018.
  • Yamada et al. (2011) Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems 24, pages 594–602. Curran Associates, Inc., 2011.
  • Zhang et al. (2015) K. Zhang, M. Gong, and B. Schölkopf. Multi-source domain adaptation: A causal view. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 3150–3157. AAAI Press, 2015.
  • Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, pages 819–827, 2013.
  • Zhang et al. (2019) Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning, pages 7404–7413, Long Beach, California, USA, 2019. PMLR.

This is the Supplementary Material for “Few-shot Domain Adaptation by Causal Mechanism Transfer.” Table 3 summarizes the abbreviations and the symbols used in the paper.

Table 3: List of abbreviations and symbols used in the paper.
Abbreviation / Symbol Meaning
DA Domain adaptation
TA Transfer assumption
SEM Structural equation model
GCM Graphical causal model
SCM Structural causal model
IC Independent component
ICA Independent component analysis
GCL Generalized contrastive learning
i.i.d. Independent and identically distributed
[N][N] {1,2,,N}\{1,2,\ldots,N\} where NN\in\mathbb{N}
Wk,p\left\|\cdot\right\|_{W^{k,p}} The (k,p)(k,p)-Sobolev norm
XX The predictor random vector (D1\mathbb{R}^{D-1}-valued)
YY The predicted random variable (\mathbb{R}-valued)
Z=(X,Y)Z=(X,Y) The joint random variable (D\mathbb{R}^{D}-valued)
SS The independent component vector (D\mathbb{R}^{D}-valued)
𝒳D1\mathcal{X}\subset\mathbb{R}^{D-1} The space of XX
𝒴\mathcal{Y}\subset\mathbb{R} The space of YY
𝒵D\mathcal{Z}\subset\mathbb{R}^{D} The space of Z=(X,Y)Z=(X,Y)
𝒢{g:D1}\mathcal{G}\subset\{g:\mathbb{R}^{D-1}\to\mathbb{R}\} Predictor hypothesis class
:𝒢×𝒵[0,B]\ell:\mathcal{G}\times\mathcal{Z}\to[0,B_{\ell}] Loss function
R(g)R(g) Target domain risk 𝔼pTar(g,Z){\mathbb{E}}_{p_{\mathrm{Tar}}}\ell(g,Z)
g𝒢g^{*}\in\mathcal{G} Minimizer of target domain risk
𝒬\mathcal{Q} The set of independent distributions
ff Ground truth mixing function
pTarp_{\mathrm{Tar}} The target joint distribution
pkp_{k} The joint distribution of source domain kk
qTar𝒬q_{\mathrm{Tar}}\in\mathcal{Q} The target independent component (IC) distribution
qk𝒬q_{k}\in\mathcal{Q} The IC distribution of source domain kk
DD The dimension of 𝒵\mathcal{Z}
KK The number of source domains
nTarn_{\mathrm{Tar}} The size of the target labeled sample
nkn_{k} The size of the labeled sample from source domain kk
𝒟Tar={Zi}i=1nTar\mathcal{D}_{\mathrm{Tar}}=\{Z_{i}\}_{i=1}^{n_{\mathrm{Tar}}} Target labeled data set
𝒟k={Zk,iSrc}i=1nk\mathcal{D}_{k}=\{{Z^{\mathrm{Src}}_{k,i}}\}_{i=1}^{n_{k}} Source labeled data set of source domain kk
R^(g)\hat{R}(g) The ordinary empirical risk estimator
Rˇ(g)\check{R}(g) The proposed risk estimator (Eq. (2))
f^\hat{f} The estimator of ff
{ψd}d=1D\{\psi_{d}\}_{d=1}^{D} The penultimate layer functions composed with ff during GCL

Appendix A Preliminary: Nonlinear ICA

Here, we use the same notation as the main text. The recently developed nonlinear ICA provides an algorithm to estimate the mixing function ff. For the case of nonlinear ff, the impossibility of identification (i.e., consistent estimation) of ff in the one-sample i.i.d. case had been established more than two decades ago (Hyvärinen and Pajunen, 1999). However, recently, various conditions have been proposed under which ff can be identified with the help of auxiliary information Hyvärinen and Morioka (2016, 2017); Hyvärinen et al. (2019); Khemakhem et al. (2019).

The identification condition that is directly relevant to this paper is that of the generalized contrastive learning (GCL) proposed in Hyvärinen et al. (2019). Hyvärinen et al. (2019) assumes that an auxiliary variable uiu_{i} from some measurable set 𝒰\mathcal{U} is obtained for each data point as {(zi,ui)}i=1n\{(z_{i},u_{i})\}_{i=1}^{n} and that the ICs S=(S(1),,S(D))S=(S^{(1)},\ldots,S^{(D)}) are conditionally independent given uu:

q(s|u)=d=1Dq(d)(s(d)|u).\begin{split}q(s|u)=\prod_{d=1}^{D}q^{(d)}(s^{(d)}|u).\end{split}

Under such conditions, GCL estimates ff by training a classification function

rf^,𝝍(z,u)=d=1Dψd(f^1(z)d,u)\begin{split}r_{\hat{f},\bm{\psi}}(z,u)=\sum_{d=1}^{D}\psi_{d}({\hat{f}}^{-1}(z)_{d},u)\end{split} (4)

parametrized by f^\hat{f} and {ψd}d=1D\{\psi_{d}\}_{d=1}^{D} with the logistic loss for classifying

(z,u) vs. (z,u~),\begin{split}(z,u)\text{ vs. }(z,\tilde{u}),\end{split}

where u~𝒰{u}\tilde{u}\in\mathcal{U}\setminus\{u\}. The key condition for the identification of ff is the following.

Assumption 1 (Assumption of variability; Hyvärinen et al., 2019, Theorem 1).

For any zz, there exist 2D+12D+1 distinct points in 𝒰\mathcal{U}, denoted by {uj}j=02D\{u_{j}\}_{j=0}^{2D}, such that the set of (2D)(2D)-dimensional vectors {w(z|uj)w(z|u0)}j=12D\{w(z|u_{j})-w(z|u_{0})\}_{j=1}^{2D} are linearly independent, where

w(z|u):=(logq(1)(z1|u)z1,,logq(D)(zD|u)zD,2logq(1)(z1|u)z12,,2logq(D)(zD|u)zD2).\begin{split}w(z|u):=\left(\frac{\partial\log q^{(1)}(z_{1}|u)}{\partial z_{1}},\ldots,\frac{\partial\log q^{(D)}(z_{D}|u)}{\partial z_{D}},\frac{\partial^{2}\log q^{(1)}(z_{1}|u)}{\partial z_{1}^{2}},\ldots,\frac{\partial^{2}\log q^{(D)}(z_{D}|u)}{\partial z_{D}^{2}}\right).\\ \end{split}

Under Assumption 1 and some regularity conditions, Theorem 1 of Hyvärinen et al. (2019) states that the transformation f^\hat{f} in Eq. (4) trained by GCL is a consistent estimator of ff upto additional dimension-wise invertible transformations. Note that the assumption is intrinsically difficult to confirm based on data due to the unsupervised nature of the problem setting. In this paper, we use the source domain index as the auxiliary variable and employ GCL for domain adaptation. The present version of Assumption 1 requires that we have at least 2D+12D+1 distinct source domains. Although this condition can be restrictive in high-dimensional data, we conjecture that there is a possibility for this assumption to be made less stringent in the future because the identification condition is only known to be a sufficient condition, not a necessary condition. However, pursuing a refinement of the identification condition is out of the scope of this paper. Among the various methods for nonlinear ICA, we chose to use GCL Hyvärinen et al. (2019) because it can operate under a nonparametric assumption on the IC distributions whereas other nonlinear ICA methods Hyvärinen and Morioka (2016, 2017); Khemakhem et al. (2019) may require parametric assumptions.

Appendix B Experiment Details

Here, we describe more implementation details of the experiment. Our experiment code can be found at https://github.com/takeshi-teshima/few-shot-domain-adaptation-by-causal-mechanism-transfer.

B.1 Dataset details

Gasoline consumption data.

B.2 Model details: Invertible neural networks

Here, we describe the details of the Glow architecture Kingma and Dhariwal (2018) used in our experiments. Glow consists of three types of layers which are invertible by design, namely affine coupling layers, 1×11\times 1 convolution layers, and activation normalization (actnorm) layers. In our implementation, we use actnorm as the first layer, and each of the subsequent layers consists of a 1×11\times 1 convolution layer followed by an affine coupling layer.

Affine coupling layers.

The coefficients ss and tt for affine coupling layers in the notation of Kingma and Dhariwal (2018) are parametrized by two one-hidden-layer neural networks whose number of hidden units is the same and the first layer parameter is shared. The activation functions of the first layer, the second layer of ss, and the second layer of tt are the rectified linear unit (ReLU) activation LeCun et al. (2015), the hyperbolic tangent function, and the linear activation function, respectively. A standard practice of affine coupling layers is to compose the coefficient ss with an exponential function xexp(x)x\mapsto\exp(x) so as to simplify the computation of the log-determinant of the Jacobian Kingma and Dhariwal (2018). In our implementation, since we do not require the computation of the log-determinant, we omit this device and instead compose x(x+1)x\mapsto(x+1). The addition of 11 shifts the parameter space so that (s,t)=(0,0)(s,t)=(0,0) corresponds to the the identity map, where 0 denotes the constant zero function. The split of the affine coupling layers is fixed at (D2,DD2)(\lfloor\frac{D}{2}\rfloor,D-\lfloor\frac{D}{2}\rfloor).

1×11\times 1 convolution layers.

We initialize the parameters of the neural networks by 𝒩(0,1m)\mathcal{N}(0,\frac{1}{m}) where mm is the number of parameters of each layer and 𝒩\mathcal{N} is the normal distribution.

B.3 Model details: Penultimate layer networks

We initialize the parameter for each layer of ψd\psi_{d} by U(1m,1m)\mathrm{U}(-\sqrt{\frac{1}{m}},\sqrt{\frac{1}{m}}), where mm is the number of input features and U\mathrm{U} is the uniform distribution.

B.4 Training details

During the training of GCL, we fix the batch size at 32.

B.5 Compared methods details

Here, we detail the methods compared through the experiment. Note that the present paper focuses on regression problems as our approach is based on ICA, hence the methods for classification domain adaptation are not comparable.

TrAdaBoost.

As suggested in Pardoe and Stone (2010), we use the linear loss function and set the maximum number of internal boosting iterations at 3030.

GDM.

We fix the number of sampling required for approximating the maximization in the generalization discrepancy at 200200. This method presumes using hypothesis classes in a reproducing kernel Hilbert space (RKHS).

Copula.

For this model, the probabilistic model of non-parametric R-vine copula of depth 11 is used following Lopez-paz et al. (2012). Kernel density estimators with RBF kernel are used for estimating the marginal distributions and the copulas. The bandwidths of the RBF kernels are determined using the rule-of-thumb implemented as “normal-reference” in the np package of R language Hayfield and Racine (2008). The predictions are made by numerically aggregating the estimated conditional distribution over the interval [miniYi2σ,maxiYi+2σ][\min_{i}Y_{i}-2\sigma,\max_{i}Y_{i}+2\sigma] where σ\sigma denotes the square root of the unbiased variance of {Yi}i=1nsrc\{Y_{i}\}_{i=1}^{n_{\mathrm{src}}}. The aggregation is performed by discretizing the interval into a grid of 300300 points. The level of the two-sample test is fixed at 0.050.05 for all combination of the two-sample tests following the experiment code of Lopez-paz et al. (2012). This method is a single-source domain adaptation method and we pool all source domain data for adaptation.

Appendix C Details and Proofs of Theorem 2

Here, we detail the assumptions, the statement, and the proof of Theorem 2.

C.1 Notation

To make the proof self-contained, we first recall some general and problem-specific notation. In the notation here, we omit the domain identifiers from the distributions and the sample size, such as Tar or Src, because only the target domain data or their distributions appear in the proofs. The theorem holds regardless of how f^\hat{f} is estimated as long as f^\hat{f} is independent of the target domain data. In the proof, we extend the maximal discrepancy bound of U-statistics previously proved for the case of degree-22 in Rejchel (2012), to allow higher degrees.

General mathematical notation.

We denote the set of natural numbers (resp. real numbers) by \mathbb{N} (resp. \mathbb{R}). For any NN\in\mathbb{N}, we define [N]:={1,2,,N}[N]:=\{1,2,\ldots,N\}. We use (ab){a\choose b} to denote the number of bb-combinations of aa elements. For a finite set AA, the notation ¯aA\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{a\in A} denotes the operator to take an average over AA, i.e., ¯aAh=1|A|aAh(a)\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{a\in A}h=\frac{1}{|A|}\sum_{a\in A}h(a). For a DD-dimensional function hh, we denote its jj-th dimension (j[D]j\in[D]) by suffixing hjh_{j}. For a vector ss, we denote its jj-th element by s(j)s^{(j)}. We denote the Jacobian determinant of a differentiable function ψ\psi at aa by Jψ(a):=detdψ(a)da\mathrm{J}\psi(a):=\det\frac{\mathrm{d}\psi(a)}{\mathrm{d}a}. We denote the identity matrix by II regardless of the size of the matrices when there is no ambiguity. For finite dimensional vectors, we denote the 22-norm by 2\left\|\cdot\right\|_{{\ell^{2}}} and the 11-norm by 1\left\|\cdot\right\|_{{\ell^{1}}}. For square matrices, we denote the operator-22 norm by op\left\|\cdot\right\|_{\mathrm{op}} and the operator-11 norm by op(1)\left\|\cdot\right\|_{\mathrm{op}(1)}. We use Wk,pW^{k,p} to denote the Sobolev space (on D\mathbb{R}^{D}) of order kk and define its associated norm by hWk,p:=(|α|kh(α)Lpp)1/p\left\|h\right\|_{W^{k,p}}:=\left(\sum_{|\alpha|\leq k}\left\|h^{(\alpha)}\right\|_{{L^{p}}}^{p}\right)^{1/p} where α\alpha is a multi-index and h(α)h^{(\alpha)} denotes the partial derivative |α|hs1α1sDαD\frac{\partial^{|\alpha|}h}{\partial s_{1}^{\alpha_{1}}\cdots\partial s_{D}^{\alpha_{D}}} (Adams and Fournier, 2003, Paragraph 3.1). We let 𝔖D\mathfrak{S}_{D} be the degree-DD symmetric group, 𝔊jD:={τ:[D][j]|τ is surjective}\mathfrak{G}_{j}^{D}:=\{\tau:[D]\to[j]\ |\ \tau\text{ is surjective}\} be the set of jj grouping of indices in [D][D], and jn:={ρ:[j][n]|ρ is injective}\mathfrak{I}_{j}^{n}:=\{\rho:[j]\to[n]\ |\ \rho\text{ is injective}\} be the set of all size-jj combinations (without replacement) of indices in [n][n].

Distributions and expectations.

We denote by 𝒬\mathcal{Q} the set of all factorized distributions on D\mathbb{R}^{D} with absolutely continuous marginals. For a measure PP, we denote its jj-product measure by Pj:=PPP^{j}:=P\otimes\cdots\otimes P (repeated jj times). We assume that all measures appearing in this proof are absolutely continuous with respect to the Lebesgue measure. The push-forward of a distribution pp by a function hh is denoted by hph_{\sharp}p. The expectation of a function hh with respect to measure PP is denoted by PhPh (if it exists) by abuse of notation. We also abuse the notation to use ψ(s,P,,P)\psi(s,P,\ldots,P) as the shorthand for PD1ψ(s,S2,,SD)P^{D-1}\psi(s,S_{2}^{\prime},\ldots,S_{D}^{\prime}) where {Sd}d=2Di.i.d.P\{S_{d}^{\prime}\}_{d=2}^{D}\overset{\text{i.i.d.}}{\sim}P.

C.2 Problem setup

We denote the target domain distribution by pp.

We fix a hypothesis class 𝒢({g:D1})\mathcal{G}(\subset\{g:\mathbb{R}^{D-1}\to\mathbb{R}\}), and our goal is to find a g𝒢g\in\mathcal{G} such that the risk functional

R(g):=p(z)(g,z)dz\begin{split}R(g):=\int p(z)\ell(g,z)\mathrm{d}z\end{split}

is small, where :𝒢×D0\ell:\mathcal{G}\times\mathbb{R}^{D}\to\mathbb{R}_{\geq 0} is a loss function. We denote by gg^{*} a minimizer of RR (assuming it exists). To this end, we are given the training data 𝒟:={Zi}i=1ni.i.d.p\mathcal{D}:=\{Z_{i}\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}p. Throughout, we assume nDn\geq D. To complement the smallness of nn, we assume the existence of a generative mechanism. Concretely, we assume that there exists a diffeomorphism f:DDf:\mathbb{R}^{D}\to\mathbb{R}^{D} such that q:=(f1)pq:=(f^{-1})_{\sharp}p satisfies q𝒬q\in\mathcal{Q}. With this transform, the original risk functional is also expressed as

R(g)=q(s)(g,f(s))ds.\begin{split}R(g)=\int q(s)\ell(g,f(s))\mathrm{d}s.\end{split}

As an estimator of ff, we are given another diffeomorphism f^:DD\hat{f}:\mathbb{R}^{D}\to\mathbb{R}^{D} such that f^f\hat{f}\simeq f. With this f^\hat{f}, the proposed method converts the dataset 𝒟\mathcal{D} by Si:=f^(Zi)S_{i}:=\hat{f}(Z_{i}). We can regard 𝒟ˇ:={Si}i=1ni.i.d.qˇ\check{\mathcal{D}}:=\{S_{i}\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}\check{q}, where qˇ:=(f^1f)q\check{q}:=({\hat{f}}^{-1}\circ f)_{\sharp}q. We use QQ (resp. Qˇ\check{Q}) to denote the probability measure corresponding to the density qq (resp. qˇ\check{q}). This conversion results in the relation:

qˇ(s)=q(f1f^(s))|(Jf1f^)(s)|.\begin{split}\check{q}(s)=q(f^{-1}\circ\hat{f}(s))\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|.\end{split}

As a candidate hypothesis g𝒢g\in\mathcal{G}, the proposed method selects a minimizer gˇ𝒢\check{g}\in\mathcal{G} of the proposed risk estimator Rˇ\check{R} defined as

Rˇ(g):=1nD(i1,,iD)[n]D(g,f^(s^i1(1),,s^iD(D))).\begin{split}\check{R}(g):=\frac{1}{n^{D}}\sum_{(i_{1},\ldots,i_{D})\in[n]^{D}}\ell(g,\hat{f}(\hat{s}_{i_{1}}^{(1)},\ldots,\hat{s}_{i_{D}}^{(D)})).\end{split} (5)

In the proof, we evaluate its concentration around the expectation Rˇ¯(g):=𝔼𝒟ˇRˇ(g)\bar{\check{R}}(g):={\mathbb{E}}_{\check{\mathcal{D}}}\check{R}(g). We use 𝔼𝒟ˇ{\mathbb{E}}_{\check{\mathcal{D}}} to denote the expectation with respect to 𝒟ˇ\check{\mathcal{D}}. Let gˇ¯\bar{\check{g}} denote a hypothesis which minimizes Rˇ¯(g)\bar{\check{R}}(g) (assuming it exists).

In what follows, for notational simplicity, we define the DD-variate symmetric function ~\tilde{\ell} as

~(s1,,sD)=¯π𝔖D(g,f^(sπ(1)(1),,sπ(D)(D))),\begin{split}\tilde{\ell}(s_{1},\ldots,s_{D})=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)})),\end{split}

where ¯π𝔖D\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}} indicates an averaging operation over all permutations (without replacement) of [D][D]. We use 𝔼^n\hat{\mathbb{E}}_{n} to denote the sample average operator with respect to 𝒟\mathcal{D} or 𝒟ˇ\check{\mathcal{D}}, depending on the context.

C.3 Assumptions

Assumption 2 (The underlying density function is bounded and Lipschitz continuous).

Assume

Bq:=supsDq(s)<,Lq:=sups1s2|q(s1)q(s2)|s1s2<.\begin{split}B_{q}&:=\sup_{s\in\mathbb{R}^{D}}q(s)<\infty,\quad L_{q}:=\sup_{s_{1}\neq s_{2}}\frac{|q(s_{1})-q(s_{2})|}{\|s_{1}-s_{2}\|}<\infty.\end{split}
Assumption 3 (f1f^{-1} is Lipschitz continuous and Hölder continuous).

We assume f1C1,1f^{-1}\in C^{1,1} where C1,1C^{1,1} is the (1,1)(1,1)-Hölder space (Adams and Fournier, 2003, Paragraph 1.29) and

Lf1:=supz1z2f1(z1)f1(z2)z1z2<.\begin{split}L_{f^{-1}}&:=\sup_{z_{1}\neq z_{2}}\frac{\|f^{-1}(z_{1})-f^{-1}(z_{2})\|}{\|z_{1}-z_{2}\|}<\infty.\end{split}
Assumption 4 (Bounded derivatives of ff and f1f^{-1}).

Assume that

Bf:=supsDdfds(s)<,Bf1:=supzDdf1dz(z)<.\begin{split}B_{\partial f}^{\infty}:=\sup_{s\in\mathbb{R}^{D}}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right\|_{\infty}<\infty,\quad B_{\partial f^{-1}}^{\infty}:=\sup_{z\in\mathbb{R}^{D}}\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(z)\right\|_{\infty}<\infty.\end{split}

where \left\|\cdot\right\|_{\infty} denotes the maximum absolute value of the elements of a matrix.

Assumption 5 (Loss function is bounded and uniformly Lipschitz continuous in ZZ).

The considered loss function takes values in a bounded interval:

:𝒢×𝒵[0,B],\begin{split}\ell:\mathcal{G}\times\mathcal{Z}\to[0,B_{\ell}],\end{split}

where 0<B<0<B_{\ell}<\infty. Also assume

L𝒢:=supg𝒢supz1z2|(g,z1)(g,z2)|z1z2<.\begin{split}L_{\ell_{\mathcal{G}}}:=\sup_{g\in\mathcal{G}}\sup_{z_{1}\neq z_{2}}\frac{|\ell(g,z_{1})-\ell(g,z_{2})|}{\|z_{1}-z_{2}\|}<\infty.\end{split}
Assumption 6 (Estimated feature extractor).

Assume f^\hat{f} is independent of 𝒟\mathcal{D} and that fjf^jW1,df_{j}-\hat{f}_{j}\in W^{1,d} for all (j,d)[D]×[D](j,d)\in[D]\times[D].

Although f^\hat{f} and ff are assumed to be diffeomorphisms in the classical sense (implying that they are strongly differentiable), we introduce the Sobolev space because we want to measure their difference and their difference of derivatives in terms of integration.

Assumption 7 (Entropic condition: Euclidean class (Sherman, 1994)).

The function class Φ:={~:g𝒢}\Phi:=\{\tilde{\ell}:g\in\mathcal{G}\} is Euclidean for the envelope FF and constants AA and VV (Sherman, 1994), i.e., if μ\mu is a measure for which μF2<\mu F^{2}<\infty, then

D(t,dμ,Φ)AtV,0<t1,\begin{split}D(t,d_{\mu},\Phi)\leq At^{-V},\quad 0<t\leq 1,\end{split}

where dμd_{\mu} is the pseudo metric defined by

dμ(ϕ1,ϕ2):=[μ|ϕ1ϕ2|2/μF2]1/2\begin{split}d_{\mu}(\phi_{1},\phi_{2}):=\left[\mu|\phi_{1}-\phi_{2}|^{2}/\mu F^{2}\right]^{1/2}\end{split}

for ϕ1,ϕ2Φ\phi_{1},\phi_{2}\in\Phi, and D(t,dμ,Φ)D(t,d_{\mu},\Phi) denotes the packing number of Φ\Phi with respect to the pseudometric dμd_{\mu} and radius tt. Without loss of generality, we take the envelope FF such that F()BF(\cdot)\leq B_{\ell}.

Assumption 8.

The hypothesis class 𝒢\mathcal{G} is expressive enough so that the model approximation error does not expand due to f^\hat{f}, i.e.,

infg𝒢Rˇ¯(g)infg𝒢R(g)\begin{split}\inf_{g\in\mathcal{G}}\bar{\check{R}}(g)\leq\inf_{g\in\mathcal{G}}R(g)\end{split}

The following complexity measure of 𝒢\mathcal{G}, which is a version of Rademacher complexity for our problem setting, is used to state the theorem.

Definition 1 (Effective Rademacher complexity).

Define

(𝒢):=1n𝔼𝒟ˇ𝔼σ[supg𝒢|i=1nσi𝔼S2,,SD[~(Si,S2,,SD)]|]\begin{split}\mathfrak{R}(\mathcal{G}):=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{2},\ldots,S^{\prime}_{D}}[\tilde{\ell}(S_{i},S_{2}^{\prime},\ldots,S_{D}^{\prime})]\right|\right]\end{split}

where {σi}i=1n\{\sigma_{i}\}_{i=1}^{n} are independent uniform sign variables and S2,,SDi.i.d.QˇS_{2}^{\prime},\ldots,S_{D}^{\prime}\overset{\text{i.i.d.}}{\sim}\check{Q} are independent of all other random variables.

We provide the definition of the ordinary Rademacher complexity in Section C.8 and make a comparison of the two complexity measures in terms of how they depend on the input dimensionality.

C.4 Theorem statement

Our goal is to prove the following theorem. This is a detailed version of the theorem appearing in the main body of the paper.

Theorem 3 (Excess risk bound).

Assume Assumptions 2, 3, 4, 5, 6, 7, and 8.

Then for arbitrary δ,δ(0,1)\delta,\delta^{\prime}\in(0,1), we have with probability at least 1(δ+δ)1-(\delta+\delta^{\prime}),

R(gˇ)R(g)Cj=1Dfjf^jW1,1Approximation error+4D(𝒢)+2DBlog2/δ2nEstimation error+κ1(δ,n)+DBBqκ2(ff^)Higher order terms.\begin{split}&R(\check{g})-R(g^{*})\\ &\leq\underbrace{C\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}}_{\text{Approximation error}}+\underbrace{4D\mathfrak{R}(\mathcal{G})+2DB_{\ell}\sqrt{\frac{\log 2/{\delta}}{2n}}}_{\text{Estimation error}}+\underbrace{\kappa_{1}(\delta^{\prime},n)+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})}_{\text{Higher order terms}}.\end{split}

where

C:=BqL𝒢+DB(LqLf1+BqDC1),C1:=(D+1)3/2(Bf(k=1Dfk1C1,1)+Bf1),κ1(δ,n)=𝒪(n1)/δ+𝒪(n1),κ2(ff^)=d=2D(Dd)Cdj=1Dfjf^jW1,dd.\begin{split}&C:=B_{q}L_{\ell_{\mathcal{G}}}+DB_{\ell}(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime}),\\ &C_{1}^{\prime}:=(D+1)^{3/2}\left(B_{\partial f}^{\infty}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)+B_{\partial f^{-1}}^{\infty}\right),\\ &\kappa_{1}(\delta^{\prime},n)=\mathcal{O}(n^{-1})/{\delta^{\prime}}+\mathcal{O}(n^{-1}),\\ &\kappa_{2}(f-\hat{f})=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\end{split}

and Cd(d=1,,D)C_{d}^{\prime}(d=1,\ldots,D) are constants determined in Lemma 11.

Proof of Theorem 3.

By adding and subtracting terms, we have

R(gˇ)R(g)=(RRˇ¯)(gˇ)(A) Approximation error+Rˇ¯(gˇ)Rˇ¯(gˇ¯)(B) Pseudo estimation error+Rˇ¯(gˇ¯)R(g)(C) Additional model misspecification error.\begin{split}&R(\check{g})-R(g^{*})=\underbrace{(R-\bar{\check{R}})(\check{g})}_{\text{\text{(A) Approximation error}}}+\underbrace{\bar{\check{R}}(\check{g})-\bar{\check{R}}(\bar{\check{g}})}_{\text{\text{(B) Pseudo estimation error}}}+\underbrace{\bar{\check{R}}(\bar{\check{g}})-R(g^{*})}_{\text{(C) Additional model misspecification error}}.\end{split}

Applying Lemma 1 to (A), Lemma 2 to (B), and Assumption 8 to (C), we obtain the assertion. ∎

As it can be seen from the proof above, Theorem 3 is proved in two parts, each corresponding to the two lemmas below. The first lemma evaluates the approximation error which reflects the fact that we are approximating ff by f^\hat{f}.

Lemma 1 (Approximation error bound).

Given Assumptions 2, 3, 4, 5, and 6. we have

(RRˇ¯)(gˇ)Cj=1Dfjf^jW1,1+DBBqκ2(ff^)\begin{split}(R-\bar{\check{R}})(\check{g})\leq&C\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})\end{split}

where CC and κ2(ff^)\kappa_{2}(f-\hat{f}) are

C:=BqL𝒢+DB(LqLf1+BqDC1),κ2(ff^):=d=2D(Dd)Cdj=1Dfjf^jW1,dd.\begin{split}C&:=B_{q}L_{\ell_{\mathcal{G}}}+DB_{\ell}(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime}),\\ \kappa_{2}(f-\hat{f})&:=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\end{split}

and Cd(d=1,,D)C_{d}^{\prime}(d=1,\ldots,D) are constants determined in Lemma 11.

The second lemma evaluates the pseudo estimation error which reflects the fact that we rely on a finite sample to approximate the underlying distribution.

Lemma 2 (Pseudo estimation error bound).

Assume that Assumptions 2 and 7 hold. Let the Rademacher complexity be defined as Definition 1. Then for any δ,δ(0,1)\delta,\delta^{\prime}\in(0,1), we have with probability at least 1(δ+δ)1-(\delta+\delta^{\prime}) that

Rˇ¯(gˇ)Rˇ¯(gˇ¯)4DwD(𝒢)+2DBwDlog2/δ2n+2wD(D1)j=2DCjδnj/2+4Bj=1D1wj𝒪(n1)\begin{split}\bar{\check{R}}(\check{g})-\bar{\check{R}}(\bar{\check{g}})\leq 4Dw_{D}\mathfrak{R}(\mathcal{G})+2DB_{\ell}w_{D}\sqrt{\frac{\log 2/{\delta}}{2n}}+\underbrace{2w_{D}(D-1)\sum_{j=2}^{D}\frac{C_{j}}{\delta^{\prime}}n^{-j/2}+4B_{\ell}\sum_{j=1}^{D-1}w_{j}}_{\text{$\mathcal{O}(n^{-1})$}}\end{split}

where {wj}j=1D\{w_{j}\}_{j=1}^{D} are universal constants determined in Lemma 3, and {Cj}j=2D\{C_{j}\}_{j=2}^{D} are constants determined in Lemma 6. Note that wj=𝒪(n(Dj))w_{j}=\mathcal{O}(n^{-(D-j)}) and wD=n(n1)(nD+1)nD<1w_{D}=\frac{n(n-1)\cdots(n-D+1)}{n^{D}}<1.

In what follows, we first present some basic facts in Section C.5 and provide the proofs for the lemmas. We provide the proof of Lemma 1 in Section C.7, and that of Lemma 2 in Section C.6.

C.5 V-statistic and U-statistic

The theoretical analysis is performed by interpreting the proposed risk estimator Eq. (5) as a V-statistic (explained shortly). The proofs will be based on applying the following facts in order:

  1. 1.

    V-statistic can be represented as a weighted average of U-statistics with degrees from 11 to DD, and only the degree-DD) term is the leading term.

  2. 2.

    The degree-DD term is again decomposed into a degree-11 U-statistic and a set of degenerate U-statistics.

  3. 3.

    The degree-11 U-statistic is an i.i.d. sum admitting a Rademacher complexity bound.

  4. 4.

    The degenerate terms concentrate around zero following an exponential inequality under appropriate entropy conditions.

To consolidate the strategy given above, we describe what are V- and U-statistics, and how they relate to each other. These estimators emerge when we allow re-using the same data point repeatedly from a single sample to estimate a function which takes multiple data points.

V-statistic.

For a given regular statistical functional of degree DD Lee (1990):

QˇD~:=~(s1,,sD)qˇ(s1)qˇ(sD)ds1dsD,\begin{split}\check{Q}^{D}\tilde{\ell}:=\int\tilde{\ell}(s_{1},\cdots,s_{D})\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D},\end{split} (6)

its associated von-Mises statistic (V-statistic) is the following quantity Lee (1990):

VnD~:=1nDi1=1niD=1n~(Si1,,SiD).\begin{split}V_{n}^{D}\tilde{\ell}:=\frac{1}{n^{D}}\sum_{i_{1}=1}^{n}\cdots\sum_{i_{D}=1}^{n}\tilde{\ell}(S_{i_{1}},\ldots,S_{i_{D}}).\end{split}

Note that Eq. (6) does not coincide with the expectation of VnD~V_{n}^{D}\tilde{\ell} in general, i.e., the V-statistic is generally not an unbiased estimator. However, it is known to be a consistent estimator of Eq. (6) (Lee, 1990).

U-statistic.

Similarly, for a jj-variate symmetric and integrable function h(x1,,xj)h(x_{1},\ldots,x_{j}), its corresponding U-statistic (Lee, 1990) of degree jj is

Unjh:=¯ρjnh(sρ(1),,sρ(j)).\begin{split}U_{n}^{j}h:=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}h(s_{\rho(1)},\ldots,s_{\rho(j)}).\end{split}

The V- and U-statistics are generalizations of the sample mean (which is the U- and V-statistics of degree 11). The important difference from the sample mean in higher degrees is that the summands may not be independent. To deal with the dependence, the following standard decompositions have been developed Lee (1990).

Lemma 3 (Decomposition of a V-statistic (Lee, 1990)).

A V-statistic can be expressed as a sum of U-statistics of degrees from 11 to DD (Lee, 1990, Section 4.2, Theorem 1):

VnD~=j=1DwjUnj~(j)\begin{split}V_{n}^{D}\tilde{\ell}&=\sum_{j=1}^{D}w_{j}U_{n}^{j}\tilde{\ell}^{(j)}\end{split}

where the weights wjw_{j} and jj-variate functions ~(j)\tilde{\ell}^{(j)} are

wj:=1nD|𝔊jD|(nj),~(j)(s1,,sj):=¯τ𝔊jD~(sτ(1),,sτ(D)).\begin{split}w_{j}&:=\frac{1}{n^{D}}|\mathfrak{G}_{j}^{D}|{n\choose j},\quad\tilde{\ell}^{(j)}(s_{1},\ldots,s_{j}):=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\tilde{\ell}(s_{\tau(1)},\ldots,s_{\tau(D)}).\end{split}
Proof.

See (Lee, 1990, Section 4.2, Theorem 1 (p.183)). ∎

Remark 1.

The weights {wj}j=1D\{w_{j}\}_{j=1}^{D} satisfy jwj=1\sum_{j}w_{j}=1 (Lee, 1990, Section 4.2, Theorem 1 (p.183)). We can also find the order of wjw_{j} with respect to nn as:

wD=1nD|𝔊jD|D!(nD)=n(n1)(nD+1)nD=𝒪(1),wj=𝒪(n(Dj)),~(D)=~.\begin{split}w_{D}=\frac{1}{n^{D}}\underbrace{|\mathfrak{G}_{j}^{D}|}_{\text{$D!$}}{n\choose D}=\frac{n(n-1)\cdots(n-D+1)}{n^{D}}=\mathcal{O}(1),\quad w_{j}=\mathcal{O}(n^{-(D-j)}),\quad\tilde{\ell}^{(D)}=\tilde{\ell}.\end{split}
Lemma 4 (Hoeffding decomposition of a U-statistic (Sherman, 1994, p.449)).

A U-statistic with a symmetric kernel ψ\psi can be decomposed as a sum of U-statistics of degrees from 11 to DD as

UnDψ𝔼𝒟ˇUnDψ=j=1DUnjψj=𝔼^nψ1+j=2DUnjψj\begin{split}U_{n}^{D}\psi-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi&=\sum_{j=1}^{D}U_{n}^{j}\psi_{j}\\ &=\hat{\mathbb{E}}_{n}\psi_{1}+\sum_{j=2}^{D}U_{n}^{j}\psi_{j}\end{split} (7)

where {ψj}j=1D\{\psi_{j}\}_{j=1}^{D} are jj-variate, symmetric and degenerate functions. Note that 𝔼𝒟ˇUnDψ=QˇDψ{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi=\check{Q}^{D}\psi. Here, a jj-variate symmetric function ψj\psi_{j} is said degenerate when

s2,,sj,ψj(Qˇ,s2,,sj)=0.\begin{split}\forall s_{2},\ldots,s_{j},\quad\psi_{j}(\check{Q},s_{2},\ldots,s_{j})=0.\end{split}

Specifically, ψ1\psi_{1} is

ψ1(s)=ψ(s,Qˇ,,Qˇ)++ψ(Qˇ,,Qˇ,s)DQˇDψ=D(ψ(s,Qˇ,,Qˇ)QˇDψ)(by symmetry).\begin{split}\psi_{1}(s)&=\psi(s,\check{Q},\ldots,\check{Q})+\cdots+\psi(\check{Q},\ldots,\check{Q},s)-D\check{Q}^{D}\psi\\ &=D\cdot(\psi(s,\check{Q},\ldots,\check{Q})-\check{Q}^{D}\psi)\quad(\text{by symmetry}).\end{split} (8)

For further details, see (Sherman, 1994, p.449). Note that in (Sherman, 1994, p.449), Eq. (7) is written using QˇDψ\check{Q}^{D}\psi in place of 𝔼𝒟ˇUnDψ{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi. This is because

𝔼𝒟ˇUnDψ=UnD𝔼𝒟ˇψ=UnDQˇDψ=QˇDψ\begin{split}{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\psi=U_{n}^{D}{\mathbb{E}}_{\check{\mathcal{D}}}\psi=U_{n}^{D}\check{Q}^{D}\psi=\check{Q}^{D}\psi\end{split}

holds by linearity and symmetry.

Remark 2 (Connecting the lemmas to Section C.6).

It can be easily checked by definition that the proposed risk estimator Eq. (5) takes the form of a V-statistic: Rˇ(g)=VnD~\check{R}(g)=V_{n}^{D}\tilde{\ell} for each g𝒢g\in\mathcal{G}. Let us denote ~(s):=~(s,Qˇ,,Qˇ)\tilde{\ell}^{*}(s):=\tilde{\ell}(s,\check{Q},\ldots,\check{Q}). Then 𝔼𝒟ˇ~=QˇD~{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*}=\check{Q}^{D}\tilde{\ell} holds by definition. Substituting these into Eq. (8), we have that Eq. (7) applied to ψ=~\psi=\tilde{\ell} is equivalent to

UnD~𝔼𝒟ˇUnD~=D(𝔼^n~𝔼𝒟ˇ~)+j=2DUnj~j.\begin{split}U_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}=D\cdot(\hat{\mathbb{E}}_{n}\tilde{\ell}^{*}-{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*})+\sum_{j=2}^{D}U_{n}^{j}\tilde{\ell}_{j}.\end{split}

where {~j}j=2D\{\tilde{\ell}_{j}\}_{j=2}^{D} are symmetric degenerate functions. In Section C.6, we first decompose Rˇ(g)\check{R}(g) into a sum of U-statistics. After such conversion, we take a closer look at the leading term, 𝔼^n~\hat{\mathbb{E}}_{n}\tilde{\ell}^{*}.

C.6 Proof of pseudo estimation error bound

(Proof of Lemma 2).

First, we have

Rˇ¯(gˇ)Rˇ¯(gˇ¯)=Rˇ¯(gˇ)Rˇ(gˇ)+Rˇ(gˇ)Rˇ¯(gˇ¯)Rˇ¯(gˇ)Rˇ(gˇ)+Rˇ(gˇ¯)Rˇ¯(gˇ¯)2supg𝒢|Rˇ(g)Rˇ¯(g)|.\begin{split}\bar{\check{R}}(\check{g})-\bar{\check{R}}(\bar{\check{g}})=\bar{\check{R}}(\check{g})-\check{R}(\check{g})+\check{R}(\check{g})-\bar{\check{R}}(\bar{\check{g}})&\leq\bar{\check{R}}(\check{g})-\check{R}(\check{g})+\check{R}(\bar{\check{g}})-\bar{\check{R}}(\bar{\check{g}})\\ &\leq 2\sup_{g\in\mathcal{G}}\left|\check{R}(g)-\bar{\check{R}}(g)\right|.\end{split}

Now the right-most expression can be decomposed as

supg𝒢|Rˇ(g)Rˇ¯(g)|=supg𝒢|VnD~𝔼𝒟ˇVnD~|wDsupg𝒢|UnD~𝔼𝒟ˇUnD~|+j=1D1wjsupg𝒢|Unj~(j)𝔼𝒟ˇUnj~(j)|(Lemma 3)wDsupg𝒢|UnD~𝔼𝒟ˇUnD~|+2Bj=1D1wjwD(supg𝒢|𝔼^n~1|+j=2Dsupg𝒢|Unj~j|)+2Bj=1D1wj(Lemma 8)=wD(supg𝒢|𝔼^nD(~𝔼𝒟ˇ~)|Addressed in Lemma 5+j=2Dsupg𝒢|Unj~j|Addressed in Lemma 6)+2Bj=1D1wj.\begin{split}&\sup_{g\in\mathcal{G}}|\check{R}(g)-\bar{\check{R}}(g)|=\sup_{g\in\mathcal{G}}|V_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}V_{n}^{D}\tilde{\ell}|\\ &\leq w_{D}\sup_{g\in\mathcal{G}}\left|U_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}\right|+\sum_{j=1}^{D-1}w_{j}\sup_{g\in\mathcal{G}}\left|U_{n}^{j}\tilde{\ell}^{(j)}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)}\right|\quad(\because\text{Lemma~{}\ref{lem:decompose-v-estimator}})\\ &\leq w_{D}\sup_{g\in\mathcal{G}}|U_{n}^{D}\tilde{\ell}-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}|+2B_{\ell}\sum_{j=1}^{D-1}w_{j}\\ &\leq w_{D}\left(\sup_{g\in\mathcal{G}}|\hat{\mathbb{E}}_{n}\tilde{\ell}_{1}|+\sum_{j=2}^{D}\sup_{g\in\mathcal{G}}|U_{n}^{j}\tilde{\ell}_{j}|\right)+2B_{\ell}\sum_{j=1}^{D-1}w_{j}\quad(\because\text{Lemma~{}\ref{lem:hoeffding-decomposition}})\\ &=w_{D}\left(\underbrace{\sup_{g\in\mathcal{G}}|\hat{\mathbb{E}}_{n}D(\tilde{\ell}^{*}-{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*})|}_{\text{Addressed in Lemma~{}\ref{lem:rademacher-U-leading-term}}}+\underbrace{\sum_{j=2}^{D}\sup_{g\in\mathcal{G}}|U_{n}^{j}\tilde{\ell}_{j}|}_{\text{Addressed in Lemma~{}\ref{lem:rademacher-U-degenerate-part}}}\right)+2B_{\ell}\sum_{j=1}^{D-1}w_{j}.\end{split}

where ~j\tilde{\ell}_{j} are symmetric degenerate functions and ~\tilde{\ell}^{*} is defined as in Remark 2. Applying Lemma 5 to the first term and Lemma 6 to the second term with the union bound, we obtain the assertion. ∎

In the last part of the proof we used the following lemmas. Because the leading term is an i.i.d. sum, the following Rademacher complexity bound can be proved.

Lemma 5 (U-process bound: the leading term).

Assume Assumption 2 holds. Then, we have with probability at least 1δ1-\delta,

supg𝒢|𝔼^n(~𝔼𝒟ˇ~)|2(𝒢)+Blog(2/δ)2n,\begin{split}\sup_{g\in\mathcal{G}}|\hat{\mathbb{E}}_{n}(\tilde{\ell}^{*}-{\mathbb{E}}_{\check{\mathcal{D}}}\tilde{\ell}^{*})|\leq 2\mathfrak{R}(\mathcal{G})+B_{\ell}\sqrt{\frac{\log(2/\delta)}{2n}},\end{split}

where \mathfrak{R} is defined in Definition 1.

Proof.

Applying the standard one-sided Rademacher complexity bound based on McDiarmid’s inequality (Mohri et al., 2012, Theorem 3.1) twice with the union bound, we obtain the lemma. ∎

The other terms than the leading term are degenerate U-statistics, hence the following holds under appropriate entropy assumptions.

Lemma 6 (U-process bound: degenerate terms (Sherman, 1994, Corollary 7)).

Assume Assumption 7. Then for each j=2,,Dj=2,\ldots,D, there exist constants CjC_{j} such that for any δ(0,1)\delta\in(0,1), we have with probability at least 1δ/(D1)1-\delta^{\prime}/(D-1),

supg𝒢|Unj~j|(D1)δCjnj/2\begin{split}\sup_{g\in\mathcal{G}}|U_{n}^{j}\tilde{\ell}_{j}|\leq\frac{(D-1)}{\delta^{\prime}}C_{j}n^{-j/2}\end{split}

where CjC_{j} depends only on AA, VV, and BB_{\ell}.

Proof.

The proof follows a similar path as that of (Sherman, 1994, Corollary 7), but we provide more explicit expressions to inspect the order with respect to nn. Let Φ𝒢,f^(j):={~j:g𝒢}\Phi_{\mathcal{G},\hat{f}}^{(j)}:=\{\tilde{\ell}_{j}:g\in\mathcal{G}\}. Then Φ𝒢,f^(j)\Phi_{\mathcal{G},\hat{f}}^{(j)} is Euclidean for an envelope FjF_{j} satisfying QˇjFj2<\check{Q}^{j}F_{j}^{2}<\infty by Lemma 6 in Sherman (1994) and Assumption 7. In addition, Φ𝒢,f^(j)\Phi_{\mathcal{G},\hat{f}}^{(j)} is a set of functions degenerate with respect to Qˇ\check{Q}. Without loss of generality, we can take FjF_{j} such that FjBF_{j}\leq B_{\ell}. Similarly to the proof of (Sherman, 1994, Corollary 4), we can apply (Sherman, 1994, Main Corollary) with p=1p=1 in their notation to obtain

𝔼𝒟ˇsupg𝒢|nj/2Unj~j|ΓA1/2mp(QˇjFj2)(ϵ+α)/2ΓA1/2mp(B)ϵ+α=:Cj\begin{split}{\mathbb{E}}_{\check{\mathcal{D}}}\sup_{g\in\mathcal{G}}|n^{j/2}U_{n}^{j}\tilde{\ell}_{j}|\leq\Gamma A^{1/2mp}(\check{Q}^{j}F_{j}^{2})^{(\epsilon+\alpha)/2}\leq\underbrace{\Gamma A^{1/2mp}(B_{\ell})^{\epsilon+\alpha}}_{\text{$=:C_{j}$}}\end{split}

where Γ\Gamma is a universal constant (Sherman, 1994, Main Corollary), ϵ(0,1)\epsilon\in(0,1) and mm are chosen to satisfy 1V/2m>1ϵ1-V/2m>1-\epsilon, and α=1V/2m\alpha=1-V/2m. By applying Markov inequality, we have for arbitrary u>0u>0,

𝒟ˇ(supg𝒢|nj/2Unj~j|>u)Cju,\begin{split}\mathbb{P}_{\check{\mathcal{D}}}\left(\sup_{g\in\mathcal{G}}|n^{j/2}U_{n}^{j}\tilde{\ell}_{j}|>u\right)\leq\frac{C_{j}}{u},\end{split}

where 𝒟ˇ(E)\mathbb{P}_{\check{\mathcal{D}}}(E) denotes the probability of the event EE with respect to 𝒟ˇ\check{\mathcal{D}}. Equating the right hand side with δ/(D1)\delta^{\prime}/(D-1) and solving for uu, we obtain the result. ∎

C.7 Proof of approximation error bound

(Proof of Lemma 1).

Due to Lemma 3, we have

supg𝒢(R(g)Rˇ¯(g))=supg𝒢(R(g)𝔼𝒟ˇVnD~)=supg𝒢(j=1Dwj(R(g)𝔼𝒟ˇUnj~(j)))wDsupg𝒢(R(g)𝔼𝒟ˇUnD~(D))+2Bj=1D1wj𝒪(n(Dj))wDsupg𝒢(R(g)𝔼𝒟ˇUnD~(D))+2B𝒪(n1)\begin{split}\sup_{g\in\mathcal{G}}\left(R(g)-\bar{\check{R}}(g)\right)&=\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}V_{n}^{D}\tilde{\ell}\right)\\ &=\sup_{g\in\mathcal{G}}\left(\sum_{j=1}^{D}w_{j}(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)})\right)\\ &\leq w_{D}\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}^{(D)}\right)+2B_{\ell}\sum_{j=1}^{D-1}\underbrace{w_{j}}_{\text{$\mathcal{O}(n^{-(D-j)})$}}\\ &\leq w_{D}\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}^{(D)}\right)+2B_{\ell}\mathcal{O}(n^{-1})\end{split}

By applying Lemmas 7 (with j=Dj=D), we obtain

supg𝒢(R(g)𝔼𝒟ˇUnD~(D))supg𝒢(f(g,))(g,f^())L1(q)+DBqqˇL1.\begin{split}&\sup_{g\in\mathcal{G}}\left(R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{D}\tilde{\ell}^{(D)}\right)\leq\sup_{g\in\mathcal{G}}\left\|\ell(f(g,\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}+DB_{\ell}\left\|q-\check{q}\right\|_{{L^{1}}}.\\ \end{split}

The right-hand side can be further bounded by applying Lemmas 9 and 8 by

BqL𝒢j=1Dfjf^jW1,1+DB((LqLf1+BqDC1)j=1Dfjf^jW1,1+Bqκ2(ff^))(BqL𝒢+DB(LqLf1+BqDC1))j=1Dfjf^jW1,1+DBBqκ2(ff^)\begin{split}&B_{q}L_{\ell_{\mathcal{G}}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+DB_{\ell}\left((L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime})\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+B_{q}\kappa_{2}(f-\hat{f})\right)\\ &\leq(B_{q}L_{\ell_{\mathcal{G}}}+DB_{\ell}(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime}))\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+DB_{\ell}B_{q}\kappa_{2}(f-\hat{f})\end{split}

and hence the assertion of the lemma. ∎

The above proof combined three approximation bounds, which are shown in the following lemmas. The following lemma reduces the difference in the expectation of U-statistic into the differences in the loss function and the density function. Although we apply the following Lemma 7 only with j=Dj=D, we prove its general form for j[D]j\in[D].

Lemma 7 (Approximation bound for U-statistic of degree-jj).

Fix j[D]j\in[D]. Assume Assumption 2. Then we have for any g𝒢g\in\mathcal{G},

R(g)𝔼𝒟ˇUnj~(j)(g,f())(g,f^())L1(q)+jBqqˇL1\begin{split}R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)}\leq\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}+jB_{\ell}\left\|q-\check{q}\right\|_{{L^{1}}}\end{split}
Proof.

Let us define a DD-variate function \ell^{\dagger} and a jj-variate function (j)\ell^{\dagger(j)} (similarly to ~\tilde{\ell} and ~(j)\tilde{\ell}^{(j)}, respectively) by

(s1,,sD):=¯π𝔖D(g,f(sπ(1)(1),,sπ(D)(D))),(j)(s1,,sj):=¯τ𝔊jD(sτ(1),,sτ(D)).\begin{split}\ell^{\dagger}(s_{1},\ldots,s_{D})&:=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,f(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)})),\\ \ell^{\dagger(j)}(s_{1},\ldots,s_{j})&:=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\ell^{\dagger}(s_{\tau(1)},\ldots,s_{\tau(D)}).\end{split}

Then, recalling Q𝒬Q\in\mathcal{Q}, we can show R(g)=Qn(Unj(j))R(g)=Q^{n}(U_{n}^{j}\ell^{\dagger(j)}) because

Qn(Unj(j))=Qn(¯ρjn(j)(Sρ(1),,Sρ(j)))=Qn(¯ρjn¯τ𝔊jD(Sρτ(1),,Sρτ(D)))=Qn(¯ρjn¯τ𝔊jD¯π𝔖D(g,f(Sρτπ(1)(1),,Sρτπ(D)(D))))=¯ρjn¯τ𝔊jD¯π𝔖DQn(g,f(Sρτπ(1)(1),,Sρτπ(D)(D)))=¯ρjn¯τ𝔊jD¯π𝔖DQ[(g,f(S(1),,S(D)))](Q𝒬)=¯ρjn¯τ𝔊jD¯π𝔖DR(g)=R(g).\begin{split}Q^{n}(U_{n}^{j}\ell^{\dagger(j)})&=Q^{n}(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\ell^{\dagger(j)}(S_{\rho(1)},\ldots,S_{\rho(j)}))\\ &=Q^{n}(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\ell^{\dagger}(S_{\rho\circ\tau(1)},\ldots,S_{\rho\circ\tau(D)}))\\ &=Q^{n}(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,f(S_{\rho\circ\tau\circ\pi(1)}^{(1)},\ldots,S_{\rho\circ\tau\circ\pi(D)}^{(D)})))\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q^{n}\ell(g,f(S_{\rho\circ\tau\circ\pi(1)}^{(1)},\ldots,S_{\rho\circ\tau\circ\pi(D)}^{(D)}))\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q[\ell(g,f(S^{(1)},\ldots,S^{(D)}))]\quad(\because Q\in\mathcal{Q})\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}R(g)=R(g).\end{split}

Combining this expression with Lemma 3,

R(g)𝔼𝒟ˇUnj~(j)=Qn(Unj(j))Qˇn(Unj~(j))=Qn(Unj(j)Unj~(j))A+(QnQˇn)(Unj~(j))B\begin{split}R(g)-{\mathbb{E}}_{\check{\mathcal{D}}}U_{n}^{j}\tilde{\ell}^{(j)}&=Q^{n}(U_{n}^{j}\ell^{\dagger(j)})-\check{Q}^{n}(U_{n}^{j}\tilde{\ell}^{(j)})\\ &=\underbrace{Q^{n}(U_{n}^{j}\ell^{\dagger(j)}-U_{n}^{j}\tilde{\ell}^{(j)})}_{\text{$A$}}+\underbrace{(Q^{n}-\check{Q}^{n})(U_{n}^{j}\tilde{\ell}^{(j)})}_{\text{$B$}}\\ \end{split}

Now, AA can be bounded from above as

A=Qn(Unj(j)Unj~(j))=¯ρjn¯τ𝔊jD¯π𝔖DQn((g,f(Sρτπ(1)(1),,Sρτπ(D)(D)))(g,f^(Sρτπ(1)(1),,Sρτπ(D)(D))))=¯ρjn¯τ𝔊jD¯π𝔖DQ((g,f(S(1),,S(D)))(g,f^(S(1),,S(D))))(Q𝒬)(g,f())(g,f^())L1(q)\begin{split}A&=Q^{n}(U_{n}^{j}\ell^{\dagger(j)}-U_{n}^{j}\tilde{\ell}^{(j)})\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q^{n}(\ell(g,f(S^{(1)}_{\rho\circ\tau\circ\pi(1)},\ldots,S^{(D)}_{\rho\circ\tau\circ\pi(D)}))-\ell(g,\hat{f}(S^{(1)}_{\rho\circ\tau\circ\pi(1)},\ldots,S^{(D)}_{\rho\circ\tau\circ\pi(D)})))\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\tau\in\mathfrak{G}_{j}^{D}}\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}Q(\ell(g,f(S^{(1)},\ldots,S^{(D)}))-\ell(g,\hat{f}(S^{(1)},\ldots,S^{(D)})))\quad(\because Q\in\mathcal{Q})\\ &\leq\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}\end{split}

Then recalling Assumption 2, we can bound BB from above as

B=(QnQˇn)(Unj~(j))=(QnQˇn)(¯ρjn~(j)(Sρ(1),,Sρ(j)))=¯ρjn(QnQˇn)(~(j)(Sρ(1),,Sρ(j)))=(QjQˇj)(~(j)(S1,,Sj))( symmetry)B|i=1jq(si)i=1jqˇ(si)|ds1dsj=B|i=1jq(s1)q(si1)(q(si)qˇ(si))qˇ(si+1)qˇ(sj)|ds1dsjBi=1jq(s1)q(si1)|q(si)qˇ(si)|qˇ(si+1)qˇ(sj)ds1dsj=Bi=1j|q(si)qˇ(si)|dsi=BjqqˇL1,\begin{split}B&=(Q^{n}-\check{Q}^{n})(U_{n}^{j}\tilde{\ell}^{(j)})=(Q^{n}-\check{Q}^{n})\left(\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}\tilde{\ell}^{(j)}(S_{\rho(1)},\ldots,S_{\rho(j)})\right)\\ &=\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\rho\in\mathfrak{I}_{j}^{n}}(Q^{n}-\check{Q}^{n})\left(\tilde{\ell}^{(j)}(S_{\rho(1)},\ldots,S_{\rho(j)})\right)=(Q^{j}-\check{Q}^{j})(\tilde{\ell}^{(j)}(S_{1},\ldots,S_{j}))\quad(\because\text{ symmetry})\\ &\leq B_{\ell}\int\left|\prod_{i=1}^{j}q(s_{i})-\prod_{i=1}^{j}\check{q}(s_{i})\right|\mathrm{d}s_{1}\cdots\mathrm{d}s_{j}\\ &=B_{\ell}\int\left|\sum_{i=1}^{j}q(s_{1})\cdots q(s_{i-1})\cdot(q(s_{i})-\check{q}(s_{i}))\cdot\check{q}(s_{i+1})\cdots\check{q}(s_{j})\right|\mathrm{d}s_{1}\cdots\mathrm{d}s_{j}\\ &\leq B_{\ell}\sum_{i=1}^{j}\int q(s_{1})\cdots q(s_{i-1})\cdot|q(s_{i})-\check{q}(s_{i})|\cdot\check{q}(s_{i+1})\cdots\check{q}(s_{j})\mathrm{d}s_{1}\cdots\mathrm{d}s_{j}\\ &=B_{\ell}\sum_{i=1}^{j}\int|q(s_{i})-\check{q}(s_{i})|\mathrm{d}s_{i}=B_{\ell}\cdot j\left\|q-\check{q}\right\|_{{L^{1}}},\\ \end{split}

which proves the assertion. ∎

Now the following lemmas bound each approximation terms in terms of the difference between ff and f^\hat{f}.

Lemma 8 (Loss difference approximation).

Assume Assumption 5. Then we have for any g𝒢g\in\mathcal{G},

(g,f())(g,f^())L1(q)BqL𝒢j=1Dfjf^jW1,1\begin{split}\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}\leq B_{q}L_{\ell_{\mathcal{G}}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}\end{split}
Proof.
(g,f())(g,f^())L1(q)=|(g,f(s))(g,f^(s))|q(s)dsBqL𝒢f(s)f^(s)2dsBqL𝒢f(s)f^(s)1dsBqL𝒢j=1Dfjf^jW1,1.\begin{split}&\left\|\ell(g,f(\cdot))-\ell(g,\hat{f}(\cdot))\right\|_{{L^{1}}(q)}=\int|\ell(g,f(s))-\ell(g,\hat{f}(s))|q(s)\mathrm{d}s\\ &\qquad\leq B_{q}\int L_{\ell_{\mathcal{G}}}\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{2}}}\mathrm{d}s\\ &\qquad\leq B_{q}L_{\ell_{\mathcal{G}}}\int\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{1}}}\mathrm{d}s\leq B_{q}L_{\ell_{\mathcal{G}}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}.\\ \end{split}

Lemma 9 (Density difference approximation).

Assume Assumptions 2, 3, and 4. Then we have

qqˇL1(LqLf1+BqDC1)j=1Dfjf^jW1,1+Bqκ2(ff^)\begin{split}\left\|q-\check{q}\right\|_{{L^{1}}}\leq(L_{q}L_{f^{-1}}+B_{q}DC_{1}^{\prime})\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+B_{q}\kappa_{2}(f-\hat{f})\\ \end{split}

where C1C_{1}^{\prime} and κ2(ff^)\kappa_{2}(f-\hat{f}) are defined as in Lemma 11.

Proof.

Since qˇ(s)=q(f1f^(s))|(Jf1f^)(s)|\check{q}(s)=q(f^{-1}\circ\hat{f}(s))\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|, we have

qqˇL1=|q(s)q(f1f^(s))|(Jf1f^)(s)||ds|q(s)q(f1f^(s))|ds+q(f1f^(s))|1|(Jf1f^)(s)||ds|q(s)q(f1f^(s))|ds(A)+Bq|1(Jf1f^)(s)|ds(B)\begin{split}\left\|q-\check{q}\right\|_{{L^{1}}}&=\int\left|q(s)-q(f^{-1}\circ\hat{f}(s))\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\right|\mathrm{d}s\\ &\leq\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s+\int q(f^{-1}\circ\hat{f}(s))\left|1-\left|(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\right|\mathrm{d}s\\ &\leq\underbrace{\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s}_{\text{(A)}}+B_{q}\underbrace{\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s}_{\text{(B)}}\\ \end{split}

where the last line follows from the triangle inequality. Applying Lemma 10 to (A) and Lemma 11 to (B) yields the assertion. ∎

Lemma 10.

Assume Assumptions 2 and 3. Then,

|q(s)q(f1f^(s))|dsLqLf1j=1Dfjf^jW1,1\begin{split}\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s\leq L_{q}L_{f^{-1}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}\end{split}
Proof.

We have

|q(s)q(f1f^(s))|ds=|q(f1f(s))q(f1f^(s))|dsLqLf1f(s)f^(s)2dsLqLf1f(s)f^(s)1dsLqLf1j=1Dfjf^jW1,1\begin{split}&\int|q(s)-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s=\int|q(f^{-1}\circ f(s))-q(f^{-1}\circ\hat{f}(s))|\mathrm{d}s\\ &\qquad\leq L_{q}L_{f^{-1}}\int\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{2}}}\mathrm{d}s\leq L_{q}L_{f^{-1}}\int\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{1}}}\mathrm{d}s\\ &\qquad\leq L_{q}L_{f^{-1}}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}\\ \end{split}

Lemma 11 (Jacobian difference approximation).

Assume Assumptions 2 and 4. Then,

|1(Jf1f^)(s)|dsDC1j=1Dfjf^jW1,1+κ2(ff^),\begin{split}\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s&\leq DC_{1}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+\kappa_{2}(f-\hat{f}),\end{split}

where

Cd:=(D+1)72d2((Bf)d(k=1Dfk1C1,1)d+(Bf1)d),κ2(ff^):=d=2D(Dd)Cdj=1Dfjf^jW1,dd.\begin{split}C_{d}^{\prime}&:=(D+1)^{\frac{7}{2}d-2}\left((B_{\partial f}^{\infty})^{d}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)^{d}+(B_{\partial f^{-1}}^{\infty})^{d}\right),\\ \kappa_{2}(f-\hat{f})&:=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\\ \end{split}
Proof.

Applying Lemma 12 with A:=(Jf1f)(s)=IA:=(\mathrm{J}f^{-1}\circ f)(s)=I, we obtain

|1(Jf1f^)(s)|ds=|(Jf1f)(s)(Jf1f^)(s)|dsd=1D(Dd)df1fds(s)df1f^ds(s)opdds.\begin{split}\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s&=\int\left|(\mathrm{J}f^{-1}\circ f)(s)-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s\\ &\leq\int\sum_{d=1}^{D}{D\choose d}\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\mathrm{d}s.\\ \end{split}

Now, each term in the integrand can be bounded from above as

df1fds(s)df1f^ds(s)op=(df1dz(f(s)))(dfds(s))(df1dz(f^(s)))(df^ds(s))op(df1dz(f(s))df1dz(f^(s)))(dfds(s))op+(df1dz(f^(s)))(dfds(s)df^ds(s))opdf1dz(f(s))df1dz(f^(s))opdfds(s)op+df1dz(f^(s))opdfds(s)df^ds(s)op( submultiplicativity (Golub and Van Loan, 2013, Section 2.3.2))df1dz(f(s))df1dz(f^(s))op(Ddfds(s))+(Ddf1dz(f^(s)))dfds(s)df^ds(s)op(opD(Golub and Van Loan, 2013, Section 2.3.2))df1dz(f(s))df1dz(f^(s))op(DBf)+(DBf1)dfds(s)df^ds(s)opDBfDdf1dz(f(s))df1dz(f^(s))op(1)+DBf1Ddfdsdf^dsop(1)(opDop(1)(Golub and Van Loan, 2013, Section 2.3.1))=D32Bfmaxk[D]j=1D|fj1zk(f(s))fj1zk(f^(s))|+D32Bf1maxk[D]j=1D|fjsk(s)f^jsk(s)|D32Bfmaxk[D]j=1Dfj1C1,1f(s)f^(s)2+D32Bf1k=1Dj=1D|fjsk(s)f^jsk(s)|D32Bf(j=1Dfj1C1,1)f(s)f^(s)1+D32Bf1k=1Dj=1D|fjsk(s)f^jsk(s)|(21(Golub and Van Loan, 2013, Section 2.2.2)).\begin{split}&\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &=\left\|\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))\right)\left(\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right)-\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right)\left(\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right)\right\|_{\mathrm{op}}\\ &\leq\left\|\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right)\left(\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right)\right\|_{\mathrm{op}}+\left\|\left(\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right)\left(\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right)\right\|_{\mathrm{op}}\\ &\leq\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}+\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &\qquad\quad(\because\text{ submultiplicativity \cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.3.2)}})\\ &\leq\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\left(D\cdot\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)\right\|_{\infty}\right)+\left(D\cdot\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\infty}\right)\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &\qquad\quad(\because\left\|\cdot\right\|_{\mathrm{op}}\leq D\left\|\cdot\right\|_{\infty}\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.3.2)}})\\ &\leq\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}}\cdot\left(DB_{\partial f}^{\infty}\right)\ +\ \left(DB_{\partial f^{-1}}^{\infty}\right)\cdot\left\|\frac{\mathrm{d}f}{\mathrm{d}s}(s)-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}\\ &\leq DB_{\partial f}^{\infty}\sqrt{D}\left\|\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(f(s))-\frac{\mathrm{d}f^{-1}}{\mathrm{d}z}(\hat{f}(s))\right\|_{\mathrm{op}(1)}+DB_{\partial f^{-1}}^{\infty}\sqrt{D}\left\|\frac{\mathrm{d}f}{\mathrm{d}s}-\frac{\mathrm{d}\hat{f}}{\mathrm{d}s}\right\|_{\mathrm{op}(1)}\\ &\qquad\quad(\because\left\|\cdot\right\|_{\mathrm{op}}\leq\sqrt{D}\left\|\cdot\right\|_{\mathrm{op}(1)}\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.3.1)}})\\ &=D^{\frac{3}{2}}B_{\partial f}^{\infty}\max_{k\in[D]}\sum_{j=1}^{D}\left|\frac{\partial f^{-1}_{j}}{\partial z_{k}}(f(s))-\frac{\partial f^{-1}_{j}}{\partial z_{k}}(\hat{f}(s))\right|+D^{\frac{3}{2}}B_{\partial f^{-1}}^{\infty}\max_{k\in[D]}\sum_{j=1}^{D}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\\ &\leq D^{\frac{3}{2}}B_{\partial f}^{\infty}\max_{k\in[D]}\sum_{j=1}^{D}\left\|f^{-1}_{j}\right\|_{C^{1,1}}\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{2}}}+D^{\frac{3}{2}}B_{\partial f^{-1}}^{\infty}\sum_{k=1}^{D}\sum_{j=1}^{D}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\\ &\leq D^{\frac{3}{2}}B_{\partial f}^{\infty}\left(\sum_{j=1}^{D}\left\|f^{-1}_{j}\right\|_{C^{1,1}}\right)\left\|f(s)-\hat{f}(s)\right\|_{{\ell^{1}}}+D^{\frac{3}{2}}B_{\partial f^{-1}}^{\infty}\sum_{k=1}^{D}\sum_{j=1}^{D}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\\ &\qquad\quad(\because\left\|\cdot\right\|_{{\ell^{2}}}\leq\left\|\cdot\right\|_{{\ell^{1}}}\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{GolubMatrix2013}{\@@citephrase{, }}{}, Section~{}2.2.2)}}).\\ \end{split}

When powered to dd, this yields

df1fds(s)df1f^ds(s)opd(D2+D)d1[j=1D(D3/2Bf(k=1Dfk1C1,1)|fj(s)f^j(s)|)d+k=1Dj=1D(D3/2Bf1|fjsk(s)f^jsk(s)|)d]\begin{split}&\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\\ &\leq(D^{2}+D)^{d-1}\biggl{[}\sum_{j=1}^{D}\left(D^{3/2}B_{\partial f}^{\infty}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)|f_{j}(s)-\hat{f}_{j}(s)|\right)^{d}\\ &\qquad\qquad\qquad\qquad+\sum_{k=1}^{D}\sum_{j=1}^{D}\left(D^{3/2}B_{\partial f^{-1}}^{\infty}\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|\right)^{d}\biggr{]}\end{split}

where we used (i=1Lai)dLd1(i=1Laid)(\sum_{i=1}^{L}a_{i})^{d}\leq L^{d-1}(\sum_{i=1}^{L}a_{i}^{d}) for ai0a_{i}\geq 0, which follows from Hölder inequality. Hence,

df1fds(s)df1f^ds(s)opddsD52d1(D+1)d1[(Bfk=1Dfk1C1,1)dj=1D|fj(s)f^j(s)|dds+(Bf1)dj=1D(k=1D|fjsk(s)f^jsk(s)|dds)](D+1)72d2((Bf)d(k=1Dfk1C1,1)dj=1Dfjf^jW1,dd+(Bf1)dj=1Dfjf^jW1,dd)Cdj=1Dfjf^jW1,dd.\begin{split}&\int\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\mathrm{d}s\\ &\leq D^{\frac{5}{2}d-1}(D+1)^{d-1}\biggl{[}\left(B_{\partial f}^{\infty}\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)^{d}\sum_{j=1}^{D}\int|f_{j}(s)-\hat{f}_{j}(s)|^{d}\mathrm{d}s\\ &\qquad\qquad\qquad\qquad\qquad\qquad+(B_{\partial f^{-1}}^{\infty})^{d}\sum_{j=1}^{D}\left(\sum_{k=1}^{D}\int\left|\frac{\partial f_{j}}{\partial s_{k}}(s)-\frac{\partial\hat{f}_{j}}{\partial s_{k}}(s)\right|^{d}\mathrm{d}s\right)\biggr{]}\\ &\leq(D+1)^{\frac{7}{2}d-2}\left((B_{\partial f}^{\infty})^{d}\left(\sum_{k=1}^{D}\left\|f^{-1}_{k}\right\|_{C^{1,1}}\right)^{d}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}+(B_{\partial f^{-1}}^{\infty})^{d}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}\right)\\ &\leq C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}.\end{split}

Therefore,

|1(Jf1f^)(s)|dsd=1D(Dd)df1fds(s)df1f^ds(s)opddsDC1j=1Dfjf^jW1,1+d=2D(Dd)Cdj=1Dfjf^jW1,ddκ2(ff^)\begin{split}&\int\left|1-(\mathrm{J}f^{-1}\circ\hat{f})(s)\right|\mathrm{d}s\\ &\leq\sum_{d=1}^{D}{D\choose d}\int\left\|\frac{\mathrm{d}f^{-1}\circ f}{\mathrm{d}s}(s)-\frac{\mathrm{d}f^{-1}\circ\hat{f}}{\mathrm{d}s}(s)\right\|_{\mathrm{op}}^{d}\mathrm{d}s\\ &\leq DC_{1}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}}+\underbrace{\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}}_{\text{$\kappa_{2}(f-\hat{f})$}}\end{split}

Lemma 11 used the following lemma to bound the difference in Jacobian determinants.

Lemma 12 (Determinant perturbation bound (Ipsen and Rehman, 2008, Corollary 2.11)).

Let AA and EE be D×DD\times D complex matrices. Then,

|det(A)det(A+E)|d=1D(Dd)AopDdEopd.\begin{split}|\det(A)-\det(A+E)|\leq\sum_{d=1}^{D}{D\choose d}\left\|A\right\|_{\mathrm{op}}^{D-d}\left\|E\right\|_{\mathrm{op}}^{d}.\end{split}

C.8 Comparison of Rademacher complexities

The following consideration demonstrates how the effective complexity measure \mathfrak{R} in Theorem 3 resulting from the proposed method may enjoy a relaxed dependence on the input dimensionality compared to the ordinary empirical risk minimization. To do so, we first recall the definition of the ordinary Rademacher complexity and a standard performance guarantee derived based on it.

Definition 2 (Ordinary Rademacher complexity).

The ordinary empirical risk minimization finds the candidate hypothesis by

g^argming𝒢R^(g),\begin{split}\hat{g}\in\mathop{\rm argmin}\limits_{g\in\mathcal{G}}\hat{R}(g),\end{split}

where

R^(g):=1ni=1n(g,Zi)=1ni=1n(g,f^(Si(1),,Si(D)))\begin{split}\hat{R}(g):=\frac{1}{n}\sum_{i=1}^{n}\ell(g,Z_{i})=\frac{1}{n}\sum_{i=1}^{n}\ell(g,\hat{f}(S_{i}^{(1)},\ldots,S_{i}^{(D)}))\end{split}

and the corresponding ordinary Rademacher complexity ord(𝒢)\mathfrak{R}_{\mathrm{ord}}(\mathcal{G}) is

ord(𝒢):=1n𝔼𝒟ˇ𝔼σ[supg𝒢|i=1nσi(Si(1),,Si(D))|]\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{G})&:=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}\ell(S_{i}^{(1)},\ldots,S_{i}^{(D)})\right|\right]\end{split}

where {σi}i=1n\{\sigma_{i}\}_{i=1}^{n} are independent uniform sign variables and we denoted (s(1),,s(D))=(g,f^(s(1),,si(D)))\ell(s^{(1)},\ldots,s^{(D)})=\ell(g,\hat{f}(s^{(1)},\ldots,s_{i}^{(D)})) by abuse of notation. This yields the standard Rademacher complexity based bound. Applying Lemma 5 and using the same proof technique, we have that with probability at least 1δ1-\delta,

R(g^)R(g)2supg𝒢|R(g)R^(g)|4ord(𝒢)+2Blog(2/δ)2n.\begin{split}R(\hat{g})-R(g^{*})\leq 2\sup_{g\in\mathcal{G}}|R(g)-\hat{R}(g)|\leq 4\mathfrak{R}_{\mathrm{ord}}(\mathcal{G})+2B_{\ell}\sqrt{\frac{\log(2/\delta)}{2n}}.\end{split}

Therefore, we the corresponding complexity terms are ord(𝒢)\mathfrak{R}_{\mathrm{ord}}(\mathcal{G}) and D(𝒢)D\mathfrak{R}(\mathcal{G}). In Remark 3, we make a comparison of these two complexity measures by taking an example. To recall, the effective Rademacher complexity can be written as, in terms of the notation in this section,

(𝒢)=1n𝔼𝒟ˇ𝔼σ[supg𝒢|i=1nσi𝔼S2,,SD~(Si,S2,,SD)|]=1n𝔼𝒟ˇ𝔼σ[supg𝒢|i=1nσi𝔼S1,,SD1D((Si(1),S2(2),,SD(D))++(S1(1),S2(2),,Si(D)))|]\begin{split}&\mathfrak{R}(\mathcal{G})=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{2},\ldots,S^{\prime}_{D}}\tilde{\ell}(S_{i},S_{2}^{\prime},\ldots,S_{D}^{\prime})\right|\right]\\ &=\frac{1}{n}{\mathbb{E}}_{\check{\mathcal{D}}}{\mathbb{E}}_{\sigma}\left[\sup_{g\in\mathcal{G}}\left|\sum_{i=1}^{n}\sigma_{i}{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}\frac{1}{D}\left(\ell(S_{i}^{(1)},S_{2}^{\prime(2)},\ldots,S_{D}^{\prime(D)})+\cdots+\ell(S_{1}^{\prime(1)},S_{2}^{\prime(2)},\ldots,S_{i}^{(D)})\right)\right|\right]\\ \end{split}
Remark 3 (Comparison of Radmacher complexities).

As an example, consider \mathcal{H}, the set of LL-Lipschitz functions (with respect to infinity norm) on the unit cube [0,1]d[0,1]^{d}. It is well-known that there exists a constant C>0C>0 such that the following holds (Wainwright, 2019, Example 5.10, p.129) for sufficiently small t>0t>0:

log𝒩(t,,)(C/t)d.\begin{split}\log\mathcal{N}(t,\mathcal{H},\left\|\cdot\right\|_{\infty})\asymp(C/t)^{d}.\end{split} (9)

Here, a(t)b(t)a(t)\asymp b(t) indicates that there exist k1,k2>0k_{1},k_{2}>0 such that, for sufficiently small tt, it holds that k1b(t)a(t)k2b(t)k_{1}b(t)\leq a(t)\leq k_{2}b(t). On the other hand, the well-known discretization argument implies that there exist constants cc and BB such that for any t(0,B]t\in(0,B], the following relation between the Rademacher complexity and the metric entropy holds:

ord()t+clog𝒩(t,,)n.\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{H})\leq t+c\sqrt{\frac{\log\mathcal{N}(t,\mathcal{H},\left\|\cdot\right\|_{\infty})}{n}}.\end{split} (10)

Substituting Eq. (9) into Eq. (10), we can find that, for large enough nn, the right hand side is minimized at t=(cCd2d2)22+dn12+dt=(c\cdot C^{\frac{d}{2}}\cdot\frac{d}{2})^{\frac{2}{2+d}}\cdot n^{-\frac{1}{2+d}}. This yields

ord()C~n12+d\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{H})\leq\tilde{C}\cdot n^{-\frac{1}{2+d}}\end{split} (11)

with a new constant C~=(cCd2d2)22+d+cCd2(cCd2d2)d2+d\tilde{C}=\left(c\cdot C^{\frac{d}{2}}\cdot\frac{d}{2}\right)^{\frac{2}{2+d}}+c\cdot C^{\frac{d}{2}}\left(c\cdot C^{\frac{d}{2}}\cdot\frac{d}{2}\right)^{-\frac{d}{2+d}}. Therefore, by substituting d=Dd=D in Eq. (11), the metric-entropy based bound on the ordinary Rademacher complexity exhibits exponential dependence on the input dimension as

ord()𝒪(n12+D),\begin{split}\mathfrak{R}_{\mathrm{ord}}(\mathcal{H})&\leq\mathcal{O}\left(n^{-\frac{1}{2+D}}\right),\end{split}

which is a manifestation of the curse of dimensionality. On the other hand, by following a similar calculation, we can see that the effective Rademacher complexity ()\mathfrak{R}(\mathcal{H}) avoids an exponential dependence on the input dimension DD. By substituting d=1d=1 in Eq. (11), we can see

D()ord(1)++ord(D)𝒪(n13),\begin{split}D\mathfrak{R}(\mathcal{H})&\leq\mathfrak{R}_{\mathrm{ord}}(\mathcal{H}_{1})+\cdots+\mathfrak{R}_{\mathrm{ord}}(\mathcal{H}_{D})\leq\mathcal{O}\left(n^{-\frac{1}{3}}\right),\\ \end{split}

where j:={𝔼S1,,SDh(S1(1),,Sj1(j1),()(j),Sj+1(j+1),,SD(D)):h}\mathcal{H}_{j}:=\{{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}h({S_{1}^{\prime}}^{(1)},\ldots,{S_{j-1}^{\prime}}^{(j-1)},(\cdot)^{(j)},{S_{j+1}^{\prime}}^{(j+1)},\ldots,{S^{\prime}_{D}}^{(D)}):h\in\mathcal{H}\}. This is because the Lipschitz constant of functions in j\mathcal{H}_{j} is at most LL (i.e., the Lipschitz constant does not increase by the marginalization procedure) because for any hjh\in\mathcal{H}_{j},

|h(x)h(y)|=|𝔼S1,,SD[h(S1(1),,Sj1(j1),x,Sj+1(j+1),,SD(D))h(S1(1),,Sj1(j1),y,Sj+1(j+1),,SD(D))]|𝔼S1,,SD|h(S1(1),,Sj1(j1),x,Sj+1(j+1),,SD(D))h(S1(1),,Sj1(j1),y,Sj+1(j+1),,SD(D))|𝔼S1,,SDL(S1(1),,Sj1(j1),x,Sj+1(j+1),,SD(D))(S1(1),,Sj1(j1),y,Sj+1(j+1),,SD(D))=𝔼S1,,SDL(0,,0,xy,0,,0)=L|xy|.\begin{split}&|h(x)-h(y)|\\ &=|{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}[h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},x,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})-h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},y,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})]|\\ &\leq{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}|h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},x,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})-h(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},y,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})|\\ &\leq{\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}L\cdot\|(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},x,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})-(S_{1}^{\prime(1)},\ldots,S_{j-1}^{\prime(j-1)},y,S_{j+1}^{\prime(j+1)},\ldots,S_{D}^{\prime(D)})\|\\ &={\mathbb{E}}_{S^{\prime}_{1},\ldots,S^{\prime}_{D}}L\cdot\|(0,\ldots,0,x-y,0,\ldots,0)\|\\ &=L\cdot|x-y|.\\ \end{split}

C.9 Remark on higher order Sobolev norms

Here, we comment on how the term κ2(ff^)\kappa_{2}(f-\hat{f}) is treated as a higher order term of ff^f-\hat{f}.

Remark 4 (Higher order Sobolev norms).

Let us assume that suppqsuppqˇ\operatorname{supp}{q}\cup\operatorname{supp}{\check{q}} is contained in a compact set 𝒮~\mathcal{\tilde{S}} for all f^\hat{f} considered. Note that for d[D]d\in[D],

𝒮~|h(s)|dds(V𝒮~)ddD(𝒮~|h(s)|Dds)d/D\begin{split}\int_{\mathcal{\tilde{S}}}|h(s)|^{d}\mathrm{d}s\leq(V_{\mathcal{\tilde{S}}})^{\frac{d}{d-D}}\left(\int_{\mathcal{\tilde{S}}}|h(s)|^{D}\mathrm{d}s\right)^{d/D}\end{split}

by Hölder ’s inequality, where we defined V𝒮~:=𝒮~1dsV_{\mathcal{\tilde{S}}}:=\int_{\mathcal{\tilde{S}}}1\mathrm{d}s, hence we have Ld(𝒮~)(V𝒮~)1dDLD(𝒮~)\left\|\cdot\right\|_{{L^{d}}(\mathcal{\tilde{S}})}\leq(V_{\mathcal{\tilde{S}}})^{\frac{1}{d-D}}\left\|\cdot\right\|_{{L^{D}}(\mathcal{\tilde{S}})}. By applying the relation to each term in the definition of W1,d\left\|\cdot\right\|_{W^{1,d}}, we obtain

fW1,dd(V𝒮~)ddDfW1,Dd\begin{split}\left\|f\right\|_{W^{1,d}}^{d}\leq(V_{\mathcal{\tilde{S}}})^{\frac{d}{d-D}}\left\|f\right\|_{W^{1,D}}^{d}\end{split}

Thus we obtain

κ2(ff^)=d=2D(Dd)Cdj=1Dfjf^jW1,ddd=2D(Dd)(V𝒮~)ddDCdj=1Dfjf^jW1,Dd𝒪(j=1Dfjf^jW1,D2)(f^f).\begin{split}\kappa_{2}(f-\hat{f})&=\sum_{d=2}^{D}{D\choose d}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,d}}^{d}\\ &\leq\sum_{d=2}^{D}{D\choose d}(V_{\mathcal{\tilde{S}}})^{\frac{d}{d-D}}C_{d}^{\prime}\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}^{d}\\ &\leq\mathcal{O}\left(\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}^{2}\right)\quad(\hat{f}\to f).\end{split}

By also replacing j=1Dfjf^jW1,1\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,1}} with j=1Dfjf^jW1,D\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}} in Theorem 3, we can see more clearly that κ2(ff^)\kappa_{2}(f-\hat{f}) is a higher order term of j=1Dfjf^jW1,D\sum_{j=1}^{D}\left\|f_{j}-\hat{f}_{j}\right\|_{W^{1,D}}.

Appendix D Details and Proofs of Theorem 1

Here, we provide the proof of Theorem 1. We reuse the notation and terminology from Section C of this Supplementary Material. We prove the uniformly minimum variance property of the proposed risk estimator under the ideal situation of f^=f\hat{f}=f.

Theorem 4 (Known causal mechanism case).

Assume f^=f\hat{f}=f. Then, for all g𝒢g\in\mathcal{G}, we have that Rˇ(g)\check{R}(g) is the uniformly minimum variance unbiased estimator of R(g)R(g). As a special case, it has a smaller variance than the ordinary empirical risk estimator: q𝒬,g𝒢,Var(Rˇ(g))Var(R^(g))\forall q\in\mathcal{Q},\forall g\in\mathcal{G},\mathrm{Var}(\check{R}(g))\leq\mathrm{Var}(\hat{R}(g)).

Proof.

The proof is a result of the following two facts. When qˇ𝒬\check{q}\in\mathcal{Q}, the estimator Rˇ(g)\check{R}(g) becomes the generalized U-statistic of the statistical functional Eq. (6). Furthermore, when f^=f\hat{f}=f, Eq. (6) coincides with R(g)R(g) because the approximation error is zero. Since we assume f^=f\hat{f}=f we have qˇ=q𝒬\check{q}=q\in\mathcal{Q} and hence both of the statements above hold. Therefore, by Lemma 13, the first assertion of the theorem follows. The last assertion of the theorem follows as a special case as R^(g)\hat{R}(g) is an unbiased estimator of R(g)R(g) for q𝒬q\in\mathcal{Q}.

From here, we confirm the above statements by calculation. We first show that Rˇ(g)\check{R}(g) is the generalized U-statistic. To see this, observe that the statistical functional Eq. (6) allows the following expression given qˇ𝒬\check{q}\in\mathcal{Q}:

~(s1,,sD)qˇ(s1)qˇ(sD)ds1dsD=¯π𝔖D(g,f^(sπ(1)(1),,sπ(D)(D)))qˇ(s1)qˇ(sD)ds1dsD=¯π𝔖D(g,f^(sπ(1)(1),,sπ(D)(D)))dqˇ(d)(s1(d))dqˇ(d)(sD(d))ds1dsD=¯π𝔖D(g,f^(s1(1),,s1(D)))dqˇ(d)(s1)ds1=(g,f^(s(1),,s(D)))qˇ1(s(1))qˇD(s(D))ds(1)ds(D).\begin{split}&\int\tilde{\ell}(s_{1},\ldots,s_{D})\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}\\ &=\int\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)}))\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}\\ &=\int\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{\pi(1)}^{(1)},\ldots,s_{\pi(D)}^{(D)}))\prod_{d}\check{q}^{(d)}(s_{1}^{(d)})\cdots\prod_{d}\check{q}^{(d)}(s_{D}^{(d)})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}\\ &=\int\mathop{\mkern 1.5mu\overline{\mkern-1.5mu\sum\mkern-1.5mu}\mkern 1.5mu}\limits_{\pi\in\mathfrak{S}_{D}}\ell(g,\hat{f}(s_{1}^{(1)},\ldots,s_{1}^{(D)}))\prod_{d}\check{q}^{(d)}(s_{1})\mathrm{d}s_{1}\\ &=\int\ell(g,\hat{f}(s^{(1)},\ldots,s^{(D)}))\check{q}_{1}(s^{(1)})\cdots\check{q}_{D}(s^{(D)})\mathrm{d}s^{(1)}\cdots\mathrm{d}s^{(D)}.\\ \end{split}

This is a regular statistical functional of degrees (1,,1)(1,\ldots,1) with the kernel (g,f^(,,))\ell(g,\hat{f}(\cdot,\ldots,\cdot)). On the other hand, we have

Rˇ(g)=1nD(i1,,iD)[n]D~(Si1,,SiD)=1nD(i1,,iD)[n]D(g,f^(Si1(1),,SiD(D)))\begin{split}\check{R}(g)&=\frac{1}{n^{D}}\sum_{(i_{1},\ldots,i_{D})\in[n]^{D}}\tilde{\ell}(S_{i_{1}},\ldots,S_{i_{D}})=\frac{1}{n^{D}}\sum_{(i_{1},\ldots,i_{D})\in[n]^{D}}\ell(g,\hat{f}(S_{i_{1}}^{(1)},\ldots,S_{i_{D}}^{(D)}))\end{split}

because the summations run through all combinations with replacement. This combined with the fact that {Si(d)}i,d\{S_{i}^{(d)}\}_{i,d} are jointly independent when qˇ𝒬\check{q}\in\mathcal{Q} yields that Rˇ(g)\check{R}(g) is the generalized U-statistic for Eq. (6).

Now we show that Eq. (6) coincides R(g)R(g). Given f^=f\hat{f}=f, we have

R(g)=q(s)(g,f(s))ds=q(s)(g,f^(s))ds(By f=f^.)=q1(s(1))qD(s(D))(g,f^(s(1),,s(D)))ds(1)ds(D)(by q𝒬)=qˇ1(s(1))qˇD(s(D))(g,f^(s(1),,s(D)))ds(1)ds(D)(by q=qˇ)=~(s1,,sD)qˇ(s1)qˇ(sD)ds1dsD.( symmetry)\begin{split}R(g)&=\int q(s)\ell(g,f(s))\mathrm{d}s\\ &=\int q(s)\ell(g,\hat{f}(s))\mathrm{d}s\quad(\text{By }f=\hat{f}.)\\ &=\int q_{1}(s^{(1)})\cdots q_{D}(s^{(D)})\ell(g,\hat{f}(s^{(1)},\ldots,s^{(D)}))\mathrm{d}s^{(1)}\cdots\mathrm{d}s^{(D)}\quad(\text{by }q\in\mathcal{Q})\\ &=\int\check{q}_{1}(s^{(1)})\cdots\check{q}_{D}(s^{(D)})\ell(g,\hat{f}(s^{(1)},\ldots,s^{(D)}))\mathrm{d}s^{(1)}\cdots\mathrm{d}s^{(D)}\quad(\text{by }q=\check{q})\\ &=\int\tilde{\ell}(s_{1},\ldots,s_{D})\check{q}(s_{1})\cdots\check{q}(s_{D})\mathrm{d}s_{1}\cdots\mathrm{d}s_{D}.\quad(\because\text{ symmetry})\\ \end{split}

The following well-known lemma states that a generalized U-statistic is a uniformly minimum variance unbiased estimator.

Lemma 13 (Uniformly minimum variance property of a generalized U-statistic).

Let θ:𝒬\theta:\mathcal{Q}\to\mathbb{R} be a regular statistical functional with kernel ψ:k1××kL\psi:\mathbb{R}^{k_{1}}\times\cdots\times\mathbb{R}^{k_{L}}\to\mathbb{R} (Clémençon et al., 2016), i.e.,

θ(q)=ψ((x1(1),,xk1(1)),,(x1(L),,xkL(L)))j=1k1q1(xj(1))dxj(1)j=1kLqL(xj(L))dxj(L).\begin{split}\theta(q)=\int\psi((x_{1}^{(1)},\ldots,x_{k_{1}}^{(1)}),\ldots,(x_{1}^{(L)},\ldots,x_{k_{L}}^{(L)}))\prod_{j=1}^{k_{1}}q_{1}(x_{j}^{(1)})\mathrm{d}x_{j}^{(1)}\cdots\prod_{j=1}^{k_{L}}q_{L}(x_{j}^{(L)})\mathrm{d}x_{j}^{(L)}.\end{split}

Given samples {xi(l)}i=1nli.i.d.ql(nlkl and l=1,,L)\{x_{i}^{(l)}\}_{i=1}^{n_{l}}\overset{\text{i.i.d.}}{\sim}q_{l}(n_{l}\geq k_{l}\text{ and }l=1,\ldots,L), let GU(n1,,nL)(k1,,kL)ψ\mathrm{GU}_{(n_{1},\ldots,n_{L})}^{(k_{1},\ldots,k_{L})}\psi be the corresponding generalized U-statistic

GU(n1,,nL)(k1,,kL)ψ:=1l(nlkl)ψ((xi1(1)(1),,xik1(1)(1)),,(xi1(L)(L),,xikL(L)(L))).\begin{split}\mathrm{GU}_{(n_{1},\ldots,n_{L})}^{(k_{1},\ldots,k_{L})}\psi:=\frac{1}{\prod_{l}{n_{l}\choose k_{l}}}\sum\psi\left(\left(x_{i_{1}^{(1)}}^{(1)},\ldots,x_{i_{k_{1}}^{(1)}}^{(1)}\right),\ldots,\left(x_{i_{1}^{(L)}}^{(L)},\ldots,x_{i_{k_{L}}^{(L)}}^{(L)}\right)\right).\end{split}

where \sum denotes that the indices run through all possible combinations (without replacement) of the indices. Then, GU(n1,,nL)(k1,,kL)ψ\mathrm{GU}_{(n_{1},\ldots,n_{L})}^{(k_{1},\ldots,k_{L})}\psi is the uniformly minimum variance unbiased estimator of θ\theta on 𝒬\mathcal{Q}.

Proof.

The assertion can be proved in a parallel manner as the proof of (Lee, 1990, Section 1.1, Lemma B)

Remark 5 (Relation to the UMVUE property of R^(g)\hat{R}(g)).

The result in Theorem 4 is not contradictory to the fact that the sample average R^(g)\hat{R}(g) is a U-statistic of degree-11 and hence the minimum variance among all unbiased estimator of R(g)R(g) on 𝒫\mathcal{P}, where 𝒫\mathcal{P} is a set of distributions containing all absolutely continuous distributions Lee (1990). Specifically, Rˇ(g)\check{R}(g) is not generally an unbiased estimator of R(g)R(g) on 𝒫𝒬\mathcal{P}\setminus\mathcal{Q}, even if f^=f\hat{f}=f. While Rˇ(g)\check{R}(g) satisfies the DD-sample symmetry condition, the same does not hold for R^(g)\hat{R}(g). By restricting the attention to 𝒬\mathcal{Q}, the estimator Rˇ(g)\check{R}(g) achieves a smaller variance than R^(g)\hat{R}(g).

Appendix E Further Comparison with Related Work

Here, we provide an additional detailed comparison with the related work to complement Section 5 of the main text.

E.1 Comparison with Magliacane et al. (2018)

Magliacane et al. (2018) considered domain adaptation among different interventional states by using SCMs. Their problem setting and ours do not strictly include each other (the two settings are somewhat complementary), and their assumption may be more suitable for application fields with interventional experiments such as genomics, while ours may be more suited for fields with observational data such as health record analysis or economics. At the methodological level, Magliacane et al. (2018) takes a variable selection approach to find a subset so that the conditional distribution is invariant, whereas our paper takes a data augmentation approach via the estimation of the SEMs (in the reduced form).

The essential assumptions of Magliacane et al. (2018) are the existence of a separating set (with small “incomplete information bias”) and the identifiability of such a set (yielded from Proposition 1, Assumption 1, and Assumption 2 (iii) in Magliacane et al. (2018)). A particularly plausible application conforming to the assumptions is, for example (but not limited to), genomics experiments. Part of the reason is that Assumption 2 (ii) and (iii) are likely to hold for well-targeted experiments Magliacane et al. (2018). The following is a detailed comparison.

(1) Modeling assumption and problem setup.

The two problem settings do not strictly include one another, and they are of complementing relations where ours corresponds to the intervention-free case and Magliacane et al. (2018) corresponds to the intervention case. If we try to express the problem setting of Magliacane et al. (2018) within our formulation, we would be expressing the interventions as alterations to the SEMs. We assume that such alterations do not occur in our setting since our focus is on observational data; therefore, the problem formulation of Magliacane et al. (2018) is not a subset of ours. On the other hand, if we try to express our problem setting within the formulation of Magliacane et al. (2018), our problem setup would only have C1C_{1} as the context variable, and C1C_{1} would be a parent of all observed variables, e.g., C1C_{1} switches the distribution of SS by switching different quantile functions to perform inverse transform. This potentially allows the existence of the effect C1YC_{1}\to Y and diverges from Assumption 2 (iii) in Magliacane et al. (2018). Also, even if such an edge does not exist, it is acceptable that there are no separating sets (in the extreme case) if YY is a parent of all XiX_{i}’s. In this case, conditioning on any of the XiX_{i}’s would result in making C1C_{1} and YY dependent. From this consideration, our problem setting is not a subset of that of Magliacane et al. (2018), either.

(2) Plausible applications.

The problem setup of Magliacane et al. (2018) is suitable especially for applications in which various experiments are conducted such as genomics Magliacane et al. (2018), whereas our problem setting may be more suitable for some fields with observational data such as health record analysis or economics.

(3) Methodology.

Our proposed method actually estimates the SEMs (though in the reduced-form) and exploits the estimated SEMs in the domain adaptation algorithm. In fact, directly using the estimated SEMs as a tool to realize domain adaptation can be seen as the first attempt to fully leverage the structural causal models in the DA algorithm. On the other hand, Magliacane et al. (2018) approaches the problem of domain adaptation via variable selection to find a subset so that the conditional distribution is invariant.

E.2 Comparison with Gong et al. (2018)

In the present paper, we assumed an invariance of structural equations between domains. Here, we clarify the difference from a related but different assumption considered by Causal Generative Domain Adaptation Network (CG-DAN; Gong et al., 2018).

(1) Problem setup.

Gong et al. (2018) presumes the anticausal scenario (i.e., YY is the cause of XX) and that XX given YY follows a structural equation model, whereas our paper considers more general SEMs of XX and YY.

(2) Theoretical justification.

The approach of Gong et al. (2018) does not have a theoretical guarantee in terms of the identifiability of ff, i.e., there has been no known theoretical condition under which the learned generator is applicable across different domains. On the other hand, our method enjoys a strong theoretical justification of nonlinear ICA including the identifiability of ff under known theoretical conditions.

(3) Methodology.

The method of Gong et al. (2018) estimates the GCM of XX given YY using source domain data and uses it to design a generator neural network. On the other hand, we more directly exploit the estimated reduced-form SEM in the method.

E.3 Comparison with Arjovsky et al. (2020)

Arjovsky et al. (2020) proposed invariant risk minimization (IRM) for the out-of-distribution (OOD) generalization problem. The IRM approach tries to learn a feature extractor that makes the optimal predictor invariant across domains, and its theoretical validity is argued based on SCMs. Here, we compare it with the present work in terms of the problem setup, theoretical justification, and the methodology.

(1) Basic assumption and problem setup.

The OOD generalization problem tackled in Arjovsky et al. (2020) assumes no access to the target domain data. In this respect, the problem is different and intrinsically more difficult than the one considered in this paper, where a small labeled sample from the target domain is assumed to be available. In order to solve the OOD generalization problem, in a nutshell, Arjovsky et al. (2020) essentially assumes the existence of a feature extractor that elicits an invariant predictor, i.e., one that makes the optimal predictors of the different domains to be identical after the feature transformation. This can be seen as a variant of the representation learning approach for domain adaptation where we assume there exists 𝒯\mathcal{T} such that p(Y|𝒯(X))p(Y|\mathcal{T}(X)) is invariant across domains. Indeed, for example, when the loss function is the cross-entropy, the condition corresponds to the invariance of P(Y|𝒯(X))P(Y|\mathcal{T}(X)) across domains Arjovsky et al. (2020). More technically, in addition, (Arjovsky et al., 2020, Definition 7(ii)) requires the condition 𝔼1[Y|Pa(Y)]=𝔼2[Y|Pa(Y)]{\mathbb{E}}_{1}[Y|\mathrm{Pa}(Y)]={\mathbb{E}}_{2}[Y|\mathrm{Pa}(Y)], which can be violated when the latent factors corresponding to YY have different distributions across domains. On the other hand, our assumption can be seen as the existence of a feature extractor that can simultaneously estimate the independent components in all domains, which does not necessarily imply the existence of a common feature transformer that induces a unique optimal predictor.

(2) Theoretical justification.

Arjovsky et al. (2020) formulated a condition under which the IRM principle leads to an appropriate predictor for OOD generalization, but only under a certain linearity assumption which is essentially a relaxation of linear SEMs. Furthermore, in the theoretical guarantee, the feature extractor is restricted to be linear. In addition, Arjovsky et al. (2020) only provides the population-level analysis that the solution of the IRM objective formulated using the underlying distributions enjoys OOD generalization, and it does not discuss the condition under which the ideal feature extractor can be properly estimated by the empirical IRM. The requirement for the strong assumption of linearity likely stems from the intrinsic difficulty of the OOD problem in Arjovsky et al. (2020), namely, its formulation does not assume specific types of interventions. On the other hand, our method enjoys a stronger theoretical guarantee of an excess risk bound without such parametric assumptions on the models or the data generating process, by focusing on the case that the causal mechanisms are indifferent across the domains.

(3) Methodology.

The methodology of IRM estimates a single predictor that generalizes well to all domains by finding a feature extractor that makes the predictor optimal in all domains. The approach shares the same spirit as the representation learning approaches to domain adaptation, which try to find a feature extractor that induces invariant conditional distributions, such as transfer component analysis Pan et al. (2011). On the other hand, our method estimates the SEMs (in the reduced-form) and exploits it to make the training on the few target domain data more efficient through data augmentation.