This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath

Out-of-distribution generalization under random, dense distributional shifts

Yujin Jeong and Dominik Rothenhäusler
Abstract

Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that p(y|x)p(y|x) remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might be dense. More specifically, these dense distributional shifts may arise through numerous small and random changes in the population and environment. First, we discuss empirical evidence for such random dense distributional shifts. Then, we develop tools to infer parameters and make predictions for partially observed, shifted distributions. Finally, we apply the framework to several real-world datasets and discuss diagnostics to evaluate the fit of the distributional uncertainty model.

1 Introduction

Distribution shift is a persistent challenge in statistics and machine learning. For example, we might be interested in understanding the determinants of positive outcomes in early childhood education. However, the relationships among learning outcomes and other variables can shift over time and across locations. This makes it challenging to gain actionable knowledge that can be used in different situations. As a result, there has been a surge of interest in robust and generalizable machine learning. Existing approaches can be classified as invariance-based approaches and adversarial approaches.

Invariance-based methods assume that the shifts only affect a sub-space of the data distribution [Pan and Yang, 2010, Baktashmotlagh et al., 2013]. For example, under covariate shift [Quinonero-Candela et al., 2009], the distribution of the covariates changes while the conditional distribution of the outcomes given the covariates stays invariant. In causal representation learning [Schölkopf et al., 2021], the goal is to disentangle the distribution into independent causal factors; based on the assumption that spurious associations can be separated from invariant associations. Both approaches assume the shift affects only a sub-space of the data. We call such shifts sparse distribution shifts.

Adversarial methods in robust machine learning consider worst-case perturbations. The idea is that some adversary can perturb the data either at the training [Huber, 1981] or deployment stage [Szegedy et al., 2014, Duchi and Namkoong, 2021]; and the goal is to learn a model that is robust to such deviations. While developing robust procedures is important, worst-case perturbations might be too conservative in many real-world applications.

Motivated by an empirical observation, in this paper we consider distributional shifts due to random and dense perturbations. As an example, consider replication studies. In replication studies, different research teams try to replicate results by following the same research protocol. Therefore, any distribution shift between the data may arise from the superposition of many unintended, subtle errors that affect every part of the distribution, which might be modeled as random. We call such shifts dense distributional shifts. The random, dense distributional shift model has recently been used to quantify distributional uncertainty in parameter estimation problems [Jeong and Rothenhäusler, 2023, Rothenhäusler and Bühlmann, 2023] and has shown success in modelling real-world temporal shifts in refugee placement [Bansak et al., 2024]. We will develop tools to measure the similarity between randomly perturbed distributions, infer parameters and predict outcomes for partially observed and shifted distributions.

1.1 Summary of contributions

In this subsection, we preview and summarize our contributions.

First, we discuss an intriguing phenomenon on a real-world dataset, in which means of arbitrary functions seem to be correlated across multiple distributions. To the best of our knowledge, existing distribution shift models do not imply this phenomenon.

To address this gap, we introduce a random distribution shift model for multiple datasets, where likelihood ratios exhibit random fluctuations. This builds on the perturbation model introduced in Jeong and Rothenhäusler [2023], which considers a single perturbed distribution. We generalize the model to handle multiple perturbed distributions that are possibly correlated. We then demonstrate the extended random perturbation model can explain the empirical distribution shift phenomenon.

Next, we consider the following domain adaptation problem. Our goal is to predict outcomes YY based on covariates XX. Specifically, we seek to estimate θ0=argminθ𝔼0[(θ,X,Y)]\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)] for some loss function (θ,X,Y)\mathcal{L}(\theta,X,Y), where 𝔼0\mathbb{E}^{0} denotes the expectation under the unknown target distribution. While we have access to complete datasets (X,Y)(X,Y) from multiple training distributions, the target dataset contains only covariates XX without corresponding outcomes YY. We adopt a weighted Empirical Risk Minimization (ERM) approach, where the parameters θ0\theta^{0} are estimated as

θ^0=k=1Kβk𝔼^k[(θ,X,Y)],\hat{\theta}^{0}=\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)],

where 𝔼^k\hat{\mathbb{E}}^{k} denotes the empirical mean computed on the kk-th training dataset.

Under the random distribution shift model, we derive optimal weights for weighted ERM and expressions for the corresponding out-of-distribution error. Based on this expression, we derive prediction intervals for parameters of the target distribution.

Refer to caption
Refer to caption
Figure 1: Test MSE results of the XGBoost when the training data is initially sourced from CA and then sourced from PR. The dashed vertical line indicates the point when PR data started to be added. The blue line is when samples are equally weighted, and the green line is when samples are weighted based on importance weights. The orange line is when samples receive distribution-specific weights using our proposed method.

The effectiveness of our approach is illustrated in Figure 1, which shows the out-of-distribution error for income prediction task (see Section 4 for details) when the target state is either Texas (TX) or Washington (WA) state, and training datasets are from California (CA) and Puerto Rico (PR). We simulate a scenario where CA data is initially available, followed by additional data from PR. We optimize the weighted empirical risk under three different weighting schemes: (1) equal weighting of all samples, (2) sample-specific importance weighting under the assumption of covariate shifts, and (3) distribution-specific weighting based on our approach. Compared to equal weighting and importance weighting methods, our approach demonstrates remarkable robustness, maintaining consistently low out-of-distribution error even after including the PR dataset by optimally weighting two data sources.

The remainder of paper is outlined as follows. In Section 2, we present empirical evidence for random and dense distributional shifts and introduce the random distributional perturbation model. In Section 3, we establish the foundations for domain adaptation under random and dense distributional shifts. In particular, we derive out-of-distribution risk for weighted empirical risk minimization and conduct statistical inference for parameters of the target distribution. In Section 4 and Section 5, we apply our framework to several real-world datasets and demonstrate the robustness of our method. We conclude in Section 6.

1.2 Related work

This work addresses domain adaptation problems under a specific class of random distributional shifts. Domain adaptation methods generally fall into two main categories [Pan and Yang, 2010]. The first type involves re-weighing, where training samples are assigned weights that take the distributional change into account [Long et al., 2014, Li et al., 2016]. For example, under covariate shift, training samples can be re-weighted via importance sampling [Gretton et al., 2009, Shimodaira, 2000, Sugiyama et al., 2008]. Another type aims to learn invariant representations of features [Argyriou et al., 2007, Baktashmotlagh et al., 2013]. Under dense distributional shifts, as we will see in Section 2, instance-specific weights can be unstable or even be undefined due to overlap violations. Furthermore, under dense distributional shifts there might be no invariant representation. Thus, from a classical domain adaptation perspective, dense distribution shifts are a challenging setting.

The proposed method approximates the target distribution as a weighted combination of source distributions. Therefore, the proposed method shares methodological similarities with synthetic controls, which search for a linear combination of untreated units to represent treated units [Abadie and Gardeazabal, 2003]. Synthetic controls are applied to panel data and often assume a linear factor model for a continuous outcome of interest [Doudchenko and Imbens, 2016, Abadie et al., 2007, Bai, 2009, Xu, 2017, Athey et al., 2021]. Our framework can be seen as a justification for synthetic control methods under random distributional shifts. In addition, our procedure can be applied to model distribution shifts with or without time structure for any type of data (discrete, continuous, ordinal). Furthermore, our framework allows for straightforward generalizations to empirical risk minimization, which includes many modern machine learning tools.

The approach of weighting a combination of source distributions to approximate a target distribution also appears in the distributionally robust optimization (DRO) framework [Xiong et al., 2023, Zhan et al., 2024]. While there is some similarity in the prediction procedures, the random distribution shift model enables diagnostics and inference that are markedly different from a worst-case approach. One can think about the random shift model as a potential justification for a very particular type of distribution weighting, with additional inferential consequences.

Our procedure is also loosely related to meta-analysis and hierarchical modeling. Meta-analysis is a statistical procedure for combining data from multiple studies [Higgins et al., 2009]. Traditional meta-analysis relies on the assumption that the effect size estimates from different studies are independent. However, in real-world applications, this independence assumption often does not hold true. There are different strategies to handle dependencies. These include averaging the dependent effect sizes within studies into a single effect [Wood, 2008], incorporating dependencies into models through robust variance estimation [Hedges et al., 2010], or through multilevel meta-analysis where they assume a cluster structure and model dependence within and between clusters [Cheung, 2014]. In meta-regression, the effect sizes are expressed as linear functions of study characteristics, and they are independent conditional on these study characteristics [Borenstein et al., 2009, Hartung et al., 2008]. Bayesian hierarchical models with prior on model parameters can also be used to model dependency [Higgins et al., 2009, Lunn et al., 2013]. Compared to these procedures, our approach allows for correlations between studies and can be applied even when there is no hierarchical structure and no summary study characteristics are available. Furthermore, our approach is frequentist, so it does not require choosing a prior distribution.

In this paper, we model multiple correlated distributions using random distributional perturbation model introduced in Jeong and Rothenhäusler [2023], where dense distributional shifts emerge as the superposition of numerous small random changes. In Jeong and Rothenhäusler [2023], a single perturbed distribution is considered. In our paper, we extend this model to address multiple perturbed distributions that are possibly correlated.

2 Distributional Uncertainty

Due to a superposition of many random, small changes, a data scientist might not draw samples from the target distribution 0\mathbb{P}^{0} but from several randomly perturbed distributions 1,,K\mathbb{P}^{1},\dots,\mathbb{P}^{K}. More specifically, we assume that a data scientist has KK datasets, with the kk-th dataset being an i.i.d. sample from a perturbed distribution k\mathbb{P}^{k}. Intuitively speaking, some of these distributions might be similar to each other, while others might be very different. For instance, one replication study might closely resemble another due to collaboration among investigators. Additionally, a dataset obtained from California might be more similar to one from Washington than to one from Wyoming.

We illustrate the scenario where multiple datasets exhibit distributional correlations using the GTEx111The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The dataset used for the analyses described in this manuscript is version 6 and can be downloaded in the GTEx Portal: www.gtexportal.org. data. The GTEx V6 data offers pre-processed RNA-seq gene-expression levels collected from 450 donors across 44 tissues. Treating each tissue as an individual study, we have 44 datasets. Each dataset from tissue kk is considered as an independent sample from a perturbed distribution k\mathbb{P}^{k}, where \mathbb{P}^{\bullet} represents a joint distribution of all gene-expression levels. Some of these datasets are likely to be correlated due to factors such as an overlap of donors and the proximity between tissues. Empirical evidence of this correlation is presented in the left side of Figure 2, where each row and column corresponds to a tissue, and each dot represents a standardized covariance between gene expression levels of a randomly selected gene-pair, among the observations in the tissue. We randomly sample 1000 gene-pairs. We can see that some tissues exhibit a clear linear relationship. In the following, we will introduce a distribution shift model that implies this linear relationship.

Refer to caption
Refer to caption
Figure 2: (Left) Linear relationships between tissues: Each row and column corresponds to one tissue, with each dot representing the standardized covariance of the gene expression levels of a randomly selected gene pair. Note that certain tissues exhibit a linear relationship. For instance, a notable linear relationship is observed within the blue squared box, where a simple linear regression yielded an R-squared value of 0.708. In contrast, in the red squared box, a linear relationship is not evident, as indicated by the simple linear regression result with an R-squared value of 0.017. (Right) QQ plots of (2) for various test functions for different perturbed tissues when the target tissue is brain cortex. The red line represents the expected QQ line if the perturbed data were all drawn i.i.d. from the target distribution (brain cortex).

To model the relatedness of multiple datasets, we will assume that each k\mathbb{P}^{k} is a random perturbation of a common target distribution 0\mathbb{P}^{0}. Then, we adopt the random distributional perturbation model in Jeong and Rothenhäusler [2023], where dense distributional shifts emerge as the superposition of numerous small random changes, to construct perturbed distributions 1,,K\mathbb{P}^{1},\dots,\mathbb{P}^{K}. In this random perturbation model, probabilities of events from a random probability measure k\mathbb{P}^{k} are slightly up-weighted or down-weighted compared to the target distribution 0\mathbb{P}^{0}.

Without loss of generality, we construct distributional perturbations for uniform distributions on [0,1][0,1]. A result from probability theory shows that any random variable DD on a finite or countably infinite dimensional probability space can be written as a measurable function D=dh(U)D\stackrel{{\scriptstyle d}}{{=}}h(U), where UU is a uniform random variable on [0,1][0,1].222For any Borel-measurable random variable DD on a Polish (separable and completely metrizable) space 𝒟\mathcal{D}, there exists a Borel-measurable function hh such that D=dh(U)D\stackrel{{\scriptstyle d}}{{=}}h(U) where UU follows the uniform distribution on [0,1][0,1] [Dudley, 2018]. With the transformation hh, the construction of distributional perturbations for uniform variables can be generalized to the general cases by using

k(D)=k(h(U)).\mathbb{P}^{k}(D\in\bullet)=\mathbb{P}^{k}(h(U)\in\bullet).

Let us now construct the perturbed distribution k\mathbb{P}^{k} for a random variable UU that follows a uniform distribution under 0\mathbb{P}^{0}. Take mm bins Ij=[(j1)/m,j/m]I_{j}=[(j-1)/m,j/m] for j=1,,mj=1,\dots,m. Let W1k,,WmkW^{k}_{1},\dots,W^{k}_{m} be i.i.d. positive random variables with finite variance. We define the randomly perturbed distribution k\mathbb{P}^{k} by setting

k(U)=j0(UIj)Wjkj=1mWjk/m.\mathbb{P}^{k}(U\in\bullet)=\sum_{j}\mathbb{P}^{0}(U\in I_{j}\cap\bullet)\cdot\frac{W^{k}_{j}}{\sum_{j=1}^{m}W^{k}_{j}/m}.

Then the datasets are generated as follows:

1. Generate perturbed distributions: Start with the target distribution 0\mathbb{P}^{0} and generate KK perturbed distributions 1,,K\mathbb{P}^{1},\dots,\mathbb{P}^{K} using random weights W={{Wjk}j=1m}k=1KW=\{\{W_{j}^{k}\}_{j=1}^{m}\}_{k=1}^{K}.

2. Draw samples: For each k=1,,Kk=1,\dots,K, draw nkn_{k} i.i.d. samples, 𝒟k={Dki}i=1nk\mathcal{D}_{k}=\{D_{ki}\}_{i=1}^{n_{k}}, from the perturbed distribution k\mathbb{P}^{k} (Dkii.i.dkD_{ki}\stackrel{{\scriptstyle i.i.d}}{{\sim}}\mathbb{P}^{k}), conditioned on the weights W={{Wjk}j=1m}k=1KW=\{\{W_{j}^{k}\}_{j=1}^{m}\}_{k=1}^{K}.

Further, let m=m(n)m=m(n) such that m(n)min(n1,,nK)\frac{m(n)}{\min(n_{1},\dots,n_{K})} converges to 0. This is the regime where the distributional uncertainty is of higher order than the sampling uncertainty.

Notation.

Let 0\mathbb{P}^{0} be the target distribution on 𝒟\mathcal{D}, and k\mathbb{P}^{k} be the perturbed distribution on 𝒟\mathcal{D} conditioned on (Wjk)j=1m(W_{j}^{k})_{j=1}^{m}. We draw an i.i.d. sample 𝒟k={Dki}i=1nk\mathcal{D}_{k}=\{D_{ki}\}_{i=1}^{n_{k}} from k\mathbb{P}^{k}, conditioned on (Wjk)j=1m(W_{j}^{k})_{j=1}^{m}. Let PP be the marginal distribution of {{Dki}i=1nk,{Wjk}j=1m}k=1K\{\{D_{ki}\}_{i=1}^{n_{k}},\{W_{j}^{k}\}_{j=1}^{m}\}_{k=1}^{K}. We denote EE as the expectation under PP, 𝔼0\mathbb{E}^{0} as the expectation under 0\mathbb{P}^{0}, and 𝔼k\mathbb{E}^{k} as the expectation under k\mathbb{P}^{k}. Specifically, 𝔼0[ϕ(D)]\mathbb{E}^{0}[\phi(D)] is the mean of ϕ(D)\phi(D) for a random variable D𝒟,D0D\in\mathcal{D},D\sim\mathbb{P}^{0} and 𝔼k[ϕ(D)]\mathbb{E}^{k}[\phi(D)] is the mean of ϕ(D)\phi(D) for a random variable D𝒟,DkD\in\mathcal{D},D\sim\mathbb{P}^{k} conditioned on (Wjk)j=1m(W_{j}^{k})_{j=1}^{m}. Moreover, we denote 𝔼^k\hat{\mathbb{E}}^{k} as the sample average in 𝒟k\mathcal{D}_{k}. For example, 𝔼^k[ϕ(D)]=i=1nkϕ(Dik)/nk\hat{\mathbb{E}}^{k}[\phi(D)]=\sum_{i=1}^{n_{k}}\phi(D_{ik})/n_{k} where Dkii.i.d.kD_{ki}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathbb{P}^{k}. We write VarP\text{Var}_{P} for the variance under PP and Var0\text{Var}_{\mathbb{P}^{0}} for the variance under 0\mathbb{P}^{0}. Specifically, Var0(ϕ(D))\text{Var}_{\mathbb{P}^{0}}(\phi(D)) is the variance of ϕ(D)\phi(D) for a random variable D𝒟,D0D\in\mathcal{D},D\sim\mathbb{P}^{0}.

Theorem 1 (Distributional CLT).

Under the assumptions we mentioned above, for any Borel measurable square-integrable function ϕk:𝒟k\phi_{k}:\mathcal{D}_{k}\xrightarrow[]{}\mathbb{R} for k=1,,Kk=1,\dots,K, we have

m((𝔼^1[ϕ1(D)]𝔼^K[ϕK(D)])(𝔼0[ϕ1(D)]𝔼0[ϕK(D)]))𝑑N(0,ΣWVar0(ϕ(D)))\sqrt{m}\left(\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{1}(D)]\\ \vdots\\ \hat{\mathbb{E}}^{K}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)\xrightarrow[]{d}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\phi(D))\right) (1)

where ϕ(D)=(ϕ1(D),,ϕK(D))K\phi(D)^{\intercal}=(\phi_{1}(D),\dots,\phi_{K}(D))\in\mathbb{R}^{K}, and ΣWK×K\Sigma^{W}\in\mathbb{R}^{K\times K} has

(ΣW)ij=CovP(Wi,Wj)E[Wi]E[Wj].(\Sigma^{W})_{ij}=\frac{Cov_{P}(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}.

Here, \odot denotes element-wise multiplication (Hadamard product).

Remark 1 (Random shift in Y|XY|X).

As an example, let us study how conditional expectations shift under this model. We consider the case where the data consists of both covariates XX and target YY, that is D=(X,Y)D=(X,Y). Let AA be some subset of the sample space of XX with 0(XA)>0\mathbb{P}^{0}(X\in A)>0. Using a Taylor approximation,

𝔼^1[Y|XA]𝔼^0[Y|XA]=𝔼^1[ϕ]𝔼^0[ϕ]+oP(1/m)\hat{\mathbb{E}}^{1}[Y|X\in A]-\hat{\mathbb{E}}^{0}[Y|X\in A]=\hat{\mathbb{E}}^{1}[\phi]-\hat{\mathbb{E}}^{0}[\phi]+o_{P}(1/\sqrt{m})

where ϕ=1XA(Y𝔼0[Y|XA])0[XA]\phi=\frac{1_{X\in A}(Y-\mathbb{E}^{0}[Y|X\in A])}{\mathbb{P}^{0}[X\in A]}. Thus, by the CLT,

m(𝔼^1[Y|XA]𝔼^0[Y|XA])d𝒩(0,Σ11WVar0(Y|XA)0(XA)).\sqrt{m}(\hat{\mathbb{E}}^{1}[Y|X\in A]-\hat{\mathbb{E}}^{0}[Y|X\in A])\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\Sigma_{11}^{W}\frac{\text{Var}_{\mathbb{P}^{0}}(Y|X\in A)}{\mathbb{P}^{0}(X\in A)}).

Hence, the conditional expectation is randomly shifted based on three factors: 1) it is proportional to the distribution shift strength, Σ11W/m\sqrt{\Sigma_{11}^{W}/m}; 2) it is proportional to the conditional noise, Var0(Y|XA)\sqrt{\text{Var}_{\mathbb{P}^{0}}(Y|X\in A)}; and 3) it is inversely proportional to the probability 0(XA)\sqrt{\mathbb{P}^{0}(X\in A)}. This random shift in Y|XY|X needs to be accounted for both estimation and inference.

Remark 2 (Distributional covariance).

We call ΣW/m\Sigma^{W}/m the covariance between distributional perturbations. Two perturbed distributions k\mathbb{P}^{k} and k\mathbb{P}^{k^{\prime}} are positively correlated if ΣkkW>0\Sigma_{kk^{\prime}}^{W}>0. This means that when one perturbation slightly increases the probability of a certain event AA, the other distribution tends to slightly increase the probability of this event as well. This can be seen by choosing the test function ϕk=1DA\phi_{k}=1_{D\in A}:

CovP(^k(DA),^k(DA))=CovP(𝔼^k[1DA],𝔼^k[1DA])Σk,kWmVar0(1DA).\text{Cov}_{P}(\hat{\mathbb{P}}^{k}(D\in A),\hat{\mathbb{P}}^{k^{\prime}}(D\in A))=\text{Cov}_{P}(\hat{\mathbb{E}}^{k}[1_{D\in A}],\hat{\mathbb{E}}^{k^{\prime}}[1_{D\in A}])\approx\frac{\Sigma_{k,k^{\prime}}^{W}}{m}\text{Var}_{\mathbb{P}^{0}}(1_{D\in A}).

However, if two perturbed distributions are negatively correlated, a (random) increase in the probability of an event under one distribution would often coincide with a (random) decrease in the probability of this event under the other distribution, compared to 0\mathbb{P}^{0}.

Remark 3 (Estimation of variance).

The variance under the target distribution Var0(ϕ(D))\text{Var}_{\mathbb{P}^{0}}(\phi(D)) can be estimated via the empirical variance of ϕ\phi on the pooled data {(Dki)i=1nk}k=1K\{(D_{ki})_{i=1}^{n_{k}}\}_{k=1}^{K}. Then, if ϕ\phi has a finite fourth moment, we have Var^0(ϕ(D))=Var0(ϕ(D))+op(1)\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1). The details can be found in the Appendix 8.5.

An extended version of Theorem 1 where ϕk:𝒟kL\phi_{k}:\mathcal{D}_{k}\xrightarrow[]{}\mathbb{R}^{L} for some L1L\geq 1 and the proof of Theorem 1 can be found in the Appendix, Section 8. In this paper, we consider the target distribution as fixed. In the Appendix 8.4, we discuss a version of Theorem 1 that allows for a randomly shifted target distribution. In Section 9 of the Appendix, we discuss several special cases of the random perturbation model, including random shift across time, i.i.d. shifts, and random shifts between locations.

Theorem 1 tells us that (ΣijW)/m(\Sigma_{ij}^{W})/m is the asymptotic covariance of 𝔼^i[ϕ(D)]\hat{\mathbb{E}}^{i}[\phi(D)] and 𝔼^j[ϕ(D)]\hat{\mathbb{E}}^{j}[\phi(D)] for any Borel-measurable square-integrable function ϕ\phi with unit variance under 0\mathbb{P}^{0}. Remarkably, under our random distributional perturbation model, a (K+(K2))(K+\binom{K}{2})-dimensional parameter ΣW\Sigma^{W} quantifies the similarity of multiple distributions and captures all possible correlations of all square-integrable functions.

The random shift model explains the linear patterns in the GTEx scatterplot.

Consider two perturbed distributions 1\mathbb{P}^{1} and 2\mathbb{P}^{2}, and LL uncorrelated test functions with unit variances, ϕ1(D),,ϕL(D)\phi_{1}(D),\dots,\phi_{L}(D). In the GTEx example, ϕ\phi_{\ell} is a standardized product between gene expression levels of a randomly selected \ell-th gene pair. Moreover, in Appendix 13, we see that these LL test functions are approximately uncorrelated. From Theorem 1, we have

m((𝔼^1[ϕ(D)]𝔼^2[ϕ(D)])(𝔼0[ϕ(D)]𝔼0[ϕ(D)]))=d Z+op(1),\sqrt{m}\left(\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{\ell}(D)]\\ \hat{\mathbb{E}}^{2}[\phi_{\ell}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{\ell}(D)]\\ \mathbb{E}^{0}[\phi_{\ell}(D)]\end{pmatrix}\right)\stackrel{{\scriptstyle d}}{{=}}\text{ }Z_{\ell}+o_{p}(1),

where

Zi.i.dN(0,(Σ11WΣ12WΣ21WΣ22W)).Z_{\ell}\stackrel{{\scriptstyle i.i.d}}{{\sim}}N\left(0,\begin{pmatrix}\Sigma_{11}^{W}&\Sigma_{12}^{W}\\ \Sigma_{21}^{W}&\Sigma_{22}^{W}\end{pmatrix}\right).

Therefore, (𝔼^1[ϕ(D)]𝔼0[ϕ(D)],𝔼^2[ϕ(D)]𝔼0[ϕ(D)])(\hat{\mathbb{E}}^{1}[\phi_{\ell}(D)]-\mathbb{E}^{0}[\phi_{\ell}(D)],\hat{\mathbb{E}}^{2}[\phi_{\ell}(D)]-\mathbb{E}^{0}[\phi_{\ell}(D)]) for =1,,L\ell=1,\dots,L, can be viewed as independent draws from a bivariate Gaussian distribution with distributional covariance ΣW/m\Sigma^{W}/m. This explains the linear patterns observed in the left plot of Figure 2. Gaussianity can be evaluated by QQ plots of the statistic:

(1nk+1n0)1/2(1nki=1nkϕ(Dki)1n0i=1n0ϕ(D0i)sd^0(ϕ(D))).\left(\frac{1}{n_{k}}+\frac{1}{n_{0}}\right)^{-1/2}\left(\frac{\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi_{\ell}(D_{ki})-\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\phi_{\ell}(D_{0i})}{\hat{\text{sd}}_{\mathbb{P}^{0}}(\phi_{\ell}(D))}\right). (2)

The QQ-plots of the statistic in (2) are presented on the right-hand side of Figure 2, when the target tissue is set to be brain cortex. These plots indicate that statistics in (2) indeed follow a Gaussian distribution.

3 Out-of-distribution generalization

In this section, we demonstrate how, under the random distributional perturbation model, we can share information across datasets from different times and locations, and infer parameters of a partially observed target distribution. Specifically, we show how the distributional CLT from the previous section allows us to identify the optimal convex combination of source distributions to best approximate the target distribution.

We consider the following domain adaptation setting throughout the section. We aim to predict a target YY based on some covariates XX, that is the data is D=(X,Y)D=(X,Y). We want to estimate θ0=argminθ𝔼0[(θ,X,Y)]\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)] for some loss function (θ,X,Y)\mathcal{L}(\theta,X,Y). We consider the setting where we observe full D=(X,Y)D=(X,Y) only on the source distributions 1,,K\mathbb{P}^{1},\dots,\mathbb{P}^{K}. More specifically, for the kk-th source data (k{1,2,,K}k\in\{1,2,\dots,K\}), we observe nkn_{k} i.i.d. samples 𝒟k={Dk1,,Dknk}\mathcal{D}_{k}=\{D_{k1},\dots,D_{kn_{k}}\} drawn from the source distribution k\mathbb{P}^{k}. Furthermore, there are n0n_{0} i.i.d. samples from the target distribution 0\mathbb{P}^{0}, but we only observe a subset XX and have 𝒟0={X01,,X0n0}\mathcal{D}_{0}=\{X_{01},\dots,X_{0n_{0}}\}.

For any fixed non-negative weights β1,,βK\beta_{1},\ldots,\beta_{K} with k=1Kβk=1\sum_{k=1}^{K}\beta_{k}=1, one can consider weighted empirical risk minimization, that is, one can set

θ^β=argminθkβk𝔼^k[(θ,X,Y)].\hat{\theta}^{\beta}=\arg\min_{\theta}\sum_{k}\beta_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)].

In the following, we want to study the excess risk

𝔼0[(θ^β,X,Y)]𝔼0[(θ0,X,Y)].\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},X,Y)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},X,Y)].

This will give us insights into how to choose the weights β\beta in an optimal fashion, and allow us to conduct statistical inference.

Lemma 1 (Out-of-distribution error of weighted ERM).

For each θ\theta in an open subset of Ω\Omega, let θθ(θ,x,y)\theta\mapsto\partial_{\theta}\mathcal{L}(\theta,x,y) be twice continuously differentiable in θ\theta for every x,yx,y. Assume that the matrix 𝔼0[θ2(θ,X,Y)]\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta,X,Y)] exists and is nonsingular. Assume that the third partial derivatives of θ(θ,D)\theta\mapsto\mathcal{L}(\theta,D) are dominated by a fixed function h(D)h(D) for every θ\theta in a neighborhood of θ0\theta^{0}. We assume that θ(θ0,D)\partial_{\theta}\mathcal{L}(\theta^{0},D), θ2(θ0,D)\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D) and h(D)h(D) are square-integrable under 0\mathbb{P}^{0}. Assume that θ^βθ0=oP(1)\hat{\theta}^{\beta}-\theta^{0}=o_{P}(1). Then,

m(𝔼0[(θ^β,X,Y)]𝔼0[(θ0,X,Y)])dm\left(\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},X,Y)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},X,Y)]\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{E}

for some random variable \mathcal{E}, with

𝔼[]=βΣWβTrace(𝔼0[θ2(θ0,X,Y)]1Var0(θ(θ0,X,Y))).\mathbb{E}[\mathcal{E}]=\beta^{\intercal}\Sigma^{W}\beta\cdot\mathrm{Trace}(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]^{-1}\mathrm{Var}_{\mathbb{P}^{0}}(\partial_{\theta}\mathcal{L}(\theta^{0},X,Y))).

The proof of Lemma 1 is provided in Appendix 10, along with the additional regularity assumptions under which the consistency assumption stated in Lemma 1 holds.

By this lemma, the optimal β\beta in terms of out-of-distribution error is

β=argminβ:β𝟙=1β0βΣWβ.\beta^{*}=\arg\min_{\begin{subarray}{c}\beta:\beta^{\intercal}\mathbb{1}=1\\ \beta\geq 0\end{subarray}}\beta^{\intercal}\Sigma^{W}\beta.

A priori, β\beta^{*} is unknown to the researcher. In the following, we will describe estimation and inference for β\beta^{*}. Our estimation strategy is based on the following observation. Using the distributional CLT, for any function ϕ(X)\phi(X) and any β\beta,

m(𝔼^0[ϕ(X)]k=1Kβk𝔼^0[ϕ(X)])d𝒩(0,βΣWβVar0(ϕ(X))).\sqrt{m}(\hat{\mathbb{E}}^{0}[\phi(X)]-\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{0}[\phi(X)])\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\beta^{\intercal}\Sigma^{W}\beta\cdot\text{Var}_{\mathbb{P}^{0}}(\phi(X))). (3)

We can use this fact that the covariance term of the limiting distribution is inflated by the factor βΣWβ\beta^{\intercal}\Sigma^{W}\beta regardless of the test functions to estimate β\beta^{*}. Consider different test functions of XX, that is ϕ(X)\phi_{\ell}(X), =1,,L\ell=1,\dots,L. For example, we may use individual covariates as test functions, ϕ=X\phi_{\ell}=X_{\ell}. Then, we can estimate β\beta^{*} by solving

β^=argminβ:β𝟙=1β0l=1L(𝔼^0[ϕl(X)]kβk𝔼^k[ϕl(X)])2.\hat{\beta}=\operatorname*{argmin}_{\begin{subarray}{c}\beta:\beta^{\intercal}\mathbb{1}=1\\ \beta\geq 0\end{subarray}}\sum_{l=1}^{L}(\hat{\mathbb{E}}^{0}[\phi_{l}(X)]-\sum_{k}\beta_{k}\hat{\mathbb{E}}^{k}[\phi_{l}(X)])^{2}. (4)

and set the final estimate of θ0\theta^{0} as

θ^=argminθkβ^k𝔼^k[(θ,X,Y)].\hat{\theta}=\arg\min_{\theta}\sum_{k}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)].

We will discuss the accuracy of this estimate in the following section. More specifically, we will derive confidence intervals for β\beta^{*} and θ0\theta^{0}.

Remark 4.

In practice, one might wonder how to choose different test functions ϕl,l=1,,L\phi_{l},l=1,\ldots,L. From a theoretical standpoint, under the distributional uncertainty model, one can technically choose any function with finite variance. However, for actual applications, we suggest the following guideline. Our method considers a distribution k\mathbb{P}^{k} to be “close” to 0\mathbb{P}^{0} if the averaged values of test functions on 𝒟k\mathcal{D}_{k} closely align with the corresponding averaged values on 𝒟0\mathcal{D}_{0}. Therefore, these test functions ϕ\phi should be chosen to capture the dimensions of the problem where a close match is important. For example, in the numerical experiments in Section 4, the target function of interest is the logarithmic value of income and using the means of covariates that are well-known to be related to the income variable as test functions demonstrates good performance. We also provide simple diagnostics, presented in Figure 3 as standard residual plots and normal QQ-plots, that allow one to assess the fit of the distributional perturbation model and the choice of test functions.

3.1 Inference for β\beta^{*} and θ0\theta^{0}

In the following, we discuss how to form asymptotically valid confidence intervals for β\beta^{*} and θ0\theta^{0} when β\beta^{*} and β^\hat{\beta} are allowed to have negative weights. As discussed earlier, we choose LL different test functions ϕ1(X),,ϕL(X)\phi_{1}(X),\dots,\phi_{L}(X). For now we assume that test functions are uncorrelated and have unit variances under 0\mathbb{P}^{0}. Later we will discuss cases where test functions are correlated and have different variances. In the previous section, we saw that β^\hat{\beta} can be estimated via a least-squares problem with a linear constraint on the weights. To remove the linear constraint in equation (4), we perform the following reparametrization,

β^1:(K1)\displaystyle\hat{\beta}_{1:(K-1)} =argminβl=1L((𝔼^0[ϕl(X)]𝔼^K[ϕl(X)])k=1K1βk(𝔼^k[ϕl(X)]𝔼^K[ϕl(X)]))2.\displaystyle=\operatorname*{argmin}_{\beta}\sum_{l=1}^{L}\left((\hat{\mathbb{E}}^{0}[\phi_{l}(X)]-\hat{\mathbb{E}}^{K}[\phi_{l}(X)])-\sum_{k=1}^{K-1}\beta_{k}(\hat{\mathbb{E}}^{k}[\phi_{l}(X)]-\hat{\mathbb{E}}^{K}[\phi_{l}(X)])\right)^{2}.

We can obtain β^K\hat{\beta}_{K} via β^K=1k=1K1β^k\hat{\beta}_{K}=1-\sum_{k=1}^{K-1}\hat{\beta}_{k}. We can re-write this as the following linear regression problem

β^1:(K1)=argminβ1:(K1)K1Φ~0Φ~β1:(K1)22,\hat{\beta}_{1:(K-1)}=\operatorname*{argmin}_{\beta_{1:(K-1)}\in\mathbb{R}^{K-1}}\|\tilde{\Phi}^{0}-\tilde{\Phi}\beta_{1:(K-1)}\|_{2}^{2},

where the feature matrix in the linear regression is defined as

Φ~=(𝔼^1[ϕ1]𝔼^K[ϕ1]𝔼^K1[ϕ1]𝔼^K[ϕ1]𝔼^1[ϕL]𝔼^K[ϕL]𝔼^K1[ϕL]𝔼^K[ϕL])L×(K1),\tilde{\Phi}=\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{1}]-\hat{\mathbb{E}}^{K}[\phi_{1}]&\dots&\hat{\mathbb{E}}^{K-1}[\phi_{1}]-\hat{\mathbb{E}}^{K}[\phi_{1}]\\ \vdots&&\vdots\\ \hat{\mathbb{E}}^{1}[\phi_{L}]-\hat{\mathbb{E}}^{K}[\phi_{L}]&\dots&\hat{\mathbb{E}}^{K-1}[\phi_{L}]-\hat{\mathbb{E}}^{K}[\phi_{L}]\end{pmatrix}\in\mathbb{R}^{L\times(K-1)},

and the outcome vector in the linear regression is defined as

Φ~0=(𝔼^0[ϕ1]𝔼^K[ϕ1],,𝔼^0[ϕL]𝔼^K[ϕL])L×1.\tilde{\Phi}^{0}=\left(\hat{\mathbb{E}}^{0}[\phi_{1}]-\hat{\mathbb{E}}^{K}[\phi_{1}],\dots,\hat{\mathbb{E}}^{0}[\phi_{L}]-\hat{\mathbb{E}}^{K}[\phi_{L}]\right)^{\intercal}\in\mathbb{R}^{L\times 1}.

Since test functions were assumed to be uncorrelated and have unit variances under 0\mathbb{P}_{0}, using Theorem 1, rows of the feature matrix Φ~\tilde{\Phi} and the outcome vector Φ~0\tilde{\Phi}^{0} are asymptotically i.i.d., with each row drawn from a centered Gaussian distribution. Therefore, the problem may be viewed as a standard multiple linear regression problem where the variance of the coefficient β^1:(K1)\hat{\beta}_{1:(K-1)} is estimated as

Var^(β^1:(K1))=(Φ~Φ~)1σ^2,\widehat{\text{Var}}(\hat{\beta}_{1:(K-1)})=(\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2},

and the residual variance is calculated as

σ^2=1LK+1Φ~0Φ~β^22.\hat{\sigma}^{2}=\frac{1}{L-K+1}\|\tilde{\Phi}^{0}-\tilde{\Phi}\hat{\beta}\|_{2}^{2}.

Different from standard linear regression, in our setting each column represents a data distribution and each row represents a test function, and Gaussianity does not hold exactly in finite samples. Therefore, a priori it is unclear whether standard statistical tests in linear regression (such as tt-test and FF-test) carry over to the distributional case. The following theorem tells us that the standard linear regression results still hold. Therefore, we can conduct statistical inference for the optimal weights β\beta^{*} in the same manner as we conduct t-tests and F-tests in a standard linear regression. The proof of Theorem 2 can be found in Appendix, Section 11. Note that these results assume a finite number of test functions and datasets but are asymptotic in terms of m(n)m(n) as they rely on the distributional CLT.

Theorem 2.

Assume that test functions ϕ1,,ϕL\phi_{1},\dots,\phi_{L} are uncorrelated and have unit variances under 0\mathbb{P}^{0}. Let Φ~\tilde{\Phi} and σ^2\hat{\sigma}^{2} be defined as above. Then, we have

((Φ~Φ~)1σ^2)12(β^1:(K1)β1:(K1))=dtK1(LK+1)+op(1),\left((\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}\right)^{-\frac{1}{2}}\left(\hat{\beta}_{1:(K-1)}-\beta^{*}_{1:(K-1)}\right)\stackrel{{\scriptstyle d}}{{=}}t_{K-1}(L-K+1)+o_{p}(1),

where tp(d)t_{p}(d) follows the pp-dimensional multivariate t-distribution with dd degrees of freedom.

Remark 5 (Distributional t-test).

For this test, the null hypothesis is that the dataset 𝒟i\mathcal{D}_{i} should have weight zero in the weighted empirical risk minimization. Mathematically, this corresponds to testing the null hypothesis βi=0\beta_{i}^{*}=0. Thus, we consider the following hypotheses:

H0:βi=0,H1:βi0.H_{0}:\beta_{i}^{*}=0,\quad H_{1}:\beta_{i}^{*}\neq 0.

The test statistic is defined as

t=β^i((Φ~Φ~)1σ^2)ii.t=\frac{\hat{\beta}_{i}}{\sqrt{\left(\left(\tilde{\Phi}^{\intercal}\tilde{\Phi}\right)^{-1}\hat{\sigma}^{2}\right)_{ii}}}.

From Theorem 2, under the null hypothesis,

tt(LK+1),t\sim t(L-K+1),

where t(LK+1)t(L-K+1) is a univariate tt-distribution with LK+1L-K+1 degrees of freedom.

Remark 6 (Distributional F-test).

For this test, the null hypothesis is that each dataset is equally informative for the target distribution. Mathematically, this corresponds to testing the null hypothesis βi=1/K\beta_{i}^{*}=1/K for i=1,,Ki=1,\ldots,K. Thus, we consider the following hypotheses:

H0:β1==βK=1K,\displaystyle H_{0}:\,\,\beta_{1}^{*}=\ldots=\beta_{K}^{*}=\frac{1}{K},
H1:βi1K, for at least one i.\displaystyle H_{1}:\,\,\beta_{i}^{*}\neq\frac{1}{K},\text{ for at least one }i.

The test statistic is defined as

F=(β^1:(K1)1K𝟏K1)((Φ~Φ~)1σ^2)1(β^1:(K1)1K𝟏K1)/(K1).F=\left(\hat{\beta}_{1:(K-1)}-\frac{1}{K}\cdot\bm{1}_{K-1}\right)^{\intercal}\left((\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}\right)^{-1}\left(\hat{\beta}_{1:(K-1)}-\frac{1}{K}\cdot\bm{1}_{K-1}\right)/(K-1).

From Theorem 2, under the null hypothesis,

FFK1,LK+1,F\sim F_{K-1,L-K+1},

where FK1,LK+1F_{K-1,L-K+1} is a F-distribution with degrees of freedom K1K-1 and LK+1L-K+1.

Remark 7 (Distributional confidence intervals for parameters of the target distribution).

We will now discuss how to form asymptotically valid confidence intervals for θ0=argminθ𝔼0[(θ,X,Y)]\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)] when mm\xrightarrow[]{}\infty and LL\xrightarrow[]{}\infty. Here we assume that the influence function ϕ=𝔼0[θ2(θ0,X,Y)]1θ(θ0,X,Y)\phi=-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]^{-1}\partial_{\theta}\mathcal{L}(\theta^{0},X,Y) has a finite fourth moment under 0\mathbb{P}^{0}. Then, we can form (1α)(1-\alpha)-confidence intervals for θ0\theta^{0} with θ^=argminθkβ^k𝔼^k[(θ,X,Y)]\hat{\theta}=\arg\min_{\theta}\sum_{k}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)] as follows:

θ^±z1α/2Var^0(ϕ)1L=1L(𝔼^0[ϕ]k=1Kβ^k𝔼^k[ϕ])2.\hat{\theta}\pm z_{1-\alpha/2}\cdot\sqrt{\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi)}\cdot\sqrt{\frac{1}{L}\sum_{\ell=1}^{L}\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}}.

Here, Var^0(ϕ)\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi) denotes the empirical variance of

ϕ^(X,Y)=(k=1Kβ^k𝔼^k[θ2(θ^,X,Y)])1θ(θ^,X,Y)\hat{\phi}(X,Y)=-(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)])^{-1}\partial_{\theta}\mathcal{L}(\hat{\theta},X,Y)

on the pooled donor data {{Xki,Yki}i=1nk}k=1K\{\{X_{ki},Y_{ki}\}_{i=1}^{n_{k}}\}_{k=1}^{K}. The proof of this result can be found in the Appendix, Section 12.

Remark 8 (Distributional sample size and degrees of freedom).

Note that the asymptotic variance of “distributional” regression parameters (Φ~Φ~)1σ^2(\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2} has the same algebraic form as the asymptotic variance for regression parameters in a standard Gaussian linear model. In the classical setting, the test statistics of regression parameters follow tt-distributions with np+1n-p+1 degrees of freedom, where nn is the sample size and pp is the number of covariates. In our setting, we have LK+1L-K+1 degrees of freedom, where LL corresponds to the number of test functions and KK corresponds to the number of donor distributions.

Remark 9 (Correlated test functions).

We assumed that the (ϕ1(X),,ϕL(X))(\phi_{1}(X),\dots,\phi_{L}(X)) are uncorrelated and have unit variances under 0\mathbb{P}^{0}. In practice, this might not be the case. In such cases, we can apply a linear transformation TT to the test functions in a pre-processing step to obtain uncorrelated test functions with unit variances. We define the transformation matrix T=(Σ^Φ)1/2T=(\hat{\Sigma}^{\Phi})^{-1/2}, where Σ^Φ\hat{\Sigma}^{\Phi} is an estimate of the covariance matrix ΣΦ\Sigma^{\Phi} of (ϕ1(X),,ϕL(X))(\phi_{1}(X),\dots,\phi_{L}(X)) on the pooled data. From Remark 3, if (ϕ1(X),,ϕL(X))(\phi_{1}(X),\dots,\phi_{L}(X)) have finite fourth moments, we have Σ^Φ=ΣΦ+op(1)\hat{\Sigma}^{\Phi}=\Sigma^{\Phi}+o_{p}(1).

4 ACS Income Data

In this section, we demonstrate the effectiveness and robustness of our method when the training dataset is obtained from multiple sources. We focus on the ACS Income dataset [Ding et al., 2021] where the goal is to predict the logarithmic value of individual income based on tabular census data.

Similar as in Shen et al. [2023], we consider a scenario where we initially have a limited training dataset from California (CA), and subsequently, we obtain additional training dataset from Puerto Rico (PR). We iterate through the target state among the rest of 49 states. Note that there are substantial economic, job market, and cost-of-living disparities between CA and PR. Most of the states are more similar to CA than PR. Therefore, increasing the training dataset with data sourced from PR will lead to substantial performance degradation of the prediction algorithm on the target state.

We compare three methods in our study. In the first method, we fit the XGBoost on the training dataset in the usual manner, with all samples assigned equal weights. In the second method, we employ sample-specific importance weights. The importance weights are calculated as follows. First, pool the covariates XX of the training dataset (CA + PR) and the target dataset. Let AiA_{i} be the label which is 1 if the ii-th sample is from the target dataset and 0 otherwise. Then, the importance weight for the ii-th sample is obtained as

w^IW,i=P^(X=Xi|Ai=1)P^(X=Xi|Ai=0)=P^(Ai=1|X=Xi)P^(Ai=0)P^(Ai=0|X=Xi)P^(Ai=1),\hat{w}_{IW,i}=\frac{\hat{P}(X=X_{i}|A_{i}=1)}{\hat{P}(X=X_{i}|A_{i}=0)}=\frac{\hat{P}(A_{i}=1|X=X_{i})\hat{P}(A_{i}=0)}{\hat{P}(A_{i}=0|X=X_{i})\hat{P}(A_{i}=1)},

where we estimate P^(Ai|X=Xi)\hat{P}(A_{i}|X=X_{i}) using XGBoost on the pooled covariates data and estimate P^(Ai=1)\hat{P}(A_{i}=1) as the ratio of the sample size of the target data to the sample size of the pooled data.

In the third method, based on our approach, the XGBoost is fitted on the training dataset, but this time samples receive different weights depending on whether they originate from CA or PR. We utilize the occupation-code feature to construct test functions. The test function ϕ(Di)\phi_{\ell}(D_{i}) is defined as whether ii-th sample has the \ell-th occupation code. The occupation code is a categorical variable representing around 500 different occupation categories. The data dictionary at ACS PUMS documentation333https://www.census.gov/programs-surveys/acs/microdata/documentation.2018.html provides the full list of occupation codes. The number of test functions in total is therefore around 500. Then, we estimate distribution-specific-weights (wCA,wPR)(w_{\text{CA}},w_{\text{PR}}) by running

w^CA,w^PR=argminwCA+wPR=1=1L(𝔼^Target[ϕ]wCA𝔼^CA[ϕ]wPR𝔼^PR[ϕ])2.\hat{w}_{\text{CA}},\hat{w}_{\text{PR}}=\operatorname*{argmin}_{w_{\text{CA}}+w_{\text{PR}}=1}\sum_{\ell=1}^{L}(\hat{\mathbb{E}}^{\text{Target}}[\phi_{\ell}]-w_{\text{CA}}\hat{\mathbb{E}}^{\text{CA}}[\phi_{\ell}]-w_{\text{PR}}\hat{\mathbb{E}}^{\text{PR}}[\phi_{\ell}])^{2}.

Note that 𝔼^state[ϕ]\hat{\mathbb{E}}^{\text{state}}[\phi_{\ell}] is the proportion of samples of the srtate with the \ell-th occupation code.

The results with the target states of Texas (TX) and Washington (WA) are given in Figure 1. Similar patterns are observed for other target states, aligning with either TX or WA behaviors. Results for other target states can be found in the Appendix 13.

For the target TX, as we include the PR data, the Mean Squared Error (MSE) initially drops for all the methods. However, as more data from PR is added, the MSE starts to increase for both methods with equal weights and importance weights. In contrast, our method demonstrates robustness, with the MSE decreasing after including PR data and staying consistently low, even when the PR source becomes dominant in the training dataset. Turning to the target WA, for both methods with equal weights and importance weights, adding data sourced from PR results in a straightforward increase in MSE. In contrast, our method assigns weights close to 0 for samples originating from PR, preventing significant performance degradation.

5 GTEx Data

In this section, we illustrate our methods using the GTEx data. We will also run model diagnostics to evaluate model fit. The GTEx V6 dataset provides RNA-seq gene-expression levels collected from 450 donors across 44 tissues.

Treating each tissue as a separate study, we have 44 different datasets. As shown in Figure 2, some tissues exhibit higher correlations than others. To demonstrate our method, we select 5 different tissues (same as in Figure 2): Adipose subcutaneous, Adipose visceral omentum, Brain cortex, Brain frontal cortex, and Brain cerebellum. Let brain cortex be our target tissue (the third tissue in Figure 2). The code and the data are available as R package dlm at https://github.com/yujinj/dlm.

For our test functions, we randomly sample 1000 gene-pairs and define a test function as the standardized product of gene-expression levels of a pair. In total, we have 1000 test functions. Most genes are known to be uncorrelated with each other, with the 90th percentile range of correlations between two different test functions being about [0.06,0.06][-0.06,0.06]. We employ group Lasso to estimate the sparse inverse covariance matrix. As the estimated inverse covariance matrix is close to a diagonal matrix (details provided in the Appendix 13), we do not perform a whitening transformation.

The R package dlm (https://github.com/yujinj/dlm) contains a function dlm that performs distributional linear regressions:

    dlm(formula, test.function, data, whitening)

Here, the formula parameter specifies which model we want to fit. The test.function parameter represents considered test functions. The data parameter passes a list of datasets. The whitening parameter is a boolean indicating whether one wants to perform a whitening transformation. Additional details and descriptions of the function can be found on the GitHub page.

Below is the summary output of the dlm function with a model formula “Brain cortex \sim Adipose subcutaneuous + Adipose visceral omentum + Brain frontal cortex + Brain cerebellum” with 1000 test functions defined as standardized products of randomly selected gene-pairs. Note that the summary output of the dlm function closely resembles that of the commonly used lm function.

Call:dlm(formula = Brain_Cortex ~            Adipose_Subcutaneous +            Adipose_Visceral_Omentum +            Brain_Frontal_Cortex_BA9 +            Brain_Cerebellum,    test.function = phi,    data = GTEx_data, whitening = FALSE)Residuals:     Min       1Q   Median       3Q      Max-0.56838 -0.09436 -0.00604  0.08739  0.40917Coefficients:                           Estimate Std. Error t value Pr(>|t|)Adipose_Subcutaneous     -0.0001295  0.0304036  -0.004   0.9966Adipose_Visceral_Omentum  0.0462641  0.0250888   1.844   0.0655 .Brain_Frontal_Cortex_BA9  0.7846112  0.0184325  42.567  < 2e-16 ***Brain_Cerebellum          0.1692543  0.0217128   7.795 1.61e-14 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1377 on 997 degrees of freedomMultiple R-squared:  0.5088,Adjusted R-squared:  0.5073F-statistic: 344.3 on 3 and 997 DF,  p-value: < 2.2e-16

The coefficient represents the estimated weight for each perturbed dataset. The sum of coefficients equals to one. The estimated weight for the brain-frontal-cortex dataset is close to 1: 0.7846112. On the other hand, the estimated weights for adipose-subcutaneous and adipose-visceral-omentum datasets are close to 0: -0.0001295 and 0.0462641. This supports the observations made in Figure 2, where the brain-cortex dataset (target) exhibits a higher correlation with the brain-frontal-cortex dataset than others.

Let us discuss the summary output in more detail. All tests have in common that the probability statements are with respect to distributional uncertainty, that is, the uncertainty induced by the random distributional perturbations.

  1. 1.

    (Distributional t-test). The summary output provides a t-statistic and a p-value for each estimated weight. Based on the output, the estimated weight for the brain-frontal-cortex dataset is highly significant (with a t-statistic 42.567 and a p-value less than 210162\cdot 10^{-16}). We have enough evidence to believe that the true optimal weight for the brain-frontal-cortex dataset is nonzero, and thus we conclude that the brain-frontal-cortex dataset is important for predicting the target dataset. On the other hand, the estimated weights for the adipose-subcutaneous and adipose-visceral-omentum datasets are not statistically significant (with t-statistics -0.004 and 1.844 and p-values 0.9966 and 0.0655). We conclude those datasets are not as important as the brain-frontal-cortex dataset for predicting the target dataset.

  2. 2.

    (Distributional F-test). The summary output provides a F-statistic and a corresponding p-value for a null hypothesis: the optimal weights are uniform (β=(0.25,0.25,0.25,0.25)\beta^{*}=(0.25,0.25,0.25,0.25) in our example). This means that different datasets are equally informative in explaining the target dataset. The F-statistic is given as 344.3, which follows the F-distribution with degrees of freedom 3 and 997 under the null hypothesis. The resulting p-value is less than 2.210162.2\cdot 10^{-16}. Therefore, we reject the null hypothesis. From Figure 2 and the estimated weights, we can see that the brain-frontal-cortex and brain-cerebellum datasets are more informative in explaining the target dataset than the adipose-subcutaneous and adipose-visceral-omentum datasets.

  3. 3.

    (R-squared). Usually, the coefficient of determination (R-squared) measures how much of the variation of a target variable is explained by other explanatory variables in a regression model. In our output, R-squared measures how much of the unexplained variance of a target dataset, compared to considering uniform weights, is explained by considering our proposed weights. Specifically, it is calculated as

    R2=1RSS with estimated weightsRSS with uniform weights.R^{2}=1-\frac{\text{RSS with estimated weights}}{\text{RSS with uniform weights}}.

    R-squared is close to 0 if the estimated weights are close to uniform weights.

Finally, we should consider diagnostic plots to evaluate the appropriateness of the fitted distribution shift model. This can be evaluated using a distributional Tukey-Anscombe and a distributional QQ-Plot. In contrast to conventional residual plots and QQ-Plots, each point in these plot corresponds to a mean of a test function instead of a single observation. Figure 3 shows that neither the QQ-Plot nor the TA plot indicate a deviation from the modelling assumptions.

Refer to caption
Refer to caption
Figure 3: The distributional residual plot and distributional QQ-Plot for the GTEx data. In contrast to conventional residual plots and QQ-Plots, each point corresponds to a mean of a test function instead of a single observation.

6 Discussion

In many practical settings, there is a distribution shift between the training and the target distribution. Existing distribution shift models fall into two categories. One category, which we term “sparse distribution shift”, assumes that the shift affects specific parts of the data generation process while leaving other parts invariant. Existing methods attempt to re-weight training samples to match the target distribution or learn invariant representations of features. The second category considers worst-case distributional shifts. More specifically, they consider the worst-case behavior of a statistical functional over a fixed neighborhood of the model based on the distributional distance. This often requires knowledge of the strength and shape of the perturbations.

In contrast, we consider random dense distributional shifts, where the shift arises as the superposition of many small random changes. We have found that the random perturbation model can be useful in modeling empirical phenomena observed in Figure 2. Moreover, we see that even when the overlap assumptions seem to be violated (when the reweighing approach fails), or when the distribution shift appears large in Kullback-Leibler divergence, the random perturbation model may still be appropriate and useful. Under the random distributional perturbation model, we establish foundations for transfer learning and generalization.

Our method shares methodological similarities with synthetic controls. While synthetic controls are typically applied to panel data under the assumption of a linear factor model, our procedure can be applied to model distribution shifts of any type of data (discrete, continuous, ordinal) with or without time structure. Furthermore, our method justifies the linearity in synthetic control methods under random distributional shifts, rather than assuming a linear factor model. The random distributional shift assumption can be evaluated based on plots in Figure 3. Additionally, we propose a generalization of our procedure to empirical risk minimization.

In practice, perturbations may involve a combination of sparse and dense shifts. In such cases, hybrid approaches (sparse + dense) may be appropriate. We view hybrid models as an important generalization and leave it for future work.

A companion R package, dlm, is available at https://github.com/yujinj/dlm. Our package performs distributional generalization under the random perturbation model. Users can replicate the results in Section 5 with the data and code provided in the package.

7 Acknowledgments

The authors are grateful for insightful discussions with Naoki Egami, Kevin Guo, Ying Jin, Hongseok Namkoong, and Elizabeth Tipton. This work was supported by Stanford University’s Human-Centered Artificial Intelligence (HAI) Hoffman-Yee Grant, and by the Dieter Schwarz Foundation.

References

  • Abadie and Gardeazabal [2003] A. Abadie and J. Gardeazabal. The economic costs of conflict: A case study of the basque country. American Economic Review, 93(1):113–132, 2003.
  • Abadie et al. [2007] A. Abadie, A. Diamond, and J. Hainmueller. Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American Statistical Association, 105:493–505, 2007.
  • Argyriou et al. [2007] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19 (NIPS), pages 41–48, 2007.
  • Athey et al. [2021] S. Athey, M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi. Matrix completion methods for causal panel data models. Journal of the American Statistical Association, pages 1–41, 2021.
  • Bai [2009] J. Bai. Panel data models with interactive fixed effects. Econometrica, 77(4):1229–1279, 2009.
  • Baktashmotlagh et al. [2013] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. Proceedings of the IEEE International Conference on Computer Vision, pages 769–776, 2013.
  • Bansak et al. [2024] K. Bansak, E. Paulson, and D. Rothenhäusler. Learning under random distributional shifts. International Conference on Artificial Intelligence and Statistics, 2024.
  • Bodnar and Okhrin [2008] T. Bodnar and Y. Okhrin. Properties of the singular, inverse and generalized inverse partitioned wishart distributions. Journal of Multivariate Analysis, 99(10):2389–2405, 2008.
  • Borenstein et al. [2009] M. Borenstein, L. V. Hedges, J. P. Higgins, and H. R. Rothstein. Introduction to Meta-Analysis. John Wiley & Sons, New York, NY, 2009.
  • Cheung [2014] M. W. Cheung. Modeling dependent effect sizes with three-level meta-analyses: a structural equation modeling approach. Psychological Methods, 19(2):211–229, 2014.
  • Ding et al. [2021] F. Ding, M. Hardt, J. Miller, and L. Schmidt. Retiring adult: New datasets for fair machine learning. In Advances in Neural Information Processing Systems, volume 34, pages 6478–6490, 2021.
  • Doudchenko and Imbens [2016] N. Doudchenko and G. W. Imbens. Balancing, regression, difference-in-differences and synthetic control methods: A synthesis. Technical report, National Bureau of Economic Research, 2016.
  • Duchi and Namkoong [2021] J. C. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
  • Dudley [2018] R. M. Dudley. Real analysis and probability. CRC Press, 2018.
  • Gretton et al. [2009] A. Gretton, J. Smola, M. Huang, K. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3:131–160, 2009.
  • Hartung et al. [2008] J. Hartung, G. Knapp, and B. K. Sinha. Statistical Meta-Analysis With Applications. John Wiley & Sons, New York, NY, 2008.
  • Hedges et al. [2010] L. V. Hedges, E. Tipton, and M. C. Johnson. Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1):39–65, 2010.
  • Higgins et al. [2009] J. Higgins, S. G. Thompson, and D. J. Spiegelhalter. A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1):137–159, 2009.
  • Huber [1981] P. Huber. Robust Statistics. Wiley, New York, 1981.
  • Jeong and Rothenhäusler [2023] Y. Jeong and D. Rothenhäusler. Calibrated inference: statistical inference that accounts for both sampling uncertainty and distributional uncertainty. arXiv preprint arXiv:2202.11886, 2023.
  • Li et al. [2016] S. Li, S. Song, and G. Huang. Prediction reweighting for domain adaptation. IEEE Transactions on Neural Networks and Learning Systems, 28(7):1682–1695, 2016.
  • Long et al. [2014] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1410–1417, 2014.
  • Lunn et al. [2013] D. Lunn, J. Barrett, M. Sweeting, and S. Thompson. Fully bayesian hierarchical modelling in two stages, with application to meta-analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 62(4):551–572, 2013.
  • Muirhead [1982] R. J. Muirhead. Aspects of Multivariate Statistical Theory. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., 1982.
  • Pan and Yang [2010] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
  • Quinonero-Candela et al. [2009] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. Mit Press, 2009.
  • Rothenhäusler and Bühlmann [2023] D. Rothenhäusler and P. Bühlmann. Distributionally robust and generalizable inference. Statistical Science, 38(4):527–542, 2023.
  • Schölkopf et al. [2021] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.
  • Shen et al. [2023] J. H. Shen, I. D. Raji, and I. Y. Chen. Mo’data mo’problems: How data composition compromises scaling properties. 2023.
  • Shimodaira [2000] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
  • Sugiyama et al. [2008] M. Sugiyama, S. Nakajima, H. Kashima, P. Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 21 (NIPS), pages 1433–1440, 2008.
  • Szegedy et al. [2014] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.
  • Van der Vaart [2000] A. Van der Vaart. Asymptotic Statistics, volume 3. Cambridge university press, 2000.
  • Wood [2008] A. Wood. Methodology for dealing with duplicate study effects in a meta-analysis. Organizational Research Methods, 11:79–95, 01 2008.
  • Xiong et al. [2023] X. Xiong, Z. Guo, and T. Cai. Distributionally robust transfer learning. arXiv preprint arXiv:2309.06534, 2023.
  • Xu [2017] Y. Xu. Generalized synthetic control method: Causal inference with interactive fixed effects models. Political Analysis, 25(1):57–76, 2017.
  • Zhan et al. [2024] K. Zhan, X. Xiong, Z. Guo, T. Cai, and M. Liu. Transfer learning targeting mixed population: A distributional robust perspective. arXiv preprint arXiv:2407.20073, 2024.

Appendix

8 Proof of Theorem 1

8.1 Extended Theorem 1

In the following, we present the extended version of Theorem 1. Note that we get Theorem 1 in the main text by defining ϕ(D)=(ϕ1(D),,ϕK(D))\phi(D)=(\phi_{1}(D),\dots,\phi_{K}(D))^{\intercal}.

Extended Theorem 1.

Under the assumptions of Theorem 1, for any Borel measurable square-integrable function ϕ:𝒟L\phi:\mathcal{D}\xrightarrow[]{}\mathbb{R}^{L}, we have

m((1n1i=1n1ϕ(D1i)1nKi=1nKϕ(DKi))(𝔼0[ϕ(D)]𝔼0[ϕ(D)]))𝑑N(0,ΣWVar0(ϕ(D))),\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi(D)]\end{pmatrix}\right)\xrightarrow[]{d}N\left(0,\Sigma^{W}\otimes\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\right), (5)

where ΣWK×K\Sigma^{W}\in\mathbb{R}^{K\times K} is

(ΣW)ij=Cov(Wi,Wj)E[Wi]E[Wj].(\Sigma^{W})_{ij}=\frac{Cov(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}.

Here, \otimes denotes a kronecker product. Note that ΣWVar0(ϕ(D))\Sigma^{W}\otimes\text{Var}_{\mathbb{P}^{0}}({\phi}(D)) can be written as

[Σ11WVar0(ϕ(D))Σ12WVar0(ϕ(D))Σ1KWVar0(ϕ(D))\hdashlineΣ21WVar0(ϕ(D))Σ22WVar0(ϕ(D))Σ2KWVar0(ϕ(D))\cdashline15\cdashline15ΣK1WVar0(ϕ(D))ΣK2WVar0(ϕ(D))ΣKKWVar0(ϕ(D))].\displaystyle\left[\begin{array}[]{c@{}c:c:c:c}&\Sigma^{W}_{11}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\Sigma^{W}_{12}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\cdots&\Sigma^{W}_{1K}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\\ \hdashline&\Sigma^{W}_{21}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\Sigma^{W}_{22}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\cdots&\Sigma^{W}_{2K}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\\ \cdashline{1-5}&{\vdots}&{\vdots}&&{\vdots}\\ \cdashline{1-5}&\Sigma^{W}_{K1}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\Sigma^{W}_{K2}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\cdots&\Sigma^{W}_{KK}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\\ \end{array}\right].

In the following, we prove Theorem 1. From the proof, we can get Extended Theorem 1 by considering

(𝔼^1[ϕ1(D)],,𝔼^1[ϕL(D)],𝔼^2[ϕ1(D)],,𝔼^2[ϕL(D)],,𝔼^K[ϕ1(D)],,𝔼^K[ϕL(D)]),(\hat{\mathbb{E}}^{1}[\phi_{1}(D)],\dots,\hat{\mathbb{E}}^{1}[\phi_{L}(D)],\hat{\mathbb{E}}^{2}[\phi_{1}(D)],\dots,\hat{\mathbb{E}}^{2}[\phi_{L}(D)],\dots\dots,\hat{\mathbb{E}}^{K}[\phi_{1}(D)],\dots,\hat{\mathbb{E}}^{K}[\phi_{L}(D)]),

where ϕ(D)=(ϕ1(D),,ϕL(D))L\phi(D)=(\phi_{1}(D),\dots,\phi_{L}(D))^{\intercal}\in\mathbb{R}^{L}.

8.2 Auxiliary Lemma for Theorem 1

Let us first state an auxiliary lemma that will turn out helpful for proving Theorem 1.

Lemma 2.

Let the assumptions of Theorem 1 hold. For any bounded measurable ϕ1,,ϕK:𝒟\phi_{1},\dots,\phi_{K}:\mathcal{D}\xrightarrow[]{}\mathbb{R}, we have that

m((𝔼1[ϕ1(D)]𝔼k[ϕK(D)])(𝔼0[ϕ1(D)]𝔼0[ϕK(D)]))𝑑N(0,ΣWVar(ϕ(D)))\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)\xrightarrow{d}N\left(0,\Sigma^{W}\odot\text{Var}(\overrightarrow{\phi}(D))\right)

where (ΣW)ij=Cov(Wi,Wj)E[Wi]E[Wj](\Sigma^{W})_{ij}=\frac{Cov(W^{i},W^{j})}{E[W^{i}]E[W^{j}]} and ϕ(D)=(ϕ1(D),,ϕK(D))\overrightarrow{\phi}(D)=(\phi_{1}(D),\dots,\phi_{K}(D))^{\intercal}.

Proof.

Let ψk=ϕkh\psi_{k}=\phi_{k}\circ h. Without loss of generality, assume that 𝔼0[ψk(U)]=0\mathbb{E}^{0}[\psi_{k}(U)]=0 for k=1,,Kk=1,\dots,K. Note that

m(𝔼k[ψk(U)]𝔼0[ψk(U)])=mj=1mxIjψk(x)𝑑x(WjkE[Wk])j=1mWjk/m.\sqrt{m}(\mathbb{E}^{k}[\psi_{k}(U)]-\mathbb{E}^{0}[\psi_{k}(U)])=\frac{\sqrt{m}\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\cdot(W^{k}_{j}-E[W^{k}])}{\sum_{j=1}^{m}W^{k}_{j}/m}.

Let

Ym,j:=m(xIjψ1(x)𝑑x(Wj1E[W1])xIjψK(x)𝑑x(WjKE[WK])).Y_{m,j}:=\sqrt{m}\begin{pmatrix}\int_{x\in I_{j}}\psi_{1}(x)dx\cdot(W^{1}_{j}-E[W^{1}])\\ \vdots\\ \int_{x\in I_{j}}\psi_{K}(x)dx\cdot(W^{K}_{j}-E[W^{K}])\end{pmatrix}.

First, note that

E[Ym,j]=0E[Y_{m,j}]=0 (6)

for all jj. As the second step, we want to show that

j=1mCov(Ym,j)Var(W)Var0(ψ(U)),\sum_{j=1}^{m}\text{Cov}(Y_{m,j})\xrightarrow[]{}\text{Var}(\overrightarrow{W})\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\psi}(U)), (7)

where W=(W1,,WK)\overrightarrow{W}=(W^{1},\dots,W^{K})^{\intercal}, ψ(U)=(ψ1(U),,ψK(U))\overrightarrow{\psi}(U)=(\psi_{1}(U),\dots,\psi_{K}(U))^{\intercal}, and \odot is an element-wise multiplication. For any fL2([0,1])f\in L^{2}([0,1]), define Πm(f)\Pi_{m}(f) as

Πm(f)(x)=j=1m(mxIjf(x)𝑑x)I(xIj).\Pi_{m}(f)(x)=\sum_{j=1}^{m}\left(m\int_{x\in I_{j}}f(x)dx\right)\cdot I(x\in I_{j}).

Then, we have

|(j=1mCov(Ym,j)Var(W)Var0(ψ(U)))kk|\displaystyle\Bigg{|}\left(\sum_{j=1}^{m}\text{Cov}(Y_{m,j})-\text{Var}(\overrightarrow{W})\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\psi}(U))\right)_{kk^{\prime}}\Bigg{|}
=|(mj=1mxIjψk(x)𝑑xxIjψk(x)𝑑xj=1mxIjψk(x)ψk(x)𝑑x)Cov(Wk,Wk)|\displaystyle\quad\quad=\Bigg{|}\left(m\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\int_{x\in I_{j}}\psi_{k^{\prime}}(x)dx-\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)\psi_{k^{\prime}}(x)dx\right)\cdot\text{Cov}(W^{k},W^{k^{\prime}})\Bigg{|}
ψk2ψkΠm(ψk)2|Cov(Wk,Wk)|0\displaystyle\quad\quad\leq||\psi_{k}||_{2}||\psi_{k^{\prime}}-\Pi_{m}(\psi_{k^{\prime}})||_{2}\cdot|\text{Cov}(W^{k},W^{k^{\prime}})|\xrightarrow[]{}0

for 1k,kK1\leq k,k^{\prime}\leq K as mm goes to infinity. This is because any bounded function can be approximated by a sequence of step functions of the form j=1mbjI(xIj)\sum_{j=1}^{m}b_{j}I(x\in I_{j}). Next we will show that for any ϵ>0\epsilon>0,

gm(ϵ)=j=1mE[Ym,j22;Ym,j2ϵ]0.g_{m}(\epsilon)=\sum_{j=1}^{m}E[||Y_{m,j}||_{2}^{2};||Y_{m,j}||_{2}\geq\epsilon]\xrightarrow[]{}0. (8)

Let ψ1,,ψKB||\psi_{1}||_{\infty},\dots,||\psi_{K}||_{\infty}\leq B. Then this is implied by the dominated convergence theorem as

j=1mE[Ym,j22;Ym,j2ϵ]\displaystyle\sum_{j=1}^{m}E[||Y_{m,j}||_{2}^{2};||Y_{m,j}||_{2}\geq\epsilon] B2E[WE[W]22I(BWE[W]2/mϵ)]0.\displaystyle\leq B^{2}E[||\overrightarrow{W}-E[\overrightarrow{W}]||_{2}^{2}I(B||\overrightarrow{W}-E[\overrightarrow{W}]||_{2}/\sqrt{m}\geq\epsilon)]\xrightarrow[]{}0. (9)

Combining equations (6), (7), and (8), we can apply multivariate Lindeberg’s CLT. Together with Slutsky’s theorem, we have

m((𝔼1[ϕ1(D]𝔼k[ϕK(D)])(𝔼0[ϕ1(D]𝔼0[ϕK(D)]))\displaystyle\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right) =j=1mYm,j(1/(j=1mWj1/m)1/(j=1mWjK/m))\displaystyle=\sum_{j=1}^{m}Y_{m,j}\odot\begin{pmatrix}1/(\sum_{j=1}^{m}W_{j}^{1}/m)\\ \vdots\\ 1/(\sum_{j=1}^{m}W_{j}^{K}/m)\end{pmatrix}
𝑑N(0,ΣWVar0(ϕ(D)))\displaystyle\xrightarrow[]{d}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))\right)

where ΣijW=Cov(Wi,Wj)E[Wi]E[Wj]\Sigma^{W}_{ij}=\frac{Cov(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}. This completes the proof. ∎

8.3 Proof of Theorem 1

Proof.

For any ϕkL2()\phi_{k}\in L^{2}(\mathbb{P}) and for any given ϵ>0\epsilon>0, there exits a bounded function ϕkB\phi_{k}^{B} such that 𝔼0[ϕk(D)]=𝔼0[ϕkB(D)]\mathbb{E}^{0}[\phi_{k}(D)]=\mathbb{E}^{0}[\phi_{k}^{B}(D)] and ϕkϕkBL2(0)<ϵ||\phi_{k}-\phi_{k}^{B}||_{L^{2}(\mathbb{P}^{0})}<\epsilon for k=1,,Kk=1,\dots,K. Then,

m((1n1i=1n1ϕ1(D1i)1nKi=1nKϕK(DKi))(𝔼0[ϕ1(D]𝔼0[ϕK(D)]))\displaystyle\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi_{1}(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi_{K}(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)
=m((1n1i=1n1ϕ1(D1i)1nKi=1nKϕK(DKi))(𝔼1[ϕ1(D]𝔼k[ϕK(D)]))\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad=\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi_{1}(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi_{K}(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}\right) (a)
+m((𝔼1[ϕ1(D)]𝔼k[ϕK(D)])(𝔼1[ϕ1B(D)]𝔼k[ϕKB(D)]))\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}^{B}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}^{B}(D)]\end{pmatrix}\right) (b)
+m((𝔼1[ϕ1B(D)]𝔼k[ϕKB(D)])(𝔼0[ϕ1B(D)]𝔼0[ϕKB(D)])).\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}^{B}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}^{B}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}^{B}(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}^{B}(D)]\end{pmatrix}\right). (c)

Note that for any ϵ>0\epsilon^{\prime}>0 and for k=1,,Kk=1,\dots,K,

P(m|1nki=1nkϕk(Dki)𝔼k[ϕk(D)]|>ϵ)\displaystyle P\left(\sqrt{m}\Bigg{|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi_{k}(D_{ki})-\mathbb{E}^{k}[\phi_{k}(D)]\Bigg{|}>\epsilon^{\prime}\right)
=E(P(|1nki=1nkϕk(Dki)𝔼k[ϕk(D)]|>ϵm | Wk))\displaystyle=E\left(P\left(\Bigg{|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi_{k}(D_{ki})-\mathbb{E}^{k}[\phi_{k}(D)]\Bigg{|}>\frac{\epsilon^{\prime}}{\sqrt{m}}\text{ }\Bigg{|}\text{ }W_{k}\right)\right)
E(𝔼k[ϕk2(D)]ϵ2)mnk=𝔼0[ϕk2(D)]ϵ2mnk0,\displaystyle\leq E\left(\frac{\mathbb{E}^{k}[\phi_{k}^{2}(D)]}{\epsilon^{\prime 2}}\right)\cdot\frac{m}{n_{k}}=\frac{\mathbb{E}^{0}[\phi_{k}^{2}(D)]}{\epsilon^{\prime 2}}\cdot\frac{m}{n_{k}}\xrightarrow[]{}0,

as the distributional uncertainty is of higher order than the sampling uncertainty. Therefore, (a) =op(1)=o_{p}(1). Next let us investigate (b). Recall that with ψk=(ϕkϕkB)h\psi_{k}=(\phi_{k}-\phi_{k}^{B})\circ h,

m(𝔼k[ϕk(D)]𝔼k[ϕkB(D)])\displaystyle\sqrt{m}(\mathbb{E}^{k}[\phi_{k}(D)]-\mathbb{E}^{k}[\phi_{k}^{B}(D)]) =mj=1mxIjψk(x)𝑑x(WjkE[Wk])j=1mWjk/m.\displaystyle=\frac{\sqrt{m}\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\cdot(W_{j}^{k}-E[W^{k}])}{\sum_{j=1}^{m}W_{j}^{k}/m}.

The variance of the numerator is bounded as

Var(Wk)j=1mm(xIjψk(x)𝑑x)2\displaystyle\text{Var}(W^{k})\sum_{j=1}^{m}m\left(\int_{x\in I_{j}}\psi_{k}(x)dx\right)^{2} Var(Wk)j=1mxIjψk2(x)𝑑x\displaystyle\leq\text{Var}(W^{k})\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}^{2}(x)dx
=Var(Wk)𝔼0[ψk2(U)]\displaystyle=\text{Var}(W^{k})\mathbb{E}^{0}[\psi_{k}^{2}(U)]
<Var(Wk)ϵ2\displaystyle<\text{Var}(W^{k})\cdot\epsilon^{2}

where the first inequality holds by Jensen’s inequality with mxIj𝑑x=1m\int_{x\in I_{j}}dx=1. The denominator converges in probability to E[Wk]E[W^{k}]. Therefore, we can write

m(𝔼k[ϕk(D)]𝔼k[ϕkB(D)])=mj=1mxIjψk(x)𝑑x(WjkE[Wk])E[Wk]+sn,k\sqrt{m}(\mathbb{E}^{k}[\phi_{k}(D)]-\mathbb{E}^{k}[\phi_{k}^{B}(D)])=\frac{\sqrt{m}\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\cdot(W_{j}^{k}-E[W^{k}])}{E[W^{k}]}+s_{n,k}

where sn,ks_{n,k} is op(1)o_{p}(1). Combining results, for sn=(sn,1,,sn,K)s_{n}=(s_{n,1},\dots,s_{n,K})^{\intercal}, we have that E[(b)sn]=0E[\eqref{eq:b}-s_{n}]=0 and

VarP((b)sn)Var(W)minkE2[Wk]ϵ2.||\text{Var}_{P}(\eqref{eq:b}-s_{n})||_{\infty}\leq\frac{||\text{Var}(\overrightarrow{W})||_{\infty}}{\min_{k}E^{2}[W^{k}]}\cdot\epsilon^{2}.

From Lemma 2, we know that

(c)𝑑N(0,ΣWVar0(ϕB(D))).\eqref{eq:c}\xrightarrow[]{d}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}\left(\overrightarrow{\phi^{B}}(D)\right)\right).

Let ϕ1B2,,ϕKB2,ϕ12,,ϕK2M||\phi_{1}^{B}||_{2},\dots,||\phi_{K}^{B}||_{2},||\phi_{1}||_{2},\dots,||\phi_{K}||_{2}\leq M. Note that for any 1k,kK1\leq k,k^{\prime}\leq K,

2|𝔼0[ϕkϕk]𝔼0[ϕkBϕkB]|\displaystyle 2|\mathbb{E}^{0}[\phi_{k}\phi_{k^{\prime}}]-\mathbb{E}^{0}[\phi_{k}^{B}\phi_{k^{\prime}}^{B}]| |𝔼0[(ϕkϕkB)(ϕk+ϕkB)]|+|𝔼0[(ϕk+ϕkB)(ϕkϕkB)]|\displaystyle\leq|\mathbb{E}^{0}[(\phi_{k}-\phi_{k}^{B})(\phi_{k^{\prime}}+\phi_{k^{\prime}}^{B})]|+|\mathbb{E}^{0}[(\phi_{k}+\phi_{k}^{B})(\phi_{k^{\prime}}-\phi_{k^{\prime}}^{B})]|
2M(ϕkϕkB2+ϕkϕkB2)4Mϵ.\displaystyle\leq 2M(||\phi_{k}-\phi_{k}^{B}||_{2}+||\phi_{k^{\prime}}-\phi_{k^{\prime}}^{B}||_{2})\leq 4M\cdot\epsilon.

Therefore,

Var0(ϕ(D))Var0(ϕB(D))2Mϵ.\Big{|}\Big{|}\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))-\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi^{B}}(D))\Big{|}\Big{|}_{\infty}\leq 2M\cdot\epsilon.

Combining results, we have

(a)+(b)+(c)\displaystyle\eqref{eq:a}+\eqref{eq:b}+\eqref{eq:c} =dN(0,ΣWVar0(ϕ(D)))+op(1)\displaystyle\stackrel{{\scriptstyle d}}{{=}}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))\right)+o_{p}(1)
+((b)sn)+N(0,ΣW(Var0(ϕ(D))Var0(ϕB(D))))\displaystyle+(\eqref{eq:b}-s_{n})+N\left(0,\Sigma^{W}\odot\left(\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))-\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi^{B}}(D))\right)\right)
=N(0,ΣWVar0(ϕ(D)))+Op(ϵ)+op(1).\displaystyle=N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))\right)+O_{p}(\epsilon)+o_{p}(1).

Note that all the results hold for arbitrary ϵ>0\epsilon>0. This completes the proof. ∎

8.4 Random target distribution

From Extended Theorem 1 with K+1K+1 datasets and ΣW(K+1)×(K+1)\Sigma^{W}\in\mathbb{R}^{(K+1)\times(K+1)}, and considering that the sampling uncertainty is of low order than distributional uncertainty, for any Borel measurable square-integrable function ϕ:𝒟L\phi:\mathcal{D}\xrightarrow[]{}\mathbb{R}^{L}, we have

m((1n1i=1n1ϕ(D1i)1nKi=1nKϕ(DKi))(𝔼K+1[ϕ(D)]𝔼K+1[ϕ(D)]))𝑑N(0,(AΣWA)Var0(ϕ(D))),\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{K+1}[\phi(D)]\\ \vdots\\ \mathbb{E}^{K+1}[\phi(D)]\end{pmatrix}\right)\xrightarrow[]{d}N\left(0,(A\Sigma^{W}A^{\intercal})\otimes\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\right), (10)

where

A=(100101010011)K×(K+1).A=\begin{pmatrix}1&0&\dots&0&-1\\ 0&1&\dots&0&-1\\ \vdots&\vdots&&\vdots&\vdots\\ 0&0&\dots&1&-1\end{pmatrix}\in\mathbb{R}^{K\times(K+1)}.

Therefore, with K+1\mathbb{P}^{K+1} as a new target distribution, we still have Extended Theorem 1 but with a different distributional covariance matrix AΣWAA\Sigma^{W}A^{\intercal}.

8.5 Proof of Remark 3

Note that Var^0(ϕ(D))\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D)) is an empirical variance of ϕ\phi on the pooled donor data {(Dki)i=1nk}k=1K\{(D_{ki})_{i=1}^{n_{k}}\}_{k=1}^{K} and can be written as

Var^0(ϕ(D))\displaystyle\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D)) =k=1Knkn1nki=1nkϕ(Dki)ϕ(Dki)\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})\phi(D_{ki})^{\intercal}
(k=1Knkn1nki=1nkϕ(Dki))(k=1Knkn1nki=1nkϕ(Dki)).\displaystyle-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})\right)^{\intercal}.

where n=n1++nKn=n_{1}+\dots+n_{K}. As ϕ\phi has a finite fourth moment, by Theorem 1, we have that

Var^0(ϕ(D))\displaystyle\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D)) =k=1Knkn(𝔼0[ϕ(D)ϕ(D)]+op(1))\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\left(\mathbb{E}^{0}[\phi(D)\phi(D)^{\intercal}]+o_{p}(1)\right)
(k=1Knkn(𝔼0[ϕ(D)]+op(1)))(k=1Knkn(𝔼0[ϕ(D)]+op(1)))\displaystyle-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\left(\mathbb{E}^{0}[\phi(D)]+o_{p}(1)\right)\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\left(\mathbb{E}^{0}[\phi(D)]+o_{p}(1)\right)\right)^{\intercal}
=Var0(ϕ(D))+op(1).\displaystyle=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1).

9 Special cases of the random perturbation model.

We present special cases of the random perturbation model to help develop some intuition about potential applications.

Example 1 (Random shift across time).

Consider the case where Wjk=Wjk1+ϵkW_{j}^{k}=W_{j}^{k-1}+\epsilon_{k}, where the ϵk\epsilon_{k} have mean zero and are uncorrelated for k=1,,Kk=1,\ldots,K. This can be used to model random distributional shifts across time. For example, at some time point, we might see (randomly) higher or lower percentage of old patients at a hospital. Similarly, one could also have an auto-regressive model Wjk=12Wjk1+12Wjk2+ϵkW_{j}^{k}=\frac{1}{2}W_{j}^{k-1}+\frac{1}{2}W_{j}^{k-2}+\epsilon_{k}. Please note that these auto-regressive models are defined on the weights of the distribution and not the data itself. This allows us to model random distributional shift for different data modalities such as discrete, ordinal, or continuous sample spaces, with a single framework.

Example 2 (Independent shifts and meta analysis).

If the distributional perturbations WjkW_{j}^{k} are i.i.d. across k=1,,Kk=1,\ldots,K, then ΣW\Sigma^{W} is a diagonal matrix. Thus, for any ϕL2(0)\phi\in L^{2}(\mathbb{P}^{0}), θ^k:=1nki=1nkϕ(Dki)\hat{\theta}_{k}:=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki}) are asymptotically Gaussian and uncorrelated across k=1,,Kk=1,\ldots,K. This implies a meta-analytic model

θ^ki.i.d.N(θ0,τ2),\hat{\theta}_{k}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\approx}}N(\theta^{0},\tau^{2}), (11)

where τ2=ΣkkWmVar0(ϕ)\tau^{2}=\frac{\Sigma_{kk}^{W}}{m}\text{Var}_{\mathbb{P}^{0}}(\phi) and θ0:=𝔼0[ϕ(D)]\theta^{0}:=\mathbb{E}^{0}[\phi(D)]. Thus, in this special case, the random perturbation model justifies a random effect meta-analysis.

Example 3 (Random shift between locations).

In general, the meta-analytic model might not hold since the distributional perturbations might not be independent or identically distributed across k=1,,Kk=1,\ldots,K. As an example, data from California might be similar to data from Washington, but very different from data from Pennsylvania. In the synthetic control literature, it is popular to model data distributions from different locations as (approximate) mixtures of each other. For example, data from California might be well approximated by a mixture of Washington and Nevada. In our framework, this can be expressed as

WCalifornia=.7WWashington+.3WNevada+ϵ,W^{\text{California}}=.7\cdot W^{\text{Washington}}+.3\cdot W^{\text{Nevada}}+\epsilon, (12)

where ϵ\epsilon is mean-zero noise.

10 Proof of Lemma 1

Proof.

The first part of proof follows Van der Vaart [2000], Theorem 5.41 with

Ψn(θ)\displaystyle\Psi_{n}(\theta) =k=1Kβk𝔼^k[θ(θ,D)]=k=1Kβk1nki=1nkθ(θ,Dki),\displaystyle=\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\theta,D)]=\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}\mathcal{L}(\theta,D_{ki}),
Ψ˙n(θ)\displaystyle\dot{\Psi}_{n}(\theta) =k=1Kβk1nki=1nkθ2(θ,Dki).\displaystyle=\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}^{2}\mathcal{L}(\theta,D_{ki}).

By Taylor’s theorem there exists a (random vector) θ~\tilde{\theta} on the line segment between θ0\theta^{0} and θ^β\hat{\theta}^{\beta} such that

0=Ψn(θ^)=Ψn(θ0)+Ψ˙n(θ0)(θ^βθ0)+12(θ^βθ0)Ψ¨n(θ~)(θ^βθ0).0=\Psi_{n}(\hat{\theta})=\Psi_{n}(\theta^{0})+\dot{\Psi}_{n}(\theta^{0})(\hat{\theta}^{\beta}-\theta^{0})+\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\ddot{\Psi}_{n}(\tilde{\theta})(\hat{\theta}^{\beta}-\theta^{0}).

By Extended Theorem 1 with ϕ(D)=θ2(θ0,D)\phi(D)=\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D), we have

Ψ˙n(θ0)\displaystyle\dot{\Psi}_{n}(\theta^{0}) =k=1Kβk1nki=1nkθ2(θ0,Dki)\displaystyle=\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D_{ki})
=k=1Kβk𝔼0[θ2(θ0,D)]+oP(1)=𝔼0[θ2(θ0,D)]+oP(1).\displaystyle=\sum_{k=1}^{K}\beta_{k}\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)=\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1). (13)

By assumption, there exists a ball BB around θ0\theta^{0} such that θθ3(θ,D)\theta\xrightarrow[]{}\partial_{\theta}^{3}\mathcal{L}(\theta,D) is dominated by a fixed function h()h(\cdot) for every θB\theta\in B. The probability of the event {θ^βB}\{\hat{\theta}^{\beta}\in B\} tends to 1. On this event,

Ψ¨n(θ~)=k=1Kβk1nki=1nkθ3(θ~,Dki)k=1K|βk|1nki=1nkh(Dki).\|\ddot{\Psi}_{n}(\tilde{\theta})\|=\Big{|}\Big{|}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}^{3}\mathcal{L}(\tilde{\theta},D_{ki})\Big{|}\Big{|}\leq\sum_{k=1}^{K}|\beta_{k}|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}h(D_{ki}).

Using Extended Theorem 1 with ϕ(D)=h(D)\phi(D)=h(D), we get

Ψ¨n(θ~)k=1K|βk|1nki=1nkh(Dki)=OP(1).\|\ddot{\Psi}_{n}(\tilde{\theta})\|\leq\sum_{k=1}^{K}|\beta_{k}|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}h(D_{ki})=O_{P}(1). (14)

Combining (13) and (14) gives us

Ψn(θ0)\displaystyle-\Psi_{n}(\theta^{0}) =(𝔼0[θ2(θ0,D)]+oP(1)+12(θ^βθ0) OP(1))(θ^βθ0)\displaystyle=\left(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)+\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})\text{ }O_{P}(1)\right)(\hat{\theta}^{\beta}-\theta^{0})
=(𝔼0[θ2(θ0,D)]+oP(1))(θ^βθ0).\displaystyle=\left(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)\right)(\hat{\theta}^{\beta}-\theta^{0}).

The probability that the matrix 𝔼0[θ2(θ0,D)]+oP(1)\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1) is invertible tends to 1. Multiplying the preceding equation by m\sqrt{m} and applying (𝔼0[θ2(θ0,D)]+oP(1))1(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1))^{-1} left and right gives us that

m(θ^βθ0)=m(k=1Kβk𝔼^k[ϕ(D)]𝔼0[ϕ(D)])+oP(1),\sqrt{m}\left(\hat{\theta}^{\beta}-\theta^{0}\right)=\sqrt{m}\left(\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\phi(D)]-\mathbb{E}^{0}[\phi(D)]\right)+o_{P}(1),

where ϕ(D)=𝔼0[θ2(θ0,D)]1θ(θ0,D)\phi(D)=-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]^{-1}\partial_{\theta}\mathcal{L}(\theta^{0},D). By Extended Theorem 1, we have

m(θ^βθ0)𝑑N(0,βΣWβ Var0(ϕ(D)))\sqrt{m}\left(\hat{\theta}^{\beta}-\theta^{0}\right)\xrightarrow[]{d}N(0,\beta^{\intercal}\Sigma^{W}\beta\text{ }\cdot\text{Var}_{\mathbb{P}^{0}}(\phi(D))) (15)

Now let Φ(θ)=𝔼0[θ(θ,D)],Φ˙(θ)=𝔼0[θ2(θ,D)],Φ¨(θ)=𝔼0[θ3(θ,D)]\Phi(\theta)=\mathbb{E}^{0}[\partial_{\theta}\mathcal{L}(\theta,D)],\dot{\Phi}(\theta)=\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta,D)],\ddot{\Phi}(\theta)=\mathbb{E}^{0}[\partial_{\theta}^{3}\mathcal{L}(\theta,D)]. By Taylor’s theorem, there exist θ~,θ~~\tilde{\theta},\tilde{\tilde{\theta}}, on the line segment between θ0\theta^{0} and θ^β\hat{\theta}^{\beta} such that

𝔼0[(θ^β,D)]𝔼0[(θ0,D)]\displaystyle\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},D)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},D)] =(θ^βθ0)Φ(θ0)+12(θ^βθ0)Φ˙(θ~)(θ^βθ0)\displaystyle=(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}{\Phi}(\theta^{0})+\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\dot{\Phi}(\tilde{\theta})(\hat{\theta}^{\beta}-\theta^{0})
=12(θ^βθ0)(Φ˙(θ0)+Φ¨(θ~~)(θ~θ0))(θ^βθ0)\displaystyle=\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\left(\dot{\Phi}(\theta^{0})+\ddot{\Phi}(\tilde{\tilde{\theta}})(\tilde{\theta}-\theta^{0})\right)(\hat{\theta}^{\beta}-\theta^{0})
=12(θ^βθ0)(Φ˙(θ0)+oP(1))(θ^βθ0),\displaystyle=\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\left(\dot{\Phi}(\theta^{0})+o_{P}(1)\right)(\hat{\theta}^{\beta}-\theta^{0}),

as Φ(θ0)=0{\Phi}(\theta^{0})=0 and Φ¨(θ~~)𝔼0[h(D)]\ddot{\Phi}(\tilde{\tilde{\theta}})\leq\mathbb{E}^{0}[h(D)] with probability approaching 1. Then the rescaled excess risk can be written as

m(𝔼0[(θ^β,D)]𝔼0[(θ0,D)])=m(θ^βθ0)𝔼0[θ2(θ0,D)](θ^βθ0)+oP(1).m\cdot\left(\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},D)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},D)]\right)=m\cdot(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)](\hat{\theta}^{\beta}-\theta^{0})+o_{P}(1). (16)

Using (15), asymptotically (16) follows a distribution with asymptotic mean

μβ=βΣWβTrace(𝔼0[θ2(θ0,D)]1Var0(θ(θ0,D)))\mu^{\beta}=\beta^{\intercal}\Sigma^{W}\beta\cdot\mathrm{Trace}(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]^{-1}\mathrm{Var}_{\mathbb{P}^{0}}(\partial_{\theta}\mathcal{L}(\theta^{0},D)))

This completes the proof.

In the following, we add regularity conditions that lead to the consistency of M-estimators, θ^β\hat{\theta}^{\beta}.

Lemma 3 (Consistency of M-estimators).

Consider the M-estimator

θ^β=argminθΩk=1Kβk𝔼^k[(θ,D)]\hat{\theta}^{\beta}=\arg\min_{\theta\in\Omega}\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,D)]

and the target θ0=argminθΩ𝔼0[(θ,D)]\theta^{0}=\arg\min_{\theta\in\Omega}\mathbb{E}^{0}[\mathcal{L}(\theta,D)], where Ω\Omega is a compact subset of d\mathbb{R}^{d}. Assume that θ(θ,D)\theta\xrightarrow[]{}\mathcal{L}(\theta,D) is continuous and that infθθ2δ(θ,D)\inf_{||\theta-\theta^{\prime}||_{2}\leq\delta}\mathcal{L}(\theta,D) is square-integrable under 0\mathbb{P}^{0} for every δ\delta and θ\theta^{\prime} and that infθΩ(θ,D)\inf_{\theta\in\Omega}\mathcal{L}(\theta,D) is square-integrable. We assume that 𝔼0[(θ,D)]\mathbb{E}^{0}[\mathcal{L}(\theta,D)] has a unique minimum. Then, θ^βθ0=oP(1)\hat{\theta}^{\beta}-\theta^{0}=o_{P}(1).

Proof.

The proof follows Van der Vaart [2000], Theorem 5.14 with mθ(D)=(θ,D)m_{\theta}(D)=-\mathcal{L}(\theta,D).

Fix some θ\theta and let UθU_{\ell}\downarrow\theta be a decreasing sequence of open balls around θ\theta of diameter converging to zero. Write mU(D)m_{U}(D) for supθUmθ(D)\sup_{\theta\in U}m_{\theta}(D). The sequence mUm_{U_{\ell}} is decreasing and greater than mθm_{\theta} for every \ell. As θmθ(D)\theta\xrightarrow[]{}m_{\theta}(D) is continuous, by monotone convergence theorem, we have 𝔼0[mU(D)]𝔼0[mθ(D)]\mathbb{E}^{0}[m_{U_{\ell}}(D)]\downarrow\mathbb{E}^{0}[m_{\theta}(D)].

For θθ0\theta\neq\theta^{0}, we have 𝔼0[mθ(D)]<𝔼0[mθ0(D)]\mathbb{E}^{0}[m_{\theta}(D)]<\mathbb{E}^{0}[m_{\theta^{0}}(D)]. Combine this with the preceding paragraph to see that for every θθ0\theta\neq\theta^{0} there exits an open ball UθU_{\theta} around θ\theta with 𝔼0[mUθ(D)]<𝔼0[mθ0(D)]\mathbb{E}^{0}[m_{U_{\theta}}(D)]<\mathbb{E}^{0}[m_{\theta^{0}}(D)]. For any given ϵ>0\epsilon>0, let the set B={θΩ:θθ02ϵ}B=\{\theta\in\Omega:||\theta-\theta^{0}||_{2}\geq\epsilon\}. The set BB is compact and is covered by the balls {Uθ:θB}\{U_{\theta}:\theta\in B\}. Let Uθ1,,UθpU_{\theta_{1}},\dots,U_{\theta_{p}} be a finite sub-cover. Then,

supθBk=1Kβk1nki=1nkmθ(Dki)\displaystyle\sup_{\theta\in B}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta}(D_{ki}) supj=1,,pk=1Kβk1nki=1nkmUθj(Dki)\displaystyle\leq\sup_{j=1,\dots,p}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{U_{\theta_{j}}}(D_{ki})
=supj=1,,p𝔼0[mUθj(D)]+oP(1)<𝔼0[mθ0(D)]+oP(1),\displaystyle=\sup_{j=1,\dots,p}\mathbb{E}^{0}[m_{U_{\theta_{j}}}(D)]+o_{P}(1)<\mathbb{E}^{0}[m_{\theta^{0}}(D)]+o_{P}(1), (17)

where for the equality we apply Theorem 1 with ϕ(D)=mUθj(D)\phi(D)=m_{U_{\theta_{j}}}(D) for all j=1,,pj=1,\ldots,p.

If θ^βB\hat{\theta}^{\beta}\in B, then

supθBk=1Kβk1nki=1nkmθ(Dki)k=1Kβk1nki=1nkmθ^β(Dki)k=1Kβk1nki=1nkmθ0(Dki)oP(1),\sup_{\theta\in B}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta}(D_{ki})\geq\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\hat{\theta}^{\beta}}(D_{ki})\geq\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta^{0}}(D_{ki})-o_{P}(1),

where the last inequality comes from the definition of θ^β\hat{\theta}^{\beta}. Using Theorem 1 with ϕ(D)=mθ0(D)\phi(D)=m_{\theta^{0}}(D), we have

k=1Kβk1nki=1nkmθ0(Dki)oP(1)=𝔼0[mθ0(D)]oP(1).\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta^{0}}(D_{ki})-o_{P}(1)=\mathbb{E}^{0}[m_{\theta^{0}}(D)]-o_{P}(1).

Therefore,

{θ^βB}{supθBk=1Kβk1nki=1nkmθ(Dki)𝔼0[mθ0(D)]oP(1)}.\{\hat{\theta}^{\beta}\in B\}\subset\Bigg{\{}\sup_{\theta\in B}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta}(D_{ki})\geq\mathbb{E}^{0}[m_{\theta^{0}}(D)]-o_{P}(1)\Bigg{\}}.

By the equation (17), the probability of the event on the right hand side converges to zero as nn\xrightarrow[]{}\infty. This completes the proof. ∎

11 Proof of Theorem 2

Define the feature matrix before reparametrization as

Φ=(𝔼^1[ϕ1]𝔼^0[ϕ1]𝔼^K[ϕ1]𝔼^0[ϕ1]𝔼^1[ϕL]𝔼^0[ϕL]𝔼^K[ϕL]𝔼^0[ϕL])L×K,\Phi=\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{1}]-\hat{\mathbb{E}}^{0}[\phi_{1}]&\dots&\hat{\mathbb{E}}^{K}[\phi_{1}]-\hat{\mathbb{E}}^{0}[\phi_{1}]\\ \vdots&&\vdots\\ \hat{\mathbb{E}}^{1}[\phi_{L}]-\hat{\mathbb{E}}^{0}[\phi_{L}]&\dots&\hat{\mathbb{E}}^{K}[\phi_{L}]-\hat{\mathbb{E}}^{0}[\phi_{L}]\end{pmatrix}\in\mathbb{R}^{L\times K},

where the \ell-th row is defined as Φ\Phi_{\ell}.

By Theorem 1 and using that the sampling uncertainty is of low order than distributional uncertainty, we have

mΦ:=m(𝔼^1[ϕ]𝔼^0[ϕ]𝔼^K[ϕ]𝔼^0[ϕ])=d𝒁+op(1)\sqrt{m}\Phi_{\ell}:=\sqrt{m}\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{\ell}]-\hat{\mathbb{E}}^{0}[\phi_{\ell}]\\ \vdots\\ \hat{\mathbb{E}}^{K}[\phi_{\ell}]-\hat{\mathbb{E}}^{0}[\phi_{\ell}]\end{pmatrix}\stackrel{{\scriptstyle d}}{{=}}\bm{Z}_{\ell}+o_{p}(1)

where 𝒁1,,𝒁LK\bm{Z}_{1},\dots,\bm{Z}_{L}\in\mathbb{R}^{K} are i.i.d Gaussian random variables following N(0,ΣW)N(0,\Sigma^{W}). Let 𝒁L×K\bm{Z}\in\mathbb{R}^{L\times K} be a matrix in which the \ell-th row is 𝒁\bm{Z}_{\ell}.

Note that

β^=argminβ:β1=1β(=1LΦΦ)β=argminβ:β1=1β𝚽𝚽β.\hat{\beta}=\operatorname*{argmin}_{\beta:\beta^{\intercal}1=1}\beta^{\intercal}\left(\sum_{\ell=1}^{L}\Phi_{\ell}\Phi_{\ell}^{\intercal}\right)\beta=\arg\min_{\beta:\beta^{\intercal}1=1}\beta^{\intercal}\bm{\Phi}^{\intercal}\bm{\Phi}\beta.

The closed form of β^\hat{\beta} is known as

β^=(𝚽𝚽)1𝟏𝟏(𝚽𝚽)1𝟏=d(𝒁𝒁)1𝟏𝟏(𝒁𝒁)1𝟏+op(1).\hat{\beta}=\frac{(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}{\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}\stackrel{{\scriptstyle d}}{{=}}\frac{(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}{\bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}+o_{p}(1).

We write β^Z=(𝒁𝒁)1𝟏𝟏(𝒁𝒁)1𝟏\hat{\beta}_{Z}=\frac{(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}{\bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}. Moreover, the closed form of β\beta^{*} is

β=argminβ:β1=1βΣWβ=(ΣW)1𝟏𝟏(ΣW)1𝟏.\beta^{*}=\operatorname*{argmin}_{\beta:\beta^{\intercal}1=1}\beta^{\intercal}\Sigma^{W}\beta=\frac{(\Sigma^{W})^{-1}\bm{1}}{\bm{1}^{\intercal}(\Sigma^{W})^{-1}\bm{1}}.

Let QQ be a p×(K1)p\times(K-1) non-random matrix with rank(Q)=pK1\text{rank}(Q)=p\leq K-1. Let Q~=(Q,𝟏)(p+1)×K\tilde{Q}=(Q^{\prime\intercal},\bm{1})^{\intercal}\in\mathbb{R}^{(p+1)\times K} where

Q=Q(𝑰(K1)×(K1)𝟎(K1)×1)p×K.Q^{\prime}=Q\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&\bm{0}_{(K-1)\times 1}\end{pmatrix}\in\mathbb{R}^{p\times K}.

Note that the rank(Q~)=p+1K\text{rank}(\tilde{Q})=p+1\leq K. Define S~:=Q~(𝒁𝒁)1Q~\tilde{S}:=\tilde{Q}(\bm{Z}^{\intercal}\bm{Z})^{-1}\tilde{Q}^{\intercal} where

S~=(S~11S~12S~21S~22)=(Q(𝒁𝒁)1QQ(𝒁𝒁)1𝟏𝟏(𝒁𝒁)1Q𝟏(𝒁𝒁)1𝟏.)\tilde{S}=\begin{pmatrix}\tilde{S}_{11}&\tilde{S}_{12}\\ \tilde{S}_{21}&\tilde{S}_{22}\end{pmatrix}=\begin{pmatrix}Q^{\prime}(\bm{Z}^{\intercal}\bm{Z})^{-1}Q^{\prime\intercal}&Q^{\prime}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}\\ \bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}Q^{\prime\intercal}&\bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}.\end{pmatrix}

Similarly, let P~:=Q~(ΣW)1Q~\tilde{P}:=\tilde{Q}(\Sigma^{W})^{-1}\tilde{Q}^{\intercal} where P~11=Q(ΣW)1Q,P~12=Q(ΣW)1𝟏,\tilde{P}_{11}=Q^{\prime}(\Sigma^{W})^{-1}Q^{\prime\intercal},\tilde{P}_{12}=Q^{\prime}(\Sigma^{W})^{-1}\bm{1}, and P~22=𝟏(ΣW)1𝟏\tilde{P}_{22}=\bm{1}^{\intercal}(\Sigma^{W})^{-1}\bm{1}.

Results from Multivariate Statistical Theory.

Note that 𝒁𝒁WishK(L;ΣW)\bm{Z}^{\intercal}\bm{Z}\sim\text{Wish}_{K}(L;\Sigma^{W}) which is the Wishart distribution with a degree of freedom LL and a scale matrix ΣW\Sigma^{W}. By Theorem 3.2.11 and 3.6 in Muirhead [1982], we have

S~Wp+11(LK+2p+3;P~).\tilde{S}\sim W_{p+1}^{-1}(L-K+2p+3;\tilde{P}).

which is the inverse-Wishart distribution with a degree of freedom LK+2p+3L-K+2p+3 and a scale matrix P~\tilde{P}. By Theorem 3 (b) and (d) of Bodnar and Okhrin [2008], we have that

S~12S~221|S~11.2N(P~12P~221,S~11.2P~221).\tilde{S}_{12}\tilde{S}_{22}^{-1}|\tilde{S}_{11.2}\sim N(\tilde{P}_{12}\tilde{P}_{22}^{-1},\tilde{S}_{11.2}\otimes\tilde{P}_{22}^{-1}).

Therefore, we get

P~221/2S~11.21/2(S~12S~221P~12P~221)|S~11.2N(0,𝑰).\tilde{P}_{22}^{1/2}\tilde{S}_{11.2}^{-1/2}\left(\tilde{S}_{12}\tilde{S}_{22}^{-1}-\tilde{P}_{12}\tilde{P}_{22}^{-1}\right)|\tilde{S}_{11.2}\sim N(0,\bm{I}).

The right hand side does not depend on S~11.2\tilde{S}_{11.2} anymore. Therefore, we have

P~221/2S~11.21/2(Q(β^Z)1:(K1)Qβ1:(K1))N(𝟎,𝑰).\tilde{P}_{22}^{1/2}\tilde{S}_{11.2}^{-1/2}\left(Q(\hat{\beta}_{Z})_{1:(K-1)}-Q{\beta}^{*}_{1:(K-1)}\right)\sim N(\bm{0},\bm{I}). (18)

By Theorem 3.2.12 in Muirhead [1982], we have

P~22S~22=𝟏(ΣW)1𝟏𝟏(ZZ)1𝟏χ2(LK+1).\frac{\tilde{P}_{22}}{\tilde{S}_{22}}=\frac{\bm{1}^{\intercal}(\Sigma^{W})^{-1}\bm{1}}{\bm{1}^{\intercal}(Z^{\intercal}Z)^{-1}\bm{1}}\sim\chi^{2}(L-K+1).

By Theorem 3 (e) of Bodnar and Okhrin [2008], S~22\tilde{S}_{22} is independent of S~12S~221\tilde{S}_{12}\tilde{S}_{22}^{-1} and S~11.2\tilde{S}_{11.2}. Hence, we get

LK+1S~221/2S~11.21/2(Q(β^Z)1:(K1)Qβ1:(K1))tp(LK+1;𝟎,𝑰),\sqrt{L-K+1}\cdot\tilde{S}_{22}^{1/2}\tilde{S}_{11.2}^{-1/2}\left(Q(\hat{\beta}_{Z})_{1:(K-1)}-Q{\beta}^{*}_{1:(K-1)}\right)\sim t_{p}(L-K+1;\bm{0},\bm{I}),

where tp(LK+1;𝟎,𝑰)t_{p}(L-K+1;\bm{0},\bm{I}) is pp-dimensional (centered) tt-distribution with LK+1L-K+1 degrees of freedom. Now using Φ=𝒁/m+op(1/m)\Phi=\bm{Z}/\sqrt{m}+o_{p}(1/\sqrt{m}) from Theorem 1, we get

LK+1𝟏(𝚽𝚽)1𝟏[QRQ]1/2(Qβ^1:(K1)Qβ1:(K1))\displaystyle\sqrt{L-K+1}\sqrt{\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}\cdot\left[QRQ^{\intercal}\right]^{-1/2}\left(Q\hat{\beta}_{1:(K-1)}-Q{\beta}^{*}_{1:(K-1)}\right) (19)
=dtp(LK+1;𝟎,𝑰)+op(1).\displaystyle\stackrel{{\scriptstyle d}}{{=}}t_{p}(L-K+1;\bm{0},\bm{I})+o_{p}(1). (20)

where

R=(𝑰(K1)×(K1)𝟎(K1)×1)((𝚽𝚽)1(𝚽𝚽)1𝟏𝟏(𝚽𝚽)1𝟏(𝚽𝚽)1𝟏)(𝑰(K1)×(K1)𝟎(K1)×1).R=\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&\bm{0}_{(K-1)\times 1}\end{pmatrix}\left((\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}-\frac{(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}}{\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}\right)\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&\bm{0}_{(K-1)\times 1}\end{pmatrix}^{\intercal}.

Connection with Reparametrized Linear Regression.

In this part, we will show that

(𝚽~𝚽~)1=R.(\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}})^{-1}=R. (21)

Using the closed form of β^\hat{\beta}, we have that

(LK+1)σ^2=β^ΦΦβ^=1𝟏(ΦΦ)1𝟏.(L-K+1)\cdot\hat{\sigma}^{2}=\hat{\beta}^{\intercal}\Phi^{\intercal}\Phi\hat{\beta}=\frac{1}{\bm{1}^{\intercal}(\Phi^{\intercal}\Phi)^{-1}\bm{1}}. (22)

Then combining (20), (21), and (22), we have

(Q(Φ~Φ~)1σ^2Q)12(Qβ^1:(K1)Qβ1:(K1))=dtp(LK+1;𝟎,𝑰)+op(1).\left(Q(\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}Q^{\intercal}\right)^{-\frac{1}{2}}\left(Q\hat{\beta}_{1:(K-1)}-Q\beta^{*}_{1:(K-1)}\right)\stackrel{{\scriptstyle d}}{{=}}t_{p}(L-K+1;\bm{0},\bm{I})+o_{p}(1).

Note that this is a more general version of Theorem 2. By letting QQ be an (K1)×(K1)(K-1)\times(K-1) identity matrix, we have Theorem 2. For the simplicity of notation, let V=𝚽𝚽V=\bm{\Phi}^{\intercal}\bm{\Phi}. Note that

𝚽~𝚽~\displaystyle\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}} =(𝑰(K1)×(K1)𝟏)(V11V12V12V22)(𝑰(K1)×(K1)𝟏)\displaystyle=\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&-\bm{1}\end{pmatrix}\begin{pmatrix}V_{11}&V_{12}\\ V_{12}^{\intercal}&V_{22}\end{pmatrix}\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&-\bm{1}\end{pmatrix}^{\intercal}
=V11𝟏V12V12𝟏+𝟏V22𝟏\displaystyle=V_{11}-\bm{1}V_{12}^{\intercal}-V_{12}\bm{1}^{\intercal}+\bm{1}V_{22}\bm{1}^{\intercal}

where V22V_{22}\in\mathbb{R} and 𝟏(K1)×1\bm{1}\in\mathbb{R}^{(K-1)\times 1}. Then, by applying Sherman-Morrison lemma, we have

(𝚽~𝚽~)1\displaystyle(\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}})^{-1} =(V11.2+V12V221V12𝟏V12V12𝟏+𝟏V22𝟏)1\displaystyle=\left(V_{11.2}+V_{12}V_{22}^{-1}V_{12}^{\intercal}-\bm{1}V_{12}^{\intercal}-V_{12}\bm{1}^{\intercal}+\bm{1}V_{22}\bm{1}^{\intercal}\right)^{-1}
=(V11.2+(V12V221/2V221/2𝟏)(V12V221/2V221/2𝟏))1\displaystyle=\left(V_{11.2}+\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)^{\intercal}\right)^{-1}
=V11.21V11.21(V12V221/2V221/2𝟏)(V12V221/2V221/2𝟏)V11.211+(V12V221/2V221/2𝟏)V11.21(V12V221/2V221/2𝟏).\displaystyle=V_{11.2}^{-1}-\frac{V_{11.2}^{-1}\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)^{\intercal}V_{11.2}^{-1}}{1+\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)^{\intercal}V_{11.2}^{-1}\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)}.

By using the inverse of block matrix as

V1=(V11.21V11.21V12V221V221V12V11.21V221+V221V12V11.21V12V221),V^{-1}=\begin{pmatrix}V_{11.2}^{-1}&-V_{11.2}^{-1}V_{12}V_{22}^{-1}\\ -V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}&V_{22}^{-1}+V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}V_{12}V_{22}^{-1}\end{pmatrix},

we also get

R=V11.21V11.21(V12V221𝟏)(V12V221𝟏)V11.21𝟏V11.21𝟏V221V12V11.21𝟏𝟏V11.21V12V221+V221+V221V12V11.21V12V221.\displaystyle R=V_{11.2}^{-1}-\frac{V_{11.2}^{-1}\left(V_{12}V_{22}^{-1}-\bm{1}\right)\left(V_{12}V_{22}^{-1}-\bm{1}\right)^{\intercal}V_{11.2}^{-1}}{\bm{1}^{\intercal}V_{11.2}^{-1}\bm{1}-V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}\bm{1}-\bm{1}^{\intercal}V_{11.2}^{-1}V_{12}V_{22}^{-1}+V_{22}^{-1}+V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}V_{12}V_{22}^{-1}}.

Then, we can easily check that (𝚽~𝚽~)1=R(\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}})^{-1}=R. This completes the proof.

12 Confidence Intervals for Target Parameters

First, we will discuss how to form asymptotically valid confidence intervals for 𝔼0[ϕ0(D)]\mathbb{E}^{0}[\phi_{0}(D)]. Assume that test functions ϕ1(X),,ϕL(X)\phi_{1}(X),\dots,\phi_{L}(X) are uncorrelated and have unit variances under 0\mathbb{P}^{0}.

Let’s say β^\hat{\beta} in (4) is consistent as follows,

β^𝑝β:=argminβ:β1βΣWβ.\hat{\beta}\xrightarrow[]{p}\beta^{*}:=\operatorname*{argmin}_{\beta:\beta^{\intercal}1}\beta^{\intercal}\Sigma^{W}\beta.

Then, we have for 𝔼^[ϕ0]=(𝔼^1[ϕ0],,𝔼^K[ϕ0])\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}=(\hat{\mathbb{E}}^{1}[\phi_{0}],\dots,\hat{\mathbb{E}}^{K}[\phi_{0}])^{\intercal},

m(k=1Kβ^k𝔼^k[ϕ0]𝔼0[ϕ0])\displaystyle\sqrt{m}\left(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]-\mathbb{E}^{0}[\phi_{0}]\right) =m(β^β)(𝔼^[ϕ0]𝟙𝔼0[ϕ0])\displaystyle=\sqrt{m}\left(\hat{\beta}-\beta^{*}\right)^{\intercal}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)
+m β(𝔼^[ϕ0]𝟙𝔼0[ϕ0])\displaystyle+\sqrt{m}\text{ }{\beta^{*}}^{\intercal}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)

since β^𝟙=1\hat{\beta}^{\intercal}\mathbb{1}=1 and β𝟙=1\beta^{*}{}^{\intercal}\mathbb{1}=1. By Theorem 1,

m(𝔼^[ϕ0]𝟙𝔼0[ϕ0])𝑑N(0,Var0(ϕ0)ΣW).\sqrt{m}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)\xrightarrow[]{d}N(0,\text{Var}_{\mathbb{P}^{0}}(\phi_{0})\cdot\Sigma^{W}).

Therefore,

m(k=1Kβ^k𝔼^k[ϕ0]𝔼0[ϕ0])\displaystyle\sqrt{m}\left(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]-\mathbb{E}^{0}[\phi_{0}]\right) =m β(𝔼^[ϕ0]𝟙𝔼0[ϕ0])+op(1)\displaystyle=\sqrt{m}\text{ }{\beta^{*}}^{\intercal}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)+o_{p}(1)
𝑑N(0,Var0(ϕ0)βΣWβ).\displaystyle\xrightarrow[]{d}N(0,\text{Var}_{\mathbb{P}^{0}}(\phi_{0})\cdot{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}). (23)

Then, we can construct confidence interval for 𝔼0[ϕ0]\mathbb{E}^{0}[\phi_{0}] as

k=1Kβ^k𝔼^k[ϕ0]±z1α/2Var0(ϕ0)βΣWβm.\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]\pm z_{1-\alpha/2}\cdot\sqrt{\text{Var}_{\mathbb{P}^{0}}(\phi_{0})}\cdot\sqrt{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}.

How can we obtain a consistent estimator for βΣWβ/m{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}/m? By Theorem 1,

𝔼^0[ϕ]k=1Kβk𝔼^k[ϕ] =d βΣWβmZ+op(1/m)\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\beta_{k}^{*}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\text{ }\stackrel{{\scriptstyle d}}{{=}}\text{ }\sqrt{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}\cdot Z_{\ell}+o_{p}(1/\sqrt{m})

where ZZ_{\ell} for =1,,L\ell=1,\dots,L are independent standard Gaussian variables. Then,

(𝔼^0[ϕ]k=1Kβ^k𝔼^k[ϕ])2\displaystyle\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2} =((𝔼^0[ϕ]k=1Kβk𝔼^k[ϕ])+(k=1K(β^kβk)(𝔼^0[ϕ]𝔼^k[ϕ])))2\displaystyle=\left(\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\beta^{*}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)+\left(\sum_{k=1}^{K}(\hat{\beta}_{k}-\beta^{*}_{k})(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\hat{\mathbb{E}}^{k}[\phi_{\ell}])\right)\right)^{2}
=dβΣWβmZ2+op(1/m).\displaystyle\stackrel{{\scriptstyle d}}{{=}}{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}\cdot Z_{\ell}^{2}+o_{p}(1/m).

Therefore, we may estimate βΣWβ/m{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}/m as

1L=1L(𝔼^0[ϕ]k=1Kβ^k𝔼^k[ϕ])2=dβΣWβmχ2(L)L+op(1/m),\frac{1}{L}\sum_{\ell=1}^{L}\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}\stackrel{{\scriptstyle d}}{{=}}{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}\cdot\frac{\chi^{2}(L)}{L}+o_{p}(1/m), (24)

where χ2(L)\chi^{2}(L) is a chi-squared random variable with degrees of freedom LL.

Finally, as LL\xrightarrow[]{}\infty, we have the consistency of β^\hat{\beta} and χ2(L)/L𝑝1\chi^{2}(L)/L\xrightarrow[]{p}1. Suppose we have Var^0(ϕ0(D))=Var0(ϕ0(D))+op(1)\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi_{0}(D))=\text{Var}_{\mathbb{P}^{0}}(\phi_{0}(D))+o_{p}(1) as in Remark 3. Combining all the results, we have asymptotically valid (1α1-\alpha) confidence interval as mm\xrightarrow[]{}\infty and LL\xrightarrow[]{}\infty:

k=1Kβ^k𝔼^k[ϕ0]±z1α/2Var^0(ϕ0)1L=1L(𝔼^0[ϕ]k=1Kβ^k𝔼^k[ϕ])2.\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]\pm z_{1-\alpha/2}\cdot\sqrt{\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi_{0})}\cdot\sqrt{\frac{1}{L}\sum_{\ell=1}^{L}\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}}.

Now let’s consider the case in Remark 7 where

θ0=argminθ𝔼0[(θ,X,Y)],θ^=argminθk=1Kβ^k𝔼^k[(θ,X,Y)].\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)],\quad\hat{\theta}=\arg\min_{\theta}\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)].

Assume the regularity conditions of Lemma 1 and the consistency of β^\hat{\beta} and θ^\hat{\theta} (β^𝑝β\hat{\beta}\xrightarrow[]{p}\beta^{*}, θ^𝑝θ0\hat{\theta}\xrightarrow[]{p}\theta^{0}). Then, from the proof of Lemma 1, we have

m(θ^θ0)=m(k=1Kβ^k𝔼^k[ϕ(D)]𝔼0[ϕ(D)])+oP(1),\sqrt{m}\left(\hat{\theta}-\theta^{0}\right)=\sqrt{m}\left(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi(D)]-\mathbb{E}^{0}[\phi(D)]\right)+o_{P}(1),

where ϕ(D)=𝔼0[θ2(θ0,D)]1θ(θ0,D)\phi(D)=-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]^{-1}\partial_{\theta}\mathcal{L}(\theta^{0},D). Then, from the results above, we only need to show that Var^0(ϕ(D))=Var0(ϕ(D))+op(1)\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1). Note that

k=1Kβ^k𝔼^k[θ2(θ^,X,Y)]\displaystyle\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)] 𝔼0[θ2(θ0,X,Y)]\displaystyle-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]
=k=1Kβ^k𝔼^k[θ2(θ^,X,Y)]k=1Kβ^k𝔼^k[θ2(θ0,X,Y)]\displaystyle=\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]
+k=1Kβ^k𝔼^k[θ2(θ0,X,Y)]k=1Kβk𝔼0[θ2(θ0,X,Y)].\displaystyle+\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]-\sum_{k=1}^{K}\beta_{k}^{*}{\mathbb{E}}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)].

Similarly as in the proof of Lemma 1, we have that

||k=1Kβ^k𝔼^k[θ2(θ^,X,Y)]\displaystyle\Big{|}\Big{|}\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)] k=1Kβ^k𝔼^k[θ2(θ0,X,Y)]||\displaystyle-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]\Big{|}\Big{|}
(k=1K(|βk|+oP(1))1nki=1nkh(Dki))θ^θ0=oP(1).\displaystyle\leq\left(\sum_{k=1}^{K}(|\beta_{k}^{*}|+o_{P}(1))\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}h(D_{ki})\right)||\hat{\theta}-\theta^{0}||=o_{P}(1).

Moreover, with the consistency of β^\hat{\beta} and Theorem 1, we have

k=1Kβ^k𝔼^k[θ2(θ0,X,Y)]k=1Kβk𝔼0[θ2(θ0,X,Y)]=oP(1)\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]-\sum_{k=1}^{K}\beta_{k}^{*}{\mathbb{E}}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]=o_{P}(1)

Therefore,

ϕ^(D)=(𝔼0[θ2(θ0,D)]+oP(1))1θ(θ^,D).\hat{\phi}(D)=-\left(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)\right)^{-1}\partial_{\theta}\mathcal{L}(\hat{\theta},D).

Note that

𝔼^k[θ(θ^,D)θ(θ^,D)]\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)\partial_{\theta}\mathcal{L}(\hat{\theta},D)^{\intercal}] =𝔼^k[(θ(θ0,D)+δ(θ^,D))(θ(θ0,D)+δ(θ^,D))]\displaystyle=\hat{\mathbb{E}}^{k}[(\partial_{\theta}\mathcal{L}({\theta}^{0},D)+\delta(\hat{\theta},D))(\partial_{\theta}\mathcal{L}(\theta^{0},D)+\delta(\hat{\theta},D))^{\intercal}]
𝔼^k[θ(θ0,D)θ(θ0,D)]+𝔼^k[δ(θ^,D)δ(θ^,D)]\displaystyle\leq\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+\hat{\mathbb{E}}^{k}[\delta(\hat{\theta},D)\delta(\hat{\theta},D)^{\intercal}]
+2𝔼^k[θ(θ0,D)θ(θ0,D)]𝔼^k[δ(θ^,D)δ(θ^,D)]\displaystyle+2\sqrt{\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]\cdot\hat{\mathbb{E}}^{k}[\delta(\hat{\theta},D)\delta(\hat{\theta},D)^{\intercal}]}

where δ(θ^,D)=θ(θ^,D)θ(θ0,D)\delta(\hat{\theta},D)=\partial_{\theta}\mathcal{L}(\hat{\theta},D)-\partial_{\theta}\mathcal{L}({\theta}^{0},D). The inequality is from the Cauchy-Schwarz inequality. Similarly as in the proof of Lemma 1 with Taylor expansion, we have

𝔼^k[δ(θ^,D)δ(θ^,D)]=oP(1).\displaystyle\hat{\mathbb{E}}^{k}[\delta(\hat{\theta},D)\delta(\hat{\theta},D)^{\intercal}]=o_{P}(1).

and we have 𝔼^k[θ(θ0,D)θ(θ0,D)]=𝔼0[θ(θ0,D)θ(θ0,D)]+oP(1)\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]={\mathbb{E}}^{0}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+o_{P}(1) from Thereom 1. Then, we get

𝔼^k[θ(θ^,D)θ(θ^,D)]𝔼^k[θ(θ0,D)θ(θ0,D)]+oP(1).\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)\partial_{\theta}\mathcal{L}(\hat{\theta},D)^{\intercal}]\leq\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+o_{P}(1).

By using Cauchy-Schwarz inequality in other direction as well, we get

𝔼^k[θ(θ^,D)θ(θ^,D)]=𝔼^k[θ(θ0,D)θ(θ0,D)]+oP(1).\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)\partial_{\theta}\mathcal{L}(\hat{\theta},D)^{\intercal}]=\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+o_{P}(1).

Similarly, we can derive

𝔼^k[θ(θ^,D)]\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)] =𝔼^k[θ(θ0,D)]+oP(1).\displaystyle=\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)]+o_{P}(1).

Then, the variance estimate on the pooled donor data can be written as

Var^0(ϕ)\displaystyle\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi) =k=1Knkn𝔼^k[ϕ^(D)ϕ(D)^](k=1Knkn𝔼^k[ϕ^(D)])(k=1Knkn𝔼^k[ϕ^(D)])\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[\hat{\phi}(D)\hat{\phi(D)}^{\intercal}]-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[\hat{\phi}(D)]\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[\hat{\phi}(D)]\right)^{\intercal}
=k=1Knkn𝔼^k[ϕ(D)ϕ(D)](k=1Knkn𝔼^k[ϕ(D)])(k=1Knkn𝔼^k[ϕ(D)])+oP(1).\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[{\phi}(D){\phi(D)}^{\intercal}]-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[{\phi}(D)]\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[{\phi}(D)]\right)^{\intercal}+o_{P}(1).

Then, by the proof in the Appendix, 8.5, we have Var^0(ϕ(D))=Var0(ϕ(D))+op(1)\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1). Finally, as LL\xrightarrow[]{}\infty, we have the consistency of β^\hat{\beta} and under the regularity conditions of Lemma 3, we have the consistency of θ^\hat{\theta} by following the proof of Lemma 3. This completes the proof.

13 Additional Details in Data Analysis

13.1 Covariance of Test Functions in GTEx Data Analysis

We estimate the sparse inverse covariance matrix of test functions (ϕ)=1,,L=1000(\phi_{\ell})_{\ell=1,\dots,L=1000} defined in Section 5, using group Lasso with a Python package sklearn.covariance.GraphicalLassoCV. The heatmap of fitted precision matrix is given in Figure 4. As the estimated inverse covariance matrix was close to a diagonal matrix, we did not perform a whitening transformation for the 1000 test functions.

Refer to caption
Figure 4: Fitted sparse precision matrix of test functions defined in Section 5 on the GTEx data

13.2 Additional Results for ACS Income Data Analysis

In this section, we provide additional results for ACS income data analysis in Section 4 for 6 randomly selected target states. The results are presented in Figure 5.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Test MSE results of the XGBoost when the training data is initially sourced from CA and then sourced from PR. The dashed vertical line indicates the point when PR data started to be added. The blue line is when samples are equally weighted, and the green line is when samples are weighted based on importance weights. The orange line is when samples receive distribution-specific weights using our proposed method.