\stackMath

Out-of-distribution generalization under random, dense distributional shifts

Yujin Jeong and Dominik Rothenhäusler

Abstract

Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that $p(y|x)$ remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might be dense. More specifically, these dense distributional shifts may arise through numerous small and random changes in the population and environment. First, we discuss empirical evidence for such random dense distributional shifts. Then, we develop tools to infer parameters and make predictions for partially observed, shifted distributions. Finally, we apply the framework to several real-world datasets and discuss diagnostics to evaluate the fit of the distributional uncertainty model.

1 Introduction

Distribution shift is a persistent challenge in statistics and machine learning. For example, we might be interested in understanding the determinants of positive outcomes in early childhood education. However, the relationships among learning outcomes and other variables can shift over time and across locations. This makes it challenging to gain actionable knowledge that can be used in different situations. As a result, there has been a surge of interest in robust and generalizable machine learning. Existing approaches can be classified as invariance-based approaches and adversarial approaches.

Invariance-based methods assume that the shifts only affect a sub-space of the data distribution [Pan and Yang, 2010, Baktashmotlagh et al., 2013]. For example, under covariate shift [Quinonero-Candela et al., 2009], the distribution of the covariates changes while the conditional distribution of the outcomes given the covariates stays invariant. In causal representation learning [Schölkopf et al., 2021], the goal is to disentangle the distribution into independent causal factors; based on the assumption that spurious associations can be separated from invariant associations. Both approaches assume the shift affects only a sub-space of the data. We call such shifts sparse distribution shifts.

Adversarial methods in robust machine learning consider worst-case perturbations. The idea is that some adversary can perturb the data either at the training [Huber, 1981] or deployment stage [Szegedy et al., 2014, Duchi and Namkoong, 2021]; and the goal is to learn a model that is robust to such deviations. While developing robust procedures is important, worst-case perturbations might be too conservative in many real-world applications.

Motivated by an empirical observation, in this paper we consider distributional shifts due to random and dense perturbations. As an example, consider replication studies. In replication studies, different research teams try to replicate results by following the same research protocol. Therefore, any distribution shift between the data may arise from the superposition of many unintended, subtle errors that affect every part of the distribution, which might be modeled as random. We call such shifts dense distributional shifts. The random, dense distributional shift model has recently been used to quantify distributional uncertainty in parameter estimation problems [Jeong and Rothenhäusler, 2023, Rothenhäusler and Bühlmann, 2023] and has shown success in modelling real-world temporal shifts in refugee placement [Bansak et al., 2024]. We will develop tools to measure the similarity between randomly perturbed distributions, infer parameters and predict outcomes for partially observed and shifted distributions.

1.1 Summary of contributions

In this subsection, we preview and summarize our contributions.

First, we discuss an intriguing phenomenon on a real-world dataset, in which means of arbitrary functions seem to be correlated across multiple distributions. To the best of our knowledge, existing distribution shift models do not imply this phenomenon.

To address this gap, we introduce a random distribution shift model for multiple datasets, where likelihood ratios exhibit random fluctuations. This builds on the perturbation model introduced in Jeong and Rothenhäusler [2023], which considers a single perturbed distribution. We generalize the model to handle multiple perturbed distributions that are possibly correlated. We then demonstrate the extended random perturbation model can explain the empirical distribution shift phenomenon.

Next, we consider the following domain adaptation problem. Our goal is to predict outcomes $Y$ based on covariates $X$ . Specifically, we seek to estimate $\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)]$ for some loss function $\mathcal{L}(\theta,X,Y)$ , where $\mathbb{E}^{0}$ denotes the expectation under the unknown target distribution. While we have access to complete datasets $(X,Y)$ from multiple training distributions, the target dataset contains only covariates $X$ without corresponding outcomes $Y$ . We adopt a weighted Empirical Risk Minimization (ERM) approach, where the parameters $\theta^{0}$ are estimated as

\hat{\theta}^{0}=\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)],

where $\hat{\mathbb{E}}^{k}$ denotes the empirical mean computed on the $k$ -th training dataset.

Under the random distribution shift model, we derive optimal weights for weighted ERM and expressions for the corresponding out-of-distribution error. Based on this expression, we derive prediction intervals for parameters of the target distribution.

Refer to caption — Figure 1: Test MSE results of the XGBoost when the training data is initially sourced from CA and then sourced from PR. The dashed vertical line indicates the point when PR data started to be added. The blue line is when samples are equally weighted, and the green line is when samples are weighted based on importance weights. The orange line is when samples receive distribution-specific weights using our proposed method.

The effectiveness of our approach is illustrated in Figure 1, which shows the out-of-distribution error for income prediction task (see Section 4 for details) when the target state is either Texas (TX) or Washington (WA) state, and training datasets are from California (CA) and Puerto Rico (PR). We simulate a scenario where CA data is initially available, followed by additional data from PR. We optimize the weighted empirical risk under three different weighting schemes: (1) equal weighting of all samples, (2) sample-specific importance weighting under the assumption of covariate shifts, and (3) distribution-specific weighting based on our approach. Compared to equal weighting and importance weighting methods, our approach demonstrates remarkable robustness, maintaining consistently low out-of-distribution error even after including the PR dataset by optimally weighting two data sources.

The remainder of paper is outlined as follows. In Section 2, we present empirical evidence for random and dense distributional shifts and introduce the random distributional perturbation model. In Section 3, we establish the foundations for domain adaptation under random and dense distributional shifts. In particular, we derive out-of-distribution risk for weighted empirical risk minimization and conduct statistical inference for parameters of the target distribution. In Section 4 and Section 5, we apply our framework to several real-world datasets and demonstrate the robustness of our method. We conclude in Section 6.

1.2 Related work

This work addresses domain adaptation problems under a specific class of random distributional shifts. Domain adaptation methods generally fall into two main categories [Pan and Yang, 2010]. The first type involves re-weighing, where training samples are assigned weights that take the distributional change into account [Long et al., 2014, Li et al., 2016]. For example, under covariate shift, training samples can be re-weighted via importance sampling [Gretton et al., 2009, Shimodaira, 2000, Sugiyama et al., 2008]. Another type aims to learn invariant representations of features [Argyriou et al., 2007, Baktashmotlagh et al., 2013]. Under dense distributional shifts, as we will see in Section 2, instance-specific weights can be unstable or even be undefined due to overlap violations. Furthermore, under dense distributional shifts there might be no invariant representation. Thus, from a classical domain adaptation perspective, dense distribution shifts are a challenging setting.

The proposed method approximates the target distribution as a weighted combination of source distributions. Therefore, the proposed method shares methodological similarities with synthetic controls, which search for a linear combination of untreated units to represent treated units [Abadie and Gardeazabal, 2003]. Synthetic controls are applied to panel data and often assume a linear factor model for a continuous outcome of interest [Doudchenko and Imbens, 2016, Abadie et al., 2007, Bai, 2009, Xu, 2017, Athey et al., 2021]. Our framework can be seen as a justification for synthetic control methods under random distributional shifts. In addition, our procedure can be applied to model distribution shifts with or without time structure for any type of data (discrete, continuous, ordinal). Furthermore, our framework allows for straightforward generalizations to empirical risk minimization, which includes many modern machine learning tools.

The approach of weighting a combination of source distributions to approximate a target distribution also appears in the distributionally robust optimization (DRO) framework [Xiong et al., 2023, Zhan et al., 2024]. While there is some similarity in the prediction procedures, the random distribution shift model enables diagnostics and inference that are markedly different from a worst-case approach. One can think about the random shift model as a potential justification for a very particular type of distribution weighting, with additional inferential consequences.

Our procedure is also loosely related to meta-analysis and hierarchical modeling. Meta-analysis is a statistical procedure for combining data from multiple studies [Higgins et al., 2009]. Traditional meta-analysis relies on the assumption that the effect size estimates from different studies are independent. However, in real-world applications, this independence assumption often does not hold true. There are different strategies to handle dependencies. These include averaging the dependent effect sizes within studies into a single effect [Wood, 2008], incorporating dependencies into models through robust variance estimation [Hedges et al., 2010], or through multilevel meta-analysis where they assume a cluster structure and model dependence within and between clusters [Cheung, 2014]. In meta-regression, the effect sizes are expressed as linear functions of study characteristics, and they are independent conditional on these study characteristics [Borenstein et al., 2009, Hartung et al., 2008]. Bayesian hierarchical models with prior on model parameters can also be used to model dependency [Higgins et al., 2009, Lunn et al., 2013]. Compared to these procedures, our approach allows for correlations between studies and can be applied even when there is no hierarchical structure and no summary study characteristics are available. Furthermore, our approach is frequentist, so it does not require choosing a prior distribution.

In this paper, we model multiple correlated distributions using random distributional perturbation model introduced in Jeong and Rothenhäusler [2023], where dense distributional shifts emerge as the superposition of numerous small random changes. In Jeong and Rothenhäusler [2023], a single perturbed distribution is considered. In our paper, we extend this model to address multiple perturbed distributions that are possibly correlated.

2 Distributional Uncertainty

Due to a superposition of many random, small changes, a data scientist might not draw samples from the target distribution $\mathbb{P}^{0}$ but from several randomly perturbed distributions $\mathbb{P}^{1},\dots,\mathbb{P}^{K}$ . More specifically, we assume that a data scientist has $K$ datasets, with the $k$ -th dataset being an i.i.d. sample from a perturbed distribution $\mathbb{P}^{k}$ . Intuitively speaking, some of these distributions might be similar to each other, while others might be very different. For instance, one replication study might closely resemble another due to collaboration among investigators. Additionally, a dataset obtained from California might be more similar to one from Washington than to one from Wyoming.

We illustrate the scenario where multiple datasets exhibit distributional correlations using the GTEx¹¹1The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The dataset used for the analyses described in this manuscript is version 6 and can be downloaded in the GTEx Portal: www.gtexportal.org. data. The GTEx V6 data offers pre-processed RNA-seq gene-expression levels collected from 450 donors across 44 tissues. Treating each tissue as an individual study, we have 44 datasets. Each dataset from tissue $k$ is considered as an independent sample from a perturbed distribution $\mathbb{P}^{k}$ , where $\mathbb{P}^{\bullet}$ represents a joint distribution of all gene-expression levels. Some of these datasets are likely to be correlated due to factors such as an overlap of donors and the proximity between tissues. Empirical evidence of this correlation is presented in the left side of Figure 2, where each row and column corresponds to a tissue, and each dot represents a standardized covariance between gene expression levels of a randomly selected gene-pair, among the observations in the tissue. We randomly sample 1000 gene-pairs. We can see that some tissues exhibit a clear linear relationship. In the following, we will introduce a distribution shift model that implies this linear relationship.

To model the relatedness of multiple datasets, we will assume that each $\mathbb{P}^{k}$ is a random perturbation of a common target distribution $\mathbb{P}^{0}$ . Then, we adopt the random distributional perturbation model in Jeong and Rothenhäusler [2023], where dense distributional shifts emerge as the superposition of numerous small random changes, to construct perturbed distributions $\mathbb{P}^{1},\dots,\mathbb{P}^{K}$ . In this random perturbation model, probabilities of events from a random probability measure $\mathbb{P}^{k}$ are slightly up-weighted or down-weighted compared to the target distribution $\mathbb{P}^{0}$ .

Without loss of generality, we construct distributional perturbations for uniform distributions on $[0,1]$ . A result from probability theory shows that any random variable $D$ on a finite or countably infinite dimensional probability space can be written as a measurable function $D\stackrel{{\scriptstyle d}}{{=}}h(U)$ , where $U$ is a uniform random variable on $[0,1]$ .²²2For any Borel-measurable random variable $D$ on a Polish (separable and completely metrizable) space $\mathcal{D}$ , there exists a Borel-measurable function $h$ such that $D\stackrel{{\scriptstyle d}}{{=}}h(U)$ where $U$ follows the uniform distribution on $[0,1]$ [Dudley, 2018]. With the transformation $h$ , the construction of distributional perturbations for uniform variables can be generalized to the general cases by using

\mathbb{P}^{k}(D\in\bullet)=\mathbb{P}^{k}(h(U)\in\bullet).

Let us now construct the perturbed distribution $\mathbb{P}^{k}$ for a random variable $U$ that follows a uniform distribution under $\mathbb{P}^{0}$ . Take $m$ bins $I_{j}=[(j-1)/m,j/m]$ for $j=1,\dots,m$ . Let $W^{k}_{1},\dots,W^{k}_{m}$ be i.i.d. positive random variables with finite variance. We define the randomly perturbed distribution $\mathbb{P}^{k}$ by setting

\mathbb{P}^{k}(U\in\bullet)=\sum_{j}\mathbb{P}^{0}(U\in I_{j}\cap\bullet)\cdot\frac{W^{k}_{j}}{\sum_{j=1}^{m}W^{k}_{j}/m}.

Then the datasets are generated as follows:

1. Generate perturbed distributions: Start with the target distribution $\mathbb{P}^{0}$ and generate $K$ perturbed distributions $\mathbb{P}^{1},\dots,\mathbb{P}^{K}$ using random weights $W=\{\{W_{j}^{k}\}_{j=1}^{m}\}_{k=1}^{K}$ .

2. Draw samples: For each $k=1,\dots,K$ , draw $n_{k}$ i.i.d. samples, $\mathcal{D}_{k}=\{D_{ki}\}_{i=1}^{n_{k}}$ , from the perturbed distribution $\mathbb{P}^{k}$ ( $D_{ki}\stackrel{{\scriptstyle i.i.d}}{{\sim}}\mathbb{P}^{k}$ ), conditioned on the weights $W=\{\{W_{j}^{k}\}_{j=1}^{m}\}_{k=1}^{K}$ .

Further, let $m=m(n)$ such that $\frac{m(n)}{\min(n_{1},\dots,n_{K})}$ converges to 0. This is the regime where the distributional uncertainty is of higher order than the sampling uncertainty.

Notation.

Let $\mathbb{P}^{0}$ be the target distribution on $\mathcal{D}$ , and $\mathbb{P}^{k}$ be the perturbed distribution on $\mathcal{D}$ conditioned on $(W_{j}^{k})_{j=1}^{m}$ . We draw an i.i.d. sample $\mathcal{D}_{k}=\{D_{ki}\}_{i=1}^{n_{k}}$ from $\mathbb{P}^{k}$ , conditioned on $(W_{j}^{k})_{j=1}^{m}$ . Let $P$ be the marginal distribution of $\{\{D_{ki}\}_{i=1}^{n_{k}},\{W_{j}^{k}\}_{j=1}^{m}\}_{k=1}^{K}$ . We denote $E$ as the expectation under $P$ , $\mathbb{E}^{0}$ as the expectation under $\mathbb{P}^{0}$ , and $\mathbb{E}^{k}$ as the expectation under $\mathbb{P}^{k}$ . Specifically, $\mathbb{E}^{0}[\phi(D)]$ is the mean of $\phi(D)$ for a random variable $D\in\mathcal{D},D\sim\mathbb{P}^{0}$ and $\mathbb{E}^{k}[\phi(D)]$ is the mean of $\phi(D)$ for a random variable $D\in\mathcal{D},D\sim\mathbb{P}^{k}$ conditioned on $(W_{j}^{k})_{j=1}^{m}$ . Moreover, we denote $\hat{\mathbb{E}}^{k}$ as the sample average in $\mathcal{D}_{k}$ . For example, $\hat{\mathbb{E}}^{k}[\phi(D)]=\sum_{i=1}^{n_{k}}\phi(D_{ik})/n_{k}$ where $D_{ki}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathbb{P}^{k}$ . We write $\text{Var}_{P}$ for the variance under $P$ and $\text{Var}_{\mathbb{P}^{0}}$ for the variance under $\mathbb{P}^{0}$ . Specifically, $\text{Var}_{\mathbb{P}^{0}}(\phi(D))$ is the variance of $\phi(D)$ for a random variable $D\in\mathcal{D},D\sim\mathbb{P}^{0}$ .

Theorem 1 (Distributional CLT).

Under the assumptions we mentioned above, for any Borel measurable square-integrable function $\phi_{k}:\mathcal{D}_{k}\xrightarrow[]{}\mathbb{R}$ for $k=1,\dots,K$ , we have

\sqrt{m}\left(\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{1}(D)]\\ \vdots\\ \hat{\mathbb{E}}^{K}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)\xrightarrow[]{d}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\phi(D))\right)

(1)

where $\phi(D)^{\intercal}=(\phi_{1}(D),\dots,\phi_{K}(D))\in\mathbb{R}^{K}$ , and $\Sigma^{W}\in\mathbb{R}^{K\times K}$ has

(\Sigma^{W})_{ij}=\frac{Cov_{P}(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}.

Here, $\odot$ denotes element-wise multiplication (Hadamard product).

Remark 1 (Random shift in $Y|X$ ).

As an example, let us study how conditional expectations shift under this model. We consider the case where the data consists of both covariates $X$ and target $Y$ , that is $D=(X,Y)$ . Let $A$ be some subset of the sample space of $X$ with $\mathbb{P}^{0}(X\in A)>0$ . Using a Taylor approximation,

\hat{\mathbb{E}}^{1}[Y|X\in A]-\hat{\mathbb{E}}^{0}[Y|X\in A]=\hat{\mathbb{E}}^{1}[\phi]-\hat{\mathbb{E}}^{0}[\phi]+o_{P}(1/\sqrt{m})

where $\phi=\frac{1_{X\in A}(Y-\mathbb{E}^{0}[Y|X\in A])}{\mathbb{P}^{0}[X\in A]}$ . Thus, by the CLT,

\sqrt{m}(\hat{\mathbb{E}}^{1}[Y|X\in A]-\hat{\mathbb{E}}^{0}[Y|X\in A])\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\Sigma_{11}^{W}\frac{\text{Var}_{\mathbb{P}^{0}}(Y|X\in A)}{\mathbb{P}^{0}(X\in A)}).

Hence, the conditional expectation is randomly shifted based on three factors: 1) it is proportional to the distribution shift strength, $\sqrt{\Sigma_{11}^{W}/m}$ ; 2) it is proportional to the conditional noise, $\sqrt{\text{Var}_{\mathbb{P}^{0}}(Y|X\in A)}$ ; and 3) it is inversely proportional to the probability $\sqrt{\mathbb{P}^{0}(X\in A)}$ . This random shift in $Y|X$ needs to be accounted for both estimation and inference.

Remark 2 (Distributional covariance).

We call $\Sigma^{W}/m$ the covariance between distributional perturbations. Two perturbed distributions $\mathbb{P}^{k}$ and $\mathbb{P}^{k^{\prime}}$ are positively correlated if $\Sigma_{kk^{\prime}}^{W}>0$ . This means that when one perturbation slightly increases the probability of a certain event $A$ , the other distribution tends to slightly increase the probability of this event as well. This can be seen by choosing the test function $\phi_{k}=1_{D\in A}$ :

\text{Cov}_{P}(\hat{\mathbb{P}}^{k}(D\in A),\hat{\mathbb{P}}^{k^{\prime}}(D\in A))=\text{Cov}_{P}(\hat{\mathbb{E}}^{k}[1_{D\in A}],\hat{\mathbb{E}}^{k^{\prime}}[1_{D\in A}])\approx\frac{\Sigma_{k,k^{\prime}}^{W}}{m}\text{Var}_{\mathbb{P}^{0}}(1_{D\in A}).

However, if two perturbed distributions are negatively correlated, a (random) increase in the probability of an event under one distribution would often coincide with a (random) decrease in the probability of this event under the other distribution, compared to $\mathbb{P}^{0}$ .

Remark 3 (Estimation of variance).

The variance under the target distribution $\text{Var}_{\mathbb{P}^{0}}(\phi(D))$ can be estimated via the empirical variance of $\phi$ on the pooled data $\{(D_{ki})_{i=1}^{n_{k}}\}_{k=1}^{K}$ . Then, if $\phi$ has a finite fourth moment, we have $\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1)$ . The details can be found in the Appendix 8.5.

An extended version of Theorem 1 where $\phi_{k}:\mathcal{D}_{k}\xrightarrow[]{}\mathbb{R}^{L}$ for some $L\geq 1$ and the proof of Theorem 1 can be found in the Appendix, Section 8. In this paper, we consider the target distribution as fixed. In the Appendix 8.4, we discuss a version of Theorem 1 that allows for a randomly shifted target distribution. In Section 9 of the Appendix, we discuss several special cases of the random perturbation model, including random shift across time, i.i.d. shifts, and random shifts between locations.

Theorem 1 tells us that $(\Sigma_{ij}^{W})/m$ is the asymptotic covariance of $\hat{\mathbb{E}}^{i}[\phi(D)]$ and $\hat{\mathbb{E}}^{j}[\phi(D)]$ for any Borel-measurable square-integrable function $\phi$ with unit variance under $\mathbb{P}^{0}$ . Remarkably, under our random distributional perturbation model, a $(K+\binom{K}{2})$ -dimensional parameter $\Sigma^{W}$ quantifies the similarity of multiple distributions and captures all possible correlations of all square-integrable functions.

The random shift model explains the linear patterns in the GTEx scatterplot.

Consider two perturbed distributions $\mathbb{P}^{1}$ and $\mathbb{P}^{2}$ , and $L$ uncorrelated test functions with unit variances, $\phi_{1}(D),\dots,\phi_{L}(D)$ . In the GTEx example, $\phi_{\ell}$ is a standardized product between gene expression levels of a randomly selected $\ell$ -th gene pair. Moreover, in Appendix 13, we see that these $L$ test functions are approximately uncorrelated. From Theorem 1, we have

\sqrt{m}\left(\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{\ell}(D)]\\ \hat{\mathbb{E}}^{2}[\phi_{\ell}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{\ell}(D)]\\ \mathbb{E}^{0}[\phi_{\ell}(D)]\end{pmatrix}\right)\stackrel{{\scriptstyle d}}{{=}}\text{ }Z_{\ell}+o_{p}(1),

where

Z_{\ell}\stackrel{{\scriptstyle i.i.d}}{{\sim}}N\left(0,\begin{pmatrix}\Sigma_{11}^{W}&\Sigma_{12}^{W}\\ \Sigma_{21}^{W}&\Sigma_{22}^{W}\end{pmatrix}\right).

Therefore, $(\hat{\mathbb{E}}^{1}[\phi_{\ell}(D)]-\mathbb{E}^{0}[\phi_{\ell}(D)],\hat{\mathbb{E}}^{2}[\phi_{\ell}(D)]-\mathbb{E}^{0}[\phi_{\ell}(D)])$ for $\ell=1,\dots,L$ , can be viewed as independent draws from a bivariate Gaussian distribution with distributional covariance $\Sigma^{W}/m$ . This explains the linear patterns observed in the left plot of Figure 2. Gaussianity can be evaluated by QQ plots of the statistic:

\left(\frac{1}{n_{k}}+\frac{1}{n_{0}}\right)^{-1/2}\left(\frac{\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi_{\ell}(D_{ki})-\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\phi_{\ell}(D_{0i})}{\hat{\text{sd}}_{\mathbb{P}^{0}}(\phi_{\ell}(D))}\right).

(2)

The QQ-plots of the statistic in (2) are presented on the right-hand side of Figure 2, when the target tissue is set to be brain cortex. These plots indicate that statistics in (2) indeed follow a Gaussian distribution.

3 Out-of-distribution generalization

In this section, we demonstrate how, under the random distributional perturbation model, we can share information across datasets from different times and locations, and infer parameters of a partially observed target distribution. Specifically, we show how the distributional CLT from the previous section allows us to identify the optimal convex combination of source distributions to best approximate the target distribution.

We consider the following domain adaptation setting throughout the section. We aim to predict a target $Y$ based on some covariates $X$ , that is the data is $D=(X,Y)$ . We want to estimate $\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)]$ for some loss function $\mathcal{L}(\theta,X,Y)$ . We consider the setting where we observe full $D=(X,Y)$ only on the source distributions $\mathbb{P}^{1},\dots,\mathbb{P}^{K}$ . More specifically, for the $k$ -th source data ( $k\in\{1,2,\dots,K\}$ ), we observe $n_{k}$ i.i.d. samples $\mathcal{D}_{k}=\{D_{k1},\dots,D_{kn_{k}}\}$ drawn from the source distribution $\mathbb{P}^{k}$ . Furthermore, there are $n_{0}$ i.i.d. samples from the target distribution $\mathbb{P}^{0}$ , but we only observe a subset $X$ and have $\mathcal{D}_{0}=\{X_{01},\dots,X_{0n_{0}}\}$ .

For any fixed non-negative weights $\beta_{1},\ldots,\beta_{K}$ with $\sum_{k=1}^{K}\beta_{k}=1$ , one can consider weighted empirical risk minimization, that is, one can set

\hat{\theta}^{\beta}=\arg\min_{\theta}\sum_{k}\beta_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)].

In the following, we want to study the excess risk

\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},X,Y)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},X,Y)].

This will give us insights into how to choose the weights $\beta$ in an optimal fashion, and allow us to conduct statistical inference.

Lemma 1 (Out-of-distribution error of weighted ERM).

For each $\theta$ in an open subset of $\Omega$ , let $\theta\mapsto\partial_{\theta}\mathcal{L}(\theta,x,y)$ be twice continuously differentiable in $\theta$ for every $x,y$ . Assume that the matrix $\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta,X,Y)]$ exists and is nonsingular. Assume that the third partial derivatives of $\theta\mapsto\mathcal{L}(\theta,D)$ are dominated by a fixed function $h(D)$ for every $\theta$ in a neighborhood of $\theta^{0}$ . We assume that $\partial_{\theta}\mathcal{L}(\theta^{0},D)$ , $\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)$ and $h(D)$ are square-integrable under $\mathbb{P}^{0}$ . Assume that $\hat{\theta}^{\beta}-\theta^{0}=o_{P}(1)$ . Then,

m\left(\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},X,Y)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},X,Y)]\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{E}

for some random variable $\mathcal{E}$ , with

\mathbb{E}[\mathcal{E}]=\beta^{\intercal}\Sigma^{W}\beta\cdot\mathrm{Trace}(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]^{-1}\mathrm{Var}_{\mathbb{P}^{0}}(\partial_{\theta}\mathcal{L}(\theta^{0},X,Y))).

The proof of Lemma 1 is provided in Appendix 10, along with the additional regularity assumptions under which the consistency assumption stated in Lemma 1 holds.

By this lemma, the optimal $\beta$ in terms of out-of-distribution error is

\beta^{*}=\arg\min_{\begin{subarray}{c}\beta:\beta^{\intercal}\mathbb{1}=1\\ \beta\geq 0\end{subarray}}\beta^{\intercal}\Sigma^{W}\beta.

A priori, $\beta^{*}$ is unknown to the researcher. In the following, we will describe estimation and inference for $\beta^{*}$ . Our estimation strategy is based on the following observation. Using the distributional CLT, for any function $\phi(X)$ and any $\beta$ ,

\sqrt{m}(\hat{\mathbb{E}}^{0}[\phi(X)]-\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{0}[\phi(X)])\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\beta^{\intercal}\Sigma^{W}\beta\cdot\text{Var}_{\mathbb{P}^{0}}(\phi(X))).

(3)

We can use this fact that the covariance term of the limiting distribution is inflated by the factor $\beta^{\intercal}\Sigma^{W}\beta$ regardless of the test functions to estimate $\beta^{*}$ . Consider different test functions of $X$ , that is $\phi_{\ell}(X)$ , $\ell=1,\dots,L$ . For example, we may use individual covariates as test functions, $\phi_{\ell}=X_{\ell}$ . Then, we can estimate $\beta^{*}$ by solving

\hat{\beta}=\operatorname*{argmin}_{\begin{subarray}{c}\beta:\beta^{\intercal}\mathbb{1}=1\\ \beta\geq 0\end{subarray}}\sum_{l=1}^{L}(\hat{\mathbb{E}}^{0}[\phi_{l}(X)]-\sum_{k}\beta_{k}\hat{\mathbb{E}}^{k}[\phi_{l}(X)])^{2}.

(4)

and set the final estimate of $\theta^{0}$ as

\hat{\theta}=\arg\min_{\theta}\sum_{k}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)].

We will discuss the accuracy of this estimate in the following section. More specifically, we will derive confidence intervals for $\beta^{*}$ and $\theta^{0}$ .

Remark 4.

In practice, one might wonder how to choose different test functions $\phi_{l},l=1,\ldots,L$ . From a theoretical standpoint, under the distributional uncertainty model, one can technically choose any function with finite variance. However, for actual applications, we suggest the following guideline. Our method considers a distribution $\mathbb{P}^{k}$ to be “close” to $\mathbb{P}^{0}$ if the averaged values of test functions on $\mathcal{D}_{k}$ closely align with the corresponding averaged values on $\mathcal{D}_{0}$ . Therefore, these test functions $\phi$ should be chosen to capture the dimensions of the problem where a close match is important. For example, in the numerical experiments in Section 4, the target function of interest is the logarithmic value of income and using the means of covariates that are well-known to be related to the income variable as test functions demonstrates good performance. We also provide simple diagnostics, presented in Figure 3 as standard residual plots and normal QQ-plots, that allow one to assess the fit of the distributional perturbation model and the choice of test functions.

3.1 Inference for $\beta^{*}$ and $\theta^{0}$

In the following, we discuss how to form asymptotically valid confidence intervals for $\beta^{*}$ and $\theta^{0}$ when $\beta^{*}$ and $\hat{\beta}$ are allowed to have negative weights. As discussed earlier, we choose $L$ different test functions $\phi_{1}(X),\dots,\phi_{L}(X)$ . For now we assume that test functions are uncorrelated and have unit variances under $\mathbb{P}^{0}$ . Later we will discuss cases where test functions are correlated and have different variances. In the previous section, we saw that $\hat{\beta}$ can be estimated via a least-squares problem with a linear constraint on the weights. To remove the linear constraint in equation (4), we perform the following reparametrization,

\displaystyle\hat{\beta}_{1:(K-1)}

\displaystyle=\operatorname*{argmin}_{\beta}\sum_{l=1}^{L}\left((\hat{\mathbb{E}}^{0}[\phi_{l}(X)]-\hat{\mathbb{E}}^{K}[\phi_{l}(X)])-\sum_{k=1}^{K-1}\beta_{k}(\hat{\mathbb{E}}^{k}[\phi_{l}(X)]-\hat{\mathbb{E}}^{K}[\phi_{l}(X)])\right)^{2}.

We can obtain $\hat{\beta}_{K}$ via $\hat{\beta}_{K}=1-\sum_{k=1}^{K-1}\hat{\beta}_{k}$ . We can re-write this as the following linear regression problem

\hat{\beta}_{1:(K-1)}=\operatorname*{argmin}_{\beta_{1:(K-1)}\in\mathbb{R}^{K-1}}\|\tilde{\Phi}^{0}-\tilde{\Phi}\beta_{1:(K-1)}\|_{2}^{2},

where the feature matrix in the linear regression is defined as

\tilde{\Phi}=\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{1}]-\hat{\mathbb{E}}^{K}[\phi_{1}]&\dots&\hat{\mathbb{E}}^{K-1}[\phi_{1}]-\hat{\mathbb{E}}^{K}[\phi_{1}]\\ \vdots&&\vdots\\ \hat{\mathbb{E}}^{1}[\phi_{L}]-\hat{\mathbb{E}}^{K}[\phi_{L}]&\dots&\hat{\mathbb{E}}^{K-1}[\phi_{L}]-\hat{\mathbb{E}}^{K}[\phi_{L}]\end{pmatrix}\in\mathbb{R}^{L\times(K-1)},

and the outcome vector in the linear regression is defined as

\tilde{\Phi}^{0}=\left(\hat{\mathbb{E}}^{0}[\phi_{1}]-\hat{\mathbb{E}}^{K}[\phi_{1}],\dots,\hat{\mathbb{E}}^{0}[\phi_{L}]-\hat{\mathbb{E}}^{K}[\phi_{L}]\right)^{\intercal}\in\mathbb{R}^{L\times 1}.

Since test functions were assumed to be uncorrelated and have unit variances under $\mathbb{P}_{0}$ , using Theorem 1, rows of the feature matrix $\tilde{\Phi}$ and the outcome vector $\tilde{\Phi}^{0}$ are asymptotically i.i.d., with each row drawn from a centered Gaussian distribution. Therefore, the problem may be viewed as a standard multiple linear regression problem where the variance of the coefficient $\hat{\beta}_{1:(K-1)}$ is estimated as

\widehat{\text{Var}}(\hat{\beta}_{1:(K-1)})=(\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2},

and the residual variance is calculated as

\hat{\sigma}^{2}=\frac{1}{L-K+1}\|\tilde{\Phi}^{0}-\tilde{\Phi}\hat{\beta}\|_{2}^{2}.

Different from standard linear regression, in our setting each column represents a data distribution and each row represents a test function, and Gaussianity does not hold exactly in finite samples. Therefore, a priori it is unclear whether standard statistical tests in linear regression (such as $t$ -test and $F$ -test) carry over to the distributional case. The following theorem tells us that the standard linear regression results still hold. Therefore, we can conduct statistical inference for the optimal weights $\beta^{*}$ in the same manner as we conduct t-tests and F-tests in a standard linear regression. The proof of Theorem 2 can be found in Appendix, Section 11. Note that these results assume a finite number of test functions and datasets but are asymptotic in terms of $m(n)$ as they rely on the distributional CLT.

Theorem 2.

Assume that test functions $\phi_{1},\dots,\phi_{L}$ are uncorrelated and have unit variances under $\mathbb{P}^{0}$ . Let $\tilde{\Phi}$ and $\hat{\sigma}^{2}$ be defined as above. Then, we have

\left((\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}\right)^{-\frac{1}{2}}\left(\hat{\beta}_{1:(K-1)}-\beta^{*}_{1:(K-1)}\right)\stackrel{{\scriptstyle d}}{{=}}t_{K-1}(L-K+1)+o_{p}(1),

where $t_{p}(d)$ follows the $p$ -dimensional multivariate t-distribution with $d$ degrees of freedom.

Remark 5 (Distributional t-test).

For this test, the null hypothesis is that the dataset $\mathcal{D}_{i}$ should have weight zero in the weighted empirical risk minimization. Mathematically, this corresponds to testing the null hypothesis $\beta_{i}^{*}=0$ . Thus, we consider the following hypotheses:

H_{0}:\beta_{i}^{*}=0,\quad H_{1}:\beta_{i}^{*}\neq 0.

The test statistic is defined as

t=\frac{\hat{\beta}_{i}}{\sqrt{\left(\left(\tilde{\Phi}^{\intercal}\tilde{\Phi}\right)^{-1}\hat{\sigma}^{2}\right)_{ii}}}.

From Theorem 2, under the null hypothesis,

t\sim t(L-K+1),

where $t(L-K+1)$ is a univariate $t$ -distribution with $L-K+1$ degrees of freedom.

Remark 6 (Distributional F-test).

For this test, the null hypothesis is that each dataset is equally informative for the target distribution. Mathematically, this corresponds to testing the null hypothesis $\beta_{i}^{*}=1/K$ for $i=1,\ldots,K$ . Thus, we consider the following hypotheses:

	$\displaystyle H_{0}:\,\,\beta_{1}^{}=\ldots=\beta_{K}^{}=\frac{1}{K},$
	$\displaystyle H_{1}:\,\,\beta_{i}^{*}\neq\frac{1}{K},\text{ for at least one }i.$

The test statistic is defined as

F=\left(\hat{\beta}_{1:(K-1)}-\frac{1}{K}\cdot\bm{1}_{K-1}\right)^{\intercal}\left((\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}\right)^{-1}\left(\hat{\beta}_{1:(K-1)}-\frac{1}{K}\cdot\bm{1}_{K-1}\right)/(K-1).

From Theorem 2, under the null hypothesis,

F\sim F_{K-1,L-K+1},

where $F_{K-1,L-K+1}$ is a F-distribution with degrees of freedom $K-1$ and $L-K+1$ .

Remark 7 (Distributional confidence intervals for parameters of the target distribution).

We will now discuss how to form asymptotically valid confidence intervals for $\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)]$ when $m\xrightarrow[]{}\infty$ and $L\xrightarrow[]{}\infty$ . Here we assume that the influence function $\phi=-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]^{-1}\partial_{\theta}\mathcal{L}(\theta^{0},X,Y)$ has a finite fourth moment under $\mathbb{P}^{0}$ . Then, we can form $(1-\alpha)$ -confidence intervals for $\theta^{0}$ with $\hat{\theta}=\arg\min_{\theta}\sum_{k}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)]$ as follows:

\hat{\theta}\pm z_{1-\alpha/2}\cdot\sqrt{\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi)}\cdot\sqrt{\frac{1}{L}\sum_{\ell=1}^{L}\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}}.

Here, $\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi)$ denotes the empirical variance of

\hat{\phi}(X,Y)=-(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)])^{-1}\partial_{\theta}\mathcal{L}(\hat{\theta},X,Y)

on the pooled donor data $\{\{X_{ki},Y_{ki}\}_{i=1}^{n_{k}}\}_{k=1}^{K}$ . The proof of this result can be found in the Appendix, Section 12.

Remark 8 (Distributional sample size and degrees of freedom).

Note that the asymptotic variance of “distributional” regression parameters $(\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}$ has the same algebraic form as the asymptotic variance for regression parameters in a standard Gaussian linear model. In the classical setting, the test statistics of regression parameters follow $t$ -distributions with $n-p+1$ degrees of freedom, where $n$ is the sample size and $p$ is the number of covariates. In our setting, we have $L-K+1$ degrees of freedom, where $L$ corresponds to the number of test functions and $K$ corresponds to the number of donor distributions.

Remark 9 (Correlated test functions).

We assumed that the $(\phi_{1}(X),\dots,\phi_{L}(X))$ are uncorrelated and have unit variances under $\mathbb{P}^{0}$ . In practice, this might not be the case. In such cases, we can apply a linear transformation $T$ to the test functions in a pre-processing step to obtain uncorrelated test functions with unit variances. We define the transformation matrix $T=(\hat{\Sigma}^{\Phi})^{-1/2}$ , where $\hat{\Sigma}^{\Phi}$ is an estimate of the covariance matrix $\Sigma^{\Phi}$ of $(\phi_{1}(X),\dots,\phi_{L}(X))$ on the pooled data. From Remark 3, if $(\phi_{1}(X),\dots,\phi_{L}(X))$ have finite fourth moments, we have $\hat{\Sigma}^{\Phi}=\Sigma^{\Phi}+o_{p}(1)$ .

4 ACS Income Data

In this section, we demonstrate the effectiveness and robustness of our method when the training dataset is obtained from multiple sources. We focus on the ACS Income dataset [Ding et al., 2021] where the goal is to predict the logarithmic value of individual income based on tabular census data.

Similar as in Shen et al. [2023], we consider a scenario where we initially have a limited training dataset from California (CA), and subsequently, we obtain additional training dataset from Puerto Rico (PR). We iterate through the target state among the rest of 49 states. Note that there are substantial economic, job market, and cost-of-living disparities between CA and PR. Most of the states are more similar to CA than PR. Therefore, increasing the training dataset with data sourced from PR will lead to substantial performance degradation of the prediction algorithm on the target state.

We compare three methods in our study. In the first method, we fit the XGBoost on the training dataset in the usual manner, with all samples assigned equal weights. In the second method, we employ sample-specific importance weights. The importance weights are calculated as follows. First, pool the covariates $X$ of the training dataset (CA + PR) and the target dataset. Let $A_{i}$ be the label which is 1 if the $i$ -th sample is from the target dataset and 0 otherwise. Then, the importance weight for the $i$ -th sample is obtained as

\hat{w}_{IW,i}=\frac{\hat{P}(X=X_{i}|A_{i}=1)}{\hat{P}(X=X_{i}|A_{i}=0)}=\frac{\hat{P}(A_{i}=1|X=X_{i})\hat{P}(A_{i}=0)}{\hat{P}(A_{i}=0|X=X_{i})\hat{P}(A_{i}=1)},

where we estimate $\hat{P}(A_{i}|X=X_{i})$ using XGBoost on the pooled covariates data and estimate $\hat{P}(A_{i}=1)$ as the ratio of the sample size of the target data to the sample size of the pooled data.

In the third method, based on our approach, the XGBoost is fitted on the training dataset, but this time samples receive different weights depending on whether they originate from CA or PR. We utilize the occupation-code feature to construct test functions. The test function $\phi_{\ell}(D_{i})$ is defined as whether $i$ -th sample has the $\ell$ -th occupation code. The occupation code is a categorical variable representing around 500 different occupation categories. The data dictionary at ACS PUMS documentation³³3https://www.census.gov/programs-surveys/acs/microdata/documentation.2018.html provides the full list of occupation codes. The number of test functions in total is therefore around 500. Then, we estimate distribution-specific-weights $(w_{\text{CA}},w_{\text{PR}})$ by running

\hat{w}_{\text{CA}},\hat{w}_{\text{PR}}=\operatorname*{argmin}_{w_{\text{CA}}+w_{\text{PR}}=1}\sum_{\ell=1}^{L}(\hat{\mathbb{E}}^{\text{Target}}[\phi_{\ell}]-w_{\text{CA}}\hat{\mathbb{E}}^{\text{CA}}[\phi_{\ell}]-w_{\text{PR}}\hat{\mathbb{E}}^{\text{PR}}[\phi_{\ell}])^{2}.

Note that $\hat{\mathbb{E}}^{\text{state}}[\phi_{\ell}]$ is the proportion of samples of the srtate with the $\ell$ -th occupation code.

The results with the target states of Texas (TX) and Washington (WA) are given in Figure 1. Similar patterns are observed for other target states, aligning with either TX or WA behaviors. Results for other target states can be found in the Appendix 13.

For the target TX, as we include the PR data, the Mean Squared Error (MSE) initially drops for all the methods. However, as more data from PR is added, the MSE starts to increase for both methods with equal weights and importance weights. In contrast, our method demonstrates robustness, with the MSE decreasing after including PR data and staying consistently low, even when the PR source becomes dominant in the training dataset. Turning to the target WA, for both methods with equal weights and importance weights, adding data sourced from PR results in a straightforward increase in MSE. In contrast, our method assigns weights close to 0 for samples originating from PR, preventing significant performance degradation.

5 GTEx Data

In this section, we illustrate our methods using the GTEx data. We will also run model diagnostics to evaluate model fit. The GTEx V6 dataset provides RNA-seq gene-expression levels collected from 450 donors across 44 tissues.

Treating each tissue as a separate study, we have 44 different datasets. As shown in Figure 2, some tissues exhibit higher correlations than others. To demonstrate our method, we select 5 different tissues (same as in Figure 2): Adipose subcutaneous, Adipose visceral omentum, Brain cortex, Brain frontal cortex, and Brain cerebellum. Let brain cortex be our target tissue (the third tissue in Figure 2). The code and the data are available as R package dlm at https://github.com/yujinj/dlm.

For our test functions, we randomly sample 1000 gene-pairs and define a test function as the standardized product of gene-expression levels of a pair. In total, we have 1000 test functions. Most genes are known to be uncorrelated with each other, with the 90th percentile range of correlations between two different test functions being about $[-0.06,0.06]$ . We employ group Lasso to estimate the sparse inverse covariance matrix. As the estimated inverse covariance matrix is close to a diagonal matrix (details provided in the Appendix 13), we do not perform a whitening transformation.

The R package dlm (https://github.com/yujinj/dlm) contains a function dlm that performs distributional linear regressions:

    dlm(formula, test.function, data, whitening)

Here, the formula parameter specifies which model we want to fit. The test.function parameter represents considered test functions. The data parameter passes a list of datasets. The whitening parameter is a boolean indicating whether one wants to perform a whitening transformation. Additional details and descriptions of the function can be found on the GitHub page.

Below is the summary output of the dlm function with a model formula “Brain cortex $\sim$ Adipose subcutaneuous + Adipose visceral omentum + Brain frontal cortex + Brain cerebellum” with 1000 test functions defined as standardized products of randomly selected gene-pairs. Note that the summary output of the dlm function closely resembles that of the commonly used lm function.

Call:dlm(formula = Brain_Cortex ~ Adipose_Subcutaneous + Adipose_Visceral_Omentum + Brain_Frontal_Cortex_BA9 + Brain_Cerebellum, test.function = phi, data = GTEx_data, whitening = FALSE)Residuals: Min 1Q Median 3Q Max-0.56838 -0.09436 -0.00604 0.08739 0.40917Coefficients: Estimate Std. Error t value Pr(>|t|)Adipose_Subcutaneous -0.0001295 0.0304036 -0.004 0.9966Adipose_Visceral_Omentum 0.0462641 0.0250888 1.844 0.0655 .Brain_Frontal_Cortex_BA9 0.7846112 0.0184325 42.567 < 2e-16 ***Brain_Cerebellum 0.1692543 0.0217128 7.795 1.61e-14 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1377 on 997 degrees of freedomMultiple R-squared: 0.5088,Adjusted R-squared: 0.5073F-statistic: 344.3 on 3 and 997 DF, p-value: < 2.2e-16

The coefficient represents the estimated weight for each perturbed dataset. The sum of coefficients equals to one. The estimated weight for the brain-frontal-cortex dataset is close to 1: 0.7846112. On the other hand, the estimated weights for adipose-subcutaneous and adipose-visceral-omentum datasets are close to 0: -0.0001295 and 0.0462641. This supports the observations made in Figure 2, where the brain-cortex dataset (target) exhibits a higher correlation with the brain-frontal-cortex dataset than others.

Let us discuss the summary output in more detail. All tests have in common that the probability statements are with respect to distributional uncertainty, that is, the uncertainty induced by the random distributional perturbations.

1.

(Distributional t-test). The summary output provides a t-statistic and a p-value for each estimated weight. Based on the output, the estimated weight for the brain-frontal-cortex dataset is highly significant (with a t-statistic 42.567 and a p-value less than $2\cdot 10^{-16}$ ). We have enough evidence to believe that the true optimal weight for the brain-frontal-cortex dataset is nonzero, and thus we conclude that the brain-frontal-cortex dataset is important for predicting the target dataset. On the other hand, the estimated weights for the adipose-subcutaneous and adipose-visceral-omentum datasets are not statistically significant (with t-statistics -0.004 and 1.844 and p-values 0.9966 and 0.0655). We conclude those datasets are not as important as the brain-frontal-cortex dataset for predicting the target dataset.
2.

(Distributional F-test). The summary output provides a F-statistic and a corresponding p-value for a null hypothesis: the optimal weights are uniform ( $\beta^{*}=(0.25,0.25,0.25,0.25)$ in our example). This means that different datasets are equally informative in explaining the target dataset. The F-statistic is given as 344.3, which follows the F-distribution with degrees of freedom 3 and 997 under the null hypothesis. The resulting p-value is less than $2.2\cdot 10^{-16}$ . Therefore, we reject the null hypothesis. From Figure 2 and the estimated weights, we can see that the brain-frontal-cortex and brain-cerebellum datasets are more informative in explaining the target dataset than the adipose-subcutaneous and adipose-visceral-omentum datasets.
3.

(R-squared). Usually, the coefficient of determination (R-squared) measures how much of the variation of a target variable is explained by other explanatory variables in a regression model. In our output, R-squared measures how much of the unexplained variance of a target dataset, compared to considering uniform weights, is explained by considering our proposed weights. Specifically, it is calculated as

$R^{2}=1-\frac{\text{RSS with estimated weights}}{\text{RSS with uniform weights}}.$

R-squared is close to 0 if the estimated weights are close to uniform weights.

Finally, we should consider diagnostic plots to evaluate the appropriateness of the fitted distribution shift model. This can be evaluated using a distributional Tukey-Anscombe and a distributional QQ-Plot. In contrast to conventional residual plots and QQ-Plots, each point in these plot corresponds to a mean of a test function instead of a single observation. Figure 3 shows that neither the QQ-Plot nor the TA plot indicate a deviation from the modelling assumptions.

6 Discussion

In many practical settings, there is a distribution shift between the training and the target distribution. Existing distribution shift models fall into two categories. One category, which we term “sparse distribution shift”, assumes that the shift affects specific parts of the data generation process while leaving other parts invariant. Existing methods attempt to re-weight training samples to match the target distribution or learn invariant representations of features. The second category considers worst-case distributional shifts. More specifically, they consider the worst-case behavior of a statistical functional over a fixed neighborhood of the model based on the distributional distance. This often requires knowledge of the strength and shape of the perturbations.

In contrast, we consider random dense distributional shifts, where the shift arises as the superposition of many small random changes. We have found that the random perturbation model can be useful in modeling empirical phenomena observed in Figure 2. Moreover, we see that even when the overlap assumptions seem to be violated (when the reweighing approach fails), or when the distribution shift appears large in Kullback-Leibler divergence, the random perturbation model may still be appropriate and useful. Under the random distributional perturbation model, we establish foundations for transfer learning and generalization.

Our method shares methodological similarities with synthetic controls. While synthetic controls are typically applied to panel data under the assumption of a linear factor model, our procedure can be applied to model distribution shifts of any type of data (discrete, continuous, ordinal) with or without time structure. Furthermore, our method justifies the linearity in synthetic control methods under random distributional shifts, rather than assuming a linear factor model. The random distributional shift assumption can be evaluated based on plots in Figure 3. Additionally, we propose a generalization of our procedure to empirical risk minimization.

In practice, perturbations may involve a combination of sparse and dense shifts. In such cases, hybrid approaches (sparse + dense) may be appropriate. We view hybrid models as an important generalization and leave it for future work.

A companion R package, dlm, is available at https://github.com/yujinj/dlm. Our package performs distributional generalization under the random perturbation model. Users can replicate the results in Section 5 with the data and code provided in the package.

7 Acknowledgments

The authors are grateful for insightful discussions with Naoki Egami, Kevin Guo, Ying Jin, Hongseok Namkoong, and Elizabeth Tipton. This work was supported by Stanford University’s Human-Centered Artificial Intelligence (HAI) Hoffman-Yee Grant, and by the Dieter Schwarz Foundation.

References

Abadie and Gardeazabal [2003] A. Abadie and J. Gardeazabal. The economic costs of conflict: A case study of the basque country. American Economic Review, 93(1):113–132, 2003.
Abadie et al. [2007] A. Abadie, A. Diamond, and J. Hainmueller. Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American Statistical Association, 105:493–505, 2007.
Argyriou et al. [2007] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19 (NIPS), pages 41–48, 2007.
Athey et al. [2021] S. Athey, M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi. Matrix completion methods for causal panel data models. Journal of the American Statistical Association, pages 1–41, 2021.
Bai [2009] J. Bai. Panel data models with interactive fixed effects. Econometrica, 77(4):1229–1279, 2009.
Baktashmotlagh et al. [2013] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. Proceedings of the IEEE International Conference on Computer Vision, pages 769–776, 2013.
Bansak et al. [2024] K. Bansak, E. Paulson, and D. Rothenhäusler. Learning under random distributional shifts. International Conference on Artificial Intelligence and Statistics, 2024.
Bodnar and Okhrin [2008] T. Bodnar and Y. Okhrin. Properties of the singular, inverse and generalized inverse partitioned wishart distributions. Journal of Multivariate Analysis, 99(10):2389–2405, 2008.
Borenstein et al. [2009] M. Borenstein, L. V. Hedges, J. P. Higgins, and H. R. Rothstein. Introduction to Meta-Analysis. John Wiley & Sons, New York, NY, 2009.
Cheung [2014] M. W. Cheung. Modeling dependent effect sizes with three-level meta-analyses: a structural equation modeling approach. Psychological Methods, 19(2):211–229, 2014.
Ding et al. [2021] F. Ding, M. Hardt, J. Miller, and L. Schmidt. Retiring adult: New datasets for fair machine learning. In Advances in Neural Information Processing Systems, volume 34, pages 6478–6490, 2021.
Doudchenko and Imbens [2016] N. Doudchenko and G. W. Imbens. Balancing, regression, difference-in-differences and synthetic control methods: A synthesis. Technical report, National Bureau of Economic Research, 2016.
Duchi and Namkoong [2021] J. C. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
Dudley [2018] R. M. Dudley. Real analysis and probability. CRC Press, 2018.
Gretton et al. [2009] A. Gretton, J. Smola, M. Huang, K. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3:131–160, 2009.
Hartung et al. [2008] J. Hartung, G. Knapp, and B. K. Sinha. Statistical Meta-Analysis With Applications. John Wiley & Sons, New York, NY, 2008.
Hedges et al. [2010] L. V. Hedges, E. Tipton, and M. C. Johnson. Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1):39–65, 2010.
Higgins et al. [2009] J. Higgins, S. G. Thompson, and D. J. Spiegelhalter. A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1):137–159, 2009.
Huber [1981] P. Huber. Robust Statistics. Wiley, New York, 1981.
Jeong and Rothenhäusler [2023] Y. Jeong and D. Rothenhäusler. Calibrated inference: statistical inference that accounts for both sampling uncertainty and distributional uncertainty. arXiv preprint arXiv:2202.11886, 2023.
Li et al. [2016] S. Li, S. Song, and G. Huang. Prediction reweighting for domain adaptation. IEEE Transactions on Neural Networks and Learning Systems, 28(7):1682–1695, 2016.
Long et al. [2014] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer joint matching for unsupervised domain adaptation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1410–1417, 2014.
Lunn et al. [2013] D. Lunn, J. Barrett, M. Sweeting, and S. Thompson. Fully bayesian hierarchical modelling in two stages, with application to meta-analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 62(4):551–572, 2013.
Muirhead [1982] R. J. Muirhead. Aspects of Multivariate Statistical Theory. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., 1982.
Pan and Yang [2010] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
Quinonero-Candela et al. [2009] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. Mit Press, 2009.
Rothenhäusler and Bühlmann [2023] D. Rothenhäusler and P. Bühlmann. Distributionally robust and generalizable inference. Statistical Science, 38(4):527–542, 2023.
Schölkopf et al. [2021] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.
Shen et al. [2023] J. H. Shen, I. D. Raji, and I. Y. Chen. Mo’data mo’problems: How data composition compromises scaling properties. 2023.
Shimodaira [2000] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
Sugiyama et al. [2008] M. Sugiyama, S. Nakajima, H. Kashima, P. Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 21 (NIPS), pages 1433–1440, 2008.
Szegedy et al. [2014] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.
Van der Vaart [2000] A. Van der Vaart. Asymptotic Statistics, volume 3. Cambridge university press, 2000.
Wood [2008] A. Wood. Methodology for dealing with duplicate study effects in a meta-analysis. Organizational Research Methods, 11:79–95, 01 2008.
Xiong et al. [2023] X. Xiong, Z. Guo, and T. Cai. Distributionally robust transfer learning. arXiv preprint arXiv:2309.06534, 2023.
Xu [2017] Y. Xu. Generalized synthetic control method: Causal inference with interactive fixed effects models. Political Analysis, 25(1):57–76, 2017.
Zhan et al. [2024] K. Zhan, X. Xiong, Z. Guo, T. Cai, and M. Liu. Transfer learning targeting mixed population: A distributional robust perspective. arXiv preprint arXiv:2407.20073, 2024.

Appendix

8 Proof of Theorem 1

8.1 Extended Theorem 1

In the following, we present the extended version of Theorem 1. Note that we get Theorem 1 in the main text by defining $\phi(D)=(\phi_{1}(D),\dots,\phi_{K}(D))^{\intercal}$ .

Extended Theorem 1.

Under the assumptions of Theorem 1, for any Borel measurable square-integrable function $\phi:\mathcal{D}\xrightarrow[]{}\mathbb{R}^{L}$ , we have

\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi(D)]\end{pmatrix}\right)\xrightarrow[]{d}N\left(0,\Sigma^{W}\otimes\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\right),

(5)

where $\Sigma^{W}\in\mathbb{R}^{K\times K}$ is

(\Sigma^{W})_{ij}=\frac{Cov(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}.

Here, $\otimes$ denotes a kronecker product. Note that $\Sigma^{W}\otimes\text{Var}_{\mathbb{P}^{0}}({\phi}(D))$ can be written as

\displaystyle\left[\begin{array}[]{c@{}c:c:c:c}&\Sigma^{W}_{11}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\Sigma^{W}_{12}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\cdots&\Sigma^{W}_{1K}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\\ \hdashline&\Sigma^{W}_{21}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\Sigma^{W}_{22}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\cdots&\Sigma^{W}_{2K}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\\ \cdashline{1-5}&{\vdots}&{\vdots}&&{\vdots}\\ \cdashline{1-5}&\Sigma^{W}_{K1}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\Sigma^{W}_{K2}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))&\cdots&\Sigma^{W}_{KK}\cdot\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\\ \end{array}\right].

In the following, we prove Theorem 1. From the proof, we can get Extended Theorem 1 by considering

(\hat{\mathbb{E}}^{1}[\phi_{1}(D)],\dots,\hat{\mathbb{E}}^{1}[\phi_{L}(D)],\hat{\mathbb{E}}^{2}[\phi_{1}(D)],\dots,\hat{\mathbb{E}}^{2}[\phi_{L}(D)],\dots\dots,\hat{\mathbb{E}}^{K}[\phi_{1}(D)],\dots,\hat{\mathbb{E}}^{K}[\phi_{L}(D)]),

where $\phi(D)=(\phi_{1}(D),\dots,\phi_{L}(D))^{\intercal}\in\mathbb{R}^{L}$ .

8.2 Auxiliary Lemma for Theorem 1

Let us first state an auxiliary lemma that will turn out helpful for proving Theorem 1.

Lemma 2.

Let the assumptions of Theorem 1 hold. For any bounded measurable $\phi_{1},\dots,\phi_{K}:\mathcal{D}\xrightarrow[]{}\mathbb{R}$ , we have that

\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)\xrightarrow{d}N\left(0,\Sigma^{W}\odot\text{Var}(\overrightarrow{\phi}(D))\right)

where $(\Sigma^{W})_{ij}=\frac{Cov(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}$ and $\overrightarrow{\phi}(D)=(\phi_{1}(D),\dots,\phi_{K}(D))^{\intercal}$ .

Proof.

Let $\psi_{k}=\phi_{k}\circ h$ . Without loss of generality, assume that $\mathbb{E}^{0}[\psi_{k}(U)]=0$ for $k=1,\dots,K$ . Note that

\sqrt{m}(\mathbb{E}^{k}[\psi_{k}(U)]-\mathbb{E}^{0}[\psi_{k}(U)])=\frac{\sqrt{m}\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\cdot(W^{k}_{j}-E[W^{k}])}{\sum_{j=1}^{m}W^{k}_{j}/m}.

Let

Y_{m,j}:=\sqrt{m}\begin{pmatrix}\int_{x\in I_{j}}\psi_{1}(x)dx\cdot(W^{1}_{j}-E[W^{1}])\\ \vdots\\ \int_{x\in I_{j}}\psi_{K}(x)dx\cdot(W^{K}_{j}-E[W^{K}])\end{pmatrix}.

First, note that

E[Y_{m,j}]=0

(6)

for all $j$ . As the second step, we want to show that

\sum_{j=1}^{m}\text{Cov}(Y_{m,j})\xrightarrow[]{}\text{Var}(\overrightarrow{W})\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\psi}(U)),

(7)

where $\overrightarrow{W}=(W^{1},\dots,W^{K})^{\intercal}$ , $\overrightarrow{\psi}(U)=(\psi_{1}(U),\dots,\psi_{K}(U))^{\intercal}$ , and $\odot$ is an element-wise multiplication. For any $f\in L^{2}([0,1])$ , define $\Pi_{m}(f)$ as

\Pi_{m}(f)(x)=\sum_{j=1}^{m}\left(m\int_{x\in I_{j}}f(x)dx\right)\cdot I(x\in I_{j}).

Then, we have

	$\displaystyle\Bigg{\|}\left(\sum_{j=1}^{m}\text{Cov}(Y_{m,j})-\text{Var}(\overrightarrow{W})\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\psi}(U))\right)_{kk^{\prime}}\Bigg{\|}$
	$\displaystyle\quad\quad=\Bigg{\|}\left(m\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\int_{x\in I_{j}}\psi_{k^{\prime}}(x)dx-\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)\psi_{k^{\prime}}(x)dx\right)\cdot\text{Cov}(W^{k},W^{k^{\prime}})\Bigg{\|}$
	$\displaystyle\quad\quad\leq\|\|\psi_{k}\|\|_{2}\|\|\psi_{k^{\prime}}-\Pi_{m}(\psi_{k^{\prime}})\|\|_{2}\cdot\|\text{Cov}(W^{k},W^{k^{\prime}})\|\xrightarrow[]{}0$

for $1\leq k,k^{\prime}\leq K$ as $m$ goes to infinity. This is because any bounded function can be approximated by a sequence of step functions of the form $\sum_{j=1}^{m}b_{j}I(x\in I_{j})$ . Next we will show that for any $\epsilon>0$ ,

g_{m}(\epsilon)=\sum_{j=1}^{m}E[||Y_{m,j}||_{2}^{2};||Y_{m,j}||_{2}\geq\epsilon]\xrightarrow[]{}0.

(8)

Let $||\psi_{1}||_{\infty},\dots,||\psi_{K}||_{\infty}\leq B$ . Then this is implied by the dominated convergence theorem as

\displaystyle\sum_{j=1}^{m}E[||Y_{m,j}||_{2}^{2};||Y_{m,j}||_{2}\geq\epsilon]

\displaystyle\leq B^{2}E[||\overrightarrow{W}-E[\overrightarrow{W}]||_{2}^{2}I(B||\overrightarrow{W}-E[\overrightarrow{W}]||_{2}/\sqrt{m}\geq\epsilon)]\xrightarrow[]{}0.

(9)

Combining equations (6), (7), and (8), we can apply multivariate Lindeberg’s CLT. Together with Slutsky’s theorem, we have

	$\displaystyle\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)$	$\displaystyle=\sum_{j=1}^{m}Y_{m,j}\odot\begin{pmatrix}1/(\sum_{j=1}^{m}W_{j}^{1}/m)\\ \vdots\\ 1/(\sum_{j=1}^{m}W_{j}^{K}/m)\end{pmatrix}$
		$\displaystyle\xrightarrow[]{d}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))\right)$

where $\Sigma^{W}_{ij}=\frac{Cov(W^{i},W^{j})}{E[W^{i}]E[W^{j}]}$ . This completes the proof. ∎

8.3 Proof of Theorem 1

Proof.

For any $\phi_{k}\in L^{2}(\mathbb{P})$ and for any given $\epsilon>0$ , there exits a bounded function $\phi_{k}^{B}$ such that $\mathbb{E}^{0}[\phi_{k}(D)]=\mathbb{E}^{0}[\phi_{k}^{B}(D)]$ and $||\phi_{k}-\phi_{k}^{B}||_{L^{2}(\mathbb{P}^{0})}<\epsilon$ for $k=1,\dots,K$ . Then,

	$\displaystyle\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi_{1}(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi_{K}(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}(D)]\end{pmatrix}\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad=\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi_{1}(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi_{K}(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}\right)$		(a)
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}^{B}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}^{B}(D)]\end{pmatrix}\right)$		(b)
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\sqrt{m}\left(\begin{pmatrix}\mathbb{E}^{1}[\phi_{1}^{B}(D)]\\ \vdots\\ \mathbb{E}^{k}[\phi_{K}^{B}(D)]\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{0}[\phi_{1}^{B}(D)]\\ \vdots\\ \mathbb{E}^{0}[\phi_{K}^{B}(D)]\end{pmatrix}\right).$		(c)

Note that for any $\epsilon^{\prime}>0$ and for $k=1,\dots,K$ ,

	$\displaystyle P\left(\sqrt{m}\Bigg{\|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi_{k}(D_{ki})-\mathbb{E}^{k}[\phi_{k}(D)]\Bigg{\|}>\epsilon^{\prime}\right)$
	$\displaystyle=E\left(P\left(\Bigg{\|}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi_{k}(D_{ki})-\mathbb{E}^{k}[\phi_{k}(D)]\Bigg{\|}>\frac{\epsilon^{\prime}}{\sqrt{m}}\text{ }\Bigg{\|}\text{ }W_{k}\right)\right)$
	$\displaystyle\leq E\left(\frac{\mathbb{E}^{k}[\phi_{k}^{2}(D)]}{\epsilon^{\prime 2}}\right)\cdot\frac{m}{n_{k}}=\frac{\mathbb{E}^{0}[\phi_{k}^{2}(D)]}{\epsilon^{\prime 2}}\cdot\frac{m}{n_{k}}\xrightarrow[]{}0,$

as the distributional uncertainty is of higher order than the sampling uncertainty. Therefore, (a) $=o_{p}(1)$ . Next let us investigate (b). Recall that with $\psi_{k}=(\phi_{k}-\phi_{k}^{B})\circ h$ ,

\displaystyle\sqrt{m}(\mathbb{E}^{k}[\phi_{k}(D)]-\mathbb{E}^{k}[\phi_{k}^{B}(D)])

\displaystyle=\frac{\sqrt{m}\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\cdot(W_{j}^{k}-E[W^{k}])}{\sum_{j=1}^{m}W_{j}^{k}/m}.

The variance of the numerator is bounded as

	$\displaystyle\text{Var}(W^{k})\sum_{j=1}^{m}m\left(\int_{x\in I_{j}}\psi_{k}(x)dx\right)^{2}$	$\displaystyle\leq\text{Var}(W^{k})\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}^{2}(x)dx$
		$\displaystyle=\text{Var}(W^{k})\mathbb{E}^{0}[\psi_{k}^{2}(U)]$
		$\displaystyle<\text{Var}(W^{k})\cdot\epsilon^{2}$

where the first inequality holds by Jensen’s inequality with $m\int_{x\in I_{j}}dx=1$ . The denominator converges in probability to $E[W^{k}]$ . Therefore, we can write

\sqrt{m}(\mathbb{E}^{k}[\phi_{k}(D)]-\mathbb{E}^{k}[\phi_{k}^{B}(D)])=\frac{\sqrt{m}\sum_{j=1}^{m}\int_{x\in I_{j}}\psi_{k}(x)dx\cdot(W_{j}^{k}-E[W^{k}])}{E[W^{k}]}+s_{n,k}

where $s_{n,k}$ is $o_{p}(1)$ . Combining results, for $s_{n}=(s_{n,1},\dots,s_{n,K})^{\intercal}$ , we have that $E[\eqref{eq:b}-s_{n}]=0$ and

||\text{Var}_{P}(\eqref{eq:b}-s_{n})||_{\infty}\leq\frac{||\text{Var}(\overrightarrow{W})||_{\infty}}{\min_{k}E^{2}[W^{k}]}\cdot\epsilon^{2}.

From Lemma 2, we know that

\eqref{eq:c}\xrightarrow[]{d}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}\left(\overrightarrow{\phi^{B}}(D)\right)\right).

Let $||\phi_{1}^{B}||_{2},\dots,||\phi_{K}^{B}||_{2},||\phi_{1}||_{2},\dots,||\phi_{K}||_{2}\leq M$ . Note that for any $1\leq k,k^{\prime}\leq K$ ,

	$\displaystyle 2\|\mathbb{E}^{0}[\phi_{k}\phi_{k^{\prime}}]-\mathbb{E}^{0}[\phi_{k}^{B}\phi_{k^{\prime}}^{B}]\|$	$\displaystyle\leq\|\mathbb{E}^{0}[(\phi_{k}-\phi_{k}^{B})(\phi_{k^{\prime}}+\phi_{k^{\prime}}^{B})]\|+\|\mathbb{E}^{0}[(\phi_{k}+\phi_{k}^{B})(\phi_{k^{\prime}}-\phi_{k^{\prime}}^{B})]\|$
		$\displaystyle\leq 2M(\|\|\phi_{k}-\phi_{k}^{B}\|\|_{2}+\|\|\phi_{k^{\prime}}-\phi_{k^{\prime}}^{B}\|\|_{2})\leq 4M\cdot\epsilon.$

Therefore,

\Big{|}\Big{|}\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))-\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi^{B}}(D))\Big{|}\Big{|}_{\infty}\leq 2M\cdot\epsilon.

Combining results, we have

	$\displaystyle\eqref{eq:a}+\eqref{eq:b}+\eqref{eq:c}$	$\displaystyle\stackrel{{\scriptstyle d}}{{=}}N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))\right)+o_{p}(1)$
		$\displaystyle+(\eqref{eq:b}-s_{n})+N\left(0,\Sigma^{W}\odot\left(\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))-\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi^{B}}(D))\right)\right)$
		$\displaystyle=N\left(0,\Sigma^{W}\odot\text{Var}_{\mathbb{P}^{0}}(\overrightarrow{\phi}(D))\right)+O_{p}(\epsilon)+o_{p}(1).$

Note that all the results hold for arbitrary $\epsilon>0$ . This completes the proof. ∎

8.4 Random target distribution

From Extended Theorem 1 with $K+1$ datasets and $\Sigma^{W}\in\mathbb{R}^{(K+1)\times(K+1)}$ , and considering that the sampling uncertainty is of low order than distributional uncertainty, for any Borel measurable square-integrable function $\phi:\mathcal{D}\xrightarrow[]{}\mathbb{R}^{L}$ , we have

\sqrt{m}\left(\begin{pmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi(D_{1i})\\ \vdots\\ \frac{1}{n_{K}}\sum_{i=1}^{n_{K}}\phi(D_{Ki})\end{pmatrix}-\begin{pmatrix}\mathbb{E}^{K+1}[\phi(D)]\\ \vdots\\ \mathbb{E}^{K+1}[\phi(D)]\end{pmatrix}\right)\xrightarrow[]{d}N\left(0,(A\Sigma^{W}A^{\intercal})\otimes\text{Var}_{\mathbb{P}^{0}}({\phi}(D))\right),

(10)

where

A=\begin{pmatrix}1&0&\dots&0&-1\\ 0&1&\dots&0&-1\\ \vdots&\vdots&&\vdots&\vdots\\ 0&0&\dots&1&-1\end{pmatrix}\in\mathbb{R}^{K\times(K+1)}.

Therefore, with $\mathbb{P}^{K+1}$ as a new target distribution, we still have Extended Theorem 1 but with a different distributional covariance matrix $A\Sigma^{W}A^{\intercal}$ .

8.5 Proof of Remark 3

Note that $\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))$ is an empirical variance of $\phi$ on the pooled donor data $\{(D_{ki})_{i=1}^{n_{k}}\}_{k=1}^{K}$ and can be written as

	$\displaystyle\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))$	$\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})\phi(D_{ki})^{\intercal}$
		$\displaystyle-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})\right)^{\intercal}.$

where $n=n_{1}+\dots+n_{K}$ . As $\phi$ has a finite fourth moment, by Theorem 1, we have that

	$\displaystyle\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))$	$\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\left(\mathbb{E}^{0}[\phi(D)\phi(D)^{\intercal}]+o_{p}(1)\right)$
		$\displaystyle-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\left(\mathbb{E}^{0}[\phi(D)]+o_{p}(1)\right)\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\left(\mathbb{E}^{0}[\phi(D)]+o_{p}(1)\right)\right)^{\intercal}$
		$\displaystyle=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1).$

9 Special cases of the random perturbation model.

We present special cases of the random perturbation model to help develop some intuition about potential applications.

Example 1 (Random shift across time).

Consider the case where $W_{j}^{k}=W_{j}^{k-1}+\epsilon_{k}$ , where the $\epsilon_{k}$ have mean zero and are uncorrelated for $k=1,\ldots,K$ . This can be used to model random distributional shifts across time. For example, at some time point, we might see (randomly) higher or lower percentage of old patients at a hospital. Similarly, one could also have an auto-regressive model $W_{j}^{k}=\frac{1}{2}W_{j}^{k-1}+\frac{1}{2}W_{j}^{k-2}+\epsilon_{k}$ . Please note that these auto-regressive models are defined on the weights of the distribution and not the data itself. This allows us to model random distributional shift for different data modalities such as discrete, ordinal, or continuous sample spaces, with a single framework.

Example 2 (Independent shifts and meta analysis).

If the distributional perturbations $W_{j}^{k}$ are i.i.d. across $k=1,\ldots,K$ , then $\Sigma^{W}$ is a diagonal matrix. Thus, for any $\phi\in L^{2}(\mathbb{P}^{0})$ , $\hat{\theta}_{k}:=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\phi(D_{ki})$ are asymptotically Gaussian and uncorrelated across $k=1,\ldots,K$ . This implies a meta-analytic model

\hat{\theta}_{k}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\approx}}N(\theta^{0},\tau^{2}),

(11)

where $\tau^{2}=\frac{\Sigma_{kk}^{W}}{m}\text{Var}_{\mathbb{P}^{0}}(\phi)$ and $\theta^{0}:=\mathbb{E}^{0}[\phi(D)]$ . Thus, in this special case, the random perturbation model justifies a random effect meta-analysis.

Example 3 (Random shift between locations).

In general, the meta-analytic model might not hold since the distributional perturbations might not be independent or identically distributed across $k=1,\ldots,K$ . As an example, data from California might be similar to data from Washington, but very different from data from Pennsylvania. In the synthetic control literature, it is popular to model data distributions from different locations as (approximate) mixtures of each other. For example, data from California might be well approximated by a mixture of Washington and Nevada. In our framework, this can be expressed as

W^{\text{California}}=.7\cdot W^{\text{Washington}}+.3\cdot W^{\text{Nevada}}+\epsilon,

(12)

where $\epsilon$ is mean-zero noise.

10 Proof of Lemma 1

Proof.

The first part of proof follows Van der Vaart [2000], Theorem 5.41 with

	$\displaystyle\Psi_{n}(\theta)$	$\displaystyle=\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\theta,D)]=\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}\mathcal{L}(\theta,D_{ki}),$
	$\displaystyle\dot{\Psi}_{n}(\theta)$	$\displaystyle=\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}^{2}\mathcal{L}(\theta,D_{ki}).$

By Taylor’s theorem there exists a (random vector) $\tilde{\theta}$ on the line segment between $\theta^{0}$ and $\hat{\theta}^{\beta}$ such that

0=\Psi_{n}(\hat{\theta})=\Psi_{n}(\theta^{0})+\dot{\Psi}_{n}(\theta^{0})(\hat{\theta}^{\beta}-\theta^{0})+\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\ddot{\Psi}_{n}(\tilde{\theta})(\hat{\theta}^{\beta}-\theta^{0}).

By Extended Theorem 1 with $\phi(D)=\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)$ , we have

	$\displaystyle\dot{\Psi}_{n}(\theta^{0})$	$\displaystyle=\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D_{ki})$
		$\displaystyle=\sum_{k=1}^{K}\beta_{k}\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)=\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1).$		(13)

By assumption, there exists a ball $B$ around $\theta^{0}$ such that $\theta\xrightarrow[]{}\partial_{\theta}^{3}\mathcal{L}(\theta,D)$ is dominated by a fixed function $h(\cdot)$ for every $\theta\in B$ . The probability of the event $\{\hat{\theta}^{\beta}\in B\}$ tends to 1. On this event,

\|\ddot{\Psi}_{n}(\tilde{\theta})\|=\Big{|}\Big{|}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\partial_{\theta}^{3}\mathcal{L}(\tilde{\theta},D_{ki})\Big{|}\Big{|}\leq\sum_{k=1}^{K}|\beta_{k}|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}h(D_{ki}).

Using Extended Theorem 1 with $\phi(D)=h(D)$ , we get

\|\ddot{\Psi}_{n}(\tilde{\theta})\|\leq\sum_{k=1}^{K}|\beta_{k}|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}h(D_{ki})=O_{P}(1).

(14)

Combining (13) and (14) gives us

	$\displaystyle-\Psi_{n}(\theta^{0})$	$\displaystyle=\left(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)+\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})\text{ }O_{P}(1)\right)(\hat{\theta}^{\beta}-\theta^{0})$
		$\displaystyle=\left(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)\right)(\hat{\theta}^{\beta}-\theta^{0}).$

The probability that the matrix $\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)$ is invertible tends to 1. Multiplying the preceding equation by $\sqrt{m}$ and applying $(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1))^{-1}$ left and right gives us that

\sqrt{m}\left(\hat{\theta}^{\beta}-\theta^{0}\right)=\sqrt{m}\left(\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\phi(D)]-\mathbb{E}^{0}[\phi(D)]\right)+o_{P}(1),

where $\phi(D)=-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]^{-1}\partial_{\theta}\mathcal{L}(\theta^{0},D)$ . By Extended Theorem 1, we have

\sqrt{m}\left(\hat{\theta}^{\beta}-\theta^{0}\right)\xrightarrow[]{d}N(0,\beta^{\intercal}\Sigma^{W}\beta\text{ }\cdot\text{Var}_{\mathbb{P}^{0}}(\phi(D)))

(15)

Now let $\Phi(\theta)=\mathbb{E}^{0}[\partial_{\theta}\mathcal{L}(\theta,D)],\dot{\Phi}(\theta)=\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta,D)],\ddot{\Phi}(\theta)=\mathbb{E}^{0}[\partial_{\theta}^{3}\mathcal{L}(\theta,D)]$ . By Taylor’s theorem, there exist $\tilde{\theta},\tilde{\tilde{\theta}}$ , on the line segment between $\theta^{0}$ and $\hat{\theta}^{\beta}$ such that

	$\displaystyle\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},D)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},D)]$	$\displaystyle=(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}{\Phi}(\theta^{0})+\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\dot{\Phi}(\tilde{\theta})(\hat{\theta}^{\beta}-\theta^{0})$
		$\displaystyle=\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\left(\dot{\Phi}(\theta^{0})+\ddot{\Phi}(\tilde{\tilde{\theta}})(\tilde{\theta}-\theta^{0})\right)(\hat{\theta}^{\beta}-\theta^{0})$
		$\displaystyle=\frac{1}{2}(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\left(\dot{\Phi}(\theta^{0})+o_{P}(1)\right)(\hat{\theta}^{\beta}-\theta^{0}),$

as ${\Phi}(\theta^{0})=0$ and $\ddot{\Phi}(\tilde{\tilde{\theta}})\leq\mathbb{E}^{0}[h(D)]$ with probability approaching 1. Then the rescaled excess risk can be written as

m\cdot\left(\mathbb{E}^{0}[\mathcal{L}(\hat{\theta}^{\beta},D)]-\mathbb{E}^{0}[\mathcal{L}(\theta^{0},D)]\right)=m\cdot(\hat{\theta}^{\beta}-\theta^{0})^{\intercal}\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)](\hat{\theta}^{\beta}-\theta^{0})+o_{P}(1).

(16)

Using (15), asymptotically (16) follows a distribution with asymptotic mean

\mu^{\beta}=\beta^{\intercal}\Sigma^{W}\beta\cdot\mathrm{Trace}(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]^{-1}\mathrm{Var}_{\mathbb{P}^{0}}(\partial_{\theta}\mathcal{L}(\theta^{0},D)))

This completes the proof.

In the following, we add regularity conditions that lead to the consistency of M-estimators, $\hat{\theta}^{\beta}$ .

Lemma 3 (Consistency of M-estimators).

Consider the M-estimator

\hat{\theta}^{\beta}=\arg\min_{\theta\in\Omega}\sum_{k=1}^{K}\beta_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,D)]

and the target $\theta^{0}=\arg\min_{\theta\in\Omega}\mathbb{E}^{0}[\mathcal{L}(\theta,D)]$ , where $\Omega$ is a compact subset of $\mathbb{R}^{d}$ . Assume that $\theta\xrightarrow[]{}\mathcal{L}(\theta,D)$ is continuous and that $\inf_{||\theta-\theta^{\prime}||_{2}\leq\delta}\mathcal{L}(\theta,D)$ is square-integrable under $\mathbb{P}^{0}$ for every $\delta$ and $\theta^{\prime}$ and that $\inf_{\theta\in\Omega}\mathcal{L}(\theta,D)$ is square-integrable. We assume that $\mathbb{E}^{0}[\mathcal{L}(\theta,D)]$ has a unique minimum. Then, $\hat{\theta}^{\beta}-\theta^{0}=o_{P}(1)$ .

Proof.

The proof follows Van der Vaart [2000], Theorem 5.14 with $m_{\theta}(D)=-\mathcal{L}(\theta,D)$ .

Fix some $\theta$ and let $U_{\ell}\downarrow\theta$ be a decreasing sequence of open balls around $\theta$ of diameter converging to zero. Write $m_{U}(D)$ for $\sup_{\theta\in U}m_{\theta}(D)$ . The sequence $m_{U_{\ell}}$ is decreasing and greater than $m_{\theta}$ for every $\ell$ . As $\theta\xrightarrow[]{}m_{\theta}(D)$ is continuous, by monotone convergence theorem, we have $\mathbb{E}^{0}[m_{U_{\ell}}(D)]\downarrow\mathbb{E}^{0}[m_{\theta}(D)]$ .

For $\theta\neq\theta^{0}$ , we have $\mathbb{E}^{0}[m_{\theta}(D)]<\mathbb{E}^{0}[m_{\theta^{0}}(D)]$ . Combine this with the preceding paragraph to see that for every $\theta\neq\theta^{0}$ there exits an open ball $U_{\theta}$ around $\theta$ with $\mathbb{E}^{0}[m_{U_{\theta}}(D)]<\mathbb{E}^{0}[m_{\theta^{0}}(D)]$ . For any given $\epsilon>0$ , let the set $B=\{\theta\in\Omega:||\theta-\theta^{0}||_{2}\geq\epsilon\}$ . The set $B$ is compact and is covered by the balls $\{U_{\theta}:\theta\in B\}$ . Let $U_{\theta_{1}},\dots,U_{\theta_{p}}$ be a finite sub-cover. Then,

	$\displaystyle\sup_{\theta\in B}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta}(D_{ki})$	$\displaystyle\leq\sup_{j=1,\dots,p}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{U_{\theta_{j}}}(D_{ki})$
		$\displaystyle=\sup_{j=1,\dots,p}\mathbb{E}^{0}[m_{U_{\theta_{j}}}(D)]+o_{P}(1)<\mathbb{E}^{0}[m_{\theta^{0}}(D)]+o_{P}(1),$		(17)

where for the equality we apply Theorem 1 with $\phi(D)=m_{U_{\theta_{j}}}(D)$ for all $j=1,\ldots,p$ .

If $\hat{\theta}^{\beta}\in B$ , then

\sup_{\theta\in B}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta}(D_{ki})\geq\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\hat{\theta}^{\beta}}(D_{ki})\geq\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta^{0}}(D_{ki})-o_{P}(1),

where the last inequality comes from the definition of $\hat{\theta}^{\beta}$ . Using Theorem 1 with $\phi(D)=m_{\theta^{0}}(D)$ , we have

\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta^{0}}(D_{ki})-o_{P}(1)=\mathbb{E}^{0}[m_{\theta^{0}}(D)]-o_{P}(1).

Therefore,

\{\hat{\theta}^{\beta}\in B\}\subset\Bigg{\{}\sup_{\theta\in B}\sum_{k=1}^{K}\beta_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}m_{\theta}(D_{ki})\geq\mathbb{E}^{0}[m_{\theta^{0}}(D)]-o_{P}(1)\Bigg{\}}.

By the equation (17), the probability of the event on the right hand side converges to zero as $n\xrightarrow[]{}\infty$ . This completes the proof. ∎

∎

11 Proof of Theorem 2

Define the feature matrix before reparametrization as

\Phi=\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{1}]-\hat{\mathbb{E}}^{0}[\phi_{1}]&\dots&\hat{\mathbb{E}}^{K}[\phi_{1}]-\hat{\mathbb{E}}^{0}[\phi_{1}]\\ \vdots&&\vdots\\ \hat{\mathbb{E}}^{1}[\phi_{L}]-\hat{\mathbb{E}}^{0}[\phi_{L}]&\dots&\hat{\mathbb{E}}^{K}[\phi_{L}]-\hat{\mathbb{E}}^{0}[\phi_{L}]\end{pmatrix}\in\mathbb{R}^{L\times K},

where the $\ell$ -th row is defined as $\Phi_{\ell}$ .

By Theorem 1 and using that the sampling uncertainty is of low order than distributional uncertainty, we have

\sqrt{m}\Phi_{\ell}:=\sqrt{m}\begin{pmatrix}\hat{\mathbb{E}}^{1}[\phi_{\ell}]-\hat{\mathbb{E}}^{0}[\phi_{\ell}]\\ \vdots\\ \hat{\mathbb{E}}^{K}[\phi_{\ell}]-\hat{\mathbb{E}}^{0}[\phi_{\ell}]\end{pmatrix}\stackrel{{\scriptstyle d}}{{=}}\bm{Z}_{\ell}+o_{p}(1)

where $\bm{Z}_{1},\dots,\bm{Z}_{L}\in\mathbb{R}^{K}$ are i.i.d Gaussian random variables following $N(0,\Sigma^{W})$ . Let $\bm{Z}\in\mathbb{R}^{L\times K}$ be a matrix in which the $\ell$ -th row is $\bm{Z}_{\ell}$ .

Note that

\hat{\beta}=\operatorname*{argmin}_{\beta:\beta^{\intercal}1=1}\beta^{\intercal}\left(\sum_{\ell=1}^{L}\Phi_{\ell}\Phi_{\ell}^{\intercal}\right)\beta=\arg\min_{\beta:\beta^{\intercal}1=1}\beta^{\intercal}\bm{\Phi}^{\intercal}\bm{\Phi}\beta.

The closed form of $\hat{\beta}$ is known as

\hat{\beta}=\frac{(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}{\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}\stackrel{{\scriptstyle d}}{{=}}\frac{(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}{\bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}+o_{p}(1).

We write $\hat{\beta}_{Z}=\frac{(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}{\bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}}$ . Moreover, the closed form of $\beta^{*}$ is

\beta^{*}=\operatorname*{argmin}_{\beta:\beta^{\intercal}1=1}\beta^{\intercal}\Sigma^{W}\beta=\frac{(\Sigma^{W})^{-1}\bm{1}}{\bm{1}^{\intercal}(\Sigma^{W})^{-1}\bm{1}}.

Let $Q$ be a $p\times(K-1)$ non-random matrix with $\text{rank}(Q)=p\leq K-1$ . Let $\tilde{Q}=(Q^{\prime\intercal},\bm{1})^{\intercal}\in\mathbb{R}^{(p+1)\times K}$ where

Q^{\prime}=Q\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&\bm{0}_{(K-1)\times 1}\end{pmatrix}\in\mathbb{R}^{p\times K}.

Note that the $\text{rank}(\tilde{Q})=p+1\leq K$ . Define $\tilde{S}:=\tilde{Q}(\bm{Z}^{\intercal}\bm{Z})^{-1}\tilde{Q}^{\intercal}$ where

\tilde{S}=\begin{pmatrix}\tilde{S}_{11}&\tilde{S}_{12}\\ \tilde{S}_{21}&\tilde{S}_{22}\end{pmatrix}=\begin{pmatrix}Q^{\prime}(\bm{Z}^{\intercal}\bm{Z})^{-1}Q^{\prime\intercal}&Q^{\prime}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}\\ \bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}Q^{\prime\intercal}&\bm{1}^{\intercal}(\bm{Z}^{\intercal}\bm{Z})^{-1}\bm{1}.\end{pmatrix}

Similarly, let $\tilde{P}:=\tilde{Q}(\Sigma^{W})^{-1}\tilde{Q}^{\intercal}$ where $\tilde{P}_{11}=Q^{\prime}(\Sigma^{W})^{-1}Q^{\prime\intercal},\tilde{P}_{12}=Q^{\prime}(\Sigma^{W})^{-1}\bm{1},$ and $\tilde{P}_{22}=\bm{1}^{\intercal}(\Sigma^{W})^{-1}\bm{1}$ .

Results from Multivariate Statistical Theory.

Note that $\bm{Z}^{\intercal}\bm{Z}\sim\text{Wish}_{K}(L;\Sigma^{W})$ which is the Wishart distribution with a degree of freedom $L$ and a scale matrix $\Sigma^{W}$ . By Theorem 3.2.11 and 3.6 in Muirhead [1982], we have

\tilde{S}\sim W_{p+1}^{-1}(L-K+2p+3;\tilde{P}).

which is the inverse-Wishart distribution with a degree of freedom $L-K+2p+3$ and a scale matrix $\tilde{P}$ . By Theorem 3 (b) and (d) of Bodnar and Okhrin [2008], we have that

\tilde{S}_{12}\tilde{S}_{22}^{-1}|\tilde{S}_{11.2}\sim N(\tilde{P}_{12}\tilde{P}_{22}^{-1},\tilde{S}_{11.2}\otimes\tilde{P}_{22}^{-1}).

Therefore, we get

\tilde{P}_{22}^{1/2}\tilde{S}_{11.2}^{-1/2}\left(\tilde{S}_{12}\tilde{S}_{22}^{-1}-\tilde{P}_{12}\tilde{P}_{22}^{-1}\right)|\tilde{S}_{11.2}\sim N(0,\bm{I}).

The right hand side does not depend on $\tilde{S}_{11.2}$ anymore. Therefore, we have

\tilde{P}_{22}^{1/2}\tilde{S}_{11.2}^{-1/2}\left(Q(\hat{\beta}_{Z})_{1:(K-1)}-Q{\beta}^{*}_{1:(K-1)}\right)\sim N(\bm{0},\bm{I}).

(18)

By Theorem 3.2.12 in Muirhead [1982], we have

\frac{\tilde{P}_{22}}{\tilde{S}_{22}}=\frac{\bm{1}^{\intercal}(\Sigma^{W})^{-1}\bm{1}}{\bm{1}^{\intercal}(Z^{\intercal}Z)^{-1}\bm{1}}\sim\chi^{2}(L-K+1).

By Theorem 3 (e) of Bodnar and Okhrin [2008], $\tilde{S}_{22}$ is independent of $\tilde{S}_{12}\tilde{S}_{22}^{-1}$ and $\tilde{S}_{11.2}$ . Hence, we get

\sqrt{L-K+1}\cdot\tilde{S}_{22}^{1/2}\tilde{S}_{11.2}^{-1/2}\left(Q(\hat{\beta}_{Z})_{1:(K-1)}-Q{\beta}^{*}_{1:(K-1)}\right)\sim t_{p}(L-K+1;\bm{0},\bm{I}),

where $t_{p}(L-K+1;\bm{0},\bm{I})$ is $p$ -dimensional (centered) $t$ -distribution with $L-K+1$ degrees of freedom. Now using $\Phi=\bm{Z}/\sqrt{m}+o_{p}(1/\sqrt{m})$ from Theorem 1, we get

	$\displaystyle\sqrt{L-K+1}\sqrt{\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}\cdot\left[QRQ^{\intercal}\right]^{-1/2}\left(Q\hat{\beta}_{1:(K-1)}-Q{\beta}^{*}_{1:(K-1)}\right)$		(19)
	$\displaystyle\stackrel{{\scriptstyle d}}{{=}}t_{p}(L-K+1;\bm{0},\bm{I})+o_{p}(1).$		(20)

where

R=\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&\bm{0}_{(K-1)\times 1}\end{pmatrix}\left((\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}-\frac{(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}}{\bm{1}^{\intercal}(\bm{\Phi}^{\intercal}\bm{\Phi})^{-1}\bm{1}}\right)\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&\bm{0}_{(K-1)\times 1}\end{pmatrix}^{\intercal}.

Connection with Reparametrized Linear Regression.

In this part, we will show that

(\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}})^{-1}=R.

(21)

Using the closed form of $\hat{\beta}$ , we have that

(L-K+1)\cdot\hat{\sigma}^{2}=\hat{\beta}^{\intercal}\Phi^{\intercal}\Phi\hat{\beta}=\frac{1}{\bm{1}^{\intercal}(\Phi^{\intercal}\Phi)^{-1}\bm{1}}.

(22)

Then combining (20), (21), and (22), we have

\left(Q(\tilde{\Phi}^{\intercal}\tilde{\Phi})^{-1}\hat{\sigma}^{2}Q^{\intercal}\right)^{-\frac{1}{2}}\left(Q\hat{\beta}_{1:(K-1)}-Q\beta^{*}_{1:(K-1)}\right)\stackrel{{\scriptstyle d}}{{=}}t_{p}(L-K+1;\bm{0},\bm{I})+o_{p}(1).

Note that this is a more general version of Theorem 2. By letting $Q$ be an $(K-1)\times(K-1)$ identity matrix, we have Theorem 2. For the simplicity of notation, let $V=\bm{\Phi}^{\intercal}\bm{\Phi}$ . Note that

	$\displaystyle\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}}$	$\displaystyle=\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&-\bm{1}\end{pmatrix}\begin{pmatrix}V_{11}&V_{12}\\ V_{12}^{\intercal}&V_{22}\end{pmatrix}\begin{pmatrix}\bm{I}_{(K-1)\times(K-1)}&-\bm{1}\end{pmatrix}^{\intercal}$
		$\displaystyle=V_{11}-\bm{1}V_{12}^{\intercal}-V_{12}\bm{1}^{\intercal}+\bm{1}V_{22}\bm{1}^{\intercal}$

where $V_{22}\in\mathbb{R}$ and $\bm{1}\in\mathbb{R}^{(K-1)\times 1}$ . Then, by applying Sherman-Morrison lemma, we have

	$\displaystyle(\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}})^{-1}$	$\displaystyle=\left(V_{11.2}+V_{12}V_{22}^{-1}V_{12}^{\intercal}-\bm{1}V_{12}^{\intercal}-V_{12}\bm{1}^{\intercal}+\bm{1}V_{22}\bm{1}^{\intercal}\right)^{-1}$
		$\displaystyle=\left(V_{11.2}+\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)^{\intercal}\right)^{-1}$
		$\displaystyle=V_{11.2}^{-1}-\frac{V_{11.2}^{-1}\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)^{\intercal}V_{11.2}^{-1}}{1+\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)^{\intercal}V_{11.2}^{-1}\left(V_{12}V_{22}^{-1/2}-V_{22}^{1/2}\bm{1}\right)}.$

By using the inverse of block matrix as

V^{-1}=\begin{pmatrix}V_{11.2}^{-1}&-V_{11.2}^{-1}V_{12}V_{22}^{-1}\\ -V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}&V_{22}^{-1}+V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}V_{12}V_{22}^{-1}\end{pmatrix},

we also get

\displaystyle R=V_{11.2}^{-1}-\frac{V_{11.2}^{-1}\left(V_{12}V_{22}^{-1}-\bm{1}\right)\left(V_{12}V_{22}^{-1}-\bm{1}\right)^{\intercal}V_{11.2}^{-1}}{\bm{1}^{\intercal}V_{11.2}^{-1}\bm{1}-V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}\bm{1}-\bm{1}^{\intercal}V_{11.2}^{-1}V_{12}V_{22}^{-1}+V_{22}^{-1}+V_{22}^{-1}V_{12}^{\intercal}V_{11.2}^{-1}V_{12}V_{22}^{-1}}.

Then, we can easily check that $(\tilde{\bm{\Phi}}^{\intercal}\tilde{\bm{\Phi}})^{-1}=R$ . This completes the proof.

12 Confidence Intervals for Target Parameters

First, we will discuss how to form asymptotically valid confidence intervals for $\mathbb{E}^{0}[\phi_{0}(D)]$ . Assume that test functions $\phi_{1}(X),\dots,\phi_{L}(X)$ are uncorrelated and have unit variances under $\mathbb{P}^{0}$ .

Let’s say $\hat{\beta}$ in (4) is consistent as follows,

\hat{\beta}\xrightarrow[]{p}\beta^{*}:=\operatorname*{argmin}_{\beta:\beta^{\intercal}1}\beta^{\intercal}\Sigma^{W}\beta.

Then, we have for $\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}=(\hat{\mathbb{E}}^{1}[\phi_{0}],\dots,\hat{\mathbb{E}}^{K}[\phi_{0}])^{\intercal}$ ,

	$\displaystyle\sqrt{m}\left(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]-\mathbb{E}^{0}[\phi_{0}]\right)$	$\displaystyle=\sqrt{m}\left(\hat{\beta}-\beta^{*}\right)^{\intercal}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)$
		$\displaystyle+\sqrt{m}\text{ }{\beta^{*}}^{\intercal}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)$

since $\hat{\beta}^{\intercal}\mathbb{1}=1$ and $\beta^{*}{}^{\intercal}\mathbb{1}=1$ . By Theorem 1,

\sqrt{m}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)\xrightarrow[]{d}N(0,\text{Var}_{\mathbb{P}^{0}}(\phi_{0})\cdot\Sigma^{W}).

Therefore,

	$\displaystyle\sqrt{m}\left(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]-\mathbb{E}^{0}[\phi_{0}]\right)$	$\displaystyle=\sqrt{m}\text{ }{\beta^{*}}^{\intercal}\left(\overrightarrow{\hat{\mathbb{E}}[\phi_{0}]}-\mathbb{1}\cdot\mathbb{E}^{0}[\phi_{0}]\right)+o_{p}(1)$
		$\displaystyle\xrightarrow[]{d}N(0,\text{Var}_{\mathbb{P}^{0}}(\phi_{0})\cdot{\beta^{}}^{\intercal}\Sigma^{W}\beta^{}).$		(23)

Then, we can construct confidence interval for $\mathbb{E}^{0}[\phi_{0}]$ as

\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]\pm z_{1-\alpha/2}\cdot\sqrt{\text{Var}_{\mathbb{P}^{0}}(\phi_{0})}\cdot\sqrt{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}.

How can we obtain a consistent estimator for ${\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}/m$ ? By Theorem 1,

\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\beta_{k}^{*}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\text{ }\stackrel{{\scriptstyle d}}{{=}}\text{ }\sqrt{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}\cdot Z_{\ell}+o_{p}(1/\sqrt{m})

where $Z_{\ell}$ for $\ell=1,\dots,L$ are independent standard Gaussian variables. Then,

	$\displaystyle\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}$	$\displaystyle=\left(\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\beta^{}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)+\left(\sum_{k=1}^{K}(\hat{\beta}_{k}-\beta^{}_{k})(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\hat{\mathbb{E}}^{k}[\phi_{\ell}])\right)\right)^{2}$
		$\displaystyle\stackrel{{\scriptstyle d}}{{=}}{\frac{{\beta^{}}^{\intercal}\Sigma^{W}\beta^{}}{m}}\cdot Z_{\ell}^{2}+o_{p}(1/m).$

Therefore, we may estimate ${\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}/m$ as

\frac{1}{L}\sum_{\ell=1}^{L}\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}\stackrel{{\scriptstyle d}}{{=}}{\frac{{\beta^{*}}^{\intercal}\Sigma^{W}\beta^{*}}{m}}\cdot\frac{\chi^{2}(L)}{L}+o_{p}(1/m),

(24)

where $\chi^{2}(L)$ is a chi-squared random variable with degrees of freedom $L$ .

Finally, as $L\xrightarrow[]{}\infty$ , we have the consistency of $\hat{\beta}$ and $\chi^{2}(L)/L\xrightarrow[]{p}1$ . Suppose we have $\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi_{0}(D))=\text{Var}_{\mathbb{P}^{0}}(\phi_{0}(D))+o_{p}(1)$ as in Remark 3. Combining all the results, we have asymptotically valid ( $1-\alpha$ ) confidence interval as $m\xrightarrow[]{}\infty$ and $L\xrightarrow[]{}\infty$ :

\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{0}]\pm z_{1-\alpha/2}\cdot\sqrt{\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi_{0})}\cdot\sqrt{\frac{1}{L}\sum_{\ell=1}^{L}\left(\hat{\mathbb{E}}^{0}[\phi_{\ell}]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi_{\ell}]\right)^{2}}.

Now let’s consider the case in Remark 7 where

\theta^{0}=\arg\min_{\theta}\mathbb{E}^{0}[\mathcal{L}(\theta,X,Y)],\quad\hat{\theta}=\arg\min_{\theta}\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\mathcal{L}(\theta,X,Y)].

Assume the regularity conditions of Lemma 1 and the consistency of $\hat{\beta}$ and $\hat{\theta}$ ( $\hat{\beta}\xrightarrow[]{p}\beta^{*}$ , $\hat{\theta}\xrightarrow[]{p}\theta^{0}$ ). Then, from the proof of Lemma 1, we have

\sqrt{m}\left(\hat{\theta}-\theta^{0}\right)=\sqrt{m}\left(\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\phi(D)]-\mathbb{E}^{0}[\phi(D)]\right)+o_{P}(1),

where $\phi(D)=-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]^{-1}\partial_{\theta}\mathcal{L}(\theta^{0},D)$ . Then, from the results above, we only need to show that $\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1)$ . Note that

	$\displaystyle\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)]$	$\displaystyle-\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]$
		$\displaystyle=\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)]-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]$
		$\displaystyle+\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]-\sum_{k=1}^{K}\beta_{k}^{*}{\mathbb{E}}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)].$

Similarly as in the proof of Lemma 1, we have that

	$\displaystyle\Big{\|}\Big{\|}\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\hat{\theta},X,Y)]$	$\displaystyle-\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]\Big{\|}\Big{\|}$
		$\displaystyle\leq\left(\sum_{k=1}^{K}(\|\beta_{k}^{*}\|+o_{P}(1))\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}h(D_{ki})\right)\|\|\hat{\theta}-\theta^{0}\|\|=o_{P}(1).$

Moreover, with the consistency of $\hat{\beta}$ and Theorem 1, we have

\sum_{k=1}^{K}\hat{\beta}_{k}\hat{\mathbb{E}}^{k}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]-\sum_{k=1}^{K}\beta_{k}^{*}{\mathbb{E}}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},X,Y)]=o_{P}(1)

Therefore,

\hat{\phi}(D)=-\left(\mathbb{E}^{0}[\partial_{\theta}^{2}\mathcal{L}(\theta^{0},D)]+o_{P}(1)\right)^{-1}\partial_{\theta}\mathcal{L}(\hat{\theta},D).

Note that

	$\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)\partial_{\theta}\mathcal{L}(\hat{\theta},D)^{\intercal}]$	$\displaystyle=\hat{\mathbb{E}}^{k}[(\partial_{\theta}\mathcal{L}({\theta}^{0},D)+\delta(\hat{\theta},D))(\partial_{\theta}\mathcal{L}(\theta^{0},D)+\delta(\hat{\theta},D))^{\intercal}]$
		$\displaystyle\leq\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+\hat{\mathbb{E}}^{k}[\delta(\hat{\theta},D)\delta(\hat{\theta},D)^{\intercal}]$
		$\displaystyle+2\sqrt{\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]\cdot\hat{\mathbb{E}}^{k}[\delta(\hat{\theta},D)\delta(\hat{\theta},D)^{\intercal}]}$

where $\delta(\hat{\theta},D)=\partial_{\theta}\mathcal{L}(\hat{\theta},D)-\partial_{\theta}\mathcal{L}({\theta}^{0},D)$ . The inequality is from the Cauchy-Schwarz inequality. Similarly as in the proof of Lemma 1 with Taylor expansion, we have

\displaystyle\hat{\mathbb{E}}^{k}[\delta(\hat{\theta},D)\delta(\hat{\theta},D)^{\intercal}]=o_{P}(1).

and we have $\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]={\mathbb{E}}^{0}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+o_{P}(1)$ from Thereom 1. Then, we get

\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)\partial_{\theta}\mathcal{L}(\hat{\theta},D)^{\intercal}]\leq\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+o_{P}(1).

By using Cauchy-Schwarz inequality in other direction as well, we get

\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)\partial_{\theta}\mathcal{L}(\hat{\theta},D)^{\intercal}]=\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)\partial_{\theta}\mathcal{L}({\theta}^{0},D)^{\intercal}]+o_{P}(1).

Similarly, we can derive

\displaystyle\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}(\hat{\theta},D)]

\displaystyle=\hat{\mathbb{E}}^{k}[\partial_{\theta}\mathcal{L}({\theta}^{0},D)]+o_{P}(1).

Then, the variance estimate on the pooled donor data can be written as

	$\displaystyle\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi)$	$\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[\hat{\phi}(D)\hat{\phi(D)}^{\intercal}]-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[\hat{\phi}(D)]\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[\hat{\phi}(D)]\right)^{\intercal}$
		$\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[{\phi}(D){\phi(D)}^{\intercal}]-\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[{\phi}(D)]\right)\left(\sum_{k=1}^{K}\frac{n_{k}}{n}\hat{\mathbb{E}}^{k}[{\phi}(D)]\right)^{\intercal}+o_{P}(1).$

Then, by the proof in the Appendix, 8.5, we have $\widehat{\text{Var}}_{\mathbb{P}^{0}}(\phi(D))=\text{Var}_{\mathbb{P}^{0}}(\phi(D))+o_{p}(1)$ . Finally, as $L\xrightarrow[]{}\infty$ , we have the consistency of $\hat{\beta}$ and under the regularity conditions of Lemma 3, we have the consistency of $\hat{\theta}$ by following the proof of Lemma 3. This completes the proof.

13 Additional Details in Data Analysis

13.1 Covariance of Test Functions in GTEx Data Analysis

We estimate the sparse inverse covariance matrix of test functions $(\phi_{\ell})_{\ell=1,\dots,L=1000}$ defined in Section 5, using group Lasso with a Python package sklearn.covariance.GraphicalLassoCV. The heatmap of fitted precision matrix is given in Figure 4. As the estimated inverse covariance matrix was close to a diagonal matrix, we did not perform a whitening transformation for the 1000 test functions.

13.2 Additional Results for ACS Income Data Analysis

In this section, we provide additional results for ACS income data analysis in Section 4 for 6 randomly selected target states. The results are presented in Figure 5.

Out-of-distribution generalization under random, dense distributional shifts

Abstract

1 Introduction

1.1 Summary of contributions

1.2 Related work

2 Distributional Uncertainty

Notation.

Theorem 1 (Distributional CLT).

Remark 1 (Random shift in Y|XY|X).

Remark 2 (Distributional covariance).

Remark 3 (Estimation of variance).

The random shift model explains the linear patterns in the GTEx scatterplot.

3 Out-of-distribution generalization

Lemma 1 (Out-of-distribution error of weighted ERM).

Remark 4.

3.1 Inference for β∗\beta^{*} and θ0\theta^{0}

Theorem 2.

Remark 5 (Distributional t-test).

Remark 6 (Distributional F-test).

Remark 7 (Distributional confidence intervals for parameters of the target distribution).

Remark 8 (Distributional sample size and degrees of freedom).

Remark 9 (Correlated test functions).

4 ACS Income Data

5 GTEx Data

6 Discussion

7 Acknowledgments

References

Appendix

8 Proof of Theorem 1

8.1 Extended Theorem 1

Extended Theorem 1.

8.2 Auxiliary Lemma for Theorem 1

Lemma 2.

Proof.

8.3 Proof of Theorem 1

Proof.

8.4 Random target distribution

8.5 Proof of Remark 3

9 Special cases of the random perturbation model.

Example 1 (Random shift across time).

Example 2 (Independent shifts and meta analysis).

Example 3 (Random shift between locations).

10 Proof of Lemma 1

Proof.

Lemma 3 (Consistency of M-estimators).

Proof.

11 Proof of Theorem 2

Results from Multivariate Statistical Theory.

Connection with Reparametrized Linear Regression.

12 Confidence Intervals for Target Parameters

13 Additional Details in Data Analysis

13.1 Covariance of Test Functions in GTEx Data Analysis

13.2 Additional Results for ACS Income Data Analysis

Remark 1 (Random shift in $Y|X$ ).

3.1 Inference for $\beta^{*}$ and $\theta^{0}$