Model selection for estimation of causal parameters

Dominik Rothenhäusler

Abstract

A popular technique for selecting and tuning machine learning estimators is cross-validation. Cross-validation evaluates overall model fit, usually in terms of predictive accuracy. In causal inference, the optimal choice of estimator depends not only on the fitted models, but also on assumptions the statistician is willing to make. In this case, the performance of different (potentially biased) estimators cannot be evaluated by checking overall model fit.

We propose a model selection procedure that estimates the squared $\ell_{2}$ -deviation of a finite-dimensional estimator from its target. The procedure relies on knowing an asymptotically unbiased ”benchmark estimator” of the parameter of interest. Under regularity conditions, we investigate bias and variance of the proposed criterion compared to competing procedures and derive a finite-sample bound for the excess risk compared to an oracle procedure. The resulting estimator is discontinuous and does not have a Gaussian limit distribution. Thus, standard asymptotic expansions do not apply. We derive asymptotically valid confidence intervals that take into account the model selection step.

The performance of the approach for estimation and inference for average treatment effects is evaluated on simulated data sets, including experimental data, instrumental variables settings and observational data with selection on observables.

1 Introduction

Model selection is a fundamental task in statistical practice. Usually, the aim is to find a model that optimizes overall model fit. If the loss function is quadratic, the total error can be decomposed into the error due to variance and the error due to bias. A popular technique to balance the bias-variance trade-off in the context of prediction is cross-validation. However, the performance of common estimators in causal inference does not only depend on prediction performance, but, as we will discuss below, also on assumptions the statistician is willing to make. Thus, methods cannot be reliably compared by checking overall model fit. In the literature, there exist some solutions for parameter-specific model selection, but we currently lack a reliable general-purpose tool. In the context of causal inference, a reliable parameter-specific model selection tool could enable the following applications.

Moving the goalpost.

Estimating average treatment effects from observational data can be unreliable, if conditional treatment assignment probabilities are close to zero or one. To address this issue, some researchers move the goalpost by switching to causal contrasts that can be estimated more reliably, see LaLonde (1986), Heckman et al. (1998), Crump et al. (2006) and references therein. For example, instead of estimating the average treatment effect (ATE), a practitioner might switch to estimating the average treatment effect on the treated (ATT), or other causal contrasts such as the overlap effect. This is often done in an ad-hoc fashion. Using a model selection tool in this context could allow to trade off bias and variance when switching from estimating the ATE to estimators of other causal contrasts.

Data fusion.

Combining evidence across data sets in the context of causal inference has recently attracted increasing interest (Peters et al., 2016; Bareinboim and Pearl, 2016; Athey et al., 2020). However, data quality often varies from data set to data set. In this case, using all data sets can lead to untrustworthy, biased estimators. A reliable model selection tool could make it possible to distinguish which data sets and estimators are useful for solving the estimation problem at hand, and which are not.

Choosing between estimators that are optimal in different models.

If all confounders are observed, semi-parametrically efficient estimators such as augmented inverse probability weighting are attractive as they do not rely on parametric specifications (Robins et al., 1994). On the other hand, if the model is well-specified, parametric estimators such as ordinary least-squares can achieve higher efficiency and the fitted model may be easier to interpret. A reliable model selection procedure would allow practitioners to automatically select interpretable models if the parametric model is a good approximation, and semi-parametric estimators if the parametric model is far from the underlying data-generating process.

Leveraging conditional independencies.

Researchers are often uncomfortable with using statistical procedures that only work under strong assumptions. Using such methods over other procedures may introduce some bias if the assumptions are violated, but has the potential to reduce variance. For example, conditional independence assumptions can be leveraged to improve precision of treatment effect estimators (Athey et al., 2019; Guo and Perković, 2020). A model selection tool as described above would allow to systematically trade off bias and variance when switching between estimation procedures that are optimal under different graphical structures. Doing so can potentially improve precision in scenarios in which researchers are not comfortable with making strong assumptions.

1.1 Related work

Model selection has a long history in statistics and machine learning. For optimizing loss-based estimators, the most commonly used methods include cross-validation, the Akaike information criterion, and the Bayesian information criterion (Akaike, 1974; Schwarz, 1978; Friedman et al., 2001; Arlot and Celisse, 2010).

The focused information criterion is a model selection criterion which, for a given focus parameter, estimates the mean-squared error of submodels (Claeskens and Hjort, 2003, 2008). It relies on knowing an asymptotically unbiased estimator of the parameter of interest. Its theoretical justification is given in a local misspecification framework.

More recently, Cui and Tchetgen-Tchetgen (2019) introduce a model selection tool for finite-dimensional functionals in a semiparametric model if a doubly robust estimation function is available. It is based on a pseudo-risk criterion that has a robustness property if one of the estimators is biased.

For the task of model selection when estimating heterogeneous treatment effects, several methods have been developed (Kapelner et al., 2014; Rolling and Yang, 2014; Athey and Imbens, 2016; Nie and Wager, 2017; Zhao et al., 2017; Powers et al., 2018). Most of the methodologies are specific to the considered model class. A comparison of this line of work for individual treatment effects can be found in Schuler et al. (2018).

Van der Laan and Robins (2003) propose a loss-based approach for parameter-specific model selection. In this work, the authors recommend minimizing an empirical estimate of the overall risk $R(\hat{\theta}^{(g)},\hat{\eta})$ , where $\hat{\theta}^{(g)}$ , $g=1,\ldots,G$ are candidate estimators and $\hat{\eta}$ is an efficient estimator of the nuisance parameter, computed on the training data set. Our approach is more generic in the sense that we do not assume that parameter of interest minimizes a known loss function.

Closest to our work is the sample-splitting criterion developed by Brookhart and Van Der Laan (2006). Roughly speaking, the data is split into a training and a test data set. Then, estimators are computed on the training and the test data set, and the squared deviation of estimators is aggregated across several splits. The criterion developed by Brookhart and Van Der Laan (2006) can be seen as a form of Monte Carlo cross-validation. In the following, we discuss a variant of this approach that splits the data into $k$ folds and thus mimics $k$ -fold cross-validation procedures, which are popular in practice. The data $D=(D_{1},\ldots,D_{n})$ is randomly split into $K$ disjoint roughly equally-sized folds $D^{0,1},\ldots,D^{0,K}$ . Define $D^{1,k}=D\setminus D^{0,k}$ . Assuming that the data are i.i.d., $D^{1,k}$ and $D^{0,k}$ are independent for each $k$ . Let $\hat{\theta}^{(0)}$ be an unbiased estimator of the parameter of interest $\theta^{(0)}\in\mathbb{R}^{d}$ . If several unbiased estimators are available, aggregation procedures such as inverse variance weighting can be used in a pre-processing step to obtain $\hat{\theta}^{(0)}$ . Let $\hat{\theta}^{(g)}$ be candidate estimators, $g=0,\ldots,G$ . Then, we can compute the risk criterion

\tilde{R}(g)=\frac{1}{K}\sum_{k=1}^{K}\|\hat{\theta}^{(g)}(D^{1,k})-\hat{\theta}^{(0)}(D^{0,k})\|_{2}^{2}.

(1)

Using independence of $D^{1,k}$ and $D^{0,k}$ ,

\mathbb{E}[\tilde{R}(g)]=\mathbb{E}[\|\hat{\theta}^{(g)}(D^{1,1})-\theta^{(0)}\|_{2}^{2}]+\sum_{j=1}^{d}\text{Var}(\hat{\theta}_{j}^{(0)}(D^{0,1})).

As $\text{Var}(\hat{\theta}^{(0)}(D^{0,1}))$ is constant in $g$ , the criterion in equation (1) can be used to select an estimator $\hat{\theta}^{(g)}$ with low mean-squared error for estimating $\theta^{(0)}$ among $\hat{\theta}^{(g)},g=0,\ldots,G$ . The criterion in equation (1) is attractive as it is simple and widely applicable. We will compare the proposed model selection criterion to the criterion in equation (1), both from a theoretical perspective and in simulations.

1.2 Our contribution

In this paper, we work towards making parameter-specific model selection more reliable. We derive a model selection criterion that estimates the squared $\ell_{2}$ -deviation of an estimator from its target. We show that the selected model has equal or lower variance than the baseline estimator asymptotically. Compared to the model selection procedure (1), we show that the proposed criterion has equal or lower variance asymptotically. The proposed criterion is flexible in the sense that it can be used for any low-dimensional estimator in both parametric and semi-parametric settings. Theoretical justification of the method is given under the assumption of asymptotic linearity.

Even if the candidate estimators are asymptotically linear, the model selection procedure is discontinuous and will not result in a regular estimator, even for $n\rightarrow\infty$ . Thus, the final estimator does not have a Gaussian limit distribution. We derive asymptotically valid confidence intervals for the resulting estimator that takes into account the model selection step.

In previous work, the goal of model selection for estimation is usually to select a nuisance parameter model from a set of candidate models, but the identification strategy is held fixed across models (Van der Laan and Robins, 2003; Brookhart and Van Der Laan, 2006; Cui and Tchetgen-Tchetgen, 2019). In this paper, the goal is to select among different estimands, or identification strategies. Mathematically, this correspond to selecting among different functionals of the underlying data generating distribution. In some situations, this results in dramatic improvements in the mean-squared error. However, there is no free lunch. Compared to the baseline procedure, for fixed $n$ , model selection can lead to increased risk in parts of the parameter space. We provide a finite-sample bound that reveals that the excess risk due to model selection becomes negligible as the dimension of the target parameter grows.

In simulations, the proposed criterion exhibits reliable performance in a variety of scenarios, including experiments, instrumental variables settings and data with selection on observables. The code can be found at github.com/rothenhaeusler/tms.

1.2.1 Outline

In Section 2, we introduce a method for parameter-specific model selection and discuss an example. Theory for the method is discussed in Section 3. We evaluate the performance of the proposed procedure on simulated data in Section 4.

2 Targeted model selection

This section consists of two parts. We briefly discuss the setting in Section 2.1. Then, we introduce the method in Section 2.2 and discuss basic properties.

2.1 Setting and notation

We observe data $D=(D_{i}$ , $i=1,\ldots,n)$ , where the $D_{i}$ are independently drawn from some unknown distribution $\mathbb{P}$ . Suppose we have access to estimators $\hat{\theta}^{(g)}(D)$ , $g=0,\ldots,G$ , of some unknown parameter $\theta^{(0)}$ . In the following, to simplify notation, we will write $\hat{\theta}^{(g)}$ instead of $\hat{\theta}^{(g)}(D)$ . We assume that the baseline estimator $\hat{\theta}^{(0)}$ is asymptotically unbiased for $\theta^{(0)}$ , i.e. that $\mathbb{E}[\hat{\theta}^{(0)}]=\theta^{(0)}+o(n^{-1/2})$ . In practice, the data scientist may know several estimators that are asymptotically unbiased for the parameter of interest. In this case, one can use aggregation procedures such as inverse variance weighting to construct an optimally weighted aggregated estimator $\hat{\theta}^{(0)}$ .

In addition, the data scientist may have access to estimators $\hat{\theta}^{(g)}$ for which the data scientist is not sure whether they are approximately unbiased for the effect of interest. The goal is to select among the set of estimators, minimizing the mean-squared error with respect to the target of interest $\theta^{(0)}$ . We assume that $\mathbb{E}[\hat{\theta}^{(g)}]=\theta^{(g)}+o(n^{-1/2})$ for some unknown $\theta^{(g)}$ and that $\sqrt{n}(\hat{\theta}^{(g)}-\theta^{(g)})$ converges to a non-degenerate random variable. We write $\sigma_{j}^{(g)}$ for the asymptotic standard deviation of $\sqrt{n}(\hat{\theta}^{(g)}_{j}-\theta^{(g)}_{j})$ . Similarly, we assume that $\sqrt{n}(\hat{\theta}^{(g)}-\hat{\theta}^{(0)}-(\theta^{(g)}-\theta^{(0)}))$ converges to a non-degenerate random variable for $g\neq 0$ and write $\tau_{i}^{(g)}$ for the asymptotic standard deviation of $\sqrt{n}(\hat{\theta}^{(g)}_{j}-\theta^{(g)}_{j}-\hat{\theta}^{(0)}_{j}+\theta^{(0)}_{j})$ .

2.2 The method

We aim to find an estimator $g$ that minimizes

R(g)=\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}].

(2)

Here and in the following, we suppress the dependence of $R(g)$ and $\hat{\theta}^{(g)}$ on $n$ . As bias and variance of $\hat{\theta}^{(g)}$ are unknown, the function $R(g)$ is unknown and one has to minimize a proxy of the risk $R(g)$ instead. We propose to estimate $R(g)$ in equation (2) via

\hat{R}(g)=\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\|_{2}^{2}+\sum_{j=1}^{d}\frac{(\hat{\sigma}_{j}^{(g)})^{2}}{n}-\frac{(\hat{\tau}_{j}^{(g)})^{2}}{n},

(3)

where $\hat{\sigma}_{j}^{(g)}$ is an estimator of the asymptotic standard deviation of $\sqrt{n}(\hat{\theta}^{(g)}_{j}-\theta^{(g)}_{j})$ and $\hat{\tau}_{j}^{(g)}$ is an estimator of the asymptotic standard deviation of $\sqrt{n}(\hat{\theta}^{(g)}_{j}-\theta^{(g)}_{j}-\hat{\theta}^{(0)}_{j}+\theta^{(0)}_{j})$ . If the estimators are asymptotically linear (i.e. in some semi-parametric or low-dimensional parametric settings), consistent estimators $\hat{\tau}^{(g)}$ and ${\hat{\sigma}^{(g)}}$ are usually available via plug-in estimators of the variance of the influence function (Van der Vaart, 2000; Tsiatis, 2007). An example will be discussed below. We propose to choose a final estimate $\hat{\theta}^{(\bar{g})}$ by solving

\bar{g}=\arg\min_{g}\hat{R}(g).

(4)

Let us consider a linear regression example. This example was mainly chosen for expository simplicity; the main motivating examples for this method are drawn from causal inference. The causal inference examples need more discussion and will be explained in detail in Section 4.

Example 1 (Model selection for parameter estimation).

Usually, when conducting model selection in the context of prediction, the goal is to find a model that can be estimated well and is a good approximation of some complex model of interest. However, if the purpose is parameter estimation, fitting complex models can reduce variance while potentially introducing bias. Such settings appear in causal inference and will be further discussed in Section 4. Here, we consider the task where the goal is to fit a regression with just one covariate; but there are additional covariates at our disposal that can be used to reduce variance, while potentially introducing some bias for the parameter of interest. Let $Y_{i}=X_{i}\theta^{(0)}+\epsilon_{i}$ , where $D_{i}=(Y_{i},X_{i})$ are i.i.d. and the $\epsilon_{i}$ are noise terms that are uncorrelated of the $X_{i}$ . Furthermore, for simplicity we assume that $Y_{i}$ , $X_{i}$ and $\epsilon_{i}$ are centered. We are interested in the parameter $\theta^{(0)}=\arg\min\mathbb{E}[(Y-X\theta)^{2}]$ and consider the baseline estimator

\hat{\theta}^{(0)}=\arg\min_{\theta}\sum_{i=1}^{n}(Y_{i}-X_{i}\theta)^{2}.

Let us assume that we have access to observations $Z_{1},\ldots,Z_{n}$ from some additional covariate. One may consider the estimator

\hat{\theta}^{(1)}=\arg\min_{\theta}\min_{\eta}\sum_{i=1}^{n}(Y_{i}-X_{i}\theta-Z_{i}\eta)^{2},

Let $(X,Y,Z,\epsilon)$ denote a generic $(X_{i},Y_{i},Z_{i},\epsilon_{i})$ . If $Z$ is correlated with $Y$ and only weakly correlated with $X$ , this estimator may reduce asymptotic variance compared to $\hat{\theta}^{(0)}$ since intuitively speaking, adjusting for $Z$ reduces unexplained variation in the residuals. On the other hand if $Z$ is strongly correlated with $X$ , $\hat{\theta}^{(1)}$ may converge to a different parameter than $\hat{\theta}^{(0)}$ . Under regularity conditions (Van der Vaart, 2000, Section 5), $\hat{\theta}^{(0)}$ is asymptotically linear and unbiased for $\theta^{(0)}:=\arg\min_{\theta}\mathbb{E}[(Y-X\theta)^{2}]$ , i.e.

\sqrt{n}(\hat{\theta}^{(0)}-\theta^{(0)})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\mathbb{E}[X^{2}]^{-1}X_{i}\epsilon_{i}+o_{P}(1).

Similarly, under regularity conditions,

\sqrt{n}(\hat{\theta}^{(1)}-\theta^{(1)})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}e_{1}^{\intercal}\mathbb{E}[(X,Z)^{\intercal}(X,Z)]^{-1}(X_{i},Z_{i})^{\intercal}(Y_{i}-X_{i}\theta^{(1)}-Z_{i}\eta^{(1)})+o_{P}(1),

where $(\theta^{(1)},\eta^{(1)})=\arg\min_{(\theta,\eta)}\mathbb{E}[(Y-X\theta-Z\eta)^{2}]$ and where $e_{j}$ denotes the $j$ -th unit vector. Thus,

	$\displaystyle(\sigma^{(0)})^{2}$	$\displaystyle=\text{Var}(\mathbb{E}[X^{2}]^{-1}X\epsilon),$
	$\displaystyle(\sigma^{(1)})^{2}$	$\displaystyle=\text{Var}(e_{1}^{\intercal}\mathbb{E}[(X,Z)^{\intercal}(X,Z)]^{-1}(X,Z)^{\intercal}(Y-X\theta^{(1)}-Z\eta^{(1)}),$
	$\displaystyle(\tau^{(0)})^{2}$	$\displaystyle=0,$
	$\displaystyle(\tau^{(1)})^{2}$	$\displaystyle=\text{Var}\left(e_{1}^{\intercal}\mathbb{E}[(X,Z)^{\intercal}(X,Z)]^{-1}(X,Z)^{\intercal}(Y-X\theta^{(1)}-Z\eta^{(1)})-\mathbb{E}[X^{2}]^{-1}X\epsilon\right).$

These quantities can be consistently estimated via plug-in estimators in standard settings. For example, $(\sigma^{(1)})^{2}$ and $(\sigma^{(0)})^{2}$ can be consistently estimated via the sandwich estimator under regularity assumptions (Huber, 1967).

We will compare the risk proxy in equation (3) to sample-splitting based criteria in Section 3. The method is evaluated on simulated data sets in Section 4.

2.2.1 Improving precision

The risk criterion can be decomposed into several parts, i.e. $\hat{R}(g)=\sum_{j=1}^{d}\hat{R}_{\text{bias},j}(g)+\hat{R}_{\text{var},j}(g)$ , where

\hat{R}_{\text{bias},j}(g)=(\hat{\theta}_{j}^{(g)}-\hat{\theta}_{j}^{(0)})^{2}-\frac{\hat{\tau}_{j}^{(g)}}{n},

and

\hat{R}_{\text{var},j}(g)=\frac{\hat{\sigma}_{j}^{(g)}}{n}.

As the naming indicates, the first term can be interpreted as an estimate of the squared bias $(\theta^{(g)}-\theta^{(0)})^{2}$ , whereas the second term is an estimate of the variance of $\hat{\theta}^{(g)}$ . Since we know that squared bias terms are non-negative this motivates defining the following modified risk criterion:

\hat{R}^{\text{mod}}(g)=\left(\sum_{j=1}^{d}(\hat{\theta}^{(g)}_{j}-\hat{\theta}^{(0)}_{j})^{2}-\frac{(\hat{\tau}_{j}^{(g)})^{2}}{n}\right)_{+}+\sum_{j=1}^{d}\frac{(\hat{\sigma}_{j}^{(g)})^{2}}{n}.

(5)

Then the final estimator $\hat{\theta}^{(\bar{g})}$ is chosen such that $\bar{g}$ minimizes equation (5). We take the positive part of the sum (instead of the sum of positive parts) as this allows random errors to cancel out for large $d$ . This will be important for the theory developed in Section 3.6. If there are ties, we select $\bar{g}$ as the one that minimizes $\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\|_{2}^{2}$ among the $g$ that satisfy $\hat{R}^{\text{mod}}(g)=\min_{g^{\prime}}\hat{R}^{\text{mod}}(g^{\prime})$ . The criterion $\hat{R}^{\text{mod}}(g)$ is not asymptotially unbiased for $R(g)$ , but has some favorable statistical properties that we will discuss in the following section.

3 Theory

In this section we discuss the theoretical underpinnings of the method introduced in Section 2. First, we show that the criterion $\hat{R}(g)$ is asymptotically unbiased for estimating the mean-squared error $R(g)=\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}]$ . Secondly, we compare the criterion $\hat{R}(g)$ to cross-validation in terms of asymptotic bias and variance. Then, we discuss the asymptotic risk of the resulting estimator. We derive asymptotically valid confidence intervals for the parameter of interest that takes into account the model selection step. Finally, we present a finite-sample bound that shows that if the dimension of the target parameter is large, the excess risk due to model selection becomes negligible.

3.1 Assumptions

We make two major assumptions, in addition to the assumptions outlined in Section 2.1. The first major assumption is a slightly stronger version of asymptotic linearity. Asymptotic linearity is an assumption that is commonly made to justify asymptotic normality of an estimator (Van der Vaart, 2000; Tsiatis, 2007). As our goal is to estimate the mean-squared error of an estimator, we use a slightly stronger version that guarantees convergence of second moments. The second major assumption is that the variance estimates are consistent.

Assumption 1.

We make two major assumptions.

1.

Let $\hat{\theta}^{(g)}$ , $g=0,\ldots,G$ be estimators such that

$\hat{\theta}^{(g)}-\theta^{(g)}=\frac{1}{n}\sum_{i=1}^{n}\psi^{(g)}(D_{i})+e_{g}(n),$

where $\psi^{(g)}(D_{i})$ are centered and have finite nonzero second moments, and $\mathbb{E}[\|e_{g}(n)\|_{2}^{2}]=o(1/n)$ . To avoid trivial special cases, in addition we assume that the covariance matrix of $(\psi^{(0)},\ldots,\psi^{(G)})$ is positive definite.

The estimators of variance are consistent, that means

	$\displaystyle(\hat{\tau}^{(g)})^{2}$	$\displaystyle=(\tau^{(g)})^{2}+o_{P}(1),$
	$\displaystyle(\hat{\sigma}^{(g)})^{2}$	$\displaystyle=(\sigma^{(g)})^{2}+o_{P}(1).$

Let us compare the first part of the assumption to asymptotic linearity. Asymptotic linearity assumes that $\|e_{g}(n)^{2}\|_{2}^{2}=o_{P}(1/n)$ while we assume that $\mathbb{E}[\|e_{g}(n)^{2}\|_{2}^{2}]=o(1/n)$ . Thus, our assumption is stronger than asymptotic linearity. Let us now turn to our theoretical results.

3.2 Asymptotic unbiasedness

Our first result shows that the proposed criterion is asymptotically unbiased for the mean-squared error of $\hat{\theta}^{(g)}$ . The convergence rate depends on whether $\hat{\theta}^{(g)}$ is asymptotically biased for estimating $\theta^{(0)}$ . If the estimator $\hat{\theta}^{(g)}$ is unbiased, the convergence rate is faster. The proof of the following result can be found in the supplement.

Theorem 1 (Asymptotic unbiasedness of $\hat{R}(g)$ ).

Let Assumption 1 hold.

1.

If $\theta^{(g)}=\theta^{(0)}$ ,

$n(\hat{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])$

converges weakly to a random variable with mean zero.
2.

If $\theta^{(g)}\neq\theta^{(0)}$ ,

$\sqrt{n}(\hat{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])$

converges weakly to a random variable with mean zero.

This result shows that the estimator is asymptotically unbiased, which is important for the theoretical justification of the approach. The major strength of the proposed approach will become apparant in the next section, where we compare asymptotic bias and variance to the cross-validation criterion (1).

3.3 Comparison to cross-validation

In this section, we will show that the proposed criterion has asymptotically lower variance than the cross-validation criterion (1) if $\theta^{(g)}=\theta^{(0)}$ . Furthermore, we show that cross-validation is generally biased for estimating the mean-squared error if $\theta^{(g)}=\theta^{(0)}$ . Let us first discuss asymptotic bias of the naive approach. The short proof of this result can be found in the supplement.

Theorem 2 (Asymptotic biasedness of $\tilde{R}(g)$ ).

Let Assumption 1 hold and fix $K$ .

If $\theta^{(g)}=\theta^{(0)}$ ,

n\left(\tilde{R}(g)-\frac{K}{K-1}\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}]-\frac{K}{n}\text{Var}(\psi^{(0)}(D_{1}))\right)

converges weakly to a random variable with mean zero.

2.

If $\theta^{(g)}\neq\theta^{(0)}$ ,

$\sqrt{n}(\tilde{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])$

converges weakly to a random variable with mean zero.

Comparing this result with Theorem 1, we see a major difference how the methods behave when $\theta^{(g)}=\theta^{(0)}$ . The proposed criterion is asymptotically unbiased for the mean-squared error, while cross-validation is not (even modulo additive constants). The cross-validation approach computes the estimator on randomly chosen subsets of the data. Thus, intuitively, this approach overestimates the asymptotic variance of unbiased estimators compared to the mean-squared error of biased estimators. In practice, this means that the approach tends to choose biased estimators with low variance over unbiased estimators with slighly higher variance. This effect can also be observed in Section 4.

Now, let us turn to the asymptotic variance of the model selection criteria. The proof of the following result can be found in the supplement.

Theorem 3 (Asymptotic variance of model selection criteria).

Let Assumption 1 hold and fix $K$ .

1.

If $\theta^{(g)}=\theta^{(0)}$ , the asymptotic variance of $n(\hat{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])^{2}$ is strictly lower than the asymptotic variance of $n(\tilde{R}(g)-\frac{K}{K-1}\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}]-\frac{K}{n}\text{Var}(\psi^{(0)}(D_{1})))$ ,
2.

If $\theta^{(g)}\neq\theta^{(0)}$ , the asymptotic variance of $\sqrt{n}(\hat{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])$ is equal to the asymptotic variance of $\sqrt{n}(\tilde{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])$ .

Roughly speaking, this theorem shows that the proposed criterion $\hat{R}(g)$ has equal or lower asymptotic variance than the cross-validation criterion $\tilde{R}(g)$ . This means that risk estimates based on the cross-validation criterion are harder to interpret than the risk estimates of the proposed criterion, since their variation will be larger. The difference in variance is particularly large if the splits are imbalanced, i.e. if the test data sets have much smaller or much larger sample size than the training data sets. Intuitively, if the validation data sets have small sample size, validation becomes unstable. However, if the validation data set is large, the estimator on the training data sets becomes unstable. This tradeoff can be avoided by estimating bias and variance separately, as done in the proposed approach.

Note that for understanding the excess risk of the resulting estimator it is not sufficient to study the asymptotic variance of the criterion itself. Thus, we study the asymptotic risk in Section 3.4 and give a finite-sample bound for the excess risk in Section 3.6.

3.4 Asymptotic risk

Here and in the following, we focus on the case where the number of models $G$ is small and fixed and $n\rightarrow\infty$ . First, we investigate the asymptotic behaviour of the proposed procedure in the case where the number of models is fixed and $n\rightarrow\infty$ . The proof of the following result can be found in the supplement.

Corollary 1 (Asymptotic risk of selected model).

Let Assumption 1 hold. Consider a finite and fixed number of estimators $g=0,\ldots,G$ . Let

\bar{g}=\arg\min\hat{R}^{\text{mod}}(g).

For $n\rightarrow\infty$ ,

\mathbb{P}[\theta^{(\bar{g})}=\theta^{(0)}]\rightarrow 1,

and

\mathbb{P}[R(\bar{g})\leq R(0)]\rightarrow 1.

In words, for $n\rightarrow\infty$ , the proposed method selects models with lower or equal risk than the baseline estimator $\hat{\theta}^{(0)}$ . Interestingly, an analogous result does not hold for the cross-validation procedure (1). In Section 4 we will see that even for relatively large $n$ , the selected model by cross-validation may have risk that is significantly larger than the risk of the baseline procedure.

3.5 Confidence intervals

Deriving confidence intervals that are valid in conjuction with a model selection step is a challenging topic and has attracted substantial interest in recent years, see for example Berk et al. (2013); Taylor and Tibshirani (2015). Generally speaking, statistical inference after a model selection step can be unreliable if the uncertainty induced by the model selection step is ignored. In this section, we describe how to construct confidence intervals that take into account the uncertainty induced by model selection. Intuitively speaking, the challenge is that the final estimator is a discontinuous function of the data. To be more precise, the final estimator $\hat{\theta}^{(\bar{g})}$ is not regular and not asymptotically normal.

The goal in this section is to find $I_{1}$ and $I_{2}$ as a function of the data $D_{1},\ldots,D_{n}$ such that

\mathbb{P}[\hat{\theta}_{j}^{(\bar{g})}-I_{1}\leq\theta_{j}^{(0)}\leq\hat{\theta}_{j}^{(\bar{g})}+I_{2}]\rightarrow 1-\alpha,

for some pre-determined $\alpha>0$ and $j$ and where

\bar{g}=\arg\min_{g}\hat{R}^{\text{mod}}(g).

The following theorem shows how to construct confidence intervals in low-dimensional settings; i.e. in settings where the number of models $G$ is fixed and the sample size goes to infinity.

Theorem 4.

Define $\hat{\theta}=(\hat{\theta}^{(0)},\hat{\theta}^{(1)},\ldots,\hat{\theta}^{(G)})\in\mathbb{R}^{(G+1)d}$ and $\theta=(\theta^{(0)},\theta^{(1)},\ldots,\theta^{(G)})$ . Assume that

\sqrt{n}(\hat{\theta}-\theta)\rightarrow\mathcal{N}(0,\Sigma),

for some positive definite $\Sigma$ . Let $\hat{\Sigma}(D)=\hat{\Sigma}_{n}(D_{1},\ldots,D_{n})$ be a consistent estimator of $\Sigma\in\mathbb{R}^{(G+1)d\times(G+1)d}$ and ${\hat{\sigma}^{(g)}}={\hat{\sigma}^{(g)}}(D_{1},\ldots,D_{n})$ be a consistent estimator of the asymptotic standard deviation of $\sqrt{n}(\hat{\theta}^{(g)}-\theta^{(g)})$ and $\hat{\tau}^{(g)}=\hat{\tau}^{(g)}(D_{1},\ldots,D_{n})$ be a consistent estimator of the asymptotic standard deviation of $\sqrt{n}(\hat{\theta}^{(g)}-\hat{\theta}^{(0)}-\theta^{(g)}+\theta^{(0)})$ . With some abuse of notation, conditionally on the data $D=(D_{1},\ldots,D_{n})$ draw $(Z_{0},\ldots,Z_{G})\sim\mathcal{N}(\sqrt{n}\hat{\theta}(D),\hat{\Sigma}(D))$ with $Z_{g}\in\mathbb{R}^{d}$ . We define the event

A_{g}^{est}=\{\max(\|Z_{g}-Z_{0}\|_{2}^{2}-\|\hat{\tau}^{(g)}\|_{2}^{2},0)+\|\hat{\sigma}^{(g)}\|_{2}^{2}<\min_{g^{\prime}\neq g}\max(\|Z_{g^{\prime}}-Z_{0}\|_{2}^{2}-\|\hat{\tau}^{(g^{\prime})}\|_{2}^{2},0)+\|\hat{\sigma}^{(g^{\prime})}\|_{2}^{2}\}.

Now for some fixed $j$ define

b_{D_{1},\ldots,D_{n}}(\beta)=\sum_{g}\mathbb{P}[\{Z_{g,j}-\sqrt{n}\hat{\theta}_{j}^{(0)}\leq\beta\}\cap A_{g}^{est}|D_{1},\ldots,D_{n}].

(6)

Then:

1.

For $n\rightarrow\infty$ , with probability converging to one, the inverse $b^{-1}_{D_{1},\ldots,D_{n}}:(0,1)\rightarrow\mathbb{R}$ is well-defined.

For all $\alpha>0$ ,

\mathbb{P}\left[\hat{\theta}_{j}^{(\bar{g})}-\frac{b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)}{\sqrt{n}}\leq\theta_{j}^{(0)}\leq\hat{\theta}_{j}^{(\bar{g})}-\frac{b^{-1}_{D_{1},\ldots,D_{n}}(\alpha/2)}{\sqrt{n}}\right]\rightarrow 1-\alpha.

Note that by definition the conditional distribution of $Z$ given $D_{1},\ldots,D_{n}$ is known to the researcher. Also, the researcher often can construct estimators of the variances in parametric and semi-parametric settings via plug-in estimators of the influence function, see e.g. (Van der Vaart, 2000; Tsiatis, 2007). In these cases, $b_{i,\alpha}(D_{1},\ldots,D_{n})$ can be computed by the researcher, for example by Monte-Carlo simulation. Thus, Theorem 4 allows us to construct asymptotically valid confidence intervals for the final estimator $\hat{\theta}^{(\bar{g})}$ in many parametric and semi-parametric settings.

This result is different from standard asymptotic arguments in the sense that $\hat{\theta}^{(\bar{g})}-\theta^{(0)}$ is not asymptotically Gaussian. However, as the result shows, it is possible to recover the exact asymptotic distribution of $\hat{\theta}^{(\bar{g})}$ in low-dimensional scenarios and use this information to conduct asymptotically valid statistical inference. We will evaluate the empirical performance of confidence intervals constructed via Theorem 4 in Section 4.

3.6 Finite-sample bound

As discussed in Section 3.4, the proposed method selects a model that is asymptotically no worse than the baseline estimator. However, there is no free lunch. In transitional regimes, for fixed $n$ , the estimator can perform worse than the baseline estimator $\hat{\theta}^{(0)}$ . This is to be expected from statistical theory, see for example the discussion of the Hodges-Le Cam estimator on page 110 in Van der Vaart (2000). This makes it important to understand in which cases we can expect reliable performance of the proposed model selection procedure. In the case $d=1$ , a Bayesian bound (Gill and Levit, 1995) reveals that improving over the Cramér-Rao bound in some parts of the parameter space must lead to deteriorating performance in other parts of the parameter space. In the following we will provide a finite-sample bound that will show that for large $d$ the excess risk becomes negligible, uniformly over a set of distributions.

Theorem 5.

Let $\hat{\theta}^{(g)}=\theta^{(g)}+\epsilon^{(g)}=\theta^{(0)}+\delta^{(g)}+\epsilon^{(g)}$ where $\delta^{(g)}\in\mathbb{R}^{d}$ is a constant vector and $G>1$ . We assume that the $\epsilon_{j}^{(g)}$ are centered, independent and sub-Gaussian random variables with variance proxy $\eta_{j}^{(g)}/\sqrt{n}$ , i.e. that $\mathbb{E}[\operatorname{exp}\left(s\epsilon_{j}^{(g)}\right)]\leq\operatorname{exp}\left(\frac{(\eta_{j}^{(g)})^{2}s^{2}}{2n}\right)$ for all $g=1,\ldots,G$ and $j=1,\ldots,d$ and $s\in\mathbb{R}$ . Define $\tau_{j}^{(g)}$ as the standard deviation of $\sqrt{n}(\epsilon_{j}^{(g)}-\epsilon_{j}^{(0)})$ and $\sigma_{j}^{(g)}$ as the standard deviation of $\sqrt{n}\epsilon_{j}^{(g)}$ . We assume that $|\delta_{j}^{(g)}|\leq c_{0}/\sqrt{n}$ for some constant $c_{0}>0$ and $\delta^{(0)}=0$ . Define $\iota_{n,d}:=\max(\sup_{g,j}|(\hat{\sigma}_{j}^{(g)})^{2}-(\sigma_{j}^{(g)})^{2}|,\sup_{g,j}|(\hat{\tau}_{j}^{(g)})^{2}-(\tau_{j}^{(g)})^{2}|)$ . Furthermore, assume that $\log G/d\leq c_{1}$ for some constant $c_{1}>0$ . Then, for every $\kappa>0$ there exists a constant $C$ that may depend on $c_{0}$ , $c_{1}$ , $\eta_{j}^{(g)}$ and $\kappa$ (but not on $\delta$ ) such that with probability exceeding $1-\kappa$ ,

\frac{n}{d}\sum_{j=1}^{d}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})^{2}\leq\min_{g}\frac{n}{d}\sum_{j=1}^{d}(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})^{2}+C\sqrt{\frac{\log G}{d}}+6\iota_{n,d}.

In particular, the excess risk due to model selection goes to zero as $\frac{d}{\log G}\rightarrow\infty$ and if $\sup_{g,j}|(\hat{\sigma}_{j}^{(g)})^{2}-\sigma_{j}^{(g)}|\rightarrow 0$ and $\sup_{g,j}|\hat{\tau}_{j}^{(g)}-\tau_{j}^{(g)}|\rightarrow 0$ . In Section 4 we will see in an example that the excess risk declines for growing dimension of the estimand.

4 Applications

In this section, we discuss applications of the proposed method. Compared to existing literature on model selection for causal effects, instead of selecting among nuisance parameter models, we consider shrinking between different functionals of the data generating distribution. As we will see, doing so can lead to drastic improvements in the mean-squared error. However, there is no free lunch. Compared to the baseline procedure, for fixed $n$ , model selection can lead to increased risk in parts of the parameter space. Thus, in this section, we study the excess risk of the procedures across the parameter space.

In the following we will use potential outcomes to define causal effects (Rubin, 1974; Splawa-Neyman et al., 1990). We are interested in the causal effect of a treatment $T\in\{0,1\}$ on an outcome $Y$ . Let $Y(1)$ denote the potential outcome under treatment $T=1$ and $Y(0)$ the potential outcome under treatment $T=0$ . We assume a superpopulation model, i.e. $Y(1)$ and $Y(0)$ are random variables. In the following, the goal is to estimate the average treatment effect within several subgroups,

\theta_{s}^{(0)}=\mathbb{E}[Y(1)-Y(0)|S=s].

(7)

Many methods have been designed to estimate (7) and these methods operate under a variety of assumptions. We present several applications that are based on different sets of assumptions for identifying (7). In each of the cases, we compare the proposed method (5, termed “targeted selection”) with the cross-validation procedure (1) and with a baseline estimator. The code can be found at github.com/rothenhaeusler/tms.

4.1 Observational studies

In observational studies, it is common practice to estimate causal effects under the assumption of unconfoundedness and under the overlap assumption. Roughly speaking, the overlap assumption states that treatment assignment probabilities are bounded away from zero and one, conditional on covariates $X$ . If these assumptions are met, it is possible to identify the average treatment effect via matching, inverse probability weighting, regression adjustment, or doubly robust methods (Hernan and Robins, 2020; Imbens and Wooldridge, 2009). However, if the overlap is limited, estimating the average treatment effect can be unreliable.

To deal with the issue of limited overlap, researchers sometimes switch to different estimands such as the average effect on the treated (ATT) or the overlap weighted effect (Crump et al., 2006). In the following, we will focus on the overlap-weighted effect as it is the causal contrast that can be estimated with the lowest asymptotic variance in certain scenarios (Crump et al., 2006). The overlap-weighted effect is defined as

\theta^{(1)}=\frac{\mathbb{E}[p(T=1|X)(1-p(T=1|X))\tau(X)]}{\mathbb{E}[p(T=1|X)(1-p(T=1|X))]},

where $\tau(x)=\mathbb{E}[Y(1)-Y(0)|X=x]$ . Note that if the treatment effect is homogeneous $\tau(x)\equiv\text{const.}$ , then the overlap-weighted effect and the average treatment effect coincide, that means $\theta^{(1)}=\theta^{(0)}$ . Thus, shrinkage towards an efficient estimator of the overlap effect is potentially beneficial under treatment effect homogeneity.

We investigate shrinking between estimators of the average treatment effect and the overlap-weighted effect in a data-driven way. The proposed model selection tool will be used to trade off bias and variance.

4.1.1 The data set

We observe $1000$ independent and identically distributed draws $(Y_{i}(T_{i}),T_{i},X_{i},S_{i})$ of a distribution $\mathbb{P}$ , where the $X_{i}$ are covariates. The data generating process was chosen such that there is limited overlap, i.e. $\mathbb{P}[T=1|X=0]\approx 0$ and that the unconfoundedness assumptions, that means $(Y(0),Y(1))\perp T|X$ (Rosenbaum and Rubin, 1983). As discussed above, the causal effect can be estimated via doubly robust methods such as augmented inverse probability weighting, among others (Hernan and Robins, 2020). The data are generated according to the following equations:

\displaystyle\begin{split}S\text{ }&\text{drawn from }\{1,2,3\}\text{ uniformly at random }\\ \epsilon_{Y}&\sim\mathcal{N}(0,1)\\ X&\sim\text{Ber}(.5)\\ T&\sim\begin{cases}\text{Ber}(.7)&\text{ if }X=1,\\ \text{Ber}(.05)&\text{ if }X=0,\end{cases}\\ Y(t)&=\frac{X}{2}+t+3t\gamma^{2}X+.1t\cdot 1_{S=1}+.2t\cdot 1_{S=2}-.1t\cdot 1_{S=3}+\epsilon_{Y},\end{split}

(8)

where $\gamma\in[0,1]$ . For $\gamma=0$ , the treatment effect is homogeneous across $X$ . Thus, for $\gamma=0$ , the overlap-weighted effect coincides with the average treatment effect.

4.1.2 The estimators

In the following, we compute the estimators for each group $S=s$ separately on the data set $\{i:S_{i}=s\}$ . For reasons of readability, notationally we suppress the dependence of the conditional probabilities and conditional expectations on $s$ . We can estimate that average treatment effect via augmented inverse probability weighting (Robins et al., 1994),

\hat{\theta}_{s}^{(0)}=\hat{\mu}_{1}-\hat{\mu}_{0},

where

\hat{\mu}_{a}=\frac{1}{n}\sum_{i=1}^{n}\frac{Y_{i}1_{T_{i}=a}}{\hat{p}(T_{i}=a|X_{i})}-\frac{1_{T_{i}=a}-\hat{p}(T_{i}=a|X_{i})}{\hat{p}(T_{i}=a|X_{i})}\hat{Q}(X_{i},a),

and where $\hat{Q}(x,t)$ is the empirical mean of $Y$ given $X=x$ and $T=t$ and $\hat{p}(\cdot|\cdot)$ are empirical probabilities. Similarly as above, we can estimate the overlap effect by

\hat{\theta}_{s}^{(1)}=\frac{\hat{\eta}_{1}-\hat{\eta}_{0}}{\frac{1}{n}\sum_{i}\hat{p}(T_{i}=1|X_{i})(1-\hat{p}(T_{i}=1|X_{i}))},

where

	$\displaystyle\hat{\eta}_{a}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}Y_{i}1_{T_{i}=a}(1-\hat{p}(T_{i}=a\|X_{i}))$
		$\displaystyle-(1_{T_{i}=a}-\hat{p}(T_{i}=a\|X_{i}))(1-\hat{p}(T_{i}=a\|X_{i}))\hat{Q}(X_{i},a).$

For $w\in\{1/10,\ldots,9/10\}$ we define

\hat{\theta}(w)=(1-w)\hat{\theta}^{(0)}+w\hat{\theta}^{(1)}.

(9)

For $\gamma\approx 0$ , due to treatment effect homogeneity we expect $\mathbb{E}[(\hat{\theta}(1)-\theta^{(0)})^{2}]<\mathbb{E}[(\hat{\theta}^{(0)}-\theta^{(0)})^{2}]$ . For $\gamma\approx 1$ , we expect $\mathbb{E}[(\hat{\theta}(1)-\theta^{(0)})^{2}]>\mathbb{E}[(\hat{\theta}(0)-\theta^{(0)})^{2}]$ . In the first case, the optimal estimator is $\hat{\theta}(w)$ with $w\approx 0$ . In the second case the optimal estimator is $\hat{\theta}(w)$ with $w\approx 1$ .

4.1.3 Results

The mean-squared error of the estimator selected by targeted selection and 10-fold cross-validation is depicted in Figure 1. To further improve the stability of cross-validation, we repeat the data splitting 10 times on randomly shuffled data. To study the influence of dimension $d$ on the performance of the model selection procedure, we show how the method performs on the full data set (left-hand side) and how the method performs if it only has access to the subset of observations $i$ for which $S_{i}=1$ (right-hand side). Results are averaged across $100$ simulation runs. The method performs better for $d=3$ (left plot) than for $d=1$ (right plot), which is consistent with Theorem 5. For $d=3$ , targeted selection performs similarly or better than the baseline estimator for all $\gamma$ . For $d=1$ , the proposed approach performs worse than the baseline estimator for $.4<\gamma<.9$ .

We also evaluated the realized coverage of confidence intervals with nominal coverage $95\%$ as described in Section 3.5. Equation 6 is estimated using the non-parametric bootstrap. Across $\gamma\in[0,1]$ , the minimal realized coverage is $93\%$ . The maximal realized coverage is $97\%$ . Averaged across all $\gamma\in[0,1]$ , the overall coverage is $94.9\%$ .

Refer to caption — Figure 1: Mean-squared error $R(\hat{w})$ where $\hat{w}$ is selected via the cross-validation criterion or targeted selection. The data are drawn according to equation (8). Cross-validation and targeted selection are used to shrink between the AIPW ATE estimator and the AIPW overlap estimator. On the left-hand side, we show the error $\|\hat{\theta}_{\bullet}^{(\overline{g})}-\theta_{\bullet}^{(0)}\|_{2}^{2}/3$ . On the right-hand side the method is run on the subset of observations $i$ for which $S_{i}=1$ . Consistent with the theory presented in Section 3.6, the maximum excess risk is smaller for $d=3$ (left) than for $d=1$ (right). The proposed method performs equal or better than cross-validation across most $\gamma\in[0,1]$ .

4.2 Instrumental variables and data fusion

The instrumental variables approach is a widely-used method to estimate causal effect of a treatment $T$ on a target outcome $Y$ in the presence of confounding (Wright, 1928; Bowden and Turkington, 1990; Angrist et al., 1996). Roughly speaking, the method relies on a predictor $I$ (called the instrument) of the treatment $T$ that is not associated with the error term of the outcome $Y$ . We will not discuss the assumptions behind instrumental variables in detail, but refer the interested reader to Hernan and Robins (2020). We will focus on the case, where $I$ , $T$ and $Y$ are one-dimensional. Under IV assumptions and linearity, the target quantity can be re-written as

\theta^{(0)}=\mathbb{E}[Y(1)-Y(0)]=\frac{\text{Cov}(I,Y)}{\text{Cov}(I,T)}.

Estimating this quantity can be challenging if the instrument is weak, i.e. if $\text{Cov}(I,T)\approx 0$ . In this case, the approach can benefit from shrinkage towards the ordinary least-squares solution (Nagar, 1959; Theil, 1961; Rothenhäusler et al., 2020; Jakobsen and Peters, 2020). Doing so may decrease the variance but generally introduces bias. We will focus on the case where we have some additional observational data, where we observe $T$ and $Y$ , but where the instrument $I$ is unobserved.

4.2.1 The data set

We draw $500$ i.i.d. observations according to the following equations:

\displaystyle\begin{split}S\text{ }&\text{drawn from }\{1,2,3\}\text{ uniformly at random }\\ I,H,\epsilon_{T},\epsilon_{Y}&\sim\mathcal{N}(0,1)\\ T&=\frac{I}{2}+H+\epsilon_{T}\\ Y(t)&=t-\gamma^{2}H+.1t\cdot 1_{S=1}+.2t\cdot 1_{S=2}-.1t\cdot 1_{S=3}+\epsilon_{Y}\end{split}

(10)

We vary $\gamma\in[0,2]$ , which corresponds to the strength of confounding between $T$ and $Y$ . We observe $(T_{i},Y_{i}(T_{i}),I_{i})$ for $i=1,\ldots,500$ . We also assume that we have access to a larger data set $i=501,\ldots,1000$ with incomplete observations. To be more precise, on this data set we only observe $X$ and $Y$ , but not the instrument $I$ . Formally, for $i=501,\ldots,1000$ we observe $(T_{i},Y_{i}(T_{i}))$ drawn according to equation (10).

4.2.2 The estimators

In the linear case, for each subset $S=s$ , the instrumental variables estimator can be written as

\displaystyle\begin{split}(\hat{b}_{IV})_{s}&=\frac{\hat{\text{Cov}}(I,Y|S=s)}{\hat{\text{Cov}}(I,T|S=s)},\end{split}

where $\hat{\text{Cov}}$ denotes the empirical covariance over the observations $i=1,\ldots,500$ . To deal with the weak instrument, we will consider shrinking the instrumental variables estimator torwards ordinary least-squares,

(\hat{b}_{\text{OLS}})_{s}=\mathop{\mathrm{missing}}{argmin}_{b}\min_{c}\hat{\mathbb{E}}[(Y-Tb-c)^{2}|S=s],

where $\hat{\mathbb{E}}$ denotes the empirical expectation over the observations $i=1,\ldots,1000$ . Shrinking towards the ordinary least-squares solution will introduce some bias if $\gamma\neq 0$ , but potentially decreases variance. As candidate estimators, for any $w\in\{0/10,1/10,\ldots,10/10\}$ we consider convex combinations of OLS and IV,

\hat{\theta}(w)=w\hat{b}_{\text{OLS}}+(1-w)\hat{b}_{\text{IV}}.

4.2.3 Results

The mean-squared error of the estimator selected by targeted selection and cross-validation is depicted in Figure 2. We use 10-fold cross-validation. To further improve the stability of cross-validation, we repeat the data splitting 10 times on randomly shuffled data. Results are averaged across $100$ simulation runs. Over the entire range $\gamma\in[0,2]$ , the proposed approach outperforms cross-validation. For $\gamma\approx 0$ and $\gamma>1$ , targeted selection outperforms the IV approach. For $.5<\gamma<1$ , the proposed approach performs worse than the IV approach. We evaluate the realized coverage of confidence intervals with nominal coverage $95\%$ as described in Section 3.5. Equation 6 is estimated using the non-parametric bootstrap. Across $\gamma\in[0,2]$ , the minimal realized coverage was $89\%$ . The maximal realized coverage was $96\%$ . Averaged across all $\gamma\in[0,2]$ , the overall realized coverage was $93\%$ .

4.3 Experiment with proxy outcome

One of the most popular estimators for causal effects in experimental settings is difference-in-means. To improve variance, it is possible to adjust for pre-treatment covariates, see for example Lin (2013). This raises the question whether post-treatment covariates can be used to improve the precision of causal effect estimates. This is indeed the case under additional assumptions. For example, in some cases, the treatment effect can be written as the product

\theta^{(0)}=\mathbb{E}[Y|T=1]-\mathbb{E}[Y|T=0]=\theta_{T\rightarrow P}\cdot\theta_{P\rightarrow Y},

(11)

where $\theta_{T\rightarrow P}=\mathbb{E}[P|T=1]-\mathbb{E}[P|T=0]$ is the effect of the treatment on some surrogate or proxy outcome $P\in\{0,1\}$ ; and $\theta_{P\rightarrow Y}=\mathbb{E}[Y|P=1]-\mathbb{E}[Y|P=0]$ is the effect of the proxy on the outcome. It is well-known that estimators that make use of such decompositions can outperform the standard difference-in-means estimator in terms of asymptotic variance (Tsiatis, 2007; Athey et al., 2019; Guo and Perković, 2020). However, doing so can introduce bias if equation (11) does not hold. We will use the proposed model selection procedure to shrink between difference-in-means and an estimator that is unbiased if the treatment effect decomposition in equation (11) holds.

4.3.1 The data set

We consider a simple experimental setting with a post-treatment variable $P$ . For simplicity, let us consider an experiment with binary treatment $T\in\{0,1\}$ , a binary proxy outcome $P\in\{0,1\}$ and outcome $Y$ . We draw $200$ i.i.d. observations according to the following equations:

\displaystyle\begin{split}S\text{ }&\text{drawn from }\{1,2,3\}\text{ uniformly at random }\\ T&\sim\text{Ber}(.5)\\ \epsilon_{P},\epsilon_{Y}&\sim\mathcal{N}(0,1)\\ P(t)&=1_{\epsilon_{p}\leq t}\\ Y(t)&=P(t)+\gamma^{2}\left(.1t\cdot 1_{S=1}+.2t\cdot 1_{S=2}-.1t\cdot 1_{S=3}\right)+\epsilon_{Y}\end{split}

(12)

For $\gamma=0$ , the outcome $Y(T)$ is conditionally independent of the treatment, given the proxy $P(T)$ . In this case, the average treatment effect can be written in product form, $\theta^{(0)}=\theta_{T\rightarrow P}\cdot\theta_{P\rightarrow Y}$ , and this decomposition can be leveraged for estimation. For $\gamma\neq 0$ , this decomposition does not hold.

4.3.2 The estimators

The standard estimator to estimate causal effects from experiments is difference-in-means,

\hat{\theta}^{(0)}=\frac{1}{\sum T_{i}}\sum_{i:T_{i}=1}Y_{i}-\frac{1}{\sum(1-T_{i})}\sum_{i:T_{i}=0}Y_{i}.

(13)

If the proxy outcome is a valid surrogate, i.e. if

Y\perp T|P,

we can rewrite $\theta^{(0)}$ as

	$\displaystyle\theta^{(0)}$	$\displaystyle=\mathbb{E}[Y\|T=1]-\mathbb{E}[Y\|T=0]$
		$\displaystyle=\mathbb{E}[\mathbb{E}[Y\|P=1]P+\mathbb{E}[Y\|P=0](1-P)\|T=1]$
		$\displaystyle\qquad-\mathbb{E}[\mathbb{E}[Y\|P=1]P+\mathbb{E}[Y\|P=0](1-P)\|T=0]$
		$\displaystyle=\left(\mathbb{E}[Y\|P=1]-\mathbb{E}[Y\|P=0]\right)\left(\mathbb{E}[P\|T=1]-\mathbb{E}[P\|T=0]\right)$
		$\displaystyle=\theta_{T\rightarrow P}\cdot\theta_{P\rightarrow Y}.$

Thus, in this case, we can also consider the product estimator

\displaystyle\begin{split}\hat{\theta}^{(1)}&=\left(\frac{1}{\sum T_{i}}\sum_{i:T_{i}=1}P_{i}-\frac{1}{\sum(1-T_{i})}\sum_{i:T_{i}=0}P_{i}\right)\\ &\qquad\cdot\left(\frac{1}{\sum P_{i}}\sum_{i:P_{i}=1}Y_{i}-\frac{1}{\sum(1-P_{i})}\sum_{i:P_{i}=0}Y_{i}\right)\end{split}

(14)

On each subset $\{i:S_{i}=s\}$ we compute (13) and (14), yielding $\hat{\theta}_{s}^{(1)}$ and $\hat{\theta}_{s}^{(0)}$ for $s=1,2,3$ . We shrink between these two vectors, i.e. for $w\in\{1/10,\ldots,9/10\}$ we define

\hat{\theta}(w)=(1-w)\hat{\theta}^{(0)}+w\hat{\theta}^{(1)}.

4.3.3 Results

The mean-squared error of the estimator selected by targeted selection and cross-validation is depicted in Figure 3. We use 10-fold cross-validation. To improve the stability of cross-validation, we repeat the data splitting 10 times on randomly shuffled data. Results are averaged across $100$ simulation runs. Similarly as above, targeted selection performs similar or better than cross-validation. Targeted selection performs better than the difference-in-means for $\gamma<1$ . For $\gamma\geq 1$ , targeted selection approaches the performance of difference-in-means. We evaluated the realized coverage of confidence intervals with nominal coverage $95\%$ as described in Section 3.5. Equation 6 is estimated using the non-parametric bootstrap. Across $\gamma\in[0,1.2]$ , the minimal realized coverage was $93\%$ . The maximal realized coverage was $96\%$ . Averaged across all $\gamma\in[0,1]$ , the overall coverage was $94.6\%$ .

5 Conclusion

We have introduced a method that allows to conduct targeted parameter selection by estimating the bias and variance of candidate estimators. The theoretical justification of the method relies on a linear expansion of the estimator. The method is very general and can be used in both parametric and semi-parametric settings. Under regularity conditions, we showed that the proposed criterion has asymptotically equal or lower variance than competing procedures based on sample splitting. In addition, we showed that for $n\rightarrow\infty$ , the modified risk criterion selects models with lower or equal risk than the baseline estimator $\hat{\theta}^{(0)}$ . Furthermore, we derived asymptotically valid confidence intervals in low-dimensional settings.

In simulations, we showed that the method selects reasonable models and outperforms the cross-validation procedure in most scenarios. The proposed method can decrease variance if the competing estimators are approximately unbiased. However, there is no free lunch. In transitional regimes, for fixed $n$ , the estimator can perform worse than the baseline estimator $\hat{\theta}^{(0)}$ . This is to be expected from statistical theory, see for example Van der Vaart (2000, page 110). However, as the finite-sample bound and our simulations show, the excess risk goes to zero as the dimension of the target parameter grows.

The theoretical justification of the proposed method relies on a linear approximation of the estimator in a neighborhood of the parameter values $\theta^{(g)}$ . Thus, it would be important to understand the performance of the method in scenarios where parameter estimates of some of the estimators are far from the parameter values. In Section 4.2, we have seen some preliminary evidence that the proposed methodology may be used to combine knowledge across data sets. The proposed method is not tailored to this special case. Thus, we believe that it would be exciting to investigate whether the model selection can be further improved for data fusion tasks.

References

Akaike [1974] H. Akaike. A new look at the statistical model identification. In Selected Papers of Hirotugu Akaike, pages 215–222. Springer, 1974.
Angrist et al. [1996] J. Angrist, G. Imbens, and D. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996.
Arlot and Celisse [2010] S. Arlot and A. Celisse. A survey of cross-validation procedures for model selection. Statistics surveys, 2010.
Athey and Imbens [2016] S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 2016.
Athey et al. [2020] S. Athey, R. Chetty, and G. Imbens. Combining experimental and observational data to estimate treatment effects on long term outcomes. arXiv preprint arXiv:2006.09676, 2020.
Athey et al. [2019] Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, 2019.
Bareinboim and Pearl [2016] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
Berk et al. [2013] R. Berk, L. Brown, A. Buja, K. Zhang, and L. Zhao. Valid post-selection inference. The Annals of Statistics, 41(2):802–837, 2013.
Bowden and Turkington [1990] R. Bowden and D. Turkington. Instrumental variables, volume 8. Cambridge university press, 1990.
Brookhart and Van Der Laan [2006] M. Brookhart and M. Van Der Laan. A semiparametric model selection criterion with applications to the marginal structural model. Computational statistics & data analysis, 50(2):475–498, 2006.
Claeskens and Hjort [2003] G. Claeskens and N. Hjort. The focused information criterion. Journal of the American Statistical Association, 2003.
Claeskens and Hjort [2008] G. Claeskens and N. Hjort. Model selection and model averaging. Cambridge University Press, 2008.
Crump et al. [2006] R. Crump, V. Hotz, G. Imbens, and O. Mitnik. Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical report, National Bureau of Economic Research, 2006.
Cui and Tchetgen-Tchetgen [2019] Y. Cui and E. Tchetgen-Tchetgen. Bias-aware model selection for machine learning of doubly robust functionals. arXiv preprint arXiv:1911.02029, 2019.
Friedman et al. [2001] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.
Gill and Levit [1995] Richard D Gill and Boris Y Levit. Applications of the van Trees inequality: a Bayesian Cramér-Rao bound. Bernoulli, 1(1-2):59–79, 1995.
Guo and Perković [2020] F. Guo and E. Perković. Efficient least squares for estimating total effects under linearity and causal sufficiency. arXiv preprint arXiv:2008.03481, 2020.
Heckman et al. [1998] J. Heckman, H. Ichimura, and P. Todd. Matching as an econometric evaluation estimator. The review of economic studies, 65(2):261–294, 1998.
Hernan and Robins [2020] M. Hernan and J. Robins. Causal inference: What If. Chapman & Hall, 2020.
Huber [1967] P. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, 1967.
Imbens and Wooldridge [2009] G. Imbens and J. Wooldridge. Recent developments in the econometrics of program evaluation. Journal of economic literature, 47(1):5–86, 2009.
Jakobsen and Peters [2020] M. Jakobsen and J. Peters. Distributional robustness of k-class estimators and the PULSE. arXiv preprint arXiv:2005.03353, 2020.
Kapelner et al. [2014] A. Kapelner, J. Bleich, A. Levine, Zachary D. C., R. DeRubeis, and R. Berk. Inference for the effectiveness of personalized medicine with software. arXiv preprint arXiv:1404.7844, 2014.
LaLonde [1986] R. LaLonde. Evaluating the econometric evaluations of training programs with experimental data. The American economic review, pages 604–620, 1986.
Lin [2013] W. Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics, 7(1):295–318, 2013.
Nagar [1959] A. Nagar. The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica, pages 575–595, 1959.
Nie and Wager [2017] X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
Peters et al. [2016] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B, 78(5), 2016.
Powers et al. [2018] S. Powers, J. Qian, K. Jung, A. Schuler, N. Shah, T. Hastie, and R. Tibshirani. Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics in medicine, 2018.
Robins et al. [1994] J. Robins, A. Rotnitzky, and L. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 1994.
Rolling and Yang [2014] C. Rolling and Y. Yang. Model selection for estimating treatment effects. Journal of the Royal Statistical Society: Series B, 2014.
Rosenbaum and Rubin [1983] P. Rosenbaum and D. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B, 45(2):212–218, 1983.
Rothenhäusler et al. [2020] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters. Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229, 2020.
Rubin [1974] D. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
Schuler et al. [2018] A. Schuler, M. Baiocchi, R. Tibshirani, and N. Shah. A comparison of methods for model selection when estimating individual treatment effects. arXiv preprint arXiv:1804.05146, 2018.
Schwarz [1978] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2), 1978.
Splawa-Neyman et al. [1990] J. Splawa-Neyman, D. Dabrowska, and T. Speed. On the application of probability theory to agricultural experiments. Statistical Science, pages 465–472, 1990.
Taylor and Tibshirani [2015] J. Taylor and R. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
Theil [1961] H. Theil. Economic forecasts and policy. North-Holland Pub. Co., 1961.
Tsiatis [2007] A. Tsiatis. Semiparametric theory and missing data. Springer Science & Business Media, 2007.
Van der Laan and Robins [2003] M. Van der Laan and J. Robins. Unified methods for censored longitudinal data and causality. Springer, 2003.
Van der Vaart [2000] A. Van der Vaart. Asymptotic statistics. Cambridge university press, 2000.
Wainwright [2019] M. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
Wright [1928] P. Wright. Tariff on animal and vegetable oils. Macmillan Company, New York, 1928.
Zhao et al. [2017] Y. Zhao, X. Fang, and D. Simchi-Levi. Uplift modeling with multiple treatments and general response types. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 2017.

Appendix A SUPPLEMENTARY MATERIAL

This supplementary material contains proofs for the theoretical results in the main paper.

A.1 Proof of Theorem 1

Proof.

First, note that

n(\hat{R}(g)-\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}])=\sum_{j=1}^{d}n(\hat{\theta}_{j}^{(g)}-\hat{\theta}_{j}^{(0)})^{2}-\hat{\tau}_{j}^{(g)}+\hat{\sigma}_{j}^{(g)}-n\mathbb{E}[(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})^{2}].

Thus, it is sufficient to show that for each $j$ with $\theta_{j}^{(0)}=\theta_{j}^{(g)}$

n(\hat{\theta}_{j}^{(g)}-\hat{\theta}_{j}^{(0)})^{2}-\hat{\tau}_{j}^{(g)}+\hat{\sigma}_{j}^{(g)}-n\mathbb{E}[(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})^{2}],

converges in distribution to a centered random variable and for each $j$ with $\theta_{j}^{(0)}\neq\theta_{j}^{(g)}$ ,

\sqrt{n}(\hat{\theta}_{j}^{(g)}-\hat{\theta}_{j}^{(0)})^{2}-\sqrt{n}\mathbb{E}[(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})^{2}],

converges in distribution to a centered random variable. Thus, without loss of generality, in the following we will assume that $d=1$ .
Case 1: $\theta^{(0)}=\theta^{(g)}$
As $\mathbb{E}[e_{g}(n)^{2}]=o(1/n)$ and as $\mathbb{E}[\hat{\theta}^{(g)}]-\theta^{(g)}=o(1/\sqrt{n})$ ,

\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]=\frac{1}{n}\text{Var}(\psi^{(g)})+o\left(\frac{1}{n}\right).

By assumption,

\hat{\tau}^{(g)}-\tau^{(g)}=o_{P}(1).

Similarly,

\hat{\sigma}^{(g)}-\sigma^{(g)}=o_{P}(1).

Combining these equations with the definition of $\hat{R}(g)$ ,

\displaystyle\hat{R}(g)

\displaystyle=(\frac{1}{n}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i}))^{2}-\frac{1}{n}\text{Var}(\psi^{(g)}-\psi^{(0)})+\frac{1}{n}\text{Var}(\psi^{(g)})+o_{P}(1/n).

Using the Central Limit Theorem, $\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})$ converges to a Gaussian random variable with mean zero and variance $\text{Var}(\psi^{(g)}-\psi^{(0)})$ . Thus,

	$\displaystyle\qquad n\left(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]\right)$
	$\displaystyle=n\left(\frac{1}{n}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})\right)^{2}-\frac{1}{n}\text{Var}(\psi^{(g)}-\psi^{(0)})\right)+o_{P}(1)$
	$\displaystyle=\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})\right)^{2}-\text{Var}(\psi^{(g)}-\psi^{(0)})+o_{P}(1)$

converges in distribution to a random variable with mean zero. This concludes the proof of the case $\theta^{(g)}=\theta^{(0)}$ .
Case 2: $\theta^{(g)}\neq\theta^{(0)}$
Similarly as above, we can show that

\displaystyle\hat{R}(g)

\displaystyle=(\theta^{(g)}-\theta^{(0)})^{2}+2(\theta^{(g)}-\theta^{(0)})\frac{1}{n}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})+o_{P}(1/\sqrt{n}),

and

\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]=(\theta^{(g)}-\theta^{(0)})^{2}+o(1/\sqrt{n}).

Thus,

\sqrt{n}(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}])=2(\theta^{(g)}-\theta^{(0)})\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})+o_{P}(1)

(15)

Using the CLT, $\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})$ converges in distribution to a centered Gaussian random variable with variance $\text{Var}(\psi^{(g)}(D_{1})-\psi^{(0)}(D_{1}))$ . Using this fact in equation (15) concludes the proof.

∎

A.2 Proof of Theorem 2

Let $S_{k}=\{i:D_{i}\in D^{1,k}\}$ . First, note that

\displaystyle\tilde{R}(g)=\sum_{j=1}^{d}\xi_{j}+r_{n},

where

\xi_{j}=\frac{1}{K}\sum_{k}\left(\theta_{j}^{(g)}-\theta_{j}^{(0)}+\frac{1}{|S_{k}|}\sum_{i\in S_{k}}\psi_{j}^{(g)}(D_{i})-\frac{1}{n-|S_{k}|}\sum_{i\not\in S_{k}}\psi_{j}^{(0)}(D_{i})\right)^{2}.

and where $r_{n}=o_{P}(1/n)$ if $\theta^{(g)}=\theta^{(0)}$ and $r_{n}=o_{P}(1/\sqrt{n})$ if $\theta^{(g)}\neq\theta^{(0)}$ . Furthermore, using Assumption 1,

\mathbb{E}[\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}]=\sum_{j=1}^{d}(\theta_{j}^{(g)}-\theta_{j}^{(0)})^{2}+\frac{1}{n}\text{Var}(\psi_{j}^{(g)})+o(1/n)

Thus, it is sufficient to show the following: For each $j$ with $\theta_{j}^{(0)}=\theta_{j}^{(g)}$

n\left(\xi_{j}-\frac{K}{n(K-1)}\text{Var}(\psi_{j}^{(g)})-\frac{K}{n}\text{Var}(\psi_{j}^{(0)})\right)

converges in distribution to a centered random variable and for each $j$ with $\theta_{j}^{(0)}\neq\theta_{j}^{(g)}$ ,

\sqrt{n}\left(\xi_{j}-(\theta_{j}^{(g)}-\theta_{j}^{(0)})^{2}-\frac{1}{n}\text{Var}(\psi_{j}^{(g)})\right)

converges in distribution to a centered random variable. In the following, without loss of generality we will assume that $d=1$ .

Proof.

Case 1: $\theta^{(g)}=\theta^{(0)}$ . Recall that $|S_{k}|\sim n(K-1)/K$ . Using the CLT, for every $k=1,\ldots,K$ ,

\sqrt{n}\left(\frac{1}{|S_{k}|}\sum_{i\in S_{k}}\psi^{(g)}(D_{i})-\frac{1}{n-|S_{k}|}\sum_{i\not\in S_{k}}\psi^{(0)}(D_{i})\right)

converges to a centered Gaussian random variable with variance

\frac{1}{1-\alpha}\text{Var}(\psi^{(g)}(D_{1}))+\frac{1}{\alpha}\text{Var}(\psi^{(0)}(D_{1})),

where $\alpha=\frac{1}{K}$ . Hence,

n\xi=\frac{n}{K}\sum_{k}\left(\frac{1}{|S_{k}|}\sum_{i\in S_{k}}\psi^{(g)}(D_{i})-\frac{1}{n-|S_{k}|}\sum_{i\not\in S_{k}}\psi^{(0)}(D_{i})\right)^{2}

converges to a random variable with asymptotic mean

	$\displaystyle\frac{1}{(1-\alpha)}\text{Var}(\psi^{(g)}(D_{1}))+\frac{1}{\alpha}\text{Var}(\psi^{(0)}(D_{1}))$
	$\displaystyle=\frac{K}{(K-1)}\text{Var}(\psi^{(g)}(D_{1}))+K\text{Var}(\psi^{(0)}(D_{1}))$

This concludes the proof of case 1.
Case 2: $\theta^{(g)}\neq\theta^{(0)}$
Similarly as above, we can show that

\displaystyle\xi

\displaystyle=(\theta^{(g)}-\theta^{(0)})^{2}+2(\theta^{(g)}-\theta^{(0)})\frac{1}{n}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})+o_{P}(1/\sqrt{n}).

Thus,

\sqrt{n}\left(\xi-(\theta^{(g)}-\theta^{(0)})^{2}-\frac{1}{n}\text{Var}(\psi^{(g)})\right)=2(\theta^{(g)}-\theta^{(0)})\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})+o_{P}(1).

(16)

Using the CLT, $\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})$ converges to a centered Gaussian random variable with variance $\text{Var}(\psi^{(g)}(D_{1})-\psi^{(0)}(D_{1}))$ . Using this fact in equation (16) concludes the proof. ∎

A.3 Proof of Theorem 3

Proof.

First, we show that it is sufficient to show the statement in the one-dimensional case, i.e. for $d=1$ . Define $\hat{\eta}^{(g)}=O\hat{\theta}^{(g)}$ and $\eta^{(g)}=O\theta^{(g)}$ , where the orthonormal matrix $O\in\mathbb{R}^{d\times d}$ is chosen such that the asymptotic covariance of $\sqrt{n}(\hat{\eta}^{(g)}-\hat{\eta}^{(0)}-\eta^{(g)}+\eta^{(0)})$ is diagonal. By Assumption 1, $\sqrt{n}(\hat{\eta}^{(g)}-\eta^{(g)})$ converges to a Gaussian random variable with a covariance matrix that is positive definite. Let $\Sigma^{(g)}$ denote the asymptotic covariance of $\sqrt{n}(\hat{\theta}^{(g)}-\theta^{(g)})$ . Define $\tau_{j}^{(g,\eta)}$ as the asymptotic standard deviation of $\sqrt{n}(\hat{\eta}_{j}^{(g)}-\hat{\eta}_{j}^{(0)}-\eta_{j}^{(g)}+\eta_{j}^{(0)})$ and $\sigma_{j}^{(g,\eta)}$ as the asymptotic standard deviation of $\sqrt{n}(\hat{\eta}_{j}^{(g)}-\eta_{j}^{(g)})$ . Similarly let $\Sigma^{(g,\eta)}$ denote the asymptotic covariance of $\sqrt{n}(\hat{\eta}^{(g)}-\eta^{(g)})$ . Now note that

	$\displaystyle\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}$
	$\displaystyle=\text{Trace}(\Sigma^{(g)})$
	$\displaystyle=\text{Trace}(\Sigma^{(g)}O^{\intercal}O)$
	$\displaystyle=\text{Trace}(O\Sigma^{(g)}O^{\intercal})$
	$\displaystyle=\text{Trace}(\Sigma^{(g,\eta)})$
	$\displaystyle=\sum_{j=1}^{d}(\sigma_{j}^{(g,\eta)})^{2}.$

Analogously,

\sum_{j=1}^{d}(\tau_{j}^{(g)})^{2}=\sum_{j=1}^{d}(\tau_{j}^{(g,\eta)})^{2}.

Thus,

	$\displaystyle\hat{R}(g)$	$\displaystyle=\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{n}\\|\hat{\tau}^{(g)}\\|_{2}^{2}+\frac{1}{n}\\|\hat{\sigma}^{(g)}\\|_{2}^{2}$
		$\displaystyle=\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{n}\\|\tau^{(g)}\\|_{2}^{2}+\frac{1}{n}\\|\sigma^{(g)}\\|_{2}^{2}+o_{P}(1/n)$
		$\displaystyle=\\|\hat{\eta}^{(g)}-\hat{\eta}^{(0)}\\|_{2}^{2}-\frac{1}{n}\\|\tau^{(g,\eta)}\\|_{2}^{2}+\frac{1}{n}\\|\sigma^{(g,\eta)}\\|_{2}^{2}+o_{P}(1/n).$

Now, by construction $\sqrt{n}(\hat{\eta}^{(g)}-\hat{\eta}^{(0)}-\eta^{(g)}+\eta^{(0)})$ converges to a Gaussian with diagonal covariance matrix. Thus, the components of $\sqrt{n}(\hat{\eta}^{(g)}-\hat{\eta}^{(0)}-\eta^{(g)}+\eta^{(0)})$ are asymptotically independent. Thus, the asymptotic mean and variance of $\hat{R}(g)$ is equal to the sums of asymptotic means and asymptotic variances of

\hat{R}_{j}(g):=(\hat{\eta}_{j}^{(g)}-\hat{\eta}_{j}^{(0)})^{2}-\frac{1}{n}(\tau_{j}^{(g,\eta)})^{2}+\frac{1}{n}(\sigma_{j}^{(g,\eta)})^{2}.

Similarly, note that

	$\displaystyle\tilde{R}(g)$	$\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\\|\hat{\theta}^{(g)}(D^{1,k})-\hat{\theta}^{(0)}(D^{0,k})\\|_{2}^{2}$
		$\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\\|\hat{\eta}^{(g)}(D^{1,k})-\hat{\eta}^{(0)}(D^{0,k})\\|_{2}^{2}.$

As the components of $\sqrt{n}(\hat{\eta}^{(g)}-\hat{\eta}^{(0)}-\eta^{(g)}+\eta^{(0)})$ are asymptotically independent, the asymptotic mean and variance of $\tilde{R}(g)$ is equal to the sums of asymptotic means and asymptotic variances of

\tilde{R}_{j}:=\frac{1}{K}\sum_{k=1}^{K}(\hat{\eta}_{j}^{(g)}(D^{1,k})-\hat{\eta}_{j}^{(0)}(D^{0,k}))^{2}.

Thus, in the following without loss of generality we will focus on the case $d=1$ and $\hat{\eta}^{(g)}=\hat{\theta}^{(g)}$ , $\eta^{(g)}=\theta^{(g)}$ . We will now consider two cases.
Case 1: $\theta^{(g)}=\theta^{(0)}$
Let $S_{k}=\{i:D_{i}\in D^{1,k}\}$ . Inspecting the proof of Theorem 1 we obtain that

	$\displaystyle\qquad n\left(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]\right)$
	$\displaystyle=\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})\right)^{2}-\text{Var}(\psi^{(g)}-\psi^{(0)})+o_{P}(1).$

Multiplying with $\alpha=1/K$ , we can rewrite this as

	$\displaystyle\qquad\frac{n}{K}\left(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]\right)$
	$\displaystyle=\left(\frac{1}{\sqrt{K}\sqrt{n}}\sum_{i=1}^{n}\psi^{(g)}(D_{i})-\psi^{(0)}(D_{i})\right)^{2}-\frac{1}{K}\text{Var}(\psi^{(g)}-\psi^{(0)})+o_{P}(1).$

Writing $Z_{k}=\frac{1}{\sqrt{\alpha n}}\sum_{i\not\in S_{k}}\psi^{(g)}(D_{i})$ and $X_{k}=\frac{1}{\sqrt{\alpha n}}\sum_{i\not\in S_{k}}\psi^{(0)}(D_{i})$ and using $\alpha=\frac{1}{K}$ we obtain

	$\displaystyle\qquad\frac{n}{K}\left(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]\right)$
	$\displaystyle=\left(\frac{1}{K}\sum_{k=1}^{K}Z_{k}-X_{k}\right)^{2}-\frac{1}{K}\text{Var}(\psi^{(g)}-\psi^{(0)})+o_{P}(1).$

Similarly, by inspecting the proof of Theorem 2,

	$\displaystyle n(\tilde{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}])=$
	$\displaystyle n\left(\frac{1}{K}\sum_{k}\left(\frac{1}{\|S_{k}\|}\sum_{i\in S_{k}}\psi^{(g)}(D_{i})-\frac{1}{n-\|S_{k}\|}\sum_{i\not\in S_{k}}\psi^{(0)}(D_{i})\right)^{2}-\frac{1}{n}\text{Var}(\psi^{(g)})\right)+o_{P}(1).$

We have $(n-|S_{k}|)\sim\alpha n$ , and $|S_{k}|\sim(K-1)\alpha n$ . Thus, multiplying with $\alpha=1/K$ we can rewrite this as

	$\displaystyle\frac{n}{K}(\tilde{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}])=$
	$\displaystyle\alpha n\Bigg{(}\frac{1}{K}\sum_{k}\left(\frac{1}{K-1}\frac{1}{\alpha n}\sum_{i\in S_{k}}\psi^{(g)}(D_{i})-\frac{1}{\alpha n}\sum_{i\not\in S_{k}}\psi^{(0)}(D_{i})\right)^{2}$
	$\displaystyle-\frac{1}{n}\text{Var}(\psi^{(g)})\Bigg{)}+o_{P}(1).$
	$\displaystyle=\frac{1}{K}\sum_{k}\left(\frac{1}{K-1}\frac{1}{\sqrt{\alpha n}}\sum_{i\in S_{k}}\psi^{(g)}(D_{i})-\frac{1}{\sqrt{\alpha n}}\sum_{i\not\in S_{k}}\psi^{(0)}(D_{i})\right)^{2}$
	$\displaystyle-\frac{1}{K}\text{Var}(\psi^{(g)})+o_{P}(1)$
	$\displaystyle=\frac{1}{K}\sum_{k}\left(\frac{1}{K-1}\sum_{j\neq k}Z_{j}-X_{k}\right)^{2}-\frac{1}{K}\text{Var}(\psi^{(g)})+o_{P}(1).$

Thus, to complete the proof, we have to show that the asymptotic variance of the first term is larger than the asymptotic variance of the second term:

	$\displaystyle\frac{n}{K}(\tilde{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}])$
	$\displaystyle=\frac{1}{K}\sum_{k}\left(\frac{1}{K-1}\sum_{j\neq k}Z_{j}-X_{k}\right)^{2}-\frac{1}{K}\text{Var}(\psi^{(g)})+o_{P}(1)$
	$\displaystyle\frac{n}{K}\left(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]\right)$
	$\displaystyle=\left(\frac{1}{K}\sum_{k=1}^{K}Z_{k}-X_{k}\right)^{2}-\frac{1}{K}\text{Var}(\psi^{(g)}-\psi^{(0)})+o_{P}(1).$

Using the CLT, for $n\rightarrow\infty$ and $K$ fixed, $(X_{1},\ldots,X_{K},Z_{1},\ldots,Z_{K})$ converge to a centered multivariate Gaussian vector with non-degenerate variance. Hence, without loss of generality in the following we can assume that the vector

(X_{1},\ldots,X_{K},Z_{1},\ldots,Z_{K})

is multivariate Gaussian. Recall that from Assumption 1 it follows that $|\text{Cor}(\psi^{(g)},\psi^{(0)})|\neq 1$ . Thus we also have $|\text{Cor}(Z_{j},X_{j})|\neq 1$ . Using Lemma 1 completes the proof of case 1.

Case 2: $\theta^{(g)}\neq\theta^{(0)}$

Inspecting the proofs of Theorem 1 and Theorem 2, we obtain that in the case $\theta^{(g)}=\theta^{(0)}$ , the asymptotic variance of $\sqrt{n}(\hat{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}]$ is

4(\theta^{(g)}-\theta^{(0)})^{2}\text{Var}(\psi^{(g)}(D_{1})-\psi^{(0)}(D_{1})),

(17)

which is the same as the asymptotic variance of $\sqrt{n}(\tilde{R}(g)-\mathbb{E}[(\hat{\theta}^{(g)}-\theta^{(0)})^{2}])$ . This concludes the proof. ∎

A.4 Proof of Corollary 1

Proof.

By assumption, we have $\hat{\theta}^{(g)}-\hat{\theta}^{(0)}=\theta^{(g)}-\theta^{(0)}+O_{P}(1/\sqrt{n})$ . If $\theta^{(g)}-\theta^{(0)}\not\equiv(0,\ldots,0)$ , then

\hat{R}^{\text{mod}}(g)=\|\theta^{(g)}-\theta^{(0)}\|_{2}^{2}+O_{P}(1/\sqrt{n}),

with $\|\theta^{(g)}-\theta^{(0)}\|_{2}^{2}>0$ . On the other hand, if $\theta^{(g)}=\theta^{(0)}$ ,

\hat{R}^{\text{mod}}(g)=O_{P}(1/n).

Thus,

\mathbb{P}[\theta^{(\bar{g})}=\theta^{(0)}]\rightarrow 1.

Now consider any $g$ with $\theta^{(g)}=\theta^{(0)}$ and $\sum_{j}\text{Var}(\psi_{j}^{(g)})>\sum_{j}\text{Var}(\psi_{j}^{(0)})$ . Then, using Assumption 1,

\hat{R}^{\text{mod}}(g)\geq\sum_{j}\frac{1}{n}\text{Var}(\psi_{j}^{(g)})+o_{P}(1/n),

and

\hat{R}^{\text{mod}}(0)=\sum_{j}\frac{1}{n}\text{Var}(\psi_{j}^{(0)}))+o_{P}(1/n).

Recall that by assumption $\sum_{j}\text{Var}(\psi_{j}^{(g)})>\sum_{j}\text{Var}(\psi_{j}^{(0)})$ . Thus, $\mathbb{P}[\hat{R}^{\text{mod}}(0)<\hat{R}^{\text{mod}}(g)]\rightarrow 1$ for $n\rightarrow\infty$ . As this holds for all $g$ with $\sum_{j}\text{Var}(\psi_{j}^{(g)})>\sum_{j}\text{Var}(\psi_{j}^{(0)})$ for $n\rightarrow\infty$ , this concludes the proof. ∎

A.5 Proof of Theorem 4

Proof.

We will first prove the first statement. Note that if $\hat{\Sigma}$ is positive definite, $\beta\mapsto b_{D_{1},\ldots,D_{n}}(\beta)$ is continuous and strictly increasing and $\lim_{\beta\rightarrow\infty}b_{D_{1},\ldots,D_{n}}(\beta)=1$ , $\lim_{\beta\rightarrow-\infty}b_{D_{1},\ldots,D_{n}}(\beta)=0$ . As $\hat{\Sigma}$ converges in probability to a positive definite matrix, the inverse $b^{-1}_{D_{1},\ldots,D_{n}}:(0,1)\rightarrow\mathbb{R}$ is well-defined, except on an event with vanishing probability as $n\rightarrow\infty$ .

Let us now turn to the second statement. We use the decomposition

\displaystyle\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq\beta]=\sum_{g}\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})\leq\beta;\bar{g}=g].

(18)

Define the event

	$\displaystyle A_{g}=\{\bar{g}=g\}$
	$\displaystyle=\{\max(n\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\\|\hat{\tau}^{(g)}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g)}\\|_{2}^{2}<\min_{g^{\prime}\neq g}\max(n\\|\hat{\theta}^{(g^{\prime})}-\hat{\theta}^{(0)}\\|_{2}^{2}-\\|\hat{\tau}^{(g^{\prime})}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g^{\prime})}\\|_{2}^{2}\}.$

It is crucial to understand how this event behaves in the limit. $\sqrt{n}(\hat{\theta}-\theta)$ converges to a centered Gaussian distribution with covariance matrix $\Sigma$ . Thus, define

A_{g}^{lim}=\max(\|Z_{g}^{0}-Z_{0}^{0}\|_{2}^{2}-\|\tau^{(g)}\|_{2}^{2},0)+\|\sigma^{(g)}\|_{2}^{2}<\min_{g^{\prime}\neq g;\theta^{(g^{\prime})}=\theta^{(0)}}\max(\|Z_{g^{\prime}}^{0}-Z_{0}^{0}\|_{2}^{2}-\|\tau^{(g^{\prime})}\|_{2}^{2},0)+\|\sigma^{(g)^{\prime}}\|_{2}^{2},

where $(Z_{0}^{0},\ldots,Z_{G}^{0})\sim\mathcal{N}(0,\Sigma)$ , independent of the data $\{D_{j}\}_{j\in\mathbb{N}}$ . Before we investigate the large-sample behaviour of equation (18), let us fix some notation.

	$\displaystyle f_{g}(\beta)$	$\displaystyle:=\mathbb{P}[\{\sqrt{n}(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})\leq\beta\}\cap A_{g}]$
	$\displaystyle f_{g}^{lim}(\beta)$	$\displaystyle:=\begin{cases}\mathbb{P}[\{(Z_{g}^{0})_{j}\leq\beta\}\cap A_{g}^{lim}]&\text{ if }\theta^{(g)}=\theta^{(0)},\\ 0&\text{ else}.\end{cases}$
	$\displaystyle f_{g}^{comp}(\beta,D_{1},\ldots,D_{n})$	$\displaystyle:=\mathbb{P}[Z_{g,j}-\sqrt{n}\hat{\theta}_{j}^{(0)}\leq\beta;\max(\\|Z_{g}-Z_{0}\\|_{2}^{2}-\\|\hat{\tau}^{(g)}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g)}\\|_{2}^{2}$
		$\displaystyle\qquad<\min_{g^{\prime}\neq g}\max(\\|Z_{g}-Z_{0}\\|_{2}^{2}-\\|\hat{\tau}^{(g)}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g)}\\|_{2}^{2}\|D_{1},\ldots,D_{n}],$

where $(Z_{0},\ldots,Z_{G})\sim\mathcal{N}(\sqrt{n}\hat{\theta}(D),\hat{\Sigma}(D))$ , conditionally on the data $D$ . In the first step, let us consider $g$ for which $\theta^{(g)}=\theta^{(0)}$ . Recall that $\hat{\sigma}^{(g)}\rightarrow\sigma^{(g)}$ , $\hat{\tau}^{(g)}\rightarrow\tau^{(g)}$ , and $\hat{\Sigma}\rightarrow\Sigma$ . Also, $\sqrt{n}(\hat{\theta}-\theta)$ converges weakly to a distribution that is equal to the distribution of $Z^{0}$ . By weak convergence, for all $\beta$ ,

\lim_{n}f_{g}(\beta)=f_{g}^{lim}(\beta)=\lim_{n}f_{g}^{comp}(\beta,D_{1},\ldots,D_{n}),

(19)

where the limit on the right-hand side is in probability. Now, let us focus on $g$ for which $\theta^{(g)}\neq\theta^{(0)}$ . Note that for all $g$ with $\theta^{(g)}\neq\theta^{(0)}$ , $n\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\|_{2}^{2}\rightarrow\infty$ . Thus, for all $g$ with $\theta^{(g)}\neq\theta^{(0)}$ ,

	$\displaystyle\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})\leq\beta;\bar{g}=g]$
	$\displaystyle\leq\mathbb{P}[\max(n\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\\|\hat{\tau}^{(g)}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g)}\\|_{2}^{2}<\min_{g^{\prime}\neq g}\max(n\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\\|\hat{\tau}^{g}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g)}\\|_{2}^{2}]$
	$\displaystyle\leq\mathbb{P}[\max(n\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\\|\hat{\tau}^{(g)}\\|_{2}^{2},0)+\\|\hat{\sigma}^{(g)}\\|_{2}^{2}<\\|\hat{\sigma}^{(0)}\\|_{2}^{2}].$

Since $\hat{\sigma}^{(0)}\rightarrow\sigma^{(0)}$ and ${\hat{\sigma}^{(g)}}\rightarrow\sigma^{(g)}$ and $\hat{\tau}^{(g)}\rightarrow\tau^{(g)}$ and $n\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\|_{2}^{2}\rightarrow\infty$ , for all $g$ with $\theta^{(g)}\neq\theta^{(0)}$ we have $\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(g)}-\theta_{j}^{(0)})\leq\beta;\bar{g}=g]\rightarrow 0$ . Thus for all $g$ with $\theta^{(g)}\neq\theta^{(0)}$ , $f_{g}(\beta)\rightarrow 0$ . Analogously it can be shown that $f_{g}^{comp}(\beta,D)\rightarrow 0$ for all $g$ with $\theta^{(g)}\neq\theta^{(0)}$ . Thus, for $g$ with $\theta^{(g)}\neq\theta^{(0)}$ ,

\lim_{n}f_{g}(\beta)=f_{g}^{lim}(\beta)=\lim_{n}f_{g}^{comp}(\beta),

(20)

where the limit on the right-hand side is in probability. By equation (19) and equation (20), for all $g$ ,

\lim_{n}f_{g}(\beta)=f_{g}^{lim}(\beta)=\lim_{n}f_{g}^{comp}(\beta,D_{1},\ldots,D_{n}),

Now note that as $\Sigma$ is positive definite, $(Z_{0}^{0},\ldots,Z_{G}^{0})$ has positive density on $\mathbb{R}^{dG}$ and thus

\beta\mapsto f_{0}^{lim}(\beta)

is strictly increasing. By definition, for all $g=1,\ldots,G$ , $\beta\mapsto f_{g}^{lim}(\beta)$ is non-decreasing. Thus, the function

\beta\mapsto\sum_{g}f_{g}^{lim}(\beta)

is strictly increasing. Note that by construction $\{A_{g}^{lim}:\theta^{(g)}=\theta^{(0)}\}$ form a disjoint partition of the sample space. Thus, using the definition of $f_{g}^{lim}$ ,

\lim_{\beta\rightarrow\infty}\sum_{g}f_{g}^{lim}(\beta)=1\text{ and }\lim_{\beta\rightarrow-\infty}\sum_{g}f_{g}^{lim}(\beta)=0.

Similarly, by definition, $\beta\mapsto\sum_{g}f_{g}(\beta)$ and $\beta\mapsto\sum_{g}f_{g}^{comp}(\beta)$ are increasing with $0\leq\sum_{g}f_{g}(\beta)\leq 1$ and $0\leq\sum_{g}f_{g}^{comp}(\beta)\leq 1$ . Invoking Polya’s theorem with equation (20), $\sup_{\beta}|f_{g}(\beta)-f_{g}^{lim}(\beta)|\rightarrow 0$ in probability. Analogously, $\sup_{\beta}|f_{g}^{comp}(\beta,D_{1},\ldots,D_{n})-f_{g}^{lim}(\beta)|$ converges to zero in probability.

Consider a sequence $n\mapsto c_{n}:=b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)$ such that

\sum_{g}f_{g}^{comp}(c_{n},D_{1},\ldots,D_{n})=1-\alpha/2.

Since $f_{g}^{comp}(\beta)$ converges to $f_{g}^{lim}(\beta)$ uniformly, we have that $\sum_{g}f_{g}^{lim}(c_{n})\rightarrow 1-\alpha/2$ in probability. Since $\beta\mapsto\sum_{g}f_{g}^{lim}(\beta)$ is strictly increasing and continuous, $c_{n}$ converges in probability to the unique $c^{0}\in\mathbb{R}$ with $\sum_{g}f_{g}^{lim}(c^{0})=1-\alpha/2$ . Thus, for all $\epsilon>0$ ,

	$\displaystyle\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)]$
	$\displaystyle=\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq c_{n}]$
	$\displaystyle\leq\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq c^{0}+\epsilon]+\mathbb{P}[\|c_{n}-c^{0}\|\geq\epsilon]$
	$\displaystyle=\sum_{g}f_{g}(c^{0}+\epsilon)+o(1).$

Now, letting $\epsilon\rightarrow 0$ and using that $\beta\mapsto f_{g}^{lim}(\beta)$ is continuous and $\sup_{\beta}|f_{g}(\beta)-f_{g}^{lim}(\beta)|\rightarrow 0$ ,

\lim\sup_{n\rightarrow\infty}\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)]\leq\sum_{g}f_{g}^{lim}(c^{0})=1-\alpha/2.

(21)

Analogously,

	$\displaystyle\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)]$
	$\displaystyle=\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq c_{n}]$
	$\displaystyle\geq\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq c^{0}-\epsilon]+\mathbb{P}[\|c_{n}-c^{0}\|\geq\epsilon]$
	$\displaystyle=\sum_{g}f_{g}(c^{0}-\epsilon)+o(1).$

Thus,

\displaystyle\lim\inf_{n\rightarrow\infty}\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)]\geq\sum_{g}f_{g}^{lim}(c^{0})=1-\alpha/2.

(22)

Combining equation (21) and equation (22),

\lim_{n\rightarrow\infty}\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)]=1-\alpha/2.

Analogously, it can be shown that

\lim_{n\rightarrow\infty}\mathbb{P}[\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\geq b^{-1}_{D_{1},\ldots,D_{n}}(\alpha/2)]=1-\alpha/2.

Thus,

\lim_{n\rightarrow\infty}\mathbb{P}[b^{-1}_{D_{1},\ldots,D_{n}}(\alpha/2)\leq\sqrt{n}(\hat{\theta}_{j}^{(\bar{g})}-\theta_{j}^{(0)})\leq b^{-1}_{D_{1},\ldots,D_{n}}(1-\alpha/2)]=1-\alpha.

This completes the proof.

∎

A.6 Proof of Theorem 5

Proof.

First, with probability exceeding $1-\operatorname{exp}\left(-t\right)$ ,

	$\displaystyle\,\,\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{n}{d}\\|\delta^{(g)}\\|_{2}^{2}-\frac{1}{d}\\|\tau^{(g)}\\|_{2}^{2}\right\|$
	$\displaystyle=\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}-\delta^{(g)}\\|_{2}^{2}-\frac{1}{d}\\|\tau^{(g)}\\|_{2}^{2}+\frac{2n}{d}(\hat{\theta}^{(g)}-\hat{\theta}^{(0)}-\delta^{(g)})\cdot\delta^{(g)}\right\|$
	$\displaystyle=\left\|\frac{n}{d}\\|\epsilon^{(g)}-\epsilon^{(0)}\\|_{2}^{2}-\frac{1}{d}\\|\tau^{(g)}\\|_{2}^{2}+\frac{2n}{d}(\epsilon^{(g)}-\epsilon^{(0)})\cdot\delta^{(g)}\right\|$
	$\displaystyle\leq C_{1}(c_{0},\eta)\frac{t}{d}+C_{2}(c_{0},\eta)\sqrt{\frac{t}{d}},$

for some constants $C_{1}(c_{0},\eta)$ and $C_{2}(c_{0},\eta)$ . Here, we used sub-Gaussian and subexponential tail bounds, see for example Chapter 2 in Wainwright [2019]. More precisely, we used that $\frac{n}{d}(\epsilon^{(g)}-\epsilon^{(0)})\cdot\delta^{(g)}$ is sub-Gaussian with variance proxy $\frac{4}{d}c_{0}^{2}\max_{g,j}(\eta_{j}^{(g)})^{2}$ and that $n\|\epsilon^{(g)}-\epsilon^{(0)}\|_{2}^{2}-(\tau^{(g)})^{2}$ is subexponential with parameter $64d\max_{g,j}(\eta_{j}^{(g)})^{2}$ . Using a union bound, for every $\kappa>0$ there exists a constant $C_{3}(c_{0},\eta,\kappa)$ , such that with probability exceeding $1-\kappa$ ,

\sup_{g}\left|\frac{n}{d}\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\|_{2}^{2}-\frac{n}{d}\|\delta^{(g)}\|_{2}^{2}-\frac{1}{d}\|\tau^{(g)}\|_{2}^{2}\right|\leq C_{3}\left(\sqrt{\frac{\log G}{d}}+\frac{\log G}{d}\right)

Since $\log(G)/d\leq c_{1}$ , for all $\kappa>0$ there exists a constant $C_{4}(c_{0},c_{1},\eta,\kappa)$ such that with probability exceeding $1-\kappa/2$ ,

\sup_{g}\left|\frac{n}{d}\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\|_{2}^{2}-\frac{n}{d}\|\delta^{(g)}\|_{2}^{2}-\frac{1}{d}\|\tau^{(g)}\|_{2}^{2}\right|\leq C_{4}\sqrt{\frac{\log G}{d}}

(23)

Analogously, it can be shown that there exists a constant $C_{5}$ such that with probability exceeding $1-\kappa/2$ ,

\displaystyle\begin{split}&\,\,\sup_{g}\left|\frac{n}{d}\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}-\frac{n}{d}\|\delta^{(g)}\|_{2}^{2}-\frac{1}{d}\|\sigma^{(g)}\|_{2}^{2}\right|\\ &=\sup_{g}\left|\frac{n}{d}\|\epsilon^{(g)}\|_{2}^{2}-\frac{1}{d}\|\sigma^{(g)}\|_{2}^{2}+\frac{2n}{d}\epsilon^{(g)}\cdot\delta^{(g)}\right|\\ &\leq C_{5}\sqrt{\frac{\log G}{d}}.\end{split}

(24)

Combining equation (23) and equation (24), with probability $1-\kappa$ ,

	$\displaystyle\sup_{g}\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\theta^{(0)}\\|_{2}^{2}-\frac{n}{d}\hat{R}^{\text{mod}}(g)\right\|$
	$\displaystyle\sup_{g}\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\theta^{(0)}\\|_{2}^{2}-\max\left(\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{d}\sum_{j=1}^{d}(\hat{\tau}_{j}^{(g)})^{2},0\right)-\frac{1}{d}\sum_{j=1}^{d}(\hat{\sigma}_{j}^{(g)})^{2}\right\|$
	$\displaystyle\leq 2\iota_{n,d}+\sup_{g}\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\theta^{(0)}\\|_{2}^{2}-\max\left(\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{d}\sum_{j=1}^{d}(\tau_{j}^{(g)})^{2},0\right)-\frac{1}{d}\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}\right\|$
	$\displaystyle\leq 2\iota_{n,d}+C_{6}\sqrt{\frac{\log G}{d}}+\sup_{g}\bigg{\|}\frac{1}{d}\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}+\frac{n}{d}\\|\theta^{(g)}-\theta^{(0)}\\|_{2}^{2}$
	$\displaystyle-\max\left(\frac{n}{d}\\|\theta^{(g)}-\theta^{(0)}\\|_{2}^{2}+\frac{1}{d}\sum_{j=1}^{d}(\tau_{j}^{(g)})^{2}-(\tau_{j}^{(g)})^{2},0\right)-\frac{1}{d}\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}\bigg{\|}$
	$\displaystyle\leq 2\iota_{n,d}+C_{6}\sqrt{\frac{\log G}{d}},$

for some constant $C_{6}$ that depends on $c_{0}$ , $c_{1}$ , $\eta$ , and $\kappa$ . Thus, on that event,

\left|\min_{g}\frac{n}{d}\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}-\frac{n}{d}\min_{g}\hat{R}^{\text{mod}}(g)\right|\leq 2C_{6}\sqrt{\frac{\log G}{d}}+4\iota_{n,d}.

Furthermore, on that event,

\left|\frac{n}{d}\hat{R}^{\text{mod}}(\bar{g})-\frac{n}{d}\|\hat{\theta}^{(\bar{g})}-\theta^{(0)}\|_{2}^{2}\right|\leq C_{6}\sqrt{\frac{\log G}{d}}+2\iota_{n,d}.

Thus,

\frac{n}{d}\|\hat{\theta}^{(\bar{g})}-\theta^{(0)}\|_{2}^{2}\leq\min_{g}\frac{n}{d}\|\hat{\theta}^{(g)}-\theta^{(0)}\|_{2}^{2}+3C_{6}\sqrt{\frac{\log G}{d}}+6\iota_{n,d}.

Here, we used that $\min_{g}\hat{R}^{\text{mod}}(g)=\hat{R}^{\text{mod}}(\bar{g})$ . ∎

A.7 Auxiliary Lemmas

Lemma 1.

Let $(Z_{i},X_{i})$ be i.i.d. Gaussian with mean zero and nonzero variance and $|\text{Cor}(Z_{i},X_{i})|\neq 1$ . Let $K\geq 2$ . Then,

\text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}Z_{i}-X_{i}\right)^{2}\right)<\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-X_{i}\right)^{2}\right).

(25)

Proof.

As $(Z_{i},X_{i})$ are multivariate Gaussian and $|\text{Cor}(Z_{i},X_{i})|\neq 1$ , $X_{i}=\alpha Z_{i}+\epsilon_{i}$ , for some $\alpha\in\mathbb{R}$ , where $(\epsilon_{i})_{i}$ is centered Gaussian and independent of $(Z_{i})_{i}$ with nonzero variance. Thus, it suffices to show that

\text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}Z_{i}-\alpha Z_{i}-\epsilon_{i}\right)^{2}\right)<\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-\alpha Z_{i}-\epsilon_{i}\right)^{2}\right).

(26)

On the left-hand side of equation (26), we have

\displaystyle\begin{split}&\text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}Z_{i}-\alpha Z_{i}\right)^{2}-2\frac{1}{K}\sum_{i=1}^{K}(Z_{i}-\alpha Z_{i})\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}+\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\right)^{2}\right)\\ &=\text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}Z_{i}-\alpha Z_{i}\right)^{2}\right)+4\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}(Z_{i}-\alpha Z_{i})\right)\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\right)\\ &+\text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\right)^{2}\right)\end{split}

On the right-hand side of equation (26) we have

\displaystyle\begin{split}&\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)^{2}-\frac{2}{K}\sum_{i=1}^{K}\epsilon_{i}\frac{1}{(K-1)}\sum_{j\neq i}Z_{j}-\alpha Z_{i}+\frac{1}{K}\sum_{i=1}^{K}\left(\epsilon_{i}\right)^{2}\right)\\ &=\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)^{2}\right)+4\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\frac{1}{(K-1)}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)\\ &\qquad+\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\epsilon_{i}\right)^{2}\right).\end{split}

Thus, it suffices to show that the following three inequalities hold:

\displaystyle\begin{split}\text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}Z_{i}-\alpha Z_{i}\right)^{2}\right)&\leq\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)^{2}\right)\\ \text{Var}\left(\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\right)^{2}\right)&<\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\epsilon_{i}\right)^{2}\right)\\ 4\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}(Z_{i}-\alpha Z_{i})\right)\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\right)&\leq 4\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\frac{1}{(K-1)}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)\end{split}

The first two inequalities follow from Lemma 2 and Lemma 3. Let us now deal with the last term.

	$\displaystyle\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\frac{1}{(K-1)}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)-\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}(Z_{i}-\alpha Z_{i})\right)\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\epsilon_{i}\right)$
	$\displaystyle=\sum_{i\neq j}\text{Var}\left(\frac{1}{K}\epsilon_{i}\frac{1}{(K-1)}Z_{j}\right)+\sum_{i}\text{Var}\left(\frac{1}{K}\alpha\epsilon_{i}Z_{i}\right)-(1-\alpha)^{2}\frac{1}{K^{2}}\text{Var}(Z_{1})\text{Var}(\epsilon_{1})$
	$\displaystyle=\text{Var}(\epsilon_{1})\text{Var}(Z_{1})\left(\frac{1}{K(K-1)}+\frac{\alpha^{2}}{K}-(1-\alpha)^{2}\frac{1}{K^{2}}\right)$

Thus, it suffices to show that

\frac{1}{K(K-1)}-\frac{1}{K^{2}}+\frac{\alpha^{2}}{K}-\frac{\alpha^{2}}{K^{2}}+2\alpha\frac{1}{K^{2}}\geq 0

Or, equivalently,

\frac{1}{K^{2}(K-1)}+\frac{\alpha^{2}(K-1)}{K^{2}}+2\alpha\frac{1}{K^{2}}\geq 0.

(27)

Dividing by $(K-1)/K^{2}$ yields

\frac{1}{(K-1)^{2}}+\alpha^{2}+2\frac{\alpha}{(K-1)}\geq 0.

The left term can be written as

\left(\frac{1}{(K-1)}+\alpha\right)^{2}.

This shows the inequality in equation (27), which completes the proof.

∎

Lemma 2.

Let $\epsilon_{i}$ be i.i.d. centered Gaussian random variables with nonzero variance and $K\geq 2$ . Then,

\text{Var}\left(\left(\frac{1}{K}\sum_{i}\epsilon_{i}\right)^{2}\right)<\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\epsilon_{i}\right)^{2}\right)

(28)

Proof.

On the left-hand side of equation (28) we have

\frac{1}{K^{2}}\text{Var}(\epsilon_{1}^{2}).

On the right-hand side of equation (28) we have

\frac{1}{K}\text{Var}(\epsilon_{1}^{2}).

This completes the proof. ∎

Lemma 3.

Let $Z_{i}$ be i.i.d. centered Gaussian random variables and $K\geq 2$ . Then,

\displaystyle\text{Var}\left(\left(\frac{1}{K}\sum_{i}Z_{i}-\alpha Z_{i}\right)^{2}\right)\leq\text{Var}\left(\frac{1}{K}\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)^{2}\right).

Proof.

It suffices to show that

\displaystyle\text{Var}\left(\left(\frac{1}{\sqrt{K}}\sum_{i}Z_{i}-\alpha Z_{i}\right)^{2}\right)\leq\text{Var}\left(\sum_{i=1}^{K}\left(\frac{1}{K-1}\sum_{j\neq i}Z_{j}-\alpha Z_{i}\right)^{2}\right).

(29)

On the left-hand side we have

(1-\alpha)^{4}\text{Var}(Z_{1}^{2}).

On the right-hand side of equation (29) we have

	$\displaystyle\text{Var}\left(\frac{1}{(K-1)}\sum_{i}Z_{i}^{2}+\frac{(K-2)}{(K-1)^{2}}\sum_{i>j}2Z_{i}Z_{j}-\sum_{i>j}\frac{\alpha}{(K-1)}4Z_{i}Z_{j}+\alpha^{2}\sum_{i=1}^{K}Z_{i}^{2}\right)$
	$\displaystyle=K(\frac{1}{(K-1)}+\alpha^{2})^{2}\text{Var}(Z_{1}^{2})+(\frac{(K-2)}{(K-1)^{2}}-\frac{2\alpha}{(K-1)})^{2}\frac{K(K-1)}{2}2\text{Var}(Z_{1}^{2})$
	$\displaystyle=\text{Var}(Z_{1}^{2})(K\alpha^{4}-\alpha\frac{(K-2)4K}{(K-1)^{2}}+\alpha^{2}(2\frac{K}{(K-1)}+4\frac{K}{(K-1)})$
	$\displaystyle+\frac{K(K-1)+K(K-2)^{2}}{(K-1)^{3}})$

Using the two inequalities above, it suffices to show that for all $\alpha$ and all $K\geq 2$ ,

		$\displaystyle K\alpha^{4}+6\alpha^{2}\frac{K}{(K-1)}-\alpha\frac{(K-2)4K}{(K-1)^{2}}+\frac{K(K-1)+K(K-2)^{2}}{(K-1)^{3}}$
	$\displaystyle\geq$	$\displaystyle\alpha^{4}-4\alpha^{3}+6\alpha^{2}-4\alpha+1$

Rearranging, it suffices to show that

	$\displaystyle(K-1)\alpha^{4}+4\alpha^{3}+6\alpha^{2}\frac{1}{(K-1)}+\alpha\frac{4}{(K-1)^{2}}$
	$\displaystyle+\frac{K(K-1)+K(K-2)^{2}-(K-1)^{3}}{(K-1)^{3}}\geq 0.$

Multiplying with $(K-1)$ , this is equivalent to

	$\displaystyle(K-1)^{2}\alpha^{4}+4\alpha^{3}(K-1)+6\alpha^{2}$
	$\displaystyle+\alpha\frac{4}{(K-1)}+\frac{K(K-1)+K(K-2)^{2}-(K-1)^{3}}{(K-1)^{2}}\geq 0,$

which is equivalent to

	$\displaystyle(K-1)^{2}\alpha^{4}+4\alpha^{3}(K-1)+6\alpha^{2}+\alpha\frac{4}{(K-1)}$
	$\displaystyle+\frac{K^{2}-K+K^{3}-4K^{2}+4K-K^{3}+3K^{2}-3K+1}{(K-1)^{2}}\geq 0.$

This inequality is equivalent to

(K-1)^{2}\alpha^{4}+4\alpha^{3}(K-1)+6\alpha^{2}+\alpha\frac{4}{(K-1)}+\frac{1}{(K-1)^{2}}\geq 0.

(30)

Rearranging the left-hand side, we obtain

(K-1)^{2}\left(\alpha+\frac{1}{(K-1)}\right)^{4}.

This proves the inequality of equation (30), which completes the proof.

∎

	$\displaystyle\theta^{(0)}$	$\displaystyle=\mathbb{E}[Y\|T=1]-\mathbb{E}[Y\|T=0]$
		$\displaystyle=\mathbb{E}[\mathbb{E}[Y\|P=1]P+\mathbb{E}[Y\|P=0](1-P)\|T=1]$
		$\displaystyle\qquad-\mathbb{E}[\mathbb{E}[Y\|P=1]P+\mathbb{E}[Y\|P=0](1-P)\|T=0]$
		$\displaystyle=\left(\mathbb{E}[Y\|P=1]-\mathbb{E}[Y\|P=0]\right)\left(\mathbb{E}[P\|T=1]-\mathbb{E}[P\|T=0]\right)$
		$\displaystyle=\theta_{T\rightarrow P}\cdot\theta_{P\rightarrow Y}.$

	$\displaystyle\hat{R}(g)$	$\displaystyle=\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{n}\\|\hat{\tau}^{(g)}\\|_{2}^{2}+\frac{1}{n}\\|\hat{\sigma}^{(g)}\\|_{2}^{2}$
		$\displaystyle=\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{n}\\|\tau^{(g)}\\|_{2}^{2}+\frac{1}{n}\\|\sigma^{(g)}\\|_{2}^{2}+o_{P}(1/n)$
		$\displaystyle=\\|\hat{\eta}^{(g)}-\hat{\eta}^{(0)}\\|_{2}^{2}-\frac{1}{n}\\|\tau^{(g,\eta)}\\|_{2}^{2}+\frac{1}{n}\\|\sigma^{(g,\eta)}\\|_{2}^{2}+o_{P}(1/n).$

	$\displaystyle\,\,\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{n}{d}\\|\delta^{(g)}\\|_{2}^{2}-\frac{1}{d}\\|\tau^{(g)}\\|_{2}^{2}\right\|$
	$\displaystyle=\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}-\delta^{(g)}\\|_{2}^{2}-\frac{1}{d}\\|\tau^{(g)}\\|_{2}^{2}+\frac{2n}{d}(\hat{\theta}^{(g)}-\hat{\theta}^{(0)}-\delta^{(g)})\cdot\delta^{(g)}\right\|$
	$\displaystyle=\left\|\frac{n}{d}\\|\epsilon^{(g)}-\epsilon^{(0)}\\|_{2}^{2}-\frac{1}{d}\\|\tau^{(g)}\\|_{2}^{2}+\frac{2n}{d}(\epsilon^{(g)}-\epsilon^{(0)})\cdot\delta^{(g)}\right\|$
	$\displaystyle\leq C_{1}(c_{0},\eta)\frac{t}{d}+C_{2}(c_{0},\eta)\sqrt{\frac{t}{d}},$

	$\displaystyle\sup_{g}\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\theta^{(0)}\\|_{2}^{2}-\frac{n}{d}\hat{R}^{\text{mod}}(g)\right\|$
	$\displaystyle\sup_{g}\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\theta^{(0)}\\|_{2}^{2}-\max\left(\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{d}\sum_{j=1}^{d}(\hat{\tau}_{j}^{(g)})^{2},0\right)-\frac{1}{d}\sum_{j=1}^{d}(\hat{\sigma}_{j}^{(g)})^{2}\right\|$
	$\displaystyle\leq 2\iota_{n,d}+\sup_{g}\left\|\frac{n}{d}\\|\hat{\theta}^{(g)}-\theta^{(0)}\\|_{2}^{2}-\max\left(\frac{n}{d}\\|\hat{\theta}^{(g)}-\hat{\theta}^{(0)}\\|_{2}^{2}-\frac{1}{d}\sum_{j=1}^{d}(\tau_{j}^{(g)})^{2},0\right)-\frac{1}{d}\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}\right\|$
	$\displaystyle\leq 2\iota_{n,d}+C_{6}\sqrt{\frac{\log G}{d}}+\sup_{g}\bigg{\|}\frac{1}{d}\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}+\frac{n}{d}\\|\theta^{(g)}-\theta^{(0)}\\|_{2}^{2}$
	$\displaystyle-\max\left(\frac{n}{d}\\|\theta^{(g)}-\theta^{(0)}\\|_{2}^{2}+\frac{1}{d}\sum_{j=1}^{d}(\tau_{j}^{(g)})^{2}-(\tau_{j}^{(g)})^{2},0\right)-\frac{1}{d}\sum_{j=1}^{d}(\sigma_{j}^{(g)})^{2}\bigg{\|}$
	$\displaystyle\leq 2\iota_{n,d}+C_{6}\sqrt{\frac{\log G}{d}},$

Model selection for estimation of causal parameters

Abstract

1 Introduction

Moving the goalpost.

Data fusion.

Choosing between estimators that are optimal in different models.

Leveraging conditional independencies.

1.1 Related work

1.2 Our contribution

1.2.1 Outline

2 Targeted model selection

2.1 Setting and notation

2.2 The method

Example 1 (Model selection for parameter estimation).

2.2.1 Improving precision

3 Theory

3.1 Assumptions

Assumption 1.

3.2 Asymptotic unbiasedness

Theorem 1 (Asymptotic unbiasedness of R^​(g)\hat{R}(g)).

3.3 Comparison to cross-validation

Theorem 2 (Asymptotic biasedness of R~​(g)\tilde{R}(g)).

Theorem 3 (Asymptotic variance of model selection criteria).

3.4 Asymptotic risk

Corollary 1 (Asymptotic risk of selected model).

3.5 Confidence intervals

Theorem 4.

3.6 Finite-sample bound

Theorem 5.

4 Applications

4.1 Observational studies

4.1.1 The data set

4.1.2 The estimators

4.1.3 Results

4.2 Instrumental variables and data fusion

4.2.1 The data set

4.2.2 The estimators

4.2.3 Results

4.3 Experiment with proxy outcome

4.3.1 The data set

4.3.2 The estimators

4.3.3 Results

5 Conclusion

References

Appendix A SUPPLEMENTARY MATERIAL

A.1 Proof of Theorem 1

Proof.

A.2 Proof of Theorem 2

Proof.

A.3 Proof of Theorem 3

Proof.

A.4 Proof of Corollary 1

Proof.

A.5 Proof of Theorem 4

Proof.

A.6 Proof of Theorem 5

Proof.

A.7 Auxiliary Lemmas

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Theorem 1 (Asymptotic unbiasedness of $\hat{R}(g)$ ).

Theorem 2 (Asymptotic biasedness of $\tilde{R}(g)$ ).