This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semi-supervised learning and the question of true versus estimated propensity scores

Andrew Herrenlabel=e1][email protected] [    P. Richard Hahnlabel=e2][email protected] [ Arizona State University
Abstract

A straightforward application of semi-supervised machine learning to the problem of treatment effect estimation would be to consider data as “unlabeled” if treatment assignment and covariates are observed but outcomes are unobserved. According to this formulation, large unlabeled data sets could be used to estimate a high dimensional propensity function and causal inference using a much smaller labeled data set could proceed via weighted estimators using the learned propensity scores. In the limiting case of infinite unlabeled data, one may estimate the high dimensional propensity function exactly. However, longstanding advice in the causal inference community suggests that estimated propensity scores (from labeled data alone) are actually preferable to true propensity scores, implying that the unlabeled data is actually useless in this context. In this paper we examine this paradox and propose a simple procedure that reconciles the strong intuition that a known propensity functions should be useful for estimating treatment effects with the previous literature suggesting otherwise. Further, simulation studies suggest that direct regression may be preferable to inverse-propensity weight estimators in many circumstances.

Causal inference,
Semi-supervised learning,
Unlabeled data,
Propensity score,
keywords:
\startlocaldefs\endlocaldefs

and

1 Introduction

In their seminal paper, Rosenbaum and Rubin (1983) show that the propensity score, the probability of receiving the treatment conditional on control covariates, is a sufficient statistic for estimating treatment effects. They show that rather than needing to balance on a potentially high dimensional vector of control variables, one need only balance on the one-dimensional propensity score. In practice, however, researchers do not know the propensity function — the mapping from the covariates to the treatment probabilities — in advance; it must estimate from data. Unfortunately, learning the propensity function can be quite difficult when there are many controls and its parametric form is unknown. Linear logistic regression is often the uncritical method of first resort and even in that familiar setting estimation can be challenging with limited data.

Notably, propensity score approaches only require learning a function of the treatment assignment and control variables; the response variable is not utilized at this preliminary stage. This raises the possibility of incorporating side data for which responses aren’t measured, so-called “unlabeled” data, in the parlance of semi-supervised learning (Zhu and Goldberg (2009)) to assist in estimating the propensity function. Provided that such data is available, the propensity function can be inferred more accurately than would be possible with the labeled data alone. In the limit, one can imagine learning the propensity function arbitrarily accurately, thereby realizing the holy grail conveyed by Rosenbaum and Rubin (1983) of being able to do causal inference with a one-dimensional control variable.

However, it was famously shown by Hirano, Imbens and Ridder (2003) that true propensity scores are actually worse – in the sense of yielding asymptotically higher variance estimates of the treatment effect – than ones estimated from the labeled data alone! This observation would seem to dash any hope of leveraging unlabeled data as described above. Similar results were demonstrated by Lunceford and Davidian (2004) in the context of stratification and weighting estimators with parametric propensity models.

This paper unpacks this apparent paradox in an effort to provide practical guidance to applied researchers interested in estimating treatment effects from observational data. We find that the uselessness of the true propensity function have been greatly exaggerated, reviving the potential for unlabeled data to assist in propensity function estimation, i.e. semi-supervised treatment effect estimation.

2 Problem Statement and Notation

2.1 Notation

Throughout the paper we use the following conventions:

  • Random variable: italicized, uppercase letter (e.g. XX, AA, …)

  • Scalar realization of random variable: italicized, lowercase letter (e.g. xx, aa, …)

  • Vector realization of random variable: lowercase letter (e.g. x\mathrm{x}, a\mathrm{a}, …)

  • Matrix realization of random variable: bold, uppercase letter (e.g. 𝐗\mathbf{X}, 𝐀\mathbf{A}, …)

  • Metric space of random variable’s support: math calligraphy letter (e.g. 𝒳\mathcal{X}, 𝒜\mathcal{A}, …)

We first draw a distinction between observational and experimental studies for treatment effect estimation. In experimental settings, the treatment of interest is deliberately randomized according to a pre-specified design, typically with the goal of enabling straightforward identification of the treatment effect. In observational settings, data on treatment, outcome and covariates are observed without any direct manipulation. Researchers wishing to infer treatment effects with observational data must rely on a series of assumptions, which we now introduce.

Let YY refer to the outcome of interest, ZZ denote a binary treatment assignment, and XX be a vector of covariates drawn from covariate space 𝒳\mathcal{X}. We use the potential outcomes notation Y1Y^{1} and Y0Y^{0} to refer to the counterfactual values of YY when Z=1Z=1 and Z=0Z=0 (Hernan and Robins (2020)). Our estimand of interest, is the average treatment effect: E[Y1Y0]\mbox{E}[Y^{1}-Y^{0}].

Since the potential outcomes (Y1,Y0)(Y^{1},Y^{0}) are never observed simultaneously, we rely on a number of identifying assumptions common to the causal inference literature (Rubin (1980), Rosenbaum and Rubin (1983), Hernan and Robins (2020)):

  1. 1.

    Stable unit treatment value assumption (SUTVA): for any sample of size nn with Y𝒴Y\in\mathcal{Y} and Z𝒵Z\in\mathcal{Z}, (Yi1,Yi0)Zj(Y_{i}^{1},Y_{i}^{0})\perp Z_{j} for all i,j{1,,n}i,j\in\{1,...,n\} with jij\neq i

  2. 2.

    Positivity: 0<P(Z=1X=𝐗)<10<\mbox{P}(Z=1\mid X=\mathbf{X})<1 for all 𝐗𝒳\mathbf{X}\in\mathcal{X}

  3. 3.

    Conditional exchangeability: (Y1,Y0)ZX(Y^{1},Y^{0})\perp Z\mid X

Taken together, these assumptions enable identification of average treatment effects using observational data after adjusting for the effect of XX on ZZ and YY. Thus, we can rewrite our estimand as ATE=E[Y1Y0]=EX[E[YX,Z=1]E[YX,Z=0]]\textrm{ATE}=\mbox{E}[Y^{1}-Y^{0}]=\mbox{E}_{X}[\mbox{E}[Y\mid X,Z=1]-\mbox{E}[Y\mid X,Z=0]]. There are many ways to adjust for XX in estimating the ATE, but this paper focuses on methods that use the propensity score. Rosenbaum and Rubin (1983) define the propensity score as p(X)=P(Z=1X)p(X)=\mbox{P}(Z=1\mid X) and show that p(X)p(X) has several desirable properties:

  1. 1.

    XZp(X)X\perp Z\mid p(X)

  2. 2.

    (Y1,Y0)Zp(X)(Y^{1},Y^{0})\perp Z\mid p(X)

In short, the propensity score can be used to ensure covariate balance across treatment groups and the resulting covariate balance deconfounds the treatment effect estimate of ZZ on YY. The deconfounding property of p(X)p(X) implies that the treatment effect estimand can be rewritten as ATE=E[Y1Y0]=Ep(X)[E[Yp(X),Z=1]E[Yp(X),Z=0]]\textrm{ATE}=\mbox{E}[Y^{1}-Y^{0}]=\mbox{E}_{p(X)}[\mbox{E}[Y\mid p(X),Z=1]-\mbox{E}[Y\mid p(X),Z=0]].

2.2 Confounders, Instruments and Pure Prognostic / Moderator Variables

Above, we let XX refer to the set of covariates of YY and ZZ, but for the purposes of this discussion, we introduce a taxonomy of covariate types. First, observe that, using a similar framing as Hahn et al. (2020), the relationship between YY, ZZ and XX can be expressed as:

E[YXμ,Xτ,Xπ]\displaystyle\mbox{E}\left[Y\mid X_{\mu},X_{\tau},X_{\pi}\right] =μ(Xμ)+τ(Xτ)Z\displaystyle=\mu\left(X_{\mu}\right)+\tau\left(X_{\tau}\right)Z
P(Z=1Xπ)\displaystyle\mbox{P}\left(Z=1\mid X_{\pi}\right) =π(Xπ)\displaystyle=\pi\left(X_{\pi}\right)

where XμX_{\mu} refers to the set of prognostic variables, XτX_{\tau} refers to the set of moderator variables, and XπX_{\pi} refers to the set of propensity variables. We consider that any variable in XX can be classified as follows:

  1. 1.

    Confounder: Any variable which appears in XπX_{\pi} and either of XμX_{\mu} or XτX_{\tau}

  2. 2.

    Pure prognostic / moderator variable: Any variable which appears in either XμX_{\mu}, XτX_{\tau}, or both, but not in XπX_{\pi}

  3. 3.

    Instrument: Any variable which appears in XπX_{\pi} and not in XμX_{\mu} nor XτX_{\tau} (this is a “pure propensity” variable)

  4. 4.

    Extraneous variable: Any variable in XX which appears in none of XτX_{\tau}, XμX_{\mu}, or XπX_{\pi})

2.3 Estimated vs true propensities

Hirano, Imbens and Ridder (2003) discuss the inverse-propensity weighted estimator of average treatment effect, which estimates the ATE as follows

ATEIPW=(1Ni=1NYiZip(Xi)Yi(1Zi)1p(Xi))\textrm{ATE}_{\textrm{IPW}}=\left(\frac{1}{N}\sum_{i=1}^{N}\frac{Y_{i}Z_{i}}{p(X_{i})}-\frac{Y_{i}(1-Z_{i})}{1-p(X_{i})}\right)

They show that even when the true propensities are known, using them directly in the IPW estimator is asymptotically inefficient. We consider the finite sample variance properties of this estimator on a simple data generating process, which we define below:

  • Covariates: X1,X2Bernoulli(1/2)X_{1},X_{2}\sim\mbox{Bernoulli}(1/2), letting X=(X1,X2)X=(X_{1},X_{2})

  • Propensity function: p(X)=1/(1+exp{(α1+α2X1)})p(X)=1/\left(1+\exp\{-(\alpha_{1}+\alpha_{2}X_{1})\}\right)

  • Treatment: ZBernoulli(p(X))Z\sim\mbox{Bernoulli}(p(X))

  • Outcome: Y𝒩(γ1X1+γ2X2+τZ,1)Y\sim\mathcal{N}(\gamma_{1}X_{1}+\gamma_{2}X_{2}+\tau Z,1)

We let the random variable NN denote the overall sample size and define subset-specific sample sizes as follows:

  • NxN_{x}: the number of observations with X=xX=x

  • Nx,zN_{x,z}: the number of observations with X=xX=x and Z=zZ=z

First, observe in this case that p(X)p(X) takes on two distinct values: 1/(1+exp{α1})1/\left(1+\exp\{-\alpha_{1}\}\right) if X1=0X_{1}=0, and 1/(1+exp{α1α2})1/\left(1+\exp\{-\alpha_{1}-\alpha_{2}\}\right) if X1=1X_{1}=1. We can split in the sum into eight parts, stratifying on the unique values of (X1X_{1}, X2X_{2}, ZZ). For a given stratum, the weights in the denominator, as well as the ZiZ_{i} and 1Zi1-Z_{i} indicators, are all the same and can be factored out. We refer to “true propensities” estimator as IPWp(X)\mbox{IPW}_{p(X)} to indicate that the true propensity function, p(X)p(X) is the basis for the weights in the denominator. Let px=P(Z=1X=x)=P(Z=1X1=x1)p_{x}=\mbox{P}(Z=1\mid X=x)=\mbox{P}(Z=1\mid X_{1}=x_{1}). The portion of the “true propensities” estimator represented by a given {X=x}\{X=x\} (i.e. {X1=x1,X2=x2}\{X_{1}=x_{1},X_{2}=x_{2}\}) stratum is given by

IPWp(X),x\displaystyle\mbox{IPW}_{p(X),x} =1Nj:Xj=x(YjZjpxYj(1Zj)1px)\displaystyle=\frac{1}{N}\sum_{j:X_{j}=x}\left(\frac{Y_{j}Z_{j}}{p_{x}}-\frac{Y_{j}(1-Z_{j})}{1-p_{x}}\right)
=1NNxNxNx,zNx,z1pxj:Xj=x,Zj=1Yj\displaystyle=\frac{1}{N}\frac{N_{x}}{N_{x}}\frac{N_{x,z}}{N_{x,z}}\frac{1}{p_{x}}\sum_{j:X_{j}=x,Z_{j}=1}Y_{j}
1NNxNxNx,z=0Nx,z=0(1px)(1px)j:Xj=x,Zj=0Yj\displaystyle\;\;\;\;-\frac{1}{N}\frac{N_{x}}{N_{x}}\frac{N_{x,z=0}}{N_{x,z=0}}\frac{(1-p_{x})}{(1-p_{x})}\sum_{j:X_{j}=x,Z_{j}=0}Y_{j}
=(NxN)(p^xpxY¯x,Z=11p^x1pxY¯x,Z=0)\displaystyle=\left(\frac{N_{x}}{N}\right)\left(\frac{\hat{p}_{x}}{p_{x}}\bar{Y}_{x,Z=1}-\frac{1-\hat{p}_{x}}{1-p_{x}}\bar{Y}_{x,Z=0}\right)

Where IPWp(X)=IPWp(X),X=(0,0)+IPWp(X),X=(1,0)+IPWp(X),X=(0,1)+IPWp(X),X=(1,1)\mbox{IPW}_{p(X)}=\mbox{IPW}_{p(X),X=(0,0)}+\mbox{IPW}_{p(X),X=(1,0)}+\mbox{IPW}_{p(X),X=(0,1)}+\mbox{IPW}_{p(X),X=(1,1)} and p^x=(Nx,z=1/Nx)\hat{p}_{x}=\left(N_{x,\mathrm{z}=1}/N_{x}\right). We see that replacing the true propensity weights with these empirical weights results in the following estimator:

IPWp(X),x\displaystyle\mbox{IPW}_{p(X),x} =1Nj:Xj=x(YjZjp^xYj(1Zj)1p^x)\displaystyle=\frac{1}{N}\sum_{j:X_{j}=x}\left(\frac{Y_{j}Z_{j}}{\hat{p}_{x}}-\frac{Y_{j}(1-Z_{j})}{1-\hat{p}_{x}}\right)
=(NxN)(Y¯x,Z=1Y¯x,Z=0)\displaystyle=\left(\frac{N_{x}}{N}\right)\left(\bar{Y}_{x,Z=1}-\bar{Y}_{x,Z=0}\right)

We can compare the variances of these two estimators, by first observing that the data are i.i.d., so that conditional on the number of observations with Z=1Z=1 and X=xX=x, the covariances of the strata means are 0. We derive the variances of the two estimators in Appendix A, but the key to understanding why V[IPWp^(X),x]V[IPWp(X),x]\mbox{V}\left[\mbox{IPW}_{\hat{p}(X),x}\right]\leq\mbox{V}\left[\mbox{IPW}_{p(X),x}\right] is to notice that the ratios (p^x1/px1)\left(\hat{p}_{x_{1}}/p_{x_{1}}\right) and (1p^x)/(1px)\left(1-\hat{p}_{x}\right)/\left(1-p_{x}\right) are random variables which converge to 11 as nn\rightarrow\infty but are noisy in finite samples. These finite sample imbalances are an ancillary statistic for τ\tau, and the p^(X)\hat{p}(X) estimator conditions on this ancillary statistic for more efficient inference. For a detailed discussion of the role of ancillary statistics in causal inference, readers are referred to Chapter 10 of Hernan and Robins (2020) and the references therein.

This result seems to imply a paradox. Even if a propensity function is known exactly, is the analyst better off ignoring that information and estimating the propensity scores? While it is true that directly weighting by the true propensities yields higher variances, we show that the true propensity scores can still be used to achieve efficiency. Consider in this case that p(X)=1/(1+exp{α1α2X1})p(X)=1/\left(1+\exp\{-\alpha_{1}-\alpha_{2}X_{1}\}\right), so ZZ can be modeled via logistic regression on X1X_{1}, X2X_{2}, and an intercept term. Fitting this model on a sample 𝐗=(x1,x2)\mathbf{X}=(\mathrm{x}_{1},\mathrm{x}_{2}) and z\mathrm{z} would yield coefficient estimates β^1\hat{\beta}_{1}, β^2,\hat{\beta}_{2}, and β^3\hat{\beta}_{3} such that p^(𝐗)=11+exp{β^1β^2x1β^3x2}\hat{p}(\mathbf{X})=\frac{1}{1+\exp\{-\hat{\beta}_{1}-\hat{\beta}_{2}\mathrm{x}_{1}-\hat{\beta}_{3}\mathrm{x}_{2}\}}, where the true values of the parameters are β1=α1\beta_{1}=\alpha_{1}, β2=α2\beta_{2}=\alpha_{2}, and β3=0\beta_{3}=0. Thus, if we know the true value of p(𝐗)p(\mathbf{X}), we can calculate the true β1+β2x1+β3x2\beta_{1}+\beta_{2}\mathrm{x}_{1}+\beta_{3}\mathrm{x}_{2} by logit transformation

β1+β2x1+β3x2=log(1p(𝐗)1)\beta_{1}+\beta_{2}\mathrm{x}_{1}+\beta_{3}\mathrm{x}_{2}=-\log\left(\frac{1}{p(\mathbf{X})}-1\right)

Since both β1+β2x1+β3x2\beta_{1}+\beta_{2}\mathrm{x}_{1}+\beta_{3}\mathrm{x}_{2} and β^1+β^2x1+β^3x2\hat{\beta}_{1}+\hat{\beta}_{2}\mathrm{x}_{1}+\hat{\beta}_{3}\mathrm{x}_{2} are one-dimensional linear functions of 𝐗\mathbf{X}, we could express β^1β^2x1β^3x2=η^0log(1p(𝐗)1)η^1\hat{\beta}_{1}-\hat{\beta}_{2}\mathrm{x}_{1}-\hat{\beta}_{3}\mathrm{x}_{2}=\hat{\eta}_{0}-\log\left(\frac{1}{p(\mathbf{X})}-1\right)\hat{\eta}_{1} for some η^0,η^1\hat{\eta}_{0},\hat{\eta}_{1}\in\mathbb{R} and observe that this is equivalent to a logistic regression of ZZ on a 1-dimensional vector log(1p(𝐗)1)-\log\left(\frac{1}{p(\mathbf{X})}-1\right) with an intercept term.

This 1-dimensional regression adjustment is similar to the sample correction discussed in the context of the p^(X)\hat{p}(X) estimator, with one key difference: the differences between true and empirical treatment probabilities are adjusted across the support of p(X)p(X) rather than XX. In our two-covariate example, the difference is that p(X)p(X) is only a function of X1X_{1}, so has two unique values, rather than the four distinct values of XX.

We introduce the following shorthand to refer to the three estimators:

  1. 1.

    p(X)p(X): IPW estimator using the true propensity function, p(X)p(X) as weights

  2. 2.

    p^(X)\hat{p}(X): IPW estimator using the covariate-estimated propensity function, p^(X)\hat{p}(X), as weights

  3. 3.

    p^(p(X))\hat{p}(p(X)): IPW estimator using the true propensity function, p(X)p(X), as a 1-dimensional variable for estimating sample propensity weights

Hirano, Imbens and Ridder (2003) and the example above demonstrate that the p^(X)\hat{p}(X) estimator is lower variance than the p(X)p(X) estimator. However, we see in Appendix B that the p^(p(X))\hat{p}(p(X)) estimator has lower variance than the p^(X)\hat{p}(X) estimator as long as the following two conditions are satisfied:

  • X2X_{2} is not a pure prognostic / moderator variable: E[Y¯X1=x1,X2=1,Z=z]=E[Y¯X1=x1,X2=0,Z=z]\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=z}\right]=\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=z}\right] for all x1,zx_{1},z. In our simple data generating process, this is true if γ2=0\gamma_{2}=0.

  • The outcome variance does not differ along the support of X2X_{2}: V[Y¯X1=x1,X2=1,Z=z]=V[Y¯X1=x1,X2=0,Z=z]\mbox{V}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=z}\right]=\mbox{V}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=z}\right] for all x1,zx_{1},z. In our simple data generating process, this is true as the variance of YY is constant with respect to XX.

The most interesting takeaway from this result is that the cases in which the p^(X)\hat{p}(X) estimator outperforms the p^(p(X))\hat{p}(p(X)) estimator do not pertain to the intended use of the propensity score, to deconfound treatment assignment and enable identification of the ATE. The p^(X)\hat{p}(X) estimator can attain a lower variance, but it does so by stratifying on a pure prognostic / moderator variable, X2X_{2}. This points to an unintended use of weighting / stratification estimators, rather than a fundamental problem with using true propensities. Adjusting for pure prognostic / moderator variables can be achieved by direct regression adjustment, and Hahn et al. (2020) provide a nonparametric Bayesian approach that shows promising results on a number of difficult data generating processes.

To demonstrate the difference in variances of the three estimators in finite samples, we conduct a simulation study. We use the same data generating process discussed in the finite sample variance calculations above, in which YY is normal, XX and ZZ are Bernoulli, and XX has p=2p=2 dimensions. In the simulations presented below, we generate 10,000 datasets for each value of NN and estimate the average treatment effect using the IPW estimator, comparing three possible weighting options:

  1. a)

    p(X)p(X): True propensities ({0.3,0.7}\{0.3,0.7\})

  2. b)

    p^(X)\hat{p}(X): Estimated propensities (logistic regression of ZZ on XX)

  3. c)

    p^(p(X))\hat{p}(p(X)): Sample adjustment using true propensities (logistic regression of ZZ on logit(p(X))\mbox{logit}\left(p(X)\right))

NN p(X)p(X) p^(X)\hat{p}(X) p^(p(X))\hat{p}(p(X))
15 1.80 0.91 0.81
25 1.41 0.69 0.52
50 1.00 0.38 0.32
100 0.70 0.24 0.22
250 0.44 0.14 0.14
Table 1: Standard deviation of IPW estimates using different weighting approaches
10,000 simulations per NN
p = 2 (value of NN for which variances converge depends on p)

Our simulations confirm the points made in Hirano, Imbens and Ridder (2003) and Rosenbaum (1987) that using the true propensities directly leads to higher variances in finite samples. However, we also see that the true propensities can be used in a one-dimensional model of ZZ to attain even lower variances than estimators using the fitted propensities.

2.4 Semi-supervised treatment effect estimation

In practice, knowledge of the exact true propensity function is unlikely outside of randomized experiments. However, it is certainly possible to imagine investigators having access to auxiliary data that enable more accurate estimates of the true propensities. The received wisdom of Hirano, Imbens and Ridder (2003), Lunceford and Davidian (2004), and Rosenbaum (1987) would seem to imply that such accuracy gains impede efficient estimation of the ATE. The primary challenge in using propensity scores estimated from a larger auxiliary dataset stems from the utility of conditioning on finite sample treatment probabilities in weighting estimators. However, we have shown in Section 2.3 that the same variance reduction can be achieved by conditioning on finite sample probabilities formed by p(X)p(X) rather than XX, and this result extends to an estimate of p(X)p(X) derived from arbitrarily large datasets.

We can formalize the notion of using a large auxiliary dataset in estimating the ATE by framing it as a semi-supervised learning problem. Semi-supervised learning refers to a broad class of techniques that combine labeled and unlabeled data in statistical learning problems (Zhu and Goldberg (2009), Belkin, Niyogi and Sindhwani (2006), Liang, Mukherjee and West (2007)). Consider a dataset with NN observations. As above, we let YY, ZZ, and XX denote outcome, treatment, and covariates, respectively. We refer to the data points with observed outcomes as “labeled data,” and let NoN_{o} refer to the number of such labeled observations and Nm=NNoN_{m}=N-N_{o} refer to the number of data points with missing outcomes. We index an outcome’s status as missing or observed by MM.

Following the notation of Liang, Mukherjee and West (2007), we denote “unlabeled,” or marginal, data as

  • Xm={Xi;i=No+1,,No+Nm}X^{m}=\{X_{i};i=N_{o}+1,...,N_{o}+N_{m}\}, and

  • Zm={Zi;i=No+1,,No+Nm}Z^{m}=\{Z_{i};i=N_{o}+1,...,N_{o}+N_{m}\}

Similarly, we denote “labeled” data as

  • Yo={Yi;i=1,,No}Y^{o}=\{Y_{i};i=1,...,N_{o}\},

  • Xo={Xi;i=1,,No}X^{o}=\{X_{i};i=1,...,N_{o}\}, and

  • Zo={Zi;i=1,,No}Z^{o}=\{Z_{i};i=1,...,N_{o}\},

We observe that M=1M=1 where X,Z(Xm,Zm)X,Z\in(X^{m},Z^{m}) and M=0M=0 elsewhere. The discussion in Section 2.3 shows that the set of marginal (Xm,Zm)(X^{m},Z^{m}) data can be put to good use in estimating the average treatment effect as follows:

  1. 1.

    Obtain an estimate, pe(X)p_{e}(X) of the true propensities using the labeled and unlabeled data (X,Z)(X,Z)

  2. 2.

    Adjust the estimated propensities to the labeled sample by modeling ZoBernoulli(pe(Xo))Z^{o}\sim\mbox{Bernoulli}\left(p_{e}(X^{o})\right), and denote these adjusted estimates by p(Xo)p^{*}(X^{o})

  3. 3.

    Estimate the ATE using only the labeled data (YoY^{o}, ZoZ^{o}, and p(Xo)p^{*}(X^{o}))

As with other constructions in the semi-supervised learning literature, this framing allows us to profitably use information about the statistical patterns in the unlabeled data (in this case, p(Z=1X)p(Z=1\mid X)) to better estimate an effect among the labeled data (E[Y1Y0]\mbox{E}[Y^{1}-Y^{0}]).

3 Simulations

We conduct several simulation studies to understand the phenomena discussed above in more depth. While logistic regression is often a natural choice for estimating propensity scores, Lee, Lessler and Stuart (2010) point out that the assumptions required for logistic regression are not always warranted and they examine propensity score estimation using a number of machine learning methods, including CART (Breiman et al. (1984)) and Random Forest (Breiman (2001)).

This paper proceeds similarly, but instead relies on Bayesian Additive Regression Trees (BART) (Chipman et al. (2010)). 111Code can be found at https://github.com/andrewherren/semi-supervised-propensity BART is a nonparametric Bayesian tree ensemble method that estimates complex functions via a sum of weak regression or classification trees. In the case of our IPW estimator, we also consider a logistic propensity model in order to investigate how inferences can be harmed by model mis-specification.

3.1 Estimators

We employ three estimators which use propensity scores in our simulation study:

  • Inverse Propensity Weighted (IPW) estimator (Hirano, Imbens and Ridder (2003)), introduced in Section 2.3 and defined as

    ATEIPW=1Ni=1N(YiZip(Xi)Yi(1Zi)(1p(Xi)))\textrm{ATE}_{\textrm{IPW}}=\frac{1}{N}\sum_{i=1}^{N}\left(\frac{Y_{i}Z_{i}}{p(X_{i})}-\frac{Y_{i}(1-Z_{i})}{(1-p(X_{i}))}\right)
  • Targeted Maximum Likelihood Estimator (TMLE) (van der Laan (2010a), van der Laan (2010b), Gruber and van der Laan (2009))

    ATETMLE=1Ni=1N(Q(Zi=1,Xi)Q(Zi=0,Xi))\textrm{ATE}_{\textrm{TMLE}}=\frac{1}{N}\sum_{i=1}^{N}\left(Q^{*}\left(Z_{i}=1,X_{i}\right)-Q^{*}\left(Z_{i}=0,X_{i}\right)\right)

    where Q(Zi,Xi)Q^{*}\left(Z_{i},X_{i}\right) represents a semi-parametric model of the outcome YY which incorporates the propensity score.

  • Bayesian Causal Forests (BCF) (Hahn et al. (2020))

    ATEBCF=1Ni=1N(f¯(Xi,Zi=1)f¯(Xi,Zi=0))\textrm{ATE}_{\textrm{BCF}}=\frac{1}{N}\sum_{i=1}^{N}\left(\bar{f}(X_{i},Z_{i}=1)-\bar{f}(X_{i},Z_{i}=0)\right)

    where f¯(Xi,Zi)\bar{f}(X_{i},Z_{i}) is an average of posterior simulations (across all simulations) of two BART models, defined as f(Xi,Zi)=μ(Xi,p(Xi))+τ(Xi)Zif(X_{i},Z_{i})=\mu(X_{i},p(X_{i}))+\tau(X_{i})Z_{i}.

3.2 Simulation approaches

Given the two step nature of each of the above estimators, in which a propensity model is first constructed and its estimates then used to calculate an average treatment effect, we consider three approaches to treatment effect estimation:

  • Complete case analysis: We discard all unlabeled data samples, estimate p^(𝐗o)\hat{p}(\mathbf{X}^{o}), and then compute average treatment effects on only the labeled observations.

  • Semi-supervised: We use the labeled and unlabeled 𝐗\mathbf{X} and z\mathrm{z} values to estimate p^(𝐗)\hat{p}(\mathbf{X}) and then use the p^(𝐗o)\hat{p}(\mathbf{X}^{o}) predictions to compute average treatment effects on the labeled observations.

  • True propensities: Simulation studies provide us with actual probabilities of receiving treatment, so we can use these true p(𝐗o)p(\mathbf{X}^{o}) to compute average treatment effects on the labeled observations. Of course, in most real world applications, the exact true propensities will not be available, and we consider this scenario as a limiting case in which an arbitrarily large amount of unlabeled data were available for modeling p^(𝐗)\hat{p}(\mathbf{X}).

3.3 Interval construction

We construct confidence intervals for our simulations as follows:

  • IPW (logit) and IPW (BART): We compute the asymptotic standard error assuming a logistic propensity model as in Cerulli (2014)

  • TMLE: We use the 95% confidence interval produced by the tmle R package

  • BCF: We construct a 95% credible interval using posterior samples of the treatment effect

3.4 Evaluation metrics

We evaluate the results of our simulations using the following metrics.

  • RMSE: root mean squared error of estimated treatment effects

  • Bias: difference between true (simulated) ATE and average estimated ATE

  • Coverage: fraction of 95% confidence / credible intervals that contain the true ATE

3.5 Data Generating Process Simulations

For simulated covariates, outcomes, and treatment effects, we use a subset of the data generating processes tested in Hahn et al. (2020).

Y\displaystyle Y =μ(X)+τZ+ε\displaystyle=\mu(X)+\tau Z+\varepsilon
μ(X)\displaystyle\mu(X) ={3+X1X3,X5=1,X1X3,X5=23+x1X3,X5=3\displaystyle=\begin{cases}3+X_{1}X_{3},&X_{5}=1,\\ X_{1}X_{3},&X_{5}=2\\ -3+x_{1}X_{3},&X_{5}=3\end{cases}
X1,X2,X3\displaystyle X_{1},X_{2},X_{3} 𝒩(0,1)\displaystyle\sim\mathcal{N}(0,1)
X4\displaystyle X_{4} Bernoulli(0.5)\displaystyle\sim\textrm{Bernoulli}(0.5)
X5\displaystyle X_{5} Categorical(0.25,0.5,0.25)\displaystyle\sim\textrm{Categorical}(0.25,0.5,0.25)
τ\displaystyle\tau =3\displaystyle=3
ε\displaystyle\varepsilon 𝒩(0,1)\displaystyle\sim\mathcal{N}(0,1)

There are p=5p=5 covariates, the first three of which are independent standard normal variables, the fourth of which is binary, and the fifth is an unordered categorical variable with values 1,2,31,2,3. Treatment effects are homogeneous (τ=3\tau=3), and the outcome is determined by a combination of treatment effects and a piecewise interaction function of three of the covariates (μ(X)=1+g(X5)+X1X3\mu(X)=1+g(X_{5})+X_{1}X_{3}, where g(X)=2g(X)=2 if X=1X=1, g(X)=1g(X)=-1 if X=2X=2, and g(X)=4g(X)=-4 if x=3x=3). μ(X)\mu(X) is often referred to as a prognostic effect and can be conceptualized as the expected value of the outcome for individuals who do not receive the treatment.

For treatment assignment, we consider two cases:

  1. 1.

    P(Z=1X)=0.5P(Z=1\mid X)=0.5

  2. 2.

    P(Z=1X)=0.8Φ(3μ(X)/s0.5X1)+0.05+u/10P(Z=1\mid X)=0.8\Phi(3\mu(X)/s-0.5X_{1})+0.05+u/10

    • Φ()\Phi(\cdot) is the standard normal CDF

    • ss is the sample standard deviation of simulated observations of μ(X)\mu(X)

    • uUniform(0,1)u\sim\mathrm{Uniform}(0,1)

The first treatment assignment mechanism mirrors a simple balanced randomized experiment. This presents relatively straightforward inference, as there is no confounding, but it allows us to simulate the phenomenon of including non-confounders in the propensity model. The second treatment assignment mechanism is described as “Targeted Selection” in Hahn et al. (2020), and refers to the phenomenon of assigning treatment based on the expected value of the outcome for those who don’t receive treatment.

In order to simulate the process by which outcomes are unlabeled, we assume a fixed overall sample size of 5,000 and randomly label 50, 100, or 500 observations so that MM is not correlated with YY, XX, or ZZ. This parallels the notion of “missing completely at random” (MCAR) in the missing data literature (Little and Rubin (2002)), though our problem is slightly different. We treat our unlabeled XmX^{m} and ZmZ^{m} as auxiliary data that may be helpful in estimating the ATE rather than treating our unobserved YmY^{m} as missing data to be imputed or otherwise estimated. Before proceeding to a more detailed discussion of the simulation results, we briefly summarize the key takeaways:

  • Using unlabeled data will harm inferences in the case of a mis-specified propensity model, as confidence intervals narrow around a biased estimate

  • BCF and IPW/BART work best in the case of targeted selection

  • While randomized treatment assignment makes ATE estimation much easier, unlabeled data can still be useful in this case for variance reduction purposes

3.6 Simulation results under targeted selection

We first examine the targeted selection case, in which treatment assignment is correlated with the prognostic effect.

3.6.1 IPW (logit)

The table below shows the results for 500 simulations using an IPW estimator and propensities estimated by logistic regression. The targeted selection function is nonlinear and thus a logistic propensity model is misspecified. We see that the results are badly biased, except those which use the true propensities and are not impacted by the model misspecification. Furthermore, we observe that using unlabeled data and increasing sample size is harmful, as the variance is reduced around a biased estimate, leading to worse interval coverage.

Approach N # labeled ATE RMSE Bias Coverage
Complete Case 5,000 50 3.49 0.89 0.49 0.67
Complete Case 5,000 100 3.48 0.68 0.48 0.77
Complete Case 5,000 500 3.50 0.51 0.50 0.59
Semi-supervised 5,000 50 3.48 0.97 0.48 0.67
Semi-supervised 5,000 100 3.56 0.73 0.56 0.72
Semi-supervised 5,000 500 3.50 0.52 0.50 0.55
True propensities 5,000 500 2.99 0.25 -0.01 0.89
Table 2: 500 simulations with outcomes MCAR

3.6.2 IPW (BART)

The table below shows the results for 500 simulations using an IPW estimator and propensities estimated by BART. In this case, the propensity model is not mis-specified, and we observe that the use of unlabeled data reduces bias and increases interval coverage.

Approach N # labeled ATE RMSE Bias Coverage
Complete Case 5,000 50 3.90 0.90 0.90 0.47
Complete Case 5,000 100 3.79 0.79 0.79 0.39
Complete Case 5,000 500 3.50 0.50 0.50 0.36
Semi-supervised 5,000 50 3.00 0.85 -0.00 0.73
Semi-supervised 5,000 100 3.00 0.58 -0.00 0.86
Semi-supervised 5,000 500 3.00 0.26 -0.00 0.90
True propensities 5,000 500 2.99 0.25 -0.01 0.89
Table 3: 500 simulations with outcomes MCAR

3.6.3 TMLE

The table below shows the results for 500 simulations using a TMLE estimator and propensities estimated by BART. In this case, the propensity model is not mis-specified, however, we observe that the TMLE estimator achieves poor interval coverage, even in large sample sizes where the average bias is lower.

Approach N # labeled ATE RMSE Bias Coverage
Complete Case 5,000 50 3.58 0.62 0.58 0.35
Complete Case 5,000 100 3.44 0.47 0.44 0.33
Complete Case 5,000 500 3.12 0.15 0.12 0.63
Semi-supervised 5,000 50 3.33 0.51 0.33 0.60
Semi-supervised 5,000 100 3.24 0.35 0.24 0.64
Semi-supervised 5,000 500 3.07 0.13 0.07 0.80
True propensities 5,000 500 3.05 0.12 0.05 0.84
Table 4: 500 simulations with outcomes MCAR

3.6.4 BCF

The table below shows the results for 500 simulations using a BCF estimator and propensities estimated by BART. BCF was designed in part to handle the case of targeted selection, and we see that it produces unbiased estimates with high coverage as more unlabeled data is incorporated.

Approach N # labeled ATE RMSE Bias Coverage
Complete Case 5,000 50 3.54 0.99 0.54 0.70
Complete Case 5,000 100 3.43 0.80 0.43 0.80
Complete Case 5,000 500 3.06 0.37 0.06 0.96
Semi-supervised 5,000 50 3.26 0.80 0.26 0.77
Semi-supervised 5,000 100 3.15 0.58 0.15 0.86
Semi-supervised 5,000 500 3.03 0.37 0.03 0.98
True propensities 5,000 500 3.01 0.32 0.01 0.98
Table 5: 500 simulations with outcomes MCAR

3.7 Simulation results under randomized assignment

We now turn to the randomized treatment assignment scenario.

3.7.1 Complete case

The table below shows the results for 500 simulations using the complete case approach across all four estimators. We see in this case that all methods produce unbiased estimates of the ATE as sample size increases, and with the exception of the TMLE estimator, the approaches all achieve high coverage of the true treatment effect.

Estimator N # labeled ATE RMSE Bias Coverage
IPW (logistic) 5,000 50 3.03 0.39 0.03 0.99
IPW (logistic) 5,000 100 2.97 0.23 -0.03 1.00
IPW (logistic) 5,000 500 3.00 0.10 0.00 1.00
IPW (BART) 5,000 50 2.67 0.46 -0.33 0.98
IPW (BART) 5,000 100 2.75 0.31 -0.25 0.99
IPW (BART) 5,000 500 2.90 0.13 -0.10 1.00
TMLE 5,000 50 2.96 0.33 -0.04 0.72
TMLE 5,000 100 2.96 0.20 -0.04 0.79
TMLE 5,000 500 3.00 0.08 0.00 0.86
BCF 5,000 50 2.90 0.69 -0.10 0.79
BCF 5,000 100 2.93 0.47 -0.07 0.87
BCF 5,000 500 2.99 0.22 -0.01 0.96
Table 6: 500 simulations with outcomes MCAR

3.7.2 Semi-supervised

Below are the same simulations using the semi-supervised approach. For both of the weighting estimators, we observe that the semi-supervised approach yields a higher variance. This is an example of the phenomenon discussed in Section 2.3 – the sample imbalances among the labeled data are an ancillary statistic for τ\tau and conditioning on that statistic leads to lower variance. We note that this problem could be overcome using a sample adjustment of ZZ on the unlabeled propensity score estimates. We also observe in this case that bias of the IPW estimators decreases with the use the unlabeled data, which suggests the use of the marginal data for two reasons:

  • Using only labeled data can lead to biased treatment effect estimates in finite samples

  • With an appropriate adjustment, the variance of estimators that use marginal propensities can be reduced to mirror estimators using only labeled data

Estimator N # labeled ATE RMSE Bias Coverage
IPW (logistic) 5,000 50 3.00 0.65 -0.00 0.90
IPW (logistic) 5,000 100 3.00 0.47 -0.00 0.93
IPW (logistic) 5,000 500 3.00 0.20 0.00 0.97
IPW (BART) 5,000 50 2.97 0.65 -0.03 0.91
IPW (BART) 5,000 100 2.97 0.47 -0.03 0.93
IPW (BART) 5,000 500 2.98 0.20 -0.02 0.96
TMLE 5,000 50 2.95 0.33 -0.05 0.76
TMLE 5,000 100 2.96 0.20 -0.04 0.81
TMLE 5,000 500 3.00 0.08 0.00 0.87
BCF 5,000 50 2.91 0.70 -0.09 0.82
BCF 5,000 100 2.94 0.46 -0.06 0.90
BCF 5,000 500 2.99 0.22 -0.01 0.97
Table 7: 500 simulations with outcomes MCAR

4 Discussion

4.1 Resolving the estimated propensity paradox

Section 2.3 reviews the paradox of using estimated propensities to reduce variance in weighting estimators, even if true propensities are known (cited in Hirano, Imbens and Ridder (2003), Rosenbaum (1987), and Lunceford and Davidian (2004)). We believe this issue deserves a detailed discussion, as it contains a number of subtleties and its central claim (i.e. that estimated propensities are “better” than true propensities from a variance perspective) persists in the causal inference community to this day.

4.1.1 What the paradox actually implies

The mechanism by which conditioning on estimated propensities can reduce variances has two components:

  1. a)

    Adjusting for finite sample differences between true propensities and empirical treatment probabilities

  2. b)

    (For weighting estimators) effecting a variant of direct regression adjustment by stratifying on pure prognostic / moderator variables

We noted in Section 2.3 that the variance implications of (a) can be mitigated by modeling ZZ directly from the true propensity score, if it is available. This leaves (b) as a concern, which is no longer a paradox of which propensity scores to use, but rather which estimator to use. In particular, this is not a concern for direct regression estimators which allow for adjustment by prognostic / moderator variables. Indeed, our simulations using BCF demonstrate this point.

4.1.2 Bias-variance tradeoff

Absent from most of the literature on estimated vs true propensities is the possibility of model misspecification in estimating propensities. Thus, points (a) and (b) above represent perhaps an optimistic view which is that the propensity function can be estimated arbitrarily well on finite samples. In this case, concerns about variance outweigh concerns about bias. However, our simulations show that in plausible real-world scenarios, propensity-based ATE estimators that rely only on labeled data can be severely biased. Indeed, while Section 2.3 shows that the variance of the “true propensities” estimator can be higher than either of the two adjusted estimators, we see in our simulations that the true propensities estimator typically achieves the lowest RMSE, implying that bias overwhelms variance in many cases.

4.2 Propensity model specification

Though there may be benefits to using estimated propensity scores, in practice, calculating propensity scores introduces a new set of risks. The simulations in Section 3 show that a misspecified propensity model can bias estimates, sometimes drastically. It may seem natural, then, to treat a propensity model as any other supervised learning problem, with variable selection, model validation, and diagnostics to ensure a proper fit. Indeed, McCaffrey, Ridgeway and Morral (2004) use boosting to estimate propensities, in part for the automatic variable selection and flexible model specification that tree ensembles provide. Their results show that using boosting for a propensity model leads to better covariate balance and lower standard errors than a logistic propensity model.

However, Hernan and Robins (2020) caution that traditional model selection techniques may be of dubious utility in constructing propensity models. They note that the purpose of the propensity score is to control for confounding variables, not predict ZZ from XX arbitrarily well. They are careful to point out that a purely data-driven approach to model selection runs the risk of including variables, such as colliders, mediators, or post-treatment variables, that jeopardize identification of treatment effects. They also advise that, even when treatment effects are identified, including instruments in a propensity model can increase the variance of treatment effect estimates by pushing estimated propensities closer to 0 and 1.

At first glance, this point would seem to argue against our recommendation to use semi-supervised machine learning in ATE estimation. Indeed, naively overfitting a propensity model is a surefire way to get noisy (and perhaps even biased) treatment effect estimates. However, we believe that a number of practical steps can be taken to mitigate these concerns.

First, estimated propensity scores should always be visualized or otherwise inspected to ensure that they are not numerically close to 0 and 1. Second, regularization methods (such as 2\ell^{2} normalization or certain types of ensembles), can be used instead of logistic regression to minimize the risk of estimating propensity scores close to 0 and 1. Finally, subject matter expertise should always play a role in building propensity models. In many applied problems, especially those with a high-dimensional covariate space, the challenge of constructing an accurate causal graph is highly non-trivial. Insofar as subject matter expertise can identify variables such as colliders, post-treatment variables, or non-confounders, such variables should of course be excluded from any propensity model.

Researchers who are concerned about an underlying causal graph invalidating their propensity model can also turn to the causal discovery literature (i.e. Peters, Janzing and Schölkopf (2017)). Specifically, Entner, Hoyer and Spirtes (2013) present an approach to the problem of identifying confounders for adjustment in a causal model. In practice, the challenge of specifying (or estimating from the data) a causal graph can present a number of pitfalls beyond the scope of this paper. We suffice to say that there is no easy replacement for domain expertise in this process, but that this point applies to propensity models that use fully labelled data as well that those that use unlabeled data.

In order to better understand the pitfalls of including non-confounders in a propensity model, we test several possibilities in a simulation study. Letting XjX_{j} be a variable in a set of pp covariates, we consider four cases:

  1. a)

    Case 1: XjX_{j} is a confounder and is included in a propensity model

  2. b)

    Case 2: XjX_{j} is not a confounder and is included in a propensity model

  3. c)

    Case 3: XjX_{j} is a confounder and is not included in a propensity model

  4. d)

    Case 4: XjX_{j} is not a confounder and is not included in a propensity model

Cases 1 and 4 are ideal, while cases 2 and 3 both fail to properly account for the true underlying causal model. The question we are concerned with is whether erring on the side of including non-confounders in a propensity model (case 2) is as harmful to inference as ignoring true confounders (case 3). To do this, we simulate 100 datasets of n=n=1,000 from the following data generating process:

Y\displaystyle Y =Xβ+τZ+ε\displaystyle=X\beta+\tau Z+\varepsilon
X,ε\displaystyle X,\varepsilon 𝒩(0,1)\displaystyle\sim\mathcal{N}(0,1)
τ\displaystyle\tau =3\displaystyle=3
p\displaystyle p =(11+exp(Xγ))\displaystyle=\left(\frac{1}{1+\exp(-X\gamma)}\right)
Z\displaystyle Z Bernoulli(p)\displaystyle\sim\textrm{Bernoulli}(p)

For all variables except XjX_{j}, the β\beta and γ\gamma coefficients for the outcome and propensity models are drawn uniformly at random from a range of (0.5,0.5)(-0.5,0.5) and (0.3,0.3)(-0.3,0.3), respectively. For the scenarios in which XjX_{j} is a confounder (that is, XjX_{j} impacts both YY and ZZ), we set βXj=0.5\beta_{X_{j}}=0.5. Each of our simulations assume some relationship between xj\mathrm{x}_{j} and z\mathrm{z}, and we vary γXj\gamma_{X_{j}} between {0,1,10}\{0,1,10\} to test the extent to which conditioning on XjX_{j} in a propensity model separates the treatment and control groups.

Below we see the simulated bias, RMSE, and 95% interval coverage of each of the four cases, using BCF with a BART propensity model and γXj=1\gamma_{X_{j}}=1

Case XjX_{j} confounder? XjX_{j} in p^\hat{p}? Bias RMSE Coverage
1 Yes Yes 0.004 0.098 89%
2 No Yes -0.008 0.096 88%
3 Yes No 0.068 0.109 84%
4 No No -0.007 0.087 87%
Table 8: ATE estimation using BCF, γXj=1\gamma_{X_{j}}=1

We see little difference in the results across each of the cases, because the influence of XjX_{j} on YY is modest in the confounded case. Below is the same set of outputs for γXj=10\gamma_{X_{j}}=10.

Case XjX_{j} confounder? XjX_{j} in p^\hat{p}? Bias RMSE Coverage
1 Yes Yes 0.033 0.253 91%
2 No Yes -0.09 0.262 87%
3 Yes No 0.24 0.252 80%
4 No No -0.045 0.126 98%
Table 9: ATE estimation using BCF, γXj=10\gamma_{X_{j}}=10

We see that cases 1 and 4 exhibit the best performance (since they properly incorporate the true causal graph), but even when XjX_{j} and ZZ are strongly related, including XjX_{j} in a propensity model when XjX_{j} is not a confounder is less harmful than excluding XjX_{j} when it is a confounder. To understand why, we briefly review the discussion of “regularization-induced confounding” (RIC) covered in Hahn et al. (2020).

As we have seen in our simulations, classic parametric models of a treatment assignment are vulnerable to misspecification. While this shortcoming suggests the use of flexible machine learning methods to estimate propensity scores, such methods run the risk of overfitting the data and estimating propensities close to 0 or 1. In addition to numeric instability of weighting estimators, such overfitting can also violate the positivity or conditional exchangeability assumption. Researchers who want to fit flexible propensity models while avoiding overfitting can use regularized machine learning methods, such as BART. Hahn et al. (2020), building on Hahn et al. (2018), show that a naive regularized model can bias treatment effect estimates. Their proposed solution, BCF, controls this bias by incorporating the estimated propensities as a feature in a regression model predicting E(YZ=0,p^(X))\mbox{E}(Y\mid Z=0,\hat{p}(X)).

In our case, since the simulation studies above are conducted using BCF with a BART propensity model, we avoid some of the common problems of overfit propensity models while also recovering unbiased estimates of the average treatment effect. This explains why including a non-confounder, XjX_{j}, which is strongly predictive of ZZ in the propensity model doesn’t harm inference of the ATE by nearly as much as excluding XjX_{j} when it is a confounder.

Acknowledgements

This work was partially supported by NSF Grant DMS-1502640.

References

  • Belkin, Niyogi and Sindhwani (2006) {barticle}[author] \bauthor\bsnmBelkin, \bfnmMikhail\binitsM., \bauthor\bsnmNiyogi, \bfnmPartha\binitsP. and \bauthor\bsnmSindhwani, \bfnmVikas\binitsV. (\byear2006). \btitleManifold regularization: A geometric framework for learning from labeled and unlabeled examples. \bjournalJournal of machine learning research \bvolume7 \bpages2399–2434. \endbibitem
  • Boyd and Vandenberghe (2004) {bbook}[author] \bauthor\bsnmBoyd, \bfnmStephen\binitsS. and \bauthor\bsnmVandenberghe, \bfnmLieven\binitsL. (\byear2004). \btitleConvex optimization. \bpublisherCambridge university press. \endbibitem
  • Breiman (2001) {barticle}[author] \bauthor\bsnmBreiman, \bfnmLeo\binitsL. (\byear2001). \btitleRandom Forests. \bjournalMachine Learning \bvolume45 \bpages5–32. \endbibitem
  • Breiman et al. (1984) {barticle}[author] \bauthor\bsnmBreiman, \bfnmLeo\binitsL., \bauthor\bsnmFriedman, \bfnmJerome H\binitsJ. H., \bauthor\bsnmOlshen, \bfnmRichard A\binitsR. A. and \bauthor\bsnmStone, \bfnmCharles J\binitsC. J. (\byear1984). \btitleClassification and Regression Trees. \endbibitem
  • Cerulli (2014) {barticle}[author] \bauthor\bsnmCerulli, \bfnmGiovanni\binitsG. (\byear2014). \btitletreatrew: A user-written command for estimating average treatment effects by reweighting on the propensity score. \bjournalThe Stata Journal \bvolume14 \bpages541–561. \endbibitem
  • Chipman et al. (2010) {barticle}[author] \bauthor\bsnmChipman, \bfnmHugh A\binitsH. A., \bauthor\bsnmGeorge, \bfnmEdward I\binitsE. I., \bauthor\bsnmMcCulloch, \bfnmRobert E\binitsR. E. \betalet al. (\byear2010). \btitleBART: Bayesian additive regression trees. \bjournalThe Annals of Applied Statistics \bvolume4 \bpages266–298. \endbibitem
  • Entner, Hoyer and Spirtes (2013) {binproceedings}[author] \bauthor\bsnmEntner, \bfnmDoris\binitsD., \bauthor\bsnmHoyer, \bfnmPatrik\binitsP. and \bauthor\bsnmSpirtes, \bfnmPeter\binitsP. (\byear2013). \btitleData-driven covariate selection for nonparametric estimation of causal effects. In \bbooktitleProceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (\beditor\bfnmCarlos M.\binitsC. M. \bsnmCarvalho and \beditor\bfnmPradeep\binitsP. \bsnmRavikumar, eds.). \bseriesProceedings of Machine Learning Research \bvolume31 \bpages256–264. \bpublisherPMLR. \endbibitem
  • Gruber and van der Laan (2009) {barticle}[author] \bauthor\bsnmGruber, \bfnmSusan\binitsS. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2009). \btitleTargeted maximum likelihood estimation: A gentle introduction. \endbibitem
  • Hahn et al. (2018) {barticle}[author] \bauthor\bsnmHahn, \bfnmP Richard\binitsP. R., \bauthor\bsnmCarvalho, \bfnmCarlos M\binitsC. M., \bauthor\bsnmPuelz, \bfnmDavid\binitsD., \bauthor\bsnmHe, \bfnmJingyu\binitsJ. \betalet al. (\byear2018). \btitleRegularization and confounding in linear regression for treatment effect estimation. \bjournalBayesian Analysis \bvolume13 \bpages163–182. \endbibitem
  • Hahn et al. (2020) {barticle}[author] \bauthor\bsnmHahn, \bfnmP Richard\binitsP. R., \bauthor\bsnmMurray, \bfnmJared S\binitsJ. S., \bauthor\bsnmCarvalho, \bfnmCarlos M\binitsC. M. \betalet al. (\byear2020). \btitleBayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. \bjournalBayesian Analysis. \endbibitem
  • Harville (1998) {bmisc}[author] \bauthor\bsnmHarville, \bfnmDavid A\binitsD. A. (\byear1998). \btitleMatrix algebra from a statistician’s perspective. \endbibitem
  • Hernan and Robins (2020) {bbook}[author] \bauthor\bsnmHernan, \bfnmMiguel A\binitsM. A. and \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. (\byear2020). \btitleCausal Inference: What If. \bpublisherBoca Raton: Chapman & Hall, CRC. \endbibitem
  • Hirano, Imbens and Ridder (2003) {barticle}[author] \bauthor\bsnmHirano, \bfnmKeisuke\binitsK., \bauthor\bsnmImbens, \bfnmGuido W\binitsG. W. and \bauthor\bsnmRidder, \bfnmGeert\binitsG. (\byear2003). \btitleEfficient estimation of average treatment effects using the estimated propensity score. \bjournalEconometrica \bvolume71 \bpages1161–1189. \endbibitem
  • Lee, Lessler and Stuart (2010) {barticle}[author] \bauthor\bsnmLee, \bfnmBrian K\binitsB. K., \bauthor\bsnmLessler, \bfnmJustin\binitsJ. and \bauthor\bsnmStuart, \bfnmElizabeth A\binitsE. A. (\byear2010). \btitleImproving propensity score weighting using machine learning. \bjournalStatistics in medicine \bvolume29 \bpages337–346. \endbibitem
  • Liang, Mukherjee and West (2007) {barticle}[author] \bauthor\bsnmLiang, \bfnmFeng\binitsF., \bauthor\bsnmMukherjee, \bfnmSayan\binitsS. and \bauthor\bsnmWest, \bfnmMike\binitsM. (\byear2007). \btitleThe use of unlabeled data in predictive modeling. \bjournalStatistical Science \bpages189–205. \endbibitem
  • Little and Rubin (2002) {bbook}[author] \bauthor\bsnmLittle, \bfnmRoderick JA\binitsR. J. and \bauthor\bsnmRubin, \bfnmDonald B\binitsD. B. (\byear2002). \btitleStatistical analysis with missing data. \bpublisherJohn Wiley & Sons. \endbibitem
  • Lunceford and Davidian (2004) {barticle}[author] \bauthor\bsnmLunceford, \bfnmJared K\binitsJ. K. and \bauthor\bsnmDavidian, \bfnmMarie\binitsM. (\byear2004). \btitleStratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. \bjournalStatistics in medicine \bvolume23 \bpages2937–2960. \endbibitem
  • McCaffrey, Ridgeway and Morral (2004) {barticle}[author] \bauthor\bsnmMcCaffrey, \bfnmDaniel F\binitsD. F., \bauthor\bsnmRidgeway, \bfnmGreg\binitsG. and \bauthor\bsnmMorral, \bfnmAndrew R\binitsA. R. (\byear2004). \btitlePropensity score estimation with boosted regression for evaluating causal effects in observational studies. \bjournalPsychological methods \bvolume9 \bpages403. \endbibitem
  • Peters, Janzing and Schölkopf (2017) {bbook}[author] \bauthor\bsnmPeters, \bfnmJonas\binitsJ., \bauthor\bsnmJanzing, \bfnmDominik\binitsD. and \bauthor\bsnmSchölkopf, \bfnmBernhard\binitsB. (\byear2017). \btitleElements of causal inference: foundations and learning algorithms. \bpublisherMIT press. \endbibitem
  • Rosenbaum (1987) {barticle}[author] \bauthor\bsnmRosenbaum, \bfnmPaul R\binitsP. R. (\byear1987). \btitleModel-based direct adjustment. \bjournalJournal of the American Statistical Association \bvolume82 \bpages387–394. \endbibitem
  • Rosenbaum and Rubin (1983) {barticle}[author] \bauthor\bsnmRosenbaum, \bfnmPaul R\binitsP. R. and \bauthor\bsnmRubin, \bfnmDonald B\binitsD. B. (\byear1983). \btitleThe central role of the propensity score in observational studies for causal effects. \bjournalBiometrika \bvolume70 \bpages41–55. \endbibitem
  • Rubin (1980) {barticle}[author] \bauthor\bsnmRubin, \bfnmDonald B\binitsD. B. (\byear1980). \btitleRandomization analysis of experimental data: The Fisher randomization test comment. \bjournalJournal of the American Statistical Association \bvolume75 \bpages591–593. \endbibitem
  • van der Laan (2010a) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2010a). \btitleTargeted Maximum Likelihood Based Causal Inference: Part I. \bjournalThe International Journal of Biostatistics \bvolume6. \endbibitem
  • van der Laan (2010b) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2010b). \btitleTargeted Maximum Likelihood Based Causal Inference: Part II. \bjournalThe International Journal of Biostatistics \bvolume6. \endbibitem
  • Zhu and Goldberg (2009) {barticle}[author] \bauthor\bsnmZhu, \bfnmXiaojin\binitsX. and \bauthor\bsnmGoldberg, \bfnmAndrew B\binitsA. B. (\byear2009). \btitleIntroduction to semi-supervised learning. \bjournalSynthesis lectures on artificial intelligence and machine learning \bvolume3 \bpages1–130. \endbibitem

Appendix A Variance derivations for IPWp(X)\mbox{IPW}_{p(X)} vs IPWp^(x)\mbox{IPW}_{\hat{p}(x)}

Consider the data generating process of Section 2.3. We condition on a sample of size nn and observed values of 𝐗\mathbf{X}, such that nxn_{x} is also observed. We derive the variance of the IPW estimators for an arbitrary {X=x}\{X=x\} stratum and then conclude that the result holds across the support of XX. We randomize over the sample sizes of treated units, which we model as Nx,z=1Binomial(nx,px)N_{x,z=1}\sim\mbox{Binomial}(n_{x},p_{x}):

V[IPWp(x),x]\displaystyle\mbox{V}\left[\mbox{IPW}_{p(x),x}\right] =V[(nxn)(p^xpxY¯x,z=11p^x1pxY¯x,z=0)]\displaystyle=\mbox{V}\left[\left(\frac{n_{x}}{n}\right)\left(\frac{\hat{p}_{x}}{p_{x}}\bar{Y}_{x,z=1}-\frac{1-\hat{p}_{x}}{1-p_{x}}\bar{Y}_{x,z=0}\right)\right]
=(nxn)2V[p^xpxY¯x,z=11p^x1pxY¯x,z=0]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\frac{\hat{p}_{x}}{p_{x}}\bar{Y}_{x,z=1}-\frac{1-\hat{p}_{x}}{1-p_{x}}\bar{Y}_{x,z=0}\right]
=(nxn)2V[Nx,z=1nxpxY¯x,z=1nxNx,z=1nx(1px)Y¯x,z=0]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\right]
=(nxn)2V[Nx,z=1nxpxY¯x,z=1nxNx,z=1nx(1px)Y¯x,z=0]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\right]
=(nxn)2E[V[Nx,z=1nxpxY¯x,z=1nxNx,z=1nx(1px)Y¯x,z=0Nx,z=1]]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
+(nxn)2V[E[Nx,z=1nxpxY¯x,z=1nxNx,z=1nx(1px)Y¯x,z=0Nx,z=1]]\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
=(nxn)2E[(Nx,z=1nxpx)2V[Y¯x,z=1Nx,z=1]+(nxNx,z=1nx(1px))2V[Y¯x,z=0Nx,z=1]]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\left(\frac{N_{x,z=1}}{n_{x}p_{x}}\right)^{2}\mbox{V}\left[\bar{Y}_{x,z=1}\mid N_{x,z=1}\right]+\left(\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\right)^{2}\mbox{V}\left[\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
+(nxn)2V[E[Nx,z=1nxpxY¯x,z=1nxNx,z=1nx(1px)Y¯x,z=0Nx,z=1]]\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
=(nxn)2(nxpx(1px)+nx2px2nx2px2E[V[Y¯x,z=1Nx,z=1]])\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\left(\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}p_{x}^{2}}{n_{x}^{2}p_{x}^{2}}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}\mid N_{x,z=1}\right]\right]\right)
+(nxn)2(nxpx(1px)+nx2(1px)2nx2(1px)2E[V[Y¯x,z=0Nx,z=1]])\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\left(\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}(1-p_{x})^{2}}{n_{x}^{2}(1-p_{x})^{2}}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]\right)
+(nxn)2V[E[Nx,z=1nxpxY¯x,z=1nxNx,z=1nx(1px)Y¯x,z=0Nx,z=1]]\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]

while

V[IPWp^(x),x]\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x}\right] =V[(nxn)(Y¯x,z=1Y¯x,z=0)]\displaystyle=\mbox{V}\left[\left(\frac{n_{x}}{n}\right)\left(\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\right)\right]
=(nxn)2V[Y¯x,z=1Y¯x,z=0]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\right]
=(nxn)2E[V[Y¯x,z=1Y¯x,z=0Nx,z=1]]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
+(nxn)2V[E[Y¯x,z=1Y¯x,z=0Nx,z=1]]\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
=(nxn)2E[V[Y¯x,z=1Y¯x,z=0Nx,z=1]]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]
+(nxn)2V[τ]\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\tau\right]
=(nxn)2E[V[Y¯x,z=1Y¯x,z=0Nx,z=1]]\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]

Since variances are nonnegative and both nxpx(1px)+nx2px2nx2px2>1\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}p_{x}^{2}}{n_{x}^{2}p_{x}^{2}}>1 and nxpx(1px)+nx2(1px)2nx2(1px)2>1\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}(1-p_{x})^{2}}{n_{x}^{2}(1-p_{x})^{2}}>1 for nx>0n_{x}>0 and 0<px<10<p_{x}<1, we see that V[IPWp(x),x]>V[IPWp^(x),x]\mbox{V}\left[\mbox{IPW}_{p(x),x}\right]>\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x}\right]. Since X=xX=x has been arbitrary, this result applies to all {X=x}\{X=x\} strata and thus it holds in expectation across the support of XX.

Appendix B Variance derivations for IPWp^(x)\mbox{IPW}_{\hat{p}(x)} vs IPWp^(p(x))\mbox{IPW}_{\hat{p}(p(x))}

Now, we compare the variance of the p^(p(X))\hat{p}(p(X)) estimator with that of the p^(X)\hat{p}(X) estimator. These two estimators differ only in their use of X2X_{2}. Since p(X)=f(X1)p(X)=f(X_{1}), the p^(p(X))\hat{p}(p(X)) estimator is equivalent to a stratification estimator using only X1X_{1}, while the p^(X)\hat{p}(X) estimator stratifies on both X1X_{1} and X2X_{2}. In this derivation, we condition on a sample of size nn and observed values of x1\mathrm{x}_{1} and z\mathrm{z}, so that nx1,zn_{x_{1},z} is observed. We randomize over X2X_{2}, deriving the variance of the IPW estimators for an arbitrary {X1=x1,Z=1}\{X_{1}=x_{1},Z=1\} stratum and concluding that the result holds across the support of XX and ZZ. To represent the randomization over X2X_{2} at the level of a {X1=x1}\{X_{1}=x_{1}\} stratum, we introduce two binomial random variables:

  • Nx1,X2=1,z=1Binomial(nx1,z=1,1/2)N_{x_{1},X_{2}=1,\mathrm{z}=1}\sim\mbox{Binomial}(n_{x_{1},\mathrm{z}=1},1/2)

  • Nx1,X2=1,z=0Binomial(nx1,z=0,1/2)N_{x_{1},X_{2}=1,\mathrm{z}=0}\sim\mbox{Binomial}(n_{x_{1},\mathrm{z}=0},1/2)

We can decompose a given {X1=x1}\{X_{1}=x_{1}\} stratum of the p^(p(X))\hat{p}(p(X)) estimator as follows

IPWp^(p(x)),x1\displaystyle\mbox{IPW}_{\hat{p}(p(x)),x_{1}} =1nj:X1,j=x1(YiZip^(px1)Yi(1Zi)1p^(px1))\displaystyle=\frac{1}{n}\sum_{j:X_{1,j}=x_{1}}\left(\frac{Y_{i}Z_{i}}{\hat{p}(p_{x_{1}})}-\frac{Y_{i}(1-Z_{i})}{1-\hat{p}(p_{x_{1}})}\right)
=nx1n(Nx1,X2=1,z=1nx1,z=1Y¯X1=x1,X2=1,Z=1+nx1,z=1Nx1,X2=1,z=1nx1,z=1Y¯X1=x1,X2=0,Z=1)\displaystyle=\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}+\frac{n_{x_{1},z=1}-N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right)
nx1n(Nx1,X2=1,z=0nx1,z=0Y¯X1=x1,X2=1,Z=0+nx1,z=0Nx1,X2=1,z=0nx1,z=0Y¯X1=x1,X2=0,Z=0)\displaystyle\;\;\;\;-\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}+\frac{n_{x_{1},z=0}-N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right)

We compute the variance of this stratum-specific estimator via the law of total variance

V[IPWp^(p(x)),x1]\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\right] =E[V[IPWp^(p(x)),x1Nx1,x2=1,z=1,Nx1,x2=1,z=0]]\displaystyle=\mbox{E}\left[\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},x_{2}=1,z=1},N_{x_{1},x_{2}=1,z=0}\right]\right]
+V[E[IPWp^(p(x)),x1Nx1,x2=1,z=1,Nx1,x2=1,z=0]]\displaystyle\;\;\;\;+\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},x_{2}=1,z=1},N_{x_{1},x_{2}=1,z=0}\right]\right]

where

E[IPWp^(p(x)),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]=\displaystyle\mbox{E}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]=
nx1n(Nx1,X2=1,z=1nx1,z=1E[Y¯X1=x1,X2=1,Z=1]+nx1,z=1Nx1,X2=1,z=1nx1,z=1E[Y¯X1=x1,X2=0,Z=1])\displaystyle\;\;\;\;\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]+\frac{n_{x_{1},z=1}-N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]\right)
nx1n(Nx1,X2=1,z=0nx1,z=0E[Y¯X1=x1,X2=1,Z=0]+nx1,z=0Nx1,X2=1,z=0nx1,z=0E[Y¯X1=x1,X2=0,Z=0])\displaystyle\;\;\;\;\;\;\;\;-\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]+\frac{n_{x_{1},z=0}-N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]\right)

If X2X_{2} has no pure prognostic / moderating impact on YY, then it is true that

  • E[Y¯X1=x1,X2=1,Z=1]=E[Y¯X1=x1,X2=0,Z=1]\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]=\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right] and

  • E[Y¯X1=x1,X2=1,Z=0]=E[Y¯X1=x1,X2=0,Z=0]\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]=\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right],

and thus V[E[IPWp^(p(x)),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]]=0\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]=0.

So the variance of IPWp^(p(x)),x1\mbox{IPW}_{\hat{p}(p(x)),x_{1}} is equal to E[V[IPWp^(p(x)),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]]\mbox{E}\left[\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right] when X2X_{2} is not a pure prognostic / moderator variable.

Now, observe that a given {X1=x1}\{X_{1}=x_{1}\} stratum of the p^(X)\hat{p}(X) estimator can be written as

IPWp^(x),x1\displaystyle\mbox{IPW}_{\hat{p}(x),x_{1}} =1nj:X1,j=x1(YiZip^(Xi)Yi(1Zi)1p^(Xi))\displaystyle=\frac{1}{n}\sum_{j:X_{1,j}=x_{1}}\left(\frac{Y_{i}Z_{i}}{\hat{p}(X_{i})}-\frac{Y_{i}(1-Z_{i})}{1-\hat{p}(X_{i})}\right)
=Nx1,X2=1nY¯X1=x1,X2=1,Z=1+Nx1,X2=0nY¯X1=x1,X2=0,Z=1\displaystyle=\frac{N_{x_{1},X_{2}=1}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}+\frac{N_{x_{1},X_{2}=0}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}
Nx1,X2=1nY¯X1=x1,X2=1,Z=0Nx1,X2=0nY¯X1=x1,X2=0,Z=0\displaystyle\;\;\;\;-\frac{N_{x_{1},X_{2}=1}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}-\frac{N_{x_{1},X_{2}=0}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}

To obtain the variance of this estimator, we again apply the law of total variance

V[IPWp^(x),x1]\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\right] =E[V[IPWp^(x),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]]\displaystyle=\mbox{E}\left[\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]
+V[E[IPWp^(x),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]]\displaystyle\;\;\;\;+\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]

and

E[IPWp^(x),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]=\displaystyle\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]=
=Nx1,X2=1nE[Y¯X1=x1,X2=1,Z=1]+Nx1,x2=0nE[Y¯X1=x1,X2=0,Z=1]\displaystyle=\frac{N_{x_{1},X_{2}=1}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]+\frac{N_{x_{1},x_{2}=0}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]
Nx1,X2=1nE[Y¯X1=x1,X2=1,Z=0]Nx1,X2=0nE[Y¯X1=x1,X2=0,Z=0]\displaystyle\;\;\;\;-\frac{N_{x_{1},X_{2}=1}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]-\frac{N_{x_{1},X_{2}=0}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]
=Nx1,X2=1,z=0+Nx1,X2=1,z=1n(E[Y¯X1=x1,X2=1,Z=1]E[Y¯X1=x1,X2=1,Z=0])\displaystyle=\frac{N_{x_{1},X_{2}=1,\mathrm{z}=0}+N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n}\left(\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]-\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]\right)
+Nx1,X2=0,z=0+Nx1,X2=0,z=1n(E[Y¯X1=x1,X2=0,Z=1]E[Y¯X1=x1,X2=0,Z=0])\displaystyle\;\;\;\;+\frac{N_{x_{1},X_{2}=0,\mathrm{z}=0}+N_{x_{1},X_{2}=0,\mathrm{z}=1}}{n}\left(\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]-\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]\right)

E[Y¯x1,x2,Z=1]E[Y¯x1,x2,Z=0]=τ\mbox{E}\left[\bar{Y}_{x_{1},x_{2},Z=1}\right]-\mbox{E}\left[\bar{Y}_{x_{1},x_{2},Z=0}\right]=\tau for x2{0,1}x_{2}\in\{0,1\}, so E[IPWp^(x),x1Nx1,X2=1,z=1,Nx1,X2=1,z=0]=τnx1/n\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]=\tau n_{x_{1}}/n and thus V[E[IPWp^(x),x1Nx1,x2=1,z=1,Nx1,x2=1,z=0]]=0\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},x_{2}=1,z=1},N_{x_{1},x_{2}=1,z=0}\right]\right]=0 regardless of whether or not X2X_{2} is a pure prognostic / moderator variable.

We can compare the variances of p^(p(x))\hat{p}(p(x)) and p^(x)\hat{p}(x) without evaluating the expectation over X2X_{2} for p^(x)\hat{p}(x) by comparing the conditional variance terms of the two expressions.

Focusing on the first two terms of the conditional variance expressions (corresponding to X1=x1X_{1}=x_{1} and Z=1Z=1), we see that

V[IPWp^(x),x1,z=1Nx1,X2=1,z=1]\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x_{1},z=1}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1}\right]
=(Nx1,X2=1n)2V[YX1=x1,X2=1,Z=1]Nx1,X2=1,z=1+(Nx1,X2=0n)2V[YX1=x1,X2=0,Z=1]Nx1,X2=0,z=1\displaystyle=\left(\frac{N_{x_{1},X_{2}=1}}{n}\right)^{2}\frac{\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}+\left(\frac{N_{x_{1},X_{2}=0}}{n}\right)^{2}\frac{\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}
=1n2(Nx1,X2=1,z=1+Nx1,X2=1,z=0)2V[YX1=x1,X2=1,Z=1]Nx1,X2=1,z=1\displaystyle=\frac{1}{n^{2}}\frac{\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=1,\mathrm{z}=0}\right)^{2}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}
+1n2(Nx1,X2=0,z=1+Nx1,X2=0,z=0)2V[YX1=x1,X2=0,Z=1]Nx1,X2=0,z=1\displaystyle\;\;\;\;\;\;+\frac{1}{n^{2}}\frac{\left(N_{x_{1},X_{2}=0,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=0}\right)^{2}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}

and also that

V[IPWp^(p(x)),x1,z=1Nx1,X2=1,z=1]\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1},z=1}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1}\right]
=1n2[(nx1)2(Nx1,X2=1,z=1V[YX1=x1,X2=1,Z=1]+Nx1,X2=0,z=1V[YX1=x1,X2=0,Z=1])(Nx1,X2=1,z=1+Nx1,X2=0,z=1)2]\displaystyle=\frac{1}{n^{2}}\left[\frac{\left(n_{x_{1}}\right)^{2}\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]+N_{x_{1},X_{2}=0,\mathrm{z}=1}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]\right)}{\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=1}\right)^{2}}\right]

where nx1=(Nx1,X2=1,z=1+Nx1,X2=1,z=0)+(Nx1,X2=0,z=1+Nx1,X2=0,z=0)n_{x_{1}}=\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=1,\mathrm{z}=0}\right)+\left(N_{x_{1},X_{2}=0,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=0}\right)

If V[YX1=x1,X2=1,Z=1]=V[YX1=x1,X2=0,Z=1]\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]=\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right] then we can set v=V[YX1=x1,X2=1,Z=1]=V[YX1=x1,X2=0,Z=1]v=\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]=\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right] and factor this term out. After this adjustment, the comparison reduces to evaluating the conditions in which

nx12Nx1,X2=1,z=1+Nx1,X2=0,z=1Nx1,X2=12Nx1,X2=1,z=1+Nx1,X2=02Nx1,X2=0,z=1\displaystyle\frac{n_{x_{1}}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=1}}\leq\frac{N_{x_{1},X_{2}=1}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}+\frac{N_{x_{1},X_{2}=0}^{2}}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}

where Nx1,X2=1=(Nx1,X2=1,z=1+Nx1,X2=1,z=0)N_{x_{1},X_{2}=1}=\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=1,\mathrm{z}=0}\right) and Nx1,X2=0=(Nx1,X2=0,z=1+Nx1,X2=0,z=0)N_{x_{1},X_{2}=0}=\left(N_{x_{1},X_{2}=0,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=0}\right)

Multiplying through by 1/21/2, we see that this is true if the function f(x,y)=(x+y)2yf(x,y)=\frac{(x+y)^{2}}{y} is convex. Let the domain equal +2\mathbb{R}^{2}_{+} and observe that a sufficient condition for the convexity of ff is the positive definiteness of the Hessian matrix (Boyd and Vandenberghe (2004)).

H\displaystyle H =(2fx22fyx2fyx2fy2)=(2y2xy22xy22x2y3)\displaystyle=\begin{pmatrix}\frac{\partial^{2}f}{\partial x^{2}}&\frac{\partial^{2}f}{\partial y\partial x}\\ \frac{\partial^{2}f}{\partial y\partial x}&\frac{\partial^{2}f}{\partial y^{2}}\end{pmatrix}=\begin{pmatrix}\frac{2}{y}&\frac{-2x}{y^{2}}\\ \frac{-2x}{y^{2}}&\frac{2x^{2}}{y^{3}}\end{pmatrix}

This can be confirmed by observing that the three principal minors are nonnegative (Harville (1998)):

2y0;2x2y30;4x2y44x2y4=0\frac{2}{y}\geq 0;\;\;\;\;\frac{2x^{2}}{y^{3}}\geq 0;\;\;\;\;\frac{4x^{2}}{y^{4}}-\frac{4x^{2}}{y^{4}}=0

Thus, it follows that

nx12Nx1,X2=1,z=1+Nx1,X2=0,z=1Nx1,X2=12Nx1,X2=1,z=1+Nx1,X2=02Nx1,X2=0,z=1\displaystyle\frac{n_{x_{1}}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=1}}\leq\frac{N_{x_{1},X_{2}=1}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}+\frac{N_{x_{1},X_{2}=0}^{2}}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}

The positivity constraint on the convexity of ff translates into an assumption of “no empty strata” for our weighting estimator. Since this result holds for an arbitrary nx1n_{x_{1}}, it holds across the distribution of sample sizes for which no strata cells are empty.