Semi-supervised learning and the question of true versus estimated propensity scores

Andrew Herrenlabel=e1][email protected] [ P. Richard Hahnlabel=e2][email protected] [ Arizona State University

Abstract

A straightforward application of semi-supervised machine learning to the problem of treatment effect estimation would be to consider data as “unlabeled” if treatment assignment and covariates are observed but outcomes are unobserved. According to this formulation, large unlabeled data sets could be used to estimate a high dimensional propensity function and causal inference using a much smaller labeled data set could proceed via weighted estimators using the learned propensity scores. In the limiting case of infinite unlabeled data, one may estimate the high dimensional propensity function exactly. However, longstanding advice in the causal inference community suggests that estimated propensity scores (from labeled data alone) are actually preferable to true propensity scores, implying that the unlabeled data is actually useless in this context. In this paper we examine this paradox and propose a simple procedure that reconciles the strong intuition that a known propensity functions should be useful for estimating treatment effects with the previous literature suggesting otherwise. Further, simulation studies suggest that direct regression may be preferable to inverse-propensity weight estimators in many circumstances.

Causal inference,

Semi-supervised learning,

Unlabeled data,

Propensity score,

keywords:

\startlocaldefs\endlocaldefs

and

1 Introduction

In their seminal paper, Rosenbaum and Rubin (1983) show that the propensity score, the probability of receiving the treatment conditional on control covariates, is a sufficient statistic for estimating treatment effects. They show that rather than needing to balance on a potentially high dimensional vector of control variables, one need only balance on the one-dimensional propensity score. In practice, however, researchers do not know the propensity function — the mapping from the covariates to the treatment probabilities — in advance; it must estimate from data. Unfortunately, learning the propensity function can be quite difficult when there are many controls and its parametric form is unknown. Linear logistic regression is often the uncritical method of first resort and even in that familiar setting estimation can be challenging with limited data.

Notably, propensity score approaches only require learning a function of the treatment assignment and control variables; the response variable is not utilized at this preliminary stage. This raises the possibility of incorporating side data for which responses aren’t measured, so-called “unlabeled” data, in the parlance of semi-supervised learning (Zhu and Goldberg (2009)) to assist in estimating the propensity function. Provided that such data is available, the propensity function can be inferred more accurately than would be possible with the labeled data alone. In the limit, one can imagine learning the propensity function arbitrarily accurately, thereby realizing the holy grail conveyed by Rosenbaum and Rubin (1983) of being able to do causal inference with a one-dimensional control variable.

However, it was famously shown by Hirano, Imbens and Ridder (2003) that true propensity scores are actually worse – in the sense of yielding asymptotically higher variance estimates of the treatment effect – than ones estimated from the labeled data alone! This observation would seem to dash any hope of leveraging unlabeled data as described above. Similar results were demonstrated by Lunceford and Davidian (2004) in the context of stratification and weighting estimators with parametric propensity models.

This paper unpacks this apparent paradox in an effort to provide practical guidance to applied researchers interested in estimating treatment effects from observational data. We find that the uselessness of the true propensity function have been greatly exaggerated, reviving the potential for unlabeled data to assist in propensity function estimation, i.e. semi-supervised treatment effect estimation.

2 Problem Statement and Notation

2.1 Notation

Throughout the paper we use the following conventions:

•

Random variable: italicized, uppercase letter (e.g. $X$ , $A$ , …)
•

Scalar realization of random variable: italicized, lowercase letter (e.g. $x$ , $a$ , …)
•

Vector realization of random variable: lowercase letter (e.g. $\mathrm{x}$ , $\mathrm{a}$ , …)
•

Matrix realization of random variable: bold, uppercase letter (e.g. $\mathbf{X}$ , $\mathbf{A}$ , …)
•

Metric space of random variable’s support: math calligraphy letter (e.g. $\mathcal{X}$ , $\mathcal{A}$ , …)

We first draw a distinction between observational and experimental studies for treatment effect estimation. In experimental settings, the treatment of interest is deliberately randomized according to a pre-specified design, typically with the goal of enabling straightforward identification of the treatment effect. In observational settings, data on treatment, outcome and covariates are observed without any direct manipulation. Researchers wishing to infer treatment effects with observational data must rely on a series of assumptions, which we now introduce.

Let $Y$ refer to the outcome of interest, $Z$ denote a binary treatment assignment, and $X$ be a vector of covariates drawn from covariate space $\mathcal{X}$ . We use the potential outcomes notation $Y^{1}$ and $Y^{0}$ to refer to the counterfactual values of $Y$ when $Z=1$ and $Z=0$ (Hernan and Robins (2020)). Our estimand of interest, is the average treatment effect: $\mbox{E}[Y^{1}-Y^{0}]$ .

Since the potential outcomes $(Y^{1},Y^{0})$ are never observed simultaneously, we rely on a number of identifying assumptions common to the causal inference literature (Rubin (1980), Rosenbaum and Rubin (1983), Hernan and Robins (2020)):

1.

Stable unit treatment value assumption (SUTVA): for any sample of size $n$ with $Y\in\mathcal{Y}$ and $Z\in\mathcal{Z}$ , $(Y_{i}^{1},Y_{i}^{0})\perp Z_{j}$ for all $i,j\in\{1,...,n\}$ with $j\neq i$
2.

Positivity: $0<\mbox{P}(Z=1\mid X=\mathbf{X})<1$ for all $\mathbf{X}\in\mathcal{X}$
3.

Conditional exchangeability: $(Y^{1},Y^{0})\perp Z\mid X$

Taken together, these assumptions enable identification of average treatment effects using observational data after adjusting for the effect of $X$ on $Z$ and $Y$ . Thus, we can rewrite our estimand as $\textrm{ATE}=\mbox{E}[Y^{1}-Y^{0}]=\mbox{E}_{X}[\mbox{E}[Y\mid X,Z=1]-\mbox{E}[Y\mid X,Z=0]]$ . There are many ways to adjust for $X$ in estimating the ATE, but this paper focuses on methods that use the propensity score. Rosenbaum and Rubin (1983) define the propensity score as $p(X)=\mbox{P}(Z=1\mid X)$ and show that $p(X)$ has several desirable properties:

1.

$X\perp Z\mid p(X)$
2.

$(Y^{1},Y^{0})\perp Z\mid p(X)$

In short, the propensity score can be used to ensure covariate balance across treatment groups and the resulting covariate balance deconfounds the treatment effect estimate of $Z$ on $Y$ . The deconfounding property of $p(X)$ implies that the treatment effect estimand can be rewritten as $\textrm{ATE}=\mbox{E}[Y^{1}-Y^{0}]=\mbox{E}_{p(X)}[\mbox{E}[Y\mid p(X),Z=1]-\mbox{E}[Y\mid p(X),Z=0]]$ .

2.2 Confounders, Instruments and Pure Prognostic / Moderator Variables

Above, we let $X$ refer to the set of covariates of $Y$ and $Z$ , but for the purposes of this discussion, we introduce a taxonomy of covariate types. First, observe that, using a similar framing as Hahn et al. (2020), the relationship between $Y$ , $Z$ and $X$ can be expressed as:

	$\displaystyle\mbox{E}\left[Y\mid X_{\mu},X_{\tau},X_{\pi}\right]$	$\displaystyle=\mu\left(X_{\mu}\right)+\tau\left(X_{\tau}\right)Z$
	$\displaystyle\mbox{P}\left(Z=1\mid X_{\pi}\right)$	$\displaystyle=\pi\left(X_{\pi}\right)$

where $X_{\mu}$ refers to the set of prognostic variables, $X_{\tau}$ refers to the set of moderator variables, and $X_{\pi}$ refers to the set of propensity variables. We consider that any variable in $X$ can be classified as follows:

1.

Confounder: Any variable which appears in $X_{\pi}$ and either of $X_{\mu}$ or $X_{\tau}$
2.

Pure prognostic / moderator variable: Any variable which appears in either $X_{\mu}$ , $X_{\tau}$ , or both, but not in $X_{\pi}$
3.

Instrument: Any variable which appears in $X_{\pi}$ and not in $X_{\mu}$ nor $X_{\tau}$ (this is a “pure propensity” variable)
4.

Extraneous variable: Any variable in $X$ which appears in none of $X_{\tau}$ , $X_{\mu}$ , or $X_{\pi}$ )

2.3 Estimated vs true propensities

Hirano, Imbens and Ridder (2003) discuss the inverse-propensity weighted estimator of average treatment effect, which estimates the ATE as follows

\textrm{ATE}_{\textrm{IPW}}=\left(\frac{1}{N}\sum_{i=1}^{N}\frac{Y_{i}Z_{i}}{p(X_{i})}-\frac{Y_{i}(1-Z_{i})}{1-p(X_{i})}\right)

They show that even when the true propensities are known, using them directly in the IPW estimator is asymptotically inefficient. We consider the finite sample variance properties of this estimator on a simple data generating process, which we define below:

•

Covariates: $X_{1},X_{2}\sim\mbox{Bernoulli}(1/2)$ , letting $X=(X_{1},X_{2})$
•

Propensity function: $p(X)=1/\left(1+\exp\{-(\alpha_{1}+\alpha_{2}X_{1})\}\right)$
•

Treatment: $Z\sim\mbox{Bernoulli}(p(X))$
•

Outcome: $Y\sim\mathcal{N}(\gamma_{1}X_{1}+\gamma_{2}X_{2}+\tau Z,1)$

We let the random variable $N$ denote the overall sample size and define subset-specific sample sizes as follows:

•

$N_{x}$ : the number of observations with $X=x$
•

$N_{x,z}$ : the number of observations with $X=x$ and $Z=z$

First, observe in this case that $p(X)$ takes on two distinct values: $1/\left(1+\exp\{-\alpha_{1}\}\right)$ if $X_{1}=0$ , and $1/\left(1+\exp\{-\alpha_{1}-\alpha_{2}\}\right)$ if $X_{1}=1$ . We can split in the sum into eight parts, stratifying on the unique values of ( $X_{1}$ , $X_{2}$ , $Z$ ). For a given stratum, the weights in the denominator, as well as the $Z_{i}$ and $1-Z_{i}$ indicators, are all the same and can be factored out. We refer to “true propensities” estimator as $\mbox{IPW}_{p(X)}$ to indicate that the true propensity function, $p(X)$ is the basis for the weights in the denominator. Let $p_{x}=\mbox{P}(Z=1\mid X=x)=\mbox{P}(Z=1\mid X_{1}=x_{1})$ . The portion of the “true propensities” estimator represented by a given $\{X=x\}$ (i.e. $\{X_{1}=x_{1},X_{2}=x_{2}\}$ ) stratum is given by

	$\displaystyle\mbox{IPW}_{p(X),x}$	$\displaystyle=\frac{1}{N}\sum_{j:X_{j}=x}\left(\frac{Y_{j}Z_{j}}{p_{x}}-\frac{Y_{j}(1-Z_{j})}{1-p_{x}}\right)$
		$\displaystyle=\frac{1}{N}\frac{N_{x}}{N_{x}}\frac{N_{x,z}}{N_{x,z}}\frac{1}{p_{x}}\sum_{j:X_{j}=x,Z_{j}=1}Y_{j}$
		$\displaystyle\;\;\;\;-\frac{1}{N}\frac{N_{x}}{N_{x}}\frac{N_{x,z=0}}{N_{x,z=0}}\frac{(1-p_{x})}{(1-p_{x})}\sum_{j:X_{j}=x,Z_{j}=0}Y_{j}$
		$\displaystyle=\left(\frac{N_{x}}{N}\right)\left(\frac{\hat{p}_{x}}{p_{x}}\bar{Y}_{x,Z=1}-\frac{1-\hat{p}_{x}}{1-p_{x}}\bar{Y}_{x,Z=0}\right)$

Where $\mbox{IPW}_{p(X)}=\mbox{IPW}_{p(X),X=(0,0)}+\mbox{IPW}_{p(X),X=(1,0)}+\mbox{IPW}_{p(X),X=(0,1)}+\mbox{IPW}_{p(X),X=(1,1)}$ and $\hat{p}_{x}=\left(N_{x,\mathrm{z}=1}/N_{x}\right)$ . We see that replacing the true propensity weights with these empirical weights results in the following estimator:

	$\displaystyle\mbox{IPW}_{p(X),x}$	$\displaystyle=\frac{1}{N}\sum_{j:X_{j}=x}\left(\frac{Y_{j}Z_{j}}{\hat{p}_{x}}-\frac{Y_{j}(1-Z_{j})}{1-\hat{p}_{x}}\right)$
		$\displaystyle=\left(\frac{N_{x}}{N}\right)\left(\bar{Y}_{x,Z=1}-\bar{Y}_{x,Z=0}\right)$

We can compare the variances of these two estimators, by first observing that the data are i.i.d., so that conditional on the number of observations with $Z=1$ and $X=x$ , the covariances of the strata means are 0. We derive the variances of the two estimators in Appendix A, but the key to understanding why $\mbox{V}\left[\mbox{IPW}_{\hat{p}(X),x}\right]\leq\mbox{V}\left[\mbox{IPW}_{p(X),x}\right]$ is to notice that the ratios $\left(\hat{p}_{x_{1}}/p_{x_{1}}\right)$ and $\left(1-\hat{p}_{x}\right)/\left(1-p_{x}\right)$ are random variables which converge to $1$ as $n\rightarrow\infty$ but are noisy in finite samples. These finite sample imbalances are an ancillary statistic for $\tau$ , and the $\hat{p}(X)$ estimator conditions on this ancillary statistic for more efficient inference. For a detailed discussion of the role of ancillary statistics in causal inference, readers are referred to Chapter 10 of Hernan and Robins (2020) and the references therein.

This result seems to imply a paradox. Even if a propensity function is known exactly, is the analyst better off ignoring that information and estimating the propensity scores? While it is true that directly weighting by the true propensities yields higher variances, we show that the true propensity scores can still be used to achieve efficiency. Consider in this case that $p(X)=1/\left(1+\exp\{-\alpha_{1}-\alpha_{2}X_{1}\}\right)$ , so $Z$ can be modeled via logistic regression on $X_{1}$ , $X_{2}$ , and an intercept term. Fitting this model on a sample $\mathbf{X}=(\mathrm{x}_{1},\mathrm{x}_{2})$ and $\mathrm{z}$ would yield coefficient estimates $\hat{\beta}_{1}$ , $\hat{\beta}_{2},$ and $\hat{\beta}_{3}$ such that $\hat{p}(\mathbf{X})=\frac{1}{1+\exp\{-\hat{\beta}_{1}-\hat{\beta}_{2}\mathrm{x}_{1}-\hat{\beta}_{3}\mathrm{x}_{2}\}}$ , where the true values of the parameters are $\beta_{1}=\alpha_{1}$ , $\beta_{2}=\alpha_{2}$ , and $\beta_{3}=0$ . Thus, if we know the true value of $p(\mathbf{X})$ , we can calculate the true $\beta_{1}+\beta_{2}\mathrm{x}_{1}+\beta_{3}\mathrm{x}_{2}$ by logit transformation

\beta_{1}+\beta_{2}\mathrm{x}_{1}+\beta_{3}\mathrm{x}_{2}=-\log\left(\frac{1}{p(\mathbf{X})}-1\right)

Since both $\beta_{1}+\beta_{2}\mathrm{x}_{1}+\beta_{3}\mathrm{x}_{2}$ and $\hat{\beta}_{1}+\hat{\beta}_{2}\mathrm{x}_{1}+\hat{\beta}_{3}\mathrm{x}_{2}$ are one-dimensional linear functions of $\mathbf{X}$ , we could express $\hat{\beta}_{1}-\hat{\beta}_{2}\mathrm{x}_{1}-\hat{\beta}_{3}\mathrm{x}_{2}=\hat{\eta}_{0}-\log\left(\frac{1}{p(\mathbf{X})}-1\right)\hat{\eta}_{1}$ for some $\hat{\eta}_{0},\hat{\eta}_{1}\in\mathbb{R}$ and observe that this is equivalent to a logistic regression of $Z$ on a 1-dimensional vector $-\log\left(\frac{1}{p(\mathbf{X})}-1\right)$ with an intercept term.

This 1-dimensional regression adjustment is similar to the sample correction discussed in the context of the $\hat{p}(X)$ estimator, with one key difference: the differences between true and empirical treatment probabilities are adjusted across the support of $p(X)$ rather than $X$ . In our two-covariate example, the difference is that $p(X)$ is only a function of $X_{1}$ , so has two unique values, rather than the four distinct values of $X$ .

We introduce the following shorthand to refer to the three estimators:

1.

$p(X)$ : IPW estimator using the true propensity function, $p(X)$ as weights
2.

$\hat{p}(X)$ : IPW estimator using the covariate-estimated propensity function, $\hat{p}(X)$ , as weights
3.

$\hat{p}(p(X))$ : IPW estimator using the true propensity function, $p(X)$ , as a 1-dimensional variable for estimating sample propensity weights

Hirano, Imbens and Ridder (2003) and the example above demonstrate that the $\hat{p}(X)$ estimator is lower variance than the $p(X)$ estimator. However, we see in Appendix B that the $\hat{p}(p(X))$ estimator has lower variance than the $\hat{p}(X)$ estimator as long as the following two conditions are satisfied:

•

$X_{2}$ is not a pure prognostic / moderator variable: $\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=z}\right]=\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=z}\right]$ for all $x_{1},z$ . In our simple data generating process, this is true if $\gamma_{2}=0$ .
•

The outcome variance does not differ along the support of $X_{2}$ : $\mbox{V}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=z}\right]=\mbox{V}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=z}\right]$ for all $x_{1},z$ . In our simple data generating process, this is true as the variance of $Y$ is constant with respect to $X$ .

The most interesting takeaway from this result is that the cases in which the $\hat{p}(X)$ estimator outperforms the $\hat{p}(p(X))$ estimator do not pertain to the intended use of the propensity score, to deconfound treatment assignment and enable identification of the ATE. The $\hat{p}(X)$ estimator can attain a lower variance, but it does so by stratifying on a pure prognostic / moderator variable, $X_{2}$ . This points to an unintended use of weighting / stratification estimators, rather than a fundamental problem with using true propensities. Adjusting for pure prognostic / moderator variables can be achieved by direct regression adjustment, and Hahn et al. (2020) provide a nonparametric Bayesian approach that shows promising results on a number of difficult data generating processes.

To demonstrate the difference in variances of the three estimators in finite samples, we conduct a simulation study. We use the same data generating process discussed in the finite sample variance calculations above, in which $Y$ is normal, $X$ and $Z$ are Bernoulli, and $X$ has $p=2$ dimensions. In the simulations presented below, we generate 10,000 datasets for each value of $N$ and estimate the average treatment effect using the IPW estimator, comparing three possible weighting options:

a)

$p(X)$ : True propensities ( $\{0.3,0.7\}$ )
b)

$\hat{p}(X)$ : Estimated propensities (logistic regression of $Z$ on $X$ )
c)

$\hat{p}(p(X))$ : Sample adjustment using true propensities (logistic regression of $Z$ on $\mbox{logit}\left(p(X)\right)$ )

$N$	$p(X)$	$\hat{p}(X)$	$\hat{p}(p(X))$
15	1.80	0.91	0.81
25	1.41	0.69	0.52
50	1.00	0.38	0.32
100	0.70	0.24	0.22
250	0.44	0.14	0.14

Table 1: Standard deviation of IPW estimates using different weighting approaches
10,000 simulations per

N

p = 2 (value of

N

for which variances converge depends on p)

Our simulations confirm the points made in Hirano, Imbens and Ridder (2003) and Rosenbaum (1987) that using the true propensities directly leads to higher variances in finite samples. However, we also see that the true propensities can be used in a one-dimensional model of $Z$ to attain even lower variances than estimators using the fitted propensities.

2.4 Semi-supervised treatment effect estimation

In practice, knowledge of the exact true propensity function is unlikely outside of randomized experiments. However, it is certainly possible to imagine investigators having access to auxiliary data that enable more accurate estimates of the true propensities. The received wisdom of Hirano, Imbens and Ridder (2003), Lunceford and Davidian (2004), and Rosenbaum (1987) would seem to imply that such accuracy gains impede efficient estimation of the ATE. The primary challenge in using propensity scores estimated from a larger auxiliary dataset stems from the utility of conditioning on finite sample treatment probabilities in weighting estimators. However, we have shown in Section 2.3 that the same variance reduction can be achieved by conditioning on finite sample probabilities formed by $p(X)$ rather than $X$ , and this result extends to an estimate of $p(X)$ derived from arbitrarily large datasets.

We can formalize the notion of using a large auxiliary dataset in estimating the ATE by framing it as a semi-supervised learning problem. Semi-supervised learning refers to a broad class of techniques that combine labeled and unlabeled data in statistical learning problems (Zhu and Goldberg (2009), Belkin, Niyogi and Sindhwani (2006), Liang, Mukherjee and West (2007)). Consider a dataset with $N$ observations. As above, we let $Y$ , $Z$ , and $X$ denote outcome, treatment, and covariates, respectively. We refer to the data points with observed outcomes as “labeled data,” and let $N_{o}$ refer to the number of such labeled observations and $N_{m}=N-N_{o}$ refer to the number of data points with missing outcomes. We index an outcome’s status as missing or observed by $M$ .

Following the notation of Liang, Mukherjee and West (2007), we denote “unlabeled,” or marginal, data as

•

$X^{m}=\{X_{i};i=N_{o}+1,...,N_{o}+N_{m}\}$ , and
•

$Z^{m}=\{Z_{i};i=N_{o}+1,...,N_{o}+N_{m}\}$

Similarly, we denote “labeled” data as

•

$Y^{o}=\{Y_{i};i=1,...,N_{o}\}$ ,
•

$X^{o}=\{X_{i};i=1,...,N_{o}\}$ , and
•

$Z^{o}=\{Z_{i};i=1,...,N_{o}\}$ ,

We observe that $M=1$ where $X,Z\in(X^{m},Z^{m})$ and $M=0$ elsewhere. The discussion in Section 2.3 shows that the set of marginal $(X^{m},Z^{m})$ data can be put to good use in estimating the average treatment effect as follows:

1.

Obtain an estimate, $p_{e}(X)$ of the true propensities using the labeled and unlabeled data $(X,Z)$
2.

Adjust the estimated propensities to the labeled sample by modeling $Z^{o}\sim\mbox{Bernoulli}\left(p_{e}(X^{o})\right)$ , and denote these adjusted estimates by $p^{*}(X^{o})$
3.

Estimate the ATE using only the labeled data ( $Y^{o}$ , $Z^{o}$ , and $p^{*}(X^{o})$ )

As with other constructions in the semi-supervised learning literature, this framing allows us to profitably use information about the statistical patterns in the unlabeled data (in this case, $p(Z=1\mid X)$ ) to better estimate an effect among the labeled data ( $\mbox{E}[Y^{1}-Y^{0}]$ ).

3 Simulations

We conduct several simulation studies to understand the phenomena discussed above in more depth. While logistic regression is often a natural choice for estimating propensity scores, Lee, Lessler and Stuart (2010) point out that the assumptions required for logistic regression are not always warranted and they examine propensity score estimation using a number of machine learning methods, including CART (Breiman et al. (1984)) and Random Forest (Breiman (2001)).

This paper proceeds similarly, but instead relies on Bayesian Additive Regression Trees (BART) (Chipman et al. (2010)). ¹¹1Code can be found at https://github.com/andrewherren/semi-supervised-propensity BART is a nonparametric Bayesian tree ensemble method that estimates complex functions via a sum of weak regression or classification trees. In the case of our IPW estimator, we also consider a logistic propensity model in order to investigate how inferences can be harmed by model mis-specification.

3.1 Estimators

We employ three estimators which use propensity scores in our simulation study:

•

Inverse Propensity Weighted (IPW) estimator (Hirano, Imbens and Ridder (2003)), introduced in Section 2.3 and defined as

\textrm{ATE}_{\textrm{IPW}}=\frac{1}{N}\sum_{i=1}^{N}\left(\frac{Y_{i}Z_{i}}{p(X_{i})}-\frac{Y_{i}(1-Z_{i})}{(1-p(X_{i}))}\right)

•

Targeted Maximum Likelihood Estimator (TMLE) (van der Laan (2010a), van der Laan (2010b), Gruber and van der Laan (2009))

$\textrm{ATE}_{\textrm{TMLE}}=\frac{1}{N}\sum_{i=1}^{N}\left(Q^{*}\left(Z_{i}=1,X_{i}\right)-Q^{*}\left(Z_{i}=0,X_{i}\right)\right)$

where $Q^{*}\left(Z_{i},X_{i}\right)$ represents a semi-parametric model of the outcome $Y$ which incorporates the propensity score.
•

Bayesian Causal Forests (BCF) (Hahn et al. (2020))

$\textrm{ATE}_{\textrm{BCF}}=\frac{1}{N}\sum_{i=1}^{N}\left(\bar{f}(X_{i},Z_{i}=1)-\bar{f}(X_{i},Z_{i}=0)\right)$

where $\bar{f}(X_{i},Z_{i})$ is an average of posterior simulations (across all simulations) of two BART models, defined as $f(X_{i},Z_{i})=\mu(X_{i},p(X_{i}))+\tau(X_{i})Z_{i}$ .

3.2 Simulation approaches

Given the two step nature of each of the above estimators, in which a propensity model is first constructed and its estimates then used to calculate an average treatment effect, we consider three approaches to treatment effect estimation:

•

Complete case analysis: We discard all unlabeled data samples, estimate $\hat{p}(\mathbf{X}^{o})$ , and then compute average treatment effects on only the labeled observations.
•

Semi-supervised: We use the labeled and unlabeled $\mathbf{X}$ and $\mathrm{z}$ values to estimate $\hat{p}(\mathbf{X})$ and then use the $\hat{p}(\mathbf{X}^{o})$ predictions to compute average treatment effects on the labeled observations.
•

True propensities: Simulation studies provide us with actual probabilities of receiving treatment, so we can use these true $p(\mathbf{X}^{o})$ to compute average treatment effects on the labeled observations. Of course, in most real world applications, the exact true propensities will not be available, and we consider this scenario as a limiting case in which an arbitrarily large amount of unlabeled data were available for modeling $\hat{p}(\mathbf{X})$ .

3.3 Interval construction

We construct confidence intervals for our simulations as follows:

•

IPW (logit) and IPW (BART): We compute the asymptotic standard error assuming a logistic propensity model as in Cerulli (2014)
•

TMLE: We use the 95% confidence interval produced by the tmle R package
•

BCF: We construct a 95% credible interval using posterior samples of the treatment effect

3.4 Evaluation metrics

We evaluate the results of our simulations using the following metrics.

•

RMSE: root mean squared error of estimated treatment effects
•

Bias: difference between true (simulated) ATE and average estimated ATE
•

Coverage: fraction of 95% confidence / credible intervals that contain the true ATE

3.5 Data Generating Process Simulations

For simulated covariates, outcomes, and treatment effects, we use a subset of the data generating processes tested in Hahn et al. (2020).

	$\displaystyle Y$	$\displaystyle=\mu(X)+\tau Z+\varepsilon$
	$\displaystyle\mu(X)$	$\displaystyle=\begin{cases}3+X_{1}X_{3},&X_{5}=1,\\ X_{1}X_{3},&X_{5}=2\\ -3+x_{1}X_{3},&X_{5}=3\end{cases}$
	$\displaystyle X_{1},X_{2},X_{3}$	$\displaystyle\sim\mathcal{N}(0,1)$
	$\displaystyle X_{4}$	$\displaystyle\sim\textrm{Bernoulli}(0.5)$
	$\displaystyle X_{5}$	$\displaystyle\sim\textrm{Categorical}(0.25,0.5,0.25)$
	$\displaystyle\tau$	$\displaystyle=3$
	$\displaystyle\varepsilon$	$\displaystyle\sim\mathcal{N}(0,1)$

There are $p=5$ covariates, the first three of which are independent standard normal variables, the fourth of which is binary, and the fifth is an unordered categorical variable with values $1,2,3$ . Treatment effects are homogeneous ( $\tau=3$ ), and the outcome is determined by a combination of treatment effects and a piecewise interaction function of three of the covariates ( $\mu(X)=1+g(X_{5})+X_{1}X_{3}$ , where $g(X)=2$ if $X=1$ , $g(X)=-1$ if $X=2$ , and $g(X)=-4$ if $x=3$ ). $\mu(X)$ is often referred to as a prognostic effect and can be conceptualized as the expected value of the outcome for individuals who do not receive the treatment.

For treatment assignment, we consider two cases:

1.

$P(Z=1\mid X)=0.5$
2.
$P(Z=1\mid X)=0.8\Phi(3\mu(X)/s-0.5X_{1})+0.05+u/10$
- •
  
  $\Phi(\cdot)$ is the standard normal CDF
- •
  
  $s$ is the sample standard deviation of simulated observations of $\mu(X)$
- •
  
  $u\sim\mathrm{Uniform}(0,1)$

The first treatment assignment mechanism mirrors a simple balanced randomized experiment. This presents relatively straightforward inference, as there is no confounding, but it allows us to simulate the phenomenon of including non-confounders in the propensity model. The second treatment assignment mechanism is described as “Targeted Selection” in Hahn et al. (2020), and refers to the phenomenon of assigning treatment based on the expected value of the outcome for those who don’t receive treatment.

In order to simulate the process by which outcomes are unlabeled, we assume a fixed overall sample size of 5,000 and randomly label 50, 100, or 500 observations so that $M$ is not correlated with $Y$ , $X$ , or $Z$ . This parallels the notion of “missing completely at random” (MCAR) in the missing data literature (Little and Rubin (2002)), though our problem is slightly different. We treat our unlabeled $X^{m}$ and $Z^{m}$ as auxiliary data that may be helpful in estimating the ATE rather than treating our unobserved $Y^{m}$ as missing data to be imputed or otherwise estimated. Before proceeding to a more detailed discussion of the simulation results, we briefly summarize the key takeaways:

•

Using unlabeled data will harm inferences in the case of a mis-specified propensity model, as confidence intervals narrow around a biased estimate
•

BCF and IPW/BART work best in the case of targeted selection
•

While randomized treatment assignment makes ATE estimation much easier, unlabeled data can still be useful in this case for variance reduction purposes

3.6 Simulation results under targeted selection

We first examine the targeted selection case, in which treatment assignment is correlated with the prognostic effect.

3.6.1 IPW (logit)

The table below shows the results for 500 simulations using an IPW estimator and propensities estimated by logistic regression. The targeted selection function is nonlinear and thus a logistic propensity model is misspecified. We see that the results are badly biased, except those which use the true propensities and are not impacted by the model misspecification. Furthermore, we observe that using unlabeled data and increasing sample size is harmful, as the variance is reduced around a biased estimate, leading to worse interval coverage.

Approach	N	# labeled	ATE	RMSE	Bias	Coverage
Complete Case	5,000	50	3.49	0.89	0.49	0.67
Complete Case	5,000	100	3.48	0.68	0.48	0.77
Complete Case	5,000	500	3.50	0.51	0.50	0.59
Semi-supervised	5,000	50	3.48	0.97	0.48	0.67
Semi-supervised	5,000	100	3.56	0.73	0.56	0.72
Semi-supervised	5,000	500	3.50	0.52	0.50	0.55
True propensities	5,000	500	2.99	0.25	-0.01	0.89

Table 2: 500 simulations with outcomes MCAR

3.6.2 IPW (BART)

The table below shows the results for 500 simulations using an IPW estimator and propensities estimated by BART. In this case, the propensity model is not mis-specified, and we observe that the use of unlabeled data reduces bias and increases interval coverage.

Approach	N	# labeled	ATE	RMSE	Bias	Coverage
Complete Case	5,000	50	3.90	0.90	0.90	0.47
Complete Case	5,000	100	3.79	0.79	0.79	0.39
Complete Case	5,000	500	3.50	0.50	0.50	0.36
Semi-supervised	5,000	50	3.00	0.85	-0.00	0.73
Semi-supervised	5,000	100	3.00	0.58	-0.00	0.86
Semi-supervised	5,000	500	3.00	0.26	-0.00	0.90
True propensities	5,000	500	2.99	0.25	-0.01	0.89

Table 3: 500 simulations with outcomes MCAR

3.6.3 TMLE

The table below shows the results for 500 simulations using a TMLE estimator and propensities estimated by BART. In this case, the propensity model is not mis-specified, however, we observe that the TMLE estimator achieves poor interval coverage, even in large sample sizes where the average bias is lower.

Approach	N	# labeled	ATE	RMSE	Bias	Coverage
Complete Case	5,000	50	3.58	0.62	0.58	0.35
Complete Case	5,000	100	3.44	0.47	0.44	0.33
Complete Case	5,000	500	3.12	0.15	0.12	0.63
Semi-supervised	5,000	50	3.33	0.51	0.33	0.60
Semi-supervised	5,000	100	3.24	0.35	0.24	0.64
Semi-supervised	5,000	500	3.07	0.13	0.07	0.80
True propensities	5,000	500	3.05	0.12	0.05	0.84

Table 4: 500 simulations with outcomes MCAR

3.6.4 BCF

The table below shows the results for 500 simulations using a BCF estimator and propensities estimated by BART. BCF was designed in part to handle the case of targeted selection, and we see that it produces unbiased estimates with high coverage as more unlabeled data is incorporated.

Approach	N	# labeled	ATE	RMSE	Bias	Coverage
Complete Case	5,000	50	3.54	0.99	0.54	0.70
Complete Case	5,000	100	3.43	0.80	0.43	0.80
Complete Case	5,000	500	3.06	0.37	0.06	0.96
Semi-supervised	5,000	50	3.26	0.80	0.26	0.77
Semi-supervised	5,000	100	3.15	0.58	0.15	0.86
Semi-supervised	5,000	500	3.03	0.37	0.03	0.98
True propensities	5,000	500	3.01	0.32	0.01	0.98

Table 5: 500 simulations with outcomes MCAR

3.7 Simulation results under randomized assignment

We now turn to the randomized treatment assignment scenario.

3.7.1 Complete case

The table below shows the results for 500 simulations using the complete case approach across all four estimators. We see in this case that all methods produce unbiased estimates of the ATE as sample size increases, and with the exception of the TMLE estimator, the approaches all achieve high coverage of the true treatment effect.

Estimator	N	# labeled	ATE	RMSE	Bias	Coverage
IPW (logistic)	5,000	50	3.03	0.39	0.03	0.99
IPW (logistic)	5,000	100	2.97	0.23	-0.03	1.00
IPW (logistic)	5,000	500	3.00	0.10	0.00	1.00
IPW (BART)	5,000	50	2.67	0.46	-0.33	0.98
IPW (BART)	5,000	100	2.75	0.31	-0.25	0.99
IPW (BART)	5,000	500	2.90	0.13	-0.10	1.00
TMLE	5,000	50	2.96	0.33	-0.04	0.72
TMLE	5,000	100	2.96	0.20	-0.04	0.79
TMLE	5,000	500	3.00	0.08	0.00	0.86
BCF	5,000	50	2.90	0.69	-0.10	0.79
BCF	5,000	100	2.93	0.47	-0.07	0.87
BCF	5,000	500	2.99	0.22	-0.01	0.96

Table 6: 500 simulations with outcomes MCAR

3.7.2 Semi-supervised

Below are the same simulations using the semi-supervised approach. For both of the weighting estimators, we observe that the semi-supervised approach yields a higher variance. This is an example of the phenomenon discussed in Section 2.3 – the sample imbalances among the labeled data are an ancillary statistic for $\tau$ and conditioning on that statistic leads to lower variance. We note that this problem could be overcome using a sample adjustment of $Z$ on the unlabeled propensity score estimates. We also observe in this case that bias of the IPW estimators decreases with the use the unlabeled data, which suggests the use of the marginal data for two reasons:

•

Using only labeled data can lead to biased treatment effect estimates in finite samples
•

With an appropriate adjustment, the variance of estimators that use marginal propensities can be reduced to mirror estimators using only labeled data

Estimator	N	# labeled	ATE	RMSE	Bias	Coverage
IPW (logistic)	5,000	50	3.00	0.65	-0.00	0.90
IPW (logistic)	5,000	100	3.00	0.47	-0.00	0.93
IPW (logistic)	5,000	500	3.00	0.20	0.00	0.97
IPW (BART)	5,000	50	2.97	0.65	-0.03	0.91
IPW (BART)	5,000	100	2.97	0.47	-0.03	0.93
IPW (BART)	5,000	500	2.98	0.20	-0.02	0.96
TMLE	5,000	50	2.95	0.33	-0.05	0.76
TMLE	5,000	100	2.96	0.20	-0.04	0.81
TMLE	5,000	500	3.00	0.08	0.00	0.87
BCF	5,000	50	2.91	0.70	-0.09	0.82
BCF	5,000	100	2.94	0.46	-0.06	0.90
BCF	5,000	500	2.99	0.22	-0.01	0.97

Table 7: 500 simulations with outcomes MCAR

4 Discussion

4.1 Resolving the estimated propensity paradox

Section 2.3 reviews the paradox of using estimated propensities to reduce variance in weighting estimators, even if true propensities are known (cited in Hirano, Imbens and Ridder (2003), Rosenbaum (1987), and Lunceford and Davidian (2004)). We believe this issue deserves a detailed discussion, as it contains a number of subtleties and its central claim (i.e. that estimated propensities are “better” than true propensities from a variance perspective) persists in the causal inference community to this day.

4.1.1 What the paradox actually implies

The mechanism by which conditioning on estimated propensities can reduce variances has two components:

a)

Adjusting for finite sample differences between true propensities and empirical treatment probabilities
b)

(For weighting estimators) effecting a variant of direct regression adjustment by stratifying on pure prognostic / moderator variables

We noted in Section 2.3 that the variance implications of (a) can be mitigated by modeling $Z$ directly from the true propensity score, if it is available. This leaves (b) as a concern, which is no longer a paradox of which propensity scores to use, but rather which estimator to use. In particular, this is not a concern for direct regression estimators which allow for adjustment by prognostic / moderator variables. Indeed, our simulations using BCF demonstrate this point.

4.1.2 Bias-variance tradeoff

Absent from most of the literature on estimated vs true propensities is the possibility of model misspecification in estimating propensities. Thus, points (a) and (b) above represent perhaps an optimistic view which is that the propensity function can be estimated arbitrarily well on finite samples. In this case, concerns about variance outweigh concerns about bias. However, our simulations show that in plausible real-world scenarios, propensity-based ATE estimators that rely only on labeled data can be severely biased. Indeed, while Section 2.3 shows that the variance of the “true propensities” estimator can be higher than either of the two adjusted estimators, we see in our simulations that the true propensities estimator typically achieves the lowest RMSE, implying that bias overwhelms variance in many cases.

4.2 Propensity model specification

Though there may be benefits to using estimated propensity scores, in practice, calculating propensity scores introduces a new set of risks. The simulations in Section 3 show that a misspecified propensity model can bias estimates, sometimes drastically. It may seem natural, then, to treat a propensity model as any other supervised learning problem, with variable selection, model validation, and diagnostics to ensure a proper fit. Indeed, McCaffrey, Ridgeway and Morral (2004) use boosting to estimate propensities, in part for the automatic variable selection and flexible model specification that tree ensembles provide. Their results show that using boosting for a propensity model leads to better covariate balance and lower standard errors than a logistic propensity model.

However, Hernan and Robins (2020) caution that traditional model selection techniques may be of dubious utility in constructing propensity models. They note that the purpose of the propensity score is to control for confounding variables, not predict $Z$ from $X$ arbitrarily well. They are careful to point out that a purely data-driven approach to model selection runs the risk of including variables, such as colliders, mediators, or post-treatment variables, that jeopardize identification of treatment effects. They also advise that, even when treatment effects are identified, including instruments in a propensity model can increase the variance of treatment effect estimates by pushing estimated propensities closer to 0 and 1.

At first glance, this point would seem to argue against our recommendation to use semi-supervised machine learning in ATE estimation. Indeed, naively overfitting a propensity model is a surefire way to get noisy (and perhaps even biased) treatment effect estimates. However, we believe that a number of practical steps can be taken to mitigate these concerns.

First, estimated propensity scores should always be visualized or otherwise inspected to ensure that they are not numerically close to 0 and 1. Second, regularization methods (such as $\ell^{2}$ normalization or certain types of ensembles), can be used instead of logistic regression to minimize the risk of estimating propensity scores close to 0 and 1. Finally, subject matter expertise should always play a role in building propensity models. In many applied problems, especially those with a high-dimensional covariate space, the challenge of constructing an accurate causal graph is highly non-trivial. Insofar as subject matter expertise can identify variables such as colliders, post-treatment variables, or non-confounders, such variables should of course be excluded from any propensity model.

Researchers who are concerned about an underlying causal graph invalidating their propensity model can also turn to the causal discovery literature (i.e. Peters, Janzing and Schölkopf (2017)). Specifically, Entner, Hoyer and Spirtes (2013) present an approach to the problem of identifying confounders for adjustment in a causal model. In practice, the challenge of specifying (or estimating from the data) a causal graph can present a number of pitfalls beyond the scope of this paper. We suffice to say that there is no easy replacement for domain expertise in this process, but that this point applies to propensity models that use fully labelled data as well that those that use unlabeled data.

In order to better understand the pitfalls of including non-confounders in a propensity model, we test several possibilities in a simulation study. Letting $X_{j}$ be a variable in a set of $p$ covariates, we consider four cases:

a)

Case 1: $X_{j}$ is a confounder and is included in a propensity model
b)

Case 2: $X_{j}$ is not a confounder and is included in a propensity model
c)

Case 3: $X_{j}$ is a confounder and is not included in a propensity model
d)

Case 4: $X_{j}$ is not a confounder and is not included in a propensity model

Cases 1 and 4 are ideal, while cases 2 and 3 both fail to properly account for the true underlying causal model. The question we are concerned with is whether erring on the side of including non-confounders in a propensity model (case 2) is as harmful to inference as ignoring true confounders (case 3). To do this, we simulate 100 datasets of $n=$ 1,000 from the following data generating process:

	$\displaystyle Y$	$\displaystyle=X\beta+\tau Z+\varepsilon$
	$\displaystyle X,\varepsilon$	$\displaystyle\sim\mathcal{N}(0,1)$
	$\displaystyle\tau$	$\displaystyle=3$
	$\displaystyle p$	$\displaystyle=\left(\frac{1}{1+\exp(-X\gamma)}\right)$
	$\displaystyle Z$	$\displaystyle\sim\textrm{Bernoulli}(p)$

For all variables except $X_{j}$ , the $\beta$ and $\gamma$ coefficients for the outcome and propensity models are drawn uniformly at random from a range of $(-0.5,0.5)$ and $(-0.3,0.3)$ , respectively. For the scenarios in which $X_{j}$ is a confounder (that is, $X_{j}$ impacts both $Y$ and $Z$ ), we set $\beta_{X_{j}}=0.5$ . Each of our simulations assume some relationship between $\mathrm{x}_{j}$ and $\mathrm{z}$ , and we vary $\gamma_{X_{j}}$ between $\{0,1,10\}$ to test the extent to which conditioning on $X_{j}$ in a propensity model separates the treatment and control groups.

Below we see the simulated bias, RMSE, and 95% interval coverage of each of the four cases, using BCF with a BART propensity model and $\gamma_{X_{j}}=1$

Case	$X_{j}$ confounder?	$X_{j}$ in $\hat{p}$ ?	Bias	RMSE	Coverage
1	Yes	Yes	0.004	0.098	89%
2	No	Yes	-0.008	0.096	88%
3	Yes	No	0.068	0.109	84%
4	No	No	-0.007	0.087	87%

Table 8: ATE estimation using BCF,

\gamma_{X_{j}}=1

We see little difference in the results across each of the cases, because the influence of $X_{j}$ on $Y$ is modest in the confounded case. Below is the same set of outputs for $\gamma_{X_{j}}=10$ .

Case	$X_{j}$ confounder?	$X_{j}$ in $\hat{p}$ ?	Bias	RMSE	Coverage
1	Yes	Yes	0.033	0.253	91%
2	No	Yes	-0.09	0.262	87%
3	Yes	No	0.24	0.252	80%
4	No	No	-0.045	0.126	98%

Table 9: ATE estimation using BCF,

\gamma_{X_{j}}=10

We see that cases 1 and 4 exhibit the best performance (since they properly incorporate the true causal graph), but even when $X_{j}$ and $Z$ are strongly related, including $X_{j}$ in a propensity model when $X_{j}$ is not a confounder is less harmful than excluding $X_{j}$ when it is a confounder. To understand why, we briefly review the discussion of “regularization-induced confounding” (RIC) covered in Hahn et al. (2020).

As we have seen in our simulations, classic parametric models of a treatment assignment are vulnerable to misspecification. While this shortcoming suggests the use of flexible machine learning methods to estimate propensity scores, such methods run the risk of overfitting the data and estimating propensities close to 0 or 1. In addition to numeric instability of weighting estimators, such overfitting can also violate the positivity or conditional exchangeability assumption. Researchers who want to fit flexible propensity models while avoiding overfitting can use regularized machine learning methods, such as BART. Hahn et al. (2020), building on Hahn et al. (2018), show that a naive regularized model can bias treatment effect estimates. Their proposed solution, BCF, controls this bias by incorporating the estimated propensities as a feature in a regression model predicting $\mbox{E}(Y\mid Z=0,\hat{p}(X))$ .

In our case, since the simulation studies above are conducted using BCF with a BART propensity model, we avoid some of the common problems of overfit propensity models while also recovering unbiased estimates of the average treatment effect. This explains why including a non-confounder, $X_{j}$ , which is strongly predictive of $Z$ in the propensity model doesn’t harm inference of the ATE by nearly as much as excluding $X_{j}$ when it is a confounder.

Acknowledgements

This work was partially supported by NSF Grant DMS-1502640.

References

Belkin, Niyogi and Sindhwani (2006) {barticle}[author] \bauthor\bsnmBelkin, \bfnmMikhail\binitsM., \bauthor\bsnmNiyogi, \bfnmPartha\binitsP. and \bauthor\bsnmSindhwani, \bfnmVikas\binitsV. (\byear2006). \btitleManifold regularization: A geometric framework for learning from labeled and unlabeled examples. \bjournalJournal of machine learning research \bvolume7 \bpages2399–2434. \endbibitem
Boyd and Vandenberghe (2004) {bbook}[author] \bauthor\bsnmBoyd, \bfnmStephen\binitsS. and \bauthor\bsnmVandenberghe, \bfnmLieven\binitsL. (\byear2004). \btitleConvex optimization. \bpublisherCambridge university press. \endbibitem
Breiman (2001) {barticle}[author] \bauthor\bsnmBreiman, \bfnmLeo\binitsL. (\byear2001). \btitleRandom Forests. \bjournalMachine Learning \bvolume45 \bpages5–32. \endbibitem
Breiman et al. (1984) {barticle}[author] \bauthor\bsnmBreiman, \bfnmLeo\binitsL., \bauthor\bsnmFriedman, \bfnmJerome H\binitsJ. H., \bauthor\bsnmOlshen, \bfnmRichard A\binitsR. A. and \bauthor\bsnmStone, \bfnmCharles J\binitsC. J. (\byear1984). \btitleClassification and Regression Trees. \endbibitem
Cerulli (2014) {barticle}[author] \bauthor\bsnmCerulli, \bfnmGiovanni\binitsG. (\byear2014). \btitletreatrew: A user-written command for estimating average treatment effects by reweighting on the propensity score. \bjournalThe Stata Journal \bvolume14 \bpages541–561. \endbibitem
Chipman et al. (2010) {barticle}[author] \bauthor\bsnmChipman, \bfnmHugh A\binitsH. A., \bauthor\bsnmGeorge, \bfnmEdward I\binitsE. I., \bauthor\bsnmMcCulloch, \bfnmRobert E\binitsR. E. \betalet al. (\byear2010). \btitleBART: Bayesian additive regression trees. \bjournalThe Annals of Applied Statistics \bvolume4 \bpages266–298. \endbibitem
Entner, Hoyer and Spirtes (2013) {binproceedings}[author] \bauthor\bsnmEntner, \bfnmDoris\binitsD., \bauthor\bsnmHoyer, \bfnmPatrik\binitsP. and \bauthor\bsnmSpirtes, \bfnmPeter\binitsP. (\byear2013). \btitleData-driven covariate selection for nonparametric estimation of causal effects. In \bbooktitleProceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (\beditor\bfnmCarlos M.\binitsC. M. \bsnmCarvalho and \beditor\bfnmPradeep\binitsP. \bsnmRavikumar, eds.). \bseriesProceedings of Machine Learning Research \bvolume31 \bpages256–264. \bpublisherPMLR. \endbibitem
Gruber and van der Laan (2009) {barticle}[author] \bauthor\bsnmGruber, \bfnmSusan\binitsS. and \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2009). \btitleTargeted maximum likelihood estimation: A gentle introduction. \endbibitem
Hahn et al. (2018) {barticle}[author] \bauthor\bsnmHahn, \bfnmP Richard\binitsP. R., \bauthor\bsnmCarvalho, \bfnmCarlos M\binitsC. M., \bauthor\bsnmPuelz, \bfnmDavid\binitsD., \bauthor\bsnmHe, \bfnmJingyu\binitsJ. \betalet al. (\byear2018). \btitleRegularization and confounding in linear regression for treatment effect estimation. \bjournalBayesian Analysis \bvolume13 \bpages163–182. \endbibitem
Hahn et al. (2020) {barticle}[author] \bauthor\bsnmHahn, \bfnmP Richard\binitsP. R., \bauthor\bsnmMurray, \bfnmJared S\binitsJ. S., \bauthor\bsnmCarvalho, \bfnmCarlos M\binitsC. M. \betalet al. (\byear2020). \btitleBayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. \bjournalBayesian Analysis. \endbibitem
Harville (1998) {bmisc}[author] \bauthor\bsnmHarville, \bfnmDavid A\binitsD. A. (\byear1998). \btitleMatrix algebra from a statistician’s perspective. \endbibitem
Hernan and Robins (2020) {bbook}[author] \bauthor\bsnmHernan, \bfnmMiguel A\binitsM. A. and \bauthor\bsnmRobins, \bfnmJames M\binitsJ. M. (\byear2020). \btitleCausal Inference: What If. \bpublisherBoca Raton: Chapman & Hall, CRC. \endbibitem
Hirano, Imbens and Ridder (2003) {barticle}[author] \bauthor\bsnmHirano, \bfnmKeisuke\binitsK., \bauthor\bsnmImbens, \bfnmGuido W\binitsG. W. and \bauthor\bsnmRidder, \bfnmGeert\binitsG. (\byear2003). \btitleEfficient estimation of average treatment effects using the estimated propensity score. \bjournalEconometrica \bvolume71 \bpages1161–1189. \endbibitem
Lee, Lessler and Stuart (2010) {barticle}[author] \bauthor\bsnmLee, \bfnmBrian K\binitsB. K., \bauthor\bsnmLessler, \bfnmJustin\binitsJ. and \bauthor\bsnmStuart, \bfnmElizabeth A\binitsE. A. (\byear2010). \btitleImproving propensity score weighting using machine learning. \bjournalStatistics in medicine \bvolume29 \bpages337–346. \endbibitem
Liang, Mukherjee and West (2007) {barticle}[author] \bauthor\bsnmLiang, \bfnmFeng\binitsF., \bauthor\bsnmMukherjee, \bfnmSayan\binitsS. and \bauthor\bsnmWest, \bfnmMike\binitsM. (\byear2007). \btitleThe use of unlabeled data in predictive modeling. \bjournalStatistical Science \bpages189–205. \endbibitem
Little and Rubin (2002) {bbook}[author] \bauthor\bsnmLittle, \bfnmRoderick JA\binitsR. J. and \bauthor\bsnmRubin, \bfnmDonald B\binitsD. B. (\byear2002). \btitleStatistical analysis with missing data. \bpublisherJohn Wiley & Sons. \endbibitem
Lunceford and Davidian (2004) {barticle}[author] \bauthor\bsnmLunceford, \bfnmJared K\binitsJ. K. and \bauthor\bsnmDavidian, \bfnmMarie\binitsM. (\byear2004). \btitleStratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. \bjournalStatistics in medicine \bvolume23 \bpages2937–2960. \endbibitem
McCaffrey, Ridgeway and Morral (2004) {barticle}[author] \bauthor\bsnmMcCaffrey, \bfnmDaniel F\binitsD. F., \bauthor\bsnmRidgeway, \bfnmGreg\binitsG. and \bauthor\bsnmMorral, \bfnmAndrew R\binitsA. R. (\byear2004). \btitlePropensity score estimation with boosted regression for evaluating causal effects in observational studies. \bjournalPsychological methods \bvolume9 \bpages403. \endbibitem
Peters, Janzing and Schölkopf (2017) {bbook}[author] \bauthor\bsnmPeters, \bfnmJonas\binitsJ., \bauthor\bsnmJanzing, \bfnmDominik\binitsD. and \bauthor\bsnmSchölkopf, \bfnmBernhard\binitsB. (\byear2017). \btitleElements of causal inference: foundations and learning algorithms. \bpublisherMIT press. \endbibitem
Rosenbaum (1987) {barticle}[author] \bauthor\bsnmRosenbaum, \bfnmPaul R\binitsP. R. (\byear1987). \btitleModel-based direct adjustment. \bjournalJournal of the American Statistical Association \bvolume82 \bpages387–394. \endbibitem
Rosenbaum and Rubin (1983) {barticle}[author] \bauthor\bsnmRosenbaum, \bfnmPaul R\binitsP. R. and \bauthor\bsnmRubin, \bfnmDonald B\binitsD. B. (\byear1983). \btitleThe central role of the propensity score in observational studies for causal effects. \bjournalBiometrika \bvolume70 \bpages41–55. \endbibitem
Rubin (1980) {barticle}[author] \bauthor\bsnmRubin, \bfnmDonald B\binitsD. B. (\byear1980). \btitleRandomization analysis of experimental data: The Fisher randomization test comment. \bjournalJournal of the American Statistical Association \bvolume75 \bpages591–593. \endbibitem
van der Laan (2010a) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2010a). \btitleTargeted Maximum Likelihood Based Causal Inference: Part I. \bjournalThe International Journal of Biostatistics \bvolume6. \endbibitem
van der Laan (2010b) {barticle}[author] \bauthor\bparticlevan der \bsnmLaan, \bfnmMark J\binitsM. J. (\byear2010b). \btitleTargeted Maximum Likelihood Based Causal Inference: Part II. \bjournalThe International Journal of Biostatistics \bvolume6. \endbibitem
Zhu and Goldberg (2009) {barticle}[author] \bauthor\bsnmZhu, \bfnmXiaojin\binitsX. and \bauthor\bsnmGoldberg, \bfnmAndrew B\binitsA. B. (\byear2009). \btitleIntroduction to semi-supervised learning. \bjournalSynthesis lectures on artificial intelligence and machine learning \bvolume3 \bpages1–130. \endbibitem

Appendix A Variance derivations for $\mbox{IPW}_{p(X)}$ vs $\mbox{IPW}_{\hat{p}(x)}$

Consider the data generating process of Section 2.3. We condition on a sample of size $n$ and observed values of $\mathbf{X}$ , such that $n_{x}$ is also observed. We derive the variance of the IPW estimators for an arbitrary $\{X=x\}$ stratum and then conclude that the result holds across the support of $X$ . We randomize over the sample sizes of treated units, which we model as $N_{x,z=1}\sim\mbox{Binomial}(n_{x},p_{x})$ :

	$\displaystyle\mbox{V}\left[\mbox{IPW}_{p(x),x}\right]$	$\displaystyle=\mbox{V}\left[\left(\frac{n_{x}}{n}\right)\left(\frac{\hat{p}_{x}}{p_{x}}\bar{Y}_{x,z=1}-\frac{1-\hat{p}_{x}}{1-p_{x}}\bar{Y}_{x,z=0}\right)\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\frac{\hat{p}_{x}}{p_{x}}\bar{Y}_{x,z=1}-\frac{1-\hat{p}_{x}}{1-p_{x}}\bar{Y}_{x,z=0}\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\left(\frac{N_{x,z=1}}{n_{x}p_{x}}\right)^{2}\mbox{V}\left[\bar{Y}_{x,z=1}\mid N_{x,z=1}\right]+\left(\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\right)^{2}\mbox{V}\left[\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\left(\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}p_{x}^{2}}{n_{x}^{2}p_{x}^{2}}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}\mid N_{x,z=1}\right]\right]\right)$
		$\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\left(\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}(1-p_{x})^{2}}{n_{x}^{2}(1-p_{x})^{2}}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]\right)$
		$\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\frac{N_{x,z=1}}{n_{x}p_{x}}\bar{Y}_{x,z=1}-\frac{n_{x}-N_{x,z=1}}{n_{x}(1-p_{x})}\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$

while

	$\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x}\right]$	$\displaystyle=\mbox{V}\left[\left(\frac{n_{x}}{n}\right)\left(\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\right)\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\mbox{E}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$
		$\displaystyle\;\;\;\;+\left(\frac{n_{x}}{n}\right)^{2}\mbox{V}\left[\tau\right]$
		$\displaystyle=\left(\frac{n_{x}}{n}\right)^{2}\mbox{E}\left[\mbox{V}\left[\bar{Y}_{x,z=1}-\bar{Y}_{x,z=0}\mid N_{x,z=1}\right]\right]$

Since variances are nonnegative and both $\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}p_{x}^{2}}{n_{x}^{2}p_{x}^{2}}>1$ and $\frac{n_{x}p_{x}(1-p_{x})+n_{x}^{2}(1-p_{x})^{2}}{n_{x}^{2}(1-p_{x})^{2}}>1$ for $n_{x}>0$ and $0<p_{x}<1$ , we see that $\mbox{V}\left[\mbox{IPW}_{p(x),x}\right]>\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x}\right]$ . Since $X=x$ has been arbitrary, this result applies to all $\{X=x\}$ strata and thus it holds in expectation across the support of $X$ .

Appendix B Variance derivations for $\mbox{IPW}_{\hat{p}(x)}$ vs $\mbox{IPW}_{\hat{p}(p(x))}$

Now, we compare the variance of the $\hat{p}(p(X))$ estimator with that of the $\hat{p}(X)$ estimator. These two estimators differ only in their use of $X_{2}$ . Since $p(X)=f(X_{1})$ , the $\hat{p}(p(X))$ estimator is equivalent to a stratification estimator using only $X_{1}$ , while the $\hat{p}(X)$ estimator stratifies on both $X_{1}$ and $X_{2}$ . In this derivation, we condition on a sample of size $n$ and observed values of $\mathrm{x}_{1}$ and $\mathrm{z}$ , so that $n_{x_{1},z}$ is observed. We randomize over $X_{2}$ , deriving the variance of the IPW estimators for an arbitrary $\{X_{1}=x_{1},Z=1\}$ stratum and concluding that the result holds across the support of $X$ and $Z$ . To represent the randomization over $X_{2}$ at the level of a $\{X_{1}=x_{1}\}$ stratum, we introduce two binomial random variables:

•

$N_{x_{1},X_{2}=1,\mathrm{z}=1}\sim\mbox{Binomial}(n_{x_{1},\mathrm{z}=1},1/2)$
•

$N_{x_{1},X_{2}=1,\mathrm{z}=0}\sim\mbox{Binomial}(n_{x_{1},\mathrm{z}=0},1/2)$

We can decompose a given $\{X_{1}=x_{1}\}$ stratum of the $\hat{p}(p(X))$ estimator as follows

	$\displaystyle\mbox{IPW}_{\hat{p}(p(x)),x_{1}}$	$\displaystyle=\frac{1}{n}\sum_{j:X_{1,j}=x_{1}}\left(\frac{Y_{i}Z_{i}}{\hat{p}(p_{x_{1}})}-\frac{Y_{i}(1-Z_{i})}{1-\hat{p}(p_{x_{1}})}\right)$
		$\displaystyle=\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}+\frac{n_{x_{1},z=1}-N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right)$
		$\displaystyle\;\;\;\;-\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}+\frac{n_{x_{1},z=0}-N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right)$

We compute the variance of this stratum-specific estimator via the law of total variance

	$\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\right]$	$\displaystyle=\mbox{E}\left[\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},x_{2}=1,z=1},N_{x_{1},x_{2}=1,z=0}\right]\right]$
		$\displaystyle\;\;\;\;+\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},x_{2}=1,z=1},N_{x_{1},x_{2}=1,z=0}\right]\right]$

where

		$\displaystyle\mbox{E}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]=$
		$\displaystyle\;\;\;\;\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]+\frac{n_{x_{1},z=1}-N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n_{x_{1},z=1}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]\right)$
		$\displaystyle\;\;\;\;\;\;\;\;-\frac{n_{x_{1}}}{n}\left(\frac{N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]+\frac{n_{x_{1},z=0}-N_{x_{1},X_{2}=1,\mathrm{z}=0}}{n_{x_{1},z=0}}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]\right)$

If $X_{2}$ has no pure prognostic / moderating impact on $Y$ , then it is true that

•

$\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]=\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]$ and
•

$\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]=\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]$ ,

and thus $\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]=0$ .

So the variance of $\mbox{IPW}_{\hat{p}(p(x)),x_{1}}$ is equal to $\mbox{E}\left[\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]$ when $X_{2}$ is not a pure prognostic / moderator variable.

Now, observe that a given $\{X_{1}=x_{1}\}$ stratum of the $\hat{p}(X)$ estimator can be written as

	$\displaystyle\mbox{IPW}_{\hat{p}(x),x_{1}}$	$\displaystyle=\frac{1}{n}\sum_{j:X_{1,j}=x_{1}}\left(\frac{Y_{i}Z_{i}}{\hat{p}(X_{i})}-\frac{Y_{i}(1-Z_{i})}{1-\hat{p}(X_{i})}\right)$
		$\displaystyle=\frac{N_{x_{1},X_{2}=1}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}+\frac{N_{x_{1},X_{2}=0}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}$
		$\displaystyle\;\;\;\;-\frac{N_{x_{1},X_{2}=1}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}-\frac{N_{x_{1},X_{2}=0}}{n}\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}$

To obtain the variance of this estimator, we again apply the law of total variance

	$\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\right]$	$\displaystyle=\mbox{E}\left[\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]$
		$\displaystyle\;\;\;\;+\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]\right]$

and

		$\displaystyle\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]=$
		$\displaystyle=\frac{N_{x_{1},X_{2}=1}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]+\frac{N_{x_{1},x_{2}=0}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]$
		$\displaystyle\;\;\;\;-\frac{N_{x_{1},X_{2}=1}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]-\frac{N_{x_{1},X_{2}=0}}{n}\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]$
		$\displaystyle=\frac{N_{x_{1},X_{2}=1,\mathrm{z}=0}+N_{x_{1},X_{2}=1,\mathrm{z}=1}}{n}\left(\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=1}\right]-\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=1,Z=0}\right]\right)$
		$\displaystyle\;\;\;\;+\frac{N_{x_{1},X_{2}=0,\mathrm{z}=0}+N_{x_{1},X_{2}=0,\mathrm{z}=1}}{n}\left(\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=1}\right]-\mbox{E}\left[\bar{Y}_{X_{1}=x_{1},X_{2}=0,Z=0}\right]\right)$

$\mbox{E}\left[\bar{Y}_{x_{1},x_{2},Z=1}\right]-\mbox{E}\left[\bar{Y}_{x_{1},x_{2},Z=0}\right]=\tau$ for $x_{2}\in\{0,1\}$ , so $\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1},N_{x_{1},X_{2}=1,\mathrm{z}=0}\right]=\tau n_{x_{1}}/n$ and thus $\mbox{V}\left[\mbox{E}\left[\mbox{IPW}_{\hat{p}(x),x_{1}}\mid N_{x_{1},x_{2}=1,z=1},N_{x_{1},x_{2}=1,z=0}\right]\right]=0$ regardless of whether or not $X_{2}$ is a pure prognostic / moderator variable.

We can compare the variances of $\hat{p}(p(x))$ and $\hat{p}(x)$ without evaluating the expectation over $X_{2}$ for $\hat{p}(x)$ by comparing the conditional variance terms of the two expressions.

Focusing on the first two terms of the conditional variance expressions (corresponding to $X_{1}=x_{1}$ and $Z=1$ ), we see that

		$\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(x),x_{1},z=1}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1}\right]$
		$\displaystyle=\left(\frac{N_{x_{1},X_{2}=1}}{n}\right)^{2}\frac{\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}+\left(\frac{N_{x_{1},X_{2}=0}}{n}\right)^{2}\frac{\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}$
		$\displaystyle=\frac{1}{n^{2}}\frac{\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=1,\mathrm{z}=0}\right)^{2}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}$
		$\displaystyle\;\;\;\;\;\;+\frac{1}{n^{2}}\frac{\left(N_{x_{1},X_{2}=0,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=0}\right)^{2}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}$

and also that

		$\displaystyle\mbox{V}\left[\mbox{IPW}_{\hat{p}(p(x)),x_{1},z=1}\mid N_{x_{1},X_{2}=1,\mathrm{z}=1}\right]$
		$\displaystyle=\frac{1}{n^{2}}\left[\frac{\left(n_{x_{1}}\right)^{2}\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]+N_{x_{1},X_{2}=0,\mathrm{z}=1}\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]\right)}{\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=1}\right)^{2}}\right]$

where $n_{x_{1}}=\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=1,\mathrm{z}=0}\right)+\left(N_{x_{1},X_{2}=0,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=0}\right)$

If $\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]=\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]$ then we can set $v=\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=1,Z=1}\right]=\mbox{V}\left[Y_{X_{1}=x_{1},X_{2}=0,Z=1}\right]$ and factor this term out. After this adjustment, the comparison reduces to evaluating the conditions in which

\displaystyle\frac{n_{x_{1}}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=1}}\leq\frac{N_{x_{1},X_{2}=1}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}+\frac{N_{x_{1},X_{2}=0}^{2}}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}

where $N_{x_{1},X_{2}=1}=\left(N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=1,\mathrm{z}=0}\right)$ and $N_{x_{1},X_{2}=0}=\left(N_{x_{1},X_{2}=0,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=0}\right)$

Multiplying through by $1/2$ , we see that this is true if the function $f(x,y)=\frac{(x+y)^{2}}{y}$ is convex. Let the domain equal $\mathbb{R}^{2}_{+}$ and observe that a sufficient condition for the convexity of $f$ is the positive definiteness of the Hessian matrix (Boyd and Vandenberghe (2004)).

\displaystyle H

\displaystyle=\begin{pmatrix}\frac{\partial^{2}f}{\partial x^{2}}&\frac{\partial^{2}f}{\partial y\partial x}\\ \frac{\partial^{2}f}{\partial y\partial x}&\frac{\partial^{2}f}{\partial y^{2}}\end{pmatrix}=\begin{pmatrix}\frac{2}{y}&\frac{-2x}{y^{2}}\\ \frac{-2x}{y^{2}}&\frac{2x^{2}}{y^{3}}\end{pmatrix}

This can be confirmed by observing that the three principal minors are nonnegative (Harville (1998)):

\frac{2}{y}\geq 0;\;\;\;\;\frac{2x^{2}}{y^{3}}\geq 0;\;\;\;\;\frac{4x^{2}}{y^{4}}-\frac{4x^{2}}{y^{4}}=0

Thus, it follows that

\displaystyle\frac{n_{x_{1}}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}+N_{x_{1},X_{2}=0,\mathrm{z}=1}}\leq\frac{N_{x_{1},X_{2}=1}^{2}}{N_{x_{1},X_{2}=1,\mathrm{z}=1}}+\frac{N_{x_{1},X_{2}=0}^{2}}{N_{x_{1},X_{2}=0,\mathrm{z}=1}}

The positivity constraint on the convexity of $f$ translates into an assumption of “no empty strata” for our weighting estimator. Since this result holds for an arbitrary $n_{x_{1}}$ , it holds across the distribution of sample sizes for which no strata cells are empty.