This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of Partially Conditional Average Treatment Effect by Hybrid Kernel-covariate Balancing

Jiayi Wang Department of Statistics, Texas A&M University Raymond K. W. Wong Department of Statistics, Texas A&M University Shu Yang Department of Statistics, North Carolina State University Kwun Chuen Gary Chan Department of Biostatistics, University of Washington
Abstract

We study nonparametric estimation for the partially conditional average treatment effect, defined as the treatment effect function over an interested subset of confounders. We propose a hybrid kernel weighting estimator where the weights aim to control the balancing error of any function of the confounders from a reproducing kernel Hilbert space after kernel smoothing over the subset of interested variables. In addition, we present an augmented version of our estimator which can incorporate estimations of outcome mean functions. Based on the representer theorem, gradient-based algorithms can be applied for solving the corresponding infinite-dimensional optimization problem. Asymptotic properties are studied without any smoothness assumptions for propensity score function or the need of data splitting, relaxing certain existing stringent assumptions. The numerical performance of the proposed estimator is demonstrated by a simulation study and an application to the effect of a mother’s smoking on a baby’s birth weight conditioned on the mother’s age.


Keywords: Augmented weighting estimator; Causal inference; Reproducing kernel Hilbert space; Treatment effect heterogeneity

1 Introduction

Causal inference often concerns not only the average effect of the treatment on the outcome but also the conditional average treatment effect (CATE) given a set of individual characteristics, when treatment effect heterogeneity is expected or of interest. Specifically, let T{0,1}T\in\{0,1\} be the treatment assignment, 0 for control and 11 for active treatment, X𝒳dX\in\mathcal{X}\subset\mathbb{R}^{d} a vector of all pre-treatment confounders, and YY the outcome of interest. Following the potential outcomes framework, let Y(t)Y(t) be the potential outcome, possibly contrary to fact, had the unit received treatment t{0,1}t\in\{0,1\}. Then, the individual treatment effect is Y(1)Y(0)Y(1)-Y(0), and the (fully) CATE can be characterized through γ(x)=𝔼{Y(1)Y(0)X=x}\gamma(x)=\mathbb{E}\{Y(1)-Y(0)\mid X=x\}, x𝒳x\in\mathcal{X}. Due to the fundamental problem in causal inference that the potential outcomes are not jointly observable, identification and estimation of the CATE in observational studies require further assumptions. A common assumption is the no unmeasured confounding (UNC) assumption, requiring XX to capture all confounding variables that affect the treatment assignment and outcome. This often results in a multidimensional XX. Given the UNC assumption, many methods have been proposed to estimate γ(x)\gamma(x) (Nie and Wager, 2017; Wager and Athey, 2018; Kennedy, 2020). However, in clinical settings, researchers may only concern the variation of treatment effect over the change of a small subset of covariates V𝒱𝒳V\in\mathcal{V}\subseteq\mathcal{X}, not necessarily the full set XX. For example, researchers are interested in estimating the CATE of smoking (treatment) on birth weight (outcome) given mother’s age but not mother’s educational attainment, although this variable can be a confounder. In this article, we focus on estimating τ(v)=𝔼{γ(X)V=v}\tau(v)=\mathbb{E}\{\gamma(X)\mid V=v\} for v𝒱v\in\mathcal{V}, which we refer to as the partially conditional average treatment effect (PCATE). When VV is taken to be XX, τ(v)\tau(v) becomes the fully conditional average treatment effect (FCATE) γ(x)\gamma(x). Despite our major focus on cases when 𝒱\mathcal{V} is a proper subset of 𝒳\mathcal{X}, the proposed method in this paper does not exclude the setting with 𝒱=𝒳\mathcal{V}=\mathcal{X}, which results in the FCATE.

When VV contains discrete covariates, one can divide the whole sample into different groups by constricting the same values of discrete covariates of VV in the same group. Then, as long as there are enough samples in such stratum, τ(v)\tau(v) can be obtained by estimating the PCATE over the remaining continuous covariates in VV separately for every stratum. Therefore, for simplicity, we focus on the setups with continuous VV (Abrevaya et al., 2015; Lee et al., 2017; Fan et al., 2020; Zimmert and Lechner, 2019; Semenova and Chernozhukov, 2020) while keeping in mind that the proposed method can be used to handle VV that consists of continuous and discrete variables. The typical estimation strategy involves two steps. The first step is to estimate nuisance parameters including the propensity score function and the outcome mean functions for the construction of adjusted responses (through weighting and augmentation) that are (asymptotically) unbiased for γ(x)\gamma(x) given X=xX=x. The nuisance parameters can be estimated by parametric, nonparametric, or even machine learning models. This step serves to adjust for confounding biases. In the second step, existing methods typically adopt nonparametric regression over VV using the adjusted responses obtained from the first step. However, these methods suffer from many drawbacks. Firstly, all parametric methods are potentially sensitive to model misspecification especially when the CATE is complex. On the other hand, although nonparametric and machine learning methods are flexible, the first-step estimator of γ(X)\gamma(X) with high-dimensional XX requires stringent assumptions for the possibly low-dimensional PCATE estimation to achieve the optimal convergence rate. For example, Abrevaya et al. (2015), Zimmert and Lechner (2019), Fan et al. (2020) and Semenova and Chernozhukov (2020) specify restrictive requirements for the convergence rate of the estimators of the nuisance parameters. Detailed discussions are provided in Remarks 3 and 6.

Instead of separating confounding adjustment and kernel smoothing in two steps, we propose a new framework that unifies the confounding adjustment and kernel smoothing in the weighting step. In particular, we generalize the idea of covariate balancing weighting in the average treatment effect (ATE) estimation literature (Qin and Zhang, 2007; Hainmueller, 2012; Imai and Ratkovic, 2014; Zubizarreta, 2015; Wong and Chan, 2018). This generalization, however, is non-trivial because we require covariate balancing in terms of flexible outcome models between the two treatment groups given all possible values of vv. We assume that the outcome models lie in the reproducing kernel Hilbert space (RKHS, Wahba, 1990), a fairly flexible class of functions of XX. We then propose covariate function balancing (CFB) weights that are capable of controlling the balancing error with respect to the L2L_{2}-norm of any function with a bounded norm over the RKHS after kernel smoothing. The construction of the proposed weights specifically involves two kernels — the reproducing kernel of the RKHS and the kernel of the kernel smoothing — and the goal of these weights can be understood as to balance covariate functions generated by the hybrid of these two kernels. Our method does not require any smoothness assumptions on the propensity score model, in sharp contrast to existing methods, and only require mild smoothness assumptions for the outcome models. Invoking the well-known representer theorem, a finite-dimensional representation form of optimization objective can be derived and it can be solved by a gradient-based algorithm. Asymptotic properties of the proposed estimator are derived under the complex dependency structure of weights and kernel smoothing. In addition, our proposed weighting estimator can be slightly modified to incorporate the estimation of the outcome mean functions, similar to the augmented inverse probability weighting (AIPW) estimator. We show that the augmentation of the outcome models relaxes the selection of tuning parameters theoretically.

The rest of paper is organized as follows. Section 2 provides the basic setup for the CATE estimation. Section 3 introduces our proposed CFB weighting estimator, together with the computation techniques. Section 4 introduces an augmented version of our proposed estimator. In Section 5, the asymptotic properties of the proposed estimators are developed. A simulation study and a real data application are presented in Sections 6 and 7, respectively.

2 Basic setup

Suppose {(Ti,Yi(1),Yi(0),Xi):i=1,,N}\{(T_{i},Y_{i}(1),Y_{i}(0),X_{i}):i=1,\ldots,N\} are NN independent and identically distributed copies of {T,Y(1),Y(0),X}\{T,Y(1),Y(0),X\}. We assume that the observed outcome is Yi=TiYi(1)+(1Ti)Yi(0)Y_{i}=T_{i}Y_{i}(1)+(1-T_{i})Y_{i}(0) for i=1,,Ni=1,\ldots,N. Thus, the observed data {(Ti,Yi,Xi):i=1,,N}\{(T_{i},Y_{i},X_{i}):i=1,\dots,N\} are also independent and identically distributed. For simplicity, we drop the subscript ii when no confusion arises.

We focus on the setting satisfying treatment ignorability in observational studies (Rosenbaum and Rubin, 1983).

Assumption 1 (No unmeasured confounding).

{Y(1),Y(0)}TX.\{Y(1),Y(0)\}\mbox{$\perp\!\!\!\perp$}T\mid X.

Assumption 1 rules out latent confounding between the treatment and outcome. In observational studies, its plausibility relies on whether or not the observed covariates XX include all the confounders that affect the treatment as well as the outcome.

Most of the existing works (Nie and Wager, 2017; Wager and Athey, 2018; Kennedy, 2020; Semenova and Chernozhukov, 2020) focus on estimating the CATE given the full set of XX, i.e., γ(x):=𝔼{Y(1)Y(0)X=x}\gamma(x):=\mathbb{E}\{Y(1)-Y(0)\mid X=x\}, x𝒳x\in\mathcal{X}, which we refer to as the FCATE. However, to ensure Assumption 1 holds, XX is often multidimensional, leading to a multidimensional CATE function γ(x)\gamma(x) that is challenging to estimate. Indeed, it is common that some covariates in XX are simply confounders but not treatment effect modifiers of interest. Therefore, a more sensible way is to allow the conditioning variables to be a subset of confounders (Abrevaya et al., 2015; Zimmert and Lechner, 2019; Fan et al., 2020). Instead of γ(x)\gamma(x), we focus on estimating the PCATE

τ(v)=𝔼{Y(1)Y(0)V=v},v𝒱𝒳,\tau(v)=\mathbb{E}\left\{Y(1)-Y(0)\mid V=v\right\},\quad v\in\mathcal{V}\subseteq\mathcal{X},

where VV is a subset of XX. It is worth noting that V=XV=X is also allowed, and therefore γ(x)\gamma(x) can be estimated by our framework. For simplicity, we assume VV is a continuous random vector for the rest of the paper. When VV contains discrete random variables, one can divide the sample into different strata, of which the units have the same level of discrete covariates. Then τ(v)\tau(v) can be estimated by estimating the PCATE at every strata.

In addition to Assumption 1, we require sufficient overlap between the treatment groups. Let π(x)=(T=1X=x)\pi(x)=\mathbb{P}(T=1\mid X=x) be the propensity score. Throughout this paper, we also assume that the propensity score is strictly bounded above zero and below one to ensure overlap.

Assumption 2.

The propensity score π()\pi(\cdot) is uniformly bounded away from zero and one. That is, there exist a constant C1\ltx@labelprop>0C_{1}\ltx@label{prop}>0, such that 1/CLABEL:propπ(x)(11/CLABEL:prop)1/C_{\ref*{prop}}\leq\pi(x)\leq(1-1/C_{\ref*{prop}}) for all x𝒳x\in\mathcal{X}.

Under Assumptions 1 and 2, τ(v)\tau(v) is identifiable based on the following formula

τ(v)=𝔼{Y(1)Y(0)V=v}=𝔼{TYπ(X)(1T)Y1π(X)|V=v}.\tau(v)=\mathbb{E}\left\{Y(1)-Y(0)\mid V=v\right\}=\mathbb{E}\left\{\left.\frac{TY}{\pi(X)}-\frac{(1-T)Y}{1-\pi(X)}\leavevmode\nobreak\ \right|\leavevmode\nobreak\ V=v\right\}.

First suppose π(Xi),i=1,,N,\pi(X_{i}),i=1,\dots,N, are known. Common procedures construct adjusted responses Zi=TiYi/π(Xi)(1Ti)Yi/{1π(Xi)}Z_{i}=T_{i}Y_{i}/\pi(X_{i})-(1-T_{i})Y_{i}/\{1-\pi(X_{i})\} and apply kernel smoother to the data {(Vi,Zi),i=1,,N}\{(V_{i},Z_{i}),i=1,\dots,N\}. Specifically, let K(v)K(v) be a kernel function and h>0h>0 be a bandwidth parameter (with technical conditions specified in Section 5.1). The above strategy leads to the following estimator for τ(v)\tau(v):

1/(Nhd1)i=1NK{(Viv)/h}Zi1/(Nhd1)j=1NK{(Vjv)/h}=1Ni=1NK~h(Vi,v)Zi\displaystyle\frac{1/(Nh^{d_{1}})\sum_{i=1}^{N}K\left\{(V_{i}-v)/h\right\}Z_{i}}{1/(Nh^{d_{1}})\sum_{j=1}^{N}K\left\{(V_{j}-v)/h\right\}}=\frac{1}{N}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},v)Z_{i} (1)

where

K~h(v1,v2)=1hd1K{(v1v2)/h}1Nj=1N1hd1K{(Vjv2)/h}.\tilde{K}_{h}(v_{1},v_{2})=\frac{\frac{1}{h^{d_{1}}}K\{(v_{1}-v_{2})/h\}}{\frac{1}{N}\sum_{j=1}^{N}\frac{1}{h^{d_{1}}}K\{(V_{j}-v_{2})/h\}}.

In observational studies, the propensity scores π(Xi),i=1,,N\pi(X_{i}),i=1,\dots,N, are often unknown. Abrevaya et al. (2015) propose to estimate these scores using another kernel smoother, and construct the adjusted responses based on the estimated propensity scores. There are two drawbacks with this approach. First, it is well known that inverting the estimated propensity scores can result in instability, especially when some of the estimated propensity scores are close to zero or one. Second, this procedure relies on the propensity score model to be correctly specified or sufficiently smooth to approximate well.

To overcome these issues, instead of obtaining the weights by inverting the estimated propensity scores, we focus on estimating the proper weights directly. In the next section, we adopt the idea of covariate balancing weighting, which has been recently studied in the context of average treatment effect (ATE) estimation (e.g., Hainmueller, 2012; Imai and Ratkovic, 2014; Zubizarreta, 2015; Chan et al., 2016; Wong and Chan, 2018; Zhao et al., 2019; Kallus, 2020; Wang and Zubizarreta, 2020).

3 Covariate function balancing weighting for PCATE estimation

3.1 Motivation

To motivate the proposed estimator, suppose we are given the covariate balancing weights {w^i:i=1,,N}\{\hat{w}_{i}:i=1,\dots,N\}. We express the adjusted response as

Zi=w^iTiYiw^i(1Ti)Yi,i=1,,N.\displaystyle Z_{i}=\hat{w}_{i}T_{i}Y_{i}-\hat{w}_{i}(1-T_{i})Y_{i},\quad i=1,\dots,N. (2)

Combining (1) and (2), the estimator of τ(v)\tau(v) is

τ^(v)=1Ni=1NTiw^iK~h(Vi,v)Yi1Ni=1N(1Ti)w^iK~h(Vi,v)Yi.\displaystyle\hat{\tau}(v)=\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)Y_{i}-\frac{1}{N}\sum_{i=1}^{N}(1-T_{i})\hat{w}_{i}\tilde{K}_{h}(V_{i},v)Y_{i}. (3)

One can see that the estimator (3) is a difference between two terms, which are the estimates of μ1(v)=𝔼{Y(1)V=v}\mu_{1}(v)=\mathbb{E}\{Y(1)\mid V=v\} and μ0(v)=𝔼{Y(0)V=v}\mu_{0}(v)=\mathbb{E}\{Y(0)\mid V=v\}, respectively. For simplicity, we focus on the first term and discuss the estimation of the corresponding weights {wi:Ti=1}\{w_{i}:T_{i}=1\} in the treated group. The same discussion applies to the second term and the estimation of weights in the control group.

We assume Yi(1)=m1(Xi)+εiY_{i}(1)=m_{1}(X_{i})+\varepsilon_{i} such that the εi\varepsilon_{i}’s are independent random errors with 𝔼(εiXi)=0\mathbb{E}(\varepsilon_{i}\mid X_{i})=0 and 𝔼(ϵi2Xi)σ02<\mathbb{E}(\epsilon_{i}^{2}\mid X_{i})\leq\sigma_{0}^{2}<\infty. Focusing on the first term of (3), we obtain the following decomposition

1Ni=1NTiw^iK~h(Vi,v)Yi=1Ni=1NTiw^iK~h(Vi,v)m1(Xi)+1Ni=1NTiw^iK~h(Vi,v)εi=1Ni=1N(Tiw^i1)K~h(Vi,v)m1(Xi)+1Ni=1NTiw^iK~h(Vi,v)εi+[1Ni=1NK~h(Vi,v)m1(Xi)μ1(v)]+μ1(v).\displaystyle\begin{split}\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)Y_{i}&=\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)m_{1}(X_{i})+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\varepsilon_{i}\\ &=\frac{1}{N}\sum_{i=1}^{N}(T_{i}\hat{w}_{i}-1)\tilde{K}_{h}(V_{i},v)m_{1}(X_{i})+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\varepsilon_{i}\\ &\quad+\left[\frac{1}{N}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},v)m_{1}(X_{i})-\mu_{1}(v)\right]+\mu_{1}(v).\end{split} (4)

In the last equality, only the first two terms depend on the weights. The second term N1i=1NTiw^iK~h(Vi,v)εiN^{-1}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\varepsilon_{i} will be handled by controlling the variability of the weights. The challenge lies in controlling the first term, which requires the control of the (empirical) balance of a kernel-weighted function class because m1(Xi),i=1,,Nm_{1}(X_{i}),i=1,\dots,N, are unknown. This requirement makes achieving covariate balance significantly more challenging than those for estimating the ATE, i.e., when VV is deterministic (e.g., Hainmueller, 2012; Imai and Ratkovic, 2014; Zubizarreta, 2015; Chan et al., 2016; Wong and Chan, 2018; Zhao et al., 2019; Kallus, 2020; Wang and Zubizarreta, 2020), for multiple reasons: (i) covariate balance is required for all vv in a continuum, and (ii) the bandwidth hh in kernel smoothing is required to diminish with respect to the sample size NN.

3.2 Balancing via empirical residual moment operator

Suppose m1m_{1}\in\mathcal{H}, where \mathcal{H} is an RKHS with reproducing kernel κ{\kappa} and norm \|\cdot\|_{\mathcal{H}}. Also, let the squared empirical norm be uN2=(1/N)i=1N{u(Xi)}2\|u\|_{N}^{2}=(1/N)\sum_{i=1}^{N}\{u(X_{i})\}^{2} for any uu\in\mathcal{H}. Intuitively, from the first term of (4)(\ref{eqn:decom}), we aim to find weights w={wi:Ti=1}w=\{w_{i}:T_{i}=1\} to ensure the following function balancing criteria:

1Ni=1NTiw^iu(Xi)K~h(Vi,v)1Ni=1Nu(Xi)K~h(Vi,v),\displaystyle\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}u(X_{i})\tilde{K}_{h}(V_{i},v)\approx\frac{1}{N}\sum_{i=1}^{N}u(X_{i})\tilde{K}_{h}(V_{i},v),

for all u,u\in\mathcal{H}, where the left and right hand sides are regarded as functions of vv. To quantify such an approximation, we define the operator N,h,w\mathcal{M}_{N,h,w} mapping an element of \mathcal{H} to a function on 𝒱\mathcal{V} by

N,h,w(u,)=1Ni=1N(Tiwi1)u(Xi)K~h(Vi,),\mathcal{M}_{N,h,w}(u,\cdot)=\frac{1}{N}\sum_{i=1}^{N}(T_{i}w_{i}-1)u(X_{i})\tilde{K}_{h}(V_{i},\cdot),

which we call the empirical residual moment operator with respect to the weights in ww. The approximation and hence the balancing error can be measured by

N,h,w(u,)2,\displaystyle\|\mathcal{M}_{N,h,w}(u,\cdot)\|^{2}, (5)

where f\|f\| is a generic metric applied to a function ff defined on 𝒱\mathcal{V}. Typical examples of a metric are LL_{\infty}-norm (\|\cdot\|_{\infty}), L2L_{2}-norm (2\|\cdot\|_{2}) and empirical norm (N\|\cdot\|_{N}). If one has non-uniform preference over 𝒱\mathcal{V}, weighted L2L_{2}-norm and weighted empirical norm are also applicable. In the following, we focus on the balancing error based on L2L_{2}-norm:

SN,h(w,u)\displaystyle S_{N,h}(w,u) =N,h,w(u,)22.\displaystyle=\|\mathcal{M}_{N,h,w}(u,\cdot)\|_{2}^{2}. (6)

We will return to the discussion of other norms in Section 5. Ideally, our target is to minimize supuSN,h(w,u)\sup_{u\in{\mathcal{H}}}S_{N,h}(w,u) uniformly over a sufficiently complex space {\mathcal{H}}. As soon as one attempts to do this, one may find that SN,h(w,tu)=t2SN,h(w,u)S_{N,h}(w,tu)=t^{2}S_{N,h}(w,u) for any t0t\geq 0, which indicates a scaling issue about uu. Therefore, we will standardize the magnitude of uu and restrict the space to N={u:uN2=1}{\mathcal{H}}_{N}=\{u\in{\mathcal{H}}:\|u\|_{N}^{2}=1\} as in Wong and Chan (2018). Also, to overcome overfitting, we add a penalty on uu with respect to \|\cdot\|_{{\mathcal{H}}} and focus on controlling the balancing error over smoother functions. Inspired by the discussion for (4), we also introduce another penalty term

RN,h(w)=1Ni=1NTiwiK~h(Vi,)22,\displaystyle R_{N,h}(w)=\frac{1}{N}\sum_{i=1}^{N}\|T_{i}w_{i}\tilde{K}_{h}(V_{i},\cdot)\|_{2}^{2}, (7)

to control the variability of the weights.

In summary, given any h>0h>0, our CFB weights w^\hat{w} is constructed as follows:

w^=argminw[supuN{SN,h(w,u)λ1u2}+λ2RN,h(w)],\displaystyle\hat{w}=\operatorname*{argmin}_{w}\left[\sup_{u\in{\mathcal{H}}_{N}}\left\{S_{N,h}(w,u)-\lambda_{1}\|u\|_{{\mathcal{H}}}^{2}\right\}+\lambda_{2}R_{N,h}(w)\right], (8)

where λ1\lambda_{1} and λ2\lambda_{2} are tuning parameters (λ1>0\lambda_{1}>0 and λ2>0\lambda_{2}>0). Note that (8) does not depend on the weights {wi:Ti=0}\{w_{i}:T_{i}=0\} of the control group, and the optimization is only performed with respect to {wi:Ti=1}\{w_{i}:T_{i}=1\}.

Remark 1.

By standard representer theorem, we can show that the solution u~=u^/u^N\tilde{u}=\hat{u}/\|\hat{u}\|_{N} of the inner optimization satisfies that u^\hat{u} belongs to 𝒦N=span{κ(Xi,):i=1,,N}\mathcal{K}_{N}=\mathrm{span}\{{\kappa}(X_{i},\cdot):i=1,\dots,N\}. See also Section LABEL:sec:reparameterization in the supplementary material. Therefore, by the definition of N,h,w\mathcal{M}_{N,h,w}, the weights are determined by achieving balance of the covariate functions generated by the hybrid of the two reproducing kernel κ\kappa and the smoothing kernel KK.

Remark 2.

Wong and Chan (2018) adopt a similar optimization form as in (8) to obtain weights. The key difference between their estimator and ours is the choice of balancing error tailored to the target quantity. In Wong and Chan (2018), the choice of balancing error is {i=1N(Tiwi1)u(Xi)/N}2\{\sum_{i=1}^{N}(T_{i}w_{i}-1)u(X_{i})/N\}^{2}, which is designed for estimating the scalar ATE. There is no guarantee that the resulting weights will ensure enough balance for the estimation of the PCATE, a function of vv. Heuristically, one can regard the balancing error in Wong and Chan (2018) as the limit of SN,hS_{N,h} as hh\rightarrow\infty. For finite hh, two fundamental difficulties emerge that do not exist in Wong and Chan (2018). First, N,h,w(u,v)\mathcal{M}_{N,h,w}(u,v) changes with vv, and so the choice of SN,hS_{N,h} involves a metric for a function of vv in (6). This is directly related to the fact that our target is a function (PCATE) instead of a scalar (ATE). For reasonable metrics, the resulting balancing errors measure imbalances over all (possibly infinite) values of vv, which is significantly more difficult than the imbalance control required for ATE. Second, for each vv, the involvement of kernel function in N,h,w\mathcal{M}_{N,h,w} suggests that the effective sample size used in the corresponding balancing is much smaller than i=1NTi\sum^{N}_{i=1}T_{i}. There is no theoretical guarantee for the weights of Wong and Chan (2018) to ensure enough balance required for the PCATE, since the proposed weights are designed to balance a function instead of a scalar. We show that the proposed CFB weighting estimator achieves desirable properties both theoretically (Section 5) and empirically (Section 6).

3.3 Computation

Applying the standard representer theory, (8) can be reformulated as

w^=argminw1[σ1{1NPdiag(TwJ)Ghdiag(TwJ)PNλ1D1}+λ2RN,h(w)],\displaystyle\hat{w}=\operatorname*{argmin}_{w\geq 1}\left[\sigma_{1}\left\{\frac{1}{N}P^{\intercal}\mathrm{diag}(T\circ w-J)G_{h}\mathrm{diag}(T\circ w-J)P-N\lambda_{1}D^{-1}\right\}+\lambda_{2}R_{N,h}(w)\right], (9)

where \circ is the element-wise product of two vectors, J=(1,1,,1)J=(1,1,\dots,1)^{\intercal}, σ1(A)\sigma_{1}(A) represents the maximum eigenvalue of a symmetric matrix AA, PN×rP\in\mathbb{R}^{N\times r} consists of the singular vectors of gram matrix M:=[κ(Xi,Xj)]i,j=1NN×NM:=[{\kappa}(X_{i},X_{j})]_{i,j=1}^{N}\in\mathbb{R}^{N\times N} of rank rr, Dr×rD\in\mathbb{R}^{r\times r} is the diagonal matrix such that M=PDPM=PDP^{\intercal}, and

Gh=[𝒱K~h(V1,v)K~h(V1,v)dv𝒱K~h(V1,v)K~h(VN,v)dv𝒱K~h(VN,v)K~h(V1,v)dv𝒱K~h(VN,v)K~h(VN,v)dv]N×N.G_{h}=\left[\begin{array}[]{ccc}\int_{\mathcal{V}}\tilde{K}_{h}(V_{1},v)\tilde{K}_{h}(V_{1},v)\mathrm{d}v&\cdots&\int_{\mathcal{V}}\tilde{K}_{h}(V_{1},v)\tilde{K}_{h}(V_{N},v)\mathrm{d}v\\ \vdots&\ddots&\vdots\\ \int_{\mathcal{V}}\tilde{K}_{h}(V_{N},v)\tilde{K}_{h}(V_{1},v)\mathrm{d}v&\cdots&\int_{\mathcal{V}}\tilde{K}_{h}(V_{N},v)\tilde{K}_{h}(V_{N},v)\mathrm{d}v\end{array}\right]\in\mathbb{R}^{N\times N}.

The detailed derivation can be found in Section LABEL:sec:reparameterization in the supplementary material.

As for the computation, Lemma 1, whose proof can be found in Section LABEL:sec:proof_convex in the supplementary material, indicates that the underlying optimization is convex.

Lemma 1.

The optimization (8) is convex.

Therefore, generic convex optimization algorithms are applicable. We note that the corresponding gradient has a closed-form expression111when the maximum eigenvalue in the objective function is of multiplicity 1. Thus, gradient based algorithms can be applied efficiently to solve this problem.

Next we discuss several practical strategies to speed up the optimization. When optimizing (9), we need to compute the dominant eigenpair of an r×rr\times r matrix (for computing the gradient). Since common choices of the reproducing kernel κ{\kappa} are smooth, the corresponding Gram matrix MM can be approximated well by a low-rank matrix. When NN is large, to facilitate computation, one can use an MM with rr much smaller than NN, such that the eigen decomposition of gram matrix MM approximately holds. This would significantly reduce the burden of computing the dominant eigenpair of the r×rr\times r matrix. Although the form of GhG_{h} may seem complicated, this does not change with ww. Therefore, for each hh, we can precompute GhG_{h} once at the beginning of an algorithm for the optimization (9). However, when the integral gh(v1,v2)=𝒱K~h(v1,v)K~h(v2,v)𝑑vg_{h}(v_{1},v_{2})=\int_{\mathcal{V}}\tilde{K}_{h}(v_{1},v)\tilde{K}_{h}(v_{2},v)dv does not possess a known expression, one generally has to perform a large number of numerical integrations for the computation of GhG_{h}, when NN is large. But, for smooth choices of KK, ghg_{h} is also a smooth function. When NN is large, we could evaluate gh(Vi,Vj)g_{h}(V_{i},V_{j}), iS1,jS2i\in S_{1},j\in S_{2} at smaller subsets S1S_{1} and S2S_{2}. Then typical interpolation methods (Harder and Desmarais, 1972) can be implemented to approximate unevaluated integrals in GhG_{h} to ease the computation burden.

4 Augmented estimator

Inspired by the augmented inverse propensity weighting (AIPW) estimators in the ATE literature, we also propose an augmented estimator that directly adjusts for the outcome models m1()m_{1}(\cdot) and m0()m_{0}(\cdot).

Recall that the outcome regression functions m1()m_{1}(\cdot) and m0()m_{0}(\cdot) are assumed to be in an RKHS {\mathcal{H}}, kernel-based estimators m^1()\hat{m}_{1}(\cdot) and m^0()\hat{m}_{0}(\cdot) can be employed. We can then perform augmentation and obtain the adjusted response ZiZ_{i} in (2) as

Zi=wiTi{Yim^1(Xi)}+m^1(Xi)[wi(1Ti){Yim^0(Xi)}+m^0(Xi)].\displaystyle Z_{i}=w_{i}T_{i}\{Y_{i}-\hat{m}_{1}(X_{i})\}+\hat{m}_{1}(X_{i})-\left[w_{i}(1-T_{i})\left\{Y_{i}-\hat{m}_{0}(X_{i})\right\}+\hat{m}_{0}(X_{i})\right]. (10)

Correspondingly, the decomposition in (4) becomes

1Ni=1NK~h(Vi,v)m^1(Xi)+1Ni=1NTiw^iK~h(Vi,v){Yim^1(Xi)}\displaystyle\frac{1}{N}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},v)\hat{m}_{1}(X_{i})+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\{Y_{i}-\hat{m}_{1}(X_{i})\}
=\displaystyle= 1Ni=1N(1Tiw^i)K~h(Vi,v)m^1(Xi)+1Ni=1NTiw^iK~h(Vi,v)m1(Xi)+1Ni=1NTiw^iK~h(Vi,v)ϵi\displaystyle\frac{1}{N}\sum_{i=1}^{N}(1-T_{i}\hat{w}_{i})\tilde{K}_{h}(V_{i},v)\hat{m}_{1}(X_{i})+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)m_{1}(X_{i})+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\epsilon_{i}
=\displaystyle= 1Ni=1N(Tiw^i1)K~h(Vi,v){m1(Xi)m^1(Xi)}+1Ni=1NTiw^iK~h(Vi,v)ϵi\displaystyle\frac{1}{N}\sum_{i=1}^{N}(T_{i}\hat{w}_{i}-1)\tilde{K}_{h}(V_{i},v)\{m_{1}(X_{i})-\hat{m}_{1}(X_{i})\}+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\epsilon_{i}
+{1Ni=1NK~h(Vi,v)m1(Xi)μ1(v)}+μ1(v).\displaystyle+\left\{\frac{1}{N}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},v)m_{1}(X_{i})-\mu_{1}(v)\right\}+\mu_{1}(v).

Now, our goal is to control the difference between N1i=1NTiw^iK~h(Vi,v){m1(Xi)m^1(Xi)}N^{-1}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},v)\{m_{1}(X_{i})-\hat{m}_{1}(X_{i})\} and N1i=1NK~h(Vi,v){m1(Xi)m^1(Xi)}N^{-1}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},v)\{m_{1}(X_{i})-\hat{m}_{1}(X_{i})\}. The weight estimators in Section 3.2 can be adopted similarly to control this difference. It can be shown that the term SN,h(w^,m1m^1):=i=1N(Tiw^i1)K~h(Vi,){m1(Xi)m^1(Xi)/N}22S_{N,h}(\hat{w},m_{1}-\hat{m}_{1}):=\|\sum_{i=1}^{N}(T_{i}\hat{w}_{i}-1)\tilde{K}_{h}(V_{i},\cdot)\{m_{1}(X_{i})-\hat{m}_{1}(X_{i})/N\}\|_{2}^{2} can achieve a faster rate of convergence than SN,h(w^,m1)S_{N,h}(\hat{w},m_{1}) does with the same estimated weights w^\hat{w} as long as m^1\hat{m}_{1} is a consistent estimator. However, this property does not improve the final convergence rate of the PCATE estimation. This is because the term N1i=1NK~h(Vi,)m1(Xi)μ1()22\|{N}^{-1}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},\cdot)m_{1}(X_{i})-\mu_{1}(\cdot)\|_{2}^{2} dominates other terms, and thus the final rate can never be faster than the optimal non-parametric rate. See Remark 4 for more details. Our theoretical results reveal that the benefit of using the augmentations lies in the relaxed order requirement of the tuning parameters to achieve the optimal convergence rate. Therefore, the performance of the augmented estimator is expected to be more robust to the tuning parameter selection.

Unlike other AIPW-type estimators (Lee et al., 2017; Fan et al., 2020; Zimmert and Lechner, 2019; Semenova and Chernozhukov, 2020) which often rely on data splitting for estimating the propensity score and outcome mean functions to relax technical conditions, our estimator does not require data splitting to facilitate the convergence with augmentation. See also Remark 7. We defer the theoretical comparison between our estimator and the existing AIPW-type estimators in Remark 6 in Section 5.

Last, we note that there are existing work using weights to balance the residuals (e.g. Athey et al., 2016; Wong and Chan, 2018), which is similar to what we consider here for the proposed augmented estimator. These estimators are designed for ATE estimation and the balancing weights cannot be directly adopted here with theoretical guarantee.

5 Asymptotic properties

In this section, we conduct an asymptotic analysis for the proposed estimator. For simplicity, we assume 𝒳=[0,1]d\mathcal{X}=[0,1]^{d}. To facilitate our theoretical discussion in terms of smoothness, we assume the RKHS \mathcal{H} is contained in a Sobolev space (see Assumption 3). Our results can be extended to other choices of {\mathcal{H}} if the corresponding entropy result and boundedness condition for the unit ball {u:u1}\{u\in{\mathcal{H}}:\|u\|_{\mathcal{H}}\leq 1\} are provided. Recall that we focus on 𝔼{Y(1)V=v}\mathbb{E}\{Y(1)\mid V=v\}. Similar analysis can be applied to 𝔼{Y(0)V=v}\mathbb{E}\{Y(0)\mid V=v\} and finally the PCATE.

5.1 Regularity conditions

Let \ell be a positive integer. For any function uu defined on 𝒳\mathcal{X}, the the Sobolev norm is u𝒲=|β|Dβu22\|u\|_{\mathcal{W}^{\ell}}=\sqrt{\sum_{|\beta|\leq\ell}\|D^{\beta}u\|_{2}^{2}}, where Dβu(x1,,xd)=|β|ux1β1xdβdD^{\beta}u(x_{1},\dots,x_{d})=\frac{\partial^{|\beta|}u}{\partial x_{1}^{\beta_{1}}\dots\partial x_{d}^{\beta_{d}}} for a multi-index β=(β1,,βd)\beta=(\beta_{1},\dots,\beta_{d}). The Sobolev space 𝒲\mathcal{W}^{\ell} consists of functions with finite Sobolev norm. For ϵ>0\epsilon>0, we denote by 𝒩(ϵ,,)\mathcal{N}(\epsilon,{\mathcal{F}},\|\cdot\|) the ϵ\epsilon-covering number of a set {\mathcal{F}} with respect to some norm \|\cdot\|. Next, we list the assumptions that are useful for our asymptotic results.

Assumption 3.

The unit ball of \mathcal{H} is a subset of a ball in the Sobolev space 𝒲\mathcal{W}^{\ell}, with the ratio α:=d/\alpha:=d/\ell less than 2.

Assumption 4.

The regression function m(x)m(x)\in{\mathcal{H}}.

Assumption 5.

(a) K is symmetric, K(s)ds=1\int K(s)ds=1, and there exists a constant C2\ltx@labelsupkernelC_{2}\ltx@label{supkernel} such that K(s)CLABEL:supkernelK(s)\leq C_{\ref*{supkernel}} for all ss. Moreover, s2K(s)ds<\int s^{2}K(s)ds<\infty and K2(s)ds<\int K^{2}(s)ds<\infty. (b) Take 𝒦={K{(v)/h}:h>0,v[0,1]d1}{\mathcal{K}}=\{K\{(v-\cdot)/h\}:h>0,v\in[0,1]^{d_{1}}\}. There exist constants A1>0A_{1}>0 and ν1>0{\nu_{1}}>0 such that 𝒩(ε,𝒦,)A1εν1\mathcal{N}(\varepsilon,{\mathcal{K}},\|\cdot\|_{\infty})\leq A_{1}\varepsilon^{-{\nu_{1}}}.

Assumption 6.

The density function g()g(\cdot) of the random variable V[0,1]d1V\in[0,1]^{d_{1}} is continuous, differentiable, and bounded away from zero, i.e., there exist constants C3\ltx@labeldensitylower>0C_{3}\ltx@label{density_{l}ower}>0 and C4\ltx@labeldensityupper>0C_{4}\ltx@label{density_{u}pper}>0 such that CLABEL:density_lowerg(v)CLABEL:density_upperC_{\ref*{density_lower}}\leq g(v)\leq C_{\ref*{density_upper}}.

Assumption 7.

h0h\rightarrow 0 and N22+αhd1N^{\frac{2}{2+\alpha}}h^{d_{1}}\rightarrow\infty, as NN\rightarrow\infty.

Assumption 8.

The joint density of {m(X),V}\{m(X),V\} and the conditional expectation 𝔼{m(X)V=v}\mathbb{E}\{m(X)\mid V=v\} are continuous.

Assumption 9.

The errors {εi,i=1,,N}\{\varepsilon_{i},i=1,\dots,N\} are uncorrelated, with 𝔼(εi)=0\mathbb{E}(\varepsilon_{i})=0 and Var(εi)σ02\mathrm{Var}(\varepsilon_{i})\leq\sigma_{0}^{2} for all i=1,,Ni=1,\dots,N. Furthermore, {εi,i=1,,N}\{\varepsilon_{i},i=1,\dots,N\} are independent of {Ti,i=1,,N}\{T_{i},i=1,\dots,N\} and {Xi,i=1,,N}\{X_{i},i=1,\dots,N\}.

Assumption 3 is a common condition in the literature of smoothing spline regression. Assumptions 58 comprise standard conditions for kernel smoother (e.g., Mack and Silverman, 1982; Einmahl et al., 2005; Wasserman, 2006) except that we require Nα2+αhd1N^{\frac{\alpha}{2+\alpha}}h^{d_{1}}\rightarrow\infty instead of Nhd1Nh^{d_{1}}\rightarrow\infty to ensure the difference between uN\|u\|_{N} and u2\|u\|_{2} is asymptotically negligible. Assumption 5(b) is satisfied whenever K()=ψ{p()}K(\cdot)=\psi\{p(\cdot)\} with p()p(\cdot) being a polynomial in d1d_{1} variables and ψ\psi being a real-valued function of bounded variation (Van der Vaart, 2000).

5.2 L2L_{2}-norm balancing

Given two sequences of positive real numbers (A1,A2,)(A_{1},A_{2},\dots) and (B1,B2,)(B_{1},B_{2},\dots), AN=𝒪(BN)A_{N}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(B_{N}) represents that there exists a positive constant MM such that ANMBNA_{N}\leq MB_{N} as NN\rightarrow\infty; AN=𝒪(BN)A_{N}=\scalebox{0.7}{$\mathcal{O}$}(B_{N}) represents that AN/BN0A_{N}/B_{N}\rightarrow 0 as NN\rightarrow\infty, and ANBNA_{N}\asymp B_{N} represents AN=𝒪(BN)A_{N}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(B_{N}) and BN=𝒪(AN)B_{N}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(A_{N}).

Theorem 1.

Suppose Assumptions 17 hold. If λ11=𝒪(Nhd1)\lambda_{1}^{-1}=\scalebox{0.7}{$\mathcal{O}$}(Nh^{d_{1}}), we have SN,h(w^,m)=𝒪p(λ1mN2+λ1m2+λ2hd1mN2)S_{N,h}(\hat{w},m)=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(\lambda_{1}\|m\|_{N}^{2}+\lambda_{1}\|m\|_{\mathcal{H}}^{2}+\lambda_{2}h^{-d_{1}}\|m\|_{N}^{2}). If we further assume λ21=𝒪(λ11hd1)\lambda_{2}^{-1}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lambda_{1}^{-1}h^{-d_{1}}), then RN,h(w^)=𝒪p(hd1)R_{N,h}(\hat{w})=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(h^{-d_{1}}).

Theorem 1 specifies the control of the balancing error and the weight variability. They can be used to derive the convergence rate of the proposed estimator in the following theorem.

Theorem 2.

Suppose Assumptions 1-9 hold. If λ11=𝒪(Nhd1)\lambda_{1}^{-1}=\scalebox{0.7}{$\mathcal{O}$}(Nh^{d_{1}}), λ21=𝒪(λ11hd1)\lambda_{2}^{-1}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lambda_{1}^{-1}h^{-d_{1}}), and h2=𝒪((N1hd1)1/2)h^{2}=\scalebox{0.7}{$\mathcal{O}$}((N^{-1}h^{-d_{1}})^{1/2}),

1Ni=1NTiw^iYiKh(Vi,)𝔼{Y(1)|V=}2=𝒪p{N1/2hd1/2+λ11/2m1+λ21/2hd1/2m12}.\left\|\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}Y_{i}K_{h}\left(V_{i},\cdot\right)-\mathbb{E}\left\{Y(1)|V=\cdot\right\}\right\|_{2}\\ =\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\{N^{-1/2}h^{-d_{1}/2}+\lambda_{1}^{1/2}\|m_{1}\|_{\mathcal{H}}+\lambda_{2}^{1/2}h^{-d_{1}/2}\|m_{1}\|_{2}\}.

The proof can be found in Section LABEL:sec:proof_prep and LABEL:sec:proof_decomp in the supplementary material. Since we require λ11=𝒪(Nhd1)\lambda_{1}^{-1}=\scalebox{0.7}{$\mathcal{O}$}(Nh^{d_{1}}), the best convergence rate that we can achieve in Theorem 2 is arbitrarily close to the optimal rate N1/2hd1/2N^{-1/2}h^{-d_{1}/2}. It is unclear if this arbitrarily small gap is an artifact of our proof structure. However, in Theorem 4 below, we show that this gap can be closed by using the proposed augmented estimator.

Remark 3.

Abrevaya et al. (2015) adopt an inverse probability weighting (IPW) method to estimate the PCATE, where the propensity scores are approximated parametrically or by kernel smoothing. They provide point-wise convergence result for their estimators, as opposed to L2L_{2} convergence in our theorem. For their nonparametric propensity score estimator, their result is derived based on a strong smoothness assumption of the propensity score. More specifically, it requires high-order kernels (the order should not be less than dd) in estimating both the propensity score and the later PCATE in order to achieve the optimal convergence rate. Compared to their results, our proposed estimator does not involve such a strong smoothness assumption nor a parametric specification of the propensity score.

5.3 LL_{\infty}-norm balancing

In Section 3.2, we mention several choices of the metric in the balancing error (6). In this subsection, we provide a theoretical investigation of an important case with LL_{\infty}-norm. We note that efficient computation of the corresponding weights is challenging, and thus is not pursued in the current paper. Nonetheless, it is theoretically interesting to derive the convergence result for the proposed estimator with LL_{\infty}-norm. More specifically, the estimator of interest in this subsection is defined by replacing the L2L_{2}-norm in SN,h(w,u)S_{N,h}(w,u) and RN,h(w)R_{N,h}(w) with the LL_{\infty}-norm. Instead of L2L_{2} convergence rate (Theorem 2), we can obtain the uniform convergence rate of this estimator in the following theorem.

Theorem 3.

Suppose Assumptions 19 hold, Let w~\tilde{w} be the solution to (8) but with SN,h(w,u)=N,h,w(u,)S_{N,h}(w,u)=\|\mathcal{M}_{N,h,w}(u,\cdot)\|_{\infty} and RN,h(w)=1Ni=1NTiwiK~h(Vi,)R_{N,h}(w)=\|\frac{1}{N}\sum_{i=1}^{N}T_{i}w_{i}\tilde{K}_{h}(V_{i},\cdot)\|_{\infty}. If λ11Nhd1log(1/h)\lambda_{1}^{-1}\asymp Nh^{d_{1}}\log(1/h), λ2N1\lambda_{2}\asymp N^{-1}, log(1/h)/(loglogN)\log(1/h)/(\log\log N)\rightarrow\infty as NN\rightarrow\infty, and h2=𝒪{(N1hd1log(1/h))1/2}h^{2}=\scalebox{0.7}{$\mathcal{O}$}\{(N^{-1}h^{-d_{1}}\log(1/h))^{1/2}\},

1Ni=1NTiw^iYiKh(Vi,)𝔼{Y(1)|V=}=𝒪p{N1/2hd1/2log1/2(1/h)}.\left\|\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}Y_{i}K_{h}\left(V_{i},\cdot\right)-\mathbb{E}\left\{Y(1)|V=\cdot\right\}\right\|_{\infty}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}\{N^{-1/2}h^{-d_{1}/2}\log^{1/2}(1/h)\}.

We provide the proof outline in Section LABEL:sec:proof_sup in the supplementary material.

Different from Theorem 2, the uniform convergence rate is optimal. Roughly speaking, this is because, compared to the optimal L2L_{2} convergence rate, the optimal uniform convergence rate has an extra logarithmic order, which dominates the arbitrarily small gap mentioned in Section 5.2.

5.4 Augmented estimator

We also derive the asymptotic property of the augmented estimator.

Theorem 4.

Suppose Assumptions 19 hold. Take e=m1m^1e=m_{1}-\hat{m}_{1}\in{\mathcal{H}} such that e=𝒪p(1)\|e\|_{\mathcal{H}}=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1) and e2=𝒪p(1)\|e\|_{2}=\scalebox{0.7}{$\mathcal{O}$}_{\mathrm{p}}(1). Suppose λ11=𝒪(Nhd1)\lambda_{1}^{-1}=\scalebox{0.7}{$\mathcal{O}$}(Nh^{d_{1}}), λ21=𝒪(λ11hd1)\lambda_{2}^{-1}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(\lambda_{1}^{-1}h^{-d_{1}}), and h2=𝒪((N1hd1)1/2)h^{2}=\scalebox{0.7}{$\mathcal{O}$}((N^{-1}h^{-d_{1}})^{1/2}), we have

1Ni=1NK~h(Vi,)m^1(Xi)+1Ni=1NTiw^iK~h(Vi,){Yim^1(Xi)}𝔼{Y(1)|V=}2\displaystyle\left\|\frac{1}{N}\sum_{i=1}^{N}\tilde{K}_{h}(V_{i},\cdot)\hat{m}_{1}(X_{i})+\frac{1}{N}\sum_{i=1}^{N}T_{i}\hat{w}_{i}\tilde{K}_{h}(V_{i},\cdot)\{Y_{i}-\hat{m}_{1}(X_{i})\}-\mathbb{E}\left\{Y(1)|V=\cdot\right\}\right\|_{2}
=𝒪p(N1/2hd1/2+λ11/2e+λ21/2hd1/2e2)\displaystyle=\mathop{}\mathopen{}\mathcal{O}\mathopen{}_{\mathrm{p}}(N^{-1/2}h^{-d_{1}/2}+\lambda_{1}^{1/2}\|e\|_{{\mathcal{H}}}+\lambda_{2}^{1/2}h^{-d_{1}/2}\|e\|_{2})
Remark 4.

In Theorem 2, to obtain the best convergence rate that is arbitrarily close to N1/2hd1/2N^{-1/2}h^{-d_{1}/2}, we require λ1\lambda_{1} and λ2\lambda_{2} to be arbitrarily close to N1hd1N^{-1}h^{-d_{1}} and N1N^{-1} respectively. While in Theorem 4, as long as λ1=𝒪(N1hd1log(1/h)e2)\lambda_{1}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(N^{-1}h^{-d_{1}}\log(1/h)\|e\|^{-2}_{\mathcal{H}}) and λ2=𝒪(N1log(1/h)eN2)\lambda_{2}=\mathop{}\mathopen{}\mathcal{O}\mathopen{}(N^{-1}\log(1/h)\|e\|^{-2}_{N}), the optimal convergence rate N1/2hd1/2N^{-1/2}h^{-d_{1}/2} is achievable. Therefore, with the help of augmentation, we can relax the order requirement of the tuning parameters for achieving the optimal rate. As a result, it is “easier” to tune λ1\lambda_{1} and λ2\lambda_{2} with augmentation.

Remark 5.

Several existing works focus on estimating the FCATE γ()\gamma(\cdot) given the full set of covariates (Kennedy, 2020; Nie and Wager, 2017). While one could partially marginalize their estimate γ^()\hat{\gamma}(\cdot) of γ()\gamma(\cdot) to obtain an estimate τˇ()\check{\tau}(\cdot) of τ()\tau(\cdot), it is not entirely clear whether the convergence rate of τˇ()\check{\tau}(\cdot) is optimal, even when γ^()\hat{\gamma}(\cdot) is rate-optimal non-parametrically. The main reason is that the estimation error γ^(x)γ(x)\hat{\gamma}(x)-\gamma(x) are dependent across different values of xx. Note that γ()\gamma(\cdot) is a dd-dimensional function and the optimal rate is slower than the optimal rate that we achieve for τ()\tau(\cdot), a d1d_{1}-dimensional function, when d1<dd_{1}<d. So the partially marginalizing step needs to be shown to speed up the convergence significantly, in order to be comparable to our rate result.

Remark 6.

To directly estimate the PCATE τ()\tau(\cdot), a common approach is to apply smoothing methods to the adjusted responses with respect to VV instead of XX. Including ours, most papers follow this approach. The essential difficulty discussed in Remark 5 remains and hence the analyses are more challenging than those for the FCATE γ()\gamma(\cdot), if the optimal rate is sought. In the existing work (Lee et al., 2017; Semenova and Chernozhukov, 2017; Zimmert and Lechner, 2019; Fan et al., 2020) that adopts augmentation, estimations of both propensity score and outcome mean functions, referred to as nuisance parameters in below, are required. Lee et al. (2017) adopt parametric modeling for both nuisance parameters and achieve double robustness; i.e., only one nuisance parameter is required to be consistent to achieve the optimal rate for τ()\tau(\cdot). However, parametric modeling is a strong assumption and may be restrictive. Semenova and Chernozhukov (2017); Zimmert and Lechner (2019); Fan et al. (2020) adopt nonparametric nusiance modeling. Importantly, to achieve optimal rate of τ()\tau(\cdot), these works require consistency of both nuisance parameter estimations. In other words, the correct specification of both nuisance parameter models are required. Fan et al. (2020) require both nuisance parameters to be estimated consistently with respect to LL_{\infty} norm. While Semenova and Chernozhukov (2017) and Zimmert and Lechner (2019) implicitly require the product convergence rates from the two estimators to be faster than N1/2N^{-1/2} to achieve the optimal rate of the PCATE estimation. In other words, if one nuisance estimator is not consistent, the other nuisance estimator has to converge faster than N1/2N^{-1/2}. Unlike these existing estimators, our estimators does not rely on restrictive parametric modeling nor consistency of both nuisance parameter estimation.

Remark 7.

Moreover, most existing work (discussed in Remark 6) require data-splitting or cross-fitting to remove the dependence between nuisance parameter estimations and the smoothing step for estimating τ()\tau(\cdot), which is crucial in their theoretical analyses. Zheng and van der Laan (2011) first propose cross-fitting in the context of Target Maximum Likelihood Estimator and Chernozhukov et al. (2017) subsequently apply to estimating equations. This technique can be used to relax the Donsker conditions required for the class of nuisance functions. Kennedy (2020) applies cross-fitting to FCATE estimation for similar purposes. While data-splitting and cross-fitting are beneficial in theoretical development, they are not generally a favorable modification, due to criticism of increased computation and fewer data for the estimation of different components (nuisance parameter estimation and smoothing). However, our estimators do not require data-splitting in both theory and practice. Our asymptotic analyses are non-standard and significantly different than these existing work since, without data-splitting, the estimated weights are intimately related with each others and an additional layer of smoothing further complicates the dependence structure.

6 Simulation

We evaluate the finite-sample properties of various estimators with sample size N=100N=100. The covariate X4X\in\mathbb{R}^{4} is generated by X1=Z1X_{1}=Z_{1}, X2=Z12+Z2X_{2}=Z^{2}_{1}+Z_{2}, X3=exp(Z3/2)+Z2X_{3}=\exp(Z_{3}/2)+Z_{2} and X4=sin(2Z1)+Z4X_{4}=\sin(2Z_{1})+Z_{4} with ZjZ_{j}\sim Uniform[2,2][-2,2] for j=1,,4j=1,\ldots,4. The conditioning variable of interest is set to be V=X1V=X_{1}. The treatment is generated by TXBernoulli{π(X)}T\mid X\sim\text{Bernoulli}\{\pi(X)\}, and the outcome is generated by Y(T=t,X) N{mt(X),1}Y\mid(T=t,X)\sim\text{ N}\{m_{t}(X),1\}. To assess the estimators, we consider two different choices for each of π(X)\pi(X) and mt(X)m_{t}(X), summarized in Table 1. In Settings 1 and 2, the outcome mean functions mtm_{t} are relatively easy to estimate, as they are linear with respect to covariates XX. While in Settings 3 and 4, the outcome mean functions are nonlinear and more complex. Propensity score function π(X)\pi(X) is set to be linear with respect to XX in Settings 1 and 3, and nonlinear in Settings 2 and 4. The corresponding PCATEs are nonlinear and shown in Figure 1.

Table 1: Models for simulation with two specifications for each of logit{π(X)}{\mathrm{logit}}\{\pi(X)\} and mt(X)m_{t}(X) (t=0,1t=0,1)
Setting π(X)\pi(X) mt(X)m_{t}(X) (t=0,1)(t=0,1) τ(v)\tau(v)
1 1/(1+expX1+X3)1/(1+\exp{X_{1}+X_{3}}) 10+X1+(2t1)(X2+X4)10+X_{1}+(2t-1)(X_{2}+X_{4}) 2v2+2sin(2v)2v^{2}+2\sin(2v)
2 1/(1+expZ1+Z2+Z3)1/(1+\exp{Z_{1}+Z_{2}+Z_{3}}) 10+X1+(2t1)(X2+X4)10+X_{1}+(2t-1)(X_{2}+X_{4}) 2v2+2sin(2v)2v^{2}+2\sin(2v)
3 1/(1+expX1+X3)1/(1+\exp{X_{1}+X_{3}}) 10+(2t1)(Z12+2Z1sin(2Z1))+Z22+sin(2Z3)Z4210+(2t-1)(Z_{1}^{2}+2Z_{1}\sin(2Z_{1}))+Z_{2}^{2}+\sin(2Z_{3})Z_{4}^{2} 2v2+4vsin(2v)2v^{2}+4v\sin(2v)
4 1/(1+expZ1+Z2+Z3)1/(1+\exp{Z_{1}+Z_{2}+Z_{3}}) 10+(2t1)(Z12+2Z1sin(2Z1))+Z22+sin(2Z3)Z4210+(2t-1)(Z_{1}^{2}+2Z_{1}\sin(2Z_{1}))+Z_{2}^{2}+\sin(2Z_{3})Z_{4}^{2} 2v2+4vsin(2v)2v^{2}+4v\sin(2v)
Refer to caption
Figure 1: The target PCATEs in the simulation study: the left panel plots the PCATE in Settings 1 and 2; the right panel plots the PCATE in Settings 3 and 4.

In our study, we compare the following estimators for τ()\tau(\cdot):

  1. a)

    Proposed: the proposed estimator using the tensor product of second-order Sobolev kernel as the reproducing kernel κ\kappa.

  2. b)

    ATERKHS: the weighted estimator described in Remark 2, whose weights are estimated based on the covariate balancing criterion in Wong and Chan (2018).

  3. c)

    IPW: the inverse propensity weighting estimator from Abrevaya et al. (2015) with a logistic regression model for the propensity score. In Settings 1 and 3, the propensity score model is correctly specified.

  4. d)

    Augmented estimators by augmenting the estimators in a)–c) by the outcome models. We consider two outcome models: linear regression (LM) and kernel ridge regression (KRR).

  5. e)

    REG: the estimator that uses outcome regressions. It directly smooths {(Xi,m^1(Xi)m^0(Xi)):i=1,,N}\{(X_{i},\hat{m}_{1}(X_{i})-\hat{m}_{0}(X_{i})):i=1,\dots,N\} to estimate the PCATE, where m^1(Xi)\hat{m}_{1}(X_{i}) and m^0(Xi)\hat{m}_{0}(X_{i}) are estimated with outcome models considered in d).

For all estimators, a kernel smoother with Gaussian kernel is applied to the adjusted responses. For IPW, the bandwidth is set as h~=h^×N1/5×N2/7\tilde{h}=\hat{h}\times N^{1/5}\times N^{-2/7}, where h^\hat{h} is a commonly used optimal bandwidth in the literature such as the direct plug-in method (Ruppert, Sheather, and Wand, 1995; Wand and Jones, 1994; Calonico, Cattaneo, and Farrell, 2019). Throughout our analysis, h^\hat{h} is computed via the R package “nprobust”. The same bandwidth formula h~\tilde{h} is also considered by Lee et al. (2017) and Fan et al. (2020) to estimate the CATE. For the proposed estimator, a bandwidth should be given prior to estimate the weights. We first compute the adjusted response by using weights from Wong and Chan (2018), and then obtain the bandwidth h~\tilde{h} as the input to our proposed estimator.

Setting 1 Setting 2 Setting 3 Setting 4
Augmentation Method AISE MeISE AISE MeISE AISE MeISE AISE MeISE
No IPW 80.212 (16.19) 25.706 40.898 (6.75) 18.838 105.509 (31.66) 31.146 49.04 (9.34) 20.989
ATERKHS 16.136 (0.77) 9.633 9.653 (0.46) 6.139 18.458 (0.93) 10.264 11.367 (0.59) 7.257
Proposed 4.223 (0.22) 2.725 2.232 (0.06) 1.997 4.769 (0.26) 3.229 3.214 (0.08) 3.006
LM IPW 1.167 (0.04) 0.958 1.066 (0.03) 0.893 5.431 (1.37) 2.405 3.74 (0.78) 2.001
ATERKHS 1.156 (0.03) 1.011 1.112 (0.03) 1.003 3.471 (0.19) 2.237 2.276 (0.06) 1.924
Proposed 1.095 (0.03) 0.947 0.977 (0.02) 0.868 2.966 (0.17) 2.014 1.856 (0.05) 1.596
REG 0.843 (0.03) 0.716 0.767 (0.02) 0.67 5.431 (0.18) 4.368 4.254 (0.05) 4.107
KRR IPW 1.25 (0.04) 1.048 1.039 (0.03) 0.905 3.203 (0.19) 2.096 2.313 (0.14) 1.645
ATERKHS 1.289 (0.04) 1.092 1.152 (0.03) 0.993 3.07 (0.13) 2.213 2.125 (0.06) 1.827
Proposed 1.203 (0.04) 1.023 1.012 (0.03) 0.856 2.658 (0.12) 1.911 1.843 (0.05) 1.53
REG 1.137 (0.03) 0.953 0.905 (0.02) 0.797 3.778 (0.12) 3.213 2.796 (0.06) 2.634
Table 2: Simulation results for the four settings, where the average integrated squared errors (AISE) with standard errors (SE) in parentheses and median integrated squared error (MeISE) are provided.

Table 2 shows the average integrated squared error (AISE) and median integrated squared error (MeISE) of above estimators over 500 simulated datasets. Without augmentation, Proposed has significantly smaller AISE and MeISE than other methods among all four settings. All methods are improved by augmentations. In Settings 1 and 2, REG has the best performance. In these two settings, the outcome models are linear and thus can be estimated well by both LM and KRR. However, the differences between REG and Proposed are relatively small. As for Settings 3 and 4 where outcome mean functions are more complex, Proposed achieves the best performance and shows a significant improvement over REG, especially when outcome models are misspecified (See Settings 3 and 4 with LM augmentation). As ATERKHS is only designed for marginal covariate balancing, its performance is worse than Proposed across all scenarios.

7 Application

We apply the estimators in Section 6 to estimate the effect of maternal smoking on birth weight as a function of mother’s age, by re-analyzing a dataset of mothers in Pennsylvania in the USA (http://www.stata-press.com/data/r13/cattaneo2.dta). Following Lee et al. (2017), we focus on white and non-Hispanic mothers, resulting in the sample size N=3754N=3754. The outcome YY is the infant birth weight measured in grams and the treatment indicator TT is whether the mother is a smoker. For the treatment ignorability, we include the following covariates: mother’s age, an indicator variable for alcohol consumption during pregnancy, an indicator for the first baby, mother’s educational attainment, an indicator for the first prenatal visit in the first trimester, the number of prenatal care visits, and an indicator for whether there was a previous birth where the newborn died. Due to the boundary effect of the kernel smoother, we focus on τ(v)\tau(v) for v[18,36]v\in[18,36], which ranges from 0.050.05 quantile to 0.950.95 quantile of mothers’ ages in the sample.

We compute various estimators of the PCATE in Section 6. For all the following IPW related estimators, logistic regression is adopted to estimate propensity scores. Following Abrevaya et al. (2015), we include IPW: the IPW estimator with no augmentation. Following Lee et al. (2017), we include IPW(LM): the IPW estimator with LM augmentation. We include Proposed: the proposed estimators with KRR augmentation here as it performs the best in the simulation study and aligns with our assumption for the outcome mean functions. For completeness, we also include IPW(KRR): the IPW estimator with KRR augmentation; REG(KRR): the REG estimator where the outcome mean functions are estimated by KRR; REG(LM): the REG estimator where the outcome mean functions are estimated by LM. For both the KRR augmentation and the weights estimations in Proposed, we consider a tensor product RKHS, with the second order Sobolev space kernel for continuous covariates and the identity kernel for binary covariates.

Refer to caption
Figure 2: The estimated PCATEs of maternal smoking on birth weight as a function of mother’s age: the left panel includes all estimators and the right panel excludes the IPW estimator.

Figure 2 shows the estimated PCATEs from different methods. From the left panel in Figure 2, IPW has large variations compared to other estimators. The significantly positive estimates before age 20 conflict with the results from various established research works indicating that smoking has adverse effect on birth weights (Kramer, 1987; Almond et al., 2005; Abrevaya, 2006; Abrevaya and Dahl, 2008). From the right panel in Figure 2, the remaining four estimators show a similar pattern that the effect becomes more severe as mother’s age increases, which aligns with the existing literature (Fox et al., 1994; Walker et al., 2007). The REG(LM) estimator shows a linearly decreasing pattern, while the REG(KRR) estimator stops decreasing after age 30. For three weighted estimators, the effects are stable around age 27 to 32, but tend to decrease quickly after age 32. Compared to IPW(LM) and IPW(KRR), Proposed does not show the increasing tendency before age 30 and the decrease after age 32 is relatively smoother.

8 Discussions

The PCATE characterizes subgroup treatment effects and provides insights about how treatment effect varies across the characteristics of interest. We develop a novel nonparametric estimator for the PCATE under treatment ignorability. The proposed hybrid kernel weighting is a non-trivial extension of covariate balancing weighting in the ATE estimation literature in that it aims to achieve approximate covariate balancing for all flexible outcome mean functions and for all subgroups defined based on continuous variables. In contrast to existing estimators, we do not require any smoothness assumption on the propensity score, and thus our weighting approach is particularly useful in studies when the treatment assignment mechanism is quite complex.

We conclude with several interesting and important extensions of the current estimator as future research directions. First, an improved data-adaptive bandwidth selection procedure is worth investigating as it plays an important role in smoothing. In addition, instead of local constant regression, other alternatives such as linear or spline smoothers can be considered. Third, given the appealing theoretical properties, we will investigate efficient computation of the proposed weighting estimators with LL_{\infty}-norm. Furthermore, the asymptotic distribution of proposed estimator is worth studying so that inference procedures can be developed.

Acknowledgement

The work of Raymond K. W. Wong is partially supported by the National Science Foundation (DMS-1711952 and CCF-1934904). The work of Shu Yang is partially supported by the National Institute on Aging (1R01AG066883) and the National Science Foundation (DMS-1811245). The work of Kwun Chuen Gary Chan is partially supported by the National Heart, Lung, and Blood Institute (R01HL122212) and the National Science Foundation (DMS-1711952).

References

  • Abrevaya (2006) Abrevaya, J. (2006). Estimating the effect of smoking on birth outcomes using a matched panel data approach. Journal of Applied Econometrics 21(4), 489–519.
  • Abrevaya and Dahl (2008) Abrevaya, J. and C. M. Dahl (2008). The effects of birth inputs on birthweight: evidence from quantile estimation on panel data. Journal of Business & Economic Statistics 26(4), 379–397.
  • Abrevaya et al. (2015) Abrevaya, J., Y.-C. Hsu, and R. P. Lieli (2015). Estimating conditional average treatment effects. Journal of Business & Economic Statistics 33(4), 485–505.
  • Almond et al. (2005) Almond, D., K. Y. Chay, and D. S. Lee (2005). The costs of low birth weight. The Quarterly Journal of Economics 120(3), 1031–1083.
  • Athey et al. (2016) Athey, S., G. W. Imbens, and S. Wager (2016). Approximate residual balancing: De-biased inference of average treatment effects in high dimensions. arXiv preprint arXiv:1604.07125.
  • Calonico et al. (2019) Calonico, S., M. D. Cattaneo, and M. H. Farrell (2019). nprobust: Nonparametric kernel-based estimation and robust bias-corrected inference. arXiv preprint arXiv:1906.00198.
  • Chan et al. (2016) Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2016). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society. Series B, Statistical methodology 78(3), 673.
  • Chernozhukov et al. (2017) Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins, et al. (2017). Double/debiased machine learning for treatment and causal parameters. Technical report.
  • Einmahl et al. (2005) Einmahl, U., D. M. Mason, et al. (2005). Uniform in bandwidth consistency of kernel-type function estimators. The Annals of Statistics 33(3), 1380–1403.
  • Fan et al. (2020) Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang (2020). Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics (just-accepted), 1–39.
  • Fox et al. (1994) Fox, S. H., T. D. Koepsell, and J. R. Daling (1994). Birth weight and smoking during pregnancy-effect modification by maternal age. American Journal of Epidemiology 139(10), 1008–1015.
  • Hainmueller (2012) Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political analysis, 25–46.
  • Harder and Desmarais (1972) Harder, R. L. and R. N. Desmarais (1972). Interpolation using surface splines. Journal of aircraft 9(2), 189–191.
  • Imai and Ratkovic (2014) Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 243–263.
  • Kallus (2020) Kallus, N. (2020). Generalized optimal matching methods for causal inference. Journal of Machine Learning Research 21(62), 1–54.
  • Kennedy (2020) Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497.
  • Kramer (1987) Kramer, M. S. (1987). Intrauterine growth and gestational duration determinants. Pediatrics 80(4), 502–511.
  • Lee et al. (2017) Lee, S., R. Okui, and Y.-J. Whang (2017). Doubly robust uniform confidence band for the conditional average treatment effect function. Journal of Applied Econometrics 32(7), 1207–1225.
  • Mack and Silverman (1982) Mack, Y.-p. and B. W. Silverman (1982). Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 61(3), 405–415.
  • Nie and Wager (2017) Nie, X. and S. Wager (2017). Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912.
  • Qin and Zhang (2007) Qin, J. and B. Zhang (2007). Empirical-likelihood-based inference in missing response problems and its application in observational studies. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(1), 101–122.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41–55.
  • Ruppert et al. (1995) Ruppert, D., S. J. Sheather, and M. P. Wand (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association 90(432), 1257–1270.
  • Semenova and Chernozhukov (2017) Semenova, V. and V. Chernozhukov (2017). Estimation and inference about conditional average treatment effect and other structural functions. arXiv, arXiv–1702.
  • Semenova and Chernozhukov (2020) Semenova, V. and V. Chernozhukov (2020). Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic Statistics, Volume 3. Cambridge university press.
  • Wager and Athey (2018) Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113(523), 1228–1242.
  • Wahba (1990) Wahba, G. (1990). Spline Models for Observational Data. SIAM.
  • Walker et al. (2007) Walker, M., E. Tekin, and S. Wallace (2007). Teen smoking and birth outcomes. Technical report, National Bureau of Economic Research.
  • Wand and Jones (1994) Wand, M. P. and M. C. Jones (1994). Kernel Smoothing. Crc Press.
  • Wang and Zubizarreta (2020) Wang, Y. and J. R. Zubizarreta (2020). Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations. Biometrika 107(1), 93–105.
  • Wasserman (2006) Wasserman, L. (2006). All of Nonparametric Statistics. Springer Science & Business Media.
  • Wong and Chan (2018) Wong, R. K. and K. C. G. Chan (2018). Kernel-based covariate functional balancing for observational studies. Biometrika 105(1), 199–213.
  • Zhao et al. (2019) Zhao, Q. et al. (2019). Covariate balancing propensity score by tailored loss functions. The Annals of Statistics 47(2), 965–993.
  • Zheng and van der Laan (2011) Zheng, W. and M. J. van der Laan (2011). Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pp.  459–474. Springer.
  • Zimmert and Lechner (2019) Zimmert, M. and M. Lechner (2019). Nonparametric estimation of causal heterogeneity under high-dimensional confounding. arXiv preprint arXiv:1908.08779.
  • Zubizarreta (2015) Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110(511), 910–922.