Estimation of Treatment Effects based on Kernel Matching

Chong Ding C. Ding: [email protected] Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China. Zheng Li Z. Li: [email protected] Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China. Hon Keung Tony Ng H. K. T. Ng: [email protected] Department of Mathematical Sciences, Bentley University, Waltham, Massachusetts 02452, U.S.A. Wei Gao W. Gao (Corresponding author): [email protected] Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China.

Abstract

A treatment effect represents the average causal impact or outcome difference between treatment and control groups. Treatment effects can be estimated through social experiments, regression models, matching estimators, and instrumental variables. In this paper, we introduce a novel kernel-matching estimator for treatment effect estimation. This method is particularly beneficial in observational studies where randomized control trials are not feasible, as it uses the full sample to increase the efficiency and robustness of treatment effect estimates. We demonstrate that the proposed estimator is consistent and asymptotically efficient under certain conditions. Through Monte Carlo simulations, we show that the estimator performs favorably against other estimators in the literature. Finally, we apply our method to data from the National Supported Work Demonstration to illustrate its practical application.

Keywords: Bootstrap confidence interval; Causal inference; Observational study; Propensity score; Missing data.

1 Introduction

Matching is a crucial method for estimating causal effects, widely applied in various research fields, including policy evaluation (Sekhon,, 2004; Herron and Wand,, 2007), healthcare research (Rubin,, 1997; Christakis and Iwashyna,, 2003), and economics (Dehejia and Wahba,, 1999; Galiani et al.,, 2005; Abadie and Imbens,, 2006). The process of estimating causal effects fundamentally depends on imputing a potential outcome for an individual, calculated as a weighted average of the observed outcomes of their nearest neighbors from the relevant counterfactual set (Rosenbaum and Rubin,, 1983). Thus, causal effects are typically expressed as the difference between the imputed outcome and the actual observed outcome, similar to what would be expected from a randomized experiment. Various methods have been developed to assign matched subjects based on covariates, such as nearest available matching using propensity scores, Mahalanobis metric matching (Rosenbaum and Rubin,, 1985), full matching (Rosenbaum,, 1991; Hansen,, 2004), genetic matching (Diamond and Sekhon,, 2013), and matching using sufficient dimension reduction (Luo and Zhu,, 2020).

Among various matching methods, covariate matching (Abadie and Imbens,, 2006) can be challenging in high-dimensional settings, as it requires exact or near-exact matches across all covariates. This strict requirement often reduces the effective sample size, limiting the method’s reliability and suitability for high-dimensional data or complex distributions. Alternatively, propensity score matching (Abadie and Imbens,, 2016) depends on the correct specification of the propensity score model; misspecification can introduce bias and result in extreme weights. Beyond these matching methods, several alternative approaches have been proposed for causal effect estimation, including inverse probability weighting (IPW) (Hirano et al.,, 2003), subclassification (Rosenbaum and Rubin,, 1983), and doubly robust (DR) estimation (Bang and Robins,, 2005). The accuracy of the IPW estimator accuracy is highly sensitive to the correct specification of the propensity score model, with even minor misspecifications potentially leading to substantial errors due to its sensitivity to high variance from extreme weights (Imai and Ratkovic,, 2014). Doubly robust (DR) estimators, which combine propensity score and outcome regression models, can yield unbiased estimates if at least one model is correctly specified. However, DR estimators are sensitive to the accuracy of both models and may be complex to implement effectively.

In this paper, we address the above challenges by introducing a novel kernel-matching estimator and providing a detailed methodological framework, including discussions on the consistency and asymptotic normality of the estimator. Kernel matching, as outlined by Heckman et al., (1997) and Heckman et al., 1998b , is a non-parametric approach that applies kernel functions to weight control cases based on their similarity to treatment cases in the covariate space, facilitating the estimation of treatment effect (Handouyahia et al.,, 2013). The kernel matching method allows for increased flexibility and stability through adjustable bandwidth parameters. By balancing bias and variance through local smoothing, kernel matching produces more robust estimates than traditional one-to-one propensity score matching, especially in situations with limited control cases available for creating a matching cohort. It mitigates extreme bias or variance and reduces dependence on stringent model assumptions (Berg,, 2011).

Although more complex than basic matching techniques, kernel matching extends interval and nearest-neighbor matching by assigning weights to all control cases associated with each treatment case, giving greater emphasis to those with closer covariate values. Smith and Todd, (2005) offered an intuitive explanation of kernel matching and its generalizations, such as local linear (Heckman et al.,, 1997; Heckman et al., 1998a, ; Heckman et al., 1998b, ) and local quadratic matching (Ham et al.,, 2011). By smoothing over the covariate space, kernel matching reduces the risks associated with model misspecification, resulting in more reliable and unbiased treatment effect estimates. This approach is particularly advantageous in settings with high overlap between treatment and control cases, ensuring that comparisons are made between genuinely comparable groups.

When choosing a suitable matching method, selecting the appropriate function of the covariates is crucial as the function may significantly affect the asymptotic behaviors of the resulting estimators (Heckman et al.,, 1997; Abadie and Imbens,, 2006). Heuristically, Rosenbaum and Rubin, (1983) suggested that using the original covariates is preferred for its simplicity and for ensuring unbiased matching at the population level, provided that the ignorability assumption (i.e., the potential outcomes are independent of treatment after controlling for the specific set of covariates) and the overlap assumption (i.e., the range of data is similar across treatment groups), are satisfied. However, Abadie and Imbens, (2006) pointed out that the matching approach can be vulnerable to the “curse of dimensionality”, rendering matching ineffective when the dimension of covariates is high. To address this issue, a natural choice is using the propensity score, which is the probability of a subject being a treatment case given its covariates, proposed by Rosenbaum and Rubin, (1985). This reduction in dimensionality makes it easier to find similar treatment and control cases. Similar to the original covariates, the propensity score also ensures the unbiasedness of matching under the ignorability assumption and the overlap assumption. Mathematically, suppose $X$ is the vector of covariates, the propensity score $p(X)$ can be defined as

p(X)=F(X^{\top}\beta),

where $\beta$ is the parameter vector, $F$ is a link function that maps $X^{\top}\beta$ into a probability in $(0,1)$ (e.g., a logit or probit function). Nevertheless, the estimation of the propensity score commonly relies on parametric models, including the link function $F$ , which are susceptible to model misspecification. In high-dimensional settings, kernel matching enhances propensity score matching by smoothing across similar observations. The kernel function averages treatment effects among units with similar propensity scores, stabilizing estimates and reducing variance. High-dimensional data often suffer from sparsity, making exact matches between treatment and control cases difficult to find. Kernel matching addresses this issue by weighting nearby units based on their propensity scores, allowing for effective comparisons even when exact matches are rare. The kernel function introduces a smoothing effect, which acts as a form of regularization. This regularization helps control overfitting to noise, a common issue in high-dimensional data, making the treatment effect estimation more robust. In high-dimensional spaces, irrelevant covariates can introduce noise into the matching process. By focusing on propensity scores, which encapsulate the relevant information from all covariates, kernel matching reduces the impact of irrelevant covariates, making the method less sensitive to noise. Therefore, kernel matching based on propensity scores is a powerful tool for addressing the challenges of high-dimensional settings, as it effectively manages the complexities associated with having many covariates.

In this paper, we propose estimating treatment effects using kernel matching based on $F(X^{\top}\hat{\beta})$ with observational data. Through a Monte Carlo simulation study with different settings, we demonstrate the practical application of kernel matching and highlight its advantages over traditional estimation methods. This study underscores the importance of adopting robust matching techniques in causal inference and showcases the potential of kernel matching to enhance the validity and reliability of treatment effect estimates. Compared to traditional nearest-neighbor matching, kernel matching does not require pre-specifying the number of matched individuals (e.g., one-to-one matching). Instead, kernel matching adaptively selects suitable matches through the kernel function. By using more control cases through weighting, kernel matching can reduce the variance of the estimate. We show that the kernel matching method is a powerful tool in the arsenal of methods for causal inference in observational studies.

The rest of the paper is organized as follows. In Section 2, after introducing the mathematical notation and the concepts of treatment effect estimation, the proposed kernel matching estimator is developed. In Section 3, the asymptotic properties of the kernel-matching estimator are established. A Monte Carlo simulation study is used to evaluate the performance of the proposed method in Section 4. Section 5 provides a practical data analysis to illustrate the proposed methodology. Finally, some concluding remarks are provided in Section 6.

2 Basic Notation and Concepts of Kernel Matching

Following the framework of Rubin, (1974), the treatment effect on the potential outcomes can be defined as follows. For unit $i$ ( $i=1,\ldots,N$ ), let the variable $W_{i}$ indicates whether an active treatment was received $(W_{i}=1)$ or not received $(W_{i}=0)$ , for $W_{i}\in\{0,1\}$ . The observed outcome is $Y_{i}=W_{i}Y_{i}(1)+(1-W_{i})Y_{i}(0)$ , where $Y_{i}(1)$ and $Y_{i}(0)$ indicate the potential outcomes that the individual would have received active treatment or not, respectively. In other words, for unit $i$ , the observed outcome $Y_{i}$ for treatment $W_{i}$ can be expressed as

Y_{i}=\begin{cases}Y_{i}(0)&\text{ if }W_{i}=0,\\ Y_{i}(1)&\text{ if }W_{i}=1.\end{cases}

In this paper, we are interested in the average treatment effect for the whole population (ATE)

\tau=E[Y(1)-Y(0)],

and the average treatment effect for the treated (ATT)

\tau_{t}=E[Y(1)-Y(0)\mid W=1].

The assumptions listed below are essential for establishing limit properties and will remain part of the underlying assumptions throughout this article. These assumptions are commonly found in the existing literature for causal inference (see, for example, Rosenbaum and Rubin,, 1983; Heckman et al., 1998b, ; Abadie and Imbens,, 2016) and are considered standard assumptions.

Assumption 1.

Let $X$ be a $k$ dimensional random vector of covariates. For $X\in\mathbb{X}$ , where the $\mathbb{X}$ is the support of $X$ , the propensity score is defined as $p(X)=\Pr(W=1|X=x)$ . We consider

(i)

(Unconfoundedness) $W\perp\{Y(1),Y(0)\}|X$ , where $\perp$ denotes independence (Dawid,, 1979);
(ii)

(Overlap) $\underline{\eta}\leq p(X)\leq\overline{\eta}$ almost surely, for some $\underline{\eta}>0$ and $\overline{\eta}<1$ .

Note that Assumption 1 is also known as “strong ignorability” (Rosenbaum and Rubin,, 1983), which implies that $X$ is sufficient for eliminating the impact of confounding factors (i.e., “unconfoundedness”). Hahn, (1998) established bounds for the asymptotic variance and explored asymptotically efficient estimation methods under Assumption 1(i).

Here, we define the proposed kernel-matching estimators for the ATE and ATT. Following the notation defined above, for $N\in\mathbb{N}$ , suppose $\{Y_{i},W_{i},X_{i}\}_{i=1}^{N}$ are independent draws from the population with the distribution of $\{Y,W,X\}$ . Let $N_{1}$ and $N_{0}$ be the number of treated and control units, respectively. The individual treatment effect is $\tau_{i}=Y_{i}(1)-Y_{i}(0)$ . For every unit in the study, we can only get one of the potential outcomes, $Y_{i}(0)$ and $Y_{i}(1)$ , and the other is unobservable or missing.

Let $K(\cdot)$ be a kernel function, and $h$ be the bandwidth parameter that converges to zero as $n$ goes to infinity. The kernel-matching estimators based on the propensity score $p(X)$ for the missing potential outcomes can be expressed as

\widehat{Y}_{i}(0)=\begin{cases}Y_{i}&\text{ if }W_{i}=0,\\ \frac{\sum_{j:W_{j}=0}^{N}Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j:W_{j}=0}^{N}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}&\text{ if }W_{i}=1,\end{cases}

and

\widehat{Y}_{i}(1)=\left\{\begin{array}[]{lll}\frac{\sum_{j:W_{j}=1}^{N}Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j:W_{j}=1}^{N}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}&\text{ if }&W_{i}=0,\\ Y_{i}&\text{ if }&W_{i}=1.\end{array}\right.

The proposed kernel-matching estimator for the ATE is defined as

\begin{split}\hat{\tau}_{N}=&\frac{1}{N}\sum_{i=1}^{N}\left(\hat{Y}_{i}(1)-\hat{Y}_{i}(0)\right)\\ =&\frac{1}{N}\sum_{i=1}^{N}\left(2W_{i}-1\right)\left(Y_{i}-\frac{\sum_{j:W_{j}=1-W_{i}}^{N}Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j:W_{j}=1-W_{i}}^{N}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}\right),\end{split}

and the proposed kernel-matching estimator for the ATT is defined as

\begin{split}\hat{\tau}_{t}=\frac{1}{N_{1}}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}\right),\end{split}

where $N_{1}=\sum_{i=1}^{N}W_{i}$ is the number of treatment cases in the sample.

In observational studies, to estimate the propensity scores, following Rosenbaum and Rubin, (1983), a generalized linear model $p(X)=F(X^{\top}\beta)$ can be used. The logit and probit functions are commonly used parametric link functions $F(\cdot)$ .

Let $\beta^{*}$ be the true value of the propensity score model parameter vector, such that $p(X)=F(X^{\top}\beta^{*})$ . The estimator based on matching the true propensity score is given by $\widehat{\tau}_{N}(\beta^{*})$ , and $\widehat{\tau}_{N}(\hat{\beta})$ is a kernel matching estimator based on the estimated propensity score $F(X^{\top}\hat{\beta})$ , where $\hat{\beta}$ is the maximum likelihood estimator of $\beta$ based on the observed data $\{Y_{i},W_{i},X_{i}\}^{N}_{i=1}$ , i.e.,

\hat{\beta}=\arg\max_{\beta}\ln L\left(\beta\mid W_{1},X_{1},\ldots,W_{N},X_{N}\right)

with the log-likelihood function

		$\displaystyle\ln L\left(\beta\mid W_{1},X_{1},\ldots,W_{N},X_{N}\right)$
		$\displaystyle\quad=\sum_{i=1}^{N}W_{i}\ln F\left(X_{i}^{\prime}\beta\right)+\left(1-W_{i}\right)\ln\left(1-F\left(X_{i}^{\prime}\beta\right)\right).$

Next, we define a type of matching estimator where the unknown parameter $\beta$ is estimated using the maximum likelihood method. The kernel-matching estimators based on the $F(X^{\top}\hat{\beta})$ for the ATE can be expressed as

\begin{split}\hat{\tau}_{N}(\hat{\beta})&=\frac{1}{N}\sum_{i=1}^{N}(\hat{Y}_{i}(1)-\hat{Y}_{i}(0))\\ &=\frac{1}{N}\sum_{i=1}^{N}\left(2W_{i}-1\right)\left(Y_{i}-\frac{\sum_{j:W_{j}=1-W_{i}}Y_{j}K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}{\sum_{j:W_{j}=1-W_{i}}K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}\right),\end{split}

and for ATT, the estimator can be expressed as

\begin{split}\hat{\tau}_{t,N}(\hat{\beta})=\frac{1}{N_{1}}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}(1-W_{j})Y_{j}K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}{\sum_{j=1}(1-W_{j})K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}\right).\end{split}

3 Asymptotic Properties of the Proposed Estimators

In this section, we establish that the proposed kernel-matching estimators for the ATE and ATT are consistent and asymptotically normally distributed, with a convergence rate of $N^{1/2}$ .

Assumption 2.

The kernel function $K(\cdot)$ is symmetric and satisfies the following properties:

(1)

$\int K(u)du=1$ ;
(2)

$\int u^{2}K(u)du<\infty$ ;
(3)

$\int K^{2}(u)du<\infty$ ;
(4)

$\int|u|^{3}K(u)du<\infty$ .

Let $g_{w}(p)=\mathbb{E}(Y(w)|p(X)=p)$ , under Assumptions 1 and 2, we provide the following theorem to show that the estimators $\hat{\tau}$ and $\hat{\tau}_{t}$ are consistent estimators of $\tau$ and $\tau_{t}$ , respectively, under mild conditions.

Theorem 1.

Suppose that $g_{w}(p)$ has three-times bounded continuous derivatives, and the propensity score $p(X)$ is continuously distributed with density function $f(p)$ that has three-times continuous derivatives, then under Assumptions 1 and 2, as $n\rightarrow\infty$ and $h\rightarrow 0$ , we have

(i)

$\hat{\tau}-\tau\stackrel{{\scriptstyle p}}{{\longrightarrow}}0,$
(ii)

$\hat{\tau}_{t}-\tau_{t}\stackrel{{\scriptstyle p}}{{\longrightarrow}}0.$

The proof of Theorem 1 is given in the Appendix.

Next, we elucidate the formal result for the asymptotic normality of the proposed estimators.

Theorem 2.

Suppose that $g_{w}(p)$ has three-times bounded continuous derivatives, and the propensity score $p(X)$ is continuously distributed with probability density function $f(p)$ that has three-times continuous derivatives, let

b_{w}(p)=\frac{\int t^{2}k(t)dt}{2}\left(\frac{2g_{w}^{\prime}(p)f^{\prime}(p|w)}{f(p|w)}+g_{w}^{\prime\prime}(p)\right),

and under Assumptions 1 and 2, as $n\rightarrow\infty$ and $nh^{4}=O(1)$ , we have

(i)

$\sqrt{N}\left(\hat{\tau}-\tau-Bh^{2}\right)\xrightarrow{d}\mathcal{N}(0,\sigma^{2}),$

where $B=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1)+\mathbb{E}[b_{1}(P)|Z=0]\Pr(Z=0)$ , and the formula for $\sigma$ is provided in the Appendix;
(ii)

$\sqrt{N}\left(\hat{\tau}_{t}-\tau_{t}-B_{t}h^{2}\right)\xrightarrow{d}\mathcal{N}(0,\sigma_{t}^{2}),$

where $B_{t}=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1)$ and the formula for $\sigma_{t}$ is provided in the Appendix.

The proof of Theorem 2 is given in the Appendix.

Corollary 1.

Suppose that $g_{w}(p)$ has three-times bounded continuous derivatives, and the propensity score $p(X)$ is continuously distributed, with continuous probability density function $f(p)$ having three-times continuous derivatives. Under Assumptions 1 and 2, as $n\rightarrow\infty$ and $nh^{4}\rightarrow 0$ , the bias term will converge to zero. Thus,

(i)

$\sqrt{N}(\hat{\tau}-\tau)\xrightarrow{d}\mathcal{N}(0,\sigma^{2});$
(ii)

$\sqrt{N}(\hat{\tau}_{t}-\tau_{t})\xrightarrow{d}\mathcal{N}(0,\sigma_{t}^{2}),$

where the formulas for $\sigma$ and $\sigma_{t}$ are provided in the Appendix.

In the following theorems, we will provide the large sample properties of $\hat{\tau}_{N}(\hat{\beta})$ and $\hat{\tau}_{t,N}(\hat{\beta})$ , including the consistency in Theorem 3 and the asymptotic normality in Theorem 4.

Theorem 3.

Suppose that $g_{w}(p)$ has three-times bounded continuous derivatives, and the propensity score $p(X)$ is distributed with continuous probability density function $f(p)$ having three-times continuous derivatives. Under Assumptions 1 and 2, as $n\rightarrow\infty$ and $h\rightarrow 0$ , we have

(i)

$\widehat{\tau}_{N}(\hat{\beta})-\tau\xrightarrow{p}0,$
(ii)

$\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}\xrightarrow{p}0.$

Theorem 4.

Suppose $g_{w}(p)$ is three times continuously differentiable and bounded. Given that the propensity score $p(X)$ is continuously distributed, with continuous density, $f(p)$ has a continuous third derivative. Under Assumptions 1 and 2, as $n\rightarrow\infty$ $h\rightarrow 0$ and $nh^{4}=O(1)$ , we have

(i)

$\sqrt{N}\left\{\widehat{\tau}_{N}(\hat{\beta})-\tau-Bh^{2}\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}^{2}),$

where $B=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1)+\mathbb{E}[b_{1}(P)|Z=0]\Pr(Z=0)$ and the formula for $\widetilde{\sigma}^{2}$ is provided in the Appendix,
(ii)

$\sqrt{N}\left\{\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}-B_{t}h^{2}\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}_{t}^{2}),$

where $B_{t}=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1)$ and the formula for $\widetilde{\sigma}_{t}^{2}$ is provided in the Appendix.

The proofs of Theorem 3 and Theorem 4 are presented in the Appendix.

Corollary 2.

(i)

$\sqrt{N}\left\{\widehat{\tau}_{N}(\hat{\beta})-\tau\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}^{2}),$
(ii)

$\sqrt{N}\left\{\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}_{t}^{2}),$

where the formulas for $\widetilde{\sigma}^{2}$ and $\widetilde{\sigma}_{t}^{2}$ are provided in the Appendix.

4 Monte Carlo Simulation Study

In this section, a Monte Carlo simulation study is used to evaluate the performance of the proposed estimators and compare the performance with those of existing estimators. The simulation results confirm the effectiveness of the kernel matching estimators based on the estimated propensity score in estimating ATE and ATT. For comparative purposes, the following existing ATE and ATT estimation methods are considered:

•

Adabie-Imbens matching estimators based on the original covariates (Abadie and Imbens,, 2006) (denoted as “Covariate”).
•

Adabie-Imbens matching estimators based on true propensity score (Abadie and Imbens,, 2006) (denoted as “True PS”).
•

Adabie-Imbens matching estimators based on estimated propensity score (Abadie and Imbens,, 2006, 2016) (denoted as “Estimated PS”).

•

Normalized inverse probability weighting (IPW) (Hirano et al.,, 2003; Imbens,, 2004) estimators for ATE and ATT given by

\hat{\tau}_{\mathrm{IPW}}^{\mathrm{ATE}}=\sum_{i=1}^{N}\frac{W_{i}\cdot Y_{i}}{\hat{p}\left(X_{i}\right)}\left(\sum_{i=1}^{N}\frac{W_{i}}{\hat{p}\left(X_{i}\right)}\right)^{-1}-\sum_{i=1}^{N}\frac{\left(1-W_{i}\right)\cdot Y_{i}}{1-\hat{p}\left(X_{i}\right)}\left(\sum_{i=1}^{N}\frac{\left(1-W_{i}\right)}{1-\hat{p}\left(X_{i}\right)}\right)^{-1}

and

\hat{\tau}_{\mathrm{IPW}}^{\mathrm{ATT}}=\frac{1}{N_{1}}\sum_{i:W_{i}=1}Y_{i}-\sum_{i:W_{i}=0}Y_{i}\cdot\frac{\hat{p}(X_{i})}{1-\hat{p}(X_{i})}\left(\sum_{i:W_{i}=0}\frac{\hat{p}(X_{i})}{1-\hat{p}(X_{i})}\right)^{-1},

respectively (denoted as “IPW”).

•

Doubly robust estimators (Robins et al.,, 1994; Bang and Robins,, 2005) for ATE and ATT given by

	$\displaystyle\hat{\tau}_{\mathrm{DR}}^{\mathrm{ATE}}$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left\{\left[\frac{W_{i}Y_{i}}{\hat{p}\left(X_{i}\right)}-\frac{W_{i}-\hat{p}\left(X_{i}\right)}{\hat{p}\left(X_{i}\right)}\hat{\mu}\left(1,X_{i}\right)\right]\right.$
			$\displaystyle\left.\qquad-\left[\frac{\left(1-W_{i}\right)Y_{i}}{1-\hat{p}\left(X_{i}\right)}-\frac{W_{i}-\hat{p}\left(X_{i}\right)}{1-\hat{p}\left(X_{i}\right)}\hat{\mu}\left(0,X_{i}\right)\right]\right\}$

and

\displaystyle\hat{\tau}_{\mathrm{DR}}^{\mathrm{ATT}}=\frac{1}{N_{1}}\sum_{i=1}^{N}\left[W_{i}Y_{i}-\frac{\left(1-W_{i}\right)Y_{i}\hat{p}\left(X_{i}\right)+\hat{\mu}\left(0,X_{i}\right)\left(W_{i}-\hat{p}\left(X_{i}\right)\right)}{1-\hat{p}\left(X_{i}\right)}\right],

respectively (denoted as “DR”).

•

The kernel-matching estimators for ATE and ATT presented in Section 2 using the Gaussian kernel function with a bandwidth $h$ of 0.01 (denoted as “Proposed”). In preliminary simulations (results not shown), we also tested the Epanechnikov kernel function and various bandwidth values, finding that the Gaussian kernel with $h=0.01$ generally provided better performance for the kernel-matching estimator.

To simulate the control and treatment cases with covariates, the following five settings are considered:

•
Setting I:
- –
  
  Two independent covariates, $X_{1}$ and $X_{2}$ , follow the uniform distribution in the interval $[-1/2,1/2]$ .
- –
  
  The potential outcomes are generated based on the following models
  
  $Y(0)=3X_{1}-3X_{2}+\epsilon_{0}$
  
  and
  
  $Y(1)=5+5X_{1}+X_{2}+\epsilon_{1},$
  
  where $\epsilon_{0}$ and $\epsilon_{1}$ are random errors that follow the standard normal distribution independent of $(W,X_{1},X_{2})$ .
- –
  
  The treatment variable, $W$ , is related to $(X_{1},X_{2})$ through the propensity score with a logistic function
  
  $\Pr(W=1|X_{1}=x_{1},X_{2}=x_{2})=\frac{\exp(x_{1}+2x_{2})}{1+\exp(x_{1}+2x_{2})}.$
- –
  
  The ATE and ATT estimators use a single match ( $M=1$ ).
•
Setting II:
- –
  
  Two independent covariates, $X_{1}$ and $X_{2}$ , follow the uniform distribution in the interval $[-1/2,1/2]$ .
- –
  
  The potential outcomes are generated based on the following models
  
  $Y(0)=10X_{1}+\epsilon_{0}$
  
  and
  
  $Y(1)=5-10X_{1}+\epsilon_{1},$
  
  where $\epsilon_{0}$ and $\epsilon_{1}$ are random errors that follow the standard normal distribution independent of $(W,X_{1},X_{2})$ .
- –
  
  The treatment variable, $W$ , is related to $X_{2}$ through the propensity score with a logistic function
  
  $\Pr(W=1|X_{1}=x_{1},X_{2}=x_{2})=\frac{\exp(2x_{2})}{1+\exp(2x_{2})}.$
- –
  
  The ATE and ATT estimators use a single match ( $M=1$ ).

•

Setting III:

–

Four covariates, the vector $X=(X_{1},X_{2},X_{3},X_{4})$ follows a four-dimensional normal distribution $\mathcal{N}\left(0,\mbox{\boldmath$\Sigma$}\right)$ , where the variance-covariance $\Sigma$ is given by

$\displaystyle\mbox{\boldmath$\Sigma$}=\frac{1}{3}\begin{pmatrix}1&-1&0&0\\ -1&2&0&0\\ 0&0&1&-1\\ 0&0&-1&2\\ \end{pmatrix},$

i.e., $X_{1}$ and $X_{2}$ are correlated, $X_{3}$ and $X_{4}$ are correlated, and $(X_{1},X_{2})$ and $(X_{3},X_{4})$ are independent.

–

The potential outcomes are generated based on the following models

Y=0.4+0.25\sin(8(X^{\top}\mbox{\boldmath$\theta$})-5)+0.4\exp(-16(4(X^{\top}\mbox{\boldmath$\theta$})-2.5)^{2})+W+\epsilon,

where $\mbox{\boldmath$\theta$}=(1,1,1,1)^{\top}$ , $\epsilon$ is a random error that follows the standard normal distribution independent of $X$ .

–

The treatment variable, $W$ , is related to $X=(X_{1},X_{2},X_{3},X_{4})$ through the propensity score with a logistic function

	$\displaystyle\Pr(W=1$	$\displaystyle\|$	$\displaystyle X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3},X_{4}=x_{4})$
		$\displaystyle=$	$\displaystyle\frac{\exp(x_{1}-3x_{2}+2x_{3}+x_{4})}{1+\exp(x_{1}-3x_{2}+2x_{3}+x_{4})}.$

–

The ATE and ATT estimators use a single match ( $M=1$ ).

•

Setting IV:

–

Four covariates, the vector $X=(X_{1},X_{2},X_{3},X_{4})$ follows a four-dimensional normal distribution $\mathcal{N}\left(0,\mbox{\boldmath$\Sigma$}\right)$ , where the variance-covariance $\Sigma$ is given by

$\displaystyle\mbox{\boldmath$\Sigma$}=\frac{1}{3}\begin{pmatrix}1&-1&0&0\\ -1&2&0&0\\ 0&0&1&-1\\ 0&0&-1&2\\ \end{pmatrix}.$

–

The potential outcomes are generated based on the following models

Y=0.15+0.7\left[\frac{\exp(\sqrt{2}(X^{\top}\mbox{\boldmath$\theta$}))}{1+\exp(\sqrt{2}(X^{\top}\mbox{\boldmath$\theta$}))}\right]+W+\epsilon,

where $\epsilon$ is a random error that follows the standard normal distributions independent of $X$ .

–

The treatment variable, $W$ , is related to $X=(X_{1},X_{2},X_{3},X_{4})$ through the propensity score with a logistic function

	$\displaystyle\Pr(W=1$	$\displaystyle\|$	$\displaystyle X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3},X_{4}=x_{4})$
		$\displaystyle=$	$\displaystyle\frac{\exp(x_{1}+x_{2}+x_{3}+x_{4})}{1+\exp(x_{1}+x_{2}+x_{3}+x_{4})}.$

–

The ATE and ATT estimators use a single match ( $M=1$ ).

•

Setting V:

–

Three independent covariates $X_{1}\sim\mathcal{N}(0,1)$ , $X_{2}\sim\text{Bernoulli}(0.5)$ , and $X_{3}\sim\mathcal{N}(0,1)$ .
–

The potential outcomes are generated based on the following models

$Y(0)=\sin(1+X_{1}+0.5X_{2})+\epsilon_{0}$

and

$Y(1)=\sin(2+X_{1}+0.5X_{2})+\epsilon_{1},$

where $\epsilon_{0}$ and $\epsilon_{1}$ are random errors that follow the standard normal distributions independent of $X$ .

–

The treatment variable, $W$ , is related to $X=(X_{1},X_{2},X_{3})$ through the propensity score with a logistic function

	$\displaystyle\Pr(W=1$	$\displaystyle\|$	$\displaystyle X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3})$
		$\displaystyle=$	$\displaystyle\frac{\exp(2x_{1}+x_{2}-3x_{3}+3)}{1+\exp(2x_{1}+x_{2}-3x_{3}+3)}.$

–

The ATE and ATT estimators use a single match ( $M=1$ ).

We generated datasets with sample sizes of $N=200,500$ , and $1000$ , and conducted $1000$ Monte Carlo simulations for each scenario to estimate causal effects. The data generation processes for Settings I and II follow the approach of Abadie and Imbens, (2016), while those for Settings III and IV are based on the method of Frölich, (2004) with slight modifications. Similarly, the process for Setting V is adapted from the approach of Luo and Zhu, (2020), also with slight modifications.

We compare the point estimation of ATE and ATT of the six estimators (Covariate, True PS, Estimated PS, Proposed, IPW, and DR) in terms of the simulated biases (Bias), standard deviations (SD), and root mean square errors (RMSE). For comparative purposes, we also compute the 95% confidence intervals based on different estimators using bootstrap methods. Specifically, following Abadie and Imbens, (2006, 2016), we obtain the 95% confidence intervals based on the “Covariate”, “True PS” and “Estimated PS” using normal approximation with standard errors obtained by using 400 bootstrap samples. For the ” “Proposed”, “IPW” and “DR” estmators, the 95% confidence intervals are obtained using the percentile bootstrap method (Efron and Tibshirani,, 1993) with 400 bootstrap samples. We compare the interval estimation of ATE and ATT of the six estimators in terms of the simulated average widths (AW) and the coverage probabilities (CP). Simulation results are reported for the six estimators under Settings I–V in Tables 1–5 for ATE estimation and in Tables 6–10 for ATT estimation.

From Tables 1–10, we observe that the biases and RMSEs of the proposed estimator decrease as the sample size increases, which is consistent with the theoretical asymptotic results discussed in Section 3. For point estimation of both ATE and ATT, the proposed kernel matching estimator demonstrates comparable bias and RMSE performance across most settings, though it does not achieve the smallest bias and MSE in every scenario. From Tables 1–2, we observe that the proposed estimator for ATE outperforms the matching estimators based on covariates, true propensity score, and estimated propensity score in Settings I and II in terms of RMSE. From Tables 3–4, we observe that the proposed estimator for ATE shows a lower bias than the covariate-based matching estimator in Settings III and IV. This consistent performance highlights the effectiveness of the proposed estimator across varied settings, even when not uniformly outperforming all competitors.

In terms of interval estimation, the proposed estimator achieves coverage probabilities close to the nominal 95% level, suggesting strong reliability of the proposed method. While IPW estimator offers good point estimation, it struggles with interval estimation accuracy, failing to maintain 95% coverage in Settings IV and V. Similarly, the DR estimator performs well for point estimation in Settings I–IV but falters in Setting V. The proposed estimator’s interval estimation remains stable, often surpassing IPW and DR in coverage accuracy, particularly in settings with extreme true propensity scores. Based on these simulation results, DR, IPW, and the proposed estimators are generally recommended; however, the proposed estimator is especially favorable for both point and interval estimation in complex scenarios like Settings IV and V.

Table 1: Simulation Results for ATE in Setting I

Sample Size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0042	0.1963	0.1963	0.901	0.978
	True PS	-0.0005	0.2839	0.2839	1.210	0.960
$N=200$	Estimated PS	0.0024	0.2315	0.2315	1.200	0.993
	Proposed	-0.0040	0.1986	0.1986	0.815	0.967
	IPW	-0.0087	0.1816	0.1818	0.943	0.913
	Doubly Robust	-0.0084	0.1832	0.1834	0.888	0.958
	Covariates	0.0061	0.1230	0.1232	0.572	0.990
	True PS	-0.0023	0.1749	0.1749	0.760	0.975
$N=500$	Estimated PS	-0.0005	0.1434	0.1434	0.759	0.994
	Proposed	0.0044	0.1151	0.1152	0.474	0.949
	IPW	0.0011	0.1143	0.1143	0.591	0.833
	Doubly Robust	0.0010	0.1129	0.1129	0.514	0.961
	Covariates	0.0037	0.0835	0.0836	0.407	0.980
	True PS	0.0034	0.1194	0.1194	0.536	0.968
$N=1000$	Estimated PS	0.0010	0.0965	0.0965	0.535	0.994
	Proposed	0.0038	0.0769	0.0770	0.323	0.951
	IPW	0.0003	0.0768	0.0768	0.418	0.686
	Doubly Robust	0.0012	0.0769	0.0769	0.354	0.949

Table 2: Simulation Results for ATE in Setting II

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0103	0.4438	0.4439	2.770	0.997
	True PS	-0.0102	0.5329	0.5329	2.060	0.937
$N=200$	Estimated PS	-0.0118	0.5189	0.5190	2.160	0.957
	Proposed	-0.0068	0.4711	0.4712	1.860	0.956
	IPW	-0.0073	0.4579	0.4580	1.715	0.938
	Doubly Robust	-0.0100	0.4609	0.4610	1.850	0.942
	Covariates	-0.0028	0.2854	0.2854	1.760	0.998
	True PS	-0.0018	0.3340	0.3340	1.300	0.943
$N=500$	Estimated PS	-0.0166	0.3320	0.3324	1.340	0.963
	Proposed	-0.0040	0.2958	0.2959	1.140	0.946
	IPW	-0.0028	0.2950	0.2950	1.082	0.932
	Doubly Robust	-0.0018	0.2948	0.2948	1.147	0.939
	Covariates	0.0032	0.1891	0.1891	1.250	0.996
	True PS	0.0040	0.2245	0.2245	0.911	0.949
$N=1000$	Estimated PS	0.0104	0.2232	0.2235	0.931	0.959
	Proposed	0.0053	0.1954	0.1954	0.802	0.957
	IPW	0.0048	0.1930	0.1931	0.764	0.947
	Doubly Robust	0.0044	0.1943	0.1944	0.806	0.953

Table 3: Simulation Results for ATE in Setting III

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0092	0.1722	0.1724	0.661	0.945
	True PS	-0.0103	0.1845	0.1848	0.718	0.947
$N=200$	Estimated PS	-0.0089	0.1871	0.1873	0.724	0.950
	Proposed	-0.0106	0.1726	0.1729	0.686	0.959
	IPW	-0.0115	0.1601	0.1605	0.575	0.962
	Doubly Robust	-0.0105	0.1631	0.1635	0.731	0.955
	Covariates	0.0028	0.1088	0.1089	0.422	0.947
	True PS	0.0017	0.1166	0.1166	0.455	0.950
$N=500$	Estimated PS	0.0019	0.1147	0.1147	0.457	0.949
	Proposed	0.0011	0.1026	0.1026	0.414	0.953
	IPW	0.0014	0.0983	0.0983	0.360	0.953
	Doubly Robust	0.0011	0.0987	0.0987	0.430	0.956
	Covariates	0.0032	0.0730	0.0731	0.299	0.956
	True PS	0.0004	0.0781	0.0781	0.320	0.949
$N=1000$	Estimated PS	0.0011	0.0770	0.0770	0.322	0.952
	Proposed	0.0016	0.0680	0.0680	0.286	0.950
	IPW	0.0010	0.0662	0.0662	0.254	0.957
	Doubly Robust	0.0010	0.0669	0.0669	0.295	0.942

Table 4: Simulation Results for ATE in Setting IV

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	0.0328	0.1691	0.1722	0.653	0.940
	True PS	-0.0082	0.1830	0.1832	0.707	0.949
$N=200$	Estimated PS	-0.0066	0.1840	0.1842	0.714	0.946
	Proposed	-0.0084	0.1708	0.1710	0.674	0.953
	IPW	-0.0101	0.1589	0.1593	0.572	0.885
	Doubly Robust	-0.0098	0.1655	0.1658	0.775	0.958
	Covariates	0.0326	0.1067	0.1116	0.415	0.933
	True PS	0.0024	0.1141	0.1141	0.448	0.953
$N=500$	Estimated PS	0.0029	0.1126	0.1127	0.450	0.950
	Proposed	0.0024	0.1005	0.1006	0.406	0.944
	IPW	0.0018	0.0967	0.0967	0.359	0.769
	Doubly Robust	0.0014	0.0994	0.0994	0.449	0.960
	Covariates	0.0262	0.0718	0.0765	0.295	0.941
	True PS	0.0008	0.0767	0.0767	0.315	0.948
$N=1000$	Estimated PS	0.0014	0.0757	0.0757	0.316	0.949
	Proposed	0.0022	0.0671	0.0671	0.281	0.962
	IPW	0.0012	0.0652	0.0652	0.253	0.560
	Doubly Robust	0.0013	0.0678	0.0678	0.304	0.945

Table 5: Simulation Results for ATE in Setting V

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0009	0.3426	0.3426	1.490	0.980
	True PS	-0.0072	0.6568	0.6568	2.380	0.965
$N=200$	Estimated PS	-0.0178	0.6552	0.6555	2.400	0.965
	Proposed	-0.0059	0.5001	0.5001	1.530	0.959
	IPW	0.0218	0.4854	0.4859	0.850	0.899
	Doubly Robust	0.2714	2.4132	2.4284	3.837	0.756
	Covariates	-0.0113	0.2713	0.2715	1.190	0.975
	True PS	-0.0410	0.5193	0.5209	2.050	0.971
$N=500$	Estimated PS	-0.0469	0.5095	0.5116	2.060	0.973
	Proposed	0.0007	0.4088	0.4088	1.240	0.956
	IPW	0.0101	0.3903	0.3904	0.526	0.733
	Doubly Robust	0.2610	3.7630	3.7720	2.268	0.676
	Covariates	-0.0196	0.2296	0.2305	1.010	0.979
	True PS	0.0098	0.4852	0.4853	1.810	0.971
$N=1000$	Estimated PS	0.0069	0.4826	0.4827	1.810	0.969
	Proposed	0.0042	0.3545	0.3545	1.010	0.953
	IPW	0.0119	0.3653	0.3655	0.367	0.432
	Doubly Robust	0.0403	3.1647	3.1650	2.213	0.697

Table 6: Simulation Results for ATT in Setting I

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0185	0.2352	0.2360	1.010	0.986
	True PS	0.0002	0.3396	0.3396	1.380	0.971
$N=200$	Estimated PS	0.0056	0.2836	0.2836	1.370	0.992
	Proposed	-0.0049	0.2321	0.2321	1.010	0.996
	IPW	-0.0079	0.2100	0.2101	0.939	0.858
	Doubly Robust	-0.0092	0.2045	0.2047	1.101	0.973
	Covariates	0.0032	0.1471	0.1471	0.642	0.984
	True PS	-0.0017	0.2086	0.2086	0.869	0.977
$N=500$	Estimated PS	0.0008	0.1789	0.1789	0.867	0.994
	Proposed	0.0043	0.1357	0.1357	0.570	0.995
	IPW	0.0013	0.1292	0.1292	0.592	0.671
	Doubly Robust	0.0009	0.1259	0.1259	0.627	0.985
	Covariates	0.0015	0.1006	0.1007	0.457	0.992
	True PS	0.0062	0.1398	0.1399	0.612	0.977
$N=1000$	Estimated PS	0.0013	0.1247	0.1247	0.610	0.994
	Proposed	0.0023	0.0903	0.0903	0.385	0.990
	IPW	0.0010	0.0873	0.0873	0.420	0.355
	Doubly Robust	0.0006	0.0852	0.0852	0.426	0.985

Table 7: Simulation Results for ATT in Setting II

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0145	0.6170	0.6172	3.150	1.000
	True PS	-0.0167	0.6115	0.6118	2.360	0.926
$N=200$	Estimated PS	-0.0278	0.7186	0.7192	2.470	0.997
	Proposed	-0.0114	0.6413	0.6414	2.620	1.000
	IPW	-0.0145	0.6285	0.6287	1.724	0.939
	Doubly Robust	-0.0125	0.6133	0.6135	2.471	1.000
	Covariates	-0.0037	0.3813	0.3813	2.010	1.000
	True PS	-0.0139	0.3779	0.3781	1.490	0.942
$N=500$	Estimated PS	-0.0167	0.4513	0.4516	1.540	0.997
	Proposed	-0.0100	0.3857	0.3858	1.570	1.000
	IPW	-0.0046	0.3853	0.3853	1.084	0.942
	Doubly Robust	-0.0052	0.3753	0.3753	1.544	1.000
	Covariates	0.0043	0.2652	0.2652	1.420	1.000
	True PS	-0.0054	0.2579	0.2580	1.040	0.945
$N=1000$	Estimated PS	0.0082	0.3081	0.3082	1.060	0.996
	Proposed	0.0039	0.2614	0.2614	1.090	1.000
	IPW	0.0048	0.2658	0.2659	0.765	0.960
	Doubly Robust	0.0029	0.2601	0.2601	1.089	1.000

Table 8: Simulation Results for ATT in Setting III

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0060	0.1951	0.1952	0.747	0.939
	True PS	-0.0041	0.2183	0.2183	0.840	0.948
$N=200$	Estimated PS	-0.0045	0.2194	0.2194	0.853	0.957
	Proposed	-0.0096	0.1943	0.1945	0.828	0.978
	IPW	-0.0135	0.1714	0.1719	0.582	0.964
	Doubly Robust	-0.0142	0.1718	0.1724	0.640	0.955
	Covariates	0.0033	0.1196	0.1197	0.478	0.945
	True PS	0.0027	0.1351	0.1352	0.534	0.943
$N=500$	Estimated PS	0.0028	0.1366	0.1366	0.537	0.951
	Proposed	0.0031	0.1122	0.1122	0.472	0.974
	IPW	0.0019	0.1044	0.1045	0.362	0.955
	Doubly Robust	0.0018	0.1049	0.1049	0.394	0.951
	Covariates	0.0038	0.0853	0.0853	0.340	0.959
	True PS	0.0004	0.0939	0.0939	0.376	0.957
$N=1000$	Estimated PS	0.0017	0.0910	0.0910	0.378	0.957
	Proposed	0.0020	0.0755	0.0755	0.322	0.958
	IPW	0.0009	0.0734	0.0734	0.255	0.957
	Doubly Robust	0.0010	0.0734	0.0734	0.277	0.961

Table 9: Simulation Results for ATT in Setting IV

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	0.0360	0.1922	0.1955	0.737	0.946
	True PS	-0.0019	0.2177	0.2177	0.826	0.940
$N=200$	Estimated PS	-0.0021	0.2154	0.2154	0.841	0.959
	Proposed	-0.0075	0.1913	0.1915	0.801	0.978
	IPW	-0.0120	0.1709	0.1713	0.579	0.886
	Doubly Robust	-0.0140	0.1717	0.1722	0.635	0.945
	Covariates	0.0334	0.1185	0.1231	0.471	0.944
	True PS	0.0035	0.1326	0.1326	0.525	0.947
$N=500$	Estimated PS	0.0036	0.1339	0.1340	0.529	0.953
	Proposed	0.0043	0.1099	0.1100	0.470	0.970
	IPW	0.0024	0.1030	0.1030	0.360	0.774
	Doubly Robust	0.0022	0.1035	0.1035	0.391	0.916
	Covariates	0.0270	0.0839	0.0881	0.335	0.936
	True PS	0.0006	0.0927	0.0927	0.370	0.950
$N=1000$	Estimated PS	0.0019	0.0899	0.0899	0.372	0.954
	Proposed	0.0025	0.0747	0.0748	0.314	0.967
	IPW	0.0011	0.0728	0.0728	0.254	0.561
	Doubly Robust	0.0008	0.0723	0.0723	0.275	0.865

Table 10: Simulation Results for ATT in Setting V

Sample size	Method	Bias	SD	RMSE	AW	CP
	Covariates	-0.0144	0.3964	0.3966	1.700	0.969
	True PS	-0.0166	0.7851	0.7853	2.700	0.945
$N=200$	Estimated PS	-0.0278	0.7820	0.7825	2.720	0.943
	Proposed	-0.0123	0.5971	0.5973	1.810	0.942
	IPW	-0.0005	0.5759	0.5759	0.861	0.744
	Doubly Robust	-0.1718	1.7912	1.7994	1.075	0.944
	Covariates	-0.0274	0.3227	0.3238	1.400	0.971
	True PS	-0.0558	0.6297	0.6322	2.390	0.964
$N=500$	Estimated PS	-0.0635	0.6184	0.6216	2.400	0.962
	Proposed	-0.0014	0.4961	0.4961	1.480	0.931
	IPW	-0.0145	0.4563	0.4566	0.528	0.407
	Doubly Robust	-0.0886	3.1315	3.1328	0.663	0.936
	Covariates	-0.0338	0.2735	0.2756	1.200	0.974
	True PS	0.0052	0.5912	0.5912	2.130	0.965
$N=1000$	Estimated PS	0.0015	0.5881	0.5882	2.140	0.968
	Proposed	0.0027	0.4321	0.4321	1.210	0.920
	IPW	-0.0049	0.4259	0.4260	0.369	0.122
	Doubly Robust	-0.0964	1.7687	1.7713	0.466	0.931

5 Practical Data Analysis

In this section, we apply the estimation procedures to analyze data from the National Supported Work (NSW) Demonstration, a program funded by both federal and private sources in the mid-1970s. This initiative provided 6–18 months of work experience for individuals facing economic and social challenges before enrollment. Our analysis demonstrates the impact of the NSW Demonstration labor training program on the earnings of these individuals. To evaluate the effectiveness of the proposed estimator, we estimate the average treatment effect on the treated (ATT) using three control groups. We use two datasets: one drawn from the experimental sample in LaLonde, (1986) and the other from Westat’s Matched Current Population Survey–Social Security Administration File (CPS-3) Dehejia and Wahba, (1999) ¹¹1The data can be obtained from the website at https://users.nber.org/ $\sim$ rdehejia/nswdata2.html..

In this dataset, the outcome variable (Re78) represents each individual’s real earnings in 1978, and the treatment variable (Treat) indicates whether the individual enrolled in the labor training program. The ten potential confounding variables include age (Age), years of schooling (Educ), race indicators for Black (Black) and Hispanic (Hispanic), marital status (Married), high school diploma status (Nodegr), real earnings in 1974 (Re74) and 1975 (Re75), and indicators for zero earnings in 1974 (U74) and 1975 (U75). This subset comprises 185 individuals in the treated group, 260 individuals in the control group reported by LaLonde, (1986), and 429 individuals from the CPS-3 control group. Table 11 presents the summary statistics for the NSW dataset and the $p$ -values of the $t$ -tests for comparing the means between the treated group and the two control groups from the two samples. Table 11 shows that the treated and control groups are well-balanced in terms of mean values for most variables, with the exception of the variable Nodegr. For the nine remaining covariates, we cannot reject the null hypothesis that the means are equal, indicating similar averages across treated and control groups. However, when comparing treated units from the experimental data with control units from the non-experimental CPS-3 data, significant mean differences appear in eight out of the ten covariates. The only covariates for which the null hypothesis of equal means cannot be rejected are Hispanic and education.

Table 11: Summary statistics of the variables in the NSW dataset.

	Experimental Data		Nonexperimental CPS-3	$p$ -value of two-sample $t$ -test
Variable	Treated (185)	Control (260)	Control (429)	Treated/Control	Treated/Control
	Mean (SD)	Mean (SD)	Mean (SD)	Experiments	Control CPS-3
Age	25.82 (7.16)	25.1 (7.10)	28.03 (10.79)	0.27	0.00
Educ	10.35 (2.01)	10.1 (1.60)	10.24 (2.86)	0.15	0.58
Black	0.84 (0.36)	0.83 (0.38)	0.20 (0.40)	0.65	0.00
Hispanic	0.06 (0.24)	0.11 (0.31)	0.14 (0.35)	0.06	0.13
Married	0.19 (0.39)	0.15 (0.36)	0.51 (0.50)	0.33	0.00
Nodegr	0.71 (0.46)	0.83 (0.37)	0.60 (0.49)	0.00	0.01
Re74	2.10 (4.89)	2.11 (5.69)	5.62 (6.79)	0.98	0.00
Re75	1.53 (3.22)	1.27 (3.10)	2.47 (3.29)	0.39	0.00
U74	0.71 (0.46)	0.75 (0.43)	0.26 (0.44)	0.33	0.00
U75	0.60 (0.49)	0.68 (0.47)	0.31 (0.46)	0.07	0.00

Earnings are expressed in thousands of 1978 dollars.

Table 12: Estimates of ATT, and their standard errors and corresponding 95% confidence intervals based on different methods for the NSW dataset.

	Experimental Group		CPS-3
	Estimate (SE)	95% C.I	Estimate (SE)	95% C.I
Proposed	1807.7 (677.5)	(479.8, 3135.6)	952.6 (731.4)	(-480.9, 2386.1)
Covariates	1686.1 (866.4)	(-12.0, 3384.2)	-243.5 (1319.3)	(-2829.3, 2342.3)
Estimated PS	2138.6 (797.8)	(575.0, 3702.3)	-944.9 (1383.7)	(-3656.9, 1767.1)
IPW	1754.6 (695.5)	(391.4, 3117.8)	1003.4 (698.6)	(-365.8, 2372.6)
DR	1718.9 (751.1)	(246.8, 3191.0)	836.0 (965.3)	(-1056.0, 2728.0)

In Table 12, we present the estimates of ATT using the proposed kernel matching estimator (“Proposed”), the matching estimator based on covariates (“Covariates”), the matching estimator based on estimated propensity scores with $m=1$ (“Estimated PS”), the Inverse Probability Weighting estimator (IPW), and Doubly Robust (DR) estimator, and their standard errors and corresponding 95% bootstrap confidence intervals for both control groups, both obtained by 400 bootstrap samples. For the experimental group, the ATT estimate based on the proposed kernel matching estimator is US$1807.7, the closest to the benchmark value of US$1794 among the estimation methods considered here. Moreover, the ATT estimate based on the proposed kernel matching estimator has the smallest standard error among the estimation methods considered here. The proposed kernel matching estimator, the IPW, and the DR estimators give ATT estimates and confidence intervals within a similar range. For the non-experimental control group (CPS-3), the matching estimator based on covariates and estimated propensity scores produces negative estimates with large standard errors, which do not agree with the results obtained by Dehejia and Wahba, (1999). In contrast, the proposed kernel matching estimator, along with IPW and DR methods, yields ATT estimates within a similar range (698.6–965.3). These findings suggest that the labor training program significantly increased individual earnings in 1978 and are consistent with the conclusions drawn from the experimental study conducted by Dehejia and Wahba, (1999).

6 Concluding Remarks

In this paper, we introduced a novel kernel matching estimator for estimating treatment effects, focusing on addressing challenges in observational data analysis. Through rigorous theoretical analysis, we established the consistency and asymptotic normality of the proposed estimator and demonstrated its robustness under various simulation settings. The results from our Monte Carlo simulations revealed that the proposed estimator performs comparably to or better than established methods such as covariate-based matching, propensity score-based matching, inverse probability weighting (IPW), and doubly robust (DR) estimators, particularly in high-dimensional and complex scenarios.

The proposed kernel matching estimator stands out for its reliable interval estimation, consistently achieving coverage probabilities near the nominal level of 95%, even in settings where IPW and DR methods falter. This reliability, combined with competitive bias and RMSE performance, highlights the potential of the proposed estimator for both point and interval estimation in a wide range of applications. Our findings suggest that this method is a robust and flexible tool for causal inference, particularly in situations involving high-dimensional data or complex treatment assignment mechanisms. Future work may explore extensions to dynamic treatments and further refinements in bandwidth selection for enhanced performance.

For the proposed kernel matching estimator, the selection of bandwidth in kernel matching remains a critical challenge, as it requires a delicate balance between bias and variance. Achieving the theoretically optimal bandwidth in practical applications is often difficult, highlighting a key limitation of this approach. Furthermore, kernel matching involves calculating weighted distances for each sample based on the kernel function, which can significantly increase computational complexity, especially when dealing with large datasets. Future research could explore more efficient methods for bandwidth selection and computational optimization to enhance the scalability and practicality of kernel matching in real-world applications.

Appendix

Preliminary:

The influence function(Hampel,, 1974) is crucial in analyzing the robustness of statistical estimators. Let $T$ be a real-valued functional defined on a subset of the set of all probability measures on $\mathbb{R}$ , and let $F$ denote a probability measure on $\mathbb{R}$ for which $T$ is defined. Then, the influence function for the functional $T(\cdot)$ is defined as

\Omega(y)=\left.\frac{d}{d\epsilon}T\left[(1-\epsilon)F+\epsilon\delta_{y}(x)\right]\right|_{\epsilon=0},

where $0<\epsilon<1$ and

\delta_{y}(x)=\begin{cases}0&x<y\\ 1&x\geqslant y.\end{cases}

Lemma 1 (Huber, (1981); Bickel et al., (1993)).

Let $X_{1}\dots X_{N}$ be independently and identically distributed random variables from a distribution $F$ . $T(F)$ is a function of $F$ . Suppose $T(F)$ has a Gateaux derivative, denoted by $\Omega(x)$ and $E(\Omega^{2}(X_{i}))<\infty$ , then we have

\sqrt{n}\left(T\left(F_{n}\right)-T(F)\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Omega\left(X_{i}\right)+o_{p}(1).

Hence, $\sqrt{n}\left(T\left(F_{n}\right)-T(F)\right)$ is asymptotically normal with mean 0 and variance $E(\Omega^{2}(X_{i}))$ .

Proof of Theorem 1 (i):

We express $\hat{\tau}$ as

$\displaystyle\hat{\tau}$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}Y_{i}-\frac{1}{N}\sum_{i=1}^{N}W_{i}\frac{\sum_{j=1}^{N}\left(1-W_{j}\right)Y_{j}K_{h}\left(X_{j}-X_{i}\right)}{\sum_{j=1}^{N}\left(1-W_{j}\right)K_{h}\left(X_{j}-X_{i}\right)}$
		$\displaystyle+\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)\frac{\sum_{j=1}^{N}W_{j}Y_{j}K_{h}\left(X_{j}-X_{i}\right)}{\sum_{j=1}^{N}W_{j}K_{h}\left(X_{j}-X_{i}\right)}-\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)Y_{i}$
	$\displaystyle=$	$\displaystyle T_{1}(F_{n})-T_{2}(F_{n})+T_{3}(F_{n})-T_{4}(F_{n}),$

where

$\displaystyle T_{1}(F_{n})$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}Y_{i}=\int yw~{}dF_{n}(y,w),$
$\displaystyle T_{2}(F_{n})$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}\frac{\sum_{j=1}^{N}\left(1-W_{j}\right)Y_{j}K_{h}(X_{j}-X_{i})}{\sum_{j=1}^{N}\left(1-W_{j}\right)K_{h}\left(X_{j}-X_{i}\right)}$
	$\displaystyle=$	$\displaystyle\int z\frac{\int(1-w)yK_{h}(x-u)dF_{n}(y,x,w)}{\int(1-w)K_{h}(x-u)dF_{n}(y,x,w)}dF_{n}(u,z),$
$\displaystyle T_{3}(F_{n})$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)\frac{\sum_{j=1}^{N}W_{j}Y_{j}K_{h}\left(X_{j}-X_{i}\right)}{\sum_{j=1}^{N}W_{j}K_{h}\left(X_{j}-X_{i}\right)}$
	$\displaystyle=$	$\displaystyle\int(1-z)\frac{\int wyK_{h}(x-u)dF_{n}(y,x,w)}{\int wK_{h}(x-u)dF_{n}(y,x,w)}dF_{n}(u,z),$
$\displaystyle T_{4}(F_{n})$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)Y_{i}=\int(1-w)y~{}dF_{n}(y,w).$

Let $T(F)=T_{1}(F)-T_{2}(F)+T_{3}(F)-T_{4}(F),$ where the terms $T_{i}(F)$ for $i=1,2,3,4$ are given by

$\displaystyle T_{1}(F)$	$\displaystyle=$	$\displaystyle E(WY),$
$\displaystyle T_{2}(F)$	$\displaystyle=$	$\displaystyle E\left[Z\frac{E\left[Y(1-W)K\left(\frac{X-U}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-U}{h}\right)\right]}\right],$
$\displaystyle T_{3}(F)$	$\displaystyle=$	$\displaystyle E\left[(1-Z)\frac{E\left[WYK\left(\frac{X-U}{h}\right)\right]}{E\left[WK\left(\frac{X-U}{h}\right)\right]}\right],$
$\displaystyle T_{4}(F)$	$\displaystyle=$	$\displaystyle E[(1-W)Y].$

Here, $(S,U,Z)$ has the same distribution as $(Y,X,W)$ , and $(Y,X,W)\sim F_{Y,X,W}$ .
Next, we will provide the formula for each term of $T(F)$ . Let $g_{w}(x)=E(Y(w)|X=x)$ , we can obtain

			$\displaystyle\frac{E\left[Y(1-W)K\left(\frac{X-u}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-u}{h}\right)\right]}$
		$\displaystyle=$	$\displaystyle\frac{\int g_{0}(x)K\left(\frac{x-u}{h}\right)f(x\|W=0)dx}{\int K\left(\frac{x-u}{h}\right)f(x\|W=0)dx}$
		$\displaystyle=$	$\displaystyle\frac{\int g_{0}(th+u)K(t)f(th+u\|W=0)dt}{\int K(t)f(th+u\|W=0)dt}$
		$\displaystyle=$	$\displaystyle\frac{\int K(t)\left[g_{0}(u)+g_{0}^{\prime}(u)th+\frac{g_{0}^{\prime\prime}(u)t^{2}h^{2}}{2}+\frac{g_{0}^{\prime\prime\prime}(\xi_{1})t^{3}h^{3}}{3!}\right]}{\int K(t)\left[f(u\|W=0)+f^{\prime}(u\|W=0)th+\frac{f^{\prime\prime}(u\|W=0)t^{2}h^{2}}{2}+\frac{f^{\prime\prime\prime}(\xi_{2}\|W=0)t^{3}h^{3}}{3!}\right]dt}$
			$\displaystyle\frac{\times\left[f(u\|W=0)+f^{\prime}(u\|W=0)th+\frac{f^{\prime\prime}(u\|W=0)t^{2}h^{2}}{2}+\frac{f^{\prime\prime\prime}(\xi_{2}\|W=0)t^{3}h^{3}}{3!}\right]dt}{\int K(t)\left[f(u\|W=0)+f^{\prime}(u\|W=0)th+\frac{f^{\prime\prime}(u\|W=0)t^{2}h^{2}}{2}+\frac{f^{\prime\prime\prime}(\xi_{2}\|W=0)t^{3}h^{3}}{3!}\right]dt}$
		$\displaystyle=$	$\displaystyle g_{0}(u)+\frac{\int t^{2}K(t)dt}{2}\left(\frac{2g_{0}^{\prime}(u)f^{\prime}(u\|W=0)}{f(u\|W=0)}+g_{0}^{\prime\prime}(u\|W=0)\right)h^{2}+o(h^{2}).$

Then,

$\displaystyle T_{2}(F)$	$\displaystyle=$	$\displaystyle E\left[Z\frac{EY(1-W)K\left(\frac{X-U}{h}\right)}{E(1-W)K\left(\frac{X-U}{h}\right)}\right]$
	$\displaystyle=$	$\displaystyle E\left\{Z\left[g_{0}(U)+\frac{\int t^{2}k(t)dt}{2}\left(\frac{2g_{0}^{\prime}(U)f^{\prime}(U\|W=0)}{f(U\|W=0)}+g_{0}^{\prime\prime}(U)\right)h^{2}\right]\right\}$
		$\displaystyle\qquad+o(h^{2})$
	$\displaystyle=$	$\displaystyle E[Y(0)\|Z=1]\Pr(Z=1)+E[b_{0}(U)\|Z=1]\Pr(Z=1)h^{2}+o(h^{2}),$

where $b_{0}(U)=\frac{\int t^{2}k(t)dt}{2}\left[\frac{2g_{0}^{\prime}(U)f^{\prime}(U|W=0)}{f(U|W=0)}+g_{0}^{\prime\prime}(U)\right].$ Similarly, it follows that

$\displaystyle T_{3}(F)$	$\displaystyle=$	$\displaystyle\mathbb{E}\left\{(1-Z)\left[g_{1}(U)+\frac{\int t^{2}K(t)dt}{2}\left(\frac{2g_{1}^{\prime}(U)f^{\prime}(U\|W=1)}{f(U\|W=1)}+g_{1}^{\prime\prime}(U)\right)h^{2}\right]\right\}$
		$\displaystyle\qquad+o(h^{2})$
	$\displaystyle=$	$\displaystyle\mathbb{E}[Y(1)\|Z=0]\Pr(Z=0)+\mathbb{E}[b_{1}(U)\|Z=0]\Pr(Z=0)h^{2}+o(h^{2}),$

where $b_{1}(U)=\frac{\int t^{2}K(t)dt}{2}\left[\frac{2g_{1}^{\prime}(U)f^{\prime}(U|W=0)}{f(U|W=0)}+g_{1}^{\prime\prime}(U)\right]$ .
Thus,

T(F)=T_{1}(F)-T_{2}(F)+T_{3}(F)-T_{4}(F)=\tau+Bh^{2}+o(h^{2}),

where $B=\mathbb{E}[b_{0}(U)|Z=1]\Pr(Z=1)+\mathbb{E}[b_{1}(U)|Z=0]\Pr(Z=0)$ .

To prove Theorem 1, we derive the influence function for $T(F)$ by the method presented in Hampel, (1974), Hettmansperger, (1984), and Hines et al., (2022). The influence function of $T_{i}(F)$ , for $i=1,2,3,4$ , are respectively,

$\displaystyle\Omega_{1}(w,y)$	$\displaystyle=$	$\displaystyle wy-E(WY),$
$\displaystyle\Omega_{2}(y,x,w)$	$\displaystyle=$	$\displaystyle\,w\frac{E[Y(1-W)K\left(\frac{X-x}{h}\right)]}{E[(1-W)K\left(\frac{X-x}{h}\right)]}-E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]$
		$\displaystyle+(1-w)yE\left[Z\frac{K\left(\frac{x-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]$
		$\displaystyle-(1-w)E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)K\left(\frac{x-U}{h}\right)]}{\left(E[(1-W)K\left(\frac{X-U}{h}\right)]\right)^{2}}\right],$
$\displaystyle\Omega_{3}(x,y,w)$	$\displaystyle=$	$\displaystyle\,(1-w)\frac{E[WYK\left(\frac{X-x}{h}\right)]}{E[WK\left(\frac{X-x}{h}\right)]}-E\left[(1-Z)\frac{E[WYK\left(\frac{X-U}{h}\right)]}{E[WK\left(\frac{X-U}{h}\right)]}\right]$
		$\displaystyle+wyE\left[(1-Z)\frac{K\left(\frac{x-U}{h}\right)}{E[WK\left(\frac{X-U}{h}\right)]}\right]$
		$\displaystyle-wE\left[(1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)K\left(\frac{x-U}{h}\right)]}{\left(E[WK\left(\frac{X-U}{h}\right)]\right)^{2}}\right],$
$\displaystyle\Omega_{4}(w,y)$	$\displaystyle=$	$\displaystyle(1-w)y-E[(1-W)Y].$

Therefore, by Lemma 1 and Markov’s inequality, when $h\to 0$ , we can obtain

	$\displaystyle\hat{\tau}-\tau=$	$\displaystyle T\left(F_{n}\right)-T\left(F\right)+Bh^{2}+o(h^{2})$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Omega\left(X_{i},W_{i},Y_{i}\right)+Bh^{2}+o(h^{2})\xrightarrow{p}0,$

where $\Omega\left(X_{i},W_{i},Y_{i}\right)=\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)+\Omega_{3}\left(X_{i},W_{i},Y_{i}\right)-\Omega_{4}\left(W_{i},Y_{i}\right)$ .

Proof of Theorem 1 (ii):

Based on the kernel matching approach for estimating the treatment effect, we express the estimator for the average treatment effect on the treated (ATT) as follows:

	$\displaystyle\hat{\tau}_{t}=$	$\displaystyle\frac{1}{N_{1}}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{X_{j}-X_{i}}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{X_{j}-X_{i}}{h}\right)}\right)$
	$\displaystyle=$	$\displaystyle\frac{N}{N_{1}}\frac{1}{N}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{X_{j}-X_{i}}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{X_{j}-X_{i}}{h}\right)}\right).$

Let $B_{t}=\mathbb{E}[b_{0}(U)|Z=1]\Pr(Z=1)$ . From Theorem 1 (i), we know that

	$\displaystyle\hat{\tau}_{t}-\tau_{t}=$	$\displaystyle\frac{N}{N_{1}}\{\left[T_{1}\left(F_{n}\right)-T_{1}\left(F\right)\right]-[T_{2}\left(F_{n}\right)-T_{2}\left(F\right)]+B_{t}h^{2}\}+o(h^{2})$
	$\displaystyle=$	$\displaystyle\frac{N}{N_{1}}\left\{\frac{1}{N}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right))\right]+B_{t}h^{2}\right\}+o(h^{2}).$

By Lemma 1 and Markov’s Inequality, when $h\to 0$ , and $N/N_{1}\xrightarrow{p}1/E[p(X)]$ , it follows that

\hat{\tau}_{t}-\tau_{t}\xrightarrow{p}0.

Proof of Theorem 2 (i):

According to Lemma 1 and Theorem 1, we have

\sqrt{N}(\hat{\tau}-Bh^{2}-\tau)=\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\Omega\left(X_{i},W_{i},Y_{i}\right)+o_{p}(1).

Since $\Omega\left(X_{i},W_{i},Y_{i}\right)$ are independent and identically distributed, applying the Central Limit Theorem, we can obtain

\sqrt{n}(\hat{\tau}-Bh^{2}-\tau)\xrightarrow{d}\mathcal{N}(0,\sigma^{2}),

where $\sigma^{2}=E\left(\Omega^{2}\left(X_{i},W_{i},Y_{i}\right)\right)$ , which can be expressed as

\sigma^{2}=\sigma^{2}_{1}+\sigma^{2}_{2}+\sigma^{2}_{3}+\sigma^{2}_{4}-2c_{1}+2c_{2}-2c_{3}-2c_{4}+2c_{5}-2c_{6},

with

$\displaystyle\sigma^{2}_{1}$	$\displaystyle=E[\Omega_{1}^{2}(W_{i},Y_{i})],$	$\displaystyle\sigma^{2}_{2}$	$\displaystyle=E[\Omega_{2}^{2}(X_{i},W_{i},Y_{i})],$
$\displaystyle\sigma^{2}_{3}$	$\displaystyle=E[\Omega_{3}^{2}(X_{i},W_{i},Y_{i})],$	$\displaystyle\sigma^{2}_{4}$	$\displaystyle=E[\Omega_{4}^{2}(W_{i},Y_{i})],$
$\displaystyle c_{1}$	$\displaystyle=E[\Omega_{1}(W_{i},Y_{i})\Omega_{2}(X_{i},W_{i},Y_{i})],$	$\displaystyle c_{2}$	$\displaystyle=E[\Omega_{1}(W_{i},Y_{i})\Omega_{3}(X_{i},W_{i},Y_{i})],$
$\displaystyle c_{3}$	$\displaystyle=E[\Omega_{1}(W_{i},Y_{i})\Omega_{4}(W_{i},Y_{i})],$	$\displaystyle c_{4}$	$\displaystyle=E[\Omega_{2}(X_{i},W_{i},Y_{i})\Omega_{3}(X_{i},W_{i},Y_{i})],$
$\displaystyle c_{5}$	$\displaystyle=E[\Omega_{2}(X_{i},W_{i},Y_{i})\Omega_{4}(W_{i},Y_{i})],$	$\displaystyle c_{6}$	$\displaystyle=E[\Omega_{3}(X_{i},W_{i},Y_{i})\Omega_{4}(W_{i},Y_{i})].$

The terms in $\sigma^{2}$ are provided as follows. First, the terms $\sigma^{2}_{i}$ , for $i=1,2,3,4$ , can be expressed as

$\displaystyle\sigma_{1}^{2}$	$\displaystyle=$	$\displaystyle E(Y^{2}(1)\|W=1)\Pr(W=1)-E^{2}(Y(1)\|W=1)[{\Pr}(W=1)]^{2},$
$\displaystyle\sigma_{2}^{2}$	$\displaystyle=$	$\displaystyle E\left\{W_{i}\frac{E[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)]}{E[(1-W)K\left(\frac{X-X_{i}}{h}\right)]}-E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]\right.$
		$\displaystyle\left.+Y_{i}\left(1-W_{i}\right)E\left[Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]\right.$
		$\displaystyle\left.-\left(1-W_{i}\right)E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right]\right\}^{2},$

$\displaystyle\sigma_{3}^{2}$	$\displaystyle=$	$\displaystyle E\left\{(1-W_{i})\frac{E\left[WYK\left(\frac{X-X_{i}}{h}\right)\right]}{E\left[WK\left(\frac{X-X_{i}}{h}\right)\right]}-E\left[(1-Z)\frac{E\left[WYK\left(\frac{X-U}{h}\right)\right]}{E\left[WK\left(\frac{X-U}{h}\right)\right]}\right]\right.$
		$\displaystyle\left.+Y_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{E\left[WK\left(\frac{X-U}{h}\right)\right]}\right)\right.$
		$\displaystyle\left.-W_{i}E\left[(1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[WK\left(\frac{X-U}{h}\right)]\right]^{2}}\right]\right\}^{2},$
$\displaystyle\sigma_{4}^{2}$	$\displaystyle=$	$\displaystyle E(Y^{2}(0)\|W=0)\Pr(W=0)-E^{2}(Y(0)\|W=0){\Pr}^{2}(W=0).$

The terms $c_{i}$ , $i=1,2,\ldots,6$ , are respectively,

$\displaystyle c_{1}$	$\displaystyle=$	$\displaystyle E(W_{i}Y_{i})\frac{E\left[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}-E(W_{i})E(WY)\frac{E\left[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}$
		$\displaystyle-E(1-W_{i})Y_{i}E(WY)E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E(1-W)K\left(\frac{X-U}{h}\right)}\right)$
		$\displaystyle+E\left\{(1-W_{i})E(WY)E\left(Z\frac{EY(1-W)K\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[E(1-W)K\left(\frac{X-U}{h}\right)\right]^{2}}\right)\right\},$
$\displaystyle c_{2}$	$\displaystyle=$	$\displaystyle EY_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right)-EY_{i}W_{i}E\left((1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[EWK\left(\frac{X-U}{h}\right)\right]^{2}}\right)$
		$\displaystyle-E(1-W_{i})E(WY)\frac{EYWK\left(\frac{X-X_{i}}{h}\right)}{EWK\left(\frac{X-X_{i}}{h}\right)}-EY_{i}W_{i}E(WY)E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right)$
		$\displaystyle+EW_{i}E(WY)E\left((1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[EWK\left(\frac{X-U}{h}\right)\right]^{2}}\right),$
$\displaystyle c_{3}$	$\displaystyle=$	$\displaystyle-E(Y(1)\|W=1)E(Y(0)\|W=0)\Pr(W=1)\Pr(W=0),$
$\displaystyle c_{4}$	$\displaystyle=$	$\displaystyle E\left\{\left[W_{i}\frac{E[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)]}{E[(1-W)K\left(\frac{X-X_{i}}{h}\right)]}-E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]\right.\right.$
		$\displaystyle\left.+Y_{i}(1-W_{i})E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right)\right.$
		$\displaystyle\left.-(1-W_{i})E\left(Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right)\right]$
		$\displaystyle\left.\left[(1-W_{i})\frac{E[YWK\left(\frac{X-X_{i}}{h}\right)]}{E[WK\left(\frac{X-X_{i}}{h}\right)]}-E\left[(1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)]}{E[WK\left(\frac{X-U}{h}\right)]}\right]\right.\right.$
		$\displaystyle\left.+Y_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[WK\left(\frac{X-U}{h}\right)]}\right)\right.$
		$\displaystyle\left.-W_{i}E\left((1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[WK\left(\frac{X-U}{h}\right)]\right]^{2}}\right)\right\},$

$\displaystyle c_{5}$	$\displaystyle=$	$\displaystyle E[(1-W_{i})Y^{2}_{i}]E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right)$
		$\displaystyle-E[(1-W_{i})Y_{i}]E\left(Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right)$
		$\displaystyle-E[(1-W)Y]E\left[W_{i}\frac{E[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)]}{E[(1-W)K\left(\frac{X-X_{i}}{h}\right)]}\right]$
		$\displaystyle-E[(1-W)Y]E\left[(1-W_{i})Y_{i}\right]E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right)$
		$\displaystyle+E[(1-W)Y]E[(1-W_{i})]E\left(Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right),$
$\displaystyle c_{6}$	$\displaystyle=$	$\displaystyle E\left\{[(1-W_{i})Y_{i}-E(1-W)Y)]\left[(1-W_{i})\frac{EYWK\left(\frac{X-X_{i}}{h}\right)}{EWK\left(\frac{X-X_{i}}{h}\right)}\right.\right.$
		$\displaystyle-E\left[(1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right]+Y_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right)$
		$\displaystyle\left.\left.-W_{i}E\left((1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[EWK\left(\frac{X-U}{h}\right)\right]^{2}}\right)\right]\right\}.$

Therefore,

\sqrt{N}\left\{\hat{\tau}-\tau-Bh^{2}\right\}\xrightarrow{d}\mathcal{N}(0,\sigma^{2}),

where $\sigma^{2}=E[\Omega^{2}(X_{i},W_{i},Y_{i})]$ and $B=\mathbb{E}[b_{0}(u)|W=0]\Pr(W=0)+\mathbb{E}[b_{1}(u)|W=1]\Pr(W=1)$ .

Proof of Theorem 2 (ii):

According to Lemma 1 and Theorem 1, we can obtain

	$\displaystyle\sqrt{N}(\hat{\tau}_{t}-B_{t}h^{2}-\tau_{t})$	$\displaystyle=\frac{N}{N_{1}}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right]\right\}+o_{p}(1)$
		$\displaystyle\to\frac{1}{E(p(X))}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right]\right\}+o_{p}(1).$

Applying the Central Limit Theorem, we have

\sqrt{N}(\hat{\tau}_{t}-B_{t}h^{2}-\tau_{t})\xrightarrow{d}N(0,\sigma_{t}^{2}),

where $\sigma_{t}^{2}=\frac{1}{E^{2}(p(X))}E[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)]^{2}$ .

Proof of Theorem 3 (i):

Based on data $\left\{\left(W_{i},X_{i}\right)\right\}_{i=1}^{N}$ , let the score function and the Fisher information matrix of $\beta$ be

S(\beta)=\frac{1}{N}\sum_{i=1}^{N}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right),

\mathcal{I}(\beta)=E\left[\frac{f\left(X^{\mathrm{T}}\beta\right)^{2}}{F\left(X^{\mathrm{T}}\beta\right)\left\{1-F\left(X^{\mathrm{T}}\beta\right)\right\}}XX^{\mathrm{T}}\right],

where $f(t)=\operatorname{dF}(t)/\mathrm{d}t$ . Because $\hat{\beta}$ is the solution to the score equation $S(\beta)=0$ , under certain regularity conditions, $\hat{\beta}-\beta^{*}=\mathcal{I}^{-1}_{\beta^{*}}S\left(\beta^{*}\right)+o_{p}\left(N^{-1/2}\right)$ .
By the Taylor’s series expansion, we can express

\displaystyle\widehat{\tau}_{N}(\hat{\beta})

\displaystyle=

\displaystyle\widehat{\tau}_{N}(\beta^{*})+\frac{\partial\widehat{\tau}_{N}(\beta^{*})}{\partial\beta^{\top}}(\beta^{*}-\hat{\beta})+o(\beta^{*}-\hat{\beta}).

Let

\displaystyle\frac{\partial\widehat{\tau}_{N}(\beta^{*})}{\partial\beta^{\top}}=\widetilde{T_{1}}(F_{n})+\widetilde{T_{2}}(F_{n}),

where

			$\displaystyle\widetilde{T_{1}}(F_{n})$
		$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}\left\{\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\left.\times\frac{\sum_{j=1}^{N}(1-W_{j})K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle-\left.\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\times\left.\frac{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right\},$

			$\displaystyle\widetilde{T_{2}}(F_{n})$
		$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}(1-W_{i})\left\{\frac{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\left.\times\frac{\sum_{j=1}^{N}W_{j}Y_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle-\left.\frac{\sum_{j=1}^{N}W_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\times\left.\frac{\sum_{j=1}^{N}W_{j}Y_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right\},$

Let $\widetilde{T}(F)=\widetilde{T_{1}}(F)+\widetilde{T_{2}}(F),$ where

			$\displaystyle\widetilde{T_{1}}(F)$
		$\displaystyle=$	$\displaystyle E(Z)\left\{\frac{E\left((1-W)YK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\left.\times\frac{E\left((1-W)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle-\left.\frac{E\left((1-W)YK^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\times\left.\frac{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right\}$
		$\displaystyle=$	$\displaystyle E(Z)\Biggl{\{}E(Y(0)\|F\left(U^{\top}\beta^{*}\right))$
			$\displaystyle\left.\times\frac{E\left[K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=0\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=0\right)}\right.$
			$\displaystyle-\frac{E\left[E(Y(0)\|X)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=0\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=0\right)}\Biggr{\}},$

			$\displaystyle\widetilde{T_{2}}(F)$
		$\displaystyle=$	$\displaystyle E(1-Z)\left\{\frac{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left(WK\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\qquad\left.\times\frac{E\left(WYK^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\qquad-\left.\frac{E\left(WK^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\qquad\times\left.\frac{E\left(WYK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right\}$
		$\displaystyle=$	$\displaystyle E(1-Z)\Biggl{\{}E(Y(1)\|F\left(U^{\top}\beta^{*}\right))$
			$\displaystyle\left.\times\frac{E\left[K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=1\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=1\right)}\right.$
			$\displaystyle-\frac{E\left[E(Y(1)\|X)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=1\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=1\right)}\Biggr{\}}.$

Thus, by Lemma 1 and Markov’s inequality, when $h\to 0$ , we have

	$\displaystyle\widehat{\tau}_{N}(\hat{\beta})-\tau$	$\displaystyle=\widehat{\tau}_{N}(\beta^{})-\tau+C(\beta^{}-\hat{\beta})+o_{p}(1)$
	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum^{N}_{i=1}\left\{\Omega\left(X_{i},W_{i},Y_{i}\right)+CX_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right\}$
		$\displaystyle+Bh^{2}+o(h^{2})+o_{p}(1)\xrightarrow{p}0,$

where $C=\widetilde{T}(F)\mathcal{I}_{\beta^{*}}^{-1}$ .

Proof of Theorem 3 (ii):

From Theorem 3 (i) and $N/N_{1}\xrightarrow{p}1/E[p(X)]$ , we have

	$\displaystyle\widehat{\tau}_{t,N}(\hat{\beta})-B_{t}h^{2}-\tau_{t}=$	$\displaystyle\frac{N}{N_{1}}\left\{\frac{1}{N}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.$
		$\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1)$
		$\displaystyle\to\frac{1}{E[p(X)]}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.$
		$\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1),$

where $C_{t}=\widetilde{T_{1}}(F)\mathcal{I}_{\beta^{*}}^{-1}.$ By Lemma 1 and Markov’s Inequality, when $h\to 0$ , it follows that

\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}\xrightarrow{p}0.

Proof of Theorem 4 (i):

Based on Theorem 3(i) and Lemma 1, we can obtain that

	$\displaystyle\sqrt{N}(\widehat{\tau}_{N}(\hat{\beta})-Bh^{2}-\tau)$
	$\displaystyle=\frac{1}{\sqrt{N}}\sum^{N}_{i=1}\left\{\Omega\left(X_{i},W_{i},Y_{i}\right)+CX_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right\}+o_{p}(1).$

Since $\Omega\left(X_{i},W_{i},Y_{i}\right)$ and $S_{\beta^{*}}(X_{i},W_{i})$ are independent and identically distributed, applying the Central Limit Theorem, we have

\sqrt{N}(\widehat{\tau}_{N}(\hat{\beta})-Bh^{2}-\tau)\xrightarrow{d}N(0,\widetilde{\sigma}^{2}),

where $\widetilde{\sigma}^{2}=E\left[\Omega\left(X_{i},W_{i},Y_{i}\right)+CX_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]^{2}$ .

Proof of Theorem 4 (ii):

To prove part (ii) of Theorem 4, from Lemma 1 and Theorem 3, we obtain

	$\displaystyle\sqrt{N}(\widehat{\tau}_{t,N}(\hat{\beta})-B_{t}h^{2}-\tau_{t})=$	$\displaystyle\frac{N}{N_{1}}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.$
		$\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1)$
		$\displaystyle\to\frac{1}{E(p(X))}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.$
		$\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1),$

Because $\Omega_{1}\left(X_{i},W_{i},Y_{i}\right)$ , $\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)$ and $S_{\beta^{*}}(X_{i},W_{i})$ are independent and identically distributed, applying the Central Limit Theorem,

\sqrt{N}(\widehat{\tau}_{t,N}(\hat{\beta})-B_{t}h^{2}-\tau)\xrightarrow{d}N(0,\widetilde{\sigma}_{t}^{2}),

where

	$\displaystyle\widetilde{\sigma}_{t}^{2}$	$\displaystyle=$	$\displaystyle\left.\frac{1}{E^{2}(p(X))}E\right[\Omega_{1}\left(X_{i},W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)$
			$\displaystyle\qquad\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]^{2}.$

References

Abadie and Imbens, (2006) Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1):235–267.
Abadie and Imbens, (2016) Abadie, A. and Imbens, G. W. (2016). Matching on the estimated propensity score. Econometrica, 84(2):781–807.
Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
Berg, (2011) Berg, G. D. (2011). An application of kernel-based versus one-to-one propensity score matching for a nonexperimental causal study: example from a disease management program evaluation. Applied Economics Letters, 18(5):439–447.
Bickel et al., (1993) Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models, volume 4. Springer.
Christakis and Iwashyna, (2003) Christakis, N. A. and Iwashyna, T. J. (2003). The health impact of health care on families: a matched cohort study of hospice use by decedents and mortality outcomes in surviving, widowed spouses. Social Science and Medicine, 57(3):465–475.
Dawid, (1979) Dawid, A. P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(1):1–15.
Dehejia and Wahba, (1999) Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448):1053–1062.
Diamond and Sekhon, (2013) Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3):932–945.
Efron and Tibshirani, (1993) Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC.
Frölich, (2004) Frölich, M. (2004). Finite-Sample Properties of Propensity-Score Matching and Weighting Estimators. The Review of Economics and Statistics, 86(1):77–90.
Galiani et al., (2005) Galiani, S., Gertler, P., and Schargrodsky, E. (2005). Water for life: The impact of the privatization of water services on child mortality. Journal of Political Economy, 113:83–120.
Hahn, (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2):315–332.
Ham et al., (2011) Ham, J. C., Li, X., and Reagan, P. B. (2011). Matching and semi-parametric iv estimation, a distance-based measure of migration, and the wages of young men. Journal of Econometrics, 161(2):208–227.
Hampel, (1974) Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393.
Handouyahia et al., (2013) Handouyahia, A., Haddad, T., and Eaton, F. (2013). Kernel matching versus inverse probability weighting: a comparative study. International Journal of Mathematical and Computational Sciences, 7(8):1218–1233.
Hansen, (2004) Hansen, B. B. (2004). Full matching in an observational study of coaching for the sat. Journal of the American Statistical Association, 99(467):609–618.
(18) Heckman, J., Ichimura, H., Smith, J., and Todd, P. (1998a). Characterizing selection bias using experimental data. Econometrica, 66:1017.
(19) Heckman, J. J., Ichimura, H., and Todd, P. (1998b). Matching as an econometric evaluation estimator. The Review of Economic Studies, 65(2):261–294.
Heckman et al., (1997) Heckman, J. J., Ichimura, H., and Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training program. The Review of Economic Studies, 64(4):605–654.
Herron and Wand, (2007) Herron, M. C. and Wand, J. (2007). Assessing partisan bias in voting technology: The case of the 2004 new hampshire recount. Electoral Studies, 26:247–261.
Hettmansperger, (1984) Hettmansperger, T. (1984). Statistical Inference Based on Ranks. Wiley Series in Probability and Statistics. Wiley.
Hines et al., (2022) Hines, O., Dukes, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician, 76(3):292–304.
Hirano et al., (2003) Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189.
Huber, (1981) Huber, P. (1981). Robust Statistics. Wiley Series in Probability and Statistics. Wiley.
Imai and Ratkovic, (2014) Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):243–263.
Imbens, (2004) Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86(1):4–29.
LaLonde, (1986) LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, pages 604–620.
Luo and Zhu, (2020) Luo, W. and Zhu, Y. (2020). Matching using sufficient dimension reduction for causal inference. Journal of Business & Economic Statistics, 38(4):888–900.
Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866.
Rosenbaum, (1991) Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):597–610.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
Rosenbaum and Rubin, (1985) Rosenbaum, P. R. and Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1):33–38.
Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688.
Rubin, (1997) Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127:757–763.
Sekhon, (2004) Sekhon, J. S. (2004). Quality meets quantity: Case studies, conditional probability, and counterfactuals. Perspectives on Politics, 2(2):281–293.
Smith and Todd, (2005) Smith, J. A. and Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125(1-2):305–353.

			$\displaystyle\widetilde{T_{1}}(F_{n})$
		$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}\left\{\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\left.\times\frac{\sum_{j=1}^{N}(1-W_{j})K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle-\left.\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\times\left.\frac{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right\},$

			$\displaystyle\widetilde{T_{2}}(F_{n})$
		$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}(1-W_{i})\left\{\frac{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\left.\times\frac{\sum_{j=1}^{N}W_{j}Y_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle-\left.\frac{\sum_{j=1}^{N}W_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{}\right)-f\left(X_{i}^{\top}\beta^{}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right.$
			$\displaystyle\times\left.\frac{\sum_{j=1}^{N}W_{j}Y_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(X_{i}^{\top}\beta^{}\right)}{h}\right)}\right\},$

			$\displaystyle\widetilde{T_{1}}(F)$
		$\displaystyle=$	$\displaystyle E(Z)\left\{\frac{E\left((1-W)YK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\left.\times\frac{E\left((1-W)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle-\left.\frac{E\left((1-W)YK^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\times\left.\frac{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right\}$
		$\displaystyle=$	$\displaystyle E(Z)\Biggl{\{}E(Y(0)\|F\left(U^{\top}\beta^{*}\right))$
			$\displaystyle\left.\times\frac{E\left[K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=0\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=0\right)}\right.$
			$\displaystyle-\frac{E\left[E(Y(0)\|X)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=0\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=0\right)}\Biggr{\}},$

			$\displaystyle\widetilde{T_{2}}(F)$
		$\displaystyle=$	$\displaystyle E(1-Z)\left\{\frac{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left(WK\left(\frac{F\left(X_{j}^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\qquad\left.\times\frac{E\left(WYK^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\qquad-\left.\frac{E\left(WK^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{}\right)-f\left(U^{\top}\beta^{}\right)}{h}(X-U)^{\top}\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right.$
			$\displaystyle\qquad\times\left.\frac{E\left(WYK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\right)}\right\}$
		$\displaystyle=$	$\displaystyle E(1-Z)\Biggl{\{}E(Y(1)\|F\left(U^{\top}\beta^{*}\right))$
			$\displaystyle\left.\times\frac{E\left[K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=1\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=1\right)}\right.$
			$\displaystyle-\frac{E\left[E(Y(1)\|X)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{}\right)-F\left(U^{\top}\beta^{}\right)}{h}\right)\frac{f(X^{\top}\beta^{})-f(U^{\top}\beta^{})}{h}(X-U)^{\top}\|W=1\right]}{f\left(F\left(U^{\top}\beta^{*}\right)\|W=1\right)}\Biggr{\}}.$