This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of Treatment Effects based on Kernel Matching

Chong Ding C. Ding: [email protected] Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China. Zheng Li Z. Li: [email protected] Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China. Hon Keung Tony Ng H. K. T. Ng: [email protected] Department of Mathematical Sciences, Bentley University, Waltham, Massachusetts 02452, U.S.A. Wei Gao W. Gao (Corresponding author): [email protected] Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China.
Abstract

A treatment effect represents the average causal impact or outcome difference between treatment and control groups. Treatment effects can be estimated through social experiments, regression models, matching estimators, and instrumental variables. In this paper, we introduce a novel kernel-matching estimator for treatment effect estimation. This method is particularly beneficial in observational studies where randomized control trials are not feasible, as it uses the full sample to increase the efficiency and robustness of treatment effect estimates. We demonstrate that the proposed estimator is consistent and asymptotically efficient under certain conditions. Through Monte Carlo simulations, we show that the estimator performs favorably against other estimators in the literature. Finally, we apply our method to data from the National Supported Work Demonstration to illustrate its practical application.

Keywords: Bootstrap confidence interval; Causal inference; Observational study; Propensity score; Missing data.

1 Introduction

Matching is a crucial method for estimating causal effects, widely applied in various research fields, including policy evaluation (Sekhon,, 2004; Herron and Wand,, 2007), healthcare research (Rubin,, 1997; Christakis and Iwashyna,, 2003), and economics (Dehejia and Wahba,, 1999; Galiani et al.,, 2005; Abadie and Imbens,, 2006). The process of estimating causal effects fundamentally depends on imputing a potential outcome for an individual, calculated as a weighted average of the observed outcomes of their nearest neighbors from the relevant counterfactual set (Rosenbaum and Rubin,, 1983). Thus, causal effects are typically expressed as the difference between the imputed outcome and the actual observed outcome, similar to what would be expected from a randomized experiment. Various methods have been developed to assign matched subjects based on covariates, such as nearest available matching using propensity scores, Mahalanobis metric matching (Rosenbaum and Rubin,, 1985), full matching (Rosenbaum,, 1991; Hansen,, 2004), genetic matching (Diamond and Sekhon,, 2013), and matching using sufficient dimension reduction (Luo and Zhu,, 2020).

Among various matching methods, covariate matching (Abadie and Imbens,, 2006) can be challenging in high-dimensional settings, as it requires exact or near-exact matches across all covariates. This strict requirement often reduces the effective sample size, limiting the method’s reliability and suitability for high-dimensional data or complex distributions. Alternatively, propensity score matching (Abadie and Imbens,, 2016) depends on the correct specification of the propensity score model; misspecification can introduce bias and result in extreme weights. Beyond these matching methods, several alternative approaches have been proposed for causal effect estimation, including inverse probability weighting (IPW) (Hirano et al.,, 2003), subclassification (Rosenbaum and Rubin,, 1983), and doubly robust (DR) estimation (Bang and Robins,, 2005). The accuracy of the IPW estimator accuracy is highly sensitive to the correct specification of the propensity score model, with even minor misspecifications potentially leading to substantial errors due to its sensitivity to high variance from extreme weights (Imai and Ratkovic,, 2014). Doubly robust (DR) estimators, which combine propensity score and outcome regression models, can yield unbiased estimates if at least one model is correctly specified. However, DR estimators are sensitive to the accuracy of both models and may be complex to implement effectively.

In this paper, we address the above challenges by introducing a novel kernel-matching estimator and providing a detailed methodological framework, including discussions on the consistency and asymptotic normality of the estimator. Kernel matching, as outlined by Heckman et al., (1997) and Heckman et al., 1998b , is a non-parametric approach that applies kernel functions to weight control cases based on their similarity to treatment cases in the covariate space, facilitating the estimation of treatment effect (Handouyahia et al.,, 2013). The kernel matching method allows for increased flexibility and stability through adjustable bandwidth parameters. By balancing bias and variance through local smoothing, kernel matching produces more robust estimates than traditional one-to-one propensity score matching, especially in situations with limited control cases available for creating a matching cohort. It mitigates extreme bias or variance and reduces dependence on stringent model assumptions (Berg,, 2011).

Although more complex than basic matching techniques, kernel matching extends interval and nearest-neighbor matching by assigning weights to all control cases associated with each treatment case, giving greater emphasis to those with closer covariate values. Smith and Todd, (2005) offered an intuitive explanation of kernel matching and its generalizations, such as local linear (Heckman et al.,, 1997; Heckman et al., 1998a, ; Heckman et al., 1998b, ) and local quadratic matching (Ham et al.,, 2011). By smoothing over the covariate space, kernel matching reduces the risks associated with model misspecification, resulting in more reliable and unbiased treatment effect estimates. This approach is particularly advantageous in settings with high overlap between treatment and control cases, ensuring that comparisons are made between genuinely comparable groups.

When choosing a suitable matching method, selecting the appropriate function of the covariates is crucial as the function may significantly affect the asymptotic behaviors of the resulting estimators (Heckman et al.,, 1997; Abadie and Imbens,, 2006). Heuristically, Rosenbaum and Rubin, (1983) suggested that using the original covariates is preferred for its simplicity and for ensuring unbiased matching at the population level, provided that the ignorability assumption (i.e., the potential outcomes are independent of treatment after controlling for the specific set of covariates) and the overlap assumption (i.e., the range of data is similar across treatment groups), are satisfied. However, Abadie and Imbens, (2006) pointed out that the matching approach can be vulnerable to the “curse of dimensionality”, rendering matching ineffective when the dimension of covariates is high. To address this issue, a natural choice is using the propensity score, which is the probability of a subject being a treatment case given its covariates, proposed by Rosenbaum and Rubin, (1985). This reduction in dimensionality makes it easier to find similar treatment and control cases. Similar to the original covariates, the propensity score also ensures the unbiasedness of matching under the ignorability assumption and the overlap assumption. Mathematically, suppose 𝑿X is the vector of covariates, the propensity score p(X)p(X) can be defined as

p(X)=F(Xβ),p(X)=F(X^{\top}\beta),

where β\beta is the parameter vector, FF is a link function that maps XβX^{\top}\beta into a probability in (0,1)(0,1) (e.g., a logit or probit function). Nevertheless, the estimation of the propensity score commonly relies on parametric models, including the link function FF, which are susceptible to model misspecification. In high-dimensional settings, kernel matching enhances propensity score matching by smoothing across similar observations. The kernel function averages treatment effects among units with similar propensity scores, stabilizing estimates and reducing variance. High-dimensional data often suffer from sparsity, making exact matches between treatment and control cases difficult to find. Kernel matching addresses this issue by weighting nearby units based on their propensity scores, allowing for effective comparisons even when exact matches are rare. The kernel function introduces a smoothing effect, which acts as a form of regularization. This regularization helps control overfitting to noise, a common issue in high-dimensional data, making the treatment effect estimation more robust. In high-dimensional spaces, irrelevant covariates can introduce noise into the matching process. By focusing on propensity scores, which encapsulate the relevant information from all covariates, kernel matching reduces the impact of irrelevant covariates, making the method less sensitive to noise. Therefore, kernel matching based on propensity scores is a powerful tool for addressing the challenges of high-dimensional settings, as it effectively manages the complexities associated with having many covariates.

In this paper, we propose estimating treatment effects using kernel matching based on F(Xβ^)F(X^{\top}\hat{\beta}) with observational data. Through a Monte Carlo simulation study with different settings, we demonstrate the practical application of kernel matching and highlight its advantages over traditional estimation methods. This study underscores the importance of adopting robust matching techniques in causal inference and showcases the potential of kernel matching to enhance the validity and reliability of treatment effect estimates. Compared to traditional nearest-neighbor matching, kernel matching does not require pre-specifying the number of matched individuals (e.g., one-to-one matching). Instead, kernel matching adaptively selects suitable matches through the kernel function. By using more control cases through weighting, kernel matching can reduce the variance of the estimate. We show that the kernel matching method is a powerful tool in the arsenal of methods for causal inference in observational studies.

The rest of the paper is organized as follows. In Section 2, after introducing the mathematical notation and the concepts of treatment effect estimation, the proposed kernel matching estimator is developed. In Section 3, the asymptotic properties of the kernel-matching estimator are established. A Monte Carlo simulation study is used to evaluate the performance of the proposed method in Section 4. Section 5 provides a practical data analysis to illustrate the proposed methodology. Finally, some concluding remarks are provided in Section 6.

2 Basic Notation and Concepts of Kernel Matching

Following the framework of Rubin, (1974), the treatment effect on the potential outcomes can be defined as follows. For unit ii (i=1,,Ni=1,\ldots,N), let the variable WiW_{i} indicates whether an active treatment was received (Wi=1)(W_{i}=1) or not received (Wi=0)(W_{i}=0), for Wi{0,1}W_{i}\in\{0,1\}. The observed outcome is Yi=WiYi(1)+(1Wi)Yi(0)Y_{i}=W_{i}Y_{i}(1)+(1-W_{i})Y_{i}(0), where Yi(1)Y_{i}(1) and Yi(0)Y_{i}(0) indicate the potential outcomes that the individual would have received active treatment or not, respectively. In other words, for unit ii, the observed outcome YiY_{i} for treatment WiW_{i} can be expressed as

Yi={Yi(0) if Wi=0,Yi(1) if Wi=1.Y_{i}=\begin{cases}Y_{i}(0)&\text{ if }W_{i}=0,\\ Y_{i}(1)&\text{ if }W_{i}=1.\end{cases}

In this paper, we are interested in the average treatment effect for the whole population (ATE)

τ=E[Y(1)Y(0)],\tau=E[Y(1)-Y(0)],

and the average treatment effect for the treated (ATT)

τt=E[Y(1)Y(0)W=1].\tau_{t}=E[Y(1)-Y(0)\mid W=1].

The assumptions listed below are essential for establishing limit properties and will remain part of the underlying assumptions throughout this article. These assumptions are commonly found in the existing literature for causal inference (see, for example, Rosenbaum and Rubin,, 1983; Heckman et al., 1998b, ; Abadie and Imbens,, 2016) and are considered standard assumptions.

Assumption 1.

Let XX be a kk dimensional random vector of covariates. For X𝕏X\in\mathbb{X}, where the 𝕏\mathbb{X} is the support of XX, the propensity score is defined as p(X)=Pr(W=1|X=x)p(X)=\Pr(W=1|X=x). We consider

  • (i)

    (Unconfoundedness) W{Y(1),Y(0)}|XW\perp\{Y(1),Y(0)\}|X, where \perp denotes independence (Dawid,, 1979);

  • (ii)

    (Overlap) η¯p(X)η¯\underline{\eta}\leq p(X)\leq\overline{\eta} almost surely, for some η¯>0\underline{\eta}>0 and η¯<1\overline{\eta}<1.

Note that Assumption 1 is also known as “strong ignorability” (Rosenbaum and Rubin,, 1983), which implies that XX is sufficient for eliminating the impact of confounding factors (i.e., “unconfoundedness”). Hahn, (1998) established bounds for the asymptotic variance and explored asymptotically efficient estimation methods under Assumption 1(i).

Here, we define the proposed kernel-matching estimators for the ATE and ATT. Following the notation defined above, for NN\in\mathbb{N}, suppose {Yi,Wi,Xi}i=1N\{Y_{i},W_{i},X_{i}\}_{i=1}^{N} are independent draws from the population with the distribution of {Y,W,X}\{Y,W,X\}. Let N1N_{1} and N0N_{0} be the number of treated and control units, respectively. The individual treatment effect is τi=Yi(1)Yi(0)\tau_{i}=Y_{i}(1)-Y_{i}(0). For every unit in the study, we can only get one of the potential outcomes, Yi(0)Y_{i}(0) and Yi(1)Y_{i}(1), and the other is unobservable or missing.

Let K()K(\cdot) be a kernel function, and hh be the bandwidth parameter that converges to zero as nn goes to infinity. The kernel-matching estimators based on the propensity score p(X)p(X) for the missing potential outcomes can be expressed as

Y^i(0)={Yi if Wi=0,j:Wj=0NYjK(p(Xj)p(Xi)h)j:Wj=0NK(p(Xj)p(Xi)h) if Wi=1,\widehat{Y}_{i}(0)=\begin{cases}Y_{i}&\text{ if }W_{i}=0,\\ \frac{\sum_{j:W_{j}=0}^{N}Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j:W_{j}=0}^{N}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}&\text{ if }W_{i}=1,\end{cases}

and

Y^i(1)={j:Wj=1NYjK(p(Xj)p(Xi)h)j:Wj=1NK(p(Xj)p(Xi)h) if Wi=0,Yi if Wi=1.\widehat{Y}_{i}(1)=\left\{\begin{array}[]{lll}\frac{\sum_{j:W_{j}=1}^{N}Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j:W_{j}=1}^{N}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}&\text{ if }&W_{i}=0,\\ Y_{i}&\text{ if }&W_{i}=1.\end{array}\right.

The proposed kernel-matching estimator for the ATE is defined as

τ^N=1Ni=1N(Y^i(1)Y^i(0))=1Ni=1N(2Wi1)(Yij:Wj=1WiNYjK(p(Xj)p(Xi)h)j:Wj=1WiNK(p(Xj)p(Xi)h)),\begin{split}\hat{\tau}_{N}=&\frac{1}{N}\sum_{i=1}^{N}\left(\hat{Y}_{i}(1)-\hat{Y}_{i}(0)\right)\\ =&\frac{1}{N}\sum_{i=1}^{N}\left(2W_{i}-1\right)\left(Y_{i}-\frac{\sum_{j:W_{j}=1-W_{i}}^{N}Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j:W_{j}=1-W_{i}}^{N}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}\right),\end{split}

and the proposed kernel-matching estimator for the ATT is defined as

τ^t=1N1i=1NWi(Yij=1N(1Wj)YjK(p(Xj)p(Xi)h)j=1N(1Wj)K(p(Xj)p(Xi)h)),\begin{split}\hat{\tau}_{t}=\frac{1}{N_{1}}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{p(X_{j})-p(X_{i})}{h}\right)}\right),\end{split}

where N1=i=1NWiN_{1}=\sum_{i=1}^{N}W_{i} is the number of treatment cases in the sample.

In observational studies, to estimate the propensity scores, following Rosenbaum and Rubin, (1983), a generalized linear model p(X)=F(Xβ)p(X)=F(X^{\top}\beta) can be used. The logit and probit functions are commonly used parametric link functions F()F(\cdot).

Let β\beta^{*} be the true value of the propensity score model parameter vector, such that p(X)=F(Xβ)p(X)=F(X^{\top}\beta^{*}). The estimator based on matching the true propensity score is given by τ^N(β)\widehat{\tau}_{N}(\beta^{*}), and τ^N(β^)\widehat{\tau}_{N}(\hat{\beta}) is a kernel matching estimator based on the estimated propensity score F(Xβ^)F(X^{\top}\hat{\beta}), where β^\hat{\beta} is the maximum likelihood estimator of β\beta based on the observed data {Yi,Wi,Xi}i=1N\{Y_{i},W_{i},X_{i}\}^{N}_{i=1}, i.e.,

β^=argmaxβlnL(βW1,X1,,WN,XN)\hat{\beta}=\arg\max_{\beta}\ln L\left(\beta\mid W_{1},X_{1},\ldots,W_{N},X_{N}\right)

with the log-likelihood function

lnL(βW1,X1,,WN,XN)\displaystyle\ln L\left(\beta\mid W_{1},X_{1},\ldots,W_{N},X_{N}\right)
=i=1NWilnF(Xiβ)+(1Wi)ln(1F(Xiβ)).\displaystyle\quad=\sum_{i=1}^{N}W_{i}\ln F\left(X_{i}^{\prime}\beta\right)+\left(1-W_{i}\right)\ln\left(1-F\left(X_{i}^{\prime}\beta\right)\right).

Next, we define a type of matching estimator where the unknown parameter β\beta is estimated using the maximum likelihood method. The kernel-matching estimators based on the F(Xβ^)F(X^{\top}\hat{\beta}) for the ATE can be expressed as

τ^N(β^)=1Ni=1N(Y^i(1)Y^i(0))=1Ni=1N(2Wi1)(Yij:Wj=1WiYjK(F(Xjβ^)F(Xiβ^)h)j:Wj=1WiK(F(Xjβ^)F(Xiβ^)h)),\begin{split}\hat{\tau}_{N}(\hat{\beta})&=\frac{1}{N}\sum_{i=1}^{N}(\hat{Y}_{i}(1)-\hat{Y}_{i}(0))\\ &=\frac{1}{N}\sum_{i=1}^{N}\left(2W_{i}-1\right)\left(Y_{i}-\frac{\sum_{j:W_{j}=1-W_{i}}Y_{j}K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}{\sum_{j:W_{j}=1-W_{i}}K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}\right),\end{split}

and for ATT, the estimator can be expressed as

τ^t,N(β^)=1N1i=1NWi(Yij=1(1Wj)YjK(F(Xjβ^)F(Xiβ^)h)j=1(1Wj)K(F(Xjβ^)F(Xiβ^)h)).\begin{split}\hat{\tau}_{t,N}(\hat{\beta})=\frac{1}{N_{1}}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}(1-W_{j})Y_{j}K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}{\sum_{j=1}(1-W_{j})K\left(\frac{F(X_{j}^{\prime}\hat{\beta})-F(X_{i}^{\prime}\hat{\beta})}{h}\right)}\right).\end{split}

3 Asymptotic Properties of the Proposed Estimators

In this section, we establish that the proposed kernel-matching estimators for the ATE and ATT are consistent and asymptotically normally distributed, with a convergence rate of N1/2N^{1/2}.

Assumption 2.

The kernel function K()K(\cdot) is symmetric and satisfies the following properties:

  • (1)

    K(u)𝑑u=1\int K(u)du=1;

  • (2)

    u2K(u)𝑑u<\int u^{2}K(u)du<\infty;

  • (3)

    K2(u)𝑑u<\int K^{2}(u)du<\infty;

  • (4)

    |u|3K(u)𝑑u<\int|u|^{3}K(u)du<\infty.

Let gw(p)=𝔼(Y(w)|p(X)=p)g_{w}(p)=\mathbb{E}(Y(w)|p(X)=p), under Assumptions 1 and 2, we provide the following theorem to show that the estimators τ^\hat{\tau} and τ^t\hat{\tau}_{t} are consistent estimators of τ\tau and τt\tau_{t}, respectively, under mild conditions.

Theorem 1.

Suppose that gw(p)g_{w}(p) has three-times bounded continuous derivatives, and the propensity score p(X)p(X) is continuously distributed with density function f(p)f(p) that has three-times continuous derivatives, then under Assumptions 1 and 2, as nn\rightarrow\infty and h0h\rightarrow 0, we have

  • (i)
    τ^τp0,\hat{\tau}-\tau\stackrel{{\scriptstyle p}}{{\longrightarrow}}0,
  • (ii)
    τ^tτtp0.\hat{\tau}_{t}-\tau_{t}\stackrel{{\scriptstyle p}}{{\longrightarrow}}0.

The proof of Theorem 1 is given in the Appendix.

Next, we elucidate the formal result for the asymptotic normality of the proposed estimators.

Theorem 2.

Suppose that gw(p)g_{w}(p) has three-times bounded continuous derivatives, and the propensity score p(X)p(X) is continuously distributed with probability density function f(p)f(p) that has three-times continuous derivatives, let

bw(p)=t2k(t)𝑑t2(2gw(p)f(p|w)f(p|w)+gw′′(p)),b_{w}(p)=\frac{\int t^{2}k(t)dt}{2}\left(\frac{2g_{w}^{\prime}(p)f^{\prime}(p|w)}{f(p|w)}+g_{w}^{\prime\prime}(p)\right),

and under Assumptions 1 and 2, as nn\rightarrow\infty and nh4=O(1)nh^{4}=O(1), we have

  • (i)
    N(τ^τBh2)𝑑𝒩(0,σ2),\sqrt{N}\left(\hat{\tau}-\tau-Bh^{2}\right)\xrightarrow{d}\mathcal{N}(0,\sigma^{2}),

    where B=𝔼[b0(P)|Z=1]Pr(Z=1)+𝔼[b1(P)|Z=0]Pr(Z=0)B=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1)+\mathbb{E}[b_{1}(P)|Z=0]\Pr(Z=0), and the formula for σ\sigma is provided in the Appendix;

  • (ii)
    N(τ^tτtBth2)𝑑𝒩(0,σt2),\sqrt{N}\left(\hat{\tau}_{t}-\tau_{t}-B_{t}h^{2}\right)\xrightarrow{d}\mathcal{N}(0,\sigma_{t}^{2}),

    where Bt=𝔼[b0(P)|Z=1]Pr(Z=1)B_{t}=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1) and the formula for σt\sigma_{t} is provided in the Appendix.

The proof of Theorem 2 is given in the Appendix.

Corollary 1.

Suppose that gw(p)g_{w}(p) has three-times bounded continuous derivatives, and the propensity score p(X)p(X) is continuously distributed, with continuous probability density function f(p)f(p) having three-times continuous derivatives. Under Assumptions 1 and 2, as nn\rightarrow\infty and nh40nh^{4}\rightarrow 0, the bias term will converge to zero. Thus,

  • (i)
    N(τ^τ)𝑑𝒩(0,σ2);\sqrt{N}(\hat{\tau}-\tau)\xrightarrow{d}\mathcal{N}(0,\sigma^{2});
  • (ii)
    N(τ^tτt)𝑑𝒩(0,σt2),\sqrt{N}(\hat{\tau}_{t}-\tau_{t})\xrightarrow{d}\mathcal{N}(0,\sigma_{t}^{2}),

    where the formulas for σ\sigma and σt\sigma_{t} are provided in the Appendix.

In the following theorems, we will provide the large sample properties of τ^N(β^)\hat{\tau}_{N}(\hat{\beta}) and τ^t,N(β^)\hat{\tau}_{t,N}(\hat{\beta}), including the consistency in Theorem 3 and the asymptotic normality in Theorem 4.

Theorem 3.

Suppose that gw(p)g_{w}(p) has three-times bounded continuous derivatives, and the propensity score p(X)p(X) is distributed with continuous probability density function f(p)f(p) having three-times continuous derivatives. Under Assumptions 1 and 2, as nn\rightarrow\infty and h0h\rightarrow 0, we have

  • (i)
    τ^N(β^)τ𝑝0,\widehat{\tau}_{N}(\hat{\beta})-\tau\xrightarrow{p}0,
  • (ii)
    τ^t,N(β^)τt𝑝0.\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}\xrightarrow{p}0.
Theorem 4.

Suppose gw(p)g_{w}(p) is three times continuously differentiable and bounded. Given that the propensity score p(X)p(X) is continuously distributed, with continuous density, f(p)f(p) has a continuous third derivative. Under Assumptions 1 and 2, as nn\rightarrow\infty h0h\rightarrow 0 and nh4=O(1)nh^{4}=O(1), we have

  • (i)
    N{τ^N(β^)τBh2}𝑑𝒩(0,σ~2),\sqrt{N}\left\{\widehat{\tau}_{N}(\hat{\beta})-\tau-Bh^{2}\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}^{2}),

    where B=𝔼[b0(P)|Z=1]Pr(Z=1)+𝔼[b1(P)|Z=0]Pr(Z=0)B=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1)+\mathbb{E}[b_{1}(P)|Z=0]\Pr(Z=0) and the formula for σ~2\widetilde{\sigma}^{2} is provided in the Appendix,

  • (ii)
    N{τ^t,N(β^)τtBth2}𝑑𝒩(0,σ~t2),\sqrt{N}\left\{\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}-B_{t}h^{2}\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}_{t}^{2}),

    where Bt=𝔼[b0(P)|Z=1]Pr(Z=1)B_{t}=\mathbb{E}[b_{0}(P)|Z=1]\Pr(Z=1) and the formula for σ~t2\widetilde{\sigma}_{t}^{2} is provided in the Appendix.

The proofs of Theorem 3 and Theorem 4 are presented in the Appendix.

Corollary 2.

Suppose that gw(p)g_{w}(p) has three-times bounded continuous derivatives, and the propensity score p(X)p(X) is distributed with continuous probability density function f(p)f(p) having three-times continuous derivatives. Under Assumptions 1 and 2, as nn\rightarrow\infty and nh40nh^{4}\rightarrow 0, the bias term will converge to zero. Thus, we have

  • (i)
    N{τ^N(β^)τ}𝑑𝒩(0,σ~2),\sqrt{N}\left\{\widehat{\tau}_{N}(\hat{\beta})-\tau\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}^{2}),
  • (ii)
    N{τ^t,N(β^)τt}𝑑𝒩(0,σ~t2),\sqrt{N}\left\{\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}\right\}\xrightarrow{d}\mathcal{N}(0,\widetilde{\sigma}_{t}^{2}),

    where the formulas for σ~2\widetilde{\sigma}^{2} and σ~t2\widetilde{\sigma}_{t}^{2} are provided in the Appendix.

4 Monte Carlo Simulation Study

In this section, a Monte Carlo simulation study is used to evaluate the performance of the proposed estimators and compare the performance with those of existing estimators. The simulation results confirm the effectiveness of the kernel matching estimators based on the estimated propensity score in estimating ATE and ATT. For comparative purposes, the following existing ATE and ATT estimation methods are considered:

  • Adabie-Imbens matching estimators based on the original covariates (Abadie and Imbens,, 2006) (denoted as “Covariate”).

  • Adabie-Imbens matching estimators based on true propensity score (Abadie and Imbens,, 2006) (denoted as “True PS”).

  • Adabie-Imbens matching estimators based on estimated propensity score (Abadie and Imbens,, 2006, 2016) (denoted as “Estimated PS”).

  • Normalized inverse probability weighting (IPW) (Hirano et al.,, 2003; Imbens,, 2004) estimators for ATE and ATT given by

    τ^IPWATE=i=1NWiYip^(Xi)(i=1NWip^(Xi))1i=1N(1Wi)Yi1p^(Xi)(i=1N(1Wi)1p^(Xi))1\hat{\tau}_{\mathrm{IPW}}^{\mathrm{ATE}}=\sum_{i=1}^{N}\frac{W_{i}\cdot Y_{i}}{\hat{p}\left(X_{i}\right)}\left(\sum_{i=1}^{N}\frac{W_{i}}{\hat{p}\left(X_{i}\right)}\right)^{-1}-\sum_{i=1}^{N}\frac{\left(1-W_{i}\right)\cdot Y_{i}}{1-\hat{p}\left(X_{i}\right)}\left(\sum_{i=1}^{N}\frac{\left(1-W_{i}\right)}{1-\hat{p}\left(X_{i}\right)}\right)^{-1}

    and

    τ^IPWATT=1N1i:Wi=1Yii:Wi=0Yip^(Xi)1p^(Xi)(i:Wi=0p^(Xi)1p^(Xi))1,\hat{\tau}_{\mathrm{IPW}}^{\mathrm{ATT}}=\frac{1}{N_{1}}\sum_{i:W_{i}=1}Y_{i}-\sum_{i:W_{i}=0}Y_{i}\cdot\frac{\hat{p}(X_{i})}{1-\hat{p}(X_{i})}\left(\sum_{i:W_{i}=0}\frac{\hat{p}(X_{i})}{1-\hat{p}(X_{i})}\right)^{-1},

    respectively (denoted as “IPW”).

  • Doubly robust estimators (Robins et al.,, 1994; Bang and Robins,, 2005) for ATE and ATT given by

    τ^DRATE\displaystyle\hat{\tau}_{\mathrm{DR}}^{\mathrm{ATE}} =\displaystyle= 1Ni=1N{[WiYip^(Xi)Wip^(Xi)p^(Xi)μ^(1,Xi)]\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left\{\left[\frac{W_{i}Y_{i}}{\hat{p}\left(X_{i}\right)}-\frac{W_{i}-\hat{p}\left(X_{i}\right)}{\hat{p}\left(X_{i}\right)}\hat{\mu}\left(1,X_{i}\right)\right]\right.
    [(1Wi)Yi1p^(Xi)Wip^(Xi)1p^(Xi)μ^(0,Xi)]}\displaystyle\left.\qquad-\left[\frac{\left(1-W_{i}\right)Y_{i}}{1-\hat{p}\left(X_{i}\right)}-\frac{W_{i}-\hat{p}\left(X_{i}\right)}{1-\hat{p}\left(X_{i}\right)}\hat{\mu}\left(0,X_{i}\right)\right]\right\}

    and

    τ^DRATT=1N1i=1N[WiYi(1Wi)Yip^(Xi)+μ^(0,Xi)(Wip^(Xi))1p^(Xi)],\displaystyle\hat{\tau}_{\mathrm{DR}}^{\mathrm{ATT}}=\frac{1}{N_{1}}\sum_{i=1}^{N}\left[W_{i}Y_{i}-\frac{\left(1-W_{i}\right)Y_{i}\hat{p}\left(X_{i}\right)+\hat{\mu}\left(0,X_{i}\right)\left(W_{i}-\hat{p}\left(X_{i}\right)\right)}{1-\hat{p}\left(X_{i}\right)}\right],

    respectively (denoted as “DR”).

  • The kernel-matching estimators for ATE and ATT presented in Section 2 using the Gaussian kernel function with a bandwidth hh of 0.01 (denoted as “Proposed”). In preliminary simulations (results not shown), we also tested the Epanechnikov kernel function and various bandwidth values, finding that the Gaussian kernel with h=0.01h=0.01 generally provided better performance for the kernel-matching estimator.

To simulate the control and treatment cases with covariates, the following five settings are considered:

  • Setting I:

    • Two independent covariates, X1X_{1} and X2X_{2}, follow the uniform distribution in the interval [1/2,1/2][-1/2,1/2].

    • The potential outcomes are generated based on the following models

      Y(0)=3X13X2+ϵ0Y(0)=3X_{1}-3X_{2}+\epsilon_{0}

      and

      Y(1)=5+5X1+X2+ϵ1,Y(1)=5+5X_{1}+X_{2}+\epsilon_{1},

      where ϵ0\epsilon_{0} and ϵ1\epsilon_{1} are random errors that follow the standard normal distribution independent of (W,X1,X2)(W,X_{1},X_{2}).

    • The treatment variable, WW, is related to (X1,X2)(X_{1},X_{2}) through the propensity score with a logistic function

      Pr(W=1|X1=x1,X2=x2)=exp(x1+2x2)1+exp(x1+2x2).\Pr(W=1|X_{1}=x_{1},X_{2}=x_{2})=\frac{\exp(x_{1}+2x_{2})}{1+\exp(x_{1}+2x_{2})}.
    • The ATE and ATT estimators use a single match (M=1M=1).

  • Setting II:

    • Two independent covariates, X1X_{1} and X2X_{2}, follow the uniform distribution in the interval [1/2,1/2][-1/2,1/2].

    • The potential outcomes are generated based on the following models

      Y(0)=10X1+ϵ0Y(0)=10X_{1}+\epsilon_{0}

      and

      Y(1)=510X1+ϵ1,Y(1)=5-10X_{1}+\epsilon_{1},

      where ϵ0\epsilon_{0} and ϵ1\epsilon_{1} are random errors that follow the standard normal distribution independent of (W,X1,X2)(W,X_{1},X_{2}).

    • The treatment variable, WW, is related to X2X_{2} through the propensity score with a logistic function

      Pr(W=1|X1=x1,X2=x2)=exp(2x2)1+exp(2x2).\Pr(W=1|X_{1}=x_{1},X_{2}=x_{2})=\frac{\exp(2x_{2})}{1+\exp(2x_{2})}.
    • The ATE and ATT estimators use a single match (M=1M=1).

  • Setting III:

    • Four covariates, the vector X=(X1,X2,X3,X4)X=(X_{1},X_{2},X_{3},X_{4}) follows a four-dimensional normal distribution 𝒩(0,𝚺)\mathcal{N}\left(0,\mbox{\boldmath$\Sigma$}\right), where the variance-covariance 𝚺\Sigma is given by

      𝚺=13(1100120000110012),\displaystyle\mbox{\boldmath$\Sigma$}=\frac{1}{3}\begin{pmatrix}1&-1&0&0\\ -1&2&0&0\\ 0&0&1&-1\\ 0&0&-1&2\\ \end{pmatrix},

      i.e., X1X_{1} and X2X_{2} are correlated, X3X_{3} and X4X_{4} are correlated, and (X1,X2)(X_{1},X_{2}) and (X3,X4)(X_{3},X_{4}) are independent.

    • The potential outcomes are generated based on the following models

      Y=0.4+0.25sin(8(X𝜽)5)+0.4exp(16(4(X𝜽)2.5)2)+W+ϵ,Y=0.4+0.25\sin(8(X^{\top}\mbox{\boldmath$\theta$})-5)+0.4\exp(-16(4(X^{\top}\mbox{\boldmath$\theta$})-2.5)^{2})+W+\epsilon,

      where 𝜽=(1,1,1,1)\mbox{\boldmath$\theta$}=(1,1,1,1)^{\top}, ϵ\epsilon is a random error that follows the standard normal distribution independent of XX.

    • The treatment variable, WW, is related to X=(X1,X2,X3,X4)X=(X_{1},X_{2},X_{3},X_{4}) through the propensity score with a logistic function

      Pr(W=1\displaystyle\Pr(W=1 |\displaystyle| X1=x1,X2=x2,X3=x3,X4=x4)\displaystyle X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3},X_{4}=x_{4})
      =\displaystyle= exp(x13x2+2x3+x4)1+exp(x13x2+2x3+x4).\displaystyle\frac{\exp(x_{1}-3x_{2}+2x_{3}+x_{4})}{1+\exp(x_{1}-3x_{2}+2x_{3}+x_{4})}.
    • The ATE and ATT estimators use a single match (M=1M=1).

  • Setting IV:

    • Four covariates, the vector X=(X1,X2,X3,X4)X=(X_{1},X_{2},X_{3},X_{4}) follows a four-dimensional normal distribution 𝒩(0,𝚺)\mathcal{N}\left(0,\mbox{\boldmath$\Sigma$}\right), where the variance-covariance 𝚺\Sigma is given by

      𝚺=13(1100120000110012).\displaystyle\mbox{\boldmath$\Sigma$}=\frac{1}{3}\begin{pmatrix}1&-1&0&0\\ -1&2&0&0\\ 0&0&1&-1\\ 0&0&-1&2\\ \end{pmatrix}.
    • The potential outcomes are generated based on the following models

      Y=0.15+0.7[exp(2(X𝜽))1+exp(2(X𝜽))]+W+ϵ,Y=0.15+0.7\left[\frac{\exp(\sqrt{2}(X^{\top}\mbox{\boldmath$\theta$}))}{1+\exp(\sqrt{2}(X^{\top}\mbox{\boldmath$\theta$}))}\right]+W+\epsilon,

      where ϵ\epsilon is a random error that follows the standard normal distributions independent of XX.

    • The treatment variable, WW, is related to X=(X1,X2,X3,X4)X=(X_{1},X_{2},X_{3},X_{4}) through the propensity score with a logistic function

      Pr(W=1\displaystyle\Pr(W=1 |\displaystyle| X1=x1,X2=x2,X3=x3,X4=x4)\displaystyle X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3},X_{4}=x_{4})
      =\displaystyle= exp(x1+x2+x3+x4)1+exp(x1+x2+x3+x4).\displaystyle\frac{\exp(x_{1}+x_{2}+x_{3}+x_{4})}{1+\exp(x_{1}+x_{2}+x_{3}+x_{4})}.
    • The ATE and ATT estimators use a single match (M=1M=1).

  • Setting V:

    • Three independent covariates X1𝒩(0,1)X_{1}\sim\mathcal{N}(0,1), X2Bernoulli(0.5)X_{2}\sim\text{Bernoulli}(0.5), and X3𝒩(0,1)X_{3}\sim\mathcal{N}(0,1).

    • The potential outcomes are generated based on the following models

      Y(0)=sin(1+X1+0.5X2)+ϵ0Y(0)=\sin(1+X_{1}+0.5X_{2})+\epsilon_{0}

      and

      Y(1)=sin(2+X1+0.5X2)+ϵ1,Y(1)=\sin(2+X_{1}+0.5X_{2})+\epsilon_{1},

      where ϵ0\epsilon_{0} and ϵ1\epsilon_{1} are random errors that follow the standard normal distributions independent of XX.

    • The treatment variable, WW, is related to X=(X1,X2,X3)X=(X_{1},X_{2},X_{3}) through the propensity score with a logistic function

      Pr(W=1\displaystyle\Pr(W=1 |\displaystyle| X1=x1,X2=x2,X3=x3)\displaystyle X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3})
      =\displaystyle= exp(2x1+x23x3+3)1+exp(2x1+x23x3+3).\displaystyle\frac{\exp(2x_{1}+x_{2}-3x_{3}+3)}{1+\exp(2x_{1}+x_{2}-3x_{3}+3)}.
    • The ATE and ATT estimators use a single match (M=1M=1).

We generated datasets with sample sizes of N=200,500N=200,500, and 10001000, and conducted 10001000 Monte Carlo simulations for each scenario to estimate causal effects. The data generation processes for Settings I and II follow the approach of Abadie and Imbens, (2016), while those for Settings III and IV are based on the method of Frölich, (2004) with slight modifications. Similarly, the process for Setting V is adapted from the approach of Luo and Zhu, (2020), also with slight modifications.

We compare the point estimation of ATE and ATT of the six estimators (Covariate, True PS, Estimated PS, Proposed, IPW, and DR) in terms of the simulated biases (Bias), standard deviations (SD), and root mean square errors (RMSE). For comparative purposes, we also compute the 95% confidence intervals based on different estimators using bootstrap methods. Specifically, following Abadie and Imbens, (2006, 2016), we obtain the 95% confidence intervals based on the “Covariate”, “True PS” and “Estimated PS” using normal approximation with standard errors obtained by using 400 bootstrap samples. For the ” “Proposed”, “IPW” and “DR” estmators, the 95% confidence intervals are obtained using the percentile bootstrap method (Efron and Tibshirani,, 1993) with 400 bootstrap samples. We compare the interval estimation of ATE and ATT of the six estimators in terms of the simulated average widths (AW) and the coverage probabilities (CP). Simulation results are reported for the six estimators under Settings I–V in Tables 15 for ATE estimation and in Tables 610 for ATT estimation.

From Tables 110, we observe that the biases and RMSEs of the proposed estimator decrease as the sample size increases, which is consistent with the theoretical asymptotic results discussed in Section 3. For point estimation of both ATE and ATT, the proposed kernel matching estimator demonstrates comparable bias and RMSE performance across most settings, though it does not achieve the smallest bias and MSE in every scenario. From Tables 12, we observe that the proposed estimator for ATE outperforms the matching estimators based on covariates, true propensity score, and estimated propensity score in Settings I and II in terms of RMSE. From Tables 34, we observe that the proposed estimator for ATE shows a lower bias than the covariate-based matching estimator in Settings III and IV. This consistent performance highlights the effectiveness of the proposed estimator across varied settings, even when not uniformly outperforming all competitors.

In terms of interval estimation, the proposed estimator achieves coverage probabilities close to the nominal 95% level, suggesting strong reliability of the proposed method. While IPW estimator offers good point estimation, it struggles with interval estimation accuracy, failing to maintain 95% coverage in Settings IV and V. Similarly, the DR estimator performs well for point estimation in Settings I–IV but falters in Setting V. The proposed estimator’s interval estimation remains stable, often surpassing IPW and DR in coverage accuracy, particularly in settings with extreme true propensity scores. Based on these simulation results, DR, IPW, and the proposed estimators are generally recommended; however, the proposed estimator is especially favorable for both point and interval estimation in complex scenarios like Settings IV and V.

Table 1: Simulation Results for ATE in Setting I
Sample Size Method Bias SD RMSE AW CP
Covariates -0.0042 0.1963 0.1963 0.901 0.978
True PS -0.0005 0.2839 0.2839 1.210 0.960
N=200N=200 Estimated PS 0.0024 0.2315 0.2315 1.200 0.993
Proposed -0.0040 0.1986 0.1986 0.815 0.967
IPW -0.0087 0.1816 0.1818 0.943 0.913
Doubly Robust -0.0084 0.1832 0.1834 0.888 0.958
Covariates 0.0061 0.1230 0.1232 0.572 0.990
True PS -0.0023 0.1749 0.1749 0.760 0.975
N=500N=500 Estimated PS -0.0005 0.1434 0.1434 0.759 0.994
Proposed 0.0044 0.1151 0.1152 0.474 0.949
IPW 0.0011 0.1143 0.1143 0.591 0.833
Doubly Robust 0.0010 0.1129 0.1129 0.514 0.961
Covariates 0.0037 0.0835 0.0836 0.407 0.980
True PS 0.0034 0.1194 0.1194 0.536 0.968
N=1000N=1000 Estimated PS 0.0010 0.0965 0.0965 0.535 0.994
Proposed 0.0038 0.0769 0.0770 0.323 0.951
IPW 0.0003 0.0768 0.0768 0.418 0.686
Doubly Robust 0.0012 0.0769 0.0769 0.354 0.949
Table 2: Simulation Results for ATE in Setting II
Sample size Method Bias SD RMSE AW CP
Covariates -0.0103 0.4438 0.4439 2.770 0.997
True PS -0.0102 0.5329 0.5329 2.060 0.937
N=200N=200 Estimated PS -0.0118 0.5189 0.5190 2.160 0.957
Proposed -0.0068 0.4711 0.4712 1.860 0.956
IPW -0.0073 0.4579 0.4580 1.715 0.938
Doubly Robust -0.0100 0.4609 0.4610 1.850 0.942
Covariates -0.0028 0.2854 0.2854 1.760 0.998
True PS -0.0018 0.3340 0.3340 1.300 0.943
N=500N=500 Estimated PS -0.0166 0.3320 0.3324 1.340 0.963
Proposed -0.0040 0.2958 0.2959 1.140 0.946
IPW -0.0028 0.2950 0.2950 1.082 0.932
Doubly Robust -0.0018 0.2948 0.2948 1.147 0.939
Covariates 0.0032 0.1891 0.1891 1.250 0.996
True PS 0.0040 0.2245 0.2245 0.911 0.949
N=1000N=1000 Estimated PS 0.0104 0.2232 0.2235 0.931 0.959
Proposed 0.0053 0.1954 0.1954 0.802 0.957
IPW 0.0048 0.1930 0.1931 0.764 0.947
Doubly Robust 0.0044 0.1943 0.1944 0.806 0.953
Table 3: Simulation Results for ATE in Setting III
Sample size Method Bias SD RMSE AW CP
Covariates -0.0092 0.1722 0.1724 0.661 0.945
True PS -0.0103 0.1845 0.1848 0.718 0.947
N=200N=200 Estimated PS -0.0089 0.1871 0.1873 0.724 0.950
Proposed -0.0106 0.1726 0.1729 0.686 0.959
IPW -0.0115 0.1601 0.1605 0.575 0.962
Doubly Robust -0.0105 0.1631 0.1635 0.731 0.955
Covariates 0.0028 0.1088 0.1089 0.422 0.947
True PS 0.0017 0.1166 0.1166 0.455 0.950
N=500N=500 Estimated PS 0.0019 0.1147 0.1147 0.457 0.949
Proposed 0.0011 0.1026 0.1026 0.414 0.953
IPW 0.0014 0.0983 0.0983 0.360 0.953
Doubly Robust 0.0011 0.0987 0.0987 0.430 0.956
Covariates 0.0032 0.0730 0.0731 0.299 0.956
True PS 0.0004 0.0781 0.0781 0.320 0.949
N=1000N=1000 Estimated PS 0.0011 0.0770 0.0770 0.322 0.952
Proposed 0.0016 0.0680 0.0680 0.286 0.950
IPW 0.0010 0.0662 0.0662 0.254 0.957
Doubly Robust 0.0010 0.0669 0.0669 0.295 0.942
Table 4: Simulation Results for ATE in Setting IV
Sample size Method Bias SD RMSE AW CP
Covariates 0.0328 0.1691 0.1722 0.653 0.940
True PS -0.0082 0.1830 0.1832 0.707 0.949
N=200N=200 Estimated PS -0.0066 0.1840 0.1842 0.714 0.946
Proposed -0.0084 0.1708 0.1710 0.674 0.953
IPW -0.0101 0.1589 0.1593 0.572 0.885
Doubly Robust -0.0098 0.1655 0.1658 0.775 0.958
Covariates 0.0326 0.1067 0.1116 0.415 0.933
True PS 0.0024 0.1141 0.1141 0.448 0.953
N=500N=500 Estimated PS 0.0029 0.1126 0.1127 0.450 0.950
Proposed 0.0024 0.1005 0.1006 0.406 0.944
IPW 0.0018 0.0967 0.0967 0.359 0.769
Doubly Robust 0.0014 0.0994 0.0994 0.449 0.960
Covariates 0.0262 0.0718 0.0765 0.295 0.941
True PS 0.0008 0.0767 0.0767 0.315 0.948
N=1000N=1000 Estimated PS 0.0014 0.0757 0.0757 0.316 0.949
Proposed 0.0022 0.0671 0.0671 0.281 0.962
IPW 0.0012 0.0652 0.0652 0.253 0.560
Doubly Robust 0.0013 0.0678 0.0678 0.304 0.945
Table 5: Simulation Results for ATE in Setting V
Sample size Method Bias SD RMSE AW CP
Covariates -0.0009 0.3426 0.3426 1.490 0.980
True PS -0.0072 0.6568 0.6568 2.380 0.965
N=200N=200 Estimated PS -0.0178 0.6552 0.6555 2.400 0.965
Proposed -0.0059 0.5001 0.5001 1.530 0.959
IPW 0.0218 0.4854 0.4859 0.850 0.899
Doubly Robust 0.2714 2.4132 2.4284 3.837 0.756
Covariates -0.0113 0.2713 0.2715 1.190 0.975
True PS -0.0410 0.5193 0.5209 2.050 0.971
N=500N=500 Estimated PS -0.0469 0.5095 0.5116 2.060 0.973
Proposed 0.0007 0.4088 0.4088 1.240 0.956
IPW 0.0101 0.3903 0.3904 0.526 0.733
Doubly Robust 0.2610 3.7630 3.7720 2.268 0.676
Covariates -0.0196 0.2296 0.2305 1.010 0.979
True PS 0.0098 0.4852 0.4853 1.810 0.971
N=1000N=1000 Estimated PS 0.0069 0.4826 0.4827 1.810 0.969
Proposed 0.0042 0.3545 0.3545 1.010 0.953
IPW 0.0119 0.3653 0.3655 0.367 0.432
Doubly Robust 0.0403 3.1647 3.1650 2.213 0.697
Table 6: Simulation Results for ATT in Setting I
Sample size Method Bias SD RMSE AW CP
Covariates -0.0185 0.2352 0.2360 1.010 0.986
True PS 0.0002 0.3396 0.3396 1.380 0.971
N=200N=200 Estimated PS 0.0056 0.2836 0.2836 1.370 0.992
Proposed -0.0049 0.2321 0.2321 1.010 0.996
IPW -0.0079 0.2100 0.2101 0.939 0.858
Doubly Robust -0.0092 0.2045 0.2047 1.101 0.973
Covariates 0.0032 0.1471 0.1471 0.642 0.984
True PS -0.0017 0.2086 0.2086 0.869 0.977
N=500N=500 Estimated PS 0.0008 0.1789 0.1789 0.867 0.994
Proposed 0.0043 0.1357 0.1357 0.570 0.995
IPW 0.0013 0.1292 0.1292 0.592 0.671
Doubly Robust 0.0009 0.1259 0.1259 0.627 0.985
Covariates 0.0015 0.1006 0.1007 0.457 0.992
True PS 0.0062 0.1398 0.1399 0.612 0.977
N=1000N=1000 Estimated PS 0.0013 0.1247 0.1247 0.610 0.994
Proposed 0.0023 0.0903 0.0903 0.385 0.990
IPW 0.0010 0.0873 0.0873 0.420 0.355
Doubly Robust 0.0006 0.0852 0.0852 0.426 0.985
Table 7: Simulation Results for ATT in Setting II
Sample size Method Bias SD RMSE AW CP
Covariates -0.0145 0.6170 0.6172 3.150 1.000
True PS -0.0167 0.6115 0.6118 2.360 0.926
N=200N=200 Estimated PS -0.0278 0.7186 0.7192 2.470 0.997
Proposed -0.0114 0.6413 0.6414 2.620 1.000
IPW -0.0145 0.6285 0.6287 1.724 0.939
Doubly Robust -0.0125 0.6133 0.6135 2.471 1.000
Covariates -0.0037 0.3813 0.3813 2.010 1.000
True PS -0.0139 0.3779 0.3781 1.490 0.942
N=500N=500 Estimated PS -0.0167 0.4513 0.4516 1.540 0.997
Proposed -0.0100 0.3857 0.3858 1.570 1.000
IPW -0.0046 0.3853 0.3853 1.084 0.942
Doubly Robust -0.0052 0.3753 0.3753 1.544 1.000
Covariates 0.0043 0.2652 0.2652 1.420 1.000
True PS -0.0054 0.2579 0.2580 1.040 0.945
N=1000N=1000 Estimated PS 0.0082 0.3081 0.3082 1.060 0.996
Proposed 0.0039 0.2614 0.2614 1.090 1.000
IPW 0.0048 0.2658 0.2659 0.765 0.960
Doubly Robust 0.0029 0.2601 0.2601 1.089 1.000
Table 8: Simulation Results for ATT in Setting III
Sample size Method Bias SD RMSE AW CP
Covariates -0.0060 0.1951 0.1952 0.747 0.939
True PS -0.0041 0.2183 0.2183 0.840 0.948
N=200N=200 Estimated PS -0.0045 0.2194 0.2194 0.853 0.957
Proposed -0.0096 0.1943 0.1945 0.828 0.978
IPW -0.0135 0.1714 0.1719 0.582 0.964
Doubly Robust -0.0142 0.1718 0.1724 0.640 0.955
Covariates 0.0033 0.1196 0.1197 0.478 0.945
True PS 0.0027 0.1351 0.1352 0.534 0.943
N=500N=500 Estimated PS 0.0028 0.1366 0.1366 0.537 0.951
Proposed 0.0031 0.1122 0.1122 0.472 0.974
IPW 0.0019 0.1044 0.1045 0.362 0.955
Doubly Robust 0.0018 0.1049 0.1049 0.394 0.951
Covariates 0.0038 0.0853 0.0853 0.340 0.959
True PS 0.0004 0.0939 0.0939 0.376 0.957
N=1000N=1000 Estimated PS 0.0017 0.0910 0.0910 0.378 0.957
Proposed 0.0020 0.0755 0.0755 0.322 0.958
IPW 0.0009 0.0734 0.0734 0.255 0.957
Doubly Robust 0.0010 0.0734 0.0734 0.277 0.961
Table 9: Simulation Results for ATT in Setting IV
Sample size Method Bias SD RMSE AW CP
Covariates 0.0360 0.1922 0.1955 0.737 0.946
True PS -0.0019 0.2177 0.2177 0.826 0.940
N=200N=200 Estimated PS -0.0021 0.2154 0.2154 0.841 0.959
Proposed -0.0075 0.1913 0.1915 0.801 0.978
IPW -0.0120 0.1709 0.1713 0.579 0.886
Doubly Robust -0.0140 0.1717 0.1722 0.635 0.945
Covariates 0.0334 0.1185 0.1231 0.471 0.944
True PS 0.0035 0.1326 0.1326 0.525 0.947
N=500N=500 Estimated PS 0.0036 0.1339 0.1340 0.529 0.953
Proposed 0.0043 0.1099 0.1100 0.470 0.970
IPW 0.0024 0.1030 0.1030 0.360 0.774
Doubly Robust 0.0022 0.1035 0.1035 0.391 0.916
Covariates 0.0270 0.0839 0.0881 0.335 0.936
True PS 0.0006 0.0927 0.0927 0.370 0.950
N=1000N=1000 Estimated PS 0.0019 0.0899 0.0899 0.372 0.954
Proposed 0.0025 0.0747 0.0748 0.314 0.967
IPW 0.0011 0.0728 0.0728 0.254 0.561
Doubly Robust 0.0008 0.0723 0.0723 0.275 0.865
Table 10: Simulation Results for ATT in Setting V
Sample size Method Bias SD RMSE AW CP
Covariates -0.0144 0.3964 0.3966 1.700 0.969
True PS -0.0166 0.7851 0.7853 2.700 0.945
N=200N=200 Estimated PS -0.0278 0.7820 0.7825 2.720 0.943
Proposed -0.0123 0.5971 0.5973 1.810 0.942
IPW -0.0005 0.5759 0.5759 0.861 0.744
Doubly Robust -0.1718 1.7912 1.7994 1.075 0.944
Covariates -0.0274 0.3227 0.3238 1.400 0.971
True PS -0.0558 0.6297 0.6322 2.390 0.964
N=500N=500 Estimated PS -0.0635 0.6184 0.6216 2.400 0.962
Proposed -0.0014 0.4961 0.4961 1.480 0.931
IPW -0.0145 0.4563 0.4566 0.528 0.407
Doubly Robust -0.0886 3.1315 3.1328 0.663 0.936
Covariates -0.0338 0.2735 0.2756 1.200 0.974
True PS 0.0052 0.5912 0.5912 2.130 0.965
N=1000N=1000 Estimated PS 0.0015 0.5881 0.5882 2.140 0.968
Proposed 0.0027 0.4321 0.4321 1.210 0.920
IPW -0.0049 0.4259 0.4260 0.369 0.122
Doubly Robust -0.0964 1.7687 1.7713 0.466 0.931

5 Practical Data Analysis

In this section, we apply the estimation procedures to analyze data from the National Supported Work (NSW) Demonstration, a program funded by both federal and private sources in the mid-1970s. This initiative provided 6–18 months of work experience for individuals facing economic and social challenges before enrollment. Our analysis demonstrates the impact of the NSW Demonstration labor training program on the earnings of these individuals. To evaluate the effectiveness of the proposed estimator, we estimate the average treatment effect on the treated (ATT) using three control groups. We use two datasets: one drawn from the experimental sample in LaLonde, (1986) and the other from Westat’s Matched Current Population Survey–Social Security Administration File (CPS-3) Dehejia and Wahba, (1999) 111The data can be obtained from the website at https://users.nber.org/\simrdehejia/nswdata2.html..

In this dataset, the outcome variable (Re78) represents each individual’s real earnings in 1978, and the treatment variable (Treat) indicates whether the individual enrolled in the labor training program. The ten potential confounding variables include age (Age), years of schooling (Educ), race indicators for Black (Black) and Hispanic (Hispanic), marital status (Married), high school diploma status (Nodegr), real earnings in 1974 (Re74) and 1975 (Re75), and indicators for zero earnings in 1974 (U74) and 1975 (U75). This subset comprises 185 individuals in the treated group, 260 individuals in the control group reported by LaLonde, (1986), and 429 individuals from the CPS-3 control group. Table 11 presents the summary statistics for the NSW dataset and the pp-values of the tt-tests for comparing the means between the treated group and the two control groups from the two samples. Table 11 shows that the treated and control groups are well-balanced in terms of mean values for most variables, with the exception of the variable Nodegr. For the nine remaining covariates, we cannot reject the null hypothesis that the means are equal, indicating similar averages across treated and control groups. However, when comparing treated units from the experimental data with control units from the non-experimental CPS-3 data, significant mean differences appear in eight out of the ten covariates. The only covariates for which the null hypothesis of equal means cannot be rejected are Hispanic and education.

Table 11: Summary statistics of the variables in the NSW dataset.
Experimental Data Nonexperimental CPS-3 pp-value of two-sample tt-test
Variable Treated (185) Control (260) Control (429) Treated/Control Treated/Control
Mean (SD) Mean (SD) Mean (SD) Experiments Control CPS-3
Age 25.82 (7.16) 25.1 (7.10) 28.03 (10.79) 0.27 0.00
Educ 10.35 (2.01) 10.1 (1.60) 10.24 (2.86) 0.15 0.58
Black 0.84 (0.36) 0.83 (0.38) 0.20 (0.40) 0.65 0.00
Hispanic 0.06 (0.24) 0.11 (0.31) 0.14 (0.35) 0.06 0.13
Married 0.19 (0.39) 0.15 (0.36) 0.51 (0.50) 0.33 0.00
Nodegr 0.71 (0.46) 0.83 (0.37) 0.60 (0.49) 0.00 0.01
Re74 2.10 (4.89) 2.11 (5.69) 5.62 (6.79) 0.98 0.00
Re75 1.53 (3.22) 1.27 (3.10) 2.47 (3.29) 0.39 0.00
U74 0.71 (0.46) 0.75 (0.43) 0.26 (0.44) 0.33 0.00
U75 0.60 (0.49) 0.68 (0.47) 0.31 (0.46) 0.07 0.00

Earnings are expressed in thousands of 1978 dollars.

Table 12: Estimates of ATT, and their standard errors and corresponding 95% confidence intervals based on different methods for the NSW dataset.
Experimental Group CPS-3
Estimate (SE) 95% C.I Estimate (SE) 95% C.I
Proposed 1807.7 (677.5) (479.8, 3135.6) 952.6 (731.4) (-480.9, 2386.1)
Covariates 1686.1 (866.4) (-12.0, 3384.2) -243.5 (1319.3) (-2829.3, 2342.3)
Estimated PS 2138.6 (797.8) (575.0, 3702.3) -944.9 (1383.7) (-3656.9, 1767.1)
IPW 1754.6 (695.5) (391.4, 3117.8) 1003.4 (698.6) (-365.8, 2372.6)
DR 1718.9 (751.1) (246.8, 3191.0) 836.0 (965.3) (-1056.0, 2728.0)

In Table 12, we present the estimates of ATT using the proposed kernel matching estimator (“Proposed”), the matching estimator based on covariates (“Covariates”), the matching estimator based on estimated propensity scores with m=1m=1 (“Estimated PS”), the Inverse Probability Weighting estimator (IPW), and Doubly Robust (DR) estimator, and their standard errors and corresponding 95% bootstrap confidence intervals for both control groups, both obtained by 400 bootstrap samples. For the experimental group, the ATT estimate based on the proposed kernel matching estimator is US$1807.7, the closest to the benchmark value of US$1794 among the estimation methods considered here. Moreover, the ATT estimate based on the proposed kernel matching estimator has the smallest standard error among the estimation methods considered here. The proposed kernel matching estimator, the IPW, and the DR estimators give ATT estimates and confidence intervals within a similar range. For the non-experimental control group (CPS-3), the matching estimator based on covariates and estimated propensity scores produces negative estimates with large standard errors, which do not agree with the results obtained by Dehejia and Wahba, (1999). In contrast, the proposed kernel matching estimator, along with IPW and DR methods, yields ATT estimates within a similar range (698.6–965.3). These findings suggest that the labor training program significantly increased individual earnings in 1978 and are consistent with the conclusions drawn from the experimental study conducted by Dehejia and Wahba, (1999).

6 Concluding Remarks

In this paper, we introduced a novel kernel matching estimator for estimating treatment effects, focusing on addressing challenges in observational data analysis. Through rigorous theoretical analysis, we established the consistency and asymptotic normality of the proposed estimator and demonstrated its robustness under various simulation settings. The results from our Monte Carlo simulations revealed that the proposed estimator performs comparably to or better than established methods such as covariate-based matching, propensity score-based matching, inverse probability weighting (IPW), and doubly robust (DR) estimators, particularly in high-dimensional and complex scenarios.

The proposed kernel matching estimator stands out for its reliable interval estimation, consistently achieving coverage probabilities near the nominal level of 95%, even in settings where IPW and DR methods falter. This reliability, combined with competitive bias and RMSE performance, highlights the potential of the proposed estimator for both point and interval estimation in a wide range of applications. Our findings suggest that this method is a robust and flexible tool for causal inference, particularly in situations involving high-dimensional data or complex treatment assignment mechanisms. Future work may explore extensions to dynamic treatments and further refinements in bandwidth selection for enhanced performance.

For the proposed kernel matching estimator, the selection of bandwidth in kernel matching remains a critical challenge, as it requires a delicate balance between bias and variance. Achieving the theoretically optimal bandwidth in practical applications is often difficult, highlighting a key limitation of this approach. Furthermore, kernel matching involves calculating weighted distances for each sample based on the kernel function, which can significantly increase computational complexity, especially when dealing with large datasets. Future research could explore more efficient methods for bandwidth selection and computational optimization to enhance the scalability and practicality of kernel matching in real-world applications.

Appendix

Preliminary:

The influence function(Hampel,, 1974) is crucial in analyzing the robustness of statistical estimators. Let TT be a real-valued functional defined on a subset of the set of all probability measures on \mathbb{R}, and let FF denote a probability measure on \mathbb{R} for which TT is defined. Then, the influence function for the functional T()T(\cdot) is defined as

Ω(y)=ddϵT[(1ϵ)F+ϵδy(x)]|ϵ=0,\Omega(y)=\left.\frac{d}{d\epsilon}T\left[(1-\epsilon)F+\epsilon\delta_{y}(x)\right]\right|_{\epsilon=0},

where 0<ϵ<10<\epsilon<1 and

δy(x)={0x<y1xy.\delta_{y}(x)=\begin{cases}0&x<y\\ 1&x\geqslant y.\end{cases}
Lemma 1 (Huber, (1981); Bickel et al., (1993)).

Let X1XNX_{1}\dots X_{N} be independently and identically distributed random variables from a distribution FF. T(F)T(F) is a function of FF. Suppose T(F)T(F) has a Gateaux derivative, denoted by Ω(x)\Omega(x) and E(Ω2(Xi))<E(\Omega^{2}(X_{i}))<\infty, then we have

n(T(Fn)T(F))=1ni=1nΩ(Xi)+op(1).\sqrt{n}\left(T\left(F_{n}\right)-T(F)\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\Omega\left(X_{i}\right)+o_{p}(1).

Hence, n(T(Fn)T(F))\sqrt{n}\left(T\left(F_{n}\right)-T(F)\right) is asymptotically normal with mean 0 and variance E(Ω2(Xi))E(\Omega^{2}(X_{i})).

Proof of Theorem 1 (i):

We express τ^\hat{\tau} as

τ^\displaystyle\hat{\tau} =\displaystyle= 1Ni=1NWiYi1Ni=1NWij=1N(1Wj)YjKh(XjXi)j=1N(1Wj)Kh(XjXi)\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}Y_{i}-\frac{1}{N}\sum_{i=1}^{N}W_{i}\frac{\sum_{j=1}^{N}\left(1-W_{j}\right)Y_{j}K_{h}\left(X_{j}-X_{i}\right)}{\sum_{j=1}^{N}\left(1-W_{j}\right)K_{h}\left(X_{j}-X_{i}\right)}
+1Ni=1N(1Wi)j=1NWjYjKh(XjXi)j=1NWjKh(XjXi)1Ni=1N(1Wi)Yi\displaystyle+\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)\frac{\sum_{j=1}^{N}W_{j}Y_{j}K_{h}\left(X_{j}-X_{i}\right)}{\sum_{j=1}^{N}W_{j}K_{h}\left(X_{j}-X_{i}\right)}-\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)Y_{i}
=\displaystyle= T1(Fn)T2(Fn)+T3(Fn)T4(Fn),\displaystyle T_{1}(F_{n})-T_{2}(F_{n})+T_{3}(F_{n})-T_{4}(F_{n}),

where

T1(Fn)\displaystyle T_{1}(F_{n}) =\displaystyle= 1Ni=1NWiYi=yw𝑑Fn(y,w),\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}Y_{i}=\int yw~{}dF_{n}(y,w),
T2(Fn)\displaystyle T_{2}(F_{n}) =\displaystyle= 1Ni=1NWij=1N(1Wj)YjKh(XjXi)j=1N(1Wj)Kh(XjXi)\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}\frac{\sum_{j=1}^{N}\left(1-W_{j}\right)Y_{j}K_{h}(X_{j}-X_{i})}{\sum_{j=1}^{N}\left(1-W_{j}\right)K_{h}\left(X_{j}-X_{i}\right)}
=\displaystyle= z(1w)yKh(xu)𝑑Fn(y,x,w)(1w)Kh(xu)𝑑Fn(y,x,w)𝑑Fn(u,z),\displaystyle\int z\frac{\int(1-w)yK_{h}(x-u)dF_{n}(y,x,w)}{\int(1-w)K_{h}(x-u)dF_{n}(y,x,w)}dF_{n}(u,z),
T3(Fn)\displaystyle T_{3}(F_{n}) =\displaystyle= 1Ni=1N(1Wi)j=1NWjYjKh(XjXi)j=1NWjKh(XjXi)\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)\frac{\sum_{j=1}^{N}W_{j}Y_{j}K_{h}\left(X_{j}-X_{i}\right)}{\sum_{j=1}^{N}W_{j}K_{h}\left(X_{j}-X_{i}\right)}
=\displaystyle= (1z)wyKh(xu)𝑑Fn(y,x,w)wKh(xu)𝑑Fn(y,x,w)𝑑Fn(u,z),\displaystyle\int(1-z)\frac{\int wyK_{h}(x-u)dF_{n}(y,x,w)}{\int wK_{h}(x-u)dF_{n}(y,x,w)}dF_{n}(u,z),
T4(Fn)\displaystyle T_{4}(F_{n}) =\displaystyle= 1Ni=1N(1Wi)Yi=(1w)y𝑑Fn(y,w).\displaystyle\frac{1}{N}\sum_{i=1}^{N}\left(1-W_{i}\right)Y_{i}=\int(1-w)y~{}dF_{n}(y,w).

Let T(F)=T1(F)T2(F)+T3(F)T4(F),T(F)=T_{1}(F)-T_{2}(F)+T_{3}(F)-T_{4}(F), where the terms Ti(F)T_{i}(F) for i=1,2,3,4i=1,2,3,4 are given by

T1(F)\displaystyle T_{1}(F) =\displaystyle= E(WY),\displaystyle E(WY),
T2(F)\displaystyle T_{2}(F) =\displaystyle= E[ZE[Y(1W)K(XUh)]E[(1W)K(XUh)]],\displaystyle E\left[Z\frac{E\left[Y(1-W)K\left(\frac{X-U}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-U}{h}\right)\right]}\right],
T3(F)\displaystyle T_{3}(F) =\displaystyle= E[(1Z)E[WYK(XUh)]E[WK(XUh)]],\displaystyle E\left[(1-Z)\frac{E\left[WYK\left(\frac{X-U}{h}\right)\right]}{E\left[WK\left(\frac{X-U}{h}\right)\right]}\right],
T4(F)\displaystyle T_{4}(F) =\displaystyle= E[(1W)Y].\displaystyle E[(1-W)Y].

Here, (S,U,Z)(S,U,Z) has the same distribution as (Y,X,W)(Y,X,W), and (Y,X,W)FY,X,W(Y,X,W)\sim F_{Y,X,W}.
Next, we will provide the formula for each term of T(F)T(F). Let gw(x)=E(Y(w)|X=x)g_{w}(x)=E(Y(w)|X=x), we can obtain

E[Y(1W)K(Xuh)]E[(1W)K(Xuh)]\displaystyle\frac{E\left[Y(1-W)K\left(\frac{X-u}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-u}{h}\right)\right]}
=\displaystyle= g0(x)K(xuh)f(x|W=0)𝑑xK(xuh)f(x|W=0)𝑑x\displaystyle\frac{\int g_{0}(x)K\left(\frac{x-u}{h}\right)f(x|W=0)dx}{\int K\left(\frac{x-u}{h}\right)f(x|W=0)dx}
=\displaystyle= g0(th+u)K(t)f(th+u|W=0)𝑑tK(t)f(th+u|W=0)𝑑t\displaystyle\frac{\int g_{0}(th+u)K(t)f(th+u|W=0)dt}{\int K(t)f(th+u|W=0)dt}
=\displaystyle= K(t)[g0(u)+g0(u)th+g0′′(u)t2h22+g0′′′(ξ1)t3h33!]K(t)[f(u|W=0)+f(u|W=0)th+f′′(u|W=0)t2h22+f′′′(ξ2|W=0)t3h33!]𝑑t\displaystyle\frac{\int K(t)\left[g_{0}(u)+g_{0}^{\prime}(u)th+\frac{g_{0}^{\prime\prime}(u)t^{2}h^{2}}{2}+\frac{g_{0}^{\prime\prime\prime}(\xi_{1})t^{3}h^{3}}{3!}\right]}{\int K(t)\left[f(u|W=0)+f^{\prime}(u|W=0)th+\frac{f^{\prime\prime}(u|W=0)t^{2}h^{2}}{2}+\frac{f^{\prime\prime\prime}(\xi_{2}|W=0)t^{3}h^{3}}{3!}\right]dt}
×[f(u|W=0)+f(u|W=0)th+f′′(u|W=0)t2h22+f′′′(ξ2|W=0)t3h33!]dtK(t)[f(u|W=0)+f(u|W=0)th+f′′(u|W=0)t2h22+f′′′(ξ2|W=0)t3h33!]𝑑t\displaystyle\frac{\times\left[f(u|W=0)+f^{\prime}(u|W=0)th+\frac{f^{\prime\prime}(u|W=0)t^{2}h^{2}}{2}+\frac{f^{\prime\prime\prime}(\xi_{2}|W=0)t^{3}h^{3}}{3!}\right]dt}{\int K(t)\left[f(u|W=0)+f^{\prime}(u|W=0)th+\frac{f^{\prime\prime}(u|W=0)t^{2}h^{2}}{2}+\frac{f^{\prime\prime\prime}(\xi_{2}|W=0)t^{3}h^{3}}{3!}\right]dt}
=\displaystyle= g0(u)+t2K(t)𝑑t2(2g0(u)f(u|W=0)f(u|W=0)+g0′′(u|W=0))h2+o(h2).\displaystyle g_{0}(u)+\frac{\int t^{2}K(t)dt}{2}\left(\frac{2g_{0}^{\prime}(u)f^{\prime}(u|W=0)}{f(u|W=0)}+g_{0}^{\prime\prime}(u|W=0)\right)h^{2}+o(h^{2}).

Then,

T2(F)\displaystyle T_{2}(F) =\displaystyle= E[ZEY(1W)K(XUh)E(1W)K(XUh)]\displaystyle E\left[Z\frac{EY(1-W)K\left(\frac{X-U}{h}\right)}{E(1-W)K\left(\frac{X-U}{h}\right)}\right]
=\displaystyle= E{Z[g0(U)+t2k(t)𝑑t2(2g0(U)f(U|W=0)f(U|W=0)+g0′′(U))h2]}\displaystyle E\left\{Z\left[g_{0}(U)+\frac{\int t^{2}k(t)dt}{2}\left(\frac{2g_{0}^{\prime}(U)f^{\prime}(U|W=0)}{f(U|W=0)}+g_{0}^{\prime\prime}(U)\right)h^{2}\right]\right\}
+o(h2)\displaystyle\qquad+o(h^{2})
=\displaystyle= E[Y(0)|Z=1]Pr(Z=1)+E[b0(U)|Z=1]Pr(Z=1)h2+o(h2),\displaystyle E[Y(0)|Z=1]\Pr(Z=1)+E[b_{0}(U)|Z=1]\Pr(Z=1)h^{2}+o(h^{2}),

where b0(U)=t2k(t)𝑑t2[2g0(U)f(U|W=0)f(U|W=0)+g0′′(U)].b_{0}(U)=\frac{\int t^{2}k(t)dt}{2}\left[\frac{2g_{0}^{\prime}(U)f^{\prime}(U|W=0)}{f(U|W=0)}+g_{0}^{\prime\prime}(U)\right]. Similarly, it follows that

T3(F)\displaystyle T_{3}(F) =\displaystyle= 𝔼{(1Z)[g1(U)+t2K(t)𝑑t2(2g1(U)f(U|W=1)f(U|W=1)+g1′′(U))h2]}\displaystyle\mathbb{E}\left\{(1-Z)\left[g_{1}(U)+\frac{\int t^{2}K(t)dt}{2}\left(\frac{2g_{1}^{\prime}(U)f^{\prime}(U|W=1)}{f(U|W=1)}+g_{1}^{\prime\prime}(U)\right)h^{2}\right]\right\}
+o(h2)\displaystyle\qquad+o(h^{2})
=\displaystyle= 𝔼[Y(1)|Z=0]Pr(Z=0)+𝔼[b1(U)|Z=0]Pr(Z=0)h2+o(h2),\displaystyle\mathbb{E}[Y(1)|Z=0]\Pr(Z=0)+\mathbb{E}[b_{1}(U)|Z=0]\Pr(Z=0)h^{2}+o(h^{2}),

where b1(U)=t2K(t)𝑑t2[2g1(U)f(U|W=0)f(U|W=0)+g1′′(U)]b_{1}(U)=\frac{\int t^{2}K(t)dt}{2}\left[\frac{2g_{1}^{\prime}(U)f^{\prime}(U|W=0)}{f(U|W=0)}+g_{1}^{\prime\prime}(U)\right].
Thus,

T(F)=T1(F)T2(F)+T3(F)T4(F)=τ+Bh2+o(h2),T(F)=T_{1}(F)-T_{2}(F)+T_{3}(F)-T_{4}(F)=\tau+Bh^{2}+o(h^{2}),

where B=𝔼[b0(U)|Z=1]Pr(Z=1)+𝔼[b1(U)|Z=0]Pr(Z=0)B=\mathbb{E}[b_{0}(U)|Z=1]\Pr(Z=1)+\mathbb{E}[b_{1}(U)|Z=0]\Pr(Z=0).

To prove Theorem 1, we derive the influence function for T(F)T(F) by the method presented in Hampel, (1974), Hettmansperger, (1984), and Hines et al., (2022). The influence function of Ti(F)T_{i}(F), for i=1,2,3,4i=1,2,3,4, are respectively,

Ω1(w,y)\displaystyle\Omega_{1}(w,y) =\displaystyle= wyE(WY),\displaystyle wy-E(WY),
Ω2(y,x,w)\displaystyle\Omega_{2}(y,x,w) =\displaystyle= wE[Y(1W)K(Xxh)]E[(1W)K(Xxh)]E[ZE[Y(1W)K(XUh)]E[(1W)K(XUh)]]\displaystyle\,w\frac{E[Y(1-W)K\left(\frac{X-x}{h}\right)]}{E[(1-W)K\left(\frac{X-x}{h}\right)]}-E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]
+(1w)yE[ZK(xUh)E[(1W)K(XUh)]]\displaystyle+(1-w)yE\left[Z\frac{K\left(\frac{x-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]
(1w)E[ZE[Y(1W)K(XUh)K(xUh)](E[(1W)K(XUh)])2],\displaystyle-(1-w)E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)K\left(\frac{x-U}{h}\right)]}{\left(E[(1-W)K\left(\frac{X-U}{h}\right)]\right)^{2}}\right],
Ω3(x,y,w)\displaystyle\Omega_{3}(x,y,w) =\displaystyle= (1w)E[WYK(Xxh)]E[WK(Xxh)]E[(1Z)E[WYK(XUh)]E[WK(XUh)]]\displaystyle\,(1-w)\frac{E[WYK\left(\frac{X-x}{h}\right)]}{E[WK\left(\frac{X-x}{h}\right)]}-E\left[(1-Z)\frac{E[WYK\left(\frac{X-U}{h}\right)]}{E[WK\left(\frac{X-U}{h}\right)]}\right]
+wyE[(1Z)K(xUh)E[WK(XUh)]]\displaystyle+wyE\left[(1-Z)\frac{K\left(\frac{x-U}{h}\right)}{E[WK\left(\frac{X-U}{h}\right)]}\right]
wE[(1Z)E[YWK(XUh)K(xUh)](E[WK(XUh)])2],\displaystyle-wE\left[(1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)K\left(\frac{x-U}{h}\right)]}{\left(E[WK\left(\frac{X-U}{h}\right)]\right)^{2}}\right],
Ω4(w,y)\displaystyle\Omega_{4}(w,y) =\displaystyle= (1w)yE[(1W)Y].\displaystyle(1-w)y-E[(1-W)Y].

Therefore, by Lemma 1 and Markov’s inequality, when h0h\to 0, we can obtain

τ^τ=\displaystyle\hat{\tau}-\tau= T(Fn)T(F)+Bh2+o(h2)\displaystyle T\left(F_{n}\right)-T\left(F\right)+Bh^{2}+o(h^{2})
=\displaystyle= 1ni=1nΩ(Xi,Wi,Yi)+Bh2+o(h2)𝑝0,\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Omega\left(X_{i},W_{i},Y_{i}\right)+Bh^{2}+o(h^{2})\xrightarrow{p}0,

where Ω(Xi,Wi,Yi)=Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)+Ω3(Xi,Wi,Yi)Ω4(Wi,Yi)\Omega\left(X_{i},W_{i},Y_{i}\right)=\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)+\Omega_{3}\left(X_{i},W_{i},Y_{i}\right)-\Omega_{4}\left(W_{i},Y_{i}\right).

Proof of Theorem 1 (ii):

Based on the kernel matching approach for estimating the treatment effect, we express the estimator for the average treatment effect on the treated (ATT) as follows:

τ^t=\displaystyle\hat{\tau}_{t}= 1N1i=1NWi(Yij=1N(1Wj)YjK(XjXih)j=1N(1Wj)K(XjXih))\displaystyle\frac{1}{N_{1}}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{X_{j}-X_{i}}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{X_{j}-X_{i}}{h}\right)}\right)
=\displaystyle= NN11Ni=1NWi(Yij=1N(1Wj)YjK(XjXih)j=1N(1Wj)K(XjXih)).\displaystyle\frac{N}{N_{1}}\frac{1}{N}\sum_{i=1}^{N}W_{i}\left(Y_{i}-\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{X_{j}-X_{i}}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{X_{j}-X_{i}}{h}\right)}\right).

Let Bt=𝔼[b0(U)|Z=1]Pr(Z=1)B_{t}=\mathbb{E}[b_{0}(U)|Z=1]\Pr(Z=1). From Theorem 1 (i), we know that

τ^tτt=\displaystyle\hat{\tau}_{t}-\tau_{t}= NN1{[T1(Fn)T1(F)][T2(Fn)T2(F)]+Bth2}+o(h2)\displaystyle\frac{N}{N_{1}}\{\left[T_{1}\left(F_{n}\right)-T_{1}\left(F\right)\right]-[T_{2}\left(F_{n}\right)-T_{2}\left(F\right)]+B_{t}h^{2}\}+o(h^{2})
=\displaystyle= NN1{1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi))]+Bth2}+o(h2).\displaystyle\frac{N}{N_{1}}\left\{\frac{1}{N}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right))\right]+B_{t}h^{2}\right\}+o(h^{2}).

By Lemma 1 and Markov’s Inequality, when h0h\to 0, and N/N1𝑝1/E[p(X)]N/N_{1}\xrightarrow{p}1/E[p(X)], it follows that

τ^tτt𝑝0.\hat{\tau}_{t}-\tau_{t}\xrightarrow{p}0.

Proof of Theorem 2 (i):

According to Lemma 1 and Theorem 1, we have

N(τ^Bh2τ)=1Ni=1NΩ(Xi,Wi,Yi)+op(1).\sqrt{N}(\hat{\tau}-Bh^{2}-\tau)=\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\Omega\left(X_{i},W_{i},Y_{i}\right)+o_{p}(1).

Since Ω(Xi,Wi,Yi)\Omega\left(X_{i},W_{i},Y_{i}\right) are independent and identically distributed, applying the Central Limit Theorem, we can obtain

n(τ^Bh2τ)𝑑𝒩(0,σ2),\sqrt{n}(\hat{\tau}-Bh^{2}-\tau)\xrightarrow{d}\mathcal{N}(0,\sigma^{2}),

where σ2=E(Ω2(Xi,Wi,Yi))\sigma^{2}=E\left(\Omega^{2}\left(X_{i},W_{i},Y_{i}\right)\right), which can be expressed as

σ2=σ12+σ22+σ32+σ422c1+2c22c32c4+2c52c6,\sigma^{2}=\sigma^{2}_{1}+\sigma^{2}_{2}+\sigma^{2}_{3}+\sigma^{2}_{4}-2c_{1}+2c_{2}-2c_{3}-2c_{4}+2c_{5}-2c_{6},

with

σ12\displaystyle\sigma^{2}_{1} =E[Ω12(Wi,Yi)],\displaystyle=E[\Omega_{1}^{2}(W_{i},Y_{i})], σ22\displaystyle\sigma^{2}_{2} =E[Ω22(Xi,Wi,Yi)],\displaystyle=E[\Omega_{2}^{2}(X_{i},W_{i},Y_{i})],
σ32\displaystyle\sigma^{2}_{3} =E[Ω32(Xi,Wi,Yi)],\displaystyle=E[\Omega_{3}^{2}(X_{i},W_{i},Y_{i})], σ42\displaystyle\sigma^{2}_{4} =E[Ω42(Wi,Yi)],\displaystyle=E[\Omega_{4}^{2}(W_{i},Y_{i})],
c1\displaystyle c_{1} =E[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)],\displaystyle=E[\Omega_{1}(W_{i},Y_{i})\Omega_{2}(X_{i},W_{i},Y_{i})], c2\displaystyle c_{2} =E[Ω1(Wi,Yi)Ω3(Xi,Wi,Yi)],\displaystyle=E[\Omega_{1}(W_{i},Y_{i})\Omega_{3}(X_{i},W_{i},Y_{i})],
c3\displaystyle c_{3} =E[Ω1(Wi,Yi)Ω4(Wi,Yi)],\displaystyle=E[\Omega_{1}(W_{i},Y_{i})\Omega_{4}(W_{i},Y_{i})], c4\displaystyle c_{4} =E[Ω2(Xi,Wi,Yi)Ω3(Xi,Wi,Yi)],\displaystyle=E[\Omega_{2}(X_{i},W_{i},Y_{i})\Omega_{3}(X_{i},W_{i},Y_{i})],
c5\displaystyle c_{5} =E[Ω2(Xi,Wi,Yi)Ω4(Wi,Yi)],\displaystyle=E[\Omega_{2}(X_{i},W_{i},Y_{i})\Omega_{4}(W_{i},Y_{i})], c6\displaystyle c_{6} =E[Ω3(Xi,Wi,Yi)Ω4(Wi,Yi)].\displaystyle=E[\Omega_{3}(X_{i},W_{i},Y_{i})\Omega_{4}(W_{i},Y_{i})].

The terms in σ2\sigma^{2} are provided as follows. First, the terms σi2\sigma^{2}_{i}, for i=1,2,3,4i=1,2,3,4, can be expressed as

σ12\displaystyle\sigma_{1}^{2} =\displaystyle= E(Y2(1)|W=1)Pr(W=1)E2(Y(1)|W=1)[Pr(W=1)]2,\displaystyle E(Y^{2}(1)|W=1)\Pr(W=1)-E^{2}(Y(1)|W=1)[{\Pr}(W=1)]^{2},
σ22\displaystyle\sigma_{2}^{2} =\displaystyle= E{WiE[Y(1W)K(XXih)]E[(1W)K(XXih)]E[ZE[Y(1W)K(XUh)]E[(1W)K(XUh)]]\displaystyle E\left\{W_{i}\frac{E[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)]}{E[(1-W)K\left(\frac{X-X_{i}}{h}\right)]}-E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]\right.
+Yi(1Wi)E[ZK(XiUh)E[(1W)K(XUh)]]\displaystyle\left.+Y_{i}\left(1-W_{i}\right)E\left[Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]\right.
(1Wi)E[ZE[Y(1W)K(XUh)]K(XiUh)[E[(1W)K(XUh)]]2]}2,\displaystyle\left.-\left(1-W_{i}\right)E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right]\right\}^{2},
σ32\displaystyle\sigma_{3}^{2} =\displaystyle= E{(1Wi)E[WYK(XXih)]E[WK(XXih)]E[(1Z)E[WYK(XUh)]E[WK(XUh)]]\displaystyle E\left\{(1-W_{i})\frac{E\left[WYK\left(\frac{X-X_{i}}{h}\right)\right]}{E\left[WK\left(\frac{X-X_{i}}{h}\right)\right]}-E\left[(1-Z)\frac{E\left[WYK\left(\frac{X-U}{h}\right)\right]}{E\left[WK\left(\frac{X-U}{h}\right)\right]}\right]\right.
+YiWiE((1Z)K(XiUh)E[WK(XUh)])\displaystyle\left.+Y_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{E\left[WK\left(\frac{X-U}{h}\right)\right]}\right)\right.
WiE[(1Z)E[YWK(XUh)]K(XiUh)[E[WK(XUh)]]2]}2,\displaystyle\left.-W_{i}E\left[(1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[WK\left(\frac{X-U}{h}\right)]\right]^{2}}\right]\right\}^{2},
σ42\displaystyle\sigma_{4}^{2} =\displaystyle= E(Y2(0)|W=0)Pr(W=0)E2(Y(0)|W=0)Pr2(W=0).\displaystyle E(Y^{2}(0)|W=0)\Pr(W=0)-E^{2}(Y(0)|W=0){\Pr}^{2}(W=0).

The terms cic_{i}, i=1,2,,6i=1,2,\ldots,6, are respectively,

c1\displaystyle c_{1} =\displaystyle= E(WiYi)E[Y(1W)K(XXih)]E[(1W)K(XXih)]E(Wi)E(WY)E[Y(1W)K(XXih)]E[(1W)K(XXih)]\displaystyle E(W_{i}Y_{i})\frac{E\left[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}-E(W_{i})E(WY)\frac{E\left[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}{E\left[(1-W)K\left(\frac{X-X_{i}}{h}\right)\right]}
E(1Wi)YiE(WY)E(ZK(XiUh)E(1W)K(XUh))\displaystyle-E(1-W_{i})Y_{i}E(WY)E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E(1-W)K\left(\frac{X-U}{h}\right)}\right)
+E{(1Wi)E(WY)E(ZEY(1W)K(XUh)K(XiUh)[E(1W)K(XUh)]2)},\displaystyle+E\left\{(1-W_{i})E(WY)E\left(Z\frac{EY(1-W)K\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[E(1-W)K\left(\frac{X-U}{h}\right)\right]^{2}}\right)\right\},
c2\displaystyle c_{2} =\displaystyle= EYiWiE((1Z)K(XiUh)EWK(XUh))EYiWiE((1Z)EYWK(XUh)K(XiUh)[EWK(XUh)]2)\displaystyle EY_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right)-EY_{i}W_{i}E\left((1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[EWK\left(\frac{X-U}{h}\right)\right]^{2}}\right)
E(1Wi)E(WY)EYWK(XXih)EWK(XXih)EYiWiE(WY)E((1Z)K(XiUh)EWK(XUh))\displaystyle-E(1-W_{i})E(WY)\frac{EYWK\left(\frac{X-X_{i}}{h}\right)}{EWK\left(\frac{X-X_{i}}{h}\right)}-EY_{i}W_{i}E(WY)E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right)
+EWiE(WY)E((1Z)EYWK(XUh)K(XiUh)[EWK(XUh)]2),\displaystyle+EW_{i}E(WY)E\left((1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[EWK\left(\frac{X-U}{h}\right)\right]^{2}}\right),
c3\displaystyle c_{3} =\displaystyle= E(Y(1)|W=1)E(Y(0)|W=0)Pr(W=1)Pr(W=0),\displaystyle-E(Y(1)|W=1)E(Y(0)|W=0)\Pr(W=1)\Pr(W=0),
c4\displaystyle c_{4} =\displaystyle= E{[WiE[Y(1W)K(XXih)]E[(1W)K(XXih)]E[ZE[Y(1W)K(XUh)]E[(1W)K(XUh)]]\displaystyle E\left\{\left[W_{i}\frac{E[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)]}{E[(1-W)K\left(\frac{X-X_{i}}{h}\right)]}-E\left[Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right]\right.\right.
+Yi(1Wi)E(ZK(XiUh)E[(1W)K(XUh)])\displaystyle\left.+Y_{i}(1-W_{i})E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right)\right.
(1Wi)E(ZE[Y(1W)K(XUh)]K(XiUh)[E[(1W)K(XUh)]]2)]\displaystyle\left.-(1-W_{i})E\left(Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right)\right]
[(1Wi)E[YWK(XXih)]E[WK(XXih)]E[(1Z)E[YWK(XUh)]E[WK(XUh)]]\displaystyle\left.\left[(1-W_{i})\frac{E[YWK\left(\frac{X-X_{i}}{h}\right)]}{E[WK\left(\frac{X-X_{i}}{h}\right)]}-E\left[(1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)]}{E[WK\left(\frac{X-U}{h}\right)]}\right]\right.\right.
+YiWiE((1Z)K(XiUh)E[WK(XUh)])\displaystyle\left.+Y_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[WK\left(\frac{X-U}{h}\right)]}\right)\right.
WiE((1Z)E[YWK(XUh)]K(XiUh)[E[WK(XUh)]]2)},\displaystyle\left.-W_{i}E\left((1-Z)\frac{E[YWK\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[WK\left(\frac{X-U}{h}\right)]\right]^{2}}\right)\right\},
c5\displaystyle c_{5} =\displaystyle= E[(1Wi)Yi2]E(ZK(XiUh)E[(1W)K(XUh)])\displaystyle E[(1-W_{i})Y^{2}_{i}]E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right)
E[(1Wi)Yi]E(ZE[Y(1W)K(XUh)]K(XiUh)[E[(1W)K(XUh)]]2)\displaystyle-E[(1-W_{i})Y_{i}]E\left(Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right)
E[(1W)Y]E[WiE[Y(1W)K(XXih)]E[(1W)K(XXih)]]\displaystyle-E[(1-W)Y]E\left[W_{i}\frac{E[Y(1-W)K\left(\frac{X-X_{i}}{h}\right)]}{E[(1-W)K\left(\frac{X-X_{i}}{h}\right)]}\right]
E[(1W)Y]E[(1Wi)Yi]E(ZK(XiUh)E[(1W)K(XUh)])\displaystyle-E[(1-W)Y]E\left[(1-W_{i})Y_{i}\right]E\left(Z\frac{K\left(\frac{X_{i}-U}{h}\right)}{E[(1-W)K\left(\frac{X-U}{h}\right)]}\right)
+E[(1W)Y]E[(1Wi)]E(ZE[Y(1W)K(XUh)]K(XiUh)[E[(1W)K(XUh)]]2),\displaystyle+E[(1-W)Y]E[(1-W_{i})]E\left(Z\frac{E[Y(1-W)K\left(\frac{X-U}{h}\right)]K\left(\frac{X_{i}-U}{h}\right)}{\left[E[(1-W)K\left(\frac{X-U}{h}\right)]\right]^{2}}\right),
c6\displaystyle c_{6} =\displaystyle= E{[(1Wi)YiE(1W)Y)][(1Wi)EYWK(XXih)EWK(XXih)\displaystyle E\left\{[(1-W_{i})Y_{i}-E(1-W)Y)]\left[(1-W_{i})\frac{EYWK\left(\frac{X-X_{i}}{h}\right)}{EWK\left(\frac{X-X_{i}}{h}\right)}\right.\right.
E[(1Z)EYWK(XUh)EWK(XUh)]+YiWiE((1Z)K(XiUh)EWK(XUh))\displaystyle-E\left[(1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right]+Y_{i}W_{i}E\left((1-Z)\frac{K\left(\frac{X_{i}-U}{h}\right)}{EWK\left(\frac{X-U}{h}\right)}\right)
WiE((1Z)EYWK(XUh)K(XiUh)[EWK(XUh)]2)]}.\displaystyle\left.\left.-W_{i}E\left((1-Z)\frac{EYWK\left(\frac{X-U}{h}\right)K\left(\frac{X_{i}-U}{h}\right)}{\left[EWK\left(\frac{X-U}{h}\right)\right]^{2}}\right)\right]\right\}.

Therefore,

N{τ^τBh2}𝑑𝒩(0,σ2),\sqrt{N}\left\{\hat{\tau}-\tau-Bh^{2}\right\}\xrightarrow{d}\mathcal{N}(0,\sigma^{2}),

where σ2=E[Ω2(Xi,Wi,Yi)]\sigma^{2}=E[\Omega^{2}(X_{i},W_{i},Y_{i})] and B=𝔼[b0(u)|W=0]Pr(W=0)+𝔼[b1(u)|W=1]Pr(W=1)B=\mathbb{E}[b_{0}(u)|W=0]\Pr(W=0)+\mathbb{E}[b_{1}(u)|W=1]\Pr(W=1).

Proof of Theorem 2 (ii):

According to Lemma 1 and Theorem 1, we can obtain

N(τ^tBth2τt)\displaystyle\sqrt{N}(\hat{\tau}_{t}-B_{t}h^{2}-\tau_{t}) =NN1{1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)]}+op(1)\displaystyle=\frac{N}{N_{1}}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right]\right\}+o_{p}(1)
1E(p(X)){1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)]}+op(1).\displaystyle\to\frac{1}{E(p(X))}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right]\right\}+o_{p}(1).

Applying the Central Limit Theorem, we have

N(τ^tBth2τt)𝑑N(0,σt2),\sqrt{N}(\hat{\tau}_{t}-B_{t}h^{2}-\tau_{t})\xrightarrow{d}N(0,\sigma_{t}^{2}),

where σt2=1E2(p(X))E[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)]2\sigma_{t}^{2}=\frac{1}{E^{2}(p(X))}E[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)]^{2}.

Proof of Theorem 3 (i):

Based on data {(Wi,Xi)}i=1N\left\{\left(W_{i},X_{i}\right)\right\}_{i=1}^{N}, let the score function and the Fisher information matrix of β\beta be

S(β)=1Ni=1NXiWiF(XiTβ)F(XiTβ){1F(XiTβ)}f(XiTβ),S(\beta)=\frac{1}{N}\sum_{i=1}^{N}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right),
(β)=E[f(XTβ)2F(XTβ){1F(XTβ)}XXT],\mathcal{I}(\beta)=E\left[\frac{f\left(X^{\mathrm{T}}\beta\right)^{2}}{F\left(X^{\mathrm{T}}\beta\right)\left\{1-F\left(X^{\mathrm{T}}\beta\right)\right\}}XX^{\mathrm{T}}\right],

where f(t)=dF(t)/dtf(t)=\operatorname{dF}(t)/\mathrm{d}t. Because β^\hat{\beta} is the solution to the score equation S(β)=0S(\beta)=0, under certain regularity conditions, β^β=β1S(β)+op(N1/2)\hat{\beta}-\beta^{*}=\mathcal{I}^{-1}_{\beta^{*}}S\left(\beta^{*}\right)+o_{p}\left(N^{-1/2}\right).
By the Taylor’s series expansion, we can express

τ^N(β^)\displaystyle\widehat{\tau}_{N}(\hat{\beta}) =\displaystyle= τ^N(β)+τ^N(β)β(ββ^)+o(ββ^).\displaystyle\widehat{\tau}_{N}(\beta^{*})+\frac{\partial\widehat{\tau}_{N}(\beta^{*})}{\partial\beta^{\top}}(\beta^{*}-\hat{\beta})+o(\beta^{*}-\hat{\beta}).

Let

τ^N(β)β=T1~(Fn)+T2~(Fn),\displaystyle\frac{\partial\widehat{\tau}_{N}(\beta^{*})}{\partial\beta^{\top}}=\widetilde{T_{1}}(F_{n})+\widetilde{T_{2}}(F_{n}),

where

T1~(Fn)\displaystyle\widetilde{T_{1}}(F_{n})
=\displaystyle= 1Ni=1NWi{j=1N(1Wj)YjK(F(Xjβ)F(Xiβ)h)j=1N(1Wj)K(F(Xjβ)F(Xiβ)h)\displaystyle\frac{1}{N}\sum_{i=1}^{N}W_{i}\left\{\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right.
×j=1N(1Wj)K(F(Xjβ)F(Xiβ)h)f(Xjβ)f(Xiβ)h(XjXi)j=1N(1Wj)K(F(Xjβ)F(Xiβ)h)\displaystyle\left.\times\frac{\sum_{j=1}^{N}(1-W_{j})K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{*}\right)-f\left(X_{i}^{\top}\beta^{*}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right.
j=1N(1Wj)YjK(F(Xjβ)F(Xiβ)h)f(Xjβ)f(Xiβ)h(XjXi)j=1N(1Wj)K(F(Xjβ)F(Xiβ)h)\displaystyle-\left.\frac{\sum_{j=1}^{N}(1-W_{j})Y_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{*}\right)-f\left(X_{i}^{\top}\beta^{*}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right.
×j=1N(1Wj)K(F(Xjβ)F(Xiβ)h)j=1N(1Wj)K(F(Xjβ)F(Xiβ)h)},\displaystyle\times\left.\frac{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}{\sum_{j=1}^{N}(1-W_{j})K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right\},
T2~(Fn)\displaystyle\widetilde{T_{2}}(F_{n})
=\displaystyle= 1Ni=1N(1Wi){j=1NWjK(F(Xjβ)F(Xiβ)h)j=1NWjK(F(Xjβ)F(Xiβ)h)\displaystyle\frac{1}{N}\sum_{i=1}^{N}(1-W_{i})\left\{\frac{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right.
×j=1NWjYjK(F(Xjβ)F(Xiβ)h)f(Xjβ)f(Xiβ)h(XjXi)j=1NWjK(F(Xjβ)F(Xiβ)h)\displaystyle\left.\times\frac{\sum_{j=1}^{N}W_{j}Y_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{*}\right)-f\left(X_{i}^{\top}\beta^{*}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right.
j=1NWjK(F(Xjβ)F(Xiβ)h)f(Xjβ)f(Xiβ)h(XjXi)j=1NWjK(F(Xjβ)F(Xiβ)h)\displaystyle-\left.\frac{\sum_{j=1}^{N}W_{j}K^{\prime}\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X_{j}^{\top}\beta^{*}\right)-f\left(X_{i}^{\top}\beta^{*}\right)}{h}(X_{j}-X_{i})^{\top}}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right.
×j=1NWjYjK(F(Xjβ)F(Xiβ)h)j=1NWjK(F(Xjβ)F(Xiβ)h)},\displaystyle\times\left.\frac{\sum_{j=1}^{N}W_{j}Y_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}{\sum_{j=1}^{N}W_{j}K\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(X_{i}^{\top}\beta^{*}\right)}{h}\right)}\right\},

Let T~(F)=T1~(F)+T2~(F),\widetilde{T}(F)=\widetilde{T_{1}}(F)+\widetilde{T_{2}}(F), where

T1~(F)\displaystyle\widetilde{T_{1}}(F)
=\displaystyle= E(Z){E((1W)YK(F(Xβ)F(Uβ)h))E((1W)K(F(Xβ)F(Uβ)h))\displaystyle E(Z)\left\{\frac{E\left((1-W)YK\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right.
×E((1W)K(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU))E((1W)K(F(Xβ)F(Uβ)h))\displaystyle\left.\times\frac{E\left((1-W)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{*}\right)-f\left(U^{\top}\beta^{*}\right)}{h}(X-U)^{\top}\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right.
E((1W)YK(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU))E((1W)K(F(Xβ)F(Uβ)h))\displaystyle-\left.\frac{E\left((1-W)YK^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{*}\right)-f\left(U^{\top}\beta^{*}\right)}{h}(X-U)^{\top}\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right.
×E((1W)K(F(Xβ)F(Uβ)h))E((1W)K(F(Xβ)F(Uβ)h))}\displaystyle\times\left.\frac{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}{E\left((1-W)K\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right\}
=\displaystyle= E(Z){E(Y(0)|F(Uβ))\displaystyle E(Z)\Biggl{\{}E(Y(0)|F\left(U^{\top}\beta^{*}\right))
×E[K(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU)|W=0]f(F(Uβ)|W=0)\displaystyle\left.\times\frac{E\left[K^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f(X^{\top}\beta^{*})-f(U^{\top}\beta^{*})}{h}(X-U)^{\top}|W=0\right]}{f\left(F\left(U^{\top}\beta^{*}\right)|W=0\right)}\right.
E[E(Y(0)|X)K(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU)|W=0]f(F(Uβ)|W=0)},\displaystyle-\frac{E\left[E(Y(0)|X)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f(X^{\top}\beta^{*})-f(U^{\top}\beta^{*})}{h}(X-U)^{\top}|W=0\right]}{f\left(F\left(U^{\top}\beta^{*}\right)|W=0\right)}\Biggr{\}},
T2~(F)\displaystyle\widetilde{T_{2}}(F)
=\displaystyle= E(1Z){E(WK(F(Xβ)F(Uβ)h))E(WK(F(Xjβ)F(Uβ)h))\displaystyle E(1-Z)\left\{\frac{E\left(WK\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}{E\left(WK\left(\frac{F\left(X_{j}^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right.
×E(WYK(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU))E(WK(F(Xβ)F(Uβ)h))\displaystyle\qquad\left.\times\frac{E\left(WYK^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{*}\right)-f\left(U^{\top}\beta^{*}\right)}{h}(X-U)^{\top}\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right.
E(WK(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU))E(WK(F(Xβ)F(Uβ)h))\displaystyle\qquad-\left.\frac{E\left(WK^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f\left(X^{\top}\beta^{*}\right)-f\left(U^{\top}\beta^{*}\right)}{h}(X-U)^{\top}\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right.
×E(WYK(F(Xβ)F(Uβ)h))E(WK(F(Xβ)F(Uβ)h))}\displaystyle\qquad\times\left.\frac{E\left(WYK\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}{E\left(WK\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\right)}\right\}
=\displaystyle= E(1Z){E(Y(1)|F(Uβ))\displaystyle E(1-Z)\Biggl{\{}E(Y(1)|F\left(U^{\top}\beta^{*}\right))
×E[K(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU)|W=1]f(F(Uβ)|W=1)\displaystyle\left.\times\frac{E\left[K^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f(X^{\top}\beta^{*})-f(U^{\top}\beta^{*})}{h}(X-U)^{\top}|W=1\right]}{f\left(F\left(U^{\top}\beta^{*}\right)|W=1\right)}\right.
E[E(Y(1)|X)K(F(Xβ)F(Uβ)h)f(Xβ)f(Uβ)h(XU)|W=1]f(F(Uβ)|W=1)}.\displaystyle-\frac{E\left[E(Y(1)|X)K^{\prime}\left(\frac{F\left(X^{\top}\beta^{*}\right)-F\left(U^{\top}\beta^{*}\right)}{h}\right)\frac{f(X^{\top}\beta^{*})-f(U^{\top}\beta^{*})}{h}(X-U)^{\top}|W=1\right]}{f\left(F\left(U^{\top}\beta^{*}\right)|W=1\right)}\Biggr{\}}.

Thus, by Lemma 1 and Markov’s inequality, when h0h\to 0, we have

τ^N(β^)τ\displaystyle\widehat{\tau}_{N}(\hat{\beta})-\tau =τ^N(β)τ+C(ββ^)+op(1)\displaystyle=\widehat{\tau}_{N}(\beta^{*})-\tau+C(\beta^{*}-\hat{\beta})+o_{p}(1)
=\displaystyle= 1Ni=1N{Ω(Xi,Wi,Yi)+CXiWiF(XiTβ)F(XiTβ){1F(XiTβ)}f(XiTβ)}\displaystyle\frac{1}{N}\sum^{N}_{i=1}\left\{\Omega\left(X_{i},W_{i},Y_{i}\right)+CX_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right\}
+Bh2+o(h2)+op(1)𝑝0,\displaystyle+Bh^{2}+o(h^{2})+o_{p}(1)\xrightarrow{p}0,

where C=T~(F)β1C=\widetilde{T}(F)\mathcal{I}_{\beta^{*}}^{-1}.

Proof of Theorem 3 (ii):

From Theorem 3 (i) and N/N1𝑝1/E[p(X)]N/N_{1}\xrightarrow{p}1/E[p(X)], we have

τ^t,N(β^)Bth2τt=\displaystyle\widehat{\tau}_{t,N}(\hat{\beta})-B_{t}h^{2}-\tau_{t}= NN1{1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)\displaystyle\frac{N}{N_{1}}\left\{\frac{1}{N}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.
+CtXiWiF(XiTβ)F(XiTβ)(1F(XiTβ))f(XiTβ)]}+op(1)\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1)
1E[p(X)]{1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)\displaystyle\to\frac{1}{E[p(X)]}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.
+CtXiWiF(XiTβ)F(XiTβ)(1F(XiTβ))f(XiTβ)]}+op(1),\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1),

where Ct=T1~(F)β1.C_{t}=\widetilde{T_{1}}(F)\mathcal{I}_{\beta^{*}}^{-1}. By Lemma 1 and Markov’s Inequality, when h0h\to 0, it follows that

τ^t,N(β^)τt𝑝0.\widehat{\tau}_{t,N}(\hat{\beta})-\tau_{t}\xrightarrow{p}0.

Proof of Theorem 4 (i):

Based on Theorem 3(i) and Lemma 1, we can obtain that

N(τ^N(β^)Bh2τ)\displaystyle\sqrt{N}(\widehat{\tau}_{N}(\hat{\beta})-Bh^{2}-\tau)
=1Ni=1N{Ω(Xi,Wi,Yi)+CXiWiF(XiTβ)F(XiTβ){1F(XiTβ)}f(XiTβ)}+op(1).\displaystyle=\frac{1}{\sqrt{N}}\sum^{N}_{i=1}\left\{\Omega\left(X_{i},W_{i},Y_{i}\right)+CX_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right\}+o_{p}(1).

Since Ω(Xi,Wi,Yi)\Omega\left(X_{i},W_{i},Y_{i}\right) and Sβ(Xi,Wi)S_{\beta^{*}}(X_{i},W_{i}) are independent and identically distributed, applying the Central Limit Theorem, we have

N(τ^N(β^)Bh2τ)𝑑N(0,σ~2),\sqrt{N}(\widehat{\tau}_{N}(\hat{\beta})-Bh^{2}-\tau)\xrightarrow{d}N(0,\widetilde{\sigma}^{2}),

where σ~2=E[Ω(Xi,Wi,Yi)+CXiWiF(XiTβ)F(XiTβ){1F(XiTβ)}f(XiTβ)]2\widetilde{\sigma}^{2}=E\left[\Omega\left(X_{i},W_{i},Y_{i}\right)+CX_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]^{2}.

Proof of Theorem 4 (ii):

To prove part (ii) of Theorem 4, from Lemma 1 and Theorem 3, we obtain

N(τ^t,N(β^)Bth2τt)=\displaystyle\sqrt{N}(\widehat{\tau}_{t,N}(\hat{\beta})-B_{t}h^{2}-\tau_{t})= NN1{1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)\displaystyle\frac{N}{N_{1}}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.
+CtXiWiF(XiTβ)F(XiTβ)(1F(XiTβ))f(XiTβ)]}+op(1)\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1)
1E(p(X)){1Ni=1N[Ω1(Wi,Yi)Ω2(Xi,Wi,Yi)\displaystyle\to\frac{1}{E(p(X))}\left\{\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\left[\Omega_{1}\left(W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)\right.\right.
+CtXiWiF(XiTβ)F(XiTβ)(1F(XiTβ))f(XiTβ)]}+op(1),\displaystyle\left.\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left(1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right)}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]\right\}+o_{p}(1),

Because Ω1(Xi,Wi,Yi)\Omega_{1}\left(X_{i},W_{i},Y_{i}\right), Ω2(Xi,Wi,Yi)\Omega_{2}\left(X_{i},W_{i},Y_{i}\right) and Sβ(Xi,Wi)S_{\beta^{*}}(X_{i},W_{i}) are independent and identically distributed, applying the Central Limit Theorem,

N(τ^t,N(β^)Bth2τ)𝑑N(0,σ~t2),\sqrt{N}(\widehat{\tau}_{t,N}(\hat{\beta})-B_{t}h^{2}-\tau)\xrightarrow{d}N(0,\widetilde{\sigma}_{t}^{2}),

where

σ~t2\displaystyle\widetilde{\sigma}_{t}^{2} =\displaystyle= 1E2(p(X))E[Ω1(Xi,Wi,Yi)Ω2(Xi,Wi,Yi)\displaystyle\left.\frac{1}{E^{2}(p(X))}E\right[\Omega_{1}\left(X_{i},W_{i},Y_{i}\right)-\Omega_{2}\left(X_{i},W_{i},Y_{i}\right)
+CtXiWiF(XiTβ)F(XiTβ){1F(XiTβ)}f(XiTβ)]2.\displaystyle\qquad\left.+C_{t}X_{i}\frac{W_{i}-F\left(X_{i}^{\mathrm{T}}\beta\right)}{F\left(X_{i}^{\mathrm{T}}\beta\right)\left\{1-F\left(X_{i}^{\mathrm{T}}\beta\right)\right\}}f\left(X_{i}^{\mathrm{T}}\beta\right)\right]^{2}.

References

  • Abadie and Imbens, (2006) Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1):235–267.
  • Abadie and Imbens, (2016) Abadie, A. and Imbens, G. W. (2016). Matching on the estimated propensity score. Econometrica, 84(2):781–807.
  • Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
  • Berg, (2011) Berg, G. D. (2011). An application of kernel-based versus one-to-one propensity score matching for a nonexperimental causal study: example from a disease management program evaluation. Applied Economics Letters, 18(5):439–447.
  • Bickel et al., (1993) Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models, volume 4. Springer.
  • Christakis and Iwashyna, (2003) Christakis, N. A. and Iwashyna, T. J. (2003). The health impact of health care on families: a matched cohort study of hospice use by decedents and mortality outcomes in surviving, widowed spouses. Social Science and Medicine, 57(3):465–475.
  • Dawid, (1979) Dawid, A. P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(1):1–15.
  • Dehejia and Wahba, (1999) Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448):1053–1062.
  • Diamond and Sekhon, (2013) Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3):932–945.
  • Efron and Tibshirani, (1993) Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC.
  • Frölich, (2004) Frölich, M. (2004). Finite-Sample Properties of Propensity-Score Matching and Weighting Estimators. The Review of Economics and Statistics, 86(1):77–90.
  • Galiani et al., (2005) Galiani, S., Gertler, P., and Schargrodsky, E. (2005). Water for life: The impact of the privatization of water services on child mortality. Journal of Political Economy, 113:83–120.
  • Hahn, (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2):315–332.
  • Ham et al., (2011) Ham, J. C., Li, X., and Reagan, P. B. (2011). Matching and semi-parametric iv estimation, a distance-based measure of migration, and the wages of young men. Journal of Econometrics, 161(2):208–227.
  • Hampel, (1974) Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393.
  • Handouyahia et al., (2013) Handouyahia, A., Haddad, T., and Eaton, F. (2013). Kernel matching versus inverse probability weighting: a comparative study. International Journal of Mathematical and Computational Sciences, 7(8):1218–1233.
  • Hansen, (2004) Hansen, B. B. (2004). Full matching in an observational study of coaching for the sat. Journal of the American Statistical Association, 99(467):609–618.
  • (18) Heckman, J., Ichimura, H., Smith, J., and Todd, P. (1998a). Characterizing selection bias using experimental data. Econometrica, 66:1017.
  • (19) Heckman, J. J., Ichimura, H., and Todd, P. (1998b). Matching as an econometric evaluation estimator. The Review of Economic Studies, 65(2):261–294.
  • Heckman et al., (1997) Heckman, J. J., Ichimura, H., and Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training program. The Review of Economic Studies, 64(4):605–654.
  • Herron and Wand, (2007) Herron, M. C. and Wand, J. (2007). Assessing partisan bias in voting technology: The case of the 2004 new hampshire recount. Electoral Studies, 26:247–261.
  • Hettmansperger, (1984) Hettmansperger, T. (1984). Statistical Inference Based on Ranks. Wiley Series in Probability and Statistics. Wiley.
  • Hines et al., (2022) Hines, O., Dukes, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician, 76(3):292–304.
  • Hirano et al., (2003) Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189.
  • Huber, (1981) Huber, P. (1981). Robust Statistics. Wiley Series in Probability and Statistics. Wiley.
  • Imai and Ratkovic, (2014) Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):243–263.
  • Imbens, (2004) Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86(1):4–29.
  • LaLonde, (1986) LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, pages 604–620.
  • Luo and Zhu, (2020) Luo, W. and Zhu, Y. (2020). Matching using sufficient dimension reduction for causal inference. Journal of Business & Economic Statistics, 38(4):888–900.
  • Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846–866.
  • Rosenbaum, (1991) Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):597–610.
  • Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
  • Rosenbaum and Rubin, (1985) Rosenbaum, P. R. and Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1):33–38.
  • Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688.
  • Rubin, (1997) Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127:757–763.
  • Sekhon, (2004) Sekhon, J. S. (2004). Quality meets quantity: Case studies, conditional probability, and counterfactuals. Perspectives on Politics, 2(2):281–293.
  • Smith and Todd, (2005) Smith, J. A. and Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125(1-2):305–353.