This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Kernel-Distance-Based Covariate Balancing

Xialing Wen Ying Yan The Corresponding Author. Email: [email protected] Wenliang Pan Xianyang Zhang
School of Mathematics
Sun Yat-sen University
No.135
Xingang West Road Guangzhou P.R.China
Abstract

A common concern in observational studies focuses on properly evaluating the causal effect, which usually refers to the average treatment effect or the average treatment effect on the treated. In this paper, we propose a data preprocessing method, the Kernel-distance-based covariate balancing, for observational studies with binary treatments. This proposed method yields a set of unit weights for the treatment and control groups, respectively, such that the reweighted covariate distributions can satisfy a set of pre-specified balance conditions. This preprocessing methodology can effectively reduce confounding bias of subsequent estimation of causal effects. We demonstrate the implementation and performance of Kernel-distance-based covariate balancing with Monte Carlo simulation experiments and a real data analysis.


Key words: Observational Studies; Causal Effect; Average Treatment Effect; Covariate Balance; Kernel Distance

1 Introduction

In causal inference, there are two crucial concepts. (i) The propensity score (Rosenbaum and Rubin, 1983): the probability of assignment to the treatment group given the observed covariates. (ii) The potential outcomes (Neyman, 1923; Rubin, 1974): the possible outcomes under different treatment decisions or interventions. The potential outcomes framework, also known as the Neyman-Rubin Potential Outcomes, was first proposed in completely randomized experiments by Neyman (1923) and was extended to both experimental and observational studies by Rubin (2005). An overview for the implementation of the potential outcomes framework can be found in Imbens and Rubin (2015). Both the two concepts are related to evaluate the causal effect, a central topic in causan inference. Based on various purpose, researchers may prefer estimating the average treatment effect (ATE), the average treatment effect on the treated (ATT), the average treatment effect on the control (ATC), or all of them.

In observational studies, however, it is not valid to evaluate the causal effect by merely computing the mean-difference in response between different experimental groups due to the confounding variables and the non-random assignment mechanism. One method proposed by Rubin (2007) to solve this problem is to balance the distributions between the treated and the control groups. Two popular preprocessing methods, matching and propensity score methods, have been widely used in observational studies. In causal inference, weighting methods that adjust for observed covariates enjoy substantial popularity. In this paper, we focus on weighting methods.

The weights derived from adjusting the empirical distributions of the observed covariates will be applied to the outcomes such that the reweighted data appear randomized, which can yield stable estimates for the parameters of interest when there is no unmeasured confounders. Generally speaking, inverse probability weighting is less robust and works no better than moment balancing methods. Therefore, we propose a weighting framework, namely the kernel distance-based (KDB) covariate balance, and use the first-moment balance of all covariates as a constraint for the optimization problem. In this article, the KDB method is a preprocessing technique to achieve covariate balance in observational studies with a binary treatment. We give estimators of ATE and ATT based on the KDB framework, among which our primary focus is ATE estimation. Within the same framework, the optimization problem can be flexibly changed to estimate ATT or ATC.

Subsequently, we further explore their asymptotic properties and compare our proposed method with other preprocessing methods, such as the inverse probability weighting (IPW) (Horvitz and Thompson, 1952; Hirano et al., 2003), the covariate balancing propensity score (CBPS) (Imai and Ratkovic, 2014), the entropy balancing (EB) (Hainmueller, 2012), and non-parametric calibration weighting (CAL)(Chan et al., 2016). It can be shown that this proposed framework has several attractive features. Most importantly, the KDB method can automatically match the balance involve the second and possibly higher moments of the covariate distributions as well as interactions. Besides, the KDB estimator for ATE shows greater stability in nonlinear simulation scenarios compared to existing weighting estimators.

The remainder of this article is organized as follows. Section 2 introduces the definitions, notation, and assumptions used throughout the article. Section 3 reviews the main results of four weighting methods, namely the IPW, CBPS, EB, and CAL. Section 4 introduces the weighted kernel distance in detail and describes the theoretical framework of the KDB that we proposed. Section 5 illustrates the proposed method and compares its performance with other weighting methods in two Monte Carlo simulations and a real data analysis. Section 6 concludes this article and discusses further explorations.

2 Notation and Model Setup

2.1 Notation

We use the symbol ‘=\stackrel{{\scriptstyle\triangle}}{{=}}’ to distinguish definitions from equalities. Suppose the observed dataset consists of a random sample or finite population of NN units, each assigned to one of the groups, treatment or control, for which covariate-balanced comparisons are of interest. For each unit i(i=1,,N)i~{}(i=1,\cdots,N), we can observe {(Yi,Ti,𝑿i):i=1,,N}\{(Y_{i},T_{i},\bm{X}_{i}):~{}i=1,\cdots,N\} of {(Y,T,𝑿)}\{(Y,T,\bm{X})\}, where YY is an outcome variable, TT is a treatment indicator and 𝑿=(X1,,XD)\bm{X}=(X_{1},\cdots,X_{D}) with expectation 𝝁\bm{\mu} and covariance matrix 𝚺\bm{\Sigma} is a set of DD covariates. Every TiT_{i} takes values 1 or 0, which correspondingly indicates the unit ii is assigned to the treatment or the control group. We define the sample size in the treatment and control group respectively as n1n_{1} and n0n_{0}. The observed data in the treatment and control groups are denoted as {(Y1i,T1i,𝑿1i):i=1,,n1}\{(Y_{1i},T_{1i},\bm{X}_{1i}):~{}i=1,\cdots,n_{1}\} and {(Y0j,T0j,𝑿0j):j=1,,n0}\{(Y_{0j},T_{0j},\bm{X}_{0j}):~{}j=1,\cdots,n_{0}\}, respectively, where

𝑿1i=(X1i,1,,X1i,D),𝑿0j=(X0j,1,,X0j,D).\bm{X}_{1i}=(X_{1i,1},\cdots,X_{1i,D})\quad,\quad\bm{X}_{0j}=(X_{0j,1},\cdots,X_{0j,D}).

The probability of assignment to the treatment group given the covariates is the so-called propensity score (Rosenbaum and Rubin, 1983)

e(𝑿)=Pr(T=1|𝑿),e(\bm{X})=Pr(T=1|\bm{X}),

which plays a central role in causal inference.

Under the potential outcome framework (Neyman, 1923; Rubin, 1974), we define a pair of potential outcomes {Yi(0),Yi(1)}\{Y_{i}(0),Y_{i}(1)\} for every unit ii under treatment 0 or 1 respectively. The observed oucome can be expressed by the pair of potential outcomes under the consistency assumption:

Yi=(1Ti)Yi(0)+TiYi(1).Y_{i}=(1-T_{i})Y_{i}(0)+T_{i}Y_{i}(1).

In this paper, we focus on two causal estimands: the average treatment effect (ATE), that is

τ:=E[Y(1)Y(0)]=μ1μ0,\tau:=E[Y(1)-Y(0)]=\mu_{1}-\mu_{0},

where μt:=E[Y(T=t)]\mu_{t}:=E[Y(T=t)], and the average treatment effect on the treated (ATT), that is

τ1:=E[Y(1)Y(0)|T=1]=μ1|1μ0|1,\tau_{1}:=E[Y(1)-Y(0)|T=1]=\mu_{1|1}-\mu_{0|1},

where μt|1E[Y(T=t)|T=1]\mu_{t|1}\equiv E[Y(T=t)|T=1] for t=0,1t=0,1. The expectations for the potential outcomes given the covariates are

μ0(𝑿)E[Y(0)|X]andμ1(𝑿)E[Y(1)|X].\mu_{0}(\bm{X})\equiv E[Y(0)|X]\quad and\quad\mu_{1}(\bm{X})\equiv E[Y(1)|X].

For causal comparisons, we make the following assumptions (Rosenbaum and Rubin, 1983) throughout this article so that unbiased estimators of ATE and ATT are available in observational studies.

2.2 Assumptions

Assumption 1 (Strong Ignorability):
{Y(0),Y(1)}T|𝑿,\{Y(0),Y(1)\}\rotatebox[origin={c}]{270.0}{$\models$}T~{}|~{}\bm{X},

where \models denotes independence.

We assume the potential outcomes {Y(0),Y(1)}\{Y(0),Y(1)\} are independent of the treatment indicator TT given covariates 𝑿\bm{X}, which implies there is no unmeasured confounders that may cause the selection bias. Formally, as shown by Rosenbaum and Rubin (1983), Assumption 1 also implies

{Y(0),Y(1)}T|e(𝑿),\{Y(0),Y(1)\}\rotatebox[origin={c}]{270.0}{$\models$}T~{}|~{}e(\bm{X}),

and

E(Y|T=0,𝑿)=E(Y|T=1,𝑿)=E(Y|𝑿).E(Y|T=0,\bm{X})=E(Y|T=1,\bm{X})=E(Y|\bm{X}).

Under Assumption 1, we can investigate ATE through the value of

τ(𝑿)μ1(𝑿)μ0(𝑿)\xlongequal[]{Y(0),Y(1)}T|𝑿E[Y(1)|T=1,𝑿]E[Y(0)|T=0,𝑿]=E[Y|T=1,𝑿]E[Y|T=0,𝑿].\begin{split}\tau(\bm{X})&\equiv\mu_{1}(\bm{X})-\mu_{0}(\bm{X})\\ &\xlongequal[]{\{Y(0),Y(1)\}\rotatebox[origin={c}]{270.0}{$\models$}T~{}|~{}\bm{X}}E\left[Y(1)|T=1,\bm{X}\right]-E\left[Y(0)|T=0,\bm{X}\right]\\ &=E\left[Y|T=1,\bm{X}\right]-E\left[Y|T=0,\bm{X}\right].\end{split}
Assumption 2 (Overlap):
0<P(T=1|𝑿)<1,forall𝑿.0<P(T=1|\bm{X})<1,~{}for~{}all~{}\bm{X}.

To ensure that there are useful observations for estimating the causal effect, we assume the probability that a subject is assigned to the treatment group is limited to range between 0 and 1. Assumption 2 also helps to ensure that the bias-correction information is available in the entire domain of 𝑿\bm{X} (Zhao and Percival, 2016) and equivalently requires that the covariate distribution provides sufficient overlap between the treatment and the control group.

Reminding that under Assumption 1, the ATE can be expressed as

τ\displaystyle\tau =E[Yi(1)Yi(0)]\displaystyle=E\left[Y_{i}(1)-Y_{i}(0)\right] (2.1)
=E{E[Y(1)Y(0)|𝑿]}\displaystyle=E\{E\left[Y(1)-Y(0)|\bm{X}\right]\}
=E[μ1(x)μ0(x)]\displaystyle=E\left[\mu_{1}(x)-\mu_{0}(x)\right]
=μ1(x)𝑑F(x)μ0(x)𝑑F(x)\displaystyle=\int\mu_{1}(x)dF(x)-\int\mu_{0}(x)dF(x)

where F(x)F(x) indicates the true distribution of covariates. Within a weighting framework, we denote the weighted sample average treatment effect (SATE) with weights 𝒘={w1,,wN}={p1,,pn1,q1,,qn0}\bm{w}=\{w_{1},\cdots,w_{N}\}=\{p_{1},\cdots,p_{n_{1}},q_{1},\cdots,q_{n_{0}}\}, n1+n0=Nn_{1}+n_{0}=N, as

τ^Nw=i=1nTiwiYii=1n(1Ti)wiYi=i=1NpiTiYij=1Nqj(1Tj)Yj\hat{\tau}_{N}^{w}=\sum_{i=1}^{n}T_{i}w_{i}Y_{i}-\sum_{i=1}^{n}(1-T_{i})w_{i}Y_{i}=\sum_{i=1}^{N}p_{i}T_{i}Y_{i}-\sum_{j=1}^{N}q_{j}(1-T_{j})Y_{j}

and the weigthed empirical distributions of the treatment and control groups as

F^N,1w(x)=i=1n1piI(X1ix),F^N,0w(x)=i=1n0qjI(X0jx),\hat{F}_{N,1}^{w}(x)=\sum_{i=1}^{n_{1}}p_{i}I(X_{1i}\leq x)\quad,\quad\hat{F}_{N,0}^{w}(x)=\sum_{i=1}^{n_{0}}q_{j}I(X_{0j}\leq x),

where I()I(\cdot) is an indicator function. Therefore, we have

τ^Nwτ=\displaystyle\hat{\tau}^{w}_{N}-\tau= i=1nTiwiYii=1n(1Ti)wiYiτ\displaystyle\sum_{i=1}^{n}T_{i}w_{i}Y_{i}-\sum_{i=1}^{n}(1-T_{i})w_{i}Y_{i}-\tau (2.2)
=\displaystyle= i=1nTiwi(Yiμ1(Xi))i=1n(1Ti)wi(Yiμ0(Xi))+i=1nTiwiμ1(Xi)\displaystyle\sum_{i=1}^{n}T_{i}w_{i}(Y_{i}-\mu_{1}(X_{i}))-\sum_{i=1}^{n}(1-T_{i})w_{i}(Y_{i}-\mu_{0}(X_{i}))+\sum_{i=1}^{n}T_{i}w_{i}\mu_{1}(X_{i})
i=1n(1Ti)wiμ0(Xi)τ\displaystyle-\sum_{i=1}^{n}(1-T_{i})w_{i}\mu_{0}(X_{i})-\tau
=\displaystyle= {i=1nTiwi(μ1(Xi)μ0(Xi))τ}+{i=1nTiwiμ0(Xi)i=1n(1Ti)wiμ0(Xi)}\displaystyle\left\{\sum_{i=1}^{n}T_{i}w_{i}(\mu_{1}(X_{i})-\mu_{0}(X_{i}))-\tau\right\}+\left\{\sum_{i=1}^{n}T_{i}w_{i}\mu_{0}(X_{i})-\sum_{i=1}^{n}(1-T_{i})w_{i}\mu_{0}(X_{i})\right\}
+{i=1nTiwi(Yiμ1(Xi))i=1n(1Ti)wi(Yiμ0(Xi))}.\displaystyle+\left\{\sum_{i=1}^{n}T_{i}w_{i}(Y_{i}-\mu_{1}(X_{i}))-\sum_{i=1}^{n}(1-T_{i})w_{i}(Y_{i}-\mu_{0}(X_{i}))\right\}.

According to Equation (2.2), we know that the bias of weighted SATE τ^Nw\hat{\tau}_{N}^{w} depends entirely on the three items

i=1nTiwi(μ1(Xi)μ0(Xi))τ,\sum_{i=1}^{n}T_{i}w_{i}(\mu_{1}(X_{i})-\mu_{0}(X_{i}))-\tau,
i=1nTiwiμ0(Xi)i=1n(1Ti)wiμ0(Xi),\sum_{i=1}^{n}T_{i}w_{i}\mu_{0}(X_{i})-\sum_{i=1}^{n}(1-T_{i})w_{i}\mu_{0}(X_{i}),
i=1nTiwi(Yiμ1(Xi))i=1n(1Ti)wi(Yiμ0(Xi)).\sum_{i=1}^{n}T_{i}w_{i}(Y_{i}-\mu_{1}(X_{i}))-\sum_{i=1}^{n}(1-T_{i})w_{i}(Y_{i}-\mu_{0}(X_{i})).

Here, the second item directly shows the importance of balancing the distributions of the covariates between the treated and control groups, which is the goal of our proposed weighting framework. To further ensure limN(τ^Nwτ)=0\lim_{N\rightarrow\infty}(\hat{\tau}_{N}^{w}-\tau)=0, we need the following Assumption 3.

Assumption 3 (Constant Conditional ATE):
μ1(𝑿)μ0(𝑿)=τ,forall𝑿.\mu_{1}(\bm{X})-\mu_{0}(\bm{X})=\tau,~{}for~{}all~{}\bm{X}.

With Assumption 3, we assume a constant causal effect, so the ATE is numerically the same as the ATT.

3 Existing Weighting Methods for Covariate Balancing

We next give a brief overview of some existing approaches.

3.1 IPW: Inverse Probability Weighting

According to Ma and Wang (2010), the IPW assigns a weight

p^iIPW=T1iNp(X1i)i=1n1T1iNp(X1i)\widehat{p}_{i}^{IPW}=\frac{\frac{T_{1i}}{Np(X_{1i})}}{\sum_{i=1}^{n_{1}}\frac{T_{1i}}{Np(X_{1i})}}~{}

to Y1iY_{1i} (the outcome in the treatment group), i=1,2,,n1i=1,2,\cdots,n_{1}, and a weight

q^jIPW=1T0jN(1p(X0j))j=0n01T0jN(1p(X0j))\widehat{q}_{j}^{IPW}=\frac{\frac{1-T_{0j}}{N(1-p(X_{0j}))}}{\sum_{j=0}^{n_{0}}\frac{1-T_{0j}}{N\left(1-p(X_{0j})\right)}}~{}

to Y0jY_{0j} (the outcome in the control group), j=1,2,,n0j=1,2,\cdots,n_{0}, to estimate ATE, thereby deriving a natural estimator of ATE, that is,

τ^IPW=1Ni=1N[TiYip(Xi)(1Ti)Yi1p(Xi)]=1Ni=1N(2Ti1)Yi1Ti+(2Ti1)p(Xi)=i=1n1T1iY1iNp(X1i)j=1n0(1T0j)Y0jN(1p(X0j)).\begin{split}\widehat{\tau}_{~{}_{IPW}}&=\frac{1}{N}\sum_{i=1}^{N}\left[\frac{T_{i}Y_{i}}{p(X_{i})}-\frac{(1-T_{i})Y_{i}}{1-p(X_{i})}\right]\\ &=\frac{1}{N}\sum_{i=1}^{N}\frac{(2T_{i}-1)Y_{i}}{1-T_{i}+(2T_{i}-1)p(X_{i})}\\ &=\sum_{i=1}^{n_{1}}\frac{T_{1i}Y_{1i}}{Np(X_{1i})}-\sum_{j=1}^{n_{0}}\frac{(1-T_{0j})Y_{0j}}{N(1-p(X_{0j}))}.\end{split}

Besides, a natural estimator of ATT using inverse probability weighting is

τ^1,IPW=1n1i=1N[TiYi(1Ti)e^(Xi)1e^(Xi)Yi]=1n1i=1N(TiYi)1n1i=1N[(1Ti)e^(Xi)1e^(Xi)Yi]=i=1n1Y1in1j=1n0e^(Xi)1e^(Xi)Y0jn1.\begin{split}\widehat{\tau}_{~{}_{1,IPW}}&=\frac{1}{n_{1}}\sum_{i=1}^{N}\left[T_{i}Y_{i}-(1-T_{i})\frac{\hat{e}(X_{i})}{1-\hat{e}(X_{i})}Y_{i}\right]\\ &=\frac{1}{n_{1}}\sum_{i=1}^{N}\left(T_{i}\cdot Y_{i}\right)-\frac{1}{n_{1}}\sum_{i=1}^{N}\left[(1-T_{i})\frac{\hat{e}(X_{i})}{1-\hat{e}(X_{i})}Y_{i}\right]\\ &=\sum_{i=1}^{n_{1}}\frac{Y_{1i}}{n_{1}}-\sum_{j=1}^{n_{0}}\frac{\frac{\hat{e}(X_{i})}{1-\hat{e}(X_{i})}Y_{0j}}{n_{1}}.\end{split}

Therefore, the weights assigned to Y1i,i=1,2,,n1Y_{1i},~{}i=1,2,\cdots,n_{1} for estimating ATT is

p^i,trtIPW=1n1,\widehat{p}_{i,trt}^{IPW}=\frac{1}{n_{1}}~{},

and the weights assigned to Y0j,j=1,2,,n0Y_{0j},~{}j=1,2,\cdots,n_{0} is

q^j,trtIPW=e^(Xi)n1(1e^(Xi)).\widehat{q}_{j,trt}^{IPW}=\frac{\hat{e}(X_{i})}{n_{1}\cdot\left(1-\hat{e}(X_{i})\right)}~{}.

3.2 EB: Entropy Balancing (Hainmueller, 2012)

The entropy balancing (EB) scheme by Hainmueller (2012) considered:

𝐦𝐢𝐧𝒘𝒊H(w)=i|D=0h(wi)\displaystyle\bm{\min\limits_{w_{i}}}H(w)=\sum_{i|D=0}h(w_{i})
s.t.{i|D=0wicri(Xi)=mr,withr1,2,,Ri|D=0wi=1,wi0forallisuchthatD=0s.t.\left\{\begin{array}[]{rcl}&\sum_{i|D=0}w_{i}c_{ri}(X_{i})=m_{r},&with~{}r\in 1,2,\cdots,R\\ &\sum_{i|D=0}w_{i}=1,&w_{i}\geq 0~{}for~{}all~{}i~{}such~{}that~{}D=0\\ \end{array}\right.

The weights p^iEB=1n1(1,,n1)\widehat{p}_{i}^{EB}=\frac{1}{n_{1}}~{}(1,\cdots,n_{1}) and q^jEB=wj(j=1,,n0)\widehat{q}_{j}^{EB}=w_{j}~{}(j=1,\cdots,n_{0}) of EB method can be directly calculated with the R package “ebal”.

3.3 CBPS: Covariate Balance Propensity Score (Imai and Ratkovic, 2014)

Imai and Ratkovic (2014) proposed a relatively more stable method, namely the Covariate Balance Propensity Score (CBPS), for estimating propensity score so that covariate balance is optimized. CBPS can significantly improve the poor empirical performance of propensity score matching and weighting methods. The key idea is using propensity score weighting to characterize the covariate balance:

𝔼(Ti𝑿~iπβ(𝑿i)(1Ti)𝑿~i1πβ(𝑿i))=0\mathbb{E}\left(\frac{T_{i}\tilde{\bm{X}}_{i}}{\pi_{\beta}(\bm{X}_{i})}-\frac{(1-T_{i})\tilde{\bm{X}}_{i}}{1-\pi_{\beta}(\bm{X}_{i})}\right)=0

where 𝑿~i=f(𝑿i)\tilde{\bm{X}}_{i}=f(\bm{X}_{i}) is an D-dimensional vector-valued measurable function of 𝑿i\bm{X}_{i} specified by the researcher. Equation (3.3) holds for any function of covariates. Selecting a specific f()f(\cdot), for example, setting X~i=(XiTXi2T)T\tilde{X}_{i}=(X_{i}^{T}X_{i}^{2T})^{T}, Equation (3.3) helps to balance both the first and second moments of all covariates.

The CBPS method can be implemented through the open-source R package “CBPS”.

3.4 CAL: Calibration Estimator (Chan et al., 2016)

A general class of calibration (CAL) estimators of ATE is established with a wide class, such as exponential tilting, of calibration weights. The weights are constructed to achieve an exact moment balance of observed covariates among the treated, the control, and the combined group. Global semiparametric efficiency has been established for the CAL estimators.

For a fixed gg\in\mathbb{R}, let D(f,g)D(f,g) denote a continuously differentiable distance measure for ff\in\mathbb{R}. Being nonnegative and strictly convex in ff, D(f,g)=0D(f,g)=0 if and only if f=gf=g. The general idea of calibration as in Deville and Särndal (1992) is to minimize the aggregate distance between the final weights w=(w1,,wN)w=(w_{1},\cdots,w_{N}) to a given vector of design weights d=(d1,,dN)d=(d_{1},\cdots,d_{N}) subject to moment constraints. Avoiding estimating the design weights, CAL first used a vector of misspecified uniform design weights d=(1,,1)d^{*}=(1,\cdots,1), and constructed the final weights ww by solving the following constrained optimization problem:

Minimizei=1ND(w,1),\displaystyle Minimize~{}\sum_{i=1}^{N}D(w,1),
s.t.{1Ni=1NTiwiu(Xi)=1Ni=1Nu(Xi)1Ni=1N(1Ti)wiu(Xi)=1Ni=1Nu(Xi).\displaystyle s.t.\begin{cases}\frac{1}{N}\sum_{i=1}^{N}T_{i}w_{i}u(X_{i})=\frac{1}{N}\sum_{i=1}^{N}u(X_{i})\\ \frac{1}{N}\sum_{i=1}^{N}(1-T_{i})w_{i}u(X_{i})=\frac{1}{N}\sum_{i=1}^{N}u(X_{i})\end{cases}. (3.1)

This CAL method can be implemented through an open-source R package “ATE” for estimating both the ATE and ATT.

4 The Weighted Kernel Covariate Balancing

4.1 Kernel Distance

Let X1,,XnX_{1},\cdots,X_{n} and Y1,,YmY_{1},\cdots,Y_{m} be random samples from two unknown probability measures, \mathbb{P} and \mathbb{Q}, then researchers are often interested in detecting the difference and estimating the discrepency between them. In this paper, we use a probability metric with ζ\zeta-structure (Zolotarev, 1983)

γ(P,Q)=supf|f𝑑Pf𝑑Q|,\gamma_{\mathscr{F}}(P,Q)=\sup\limits_{f\in\mathscr{F}}|\int fdP-\int fdQ|,

where \mathscr{F} indicates a class of functions, to measure the distance between the treatment and the control covariates.

As for the implementation of γ(P,Q)\gamma(P,Q), we can define ||||𝒦||\cdot||_{\mathscr{K}} to be a norm of a Reproducing Kernel Hilbert Spaces (RKHS) 𝒦\mathscr{K} and set ={f:f𝒦1}\mathscr{F}=\{f:||f||_{\mathscr{K}}\neq 1\}, by which we can get a “kernelized” version of the total variation distance. Restricting kk to be a strictly positive definite kernel function corresponding to RKHS 𝒦\mathscr{K}, then we can get a closed-form solution for γ(P,Q)\gamma(P,Q) (Sriperumbudur et al., 2012), that is

γk(Pn1,Qn0)=i=1Nwik(,𝑿i)𝒦=i,j=1Nwiwjk(𝑿i,𝑿j),\gamma_{k}(P_{n_{1}},Q_{n_{0}})=||\sum_{i=1}^{N}w_{i}k(\cdot,\bm{X}_{i})||_{\mathscr{K}}=\sqrt{\sum_{i,j=1}^{N}w_{i}w_{j}k(\bm{X}_{i},\bm{X}_{j})},

where Pn1P_{n_{1}} and Qn0Q_{n_{0}} respectively denote the empirical measure of 𝑿|T=1\bm{X}|T=1 and 𝑿|T=0\bm{X}|T=0, and wiw_{i} is the sample weight for the unit i{1,,N}i\in\{1,\cdots,N\}. Let us define wi=1/n1w_{i}=1/n_{1} if Ti=1T_{i}=1 and wi=1/n0w_{i}=1/n_{0} if Ti=0T_{i}=0, then we can get a simple as well as closed-form analytical solution for the empirical estimate of the probability metric under the assumption that the function class represents an RKHS. The computation of γk(Pn1,Qn0)\gamma_{k}(P_{n_{1}},Q_{n_{0}}) is O(N2)O(N^{2}) but it is also independent of the dimension of the confounders.

We can use γk(Pn1,Qn0)\gamma_{k}(P_{n_{1}},Q_{n_{0}}) as a diagnostic for covariate balance, the smaller values of which denote better performance.

4.2 The Integral Probability Metric (IPM)

Given two probability measures, \mathbb{P} and \mathbb{Q} defined on a measurable space S, the integral probability metric (IPM) (Sriperumbudur et al., 2012), also called probability metrics with a ζ\zeta-structure, between \mathbb{P} and \mathbb{Q} is defined as

γ(,)=sup{|Sf𝑑Sf𝑑|:f}\gamma_{\mathcal{F}}(\mathbb{P},\mathbb{Q})=\sup\left\{\left\lvert\int_{\textit{S}}fd\mathbb{P}-\int_{\textit{S}}fd\mathbb{Q}\right\rvert:f\in\mathcal{F}\right\} (4.1)

where \mathcal{F} is a class of real-valued bounded measurable functions on S. The choice of functions is the crucial distinction between different IPMs and various popular distance measures, for example, the Kantorovich metric and the total variation distance, can be obtained by choosing appropriate \mathcal{F} in (4.1). When ={f:f1}\mathcal{F}=\{f:||f||_{\mathcal{H}}\leq 1\}, the corresponding γ\gamma_{\mathcal{F}} is called kernel distance or maximum mean discrepancy, where \mathcal{H} represents a reproducing kernel Hilbert space (RKHS) with kk as its reproducing kernel (r.k.). In this situation, we can rewrite the function space \mathcal{F} as (,k)(\mathcal{H},k) and the IPM γ\gamma_{\mathcal{F}} as γk\gamma_{\mathcal{F}_{k}}. The kernel distance has been widely used in statistical applications, including homogeneity testing, independence testing, conditional independence testing, and mixture density estimation.

4.3 Estimators of IPM

Let {XP,i:i=1,2,,n}\left\{X_{P,i}:~{}i=1,2,\cdots,n\right\} and {XQ,j:j=1,2,,m}\left\{X_{Q,j}:~{}j=1,2,\cdots,m\right\} be i.i.d. samples randomly drawn from \mathbb{P} and \mathbb{Q}, respectively, the empirical estimator of γ(,)\gamma_{\mathcal{F}}(\mathbb{P},\mathbb{Q}) is given by

γ(n,m)=supf|i=1NwiEf(Xi)|,\gamma_{\mathcal{F}}(\mathbb{P}_{n},\mathbb{Q}_{m})=\sup_{f\in\mathcal{F}}\left\lvert\sum_{i=1}^{N}w^{E}_{i}f(X_{i})\right\rvert, (4.2)

where n(x):=i=1n1nI(XP,ix)\mathbb{P}_{n}(x):=\sum_{i=1}^{n}\frac{1}{n}I(X_{P,i}\leq x) and m(x):=j=1m1mI(XQ,jx)\mathbb{Q}_{m}(x):=\sum_{j=1}^{m}\frac{1}{m}I(X_{Q,j}\leq x) represent the empirical distributions of \mathbb{P} and \mathbb{Q}, respectively, I()I(\cdot) is an indicator function, N=n+mN=n+m, wiE=1nw_{i}^{E}=\frac{1}{n} when Xi=XP,iX_{i}=X_{P,i} for i=1,,ni=1,\cdots,n and wjE=1mw_{j}^{E}=-\frac{1}{m} when Xj=XQ,jX_{j}=X_{Q,j} for j=1,,mj=1,\cdots,m.

Correspondingly, we define the weighted estimator of of γ(,)\gamma_{\mathcal{F}}(\mathbb{P},\mathbb{Q}) as

γ(nw,mw)=supf|i=1Nwif(Xi)|,\gamma_{\mathcal{F}}(\mathbb{P}_{n}^{w},\mathbb{Q}_{m}^{w})=\sup_{f\in\mathcal{F}}\left\lvert\sum_{i=1}^{N}w_{i}f(X_{i})\right\rvert, (4.3)

where nw(x):=i=1npiI(XP,ix)\mathbb{P}^{w}_{n}(x):=\sum_{i=1}^{n}p_{i}I(X_{P,i}\leq x) and m(x):=j=1mqjI(XQ,jx)\mathbb{Q}_{m}(x):=\sum_{j=1}^{m}q_{j}I(X_{Q,j}\leq x) represent the wighted distributions of \mathbb{P} and \mathbb{Q}, respectively, wi=piw_{i}=p_{i} when Xi=XP,iX_{i}=X_{P,i} for i=1,,ni=1,\cdots,n and wj=qjw_{j}=-q_{j} when Xj=XQ,jX_{j}=X_{Q,j} for j=1,,mj=1,\cdots,m. {pi:i=1,,n}\{p_{i}:i=1,\cdots,n\} and {qj:j=1,,m}\{q_{j}:j=1,\cdots,m\} are the balancing weights derived frome some weighting methods.

4.4 Weighted Estimator of the Kernel Distance

Theorem 4.1.

Let kk be a strictly positive definite kernel, i.e., for all NN\in\mathbb{N}, {αi}i=1N\{0}\{\alpha_{i}\}_{i=1}^{N}\subset\mathbb{R}\backslash\{0\} and all mutually distinct {θi}i=1NS\{\theta_{i}\}_{i=1}^{N}\subset S, i,j=1Nαiαjk(θi,θj)>0\sum_{i,j=1}^{N}\alpha_{i}\alpha_{j}k(\theta_{i},\theta_{j})>0. Then, for =k:={f:f1}\mathcal{F}=\mathcal{F}_{k}:=\{f:\lVert f\rVert_{\mathcal{H}}\leq 1\}, the following function ff is the unique solution to (4.3):

f=1i=1Nwik(,Xi)i=1Nwik(,Xi)f=\frac{1}{\left\lVert\sum_{i=1}^{N}w_{i}k(\cdot,X_{i})\right\rVert_{\mathcal{H}}}\sum_{i=1}^{N}w_{i}k(\cdot,X_{i}) (4.4)

and the corresponding weighted estimator of the kernel distance is

γk,N(n,m)=i=1Nwik(,Xi)=i,j=1Nwiwjk(Xi,Xj).\gamma_{k,N}(\mathbb{P}_{n},\mathbb{Q}_{m})=\left\lVert\sum_{i=1}^{N}w_{i}k(\cdot,X_{i})\right\rVert=\sqrt{\sum_{i,j=1}^{N}w_{i}w_{j}k(X_{i},X_{j})}. (4.5)

Proof. Consider γk,N(n,m):=sup{i=1Nwif(Xi):f|1}\gamma_{k,N}(\mathbb{P}_{n},\mathbb{Q}_{m}):=\sup\left\{\sum_{i=1}^{N}w_{i}f(X_{i}):\lVert f\rvert_{\mathcal{H}}\leq 1\right\}, which can be written as

γk,N(n,m)=sup{f,i=1Nwik(,Xi):f1},\gamma_{k,N}(\mathbb{P}_{n},\mathbb{Q}_{m})=\sup\left\{\left\langle f,\sum_{i=1}^{N}w_{i}k(\cdot,X_{i})\right\rangle_{\mathcal{H}}:\lVert f\rVert_{\mathcal{H}}\leq 1\right\},

where we have used the reproducing property of \mathcal{H}, i.e., f,xS,f(x)=f,k(,x)\forall f\in\mathcal{H},\forall x\in\textit{S},f(x)=\langle f,k(\cdot,x)\rangle_{\mathcal{H}}.

Following the Cauchy-Schwartz inequality, which states that for all vectors uu and vv of an inner product space it is true that

u,vuv\lVert\langle u,v\rangle\rVert\leq\lVert u\rVert\cdot\lVert v\rVert (4.6)

where ,\langle\cdot,\cdot\rangle is the inner product, we have

γk,N(n,m):=sup{i=1Nwif(Xi):f|1}\displaystyle\gamma_{k,N}(\mathbb{P}_{n},\mathbb{Q}_{m}):=\sup\left\{\sum_{i=1}^{N}w_{i}f(X_{i}):\lVert f\rvert_{\mathcal{H}}\leq 1\right\} sup{fi=1Nwik(,Xi):f1}\displaystyle\leq\sup\left\{\left\lVert f\right\rVert_{\mathcal{H}}\cdot\left\lVert\sum_{i=1}^{N}w_{i}k(\cdot,X_{i})\right\rVert_{\mathcal{H}}:\lVert f\rVert_{\mathcal{H}}\leq 1\right\}
sup{i=1Nwik(,Xi)}\displaystyle\leq\sup\left\{\left\lVert\sum_{i=1}^{N}w_{i}k(\cdot,X_{i})\right\rVert_{\mathcal{H}}\right\}

Therefore, we can set

f=1i=1Nwik(,Xi)i=1Nwik(,Xi).f=\frac{1}{\left\lVert\sum_{i=1}^{N}w_{i}k(\cdot,X_{i})\right\rVert_{\mathcal{H}}}\sum_{i=1}^{N}w_{i}k(\cdot,X_{i}).

Since kk is strictly positive definite, γk,N(n,m)=0\gamma_{k,N}(\mathbb{P}_{n},\mathbb{Q}_{m})=0 if and only if n=m\mathbb{P}_{n}=\mathbb{Q}_{m}, which therefore ensures that (4.4) is well-defined.

4.5 The Gaussian Kernel and the Information Matrix

According to Rasmussen (2003), one of the choice of kernel kk for covariate balance is the Gaussian kernel, that is

K(𝒙,𝒙;σ2)=exp(𝒙𝒙2σ2),K(\bm{x},\bm{x^{\prime}};\sigma^{2})=exp\left(\frac{-||\bm{x}-\bm{x^{\prime}}||^{2}}{\sigma^{2}}\right),

where the parameter σ\sigma determines the width of the Gaussian kernel and it can either be fixed or estimated from the data. In this article, we estimate σ\sigma with the median of all possible pairwise squared Euclidean distances between all pairs of subjects.

σ=median{{(𝑿1i𝑿1j)i,jtrt2,(𝑿1i𝑿0j)itrt,jcol2,(𝑿0i𝑿0j)i,jcol2}>0}\sigma=median\begin{Bmatrix}\{(\bm{X}_{1i}-\bm{X}_{1j})^{2}_{i,j\in trt},~{}(\bm{X}_{1i}-\bm{X}_{0j})^{2}_{i\in trt,j~{}\in col},(\bm{X}_{0i}-\bm{X}_{0j})^{2}_{i,j\in col}\}~{}>~{}0\end{Bmatrix}

The Information Matrix 𝑲G\bm{K}_{G} corresponding to the Gaussian kernel K(𝒙,𝒙;σ2)K(\bm{x},\bm{x^{\prime}};\sigma^{2}) is

𝑲G\displaystyle\bm{K}_{G} =(K1K10K01K0)\displaystyle=\begin{pmatrix}K_{1}&-K_{10}\\ -K_{01}&K_{0}\\ \end{pmatrix} (4.7)
=((K(X1i,X1j))n1×n1(K(X1i,X0j))n1×n0((K(X1i,X0j))n1×n0)T(K(X0i,X0j))n0×n0).\displaystyle=\begin{pmatrix}\left(K(X_{1i},X_{1j})\right)_{n_{1}\times n_{1}}&-\left(K(X_{1i},X_{0j})\right)_{n_{1}\times n_{0}}\\ \left(-\left(K(X_{1i},X_{0j})\right)_{n_{1}\times n_{0}}\right)^{T}&\left(-K(X_{0i},X_{0j})\right)_{n_{0}\times n_{0}}\end{pmatrix}.

4.6 The Kernel-distance-based (KDB) Covariate Balance

In this article, we propose a preprocessing procedure, the kernel-distance-based covariate balancing method, to create balanced samples for evaluating the treatment effects. In this procedure, we create two groups of scalar weights for the treatment and the control groups so that the reweighted treatment and control groups can match precisely at the specified moments.

4.6.1 ATE Estimation

In this section, we introduce the estimation of ATE

τE[Y(1)Y(0)]\tau\equiv E[Y(1)-Y(0)]

with the weighted mean difference between the treatment and the control outcomes, that is

τ^=i=1n1piY1ij=1n0qjY0j.\hat{\tau}=\sum_{i=1}^{n_{1}}p_{i}\cdot Y_{1i}-\sum_{j=1}^{n_{0}}q_{j}\cdot Y_{0j}.

The weights {pi:i=1,,n1}\{p_{i}:i=1,\cdots,n_{1}\} and {qj:j=1,,n0}\{q_{j}:j=1,\cdots,n_{0}\} derive from the following KDB reweighting scheme:

min𝒑,𝒒rW(𝒑,𝒒)=\displaystyle\min\limits_{\bm{\vec{p}},\bm{\vec{q}}}r^{W}(\bm{\vec{p}},\bm{\vec{q}})= i=1Nj=1NpipjK(X1i,X1j)I(i,jtrt)+\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}p_{i}p_{j}K({X}_{1i},{X}_{1j})\cdot I(i,j\in\textbf{trt})+ (4.8)
i=1Nj=1NqiqjK(𝑿0i,𝑿0j)I(i,jcol)\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}q_{i}q_{j}K(\bm{X}_{0i},\bm{X}_{0j})\cdot I(i,j\in\textbf{col})-
2i=1Nj=1NpiqjK(𝑿1i,𝑿0j)I(itrt,jcol)\displaystyle 2\sum_{i=1}^{N}\sum_{j=1}^{N}p_{i}q_{j}K(\bm{X}_{1i},\bm{X}_{0j})\cdot I(i\in\textbf{trt},j\in\textbf{col})
s.t.{i=1n1pi=1,pi0j=1n0qj=1,qj0i=1n1pi𝒈(𝑿1)=j=1n0qj𝒈(𝑿0)s.t.\left\{\begin{array}[]{rcl}&\sum_{i=1}^{n_{1}}p_{i}=1,&p_{i}\geq 0\\ &\sum_{j=1}^{n_{0}}q_{j}=1,&q_{j}\geq 0\\ &\sum_{i=1}^{n_{1}}p_{i}\bm{g}(\bm{X}_{1})=\sum_{j=1}^{n_{0}}q_{j}\bm{g}(\bm{X}_{0})&\\ \end{array}\right.

where I()I(\cdot) is an indicator function, and 𝒈()\bm{g}(\cdot) indicates a set of constraint functions imposed on the covariate moment. Here, for convenience, we set 𝒈()={g1(),,gD()}\bm{g}(\cdot)=\{g_{1}(\cdot),\cdots,g_{D}(\cdot)\} to be the sample mean function at all covariates so that the reweighted treatment and control groups can match on the first moment, that is

gd(𝑿t):=i=1ntXti,dnt,g_{d}(\bm{X}_{t}):=\frac{\sum_{i=1}^{n_{t}}X_{ti,d}}{n_{t}},

where t{0,1},i{1,,nt},d{1,,D}t\in\{0,1\},i\in\{1,\cdots,n_{t}\},d\in\{1,\cdots,D\}. To express concisely, we define 𝒘=(p1,,pn1,q1,,qn0)T\vec{\bm{w}}=(p_{1},\cdots,p_{n_{1}},q_{1},\cdots,q_{n_{0}})^{T}, 𝒃𝟎(2+D+N)×1=(11000)T\bm{\vec{b_{0}}}_{{}_{(2+D+N)\times 1}}=\begin{pmatrix}1&1&0&0&\cdots&0\end{pmatrix}^{T}, and

𝑨T\displaystyle\bm{A}^{T} =(𝟏1×n1,𝟎1×n0𝟎1×n1,𝟏1×n0𝑿1,trt,𝑿1,col,𝑿D,trt,𝑿D,col𝑰N×N)(2+D+N)×N=(111n1010n0010n1111n0X1,11X1,1n1X1,01X1,0n0XD,11XD,1n1XD,01XD,0n010000010000001000001)(2+D+N)×N,\displaystyle=\begin{pmatrix}\vec{\bm{1}}_{1\times n_{1}},&~{}&\vec{\bm{0}}_{1\times n_{0}}\\ \vec{\bm{0}}_{1\times n_{1}},&~{}&\vec{\bm{1}}_{1\times n_{0}}\\ \vec{\bm{X}}_{1,trt},&~{}&-\vec{\bm{X}}_{1,col}\\ \vdots,&~{}&\vdots\\ \vec{\bm{X}}_{D,trt},&~{}&-\vec{\bm{X}}_{D,col}\\ &~{}\bm{I}_{N\times N}&~{}\end{pmatrix}_{(2+D+N)\times N}=\begin{pmatrix}1_{1}&\cdots&1_{n_{1}}&0_{1}&\cdots&0_{n_{0}}\\ 0_{1}&\cdots&0_{n_{1}}&1_{1}&\cdots&1_{n_{0}}\\ X_{1,11}&\cdots&X_{1,1n_{1}}&-X_{1,01}&\cdots&-X_{1,0n_{0}}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ X_{D,11}&\cdots&X_{D,1n_{1}}&-X_{D,01}&\cdots&-X_{D,0n_{0}}\\ 1&0&0&\cdots&0&0\\ 0&1&0&\cdots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&1&0\\ 0&0&0&\cdots&0&1\\ \end{pmatrix}_{(2+D+N)\times N},

where DD refers to the total number of covariates, 𝑿d,trt\vec{\bm{X}}_{d,trt} and 𝑿d,col\vec{\bm{X}}_{d,col} respectively on behalf of the dthd^{th} covariate in the treatment and control group, and then the optimization problem for estimating ATE can be expressed as

min𝒘rW(𝒘)=𝒘T𝑲G𝒘\displaystyle\min\limits_{\vec{\bm{w}}}r^{W}(\vec{\bm{w}})=\vec{\bm{w}}^{T}\bm{K}_{G}\vec{\bm{w}} (4.9)
s.t.𝑨T𝒘𝒃𝟎s.t.~{}\bm{A}^{T}\vec{\bm{w}}\geq\bm{\vec{b}_{0}}

Here, the first 2+D2+D constraints are treated as equality constraints, and all further as inequality constraints.

4.6.2 Stable Weight Estimation of ATE

To further construct stable weights for estimating the ATE, we can make a small modification to the optimization problem in (4.9). Compared to the reweighting scheme in section 4.6.1, we only add a tuning parameter λ\lambda to the diagonal of 𝑲G\bm{K}_{G} in order to control the variance of τ^\hat{\tau}, while keeping all the other settings and parameters the same. Defining 𝒘𝟎=𝒘|pi=1n1,qj=1n0=(1n1,,1n1,1n0,,1n0)T\vec{\bm{w_{0}}}=\vec{\bm{w}}|_{p_{i}=\frac{1}{n_{1}},q_{j}=\frac{1}{n_{0}}}=(\frac{1}{n_{1}},\cdots,\frac{1}{n_{1}},\frac{1}{n_{0}},\cdots,\frac{1}{n_{0}})^{T}, the reweighting scheme for stable ATE estimation is:

min𝒘rW(𝒘)=𝒘T𝑲G𝒘+λ(𝒘𝒘𝟎)T(𝒘𝒘𝟎)\displaystyle\min\limits_{\vec{\bm{w}}}r^{W}(\vec{\bm{w}})=\vec{\bm{w}}^{T}\bm{K}_{G}\vec{\bm{w}}+\lambda\left(\vec{\bm{w}}-\vec{\bm{w_{0}}}\right)^{T}\left(\vec{\bm{w}}-\vec{\bm{w_{0}}}\right) (4.10)
s.t.𝑨T𝒘𝒃𝟎s.t.~{}\bm{A}^{T}\vec{\bm{w}}\geq\bm{\vec{b}_{0}}

It is easy to prove that the optimization problem (4.10) satisfying constraints above is equivalent to

min𝒘rW(𝒘)=𝒘T(𝑲+λ𝑰N×N)𝒘\displaystyle\min\limits_{\vec{\bm{w}}}r^{W}(\vec{\bm{w}})=\vec{\bm{w}}^{T}(\bm{K}+\lambda\bm{I}_{{}_{N\times N}})\vec{\bm{w}} (4.11)
s.t.𝑨T𝒘𝒃𝟎s.t.~{}\bm{A}^{T}\vec{\bm{w}}\geq\bm{\vec{b}_{0}}

When λ=0\lambda=0, this optimization problem (4.11) degenerates into problem (4.9) and it controls bias, but not variance of ATE estimation. When λ\lambda increases, this optimization problem may increase bias while reduce the variance of ATE estimation.

4.6.3 ATT Estimation

In this section, we introduce the KDB reweighting scheme for estimating the ATT

τ1:=E[Y(1)Y(0)|T=1]=μ1|1μ0|1.\tau_{1}:=E[Y(1)-Y(0)|T=1]=\mu_{1|1}-\mu_{0|1}.

We estimate τ1\tau_{1} with the mean difference between the treatment outcomes and the weighted control outcomes, that is,

τ^1=i=1n11n1Y1ij=1n0qjY0j,\hat{\tau}_{1}=\sum_{i=1}^{n_{1}}\frac{1}{n_{1}}\cdot Y_{1i}-\sum_{j=1}^{n_{0}}q_{j}\cdot Y_{0j},

where {qj:j=1,,n0}\{q_{j}:j=1,\cdots,n_{0}\} derive‘ from the following reweighted scheme:

min𝒒rW(𝒒)=\displaystyle\min\limits_{\bm{\vec{q}}}r^{W}(\bm{\vec{q}})= i=1Nj=1N1n11n1K(𝑿1i,𝑿1j)I(i,jtrt)+\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}\frac{1}{n_{1}}\frac{1}{n_{1}}K(\bm{X}_{1i},\bm{X}_{1j})\cdot I(i,j\in\textbf{trt})+ (4.12)
i=1Nj=1NqiqjK(𝑿0i,𝑿0j)I(i,jcol)\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}q_{i}q_{j}K(\bm{X}_{0i},\bm{X}_{0j})\cdot I(i,j\in\textbf{col})-
2i=1Nj=1N1n1qjK(𝑿1i,𝑿0j)I(itrt,jcol)\displaystyle 2\sum_{i=1}^{N}\sum_{j=1}^{N}\frac{1}{n_{1}}q_{j}K(\bm{X}_{1i},\bm{X}_{0j})\cdot I(i\in\textbf{trt},j\in\textbf{col})
s.t.{j=1n0qj=1,qj0i=1n11n1𝒈(𝑿1)=j=1n0qj𝒈(𝑿0)s.t.\left\{\begin{array}[]{rcl}&\sum_{j=1}^{n_{0}}q_{j}=1,&q_{j}\geq 0\\ &\sum_{i=1}^{n_{1}}\frac{1}{n_{1}}\bm{g}(\bm{X}_{1})=\sum_{j=1}^{n_{0}}q_{j}\bm{g}(\bm{X}_{0})&\\ \end{array}\right.

Just like the ATE estimation, we define 𝒒n0×1=(q1,q2,,qn0)T,\bm{\vec{q}}_{{}_{n_{0}\times 1}}=\left(q_{1},~{}q_{2},~{}\cdots,~{}q_{n_{0}}\right)^{T}, and reset 𝒘\vec{\bm{w}}, 𝒃𝟎(1+D+n0)×1\bm{\vec{b_{0}}}_{{}_{(1+D+n_{0})\times 1}}, 𝑨T\bm{A}^{T} respectively to be

𝒘=(p1,,pn1,q1,,qn0)T,\vec{\bm{w}}=(p_{1},\cdots,p_{n_{1}},q_{1},\cdots,q_{n_{0}})^{T},
𝒃𝟎(1+D+n0)×1=(1X¯1,1X¯2,1X¯D,100)T\bm{\vec{b_{0}}}_{{}_{(1+D+n_{0})\times 1}}=\begin{pmatrix}1&\overline{X}_{1,1}&\overline{X}_{2,1}&\cdots&\overline{X}_{D,1}&0&\cdots&0\end{pmatrix}^{T}
𝑨T\displaystyle\bm{A}^{T} =(𝟏1×n0𝑿1,0𝑿D,0𝑰n0×n0)(1+D+n0)×n0=(11121n0X1,01X1,02X1,0n0XD,01XD,02XD,0n0100010001)(1+D+n0)×n0,\displaystyle=\begin{pmatrix}\vec{\bm{1}}_{1\times n_{0}}\\ \bm{X}_{1,0}\\ \vdots\\ \bm{X}_{D,0}\\ \bm{I}_{n_{0}\times n_{0}}\end{pmatrix}_{(1+D+n_{0})\times n_{0}}=\begin{pmatrix}1_{1}&1_{2}&\cdots&1_{n_{0}}\\ X_{1,01}&X_{1,02}&\cdots&X_{1,0n_{0}}\\ \vdots&\vdots&\ddots&\vdots\\ X_{D,01}&X_{D,02}&\cdots&X_{D,0n_{0}}\\ 1&0&\cdots&0\\ 0&1&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&1\\ \end{pmatrix}_{(1+D+n_{0})\times n_{0}},

where pi=1n1,i{1,,n1}p_{i}=\frac{1}{n_{1}},~{}i\in\{1,\cdots,n_{1}\} and X¯d,1,d{1,,D}\overline{X}_{d,1},~{}d\in\{1,\cdots,D\} refers to the sample mean of the dthd^{th} covariate in the treatment group. Therefore, we can also simplify the ATT optimization problem above into a matrix form

min𝒘rW(𝒘)\displaystyle\min\limits_{\vec{\bm{w}}}r^{W}(\vec{\bm{w}}) =𝒘T𝑲G𝒘\displaystyle=\vec{\bm{w}}^{~{}T}\bm{K}_{G}~{}\vec{\bm{w}} (4.13)
s.t.𝑨T𝒒\displaystyle s.t.~{}\bm{A}^{T}\vec{\bm{q}} 𝒃𝟎\displaystyle\geq\bm{\vec{b}_{0}}

Here, the first 1+D1+D constraints are treated as equality constraints, and all further as inequality constraints.

4.7 Propositions of the KDB Covariate Balance

Proposition 1.

The Gaussian Information Matrix 𝐊G\bm{K}_{G} is positive semi-definite.

Proof. Based on the propertied of kernel, i.e, K(x1,x2)=<φ(x1),φ(x2)>K(x_{1},x_{2})=<\varphi(x_{1}),\varphi(x_{2})>, for any vNv\in\mathbb{R}^{N}, we have

v𝐊v=\displaystyle v^{\prime}\mathbf{K}v= i,j=1NvivjKij\displaystyle\sum_{i,j=1}^{N}v_{i}v_{j}K_{ij}
=\displaystyle= i,j=1n1vivj<φ(Xi),φ(Xj)>+i,j=n1Nvivj<φ(Xi),φ(Xj)>2i=1n1j=n1Nvivj<φ(Xi),φ(Xj)>\displaystyle\sum_{i,j=1}^{n_{1}}v_{i}v_{j}<\varphi(X_{i}),\varphi(X_{j})>+\sum_{i,j=n_{1}}^{N}v_{i}v_{j}<\varphi(X_{i}),\varphi(X_{j})>-2\sum_{i=1}^{n_{1}}\sum_{j=n_{1}}^{N}v_{i}v_{j}<\varphi(X_{i}),\varphi(X_{j})>
=\displaystyle= <i=1n1φ(Xi),j=1n1φ(Xj)>+<i=n1Nφ(Xi),j=n1Nφ(Xj)>2<i=1n1φ(Xi),j=n1Nφ(Xj)>\displaystyle<\sum_{i=1}^{n_{1}}\varphi(X_{i}),\sum_{j=1}^{n_{1}}\varphi(X_{j})>+<\sum_{i=n_{1}}^{N}\varphi(X_{i}),\sum_{j=n_{1}}^{N}\varphi(X_{j})>-2<\sum_{i=1}^{n_{1}}\varphi(X_{i}),\sum_{j=n_{1}}^{N}\varphi(X_{j})>
=\displaystyle= i=1n1viφ(Xi)j=n1Nvjφ(Xj)20\displaystyle||\sum_{i=1}^{n_{1}}v_{i}\varphi(X_{i})-\sum_{j=n_{1}}^{N}v_{j}\varphi(X_{j})||^{2}\geq 0


as required.

Optimization Problem:
We aim to solve the following minimization problem:

minxNf0(x)=12x𝐊x,\displaystyle\min\limits_{x\in\mathbb{R}^{N}}f_{0}(x)=\frac{1}{2}x^{\prime}\mathbf{K}x, (4.14)
st.𝐀x=b,\displaystyle\text{st.}\mathbf{A}x=b,
𝐁x𝟎N,\displaystyle\mathbf{B}x\leq\mathbf{0}_{N},

where 𝐀=(𝟏n1τ𝟎n0τ𝟎n1τ𝟏n0τ)\mathbf{A}=\left(\begin{array}[]{cc}\mathbf{1}^{\tau}_{n_{1}}&\mathbf{0}^{\tau}_{n_{0}}\\ \mathbf{0}^{\tau}_{n_{1}}&\mathbf{1}^{\tau}_{n_{0}}\end{array}\right), 𝐁=𝐈\mathbf{B}=-\mathbf{I} and b=(1,1)τb=(1,1)^{\tau}.

Proposition 2.

(4.14) is a standard convex optimization problem.

The Dual of Above Convex Problem:
The Lagrange function is given by

(x,λ,ν)=f0(x)+λτ𝐁x+ντ(𝐀xb)\mathcal{L}(x,\lambda,\nu)=f_{0}(x)+\lambda^{\tau}\mathbf{B}x+\nu^{\tau}(\mathbf{A}x-b)

and the dual function is given by g(λ,ν)=infxN(x,λ,ν)\textsl{g}(\lambda,\nu)=\inf\limits_{x\in\mathbb{R}^{N}}\mathcal{L}(x,\lambda,\nu). Note

(x,λ,ν)=\displaystyle\mathcal{L}(x,\lambda,\nu)= 12x𝐊x+λτ𝐁x+ντ(𝐀xb)\displaystyle\frac{1}{2}x^{\prime}\mathbf{K}x+\lambda^{\tau}\mathbf{B}x+\nu^{\tau}(\mathbf{A}x-b) (4.15)
=\displaystyle= 12x𝐊x+(λτ𝐁+ντ𝐀)xντb.\displaystyle\frac{1}{2}x^{\prime}\mathbf{K}x+(\lambda^{\tau}\mathbf{B}+\nu^{\tau}\mathbf{A})x-\nu^{\tau}b.

The dual problem of (4.14) is d=supλ𝟎N,νg(λ,ν)d^{*}=\sup\limits_{\lambda\geq\mathbf{0}_{N},\nu}\textsl{g}(\lambda,\nu). Let pp^{*} be the optimal value of (4.14), we have dpd^{*}\leq p^{*}. Since the inequality constraints are all affine, from the weak Slater condition we know that d=pd^{*}=p^{*}. Then from KKT conditions we have,

𝐊x+𝐁τλ+𝐀τν=0\displaystyle\mathbf{K}x+\mathbf{B}^{\tau}\lambda+\mathbf{A}^{\tau}\nu=0 (4.16)

and

𝐀x=b.\mathbf{A}x=b.

Assuming 𝐊\mathbf{K} is positive difinite. Solving (4.16) we have x=𝐊1(𝐁τλ+𝐀τν)x=-\mathbf{K}^{-1}(\mathbf{B}^{\tau}\lambda+\mathbf{A}^{\tau}\nu). Therefore,

g(λ,ν)=12(𝐁τλ+𝐀τν)τ𝐊1(𝐁τλ+𝐀τν)ντb.\textsl{g}(\lambda,\nu)=-\frac{1}{2}(\mathbf{B}^{\tau}\lambda+\mathbf{A}^{\tau}\nu)^{\tau}\mathbf{K}^{-1}(\mathbf{B}^{\tau}\lambda+\mathbf{A}^{\tau}\nu)-\nu^{\tau}b.

The dual problem is

maxλ,ν12(𝐁τλ+𝐀τν)τ𝐊1(𝐁τλ+𝐀τν)ντb\displaystyle\max_{\lambda,\nu}-\frac{1}{2}(\mathbf{B}^{\tau}\lambda+\mathbf{A}^{\tau}\nu)^{\tau}\mathbf{K}^{-1}(\mathbf{B}^{\tau}\lambda+\mathbf{A}^{\tau}\nu)-\nu^{\tau}b (4.17)
s.t.λi0,i=1,,N.\displaystyle\text{s.t.}\lambda_{i}\geq 0,\quad i=1,\cdots,N.

Solving for ν\nu we have

0=νg(λ,ν)=𝐀𝐊1𝐀τν𝐀𝐊1𝐁τλbν=(𝐀𝐊1𝐀τ)1(𝐀𝐊1𝐁τλ+b).\displaystyle 0={\triangledown}_{\nu}g(\lambda,\nu)=-\mathbf{A}\mathbf{K}^{-1}\mathbf{A}^{\tau}\nu-\mathbf{A}\mathbf{K}^{-1}\mathbf{B}^{\tau}\lambda-b\quad\Rightarrow\quad\nu=-(\mathbf{A}\mathbf{K}^{-1}\mathbf{A}^{\tau})^{-1}\left(\mathbf{A}\mathbf{K}^{-1}\mathbf{B}^{\tau}\lambda+b\right). (4.18)

Substitute ν\nu with (4.18), (4.17) becomes

maxλi012λτ(𝐁𝐊1𝐀τ(𝐀𝐊1𝐀τ)1𝐀𝐊1𝐁τ𝐁𝐊1𝐁τ)λ+bτ(𝐀𝐊1𝐀τ)1𝐀𝐊1𝐁τλ.\displaystyle\max_{\lambda_{i}\geq 0}\frac{1}{2}\lambda^{\tau}\left(\mathbf{B}\mathbf{K}^{-1}\mathbf{A}^{\tau}(\mathbf{A}\mathbf{K}^{-1}\mathbf{A}^{\tau})^{-1}\mathbf{A}\mathbf{K}^{-1}\mathbf{B}^{\tau}-\mathbf{B}\mathbf{K}^{-1}\mathbf{B}^{\tau}\right)\lambda+b^{\tau}(\mathbf{A}\mathbf{K}^{-1}\mathbf{A}^{\tau})^{-1}\mathbf{A}\mathbf{K}^{-1}\mathbf{B}^{\tau}\lambda. (4.19)

Suppose 𝐊1=(𝐌1𝐌10𝐌01𝐌0)\mathbf{K}^{-1}=\left(\begin{array}[]{cc}\mathbf{M}_{1}&-\mathbf{M}_{10}\\ -\mathbf{M}_{01}&\mathbf{M}_{0}\end{array}\right), we have 𝐀𝐊1𝐀τ=𝟏τ𝐌1𝟏+𝟏τ𝐌0𝟏\mathbf{A}\mathbf{K}^{-1}\mathbf{A}^{\tau}=\mathbf{1}^{\tau}\mathbf{M}_{1}\mathbf{1}+\mathbf{1}^{\tau}\mathbf{M}_{0}\mathbf{1}. For simplicity, substitute BB with 𝐈-\mathbf{I} and write 𝐐=𝟏τ𝐌1𝟏+𝟏τ𝐌0𝟏\mathbf{Q}=\mathbf{1}^{\tau}\mathbf{M}_{1}\mathbf{1}+\mathbf{1}^{\tau}\mathbf{M}_{0}\mathbf{1}, (4.19) is converted to

maxλi012λτ(𝐊1𝐀τ𝐐1𝐀𝐊1𝐊1)λ+bτ𝐐1𝐀𝐊1λ.\displaystyle\max_{\lambda_{i}\geq 0}\frac{1}{2}\lambda^{\tau}\left(\mathbf{K}^{-1}\mathbf{A}^{\tau}\mathbf{Q}^{-1}\mathbf{A}\mathbf{K}^{-1}-\mathbf{K}^{-1}\right)\lambda+b^{\tau}\mathbf{Q}^{-1}\mathbf{A}\mathbf{K}^{-1}\lambda. (4.20)

5 Simulation Studies

In this section, we perform simulation studies for evaluating the performance of different weighting methods, including the inverse probability weighting (IPW), the covariate balancing propensity score (CBPS) (Imai and Ratkovic, 2014), the entropy balance (EB) (Hainmueller, 2012), the calibration estimator (CAL) (Chan et al., 2016) and our proposed KDB Covariate Balance. We use the oracle and unadjusted (UnAD) estimators as benchmarks. In all the following simulation experiments, we assume a constant causal effect, so the ATE is numerically the same as the ATT. For each simulation experiment, we construct NsimN_{sim} Monte Carlo datasets.

In addition to simulation studies, we also conduct a real data analysis with the ‘National supported work’ (NSW) dataset.

5.1 Evaluation Metrics and Balancing Statistics

To evaluate the performance of estimating ATE or ATT, we use the metric Bias, % Bias, empirical standard deviation (SD), and RMSE.

Bias is the difference between the average estimate and the true value of ATE/ATT. % Bias is the bias as a percentage of the estimate’s standard deviation. A useful rule of thumb (Kang et al., 2007) is that the performance of interval estimates and test statistics begins to deteriorate when the bias of the point estimate exceeds about 40% of its standard deviation. The empirical SD for an estimator is the square root of average squared differences between all estimates from the mean. Here, we use the empirical SD corresponding to the average value of ATE/ATT estimates, and thus it is the empirical standard deviation of all estimated values divided by the square root of sample size NsimN_{sim}. RMSE is the root-mean-square error for measuring the differences between values predicted by a weighting method and the observed values.

To further compare the power of balancing covariates, we use the balancing statistics rWr^{W}, kernel distance (KD), the maximum, average, and median value of absolute standardized mean difference (maxASMD, meanASMD, medASMD), the average Kolmogorov-Smirnov statistic (meanKS) and the average T statistic (meanT). We introduce each balancing statistic with more details as follows.

For every set of sample weights {p^i}(i=1,2,,n1)\{\widehat{p}_{i}\}(i=1,2,\cdots,n_{1}) and {q^j}(j=1,2,,n0)\{\widehat{q}_{j}\}(j=1,2,\cdots,n_{0}) obtained by different balancing methods for each Monte Carlo dataset, we calculate rWr^{W} as

𝒓^𝑾(𝒑,𝒒)=\displaystyle\bm{\widehat{r}^{W}(\vec{p},\vec{q})}= i=1Nj=1Np^ip^jK(X1i,X1j)I(i,jtrt)+\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}\widehat{p}_{i}\widehat{p}_{j}K(X_{1i},X_{1j})\cdot I(i,j\in\textbf{trt})+ (5.1)
i=1Nj=1Nq^iq^jK(X0i,X0j)I(i,jcol)\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}\widehat{q}_{i}\widehat{q}_{j}K(X_{0i},X_{0j})\cdot I(i,j\in\textbf{col})-
2i=1Nj=1Np^iq^jK(X1i,X0j)I(itrt,jcol).\displaystyle 2\sum_{i=1}^{N}\sum_{j=1}^{N}\widehat{p}_{i}\widehat{q}_{j}K(X_{1i},X_{0j})\cdot I(i\in\textbf{trt},j\in\textbf{col}).

For a simulation experiment with a total of NsimN_{sim} iterations, we calculate the average value of all 𝒓^𝑾(𝒑,𝒒)\bm{\widehat{r}^{W}(\vec{p},\vec{q})} statistics of each balancing method and compare them. KD, the square root of rWr^{W}, has been widely used in statistical applications (Sriperumbudur et al., 2012) and can be regarded as a reasonable measure of covariates balance (Xie, 2018).

In estimating ATE, the ASMD of covariate Xd(d=1,,D)X_{d}~{}(d=1,\cdots,D) with balancing weights p^i(i=1,,n1)\widehat{p}_{i}~{}(i=1,\cdots,n_{1}) and q^j(j=1,,n0)\widehat{q}_{j}~{}(j=1,\cdots,n_{0}) is defined as (Zhang et al., 2019):

𝑨𝑺𝑴𝑫d,ATE=|X¯d,1pX¯d,0q|(Sd,12+Sd,22)/2,\bm{ASMD}_{d,ATE}=\frac{|\bar{X}_{d,1}^{p}-\bar{X}_{d,0}^{q}|}{\sqrt{(S_{d,1}^{2}+S_{d,2}^{2})/2}},

where X¯d,1p\bar{X}_{d,1}^{p} and X¯d,0q\bar{X}_{d,0}^{q} respectively refer to the weighted sample mean of covariate XdX_{d} in the treatment and control group

X¯d,1p=i=1n1p^iXi,d,1,X¯d,0q=j=1n0q^jXj,d,0,\bar{X}_{d,1}^{p}=\sum_{i=1}^{n_{1}}\widehat{p}_{i}X_{i,d,1},~{}\bar{X}_{d,0}^{q}=\sum_{j=1}^{n_{0}}\widehat{q}_{j}X_{j,d,0},

and Sd,12S_{d,1}^{2} and Sd,02S_{d,0}^{2} denote the sample variance of covariate XkX_{k} in the treatment and control group without any adjustment

Sd,12=i=1n1Xi,d,1X¯d,1n11,Sd,02=j=1n0Xi,d,0X¯d,0n01S_{d,1}^{2}=\frac{\sum_{i=1}^{n_{1}}X_{i,d,1}-\bar{X}_{d,1}}{n_{1}-1},~{}S_{d,0}^{2}=\frac{\sum_{j=1}^{n_{0}}X_{i,d,0}-\bar{X}_{d,0}}{n_{0}-1}

In estimating ATT, the ASMD of covariate Xd(d=1,,D)X_{d}~{}(d=1,\cdots,D) with balancing weights p^i=1n1(i=1,,n1)\widehat{p}_{i}=\frac{1}{n_{1}}~{}(i=1,\cdots,n_{1}) and q^j(j=1,,n0)\widehat{q}_{j}~{}(j=1,\cdots,n_{0}) is defined as (Xie et al., 2019):

𝑨𝑺𝑴𝑫d,ATT=|X¯d,1psd(Xd,1)X¯d,0qsd(Xd,1)|,\bm{ASMD}_{d,ATT}=|\frac{\bar{X}_{d,1}^{p}}{sd(X_{d,1})}-\frac{\bar{X}_{d,0}^{q}}{sd(X_{d,1})}|,

where sd(Xd,1)sd(X_{d,1}) denotes the sample standard deviation of covariate XdX_{d} in the treatment group without any adjustment

sd(Xd,1)=i=1n1Xi,d,1X¯d,1n11.sd(X_{d,1})=\sqrt{\frac{\sum_{i=1}^{n_{1}}X_{i,d,1}-\bar{X}_{d,1}}{n_{1}-1}}.

For each Monte Carlo dataset, we calculte the ASMD values over all DD covariates and mark down the maximum, mean and median values.

A Kolmogorov-Smirnov test can be applied to test whether two underlying one-dimensional probability distributions significantly differ and the KS statistic quantifies a distance between the empirical distribution functions of two samples. We calculate the KS statistic between the weighted samples in treatment and control group over all DD covariates and take the average as one balancing statistic.

𝒎𝒆𝒂𝒏𝑲𝑺=d=1DKSdD\bm{meanKS}=\frac{\sum_{d=1}^{D}KS_{d}}{D}

where KSd=supx|F1,n1(Xd,1p)F0,n0(Xd,0q)|KS_{d}=\sup\limits_{x}{|F_{1,n1}(X_{d,1}^{p})-F_{0,n0}(X_{d,0}^{q})|}. F1,n1(Xd,1p)F_{1,n1}(X_{d,1}^{p}) and F0,n0(Xd,0q)F_{0,n0}(X_{d,0}^{q}) are the empirical distribution functions of the weighted sample {p^iXd,1i}(i=1,2,,n1)\{\widehat{p}_{i}\cdot X_{d,1i}\}(i=1,2,\cdots,n_{1}) and {q^jXd,0j}(j=1,2,,n0)\{\widehat{q}_{j}\cdot X_{d,0j}\}(j=1,2,\cdots,n_{0}) respectively, and 𝐬𝐮𝐩\bm{\sup} is the supremum function.

A t-test is commonly applied to determine if the means of two sets of data are significantly different from each other. We conduct a Welch’s t-test between the weighted means in the treatment and control group over all DD covariates and take the average as one balance statistic.

𝒎𝒆𝒂𝒏𝑻=d=1DTdD\bm{meanT}=\frac{\sum_{d=1}^{D}T_{d}}{D}

where Td=X¯d,1pX¯d,0qS¯,dT_{d}=\frac{\bar{X}_{d,1}^{p}-\bar{X}_{d,0}^{q}}{S_{\bar{\triangle},d}}, and S¯,d=SXd,1p2n1+SXd,0q2n0S_{\bar{\triangle},d}=\sqrt{\frac{S^{2}_{X_{d,1}^{p}}}{n_{1}}+\frac{S^{2}_{X_{d,0}^{q}}}{n_{0}}}. SXd,1p2S^{2}_{X_{d,1}^{p}} and SXd,0q2S^{2}_{X_{d,0}^{q}} are the unbiased estimators of the variances of the two weighted samples {p^iXd,1i}(i=1,2,,n1)\{\widehat{p}_{i}\cdot X_{d,1i}\}(i=1,2,\cdots,n_{1}) and {q^jXd,0j}(j=1,2,,n0)\{\widehat{q}_{j}\cdot X_{d,0j}\}(j=1,2,\cdots,n_{0}) respectively.

5.2 Benchmark Estimators

In different simulation scenarios, sample oracle estimators, sample unadjusted estimators, and their biases will be served as benchmarks for comparing different weighting methods.

5.2.1 Oracle Estimator

The sample oracle estimator of the ATE and ATT are respectively defined as

τ^Oracle=iNYi(1)iNYi(0),\widehat{\tau}_{~{}_{Oracle}}=\sum_{i}^{N}Y_{i}(1)-\sum_{i}^{N}Y_{i}(0),
τ^1,Oracle=in1[Yi(1)|Ti=1]in1[Yi(0)|Ti=1].\widehat{\tau}_{~{}_{1,Oracle}}=\sum_{i}^{n_{1}}\left[Y_{i}(1)|T_{i}=1\right]-\sum_{i}^{n_{1}}\left[Y_{i}(0)|T_{i}=1\right].

5.2.2 UnAD: Unadjusted Estimator

Without any adjustment, we assigned weight p^iUnAD=1n1(i=1,2,,n1)\widehat{p}_{i}^{UnAD}=\frac{1}{n_{1}}~{}(i=1,2,\cdots,n_{1}) for each unit in the treatment group, and q^jUnAD=1n0(j=1,2,,n0)\widehat{q}_{j}^{UnAD}=\frac{1}{n_{0}}~{}(j=1,2,\cdots,n_{0}) for each unit in the control group. From this, we can get the sample unadjusted estimator of the ATE, that is

τ^UnAD=in1piUnADY1ijn0qjUnADY0j.\widehat{\tau}_{~{}_{UnAD}}=\sum_{i}^{n_{1}}p_{i}^{UnAD}Y_{1i}-\sum_{j}^{n_{0}}q_{j}^{UnAD}Y_{0j}.

5.3 Simulation Studies

For convenience, we denote

  • KDBC:

    the proposed KDB covariate balancing with only the constraints:

    in1pi=1,pi0;jn0qj=1,qj0\displaystyle\sum_{i}^{n_{1}}p_{i}=1,\quad p_{i}\geq 0;\sum_{j}^{n_{0}}q_{j}=1,\quad q_{j}\geq 0
  • KDM1:

    the proposed KDB covariate balancing with the constraints:

    i=1n1pi=1,pi0;j=1n0qj=1,qj0\displaystyle\sum_{i=1}^{n_{1}}p_{i}=1,\quad p_{i}\geq 0;\quad\sum_{j=1}^{n_{0}}q_{j}=1,\quad q_{j}\geq 0
    i=1n1piXk,1i=j=1n0qjXk,0j,k1,,K\displaystyle\sum_{i=1}^{n_{1}}p_{i}X_{k,1i}=\sum_{j=1}^{n_{0}}q_{j}X_{k,0j},\quad k\in{1,\cdots,K}

5.3.1 Simulation 1: Kang and Schafer (Kang et al., 2007; Josey et al., 2020)

In this section, we follow the simulation example in Josey et al. (2020), an extensive set of experimental scenarios adapted from Kang et al. (2007), to compare different weighting methods.

We first generate the observed covariates 𝑿=(X1,X2,X3,X4)T\bm{X}=(X_{1},X_{2},X_{3},X_{4})^{T}, where Xd𝒩(0,1),d=1,,4X_{d}\sim\mathcal{N}(0,1),~{}d=1,\cdots,4, and then make the following transformations to build the hidden variables 𝑼=(U1,U2,U3,U4)T\bm{U}=(U_{1},U_{2},U_{3},U_{4})^{T}

U1\displaystyle U_{1} =exp(X12),\displaystyle=exp(\frac{X_{1}}{2}),
U2\displaystyle U_{2} =X21+exp(X1)+10,\displaystyle=\frac{X_{2}}{1+exp(X_{1})}+10,
U3\displaystyle U_{3} =(X1X325+0.6)3,\displaystyle=(\frac{X_{1}\cdot X_{3}}{25}+0.6)^{3},
U4\displaystyle U_{4} =(X2+X4+20)2.\displaystyle=(X_{2}+X_{4}+20)^{2}.

In subsequent applications, we standardize 𝑼=(U1,U2,U3,U4)T\bm{U}=(U_{1},U_{2},U_{3},U_{4})^{T} so that it has a mean of zero and marginal variances of one. The propensity score model is defined as

πi(δT)=exp[ηi(δT)]1+exp[ηi(δT)]Ti(δT)Bin(1,πi(δT))\pi_{i}^{(\delta_{T})}=\frac{exp[\eta_{i}^{(\delta_{T})}]}{1+exp[\eta_{i}^{(\delta_{T})}]}~{}\quad\Rightarrow\quad T_{i}^{(\delta_{T})}\sim Bin(1,\pi_{i}^{(\delta_{T})})

where δT{X,U}\delta_{T}\in\{X,U\} refers to the generative process that determines the treatment assignment. Here, δT=X\delta_{T}=X indicates the log odds of the propensity score is linear with ηi(X)=Xi1+0.5Xi20.25Xi30.1Xi4\eta_{i}^{(X)}=-X_{i1}+0.5X_{i2}-0.25X_{i3}-0.1X_{i4}, while δT=U\delta_{T}=U implies a nonlinear relaltionship between the log odds of the propensity score and ηi(U)=Ui1+0.5Ui20.25Ui30.1Ui4\eta_{i}^{(U)}=-U_{i1}+0.5U_{i2}-0.25U_{i3}-0.1U_{i4}. As for the outcome propcess, we know that Y=Y(1)T+Y(0)(1T)Y=Y(1)\cdot T+Y(0)\cdot(1-T) and construct the pair potential outcomes {Yi(1),Yi(0)}\{Y_{i}(1),Y_{i}(0)\} for unit i,i=1,,Ni,~{}i=1,\cdots,N with a bivariate model

(Yi(0)Yi(1))(δO)N([μi(δO)μi(δO)+γ],[σ2ρσρσσ2])\begin{pmatrix}Y_{i}(0)\\ Y_{i}(1)\end{pmatrix}^{(\delta_{O})}\sim N\begin{pmatrix}\left[\begin{array}[]{c}\mu_{i}^{(\delta_{O})}\\ \mu_{i}^{(\delta_{O})}+\gamma\end{array}\right],\left[\begin{array}[]{cc}\sigma^{2}&\rho\sigma\\ \rho\sigma&\sigma^{2}\\ \end{array}\right]\end{pmatrix}

where σ2\sigma^{2} is the error variance, ρ\rho is the correlation between the potential outcomes, γ=20\gamma=20 represents the treatment effect and δO{X,U}\delta_{O}\in\{X,U\} indicates the covariates, whether 𝑿\bm{X} or 𝑼\bm{U}, that participate in the outcome process. For example, if δO=X\delta_{O}=X, we have μi(X)=210+27.4Xi1+13.7Xi2+13.7Xi3+13.7Xi4\mu_{i}^{(X)}=210+27.4X_{i1}+13.7X_{i2}+13.7X_{i3}+13.7X_{i4}. It should be emphasized that although we may use 𝑼\bm{U} to construct variable TT or YY, it cannot be observed and the covariates to be balanced is always 𝑿\bm{X}. In the other words, the observed simulated data consists of {Xi,Ti,Yi},i=1,,N\{X_{i},T_{i},Y_{i}\},i=1,\cdots,N.

Based on this data generation mechanism, we can know that

  • ATE:

    τ\displaystyle\tau =E[Yi(1)Yi(0)]=E[Yi(1)]E[Yi(0)]\displaystyle=E\left[Y_{i}(1)-Y_{i}(0)\right]=E\left[Y_{i}(1)\right]-E\left[Y_{i}(0)\right]
    =(μi(δO)+γ)μi(δO)=γ=20\displaystyle=\left(\mu_{i}^{(\delta_{O})}+\gamma\right)-\mu_{i}^{(\delta_{O})}=\gamma=20
  • ATT:

    τtrt\displaystyle\tau_{trt} =E[Yi(1)Yi(0)|T=1]\displaystyle=E\left[Y_{i}(1)-Y_{i}(0)|T=1\right]
    =E[(μi(δO)+γ)μi(δO)|T=1]\displaystyle=E\left[\left(\mu_{i}^{(\delta_{O})}+\gamma\right)-\mu_{i}^{(\delta_{O})}|T=1\right]
    =γ=20\displaystyle=\gamma=20

To better understand the role of different parameters on estimating ATE, we choose the sample size NN in {200,1000}\{200,1000\}, the error variance σ2\sigma^{2} in {2,5,10}\{2,5,10\}, the correlation between the potential outcomes ρ\rho in {0.3,0,0.5}\{-0.3,0,0.5\}, the generative process δT\delta_{T} that determines the treatment assignment in {UU: TT is generated with covariates 𝑼\bm{U}; XX: TT is generated with covariates 𝑿\bm{X} }, and the outcome process δO\delta_{O} in {UU: YY is generated with covariates 𝑼\bm{U}; XX: TT is generated with covariates 𝑿\bm{X}}, so there are a total of 72 different simulaiton scenarios. We change only one parameter in different simulation scenarios and run 500 Monte Carlo datasets for each setting.

Fixing δT\delta_{T} and δO\delta_{O} ({δT\delta_{T} = X, δO\delta_{O} = X}, {δT\delta_{T} = X, δO\delta_{O} = U}, {δO\delta_{O} = U, δO\delta_{O} = X}, {δT\delta_{T} = U, δO\delta_{O} = U}, we run 1000 Monte Carlo datasets to compare the performance of four weighting methods (IPW, CBPS, EB, KDBC) on estimating ATE under different combinations of NN, σ2\sigma^{2} and ρ\rho. The results are briefly shown in Figure [1, 2, 3, 4]. For every Figure corresponds to a different combination of δT\delta_{T} and δO\delta_{O}, the value of σ2\sigma^{2} corresponds to each row from top to bottom are {2,5,10}\{2,5,10\}, and the value of ρ\rho corresponds to each column from left to right are {0.3,0,0.5}\{-0.3,0,0.5\}. It can be found that the sample size NN and the error variance σ2\sigma^{2} mainly influence the standard errors of ATE estimators, while the correlation between potential outcomes Yi(0)Y_{i}(0) and Yi(1)Y_{i}(1) has almost no effect. This observation is consistent with Josey et al. (2020) and thus, we focus on the simulation scenario with parameters N=200N=200, σ2=10\sigma^{2}=10, and ρ=0\rho=0.

We run 500 Monte Carlo experiments for estimating both the ATE and ATT. The results of ATE estimation are shown in Figure [5] and Table [1,2], and the results of ATT estimation are shown in Table [3,4]. Table [1,3] together show that the KDBC estimators, both for ATE and ATT, can perform as well as the other weighting estimators when all the true predictors are included in the observed dataset. While in both scenarios of {δT\delta_{T} = X, δO\delta_{O} = U} and {δT\delta_{T} = U, δO\delta_{O} = U} where the key information of the true covariates are missed, the KDBC estimators can perform much better than other weighting estimators. This reflects the high stability of our proposed KDB covariate balance in estimating causal effects even when the observed dataset misses some key information about the real generation mechanism.

To better observe the adjustment effect of our proposed method, we draw the cumulative distribution functions (CDF) of all covariates without any adjustment and that of the weighted covariates under KDBC. It can be seen from Figure [6] that regardless of the value of δT\delta_{T} and δO\delta_{O}, under our proposed KDB covariate balancing with only regularization constraints (KDBC), the CDFs of the weighted covariates between the treatment and control groups can always match each other well. This means that our proposed method can effectively achieve the covariate balance between the treatment and control groups.

Refer to caption
Figure 1: T Scenario = “X”, Y Scenario = “X”
Refer to caption
Figure 2: T Scenario = “X”, Y Scenario = “U”
Refer to caption
Figure 3: T Scenario = “U”, Y Scenario = “X”.
Refer to caption
Figure 4: T Scenario = “U”, Y Scenario = “U”
Refer to caption
Figure 5: ATE estimate using five balancing methods (IPW, CBPS, EB, CAL and KDBC) with N=200N=200, σ2=10\sigma^{2}=10, ρ=0\rho=0. Both oracle estimate and unadjusted (UnAD) estimate are used as a benchmark for comparison.
Treatment Assignment Scenario Outcome Scenario Metric Benchmark Balancing Methods
Oracle UnAD IPW CBPS EB CAL KDBC KDM1
X X ATE 20.02181 -0.22254 19.42157 19.99264 19.97454 20.00381 19.99452 20.01701
Bias 0.02181 -20.22254 -0.57843 -0.00736 -0.02546 0.00381 -0.00548 0.01701
sd 0.01358 0.22983 0.16986 0.02390 0.02298 0.02302 0.03207 0.03171
RMSE 0.01360 0.93307 0.17165 0.02388 0.02299 0.02299 0.03204 0.03168
X U ATE 20.01750 7.76609 19.28762 19.67198 19.74617 19.76634 19.95198 19.97879
Bias 0.01750 -12.23391 -0.71238 -0.32802 -0.25383 -0.23366 -0.04802 -0.02121
sd 0.01428 0.23171 0.25003 0.17427 0.12919 0.12925 0.03289 0.03279
RMSE 0.01428 0.59407 0.25181 0.17471 0.12956 0.12954 0.03293 0.03277
U X ATE 19.99365 4.149422956 20.37174 19.99844 19.95353 19.99821 19.98244 19.99608
Bias -0.00635 -15.85057704 0.37174 -0.00156 -0.04647 -0.00179 -0.01756 -0.00392
sd 0.01431 0.215249227 0.14953 0.02376 0.02326 0.02298 0.03060 0.03047
RMSE 0.01429 0.740757147 0.15030 0.02374 0.02333 0.02295 0.03058 0.03044
U U ATE 19.99174 5.308382057 15.30943 15.19822 15.83007 15.86088 19.89995 19.91026
Bias -0.00826 -14.69161794 -4.69057 -4.80178 -4.16993 -4.13912 -0.10005 -0.08974
sd 0.01474 0.223602406 0.17365 0.14242 0.11789 0.11796 0.03044 0.03037
RMSE 0.01473 0.69396 0.27221 0.25760 0.22056 0.21943 0.03074 0.03061
Table 1: Details on ATE estimate with 500 Monte Carlo datasets.
Treatment Assignment Scenario Outcome Scenario Balancing Metric Benchmark Balancing Methods
Oracle UnAD IPW CBPS EB CAL KDBC KDM1
X X KD 0.00100 0.17749 0.03689 0.02457 0.02138 0.02140 0.00036 0.00035
maxASMD 0.00000 0.00879 0.00129 0.00015 0.00012 0.00012 0.00017 0.00017
meanASMD 0.00000 0.00409 0.00068 0.00008 0.00007 0.00007 0.00009 0.00009
medASMD 0.00000 0.00327 0.00061 0.00007 0.00006 0.00006 0.00008 0.00008
meanKS 0.00000 0.22341 0.21086 0.21025 0.20455 0.20455 0.20812 0.20635
meanT 0.00000 -1.30843 -0.05517 0.00350 0.00279 0.00439 -0.00104 -0.00060
X U KD 0.00100 0.17657 0.03709 0.02459 0.02145 0.02147 0.00036 0.00036
maxASMD 0.00000 0.00874 0.00129 0.00016 0.00013 0.00013 0.00018 0.00018
meanASMD 0.00000 0.00404 0.00069 0.00009 0.00007 0.00007 0.00010 0.00010
medASMD 0.00000 0.00324 0.00062 0.00008 0.00006 0.00006 0.00009 0.00009
meanKS 0.00000 0.22388 0.21115 0.21048 0.20520 0.20520 0.20888 0.20717
meanT 0.00000 -1.28633 -0.03970 0.00588 0.00577 0.00748 0.00066 0.00144
U X KD 0.00100 0.15692 0.03542 0.02824 0.02555 0.02557 0.00036 0.00035
maxASMD 0.00000 0.00837 0.00101 0.00014 0.00012 0.00011 0.00016 0.00016
meanASMD 0.00000 0.00359 0.00053 0.00008 0.00006 0.00006 0.00009 0.00009
medASMD 0.00000 0.00265 0.00047 0.00007 0.00006 0.00005 0.00008 0.00008
meanKS 0.00000 0.20399 0.19083 0.19077 0.18849 0.18850 0.19127 0.19017
meanT 0.00000 -0.85087 0.02321 0.00628 -0.00105 0.00097 0.00171 0.00224
U U KD 0.00100 0.15523 0.03651 0.02824 0.02557 0.02558 0.00035 0.00036
maxASMD 0.00000 0.00834 0.00109 0.00015 0.00012 0.00012 0.00017 0.00017
meanASMD 0.00000 0.00354 0.00057 0.00008 0.00007 0.00006 0.00009 0.00009
medASMD 0.00000 0.00258 0.00050 0.00007 0.00006 0.00006 0.00009 0.00009
meanKS 0.00000 0.20276 0.18890 0.18875 0.18557 0.18556 0.18917 0.18809
meanT 0.00000 -0.85721 0.03375 0.00702 -0.00030 0.00165 0.00152 0.00210
Table 2: ATE - Details on Balancing Metrics with 500 Monte Carlo datasets.
Treatment Assignment Scenario Outcome Scenario Metric Benchmark Balancing Methods
Oracle IPW CBPS EB CAL KDM1 KDBC
X X ATT 20.02181 18.94827 20.01480 19.94777 20.01839 20.01063 19.48173
Bias 0.02181 -1.05173 0.01480 -0.05223 0.01839 0.01063 -0.51827
sd 0.01358 0.22240 0.02916 0.02931 0.02924 0.04480 0.04704
RMSE 0.01360 0.22710 0.02914 0.02938 0.02922 0.04476 0.05240
X U ATT 20.01750 19.69548 20.77572 20.75489 20.77844 20.41813 19.82211
Bias 0.01750 -0.30452 0.77572 0.75489 0.77844 0.41813 -0.17789
sd 0.01428 0.24458 0.15395 0.15362 0.15389 0.05750 0.05277
RMSE 0.01428 0.24471 0.15766 0.15714 0.15763 0.06041 0.05332
U X ATT 19.99365 25.08391 19.99283 19.94235 19.99311 19.97513 19.77988
Bias -0.00635 5.08391 -0.00717 -0.05765 -0.00689 -0.02487 -0.22012
sd 0.01431 0.25533 0.02654 0.02661 0.02653 0.04041 0.04144
RMSE 0.01429 0.34170 0.02652 0.02671 0.02650 0.04038 0.04255
U U ATT 19.99174 19.07117 16.39781 16.37561 16.39804 19.96664 19.74746
Bias -0.00826 -0.92883 -3.60219 -3.62439 -3.60196 -0.03336 -0.25254
sd 0.01474 0.25013 0.15951 0.15934 0.15953 0.04586 0.04628
RMSE 0.01473 0.25331 0.22659 0.22718 0.22660 0.04584 0.04759
Table 3: Details on ATT estimate with 500 Monte Carlo datasets.
Treatment Assignment Scenario Outcome Scenario Balancing Metric Benchmark Balancing Methods
Oracle IPW CBPS EB CAL KDM1 KDBC
X X KD 0.00142 0.05126 0.02618 0.02612 0.02619 0.00725 0.00568
maxASMD 0.00000 0.00198 0.00050 0.00050 0.00050 0.00050 0.00054
meanASMD 0.00000 0.00107 0.00024 0.00024 0.00024 0.00024 0.00027
medASMD 0.00000 0.00098 0.00020 0.00020 0.00020 0.00020 0.00023
meanKS 0.00000 0.25172 0.26461 0.26408 0.26467 0.45139 0.43743
meanT 0.00000 -0.09005 -0.00207 -0.00467 -0.00190 -0.00225 -0.01696
X U KD 0.00142 0.05214 0.02675 0.02669 0.02676 0.00713 0.00556
maxASMD 0.00000 0.00205 0.00051 0.00051 0.00051 0.00051 0.00055
meanASMD 0.00000 0.00110 0.00024 0.00024 0.00024 0.00024 0.00027
medASMD 0.00000 0.00100 0.00020 0.00020 0.00020 0.00020 0.00022
meanKS 0.00000 0.25165 0.26293 0.26249 0.26299 0.44922 0.43451
meanT 0.00000 -0.05648 -0.00233 -0.00499 -0.00215 -0.00118 -0.01506
U X KD 0.00140 0.05612 0.02820 0.02814 0.02820 0.00501 0.00412
maxASMD 0.00000 0.00232 0.00041 0.00040 0.00041 0.00041 0.00041
meanASMD 0.00000 0.00123 0.00018 0.00018 0.00018 0.00018 0.00019
medASMD 0.00000 0.00110 0.00014 0.00014 0.00014 0.00014 0.00016
meanKS 0.00000 0.22078 0.21968 0.21934 0.21968 0.41261 0.39862
meanT 0.00000 0.30059 0.00518 0.00423 0.00518 0.00598 0.00305
U U KD 0.00141 0.05644 0.02832 0.02827 0.02832 0.00489 0.00403
maxASMD 0.00000 0.00237 0.00044 0.00044 0.00044 0.00044 0.00044
meanASMD 0.00000 0.00123 0.00020 0.00020 0.00020 0.00020 0.00021
medASMD 0.00000 0.00108 0.00015 0.00015 0.00015 0.00015 0.00017
meanKS 0.00000 0.21859 0.21740 0.21706 0.21741 0.41061 0.39707
meanT 0.00000 0.31149 0.00658 0.00557 0.00660 0.00625 0.00418
Table 4: ATT - Details on Balancing Metrics with 500 Monte Carlo datasets.
Refer to caption
Figure 6: The CDF of unweighted covariates is arranged in odd rows, and that of weighted covariates under KDBC is arranged in even rows. All combination of “T scenario” and “Y scenario” are included. Dashed line indicates the treatment group, and the solid line indicatesv the control group.

5.3.2 Simulation 2

In this section, we perform a simulation experiment with a parameter setting different from Simulation 1 (section 5.3.1). We wish to explore the mechanism of the KDB method in mining covariate information as well as check the performance of the KDB stable ATE estimation.

We first construct the treatment indicator Ti,i=1,2,,NT_{i},~{}i=1,2,\cdots,N with a binomial distribution Bin(1,p)Bin(1,p). Here, we set pp, the probability for unit ii being assigned to the treatment group, to be 0.5. Next, we used two different multivariate normal distributions to generate the covariates X1X_{1} and X2X_{2} for the treatment group and the control group, respectively. We have

(X1X2)|T=1N((12),(10.50.51))\begin{pmatrix}X_{1}&X_{2}\end{pmatrix}|T=1\sim N\begin{pmatrix}\begin{pmatrix}1\\ 2\end{pmatrix},\begin{pmatrix}1&0.5\\ 0.5&1\end{pmatrix}\end{pmatrix}

and

(X1X2)|T=0N((12)+α1,(10.50.51)+(0α2α20)),\begin{pmatrix}X_{1}&X_{2}\end{pmatrix}|T=0\sim N\begin{pmatrix}\begin{pmatrix}1\\ 2\end{pmatrix}+\alpha_{1},\begin{pmatrix}1&0.5\\ 0.5&1\end{pmatrix}+\begin{pmatrix}0&\alpha_{2}\\ \alpha_{2}&0\end{pmatrix}\end{pmatrix},

where α1=0.8,α2=0.2\alpha_{1}=0.8,~{}\alpha_{2}=0.2. Let X3X_{3} and X4X_{4} to be distributer as 𝒩(0,1)\mathcal{N}(0,1), we continue to construct the covariates X5=X1×X2X_{5}=X_{1}\times X_{2} and X6=X32X_{6}=X_{3}^{2}. The vector (X1,X2,X3,X4,X5,X6)T(X_{1},X_{2},X_{3},X_{4},X_{5},X_{6})^{T} is subsequently standardized to have a mean of zero and marginal variances of one. The observed covariates are {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\}. For the outcome variable, we use the bivariate model

(Yi(0)Yi(1))N([μiμi+γ],[σ200σ2])\begin{pmatrix}Y_{i}(0)\\ Y_{i}(1)\end{pmatrix}\sim N\begin{pmatrix}\left[\begin{array}[]{c}\mu_{i}\\ \mu_{i}+\gamma\end{array}\right],\left[\begin{array}[]{cc}\sigma^{2}&0\\ 0&\sigma^{2}\\ \end{array}\right]\end{pmatrix}

where γ=10\gamma=10, σ2=10\sigma^{2}=10, and μi=20Xi1+10Xi2+5Xi3+5Xi4+α3Xi5+α4Xi6\mu_{i}=20X_{i1}+10X_{i2}+5X_{i3}+5X_{i4}+\alpha_{3}\cdot X_{i5}+\alpha_{4}\cdot X_{i6} with α3=1,α4=2\alpha_{3}=1,\alpha_{4}=2. To explore whether KDB stable ATE estimation can effectively reduce the variance of ATE estimator, we add a tuning parameter λ\lambda to the diagonal of 𝑲G\bm{K}_{G} and change the value to see see how the estimation result changes.

𝑲𝑮=(K1K10K01K0)+λ𝑰N\bm{K_{G}}=\begin{pmatrix}K_{1}&-K_{10}\\ -K_{01}&K_{0}\\ \end{pmatrix}+\lambda\bm{I}_{N}

Setting the sample size N=200N=200, we run 500 Monte Carlo datasets for each value of λ\lambda in {0,1,2,5,10,100}\{0,1,2,5,10,100\}. Results of ATE estimation with λ{0,1,2,5,10,100}\lambda\in\{0,1,2,5,10,100\} are shown in Table [5]. In each table for a specific value of λ\lambda, the rows are arranged from top to bottom according to the absolute value of bias.

Comparing the results in all tables together, we can find that the UnAD estimator always performs the worst, and is far from the true value. This shows the importance of covariate balance, especially when the real generation mechanism of the data involves nonlinear information. As the value of λ\lambda increases, the performance of KDBC gradually deteriorates, but the performance of KDM1 has always been in the forefront, which indicates the necessity to add the first-moment balance to the balance constraints of the KDB method. Furthermore, adding a non-zero λ\lambda do help reduce the variance of the KDM1 estimator of ATE compared to the results when λ=0\lambda=0, the mechanism behind which requires to be further explored.

When λ\lambda changes from 0 to 1, the performance of KDM1 is not as good as CBPS and CAL. However, when we increase the λ\lambda value from 2 to 100, KDM1 again becomes the best weighting method with the smallest absolute value of bias, the mechanism behind which needs further exploration. This probably implies that there is a threshold for λ\lambda to induce better performance, and it may be related to the information matrix (section 4.5) of the observed dataset.

Considering the ASMD values of X5X_{5} and X6X_{6} under different weighting methods in Table [5], we know that the KDB method including the KDBC and KDM1 can always perform the best. Reviewing the performance of the KDB covariate balance in simulation 1 (section 5.3.1) with {δT\delta_{T} = X, δO\delta_{O} = U} and {δT\delta_{T} = U, δO\delta_{O} = U}, we can conclude that the KDB covariate balance may automatically discover and include some latent variables, such as interaction terms between different covariates or high-order terms of covariates, to analyze and estimate the ATE.

lambda = 0
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanT
Oracle 9.98058 0.01942 0.01427 0.01428 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
KDM1 9.97253 0.02747 0.02993 0.02993 0.00000 0.00034 0.00179 0.00039 0.00000 0.00068 0.00167 0.23125 -0.01172
KDBC 9.95925 0.04075 0.02982 0.02985 0.00000 0.00034 0.00203 0.00078 0.00061 0.00128 0.00166 0.23229 -0.01277
CBPS 9.93864 0.06136 0.02546 0.02559 0.00087 0.02829 0.15146 0.03478 0.00001 0.08006 0.12859 0.23308 -0.00203
CAL 9.92957 0.07043 0.02471 0.02488 0.00063 0.02436 0.13794 0.03100 0.00000 0.06410 0.12190 0.23103 -0.00062
IPW 9.87181 0.12819 0.13501 0.13500 0.00155 0.03605 0.18029 0.07931 0.06737 0.11296 0.13136 0.23359 -0.05327
EB 9.86523 0.13477 0.02452 0.02522 0.00063 0.02435 0.13861 0.03212 0.00218 0.06553 0.12184 0.23103 -0.00725
UnAD -13.16629 23.16629 0.15097 1.04695 0.04264 0.20455 0.89249 0.46910 0.46869 0.87363 0.11004 0.24557 -2.83703
lambda = 1
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanT
Oracle 9.98058 0.01942 0.01427 0.01428 0.00000 NA 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
CBPS 9.93864 0.06136 0.02546 0.02559 0.00087 0.02829 0.15146 0.03478 0.00001 0.08006 0.12859 0.23308 -0.00203
CAL 9.92957 0.07043 0.02471 0.02488 0.00063 0.02436 0.13794 0.03100 0.00000 0.06410 0.12190 0.23103 -0.00062
KDM1 9.92614 0.07386 0.02230 0.02252 0.00027 0.01609 0.09301 0.02067 0.00000 0.04032 0.08370 0.22262 -0.02399
IPW 9.87181 0.12819 0.13501 0.13500 0.00155 0.03605 0.18029 0.07931 0.06737 0.11296 0.13136 0.23359 -0.05327
EB 9.86523 0.13477 0.02452 0.02522 0.00063 0.02435 0.13861 0.03212 0.00218 0.06553 0.12184 0.23103 -0.00725
KDBC 7.70684 2.29316 0.03258 0.10759 0.00068 0.02577 0.13520 0.06471 0.06194 0.11691 0.08285 0.22935 -0.29548
UnAD -13.16629 23.16629 0.15097 1.04695 0.04264 0.20455 0.89249 0.46910 0.46869 0.87363 0.11004 0.24557 -2.83703
lambda = 2
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanT
Oracle 9.99897 0.00103 0.01403 0.01402 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
CAL 9.97598 0.02402 0.02712 0.02711 0.00061 0.02399 0.14136 0.03183 0.00000 0.06461 0.12634 0.23060 0.00053
KDM1 9.97495 0.02505 0.02545 0.02544 0.00034 0.01796 0.11197 0.02460 0.00000 0.04486 0.10274 0.22157 -0.02585
CBPS 9.96688 0.03312 0.02809 0.02810 0.00085 0.02795 0.15344 0.03519 0.00001 0.08018 0.13097 0.23332 -0.00194
EB 9.91031 0.08969 0.02739 0.02766 0.00061 0.02397 0.14194 0.03297 0.00220 0.06612 0.12629 0.23060 -0.00639
IPW 9.86231 0.13769 0.14389 0.14388 0.00165 0.03577 0.18388 0.07980 0.06765 0.11501 0.13453 0.23366 -0.06557
KDBC 5.90692 4.09308 0.04408 0.18827 0.00172 0.04099 0.20144 0.10423 0.10464 0.18551 0.09985 0.23222 -0.52996
UnAD -13.21379 23.21379 0.15533 1.04968 0.04286 0.20487 0.89554 0.47282 0.47129 0.87486 0.11403 0.24615 -2.86241
lambda = 5
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanT
Oracle 9.99204 0.00796 0.01455 0.01454 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
KDM1 9.97729 0.02271 0.02752 0.02751 0.00042 0.02001 0.12573 0.02743 0.00000 0.04826 0.11631 0.22127 -0.02422
CAL 9.97347 0.02653 0.02839 0.02839 0.00063 0.02423 0.14342 0.03196 0.00000 0.06393 0.12784 0.23043 -0.00077
CBPS 9.96713 0.03287 0.02915 0.02916 0.00085 0.02797 0.15523 0.03541 0.00001 0.07816 0.13429 0.23273 -0.00133
EB 9.90745 0.09255 0.02870 0.02897 0.00063 0.02421 0.14408 0.03316 0.00223 0.06564 0.12778 0.23045 -0.00754
IPW 9.71187 0.28813 0.12863 0.12914 0.00154 0.03547 0.18570 0.07895 0.06512 0.11247 0.13634 0.23305 -0.06307
KDBC 1.99376 8.00624 0.07158 0.36512 0.00566 0.07427 0.34591 0.18240 0.18200 0.33050 0.10952 0.23495 -1.02101
UnAD -13.08489 23.08489 0.15413 1.04381 0.04235 0.20387 0.89282 0.47006 0.46973 0.87439 0.11324 0.24486 -2.82364
lambda = 10
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanT
Oracle 10.00065 0.00065 0.01403 0.01402 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
KDM1 9.98450 0.01550 0.02582 0.02581 0.00044 0.02045 0.12810 0.02795 0.00000 0.04786 0.11985 0.22178 -0.02839
CAL 9.97525 0.02475 0.02652 0.02652 0.00060 0.02373 0.13972 0.03153 0.00000 0.06203 0.12714 0.23110 -0.00074
CBPS 9.97272 0.02728 0.02745 0.02745 0.00081 0.02729 0.14996 0.03443 0.00001 0.07602 0.13054 0.23393 -0.00384
EB 9.91211 0.08789 0.02633 0.02660 0.00060 0.02370 0.14022 0.03266 0.00214 0.06360 0.12710 0.23110 -0.00735
IPW 9.57129 0.42871 0.11484 0.11632 0.00140 0.03458 0.17781 0.07669 0.06409 0.11294 0.13090 0.23390 -0.07975
KDBC -1.93687 11.93687 0.09537 0.54227 0.01188 0.10779 0.48651 0.25751 0.25467 0.47040 0.11033 0.23875 -1.53728
UnAD -13.28347 23.28347 0.15303 1.05243 0.04300 0.20539 0.89664 0.47257 0.47019 0.87847 0.11167 0.24624 -2.87076
lambda = 100
ATE abs(Bias) sd RMSE rw KD maxASMD meanASMD medASMD X5ASMD X6ASMD meanKS meanT
Oracle 9.99204 0.00796 0.01455 0.01454 0.00000 NA 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
KDM1 9.97934 0.02066 0.02821 0.02820 0.00050 0.02168 0.13688 0.02977 0.00000 0.05193 0.12669 0.21941 -0.02443
CAL 9.97347 0.02653 0.02839 0.02839 0.00063 0.02423 0.14342 0.03196 0.00000 0.06393 0.12784 0.23043 -0.00077
CBPS 9.96713 0.03287 0.02915 0.02916 0.00085 0.02797 0.15523 0.03541 0.00001 0.07816 0.13429 0.23273 -0.00133
EB 9.90745 0.09255 0.02870 0.02897 0.00063 0.02421 0.14408 0.03316 0.00223 0.06564 0.12778 0.23045 -0.00754
IPW 9.71187 0.28813 0.12863 0.12914 0.00154 0.03547 0.18570 0.07895 0.06512 0.11247 0.13634 0.23305 -0.06307
KDBC -11.07394 21.07394 0.14527 0.95356 0.03548 0.18653 0.82059 0.43223 0.43112 0.80211 0.11307 0.24343 -2.61005
UnAD -13.08489 23.08489 0.15413 1.04381 0.04235 0.20387 0.89282 0.47006 0.46973 0.87439 0.11324 0.24486 -2.82364
Table 5: λ{0,1,2,5,10,100}\lambda\in\{0,1,2,5,10,100\}.

5.4 Real Data Analysis

To further compare different weighting methods, we investigate the causal effect of a real data set - the labour training program data that previously analyzed in LaLonde (1986) and Dehejia and Wahba (1999), among many others.

The dataset used here is a subset containing information on earnings in 1974 (RE74) of the National Supported Work Demonstration as used in LaLonde (1986). The National Supported Work Demonstration is an evaluation of the National Supported Work Demonstration project, a transitional, subsidized work experience program for four target groups of people with longstanding employment problems: ex-offenders, former drug addicts, women who were long-term recipients of welfare benefits, and school dropouts, many with criminal records. It is designed to test whether and to what extent 12 to 18 months of employment in a supportive but performance-oriented environment would equip hard-to-employ people to get and hold regular, unsubsidized jobs.

Metric UnAD IPW CBPS EB CAL KDBC KDps KDM1
ATE 1804.78064 1641.77770 1637.56504 1612.77550 1612.44289 1774.39211 2311.99732 1672.08084
sd 28.78274 29.73068 29.73643 30.01748 30.01800 29.48966 56.44856 29.32907
Balancing Metric UnAD IPW CBPS EB CAL KDBC KDps KDM1
KD 0.00051 0.00012 0.00004 0.00004 0.00004 0.00002 0.00002 0.00001
maxASMD 0.00862 0.00802 0.00801 0.00800 0.00800 0.00864 0.00928 0.00803
meanASMD 0.00480 0.00445 0.00445 0.00445 0.00445 0.00482 0.00476 0.00445
medASMD 0.00498 0.00460 0.00460 0.00460 0.00460 0.00499 0.00469 0.00464
meanKS 0.57085 0.42960 0.42476 0.40371 0.40311 0.54584 0.22214 0.43027
meanT 9.81470 6.94619 6.86728 6.79432 6.78668 8.98585 2.07824 7.34122
Table 6: Results of ATE for the NSW data (Dehejia and Wahba, 1999). The ATE is an average of 500 bootstrap estimates, and 𝒔𝒅=𝒔𝒅(𝑨𝑻𝑬)𝟓𝟎𝟎\bm{sd=\frac{sd(\bm{ATE})}{\sqrt{500}}} is the standard error of the average ATE estimates.

Variables included in the dataset are: treatment indicator (1 if treated, 0 if not treated), age, education, Black (1 if black, 0 otherwise), Hispanic (1 if Hispanic, 0 otherwise), married (1 if married, 0 otherwise), nodegree (1 if no degree, 0 otherwise), RE74 (earnings in 1974), RE75 (earnings in 1975), and RE78 (earnings in 1978). We compared the weighting estimates for earnings in 1978 between the treatment and control groups, which were estimates for the treatment effects. We compare the proposed KDB covariate balance with the IPW, CBPS, EB, CAL, and take the UnAD estimator as the benchmark. We set the balance restrictions in the KDB method as follows: regularization constraint only, regularization constraints plus propensity score balance, and regularization constraints plus the first-moment balance. Each scenario is denoted by “KDBC”, “KDps”, and “KDM1”, respectively. We obtain 500 bootstrap datasets using repeatable sampling, with which we estimate the ATE. The results are shown in Table [6], and all numerical results are rounded to five decimal places. As can be seen from Table [6], the estimation results of all weighting methods are significantly different from those of UnAD estimates. Except for the KDB method (KDBC, KDps, KDM1), the estimation results of other weighting methods are very similar. Both KDBC and KDM1 have the smallest standard deviation (sd), so they show outstanding stability in estimating ATE. Considering the specific estimates of ATE, however, only adding regularization constraints may not be enough to achieve a good balance of covariates between the treatment and control group. In addition, the results of KDPS estimation are completely different from those of other methods including UnAD, and its sd is very large. Therefore, it seems inappropriate to use regularization constraints plus propensity score balance as balance restrictions in the KDB method. This may be an important hint to remind us to carefully set balance restrictions, which is an important exploration focus.

We plot the probability density functions (PDF) of covariates “age” (denoted by X1X_{1}) and “education” (denoted by X2X_{2}), as well as PDFs of X1X_{1} and X2X_{2} weighted by different methods (UnAD, IPW, CBPS, EB, CAL, KDBC) to compare performance. Both the first PDF plot in Figure [7] and Figure [8] indicate that X1X_{1} and X2X_{2} have almost the same distributions in the treatment and control groups, which is consistent with the randomization setting in NSW data. In Figure [7,8], the relative relationship between the PDFs of X1X_{1} and X2X_{2} weighted by the IPW, CBPS, EB, and CAL methods in the treatment and control groups is very similar, which may be able to explain why the estimation results of these weighting methods in Table [6] are highly similar. Although all methods have widened the gap between the probability density functions of the treatment group and the control group, the KDBC method seems to better retain the characteristics of the original PDF, such as bimodal.

Refer to caption
Figure 7: Density of X1=ageX_{1}=age under different balancing methods.
Refer to caption
Figure 8: Density of X2=educationX_{2}=education under different balancing methods.

6 Conclusions and Further Discussions

6.1 Conclusions

With the goal of creating balanced samples, we proposed a kernel-distance-based covariate balancing method with the Gaussian kernel in this article. The proposed method works for calculating sample weights for the treatment and control groups in order to yield proper estimator for the causal effect. It seems to perform well in discovering the latent variable, such as the interaction of different predictors, and yield stable estimate for the ATE.

6.2 Further Discussions

There are two main directions for our proposed method to develop. (i) In addition to setting the kernel distance as the loss function, we can also choose other commonly used covariate balance metrics as the loss function, and use the kernel distance as an optimization constraint, so that it is possible to control the balance performance from different dimensions at the same time. (ii) We look forward to exploring the further applications of kernel distance in high-dimensional space.

Acknowledgments

References

  • Chan et al. (2016) Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2016): “Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting,” Journal of the Royal Statistical Society. Series B, Statistical methodology, 78, 673.
  • Dehejia and Wahba (1999) Dehejia, R. H. and S. Wahba (1999): “Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs,” Journal of the American statistical Association, 94, 1053–1062.
  • Deville and Särndal (1992) Deville, J.-C. and C.-E. Särndal (1992): “Calibration estimators in survey sampling,” Journal of the American statistical Association, 87, 376–382.
  • Erlander (1977) Erlander, S. (1977): “Entropy in linear programs: an approach to planning" report LiTH-MAT-R-77-3,” Department of Mathematics, Linkoping University, Linkoping, Sweden.
  • Hainmueller (2012) Hainmueller, J. (2012): “Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies,” Political Analysis, 20, 25–46.
  • Hirano et al. (2003) Hirano, K., G. W. Imbens, and G. Ridder (2003): “Efficient estimation of average treatment effects using the estimated propensity score,” Econometrica, 71, 1161–1189.
  • Horvitz and Thompson (1952) Horvitz, D. G. and D. J. Thompson (1952): “A generalization of sampling without replacement from a finite universe,” Journal of the American statistical Association, 47, 663–685.
  • Imai and Ratkovic (2014) Imai, K. and M. Ratkovic (2014): “Covariate balancing propensity score,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 243–263.
  • Imbens and Rubin (2015) Imbens, G. W. and D. B. Rubin (2015): Causal inference in statistics, social, and biomedical sciences, Cambridge University Press.
  • Josey et al. (2020) Josey, K. P., E. Juarez-Colunga, F. Yang, and D. Ghosh (2020): “A framework for covariate balance using Bregman distances,” Scandinavian Journal of Statistics.
  • Kang et al. (2007) Kang, J. D., J. L. Schafer, et al. (2007): “Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data,” Statistical science, 22, 523–539.
  • LaLonde (1986) LaLonde, R. J. (1986): “Evaluating the econometric evaluations of training programs with experimental data,” The American economic review, 604–620.
  • Ma and Wang (2010) Ma, X. and J. Wang (2010): “Robust inference using inverse probability weighting,” Journal of the American Statistical Association, 1–10.
  • Newey and Smith (2004) Newey, W. K. and R. J. Smith (2004): “Higher order properties of GMM and generalized empirical likelihood estimators,” Econometrica, 72, 219–255.
  • Neyman (1923) Neyman, J. S. (1923): “On the application of probability theory to agricultural experiments. essay on principles. section 9.(tlanslated and edited by dm dabrowska and tp speed, statistical science (1990), 5, 465-480),” Annals of Agricultural Sciences, 10, 1–51.
  • Rasmussen (2003) Rasmussen, C. E. (2003): “Gaussian processes in machine learning,” in Summer School on Machine Learning, Springer, 63–71.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41–55.
  • Rubin (1974) Rubin, D. B. (1974): “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of educational Psychology, 66, 688.
  • Rubin (1978) ——— (1978): “Bayesian inference for causal effects: The role of randomization,” The Annals of statistics, 34–58.
  • Rubin (2005) ——— (2005): “Causal inference using potential outcomes: Design, modeling, decisions,” Journal of the American Statistical Association, 100, 322–331.
  • Rubin (2007) ——— (2007): “The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials,” Statistics in medicine, 26, 20–36.
  • Sriperumbudur et al. (2012) Sriperumbudur, B. K., K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al. (2012): “On the empirical estimation of integral probability metrics,” Electronic Journal of Statistics, 6, 1550–1599.
  • Wong and Chan (2018) Wong, R. K. and K. C. G. Chan (2018): “Kernel-based covariate functional balancing for observational studies,” Biometrika, 105, 199–213.
  • Xie (2018) Xie, Y. (2018): “Causal Inference with Covariate Balance Optimization,” Ph.D. thesis, University of Waterloo.
  • Xie et al. (2019) Xie, Y., Y. Zhu, C. A. Cotton, and P. Wu (2019): “A model averaging approach for estimating propensity scores by optimizing balance,” Statistical methods in medical research, 28, 84–101.
  • Yeying et al. (2018) Yeying, Z., G. Debashis, et al. (2018): “A Kernel-Based Metric for Balance Assessment,” Journal of Causal Inference, 6.
  • Zhang et al. (2019) Zhang, Z., H. J. Kim, G. Lonjon, Y. Zhu, et al. (2019): “Balance diagnostics after propensity score matching,” Annals of translational medicine, 7.
  • Zhao and Percival (2016) Zhao, Q. and D. Percival (2016): “Entropy balancing is doubly robust,” Journal of Causal Inference, 5.
  • Zhu et al. (2018) Zhu, Y., J. S. Savage, and D. Ghosh (2018): “A kernel-based metric for balance assessment,” Journal of causal inference, 6.
  • Zolotarev (1983) Zolotarev, V. M. (1983): “Probability metrics,” Teoriya Veroyatnostei i ee Primeneniya, 28, 264–287.