This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of Local Average Treatment Effect by Data Combination

Kazuhiko Shinoda
Graduate School of Economics, Keio University
&Takahiro Hoshino
Faculty of Economics, Keio University
RIKEN AIP
Abstract

It is important to estimate the local average treatment effect (LATE) when compliance with a treatment assignment is incomplete. The previously proposed methods for LATE estimation required all relevant variables to be jointly observed in a single dataset; however, it is sometimes difficult or even impossible to collect such data in many real-world problems for technical or privacy reasons. We consider a novel problem setting in which LATE, as a function of covariates, is nonparametrically identified from the combination of separately observed datasets. For estimation, we show that the direct least squares method, which was originally developed for estimating the average treatment effect under complete compliance, is applicable to our setting. However, model selection and hyperparameter tuning for the direct least squares estimator can be unstable in practice since it is defined as a solution to the minimax problem. We then propose a weighted least squares estimator that enables simpler model selection by avoiding the minimax objective formulation. Unlike the inverse probability weighted (IPW) estimator, the proposed estimator directly uses the pre-estimated weight without inversion, avoiding the problems caused by the IPW methods. We demonstrate the effectiveness of our method through experiments using synthetic and real-world datasets.

1 Introduction

Estimating the causal effects of treatment on an outcome of interest is central to optimal decision making in many real-world problems, such as policymaking, epidemiology, education, and marketing [48, 55, 40, 52]. However, the identification and estimation of treatment effects usually relies on the untestable assumption referred to as unconfoundedness, namely, independence between the treatment status and potential outcomes [25]. Violations of unconfoundedness may occur not only in observational studies, but also in randomized controlled trials (RCTs) when compliance to the assigned treatment is not complete. For example, even if a coupon is distributed randomly to measure its effect on sales, the probability of using the coupon is likely to depend on the unobserved nature of the individuals. Moreover, noncompliance can also occur regardless of the individual’s intentions. In online advertisement placement, the probability of watching the ad depends on the bidding strategy of other companies because ads that are actually displayed are determined through the real-time-bidding even if one tries to randomly place the ad.

In such cases, it is well known that the local average treatment effect (LATE) can be identified and estimated using the treatment assignment as an instrumental variable (IV) and conditions milder than unconfoundedness [24, 4, 18]. LATE is the treatment effect measured for the subpopulation of compliers, individuals who always follow the given assignment.

We suppose that we cannot observe all relevant variables in a single dataset for technical or privacy reasons. For example, in online-to-offline (O2O) marketing, where treatments are implemented online and outcomes are observed offline, it is often difficult to match the records of the same individuals observed separately online and offline. Additionally, with the global anti-tracking movement gaining momentum, it may become more difficult to combine multiple pieces of information online as well. Although causal inference by data combination has been actively studied [42, 7, 32], LATE estimation using multiple datasets has not received much attention despite its practical importance. We extend the problem setting considered in [58], where two different treatment regimes are available, to allow for the existence of noncompliance.

For the estimation, we show that the direct estimation method originally developed for the average treatment effect (ATE) under the complete compliance [58] can be applied to the LATE estimation in our setting. However, their method has a practical issue in that model selection and hyperparameter tuning can be unstable owing to its minimax objective formulation. We then propose a weighted least squares estimator to avoid the minimax objective, and improve the stability in practice. Unlike the inverse probability weighted (IPW) estimator [56, 57, 46], which is often employed to estimate treatment effects, the proposed estimator directly uses the estimated propensity-score-difference (PSD) as a weight without inversion. Therefore, our method can also avoid the common issue in the IPW methods, that is, high variance at points with a propensity score extremely close to zero.

The contributions of this study lie in the following three parts. First, we show that LATE is identified even when an outcome and treatment status cannot be observed simultaneously in a single dataset, and the treatment assignment is completely missing. Second, we find that the positivity assumption, which is necessary in the standard setting with one regime, can be omitted in our setting. We show this relaxation of the conditions further facilitates data collection. Third, we develop a novel estimator that enables simpler model selection while maintaining the essence of direct estimation as much as possible.

2 Problem Setting

Unlike the standard causal inference studies, we consider a setting where there are two different assignment regimes [58], which we term as the two-regime design (2RD). The concept is quiet general that it only requires two observational studies, two RCTs or a combination of the two. We have to be assured that the treatment assignment is done based on different regimes, i.e. different probabilities (see Assumption 1.2).

We define our problem using the potential outcome framework [44, 25]. Let K{0,1}K\in\{0,1\} be a regime indicator, and we use a superscript k=0,1k=0,1 to specify the regime which the variables come from. Let Y(k)𝒴Y^{(k)}\in\mathcal{Y}\subset\mathbb{R} be an outcome of interest, D(k){0,1}D^{(k)}\in\{0,1\} be a binary treatment indicator, Z(k){0,1}Z^{(k)}\in\{0,1\} be an assignment indicator and 𝑿(k)𝒳qx\bm{X}^{(k)}\in\mathcal{X}\subset\mathbb{R}^{q_{x}} be a qxq_{x}-dimensional vector of covariates. Dz(k)D_{z}^{(k)} is the potential treatment status realized only when Z(k)=zZ^{(k)}=z, and Yd(k)Y_{d}^{(k)} is the potential outcome realized only when D(k)=dD^{(k)}=d, where z,d{0,1}z,d\in\{0,1\}. Using this notation, we implicitly assume that Z(k)Z^{(k)} does not have a direct effect on Y(k)Y^{(k)}, but affects Y(k)Y^{(k)} indirectly through D(k)D^{(k)}. This condition, often referred to as the exclusion restriction, is necessary for Z(k)Z^{(k)} to be a valid IV. Let Y1Y_{1}, Y0Y_{0}, D1D_{1}, D0D_{0} and 𝑿\bm{X} be the potential variables and covariates in the population of interest, and we suppose that the iid samples from P(𝑿)P(\bm{X}) can be obtained as test samples.

Refer to caption
Figure 1: Overview of the 2RD.

2.1 Basic Setting

First, we make the following assumption on the relation between the two regimes.

Assumption 1.

  1. 1.

    P(Y1(1),Y0(1),D1(1),D0(1),𝑿(1))=P(Y1(0),Y0(0),D1(0),D0(0),𝑿(0))=P(Y1,Y0,D1,D0,𝑿)P(Y_{1}^{(1)},Y_{0}^{(1)},D_{1}^{(1)},D_{0}^{(1)},\bm{X}^{(1)})=P(Y_{1}^{(0)},Y_{0}^{(0)},D_{1}^{(0)},D_{0}^{(0)},\bm{X}^{(0)})=P(Y_{1},Y_{0},D_{1},D_{0},\bm{X}).

  2. 2.

    P(Z(1)=1|𝑿(1)=𝒙)P(Z(0)=1|𝑿(0)=𝒙)P(Z^{(1)}=1|\bm{X}^{(1)}=\bm{x})\neq P(Z^{(0)}=1|\bm{X}^{(0)}=\bm{x}) for any 𝒙𝒳\bm{x}\in\mathcal{X}.

By Assumption 1.1, we suppose that the joint distribution of the potential variables and covariates in each assignment regime is invariant to each other, and equal to the joint distribution in the population of interest. This means that the participants in each regime are random draws from the population of interest. We can still identify LATE even if we weaken Assumption 1.1 to its conditional version, but the direct estimation is no longer possible. See Appendix A for more discussion. Assumption 1.2 indicates that the assignment regimes are different to each other for any level of the covariates.

Additionally, we make the following assumption required for the identification of LATE. Here, AB|CA\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}B|C means AA and BB are conditionally independent given CC.

Assumption 2.

For k=0,1k=0,1,

  1. 1.

    Y1(k),Y0(k),D1(k),D0(k)Z(k)|𝑿(k)Y_{1}^{(k)},Y_{0}^{(k)},D_{1}^{(k)},D_{0}^{(k)}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Z^{(k)}|\bm{X}^{(k)}.

  2. 2.

    D(k)=Z(k)D1(k)+(1Z(k))D0(k)D^{(k)}=Z^{(k)}D_{1}^{(k)}+(1-Z^{(k)})D_{0}^{(k)} and Y(k)=D(k)Y1(k)+(1D(k))Y0(k)Y^{(k)}=D^{(k)}Y_{1}^{(k)}+(1-D^{(k)})Y_{0}^{(k)}.

  3. 3.

    P(D1(k)D0(k)|𝑿(k))=1P(D_{1}^{(k)}\geq D_{0}^{(k)}|\bm{X}^{(k)})=1.

  4. 4.

    P(D1(k)=1|𝑿(k))P(D0(k)=1|𝑿(k))P(D_{1}^{(k)}=1|\bm{X}^{(k)})\neq P(D_{0}^{(k)}=1|\bm{X}^{(k)}).

Assumption 2.1 states that Z(k)Z^{(k)} are randomly assigned within a subpopulation sharing the same level of the covariates. The mean independence between the potential variables and the treatment assignment conditional on covariates is sufficient for identifying LATE, but we maintain Assumption 2.1 as it is typical to assume the full conditional independence in the causal inference literature. Assumption 2.2 relates the potential variables to their realized counterparts. Assumption 2.3 excludes defiers, those who never follow the given treatment assignment, from our analysis. Assumption 2.4 is necessary for Z(k)Z^{(k)} to be a valid IV of D(k)D^{(k)}. Figure 1 illustrates the data generating process under the 2RD.

Assumption 2 is the direct extension of the standard assumptions used for the LATE estimation [1, 18, 51, 38] to the 2RD. However, we omit the condition referred to as positivity 0<P(Z(k)=1|𝑿(k))<10<P(Z^{(k)}=1|\bm{X}^{(k)})<1 as it is unnecessary in the 2RD. Therefore, it is possible to set P(Z(0)=1|𝑿(0))=0P(Z^{(0)}=1|\bm{X}^{(0)})=0 as long as Assumption 1.2 is satisfied. We term the design with P(Z(1)=1|𝑿(1))>0P(Z^{(1)}=1|\bm{X}^{(1)})>0 for any 𝑿(1)\bm{X}^{(1)} and P(Z(0)=1|𝑿(0))=0P(Z^{(0)}=1|\bm{X}^{(0)})=0 for any 𝑿(0)\bm{X}^{(0)} as the one-experiment 2RD (1E2RD) since an experiment is conducted only for those with k=1k=1, and a natural state without any intervention is observed for those with k=0k=0.

Although [58] does not mention anything on this point, it is of great practical importance since the 1E2RD is much easier to implement than the general 2RD, and extends the plausibility of our setting. In the 1E2RD, we only have to conduct one experiment or collect one set of observational data just like the standard setting for causal inference. All we need in addition is a dataset collected from those without any intervention. Such data may be available at no additional cost, for example, when there is appropriate open data published by the government.

Hereafter, we abuse the notation by omitting the superscript on the potential variables and covariates unless this causes confusion.

2.2 Identification

The parameter of our interest is LATE as a function of covariates 𝑿\bm{X}, defined as

μ(𝑿):=E[Y1Y0|D1>D0,𝑿].\displaystyle\mu(\bm{X}):=E[Y_{1}-Y_{0}|D_{1}>D_{0},\bm{X}].

It measures how covariates 𝑿\bm{X} affect the average treatment effect within the subpopulation of compliers. The following theorem shows that μ(𝑿)\mu(\bm{X}) is nonparametrically identified in an unusual form in our setting.

Theorem 1.

Under Assumption 1 and 2,

μ(𝑿)=E[Y(1)|𝑿]E[Y(0)|𝑿]E[D(1)|𝑿]E[D(0)|𝑿].\displaystyle\mu(\bm{X})=\frac{E[Y^{(1)}|\bm{X}]-E[Y^{(0)}|\bm{X}]}{E[D^{(1)}|\bm{X}]-E[D^{(0)}|\bm{X}]}. (1)

The proof can be found in Appendix B. Notably, this form coincides with the identification result of ATE in [58], which means that we can also estimate μ(𝑿)\mu(\bm{X}) from the same combination of datasets proposed in [58] under appropriate assumptions. This is a rather powerful result in practice since Z(k)Z^{(k)} is not required despite the presence of noncompliance. Moreover, it is obvious that pd(k):=P(D(k)=1)p_{d}^{(k)}:=P(D^{(k)}=1) and P(𝑿|D(k)=1)P(\bm{X}|D^{(k)}=1) are sufficient to identify the denominator of (1) since E[D(k)|𝑿]E[D^{(k)}|\bm{X}] can be identified as P(𝑿|D(k)=1)pd(k)P(𝑿)\frac{P(\bm{X}|D^{(k)}=1)p_{d}^{(k)}}{P(\bm{X})} by applying the Bayes’ theorem.

The point is that we can identify μ(𝑿)\mu(\bm{X}) as a combination of functions depending only on covariates 𝑿\bm{X} in our setting. This property allows us to use a direct estimation method for μ(𝑿)\mu(\bm{X}). We can identify LATE as μ(𝑿)=E[Y|Z=1,𝑿]E[Y|Z=0,𝑿]E[D|Z=1,𝑿]E[D|Z=0,𝑿]\mu(\bm{X})=\frac{E[Y|Z=1,\bm{X}]-E[Y|Z=0,\bm{X}]}{E[D|Z=1,\bm{X}]-E[D|Z=0,\bm{X}]} under the standard assumptions [1, 18, 51, 38], but we cannot rewrite it as a combination of functions depending only on covariates (see Appendix A).

Theorem 1 also suggests the usefulness of the 1E2RD. We can simplify (1) when we implement the 1E2RD under one-sided noncompliance, where individuals assigned to the control group never receive treatment, but they have a choice if assigned to the treatment group.

Corollary 1.

Assume P(Z(0)=1|𝐗)=0P(Z^{(0)}=1|\bm{X})=0 (1E2RD) and P(D0=1|𝐗)=0P(D_{0}=1|\bm{X})=0 (one-sided noncompliance) in addition to Assumption 1 and 2. Then, E[Y(0)|𝐗]=E[Y0|𝐗]E[Y^{(0)}|\bm{X}]=E[Y_{0}|\bm{X}] and

μ(𝑿)=E[Y(1)|𝑿]E[Y(0)|𝑿]E[D(1)|𝑿].\displaystyle\mu(\bm{X})=\frac{E[Y^{(1)}|\bm{X}]-E[Y^{(0)}|\bm{X}]}{E[D^{(1)}|\bm{X}]}.

This corollary shows that we can reduce the number of necessary datasets in the 1E2RD under one-sided noncompliance. This fact not only facilitates the data collection, but also benefits the estimation since the denominator is now just a propensity score, thus estimable accurately by, for example, the Positive-Unlabeled (PU) learning [15, 13, 14, 28] with the logistic loss.

Although one-sided noncompliance is often associated with RCTs, some observational studies also fit with the framework [19, 27]. In the case of one-sided noncompliance, we can also consider our problem as the estimation of the average treatment effect on the treated (ATT) since LATE is equal to ATT under one-sided noncompliance and the other standard assumptions [19, 12].

2.3 Data Collection Scheme

We assume that the joint samples of (Y(k),D(k),Z(k),𝑿)(Y^{(k)},D^{(k)},Z^{(k)},\bm{X}) are not available. By Theorem 1, the following separate datasets and the estimate of pd(k)p_{d}^{(k)} are sufficient for estimating μ(𝑿)\mu(\bm{X}):

{𝒙di(k)}i=1nd(k)iidP(𝑿|D(k)=1),{(yi(k),𝒙i(k))}i=1n(k)iidP(Y(k),𝑿),\displaystyle\{\bm{x}_{di}^{(k)}\}_{i=1}^{n_{d}^{(k)}}\overset{\mathrm{iid}}{\sim}P(\bm{X}|D^{(k)}=1),\hskip 14.22636pt\{(y_{i}^{(k)},\bm{x}_{i}^{(k)})\}_{i=1}^{n^{(k)}}\overset{\mathrm{iid}}{\sim}P(Y^{(k)},\bm{X}),

for k=0,1k=0,1. This setting is much easier to apply to real-world situations than the standard setting where the joint samples are required for every individual.

We can interpret this data collection scheme in two ways. First, it can be regarded as a version of the repeated cross-sectional (RCS) design [36, 6, 20, 42, 30] with k=0,1k=0,1 representing the time points before and after the assignment regime switches, respectively. The 1E2RD is also possible in this case by setting P(Z(0)=1|𝑿(0))=0P(Z^{(0)}=1|\bm{X}^{(0)})=0. This means collecting data for k=0k=0 at some point before an experiment is conducted. Generally, although panel data are advantageous for statistical analyses because they follow the same individuals over multiple time periods, RCS data have some advantages over panel data. RCS data are easier and cheaper to collect in many cases. Consequently, they are often more representative of the population of interest and larger in sample size than panel data.

However, there is a concern regarding the validity of Assumption 1.1 when we use RCS data since the potential variables may change over time. We need some sort of side information to be confident about the use of RCS data in our setting as we cannot directly test whether the potential variables remain unchanged.

The second interpretation is to collect data with k=0,1k=0,1 at the same time by randomly splitting the population of interest. If an experimenter is able to surely perform random splitting, this approach is favorable since we do not have to worry about the validity of Assumption 1.1. The implementation of random splitting is easy in the case of RCTs.

One specific example of our setting is O2O marketing. To measure the effect of an online video ads on sales in physical stores, we usually cannot link the data of those who watched the ads with that of those who shopped at the stores. However, it is relatively easy to separately collect data from those who watch the ads and the purchasers. In this case, data collection can be performed by either RCS design or random splitting.

Another example is when estimating the effect of a treatment that takes a certain period of time to take effect. For instance, it is desirable to use panel data to estimate the effect of job training on future earnings. However, individuals may gradually drop out of the survey, and the probability of the attrition can depend on unobserved variables [21, 37]. Therefore, it is easier to collect RCS data than to construct balanced panel data. Because we do not need to observe the outcome and treatment status simultaneously, we are more likely to be able to collect even larger and more representative data than with a normal RCS design.

3 Related Works

There have been many proposals for the estimation of μ(𝑿)\mu(\bm{X}) from the joint samples [33, 22, 1, 51, 39, 38, 53]. These include estimation via the parametric specification of the local average response function E[Yd|D1>D0,𝑿]E[Y_{d}|D_{1}>D_{0},\bm{X}] [1], doubly robust estimators [39, 38] and estimation with a binary outcome [53], to name a few. However, the estimation of μ(𝑿)\mu(\bm{X}) by data combination has rarely been considered in the literature.

The 2RD has a similar structure to that of the instrumented difference-in-differences (IDID) [59]. The main differences between them are: the conditional independence assumption in IDID is strictly milder than our Assumption 2.1, but IDID requires ZZ, and the direct estimation is not possible. Which setting is more plausible and practical depends on an actual situation.

Our problem setting is also closely related to the two-sample IV (TSIV) estimation [5, 26, 41, 11, 9], where the outcome, IVs, and covariates are not jointly observed in a single dataset. Furthermore, the idea of using moments separately estimated from different datasets can be dated back to at least [29]. Although estimands in the TSIV estimation are not limited to causal parameters, there have been some studies on the causal inference in the setting related to TSIV. They include ATE estimation from samples of (Y,Z,𝑿)(Y,Z,\bm{X}) and (D,Z,𝑿)(D,Z,\bm{X}) with the existence of unmeasured confounders [49] and causal inference using samples from heterogeneous populations [60, 47]. Our setting differs from theirs in that we have to observe DD only when D=1D=1, and we do not need ZZ at all. Particularly, it is of great practical benefit since samples of those who do not receive treatment are often rather difficult to observe. Although our setting requires two regimes, it rather opens up the possibility of LATE estimation in the real world since the 1E2RD is possible.

Another approach for the causal inference from separately observed samples is the partial identification [34, 50, 35], that is, deriving bounds for the treatment effects rather than imposing strong assumptions sufficient for the point identification. In [16, 17], the moment inequality proposed in [10] was applied to derive the sharp bounds for ATE when only samples of (Y,D)(Y,D) and (D,𝑿)(D,\bm{X}) are available. The difficulty of applying their approach to the estimation of a treatment effect conditional on covariates is that it requires conditional distribution functions or conditional quantile functions, which are usually difficult to estimate accurately. Moreover, it is sometimes difficult to derive informative bounds for treatment effects without imposing strong assumptions depending on the data generating process [35].

4 Existing Methods

Although the LATE estimation in our specific setting has not been studied before, some existing methods can be applied. We discuss the advantages and disadvantages of these methods especially in terms of accuracy and model selection. Since the true value of treatment effects is not observable by the fundamental problem of causal inference [23], model selection and hyperparameter tuning are substantial issues in practice [43, 3, 45].

4.1 Separate Estimation

A naive estimation method for μ(𝑿)\mu(\bm{X}) is to separately estimate the components and combine them as

μ^sep(𝒙)=E^[Y(1)|𝑿=𝒙]E^[Y(0)|𝑿=𝒙]E^[D(1)|𝑿=𝒙]E^[D(0)|𝑿=𝒙],\displaystyle\widehat{\mu}_{\mathrm{sep}}(\bm{x})=\frac{\widehat{E}[Y^{(1)}|\bm{X}=\bm{x}]-\widehat{E}[Y^{(0)}|\bm{X}=\bm{x}]}{\widehat{E}[D^{(1)}|\bm{X}=\bm{x}]-\widehat{E}[D^{(0)}|\bm{X}=\bm{x}]}, (2)

where a hat denotes an estimator. E[D(k)|𝑿]E[D^{(k)}|\bm{X}] can be estimated by the PU learning [15, 13, 14, 28] with the logistic loss by using {𝒙di(k)}i=1nd(k)\{\bm{x}_{di}^{(k)}\}_{i=1}^{n_{d}^{(k)}} as positive data, and the covariates in {(yi(1),𝒙i(1))}i=1n(1)\{(y_{i}^{(1)},\bm{x}_{i}^{(1)})\}_{i=1}^{n^{(1)}} and {(yi(0),𝒙i(0))}i=1n(0)\{(y_{i}^{(0)},\bm{x}_{i}^{(0)})\}_{i=1}^{n^{(0)}} as unlabeled data. Separate estimation is easy to implement, but usually does not provide a good estimate since it has four possible sources of error.

One advantage of the separate estimation is that the model selection can also be naively performed by choosing the best model for each component. We can easily calculate the model selection criteria, such as the mean squared error (MSE) for each component from the separately observed datasets. However, the resulting μ^sep\widehat{\mu}_{\mathrm{sep}} may perform poorly because it does not necessarily minimize the model selection criterion in terms of the true μ\mu [43].

We can alleviate the drawback of the separate estimation by directly estimating the numerator and denominator in (2). Let T:=KD(1)(1K)D(0)T:=KD^{(1)}-(1-K)D^{(0)} and U:=KY(1)(1K)Y(0)U:=KY^{(1)}-(1-K)Y^{(0)} be the auxiliary variables for the notational and computational convenience, and assume P(K=1|𝑿)=0.5P(K=1|\bm{X})=0.5. Then, we can rewrite the expression in Theorem 1 as μ(𝑿)=ν(𝑿)/π(𝑿)\mu(\bm{X})=\nu(\bm{X})/\pi(\bm{X}), where ν(𝑿):=E[U|𝑿]\nu(\bm{X}):=E[U|\bm{X}] and π(𝑿):=E[T|𝑿]\pi(\bm{X}):=E[T|\bm{X}]. To estimate ν(𝑿)\nu(\bm{X}) and π(𝑿)\pi(\bm{X}), we can construct combined datasets

{(ti,𝒙ti,rti)}i=1nt\displaystyle\{(t_{i},\bm{x}_{ti},r_{ti})\}_{i=1}^{n_{t}} :={((1)1k,𝒙di(k),p^d(k)(nd(1)+nd(0))2nd(k))}i=1,k=0nd(k),1,\displaystyle:=\left\{\left((-1)^{1-k},\bm{x}_{di}^{(k)},\frac{\widehat{p}_{d}^{(k)}(n_{d}^{(1)}+n_{d}^{(0)})}{2n_{d}^{(k)}}\right)\right\}_{i=1,k=0}^{n_{d}^{(k)},1},
{(ui,𝒙ui,rui)}i=1nu\displaystyle\{(u_{i},\bm{x}_{ui},r_{ui})\}_{i=1}^{n_{u}} :={((1)1kyi(k),𝒙i(k),n(1)+n(0)2n(k))}i=1,k=0n(k),1\displaystyle:=\left\{\left((-1)^{1-k}y_{i}^{(k)},\bm{x}_{i}^{(k)},\frac{n^{(1)}+n^{(0)}}{2n^{(k)}}\right)\right\}_{i=1,k=0}^{n^{(k)},1}

from {𝒙di(k)}i=1nd(k)\{\bm{x}_{di}^{(k)}\}_{i=1}^{n_{d}^{(k)}} and {(yi(k),𝒙i(k))}i=1n(k)\{(y_{i}^{(k)},\bm{x}_{i}^{(k)})\}_{i=1}^{n^{(k)}}, respectively, where nt:=nd(1)+nd(0)n_{t}:=n_{d}^{(1)}+n_{d}^{(0)} and nu:=n(1)+n(0)n_{u}:=n^{(1)}+n^{(0)}.

We can approximate the expectation of a product of any function of 𝑿\bm{X} and TT or UU by the simple sample average using the above datasets since we have

1nti=1ntrtitif(𝒙ti)\displaystyle\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}f(\bm{x}_{ti}) =p^d(1)2nd(1)i=1nd(1)f(𝒙di(1))p^d(0)2nd(0)i=1nd(0)f(𝒙di(0))12E[D(1)f(𝑿)]12E[D(0)f(𝑿)],\displaystyle=\frac{\widehat{p}_{d}^{(1)}}{2n_{d}^{(1)}}\sum_{i=1}^{n_{d}^{(1)}}f(\bm{x}_{di}^{(1)})-\frac{\widehat{p}_{d}^{(0)}}{2n_{d}^{(0)}}\sum_{i=1}^{n_{d}^{(0)}}f(\bm{x}_{di}^{(0)})\approx\frac{1}{2}E[D^{(1)}f(\bm{X})]-\frac{1}{2}E[D^{(0)}f(\bm{X})],
1nui=1nuruiuif(𝒙i)\displaystyle\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}u_{i}f(\bm{x}_{i}) =12n(1)i=1n(1)yif(𝒙i(1))12n(0)i=1n(0)yif(𝒙i(0))12E[Y(1)f(𝑿)]12E[Y(0)f(𝑿)].\displaystyle=\frac{1}{2n^{(1)}}\sum_{i=1}^{n^{(1)}}y_{i}f(\bm{x}_{i}^{(1)})-\frac{1}{2n^{(0)}}\sum_{i=1}^{n^{(0)}}y_{i}f(\bm{x}_{i}^{(0)})\approx\frac{1}{2}E[Y^{(1)}f(\bm{X})]-\frac{1}{2}E[Y^{(0)}f(\bm{X})].

An estimator of ν(𝑿)\nu(\bm{X}) can be obtained by any regression method using the above combined dataset, while we need a little twist to obtain an estimator of the PSD π(𝑿)\pi(\bm{X}). See Appendix C for the direct estimation methods for the PSD.

4.2 Direct Least Squares Estimation

We can apply the direct least squares (DLS) method [58] originally proposed for ATE estimation under complete compliance. Motivated by the drawbacks of the separate estimation, the DLS directly estimates μ(𝑿)\mu(\bm{X}), which is advantageous not only in performance but also in computational efficiency.

The following theorem corresponds to Theorem 1 in [58].

Theorem 2.

Assume νL2\nu\in L^{2}, where L2:={f:𝓧|E[f(𝐗)2]<}L^{2}:=\left\{f:\mathcal{\bm{X}}\mapsto\mathbb{R}\left|E[f(\bm{X})^{2}]<\infty\right.\right\} in addition to Assumption 1 and 2. Furthermore, define Hf(𝐗):=π(𝐗)f(𝐗)ν(𝐗)H_{f}(\bm{X}):=\pi(\bm{X})f(\bm{X})-\nu(\bm{X}). Then, μ\mu is equal to the solution of the following least squares problem:

minfL2E[Hf(𝑿)2].\displaystyle\min_{f\in L^{2}}E\left[H_{f}(\bm{X})^{2}\right]. (3)

Theorem 2 immediately follows from Theorem 1. In what follows, we show that the minimizer of the problem (3) can be estimated without going through the estimation of the conditional mean functions ν\nu and π\pi. Since (Hf(𝑿)g(𝑿))20\left(H_{f}(\bm{X})-g(\bm{X})\right)^{2}\geq 0 for any square integrable function gL2g\in L^{2}, we have Hf(𝑿)22Hf(𝑿)g(𝑿)g(𝑿)2H_{f}(\bm{X})^{2}\geq 2H_{f}(\bm{X})g(\bm{X})-g(\bm{X})^{2} by expanding the square. The equality holds at g(𝑿)=Hf(𝑿)g(\bm{X})=H_{f}(\bm{X}) for any ff, which maximizes 2Hf(𝑿)g(𝑿)g(𝑿)22H_{f}(\bm{X})g(\bm{X})-g(\bm{X})^{2} with respect to gg. Hence,

μ(𝑿)=argminfL2maxgL2J(f,g),\displaystyle\mu(\bm{X})=\underset{f\in L^{2}}{\mathrm{argmin}}\max_{g\in L^{2}}J(f,g), (4)

where J(f,g):=E[2Hf(𝑿)g(𝑿)g(𝑿)2]J(f,g):=E\left[2H_{f}(\bm{X})g(\bm{X})-g(\bm{X})^{2}\right]. Rewriting J(f,g)J(f,g) yields

J(f,g)=2E[Tf(𝑿)g(𝑿)]2E[Ug(𝑿)]E[g(𝑿)2].\displaystyle J(f,g)=2E[Tf(\bm{X})g(\bm{X})]-2E[Ug(\bm{X})]-E[g(\bm{X})^{2}]. (5)

While the objective functional in (3) requires the conditional mean estimators, this form can be estimated based on the sample averages as

J^(f,g)=\displaystyle\widehat{J}(f,g)= 2nti=1ntrtitif(𝒙ti)g(𝒙ti)2nui=1nuruiuig(𝒙ui)1nui=1nuruig(𝒙ui)2.\displaystyle\frac{2}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}f(\bm{x}_{ti})g(\bm{x}_{ti})-\frac{2}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}u_{i}g(\bm{x}_{ui})-\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}g(\bm{x}_{ui})^{2}. (6)

We can implement the DLS estimation of μ\mu with an arbitrary regularization term Ω\Omega in practice:

μ^dls(𝒙)=argminfFmaxgGJ^(f,g)+Ω(f,g),\displaystyle\widehat{\mu}_{\mathrm{dls}}(\bm{x})=\underset{f\in F}{\mathrm{argmin}}\max_{g\in G}\widehat{J}(f,g)+\Omega(f,g),

where FF and GG are model classes for ff and gg, respectively.

Although any model can be trained by optimizing the model parameters to minimize the above objective functional, a practically useful choice is a linear-in-parameter model. Set F={f𝜶:𝒙𝜶ϕ(𝒙)|𝜶qf}F=\{f_{\bm{\alpha}}:\bm{x}\mapsto\bm{\alpha}^{\top}\bm{\phi}(\bm{x})|\bm{\alpha}\in\mathbb{R}^{q_{f}}\} and G={g𝜷:𝒙𝜷𝝍(𝒙)|𝜷qg}G=\{g_{\bm{\beta}}:\bm{x}\mapsto\bm{\beta}^{\top}\bm{\psi}(\bm{x})|\bm{\beta}\in\mathbb{R}^{q_{g}}\}, where ϕ\bm{\phi} and 𝝍\bm{\psi} are qfq_{f}- and qgq_{g}-dimensional basis functions, respectively. Using the 2\ell_{2}-regularizer, we have

J^(f,g)+Ω(f,g)=2𝜶𝑨𝜷\displaystyle\widehat{J}(f,g)+\Omega(f,g)=2\bm{\alpha}^{\top}\bm{A}\bm{\beta} 2𝜷𝒃𝜷𝑪𝜷+λf𝜶𝜶+λg𝜷𝜷,\displaystyle-2\bm{\beta}^{\top}\bm{b}-\bm{\beta}^{\top}\bm{C}\bm{\beta}+\lambda_{f}\bm{\alpha}^{\top}\bm{\alpha}+\lambda_{g}\bm{\beta}^{\top}\bm{\beta},

where

𝑨:=1nti=1ntrtitiϕ(𝒙ti)𝝍(𝒙ti),𝒃:=1nui=1nuruiui𝝍(𝒙ui),𝑪:=1nui=1nurui𝝍(𝒙ui)𝝍(𝒙ui),\displaystyle\bm{A}:=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}\bm{\phi}(\bm{x}_{ti})\bm{\psi}(\bm{x}_{ti})^{\top},\hskip 14.22636pt\bm{b}:=\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}u_{i}\bm{\psi}(\bm{x}_{ui}),\hskip 14.22636pt\bm{C}:=\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}\bm{\psi}(\bm{x}_{ui})\bm{\psi}(\bm{x}_{ui})^{\top},

and λf\lambda_{f} and λg\lambda_{g} are some positive constants. The advantage of this formulation with the linear-in-parameter models and 2\ell_{2}-regularizer is that we have an analytical solution. The solution to the inner maximization is given by

𝜷^=(𝑪+λg𝑰qg)1(𝑨𝜶𝒃),\displaystyle\widehat{\bm{\beta}}=(\bm{C}+\lambda_{g}\bm{I}_{q_{g}})^{-1}(\bm{A}^{\top}\bm{\alpha}-\bm{b}),

where 𝑰qg\bm{I}_{q_{g}} denotes the qg×qgq_{g}\times q_{g} identity matrix. We can then obtain the DLS estimator of 𝜶\bm{\alpha} by substituting 𝜷^\widehat{\bm{\beta}} and solving the outer minimization:

𝜶^dls={𝑨(𝑪+λg𝑰qg)1𝑨+λf𝑰qf}1𝑨(𝑪+λg𝑰dg)1𝒃.\displaystyle\widehat{\bm{\alpha}}_{\mathrm{dls}}=\left\{\bm{A}(\bm{C}+\lambda_{g}\bm{I}_{q_{g}})^{-1}\bm{A}^{\top}+\lambda_{f}\bm{I}_{q_{f}}\right\}^{-1}\bm{A}(\bm{C}+\lambda_{g}\bm{I}_{d_{g}})^{-1}\bm{b}.

For the model selection, we can choose a model that minimizes J^(f^,g^)\widehat{J}(\widehat{f},\widehat{g}) evaluated on a validation set. However, as pointed out in [58], one cannot tell if J^(f^,g^)\widehat{J}(\widehat{f},\widehat{g}) is small because f^\widehat{f} is a good solution to the outer minimization, or g^\widehat{g} is a poor solution to the inner maximization. For this reason, μ^dls(𝒙):=𝜶^dlsϕ(𝒙)\widehat{\mu}_{\mathrm{dls}}(\bm{x}):=\widehat{\bm{\alpha}}_{\mathrm{dls}}^{\top}\bm{\phi}(\bm{x}) can be unstable, and one cannot be confident about which model is the best. The increased dimensionality of the search space because of the need to simultaneously select models for ff and gg can also make the model selection based on J^(f^,g^)\widehat{J}(\widehat{f},\widehat{g}) challenging.

5 Proposed Method

We propose a novel estimator that enables simpler model selection than the DLS estimation while maintaining the essence of direct estimation as much as possible. We avoid the minimax formulation of the objective functional as in (4) by estimating μ\mu as a solution to the weighted least squares problem derived from the original problem (3). It can be constructed based on the sample averages from the separately observed samples like the DLS estimator, except we need to estimate π(𝑿)\pi(\bm{X}) as a weight. We term the proposed estimator as the directly weighted least squares (DWLS) estimator since the pre-estimated PSD directly appears without inversion in the objective unlike the IPW estimators [56, 57, 46].

Consider the following weighted least squares problem:

minfL2E[w(𝑿)π(𝑿)Hf(𝑿)2]=:Q0(f|w),\displaystyle\min_{f\in L^{2}}E\left[\frac{w(\bm{X})}{\pi(\bm{X})}H_{f}(\bm{X})^{2}\right]=:Q_{0}(f|w), (7)

where w(𝑿)w(\bm{X}) is some weight depending on 𝑿\bm{X}. Rewriting the objective yields

Q0(f|w)=E[Tw(𝑿)f(𝑿)2]2E[Uw(𝑿)f(𝑿)]+E[S],\displaystyle Q_{0}(f|w)=E[Tw(\bm{X})f(\bm{X})^{2}]-2E\left[Uw(\bm{X})f(\bm{X})\right]+E[S], (8)

where S:=w(𝑿)π(𝑿)ν(𝑿)2S:=\frac{w(\bm{X})}{\pi(\bm{X})}\nu(\bm{X})^{2} is a constant and can therefore be safely ignored in the optimization. Choosing w=πw=\pi clearly reduces the problem (7) to (3). Therefore, we can plug-in the pre-estimated π(𝑿)\pi(\bm{X}) as a weight in practice to find a minimizer of the problem (3). However, π(𝑿)\pi(\bm{X}) may not be the proper weight since it takes a negative value when P(Z(1)=1|𝑿)<P(Z(0)=1|𝑿)P(Z^{(1)}=1|\bm{X})<P(Z^{(0)}=1|\bm{X}). We can estimate Q(f|π^):=Q0(f|π^)E[S]Q(f|\widehat{\pi}):=Q_{0}(f|\widehat{\pi})-E[S] based on the sample averages using the separately observed samples as

Q^(f|π^)=1nti=1ntrtitiπ^(𝒙ti)f(𝒙ti)22nui=1nuruiuiπ^(𝒙ui)f(𝒙ui).\displaystyle\widehat{Q}(f|\widehat{\pi})=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}\widehat{\pi}(\bm{x}_{ti})f(\bm{x}_{ti})^{2}-\frac{2}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}u_{i}\widehat{\pi}(\bm{x}_{ui})f(\bm{x}_{ui}).

Using the linear-in-parameter model for ff and the 2\ell_{2}-regularizer as in the previous section, we have

Q^(f|π^)+Ω(f)=𝜶𝑨~𝜶2𝜶𝒃~+λf𝜶𝜶,\displaystyle\widehat{Q}(f|\widehat{\pi})+\Omega(f)=\bm{\alpha}^{\top}\widetilde{\bm{A}}\bm{\alpha}-2\bm{\alpha}^{\top}\widetilde{\bm{b}}+\lambda_{f}\bm{\alpha}^{\top}\bm{\alpha},

where

𝑨~:=1nti=1ntrtitiπ^(𝒙ti)ϕ(𝒙ti)ϕ(𝒙ti),𝒃~:=1nui=1nuruiuiπ^(𝒙ui)ϕ(𝒙ui).\displaystyle\widetilde{\bm{A}}:=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}\widehat{\pi}(\bm{x}_{ti})\bm{\phi}(\bm{x}_{ti})\bm{\phi}(\bm{x}_{ti})^{\top},\hskip 14.22636pt\widetilde{\bm{b}}:=\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}u_{i}\widehat{\pi}(\bm{x}_{ui})\bm{\phi}(\bm{x}_{ui}).

The analytical solution to minfFQ^(f|π^)+Ω(f)\min_{f\in F}\widehat{Q}(f|\widehat{\pi})+\Omega(f) can be obtained as μ^dpw(𝒙)=𝜶^dpwϕ(𝒙)\widehat{\mu}_{\mathrm{dpw}}(\bm{x})=\widehat{\bm{\alpha}}_{\mathrm{dpw}}^{\top}\bm{\phi}(\bm{x}), where

𝜶^dpw=(𝑨~+λf𝑰qf)1𝒃~.\displaystyle\widehat{\bm{\alpha}}_{\mathrm{dpw}}=\left(\widetilde{\bm{A}}+\lambda_{f}\bm{I}_{q_{f}}\right)^{-1}\widetilde{\bm{b}}.

Although the DWLS objective is no longer the MSE, Q^(f^|π^)\widehat{Q}(\widehat{f}|\widehat{\pi}) is still a sufficient measure for model selection because μ^dwls\widehat{\mu}_{\mathrm{dwls}} minimizing this objective is also a good estimator in terms of the true MSE as long as π^\widehat{\pi} is sufficiently accurate. Since the DWLS involves only a single minimization, the model selection is easier than in the DLS estimation.

Remark on the objective formulation

We can consider the following least squares problem instead of (7):

minfL2E[π(𝑿)w(𝑿)(f(𝑿)μ(𝑿))2]=:Q0(f|w).\displaystyle\min_{f\in L^{2}}E\left[\frac{\pi(\bm{X})}{w(\bm{X})}\left(f(\bm{X})-\mu(\bm{X})\right)^{2}\right]=:Q^{\prime}_{0}(f|w). (9)

This objective can also be evaluated without the conditional mean function ν\nu through a transformation similar to (8):

Q0(f|w)=E[Tf(𝑿)2w(𝑿)]2E[Uf(𝑿)w(𝑿)]+E[S],\displaystyle Q^{\prime}_{0}(f|w)=E\left[\frac{Tf(\bm{X})^{2}}{w(\bm{X})}\right]-2E\left[\frac{Uf(\bm{X})}{w(\bm{X})}\right]+E[S^{\prime}],

where S:=π(𝑿)w(𝑿)μ(𝑿)2S^{\prime}:=\frac{\pi(\bm{X})}{w(\bm{X})}\mu(\bm{X})^{2}.

We refer to the estimator based on this objective as the inverse weighted least squares (IWLS) estimator. This estimator tends to be imprecise when the PSD is extremely close to zero, which is a common issue among the IPW estimators [56, 57, 46]. Although the performance of such estimators can be improved by trimming small probabilities, determining the optimal threshold is nontrivial [31].

Selecting the threshold is more complicated in the case of the IWLS as the PSD can be negative. At points where the PSD is close to zero, the absolute value of the weights increases while the sign errors are more likely to occur. As a result, a small estimation error in the PSD tends to have a large impact on the IWLS. On the other hand, the impact of the estimation error of the PSD on the DWLS estimator is limited since the absolute value of the weights is small when a sign error occurs in the PSD estimation. The weight in the proposed method is confined to [0.5,0.5][-0.5,0.5], whereas the weight in the IWLS is not bounded at all.

Although the unweighted subsampling method has been recently studied to circumvents weighting samples with the inverse probability [54], it cannot be directly applied in this case because it requires the denominator to be a sampling probability.

6 Performance Evaluation

We test the performance of the proposed method with synthetic and real-world datasets. The details of the datasets and other setups can be found in Appendix D. We denote the separate estimation as the SEP.

6.1 Synthetic Experiment

We generated synthetic data with the different shapes of μ\mu, the covariates’ dimension qxq_{x}, and the size of training samples. The mean and standard deviation of the MSE over 100 trials are summarized in Table 2. We conducted the Wilcoxon signed-rank test since the IWLS was highly unstable, which made it difficult to use the tt-test.

Although we did not trim the PSD for the DWLS estimator, its performance was sufficiently stable to outperform the others in almost all the cases. The DWLS had larger errors than the DLS only when μ\mu was constant and qx=1q_{x}=1. This may be because the benefit of direct estimation outweighed the difficulties of hyperparameter tuning of the DLS since the one-dimensional constant μ\mu is the easiest to learn. This result supports the effectiveness of our approach: avoiding the minimax formulation but without the inverse PSD.

On the other hand, the IWLS estimator often worked terribly poorly even with the weight trimming. The IWLS might be affected by the small PSD most severely since it uses the inverse PSD both in the estimation and hyperparameter tuning. The performance of the DLS was not as poor, but it had a larger MSE than SEP in many cases, indicating that hyperparameter tuning did not work as well as the DWLS.

Figure 2 shows the relation between the squared error of each estimator and the PSD for the linear μ(𝑿)\mu(\bm{X}) when qx=5q_{x}=5 and n=10000n=10000. The area with a dense plot is colored in light. The squared error of the DWLS shown in Figure 2a is kept very small except it is slightly larger when the PSD is close to zero. This result demonstrates the robustness of our method against the near-zero PSD, whereas the performance of the other estimators is affected more severely by the small PSD. We can observe the instability of the IWLS again in Figure 2b, which shows the sporadic high errors over the entire PSD. The squared error of the SEP displays the similar pattern. In Figure 2d, the light-colored area is at a high position, indicating that the hyperparameter tuning of the DLS does not work well.

Shape nn qxq_{x} DWLS IWLS SEP DLS
1 0.65 ±\pm 1.54 5.04 ±\pm 16.5 1.80 ±\pm 2.38 0.15 ±\pm 0.20
10K 5 0.40 ±\pm 0.91 1.30 ±\pm 3.21 3.46 ±\pm 2.71 0.93 ±\pm 1.17
hconh_{con} 10 0.65 ±\pm 1.00 4.96 ±\pm 14.4 4.18 ±\pm 3.81 1.81 ±\pm 2.82
1 0.22 ±\pm 0.46 53.5 ±\pm 372 1.04 ±\pm 0.93 0.08 ±\pm 0.13
50K 5 0.06 ±\pm 0.07 0.35 ±\pm 0.88 2.13 ±\pm 2.00 0.41 ±\pm 0.67
10 0.10 ±\pm 0.11 1.22 ±\pm 6.07 3.52 ±\pm 1.92 0.41 ±\pm 0.52
1 0.09 ±\pm 0.15 0.51 ±\pm 1.25 0.25 ±\pm 0.30 0.39 ±\pm 0.30
10K 5 0.14 ±\pm 0.08 1.32 ±\pm 6.72 0.58 ±\pm 0.41 0.88 ±\pm 0.52
hlinh_{lin} 10 0.61 ±\pm 0.27 4.73 ±\pm 22.4 1.39 ±\pm 0.65 1.53 ±\pm 0.75
1 0.02 ±\pm 0.03 6.79 ±\pm 53.1 0.11 ±\pm 0.11 0.36 ±\pm 0.33
50K 5 0.04 ±\pm 0.03 0.12 ±\pm 0.16 0.36 ±\pm 0.24 1.04 ±\pm 0.53
10 0.32 ±\pm 0.10 11.9 ±\pm 112 0.76 ±\pm 0.26 1.61 ±\pm 0.89
1 0.19 ±\pm 0.26 0.72 ±\pm 1.82 0.30 ±\pm 0.31 0.32 ±\pm 0.29
10K 5 0.29 ±\pm 0.23 12.1 ±\pm 113 0.58 ±\pm 0.40 0.97 ±\pm 0.56
hlogh_{log} 10 0.72 ±\pm 0.31 2.76 ±\pm 7.65 1.11 ±\pm 0.49 1.40 ±\pm 0.72
1 0.06 ±\pm 0.08 0.30 ±\pm 0.88 0.22 ±\pm 0.16 0.32 ±\pm 0.30
50K 5 0.10 ±\pm 0.04 0.18 ±\pm 0.13 0.56 ±\pm 0.31 0.75 ±\pm 0.62
10 0.30 ±\pm 0.10 55.3 ±\pm 550 0.86 ±\pm 0.28 1.20 ±\pm 0.99
Table 1: The mean and standard deviation of the MSE over 100 trials with γ=0\gamma=0. The results are multiplied by 100 (constant), 10 (linear) and 10 (logistic), respectively. The bold face denotes the best and comparative results according to the two-sided Wilcoxon signed-rank test at the significance level of 5%.
Refer to caption
Figure 2: Relation between the squared error (y-axis) and the PSD (x-axis) when μ(𝑿)\mu(\bm{X}) is linear, qx=5q_{x}=5 and n=10000n=10000. Each point is coloured according to the spatial density.

6.2 Real-Data Analysis

We used the dataset of the National Job Training Partnership Act (JTPA) study111https://www.upjohn.org/data-tools/employment-research-data-center/national-jtpa-study. This is one of the largest RCT dataset for job training evaluations in the US with approximately 20000 participants, and it has been used in the several previous studies on causal inference [8, 2, 12]. We used the doubly robust (DR) estimator [38] trained with all joint samples as the pseudo true LATE since we do not know the true treatment effect for a real-world dataset.

The results are summarized in Table 2. We report the mean, max, and min of μ^(𝒙)\widehat{\mu}(\bm{x}) in addition to the root mean squared error (RMSE) to see the behavior of each estimator. The IWLS had the smallest RMSE, but the difference from the DWLS was not statistically significant. The mean, max, and min of the DWLS and IWLS also indicate they have similar performance. The stability of the IWLS in contrast to the synthetic experiment is because there were no samples with a small PSD. The relatively good performance of the SEP should be because of the same reason. The performance of the DLS was highly unstable, reflecting the difficulty of the model selection based on the DLS objective.

7 Conclusion

We proposed a novel problem setting for LATE estimation by data combination. Our setting is plausible enough to apply to many real-world problems. The leading examples include O2O marketing and when there is non-random attrition in panel data. We developed a practically useful method for estimating LATE from the separately observed datasets. Our method overcomes the issue of the existing method in model selection by avoiding the minimax formulation of the objective. Furthermore, our method can also avoid the common problem of the IPW methods since the PSD directly appears in our estimator without inversion. Our method displayed the promising performance in the experiments on the synthetic and real-world datasets. However, there are sometimes concerns on the homogeneity of the populations which the separate samples come from. Extending the proposed method to account for heterogeneous samples and time-varying potential variables is an important future direction to further increase its usefulness in practice.

DR DWLS IWLS SEP DLS
RMSE - 252 ±\pm 141 222 ±\pm 145 280 ±\pm 119 946 ±\pm 663
Mean 1778 1528 1559 1538 937
Max 1837 1838 1924 2200 7966
Min 1648 1013 977 919 -1814
Table 2: The results for performance evaluation using the JTPA dataset. The bold face denotes the best and comparative results according to the two-sided Wilcoxon signed-rank test at the significance level of 5%.

References

  • [1] A. Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics, 113(2):231–263, 2003.
  • [2] A. Abadie, J. Angrist, and G. Imbens. Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings. Econometrica, 70(1):91–117, 2002.
  • [3] A. Alaa and M. van der Schaar. Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 129–138. PMLR, 2018.
  • [4] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996.
  • [5] J. D. Angrist and A. B. Krueger. The effect of age at school entry on educational attainment: An application of instrumental variables with moments from two samples. Journal of the American Statistical Association, 87(418):328–336, 1992.
  • [6] S. Athey and G. W. Imbens. Identification and inference in nonlinear difference-in-differences models. Econometrica, 74(2):431–497, 2006.
  • [7] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences of the United States of America, 113(27):7345–7352, 2016.
  • [8] H. S. Bloom, L. L. Orr, S. H. Bell, G. Cave, F. Doolittle, W. Lin, and J. M. Bos. The benefits and costs of JTPA Title II-A programs: Key findings from the national job training partnership act study. The Journal of Human Resources, 32(3):549–576, 1997.
  • [9] M. Buchinsky, F. Li, and Z. Liao. Estimation and inference of semiparametric models using data from several sources. Journal of Econometrics, 2021.
  • [10] S. Cambanis, G. Simons, and W. Stout. Inequalities for Ek(X,Y) when the marginals are fixed. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 36(4):285–294, 1976.
  • [11] J. Choi, J. Gu, and S. Shen. Weak-instrument robust inference for two-sample instrumental variables regression. Journal of Applied Econometrics, 33(1):109–125, 2018.
  • [12] S. G. Donald, Y.-C. Hsu, and R. P. Lieli. Testing the unconfoundedness assumption via inverse probability weighted estimators of (L)ATT. Journal of Business & Economic Statistics, 32(3):395–415, 2014.
  • [13] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • [14] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1386–1394. PMLR, 2015.
  • [15] C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 213–220. Association for Computing Machinery, 2008.
  • [16] Y. Fan, R. Sherman, and M. Shum. Identifying treatment effects under data combination. Econometrica, 82(2):811–822, 2014.
  • [17] Y. Fan, R. Sherman, and M. Shum. Estimation and inference in an ecological inference model. Journal of Econometric Methods, 5(1):547, 2016.
  • [18] M. Frölich. Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics, 139(1):35–75, 2007.
  • [19] M. Frölich and B. Melly. Identification of treatment effects on the treated with One-Sided Non-Compliance. Econometric Reviews, 32(3):384–414, 2013.
  • [20] M. Güell and L. Hu. Estimating the probability of leaving unemployment using uncompleted spells from repeated cross-section data. Journal of Econometrics, 133(1):307–341, 2006.
  • [21] K. Hirano, G. W. Imbens, G. Ridder, and D. B. Rubin. Combining panel data sets with attrition and refreshment samples. Econometrica, 69(6):1645–1659, 2001.
  • [22] K. Hirano, G. W. Imbens, D. B. Rubin, and X. H. Zhou. Assessing the effect of an influenza vaccine in an encouragement design. Biostatistics, 1(1):69–88, 2000.
  • [23] P. W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986.
  • [24] G. W. Imbens and J. D. Angrist. Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475, 1994.
  • [25] G. W. Imbens and D. B. Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  • [26] A. Inoue and G. Solon. Two-sample instrumental variables estimators. The Review of Economics and Statistics, 92(3):557–561, 2010.
  • [27] E. H. Kennedy. Efficient nonparametric causal inference with missing exposure information. The International Journal of Biostatistics, 16(1), 2020.
  • [28] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [29] A. Klevmarken. Missing variables and two-stage least-squares estimation from more than one data set. Technical report, IUI Working Paper, 1982.
  • [30] M. J. Lebo and C. Weber. An effective approach to the repeated cross-sectional design. American Journal of Political Science, 59(1):242–258, 2015.
  • [31] B. K. Lee, J. Lessler, and E. A. Stuart. Weight trimming and propensity score weighting. PloS one, 6(3):e18174, 2011.
  • [32] S. Lee, J. D. Correa, and E. Bareinboim. Identifiability from a combination of observations and experiments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13677–13680. AAAI Press, Apr. 2020.
  • [33] R. J. Little and L. H. Y. Yau. Statistical techniques for analyzing data from prevention trials: Treatment of no-shows using Rubin’s causal model. Psychological Methods, 3(2):147–159, 1998.
  • [34] C. F. Manski. Partial Identification of Probability Distributions. Springer Science & Business Media, 2003.
  • [35] F. Molinari. Microeconometrics with partial identification. Handbook of econometrics, 7:355–486, 2020.
  • [36] E. Moretti. Estimating the social return to higher education: Evidence from longitudinal and repeated cross-sectional data. Journal of Econometrics, 121(1):175–212, 2004.
  • [37] A. Nevo. Using weights to adjust for sample selection when auxiliary information is available. Journal of Business & Economic Statistics, 21(1):43–52, 2003.
  • [38] E. L. Ogburn, A. Rotnitzky, and J. M. Robins. Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society. Series B, Statistical methodology, 77(2):373, 2015.
  • [39] R. Okui, D. S. Small, Z. Tan, and J. M. Robins. Doubly robust instrumental variable regression. Statistica Sinica, 22(1), 2012.
  • [40] P. Oreopoulos. Estimating average and local average treatment effects of education when compulsory schooling laws really matter. American Economic Review, 96(1):152–175, 2006.
  • [41] D. Pacini and F. Windmeijer. Robust inference for the two-sample 2SLS estimator. Economics letters, 146:50–54, 2016.
  • [42] G. Ridder and R. Moffitt. The econometrics of data combination. In J. J. Heckman and E. E. Leamer, editors, Handbook of Econometrics, volume 6, pages 5469–5547. Elsevier, 2007.
  • [43] C. A. Rolling and Y. Yang. Model selection for estimating treatment effects. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 749–769, 2014.
  • [44] D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974.
  • [45] Y. Saito and S. Yasui. Counterfactual cross-validation: Stable model selection procedure for causal inference models. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 8398–8407. PMLR, 2020.
  • [46] S. R. Seaman and I. R. White. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3):278–295, 2013.
  • [47] H. Shu and Z. Tan. Improved methods for moment restriction models with data combination and an application to two-sample instrumental variable estimation. Canadian Journal of Statistics, 48(2):259–284, 2020.
  • [48] C. Skovron and R. Titiunik. A practical guide to regression discontinuity designs in political science. American Journal of Political Science, 2015:1–36, 2015.
  • [49] B. Sun and W. Miao. On semiparametric instrumental variable estimation of average treatment effects through data fusion. Statistica Sinica, 2022. Forthcoming.
  • [50] E. Tamer. Partial identification in econometrics. Annual Review of Economics, 2(1):167–195, 2010.
  • [51] Z. Tan. Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101(476):1607–1618, 2006.
  • [52] H. R. Varian. Causal inference in economics and marketing. Proceedings of the National Academy of Sciences of the United States of America, 113(27):7310–7315, 2016.
  • [53] L. Wang, Y. Zhang, T. S. Richardson, and J. M. Robins. Estimation of local treatment effects under the binary instrumental variable model. Biometrika, 2021.
  • [54] Z. Wang, H. Zhu, Z. Dong, X. He, and S.-L. Huang. Less is better: Unweighted data subsampling via influence function. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6340–6347. AAAI Press, 2020.
  • [55] L. Wood, M. Egger, L. L. Gluud, K. F. Schulz, P. Jüni, D. G. Altman, C. Gluud, R. M. Martin, A. J. Wood, and J. A. Sterne. Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: Meta-epidemiological study. BMJ, 336(7644):601–605, 2008.
  • [56] J. M. Wooldridge. Inverse probability weighted M-estimators for sample selection, attrition, and stratification. Portuguese Economic Journal, 1(2):117–139, 2002.
  • [57] J. M. Wooldridge. Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141(2):1281–1301, 2007.
  • [58] I. Yamane, F. Yger, J. Atif, and M. Sugiyama. Uplift modeling from separate labels. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [59] T. Ye, A. Ertefaie, J. Flory, S. Hennessy, and D. S. Small. Instrumented difference-in-differences, 2021.
  • [60] Q. Zhao, J. Wang, W. Spiller, J. Bowden, and D. S. Small. Two-Sample instrumental variable analyses using heterogeneous samples. Statistical Science, 34(2):317–333, 2019.

Appendix A Estimation of μ(𝑿)\mu(\bm{X}) when P(𝑿(1))P(𝑿(0))P(𝑿)P(\bm{X}^{(1)})\neq P(\bm{X}^{(0)})\neq P(\bm{X})

Consider weakening Assumption 1.1 to its conditional version:

P(Y1(1),Y0(1),D1(1),D0(1)|𝑿(1))=P(Y1(0),Y0(0),D1(0),D0(0)|𝑿(0))=P(Y1,Y0,D1,D0|𝑿).\displaystyle P(Y_{1}^{(1)},Y_{0}^{(1)},D_{1}^{(1)},D_{0}^{(1)}|\bm{X}^{(1)})=P(Y_{1}^{(0)},Y_{0}^{(0)},D_{1}^{(0)},D_{0}^{(0)}|\bm{X}^{(0)})=P(Y_{1},Y_{0},D_{1},D_{0}|\bm{X}). (10)

Under this assumption, μ(𝑿)\mu(\bm{X}) is identified as follows:

μ(𝒙)=E[Y(1)|𝑿(1)=𝒙]E[Y(0)|𝑿(0)=𝒙]E[D(1)|𝑿(1)=𝒙]E[D(0)|𝑿(0)=𝒙].\displaystyle\mu(\bm{x})=\frac{E[Y^{(1)}|\bm{X}^{(1)}=\bm{x}]-E[Y^{(0)}|\bm{X}^{(0)}=\bm{x}]}{E[D^{(1)}|\bm{X}^{(1)}=\bm{x}]-E[D^{(0)}|\bm{X}^{(0)}=\bm{x}]}.

In this case, directly applying the DLS and DWLS may result in large bias since P(𝑿(1))P(𝑿(0))P(𝑿)P(\bm{X}^{(1)})\neq P(\bm{X}^{(0)})\neq P(\bm{X}) in general. We need density ratios to adjust a distribution over which the expectation is taken to apply the DLS and DWLS. Unfortunately, we cannot estimate the density ratio P(𝑿|D(k)=1)P(𝑿(k)|D(k)=1)\frac{P(\bm{X}|D^{(k)}=1)}{P(\bm{X}^{(k)}|D^{(k)}=1)} when we have {𝒙di(k)}i=1nd(k)\{\bm{x}_{di}^{(k)}\}_{i=1}^{n_{d}^{(k)}} and p^d(k)\widehat{p}_{d}^{(k)} as in the data collection scheme introduced in Section 2.3. We do not have any information on P(𝑿|D(k)=1)P(\bm{X}|D^{(k)}=1) in this case.

If we can collect {(di(k),𝒙di(k))}i=1nd(k)iidP(D(k),𝑿(k))\{(d_{i}^{(k)},\bm{x}_{di}^{(k)})\}_{i=1}^{n_{d}^{(k)}}\overset{\mathrm{iid}}{\sim}P(D^{(k)},\bm{X}^{(k)}) instead of the pair {𝒙di(k)}i=1nd(k)\{\bm{x}_{di}^{(k)}\}_{i=1}^{n_{d}^{(k)}} and p^d(k)\widehat{p}_{d}^{(k)}, we can modify the combined datasets explained in Section 4.1 to

{(ti,𝒙ti,rti)}i=1nt\displaystyle\{(t_{i},\bm{x}_{ti},r_{ti})\}_{i=1}^{n_{t}} :={((1)1kdi(k),𝒙di(k),nd(1)+nd(0)2nd(k)ρ^k(𝒙di(k)))}i=1,k=0nd(k),1,\displaystyle:=\left\{\left((-1)^{1-k}d_{i}^{(k)},\bm{x}_{di}^{(k)},\frac{n_{d}^{(1)}+n_{d}^{(0)}}{2n_{d}^{(k)}}\widehat{\rho}_{k}(\bm{x}_{di}^{(k)})\right)\right\}_{i=1,k=0}^{n_{d}^{(k)},1},
{(ui,𝒙ui,rui)}i=1nu\displaystyle\{(u_{i},\bm{x}_{ui},r_{ui})\}_{i=1}^{n_{u}} :={((1)1kyi(k),𝒙i(k),n(1)+n(0)2n(k)ρ^k(𝒙i(k)))}i=1,k=0n(k),1,\displaystyle:=\left\{\left((-1)^{1-k}y_{i}^{(k)},\bm{x}_{i}^{(k)},\frac{n^{(1)}+n^{(0)}}{2n^{(k)}}\widehat{\rho}_{k}(\bm{x}_{i}^{(k)})\right)\right\}_{i=1,k=0}^{n^{(k)},1},

where ρk(𝑿):=P(𝑿)P(𝑿(k))\rho_{k}(\bm{X}):=\frac{P(\bm{X})}{P(\bm{X}^{(k)})} for k=0,1k=0,1, and estimate μ(𝑿)\mu(\bm{X}) with the DLS and DWLS using the above datasets. Various methods can be employed to estimate density ratios (Nguyen, Wainwright, and Jordan 2010; Sugiyama, Suzuki, and Kanamori 2012; Liu et al. 2017; Kato and Teshima 2021).

However, it is no longer direct estimation in the original sense because we need to estimate the density ratios. Although we have to maintain Assumption 1.1 to make direct estimation possible, this limitation does not detract significantly from the value of our work since our setting is plausible enough to apply to many real-world problems even with Assumption 1.1 as explained in the main text. Note that the concern for the time-varying potential variables when using RCS data does not vanish even if we adopt the condition (10) since the conditional distribution may also change over time.

Relation to the Standard Setting

Consider the estimation of μ(𝑿)\mu(\bm{X}) using separate samples drawn from P(Y,Z,𝑿)P(Y,Z,\bm{X}) and P(D,Z,𝑿)P(D,Z,\bm{X}) under the standard assumptions (Abadie 2003; Frölich 2007; Tan 2006; Ogburn, Rotnitzky, and Robins 2015). Recalling that μ(𝑿)\mu(\bm{X}) can be identified as

μ(𝑿)=E[Y|Z=1,𝑿]E[Y|Z=0,𝑿]E[D|Z=1,𝑿]E[D|Z=0,𝑿],\displaystyle\mu(\bm{X})=\frac{E[Y|Z=1,\bm{X}]-E[Y|Z=0,\bm{X}]}{E[D|Z=1,\bm{X}]-E[D|Z=0,\bm{X}]},

we can see that it is essentially the same problem as our setting with the condition (10). We need P(𝑿)P(𝑿|Z=z)\frac{P(\bm{X})}{P(\bm{X}|Z=z)} for z=0,1z=0,1 in this case.

Appendix B Proof of Theorem 1

Proof.

By Assumption 1.1, 2.1 and 2.2, we have

E[D(1)D(0)|𝑿]\displaystyle E[D^{(1)}-D^{(0)}|\bm{X}] =E[Z(1)D1+(1Z(1))D0|𝑿]E[Z(0)D1+(1Z(0))D0|𝑿]\displaystyle=E[Z^{(1)}D_{1}+(1-Z^{(1)})D_{0}|\bm{X}]-E[Z^{(0)}D_{1}+(1-Z^{(0)})D_{0}|\bm{X}]
=E[(Z(1)Z(0))(D1D0)|𝑿]\displaystyle=E[(Z^{(1)}-Z^{(0)})(D_{1}-D_{0})|\bm{X}]
=E[Z(1)Z(0)|𝑿]E[D1D0|𝑿].\displaystyle=E[Z^{(1)}-Z^{(0)}|\bm{X}]E[D_{1}-D_{0}|\bm{X}]. (11)

Similarly, we can show

E[Y(1)Y(0)|𝑿]\displaystyle E[Y^{(1)}-Y^{(0)}|\bm{X}] =E[D(1)Y1+(1D(1))Y0|𝑿]E[D(0)Y1+(1D(0))Y0|𝑿]\displaystyle=E[D^{(1)}Y_{1}+(1-D^{(1)})Y_{0}|\bm{X}]-E[D^{(0)}Y_{1}+(1-D^{(0)})Y_{0}|\bm{X}]
=E[(D(1)D(0))(Y1Y0)|𝑿]\displaystyle=E[(D^{(1)}-D^{(0)})(Y_{1}-Y_{0})|\bm{X}]
=E[(Z(1)Z(0))(D1D0)(Y1Y0)|𝑿]\displaystyle=E[(Z^{(1)}-Z^{(0)})(D_{1}-D_{0})(Y_{1}-Y_{0})|\bm{X}]
=E[Z(1)Z(0)|𝑿]E[(D1D0)(Y1Y0)|𝑿]\displaystyle=E[Z^{(1)}-Z^{(0)}|\bm{X}]E[(D_{1}-D_{0})(Y_{1}-Y_{0})|\bm{X}]
=E[Z(1)Z(0)|𝑿]P(D1D0=1|𝑿)E[Y1Y0|D1D0=1𝑿].\displaystyle=E[Z^{(1)}-Z^{(0)}|\bm{X}]P(D_{1}-D_{0}=1|\bm{X})E[Y_{1}-Y_{0}|D_{1}-D_{0}=1\bm{X}]. (12)

It holds that E[D1D0|𝑿]=P(D1D0=1|𝑿)E[D_{1}-D_{0}|\bm{X}]=P(D_{1}-D_{0}=1|\bm{X}) since D1D0{0,1}D_{1}-D_{0}\in\{0,1\} by Assumption 2.3. Moreover, D1D0=1D_{1}-D_{0}=1 implies D1>D0D_{1}>D_{0} since D1D_{1} and D0D_{0} are binary. Note that E[Z(1)Z(0)|𝑿]0E[Z^{(1)}-Z^{(0)}|\bm{X}]\neq 0 and E[D1D0|𝑿]0E[D_{1}-D_{0}|\bm{X}]\neq 0 by Assumption 1.2 and 2.4, respectively. Combining these facts with (11) and (12) yields

E[Y(1)Y(0)|𝑿]E[D(1)D(0)|𝑿]\displaystyle\frac{E[Y^{(1)}-Y^{(0)}|\bm{X}]}{E[D^{(1)}-D^{(0)}|\bm{X}]} =E[Z(1)Z(0)|𝑿]E[D1D0|𝑿]E[Y1Y0|D1D0=1𝑿]E[Z(1)Z(0)|𝑿]E[D1D0|𝑿]\displaystyle=\frac{E[Z^{(1)}-Z^{(0)}|\bm{X}]E[D_{1}-D_{0}|\bm{X}]E[Y_{1}-Y_{0}|D_{1}-D_{0}=1\bm{X}]}{E[Z^{(1)}-Z^{(0)}|\bm{X}]E[D_{1}-D_{0}|\bm{X}]}
=E[Y1Y0|D1D0=1,𝑿]\displaystyle=E[Y_{1}-Y_{0}|D_{1}-D_{0}=1,\bm{X}]
=E[Y1Y0|D1>D0,𝑿]\displaystyle=E[Y_{1}-Y_{0}|D_{1}>D_{0},\bm{X}]
=μ(𝑿).\displaystyle=\mu(\bm{X}).

Appendix C Direct Estimation of PSD

We cannot employ ordinary regression methods on {(ti,𝒙ti,rti)}i=1nt\{(t_{i},\bm{x}_{ti},r_{ti})\}_{i=1}^{n_{t}} to obtain an estimator of π(𝑿)\pi(\bm{X}) since it does not follow P(T|𝑿)P(T|\bm{X}) even when nd(1)=nd(0)n_{d}^{(1)}=n_{d}^{(0)}. We have to develop an estimator which incorporates the covariates of {(ui,𝒙ui,rui)}i=1nu\{(u_{i},\bm{x}_{ui},r_{ui})\}_{i=1}^{n_{u}}. Moreover, an estimator of π(𝑿)\pi(\bm{X}) should satisfy the constraint π(𝑿)[0.5,0.5]\pi(\bm{X})\in[-0.5,0.5]. In the following, we develop a direct estimation method which is basically an application of the least squares method for estimating the density ratio (Kanamori, Hido, and Sugiyama 2009; Sugiyama, Suzuki, and Kanamori 2012).

In order to develop an estimator satisfying the constraint, consider the least squares estimation of π(𝑿)+0.5[0,1]\pi(\bm{X})+0.5\in[0,1] by minimizing the following squared error:

E[(f(𝑿)π(𝑿)0.5)2]\displaystyle E[(f(\bm{X})-\pi(\bm{X})-0.5)^{2}] =E[f(𝑿)2]2E[Tf(𝑿)]E[f(𝑿)]+E[(π(𝑿)+0.5)2].\displaystyle=E[f(\bm{X})^{2}]-2E[Tf(\bm{X})]-E[f(\bm{X})]+E[(\pi(\bm{X})+0.5)^{2}].

The last term can be safely ignored in optimization. We can approximate the first three terms as

1nui=1nuruif(𝒙ui)22nti=1ntrtitif(𝒙ti)1nui=1nuruif(𝒙ui).\displaystyle\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}f(\bm{x}_{ui})^{2}-\frac{2}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}f(\bm{x}_{ti})-\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}f(\bm{x}_{ui}).

Using the linear-in-parameter model and 2\ell_{2}-regularizer, we have an analytical solution 𝜶~0.5+ϕ(𝒙)\widetilde{\bm{\alpha}}_{0.5+}^{\top}\bm{\phi}(\bm{x}), where

𝜶~0.5+=(1nui=1nuruiϕ(𝒙ui)ϕ(𝒙ui)+λf𝑰qf)1(1nti=1ntrtitiϕ(𝒙ti)+12nui=1nuruiϕ(𝒙ui)).\displaystyle\widetilde{\bm{\alpha}}_{0.5+}=\left(\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}\bm{\phi}(\bm{x}_{ui})\bm{\phi}(\bm{x}_{ui})^{\top}+\lambda_{f}\bm{I}_{q_{f}}\right)^{-1}\left(\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}\bm{\phi}(\bm{x}_{ti})+\frac{1}{2n_{u}}\sum_{i=1}^{n_{u}}r_{ui}\bm{\phi}(\bm{x}_{ui})\right).

When we use a non-negative basis function such as the Gaussian kernel, we can modify 𝜶~0.5+\widetilde{\bm{\alpha}}_{0.5+} as 𝜶^0.5+:=max{𝜶~0.5+,𝟎qf}\widehat{\bm{\alpha}}_{0.5+}:=\max\{\widetilde{\bm{\alpha}}_{0.5+},\bm{0}_{q_{f}}\} to force 𝜶^0.5+ϕ(𝒙)\widehat{\bm{\alpha}}_{0.5+}^{\top}\bm{\phi}(\bm{x}) to be non-negative as well. Here, the max\max operator extracts larger values element-wise. Similarly, we can construct an estimator of π(𝑿)+0.5[0,1]-\pi(\bm{X})+0.5\in[0,1] as 𝜶^0.5ϕ(𝒙)\widehat{\bm{\alpha}}_{0.5-}^{\top}\bm{\phi}(\bm{x}), where 𝜶^0.5:=max{𝜶~0.5,𝟎qf}\widehat{\bm{\alpha}}_{0.5-}:=\max\{\widetilde{\bm{\alpha}}_{0.5-},\bm{0}_{q_{f}}\} and

𝜶~0.5=(1nui=1nuruiϕ(𝒙ui)ϕ(𝒙ui)+λf𝑰qf)1(1nti=1ntrtitiϕ(𝒙ti)+12nui=1nuruiϕ(𝒙ui)).\displaystyle\widetilde{\bm{\alpha}}_{0.5-}=\left(\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}\bm{\phi}(\bm{x}_{ui})\bm{\phi}(\bm{x}_{ui})^{\top}+\lambda_{f}\bm{I}_{q_{f}}\right)^{-1}\left(-\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}\bm{\phi}(\bm{x}_{ti})+\frac{1}{2n_{u}}\sum_{i=1}^{n_{u}}r_{ui}\bm{\phi}(\bm{x}_{ui})\right).

Finally, we normalize an estimate of π(𝑿)+0.5\pi(\bm{X})+0.5 at a given point 𝒙\bm{x} as

(π+0.5)^(𝒙)=𝜶^0.5+ϕ(𝒙)(𝜶^0.5++𝜶^0.5)ϕ(𝒙),\displaystyle\widehat{(\pi+0.5)}(\bm{x})=\frac{\widehat{\bm{\alpha}}_{0.5+}^{\top}\bm{\phi}(\bm{x})}{(\widehat{\bm{\alpha}}_{0.5+}^{\top}+\widehat{\bm{\alpha}}_{0.5-}^{\top})\bm{\phi}(\bm{x})},

since (π(𝑿)+0.5)+(π(𝑿)+0.5)=1(\pi(\bm{X})+0.5)+(-\pi(\bm{X})+0.5)=1. We can ensure (π+0.5)^(𝒙)1\widehat{(\pi+0.5)}(\bm{x})\leq 1 for any 𝒙\bm{x} by this normalization. Therefore, we can obtain an estimator that respects the constraint by setting π^(𝒙):=(π+0.5)^(𝒙)0.5\widehat{\pi}(\bm{x}):=\widehat{(\pi+0.5)}(\bm{x})-0.5.

Estimation in 1E2RD

We can further tighten the restriction to π(𝑿)[0,0.5]\pi(\bm{X})\in[0,0.5] in the 1E2RD since the PSD is non-negative by Assumption 2.3 when P(Z(0)=1|𝑿)=0P(Z^{(0)}=1|\bm{X})=0. In this case, we can obtain the non-negative estimator of π(𝑿)\pi(\bm{X}) as 𝜶^πϕ(𝒙)\widehat{\bm{\alpha}}_{\pi}^{\top}\bm{\phi}(\bm{x}), where 𝜶^π:=max{𝜶~π,𝟎qf}\widehat{\bm{\alpha}}_{\pi}:=\max\{\widetilde{\bm{\alpha}}_{\pi},\bm{0}_{q_{f}}\} and

𝜶~π=(1nui=1nuruiϕ(𝒙ui)ϕ(𝒙ui)+λf𝑰qf)1(1nti=1ntrtitiϕ(𝒙ti)).\displaystyle\widetilde{\bm{\alpha}}_{\pi}=\left(\frac{1}{n_{u}}\sum_{i=1}^{n_{u}}r_{ui}\bm{\phi}(\bm{x}_{ui})\bm{\phi}(\bm{x}_{ui})^{\top}+\lambda_{f}\bm{I}_{q_{f}}\right)^{-1}\left(\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}r_{ti}t_{i}\bm{\phi}(\bm{x}_{ti})\right).

Then, we can normalize the estimator so that it takes values no larger than 0.5 as

π^(𝒙)=𝜶^πϕ(𝒙)2(𝜶^π+𝜶^0.5)ϕ(𝒙),\displaystyle\widehat{\pi}(\bm{x})=\frac{\widehat{\bm{\alpha}}_{\pi}^{\top}\bm{\phi}(\bm{x})}{2(\widehat{\bm{\alpha}}_{\pi}^{\top}+\widehat{\bm{\alpha}}_{0.5-}^{\top})\bm{\phi}(\bm{x})},

since π(𝑿)+(π(𝑿)+0.5)=0.5\pi(\bm{X})+(-\pi(\bm{X})+0.5)=0.5.

Appendix D Experimental Setups

D.1 Synthetic Experiment

The data generating process used for the experiments is as follows:

Y0=ς(𝟏qx\displaystyle Y^{\prime}_{0}=\varsigma(\bm{1}_{q_{x}}^{\top} 𝑿)+(0.2D1+0.1D0)×𝟏qx𝑿,Y1=Y0+h(𝑿,D1,D0),\displaystyle\bm{X})+(0.2D_{1}+0.1D_{0})\times\bm{1}_{q_{x}}^{\top}\bm{X},\hskip 8.53581ptY^{\prime}_{1}=Y^{\prime}_{0}+h(\bm{X},D_{1},D_{0}),
Y0=Y0+ε0,Y1=Y1+ε1,𝑿N(𝟎qx,𝚺qx),\displaystyle Y_{0}=Y^{\prime}_{0}+\varepsilon_{0},\hskip 8.53581ptY_{1}=Y^{\prime}_{1}+\varepsilon_{1},\hskip 8.53581pt\bm{X}\sim N(\bm{0}_{q_{x}},\bm{\Sigma}_{q_{x}}),
P(\displaystyle P( Z(1)=1|𝑿)=ς(1+0.2×𝟏qx𝑿),P(Z(0)=1|𝑿)=0,\displaystyle Z^{(1)}=1|\bm{X})=\varsigma(1+0.2\times\bm{1}_{q_{x}}^{\top}\bm{X}),\hskip 8.53581ptP(Z^{(0)}=1|\bm{X})=0,
P(D1\displaystyle P(D_{1} =1|𝑿)=ς(γ+4+𝟏qx𝑿),P(D0=1|𝑿)=ς(γ+𝟏qx𝑿),\displaystyle=1|\bm{X})=\varsigma(\gamma+4+\bm{1}_{q_{x}}^{\top}\bm{X}),\hskip 5.69054ptP(D_{0}=1|\bm{X})=\varsigma(\gamma+\bm{1}_{q_{x}}^{\top}\bm{X}),

where ς(v)=(1+ev)1\varsigma(v)=(1+e^{-v})^{-1}, 𝟏qx\bm{1}_{q_{x}} is a qx×1q_{x}\times 1 vector with all elements equal to 1, 𝟎qx\bm{0}_{q_{x}} is a qx×1q_{x}\times 1 zero vector and 𝚺qx\bm{\Sigma}_{q_{x}} is a qx×qxq_{x}\times q_{x} positive semidefinite matrix with diagonal elements equal to 1 and the absolute value of nondiagonal elements less than 0.5. ε0\varepsilon_{0} and ε1\varepsilon_{1} are mean zero random noise with var(εd)=0.5\mathrm{var}(\varepsilon_{d})=0.5 for d=0,1d=0,1 and cov(ε0,ε1)=0.2\mathrm{cov}(\varepsilon_{0},\varepsilon_{1})=0.2. γ\gamma is an experimental parameter for controlling the ratio of the extreme PSD, and we set γ=0\gamma=0 for the experiment in the main text. The experiments with γ=1,1\gamma=-1,1 yielded similar findings, and their results can be found in Appendix E.

We implemented the experiments with three different shapes of the treatment effect function hh:

hcon(𝑿,D1,D0)\displaystyle h_{\mathrm{con}}(\bm{X},D_{1},D_{0}) =0.2+0.3D1+0.1D0,\displaystyle=0.2+0.3D_{1}+0.1D_{0},
hlin(𝑿,D1,D0)\displaystyle h_{\mathrm{lin}}(\bm{X},D_{1},D_{0}) =(0.1+0.15D1+0.05D0)×𝟏qx𝑿,\displaystyle=(0.1+0.15D_{1}+0.05D_{0})\times\bm{1}_{q_{x}}^{\top}\bm{X},
hlog(𝑿,D1,D0)\displaystyle h_{\mathrm{log}}(\bm{X},D_{1},D_{0}) =ς((1+0.2D1+0.1D0)×𝟏qx𝑿).\displaystyle=\varsigma((1+0.2D_{1}+0.1D_{0})\times\bm{1}_{q_{x}}^{\top}\bm{X}).

Note that μ(𝑿)=h(𝑿,1,0)\mu(\bm{X})=h(\bm{X},1,0). We separately generated four sets of the training samples {𝒙di(k)}i=1nd(k)\{\bm{x}_{di}^{(k)}\}_{i=1}^{n_{d}^{(k)}}, {(yi(k),𝒙i(k))}i=1n(k)\{(y_{i}^{(k)},\bm{x}_{i}^{(k)})\}_{i=1}^{n^{(k)}} with nd(k)=n(k)=10000n_{d}^{(k)}=n^{(k)}=10000 or 5000050000 for k=0,1k=0,1, validation sets with the same sample size as the training samples and a test set with 10000 samples.

We compared the proposed method (DWLS) with the IWLS, the separate estimation (SEP) and the DLS. The Gaussian kernels centered at 100 randomly chosen training samples were used for the basis functions ϕ\bm{\phi} and 𝝍\bm{\psi}. We also used the same kernel for the kernel ridge regression to estimate ν\nu. The estimation of π\pi was performed using the direct estimation method explained in Appendix C. The value of π^\widehat{\pi} was trimmed to 0.150.15 to prevent the weight from being too large for the SEP and IWLS. This trimming was not performed for the DWLS. The bandwidth hh of the Gaussian kernel and the regularization parameter λ\lambda were selected based on the model selection criterion corresponding to each estimator on validation sets. We used optuna (Akiba et al. 2019) for the hyperparameter tuning with 100 searches, and we set the search space for hh and λ\lambda to [1,10][1,10] and [105,105][10^{-5},10^{5}], respectively.

D.2 Real Data Analysis

We treat the JTPA dataset as if it was collected from the 1E2RD under one-sided noncompliance since the noncompliance structure was almost one-sided in the JTPA study. Only 2% of the participants in the control group received the job training treatment, whereas approximately 40% in the treatment group did not receive the treatment. We excluded the participants with Z=0Z=0 and D=1D=1 from our analysis to focus on the one-sided noncompliance structure. We used the samples with Z=0,1Z=0,1 as K=1K=1 and the samples with Z=0Z=0 as K=0K=0. Note that, therefore, this sampling procedure is with replacement. Assumption 1.1 is satisfied with this sampling since ZZ was assigned completely at random in the JTPA study.

The outcome of interest here is the total earnings over 30 months after a random assignment. The covariates used in our analysis included gender, age, earnings in the previous year, a dummy for the black, a dummy for Hispanic, a dummy for high school graduates, a dummy for the married and a dummy for whether a participant worked for at least 12 weeks in the 12 months preceding the random assignment. We excluded the samples with earnings in the previous year larger than 90 percentile to stabilize the estimation. We applied the Yeo-Johnson transformation (Yeo and Johnson 2000) to the age and earnings in the previous year since they are non-negative and thus have a highly skewed distribution. All variables were normalized to have zero mean and unit standard deviation. We evaluated the performance of each estimator based on the root mean squared error computed using a 100-fold cross-validation. For hyperparameter tuning, we calculated the model selection criterion corresponding to each estimator by 10-fold cross-validation within the training sets. Weight trimming was not performed since π^(𝑿)\widehat{\pi}(\bm{X}) was far from zero for all 𝑿\bm{X}.

Appendix E Additional Experimental Results

E.1 Relation between the Squared Error and the PSD

We present the graphs of the relation between the squared error and the PSD for all the cases. Similar to Figure 2, these figures show that the performance deterioration of the DWLS at points with the small PSD is limited compared to the other estimators.

E.2 Sensitivity Analysis

In addition to the synthetic experiment in the main text, we conducted the experiments with γ=1,1\gamma=-1,1 to investigate the change in the performance of each estimator according to the ratio of the small PSD. Figure 21 shows the histograms of the PSD with different values of γ\gamma. It is clear that the proportion of the small PSD increases as γ\gamma increases.

Table 3 and 4 presents the results with γ=1,1\gamma=1,-1, respectively. In both cases, we obtained the results similar to the case with γ=0\gamma=0: the DWLS had the smallest MSE for all the cases except the constant μ\mu with qx=1q_{x}=1. When γ=1\gamma=1, the relative advantage of the DWLS over the other estimators decreased for all the cases. Since the DWLS directly uses the PSD, the efficiency of the DWLS estimator declines as the proportion of the small PSD increases. In this case, the performance of the DWLS could be improved by the weight trimming so that samples with the near-zero PSD have at least a certain impact on estimation. When γ=1\gamma=-1, the MSE decreased overall. The variance of the IWLS got smaller than in the case with the other value of γ\gamma, but it is still unstable compared to the other methods.

[Uncaptioned image]
Figure 3: Constant, qx=1q_{x}=1, n=10000n=10000.
[Uncaptioned image]
Figure 4: Linear, qx=1q_{x}=1, n=10000n=10000.
[Uncaptioned image]
Figure 5: Logistic, qx=1q_{x}=1, n=10000n=10000.
[Uncaptioned image]
Figure 6: Constant, qx=5q_{x}=5, n=10000n=10000.
[Uncaptioned image]
Figure 7: Linear, qx=5q_{x}=5, n=10000n=10000.
[Uncaptioned image]
Figure 8: Logistic, qx=5q_{x}=5, n=10000n=10000.
[Uncaptioned image]
Figure 9: Constant, qx=10q_{x}=10, n=10000n=10000.
[Uncaptioned image]
Figure 10: Linear, qx=10q_{x}=10, n=10000n=10000.
[Uncaptioned image]
Figure 11: Logistic, qx=10q_{x}=10, n=10000n=10000.
[Uncaptioned image]
Figure 12: Constant, qx=1q_{x}=1, n=50000n=50000.
[Uncaptioned image]
Figure 13: Linear, qx=1q_{x}=1, n=50000n=50000.
[Uncaptioned image]
Figure 14: Logistic, qx=1q_{x}=1, n=50000n=50000.
[Uncaptioned image]
Figure 15: Constant, qx=5q_{x}=5, n=50000n=50000.
[Uncaptioned image]
Figure 16: Linear, qx=5q_{x}=5, n=50000n=50000.
[Uncaptioned image]
Figure 17: Logistic, qx=5q_{x}=5, n=50000n=50000.
[Uncaptioned image]
Figure 18: Constant, qx=10q_{x}=10, n=50000n=50000.
[Uncaptioned image]
Figure 19: Linear, qx=10q_{x}=10, n=50000n=50000.
[Uncaptioned image]
Figure 20: Logistic, qx=10q_{x}=10, n=50000n=50000.
Refer to caption
(a) γ=1\gamma=-1
Refer to caption
(b) γ=0\gamma=0
Refer to caption
(c) γ=1\gamma=1
Figure 21: Histograms of the PSD.
Shape nn qxq_{x} DWLS IWLS SEP DLS
1 0.11 ±\pm 0.32 404 ±\pm 3482 0.55 ±\pm 0.23 0.10 ±\pm 0.46
10K 5 0.10 ±\pm 0.23 0.79 ±\pm 5.09 0.70 ±\pm 0.32 0.23 ±\pm 0.30
hconh_{\mathrm{con}} 10 0.09 ±\pm 0.19 30.5 ±\pm 225 0.71 ±\pm 0.36 0.28 ±\pm 0.30
1 0.05 ±\pm 0.14 85.4 ±\pm 623 0.56 ±\pm 0.11 0.02 ±\pm 0.04
50K 5 0.01 ±\pm 0.02 0.07 ±\pm 0.17 0.63 ±\pm 0.13 0.05 ±\pm 0.10
10 0.02 ±\pm 0.02 158 ±\pm 1551 0.66 ±\pm 0.19 0.11 ±\pm 0.27
1 0.02 ±\pm 0.08 578 ±\pm 5723 0.03 ±\pm 0.02 0.04 ±\pm 0.03
10K 5 0.05 ±\pm 0.05 0.24 ±\pm 0.90 0.08 ±\pm 0.04 0.14 ±\pm 0.06
hlinh_{\mathrm{lin}} 10 0.15 ±\pm 0.29 3.23 ±\pm 19.8 0.15 ±\pm 0.05 0.20 ±\pm 0.09
1 0.01 ±\pm 0.02 10.3 ±\pm 50.2 0.02 ±\pm 0.01 0.05 ±\pm 0.04
50K 5 0.01 ±\pm 0.01 0.03 ±\pm 0.04 0.06 ±\pm 0.02 0.13 ±\pm 0.06
10 0.05 ±\pm 0.03 0.60 ±\pm 2.94 0.10 ±\pm 0.02 0.18 ±\pm 0.10
1 0.04 ±\pm 0.08 68.4 ±\pm 573 0.13 ±\pm 0.03 0.05 ±\pm 0.04
10K 5 0.10 ±\pm 0.08 0.52 ±\pm 3.58 0.18 ±\pm 0.04 0.20 ±\pm 0.07
hlogh_{\mathrm{log}} 10 0.17 ±\pm 0.18 9.18 ±\pm 63.1 0.23 ±\pm 0.04 0.25 ±\pm 0.07
1 0.02 ±\pm 0.03 17.2 ±\pm 94.5 0.13 ±\pm 0.02 0.05 ±\pm 0.04
50K 5 0.03 ±\pm 0.02 0.05 ±\pm 0.07 0.17 ±\pm 0.02 0.12 ±\pm 0.07
10 0.08 ±\pm 0.03 0.32 ±\pm 2.12 0.21 ±\pm 0.03 0.17 ±\pm 0.08
Table 3: The mean and standard deviation of the MSE over 100 trials with γ=1\gamma=1. The results are multiplied by 10 (constant), 1 (linear) and 1 (logistic), respectively. The bold face denotes the best and comparative results according to the two-sided Wilcoxon signed-rank test at the significance level of 5%.
Shape nn qxq_{x} DWLS IWLS SEP DLS
1 0.30 ±\pm 0.59 1.19 ±\pm 3.72 0.76 ±\pm 0.88 0.14 ±\pm 0.30
10K 5 0.24 ±\pm 0.29 0.56 ±\pm 1.02 1.97 ±\pm 1.89 0.66 ±\pm 1.12
hconh_{\mathrm{con}} 10 0.43 ±\pm 0.52 2280 ±\pm 22666 2.18 ±\pm 1.92 0.88 ±\pm 1.08
1 0.11 ±\pm 0.27 0.46 ±\pm 1.33 0.40 ±\pm 0.34 0.06 ±\pm 0.11
50K 5 0.05 ±\pm 0.07 0.16 ±\pm 0.28 1.74 ±\pm 0.96 0.18 ±\pm 0.26
10 0.09 ±\pm 0.21 0.39 ±\pm 1.13 2.06 ±\pm 0.96 0.36 ±\pm 0.46
1 0.06 ±\pm 0.17 0.19 ±\pm 0.44 0.10 ±\pm 0.11 0.40 ±\pm 0.29
10K 5 0.11 ±\pm 0.09 0.26 ±\pm 0.20 0.29 ±\pm 0.17 0.75 ±\pm 0.50
hlinh_{\mathrm{lin}} 10 0.50 ±\pm 0.17 18.8 ±\pm 178 1.02 ±\pm 0.37 1.19 ±\pm 0.70
1 0.01 ±\pm 0.02 0.05 ±\pm 0.15 0.05 ±\pm 0.05 0.35 ±\pm 0.31
50K 5 0.03 ±\pm 0.02 0.06 ±\pm 0.06 0.18 ±\pm 0.12 0.90 ±\pm 0.52
10 0.25 ±\pm 0.06 0.32 ±\pm 0.24 0.63 ±\pm 0.18 1.36 ±\pm 0.86
1 0.07 ±\pm 0.08 0.26 ±\pm 0.60 0.15 ±\pm 0.13 0.28 ±\pm 0.29
10K 5 0.17 ±\pm 0.12 0.35 ±\pm 0.26 0.50 ±\pm 0.28 0.56 ±\pm 0.33
hlogh_{\mathrm{log}} 10 0.47 ±\pm 0.18 321 ±\pm 3187 0.87 ±\pm 0.33 0.85 ±\pm 0.60
1 0.03 ±\pm 0.04 0.07 ±\pm 0.16 0.09 ±\pm 0.06 0.32 ±\pm 0.40
50K 5 0.06 ±\pm 0.01 0.11 ±\pm 0.06 0.35 ±\pm 0.15 0.70 ±\pm 0.45
10 0.21 ±\pm 0.05 0.32 ±\pm 0.14 0.57 ±\pm 0.16 0.78 ±\pm 0.77
Table 4: The mean and standard deviation of the MSE over 100 trials with γ=1\gamma=-1. The results are multiplied by 100 (constant), 10 (linear) and 10 (logistic), respectively. The bold face denotes the best and comparative results according to the two-sided Wilcoxon signed-rank test at the significance level of 5%.