Efficient and Robust Estimation of the Generalized LATE Model
Abstract
This paper studies the estimation of causal parameters in the generalized local average treatment effect (GLATE) model, a generalization of the classical LATE model encompassing multi-valued treatment and instrument. We derive the efficient influence function (EIF) and the semiparametric efficiency bound (SPEB) for two types of parameters: local average structural function (LASF) and local average structural function for the treated (LASF-T). The moment condition generated by the EIF satisfies two robustness properties: double robustness and Neyman orthogonality. Based on the robust moment condition, we propose the double/debiased machine learning (DML) estimators for LASF and LASF-T. The DML estimator is semiparametric efficient and suitable for high dimensional settings. We also propose null-restricted inference methods that are robust against weak identification issues. As an empirical application, we study the effects across different sources of health insurance by applying the developed methods to the Oregon Health Insurance Experiment.
Keywords: Causal Inference, Double Robustness, Efficient Influence Function, Multi-valued Treatment, Neyman Orthogonality, Oregon Health Insurance Experiment, Unordered Monotonicity, Weak Identification.
1 Introduction
Since the seminal works of Imbens and Angrist (1994) and Angrist et al. (1996), the local average treatment effect (LATE) model has become popular for causal inference in economics. Instead of imposing homogeneity of the treatment effects as in the classical instrumental variable (IV) regression model, the LATE framework allows the treatment effect to vary across individuals. Under the monotonicity condition, the average treatment effect can be identified for a subgroup of individuals whose treatment choice complies with the change in instrument levels.
The current form of the LATE model only accepts binary treatment variables. This restriction is inconvenient in many economic settings where the treatment is multi-leveled in nature. For example, parents select different preschool programs for their kids, schools assign students to different classroom sizes, families relocate to various neighborhoods in housing experiments, and people choose different sources of health insurance. To apply the LATE model to these settings, researchers often need to redefine the treatment so that there are only two treatment levels. However, merging the treatment levels can complicate the task of program evaluation and dampen the causal interpretation of the estimates. As pointed out by Kline and Walters (2016), if the original treatment levels are substitutes, then there is ambiguity regarding which causal parameters are of interest. After merging the treatment levels, the heterogeneity in the treatment effect across different treatment levels is lost.
This paper addresses the above issues by generalizing the LATE framework to incorporate the potential multiplicity in treatment levels directly. We call the new framework the generalized LATE (GLATE) model. The main assumption of the GLATE model is the unordered monotonicity assumption proposed by Heckman and Pinto (2018a), which is a generalization of the monotonicity assumption in the binary LATE model.111To distinguish with the GLATE model, we sometimes use the terminology “binary LATE model” to refer to the LATE model studied by Imbens and Angrist (1994) and Abadie (2003).
We generalize the identification results in Heckman and Pinto (2018a) to explicitly account for the presence of conditioning covariates, which is often important in practical settings. Recently, Blandhol et al. (2022) point out that linear TSLS, the common way to control for covariates in empirical studies, does not bear the LATE interpretation. The only specifications that have LATE interpretations are the ones that control for covariates nonparametrically. Therefore, it is essential from the causal analysis perspective to incorporate the covariates into the GLATE framework in a nonparametric way.
The causal parameters identifiable in the GLATE model include local average structural function (LASF) and local average structural function for the treated (LASF-T). LASF is the mean potential outcome for specific subpopulations. These subpopulations are defined by their treatment choice behaviors and are generalizations of the concepts always takers, compliers, and never takers in the binary LATE model. The parameter LASF-T further restricts the subpopulation to exclude individuals who do not take up the treatment.
The paper is concerned with the econometric aspects of the GLATE model. The analysis begins by deriving efficient influence function (EIF) and semiparametric efficiency bound (SPEB) for the identified parameters. The calculation is based on the method outlined in Chapter 3 of Bickel et al. (1993) and Newey (1990). We then verify that the conditional expectation projection (CEP) estimator (e.g., Chen et al., 2008), constructed directly from the identification result, achieves the SPEB and hence is semiparametric efficient. Using these results, we may efficiently estimate other important parameters of interest by the plug-in method since a standard delta-method argument preserves semiparametric efficiency.
The EIF not only facilitates the efficiency calculation but can also serve as the moment condition for estimation. This is because the EIF is mean zero by construction and is equal to the original identification result plus an adjustment term due to the presence of infinite-dimensional parameters. We show that the moment condition constructed from the EIF satisfies two related robustness properties: double robustness and Neyman orthogonality. Double robustness guarantees that the moment condition is correctly specified in a parametric setting even when some nuisance parameters are not.
The Neyman orthogonality condition means that the moment condition is insensitive to the nuisance parameters. This condition is particularly useful when the conditioning covariates are of high dimension. To further utilize this condition, we study the double/debiased machine learning (DML) estimator (Chernozhukov et al., 2018) in the GLATE setting. Under certain conditions regarding the convergence rate of the first-step nonparametric estimators, the DML estimator is asymptotically normally uniformly over a large class of data generating processes (DGPs).
The weak identification issue is a practical concern of the GLATE model. This is because both the treatment and instrument are multi-valued, and hence the subpopulation on which LASF and LASF-T are defined can be small in size. To deal with this issue, we propose null-restricted test statistics in one-sided and two-sided testing problems. This procedure is the generalization of the well-known Anderson-Rubin (AR) test. We show that the proposed tests are consistent and uniformly control size across a large class of DGPs, in which the size of the subpopulation mentioned above can be arbitrarily close to zero.
The paper is organized as follows. The remaining part of this section discusses the literature. Section 2 introduces the GLATE model and the nonparametric identification results. Section 3 calculates the EIF and SPEB. Section 4 discusses the robustness properties of the moment condition generated by the EIF. Section 5 proposes inference procedures under weak identification issues. Section 6 presents the empirical application. Section 7 concludes. The proofs for theoretical results in the main text are collected in Appendix A.
1.1 Literature Review
The GLATE model provides a way to conduct causal inference under endogeneity when the treatment is multi-valued and unordered. As mentioned above, the identification result (conditional on the covariates) is first established in Heckman and Pinto (2018a) by using the unordered monotonicity condition. Lee and Salanié (2018) proposes another method of identification in a similar model of multi-valued treatment. Their method is concerned with continuous instruments, while the GLATE is framed in terms of discrete-valued instruments. When the treatment levels are ordered, Angrist and Imbens (1995) derives the identification and estimation results for the causal parameter, which is a weighted average of LATEs across different treatment levels.
The literature on semiparametric efficiency in program evaluation starts with the seminal work of Hahn (1998), which studies the benchmark case of estimating the average treatment effect (ATE) under unconfoundedness. For multi-level treatment, Cattaneo (2010) studies the efficient estimation of causal parameters implicitly defined through over-identified non-smooth moment conditions. In the case where unconfoundedness fails and instruments are present, Frölich (2007) calculates the SPEB for the LATE parameter, and Hong and Nekipelov (2010a) extend to the estimation of parameters implicitly defined by moment restrictions. In a more general framework encompassing missing data, Chen et al. (2004) and Chen et al. (2008) studies semiparametric efficiency bounds and efficient estimation of parameters defined through overidentifying moment restrictions. However, there is currently no theoretical research on semiparametric efficient estimation in models that encompasses endogeneity and unordered multiple treatment levels.
Several ways are available for calculating the EIF for semiparametric estimators, as illustrated by Newey (1990) and Ichimura and Newey (2022). Semiparametric efficiency calculations can be used to construct robust (Neyman orthogonal) moment conditions. This method is illustrated in Newey (1994) and Chernozhukov et al. (2016). Based on the Neyman orthogonality condition, Chernozhukov et al. (2018) introduces the DML method that suits high dimensional settings. This is because Donsker properties and stochastic equicontinuity conditions are no longer required in deriving the asymptotic distribution of the semiparametric estimator.
For testing the GLATE model, Sun (2021) proposes a bootstrap test which is the generalization and improvement of the test studied by Kitagawa (2015) in the binary LATE model.
The GLATE model has received attention in the recent empirical literature due to its ability to model multi-valued treatment. Kline and Walters (2016) evaluate the cost-effectiveness of Head Start, classifying Head Start and other preschool programs as different treatment levels against the control group of no preschool. Galindo (2020) assesses the impact of different childcare choice in Colombia on children’s development. Pinto (2021) studies the neighborhood effects and voucher effects in housing allocations using data from the Moving to Opportunity experiment. Our theoretical analysis of the GLATE model presents important tools for estimation and inference that can be applied to those empirical settings.
2 Identification in the GLATE Model
This section describes the generalized local average treatment effect (GLATE) model, discusses identification of the local average structural function (LASF) and other parameters, and introduces the notation.
2.1 The model
We assume a finite collection of instrument values and a finite collection of treatment values , where and are respectively the total number of instrument and treatment levels. The sets and are categorical and unordered. The instrumental variable denotes which of the instrument levels is realized. The random variables , each taking values in , denote the collection of potential treatments under each instrument status. Thus, the observed treatment level is the random variable . For each given treatment level , there is a potential outcome . The observed outcome is denoted by . The random vector contains the set of covariates. The observed data is a random sample .
The description above establishes a random sampling model where the researcher only observes one potential outcome, the one associated with the observed treatment. This implies that the sample of , observed from an individual with treatment , comes from the conditional distribution of given rather than from the marginal distribution of . In general, this fact leads to identifications issues and presents challenges for causal inference. To overcome these problems, we impose further structures on the model.
Assumption 1 (Conditional Independence).
.
Assumption 2 (Unordered Monotonicity).
For any , either
or
Assumption 1 and 2 provide the multi-valued analog of Assumption 2.1 in Abadie (2003). Assumption 1 restricts that the instrument is independent with the potential treatments and outcomes once we condition on . Assumption 2 is the conditional version of the unordered monotonicity condition proposed by Heckman and Pinto (2018a). It means that when we focus on a particular treatment level and a pair of instrument values, the binary environment should satisfy the usual monotonicity constraint in the LATE model. Specifically, the unordered monotonicity condition requires that a shift in the instrument moves all agents uniformly toward or against each possible treatment value.222As pointed out by Vytlacil (2002), the LATE monotonicity condition is a restriction across individuals on the relationship between different hypothetical treatment choices defined in terms of an instrument.
We define the type of an individual as the vector of the potential treatments, that is,
By construction, is not observed. Assumption 2, the unordered monotonicity condition, is essentially a restriction on , the support of . Denote the elements in by , where is the cardinality of . A convenient way to characterize is by using the matrix . The matrix is referred to as the response matrix since it describes how each type of individuals’ treatment choice responds to the instrument.
The role of is to assist the identification of the counterfactual outcomes by dividing the population into a finite number of groups, where identification can be achieved within specific groups. Those groups are defined as follows. For , let be the set of types in which the treatment level appears exactly times. That is,
where denotes the th element of the vector . In particular, the collection forms a partition of .
For individuals with type in the same type set , their treatment response in terms of is in a way homogeneous. Thus, it is easier intuitively to identify the marginal distribution of the potential outcome within each . More specifically, we define the local average structural functions (LASF) and the local average structural functions for the treated (LASF-T) as follows.
Before presenting the identification results for the above two classes of parameters, we illustrate the GLATE model in the following two examples.
Example 1 (Binary LATE model).
In the binary LATE model of Imbens and Angrist (1994), there are two treatment levels and two instrument levels . There are three types: , which are referred to in the literature as defiers, compliers, and always-takers, respectively. The type set contains the defiers, the compliers, and the always-takers. The response matrix is the following binary matrix
The local average treatment effect is the treatment effect for the compliers, which can be written as the difference between two LASFs:
Example 2 (Three treatment levels and two instrument levels).
The simplest GLATE model (excluding the binary case in Example 1) has three treatment levels and two instrument levels . There are five types specified as the columns in the following response matrix
In this example, a shift from to moves all agents uniformly toward the treatment level . The type set contains the type that always choose the treatment and thus can be referred to as -always taker. The same applies to and . The type set switches from to and hence can be considered as -swticher (or -compliter). Similarly, we can refer to as -switcher and as -switcher. This model is used in Kline and Walters (2016) to study the causal effect of the Head Start preschool program. The instrument indicates whether the household receives a Head Start offer, and the treatment levels are Head Start, other preschool programs, and no preschool. The unordered monotonicity condition means that anyone who changes behavior as a result of the Head Start offer does so to attend Head Start.
2.2 Identification Results
We introduce some matrix notations related to the type . For each treatment level , let be a binary matrix of the same dimension as the response matrix with each element of signifying whether the corresponding element in the response matrix is . That is, , the th element of , is whether equals for the subpopulation . Define where is the Moore-Penrose inverse of .
For convenience, we also need some notations regarding conditional expectations. Let
be the vector of functions that describes the conditional distribution of the instrument . For each treatment level , let
be the vector that describes the conditional treatment probabilities given each level of the instrument. Denote
as the vector that contains the conditional outcomes for each treatment level . Notice that the functions , , and are all identified.
Theorem 2.1 (Identification of LASF).
Theorem 2.1 identifies , the size of the subpopulation , and the local structural function for that subpopulation. The only exception when the identification fails is when the type set , in which case the individual never chooses the treatment . This identification result is a modification of Theorem T-6 in Heckman and Pinto (2018a) that explicitly accounts for the presence of covariates . Bayes rule is applied to convert the conditional result into the unconditional one. The following theorem presents the identification result for the LASF-T.
Let be the set of instrument values that induces the treatment level in the type set . That is, , where denotes the th element of the vector . Then define as the total probability of those instrument values.
Theorem 2.2 (Identification of LASF-T).
The identification results are illustrated using the two examples.
Example 3 (continues = eg:binary).
Since the treatment is binary, the matrix is equal to the response matrix . The matrix and its generalized inverse are respectively
The matrix and its generalized inverse are respectively
The vectors and are respectively
Theorem 2.1 implies that
The two denominators in the above expressions are both equal to the type probability of compliers. Then the usual identification of the LATE parameter (e.g., Frölich, 2007) follows:
where denotes the marginal density function of .
3 Semiparametric Efficiency
In this section, we calculate the semiparametric efficiency bound (SPEB) and propose estimators that achieve such bounds. We focus on the parameters LASF and LASF-T. In Appendix B, we study general parameters implicitly defined through moment restrictions.
3.1 LASF and LASF-T
For the rest of the paper, we assume that have finite second moments. This is necessary since we are studying efficiency. Let denote the column vector of ones and the diagonal matrix with the diagonal elements being . The following theorem gives the efficient influence function (EIF) and the SPEB for the parameters identified in the preceding section.
Theorem 3.1 (SPEB for LASF and LASF-T).
Let Assumptions 1 - 2 hold. Let and . Assume that .
-
(i)
The semiparametric efficiency bound for is given by the variance of the efficient influence function
(2) -
(ii)
The semiparametric efficiency bound for is given by the variance of the efficient influence function
-
(iii)
The semiparametric efficiency bound for is given by the variance of the efficient influence function
-
(iv)
The semiparametric efficiency bound for is given by the variance of the efficient influence function
The EIF in Theorem 3.1 can be interpreted as the moment condition from the identification results modified by an adjustment term due to the presence of unknown infinite-dimensional parameters. Take as an example, the terms
and
are respectively the adjustment terms due to the presence of and .
From the expression of , we can see that the SPEB would be large when is small. This is because measures the size of the subpopulation on which the LASF is estimated. When is small, we run into the weak identification issue. In Section 5, we study inference procedures that are robust against weak identification issues.
One benefit of the EIFs is that we can easily calculate the covariance matrix of different estimators. Consider an example where we are interested in two LASFs and , whose EIF is given by and , respectively. If the two estimators and are both semiparametric efficient, then their covariance matrix equals .
Example 5 (continues = eg:binary).
The derived SPEB helps determine whether an estimation procedure is efficient. In this section, we focus on the condition expectation projection (CEP) estimator.444The terminology “condition expectation projection” is adopted from the papers Chen et al. (2008) and Hong and Nekipelov (2010a), whereas Hahn (1998) refers to these estimators as “nonparametric imputation based estimators.” Define
The CEP procedure first estimates , , and by using nonparametric estimators , , and respectively. These estimators can be constructed based on series or local polynomial estimation. Then and are estimated using and . The vectors of estimators and , are stacked in an obvious way. Let . The CEP estimators for the structural parameters are defined by
The next proposition shows that the CEP estimators are semiparametrically efficient. The result is similar in style to Hahn’s (1998) Proposition 4 that the low-level regularity conditions are omitted. Instead, the proposition assumes the high-level condition that the CEP estimators are asymptotically linear, which means they are asymptotically equivalent to sample averages. More formally, an estimator of is asymptotically linear if it admits an influence function. That is, there exists an iid sequence with zero mean and finite variance such that
Since each element of the conditional expectations , , and can be considered as coming from a binary LATE model, the regularity conditions in Hong and Nekipelov (2010b) should work with little modification.
Proposition 3.2.
Suppose the CEP estimators are asymptotically linear, then they achieve the semiparametric efficiency bound.
The reason that this type of estimator is efficient is well explained in Ackerberg et al. (2014). The estimation problem here falls into their general semiparametric model, where the finite-dimensional parameter of interest is defined by unconditional moment restrictions. They show that the semiparametric two-step optimally weighted GMM estimators, the CEP estimators in this case, achieve the efficiency bound since the parameters of interest are exactly identified. Discussions related to this phenomenon can also be found in Chen and Santos (2018).
We next examine the efficient estimation of other policy-relevant parameters that can be derived from the parameters . As an example, consider the type set , which is referred to as -switchers. This subpopulation contains individuals who switch between and other treatments when given different levels of instruments. It is a generalization of the concept of compliers in the binary LATE framework.555Recall that switchers are also illustrated in Example 2. The LASF for the subpopulation is given by
Similarly, one can also define
(3) |
which represents the LASF-T for the subpopulation of -treated -switchers.
For some subpopulations, a treatment effect can be identified. This point is already illustrated with Example 2 in the discussion of the identification of the usual LATE parameter. We further illustrate this point with Example 2.
Example 6 (continues = eg:3t2z).
The quantity
represents the local average treatment effect of against other treatments within the subpopulation of -switchers. Analogously, the parameter
is the local average treatment effect of against other treatments within the subpopulation of -treated -switchers.
To summarize the above examples using a general expression, let be a finite-dimensional parameter, where is a known continuously differentiable function, and is the vector containing all identifiable ’s, that is, . Let , and be defined analogously. A natural estimator can be defined through the CEP estimates, . The delta method can help calculate the efficiency bound of and show the efficiency of . In fact, by Theorem 25.47 of van der Vaart (1998), we immediately have the following corollary, which shows that plug-in estimators are efficient.
Corollary 3.3.
The semiparametric efficiency bound of is given by the variance of efficient influence function
(4) |
where the partial derivatives are evaluated at the true parameter value. Moreover, the plug-in estimator , based on the CEP estimators , achieves the efficiency bound.
4 Robustness
In the previous section, the EIF is used as a tool for computing the SPEB. In this section, we directly use the EIF as the moment condition for estimation. These moment conditions are appealing because they satisfy double robustness and local robustness — the two topics of this section.
A word on notation: in the rest of the paper, we use a superscript to signify the true value whenever necessary. For example, when both and appear, the former means the true probability while the latter denotes a generic function.
4.1 Double Robustness
We focus on the LASF . The same analysis can be applied to the other parameters. To avoid notational burden in the main text, we drop the subscript in , , and , and the subscript in and .666The full subscripts are kept in the Appendices. It is straightforward to verify that the EIF has zero mean. However, we do not want to use itself as the estimating equation since it contains as a factor. To deal with this problem, we simply multiply by and define
The corresponding moment condition is
(5) |
This moment condition is doubly robust, as demonstrated in the following proposition.
Proposition 4.1 (Double Robustness).
Let be an arbitrary vector of functions and the true vector of conditional expectations. Then
and
The above proposition divides the nonparametric nuisance parameters into two groups, and . The doubly robust moment condition is valid if either of these two groups of nuisance parameters is true. On the other hand, if the researcher uses parametric models for these nuisance parameters, then the structural parameter can be recovered provided that at least one of the working nuisance models is correctly specified. Therefore, the doubly robust moment condition is “less demanding” on the researcher’s ability to devise a correctly specified model for the nuisance parameters. The double robustness result in Proposition 4.1 can be seen as the GLATE extension of the existing results in the binary LATE literature (e.g., Tan, 2006; Okui et al., 2012).
4.2 Neyman Orthogonality
The second robustness property is Neyman orthogonality. Moment conditions with this property have reduced sensitivity with respect to the nuisance parameters. Formally, Neyman orthogonality means that the moment condition has zero Gateaux derivative with respect to the nuisance parameters. The result is presented in the following proposition.
Proposition 4.2 (Neyman Orthogonality).
Let be an arbitrary set of functions. For , define and . Suppose that is integrable, then
where does not need to be the true parameter value.
In many econometrics models, double robustness and Neyman orthogonality come in pairs. Discussions about their general relationships can be found in Chernozhukov et al. (2016). In practice, double robustness is often used for parametric estimation, as previously explained, whereas Neyman orthogonality is used in estimation with the presence of possibly high-dimensional nuisance parameters.
Next, we apply the double/debiased machine learning (DML) method developed by Chernozhukov et al. (2018) to the moment condition (5). This estimation method works even when the nuisance parameter space is complex enough that the traditional assumptions, e.g., Donsker properties, are no longer valid.777In two-step semiparametric estimations, Donsker properties are usually required so that a suitable stochastic equicontinuity condition is satisfied. See, for example, Assumption 2.5 in Chen et al. (2003). The implementation details are explained below.
The nuisance parameters , , and are estimated using a cross-fitting method: Take an -fold random partition of the data such that the size of each fold is . For , let denote the set of observation indices in the th fold and the set of observation indices not in the th fold. Define , , and to be the estimates constructed by using data from . The DML estimator of is constructed following the moment condition (5):888This is the DML2 estimator defined in Chernozhukov et al. (2018). Another estimator, the DML1 estimator, is proposed in the same paper. We do not study the DML1 estimator since it is asymptotically equivalent to DML2, and the authors generally recommend DML2.
(6) |
To conduct inference, we also need an estimate for the asymptotic variance of , which we denote by . The asymptotic variance equals to the expectation of the squared efficient influence function: . We first estimate by using the cross-fitting method, which is essentially given by the denominator of (6):
(7) |
Then the asymptotic variance can be estimated by
We want to establish the convergence results for the DML estimator uniformly over a class of data generating processes (DGPs) defined as follows. For any two constants , let be the set of joint distributions of such that
-
(i)
,
-
(ii)
, and .
The first condition excludes the case where is weakly identified (when can be arbitrarily close to zero). Inference under weak identification is studied in the next section. The following theorem establishes the asymptotic properties of the DML estimation procedure. In particular, the estimator achieves the SPEB.
Theorem 4.3.
Let Assumptions 1 and 2 hold. Assume the following conditions on the nuisance parameter estimators :
-
(i)
For , is bounded, and , and is bounded away from zero.
-
(ii)
.
Then the estimator obeys that
uniformly over the DGPs in . Moreover, the above convergence result continues to hold when is replaced by the estimator .
The proof verifies the conditions of Theorem 3.1 in Chernozhukov et al. (2018). The essential restriction is on the uniform convergence rate for the estimators of the nuisance parameters. In low-dimensional settings, one can consider the local polynomial regression for estimation of the conditional expectations. Under suitable conditions (Hansen, 2008; Masry, 1996), the uniform convergence rate of the local polynomial estimators is , which is if . In high-dimensional settings, as pointed out by Chernozhukov et al. (2018), the rate is often available for common machine learning methods under structured assumptions on the nuisance parameters.999This includes the LASSO method under sparsity of the nuisance space. See, for example, Bühlmann and Van De Geer (2011), Belloni and Chernozhukov (2011), and Belloni and Chernozhukov (2013). However, Chernozhukov et al. (2018) also indicate that to prove that machine learning methods achieve the rate, one will eventually have to use related entropy conditions. This means that the asymptotic normality of the DML estimator continues to hold.
Theorem 4.3 can be directly used to conduct inference on . Confidence regions can be constructed by inverting the usual -tests. These confidence regions are uniformly valid since the convergence results in the above theorem hold uniformly over . In the next section, we explain why uniform validity is crucial when dealing with weak identification issues.
5 Weak Identification
The convergence result established in Theorem 4.3 is uniform over the set of DGPs with type probability bounded away from zero. However, the identification of would be weak in the case where can be arbitrarily close to zero. This leads to distortion of the uniform size of the test and poor asymptotic approximation in finite-sample settings. This section studies this weak identification issue and proposes an inference procedure that is robust against such a problem.
We begin with a heuristic illustration of the weak identification problem. To ease notation, define and
After a simple calculation, we can write
In the above expression, we can interpret the estimation errors and as the noises, while the signal is the term . Under the usual asymptotics where is fixed, the noise terms are bounded in probability, whereas the signal term . Hence, the signal dominates the noise, and the estimator is consistent. However, under asymptotics with a drifting sequence and converging to a finite constant, the signal and the noise are of the same magnitude, which results in the inconsistency of . This problem is the weak identification issue. In the weak IV literature, a common measure of identification strength is the so-called concentration parameter. In our case, the concentration parameter is given by where corresponds to strong identification, and identification is weak when the limit of is finite.
While weak identification is a finite-sample issue, it is formalized using the asymptotic framework. However, the illustration above using asymptotics under drifting sequences is not meant to model DGPs that vary with the sample size . Instead, it is a tool used to detect the lack of uniform convergence. In fact, controlling the uniform size of the test is the key to solving weak identification problems.101010See, for example, Imbens and Manski (2004), Mikusheva (2007), and Andrews et al. (2020). Formally, the uniform size of a test is the large sample limit of the supremum of the rejection probability under the null hypothesis, where the supremum is taken over the nuisance parameter space. When testing a null hypothesis on in the GLATE model, the supremum mentioned above is taken over all values of . That is, a desirable test should have rejection probability under the null converge to the nominal size uniformly over . From the previous discussion, we can see that the uniform size can not be controlled using the usual -statistic . This failure of uniform convergence, however, does not conflict with Theorem 4.3, where the uniform convergence of is established only after restricting to be bounded away from zero.
Inference procedures that are robust against weak identification can be obtained by directly imposing the null hypothesis in the construction of the test statistic. One such example is the well-known Anderson-Rubin (AR) statistic in the weak IV literature. Its idea can be generalized to the GLATE model. We first consider testing the two-sided hypothesis versus . To control the uniform size of the test, we need the test statistic to converge uniformly on the parameter space where (1) , and (2) is allowed to be arbitrarily close to zero. A null-restricted -statistic can be obtained as follows. Notice that when , is equivalent to
(8) |
Its estimate can be written as
(9) |
Under the null hypothesis , the above estimate does not depend on the concentration parameter and consists only of the noise terms and , whose uniform convergence can be established directly.
For implementation, this test statistic can be obtained as a straightforward application of the DML procedure described in the previous section to the moment condition (8). As a consequence of Proposition 4.2, the above moment condition satisfies the Neyman orthogonality condition regardless of the true value of . More specifically, the null-restricted -statistic is defined to be
where
The corresponding test of against rejects for large values of .
The same methodology can be applied to testing one-sided hypothesis versus . Under the null hypothesis, is non-positive, suggesting that the test should reject for large values of . Notice that this relies on knowing the sign of due to the GLATE model structure. This restriction on the sign of is similar to knowing the first-stage sign in the linear IV model, which is studied by Andrews and Armstrong (2017) in the context of unbiased estimation.
We now define the set of DGPs that allows to be arbitrarily close to zero. For any two constants , let be the set of joint distributions of such that
-
(i)
,
-
(ii)
, and .
For any , let be the subset of in which the true value of the parameter is . In particular, denotes the subset where the null hypothesis is true. The superscript “WI” denotes weak identification. The difference between and is that allows the type probability to be arbitrarily small, whereas the type probabilities in are uniformly bounded away from zero. Denote as the th quantile of the standard normal distribution. The following theorem establishes that the above testing procedures have uniformly correct sizes and are consistent.
Theorem 5.1.
Suppose the conditions on the nuisance parameter estimates in Theorem 4.3 hold. Let be the nominal size of the tests.
-
(i)
The test that rejects in favor of when has (asymptotically) uniformly correct size and is consistent. That is,
and
-
(ii)
The test that rejects in favor of when has (asymptotically) uniformly correct size and is consistent. That is,
and
6 Empirical Application
In this section, we apply the theoretical results to data from the Oregon Health Insurance Experiment Finkelstein et al. (2012) and examine the effects on the health of different sources of health insurance. The experiment is conducted by the state of Oregon between March and September 2008. A series of lottery draws were administered to award the participants the option of enrolling in the Oregon Health Plan Standard, which is a Medicaid expansion program available for Oregon adult residents that have limited income. Follow-up surveys were sent out in several waves to record, among many variables, the participants’ insurance plan and health status. Finkelstein et al. (2012) obtain the effects of insurance coverage by using a LATE model. We apply the GLATE model can study the effect heterogeneity across different sources of insurance.
According to the data, many lottery winners did not choose to participate in the Medicaid program. Instead, they went with other insurance plans or chose not to have any health insurance. Based on this observation, we can set up the GLATE model. The instrument is the binary lottery that determines whether an individual is selected. The covariates include the number of household members and survey waves. Given , is randomly assigned (Finkelstein et al., 2012, p1071).111111Though the covariates are discrete, the methods developed in this paper are still different from linear regressions in Finkelstein et al. (2012). The treatment is the insurance plan, which contains three categories: Medicaid (), non-Medicaid insurance plans (), and no health insurance (). The second category includes Medicare, private plans, employer plans, and other plans. The counterfactual health plan choices under different lottery results are the variables and . The unordered monotonicity condition requires that any participant who changes insurance plan due to winning the lottery does so to enroll in the Medicaid program.
The above setup is the same as Example 2, with types. We follow the terminologies in Kline and Walters (2016) and define the following six type sets by their counterfactual insurance plan choices:
-
1.
-never takers: , ;
-
2.
-never takers: , ;
-
3.
always takers: , ;
-
4.
-compliers: , , ;
-
5.
-compliers: , , ;
-
6.
compliers: , , .
The two groups of never takers choose not to join Medicaid regardless of the offer. Always takers manage to enroll in Medicaid even without an offer. The - and - compliers switch to Medicaid from no insurance plan and other plans, respectively, upon winning the lottery. Combining these two groups gives the larger set of compliers.
Table 1 shows the estimated probabilities of the six types.121212We use the data from the 12-month survey. After taking care of the missing values, we are left with observations. For cross-fitting, we choose . We can see that half of the population are -never takers, who are never covered by any insurance plan. The compliers make up around one-fifth of the population. There are effectively no -compliers, meaning that the experiment does not crowd out other insurance plan choices. These findings are consistent with Finkelstein et al. (2012).
Type | Probability | Estimate (se) |
---|---|---|
-never takers | .492 (.046) | |
-never takers | .208 (.018) | |
always takers | .116 (.018) | |
-compliers | .197 (.059) | |
-compliers | .010 (.024) | |
compliers | .208 (.060) |
The outcome of interest is health status, which is (inversely) measured by the number of days (out of past 30) when poor health impaired regular activities.131313Other types of outcomes are also studied by Finkelstein et al. (2012), including health care utilization and financial strain. Here we only focus on health status for simplicity. The potential outcomes are denoted by , , and . By Theorem 2.1, we can identify the distribution of for -never takers and -compliers, the distribution of for -never takers and -compliers, and the distribution of for always takers and compliers. Table 2 reports the estimated LASFs.141414The LASF is excluded because there are few -compliers as reported in Table 1. We can clearly see a pattern of self-selection into the treatment. For example, when there is no insurance coverage, the potential health status of -compliers is worse than -never takers and therefore choose to enroll in Medicaid.
Type | Treatment | LASF | Estimate (se) |
---|---|---|---|
-never takers | 6.78 (1.19) | ||
-never takers | 7.74 (1.05) | ||
always takers | 9.96 (1.75) | ||
-compliers | 11.50 (2.92) | ||
compliers | 0.48 (3.42) |
7 Concluding Remarks
In this paper, we considered the estimation of the causal parameters, LASF and LASF-T, in the GLATE model by using the EIF. The proposed DML estimator satisfies the SPEB and can be applied in situations, such as high-dimensional settings, where Donsker properties fail. For inference, we proposed generalized AR tests robust against weak identification issues. Currently, empirical researchers use the TSLS and control the covariates linearly in models with multi-valued treatments and instruments. This linear specification does not have LATE interpretation, as pointed out by Blandhol et al. (2022). Therefore, we advocate using the semiparametric methods studied by this paper in those cases.
SUPPLEMENTARY MATERIAL
Appendix A Technical Proofs
In this section, we prove the theorems and propositions stated in the main text. We assume that Assumptions 1 and 2 hold throughout this section.
A.1 Proof of the Identification Results
Lemma A.1.
and , .
Proof of Lemma A.1.
The first statement follows from the definition of and the fact that is independent of the vector conditioning on . For the second statement, is entirely determined by . Hence, given and , is independent of since is independent of conditional on . ∎
Lemma A.2.
For each and , the following identification results hold.
-
(i)
a.s.
-
(ii)
a.s.
Proof of Lemma A.2.
This is Theorem T-6 in Heckman and Pinto (2018a). The conditioning is explicitly presented. ∎
Proof of Theorem 2.1.
Proof of Theorem 2.2.
By Lemma L-16 of Heckman and Pinto (2018b), we know that under the unordered monotonicity assumption, for all . Thus, the set always exists. For the first statement, we have
where the second equality follows from the law of iterated expectations and the third equality follows from the fact that (Lemma A.1). For the second statement, notice that
By Lemma A.1, we know that
Therefore, we can apply Bayes rule and obtain that
∎
A.2 Semiparametric Efficiency Calculations
We follow the method developed by Newey (1990). The likelihood of the GLATE model can be specified as
where denotes the conditional density of given and . In a regular parametric submodel, where the true underlying probability measure is indexed by , we use the following notations to represent the score functions:
The score in a regular parametric submodel is
Hence, the tangent space of the model is
where is a subspace of that contains the mean zero functions.
Proof of Theorem 3.1.
We only prove statements (i) and (ii) since (iii) and (iv) are easier cases that can be proved along the way. We start with the first statement. The path-wise differentiability of the parameter can be verified in the following way: in any parametric submodel, we have
where and are random vectors whose typical element can be represented respectively by
and
respectively, for . The EIF is characterized by the condition that
The expression of given in Equation (2) meets the above requirements. In particular, the correspondence between terms in the EIF and path-wise derivative appears exactly as in Lemma 1 of Hong and Nekipelov (2010b).
For the second statement, the path-wise derivative of can be computed similarly.
where and are random vectors whose typical element can be represented by
and
respectively, for . The main difference appears when dealing with the last terms in the above two expressions, which can be matched with terms in the efficient influence function of the following two forms
Take the latter one as an example. Notice that
and
By the law of iterated expectation, we have
∎
Proof of Proposition 3.2.
This proof is based on Section 4 in Newey (1994). We focus on the case of . The other cases are similar. To ease notation, let . The estimator is defined by the moment condition
where
We then compute the derivatives of with respect to the parameters:
where denotes the th element of the vector . Define
We have
Then Newey’s (1994) Proposition 4 suggests that the influence function of the estimator is which is equal to the EIF .
∎
A.3 Proof of Robustness Results
Proof of Proposition 4.1.
We prove the case for , the other cases can be dealt with analogously. First assume , then
which implies that is almost surly equal to the identity matrix . By the law of total expectations, we have
which implies that . Therefore,
Now suppose that . Then by the law of total expectation, we have
This implies that . Hence,
This proves the proposition. ∎
Proof of Proposition 4.2.
Since is a finite vector, it suffices to verify the Neyman orthogonality condition for , which is defined by
We want to show that
where and . In fact,
which equals zero because of the following three identities:
∎
Proof of Theorem 4.3.
The asserted claims follow from Theorem 3.1, Theorem 3.2, and Corollary 3.2 of Chernozhukov et al. (2018) (henceforth referred to as the DML paper). We want to verify their Assumption 3.1 and 3.2. Adopting the notation from the DML paper, we let
and
so that the linearity of the moment condition (with respect to ) is verified by the fact that . Define151515For simplicity, we drop the superscript in the nonparametric estimators.
By assumption on the convergence rates of the nonparametric estimators, we have . Define , where and are positive constant that only depends on and and are specified later in the proof. Let be a sequence of positive constants approaching zero and satisfies that . Such construction is possible since . We set the nuisance realization set (denoted by in the DML paper) to be the set of all vector functions consisting of square-integrable functions and such that for all :
Consider Assumption 3.1 in the DML paper. Assumption 3.1(d), the Neyman orthogonality condition, is verified by Proposition 4.2, where the validity of the differentiation under the integral operation is verified later in the proof. Assumption 3.1(e), the identification condition, is verified by the condition that . The remaining conditions of Assumption 3.1 in the DML paper are trivially verified.
Next, we consider Assumption 3.2 in the DML paper. Note that Assumption 3.2(a) holds by the construction of and and our assumptions on the nuisance estimates. Assumption 3.2(d) is verified by our assumption that the semiparametric efficiency bound of is above . The remaining task is to verify Assumption 3.2(b) and 3.2(c) in the DML paper. To do that, we choose sufficiently large and let be an arbitrary element of the nuisance realization set . We keep the above notations throughout the remaining part of the proof. Define
and
Since is a linear combination of and is a linear combination of , we only need and to be uniformly bounded (i.e., the bounds do not depend on ) for in order to verify Assumption 3.2(b) in the DML paper. In fact,
where we have used the assumption that , , and . Similarly, we have
where we have used the assumption that and . Thus, Assumption 3.2(b) in the DML paper is verified.
To verify Assumption 3.2(c) in the DML paper, we again only need to verify the corresponding conditions for and , respectively. For , we have
where the second to last inequality follows from the fact that . For , we have
where the last inequality follows from our assumption that and the fact that . Combining the above two inequality results, we can verify the first two conditions of Assumption 3.2(c) in the DML paper.
For the last condition of Assumption 3.2(c) in the DML paper, which bounds the second-order Gateaux derivative, we again consider and separately. For , recall that and . Clearly, . With differentiation under the integral, we have
Using the fact that and , we can bound the above derivative by
By bounding the first and second derivative uniformly with respect to , we know that the differentiation under the integral operation is valid. So the Neyman orthogonality condition is verified. Analogously, we can show that
Under the assumption , we have
for all and large enough. Then we can bound the above derivative by
Therefore, we have verified the last condition of Assumption 3.2(c) in the DML paper.
Lastly, we need to verify the condition on in Theorem 3.1 and 3.2 in the DML paper, that is, . This directly follows from the construction of . ∎
A.4 Proof of Weak IV Inference Results
Proof of Theorem 5.1.
We first prove part (i). Consider applying the DML method to the moment condition (8) to estimate the parameter and obtain the standard error. We want to show the convergence in distribution of
(A.1) |
to the standard normal distribution uniformly over the DGPs in . To do that, we need to verify Assumptions 3.1 and 3.2 in the DML paper regarding the above moment condition. Assumptions 3.1(a)-(c) hold trivially. Assumption 3.1(d), the Neyman orthogonality condition, is verified by Proposition 4.2. That is, the Gateaux derivatives with respect to the nuisance parameters are zero regardless of the value of . Assumption 3.1(e), the identification condition, is verified since the Jacobian of the parameter in the moment condition is . Assumption 3.2 in the DML paper can be verified in the same way as in the proof of Theorem 4.3. For brevity, we do not repeat the verification here.
For DGPs in , (A.1) is equal to . Therefore, the uniform convergence in distribution of is established in the null space, and the size of the test is uniformly controlled accordingly. For DGPs in , where , we have
The first term on the RHS of the last equality converges in distribution to . In contrast, the second term diverges to infinity since converges in probability to by Theorem 3.2 in the DML paper. Therefore, the probability of exceeding any finite number converges to 1. The case where is essentially the same.
To prove part (ii) of the theorem, notice that for any DGP in the null space , which implies that . Therefore,
where the supremum is taken over . Consistency can be derived in the same way as part (i). ∎
Appendix B Implicitly Defined Parameters
This section studies general parameters defined implicitly through moment conditions. We allow the moment conditions to be non-smooth, which is the case when the parameter of interest is the quantile. We also allow the moment conditions to be overidentifying, which could be the result of imposing the underlying economic theory on multiple levels of treatment and instrument.
To facilitate the exposition, we define a random variable such that the marginal distribution of is equal to the conditional distribution of given . The joint distribution of the ’s is irrelevant and hence left unspecified. For convenience, we use a single index rather than for labeling. That is, we collect the ’s into the vector . Let be the treatment level associated with . The quantities and are analogously defined.161616We can further extend the vector to include variables whose marginal distributions are the same as the conditional distributions of given . Efficient estimation in this more general case is similar and hence omitted for brevity.
Let the parameter of interest be , which lies in the parameter space , . The true value of the parameter satisfies the moment condition
where is a vector of functions:
Since the vector appears in each , restrictions are allowed both within and across different subpopulations. Another interesting feature of this specification is that the moment conditions are defined for the random variables that are not observed. But their marginal distributions can be identified similar to Theorem 2.1.
Let , where
and
The functions are identified from the data. Similar to Theorem 2.1, we can show that the parameter is identified by the moment conditions:
The following theorem gives the SPEB for the estimation of .
Theorem B.1.
Assume the following conditions hold.
-
(i)
.
-
(ii)
For each and , is continuously differentiable in its second argument. Let be the matrix whose th row is , and assume has full column rank.
Then for the estimation of , the EIF is
(B.1) |
where
and is a random vector whose th element is
(B.2) |
In particular, the semiparametric efficiency bound is .
Proof of Theorem B.1.
The proof is based on the approach described in section 3.6 of Hong and Nekipelov (2010a) and the proof of Theorem 1 in Cattaneo (2010). We use a constant matrix to transform the overidentified vector of moments into an exactly identified system of equations , find the -dependent EIF for the exactly-identified parameter, and choose the optimal . In a parametric submodel, the implicit function theorem gives that
where is an random vector whose typical element can be represented by
for . So the EIF for this exactly-identified parameter is
where is defined by Equation (B.2). It is straightforward to verify that satisfies . The optimal is chosen by minimizing the sandwich matrix . Thus, the EIF for the over-identified parameter is obtained when . Plugging this expression into , we obtain Equation (B.1). ∎
Note that, for example, , then , and the efficiency bound shown above reduces to the one computed in Theorem 3.1. If , that is, the treatment satisfies the unconfounded, then the Theorem B.1 reduces to Theorem 1 in Cattaneo (2010).
For estimation, we use the EIFs to generate moment conditions and propose a three-step semiparametric GMM procedure. The criterion function is
(B.3) |
Its probability limit is denoted as
(B.4) |
where the expectation is taken with respect to the true parameters . The implementation procedure is as follows. Assume that we have nonparametric estimators and that consistently estimate and , respectively. We first find a consistent GMM estimator using the identity matrix as the weighting matrix, that is,
(B.5) |
Next, we use this estimate to form a consistent estimator of the covariance matrix , where
Then we let be the optimally-weighted GMM estimator:
To conduct inference, we estimate using the estimator whose elements are defined as
where we have implicitly assumed that the estimator is differentiable in its second argument.
In the following theorem, we derive the asymptotic properties of the GMM estimators. The main theoretical difficulty is that the random criterion function could potentially be discontinuous because we allow to be discontinuous. We use the theory developed in Chen et al. (2003) to overcome this problem.171717Cattaneo (2010) instead uses the theory from Pakes and Pollard (1989). However, the general theory of Chen et al. (2003) is more straightforward to apply in this case since they explicitly assume the presence of infinite-dimensional nuisance parameters, which can depend on the parameters to be estimated. Let be the function class that contains . Let be the function class that contains .
Theorem B.2.
Let the assumptions in Theorem B.1 hold. Assume the following conditions hold.
-
(i)
The parameter space is compact. The true parameter is in the interior of .
-
(ii)
For any and , there exists such that for sufficiently small,
-
(iii)
Donsker properties:
where denotes the covering number of the space .
-
(iv)
Convergence rates of the nonparametric estimators:
-
(v)
The function is integrable. The estimator is consistent uniformly in its second argument, that is,
Then , , , and
where denotes a vector of zeros.
The following lemma is helpful for proving Theorem B.2.
Lemma B.3.
Under the assumptions of Theorem B.1, the class
is Donsker with a finite integrable envelope. The following stochastic equicontinuity condition hold: for any positive sequence ,
where the supremum is taken over , , and .
Proof of Lemma B.3.
We first verify that the moment condition satisfies Condition (3.2) of Theorem 3 in Chen et al. (2003) (hereafter CLK). In fact, when , the triangle inequality gives that
where we use the notation const to denote a generic constant that may have different values at each appearance. The last inequality follows from the assumption (ii). Similarly, we can verify that the remaining terms in also satisfy the same condition. Therefore, is locally uniformly -continuous, that is,
Following the same steps as in the proof of Theorem 3 in CLK (p. 1607), we can show that the bracketing number of is bounded by
Therefore, the bracketing entropy of class is bounded by
Under the assumption that is compact and
we have that
This implies that is Donsker with a finite integrable envelope. Lastly, as stated in Lemma 1 of CLK, the asserted stochastic equicontinuity condition is implied by the fact that is Donsker and is -continuous. ∎
Proof of Theorem B.2.
We follow the large sample theory in CLK and set , , , and .
We first use Theorem 1 in CLK to show the consistency of . Condition (1.2) in CLK is satisfied because is compact, and has a unique zero and is continuous by our second condition in Theorem B.1. As for Condition (1.3) of CLK, we can easily see from the expression of that it is continuous with respect to and (since is bounded away from zero), and the uniformity in follows from the fact that is bounded as a function of . Condition (1.4) of CLK is satisfied by the assumption of Theorem B.2. The uniform stochastic equicontinuity condition (1.5) of CLK is implied by Lemma B.3. Therefore, .
We use Corollary 1 (which is based on Theorem 2) in CLK to show the consistency of and the asymptotic normality of . Condition (2.2) in CLK is verified by the assumptions of Theorem B.1. Similar to the proof of Proposition 4.2, we can show that the moment condition , based on the EIF, satisfies the Neyman orthogonality condition for the nuisance parameters and . In fact, for any and , we let and . Then we have
where we have applied the law of iterated expectations and used the fact that
Thus, the path-wise derivative of with respect to is zero in any direction. Hence, Condition (2.3) of CLK is verified. Condition (2.4) in CLK directly follows from our assumptions of Theorem B.2. The stochastic equicontinuity condition (condition (2.6) in CLK) follows from Lemma B.3. Lastly, condition (2.6) in CLK is verified using the central limit theorem since the path-wise derivative is zero. Due to the presence of , we also need the uniform convergence condition in Corollary 1 of CLK, which can be verified by using Lemma B.3 and an application of Theorem 2.10.14 of van der Vaart and Wellner (1996).
Lastly, to show the consistency of , we only need to show that
where the inequality follows from the differentiation under integral operation which holds under the last assumption of the theorem. The convergence in probability follows from the uniform convergence of and the consistency of . Therefore, the desired convergence results follow. ∎
References
- Abadie (2003) Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of econometrics 113(2), 231–263.
- Ackerberg et al. (2014) Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014). Asymptotic efficiency of semiparametric two-step gmm. Review of Economic Studies 81(3), 919–943.
- Andrews et al. (2020) Andrews, D. W., X. Cheng, and P. Guggenberger (2020). Generic results for establishing the asymptotic size of confidence sets and tests. Journal of Econometrics 218(2), 496–531.
- Andrews and Armstrong (2017) Andrews, I. and T. B. Armstrong (2017). Unbiased instrumental variables estimation under known first-stage sign. Quantitative Economics 8(2), 479–503.
- Angrist and Imbens (1995) Angrist, J. D. and G. W. Imbens (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association 90(430), 431–442.
- Angrist et al. (1996) Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association 91(434), 444–455.
- Belloni and Chernozhukov (2011) Belloni, A. and V. Chernozhukov (2011). -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39(1), 82–130.
- Belloni and Chernozhukov (2013) Belloni, A. and V. Chernozhukov (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547.
- Bickel et al. (1993) Bickel, P. J., C. A. Klaassen, Y. Ritov, , and J. A. Wellner (1993). Efficient and adaptive estimation for semiparametric models, Volume 4. Springer, New York.
- Blandhol et al. (2022) Blandhol, C., J. Bonney, M. Mogstad, and A. Torgovitsky (2022). When is tsls actually late? University of Chicago, Becker Friedman Institute for Economics Working Paper (2022-16).
- Bühlmann and Van De Geer (2011) Bühlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
- Cattaneo (2010) Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155(2), 138–154.
- Chen et al. (2004) Chen, X., H. Hong, and A. Tarozzi (2004). Semiparametric efficiency in gmm models of nonclassical measurement errors, missing data and treatment effects.
- Chen et al. (2008) Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36(2), 808–843.
- Chen et al. (2003) Chen, X., O. Linton, and I. Van Keilegom (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica 71(5), 1591–1608.
- Chen and Santos (2018) Chen, X. and A. Santos (2018). Overidentification in regular models. Econometrica 86(5), 1771–1817.
- Chernozhukov et al. (2018) Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21(1), C1–C68.
- Chernozhukov et al. (2016) Chernozhukov, V., J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins (2016). Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033.
- Finkelstein et al. (2012) Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. Newhouse, H. Allen, K. Baicker, and O. H. S. Group (2012). The oregon health insurance experiment: evidence from the first year. The Quarterly journal of economics 127(3), 1057–1106.
- Frölich (2007) Frölich, M. (2007). Nonparametric iv estimation of local average treatment effects with covariates. Journal of Econometrics 139(1), 35–75.
- Galindo (2020) Galindo, C. (2020). Empirical challenges of multivalued treatment effects. Technical report, Job market paper.
- Hahn (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 315–331.
- Hansen (2008) Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory 24(3), 726–748.
- Heckman and Pinto (2018a) Heckman, J. J. and R. Pinto (2018a). Unordered monotonicity. Econometrica 86(1), 1–35.
- Heckman and Pinto (2018b) Heckman, J. J. and R. Pinto (2018b). Web appendix for unordered monotonicity. Econometrica 86(1), 1–35.
- Hong and Nekipelov (2010a) Hong, H. and D. Nekipelov (2010a). Semiparametric efficiency in nonlinear late models. Quantitative Economics 1(2), 279–304.
- Hong and Nekipelov (2010b) Hong, H. and D. Nekipelov (2010b). Supplement to “semiparametric efficiency in nonlinear late models”. Quantitative Economics 1(2), 279–304.
- Ichimura and Newey (2022) Ichimura, H. and W. K. Newey (2022). The influence function of semiparametric estimators. Quantitative Economics 13(1), 29–61.
- Imbens and Angrist (1994) Imbens, G. W. and J. D. Angrist (1994). Identification and estimation of local average treatment effects. Econometrica 62(2), 467–475.
- Imbens and Manski (2004) Imbens, G. W. and C. F. Manski (2004). Confidence intervals for partially identified parameters. Econometrica 72(6), 1845–1857.
- Kitagawa (2015) Kitagawa, T. (2015). A test for instrument validity. Econometrica 83(5), 2043–2063.
- Kline and Walters (2016) Kline, P. and C. R. Walters (2016). Evaluating public programs with close substitutes: The case of head start. The Quarterly Journal of Economics 131(4), 1795–1848.
- Lee and Salanié (2018) Lee, S. and B. Salanié (2018). Identifying effects of multivalued treatments. Econometrica 86(6), 1939–1963.
- Masry (1996) Masry, E. (1996). Multivariate local polynomial regression for time series: uniform strong consistency and rates. Journal of Time Series Analysis 17(6), 571–599.
- Mikusheva (2007) Mikusheva, A. (2007). Uniform inference in autoregressive models. Econometrica 75(5), 1411–1452.
- Newey (1990) Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of applied econometrics 5(2), 99–135.
- Newey (1994) Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society, 1349–1382.
- Okui et al. (2012) Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012). Doubly robust instrumental variable regression. Statistica Sinica, 173–205.
- Pakes and Pollard (1989) Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of optimization estimators. Econometrica: Journal of the Econometric Society, 1027–1057.
- Pinto (2021) Pinto, R. (2021). Beyond intention to treat: Using the incentives in moving to opportunity to identify neighborhood effects. UCLA Working paper.
- Sun (2021) Sun, Z. (2021). Instrument validity for heterogeneous causal effects. arXiv preprint arXiv:2009.01995.
- Tan (2006) Tan, Z. (2006). Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association 101(476), 1607–1618.
- van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics, Volume 3. Cambridge university press.
- van der Vaart and Wellner (1996) van der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. New York: Springer-Verlag.
- Vytlacil (2002) Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result. Econometrica 70(1), 331–341.