This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient and Robust Estimation of the Generalized LATE Model

Haitian Xie  
Department of Economics, University of California San Diego
Email: [email protected]. The author is grateful to his advisors Graham Elliott and Yixiao Sun, who were gracious with their advice, support and feedback. The author also thanks Wei-Lin Chen, Yu-Chang Chen, and Kaspar Wüthrich for helpful suggestions and constructive comments. This paper was previously circulated under the title “Generalized Local IV with Unordered Multiple Treatment Levels: Identification, Efficient Estimation, and Testable Implication.” All remaining errors are my own.
Abstract

This paper studies the estimation of causal parameters in the generalized local average treatment effect (GLATE) model, a generalization of the classical LATE model encompassing multi-valued treatment and instrument. We derive the efficient influence function (EIF) and the semiparametric efficiency bound (SPEB) for two types of parameters: local average structural function (LASF) and local average structural function for the treated (LASF-T). The moment condition generated by the EIF satisfies two robustness properties: double robustness and Neyman orthogonality. Based on the robust moment condition, we propose the double/debiased machine learning (DML) estimators for LASF and LASF-T. The DML estimator is semiparametric efficient and suitable for high dimensional settings. We also propose null-restricted inference methods that are robust against weak identification issues. As an empirical application, we study the effects across different sources of health insurance by applying the developed methods to the Oregon Health Insurance Experiment.


Keywords: Causal Inference, Double Robustness, Efficient Influence Function, Multi-valued Treatment, Neyman Orthogonality, Oregon Health Insurance Experiment, Unordered Monotonicity, Weak Identification.

1 Introduction

Since the seminal works of Imbens and Angrist (1994) and Angrist et al. (1996), the local average treatment effect (LATE) model has become popular for causal inference in economics. Instead of imposing homogeneity of the treatment effects as in the classical instrumental variable (IV) regression model, the LATE framework allows the treatment effect to vary across individuals. Under the monotonicity condition, the average treatment effect can be identified for a subgroup of individuals whose treatment choice complies with the change in instrument levels.

The current form of the LATE model only accepts binary treatment variables. This restriction is inconvenient in many economic settings where the treatment is multi-leveled in nature. For example, parents select different preschool programs for their kids, schools assign students to different classroom sizes, families relocate to various neighborhoods in housing experiments, and people choose different sources of health insurance. To apply the LATE model to these settings, researchers often need to redefine the treatment so that there are only two treatment levels. However, merging the treatment levels can complicate the task of program evaluation and dampen the causal interpretation of the estimates. As pointed out by Kline and Walters (2016), if the original treatment levels are substitutes, then there is ambiguity regarding which causal parameters are of interest. After merging the treatment levels, the heterogeneity in the treatment effect across different treatment levels is lost.

This paper addresses the above issues by generalizing the LATE framework to incorporate the potential multiplicity in treatment levels directly. We call the new framework the generalized LATE (GLATE) model. The main assumption of the GLATE model is the unordered monotonicity assumption proposed by Heckman and Pinto (2018a), which is a generalization of the monotonicity assumption in the binary LATE model.111To distinguish with the GLATE model, we sometimes use the terminology “binary LATE model” to refer to the LATE model studied by Imbens and Angrist (1994) and Abadie (2003).

We generalize the identification results in Heckman and Pinto (2018a) to explicitly account for the presence of conditioning covariates, which is often important in practical settings. Recently, Blandhol et al. (2022) point out that linear TSLS, the common way to control for covariates in empirical studies, does not bear the LATE interpretation. The only specifications that have LATE interpretations are the ones that control for covariates nonparametrically. Therefore, it is essential from the causal analysis perspective to incorporate the covariates into the GLATE framework in a nonparametric way.

The causal parameters identifiable in the GLATE model include local average structural function (LASF) and local average structural function for the treated (LASF-T). LASF is the mean potential outcome for specific subpopulations. These subpopulations are defined by their treatment choice behaviors and are generalizations of the concepts always takers, compliers, and never takers in the binary LATE model. The parameter LASF-T further restricts the subpopulation to exclude individuals who do not take up the treatment.

The paper is concerned with the econometric aspects of the GLATE model. The analysis begins by deriving efficient influence function (EIF) and semiparametric efficiency bound (SPEB) for the identified parameters. The calculation is based on the method outlined in Chapter 3 of Bickel et al. (1993) and Newey (1990). We then verify that the conditional expectation projection (CEP) estimator (e.g., Chen et al., 2008), constructed directly from the identification result, achieves the SPEB and hence is semiparametric efficient. Using these results, we may efficiently estimate other important parameters of interest by the plug-in method since a standard delta-method argument preserves semiparametric efficiency.

The EIF not only facilitates the efficiency calculation but can also serve as the moment condition for estimation. This is because the EIF is mean zero by construction and is equal to the original identification result plus an adjustment term due to the presence of infinite-dimensional parameters. We show that the moment condition constructed from the EIF satisfies two related robustness properties: double robustness and Neyman orthogonality. Double robustness guarantees that the moment condition is correctly specified in a parametric setting even when some nuisance parameters are not.

The Neyman orthogonality condition means that the moment condition is insensitive to the nuisance parameters. This condition is particularly useful when the conditioning covariates are of high dimension. To further utilize this condition, we study the double/debiased machine learning (DML) estimator (Chernozhukov et al., 2018) in the GLATE setting. Under certain conditions regarding the convergence rate of the first-step nonparametric estimators, the DML estimator is asymptotically normally uniformly over a large class of data generating processes (DGPs).

The weak identification issue is a practical concern of the GLATE model. This is because both the treatment and instrument are multi-valued, and hence the subpopulation on which LASF and LASF-T are defined can be small in size. To deal with this issue, we propose null-restricted test statistics in one-sided and two-sided testing problems. This procedure is the generalization of the well-known Anderson-Rubin (AR) test. We show that the proposed tests are consistent and uniformly control size across a large class of DGPs, in which the size of the subpopulation mentioned above can be arbitrarily close to zero.

The paper is organized as follows. The remaining part of this section discusses the literature. Section 2 introduces the GLATE model and the nonparametric identification results. Section 3 calculates the EIF and SPEB. Section 4 discusses the robustness properties of the moment condition generated by the EIF. Section 5 proposes inference procedures under weak identification issues. Section 6 presents the empirical application. Section 7 concludes. The proofs for theoretical results in the main text are collected in Appendix A.

1.1 Literature Review

The GLATE model provides a way to conduct causal inference under endogeneity when the treatment is multi-valued and unordered. As mentioned above, the identification result (conditional on the covariates) is first established in Heckman and Pinto (2018a) by using the unordered monotonicity condition. Lee and Salanié (2018) proposes another method of identification in a similar model of multi-valued treatment. Their method is concerned with continuous instruments, while the GLATE is framed in terms of discrete-valued instruments. When the treatment levels are ordered, Angrist and Imbens (1995) derives the identification and estimation results for the causal parameter, which is a weighted average of LATEs across different treatment levels.

The literature on semiparametric efficiency in program evaluation starts with the seminal work of Hahn (1998), which studies the benchmark case of estimating the average treatment effect (ATE) under unconfoundedness. For multi-level treatment, Cattaneo (2010) studies the efficient estimation of causal parameters implicitly defined through over-identified non-smooth moment conditions. In the case where unconfoundedness fails and instruments are present, Frölich (2007) calculates the SPEB for the LATE parameter, and Hong and Nekipelov (2010a) extend to the estimation of parameters implicitly defined by moment restrictions. In a more general framework encompassing missing data, Chen et al. (2004) and Chen et al. (2008) studies semiparametric efficiency bounds and efficient estimation of parameters defined through overidentifying moment restrictions. However, there is currently no theoretical research on semiparametric efficient estimation in models that encompasses endogeneity and unordered multiple treatment levels.

Several ways are available for calculating the EIF for semiparametric estimators, as illustrated by Newey (1990) and Ichimura and Newey (2022). Semiparametric efficiency calculations can be used to construct robust (Neyman orthogonal) moment conditions. This method is illustrated in Newey (1994) and Chernozhukov et al. (2016). Based on the Neyman orthogonality condition, Chernozhukov et al. (2018) introduces the DML method that suits high dimensional settings. This is because Donsker properties and stochastic equicontinuity conditions are no longer required in deriving the asymptotic distribution of the semiparametric estimator.

For testing the GLATE model, Sun (2021) proposes a bootstrap test which is the generalization and improvement of the test studied by Kitagawa (2015) in the binary LATE model.

The GLATE model has received attention in the recent empirical literature due to its ability to model multi-valued treatment. Kline and Walters (2016) evaluate the cost-effectiveness of Head Start, classifying Head Start and other preschool programs as different treatment levels against the control group of no preschool. Galindo (2020) assesses the impact of different childcare choice in Colombia on children’s development. Pinto (2021) studies the neighborhood effects and voucher effects in housing allocations using data from the Moving to Opportunity experiment. Our theoretical analysis of the GLATE model presents important tools for estimation and inference that can be applied to those empirical settings.

2 Identification in the GLATE Model

This section describes the generalized local average treatment effect (GLATE) model, discusses identification of the local average structural function (LASF) and other parameters, and introduces the notation.

2.1 The model

We assume a finite collection of instrument values 𝒵={z1,,zNZ}\mathcal{Z}=\{z_{1},\cdots,z_{N_{Z}}\} and a finite collection of treatment values 𝒯={t1,,tNT}\mathcal{T}=\{t_{1},\cdots,t_{N_{T}}\}, where NZN_{Z} and NTN_{T} are respectively the total number of instrument and treatment levels. The sets 𝒯\mathcal{T} and 𝒵\mathcal{Z} are categorical and unordered. The instrumental variable ZZ denotes which of the NZN_{Z} instrument levels is realized. The random variables Tz1,,TzNZT_{z_{1}},\cdots,T_{z_{N_{Z}}}, each taking values in 𝒯\mathcal{T}, denote the collection of potential treatments under each instrument status. Thus, the observed treatment level is the random variable T=TZ=z𝒵𝟏{Z=z}TzT=T_{Z}=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}T_{z}. For each given treatment level t𝒯t\in\mathcal{T}, there is a potential outcome Yt𝒴Y_{t}\in\mathcal{Y}\subset\mathbb{R}. The observed outcome is denoted by Y=YT=t𝒯𝟏{T=t}YtY=Y_{T}=\sum_{t\in\mathcal{T}}\mathbf{1}\{T=t\}Y_{t}. The random vector X𝒳dXX\in\mathcal{X}\subset\mathbb{R}^{d_{X}} contains the set of covariates. The observed data is a random sample (Yi,Ti,Zi,Xi),1in(Y_{i},T_{i},Z_{i},X_{i}),1\leq i\leq n.

The description above establishes a random sampling model where the researcher only observes one potential outcome, the one associated with the observed treatment. This implies that the sample of YY, observed from an individual with treatment T=tT=t, comes from the conditional distribution of YtY_{t} given T=tT=t rather than from the marginal distribution of YtY_{t}. In general, this fact leads to identifications issues and presents challenges for causal inference. To overcome these problems, we impose further structures on the model.

Assumption 1 (Conditional Independence).

({Yt:t𝒯},{Tz:z𝒵})ZX(\{Y_{t}\mathrel{\mathop{\ordinarycolon}}t\in\mathcal{T}\},\{T_{z}\mathrel{\mathop{\ordinarycolon}}z\in\mathcal{Z}\})\perp Z\mid X.

Assumption 2 (Unordered Monotonicity).

For any t𝒯,z,z𝒵t\in\mathcal{T},z,z^{\prime}\in\mathcal{Z}, either

(𝟏{Tz=t}𝟏{Tz=t}X)=1\displaystyle\mathbb{P}(\mathbf{1}\{T_{z}=t\}\geq\mathbf{1}\{T_{z^{\prime}}=t\}\mid X)=1

or

(𝟏{Tz=t}𝟏{Tz=t}X)=1.\displaystyle\mathbb{P}(\mathbf{1}\{T_{z}=t\}\leq\mathbf{1}\{T_{z^{\prime}}=t\}\mid X)=1.

Assumption 1 and 2 provide the multi-valued analog of Assumption 2.1 in Abadie (2003). Assumption 1 restricts that the instrument ZZ is independent with the potential treatments and outcomes once we condition on XX. Assumption 2 is the conditional version of the unordered monotonicity condition proposed by Heckman and Pinto (2018a). It means that when we focus on a particular treatment level tt and a pair (z,z)(z,z^{\prime}) of instrument values, the binary environment should satisfy the usual monotonicity constraint in the LATE model. Specifically, the unordered monotonicity condition requires that a shift in the instrument moves all agents uniformly toward or against each possible treatment value.222As pointed out by Vytlacil (2002), the LATE monotonicity condition is a restriction across individuals on the relationship between different hypothetical treatment choices defined in terms of an instrument.

We define the type SS of an individual as the vector of the potential treatments, that is,

S=(Tz1,TzNZ).\displaystyle S=(T_{z_{1}}\cdots,T_{z_{N_{Z}}})^{\prime}.

By construction, SS is not observed. Assumption 2, the unordered monotonicity condition, is essentially a restriction on 𝒮supp(S)\mathcal{S}\equiv\textit{supp}(S), the support of SS. Denote the elements in 𝒮\mathcal{S} by s1,,sNSs_{1},\cdots,s_{N_{S}}, where NSN_{S} is the cardinality of 𝒮\mathcal{S}. A convenient way to characterize 𝒮\mathcal{S} is by using the NZ×NSN_{Z}\times N_{S} matrix R(s1,,sNS)R\equiv(s_{1},\cdots,s_{N_{S}}). The matrix RR is referred to as the response matrix since it describes how each type of individuals’ treatment choice responds to the instrument.

The role of SS is to assist the identification of the counterfactual outcomes by dividing the population into a finite number of groups, where identification can be achieved within specific groups. Those groups are defined as follows. For k=0,,NZk=0,\cdots,N_{Z}, let Σt,k\Sigma_{t,k} be the set of types in which the treatment level tt appears exactly kk times. That is,

Σt,k{s𝒮:i=1NZ𝟏{s[i]=t}=k},\displaystyle\Sigma_{t,k}\equiv\{s\in\mathcal{S}\mathrel{\mathop{\ordinarycolon}}\textstyle\sum_{i=1}^{N_{Z}}\mathbf{1}\{s[i]=t\}=k\},

where s[i]s[i] denotes the iith element of the vector ss. In particular, the collection Σt,k,k=0,,NZ\Sigma_{t,k},k=0,\cdots,N_{Z} forms a partition of 𝒮\mathcal{S}.

For individuals with type SS in the same type set Σt,k\Sigma_{t,k}, their treatment response in terms of T=tT=t is in a way homogeneous. Thus, it is easier intuitively to identify the marginal distribution of the potential outcome YtY_{t} within each Σt,k\Sigma_{t,k}. More specifically, we define the local average structural functions (LASF) and the local average structural functions for the treated (LASF-T) as follows.

LASF: βt,k\displaystyle\text{LASF: }\beta_{t,k} 𝔼[YtSΣt,k],\displaystyle\equiv\mathbb{E}[Y_{t}\mid S\in\Sigma_{t,k}],
LASF-T: γt,k\displaystyle\text{LASF-T: }\gamma_{t,k} 𝔼[YtSΣt,k,T=t].\displaystyle\equiv\mathbb{E}[Y_{t}\mid S\in\Sigma_{t,k},T=t].

Before presenting the identification results for the above two classes of parameters, we illustrate the GLATE model in the following two examples.

Example 1 (Binary LATE model).

In the binary LATE model of Imbens and Angrist (1994), there are two treatment levels 𝒯={0,1}\mathcal{T}=\{0,1\} and two instrument levels 𝒵={0,1}\mathcal{Z}=\{0,1\}. There are three types: 𝒮={s1=(0,0),s2=(0,1),s3=(1,1)}\mathcal{S}=\{s_{1}=(0,0)^{\prime},s_{2}=(0,1)^{\prime},s_{3}=(1,1)^{\prime}\}, which are referred to in the literature as defiers, compliers, and always-takers, respectively. The type set Σ1,0={s1}\Sigma_{1,0}=\{s_{1}\} contains the defiers, Σ1,1={s2}\Sigma_{1,1}=\{s_{2}\} the compliers, and Σ1,2={s3}\Sigma_{1,2}=\{s_{3}\} the always-takers. The response matrix is the following binary matrix

R=(s1,s2,s3)=(001011).\displaystyle R=(s_{1},s_{2},s_{3})=\begin{pmatrix}0&0&1\\ 0&1&1\end{pmatrix}.

The local average treatment effect is the treatment effect for the compliers, which can be written as the difference between two LASFs:

𝔼[Y1Y0S=compliers]=𝔼[Y1Y0T1>T0]=β1,1β0,1.\displaystyle\mathbb{E}[Y_{1}-Y_{0}\mid S=\text{compliers}]=\mathbb{E}[Y_{1}-Y_{0}\mid T_{1}>T_{0}]=\beta_{1,1}-\beta_{0,1}.
Example 2 (Three treatment levels and two instrument levels).

The simplest GLATE model (excluding the binary case in Example 1) has three treatment levels 𝒯={t1,t2,t3}\mathcal{T}=\{t_{1},t_{2},t_{3}\} and two instrument levels 𝒵={z1,z2}\mathcal{Z}=\{z_{1},z_{2}\}. There are five types specified as the columns in the following response matrix

R=(s1,s2,s3,s4,s5)=(t1t2t3t1t2t1t2t3t3t3).\displaystyle R=(s_{1},s_{2},s_{3},s_{4},s_{5})=\begin{pmatrix}t_{1}&t_{2}&t_{3}&t_{1}&t_{2}\\ t_{1}&t_{2}&t_{3}&t_{3}&t_{3}\end{pmatrix}.

In this example, a shift from z1z_{1} to z2z_{2} moves all agents uniformly toward the treatment level t3t_{3}. The type set Σt1,2={s1}\Sigma_{t_{1},2}=\{s_{1}\} contains the type that always choose the treatment t1t_{1} and thus can be referred to as t1t_{1}-always taker. The same applies to Σt2,2={s2}\Sigma_{t_{2},2}=\{s_{2}\} and Σt3,2={s3}\Sigma_{t_{3},2}=\{s_{3}\}. The type set Σt1,1={s4}\Sigma_{t1,1}=\{s_{4}\} switches from t1t_{1} to t3t_{3} and hence can be considered as t1t_{1}-swticher (or t1t_{1}-compliter). Similarly, we can refer to Σt2,1={s5}\Sigma_{t_{2},1}=\{s_{5}\} as t2t_{2}-switcher and Σt2,1={s5}\Sigma_{t_{2},1}=\{s_{5}\} as t3t_{3}-switcher. This model is used in Kline and Walters (2016) to study the causal effect of the Head Start preschool program. The instrument indicates whether the household receives a Head Start offer, and the treatment levels are t1=t_{1}= Head Start, t2=t_{2}= other preschool programs, and t3=t_{3}= no preschool. The unordered monotonicity condition means that anyone who changes behavior as a result of the Head Start offer does so to attend Head Start.

2.2 Identification Results

We introduce some matrix notations related to the type SS. For each treatment level t𝒯t\in\mathcal{T}, let BtB_{t} be a binary matrix of the same dimension as the response matrix RR with each element of BtB_{t} signifying whether the corresponding element in the response matrix is tt. That is, Bt[i,j]B_{t}[i,j], the (i,j)(i,j)th element of BtB_{t}, is whether TziT_{z_{i}} equals tt for the subpopulation S=sjS=s_{j}. Define bt,k(𝟏{s1Σt,k},,𝟏{sNSΣt,k})Bt+,b_{t,k}\equiv\left(\mathbf{1}\{s_{1}\in\Sigma_{t,k}\},\cdots,\mathbf{1}\{s_{N_{S}}\in\Sigma_{t,k}\}\right)B_{t}^{+}, where Bt+B_{t}^{+} is the Moore-Penrose inverse of BtB_{t}.

For convenience, we also need some notations regarding conditional expectations. Let

π(X)(πz1(X),,πzNZ(X)) with πz(X)P(Z=zX)\displaystyle\pi(X)\equiv(\pi_{z_{1}}(X),\cdots,\pi_{z_{N_{Z}}}(X))^{\prime}\text{ with }\pi_{z}(X)\equiv P\left(Z=z\mid X\right)

be the vector of functions that describes the conditional distribution of the instrument ZZ. For each treatment level t𝒯t\in\mathcal{T}, let

Pt(X)(Pt,z1(X),,Pt,zNZ(X)) with Pt(X)P(T=tZ=z,X)\displaystyle P_{t}(X)\equiv(P_{t,z_{1}}(X),\cdots,P_{t,z_{N_{Z}}}(X))^{\prime}\text{ with }P_{t}(X)\equiv P(T=t\mid Z=z,X)

be the vector that describes the conditional treatment probabilities given each level of the instrument. Denote

Qt(X)(Qt,z1(X),,Qt,zNZ(X)) with Qt,z(X)𝔼[Y𝟏{T=t}Z=z,X]\displaystyle Q_{t}(X)\equiv(Q_{t,z_{1}}(X),\cdots,Q_{t,z_{N_{Z}}}(X))^{\prime}\text{ with }Q_{t,z}(X)\equiv\mathbb{E}[Y\mathbf{1}\{T=t\}\mid Z=z,X]

as the vector that contains the conditional outcomes for each treatment level tt. Notice that the functions π\pi, PtP_{t}, and QtQ_{t} are all identified.

Theorem 2.1 (Identification of LASF).

Let Assumptions 1 - 2 hold. Let t𝒯t\in\mathcal{T} and k{1,,NZ}k\in\{1,\cdots,N_{Z}\}.

  1. (i)

    The type set probability is identified by

    pt,k(SΣt,k)=bt,k𝔼[Pt(X)].p_{t,k}\equiv\mathbb{P}(S\in\Sigma_{t,k})=b_{t,k}\mathbb{E}\left[P_{t}(X)\right].
  2. (ii)

    If pt,k>0p_{t,k}>0, the LASF is identified by:

    βt,k=bt,k𝔼[Qt(X)]/pt,k.\beta_{t,k}=b_{t,k}\mathbb{E}\left[Q_{t}(X)\right]/p_{t,k}.

Theorem 2.1 identifies pt,kp_{t,k}, the size of the subpopulation Σt,k\Sigma_{t,k}, and the local structural function for that subpopulation. The only exception when the identification fails is when the type set Σt,0\Sigma_{t,0}, in which case the individual never chooses the treatment tt. This identification result is a modification of Theorem T-6 in Heckman and Pinto (2018a) that explicitly accounts for the presence of covariates XX. Bayes rule is applied to convert the conditional result into the unconditional one. The following theorem presents the identification result for the LASF-T.

Let 𝒵t,k𝒵\mathcal{Z}_{t,k}\subset\mathcal{Z} be the set of instrument values that induces the treatment level tt in the type set Σt,k\Sigma_{t,k}. That is, 𝒵t,k{zi𝒵:s[i]=t, for all sΣt,k}\mathcal{Z}_{t,k}\equiv\left\{z_{i}\in\mathcal{Z}\mathrel{\mathop{\ordinarycolon}}s[i]=t\text{, for all }s\in\Sigma_{t,k}\right\}, where s[i]s[i] denotes the iith element of the vector ss. Then define πt,kz𝒵t,kπz\pi_{t,k}\equiv\sum_{z\in\mathcal{Z}_{t,k}}\pi_{z} as the total probability of those instrument values.

Theorem 2.2 (Identification of LASF-T).

Let Assumptions 1 - 2 hold. Let t𝒯t\in\mathcal{T} and k{1,,NZ}k\in\{1,\cdots,N_{Z}\}. Then 𝒵t,k\mathcal{Z}_{t,k} is nonempty.

  1. (i)

    The treatment probability within the type set is identified by

    qt,kP(T=t,SΣt,k)=bt,k𝔼[Pt(X)πt,k(X)].q_{t,k}\equiv P\left(T=t,S\in\Sigma_{t,k}\right)=b_{t,k}\mathbb{E}\left[P_{t}(X)\pi_{t,k}(X)\right].
  2. (ii)

    If qt,k>0q_{t,k}>0, then the LASF-T is identified by

    γt,k=bt,k𝔼[Qt(X)πt,k(X)]/qt,k.\gamma_{t,k}=b_{t,k}\mathbb{E}\left[Q_{t}(X)\pi_{t,k}(X)\right]/q_{t,k}. (1)

The identification results are illustrated using the two examples.

Example 3 (continues = eg:binary).

Since the treatment is binary, the matrix B1B_{1} is equal to the response matrix RR. The matrix B1B_{1} and its generalized inverse B1+B_{1}^{+} are respectively

B1=(001011), and (B1+)=(011010).\displaystyle B_{1}=\begin{pmatrix}0&0&1\\ 0&1&1\end{pmatrix},\text{ and }(B_{1}^{+})^{\prime}=\begin{pmatrix}0&-1&1\\ 0&1&0\end{pmatrix}.

The matrix B0B_{0} and its generalized inverse B0+B_{0}^{+} are respectively

B0=(110100), and (B0+)=(010110).\displaystyle B_{0}=\begin{pmatrix}1&1&0\\ 1&0&0\end{pmatrix},\text{ and }(B_{0}^{+})^{\prime}=\begin{pmatrix}0&1&0\\ 1&-1&0\end{pmatrix}.

The vectors b1,1b_{1,1} and b0,1b_{0,1} are respectively

b1,1=(1,1), and b0,1=(1,1).\displaystyle b_{1,1}=(-1,1),\text{ and }b_{0,1}=(1,-1).

Theorem 2.1 implies that

β1,1=𝔼[Q1,1(X)]𝔼[Q1,0(X)]𝔼[P1,1(X)]𝔼[P1,0(X)], and β0,1=𝔼[Q0,0(X)]𝔼[Q0,1(X)]𝔼[P0,0(X)]𝔼[P0,1(X)].\displaystyle\beta_{1,1}=\frac{\mathbb{E}[Q_{1,1}(X)]-\mathbb{E}[Q_{1,0}(X)]}{\mathbb{E}[P_{1,1}(X)]-\mathbb{E}[P_{1,0}(X)]},\text{ and }\beta_{0,1}=\frac{\mathbb{E}[Q_{0,0}(X)]-\mathbb{E}[Q_{0,1}(X)]}{\mathbb{E}[P_{0,0}(X)]-\mathbb{E}[P_{0,1}(X)]}.

The two denominators in the above expressions are both equal to the type probability of compliers. Then the usual identification of the LATE parameter (e.g., Frölich, 2007) follows:

𝔼[Y1Y0T1>T0]=(𝔼[YZ=1,X=x]𝔼[YZ=0,X=x])fX(x)𝑑x(𝔼[TZ=1,X=x]𝔼[TZ=0,X=x])fX(x)𝑑x,\displaystyle\mathbb{E}[Y_{1}-Y_{0}\mid T_{1}>T_{0}]=\frac{\int(\mathbb{E}[Y\mid Z=1,X=x]-\mathbb{E}[Y\mid Z=0,X=x])f_{X}(x)dx}{\int(\mathbb{E}[T\mid Z=1,X=x]-\mathbb{E}[T\mid Z=0,X=x])f_{X}(x)dx},

where fXf_{X} denotes the marginal density function of XX.

Example 4 (continues = eg:3t2z).

Recall that Σt1,1={s4}\Sigma_{t_{1},1}=\{s_{4}\} contains the t1t_{1}-switcher. By Theorem 2.1, the LASF for the treatment level tt and the subpopulation S=s4S=s_{4} is identified by333The calculation of bt,kb_{t,k} is omitted for brevity, but it can be done in the same way as Example 1.

pt1,1\displaystyle p_{t_{1},1} =𝔼[Pt1,z1(X)]𝔼[Pt1,z2(X)],\displaystyle=\mathbb{E}[P_{t_{1},z_{1}}(X)]-\mathbb{E}[P_{t_{1},z_{2}}(X)],
βt1,1\displaystyle\beta_{t_{1},1} =𝔼[Qt1,z1(X)]𝔼[Qt1,z2(X)]𝔼[Pt1,z1(X)]𝔼[Pt1,z2(X)].\displaystyle=\frac{\mathbb{E}[Q_{t_{1},z_{1}}(X)]-\mathbb{E}[Q_{t_{1},z_{2}}(X)]}{\mathbb{E}[P_{t_{1},z_{1}}(X)]-\mathbb{E}[P_{t_{1},z_{2}}(X)]}.

Notice that 𝒵t1,1={z1}\mathcal{Z}_{t_{1},1}=\{z_{1}\}. Then by Theorem 2.2 we have

qt1,1\displaystyle q_{t_{1},1} =𝔼[(Pt1,z1(X)Pt1,z2(X))πz1(X)],\displaystyle=\mathbb{E}[(P_{t_{1},z_{1}}(X)-P_{t_{1},z_{2}}(X))\pi_{z_{1}}(X)],
γt1,1\displaystyle\gamma_{t_{1},1} =𝔼[(Qt1,z1(X)Qt1,z2(X))πz1(X)]𝔼[(Pt1,z1(X)Pt1,z2(X))πz1(X)].\displaystyle=\frac{\mathbb{E}[(Q_{t_{1},z_{1}}(X)-Q_{t_{1},z_{2}}(X))\pi_{z_{1}}(X)]}{\mathbb{E}[(P_{t_{1},z_{1}}(X)-P_{t_{1},z_{2}}(X))\pi_{z_{1}}(X)]}.

3 Semiparametric Efficiency

In this section, we calculate the semiparametric efficiency bound (SPEB) and propose estimators that achieve such bounds. We focus on the parameters LASF and LASF-T. In Appendix B, we study general parameters implicitly defined through moment restrictions.

3.1 LASF and LASF-T

For the rest of the paper, we assume that Yt,t𝒯Y_{t},t\in\mathcal{T} have finite second moments. This is necessary since we are studying efficiency. Let ι\iota denote the column vector of ones and ζ(Z,X,π)\zeta(Z,X,\pi) the diagonal matrix with the diagonal elements being 𝟏{Z=z}/πz(X),z𝒵\mathbf{1}\{Z=z\}/\pi_{z}(X),z\in\mathcal{Z}. The following theorem gives the efficient influence function (EIF) and the SPEB for the parameters identified in the preceding section.

Theorem 3.1 (SPEB for LASF and LASF-T).

Let Assumptions 1 - 2 hold. Let t𝒯t\in\mathcal{T} and k{1,,NZ}k\in\{1,\cdots,N_{Z}\}. Assume that pt,k,qt,k>0p_{t,k},q_{t,k}>0.

  1. (i)

    The semiparametric efficiency bound for βt,k\beta_{t,k} is given by the variance of the efficient influence function

    ψβt,k(Y,T,Z,X,βt,k,pt,k,Qt,Pt,π)=1pt,kbt,k(ζ(Z,X,π)(ι(Y𝟏{T=t})Qt(X))+Qt(X))βt,kpt,kbt,k(ζ(Z,X,π)(ι𝟏{T=t}Pt(X))+Pt(X)).\displaystyle\begin{split}&\psi^{\beta_{t,k}}(Y,T,Z,X,\beta_{t,k},p_{t,k},Q_{t},P_{t},\pi)\\ =&\frac{1}{p_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)+Q_{t}(X)\right)\\ -&\frac{\beta_{t,k}}{p_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)+P_{t}(X)\right).\end{split} (2)
  2. (ii)

    The semiparametric efficiency bound for γt,k\gamma_{t,k} is given by the variance of the efficient influence function

    ψγt,k(Y,T,Z,X,γt,k,qt,k,Qt,Pt,π)=1qt,kbt,k(ζ(Z,X,π)(ι(Y𝟏{T=t})Qt(X))πt,k(X)+Qt(X)𝟏{Z𝒵t,k})γt,kqt,kbt,k(ζ(Z,X,π)(ι𝟏{T=t}Pt(X))πt,k(X)+Pt(X)𝟏{Z𝒵t,k}).\displaystyle\begin{split}&\psi^{\gamma_{t,k}}(Y,T,Z,X,\gamma_{t,k},q_{t,k},Q_{t},P_{t},\pi)\\ =&\frac{1}{q_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)\pi_{{t,k}}(X)+Q_{t}(X)\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}\right)\\ -&\frac{\gamma_{t,k}}{q_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)\pi_{{t,k}}(X)+P_{t}(X)\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}\right).\end{split}
  3. (iii)

    The semiparametric efficiency bound for pt,kp_{t,k} is given by the variance of the efficient influence function

    ψpt,k(T,Z,X,pt,k,Pt,π)=bt,k(ζ(Z,X,π)(ι𝟏{T=t}Pt(X))+Pt(X))pt,k.\displaystyle\begin{split}\psi^{p_{t,k}}(T,Z,X,p_{t,k},P_{t},\pi)=b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)+P_{t}(X)\right)-p_{t,k}.\end{split}
  4. (iv)

    The semiparametric efficiency bound for qt,kq_{t,k} is given by the variance of the efficient influence function

    ψqt,k(T,Z,X,qt,k,Pt,π)=bt,k(ζ(Z,X,π)(ι𝟏{T=t}Pt(X))πt,k(X)+Pt(X)𝟏{Z𝒵t,k})qt,k.\displaystyle\begin{split}&\psi^{q_{t,k}}(T,Z,X,q_{t,k},P_{t},\pi)\\ =&b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)\pi_{{t,k}}(X)+P_{t}(X)\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}\right)-q_{t,k}.\end{split}

The EIF in Theorem 3.1 can be interpreted as the moment condition from the identification results modified by an adjustment term due to the presence of unknown infinite-dimensional parameters. Take ψβt,k\psi^{\beta_{t,k}} as an example, the terms

bt,k(ζ(Z,X,π)(ι(Y𝟏{T=t})Qt(X)))/pt,k\displaystyle b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)\right)/p_{t,k}

and

βt,kbt,k(ζ(Z,X,π)(ι𝟏{T=t}Pt(X)))/pt,k\displaystyle\beta_{t,k}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)\right)/p_{t,k}

are respectively the adjustment terms due to the presence of QtQ_{t} and PtP_{t}.

From the expression of ψβt,k\psi^{\beta_{t,k}}, we can see that the SPEB would be large when pt,kp_{t,k} is small. This is because pt,kp_{t,k} measures the size of the subpopulation SΣt,kS\in\Sigma_{t,k} on which the LASF is estimated. When pt,kp_{t,k} is small, we run into the weak identification issue. In Section 5, we study inference procedures that are robust against weak identification issues.

One benefit of the EIFs is that we can easily calculate the covariance matrix of different estimators. Consider an example where we are interested in two LASFs β1\beta_{1} and β2\beta_{2}, whose EIF is given by ψ1\psi_{1} and ψ2\psi_{2}, respectively. If the two estimators β^1\hat{\beta}_{1} and β^2\hat{\beta}_{2} are both semiparametric efficient, then their covariance matrix equals 𝔼[ψ1ψ2]\mathbb{E}[\psi_{1}\psi_{2}^{\prime}].

Example 5 (continues = eg:binary).

In the binary LATE model, the first two parts of Theorem 3.1 reduce to Theorem 2 of Hong and Nekipelov (2010a). If we assume unconfoundedness by having T=ZT=Z, then the result further reduces to Theorem 1 of Hahn (1998).

The derived SPEB helps determine whether an estimation procedure is efficient. In this section, we focus on the condition expectation projection (CEP) estimator.444The terminology “condition expectation projection” is adopted from the papers Chen et al. (2008) and Hong and Nekipelov (2010a), whereas Hahn (1998) refers to these estimators as “nonparametric imputation based estimators.” Define

hY,t,z(X)=𝔼[𝟏{Z=z}Y𝟏{T=t}X] and ht,z(X)=𝔼[𝟏{Z=z}𝟏{T=t}X].\displaystyle h_{Y,t,z}(X)=\mathbb{E}\left[\mathbf{1}\{Z=z\}Y\mathbf{1}\{T=t\}\mid X\right]\text{ and }h_{t,z}(X)=\mathbb{E}\left[\mathbf{1}\{Z=z\}\mathbf{1}\{T=t\}\mid X\right].

The CEP procedure first estimates πz\pi_{z}, hY,t,zh_{Y,t,z}, and ht,zh_{t,z} by using nonparametric estimators π^z\hat{\pi}_{z}, h^Y,t,z\hat{h}_{Y,t,z}, and h^t,z\hat{h}_{t,z} respectively. These estimators can be constructed based on series or local polynomial estimation. Then Qt,zQ_{t,z} and Pt,zP_{t,z} are estimated using Q^t,z=h^Y,t,z/π^z\hat{Q}_{t,z}=\hat{h}_{Y,t,z}/\hat{\pi}_{z} and P^t,z=h^t,z/π^z\hat{P}_{t,z}=\hat{h}_{t,z}/\hat{\pi}_{z}. The vectors of estimators Q^t\hat{Q}_{t} and P^t\hat{P}_{t}, π^\hat{\pi} are stacked in an obvious way. Let π^t,k=z𝒵t,kπ^z\hat{\pi}_{{t,k}}=\sum_{z\in\mathcal{Z}_{t,k}}\hat{\pi}_{z}. The CEP estimators for the structural parameters are defined by

p^t,k\displaystyle\hat{p}_{t,k} =1ni=1nbt,kP^t(Xi),\displaystyle=\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{P}_{t}(X_{i}), q^t,k=1ni=1nbt,kP^t(Xi)π^t,k(Xi),\displaystyle\hat{q}_{t,k}=\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{P}_{t}(X_{i})\hat{\pi}_{{t,k}}(X_{i}),
β^t,k\displaystyle\hat{\beta}_{t,k} =1p^t,k1ni=1nbt,kQ^t(Xi),\displaystyle=\frac{1}{\hat{p}_{t,k}}\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{Q}_{t}(X_{i}), γ^t,k=1q^t,k1ni=1nbt,kQ^t(Xi)π^t,k(Xi).\displaystyle\hat{\gamma}_{t,k}=\frac{1}{\hat{q}_{t,k}}\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{Q}_{t}(X_{i})\hat{\pi}_{{t,k}}(X_{i}).

The next proposition shows that the CEP estimators are semiparametrically efficient. The result is similar in style to Hahn’s (1998) Proposition 4 that the low-level regularity conditions are omitted. Instead, the proposition assumes the high-level condition that the CEP estimators are asymptotically linear, which means they are asymptotically equivalent to sample averages. More formally, an estimator β^\hat{\beta} of β\beta is asymptotically linear if it admits an influence function. That is, there exists an iid sequence ψi\psi_{i} with zero mean and finite variance such that

n(β^β)=1ni=1nψi+op(1).\displaystyle\sqrt{n}(\hat{\beta}-\beta)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi_{i}+o_{p}(1).

Since each element of the conditional expectations hY,t,zh_{Y,t,z}, ht,zh_{t,z}, and πz\pi_{z} can be considered as coming from a binary LATE model, the regularity conditions in Hong and Nekipelov (2010b) should work with little modification.

Proposition 3.2.

Suppose the CEP estimators are asymptotically linear, then they achieve the semiparametric efficiency bound.

The reason that this type of estimator is efficient is well explained in Ackerberg et al. (2014). The estimation problem here falls into their general semiparametric model, where the finite-dimensional parameter of interest is defined by unconditional moment restrictions. They show that the semiparametric two-step optimally weighted GMM estimators, the CEP estimators in this case, achieve the efficiency bound since the parameters of interest are exactly identified. Discussions related to this phenomenon can also be found in Chen and Santos (2018).

We next examine the efficient estimation of other policy-relevant parameters that can be derived from the parameters (βt,k,γt,k,pt,k,qt,k)\left(\beta_{t,k},\gamma_{t,k},p_{t,k},q_{t,k}\right). As an example, consider the type set Σtk=1NZ1Σt,k\Sigma_{t}\equiv\cup_{k=1}^{N_{Z-1}}\Sigma_{t,k}, which is referred to as tt-switchers. This subpopulation contains individuals who switch between tt and other treatments when given different levels of instruments. It is a generalization of the concept of compliers in the binary LATE framework.555Recall that switchers are also illustrated in Example 2. The LASF for the subpopulation Σt\Sigma_{t} is given by

βt𝔼[YtSΣt]=k=1NZ1βt,kpt,kk=1NZ1pt,k.\displaystyle\beta_{t}\equiv\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t}\right]=\frac{\sum_{k=1}^{N_{Z}-1}\beta_{t,k}p_{t,k}}{\sum_{k=1}^{N_{Z}-1}p_{t,k}}.

Similarly, one can also define

γt=𝔼[YtT=t,SΣt]=k=1NZ1γt,kpt,kk=1NZ1pt,k,\gamma_{t}=\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t}\right]=\frac{\sum_{k=1}^{N_{Z}-1}\gamma_{t,k}p_{t,k}}{\sum_{k=1}^{N_{Z}-1}p_{t,k}}, (3)

which represents the LASF-T for the subpopulation of tt-treated tt-switchers.

For some subpopulations, a treatment effect can be identified. This point is already illustrated with Example 2 in the discussion of the identification of the usual LATE parameter. We further illustrate this point with Example 2.

Example 6 (continues = eg:3t2z).

The quantity

βt3,1βt1,1pt1,1+βt2,1pt2,1pt1,1+pt2,1\displaystyle\beta_{t_{3},1}-\frac{\beta_{t_{1},1}p_{t_{1},1}+\beta_{t_{2},1}p_{t_{2},1}}{p_{t_{1},1}+p_{t_{2},1}}

represents the local average treatment effect of t3t_{3} against other treatments within the subpopulation of t3t_{3}-switchers. Analogously, the parameter

γt3,1γt3,t1,1qt3,t1,1+γt3,t2,1qt3,t2,1qt3,t1,1+qt3,t2,1\displaystyle\gamma_{t_{3},1}-\frac{\gamma_{t_{3},t_{1},1}q_{t_{3},t_{1},1}+\gamma_{t_{3},t_{2},1}q_{t_{3},t_{2},1}}{q_{t_{3},t_{1},1}+q_{t_{3},t_{2},1}}

is the local average treatment effect of t3t_{3} against other treatments within the subpopulation of t3t_{3}-treated t3t_{3}-switchers.

To summarize the above examples using a general expression, let ϕ=ϕ(p¯,q¯,β¯,γ¯)\phi=\phi(\underline{p},\underline{q},\underline{\beta},\underline{\gamma}) be a finite-dimensional parameter, where ϕ()\phi(\cdot) is a known continuously differentiable function, and p¯\underline{p} is the vector containing all identifiable pt,kp_{t,k}’s, that is, p¯{pt,k:t𝒯,1kNZ}\underline{p}\equiv\{p_{t,k}\mathrel{\mathop{\ordinarycolon}}t\in\mathcal{T},1\leq k\leq N_{Z}\}. Let q¯,β¯\underline{q},\underline{\beta}, and γ¯\underline{\gamma} be defined analogously. A natural estimator can be defined through the CEP estimates, ϕ(p¯^,q¯^,β¯^,γ¯^)\phi(\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}}). The delta method can help calculate the efficiency bound of ϕ\phi and show the efficiency of ϕ(p¯^,q¯^,β¯^,γ¯^)\phi(\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}}). In fact, by Theorem 25.47 of van der Vaart (1998), we immediately have the following corollary, which shows that plug-in estimators are efficient.

Corollary 3.3.

The semiparametric efficiency bound of ϕ\phi is given by the variance of efficient influence function

ψϕ=pp¯ϕpψp+qq¯ϕqψq+ββ¯ϕβψβ+γγ¯ϕγψγ\psi^{\phi}=\sum_{p\in\underline{p}}\frac{\partial\phi}{\partial p}\psi^{p}+\sum_{q\in\underline{q}}\frac{\partial\phi}{\partial q}\psi^{q}+\sum_{\beta\in\underline{\beta}}\frac{\partial\phi}{\partial\beta}\psi^{\beta}+\sum_{\gamma\in\underline{\gamma}}\frac{\partial\phi}{\partial\gamma}\psi^{\gamma} (4)

where the partial derivatives are evaluated at the true parameter value. Moreover, the plug-in estimator ϕ(p¯^,q¯^,β¯^,γ¯^)\phi(\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}}), based on the CEP estimators p¯^,q¯^,β¯^,γ¯^\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}}, achieves the efficiency bound.

4 Robustness

In the previous section, the EIF is used as a tool for computing the SPEB. In this section, we directly use the EIF as the moment condition for estimation. These moment conditions are appealing because they satisfy double robustness and local robustness — the two topics of this section.

A word on notation: in the rest of the paper, we use a superscript oo to signify the true value whenever necessary. For example, when both πo\pi^{o} and π\pi appear, the former means the true probability while the latter denotes a generic function.

4.1 Double Robustness

We focus on the LASF βt,k\beta_{t,k}. The same analysis can be applied to the other parameters. To avoid notational burden in the main text, we drop the subscript (t,k)(t,k) in βt,k\beta_{t,k}, pt,kp_{t,k}, and bt,kb_{t,k}, and the subscript tt in PtP_{t} and QtQ_{t}.666The full subscripts are kept in the Appendices. It is straightforward to verify that the EIF ψβ\psi^{\beta} has zero mean. However, we do not want to use ψβ\psi^{\beta} itself as the estimating equation since it contains 1/p1/p as a factor. To deal with this problem, we simply multiply ψβ\psi^{\beta} by pp and define

ψ(Y,T,Z,X,β,Q,P,π)\displaystyle\psi(Y,T,Z,X,\beta,Q,P,\pi) =pψβ(Y,T,Z,X,β,p,Q,P,π)\displaystyle=p\psi^{\beta}(Y,T,Z,X,\beta,p,Q,P,\pi)
=b(ζ(Z,X,π)(ι(Y𝟏{T=t})Q(X))+Q(X))\displaystyle=b\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q(X)\right)+Q(X)\right)
βb(ζ(Z,X,π)(ι𝟏{T=t}P(X))+P(X)).\displaystyle\quad-\beta b\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P(X)\right)+P(X)\right).

The corresponding moment condition is

𝔼[ψ(Y,T,Z,X,βo,Qo,Po,πo)]=0.\displaystyle\mathbb{E}\left[\psi(Y,T,Z,X,\beta^{o},Q^{o},P^{o},\pi^{o})\right]=0. (5)

This moment condition is doubly robust, as demonstrated in the following proposition.

Proposition 4.1 (Double Robustness).

Let (Q,P,π)\left(Q,P,\pi\right) be an arbitrary vector of functions and (Qo,Po,πo)(Q^{o},P^{o},\pi^{o}) the true vector of conditional expectations. Then

𝔼[ψ(Y,T,Z,X,βo,Qo,Po,π)]=0\displaystyle\mathbb{E}\left[\psi(Y,T,Z,X,\beta^{o},Q^{o},P^{o},\pi)\right]=0

and

𝔼[ψ(Y,T,Z,X,βo,Q,P,πo)]=0.\displaystyle\mathbb{E}\left[\psi(Y,T,Z,X,\beta^{o},Q,P,\pi^{o})\right]=0.

The above proposition divides the nonparametric nuisance parameters into two groups, π\pi and (Q,P)(Q,P). The doubly robust moment condition is valid if either of these two groups of nuisance parameters is true. On the other hand, if the researcher uses parametric models for these nuisance parameters, then the structural parameter β\beta can be recovered provided that at least one of the working nuisance models is correctly specified. Therefore, the doubly robust moment condition is “less demanding” on the researcher’s ability to devise a correctly specified model for the nuisance parameters. The double robustness result in Proposition 4.1 can be seen as the GLATE extension of the existing results in the binary LATE literature (e.g., Tan, 2006; Okui et al., 2012).

4.2 Neyman Orthogonality

The second robustness property is Neyman orthogonality. Moment conditions with this property have reduced sensitivity with respect to the nuisance parameters. Formally, Neyman orthogonality means that the moment condition has zero Gateaux derivative with respect to the nuisance parameters. The result is presented in the following proposition.

Proposition 4.2 (Neyman Orthogonality).

Let (Q,P,π)\left(Q,P,\pi\right) be an arbitrary set of functions. For r[0,1)r\in[0,1), define Qr=Qo+r(QQo),Q^{r}=Q^{o}+r(Q-Q^{o}), Pr=Po+r(PPo),P^{r}=P^{o}+r(P-P^{o}), and πr=πo+r(ππo)\pi^{r}=\pi^{o}+r(\pi-\pi^{o}). Suppose that supr[0,1]|rψ(Y,T,Z,X,β,Qr,Pr,πr)|\sup_{r\in[0,1]}\big{|}\frac{\partial}{\partial r}\psi(Y,T,Z,X,\beta,Q^{r},P^{r},\pi^{r})\big{|} is integrable, then

r𝔼[ψ(Y,T,Z,X,β,Qr,Pr,πr)]|r=0=0,\displaystyle\frac{\partial}{\partial r}\mathbb{E}\left[\psi(Y,T,Z,X,\beta,Q^{r},P^{r},\pi^{r})\right]\Big{|}_{r=0}=0,

where β\beta does not need to be the true parameter value.

In many econometrics models, double robustness and Neyman orthogonality come in pairs. Discussions about their general relationships can be found in Chernozhukov et al. (2016). In practice, double robustness is often used for parametric estimation, as previously explained, whereas Neyman orthogonality is used in estimation with the presence of possibly high-dimensional nuisance parameters.

Next, we apply the double/debiased machine learning (DML) method developed by Chernozhukov et al. (2018) to the moment condition (5). This estimation method works even when the nuisance parameter space is complex enough that the traditional assumptions, e.g., Donsker properties, are no longer valid.777In two-step semiparametric estimations, Donsker properties are usually required so that a suitable stochastic equicontinuity condition is satisfied. See, for example, Assumption 2.5 in Chen et al. (2003). The implementation details are explained below.

The nuisance parameters QQ, PP, and π\pi are estimated using a cross-fitting method: Take an LL-fold random partition of the data such that the size of each fold is n/Ln/L. For l=1,,Ll=1,\cdots,L, let IlI_{l} denote the set of observation indices in the llth fold and Ilc=llIlI^{c}_{l}=\bigcup_{l^{\prime}\neq l}I_{l^{\prime}} the set of observation indices not in the llth fold. Define Qˇl\check{Q}^{l}, Pˇl\check{P}^{l}, and πˇl\check{\pi}^{l} to be the estimates constructed by using data from IlcI_{l}^{c}. The DML estimator of β\beta is constructed following the moment condition (5):888This is the DML2 estimator defined in Chernozhukov et al. (2018). Another estimator, the DML1 estimator, is proposed in the same paper. We do not study the DML1 estimator since it is asymptotically equivalent to DML2, and the authors generally recommend DML2.

βˇ=l=1LiIlb(ζ(Zi,Xi,πˇl)(ι(Yi𝟏{Ti=t})Qˇl(Xi))+Qˇl(Xi))l=1LiIlb(ζ(Zi,Xi,πˇl)(ι𝟏{Ti=t}Pˇl(Xi))+Pˇl(Xi)).\displaystyle\check{\beta}=\frac{\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota(Y_{i}\mathbf{1}\{T_{i}=t\})-\check{Q}^{l}(X_{i})\big{)}+\check{Q}^{l}(X_{i})\big{)}}{\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota\mathbf{1}\{T_{i}=t\}-\check{P}^{l}(X_{i})\big{)}+\check{P}^{l}(X_{i})\big{)}}. (6)

To conduct inference, we also need an estimate for the asymptotic variance of βˇ\check{\beta}, which we denote by σ2\sigma^{2}. The asymptotic variance equals to the expectation of the squared efficient influence function: σ2=𝔼[ψβ]2=𝔼[ψ2]/p2\sigma^{2}=\mathbb{E}\left[\psi^{\beta}\right]^{2}=\mathbb{E}[\psi^{2}]/p^{2}. We first estimate pp by using the cross-fitting method, which is essentially given by the denominator of (6):

pˇ=1nl=1LiIlb(ζ(Zi,Xi,πˇl)(ι𝟏{Ti=t}Pˇl(Xi))+Pˇl(Xi)).\displaystyle\check{p}=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota\mathbf{1}\{T_{i}=t\}-\check{P}^{l}(X_{i})\big{)}+\check{P}^{l}(X_{i})\big{)}. (7)

Then the asymptotic variance can be estimated by

σˇ2\displaystyle\check{\sigma}^{2} =1nl=1LiIl(ψβ(Yi,Ti,Zi,Xi,βˇ,pˇ,Qˇl,Pˇl,πˇl))2\displaystyle=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}\big{(}\psi^{\beta}\big{(}Y_{i},T_{i},Z_{i},X_{i},\check{\beta},\check{p},\check{Q}^{l},\check{P}^{l},\check{\pi}^{l}\big{)}\big{)}^{2}
=1nl=1LiIl(ψ(Yi,Ti,Zi,Xi,βˇ,Qˇl,Pˇl,πˇl)/pˇ)2.\displaystyle=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}\big{(}\psi\big{(}Y_{i},T_{i},Z_{i},X_{i},\check{\beta},\check{Q}^{l},\check{P}^{l},\check{\pi}^{l}\big{)}/\check{p}\big{)}^{2}.

We want to establish the convergence results for the DML estimator uniformly over a class of data generating processes (DGPs) defined as follows. For any two constants c1>c0>0c_{1}>c_{0}>0, let 𝒫(c1,c0)\mathcal{P}(c_{1},c_{0}) be the set of joint distributions of (Y,T,Z,X)(Y,T,Z,X) such that

  1. (i)

    p[c0,1]p\in[c_{0},1],

  2. (ii)

    𝔼[ψ2],πzo(X)c0,z𝒵\mathbb{E}[\psi^{2}],\pi_{z}^{o}(X)\geq c_{0},z\in\mathcal{Z}, and |Y𝟏{T=t}|,|Y𝟏{T=t}Qto(X)|c1|Y\mathbf{1}\{T=t\}|,|Y\mathbf{1}\{T=t\}-Q_{t}^{o}(X)|\leq c_{1}.

The first condition excludes the case where β\beta is weakly identified (when pp can be arbitrarily close to zero). Inference under weak identification is studied in the next section. The following theorem establishes the asymptotic properties of the DML estimation procedure. In particular, the estimator achieves the SPEB.

Theorem 4.3.

Let Assumptions 1 and 2 hold. Assume the following conditions on the nuisance parameter estimators (Qˇl,Pˇl,πˇl)(\check{Q}^{l},\check{P}^{l},\check{\pi}^{l}):

  1. (i)

    For z𝒵z\in\mathcal{Z}, |Qˇl||\check{Q}^{l}| is bounded, Pˇzl\check{P}^{l}_{z} and πˇzl[0,1]\check{\pi}^{l}_{z}\in[0,1], and πˇzl\check{\pi}^{l}_{z} is bounded away from zero.

  2. (ii)

    maxz𝒵(Q^Qo2P^Po2π^πo2)=op(n1/4)\max_{z\in\mathcal{Z}}\big{(}\|\hat{Q}-Q^{o}\rVert_{2}\vee\|\hat{P}-P^{o}\rVert_{2}\vee\lVert\hat{\pi}-\pi^{o}\rVert_{2}\big{)}=o_{p}\big{(}n^{-1/4}\big{)}.

Then the estimator βˇ\check{\beta} obeys that

σ1n(βˇβ)N(0,1),\sigma^{-1}\sqrt{n}\big{(}\check{\beta}-\beta\big{)}\Rightarrow N(0,1),

uniformly over the DGPs in 𝒫(c0,c1)\mathcal{P}(c_{0},c_{1}). Moreover, the above convergence result continues to hold when σ\sigma is replaced by the estimator σˇ\check{\sigma}.

The proof verifies the conditions of Theorem 3.1 in Chernozhukov et al. (2018). The essential restriction is on the uniform convergence rate for the estimators of the nuisance parameters. In low-dimensional settings, one can consider the local polynomial regression for estimation of the conditional expectations. Under suitable conditions (Hansen, 2008; Masry, 1996), the uniform convergence rate of the local polynomial estimators is (logn/n)2/(dX+4)(\log n/n)^{2/(d_{X}+4)}, which is o(n1/4)o(n^{-1/4}) if dX3d_{X}\leq 3. In high-dimensional settings, as pointed out by Chernozhukov et al. (2018), the rate o(n1/4)o(n^{-1/4}) is often available for common machine learning methods under structured assumptions on the nuisance parameters.999This includes the LASSO method under sparsity of the nuisance space. See, for example, Bühlmann and Van De Geer (2011), Belloni and Chernozhukov (2011), and Belloni and Chernozhukov (2013). However, Chernozhukov et al. (2018) also indicate that to prove that machine learning methods achieve the o(n1/4)o(n^{-1/4}) rate, one will eventually have to use related entropy conditions. This means that the asymptotic normality of the DML estimator continues to hold.

Theorem 4.3 can be directly used to conduct inference on β\beta. Confidence regions can be constructed by inverting the usual tt-tests. These confidence regions are uniformly valid since the convergence results in the above theorem hold uniformly over 𝒫\mathcal{P}. In the next section, we explain why uniform validity is crucial when dealing with weak identification issues.

5 Weak Identification

The convergence result established in Theorem 4.3 is uniform over the set of DGPs with type probability pp bounded away from zero. However, the identification of β\beta would be weak in the case where pp can be arbitrarily close to zero. This leads to distortion of the uniform size of the test and poor asymptotic approximation in finite-sample settings. This section studies this weak identification issue and proposes an inference procedure that is robust against such a problem.

We begin with a heuristic illustration of the weak identification problem. To ease notation, define υ=βp\upsilon=\beta p and

υˇ=βˇpˇ=l=1LiIlb(ζ(Zi,Xi,πˇl)(ι(Yi𝟏{Ti=t})Qˇl(Xi))+Qˇl(Xi)).\displaystyle\check{\upsilon}=\check{\beta}\check{p}=\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota(Y_{i}\mathbf{1}\{T_{i}=t\})-\check{Q}^{l}(X_{i})\big{)}+\check{Q}^{l}(X_{i})\big{)}.

After a simple calculation, we can write

βˇβ=n(υˇυ)βn(pˇp)n(pˇp)+np.\displaystyle\check{\beta}-\beta=\frac{\sqrt{n}(\check{\upsilon}-\upsilon)-\beta\sqrt{n}(\check{p}-p)}{\sqrt{n}(\check{p}-p)+\sqrt{n}p}.

In the above expression, we can interpret the estimation errors n(υˇυ)\sqrt{n}(\check{\upsilon}-\upsilon) and n(pˇp)\sqrt{n}(\check{p}-p) as the noises, while the signal is the term np\sqrt{n}p. Under the usual asymptotics where p>0p>0 is fixed, the noise terms are bounded in probability, whereas the signal term np\sqrt{n}p\rightarrow\infty. Hence, the signal dominates the noise, and the estimator βˇ\check{\beta} is consistent. However, under asymptotics with a drifting sequence p=pn0p=p_{n}\rightarrow 0 and np\sqrt{n}p converging to a finite constant, the signal and the noise are of the same magnitude, which results in the inconsistency of βˇ\check{\beta}. This problem is the weak identification issue. In the weak IV literature, a common measure of identification strength is the so-called concentration parameter. In our case, the concentration parameter is given by np\sqrt{n}p where np\sqrt{n}p\rightarrow\infty corresponds to strong identification, and identification is weak when the limit of np\sqrt{n}p is finite.

While weak identification is a finite-sample issue, it is formalized using the asymptotic framework. However, the illustration above using asymptotics under drifting sequences is not meant to model DGPs that vary with the sample size nn. Instead, it is a tool used to detect the lack of uniform convergence. In fact, controlling the uniform size of the test is the key to solving weak identification problems.101010See, for example, Imbens and Manski (2004), Mikusheva (2007), and Andrews et al. (2020). Formally, the uniform size of a test is the large sample limit of the supremum of the rejection probability under the null hypothesis, where the supremum is taken over the nuisance parameter space. When testing a null hypothesis on β\beta in the GLATE model, the supremum mentioned above is taken over all values of p>0p>0. That is, a desirable test should have rejection probability under the null converge to the nominal size uniformly over p(0,1]p\in(0,1]. From the previous discussion, we can see that the uniform size can not be controlled using the usual tt-statistic n(βˇβ)/σˇ\sqrt{n}(\check{\beta}-\beta)/\check{\sigma}. This failure of uniform convergence, however, does not conflict with Theorem 4.3, where the uniform convergence of βˇ\check{\beta} is established only after restricting pp to be bounded away from zero.

Inference procedures that are robust against weak identification can be obtained by directly imposing the null hypothesis in the construction of the test statistic. One such example is the well-known Anderson-Rubin (AR) statistic in the weak IV literature. Its idea can be generalized to the GLATE model. We first consider testing the two-sided hypothesis H0:β=β0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta=\beta_{0} versus H1:ββ0H_{1}\mathrel{\mathop{\ordinarycolon}}\beta\neq\beta_{0}. To control the uniform size of the test, we need the test statistic to converge uniformly on the parameter space where (1) β=β0\beta=\beta_{0}, and (2) pp is allowed to be arbitrarily close to zero. A null-restricted tt-statistic can be obtained as follows. Notice that when p>0p>0, β=β0\beta=\beta_{0} is equivalent to

0=υβ0p=𝔼[ψ(Y,T,Z,X,β0,Q,P,π)].\displaystyle 0=\upsilon-\beta_{0}p=\mathbb{E}\left[\psi(Y,T,Z,X,\beta_{0},Q,P,\pi)\right]. (8)

Its estimate can be written as

υˇβ0pˇ=(υˇυ)β(pˇp)+(ββ0)p.\displaystyle\check{\upsilon}-\beta_{0}\check{p}=(\check{\upsilon}-\upsilon)-\beta(\check{p}-p)+(\beta-\beta_{0})p. (9)

Under the null hypothesis β=β0\beta=\beta_{0}, the above estimate does not depend on the concentration parameter np\sqrt{n}p and consists only of the noise terms υˇυ\check{\upsilon}-\upsilon and pˇp\check{p}-p, whose uniform convergence can be established directly.

For implementation, this test statistic can be obtained as a straightforward application of the DML procedure described in the previous section to the moment condition (8). As a consequence of Proposition 4.2, the above moment condition satisfies the Neyman orthogonality condition regardless of the true value of β\beta. More specifically, the null-restricted tt-statistic is defined to be

ρˇ=n(υˇβ0pˇ)/σˇψ,\displaystyle\check{\rho}=\sqrt{n}(\check{\upsilon}-\beta_{0}\check{p})/\check{\sigma}_{\psi},

where

σˇψ2=1nl=1LiIlψ(Yi,Ti,Zi,Xi,β0,Qˇl,Pˇl,πˇl)2.\displaystyle\check{\sigma}_{\psi}^{2}=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}\psi(Y_{i},T_{i},Z_{i},X_{i},\beta_{0},\check{Q}^{l},\check{P}^{l},\check{\pi}^{l})^{2}.

The corresponding test of H0:β=β0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta=\beta_{0} against H1:ββ0H_{1}\mathrel{\mathop{\ordinarycolon}}\beta\neq\beta_{0} rejects for large values of |ρˇ||\check{\rho}|.

The same methodology can be applied to testing one-sided hypothesis H0:ββ0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta\leq\beta_{0} versus H1:β>β0H_{1}\mathrel{\mathop{\ordinarycolon}}\beta>\beta_{0}. Under the null hypothesis, (ββ0)p(\beta-\beta_{0})p is non-positive, suggesting that the test should reject for large values of ρˇ\check{\rho}. Notice that this relies on knowing the sign of pp due to the GLATE model structure. This restriction on the sign of pp is similar to knowing the first-stage sign in the linear IV model, which is studied by Andrews and Armstrong (2017) in the context of unbiased estimation.

We now define the set of DGPs that allows pp to be arbitrarily close to zero. For any two constants c1>c0>0c_{1}>c_{0}>0, let 𝒫WI(c0,c1)\mathcal{P}^{\text{WI}}(c_{0},c_{1}) be the set of joint distributions of (Y,T,Z,X)(Y,T,Z,X) such that

  1. (i)

    p(0,1]p\in(0,1],

  2. (ii)

    𝔼[ψ2],πzo(X)c0,z𝒵\mathbb{E}[\psi^{2}],\pi_{z}^{o}(X)\geq c_{0},z\in\mathcal{Z}, and |Y𝟏{T=t}|,|Y𝟏{T=t}Qto(X)|c1\mathinner{\!\left\lvert Y\mathbf{1}\{T=t\}\right\rvert},|Y\mathbf{1}\{T=t\}-Q_{t}^{o}(X)|\leq c_{1}.

For any β\beta^{\prime}\in\mathbb{R}, let 𝒫βWI(c0,c1)\mathcal{P}^{\text{WI}}_{\beta^{\prime}}(c_{0},c_{1}) be the subset of 𝒫WI(c0,c1)\mathcal{P}^{\text{WI}}(c_{0},c_{1}) in which the true value of the parameter β\beta is β\beta^{\prime}. In particular, 𝒫β0WI(c0,c1)\mathcal{P}^{\text{WI}}_{\beta_{0}}(c_{0},c_{1}) denotes the subset where the null hypothesis is true. The superscript “WI” denotes weak identification. The difference between 𝒫(c0,c1)\mathcal{P}(c_{0},c_{1}) and 𝒫WI(c0,c1)\mathcal{P}^{\text{WI}}(c_{0},c_{1}) is that 𝒫WI(c0,c1)\mathcal{P}^{\text{WI}}(c_{0},c_{1}) allows the type probability pp to be arbitrarily small, whereas the type probabilities in 𝒫(c0,c1)\mathcal{P}(c_{0},c_{1}) are uniformly bounded away from zero. Denote 𝒩ν\mathcal{N}_{\nu} as the ν\nuth quantile of the standard normal distribution. The following theorem establishes that the above testing procedures have uniformly correct sizes and are consistent.

Theorem 5.1.

Suppose the conditions on the nuisance parameter estimates in Theorem 4.3 hold. Let α(0,1)\alpha\in(0,1) be the nominal size of the tests.

  1. (i)

    The test that rejects H0:β=β0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta=\beta_{0} in favor of H0:ββ0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta\neq\beta_{0} when |ρˇ|>𝒩1α2|\check{\rho}|>\mathcal{N}_{1-\frac{\alpha}{2}} has (asymptotically) uniformly correct size and is consistent. That is,

    sup{P(|ρˇ|>𝒩1α2):P𝒫β0WI(c0,c1)}α\displaystyle\sup\big{\{}\mathbb{P}_{P}\big{(}|\check{\rho}|>\mathcal{N}_{1-\frac{\alpha}{2}}\big{)}\mathrel{\mathop{\ordinarycolon}}P\in\mathcal{P}^{\text{WI}}_{\beta_{0}}(c_{0},c_{1})\big{\}}\rightarrow\alpha

    and

    P(|ρˇ|>𝒩1α2)1,P𝒫βWI(c0,c1),ββ0.\displaystyle\mathbb{P}_{P}\big{(}|\check{\rho}|>\mathcal{N}_{1-\frac{\alpha}{2}}\big{)}\rightarrow 1,P\in\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}),\beta\neq\beta_{0}.
  2. (ii)

    The test that rejects H0:ββ0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta\leq\beta_{0} in favor of H0:β>β0H_{0}\mathrel{\mathop{\ordinarycolon}}\beta>\beta_{0} when ρˇ>𝒩1α\check{\rho}>\mathcal{N}_{1-\alpha} has (asymptotically) uniformly correct size and is consistent. That is,

    sup{P(ρˇ>𝒩1α):P𝒫βWI(c0,c1),ββ0}α\displaystyle\sup\big{\{}\mathbb{P}_{P}\big{(}\check{\rho}>\mathcal{N}_{1-\alpha}\big{)}\mathrel{\mathop{\ordinarycolon}}P\in\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}),\beta\leq\beta_{0}\big{\}}\rightarrow\alpha

    and

    P(ρˇ>𝒩1α)1,P𝒫βWI(c0,c1),β>β0.\displaystyle\mathbb{P}_{P}\big{(}\check{\rho}>\mathcal{N}_{1-\alpha}\big{)}\rightarrow 1,P\in\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}),\beta>\beta_{0}.

6 Empirical Application

In this section, we apply the theoretical results to data from the Oregon Health Insurance Experiment Finkelstein et al. (2012) and examine the effects on the health of different sources of health insurance. The experiment is conducted by the state of Oregon between March and September 2008. A series of lottery draws were administered to award the participants the option of enrolling in the Oregon Health Plan Standard, which is a Medicaid expansion program available for Oregon adult residents that have limited income. Follow-up surveys were sent out in several waves to record, among many variables, the participants’ insurance plan and health status. Finkelstein et al. (2012) obtain the effects of insurance coverage by using a LATE model. We apply the GLATE model can study the effect heterogeneity across different sources of insurance.

According to the data, many lottery winners did not choose to participate in the Medicaid program. Instead, they went with other insurance plans or chose not to have any health insurance. Based on this observation, we can set up the GLATE model. The instrument ZZ is the binary lottery that determines whether an individual is selected. The covariates XX include the number of household members and survey waves. Given XX, ZZ is randomly assigned (Finkelstein et al., 2012, p1071).111111Though the covariates are discrete, the methods developed in this paper are still different from linear regressions in Finkelstein et al. (2012). The treatment TT is the insurance plan, which contains three categories: Medicaid (mm), non-Medicaid insurance plans (nmnm), and no health insurance (nono). The second category includes Medicare, private plans, employer plans, and other plans. The counterfactual health plan choices under different lottery results are the variables T0T_{0} and T1T_{1}. The unordered monotonicity condition requires that any participant who changes insurance plan due to winning the lottery does so to enroll in the Medicaid program.

The above setup is the same as Example 2, with types. We follow the terminologies in Kline and Walters (2016) and define the following six type sets by their counterfactual insurance plan choices:

  1. 1.

    nono-never takers: SΣno,2={s1}S\in\Sigma_{no,2}=\{s_{1}\}, T0=T1=noT_{0}=T_{1}=no;

  2. 2.

    nmnm-never takers: SΣnm,2={s2}S\in\Sigma_{nm,2}=\{s_{2}\}, T0=T1=nmT_{0}=T_{1}=nm;

  3. 3.

    always takers: SΣm,2={s3}S\in\Sigma_{m,2}=\{s_{3}\}, T0=T1=mT_{0}=T_{1}=m;

  4. 4.

    nono-compliers: SΣno,1={s4}S\in\Sigma_{no,1}=\{s_{4}\}, T0=noT_{0}=no, T1=mT_{1}=m;

  5. 5.

    nmnm-compliers: SΣnm,1={s5}S\in\Sigma_{nm,1}=\{s_{5}\}, T0=nmT_{0}=nm, T1=mT_{1}=m;

  6. 6.

    compliers: SΣm,1={s4,s5}S\in\Sigma_{m,1}=\{s_{4},s_{5}\}, T0mT_{0}\neq m, T1=mT_{1}=m.

The two groups of never takers choose not to join Medicaid regardless of the offer. Always takers manage to enroll in Medicaid even without an offer. The nono- and nmnm- compliers switch to Medicaid from no insurance plan and other plans, respectively, upon winning the lottery. Combining these two groups gives the larger set of compliers.

Table 1 shows the estimated probabilities of the six types.121212We use the data from the 12-month survey. After taking care of the missing values, we are left with 2329023290 observations. For cross-fitting, we choose L=10L=10. We can see that half of the population are nono-never takers, who are never covered by any insurance plan. The compliers make up around one-fifth of the population. There are effectively no nmnm-compliers, meaning that the experiment does not crowd out other insurance plan choices. These findings are consistent with Finkelstein et al. (2012).

Type Probability Estimate (se)
nono-never takers pno,2p_{no,2} .492 (.046)
nmnm-never takers pnm,2p_{nm,2} .208 (.018)
always takers pm,2p_{m,2} .116 (.018)
nono-compliers pno,1p_{no,1} .197 (.059)
nmnm-compliers pnm,1p_{nm,1} .010 (.024)
compliers pm,1p_{m,1} .208 (.060)
Table 1: Estimated probability of different types.

The outcome of interest YY is health status, which is (inversely) measured by the number of days (out of past 30) when poor health impaired regular activities.131313Other types of outcomes are also studied by Finkelstein et al. (2012), including health care utilization and financial strain. Here we only focus on health status for simplicity. The potential outcomes are denoted by YnoY_{no}, YnmY_{nm}, and YmY_{m}. By Theorem 2.1, we can identify the distribution of YnoY_{no} for nono-never takers and nono-compliers, the distribution of YnmY_{nm} for nmnm-never takers and nmnm-compliers, and the distribution of YnmY_{nm} for always takers and compliers. Table 2 reports the estimated LASFs.141414The LASF βnm,1\beta_{nm,1} is excluded because there are few nmnm-compliers as reported in Table 1. We can clearly see a pattern of self-selection into the treatment. For example, when there is no insurance coverage, the potential health status of nono-compliers is worse than nono-never takers and therefore choose to enroll in Medicaid.

Type Treatment LASF Estimate (se)
nono-never takers nono βno,2\beta_{no,2} 6.78 (1.19)
nmnm-never takers nmnm βnm,2\beta_{nm,2} 7.74 (1.05)
always takers mm βm,2\beta_{m,2} 9.96 (1.75)
nono-compliers nono βno,1\beta_{no,1} 11.50 (2.92)
compliers mm βm,1\beta_{m,1} 0.48 (3.42)
Table 2: Estimated LASFs.

7 Concluding Remarks

In this paper, we considered the estimation of the causal parameters, LASF and LASF-T, in the GLATE model by using the EIF. The proposed DML estimator satisfies the SPEB and can be applied in situations, such as high-dimensional settings, where Donsker properties fail. For inference, we proposed generalized AR tests robust against weak identification issues. Currently, empirical researchers use the TSLS and control the covariates linearly in models with multi-valued treatments and instruments. This linear specification does not have LATE interpretation, as pointed out by Blandhol et al. (2022). Therefore, we advocate using the semiparametric methods studied by this paper in those cases.


SUPPLEMENTARY MATERIAL

Appendix A Technical Proofs

In this section, we prove the theorems and propositions stated in the main text. We assume that Assumptions 1 and 2 hold throughout this section.

A.1 Proof of the Identification Results

Lemma A.1.

SZXS\perp Z\mid X and t𝒯t\in\mathcal{T}, YtTS,XY_{t}\perp T\mid S,X.

Proof of Lemma A.1.

The first statement follows from the definition of SS and the fact that ZZ is independent of the vector (Tz1,,TzNZ)(T_{z_{1}},\cdots,T_{z_{N_{Z}}}) conditioning on XX. For the second statement, TT is entirely determined by (S,Z,X)\left(S,Z,X\right). Hence, given SS and XX, TT is independent of YtY_{t} since ZZ is independent of (Yt1,,YtNT)(Y_{t_{1}},\cdots,Y_{t_{N_{T}}}) conditional on XX. ∎

Lemma A.2.

For each t𝒯t\in\mathcal{T} and k=1,,NZk=1,\cdots,N_{Z}, the following identification results hold.

  1. (i)

    (SΣt,kX)=bt,kPt(X)\mathbb{P}(S\in\Sigma_{t,k}\mid X)=b_{t,k}P_{t}(X) a.s.

  2. (ii)

    𝔼[YtSΣt,k,X]=(bt,kQt(X))/(bt,kPt(X))\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X\right]=(b_{t,k}Q_{t}(X))/(b_{t,k}P_{t}(X)) a.s.

Proof of Lemma A.2.

This is Theorem T-6 in Heckman and Pinto (2018a). The conditioning is explicitly presented. ∎

Proof of Theorem 2.1.

The first statement follows from applying the law of iterated expectation to Lemma A.2(i). For the second statement, we can apply Bayes rule to Lemma A.2 and obtain that

𝔼[YtSΣt,k]\displaystyle\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k}\right] =𝔼[YtSΣt,k,X=x]fXSΣt,k(x)𝑑x\displaystyle=\int\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right]f_{X\mid S\in\Sigma_{t,k}}(x)dx
=𝔼[YtSΣt,k,X=x](SΣt,kX=x)(SΣt,k)fX(x)𝑑x\displaystyle=\int\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right]\frac{\mathbb{P}(S\in\Sigma_{t,k}\mid X=x)}{\mathbb{P}(S\in\Sigma_{t,k})}f_{X}(x)dx
=𝔼[bt,kQt(X)]/pt,k,\displaystyle=\mathbb{E}\left[b_{t,k}Q_{t}(X)\right]/p_{t,k},

where fXSΣt,kf_{X\mid S\in\Sigma_{t,k}} denotes the conditional density function of XX given type SΣt,kS\in\Sigma_{t,k}. ∎

Proof of Theorem 2.2.

By Lemma L-16 of Heckman and Pinto (2018b), we know that under the unordered monotonicity assumption, Bt[,i]=Bt[,i]B_{t}[\cdot,i]=B_{t}[\cdot,i^{\prime}] for all si,siΣt,ks_{i},s_{i^{\prime}}\in\Sigma_{t,k}. Thus, the set 𝒵t,k\mathcal{Z}_{t,k} always exists. For the first statement, we have

(T=t,SΣt,k)\displaystyle\mathbb{P}\left(T=t,S\in\Sigma_{t,k}\right) =(Z𝒵t,k,SΣt,k)\displaystyle=\mathbb{P}\left(Z\in\mathcal{Z}_{t,k},S\in\Sigma_{t,k}\right)
=𝔼[(Z𝒵t,k,SΣt,kX)]\displaystyle=\mathbb{E}\left[\mathbb{P}\left(Z\in\mathcal{Z}_{t,k},S\in\Sigma_{t,k}\mid X\right)\right]
=𝔼[(Z𝒵t,kX)(SΣt,kX)]\displaystyle=\mathbb{E}\left[\mathbb{P}\left(Z\in\mathcal{Z}_{t,k}\mid X\right)\mathbb{P}\left(S\in\Sigma_{t,k}\mid X\right)\right]
=𝔼[bt,kPt(X)πt,k(X)],\displaystyle=\mathbb{E}\left[b_{t,k}P_{t}(X)\pi_{{t,k}}(X)\right],

where the second equality follows from the law of iterated expectations and the third equality follows from the fact that ZSXZ\perp S\mid X (Lemma A.1). For the second statement, notice that

(T=t,SΣt,kX=x)\displaystyle\mathbb{P}(T=t,S\in\Sigma_{t,k}\mid X=x) =(T=tSΣt,k,X=x)(SΣt,kX=x)\displaystyle=\mathbb{P}(T=t\mid S\in\Sigma_{t,k},X=x)\mathbb{P}(S\in\Sigma_{t,k}\mid X=x)
=(Z𝒵t,kX)(SΣt,kX=x)\displaystyle=\mathbb{P}(Z\in\mathcal{Z}_{t,k}\mid X)\mathbb{P}(S\in\Sigma_{t,k}\mid X=x)
=πt,k(X)bt,kPt(X).\displaystyle=\pi_{t,k}(X)b_{t,k}P_{t}(X).

By Lemma A.1, we know that

𝔼[YtT=t,SΣt,k,X=x]=𝔼[YtSΣt,k,X=x].\displaystyle\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t,k},X=x\right]=\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right].

Therefore, we can apply Bayes rule and obtain that

𝔼[YtT=t,SΣt,k]\displaystyle\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t,k}\right]
=\displaystyle= 𝔼[YtT=t,SΣt,k,X=x]fXT=t,SΣt,k(x)𝑑x\displaystyle\int\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t,k},X=x\right]f_{X\mid T=t,S\in\Sigma_{t,k}}(x)dx
=\displaystyle= 𝔼[YtSΣt,k,X=x]P(T=t,SΣt,kX=x)P(T=t,SΣt,k)fX(x)𝑑x\displaystyle\int\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right]\frac{P(T=t,S\in\Sigma_{t,k}\mid X=x)}{P(T=t,S\in\Sigma_{t,k})}f_{X}(x)dx
=\displaystyle= bt,kQt(X)bt,kPt(X)×πt,k(X)bt,kPt(X)qt,kfX(x)𝑑x\displaystyle\int\frac{b_{t,k}Q_{t}(X)}{b_{t,k}P_{t}(X)}\times\frac{\pi_{t,k}(X)b_{t,k}P_{t}(X)}{q_{t,k}}f_{X}(x)dx
=\displaystyle= 𝔼[bt,kQt(X)πt,k(X)]/qt,k.\displaystyle\mathbb{E}\left[b_{t,k}Q_{t}(X)\pi_{t,k}(X)\right]/q_{t,k}.

A.2 Semiparametric Efficiency Calculations

We follow the method developed by Newey (1990). The likelihood of the GLATE model can be specified as

(Y,T,Z,X)=fX(X)z𝒵(fz(Y,TX)πz(X))𝟏{Z=z},\mathcal{L}\left(Y,T,Z,X\right)=f_{X}(X)\prod_{z\in\mathcal{Z}}\Big{(}f_{z}(Y,T\mid X)\pi_{z}(X)\Big{)}^{\mathbf{1}\{Z=z\}},

where fz(,X)f_{z}(\cdot,\cdot\mid X) denotes the conditional density of Y,TY,T given Z=zZ=z and XX. In a regular parametric submodel, where the true underlying probability measure PP is indexed by θo\theta^{o}, we use the following notations to represent the score functions:

sz(Y,ZX;θ)=θlog(fz(Y,TX;θ)),\displaystyle s_{z}(Y,Z\mid X;\theta)=\frac{\partial}{\partial\theta}\log\left(f_{z}(Y,T\mid X;\theta)\right),
sπ(ZX;θ)=z𝒵𝟏{Z=z}θlog(πz(X;θ)),\displaystyle s_{\pi}(Z\mid X;\theta)=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}\frac{\partial}{\partial\theta}\log\left(\pi_{z}(X;\theta)\right),
sX(X;θ)=θlog(fX(X;θ)).\displaystyle s_{X}(X;\theta)=\frac{\partial}{\partial\theta}\log\left(f_{X}(X;\theta)\right).

The score in a regular parametric submodel is

sθo(Y,T,Z,X)=z𝒵𝟏{Z=z}sz(Y,TX;θo)+sπ(ZX;θo)+sX(X;θo).\displaystyle s_{\theta^{o}}(Y,T,Z,X)=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}s_{z}\left(Y,T\mid X;\theta^{o}\right)+s_{\pi}(Z\mid X;\theta^{o})+s_{X}(X;\theta^{o}).

Hence, the tangent space of the model is

𝒮\displaystyle\mathscr{S} ={sL02:s(Y,T,Z,X)=z𝒵𝟏{Z=z}sz(Y,TX)+sπ(ZX)+sX(X)\displaystyle=\big{\{}s\in L^{2}_{0}\mathrel{\mathop{\ordinarycolon}}s(Y,T,Z,X)=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}s_{z}\left(Y,T\mid X\right)+s_{\pi}(Z\mid X)+s_{X}(X)
 for some sz,sπ,sX such that sz(y,tX)fz(y,tX)𝑑y𝑑t0,z;\displaystyle\quad\quad\text{ for some }s_{z},s_{\pi},s_{X}\text{ such that }\int s_{z}(y,t\mid X)f_{z}(y,t\mid X)dydt\equiv 0,\forall z;
z𝒵sπ(zX)πz(X)0, and sX(x)fX(x)dx=0},\displaystyle\quad\quad\sum_{z\in\mathcal{Z}}s_{\pi}(z\mid X)\pi_{z}(X)\equiv 0\text{, and }\int s_{X}(x)f_{X}(x)dx=0\big{\}},

where L02L^{2}_{0} is a subspace of L2L^{2} that contains the mean zero functions.

Proof of Theorem 3.1.

We only prove statements (i) and (ii) since (iii) and (iv) are easier cases that can be proved along the way. We start with the first statement. The path-wise differentiability of the parameter βt,k\beta_{t,k} can be verified in the following way: in any parametric submodel, we have

θβt,k(θ)|θ=θo\displaystyle\frac{\partial}{\partial\theta}\beta_{t,k}(\theta)\Big{|}_{\theta=\theta^{o}} =θ(bt,k𝔼θ[Qt(X)]/pt,k)|θ=θo\displaystyle=\frac{\partial}{\partial\theta}(b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/p_{t,k})\big{|}_{\theta=\theta^{o}}
=1pt,k((bt,k𝔼θ[Qt(X)]/θ)|θ=θo(bt,k𝔼θ[Qt(X)]/pt,k)(pt,k/θ)|θ=θo)\displaystyle=\frac{1}{p_{t,k}}\left((\partial b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/\partial\theta)|_{\theta=\theta^{o}}-(b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/p_{t,k})(\partial p_{t,k}/\partial\theta)|_{\theta=\theta^{o}}\right)
=1pt,kbt,k(θ𝔼θ[Qt(X)]|θ=θoθ𝔼θ[Pt(X)]|θ=θoβt,k),\displaystyle=\frac{1}{p_{t,k}}b_{t,k}\left(\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]\big{|}_{\theta=\theta^{o}}-\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\right]\big{|}_{\theta=\theta^{o}}\beta_{t,k}\right),

where θ𝔼θ[Qt(X)]|θ=θo\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]|_{\theta=\theta^{o}} and θ𝔼θ[Pt(X)]|θ=θo\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\right]|_{\theta=\theta^{o}} are NZ×1N_{Z}\times 1 random vectors whose typical element can be represented respectively by

y𝟏{τ=t}sz(y,τx;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int y\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ y𝟏{τ=t}sX(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int y\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx

and

𝟏{τ=t}sz(y,τx;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ 𝟏{τ=t}sX(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x,\displaystyle\int\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,

respectively, for z𝒵z\in\mathcal{Z}. The EIF is characterized by the condition that

θβt,k(θ)|θ=θo=𝔼[ψβt,ksθo], and ψβt,k𝒮.\displaystyle\frac{\partial}{\partial\theta}\beta_{t,k}(\theta)\Big{|}_{\theta=\theta^{o}}=\mathbb{E}\left[\psi_{\beta_{t,k}}s_{\theta^{o}}\right]\text{, and }\psi_{\beta_{t,k}}\in\mathscr{S}.

The expression of ψβt,k\psi_{\beta_{t,k}} given in Equation (2) meets the above requirements. In particular, the correspondence between terms in the EIF and path-wise derivative appears exactly as in Lemma 1 of Hong and Nekipelov (2010b).

For the second statement, the path-wise derivative of γt,k\gamma_{t,k} can be computed similarly.

θγt,k(θ)|θ=θo\displaystyle\frac{\partial}{\partial\theta}\gamma_{t,k}(\theta)\Big{|}_{\theta=\theta^{o}} =1qt,kbt,kθ𝔼θ[Qt(X)πt,k(X)]|θ=θo\displaystyle=\frac{1}{q_{t,k}}b_{t,k}\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\pi_{{t,k}}(X)\right]\Big{|}_{\theta=\theta^{o}}
γt,kqt,kbt,kθ𝔼θ[Pt(X)πt,k(X)]|θ=θo,\displaystyle-\frac{\gamma_{t,k}}{q_{t,k}}b_{t,k}\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\pi_{{t,k}}(X)\right]\Big{|}_{\theta=\theta^{o}},

where θ𝔼θ[Qt(X)πWt,k(X)]|θ=θo\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}[Q_{t}(X)\pi_{W_{t,k}}(X)]|_{\theta=\theta^{o}} and θ𝔼θ[Pt(X)πWt,k(X)]|θ=θo\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}[P_{t}(X)\pi_{W_{t,k}}(X)]|_{\theta=\theta^{o}} are NZ×1N_{Z}\times 1 random vectors whose typical element can be represented by

y𝟏{τ=t}sz(y,τx;θo)πWt,k(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int y\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ y𝟏{τ=t}sX(x;θo)πWt,k(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int y\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ y𝟏{τ=t}(θπt,k(X;θ)|θ=θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x,\displaystyle\int y\mathbf{1}\{\tau=t\}\left(\frac{\partial}{\partial\theta}\pi_{{t,k}}(X;\theta)\big{|}_{\theta=\theta^{o}}\right)f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,

and

𝟏{τ=t}sz(y,τx;θo)πWt,k(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ 𝟏{τ=t}sX(x;θo)πWt,k(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ 𝟏{τ=t}(θπt,k(X;θ)|θ=θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x,\displaystyle\int\mathbf{1}\{\tau=t\}\left(\frac{\partial}{\partial\theta}\pi_{{t,k}}(X;\theta)\big{|}_{\theta=\theta^{o}}\right)f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,

respectively, for z𝒵z\in\mathcal{Z}. The main difference appears when dealing with the last terms in the above two expressions, which can be matched with terms in the efficient influence function of the following two forms

𝔼[Y𝟏{T=t}Z=z,X](𝟏{Z𝒵t,k}πt,k(X)), and\displaystyle\mathbb{E}\left[Y\mathbf{1}\{T=t\}\mid Z=z,X\right]\left(\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}-\pi_{{t,k}}(X)\right),\text{ and }
𝔼[𝟏{T=t}Z=z,X](𝟏{Z𝒵t,k}πt,k(X)).\displaystyle\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\left(\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}-\pi_{{t,k}}(X)\right).

Take the latter one as an example. Notice that

𝟏{Z𝒵t,k}πt,k(X)=z𝒵t,k(𝟏{Z=z}πz(X)),\displaystyle\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}-\pi_{{t,k}}(X)=\sum_{z\in\mathcal{Z}_{t,k}}\left(\mathbf{1}\{Z=z\}-\pi_{z}(X)\right),

and

(𝟏{Z=z}πz(X))sπ(ZX;θo)=𝟏{Z=z}πz(X)θπz(X;θ)|θ=θoπz(X)sπ(ZX;θo).\displaystyle\left(\mathbf{1}\{Z=z\}-\pi_{z}(X)\right)s_{\pi}(Z\mid X;\theta^{o})=\frac{\mathbf{1}\{Z=z\}}{\pi_{z}(X)}\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{|}_{\theta=\theta^{o}}-\pi_{z}(X)s_{\pi}(Z\mid X;\theta^{o}).

By the law of iterated expectation, we have

𝔼[𝔼[𝟏{T=t}Z=z,X](𝟏{Z=z}πz(X))sπ(ZX;θo)]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\left(\mathbf{1}\{Z=z\}-\pi_{z}(X)\right)s_{\pi}(Z\mid X;\theta^{o})\right]
=\displaystyle= 𝔼[𝔼[𝟏{T=t}Z=z,X]𝔼[𝟏{Z=z}/πz(X)X]θπz(X;θ)|θ=θo]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\mathbb{E}\left[\mathbf{1}\{Z=z\}/\pi_{z}(X)\mid X\right]\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{|}_{\theta=\theta^{o}}\right]
\displaystyle- 𝔼[𝔼[𝟏{T=t}Z=z,X]πz(X)𝔼[sπ(ZX;θo)X]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\pi_{z}(X)\mathbb{E}\left[s_{\pi}(Z\mid X;\theta^{o})\mid X\right]\right]
=\displaystyle= 𝔼[𝔼[𝟏{T=t}Z=z,X]θπz(X;θ)|θ=θo]\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{|}_{\theta=\theta^{o}}\right]
=\displaystyle= 𝟏{τ=t}(θπz(X;θ)|θ=θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x.\displaystyle\int\mathbf{1}\{\tau=t\}\left(\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{|}_{\theta=\theta^{o}}\right)f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx.

Proof of Proposition 3.2.

This proof is based on Section 4 in Newey (1994). We focus on the case of βt,k\beta_{t,k}. The other cases are similar. To ease notation, let ht=(hY,t,Z,ht,Z,π)h_{t}=\left(h_{Y,t,Z},h_{t,Z},\pi\right)^{\prime}. The estimator β^t,k\hat{\beta}_{t,k} is defined by the moment condition

𝔼[M(X,βt,k,ht)]=0,\displaystyle\mathbb{E}[M\left(X,\beta_{t,k},h_{t}\right)]=0,

where

M(X,βt,k,ht)bt,k(hY,t,z1(X)πz1(X),,hY,t,zNZ(X)πzNZ(X))βt,kbt,k(ht,z1(X)πz1(X),,ht,zNZ(X)πzNZ(X)).\displaystyle M\left(X,\beta_{t,k},h_{t}\right)\equiv b_{t,k}\left(\frac{h_{Y,t,z_{1}}(X)}{\pi_{z_{1}}(X)},\cdots,\frac{h_{Y,t,z_{N_{Z}}}(X)}{\pi_{z_{N_{Z}}}(X)}\right)^{\prime}-\beta_{t,k}b_{t,k}\left(\frac{h_{t,z_{1}}(X)}{\pi_{z_{1}}(X)},\cdots,\frac{h_{t,z_{N_{Z}}}(X)}{\pi_{z_{N_{Z}}}(X)}\right)^{\prime}.

We then compute the derivatives of MM with respect to the parameters:

𝔼[M/βt,k]\displaystyle\mathbb{E}\left[\partial M/\partial\beta_{t,k}\right] =bt,k𝔼[Pt(X)]=pt,ko\displaystyle=-b_{t,k}\mathbb{E}\left[P_{t}(X)\right]=-p^{o}_{t,k}
M/hY,t,zi|ht=hto\displaystyle\partial M/\partial h_{Y,t,z_{i}}|_{h_{t}=h_{t}^{o}} =bt,k[i]/πzio(X)δY,t,zi(X)\displaystyle=b_{t,k}[i]/\pi^{o}_{z_{i}}(X)\equiv\delta_{Y,t,z_{i}}(X)
/ht,ziM|ht=hto\displaystyle\partial/\partial h_{t,z_{i}}M|_{h_{t}=h_{t}^{o}} =(βt,kbt,k[i])/πzio(X)δt,zi(X)\displaystyle=-(\beta_{t,k}b_{t,k}[i])/\pi^{o}_{z_{i}}(X)\equiv\delta_{t,z_{i}}(X)
M/πzi|ht=hto\displaystyle\partial M/\partial\pi_{z_{i}}|_{h_{t}=h_{t}^{o}} =(bt,k[i]Qt,zio(X))/πzio(X)+(βt,kbt,k[i]Pt,zio(X))/πzio(X)δπ,zi(X),\displaystyle=-(b_{t,k}[i]Q^{o}_{t,z_{i}}(X))/\pi^{o}_{z_{i}}(X)+(\beta_{t,k}b_{t,k}[i]P^{o}_{t,z_{i}}(X))/\pi^{o}_{z_{i}}(X)\equiv\delta_{\pi,z_{i}}(X),

where bt,k[i]b_{t,k}[i] denotes the iith element of the vector bt,kb_{t,k}. Define

α(Y,T,Z,X)\displaystyle\alpha\left(Y,T,Z,X\right) z𝒵δY,t,z(X)(𝟏{Z=z}Y𝟏{T=t}hY,t,zo(X))\displaystyle\equiv\sum_{z\in\mathcal{Z}}\delta_{Y,t,z}(X)\left(\mathbf{1}\{Z=z\}Y\mathbf{1}\{T=t\}-h^{o}_{Y,t,z}(X)\right)
+z𝒵δt,z(X)(𝟏{Z=z}𝟏{T=t}ht,zo(X))\displaystyle\quad+\sum_{z\in\mathcal{Z}}\delta_{t,z}(X)\left(\mathbf{1}\{Z=z\}\mathbf{1}\{T=t\}-h^{o}_{t,z}(X)\right)
+z𝒵δπ,z(X)(𝟏{Z=z}πzo(X)).\displaystyle\quad+\sum_{z\in\mathcal{Z}}\delta_{\pi,z}(X)\left(\mathbf{1}\{Z=z\}-\pi^{o}_{z}(X)\right).

We have

α(Y,T,Z,X)\displaystyle\alpha\left(Y,T,Z,X\right) =bt,kζ(Z,X,πo)(ι(Y𝟏{T=t})Qto(X))\displaystyle=b_{t,k}\zeta(Z,X,\pi^{o})\left(\iota(Y\mathbf{1}\{T=t\})-Q^{o}_{t}(X)\right)
βt,kobt,kζ(Z,X,πo)(ι𝟏{T=t}Pto(X)).\displaystyle\quad-\beta_{t,k}^{o}b_{t,k}\zeta(Z,X,\pi^{o})\left(\iota\mathbf{1}\{T=t\}-P^{o}_{t}(X)\right).

Then Newey’s (1994) Proposition 4 suggests that the influence function of the estimator β^t,k\hat{\beta}_{t,k} is (M+α)/pt,k(M+\alpha)/p_{t,k} which is equal to the EIF ψβt,k\psi^{\beta_{t,k}}.

A.3 Proof of Robustness Results

Proof of Proposition 4.1.

We prove the case for ψpt,k\psi^{p_{t,k}}, the other cases can be dealt with analogously. First assume π=πo\pi=\pi^{o}, then

𝔼[𝟏{Z=z}/πzo(X)X]=1,\displaystyle\mathbb{E}\left[\mathbf{1}\{Z=z\}/\pi_{z}^{o}(X)\mid X\right]=1,

which implies that 𝔼[ζ(Z,X,πo)X]\mathbb{E}\left[\zeta(Z,X,\pi^{o})\mid X\right] is almost surly equal to the identity matrix 𝐈\mathbf{I}. By the law of total expectations, we have

𝔼[𝟏{T=t}𝟏{Z=z}/πzo(X)X]=𝔼[𝟏{T=t}Z=z,X]=Pt,zo(X),\displaystyle\mathbb{E}\left[\mathbf{1}\{T=t\}\mathbf{1}\{Z=z\}/\pi_{z}^{o}(X)\mid X\right]=\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]=P_{t,z}^{o}(X),

which implies that 𝔼[ζ(Z,X,πo)ι𝟏{T=t}]=𝔼[Pto(X)]\mathbb{E}\left[\zeta(Z,X,\pi^{o})\iota\mathbf{1}\{T=t\}\right]=\mathbb{E}\left[P_{t}^{o}(X)\right]. Therefore,

bt,k𝔼[ζ(Z,X,πo)(ι(𝟏{T=t})Pt(X))+Pt(X)]\displaystyle b_{t,k}\mathbb{E}[\zeta(Z,X,\pi^{o})\left(\iota(\mathbf{1}\{T=t\})-P_{t}(X)\right)+P_{t}(X)]
=\displaystyle= bt,k𝔼[ζ(Z,X,πo)ι𝟏{T=t}]+bt,k𝔼[(𝐈ζ(Z,X,πo))Pt(X)]=bt,k𝔼[Pto(X)]=pt,ko.\displaystyle b_{t,k}\mathbb{E}\left[\zeta(Z,X,\pi^{o})\iota\mathbf{1}\{T=t\}\right]+b_{t,k}\mathbb{E}\left[(\mathbf{I}-\zeta(Z,X,\pi^{o}))P_{t}(X)\right]=b_{t,k}\mathbb{E}\left[P_{t}^{o}(X)\right]=p_{t,k}^{o}.

Now suppose that Pt=PtoP_{t}=P_{t}^{o}. Then by the law of total expectation, we have

𝔼[𝟏{Z=z}(𝟏{T=t}Pt,zo(X))X]\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}(\mathbf{1}\{T=t\}-P_{t,z}^{o}(X))\mid X]
=\displaystyle= πz(X)𝔼[𝔼[𝟏{T=t}Z=z,X]Pt,zo(X)X]=0.\displaystyle\pi_{z}(X)\mathbb{E}[\mathbb{E}[\mathbf{1}\{T=t\}\mid Z=z,X]-P_{t,z}^{o}(X)\mid X]=0.

This implies that 𝔼[ζ(Z,X,π)(ι(𝟏{T=t})Pto(X))]=0\mathbb{E}[\zeta(Z,X,\pi)(\iota(\mathbf{1}\{T=t\})-P_{t}^{o}(X))]=0. Hence,

bt,k𝔼[ζ(Z,X,π)(ι(𝟏{T=t})Pto(X))+Pto(X)]=bt,k𝔼[Pto(X)]=pt,ko.\displaystyle b_{t,k}\mathbb{E}\left[\zeta(Z,X,\pi)\left(\iota(\mathbf{1}\{T=t\})-P_{t}^{o}(X)\right)+P_{t}^{o}(X)\right]=b_{t,k}\mathbb{E}\left[P_{t}^{o}(X)\right]=p_{t,k}^{o}.

This proves the proposition. ∎

Proof of Proposition 4.2.

Since bt,kb_{t,k} is a finite vector, it suffices to verify the Neyman orthogonality condition for ψz\psi_{z}, which is defined by

ψz(Y,T,Z,X,βt,k,Qt,Pt,πz)\displaystyle\psi_{z}(Y,T,Z,X,\beta_{t,k},Q_{t},P_{t},\pi_{z})
\displaystyle\equiv ((𝟏{Z=z}/πz(X))(𝟏{T=t}Pt,z(X))+Pt,z(X))βt,k\displaystyle\big{(}(\mathbf{1}\{Z=z\}/\pi_{z}(X))\left(\mathbf{1}\{T=t\}-P_{t,z}(X)\right)+P_{t,z}(X)\big{)}\beta_{t,k}
(𝟏{Z=z}/πz(X))(Y𝟏{T=t}Qt,z(X))Qt,z(X).\displaystyle-(\mathbf{1}\{Z=z\}/\pi_{z}(X))\left(Y\mathbf{1}\{T=t\}-Q_{t,z}(X)\right)-Q_{t,z}(X).

We want to show that

ddr𝔼[ψz(Y,T,Z,X,βt,k,Qtr,Ptr,πzr)]|r=0=0,\displaystyle\frac{d}{dr}\mathbb{E}\left[\psi_{z}(Y,T,Z,X,\beta_{t,k},Q_{t}^{r},P_{t}^{r},\pi_{z}^{r})\right]\Big{|}_{r=0}=0,

where Qtr=Qto+r(QtQto),Q_{t}^{r}=Q_{t}^{o}+r(Q_{t}-Q_{t}^{o}), Ptr=Pto+r(PtPto),P_{t}^{r}=P_{t}^{o}+r(P_{t}-P_{t}^{o}), and πzr=πzo+r(πzπzo)\pi_{z}^{r}=\pi_{z}^{o}+r(\pi_{z}-\pi_{z}^{o}). In fact,

ddr𝔼[ψz(Y,T,Z,X,βt,k,Qtr,Ptr,πzr)]|r=0\displaystyle\frac{d}{dr}\mathbb{E}\left[\psi_{z}(Y,T,Z,X,\beta_{t,k},Q_{t}^{r},P_{t}^{r},\pi_{z}^{r})\right]\big{|}_{r=0}
=\displaystyle= 𝔼[𝟏{Z=z}(πzr(X))2(𝟏{T=t}Pt,zr(X))(πz(X)πzo(X))βt,k\displaystyle\mathbb{E}\Big{[}\frac{-\mathbf{1}\{Z=z\}}{(\pi^{r}_{z}(X))^{2}}\left(\mathbf{1}\{T=t\}-P^{r}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)\beta_{t,k}
+(Pt,z(X)Pt,zo(X)𝟏{Z=z}πzr(X)(Pt,z(X)Pt,zo(X)))βt,k\displaystyle+\left(P_{t,z}(X)-P^{o}_{t,z}(X)-\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(P_{t,z}(X)-P^{o}_{t,z}(X)\right)\right)\beta_{t,k}
+𝟏{Z=z}(πzr(X))2(Y𝟏{T=t}Qt,zr(X))(πz(X)πzo(X))\displaystyle+\frac{\mathbf{1}\{Z=z\}}{(\pi^{r}_{z}(X))^{2}}\left(Y\mathbf{1}\{T=t\}-Q^{r}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)
(Qt,z(X)Qto(X))+𝟏{Z=z}πzr(X)(Qt,z(X)Qt,zo(X))]|r=0\displaystyle-(Q_{t,z}(X)-Q^{o}_{t}(X))+\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(Q_{t,z}(X)-Q^{o}_{t,z}(X)\right)\Big{]}\Big{|}_{r=0}
=\displaystyle= 𝔼[𝟏{Z=z}(πzo(X))2(𝟏{T=t}Pt,zo(X))(πz(X)πzo(X))βt,k\displaystyle\mathbb{E}\Big{[}\frac{-\mathbf{1}\{Z=z\}}{(\pi^{o}_{z}(X))^{2}}\left(\mathbf{1}\{T=t\}-P^{o}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)\beta_{t,k}
+(Pt,z(X)Pt,zo(X)𝟏{Z=z}πzo(X)(Pt,z(X)Pt,zo(X)))βt,k\displaystyle+\left(P_{t,z}(X)-P^{o}_{t,z}(X)-\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}\left(P_{t,z}(X)-P^{o}_{t,z}(X)\right)\right)\beta_{t,k}
+𝟏{Z=z}(πzo(X))2(Y𝟏{T=t}Qt,zo(X))(πz(X)πzo(X))\displaystyle+\frac{\mathbf{1}\{Z=z\}}{(\pi^{o}_{z}(X))^{2}}\left(Y\mathbf{1}\{T=t\}-Q^{o}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)
(Qt,z(X)Qt,zo(X))+𝟏{Z=z}πzo(X)(Qt,z(X)Qt,zo(X))],\displaystyle-(Q_{t,z}(X)-Q^{o}_{t,z}(X))+\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}\left(Q_{t,z}(X)-Q^{o}_{t,z}(X)\right)\Big{]},

which equals zero because of the following three identities:

𝔼[𝟏{Z=z}/πzo(X)X]=1,\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}/\pi^{o}_{z}(X)\mid X]=1,
𝔼[𝟏{Z=z}/πzo(X)(𝟏{T=t}Pt,zo(X))X]=0,\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}/\pi^{o}_{z}(X)(\mathbf{1}\{T=t\}-P^{o}_{t,z}(X))\mid X]=0,
𝔼[𝟏{Z=z}/πzo(X)(Y𝟏{T=t}Qt,zo(X))X]=0.\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}/\pi^{o}_{z}(X)(Y\mathbf{1}\{T=t\}-Q^{o}_{t,z}(X))\mid X]=0.

Proof of Theorem 4.3.

The asserted claims follow from Theorem 3.1, Theorem 3.2, and Corollary 3.2 of Chernozhukov et al. (2018) (henceforth referred to as the DML paper). We want to verify their Assumption 3.1 and 3.2. Adopting the notation from the DML paper, we let

ψa(T,Z,X,Pt,π)=bt,k(ζ(Z,X,π)(ι𝟏{T=t}Pt(X))+Pt(X))\displaystyle\psi^{a}(T,Z,X,P_{t},\pi)=-b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)+P_{t}(X)\right)

and

ψb(Y,T,Z,X,Qt,π)=bt,k(ζ(Z,X,π)(ι(Y𝟏{T=t})Qt(X))+Qt(X))\displaystyle\psi^{b}(Y,T,Z,X,Q_{t},\pi)=b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)+Q_{t}(X)\right)

so that the linearity of the moment condition (with respect to βt,k\beta_{t,k}) is verified by the fact that ψ=ψaβt,k+ψb\psi=\psi^{a}\beta^{t,k}+\psi^{b}. Define151515For simplicity, we drop the superscript ll in the nonparametric estimators.

ϵn=maxz𝒵(Q^t,zQt,zo2P^t,zPto2π^zπzo2).\displaystyle\epsilon_{n}=\max_{z\in\mathcal{Z}}\big{(}\lVert\hat{Q}_{t,z}-Q_{t,z}^{o}\rVert_{2}\vee\lVert\hat{P}_{t,z}-P_{t}^{o}\rVert_{2}\vee\mathinner{\!\left\lVert\hat{\pi}_{z}-\pi_{z}^{o}\right\rVert}_{2}\big{)}.

By assumption on the convergence rates of the nonparametric estimators, we have ϵn=o(n1/4)\epsilon_{n}=o(n^{-1/4}). Define Cϵ=Cϵ,1Cϵ,2Cϵ,3Cϵ,4C_{\epsilon}=C_{\epsilon,1}\vee C_{\epsilon,2}\vee C_{\epsilon,3}\vee C_{\epsilon,4}, where Cϵ,1,Cϵ,2,Cϵ,3,C_{\epsilon,1},C_{\epsilon,2},C_{\epsilon,3}, and Cϵ,4C_{\epsilon,4} are positive constant that only depends on CC and ϵ\epsilon and are specified later in the proof. Let δn\delta_{n} be a sequence of positive constants approaching zero and satisfies that δnCϵ(ϵn2nn1/4n(12/q))\delta_{n}\geq C_{\epsilon}\big{(}\epsilon_{n}^{2}\sqrt{n}\vee n^{-1/4}\vee n^{-(1-2/q)}\big{)}. Such construction is possible since nϵn2=o(1)\sqrt{n}\epsilon_{n}^{2}=o(1). We set the nuisance realization set NnN_{n} (denoted by 𝒯N\mathcal{T}_{N} in the DML paper) to be the set of all vector functions (Qt,Pt,πz:z𝒵)(Q_{t},P_{t},\pi_{z}\mathrel{\mathop{\ordinarycolon}}z\in\mathcal{Z}) consisting of square-integrable functions Qt,z,Pt,z,Q_{t,z},P_{t,z}, and πz\pi_{z} such that for all z𝒵z\in\mathcal{Z}:

Qt,zqC,Pt,z[0,1],πz[ϵ,1],z𝒵,\displaystyle\mathinner{\!\left\lVert Q_{t,z}\right\rVert}_{q}\leq C,P_{t,z}\in[0,1],\pi_{z}\in[\epsilon,1],z\in\mathcal{Z},
Qt,zQt,zoqPt,zPt,zoqπzπzoqϵn,\displaystyle\lVert Q_{t,z}-Q_{t,z}^{o}\rVert_{q}\vee\lVert P_{t,z}-P_{t,z}^{o}\rVert_{q}\vee\mathinner{\!\left\lVert\pi_{z}-\pi_{z}^{o}\right\rVert}_{q}\leq\epsilon_{n},
πzπzo2×(Qt,zQt,zo2+Pt,zPt,zo2)ϵn2.\displaystyle\mathinner{\!\left\lVert\pi_{z}-\pi_{z}^{o}\right\rVert}_{2}\times\big{(}\lVert Q_{t,z}-Q_{t,z}^{o}\rVert_{2}+\lVert P_{t,z}-P_{t,z}^{o}\rVert_{2}\big{)}\leq\epsilon_{n}^{2}.

Consider Assumption 3.1 in the DML paper. Assumption 3.1(d), the Neyman orthogonality condition, is verified by Proposition 4.2, where the validity of the differentiation under the integral operation is verified later in the proof. Assumption 3.1(e), the identification condition, is verified by the condition that pt,ko[ϵ,1]p_{t,k}^{o}\in[\epsilon,1]. The remaining conditions of Assumption 3.1 in the DML paper are trivially verified.

Next, we consider Assumption 3.2 in the DML paper. Note that Assumption 3.2(a) holds by the construction of NnN_{n} and ϵn\epsilon_{n} and our assumptions on the nuisance estimates. Assumption 3.2(d) is verified by our assumption that the semiparametric efficiency bound of βt,k\beta_{t,k} is above ϵ\epsilon. The remaining task is to verify Assumption 3.2(b) and 3.2(c) in the DML paper. To do that, we choose nn sufficiently large and let (Qt,z,Pt,z,πz:z𝒵)(Q_{t,z},P_{t,z},\pi_{z}\mathrel{\mathop{\ordinarycolon}}z\in\mathcal{Z}) be an arbitrary element of the nuisance realization set NnN_{n}. We keep the above notations throughout the remaining part of the proof. Define

ψza(T,Z,X,Pt,πz)=𝟏{Z=z}πz(X)(𝟏{T=t}Pt,z(X))+Pt,z(X)\displaystyle\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})=\frac{\mathbf{1}\{Z=z\}}{\pi_{z}(X)}(\mathbf{1}\{T=t\}-P_{t,z}(X))+P_{t,z}(X)

and

ψzb(Y,T,Z,X,Qt,πz)=𝟏{Z=z}πz(X)(Y𝟏{T=t}Qt,z(X))+Qt,z(X).\displaystyle\psi^{b}_{z}(Y,T,Z,X,Q_{t},\pi_{z})=\frac{\mathbf{1}\{Z=z\}}{\pi_{z}(X)}(Y\mathbf{1}\{T=t\}-Q_{t,z}(X))+Q_{t,z}(X).

Since ψa\psi^{a} is a linear combination of ψza,z𝒵\psi^{a}_{z},z\in\mathcal{Z} and ψb\psi^{b} is a linear combination of ψzb,z𝒵\psi^{b}_{z},z\in\mathcal{Z}, we only need ψza(T,Z,X,Pt,πz)q\mathinner{\!\left\lVert\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})\right\rVert}_{q} and ψzb(Y,T,Z,X,Qt,πz)q\mathinner{\!\left\lVert\psi^{b}_{z}(Y,T,Z,X,Q_{t},\pi_{z})\right\rVert}_{q} to be uniformly bounded (i.e., the bounds do not depend on nn) for z𝒵z\in\mathcal{Z} in order to verify Assumption 3.2(b) in the DML paper. In fact,

ψzb(Y,T,Z,X,Pt,πz)q\displaystyle\mathinner{\!\left\lVert\psi^{b}_{z}(Y,T,Z,X,P_{t},\pi_{z})\right\rVert}_{q} 𝟏{Z=z}/πz(X)|Y𝟏{T=t}Qt,z(X)|q+Qt,z(X)q\displaystyle\leq\mathinner{\!\left\lVert\mathbf{1}\{Z=z\}/\pi_{z}(X)\mathinner{\!\left\lvert Y\mathbf{1}\{T=t\}-Q_{t,z}(X)\right\rvert}\right\rVert}_{q}+\mathinner{\!\left\lVert Q_{t,z}(X)\right\rVert}_{q}
1ϵ(Y𝟏{T=t}q+Qt,z(X)q)+Qt,z(X)q2C/ϵ+C,\displaystyle\leq\frac{1}{\epsilon}\left(\mathinner{\!\left\lVert Y\mathbf{1}\{T=t\}\right\rVert}_{q}+\mathinner{\!\left\lVert Q_{t,z}(X)\right\rVert}_{q}\right)+\mathinner{\!\left\lVert Q_{t,z}(X)\right\rVert}_{q}\leq 2C/\epsilon+C,

where we have used the assumption that πzϵ\pi_{z}\geq\epsilon, Y𝟏{T=t}qC\mathinner{\!\left\lVert Y\mathbf{1}\{T=t\}\right\rVert}_{q}\leq C, and Qt(X)qC\mathinner{\!\left\lVert Q_{t}(X)\right\rVert}_{q}\leq C. Similarly, we have

ψza(T,Z,X,Pt,πz)q\displaystyle\mathinner{\!\left\lVert\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})\right\rVert}_{q} 𝟏{Z=z}/πz(X)|𝟏{T=t}Pt,z(X)|q+Pt,z(X)q\displaystyle\leq\mathinner{\!\left\lVert\mathbf{1}\{Z=z\}/\pi_{z}(X)\mathinner{\!\left\lvert\mathbf{1}\{T=t\}-P_{t,z}(X)\right\rvert}\right\rVert}_{q}+\mathinner{\!\left\lVert P_{t,z}(X)\right\rVert}_{q}
1ϵ(1+Pt,z(X)q)+Pt,z(X)q2/ϵ+1,\displaystyle\leq\frac{1}{\epsilon}\big{(}1+\mathinner{\!\left\lVert P_{t,z}(X)\right\rVert}_{q}\big{)}+\mathinner{\!\left\lVert P_{t,z}(X)\right\rVert}_{q}\leq 2/\epsilon+1,

where we have used the assumption that πzϵ\pi_{z}\geq\epsilon and Pt[0,1]P_{t}\in[0,1]. Thus, Assumption 3.2(b) in the DML paper is verified.

To verify Assumption 3.2(c) in the DML paper, we again only need to verify the corresponding conditions for ψza\psi^{a}_{z} and ψzb\psi^{b}_{z}, respectively. For ψza\psi^{a}_{z}, we have

ψza(T,Z,X,Pt,πz)ψza(T,Z,X,Pto,πzo)2\displaystyle\mathinner{\!\left\lVert\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})-\psi^{a}_{z}(T,Z,X,P_{t}^{o},\pi_{z}^{o})\right\rVert}_{2}
\displaystyle\leq πz(X)πzo(X)πz(X)πzo(X)2+Pt,z(X)πz(X)Pt,zo(X)πzo(X)2+Pt,z(X)Pt,zo(X)2\displaystyle\mathinner{\!\left\lVert\frac{\pi_{z}(X)-\pi_{z}^{o}(X)}{\pi_{z}(X)\pi_{z}^{o}(X)}\right\rVert}_{2}+\mathinner{\!\left\lVert\frac{P_{t,z}(X)}{\pi_{z}(X)}-\frac{P_{t,z}^{o}(X)}{\pi_{z}^{o}(X)}\right\rVert}_{2}+\mathinner{\!\left\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\right\rVert}_{2}
\displaystyle\leq 1ϵ2πz(X)πzo(X)2+1ϵ2(Pt,z(X)Pt,zo(X))πzo(X)+Pt,zo(X)(πzo(X)πz(X))2\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}+\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert(P_{t,z}(X)-P_{t,z}^{o}(X))\pi_{z}^{o}(X)+P_{t,z}^{o}(X)(\pi_{z}^{o}(X)-\pi_{z}(X))\right\rVert}_{2}
+Pt,z(X)Pt,zo(X)2\displaystyle+\mathinner{\!\left\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\right\rVert}_{2}
\displaystyle\leq 2ϵ2πz(X)πzo(X)2+(1/ϵ2+1)Pt,z(X)Pt,zo(X)2Cϵ,1εnδn,\displaystyle\frac{2}{\epsilon^{2}}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}+\left(1/\epsilon^{2}+1\right)\mathinner{\!\left\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\right\rVert}_{2}\leq C_{\epsilon,1}\varepsilon_{n}\leq\delta_{n},

where the second to last inequality follows from the fact that Pt,zo,πzo[0,1]P_{t,z}^{o},\pi_{z}^{o}\in[0,1]. For ψzb\psi^{b}_{z}, we have

ψzb(Y,T,Z,X,Qt,πz)ψzb(Y,T,Z,X,Qto,πzo)2\displaystyle\mathinner{\!\left\lVert\psi^{b}_{z}(Y,T,Z,X,Q_{t},\pi_{z})-\psi^{b}_{z}(Y,T,Z,X,Q_{t}^{o},\pi_{z}^{o})\right\rVert}_{2}
\displaystyle\leq 1ϵ2πzo(X)(Y𝟏{T=t}Qt,z(X))πz(X)(Y𝟏{T=t}Qt,zo(X))2\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert\pi_{z}^{o}(X)(Y\mathbf{1}\{T=t\}-Q_{t,z}(X))-\pi_{z}(X)(Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X))\right\rVert}_{2}
+Qt,z(X)Qt,zo(X)2\displaystyle+\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}
=\displaystyle= 1ϵ2(Y𝟏{T=t}Qt,zo(X))(πzo(X)πz(X))+πzo(X)(Qt,zo(X)Qt,z(X))2\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert(Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X))(\pi^{o}_{z}(X)-\pi_{z}(X))+\pi_{z}^{o}(X)(Q_{t,z}^{o}(X)-Q_{t,z}(X))\right\rVert}_{2}
+Qt,z(X)Qt,zo(X)2\displaystyle+\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}
\displaystyle\leq 1ϵ2(Y𝟏{T=t}Qt,zo(X))(πzo(X)πz(X))2+πzo(X)(Qt,zo(X)Qt,z(X))2\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert(Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X))(\pi^{o}_{z}(X)-\pi_{z}(X))\right\rVert}_{2}+\mathinner{\!\left\lVert\pi_{z}^{o}(X)(Q_{t,z}^{o}(X)-Q_{t,z}(X))\right\rVert}_{2}
+Qt,z(X)Qt,zo(X)2\displaystyle+\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}
\displaystyle\leq Cϵ2πzo(X)πz(X)2+(1ϵ2+1)Qt,zo(X)Qt,z(X)2Cϵ,2εnδn,\displaystyle\frac{C}{\epsilon^{2}}\mathinner{\!\left\lVert\pi^{o}_{z}(X)-\pi_{z}(X)\right\rVert}_{2}+\left(\frac{1}{\epsilon^{2}}+1\right)\mathinner{\!\left\lVert Q_{t,z}^{o}(X)-Q_{t,z}(X)\right\rVert}_{2}\leq C_{\epsilon,2}\varepsilon_{n}\leq\delta_{n},

where the last inequality follows from our assumption that |Y𝟏{T=t}Qto(X)|C|Y\mathbf{1}\{T=t\}-Q_{t}^{o}(X)|\leq C and the fact that πzo[ϵ,1]\pi_{z}^{o}\in[\epsilon,1]. Combining the above two inequality results, we can verify the first two conditions of Assumption 3.2(c) in the DML paper.

For the last condition of Assumption 3.2(c) in the DML paper, which bounds the second-order Gateaux derivative, we again consider ψza\psi^{a}_{z} and ψzb\psi^{b}_{z} separately. For r[0,1)r\in[0,1), recall that Qt,zr=Qt,zo+r(Qt,zQt,zo),Q_{t,z}^{r}=Q_{t,z}^{o}+r(Q_{t,z}-Q_{t,z}^{o}), Pt,zr=Pt,zo+r(Pt,zPt,zo),P_{t,z}^{r}=P_{t,z}^{o}+r(P_{t,z}-P_{t,z}^{o}), and πzr=πzo+r(πzπzo)\pi^{r}_{z}=\pi^{o}_{z}+r(\pi_{z}-\pi^{o}_{z}). Clearly, Pt,zr,πzr[0,1]P_{t,z}^{r},\pi_{z}^{r}\in[0,1]. With differentiation under the integral, we have

2r2𝔼[ψza(T,Z,X,Ptr,πzr)]\displaystyle\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{a}_{z}(T,Z,X,P_{t}^{r},\pi_{z}^{r})\right]
=\displaystyle= r𝔼[𝟏{Z=z}(πzr(X))2(𝟏{T=t}Pt,zr(X))(πz(X)πzo(X))\displaystyle\frac{\partial}{\partial r}\mathbb{E}\Big{[}\frac{-\mathbf{1}\{Z=z\}}{(\pi^{r}_{z}(X))^{2}}\left(\mathbf{1}\{T=t\}-P^{r}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)
+Pt,z(X)Pt,zo(X)𝟏{Z=z}πzr(X)(Pt,z(X)Pt,zo(X))]\displaystyle+P_{t,z}(X)-P^{o}_{t,z}(X)-\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(P_{t,z}(X)-P^{o}_{t,z}(X)\right)\Big{]}
=\displaystyle= 𝔼[2×𝟏{Z=z}(πzr(X))3(πz(X)πzo(X))2(𝟏{T=t}Pt,zr(X))]\displaystyle\mathbb{E}\Big{[}\frac{2\times\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{3}}(\pi_{z}(X)-\pi_{z}^{o}(X))^{2}(\mathbf{1}\{T=t\}-P_{t,z}^{r}(X))\Big{]}
+𝔼[𝟏{Z=z}(πzr(X))2(πz(X)πzo(X))(Pt,z(X)Pt,zo)]\displaystyle+\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(P_{t,z}(X)-P_{t,z}^{o})\Big{]}
+𝔼[𝟏{Z=z}(πzr(X))2(πz(X)πzo(X))(𝟏{T=t}Pt,zr(X))(Pt,z(X)Pt,zo)]\displaystyle+\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(\mathbf{1}\{T=t\}-P_{t,z}^{r}(X))(P_{t,z}(X)-P_{t,z}^{o})\Big{]}
𝔼[𝟏{Z=z}πzr(X)(𝟏{T=t}Pt,zr(X))(Pt,z(X)Pt,zo)2].\displaystyle-\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{\pi_{z}^{r}(X)}(\mathbf{1}\{T=t\}-P_{t,z}^{r}(X))(P_{t,z}(X)-P_{t,z}^{o})^{2}\Big{]}.

Using the fact that |𝟏{T=t}Ptr(X)|1|\mathbf{1}\{T=t\}-P_{t}^{r}(X)|\leq 1 and πzrϵ\pi_{z}^{r}\geq\epsilon, we can bound the above derivative by

|2r2𝔼[ψza(T,Z,X,Ptr,πzr)]|\displaystyle\Big{|}\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{a}_{z}(T,Z,X,P_{t}^{r},\pi_{z}^{r})\right]\Big{|} Cϵ(πz(X)πzo(X)22+Pt,z(X)Pt,zo(X)22)\displaystyle\leq C_{\epsilon}\big{(}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}^{2}+\big{\lVert}P_{t,z}(X)-P_{t,z}^{o}(X)\big{\rVert}_{2}^{2}\big{)}
+Cϵπz(X)πzo(X)2×Pt,z(X)Pt,zo(X)2\displaystyle\quad+C_{\epsilon}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}\times\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\rVert_{2}
Cϵ,3εn2δn/n.\displaystyle\leq C_{\epsilon,3}\varepsilon_{n}^{2}\leq\delta_{n}/\sqrt{n}.

By bounding the first and second derivative uniformly with respect to rr, we know that the differentiation under the integral operation is valid. So the Neyman orthogonality condition is verified. Analogously, we can show that

2r2𝔼[ψzb(Y,T,Z,X,Qtr,πzr)]\displaystyle\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{b}_{z}(Y,T,Z,X,Q_{t}^{r},\pi_{z}^{r})\right]
=\displaystyle= 𝔼[2×𝟏{Z=z}(πzr(X))3(πz(X)πzo(X))2(Y𝟏{T=t}Qt,zr(X))]\displaystyle\mathbb{E}\Big{[}\frac{2\times\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{3}}(\pi_{z}(X)-\pi_{z}^{o}(X))^{2}(Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X))\Big{]}
+𝔼[𝟏{Z=z}(πzr(X))2(πz(X)πzo(X))(Qt,z(X)Qt,zo)]\displaystyle+\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(Q_{t,z}(X)-Q_{t,z}^{o})\Big{]}
𝔼[𝟏{Z=z}(πzr(X))2(πz(X)πzo(X))(Y𝟏{T=t}Qt,zr(X))(Qt,z(X)Qt,zo)]\displaystyle-\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X))(Q_{t,z}(X)-Q_{t,z}^{o})\Big{]}
𝔼[𝟏{Z=z}πzr(X)(Y𝟏{T=t}Qt,zr(X))(Qt,z(X)Qt,zo)2].\displaystyle-\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{\pi_{z}^{r}(X)}(Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X))(Q_{t,z}(X)-Q_{t,z}^{o})^{2}\Big{]}.

Under the assumption |Y𝟏{T=t}Qt,zo(X)|C|Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X)|\leq C, we have

|Y𝟏{T=t}Qt,zr(X)||Y𝟏{T=t}Qt,zo(X)|+r|Qt,z(X)Qt,zo|C+1,\displaystyle|Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X)|\leq|Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X)|+r|Q_{t,z}(X)-Q_{t,z}^{o}|\leq C+1,

for all r[0,1]r\in[0,1] and nn large enough. Then we can bound the above derivative by

|2r2𝔼[ψzb(Y,T,Z,X,Qtr,πzr)]|\displaystyle\Big{|}\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{b}_{z}(Y,T,Z,X,Q_{t}^{r},\pi_{z}^{r})\right]\Big{|}\leq Cϵ(πz(X)πzo(X)22+Qt,z(X)Qt,zo(X)22)\displaystyle C_{\epsilon}\big{(}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}^{2}+\big{\lVert}Q_{t,z}(X)-Q_{t,z}^{o}(X)\big{\rVert}_{2}^{2}\big{)}
+Cϵπz(X)πzo(X)2×Qt,z(X)Qt,zo(X)2\displaystyle+C_{\epsilon}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}\times\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}
\displaystyle\leq Cϵ,4εn2δn/n.\displaystyle C_{\epsilon,4}\varepsilon_{n}^{2}\leq\delta_{n}/\sqrt{n}.

Therefore, we have verified the last condition of Assumption 3.2(c) in the DML paper.

Lastly, we need to verify the condition on δn\delta_{n} in Theorem 3.1 and 3.2 in the DML paper, that is, δnn[(12/q)(1/2)]\delta_{n}\geq n^{-[(1-2/q)\wedge(1/2)]}. This directly follows from the construction of δn\delta_{n}. ∎

A.4 Proof of Weak IV Inference Results

Proof of Theorem 5.1.

We first prove part (i). Consider applying the DML method to the moment condition (8) to estimate the parameter υβ0p\upsilon-\beta_{0}p and obtain the standard error. We want to show the convergence in distribution of

σˇψ1n[(υˇβ0pˇ)(υβ0p)]=ρˇn(υβ0p)/σˇψ\displaystyle\check{\sigma}_{\psi}^{-1}\sqrt{n}\left[(\check{\upsilon}-\beta_{0}\check{p})-(\upsilon-\beta_{0}p)\right]=\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi} (A.1)

to the standard normal distribution uniformly over the DGPs in 𝒫WI(c0,c1)\mathcal{P}^{\text{WI}}(c_{0},c_{1}). To do that, we need to verify Assumptions 3.1 and 3.2 in the DML paper regarding the above moment condition. Assumptions 3.1(a)-(c) hold trivially. Assumption 3.1(d), the Neyman orthogonality condition, is verified by Proposition 4.2. That is, the Gateaux derivatives with respect to the nuisance parameters are zero regardless of the value of β\beta. Assumption 3.1(e), the identification condition, is verified since the Jacobian of the parameter in the moment condition is 11. Assumption 3.2 in the DML paper can be verified in the same way as in the proof of Theorem 4.3. For brevity, we do not repeat the verification here.

For DGPs in 𝒫β0WI(c0,c1)\mathcal{P}^{\text{WI}}_{\beta_{0}}(c_{0},c_{1}), (A.1) is equal to ρˇ\check{\rho}. Therefore, the uniform convergence in distribution of |ρˇ||\check{\rho}| is established in the null space, and the size of the test is uniformly controlled accordingly. For DGPs in 𝒫βWI(c0,c1)\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}), where β>β0\beta>\beta_{0}, we have

ρˇ\displaystyle\check{\rho} =(ρˇn(υβ0p)/σˇψ)+n(υβ0p)/σˇψ\displaystyle=\left(\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}\right)+\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}
=(ρˇn(υβ0p)/σˇψ)+n(ββ0)p/σˇψ.\displaystyle=\left(\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}\right)+\sqrt{n}(\beta-\beta_{0})p/\check{\sigma}_{\psi}.

The first term on the RHS of the last equality converges in distribution to N(0,1)N(0,1). In contrast, the second term diverges to infinity since σˇψ\check{\sigma}_{\psi} converges in probability to σψc0\sigma_{\psi}\geq\sqrt{c_{0}} by Theorem 3.2 in the DML paper. Therefore, the probability of |ρˇ||\check{\rho}| exceeding any finite number converges to 1. The case where β<β0\beta<\beta_{0} is essentially the same.

To prove part (ii) of the theorem, notice that (ββ0)p0(\beta-\beta_{0})p\leq 0 for any DGP in the null space ββ0𝒫βWI(c0,c1)\bigcup_{\beta\leq\beta_{0}}\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}), which implies that ρˇρˇn(υβ0p)/σˇψ\check{\rho}\leq\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}. Therefore,

supPP(ρˇ>𝒩1α)supPP(ρˇn(υβ0p)/σˇψ>𝒩1α)α,\displaystyle\sup_{P}\mathbb{P}_{P}\big{(}\check{\rho}>\mathcal{N}_{1-\alpha}\big{)}\leq\sup_{P}\mathbb{P}_{P}\left(\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}>\mathcal{N}_{1-\alpha}\right)\rightarrow\alpha,

where the supremum is taken over Pββ0𝒫βWI(c0,c1)P\in\bigcup_{\beta\leq\beta_{0}}\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}). Consistency can be derived in the same way as part (i). ∎

Appendix B Implicitly Defined Parameters

This section studies general parameters defined implicitly through moment conditions. We allow the moment conditions to be non-smooth, which is the case when the parameter of interest is the quantile. We also allow the moment conditions to be overidentifying, which could be the result of imposing the underlying economic theory on multiple levels of treatment and instrument.

To facilitate the exposition, we define a random variable Yt,kY^{*}_{t,k} such that the marginal distribution of Yt,kY^{*}_{t,k} is equal to the conditional distribution of YtY_{t} given SΣt,kS\in\Sigma_{t,k}. The joint distribution of the Yt,kY^{*}_{t,k}’s is irrelevant and hence left unspecified. For convenience, we use a single index jJj\in J rather than (t,k)(t,k) for labeling. That is, we collect the Yt,kY^{*}_{t,k}’s into the vector Y(Y1,,YJ)Y^{*}\equiv(Y^{*}_{1},\cdots,Y^{*}_{J}). Let tjt_{j} be the treatment level associated with YjY^{*}_{j}. The quantities pjp_{j} and bjb_{j} are analogously defined.161616We can further extend the vector YY^{*} to include variables whose marginal distributions are the same as the conditional distributions of YtY_{t} given T=t,SΣt,kT=t,S\in\Sigma_{t,k}. Efficient estimation in this more general case is similar and hence omitted for brevity.

Let the parameter of interest be η\eta, which lies in the parameter space Λdη\Lambda\subset\mathbb{R}^{d_{\eta}}, dηJd_{\eta}\leq J. The true value of the parameter η0\eta_{0} satisfies the moment condition

𝔼[m(Y,ηo)]=0,\mathbb{E}\left[m(Y^{*},\eta^{o})\right]=0,

where m:𝒴J×dηJm\mathrel{\mathop{\ordinarycolon}}\mathcal{Y}^{J}\times\mathbb{R}^{d_{\eta}}\rightarrow\mathbb{R}^{J} is a vector of functions:

m(Y,η)(m1(Y1,η),,mJ(YJ,η))\displaystyle m(Y^{*},\eta)\equiv\left(m_{1}(Y_{1}^{*},\eta),\cdots,m_{J}(Y_{J}^{*},\eta)\right)^{\prime}

Since the vector η\eta appears in each mjm_{j}, restrictions are allowed both within and across different subpopulations. Another interesting feature of this specification is that the moment conditions are defined for the random variables that are not observed. But their marginal distributions can be identified similar to Theorem 2.1.

Let m¯(m¯1,,m¯J)\bar{m}\equiv(\bar{m}_{1}^{\prime},\cdots,\bar{m}_{J}^{\prime})^{\prime}, where

m¯j(X,η)=(m¯j,z1(X,η),,m¯j,zNZ(X,η))\displaystyle\bar{m}_{j}(X,\eta)=\left(\bar{m}_{j,z_{1}}(X,\eta),\cdots,\bar{m}_{j,z_{N_{Z}}}(X,\eta)\right)^{\prime}

and

m¯j,z(X,η)=𝔼[mj(Y,η)𝟏{T=tj}Z=z,X].\displaystyle\bar{m}_{j,z}(X,\eta)=\mathbb{E}\left[m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}\mid Z=z,X\right].

The functions m¯j,z\bar{m}_{j,z} are identified from the data. Similar to Theorem 2.1, we can show that the parameter η\eta is identified by the moment conditions:

bj𝔼[m¯j(X,η)]=0,1jJη=ηo.\displaystyle b_{j}\mathbb{E}\left[\bar{m}_{j}(X,\eta)\right]=0,1\leq j\leq J\iff\eta=\eta^{o}.

The following theorem gives the SPEB for the estimation of η\eta.

Theorem B.1.

Assume the following conditions hold.

  1. (i)

    𝔼[m(Y,η)2]<,ηΛ\mathbb{E}\left[m(Y^{*},\eta)^{2}\right]<\infty,\eta\in\Lambda.

  2. (ii)

    For each jj and zz, mj,tj,zm_{j,t_{j},z} is continuously differentiable in its second argument. Let Γ\Gamma be the J×dηJ\times d_{\eta} matrix whose jjth row is bjddη𝔼[m¯j(X,η)]|η=ηob_{j}\frac{d}{d\eta}\mathbb{E}\left[\bar{m}_{j}(X,\eta)\right]\big{|}_{\eta=\eta^{o}}^{\prime}, and assume Γ\Gamma has full column rank.

Then for the estimation of η\eta, the EIF is

(ΓV1Γ)1ΓV1ψη(Y,T,Z,X,ηo,πo,m¯o),-\left(\Gamma^{\prime}V^{-1}\Gamma\right)^{-1}\Gamma^{\prime}V^{-1}\psi^{\eta}(Y,T,Z,X,\eta^{o},\pi^{o},\bar{m}^{o}), (B.1)

where

V=𝔼[ψη(Y,T,Z,X,η,π,m¯)ψη(Y,T,Z,X,η,π,m¯)]\displaystyle V=\mathbb{E}\left[\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})^{\prime}\right]

and ψη(Y,T,Z,X,η,π,m¯)\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m}) is a J×1J\times 1 random vector whose jjth element is

bj(ζ(Z,X,π)(ι(mj(Y,η)𝟏{T=tj})m¯j(X,η))+m¯j(X,η))b_{j}\left(\zeta(Z,X,\pi)\left(\iota(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\})-\bar{m}_{j}(X,\eta)\right)+\bar{m}_{j}(X,\eta)\right) (B.2)

In particular, the semiparametric efficiency bound is (ΓV1Γ)1\left(\Gamma^{\prime}V^{-1}\Gamma\right)^{-1}.

Proof of Theorem B.1.

The proof is based on the approach described in section 3.6 of Hong and Nekipelov (2010a) and the proof of Theorem 1 in Cattaneo (2010). We use a constant dη×dmd_{\eta}\times d_{m} matrix AA to transform the overidentified vector of moments into an exactly identified system of equations A(bj𝔼[m¯j(X,η)])j=1J=0A\left(b_{j}\mathbb{E}\left[\bar{m}_{j}(X,\eta)\right]\right)_{j=1}^{J}=0, find the AA-dependent EIF for the exactly-identified parameter, and choose the optimal AA. In a parametric submodel, the implicit function theorem gives that

θη|θ=θo=(AΓ)1Aθ(bj𝔼θ[m¯j(X,ηo)])j=1J|θ=θo,\displaystyle\frac{\partial}{\partial\theta}\eta\big{|}_{\theta=\theta^{o}}=-\left(A\Gamma\right)^{-1}A\frac{\partial}{\partial\theta}\left(b_{j}\mathbb{E}_{\theta}\left[\bar{m}_{j}(X,\eta^{o})\right]\right)_{j=1}^{J}\big{|}_{\theta=\theta^{o}},

where θ𝔼θ[m¯j(X,ηo)]|θ=θo\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[\bar{m}_{j}(X,\eta^{o})\right]\big{|}_{\theta=\theta^{o}} is an NZ×1N_{Z}\times 1 random vector whose typical element can be represented by

mj(y,ηo)𝟏{τ=tj}sz(y,τx;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x\displaystyle\int m_{j}(y,\eta^{o})\mathbf{1}\{\tau=t_{j}\}s_{z}(y,\tau\mid x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx
+\displaystyle+ mj(y,ηo)𝟏{τ=tj}sX(x;θo)fz(y,τx;θo)fX(x;θo)𝑑y𝑑τ𝑑x,\displaystyle\int m_{j}(y,\eta^{o})\mathbf{1}\{\tau=t_{j}\}s_{X}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,

for z𝒵z\in\mathcal{Z}. So the EIF for this exactly-identified parameter is

ψA(Y,T,Z,X,ηo,πo,m¯o)=(AΓ)1AΨη(Y,T,Z,X,ηo,πo,m¯o),\displaystyle\psi^{A}(Y,T,Z,X,\eta^{o},\pi^{o},\bar{m}^{o})=-\left(A\Gamma\right)^{-1}A\Psi^{\eta}(Y,T,Z,X,\eta^{o},\pi^{o},\bar{m}^{o}),

where ψη\psi^{\eta} is defined by Equation (B.2). It is straightforward to verify that ψA\psi^{A} satisfies θη|θ=θo=𝔼[ψAsθo], and ψA𝒮\frac{\partial}{\partial\theta}\eta\big{|}_{\theta=\theta^{o}}=\mathbb{E}\left[\psi^{A}s_{\theta^{o}}^{\prime}\right]\text{, and }\psi^{A}\in\mathscr{S}. The optimal AA is chosen by minimizing the sandwich matrix 𝔼[ψA(ψA)]=(AΓ)1A𝔼[ψη(ψη)]A(ΓA)1\mathbb{E}\left[\psi^{A}(\psi^{A})^{\prime}\right]=\left(A\Gamma\right)^{-1}A\mathbb{E}\left[\psi^{\eta}(\psi^{\eta})^{\prime}\right]A^{\prime}\left(\Gamma^{\prime}A^{\prime}\right)^{-1}. Thus, the EIF for the over-identified parameter is obtained when A=ΓV1A=\Gamma^{\prime}V^{-1}. Plugging this expression into ψA\psi^{A}, we obtain Equation (B.1). ∎

Note that, for example, mj(Yj,η)=Yjηm_{j}(Y^{*}_{j},\eta)=Y^{*}_{j}-\eta, then η=βj\eta=\beta_{j}, and the efficiency bound shown above reduces to the one computed in Theorem 3.1. If T=ZT=Z, that is, the treatment satisfies the unconfounded, then the Theorem B.1 reduces to Theorem 1 in Cattaneo (2010).

For estimation, we use the EIFs to generate moment conditions and propose a three-step semiparametric GMM procedure. The criterion function is

Ψnη(η,π,m)=1ni=1nψη(Yi,Ti,Zi,Xi,η,π,m¯).\Psi^{\eta}_{n}(\eta,\pi,m)=\frac{1}{n}\sum_{i=1}^{n}\psi^{\eta}(Y_{i},T_{i},Z_{i},X_{i},\eta,\pi,\bar{m}). (B.3)

Its probability limit is denoted as

Ψη(η,π,mZ)=𝔼[ψη(Y,T,Z,X,η,π,m¯)],\Psi^{\eta}(\eta,\pi,m_{Z})=\mathbb{E}\left[\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\right], (B.4)

where the expectation is taken with respect to the true parameters (πo,m¯o)(\pi^{o},\bar{m}^{o}). The implementation procedure is as follows. Assume that we have nonparametric estimators π^\hat{\pi} and m^\hat{m} that consistently estimate πo\pi^{o} and m¯o\bar{m}^{o}, respectively. We first find a consistent GMM estimator η~\tilde{\eta} using the identity matrix as the weighting matrix, that is,

Ψnη(η~,π^,m^)2infηΛΨnη(η,π^,m^)2+op(1).\mathinner{\!\left\lVert\Psi^{\eta}_{n}(\tilde{\eta},\hat{\pi},\hat{m})\right\rVert}_{2}\leq\inf_{\eta\in\Lambda}\mathinner{\!\left\lVert\Psi^{\eta}_{n}(\eta,\hat{\pi},\hat{m})\right\rVert}_{2}+o_{p}(1). (B.5)

Next, we use this estimate to form a consistent estimator V^\hat{V} of the covariance matrix VV, where

V^=1ni=1nψη(Yi,Ti,Zi,Xi,η~,π^,m^)ψη(Yi,Ti,Zi,Xi,η~,π^,m^).\displaystyle\hat{V}=\frac{1}{n}\sum_{i=1}^{n}\psi^{\eta}(Y_{i},T_{i},Z_{i},X_{i},\tilde{\eta},\hat{\pi},\hat{m})\psi^{\eta}(Y_{i},T_{i},Z_{i},X_{i},\tilde{\eta},\hat{\pi},\hat{m})^{\prime}.

Then we let η^\hat{\eta} be the optimally-weighted GMM estimator:

Ψnη(η^,π^,m^Z)Vn(η~,π^,m^Z)1Ψnη(η^,π^,m^Z)\displaystyle\Psi^{\eta}_{n}(\hat{\eta},\hat{\pi},\hat{m}_{Z})V_{n}(\tilde{\eta},\hat{\pi},\hat{m}_{Z})^{-1}\Psi^{\eta}_{n}(\hat{\eta},\hat{\pi},\hat{m}_{Z})^{\prime}
\displaystyle\leq infηΛΨnη(η,π^,m^Z)Vn(η~,π^,m^Z)1Ψnη(η,π^,m^Z)+op(n1/2).\displaystyle\inf_{\eta\in\Lambda}\Psi^{\eta}_{n}(\eta,\hat{\pi},\hat{m}_{Z})V_{n}(\tilde{\eta},\hat{\pi},\hat{m}_{Z})^{-1}\Psi^{\eta}_{n}(\eta,\hat{\pi},\hat{m}_{Z})^{\prime}+o_{p}\big{(}n^{-1/2}\big{)}.

To conduct inference, we estimate Γ\Gamma using the estimator Γ^\hat{\Gamma} whose elements are defined as

Γ^jl=1ni=1nbjηm^j(Xi,η)|η=η^,\displaystyle\hat{\Gamma}_{jl}=\frac{1}{n}\sum_{i=1}^{n}b_{j}\frac{\partial}{\partial\eta}\hat{m}_{j}(X_{i},\eta)\Big{|}_{\eta=\hat{\eta}},

where we have implicitly assumed that the estimator m^j\hat{m}_{j} is differentiable in its second argument.

In the following theorem, we derive the asymptotic properties of the GMM estimators. The main theoretical difficulty is that the random criterion function Ψn(,π^,m^)\Psi_{n}(\cdot,\hat{\pi},\hat{m}) could potentially be discontinuous because we allow m(Y,)m(Y^{*},\cdot) to be discontinuous. We use the theory developed in Chen et al. (2003) to overcome this problem.171717Cattaneo (2010) instead uses the theory from Pakes and Pollard (1989). However, the general theory of Chen et al. (2003) is more straightforward to apply in this case since they explicitly assume the presence of infinite-dimensional nuisance parameters, which can depend on the parameters to be estimated. Let Πz\Pi_{z} be the function class that contains πzo\pi_{z}^{o}. Let j,z\mathcal{M}_{j,z} be the function class that contains m¯j,zo\bar{m}_{j,z}^{o}.

Theorem B.2.

Let the assumptions in Theorem B.1 hold. Assume the following conditions hold.

  1. (i)

    The parameter space Λ\Lambda is compact. The true parameter ηo\eta^{o} is in the interior of Λ\Lambda.

  2. (ii)

    For any j,zj,z and m¯j,zj,z\bar{m}_{j,z}\in\mathcal{M}_{j,z}, there exists C>0C>0 such that for δ>0\delta>0 sufficiently small,

    sup|ηη|δ𝔼|m¯j,z(X,η)m¯j,z(X,η)|2Cδ2.\displaystyle\sup_{\mathinner{\!\left\lvert\eta^{\prime}-\eta\right\rvert}\leq\delta}\mathbb{E}\mathinner{\!\left\lvert\bar{m}_{j,z}(X,\eta^{\prime})-\bar{m}_{j,z}(X,\eta)\right\rvert}^{2}\leq C\delta^{2}.
  3. (iii)

    Donsker properties:

    0logN(ε,Πz,)𝑑ε,0logN(ε,j,z,)𝑑ε<,\displaystyle\int_{0}^{\infty}\log N(\varepsilon,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert}_{\infty})d\varepsilon,\int_{0}^{\infty}\log N(\varepsilon,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert}_{\infty})d\varepsilon<\infty,

    where N(ε,,)N(\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}) denotes the covering number of the space (,)(\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}).

  4. (iv)

    Convergence rates of the nonparametric estimators:

    π^zπzo,m^j,zm¯j,zo=op(n1/4).\displaystyle\mathinner{\!\left\lVert\hat{\pi}_{z}-\pi^{o}_{z}\right\rVert}_{\infty},\lVert\hat{m}_{j,z}-\bar{m}_{j,z}^{o}\rVert_{\infty}=o_{p}(n^{-1/4}).
  5. (v)

    The function supηΛ|ηm¯jo(,η)|\sup_{\eta\in\Lambda}\mathinner{\!\left\lvert\frac{\partial}{\partial\eta}\bar{m}^{o}_{j}(\cdot,\eta)\right\rvert} is integrable. The estimator ηm^j\frac{\partial}{\partial\eta}\hat{m}_{j} is consistent uniformly in its second argument, that is,

    ηm^j(x,η)ηm¯jo(x,η)=op(1),x.\displaystyle\mathinner{\!\left\lVert\frac{\partial}{\partial\eta}\hat{m}_{j}(x,\eta)-\frac{\partial}{\partial\eta}\bar{m}^{o}_{j}(x,\eta)\right\rVert}_{\infty}=o_{p}(1),\forall x.

Then η~=ηo+op(1)\tilde{\eta}=\eta^{o}+o_{p}(1), V^=V+op(1)\hat{V}=V+o_{p}(1), Γ^=Γ+op(1)\hat{\Gamma}=\Gamma+o_{p}(1), and

n(η^ηo)N(𝟎,(ΓV1Γ)1),\displaystyle\sqrt{n}\left(\hat{\eta}-\eta^{o}\right)\implies N\left(\bm{0},(\Gamma^{\prime}V^{-1}\Gamma)^{-1}\right),

where 𝟎\bm{0} denotes a vector of zeros.

The following lemma is helpful for proving Theorem B.2.

Lemma B.3.

Under the assumptions of Theorem B.1, the class

{ψη(Y,T,Z,X,η,π,m¯):πΠz,m¯j,zj,z,1jJ,z𝒵}\displaystyle\mathcal{F}\equiv\left\{\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\mathrel{\mathop{\ordinarycolon}}\pi\in\Pi_{z},\bar{m}_{j,z}\in\mathcal{M}_{j,z},1\leq j\leq J,z\in\mathcal{Z}\right\}

is Donsker with a finite integrable envelope. The following stochastic equicontinuity condition hold: for any positive sequence δn=o(1)\delta_{n}=o(1),

sup{\displaystyle\sup\big{\{} Ψnη(η,π,m¯)Ψη(η,π,m¯)Ψnη(ηo,πo,mZo):\displaystyle\Psi^{\eta}_{n}(\eta,\pi,\bar{m})-\Psi^{\eta}(\eta,\pi,\bar{m})-\Psi^{\eta}_{n}(\eta^{o},\pi^{o},m_{Z}^{o})\mathrel{\mathop{\ordinarycolon}}
ηηo2ππom¯m¯oδn}=op(n1/2),\displaystyle\mathinner{\!\left\lVert\eta-\eta^{o}\right\rVert}_{2}\vee\mathinner{\!\left\lVert\pi-\pi^{o}\right\rVert}_{\infty}\vee\mathinner{\!\left\lVert\bar{m}-\bar{m}^{o}\right\rVert}_{\infty}\leq\delta_{n}\big{\}}=o_{p}\big{(}n^{-1/2}\big{)},

where the supremum is taken over ηΛ\eta\in\Lambda, πzΠz\pi_{z}\in\Pi_{z}, and m¯j,zj,z\bar{m}_{j,z}\in\mathcal{M}_{j,z}.

Proof of Lemma B.3.

We first verify that the moment condition ψη\psi^{\eta} satisfies Condition (3.2) of Theorem 3 in Chen et al. (2003) (hereafter CLK). In fact, when m¯j,zm¯j,zηηδ\lVert\bar{m}^{\prime}_{j,z}-\bar{m}_{j,z}\rVert_{\infty}\vee\mathinner{\!\left\lVert\eta^{\prime}-\eta\right\rVert}_{\infty}\leq\delta, the triangle inequality gives that

𝔼|m¯j,z(X,η)m¯j,z(X,η)|2\displaystyle\mathbb{E}\mathinner{\!\left\lvert\bar{m}^{\prime}_{j,z}(X,\eta^{\prime})-\bar{m}_{j,z}(X,\eta)\right\rvert}^{2}
\displaystyle\leq 2𝔼|m¯j,z(X,η)m¯j,z(X,η)|2+2𝔼|m¯j,z(X,η)m¯j,z(X,η)|2\displaystyle 2\mathbb{E}\mathinner{\!\left\lvert\bar{m}^{\prime}_{j,z}(X,\eta^{\prime})-\bar{m}^{\prime}_{j,z}(X,\eta)\right\rvert}^{2}+2\mathbb{E}\mathinner{\!\left\lvert\bar{m}^{\prime}_{j,z}(X,\eta)-\bar{m}_{j,z}(X,\eta)\right\rvert}^{2}
\displaystyle\leq const×δ2,\displaystyle const\times\delta^{2},

where we use the notation const to denote a generic constant that may have different values at each appearance. The last inequality follows from the assumption (ii). Similarly, we can verify that the remaining terms in ψη\psi^{\eta} also satisfy the same condition. Therefore, ψη\psi^{\eta} is locally uniformly L2L_{2}-continuous, that is,

𝔼[sup{\displaystyle\mathbb{E}\big{[}\sup\big{\{} |ψη(Y,T,Z,X,η,π,m¯)ψη(Y,T,Z,X,η,π,m¯)|:\displaystyle\mathinner{\!\left\lvert\psi^{\eta}(Y,T,Z,X,\eta^{\prime},\pi^{\prime},\bar{m}^{\prime})-\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\right\rvert}\mathrel{\mathop{\ordinarycolon}}
ηηππm¯m¯δ}]const.×δ2.\displaystyle\mathinner{\!\left\lVert\eta^{\prime}-\eta\right\rVert}\vee\mathinner{\!\left\lVert\pi^{\prime}-\pi\right\rVert}_{\infty}\vee\mathinner{\!\left\lVert\bar{m}^{\prime}-\bar{m}\right\rVert}_{\infty}\leq\delta\big{\}}\big{]}\leq const.\times\delta^{2}.

Following the same steps as in the proof of Theorem 3 in CLK (p. 1607), we can show that the bracketing number of \mathcal{F} is bounded by

N[](ε,,L2)\displaystyle N_{[]}\big{(}\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}_{L_{2}}\big{)}
\displaystyle\leq N(ε/const,Λ,)×zN(ε/const,Πz,)×j,zN(ε/const,j,z,).\displaystyle N(\varepsilon/const,\Lambda,\mathinner{\!\left\lVert\cdot\right\rVert})\times\prod_{z}N(\varepsilon/const,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert})\times\prod_{j,z}N(\varepsilon/const,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert}).

Therefore, the bracketing entropy of class \mathcal{F} is bounded by

logN[](ε,,L2)\displaystyle\log N_{[]}\big{(}\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}_{L_{2}}\big{)}
\displaystyle\leq const×(logN(ε/const,Λ,)maxzlogN(ε/const,Πz,)\displaystyle const\times\Big{(}\log N(\varepsilon/const,\Lambda,\mathinner{\!\left\lVert\cdot\right\rVert})\vee\max_{z}\log N(\varepsilon/const,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert})
\displaystyle\vee maxj,zlogN(ε/const,j,z,)).\displaystyle\max_{j,z}\log N(\varepsilon/const,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert})\Big{)}.

Under the assumption that Λ\Lambda is compact and

0logN(ε,Πz,)𝑑ε,0logN(ε,j,z,)𝑑ε<,j,z,\displaystyle\int_{0}^{\infty}\log N(\varepsilon,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert})d\varepsilon,\int_{0}^{\infty}\log N(\varepsilon,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert})d\varepsilon<\infty,\forall j,z,

we have that

0logN[](ε,,L2)𝑑ε<.\displaystyle\int_{0}^{\infty}\log N_{[]}\big{(}\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}_{L_{2}}\big{)}d\varepsilon<\infty.

This implies that \mathcal{F} is Donsker with a finite integrable envelope. Lastly, as stated in Lemma 1 of CLK, the asserted stochastic equicontinuity condition is implied by the fact that \mathcal{F} is Donsker and ψη\psi^{\eta} is L2L_{2}-continuous. ∎

Proof of Theorem B.2.

We follow the large sample theory in CLK and set θ=η\theta=\eta, h=(π,m¯)h=(\pi,\bar{m}), M(θ,h)=Ψη(η,π,m¯)M(\theta,h)=\Psi^{\eta}(\eta,\pi,\bar{m}), and Mn(θ,h)=Ψnη(η,π,m¯)M_{n}(\theta,h)=\Psi^{\eta}_{n}(\eta,\pi,\bar{m}).

We first use Theorem 1 in CLK to show the consistency of η~\tilde{\eta}. Condition (1.2) in CLK is satisfied because Λ\Lambda is compact, and Ψη(η,πo,m¯o)\Psi^{\eta}(\eta,\pi^{o},\bar{m}^{o}) has a unique zero and is continuous by our second condition in Theorem B.1. As for Condition (1.3) of CLK, we can easily see from the expression of Ψ\Psi that it is continuous with respect to m¯j,z\bar{m}_{j,z} and πz\pi_{z} (since πz\pi_{z} is bounded away from zero), and the uniformity in η\eta follows from the fact that 𝔼[m(Y,η)]\mathbb{E}\left[m(Y^{*},\eta)\right] is bounded as a function of η\eta. Condition (1.4) of CLK is satisfied by the assumption of Theorem B.2. The uniform stochastic equicontinuity condition (1.5) of CLK is implied by Lemma B.3. Therefore, η~=ηo+op(1)\tilde{\eta}=\eta^{o}+o_{p}(1).

We use Corollary 1 (which is based on Theorem 2) in CLK to show the consistency of V^\hat{V} and the asymptotic normality of η^\hat{\eta}. Condition (2.2) in CLK is verified by the assumptions of Theorem B.1. Similar to the proof of Proposition 4.2, we can show that the moment condition Ψη\Psi^{\eta}, based on the EIF, satisfies the Neyman orthogonality condition for the nuisance parameters π\pi and mZm_{Z}. In fact, for any jj and zz, we let πzr=πzo(X)+r(πz(X)πzo(X))\pi_{z}^{r}=\pi^{o}_{z}(X)+r(\pi_{z}(X)-\pi^{o}_{z}(X)) and m¯j,zr(X,η)=m¯j,zo(X,η)+r(m¯j,z(X,η)m¯j,zo(X,η))\bar{m}^{r}_{j,z}(X,\eta)=\bar{m}^{o}_{j,z}(X,\eta)+r\big{(}\bar{m}_{j,z}(X,\eta)-\bar{m}^{o}_{j,z}(X,\eta)\big{)}. Then we have

ddr𝔼[𝟏{Z=z}πzr(X)(mj(Y,η)𝟏{T=tj}m¯j,zr(X,η))+m¯j,zr(X,η)]|r=0\displaystyle\frac{d}{dr}\mathbb{E}\left[\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}-\bar{m}^{r}_{j,z}(X,\eta)\right)+\bar{m}^{r}_{j,z}(X,\eta)\right]\Bigg{|}_{r=0}
=𝔼[\displaystyle=\mathbb{E}\Bigg{[} 𝟏{Z=z}(πzo(X))2(πz(X)πzo(X))(mj(Y,η)𝟏{T=tj}m¯j,zo(X,η))\displaystyle-\frac{\mathbf{1}\{Z=z\}}{\left(\pi^{o}_{z}(X)\right)^{2}}\left(\pi_{z}(X)-\pi_{z}^{o}(X)\right)\left(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}-\bar{m}^{o}_{j,z}(X,\eta)\right)
+\displaystyle+ (m¯j,zo(X,η)m¯j,z(X,η))(𝟏{Z=z}πzo(X)1)]=0,\displaystyle\left(\bar{m}^{o}_{j,z}(X,\eta)-\bar{m}_{j,z}(X,\eta)\right)\left(\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}-1\right)\Bigg{]}=0,

where we have applied the law of iterated expectations and used the fact that

𝔼[𝟏{Z=z}πzo(X)(mj(Y,η)𝟏{T=tj}m¯j,zo(X,η))|X]=0.\mathbb{E}\left[\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}\left(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}-\bar{m}^{o}_{j,z}(X,\eta)\right)\Big{|}X\right]=0.

Thus, the path-wise derivative of Ψη\Psi^{\eta} with respect to h=(π,m¯)h=(\pi,\bar{m}) is zero in any direction. Hence, Condition (2.3) of CLK is verified. Condition (2.4) in CLK directly follows from our assumptions of Theorem B.2. The stochastic equicontinuity condition (condition (2.6) in CLK) follows from Lemma B.3. Lastly, condition (2.6) in CLK is verified using the central limit theorem since the path-wise derivative is zero. Due to the presence of V^\hat{V}, we also need the uniform convergence condition in Corollary 1 of CLK, which can be verified by using Lemma B.3 and an application of Theorem 2.10.14 of van der Vaart and Wellner (1996).

Lastly, to show the consistency of Γ^\hat{\Gamma}, we only need to show that

1ni=1nηm^j,tj,z(Xi,η^)𝑝𝔼[ηm^j,z(X,ηo)]=η𝔼[m^j,z(X,ηo)],\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\eta}\hat{m}_{j,t_{j},z}(X_{i},\hat{\eta})\overset{p}{\rightarrow}\mathbb{E}\left[\frac{\partial}{\partial\eta}\hat{m}_{j,z}(X,\eta^{o})\right]=\frac{\partial}{\partial\eta}\mathbb{E}\left[\hat{m}_{j,z}(X,\eta^{o})\right],

where the inequality follows from the differentiation under integral operation which holds under the last assumption of the theorem. The convergence in probability follows from the uniform convergence of ηm^j,z\frac{\partial}{\partial\eta}\hat{m}_{j,z} and the consistency of η^\hat{\eta}. Therefore, the desired convergence results follow. ∎

References

  • Abadie (2003) Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of econometrics 113(2), 231–263.
  • Ackerberg et al. (2014) Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014). Asymptotic efficiency of semiparametric two-step gmm. Review of Economic Studies 81(3), 919–943.
  • Andrews et al. (2020) Andrews, D. W., X. Cheng, and P. Guggenberger (2020). Generic results for establishing the asymptotic size of confidence sets and tests. Journal of Econometrics 218(2), 496–531.
  • Andrews and Armstrong (2017) Andrews, I. and T. B. Armstrong (2017). Unbiased instrumental variables estimation under known first-stage sign. Quantitative Economics 8(2), 479–503.
  • Angrist and Imbens (1995) Angrist, J. D. and G. W. Imbens (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association 90(430), 431–442.
  • Angrist et al. (1996) Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association 91(434), 444–455.
  • Belloni and Chernozhukov (2011) Belloni, A. and V. Chernozhukov (2011). 1\ell_{1}-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39(1), 82–130.
  • Belloni and Chernozhukov (2013) Belloni, A. and V. Chernozhukov (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547.
  • Bickel et al. (1993) Bickel, P. J., C. A. Klaassen, Y. Ritov, , and J. A. Wellner (1993). Efficient and adaptive estimation for semiparametric models, Volume 4. Springer, New York.
  • Blandhol et al. (2022) Blandhol, C., J. Bonney, M. Mogstad, and A. Torgovitsky (2022). When is tsls actually late? University of Chicago, Becker Friedman Institute for Economics Working Paper (2022-16).
  • Bühlmann and Van De Geer (2011) Bühlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
  • Cattaneo (2010) Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155(2), 138–154.
  • Chen et al. (2004) Chen, X., H. Hong, and A. Tarozzi (2004). Semiparametric efficiency in gmm models of nonclassical measurement errors, missing data and treatment effects.
  • Chen et al. (2008) Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36(2), 808–843.
  • Chen et al. (2003) Chen, X., O. Linton, and I. Van Keilegom (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica 71(5), 1591–1608.
  • Chen and Santos (2018) Chen, X. and A. Santos (2018). Overidentification in regular models. Econometrica 86(5), 1771–1817.
  • Chernozhukov et al. (2018) Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21(1), C1–C68.
  • Chernozhukov et al. (2016) Chernozhukov, V., J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins (2016). Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033.
  • Finkelstein et al. (2012) Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. Newhouse, H. Allen, K. Baicker, and O. H. S. Group (2012). The oregon health insurance experiment: evidence from the first year. The Quarterly journal of economics 127(3), 1057–1106.
  • Frölich (2007) Frölich, M. (2007). Nonparametric iv estimation of local average treatment effects with covariates. Journal of Econometrics 139(1), 35–75.
  • Galindo (2020) Galindo, C. (2020). Empirical challenges of multivalued treatment effects. Technical report, Job market paper.
  • Hahn (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 315–331.
  • Hansen (2008) Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory 24(3), 726–748.
  • Heckman and Pinto (2018a) Heckman, J. J. and R. Pinto (2018a). Unordered monotonicity. Econometrica 86(1), 1–35.
  • Heckman and Pinto (2018b) Heckman, J. J. and R. Pinto (2018b). Web appendix for unordered monotonicity. Econometrica 86(1), 1–35.
  • Hong and Nekipelov (2010a) Hong, H. and D. Nekipelov (2010a). Semiparametric efficiency in nonlinear late models. Quantitative Economics 1(2), 279–304.
  • Hong and Nekipelov (2010b) Hong, H. and D. Nekipelov (2010b). Supplement to “semiparametric efficiency in nonlinear late models”. Quantitative Economics 1(2), 279–304.
  • Ichimura and Newey (2022) Ichimura, H. and W. K. Newey (2022). The influence function of semiparametric estimators. Quantitative Economics 13(1), 29–61.
  • Imbens and Angrist (1994) Imbens, G. W. and J. D. Angrist (1994). Identification and estimation of local average treatment effects. Econometrica 62(2), 467–475.
  • Imbens and Manski (2004) Imbens, G. W. and C. F. Manski (2004). Confidence intervals for partially identified parameters. Econometrica 72(6), 1845–1857.
  • Kitagawa (2015) Kitagawa, T. (2015). A test for instrument validity. Econometrica 83(5), 2043–2063.
  • Kline and Walters (2016) Kline, P. and C. R. Walters (2016). Evaluating public programs with close substitutes: The case of head start. The Quarterly Journal of Economics 131(4), 1795–1848.
  • Lee and Salanié (2018) Lee, S. and B. Salanié (2018). Identifying effects of multivalued treatments. Econometrica 86(6), 1939–1963.
  • Masry (1996) Masry, E. (1996). Multivariate local polynomial regression for time series: uniform strong consistency and rates. Journal of Time Series Analysis 17(6), 571–599.
  • Mikusheva (2007) Mikusheva, A. (2007). Uniform inference in autoregressive models. Econometrica 75(5), 1411–1452.
  • Newey (1990) Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of applied econometrics 5(2), 99–135.
  • Newey (1994) Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society, 1349–1382.
  • Okui et al. (2012) Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012). Doubly robust instrumental variable regression. Statistica Sinica, 173–205.
  • Pakes and Pollard (1989) Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of optimization estimators. Econometrica: Journal of the Econometric Society, 1027–1057.
  • Pinto (2021) Pinto, R. (2021). Beyond intention to treat: Using the incentives in moving to opportunity to identify neighborhood effects. UCLA Working paper.
  • Sun (2021) Sun, Z. (2021). Instrument validity for heterogeneous causal effects. arXiv preprint arXiv:2009.01995.
  • Tan (2006) Tan, Z. (2006). Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association 101(476), 1607–1618.
  • van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics, Volume 3. Cambridge university press.
  • van der Vaart and Wellner (1996) van der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. New York: Springer-Verlag.
  • Vytlacil (2002) Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result. Econometrica 70(1), 331–341.