Efficient and Robust Estimation of the Generalized LATE Model

Haitian Xie
Department of Economics, University of California San Diego Email: [email protected]. The author is grateful to his advisors Graham Elliott and Yixiao Sun, who were gracious with their advice, support and feedback. The author also thanks Wei-Lin Chen, Yu-Chang Chen, and Kaspar Wüthrich for helpful suggestions and constructive comments. This paper was previously circulated under the title “Generalized Local IV with Unordered Multiple Treatment Levels: Identification, Efficient Estimation, and Testable Implication.” All remaining errors are my own.

Abstract

This paper studies the estimation of causal parameters in the generalized local average treatment effect (GLATE) model, a generalization of the classical LATE model encompassing multi-valued treatment and instrument. We derive the efficient influence function (EIF) and the semiparametric efficiency bound (SPEB) for two types of parameters: local average structural function (LASF) and local average structural function for the treated (LASF-T). The moment condition generated by the EIF satisfies two robustness properties: double robustness and Neyman orthogonality. Based on the robust moment condition, we propose the double/debiased machine learning (DML) estimators for LASF and LASF-T. The DML estimator is semiparametric efficient and suitable for high dimensional settings. We also propose null-restricted inference methods that are robust against weak identification issues. As an empirical application, we study the effects across different sources of health insurance by applying the developed methods to the Oregon Health Insurance Experiment.

Keywords: Causal Inference, Double Robustness, Efficient Influence Function, Multi-valued Treatment, Neyman Orthogonality, Oregon Health Insurance Experiment, Unordered Monotonicity, Weak Identification.

1 Introduction

Since the seminal works of Imbens and Angrist (1994) and Angrist et al. (1996), the local average treatment effect (LATE) model has become popular for causal inference in economics. Instead of imposing homogeneity of the treatment effects as in the classical instrumental variable (IV) regression model, the LATE framework allows the treatment effect to vary across individuals. Under the monotonicity condition, the average treatment effect can be identified for a subgroup of individuals whose treatment choice complies with the change in instrument levels.

The current form of the LATE model only accepts binary treatment variables. This restriction is inconvenient in many economic settings where the treatment is multi-leveled in nature. For example, parents select different preschool programs for their kids, schools assign students to different classroom sizes, families relocate to various neighborhoods in housing experiments, and people choose different sources of health insurance. To apply the LATE model to these settings, researchers often need to redefine the treatment so that there are only two treatment levels. However, merging the treatment levels can complicate the task of program evaluation and dampen the causal interpretation of the estimates. As pointed out by Kline and Walters (2016), if the original treatment levels are substitutes, then there is ambiguity regarding which causal parameters are of interest. After merging the treatment levels, the heterogeneity in the treatment effect across different treatment levels is lost.

This paper addresses the above issues by generalizing the LATE framework to incorporate the potential multiplicity in treatment levels directly. We call the new framework the generalized LATE (GLATE) model. The main assumption of the GLATE model is the unordered monotonicity assumption proposed by Heckman and Pinto (2018a), which is a generalization of the monotonicity assumption in the binary LATE model.¹¹1To distinguish with the GLATE model, we sometimes use the terminology “binary LATE model” to refer to the LATE model studied by Imbens and Angrist (1994) and Abadie (2003).

We generalize the identification results in Heckman and Pinto (2018a) to explicitly account for the presence of conditioning covariates, which is often important in practical settings. Recently, Blandhol et al. (2022) point out that linear TSLS, the common way to control for covariates in empirical studies, does not bear the LATE interpretation. The only specifications that have LATE interpretations are the ones that control for covariates nonparametrically. Therefore, it is essential from the causal analysis perspective to incorporate the covariates into the GLATE framework in a nonparametric way.

The causal parameters identifiable in the GLATE model include local average structural function (LASF) and local average structural function for the treated (LASF-T). LASF is the mean potential outcome for specific subpopulations. These subpopulations are defined by their treatment choice behaviors and are generalizations of the concepts always takers, compliers, and never takers in the binary LATE model. The parameter LASF-T further restricts the subpopulation to exclude individuals who do not take up the treatment.

The paper is concerned with the econometric aspects of the GLATE model. The analysis begins by deriving efficient influence function (EIF) and semiparametric efficiency bound (SPEB) for the identified parameters. The calculation is based on the method outlined in Chapter 3 of Bickel et al. (1993) and Newey (1990). We then verify that the conditional expectation projection (CEP) estimator (e.g., Chen et al., 2008), constructed directly from the identification result, achieves the SPEB and hence is semiparametric efficient. Using these results, we may efficiently estimate other important parameters of interest by the plug-in method since a standard delta-method argument preserves semiparametric efficiency.

The EIF not only facilitates the efficiency calculation but can also serve as the moment condition for estimation. This is because the EIF is mean zero by construction and is equal to the original identification result plus an adjustment term due to the presence of infinite-dimensional parameters. We show that the moment condition constructed from the EIF satisfies two related robustness properties: double robustness and Neyman orthogonality. Double robustness guarantees that the moment condition is correctly specified in a parametric setting even when some nuisance parameters are not.

The Neyman orthogonality condition means that the moment condition is insensitive to the nuisance parameters. This condition is particularly useful when the conditioning covariates are of high dimension. To further utilize this condition, we study the double/debiased machine learning (DML) estimator (Chernozhukov et al., 2018) in the GLATE setting. Under certain conditions regarding the convergence rate of the first-step nonparametric estimators, the DML estimator is asymptotically normally uniformly over a large class of data generating processes (DGPs).

The weak identification issue is a practical concern of the GLATE model. This is because both the treatment and instrument are multi-valued, and hence the subpopulation on which LASF and LASF-T are defined can be small in size. To deal with this issue, we propose null-restricted test statistics in one-sided and two-sided testing problems. This procedure is the generalization of the well-known Anderson-Rubin (AR) test. We show that the proposed tests are consistent and uniformly control size across a large class of DGPs, in which the size of the subpopulation mentioned above can be arbitrarily close to zero.

The paper is organized as follows. The remaining part of this section discusses the literature. Section 2 introduces the GLATE model and the nonparametric identification results. Section 3 calculates the EIF and SPEB. Section 4 discusses the robustness properties of the moment condition generated by the EIF. Section 5 proposes inference procedures under weak identification issues. Section 6 presents the empirical application. Section 7 concludes. The proofs for theoretical results in the main text are collected in Appendix A.

1.1 Literature Review

The GLATE model provides a way to conduct causal inference under endogeneity when the treatment is multi-valued and unordered. As mentioned above, the identification result (conditional on the covariates) is first established in Heckman and Pinto (2018a) by using the unordered monotonicity condition. Lee and Salanié (2018) proposes another method of identification in a similar model of multi-valued treatment. Their method is concerned with continuous instruments, while the GLATE is framed in terms of discrete-valued instruments. When the treatment levels are ordered, Angrist and Imbens (1995) derives the identification and estimation results for the causal parameter, which is a weighted average of LATEs across different treatment levels.

The literature on semiparametric efficiency in program evaluation starts with the seminal work of Hahn (1998), which studies the benchmark case of estimating the average treatment effect (ATE) under unconfoundedness. For multi-level treatment, Cattaneo (2010) studies the efficient estimation of causal parameters implicitly defined through over-identified non-smooth moment conditions. In the case where unconfoundedness fails and instruments are present, Frölich (2007) calculates the SPEB for the LATE parameter, and Hong and Nekipelov (2010a) extend to the estimation of parameters implicitly defined by moment restrictions. In a more general framework encompassing missing data, Chen et al. (2004) and Chen et al. (2008) studies semiparametric efficiency bounds and efficient estimation of parameters defined through overidentifying moment restrictions. However, there is currently no theoretical research on semiparametric efficient estimation in models that encompasses endogeneity and unordered multiple treatment levels.

Several ways are available for calculating the EIF for semiparametric estimators, as illustrated by Newey (1990) and Ichimura and Newey (2022). Semiparametric efficiency calculations can be used to construct robust (Neyman orthogonal) moment conditions. This method is illustrated in Newey (1994) and Chernozhukov et al. (2016). Based on the Neyman orthogonality condition, Chernozhukov et al. (2018) introduces the DML method that suits high dimensional settings. This is because Donsker properties and stochastic equicontinuity conditions are no longer required in deriving the asymptotic distribution of the semiparametric estimator.

For testing the GLATE model, Sun (2021) proposes a bootstrap test which is the generalization and improvement of the test studied by Kitagawa (2015) in the binary LATE model.

The GLATE model has received attention in the recent empirical literature due to its ability to model multi-valued treatment. Kline and Walters (2016) evaluate the cost-effectiveness of Head Start, classifying Head Start and other preschool programs as different treatment levels against the control group of no preschool. Galindo (2020) assesses the impact of different childcare choice in Colombia on children’s development. Pinto (2021) studies the neighborhood effects and voucher effects in housing allocations using data from the Moving to Opportunity experiment. Our theoretical analysis of the GLATE model presents important tools for estimation and inference that can be applied to those empirical settings.

2 Identification in the GLATE Model

This section describes the generalized local average treatment effect (GLATE) model, discusses identification of the local average structural function (LASF) and other parameters, and introduces the notation.

2.1 The model

We assume a finite collection of instrument values $\mathcal{Z}=\{z_{1},\cdots,z_{N_{Z}}\}$ and a finite collection of treatment values $\mathcal{T}=\{t_{1},\cdots,t_{N_{T}}\}$ , where $N_{Z}$ and $N_{T}$ are respectively the total number of instrument and treatment levels. The sets $\mathcal{T}$ and $\mathcal{Z}$ are categorical and unordered. The instrumental variable $Z$ denotes which of the $N_{Z}$ instrument levels is realized. The random variables $T_{z_{1}},\cdots,T_{z_{N_{Z}}}$ , each taking values in $\mathcal{T}$ , denote the collection of potential treatments under each instrument status. Thus, the observed treatment level is the random variable $T=T_{Z}=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}T_{z}$ . For each given treatment level $t\in\mathcal{T}$ , there is a potential outcome $Y_{t}\in\mathcal{Y}\subset\mathbb{R}$ . The observed outcome is denoted by $Y=Y_{T}=\sum_{t\in\mathcal{T}}\mathbf{1}\{T=t\}Y_{t}$ . The random vector $X\in\mathcal{X}\subset\mathbb{R}^{d_{X}}$ contains the set of covariates. The observed data is a random sample $(Y_{i},T_{i},Z_{i},X_{i}),1\leq i\leq n$ .

The description above establishes a random sampling model where the researcher only observes one potential outcome, the one associated with the observed treatment. This implies that the sample of $Y$ , observed from an individual with treatment $T=t$ , comes from the conditional distribution of $Y_{t}$ given $T=t$ rather than from the marginal distribution of $Y_{t}$ . In general, this fact leads to identifications issues and presents challenges for causal inference. To overcome these problems, we impose further structures on the model.

Assumption 1 (Conditional Independence).

$(\{Y_{t}\mathrel{\mathop{\ordinarycolon}}t\in\mathcal{T}\},\{T_{z}\mathrel{\mathop{\ordinarycolon}}z\in\mathcal{Z}\})\perp Z\mid X$ .

Assumption 2 (Unordered Monotonicity).

For any $t\in\mathcal{T},z,z^{\prime}\in\mathcal{Z}$ , either

\displaystyle\mathbb{P}(\mathbf{1}\{T_{z}=t\}\geq\mathbf{1}\{T_{z^{\prime}}=t\}\mid X)=1

\displaystyle\mathbb{P}(\mathbf{1}\{T_{z}=t\}\leq\mathbf{1}\{T_{z^{\prime}}=t\}\mid X)=1.

Assumption 1 and 2 provide the multi-valued analog of Assumption 2.1 in Abadie (2003). Assumption 1 restricts that the instrument $Z$ is independent with the potential treatments and outcomes once we condition on $X$ . Assumption 2 is the conditional version of the unordered monotonicity condition proposed by Heckman and Pinto (2018a). It means that when we focus on a particular treatment level $t$ and a pair $(z,z^{\prime})$ of instrument values, the binary environment should satisfy the usual monotonicity constraint in the LATE model. Specifically, the unordered monotonicity condition requires that a shift in the instrument moves all agents uniformly toward or against each possible treatment value.²²2As pointed out by Vytlacil (2002), the LATE monotonicity condition is a restriction across individuals on the relationship between different hypothetical treatment choices defined in terms of an instrument.

We define the type $S$ of an individual as the vector of the potential treatments, that is,

\displaystyle S=(T_{z_{1}}\cdots,T_{z_{N_{Z}}})^{\prime}.

By construction, $S$ is not observed. Assumption 2, the unordered monotonicity condition, is essentially a restriction on $\mathcal{S}\equiv\textit{supp}(S)$ , the support of $S$ . Denote the elements in $\mathcal{S}$ by $s_{1},\cdots,s_{N_{S}}$ , where $N_{S}$ is the cardinality of $\mathcal{S}$ . A convenient way to characterize $\mathcal{S}$ is by using the $N_{Z}\times N_{S}$ matrix $R\equiv(s_{1},\cdots,s_{N_{S}})$ . The matrix $R$ is referred to as the response matrix since it describes how each type of individuals’ treatment choice responds to the instrument.

The role of $S$ is to assist the identification of the counterfactual outcomes by dividing the population into a finite number of groups, where identification can be achieved within specific groups. Those groups are defined as follows. For $k=0,\cdots,N_{Z}$ , let $\Sigma_{t,k}$ be the set of types in which the treatment level $t$ appears exactly $k$ times. That is,

\displaystyle\Sigma_{t,k}\equiv\{s\in\mathcal{S}\mathrel{\mathop{\ordinarycolon}}\textstyle\sum_{i=1}^{N_{Z}}\mathbf{1}\{s[i]=t\}=k\},

where $s[i]$ denotes the $i$ th element of the vector $s$ . In particular, the collection $\Sigma_{t,k},k=0,\cdots,N_{Z}$ forms a partition of $\mathcal{S}$ .

For individuals with type $S$ in the same type set $\Sigma_{t,k}$ , their treatment response in terms of $T=t$ is in a way homogeneous. Thus, it is easier intuitively to identify the marginal distribution of the potential outcome $Y_{t}$ within each $\Sigma_{t,k}$ . More specifically, we define the local average structural functions (LASF) and the local average structural functions for the treated (LASF-T) as follows.

	$\displaystyle\text{LASF: }\beta_{t,k}$	$\displaystyle\equiv\mathbb{E}[Y_{t}\mid S\in\Sigma_{t,k}],$
	$\displaystyle\text{LASF-T: }\gamma_{t,k}$	$\displaystyle\equiv\mathbb{E}[Y_{t}\mid S\in\Sigma_{t,k},T=t].$

Before presenting the identification results for the above two classes of parameters, we illustrate the GLATE model in the following two examples.

Example 1 (Binary LATE model).

In the binary LATE model of Imbens and Angrist (1994), there are two treatment levels $\mathcal{T}=\{0,1\}$ and two instrument levels $\mathcal{Z}=\{0,1\}$ . There are three types: $\mathcal{S}=\{s_{1}=(0,0)^{\prime},s_{2}=(0,1)^{\prime},s_{3}=(1,1)^{\prime}\}$ , which are referred to in the literature as defiers, compliers, and always-takers, respectively. The type set $\Sigma_{1,0}=\{s_{1}\}$ contains the defiers, $\Sigma_{1,1}=\{s_{2}\}$ the compliers, and $\Sigma_{1,2}=\{s_{3}\}$ the always-takers. The response matrix is the following binary matrix

\displaystyle R=(s_{1},s_{2},s_{3})=\begin{pmatrix}0&0&1\\ 0&1&1\end{pmatrix}.

The local average treatment effect is the treatment effect for the compliers, which can be written as the difference between two LASFs:

\displaystyle\mathbb{E}[Y_{1}-Y_{0}\mid S=\text{compliers}]=\mathbb{E}[Y_{1}-Y_{0}\mid T_{1}>T_{0}]=\beta_{1,1}-\beta_{0,1}.

Example 2 (Three treatment levels and two instrument levels).

The simplest GLATE model (excluding the binary case in Example 1) has three treatment levels $\mathcal{T}=\{t_{1},t_{2},t_{3}\}$ and two instrument levels $\mathcal{Z}=\{z_{1},z_{2}\}$ . There are five types specified as the columns in the following response matrix

\displaystyle R=(s_{1},s_{2},s_{3},s_{4},s_{5})=\begin{pmatrix}t_{1}&t_{2}&t_{3}&t_{1}&t_{2}\\ t_{1}&t_{2}&t_{3}&t_{3}&t_{3}\end{pmatrix}.

In this example, a shift from $z_{1}$ to $z_{2}$ moves all agents uniformly toward the treatment level $t_{3}$ . The type set $\Sigma_{t_{1},2}=\{s_{1}\}$ contains the type that always choose the treatment $t_{1}$ and thus can be referred to as $t_{1}$ -always taker. The same applies to $\Sigma_{t_{2},2}=\{s_{2}\}$ and $\Sigma_{t_{3},2}=\{s_{3}\}$ . The type set $\Sigma_{t1,1}=\{s_{4}\}$ switches from $t_{1}$ to $t_{3}$ and hence can be considered as $t_{1}$ -swticher (or $t_{1}$ -compliter). Similarly, we can refer to $\Sigma_{t_{2},1}=\{s_{5}\}$ as $t_{2}$ -switcher and $\Sigma_{t_{2},1}=\{s_{5}\}$ as $t_{3}$ -switcher. This model is used in Kline and Walters (2016) to study the causal effect of the Head Start preschool program. The instrument indicates whether the household receives a Head Start offer, and the treatment levels are $t_{1}=$ Head Start, $t_{2}=$ other preschool programs, and $t_{3}=$ no preschool. The unordered monotonicity condition means that anyone who changes behavior as a result of the Head Start offer does so to attend Head Start.

2.2 Identification Results

We introduce some matrix notations related to the type $S$ . For each treatment level $t\in\mathcal{T}$ , let $B_{t}$ be a binary matrix of the same dimension as the response matrix $R$ with each element of $B_{t}$ signifying whether the corresponding element in the response matrix is $t$ . That is, $B_{t}[i,j]$ , the $(i,j)$ th element of $B_{t}$ , is whether $T_{z_{i}}$ equals $t$ for the subpopulation $S=s_{j}$ . Define $b_{t,k}\equiv\left(\mathbf{1}\{s_{1}\in\Sigma_{t,k}\},\cdots,\mathbf{1}\{s_{N_{S}}\in\Sigma_{t,k}\}\right)B_{t}^{+},$ where $B_{t}^{+}$ is the Moore-Penrose inverse of $B_{t}$ .

For convenience, we also need some notations regarding conditional expectations. Let

\displaystyle\pi(X)\equiv(\pi_{z_{1}}(X),\cdots,\pi_{z_{N_{Z}}}(X))^{\prime}\text{ with }\pi_{z}(X)\equiv P\left(Z=z\mid X\right)

be the vector of functions that describes the conditional distribution of the instrument $Z$ . For each treatment level $t\in\mathcal{T}$ , let

\displaystyle P_{t}(X)\equiv(P_{t,z_{1}}(X),\cdots,P_{t,z_{N_{Z}}}(X))^{\prime}\text{ with }P_{t}(X)\equiv P(T=t\mid Z=z,X)

be the vector that describes the conditional treatment probabilities given each level of the instrument. Denote

\displaystyle Q_{t}(X)\equiv(Q_{t,z_{1}}(X),\cdots,Q_{t,z_{N_{Z}}}(X))^{\prime}\text{ with }Q_{t,z}(X)\equiv\mathbb{E}[Y\mathbf{1}\{T=t\}\mid Z=z,X]

as the vector that contains the conditional outcomes for each treatment level $t$ . Notice that the functions $\pi$ , $P_{t}$ , and $Q_{t}$ are all identified.

Theorem 2.1 (Identification of LASF).

Let Assumptions 1 - 2 hold. Let $t\in\mathcal{T}$ and $k\in\{1,\cdots,N_{Z}\}$ .

(i)

The type set probability is identified by

$p_{t,k}\equiv\mathbb{P}(S\in\Sigma_{t,k})=b_{t,k}\mathbb{E}\left[P_{t}(X)\right].$
(ii)

If $p_{t,k}>0$ , the LASF is identified by:

$\beta_{t,k}=b_{t,k}\mathbb{E}\left[Q_{t}(X)\right]/p_{t,k}.$

Theorem 2.1 identifies $p_{t,k}$ , the size of the subpopulation $\Sigma_{t,k}$ , and the local structural function for that subpopulation. The only exception when the identification fails is when the type set $\Sigma_{t,0}$ , in which case the individual never chooses the treatment $t$ . This identification result is a modification of Theorem T-6 in Heckman and Pinto (2018a) that explicitly accounts for the presence of covariates $X$ . Bayes rule is applied to convert the conditional result into the unconditional one. The following theorem presents the identification result for the LASF-T.

Let $\mathcal{Z}_{t,k}\subset\mathcal{Z}$ be the set of instrument values that induces the treatment level $t$ in the type set $\Sigma_{t,k}$ . That is, $\mathcal{Z}_{t,k}\equiv\left\{z_{i}\in\mathcal{Z}\mathrel{\mathop{\ordinarycolon}}s[i]=t\text{, for all }s\in\Sigma_{t,k}\right\}$ , where $s[i]$ denotes the $i$ th element of the vector $s$ . Then define $\pi_{t,k}\equiv\sum_{z\in\mathcal{Z}_{t,k}}\pi_{z}$ as the total probability of those instrument values.

Theorem 2.2 (Identification of LASF-T).

Let Assumptions 1 - 2 hold. Let $t\in\mathcal{T}$ and $k\in\{1,\cdots,N_{Z}\}$ . Then $\mathcal{Z}_{t,k}$ is nonempty.

(i)

The treatment probability within the type set is identified by

$q_{t,k}\equiv P\left(T=t,S\in\Sigma_{t,k}\right)=b_{t,k}\mathbb{E}\left[P_{t}(X)\pi_{t,k}(X)\right].$
(ii)

If $q_{t,k}>0$ , then the LASF-T is identified by

$\gamma_{t,k}=b_{t,k}\mathbb{E}\left[Q_{t}(X)\pi_{t,k}(X)\right]/q_{t,k}.$ (1)

The identification results are illustrated using the two examples.

Example 3 (continues = eg:binary).

Since the treatment is binary, the matrix $B_{1}$ is equal to the response matrix $R$ . The matrix $B_{1}$ and its generalized inverse $B_{1}^{+}$ are respectively

\displaystyle B_{1}=\begin{pmatrix}0&0&1\\ 0&1&1\end{pmatrix},\text{ and }(B_{1}^{+})^{\prime}=\begin{pmatrix}0&-1&1\\ 0&1&0\end{pmatrix}.

The matrix $B_{0}$ and its generalized inverse $B_{0}^{+}$ are respectively

\displaystyle B_{0}=\begin{pmatrix}1&1&0\\ 1&0&0\end{pmatrix},\text{ and }(B_{0}^{+})^{\prime}=\begin{pmatrix}0&1&0\\ 1&-1&0\end{pmatrix}.

The vectors $b_{1,1}$ and $b_{0,1}$ are respectively

\displaystyle b_{1,1}=(-1,1),\text{ and }b_{0,1}=(1,-1).

Theorem 2.1 implies that

\displaystyle\beta_{1,1}=\frac{\mathbb{E}[Q_{1,1}(X)]-\mathbb{E}[Q_{1,0}(X)]}{\mathbb{E}[P_{1,1}(X)]-\mathbb{E}[P_{1,0}(X)]},\text{ and }\beta_{0,1}=\frac{\mathbb{E}[Q_{0,0}(X)]-\mathbb{E}[Q_{0,1}(X)]}{\mathbb{E}[P_{0,0}(X)]-\mathbb{E}[P_{0,1}(X)]}.

The two denominators in the above expressions are both equal to the type probability of compliers. Then the usual identification of the LATE parameter (e.g., Frölich, 2007) follows:

\displaystyle\mathbb{E}[Y_{1}-Y_{0}\mid T_{1}>T_{0}]=\frac{\int(\mathbb{E}[Y\mid Z=1,X=x]-\mathbb{E}[Y\mid Z=0,X=x])f_{X}(x)dx}{\int(\mathbb{E}[T\mid Z=1,X=x]-\mathbb{E}[T\mid Z=0,X=x])f_{X}(x)dx},

where $f_{X}$ denotes the marginal density function of $X$ .

Example 4 (continues = eg:3t2z).

Recall that $\Sigma_{t_{1},1}=\{s_{4}\}$ contains the $t_{1}$ -switcher. By Theorem 2.1, the LASF for the treatment level $t$ and the subpopulation $S=s_{4}$ is identified by³³3The calculation of $b_{t,k}$ is omitted for brevity, but it can be done in the same way as Example 1.

	$\displaystyle p_{t_{1},1}$	$\displaystyle=\mathbb{E}[P_{t_{1},z_{1}}(X)]-\mathbb{E}[P_{t_{1},z_{2}}(X)],$
	$\displaystyle\beta_{t_{1},1}$	$\displaystyle=\frac{\mathbb{E}[Q_{t_{1},z_{1}}(X)]-\mathbb{E}[Q_{t_{1},z_{2}}(X)]}{\mathbb{E}[P_{t_{1},z_{1}}(X)]-\mathbb{E}[P_{t_{1},z_{2}}(X)]}.$

Notice that $\mathcal{Z}_{t_{1},1}=\{z_{1}\}$ . Then by Theorem 2.2 we have

	$\displaystyle q_{t_{1},1}$	$\displaystyle=\mathbb{E}[(P_{t_{1},z_{1}}(X)-P_{t_{1},z_{2}}(X))\pi_{z_{1}}(X)],$
	$\displaystyle\gamma_{t_{1},1}$	$\displaystyle=\frac{\mathbb{E}[(Q_{t_{1},z_{1}}(X)-Q_{t_{1},z_{2}}(X))\pi_{z_{1}}(X)]}{\mathbb{E}[(P_{t_{1},z_{1}}(X)-P_{t_{1},z_{2}}(X))\pi_{z_{1}}(X)]}.$

3 Semiparametric Efficiency

In this section, we calculate the semiparametric efficiency bound (SPEB) and propose estimators that achieve such bounds. We focus on the parameters LASF and LASF-T. In Appendix B, we study general parameters implicitly defined through moment restrictions.

3.1 LASF and LASF-T

For the rest of the paper, we assume that $Y_{t},t\in\mathcal{T}$ have finite second moments. This is necessary since we are studying efficiency. Let $\iota$ denote the column vector of ones and $\zeta(Z,X,\pi)$ the diagonal matrix with the diagonal elements being $\mathbf{1}\{Z=z\}/\pi_{z}(X),z\in\mathcal{Z}$ . The following theorem gives the efficient influence function (EIF) and the SPEB for the parameters identified in the preceding section.

Theorem 3.1 (SPEB for LASF and LASF-T).

Let Assumptions 1 - 2 hold. Let $t\in\mathcal{T}$ and $k\in\{1,\cdots,N_{Z}\}$ . Assume that $p_{t,k},q_{t,k}>0$ .

(i)

The semiparametric efficiency bound for $\beta_{t,k}$ is given by the variance of the efficient influence function

\displaystyle\begin{split}&\psi^{\beta_{t,k}}(Y,T,Z,X,\beta_{t,k},p_{t,k},Q_{t},P_{t},\pi)\\ =&\frac{1}{p_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)+Q_{t}(X)\right)\\ -&\frac{\beta_{t,k}}{p_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)+P_{t}(X)\right).\end{split}

(2)

(ii)

The semiparametric efficiency bound for $\gamma_{t,k}$ is given by the variance of the efficient influence function

\displaystyle\begin{split}&\psi^{\gamma_{t,k}}(Y,T,Z,X,\gamma_{t,k},q_{t,k},Q_{t},P_{t},\pi)\\ =&\frac{1}{q_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)\pi_{{t,k}}(X)+Q_{t}(X)\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}\right)\\ -&\frac{\gamma_{t,k}}{q_{t,k}}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)\pi_{{t,k}}(X)+P_{t}(X)\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}\right).\end{split}

(iii)

The semiparametric efficiency bound for $p_{t,k}$ is given by the variance of the efficient influence function

\displaystyle\begin{split}\psi^{p_{t,k}}(T,Z,X,p_{t,k},P_{t},\pi)=b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)+P_{t}(X)\right)-p_{t,k}.\end{split}

(iv)

The semiparametric efficiency bound for $q_{t,k}$ is given by the variance of the efficient influence function

\displaystyle\begin{split}&\psi^{q_{t,k}}(T,Z,X,q_{t,k},P_{t},\pi)\\ =&b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)\pi_{{t,k}}(X)+P_{t}(X)\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}\right)-q_{t,k}.\end{split}

The EIF in Theorem 3.1 can be interpreted as the moment condition from the identification results modified by an adjustment term due to the presence of unknown infinite-dimensional parameters. Take $\psi^{\beta_{t,k}}$ as an example, the terms

\displaystyle b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)\right)/p_{t,k}

and

\displaystyle\beta_{t,k}b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)\right)/p_{t,k}

are respectively the adjustment terms due to the presence of $Q_{t}$ and $P_{t}$ .

From the expression of $\psi^{\beta_{t,k}}$ , we can see that the SPEB would be large when $p_{t,k}$ is small. This is because $p_{t,k}$ measures the size of the subpopulation $S\in\Sigma_{t,k}$ on which the LASF is estimated. When $p_{t,k}$ is small, we run into the weak identification issue. In Section 5, we study inference procedures that are robust against weak identification issues.

One benefit of the EIFs is that we can easily calculate the covariance matrix of different estimators. Consider an example where we are interested in two LASFs $\beta_{1}$ and $\beta_{2}$ , whose EIF is given by $\psi_{1}$ and $\psi_{2}$ , respectively. If the two estimators $\hat{\beta}_{1}$ and $\hat{\beta}_{2}$ are both semiparametric efficient, then their covariance matrix equals $\mathbb{E}[\psi_{1}\psi_{2}^{\prime}]$ .

Example 5 (continues = eg:binary).

In the binary LATE model, the first two parts of Theorem 3.1 reduce to Theorem 2 of Hong and Nekipelov (2010a). If we assume unconfoundedness by having $T=Z$ , then the result further reduces to Theorem 1 of Hahn (1998).

The derived SPEB helps determine whether an estimation procedure is efficient. In this section, we focus on the condition expectation projection (CEP) estimator.⁴⁴4The terminology “condition expectation projection” is adopted from the papers Chen et al. (2008) and Hong and Nekipelov (2010a), whereas Hahn (1998) refers to these estimators as “nonparametric imputation based estimators.” Define

\displaystyle h_{Y,t,z}(X)=\mathbb{E}\left[\mathbf{1}\{Z=z\}Y\mathbf{1}\{T=t\}\mid X\right]\text{ and }h_{t,z}(X)=\mathbb{E}\left[\mathbf{1}\{Z=z\}\mathbf{1}\{T=t\}\mid X\right].

The CEP procedure first estimates $\pi_{z}$ , $h_{Y,t,z}$ , and $h_{t,z}$ by using nonparametric estimators $\hat{\pi}_{z}$ , $\hat{h}_{Y,t,z}$ , and $\hat{h}_{t,z}$ respectively. These estimators can be constructed based on series or local polynomial estimation. Then $Q_{t,z}$ and $P_{t,z}$ are estimated using $\hat{Q}_{t,z}=\hat{h}_{Y,t,z}/\hat{\pi}_{z}$ and $\hat{P}_{t,z}=\hat{h}_{t,z}/\hat{\pi}_{z}$ . The vectors of estimators $\hat{Q}_{t}$ and $\hat{P}_{t}$ , $\hat{\pi}$ are stacked in an obvious way. Let $\hat{\pi}_{{t,k}}=\sum_{z\in\mathcal{Z}_{t,k}}\hat{\pi}_{z}$ . The CEP estimators for the structural parameters are defined by

	$\displaystyle\hat{p}_{t,k}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{P}_{t}(X_{i}),$	$\displaystyle\hat{q}_{t,k}=\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{P}_{t}(X_{i})\hat{\pi}_{{t,k}}(X_{i}),$
	$\displaystyle\hat{\beta}_{t,k}$	$\displaystyle=\frac{1}{\hat{p}_{t,k}}\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{Q}_{t}(X_{i}),$	$\displaystyle\hat{\gamma}_{t,k}=\frac{1}{\hat{q}_{t,k}}\frac{1}{n}\sum_{i=1}^{n}b_{t,k}\hat{Q}_{t}(X_{i})\hat{\pi}_{{t,k}}(X_{i}).$

The next proposition shows that the CEP estimators are semiparametrically efficient. The result is similar in style to Hahn’s (1998) Proposition 4 that the low-level regularity conditions are omitted. Instead, the proposition assumes the high-level condition that the CEP estimators are asymptotically linear, which means they are asymptotically equivalent to sample averages. More formally, an estimator $\hat{\beta}$ of $\beta$ is asymptotically linear if it admits an influence function. That is, there exists an iid sequence $\psi_{i}$ with zero mean and finite variance such that

\displaystyle\sqrt{n}(\hat{\beta}-\beta)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi_{i}+o_{p}(1).

Since each element of the conditional expectations $h_{Y,t,z}$ , $h_{t,z}$ , and $\pi_{z}$ can be considered as coming from a binary LATE model, the regularity conditions in Hong and Nekipelov (2010b) should work with little modification.

Proposition 3.2.

Suppose the CEP estimators are asymptotically linear, then they achieve the semiparametric efficiency bound.

The reason that this type of estimator is efficient is well explained in Ackerberg et al. (2014). The estimation problem here falls into their general semiparametric model, where the finite-dimensional parameter of interest is defined by unconditional moment restrictions. They show that the semiparametric two-step optimally weighted GMM estimators, the CEP estimators in this case, achieve the efficiency bound since the parameters of interest are exactly identified. Discussions related to this phenomenon can also be found in Chen and Santos (2018).

We next examine the efficient estimation of other policy-relevant parameters that can be derived from the parameters $\left(\beta_{t,k},\gamma_{t,k},p_{t,k},q_{t,k}\right)$ . As an example, consider the type set $\Sigma_{t}\equiv\cup_{k=1}^{N_{Z-1}}\Sigma_{t,k}$ , which is referred to as $t$ -switchers. This subpopulation contains individuals who switch between $t$ and other treatments when given different levels of instruments. It is a generalization of the concept of compliers in the binary LATE framework.⁵⁵5Recall that switchers are also illustrated in Example 2. The LASF for the subpopulation $\Sigma_{t}$ is given by

\displaystyle\beta_{t}\equiv\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t}\right]=\frac{\sum_{k=1}^{N_{Z}-1}\beta_{t,k}p_{t,k}}{\sum_{k=1}^{N_{Z}-1}p_{t,k}}.

Similarly, one can also define

\gamma_{t}=\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t}\right]=\frac{\sum_{k=1}^{N_{Z}-1}\gamma_{t,k}p_{t,k}}{\sum_{k=1}^{N_{Z}-1}p_{t,k}},

(3)

which represents the LASF-T for the subpopulation of $t$ -treated $t$ -switchers.

For some subpopulations, a treatment effect can be identified. This point is already illustrated with Example 2 in the discussion of the identification of the usual LATE parameter. We further illustrate this point with Example 2.

Example 6 (continues = eg:3t2z).

The quantity

\displaystyle\beta_{t_{3},1}-\frac{\beta_{t_{1},1}p_{t_{1},1}+\beta_{t_{2},1}p_{t_{2},1}}{p_{t_{1},1}+p_{t_{2},1}}

represents the local average treatment effect of $t_{3}$ against other treatments within the subpopulation of $t_{3}$ -switchers. Analogously, the parameter

\displaystyle\gamma_{t_{3},1}-\frac{\gamma_{t_{3},t_{1},1}q_{t_{3},t_{1},1}+\gamma_{t_{3},t_{2},1}q_{t_{3},t_{2},1}}{q_{t_{3},t_{1},1}+q_{t_{3},t_{2},1}}

is the local average treatment effect of $t_{3}$ against other treatments within the subpopulation of $t_{3}$ -treated $t_{3}$ -switchers.

To summarize the above examples using a general expression, let $\phi=\phi(\underline{p},\underline{q},\underline{\beta},\underline{\gamma})$ be a finite-dimensional parameter, where $\phi(\cdot)$ is a known continuously differentiable function, and $\underline{p}$ is the vector containing all identifiable $p_{t,k}$ ’s, that is, $\underline{p}\equiv\{p_{t,k}\mathrel{\mathop{\ordinarycolon}}t\in\mathcal{T},1\leq k\leq N_{Z}\}$ . Let $\underline{q},\underline{\beta}$ , and $\underline{\gamma}$ be defined analogously. A natural estimator can be defined through the CEP estimates, $\phi(\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}})$ . The delta method can help calculate the efficiency bound of $\phi$ and show the efficiency of $\phi(\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}})$ . In fact, by Theorem 25.47 of van der Vaart (1998), we immediately have the following corollary, which shows that plug-in estimators are efficient.

Corollary 3.3.

The semiparametric efficiency bound of $\phi$ is given by the variance of efficient influence function

\psi^{\phi}=\sum_{p\in\underline{p}}\frac{\partial\phi}{\partial p}\psi^{p}+\sum_{q\in\underline{q}}\frac{\partial\phi}{\partial q}\psi^{q}+\sum_{\beta\in\underline{\beta}}\frac{\partial\phi}{\partial\beta}\psi^{\beta}+\sum_{\gamma\in\underline{\gamma}}\frac{\partial\phi}{\partial\gamma}\psi^{\gamma}

(4)

where the partial derivatives are evaluated at the true parameter value. Moreover, the plug-in estimator $\phi(\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}})$ , based on the CEP estimators $\hat{\underline{p}},\hat{\underline{q}},\hat{\underline{\beta}},\hat{\underline{\gamma}}$ , achieves the efficiency bound.

4 Robustness

In the previous section, the EIF is used as a tool for computing the SPEB. In this section, we directly use the EIF as the moment condition for estimation. These moment conditions are appealing because they satisfy double robustness and local robustness — the two topics of this section.

A word on notation: in the rest of the paper, we use a superscript $o$ to signify the true value whenever necessary. For example, when both $\pi^{o}$ and $\pi$ appear, the former means the true probability while the latter denotes a generic function.

4.1 Double Robustness

We focus on the LASF $\beta_{t,k}$ . The same analysis can be applied to the other parameters. To avoid notational burden in the main text, we drop the subscript $(t,k)$ in $\beta_{t,k}$ , $p_{t,k}$ , and $b_{t,k}$ , and the subscript $t$ in $P_{t}$ and $Q_{t}$ .⁶⁶6The full subscripts are kept in the Appendices. It is straightforward to verify that the EIF $\psi^{\beta}$ has zero mean. However, we do not want to use $\psi^{\beta}$ itself as the estimating equation since it contains $1/p$ as a factor. To deal with this problem, we simply multiply $\psi^{\beta}$ by $p$ and define

	$\displaystyle\psi(Y,T,Z,X,\beta,Q,P,\pi)$	$\displaystyle=p\psi^{\beta}(Y,T,Z,X,\beta,p,Q,P,\pi)$
		$\displaystyle=b\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q(X)\right)+Q(X)\right)$
		$\displaystyle\quad-\beta b\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P(X)\right)+P(X)\right).$

The corresponding moment condition is

\displaystyle\mathbb{E}\left[\psi(Y,T,Z,X,\beta^{o},Q^{o},P^{o},\pi^{o})\right]=0.

(5)

This moment condition is doubly robust, as demonstrated in the following proposition.

Proposition 4.1 (Double Robustness).

Let $\left(Q,P,\pi\right)$ be an arbitrary vector of functions and $(Q^{o},P^{o},\pi^{o})$ the true vector of conditional expectations. Then

\displaystyle\mathbb{E}\left[\psi(Y,T,Z,X,\beta^{o},Q^{o},P^{o},\pi)\right]=0

and

\displaystyle\mathbb{E}\left[\psi(Y,T,Z,X,\beta^{o},Q,P,\pi^{o})\right]=0.

The above proposition divides the nonparametric nuisance parameters into two groups, $\pi$ and $(Q,P)$ . The doubly robust moment condition is valid if either of these two groups of nuisance parameters is true. On the other hand, if the researcher uses parametric models for these nuisance parameters, then the structural parameter $\beta$ can be recovered provided that at least one of the working nuisance models is correctly specified. Therefore, the doubly robust moment condition is “less demanding” on the researcher’s ability to devise a correctly specified model for the nuisance parameters. The double robustness result in Proposition 4.1 can be seen as the GLATE extension of the existing results in the binary LATE literature (e.g., Tan, 2006; Okui et al., 2012).

4.2 Neyman Orthogonality

The second robustness property is Neyman orthogonality. Moment conditions with this property have reduced sensitivity with respect to the nuisance parameters. Formally, Neyman orthogonality means that the moment condition has zero Gateaux derivative with respect to the nuisance parameters. The result is presented in the following proposition.

Proposition 4.2 (Neyman Orthogonality).

Let $\left(Q,P,\pi\right)$ be an arbitrary set of functions. For $r\in[0,1)$ , define $Q^{r}=Q^{o}+r(Q-Q^{o}),$ $P^{r}=P^{o}+r(P-P^{o}),$ and $\pi^{r}=\pi^{o}+r(\pi-\pi^{o})$ . Suppose that $\sup_{r\in[0,1]}\big{|}\frac{\partial}{\partial r}\psi(Y,T,Z,X,\beta,Q^{r},P^{r},\pi^{r})\big{|}$ is integrable, then

\displaystyle\frac{\partial}{\partial r}\mathbb{E}\left[\psi(Y,T,Z,X,\beta,Q^{r},P^{r},\pi^{r})\right]\Big{|}_{r=0}=0,

where $\beta$ does not need to be the true parameter value.

In many econometrics models, double robustness and Neyman orthogonality come in pairs. Discussions about their general relationships can be found in Chernozhukov et al. (2016). In practice, double robustness is often used for parametric estimation, as previously explained, whereas Neyman orthogonality is used in estimation with the presence of possibly high-dimensional nuisance parameters.

Next, we apply the double/debiased machine learning (DML) method developed by Chernozhukov et al. (2018) to the moment condition (5). This estimation method works even when the nuisance parameter space is complex enough that the traditional assumptions, e.g., Donsker properties, are no longer valid.⁷⁷7In two-step semiparametric estimations, Donsker properties are usually required so that a suitable stochastic equicontinuity condition is satisfied. See, for example, Assumption 2.5 in Chen et al. (2003). The implementation details are explained below.

The nuisance parameters $Q$ , $P$ , and $\pi$ are estimated using a cross-fitting method: Take an $L$ -fold random partition of the data such that the size of each fold is $n/L$ . For $l=1,\cdots,L$ , let $I_{l}$ denote the set of observation indices in the $l$ th fold and $I^{c}_{l}=\bigcup_{l^{\prime}\neq l}I_{l^{\prime}}$ the set of observation indices not in the $l$ th fold. Define $\check{Q}^{l}$ , $\check{P}^{l}$ , and $\check{\pi}^{l}$ to be the estimates constructed by using data from $I_{l}^{c}$ . The DML estimator of $\beta$ is constructed following the moment condition (5):⁸⁸8This is the DML2 estimator defined in Chernozhukov et al. (2018). Another estimator, the DML1 estimator, is proposed in the same paper. We do not study the DML1 estimator since it is asymptotically equivalent to DML2, and the authors generally recommend DML2.

\displaystyle\check{\beta}=\frac{\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota(Y_{i}\mathbf{1}\{T_{i}=t\})-\check{Q}^{l}(X_{i})\big{)}+\check{Q}^{l}(X_{i})\big{)}}{\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota\mathbf{1}\{T_{i}=t\}-\check{P}^{l}(X_{i})\big{)}+\check{P}^{l}(X_{i})\big{)}}.

(6)

To conduct inference, we also need an estimate for the asymptotic variance of $\check{\beta}$ , which we denote by $\sigma^{2}$ . The asymptotic variance equals to the expectation of the squared efficient influence function: $\sigma^{2}=\mathbb{E}\left[\psi^{\beta}\right]^{2}=\mathbb{E}[\psi^{2}]/p^{2}$ . We first estimate $p$ by using the cross-fitting method, which is essentially given by the denominator of (6):

\displaystyle\check{p}=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota\mathbf{1}\{T_{i}=t\}-\check{P}^{l}(X_{i})\big{)}+\check{P}^{l}(X_{i})\big{)}.

(7)

Then the asymptotic variance can be estimated by

	$\displaystyle\check{\sigma}^{2}$	$\displaystyle=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}\big{(}\psi^{\beta}\big{(}Y_{i},T_{i},Z_{i},X_{i},\check{\beta},\check{p},\check{Q}^{l},\check{P}^{l},\check{\pi}^{l}\big{)}\big{)}^{2}$
		$\displaystyle=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}\big{(}\psi\big{(}Y_{i},T_{i},Z_{i},X_{i},\check{\beta},\check{Q}^{l},\check{P}^{l},\check{\pi}^{l}\big{)}/\check{p}\big{)}^{2}.$

We want to establish the convergence results for the DML estimator uniformly over a class of data generating processes (DGPs) defined as follows. For any two constants $c_{1}>c_{0}>0$ , let $\mathcal{P}(c_{1},c_{0})$ be the set of joint distributions of $(Y,T,Z,X)$ such that

(i)

$p\in[c_{0},1]$ ,
(ii)

$\mathbb{E}[\psi^{2}],\pi_{z}^{o}(X)\geq c_{0},z\in\mathcal{Z}$ , and $|Y\mathbf{1}\{T=t\}|,|Y\mathbf{1}\{T=t\}-Q_{t}^{o}(X)|\leq c_{1}$ .

The first condition excludes the case where $\beta$ is weakly identified (when $p$ can be arbitrarily close to zero). Inference under weak identification is studied in the next section. The following theorem establishes the asymptotic properties of the DML estimation procedure. In particular, the estimator achieves the SPEB.

Theorem 4.3.

Let Assumptions 1 and 2 hold. Assume the following conditions on the nuisance parameter estimators $(\check{Q}^{l},\check{P}^{l},\check{\pi}^{l})$ :

(i)

For $z\in\mathcal{Z}$ , $|\check{Q}^{l}|$ is bounded, $\check{P}^{l}_{z}$ and $\check{\pi}^{l}_{z}\in[0,1]$ , and $\check{\pi}^{l}_{z}$ is bounded away from zero.
(ii)

$\max_{z\in\mathcal{Z}}\big{(}\|\hat{Q}-Q^{o}\rVert_{2}\vee\|\hat{P}-P^{o}\rVert_{2}\vee\lVert\hat{\pi}-\pi^{o}\rVert_{2}\big{)}=o_{p}\big{(}n^{-1/4}\big{)}$ .

Then the estimator $\check{\beta}$ obeys that

\sigma^{-1}\sqrt{n}\big{(}\check{\beta}-\beta\big{)}\Rightarrow N(0,1),

uniformly over the DGPs in $\mathcal{P}(c_{0},c_{1})$ . Moreover, the above convergence result continues to hold when $\sigma$ is replaced by the estimator $\check{\sigma}$ .

The proof verifies the conditions of Theorem 3.1 in Chernozhukov et al. (2018). The essential restriction is on the uniform convergence rate for the estimators of the nuisance parameters. In low-dimensional settings, one can consider the local polynomial regression for estimation of the conditional expectations. Under suitable conditions (Hansen, 2008; Masry, 1996), the uniform convergence rate of the local polynomial estimators is $(\log n/n)^{2/(d_{X}+4)}$ , which is $o(n^{-1/4})$ if $d_{X}\leq 3$ . In high-dimensional settings, as pointed out by Chernozhukov et al. (2018), the rate $o(n^{-1/4})$ is often available for common machine learning methods under structured assumptions on the nuisance parameters.⁹⁹9This includes the LASSO method under sparsity of the nuisance space. See, for example, Bühlmann and Van De Geer (2011), Belloni and Chernozhukov (2011), and Belloni and Chernozhukov (2013). However, Chernozhukov et al. (2018) also indicate that to prove that machine learning methods achieve the $o(n^{-1/4})$ rate, one will eventually have to use related entropy conditions. This means that the asymptotic normality of the DML estimator continues to hold.

Theorem 4.3 can be directly used to conduct inference on $\beta$ . Confidence regions can be constructed by inverting the usual $t$ -tests. These confidence regions are uniformly valid since the convergence results in the above theorem hold uniformly over $\mathcal{P}$ . In the next section, we explain why uniform validity is crucial when dealing with weak identification issues.

5 Weak Identification

The convergence result established in Theorem 4.3 is uniform over the set of DGPs with type probability $p$ bounded away from zero. However, the identification of $\beta$ would be weak in the case where $p$ can be arbitrarily close to zero. This leads to distortion of the uniform size of the test and poor asymptotic approximation in finite-sample settings. This section studies this weak identification issue and proposes an inference procedure that is robust against such a problem.

We begin with a heuristic illustration of the weak identification problem. To ease notation, define $\upsilon=\beta p$ and

\displaystyle\check{\upsilon}=\check{\beta}\check{p}=\sum_{l=1}^{L}\sum_{i\in I_{l}}b\big{(}\zeta(Z_{i},X_{i},\check{\pi}^{l})\big{(}\iota(Y_{i}\mathbf{1}\{T_{i}=t\})-\check{Q}^{l}(X_{i})\big{)}+\check{Q}^{l}(X_{i})\big{)}.

After a simple calculation, we can write

\displaystyle\check{\beta}-\beta=\frac{\sqrt{n}(\check{\upsilon}-\upsilon)-\beta\sqrt{n}(\check{p}-p)}{\sqrt{n}(\check{p}-p)+\sqrt{n}p}.

In the above expression, we can interpret the estimation errors $\sqrt{n}(\check{\upsilon}-\upsilon)$ and $\sqrt{n}(\check{p}-p)$ as the noises, while the signal is the term $\sqrt{n}p$ . Under the usual asymptotics where $p>0$ is fixed, the noise terms are bounded in probability, whereas the signal term $\sqrt{n}p\rightarrow\infty$ . Hence, the signal dominates the noise, and the estimator $\check{\beta}$ is consistent. However, under asymptotics with a drifting sequence $p=p_{n}\rightarrow 0$ and $\sqrt{n}p$ converging to a finite constant, the signal and the noise are of the same magnitude, which results in the inconsistency of $\check{\beta}$ . This problem is the weak identification issue. In the weak IV literature, a common measure of identification strength is the so-called concentration parameter. In our case, the concentration parameter is given by $\sqrt{n}p$ where $\sqrt{n}p\rightarrow\infty$ corresponds to strong identification, and identification is weak when the limit of $\sqrt{n}p$ is finite.

While weak identification is a finite-sample issue, it is formalized using the asymptotic framework. However, the illustration above using asymptotics under drifting sequences is not meant to model DGPs that vary with the sample size $n$ . Instead, it is a tool used to detect the lack of uniform convergence. In fact, controlling the uniform size of the test is the key to solving weak identification problems.¹⁰¹⁰10See, for example, Imbens and Manski (2004), Mikusheva (2007), and Andrews et al. (2020). Formally, the uniform size of a test is the large sample limit of the supremum of the rejection probability under the null hypothesis, where the supremum is taken over the nuisance parameter space. When testing a null hypothesis on $\beta$ in the GLATE model, the supremum mentioned above is taken over all values of $p>0$ . That is, a desirable test should have rejection probability under the null converge to the nominal size uniformly over $p\in(0,1]$ . From the previous discussion, we can see that the uniform size can not be controlled using the usual $t$ -statistic $\sqrt{n}(\check{\beta}-\beta)/\check{\sigma}$ . This failure of uniform convergence, however, does not conflict with Theorem 4.3, where the uniform convergence of $\check{\beta}$ is established only after restricting $p$ to be bounded away from zero.

Inference procedures that are robust against weak identification can be obtained by directly imposing the null hypothesis in the construction of the test statistic. One such example is the well-known Anderson-Rubin (AR) statistic in the weak IV literature. Its idea can be generalized to the GLATE model. We first consider testing the two-sided hypothesis $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta=\beta_{0}$ versus $H_{1}\mathrel{\mathop{\ordinarycolon}}\beta\neq\beta_{0}$ . To control the uniform size of the test, we need the test statistic to converge uniformly on the parameter space where (1) $\beta=\beta_{0}$ , and (2) $p$ is allowed to be arbitrarily close to zero. A null-restricted $t$ -statistic can be obtained as follows. Notice that when $p>0$ , $\beta=\beta_{0}$ is equivalent to

\displaystyle 0=\upsilon-\beta_{0}p=\mathbb{E}\left[\psi(Y,T,Z,X,\beta_{0},Q,P,\pi)\right].

(8)

Its estimate can be written as

\displaystyle\check{\upsilon}-\beta_{0}\check{p}=(\check{\upsilon}-\upsilon)-\beta(\check{p}-p)+(\beta-\beta_{0})p.

(9)

Under the null hypothesis $\beta=\beta_{0}$ , the above estimate does not depend on the concentration parameter $\sqrt{n}p$ and consists only of the noise terms $\check{\upsilon}-\upsilon$ and $\check{p}-p$ , whose uniform convergence can be established directly.

For implementation, this test statistic can be obtained as a straightforward application of the DML procedure described in the previous section to the moment condition (8). As a consequence of Proposition 4.2, the above moment condition satisfies the Neyman orthogonality condition regardless of the true value of $\beta$ . More specifically, the null-restricted $t$ -statistic is defined to be

\displaystyle\check{\rho}=\sqrt{n}(\check{\upsilon}-\beta_{0}\check{p})/\check{\sigma}_{\psi},

where

\displaystyle\check{\sigma}_{\psi}^{2}=\frac{1}{n}\sum_{l=1}^{L}\sum_{i\in I_{l}}\psi(Y_{i},T_{i},Z_{i},X_{i},\beta_{0},\check{Q}^{l},\check{P}^{l},\check{\pi}^{l})^{2}.

The corresponding test of $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta=\beta_{0}$ against $H_{1}\mathrel{\mathop{\ordinarycolon}}\beta\neq\beta_{0}$ rejects for large values of $|\check{\rho}|$ .

The same methodology can be applied to testing one-sided hypothesis $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta\leq\beta_{0}$ versus $H_{1}\mathrel{\mathop{\ordinarycolon}}\beta>\beta_{0}$ . Under the null hypothesis, $(\beta-\beta_{0})p$ is non-positive, suggesting that the test should reject for large values of $\check{\rho}$ . Notice that this relies on knowing the sign of $p$ due to the GLATE model structure. This restriction on the sign of $p$ is similar to knowing the first-stage sign in the linear IV model, which is studied by Andrews and Armstrong (2017) in the context of unbiased estimation.

We now define the set of DGPs that allows $p$ to be arbitrarily close to zero. For any two constants $c_{1}>c_{0}>0$ , let $\mathcal{P}^{\text{WI}}(c_{0},c_{1})$ be the set of joint distributions of $(Y,T,Z,X)$ such that

(i)

$p\in(0,1]$ ,
(ii)

$\mathbb{E}[\psi^{2}],\pi_{z}^{o}(X)\geq c_{0},z\in\mathcal{Z}$ , and $\mathinner{\!\left\lvert Y\mathbf{1}\{T=t\}\right\rvert},|Y\mathbf{1}\{T=t\}-Q_{t}^{o}(X)|\leq c_{1}$ .

For any $\beta^{\prime}\in\mathbb{R}$ , let $\mathcal{P}^{\text{WI}}_{\beta^{\prime}}(c_{0},c_{1})$ be the subset of $\mathcal{P}^{\text{WI}}(c_{0},c_{1})$ in which the true value of the parameter $\beta$ is $\beta^{\prime}$ . In particular, $\mathcal{P}^{\text{WI}}_{\beta_{0}}(c_{0},c_{1})$ denotes the subset where the null hypothesis is true. The superscript “WI” denotes weak identification. The difference between $\mathcal{P}(c_{0},c_{1})$ and $\mathcal{P}^{\text{WI}}(c_{0},c_{1})$ is that $\mathcal{P}^{\text{WI}}(c_{0},c_{1})$ allows the type probability $p$ to be arbitrarily small, whereas the type probabilities in $\mathcal{P}(c_{0},c_{1})$ are uniformly bounded away from zero. Denote $\mathcal{N}_{\nu}$ as the $\nu$ th quantile of the standard normal distribution. The following theorem establishes that the above testing procedures have uniformly correct sizes and are consistent.

Theorem 5.1.

Suppose the conditions on the nuisance parameter estimates in Theorem 4.3 hold. Let $\alpha\in(0,1)$ be the nominal size of the tests.

(i)

The test that rejects $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta=\beta_{0}$ in favor of $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta\neq\beta_{0}$ when $|\check{\rho}|>\mathcal{N}_{1-\frac{\alpha}{2}}$ has (asymptotically) uniformly correct size and is consistent. That is,

\displaystyle\sup\big{\{}\mathbb{P}_{P}\big{(}|\check{\rho}|>\mathcal{N}_{1-\frac{\alpha}{2}}\big{)}\mathrel{\mathop{\ordinarycolon}}P\in\mathcal{P}^{\text{WI}}_{\beta_{0}}(c_{0},c_{1})\big{\}}\rightarrow\alpha

and

\displaystyle\mathbb{P}_{P}\big{(}|\check{\rho}|>\mathcal{N}_{1-\frac{\alpha}{2}}\big{)}\rightarrow 1,P\in\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}),\beta\neq\beta_{0}.

(ii)

The test that rejects $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta\leq\beta_{0}$ in favor of $H_{0}\mathrel{\mathop{\ordinarycolon}}\beta>\beta_{0}$ when $\check{\rho}>\mathcal{N}_{1-\alpha}$ has (asymptotically) uniformly correct size and is consistent. That is,

\displaystyle\sup\big{\{}\mathbb{P}_{P}\big{(}\check{\rho}>\mathcal{N}_{1-\alpha}\big{)}\mathrel{\mathop{\ordinarycolon}}P\in\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}),\beta\leq\beta_{0}\big{\}}\rightarrow\alpha

and

\displaystyle\mathbb{P}_{P}\big{(}\check{\rho}>\mathcal{N}_{1-\alpha}\big{)}\rightarrow 1,P\in\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1}),\beta>\beta_{0}.

6 Empirical Application

In this section, we apply the theoretical results to data from the Oregon Health Insurance Experiment Finkelstein et al. (2012) and examine the effects on the health of different sources of health insurance. The experiment is conducted by the state of Oregon between March and September 2008. A series of lottery draws were administered to award the participants the option of enrolling in the Oregon Health Plan Standard, which is a Medicaid expansion program available for Oregon adult residents that have limited income. Follow-up surveys were sent out in several waves to record, among many variables, the participants’ insurance plan and health status. Finkelstein et al. (2012) obtain the effects of insurance coverage by using a LATE model. We apply the GLATE model can study the effect heterogeneity across different sources of insurance.

According to the data, many lottery winners did not choose to participate in the Medicaid program. Instead, they went with other insurance plans or chose not to have any health insurance. Based on this observation, we can set up the GLATE model. The instrument $Z$ is the binary lottery that determines whether an individual is selected. The covariates $X$ include the number of household members and survey waves. Given $X$ , $Z$ is randomly assigned (Finkelstein et al., 2012, p1071).¹¹¹¹11Though the covariates are discrete, the methods developed in this paper are still different from linear regressions in Finkelstein et al. (2012). The treatment $T$ is the insurance plan, which contains three categories: Medicaid ( $m$ ), non-Medicaid insurance plans ( $nm$ ), and no health insurance ( $no$ ). The second category includes Medicare, private plans, employer plans, and other plans. The counterfactual health plan choices under different lottery results are the variables $T_{0}$ and $T_{1}$ . The unordered monotonicity condition requires that any participant who changes insurance plan due to winning the lottery does so to enroll in the Medicaid program.

The above setup is the same as Example 2, with types. We follow the terminologies in Kline and Walters (2016) and define the following six type sets by their counterfactual insurance plan choices:

1.

$no$ -never takers: $S\in\Sigma_{no,2}=\{s_{1}\}$ , $T_{0}=T_{1}=no$ ;
2.

$nm$ -never takers: $S\in\Sigma_{nm,2}=\{s_{2}\}$ , $T_{0}=T_{1}=nm$ ;
3.

always takers: $S\in\Sigma_{m,2}=\{s_{3}\}$ , $T_{0}=T_{1}=m$ ;
4.

$no$ -compliers: $S\in\Sigma_{no,1}=\{s_{4}\}$ , $T_{0}=no$ , $T_{1}=m$ ;
5.

$nm$ -compliers: $S\in\Sigma_{nm,1}=\{s_{5}\}$ , $T_{0}=nm$ , $T_{1}=m$ ;
6.

compliers: $S\in\Sigma_{m,1}=\{s_{4},s_{5}\}$ , $T_{0}\neq m$ , $T_{1}=m$ .

The two groups of never takers choose not to join Medicaid regardless of the offer. Always takers manage to enroll in Medicaid even without an offer. The $no$ - and $nm$ - compliers switch to Medicaid from no insurance plan and other plans, respectively, upon winning the lottery. Combining these two groups gives the larger set of compliers.

Table 1 shows the estimated probabilities of the six types.¹²¹²12We use the data from the 12-month survey. After taking care of the missing values, we are left with $23290$ observations. For cross-fitting, we choose $L=10$ . We can see that half of the population are $no$ -never takers, who are never covered by any insurance plan. The compliers make up around one-fifth of the population. There are effectively no $nm$ -compliers, meaning that the experiment does not crowd out other insurance plan choices. These findings are consistent with Finkelstein et al. (2012).

Type	Probability	Estimate (se)
$no$ -never takers	$p_{no,2}$	.492 (.046)
$nm$ -never takers	$p_{nm,2}$	.208 (.018)
always takers	$p_{m,2}$	.116 (.018)
$no$ -compliers	$p_{no,1}$	.197 (.059)
$nm$ -compliers	$p_{nm,1}$	.010 (.024)
compliers	$p_{m,1}$	.208 (.060)

Table 1: Estimated probability of different types.

The outcome of interest $Y$ is health status, which is (inversely) measured by the number of days (out of past 30) when poor health impaired regular activities.¹³¹³13Other types of outcomes are also studied by Finkelstein et al. (2012), including health care utilization and financial strain. Here we only focus on health status for simplicity. The potential outcomes are denoted by $Y_{no}$ , $Y_{nm}$ , and $Y_{m}$ . By Theorem 2.1, we can identify the distribution of $Y_{no}$ for $no$ -never takers and $no$ -compliers, the distribution of $Y_{nm}$ for $nm$ -never takers and $nm$ -compliers, and the distribution of $Y_{nm}$ for always takers and compliers. Table 2 reports the estimated LASFs.¹⁴¹⁴14The LASF $\beta_{nm,1}$ is excluded because there are few $nm$ -compliers as reported in Table 1. We can clearly see a pattern of self-selection into the treatment. For example, when there is no insurance coverage, the potential health status of $no$ -compliers is worse than $no$ -never takers and therefore choose to enroll in Medicaid.

Type	Treatment	LASF	Estimate (se)
$no$ -never takers	$no$	$\beta_{no,2}$	6.78 (1.19)
$nm$ -never takers	$nm$	$\beta_{nm,2}$	7.74 (1.05)
always takers	$m$	$\beta_{m,2}$	9.96 (1.75)
$no$ -compliers	$no$	$\beta_{no,1}$	11.50 (2.92)
compliers	$m$	$\beta_{m,1}$	0.48 (3.42)

Table 2: Estimated LASFs.

7 Concluding Remarks

In this paper, we considered the estimation of the causal parameters, LASF and LASF-T, in the GLATE model by using the EIF. The proposed DML estimator satisfies the SPEB and can be applied in situations, such as high-dimensional settings, where Donsker properties fail. For inference, we proposed generalized AR tests robust against weak identification issues. Currently, empirical researchers use the TSLS and control the covariates linearly in models with multi-valued treatments and instruments. This linear specification does not have LATE interpretation, as pointed out by Blandhol et al. (2022). Therefore, we advocate using the semiparametric methods studied by this paper in those cases.

SUPPLEMENTARY MATERIAL

Appendix A Technical Proofs

In this section, we prove the theorems and propositions stated in the main text. We assume that Assumptions 1 and 2 hold throughout this section.

A.1 Proof of the Identification Results

Lemma A.1.

$S\perp Z\mid X$ and $t\in\mathcal{T}$ , $Y_{t}\perp T\mid S,X$ .

Proof of Lemma A.1.

The first statement follows from the definition of $S$ and the fact that $Z$ is independent of the vector $(T_{z_{1}},\cdots,T_{z_{N_{Z}}})$ conditioning on $X$ . For the second statement, $T$ is entirely determined by $\left(S,Z,X\right)$ . Hence, given $S$ and $X$ , $T$ is independent of $Y_{t}$ since $Z$ is independent of $(Y_{t_{1}},\cdots,Y_{t_{N_{T}}})$ conditional on $X$ . ∎

Lemma A.2.

For each $t\in\mathcal{T}$ and $k=1,\cdots,N_{Z}$ , the following identification results hold.

(i)

$\mathbb{P}(S\in\Sigma_{t,k}\mid X)=b_{t,k}P_{t}(X)$ a.s.
(ii)

$\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X\right]=(b_{t,k}Q_{t}(X))/(b_{t,k}P_{t}(X))$ a.s.

Proof of Lemma A.2.

This is Theorem T-6 in Heckman and Pinto (2018a). The conditioning is explicitly presented. ∎

Proof of Theorem 2.1.

The first statement follows from applying the law of iterated expectation to Lemma A.2(i). For the second statement, we can apply Bayes rule to Lemma A.2 and obtain that

	$\displaystyle\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k}\right]$	$\displaystyle=\int\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right]f_{X\mid S\in\Sigma_{t,k}}(x)dx$
		$\displaystyle=\int\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right]\frac{\mathbb{P}(S\in\Sigma_{t,k}\mid X=x)}{\mathbb{P}(S\in\Sigma_{t,k})}f_{X}(x)dx$
		$\displaystyle=\mathbb{E}\left[b_{t,k}Q_{t}(X)\right]/p_{t,k},$

where $f_{X\mid S\in\Sigma_{t,k}}$ denotes the conditional density function of $X$ given type $S\in\Sigma_{t,k}$ . ∎

Proof of Theorem 2.2.

By Lemma L-16 of Heckman and Pinto (2018b), we know that under the unordered monotonicity assumption, $B_{t}[\cdot,i]=B_{t}[\cdot,i^{\prime}]$ for all $s_{i},s_{i^{\prime}}\in\Sigma_{t,k}$ . Thus, the set $\mathcal{Z}_{t,k}$ always exists. For the first statement, we have

	$\displaystyle\mathbb{P}\left(T=t,S\in\Sigma_{t,k}\right)$	$\displaystyle=\mathbb{P}\left(Z\in\mathcal{Z}_{t,k},S\in\Sigma_{t,k}\right)$
		$\displaystyle=\mathbb{E}\left[\mathbb{P}\left(Z\in\mathcal{Z}_{t,k},S\in\Sigma_{t,k}\mid X\right)\right]$
		$\displaystyle=\mathbb{E}\left[\mathbb{P}\left(Z\in\mathcal{Z}_{t,k}\mid X\right)\mathbb{P}\left(S\in\Sigma_{t,k}\mid X\right)\right]$
		$\displaystyle=\mathbb{E}\left[b_{t,k}P_{t}(X)\pi_{{t,k}}(X)\right],$

where the second equality follows from the law of iterated expectations and the third equality follows from the fact that $Z\perp S\mid X$ (Lemma A.1). For the second statement, notice that

	$\displaystyle\mathbb{P}(T=t,S\in\Sigma_{t,k}\mid X=x)$	$\displaystyle=\mathbb{P}(T=t\mid S\in\Sigma_{t,k},X=x)\mathbb{P}(S\in\Sigma_{t,k}\mid X=x)$
		$\displaystyle=\mathbb{P}(Z\in\mathcal{Z}_{t,k}\mid X)\mathbb{P}(S\in\Sigma_{t,k}\mid X=x)$
		$\displaystyle=\pi_{t,k}(X)b_{t,k}P_{t}(X).$

By Lemma A.1, we know that

\displaystyle\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t,k},X=x\right]=\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right].

Therefore, we can apply Bayes rule and obtain that

		$\displaystyle\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t,k}\right]$
	$\displaystyle=$	$\displaystyle\int\mathbb{E}\left[Y_{t}\mid T=t,S\in\Sigma_{t,k},X=x\right]f_{X\mid T=t,S\in\Sigma_{t,k}}(x)dx$
	$\displaystyle=$	$\displaystyle\int\mathbb{E}\left[Y_{t}\mid S\in\Sigma_{t,k},X=x\right]\frac{P(T=t,S\in\Sigma_{t,k}\mid X=x)}{P(T=t,S\in\Sigma_{t,k})}f_{X}(x)dx$
	$\displaystyle=$	$\displaystyle\int\frac{b_{t,k}Q_{t}(X)}{b_{t,k}P_{t}(X)}\times\frac{\pi_{t,k}(X)b_{t,k}P_{t}(X)}{q_{t,k}}f_{X}(x)dx$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[b_{t,k}Q_{t}(X)\pi_{t,k}(X)\right]/q_{t,k}.$

∎

A.2 Semiparametric Efficiency Calculations

We follow the method developed by Newey (1990). The likelihood of the GLATE model can be specified as

\mathcal{L}\left(Y,T,Z,X\right)=f_{X}(X)\prod_{z\in\mathcal{Z}}\Big{(}f_{z}(Y,T\mid X)\pi_{z}(X)\Big{)}^{\mathbf{1}\{Z=z\}},

where $f_{z}(\cdot,\cdot\mid X)$ denotes the conditional density of $Y,T$ given $Z=z$ and $X$ . In a regular parametric submodel, where the true underlying probability measure $P$ is indexed by $\theta^{o}$ , we use the following notations to represent the score functions:

	$\displaystyle s_{z}(Y,Z\mid X;\theta)=\frac{\partial}{\partial\theta}\log\left(f_{z}(Y,T\mid X;\theta)\right),$
	$\displaystyle s_{\pi}(Z\mid X;\theta)=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}\frac{\partial}{\partial\theta}\log\left(\pi_{z}(X;\theta)\right),$
	$\displaystyle s_{X}(X;\theta)=\frac{\partial}{\partial\theta}\log\left(f_{X}(X;\theta)\right).$

The score in a regular parametric submodel is

\displaystyle s_{\theta^{o}}(Y,T,Z,X)=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}s_{z}\left(Y,T\mid X;\theta^{o}\right)+s_{\pi}(Z\mid X;\theta^{o})+s_{X}(X;\theta^{o}).

Hence, the tangent space of the model is

	$\displaystyle\mathscr{S}$	$\displaystyle=\big{\{}s\in L^{2}_{0}\mathrel{\mathop{\ordinarycolon}}s(Y,T,Z,X)=\sum_{z\in\mathcal{Z}}\mathbf{1}\{Z=z\}s_{z}\left(Y,T\mid X\right)+s_{\pi}(Z\mid X)+s_{X}(X)$
		$\displaystyle\quad\quad\text{ for some }s_{z},s_{\pi},s_{X}\text{ such that }\int s_{z}(y,t\mid X)f_{z}(y,t\mid X)dydt\equiv 0,\forall z;$
		$\displaystyle\quad\quad\sum_{z\in\mathcal{Z}}s_{\pi}(z\mid X)\pi_{z}(X)\equiv 0\text{, and }\int s_{X}(x)f_{X}(x)dx=0\big{\}},$

where $L^{2}_{0}$ is a subspace of $L^{2}$ that contains the mean zero functions.

Proof of Theorem 3.1.

We only prove statements (i) and (ii) since (iii) and (iv) are easier cases that can be proved along the way. We start with the first statement. The path-wise differentiability of the parameter $\beta_{t,k}$ can be verified in the following way: in any parametric submodel, we have

	$\displaystyle\frac{\partial}{\partial\theta}\beta_{t,k}(\theta)\Big{\|}_{\theta=\theta^{o}}$	$\displaystyle=\frac{\partial}{\partial\theta}(b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/p_{t,k})\big{\|}_{\theta=\theta^{o}}$
		$\displaystyle=\frac{1}{p_{t,k}}\left((\partial b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/\partial\theta)\|_{\theta=\theta^{o}}-(b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/p_{t,k})(\partial p_{t,k}/\partial\theta)\|_{\theta=\theta^{o}}\right)$
		$\displaystyle=\frac{1}{p_{t,k}}b_{t,k}\left(\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]\big{\|}_{\theta=\theta^{o}}-\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\right]\big{\|}_{\theta=\theta^{o}}\beta_{t,k}\right),$

where $\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]|_{\theta=\theta^{o}}$ and $\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\right]|_{\theta=\theta^{o}}$ are $N_{Z}\times 1$ random vectors whose typical element can be represented respectively by

		$\displaystyle\int y\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int y\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$

and

		$\displaystyle\int\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,$

respectively, for $z\in\mathcal{Z}$ . The EIF is characterized by the condition that

\displaystyle\frac{\partial}{\partial\theta}\beta_{t,k}(\theta)\Big{|}_{\theta=\theta^{o}}=\mathbb{E}\left[\psi_{\beta_{t,k}}s_{\theta^{o}}\right]\text{, and }\psi_{\beta_{t,k}}\in\mathscr{S}.

The expression of $\psi_{\beta_{t,k}}$ given in Equation (2) meets the above requirements. In particular, the correspondence between terms in the EIF and path-wise derivative appears exactly as in Lemma 1 of Hong and Nekipelov (2010b).

For the second statement, the path-wise derivative of $\gamma_{t,k}$ can be computed similarly.

	$\displaystyle\frac{\partial}{\partial\theta}\gamma_{t,k}(\theta)\Big{\|}_{\theta=\theta^{o}}$	$\displaystyle=\frac{1}{q_{t,k}}b_{t,k}\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\pi_{{t,k}}(X)\right]\Big{\|}_{\theta=\theta^{o}}$
		$\displaystyle-\frac{\gamma_{t,k}}{q_{t,k}}b_{t,k}\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\pi_{{t,k}}(X)\right]\Big{\|}_{\theta=\theta^{o}},$

where $\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}[Q_{t}(X)\pi_{W_{t,k}}(X)]|_{\theta=\theta^{o}}$ and $\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}[P_{t}(X)\pi_{W_{t,k}}(X)]|_{\theta=\theta^{o}}$ are $N_{Z}\times 1$ random vectors whose typical element can be represented by

		$\displaystyle\int y\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int y\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int y\mathbf{1}\{\tau=t\}\left(\frac{\partial}{\partial\theta}\pi_{{t,k}}(X;\theta)\big{\|}_{\theta=\theta^{o}}\right)f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,$

and

		$\displaystyle\int\mathbf{1}\{\tau=t\}s_{z}(y,\tau\mid x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int\mathbf{1}\{\tau=t\}s_{X}(x;\theta^{o})\pi_{W_{t,k}}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int\mathbf{1}\{\tau=t\}\left(\frac{\partial}{\partial\theta}\pi_{{t,k}}(X;\theta)\big{\|}_{\theta=\theta^{o}}\right)f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,$

respectively, for $z\in\mathcal{Z}$ . The main difference appears when dealing with the last terms in the above two expressions, which can be matched with terms in the efficient influence function of the following two forms

	$\displaystyle\mathbb{E}\left[Y\mathbf{1}\{T=t\}\mid Z=z,X\right]\left(\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}-\pi_{{t,k}}(X)\right),\text{ and }$
	$\displaystyle\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\left(\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}-\pi_{{t,k}}(X)\right).$

Take the latter one as an example. Notice that

\displaystyle\mathbf{1}\{Z\in\mathcal{Z}_{t,k}\}-\pi_{{t,k}}(X)=\sum_{z\in\mathcal{Z}_{t,k}}\left(\mathbf{1}\{Z=z\}-\pi_{z}(X)\right),

and

\displaystyle\left(\mathbf{1}\{Z=z\}-\pi_{z}(X)\right)s_{\pi}(Z\mid X;\theta^{o})=\frac{\mathbf{1}\{Z=z\}}{\pi_{z}(X)}\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{|}_{\theta=\theta^{o}}-\pi_{z}(X)s_{\pi}(Z\mid X;\theta^{o}).

By the law of iterated expectation, we have

		$\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\left(\mathbf{1}\{Z=z\}-\pi_{z}(X)\right)s_{\pi}(Z\mid X;\theta^{o})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\mathbb{E}\left[\mathbf{1}\{Z=z\}/\pi_{z}(X)\mid X\right]\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{\|}_{\theta=\theta^{o}}\right]$
	$\displaystyle-$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\pi_{z}(X)\mathbb{E}\left[s_{\pi}(Z\mid X;\theta^{o})\mid X\right]\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{\|}_{\theta=\theta^{o}}\right]$
	$\displaystyle=$	$\displaystyle\int\mathbf{1}\{\tau=t\}\left(\frac{\partial}{\partial\theta}\pi_{z}(X;\theta)\big{\|}_{\theta=\theta^{o}}\right)f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx.$

∎

Proof of Proposition 3.2.

This proof is based on Section 4 in Newey (1994). We focus on the case of $\beta_{t,k}$ . The other cases are similar. To ease notation, let $h_{t}=\left(h_{Y,t,Z},h_{t,Z},\pi\right)^{\prime}$ . The estimator $\hat{\beta}_{t,k}$ is defined by the moment condition

\displaystyle\mathbb{E}[M\left(X,\beta_{t,k},h_{t}\right)]=0,

where

\displaystyle M\left(X,\beta_{t,k},h_{t}\right)\equiv b_{t,k}\left(\frac{h_{Y,t,z_{1}}(X)}{\pi_{z_{1}}(X)},\cdots,\frac{h_{Y,t,z_{N_{Z}}}(X)}{\pi_{z_{N_{Z}}}(X)}\right)^{\prime}-\beta_{t,k}b_{t,k}\left(\frac{h_{t,z_{1}}(X)}{\pi_{z_{1}}(X)},\cdots,\frac{h_{t,z_{N_{Z}}}(X)}{\pi_{z_{N_{Z}}}(X)}\right)^{\prime}.

We then compute the derivatives of $M$ with respect to the parameters:

	$\displaystyle\mathbb{E}\left[\partial M/\partial\beta_{t,k}\right]$	$\displaystyle=-b_{t,k}\mathbb{E}\left[P_{t}(X)\right]=-p^{o}_{t,k}$
	$\displaystyle\partial M/\partial h_{Y,t,z_{i}}\|_{h_{t}=h_{t}^{o}}$	$\displaystyle=b_{t,k}[i]/\pi^{o}_{z_{i}}(X)\equiv\delta_{Y,t,z_{i}}(X)$
	$\displaystyle\partial/\partial h_{t,z_{i}}M\|_{h_{t}=h_{t}^{o}}$	$\displaystyle=-(\beta_{t,k}b_{t,k}[i])/\pi^{o}_{z_{i}}(X)\equiv\delta_{t,z_{i}}(X)$
	$\displaystyle\partial M/\partial\pi_{z_{i}}\|_{h_{t}=h_{t}^{o}}$	$\displaystyle=-(b_{t,k}[i]Q^{o}_{t,z_{i}}(X))/\pi^{o}_{z_{i}}(X)+(\beta_{t,k}b_{t,k}[i]P^{o}_{t,z_{i}}(X))/\pi^{o}_{z_{i}}(X)\equiv\delta_{\pi,z_{i}}(X),$

where $b_{t,k}[i]$ denotes the $i$ th element of the vector $b_{t,k}$ . Define

	$\displaystyle\alpha\left(Y,T,Z,X\right)$	$\displaystyle\equiv\sum_{z\in\mathcal{Z}}\delta_{Y,t,z}(X)\left(\mathbf{1}\{Z=z\}Y\mathbf{1}\{T=t\}-h^{o}_{Y,t,z}(X)\right)$
		$\displaystyle\quad+\sum_{z\in\mathcal{Z}}\delta_{t,z}(X)\left(\mathbf{1}\{Z=z\}\mathbf{1}\{T=t\}-h^{o}_{t,z}(X)\right)$
		$\displaystyle\quad+\sum_{z\in\mathcal{Z}}\delta_{\pi,z}(X)\left(\mathbf{1}\{Z=z\}-\pi^{o}_{z}(X)\right).$

We have

	$\displaystyle\alpha\left(Y,T,Z,X\right)$	$\displaystyle=b_{t,k}\zeta(Z,X,\pi^{o})\left(\iota(Y\mathbf{1}\{T=t\})-Q^{o}_{t}(X)\right)$
		$\displaystyle\quad-\beta_{t,k}^{o}b_{t,k}\zeta(Z,X,\pi^{o})\left(\iota\mathbf{1}\{T=t\}-P^{o}_{t}(X)\right).$

Then Newey’s (1994) Proposition 4 suggests that the influence function of the estimator $\hat{\beta}_{t,k}$ is $(M+\alpha)/p_{t,k}$ which is equal to the EIF $\psi^{\beta_{t,k}}$ .

∎

A.3 Proof of Robustness Results

Proof of Proposition 4.1.

We prove the case for $\psi^{p_{t,k}}$ , the other cases can be dealt with analogously. First assume $\pi=\pi^{o}$ , then

\displaystyle\mathbb{E}\left[\mathbf{1}\{Z=z\}/\pi_{z}^{o}(X)\mid X\right]=1,

which implies that $\mathbb{E}\left[\zeta(Z,X,\pi^{o})\mid X\right]$ is almost surly equal to the identity matrix $\mathbf{I}$ . By the law of total expectations, we have

\displaystyle\mathbb{E}\left[\mathbf{1}\{T=t\}\mathbf{1}\{Z=z\}/\pi_{z}^{o}(X)\mid X\right]=\mathbb{E}\left[\mathbf{1}\{T=t\}\mid Z=z,X\right]=P_{t,z}^{o}(X),

which implies that $\mathbb{E}\left[\zeta(Z,X,\pi^{o})\iota\mathbf{1}\{T=t\}\right]=\mathbb{E}\left[P_{t}^{o}(X)\right]$ . Therefore,

		$\displaystyle b_{t,k}\mathbb{E}[\zeta(Z,X,\pi^{o})\left(\iota(\mathbf{1}\{T=t\})-P_{t}(X)\right)+P_{t}(X)]$
	$\displaystyle=$	$\displaystyle b_{t,k}\mathbb{E}\left[\zeta(Z,X,\pi^{o})\iota\mathbf{1}\{T=t\}\right]+b_{t,k}\mathbb{E}\left[(\mathbf{I}-\zeta(Z,X,\pi^{o}))P_{t}(X)\right]=b_{t,k}\mathbb{E}\left[P_{t}^{o}(X)\right]=p_{t,k}^{o}.$

Now suppose that $P_{t}=P_{t}^{o}$ . Then by the law of total expectation, we have

		$\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}(\mathbf{1}\{T=t\}-P_{t,z}^{o}(X))\mid X]$
	$\displaystyle=$	$\displaystyle\pi_{z}(X)\mathbb{E}[\mathbb{E}[\mathbf{1}\{T=t\}\mid Z=z,X]-P_{t,z}^{o}(X)\mid X]=0.$

This implies that $\mathbb{E}[\zeta(Z,X,\pi)(\iota(\mathbf{1}\{T=t\})-P_{t}^{o}(X))]=0$ . Hence,

\displaystyle b_{t,k}\mathbb{E}\left[\zeta(Z,X,\pi)\left(\iota(\mathbf{1}\{T=t\})-P_{t}^{o}(X)\right)+P_{t}^{o}(X)\right]=b_{t,k}\mathbb{E}\left[P_{t}^{o}(X)\right]=p_{t,k}^{o}.

This proves the proposition. ∎

Proof of Proposition 4.2.

Since $b_{t,k}$ is a finite vector, it suffices to verify the Neyman orthogonality condition for $\psi_{z}$ , which is defined by

		$\displaystyle\psi_{z}(Y,T,Z,X,\beta_{t,k},Q_{t},P_{t},\pi_{z})$
	$\displaystyle\equiv$	$\displaystyle\big{(}(\mathbf{1}\{Z=z\}/\pi_{z}(X))\left(\mathbf{1}\{T=t\}-P_{t,z}(X)\right)+P_{t,z}(X)\big{)}\beta_{t,k}$
		$\displaystyle-(\mathbf{1}\{Z=z\}/\pi_{z}(X))\left(Y\mathbf{1}\{T=t\}-Q_{t,z}(X)\right)-Q_{t,z}(X).$

We want to show that

\displaystyle\frac{d}{dr}\mathbb{E}\left[\psi_{z}(Y,T,Z,X,\beta_{t,k},Q_{t}^{r},P_{t}^{r},\pi_{z}^{r})\right]\Big{|}_{r=0}=0,

where $Q_{t}^{r}=Q_{t}^{o}+r(Q_{t}-Q_{t}^{o}),$ $P_{t}^{r}=P_{t}^{o}+r(P_{t}-P_{t}^{o}),$ and $\pi_{z}^{r}=\pi_{z}^{o}+r(\pi_{z}-\pi_{z}^{o})$ . In fact,

		$\displaystyle\frac{d}{dr}\mathbb{E}\left[\psi_{z}(Y,T,Z,X,\beta_{t,k},Q_{t}^{r},P_{t}^{r},\pi_{z}^{r})\right]\big{\|}_{r=0}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\frac{-\mathbf{1}\{Z=z\}}{(\pi^{r}_{z}(X))^{2}}\left(\mathbf{1}\{T=t\}-P^{r}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)\beta_{t,k}$
		$\displaystyle+\left(P_{t,z}(X)-P^{o}_{t,z}(X)-\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(P_{t,z}(X)-P^{o}_{t,z}(X)\right)\right)\beta_{t,k}$
		$\displaystyle+\frac{\mathbf{1}\{Z=z\}}{(\pi^{r}_{z}(X))^{2}}\left(Y\mathbf{1}\{T=t\}-Q^{r}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)$
		$\displaystyle-(Q_{t,z}(X)-Q^{o}_{t}(X))+\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(Q_{t,z}(X)-Q^{o}_{t,z}(X)\right)\Big{]}\Big{\|}_{r=0}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\frac{-\mathbf{1}\{Z=z\}}{(\pi^{o}_{z}(X))^{2}}\left(\mathbf{1}\{T=t\}-P^{o}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)\beta_{t,k}$
		$\displaystyle+\left(P_{t,z}(X)-P^{o}_{t,z}(X)-\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}\left(P_{t,z}(X)-P^{o}_{t,z}(X)\right)\right)\beta_{t,k}$
		$\displaystyle+\frac{\mathbf{1}\{Z=z\}}{(\pi^{o}_{z}(X))^{2}}\left(Y\mathbf{1}\{T=t\}-Q^{o}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)$
		$\displaystyle-(Q_{t,z}(X)-Q^{o}_{t,z}(X))+\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}\left(Q_{t,z}(X)-Q^{o}_{t,z}(X)\right)\Big{]},$

which equals zero because of the following three identities:

	$\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}/\pi^{o}_{z}(X)\mid X]=1,$
	$\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}/\pi^{o}_{z}(X)(\mathbf{1}\{T=t\}-P^{o}_{t,z}(X))\mid X]=0,$
	$\displaystyle\mathbb{E}[\mathbf{1}\{Z=z\}/\pi^{o}_{z}(X)(Y\mathbf{1}\{T=t\}-Q^{o}_{t,z}(X))\mid X]=0.$

∎

Proof of Theorem 4.3.

The asserted claims follow from Theorem 3.1, Theorem 3.2, and Corollary 3.2 of Chernozhukov et al. (2018) (henceforth referred to as the DML paper). We want to verify their Assumption 3.1 and 3.2. Adopting the notation from the DML paper, we let

\displaystyle\psi^{a}(T,Z,X,P_{t},\pi)=-b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota\mathbf{1}\{T=t\}-P_{t}(X)\right)+P_{t}(X)\right)

and

\displaystyle\psi^{b}(Y,T,Z,X,Q_{t},\pi)=b_{t,k}\left(\zeta(Z,X,\pi)\left(\iota(Y\mathbf{1}\{T=t\})-Q_{t}(X)\right)+Q_{t}(X)\right)

so that the linearity of the moment condition (with respect to $\beta_{t,k}$ ) is verified by the fact that $\psi=\psi^{a}\beta^{t,k}+\psi^{b}$ . Define¹⁵¹⁵15For simplicity, we drop the superscript $l$ in the nonparametric estimators.

\displaystyle\epsilon_{n}=\max_{z\in\mathcal{Z}}\big{(}\lVert\hat{Q}_{t,z}-Q_{t,z}^{o}\rVert_{2}\vee\lVert\hat{P}_{t,z}-P_{t}^{o}\rVert_{2}\vee\mathinner{\!\left\lVert\hat{\pi}_{z}-\pi_{z}^{o}\right\rVert}_{2}\big{)}.

By assumption on the convergence rates of the nonparametric estimators, we have $\epsilon_{n}=o(n^{-1/4})$ . Define $C_{\epsilon}=C_{\epsilon,1}\vee C_{\epsilon,2}\vee C_{\epsilon,3}\vee C_{\epsilon,4}$ , where $C_{\epsilon,1},C_{\epsilon,2},C_{\epsilon,3},$ and $C_{\epsilon,4}$ are positive constant that only depends on $C$ and $\epsilon$ and are specified later in the proof. Let $\delta_{n}$ be a sequence of positive constants approaching zero and satisfies that $\delta_{n}\geq C_{\epsilon}\big{(}\epsilon_{n}^{2}\sqrt{n}\vee n^{-1/4}\vee n^{-(1-2/q)}\big{)}$ . Such construction is possible since $\sqrt{n}\epsilon_{n}^{2}=o(1)$ . We set the nuisance realization set $N_{n}$ (denoted by $\mathcal{T}_{N}$ in the DML paper) to be the set of all vector functions $(Q_{t},P_{t},\pi_{z}\mathrel{\mathop{\ordinarycolon}}z\in\mathcal{Z})$ consisting of square-integrable functions $Q_{t,z},P_{t,z},$ and $\pi_{z}$ such that for all $z\in\mathcal{Z}$ :

	$\displaystyle\mathinner{\!\left\lVert Q_{t,z}\right\rVert}_{q}\leq C,P_{t,z}\in[0,1],\pi_{z}\in[\epsilon,1],z\in\mathcal{Z},$
	$\displaystyle\lVert Q_{t,z}-Q_{t,z}^{o}\rVert_{q}\vee\lVert P_{t,z}-P_{t,z}^{o}\rVert_{q}\vee\mathinner{\!\left\lVert\pi_{z}-\pi_{z}^{o}\right\rVert}_{q}\leq\epsilon_{n},$
	$\displaystyle\mathinner{\!\left\lVert\pi_{z}-\pi_{z}^{o}\right\rVert}_{2}\times\big{(}\lVert Q_{t,z}-Q_{t,z}^{o}\rVert_{2}+\lVert P_{t,z}-P_{t,z}^{o}\rVert_{2}\big{)}\leq\epsilon_{n}^{2}.$

Consider Assumption 3.1 in the DML paper. Assumption 3.1(d), the Neyman orthogonality condition, is verified by Proposition 4.2, where the validity of the differentiation under the integral operation is verified later in the proof. Assumption 3.1(e), the identification condition, is verified by the condition that $p_{t,k}^{o}\in[\epsilon,1]$ . The remaining conditions of Assumption 3.1 in the DML paper are trivially verified.

Next, we consider Assumption 3.2 in the DML paper. Note that Assumption 3.2(a) holds by the construction of $N_{n}$ and $\epsilon_{n}$ and our assumptions on the nuisance estimates. Assumption 3.2(d) is verified by our assumption that the semiparametric efficiency bound of $\beta_{t,k}$ is above $\epsilon$ . The remaining task is to verify Assumption 3.2(b) and 3.2(c) in the DML paper. To do that, we choose $n$ sufficiently large and let $(Q_{t,z},P_{t,z},\pi_{z}\mathrel{\mathop{\ordinarycolon}}z\in\mathcal{Z})$ be an arbitrary element of the nuisance realization set $N_{n}$ . We keep the above notations throughout the remaining part of the proof. Define

\displaystyle\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})=\frac{\mathbf{1}\{Z=z\}}{\pi_{z}(X)}(\mathbf{1}\{T=t\}-P_{t,z}(X))+P_{t,z}(X)

and

\displaystyle\psi^{b}_{z}(Y,T,Z,X,Q_{t},\pi_{z})=\frac{\mathbf{1}\{Z=z\}}{\pi_{z}(X)}(Y\mathbf{1}\{T=t\}-Q_{t,z}(X))+Q_{t,z}(X).

Since $\psi^{a}$ is a linear combination of $\psi^{a}_{z},z\in\mathcal{Z}$ and $\psi^{b}$ is a linear combination of $\psi^{b}_{z},z\in\mathcal{Z}$ , we only need $\mathinner{\!\left\lVert\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})\right\rVert}_{q}$ and $\mathinner{\!\left\lVert\psi^{b}_{z}(Y,T,Z,X,Q_{t},\pi_{z})\right\rVert}_{q}$ to be uniformly bounded (i.e., the bounds do not depend on $n$ ) for $z\in\mathcal{Z}$ in order to verify Assumption 3.2(b) in the DML paper. In fact,

	$\displaystyle\mathinner{\!\left\lVert\psi^{b}_{z}(Y,T,Z,X,P_{t},\pi_{z})\right\rVert}_{q}$	$\displaystyle\leq\mathinner{\!\left\lVert\mathbf{1}\{Z=z\}/\pi_{z}(X)\mathinner{\!\left\lvert Y\mathbf{1}\{T=t\}-Q_{t,z}(X)\right\rvert}\right\rVert}_{q}+\mathinner{\!\left\lVert Q_{t,z}(X)\right\rVert}_{q}$
		$\displaystyle\leq\frac{1}{\epsilon}\left(\mathinner{\!\left\lVert Y\mathbf{1}\{T=t\}\right\rVert}_{q}+\mathinner{\!\left\lVert Q_{t,z}(X)\right\rVert}_{q}\right)+\mathinner{\!\left\lVert Q_{t,z}(X)\right\rVert}_{q}\leq 2C/\epsilon+C,$

where we have used the assumption that $\pi_{z}\geq\epsilon$ , $\mathinner{\!\left\lVert Y\mathbf{1}\{T=t\}\right\rVert}_{q}\leq C$ , and $\mathinner{\!\left\lVert Q_{t}(X)\right\rVert}_{q}\leq C$ . Similarly, we have

	$\displaystyle\mathinner{\!\left\lVert\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})\right\rVert}_{q}$	$\displaystyle\leq\mathinner{\!\left\lVert\mathbf{1}\{Z=z\}/\pi_{z}(X)\mathinner{\!\left\lvert\mathbf{1}\{T=t\}-P_{t,z}(X)\right\rvert}\right\rVert}_{q}+\mathinner{\!\left\lVert P_{t,z}(X)\right\rVert}_{q}$
		$\displaystyle\leq\frac{1}{\epsilon}\big{(}1+\mathinner{\!\left\lVert P_{t,z}(X)\right\rVert}_{q}\big{)}+\mathinner{\!\left\lVert P_{t,z}(X)\right\rVert}_{q}\leq 2/\epsilon+1,$

where we have used the assumption that $\pi_{z}\geq\epsilon$ and $P_{t}\in[0,1]$ . Thus, Assumption 3.2(b) in the DML paper is verified.

To verify Assumption 3.2(c) in the DML paper, we again only need to verify the corresponding conditions for $\psi^{a}_{z}$ and $\psi^{b}_{z}$ , respectively. For $\psi^{a}_{z}$ , we have

		$\displaystyle\mathinner{\!\left\lVert\psi^{a}_{z}(T,Z,X,P_{t},\pi_{z})-\psi^{a}_{z}(T,Z,X,P_{t}^{o},\pi_{z}^{o})\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle\mathinner{\!\left\lVert\frac{\pi_{z}(X)-\pi_{z}^{o}(X)}{\pi_{z}(X)\pi_{z}^{o}(X)}\right\rVert}_{2}+\mathinner{\!\left\lVert\frac{P_{t,z}(X)}{\pi_{z}(X)}-\frac{P_{t,z}^{o}(X)}{\pi_{z}^{o}(X)}\right\rVert}_{2}+\mathinner{\!\left\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}+\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert(P_{t,z}(X)-P_{t,z}^{o}(X))\pi_{z}^{o}(X)+P_{t,z}^{o}(X)(\pi_{z}^{o}(X)-\pi_{z}(X))\right\rVert}_{2}$
		$\displaystyle+\mathinner{\!\left\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{\epsilon^{2}}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}+\left(1/\epsilon^{2}+1\right)\mathinner{\!\left\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\right\rVert}_{2}\leq C_{\epsilon,1}\varepsilon_{n}\leq\delta_{n},$

where the second to last inequality follows from the fact that $P_{t,z}^{o},\pi_{z}^{o}\in[0,1]$ . For $\psi^{b}_{z}$ , we have

		$\displaystyle\mathinner{\!\left\lVert\psi^{b}_{z}(Y,T,Z,X,Q_{t},\pi_{z})-\psi^{b}_{z}(Y,T,Z,X,Q_{t}^{o},\pi_{z}^{o})\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert\pi_{z}^{o}(X)(Y\mathbf{1}\{T=t\}-Q_{t,z}(X))-\pi_{z}(X)(Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X))\right\rVert}_{2}$
		$\displaystyle+\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert(Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X))(\pi^{o}_{z}(X)-\pi_{z}(X))+\pi_{z}^{o}(X)(Q_{t,z}^{o}(X)-Q_{t,z}(X))\right\rVert}_{2}$
		$\displaystyle+\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\epsilon^{2}}\mathinner{\!\left\lVert(Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X))(\pi^{o}_{z}(X)-\pi_{z}(X))\right\rVert}_{2}+\mathinner{\!\left\lVert\pi_{z}^{o}(X)(Q_{t,z}^{o}(X)-Q_{t,z}(X))\right\rVert}_{2}$
		$\displaystyle+\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{C}{\epsilon^{2}}\mathinner{\!\left\lVert\pi^{o}_{z}(X)-\pi_{z}(X)\right\rVert}_{2}+\left(\frac{1}{\epsilon^{2}}+1\right)\mathinner{\!\left\lVert Q_{t,z}^{o}(X)-Q_{t,z}(X)\right\rVert}_{2}\leq C_{\epsilon,2}\varepsilon_{n}\leq\delta_{n},$

where the last inequality follows from our assumption that $|Y\mathbf{1}\{T=t\}-Q_{t}^{o}(X)|\leq C$ and the fact that $\pi_{z}^{o}\in[\epsilon,1]$ . Combining the above two inequality results, we can verify the first two conditions of Assumption 3.2(c) in the DML paper.

For the last condition of Assumption 3.2(c) in the DML paper, which bounds the second-order Gateaux derivative, we again consider $\psi^{a}_{z}$ and $\psi^{b}_{z}$ separately. For $r\in[0,1)$ , recall that $Q_{t,z}^{r}=Q_{t,z}^{o}+r(Q_{t,z}-Q_{t,z}^{o}),$ $P_{t,z}^{r}=P_{t,z}^{o}+r(P_{t,z}-P_{t,z}^{o}),$ and $\pi^{r}_{z}=\pi^{o}_{z}+r(\pi_{z}-\pi^{o}_{z})$ . Clearly, $P_{t,z}^{r},\pi_{z}^{r}\in[0,1]$ . With differentiation under the integral, we have

		$\displaystyle\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{a}_{z}(T,Z,X,P_{t}^{r},\pi_{z}^{r})\right]$
	$\displaystyle=$	$\displaystyle\frac{\partial}{\partial r}\mathbb{E}\Big{[}\frac{-\mathbf{1}\{Z=z\}}{(\pi^{r}_{z}(X))^{2}}\left(\mathbf{1}\{T=t\}-P^{r}_{t,z}(X)\right)\left(\pi_{z}(X)-\pi^{o}_{z}(X)\right)$
		$\displaystyle+P_{t,z}(X)-P^{o}_{t,z}(X)-\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(P_{t,z}(X)-P^{o}_{t,z}(X)\right)\Big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\frac{2\times\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{3}}(\pi_{z}(X)-\pi_{z}^{o}(X))^{2}(\mathbf{1}\{T=t\}-P_{t,z}^{r}(X))\Big{]}$
		$\displaystyle+\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(P_{t,z}(X)-P_{t,z}^{o})\Big{]}$
		$\displaystyle+\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(\mathbf{1}\{T=t\}-P_{t,z}^{r}(X))(P_{t,z}(X)-P_{t,z}^{o})\Big{]}$
		$\displaystyle-\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{\pi_{z}^{r}(X)}(\mathbf{1}\{T=t\}-P_{t,z}^{r}(X))(P_{t,z}(X)-P_{t,z}^{o})^{2}\Big{]}.$

Using the fact that $|\mathbf{1}\{T=t\}-P_{t}^{r}(X)|\leq 1$ and $\pi_{z}^{r}\geq\epsilon$ , we can bound the above derivative by

	$\displaystyle\Big{\|}\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{a}_{z}(T,Z,X,P_{t}^{r},\pi_{z}^{r})\right]\Big{\|}$	$\displaystyle\leq C_{\epsilon}\big{(}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}^{2}+\big{\lVert}P_{t,z}(X)-P_{t,z}^{o}(X)\big{\rVert}_{2}^{2}\big{)}$
		$\displaystyle\quad+C_{\epsilon}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}\times\lVert P_{t,z}(X)-P_{t,z}^{o}(X)\rVert_{2}$
		$\displaystyle\leq C_{\epsilon,3}\varepsilon_{n}^{2}\leq\delta_{n}/\sqrt{n}.$

By bounding the first and second derivative uniformly with respect to $r$ , we know that the differentiation under the integral operation is valid. So the Neyman orthogonality condition is verified. Analogously, we can show that

		$\displaystyle\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{b}_{z}(Y,T,Z,X,Q_{t}^{r},\pi_{z}^{r})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\frac{2\times\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{3}}(\pi_{z}(X)-\pi_{z}^{o}(X))^{2}(Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X))\Big{]}$
		$\displaystyle+\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(Q_{t,z}(X)-Q_{t,z}^{o})\Big{]}$
		$\displaystyle-\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{(\pi_{z}^{r}(X))^{2}}(\pi_{z}(X)-\pi_{z}^{o}(X))(Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X))(Q_{t,z}(X)-Q_{t,z}^{o})\Big{]}$
		$\displaystyle-\mathbb{E}\Big{[}\frac{\mathbf{1}\{Z=z\}}{\pi_{z}^{r}(X)}(Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X))(Q_{t,z}(X)-Q_{t,z}^{o})^{2}\Big{]}.$

Under the assumption $|Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X)|\leq C$ , we have

\displaystyle|Y\mathbf{1}\{T=t\}-Q_{t,z}^{r}(X)|\leq|Y\mathbf{1}\{T=t\}-Q_{t,z}^{o}(X)|+r|Q_{t,z}(X)-Q_{t,z}^{o}|\leq C+1,

for all $r\in[0,1]$ and $n$ large enough. Then we can bound the above derivative by

	$\displaystyle\Big{\|}\frac{\partial^{2}}{\partial r^{2}}\mathbb{E}\left[\psi^{b}_{z}(Y,T,Z,X,Q_{t}^{r},\pi_{z}^{r})\right]\Big{\|}\leq$	$\displaystyle C_{\epsilon}\big{(}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}^{2}+\big{\lVert}Q_{t,z}(X)-Q_{t,z}^{o}(X)\big{\rVert}_{2}^{2}\big{)}$
		$\displaystyle+C_{\epsilon}\mathinner{\!\left\lVert\pi_{z}(X)-\pi_{z}^{o}(X)\right\rVert}_{2}\times\mathinner{\!\left\lVert Q_{t,z}(X)-Q_{t,z}^{o}(X)\right\rVert}_{2}$
	$\displaystyle\leq$	$\displaystyle C_{\epsilon,4}\varepsilon_{n}^{2}\leq\delta_{n}/\sqrt{n}.$

Therefore, we have verified the last condition of Assumption 3.2(c) in the DML paper.

Lastly, we need to verify the condition on $\delta_{n}$ in Theorem 3.1 and 3.2 in the DML paper, that is, $\delta_{n}\geq n^{-[(1-2/q)\wedge(1/2)]}$ . This directly follows from the construction of $\delta_{n}$ . ∎

A.4 Proof of Weak IV Inference Results

Proof of Theorem 5.1.

We first prove part (i). Consider applying the DML method to the moment condition (8) to estimate the parameter $\upsilon-\beta_{0}p$ and obtain the standard error. We want to show the convergence in distribution of

\displaystyle\check{\sigma}_{\psi}^{-1}\sqrt{n}\left[(\check{\upsilon}-\beta_{0}\check{p})-(\upsilon-\beta_{0}p)\right]=\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}

(A.1)

to the standard normal distribution uniformly over the DGPs in $\mathcal{P}^{\text{WI}}(c_{0},c_{1})$ . To do that, we need to verify Assumptions 3.1 and 3.2 in the DML paper regarding the above moment condition. Assumptions 3.1(a)-(c) hold trivially. Assumption 3.1(d), the Neyman orthogonality condition, is verified by Proposition 4.2. That is, the Gateaux derivatives with respect to the nuisance parameters are zero regardless of the value of $\beta$ . Assumption 3.1(e), the identification condition, is verified since the Jacobian of the parameter in the moment condition is $1$ . Assumption 3.2 in the DML paper can be verified in the same way as in the proof of Theorem 4.3. For brevity, we do not repeat the verification here.

For DGPs in $\mathcal{P}^{\text{WI}}_{\beta_{0}}(c_{0},c_{1})$ , (A.1) is equal to $\check{\rho}$ . Therefore, the uniform convergence in distribution of $|\check{\rho}|$ is established in the null space, and the size of the test is uniformly controlled accordingly. For DGPs in $\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1})$ , where $\beta>\beta_{0}$ , we have

	$\displaystyle\check{\rho}$	$\displaystyle=\left(\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}\right)+\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}$
		$\displaystyle=\left(\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}\right)+\sqrt{n}(\beta-\beta_{0})p/\check{\sigma}_{\psi}.$

The first term on the RHS of the last equality converges in distribution to $N(0,1)$ . In contrast, the second term diverges to infinity since $\check{\sigma}_{\psi}$ converges in probability to $\sigma_{\psi}\geq\sqrt{c_{0}}$ by Theorem 3.2 in the DML paper. Therefore, the probability of $|\check{\rho}|$ exceeding any finite number converges to 1. The case where $\beta<\beta_{0}$ is essentially the same.

To prove part (ii) of the theorem, notice that $(\beta-\beta_{0})p\leq 0$ for any DGP in the null space $\bigcup_{\beta\leq\beta_{0}}\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1})$ , which implies that $\check{\rho}\leq\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}$ . Therefore,

\displaystyle\sup_{P}\mathbb{P}_{P}\big{(}\check{\rho}>\mathcal{N}_{1-\alpha}\big{)}\leq\sup_{P}\mathbb{P}_{P}\left(\check{\rho}-\sqrt{n}(\upsilon-\beta_{0}p)/\check{\sigma}_{\psi}>\mathcal{N}_{1-\alpha}\right)\rightarrow\alpha,

where the supremum is taken over $P\in\bigcup_{\beta\leq\beta_{0}}\mathcal{P}^{\text{WI}}_{\beta}(c_{0},c_{1})$ . Consistency can be derived in the same way as part (i). ∎

Appendix B Implicitly Defined Parameters

This section studies general parameters defined implicitly through moment conditions. We allow the moment conditions to be non-smooth, which is the case when the parameter of interest is the quantile. We also allow the moment conditions to be overidentifying, which could be the result of imposing the underlying economic theory on multiple levels of treatment and instrument.

To facilitate the exposition, we define a random variable $Y^{*}_{t,k}$ such that the marginal distribution of $Y^{*}_{t,k}$ is equal to the conditional distribution of $Y_{t}$ given $S\in\Sigma_{t,k}$ . The joint distribution of the $Y^{*}_{t,k}$ ’s is irrelevant and hence left unspecified. For convenience, we use a single index $j\in J$ rather than $(t,k)$ for labeling. That is, we collect the $Y^{*}_{t,k}$ ’s into the vector $Y^{*}\equiv(Y^{*}_{1},\cdots,Y^{*}_{J})$ . Let $t_{j}$ be the treatment level associated with $Y^{*}_{j}$ . The quantities $p_{j}$ and $b_{j}$ are analogously defined.¹⁶¹⁶16We can further extend the vector $Y^{*}$ to include variables whose marginal distributions are the same as the conditional distributions of $Y_{t}$ given $T=t,S\in\Sigma_{t,k}$ . Efficient estimation in this more general case is similar and hence omitted for brevity.

Let the parameter of interest be $\eta$ , which lies in the parameter space $\Lambda\subset\mathbb{R}^{d_{\eta}}$ , $d_{\eta}\leq J$ . The true value of the parameter $\eta_{0}$ satisfies the moment condition

\mathbb{E}\left[m(Y^{*},\eta^{o})\right]=0,

where $m\mathrel{\mathop{\ordinarycolon}}\mathcal{Y}^{J}\times\mathbb{R}^{d_{\eta}}\rightarrow\mathbb{R}^{J}$ is a vector of functions:

\displaystyle m(Y^{*},\eta)\equiv\left(m_{1}(Y_{1}^{*},\eta),\cdots,m_{J}(Y_{J}^{*},\eta)\right)^{\prime}

Since the vector $\eta$ appears in each $m_{j}$ , restrictions are allowed both within and across different subpopulations. Another interesting feature of this specification is that the moment conditions are defined for the random variables that are not observed. But their marginal distributions can be identified similar to Theorem 2.1.

Let $\bar{m}\equiv(\bar{m}_{1}^{\prime},\cdots,\bar{m}_{J}^{\prime})^{\prime}$ , where

\displaystyle\bar{m}_{j}(X,\eta)=\left(\bar{m}_{j,z_{1}}(X,\eta),\cdots,\bar{m}_{j,z_{N_{Z}}}(X,\eta)\right)^{\prime}

and

\displaystyle\bar{m}_{j,z}(X,\eta)=\mathbb{E}\left[m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}\mid Z=z,X\right].

The functions $\bar{m}_{j,z}$ are identified from the data. Similar to Theorem 2.1, we can show that the parameter $\eta$ is identified by the moment conditions:

\displaystyle b_{j}\mathbb{E}\left[\bar{m}_{j}(X,\eta)\right]=0,1\leq j\leq J\iff\eta=\eta^{o}.

The following theorem gives the SPEB for the estimation of $\eta$ .

Theorem B.1.

Assume the following conditions hold.

(i)

$\mathbb{E}\left[m(Y^{*},\eta)^{2}\right]<\infty,\eta\in\Lambda$ .
(ii)

For each $j$ and $z$ , $m_{j,t_{j},z}$ is continuously differentiable in its second argument. Let $\Gamma$ be the $J\times d_{\eta}$ matrix whose $j$ th row is $b_{j}\frac{d}{d\eta}\mathbb{E}\left[\bar{m}_{j}(X,\eta)\right]\big{|}_{\eta=\eta^{o}}^{\prime}$ , and assume $\Gamma$ has full column rank.

Then for the estimation of $\eta$ , the EIF is

-\left(\Gamma^{\prime}V^{-1}\Gamma\right)^{-1}\Gamma^{\prime}V^{-1}\psi^{\eta}(Y,T,Z,X,\eta^{o},\pi^{o},\bar{m}^{o}),

(B.1)

where

\displaystyle V=\mathbb{E}\left[\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})^{\prime}\right]

and $\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})$ is a $J\times 1$ random vector whose $j$ th element is

b_{j}\left(\zeta(Z,X,\pi)\left(\iota(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\})-\bar{m}_{j}(X,\eta)\right)+\bar{m}_{j}(X,\eta)\right)

(B.2)

In particular, the semiparametric efficiency bound is $\left(\Gamma^{\prime}V^{-1}\Gamma\right)^{-1}$ .

Proof of Theorem B.1.

The proof is based on the approach described in section 3.6 of Hong and Nekipelov (2010a) and the proof of Theorem 1 in Cattaneo (2010). We use a constant $d_{\eta}\times d_{m}$ matrix $A$ to transform the overidentified vector of moments into an exactly identified system of equations $A\left(b_{j}\mathbb{E}\left[\bar{m}_{j}(X,\eta)\right]\right)_{j=1}^{J}=0$ , find the $A$ -dependent EIF for the exactly-identified parameter, and choose the optimal $A$ . In a parametric submodel, the implicit function theorem gives that

\displaystyle\frac{\partial}{\partial\theta}\eta\big{|}_{\theta=\theta^{o}}=-\left(A\Gamma\right)^{-1}A\frac{\partial}{\partial\theta}\left(b_{j}\mathbb{E}_{\theta}\left[\bar{m}_{j}(X,\eta^{o})\right]\right)_{j=1}^{J}\big{|}_{\theta=\theta^{o}},

where $\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[\bar{m}_{j}(X,\eta^{o})\right]\big{|}_{\theta=\theta^{o}}$ is an $N_{Z}\times 1$ random vector whose typical element can be represented by

		$\displaystyle\int m_{j}(y,\eta^{o})\mathbf{1}\{\tau=t_{j}\}s_{z}(y,\tau\mid x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx$
	$\displaystyle+$	$\displaystyle\int m_{j}(y,\eta^{o})\mathbf{1}\{\tau=t_{j}\}s_{X}(x;\theta^{o})f_{z}(y,\tau\mid x;\theta^{o})f_{X}(x;\theta^{o})dyd\tau dx,$

for $z\in\mathcal{Z}$ . So the EIF for this exactly-identified parameter is

\displaystyle\psi^{A}(Y,T,Z,X,\eta^{o},\pi^{o},\bar{m}^{o})=-\left(A\Gamma\right)^{-1}A\Psi^{\eta}(Y,T,Z,X,\eta^{o},\pi^{o},\bar{m}^{o}),

where $\psi^{\eta}$ is defined by Equation (B.2). It is straightforward to verify that $\psi^{A}$ satisfies $\frac{\partial}{\partial\theta}\eta\big{|}_{\theta=\theta^{o}}=\mathbb{E}\left[\psi^{A}s_{\theta^{o}}^{\prime}\right]\text{, and }\psi^{A}\in\mathscr{S}$ . The optimal $A$ is chosen by minimizing the sandwich matrix $\mathbb{E}\left[\psi^{A}(\psi^{A})^{\prime}\right]=\left(A\Gamma\right)^{-1}A\mathbb{E}\left[\psi^{\eta}(\psi^{\eta})^{\prime}\right]A^{\prime}\left(\Gamma^{\prime}A^{\prime}\right)^{-1}$ . Thus, the EIF for the over-identified parameter is obtained when $A=\Gamma^{\prime}V^{-1}$ . Plugging this expression into $\psi^{A}$ , we obtain Equation (B.1). ∎

Note that, for example, $m_{j}(Y^{*}_{j},\eta)=Y^{*}_{j}-\eta$ , then $\eta=\beta_{j}$ , and the efficiency bound shown above reduces to the one computed in Theorem 3.1. If $T=Z$ , that is, the treatment satisfies the unconfounded, then the Theorem B.1 reduces to Theorem 1 in Cattaneo (2010).

For estimation, we use the EIFs to generate moment conditions and propose a three-step semiparametric GMM procedure. The criterion function is

\Psi^{\eta}_{n}(\eta,\pi,m)=\frac{1}{n}\sum_{i=1}^{n}\psi^{\eta}(Y_{i},T_{i},Z_{i},X_{i},\eta,\pi,\bar{m}).

(B.3)

Its probability limit is denoted as

\Psi^{\eta}(\eta,\pi,m_{Z})=\mathbb{E}\left[\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\right],

(B.4)

where the expectation is taken with respect to the true parameters $(\pi^{o},\bar{m}^{o})$ . The implementation procedure is as follows. Assume that we have nonparametric estimators $\hat{\pi}$ and $\hat{m}$ that consistently estimate $\pi^{o}$ and $\bar{m}^{o}$ , respectively. We first find a consistent GMM estimator $\tilde{\eta}$ using the identity matrix as the weighting matrix, that is,

\mathinner{\!\left\lVert\Psi^{\eta}_{n}(\tilde{\eta},\hat{\pi},\hat{m})\right\rVert}_{2}\leq\inf_{\eta\in\Lambda}\mathinner{\!\left\lVert\Psi^{\eta}_{n}(\eta,\hat{\pi},\hat{m})\right\rVert}_{2}+o_{p}(1).

(B.5)

Next, we use this estimate to form a consistent estimator $\hat{V}$ of the covariance matrix $V$ , where

\displaystyle\hat{V}=\frac{1}{n}\sum_{i=1}^{n}\psi^{\eta}(Y_{i},T_{i},Z_{i},X_{i},\tilde{\eta},\hat{\pi},\hat{m})\psi^{\eta}(Y_{i},T_{i},Z_{i},X_{i},\tilde{\eta},\hat{\pi},\hat{m})^{\prime}.

Then we let $\hat{\eta}$ be the optimally-weighted GMM estimator:

		$\displaystyle\Psi^{\eta}_{n}(\hat{\eta},\hat{\pi},\hat{m}_{Z})V_{n}(\tilde{\eta},\hat{\pi},\hat{m}_{Z})^{-1}\Psi^{\eta}_{n}(\hat{\eta},\hat{\pi},\hat{m}_{Z})^{\prime}$
	$\displaystyle\leq$	$\displaystyle\inf_{\eta\in\Lambda}\Psi^{\eta}_{n}(\eta,\hat{\pi},\hat{m}_{Z})V_{n}(\tilde{\eta},\hat{\pi},\hat{m}_{Z})^{-1}\Psi^{\eta}_{n}(\eta,\hat{\pi},\hat{m}_{Z})^{\prime}+o_{p}\big{(}n^{-1/2}\big{)}.$

To conduct inference, we estimate $\Gamma$ using the estimator $\hat{\Gamma}$ whose elements are defined as

\displaystyle\hat{\Gamma}_{jl}=\frac{1}{n}\sum_{i=1}^{n}b_{j}\frac{\partial}{\partial\eta}\hat{m}_{j}(X_{i},\eta)\Big{|}_{\eta=\hat{\eta}},

where we have implicitly assumed that the estimator $\hat{m}_{j}$ is differentiable in its second argument.

In the following theorem, we derive the asymptotic properties of the GMM estimators. The main theoretical difficulty is that the random criterion function $\Psi_{n}(\cdot,\hat{\pi},\hat{m})$ could potentially be discontinuous because we allow $m(Y^{*},\cdot)$ to be discontinuous. We use the theory developed in Chen et al. (2003) to overcome this problem.¹⁷¹⁷17Cattaneo (2010) instead uses the theory from Pakes and Pollard (1989). However, the general theory of Chen et al. (2003) is more straightforward to apply in this case since they explicitly assume the presence of infinite-dimensional nuisance parameters, which can depend on the parameters to be estimated. Let $\Pi_{z}$ be the function class that contains $\pi_{z}^{o}$ . Let $\mathcal{M}_{j,z}$ be the function class that contains $\bar{m}_{j,z}^{o}$ .

Theorem B.2.

Let the assumptions in Theorem B.1 hold. Assume the following conditions hold.

(i)

The parameter space $\Lambda$ is compact. The true parameter $\eta^{o}$ is in the interior of $\Lambda$ .

(ii)

For any $j,z$ and $\bar{m}_{j,z}\in\mathcal{M}_{j,z}$ , there exists $C>0$ such that for $\delta>0$ sufficiently small,

\displaystyle\sup_{\mathinner{\!\left\lvert\eta^{\prime}-\eta\right\rvert}\leq\delta}\mathbb{E}\mathinner{\!\left\lvert\bar{m}_{j,z}(X,\eta^{\prime})-\bar{m}_{j,z}(X,\eta)\right\rvert}^{2}\leq C\delta^{2}.

(iii)

Donsker properties:

\displaystyle\int_{0}^{\infty}\log N(\varepsilon,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert}_{\infty})d\varepsilon,\int_{0}^{\infty}\log N(\varepsilon,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert}_{\infty})d\varepsilon<\infty,

where $N(\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert})$ denotes the covering number of the space $(\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert})$ .

(iv)

Convergence rates of the nonparametric estimators:

\displaystyle\mathinner{\!\left\lVert\hat{\pi}_{z}-\pi^{o}_{z}\right\rVert}_{\infty},\lVert\hat{m}_{j,z}-\bar{m}_{j,z}^{o}\rVert_{\infty}=o_{p}(n^{-1/4}).

(v)

The function $\sup_{\eta\in\Lambda}\mathinner{\!\left\lvert\frac{\partial}{\partial\eta}\bar{m}^{o}_{j}(\cdot,\eta)\right\rvert}$ is integrable. The estimator $\frac{\partial}{\partial\eta}\hat{m}_{j}$ is consistent uniformly in its second argument, that is,

\displaystyle\mathinner{\!\left\lVert\frac{\partial}{\partial\eta}\hat{m}_{j}(x,\eta)-\frac{\partial}{\partial\eta}\bar{m}^{o}_{j}(x,\eta)\right\rVert}_{\infty}=o_{p}(1),\forall x.

Then $\tilde{\eta}=\eta^{o}+o_{p}(1)$ , $\hat{V}=V+o_{p}(1)$ , $\hat{\Gamma}=\Gamma+o_{p}(1)$ , and

\displaystyle\sqrt{n}\left(\hat{\eta}-\eta^{o}\right)\implies N\left(\bm{0},(\Gamma^{\prime}V^{-1}\Gamma)^{-1}\right),

where $\bm{0}$ denotes a vector of zeros.

The following lemma is helpful for proving Theorem B.2.

Lemma B.3.

Under the assumptions of Theorem B.1, the class

\displaystyle\mathcal{F}\equiv\left\{\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\mathrel{\mathop{\ordinarycolon}}\pi\in\Pi_{z},\bar{m}_{j,z}\in\mathcal{M}_{j,z},1\leq j\leq J,z\in\mathcal{Z}\right\}

is Donsker with a finite integrable envelope. The following stochastic equicontinuity condition hold: for any positive sequence $\delta_{n}=o(1)$ ,

	$\displaystyle\sup\big{\{}$	$\displaystyle\Psi^{\eta}_{n}(\eta,\pi,\bar{m})-\Psi^{\eta}(\eta,\pi,\bar{m})-\Psi^{\eta}_{n}(\eta^{o},\pi^{o},m_{Z}^{o})\mathrel{\mathop{\ordinarycolon}}$
		$\displaystyle\mathinner{\!\left\lVert\eta-\eta^{o}\right\rVert}_{2}\vee\mathinner{\!\left\lVert\pi-\pi^{o}\right\rVert}_{\infty}\vee\mathinner{\!\left\lVert\bar{m}-\bar{m}^{o}\right\rVert}_{\infty}\leq\delta_{n}\big{\}}=o_{p}\big{(}n^{-1/2}\big{)},$

where the supremum is taken over $\eta\in\Lambda$ , $\pi_{z}\in\Pi_{z}$ , and $\bar{m}_{j,z}\in\mathcal{M}_{j,z}$ .

Proof of Lemma B.3.

We first verify that the moment condition $\psi^{\eta}$ satisfies Condition (3.2) of Theorem 3 in Chen et al. (2003) (hereafter CLK). In fact, when $\lVert\bar{m}^{\prime}_{j,z}-\bar{m}_{j,z}\rVert_{\infty}\vee\mathinner{\!\left\lVert\eta^{\prime}-\eta\right\rVert}_{\infty}\leq\delta$ , the triangle inequality gives that

		$\displaystyle\mathbb{E}\mathinner{\!\left\lvert\bar{m}^{\prime}_{j,z}(X,\eta^{\prime})-\bar{m}_{j,z}(X,\eta)\right\rvert}^{2}$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}\mathinner{\!\left\lvert\bar{m}^{\prime}_{j,z}(X,\eta^{\prime})-\bar{m}^{\prime}_{j,z}(X,\eta)\right\rvert}^{2}+2\mathbb{E}\mathinner{\!\left\lvert\bar{m}^{\prime}_{j,z}(X,\eta)-\bar{m}_{j,z}(X,\eta)\right\rvert}^{2}$
	$\displaystyle\leq$	$\displaystyle const\times\delta^{2},$

where we use the notation const to denote a generic constant that may have different values at each appearance. The last inequality follows from the assumption (ii). Similarly, we can verify that the remaining terms in $\psi^{\eta}$ also satisfy the same condition. Therefore, $\psi^{\eta}$ is locally uniformly $L_{2}$ -continuous, that is,

	$\displaystyle\mathbb{E}\big{[}\sup\big{\{}$	$\displaystyle\mathinner{\!\left\lvert\psi^{\eta}(Y,T,Z,X,\eta^{\prime},\pi^{\prime},\bar{m}^{\prime})-\psi^{\eta}(Y,T,Z,X,\eta,\pi,\bar{m})\right\rvert}\mathrel{\mathop{\ordinarycolon}}$
		$\displaystyle\mathinner{\!\left\lVert\eta^{\prime}-\eta\right\rVert}\vee\mathinner{\!\left\lVert\pi^{\prime}-\pi\right\rVert}_{\infty}\vee\mathinner{\!\left\lVert\bar{m}^{\prime}-\bar{m}\right\rVert}_{\infty}\leq\delta\big{\}}\big{]}\leq const.\times\delta^{2}.$

Following the same steps as in the proof of Theorem 3 in CLK (p. 1607), we can show that the bracketing number of $\mathcal{F}$ is bounded by

		$\displaystyle N_{[]}\big{(}\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}_{L_{2}}\big{)}$
	$\displaystyle\leq$	$\displaystyle N(\varepsilon/const,\Lambda,\mathinner{\!\left\lVert\cdot\right\rVert})\times\prod_{z}N(\varepsilon/const,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert})\times\prod_{j,z}N(\varepsilon/const,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert}).$

Therefore, the bracketing entropy of class $\mathcal{F}$ is bounded by

		$\displaystyle\log N_{[]}\big{(}\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}_{L_{2}}\big{)}$
	$\displaystyle\leq$	$\displaystyle const\times\Big{(}\log N(\varepsilon/const,\Lambda,\mathinner{\!\left\lVert\cdot\right\rVert})\vee\max_{z}\log N(\varepsilon/const,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert})$
	$\displaystyle\vee$	$\displaystyle\max_{j,z}\log N(\varepsilon/const,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert})\Big{)}.$

Under the assumption that $\Lambda$ is compact and

\displaystyle\int_{0}^{\infty}\log N(\varepsilon,\Pi_{z},\mathinner{\!\left\lVert\cdot\right\rVert})d\varepsilon,\int_{0}^{\infty}\log N(\varepsilon,\mathcal{M}_{j,z},\mathinner{\!\left\lVert\cdot\right\rVert})d\varepsilon<\infty,\forall j,z,

we have that

\displaystyle\int_{0}^{\infty}\log N_{[]}\big{(}\varepsilon,\mathcal{F},\mathinner{\!\left\lVert\cdot\right\rVert}_{L_{2}}\big{)}d\varepsilon<\infty.

This implies that $\mathcal{F}$ is Donsker with a finite integrable envelope. Lastly, as stated in Lemma 1 of CLK, the asserted stochastic equicontinuity condition is implied by the fact that $\mathcal{F}$ is Donsker and $\psi^{\eta}$ is $L_{2}$ -continuous. ∎

Proof of Theorem B.2.

We follow the large sample theory in CLK and set $\theta=\eta$ , $h=(\pi,\bar{m})$ , $M(\theta,h)=\Psi^{\eta}(\eta,\pi,\bar{m})$ , and $M_{n}(\theta,h)=\Psi^{\eta}_{n}(\eta,\pi,\bar{m})$ .

We first use Theorem 1 in CLK to show the consistency of $\tilde{\eta}$ . Condition (1.2) in CLK is satisfied because $\Lambda$ is compact, and $\Psi^{\eta}(\eta,\pi^{o},\bar{m}^{o})$ has a unique zero and is continuous by our second condition in Theorem B.1. As for Condition (1.3) of CLK, we can easily see from the expression of $\Psi$ that it is continuous with respect to $\bar{m}_{j,z}$ and $\pi_{z}$ (since $\pi_{z}$ is bounded away from zero), and the uniformity in $\eta$ follows from the fact that $\mathbb{E}\left[m(Y^{*},\eta)\right]$ is bounded as a function of $\eta$ . Condition (1.4) of CLK is satisfied by the assumption of Theorem B.2. The uniform stochastic equicontinuity condition (1.5) of CLK is implied by Lemma B.3. Therefore, $\tilde{\eta}=\eta^{o}+o_{p}(1)$ .

We use Corollary 1 (which is based on Theorem 2) in CLK to show the consistency of $\hat{V}$ and the asymptotic normality of $\hat{\eta}$ . Condition (2.2) in CLK is verified by the assumptions of Theorem B.1. Similar to the proof of Proposition 4.2, we can show that the moment condition $\Psi^{\eta}$ , based on the EIF, satisfies the Neyman orthogonality condition for the nuisance parameters $\pi$ and $m_{Z}$ . In fact, for any $j$ and $z$ , we let $\pi_{z}^{r}=\pi^{o}_{z}(X)+r(\pi_{z}(X)-\pi^{o}_{z}(X))$ and $\bar{m}^{r}_{j,z}(X,\eta)=\bar{m}^{o}_{j,z}(X,\eta)+r\big{(}\bar{m}_{j,z}(X,\eta)-\bar{m}^{o}_{j,z}(X,\eta)\big{)}$ . Then we have

		$\displaystyle\frac{d}{dr}\mathbb{E}\left[\frac{\mathbf{1}\{Z=z\}}{\pi^{r}_{z}(X)}\left(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}-\bar{m}^{r}_{j,z}(X,\eta)\right)+\bar{m}^{r}_{j,z}(X,\eta)\right]\Bigg{\|}_{r=0}$
	$\displaystyle=\mathbb{E}\Bigg{[}$	$\displaystyle-\frac{\mathbf{1}\{Z=z\}}{\left(\pi^{o}_{z}(X)\right)^{2}}\left(\pi_{z}(X)-\pi_{z}^{o}(X)\right)\left(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}-\bar{m}^{o}_{j,z}(X,\eta)\right)$
	$\displaystyle+$	$\displaystyle\left(\bar{m}^{o}_{j,z}(X,\eta)-\bar{m}_{j,z}(X,\eta)\right)\left(\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}-1\right)\Bigg{]}=0,$

where we have applied the law of iterated expectations and used the fact that

\mathbb{E}\left[\frac{\mathbf{1}\{Z=z\}}{\pi^{o}_{z}(X)}\left(m_{j}(Y,\eta)\mathbf{1}\{T=t_{j}\}-\bar{m}^{o}_{j,z}(X,\eta)\right)\Big{|}X\right]=0.

Thus, the path-wise derivative of $\Psi^{\eta}$ with respect to $h=(\pi,\bar{m})$ is zero in any direction. Hence, Condition (2.3) of CLK is verified. Condition (2.4) in CLK directly follows from our assumptions of Theorem B.2. The stochastic equicontinuity condition (condition (2.6) in CLK) follows from Lemma B.3. Lastly, condition (2.6) in CLK is verified using the central limit theorem since the path-wise derivative is zero. Due to the presence of $\hat{V}$ , we also need the uniform convergence condition in Corollary 1 of CLK, which can be verified by using Lemma B.3 and an application of Theorem 2.10.14 of van der Vaart and Wellner (1996).

Lastly, to show the consistency of $\hat{\Gamma}$ , we only need to show that

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial}{\partial\eta}\hat{m}_{j,t_{j},z}(X_{i},\hat{\eta})\overset{p}{\rightarrow}\mathbb{E}\left[\frac{\partial}{\partial\eta}\hat{m}_{j,z}(X,\eta^{o})\right]=\frac{\partial}{\partial\eta}\mathbb{E}\left[\hat{m}_{j,z}(X,\eta^{o})\right],

where the inequality follows from the differentiation under integral operation which holds under the last assumption of the theorem. The convergence in probability follows from the uniform convergence of $\frac{\partial}{\partial\eta}\hat{m}_{j,z}$ and the consistency of $\hat{\eta}$ . Therefore, the desired convergence results follow. ∎

References

Abadie (2003) Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of econometrics 113(2), 231–263.
Ackerberg et al. (2014) Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014). Asymptotic efficiency of semiparametric two-step gmm. Review of Economic Studies 81(3), 919–943.
Andrews et al. (2020) Andrews, D. W., X. Cheng, and P. Guggenberger (2020). Generic results for establishing the asymptotic size of confidence sets and tests. Journal of Econometrics 218(2), 496–531.
Andrews and Armstrong (2017) Andrews, I. and T. B. Armstrong (2017). Unbiased instrumental variables estimation under known first-stage sign. Quantitative Economics 8(2), 479–503.
Angrist and Imbens (1995) Angrist, J. D. and G. W. Imbens (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association 90(430), 431–442.
Angrist et al. (1996) Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association 91(434), 444–455.
Belloni and Chernozhukov (2011) Belloni, A. and V. Chernozhukov (2011). $\ell_{1}$ -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39(1), 82–130.
Belloni and Chernozhukov (2013) Belloni, A. and V. Chernozhukov (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547.
Bickel et al. (1993) Bickel, P. J., C. A. Klaassen, Y. Ritov, , and J. A. Wellner (1993). Efficient and adaptive estimation for semiparametric models, Volume 4. Springer, New York.
Blandhol et al. (2022) Blandhol, C., J. Bonney, M. Mogstad, and A. Torgovitsky (2022). When is tsls actually late? University of Chicago, Becker Friedman Institute for Economics Working Paper (2022-16).
Bühlmann and Van De Geer (2011) Bühlmann, P. and S. Van De Geer (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
Cattaneo (2010) Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics 155(2), 138–154.
Chen et al. (2004) Chen, X., H. Hong, and A. Tarozzi (2004). Semiparametric efficiency in gmm models of nonclassical measurement errors, missing data and treatment effects.
Chen et al. (2008) Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36(2), 808–843.
Chen et al. (2003) Chen, X., O. Linton, and I. Van Keilegom (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica 71(5), 1591–1608.
Chen and Santos (2018) Chen, X. and A. Santos (2018). Overidentification in regular models. Econometrica 86(5), 1771–1817.
Chernozhukov et al. (2018) Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21(1), C1–C68.
Chernozhukov et al. (2016) Chernozhukov, V., J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins (2016). Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033.
Finkelstein et al. (2012) Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. Newhouse, H. Allen, K. Baicker, and O. H. S. Group (2012). The oregon health insurance experiment: evidence from the first year. The Quarterly journal of economics 127(3), 1057–1106.
Frölich (2007) Frölich, M. (2007). Nonparametric iv estimation of local average treatment effects with covariates. Journal of Econometrics 139(1), 35–75.
Galindo (2020) Galindo, C. (2020). Empirical challenges of multivalued treatment effects. Technical report, Job market paper.
Hahn (1998) Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 315–331.
Hansen (2008) Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data. Econometric Theory 24(3), 726–748.
Heckman and Pinto (2018a) Heckman, J. J. and R. Pinto (2018a). Unordered monotonicity. Econometrica 86(1), 1–35.
Heckman and Pinto (2018b) Heckman, J. J. and R. Pinto (2018b). Web appendix for unordered monotonicity. Econometrica 86(1), 1–35.
Hong and Nekipelov (2010a) Hong, H. and D. Nekipelov (2010a). Semiparametric efficiency in nonlinear late models. Quantitative Economics 1(2), 279–304.
Hong and Nekipelov (2010b) Hong, H. and D. Nekipelov (2010b). Supplement to “semiparametric efficiency in nonlinear late models”. Quantitative Economics 1(2), 279–304.
Ichimura and Newey (2022) Ichimura, H. and W. K. Newey (2022). The influence function of semiparametric estimators. Quantitative Economics 13(1), 29–61.
Imbens and Angrist (1994) Imbens, G. W. and J. D. Angrist (1994). Identification and estimation of local average treatment effects. Econometrica 62(2), 467–475.
Imbens and Manski (2004) Imbens, G. W. and C. F. Manski (2004). Confidence intervals for partially identified parameters. Econometrica 72(6), 1845–1857.
Kitagawa (2015) Kitagawa, T. (2015). A test for instrument validity. Econometrica 83(5), 2043–2063.
Kline and Walters (2016) Kline, P. and C. R. Walters (2016). Evaluating public programs with close substitutes: The case of head start. The Quarterly Journal of Economics 131(4), 1795–1848.
Lee and Salanié (2018) Lee, S. and B. Salanié (2018). Identifying effects of multivalued treatments. Econometrica 86(6), 1939–1963.
Masry (1996) Masry, E. (1996). Multivariate local polynomial regression for time series: uniform strong consistency and rates. Journal of Time Series Analysis 17(6), 571–599.
Mikusheva (2007) Mikusheva, A. (2007). Uniform inference in autoregressive models. Econometrica 75(5), 1411–1452.
Newey (1990) Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of applied econometrics 5(2), 99–135.
Newey (1994) Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society, 1349–1382.
Okui et al. (2012) Okui, R., D. S. Small, Z. Tan, and J. M. Robins (2012). Doubly robust instrumental variable regression. Statistica Sinica, 173–205.
Pakes and Pollard (1989) Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of optimization estimators. Econometrica: Journal of the Econometric Society, 1027–1057.
Pinto (2021) Pinto, R. (2021). Beyond intention to treat: Using the incentives in moving to opportunity to identify neighborhood effects. UCLA Working paper.
Sun (2021) Sun, Z. (2021). Instrument validity for heterogeneous causal effects. arXiv preprint arXiv:2009.01995.
Tan (2006) Tan, Z. (2006). Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association 101(476), 1607–1618.
van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics, Volume 3. Cambridge university press.
van der Vaart and Wellner (1996) van der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. New York: Springer-Verlag.
Vytlacil (2002) Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result. Econometrica 70(1), 331–341.

	$\displaystyle\frac{\partial}{\partial\theta}\beta_{t,k}(\theta)\Big{\|}_{\theta=\theta^{o}}$	$\displaystyle=\frac{\partial}{\partial\theta}(b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/p_{t,k})\big{\|}_{\theta=\theta^{o}}$
		$\displaystyle=\frac{1}{p_{t,k}}\left((\partial b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/\partial\theta)\|_{\theta=\theta^{o}}-(b_{t,k}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]/p_{t,k})(\partial p_{t,k}/\partial\theta)\|_{\theta=\theta^{o}}\right)$
		$\displaystyle=\frac{1}{p_{t,k}}b_{t,k}\left(\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[Q_{t}(X)\right]\big{\|}_{\theta=\theta^{o}}-\frac{\partial}{\partial\theta}\mathbb{E}_{\theta}\left[P_{t}(X)\right]\big{\|}_{\theta=\theta^{o}}\beta_{t,k}\right),$