This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Single Proxy Control

Chan Parka, David B. Richardsonb, Eric J. Tchetgen Tchetgena
aa: Department of Statistics and Data Science, The Wharton School, University of Pennsylvania
bb: Department of Environmental & Occupational Health, University of California-Irvine
Abstract

Negative control variables are sometimes used in non-experimental studies to detect the presence of confounding by hidden factors. A negative control outcome (NCO) is an outcome that is influenced by unobserved confounders of the exposure effects on the outcome in view, but is not causally impacted by the exposure. Tchetgen Tchetgen (2013) introduced the Control Outcome Calibration Approach (COCA) as a formal NCO counterfactual method to detect and correct for residual confounding bias. For identification, COCA treats the NCO as an error-prone proxy of the treatment-free counterfactual outcome of interest, and involves regressing the NCO on the treatment-free counterfactual, together with a rank-preserving structural model which assumes a constant individual-level causal effect. In this work, we establish nonparametric COCA identification for the average causal effect for the treated, without requiring rank-preservation, therefore accommodating unrestricted effect heterogeneity across units. This nonparametric identification result has important practical implications, as it provides single proxy confounding control, in contrast to recently proposed proximal causal inference, which relies for identification on a pair of confounding proxies. For COCA estimation we propose three separate strategies: (i) an extended propensity score approach, (ii) an outcome bridge function approach, and (iii) a doubly-robust approach. Finally, we illustrate the proposed methods in an application evaluating the causal impact of a Zika virus outbreak on birth rate in Brazil.

Keywords: Confounding Proxy; Doubly Robust; Extended Propensity Score; Negative Controls; Unmeasured Confounding.

1 Introduction

Unmeasured confounding is a well-known threat to valid causal inference from observational data. An approach that is sometimes used in practice to assess residual confounding bias, is to check whether known null effects can be recovered free of bias, by evaluating whether the exposure or treatment of interest is found to be associated with a so-called negative control outcome (NCO), upon adjusting for measured confounders (Rosenbaum, 1989; Lipsitch et al., 2010; Shi et al., 2020). An observed variable is said to be a valid negative control outcome or more broadly, an outcome confounding proxy, to the extent that it is associated with hidden factors confounding the exposure-outcome relationship in view, although not directly impacted by the exposure. Therefore, an NCO which is empirically associated with the exposure, might suggest the presence of residual confounding. In the event such an association is present, a natural question is whether the negative control outcome can be used for bias correction.

The most well-established NCO approach for debiasing observational causal effect estimates is the difference-in-differences approach (DiD) (Card and Krueger, 1994; Lechner, 2011; Caniglia and Murray, 2020). In fact, DiD may be viewed as directly leveraging the pre-treatment outcome as an NCO since it cannot logically be causally impacted by the treatment. Identification then follows from an additive equi-confounding assumption that the unmeasured confounder association with the post-treatment outcome of interest matches that with the pre-treatment outcome on the additive scale (Sofer et al., 2016). The baseline outcome in DiD is thus implicitly assumed to be a valid NCO, and equi-confounding is equivalent to the so-called parallel trends assumption, that the average trends in treatment-free potential outcomes for treatment and untreated units are parallel. In practice, equi-confounding or equivalently parallel trends may not be reasonable for a number of reasons, including if the outcome trend is also impacted by an unmeasured common cause with the treatment. Furthermore, additive equi-confounding may not be realistic as a broader debiasing method in non-DiD settings where the NCO is not necessarily a pre-treatment measurement of the outcome of interest, but is instead a post-treatment measurement of a different type of outcome (and might therefore have support on a different scale than the outcome of interest has).

To address these potential limitations of additive equi-confounding, Tchetgen Tchetgen (2013) introduced the Control Outcome Calibration Approach (COCA) as a simple yet formal counterfactual NCO approach to debias causal effect estimates in observational analyses. At its core, COCA essentially treats the NCO variable as a proxy measurement for the treatment-free potential outcome, which therefore is associated with the latter, and which becomes independent of the treatment assignment mechanism, upon conditioning on the treatment-free counterfactual outcome. As the treatment-free potential outcome can be viewed as an ultimate source of unmeasured confounding, this assumption formalizes the idea that, as a relevant proxy for the source of residual confounding, the NCO would be made irrelevant for the treatment assignment mechanism if one were to hypothetically condition on the underlying potential outcome.

For identification and inference for a continuous outcome, the original COCA approach of Tchetgen Tchetgen (2013) involves the correct specification of a regression model for the NCO, conditional on the treatment-free potential outcome and measured confounders, together with a rank-preserving structural model which effectively assumes a constant individual-level treatment effect. In this paper, we develop a nonparametric COCA identification framework for the average causal effect for the treated, which equally applies irrespective of the nature of the primary outcome, whether binary, continuous, or polytomous. Importantly, as we show, the proposed COCA identification framework completely obviates the need for rank preservation, therefore accommodating an arbitrary degree of effect heterogeneity across units. Relatedly, an alternative counterfactual approach named proximal causal inference has recently developed in causal inference literature (Miao et al., 2016, 2018; Tchetgen Tchetgen et al., 2024), which leverages a pair of negative treatment and outcome control variables, or more broadly treatment and outcome confounding proxies, to nonparametrically identify treatment causal effects subject to residual confounding without invoking a rank-preservation assumption. Importantly, while proximal causal inference relies on two proxies for causal identification, in contrast, COCA is a single-proxy control approach which therefore may present practical advantages. For estimation and inference, we introduce three strategies to implement COCA which improve on prior methods: (i) an extended propensity score approach, (ii) a so-called outcome calibration bridge function approach, and (iii) a doubly robust approach which carefully combines approaches (i) and (ii) and remains unbiased, provided that either approach is also unbiased, without necessarily knowing which method might not be unbiased. Finally, we illustrate the methods with an application evaluating the causal effect of a Zika outbreak on birth rate in Brazil, and we conclude with possible extensions to our methods and a brief discussion.

2 Notation and Brief Review of COCA

Consider an observational study where, as represented in Figure 1, one has observed an outcome variable YY, a binary treatment AA whose causal effect on YY is of interest, and measured pre-treatment covariates XX. We are concerned that as displayed succinctly in the figure with the bow arc, the association between AA and YY is confounded by hidden factors.

AAYYXX
Figure 1: A Graphical Illustration of A Simple Causal Model

Throughout, YaY^{a} denotes the potential outcome or counterfactual, had possibly contrary to fact, the exposure been set to aa by an external hypothetical intervention. Furthermore, throughout, we also make the consistency assumption:

Assumption 1

Y=YAY=Y^{A} almost surely.

Hereafter, we aim to make inferences about the causal effect of treatment on the treated (ETT), denoted by ψ=E(Ya=1Ya=0|A=1)\psi^{*}=E\big{(}Y^{a=1}-Y^{a=0}\,\big{|}\,A=1\big{)}. Under consistency, ψ1=E(Ya=1|A=1)\psi_{1}^{*}=E\big{(}Y^{a=1}\,\big{|}\,A=1\big{)} is identified by E(Y|A=1)E\big{(}Y\,\big{|}\,A=1\big{)}; to identify the counterfactual mean ψ0=E(Ya=0|A=1)\psi_{0}^{*}=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)} requires additional assumptions. Standard methods often resort to the no unmeasured confounding assumption, i.e., Ya=0A|XY^{a=0}\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,X, a strong assumption we do not make. Instead, we suppose that one has measured a valid NCO WW, possibly multi-dimensional, which is known a priori to satisfy the following conditions:

Assumption 2

Condition (i): Wa=WW^{a}=W almost surely for a=0,1a=0,1, where WaW^{a} is the potential NCO under an external intervention that sets A=aA=a; Condition (ii): WYa=0|XW\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,Y^{a=0}\,\big{|}\,X; Condition (iii): WA|(Ya=0,X)W\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(Y^{a=0},X).

Assumption 2-(i) encodes the key assumption of a known null causal effect of the treatment on the NCO in potential outcome notation. Assumption 2-(ii) encodes that WW is relevant for predicting the treatment-free potential outcome of interest. Assumption 2-(iii) states that WW is independent of AA conditional on treatment-free potential outcome and covariates. These conditions formally encode the assumption that WW is a valid proxy for the treatment-free potential outcome, a source of residual confounding bias; WW is only associated with the treatment mechanism to the extent that it is associated with the confounding mechanism captured by the potential outcome. We illustrate these NCO assumptions with the causal graph displayed in Figure 2. The thick arrows in the graph indicate the deterministic relationship defining the observed outcome in terms of potential outcomes and treatment variables by the consistency assumption. The missing arrows on the graph formally encode the core conditional independence conditions implied by Assumption 2.

A\,A\,Ya=0Y^{a=0}XXYa=1Y^{a=1}W\,W\,Y\,Y\,
Figure 2: A Graphical Illustration of the Assumptions for Control Outcome Calibration Approach. Thick arrows depict the deterministic relationship between YY and (Ya=1,Ya=0,A)(Y^{a=1},Y^{a=0},A), as established by the consistency assumption (Assumption 1).

It is enlightening to consider a data generating mechanism that is compatible with Assumption 2. As an example, we consider the following latent variable model for a continuous outcome:

Ya=0=hy(U0,X),\displaystyle Y^{a=0}=h_{y}(U_{0},X)\ , (1a)
WU0|X and WA|(U0,X),\displaystyle W\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,U_{0}\,\big{|}\,X\text{ and }W\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(U_{0},X)\ , (1b)

where UU is a continuously distributed unobserved variable and hy(u,x)h_{y}(u,x) is a function that is strictly monotone in uu for all xx, but otherwise completely unrestricted. In the Supplementary Material A.1, we establish that expressions (1a)-(1b) imply Assumption 2. Expression (1a) means that the treatment-free outcome is a monotonic transformation of an unobserved variable U0U_{0}, a specific instance of a so-called changes-in-changes model (Athey and Imbens, 2006). Although the latter would also assume under the causal graph in Figure 2 that W=hw(U0,X)W=h_{w}(U_{0},X) where hw(u,x)h_{w}(u,x) is strictly monotone in uu for all xx, an assumption we do not make. Expression (1b) corresponds to Assumption 2-(ii) and (iii). In fact, our formulation accommodates an additional measurement error in WW, say ϵw\epsilon_{w}, so that W=hw(U0,X,ϵw)W=h_{w}(U_{0},X,\epsilon_{w}) where hw(u,x,ϵw)h_{w}(u,x,\epsilon_{w}) varies in uu (potentially non-monotonic) and ϵwA|(U0,X)\epsilon_{w}\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(U_{0},X), (1b) is satisfied. Figure 3 provides a graphical representation compatible with, although not necessarily with, expressions (1a)-(1b).

A\,A\,Ya=0Y^{a=0}U1U_{1}U0U_{0}Ya=1Y^{a=1}W\,W\,Y\,Y\,
Figure 3: A Graphical Illustration of a Structural Model Compatible with (1). Measured covariates XX are suppressed for simplicity. The thick arrows depict the deterministic relationships Ya=0=hy(U0)Y^{a=0}=h_{y}(U_{0}) and Y=YAY=Y^{A}.

For identification and estimation in the case of a continuous outcome, Tchetgen Tchetgen (2013) further assumed the rank-preserving structural model:

Y=Ya=0+ψA,\displaystyle Y=Y^{a=0}+\psi^{*}A, (2)

which, by consistency, implies a constant individual-level causal effect ψ=Ya=1Ya=0\psi^{*}=Y^{a=1}-Y^{a=0}. Under this model, he noted that upon defining Y(ψ)=YψAY(\psi)=Y-\psi A, then one can deduce from Assumption 2 that WA|(Y(ψ),X)W\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(Y(\psi),X) if and only if ψ=ψ\psi=\psi^{*}, in which case, given (2), Ya=0=Y(ψ)=YψAY^{a=0}=Y(\psi^{*})=Y-\psi^{*}A which motivates a regression-based implementation of COCA, that entails searching for the parameter value of ψ=ψ\psi=\psi^{*} such that E{W|A,Y(ψ),X}=E{W|Y(ψ),X}E\big{\{}W\,\big{|}\,A,Y(\psi),X\big{\}}=E\big{\{}W\,\big{|}\,Y(\psi),X\big{\}}. A straightforward implementation of the approach uses linear models whereby for each value of ψ\psi on a sufficiently fine grid, one obtains an estimate of the regression model E{W|Y(ψ),X}=β1+β2A+β3Y(ψ)+β4XE\big{\{}W\,\big{|}\,Y(\psi),X\big{\}}=\beta_{1}+\beta_{2}A+\beta_{3}Y(\psi)+\beta_{4}X using ordinary least squares (OLS), with estimated coefficients (β^1(ψ),β^2(ψ),β^3(ψ),β^4(ψ))\big{(}\widehat{\beta}_{1}(\psi),\widehat{\beta}_{2}(\psi),\widehat{\beta}_{3}(\psi),\widehat{\beta}_{4}(\psi)\big{)}. Then a 95% confidence interval for ψ\psi^{*} consists of all values of ψ\psi for which a valid test of the null hypothesis β2(ψ)=0\beta_{2}(\psi)=0 fails to reject at the 0.05 type 1 error level. Such hypothesis test might be performed by verifying whether the interval β^2(ψ)±1.96SE^(β^2(ψ))\widehat{\beta}_{2}(\psi)\pm 1.96\widehat{\mathrm{SE}}\big{(}\widehat{\beta}_{2}(\psi)\big{)} covers 0,0, with SE^(β^2(ψ))\widehat{\mathrm{SE}}\big{(}\widehat{\beta}_{2}(\psi)\big{)} the OLS estimate of the standard error of β^2(ψ)\widehat{\beta}_{2}(\psi). Tchetgen Tchetgen (2013) also describes a potentially simpler one-shot approach which fits a single regression E(W|A,Y,X)=β1+β2A+β3Y+β4XE\big{(}W\,\big{|}\,A,Y,X\big{)}=\beta_{1}+\beta_{2}^{*}A+\beta_{3}Y+\beta_{4}X via OLS where β2=β3/ψ0,\beta_{2}^{*}=-\beta_{3}/\psi_{0}, in which case ψ^=β^2/β^3\widehat{\psi}=-\widehat{\beta}_{2}^{*}/\widehat{\beta}_{3} where (β^2,β^3)\big{(}\widehat{\beta}_{2}^{*},\widehat{\beta}_{3}\big{)} are OLS estimates; a corresponding standard error estimator of ψ^\widehat{\psi} is given in the Supplementary Material A.2 for convenience. Though practically convenient, validity of either approach relies on both correct specification of the linear model for WW given (Ya=0,X)(Y^{a=0},X), and on the rank-preserving structural model, which may be biologically implausible. In the following, we describe alternative methods aimed at addressing these limitations.

3 Identification of the Effect of Treatment on the Treated

3.1 Identification via Extended Propensity Score Weighting

In order to establish identification, consider the extended propensity score function (EPS):

π(y,x)=Pr(A=1|Ya=0=y,X=x)\displaystyle\pi^{*}\big{(}y,x\big{)}=\Pr\big{(}A=1\,\big{|}\,Y^{a=0}=y,X=x\big{)}

which makes explicit the fact that, in the presence of unmeasured confounding, the treatment mechanism will generally depend on the treatment-free potential outcome even after conditioning for all observed confounders. For notational brevity, we denote the treatment odds by ω(y,x)=π(y,x)/{1π(y,x)}\omega^{*}(y,x)=\pi^{*}(y,x)/\{1-\pi^{*}(y,x)\}. Since ω\omega^{*} and π\pi^{*} have a one-to-one relationship, we use ω\omega^{*} throughout to model the exposure mechanism. We assume that positivity holds, i.e.,

Assumption 3

𝒮(1)𝒮(0)\mathcal{S}(1)\subseteq\mathcal{S}(0) where 𝒮(a)\mathcal{S}(a) is the support of (Ya=0,X)|(A=a)(Y^{a=0},X)\,\big{|}\,(A=a) for a=0,1a=0,1.

Next, we note that were ω\omega^{*} known, the average treatment-free potential outcome in the treated would then be empirically identified by the expression

ψ0=E{(1A)Yω(Y,X)}E{(1A)ω(Y,X)}.\displaystyle\psi_{0}^{*}=\frac{E\big{\{}\big{(}1-A\big{)}Y\omega^{*}\big{(}Y,X\big{)}\big{\}}}{E\big{\{}\big{(}1-A\big{)}\omega^{*}\big{(}Y,X\big{)}\big{\}}}\ . (3)

See the Supplementary Material B.1 for the details. Therefore, the ETT would be identified by

ψ=E(Y|A=1)E{(1A)Yω(Y,X)}E{(1A)ω(Y,X)}.\displaystyle\psi^{*}=E\big{(}Y\,\big{|}\,A=1\big{)}-\frac{E\big{\{}\big{(}1-A\big{)}Y\omega^{*}\big{(}Y,X\big{)}\big{\}}}{E\big{\{}\big{(}1-A\big{)}\omega^{*}\big{(}Y,X\big{)}\big{\}}}\ .

As ω\omega^{*} is unknown, we next demonstrate how the NCO assumption can be leveraged to identify the latter quantity. Let p(w,x)=Pr(A=1|W=w,X=x)p^{*}\big{(}w,x\big{)}=\Pr\big{(}A=1\,\big{|}\,W=w,X=x\big{)}. The proposed approach to identify ω\omega^{*} is based on the following equality, which we prove in the Supplementary Material B.2:

Result 1

Under Assumptions 1-3, the following result holds almost surely:

p(W,X)1p(W,X)\displaystyle\frac{p^{*}\big{(}W,X\big{)}}{1-p^{*}\big{(}W,X\big{)}} =E{ω(Y,X)|W,X,A=0}=yπ(y,X)1π(y,X)f(y|W,X,A=0)\displaystyle=E\left\{\omega^{*}(Y,X)\,\big{|}\,W,X,A=0\right\}=\sum_{y}\frac{\pi^{*}\big{(}y,X\big{)}}{1-\pi^{*}\big{(}y,X\big{)}}f^{*}\big{(}y\,\big{|}\,W,X,A=0\big{)} (4)

where f(y|W,X,A=0)f^{*}\big{(}y\,\big{|}\,W,X,A=0\big{)} is the conditional law of YY given (W,X,A)(W,X,A) evaluated at Y=yY=y.

Result 1 provides an expression relating the exposure mechanism of interest to the observed data distribution, as of the three quantities involved in the expression pp^{*}, ff^{*}, and ω\omega^{*}; two are uniquely determined by the observed data, mainly pp^{*} and ff^{*}, which are then related to the unknown function of interest ω\omega^{*} in Result 1; see the Supplementary Material A.3 for a graphical illustration. Equation (4) is known as an integral equation, more precisely a Fredholm integral equation of the first kind. In slight abuse of notation, the sum may be interpreted as an integral if YY is Lebesgue measurable. We then have the following identification result.

Result 2

If Assumptions 1-3 hold, and the integral equation (4) in Result 1 admits a unique solution, then ω\omega^{*} and π\pi^{*} are nonparametrically identified from the observed data by solving (4), and ψ0\psi_{0}^{*} is nonparametrically identified by (3).

Sufficient conditions for the existence and uniqueness of a solution to such an equation are well-studied; see Section A.5 for details. Such conditions were recently discussed in the context of proximal causal inference (Miao et al., 2016, 2018; Tchetgen Tchetgen et al., 2024). A detailed comparison of COCA with proximal causal inference is relegated to Section 6. Intuitively, the condition that the integral equation admits a unique solution essentially requires that WW is sufficiently relevant for Ya=0Y^{a=0} in the sense that for any variation in the latter, there is corresponding variation in the former. This assumption is akin to the assumption of relevance in the context of instrumental variable methodology, which states that variation in the instrument should induce variation in the treatment. Importantly, Result 2 applies whether YY is binary, continuous or polytomous, provided that WW is sufficiently relevant for Ya=0Y^{a=0}, to ensure that equation (3) admits a solution. Also, the result obviates the need for a rank-preserving structural model, and delivers fully nonparametric identification of the causal effect of treatment on the treated.

3.2 Identification via COCA Confounding Bridge Function

In this Section, we introduce an alternative nonparametric identification and estimation approach which does not rely on modeling the EPS, but instead relies on the existence of a so-called confounding bridge function formalized below.

Assumption 4

For all (y,x)(y,x), there exists a function (possibly nonlinear) b(w,x)b^{*}(w,x) that satisfies the following equation

y=E{b(W,X)|Y=y,X=x,A=0}.\displaystyle y=E\left\{b^{*}\big{(}W,X\big{)}\,\big{|}\,Y=y,X=x,A=0\right\}. (5)

Intuitively, upon noting that under Assumptions 1-2, Assumption 4 can equivalently be stated in terms of potential outcomes

Ya=0=E{b(W,X)|Ya=0,X}\displaystyle Y^{a=0}=E\left\{b^{*}\big{(}W,X\big{)}\,\big{|}\,Y^{a=0},X\right\} (6)

which essentially formalizes the idea that WW is a sufficiently relevant proxy for the potential outcome Ya=0Y^{a=0} if there exist a (potentially nonlinear) transformation of (W,X)(W,X) whose conditional expectation given (Ya=0,X)(Y^{a=0},X) recovers Ya=0Y^{a=0}. As b(W,X)b^{*}(W,X) provides a bridge between the observed data equation (5), and its potential outcome counterpart (6), we aptly refer to b(W,X)b^{*}\big{(}W,X\big{)} as a COCA confounding bridge function. Note that classical measurement error is a special case of the equation in the display above in which case bb^{*} is the identity map and W=Ya=0+eW=Y^{a=0}+e where ee is an independent mean zero error. The condition can therefore be viewed as a nonparametric generalization of classical measurement error which allows WW and YY to be of arbitrary nature and does not assume the error to be unbiased on the additive scale. We further illustrate the assumption in the case of binary WW and YY. As shown in the Supplementary Material B.4, in this case with suppressing covariates, the following b(W)b^{*}(W) satisfies (5):

b(W)=(1W)Pr(W=1|Y=0,A=0)+WPr(W=0|Y=0,A=0)Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0);\displaystyle b^{*}(W)=\frac{-(1-W)\cdot\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}+W\cdot\Pr\big{(}W=0\,\big{|}\,Y=0,A=0\big{)}}{\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}};

provided that Pr(W=1|Ya=0=1)Pr(W=1|Ya=0=0)\Pr\big{(}W=1\,\big{|}\,Y^{a=0}=1\big{)}\not=\Pr\big{(}W=1\,\big{|}\,Y^{a=0}=0\big{)}, encoding the requirement that WW cannot be independent of Ya=0Y^{a=0}, i.e., Assumption 2. Beyond the binary case, for more general outcome types, Assumption 4 likewise formally defines a Fredholm integral equation of the first kind, for which sufficient conditions for existence of a solution are well characterized in functional analysis textbooks; we again refer the reader to Miao et al. (2016). We are now ready to state our result, which we prove in the Supplementary Material B.5:

Result 3

Suppose that Assumptions 1-4 hold, and bb^{*} satisfies (5). Then,

ψ0=E{b(W,X)|A=1} and ψ=E{Yb(W,X)|A=1}.\displaystyle\psi_{0}^{*}=E\big{\{}b^{*}\big{(}W,X\big{)}\,\big{|}\,A=1\big{\}}\ \text{ and }\ \psi^{*}=E\big{\{}Y-b^{*}(W,X)\,\big{|}\,A=1\big{\}}\ . (7)

In the binary example discussed above where b(w)b^{*}(w) was uniquely identified, we have that

ψ0\displaystyle\psi_{0}^{*} =E{b(W)|A=1}=Pr(W=1|A=1)Pr(W=1|Y=0,A=0)Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0).\displaystyle=E\left\{b^{*}\big{(}W\big{)}\,\big{|}\,A=1\right\}=\frac{\Pr\big{(}W=1\,\big{|}\,A=1\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}}{\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}}\ .

We briefly highlight a key feature of the above result reflected in its proof, which is that b(W,X)b^{*}\big{(}W,X\big{)} need not be uniquely identified by equation (5), and that any such solution leads to a unique value for E(Ya=0|A=1).E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)}. Interestingly, the identifying formula in the display above was also obtained by Tchetgen Tchetgen (2013) in the binary case, although he did not emphasize the key role of the bridge function as a general framework for identification beyond the binary case.

3.3 Semiparametric Efficiency Theory

Let \mathcal{M} denote a semiparametric model defined as a collection of observed data laws that admit a solution to (5), i.e.,

={P|P(O) is regular and Assumption 4 holds, i.e., there exists b solving (5)}.\displaystyle\mathcal{M}=\Big{\{}P\,\Big{|}\,\text{$P(O)$ is regular and Assumption \ref{condition-1} holds, i.e., there exists $b^{*}$ solving \eqref{Outcome COCA}}\Big{\}}\ .

We further consider the following surjectivity condition:

  • (Surjectivity): Let T:2(W,X)2(Y,A=0,X)T:\mathcal{L}_{2}(W,X)\rightarrow\mathcal{L}_{2}(Y,A=0,X) denote the operator given by T(g)=E{g(W,X)|Y,A=0,X}T(g)=E\big{\{}g(W,X)\,\big{|}\,Y,A=0,X\big{\}}. At the true data law, TT is surjective.

The surjectivity condition states that the Hilbert space 2(W,X)\mathcal{L}_{2}(W,X) is sufficiently rich so that any element in 2(Y,A=0,X)\mathcal{L}_{2}(Y,A=0,X) can be recovered from an element in 2(W,X)\mathcal{L}_{2}(W,X) via the conditional expectation mapping; see Cui et al. (2023), Dukes et al. (2023), and Ying et al. (2023) for related discussions. In addition, we consider a submodel sub\mathcal{M}_{\text{sub}}:

sub={P|ω and b are unique solutions to (4) and (5), respectively, and (Surjectivity) is satisfied}\displaystyle\mathcal{M}_{\text{sub}}=\Bigg{\{}P\in\mathcal{M}\,\Bigg{|}\,\begin{array}[]{l}\text{$\omega^{*}$ and $b^{*}$ are unique solutions to \eqref{EPS equation} and \eqref{Outcome COCA}, respectively, }\\ \text{and \hyperlink{(Surjectivity)}{(Surjectivity)} is satisfied}\end{array}\Bigg{\}}

We then establish the semiparametric local efficiency bound for ψ\psi^{*} under \mathcal{M} at the submodel sub\mathcal{M}_{\text{sub}}.

Result 4

Suppose that Assumptions 1-4 hold. Then, the following results hold.

  • (i)

    The following function IF(O;ω,b)\texttt{IF}(O;\omega^{*},b^{*}) is an influence function for ψ\psi^{*} under \mathcal{M}.

    IF(O;ω,b)=A{Yb(W,X)ψ}(1A)ω(Y,X){Yb(W,X)}Pr(A=1).\displaystyle\texttt{IF}(O;\omega^{*},b^{*})=\frac{A\big{\{}Y-b^{*}\big{(}W,X\big{)}-\psi^{*}\big{\}}-(1-A)\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}}{\Pr\big{(}A=1\big{)}}\ . (8)
  • (ii)

    The influence function IF(O;ω,b)\texttt{IF}(O;\omega^{*},b^{*}) is the efficient influence function for ψ\psi^{*} under \mathcal{M} at the submodel sub\mathcal{M}_{\text{sub}}. Therefore, the corresponding semiparametric local efficiency bound for ψ\psi^{*} is Var{IF(O;ω,b)}Var\big{\{}\texttt{IF}(O;\omega^{*},b^{*})\big{\}}.

The influence function IF shares similarity with an influence function for ψ\psi^{*} in the proximal causal inference framework; see Section G of Cui et al. (2023) for details. Interestingly, the influence function has the following doubly robust property (Scharfstein et al., 1999; Lunceford and Davidian, 2004; Bang and Robins, 2005); see the Supplementary Material B.6 for the proof:

Result 5

Suppose that Assumptions 1-4 are satisfied. In addition, suppose that either (i) E{b(W,X)|A=0,Y=y,X}=yE\big{\{}b^{\dagger}\big{(}W,X\big{)}\,\big{|}\,A=0,Y=y,X\big{\}}=y or (ii) ω(y,x)=ω(y,x)\omega^{\dagger}\big{(}y,x\big{)}=\omega^{*}\big{(}y,x\big{)}, but not necessarily both, is satisfied. Then, we have that E{IF(O;ω,b)}=0E\big{\{}\texttt{IF}(O;\omega^{\dagger},b^{\dagger})\big{\}}=0.

In words, if either the COCA confounding bridge function or the EPS, but not necessarily both, is correctly specified, the influence function is an unbiased estimating function of ψ\psi^{*}.

Using expressions (3), (7), and (8), one can construct parametric estimators of ψ\psi^{*}. Specifically, the first estimator using (3) entails a priori specifying a parametric model for the EPS, say a logistic regression model. The second estimator based on (7) entails a priori specifying a parametric model for the COCA bridge function, say a linear model. Lastly, the third estimator based on (8) entails parametric models for both EPS and COCA bridge functions. The first two estimators rely on the correct exposure and COCA bridge function specifications, respectively. Thus, misspecification of either model will likely result in biased inferences about the ETT. On the other hand, the last estimator has a doubly-robust property (Scharfstein et al., 1999; Lunceford and Davidian, 2004; Bang and Robins, 2005) in that it can be used for unbiased inference about the ETT if either EPS or COCA bridge function is correct, without a priori knowledge of which model, if any, is incorrect. In Section A.4, we provide details on constructing these three parametric estimators and their large sample behavior.

A significant limitation of the three parametric estimators is their dependence on specific parametric specifications of nuisance components, which can lead to biased inference if the model specifications are incorrect. To address this concern, a potential solution is to develop an estimator where nuisance components are estimated using nonparametric methods, drawing on advancements in recent learning theory. In the following Section, we construct such an estimator and study its statistical properties.

4 A Semiparametric Locally Efficient Estimator

Our estimator is derived from the influence function IF in Result 5 and adopts the cross-fitting approach (Schick, 1986; Chernozhukov et al., 2018), which is implemented as follows. We randomly split NN study units, denoted by ={1,,N}\mathcal{I}=\{1,\ldots,N\}, into KK non-overlapping folds, denoted by {1,,K}\{\mathcal{I}_{1},\ldots,\mathcal{I}_{K}\}. For each k=1,,Kk=1,\ldots,K, we estimate the EPS and COCA confounding bridge functions using observations in kc=k\mathcal{I}_{k}^{c}=\mathcal{I}\setminus\mathcal{I}_{k}, and then evaluate the estimated nuisance functions using observations in k\mathcal{I}_{k} to obtain an estimator of ψ\psi^{*}. We refer to kc\mathcal{I}_{k}^{c} and k\mathcal{I}_{k} as the estimation and evaluation folds, respectively. To use the entire sample, we take the simple average of the KK estimators.

We introduce the following additional notation in order to facilitate the discussion. Let (V)\mathcal{H}(V) be the Reproducing Kernel Hilbert Space (RKHS) of VV endowed with a universal kernel function 𝒦\mathcal{K}, such as the Gaussian kernel 𝒦\mathcal{K}, i.e., 𝒦(v,v)=exp{vv22/κ}\mathcal{K}(v,v^{\prime})=\exp\big{\{}-\big{\|}v-v^{\prime}\|_{2}^{2}/\kappa\big{\}} where κ(0,)\kappa\in(0,\infty) is a bandwidth parameter; see Chapter 4 of Steinwart and Christmann (2008) for the definition and examples of the universal kernal function. For each k=1,,Kk=1,\ldots,K, let (k)(V)=|kc|1ikcVi\mathbbm{P}^{(-k)}(V)=|\mathcal{I}_{k}^{c}|^{-1}\sum_{i\in\mathcal{I}_{k}^{c}}V_{i} and (k)(V)=|k|1ikVi\mathbbm{P}^{(k)}(V)=|\mathcal{I}_{k}|^{-1}\sum_{i\in\mathcal{I}_{k}}V_{i}. For a function g(O)g(O), let gP,2=[E{g2(O)}]1/2\big{\|}g\big{\|}_{P,2}=\big{[}E\big{\{}g^{2}(O)\big{\}}\big{]}^{1/2} be the L2(P)L_{2}(P)-norm of gg.

We estimate the EPS and COCA bridge functions by adopting a recently developed minimax estimation approach (Ghassami et al., 2022). We remark that other approaches (e.g., Mastouri et al. (2021)) can also be adopted with minor modification. Note that ω(Y,X)\omega^{*}(Y,X) and b(W,X)b^{*}(W,X) satisfy

E[{(1A)ω(Y,X)A}p(W,X)]=0,p2(W,X),\displaystyle E\left[\big{\{}(1-A)\omega^{*}(Y,X)-A\big{\}}p\big{(}W,X\big{)}\right]=0\ ,\ \forall p\in\mathcal{L}_{2}(W,X)\ ,
E[(1A){Yb(W,X)}q(Y,X)]=0,q2(Y,X).\displaystyle E\left[(1-A)\big{\{}Y-b^{*}(W,X)\big{\}}q(Y,X)\right]=0\ ,\ \forall q\in\mathcal{L}_{2}(Y,X)\ .

Therefore, following Ghassami et al. (2022), minimax estimators of ω\omega^{*} and bb^{*} are given by

ω^(k)()\displaystyle\widehat{\omega}^{(-k)}(\cdot)
=argminω(Y,X)[maxp(W,X)[(k)[p(W,X){(1A)ω(Y,X)A}p2(W,X)]λpp2]+λωω2]\displaystyle=\operatorname*{arg\,min}_{\omega\in\mathcal{H}(Y,X)}\bigg{[}\displaystyle{\max_{p\in\mathcal{H}(W,X)}}\bigg{[}\mathbbm{P}^{(-k)}\bigg{[}\begin{array}[]{l}p(W,X)\big{\{}(1-A)\omega(Y,X)-A\big{\}}\\ -p^{2}(W,X)\end{array}\bigg{]}-\lambda_{p}\big{\|}p\big{\|}_{\mathcal{H}}^{2}\bigg{]}+\lambda_{\omega}\big{\|}\omega\big{\|}_{\mathcal{H}}^{2}\bigg{]}
b^(k)()\displaystyle\widehat{b}^{(-k)}(\cdot)
=argminh(W,X)[maxq(Y,X)[(k)[q(Y,X)(1A){Yb(W,X)}q2(Y,X)]λqq2]+λbb2]\displaystyle=\operatorname*{arg\,min}_{h\in\mathcal{H}(W,X)}\bigg{[}\displaystyle{\max_{q\in\mathcal{H}(Y,X)}}\bigg{[}\mathbbm{P}^{(-k)}\bigg{[}\begin{array}[]{l}q(Y,X)(1-A)\big{\{}Y-b(W,X)\big{\}}\\ -q^{2}(Y,X)\end{array}\bigg{]}-\lambda_{q}\big{\|}q\big{\|}_{\mathcal{H}}^{2}\bigg{]}+\lambda_{b}\big{\|}b\big{\|}_{\mathcal{H}}^{2}\bigg{]}

where \big{\|}\cdot\big{\|}_{\mathcal{H}} is an RKHS norm and λp\lambda_{p}, λω\lambda_{\omega}, λq\lambda_{q}, and λb\lambda_{b} are positive regularization parameters.

We make a few remarks about the minimax estimation approach, of which details are relegated to the Supplementary Material A.6. First, despite the complicated formulas, closed-form representations of ω^(k)\widehat{\omega}^{(-k)} and b^(k)\widehat{b}^{(-k)} are available from the representer theorem (Kimeldorf and Wahba, 1970; Schölkopf et al., 2001). Second, the bandwidth and regularization parameters can be selected via cross-validation. Lastly, ω\omega^{*} may vary widely because it is a ratio of two probabilities. In such cases, the proposed minimax estimator may result in significantly small or negative estimates. To mitigate this issue, one may consider a practical approach to regularize the minimax estimator when it appears to be ill-behaved.

Using the minimax estimators of the nuisance functions, a semiparametric estimator ψ^\widehat{\psi} of ψ\psi^{*} is then obtained as follows:

ψ^=1Kk=1Kψ^(k),\displaystyle\widehat{\psi}=\frac{1}{K}\sum_{k=1}^{K}\widehat{\psi}^{(k)}\ ,
ψ^(k)=1Ni=1NAiYi(k)[(1A)ω^(k)(Y,X){Yb^(k)(W,X)}+Ab^(k)(W,X)]1Ni=1NAi.\displaystyle\widehat{\psi}^{(k)}=\frac{\frac{1}{N}\sum_{i=1}^{N}A_{i}Y_{i}-\mathbbm{P}^{(k)}\big{[}(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}+A\widehat{b}^{(-k)}(W,X)\big{]}}{\frac{1}{N}\sum_{i=1}^{N}A_{i}}\ .

Under regularity conditions, the semiparametric estimator ψ^\widehat{\psi} is consistent and asymptotically normal for ψ\psi^{*}.

Assumption 5

Suppose that the following conditions hold for all k=1,,Kk=1,\ldots,K:

  • (i)

    (Boundedness) There exists a finite constant C>0C>0 such that

    |ω(y,x)|[C1,C],|ω^(k)(y,x)|[C1,C] for all (y,x),\displaystyle\big{|}\omega^{*}(y,x)\big{|}\in[C^{-1},C]\ ,\quad\big{|}\widehat{\omega}^{(-k)}(y,x)\big{|}\in[C^{-1},C]\quad\text{ for all $(y,x)$,}
    |b(w,x)|C,|b^(k)(w,x)|C for all (w,x),E(Y4)C.\displaystyle\big{|}b^{*}(w,x)\big{|}\leq C\ ,\quad\big{|}\widehat{b}^{(-k)}(w,x)\big{|}\leq C\quad\text{ for all $(w,x)$,}\quad E(Y^{4})\leq C\ .
  • (ii)

    (Consistency) As NN\rightarrow\infty, we have ω^(k)ωP,2=oP(1)\big{\|}\widehat{\omega}^{(-k)}-\omega^{*}\big{\|}_{P,2}=o_{P}(1) and b^(k)bP,2=oP(1)\big{\|}\widehat{b}^{(-k)}-b^{*}\big{\|}_{P,2}=o_{P}(1).

  • (iii)

    (Cross-product Rates) As NN\rightarrow\infty, we have

    min[ω^(k)ωP,2E(k)[(1A){b(W,X)b^(k)(W,X)}|Y,X]P,2,b^(k)bP,2E(k)[(1A){ω(Y,X)ω^(k)(Y,X)}|W,X]P,2]=oP(N1/2).\displaystyle\min\Bigg{[}\begin{array}[]{l}\big{\|}\widehat{\omega}^{(-k)}-\omega^{*}\big{\|}_{P,2}\big{\|}E^{(-k)}\big{[}(1-A)\big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\big{\}}\,\big{|}\,Y,X\big{]}\big{\|}_{P,2},\\ \big{\|}\widehat{b}^{(-k)}-b^{*}\big{\|}_{P,2}\big{\|}E^{(-k)}\big{[}(1-A)\big{\{}\omega^{*}(Y,X)-\widehat{\omega}^{(-k)}(Y,X)\big{\}}\,\big{|}\,W,X\big{]}\big{\|}_{P,2}\end{array}\Bigg{]}=o_{P}(N^{-1/2})\ .

Assumption 5-(i) states that nuisance functions and the corresponding estimators are uniformly bounded. Assumption 5-(ii) states that the estimated nuisance functions are consistent for the true nuisance functions in the L2(P)L_{2}(P) norm sense. Assumption 5-(iii) states that the cross-product rate of nuisance function estimators are oP(N1/2)o_{P}(N^{-1/2}). Assumption 5-(iii) is satisfied if bb^{*} and ω\omega^{*} are sufficiently smooth, the conditional expectation operators f(W,X)E{f(W,X)|Y,A=0,X}f(W,X)\mapsto E\{f(W,X)\,\big{|}\,Y,A=0,X\} and f(Y,X)E{f(Y,X)|W,A=0,X}f(Y,X)\mapsto E\{f(Y,X)\,\big{|}\,W,A=0,X\} are sufficiently smooth, and ω^(k)\widehat{\omega}^{(-k)} and b^(k)\widehat{b}^{(-k)} are estimated over an RKHS with fast enough eigendecay; see Section 5 of Ghassami et al. (2022) for details. Importantly, if one nuisance function is estimated at sufficiently fast rates, the other nuisance function is allowed to converge at a substantially slower rate provided that the cross-products remain oP(N1/2)o_{P}(N^{-1/2}). This is an instance of the mixed-bias property described by Rotnitzky et al. (2020) and Ghassami et al. (2022). It is also worth highlighting the structure of the mixed bias in the current context which is the minimum of two product biases, each containing a bias term for a nuisance function and a projected bias term for the other nuisance function; this property was first reported in Ghassami et al. (2022) for a large class of functionals including ours.

Result 6 establishes that ψ^\widehat{\psi} is consistent and asymptotically normal (CAN) for ψ\psi^{*}.

Result 6

Suppose that Assumptions 1-5 hold. Then, we have N(ψ^ψ)DN(0,σ2)\sqrt{N}\big{(}\widehat{\psi}-\psi^{*}\big{)}\stackrel{{\scriptstyle D}}{{\rightarrow}}N(0,\sigma^{2}) where σ2=Var{IF(O;ω,b)}\sigma^{2}=Var\big{\{}\texttt{IF}(O;\omega^{*},b^{*})\big{\}}, and a consistent estimator of σ2\sigma^{2} is σ^2=k=1Kσ^2,(k)/K\widehat{\sigma}^{2}=\sum_{k=1}^{K}\widehat{\sigma}^{2,(k)}/K where

σ^2,(k)=(k)[[A{Yb^(k)(W,X)ψ^(k)}(1A)ω^(k)(Y,X){Yb^(k)(W,X)}]2](i=1NAi/N)2.\displaystyle\widehat{\sigma}^{2,(k)}=\frac{\mathbbm{P}^{(k)}\big{[}\big{[}A\big{\{}Y-\widehat{b}^{(-k)}\big{(}W,X\big{)}-\widehat{\psi}^{(k)}\big{\}}-\big{(}1-A\big{)}\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}\big{(}W,X\big{)}\big{\}}\big{]}^{2}\big{]}}{\big{(}\sum_{i=1}^{N}A_{i}/N\big{)}^{2}}\ .

Using the variance estimator σ^2\widehat{\sigma}^{2}, valid 100(1α)100(1-\alpha)% confidence intervals for the ETT are given by (ψ^+zα/2σ^/N,ψ^+z1α/2σ^/N)\big{(}\widehat{\psi}+z_{\alpha/2}\widehat{\sigma}/\sqrt{N},\widehat{\psi}+z_{1-\alpha/2}\widehat{\sigma}/\sqrt{N}\big{)} where zαz_{\alpha} is the 100α100\alphath percentile of the standard normal distribution. Alternatively, one may construct confidence intervals using the multiplier bootstrap (van der Vaart and Wellner, 1996, Chapter 2.9); see Section A.6 for details.

Lastly, the cross-fitting estimator depends on a specific sample split, and thus, may produce outlying estimates if some split samples do not represent the entire data. To mitigate this issue, Chernozhukov et al. (2018) proposes to use median adjustment from multiple cross-fitting estimates; the detail can be found in the Supplementary Material A.6.

5 Data Application: Zika Virus Outbreak in Brazil

The Zika virus, which can be transmitted from a pregnant woman to her fetus, can cause serious brain abnormalities, including microcephaly (i.e., an abnormally small head) (Rasmussen et al., 2016). Brazil is one of the countries hardest hit by the Zika virus. In particular, the outbreak in 2015 resulted in over 200,000 cases in Brazil by 2016 (Lowe et al., 2018). As a result, many prior works (Castro et al., 2018; Diaz-Quijano et al., 2018; Taddeo et al., 2022; Tchetgen Tchetgen et al., 2024) asked whether the Zika virus outbreak caused a drop in birth rates.

We re-analyzed the dataset analyzed in Taddeo et al. (2022) and Tchetgen Tchetgen et al. (2024). In the dataset, we focused on 673 municipalities in two states of Brazil, Pernambuco and Rio Grande do Sul, which are northeastern and southernmost states. Out of the 1248 cases of microcephaly that occurred in Brazil by November 28, 2015, 51.8% (646 cases) were reported in Pernambuco (PE), less than 10 cases of Zika-related microcephaly were reported in Rio Grande do Sul (RS) (Gregianini et al., 2017), which shows that PE was severely impacted by the Zika virus outbreak, while RS was minimally affected. Based on their epidemiologic histories, we defined 185 and 488 municipalities in PE and RS as treated and control groups, respectively.

For each municipality, we included the following variables in the analysis. As pre-treatment covariates, we included municipality-level population size, population density, and proportion of females measured in 2014. We used the post-epidemic municipality-level birth rate in 2016 as the outcome YY, where the birth rate is defined as the total number of live human births per 1,000 persons. We used the pre-epidemic municipality-level birth rates in 2013 and 2014 as the outcome proxies (i.e., NCO), denoted by W1W_{1} and W2W_{2}, respectively. To be valid proxies, the birth rates in 2013 and 2014 must satisfy Assumption 2: (i) birth rates in 2013 and 2014 cannot be causally impacted by the Zika virus epidemic which occurred in 2015, (ii) birth rates in 2013 and 2014 are correlated with what the birth rate in 2016 would have been had there not been a Zika virus epidemic, and (iii) birth rates in 2013 and 2014 are independent of a municipality’s Zika epidemic status, upon conditioning on its Zika virus epidemic-free potential birth rate in 2016. The first two conditions are uncontroversial, while the third condition largely relies on the extent to which pre-epidemic birth rates can accurately be viewed as a proxy for the counterfactual birth rate had the pandemic not occurred, and as such would not further be predictive of whether the municipality experienced a high rate of Zika virus incidence, conditional on the region’s epidemic-free counterfactual birth rate in 2016. Although one might consider this last assumption reasonable, ultimately, it is empirically untestable without making an alternative assumption. Nevertheless, in the Supplementary Material A.8, we describe a straightforward sensitivity analysis to evaluate the extent to which violation of the assumption might impact inference.

Using the dataset, we estimated ψ=E(Ya=1Ya=0|A=1)\psi^{*}=E\big{(}Y^{a=1}-Y^{a=0}\,\big{|}\,A=1\big{)}, i.e., the difference between the observed average birth rate of Pernambuco and a forecast of what it would have been had the Zika outbreak been prevented. Therefore, the ETT quantifies the average treatment effect of the Zika outbreak on the birth rate within the Pernambuco region. Of note, the crude estimand E(Y|A=1)E(Y|A=0)E(Y\,\big{|}\,A=1)-E(Y\,\big{|}\,A=0) was estimated to be equal to 3.3843.384, suggesting that municipalities in the PE region (with higher incidence of Zika virus) experienced a higher birth rate than RS regions in 2016 during the Zika virus outbreak. An immediate concern is that this crude association between AA and YY might be subject to significant confounding bias, leading us to conduct two separate analyses geared at addressing residual confounding bias; the proposed COCA methods, which we compared with a standard difference-in-differences analysis. Thus, we estimate the ETT using the approach outlined in Section 4 where the NCOs are specified as either (i) W1W_{1}, birth rate in 2013, or (ii) W2W_{2}, birth rate in 2014, or (iii) (W1,W2)(W_{1},W_{2}). For comparison, we also obtained doubly-robust parametric estimators of the ETT using the three NCO specifications; see the Supplementary Material A.7 for details on how these estimators were constructed.

Table 1 summarizes corresponding results. We find that the six COCA estimates vary between 1.833-1.833 and 2.410-2.410, meaning between 1.8331.833 and 2.4102.410 birth per 1,000 persons were reduced in PE due to the Zika virus outbreak, an empirical finding better aligned with the scientific hypothesis that Zika may likely adversely impact the birth rates of exposed populations. Compared to the crude estimate of 3.384, the negative effect estimates indeed provide compelling evidence of potential confounding. We also obtain an estimate using the difference-in-difference estimator under a standard parallel trends assumption (e.g., Card and Krueger (1994); Angrist and Pischke (2009)), which yields a considerably smaller effect estimate varying between 1.156-1.156 and 1.041-1.041; noting that the DiD estimator requires the assumption of equi-confounding of the AYA-Y association in the pre- and post-periods, while our proposed estimator does not (but instead requires conditions (i)-(iii) outlined above). Regardless of the estimator, all estimates appear to be consistent with the anticipated adverse causal impact of the Zika Virus epidemic. Consequently, we conclude that based on inferences aimed at accounting for confounding (DiD and COCA), the Zika virus outbreak likely led to a decline in the birthrate of affected regions in Brazil, which agrees with similar findings in the literature (Castro et al., 2018; Diaz-Quijano et al., 2018; Taddeo et al., 2022; Tchetgen Tchetgen et al., 2024).

Estimator Statistic NCO
W1W_{1} W2W_{2} (W1,W2)(W_{1},W_{2})
Semiparametric COCA Estimate 2.410-2.410 2.182-2.182 2.180-2.180
SE 0.3560.356 0.5030.503 0.3420.342
95% CI (3.107-3.107,1.713-1.713) (3.168-3.168,1.196-1.196) (2.850-2.850,1.510-1.510)
Doubly-robust parametric COCA Estimate 2.235-2.235 1.833-1.833 2.182-2.182
SE 0.5020.502 0.5190.519 0.4150.415
95% CI (3.220-3.220,1.250-1.250) (2.850-2.850,0.816-0.816) (2.996-2.996,1.368-1.368)
Standard DiD under parallel trends Estimate 1.156-1.156 1.041-1.041 1.041-1.041
SE 0.1990.199 0.1950.195 0.1950.195
95% CI (1.546-1.546,0.767-0.767) (1.424-1.424,0.658-0.658) (1.424-1.424,0.658-0.658)
Table 1: Summary of Data Analysis. Values in “Estimate” row represent the estimates of the ETT. Values in “SE” and “95% CI” rows represent the standard errors (SEs) associated with the estimates and the corresponding 95% confidence intervals (CIs), respectively. The reported values are expressed as births per 1,000 persons.

6 Discussion and Possible Extensions

We have described a COCA nonparametric identification framework, therefore extending previous results of Tchetgen Tchetgen (2013) to a more general setting accommodating outcomes of arbitrary nature and obviating the need for an assumption of constant treatment effects, i.e. rank preservation. We have proposed three estimation strategies including a doubly robust method which has appealing robustness properties. Interestingly, the COCA central identifying assumption, that conditioning on the treatment-free counterfactual would in principle shield the treatment assignment from any association with the NCO is isomorphic to an analogous assumption in the missing data literature where an outcome might be missing not at random; however, a fully observed so-called shadow variable (the missing data analog of an NCO) reasonably assumed to be conditionally independent of the missing data process given the value of the potentially missing outcome. For example, Zahner et al. (1992) considered a study of the children’s mental health evaluated through their teachers’ assessments in Connecticut. However, the data for the teachers’ assessments are subject to nonignorable missingness. As a proxy of the teacher’s assessment, a separate parent report is available for all children in this study. The parent report is likely to be correlated with the teacher’s assessment, but is unlikely to be related to the teacher’s response rate given the teacher’s assessment and fully observed covariates. Hence, the parental assessment is regarded as a shadow variable for the teacher’s assessment in this study. The literature on shadow variables is fast-growing (d’Haultfoeuille, 2010; Kott, 2014; Wang et al., 2014; Miao and Tchetgen Tchetgen, 2016; Li et al., 2023; Miao et al., 2023), the methods developed in this paper have close parallels to shadow variable counterparts in this literature. This connection to shadow variables is particularly salient when the COCA confounding bridge function is not uniquely defined, which can easily occur for instance when the shadow variable (or analogously the NCO) is multivariate, therefore significantly complicating inference. Fortunately, the methods developed by Li et al. (2023) for the analogous shadow variable setting directly apply to the corresponding COCA setting and thus provide a complete solution for identification and inference for the average treatment effect for the treated without relying on completeness conditions nor on unique identification of either the EPS or the COCA confounding bridge function. We refer the interested reader to this latter work for further details. It is worth noting that the doubly robust estimator proposed in this paper appears to be completely new, and different from those of Miao and Tchetgen Tchetgen (2016), Li et al. (2023), and Miao et al. (2023) and therefore may also be of use in shadow variable applications. Likewise, the doubly robust estimators proposed in the latter works can equally be applied to the current COCA setting as an alternative inferential approach.

Additionally, as mentioned in the previous Section, the key assumption that conditioning on the treatment-free potential outcome, would in principle make the NCO or outcome proxy irrelevant to treatment mechanism is ultimately untestable, and may in certain settings not hold exactly. In fact, this would be the case if the NCO were in fact explicitly used in assigning the treatment in which case the assumption might be violated. In order to address such eventuality, the analyst might consider several candidate proxies/NCOs when available, and may even perform an over-identification test, by inspecting the extent to which the estimated causal effect depends on the choice of proxy. Alternatively, a sensitivity analysis might also be performed to evaluate the potential impact of a hypothesized departure from the assumption. In the context of the Zika virus application, an over-identification test and a sensitivity analysis were carried out as illustrative examples, with the corresponding results and discussion provided in the Supplementary Material A.8 and A.9.

Finally, as previously mentioned in the Introduction, COCA offers an alternative approach to proximal causal inference for debiasing observational estimates of the ETT by leveraging negative control or valid confounding proxies. A key difference highlighted earlier between these two frameworks is that COCA relies on a single valid NCO which directly proxies the treatment-free potential outcome, while proximal causal inference requires both valid NCO and negative control treatment variables that proxy an underlying unmeasured confounder. Importantly, COCA takes advantage of the fact that the treatment-free potential outcome is observed in the untreated, while in proximal causal inference, the unmeasured confounders for which proxies are available, are themselves never observed, arguably a more challenging identification task. Despite the practical advantage of needing one rather than two proxies, it is important to note that though COCA identifies the ETT, it fails to nonparametrically identify the population average treatment effect (ATE), without an additional assumption. In contrast, proximal causal inference provides nonparametric identification of both causal parameters and thus can be interpreted as providing richer identification opportunities. A key reason for this difference in the scope of identification is the fact that in the current paper, we have emphasized an interpretation of the NCO as a proxy for the treatment-free potential outcome but not for the potential outcome under treatment, in the sense that under our conditions, it must be that the treatment-free potential outcome does not only shield the treatment from the NCO, but also shields the potential outcome under treatment from the latter. Two potential strategies to recover COCA identification of the population ATE, might be either (i) evoke a rank-preservation assumption which if appropriate would imply that the ETT and the ATE are equal (this is the assumption made in Tchetgen Tchetgen (2013)); or (ii) identify a second proxy NCO which is a valid proxy for the counterfactual outcome under treatment. The second condition would be needed if Ya=1Y^{a=1} can also be viewed as a hidden confounder. By a symmetry argument, one can show that (ii) in fact would provide identification of the average counterfactual outcome under treatment for the untreated. A weighted average of both counterfactual means would then provide identification of the ATE. Details are not provided, but can easily be deduced from the presentation.

Acknowledgment

The authors would like to thank James Robins, Thomas Richardson, and Ilya Shpitser for helpful discussions.

Data Availability

The data and the analysis R code are accessible on the GitHub repository located at http://github.com/qkrcks0218/SingleProxyControl.

Supplementary Material

This document provides details of “Single Proxy Control.” In Section A, we provide details of the main paper. In Section B, we provide proofs of the results in the main paper.

Appendix A Details of the Main Paper

A.1 Details of the Latent Variable Model (1)

Recall that model (1) is given by

Ya=0=hy(U0,X),hy(u,x):strictly increasing in u for all x\displaystyle Y^{a=0}=h_{y}(U_{0},X)\ ,\ h_{y}(u,x):\text{strictly increasing in $u$ for all $x$} (1a)
WU0|X and WA|(U0,X)\displaystyle W\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,U_{0}\,\big{|}\,X\text{ and }W\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(U_{0},X) (1b)

Under these assumptions, we establish that

WU0|XWhy(U0,X)|XWYa=0|X\displaystyle W\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,U_{0}\,\big{|}\,X\quad\Rightarrow\quad W\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,h_{y}(U_{0},X)\,\big{|}\,X\quad\Rightarrow\quad W\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,Y^{a=0}\,\big{|}\,X

justifying Assumption 2-(ii): WYa=0|XW\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,Y^{a=0}\,\big{|}\,X, and

W|(U0=u,A=1,X)=DW|(U0=u,A=0,X)\displaystyle W\,\big{|}\,(U_{0}=u,A=1,X)\stackrel{{\scriptstyle D}}{{=}}W\,\big{|}\,(U_{0}=u,A=0,X)
\displaystyle\Rightarrow\quad W|(hy1(Ya=0,X)=u,A=1,X)=DW|(hy1(Ya=0,X)=u,A=0,X)\displaystyle W\,\big{|}\,(h_{y}^{-1}(Y^{a=0},X)=u,A=1,X)\stackrel{{\scriptstyle D}}{{=}}W\,\big{|}\,(h_{y}^{-1}(Y^{a=0},X)=u,A=0,X) (h is monotonic)\displaystyle(\because\ \text{$h$ is monotonic})
\displaystyle\Rightarrow\quad W|(Ya=0=hy(u,X),A=1,X)=DW|(Ya=0=hy(u,X),A=0,X)\displaystyle W\,\big{|}\,(Y^{a=0}=h_{y}(u,X),A=1,X)\stackrel{{\scriptstyle D}}{{=}}W\,\big{|}\,(Y^{a=0}=h_{y}(u,X),A=0,X) (Definition of Ya=0)\displaystyle(\because\ \text{Definition of $Y^{a=0}$})
\displaystyle\Rightarrow\quad W|(Ya=0=y,A=1,X)=DW|(Ya=0=y,A=0,X)\displaystyle W\,\big{|}\,(Y^{a=0}=y,A=1,X)\stackrel{{\scriptstyle D}}{{=}}W\,\big{|}\,(Y^{a=0}=y,A=0,X)

justifying Assumption 2-(iii): WA|(Ya=0,X)W\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(Y^{a=0},X).

It is instructive to consider a data generating mechanism that can be formulated in a manner analogous to the changes-in-changes model (Athey and Imbens, 2006). To this end, we now consider a special case of model (1):

Ya=0=hy(Uy0,X),hy(u,x):strictly increasing in u for all x\displaystyle Y^{a=0}=h_{y}(U_{y0},X)\ ,\ h_{y}(u,x):\text{strictly increasing in $u$ for all $x$} (SPC-a)
W=hw(Uw,X)\displaystyle W=h_{w}(U_{w},X) (SPC-b)
Uy0Uw|X and UwA|Uy0,X\displaystyle U_{y0}\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,U_{w}\,\big{|}\,X\quad\text{ and }\quad U_{w}\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,U_{y0},X (SPC-c)

Figure 4 provides a graphical representation compatible with expressions (SPC-a)-(SPC-c).

A\,A\,Ya=0Y^{a=0}Uy1U_{y1}Uy0U_{y0}UwU_{w}Ya=1Y^{a=1}W\,W\,Y\,Y\,
Figure 4: A Graphical Illustration of a Structural Model Compatible with (1). Measured covariates XX are suppressed for simplicity. The thick arrows depict the deterministic relationships Ya=0=hy(Uy0)Y^{a=0}=h_{y}(U_{y0}), W=hw(Uw)W=h_{w}(U_{w}), and Y=YAY=Y^{A}.

It is straight forward to show that (SPC-b) and (SPC-c) imply (1b) as follows:

Under these assumptions, we establish that

Uy0Uw|XUy0hw(Uw,X)|XUy0W|X\displaystyle U_{y0}\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,U_{w}\,\big{|}\,X\quad\Rightarrow\quad U_{y0}\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,h_{w}(U_{w},X)\,\big{|}\,X\quad\Rightarrow\quad U_{y0}\not\!\!\!\,\rotatebox[origin={c}]{90.0}{$\models$}\,W\,\big{|}\,X

and

Uw|(Uy0=u,A=1,X)=DUw|(Uy0=u,A=0,X)\displaystyle U_{w}\,\big{|}\,(U_{y0}=u,A=1,X)\stackrel{{\scriptstyle D}}{{=}}U_{w}\,\big{|}\,(U_{y0}=u,A=0,X)
\displaystyle\Rightarrow\quad h(Uw,X)|(Uy0=u,A=1,X)=Dh(Uw,X)|(Uy0=u,A=0,X)\displaystyle h(U_{w},X)\,\big{|}\,(U_{y0}=u,A=1,X)\stackrel{{\scriptstyle D}}{{=}}h(U_{w},X)\,\big{|}\,(U_{y0}=u,A=0,X)
\displaystyle\Rightarrow\quad W|(Uy0=u,A=1,X)=DW|(Uy0=u,A=0,X)\displaystyle W\,\big{|}\,(U_{y0}=u,A=1,X)\stackrel{{\scriptstyle D}}{{=}}W\,\big{|}\,(U_{y0}=u,A=0,X)

Models (SPC-a)-(SPC-c) share similarity with a changes-in-changes model in panel data setting (Athey and Imbens, 2006). Specifically, when a pre-treatment outcome is viewed as WW, the changes-in-changes model in panel data setting is represented as follows:

Ya=0=hy(Uy0,X),\displaystyle Y^{a=0}=h_{y}(U_{y0},X)\ , hy(u,x):strictly increasing in u for all x\displaystyle h_{y}(u,x):\text{strictly increasing in $u$ for all $x$} (CiC-a)
W=hw(Uw,X),\displaystyle W=h_{w}(U_{w},X)\ , hw(u,x):strictly increasing in u for all x\displaystyle h_{w}(u,x):\text{strictly increasing in $u$ for all $x$} (CiC-b)
Uy0|A,X=DUw|A,X\displaystyle U_{y0}\,\big{|}\,A,X\stackrel{{\scriptstyle D}}{{=}}U_{w}\,\big{|}\,A,X (CiC-c)

Expression (SPC-a) means that the treatment-free outcome is a monotonic transformation of an unobserved variable Uy0U_{y0}. We remark that the same model for Ya=0Y^{a=0} is used in the changes-in-changes model (CiC-a). Expression (SPC-b) means that the negative control outcome is a transformation of an unobserved variable UwU_{w}. It is important to highlight that in model (SPC-b), the function hwh_{w} is not necessarily monotonic in UwU_{w}, unlike (CiC-b). Consequently, (SPC-b) is strictly weaker than (CiC-b). Lastly, expressions (SPC-c) and (CiC-c) share a common goal of restricting the distributions of the unmeasured confounders, but have significant differences. Specifically, (CiC-c) states that Uy0U_{y0} and UwU_{w} have the identical distribution given (A,X)(A,X) and do not place any restriction on the correlation between Uy0U_{y0} and UwU_{w}. On the other hand, (SPC-c) states that (i) Uy0U_{y0} and UwU_{w} must be correlated given XX and (ii) UwU_{w} is conditionally independent of AA given (Uy0,X)(U_{y0},X), but it does not necessitate that Uy0U_{y0} and UwU_{w} have the identical distribution given (A,X)(A,X). Consequently, Uy0U_{y0} and UwU_{w} may have different supports under expression (SPC-c), a condition that does not apply to expression (CiC-c). Lastly, we note that both (SPC-c) and (CiC-c) hold if Uy0=UwU_{y0}=U_{w} almost surely, yet perfect correlation between the two unmeasured confounders may not be reasonable.

A.2 Standard Error of the COCA Estimator in Tchetgen Tchetgen (2013)

We provide details on the standard error of the COCA estimator obtained from the ordinary least squares estimators of the regression model E(W|A,Y,X)=β1+β2A+β3Y+β4XE(W|A,Y,X)=\beta_{1}+\beta_{2}^{*}A+\beta_{3}Y+\beta_{4}X. It is well known that the variance estimator of the OLS estimator (β^1,β^2,β^3,β^4)(\widehat{\beta}_{1},\widehat{\beta}_{2}^{*},\widehat{\beta}_{3},\widehat{\beta}_{4}) is given as

V^:=σ^2[i=1N(1AiYiXiAiAi2AiYiAiXiYiAiYiYi2YiXiXiAiXiYiXiXi2)]1,σ^2:=i=1N(Wiβ^1β^2Aiβ^3Yiβ^4Xi)2N4\displaystyle\widehat{V}:=\widehat{\sigma}^{2}\left[\sum_{i=1}^{N}\begin{pmatrix}1&A_{i}&Y_{i}&X_{i}\\ A_{i}&A_{i}^{2}&A_{i}Y_{i}&A_{i}X_{i}\\ Y_{i}&A_{i}Y_{i}&Y_{i}^{2}&Y_{i}X_{i}\\ X_{i}&A_{i}X_{i}&Y_{i}X_{i}&X_{i}^{2}\end{pmatrix}\right]^{-1}\ ,\ \widehat{\sigma}^{2}:=\frac{\sum_{i=1}^{N}\big{(}W_{i}-\widehat{\beta}_{1}-\widehat{\beta}_{2}^{*}A_{i}-\widehat{\beta}_{3}Y_{i}-\widehat{\beta}_{4}X_{i}\big{)}^{2}}{N-4}

where NN is the number of observed units, indexed by subscript i=1,Ni=1,\ldots N. Let f(a,b)=a/bf(a,b)=a/b. Then, the first-order Taylor expansion around (a0,b0)(a_{0},b_{0}) is

ab=f(a,b)f(a0,b0)+f1(a0,b0)(aa0)+f2(a0,b0)(bb0)=a0b0+aa0b0a0(bb0)b02\displaystyle\frac{a}{b}=f(a,b)\simeq f(a_{0},b_{0})+f_{1}(a_{0},b_{0})(a-a_{0})+f_{2}(a_{0},b_{0})(b-b_{0})=\frac{a_{0}}{b_{0}}+\frac{a-a_{0}}{b_{0}}-\frac{a_{0}(b-b_{0})}{b_{0}^{2}}

where f1(a,b)=f(a,b)/af_{1}(a,b)=\partial f(a,b)/\partial a and f2(a,b)=f(a,b)/bf_{2}(a,b)=\partial f(a,b)/\partial b. Therefore, we find

E{(ψ^0ψ^0)2}\displaystyle E\big{\{}(\widehat{\psi}_{0}-\widehat{\psi}_{0})^{2}\big{\}} =E{(β^2β^3β2β3)2}\displaystyle=E\bigg{\{}\bigg{(}\frac{\widehat{\beta}_{2}^{*}}{\widehat{\beta}_{3}}-\frac{{\beta}_{2}^{*}}{{\beta}_{3}}\bigg{)}^{2}\bigg{\}}
E{(β^2β2β3β2(β^3β3)β32)2}\displaystyle\simeq E\bigg{\{}\bigg{(}\frac{\widehat{\beta}_{2}^{*}-\beta_{2}^{*}}{{\beta}_{3}}-\frac{\beta_{2}^{*}(\widehat{\beta}_{3}-\beta_{3})}{{\beta}_{3}^{2}}\bigg{)}^{2}\bigg{\}}
=Var(β^2)β322β2Cov(β^2,β^3)β33+(β2)2Var(β^3)β34\displaystyle=\frac{Var(\widehat{\beta}_{2}^{*})}{\beta_{3}^{2}}-2\frac{\beta_{2}^{*}Cov(\widehat{\beta}_{2}^{*},\widehat{\beta}_{3})}{\beta_{3}^{3}}+\frac{(\beta_{2}^{*})^{2}Var(\widehat{\beta}_{3})}{\beta_{3}^{4}}

Consequently, the standard error estimator of ψ^\widehat{\psi} is given as SEψ0=β^31(V^222ψ^0V^23+ψ^02V^33)1/2SE_{\psi_{0}}=\widehat{\beta}_{3}^{-1}\Big{(}\widehat{V}_{22}-2\widehat{\psi}_{0}\widehat{V}_{23}+\widehat{\psi}_{0}^{2}\widehat{V}_{33}\Big{)}^{1/2} where V^ij\widehat{V}_{ij} is the (i,j)(i,j)th element of V^\widehat{V}.

A.3 A Graphical Illustration of Result 1

Recall that Result 1 is given by

p(W,X)1p(W,X)\displaystyle\frac{p^{*}\big{(}W,X\big{)}}{1-p^{*}\big{(}W,X\big{)}} =E{ω(Y,X)|W,X,A=0}=yπ(y,X)1π(y,X)f(y|W,X,A=0)\displaystyle=E\left\{\omega^{*}(Y,X)\,\big{|}\,W,X,A=0\right\}=\sum_{y}\frac{\pi^{*}\big{(}y,X\big{)}}{1-\pi^{*}\big{(}y,X\big{)}}f^{*}\big{(}y|W,X,A=0\big{)} (4)

where f(y|W,X,A=0)f^{*}\big{(}y\,\big{|}\,W,X,A=0\big{)} is the conditional law of YY given (W,X,A)(W,X,A) evaluated at Y=yY=y. To illustrate, we consider the following data generating process:

  • ABer(0.5)A\sim\text{Ber}(0.5)

  • Ya=0N(0.25A,1)Y^{a=0}\sim N(0.25A,1)

  • W=Ya=0+ϵW=Y^{a=0}+\epsilon where ϵN(0,1)\epsilon\sim N(0,1) with ϵ(Ya=0,A)\epsilon\,\rotatebox[origin={c}]{90.0}{$\models$}\,(Y^{a=0},A)

We find the odds function relating WW and AA and that relating Ya=0Y^{a=0} and AA are given by

Odds(W=w)=p(w)1p(w)=Pr(A=1|W=w)Pr(A=0|W=w)=ϕ(w;0.25,2)ϕ(w;0,2)\displaystyle\text{Odds}(W=w)=\frac{p^{*}(w)}{1-p^{*}(w)}=\frac{\Pr(A=1\,\big{|}\,W=w)}{\Pr(A=0\,\big{|}\,W=w)}=\frac{\phi(w;0.25,2)}{\phi(w;0,2)}
Odds(Ya=0=y)=π(y)1π(y)=Pr(A=1|Ya=0=y)Pr(A=0|Ya=0=y)=ϕ(y;0.25,1)ϕ(y;0,1)\displaystyle\text{Odds}(Y^{a=0}=y)=\frac{\pi^{*}(y)}{1-\pi^{*}(y)}=\frac{\Pr(A=1\,\big{|}\,Y^{a=0}=y)}{\Pr(A=0\,\big{|}\,Y^{a=0}=y)}=\frac{\phi(y;0.25,1)}{\phi(y;0,1)}

where ϕ(v;μ,σ2)\phi(v;\mu,\sigma^{2}) is the density of N(μ,σ2)N(\mu,\sigma^{2}). In addition, Ya=0|(W,A=0)N(W/2,0.5)Y^{a=0}\,\big{|}\,(W,A=0)\sim N(W/2,0.5). From straightforward algebra, we find E{Odds(Ya=0)|W,A=0}=Odds(W)E\big{\{}\text{Odds}(Y^{a=0})\,\big{|}\,W,A=0\big{\}}=\text{Odds}(W). We then generate N=1000N=1000 observations from the data generating process, and we draw a scatterplot of Odds(Wi)\text{Odds}(W_{i}) and Odds(Yia=0)\text{Odds}(Y_{i}^{a=0}) in Figure 5.

Refer to caption
Figure 5: A scatterplot of Odds(Wi)\text{Odds}(W_{i}) and Odds(Yia=0)\text{Odds}(Y_{i}^{a=0}) for i=1,,Ni=1,\ldots,N. The red solid line shows the deterministic relationship Odds(W=w)=E{Odds(Ya=0)|W=w,A=0}\text{Odds}(W=w)=E\{\text{Odds}(Y^{a=0})\,\big{|}\,W=w,A=0\}.

A.4 Details of the Construction of Parametric Estimators via GMM

In this Section, we provide details of the generalized method of moments (GMM, Hansen (1982)), which is implemented in many publicly available software programs, such as gmm package in R (Chaussé, 2010). Before we present details, we introduce additional notations. Let NN be the number of observed units, indexed by subscript i=1,Ni=1,\ldots N. Let Oi=(Yi,Ai,Wi)O_{i}=(Y_{i},A_{i},W_{i}) be the observed data for the iith unit. Let (h)\mathbbm{P}(h) denote the average of function h(O)h(O) across NN units, i.e., (g)=N1i=1Ng(Oi)\mathbbm{P}(g)=N^{-1}\sum_{i=1}^{N}g(O_{i}). For random variables VV and ZZ, let VDZV\stackrel{{\scriptstyle D}}{{\rightarrow}}Z denote VV converges to ZZ in distribution.

Let ψ1=E(Ya=1|A=1)\psi_{1}^{*}=E(Y^{a=1}\,\big{|}\,A=1) and ψ0=E(Ya=0|A=1)\psi_{0}^{*}=E(Y^{a=0}\,\big{|}\,A=1). The additive and multiplicative additive average causal effects of treatment on the treated are represented as ψ:=ψ1ψ0\psi^{*}:=\psi_{1}^{*}-\psi_{0}^{*}. Let θ\theta be a vector of parameters of interest and gg be a vector-valued function that is restricted by the mean-zero moment restriction, i.e. E{g(O;θ)}=0E\big{\{}g(O;\theta^{\dagger})\big{\}}=0 where θ\theta^{\dagger} is the unique parameter that achieves the mean-zero moment restriction.

A.4.1 A Moment Restriction based on the EPS

The first approach entails a priori specifying a parametric logistic regression model for the EPS, say:

logω(y,x;α)=logπ(y,x;α)1π(y,x;α)=αSy(y,x),\displaystyle\log\omega(y,x;\alpha)=\log\frac{\pi\big{(}y,x;\alpha\big{)}}{1-\pi\big{(}y,x;\mathbf{\alpha}\big{)}}=\mathbf{\alpha}^{\intercal}S_{y}(y,x)\ ,

where Sy(y,x)S_{y}\big{(}y,x\big{)} is a user-specified sufficient statistic for the log-odds ratio ratio association between Ya=0Y^{a=0} and AA evaluated at Ya=0=yY^{a=0}=y given X=xX=x. Options for specifying SyS_{y} includes both more parsimonious specifications, e.g., Sy(y,x)=(1,y,x)S_{y}\big{(}y,x\big{)}=(1,y,x^{\intercal})^{\intercal}, as well as more flexible specifications, say Sy(y,x)=(1,y,y2,,yv,x,yx,,yvx)S_{y}\big{(}y,x\big{)}=\big{(}1,y,y^{2},\ldots,y^{v},x^{\intercal},yx^{\intercal},\ldots,y^{v}x^{\intercal}\big{)}^{\intercal} or

Sy(y,x)={1,𝟙(yy1),𝟙(y1<yy2),,𝟙(yK<y),𝟙(yy1)x,,𝟙(yK<y)x}\displaystyle S_{y}\big{(}y,x\big{)}=\big{\{}1,\mathbbm{1}(y\leq y_{1}),\mathbbm{1}(y_{1}<y\leq y_{2}),\ldots,\mathbbm{1}(y_{K}<y),\mathbbm{1}(y\leq y_{1})x^{\intercal},\ldots,\mathbbm{1}(y_{K}<y)x^{\intercal}\big{\}}^{\intercal}

where yky_{k} is the kthk^{th} user-specified percentile of YY. Alternative nonparametric smoothing techniques, e.g. nearest neighbor, kernel smoothing, splines or wavelets might also be considered. We assume that ω(Y,X)=ω(Y,X;α)\omega^{*}(Y,X)=\omega(Y,X;\alpha^{*}) for an unknown unique value α\alpha^{*}.

While such logistic regression model specifications for the EPS might look familiar, estimation and inference about its unknown parameter α\mathbf{\alpha}^{*} via maximum likelihood is complicated by the fact that Ya=0Y^{a=0}   is only observed among untreated units with A=0.A=0. Instead, to find an empirical estimate of α\mathbf{\alpha}, we observe the following property of ω\omega^{*} and π\pi^{*}:

For any function p(W,X), we have\displaystyle\text{For any function }p(W,X),\text{ we have }
E[{(1A)ω(Y,X)A}p(W,X)]=E[{1A1π(Y,X)1}p(W,X)]=0.\displaystyle E\left[\left\{(1-A)\omega^{*}(Y,X)-A\right\}p\big{(}W,X\big{)}\right]=E\left[\left\{\frac{1-A}{1-\pi^{*}(Y,X)}-1\right\}p\big{(}W,X\big{)}\right]=0\ .

The identity is trivial from Result 1, the law of iterated expectations, and some straightforward algebra:

E[{(1A)ω(Y,X)}p(W,X)]\displaystyle E\big{[}\big{\{}(1-A)\omega^{*}(Y,X)\big{\}}p(W,X)\big{]} =E[Pr(A=0|W,X)E{ω(Y,X)|W,X,A=0}p(W,X)]\displaystyle=E\big{[}\Pr(A=0\,\big{|}\,W,X)E\big{\{}\omega^{*}(Y,X)\,\big{|}\,W,X,A=0\big{\}}p(W,X)\big{]}
=E[Pr(A=0|W,X)Pr(A=1|W,X)Pr(A=0|W,X)p(W,X)]\displaystyle=E\big{[}\Pr(A=0\,\big{|}\,W,X)\frac{\Pr(A=1\,\big{|}\,W,X)}{\Pr(A=0\,\big{|}\,W,X)}p(W,X)\big{]}
=E[Pr(A=1|W,X)p(W,X)]\displaystyle=E\big{[}\Pr(A=1\,\big{|}\,W,X)p(W,X)\big{]}
=E[Ap(W,X)],\displaystyle=E\big{[}Ap(W,X)\big{]}\ ,

and

(1A)π(Y,X)1π(Y,X)A\displaystyle\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}-A =π(Y,X)Aπ(Y,X)A+Aπ(Y,X)1π(Y,X)=1A1π(Y,X)1.\displaystyle=\frac{\pi^{*}(Y,X)-A\pi^{*}(Y,X)-A+A\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}=\frac{1-A}{1-\pi^{*}(Y,X)}-1\ .

Consequently, one can use the following moment restriction to estimate α\alpha^{*}:

E[{1A1π(Y,X;α)1}rw(W,X)]=0 for a user-specified function rw(W,X)α=α.\displaystyle E\left[\left\{\frac{1-A}{1-\pi(Y,X;\alpha)}-1\right\}r_{w}\big{(}W,X\big{)}\right]=0\text{ for a user-specified function }r_{w}(W,X)\quad\Leftrightarrow\quad\alpha=\alpha^{*}\ . (11)

We need to choose rw(w,x)r_{w}(w,x) so that dim(rw)dim(α)\dim(r_{w})\geq\dim(\alpha), which is required to admit a unique α\alpha^{*}.

A.4.2 A Moment Restriction based on the COCA Confounding Bridge Function

We consider a parametric model of b(W,X;η)=ηSw(W,X)b\big{(}W,X;\eta\big{)}=\eta^{\intercal}S_{w}\big{(}W,X\big{)} for user-specified function Sw(W,X)S_{w}\big{(}W,X\big{)}, assuming that b(W,X)=ηSw(W,X)b^{*}\big{(}W,X\big{)}=\eta^{*\intercal}S_{w}\big{(}W,X\big{)} for an unknown unique value η\eta^{*}. Similar to Sy(y,x)S_{y}\big{(}y,x\big{)} in the previous Section, Sw(W,X)S_{w}\big{(}W,X\big{)} can likewise account for nonlinearities by specifying specific basis functions, such as polynomials, splines, or wavelets. We then note that a standard least-squares regression YY on (W,X)(W,X) would generally fail to recover η\eta^{*} even if the model is correctly specified. In other words, under our assumptions it will generally be the case that b(W,X)E(Y|A=0,W,X)b^{*}\big{(}W,X\big{)}\neq E\big{(}Y|A=0,W,X\big{)}, as this would in fact imply that E{E(Y|A=0,W,X)|A=0,Y=y,X}=yE\left\{E\big{(}Y\,\big{|}\,A=0,W,X\big{)}\,\big{|}\,A=0,Y=y,X\right\}=y which does not follow from our assumptions. A correct approach in fact is based on the following consequence of our assumptions:

E[(1A){ηSw(W,X)Y}ry(Y,X)]=0 for a user-specified function ryη=η.\displaystyle E\big{[}\big{(}1-A\big{)}\big{\{}\eta^{\intercal}S_{w}\big{(}W,X\big{)}-Y\big{\}}r_{y}\big{(}Y,X\big{)}\big{]}=0\text{ for a user-specified function }r_{y}\quad\Leftrightarrow\quad\eta=\eta^{*}\ . (12)

Again, we need to choose ry(Y,X)r_{y}(Y,X) so that dim(ry)dim(η)\dim(r_{y})\geq\dim(\eta) to admit a unique η\eta^{*}. If only one NCO is available and ryr_{y} is chosen as ry(Y,X)=Sw(Y,X)r_{y}\big{(}Y,X\big{)}=S_{w}\big{(}Y,X\big{)}, η\eta^{*} is represented as

η=E{Sw(Y,X)Sw(W,X)|A=0}1E{YSw(Y,X)|A=0}\displaystyle\eta^{*}=E\big{\{}S_{w}\big{(}Y,X\big{)}S_{w}^{\intercal}\big{(}W,X\big{)}\,\big{|}\,A=0\big{\}}^{-1}E\big{\{}YS_{w}\big{(}Y,X\big{)}\,\big{|}\,A=0\big{\}}

assuming E{Sw(Y,X)Sw(W,X)|A=0}E\big{\{}S_{w}\big{(}Y,X\big{)}S_{w}^{\intercal}\big{(}W,X\big{)}\,\big{|}\,A=0\big{\}} is invertible.

A.4.3 Construction of the GMM Estimators

Based on the moment restrictions (11) and (12), we consider the following three moment functions gg and the corresponding parameters θ\theta:

  • (Extended Propensity Score)

    θ=(ψ1ψ0α)dim(α)+2,gPS(O;θ)=(A(Yψ1)(1A)π(Y,X;α)1π(Y,X;α)(Yψ0){1A1π(Y,X;α)1}rw(W,X))dim(rw)+2\displaystyle\theta=\begin{pmatrix}\psi_{1}\\ \psi_{0}\\ \alpha\end{pmatrix}\in\mathbbm{R}^{\dim(\alpha)+2}\ ,\ g_{{\text{PS}}}(O;\theta)=\begin{pmatrix}A\big{(}Y-\psi_{1}\big{)}\\ (1-A)\frac{\pi(Y,X;\alpha)}{1-\pi(Y,X;\alpha)}(Y-\psi_{0})\\ \big{\{}\frac{1-A}{1-\pi(Y,X;\alpha)}-1\big{\}}r_{w}(W,X)\end{pmatrix}\in\mathbbm{R}^{\dim(r_{w})+2} (13)

    where dim(θ)dim(gPS)\dim(\theta)\leq\dim(g_{{\text{PS}}}).

  • (COCA Confounding Bridge Function)

    θ=(ψ1ψ0η)dim(η)+2,gOR(O;θ)=(A(Yψ1)A{b(W,X;η)ψ0}(1A){b(W,X;η)Y}ry(Y,X))dim(ry)+2\displaystyle\theta=\begin{pmatrix}\psi_{1}\\ \psi_{0}\\ \eta\end{pmatrix}\in\mathbbm{R}^{\dim(\eta)+2}\ ,\ g_{{\text{OR}}}(O;\theta)=\begin{pmatrix}A\big{(}Y-\psi_{1}\big{)}\\ A\big{\{}b(W,X;\eta)-\psi_{0}\big{\}}\\ (1-A)\big{\{}b(W,X;\eta)-Y\big{\}}r_{y}(Y,X)\end{pmatrix}\in\mathbbm{R}^{\dim(r_{y})+2} (14)

    where dim(θ)dim(gOR)\dim(\theta)\leq\dim(g_{{\text{OR}}}).

  • (Doubly Robust)

    θ=(ψ1ψ0αη)dim(α)+dim(η)+2,\displaystyle\theta=\begin{pmatrix}\psi_{1}\\ \psi_{0}\\ \alpha\\ \eta\end{pmatrix}\in\mathbbm{R}^{\dim(\alpha)+\dim(\eta)+2}\ ,
    gDR(O;θ)=(A(Yψ1)(1A)π(Y,X;α)1π(Y,X;α){Yb(W,X;η)}+A{b(W,X;η)ψ0}{1A1π(Y,X;α)1}rw(W,X)(1A){b(W,X;η)Y}ry(Y,X))dim(ry)+dim(ry)+2\displaystyle g_{{\text{DR}}}(O;\theta)=\begin{pmatrix}A\big{(}Y-\psi_{1}\big{)}\\ \frac{(1-A)\pi(Y,X;\alpha)}{1-\pi(Y,X;\alpha)}\{Y-b(W,X;\eta)\}+A\{b(W,X;\eta)-\psi_{0}\}\\ \big{\{}\frac{1-A}{1-\pi(Y,X;\alpha)}-1\big{\}}r_{w}(W,X)\\ (1-A)\big{\{}b(W,X;\eta)-Y\big{\}}r_{y}(Y,X)\end{pmatrix}\in\mathbbm{R}^{\dim(r_{y})+\dim(r_{y})+2} (15)

    where dim(θ)dim(gDR)\dim(\theta)\leq\dim(g_{{\text{DR}}}).

The unique parameter θ\theta^{\dagger} can be represented as the minimizer of the weighted squared norm of E{g(O;θ)}E\{g(O;\theta)\}. Specifically, with a chosen weighting matrix Ω\Omega (which can depend on the observed data), θ\theta^{\dagger} is represented as

θ=argminθ[E{g(O;θ)}]Ω[E{g(O;θ)}]\displaystyle\theta^{\dagger}=\operatorname*{arg\,min}_{\theta}\big{[}E\big{\{}g(O;\theta)\big{\}}\big{]}^{\intercal}\Omega\big{[}E\big{\{}g(O;\theta)\big{\}}\big{]} (16)

From the law of large numbers, we find ψ1=ψ1\psi_{1}^{\dagger}=\psi_{1}^{*}. In section B.1 of the Appendix, we show that ψ0=ψ0\psi_{0}^{\dagger}=\psi_{0}^{*} for each estimation strategy if

  • (Extended Propensity Score) the EPS is correctly specified, i.e.

    π(y,x;α)=Pr(A=1|Ya=0=y,X=x)\displaystyle\pi(y,x;\alpha^{\dagger})=\Pr(A=1\,\big{|}\,Y^{a=0}=y,X=x) (17)
  • (COCA Confounding Bridge Function) the COCA confounding bridge function is correctly specified, i.e.

    y=E{b(W,X;η)|Y=y,X=x,A=0}\displaystyle y=E\big{\{}b(W,X;\eta^{\dagger})\,\big{|}\,Y=y,X=x,A=0\big{\}} (18)
  • (Doubly Robust) either the EPS or the COCA confounding bridge function is correctly specified; i.e. either (17) or (18) is satisfied.

For efficiency, Ω\Omega is chosen as Σ1\Sigma^{-1}, the inverse of the variance matrix of g(O;θ)g(O;\theta^{\dagger}), i.e. Ω=Σ1=[E{g(O;θ)g(O;θ)}]1\Omega=\Sigma^{-1}=\big{[}E\big{\{}g(O;\theta^{\dagger})g(O;\theta^{\dagger})^{\intercal}\big{\}}\big{]}^{-1}.

The GMM estimator is obtained from the empirical analog of the minimization task in (16). Specifically, we get

θ^GMM=argminθ[{g(O;θ)}]Ω^[{g(O;θ)}]\displaystyle\widehat{\theta}_{\text{GMM}}=\operatorname*{arg\,min}_{\theta}\big{[}\mathbbm{P}\big{\{}g(O;\theta)\big{\}}\big{]}^{\intercal}\widehat{\Omega}\big{[}\mathbbm{P}\big{\{}g(O;\theta)\big{\}}\big{]} (19)

where Ω^\widehat{\Omega} is often chosen as a matrix that is consistent for Σ1\Sigma^{-1}; see equation (20) below for the exact form.

Under regularity conditions (see Newey and McFadden (1994) for details), the GMM estimator is consistent and asymptotically normal (CAN) as follows:

N(θ^GMMθ)DN(0,ΣGMM)\displaystyle\sqrt{N}\big{(}\widehat{\theta}_{\text{GMM}}-\theta^{\dagger}\big{)}\stackrel{{\scriptstyle D}}{{\rightarrow}}N\big{(}0,\Sigma_{\text{GMM}}\big{)}
ΣGMM=(GΩG)1(GΩΣΩG)(GΩG)1,G=E{g(O;θ)θ|θ=θ}dim(g)×dim(θ)\displaystyle\Sigma_{\text{GMM}}=(G^{\intercal}\Omega G)^{-1}(G^{\intercal}\Omega\Sigma\Omega^{\intercal}G)(G^{\intercal}\Omega G)^{-1}\ ,\ G=E\bigg{\{}\frac{\partial g(O;\theta)}{\partial\theta^{\intercal}}\bigg{|}_{\theta=\theta^{\dagger}}\bigg{\}}\in\mathbbm{R}^{\dim(g)\times\dim(\theta)}

In particular, ΣGMM\Sigma_{\text{GMM}} reduces to ΣGMM=(GΣ1G)1\Sigma_{\text{GMM}}=(G^{\intercal}\Sigma^{-1}G)^{-1} if Ω=Σ1\Omega=\Sigma^{-1}, which is the most efficient in the class of all GMM estimators. Therefore, we obtain the consistency and asymptotic normality of the GMM-based estimator of ψ=ψ1ψ0\psi^{*}=\psi_{1}^{*}-\psi_{0}^{*} as follows:

N(ψ^GMMψ)DN(0,vΣGMMv),\displaystyle\sqrt{N}\big{(}\widehat{\psi}_{\text{GMM}}-\psi^{*}\big{)}\stackrel{{\scriptstyle D}}{{\rightarrow}}N\big{(}0,v^{\intercal}\Sigma_{\text{GMM}}v\big{)}\ ,
v=(1,1,0,,0),ψ^GMM=vθ^GMM,ψ=vθ\displaystyle v=\big{(}-1,1,0,\cdots,0\big{)}^{\intercal}\ ,\ \widehat{\psi}_{\text{GMM}}=v^{\intercal}\widehat{\theta}_{\text{GMM}}\ ,\ \psi^{*}=v^{\intercal}\theta^{*}

The variance estimator of ψ^GMM\widehat{\psi}_{\text{GMM}} can be obtained from the empirical analog, i.e. Var^(ψ^GMM)=vΣ^GMMv/N\widehat{Var}(\widehat{\psi}_{\text{GMM}})=v^{\intercal}\widehat{\Sigma}_{\text{GMM}}v/N where

Σ^GMM=(G^Ω^G^)1(G^Ω^Σ^Ω^G^)(G^Ω^G^)1,\displaystyle\widehat{\Sigma}_{\text{GMM}}=(\widehat{G}^{\intercal}\widehat{\Omega}\widehat{G})^{-1}(\widehat{G}^{\intercal}\widehat{\Omega}\widehat{\Sigma}\widehat{\Omega}^{\intercal}\widehat{G})(\widehat{G}^{\intercal}\widehat{\Omega}\widehat{G})^{-1}\ ,
G^={g(O;θ)θ|θ=θ^GMM},Σ^={g(O;θ^GMM)g(O;θ^GMM)}\displaystyle\widehat{G}=\mathbbm{P}\bigg{\{}\frac{\partial g(O;\theta)}{\partial\theta}\bigg{|}_{\theta=\widehat{\theta}_{\text{GMM}}}\bigg{\}}\ ,\quad\quad\widehat{\Sigma}=\mathbbm{P}\big{\{}g(O;\widehat{\theta}_{\text{GMM}})g(O;\widehat{\theta}_{\text{GMM}})^{\intercal}\big{\}} (20)

If we choose Ω^=Σ^1\widehat{\Omega}=\widehat{\Sigma}^{-1}, Σ^GMM\widehat{\Sigma}_{\text{GMM}} reduces to Σ^GMM=(G^Ω^G^)1\widehat{\Sigma}_{\text{GMM}}=(\widehat{G}^{\intercal}\widehat{\Omega}\widehat{G})^{-1}.

Since Ω^\widehat{\Omega} depends on the parameter θ\theta, it may be difficult to obtain the GMM estimator by directly working on (19). Therefore, a popular procedure is the following two-step approach in that:

  • (1)

    Compute the preliminary GMM estimator θ^GMM(pre)\widehat{\theta}_{\text{GMM}}^{(pre)} from (19) where the weight matrix Ω^\widehat{\Omega} is chosen as the identity matrix or other positive definite fixed matrix.

  • (2)

    Compute the GMM estimator θ^GMM\widehat{\theta}_{\text{GMM}} from (19) where the weight matrix Ω^(pre)\widehat{\Omega}^{(pre)} is chosen as Ω^(pre)=[{g(O;θ^GMM(pre))g(O;θ^GMM(pre))}]1\widehat{\Omega}^{(pre)}=\big{[}\mathbbm{P}\big{\{}g(O;\widehat{\theta}_{\text{GMM}}^{(pre)})g(O;\widehat{\theta}_{\text{GMM}}^{(pre)})^{\intercal}\big{\}}\big{]}^{-1}.

In the second step, Ω^(pre)\widehat{\Omega}^{(pre)} is consistent for Σ1\Sigma^{-1} because θ^GMM(pre)\widehat{\theta}_{\text{GMM}}^{(pre)} is consistent (even though it is inefficient). One can repeat the iterative procedure multiple times until convergence criteria are satisfied.

A.4.4 Local Efficiency of the Doubly-robust GMM Estimator

Recall that \mathcal{M} is defined as

={P|P(O) is regular and Assumption 4 holds,i.e., there exists b(W,X) solving integral equation (5)}.\displaystyle\mathcal{M}=\Bigg{\{}P\,\Bigg{|}\,\begin{array}[]{l}\text{$P(O)$ is regular and Assumption \ref{condition-1} holds,}\\ \text{i.e., there exists $b^{*}(W,X)$ solving integral equation \eqref{Outcome COCA}}\end{array}\Bigg{\}}\ .

We show that the doubly robust GMM estimator is semiparametric locally efficient in \mathcal{M} at the intersection model πb\mathcal{M}_{\pi}\cap\mathcal{M}_{b} where

π:\displaystyle\mathcal{M}_{\pi}: π(Y,X;α)\pi(Y,X;\alpha) is correctly specified, i.e., π(Y,X;α)=π(Y,X)\pi(Y,X;\alpha^{*})=\pi^{*}(Y,X)
b:\displaystyle\mathcal{M}_{b}: b(W,X;η) is correctly specified, i.e., b(Y,X;η)=b(W,X).\displaystyle\quad\text{$b(W,X;\eta)$ is correctly specified, i.e., $b(Y,X;\eta^{*})=b^{*}(W,X)$}\ .

Moreover, we choose ra(W,X)r_{a}(W,X) and rw(Y,X)r_{w}(Y,X) as

ra(W,X)=E[(1A)απ(Y,X;α){1π(Y,X;α)}2|W,X]dim(α),\displaystyle r_{a}(W,X)=E\bigg{[}\frac{(1-A)\nabla_{\alpha}\pi(Y,X;\alpha)}{\{1-\pi(Y,X;\alpha)\}^{2}}\,\bigg{|}\,W,X\bigg{]}\in\mathbbm{R}^{\dim(\alpha)}\ ,
rw(Y,X)=E{(1A)ηb(W,X;η)|Y,X}dim(η).\displaystyle r_{w}(Y,X)=E\big{\{}(1-A)\nabla_{\eta}b(W,X;\eta)\,\big{|}\,Y,X\big{\}}\in\mathbbm{R}^{\dim(\eta)}\ .

Then, the moment equation gDRg_{\text{DR}} in (A.4.3) is just identified, i.e., dim(gDR)=θ\dim(g_{\text{DR}})=\theta. Therefore, under regularity conditions, (see Newey and McFadden (1994) for details), we obtain

1N(θ^GMMθ)=1Ni=1N{G1gDR(Oi;θ)}+oP(1),\displaystyle\frac{1}{\sqrt{N}}\big{(}\widehat{\theta}_{\text{GMM}}-\theta^{*}\big{)}=\frac{1}{N}\sum_{i=1}^{N}\Big{\{}-G^{-1}g_{\text{DR}}(O_{i};\theta^{*})\Big{\}}+o_{P}(1)\ ,

where GG is

G=E{gDR(O;θ)θ|θ=θ}=E(A0000A(g1)(g2)000000)\displaystyle G=E\bigg{\{}\frac{\partial g_{\text{DR}}(O;\theta)}{\partial\theta^{\intercal}}\bigg{|}_{\theta=\theta^{*}}\bigg{\}}=E\left(\begin{array}[]{cccc}-A&0&0&0\\ 0&-A&\hypertarget{(g1)}{(g1)}&\hypertarget{(g2)}{(g2)}\\ 0&0&*&0\\ 0&0&0&*\end{array}\right)

Here, (g1) and (g2) are zero as follows:

(g1) =E[(1A)α{π(Y,X;α)1π(Y,X;α)}{Yb(W,X;η)}]\displaystyle=E\Bigg{[}(1-A)\nabla_{\alpha}\bigg{\{}\frac{\pi(Y,X;\alpha)}{1-\pi(Y,X;\alpha)}\bigg{\}}\big{\{}Y-b(W,X;\eta)\big{\}}\Bigg{]}
=E[(1A)α{π(Y,X;α)1π(Y,X;α)}E{Yb(W,X;η)|Y,A=0,X}=0]=0\displaystyle=E\Bigg{[}(1-A)\nabla_{\alpha}\bigg{\{}\frac{\pi(Y,X;\alpha)}{1-\pi(Y,X;\alpha)}\bigg{\}}\underbrace{E\big{\{}Y-b(W,X;\eta)\,\big{|}\,Y,A=0,X\big{\}}}_{=0}\Bigg{]}=0

and

(g2) =E[(1A)π(Y,X;α)1π(Y,X;α)ηb(W,X;η)+Aηb(W,X;η)]\displaystyle=E\Bigg{[}-(1-A)\frac{\pi(Y,X;\alpha)}{1-\pi(Y,X;\alpha)}\nabla_{\eta}b(W,X;\eta)+A\nabla_{\eta}b(W,X;\eta)\Bigg{]}
=E[π(Ya=0,X;α)E{ηb(W,X;η)|Ya=0,X}+Aηb(W,X;η)]\displaystyle=E\Big{[}-\pi(Y^{a=0},X;\alpha)E\big{\{}\nabla_{\eta}b(W,X;\eta)\,\big{|}\,Y^{a=0},X\big{\}}+A\nabla_{\eta}b(W,X;\eta)\Big{]}
=E[Aηb(W,X;η)+Aηb(W,X;η)]=0.\displaystyle=E\Big{[}-A\nabla_{\eta}b(W,X;\eta)+A\nabla_{\eta}b(W,X;\eta)\Big{]}=0\ .

Consequently, the influence function of θ^GMM\widehat{\theta}_{\text{GMM}} reduces to

G1gDR(Oi;θ)\displaystyle-G^{-1}g_{\text{DR}}(O_{i};\theta^{*})
=(1Pr(A=1)00001Pr(A=1)00000000)(A(Yψ1)(1A)π(Y,X;α)1π(Y,X;α){Yb(W,X;η)}+A{b(W,X;η)ψ0}{1A1π(Y,X;α)1}ra(W,X)(1A){b(W,X;η)Y}rw(Y,X))=(A(Yψ1)Pr(A=1)IF(O;π,η)).\displaystyle=\left(\begin{array}[]{cccc}\frac{1}{\Pr(A=1)}&0&0&0\\ 0&\frac{1}{\Pr(A=1)}&0&0\\ 0&0&*&0\\ 0&0&0&*\end{array}\right)\begin{pmatrix}A\big{(}Y-\psi_{1}^{*}\big{)}\\ \frac{(1-A)\pi(Y,X;\alpha)}{1-\pi(Y,X;\alpha)}\{Y-b(W,X;\eta)\}+A\{b(W,X;\eta)-\psi_{0}^{*}\}\\ \big{\{}\frac{1-A}{1-\pi(Y,X;\alpha)}-1\big{\}}r_{a}(W,X)\\ (1-A)\big{\{}b(W,X;\eta)-Y\big{\}}r_{w}(Y,X)\end{pmatrix}=\begin{pmatrix}\frac{A(Y-\psi_{1}^{*})}{\Pr(A=1)}\\ \texttt{IF}(O;\pi^{*},\eta^{*})\\ *\\ *\end{pmatrix}\ .

Therefore, the estimator ψ^1ψ^0\widehat{\psi}_{1}-\widehat{\psi}_{0} achieves the semiparmetric local efficiency bound for the ETT ψ\psi^{*} under \mathcal{M} at the submodel sub\mathcal{M}_{\text{sub}} so long as the posited parametric models belong to the intersection model πb\mathcal{M}_{\pi}\cap\mathcal{M}_{b}.

A.4.5 Regularized GMM Estimators

Finding the optimal solution in (19) can be quite challenging when dealing with finite samples, especially if the objective function is not convex. To mitigate the issue, one may regularize (19) and solve a penalized version of GMM, e.g.,

θ^GMM(λ)=argminθ[{g(O;θ)}]Ω^[{g(O;θ)}]+λ(θ).\displaystyle\widehat{\theta}_{\text{GMM}}(\lambda)=\operatorname*{arg\,min}_{\theta}\big{[}\mathbbm{P}\big{\{}g(O;\theta)\big{\}}\big{]}^{\intercal}\widehat{\Omega}\big{[}\mathbbm{P}\big{\{}g(O;\theta)\big{\}}\big{]}+\lambda\mathcal{R}(\theta)\ . (21)

where (θ)\mathcal{R}(\theta) is a regularization term and λ(0,)\lambda\in(0,\infty) is a regularization parameter, which is chosen from cross-validation; see Algorithm 1 below for details.

0:  Candidates of Regularization Parameters {λ1,,λL}\{\lambda_{1},\ldots,\lambda_{L}\}
1:  for =1,,L\ell=1,\ldots,L do
2:     for i=1,,Ni=1,\ldots,N do
3:        Obtain θ^(i)\widehat{\theta}_{\ell}^{(-i)} from (21) using a regularization parameter λ\lambda_{\ell} and N1N-1 observations {1,,i1,i+1,,N}\{1,\ldots,i-1,i+1,\ldots,N\}.
4:        Evaluate g(Oi;θ^(i))g(O_{i};\widehat{\theta}_{\ell}^{(-i)})
5:     end for
6:     Obtain the average moment g¯=N1i=1Ng(Oi;θ^(i))22\overline{g}_{\ell}=\big{\|}N^{-1}\sum_{i=1}^{N}g(O_{i};\widehat{\theta}_{\ell}^{(-i)})\big{\|}_{2}^{2}
7:  end for
8:  Choose \ell that minimizes g¯\overline{g}_{\ell}, i.e., =argming¯\ell^{*}=\operatorname*{arg\,min}_{\ell}\overline{g}_{\ell}
9:  return  Regularization parameter λ=λ\lambda=\lambda_{\ell^{*}}
Algorithm 1 Cross-validation for Choosing λ\lambda

A.5 Sufficient Conditions for the Existence and Uniqueness of the COCA Confounding Bridge Function

In this Section, we discuss sufficient conditions for the existence and uniqueness of the solutions to the integral equations in equations (4) and (5):

y=E{b(W,X)|Y=y,X=x,A=0}\displaystyle y=E\left\{b^{*}\big{(}W,X\big{)}\,\big{|}\,Y=y,X=x,A=0\right\}
p(w,x)1p(w,x)=E{ω(Y,x)|W=w,X=x,A=0}\displaystyle\frac{p^{*}\big{(}w,x\big{)}}{1-p^{*}\big{(}w,x\big{)}}=E\left\{\omega^{*}(Y,x)\,\big{|}\,W=w,X=x,A=0\right\}

Since both equations share a similar structure, our focus here is primarily on the case of the COCA bridge function bb^{*}.

A.5.1 Conditions for Existence

We begin by providing sufficient conditions for the existence of the COCA confounding bridge function bb^{*}. In brief, we follow the approach in Miao et al. (2018). The proof relies on Theorem 15.18 of Kress (2014), which is stated below for completeness.
Theorem 15.18. (Kress, 2014) Let A:XYA:X\rightarrow Y be a compact operator with singular system {μn,ϕn,gn}n=1,2,\big{\{}\mu_{n},\phi_{n},g_{n}\big{\}}_{n=1,2,\ldots}. The integral equation of the first kind Aϕ=fA\phi=f is solvable if and only if

1.f𝒩(Aadjoint)={f|Aadjoint(f)=0},\displaystyle 1.\quad\text{$f\in\mathcal{N}(A^{\text{adjoint}})^{\perp}=\big{\{}f\,\big{|}\,A^{\text{adjoint}}(f)=0\big{\}}^{\perp}$}\ , 2.n=1μn2|f,gn|2<\displaystyle 2.\quad\text{$\sum_{n=1}^{\infty}\mu_{n}^{-2}\big{|}\langle f,g_{n}\rangle|^{2}<\infty$}

To apply the Theorem, we introduce some additional notations. For a fixed X=xX=x, let Wx\mathcal{L}_{Wx} and Yx\mathcal{L}_{Yx} be the spaces of square-integrable functions of WW and YY, respectively, which are equipped with the inner products

h1,h2Wx=h1x(w)h2x(w)fW|XA(w|X=x,A=0)𝑑w=E{h1x(W)h2x(W)|X=x,A=0}\displaystyle\langle h_{1},h_{2}\rangle_{Wx}=\int h_{1x}(w)h_{2x}(w)\,f_{W|XA}(w\,\big{|}\,X=x,A=0)\,dw=E\big{\{}h_{1x}(W)h_{2x}(W)\,\big{|}\,X=x,A=0\big{\}}
g1x,g2xYx=g1x(y)g2x(y)fY|XA(y|X=x,A=0)𝑑y=E{g1x(Y)g2x(Y)|X=x,A=0}.\displaystyle\langle g_{1x},g_{2x}\rangle_{Yx}=\int g_{1x}(y)g_{2x}(y)\,f_{Y|XA}(y\,\big{|}\,X=x,A=0)\,dy=E\big{\{}g_{1x}(Y)g_{2x}(Y)\,\big{|}\,X=x,A=0\big{\}}\ .

Let 𝒦x:WxYx\mathcal{K}_{x}:\mathcal{L}_{Wx}\rightarrow\mathcal{L}_{Yx} be the conditional expectation of h(W)Wxh(W)\in\mathcal{L}_{Wx} given (Y,X=x,A=0)(Y,X=x,A=0), i.e.,

𝒦x(h)Yx satisfying (𝒦x(h))(y)=E{h(W)|Y=y,X=x,A=0} for hWx.\displaystyle\mathcal{K}_{x}(h)\in\mathcal{L}_{Yx}\text{ satisfying }\big{(}\mathcal{K}_{x}(h)\big{)}(y)=E\big{\{}h(W)\,\big{|}\,Y=y,X=x,A=0\big{\}}\text{ for }h\in\mathcal{L}_{Wx}\ .

Then, the COCA confounding bridge function evaluated at X=xX=x, i.e., bxWb_{x}^{*}\in\mathcal{L}_{W}, solves 𝒦x(bx)=[identity map:YY]Yx\mathcal{K}_{x}(b_{x}^{*})=[\text{identity map}:Y\rightarrow Y]\in\mathcal{L}_{Yx}, i.e.,

E{bx(W)|Y=y,X=x,A=0}=bx(w)fW|YXA(w|y,x,0)𝑑w=y,y.\displaystyle E\left\{b_{x}^{*}\big{(}W\big{)}\,\big{|}\,Y=y,X=x,A=0\right\}=\int b_{x}^{*}(w)f_{W|YXA}(w\,\big{|}\,y,x,0)\,dw=y,\ \forall y\ .

Now, we assume the following conditions for each xx:

  • (Bridge-1) fW|YXA(w|y,x,0)fY|WXA(y|w,x,0)𝑑w𝑑y<\iint f_{W|YXA}(w\,\big{|}\,y,x,0)f_{Y|WXA}(y\,\big{|}\,w,x,0)\,dw\,dy<\infty;

  • (Bridge-2) For gxYxg_{x}\in\mathcal{L}_{Yx}, E{gx(Y)|W,X=x,A=0}=0E\big{\{}g_{x}(Y)\,\big{|}\,W,X=x,A=0\big{\}}=0 implies gx(Y)=0g_{x}(Y)=0 almost surely;

  • (Bridge-3) E(Y2|X=x,A=0)<E\big{(}Y^{2}\,\big{|}\,X=x,A=0\big{)}<\infty;

  • (Bridge-4) Let the singular system of 𝒦x\mathcal{K}_{x} be {μn,x,ϕn,x,gn,x}n=1,2,\big{\{}\mu_{n,x},\phi_{n,x},g_{n,x}\big{\}}_{n=1,2,\ldots}. Then, we have

    n=1μn,x2|Y,gn,xYx|2< almost surely.\displaystyle\sum_{n=1}^{\infty}\mu_{n,x}^{-2}\big{|}\langle Y,g_{n,x}\rangle_{Yx}|^{2}<\infty\text{ almost surely. }

First, we show that 𝒦x\mathcal{K}_{x} is a compact operator under Condition (Bridge-1) for each xx. Let 𝒦xadjoint:YxWx\mathcal{K}_{x}^{\text{adjoint}}:\mathcal{L}_{Yx}\rightarrow\mathcal{L}_{Wx} be the conditional expectation of gx(Y)Yxg_{x}(Y)\in\mathcal{L}_{Yx} given (W,X,A=0)(W,X,A=0), i.e.,

𝒦xadjoint(g)Wx satisfying (𝒦x(g))(w)=E{g(Y)|W=w,X=x,A=0} for gYx.\displaystyle\mathcal{K}_{x}^{\text{adjoint}}(g)\in\mathcal{L}_{Wx}\text{ satisfying }\big{(}\mathcal{K}_{x}(g)\big{)}(w)=E\big{\{}g(Y)\,\big{|}\,W=w,X=x,A=0\big{\}}\text{ for }g\in\mathcal{L}_{Yx}\ .

Then, 𝒦x\mathcal{K}_{x} and 𝒦xadjoint\mathcal{K}_{x}^{\text{adjoint}} are the adjoint operator of each other as follows:

𝒦x(h),gYx\displaystyle\langle\mathcal{K}_{x}(h),g\rangle_{Yx} =E[E{h(W)|Y,X,A=0}g(Y)]\displaystyle=E\big{[}E\big{\{}h(W)\,\big{|}\,Y,X,A=0\big{\}}g(Y)\big{]}
=E{h(W)g(Y)|X,A=0}\displaystyle=E\big{\{}h(W)g(Y)\,\big{|}\,X,A=0\big{\}}
=E[h(W)E{g(Y)|W,X,A=0}]=h,𝒦xadjoint(g)Wx.\displaystyle=E\big{[}h(W)E\big{\{}g(Y)\,\big{|}\,W,X,A=0\big{\}}\big{]}=\langle h,\mathcal{K}_{x}^{\text{adjoint}}(g)\rangle_{Wx}\ .

Additionally, as shown in page 5659 of Carrasco et al. (2007), 𝒦x\mathcal{K}_{x} and 𝒦xadjoint\mathcal{K}_{x}^{\text{adjoint}} are compact operators under Condition (Bridge-1). Moreover, by Theorem 15.16 of Kress (2014), there exists a singular value decomposition of 𝒦x\mathcal{K}_{x} as {μn,x,ϕn,x,gn,x}n=1,2,\big{\{}\mu_{n,x},\phi_{n,x},g_{n,x}\big{\}}_{n=1,2,\ldots}.

Second, we show that 𝒩(𝒦xadjoint)=Yx\mathcal{N}(\mathcal{K}_{x}^{\text{adjoint}})^{\perp}=\mathcal{L}_{Yx}, which suffices to show 𝒩(𝒦xadjoint)={0}Yx\mathcal{N}(\mathcal{K}_{x}^{\text{adjoint}})=\big{\{}0\big{\}}\subseteq\mathcal{L}_{Yx}. Under Condition (Bridge-2), we have

g𝒩(𝒦xadjoint)E{g(Y)|W=w,X,A=0}=0,wg(Y)=0 almost surely.\displaystyle g\in\mathcal{N}(\mathcal{K}_{x}^{\text{adjoint}})\quad\Rightarrow\quad E\big{\{}g(Y)\,\big{|}\,W=w,X,A=0\big{\}}=0,\ \forall w\quad\Rightarrow\quad g(Y)=0\text{ almost surely.}

where the first arrow is from the definition of the null space 𝒩\mathcal{N}, and the second arrow is from Condition (Bridge-2). Therefore, any g𝒩(𝒦xadjoint)g\in\mathcal{N}(\mathcal{K}_{x}^{\text{adjoint}}) must satisfy g(y)=0g(y)=0 almost surely, i.e., 𝒩(𝒦xadjoint)={0}Yx\mathcal{N}(\mathcal{K}_{x}^{\text{adjoint}})=\big{\{}0\big{\}}\subseteq\mathcal{L}_{Yx} almost surely.

Third, from the definition of Wx\mathcal{L}_{Wx}, g(Y)=YYx=𝒩(𝒦xadjoint)g(Y)=Y\in\mathcal{L}_{Yx}=\mathcal{N}(\mathcal{K}_{x}^{\text{adjoint}})^{\perp} under Condition (Bridge-3).

Combining the three results, we establish that YY satisfies the first condition of Theorem 15.18 of Kress (2014). The second condition of the Theorem is exactly the same as Condition (Bridge-4). Therefore, we establish that the Fredholm integral equation of the first kind 𝒦x(h)=[identity map:YY]\mathcal{K}_{x}(h)=[\text{identity map}:Y\rightarrow Y] is solvable under Conditions (Bridge-1)-(Bridge-4). Consequently, for each xx, there exists a function bx(W)b_{x}^{*}(W) satisfying

E{bx(W)|Y=y,X=x,A=0}=y,y.\displaystyle E\big{\{}b_{x}^{*}(W)\,\big{|}\,Y=y,X=x,A=0\big{\}}=y\ ,\ \forall y\ .

We define b(W,X)b^{*}(W,X) as a function satisfying b(W=w,X=x)=bx(w)b^{*}(W=w,X=x)=b_{x}^{*}(w), which then solves

E{b(W,X)|Y=y,X=x,A=0}=y,y.\displaystyle E\big{\{}b^{*}(W,X)\,\big{|}\,Y=y,X=x,A=0\big{\}}=y\ ,\ \forall y\ .

A.5.2 Conditions for Uniqueness

The COCA confounding bridge function is unique under the following completeness condition:

  • (Completeness): Suppose that E{g(W)|Y=y,X=x,A=0}=0E\big{\{}g(W)\,\big{|}\,Y=y,X=x,A=0\big{\}}=0 for any (y,x)(y,x). Then, g(W)=0g(W)=0 almost surely.

The proof is given below. Suppose that b1(W,X)b_{1}^{*}(W,X) and b2(W,X)b_{2}^{*}(W,X) satisfy the bridge function condition (5). Then, we find

0=YY=E{b1(W,x)b2(W,x)|Y=y,X=x,A=0}.\displaystyle 0=Y-Y=E\big{\{}b_{1}^{*}(W,x)-b_{2}^{*}(W,x)\,\big{|}\,Y=y,X=x,A=0\big{\}}\ .

From (Completeness), Δx(W)=b1(W,x)b2(W,x)=0\Delta_{x}(W)=b_{1}^{*}(W,x)-b_{2}^{*}(W,x)=0 almost surely for any xx. This implies b1(W,X)=b2(W,X)b_{1}^{*}(W,X)=b_{2}^{*}(W,X) almost surely. Therefore, the COCA confounding bridge function is unique.

A.6 Details of the Minimax Estimation

A.6.1 Closed-form Representations of the Minimax Estimators

For notational brevity, let Mk=|kc|M_{k}=|\mathcal{I}_{k}^{c}|. Recall ω\omega^{*} and bb^{*} satisfy

E[{(1A)ω(Y,X)A}p(W,X)]=0,p,\displaystyle E\left[\left\{(1-A)\omega^{*}(Y,X)-A\right\}p\big{(}W,X\big{)}\right]=0\quad,\quad\forall p\ ,
E[(1A){Yb(W,X)}q(Y,X)]=0,q.\displaystyle E\big{[}\big{(}1-A\big{)}\big{\{}Y-b^{*}(W,X)\big{\}}q\big{(}Y,X\big{)}\big{]}=0\quad,\quad\forall q\ .

Following Ghassami et al. (2022), minimax estimators of ω\omega^{*} and bb^{*} are given by

ω^(k)()=argminω(Y,X)[maxp(W,X)[(k)[p(W,X){(1A)ω(Y,X)A}p2(W,X)]λpp2]+λωω2],\displaystyle\widehat{\omega}^{(-k)}(\cdot)=\operatorname*{arg\,min}_{\omega\in\mathcal{H}(Y,X)}\Bigg{[}\displaystyle{\max_{p\in\mathcal{H}(W,X)}}\Bigg{[}\mathbbm{P}^{(-k)}\bigg{[}\begin{array}[]{l}p(W,X)\big{\{}(1-A)\omega(Y,X)-A\big{\}}\\ -p^{2}(W,X)\end{array}\bigg{]}-\lambda_{p}\big{\|}p\big{\|}_{\mathcal{H}}^{2}\Bigg{]}+\lambda_{\omega}\big{\|}\omega\big{\|}_{\mathcal{H}}^{2}\Bigg{]}\ ,
b^(k)()=argminh(W,X)[maxq(Y,X)[(k)[q(Y,X)(1A){Yb(W,X)}q2(Y,X)]λqq2]+λbb2].\displaystyle\widehat{b}^{(-k)}(\cdot)=\operatorname*{arg\,min}_{h\in\mathcal{H}(W,X)}\Bigg{[}\displaystyle{\max_{q\in\mathcal{H}(Y,X)}}\Bigg{[}\mathbbm{P}^{(-k)}\bigg{[}\begin{array}[]{l}q(Y,X)(1-A)\big{\{}Y-b(W,X)\big{\}}\\ -q^{2}(Y,X)\end{array}\bigg{]}-\lambda_{q}\big{\|}q\big{\|}_{\mathcal{H}}^{2}\Bigg{]}+\lambda_{b}\big{\|}b\big{\|}_{\mathcal{H}}^{2}\Bigg{]}\ .

Note that all true and estimated functions are defined in terms of the following representations:

f(vf) satisfies E[{SDf(Vf)}g(Vg)]=0 for any g,\displaystyle f^{*}(v_{f})\text{ satisfies }E\big{[}\big{\{}S-D\cdot f^{*}(V_{f})\big{\}}g(V_{g})\big{]}=0\text{ for any }g\ ,
f^(k)(vf)\displaystyle\widehat{f}^{(-k)}(v_{f})
=argminf(Vf)[maxg(Vg)[(k)[[g(Vg){SDf(Vf))}g2(Vg)]λggg2]+λfff2],\displaystyle=\operatorname*{arg\,min}_{f\in\mathcal{H}(V_{f})}\Big{[}\max_{g\in\mathcal{H}(V_{g})}\Big{[}\mathbbm{P}^{(-k)}\big{[}\big{[}g(V_{g})\cdot\big{\{}S-D\cdot f(V_{f})\big{)}\big{\}}-g^{2}(V_{g})\big{]}-\lambda_{g}\big{\|}g\big{\|}_{\mathcal{H}_{g}}^{2}\Big{]}+\lambda_{f}\big{\|}f\big{\|}_{\mathcal{H}_{f}}^{2}\Big{]}\ , (22)

where S,D,Vf,VgS,D,V_{f},V_{g} are appropriately chosen. Speicifcally, for the weight function ω\omega^{*}, we choose Vf=(Y,X)V_{f}=(Y,X), Vg=(W,X)V_{g}=(W,X), S=AS=A, D=1AD=1-A; for the COCA confounding bridge function bb^{*}, we choose Vf=(W,X)V_{f}=(W,X), Vg=(Y,X)V_{g}=(Y,X), S=(1A)YS=(1-A)Y, D=1AD=1-A.

A closed-form representation of the solution to (A.6.1) is available from the represented theorem (Kimeldorf and Wahba, 1970; Schölkopf et al., 2001). Specifically, we have f^(k)(vf)=ikcγ^i𝒦(Vf,i,vf)\widehat{f}^{(-k)}(v_{f})=\sum_{i\in\mathcal{I}_{k}^{c}}\widehat{\gamma}_{i}\mathcal{K}(V_{f,i},v_{f}) where γ^=(γ^i)ikc\widehat{\gamma}=\big{(}\widehat{\gamma}_{i}\big{)}_{i\in\mathcal{I}_{k}^{c}} is equal to

γ^=(𝒟TΓ𝒟T+Mk2λf)𝒟TΓ𝒮,\displaystyle\widehat{\gamma}=\big{(}\mathcal{F}\mathcal{D}_{T}\Gamma\mathcal{D}_{T}\mathcal{F}+M_{k}^{2}\lambda_{f}\mathcal{F}\big{)}^{\dagger}\mathcal{F}\mathcal{D}_{T}\Gamma\mathcal{S}\ ,

where

Γ=0.25𝒢{Mk1𝒢+λgIMk×Mk}1Mk×Mk\displaystyle\Gamma=0.25\mathcal{G}\big{\{}M_{k}^{-1}\mathcal{G}+\lambda_{g}I_{M_{k}\times M_{k}}\big{\}}^{-1}\in\mathbbm{R}^{M_{k}\times M_{k}}
=[𝒦(Vf,i,Vf,j)]i,jkcMk×Mk,\displaystyle\mathcal{F}=\Big{[}\mathcal{K}\big{(}V_{f,i},V_{f,j}\big{)}\Big{]}_{i,j\in\mathcal{I}_{k}^{c}}\in\mathbbm{R}^{M_{k}\times M_{k}}\quad, 𝒢=[𝒦(Vg,i,Vg,j)]i,jkcMk×Mk\displaystyle\mathcal{G}=\Big{[}\mathcal{K}\big{(}V_{g,i},V_{g,j}\big{)}\Big{]}_{i,j\in\mathcal{I}_{k}^{c}}\in\mathbbm{R}^{M_{k}\times M_{k}}
𝒟=diag[Di]ikcMk×Mk,\displaystyle\mathcal{D}=\text{diag}\Big{[}D_{i}\Big{]}_{i\in\mathcal{I}_{k}^{c}}\in\mathbbm{R}^{M_{k}\times M_{k}}\quad, 𝒮=[Si]ikcMk.\displaystyle\mathcal{S}=\Big{[}S_{i}\Big{]}_{i\in\mathcal{I}_{k}^{c}}\in\mathbbm{R}^{M_{k}}\ .

Therefore, minimax estimators of the nuisance functions are readily available by appropriately choosing (Vf,Vg,S,D)(V_{f},V_{g},S,D).

A.6.2 Cross-Validation

We use cross-validation to select the hyperparameters 𝔥=(κf,κg,λf,λg)\mathfrak{h}=(\kappa_{f},\kappa_{g},\lambda_{f},\lambda_{g}) that minimize an empirical risk evaluated over a validation set 𝒱\mathcal{V}, denoted by (𝒱)(𝔥)\mathcal{R}_{(}\mathcal{V})(\mathfrak{h}). For the empirical risk, we can either use (i) the projected risk (Dikkala et al., 2020; Ghassami et al., 2022), or (ii) the V-statistic (Mastouri et al., 2021). To motivate (i), we remark that Dikkala et al. (2020); Ghassami et al. (2022) defined the population-level projected risk of (A.6.1) as E[E[D{f^(k)(Vf)f(Vf)}|Vg]22]E\big{[}\big{\|}E\big{[}D\big{\{}\widehat{f}^{(-k)}(V_{f})-f^{*}(V_{f})\big{\}}\,\big{|}\,V_{g}\big{]}\big{\|}_{2}^{2}\big{]}. In addition, they showed that its empirical counterpart evaluated over a validation set 𝒱\mathcal{V} is given by R^𝒱(𝔥)={ϵ^𝒱(k)}Γ𝒱{ϵ^𝒱(k)}\widehat{R}_{\mathcal{V}}(\mathfrak{h})=\{\widehat{\epsilon}_{\mathcal{V}}^{(-k)}\}^{\intercal}\Gamma_{\mathcal{V}}\{\widehat{\epsilon}_{\mathcal{V}}^{(-k)}\} where

ϵ^𝒱(k)=[SjDjf^(k)(Vf,j)]j𝒱|𝒱|,\displaystyle\widehat{\epsilon}_{\mathcal{V}}^{(-k)}=\big{[}S_{j}-D_{j}\cdot\widehat{f}^{(-k)}(V_{f,j})\big{]}_{j\in\mathcal{V}}\in\mathbbm{R}^{|\mathcal{V}|}\ ,
Γ𝒱=0.25𝒢𝒱{|𝒱|1𝒢𝒱+λgI|𝒱|×|𝒱|}1|𝒱|×|𝒱|,\displaystyle\Gamma_{\mathcal{V}}=0.25\mathcal{G}_{\mathcal{V}}\big{\{}|\mathcal{V}|^{-1}\mathcal{G}_{\mathcal{V}}+\lambda_{g}I_{|\mathcal{V}|\times|\mathcal{V}|}\big{\}}^{-1}\in\mathbbm{R}^{|\mathcal{V}|\times|\mathcal{V}|}\ , 𝒢𝒱=[𝒦(Vg,i,Vg,j)]i,j𝒱|𝒱|×|𝒱|.\displaystyle\mathcal{G}_{\mathcal{V}}=\big{[}\mathcal{K}(V_{g,i},V_{g,j})\big{]}_{i,j\in\mathcal{V}}\in\mathbbm{R}^{|\mathcal{V}|\times|\mathcal{V}|}\ .

To motivate the V-statistic, we begin by introducing Lemma 2 of Mastouri et al. (2021):
Lemma 2 (Mastouri et al., 2021): Suppose that

E[{SDf(Vf)}2𝒦(Vg,Vg)]<.\displaystyle E\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}^{2}\mathcal{K}(V_{g},V_{g})\big{]}<\infty\ . (23)

Then, we have

maxg(Vg)g:g1[E[g(Vg){SDf(Vf)}]]2\displaystyle\max_{\begin{subarray}{c}g\in\mathcal{H}(V_{g})\\ g:\|g\|\leq 1\end{subarray}}\Big{[}E\big{[}g(V_{g})\big{\{}S-D\cdot f(V_{f})\big{\}}\big{]}\Big{]}^{2} =E[{SDf(Vf)}{SDf(Vf)}×𝒦(Vg,Vg)],\displaystyle=E\left[\big{\{}S-D\cdot f(V_{f})\big{\}}\big{\{}S^{\prime}-D^{\prime}\cdot f(V_{f}^{\prime})\big{\}}\times\mathcal{K}(V_{g},V_{g}^{\prime})\right]\ ,

where (S,D,Vf,Vg)(S^{\prime},D^{\prime},V_{f}^{\prime},V_{g}^{\prime}) are independent copies of (S,D,Vf,Vg)(S,D,V_{f},V_{g}). The result is also reported in other works, e.g., Theorem 3.3 of Muandet et al. (2020) and Lemma 1 of Zhang et al. (2023). The condition (23) implies that E[{SDf(Vf)}𝒦(Vg,)]E\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]} is Bochner integrable (Steinwart and Christmann, 2008, Definition A.5.20). One important property of the Bochner integrability is that an integration and a linear operator can be interchanged. Therefore, we find

maxg(Vg)g:g1[E[g(Vg){SDf(Vf)}]]2\displaystyle\max_{\begin{subarray}{c}g\in\mathcal{H}(V_{g})\\ g:\|g\|\leq 1\end{subarray}}\Big{[}{E}\big{[}g(V_{g})\big{\{}S-D\cdot f(V_{f})\big{\}}\big{]}\Big{]}^{2} =maxg(Vg)g:g1[E[{SDf(Vf)}g,𝒦(Vg,)]]2\displaystyle=\max_{\begin{subarray}{c}g\in\mathcal{H}(V_{g})\\ g:\|g\|\leq 1\end{subarray}}\Big{[}{E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\langle g,\mathcal{K}(V_{g},\cdot)\rangle\big{]}\Big{]}^{2}
=maxg(Vg)g:g1[g,E[{SDf(Vf)}𝒦(Vg,)]]2\displaystyle=\max_{\begin{subarray}{c}g\in\mathcal{H}(V_{g})\\ g:\|g\|\leq 1\end{subarray}}\Big{[}\Big{\langle}g,{E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]}\Big{\rangle}\Big{]}^{2}
=E[{SDf(Vf)}𝒦(Vg,)](Vg)2\displaystyle=\Big{\|}{E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]}\Big{\|}_{\mathcal{H}(V_{g})}^{2}
=E[{SDf(Vf)}𝒦(Vg,)],E[{SDf(Vf)}𝒦(Vg,)]\displaystyle=\Big{\langle}{E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]},{E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]}\Big{\rangle}
=E[{SDf(Vf)}𝒦(Vg,),E[{SDf(Vf)}𝒦(Vg,)]]\displaystyle={E}\left[\Big{\langle}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot),{E}\big{[}\big{\{}S^{\prime}-D\cdot f(V_{f}^{\prime})\big{\}}\mathcal{K}(V_{g}^{\prime},\cdot)\big{]}\Big{\rangle}\right]
=E[{SDf(Vf)}𝒦(Vg,),{SDf(Vf)}𝒦(Vg,)]\displaystyle={E}\left[\Big{\langle}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot),\big{\{}S^{\prime}-D\cdot f(V_{f}^{\prime})\big{\}}\mathcal{K}(V_{g}^{\prime},\cdot)\Big{\rangle}\right]
=E[{SDf(Vf)}{SDf(Vf)}×𝒦(Vg,Vg)].\displaystyle=E\left[\big{\{}S-D\cdot f(V_{f})\big{\}}\big{\{}S^{\prime}-D^{\prime}\cdot f(V_{f}^{\prime})\big{\}}\times\mathcal{K}(V_{g},V_{g}^{\prime})\right]\ .

The first line holds from g(Vg)g\in\mathcal{H}(V_{g}), implying that g(Vg)=g,𝒦(Vg,)g(V_{g})=\langle g,\mathcal{K}(V_{g},\cdot)\rangle. The second line holds from the Bochner integrability. The third line holds from the fact that (Vg)\mathcal{H}(V_{g}) is a vector space, and E[{SDf(Vf)}𝒦(Vg,)](Vg){E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]}\in\mathcal{H}(V_{g}) from the Bochner integrability. Therefore, by choosing gE[{SDf(Vf)}𝒦(Vg,)]g\propto{E}\big{[}\big{\{}S-D\cdot f(V_{f})\big{\}}\mathcal{K}(V_{g},\cdot)\big{]}, we obtain the result. The fourth line is trivial from the definition of the norm (Vg)\|\cdot\|_{\mathcal{H}(V_{g})}. The fifth and sixth lines are from the Bochner integrability. The last line is trivial. Using this risk function, one may use the following empirical risk evaluated over a validation set 𝒱\mathcal{V} to find hyperparameters:

𝒱(𝔥)=j,j𝒱{SjDjf^(k)(Vf,j)}{SjDjf^(k)(Vf,j)}𝒦(Vf,j,Vf,j).\displaystyle\mathcal{R}_{\mathcal{V}}(\mathfrak{h})=\sum_{j,j^{\prime}\in\mathcal{V}}\big{\{}S_{j}-D_{j}\cdot\widehat{f}^{(-k)}(V_{f,j})\big{\}}\big{\{}S_{j^{\prime}}-D_{j^{\prime}}\cdot\widehat{f}^{(-k)}(V_{f,j^{\prime}})\big{\}}\mathcal{K}(V_{f,j},V_{f,j^{\prime}})\ . (24)

Using these empirical risks, we may choose the hyperparameters by following Algorithm 2 below:

0:  Candidates of Hyperparameters {𝔥1,,𝔥L}\{\mathfrak{h}_{1},\ldots,\mathfrak{h}_{L}\}
1:  Split kc\mathcal{I}_{k}^{c} into non-overlapping KK folds {𝒱1,,𝒱K}\{\mathcal{V}_{1},\ldots,\mathcal{V}_{K}\}
2:  for =1,,L\ell=1,\ldots,L do
3:     for c=1,,Cc=1,\ldots,C do
4:        Estimate the nuisance functions ω\omega and bb using hyperparameters 𝔥\mathfrak{h}_{\ell} and observations in {𝒱1,,𝒱c1,𝒱c+1,,𝒱C}\big{\{}\mathcal{V}_{1},\ldots,\mathcal{V}_{c-1},\mathcal{V}_{c+1},\ldots,\mathcal{V}_{C}\}
5:        Evaluate the empirical risk 𝒱c(𝔥)\mathcal{R}_{\mathcal{V}_{c}}(\mathfrak{h}_{\ell}) over the remaining set 𝒱c\mathcal{V}_{c}
6:     end for
7:     Evaluate the average empirical risk ¯(𝔥)=C1c=1C𝒱c(𝔥)\overline{\mathcal{R}}(\mathfrak{h}_{\ell})=C^{-1}\sum_{c=1}^{C}\mathcal{R}_{\mathcal{V}_{c}}(\mathfrak{h}_{\ell})
8:  end for
9:  Choose \ell that minimizes ¯\overline{\mathcal{R}}, i.e., =argmin¯(𝔥)\ell^{*}=\operatorname*{arg\,min}_{\ell}\overline{\mathcal{R}}(\mathfrak{h}_{\ell})
10:  return  Hyperparameter 𝔥=𝔥\mathfrak{h}=\mathfrak{h}_{\ell^{*}}
Algorithm 2 Cross-validation for Choosing Hyperparameters

A.6.3 A Remedial Strategy to Address Widely Varying Nuisance Functions

We present a remedial strategy to address widely varying ω\omega^{*}. First, following Section A.4.1, we may obtain a GMM estimator of ω\omega^{*}, denoted by ω~(k)(y,x)\widetilde{\omega}^{(-k)}(y,x), using observations in kc\mathcal{I}_{k}^{c}. If the range of ω~(k)\widetilde{\omega}^{(-k)} is excessively wide (e.g., the maximum is larger than ten times the minimum), we recommend estimating r(x,x)=ω(y,x)/ω~(k)(y,x)r^{*}(x,x)=\omega^{*}(y,x)/\widetilde{\omega}^{(-k)}(y,x) instead of ω\omega^{*}; note that rr stands for the ratio. If ω~(k)(y,x)\widetilde{\omega}^{(-k)}(y,x) is a good estimator, r(y,x)r^{*}(y,x) would be close to 1, and thus, the minimax estimator of rr^{*} would vary less than the minimax estimator of ω\omega^{*}. Motivated by this phenomenon, we obtain a minimax estimator of ω\omega^{*} as ω^(k)(y,x)=ω~(k)(y,x)r^(y,x)\widehat{\omega}^{(-k)}(y,x)=\widetilde{\omega}^{(-k)}(y,x)\cdot\widehat{r}(y,x) where r^(k)(y,x)\widehat{r}^{(-k)}(y,x) is estimated from the following minimax estimation strategy:

r^(k)(y,x)\displaystyle\widehat{r}^{(-k)}(y,x)
=argminr(Y,X)[maxp(W,X)[(k)[p(W,X){(1A)r(Y,X)ω~(k)(Y,X)A}p2(W,X)]λpp(W,X)2]+λrr(Y,X)2].\displaystyle=\operatorname*{arg\,min}_{r\in\mathcal{H}(Y,X)}\left[\begin{array}[]{l}\displaystyle{\max_{p\in\mathcal{H}(W,X)}}\Big{[}\mathbbm{P}^{(-k)}\big{[}p(W,X)\big{\{}(1-A)r(Y,X)\widetilde{\omega}^{(-k)}(Y,X)-A\big{\}}-p^{2}(W,X)\big{]}-\lambda_{p}\big{\|}p\big{\|}_{\mathcal{H}(W,X)}^{2}\Big{]}\\ +\lambda_{r}\big{\|}r\big{\|}_{\mathcal{H}(Y,X)}^{2}\end{array}\right]\ .

A.6.4 Multiplier Bootstrap

One can obtain a standard error of the minimax estimator ψ^\widehat{\psi} and confidence intervals for the ETT ψ\psi^{*} based on the multiplier bootstrap procedure; see Algorithm 3 for details:

0:  Number of bootstrap estimates BB
1:  For i=1,,Ni=1,\ldots,N, obtain the estimated influence functions
Ψ^(Oi)\displaystyle\widehat{\Psi}(O_{i}) =Ai{Yib^(k)(Wi,Xi)ψ^(k)}(1Ai)ω^(k)(Yi,Xi){Yib^(k)(Wi,Xi)}i=1NAi/N\displaystyle=\frac{A_{i}\big{\{}Y_{i}-\widehat{b}^{(-k)}\big{(}W_{i},X_{i}\big{)}-\widehat{\psi}^{(-k)}\big{\}}-\big{(}1-A_{i}\big{)}\widehat{\omega}^{(-k)}(Y_{i},X_{i})\big{\{}Y_{i}-\widehat{b}^{(-k)}\big{(}W_{i},X_{i}\big{)}\big{\}}}{\sum_{i=1}^{N}A_{i}/N}
where kk is the index that satisfies iki\in\mathcal{I}_{k}
2:  for b=1,,Bb=1,\ldots,B do
3:     Generate i.i.d. random variables ϵi(b)N(0,1)\epsilon_{i}^{(b)}\sim N(0,1) for i=1,,Ni=1,\ldots,N
4:     Calculate e^(b)=N1i=1Nϵi(b)Ψ^(Oi)\widehat{e}^{(b)}=N^{-1}\sum_{i=1}^{N}\epsilon_{i}^{(b)}\widehat{\Psi}(O_{i})
5:  end for
6:  Let σ^boot2\widehat{\sigma}_{\text{boot}}^{2} be the empirical variance of {e^(b)|b=1,,B}\big{\{}\widehat{e}^{(b)}\,\big{|}\,b=1,\ldots,B\big{\}}
7:  Let q^boot,α\widehat{q}_{\text{boot},\alpha} be the 100α100\alpha-th percentile of {e^(b)|b=1,,B}\big{\{}\widehat{e}^{(b)}\,\big{|}\,b=1,\ldots,B\big{\}}
8:  return  Variance estimate σ^boot2\widehat{\sigma}_{\text{boot}}^{2}; 100(1α)100(1-\alpha)% confidence interval [q^boot,α/2,q^boot,1α/2][\widehat{q}_{\text{boot},\alpha/2},\widehat{q}_{\text{boot},1-\alpha/2}]
Algorithm 3 Multiplier Bootstrap Procedure

A.6.5 Median Adjustment

Lastly, the cross-fitting estimator depends on a specific sample split, and thus, may produce outlying estimates if some split samples do not represent the entire data. To mitigate this issue, Chernozhukov et al. (2018) proposes to use so-called median adjustment from multiple cross-fitting estimates, which is implemented as follows. First, let ψ^s\widehat{\psi}_{s} (s=1,,S)(s=1,\ldots,S) be the ssth cross-fitting estimate with a variance estimate σ^s2\widehat{\sigma}_{s}^{2}. Then, the median-adjusted cross-fitting estimate and its variance estimate are defined as follows:

ψ^median:=medians=1,,Sψ^s,σ^median2:=medians=1,,S{σs2+(ψ^sψ^median)2}.\displaystyle\widehat{\psi}_{\operatorname*{median}}:=\operatorname*{median}_{s=1,\ldots,S}\widehat{\psi}_{s}\ ,\quad\widehat{\sigma}_{\operatorname*{median}}^{2}:=\operatorname*{median}_{s=1,\ldots,S}\big{\{}\sigma_{s}^{2}+(\widehat{\psi}_{s}-\widehat{\psi}_{\operatorname*{median}})^{2}\big{\}}\ . (25)

These estimates are more robust to the particular realization of sample partition.

A.7 Details of the Specifications for the Zika Virus Application

In this Appendix, we present a detailed explanation of the Zika virus application. We remark that the replication code is available at https://github.com/qkrcks0218/SingleProxyControl, and the code implements the estimation process outlined below.

Recall that the variables are defined as follows:

  • N=617N=617: Number of municipalities

  • Ai=1A_{i}=1: municipality ii belongs to Pernambuco (treated)

  • Ai=0A_{i}=0: municipality ii belongs to Rio Grande do Sul (control)

  • YiY_{i}: Birth rate of municipality ii in 2016

  • W1iW_{1i}: Birth rate of municipality ii in 2014

  • W2iW_{2i}: Birth rate of municipality ii in 2013

  • Xi=(X1i,X2i,X3i)X_{i}=(X_{1i},X_{2i},X_{3i})^{\intercal}: A vector of municipality ii’s log population, log population density, proportion of females in 2014

A.7.1 GMM Estimators

We first provide details of GMM estimators. We consider three specifications according to how NCOs are used:

  • (NCO 1) We use W1W_{1}, birth rate in 2013, as an NCO.

  • (NCO 2) We use W2W_{2}, birth rate in 2014, as an NCO.

  • (NCO 3) We use (W1,W2)(W_{1},W_{2}), birth rates in 2013 and 2014, as NCOs.

According to the specifications, we use penalized GMM in (21) to estimate the ETT, where the nuisance functions and basis functions in the moment function gg ((13),(14),(A.4.3)) are specified is specified as follows:

  • (NCO 1)

    • π(Y,X;α)=expit{(1,Y,X)α}\pi(Y,X;\alpha)=\text{expit}\big{\{}(1,Y,X^{\intercal})\alpha\big{\}} \Leftrightarrow ω(Y,X;α)=exp{(1,Y,X)α}\omega(Y,X;\alpha)=\exp\big{\{}(1,Y,X^{\intercal})\alpha\big{\}}, α=(α0,α1,,α4)5\alpha=(\alpha_{0},\alpha_{1},\ldots,\alpha_{4})^{\intercal}\in\mathbbm{R}^{5}

    • ry(Y,X)=(1,Y,X,YX)8r_{y}(Y,X)=(1,Y,X^{\intercal},YX^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

    • b(W1,X;η)=(1,W1,X)ηb(W_{1},X;\eta)=(1,W_{1},X^{\intercal})\eta, η=(η0,η1,,η4)5\eta=(\eta_{0},\eta_{1},\ldots,\eta_{4})\in\mathbbm{R}^{5}

    • rw(W1,X)=(1,W1,X,W1X)8r_{w}(W_{1},X)=(1,W_{1},X^{\intercal},W_{1}X^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

  • (NCO 2)

    • π(Y,X;α)=expit{(1,Y,X)α}\pi(Y,X;\alpha)=\text{expit}\big{\{}(1,Y,X^{\intercal})\alpha\big{\}} \Leftrightarrow ω(Y,X;α)=exp{(1,Y,X)α}\omega(Y,X;\alpha)=\exp\big{\{}(1,Y,X^{\intercal})\alpha\big{\}}, α=(α0,α1,,α4)5\alpha=(\alpha_{0},\alpha_{1},\ldots,\alpha_{4})^{\intercal}\in\mathbbm{R}^{5}

    • ry(Y,X)=(1,Y,X,YX)8r_{y}(Y,X)=(1,Y,X^{\intercal},YX^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

    • b(W2,X;η)=(1,W2,X)ηb(W_{2},X;\eta)=(1,W_{2},X^{\intercal})\eta, η=(η0,η1,,η4)5\eta=(\eta_{0},\eta_{1},\ldots,\eta_{4})\in\mathbbm{R}^{5}

    • rw(W2,X)=(1,W2,X,W1X)8r_{w}(W_{2},X)=(1,W_{2},X^{\intercal},W_{1}X^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

  • (NCO 3)

    • π(Y,X;α)=expit{(1,Y,X)α}\pi(Y,X;\alpha)=\text{expit}\big{\{}(1,Y,X^{\intercal})\alpha\big{\}} \Leftrightarrow ω(Y,X;α)=exp{(1,Y,X)α}\omega(Y,X;\alpha)=\exp\big{\{}(1,Y,X^{\intercal})\alpha\big{\}}, α=(α0,α1,,α4)5\alpha=(\alpha_{0},\alpha_{1},\ldots,\alpha_{4})^{\intercal}\in\mathbbm{R}^{5}

    • ry(Y,X)=(1,Y,X,YX)8r_{y}(Y,X)=(1,Y,X^{\intercal},YX^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

    • b(W1,W2,X;η)=(1,W1,W2,X)ηb(W_{1},W_{2},X;\eta)=(1,W_{1},W_{2},X^{\intercal})\eta, η=(η0,η1,,η6)5\eta=(\eta_{0},\eta_{1},\ldots,\eta_{6})\in\mathbbm{R}^{5}

    • rw(W2,X)=(1,W1,W2,X,W1X,W2X)12r_{w}(W_{2},X)=(1,W_{1},W_{2},X^{\intercal},W_{1}X^{\intercal},W_{2}X^{\intercal})^{\intercal}\in\mathbbm{R}^{12}

The penalization term (θ)\mathcal{R}(\theta) in equation (21) is determined based on the specific moment function used:

  • (gPSg_{{\text{PS}}}) (θ)=α022\mathcal{R}(\theta)=\|\alpha_{-0}\|_{2}^{2} where α0=(α1,,α4)\alpha_{-0}=(\alpha_{1},\ldots,\alpha_{4})

  • (gORg_{{\text{OR}}}) (θ)=η022\mathcal{R}(\theta)=\|\eta_{-0}\|_{2}^{2} where η0=(η1,,ηdim(W)+3)\eta_{-0}=(\eta_{1},\ldots,\eta_{\dim(W)+3})

  • (gDRg_{{\text{DR}}}) (θ)=α022+η022\mathcal{R}(\theta)=\|\alpha_{-0}\|_{2}^{2}+\|\eta_{-0}\|_{2}^{2}

The regularization parameter λ\lambda is chosen from cross-validation in Algorithm 1. Table summarizes the choice of λ\lambda:

Moment Function NCO
W1W_{1} W2W_{2} (W1,W2)(W_{1},W_{2})
gPSg_{{\text{PS}}} 103.510^{-3.5} 10410^{-4} 10410^{-4}
gORg_{{\text{OR}}} 10410^{-4} 10410^{-4} 103.510^{-3.5}
gDRg_{{\text{DR}}} 103.510^{-3.5} 103.510^{-3.5} 104.510^{-4.5}
Table 2: Choice of λ\lambda Obtained from Cross-validation in Algorithm 1

A.7.2 Minimax Estimators

We next provide details of minimax estimators. First, we employ the cross-fitting procedure with K=2K=2 split samples. Second, the hyperparameters are chosen based on 5-fold cross-validation in Algorithm 2 using the V-statistic in (24) as a criterion. Third, we employ the remedial strategy in Section A.6.3 where ω~\widetilde{\omega} is estimated from GMM using gPSg_{{\text{PS}}} in Section A.7.1 where the regularization parameter λ\lambda is set to 10310^{-3}. Lastly, we implement median adjustment in Section A.6.5 based on 500 cross-fitting estimates.

A.7.3 Summary of the Result

Table 3 summarizes corresponding results. We find that the twelve COCA estimates vary between 1.833-1.833 and 3.599-3.599, meaning between 1.8331.833 and 3.5993.599 birth per 1,000 persons were reduced in PE due to the Zika virus outbreak, an empirical finding better aligned with the scientific hypothesis that Zika may likely adversely impact the birth rates of exposed populations. We remark that parametric estimators utilizing the COCA bridge function yield larger effect sizes compared to the other three COCA estimators. However, the estimates show minimal variability across different NCO specifications. Regardless of the estimator used, all the estimates support the conclusion that the Zika virus outbreak likely led to a decline in the birthrate in the affected regions of Brazil.

Estimator Statistic NCO
W1W_{1} W2W_{2} (W1,W2)(W_{1},W_{2})
Semiparametric Estimator 2.410-2.410 2.182-2.182 2.180-2.180
SE 0.3560.356 0.5030.503 0.3420.342
95% CI (3.107,1.713)(-3.107,-1.713) (3.168,1.196)(-3.168,-1.196) (2.850,1.510)(-2.850,-1.510)
Parametric EPS Estimator 2.334-2.334 2.298-2.298 2.462-2.462
SE 0.3650.365 0.4920.492 0.4570.457
95% CI (3.050,1.618)(-3.050,-1.618) (3.261,1.334)(-3.261,-1.334) (3.357,1.567)(-3.357,-1.567)
COCA bridge function Estimator 3.560-3.560 3.446-3.446 3.599-3.599
SE 0.5200.520 0.5450.545 0.7030.703
95% CI (4.579,2.542)(-4.579,-2.542) (4.516,2.377)(-4.516,-2.377) (4.976,2.222)(-4.976,-2.222)
Doubly-robust Estimator 2.235-2.235 1.833-1.833 2.182-2.182
SE 0.5020.502 0.5190.519 0.4150.415
95% CI (3.220,1.250)(-3.220,-1.250) (2.850,0.816)(-2.850,-0.816) (2.996,1.368)(-2.996,-1.368)
DiD under parallel trends Estimator 1.156-1.156 1.041-1.041 1.041-1.041
SE 0.1990.199 0.1950.195 0.1950.195
95% CI (1.546,0.767)(-1.546,-0.767) (1.424,0.658)(-1.424,-0.658) (1.424,0.658)(-1.424,-0.658)
Table 3: Summary of Data Analysis. Values in “Estimate” row represent the estimates of the ETT ψ\psi^{*}. Values in “SE” and “95% CI” rows represent the standard errors (SEs) associated with the estimates and the corresponding 95% confidence intervals (CIs), respectively. The reported values are expressed as births per 1,000 persons.

A.8 Sensitivity Analysis

We provide details of the sensitivity analysis described in Section 6 of the main paper. To perform such sensitivity analysis, one might consider a parametric approach in Section A.4 where the EPS model is slightly modified by including the NCO into the EPS to encode a departure from the assumption. For example, we may consider:

logPr(A=1|Ya=0=y,X=x,W=w)Pr(A=0|Ya=0=y,X=x,W=w)=αSy(y,x)+αwSw(w);\displaystyle\log\frac{\Pr\big{(}A=1|Y^{a=0}=y,X=x,W=w\big{)}}{\Pr\big{(}A=0|Y^{a=0}=y,X=x,W=w\big{)}}=\alpha^{*\intercal}S_{y}(y,x)+\alpha_{w}S_{w}\big{(}w\big{)};

where SwS_{w} is specified by the user and αw\mathbf{\alpha}_{w} is a pre-specified sensitivity parameter encoding a hypothetical violation of a valid NCO assumption in the direction of SwS_{w}. For instance, one might let Sw(W)=WS_{w}(W)=W and vary αw\mathbf{\alpha}_{w} over a grid of values in the neighborhood of zero (which corresponds to the valid NCO assumption). For each hypothetical value of αw\mathbf{\alpha}_{w}, one would then re-estimate the EPS using methods described in the paper with the sensitivity function specified as an offset.

We implement a sensitivity analysis in the context of the Zika virus application. We focus on the GMM estimator based on the EPS model using the following specifications:

  • (NCO 1)

    • πSA(Y,W,X;α,αw)=expit{(1,Y,X)α+αwW1}\pi_{SA}(Y,W,X;\alpha,\alpha_{w})=\text{expit}\big{\{}(1,Y,X^{\intercal})\alpha+\alpha_{w}W_{1}\big{\}}
      \Leftrightarrow ωSA(Y,X;α,αw)=exp{(1,Y,X)α+αwW1}\omega_{SA}(Y,X;\alpha,\alpha_{w})=\exp\big{\{}(1,Y,X^{\intercal})\alpha+\alpha_{w}W_{1}\big{\}}, α=(α0,α1,,α4)5\alpha=(\alpha_{0},\alpha_{1},\ldots,\alpha_{4})^{\intercal}\in\mathbbm{R}^{5}

    • ry(Y,X)=(1,Y,X,YX)8r_{y}(Y,X)=(1,Y,X^{\intercal},YX^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

    • b(W1,X;η)=(1,W1,X)ηb(W_{1},X;\eta)=(1,W_{1},X^{\intercal})\eta, η=(η0,η1,,η4)5\eta=(\eta_{0},\eta_{1},\ldots,\eta_{4})\in\mathbbm{R}^{5}

    • rw(W1,X)=(1,W1,X,W1X)8r_{w}(W_{1},X)=(1,W_{1},X^{\intercal},W_{1}X^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

  • (NCO 2)

    • πSA(Y,W,X;α,αw)=expit{(1,Y,X)α+αwW2}\pi_{SA}(Y,W,X;\alpha,\alpha_{w})=\text{expit}\big{\{}(1,Y,X^{\intercal})\alpha+\alpha_{w}W_{2}\big{\}}
      \Leftrightarrow ωSA(Y,X;α,αw)=exp{(1,Y,X)α+αwW2}\omega_{SA}(Y,X;\alpha,\alpha_{w})=\exp\big{\{}(1,Y,X^{\intercal})\alpha+\alpha_{w}W_{2}\big{\}}, α=(α0,α1,,α4)5\alpha=(\alpha_{0},\alpha_{1},\ldots,\alpha_{4})^{\intercal}\in\mathbbm{R}^{5}

    • ry(Y,X)=(1,Y,X,YX)8r_{y}(Y,X)=(1,Y,X^{\intercal},YX^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

    • b(W2,X;η)=(1,W2,X)ηb(W_{2},X;\eta)=(1,W_{2},X^{\intercal})\eta, η=(η0,η1,,η4)5\eta=(\eta_{0},\eta_{1},\ldots,\eta_{4})\in\mathbbm{R}^{5}

    • rw(W2,X)=(1,W2,X,W1X)8r_{w}(W_{2},X)=(1,W_{2},X^{\intercal},W_{1}X^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

  • (NCO 3)

    • πSA(Y,W,X;α,αw)=expit{(1,Y,X)α+αw(W1+W2)/2}\pi_{SA}(Y,W,X;\alpha,\alpha_{w})=\text{expit}\big{\{}(1,Y,X^{\intercal})\alpha+\alpha_{w}(W_{1}+W_{2})/2\big{\}}
      \Leftrightarrow ωSA(Y,X;α,αw)=exp{(1,Y,X)α+αw(W1+W2)/2}\omega_{SA}(Y,X;\alpha,\alpha_{w})=\exp\big{\{}(1,Y,X^{\intercal})\alpha+\alpha_{w}(W_{1}+W_{2})/2\big{\}}, α=(α0,α1,,α4)5\alpha=(\alpha_{0},\alpha_{1},\ldots,\alpha_{4})^{\intercal}\in\mathbbm{R}^{5}

    • ry(Y,X)=(1,Y,X,YX)8r_{y}(Y,X)=(1,Y,X^{\intercal},YX^{\intercal})^{\intercal}\in\mathbbm{R}^{8}

    • b(W1,W2,X;η)=(1,W1,W2,X)ηb(W_{1},W_{2},X;\eta)=(1,W_{1},W_{2},X^{\intercal})\eta, η=(η0,η1,,η6)5\eta=(\eta_{0},\eta_{1},\ldots,\eta_{6})\in\mathbbm{R}^{5}

    • rw(W2,X)=(1,W1,W2,X,W1X,W2X)12r_{w}(W_{2},X)=(1,W_{1},W_{2},X^{\intercal},W_{1}X^{\intercal},W_{2}X^{\intercal})^{\intercal}\in\mathbbm{R}^{12}

Therefore, using the EPS πSA(Y,W,X;α,αw)\pi_{SA}(Y,W,X;\alpha,\alpha_{w}) (here, subscript SASA emphasizes that the EPS is for the sensitivity analysis) with a specified sensitivity parameter value αw\alpha_{w}\in\mathbbm{R}, we redefine the moment equation used in the GMM procedure as follows:

gPS(O;θ)\displaystyle g_{{\text{PS}}}(O;\theta) =(A(Yψ1)(1A)πSA(Y,W,X;α,αw)1πSA(Y,W,X;α,αw)(Yψ0){1A1πSA(Y,W,X;α,αw)1}rw(W,X)).\displaystyle=\begin{pmatrix}A\big{(}Y-\psi_{1}\big{)}\\ (1-A)\frac{\pi_{SA}(Y,W,X;\alpha,\alpha_{w})}{1-\pi_{SA}(Y,W,X;\alpha,\alpha_{w})}(Y-\psi_{0})\\ \big{\{}\frac{1-A}{1-\pi_{SA}(Y,W,X;\alpha,\alpha_{w})}-1\big{\}}r_{w}(W,X)\end{pmatrix}\ .

Using this gg function, we use penalized GMM in (21) to estimate the ETT, where the penalization term (θ)\mathcal{R}(\theta) is chosen as (θ)=α022\mathcal{R}(\theta)=\|\alpha_{-0}\|_{2}^{2} and λ\lambda are chosen as the same value in Table 2.

Table 4 summarizes the sensitivity analysis results. We focus the results associated with W1W_{1} because the other two cases can be interpreted in a similar manner. First, at αW=0\alpha_{W}=0, we have the same result as the EPS COCA estimates in Table 3. Second, the 95% pointwise confidence intervals for the effect estimate include the null effect once the sensitivity parameter αw\alpha_{w} is equal to 0.74. Third, the effect estimate is positive once the sensitivity parameter αw\alpha_{w} is greater than 0.99. Lastly, the lower bound of the 95% confidence interval for the crude estimate (2.953,3.816)(2.953,3.816) overlaps with the 95% pointwise confidence intervals for the effect estimate once αw\alpha_{w} is greater than 4.19.

NCO Statistic COCA COCA (Null) COCA (Positive) COCA (Crude)
W1W_{1} αw\alpha_{w} 0.000 0.690 1.490 2.600
αy\alpha_{y} 2.677 1.750 0.325 -0.484
Estimate -2.334 -0.718 0.024 2.330
95% CI (-3.050,-1.618) (-1.444,0.007) (-0.379,0.427) (1.672,2.989)
W2W_{2} αw\alpha_{w} 0.000 0.690 0.980 4.370
αy\alpha_{y} 3.060 1.047 0.273 -4.243
Estimate -2.298 -0.646 0.130 2.288
95% CI (-3.261,-1.334) (-1.303,0.011) (-0.444,0.704) (1.473,3.103)
(W1,W2)(W_{1},W_{2}) αw\alpha_{w} 0.000 1.040 1.320 7.030
αy\alpha_{y} 2.629 0.829 0.165 -4.611
Estimate -2.462 -0.541 0.006 2.630
95% CI (-3.357,-1.567) (-1.083,0.001) (-0.457,0.469) (2.292,2.969)
Table 4: Summary of Sensitivity Analysis. COCA column shows the estimation result at αw=0\alpha_{w}=0. COCA (Null) column shows the estimation result at the sensitivity parameter where the 95% confidence interval starts to contain zero. COCA (Positive) column shows the estimation result at the sensitivity parameter where the effect estimate becomes positive. COCA (Crude) column shows the estimation result at the sensitivity parameter where the 95% confidence interval starts to overlap with the 95% confidence interval for the crude estimate (2.953,3.816)(2.953,3.816). αy\alpha_{y} rows show the coefficient of YY in πSA\pi_{SA}. All results are obtained based on the penalized GMM in equation (21) with the regularization parameters are chosen as Table 2.

To interpret these results, consider the scenario where (W1,W2)(W_{1},W_{2}) are used as NCOs. Note that, at the sensitivity parameter value αw=4.006\alpha_{w}=4.006, the EPS COCA inference would become empirically consistent with the crude estimate. At this specific value of αw\alpha_{w}, the estimated coefficient for YY stands at αy=3.250\alpha_{y}=-3.250. Compared to αy=2.629\alpha_{y}=2.629 at αw=0\alpha_{w}=0, this suggests that it would take a substantial violation of our primary assumption for the crude estimate to have a causal interpretation; a possibility we do not believe credible. Likewise, our sensitivity analysis suggests that it would also take incorporating a relatively large violation of our primary assumption for the COCA estimator to become consistent with the sharp null hypothesis of no causal effect Ya=1=Ya=0Y^{a=1}=Y^{a=0}, therefore indicating the presence of a strong common cause of exposure and baseline outcome, leading to a near doubling of the odds of exposure to Zika virus but which does not also confound the outcome of interest; which we believe to be unlikely.

A.9 Over-identification Test

We formalize an over-identification test discussed in Section 6 of the paper. For simplicity, suppose that there are two potential NCOs, denoted by W=(W1,W2)W=(W_{1},W_{2}), and that W1W_{1} is known a priori to be a valid NCO. One can obtain two semiparametric estimators of ψ\psi^{*}, denoted by ψ^[1]\widehat{\psi}_{[1]} and ψ^[1,2]\widehat{\psi}_{[1,2]}, which are constructed by either using W1W_{1} or both (W1,W2)(W_{1},W_{2}) as NCOs; see Section 4 of the main paper or the Appendix A.4 for details on how these estimators are constructed. From Result 6 and (20), ψ^[1]\widehat{\psi}_{[1]} is CAN for ψ\psi^{*}. Likewise, if W2W_{2} is also a valid NCO, ψ^[1,2]\widehat{\psi}_{[1,2]} is also CAN for the ETT from Result 6. However, if W2W_{2} is an invalid NCO, say it violates Assumption 2-(iii), ψ^[1,2]\widehat{\psi}_{[1,2]} may be no longer CAN for ψ\psi^{*}. This implies that, we have the following result under a null hypothesis H0:W2 is a valid NCOH_{0}:\text{$W_{2}$ is a valid NCO}:

N{ψ^[1]ψ^[1,2]}DN(0,ς2).\displaystyle\sqrt{N}\Big{\{}\widehat{\psi}_{[1]}-\widehat{\psi}_{[1,2]}\Big{\}}\stackrel{{\scriptstyle D}}{{\rightarrow}}N\big{(}0,\varsigma^{2}\big{)}\ .

A consistent estimator for ς2\varsigma^{2}, say ς^2\widehat{\varsigma}^{2} can be obtained in two ways. First, if parametric estimators are used, ς^2\widehat{\varsigma}^{2} can be constructed from a consistent GMM variance estimator. Second, if semiparametric estimators are used, ς^2\widehat{\varsigma}^{2} can be chosen as the empirical variance of the difference between the two efficient influence functions where one is associated with only W1W_{1} and the other is associated with both W1W_{1} and W2W_{2}. Therefore, a statistical test for evaluating the null hypothesis H0H_{0} is given by

Reject H0W2 is a valid NCO at level α(0,1) if T=|ψ^[1,2]ψ^[1]ς^/N|z1α/2\displaystyle\text{Reject $H_{0}$: $W_{2}$ is a valid NCO at level $\alpha\in(0,1)$ if }T=\bigg{|}\frac{\widehat{\psi}_{[1,2]}-\widehat{\psi}_{[1]}}{\widehat{\varsigma}/\sqrt{N}}\bigg{|}\geq z_{1-\alpha/2}

where zaz_{a} is the 100a%100a\% percentile of N(0,1)N(0,1).

In the context of the Zika virus application, we conducted the over-identification test, and the results are summarized in Table 5. The findings from all four estimators indicate that there is insufficient statistical evidence to conclude that either W1W_{1} or W2W_{2} is an invalid NCO, as long as the other is a valid NCO.

Estimator Statistic
H0H_{0}: W2W_{2} is a valid NCO
(W1W_{1} is known a priori as a valid NCO)
H0H_{0}: W1W_{1} is a valid NCO
(W2W_{2} is known a priori as a valid NCO)
Semiparametric Estimator 0.230-0.230 0.002-0.002
SE 0.3160.316 0.4870.487
Test statistic 0.7290.729 0.0040.004
Parametric EPS Estimator 0.1280.128 0.1640.164
SE 0.1500.150 0.2260.226
Test statistic 0.8530.853 0.7280.728
COCA bridge function Estimator 0.0390.039 0.1520.152
SE 0.1860.186 0.4820.482
Test statistic 0.2070.207 0.3170.317
Doubly-robust Estimator 0.053-0.053 0.3490.349
SE 0.1930.193 0.2460.246
Test statistic 0.2750.275 1.4181.418
Table 5: Over-identification Test Results. The critical value at level α=0.05\alpha=0.05 is z0.975=1.960z_{0.975}=1.960.

Appendix B Proof

B.1 Proof of (3)

Suppose that the EPS is correctly specified, i.e. π(y,x;α)=Pr(A=1|Ya=0=y,X=x)\pi(y,x;\alpha^{\dagger})=\Pr(A=1\,\big{|}\,Y^{a=0}=y,X=x). Then,

E{(1A)Yπ(Y,X;α)1π(Y,X;α)}\displaystyle E\bigg{\{}(1-A)Y\frac{\pi(Y,X;\alpha^{\dagger})}{1-\pi(Y,X;\alpha^{\dagger})}\bigg{\}}
=E[E{(1A)Yπ(Y,X;α)1π(Y,X;α)|X}]\displaystyle=E\bigg{[}E\bigg{\{}(1-A)Y\frac{\pi(Y,X;\alpha^{\dagger})}{1-\pi(Y,X;\alpha^{\dagger})}\,\bigg{|}\,X\bigg{\}}\bigg{]}
=E[E{(1A)Ya=0π(Ya=0,X;α)1π(Ya=0,X;α)|X}]\displaystyle=E\bigg{[}E\bigg{\{}(1-A)Y^{a=0}\frac{\pi(Y^{a=0},X;\alpha^{\dagger})}{1-\pi(Y^{a=0},X;\alpha^{\dagger})}\,\bigg{|}\,X\bigg{\}}\bigg{]}
=E[E{(1A)Ya=0Pr(A=1|Ya=0,X)Pr(A=0|Ya=0,X)|X}]\displaystyle=E\bigg{[}E\bigg{\{}(1-A)Y^{a=0}\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X)}{\Pr(A=0\,\big{|}\,Y^{a=0},X)}\,\bigg{|}\,X\bigg{\}}\bigg{]}
=E{Pr(A=0|Ya=0,X)E(Ya=0|X)Pr(A=1|Ya=0,X)Pr(A=0|Ya=0,X)}\displaystyle=E\bigg{\{}\Pr(A=0\,\big{|}\,Y^{a=0},X)E\big{(}Y^{a=0}\,\big{|}\,X\big{)}\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X)}{\Pr(A=0\,\big{|}\,Y^{a=0},X)}\bigg{\}}
=E{E(A|Ya=0,X)E(Ya=0|X)}\displaystyle=E\big{\{}E(A\,\big{|}\,Y^{a=0},X)E\big{(}Y^{a=0}\,\big{|}\,X\big{)}\big{\}}
=E{E(AYa=0|X)}\displaystyle=E\big{\{}E\big{(}AY^{a=0}\,\big{|}\,X\big{)}\big{\}}
=E(AYa=0)\displaystyle=E\big{(}AY^{a=0}\big{)}

The first equality is from the law of iterated expectations. The second equality is from the consistency assumption, i.e. Ya=0=YY^{a=0}=Y if A=0A=0. The third equality holds when the EPS is correctly specified. The rest identities hold from the law of iterated expectations. Similarly, we obtain

E{(1A)π(Y,X;α)1π(Y,X;α)}=E(A)\displaystyle E\bigg{\{}(1-A)\frac{\pi(Y,X;\alpha^{\dagger})}{1-\pi(Y,X;\alpha^{\dagger})}\bigg{\}}=E(A)

Consequently, we obtain

E{(1A)Yπ(Y,X;α)1π(Y,X;α)}E{(1A)π(Y,X;α)1π(Y,X;α)}=E(AYa=0)E(A)=E(Ya=0|A=1)\displaystyle\frac{E\big{\{}(1-A)Y\frac{\pi(Y,X;\alpha^{\dagger})}{1-\pi(Y,X;\alpha^{\dagger})}\big{\}}}{E\big{\{}(1-A)\frac{\pi(Y,X;\alpha^{\dagger})}{1-\pi(Y,X;\alpha^{\dagger})}\big{\}}}=\frac{E\big{(}AY^{a=0}\big{)}}{E(A)}=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)}

Therefore, the GMM estimator using gPSg_{{\text{PS}}} leads to ψ0=E(Ya=0|A=1)\psi_{0}^{\dagger}=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)} if the EPS is correctly specified.

B.2 Proof of Result 1 and (4)

The proof is straightforward from the following algebra:

E{Pr(A=1|Y,X=x)Pr(A=0|Y,X=x)|W=w,X=x,A=0}\displaystyle E\bigg{\{}\frac{\Pr(A=1\,\big{|}\,Y,X=x)}{\Pr(A=0\,\big{|}\,Y,X=x)}\,\bigg{|}\,W=w,X=x,A=0\bigg{\}}
=(A)E{Pr(A=1|Ya=0,X=x)Pr(A=0|Ya=0,X=x)|W=w,X=x,A=0}\displaystyle\stackrel{{\scriptstyle(A)}}{{=}}E\bigg{\{}\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X=x)}{\Pr(A=0\,\big{|}\,Y^{a=0},X=x)}\,\bigg{|}\,W=w,X=x,A=0\bigg{\}}
=Pr(A=1|Ya=0,X=x)Pr(A=0|Ya=0,X=x)f(Ya=0=y|W=w,X=x,A=0)dy\displaystyle=\int\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X=x)}{\Pr(A=0\,\big{|}\,Y^{a=0},X=x)}f(Y^{a=0}=y\,\big{|}\,W=w,X=x,A=0)\,dy
=Pr(A=1|Ya=0,X=x)Pr(A=0|Ya=0,X=x)f(Ya=0=y,W=w,X=x,A=0)f(W=w,X=x,A=0)𝑑y\displaystyle=\int\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X=x)}{\Pr(A=0\,\big{|}\,Y^{a=0},X=x)}\frac{f(Y^{a=0}=y,W=w,X=x,A=0)}{f(W=w,X=x,A=0)}\,dy
=Pr(A=1|Ya=0,X=x)Pr(A=0|Ya=0,X=x)f(W=w,A=0|Ya=0=y,X=x)f(Ya=0=y,X=x)f(W=w,X=x,A=0)𝑑y\displaystyle=\int\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X=x)}{\Pr(A=0\,\big{|}\,Y^{a=0},X=x)}\frac{f(W=w,A=0\,\big{|}\,Y^{a=0}=y,X=x)f(Y^{a=0}=y,X=x)}{f(W=w,X=x,A=0)}\,dy
=(B)Pr(A=1|Ya=0,X=x)Pr(A=0|Ya=0,X=x){f(W=w|Ya=0=y,X=x)×Pr(A=0|Ya=0=y,X=x)f(Ya=0=y,X=x)}f(W=w,X=x,A=0)𝑑y\displaystyle\stackrel{{\scriptstyle(B)}}{{=}}\int\frac{\Pr(A=1\,\big{|}\,Y^{a=0},X=x)}{\Pr(A=0\,\big{|}\,Y^{a=0},X=x)}\frac{\bigg{\{}\begin{array}[]{l}f(W=w\,\big{|}\,Y^{a=0}=y,X=x)\\[-7.11317pt] \times\Pr(A=0\,\big{|}\,Y^{a=0}=y,X=x)f(Y^{a=0}=y,X=x)\end{array}\bigg{\}}}{f(W=w,X=x,A=0)}\,dy
=f(W=w|Ya=0=y,X=x)Pr(A=1|Ya=0=y,X=x)f(Ya=0=y,X=x)f(W=w,X=x,A=0)𝑑y\displaystyle=\int\frac{f(W=w\,\big{|}\,Y^{a=0}=y,X=x)\Pr(A=1\,\big{|}\,Y^{a=0}=y,X=x)f(Y^{a=0}=y,X=x)}{f(W=w,X=x,A=0)}\,dy
=(B)f(W=w,A=1|Ya=0=y,X=x)f(Ya=0=y,X=x)f(W=w,X=x,A=0)𝑑y\displaystyle\stackrel{{\scriptstyle(B)}}{{=}}\int\frac{f(W=w,A=1\,\big{|}\,Y^{a=0}=y,X=x)f(Y^{a=0}=y,X=x)}{f(W=w,X=x,A=0)}\,dy
=f(W=w,A=1,Ya=0=y,X=x)f(W=w,A=0,X=x)𝑑y\displaystyle=\int\frac{f(W=w,A=1,Y^{a=0}=y,X=x)}{f(W=w,A=0,X=x)}\,dy
=f(W=w,A=1,X=x)f(W=w,A=0,X=x)\displaystyle=\frac{f(W=w,A=1,X=x)}{f(W=w,A=0,X=x)}
=Pr(A=1|W=w,X=x)Pr(A=0|W=w,X=x)\displaystyle=\frac{\Pr(A=1\,\big{|}\,W=w,X=x)}{\Pr(A=0\,\big{|}\,W=w,X=x)}

Equality (A) holds from the consistency assumption, i.e. Ya=0=YY^{a=0}=Y if A=0A=0. Equalities (B) holds from Assumption 2. The rest identities are trivial.

B.3 Proof of Result 2

We have

E{g(Y)(1A)ω(Y,X)}\displaystyle E\big{\{}g(Y)(1-A)\omega^{*}(Y,X)\big{\}} =E{g(Ya=0)(1A)ω(Ya=0,X)}\displaystyle=E\big{\{}g(Y^{a=0})(1-A)\omega^{*}(Y^{a=0},X)\big{\}}
=E{g(Ya=0)Pr(A=0|Ya=0,X)ω(Ya=0,X)}\displaystyle=E\big{\{}g(Y^{a=0})\Pr(A=0\,\big{|}\,Y^{a=0},X)\omega^{*}(Y^{a=0},X)\big{\}}
=E{g(Ya=0)Pr(A=1|Ya=0,X)}\displaystyle=E\big{\{}g(Y^{a=0})\Pr(A=1\,\big{|}\,Y^{a=0},X)\big{\}}
=E{g(Ya=0)A}.\displaystyle=E\big{\{}g(Y^{a=0})A\big{\}}\ .

Therefore, by taking g(y)=yg(y)=y and g(y)=1g(y)=1, we obtain

E{g(Y)(1A)ω(Y,X)}E{(1A)ω(Y,X)}=E(AYa=0)E(A)=E(Ya=0|A=1).\displaystyle\frac{E\big{\{}g(Y)(1-A)\omega^{*}(Y,X)\big{\}}}{E\big{\{}(1-A)\omega^{*}(Y,X)\big{\}}}=\frac{E\big{(}AY^{a=0}\big{)}}{E\big{(}A\big{)}}=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)}\ .

B.4 Condition (5) Under Binary YY and WW

If WW is a binary variable, any arbitrary function of WW is represented as b(W)=b0+b1Wb(W)=b_{0}+b_{1}\cdot W where b0b_{0} and b1b_{1} are finite numbers. Therefore, Condition (5) is written as

1=E{b(W)|Y=1,A=0}=b0+b1Pr(W=1|Y=1,A=0),\displaystyle 1=E\big{\{}b(W)\,\big{|}\,Y=1,A=0\big{\}}=b_{0}+b_{1}\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}\ , (26)
0=E{b(W)|Y=0,A=0}=b0+b1Pr(W=1|Y=0,A=0).\displaystyle 0=E\big{\{}b(W)\,\big{|}\,Y=0,A=0\big{\}}=b_{0}+b_{1}\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}\ . (27)

Therefore, the system of equations satisfy

(26)(27)\displaystyle\eqref{eq:supp:binary1}-\eqref{eq:supp:binary0} \displaystyle\Rightarrow 1=b1{Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)}\displaystyle 1=b_{1}\big{\{}\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}\big{\}}
\displaystyle\Rightarrow b1=1Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)\displaystyle b_{1}=\frac{1}{\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}}

and

Pr(W=1|Y=0,A=0)×(26)Pr(W=1|Y=1,A=0)×(27)\displaystyle\Pr(W=1\,\big{|}\,Y=0,A=0)\times\eqref{eq:supp:binary1}-\Pr(W=1\,\big{|}\,Y=1,A=0)\times\eqref{eq:supp:binary0}
Pr(W=1|Y=0,A=0)={Pr(W=1|Y=0,A=0)Pr(W=1|Y=1,A=0)}b0\displaystyle\Rightarrow\quad\Pr(W=1\,\big{|}\,Y=0,A=0)=\big{\{}\Pr(W=1\,\big{|}\,Y=0,A=0)-\Pr(W=1\,\big{|}\,Y=1,A=0)\big{\}}b_{0}
b0=Pr(W=1|Y=0,A=0)Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)\displaystyle\Rightarrow\quad b_{0}=-\frac{\Pr(W=1\,\big{|}\,Y=0,A=0)}{\Pr(W=1\,\big{|}\,Y=1,A=0)-\Pr(W=1\,\big{|}\,Y=0,A=0)}

provided that Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)0\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}\neq 0. Therefore,

b(W=0)=b0\displaystyle b(W=0)=b_{0} =Pr(W=1|Y=0,A=0)Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)\displaystyle=-\frac{\Pr(W=1\,\big{|}\,Y=0,A=0)}{\Pr(W=1\,\big{|}\,Y=1,A=0)-\Pr(W=1\,\big{|}\,Y=0,A=0)}
b(W=1)=b0+b1\displaystyle b(W=1)=b_{0}+b_{1} =Pr(W=1|Y=0,A=0)+1Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)\displaystyle=\frac{-\Pr(W=1\,\big{|}\,Y=0,A=0)+1}{\Pr(W=1\,\big{|}\,Y=1,A=0)-\Pr(W=1\,\big{|}\,Y=0,A=0)}
=Pr(W=0|Y=0,A=0)Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)\displaystyle=\frac{\Pr(W=0\,\big{|}\,Y=0,A=0)}{\Pr(W=1\,\big{|}\,Y=1,A=0)-\Pr(W=1\,\big{|}\,Y=0,A=0)}

which is equivalent to

b(W)=(1W)Pr(W=1|Y=0,A=0)+WPr(W=0|Y=0,A=0)Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0).\displaystyle b(W)=\frac{-(1-W)\Pr\big{(}W=1|Y=0,A=0\big{)}+W\Pr\big{(}W=0|Y=0,A=0\big{)}}{\Pr\big{(}W=1|Y=1,A=0\big{)}-\Pr\big{(}W=1|Y=0,A=0\big{)}}\ .

If Pr(W=1|Y=1,A=0)Pr(W=1|Y=0,A=0)=0\Pr\big{(}W=1\,\big{|}\,Y=1,A=0\big{)}-\Pr\big{(}W=1\,\big{|}\,Y=0,A=0\big{)}=0, it implies that the right hand sides of (27) and (26) are the same whereas the left hand sides of those are different as 1 and 0, respectively. In this case, there is no function b(W)b(W) satisfying (5).

B.5 Proof of Result 3 and (7)

Suppose that bb^{*} satisfies (5) and Assumption 2. Then,

E{b(W,X)|A=1}\displaystyle E\left\{b^{*}(W,X)\,\big{|}\,A=1\right\}
=E[E{b(W,X)|A=1,Ya=0,X}|A=1]\displaystyle=E\left[E\left\{b^{*}(W,X)\,\big{|}\,A=1,Y^{a=0},X\right\}\,\big{|}\,A=1\right]
=(A)E[E{b(W,X)|A=0,Ya=0,X}|A=1]\displaystyle\stackrel{{\scriptstyle(A)}}{{=}}E\left[E\left\{b^{*}(W,X)\,\big{|}\,A=0,Y^{a=0},X\right\}\,\big{|}\,A=1\right]
=E{b(W,X)|A=0,Ya=0=y,X=x}f(Ya=0=y,X=x|A=1)d(y,x)\displaystyle=\iint E\left\{b^{*}(W,X)\,\big{|}\,A=0,Y^{a=0}=y,X=x\right\}f(Y^{a=0}=y,X=x\,\big{|}\,A=1)\,d(y,x)
=(B)E{b(W,X)|A=0,Y=y,X=x}f(Ya=0=y,X=x|A=1)d(y,x)\displaystyle\stackrel{{\scriptstyle(B)}}{{=}}\iint E\left\{b^{*}(W,X)\,\big{|}\,A=0,Y=y,X=x\right\}f(Y^{a=0}=y,X=x\,\big{|}\,A=1)\,d(y,x)
=(C)yf(Ya=0=y,X=x|A=1)d(y,x)\displaystyle\stackrel{{\scriptstyle(C)}}{{=}}\iint yf(Y^{a=0}=y,X=x\,\big{|}\,A=1)\,d(y,x)
=E(Ya=0|A=1).\displaystyle=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)}\ .

Equalities (A), (B), and (C) hold from Assumption 2, consistency assumption, and (5), respectively.

Suppose that the COCA confounding bridge function is correctly specified, i.e. y=E{b(W,X;η)|Y=y,X=x,A=0}y=E\big{\{}b(W,X;\eta^{\dagger})\,\big{|}\,Y=y,X=x,A=0\big{\}}. Then,

E[A{b(W,X;η)E(Ya=0|A=1)}]\displaystyle E\big{[}A\big{\{}b(W,X;\eta^{\dagger})-E(Y^{a=0}\,\big{|}\,A=1)\big{\}}\big{]} =Pr(A=1)[E{b(W,X;η)|A=1}E(Ya=0|A=1)]\displaystyle=\Pr(A=1)\big{[}E\big{\{}b(W,X;\eta^{\dagger})\,\big{|}\,A=1\big{\}}-E(Y^{a=0}\,\big{|}\,A=1)\big{]}
=0.\displaystyle=0\ .

Therefore, the GMM estimator using gORg_{{\text{OR}}} leads to ψ0=E(Ya=0|A=1)\psi_{0}^{\dagger}=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)} if the COCA confounding bridge function is correctly specified.

B.6 Proof of Result 5

Suppose E{b(W,X;η)|A=0,Y=y}=yE\left\{b\big{(}W,X;\eta^{\dagger}\big{)}\,\big{|}\,A=0,Y=y\right\}=y then

E[Ab(W,X;η)+(1A)π(Y,X;α)1π(Y,X;α){Yb(W,X;η)}]\displaystyle E\left[Ab\big{(}W,X;\eta^{\dagger}\big{)}+\frac{\big{(}1-A\big{)}\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}{1-\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}\left\{Y-b\big{(}W,X;\eta\big{)}\right\}\right]
=E[AE{b(W,X;η)|A=1,Ya=0,X}+(1A)π(Y,X;α)1π(Y,X;α)[YE{b(W,X;η)|A=0,Ya=0,X}]]\displaystyle=E\left[AE\left\{b\big{(}W,X;\eta^{\dagger}\big{)}\,\big{|}\,A=1,Y^{a=0},X\right\}+\frac{\big{(}1-A\big{)}\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}{1-\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}\left[Y-E\left\{b\big{(}W,X;\eta\big{)}\,\big{|}\,A=0,Y^{a=0},X\right\}\right]\right]
=E[AE{b(W,X;η)|A=0,X,Ya=0}+(1A)π(Y,X;α)1π(Y,X;α)[Ya=0E{b(W,X;η)|A=0,Ya=0,X}]]\displaystyle=E\left[AE\left\{b\big{(}W,X;\eta^{\dagger}\big{)}\,\big{|}\,A=0,X,Y^{a=0}\right\}+\frac{\big{(}1-A\big{)}\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}{1-\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}\left[Y^{a=0}-E\left\{b\big{(}W,X;\eta\big{)}\,\big{|}\,A=0,Y^{a=0},X\right\}\right]\right]
=E{AYa=0+(1A)π(Y,X;α)1π(Y,X;α)(Ya=0Ya=0)=0}\displaystyle=E\bigg{\{}AY^{a=0}+\frac{\big{(}1-A\big{)}\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}{1-\pi\big{(}Y,X;\mathbf{\alpha}\big{)}}\underset{=0}{\underbrace{\big{(}Y^{a=0}-Y^{a=0}\big{)}}}\bigg{\}}
=P(A=1)×E(Ya=0|A=1).\displaystyle=P\big{(}A=1\big{)}\times E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)}\ .

The first equality holds from the law of iterated expectations. The second equality holds from Assumption 2 and the consistency assumption. The third equality holds from E{b(W,X;η)|A=0,Y=y}=yE\left\{b\big{(}W,X;\eta^{\dagger}\big{)}\,\big{|}\,A=0,Y=y\right\}=y. The last equality is trivial.

Next suppose that π(y,x;α)=Pr(A=1|Ya=0=y,X=x)\pi\big{(}y,x;\mathbf{\alpha}^{\dagger}\big{)}=\Pr\big{(}A=1\,\big{|}\,Y^{a=0}=y,X=x\big{)} then

E[Ab(W,X;η)+(1A)π(Y,X;α)1π(Y,X;α){Yb(W,X;η)}]\displaystyle E\left[Ab\big{(}W,X;\eta\big{)}+\frac{\big{(}1-A\big{)}\pi\big{(}Y,X;\mathbf{\alpha}^{\dagger}\big{)}}{1-\pi\big{(}Y,X;\mathbf{\alpha}^{\dagger}\big{)}}\left\{Y-b\big{(}W,X;\eta\big{)}\right\}\right]
=E[Ab(W,X;η)+(1A)Pr(A=1|Ya=0,X)1Pr(A=1|Ya=0,X){Ya=0b(W,X;η)}]\displaystyle=E\left[Ab\big{(}W,X;\eta\big{)}+\frac{\big{(}1-A\big{)}\Pr\big{(}A=1\,\big{|}\,Y^{a=0},X\big{)}}{1-\Pr\big{(}A=1\,\big{|}\,Y^{a=0},X\big{)}}\left\{Y^{a=0}-b\big{(}W,X;\eta\big{)}\right\}\right]
=E[Ab(W,X;η)+Pr(A=0|Ya=0,X)Pr(A=1|Ya=0,X)1Pr(A=1|Ya=0,X)E{Ya=0b(W,X;η)|A=0,Ya=0,X}]\displaystyle=E\left[Ab\big{(}W,X;\eta\big{)}+\frac{\Pr\big{(}A=0\,\big{|}\,Y^{a=0},X\big{)}\Pr\big{(}A=1\,\big{|}\,Y^{a=0},X\big{)}}{1-\Pr\big{(}A=1\,\big{|}\,Y^{a=0},X\big{)}}E\big{\{}Y^{a=0}-b\big{(}W,X;\eta\big{)}\,\big{|}\,A=0,Y^{a=0},X\big{\}}\right]
=E[Ab(W,X;η)+Pr(A=1|Ya=0,X)E{Ya=0b(W,X;η)|A=1,Ya=0,X}]\displaystyle=E\left[Ab\big{(}W,X;\eta\big{)}+\Pr\big{(}A=1\,\big{|}\,Y^{a=0},X\big{)}E\big{\{}Y^{a=0}-b\big{(}W,X;\eta\big{)}\,\big{|}\,A=1,Y^{a=0},X\big{\}}\right]
=E[Ab(W,X;η)+A{Ya=0b(W,X;η)}]\displaystyle=E\left[Ab\big{(}W,X;\eta\big{)}+A\left\{Y^{a=0}-b\big{(}W,X;\eta\big{)}\right\}\right]
=P(A=1)×E(Ya=0|A=1).\displaystyle=P\big{(}A=1\big{)}\times E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)}\ .

The first equality holds from π(y,x;α)=Pr(A=1|Ya=0=y,X=x)\pi\big{(}y,x;\mathbf{\alpha}^{\dagger}\big{)}=\Pr\big{(}A=1\,\big{|}\,Y^{a=0}=y,X=x\big{)} and the consistency assumption. The second equality holds from the law of iterated expectations. The third equality holds from Assumption 2. The remaining equalities are trivial. Therefore, the GMM estimator using gDRg_{{\text{DR}}} leads to ψ0=E(Ya=0|A=1)\psi_{0}^{\dagger}=E\big{(}Y^{a=0}\,\big{|}\,A=1\big{)} if the EPS or the COCA confounding bridge function is correctly specified.

B.7 Proof of Result 4

It is straightforward to verify that the efficient influence function for ψ1=E(Y|A=1)\psi_{1}^{*}=E(Y\,\big{|}\,A=1) is A(Yψ1)/Pr(A=1)A(Y-\psi_{1}^{*})/\Pr(A=1). Therefore, it suffices to find an (efficient) influence function of ψ0\psi_{0}^{*}. In the remainder of the proof, we establish that (i) the following IF0\texttt{IF}_{0} is an influence function of the counterfactual mean ψ0=E(Ya=0|A=1)\psi_{0}^{*}=E(Y^{a=0}\,\big{|}\,A=1) under \mathcal{M} and (ii) it is the efficient influence function for ψ0\psi_{0}^{*} at the submodel sub\mathcal{M}_{\text{sub}}:

IF0(O;π,b)=1Pr(A=1)[(1A)π(Y,X)1π(Y,X){Yb(W,X)}+A{b(W,X)ψ0}].\displaystyle\texttt{IF}_{0}(O;\pi^{*},b^{*})=\frac{1}{\Pr\big{(}A=1\big{)}}\bigg{[}\frac{\big{(}1-A\big{)}\pi^{*}\big{(}Y,X\big{)}}{1-\pi^{*}\big{(}Y,X\big{)}}\left\{Y-b^{*}\big{(}W,X\big{)}\right\}+A\big{\{}b^{*}\big{(}W,X\big{)}-\psi_{0}^{*}\big{\}}\bigg{]}\ .

B.7.1 Proof of (i)

Let (t)\mathcal{M}(t) be the one-dimensional parametric submodel of \mathcal{M} and f(O;t)(t)f(O;t)\in\mathcal{M}(t) be the density at tt. We suppose that the true density is recovered at tt^{*}, i.e., f(O):=f(O;t)f^{*}(O):=f(O;t^{*}). Let s(|;t):=logf(|;t)/ts(\cdot\,\big{|}\,\cdot;t):=\partial\log f(\cdot\,\big{|}\,\cdot;t)/\partial t be the score function evaluated at tt, and let s(|)s^{*}(\cdot\,\big{|}\,\cdot) be the score function evaluated at tt^{*}. Let E(t){g(O)}E^{(t)}\big{\{}g(O)\big{\}} be the expectation operator at f(O;t)f(O;t).

Let b(W,X;t)b(W,X;t) be the COCA confounding bridge function satisfying

0=E(t){Yb(W,X;t)|Y,A=0,X}.\displaystyle 0=E^{(t)}\big{\{}Y-b(W,X;t)\,\big{|}\,Y,A=0,X\big{\}}\ .

Let ψ0(t)=E(t){b(W,X;t)|A=1}=E(t){Ab(W,X;t)}/E(t)(A)\psi_{0}(t)=E^{(t)}\big{\{}b\big{(}W,X;t\big{)}\,\big{|}\,A=1\big{\}}=E^{(t)}\big{\{}Ab\big{(}W,X;t\big{)}\big{\}}/E^{(t)}\big{(}A\big{)} be the counterfactual mean evaluated at tt. Let ψN(t)=E(t){Ab(W,X;t)}\psi_{N}(t)=E^{(t)}\big{\{}Ab\big{(}W,X;t\big{)}\big{\}} and ψD(t)=E(t)(A)\psi_{D}(t)=E^{(t)}\big{(}A\big{)}. Note that ψ0=E{b(W,X)|A=1}\psi_{0}^{*}=E\big{\{}b^{*}\big{(}W,X\big{)}\,\big{|}\,A=1\big{\}} and ψN=E{Ab(W,X)}\psi_{N}^{*}=E\big{\{}Ab^{*}\big{(}W,X\big{)}\big{\}} and ψD=E(A)\psi_{D}^{*}=E(A). Taking the derivative of the above restrictions with respect to tt and evaluating at tt^{*}, the score functions ss^{*} satisfies:

E[{Yb(W,X)}s(W|Y,A=0,X)tb(W,X;t)|Y,A=0,X]=0.\displaystyle E\Big{[}\big{\{}Y-b^{*}(W,X)\big{\}}s^{*}(W\,\big{|}\,Y,A=0,X)-\nabla_{t}b(W,X;t^{*})\,\Big{|}\,Y,A=0,X\Big{]}=0\ . (28)

The pathwise derivative of ψ0(t)\psi_{0}(t) is

ψ0(t)t\displaystyle\frac{\partial\psi_{0}(t)}{\partial t} =E(t){Atb(W,X;t)+Ab(W,X;t)s(W,A,X;t)}E(t)(A)ψ0(t)ψD(t)E(t){As(A;t)}\displaystyle=\frac{E^{(t)}\big{\{}A\nabla_{t}b(W,X;t)+Ab(W,X;t)s(W,A,X;t)\big{\}}}{E^{(t)}(A)}-\frac{\psi_{0}(t)}{\psi_{D}(t)}E^{(t)}\big{\{}A\cdot s(A;t)\big{\}}
=E(t){Atb(W,X;t)+Ab(W,X;t)s(Y,W,A,X;t)}ψD(t)ψ0(t)ψD(t)E(t){As(Y,W,A,X;t)}\displaystyle=\frac{E^{(t)}\big{\{}A\nabla_{t}b(W,X;t)+Ab(W,X;t)s(Y,W,A,X;t)\big{\}}}{\psi_{D}(t)}-\frac{\psi_{0}(t)}{\psi_{D}(t)}E^{(t)}\big{\{}A\cdot s(Y,W,A,X;t)\big{\}}
=E(t)[Atb(W,X;t)+A{b(W,X;t)ψ0(t)}s(Y,W,A,X;t)]ψD(t),\displaystyle=\frac{E^{(t)}\big{[}A\nabla_{t}b(W,X;t)+A\big{\{}b(W,X;t)-\psi_{0}(t)\big{\}}s(Y,W,A,X;t)\big{]}}{\psi_{D}(t)}\ ,

which is evaluated at tt^{*} as follows:

ψ0(t)t|t=t\displaystyle\frac{\partial\psi_{0}(t)}{\partial t}\Bigg{|}_{t=t^{*}} =E[Atb(W,X;t)+A{b(W,X)ψ0}s(O)]Pr(A=1),\displaystyle=\frac{E\big{[}A\nabla_{t}b(W,X;t^{*})+A\big{\{}b^{*}(W,X)-\psi_{0}^{*}\big{\}}s^{*}(O)\big{]}}{\Pr(A=1)}\ ,

We find that

E{Atb(W,X;t)}\displaystyle E\big{\{}A\nabla_{t}b(W,X;t^{*})\big{\}}
=E[E{Atb(W,X;t)|Ya=0,X}]\displaystyle=E\Big{[}E\big{\{}A\nabla_{t}b(W,X;t^{*})\,\big{|}\,Y^{a=0},X\big{\}}\Big{]}
=()E[E(A|Ya=0,X)E{tb(W,X;t)|Ya=0,X}]\displaystyle\stackrel{{\scriptstyle(\dagger)}}{{=}}E\Big{[}E\big{(}A\,\big{|}\,Y^{a=0},X\big{)}E\big{\{}\nabla_{t}b(W,X;t^{*})\,\big{|}\,Y^{a=0},X\big{\}}\Big{]}
=E[Pr(A=0|Ya=0,X)π(Ya=0,X)1π(Ya=0,X)E{tb(W,X;t)|Ya=0,X}]\displaystyle=E\bigg{[}\Pr\big{(}A=0\,\big{|}\,Y^{a=0},X\big{)}\frac{\pi^{*}(Y^{a=0},X)}{1-\pi^{*}(Y^{a=0},X)}E\big{\{}\nabla_{t}b(W,X;t^{*})\,\big{|}\,Y^{a=0},X\big{\}}\bigg{]}
=E[E{(1A)π(Ya=0,X)1π(Ya=0,X)tb(W,X;t)|Ya=0,X}]\displaystyle=E\bigg{[}E\bigg{\{}\frac{(1-A)\pi^{*}(Y^{a=0},X)}{1-\pi^{*}(Y^{a=0},X)}\nabla_{t}b(W,X;t^{*})\,\bigg{|}\,Y^{a=0},X\bigg{\}}\bigg{]}
=E{(1A)π(Y,X)1π(Y,X)tb(W,X;t)}\displaystyle=E\bigg{\{}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}\nabla_{t}b(W,X;t^{*})\bigg{\}}
=E[(1A)π(Y,X)1π(Y,X)E{tb(W,X;t)|Y,A=0,X}]\displaystyle=E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}E\big{\{}\nabla_{t}b(W,X;t^{*})\,\big{|}\,Y,A=0,X\big{\}}\bigg{]}
=()E[(1A)π(Y,X)1π(Y,X)E[{Yb(W,X)}s(W|Y,A=0,X)|Y,A=0,X]]\displaystyle\stackrel{{\scriptstyle(*)}}{{=}}E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}E\big{[}\big{\{}Y-b^{*}(W,X)\big{\}}s^{*}(W\,\big{|}\,Y,A=0,X)\,\big{|}\,Y,A=0,X\big{]}\bigg{]}
=E[(1A)π(Y,X)1π(Y,X){Yb(W,X)}s(W|Y,A=0,X)]\displaystyle=E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}\big{\{}Y-b^{*}(W,X)\big{\}}s^{*}(W\,\big{|}\,Y,A=0,X)\bigg{]}
=($)E[(1A)π(Y,X)1π(Y,X){Yb(W,X)}[E{s(O)|Y,W,A=0,X}E{s(O)|Y,A=0,X}]]\displaystyle\stackrel{{\scriptstyle(\$)}}{{=}}E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}\big{\{}Y-b^{*}(W,X)\big{\}}\Big{[}E\big{\{}s^{*}(O)\,\big{|}\,Y,W,A=0,X\big{\}}-E\big{\{}s^{*}(O)\,\big{|}\,Y,A=0,X\big{\}}\Big{]}\bigg{]}
=E[(1A)π(Y,X)1π(Y,X){Yb(W,X)}s(O)]\displaystyle=E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}\big{\{}Y-b^{*}(W,X)\big{\}}s^{*}(O)\bigg{]}
E[(1A)π(Y,X)1π(Y,X)[YE{b(W,X)|Y,A=0,X}]E{s(O)|Y,A=0,X}]\displaystyle-E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}\big{[}Y-E\big{\{}b^{*}(W,X)\,\big{|}\,Y,A=0,X\big{\}}\big{]}E\big{\{}s^{*}(O)\,\big{|}\,Y,A=0,X\big{\}}\bigg{]}
=(#)E[(1A)π(Y,X)1π(Y,X){Yb(W,X)}s(O)].\displaystyle\stackrel{{\scriptstyle(\#)}}{{=}}E\bigg{[}\frac{(1-A)\pi^{*}(Y,X)}{1-\pi^{*}(Y,X)}\big{\{}Y-b^{*}(W,X)\big{\}}s^{*}(O)\bigg{]}\ .

Equalities with no marks are straightforward to show from the law of iterated expectations. Equalities with ()(\dagger) are established from Assumption 2: WA|(Ya=0,X)W\,\rotatebox[origin={c}]{90.0}{$\models$}\,A\,\big{|}\,(Y^{a=0},X). The equality with ()(*) is established from (28). The equality with ($)(\$) holds from the property of the score function. The equality with (#)(\#) holds from the definition of the COCA confounding bridge function. As a consequence, we find

ψ0(t)t|t=t\displaystyle\frac{\partial\psi_{0}(t)}{\partial t}\Bigg{|}_{t=t^{*}} =E[Atb(W,X;t)+A{b(W,X)ψ0}s(O)]Pr(A=1)\displaystyle=\frac{E\big{[}A\nabla_{t}b(W,X;t^{*})+A\big{\{}b^{*}(W,X)-\psi_{0}^{*}\big{\}}s^{*}(O)\big{]}}{\Pr(A=1)}
=E{IF0(O;π,b)s(O)},\displaystyle=E\big{\{}\texttt{IF}_{0}(O;\pi^{*},b^{*})\cdot s^{*}(O)\big{\}}\ ,

implying that ψ0\psi_{0}^{*} is a pathwise differentiable parameter (Newey, 1990) with the corresponding influence function IF0(O;π,b)\texttt{IF}_{0}(O;\pi^{*},b^{*}).

B.7.2 Proof of (ii)

It suffices to show that the influence function IF0(O;π,b)\texttt{IF}_{0}(O;\pi^{*},b^{*}) belongs to the tangent space of \mathcal{M} at the submodel sub\mathcal{M}_{\text{sub}}. Note that the model imposes a restriction in (28) on the score function. The restriction implies that the tangent space of \mathcal{M} consists of the functions S(O)2,0(O)S(O)\in\mathcal{L}_{2,0}(O) satisfying

E[{Yb(W,X)}S(W|Y,A=0,X)|Y,A=0,X]Range(T1).\displaystyle E\Big{[}\big{\{}Y-b^{*}(W,X)\big{\}}S(W\,\big{|}\,Y,A=0,X)\,\Big{|}\,Y,A=0,X\Big{]}\in\text{Range}(T_{1})\ . (29)

Under (Surjectivity), we have Range(T1)=2(Y,A=0,X)\text{Range}(T_{1})=\mathcal{L}_{2}(Y,A=0,X). Therefore, any S(O)2,0(O)S(O)\in\mathcal{L}_{2,0}(O) satisfies the restriction (29) at the submodel sub\mathcal{M}_{\text{sub}}. This implies that IF0(O;π,b)\texttt{IF}_{0}(O;\pi^{*},b^{*}) belongs to the tangent space of \mathcal{M} at the submodel sub\mathcal{M}_{\text{sub}}.

B.8 Proof of Result 6

In what follows, we simply denote IF(O)=IF(O;ω,b)\texttt{IF}^{*}(O)=\texttt{IF}(O;\omega^{*},b^{*}) and (V)=i=1NVi/N\mathbbm{P}(V)=\sum_{i=1}^{N}V_{i}/N. If we show that ψ^(k)\widehat{\psi}^{(k)} has the asymptotic representation as

|k|1/2{ψ^(k)ψ}=1|k|1/2ikIF(Oi)+oP(1),\displaystyle\big{|}\mathcal{I}_{k}\big{|}^{1/2}\Big{\{}\widehat{\psi}^{(k)}-\psi^{*}\Big{\}}=\frac{1}{\big{|}\mathcal{I}_{k}\big{|}^{1/2}}\sum_{i\in\mathcal{I}_{k}}\texttt{IF}^{*}(O_{i})+o_{P}(1)\ , (30)

then we establish ψ^=K1k=1Kψ^(k)\widehat{\psi}=K^{-1}\sum_{k=1}^{K}\widehat{\psi}^{(k)} has the asymptotic representation as

N(ψ^ψ)=1Ni=1NIF(Oi)+oP(1).\displaystyle\sqrt{N}\Big{(}\widehat{\psi}-\psi^{*}\Big{)}=\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\texttt{IF}^{*}(O_{i})+o_{P}(1)\ . (31)

Therefore, the asymptotic normality result holds from the central limit theorem. Thus, we focus on showing that (30) holds.

Recall that ψ^(k)\widehat{\psi}^{(k)} is

ψ^(k)\displaystyle\widehat{\psi}^{(k)} ={(A)}1[(k)[A{Yb^(k)(W,X)}(1A)ω^(k)(Y,X){Yb^(k)(W,X)}]=:ζ^(k)]\displaystyle=\Big{\{}\mathbbm{P}\big{(}A\big{)}\Big{\}}^{-1}\Big{[}\mathbbm{P}^{(k)}\underbrace{\Big{[}A\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}-(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}\Big{]}}_{=:\widehat{\zeta}^{(-k)}}\Big{]} (32)

Therefore, we find the left hand side of (30) is

|k|1/2{ψ^(k)ψ}\displaystyle\big{|}\mathcal{I}_{k}\big{|}^{1/2}\Big{\{}\widehat{\psi}^{(k)}-\psi^{*}\Big{\}} =1|k|1/2ik{ζ^i(k)(A)Aiψ(k)(A)}\displaystyle=\frac{1}{|\mathcal{I}_{k}|^{1/2}}\sum_{i\in\mathcal{I}_{k}}\bigg{\{}\frac{\widehat{\zeta}_{i}^{(-k)}}{\mathbbm{P}\big{(}A\big{)}}-\frac{A_{i}\psi^{*}}{\mathbbm{P}^{(k)}\big{(}A\big{)}}\bigg{\}}
=12[Pr(A=1)(A)Pr(A=1)(k)(A)]oP(1)1|k|1/2ikζ^i(k)+AiψPr(A=1)OP(1)\displaystyle=\underbrace{\frac{1}{2}\bigg{[}\frac{\Pr(A=1)}{\mathbbm{P}(A)}-\frac{\Pr(A=1)}{\mathbbm{P}^{(k)}(A)}\bigg{]}}_{o_{P}(1)}\underbrace{\frac{1}{\big{|}\mathcal{I}_{k}\big{|}^{1/2}}\sum_{i\in\mathcal{I}_{k}}\frac{\widehat{\zeta}_{i}^{(-k)}+A_{i}\psi^{*}}{\Pr(A=1)}}_{O_{P}(1)}
+12[Pr(A=1)(A)+Pr(A=1)(k)(A)]=1+oP(1)1|k|1/2ikζ^i(k)AiψPr(A=1)(AN)OP(1)\displaystyle\quad+\underbrace{\frac{1}{2}\bigg{[}\frac{\Pr(A=1)}{\mathbbm{P}(A)}+\frac{\Pr(A=1)}{\mathbbm{P}^{(k)}(A)}\bigg{]}}_{=1+o_{P}(1)}\underbrace{\frac{1}{\big{|}\mathcal{I}_{k}\big{|}^{1/2}}\sum_{i\in\mathcal{I}_{k}}\frac{\widehat{\zeta}_{i}^{(-k)}-A_{i}\psi^{*}}{\Pr(A=1)}}_{\text{\hypertarget{(AN)}{(AN)}: }O_{P}(1)}
=1|k|1/2ikζ^i(k)AiψPr(A=1)+oP(1).\displaystyle=\frac{1}{\big{|}\mathcal{I}_{k}\big{|}^{1/2}}\sum_{i\in\mathcal{I}_{k}}\frac{\widehat{\zeta}_{i}^{(-k)}-A_{i}\psi^{*}}{\Pr(A=1)}+o_{P}(1)\ .

The second line holds because (A)=Pr(A)+oP(1)\mathbbm{P}(A)=\Pr(A)+o_{P}(1) and (k)(A)=Pr(A=1)+oP(1)\mathbbm{P}^{(k)}(A)=\Pr(A=1)+o_{P}(1) from the law of large numbers, and ζ^i(k)\widehat{\zeta}_{i}^{(-k)} is bounded. The third row holds because term (AN) is asymptotically normal, which will be shown later.

Let ζ\zeta^{*} be the numerator of the uncentered influence function, i.e.,

ζ=A{Yb(W,X)}(1A)ω(Y,X){Yb(W,X)}.\displaystyle\zeta^{*}=A\big{\{}Y-b^{*}\big{(}W,X\big{)}\big{\}}-\big{(}1-A\big{)}\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}\ .

Note that we have IF(O)=(ζAψ)/Pr(A=1)\texttt{IF}^{*}(O)=(\zeta^{*}-A\psi^{*})/\Pr(A=1).

Let 𝔾k(V)=|k|1/2ik{ViE(Vi)}\mathbbm{G}_{\mathcal{I}_{k}}(V)=|\mathcal{I}_{k}|^{-1/2}\sum_{i\in\mathcal{I}_{k}}\big{\{}V_{i}-E(V_{i})\big{\}} be the empirical process of ViV_{i} centered by E(Vi)E(V_{i}). Similarly, let 𝔾k(k)(V^(k))=|k|1/2ik{V^i(k)E(k)(V^(k))}\mathbbm{G}_{\mathcal{I}_{k}}^{(-k)}\big{(}\widehat{V}^{(-k)}\big{)}=|\mathcal{I}_{k}|^{-1/2}\sum_{i\in\mathcal{I}_{k}}\big{\{}\widehat{V}_{i}^{(-k)}-E^{(-k)}(\widehat{V}^{(-k)})\big{\}} be the empirical process of V^(k)\widehat{V}^{(-k)} centered by E(k)(V^(k))E^{(-k)}(\widehat{V}^{(-k)}) where E(k)()E^{(-k)}(\cdot) is the expectation after considering random functions obtained from kc\mathcal{I}_{k}^{c} as fixed functions. The empirical process of ζ^(k)Aψ\widehat{\zeta}^{(-k)}-A\psi^{*} is

|k|1/2ik{ζ^(k)Aψ}=\displaystyle\big{|}\mathcal{I}_{k}\big{|}^{-1/2}\sum_{i\in\mathcal{I}_{k}}\big{\{}\widehat{\zeta}^{(-k)}-A\psi^{*}\big{\}}= 𝔾k(ζAψ)\displaystyle\ \mathbbm{G}_{\mathcal{I}_{k}}\big{(}\zeta^{*}-A\psi^{*}\big{)} (33)
+|k|1/2E(k){ζ^(k)ζ}\displaystyle+\big{|}\mathcal{I}_{k}\big{|}^{1/2}\cdot E^{(-k)}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}} (34)
+𝔾k(k)(ζ^(k)ζ),\displaystyle+\mathbbm{G}_{\mathcal{I}_{k}}^{(-k)}\big{(}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{)}\ , (35)

where

𝔾k(ζAψ)\displaystyle\mathbbm{G}_{\mathcal{I}_{k}}\big{(}\zeta^{*}-A\psi^{*}\big{)} =𝔾k[A{Yb(W,X)ψ}(1A)ω(Y,X){Yb(W,X)}]\displaystyle=\mathbbm{G}_{\mathcal{I}_{k}}\Big{[}A\big{\{}Y-b^{*}\big{(}W,X\big{)}-\psi^{*}\big{\}}-\big{(}1-A\big{)}\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}\Big{]}
=|k|1/2ik{ζiPr(A=1)ψ}\displaystyle=\big{|}\mathcal{I}_{k}\big{|}^{-1/2}\sum_{i\in\mathcal{I}_{k}}\Big{\{}\zeta_{i}^{*}-\Pr(A=1)\psi^{*}\Big{\}}
𝔾k(k)(ζ^(k)ζ)\displaystyle\mathbbm{G}_{\mathcal{I}_{k}}^{(-k)}\big{(}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{)} =𝔾k(k)[A{Yb^(k)(W,X)}(1A)ω^(k)(Y,X){Yb^(k)(W,X)}A{Yb(W,X)}+(1A)ω(Y,X){Yb(W,X)}].\displaystyle=\mathbbm{G}_{\mathcal{I}_{k}}^{(-k)}\left[\begin{array}[]{l}A\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}-(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}\\ -A\big{\{}Y-b^{*}\big{(}W,X\big{)}\big{\}}+\big{(}1-A\big{)}\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}\end{array}\right]\ .

From the derivation below, we find that (34) and (35) are oP(1)o_{P}(1), indicating that (33) is asymptotically normal, and consequently, (AN) is OP(1)O_{P}(1). Moreover, we establish (30), concluding the proof as follows:

|k|1/2{ψ^(k)ψ}\displaystyle\big{|}\mathcal{I}_{k}\big{|}^{1/2}\Big{\{}\widehat{\psi}^{(k)}-\psi^{*}\Big{\}} =|k|1/2ikζ^i(k)Aiψ0Pr(A=1)+oP(1)\displaystyle=\big{|}\mathcal{I}_{k}\big{|}^{-1/2}\sum_{i\in\mathcal{I}_{k}}\frac{\widehat{\zeta}_{i}^{(-k)}-A_{i}\psi_{0}^{*}}{\Pr(A=1)}+o_{P}(1)
=|k|1/2ikζiAiψPr(A=1)+oP(1)=|k|1/2ikIF(Oi)+oP(1).\displaystyle=\big{|}\mathcal{I}_{k}\big{|}^{-1/2}\sum_{i\in\mathcal{I}_{k}}\frac{\zeta_{i}^{*}-A_{i}\psi^{*}}{\Pr(A=1)}+o_{P}(1)=\big{|}\mathcal{I}_{k}\big{|}^{-1/2}\sum_{i\in\mathcal{I}_{k}}\texttt{IF}^{*}(O_{i})+o_{P}(1)\ .

Showing that (34) is oP(1)o_{P}(1)

First, we introduce the following equality:

E{Ag(W,X)}\displaystyle E\big{\{}Ag(W,X)\big{\}}
=E[Pr(A=1|Ya=0,X)E{g(W,X)|Ya=0,A=1,X}]\displaystyle=E\big{[}\Pr(A=1\,\big{|}\,Y^{a=0},X)E\big{\{}g(W,X)\,\big{|}\,Y^{a=0},A=1,X\big{\}}\big{]}
=E[E(1A|Ya=0,X)ω(Ya=0,X)E{g(W,X)|Ya=0,A=0,X}]\displaystyle=E\big{[}E(1-A\,\big{|}\,Y^{a=0},X)\omega^{*}(Y^{a=0},X)E\big{\{}g(W,X)\,\big{|}\,Y^{a=0},A=0,X\big{\}}\big{]}
=E{(1A)ω(Ya=0,X)g(W,X)}\displaystyle=E\big{\{}(1-A)\omega^{*}(Y^{a=0},X)g(W,X)\big{\}}
=E{(1A)ω(Y,X)g(W,X)}.\displaystyle=E\big{\{}(1-A)\omega^{*}(Y,X)g(W,X)\big{\}}\ . (36)

We find

|k|1/2E(k){ζ^(k)ζ}\displaystyle\Big{\|}\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}}\Big{\|}
=|k|1/2E(k)[(1A)ω^(k)(Y,X){Yb^(k)(W,X)}+Ab^(k)(W,X)(1A)ω(Y,X){Yb(W,X)}Ab(W,X)]\displaystyle=\big{|}\mathcal{I}_{k}\big{|}^{1/2}\left\|E^{(-k)}\left[\begin{array}[]{l}(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}+A\widehat{b}^{(-k)}(W,X)\\ -\big{(}1-A\big{)}\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}-Ab^{*}\big{(}W,X\big{)}\end{array}\right]\right\|
=|k|1/2E(k)[(1A)ω^(k)(Y,X){b(W,X)b^(k)(W,X)}+A{b^(k)(W,X)b(W,X)}]\displaystyle=\big{|}\mathcal{I}_{k}\big{|}^{1/2}\bigg{\|}E^{(-k)}\bigg{[}(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\big{\}}+A\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}\big{(}W,X\big{)}\big{\}}\bigg{]}\bigg{\|}
=()|k|1/2E(k)[(1A)ω^(k)(Y,X){b(W,X)b^(k)(W,X)}+(1A)ω(Y,X){b^(k)(W,X)b(W,X)}]\displaystyle\stackrel{{\scriptstyle(\dagger)}}{{=}}\big{|}\mathcal{I}_{k}\big{|}^{1/2}\left\|E^{(-k)}\left[\begin{array}[]{l}(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\big{\}}\\ +(1-A)\omega^{*}(Y,X)\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}\big{(}W,X\big{)}\big{\}}\end{array}\right]\right\|
=|k|1/2E(k)[(1A){ω^(k)(Y,X)ω(Y,X)}{b(W,X)b^(k)(W,X)}]\displaystyle=\big{|}\mathcal{I}_{k}\big{|}^{1/2}\bigg{\|}E^{(-k)}\bigg{[}(1-A)\Big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\Big{\}}\Big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\Big{\}}\bigg{]}\bigg{\|}

Equality ()(\dagger) holds from (36) by choosing g=bb^(k)g=b^{*}-\widehat{b}^{(-k)}. Therefore, |k|1/2E(k){ζ^(k)ζ}\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}} is upper bounded as

|k|1/2E(k){ζ^(k)ζ}\displaystyle\Big{\|}\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}}\Big{\|}
|k|1/2E(k)[ω^(k)(Y,X)ω(Y,X)E(k)[(1A){b(W,X)b^(k)(W,X)}|Y,X]]\displaystyle\leq\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\bigg{[}\Big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\Big{\|}\cdot\Big{\|}E^{(-k)}\big{[}(1-A)\big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\big{\}}\,\big{|}\,Y,X\big{]}\Big{\|}\bigg{]}
|k|1/2ω^(k)(Y,X)ω(Y,X)P,2E(k)[(1A){b(W,X)b^(k)(W,X)}|Y,X]P,2\displaystyle\leq\big{|}\mathcal{I}_{k}\big{|}^{1/2}\Big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\Big{\|}_{P,2}\Big{\|}E^{(-k)}\big{[}(1-A)\big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\big{\}}\,\big{|}\,Y,X\big{]}\Big{\|}_{P,2}

and

|k|1/2E(k){ζ^(k)ζ}\displaystyle\Big{\|}\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}}\Big{\|}
|k|1/2E(k)[b^(k)(W,X)b(W,X)E(k)[(1A){ω(Y,X)ω^(k)(Y,X)}|W,X]]\displaystyle\leq\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\bigg{[}\Big{\|}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\Big{\|}\Big{\|}E^{(-k)}\big{[}(1-A)\big{\{}\omega^{*}(Y,X)-\widehat{\omega}^{(-k)}(Y,X)\big{\}}\,\big{|}\,W,X\big{]}\Big{\|}\bigg{]}
|k|1/2b^(k)(W,X)b(W,X)P,2E(k)[(1A){ω(Y,X)ω^(k)(Y,X)}|W,X]P,2.\displaystyle\leq\big{|}\mathcal{I}_{k}\big{|}^{1/2}\Big{\|}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\Big{\|}_{P,2}\Big{\|}E^{(-k)}\big{[}(1-A)\big{\{}\omega^{*}(Y,X)-\widehat{\omega}^{(-k)}(Y,X)\big{\}}\,\big{|}\,W,X\big{]}\Big{\|}_{P,2}\ .

This concludes that

|k|1/2E(k){ζ^(k)ζ}\displaystyle\Big{\|}\big{|}\mathcal{I}_{k}\big{|}^{1/2}E^{(-k)}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}}\Big{\|}
|k|1/2min[ω^(k)ωP,2E(k)[(1A){b(W,X)b^(k)(W,X)}|Y,X]P,2,b^(k)bP,2E(k)[(1A){ω(Y,X)ω^(k)(Y,X)}|W,X]P,2]\displaystyle\leq\big{|}\mathcal{I}_{k}\big{|}^{1/2}\cdot\min\left[\begin{array}[]{l}\Big{\|}\widehat{\omega}^{(-k)}-\omega^{*}\Big{\|}_{P,2}\Big{\|}E^{(-k)}\big{[}(1-A)\big{\{}b^{*}(W,X)-\widehat{b}^{(-k)}(W,X)\big{\}}\,\big{|}\,Y,X\big{]}\Big{\|}_{P,2},\\ \Big{\|}\widehat{b}^{(-k)}-b^{*}\Big{\|}_{P,2}\Big{\|}E^{(-k)}\big{[}(1-A)\big{\{}\omega^{*}(Y,X)-\widehat{\omega}^{(-k)}(Y,X)\big{\}}\,\big{|}\,W,X\big{]}\Big{\|}_{P,2}\end{array}\right]
=oP(1).\displaystyle=o_{P}(1)\ .

where the second inequality is from Assumption 5-(iii).


Showing that (35) is oP(1)o_{P}(1)

Next, we establish that Term (35) is oP(1)o_{P}(1). Suppose g(O)g(O) is a mean-zero function. Then, E(k){𝔾k(g)}=0E^{(-k)}\big{\{}\mathbbm{G}_{\mathcal{I}_{k}}(g)\big{\}}=0. Therefore, it suffices to show that Var(k){𝔾k(g)}=oP(1)\text{Var}^{(-k)}\big{\{}\mathbbm{G}_{\mathcal{I}_{k}}(g)\big{\}}=o_{P}(1), i.e.,

Var(k){𝔾k(g)}=Var(k){1|k|ikg(Oi)}=Var(k){g(O)}=E(k){g(O)2}=oP(1).\displaystyle\text{Var}^{(-k)}\big{\{}\mathbbm{G}_{\mathcal{I}_{k}}(g)\big{\}}=\text{Var}^{(-k)}\bigg{\{}\frac{1}{\sqrt{|\mathcal{I}_{k}|}}\sum_{i\in\mathcal{I}_{k}}g(O_{i})\bigg{\}}=\text{Var}^{(-k)}\big{\{}g(O)\big{\}}=E^{(-k)}\big{\{}g(O)^{2}\big{\}}=o_{P}(1)\ .

The variance of (35) is

Var(k){(35)}\displaystyle\text{Var}^{(-k)}\big{\{}\eqref{Term2}\big{\}}
=Var(k)[𝔾k[A{Yb^(k)(W,X)}(1A)ω^(k)(Y,X){Yb^(k)(W,X)}A{Yb(W,X)}+(1A)ω(Y,X){Yb(W,X)}]]\displaystyle=\text{Var}^{(-k)}\left[\mathbbm{G}_{\mathcal{I}_{k}}\left[\begin{array}[]{l}A\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}-(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}\\ -A\big{\{}Y-b^{*}\big{(}W,X\big{)}\big{\}}+\big{(}1-A\big{)}\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}\end{array}\right]\right] (39)
=E(k)[[(1A)ω^(k)(Y,X){Yb^(k)(W,X)}+Ab^(k)(W,X)(1A)ω(Y,X){Yb(W,X)}Ab(W,X)]2]\displaystyle=E^{(-k)}\left[\left[\begin{array}[]{l}(1-A)\widehat{\omega}^{(-k)}(Y,X)\big{\{}Y-\widehat{b}^{(-k)}(W,X)\big{\}}+A\widehat{b}^{(-k)}(W,X)\\ -\big{(}1-A\big{)}\omega^{*}(Y,X)\left\{Y-b^{*}\big{(}W,X\big{)}\right\}-Ab^{*}\big{(}W,X\big{)}\end{array}\right]^{2}\right] (42)
()E(k)[[(1A){ω^(k)(Y,X)ω(Y,X)}Y]2+[(1A)ω^(k)(Y,X)b^(k)(W,X)(1A)ω(Y,X)b(W,X)]2+[A{b^(k)(W,X)b(W,X)}]2]\displaystyle\stackrel{{\scriptstyle(*)}}{{\precsim}}E^{(-k)}\left[\begin{array}[]{l}\big{[}(1-A)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}Y\big{]}^{2}\\ +\big{[}(1-A)\widehat{\omega}^{(-k)}(Y,X)\widehat{b}^{(-k)}(W,X)-(1-A)\omega^{*}(Y,X)b^{*}(W,X)\big{]}^{2}\\ +\big{[}A\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\}}\big{]}^{2}\end{array}\right] (46)
()E(k)[[(1A){ω^(k)(Y,X)ω(Y,X)}Y]2+[(1A){ω^(k)(Y,X)ω(Y,X)}b^(k)(W,X)]2+[(1A)ω(Y,X){b^(k)(W,X)b(W,X)}]2+[A{b^(k)(W,X)b(W,X)}]2](Term-Q1)(Term-Q2)(Term-Q3)(Term-Q4)\displaystyle\stackrel{{\scriptstyle(*)}}{{\precsim}}E^{(-k)}\left[\begin{array}[]{l}\big{[}(1-A)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}Y\big{]}^{2}\\ +\big{[}(1-A)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}\widehat{b}^{(-k)}(W,X)\big{]}^{2}\\ +\big{[}(1-A)\omega^{*}(Y,X)\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\}}\big{]}^{2}\\ +\big{[}A\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\}}\big{]}^{2}\end{array}\right]\quad\begin{array}[]{l}\text{\hypertarget{(Term-Q1)}{(Term-Q1)}}\\ \text{\hypertarget{(Term-Q2)}{(Term-Q2)}}\\ \text{\hypertarget{(Term-Q3)}{(Term-Q3)}}\\ \text{\hypertarget{(Term-Q4)}{(Term-Q4)}}\end{array} (55)
()ω^(k)(Y,X)ω(Y,X)P,22+b^(k)(Y,X)b(Y,X)P,22\displaystyle\stackrel{{\scriptstyle(\dagger)}}{{\precsim}}\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}_{P,2}^{2}+\big{\|}\widehat{b}^{(-k)}(Y,X)-b^{*}(Y,X)\big{\|}_{P,2}^{2}
=oP(1).\displaystyle=o_{P}(1)\ . (56)

Inequality ()(*) holds from the Cauchy-Schwartz inequality (j=1Maj)2M(j=1Maj2)(\sum_{j=1}^{M}a_{j})^{2}\leq M(\sum_{j=1}^{M}a_{j}^{2}). Inequality ()(\dagger) is established in items (a) and (b) below. The last convergence rate is obtained from Assumption 5.

  • (a)

    (Term-Q1)

    (Term-Q1) =E(k)[[(1A){ω^(k)(Y,X)ω(Y,X)}Y]2]\displaystyle=E^{(-k)}\big{[}\big{[}(1-A)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}Y\big{]}^{2}\big{]}
    =E(k)[(1A){ω^(k)(Y,X)ω(Y,X)}2Y2]\displaystyle=E^{(-k)}\big{[}(1-A)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}^{2}Y^{2}\big{]}
    =E(k)[Pr(A=0|Y,X){ω^(k)(Y,X)ω(Y,X)}2Y2]\displaystyle=E^{(-k)}\big{[}\Pr(A=0\,\big{|}\,Y,X)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}^{2}Y^{2}\big{]}
    ()E(k){ω^(k)(Y,X)ω(Y,X)Y2}\displaystyle\stackrel{{\scriptstyle(*)}}{{\precsim}}E^{(-k)}\big{\{}\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}\cdot Y^{2}\big{\}}
    ()E(k){ω^(k)(Y,X)ω(Y,X)2}E(k)(Y4)\displaystyle\stackrel{{\scriptstyle(\dagger)}}{{\precsim}}E^{(-k)}\big{\{}\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}^{2}\big{\}}\cdot E^{(-k)}\big{(}Y^{4}\big{)}
    ω^(k)(Y,X)ω(Y,X)P,22.\displaystyle\precsim\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}_{P,2}^{2}\ .

    Inequality ()(*) holds from Assumption 5, which implies Pr(A=0|Y,X)[c,1c]\Pr(A=0\,\big{|}\,Y,X)\in[c,1-c] for c>0c>0 and ω^(k)(Y,X)ω(Y,X)2C\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}_{\infty}\leq 2C. Inequality ()(\dagger) holds from the Cauchy-Schwartz inequality.

  • (b)

    (Term-Q2)-(Term-Q4)

    (Term-Q2) =E(k)[[(1A){ω^(k)(Y,X)ω(Y,X)}b^(k)(W,X)]2]\displaystyle=E^{(-k)}\big{[}\big{[}(1-A)\big{\{}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\}}\widehat{b}^{(-k)}(W,X)\big{]}^{2}\big{]}
    E(k){ω^(k)(Y,X)ω(Y,X)2}\displaystyle\precsim E^{(-k)}\big{\{}\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}^{2}\big{\}}
    ω^(k)(Y,X)ω(Y,X)P,22.\displaystyle\precsim\big{\|}\widehat{\omega}^{(-k)}(Y,X)-\omega^{*}(Y,X)\big{\|}_{P,2}^{2}\ .
    (Term-Q3) =E(k)[[(1A)ω(Y,X){b^(k)(W,X)b(W,X)}]2]\displaystyle=E^{(-k)}\big{[}\big{[}(1-A)\omega^{*}(Y,X)\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\}}\big{]}^{2}\big{]}
    E(k){b^(k)(W,X)b(W,X)2}\displaystyle\precsim E^{(-k)}\big{\{}\big{\|}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\|}^{2}\big{\}}
    b^(k)(W,X)b(W,X)P,22.\displaystyle\precsim\big{\|}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\|}_{P,2}^{2}\ .
    (Term-Q4) =E(k)[[A{b^(k)(W,X)b(W,X)}]2]\displaystyle=E^{(-k)}\big{[}\big{[}A\big{\{}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\}}\big{]}^{2}\big{]}
    E(k){b^(k)(W,X)b(W,X)2}\displaystyle\precsim E^{(-k)}\big{\{}\big{\|}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\|}^{2}\big{\}}
    b^(k)(W,X)b(W,X)P,22.\displaystyle\precsim\big{\|}\widehat{b}^{(-k)}(W,X)-b^{*}(W,X)\big{\|}_{P,2}^{2}\ .

Consistency of the Variance Estimator

The proposed variance estimator is

σ^2=1Kk=1Kσ^2,(k),\displaystyle\widehat{\sigma}^{2}=\frac{1}{K}\sum_{k=1}^{K}\widehat{\sigma}^{2,(k)}\ ,\ σ^2,(k)=1|k|ik{ζ^i(k)Aiψ^(k)(A)}2,\displaystyle\widehat{\sigma}^{2,(k)}=\frac{1}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\Bigg{\{}\frac{\widehat{\zeta}_{i}^{(-k)}-A_{i}\widehat{\psi}^{(-k)}}{\mathbbm{P}(A)}\Bigg{\}}^{2}\ ,

where ζ^(k)\widehat{\zeta}^{(-k)} is defined in (32). Therefore, it suffices to show that σ^2,(k)=σ2+oP(1)\widehat{\sigma}^{2,(k)}=\sigma^{2}+o_{P}(1). Note that σ^2,(k)σ2\widehat{\sigma}^{2,(k)}-\sigma^{2} is

σ^2,(k)σ2\displaystyle\widehat{\sigma}^{2,(k)}-\sigma^{2} ={(A)}2(k)[{ζ^(k)Aψ^}2]σ2\displaystyle=\big{\{}\mathbbm{P}(A)\big{\}}^{-2}\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\zeta}^{(-k)}-A\widehat{\psi}\big{\}}^{2}\Big{]}-\sigma^{2}
={Pr(A=1)}2(k)[{ζ^(k)Aψ^}2]σ2+oP(1)\displaystyle=\big{\{}\Pr(A=1)\big{\}}^{-2}\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\zeta}^{(-k)}-A\widehat{\psi}\big{\}}^{2}\Big{]}-\sigma^{2}+o_{P}(1)
={Pr(A=1)}2[(k)[{ζ^(k)Aψ^}2](k){(ζAψ)2}]\displaystyle=\big{\{}\Pr(A=1)\big{\}}^{-2}\Big{[}\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\zeta}^{(-k)}-A\widehat{\psi}\big{\}}^{2}\Big{]}-\mathbbm{P}^{(k)}\Big{\{}\big{(}\zeta^{*}-A\psi^{*}\big{)}^{2}\Big{\}}\Big{]}
+[{Pr(A=1)}2(k){(ζAψ)2}σ2]+oP(1)\displaystyle\hskip 56.9055pt+\Big{[}\big{\{}\Pr(A=1)\big{\}}^{-2}\mathbbm{P}^{(k)}\Big{\{}\big{(}\zeta^{*}-A\psi^{*}\big{)}^{2}\Big{\}}-\sigma^{2}\Big{]}+o_{P}(1)
={Pr(A=1)}2[(k)[{ζ^(k)Aψ^}2](k){(ζAψ)2}]+oP(1).\displaystyle=\big{\{}\Pr(A=1)\big{\}}^{-2}\Big{[}\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\zeta}^{(-k)}-A\widehat{\psi}\big{\}}^{2}\Big{]}-\mathbbm{P}^{(k)}\Big{\{}\big{(}\zeta^{*}-A\psi^{*}\big{)}^{2}\Big{\}}\Big{]}+o_{P}(1)\ . (57)

The third and fifth lines hold from the law of large numbers. Therefore, it is sufficient to show that (57) is also oP(1)o_{P}(1). From some algebra, we find the term in (57) is

1|k|1ik[{ζ^i(k)Aψ^}2{ζiAψ}2]\displaystyle\frac{1}{|\mathcal{I}_{k}|^{-1}}\sum_{i\in\mathcal{I}_{k}}\Big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}^{2}-\big{\{}\zeta_{i}^{*}-A\psi^{*}\big{\}}^{2}\Big{]}
=1|k|ik[{ζ^i(k)Aψ^}{ζiAψ}][{ζ^i(k)Aψ^}+{ζiAψ}]\displaystyle=\frac{1}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}-\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}\big{]}\big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}+\big{\{}\zeta_{i}^{*}-A\psi^{*}\big{\}}\big{]}
=1|k|ik[{ζ^i(k)Aψ^}{ζiAψ}][{ζ^i(k)Aψ^}{ζiAψ}+2{ζiAψ}]\displaystyle=\frac{1}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}-\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}\big{]}\big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}-\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}+2\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}\big{]}
=1|k|ik[{ζ^i(k)Aψ^}{ζiAψ}]2+2|k|ik[[{ζ^i(k)Aψ^}{ζiAψ}]{ζiAψ}].\displaystyle=\frac{1}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\Big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}-\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}\Big{]}^{2}+\frac{2}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\Big{[}\big{[}\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}-\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}\big{]}\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}\Big{]}\ .

Let δ^i(k)={ζ^i(k)Aψ^}{ζiAψ}\widehat{\delta}_{i}^{(-k)}=\big{\{}\widehat{\zeta}_{i}^{(-k)}-A\widehat{\psi}\big{\}}-\big{\{}{\zeta}_{i}^{*}-A\psi^{*}\big{\}}. From the Hölder’s inequality, we find the absolute value of (57) is upper bounded by

(57)\displaystyle\big{\|}\eqref{eq-variance1}\big{\|} (k)[{δ^(k)}2]+2(k)[{δ^(k)}2](k){(ζAψ)2}.\displaystyle\precsim\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\delta}^{(-k)}\big{\}}^{2}\Big{]}+2\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\delta}^{(-k)}\big{\}}^{2}\Big{]}\cdot\mathbbm{P}^{(k)}\Big{\{}\big{(}\zeta^{*}-A\psi^{*}\big{)}^{2}\Big{\}}\ .

Since (k){(ζAψ)2}=Pr(A=1)2σ2+oP(1)=OP(1)\mathbbm{P}^{(k)}\big{\{}\big{(}\zeta^{*}-A\psi^{*}\big{)}^{2}\big{\}}=\Pr(A=1)^{2}\sigma^{2}+o_{P}(1)=O_{P}(1), (57) is oP(1)o_{P}(1) if (k)[{δ^(k)}2]=oP(1)\mathbbm{P}^{(k)}\big{[}\big{\{}\widehat{\delta}^{(-k)}\big{\}}^{2}\big{]}=o_{P}(1). From some algebra, we find

(k)[{δ^(k)}2]\displaystyle\mathbbm{P}^{(k)}\Big{[}\big{\{}\widehat{\delta}^{(-k)}\big{\}}^{2}\Big{]} 2|k|ik{ζ^i(k)ζi}2+2(ψ^ψ)2\displaystyle\leq\frac{2}{|\mathcal{I}_{k}|}\sum_{i\in\mathcal{I}_{k}}\big{\{}\widehat{\zeta}_{i}^{(-k)}-\zeta_{i}^{*}\big{\}}^{2}+2\big{(}\widehat{\psi}-\psi^{*}\big{)}^{2}
=E(k)[{ζ^(k)ζ}2]+oP(1)+2(ψ^ψ)2=oP(1).\displaystyle=E^{(-k)}\Big{[}\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}}^{2}\Big{]}+o_{P}(1)+2\big{(}\widehat{\psi}-\psi^{*}\big{)}^{2}=o_{P}(1)\ .

The first line holds from (V1+AV2)22V12+2V22(V_{1}+AV_{2})^{2}\leq 2V_{1}^{2}+2V_{2}^{2}. The second line holds from the law of large numbers applied to {ζ^(k)ζ}2\big{\{}\widehat{\zeta}^{(-k)}-\zeta^{*}\big{\}}^{2}. The last line holds from (56) and ψ^=ψ+oP(1)\widehat{\psi}=\psi^{*}+o_{P}(1), which is from the asymptotic normality of the estimator. This concludes the proof.

References

  • Angrist and Pischke (2009) Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An empiricist’s companion. Princeton university press.
  • Athey and Imbens (2006) Athey, S. and Imbens, G. W. (2006). Identification and inference in nonlinear difference-in-differences models. Econometrica, 74(2):431–497.
  • Bang and Robins (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
  • Caniglia and Murray (2020) Caniglia, E. C. and Murray, E. J. (2020). Difference-in-difference in the time of cholera: a gentle introduction for epidemiologists. Current Epidemiology Reports, 7(4):203–211.
  • Card and Krueger (1994) Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in new jersey and pennsylvania. The American Economic Review, 84(4):772–793.
  • Carrasco et al. (2007) Carrasco, M., Florens, J.-P., and Renault, E. (2007). Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. In Heckman, J. J. and Leamer, E. E., editors, Handbook of Econometrics, volume 6, pages 5633–5751. Elsevier.
  • Castro et al. (2018) Castro, M. C., Han, Q. C., Carvalho, L. R., Victora, C. G., and França, G. V. A. (2018). Implications of Zika virus and congenital Zika syndrome for the number of live births in Brazil. Proceedings of the National Academy of Sciences, 115(24):6177–6182.
  • Chaussé (2010) Chaussé, P. (2010). Computing generalized method of moments and generalized empirical likelihood with R. Journal of Statistical Software, 34(11):1–35.
  • Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
  • Cui et al. (2023) Cui, Y., Pu, H., Shi, X., Miao, W., and Tchetgen Tchetgen, E. (2023). Semiparametric proximal causal inference. Journal of the American Statistical Association, 0(0):1–12.
  • Diaz-Quijano et al. (2018) Diaz-Quijano, F. A., Pelissari, D. M., and Chiavegatto Filho, A. D. P. (2018). Zika-associated microcephaly epidemic and birth rate reduction in Brazilian cities. American Journal of Public Health, 108(4):514–516.
  • Dikkala et al. (2020) Dikkala, N., Lewis, G., Mackey, L., and Syrgkanis, V. (2020). Minimax estimation of conditional moment models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  • Dukes et al. (2023) Dukes, O., Shpitser, I., and Tchetgen Tchetgen, E. J. (2023). Proximal mediation analysis. Biometrika. asad015.
  • d’Haultfoeuille (2010) d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1):1–15.
  • Ghassami et al. (2022) Ghassami, A., Ying, A., Shpitser, I., and Tchetgen Tchetgen, E. (2022). Minimax kernel machine learning for a class of doubly robust functionals with application to proximal causal inference. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I., editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 7210–7239. PMLR.
  • Gregianini et al. (2017) Gregianini, T. S., Ranieri, T., Favreto, C., Nunes, Z. M. A., Tumioto Giannini, G. L., Sanberg, N. D., da Rosa, M. T. M., and da Veiga, A. B. G. (2017). Emerging arboviruses in Rio Grande do Sul, Brazil: Chikungunya and Zika outbreaks, 2014-2016. Reviews in Medical Virology, 27(6):e1943.
  • Hansen (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, pages 1029–1054.
  • Kimeldorf and Wahba (1970) Kimeldorf, G. S. and Wahba, G. (1970). A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41(2):495–502.
  • Kott (2014) Kott, P. S. (2014). Calibration weighting when model and calibration variables can differ. Contributions to sampling statistics, pages 1–18.
  • Kress (2014) Kress, R. (2014). Linear Integral Equations. Springer, 3 edition.
  • Lechner (2011) Lechner, M. (2011). The estimation of causal effects by difference-in-difference methods. Foundations and Trends® in Econometrics, 4(3):165–224.
  • Li et al. (2023) Li, W., Miao, W., and Tchetgen Tchetgen, E. (2023). Non-parametric inference about mean functionals of non-ignorable non-response data without identifying the joint distribution. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):913–935.
  • Lipsitch et al. (2010) Lipsitch, M., Tchetgen Tchetgen, E., and Cohen, T. (2010). Negative controls: A tool for detecting confounding and bias in observational studies. Epidemiology, 21(3):383.
  • Lowe et al. (2018) Lowe, R., Barcellos, C., Brasil, P., Cruz, O. G., Honório, N. A., Kuper, H., and Carvalho, M. S. (2018). The Zika virus epidemic in brazil: From discovery to future implications. International journal of environmental research and public health, 15(1):96.
  • Lunceford and Davidian (2004) Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19):2937–2960.
  • Mastouri et al. (2021) Mastouri, A., Zhu, Y., Gultchin, L., Korba, A., Silva, R., Kusner, M., Gretton, A., and Muandet, K. (2021). Proximal causal learning with kernels: Two-stage estimation and moment restriction. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7512–7523. PMLR.
  • Miao et al. (2016) Miao, W., Geng, Z., and Tchetgen Tchetgen, E. (2016). Identifying causal effects with proxy variables of an unmeasured confounder. Preprint arXiv:1609.08816.
  • Miao et al. (2018) Miao, W., Geng, Z., and Tchetgen Tchetgen, E. J. (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987–993.
  • Miao et al. (2023) Miao, W., Liu, L., Li, Y., Tchetgen Tchetgen, E. J., and Geng, Z. (2023). Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/IMS Journal of Data Science. In press.
  • Miao and Tchetgen Tchetgen (2016) Miao, W. and Tchetgen Tchetgen, E. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2):475–482.
  • Muandet et al. (2020) Muandet, K., Jitkrittum, W., and Kübler, J. (2020). Kernel conditional moment test via maximum moment restriction. In Peters, J. and Sontag, D., editors, Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 41–50. PMLR.
  • Newey (1990) Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of Applied Econometrics, 5(2):99–135.
  • Newey and McFadden (1994) Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. In Engle, R. F. and McFadden, D. L., editors, Handbook of Econometrics, volume 4, pages 2111 – 2245. Elsevier, New York.
  • Rasmussen et al. (2016) Rasmussen, S. A., Jamieson, D. J., Honein, M. A., and Petersen, L. R. (2016). Zika virus and birth defects–Reviewing the evidence for causality. New England Journal of Medicine, 374(20):1981–1987.
  • Rosenbaum (1989) Rosenbaum, P. R. (1989). The role of known effects in observational studies. Biometrics, pages 557–569.
  • Rotnitzky et al. (2020) Rotnitzky, A., Smucler, E., and Robins, J. M. (2020). Characterization of parameters with a mixed bias property. Biometrika, 108(1):231–238.
  • Scharfstein et al. (1999) Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120.
  • Schick (1986) Schick, A. (1986). On asymptotically efficient estimation in semiparametric models. The Annals of Statistics, 14(3):1139–1151.
  • Schölkopf et al. (2001) Schölkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. In Helmbold, D. and Williamson, B., editors, Computational Learning Theory, pages 416–426. Springer, Berlin, Heidelberg.
  • Shi et al. (2020) Shi, X., Miao, W., and Tchetgen Tchetgen, E. (2020). A selective review of negative control methods in epidemiology. Current epidemiology reports, 7:190–202.
  • Sofer et al. (2016) Sofer, T., Richardson, D. B., Colicino, E., Schwartz, J., and Tchetgen Tchetgen, E. J. (2016). On negative outcome control of unobserved confounding as a generalization of difference-in-differences. Statistical Science, 31(3):348 – 361.
  • Steinwart and Christmann (2008) Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer-Verlag, New York.
  • Taddeo et al. (2022) Taddeo, M. M., Amorim, L. D., and Aquino, R. (2022). Causal measures using generalized difference-in-difference approach with nonlinear models. Statistics and Its Interface, 15(4):399–413.
  • Tchetgen Tchetgen (2013) Tchetgen Tchetgen, E. (2013). The Control Outcome Calibration Approach for Causal Inference With Unobserved Confounding. American Journal of Epidemiology, 179(5):633–640.
  • Tchetgen Tchetgen et al. (2024) Tchetgen Tchetgen, E. J., Park, C., and Richardson, D. B. (2024). Universal difference-in-differences for causal inference in epidemiology. Epidemiology, 35(1):16–22.
  • Tchetgen Tchetgen et al. (2024) Tchetgen Tchetgen, E. J., Ying, A., Cui, Y., Shi, X., and Miao, W. (2024). An introduction to proximal causal inference. Statistical Science. to appear.
  • van der Vaart and Wellner (1996) van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer.
  • Wang et al. (2014) Wang, S., Shao, J., and Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3):1097–1116.
  • Ying et al. (2023) Ying, A., Miao, W., Shi, X., and Tchetgen Tchetgen, E. J. (2023). Proximal causal inference for complex longitudinal studies. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):684–704.
  • Zahner et al. (1992) Zahner, G. E., Pawelkiewicz, W., DeFrancesco, J. J., and Adnopoz, J. (1992). Children’s mental health service needs and utilization patterns in an urban community: An epidemiological assessment. Journal of the American Academy of Child & Adolescent Psychiatry, 31(5):951–960.
  • Zhang et al. (2023) Zhang, R., Imaizumi, M., Schölkopf, B., and Muandet, K. (2023). Instrumental variable regression via kernel maximum moment loss. Journal of Causal Inference, 11(1):20220073.