This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

footnotetext: Email: [email protected]

Causal Inference with Unmeasured Confounding from Nonignorable Missing Outcomes

Renzhong Zheng
School of Statistics, Beijing Normal University, Beijing, China
(May, 2023)
Abstract

Observational studies are the primary source of data for causal inference, but it is challenging when existing unmeasured confounding. Missing data problems are also common in observational studies. How to obtain the causal effects from the nonignorable missing data with unmeasured confounding is a challenge. In this paper, we consider that how to obtain complier average causal effect with unmeasured confounding from the nonignorable missing outcomes. We propose an auxiliary variable which plays two roles simultaneously, the one is the shadow variable for identification and the other is the instrumental variable for inference. We also illustrate some difference between some missing outcomes mechanisms in the previous work and the shadow variable assumption. We give a causal diagram to illustrate this description. Under such a setting, we present a general condition for nonparametric identification of the full data law from the nonignorable missing outcomes with this auxiliary variable. For inference, firstly, we recover the mean value of the outcome based on the generalized method of moments. Secondly, we propose an estimator to adjust for the unmeasured confounding to obtain complier average causal effect. We also establish the asymptotic results of the estimated parameters. We evaluate its performance via simulations and apply it to a real-life dataset about a political analysis.


Keywords: Complier average causal effect; Generalized method of moments; Instrumental variable; Missing not at random; Principal Stratification; Shadow variable.

1 Introduction

Causal inference plays an important role in many fields, such as epidemiology, economics and social sciences. Observational studies are the primary source of data for causal inference. In observational studies, if all the confounders are measured, one can use some standard methods to adjust for the confounding bias, such as stratification and matching (Rubin,, 1973; Stuart,, 2010), propensity score (Rosenbaum and Rubin,, 1983), inverse probability weighting (Horvitz and Thompson,, 1952), outcome regression-based estimation (Rubin,, 1979), doubly robust estimation (Bang and Robins,, 2005). However, unmeasured confounding is present in most observational studies. In this case, one cannot identify the causal effects from the observational data without additional assumptions. Auxiliary variables are often used to identify causal effects in adjustment for unmeasured confoundering. The instrumental variable (IV) is an auxiliary variable to adjust for the unmeasured confounding. There are two frameworks used in the analysis of IV, one is a structural equation model (Wright,, 1928; Goldberger,, 1972); the other one is the monotonicity assumption, which means a monotone effect of the IV on the treatment. Under the monotonicity assumption, IV can identify and estimate the compliers average causal effect (Imbens and Angrist,, 1994; Angrist and Imbens,, 1995; Angrist et al.,, 1996). More details about the two framework of the instrument variable can be found in Wang and Tchetgen Tchetgen, (2018).

Missing data is also a common problem in observational studies. According to Rubin, (1976), the missing mechanism is called missing at random (MAR) or ignorable if it is independent of the missing values conditional on observed data, which means the missing mechanism only depends on the observed data. Otherwise, it is called missing not at random (MNAR) or nonignorable (Little and Rubin,, 2002). Compared to MAR, MNAR is much more challenging. Identification is generally not available under MNAR without additional assumptions (Robins and Ritov,, 1997). In the previous papers, authors would make sufficient parametric assumptions about the full data law to ensure the validity of the identification (Wu and Carroll,, 1988; Little and Rubin,, 2002; Roy,, 2003). However, as noted by Miao et al., (2016) and Wang et al., (2014), the identification of fully parametric models can even fail under MNAR.

A new framework for identification and semiparametric inference was recently proposed by Miao and Tchetgen Tchetgen, (2016), Miao and Tchetgen, (2018), and Miao et al., (2019), building on earlier work by d’Haultfoeuille, (2010), Kott, (2014), Wang et al., (2014), and Zhao and Shao, (2015) which studied identification of several parametric and semiparametric models. A shadow variable is associated with the outcome prone to missingness but independent of the missing mechanism conditional on treatment and possibly unobserved outcome. In the context of missing covariate data, Miao and Tchetgen, (2018) studied identification of generalized linear models and some semiparametric models, and then proposed an inverse probability weighted estimator that incorporates the shadow variable to guarantee the unbiased estimation. In the context of missing outcomes data, Miao and Tchetgen Tchetgen, (2016) considered the identification of a location-scale model and described a doubly robust estimator. If a shadow variable is fully observed, we can use it to recover the distribution of unobserved outcome which is missing not at random with the fully observed covariates. Miao et al., (2019) used a valid shadow variable to nonparametric identification of the full data distribution under nonignorable missingness, and they developed the semiparametric theory for some semiparametric estimators with a shadow variable under MNAR.

It is highly necessary to combine causal inference with missing data research (Ding and Li,, 2018). In the case of covariates missing not at random, Ding and Geng, (2014) showed the identification of the causal effects for four interpretable missing data mechanisms and proposed the upper and lower bound for causal effects. When the unmeasured confounders are missing not at random, Yang et al., (2019) generalized the results in Ding and Geng, (2014) to establish a general condition for identification of the causal effects, they further developed parametric and nonparametric inference for the causal effects. With the nonignorable missing outcomes, the identification and estimation of the causal effects is more challenging, different types of the missing mechanism are imposed to guarantee the identification and estimation under the principal stratification framework. Frangakis and Rubin, (1999) proposed the latent ignorable (LI) missing data mechanism to guarantee the identification and estimated the complier average causal effects (CACE) . Under the LI missing data mechanism, O’Malley and Normand, (2005) proposed two methods based on the moments and likelihood to estimate CACE for normally distributed outcomes. Zhou and Li, (2006) proposed both moment and maximum likelihood estimators of CACE for the binary outcomes. Otherwise, some authors proposed another missing data mechanisms, which is called the outcome-dependent nonignorable (ODN) missing data mechanism. Chen et al., (2009) and Imai, (2009) established the identification of CACE for discrete outcomes based on the likelihood method under the outcome-dependent nonignorable missing data mechanism. Chen et al., (2015) proposed an exponential family assumption about the conditional density of the outcome variable to establish semiparametric inference of CACE for continuous outcomes under ODN missing data mechanism. Li and Zhou, (2017) established the identification with a shadow variable and estimation of causal mediation effects from the outcomes MNAR.

In this paper, we obtain CACE with unmeasured confounding from nonignorable missing outcomes. Different from the previous work, we impose no assumption about the missing outcomes mechanisms. Instead, we impose a shadow variable assumption on the instrumental variable to guarantee the identification and estimate CACE. From Figure 3, we allow an arrow from treatment to missing mechanism, which is more reasonable in some application cases. In Section 2, we illustrate some basic assumptions in causal inference. We also illustrate the difference between some missing data mechanisms and the shadow variable assumption. Then we demonstrate the framework of this paper with an example. In Section 3, we use a shadow variable to identify the full data distribution nonparametrically under certain completeness condition. In Section 4, we apply the generalized method of moments to estimate the missing mechanism. We also establish some asymptotic results of the estimated parameters in the missing mechanism. Then we recover the mean value of the outcome and propose an estimator of CACE. In Section 5, we conduct some simulation studies to evaluate the performance of some parameters estimators in missing mechanism and the estimator of CACE. In Section 6, we use a real data about to illustrate our approach. In Section 7, we conclude with some discussions and provide all proofs in the Supplementary Materials.

2 Notation and Assumptions

2.1 Potential outcomes, causal effects, basic assumptions and instrumental variable

In this paper, we consider the situation where {(yi,ri,ai,zi):i=1,,n}\left\{\left(y_{i},r_{i},a_{i},z_{i}\right):i=1,\ldots,n\right\} is an independent and identically distributed sample from (Y,R,A,Z)\left(Y,R,A,Z\right). Vectors are assumed to be column vectors, unless explicitly transposed.

YY is the outcome of interest subject to missingness. We consider the outcome YY a binary variable. We let RR denote the missing indicator of YY: R=1R=1 if Y is observed and R=0R=0 otherwise. The observed data include (A,Z)\left(A,Z\right) for all samples and YY only for those R=1R=1. We use lower-case letters for realized values of the corresponding variables, for example, yy for a value of the outcome variable YY. We use ff to denote a probability density or mass function. Suppose the observed data are nn independent and identically distributed samples.

We use potential outcome model to define causal effects (Neyman,, 1923; Rubin,, 1974). For each unit ii, We consider ZZ is a binary instrument variable, where Zi=1Z_{i}=1 indicates the unit ii is assigned to the treatment group and Zi=0Z_{i}=0 indicates the unit ii is assigned to the control group. Let Ai(z)A_{i}(z) denote the potential outcome of the unit ii that the instrument variable ZZ was set to level zz. Ai(z)=1A_{i}(z)=1 indicates that unit ii would receive the treatment if assigned zz, and Ai(z)=0A_{i}(z)=0 indicates that unit ii would receive the control if assigned zz. Similar to the definition of Ai(z)A_{i}(z), we let Yi(z,Ai(z))Y_{i}(z,A_{i}(z)) denote the potential outcome for an unit ii if exposed to treatment Ai(z)A_{i}(z) after ZZ was set to level zz. For simplicity, we let Yi(z)Y_{i}(z) and Ri(z)R_{i}(z) denote the potential outcomes after ZZ was set to level zz, respectively. Let ZiZ_{i}, AiA_{i}, YiY_{i}, RiR_{i} denote their observations for i=1,,ni=1,\dots,n.

The following assumption is standard in causal inference with observational studies (Rubin,, 1980; Angrist et al.,, 1996).

Assumption 1.

Stable Unit treatment value Assumption (SUTVA): There is no interference between units, which means a unit’s potential outcomes cannot be affected by the treatment status of other units and there is only one version of potential outcome of a certain treatment.

Under SUTVA assumption, the definition of the potential outcome is reasonable. But when ZiAiZ_{i}\neq A_{i}, there exists noncompliance. Under the principal stratification framework (Angrist et al.,, 1996; Frangakis and Rubin,, 2002), we define UiU_{i} as the compliance status variable of unit ii:

Ui={at, if Ai(1)=1 and Ai(0)=1;cp, if Ai(1)=1 and Ai(0)=0;df, if Ai(1)=0 and Ai(0)=1;nt, if Ai(1)=0 and Ai(0)=0.U_{i}=\begin{cases}at,&\text{ if }A_{i}(1)=1\text{ and }A_{i}(0)=1;\\ cp,&\text{ if }A_{i}(1)=1\text{ and }A_{i}(0)=0;\\ df,&\text{ if }A_{i}(1)=0\text{ and }A_{i}(0)=1;\\ nt,&\text{ if }A_{i}(1)=0\text{ and }A_{i}(0)=0.\end{cases}

Because Ai(1)A_{i}(1) and Ai(0)A_{i}(0) each can take two values, the compliance status variable UiU_{i} has four different values, ntnt for never takers, atat for always takers, cpcp for compliers, and dfdf for defiers. Because we can not observe Ai(1)A_{i}(1) and Ai(0)A_{i}(0) jointly, the compliance behavior of an unit is unknown, so UiU_{i} is an observed variable and it can be viewed as an unmeasured confounder.

Definition 1.

The Complier Average causal effect (CACE) is defined as CACE=E[Yi(1)Yi(0)Ui=cp]CACE=E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp\right].

Because for the compliers, Zi=AiZ_{i}=A_{i}, so CACE is a subgroup causal effect for the compliers, with incompletely observed compliance status.

Assumption 2.

Randomization: (Y(0),Y(1),A(0),A(1))(Y(0),Y(1),A(0),A(1)) is independent of ZZ.

Randomization means that Z(Y(0),Y(1),A(0),A(1))Z\perp\!\!\!\perp(Y(0),Y(1),A(0),A(1)).

Assumption 3.

Monotonicity: Ai(1)Ai(0)A_{i}(1)\geq A_{i}(0) for each unit ii.

Monotonicity assumption means there is no defiers in the population. In some studies, the monotonicity assumption is plausible when the treatment assignment has a nonnegative effect on the treatment received for each unit.

Assumption 4.

Nonzero average causal effect of ZZ on AA: The average causal effect of ZZ on AA, E[Ai(1)Ai(0)]E\left[A_{i}(1)-A_{i}(0)\right], is not equal to zero.

Assumption 5.

Exclusion restrictions among never takers and always takers: Yi(1)=Yi(0)Y_{i}(1)=Y_{i}(0) if Ui=ntU_{i}=nt and Yi(1)=Yi(0)Y_{i}(1)=Y_{i}(0) if Ui=atU_{i}=at.

Assumption 5 means that the instrument variable ZZ only affects the outcome through treatment and has no direct effect.

If the outcome YY can be fully observed, CACE can be identified and estimated by Angrist et al., (1996). With the nonignorable missing outcome, some assumptions about the missing data mechanisms are needed to estimate CACE. We will compare some assumptions about the missing data mechanisms in the next subsection.

2.2 Some Missing data mechanisms of the outcomes

Before illustrating the missing data mechanisms, Assumption 2 and Assumption 5 are replaced by Assumption 6 and Assumption 7 in some previous work, respectivitly. Assumption 6 and Assumption 7 are usually combined with the missing outcomes mechanisms in the previous work. However, we only need Assumption 2 and Assumption 5 to estimate CACE in this paper, which are the weak version of Assumption 6 and Assumption 7.

Assumption 6.

Complete Randomization: The treatment assignment ZZ is completely randomized.

Complete Randomization means that Z{A(1),A(0),Y(1),Y(0),R(1),R(0)}Z\perp\!\!\!\perp\{A(1),A(0),Y(1),Y(0),R(1),R(0)\}.

Assumption 7.

Compound exclusion restrictions: For never takers and always takers, we assume that f{Y(1),R(1)U=nt}=f{Y(0),R(0)U=nt}f\{Y(1),R(1)\mid U=nt\}=f\{Y(0),R(0)\mid U=nt\}, and f{Y(1),R(1)U=at}=f{Y(0),R(0)U=at}f\{Y(1),R(1)\mid U=at\}=f\{Y(0),R(0)\mid U=at\}.

Different from traditional Assumption 5, Frangakis and Rubin, (1999) extended it to the compound exclusion restrictions. Assumption 7 is stronger than Assumption 5. Under Assumption 6, Assumption 7 is equivalent to f(Y,RZ=1,U=nt)=f(Y,RZ=0,U=nt)f(Y,R\mid Z=1,U=nt)=f(Y,R\mid Z=0,U=nt) and f(Y,RZ=1,U=at)=f(Y,RZ=f(Y,R\mid Z=1,U=at)=f(Y,R\mid Z= 0,U=at)0,U=at).

In previous work, some authors proposed the Latent ignorability (LI) assumption (Frangakis and Rubin,, 1999; Osius,, 2004; Zhou and Li,, 2006):

Assumption 8.

Latent ignorability assumption: f{R(z)Y(z),A(z),U}=f{R(z)U}f\{R(z)\mid Y(z),A(z),U\}=f\{R(z)\mid U\}

Latent ignorability implies that for given the each principal stratum, the potential outcomes are independent of the missing indicator, which means the missing data mechanism does not depend on the missing outcome. We gave a graph model to illustrate the LI missing data mechanism.

U{U}Z{Z}A{A}Y{Y}R{R}
Figure 1: A graph model for the LI missing data mechanism under Assumption 6 and 7

Chen et al., (2009) and Chen et al., (2015) proposed an Outcome-dependent nonignorable (ODN) missing data mechanism:

Assumption 9.

Outcome-dependent nonignorable assumption: For all y;z=0,1;a=0,1y;z=0,1;a=0,1; and u{at,cp,nt}u\in\{at,cp,nt\}, assuming

P{R(z)=1Y(z)=y,A(z)=a,U=u}\displaystyle P\{R(z)=1\mid Y(z)=y,A(z)=a,U=u\} =P{R(z)=1Y(z)=y}\displaystyle=P\{R(z)=1\mid Y(z)=y\}
P{R(1)=1Y(1)=y}\displaystyle P\{R(1)=1\mid Y(1)=y\} =P{R(0)=1Y(0)=y}.\displaystyle=P\{R(0)=1\mid Y(0)=y\}.

Under the completely randomization Assumption 6, Assumption 9 becomes P(R=1Y=y,A=a,U=u,Z=z)=P(R=1Y=y,Z=z)P(R=1\mid Y=y,A=a,U=u,Z=z)=P(R=1\mid Y=y,Z=z) and P(R=1Y=y,Z=1)=P(R=1Y=y,Z=0)P(R=1\mid Y=y,Z=1)=P(R=1\mid Y=y,Z=0). Outcome-dependent nonignorable assumption means that RR depends on YY, but is independent of (Z,A,U)(Z,A,U) given YY. Under the ODN missing data mechanism, the missing data indicator depends on the possibly missing outcome Y, which may be more reasonable than LI missing data assumption in some applications. We gave a graph model to illustrate ODN missing data mechanism.

U{U}Z{Z}A{A}Y{Y}R{R}
Figure 2: A graph model for the ODN missing data mechanism under Assumption 6 and 7

More examples and details about LI assumption and ODN assumption can be found in Chen et al., (2009) and Chen et al., (2015).

2.3 The framework of this paper

In this paper, we are interested in CACE with unmeasured confounding UU from the nonignorable missing outcomes YY.

In fact, LI assumption and ODN assumption usually need the Assumption 6 and Assumption 7, which are stronger than Assumption 2 and Assumption 5. In this paper, we only need the weak randomization Assumption 2 and exclusion restriction Assumption 5 to estimate CACE. Different from the above two missing outcomes mechanisms, we adapt the shadow variable assumption to guarantee the identification under the nonignorable missing outcomes in this paper.

We suppose that an auxiliary variable ZZ called a shadow variable is fully observed if it statisfies the following assumption (Miao and Tchetgen Tchetgen,, 2016; Miao et al.,, 2019).

Assumption 10.

(a): ZR(Y,A)Z\perp\!\!\!\perp R\mid(Y,A); (b): Z⟂̸Y(R=1,A)Z\not\perp\!\!\!\perp Y\mid(R=1,A).

Assumption 10 implies that the shadow variable is associated with the outcome when the outcome is observed, but it is independent of the missing mechanism conditional on fully observed variables and possibly unobserved outcome (d’Haultfoeuille,, 2010; Kott,, 2014; Wang et al.,, 2014; Zhao and Shao,, 2015). Therefore, Assumption 10 allows for the data missing not at random.

In this paper, we suppose an auxiliary variable ZZ satisfies both Assumptions 1-5 and Assumption 10, which plays two roles of the instrument variable and the shadow variable simultaneously. Different from the ODN missing data mechanism 2, we allow the arrow from AA to RR in Figure 3, which means the treatment would affect the missing indicator. It is more reasonable in some applications. We present a example to help readers to understand the framework of Fiugre 3. Figure 3 illustrates the framework of this paper.

U{U}Z{Z}A{A}Y{Y}R{R}
Figure 3: A directed acyclic graph model for the auxiliary variable ZZ
Example 1.

We consider a real data analysis in Esterling et al., (2011). As noted in Esterling et al., (2011), the authors aimed to assess that participating in an Online-chat session whether produced higher levels of citizen efficacy among the constituents. In this paper, we choose the instrument variable ZZ indicating political knowledge, which means ZZ equals to 11 for the constituents who have high political knowledge and 0 for low political knowledge. The treatment variable AA equals to 11 for the constituents who participated in the Online-chat session and 0 for otherwise. A week after each session, a company administered a follow-up survey to the constituents. The outcome variable YY is Officials care, denoting the attitudes of the constituents about a question in the follow-up survey, which includes “agreement”, “disagreement” and so on. However, some subjects may not respond the follow-up survey, which the outcomes have missing values. At the same time, it is reasonable that whether or not to participate in the session would affect the missing mechanism of the outcomes. In Section 6, we illustrate this example in more detail.

In the next section, we state that the key of identification of f(y,ra,z)f(y,r\mid a,z) is Assumption 10. If Assumption 10 is violated, which means the shadow variable is not available, the identification of f(y,ra,z)f(y,r\mid a,z) is not guaranteed, even if the parametric missingness mechanisms may not be identifiable (Miao et al.,, 2016; Wang et al.,, 2014). We also illustrate extra conditions to guarantee the identification.

3 Identification

We aim to identify the joint distribution f(a,y,r,z)f(a,y,r,z). We say the joint distribution f(a,y,r,z)f(a,y,r,z) is identifiable , if and only if the joint distribution can be uniquely determined by the observed data distribution f(y,r=1a,z),f(r=0a,z)f(y,r=1\mid a,z),f(r=0\mid a,z) and f(a,z)f(a,z).

The joint distribution f(a,y,z,r)f(a,y,z,r) can be factorized as

f(a,y,z,r)=f(y,ra,z)f(a,z),f(a,y,z,r)=f(y,r\mid a,z)f(a,z), (3.1)

because the variables (A,Z)(A,Z) can be fully observeed, so f(a,z)f(a,z) is identifiable without additional assumptions, we focus on the identification of f(y,ra,z)f(y,r\mid a,z). The observed data law is captured by f(y,r=1a,z)f(y,r=1\mid a,z), f(r=0a,z)f(r=0\mid a,z) and f(a,z)f(a,z), which are functionals of the joint law f(a,y,z,r)f(a,y,z,r). Then we factorize f(y,ra,z)f(y,r\mid a,z) as

f(y,ra,z)=f(yr,a,z)f(ra,z).f(y,r\mid a,z)=f(y\mid r,a,z)f(r\mid a,z). (3.2)

According to equation (3.2), f(yr,a,z)f(y\mid r,a,z) represents the outcome distribution for different data patterns: R=1R=1 for the observed data and R=0R=0 for the missing data. Although f(yr=1,a,z)f(y\mid r=1,a,z) can be obtained from the observed data completely, the missing data distribution f(yr=0,a,z)f(y\mid r=0,a,z) can not be obtained directly from the observed data under MNAR. The fundamental identification challenge in missing data problems is how to recover the full data distribution f(a,y)f(a,y) and missingness process (or propensity score)f(ra,z)f(r\mid a,z) from the observed data distribution.

We use the odds ratio function which encodes the deviation between the observed data and missing data distributions to measure the missingness process (Little,, 1993, 1994).

OR(a,y,z)=f(yr=0,a,z)f(y=1r=1,a,z)f(yr=1,a,z)f(y=1r=0,a,z).\operatorname{OR}(a,y,z)=\frac{f(y\mid r=0,a,z)f(y=1\mid r=1,a,z)}{f(y\mid r=1,a,z)f(y=1\mid r=0,a,z)}. (3.3)

Here, we use y=1y=1 as a reference value, the analyst can use any other value within the support of YY. The odds ratio function can be applied to impose a known relationship of the outcome YY (Little,, 1993, 1994). In proposition 1, the odds ratio function plays a center role under data MNAR and it can be identified by a shadow variable.

In the following, we hold that OR(a,y)\operatorname{OR}(a,y) and E[OR(a,y,z)r=1,a,z]<+E\left[\operatorname{OR}(a,y,z)\mid r=1,a,z\right]<+\infty. According to the previous work, we factorize the conditional density function f(y,ra,z)f(y,r\mid a,z) as the odds ratio function and two baseline distributions (Osius,, 2004; Chen,, 2003, 2004, 2007; Kim and Yu,, 2011; Miao and Tchetgen Tchetgen,, 2016; Miao et al.,, 2019), we establish some results in the proposition 1 based on a valid shadow variable.

Proposition 1.

Under Assumption 10, we have that for all (A,Y,Z)(A,Y,Z)

OR(a,y,z)=OR(a,y)f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1),\operatorname{OR}(a,y,z)=\operatorname{OR}(a,y)\equiv\frac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}, (3.4)
f(y,ra,z)=c(a,z)f(ra,y=1)f(yr=1,a,z){OR(a,y)}1r,c(a,z)=f(r=1a)f(r=1a,y=1)f(zr=1,a)f(za),\begin{gathered}f(y,r\mid a,z)=c(a,z)f(r\mid a,y=1)f(y\mid r=1,a,z)\{\operatorname{OR}(a,y)\}^{1-r},\\[5.69054pt] c(a,z)=\dfrac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\frac{f(z\mid r=1,a)}{f(z\mid a)},\end{gathered} (3.5)
f(r=1a,y=1)=E[OR(a,y)r=1,a]f(r=0a)/f(r=1a)+E[OR(a,y)r=1,a].f(r=1\mid a,y=1)=\dfrac{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}{f(r=0\mid a)/f(r=1\mid a)+E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}. (3.6)

The proof of the proposition 1 is found in the Supplementary Materials. From identity (3.4), we demonstrate that the odds ratio function OR(a,y)\operatorname{OR}(a,y) is the function of the variables (A,Y)(A,Y) under Assumption 10, which means the odds ratio function doesn’t depend on the variable ZZ under the Assumption 10. So we denote the odds ratio function OR(a,y,z)\operatorname{OR}(a,y,z) by OR(a,y)\operatorname{OR}(a,y). According to equation (3.4), we notice that the odds ratio function also consists of the propensity score f(r=1a,y)f(r=1\mid a,y), which is influenced by the outcome itself. The odds ratio function can also be used to measure whether the selection bias exists, for instance, the value of OR(a,y)\operatorname{OR}(a,y) represents the deviation of missingness mechanism from the MAR (Miao et al.,, 2019).

According to Miao and Tchetgen Tchetgen, (2016), we suppose throughout that OR(a,y)\operatorname{OR}(a,y) is correctly specified, which can be achieved by specifying a relatively flexible model, or following the approach suggested by Higgins et al., (2008) if information on the reasons for missingness are available.

Equation (3.5) plays the key role of the factorization of f(y,ra,z)f(y,r\mid a,z), which is factorized by the propensity score f(ra,y=1)f(r\mid a,y=1) evaluated at the reference level Y=1Y=1, the outcome distribution f(yr=1,a,z)f(y\mid r=1,a,z) among the complete cases, and the odds ratio function OR(a,y)\operatorname{OR}(a,y). The former two is referred to as the baseline propensity score and the baseline outcome distribution, respectively. In the previous illustration, we have that f(yr=1,a,z)f(y\mid r=1,a,z) is uniquely determined from complete cases, according to equation (3.5) and (3.6), we aim to identify f(y,ra,z)f(y,r\mid a,z) by proposing the odds ratio function OR(a,y)\operatorname{OR}(a,y), which means the identification of the odds ratio function OR(a,y)\operatorname{OR}(a,y) is the fundamental problem in the whole framework. Propositon 2 indicates some further results from identity (3.5) and (3.6) (Miao et al.,, 2019).

Proposition 2.

Under Assumption 10, we have that

f(r=1a,y)\displaystyle f(r=1\mid a,y) =f(r=1a,y,z),\displaystyle=f(r=1\mid a,y,z), (3.7)
=f(r=1a,y=1)f(r=1a,y=1)+OR(a,y)f(r=0a,y=1),\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\operatorname{OR}(a,y)f(r=0\mid a,y=1)},
f(yr=0,a,z)=OR(a,y)f(yr=1,a,z)E[OR(a,y)r=1,a,z],f(y\mid r=0,a,z)=\dfrac{\operatorname{OR}(a,y)f(y\mid r=1,a,z)}{E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]}, (3.8)
E[OR~(a,y)r=1,a,z]=f(zr=0,a)f(zr=1,a),whereOR~(a,y)=OR(a,y)E[OR(a,y)r=1,a],\begin{gathered}E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]=\frac{f(z\mid r=0,a)}{f(z\mid r=1,a)},\\[8.53581pt] \text{where}\quad\widetilde{\operatorname{OR}}(a,y)=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]},\end{gathered} (3.9)
OR(a,y)=OR~(a,y)OR~(a,y=1).\operatorname{OR}(a,y)=\dfrac{\widetilde{\operatorname{OR}}(a,y)}{\widetilde{\operatorname{OR}}(a,y=1)}. (3.10)

The proof of the proposition 2 is found in the Supplementary Materials. The equations (3.7) - (3.10) indicate the key role of the odds ratio function in the identification framework of this paper. Equation (3.7) reveals that the propensity score f(r=1a,y)f(r=1\mid a,y) is a function of the odds ratio function OR(a,y)\operatorname{OR}(a,y). Equation (3.8) shows that under the shadow variable Assumption 10, the recovery of the missing data distribution f(yr=0,a,z)f(y\mid r=0,a,z) is achievable by integrating the odds ratio function OR(a,y)\operatorname{OR}(a,y) with the complete-case distribution, therefore the full data distribution is available. Equation (3.9) is a Fredholm integral equation of the first kind, which OR~(a,y)\widetilde{\operatorname{OR}}(a,y) is to be solved for, because f(zr=0,a)f(z\mid r=0,a), f(zr=1,a)f(z\mid r=1,a) and f(yr=1,a,z)f(y\mid r=1,a,z) can be obtained from the observed data. Equation (3.10) means OR(a,y)\operatorname{OR}(a,y) is related to OR~(a,y)\widetilde{\operatorname{OR}}(a,y), therefore the identification of OR(a,y)\operatorname{OR}(a,y) is equivalent to the unique solution of equation (3.9), which is guaranteed by a completeness condition of f(yr=1,a,z)f(y\mid r=1,a,z) (Miao et al.,, 2019).

Condition 1.

(The completeness of f(yr=1,a,z)f(y\mid r=1,a,z))   The function f(yr=1,a,z)f(y\mid r=1,a,z) is complete if and only if for any square-integrable function h(a,y)=0h(a,y)=0 almost surely, E[h(a,y)r=1,a,z]=0E\left[h(a,y)\mid r=1,a,z\right]=0 almost surely.

The completeness condition is also an essential condition in other identification analyses (Newey and Powell,, 2003), such as nonparametric instrumental variable regression (Darolles et al.,, 2011), outcome missing not at random (Miao et al.,, 2019) and confounders missing not at random (Yang et al.,, 2019). In contrast to previous authors, some conditions about the shadow variable could not be justified empirically. For instance, a completeness condition is required by the full data distribution (d’Haultfoeuille,, 2010); a monotone likelihood ratio is required by the full data distribution (Wang et al.,, 2014); a generalized linear model is considered for the full data distribution (Zhao and Shao,, 2015). As noted by (Miao et al.,, 2019), the completeness condition of this paper only involves the observed data distribution f(yr=1,a,z)f(y\mid r=1,a,z), which means that it can be justified and it does not need additional model assumptions on the missing data distribution.

In fact, Condition 1 implicitly indicates that ZZ has a larger support than YY. However, as noted by Miao and Tchetgen Tchetgen, (2016), a binary shadow variable can not guatantee the identification of the full data distribution for a continuous outcome, which is given a counterexample. If one wants to identify a continuous outcome, a continuous shadow variable and extra conditions are needed to impose. More details about the completeness condition can be found in Miao and Tchetgen Tchetgen, (2016). The authors applied some commonly-used parametric and semiparametric models such as exponential families and location-scale families to the analysis.

Under Assumption 10, Condition 1 is sufficient to ensure the unique solution from equation (3.9), and thus according to equation (3.10), the odds ratio function OR(a,y)\operatorname{OR}(a,y) is identifiable. We state the result in the following theorem.

Theorem 1.

Under Assumption 10 and Condition 1, the joint distribution f(a,y,z,r)f(a,y,z,r) is identifiable.

The proof of the Theorem 1 can be found in the Supplementary Materials. Theorem 1 indicates nonparametric identification of the full data law f(a,y,z,r)f(a,y,z,r) under MNAR is achieved with a valid shadow variable. The odds ratio function is the basis of our identification analysis. Because f(yr=1,a,z)f(y\mid r=1,a,z) and f(zr=1,a)f(z\mid r=1,a) are identifiable from the observed data, thus we turn the identification of the odds ratio function OR(a,y)\operatorname{OR}(a,y) into the problem of solving for OR~(a,y)\widetilde{\operatorname{OR}}(a,y) from (3.9). Condition 1 guarantees the unique solution of (3.9). According to equation (3.8), the missing data distribution f(yr=0,a,z)f(y\mid r=0,a,z) can be recovered after the odds ratio function is identifiable. And then according to equation (3.1) and (3.2), f(y,ra,z)f(y,r\mid a,z) and its functionals can be identified. The shadow variable plays the key role in the identification of the odds ratio function. With a valid shadow variable, nonparametric identification of the full data distribution f(a,y,z,r)f(a,y,z,r) is achieved via the pattern-mixture factorization. Furthermore, the shadow variable is the basis of the identification of the odds ratio function, this approach guarantees equation (3.9) available. More details and examples about the shadow variable can be found in Miao and Tchetgen Tchetgen, (2016) and Miao et al., (2019).

In the next section, we apply the generalized method of moments (GMM) to obtain the mean value of the outcome YY. Some results of the consistency and the asymptotic properties of the estimators are established. After recovering the mean value of the outcome, we propose an estimator to adjust for the unmeasured confounding to obtain the CACE.

4 Estimation and Inference

4.1 Estimating the mean value of the nonignorable missing outcomes

We consider the situation where ZZ is an auxiliary variable taking values of l=0,1l=0,1. The first step of the estimation is to recover the mean value of the outcome which is missing not at random. We want to estimate the population mean μ=E[Y]\mu=E\left[Y\right] from the observed data. After the identification is guaranteed, the joint distribution of YY and RR given (A,Z)\left(A,Z\right) is determined by f(ya,z)f(y\mid a,z) and the missing mechanism f(ry,a,z)f(r\mid y,a,z).

f(y,ra,z)=f(ya,z)f(ry,a,z).f(y,r\mid a,z)=f(y\mid a,z)f(r\mid y,a,z). (4.1)

The missing mechanism π(y,a,z)=f(r=1y,a,z)\pi(y,a,z)=f(r=1\mid y,a,z) is the key in the process of the estimation. The conditional probability π(y,a,z)\pi(y,a,z) is also called the nonresponse mechanism or the propensity of missing data in some literatures (Wang et al.,, 2014; Shao and Wang,, 2016; Zhang et al.,, 2018). For the outcome YY missing not at random, the propensity depends on both observed data and missing data. Some authors imposed some parametric assumptions on both propensity and outcome model to establish some likelihood methods (Greenlees et al.,, 1982; Baker and Laird,, 1988).But it is sensitive for the estimators under the fully parametric models. Some authors proposed some semiparametric approaches. Qin et al., (2002) imposed a parametric model on the propensity , which is difficult to verify under nonignorable missingness. Some authors considerd some weak assumption to the model. Kim and Yu, (2011) proposed a semiparametric logistic regression model for the propensity . Tang et al., (2003) proposed a pseudo-likelihood method, imposing a parametric model for propensity but an unspecified propensity . Zhao and Shao, (2015) consider a generalized linear model for the estimation allowing for a nonparametric missing mechanism. Some authors proposed an empirical likelihood method to estimate the parameters in the missing mechanism with nonignorable missing data (Zhao et al.,, 2013; Tang et al.,, 2014; Niu et al.,, 2014).

Wang et al., (2014) applied the generalized method of moments to estimate the parameters of the missing mechanism. Shao and Wang, (2016) imposed an exponential tilting model on the propensity and estimated the tilting parameter and population mean in two steps. Zhang et al., (2018) proposed an approach that the parameters of interest and the tilting parameter can be estimated simultaneously with generalized method of moments and kernel regression.

In this paper, we apply the generalized method of moments (GMM) to estimate the parameters of the missing mechanism(Hansen,, 1982; Hall,, 2005). For estimation, we consider a parametric model for the propensity π(y,a,z)\pi(y,a,z) and we don’t impose any parametric assumption on the outcome model f(ya,z)f(y\mid a,z), which means the f(ya,z)f(y\mid a,z) is nonparametric.

Here we assume the propensity π(y,a,z)=f(r=1y,a,z)\pi(y,a,z)=f(r=1\mid y,a,z) satisfies the following conditions:

Condition 2.
π(y,a,z)\displaystyle\pi(y,a,z) =π(y,a,z,𝜽)=f(r=1y,a,z,𝜽)\displaystyle=\pi(y,a,z,\bm{\theta})=f(r=1\mid y,a,z,\bm{\theta}) (4.2)
=f(r=1y,a,𝜽)=ψ(α+βy+γa),\displaystyle=f(r=1\mid y,a,\bm{\theta})=\psi\left(\alpha+\beta y+\gamma a\right),

where 𝜽=(α,β,γ)T\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}} is a pp-dimensional unknown parameter. ψ()\psi(\cdot) is a known, strictly monotone, and twice differentiable function from \mathcal{R} to (0,1](0,1].

Similar to the discussion of Wang et al., (2014) , we set a parametric propensity in which the nonresponse instrument ZZ does not appear. However once the identification is guaranteed in Section 3, the Condition (C2) in Wang et al., (2014) and Zhang et al., (2018) for the identiability conditions is not essential. The next step is to estimate these parameters using the observed data.

For applying the GMM, we need to construct a set of qq estimating equations:

G(y,r,a,z,𝜽)=(g1(y,r,a,z,𝜽),,gq(y,r,a,z,𝜽))T,𝜽Θ,G(y,r,a,z,\bm{\theta})=\left(g_{1}(y,r,a,z,\bm{\theta}),\cdots,g_{q}(y,r,a,z,\bm{\theta})\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta, (4.3)

where Θ\Theta is the parameter space containing the true parameter value 𝜽0=(α0,β0,γ0)T\bm{\theta}_{0}=\left(\alpha_{0},\beta_{0},\gamma_{0}\right)^{\mathrm{T}}, 𝜽\bm{\theta} is a pp-dimensional vector of the parameters that we want to estimate. The estimating equations need to satisfy E[gm(y,r,a,z,𝜽0))]=0E\left[g_{m}\left(y,r,a,z,\bm{\theta}_{0})\right)\right]=0,  m=1,,qm=1,\cdots,q and qpq\geq p. Let

G^n(𝜽)=(1ni=1ng1(yi,ri,ai,zi,𝜽),,1ni=1ngq(yi,ri,ai,zi,𝜽))T,𝜽Θ.\widehat{G}_{n}(\bm{\theta})=\left(\frac{1}{n}\sum\limits_{i=1}^{n}g_{1}(y_{i},r_{i},a_{i},z_{i},\bm{\theta}),\cdots,\frac{1}{n}\sum\limits_{i=1}^{n}g_{q}(y_{i},r_{i},a_{i},z_{i},\bm{\theta})\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta. (4.4)

G^n(𝜽)\widehat{G}_{n}(\bm{\theta}) is the sample version of the estimating equations (4.3). The number of estimating equations qq is often greater than pp, which implies that there is no solution to

g¯m(𝜽)1ni=1ngm(yi,ri,ai,zi,𝜽)=0,m=1,,q.\bar{g}_{m}\left(\bm{\theta}\right)\equiv\frac{1}{n}\sum\limits_{i=1}^{n}g_{m}\left(y_{i},r_{i},a_{i},z_{i},\bm{\theta}\right)=0,\quad m=1,\cdots,q. (4.5)

The best we can do is to make it as close as possible to zero by minimizing the quadratic function

Qn(𝜽)=[G^n(𝜽)]TW[G^n(𝜽)],{Q}_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}W[\widehat{G}_{n}(\bm{\theta})], (4.6)

where WW is a positive semi-definite and symmetric q×qq\times q matrix of weights. We use two-step GMM to estimate the parameters 𝜽\bm{\theta}.

The first step: Let W=Iq×qW=I_{q\times q} in (4.6), Iq×qI_{q\times q} is a q×qq\times q identity matrix, then we obtain 𝜽~\tilde{\bm{\theta}} by minimizing Qn(𝜽)=[G^n(𝜽)]T[G^n(𝜽)]{Q}_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}[\widehat{G}_{n}(\bm{\theta})], which means 𝜽~=argmin𝜽ΘQn(𝜽)\tilde{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}Q_{n}(\bm{\theta}).

The second step: Then we obtain W^=W(𝜽~)\widehat{W}=W(\tilde{\bm{\theta}}), plugging in W^\widehat{W} in (4.6). Finally, we obtain the GMM estimator 𝜽^\widehat{\bm{\theta}} by minimizing Q^n(𝜽)=[G^n(𝜽)]TW^[G^n(𝜽)]\widehat{Q}_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}\widehat{W}[\widehat{G}_{n}(\bm{\theta})] over 𝜽Θ\bm{\theta}\in\Theta, which means 𝜽^=argmin𝜽ΘQ^n(𝜽)\widehat{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}\widehat{Q}_{n}(\bm{\theta}).

The next step is to construct the estimating equations to estimate 𝜽=(α,β,γ)T\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}. ZZ is a discrete variable taking values of l=0,1l=0,1. We let 𝒌\bm{k} be a LL-dimensional column vector whose ll-th component is I(z=l)I(z=l) (L=2)(L=2), where I()I(\cdot) is the indicator function. Similar to 𝒌\bm{k}, we let 𝒋\bm{j} also be a LL-dimensional column vector whose ll-th component is I(a=l)I(a=l) .

We now consider the general situation. The estimating equations in (4.3) can be constructed by the following qq functions

G(y,r,a,z,𝜽)=(𝒌[rπ(y,a,z,𝜽)1]𝒋[rπ(y,a,z,𝜽)1]),G(y,r,a,z,\bm{\theta})=\left(\begin{array}[]{l}\bm{k}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\\[8.53581pt] \bm{j}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\end{array}\right), (4.7)

where π(y,a,z,𝜽)=f(r=1y,a,z)=f(r=1y,a)=ψ(α+βy+γa)\pi(y,a,z,\bm{\theta})=f(r=1\mid y,a,z)=f(r=1\mid y,a)=\psi\left(\alpha+\beta y+\gamma a\right) in (4.2).

For example, if zz is a constant and there is no other auxiliary variables, and (4.7) is insufficient for estimating unknown 𝜽=(α,β,γ)T\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}. For using the GMM, we need qpq\geq p. If there is only a discrete variable ZZ, the requirement L2L\geq 2 is satisfied if ZZ is not a constant.

The estimating functions G(y,r,a,z,𝜽)G(y,r,a,z,\bm{\theta}) are motivated by the following equations, if 𝜽0=(α0,β0,γ0)T\bm{\theta}_{0}=(\alpha_{0},\beta_{0},\gamma_{0})^{\mathrm{T}} is the true parameter value,

E[G(y,r,a,z,𝜽)]\displaystyle E\left[G(y,r,a,z,\bm{\theta})\right] =E{𝜼[rπ(y,a,z,𝜽)1]}\displaystyle=E\left\{\bm{\eta}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\right\} (4.8)
=E(E{𝜼[rπ(y,a,z,𝜽)1]|y,a,z})\displaystyle=E\left(E\left\{\bm{\eta}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\bigg{|}\ y,a,z\right\}\right)
=E{𝜼[E(ry,a,z)f(r=1y,a,z)1]}\displaystyle=E\left\{\bm{\eta}\left[\dfrac{E(r\mid y,a,z)}{f(r=1\mid y,a,z)}-1\right]\right\}
=𝟎,\displaystyle=\bm{0},

where 𝜼=(𝒌T,𝒋T)T\bm{\eta}=\left(\bm{k}^{\mathrm{T}},\bm{j}^{\mathrm{T}}\right)^{\mathrm{T}} is a qq-dimensional vector. Let G(y,r,a,z,𝜽)G(y,r,a,z,\bm{\theta}) in (4.7) be the estimating equations in (4.3), G^n(𝜽)\widehat{G}_{n}(\bm{\theta}) in (4.4) is the sample version. W^\widehat{W} is given by the two-step method on the above. Then we will obtain 𝜽^=(α^,β^,γ^)T\widehat{\bm{\theta}}=\left(\hat{\alpha},\hat{\beta},\hat{\gamma}\right)^{\mathrm{T}} as the two-step GMM estimator of 𝜽=(α,β,γ)T\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}.

Theorem 2.

Suppose that E[G(y,r,a,z,𝜽)]=𝟎E\left[G(y,r,a,z,\bm{\theta})\right]=\bm{0} only if 𝜽=𝜽0\bm{\theta}=\bm{\theta}_{0}, 𝜽0Θ\bm{\theta}_{0}\in\Theta, which is compact and that E[sup𝜽ΘG(y,r,a,z,𝜽)]<E\left[\sup_{\bm{\theta}\in\Theta}\left\|G(y,r,a,z,\bm{\theta})\right\|\right]<\infty. Then, as nn\rightarrow\infty, 𝜽^P𝜽0\widehat{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\bm{\theta}_{0} and P\stackrel{{\scriptstyle P}}{{\longrightarrow}} denotes convergence in probability.

The proof of the Theorem 2 and more details about GMM can be found in the Supplementary Materials. Let 𝜽()\nabla_{\bm{\theta}}(\cdot) and 𝜽𝜽()\nabla_{{\bm{\theta}}{\bm{\theta}}}(\cdot) denote the first- and second-order derivatives with respect to 𝜽\bm{\theta}. The asymptotic normality of 𝜽^\widehat{\bm{\theta}} is established by Theorem 3 under the following additional Condition 3.

Condition 3.

(i) 𝜽0\bm{\theta}_{0}\in interior of Θ\Theta; (ii) G(y,r,a,z,𝜽)G(y,r,a,z,\bm{\theta}) is continuously differentiable in a neighborhood 𝒩\mathcal{N} of 𝜽0\bm{\theta}_{0}, with probability approaching one; (iii) E[G(y,r,a,z,𝜽0)2]E\left[\|G(y,r,a,z,\bm{\theta}_{0})\|^{2}\right] is finite and E[sup𝜽𝒩θG(y,r,a,z,𝜽)]<E\left[\sup_{\bm{\theta}\in\mathcal{N}}\left\|\nabla_{\theta}G(y,r,a,z,\bm{\theta})\right\|\right]<\infty; (iv) H=E[θG(y,r,a,z,𝜽0)]H=E\left[\nabla_{\theta}G(y,r,a,z,\bm{\theta}_{0})\right] exists and is full of rank.

Theorem 3.

Suppose that the assumptions of Theorem 2 are satisfied, and Condition 3 holds, for Ω=E[G(y,r,a,z,𝜽0)G(y,r,a,z,𝜽0))T]\Omega=E[G(y,r,a,z,\bm{\theta}_{0})G(y,r,a,z,\bm{\theta}_{0}))^{\mathrm{T}}]. Then, as nn\rightarrow\infty, n(𝜽^𝜽0)N(𝟎,Δ)\sqrt{n}(\widehat{\bm{\theta}}-\bm{\theta}_{0})\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}N(\bm{0},\Delta), where Δ=(HTWH)1HTWΩWH(HTWH)1\Delta=\left(H^{\mathrm{T}}WH\right)^{-1}H^{\mathrm{T}}W\Omega WH\left(H^{\mathrm{T}}WH\right)^{-1}, \stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}} denotes convergence in distribution.

The proof of Theorem 3 can be found in the Supplementary Materials. For asymptotic covariance variance estimation Δ^\widehat{\Delta}, which can be estimated by substituting estimators for each of HH, WW and Ω\Omega. To estimate Ω\Omega, we can replace the population moment by a sample average and the true parameter 𝜽0\bm{\theta}_{0} by an estimator 𝜽^\widehat{\bm{\theta}}. H^\widehat{H} is similar to Ω^\widehat{\Omega}, respectively. To form

Ω^=1ni=1nGn(yi,ri,ai,zi,𝜽^)Gn(yi,ri,ai,zi,𝜽^)T,H^=1ni=1nGn(yi,ri,ai,zi,𝜽)𝜽|𝜽=𝜽^,\widehat{\Omega}=\frac{1}{n}\sum\limits_{i=1}^{n}G_{n}(y_{i},r_{i},a_{i},z_{i},\widehat{\bm{\theta}})G_{n}(y_{i},r_{i},a_{i},z_{i},\widehat{\bm{\theta}})^{\mathrm{T}},\quad\widehat{H}=\left.\frac{1}{n}\sum_{i=1}^{n}\frac{\partial G_{n}(y_{i},r_{i},a_{i},z_{i},\bm{\theta})}{\partial\bm{\theta}}\right|_{\bm{\theta}=\widehat{\bm{\theta}}},

where Gn(yi,ri,ai,zi,𝜽)=(g1(yi,ri,ai,zi,𝜽),,gq(yi,ri,ai,zi,𝜽))T,i=1,,n,𝜽ΘG_{n}(y_{i},r_{i},a_{i},z_{i},\bm{\theta})=\left(g_{1}\left(y_{i},r_{i},a_{i},z_{i},\bm{\theta}\right),\cdots,g_{q}\left(y_{i},r_{i},a_{i},z_{i},\bm{\theta}\right)\right)^{\mathrm{T}},\quad i=1,\cdots,n,\quad\bm{\theta}\in\Theta. So that we have the Theorem 4.

Theorem 4.

If the hypotheses of Theorem 3 are satisfied, then Δ^PΔ\widehat{\Delta}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\Delta, where =^(H^TW^H^)1H^TW^Ω^W^H^(H^TW^H^)1\widehat{=}\left(\widehat{H}^{\mathrm{T}}\widehat{W}\widehat{H}\right)^{-1}\widehat{H}^{\mathrm{T}}\widehat{W}\widehat{\Omega}\widehat{W}\widehat{H}\left(\widehat{H}^{\mathrm{T}}\widehat{W}\widehat{H}\right)^{-1}.

The proof of Theorem 4 can be found in the Supplementary Materials. The optimal weight matrix W1=Ω1W_{1}=\Omega^{-1}, with this choice of W1W_{1}, W^\widehat{W} can be constructed by the following,

W^={1ni=1nGn(yi,ri,ai,zi,𝜽~)Gn(yi,ri,ai,zi,𝜽~)T}1,\widehat{W}=\left\{\frac{1}{n}\sum\limits_{i=1}^{n}G_{n}(y_{i},r_{i},a_{i},z_{i},\tilde{\bm{\theta}})G_{n}(y_{i},r_{i},a_{i},z_{i},\tilde{\bm{\theta}})^{\mathrm{T}}\right\}^{-1},

the asymptotic covariance matrix Δ\Delta reduces to (HTΩH)1\left(H^{\mathrm{T}}\Omega H\right)^{-1} , and the asymptotic covariance variance estimation Δ^\widehat{\Delta} reduces (H^TΩ^H^)1\left(\widehat{H}^{\mathrm{T}}\widehat{\Omega}\widehat{H}\right)^{-1}.

Once 𝜽^\widehat{\bm{\theta}} is obtained, the mean value of μ=E[Y]\mu=E\left[Y\right] can be constructed by the inverse probability weighting with the estimated propensity as the weight function,

μ^=1ni=1nriyiπ(yi,ai,zi,𝜽^).\widehat{\mu}=\dfrac{1}{n}\sum_{i=1}^{n}\dfrac{r_{i}y_{i}}{\pi(y_{i},a_{i},z_{i},\widehat{\bm{\theta}})}. (4.9)

Actually, if the dimension of the estimating equations qq is more than the dimension of the parameters pp, an additional estimating equation gq+1=μry/π(y,a,z,𝜽)g_{q+1}=\mu-ry/\pi(y,a,z,\bm{\theta}) can be added into the estimating equations G(y,r,a,z,𝜽)G(y,r,a,z,\bm{\theta}) in (4.7). Then G(y,r,a,z,𝜽)G(y,r,a,z,\bm{\theta}) is a set of q+1=2L+1q+1=2L+1 estimating equations , and 𝜽=(μ,α,β,γ)T\bm{\theta}=\left(\mu,\alpha,\beta,\gamma\right)^{\mathrm{T}} is a set of p+1p+1 unknown parameters. If μ~\tilde{\mu} is the GMM estimator of the above estimating equations q+1=2L+1q+1=2L+1 by the two-step method, as noted by Wang et al., (2014), the difference between the estimator μ^\widehat{\mu} and μ~\tilde{\mu} is the weighting matrix W^\widehat{W}, the weighting matrix of the esimator μ^\widehat{\mu} is not the optimal and μ~\tilde{\mu} is asymptotically more efficient unless the weighting matrix of the esimator μ^\widehat{\mu} is optimal. But in this paper, we propose μ^\widehat{\mu} to construct the estimator of CACE. The functional delta method can be used to establish the asymptotic results for μ^\widehat{\mu}, for simplicity, we omit this part.

4.2 Obtaining CACE from the nonignorable missing outcomes

Under Assumptions 1-5, we can estimate CACE for the subpopulation of compliers characterized by Ui=cpU_{i}=cp, by taking the ratio of the average difference in YiY_{i} by instrument and the average difference in AiA_{i} by instrument (Angrist et al.,, 1996):

E[Yi(1)Yi(0)Ui=cp]=E[YiZi=1]E[YiZi=0]E[AiZi=1]E[AiZi=0].E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp\right]=\frac{E\left[Y_{i}\mid Z_{i}=1\right]-E\left[Y_{i}\mid Z_{i}=0\right]}{E\left[A_{i}\mid Z_{i}=1\right]-E\left[A_{i}\mid Z_{i}=0\right]}. (4.10)

If the outcome YY is fully observed, CACE can be estimated by equation (4.10). So, the missing data problem is equivalent to the problem of estimating the mean difference E(YiZi=1)E(Y_{i}\mid Z_{i}=1) and E(YiZi=0)E(Y_{i}\mid Z_{i}=0) from incomplete outcome data.

In equation (4.10), the conditional expectation of the nonignorable missing outcomes YY, E[YiZi=l],l=0,1E\left[Y_{i}\mid Z_{i}=l\right],l=0,1 can be estimated by equation (4.9) with the stratification method by the instrument ZZ ; E[AiZi=l],l=0,1E\left[A_{i}\mid Z_{i}=l\right],l=0,1 can be estimated with the fully observed data. Then the estimator CACE^\widehat{CACE} can be obtained by equation (4.9) and (4.10).

5 Simulations

In this section, we conduct some simulation studies to evaluate the proposed approach in this paper. All the variables are binary variables. We generate the instrument variable, the compliance status variable, the potential outcomes from the following process:

ZBernoulli(0.5)Z\sim Bernoulli(0.5)
Table 1: True parameters for the compliance status variable UU
UU atat ntnt cpcp
PP 0.20.2 0.250.25 0.550.55
Table 2: True parameters for the potential outcomes YY
Y(z)U=atY(z)\mid U=at 22 44
PP 0.30.3 0.70.7
Y(z)U=ntY(z)\mid U=nt 22 44
PP 0.70.7 0.30.3
Y(1)U=cpY(1)\mid U=cp 22 44
PP 0.40.4 0.60.6
Y(0)U=cpY(0)\mid U=cp 22 44
PP 0.60.6 0.40.4

For the outcome YY may be missing not at random, we consider the following logistics missing mechanism in equation (4.2):

π(y,a,z,𝜽)=P(R=1Z=z,Y=y,A=a)=ψ(α+βy+γa)\pi(y,a,z,\bm{\theta})=P(R=1\mid Z=z,Y=y,A=a)=\psi(\alpha+\beta y+\gamma a) (5.1)

where ψ(α+βy+γa)=[1+exp((α+βy+γa))]1\psi\left(\alpha+\beta y+\gamma a\right)=[1+\exp(-(\alpha+\beta y+\gamma a))]^{-1}, the true parameters 𝜽0=(α0,β0,γ0)T=(1,0.1,0.1)T\bm{\theta}_{0}=(\alpha_{0},\beta_{0},\gamma_{0})^{\mathrm{T}}=(1,-0.1,-0.1)^{\mathrm{T}}, the missingness indicator RR does depend on the outcome YY with missing data itself. The key assumption is Assumption 10, which means RR is independent of ZZ conditional on the fully observed variable and possibly unobserved outcome YY. The missing data proportion is approximately between 35%35\% and 40%40\%.

From Table 1 and Table 2, the true value of CACECACE =0.4=0.4. We use the approach proposed in Section 4 for our estimation. In Table 3, the columns with the labels: Bias, SD, and 95%95\% CI represent the average bias for (α^α0,β^β0,γ^γ0,CACE^CACE)(\hat{\alpha}-\alpha_{0},\hat{\beta}-\beta_{0},\hat{\gamma}-\gamma_{0},\widehat{CACE}-CACE), average standard deviation, and average 95%95\% confidence interval, respectively. We set 1000 replicates under sample sizes 100, 500, 1000, and 2000, respectively.

Table 3: Results of simulation studies
true value n Bias SD 95%95\% CI
α=1.0\alpha=1.0 100 0.0256 0.0020 [1.0216, 1.0295]
500 0.0323 0.0048 [1.0229, 1.0417]
1000 0.0395 0.0035 [1.0326, 1.0464]
2000 0.0373 0.0045 [1.0283, 1.0462]
β=0.1\beta=-0.1 100 0.0649 0.0021 [-0.0392, -0.0308]
500 0.1715 0.0075 [0.0567, 0.0863]
1000 0.1599 0.0066 [0.0469, 0.0730]
2000 0.1561 0.0064 [0.0434, 0.0688]
γ=0.1\gamma=-0.1 100 0.0153 0.0021 [-0.0887, -0.0804]
500 -0.0188 0.0052 [-0.1291, -0.1084]
1000 -0.0259 0.0052 [-0.1361, -0.1156]
2000 -0.0093 0.0045 [-0.1183, -0.1004]
CACE=0.4CACE=0.4 100 -0.0097 0.0274 [0.3364, 0.4441]
500 0.0026 0.0110 [0.3809, 0.4243]
1000 -0.0045 0.0107 [0.3743, 0.4165]
2000 0.0013 0.0112 [0.3791, 0.4234]

From Table 3, the proposed estimator of CACECACE has negligible bias even under the sample size 500500. As the sample size increases, the sd of the CACE^\widehat{CACE} becomes much smaller. But the bias of the CACE^\widehat{CACE} becomes a nearly stable value when the sample size nn equals to 500500 . The estimator 𝜽^=(α^,β^,γ^)T\widehat{\bm{\theta}}=(\hat{\alpha},\hat{\beta},\hat{\gamma})^{\mathrm{T}} all have the small biases and standard deviations even under the sample size 100100. In fact, the proposed estimators of 𝜽^\widehat{\bm{\theta}} and CACE^\widehat{CACE} in Section 4 all has small biases and standard deviations even under the small sample size 100100. All of the confidence intervals of 𝜽^\widehat{\bm{\theta}} and CACE^\widehat{CACE} have empirical coverage proportions very close to their nominal values.

6 A real data analysis

In this section, we apply the proposed estimators to a real data about the political analysis in Esterling et al., (2011). As noted in Esterling et al., (2011), the authors conducted a series of online deliberative field experiments, where current members of the U.S. House of Representatives interacted via a Web-based interface with random samples of their constituents. Some members of Congress conducted the sessions that their constituents participated in. And the constituents interacted with their members in an online chat room.

An online survey research firm called Knowledge Networks (KN) was responsible for recruiting the constituents from each congressional district and administering the surveys. Each constituent was randomly assigned to three conditions: a deliberative condition that received background reading materials and was asked to complete a survey regarding the background materials(the background materials survey) and to participate in the sessions; an information-only group that only received the background materials and was asked to take the background materials survey; and a true control group. A week after session, KN administered a follow-up survey to subjects in each of the groups.

There are two questions in the follow-up survey, all subjects were asked “Please tell us how much you agree or disagree with the following statements”:

1. I don’t think public officials care much what people like me think.

2. I have ideas about politics and policy that people in government should listen to.

The first question is a measure of external efficacy, and the second question is a measure of internal efficacy. In this study, we only consider the first question as the outcome variable: the outcome (Officials care) variable YY equals to 22 for subjects who “somewhat disagree” or “strongly disagree” with the first question, and 11 for subjects who “somewhat agree”, “strongly agree” and “Neither agree nor disagree”. But in the follow-up survey, some subjects chose not to respond the either question so that the outcome variable YY had missing values.

Esterling et al., (2011) restricted the sample to the 670670 subjects who completed the baseline and background materials surveys and initially indicated a willingness to participate in the deliberative sessions. Thus, the treatment effect compares those who read the background materials and participated in the discussions to those who only read the background materials. Our subject for conducting this experiment is to evaluate the causal effect of participating in the Online-chat session to Officials care. More details about this deliberative session can be found in Esterling et al., (2011).

The treatment variable AA equals to 11 for subjects who participated in the session and 0 for subjects who did not participate in the session. There are 12 pre-treatment (exogenous) variables in Esterling et al., (2011), we choose an exogenous variable as the instrument variable ZZ. The instrument variable ZZ equals to 11 for each subject who is able to answer at least four of the “Delli Carpini and Keeter five” items correctly on the baseline survey indicating high political knowledge, and 0 for otherwise. We consider the subject who had high political knowledge might be more likely to participate in the session because they were more likely to show their willingness to participate in the session, which the Assumption 3 might be reasonable. In fact, we expect the political knowledge is independent of missing mechanism conditional on the participating and possibly missing outcomes, which implies Assumption 10. The observed data is reported in Table 4.

Table 4: A real data about the political analysis
Z=1Z=1, A=1A=1 Z=1Z=1, A=0A=0 Z=0Z=0, A=1A=1 Z=0Z=0, A=0A=0
R=1R=1, Y=1Y=1 130 139 62 82
R=1R=1, Y=2Y=2 67 24 12 11
R=0R=0, Y=.Y=. 21 72 5 45

The samples size nn equals to 670670 with 143143 missing values. The missing proportion is about 21.34%21.34\%(about 143/6700.2134143/670\approx 0.2134). In the treatment group (A=1A=1) who participated in the session, the missing proportion is about 8.75%8.75\%(26/2970.087526/297\approx 0.0875), but in the control group (A=0A=0) who did not participate in the session, the missing proportion is about 31.36%31.36\%(117/3730.3136117/373\approx 0.3136). Those who participated in the session were more likely to respond to the follow-up survey than those not. We consider the missing mechanism was influenced by the subject whether participated in the session. The missing mechanism RR may possibly depend on the missing outcomes YY because of the questions in the follow-up survey. Hence it is possible that the outcomes are missing not at random.

The estimates of 𝜽\bm{\theta} and CACECACE are shown in Table 5. The columns of Table 5 correspond to the parameters, point estimate, Standard Deviation and 95%95\% Confidence Interval, respectively. The assumption about the missing mechanism is similar to the equation (5.1) in Section 5.

Table 5: Results for the estimates of Parameters and CACECACE
Patameters Estimate SD 95%95\% Confidence Interval
α^\hat{\alpha} 1.6204 0.0349 [1.5519, 1.6888]
β^\hat{\beta} -0.2225 0.0234 [-0.2684, -0.1766]
γ^\hat{\gamma} 0.1249 0.0249 [0.0759, 0.1739]
CACE^\widehat{CACE} 1.3234 0.0190 [1.2861, 1.3608]

From Table 5, we can see that, under the assumption of the missing mechanism in equation (5.1), the sign of β^\hat{\beta} is negative and the sign of γ^\hat{\gamma} is positive, which means the more disagreement on the first questions in the follow-up survey, the more likely the outcome variable YY is to be missing and for those who did not participate in the session, their outcomes might be more likely to have missing values. The estimate of CACE is 1.3234, which means participating in the session would propose a positive effect on Officials care significantly, which means a subject who participated in the session might change their opinions from agreement to disagreement about the first question. Our results are similar to Esterling et al., (2011), who also produced a significant result.

7 Discussions

How to obtain the average causal effects from the nonignorable missing data is a challenging problem. In fact, under the principal stratification framework, it can not be estimated the average causal effects for all population, we can only estimate the complier average causal effects. But in a real situation, we can not observe the compliance status of an unit. With the nonignorable missing outcomes YY, the identification of CACE can not be established without additional assumptions. Some different missing data mechanisms are discussed in the previous work.

In this paper, we compare LI and ODN missing data mechanisms with the shadow variable assumption. And then we establish the nonparametric identification about the full data distribution from the nonignorable missing outcomes with a shadow variable. However, it might not be easy to find such a variable in the real analysis. Some Prior knowledge is usually needed in practice. For inference, we impose a parametric assumption on the missing data mechanism and no assumption on the outcome model. With the GMM method, we estimate the parameters in the propensity. And then we establish some results of the consistency and the asymptotic properties of the parameters. The next, we recover the mean value of the outcome YY and estimate CACE by equation (4.10). In the discussions, we discuss a few possible future generalizations of the proposed approach in this paper.

7.1 Adjustment for measured confounding with a covariate VV

When a discrete covariate VV can be fully observed, we adjust the measured confounding with a stratification method. Some assumptions will be established conditional on VV.

Randomization Assumption 2 is replaced by : For given VV, (Y(0),Y(1),A(0),A(1))(Y(0),Y(1),A(0),A(1)) is independent of ZZ (Z(Y(0),Y(1),A(0),A(1))VZ\perp\!\!\!\perp(Y(0),Y(1),A(0),A(1))\mid V). Shadow variable Assumption 10 is replaced by: (a’): ZR(Y,V,A)Z\perp\!\!\!\perp R\mid(Y,V,A); (b’): Z⟂̸Y(R=1,V,A)Z\not\perp\!\!\!\perp Y\mid(R=1,V,A). Under the assumptions conditional on VV and other essential assumptions, the Definition 1 of CACE is replaced by CACE=E[Yi(1)Yi(0)Ui=cp,Vi=vi]=E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp,V_{i}=v_{i}\right].

We present a figure to help readers to understand the framework of this paper with a covariate VV.

V{V}U{U}Z{Z}A{A}Y{Y}R{R}
Figure 4: A directed acyclic graph model for this paper with a covariate VV

And then CACE can be estimated by the following equation:

E[Yi(1)Yi(0)Ui=cp,Vi=vi]=E[YiZi=1,Vi=vi]E[YiZi=0,Vi=vi]E[AiZi=1,Vi=vi]E[AiZi=0,Vi=vi].E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp,V_{i}=v_{i}\right]=\frac{E\left[Y_{i}\mid Z_{i}=1,V_{i}=v_{i}\right]-E\left[Y_{i}\mid Z_{i}=0,V_{i}=v_{i}\right]}{E\left[A_{i}\mid Z_{i}=1,V_{i}=v_{i}\right]-E\left[A_{i}\mid Z_{i}=0,V_{i}=v_{i}\right]}.

7.2 How to obtain the average causal effects within all population

Wang and Tchetgen Tchetgen, (2018) discussed two framework of the instrument variable. Under the linear structural equation models (SEMs), which implies the effect homogeneity, the average causal effects can be identified and estimated. But the structural equation models are so sensitive to the model assumptions. In this paper, we impose no assumption on the outcome model, including the identification and estimation. Probably imposing some assumptions on the outcome model might be able to estimate the average causal effects from the nonignorable missing outcomes under some new missing data mechanisms.

References

  • Angrist and Imbens, (1995) Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association, 90(430):431–442.
  • Angrist et al., (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455.
  • Baker and Laird, (1988) Baker, S. G. and Laird, N. M. (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical association, 83(401):62–69.
  • Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
  • Chen et al., (2015) Chen, H., Ding, P., Geng, Z., and Zhou, X.-H. (2015). Semiparametric inference of the complier average causal effect with nonignorable missing outcomes. ACM Transactions on Intelligent Systems and Technology (TIST), 7(2):1–15.
  • Chen et al., (2009) Chen, H., Geng, Z., and Zhou, X.-H. (2009). Identifiability and estimation of causal effects in randomized trials with noncompliance and completely nonignorable missing data. Biometrics, 65(3):675–682.
  • Chen, (2003) Chen, H. Y. (2003). A note on the prospective analysis of outcome-dependent samples. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):575–584.
  • Chen, (2004) Chen, H. Y. (2004). Nonparametric and semiparametric models for missing covariates in parametric regression. Journal of the American Statistical Association, 99(468):1176–1189.
  • Chen, (2007) Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics, 63(2):413–421.
  • Darolles et al., (2011) Darolles, S., Fan, Y., Florens, J.-P., and Renault, E. (2011). Nonparametric instrumental regression. Econometrica, 79(5):1541–1565.
  • Ding and Geng, (2014) Ding, P. and Geng, Z. (2014). Identifiability of subgroup causal effects in randomized experiments with nonignorable missing covariates. Statistics in Medicine, 33(7):1121–1133.
  • Ding and Li, (2018) Ding, P. and Li, F. (2018). Causal inference: A missing data perspective. Statistical Science, 33(2):214–237.
  • d’Haultfoeuille, (2010) d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1):1–15.
  • Esterling et al., (2011) Esterling, K. M., Neblo, M. A., and Lazer, D. M. (2011). Estimating treatment effects in the presence of noncompliance and nonresponse: The generalized endogenous treatment model. Political Analysis, 19(2):205–226.
  • Frangakis and Rubin, (1999) Frangakis, C. E. and Rubin, D. B. (1999). Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika, 86(2):365–379.
  • Frangakis and Rubin, (2002) Frangakis, C. E. and Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1):21–29.
  • Goldberger, (1972) Goldberger, A. S. (1972). Structural equation methods in the social sciences. Econometrica: Journal of the Econometric Society, 40:979–1001.
  • Greenlees et al., (1982) Greenlees, J. S., Reece, W. S., and Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association, 77(378):251–261.
  • Hall, (2005) Hall, A. R. (2005). Generalized method of moments. Oxford: Oxford University Press.
  • Hansen, (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society, pages 1029–1054.
  • Higgins et al., (2008) Higgins, J. P., White, I. R., and Wood, A. M. (2008). Imputation methods for missing outcome data in meta-analysis of clinical trials. Clinical trials, 5(3):225–239.
  • Horvitz and Thompson, (1952) Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685.
  • Imai, (2009) Imai, K. (2009). Statistical analysis of randomized experiments with non-ignorable missing binary outcomes: an application to a voting experiment. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(1):83–104.
  • Imbens and Angrist, (1994) Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475.
  • Kim and Yu, (2011) Kim, J. K. and Yu, C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106(493):157–165.
  • Kott, (2014) Kott, P. S. (2014). Calibration weighting when model and calibration variables can differ. In Contributions to sampling statistics, pages 1–18. Springer.
  • Li and Zhou, (2017) Li, W. and Zhou, X.-H. (2017). Identifiability and estimation of causal mediation effects with missing data. Statistics in Medicine, 36(25):3948–3965.
  • Little, (1993) Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421):125–134.
  • Little, (1994) Little, R. J. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika, 81(3):471–483.
  • Little and Rubin, (2002) Little, R. J. and Rubin, D. B. (2002). Statistical analysis with missing data. Wiley: New York.
  • Miao et al., (2016) Miao, W., Ding, P., and Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516):1673–1683.
  • Miao et al., (2019) Miao, W., Liu, L., Tchetgen, E. T., and Geng, Z. (2019). Identification, doubly robust estimation, and semiparametric efficiency theory of nonignorable missing data with a shadow variable. arXiv preprint arXiv:1509.02556v3.
  • Miao and Tchetgen, (2018) Miao, W. and Tchetgen, E. T. (2018). Identification and inference with nonignorable missing covariate data. Statistica Sinica, 28(4):2049–2067.
  • Miao and Tchetgen Tchetgen, (2016) Miao, W. and Tchetgen Tchetgen, E. J. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2):475–482.
  • Newey and McFadden, (1994) Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245.
  • Newey and Powell, (2003) Newey, W. K. and Powell, J. L. (2003). Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578.
  • Neyman, (1923) Neyman, J. S. (1923). On the application of probability theory to agricultural experiments. essay on principles. section 9.(translated and edited by dm dabrowska and tp speed, statistical science (1990), 5, 465-480). Annals of Agricultural Sciences, 10:1–51.
  • Niu et al., (2014) Niu, C., Guo, X., Xu, W., and Zhu, L. (2014). Empirical likelihood inference in linear regression with nonignorable missing response. Computational Statistics & Data Analysis, 79:91–112.
  • O’Malley and Normand, (2005) O’Malley, A. J. and Normand, S.-L. T. (2005). Likelihood methods for treatment noncompliance and subsequent nonresponse in randomized trials. Biometrics, 61(2):325–334.
  • Osius, (2004) Osius, G. (2004). The association between two random elements: A complete characterization and odds ratio models. Metrika, 60(3):261–277.
  • Qin et al., (2002) Qin, J., Leung, D., and Shao, J. (2002). Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97(457):193–200.
  • Robins and Ritov, (1997) Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in medicine, 16(3):285–319.
  • Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
  • Roy, (2003) Roy, J. (2003). Modeling longitudinal data with nonignorable dropouts using a latent dropout class model. Biometrics, 59(4):829–836.
  • Rubin, (1973) Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 29:159–183.
  • Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688–701.
  • Rubin, (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
  • Rubin, (1979) Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74(366a):318–328.
  • Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American statistical association, 75(371):591–593.
  • Shao and Wang, (2016) Shao, J. and Wang, L. (2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika, 103(1):175–187.
  • Stuart, (2010) Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1–21.
  • Tang et al., (2003) Tang, G., Little, R. J., and Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4):747–764.
  • Tang et al., (2014) Tang, N., Zhao, P., and Zhu, H. (2014). Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica, 24:723–747.
  • Tauchen, (1985) Tauchen, G. (1985). Diagnostic testing and evaluation of maximum likelihood models. Journal of Econometrics, 30(1-2):415–443.
  • Wang and Tchetgen Tchetgen, (2018) Wang, L. and Tchetgen Tchetgen, E. (2018). Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):531–550.
  • Wang et al., (2014) Wang, S., Shao, J., and Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24:1097–1116.
  • Wright, (1928) Wright, P. G. (1928). Tariff on animal and vegetable oils. Macmillan Company, New York.
  • Wu and Carroll, (1988) Wu, M. C. and Carroll, R. J. (1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics, 44:175–188.
  • Yang et al., (2019) Yang, S., Wang, L., and Ding, P. (2019). Causal inference with confounders missing not at random. Biometrika, 106(4):875–888.
  • Zhang et al., (2018) Zhang, L., Lin, C., and Zhou, Y. (2018). Generalized method of moments for nonignorable missing data. Statistica Sinica, 28(4):2107–2124.
  • Zhao et al., (2013) Zhao, H., Zhao, P. Y., and Tang, N. S. (2013). Empirical likelihood inference for mean functionals with nonignorably missing response data. Computational Statistics & Data Analysis, 66:101–116.
  • Zhao and Shao, (2015) Zhao, J. and Shao, J. (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512):1577–1590.
  • Zhou and Li, (2006) Zhou, X.-H. and Li, S. M. (2006). Itt analysis of randomized encouragement design studies with missing data. Statistics in medicine, 25(16):2737–2761.

Supplementary Materials


Appendix A: Proofs of the propositions

Throughout the paper, 𝒫()\mathcal{P}(\cdot) denotes the counting measure for a discrete variable.

Proofs of the proposition 1

Under Assumption 10, we have that for all (A,Y,Z)(A,Y,Z)

OR(a,y,z)=OR(a,y)f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1),\operatorname{OR}(a,y,z)=\operatorname{OR}(a,y)\equiv\frac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}, (4)
f(y,ra,z)=c(a,z)f(ra,y=1)f(yr=1,a,z){OR(a,y)}1r,c(a,z)=f(r=1a)f(r=1a,y=1)f(zr=1,a)f(za),\begin{gathered}f(y,r\mid a,z)=c(a,z)f(r\mid a,y=1)f(y\mid r=1,a,z)\{\operatorname{OR}(a,y)\}^{1-r},\\[5.69054pt] c(a,z)=\frac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\frac{f(z\mid r=1,a)}{f(z\mid a)},\end{gathered} (5)
f(r=1a,y=1)=E[OR(a,y)r=1,a]f(r=0a)/f(r=1a)+E[OR(a,y)r=1,a].f(r=1\mid a,y=1)=\frac{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}{f(r=0\mid a)/f(r=1\mid a)+E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}. (6)
Proof of equation (3.4)

Given Assumption 10 and according to equation (3.3), we have the following results

OR(a,y,z)\displaystyle\operatorname{OR}(a,y,z) =f(yr=0,a,z)f(y=1r=1,a,z)f(yr=1,a,z)f(y=1r=0,a,z)\displaystyle=\frac{f(y\mid r=0,a,z)f(y=1\mid r=1,a,z)}{f(y\mid r=1,a,z)f(y=1\mid r=0,a,z)}
=f(y,r=0,a,z)f(r=0,a,z)f(y=1,r=1,a,z)f(r=1,a,z)f(y,r=1,a,z)f(r=1,a,z)f(y=1,r=0,a,z)f(r=0,a,z)\displaystyle=\dfrac{\dfrac{f(y,r=0,a,z)}{f(r=0,a,z)}\dfrac{f(y=1,r=1,a,z)}{f(r=1,a,z)}}{\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}\dfrac{f(y=1,r=0,a,z)}{f(r=0,a,z)}}
=f(y,r=0,a,z)f(y=1,r=1,a,z)f(y,r=1,a,z)f(y=1,r=0,a,z)\displaystyle=\frac{f(y,r=0,a,z)f(y=1,r=1,a,z)}{f(y,r=1,a,z)f(y=1,r=0,a,z)}
=f(y,r=0,a,z)f(a,y)f(y=1,r=1,a,z)f(a,y=1)f(y,r=1,a,z)f(a,y)f(y=1,r=0,a,z)f(a,y=1)\displaystyle=\dfrac{\dfrac{f(y,r=0,a,z)}{f(a,y)}\dfrac{f(y=1,r=1,a,z)}{f(a,y=1)}}{\dfrac{f(y,r=1,a,z)}{f(a,y)}\dfrac{f(y=1,r=0,a,z)}{f(a,y=1)}}
=f(r=0,za,y)f(r=1,za,y=1)f(r=1,za,y)f(r=0,za,y=1)\displaystyle=\frac{f(r=0,z\mid a,y)f(r=1,z\mid a,y=1)}{f(r=1,z\mid a,y)f(r=0,z\mid a,y=1)}
=f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)OR(a,y)\displaystyle=\frac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}\equiv\operatorname{OR}(a,y)
Proof of equation (3.5)

To proof equation (3.5), we prove that for R=1R=1 and R=0R=0, respectively. When R=1R=1, we have that

right side of the equation (3.5) =c(a,z)f(r=1a,y=1)f(yr=1,a,z)\displaystyle=c(a,z)f(r=1\mid a,y=1)f(y\mid r=1,a,z)
=f(r=1a)f(r=1a,y=1)f(zr=1,a)f(za)f(r=1a,y=1)f(yr=1,a,z)\displaystyle=\dfrac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(r=1\mid a,y=1)f(y\mid r=1,a,z)
=f(r=1a)f(zr=1,a)f(za)f(yr=1,a,z)\displaystyle=f(r=1\mid a)\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(y\mid r=1,a,z)
=f(r=1,a)f(a)f(z,r=1,a)f(r=1,a)f(z,a)f(a)f(y,r=1,a,z)f(r=1,a,z)\displaystyle=\dfrac{f(r=1,a)}{f(a)}\dfrac{\dfrac{f(z,r=1,a)}{f(r=1,a)}}{\dfrac{f(z,a)}{f(a)}}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}
=f(y,r=1,a,z)f(z,a)=f(y,r=1a,z)=left side of the equation (3.5)\displaystyle=\dfrac{f(y,r=1,a,z)}{f(z,a)}=f(y,r=1\mid a,z)=\text{left side of the equation (\ref{equation 5})}

When R=0R=0, from equation (3.4) and Assumption 10, we have that

right side of the equation (3.5)=\displaystyle\text{right side of the equation (\ref{equation 5})}= c(a,z)f(r=0a,y=1)f(yr=1,a,z)OR(a,y)\displaystyle c(a,z)f(r=0\mid a,y=1)f(y\mid r=1,a,z){\operatorname{OR}}(a,y)
=\displaystyle= f(r=1a)f(r=1a,y=1)f(zr=1,a)f(za)f(r=0a,y=1)f(yr=1,a,z)\displaystyle\dfrac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(r=0\mid a,y=1)f(y\mid r=1,a,z)
f(r=0a,y)f(r=1a,y)f(r=1a,y=1)f(r=0a,y=1)\displaystyle\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}
=\displaystyle= f(r=1a)f(zr=1,a)f(za)f(yr=1,a,z)f(r=0a,y)f(r=1a,y)\displaystyle f(r=1\mid a)\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(y\mid r=1,a,z)\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}
=\displaystyle= f(r=1,a)f(a)f(z,r=1,a)f(r=1,a)f(z,a)f(a)f(y,r=1,a,z)f(r=1,a,z)f(r=0,za,y)f(r=1,za,y)\displaystyle\dfrac{f(r=1,a)}{f(a)}\dfrac{\dfrac{f(z,r=1,a)}{f(r=1,a)}}{\dfrac{f(z,a)}{f(a)}}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}\dfrac{f(r=0,z\mid a,y)}{f(r=1,z\mid a,y)}
=\displaystyle= f(z,r=1,a)f(z,a)f(y,r=1,a,z)f(r=1,a,z)f(r=0,z,a,y)f(r=1,z,a,y)\displaystyle\dfrac{f(z,r=1,a)}{f(z,a)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}\frac{f(r=0,z,a,y)}{f(r=1,z,a,y)}
=\displaystyle= f(r=0,y,z,a)f(z,a)=f(y,r=0a,z)=left side of the equation (3.5)\displaystyle\dfrac{f(r=0,y,z,a)}{f(z,a)}=f(y,r=0\mid a,z)=\text{left side of the equation (\ref{equation 5})}

In conclusion, equation (3.5) holds for the value R=1R=1 and R=0R=0.

Proof of equation (3.6)

From equation (3.4), we notice that the odds ratio function OR(a,y)\operatorname{OR}(a,y) is the function of the variables (A,Y)(A,Y), by the property of conditional mathematical expectation, we have that

E[OR(a,y)r=1,a]\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a\right]
=f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)f(yr=1,a)𝑑𝒫(y)\displaystyle=\int\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}f(y\mid r=1,a)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0a,y)f(r=1a,y)f(yr=1,a)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}f(y\mid r=1,a)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a,y)f(r=1,a,y)f(y,r=1,a)f(r=1,a)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0,a,y)}{f(r=1,a,y)}\dfrac{f(y,r=1,a)}{f(r=1,a)}d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a,y)f(r=1,a)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0,a,y)}{f(r=1,a)}d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)1f(r=1,a)f(r=0,a,y)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{1}{f(r=1,a)}\int f(r=0,a,y)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a)f(r=1,a)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\frac{f(r=0,a)}{f(r=1,a)}
right side of the equation (3.6) =E[OR(a,y)r=1,a]f(r=0a)/f(r=1a)+E[OR(a,y)r=1,a]\displaystyle=\dfrac{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}{f(r=0\mid a)/f(r=1\mid a)+E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a)f(r=1,a)f(r=0a)f(r=1a)+f(r=1a,y=1)f(r=0a,y=1)f(r=0,a)f(r=1,a)\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}{\dfrac{f(r=0\mid a)}{f(r=1\mid a)}+\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a)f(r=1,a)f(r=0,a)f(r=1,a)+f(r=1a,y=1)f(r=0a,y=1)f(r=0,a)f(r=1,a)\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}{\dfrac{f(r=0,a)}{f(r=1,a)}+\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}
=f(r=1a,y=1)f(r=0a,y=1)1+f(r=1a,y=1)f(r=0a,y=1)\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}}{1+\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}}
=f(r=1a,y=1)f(r=0a,y=1)f(r=0a,y=1)+f(r=1a,y=1)f(r=0,a,y=1)\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}}{\dfrac{f(r=0\mid a,y=1)+f(r=1\mid a,y=1)}{f(r=0,a,y=1)}}
=f(r=1a,y=1)=left side of the equation (3.6)\displaystyle=f(r=1\mid a,y=1)=\text{left side of the equation (\ref{equation 6})}

Proofs of the proposition 2

Under Assumption 10, we have that

f(r=1a,y)\displaystyle f(r=1\mid a,y) =f(r=1a,y,z),\displaystyle=f(r=1\mid a,y,z), (7)
=f(r=1a,y=1)f(r=1a,y=1)+OR(a,y)f(r=0a,y=1),\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\operatorname{OR}(a,y)f(r=0\mid a,y=1)},
f(yr=0,a,z)=OR(a,y)f(yr=1,a,z)E[OR(a,y)r=1,a,z],f(y\mid r=0,a,z)=\dfrac{\operatorname{OR}(a,y)f(y\mid r=1,a,z)}{E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]}, (8)
E[OR~(a,y)r=1,a,z]=f(zr=0,a)f(zr=1,a),whereOR~(a,y)=OR(a,y)E[OR(a,y)r=1,a],\begin{gathered}E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]=\dfrac{f(z\mid r=0,a)}{f(z\mid r=1,a)},\\[5.69054pt] \text{where}\quad\widetilde{\operatorname{OR}}(a,y)=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]},\end{gathered} (9)
OR(a,y)=OR~(a,y)OR~(a,y=1).\operatorname{OR}(a,y)=\dfrac{\widetilde{\operatorname{OR}}(a,y)}{\widetilde{\operatorname{OR}}(a,y=1)}. (10)
Proof of equation (3.7)

From Assumption 10 and equation (3.4), we have that

right side of the equation (3.7) =f(r=1a,y)\displaystyle=f(r=1\mid a,y)
=f(r=1a,y,z)\displaystyle=f(r=1\mid a,y,z)
=f(r=1a,y=1)f(r=1a,y=1)+OR(a,y)f(r=0a,y=1)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\operatorname{OR}(a,y)f(r=0\mid a,y=1)}
=f(r=1a,y=1)f(r=1a,y=1)+f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)f(r=0a,y=1)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}f(r=0\mid a,y=1)}
=f(r=1a,y=1)f(r=1a,y=1)+f(r=0a,y)f(r=1a,y=1)f(r=1a,y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)}}
=11+f(r=0a,y)f(r=1a,y)\displaystyle=\dfrac{1}{1+\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}}
=f(r=1a,y)f(r=1a,y)+f(r=0a,y)\displaystyle=\dfrac{f(r=1\mid a,y)}{f(r=1\mid a,y)+f(r=0\mid a,y)}
=f(r=1a,y)=left side of the equation (3.7)\displaystyle=f(r=1\mid a,y)=\text{left side of the equation (\ref{equation 7})}
Proof of equation (3.8)

From Assumption 10 and equation (3.4), by the property of conditional mathematical expectation, we have that

E[OR(a,y)r=1,a,z]\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]
=f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)f(yr=1,a,z)𝑑𝒫(y)\displaystyle=\int\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}f(y\mid r=1,a,z)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0a,y)f(r=1a,y)f(yr=1,a,z)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}f(y\mid r=1,a,z)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0a,y,z)f(r=1a,y,z)f(yr=1,a,z)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0\mid a,y,z)}{f(r=1\mid a,y,z)}f(y\mid r=1,a,z)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a,y,z)f(a,y,z)f(r=1,a,y,z)f(a,y,z)f(y,r=1,a,z)f(r=1,a,z)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{\dfrac{f(r=0,a,y,z)}{f(a,y,z)}}{\dfrac{f(r=1,a,y,z)}{f(a,y,z)}}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a,y,z)f(r=1,a,y,z)f(y,r=1,a,z)f(r=1,a,z)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0,a,y,z)}{f(r=1,a,y,z)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)1f(r=1,a,z)f(r=0,a,y,z)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{1}{f(r=1,a,z)}\int f(r=0,a,y,z)d\mathcal{P}(y)
=f(r=1a,y=1)f(r=0a,y=1)f(r=0,a,z)f(r=1,a,z)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a,z)}{f(r=1,a,z)}
right side of the equation (3.8) =OR(a,y)f(yr=1,a,z)E[OR(a,y)r=1,a,z]\displaystyle=\dfrac{\operatorname{OR}(a,y)f(y\mid r=1,a,z)}{E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]}
=f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)f(yr=1,a,z)f(r=1x,y=1)f(r=0a,y=1)f(r=0,a,z)f(r=1,a,z)\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}\quad f(y\mid r=1,a,z)}{\dfrac{f(r=1\mid x,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}
=f(r=0a,y)f(r=1a,y)f(yr=1,a,z)f(r=0,a,z)f(r=1,a,z)\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}f(y\mid r=1,a,z)}{\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}
=f(r=0a,y,z)f(r=1a,y,z)f(y,r=1,a,z)f(r=1,x,z)f(r=0,a,z)f(r=1,a,z)\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y,z)}{f(r=1\mid a,y,z)}\dfrac{f(y,r=1,a,z)}{f(r=1,x,z)}}{\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}
=f(r=0,a,y,z)f(r=1,a,y,z)f(y,r=1,a,z)f(r=1,a,z)f(r=0,a,z)f(r=1,a,z)\displaystyle=\dfrac{\dfrac{f(r=0,a,y,z)}{f(r=1,a,y,z)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}}{\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}
=f(r=0,a,y,z)f(r=0,a,z)\displaystyle=\dfrac{f(r=0,a,y,z)}{f(r=0,a,z)}
=f(yr=0,a,z)=left side of the equation (3.8)\displaystyle=f(y\mid r=0,a,z)=\text{left side of the equation (\ref{equation 8})}
Proof of equation (3.9)

From equation (3.6) and equation (3.8) , we have that

E[OR(a,y)r=1,a]\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a\right] =f(r=1a,y=1)f(r=0a,y=1)f(r=0,a)f(r=1,a)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\frac{f(r=0,a)}{f(r=1,a)}
E[OR(a,y)r=1,a,z]\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right] =f(r=1a,y=1)f(r=0a,y=1)f(r=0,a,z)f(r=1,a,z)\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a,z)}{f(r=1,a,z)}
whereOR~(a,y)\displaystyle\text{where}\quad\widetilde{\operatorname{OR}}(a,y) =OR(a,y)E[OR(a,y)r=1,a]\displaystyle=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}
=f(R=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)f(r=1a,y=1)f(r=0,a)f(r=0a,y=1)f(r=1,a)\displaystyle=\dfrac{\dfrac{f(R=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}}{\dfrac{f(r=1\mid a,y=1)f(r=0,a)}{f(r=0\mid a,y=1)f(r=1,a)}}
=f(r=0a,y)f(r=1a,y)f(r=0,a)f(r=1,a)\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}}{\dfrac{f(r=0,a)}{f(r=1,a)}}
=f(r=0a,y)f(r=1,a)f(r=1a,y)f(r=0,a)\displaystyle=\dfrac{f(r=0\mid a,y)f(r=1,a)}{f(r=1\mid a,y)f(r=0,a)}
=f(r=0a,y)f(r=1a)f(r=1a,y)f(r=0a)\displaystyle=\dfrac{f(r=0\mid a,y)f(r=1\mid a)}{f(r=1\mid a,y)f(r=0\mid a)}
E[OR~(a,y)r=1,a,z]\displaystyle E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]
=f(r=0a,y)f(r=1a)f(r=1a,y)f(r=0a)f(yr=1,a,z)𝑑𝒫(y)\displaystyle=\int\dfrac{f(r=0\mid a,y)f(r=1\mid a)}{f(r=1\mid a,y)f(r=0\mid a)}f(y\mid r=1,a,z)d\mathcal{P}(y)
=f(r=0a,y,z)f(r=1,a)f(r=1a,y,z)f(r=0,a)f(y,r=1,a,z)f(r=1,a,z)𝑑𝒫(y)\displaystyle=\int\dfrac{f(r=0\mid a,y,z)f(r=1,a)}{f(r=1\mid a,y,z)f(r=0,a)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)
=f(r=0,a,y,z)f(r=1,a)f(r=1,a,y,z)f(r=0,a)f(y,r=1,a,z)f(r=1,a,z)𝑑𝒫(y)\displaystyle=\int\dfrac{f(r=0,a,y,z)f(r=1,a)}{f(r=1,a,y,z)f(r=0,a)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)
=f(r=0,a,y,z)f(r=1,a)f(r=0,a)f(r=1,a,z)𝑑𝒫(y)\displaystyle=\int\dfrac{f(r=0,a,y,z)f(r=1,a)}{f(r=0,a)f(r=1,a,z)}d\mathcal{P}(y)
=f(r=1,a)f(r=0,a)f(r=1,a,z)f(r=0,a,y,z)𝑑𝒫(y)\displaystyle=\dfrac{f(r=1,a)}{f(r=0,a)f(r=1,a,z)}\int f(r=0,a,y,z)d\mathcal{P}(y)
=f(r=0,a,z)f(r=1,a)f(r=0,a)f(r=1,a,z)\displaystyle=\dfrac{f(r=0,a,z)f(r=1,a)}{f(r=0,a)f(r=1,a,z)}
=f(r=0,a,z)f(r=0,a)f(r=1,a,z)f(r=1,a)\displaystyle=\dfrac{\dfrac{f(r=0,a,z)}{f(r=0,a)}}{\dfrac{f(r=1,a,z)}{f(r=1,a)}}
=f(zr=0,a)f(zr=1,a)\displaystyle=\dfrac{f(z\mid r=0,a)}{f(z\mid r=1,a)}
Proof of equation (3.10)

From equation (3.9), we have that

OR~(a,y)OR~(a,y=1)\displaystyle\dfrac{\widetilde{\operatorname{OR}}(a,y)}{\widetilde{\operatorname{OR}}(a,y=1)} =f(r=0a,y)f(r=1a)f(r=1a,y)f(r=0a)f(r=0a,y=1)f(r=1a)f(r=1a,y=1)f(r=0a)\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)f(r=1\mid a)}{f(r=1\mid a,y)f(r=0\mid a)}}{\dfrac{f(r=0\mid a,y=1)f(r=1\mid a)}{f(r=1\mid a,y=1)f(r=0\mid a)}}
=f(r=0a,y)f(r=1a,y=1)f(r=1a,y)f(r=0a,y=1)\displaystyle=\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}
=OR(a,y)\displaystyle={\operatorname{OR}}(a,y)

Appendix B: Proofs of the Theorems

Proof of the Theorem 1

From Proposition 2, we have that

E[OR~(a,y)r=1,a,z]=f(zr=0,a)f(zr=1,a)\displaystyle E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]=\frac{f(z\mid r=0,a)}{f(z\mid r=1,a)} (A.1)
OR~(a,y)=OR(a,y)E[OR(a,y)r=1,a]\displaystyle\widetilde{\operatorname{OR}}(a,y)=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}

Under shadow variable Assumption 10, we can identify the odds ratio function OR(a,y)\operatorname{OR}(a,y). Because f(yr=1,a,z)f(y\mid r=1,a,z) and f(zr=1,a)f(z\mid r=1,a) are identifiable from the observed data, for any candidate of OR(a,y)\operatorname{OR}(a,y), we can just need the observed data to compute the integral equation (A.1). Suppose that OR~(1)(a,y)\widetilde{\operatorname{OR}}^{(1)}(a,y) and OR~(2)(a,y)\widetilde{\operatorname{OR}}^{(2)}(a,y) are two solutions to equation (3.9):

E[OR~(k)(a,y)r=1,a,z]=f(zr=0,a)f(zr=1,a)(k=1,2),E\left[\widetilde{\operatorname{OR}}^{(k)}(a,y)\mid r=1,a,z\right]=\frac{f(z\mid r=0,a)}{f(z\mid r=1,a)}\qquad(k=1,2),

We have that

E[OR~(1)(a,y)OR~(2)(a,y)r=1,a,z]=0.E\left[\widetilde{\operatorname{OR}}^{(1)}(a,y)-\widetilde{\operatorname{OR}}^{(2)}(a,y)\mid r=1,a,z\right]=0.

Condition 1 implies that OR~(1)(a,y)OR~(2)(a,y)=0\widetilde{\operatorname{OR}}^{(1)}(a,y)-\widetilde{\operatorname{OR}}^{(2)}(a,y)=0 almost surely, OR~(1)(a,y)=OR~(2)(a,y)\widetilde{\operatorname{OR}}^{(1)}(a,y)=\widetilde{\operatorname{OR}}^{(2)}(a,y) almost surely. Therefore, equation (A.1) has a unique solution OR(a,y)\operatorname{OR}(a,y), which means OR~(a,y)\widetilde{\operatorname{OR}}(a,y) is identifiable. Based on equation (3.10) which is the definition of OR~(a,y)\widetilde{\operatorname{OR}}(a,y), finally we can identify the odds ratio function OR(a,y)\operatorname{OR}(a,y) through as OR(a,y)=OR~(a,y)/OR~(a,y=1)\operatorname{OR}(a,y)={\widetilde{\operatorname{OR}}(a,y)}/{\widetilde{\operatorname{OR}}(a,y=1)}.

Proof of the Theorem 2

We demonstrate some notations at the first. For simplicity, we let 𝒔=(y,r,a,z)T\bm{s}=\left(y,r,a,z\right)^{\mathrm{T}}. 𝒔\bm{s} is a random vector denoting all the random variables. We let gm(𝒔,𝜽)=gm(y,r,a,z,𝜽)g_{m}\left(\bm{s},\bm{\theta}\right)=g_{m}\left(y,r,a,z,\bm{\theta}\right), 𝜽Θ\bm{\theta}\in\Theta, m=1,,qm=1,\cdots,q in (4.3). Let

G(𝒔,𝜽)=(g1(𝒔,𝜽),g2(𝒔,𝜽),gq(𝒔,𝜽))T,𝜽Θ,\displaystyle G(\bm{s},\bm{\theta})=\left(g_{1}(\bm{s},\bm{\theta}),g_{2}(\bm{s},\bm{\theta})\cdots,g_{q}(\bm{s},\bm{\theta})\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,
G0(𝜽)=(E[g1(𝒔,𝜽)],E[g2(𝒔,𝜽)],,E[gq(𝒔,𝜽)])T,𝜽Θ,\displaystyle G_{0}(\bm{\theta})=\left(E\left[g_{1}(\bm{s},\bm{\theta})\right],E\left[g_{2}(\bm{s},\bm{\theta})\right],\cdots,E\left[g_{q}(\bm{s},\bm{\theta})\right]\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,
Gn(𝒔i,𝜽)=(g1(𝒔i,𝜽),g2(𝒔i,𝜽),,gq(𝒔i,𝜽))T,i=1,,n,𝜽Θ,\displaystyle G_{n}(\bm{s}_{i},\bm{\theta})=\left(g_{1}\left(\bm{s}_{i},\bm{\theta}\right),g_{2}\left(\bm{s}_{i},\bm{\theta}\right),\cdots,g_{q}\left(\bm{s}_{i},\bm{\theta}\right)\right)^{\mathrm{T}},\quad i=1,\cdots,n,\quad\bm{\theta}\in\Theta,
G^n(𝜽)=(1ni=1ng1(𝒔i,𝜽),1ni=1ng2(𝒔i,𝜽),,1ni=1ngq(𝒔i,𝜽))T,𝜽Θ,\displaystyle\widehat{G}_{n}(\bm{\theta})=\left(\frac{1}{n}\sum_{i=1}^{n}g_{1}\left(\bm{s}_{i},\bm{\theta}\right),\frac{1}{n}\sum_{i=1}^{n}g_{2}\left(\bm{s}_{i},\bm{\theta}\right),\cdots,\frac{1}{n}\sum_{i=1}^{n}g_{q}\left(\bm{s}_{i},\bm{\theta}\right)\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,
Q0(𝜽)=[G0(𝜽)]TW[G0(𝜽)],\displaystyle Q_{0}(\bm{\theta})=\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}W\left[G_{0}(\bm{\theta})\right],
Qn(𝜽)=[G^n(𝜽)]TW[G^n(𝜽)],\displaystyle Q_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}W\left[\widehat{G}_{n}(\bm{\theta})\right],
Q^n(𝜽)=[G^n(𝜽)]TW^[G^n(𝜽)].\displaystyle\widehat{Q}_{n}(\bm{\theta})=\left[\widehat{G}_{n}(\bm{\theta})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{G}_{n}(\bm{\theta})\right].

where WW is a positive semi-definite and symmetric q×qq\times q matrix of weights, W^=W(𝜽~)\widehat{W}=W(\tilde{\bm{\theta}}). 𝜽\bm{\theta} is a pp-dimensional parameter. 𝜽~\tilde{\bm{\theta}} is an estimator of 𝜽\bm{\theta} from the first step of two-step GMM, which means 𝜽~=argmin𝜽ΘQn(𝜽)\tilde{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}Q_{n}(\bm{\theta}). 𝜽0\bm{\theta}_{0} is the true value of the parameter 𝜽\bm{\theta}. Let a matrix B=[bjk]B=\left[b_{jk}\right], let B=(j,kbjk2)1/2\|B\|=\left(\sum_{j,k}b_{jk}^{2}\right)^{1/2} be the Euclidean norm. Let 𝜽()\nabla_{\bm{\theta}}(\cdot) and 𝜽𝜽()\nabla_{{\bm{\theta}}{\bm{\theta}}}(\cdot) denote the first- and second-order derivatives with respect to 𝜽\bm{\theta}. Let 𝜽^=argmin𝜽ΘQ^n(𝜽)\widehat{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}\widehat{Q}_{n}(\bm{\theta}).

Uniform convergence in probability: Q^n(θ)\widehat{Q}_{n}(\theta) converges uniformly in probability to Q0(θ)Q_{0}(\theta) means: as nn\rightarrow\infty,

sup𝜽Θ|Q^n(𝜽)Q0(𝜽)|P0.\sup_{\bm{\theta}\in\Theta}\left|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right|\stackrel{{\scriptstyle P}}{{\longrightarrow}}0.

To prove the Theorem 2, we need some lemmas to prove. More details can be found in Newey and McFadden, (1994).

Lemma 1.

(GMM identification) If WW is positive semi-definite matrix, for G0(𝜽0)=𝟎G_{0}(\bm{\theta}_{0})=\bm{0} and WG0(𝜽0)𝟎WG_{0}(\bm{\theta}_{0})\neq\bm{0} for 𝜽𝜽0\bm{\theta}\neq\bm{\theta}_{0} then Q0(𝜽)=[G0(𝜽)]TW[G0(𝜽)]Q_{0}(\bm{\theta})=\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}W\left[G_{0}(\bm{\theta})\right] has a unique minimum at 𝜽0\bm{\theta}_{0}.

Proof of Lemma 1. Let DD be such that DTD=WD^{\mathrm{T}}D=W. If 𝜽𝜽0\bm{\theta}\neq\bm{\theta}_{0}, then 𝟎WG0(𝜽)=DTDG0(𝜽)\bm{0}\neq WG_{0}(\bm{\theta})=D^{\mathrm{T}}DG_{0}(\bm{\theta}) implies DG0(𝜽)𝟎DG_{0}(\bm{\theta})\neq\bm{0} and hence for 𝜽𝜽0\bm{\theta}\neq\bm{\theta}_{0}, Q0(𝜽)=[DG0(𝜽)]T[DG0(𝜽)]>Q0(𝜽0)Q_{0}(\bm{\theta})=\left[DG_{0}(\bm{\theta})\right]^{\mathrm{T}}\left[DG_{0}(\bm{\theta})\right]>Q_{0}(\bm{\theta}_{0}).

Lemma 2.

If the data are i.i.d., Θ\Theta is compact, 𝒆(𝒔,𝜽)\bm{e}\left(\bm{s},\bm{\theta}\right) is continuous at each 𝜽Θ\bm{\theta}\in\Theta with probability one, and there is b(𝒔)b(\bm{s}) with 𝒆(𝒔,𝜽)b(𝒔)\|\bm{e}\left(\bm{s},\bm{\theta}\right)\|\leqslant b(\bm{s}) for all 𝜽Θ\bm{\theta}\in\Theta and E[b(𝒔)]<E[b(\bm{s})]<\infty, then E[𝒆(𝒔,𝜽)]E[\bm{e}\left(\bm{s},\bm{\theta}\right)] is continuous and sup𝜽Θ1ni=1n𝒆(𝒔i,𝜽)E[𝒆(𝒔,𝜽)]P0\sup\limits_{\bm{\theta}\in\Theta}\left\|\dfrac{1}{n}\sum\limits_{i=1}^{n}\bm{e}\left(\bm{s}_{i},\bm{\theta}\right)-E[\bm{e}\left(\bm{s},\bm{\theta}\right)]\right\|\stackrel{{\scriptstyle P}}{{\longrightarrow}}0.

Proof of Lemma 2. It is implied by Lemma 1 of Tauchen, (1985).

Lemma 3.

If there is a function Q0(𝜽)Q_{0}(\bm{\theta}) such that (i) Q0(𝜽)Q_{0}(\bm{\theta}) is uniquely minimized at 𝜽0\bm{\theta}_{0}; (ii) Θ\Theta is compact; (iii) Q0(𝜽)Q_{0}(\bm{\theta}) is continuous; (iv) Q^n(𝜽)\widehat{Q}_{n}(\bm{\theta}) converges uniformly in probability to Q0(𝜽)Q_{0}(\bm{\theta}), then 𝜽^P𝜽0\widehat{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\rightarrow}}\bm{\theta}_{0}.

Proof of lemma 3. By the definition of minimum value and the hypothesis of the Lemma 3, for any ε>0\varepsilon>0 we have with probability approaching one

Q^n(𝜽^)<Q^n(𝜽0)+ε/3,Q0(𝜽^)<Q^n(𝜽^)+ε/3,Q^n(𝜽0)<Q0(𝜽0)+ε/3.\widehat{Q}_{n}(\widehat{\bm{\theta}})<\widehat{Q}_{n}\left(\bm{\theta}_{0}\right)+\varepsilon/3,\quad Q_{0}(\widehat{\bm{\theta}})<\widehat{Q}_{n}(\widehat{\bm{\theta}})+\varepsilon/3,\quad\widehat{Q}_{n}\left(\bm{\theta}_{0}\right)<Q_{0}\left(\bm{\theta}_{0}\right)+\varepsilon/3.

Therefore, with probability approaching one

Q0(𝜽^)<Q^n(𝜽^)+ε/3<Q^n(𝜽0)+2ε/3<Q0(𝜽0)+ε,Q_{0}(\widehat{\bm{\theta}})<\widehat{Q}_{n}(\widehat{\bm{\theta}})+\varepsilon/3<\widehat{Q}_{n}\left(\bm{\theta}_{0}\right)+2\varepsilon/3<Q_{0}\left(\bm{\theta}_{0}\right)+\varepsilon,

Thus, for any ε>0\varepsilon>0, with probability approaching one

Q0(𝜽^)<Q0(𝜽0)+ε,Q_{0}(\widehat{\bm{\theta}})<Q_{0}\left(\bm{\theta}_{0}\right)+\varepsilon,

Let 𝒩\mathscr{N} be any open subset of Θ\Theta containing 𝜽0\bm{\theta}_{0}. By Θ𝒩c\Theta\cap\mathcal{N}^{c} compact, (i), and (iii),

Q0(𝜽)inf𝜽Θ𝒩cQ0(𝜽)>Q0(𝜽0),Q_{0}\left(\bm{\theta}^{*}\right)\equiv\inf_{\bm{\theta}\in\Theta\cap\mathscr{N}^{c}}Q_{0}(\bm{\theta})>Q_{0}\left(\bm{\theta}_{0}\right),

for some 𝜽Θ𝒩c\bm{\theta}^{*}\in\Theta\cap\mathscr{N}^{c}. Thus, choosing ε=Q0(𝜽)Q0(𝜽0)\varepsilon=Q_{0}\left(\bm{\theta}^{*}\right)-Q_{0}\left(\bm{\theta}_{0}\right), it follows that with probability approaching one

Q0(𝜽^)<Q0(𝜽),Q_{0}(\widehat{\bm{\theta}})<Q_{0}\left(\bm{\theta}^{*}\right),

hence that 𝜽^𝒩\widehat{\bm{\theta}}\in\mathscr{N} with probability approaching one . By the arbitrariness of 𝒩\mathscr{N},

𝜽^P𝜽0.\widehat{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\bm{\theta}_{0}.

Proof of Theorem 2.

Proceed by verifying the hypotheses of Lemma 3. Condition (i) in Lemma 3 follows by Theorem 2 and Lemma 1 . Condition (ii) in Lemma 3 holds by Theorem 2. By Lemma 2 applied to 𝒆(𝒔,𝜽)=G(𝒔,𝜽)\bm{e}\left(\bm{s},\bm{\theta}\right)=G(\bm{s},\bm{\theta}), one has sup𝜽ΘG^n(𝜽)G0(𝜽)p0\sup\limits_{\bm{\theta}\in\Theta}\left\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right\|\stackrel{{\scriptstyle p}}{{\rightarrow}}0 and G0(𝜽)G_{0}(\bm{\theta}) is continuous. Thus, (iii) in Lemma 3 holds by Q0(𝜽)=[G0(𝜽)]TW[G0(𝜽)]Q_{0}(\bm{\theta})=\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}W\left[G_{0}(\bm{\theta})\right] continuous. By Θ\Theta compact, G0(𝜽)G_{0}(\bm{\theta}) is bounded on Θ\Theta.

Suppose that W^\widehat{W} converges to WW in probability. By triangle and Cauchy-Schwartz inequalities,

|Q^n(𝜽)Q0(𝜽)|\displaystyle\left|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right|
|[G^n(𝜽)G0(𝜽)]TW^[G^n(𝜽)G0(𝜽)]|+|[G0(𝜽)]T(W^+W^T)[G^n(𝜽)G0(𝜽)]|\displaystyle\leqslant\left|\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]\right|+\left|\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}\left(\widehat{W}+\widehat{W}^{\mathrm{T}}\right)\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]\right|
+|[G0(𝜽)]T(W^W)G0(𝜽)|\displaystyle+\left|[G_{0}(\bm{\theta})]^{\mathrm{T}}(\widehat{W}-W)G_{0}(\bm{\theta})\right|
G^n(𝜽)G0(𝜽)2W^+2G0(𝜽)G^n(𝜽)G0(𝜽)W^\displaystyle\leqslant\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\|^{2}\|\widehat{W}\|+2\left\|G_{0}(\bm{\theta})\right\|\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\|\|\widehat{W}\|
+G0(𝜽)2W^W,\displaystyle\quad+\left\|G_{0}(\bm{\theta})\right\|^{2}\|\widehat{W}-W\|,

so that

sup𝜽Θ|Q^n(𝜽)Q0(𝜽)|P0, and (iv) in Lemma 3 holds. \sup\limits_{\bm{\theta}\in\Theta}\left|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right|\stackrel{{\scriptstyle P}}{{\longrightarrow}}0\text{, and (iv) in Lemma \ref{lemma 3} holds. }

With WW being the Iq×qI_{q\times q} identity matrix, we can prove that

𝜽~P𝜽0.\tilde{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\bm{\theta}_{0}.

which implies W^PW\widehat{W}\stackrel{{\scriptstyle P}}{{\longrightarrow}}W.

Proof of the Theorem 3

Under the assumptions in Theorem 2, 𝜽Q^n(𝜽^)=0\nabla_{\bm{\theta}}\widehat{Q}_{n}(\widehat{\bm{\theta}})=0 with probability approaching 1. By Taylor’s expansion around 𝜽0\bm{\theta}_{0},

𝜽Q^n(𝜽^)𝜽Q^n(𝜽0)=𝜽𝜽Q^n(𝜽)(𝜽^𝜽0),\nabla_{\bm{\theta}}\widehat{Q}_{n}(\widehat{\bm{\theta}})-\nabla_{\bm{\theta}}\widehat{Q}_{n}(\bm{\theta}_{0})=\nabla_{{\bm{\theta}}{\bm{\theta}}}\widehat{Q}_{n}(\bm{\theta}^{*})(\widehat{\bm{\theta}}-\bm{\theta}_{0}),

where 𝜽\bm{\theta}^{*} is between 𝜽^\widehat{\bm{\theta}} and 𝜽0\bm{\theta}_{0}. Let H^(𝜽^)=𝜽G^n(𝜽^)\widehat{H}(\widehat{\bm{\theta}})=\nabla_{\bm{\theta}}\widehat{G}_{n}(\widehat{\bm{\theta}}). Multiplying through by n\sqrt{n}, and solving gives

n(𝜽^𝜽0)={[H^(𝜽^)]TW^[H^(𝜽)]}1[H^(𝜽^)]TW^[nG^n(𝜽0)],\sqrt{n}\left(\widehat{\bm{\theta}}-\bm{\theta}_{0}\right)=-\left\{\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{H}(\bm{\theta}^{*})\right]\right\}^{-1}\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\left[\sqrt{n}\widehat{G}_{n}(\bm{\theta}_{0})\right],

By (iii) in Condition 3, H^(𝜽^)PH\widehat{H}(\widehat{\bm{\theta}})\stackrel{{\scriptstyle P}}{{\longrightarrow}}H and H^(𝜽)PH\widehat{H}(\bm{\theta}^{*})\stackrel{{\scriptstyle P}}{{\longrightarrow}}H, so that by (iv) and continuity of matrix inversion,

{[H^(𝜽^)]TW^[H^(𝜽)]}1[H^(𝜽^)]TW^P(HTWH)1HTW,-\left\{\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{H}(\bm{\theta}^{*})\right]\right\}^{-1}\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\left(H^{\mathrm{T}}WH\right)^{-1}H^{\mathrm{T}}W,

Based on the Central Limit Theorem,

[nG^n(𝜽0)]N(𝟎,Ω),\left[\sqrt{n}\widehat{G}_{n}(\bm{\theta}_{0})\right]\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}N(\bm{0},\Omega),

the conclusion then follows by the Slutzky theorem.

n(𝜽^𝜽0)N(𝟎,(HTWH)1HTWΩWH(HTWH)1).\sqrt{n}(\widehat{\bm{\theta}}-\bm{\theta}_{0})\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}N\left(\bm{0},\left(H^{\mathrm{T}}WH\right)^{-1}H^{\mathrm{T}}W\Omega WH\left(H^{\mathrm{T}}WH\right)^{-1}\right).

Proof of the Theorem 4

Under the assumptions of Theorem 2 and Theorem 3, we have the following proof.

By consistency of 𝜽^\widehat{\bm{\theta}} there is λn0\lambda_{n}\rightarrow 0 such that 𝜽^𝜽0λn\|\widehat{\bm{\theta}}-\bm{\theta}_{0}\|\leqslant\lambda_{n} with probability approaching one.

Let Λn(𝒔)=sup𝜽^𝜽0λnG(𝒔,𝜽)G(𝒔,𝜽)TG(𝒔,𝜽0)G(𝒔,𝜽0)T\Lambda_{n}(\bm{s})=\sup_{\|\widehat{\bm{\theta}}-\bm{\theta}_{0}\|\leqslant\lambda_{n}}\|G(\bm{s},\bm{\theta})G(\bm{s},\bm{\theta})^{\mathrm{T}}-G(\bm{s},\bm{\theta}_{0})G(\bm{s},\bm{\theta}_{0})^{\mathrm{T}}\|.

By continuity of G(𝒔,𝜽)G(𝒔,𝜽)TG(\bm{s},\bm{\theta})G(\bm{s},\bm{\theta})^{\mathrm{T}} at 𝜽0\bm{\theta}_{0}, Λn(𝒔)0\Lambda_{n}(\bm{s})\rightarrow 0 with probability one, while by the dominance condition, for nn large enough Λn(𝒔)2sup𝜽𝒩G(𝒔,𝜽)G(𝒔,𝜽)T\Lambda_{n}(\bm{s})\leqslant 2\sup_{\bm{\theta}\in\mathcal{N}}\|G(\bm{s},\bm{\theta})G(\bm{s},\bm{\theta})^{\mathrm{T}}\|.

Then by the dominated convergence theorem, E[Λn(𝒔)]0E\left[\Lambda_{n}(\bm{s})\right]\rightarrow 0 , so by the Markov’s inequality,

P(|1ni=1nΛn(𝒔i)|>ε)E[Λn(𝒔)]ε0,P\left(\left|\frac{1}{n}\sum_{i=1}^{n}\Lambda_{n}(\bm{s}_{i})\right|>\varepsilon\right)\leqslant\frac{E\left[\Lambda_{n}(\bm{s})\right]}{\varepsilon}\rightarrow 0,

for all ε>0\varepsilon>0, giving 1ni=1nΛn(𝒔i)P0\frac{1}{n}\sum\limits_{i=1}^{n}\Lambda_{n}(\bm{s}_{i})\stackrel{{\scriptstyle P}}{{\longrightarrow}}0. By Khinchin’s law of large numbers,

1ni=1nG(𝒔i,𝜽0)G(𝒔i,𝜽0)TPE[G(𝒔,𝜽0)G(𝒔,𝜽0)T],\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\bm{\theta}_{0})G(\bm{s}_{i},\bm{\theta}_{0})^{\mathrm{T}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}E\left[G(\bm{s},\bm{\theta}_{0})G(\bm{s},\bm{\theta}_{0})^{\mathrm{T}}\right],

Also, with probability approaching one,

1ni=1nG(𝒔i,𝜽^)G(𝒔i,𝜽^)T1ni=1nG(𝒔i,𝜽0)G(𝒔i,𝜽0)T\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\widehat{\bm{\theta}})G(\bm{s}_{i},\widehat{\bm{\theta}})^{\mathrm{T}}-\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\bm{\theta}_{0})G(\bm{s}_{i},\bm{\theta}_{0})^{\mathrm{T}}\right\| 1ni=1nG(𝒔i,𝜽^)G(𝒔i,𝜽^)TG(𝒔i,𝜽0)G(𝒔i,𝜽0)T\displaystyle\leqslant\frac{1}{n}\sum_{i=1}^{n}\left\|G(\bm{s}_{i},\widehat{\bm{\theta}})G(\bm{s}_{i},\widehat{\bm{\theta}})^{\mathrm{T}}-G(\bm{s}_{i},\bm{\theta}_{0})G(\bm{s}_{i},\bm{\theta}_{0})^{\mathrm{T}}\right\|
1ni=1nΛn(𝒔i)P0.\displaystyle\leqslant\frac{1}{n}\sum_{i=1}^{n}\Lambda_{n}(\bm{s}_{i})\stackrel{{\scriptstyle P}}{{\longrightarrow}}0.

So by the triangle inequality, we have that

1ni=1nG(𝒔i,𝜽^)G(𝒔i,𝜽^)TPE[G(𝒔,𝜽0)G(𝒔,𝜽0)T],\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\widehat{\bm{\theta}})G(\bm{s}_{i},\widehat{\bm{\theta}})^{\mathrm{T}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}E\left[G(\bm{s},\bm{\theta}_{0})G(\bm{s},\bm{\theta}_{0})^{\mathrm{T}}\right],

which means Ω^PΩ\widehat{\Omega}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\Omega. Theorem 2 and Theorem 3 implies that H^PH\widehat{H}\stackrel{{\scriptstyle P}}{{\longrightarrow}}H, W^PW\widehat{W}\stackrel{{\scriptstyle P}}{{\longrightarrow}}W. The conclusion follows by (iv) of Condition 3 and continuity of matrix inversion and multiplication.