^†^†footnotetext: Email: [email protected]

Causal Inference with Unmeasured Confounding from Nonignorable Missing Outcomes

Renzhong Zheng
School of Statistics, Beijing Normal University, Beijing, China

(May, 2023)

Abstract

Observational studies are the primary source of data for causal inference, but it is challenging when existing unmeasured confounding. Missing data problems are also common in observational studies. How to obtain the causal effects from the nonignorable missing data with unmeasured confounding is a challenge. In this paper, we consider that how to obtain complier average causal effect with unmeasured confounding from the nonignorable missing outcomes. We propose an auxiliary variable which plays two roles simultaneously, the one is the shadow variable for identification and the other is the instrumental variable for inference. We also illustrate some difference between some missing outcomes mechanisms in the previous work and the shadow variable assumption. We give a causal diagram to illustrate this description. Under such a setting, we present a general condition for nonparametric identification of the full data law from the nonignorable missing outcomes with this auxiliary variable. For inference, firstly, we recover the mean value of the outcome based on the generalized method of moments. Secondly, we propose an estimator to adjust for the unmeasured confounding to obtain complier average causal effect. We also establish the asymptotic results of the estimated parameters. We evaluate its performance via simulations and apply it to a real-life dataset about a political analysis.

Keywords: Complier average causal effect; Generalized method of moments; Instrumental variable; Missing not at random; Principal Stratification; Shadow variable.

1 Introduction

Causal inference plays an important role in many fields, such as epidemiology, economics and social sciences. Observational studies are the primary source of data for causal inference. In observational studies, if all the confounders are measured, one can use some standard methods to adjust for the confounding bias, such as stratification and matching (Rubin,, 1973; Stuart,, 2010), propensity score (Rosenbaum and Rubin,, 1983), inverse probability weighting (Horvitz and Thompson,, 1952), outcome regression-based estimation (Rubin,, 1979), doubly robust estimation (Bang and Robins,, 2005). However, unmeasured confounding is present in most observational studies. In this case, one cannot identify the causal effects from the observational data without additional assumptions. Auxiliary variables are often used to identify causal effects in adjustment for unmeasured confoundering. The instrumental variable (IV) is an auxiliary variable to adjust for the unmeasured confounding. There are two frameworks used in the analysis of IV, one is a structural equation model (Wright,, 1928; Goldberger,, 1972); the other one is the monotonicity assumption, which means a monotone effect of the IV on the treatment. Under the monotonicity assumption, IV can identify and estimate the compliers average causal effect (Imbens and Angrist,, 1994; Angrist and Imbens,, 1995; Angrist et al.,, 1996). More details about the two framework of the instrument variable can be found in Wang and Tchetgen Tchetgen, (2018).

Missing data is also a common problem in observational studies. According to Rubin, (1976), the missing mechanism is called missing at random (MAR) or ignorable if it is independent of the missing values conditional on observed data, which means the missing mechanism only depends on the observed data. Otherwise, it is called missing not at random (MNAR) or nonignorable (Little and Rubin,, 2002). Compared to MAR, MNAR is much more challenging. Identification is generally not available under MNAR without additional assumptions (Robins and Ritov,, 1997). In the previous papers, authors would make sufficient parametric assumptions about the full data law to ensure the validity of the identification (Wu and Carroll,, 1988; Little and Rubin,, 2002; Roy,, 2003). However, as noted by Miao et al., (2016) and Wang et al., (2014), the identification of fully parametric models can even fail under MNAR.

A new framework for identification and semiparametric inference was recently proposed by Miao and Tchetgen Tchetgen, (2016), Miao and Tchetgen, (2018), and Miao et al., (2019), building on earlier work by d’Haultfoeuille, (2010), Kott, (2014), Wang et al., (2014), and Zhao and Shao, (2015) which studied identification of several parametric and semiparametric models. A shadow variable is associated with the outcome prone to missingness but independent of the missing mechanism conditional on treatment and possibly unobserved outcome. In the context of missing covariate data, Miao and Tchetgen, (2018) studied identification of generalized linear models and some semiparametric models, and then proposed an inverse probability weighted estimator that incorporates the shadow variable to guarantee the unbiased estimation. In the context of missing outcomes data, Miao and Tchetgen Tchetgen, (2016) considered the identification of a location-scale model and described a doubly robust estimator. If a shadow variable is fully observed, we can use it to recover the distribution of unobserved outcome which is missing not at random with the fully observed covariates. Miao et al., (2019) used a valid shadow variable to nonparametric identification of the full data distribution under nonignorable missingness, and they developed the semiparametric theory for some semiparametric estimators with a shadow variable under MNAR.

It is highly necessary to combine causal inference with missing data research (Ding and Li,, 2018). In the case of covariates missing not at random, Ding and Geng, (2014) showed the identification of the causal effects for four interpretable missing data mechanisms and proposed the upper and lower bound for causal effects. When the unmeasured confounders are missing not at random, Yang et al., (2019) generalized the results in Ding and Geng, (2014) to establish a general condition for identification of the causal effects, they further developed parametric and nonparametric inference for the causal effects. With the nonignorable missing outcomes, the identification and estimation of the causal effects is more challenging, different types of the missing mechanism are imposed to guarantee the identification and estimation under the principal stratification framework. Frangakis and Rubin, (1999) proposed the latent ignorable (LI) missing data mechanism to guarantee the identification and estimated the complier average causal effects (CACE) . Under the LI missing data mechanism, O’Malley and Normand, (2005) proposed two methods based on the moments and likelihood to estimate CACE for normally distributed outcomes. Zhou and Li, (2006) proposed both moment and maximum likelihood estimators of CACE for the binary outcomes. Otherwise, some authors proposed another missing data mechanisms, which is called the outcome-dependent nonignorable (ODN) missing data mechanism. Chen et al., (2009) and Imai, (2009) established the identification of CACE for discrete outcomes based on the likelihood method under the outcome-dependent nonignorable missing data mechanism. Chen et al., (2015) proposed an exponential family assumption about the conditional density of the outcome variable to establish semiparametric inference of CACE for continuous outcomes under ODN missing data mechanism. Li and Zhou, (2017) established the identification with a shadow variable and estimation of causal mediation effects from the outcomes MNAR.

In this paper, we obtain CACE with unmeasured confounding from nonignorable missing outcomes. Different from the previous work, we impose no assumption about the missing outcomes mechanisms. Instead, we impose a shadow variable assumption on the instrumental variable to guarantee the identification and estimate CACE. From Figure 3, we allow an arrow from treatment to missing mechanism, which is more reasonable in some application cases. In Section 2, we illustrate some basic assumptions in causal inference. We also illustrate the difference between some missing data mechanisms and the shadow variable assumption. Then we demonstrate the framework of this paper with an example. In Section 3, we use a shadow variable to identify the full data distribution nonparametrically under certain completeness condition. In Section 4, we apply the generalized method of moments to estimate the missing mechanism. We also establish some asymptotic results of the estimated parameters in the missing mechanism. Then we recover the mean value of the outcome and propose an estimator of CACE. In Section 5, we conduct some simulation studies to evaluate the performance of some parameters estimators in missing mechanism and the estimator of CACE. In Section 6, we use a real data about to illustrate our approach. In Section 7, we conclude with some discussions and provide all proofs in the Supplementary Materials.

2 Notation and Assumptions

2.1 Potential outcomes, causal effects, basic assumptions and instrumental variable

In this paper, we consider the situation where $\left\{\left(y_{i},r_{i},a_{i},z_{i}\right):i=1,\ldots,n\right\}$ is an independent and identically distributed sample from $\left(Y,R,A,Z\right)$ . Vectors are assumed to be column vectors, unless explicitly transposed.

$Y$ is the outcome of interest subject to missingness. We consider the outcome $Y$ a binary variable. We let $R$ denote the missing indicator of $Y$ : $R=1$ if Y is observed and $R=0$ otherwise. The observed data include $\left(A,Z\right)$ for all samples and $Y$ only for those $R=1$ . We use lower-case letters for realized values of the corresponding variables, for example, $y$ for a value of the outcome variable $Y$ . We use $f$ to denote a probability density or mass function. Suppose the observed data are $n$ independent and identically distributed samples.

We use potential outcome model to define causal effects (Neyman,, 1923; Rubin,, 1974). For each unit $i$ , We consider $Z$ is a binary instrument variable, where $Z_{i}=1$ indicates the unit $i$ is assigned to the treatment group and $Z_{i}=0$ indicates the unit $i$ is assigned to the control group. Let $A_{i}(z)$ denote the potential outcome of the unit $i$ that the instrument variable $Z$ was set to level $z$ . $A_{i}(z)=1$ indicates that unit $i$ would receive the treatment if assigned $z$ , and $A_{i}(z)=0$ indicates that unit $i$ would receive the control if assigned $z$ . Similar to the definition of $A_{i}(z)$ , we let $Y_{i}(z,A_{i}(z))$ denote the potential outcome for an unit $i$ if exposed to treatment $A_{i}(z)$ after $Z$ was set to level $z$ . For simplicity, we let $Y_{i}(z)$ and $R_{i}(z)$ denote the potential outcomes after $Z$ was set to level $z$ , respectively. Let $Z_{i}$ , $A_{i}$ , $Y_{i}$ , $R_{i}$ denote their observations for $i=1,\dots,n$ .

The following assumption is standard in causal inference with observational studies (Rubin,, 1980; Angrist et al.,, 1996).

Assumption 1.

Stable Unit treatment value Assumption (SUTVA): There is no interference between units, which means a unit’s potential outcomes cannot be affected by the treatment status of other units and there is only one version of potential outcome of a certain treatment.

Under SUTVA assumption, the definition of the potential outcome is reasonable. But when $Z_{i}\neq A_{i}$ , there exists noncompliance. Under the principal stratification framework (Angrist et al.,, 1996; Frangakis and Rubin,, 2002), we define $U_{i}$ as the compliance status variable of unit $i$ :

U_{i}=\begin{cases}at,&\text{ if }A_{i}(1)=1\text{ and }A_{i}(0)=1;\\ cp,&\text{ if }A_{i}(1)=1\text{ and }A_{i}(0)=0;\\ df,&\text{ if }A_{i}(1)=0\text{ and }A_{i}(0)=1;\\ nt,&\text{ if }A_{i}(1)=0\text{ and }A_{i}(0)=0.\end{cases}

Because $A_{i}(1)$ and $A_{i}(0)$ each can take two values, the compliance status variable $U_{i}$ has four different values, $nt$ for never takers, $at$ for always takers, $cp$ for compliers, and $df$ for defiers. Because we can not observe $A_{i}(1)$ and $A_{i}(0)$ jointly, the compliance behavior of an unit is unknown, so $U_{i}$ is an observed variable and it can be viewed as an unmeasured confounder.

Definition 1.

The Complier Average causal effect (CACE) is defined as $CACE=E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp\right]$ .

Because for the compliers, $Z_{i}=A_{i}$ , so CACE is a subgroup causal effect for the compliers, with incompletely observed compliance status.

Assumption 2.

Randomization: $(Y(0),Y(1),A(0),A(1))$ is independent of $Z$ .

Randomization means that $Z\perp\!\!\!\perp(Y(0),Y(1),A(0),A(1))$ .

Assumption 3.

Monotonicity: $A_{i}(1)\geq A_{i}(0)$ for each unit $i$ .

Monotonicity assumption means there is no defiers in the population. In some studies, the monotonicity assumption is plausible when the treatment assignment has a nonnegative effect on the treatment received for each unit.

Assumption 4.

Nonzero average causal effect of $Z$ on $A$ : The average causal effect of $Z$ on $A$ , $E\left[A_{i}(1)-A_{i}(0)\right]$ , is not equal to zero.

Assumption 5.

Exclusion restrictions among never takers and always takers: $Y_{i}(1)=Y_{i}(0)$ if $U_{i}=nt$ and $Y_{i}(1)=Y_{i}(0)$ if $U_{i}=at$ .

Assumption 5 means that the instrument variable $Z$ only affects the outcome through treatment and has no direct effect.

If the outcome $Y$ can be fully observed, CACE can be identified and estimated by Angrist et al., (1996). With the nonignorable missing outcome, some assumptions about the missing data mechanisms are needed to estimate CACE. We will compare some assumptions about the missing data mechanisms in the next subsection.

2.2 Some Missing data mechanisms of the outcomes

Before illustrating the missing data mechanisms, Assumption 2 and Assumption 5 are replaced by Assumption 6 and Assumption 7 in some previous work, respectivitly. Assumption 6 and Assumption 7 are usually combined with the missing outcomes mechanisms in the previous work. However, we only need Assumption 2 and Assumption 5 to estimate CACE in this paper, which are the weak version of Assumption 6 and Assumption 7.

Assumption 6.

Complete Randomization: The treatment assignment $Z$ is completely randomized.

Complete Randomization means that $Z\perp\!\!\!\perp\{A(1),A(0),Y(1),Y(0),R(1),R(0)\}$ .

Assumption 7.

Compound exclusion restrictions: For never takers and always takers, we assume that $f\{Y(1),R(1)\mid U=nt\}=f\{Y(0),R(0)\mid U=nt\}$ , and $f\{Y(1),R(1)\mid U=at\}=f\{Y(0),R(0)\mid U=at\}$ .

Different from traditional Assumption 5, Frangakis and Rubin, (1999) extended it to the compound exclusion restrictions. Assumption 7 is stronger than Assumption 5. Under Assumption 6, Assumption 7 is equivalent to $f(Y,R\mid Z=1,U=nt)=f(Y,R\mid Z=0,U=nt)$ and $f(Y,R\mid Z=1,U=at)=f(Y,R\mid Z=$ $0,U=at)$ .

In previous work, some authors proposed the Latent ignorability (LI) assumption (Frangakis and Rubin,, 1999; Osius,, 2004; Zhou and Li,, 2006):

Assumption 8.

Latent ignorability assumption: $f\{R(z)\mid Y(z),A(z),U\}=f\{R(z)\mid U\}$

Latent ignorability implies that for given the each principal stratum, the potential outcomes are independent of the missing indicator, which means the missing data mechanism does not depend on the missing outcome. We gave a graph model to illustrate the LI missing data mechanism.

Figure 1: A graph model for the LI missing data mechanism under Assumption 6 and 7

Chen et al., (2009) and Chen et al., (2015) proposed an Outcome-dependent nonignorable (ODN) missing data mechanism:

Assumption 9.

Outcome-dependent nonignorable assumption: For all $y;z=0,1;a=0,1$ ; and $u\in\{at,cp,nt\}$ , assuming

	$\displaystyle P\{R(z)=1\mid Y(z)=y,A(z)=a,U=u\}$	$\displaystyle=P\{R(z)=1\mid Y(z)=y\}$
	$\displaystyle P\{R(1)=1\mid Y(1)=y\}$	$\displaystyle=P\{R(0)=1\mid Y(0)=y\}.$

Under the completely randomization Assumption 6, Assumption 9 becomes $P(R=1\mid Y=y,A=a,U=u,Z=z)=P(R=1\mid Y=y,Z=z)$ and $P(R=1\mid Y=y,Z=1)=P(R=1\mid Y=y,Z=0)$ . Outcome-dependent nonignorable assumption means that $R$ depends on $Y$ , but is independent of $(Z,A,U)$ given $Y$ . Under the ODN missing data mechanism, the missing data indicator depends on the possibly missing outcome Y, which may be more reasonable than LI missing data assumption in some applications. We gave a graph model to illustrate ODN missing data mechanism.

Figure 2: A graph model for the ODN missing data mechanism under Assumption 6 and 7

More examples and details about LI assumption and ODN assumption can be found in Chen et al., (2009) and Chen et al., (2015).

2.3 The framework of this paper

In this paper, we are interested in CACE with unmeasured confounding $U$ from the nonignorable missing outcomes $Y$ .

In fact, LI assumption and ODN assumption usually need the Assumption 6 and Assumption 7, which are stronger than Assumption 2 and Assumption 5. In this paper, we only need the weak randomization Assumption 2 and exclusion restriction Assumption 5 to estimate CACE. Different from the above two missing outcomes mechanisms, we adapt the shadow variable assumption to guarantee the identification under the nonignorable missing outcomes in this paper.

We suppose that an auxiliary variable $Z$ called a shadow variable is fully observed if it statisfies the following assumption (Miao and Tchetgen Tchetgen,, 2016; Miao et al.,, 2019).

Assumption 10.

(a): $Z\perp\!\!\!\perp R\mid(Y,A)$ ; (b): $Z\not\perp\!\!\!\perp Y\mid(R=1,A)$ .

Assumption 10 implies that the shadow variable is associated with the outcome when the outcome is observed, but it is independent of the missing mechanism conditional on fully observed variables and possibly unobserved outcome (d’Haultfoeuille,, 2010; Kott,, 2014; Wang et al.,, 2014; Zhao and Shao,, 2015). Therefore, Assumption 10 allows for the data missing not at random.

In this paper, we suppose an auxiliary variable $Z$ satisfies both Assumptions 1-5 and Assumption 10, which plays two roles of the instrument variable and the shadow variable simultaneously. Different from the ODN missing data mechanism 2, we allow the arrow from $A$ to $R$ in Figure 3, which means the treatment would affect the missing indicator. It is more reasonable in some applications. We present a example to help readers to understand the framework of Fiugre 3. Figure 3 illustrates the framework of this paper.

Figure 3: A directed acyclic graph model for the auxiliary variable

Z

Example 1.

We consider a real data analysis in Esterling et al., (2011). As noted in Esterling et al., (2011), the authors aimed to assess that participating in an Online-chat session whether produced higher levels of citizen efficacy among the constituents. In this paper, we choose the instrument variable $Z$ indicating political knowledge, which means $Z$ equals to $1$ for the constituents who have high political knowledge and $0$ for low political knowledge. The treatment variable $A$ equals to $1$ for the constituents who participated in the Online-chat session and $0$ for otherwise. A week after each session, a company administered a follow-up survey to the constituents. The outcome variable $Y$ is Officials care, denoting the attitudes of the constituents about a question in the follow-up survey, which includes “agreement”, “disagreement” and so on. However, some subjects may not respond the follow-up survey, which the outcomes have missing values. At the same time, it is reasonable that whether or not to participate in the session would affect the missing mechanism of the outcomes. In Section 6, we illustrate this example in more detail.

In the next section, we state that the key of identification of $f(y,r\mid a,z)$ is Assumption 10. If Assumption 10 is violated, which means the shadow variable is not available, the identification of $f(y,r\mid a,z)$ is not guaranteed, even if the parametric missingness mechanisms may not be identifiable (Miao et al.,, 2016; Wang et al.,, 2014). We also illustrate extra conditions to guarantee the identification.

3 Identification

We aim to identify the joint distribution $f(a,y,r,z)$ . We say the joint distribution $f(a,y,r,z)$ is identifiable , if and only if the joint distribution can be uniquely determined by the observed data distribution $f(y,r=1\mid a,z),f(r=0\mid a,z)$ and $f(a,z)$ .

The joint distribution $f(a,y,z,r)$ can be factorized as

f(a,y,z,r)=f(y,r\mid a,z)f(a,z),

(3.1)

because the variables $(A,Z)$ can be fully observeed, so $f(a,z)$ is identifiable without additional assumptions, we focus on the identification of $f(y,r\mid a,z)$ . The observed data law is captured by $f(y,r=1\mid a,z)$ , $f(r=0\mid a,z)$ and $f(a,z)$ , which are functionals of the joint law $f(a,y,z,r)$ . Then we factorize $f(y,r\mid a,z)$ as

f(y,r\mid a,z)=f(y\mid r,a,z)f(r\mid a,z).

(3.2)

According to equation (3.2), $f(y\mid r,a,z)$ represents the outcome distribution for different data patterns: $R=1$ for the observed data and $R=0$ for the missing data. Although $f(y\mid r=1,a,z)$ can be obtained from the observed data completely, the missing data distribution $f(y\mid r=0,a,z)$ can not be obtained directly from the observed data under MNAR. The fundamental identification challenge in missing data problems is how to recover the full data distribution $f(a,y)$ and missingness process (or propensity score) $f(r\mid a,z)$ from the observed data distribution.

We use the odds ratio function which encodes the deviation between the observed data and missing data distributions to measure the missingness process (Little,, 1993, 1994).

\operatorname{OR}(a,y,z)=\frac{f(y\mid r=0,a,z)f(y=1\mid r=1,a,z)}{f(y\mid r=1,a,z)f(y=1\mid r=0,a,z)}.

(3.3)

Here, we use $y=1$ as a reference value, the analyst can use any other value within the support of $Y$ . The odds ratio function can be applied to impose a known relationship of the outcome $Y$ (Little,, 1993, 1994). In proposition 1, the odds ratio function plays a center role under data MNAR and it can be identified by a shadow variable.

In the following, we hold that $\operatorname{OR}(a,y)$ and $E\left[\operatorname{OR}(a,y,z)\mid r=1,a,z\right]<+\infty$ . According to the previous work, we factorize the conditional density function $f(y,r\mid a,z)$ as the odds ratio function and two baseline distributions (Osius,, 2004; Chen,, 2003, 2004, 2007; Kim and Yu,, 2011; Miao and Tchetgen Tchetgen,, 2016; Miao et al.,, 2019), we establish some results in the proposition 1 based on a valid shadow variable.

Proposition 1.

Under Assumption 10, we have that for all $(A,Y,Z)$

\operatorname{OR}(a,y,z)=\operatorname{OR}(a,y)\equiv\frac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)},

(3.4)

\begin{gathered}f(y,r\mid a,z)=c(a,z)f(r\mid a,y=1)f(y\mid r=1,a,z)\{\operatorname{OR}(a,y)\}^{1-r},\\[5.69054pt] c(a,z)=\dfrac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\frac{f(z\mid r=1,a)}{f(z\mid a)},\end{gathered}

(3.5)

f(r=1\mid a,y=1)=\dfrac{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}{f(r=0\mid a)/f(r=1\mid a)+E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}.

(3.6)

The proof of the proposition 1 is found in the Supplementary Materials. From identity (3.4), we demonstrate that the odds ratio function $\operatorname{OR}(a,y)$ is the function of the variables $(A,Y)$ under Assumption 10, which means the odds ratio function doesn’t depend on the variable $Z$ under the Assumption 10. So we denote the odds ratio function $\operatorname{OR}(a,y,z)$ by $\operatorname{OR}(a,y)$ . According to equation (3.4), we notice that the odds ratio function also consists of the propensity score $f(r=1\mid a,y)$ , which is influenced by the outcome itself. The odds ratio function can also be used to measure whether the selection bias exists, for instance, the value of $\operatorname{OR}(a,y)$ represents the deviation of missingness mechanism from the MAR (Miao et al.,, 2019).

According to Miao and Tchetgen Tchetgen, (2016), we suppose throughout that $\operatorname{OR}(a,y)$ is correctly specified, which can be achieved by specifying a relatively flexible model, or following the approach suggested by Higgins et al., (2008) if information on the reasons for missingness are available.

Equation (3.5) plays the key role of the factorization of $f(y,r\mid a,z)$ , which is factorized by the propensity score $f(r\mid a,y=1)$ evaluated at the reference level $Y=1$ , the outcome distribution $f(y\mid r=1,a,z)$ among the complete cases, and the odds ratio function $\operatorname{OR}(a,y)$ . The former two is referred to as the baseline propensity score and the baseline outcome distribution, respectively. In the previous illustration, we have that $f(y\mid r=1,a,z)$ is uniquely determined from complete cases, according to equation (3.5) and (3.6), we aim to identify $f(y,r\mid a,z)$ by proposing the odds ratio function $\operatorname{OR}(a,y)$ , which means the identification of the odds ratio function $\operatorname{OR}(a,y)$ is the fundamental problem in the whole framework. Propositon 2 indicates some further results from identity (3.5) and (3.6) (Miao et al.,, 2019).

Proposition 2.

Under Assumption 10, we have that

	$\displaystyle f(r=1\mid a,y)$	$\displaystyle=f(r=1\mid a,y,z),$		(3.7)
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\operatorname{OR}(a,y)f(r=0\mid a,y=1)},$		(3.7)

f(y\mid r=0,a,z)=\dfrac{\operatorname{OR}(a,y)f(y\mid r=1,a,z)}{E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]},

(3.8)

\begin{gathered}E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]=\frac{f(z\mid r=0,a)}{f(z\mid r=1,a)},\\[8.53581pt] \text{where}\quad\widetilde{\operatorname{OR}}(a,y)=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]},\end{gathered}

(3.9)

\operatorname{OR}(a,y)=\dfrac{\widetilde{\operatorname{OR}}(a,y)}{\widetilde{\operatorname{OR}}(a,y=1)}.

(3.10)

The proof of the proposition 2 is found in the Supplementary Materials. The equations (3.7) - (3.10) indicate the key role of the odds ratio function in the identification framework of this paper. Equation (3.7) reveals that the propensity score $f(r=1\mid a,y)$ is a function of the odds ratio function $\operatorname{OR}(a,y)$ . Equation (3.8) shows that under the shadow variable Assumption 10, the recovery of the missing data distribution $f(y\mid r=0,a,z)$ is achievable by integrating the odds ratio function $\operatorname{OR}(a,y)$ with the complete-case distribution, therefore the full data distribution is available. Equation (3.9) is a Fredholm integral equation of the first kind, which $\widetilde{\operatorname{OR}}(a,y)$ is to be solved for, because $f(z\mid r=0,a)$ , $f(z\mid r=1,a)$ and $f(y\mid r=1,a,z)$ can be obtained from the observed data. Equation (3.10) means $\operatorname{OR}(a,y)$ is related to $\widetilde{\operatorname{OR}}(a,y)$ , therefore the identification of $\operatorname{OR}(a,y)$ is equivalent to the unique solution of equation (3.9), which is guaranteed by a completeness condition of $f(y\mid r=1,a,z)$ (Miao et al.,, 2019).

Condition 1.

(The completeness of $f(y\mid r=1,a,z)$ ) The function $f(y\mid r=1,a,z)$ is complete if and only if for any square-integrable function $h(a,y)=0$ almost surely, $E\left[h(a,y)\mid r=1,a,z\right]=0$ almost surely.

The completeness condition is also an essential condition in other identification analyses (Newey and Powell,, 2003), such as nonparametric instrumental variable regression (Darolles et al.,, 2011), outcome missing not at random (Miao et al.,, 2019) and confounders missing not at random (Yang et al.,, 2019). In contrast to previous authors, some conditions about the shadow variable could not be justified empirically. For instance, a completeness condition is required by the full data distribution (d’Haultfoeuille,, 2010); a monotone likelihood ratio is required by the full data distribution (Wang et al.,, 2014); a generalized linear model is considered for the full data distribution (Zhao and Shao,, 2015). As noted by (Miao et al.,, 2019), the completeness condition of this paper only involves the observed data distribution $f(y\mid r=1,a,z)$ , which means that it can be justified and it does not need additional model assumptions on the missing data distribution.

In fact, Condition 1 implicitly indicates that $Z$ has a larger support than $Y$ . However, as noted by Miao and Tchetgen Tchetgen, (2016), a binary shadow variable can not guatantee the identification of the full data distribution for a continuous outcome, which is given a counterexample. If one wants to identify a continuous outcome, a continuous shadow variable and extra conditions are needed to impose. More details about the completeness condition can be found in Miao and Tchetgen Tchetgen, (2016). The authors applied some commonly-used parametric and semiparametric models such as exponential families and location-scale families to the analysis.

Under Assumption 10, Condition 1 is sufficient to ensure the unique solution from equation (3.9), and thus according to equation (3.10), the odds ratio function $\operatorname{OR}(a,y)$ is identifiable. We state the result in the following theorem.

Theorem 1.

Under Assumption 10 and Condition 1, the joint distribution $f(a,y,z,r)$ is identifiable.

The proof of the Theorem 1 can be found in the Supplementary Materials. Theorem 1 indicates nonparametric identification of the full data law $f(a,y,z,r)$ under MNAR is achieved with a valid shadow variable. The odds ratio function is the basis of our identification analysis. Because $f(y\mid r=1,a,z)$ and $f(z\mid r=1,a)$ are identifiable from the observed data, thus we turn the identification of the odds ratio function $\operatorname{OR}(a,y)$ into the problem of solving for $\widetilde{\operatorname{OR}}(a,y)$ from (3.9). Condition 1 guarantees the unique solution of (3.9). According to equation (3.8), the missing data distribution $f(y\mid r=0,a,z)$ can be recovered after the odds ratio function is identifiable. And then according to equation (3.1) and (3.2), $f(y,r\mid a,z)$ and its functionals can be identified. The shadow variable plays the key role in the identification of the odds ratio function. With a valid shadow variable, nonparametric identification of the full data distribution $f(a,y,z,r)$ is achieved via the pattern-mixture factorization. Furthermore, the shadow variable is the basis of the identification of the odds ratio function, this approach guarantees equation (3.9) available. More details and examples about the shadow variable can be found in Miao and Tchetgen Tchetgen, (2016) and Miao et al., (2019).

In the next section, we apply the generalized method of moments (GMM) to obtain the mean value of the outcome $Y$ . Some results of the consistency and the asymptotic properties of the estimators are established. After recovering the mean value of the outcome, we propose an estimator to adjust for the unmeasured confounding to obtain the CACE.

4 Estimation and Inference

4.1 Estimating the mean value of the nonignorable missing outcomes

We consider the situation where $Z$ is an auxiliary variable taking values of $l=0,1$ . The first step of the estimation is to recover the mean value of the outcome which is missing not at random. We want to estimate the population mean $\mu=E\left[Y\right]$ from the observed data. After the identification is guaranteed, the joint distribution of $Y$ and $R$ given $\left(A,Z\right)$ is determined by $f(y\mid a,z)$ and the missing mechanism $f(r\mid y,a,z)$ .

f(y,r\mid a,z)=f(y\mid a,z)f(r\mid y,a,z).

(4.1)

The missing mechanism $\pi(y,a,z)=f(r=1\mid y,a,z)$ is the key in the process of the estimation. The conditional probability $\pi(y,a,z)$ is also called the nonresponse mechanism or the propensity of missing data in some literatures (Wang et al.,, 2014; Shao and Wang,, 2016; Zhang et al.,, 2018). For the outcome $Y$ missing not at random, the propensity depends on both observed data and missing data. Some authors imposed some parametric assumptions on both propensity and outcome model to establish some likelihood methods (Greenlees et al.,, 1982; Baker and Laird,, 1988).But it is sensitive for the estimators under the fully parametric models. Some authors proposed some semiparametric approaches. Qin et al., (2002) imposed a parametric model on the propensity , which is difficult to verify under nonignorable missingness. Some authors considerd some weak assumption to the model. Kim and Yu, (2011) proposed a semiparametric logistic regression model for the propensity . Tang et al., (2003) proposed a pseudo-likelihood method, imposing a parametric model for propensity but an unspecified propensity . Zhao and Shao, (2015) consider a generalized linear model for the estimation allowing for a nonparametric missing mechanism. Some authors proposed an empirical likelihood method to estimate the parameters in the missing mechanism with nonignorable missing data (Zhao et al.,, 2013; Tang et al.,, 2014; Niu et al.,, 2014).

Wang et al., (2014) applied the generalized method of moments to estimate the parameters of the missing mechanism. Shao and Wang, (2016) imposed an exponential tilting model on the propensity and estimated the tilting parameter and population mean in two steps. Zhang et al., (2018) proposed an approach that the parameters of interest and the tilting parameter can be estimated simultaneously with generalized method of moments and kernel regression.

In this paper, we apply the generalized method of moments (GMM) to estimate the parameters of the missing mechanism(Hansen,, 1982; Hall,, 2005). For estimation, we consider a parametric model for the propensity $\pi(y,a,z)$ and we don’t impose any parametric assumption on the outcome model $f(y\mid a,z)$ , which means the $f(y\mid a,z)$ is nonparametric.

Here we assume the propensity $\pi(y,a,z)=f(r=1\mid y,a,z)$ satisfies the following conditions:

Condition 2.

	$\displaystyle\pi(y,a,z)$	$\displaystyle=\pi(y,a,z,\bm{\theta})=f(r=1\mid y,a,z,\bm{\theta})$		(4.2)
		$\displaystyle=f(r=1\mid y,a,\bm{\theta})=\psi\left(\alpha+\beta y+\gamma a\right),$		(4.2)

where $\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}$ is a $p$ -dimensional unknown parameter. $\psi(\cdot)$ is a known, strictly monotone, and twice differentiable function from $\mathcal{R}$ to $(0,1]$ .

Similar to the discussion of Wang et al., (2014) , we set a parametric propensity in which the nonresponse instrument $Z$ does not appear. However once the identification is guaranteed in Section 3, the Condition (C2) in Wang et al., (2014) and Zhang et al., (2018) for the identiability conditions is not essential. The next step is to estimate these parameters using the observed data.

For applying the GMM, we need to construct a set of $q$ estimating equations:

G(y,r,a,z,\bm{\theta})=\left(g_{1}(y,r,a,z,\bm{\theta}),\cdots,g_{q}(y,r,a,z,\bm{\theta})\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,

(4.3)

where $\Theta$ is the parameter space containing the true parameter value $\bm{\theta}_{0}=\left(\alpha_{0},\beta_{0},\gamma_{0}\right)^{\mathrm{T}}$ , $\bm{\theta}$ is a $p$ -dimensional vector of the parameters that we want to estimate. The estimating equations need to satisfy $E\left[g_{m}\left(y,r,a,z,\bm{\theta}_{0})\right)\right]=0$ , $m=1,\cdots,q$ and $q\geq p$ . Let

\widehat{G}_{n}(\bm{\theta})=\left(\frac{1}{n}\sum\limits_{i=1}^{n}g_{1}(y_{i},r_{i},a_{i},z_{i},\bm{\theta}),\cdots,\frac{1}{n}\sum\limits_{i=1}^{n}g_{q}(y_{i},r_{i},a_{i},z_{i},\bm{\theta})\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta.

(4.4)

$\widehat{G}_{n}(\bm{\theta})$ is the sample version of the estimating equations (4.3). The number of estimating equations $q$ is often greater than $p$ , which implies that there is no solution to

\bar{g}_{m}\left(\bm{\theta}\right)\equiv\frac{1}{n}\sum\limits_{i=1}^{n}g_{m}\left(y_{i},r_{i},a_{i},z_{i},\bm{\theta}\right)=0,\quad m=1,\cdots,q.

(4.5)

The best we can do is to make it as close as possible to zero by minimizing the quadratic function

{Q}_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}W[\widehat{G}_{n}(\bm{\theta})],

(4.6)

where $W$ is a positive semi-definite and symmetric $q\times q$ matrix of weights. We use two-step GMM to estimate the parameters $\bm{\theta}$ .

The first step: Let $W=I_{q\times q}$ in (4.6), $I_{q\times q}$ is a $q\times q$ identity matrix, then we obtain $\tilde{\bm{\theta}}$ by minimizing ${Q}_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}[\widehat{G}_{n}(\bm{\theta})]$ , which means $\tilde{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}Q_{n}(\bm{\theta})$ .

The second step: Then we obtain $\widehat{W}=W(\tilde{\bm{\theta}})$ , plugging in $\widehat{W}$ in (4.6). Finally, we obtain the GMM estimator $\widehat{\bm{\theta}}$ by minimizing $\widehat{Q}_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}\widehat{W}[\widehat{G}_{n}(\bm{\theta})]$ over $\bm{\theta}\in\Theta$ , which means $\widehat{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}\widehat{Q}_{n}(\bm{\theta})$ .

The next step is to construct the estimating equations to estimate $\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}$ . $Z$ is a discrete variable taking values of $l=0,1$ . We let $\bm{k}$ be a $L$ -dimensional column vector whose $l$ -th component is $I(z=l)$ $(L=2)$ , where $I(\cdot)$ is the indicator function. Similar to $\bm{k}$ , we let $\bm{j}$ also be a $L$ -dimensional column vector whose $l$ -th component is $I(a=l)$ .

We now consider the general situation. The estimating equations in (4.3) can be constructed by the following $q$ functions

G(y,r,a,z,\bm{\theta})=\left(\begin{array}[]{l}\bm{k}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\\[8.53581pt] \bm{j}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\end{array}\right),

(4.7)

where $\pi(y,a,z,\bm{\theta})=f(r=1\mid y,a,z)=f(r=1\mid y,a)=\psi\left(\alpha+\beta y+\gamma a\right)$ in (4.2).

For example, if $z$ is a constant and there is no other auxiliary variables, and (4.7) is insufficient for estimating unknown $\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}$ . For using the GMM, we need $q\geq p$ . If there is only a discrete variable $Z$ , the requirement $L\geq 2$ is satisfied if $Z$ is not a constant.

The estimating functions $G(y,r,a,z,\bm{\theta})$ are motivated by the following equations, if $\bm{\theta}_{0}=(\alpha_{0},\beta_{0},\gamma_{0})^{\mathrm{T}}$ is the true parameter value,

$\displaystyle E\left[G(y,r,a,z,\bm{\theta})\right]$	$\displaystyle=E\left\{\bm{\eta}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\right\}$	(4.8)
	$\displaystyle=E\left(E\left\{\bm{\eta}\left[\dfrac{r}{\pi(y,a,z,\bm{\theta})}-1\right]\bigg{\|}\ y,a,z\right\}\right)$
	$\displaystyle=E\left\{\bm{\eta}\left[\dfrac{E(r\mid y,a,z)}{f(r=1\mid y,a,z)}-1\right]\right\}$
	$\displaystyle=\bm{0},$

where $\bm{\eta}=\left(\bm{k}^{\mathrm{T}},\bm{j}^{\mathrm{T}}\right)^{\mathrm{T}}$ is a $q$ -dimensional vector. Let $G(y,r,a,z,\bm{\theta})$ in (4.7) be the estimating equations in (4.3), $\widehat{G}_{n}(\bm{\theta})$ in (4.4) is the sample version. $\widehat{W}$ is given by the two-step method on the above. Then we will obtain $\widehat{\bm{\theta}}=\left(\hat{\alpha},\hat{\beta},\hat{\gamma}\right)^{\mathrm{T}}$ as the two-step GMM estimator of $\bm{\theta}=\left(\alpha,\beta,\gamma\right)^{\mathrm{T}}$ .

Theorem 2.

Suppose that $E\left[G(y,r,a,z,\bm{\theta})\right]=\bm{0}$ only if $\bm{\theta}=\bm{\theta}_{0}$ , $\bm{\theta}_{0}\in\Theta$ , which is compact and that $E\left[\sup_{\bm{\theta}\in\Theta}\left\|G(y,r,a,z,\bm{\theta})\right\|\right]<\infty$ . Then, as $n\rightarrow\infty$ , $\widehat{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\bm{\theta}_{0}$ and $\stackrel{{\scriptstyle P}}{{\longrightarrow}}$ denotes convergence in probability.

The proof of the Theorem 2 and more details about GMM can be found in the Supplementary Materials. Let $\nabla_{\bm{\theta}}(\cdot)$ and $\nabla_{{\bm{\theta}}{\bm{\theta}}}(\cdot)$ denote the first- and second-order derivatives with respect to $\bm{\theta}$ . The asymptotic normality of $\widehat{\bm{\theta}}$ is established by Theorem 3 under the following additional Condition 3.

Condition 3.

(i) $\bm{\theta}_{0}\in$ interior of $\Theta$ ; (ii) $G(y,r,a,z,\bm{\theta})$ is continuously differentiable in a neighborhood $\mathcal{N}$ of $\bm{\theta}_{0}$ , with probability approaching one; (iii) $E\left[\|G(y,r,a,z,\bm{\theta}_{0})\|^{2}\right]$ is finite and $E\left[\sup_{\bm{\theta}\in\mathcal{N}}\left\|\nabla_{\theta}G(y,r,a,z,\bm{\theta})\right\|\right]<\infty$ ; (iv) $H=E\left[\nabla_{\theta}G(y,r,a,z,\bm{\theta}_{0})\right]$ exists and is full of rank.

Theorem 3.

Suppose that the assumptions of Theorem 2 are satisfied, and Condition 3 holds, for $\Omega=E[G(y,r,a,z,\bm{\theta}_{0})G(y,r,a,z,\bm{\theta}_{0}))^{\mathrm{T}}]$ . Then, as $n\rightarrow\infty$ , $\sqrt{n}(\widehat{\bm{\theta}}-\bm{\theta}_{0})\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}N(\bm{0},\Delta)$ , where $\Delta=\left(H^{\mathrm{T}}WH\right)^{-1}H^{\mathrm{T}}W\Omega WH\left(H^{\mathrm{T}}WH\right)^{-1}$ , $\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}$ denotes convergence in distribution.

The proof of Theorem 3 can be found in the Supplementary Materials. For asymptotic covariance variance estimation $\widehat{\Delta}$ , which can be estimated by substituting estimators for each of $H$ , $W$ and $\Omega$ . To estimate $\Omega$ , we can replace the population moment by a sample average and the true parameter $\bm{\theta}_{0}$ by an estimator $\widehat{\bm{\theta}}$ . $\widehat{H}$ is similar to $\widehat{\Omega}$ , respectively. To form

\widehat{\Omega}=\frac{1}{n}\sum\limits_{i=1}^{n}G_{n}(y_{i},r_{i},a_{i},z_{i},\widehat{\bm{\theta}})G_{n}(y_{i},r_{i},a_{i},z_{i},\widehat{\bm{\theta}})^{\mathrm{T}},\quad\widehat{H}=\left.\frac{1}{n}\sum_{i=1}^{n}\frac{\partial G_{n}(y_{i},r_{i},a_{i},z_{i},\bm{\theta})}{\partial\bm{\theta}}\right|_{\bm{\theta}=\widehat{\bm{\theta}}},

where $G_{n}(y_{i},r_{i},a_{i},z_{i},\bm{\theta})=\left(g_{1}\left(y_{i},r_{i},a_{i},z_{i},\bm{\theta}\right),\cdots,g_{q}\left(y_{i},r_{i},a_{i},z_{i},\bm{\theta}\right)\right)^{\mathrm{T}},\quad i=1,\cdots,n,\quad\bm{\theta}\in\Theta$ . So that we have the Theorem 4.

Theorem 4.

If the hypotheses of Theorem 3 are satisfied, then $\widehat{\Delta}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\Delta$ , where $\widehat{=}\left(\widehat{H}^{\mathrm{T}}\widehat{W}\widehat{H}\right)^{-1}\widehat{H}^{\mathrm{T}}\widehat{W}\widehat{\Omega}\widehat{W}\widehat{H}\left(\widehat{H}^{\mathrm{T}}\widehat{W}\widehat{H}\right)^{-1}$ .

The proof of Theorem 4 can be found in the Supplementary Materials. The optimal weight matrix $W_{1}=\Omega^{-1}$ , with this choice of $W_{1}$ , $\widehat{W}$ can be constructed by the following,

\widehat{W}=\left\{\frac{1}{n}\sum\limits_{i=1}^{n}G_{n}(y_{i},r_{i},a_{i},z_{i},\tilde{\bm{\theta}})G_{n}(y_{i},r_{i},a_{i},z_{i},\tilde{\bm{\theta}})^{\mathrm{T}}\right\}^{-1},

the asymptotic covariance matrix $\Delta$ reduces to $\left(H^{\mathrm{T}}\Omega H\right)^{-1}$ , and the asymptotic covariance variance estimation $\widehat{\Delta}$ reduces $\left(\widehat{H}^{\mathrm{T}}\widehat{\Omega}\widehat{H}\right)^{-1}$ .

Once $\widehat{\bm{\theta}}$ is obtained, the mean value of $\mu=E\left[Y\right]$ can be constructed by the inverse probability weighting with the estimated propensity as the weight function,

\widehat{\mu}=\dfrac{1}{n}\sum_{i=1}^{n}\dfrac{r_{i}y_{i}}{\pi(y_{i},a_{i},z_{i},\widehat{\bm{\theta}})}.

(4.9)

Actually, if the dimension of the estimating equations $q$ is more than the dimension of the parameters $p$ , an additional estimating equation $g_{q+1}=\mu-ry/\pi(y,a,z,\bm{\theta})$ can be added into the estimating equations $G(y,r,a,z,\bm{\theta})$ in (4.7). Then $G(y,r,a,z,\bm{\theta})$ is a set of $q+1=2L+1$ estimating equations , and $\bm{\theta}=\left(\mu,\alpha,\beta,\gamma\right)^{\mathrm{T}}$ is a set of $p+1$ unknown parameters. If $\tilde{\mu}$ is the GMM estimator of the above estimating equations $q+1=2L+1$ by the two-step method, as noted by Wang et al., (2014), the difference between the estimator $\widehat{\mu}$ and $\tilde{\mu}$ is the weighting matrix $\widehat{W}$ , the weighting matrix of the esimator $\widehat{\mu}$ is not the optimal and $\tilde{\mu}$ is asymptotically more efficient unless the weighting matrix of the esimator $\widehat{\mu}$ is optimal. But in this paper, we propose $\widehat{\mu}$ to construct the estimator of CACE. The functional delta method can be used to establish the asymptotic results for $\widehat{\mu}$ , for simplicity, we omit this part.

4.2 Obtaining CACE from the nonignorable missing outcomes

Under Assumptions 1-5, we can estimate CACE for the subpopulation of compliers characterized by $U_{i}=cp$ , by taking the ratio of the average difference in $Y_{i}$ by instrument and the average difference in $A_{i}$ by instrument (Angrist et al.,, 1996):

E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp\right]=\frac{E\left[Y_{i}\mid Z_{i}=1\right]-E\left[Y_{i}\mid Z_{i}=0\right]}{E\left[A_{i}\mid Z_{i}=1\right]-E\left[A_{i}\mid Z_{i}=0\right]}.

(4.10)

If the outcome $Y$ is fully observed, CACE can be estimated by equation (4.10). So, the missing data problem is equivalent to the problem of estimating the mean difference $E(Y_{i}\mid Z_{i}=1)$ and $E(Y_{i}\mid Z_{i}=0)$ from incomplete outcome data.

In equation (4.10), the conditional expectation of the nonignorable missing outcomes $Y$ , $E\left[Y_{i}\mid Z_{i}=l\right],l=0,1$ can be estimated by equation (4.9) with the stratification method by the instrument $Z$ ; $E\left[A_{i}\mid Z_{i}=l\right],l=0,1$ can be estimated with the fully observed data. Then the estimator $\widehat{CACE}$ can be obtained by equation (4.9) and (4.10).

5 Simulations

In this section, we conduct some simulation studies to evaluate the proposed approach in this paper. All the variables are binary variables. We generate the instrument variable, the compliance status variable, the potential outcomes from the following process:

Z\sim Bernoulli(0.5)

Table 1: True parameters for the compliance status variable

U

$U$	$at$	$nt$	$cp$
$P$	$0.2$	$0.25$	$0.55$

Table 2: True parameters for the potential outcomes

Y

$Y(z)\mid U=at$	$2$	$4$
$P$	$0.3$	$0.7$
$Y(z)\mid U=nt$	$2$	$4$
$P$	$0.7$	$0.3$
$Y(1)\mid U=cp$	$2$	$4$
$P$	$0.4$	$0.6$
$Y(0)\mid U=cp$	$2$	$4$
$P$	$0.6$	$0.4$

For the outcome $Y$ may be missing not at random, we consider the following logistics missing mechanism in equation (4.2):

\pi(y,a,z,\bm{\theta})=P(R=1\mid Z=z,Y=y,A=a)=\psi(\alpha+\beta y+\gamma a)

(5.1)

where $\psi\left(\alpha+\beta y+\gamma a\right)=[1+\exp(-(\alpha+\beta y+\gamma a))]^{-1}$ , the true parameters $\bm{\theta}_{0}=(\alpha_{0},\beta_{0},\gamma_{0})^{\mathrm{T}}=(1,-0.1,-0.1)^{\mathrm{T}}$ , the missingness indicator $R$ does depend on the outcome $Y$ with missing data itself. The key assumption is Assumption 10, which means $R$ is independent of $Z$ conditional on the fully observed variable and possibly unobserved outcome $Y$ . The missing data proportion is approximately between $35\%$ and $40\%$ .

From Table 1 and Table 2, the true value of $CACE$ $=0.4$ . We use the approach proposed in Section 4 for our estimation. In Table 3, the columns with the labels: Bias, SD, and $95\%$ CI represent the average bias for $(\hat{\alpha}-\alpha_{0},\hat{\beta}-\beta_{0},\hat{\gamma}-\gamma_{0},\widehat{CACE}-CACE)$ , average standard deviation, and average $95\%$ confidence interval, respectively. We set 1000 replicates under sample sizes 100, 500, 1000, and 2000, respectively.

Table 3: Results of simulation studies

true value	n	Bias	SD	$95\%$ CI
$\alpha=1.0$	100	0.0256	0.0020	[1.0216, 1.0295]
	500	0.0323	0.0048	[1.0229, 1.0417]
	1000	0.0395	0.0035	[1.0326, 1.0464]
	2000	0.0373	0.0045	[1.0283, 1.0462]
$\beta=-0.1$	100	0.0649	0.0021	[-0.0392, -0.0308]
	500	0.1715	0.0075	[0.0567, 0.0863]
	1000	0.1599	0.0066	[0.0469, 0.0730]
	2000	0.1561	0.0064	[0.0434, 0.0688]
$\gamma=-0.1$	100	0.0153	0.0021	[-0.0887, -0.0804]
	500	-0.0188	0.0052	[-0.1291, -0.1084]
	1000	-0.0259	0.0052	[-0.1361, -0.1156]
	2000	-0.0093	0.0045	[-0.1183, -0.1004]
$CACE=0.4$	100	-0.0097	0.0274	[0.3364, 0.4441]
	500	0.0026	0.0110	[0.3809, 0.4243]
	1000	-0.0045	0.0107	[0.3743, 0.4165]
	2000	0.0013	0.0112	[0.3791, 0.4234]

From Table 3, the proposed estimator of $CACE$ has negligible bias even under the sample size $500$ . As the sample size increases, the sd of the $\widehat{CACE}$ becomes much smaller. But the bias of the $\widehat{CACE}$ becomes a nearly stable value when the sample size $n$ equals to $500$ . The estimator $\widehat{\bm{\theta}}=(\hat{\alpha},\hat{\beta},\hat{\gamma})^{\mathrm{T}}$ all have the small biases and standard deviations even under the sample size $100$ . In fact, the proposed estimators of $\widehat{\bm{\theta}}$ and $\widehat{CACE}$ in Section 4 all has small biases and standard deviations even under the small sample size $100$ . All of the confidence intervals of $\widehat{\bm{\theta}}$ and $\widehat{CACE}$ have empirical coverage proportions very close to their nominal values.

6 A real data analysis

In this section, we apply the proposed estimators to a real data about the political analysis in Esterling et al., (2011). As noted in Esterling et al., (2011), the authors conducted a series of online deliberative field experiments, where current members of the U.S. House of Representatives interacted via a Web-based interface with random samples of their constituents. Some members of Congress conducted the sessions that their constituents participated in. And the constituents interacted with their members in an online chat room.

An online survey research firm called Knowledge Networks (KN) was responsible for recruiting the constituents from each congressional district and administering the surveys. Each constituent was randomly assigned to three conditions: a deliberative condition that received background reading materials and was asked to complete a survey regarding the background materials(the background materials survey) and to participate in the sessions; an information-only group that only received the background materials and was asked to take the background materials survey; and a true control group. A week after session, KN administered a follow-up survey to subjects in each of the groups.

There are two questions in the follow-up survey, all subjects were asked “Please tell us how much you agree or disagree with the following statements”:

1. I don’t think public officials care much what people like me think.

2. I have ideas about politics and policy that people in government should listen to.

The first question is a measure of external efficacy, and the second question is a measure of internal efficacy. In this study, we only consider the first question as the outcome variable: the outcome (Officials care) variable $Y$ equals to $2$ for subjects who “somewhat disagree” or “strongly disagree” with the first question, and $1$ for subjects who “somewhat agree”, “strongly agree” and “Neither agree nor disagree”. But in the follow-up survey, some subjects chose not to respond the either question so that the outcome variable $Y$ had missing values.

Esterling et al., (2011) restricted the sample to the $670$ subjects who completed the baseline and background materials surveys and initially indicated a willingness to participate in the deliberative sessions. Thus, the treatment effect compares those who read the background materials and participated in the discussions to those who only read the background materials. Our subject for conducting this experiment is to evaluate the causal effect of participating in the Online-chat session to Officials care. More details about this deliberative session can be found in Esterling et al., (2011).

The treatment variable $A$ equals to $1$ for subjects who participated in the session and $0$ for subjects who did not participate in the session. There are 12 pre-treatment (exogenous) variables in Esterling et al., (2011), we choose an exogenous variable as the instrument variable $Z$ . The instrument variable $Z$ equals to $1$ for each subject who is able to answer at least four of the “Delli Carpini and Keeter five” items correctly on the baseline survey indicating high political knowledge, and $0$ for otherwise. We consider the subject who had high political knowledge might be more likely to participate in the session because they were more likely to show their willingness to participate in the session, which the Assumption 3 might be reasonable. In fact, we expect the political knowledge is independent of missing mechanism conditional on the participating and possibly missing outcomes, which implies Assumption 10. The observed data is reported in Table 4.

Table 4: A real data about the political analysis

	$Z=1$ , $A=1$	$Z=1$ , $A=0$	$Z=0$ , $A=1$	$Z=0$ , $A=0$
$R=1$ , $Y=1$	130	139	62	82
$R=1$ , $Y=2$	67	24	12	11
$R=0$ , $Y=.$	21	72	5	45

The samples size $n$ equals to $670$ with $143$ missing values. The missing proportion is about $21.34\%$ (about $143/670\approx 0.2134$ ). In the treatment group ( $A=1$ ) who participated in the session, the missing proportion is about $8.75\%$ ( $26/297\approx 0.0875$ ), but in the control group ( $A=0$ ) who did not participate in the session, the missing proportion is about $31.36\%$ ( $117/373\approx 0.3136$ ). Those who participated in the session were more likely to respond to the follow-up survey than those not. We consider the missing mechanism was influenced by the subject whether participated in the session. The missing mechanism $R$ may possibly depend on the missing outcomes $Y$ because of the questions in the follow-up survey. Hence it is possible that the outcomes are missing not at random.

The estimates of $\bm{\theta}$ and $CACE$ are shown in Table 5. The columns of Table 5 correspond to the parameters, point estimate, Standard Deviation and $95\%$ Confidence Interval, respectively. The assumption about the missing mechanism is similar to the equation (5.1) in Section 5.

Table 5: Results for the estimates of Parameters and

CACE

Patameters	Estimate	SD	$95\%$ Confidence Interval
$\hat{\alpha}$	1.6204	0.0349	[1.5519, 1.6888]
$\hat{\beta}$	-0.2225	0.0234	[-0.2684, -0.1766]
$\hat{\gamma}$	0.1249	0.0249	[0.0759, 0.1739]
$\widehat{CACE}$	1.3234	0.0190	[1.2861, 1.3608]

From Table 5, we can see that, under the assumption of the missing mechanism in equation (5.1), the sign of $\hat{\beta}$ is negative and the sign of $\hat{\gamma}$ is positive, which means the more disagreement on the first questions in the follow-up survey, the more likely the outcome variable $Y$ is to be missing and for those who did not participate in the session, their outcomes might be more likely to have missing values. The estimate of CACE is 1.3234, which means participating in the session would propose a positive effect on Officials care significantly, which means a subject who participated in the session might change their opinions from agreement to disagreement about the first question. Our results are similar to Esterling et al., (2011), who also produced a significant result.

7 Discussions

How to obtain the average causal effects from the nonignorable missing data is a challenging problem. In fact, under the principal stratification framework, it can not be estimated the average causal effects for all population, we can only estimate the complier average causal effects. But in a real situation, we can not observe the compliance status of an unit. With the nonignorable missing outcomes $Y$ , the identification of CACE can not be established without additional assumptions. Some different missing data mechanisms are discussed in the previous work.

In this paper, we compare LI and ODN missing data mechanisms with the shadow variable assumption. And then we establish the nonparametric identification about the full data distribution from the nonignorable missing outcomes with a shadow variable. However, it might not be easy to find such a variable in the real analysis. Some Prior knowledge is usually needed in practice. For inference, we impose a parametric assumption on the missing data mechanism and no assumption on the outcome model. With the GMM method, we estimate the parameters in the propensity. And then we establish some results of the consistency and the asymptotic properties of the parameters. The next, we recover the mean value of the outcome $Y$ and estimate CACE by equation (4.10). In the discussions, we discuss a few possible future generalizations of the proposed approach in this paper.

7.1 Adjustment for measured confounding with a covariate $V$

When a discrete covariate $V$ can be fully observed, we adjust the measured confounding with a stratification method. Some assumptions will be established conditional on $V$ .

Randomization Assumption 2 is replaced by : For given $V$ , $(Y(0),Y(1),A(0),A(1))$ is independent of $Z$ ( $Z\perp\!\!\!\perp(Y(0),Y(1),A(0),A(1))\mid V$ ). Shadow variable Assumption 10 is replaced by: (a’): $Z\perp\!\!\!\perp R\mid(Y,V,A)$ ; (b’): $Z\not\perp\!\!\!\perp Y\mid(R=1,V,A)$ . Under the assumptions conditional on $V$ and other essential assumptions, the Definition 1 of CACE is replaced by CACE $=E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp,V_{i}=v_{i}\right]$ .

We present a figure to help readers to understand the framework of this paper with a covariate $V$ .

Figure 4: A directed acyclic graph model for this paper with a covariate

V

And then CACE can be estimated by the following equation:

E\left[Y_{i}(1)-Y_{i}(0)\mid U_{i}=cp,V_{i}=v_{i}\right]=\frac{E\left[Y_{i}\mid Z_{i}=1,V_{i}=v_{i}\right]-E\left[Y_{i}\mid Z_{i}=0,V_{i}=v_{i}\right]}{E\left[A_{i}\mid Z_{i}=1,V_{i}=v_{i}\right]-E\left[A_{i}\mid Z_{i}=0,V_{i}=v_{i}\right]}.

7.2 How to obtain the average causal effects within all population

Wang and Tchetgen Tchetgen, (2018) discussed two framework of the instrument variable. Under the linear structural equation models (SEMs), which implies the effect homogeneity, the average causal effects can be identified and estimated. But the structural equation models are so sensitive to the model assumptions. In this paper, we impose no assumption on the outcome model, including the identification and estimation. Probably imposing some assumptions on the outcome model might be able to estimate the average causal effects from the nonignorable missing outcomes under some new missing data mechanisms.

References

Angrist and Imbens, (1995) Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association, 90(430):431–442.
Angrist et al., (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455.
Baker and Laird, (1988) Baker, S. G. and Laird, N. M. (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical association, 83(401):62–69.
Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
Chen et al., (2015) Chen, H., Ding, P., Geng, Z., and Zhou, X.-H. (2015). Semiparametric inference of the complier average causal effect with nonignorable missing outcomes. ACM Transactions on Intelligent Systems and Technology (TIST), 7(2):1–15.
Chen et al., (2009) Chen, H., Geng, Z., and Zhou, X.-H. (2009). Identifiability and estimation of causal effects in randomized trials with noncompliance and completely nonignorable missing data. Biometrics, 65(3):675–682.
Chen, (2003) Chen, H. Y. (2003). A note on the prospective analysis of outcome-dependent samples. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):575–584.
Chen, (2004) Chen, H. Y. (2004). Nonparametric and semiparametric models for missing covariates in parametric regression. Journal of the American Statistical Association, 99(468):1176–1189.
Chen, (2007) Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association. Biometrics, 63(2):413–421.
Darolles et al., (2011) Darolles, S., Fan, Y., Florens, J.-P., and Renault, E. (2011). Nonparametric instrumental regression. Econometrica, 79(5):1541–1565.
Ding and Geng, (2014) Ding, P. and Geng, Z. (2014). Identifiability of subgroup causal effects in randomized experiments with nonignorable missing covariates. Statistics in Medicine, 33(7):1121–1133.
Ding and Li, (2018) Ding, P. and Li, F. (2018). Causal inference: A missing data perspective. Statistical Science, 33(2):214–237.
d’Haultfoeuille, (2010) d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1):1–15.
Esterling et al., (2011) Esterling, K. M., Neblo, M. A., and Lazer, D. M. (2011). Estimating treatment effects in the presence of noncompliance and nonresponse: The generalized endogenous treatment model. Political Analysis, 19(2):205–226.
Frangakis and Rubin, (1999) Frangakis, C. E. and Rubin, D. B. (1999). Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika, 86(2):365–379.
Frangakis and Rubin, (2002) Frangakis, C. E. and Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1):21–29.
Goldberger, (1972) Goldberger, A. S. (1972). Structural equation methods in the social sciences. Econometrica: Journal of the Econometric Society, 40:979–1001.
Greenlees et al., (1982) Greenlees, J. S., Reece, W. S., and Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association, 77(378):251–261.
Hall, (2005) Hall, A. R. (2005). Generalized method of moments. Oxford: Oxford University Press.
Hansen, (1982) Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the econometric society, pages 1029–1054.
Higgins et al., (2008) Higgins, J. P., White, I. R., and Wood, A. M. (2008). Imputation methods for missing outcome data in meta-analysis of clinical trials. Clinical trials, 5(3):225–239.
Horvitz and Thompson, (1952) Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685.
Imai, (2009) Imai, K. (2009). Statistical analysis of randomized experiments with non-ignorable missing binary outcomes: an application to a voting experiment. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(1):83–104.
Imbens and Angrist, (1994) Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475.
Kim and Yu, (2011) Kim, J. K. and Yu, C. L. (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106(493):157–165.
Kott, (2014) Kott, P. S. (2014). Calibration weighting when model and calibration variables can differ. In Contributions to sampling statistics, pages 1–18. Springer.
Li and Zhou, (2017) Li, W. and Zhou, X.-H. (2017). Identifiability and estimation of causal mediation effects with missing data. Statistics in Medicine, 36(25):3948–3965.
Little, (1993) Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421):125–134.
Little, (1994) Little, R. J. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika, 81(3):471–483.
Little and Rubin, (2002) Little, R. J. and Rubin, D. B. (2002). Statistical analysis with missing data. Wiley: New York.
Miao et al., (2016) Miao, W., Ding, P., and Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516):1673–1683.
Miao et al., (2019) Miao, W., Liu, L., Tchetgen, E. T., and Geng, Z. (2019). Identification, doubly robust estimation, and semiparametric efficiency theory of nonignorable missing data with a shadow variable. arXiv preprint arXiv:1509.02556v3.
Miao and Tchetgen, (2018) Miao, W. and Tchetgen, E. T. (2018). Identification and inference with nonignorable missing covariate data. Statistica Sinica, 28(4):2049–2067.
Miao and Tchetgen Tchetgen, (2016) Miao, W. and Tchetgen Tchetgen, E. J. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2):475–482.
Newey and McFadden, (1994) Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245.
Newey and Powell, (2003) Newey, W. K. and Powell, J. L. (2003). Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578.
Neyman, (1923) Neyman, J. S. (1923). On the application of probability theory to agricultural experiments. essay on principles. section 9.(translated and edited by dm dabrowska and tp speed, statistical science (1990), 5, 465-480). Annals of Agricultural Sciences, 10:1–51.
Niu et al., (2014) Niu, C., Guo, X., Xu, W., and Zhu, L. (2014). Empirical likelihood inference in linear regression with nonignorable missing response. Computational Statistics & Data Analysis, 79:91–112.
O’Malley and Normand, (2005) O’Malley, A. J. and Normand, S.-L. T. (2005). Likelihood methods for treatment noncompliance and subsequent nonresponse in randomized trials. Biometrics, 61(2):325–334.
Osius, (2004) Osius, G. (2004). The association between two random elements: A complete characterization and odds ratio models. Metrika, 60(3):261–277.
Qin et al., (2002) Qin, J., Leung, D., and Shao, J. (2002). Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97(457):193–200.
Robins and Ritov, (1997) Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in medicine, 16(3):285–319.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
Roy, (2003) Roy, J. (2003). Modeling longitudinal data with nonignorable dropouts using a latent dropout class model. Biometrics, 59(4):829–836.
Rubin, (1973) Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 29:159–183.
Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688–701.
Rubin, (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
Rubin, (1979) Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74(366a):318–328.
Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American statistical association, 75(371):591–593.
Shao and Wang, (2016) Shao, J. and Wang, L. (2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika, 103(1):175–187.
Stuart, (2010) Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1–21.
Tang et al., (2003) Tang, G., Little, R. J., and Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4):747–764.
Tang et al., (2014) Tang, N., Zhao, P., and Zhu, H. (2014). Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica, 24:723–747.
Tauchen, (1985) Tauchen, G. (1985). Diagnostic testing and evaluation of maximum likelihood models. Journal of Econometrics, 30(1-2):415–443.
Wang and Tchetgen Tchetgen, (2018) Wang, L. and Tchetgen Tchetgen, E. (2018). Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):531–550.
Wang et al., (2014) Wang, S., Shao, J., and Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24:1097–1116.
Wright, (1928) Wright, P. G. (1928). Tariff on animal and vegetable oils. Macmillan Company, New York.
Wu and Carroll, (1988) Wu, M. C. and Carroll, R. J. (1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics, 44:175–188.
Yang et al., (2019) Yang, S., Wang, L., and Ding, P. (2019). Causal inference with confounders missing not at random. Biometrika, 106(4):875–888.
Zhang et al., (2018) Zhang, L., Lin, C., and Zhou, Y. (2018). Generalized method of moments for nonignorable missing data. Statistica Sinica, 28(4):2107–2124.
Zhao et al., (2013) Zhao, H., Zhao, P. Y., and Tang, N. S. (2013). Empirical likelihood inference for mean functionals with nonignorably missing response data. Computational Statistics & Data Analysis, 66:101–116.
Zhao and Shao, (2015) Zhao, J. and Shao, J. (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110(512):1577–1590.
Zhou and Li, (2006) Zhou, X.-H. and Li, S. M. (2006). Itt analysis of randomized encouragement design studies with missing data. Statistics in medicine, 25(16):2737–2761.

Supplementary Materials

Appendix A: Proofs of the propositions

Throughout the paper, $\mathcal{P}(\cdot)$ denotes the counting measure for a discrete variable.

Proofs of the proposition 1

Under Assumption 10, we have that for all $(A,Y,Z)$

\operatorname{OR}(a,y,z)=\operatorname{OR}(a,y)\equiv\frac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)},

(4)

\begin{gathered}f(y,r\mid a,z)=c(a,z)f(r\mid a,y=1)f(y\mid r=1,a,z)\{\operatorname{OR}(a,y)\}^{1-r},\\[5.69054pt] c(a,z)=\frac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\frac{f(z\mid r=1,a)}{f(z\mid a)},\end{gathered}

(5)

f(r=1\mid a,y=1)=\frac{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}{f(r=0\mid a)/f(r=1\mid a)+E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}.

(6)

Proof of equation (3.4)

Given Assumption 10 and according to equation (3.3), we have the following results

	$\displaystyle\operatorname{OR}(a,y,z)$	$\displaystyle=\frac{f(y\mid r=0,a,z)f(y=1\mid r=1,a,z)}{f(y\mid r=1,a,z)f(y=1\mid r=0,a,z)}$
		$\displaystyle=\dfrac{\dfrac{f(y,r=0,a,z)}{f(r=0,a,z)}\dfrac{f(y=1,r=1,a,z)}{f(r=1,a,z)}}{\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}\dfrac{f(y=1,r=0,a,z)}{f(r=0,a,z)}}$
		$\displaystyle=\frac{f(y,r=0,a,z)f(y=1,r=1,a,z)}{f(y,r=1,a,z)f(y=1,r=0,a,z)}$
		$\displaystyle=\dfrac{\dfrac{f(y,r=0,a,z)}{f(a,y)}\dfrac{f(y=1,r=1,a,z)}{f(a,y=1)}}{\dfrac{f(y,r=1,a,z)}{f(a,y)}\dfrac{f(y=1,r=0,a,z)}{f(a,y=1)}}$
		$\displaystyle=\frac{f(r=0,z\mid a,y)f(r=1,z\mid a,y=1)}{f(r=1,z\mid a,y)f(r=0,z\mid a,y=1)}$
		$\displaystyle=\frac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}\equiv\operatorname{OR}(a,y)$

Proof of equation (3.5)

To proof equation (3.5), we prove that for $R=1$ and $R=0$ , respectively. When $R=1$ , we have that

	right side of the equation (3.5)	$\displaystyle=c(a,z)f(r=1\mid a,y=1)f(y\mid r=1,a,z)$
		$\displaystyle=\dfrac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(r=1\mid a,y=1)f(y\mid r=1,a,z)$
		$\displaystyle=f(r=1\mid a)\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(y\mid r=1,a,z)$
		$\displaystyle=\dfrac{f(r=1,a)}{f(a)}\dfrac{\dfrac{f(z,r=1,a)}{f(r=1,a)}}{\dfrac{f(z,a)}{f(a)}}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}$
		$\displaystyle=\dfrac{f(y,r=1,a,z)}{f(z,a)}=f(y,r=1\mid a,z)=\text{left side of the equation (\ref{equation 5})}$

When $R=0$ , from equation (3.4) and Assumption 10, we have that

	$\displaystyle\text{right side of the equation (\ref{equation 5})}=$	$\displaystyle c(a,z)f(r=0\mid a,y=1)f(y\mid r=1,a,z){\operatorname{OR}}(a,y)$
	$\displaystyle=$	$\displaystyle\dfrac{f(r=1\mid a)}{f(r=1\mid a,y=1)}\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(r=0\mid a,y=1)f(y\mid r=1,a,z)$
		$\displaystyle\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}$
	$\displaystyle=$	$\displaystyle f(r=1\mid a)\dfrac{f(z\mid r=1,a)}{f(z\mid a)}f(y\mid r=1,a,z)\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}$
	$\displaystyle=$	$\displaystyle\dfrac{f(r=1,a)}{f(a)}\dfrac{\dfrac{f(z,r=1,a)}{f(r=1,a)}}{\dfrac{f(z,a)}{f(a)}}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}\dfrac{f(r=0,z\mid a,y)}{f(r=1,z\mid a,y)}$
	$\displaystyle=$	$\displaystyle\dfrac{f(z,r=1,a)}{f(z,a)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}\frac{f(r=0,z,a,y)}{f(r=1,z,a,y)}$
	$\displaystyle=$	$\displaystyle\dfrac{f(r=0,y,z,a)}{f(z,a)}=f(y,r=0\mid a,z)=\text{left side of the equation (\ref{equation 5})}$

In conclusion, equation (3.5) holds for the value $R=1$ and $R=0$ .

Proof of equation (3.6)

From equation (3.4), we notice that the odds ratio function $\operatorname{OR}(a,y)$ is the function of the variables $(A,Y)$ , by the property of conditional mathematical expectation, we have that

		$\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a\right]$
		$\displaystyle=\int\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}f(y\mid r=1,a)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}f(y\mid r=1,a)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0,a,y)}{f(r=1,a,y)}\dfrac{f(y,r=1,a)}{f(r=1,a)}d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0,a,y)}{f(r=1,a)}d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{1}{f(r=1,a)}\int f(r=0,a,y)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\frac{f(r=0,a)}{f(r=1,a)}$

	right side of the equation (3.6)	$\displaystyle=\dfrac{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}{f(r=0\mid a)/f(r=1\mid a)+E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}$
		$\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}{\dfrac{f(r=0\mid a)}{f(r=1\mid a)}+\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}{\dfrac{f(r=0,a)}{f(r=1,a)}+\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a)}{f(r=1,a)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}}{1+\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}}{\dfrac{f(r=0\mid a,y=1)+f(r=1\mid a,y=1)}{f(r=0,a,y=1)}}$
		$\displaystyle=f(r=1\mid a,y=1)=\text{left side of the equation (\ref{equation 6})}$

Proofs of the proposition 2

Under Assumption 10, we have that

	$\displaystyle f(r=1\mid a,y)$	$\displaystyle=f(r=1\mid a,y,z),$		(7)
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\operatorname{OR}(a,y)f(r=0\mid a,y=1)},$		(7)

f(y\mid r=0,a,z)=\dfrac{\operatorname{OR}(a,y)f(y\mid r=1,a,z)}{E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]},

(8)

\begin{gathered}E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]=\dfrac{f(z\mid r=0,a)}{f(z\mid r=1,a)},\\[5.69054pt] \text{where}\quad\widetilde{\operatorname{OR}}(a,y)=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]},\end{gathered}

(9)

\operatorname{OR}(a,y)=\dfrac{\widetilde{\operatorname{OR}}(a,y)}{\widetilde{\operatorname{OR}}(a,y=1)}.

(10)

Proof of equation (3.7)

From Assumption 10 and equation (3.4), we have that

	right side of the equation (3.7)	$\displaystyle=f(r=1\mid a,y)$
		$\displaystyle=f(r=1\mid a,y,z)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\operatorname{OR}(a,y)f(r=0\mid a,y=1)}$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}f(r=0\mid a,y=1)}$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=1\mid a,y=1)+\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)}}$
		$\displaystyle=\dfrac{1}{1+\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}}$
		$\displaystyle=\dfrac{f(r=1\mid a,y)}{f(r=1\mid a,y)+f(r=0\mid a,y)}$
		$\displaystyle=f(r=1\mid a,y)=\text{left side of the equation (\ref{equation 7})}$

Proof of equation (3.8)

From Assumption 10 and equation (3.4), by the property of conditional mathematical expectation, we have that

		$\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]$
		$\displaystyle=\int\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}f(y\mid r=1,a,z)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}f(y\mid r=1,a,z)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0\mid a,y,z)}{f(r=1\mid a,y,z)}f(y\mid r=1,a,z)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{\dfrac{f(r=0,a,y,z)}{f(a,y,z)}}{\dfrac{f(r=1,a,y,z)}{f(a,y,z)}}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\int\dfrac{f(r=0,a,y,z)}{f(r=1,a,y,z)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{1}{f(r=1,a,z)}\int f(r=0,a,y,z)d\mathcal{P}(y)$
		$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a,z)}{f(r=1,a,z)}$

	right side of the equation (3.8)	$\displaystyle=\dfrac{\operatorname{OR}(a,y)f(y\mid r=1,a,z)}{E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]}$
		$\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}\quad f(y\mid r=1,a,z)}{\dfrac{f(r=1\mid x,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}f(y\mid r=1,a,z)}{\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y,z)}{f(r=1\mid a,y,z)}\dfrac{f(y,r=1,a,z)}{f(r=1,x,z)}}{\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=0,a,y,z)}{f(r=1,a,y,z)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}}{\dfrac{f(r=0,a,z)}{f(r=1,a,z)}}$
		$\displaystyle=\dfrac{f(r=0,a,y,z)}{f(r=0,a,z)}$
		$\displaystyle=f(y\mid r=0,a,z)=\text{left side of the equation (\ref{equation 8})}$

Proof of equation (3.9)

From equation (3.6) and equation (3.8) , we have that

	$\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a\right]$	$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\frac{f(r=0,a)}{f(r=1,a)}$
	$\displaystyle E\left[\operatorname{OR}(a,y)\mid r=1,a,z\right]$	$\displaystyle=\dfrac{f(r=1\mid a,y=1)}{f(r=0\mid a,y=1)}\dfrac{f(r=0,a,z)}{f(r=1,a,z)}$

	$\displaystyle\text{where}\quad\widetilde{\operatorname{OR}}(a,y)$	$\displaystyle=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}$
		$\displaystyle=\dfrac{\dfrac{f(R=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}}{\dfrac{f(r=1\mid a,y=1)f(r=0,a)}{f(r=0\mid a,y=1)f(r=1,a)}}$
		$\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)}{f(r=1\mid a,y)}}{\dfrac{f(r=0,a)}{f(r=1,a)}}$
		$\displaystyle=\dfrac{f(r=0\mid a,y)f(r=1,a)}{f(r=1\mid a,y)f(r=0,a)}$
		$\displaystyle=\dfrac{f(r=0\mid a,y)f(r=1\mid a)}{f(r=1\mid a,y)f(r=0\mid a)}$

	$\displaystyle E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]$
	$\displaystyle=\int\dfrac{f(r=0\mid a,y)f(r=1\mid a)}{f(r=1\mid a,y)f(r=0\mid a)}f(y\mid r=1,a,z)d\mathcal{P}(y)$
	$\displaystyle=\int\dfrac{f(r=0\mid a,y,z)f(r=1,a)}{f(r=1\mid a,y,z)f(r=0,a)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)$
	$\displaystyle=\int\dfrac{f(r=0,a,y,z)f(r=1,a)}{f(r=1,a,y,z)f(r=0,a)}\dfrac{f(y,r=1,a,z)}{f(r=1,a,z)}d\mathcal{P}(y)$
	$\displaystyle=\int\dfrac{f(r=0,a,y,z)f(r=1,a)}{f(r=0,a)f(r=1,a,z)}d\mathcal{P}(y)$
	$\displaystyle=\dfrac{f(r=1,a)}{f(r=0,a)f(r=1,a,z)}\int f(r=0,a,y,z)d\mathcal{P}(y)$
	$\displaystyle=\dfrac{f(r=0,a,z)f(r=1,a)}{f(r=0,a)f(r=1,a,z)}$
	$\displaystyle=\dfrac{\dfrac{f(r=0,a,z)}{f(r=0,a)}}{\dfrac{f(r=1,a,z)}{f(r=1,a)}}$
	$\displaystyle=\dfrac{f(z\mid r=0,a)}{f(z\mid r=1,a)}$

Proof of equation (3.10)

From equation (3.9), we have that

	$\displaystyle\dfrac{\widetilde{\operatorname{OR}}(a,y)}{\widetilde{\operatorname{OR}}(a,y=1)}$	$\displaystyle=\dfrac{\dfrac{f(r=0\mid a,y)f(r=1\mid a)}{f(r=1\mid a,y)f(r=0\mid a)}}{\dfrac{f(r=0\mid a,y=1)f(r=1\mid a)}{f(r=1\mid a,y=1)f(r=0\mid a)}}$
		$\displaystyle=\dfrac{f(r=0\mid a,y)f(r=1\mid a,y=1)}{f(r=1\mid a,y)f(r=0\mid a,y=1)}$
		$\displaystyle={\operatorname{OR}}(a,y)$

Appendix B: Proofs of the Theorems

Proof of the Theorem 1

From Proposition 2, we have that

	$\displaystyle E\left[\widetilde{\operatorname{OR}}(a,y)\mid r=1,a,z\right]=\frac{f(z\mid r=0,a)}{f(z\mid r=1,a)}$		(A.1)
	$\displaystyle\widetilde{\operatorname{OR}}(a,y)=\dfrac{\operatorname{OR}(a,y)}{E\left[\operatorname{OR}(a,y)\mid r=1,a\right]}$

Under shadow variable Assumption 10, we can identify the odds ratio function $\operatorname{OR}(a,y)$ . Because $f(y\mid r=1,a,z)$ and $f(z\mid r=1,a)$ are identifiable from the observed data, for any candidate of $\operatorname{OR}(a,y)$ , we can just need the observed data to compute the integral equation (A.1). Suppose that $\widetilde{\operatorname{OR}}^{(1)}(a,y)$ and $\widetilde{\operatorname{OR}}^{(2)}(a,y)$ are two solutions to equation (3.9):

E\left[\widetilde{\operatorname{OR}}^{(k)}(a,y)\mid r=1,a,z\right]=\frac{f(z\mid r=0,a)}{f(z\mid r=1,a)}\qquad(k=1,2),

We have that

E\left[\widetilde{\operatorname{OR}}^{(1)}(a,y)-\widetilde{\operatorname{OR}}^{(2)}(a,y)\mid r=1,a,z\right]=0.

Condition 1 implies that $\widetilde{\operatorname{OR}}^{(1)}(a,y)-\widetilde{\operatorname{OR}}^{(2)}(a,y)=0$ almost surely, $\widetilde{\operatorname{OR}}^{(1)}(a,y)=\widetilde{\operatorname{OR}}^{(2)}(a,y)$ almost surely. Therefore, equation (A.1) has a unique solution $\operatorname{OR}(a,y)$ , which means $\widetilde{\operatorname{OR}}(a,y)$ is identifiable. Based on equation (3.10) which is the definition of $\widetilde{\operatorname{OR}}(a,y)$ , finally we can identify the odds ratio function $\operatorname{OR}(a,y)$ through as $\operatorname{OR}(a,y)={\widetilde{\operatorname{OR}}(a,y)}/{\widetilde{\operatorname{OR}}(a,y=1)}$ .

Proof of the Theorem 2

We demonstrate some notations at the first. For simplicity, we let $\bm{s}=\left(y,r,a,z\right)^{\mathrm{T}}$ . $\bm{s}$ is a random vector denoting all the random variables. We let $g_{m}\left(\bm{s},\bm{\theta}\right)=g_{m}\left(y,r,a,z,\bm{\theta}\right)$ , $\bm{\theta}\in\Theta$ , $m=1,\cdots,q$ in (4.3). Let

		$\displaystyle G(\bm{s},\bm{\theta})=\left(g_{1}(\bm{s},\bm{\theta}),g_{2}(\bm{s},\bm{\theta})\cdots,g_{q}(\bm{s},\bm{\theta})\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,$
		$\displaystyle G_{0}(\bm{\theta})=\left(E\left[g_{1}(\bm{s},\bm{\theta})\right],E\left[g_{2}(\bm{s},\bm{\theta})\right],\cdots,E\left[g_{q}(\bm{s},\bm{\theta})\right]\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,$
		$\displaystyle G_{n}(\bm{s}_{i},\bm{\theta})=\left(g_{1}\left(\bm{s}_{i},\bm{\theta}\right),g_{2}\left(\bm{s}_{i},\bm{\theta}\right),\cdots,g_{q}\left(\bm{s}_{i},\bm{\theta}\right)\right)^{\mathrm{T}},\quad i=1,\cdots,n,\quad\bm{\theta}\in\Theta,$
		$\displaystyle\widehat{G}_{n}(\bm{\theta})=\left(\frac{1}{n}\sum_{i=1}^{n}g_{1}\left(\bm{s}_{i},\bm{\theta}\right),\frac{1}{n}\sum_{i=1}^{n}g_{2}\left(\bm{s}_{i},\bm{\theta}\right),\cdots,\frac{1}{n}\sum_{i=1}^{n}g_{q}\left(\bm{s}_{i},\bm{\theta}\right)\right)^{\mathrm{T}},\quad\bm{\theta}\in\Theta,$

		$\displaystyle Q_{0}(\bm{\theta})=\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}W\left[G_{0}(\bm{\theta})\right],$
		$\displaystyle Q_{n}(\bm{\theta})=[\widehat{G}_{n}(\bm{\theta})]^{\mathrm{T}}W\left[\widehat{G}_{n}(\bm{\theta})\right],$
		$\displaystyle\widehat{Q}_{n}(\bm{\theta})=\left[\widehat{G}_{n}(\bm{\theta})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{G}_{n}(\bm{\theta})\right].$

where $W$ is a positive semi-definite and symmetric $q\times q$ matrix of weights, $\widehat{W}=W(\tilde{\bm{\theta}})$ . $\bm{\theta}$ is a $p$ -dimensional parameter. $\tilde{\bm{\theta}}$ is an estimator of $\bm{\theta}$ from the first step of two-step GMM, which means $\tilde{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}Q_{n}(\bm{\theta})$ . $\bm{\theta}_{0}$ is the true value of the parameter $\bm{\theta}$ . Let a matrix $B=\left[b_{jk}\right]$ , let $\|B\|=\left(\sum_{j,k}b_{jk}^{2}\right)^{1/2}$ be the Euclidean norm. Let $\nabla_{\bm{\theta}}(\cdot)$ and $\nabla_{{\bm{\theta}}{\bm{\theta}}}(\cdot)$ denote the first- and second-order derivatives with respect to $\bm{\theta}$ . Let $\widehat{\bm{\theta}}=\mathop{\arg\min}\limits_{\bm{\theta}\in\Theta}\widehat{Q}_{n}(\bm{\theta})$ .

Uniform convergence in probability: $\widehat{Q}_{n}(\theta)$ converges uniformly in probability to $Q_{0}(\theta)$ means: as $n\rightarrow\infty$ ,

\sup_{\bm{\theta}\in\Theta}\left|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right|\stackrel{{\scriptstyle P}}{{\longrightarrow}}0.

To prove the Theorem 2, we need some lemmas to prove. More details can be found in Newey and McFadden, (1994).

Lemma 1.

(GMM identification) If $W$ is positive semi-definite matrix, for $G_{0}(\bm{\theta}_{0})=\bm{0}$ and $WG_{0}(\bm{\theta}_{0})\neq\bm{0}$ for $\bm{\theta}\neq\bm{\theta}_{0}$ then $Q_{0}(\bm{\theta})=\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}W\left[G_{0}(\bm{\theta})\right]$ has a unique minimum at $\bm{\theta}_{0}$ .

Proof of Lemma 1. Let $D$ be such that $D^{\mathrm{T}}D=W$ . If $\bm{\theta}\neq\bm{\theta}_{0}$ , then $\bm{0}\neq WG_{0}(\bm{\theta})=D^{\mathrm{T}}DG_{0}(\bm{\theta})$ implies $DG_{0}(\bm{\theta})\neq\bm{0}$ and hence for $\bm{\theta}\neq\bm{\theta}_{0}$ , $Q_{0}(\bm{\theta})=\left[DG_{0}(\bm{\theta})\right]^{\mathrm{T}}\left[DG_{0}(\bm{\theta})\right]>Q_{0}(\bm{\theta}_{0})$ .

Lemma 2.

If the data are i.i.d., $\Theta$ is compact, $\bm{e}\left(\bm{s},\bm{\theta}\right)$ is continuous at each $\bm{\theta}\in\Theta$ with probability one, and there is $b(\bm{s})$ with $\|\bm{e}\left(\bm{s},\bm{\theta}\right)\|\leqslant b(\bm{s})$ for all $\bm{\theta}\in\Theta$ and $E[b(\bm{s})]<\infty$ , then $E[\bm{e}\left(\bm{s},\bm{\theta}\right)]$ is continuous and $\sup\limits_{\bm{\theta}\in\Theta}\left\|\dfrac{1}{n}\sum\limits_{i=1}^{n}\bm{e}\left(\bm{s}_{i},\bm{\theta}\right)-E[\bm{e}\left(\bm{s},\bm{\theta}\right)]\right\|\stackrel{{\scriptstyle P}}{{\longrightarrow}}0$ .

Proof of Lemma 2. It is implied by Lemma 1 of Tauchen, (1985).

Lemma 3.

If there is a function $Q_{0}(\bm{\theta})$ such that (i) $Q_{0}(\bm{\theta})$ is uniquely minimized at $\bm{\theta}_{0}$ ; (ii) $\Theta$ is compact; (iii) $Q_{0}(\bm{\theta})$ is continuous; (iv) $\widehat{Q}_{n}(\bm{\theta})$ converges uniformly in probability to $Q_{0}(\bm{\theta})$ , then $\widehat{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\rightarrow}}\bm{\theta}_{0}$ .

Proof of lemma 3. By the definition of minimum value and the hypothesis of the Lemma 3, for any $\varepsilon>0$ we have with probability approaching one

\widehat{Q}_{n}(\widehat{\bm{\theta}})<\widehat{Q}_{n}\left(\bm{\theta}_{0}\right)+\varepsilon/3,\quad Q_{0}(\widehat{\bm{\theta}})<\widehat{Q}_{n}(\widehat{\bm{\theta}})+\varepsilon/3,\quad\widehat{Q}_{n}\left(\bm{\theta}_{0}\right)<Q_{0}\left(\bm{\theta}_{0}\right)+\varepsilon/3.

Therefore, with probability approaching one

Q_{0}(\widehat{\bm{\theta}})<\widehat{Q}_{n}(\widehat{\bm{\theta}})+\varepsilon/3<\widehat{Q}_{n}\left(\bm{\theta}_{0}\right)+2\varepsilon/3<Q_{0}\left(\bm{\theta}_{0}\right)+\varepsilon,

Thus, for any $\varepsilon>0$ , with probability approaching one

Q_{0}(\widehat{\bm{\theta}})<Q_{0}\left(\bm{\theta}_{0}\right)+\varepsilon,

Let $\mathscr{N}$ be any open subset of $\Theta$ containing $\bm{\theta}_{0}$ . By $\Theta\cap\mathcal{N}^{c}$ compact, (i), and (iii),

Q_{0}\left(\bm{\theta}^{*}\right)\equiv\inf_{\bm{\theta}\in\Theta\cap\mathscr{N}^{c}}Q_{0}(\bm{\theta})>Q_{0}\left(\bm{\theta}_{0}\right),

for some $\bm{\theta}^{*}\in\Theta\cap\mathscr{N}^{c}$ . Thus, choosing $\varepsilon=Q_{0}\left(\bm{\theta}^{*}\right)-Q_{0}\left(\bm{\theta}_{0}\right)$ , it follows that with probability approaching one

Q_{0}(\widehat{\bm{\theta}})<Q_{0}\left(\bm{\theta}^{*}\right),

hence that $\widehat{\bm{\theta}}\in\mathscr{N}$ with probability approaching one . By the arbitrariness of $\mathscr{N}$ ,

\widehat{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\bm{\theta}_{0}.

Proof of Theorem 2.

Proceed by verifying the hypotheses of Lemma 3. Condition (i) in Lemma 3 follows by Theorem 2 and Lemma 1 . Condition (ii) in Lemma 3 holds by Theorem 2. By Lemma 2 applied to $\bm{e}\left(\bm{s},\bm{\theta}\right)=G(\bm{s},\bm{\theta})$ , one has $\sup\limits_{\bm{\theta}\in\Theta}\left\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right\|\stackrel{{\scriptstyle p}}{{\rightarrow}}0$ and $G_{0}(\bm{\theta})$ is continuous. Thus, (iii) in Lemma 3 holds by $Q_{0}(\bm{\theta})=\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}W\left[G_{0}(\bm{\theta})\right]$ continuous. By $\Theta$ compact, $G_{0}(\bm{\theta})$ is bounded on $\Theta$ .

Suppose that $\widehat{W}$ converges to $W$ in probability. By triangle and Cauchy-Schwartz inequalities,

		$\displaystyle\left\|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right\|$
		$\displaystyle\leqslant\left\|\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]\right\|+\left\|\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}\left(\widehat{W}+\widehat{W}^{\mathrm{T}}\right)\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]\right\|$
		$\displaystyle+\left\|[G_{0}(\bm{\theta})]^{\mathrm{T}}(\widehat{W}-W)G_{0}(\bm{\theta})\right\|$
		$\displaystyle\leqslant\\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\\|^{2}\\|\widehat{W}\\|+2\left\\|G_{0}(\bm{\theta})\right\\|\\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\\|\\|\widehat{W}\\|$
		$\displaystyle\quad+\left\\|G_{0}(\bm{\theta})\right\\|^{2}\\|\widehat{W}-W\\|,$

so that

\sup\limits_{\bm{\theta}\in\Theta}\left|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right|\stackrel{{\scriptstyle P}}{{\longrightarrow}}0\text{, and (iv) in Lemma \ref{lemma 3} holds. }

With $W$ being the $I_{q\times q}$ identity matrix, we can prove that

\tilde{\bm{\theta}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\bm{\theta}_{0}.

which implies $\widehat{W}\stackrel{{\scriptstyle P}}{{\longrightarrow}}W$ .

Proof of the Theorem 3

Under the assumptions in Theorem 2, $\nabla_{\bm{\theta}}\widehat{Q}_{n}(\widehat{\bm{\theta}})=0$ with probability approaching 1. By Taylor’s expansion around $\bm{\theta}_{0}$ ,

\nabla_{\bm{\theta}}\widehat{Q}_{n}(\widehat{\bm{\theta}})-\nabla_{\bm{\theta}}\widehat{Q}_{n}(\bm{\theta}_{0})=\nabla_{{\bm{\theta}}{\bm{\theta}}}\widehat{Q}_{n}(\bm{\theta}^{*})(\widehat{\bm{\theta}}-\bm{\theta}_{0}),

where $\bm{\theta}^{*}$ is between $\widehat{\bm{\theta}}$ and $\bm{\theta}_{0}$ . Let $\widehat{H}(\widehat{\bm{\theta}})=\nabla_{\bm{\theta}}\widehat{G}_{n}(\widehat{\bm{\theta}})$ . Multiplying through by $\sqrt{n}$ , and solving gives

\sqrt{n}\left(\widehat{\bm{\theta}}-\bm{\theta}_{0}\right)=-\left\{\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{H}(\bm{\theta}^{*})\right]\right\}^{-1}\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\left[\sqrt{n}\widehat{G}_{n}(\bm{\theta}_{0})\right],

By (iii) in Condition 3, $\widehat{H}(\widehat{\bm{\theta}})\stackrel{{\scriptstyle P}}{{\longrightarrow}}H$ and $\widehat{H}(\bm{\theta}^{*})\stackrel{{\scriptstyle P}}{{\longrightarrow}}H$ , so that by (iv) and continuity of matrix inversion,

-\left\{\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{H}(\bm{\theta}^{*})\right]\right\}^{-1}\left[\widehat{H}(\widehat{\bm{\theta}})\right]^{\mathrm{T}}\widehat{W}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\left(H^{\mathrm{T}}WH\right)^{-1}H^{\mathrm{T}}W,

Based on the Central Limit Theorem,

\left[\sqrt{n}\widehat{G}_{n}(\bm{\theta}_{0})\right]\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}N(\bm{0},\Omega),

the conclusion then follows by the Slutzky theorem.

\sqrt{n}(\widehat{\bm{\theta}}-\bm{\theta}_{0})\stackrel{{\scriptstyle\mathscr{L}}}{{\longrightarrow}}N\left(\bm{0},\left(H^{\mathrm{T}}WH\right)^{-1}H^{\mathrm{T}}W\Omega WH\left(H^{\mathrm{T}}WH\right)^{-1}\right).

Proof of the Theorem 4

Under the assumptions of Theorem 2 and Theorem 3, we have the following proof.

By consistency of $\widehat{\bm{\theta}}$ there is $\lambda_{n}\rightarrow 0$ such that $\|\widehat{\bm{\theta}}-\bm{\theta}_{0}\|\leqslant\lambda_{n}$ with probability approaching one.

Let $\Lambda_{n}(\bm{s})=\sup_{\|\widehat{\bm{\theta}}-\bm{\theta}_{0}\|\leqslant\lambda_{n}}\|G(\bm{s},\bm{\theta})G(\bm{s},\bm{\theta})^{\mathrm{T}}-G(\bm{s},\bm{\theta}_{0})G(\bm{s},\bm{\theta}_{0})^{\mathrm{T}}\|$ .

By continuity of $G(\bm{s},\bm{\theta})G(\bm{s},\bm{\theta})^{\mathrm{T}}$ at $\bm{\theta}_{0}$ , $\Lambda_{n}(\bm{s})\rightarrow 0$ with probability one, while by the dominance condition, for $n$ large enough $\Lambda_{n}(\bm{s})\leqslant 2\sup_{\bm{\theta}\in\mathcal{N}}\|G(\bm{s},\bm{\theta})G(\bm{s},\bm{\theta})^{\mathrm{T}}\|$ .

Then by the dominated convergence theorem, $E\left[\Lambda_{n}(\bm{s})\right]\rightarrow 0$ , so by the Markov’s inequality,

P\left(\left|\frac{1}{n}\sum_{i=1}^{n}\Lambda_{n}(\bm{s}_{i})\right|>\varepsilon\right)\leqslant\frac{E\left[\Lambda_{n}(\bm{s})\right]}{\varepsilon}\rightarrow 0,

for all $\varepsilon>0$ , giving $\frac{1}{n}\sum\limits_{i=1}^{n}\Lambda_{n}(\bm{s}_{i})\stackrel{{\scriptstyle P}}{{\longrightarrow}}0$ . By Khinchin’s law of large numbers,

\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\bm{\theta}_{0})G(\bm{s}_{i},\bm{\theta}_{0})^{\mathrm{T}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}E\left[G(\bm{s},\bm{\theta}_{0})G(\bm{s},\bm{\theta}_{0})^{\mathrm{T}}\right],

Also, with probability approaching one,

	$\displaystyle\left\\|\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\widehat{\bm{\theta}})G(\bm{s}_{i},\widehat{\bm{\theta}})^{\mathrm{T}}-\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\bm{\theta}_{0})G(\bm{s}_{i},\bm{\theta}_{0})^{\mathrm{T}}\right\\|$	$\displaystyle\leqslant\frac{1}{n}\sum_{i=1}^{n}\left\\|G(\bm{s}_{i},\widehat{\bm{\theta}})G(\bm{s}_{i},\widehat{\bm{\theta}})^{\mathrm{T}}-G(\bm{s}_{i},\bm{\theta}_{0})G(\bm{s}_{i},\bm{\theta}_{0})^{\mathrm{T}}\right\\|$
		$\displaystyle\leqslant\frac{1}{n}\sum_{i=1}^{n}\Lambda_{n}(\bm{s}_{i})\stackrel{{\scriptstyle P}}{{\longrightarrow}}0.$

So by the triangle inequality, we have that

\frac{1}{n}\sum_{i=1}^{n}G(\bm{s}_{i},\widehat{\bm{\theta}})G(\bm{s}_{i},\widehat{\bm{\theta}})^{\mathrm{T}}\stackrel{{\scriptstyle P}}{{\longrightarrow}}E\left[G(\bm{s},\bm{\theta}_{0})G(\bm{s},\bm{\theta}_{0})^{\mathrm{T}}\right],

which means $\widehat{\Omega}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\Omega$ . Theorem 2 and Theorem 3 implies that $\widehat{H}\stackrel{{\scriptstyle P}}{{\longrightarrow}}H$ , $\widehat{W}\stackrel{{\scriptstyle P}}{{\longrightarrow}}W$ . The conclusion follows by (iv) of Condition 3 and continuity of matrix inversion and multiplication.

		$\displaystyle\left\|\widehat{Q}_{n}(\bm{\theta})-Q_{0}(\bm{\theta})\right\|$
		$\displaystyle\leqslant\left\|\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]^{\mathrm{T}}\widehat{W}\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]\right\|+\left\|\left[G_{0}(\bm{\theta})\right]^{\mathrm{T}}\left(\widehat{W}+\widehat{W}^{\mathrm{T}}\right)\left[\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\right]\right\|$
		$\displaystyle+\left\|[G_{0}(\bm{\theta})]^{\mathrm{T}}(\widehat{W}-W)G_{0}(\bm{\theta})\right\|$
		$\displaystyle\leqslant\\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\\|^{2}\\|\widehat{W}\\|+2\left\\|G_{0}(\bm{\theta})\right\\|\\|\widehat{G}_{n}(\bm{\theta})-G_{0}(\bm{\theta})\\|\\|\widehat{W}\\|$
		$\displaystyle\quad+\left\\|G_{0}(\bm{\theta})\right\\|^{2}\\|\widehat{W}-W\\|,$