Full Law Identification in Graphical Models of Missing Data:
Completeness Results

Razieh Nabi Rohit Bhattacharya Ilya Shpitser Razieh Nabi Rohit Bhattacharya Ilya Shpitser

Supplementary Materials For:
Full Law Identification In Graphical Models Of Missing Data:
Completeness Results

Razieh Nabi Rohit Bhattacharya Ilya Shpitser Razieh Nabi Rohit Bhattacharya Ilya Shpitser

Abstract

Missing data has the potential to affect analyses conducted in all fields of scientific study including healthcare, economics, and the social sciences. Several approaches to unbiased inference in the presence of non-ignorable missingness rely on the specification of the target distribution and its missingness process as a probability distribution that factorizes with respect to a directed acyclic graph. In this paper, we address the longstanding question of the characterization of models that are identifiable within this class of missing data distributions. We provide the first completeness result in this field of study – necessary and sufficient graphical conditions under which, the full data distribution can be recovered from the observed data distribution. We then simultaneously address issues that may arise due to the presence of both missing data and unmeasured confounding, by extending these graphical conditions and proofs of completeness, to settings where some variables are not just missing, but completely unobserved.

Missing Data, Identification, Missing Not At Random, Causality, Graphical Models, Selection Bias

1 Introduction

Missing data has the potential to affect analyses conducted in all fields of scientific study, including healthcare, economics, and the social sciences. Strategies to cope with missingness that depends only on the observed data, known as the missing at random (MAR) mechanism, are well-studied (Dempster et al., 1977; Cheng, 1994; Robins et al., 1994; Tsiatis, 2006). However, the setting where missingness depends on covariates that may themselves be missing, known as the missing not at random (MNAR) mechanism, is substantially more difficult and under-studied (Fielding et al., 2008; Marston et al., 2010). MNAR mechanisms are expected to occur quite often in practice, for example, in longitudinal studies with complex patterns of dropout and re-enrollment, or in studies where social stigma may prompt non-response to questions pertaining to drug-use, or sexual activity and orientation, in a way that depends on other imperfectly collected or censored covariates (Robins & Gill, 1997; Vansteelandt et al., 2007; Marra et al., 2017).

Previous work on MNAR models has proceeded by imposing a set of restrictions on the full data distribution (the target distribution and its missingness mechanism) that are sufficient to yield identification of the parameter of interest. While there exist MNAR models whose restrictions cannot be represented graphically (Tchetgen Tchetgen et al., 2018), the restrictions posed in several popular MNAR models such as the permutation model (Robins & Gill, 1997), the block-sequential MAR model (Zhou et al., 2010), the itemwise conditionally independent nonresponse (ICIN) model (Shpitser, 2016; Sadinle & Reiter, 2017), and those in (Daniel et al., 2012; Thoemmes & Rose, 2013; Martel García, 2013; Mohan et al., 2013; Mohan & Pearl, 2014; Saadati & Tian, 2019) are either explicitly graphical or can be interpreted as such.

Despite the popularity of graphical modeling approaches for missing data problems, characterization of the class of missing data distributions identified as functionals of the observed data distribution has remained an open question (Bhattacharya et al., 2019). Several algorithms for the identification of the target distribution have been proposed (Mohan & Pearl, 2014; Shpitser et al., 2015; Tian, 2017; Bhattacharya et al., 2019). We show that even the most general algorithm currently published (Bhattacharya et al., 2019) still retains a significant gap in that there exist target distributions that are identified which the algorithm fails to identify. We then present what is, to our knowledge, the first completeness result for missing data models representable as directed acyclic graphs (DAGs) – a necessary and sufficient graphical condition under which the full data distribution is identified as a function of the observed data distribution. For any given field of study, such a characterization is one of the most powerful results that identification theory can offer, as it comes with the guarantee that if these conditions do not hold, the model is provably not identified.

We further generalize these graphical conditions to settings where some variables are not just missing, but completely unobserved. Such distributions are typically summarized using acyclic directed mixed graphs (ADMGs) (Richardson et al., 2017). We prove, once again, that our graphical criteria are sound and complete for the identification of full laws that are Markov relative to a hidden variable DAG and the resulting summary ADMG. This new result allows us to address two of the most critical issues in practical data analyses simultaneously, those of missingness and unmeasured confounding.

Finally, in the course of proving our results on completeness, we show that the proposed graphical conditions also imply that all missing data models of directed acyclic graphs or acyclic directed mixed graphs that meet these conditions, are in fact sub-models of the MNAR models in (Shpitser, 2016; Sadinle & Reiter, 2017). This simple, yet powerful result implies that the joint density of these models may be identified using an odds ratio parameterization that also ensures congenial specification of various components of the likelihood (Chen, 2007; Malinsky et al., 2019). Our results serve as an important precondition for the development of score-based model selection methods for graphical models of missing data, as an alternative to the constraint-based approaches proposed in (Strobl et al., 2018; Gain & Shpitser, 2018; Tu et al., 2019), and directly yield semi-parametric estimators using results in (Malinsky et al., 2019).

2 Preliminaries

A directed acyclic graph (DAG) ${\mathcal{G}}(V)$ consists of a set of nodes $V$ connected through directed edges such that there are no directed cycles. We will abbreviate ${\mathcal{G}}(V)$ as simply ${\mathcal{G}},$ when the vertex set is clear from the given context. Statistical models of a DAG ${\mathcal{G}}$ are sets of distributions that factorize as $p(V)=\prod_{V_{i}\in V}p(V_{i}\mid\operatorname{pa}_{\mathcal{G}}(V_{i}))$ , where $\operatorname{pa}_{\mathcal{G}}(V_{i})$ are the parents of $V_{i}$ in ${\mathcal{G}}$ . The absence of edges between variables in ${\mathcal{G}},$ relative to a complete DAG entails conditional independence facts in $p(V).$ These can be directly read off from the DAG ${\mathcal{G}}$ by the well-known d-separation criterion (Pearl, 2009). That is, for disjoint sets $X,Y,Z$ , the following global Markov property holds: $(X\perp\!\!\!\perp_{\text{d-sep}}Y\mid Z)_{\mathcal{G}}\implies(X\perp\!\!\!\perp Y\mid Z)_{p(V)}.$ When the context is clear, we will simply use $X\perp\!\!\!\perp Y\mid Z$ to denote the conditional independence between $X$ and $Y$ given $Z.$

In practice, some variables on the DAG may be unmeasured or hidden. In such cases, the distribution $p(V\cup U)$ is Markov relative to a hidden variable DAG ${\mathcal{G}}(V\cup U),$ where variables in $U$ are unobserved. There may be infinitely many hidden variable DAGs that imply the same set of conditional independences on the observed margin. Hence, it is typical to utilize a single acyclic directed mixed graph (ADMG) consisting of directed and bidirected edges that entails the same set of equality constraints as this infinite class (Evans, 2018). Such an ADMG ${\mathcal{G}}(V)$ is obtained from a hidden variable DAG ${\mathcal{G}}(V\cup U)$ via the latent projection operator (Verma & Pearl, 1990) as follows. $A\rightarrow B$ exists in ${\mathcal{G}}(V)$ if there exists a directed path from $A$ to $B$ in ${\mathcal{G}}(V\cup U)$ with all intermediate vertices in $U.$ An edge $A\leftrightarrow B$ exists in ${\mathcal{G}}(V)$ if there exists a collider-free path (i.e., there are no consecutive edges of the form $\rightarrow\circ\leftarrow$ ) from $A$ to $B$ in ${\mathcal{G}}(V\cup U)$ with all intermediate vertices in $U,$ such that the first edge on the path is an incoming edge into $A$ and the final edge is an incoming edge into $B.$

Given a distribution $p(V\cup U)$ that is Markov relative to a hidden variable DAG ${\mathcal{G}}(V,U),$ conditional independence facts pertaining to the observed margin $p(V)$ can be read off from the ADMG ${\mathcal{G}}(V)$ by a simple analogue of the d-separation criterion, known as m-separation (Richardson, 2003), that generalizes the notion of a collider to include mixed edges of the form $\rightarrow\circ\leftrightarrow,\leftrightarrow\circ\leftarrow,$ and $\leftrightarrow\circ\leftrightarrow.$

3 Missing Data Models

A missing data model is a set of distributions defined over a set of random variables $\{O,X^{(1)},R,X\}$ , where $O$ denotes the set of variables that are always observed, $X^{(1)}$ denotes the set of variables that are potentially missing, $R$ denotes the set of missingness indicators of the variables in $X^{(1)}$ , and $X$ denotes the set of the observed proxies of the variables in $X^{(1)}.$ By definition missingness indicators are binary random variables; however, the state space of variables in $X^{(1)}$ and $O$ are unrestricted. Given $X^{(1)}_{i}\in X^{(1)}$ and its corresponding missingness indicator $R_{i}\in R$ , the observed proxy $X_{i}$ is defined as $X_{i}\equiv X^{(1)}_{i}$ if $R_{i}=1$ , and $X_{i}=?$ if $R_{i}=0$ . Hence, $p(X\mid R,X^{(1)})$ is deterministically defined. We call the non-deterministic part of a missing data distribution, i.e, $p(O,X^{(1)},R)$ , the full law, and partition it into two pieces: the target law $p(O,X^{(1)})$ and the missingness mechanism $p(R\mid X^{(1)},O)$ . The censored version of the full law $p(O,R,X),$ that the analyst actually has access to is known as the observed data distribution.

Following the convention in (Mohan et al., 2013), let ${\mathcal{G}}(V)$ be a missing data DAG, where $V=\{O\cup X^{(1)}\cup R\cup X\}.$ In addition to acyclicity, edges of a missing data DAG are subject to other restrictions: outgoing edges from variables in $R$ cannot point to variables in $\{X^{(1)},O\}$ , each $X_{i}\in X$ has only two parents in ${\mathcal{G}},$ i.e., $R_{i}$ and $X^{(1)}_{i}$ (these edges represent the deterministic function above that defines $X_{i}$ , and are shown in gray in all the figures below), and there are no outgoing edges from $X_{i}$ (i.e., the proxy $X_{i}$ does not cause any variable on the DAG, however the corresponding full data variable $X^{(1)}_{i}$ may cause other variables.) A missing data model associated with a missing data DAG ${\mathcal{G}}$ is the set of distributions $p(O,X^{(1)},R,X)$ that factorizes as,

\displaystyle\prod_{V_{i}\in O\cup X^{(1)}\cup R}\ p(V_{i}\mid\operatorname{pa}_{\mathcal{G}}(V_{i}))\prod_{X_{i}\in X}\ p(X_{i}\mid X^{(1)}_{i},R_{i}).

By standard results on DAG models, conditional independences in $p(X^{(1)},O,R)$ can still be read off from ${\mathcal{G}}$ by the d-separation criterion (Pearl, 2009). For convenience, we will drop the deterministic terms of the form $p(X_{i}\mid X^{(1)}_{i},R_{i})$ from the identification analyses in the following sections since these terms are always identified by construction.

As an extension, we also consider a hidden variable DAG ${\mathcal{G}}(V\cup U)$ , where $V=\{O,X^{(1)},R,X\}$ and variables in $U$ are unobserved, to encode missing data models in the presence of unmeasured confounders. In such cases, the full law would obey the nested Markov factorization (Richardson et al., 2017) with respect to a missing data ADMG ${\mathcal{G}}(V)$ , obtained by applying the latent projection operator (Verma & Pearl, 1990) to the hidden variable DAG ${\mathcal{G}}(V\cup U).$ As a result of marginalization of latents $U,$ there might exist bi-directed edges (to encode the hidden common causes) between variables in $V$ (bi-directed edges are shown in red in all the figures below). It is straightforward to see that a missing data ADMG obtained via projection of a hidden variable missing data DAG follows the exact same restrictions as stated in the previous paragraph (i.e., no directed cycles, $\operatorname{pa}_{\mathcal{G}}(X_{i})=\{X^{(1)}_{i},R_{i}\}$ , every $X_{i}\in X$ is childless, and there are no outgoing edges from $R_{i}$ to any variables in $\{X^{(1)},O\}$ .)

3.1 Identification in Missing Data Models

The goal of non-parametric identification in missing data models is twofold: identification of the target law $p(O,X^{(1)})$ or functions of it $f(p(O,X^{(1)})),$ and identification of the full law $p(O,X^{(1)},R),$ in terms of the observed data distribution $p(O,R,X).$

A compelling reason to study the problem of identification of the full law in and of itself, is due to the fact that many popular methods for model selection or causal discovery, rely on the specification of a well-defined and congenial joint distribution (Chickering, 2002; Ramsey, 2015; Ogarrio et al., 2016). A complete theory of the characterization of missing data full laws that are identified opens up the possibility of adapting such methods to settings involving non-ignorable missingness, in order to learn not only substantive relationships between variables of interest in the target distribution, but also the processes that drive their missingness. This is in contrast to previous approaches to model selection under missing data that are restricted to submodels of a single fixed identified model (Strobl et al., 2018; Gain & Shpitser, 2018; Tu et al., 2019). Such an assumption may be impractical in complex healthcare settings, for example, where discovering the factors that lead to missingness or study-dropout may be just as important as discovering substantive relations in the underlying data.

Though the focus of this paper is on identification of the full law of missing data models that can be represented by a DAG (or a hidden variable DAG), some of our results naturally extend to identification of the target law (and functionals therein) due to the fact that the target law can be derived from the full law as $\sum_{R}p(O,X^{(1)},R).$

Remark 1.

By chain rule of probability, the target law $p(O,X^{(1)})$ is identified if and only if $p(R=1\mid O,X^{(1)})$ is identified. The identifying functional is given by

\displaystyle p(O,X^{(1)})=\frac{p(O,X^{(1)},R=1)}{p(R=1\mid O,X^{(1)})}.

(the numerator is a function of observed data by noting that $X^{(1)}=X$ , and is observed when $R=1$ ).

Remark 2.

The full law $p(O,X^{(1)},R)$ is identified if and only if $p(R\mid O,X^{(1)})$ is identified. According to Remark 1, the identifying functional is given by

\displaystyle p(O,X^{(1)},R)=\frac{p(O,X^{(1)},R=1)}{p(R=1\mid O,X^{(1)})}\times p(R\mid O,X^{(1)}).

The rest of the paper is organized as follows. In Section 4, we explain, through examples, why none of the existing identification algorithms put forward in the literature are complete in the sense that there exist missing data DAGs whose full law and target law are identified but these algorithms fail to derive an identifying functional for them. In Section 5, we provide a complete algorithm for full law identification. In Section 6, we further extend our identification results to models where unmeasured confounders are present. We defer all proofs to the Appendix.

4 Incompleteness of Current Methods

In this section, we show that even the most general methods proposed for identification in missing data DAG models remain incomplete. In other words, we show that there exist identified MNAR models that are representable by DAGs, however all existing algorithms fail to identify both the full and target law for these models. For brevity, we use the procedure proposed in (Bhattacharya et al., 2019) as an exemplar. However, as it is the most general procedure in the current literature, failure to identify via this procedure would imply failure by all other existing ones. For each example, we also provide alternate arguments for identification that eventually lead to the general theory in Sections 5 and 6.

The algorithm proposed by (Bhattacharya et al., 2019) proceeds as follows. For each missingness indicator $R_{i},$ the algorithm tries to identify the distribution $p(R_{i}|\operatorname{pa}_{\mathcal{G}}(R_{i}))|_{R=1},$ sometimes referred to as the propensity score of $R_{i}.$ It does so by checking if $R_{i}$ is conditionally independent (given its parents) of the corresponding missingness indicators of its parents that are potentially missing. If this is the case, the propensity score is identified by a simple conditional independence argument (d-separation). Otherwise, the algorithm checks if this condition holds in post-fixing distributions obtained through recursive application of the fixing operator, which roughly corresponds to inverse weighting the current distribution by the propensity score of the variable being fixed (Richardson et al., 2017) (a more formal definition is provided in the Appendix.) If the algorithm succeeds in identifying the propensity score for each missingness indicator in this manner, then it succeeds in identifying the target law as Remark 1 suggests, since $p(R=1|O,X^{(1)})=\prod_{R_{i}\in R}p(R_{i}|\operatorname{pa}_{\mathcal{G}}(R))|_{R=1}.$ Additionally, if it is the case that in the course of execution, the propensity score $p(R_{i}|\operatorname{pa}_{\mathcal{G}}(R_{i}))$ for each missingness indicator is also identified at all levels of its parents, then the algorithm also succeeds in identifying the full law (due to Remark 2).

In order to ground our theory in reality, we now describe a series of hypotheses that may arise during the course of a data analysis that seeks to study the link between the effects of smoking on bronchitis, through the deposition of tar or other particulate matter in the lungs. For each hypothesis, we ask if the investigator is able to evaluate the goodness of fit of the proposed model, typically expressed as a function of the full data likelihood, as a function of just the observed data. In other words, we ask if the full law is identified as a function of the observed data distribution. If it is, this enables the analyst to compare and contrast different hypotheses and select one that fits the data the best.

Setup. To start, the investigator consults a large observational database containing the smoking habits, measurements of particulate matter in the lungs, and results of diagnostic tests for bronchitis on individuals across a city. She notices however, that several entries in the database are missing. This leads her to propose a model like the one shown in Fig. 1(a), where $X_{1}^{(1)},X_{2}^{(1)},$ and $X_{3}^{(1)}$ correspond to smoking, particulate matter, and bronchitis respectively, and $R_{1},R_{2},$ and $R_{3}$ are the corresponding missingness indicators.

For the target distribution $p(X^{(1)}),$ she proposes a simple mechanism that smoking leads to increased deposits of tar in the lungs, which in turn leads to bronchitis ( $X_{1}^{(1)}\rightarrow X_{2}^{(1)}\rightarrow X_{3}^{(1)}$ ). For the missingness process, she proposes that a suspected diagnosis of bronchitis is likely to lead to an inquiry about the smoking status of the patient ( $X_{3}^{(1)}\rightarrow R_{1}$ ), smokers are more likely to get tested for tar and bronchitis ( $X_{1}^{(1)}\rightarrow R_{2},X_{1}^{(1)}\rightarrow R_{3}$ ), and ordering a diagnostic test for bronchitis, increases the likelihood of ordering a test for tar, which in turn increases the likelihood of inquiry about smoking status ( $R_{1}\leftarrow R_{2}\leftarrow R_{3}$ ).

We now show that for this preliminary hypothesis, if the investigator were to utilize the procedure described in (Bhattacharya et al., 2019) she may conclude that it is not possible to identify the full law. We go on to show that such a conclusion would be incorrect, as the full law is, in fact, identified, and provide an alternative means of identification.

Figure 1: (a) The missing data DAG used in scenario 1 (without the dashed edge

X^{(1)}_{2}\rightarrow R_{3}

) and scenario 2 (with the dashed edge

X^{(1)}_{2}\rightarrow R_{3}

) (b) Conditional DAG corresponding to the missing data DAG in (a) after fixing

R_{1}

, i.e., inverse weighting by the propensity score of

R_{1}

Scenario 1. Consider the missing data DAG model in Fig. 1(a) by excluding the edge $X^{(1)}_{2}\rightarrow R_{3},$ corresponding to the first hypothesis put forth by the investigator. The propensity score for $R_{1}$ can be obtained by simple conditioning, noting that $R_{1}\perp\!\!\!\perp R_{3}\mid X_{3}^{(1)},R_{2}$ by d-separation. Hence, $p(R_{1}\mid\operatorname{pa}_{\mathcal{G}}(R_{1}))=p(R_{1}\mid X_{3}^{(1)},R_{2})=p(R_{1}\mid X_{3},R_{2},R_{3}=1).$

Conditioning is not sufficient in order to identify the propensity score for $R_{2},$ as $R_{2}\not\perp\!\!\!\perp R_{1}\mid X_{1}^{(1)},R_{3}$ . However, it can be shown that in the distribution $q(V\setminus R_{1}\mid R_{1}=1)\equiv\frac{p(V)}{p(R_{1}=1\mid\operatorname{pa}_{\mathcal{G}}(R_{1}))}$ , $R_{2}\perp\!\!\!\perp R_{1}\mid X_{1},R_{3}=1$ , since this distribution is Markov relative to the graph in Fig. 1(b) (see the Appendix for details). We use the notation $q(\cdot\mid\cdot)$ to indicate that while $q$ acts in most respects as a conditional distribution, it was not obtained from $p(V)$ by a conditioning operation. This implies that the propensity score for $R_{2}$ (evaluated at $R=1$ ) is identified as $q(R_{2}\mid X_{1},R_{3}=1,R_{1}=1).$

Finally, we show that the algorithm in (Bhattacharya et al., 2019) is unable to identify the propensity score for $R_{3}.$ We first note that $R_{3}\not\perp\!\!\!\perp R_{1}\mid X_{1}^{(1)}$ in the original problem. Furthermore, as shown in Fig. 1(b), fixing $R_{1}$ leads to a distribution where $R_{3}$ is necessarily selected on as the propensity score $p(R_{1}\mid\operatorname{pa}_{\mathcal{G}}(R_{1}))$ is identified by restricting the data to cases where $R_{3}=1.$ It is thus impossible to identify the propensity score for $R_{3}$ in this post-fixing distribution. The same holds if we try to fix $R_{2}$ as identification of the propensity score for $R_{2}$ required us to first fix $R_{1},$ which we have seen introduces selection bias on $R_{3}.$

Hence, the procedure in (Bhattacharya et al., 2019) fails to identify both the target law and the full law for the problem posed in Fig. 1(a). However, both these distributions are, in fact, identified as we now demonstrate.

A key observation is that even though the identification of $p(R_{3}\mid X^{(1)}_{1})$ might not be so straightforward, $p(R_{3}\mid X^{(1)}_{1},R_{2})$ is indeed identified, because by d-separation $R_{3}\perp\!\!\!\perp R_{1}\mid X^{(1)}_{1},R_{2}$ , and therefore $p(R_{3}\mid X^{(1)}_{1},R_{2})=p(R_{3}\mid X_{1},R_{2},R_{1}=1).$ Given that $p(R_{3}\mid X^{(1)}_{1},R_{2})$ and $p(R_{2}\mid X^{(1)}_{1},R_{3}=1)$ are both identified (the latter is obtained through as described earlier), we consider exploiting an odds ratio parameterization of the joint density $p(R_{2},R_{3}\mid\operatorname{pa}_{{\mathcal{G}}}(R_{2},R_{3}))=p(R_{2},R_{3}\mid X^{(1)}_{1})$ . As we show below, such a parameterization immediately implies the identifiability of this density and consequently, the individual propensity scores for $R_{2}$ and $R_{3}$ .

Given disjoint sets of variables $A,B,C$ and reference values $A=a_{0},B=b_{0},$ the odds ratio parameterization of $p(A,B\mid C)$ , given in (Chen, 2007), is as follows:

\displaystyle\frac{1}{Z}\times p(A\mid b_{0},C)\times p(B\mid a_{0},C)\times\text{OR}(A,B\mid C),

(1)

where

	$\displaystyle\text{OR}(A=a,B=b\mid C)$
	$\displaystyle\hskip 14.22636pt=\frac{p(A=a\mid B=b,C)}{p(A=a_{0}\mid B=b,C)}\times\frac{p(A=a_{0}\mid B=b_{0},C)}{p(A=a\mid B=b_{0},C)},$

and $Z$ is the normalizing term and is equal to

\displaystyle\displaystyle\sum_{A,B}p(A\mid B=b_{0},C)\times p(B\mid A=a_{0},C)\times\text{OR}(A,B\mid C).

Note that ${\small\text{OR}(A,B\mid C)=\text{OR}(B,A\mid C)},$ i.e., the odds ratio is symmetric; see (Chen, 2007).

A convenient choice of reference value for the odds ratio in missing data problems is the value $R_{i}=1.$ Given this reference level and the parameterization of the joint in Eq. (1), we know that $p(R_{2},R_{3}\mid X^{(1)}_{1})=\frac{1}{Z}\times p(R_{2}\mid R_{3}=1,X^{(1)}_{1})\times p(R_{3}\mid R_{2}=1,X^{(1)}_{1})\times\text{OR}(R_{2},R_{3}\mid X^{(1)}_{1}),$ where $Z$ is the normalizing term, and

	$\displaystyle\text{OR}(R_{2}=r_{2},R_{3}=r_{3}\mid X^{(1)}_{1})$
	$\displaystyle\hskip 0.28436pt=\frac{p(R_{3}=r_{3}\mid R_{2}=r_{2},X^{(1)}_{1})}{p(R_{3}=1\mid R_{2}=r_{2},X^{(1)}_{1})}\times\frac{p(R_{3}=1\mid R_{2}=1,X^{(1)}_{1})}{p(R_{3}=r_{3}\mid R_{2}=1,X^{(1)}_{1})}.$

The conditional pieces $p(R_{2}\mid R_{3}=1,X^{(1)}_{1})$ and $p(R_{3}\mid R_{2}=1,X^{(1)}_{1})$ are already shown to be functions of the observed data. To see that the odds ratio is also a function of observables, recall that $R_{3}\perp\!\!\!\perp R_{1}\mid R_{2},X^{(1)}_{1}.$ This means that $R_{1}=1$ can be introduced into each individual piece of the odds ratio functional above, making it so that the entire functional depends only on observed quantities. Since all pieces of the odds ratio parameterization are identified as functions of the observed data, we can conclude that $p(R_{2},R_{3}\mid X^{(1)}_{1})$ is identified as the normalizing term is always identified if all the conditional pieces and the odds ratio are identified. This result, in addition to the fact that $p(R_{1}\mid R_{2},X^{(1)}_{3})$ is identified as before, leads us to the identification of both the target law and the full law, as the missingness process $p(R\mid X^{(1)})$ is identified.

Scenario 2. Suppose the investigator is interested in testing an alternate hypothesis to see whether detecting high levels of particulate matter in the lungs, also serves as an indicator to physicians that a diagnostic test for bronchitis should be ordered. This corresponds to the missing data DAG model in Fig. 1(a) by including the edge $X_{2}^{(1)}\rightarrow R_{3}.$ Since this is a strict super model of the previous example, the procedure in (Bhattacharya et al., 2019) still fails to identify the target and full laws in a similar manner as before.

However, it is still the case that both the target and full laws are identified. The justification for why the odds ratio parameterization of the joint density $p(R_{2},R_{3}\mid\operatorname{pa}_{{\mathcal{G}}}(R_{2},R_{3}))=p(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ is identified in this scenario, is more subtle. We have,

	$\displaystyle p(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})=\frac{1}{Z}\times p(R_{2}\mid R_{3}=1,X^{(1)}_{1},X^{(1)}_{2})$
	$\displaystyle\hskip 8.5359pt\times p(R_{3}\mid R_{2}=1,X^{(1)}_{1},X^{(1)}_{2})\times\text{OR}(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2}).$

Note that $R_{2}\perp\!\!\!\perp X^{(1)}_{2}\mid R_{3},X^{(1)}_{1}$ , and $R_{3}\perp\!\!\!\perp R_{1}\mid R_{2},X^{(1)}_{1},X^{(1)}_{2}$ . Therefore, $p(R_{2}\mid R_{3}=1,X^{(1)}_{1},X^{(1)}_{2})=p(R_{2}\mid R_{3}=1,X^{(1)}_{1})$ is identified the same way as described in Scenario 1, and $p(R_{3}\mid R_{2}=1,X^{(1)}_{1},X^{(1)}_{2})=p(R_{3}\mid R_{1}=1,R_{2}=1,X_{1},X_{2})$ is a function of the observed data and hence is identified. Now the identification of the joint density $p(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ boils down to identifiability of the odds ratio term. By symmetry, we can express the odds ratio $\text{OR}(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ in two different ways,

	$\displaystyle\text{OR}(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$
	$\displaystyle\!\!\!\!=\frac{p(R_{2}\mid R_{3},X^{(1)}_{1})}{p(R_{2}=1\mid R_{3},X^{(1)}_{1})}\times\frac{p(R_{2}=1\mid R_{3}=1,X^{(1)}_{1})}{p(R_{2}\mid R_{3}=1,X^{(1)}_{1})}$
	$\displaystyle\!\!\!\!=\frac{p(R_{3}\|R_{2},X^{(1)}_{1},X^{(1)}_{2})}{p(R_{3}=1\|R_{2},X^{(1)}_{1},X^{(1)}_{2})}\times\frac{p(R_{3}=1\|R_{2}=1,X^{(1)}_{1},X^{(1)}_{2})}{p(R_{3}\|R_{2}=1,X^{(1)}_{1},X^{(1)}_{2})}.$

The first equality holds by d-separation ( $R_{2}\perp\!\!\!\perp X^{(1)}_{2}\mid R_{3},X^{(1)}_{1}$ ). This implies that $\text{OR}(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ is not a function of $X^{(1)}_{2}.$ Let us denote this functional by $f_{1}(R_{2},R_{3},X^{(1)}_{1}).$ On the other hand, we can plug-in $R_{1}=1$ to pieces in the second equality since $R_{3}\perp\!\!\!\perp R_{1}\mid R_{2},X^{(1)}_{1},X^{(1)}_{2}$ (by d-separation.) This implies that $\text{OR}(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ is a function of $X^{(1)}_{1}$ only through its observed values (i.e. $X_{1}$ ). Let us denote this functional by $f_{2}(R_{2},R_{3},X_{1},X^{(1)}_{2},R_{1}=1).$ Since odds ratio is symmetric (by definition), then it must be the case that $f_{1}(R_{2},R_{3},X^{(1)}_{1})=f_{2}(R_{2},R_{3},X_{1},X^{(1)}_{2},R_{1}=1)$ ; concluding that $f_{2}$ cannot be a function of $X^{(1)}_{2}$ , as the left hand side of the equation does not depend on $X^{(1)}_{2}$ . This renders $f_{2}$ to be a function of only observed quantities, i.e. $f_{2}=f_{2}(R_{2},R_{3},X_{1},R_{1}=1)$ . This leads to the conclusion that $p(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ is identified and consequently the missingness process $p(R\mid X^{(1)})$ in Fig. 1(a) is identified. According to Remarks 1 and 2, both the target and full laws are identified.

Adding any directed edge to Fig. 1(a) (including the dashed edge) allowed by missing data DAGs results in either a self-censoring edge ( $X_{i}^{(1)}\rightarrow R_{i}$ ) or a special kind of collider structure called the colluder ( $X_{j}^{(1)}\rightarrow R_{i}\leftarrow R_{j}$ ) in (Bhattacharya et al., 2019). We discuss in detail, the link between identification of missing data models of a DAG and the absence of these structures in Section 5.

Scenario 3. So far, the investigator has conducted preliminary analyses of the problem while ignoring the issue of unmeasured confounding. In order to address this issue, she first posits an unmeasured confounder $U_{1},$ corresponding to genotypic traits that may predispose certain individuals to both smoke and develop bronchitis. She posits another unmeasured confounder $U_{2},$ corresponding to the occupation of an individual, that may affect both the deposits of tar found in their lungs (for e.g., construction workers may accumulate more tar than an accountant due to occupational hazards) as well as limit an individual’s access to proper healthcare, leading to the absence of a diagnostic test for bronchitis.

The missing data DAG with unmeasured confounders, corresponding to the aforementioned hypothesis is shown in Fig. 2(a) (excluding the dashed edges). The corresponding missing data ADMG, obtained by latent projection is shown in Fig. 2(b) (excluding the dashed bidirected edge). A procedure to identify the full law of such an MNAR model, that is nested Markov with respect to a missing data ADMG, is absent from the current literature. The question that arises, is whether it is possible to adapt the odds ratio parameterization from the previous scenarios, to this setting.

We first note that by application of the chain rule of probability and Markov restrictions, the missingness mechanism still factorizes in the same way as in Scenario 2, i.e., $p(R\mid X^{(1)})=p(R_{1}\mid R_{2},X^{(1)}_{3})\times p(R_{2},R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ (Tian & Pearl, 2002). Despite the addition of the bidirected edges $X_{1}^{(1)}\leftrightarrow X_{3}^{(1)}$ and $X_{2}^{(1)}\leftrightarrow R_{3},$ corresponding to unmeasured confounding, it is easy to see that the propensity score for $R_{1}$ is still identified via simple conditioning. That is, $p(R_{1}\mid\operatorname{pa}_{\mathcal{G}}(R_{1}))=p(R_{1}\mid X_{3},R_{2},R_{3}=1)$ as $R_{1}\perp\!\!\!\perp R_{3}\mid X_{3}^{(1)},R_{2}$ by m-separation. Furthermore, it can also be shown that the two key conditional independences that were exploited in the odds ratio parameterization of $p(R_{2},R_{3}\mid X^{(1)}),$ still hold in the presence of these additional edges. In particular, $R_{2}\perp\!\!\!\perp X_{2}^{(1)}\mid R_{3},X_{1}^{(1)}$ , and $R_{3}\perp\!\!\!\perp R_{1}\mid R_{2},X_{1}^{(1)},X_{2}^{(1)},$ by m-separation. Thus, the same odds ratio parameterization used for identification of the full law in Scenario 2, is also valid for Scenario 3. The full odds ratio parameterization of the MNAR models in Scenarios 2 and 3 is provided in Appendix B.

Figure 2: (a) The missing data DAG with unobserved confounders used in scenario 3 (without the dashed edges) and scenario 4 (with the dashed edges). (b) The corresponding missing data ADMGs obtained by applying the latent projection rules to the hidden variable DAG in (a).

Scenario 4. Finally, the investigator notices that a disproportionate number of missing entries for smoking status and diagnosis of bronchitis, correspond to individuals from certain neighborhoods in the city. She posits that such missingness may be explained by systematic biases in the healthcare system, where certain ethnic minorities may not be treated with the same level of care. This corresponds to adding a third unmeasured confounder $U_{3},$ which affects the ordering of a diagnostic test for bronchitis as well as inquiry about smoking habits, as shown in Fig. 2(a) (including the dashed edges.) The corresponding missing data ADMG is shown in Fig. 2(b) (including the bidirected dashed edge.) Once again, we investigate if the full law is identified, in the presence of an additional unmeasured confounder $U_{3}$ , and the corresponding bidirected edge $R_{1}\leftrightarrow R_{3}.$

The missingness mechanism $p(R\mid X^{(1)})$ in Fig. 2(b) (including the dashed edge) no longer follows the same factorization as the one described in Scenarios 2 and 3, due to the presence of a direct connection between $R_{1}$ and $R_{3}.$ According to (Tian & Pearl, 2002), this factorization is given as $p(R\mid X^{(1)})=p(R_{1}\mid R_{2},R_{3},X^{(1)}_{1},X^{(1)}_{2},X^{(1)}_{3})\times p(R_{2}\mid R_{3},X^{(1)}_{1})\times p(R_{3}\mid X^{(1)}_{1},X^{(1)}_{2})$ . Unlike the previous scenarios, the propensity score of $R_{1}$ , $p(R_{1}\mid R_{2},R_{3},X^{(1)}_{1},X^{(1)}_{2},X^{(1)}_{3})$ , includes $X^{(1)}_{1},X^{(1)}_{2},$ and $R_{3}$ past the conditioning bar. Thus, the propensity score of $R_{1}$ seems to be not identified, since there is no clear way of breaking down the dependency between $R_{1}$ and $X^{(1)}_{1}.$ The problematic structure is the path $X^{(1)}_{1}\rightarrow R_{3}\leftrightarrow R_{1}$ which contains a collider at $R_{3}$ that opens up when we condition on $R_{3}$ in the propensity score of $R_{1}.$

In light of the discussion in previous scenarios, another possibility for identifying $p(R\mid X^{(1)})$ is through analysis of the odds ratio parameterization of the entire missingness mechanism. In Section 5, we provide a description of the general odds ratio parameterization on an arbitrary number of missingness indicators. For brevity, we avoid re-writing the formula here. We simply point out that the first step in identifying the missingness mechanism via the odds ratio parameterization is arguing whether conditional densities of the form $p(R_{i}\mid R\setminus R_{i}=1,X^{(1)})$ are identified, which is true if $R_{i}\perp\!\!\!\perp X^{(1)}_{i}\mid R\setminus R_{i},X^{(1)}\setminus X^{(1)}_{i}.$

Such independencies do not hold in Fig. 2(b) (including the dashed edge) for any of the $R$ s, since there exist collider paths between every pair $(X^{(1)}_{i},R_{i})$ that render the two variables dependent when we condition on everything outside $X^{(1)}_{i},R_{i}$ (by m-separation). Examples of such paths are $X^{(1)}_{1}\rightarrow R_{3}\leftrightarrow R_{1}$ and $X^{(1)}_{2}\leftrightarrow R_{3}\leftrightarrow R_{1}\leftarrow R_{2}$ and $X^{(1)}_{3}\rightarrow R_{1}\leftrightarrow R_{3}$ .

In Section 6, we show that the structures arising in the missing data ADMG presented in Fig. 2(b) (including the dashed edge), give rise to MNAR models that are provably not identified without further assumptions.

5 Full Law Identification in DAGs

(Bhattacharya et al., 2019) proved that two graphical structures, namely the self-censoring edge ( $X_{i}^{(1)}\rightarrow R_{i}$ ) and the colluder ( $X_{j}^{(1)}\rightarrow R_{i}\leftarrow R_{j}$ ), prevent the identification of full laws in missing data models of a DAG. In this section we exploit an odds ratio parameterization of the missing data process to prove that these two structures are, in fact, the only structures that prevent identification, thus yielding a complete characterization of identification for the full law in missing data DAG models.

We formally introduce the odds ratio parameterization of the missing data process introduced in (Chen, 2007), as a more general version of the simpler form mentioned earlier in Eq. (1). Assuming we have $K$ missingness indicators, $p(R\mid X^{(1)},O)$ can be expressed as follows.

	$\displaystyle p(R\mid X^{(1)},O)=$	$\displaystyle\ \frac{1}{Z}\times\prod_{k=1}^{K}\ p(R_{k}\mid R_{-k}=1,X^{(1)},O)$
	$\displaystyle\times$	$\displaystyle\prod_{k=2}^{K}\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)},O),$		(2)

where $R_{-k}=R\setminus R_{k},R_{\prec k}=\{R_{1},\ldots,R_{k-1}\},R_{\succ k}=\{R_{k+1},\ldots,R_{K}\}$ , and

	$\displaystyle\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)},O)$
	$\displaystyle\hskip 28.45274pt=\frac{p(R_{k}\mid R_{\succ k}=1,R_{\prec k},X^{(1)},O)}{p(R_{k}=1\mid R_{\succ k}=1,R_{\prec k},X^{(1)},O)}$
	$\displaystyle\hskip 42.67912pt\times\frac{p(R_{k}=1\mid R_{-k}=1,X^{(1)},O)}{p(R_{k}\mid R_{-k}=1,X^{(1)},O)}.$

$Z$ in Eq. (2) is the normalizing term and is equal to $\sum_{r}\{\prod_{k=1}^{K}\ p(r_{k}\mid R_{-k}=1,X^{(1)},O)\times\prod_{k=2}^{K}\text{OR}(r_{k},r_{\prec k}\mid R_{\succ k}=1,X^{(1)},O)\}$ .

Using the odds ratio reparameterization given in Eq. (2), we now show that under a standard positivity assumption, stating that $p(R\mid X^{(1)},O)>\delta>0$ , with probability one for some constant $\delta$ , the full law $p(R,X^{(1)},O)$ of a missing data DAG is identified in the absence of self-censoring edges and colluders. Moreover, if any of these conditions are violated, the full law is no longer identified. We formalize this result below.

Theorem 1.

A full law $p(R,X^{(1)},O)$ that is Markov relative to a missing data DAG ${\mathcal{G}}$ is identified if ${\mathcal{G}}$ does not contain edges of the form $X_{i}^{(1)}\rightarrow R_{i}$ (no self-censoring) and structures of the form $X_{j}^{(1)}\rightarrow R_{i}\leftarrow R_{j}$ (no colluders), and the stated positivity assumption holds. Moreover, the resulting identifying functional for the missingness mechanism $p(R\mid X^{(1)},O)$ is given by the odds ratio parameterization provided in Eq. 2, and the identifying functionals for the target law and full law are given by Remarks 1 and 2.

In what follows, we show that the identification theory that we have proposed for the full law in missing data models of a DAG is sound and complete. Soundness implies that when our procedure succeeds, the model is in fact identified, and the identifying functional is correct. Completeness implies that when our procedure fails, the model is provably not identified (non-parametrically). These two properties allow us to derive a precise boundary for what is and is not identified in the space of missing data models that can be represented by a DAG.

Theorem 2.

The graphical condition of no self-censoring and no colluders, put forward in Theorem 1, is sound and complete for the identification of full laws $p(R,O,X^{(1)})$ that are Markov relative to a missing data DAG ${\mathcal{G}}.$

We now state an important result that draws a connection between missing data models of a DAG ${\mathcal{G}}$ that are devoid of self-censoring and colluders, and the itemwise conditionally independent nonresponse (ICIN) model described in (Shpitser, 2016; Sadinle & Reiter, 2017). As a substantive model, the ICIN model implies that no partially observed variable directly determines its own missingness, and is defined by the restrictions that for every pair $X_{i}^{(1)},R_{i},$ it is the case that $X_{i}^{(1)}\perp\!\!\!\perp R_{i}\mid R_{-i},X_{-i}^{(1)},O.$ We utilize this result in the course of proving Theorem 2.

Lemma 1.

A missing data model of a DAG ${\mathcal{G}}$ that contains no self-censoring edges and no colluders, is a submodel of the ICIN model.

Figure 3: All possible colluding paths between

X_{i}^{(1)}

and

R_{i}.

Each pair of dashed edges imply that the presence of either (or both) result in formation of a colluding path.

6 Full Law Identification in the Presence of Unmeasured Confounders

We now generalize identification theory of the full law to scenarios where some variables are not just missing, but completely unobserved, corresponding to the issues faced by the analyst in Scenarios 3 and 4 of Section 4. That is, we shift our focus to the identification of full data laws that are (nested) Markov with respect to a missing data ADMG ${\mathcal{G}}.$

Previously, we exploited the fact that the absence of colluders and self-censoring edges in a missing data DAG ${\mathcal{G}},$ imply a set of conditional independence restrictions of the form $X^{(1)}_{i}\perp\!\!\!\perp R_{i}\mid R_{-i},X^{(1)}_{-i},O$ , for any pair $X^{(1)}_{i}\in X^{(1)}$ and $R_{i}\in R$ (see Lemma 1). We now describe necessary and sufficient graphical conditions that must hold in a missing data ADMG ${\mathcal{G}}$ to imply this same set of conditional independences. Going forward, we ignore (without loss of generality), the deterministic factors $p(X\mid X^{(1)},R),$ and the corresponding deterministic edges in ${\mathcal{G}},$ in the process of defining this graphical criterion.

A colliding path between two vertices $A$ and $B$ is a path on which every non-endpoint node is a collider. We adopt the convention that $A\rightarrow B$ and $A\leftrightarrow B$ are trivially collider paths. We say there exists a colluding path between the pair $(X^{(1)}_{i},R_{i})$ if $X^{(1)}_{i}$ and $R_{i}$ are connected through at least one non-deterministic colliding path i.e., one which does not pass through (using deterministic edges) variables in $X.$

We enumerate all possible colluding paths between a vertex $X_{i}^{(1)}$ and its corresponding missingness indicator $R_{i}$ in Fig. 3. Note that both the self-censoring structure and the colluding structure introduced in (Bhattacharya et al., 2019) are special cases of a colluding path. Using the m-separation criterion for ADMGs, it is possible to show that a missing data model of an ADMG ${\mathcal{G}}$ that contains no colluding paths of the form shown in Fig. 3, is also a submodel of the ICIN model in (Shpitser, 2016; Sadinle & Reiter, 2017).

Lemma 2.

A missing data model of an ADMG ${\mathcal{G}}$ that contains no colluding paths is a submodel of the ICIN model.

This directly yields a sound criterion for identification of the full law of missing data models of an ADMG ${\mathcal{G}}$ using the odds ratio parameterization as before.

Theorem 3.

A full law $p(R,X^{(1)},O)$ that is Markov relative to a missing data ADMG ${\mathcal{G}}$ is identified if ${\mathcal{G}}$ does not contain any colluding paths and the stated positivity assumption in Section 5 holds. Moreover, the resulting identifying functional for the missingness mechanism $p(R\mid X^{(1)},O)$ is given by the odds ratio parametrization provided in Eq. 2.

We now address the question as to whether there exist missing data ADMGs which contain colluding paths but whose full laws are nevertheless identified. We show (see Appendix for proofs), that the presence of a single colluding path of any of the forms shown in Fig. 3, results in a missing data ADMG ${\mathcal{G}}$ whose full law $p(X^{(1)},R,O)$ cannot be identified as a function of the observed data distribution $p(X,R,O).$

Lemma 3.

A full law $p(R,X^{(1)},O)$ that is Markov relative to a missing data ADMG ${\mathcal{G}}$ containing a colluding path between any pair $X_{i}^{(1)}\in X^{(1)}$ and $R_{i}\in R$ is not identified.

Revisiting our example in scenario 4, we note that every $(R_{i},X^{(1)}_{i})$ pair is connected through at least one colluding path. Therefore, according to Lemma 3, the full law in Fig. 2(a) including the dashed edge, is not identified. It is worth emphasizing that the existence of at least one colluding path between any pair $(R_{i},X^{(1)}_{i})$ is sufficient to conclude that the full law is not identified.

In what follows, we present a result on the soundness and completeness of our graphical condition that represents a powerful unification of non-parametric identification theory in the presence of non-ignorable missingness and unmeasured confounding. To our knowledge, such a result is the first of its kind. We present the theorem below.

Theorem 4.

The graphical condition of the absence of colluding paths, put forward in Theorem 3, is sound and complete for the identification of full laws $p(X^{(1)},R,O)$ that are Markov relative to a missing data ADMG ${\mathcal{G}}.$

Throughout the paper, we have focused on identification of the full law which, according to Remark 1, directly yields identification for the target law. However, identification of the full law is a sufficient but not necessary condition for identification of the target law. In other words, the target law may still be identified despite the presence of colluding paths. Fig. 4(a) in (Bhattacharya et al., 2019) is an example of such a case where the full law is not identified due to the colluder structure at $R_{2};$ however, as the authors argue the target law remains identified.

7 Conclusion

In this paper, we concluded an important chapter in the non-parametric identification theory of missing data models represented via directed acyclic graphs, possibly in the presence of unmeasured confounders. We provided a simple graphical condition to check if the full law, Markov relative to a (hidden variable) missing data DAG, is identified. We further proved that these criteria are sound and complete. Moreover, we provided an identifying functional for the missingness process, through an odds ratio parameterization that allows for congenial specification of components of the likelihood. Our results serve as an important precondition for the development of score-based model selection methods that consider a broader class of missing data distributions than the ones considered in prior works. An interesting avenue for future work is exploration of the estimation theory of functionals derived from the identified full data law. To conclude, we note that while identification of the full law is sufficient to identify the target law, there exist identified target laws where the corresponding full law is not identified. We leave a complete characterization of target law identification to future work.

Acknowledgements

This project is sponsored in part by the National Science Foundation grant 1939675, the Office of Naval Research grant N00014-18-1-2760, and the Defense Advanced Research Projects Agency under contract HR0011-18-C-0049. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

For clearer presentation of materials in this supplement, we switch to a single-column format. In Appendix A, we provide an overview of the nested Markov model. We summarize the necessary concepts required in order to explain our proof of completeness for identification of the full law in missing data acyclic directed mixed graphs (ADMGs). These concepts draw on the binary parameterization of nested Markov models of an ADMG. In Appendix B, we provide a concrete example of the odds ratio parameterization. In Appendix C, we present proofs that were omitted from the main body of the paper for brevity.

A. Background: Fixing and Nested Markov Models of an ADMG

Given a DAG ${\mathcal{G}}(V\cup U)$ where $U$ contains variables that are unobserved, the latent projection operator onto the observed margin produces an acyclic directed mixed graph ${\mathcal{G}}(V)$ that consists of directed and bidirected edges (Verma & Pearl, 1990). The bidirected connected components of an ADMG ${\mathcal{G}}(V),$ partition the vertices $V$ into distinct sets known as districts. The district membership of a vertex $V_{i}$ in ${\mathcal{G}}$ is denoted $\operatorname{dis}_{\mathcal{G}}(V_{i}),$ and the set of all districts in ${\mathcal{G}}$ is denoted ${\cal D}({\mathcal{G}}).$

(Evans, 2018) showed that the nested Markov model (Richardson et al., 2017) of an ADMG ${\mathcal{G}}(V)$ is a smooth super model with fixed dimension, of the underlying latent variable model, that captures all equality constraints and avoids non-regular asymptotics arising from singularities in the parameter space (Drton, 2009; Evans, 2018). We use this fact in order to justify the use of nested Markov models of a missing data ADMG in order to describe full laws that are Markov relative to a missing data DAG with hidden variables. That is, the nested Markov model of a missing data ADMG ${\mathcal{G}}(V),$ where $V=\{O,X^{(1)},R,X\},$ is a smooth super model of the missing data DAG model ${\mathcal{G}}(V\cup U).$ We also utilize nested Markov models of an ADMG ${\mathcal{G}}(V\setminus X^{(1)}),$ corresponding to projection of the missing data ADMG ${\mathcal{G}}(V)$ onto variables that are fully observable. While such a model does not capture all equality constraints in the true observed law, it is still a smooth super model of it, thus providing an upper bound on the model dimension of the observed law.

CADMGs and Kernels

The nested Markov factorization of $p(V)$ relative to an ADMG ${\mathcal{G}}(V)$ is defined with the use of conditional distributions known as kernels and their associated conditional ADMGs (CADMGs) that are derived from $p(V)$ and ${\mathcal{G}}(V)$ respectively, via repeated applications of the fixing operator (Richardson et al., 2017). A CADMG ${\mathcal{G}}(V,W),$ is an ADMG whose nodes can be partitioned into random variables $V$ and fixed variables $W,$ with the restriction that only outgoing edges may be adjacent to variables in $W.$ A kernel $q_{V}(V\mid W)$ is a mapping from values in $W$ to normalized densities over $V$ i.e., $\sum_{v\in V}q_{V}(v\mid w)=1$ (Lauritzen, 1996). Conditioning and marginalization operations in kernels are defined in the usual way.

Fixing and Fixability

In Section 4 of the main paper, we provided an informal description of fixing as the operation of inverse-weighting by the propensity score of the variable being fixed; we now formalize this notion. A variable $A\in V$ is said to be fixable if the paths $A\rightarrow\cdots\rightarrow X$ and $A\leftrightarrow\cdots\leftrightarrow X$ do not both exist for all $X\in V\setminus\{A\}.$ Given a CADMG ${\mathcal{G}}(V,W)$ where $A$ is fixable, the graphical operator of fixing, denoted $\phi_{A}({\mathcal{G}}),$ yields a new CADMG ${\mathcal{G}}(V\setminus A,W\cup A)$ with all incoming edges into $A$ being removed, and $A$ being set to a fixed value $a.$ Given a kernel $q_{V}(V\mid W),$ the corresponding probabilistic operation of fixing, denoted $\phi_{A}(q_{V};{\mathcal{G}})$ yields a new kernel

\displaystyle q_{V\setminus A}(V\setminus A\mid W\cup A)\equiv\frac{q_{V}(V\mid W)}{q_{V}(A\mid\operatorname{mb}_{\mathcal{G}}(A),W)},

where $\operatorname{mb}_{\mathcal{G}}(A)$ is the Markov blanket of $A,$ defined as the bidirected connected component (district) of $A$ (excluding $A$ itself) and the parents of the district of $A,$ i.e., $\operatorname{mb}_{\mathcal{G}}(A)\equiv\operatorname{dis}_{\mathcal{G}}(A)\cup\operatorname{pa}_{\mathcal{G}}(\operatorname{dis}_{\mathcal{G}}(A))\setminus\{A\}.$ It is easy to check that when ${\mathcal{G}}$ is a DAG, i.e., there are no bidirected edges, the denominator in the probabilistic operation of fixing, reduces to the familiar definition of a simple propensity score.

The notion of fixability can be extended to a set of variables $S\subseteq V$ as follows. A set $S$ is said to be fixable if elements in $S$ can be ordered into a sequence $\sigma_{S}=\langle S_{1},S_{2},\dots\rangle$ such that $S_{1}$ is fixable in ${\mathcal{G}},$ $S_{2}$ is fixable in $\phi_{S_{1}}({\mathcal{G}}),$ and so on. This notion of fixability on sets of variables is essential to the description of the nested Markov model that we present in the following section.

Nested Markov Factorization

Given a CADMG ${\mathcal{G}},$ A set $S\subseteq V$ is said to be reachable if there exists a valid sequence of fixing operations on vertices $V\setminus S.$ Further, $S$ is said to be intrinsic if it is reachable, and forms a single bidirected connected component or district in $\phi_{\sigma_{V\setminus S}}({\mathcal{G}}),$ i.e., the CADMG obtained upon executing all fixing operations given by a valid fixing sequence $\sigma_{V\setminus S}.$

A distribution $p(V)$ is said to obey the nested Markov factorization relative to an ADMG ${\mathcal{G}}(V)$ if for every fixable set $S,$ and any valid fixing sequence $\sigma_{S},$

\displaystyle\phi_{\sigma_{S}}(p(V);{\mathcal{G}})=\prod_{D\in{\cal D}(\phi_{\sigma_{S}}({\mathcal{G}}))}q_{D}(D\mid\operatorname{pa}_{\phi_{\sigma_{S}}({\mathcal{G}})}(D)),

where all kernels appearing in the product above can be constructed by combining kernels corresponding to intrinsic sets i.e., $\{q_{I}(I\mid\operatorname{pa}_{\mathcal{G}}(I))\mid I\text{ is intrinsic in }{\mathcal{G}}\}.$ Such a construction is made possible by the fact that all the sets $D$ quantified in the product are districts in a reachable graph derived from ${\mathcal{G}}.$

(Richardson et al., 2017) noted that when a distribution $p(V)$ is nested Markov relative to an ADMG ${\mathcal{G}},$ all valid fixing sequences yield the same CADMG and kernel so that recursive applications of the fixing operator on a set $V\setminus S$ can simply be denoted as $\phi_{V\setminus S}({\mathcal{G}})$ and $\phi_{V\setminus S}(q_{V};{\mathcal{G}})$ without explicitly specifying any particular valid order. Thus, the construction of the set of kernels corresponding to intrinsic sets can be characterized as $\{q_{I}(I\mid\operatorname{pa}_{\mathcal{G}}(I))\mid I\text{ is intrinsic in }{\mathcal{G}}\}=\{\phi_{V\setminus I}(p(V;{\mathcal{G}}))\mid I\text{ is intrinsic in }{\mathcal{G}}\},$ and the nested Markov factorization can be re-stated more simply as, for every fixable set $S$ we have,

\displaystyle\phi_{S}(p(V;{\mathcal{G}}))=\prod_{D\in{\cal D}\big{(}\phi_{S}({\mathcal{G}})\big{)}}\phi_{V\setminus D}(p(V);{\mathcal{G}}),

An important result from (Richardson et al., 2017) states that if $p(V\cup U)$ is Markov relative to a DAG ${\mathcal{G}}(V\cup U),$ then $p(V)$ is nested Markov relative to the ADMG ${\mathcal{G}}(V)$ obtained by latent projection.

Binary Parameterization of Nested Markov Models

From the above factorization, it is clear that intrinsic sets given their parents form the atomic units of the nested Markov model. Using this observation, a smooth parameterization of discrete nested Markov models was provided by (Evans & Richardson, 2014). We now provide a short description of how to derive the so-called Moebius parameters of a binary nested Markov model.

For each district $D\in{\cal D}({\mathcal{G}}),$ consider all possible subsets $S\subseteq D.$ If $S$ is intrinsic (that is, reachable and bidirected connected in $\phi_{V\setminus S}({\mathcal{G}})$ ), define the head $H$ of the intrinsic set to be all vertices in $S$ that are childless in $\phi_{V\setminus S}({\mathcal{G}}),$ and the tail $T$ to be all parents of the head in the CADMG $\phi_{V\setminus S}({\mathcal{G}}),$ excluding the head itself. More formally, $H\equiv\{V_{i}\in S\mid\operatorname{ch}_{\phi_{{}_{V\setminus S}({\mathcal{G}})}}(V_{i})=\emptyset\},$ and $T\equiv\operatorname{pa}_{\phi_{{}_{V\setminus S}({\mathcal{G}})}}(H)\setminus H.$ The corresponding set of Moebius parameters for this intrinsic head and tail pair parameterizes the kernel $q_{S}(H=0\mid T),$ i.e., the kernel where all variables outside the intrinsic set $S$ are fixed, and all elements of the head are set to zero given the tail. Note that these parameters are, in general, variationally dependent (in contrast to variationally independent in the case of an ordinary DAG model) as the heads and tails in these parameter sets may overlap. The joint density for any query $p(V=v),$ can be obtained through the Moebius inversion formula; see (Lauritzen, 1996; Evans & Richardson, 2014) for details. For brevity, we will denote $q_{S}(H=0\mid T)$ as simply $q(H=0\mid T),$ as it will be clear from the given context what variables are still random in the kernel corresponding to a given intrinsic set.

Binary Parameterization of Missing Data ADMGs

We use the parameterization described in the previous section in order to count the number of parameters required to parameterize the full law of a missing data ADMG and its corresponding observed law. We then use this to reason that if the number of parameters in the full law exceeds those in the observed law, it is impossible to establish a map from the observed law to the full law. This in turn implies that such a full law is not identified.

The binary parameterization of the full law of a missing data ADMG ${\mathcal{G}}(X^{(1)},O,R,X)$ is exactly the same as that of an ordinary ADMG, except that the deterministic factors $p(X_{i}\mid R_{i},X_{i}^{(1)}),$ can be ignored, as $X_{i}=X_{i}^{(1)}$ with probability one when $R_{i}=1,$ and $X_{i}=?$ with probability one when $R_{i}=0.$

The observed law is parameterized as follows. First, variables in $X^{(1)}$ are treated as completely unobserved, and an observed law ADMG ${\mathcal{G}}(X,O,R)$ is obtained by applying the latent projection operator to ${\mathcal{G}}(X^{(1)},O,R,X).$ The Moebius parameters are then derived in a similar manner as before, with the additional constraint that if $X_{i}\in X$ appears in the head of a Moebius parameter, and the corresponding missingness indicator $R_{i}$ appears in the tail, then the kernel must be restricted to cases where $R_{i}=1.$ This is because when $R_{i}=0,$ the probability of the head taking on any value, aside from those where $X_{i}=?,$ is deterministically defined to be $0.$

Note that parameterizing the observed law by treating variables in $X^{(1)}$ as fully unobserved does not quite capture all equality constraints that may be detectable in the observed law, as these variables are, in fact, sometimes observable when their corresponding missingness indicators are set to one. Indeed, a smooth parameterization of the observed law of missing data models that captures all constraints implied by the model, is still an open problem. Nevertheless, parameterizing an observed law ADMG, such as the one mentioned earlier, provides an upper bound on the number of parameters required to parameterize the true observed law. This suffices for our purposes, as demonstrating that the upper bound on the number of parameters in the observed law is less than the number of parameters in the full law, is sufficient to prove that the full law is not identified.

B. Example: Odds Ratio Parameterization

Figure 4: (a) The missing data DAG model used in Scenario 2. (b) the missing data ADMG model used in Scenario 3.

To build up a more concrete intuition for Theorems 1 and 3, we provide an example of the odds ratio parameterization for the missing data models used in Scenarios 2 and 3 of the main paper, reproduced here in Figs. 4(a, b). Utilizing the order $R_{1},R_{2},R_{3}$ on the missingness indicators, the odds ratio parameterization of the missing data process for both models is as follows.

\displaystyle\frac{1}{Z}\times\bigg{(}\prod_{k=1}^{3}p(R_{i}\mid R_{-i}=1,X^{(1)})\bigg{)}\times\text{OR}(R_{1},R_{2},\mid R_{3}=1,X^{(1)})\times\text{OR}(R_{3},(R_{1},R_{2})\mid X^{(1)}).

(3)

We now argue that each piece in Eq. 3 is identified. Note that, in the missing data DAG shown in Fig. 4(a), $R_{i}\perp\!\!\!\perp X_{i}^{(1)}\mid R_{-i},X_{-i}^{(1)}$ by d-separation. The same is true for the missing data ADMG in Fig. 4(b) by m-separation. Thus, in both cases, the product over conditional pieces of each $R_{i}$ given the remaining variables is not a function $X_{i}^{(1)},$ and is thus a function of observed data. We now show that $\text{OR}(R_{1},R_{2}\mid R_{3}=1,X^{(1)})$ is not a function of $X_{1}^{(1)},X_{2}^{(1)}$ by utilizing the symmetry property of the odds ratio.

	$\displaystyle\text{OR}(R_{1},R_{2}\mid R_{3}=1,X^{(1)})$	$\displaystyle=\frac{p(R_{1}\mid R_{2},R_{3}=1,X_{2}^{(1)},X_{3}^{(1)})}{p(R_{1}=1\mid R_{2},R_{3}=1,X_{2}^{(1)},X_{3}^{(1)})}\times\frac{p(R_{1}=1\mid R_{2}=1,R_{3}=1,X_{2}^{(1)},X_{3}^{(1)})}{p(R_{1}\mid R_{2}=1,R_{3}=1,X_{2}^{(1)},X_{3}^{(1)})}$
	$\displaystyle=\text{OR}(R_{2},R_{1}\mid R_{3}=1,X^{(1)})$	$\displaystyle=\frac{p(R_{2}\mid R_{1},R_{3}=1,X_{1}^{(1)},X_{3}^{(1)})}{p(R_{2}=1\mid R_{1},R_{3}=1,X_{1}^{(1)},X_{3}^{(1)})}\times\frac{p(R_{2}=1\mid R_{1}=1,R_{3}=1,X_{1}^{(1)},X_{3}^{(1)})}{p(R_{2}\mid R_{1}=1,R_{3}=1,X_{1}^{(1)},X_{3}^{(1)})}.$

Thus, from the first equality, the odds ratio is not a function of $X_{2}^{(1)}$ as $R_{1}\perp\!\!\!\perp X_{1}^{(1)}\mid R_{-1},X_{-1}^{(1)}$ by d-separation in Fig. 4(a) and by m-separation in Fig. 4(b). A symmetric argument holds for $X_{2}^{(1)}$ and $R_{2}$ as seen in the second and third equalities. Hence, the odds ratio is only a function of $X_{3}^{(1)},$ which is observable, as the function is evaluated at $R_{3}=1.$

We now utilize an identity from (Chen et al., 2015) in order to simplify the final term in Eq. 3. That is,

	$\displaystyle\text{OR}(R_{3},(R_{1},R_{2})\mid X^{(1)})$	$\displaystyle=\text{OR}(R_{3},R_{2}\mid R_{1}=1,X^{(1)})\ \text{OR}(R_{3},R_{1}\mid R_{2},X^{(1)})$
		$\displaystyle=\text{OR}(R_{3},R_{2}\mid R_{1}=1,X^{(1)})\ \text{OR}(R_{3},R_{1}\mid R_{2}=1,X^{(1)})\ \underset{f(R_{1},R_{2},R_{3}\mid X^{(1)})}{\underbrace{\frac{\text{OR}(R_{3},R_{1}\mid R_{2},X^{(1)})}{\text{OR}(R_{3},R_{1}\mid R_{2}=1,X^{(1)})}}}.$

The first two pairwise odds ratio terms are functions of observed data using an analogous argument that draws on the symmetry property of the odds ratio and the conditional independence $R_{i}\perp\!\!\!\perp X_{i}\mid R_{-i},X_{-i}^{(1)},$ as before. The final term $f(R_{1},R_{2},R_{3}\mid X^{(1)})$ , is a three-way interaction term on the odds ratio scale and can be expressed in three different ways as follows (Chen et al., 2015),

\displaystyle\frac{\text{OR}(R_{3},R_{1}\mid R_{2},X^{(1)})}{\text{OR}(R_{3},R_{1}\mid R_{2}=1,X^{(1)})}=\frac{\text{OR}(R_{2},R_{3}\mid R_{1},X^{(1)})}{\text{OR}(R_{2},R_{3}\mid R_{1}=1,X^{(1)})}=\frac{\text{OR}(R_{1},R_{2}\mid R_{3},X^{(1)})}{\text{OR}(R_{1},R_{2}\mid R_{3}=1,X^{(1)})}.

From the first equality, we note by symmetry of the odds ratio and conditional independence that $f$ is not a function of $X^{(1)}_{1},X^{(1)}_{3}$ . Similarly, from the second equality, we note that $f$ is not a function of $X^{(1)}_{2},X^{(1)}_{3}$ . Finally, from the third equality, we note that $f$ is not a function of $X^{(1)}_{1},X^{(1)}_{2}$ . Therefore, $f$ is not a function of $X^{(1)}_{1},X^{(1)}_{2},X^{(1)}_{3}$ and is identified.

The normalizing function $Z,$ is a function of all the pieces that we have already shown to be identified, and is therefore also identified. Thus, the missing data mechanisms $p(R\mid X^{(1)}),$ and consequently, the full laws corresponding to the missing data graphs shown in Figs. 4(a,b) are identified by Remark 2.

C. Proofs

We first prove Lemmas 1 and 2 as we use them in the course of proving Theorems 1 and 3. We start with Lemma 2, as the proof for Lemma 1 simplifies to a special case.

Lemma 2 A missing data model of an ADMG ${\mathcal{G}}$ that contains no colluding paths is a submodel of the itemwise conditionally independent nonresponse model described in (Shpitser, 2016; Sadinle & Reiter, 2017).

Proof.

The complete Markov blanket of a vertex $V_{i}$ in an ADMG ${\mathcal{G}},$ denoted $\operatorname{mb}_{\mathcal{G}}^{c}(V_{i})$ is the set of vertices such that $V_{i}\perp\!\!\!\perp V_{-i}\setminus\operatorname{mb}_{\mathcal{G}}^{c}(V_{i})\mid\operatorname{mb}_{\mathcal{G}}^{c}(V_{i})$ (Pearl, 1988; Richardson, 2003). In ADMGs, this set corresponds to the Markov blanket of $V_{i},$ its children, and the Markov blanket of its children. That is,

\displaystyle\operatorname{mb}_{\mathcal{G}}^{c}(V_{i})\equiv\operatorname{mb}_{\mathcal{G}}(V_{i})\cup\bigg{(}\bigcup_{V_{j}\in\operatorname{ch}_{\mathcal{G}}(V_{i})}V_{j}\cup\operatorname{mb}_{\mathcal{G}}(V_{j})\bigg{)}\setminus\{V_{i}\}.

Without loss of generality, we ignore the part of the graph involving the deterministic factors $p(X\mid X^{(1)},R)$ and the corresponding deterministic edges, in the construction of the Markov blanket and complete Markov blanket of variables in a missing data graph ${\mathcal{G}}(X^{(1)},O,R).$ We now show that the absence of non-deterministic colluder paths between a pair $X_{i}^{(1)}$ and $R_{i}$ in ${\mathcal{G}}$ implies that $X_{i}^{(1)}\notin\operatorname{mb}_{\mathcal{G}}^{c}(R_{i}).$

•

$X_{i}^{(1)}$ is not a parent of $R_{i},$ as $X_{i}^{(1)}\rightarrow R_{i}$ is trivially a colluder path.
•

$X_{i}^{(1)}$ is not in the district of $R_{i},$ as $X_{i}^{(1)}\leftrightarrow\cdots\leftrightarrow R_{i}$ is also a colluder path.

These two points together imply that $X_{i}^{(1)}\notin\operatorname{mb}_{\mathcal{G}}(R_{i}).$ We now show that the union over children of $R_{i}$ and their Markov blankets also exclude $X_{i}^{(1)}.$

•

$X_{i}^{(1)}$ is not a child of $R_{i},$ as directed edges from $R_{i}$ to variables in $X^{(1)}$ are ruled out by construction in missing data graphs.
•

$X_{i}^{(1)}$ is also not in the district of any children of $R_{i},$ as $R_{i}\rightarrow\cdots\leftrightarrow X_{i}^{(1)}$ is a colluding path.
•

$X_{i}^{(1)}$ is also not a parent of the district of any children of $R_{i},$ as $R_{i}\rightarrow\cdots\leftarrow X_{i}^{(1)}$ is a colluding path.

These three points together rule out the possibility that $X_{i}^{(1)}$ is present in the union over children and Markov blankets of children of $R_{i}.$ Thus, we have shown that $X_{i}^{(1)}\not\in\operatorname{mb}_{\mathcal{G}}^{c}(R_{i}).$ This implies the following,

\displaystyle R_{i}\perp\!\!\!\perp V\setminus\{R_{i},\operatorname{mb}_{\mathcal{G}}^{c}(R_{i})\}\mid\operatorname{mb}_{\mathcal{G}}^{c}(R_{i})\implies R_{i}\perp\!\!\!\perp X_{i}^{(1)}\mid\operatorname{mb}_{\mathcal{G}}^{c}(R_{i}).

By semi-graphoid axioms (see for example, (Lauritzen, 1996; Pearl, 2009)) this yields the conditional independence $R_{i}\perp\!\!\!\perp X_{i}^{(1)}\mid R_{-i},X^{(1)}_{-i},O.$

The same line of reasoning detailed above can be used for all $R_{i}\in R,$ which then gives us the set of conditional independences implied by the no self-censoring model. That is,

\displaystyle R_{i}\perp\!\!\!\perp X^{(1)}_{i}\mid R_{-i},X^{(1)}_{-i},O,\quad\forall R_{i}\in R.

∎

Lemma 1 A missing data model of a DAG ${\mathcal{G}}$ that contains no self-censoring edges and no colluders, is a submodel of the itemwise conditionally independent nonresponse model described in (Shpitser, 2016; Sadinle & Reiter, 2017).

Proof.

A DAG is simply a special case of an ADMG with no bidirected edges. Consequently the only two types of colluding paths, are self-censoring edges ( $X_{i}^{(1)}\rightarrow R_{i}$ ) and colluder structures ( $X_{i}^{(1)}\rightarrow R_{j}\leftarrow R_{i}).$ Thus, the absence of these two structures in a missing data DAG ${\mathcal{G}},$ rules out all possible colluding paths. The rest of the proof then carries over straightforwardly from Lemma 2. ∎

Theorem 1 A full law $p(R,X^{(1)},O)$ that is Markov relative to a missing data DAG ${\mathcal{G}}$ is identified if ${\mathcal{G}}$ does not contain edges of the form $X_{i}^{(1)}\rightarrow R_{i}$ (no self-censoring) and structures of the form $X_{j}^{(1)}\rightarrow R_{i}\leftarrow R_{j}$ (no colluders), and the stated positivity assumption holds. Moreover, the resulting identifying functional for the missingness mechanism $p(R\mid X^{(1)},O)$ is given by the odds ratio parameterization provided in Eq. 2 of the main draft, and the identifying functionals for the target law and full law are given by Remarks 1 and 2.

Proof.

Given Eq. (2), we know that

\displaystyle p(R\mid X^{(1)},O)=

\displaystyle\ \frac{1}{Z}\times\prod_{k=1}^{K}\ p(R_{k}\mid R_{-k}=1,X^{(1)},O)\times\prod_{k=2}^{K}\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)},O),

where $R_{-k}=R\setminus R_{k},R_{\prec k}=\{R_{1},\ldots,R_{k-1}\},R_{\succ k}=\{R_{k+1},\ldots,R_{K}\}$ , and

\displaystyle\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)},O)=\frac{p(R_{k}\mid R_{\succ k}=1,R_{\prec k},X^{(1)},O)}{p(R_{k}=1\mid R_{\succ k}=1,R_{\prec k},X^{(1)},O)}\times\frac{p(R_{k}=1\mid R_{-k}=1,X^{(1)},O)}{p(R_{k}\mid R_{-k}=1,X^{(1)},O)},

and $Z$ is the normalizing term and is equal to $\sum_{r}\{\prod_{k=1}^{K}\ p(r_{k}\mid R_{-k}=1,X^{(1)},O)\times\prod_{k=2}^{K}\text{OR}(r_{k},r_{\prec k}\mid R_{\succ k}=1,X^{(1)},O)\}$ . If we can prove that all the pieces in this factorization are identified, then the missingness process is identified and so is the full law. We provide the proof in two steps. Our proof is similar to the identification proof of the no self-censoring model given in (Malinsky et al., 2019).

For each $k\in 3,\dots,K,$ we can apply the following expansion to the odds ratio term. Without loss of generality we drop fully observed random variables $O$ for brevity,

\displaystyle\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)})=\text{OR}(R_{k},R_{k-1}\mid R_{-(k,k-1)}=1,X^{(1)})\times\text{OR}(R_{k},R_{\prec k-2}\mid R_{\succ k}=1,R_{k-1},X^{(1)}).

(4)

This expansion can be applied inductively to the second term in the above product until $\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)})$ is expressed as a function of pairwise odds ratios and higher-order interaction terms. Applying the inductive expansion to each odds ratio term in $\prod_{k=2}^{K}\text{OR}(R_{k},R_{\prec k}\mid R_{\succ k}=1,X^{(1)})$ we can re-express the identifying functional as,

$\displaystyle p(R\mid X^{(1)})=$	$\displaystyle\frac{1}{Z}\times\prod_{k=1}^{K}\ p(R_{k}\mid R_{-k}=1,X^{(1)})$
	$\displaystyle\times\prod_{R_{k},R_{l}\in R}\text{OR}(R_{k},R_{l}\mid R_{-(k,l)}=1,X^{(1)})\times\prod_{R_{k},R_{l},R_{m}\in R}f(R_{k},R_{l},R_{m}\mid R_{-(k,l,m)}=1,X^{(1)})$
	$\displaystyle\times\prod_{R_{k},R_{l},R_{m},R_{n}\in R}f(R_{k},R_{l},R_{m},R_{n}\mid R_{-(k,l,m,n)}=1,X^{(1)})\times\dots\times f(R_{1},\dots,R_{K}\mid X^{(1)}),$	(5)

where $Z$ is the normalizing constant as before, and each $f(\cdot\mid\cdot,X^{(1)})$ are 3-way, 4-way, up to $K$ -way interaction terms. These interaction terms are defined as follows.

f(R_{i},R_{j},R_{k}|R_{-(i,j,k)}=1,X^{(1)})=\frac{\textrm{OR}(R_{i},R_{j}|R_{k},R_{-(i,j,k)}=1,X^{(1)})}{\textrm{OR}(R_{i},R_{j}|R_{k}=1,R_{-(i,j,k)}=1,X^{(1)})},

and

f(R_{i},R_{j},R_{k},R_{l}|R_{-(i,j,k,l)}=1,X^{(1)})=\frac{\textrm{OR}(R_{i},R_{j}|R_{k},R_{l},R_{-(i,j,k,l)}=1,X^{(1)})}{\textrm{OR}(R_{i},R_{j}|R_{k}=1,R_{l},R_{-(i,j,k,l)}=1,X^{(1)})}\times\frac{\textrm{OR}(R_{i},R_{j}|R_{k}=1,R_{l}=1,R_{-(i,j,k,l)}=1,X^{(1)})}{\textrm{OR}(R_{i},R_{j}|R_{k},R_{l}=1,R_{-(i,j,k,l)}=1,X^{(1)})},

and so on, up to

	$\displaystyle f(R_{1},...,R_{K}\mid X^{(1)})$	$\displaystyle=\textrm{OR}(R_{i},R_{j}\|R_{-(i,j)},X^{(1)})$
		$\displaystyle\times\frac{\displaystyle\prod_{R_{k},R_{l}\in R}\textrm{OR}(R_{i},R_{j}\|R_{(k,l)}=1,R_{-(i,j,k,l)},X^{(1)})\prod_{R_{k},R_{l},R_{m},R_{n}\in R}\textrm{OR}(R_{i},R_{j}\|R_{(k,l,m,n)}=1,R_{-(i,j,k,l,m,n)},X^{(1)})\times\cdots}{\displaystyle\prod_{R_{k}\in R}\textrm{OR}(R_{i},R_{j}\|R_{k}=1,R_{-(i,j,k)},X^{(1)})\prod_{R_{k},R_{l},R_{m}\in R}\textrm{OR}(R_{i},R_{j}\|R_{(k,l,m)}=1,R_{-(i,j,k,l,m)},X^{(1)})\times\cdots}.$

Readers familiar with the clique potential factorization of Markov random fields may treat these interaction terms analogously (Malinsky et al., 2019). We now show that each term in the above factorization is identified.

Step 1.
We start off by looking at the conditional pieces $p(R_{k}\mid R_{-k}=1,X^{(1)},O)$ . Given Lemma. 1, we know that $R_{k}\perp\!\!\!\perp X^{(1)}_{k}\mid R_{-k},X^{(1)}_{-k},O$ . Therefore, $p(R_{k}\mid R_{-k}=1,X^{(1)},O)=p(R_{k}\mid R_{-k}=1,X^{(1)}_{-k},O),\forall k,$ is identified for all $R_{k}\in R.$

Step 2.
We now show that for any $R_{k},R_{l}\in R$ , the pairwise odds ratio $\text{OR}(R_{k},R_{l}\mid R_{\{-(k,l)\}}=1,X^{(1)})$ given in Eq. (5) is identified. We know that

\text{OR}(R_{k},R_{l}\mid R_{-(k,l)}=1,X^{(1)})=\text{OR}(R_{k},R_{l}\mid R_{-(k,l)}=1,X_{-(k,l)},X^{(1)}_{k},X^{(1)}_{l}).

Consequently, if we can show that the odds ratio is neither a function of $X^{(1)}_{k}$ nor $X^{(1)}_{l}$ , then we can safely claim that the odds ratio is only a function of observed data and hence is identified. We get to this conclusion by exploiting the symmetric notion in odds ratios.

	$\displaystyle\text{OR}(R_{k},R_{l}\mid R_{-(k,l)}=1,X^{(1)})$	$\displaystyle=\frac{p(R_{k}\mid R_{l},R_{-(k,l)}=1,X^{(1)})}{p(R_{k}=1\mid R_{l},R_{-(k,l)}=1,X^{(1)})}\times\frac{p(R_{k}=1\mid R_{-k}=1,X^{(1)})}{p(R_{k}\mid R_{-k}=1,X^{(1)})}$
		$\displaystyle=\frac{p(R_{l}\mid R_{k},R_{-(k,l)}=1,X^{(1)})}{p(R_{l}=1\mid R_{k},R_{-(k,l)}=1,X^{(1)})}\times\frac{p(R_{l}=1\mid R_{-l}=1,X^{(1)})}{p(R_{l}\mid R_{-l}=1,X^{(1)})}$

In the first equality, we can see that the odds ratio is not a function of $X^{(1)}_{k}$ since $R_{k}\perp\!\!\!\perp X^{(1)}_{k}\mid R_{-k},X^{(1)}_{-k}$ . Similarly, from the second equality, we can see that the odds ratio is not a function of $X^{(1)}_{l}$ since $R_{l}\perp\!\!\!\perp X^{(1)}_{l}\mid R_{-l},X^{(1)}_{-l}$ . Therefore, the pairwise odds ratios are all identified.

Finally we show that each of the higher-order interaction terms are identified. For each of these terms we need to show that they are not a function of missing variables with indices corresponding to indicators to the left of the conditioning bar. That is, we need to show that the 3-way interaction terms $f(R_{k},R_{l},R_{m}\mid R_{-(k,l,m)}=1,X^{(1)})$ are not functions of $X_{(k,l,m)}^{(1)},$ the 4-way interaction terms $f(R_{k},R_{l},R_{m},R_{n}\mid R_{-(k,l,m,n)}=1,X^{(1)})$ are not functions of $X_{(k,l,m,n)}^{(1)}$ , and so on until finally the $K$ -way interaction term $f(R_{1},\dots,R_{K}\mid X^{(1)})$ is not a function of $X^{(1)}.$

Because of the way the odds ratio is defined, each $f(\cdot\mid\cdot,X^{(1)})$ is symmetric in the $k$ arguments appearing to the left of the conditioning bar and can be rewritten in multiple equivalent ways. In particular, each $k$ -way interaction term can be rewritten in $\binom{k}{2}$ ways for any choice of indices $i,j$ of the missingness indicators that appear to the left of the conditioning bar. Each such representation allows us to conclude that $f(\cdot\mid\cdot,X^{(1)})$ is not a function of $X_{i}^{(1)},X_{j}^{(1)}.$ Combining all these together allows us to conclude that the $k$ -way interaction term $f(\cdot\mid\cdot,X^{(1)})$ is not a function of the missing variables corresponding to the indicators appearing on the left of the conditioning bar.

As a concrete example, consider the 3-way interaction $f(R_{1},R_{2},R_{3}\mid R_{-(1,2,3)}=1,X^{(1)}).$ We can write it down in three different ways as follows.

	$\displaystyle f(R_{i},R_{j},R_{k}\mid R_{-(1,2,3)}=1,X^{(1)})$
	$\displaystyle=\frac{\text{OR}(R_{1},R_{2}\mid R_{-(1,2,3)}=1,R_{3},X^{(1)})}{\text{OR}(R_{1},R_{2}\mid R_{-(1,2,3)}=1,R_{3}=1,X^{(1)})}=\frac{\text{OR}(R_{1},R_{3}\mid R_{-(1,2,3)}=1,R_{2},X^{(1)})}{\text{OR}(R_{1},R_{3}\mid R_{-(1,2,3)}=1,R_{2}=1,X^{(1)})}=\frac{\text{OR}(R_{2},R_{3}\mid R_{-(1,2,3)}=1,R_{1},X^{(1)})}{\text{OR}(R_{2},R_{3}\mid R_{-(1,2,3)}=1,R_{1}=1,X^{(1)})}$

From the first equality, we note that $f$ is not a function of $X^{(1)}_{1},X^{(1)}_{2}$ . From the second equality, we note that $f$ is not a function of $X^{(1)}_{1},X^{(1)}_{3}$ . From the third equality, we note that $f$ is not a function of $X^{(1)}_{2},X^{(1)}_{3}$ . Therefore, $f$ is not a function of $X^{(1)}_{1},X^{(1)}_{2},X^{(1)}_{3}$ and is identified. ∎

Theorem 2 The graphical condition of no self-censoring and no colluders, put forward in Theorem 1, is sound and complete for the identification of full laws $p(R,O,X^{(1)})$ that are Markov relative to a missing data DAG ${\mathcal{G}}.$

Proof.

Soundness is a direct consequence of Theorem 1. To prove completeness, it needs to be shown that in the presence of a self-censoring edge, or a colluder structure, the full law is no longer (non-parametrically) identified. A proof by counterexample of both these facts was provided in (Bhattacharya et al., 2019). However, this can also be seen from the fact that self-censoring edges and colluders are special cases of the colluding paths that we prove results in non-identification of the full law in Lemma 3. ∎

Theorem 3 A full law $p(R,X^{(1)},O)$ that is Markov relative to a missing data ADMG ${\mathcal{G}}$ is identified if ${\mathcal{G}}$ does not contain any colluding paths and the stated positivity assumption in Section 5 holds. Moreover, the resulting identifying functional for the missingness mechanism $p(R\mid X^{(1)},O)$ is given by the odds ratio parametrization provided in Eq. 2 of the main draft.

Proof.

The proof strategy is nearly identical to the one utilized in Theorem 1, except the conditional independences $R_{k}\perp\!\!\!\perp X_{k}^{(1)}\mid R_{-k},X_{-k}^{(1)},O$ come from Lemma 2 instead of Lemma 1. ∎

Lemma 3 A full law $p(R,X^{(1)},O)$ that is Markov relative to a missing data ADMG ${\mathcal{G}}$ containing a colluding path between any pair $X_{i}^{(1)}\in X^{(1)}$ and $R_{i}\in R,$ is not identified.

Figure 5: (a, d, e) Examples of colluding paths in missing data models of ADMGs. (b) A DAG with hidden variable

U

that is Markov equivalent to (a). (c) Projecting out

X^{(1)}_{1}

from (a), (f) Projecting out

X^{(1)}_{1}

and

X^{(1)}_{2}

from (d) and (e).

Proof.

Proving the non-identifiability of missing data models of an ADMG ${\mathcal{G}}$ that contains a colluding path can be shown by providing two models ${\cal M}_{1}$ and ${\cal M}_{2}$ that disagree on the full law but agree on the observed law. Coming up with a single example of such a pair of models is sufficient for arguing against non-parametric identification of the full law. Therefore, for simplicity, we restrict our attention to binary random variables. We first provide an example of such a pair of models on the simplest form of a colluding path, a bidirected edge $X_{i}^{(1)}\leftrightarrow R_{i}$ as shown in Fig. 5(a). According to Table 1, in order for the observed laws to agree, the only requirement is that the quantity $ab+(1-a)c$ remain equal in both models; hence we can come up with infinitely many counterexamples of full laws that are not the same but map to the same observed law.

Constructing explicit counterexamples are not necessary to prove non-identification as long as it can be shown that there exist at least two distinct functions that map two different full laws onto the exact same observed law. For instance, if the number of parameters in the full law is strictly larger than the number of parameters in the observed law, then there would exist infinitely many such functions. Consequently, we rely on a parameter counting argument to prove the completeness of our results. Since we are considering missing data models of ADMGs, we use the Moebius parameterization of binary nested Markov models of an ADMG described in Appendix A.

The nested Markov model of a missing data ADMG ${\mathcal{G}}(V),$ where $V=\{O,X^{(1)},R,X\},$ is a smooth super model of the missing data DAG model ${\mathcal{G}}(V\cup U),$ and has the same model dimension as the latent variable model (Evans, 2018). We also utilize nested Markov models of an ADMG ${\mathcal{G}}(V\setminus X^{(1)}),$ corresponding to projection of the missing data ADMG ${\mathcal{G}}(V)$ onto variables that are fully observable. While such a model does not capture all equality constraints in the true observed law, it is still a smooth super model of it, thus providing an upper bound on the model dimension of the observed law. This suffices for our purposes, as demonstrating that the upper bound on the number of parameters in the observed law is less than the number of parameters in the full law, is sufficient to prove that the full law is not identified. We first walk the reader through a few examples to demonstrate this proof strategy, and then provide the general argument.

$U$	$p(U)$
$0$	$a$
$1$	$1-a$

$R_{1}$	$U$	$p(R_{1}\|U)$
$0$	$0$	$b$
$1$	$0$	$1-b$
$0$	$1$	$c$
$1$	$1$	$1-c$

$X^{(1)}_{1}$	$U$	$p(X^{(1)}_{1}\|U)$
$0$	$0$	$d$
$1$	$0$	$1-d$
$0$	$1$	$e$
$1$	$1$	$1-e$

$R_{1}$	$X^{(1)}_{1}$	$U$	$p(R_{1},X^{(1)}_{1},U)$
$0$	$0$	$0$	$abd$
$0$	$0$	$1$	$(1-a)ce$
$0$	$1$	$0$	$ab(1-d)$
$0$	$1$	$1$	$(1-a)c(1-e)$
$1$	$0$	$0$	$a(1-b)d$
$1$	$0$	$1$	$(1-a)(1-c)e$
$1$	$1$	$0$	$a(1-b)(1-d)$
$1$	$1$	$1$	$(1-a)(1-c)(1-e)$

$R_{1}$	$X^{(1)}_{1}$	p(Full Law)	$X_{1}$	p(Observed Law)
0	0	$abd+(1-a)ce$	?	$ab+(1-a)c$
0	1	$ab(1-d)+(1-a)c(1-e)$	?	$ab+(1-a)c$
1	0	$a(1-b)d+(1-a)(1-c)e$	0	$a(1-b)+(1-a)(1-c)$
1	1	$a(1-b)(1-d)+(1-a)(1-c)(1-e)$	1	$a(1-b)+(1-a)(1-c)$

Table 1: Construction of counterexamples for non-identifiablity of the full law in Fig. 5(a) using the DAG with hidden variable

U

in Fig. 5(b) that is Markov equivalent to (a).

Moebius Parameterization of the Full Law in Fig. 5(d)
Districts	Intrinsic Head/Tail	Moebius Parameters	Counts
$\{X^{(1)}_{1}\}$	$\{X^{(1)}_{1}\},\{\}$	$q(X^{(1)}_{1}=0)$	$1$
$\{R_{2}\}$	$\{R_{2}\},\{\}$	$q(R_{2}=0)$	$1$
$\{R_{1},X^{(1)}_{2}\}$	$\{R_{1}\},\{\}$	$q(R_{1}=0)$	$1$
$\{R_{1},X^{(1)}_{2}\}$	$\{X^{(1)}_{2}\},\{X^{(1)}_{1}\}$	$q(X^{(1)}_{2}=0\mid X^{(1)}_{1})$	$2$
	$\{R_{1},X^{(1)}_{2}\},\{X^{(1)}_{1}\}$	$q(R_{1}=0,X^{(1)}_{2}=0\mid X^{(1)}_{1})$	$2$
Total			$7$
Moebius Parameterization of the Full Law in Fig. 5(e)
Districts	Intrinsic Head/Tail	Moebius Parameters	Counts
$\{R_{2}\}$	$\{R_{2}\},\{\}$	$q(R_{2}=0)$	$1$
$\{R_{1},X^{(1)}_{1},X^{(1)}_{2}\}$	$\{R_{1}\},\{\}$	$q(R_{1}=0)$	$1$
	$\{X^{(1)}_{1}\},\{\}$	$q(X^{(1)}_{1}=0)$	$1$
	$\{X^{(1)}_{2}\},\{\}$	$q(X^{(1)}_{2}=0)$	$1$
	$\{R_{1},X^{(1)}_{2}\},\{\}$	$q(R_{1}=0,X^{(1)}_{2}=0)$	$1$
	$\{X^{(1)}_{1},X^{(1)}_{2}\},\{\}$	$q(X^{(1)}_{1}=0,X^{(1)}_{2}=0)$	$1$
	$\{R_{1},X^{(1)}_{1},X^{(1)}_{2}\},\{\}$	$q(R_{1}=0,X^{(1)}_{1}=0,X^{(1)}_{2}=0)$	$1$
Total			$7$
Moebius Parameterization of the Observed Law in Fig. 5(f)
Districts	Intrinsic Head/Tail	Moebius Parameters	Counts
$R_{2}$	$\{R_{2}\},\{\}$	$q(R_{2}=0)$	$1$
$\{R_{1},X_{1},X_{2}\}$	$\{R_{1}\},\{\}$	$q(R_{1}=0)$	$1$
	$\{X_{1}\},\{R_{1}\}$	$q(X_{1}=0\mid R_{1})$	$1$
	$\{X_{2}\},\{R_{2}\}$	$q(X_{2}=0\mid R_{2})$	$1$
	$\{R_{1},X_{2}\},\{R_{2}\}$	$q(R_{1}=0,X_{2}=0\mid R_{2})$	$1$
	$\{X_{1},X_{2}\},\{R_{1},R_{2}\}$	$q(X_{1}=0,X_{2}=0\mid R_{1},R_{2})$	$1$
Total			$6$

Table 2: Moebius Parameterization of the Full and Observed Laws of missing data ADMGs

Self-censoring through unmeasured confounding:

We start by reanalyzing the colluding path given in Fig. 5(a) and the corresponding projection given in Fig. 5(c). The Moebius parameters associated with the full law are $q(X^{(1)}_{1}=0),q(R_{1}=0),q(X^{(1)}_{1}=0,R_{1}=1),$ for a total of 3 parameters. The Moebius parameters associated with the observed law in Fig 5(c) are $q(R_{1}=0),q(X^{(1)}_{1}=0\mid R_{1}=0),$ for a total of only 2 parameters. Since $2<3$ , we can construct infinitely many mappings, as it was shown in Table 1.

Simple colluding paths:

Consider the colluding paths given in Fig. 5(d, e) and the corresponding projection (which are identical in both cases) given in Fig. 5(f). The Moebius parameters associated with the full laws and observed law are shown in Table 2. Once again, since the number of parameters in the observed law is less than the number in the full law $(6<7)$ , we can construct infinitely many mappings.

A general argument:

In order to generalize our argument, we first provide a more precise representation (that does not use dashed edges) in Figs. 6(a-d), of all possible colluding paths between $X_{i}^{(1)}$ and $R_{i}.$ Without loss of generality, assume that there are $K$ variables in $X^{(1)}$ and there are $S$ variables that lie on the collider path between $X^{(1)}_{i}$ and $R_{i}$ , $S\in\{0,1,\dots,2*(K-1)\}$ . We denote the $s$ th variable on the collider path by $V_{s}$ ; $V_{s}\in\{X^{(1)}\setminus X^{(1)}_{i},R\setminus R_{i}\}.$ Note that $V_{S}$ in Figs. 6(c, d) can only belong to $\{R\setminus R_{i}\}$ by convention. Fig. 6(e) illustrates the corresponding projections of figures (a) and (b), and Fig. 6(f) illustrates the corresponding projections of figures (c) and (d). In the projections shown in Figs. 6(e, f), $V^{*}\in\{X\setminus X^{(1)}_{i},R\setminus R_{i}\}.$

Figure 6: (a) Colluding paths (b) Projecting out

X^{(1)}

We now go over each of these colluding paths and their corresponding latent projections, as if they appear in a larger graph that is otherwise completely disconnected. We count the number of Moebius parameters as a function of $S$ , and show that the full law always has one more parameter than the observed law. One can then imagine placing these colluding paths in a larger graph with arbitrary connectivity, and arguing that the full law is still not identified as a consequence of the parameter discrepancy arising from the colluding path alone. That is, if we show a fully disconnected graph containing a single colluding path is not identified, then it is also the case that any edge super graph (super model) is also not identified.

In the following proof we heavily rely on the following fact. Given a bidirected chain of length $V_{1}\leftrightarrow,\cdots,\leftrightarrow V_{K},$ of length $K,$ the number of Moebius parameters required to parameterize this chain is given by the sum of natural numbers $1$ to $K,$ i.e., $\frac{K(K+1)}{2}.$ This can be seen from the fact that the corresponding Moebius parameters are given by the series,

•

$q(V_{1}=0),q(V_{1}=0,V_{2}=0),\dots,q(V_{1}=0,\dots,V_{K}=0)$ corresponding to $K$ parameters.
•

$q(V_{2}=0),q(V_{2},V_{3}=0),\dots,q(V_{2}=0,\dots,V_{K}=0)$ corresponding to $K-1$ parameters.
•

$\dots$
•

$q(V_{K}=0)$ corresponding to $1$ parameter.

In counting the number of parameters for a disconnected graph (with the exception of the colluding path), we can also exclude the singleton (disconnected) nodes from the counting argument since they account for the same number of parameters in both the full law and observed law. In the full law they are either $q(R_{s}=0)$ or $q(X^{(1)}_{s}=0)$ and the corresponding parameters in the observed law are $q(R_{s}=0)$ or $q(X_{s}=0\mid R_{s}=1)$ . The Moebius parameter counts for each of the colluding paths in Figs. 6(a-d) and their corresponding latent projections in Figs. 6(e,f) are as follows.

Figures a, b, and e

1.
Number of Moebius parameters in Fig. 6(a) is $\frac{(S+2)(S+3)}{2}$
- •
  
  A bidirected chain $X_{i}^{(1)}\leftrightarrow,\cdots,\leftrightarrow R_{i}$ of length $S+2$ , i.e., $(S+2)*(S+3)/2$ parameters.
2.
Number of Moebius parameters in Fig. 6(b) is $\frac{(S+2)(S+3)}{2}$
- •
  
  $q(X^{(1)}_{i}=0)$ , i.e. $1$ parameter,
- •
  
  A bidirected chain $V_{2}\leftrightarrow\cdots\leftrightarrow R_{i}$ of length $S$ , i.e. $S*(S+1)/2$ parameters,
- •
  
  Intrinsic sets involving $V_{1}$ , i.e., $q(V_{1}=0\mid X_{i}^{(1)}),q(V_{1}=0,V_{2}=0\mid X_{i}^{(1)}),q(V_{1}=0,\dots,R_{i}=0\mid X_{i}^{(1)})$ corresponding to $2*(S+1)$ parameters.
3.
Number of Moebius parameters in Fig. 6(e) is $\frac{(S+2)(S+3)}{2}-1$
- •
  
  Note that even though each proxy $X_{s}$ that may appear in the bidirected chain has a directed edge from $R_{s}$ pointing into it, the corresponding intrinsic head tail pair that involves both variables, will always have $R_{i}=1.$ Hence, we may ignore these deterministic edges and count the parameters as if it were a bidirected chain $V_{1}^{*}\leftrightarrow\cdots\leftrightarrow R_{i}$ of length $S+1$ , corresponding to $(S+1)*(S+2)/2$ parameters,
- •
  
  When enumerating intrinsic sets involving $X_{i},$ we note that $\{X_{i},V_{1}^{*},\dots V_{S}^{*}\}$ is not intrinsic as $R_{i}$ is not fixable (due to the bidirected path between $R_{i}$ and $X_{i}$ and the edge $R_{i}\rightarrow X_{i}$ ). Thus, as there is one less intrinsic set involving $X_{i},$ the number of parameters required to parameterize all intrinsic sets involving $X_{i}$ is one fewer, i.e., $S+1$ (instead of $S+2$ ) parameters.

Figures c, d, and f

1.
Number of Moebius parameters in Fig. 6(c) is $\frac{(S+2)(S+3)}{2}$
- •
  
  $q(R_{i}=0)$ , i.e. $1$ parameter,
- •
  
  A bidirected chain $X_{i}^{(1)}\leftrightarrow\cdots\leftrightarrow V_{S-1}$ of length $S$ , i.e. $S*(S+1)/2$ parameters,
- •
  
  Intrinsic sets involving $R_{S}$ , i.e., $q(R_{S}=0\mid R_{i}),q(R_{S}=0,V_{S-1}=0\mid R_{i}),\dots,q(R_{S}=0,V_{S-1}=0\dots,X_{i}^{(1)}\mid R_{i}),$ corresponding to $2*(S+1)$ parameters.
2.
Number of Moebius parameters in Fig. 6(d) is $\frac{(S+2)(S+3)}{2}$
- •
  
  $q(X^{(1)}_{i}=0),q(R_{i}=0)$ , i.e. $2$ parameters,
- •
  
  A bidirected chain $V_{2}\leftrightarrow\cdots\leftrightarrow V_{S-2}$ of length $S-2$ , i.e. $(S-2)*(S-1)/2$ parameters,
- •
  
  Intrinsic sets involving $V_{1}$ and not $R_{S},$ i.e., $q(V_{1}=0\mid X_{i}^{(1)}),q(V_{1}=0,V_{2}=0\mid X_{i}^{(1)}),\dots,q(V_{1}=0,V_{2}=0,\dots,V_{S-1}\mid X_{i}^{(1)}),$ corresponding to $2*(S-1)$ parameters,
- •
  
  Intrinsic sets involving $R_{S}$ and not $V_{1}$ , i.e., $q(R_{S}=0\mid R_{i}),q(R_{S}=0,V_{S-1}=0\mid R_{i}),\dots,q(R_{S}=0,V_{S-1}=0,\dots,V_{2}\mid R_{i})$ corresponding to $2*(S-1)$ parameters.
- •
  
  The intrinsic set involving both $V_{1}$ and $R_{S},$ i.e., $q(V_{1}=0,V_{2}=0,\dots,R_{S}=0\mid X_{i}^{(1)},R_{i}),$ corresponding to $4$ parameters.
3.
Number of Moebius parameters in Fig. 6(f) is $\frac{(S+2)(S+3)}{2}-1$
- •
  
  $q(R_{i}=0)$ , i.e. $1$ parameter,
- •
  
  By the same argument as before, deterministic tails can be ignored. Hence, we have a bidirected chain $X_{i}\leftrightarrow\cdots\leftrightarrow V_{S-1}$ of length $S$ , i.e. $S*(S+1)/2$ parameters,
- •
  
  Intrinsic sets involving $R_{S}$ , i.e., $q(R_{S}=0\mid R_{i}),q(R_{S}=0,V_{S-1}\mid R_{i}),\dots,q(R_{S},V_{S-1},\dots,V_{1}\mid R_{i}),$ corresponding to $2*S$ parameters, and the special intrinsic set which results in the observed law having one less parameter ${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}q(R_{S},V_{S-1},\dots,V_{1},X_{i}\mid R_{i}=1)}$ corresponding to just $1$ parameter instead of 2 due to the presence of the proxy $X_{i}$ in the head and the corresponding $R_{i}$ in the tail.

∎

Theorem 4 The graphical condition of the absence of colluding paths, put forward in Theorem 3, is sound and complete for the identification of full laws $p(R,O,X^{(1)})$ that are Markov relative to a missing data ADMG ${\mathcal{G}}.$

Proof.

Soundness is a direct consequence of Theorem 3 and completeness is a direct consequence of Lemma. 3. ∎

References

Bhattacharya et al. (2019) Bhattacharya, R., Nabi, R., Shpitser, I., and Robins, J. M. Identification in missing data models represented by directed acyclic graphs. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2019.
Chen (2007) Chen, H. Y. A semiparametric odds ratio model for measuring association. Biometrics, 63:413–421, 2007.
Chen et al. (2015) Chen, H. Y., Rader, D. E., and Li, M. Likelihood inferences on semiparametric odds ratio model. Journal of the American Statistical Association, 110(511):1125–1135, 2015.
Cheng (1994) Cheng, P. E. Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89(425):81–87, 1994.
Chickering (2002) Chickering, D. M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554, 2002.
Daniel et al. (2012) Daniel, R. M., Kenward, M. G., Cousens, S. N., and De Stavola, B. L. Using causal diagrams to guide analysis in missing data problems. Statistical Methods in Medical Research, 21(3):243–256, 2012.
Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
Drton (2009) Drton, M. Discrete chain graph models. Bernoulli, 15(3):736–753, 2009.
Evans (2018) Evans, R. J. Margins of discrete Bayesian networks. The Annals of Statistics, 46(6A):2623–2656, 2018.
Evans & Richardson (2014) Evans, R. J. and Richardson, T. S. Markovian acyclic directed mixed graphs for discrete data. The Annals of Statistics, pp. 1452–1482, 2014.
Fielding et al. (2008) Fielding, S., Fayers, P. M., McDonald, A., McPherson, G., Campbell, M. K., et al. Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health and Quality of Life Outcomes, 6(1):57, 2008.
Gain & Shpitser (2018) Gain, A. and Shpitser, I. Structure learning under missing data. In Proceedings of the 9th International Conference on Probabilistic Graphical Models, pp. 121–132, 2018.
Lauritzen (1996) Lauritzen, S. L. Graphical Models. Oxford, U.K.: Clarendon, 1996.
Malinsky et al. (2019) Malinsky, D., Shpitser, I., and Tchetgen Tchetgen, E. J. Semiparametric inference for non-monotone missing-not-at-random data: the no self-censoring model. arXiv preprint arXiv:1909.01848, 2019.
Marra et al. (2017) Marra, G., Radice, R., Bärnighausen, T., Wood, S. N., and McGovern, M. E. A simultaneous equation approach to estimating HIV prevalence with nonignorable missing responses. Journal of the American Statistical Association, 112(518):484–496, 2017.
Marston et al. (2010) Marston, L., Carpenter, J. R., Walters, K. R., Morris, R. W., Nazareth, I., and Petersen, I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiology and Drug Safety, 19(6):618–626, 2010.
Martel García (2013) Martel García, F. Definition and diagnosis of problematic attrition in randomized controlled experiments. Working paper. Available at SSRN 2302735, 2013.
Mohan & Pearl (2014) Mohan, K. and Pearl, J. Graphical models for recovering probabilistic and causal queries from missing data. In Proceedings of the 28th Conference on Advances in Neural Information Processing Systems, pp. 1520–1528. 2014.
Mohan et al. (2013) Mohan, K., Pearl, J., and Tian, J. Graphical models for inference with missing data. In Proceedings of the 27th Conference on Advances in Neural Information Processing Systems, pp. 1277–1285. 2013.
Ogarrio et al. (2016) Ogarrio, J. M., Spirtes, P. L., and Ramsey, J. D. A hybrid causal search algorithm for latent variable models. In Proceedings of the 8th International Conference on Probabilistic Graphical Models, pp. 368–379, 2016.
Pearl (1988) Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
Pearl (2009) Pearl, J. Causality. Cambridge university press, 2009.
Ramsey (2015) Ramsey, J. D. Scaling up greedy causal search for continuous variables. arXiv preprint arXiv:1507.07749, 2015.
Richardson (2003) Richardson, T. S. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30(1):145–157, 2003.
Richardson et al. (2017) Richardson, T. S., Evans, R. J., Robins, J. M., and Shpitser, I. Nested Markov properties for acyclic directed mixed graphs. arXiv:1701.06686v2, 2017. Working paper.
Robins & Gill (1997) Robins, J. M. and Gill, R. D. Non-response models for the analysis of non-monotone ignorable missing data. Statistics in Medicine, 16(1):39–56, 1997.
Robins et al. (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89:846–866, 1994.
Saadati & Tian (2019) Saadati, M. and Tian, J. Adjustment criteria for recovering causal effects from missing data. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019.
Sadinle & Reiter (2017) Sadinle, M. and Reiter, J. P. Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika, 104(1):207–220, 2017.
Shpitser (2016) Shpitser, I. Consistent estimation of functions of data missing non-monotonically and not at random. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016.
Shpitser et al. (2015) Shpitser, I., Mohan, K., and Pearl, J. Missing data as a causal and probabilistic problem. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, pp. 802–811. AUAI Press, 2015.
Strobl et al. (2018) Strobl, E. V., Visweswaran, S., and Spirtes, P. L. Fast causal inference with non-random missingness by test-wise deletion. International Journal of Data Science and Analytics, 6(1):47–62, 2018.
Tchetgen Tchetgen et al. (2018) Tchetgen Tchetgen, E. J., Wang, L., and Sun, B. Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica, 28(4):2069–2088, 2018.
Thoemmes & Rose (2013) Thoemmes, F. and Rose, N. Selection of auxiliary variables in missing data problems: Not all auxiliary variables are created equal. Technical report, R-002, Cornell University, 2013.
Tian (2017) Tian, J. Recovering probability distributions from missing data. In Proceedings of the Ninth Asian Conference on Machine Learning, 2017.
Tian & Pearl (2002) Tian, J. and Pearl, J. A general identification condition for causal effects. In Proceedings of the 18th National Conference on Artificial Intelligence, pp. 567–573, 2002.
Tsiatis (2006) Tsiatis, A. Semiparametric Theory and Missing Data. Springer-Verlag New York, 1st edition edition, 2006.
Tu et al. (2019) Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., and Zhang, K. Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1762–1770, 2019.
Vansteelandt et al. (2007) Vansteelandt, S., Rotnitzky, A., and Robins, J. M. Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. Biometrika, 94(4):841–860, 2007.
Verma & Pearl (1990) Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence, 1990.
Zhou et al. (2010) Zhou, Y., Little, R. J. A., and Kalbfleisch, J. D. Block-conditional missing at random models for missing data. Statistical Science, 25(4):517–532, 2010.

$R_{1}$	$X^{(1)}_{1}$	p(Full Law)	$X_{1}$	p(Observed Law)
0	0	$abd+(1-a)ce$	?	$ab+(1-a)c$
0	1	$ab(1-d)+(1-a)c(1-e)$	?	$ab+(1-a)c$
1	0	$a(1-b)d+(1-a)(1-c)e$	0	$a(1-b)+(1-a)(1-c)$
1	1	$a(1-b)(1-d)+(1-a)(1-c)(1-e)$	1	$a(1-b)+(1-a)(1-c)$

Full Law Identification in Graphical Models of Missing Data: Completeness Results

Supplementary Materials For: Full Law Identification In Graphical Models Of Missing Data: Completeness Results

Abstract

1 Introduction

2 Preliminaries

3 Missing Data Models

3.1 Identification in Missing Data Models

Remark 1.

Remark 2.

4 Incompleteness of Current Methods

5 Full Law Identification in DAGs

Theorem 1.

Theorem 2.

Lemma 1.

6 Full Law Identification in the Presence of Unmeasured Confounders

Lemma 2.

Theorem 3.

Lemma 3.

Theorem 4.

7 Conclusion

Acknowledgements

A. Background: Fixing and Nested Markov Models of an ADMG

CADMGs and Kernels

Fixing and Fixability

Nested Markov Factorization

Binary Parameterization of Nested Markov Models

Binary Parameterization of Missing Data ADMGs

B. Example: Odds Ratio Parameterization

C. Proofs

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Self-censoring through unmeasured confounding:

Simple colluding paths:

A general argument:

Proof.

References

Full Law Identification in Graphical Models of Missing Data:
Completeness Results

Supplementary Materials For:
Full Law Identification In Graphical Models Of Missing Data:
Completeness Results