Full Law Identification in Graphical Models of Missing Data:
Completeness Results
Supplementary Materials For:
Full Law Identification In Graphical Models Of Missing Data:
Completeness Results
Abstract
Missing data has the potential to affect analyses conducted in all fields of scientific study including healthcare, economics, and the social sciences. Several approaches to unbiased inference in the presence of non-ignorable missingness rely on the specification of the target distribution and its missingness process as a probability distribution that factorizes with respect to a directed acyclic graph. In this paper, we address the longstanding question of the characterization of models that are identifiable within this class of missing data distributions. We provide the first completeness result in this field of study – necessary and sufficient graphical conditions under which, the full data distribution can be recovered from the observed data distribution. We then simultaneously address issues that may arise due to the presence of both missing data and unmeasured confounding, by extending these graphical conditions and proofs of completeness, to settings where some variables are not just missing, but completely unobserved.
1 Introduction
Missing data has the potential to affect analyses conducted in all fields of scientific study, including healthcare, economics, and the social sciences. Strategies to cope with missingness that depends only on the observed data, known as the missing at random (MAR) mechanism, are well-studied (Dempster et al., 1977; Cheng, 1994; Robins et al., 1994; Tsiatis, 2006). However, the setting where missingness depends on covariates that may themselves be missing, known as the missing not at random (MNAR) mechanism, is substantially more difficult and under-studied (Fielding et al., 2008; Marston et al., 2010). MNAR mechanisms are expected to occur quite often in practice, for example, in longitudinal studies with complex patterns of dropout and re-enrollment, or in studies where social stigma may prompt non-response to questions pertaining to drug-use, or sexual activity and orientation, in a way that depends on other imperfectly collected or censored covariates (Robins & Gill, 1997; Vansteelandt et al., 2007; Marra et al., 2017).
Previous work on MNAR models has proceeded by imposing a set of restrictions on the full data distribution (the target distribution and its missingness mechanism) that are sufficient to yield identification of the parameter of interest. While there exist MNAR models whose restrictions cannot be represented graphically (Tchetgen Tchetgen et al., 2018), the restrictions posed in several popular MNAR models such as the permutation model (Robins & Gill, 1997), the block-sequential MAR model (Zhou et al., 2010), the itemwise conditionally independent nonresponse (ICIN) model (Shpitser, 2016; Sadinle & Reiter, 2017), and those in (Daniel et al., 2012; Thoemmes & Rose, 2013; Martel García, 2013; Mohan et al., 2013; Mohan & Pearl, 2014; Saadati & Tian, 2019) are either explicitly graphical or can be interpreted as such.
Despite the popularity of graphical modeling approaches for missing data problems, characterization of the class of missing data distributions identified as functionals of the observed data distribution has remained an open question (Bhattacharya et al., 2019). Several algorithms for the identification of the target distribution have been proposed (Mohan & Pearl, 2014; Shpitser et al., 2015; Tian, 2017; Bhattacharya et al., 2019). We show that even the most general algorithm currently published (Bhattacharya et al., 2019) still retains a significant gap in that there exist target distributions that are identified which the algorithm fails to identify. We then present what is, to our knowledge, the first completeness result for missing data models representable as directed acyclic graphs (DAGs) – a necessary and sufficient graphical condition under which the full data distribution is identified as a function of the observed data distribution. For any given field of study, such a characterization is one of the most powerful results that identification theory can offer, as it comes with the guarantee that if these conditions do not hold, the model is provably not identified.
We further generalize these graphical conditions to settings where some variables are not just missing, but completely unobserved. Such distributions are typically summarized using acyclic directed mixed graphs (ADMGs) (Richardson et al., 2017). We prove, once again, that our graphical criteria are sound and complete for the identification of full laws that are Markov relative to a hidden variable DAG and the resulting summary ADMG. This new result allows us to address two of the most critical issues in practical data analyses simultaneously, those of missingness and unmeasured confounding.
Finally, in the course of proving our results on completeness, we show that the proposed graphical conditions also imply that all missing data models of directed acyclic graphs or acyclic directed mixed graphs that meet these conditions, are in fact sub-models of the MNAR models in (Shpitser, 2016; Sadinle & Reiter, 2017). This simple, yet powerful result implies that the joint density of these models may be identified using an odds ratio parameterization that also ensures congenial specification of various components of the likelihood (Chen, 2007; Malinsky et al., 2019). Our results serve as an important precondition for the development of score-based model selection methods for graphical models of missing data, as an alternative to the constraint-based approaches proposed in (Strobl et al., 2018; Gain & Shpitser, 2018; Tu et al., 2019), and directly yield semi-parametric estimators using results in (Malinsky et al., 2019).
2 Preliminaries
A directed acyclic graph (DAG) consists of a set of nodes connected through directed edges such that there are no directed cycles. We will abbreviate as simply when the vertex set is clear from the given context. Statistical models of a DAG are sets of distributions that factorize as , where are the parents of in . The absence of edges between variables in relative to a complete DAG entails conditional independence facts in These can be directly read off from the DAG by the well-known d-separation criterion (Pearl, 2009). That is, for disjoint sets , the following global Markov property holds: When the context is clear, we will simply use to denote the conditional independence between and given
In practice, some variables on the DAG may be unmeasured or hidden. In such cases, the distribution is Markov relative to a hidden variable DAG where variables in are unobserved. There may be infinitely many hidden variable DAGs that imply the same set of conditional independences on the observed margin. Hence, it is typical to utilize a single acyclic directed mixed graph (ADMG) consisting of directed and bidirected edges that entails the same set of equality constraints as this infinite class (Evans, 2018). Such an ADMG is obtained from a hidden variable DAG via the latent projection operator (Verma & Pearl, 1990) as follows. exists in if there exists a directed path from to in with all intermediate vertices in An edge exists in if there exists a collider-free path (i.e., there are no consecutive edges of the form ) from to in with all intermediate vertices in such that the first edge on the path is an incoming edge into and the final edge is an incoming edge into
Given a distribution that is Markov relative to a hidden variable DAG conditional independence facts pertaining to the observed margin can be read off from the ADMG by a simple analogue of the d-separation criterion, known as m-separation (Richardson, 2003), that generalizes the notion of a collider to include mixed edges of the form and
3 Missing Data Models
A missing data model is a set of distributions defined over a set of random variables , where denotes the set of variables that are always observed, denotes the set of variables that are potentially missing, denotes the set of missingness indicators of the variables in , and denotes the set of the observed proxies of the variables in By definition missingness indicators are binary random variables; however, the state space of variables in and are unrestricted. Given and its corresponding missingness indicator , the observed proxy is defined as if , and if . Hence, is deterministically defined. We call the non-deterministic part of a missing data distribution, i.e, , the full law, and partition it into two pieces: the target law and the missingness mechanism . The censored version of the full law that the analyst actually has access to is known as the observed data distribution.
Following the convention in (Mohan et al., 2013), let be a missing data DAG, where In addition to acyclicity, edges of a missing data DAG are subject to other restrictions: outgoing edges from variables in cannot point to variables in , each has only two parents in i.e., and (these edges represent the deterministic function above that defines , and are shown in gray in all the figures below), and there are no outgoing edges from (i.e., the proxy does not cause any variable on the DAG, however the corresponding full data variable may cause other variables.) A missing data model associated with a missing data DAG is the set of distributions that factorizes as,
By standard results on DAG models, conditional independences in can still be read off from by the d-separation criterion (Pearl, 2009). For convenience, we will drop the deterministic terms of the form from the identification analyses in the following sections since these terms are always identified by construction.
As an extension, we also consider a hidden variable DAG , where and variables in are unobserved, to encode missing data models in the presence of unmeasured confounders. In such cases, the full law would obey the nested Markov factorization (Richardson et al., 2017) with respect to a missing data ADMG , obtained by applying the latent projection operator (Verma & Pearl, 1990) to the hidden variable DAG As a result of marginalization of latents there might exist bi-directed edges (to encode the hidden common causes) between variables in (bi-directed edges are shown in red in all the figures below). It is straightforward to see that a missing data ADMG obtained via projection of a hidden variable missing data DAG follows the exact same restrictions as stated in the previous paragraph (i.e., no directed cycles, , every is childless, and there are no outgoing edges from to any variables in .)
3.1 Identification in Missing Data Models
The goal of non-parametric identification in missing data models is twofold: identification of the target law or functions of it and identification of the full law in terms of the observed data distribution
A compelling reason to study the problem of identification of the full law in and of itself, is due to the fact that many popular methods for model selection or causal discovery, rely on the specification of a well-defined and congenial joint distribution (Chickering, 2002; Ramsey, 2015; Ogarrio et al., 2016). A complete theory of the characterization of missing data full laws that are identified opens up the possibility of adapting such methods to settings involving non-ignorable missingness, in order to learn not only substantive relationships between variables of interest in the target distribution, but also the processes that drive their missingness. This is in contrast to previous approaches to model selection under missing data that are restricted to submodels of a single fixed identified model (Strobl et al., 2018; Gain & Shpitser, 2018; Tu et al., 2019). Such an assumption may be impractical in complex healthcare settings, for example, where discovering the factors that lead to missingness or study-dropout may be just as important as discovering substantive relations in the underlying data.
Though the focus of this paper is on identification of the full law of missing data models that can be represented by a DAG (or a hidden variable DAG), some of our results naturally extend to identification of the target law (and functionals therein) due to the fact that the target law can be derived from the full law as
Remark 1.
By chain rule of probability, the target law is identified if and only if is identified. The identifying functional is given by
(the numerator is a function of observed data by noting that , and is observed when ).
Remark 2.
The full law is identified if and only if is identified. According to Remark 1, the identifying functional is given by
The rest of the paper is organized as follows. In Section 4, we explain, through examples, why none of the existing identification algorithms put forward in the literature are complete in the sense that there exist missing data DAGs whose full law and target law are identified but these algorithms fail to derive an identifying functional for them. In Section 5, we provide a complete algorithm for full law identification. In Section 6, we further extend our identification results to models where unmeasured confounders are present. We defer all proofs to the Appendix.
4 Incompleteness of Current Methods
In this section, we show that even the most general methods proposed for identification in missing data DAG models remain incomplete. In other words, we show that there exist identified MNAR models that are representable by DAGs, however all existing algorithms fail to identify both the full and target law for these models. For brevity, we use the procedure proposed in (Bhattacharya et al., 2019) as an exemplar. However, as it is the most general procedure in the current literature, failure to identify via this procedure would imply failure by all other existing ones. For each example, we also provide alternate arguments for identification that eventually lead to the general theory in Sections 5 and 6.
The algorithm proposed by (Bhattacharya et al., 2019) proceeds as follows. For each missingness indicator the algorithm tries to identify the distribution sometimes referred to as the propensity score of It does so by checking if is conditionally independent (given its parents) of the corresponding missingness indicators of its parents that are potentially missing. If this is the case, the propensity score is identified by a simple conditional independence argument (d-separation). Otherwise, the algorithm checks if this condition holds in post-fixing distributions obtained through recursive application of the fixing operator, which roughly corresponds to inverse weighting the current distribution by the propensity score of the variable being fixed (Richardson et al., 2017) (a more formal definition is provided in the Appendix.) If the algorithm succeeds in identifying the propensity score for each missingness indicator in this manner, then it succeeds in identifying the target law as Remark 1 suggests, since Additionally, if it is the case that in the course of execution, the propensity score for each missingness indicator is also identified at all levels of its parents, then the algorithm also succeeds in identifying the full law (due to Remark 2).
In order to ground our theory in reality, we now describe a series of hypotheses that may arise during the course of a data analysis that seeks to study the link between the effects of smoking on bronchitis, through the deposition of tar or other particulate matter in the lungs. For each hypothesis, we ask if the investigator is able to evaluate the goodness of fit of the proposed model, typically expressed as a function of the full data likelihood, as a function of just the observed data. In other words, we ask if the full law is identified as a function of the observed data distribution. If it is, this enables the analyst to compare and contrast different hypotheses and select one that fits the data the best.
Setup. To start, the investigator consults a large observational database containing the smoking habits, measurements of particulate matter in the lungs, and results of diagnostic tests for bronchitis on individuals across a city. She notices however, that several entries in the database are missing. This leads her to propose a model like the one shown in Fig. 1(a), where and correspond to smoking, particulate matter, and bronchitis respectively, and and are the corresponding missingness indicators.
For the target distribution she proposes a simple mechanism that smoking leads to increased deposits of tar in the lungs, which in turn leads to bronchitis (). For the missingness process, she proposes that a suspected diagnosis of bronchitis is likely to lead to an inquiry about the smoking status of the patient (), smokers are more likely to get tested for tar and bronchitis (), and ordering a diagnostic test for bronchitis, increases the likelihood of ordering a test for tar, which in turn increases the likelihood of inquiry about smoking status ().
We now show that for this preliminary hypothesis, if the investigator were to utilize the procedure described in (Bhattacharya et al., 2019) she may conclude that it is not possible to identify the full law. We go on to show that such a conclusion would be incorrect, as the full law is, in fact, identified, and provide an alternative means of identification.
Scenario 1. Consider the missing data DAG model in Fig. 1(a) by excluding the edge corresponding to the first hypothesis put forth by the investigator. The propensity score for can be obtained by simple conditioning, noting that by d-separation. Hence,
Conditioning is not sufficient in order to identify the propensity score for as . However, it can be shown that in the distribution , , since this distribution is Markov relative to the graph in Fig. 1(b) (see the Appendix for details). We use the notation to indicate that while acts in most respects as a conditional distribution, it was not obtained from by a conditioning operation. This implies that the propensity score for (evaluated at ) is identified as
Finally, we show that the algorithm in (Bhattacharya et al., 2019) is unable to identify the propensity score for We first note that in the original problem. Furthermore, as shown in Fig. 1(b), fixing leads to a distribution where is necessarily selected on as the propensity score is identified by restricting the data to cases where It is thus impossible to identify the propensity score for in this post-fixing distribution. The same holds if we try to fix as identification of the propensity score for required us to first fix which we have seen introduces selection bias on
Hence, the procedure in (Bhattacharya et al., 2019) fails to identify both the target law and the full law for the problem posed in Fig. 1(a). However, both these distributions are, in fact, identified as we now demonstrate.
A key observation is that even though the identification of might not be so straightforward, is indeed identified, because by d-separation , and therefore Given that and are both identified (the latter is obtained through as described earlier), we consider exploiting an odds ratio parameterization of the joint density . As we show below, such a parameterization immediately implies the identifiability of this density and consequently, the individual propensity scores for and .
Given disjoint sets of variables and reference values the odds ratio parameterization of , given in (Chen, 2007), is as follows:
(1) |
where
and is the normalizing term and is equal to
Note that i.e., the odds ratio is symmetric; see (Chen, 2007).
A convenient choice of reference value for the odds ratio in missing data problems is the value Given this reference level and the parameterization of the joint in Eq. (1), we know that where is the normalizing term, and
The conditional pieces and are already shown to be functions of the observed data. To see that the odds ratio is also a function of observables, recall that This means that can be introduced into each individual piece of the odds ratio functional above, making it so that the entire functional depends only on observed quantities. Since all pieces of the odds ratio parameterization are identified as functions of the observed data, we can conclude that is identified as the normalizing term is always identified if all the conditional pieces and the odds ratio are identified. This result, in addition to the fact that is identified as before, leads us to the identification of both the target law and the full law, as the missingness process is identified.
Scenario 2. Suppose the investigator is interested in testing an alternate hypothesis to see whether detecting high levels of particulate matter in the lungs, also serves as an indicator to physicians that a diagnostic test for bronchitis should be ordered. This corresponds to the missing data DAG model in Fig. 1(a) by including the edge Since this is a strict super model of the previous example, the procedure in (Bhattacharya et al., 2019) still fails to identify the target and full laws in a similar manner as before.
However, it is still the case that both the target and full laws are identified. The justification for why the odds ratio parameterization of the joint density is identified in this scenario, is more subtle. We have,
Note that , and . Therefore, is identified the same way as described in Scenario 1, and is a function of the observed data and hence is identified. Now the identification of the joint density boils down to identifiability of the odds ratio term. By symmetry, we can express the odds ratio in two different ways,
The first equality holds by d-separation (). This implies that is not a function of Let us denote this functional by On the other hand, we can plug-in to pieces in the second equality since (by d-separation.) This implies that is a function of only through its observed values (i.e. ). Let us denote this functional by Since odds ratio is symmetric (by definition), then it must be the case that ; concluding that cannot be a function of , as the left hand side of the equation does not depend on . This renders to be a function of only observed quantities, i.e. . This leads to the conclusion that is identified and consequently the missingness process in Fig. 1(a) is identified. According to Remarks 1 and 2, both the target and full laws are identified.
Adding any directed edge to Fig. 1(a) (including the dashed edge) allowed by missing data DAGs results in either a self-censoring edge () or a special kind of collider structure called the colluder () in (Bhattacharya et al., 2019). We discuss in detail, the link between identification of missing data models of a DAG and the absence of these structures in Section 5.
Scenario 3. So far, the investigator has conducted preliminary analyses of the problem while ignoring the issue of unmeasured confounding. In order to address this issue, she first posits an unmeasured confounder corresponding to genotypic traits that may predispose certain individuals to both smoke and develop bronchitis. She posits another unmeasured confounder corresponding to the occupation of an individual, that may affect both the deposits of tar found in their lungs (for e.g., construction workers may accumulate more tar than an accountant due to occupational hazards) as well as limit an individual’s access to proper healthcare, leading to the absence of a diagnostic test for bronchitis.
The missing data DAG with unmeasured confounders, corresponding to the aforementioned hypothesis is shown in Fig. 2(a) (excluding the dashed edges). The corresponding missing data ADMG, obtained by latent projection is shown in Fig. 2(b) (excluding the dashed bidirected edge). A procedure to identify the full law of such an MNAR model, that is nested Markov with respect to a missing data ADMG, is absent from the current literature. The question that arises, is whether it is possible to adapt the odds ratio parameterization from the previous scenarios, to this setting.
We first note that by application of the chain rule of probability and Markov restrictions, the missingness mechanism still factorizes in the same way as in Scenario 2, i.e., (Tian & Pearl, 2002). Despite the addition of the bidirected edges and corresponding to unmeasured confounding, it is easy to see that the propensity score for is still identified via simple conditioning. That is, as by m-separation. Furthermore, it can also be shown that the two key conditional independences that were exploited in the odds ratio parameterization of still hold in the presence of these additional edges. In particular, , and by m-separation. Thus, the same odds ratio parameterization used for identification of the full law in Scenario 2, is also valid for Scenario 3. The full odds ratio parameterization of the MNAR models in Scenarios 2 and 3 is provided in Appendix B.
Scenario 4. Finally, the investigator notices that a disproportionate number of missing entries for smoking status and diagnosis of bronchitis, correspond to individuals from certain neighborhoods in the city. She posits that such missingness may be explained by systematic biases in the healthcare system, where certain ethnic minorities may not be treated with the same level of care. This corresponds to adding a third unmeasured confounder which affects the ordering of a diagnostic test for bronchitis as well as inquiry about smoking habits, as shown in Fig. 2(a) (including the dashed edges.) The corresponding missing data ADMG is shown in Fig. 2(b) (including the bidirected dashed edge.) Once again, we investigate if the full law is identified, in the presence of an additional unmeasured confounder , and the corresponding bidirected edge
The missingness mechanism in Fig. 2(b) (including the dashed edge) no longer follows the same factorization as the one described in Scenarios 2 and 3, due to the presence of a direct connection between and According to (Tian & Pearl, 2002), this factorization is given as . Unlike the previous scenarios, the propensity score of , , includes and past the conditioning bar. Thus, the propensity score of seems to be not identified, since there is no clear way of breaking down the dependency between and The problematic structure is the path which contains a collider at that opens up when we condition on in the propensity score of
In light of the discussion in previous scenarios, another possibility for identifying is through analysis of the odds ratio parameterization of the entire missingness mechanism. In Section 5, we provide a description of the general odds ratio parameterization on an arbitrary number of missingness indicators. For brevity, we avoid re-writing the formula here. We simply point out that the first step in identifying the missingness mechanism via the odds ratio parameterization is arguing whether conditional densities of the form are identified, which is true if
Such independencies do not hold in Fig. 2(b) (including the dashed edge) for any of the s, since there exist collider paths between every pair that render the two variables dependent when we condition on everything outside (by m-separation). Examples of such paths are and and .
5 Full Law Identification in DAGs
(Bhattacharya et al., 2019) proved that two graphical structures, namely the self-censoring edge () and the colluder (), prevent the identification of full laws in missing data models of a DAG. In this section we exploit an odds ratio parameterization of the missing data process to prove that these two structures are, in fact, the only structures that prevent identification, thus yielding a complete characterization of identification for the full law in missing data DAG models.
We formally introduce the odds ratio parameterization of the missing data process introduced in (Chen, 2007), as a more general version of the simpler form mentioned earlier in Eq. (1). Assuming we have missingness indicators, can be expressed as follows.
Using the odds ratio reparameterization given in Eq. (2), we now show that under a standard positivity assumption, stating that , with probability one for some constant , the full law of a missing data DAG is identified in the absence of self-censoring edges and colluders. Moreover, if any of these conditions are violated, the full law is no longer identified. We formalize this result below.
Theorem 1.
A full law that is Markov relative to a missing data DAG is identified if does not contain edges of the form (no self-censoring) and structures of the form (no colluders), and the stated positivity assumption holds. Moreover, the resulting identifying functional for the missingness mechanism is given by the odds ratio parameterization provided in Eq. 2, and the identifying functionals for the target law and full law are given by Remarks 1 and 2.
In what follows, we show that the identification theory that we have proposed for the full law in missing data models of a DAG is sound and complete. Soundness implies that when our procedure succeeds, the model is in fact identified, and the identifying functional is correct. Completeness implies that when our procedure fails, the model is provably not identified (non-parametrically). These two properties allow us to derive a precise boundary for what is and is not identified in the space of missing data models that can be represented by a DAG.
Theorem 2.
The graphical condition of no self-censoring and no colluders, put forward in Theorem 1, is sound and complete for the identification of full laws that are Markov relative to a missing data DAG
We now state an important result that draws a connection between missing data models of a DAG that are devoid of self-censoring and colluders, and the itemwise conditionally independent nonresponse (ICIN) model described in (Shpitser, 2016; Sadinle & Reiter, 2017). As a substantive model, the ICIN model implies that no partially observed variable directly determines its own missingness, and is defined by the restrictions that for every pair it is the case that We utilize this result in the course of proving Theorem 2.
Lemma 1.
A missing data model of a DAG that contains no self-censoring edges and no colluders, is a submodel of the ICIN model.
6 Full Law Identification in the Presence of Unmeasured Confounders
We now generalize identification theory of the full law to scenarios where some variables are not just missing, but completely unobserved, corresponding to the issues faced by the analyst in Scenarios 3 and 4 of Section 4. That is, we shift our focus to the identification of full data laws that are (nested) Markov with respect to a missing data ADMG
Previously, we exploited the fact that the absence of colluders and self-censoring edges in a missing data DAG imply a set of conditional independence restrictions of the form , for any pair and (see Lemma 1). We now describe necessary and sufficient graphical conditions that must hold in a missing data ADMG to imply this same set of conditional independences. Going forward, we ignore (without loss of generality), the deterministic factors and the corresponding deterministic edges in in the process of defining this graphical criterion.
A colliding path between two vertices and is a path on which every non-endpoint node is a collider. We adopt the convention that and are trivially collider paths. We say there exists a colluding path between the pair if and are connected through at least one non-deterministic colliding path i.e., one which does not pass through (using deterministic edges) variables in
We enumerate all possible colluding paths between a vertex and its corresponding missingness indicator in Fig. 3. Note that both the self-censoring structure and the colluding structure introduced in (Bhattacharya et al., 2019) are special cases of a colluding path. Using the m-separation criterion for ADMGs, it is possible to show that a missing data model of an ADMG that contains no colluding paths of the form shown in Fig. 3, is also a submodel of the ICIN model in (Shpitser, 2016; Sadinle & Reiter, 2017).
Lemma 2.
A missing data model of an ADMG that contains no colluding paths is a submodel of the ICIN model.
This directly yields a sound criterion for identification of the full law of missing data models of an ADMG using the odds ratio parameterization as before.
Theorem 3.
A full law that is Markov relative to a missing data ADMG is identified if does not contain any colluding paths and the stated positivity assumption in Section 5 holds. Moreover, the resulting identifying functional for the missingness mechanism is given by the odds ratio parametrization provided in Eq. 2.
We now address the question as to whether there exist missing data ADMGs which contain colluding paths but whose full laws are nevertheless identified. We show (see Appendix for proofs), that the presence of a single colluding path of any of the forms shown in Fig. 3, results in a missing data ADMG whose full law cannot be identified as a function of the observed data distribution
Lemma 3.
A full law that is Markov relative to a missing data ADMG containing a colluding path between any pair and is not identified.
Revisiting our example in scenario 4, we note that every pair is connected through at least one colluding path. Therefore, according to Lemma 3, the full law in Fig. 2(a) including the dashed edge, is not identified. It is worth emphasizing that the existence of at least one colluding path between any pair is sufficient to conclude that the full law is not identified.
In what follows, we present a result on the soundness and completeness of our graphical condition that represents a powerful unification of non-parametric identification theory in the presence of non-ignorable missingness and unmeasured confounding. To our knowledge, such a result is the first of its kind. We present the theorem below.
Theorem 4.
The graphical condition of the absence of colluding paths, put forward in Theorem 3, is sound and complete for the identification of full laws that are Markov relative to a missing data ADMG
Throughout the paper, we have focused on identification of the full law which, according to Remark 1, directly yields identification for the target law. However, identification of the full law is a sufficient but not necessary condition for identification of the target law. In other words, the target law may still be identified despite the presence of colluding paths. Fig. 4(a) in (Bhattacharya et al., 2019) is an example of such a case where the full law is not identified due to the colluder structure at however, as the authors argue the target law remains identified.
7 Conclusion
In this paper, we concluded an important chapter in the non-parametric identification theory of missing data models represented via directed acyclic graphs, possibly in the presence of unmeasured confounders. We provided a simple graphical condition to check if the full law, Markov relative to a (hidden variable) missing data DAG, is identified. We further proved that these criteria are sound and complete. Moreover, we provided an identifying functional for the missingness process, through an odds ratio parameterization that allows for congenial specification of components of the likelihood. Our results serve as an important precondition for the development of score-based model selection methods that consider a broader class of missing data distributions than the ones considered in prior works. An interesting avenue for future work is exploration of the estimation theory of functionals derived from the identified full data law. To conclude, we note that while identification of the full law is sufficient to identify the target law, there exist identified target laws where the corresponding full law is not identified. We leave a complete characterization of target law identification to future work.
Acknowledgements
This project is sponsored in part by the National Science Foundation grant 1939675, the Office of Naval Research grant N00014-18-1-2760, and the Defense Advanced Research Projects Agency under contract HR0011-18-C-0049. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
For clearer presentation of materials in this supplement, we switch to a single-column format. In Appendix A, we provide an overview of the nested Markov model. We summarize the necessary concepts required in order to explain our proof of completeness for identification of the full law in missing data acyclic directed mixed graphs (ADMGs). These concepts draw on the binary parameterization of nested Markov models of an ADMG. In Appendix B, we provide a concrete example of the odds ratio parameterization. In Appendix C, we present proofs that were omitted from the main body of the paper for brevity.
A. Background: Fixing and Nested Markov Models of an ADMG
Given a DAG where contains variables that are unobserved, the latent projection operator onto the observed margin produces an acyclic directed mixed graph that consists of directed and bidirected edges (Verma & Pearl, 1990). The bidirected connected components of an ADMG partition the vertices into distinct sets known as districts. The district membership of a vertex in is denoted and the set of all districts in is denoted
(Evans, 2018) showed that the nested Markov model (Richardson et al., 2017) of an ADMG is a smooth super model with fixed dimension, of the underlying latent variable model, that captures all equality constraints and avoids non-regular asymptotics arising from singularities in the parameter space (Drton, 2009; Evans, 2018). We use this fact in order to justify the use of nested Markov models of a missing data ADMG in order to describe full laws that are Markov relative to a missing data DAG with hidden variables. That is, the nested Markov model of a missing data ADMG where is a smooth super model of the missing data DAG model We also utilize nested Markov models of an ADMG corresponding to projection of the missing data ADMG onto variables that are fully observable. While such a model does not capture all equality constraints in the true observed law, it is still a smooth super model of it, thus providing an upper bound on the model dimension of the observed law.
CADMGs and Kernels
The nested Markov factorization of relative to an ADMG is defined with the use of conditional distributions known as kernels and their associated conditional ADMGs (CADMGs) that are derived from and respectively, via repeated applications of the fixing operator (Richardson et al., 2017). A CADMG is an ADMG whose nodes can be partitioned into random variables and fixed variables with the restriction that only outgoing edges may be adjacent to variables in A kernel is a mapping from values in to normalized densities over i.e., (Lauritzen, 1996). Conditioning and marginalization operations in kernels are defined in the usual way.
Fixing and Fixability
In Section 4 of the main paper, we provided an informal description of fixing as the operation of inverse-weighting by the propensity score of the variable being fixed; we now formalize this notion. A variable is said to be fixable if the paths and do not both exist for all Given a CADMG where is fixable, the graphical operator of fixing, denoted yields a new CADMG with all incoming edges into being removed, and being set to a fixed value Given a kernel the corresponding probabilistic operation of fixing, denoted yields a new kernel
where is the Markov blanket of defined as the bidirected connected component (district) of (excluding itself) and the parents of the district of i.e., It is easy to check that when is a DAG, i.e., there are no bidirected edges, the denominator in the probabilistic operation of fixing, reduces to the familiar definition of a simple propensity score.
The notion of fixability can be extended to a set of variables as follows. A set is said to be fixable if elements in can be ordered into a sequence such that is fixable in is fixable in and so on. This notion of fixability on sets of variables is essential to the description of the nested Markov model that we present in the following section.
Nested Markov Factorization
Given a CADMG A set is said to be reachable if there exists a valid sequence of fixing operations on vertices Further, is said to be intrinsic if it is reachable, and forms a single bidirected connected component or district in i.e., the CADMG obtained upon executing all fixing operations given by a valid fixing sequence
A distribution is said to obey the nested Markov factorization relative to an ADMG if for every fixable set and any valid fixing sequence
where all kernels appearing in the product above can be constructed by combining kernels corresponding to intrinsic sets i.e., Such a construction is made possible by the fact that all the sets quantified in the product are districts in a reachable graph derived from
(Richardson et al., 2017) noted that when a distribution is nested Markov relative to an ADMG all valid fixing sequences yield the same CADMG and kernel so that recursive applications of the fixing operator on a set can simply be denoted as and without explicitly specifying any particular valid order. Thus, the construction of the set of kernels corresponding to intrinsic sets can be characterized as and the nested Markov factorization can be re-stated more simply as, for every fixable set we have,
An important result from (Richardson et al., 2017) states that if is Markov relative to a DAG then is nested Markov relative to the ADMG obtained by latent projection.
Binary Parameterization of Nested Markov Models
From the above factorization, it is clear that intrinsic sets given their parents form the atomic units of the nested Markov model. Using this observation, a smooth parameterization of discrete nested Markov models was provided by (Evans & Richardson, 2014). We now provide a short description of how to derive the so-called Moebius parameters of a binary nested Markov model.
For each district consider all possible subsets If is intrinsic (that is, reachable and bidirected connected in ), define the head of the intrinsic set to be all vertices in that are childless in and the tail to be all parents of the head in the CADMG excluding the head itself. More formally, and The corresponding set of Moebius parameters for this intrinsic head and tail pair parameterizes the kernel i.e., the kernel where all variables outside the intrinsic set are fixed, and all elements of the head are set to zero given the tail. Note that these parameters are, in general, variationally dependent (in contrast to variationally independent in the case of an ordinary DAG model) as the heads and tails in these parameter sets may overlap. The joint density for any query can be obtained through the Moebius inversion formula; see (Lauritzen, 1996; Evans & Richardson, 2014) for details. For brevity, we will denote as simply as it will be clear from the given context what variables are still random in the kernel corresponding to a given intrinsic set.
Binary Parameterization of Missing Data ADMGs
We use the parameterization described in the previous section in order to count the number of parameters required to parameterize the full law of a missing data ADMG and its corresponding observed law. We then use this to reason that if the number of parameters in the full law exceeds those in the observed law, it is impossible to establish a map from the observed law to the full law. This in turn implies that such a full law is not identified.
The binary parameterization of the full law of a missing data ADMG is exactly the same as that of an ordinary ADMG, except that the deterministic factors can be ignored, as with probability one when and with probability one when
The observed law is parameterized as follows. First, variables in are treated as completely unobserved, and an observed law ADMG is obtained by applying the latent projection operator to The Moebius parameters are then derived in a similar manner as before, with the additional constraint that if appears in the head of a Moebius parameter, and the corresponding missingness indicator appears in the tail, then the kernel must be restricted to cases where This is because when the probability of the head taking on any value, aside from those where is deterministically defined to be
Note that parameterizing the observed law by treating variables in as fully unobserved does not quite capture all equality constraints that may be detectable in the observed law, as these variables are, in fact, sometimes observable when their corresponding missingness indicators are set to one. Indeed, a smooth parameterization of the observed law of missing data models that captures all constraints implied by the model, is still an open problem. Nevertheless, parameterizing an observed law ADMG, such as the one mentioned earlier, provides an upper bound on the number of parameters required to parameterize the true observed law. This suffices for our purposes, as demonstrating that the upper bound on the number of parameters in the observed law is less than the number of parameters in the full law, is sufficient to prove that the full law is not identified.
B. Example: Odds Ratio Parameterization
To build up a more concrete intuition for Theorems 1 and 3, we provide an example of the odds ratio parameterization for the missing data models used in Scenarios 2 and 3 of the main paper, reproduced here in Figs. 4(a, b). Utilizing the order on the missingness indicators, the odds ratio parameterization of the missing data process for both models is as follows.
(3) |
We now argue that each piece in Eq. 3 is identified. Note that, in the missing data DAG shown in Fig. 4(a), by d-separation. The same is true for the missing data ADMG in Fig. 4(b) by m-separation. Thus, in both cases, the product over conditional pieces of each given the remaining variables is not a function and is thus a function of observed data. We now show that is not a function of by utilizing the symmetry property of the odds ratio.
Thus, from the first equality, the odds ratio is not a function of as by d-separation in Fig. 4(a) and by m-separation in Fig. 4(b). A symmetric argument holds for and as seen in the second and third equalities. Hence, the odds ratio is only a function of which is observable, as the function is evaluated at
We now utilize an identity from (Chen et al., 2015) in order to simplify the final term in Eq. 3. That is,
The first two pairwise odds ratio terms are functions of observed data using an analogous argument that draws on the symmetry property of the odds ratio and the conditional independence as before. The final term , is a three-way interaction term on the odds ratio scale and can be expressed in three different ways as follows (Chen et al., 2015),
From the first equality, we note by symmetry of the odds ratio and conditional independence that is not a function of . Similarly, from the second equality, we note that is not a function of . Finally, from the third equality, we note that is not a function of . Therefore, is not a function of and is identified.
C. Proofs
We first prove Lemmas 1 and 2 as we use them in the course of proving Theorems 1 and 3. We start with Lemma 2, as the proof for Lemma 1 simplifies to a special case.
Lemma 2 A missing data model of an ADMG that contains no colluding paths is a submodel of the itemwise conditionally independent nonresponse model described in (Shpitser, 2016; Sadinle & Reiter, 2017).
Proof.
The complete Markov blanket of a vertex in an ADMG denoted is the set of vertices such that (Pearl, 1988; Richardson, 2003). In ADMGs, this set corresponds to the Markov blanket of its children, and the Markov blanket of its children. That is,
Without loss of generality, we ignore the part of the graph involving the deterministic factors and the corresponding deterministic edges, in the construction of the Markov blanket and complete Markov blanket of variables in a missing data graph We now show that the absence of non-deterministic colluder paths between a pair and in implies that
-
•
is not a parent of as is trivially a colluder path.
-
•
is not in the district of as is also a colluder path.
These two points together imply that We now show that the union over children of and their Markov blankets also exclude
-
•
is not a child of as directed edges from to variables in are ruled out by construction in missing data graphs.
-
•
is also not in the district of any children of as is a colluding path.
-
•
is also not a parent of the district of any children of as is a colluding path.
These three points together rule out the possibility that is present in the union over children and Markov blankets of children of Thus, we have shown that This implies the following,
By semi-graphoid axioms (see for example, (Lauritzen, 1996; Pearl, 2009)) this yields the conditional independence
The same line of reasoning detailed above can be used for all which then gives us the set of conditional independences implied by the no self-censoring model. That is,
∎
Lemma 1 A missing data model of a DAG that contains no self-censoring edges and no colluders, is a submodel of the itemwise conditionally independent nonresponse model described in (Shpitser, 2016; Sadinle & Reiter, 2017).
Proof.
A DAG is simply a special case of an ADMG with no bidirected edges. Consequently the only two types of colluding paths, are self-censoring edges () and colluder structures ( Thus, the absence of these two structures in a missing data DAG rules out all possible colluding paths. The rest of the proof then carries over straightforwardly from Lemma 2. ∎
Theorem 1 A full law that is Markov relative to a missing data DAG is identified if does not contain edges of the form (no self-censoring) and structures of the form (no colluders), and the stated positivity assumption holds. Moreover, the resulting identifying functional for the missingness mechanism is given by the odds ratio parameterization provided in Eq. 2 of the main draft, and the identifying functionals for the target law and full law are given by Remarks 1 and 2.
Proof.
Given Eq. (2), we know that
where , and
and is the normalizing term and is equal to . If we can prove that all the pieces in this factorization are identified, then the missingness process is identified and so is the full law. We provide the proof in two steps. Our proof is similar to the identification proof of the no self-censoring model given in (Malinsky et al., 2019).
For each we can apply the following expansion to the odds ratio term. Without loss of generality we drop fully observed random variables for brevity,
(4) |
This expansion can be applied inductively to the second term in the above product until is expressed as a function of pairwise odds ratios and higher-order interaction terms. Applying the inductive expansion to each odds ratio term in we can re-express the identifying functional as,
(5) |
where is the normalizing constant as before, and each are 3-way, 4-way, up to -way interaction terms. These interaction terms are defined as follows.
and
and so on, up to
Readers familiar with the clique potential factorization of Markov random fields may treat these interaction terms analogously (Malinsky et al., 2019). We now show that each term in the above factorization is identified.
Step 1.
We start off by looking at the conditional pieces . Given Lemma. 1, we know that . Therefore, is identified for all
Step 2.
We now show that for any , the pairwise odds ratio given in Eq. (5) is identified. We know that
Consequently, if we can show that the odds ratio is neither a function of nor , then we can safely claim that the odds ratio is only a function of observed data and hence is identified. We get to this conclusion by exploiting the symmetric notion in odds ratios.
In the first equality, we can see that the odds ratio is not a function of since . Similarly, from the second equality, we can see that the odds ratio is not a function of since . Therefore, the pairwise odds ratios are all identified.
Finally we show that each of the higher-order interaction terms are identified. For each of these terms we need to show that they are not a function of missing variables with indices corresponding to indicators to the left of the conditioning bar. That is, we need to show that the 3-way interaction terms are not functions of the 4-way interaction terms are not functions of , and so on until finally the -way interaction term is not a function of
Because of the way the odds ratio is defined, each is symmetric in the arguments appearing to the left of the conditioning bar and can be rewritten in multiple equivalent ways. In particular, each -way interaction term can be rewritten in ways for any choice of indices of the missingness indicators that appear to the left of the conditioning bar. Each such representation allows us to conclude that is not a function of Combining all these together allows us to conclude that the -way interaction term is not a function of the missing variables corresponding to the indicators appearing on the left of the conditioning bar.
As a concrete example, consider the 3-way interaction We can write it down in three different ways as follows.
From the first equality, we note that is not a function of . From the second equality, we note that is not a function of . From the third equality, we note that is not a function of . Therefore, is not a function of and is identified. ∎
Theorem 2 The graphical condition of no self-censoring and no colluders, put forward in Theorem 1, is sound and complete for the identification of full laws that are Markov relative to a missing data DAG
Proof.
Soundness is a direct consequence of Theorem 1. To prove completeness, it needs to be shown that in the presence of a self-censoring edge, or a colluder structure, the full law is no longer (non-parametrically) identified. A proof by counterexample of both these facts was provided in (Bhattacharya et al., 2019). However, this can also be seen from the fact that self-censoring edges and colluders are special cases of the colluding paths that we prove results in non-identification of the full law in Lemma 3. ∎
Theorem 3 A full law that is Markov relative to a missing data ADMG is identified if does not contain any colluding paths and the stated positivity assumption in Section 5 holds. Moreover, the resulting identifying functional for the missingness mechanism is given by the odds ratio parametrization provided in Eq. 2 of the main draft.
Proof.
Lemma 3 A full law that is Markov relative to a missing data ADMG containing a colluding path between any pair and is not identified.
Proof.
Proving the non-identifiability of missing data models of an ADMG that contains a colluding path can be shown by providing two models and that disagree on the full law but agree on the observed law. Coming up with a single example of such a pair of models is sufficient for arguing against non-parametric identification of the full law. Therefore, for simplicity, we restrict our attention to binary random variables. We first provide an example of such a pair of models on the simplest form of a colluding path, a bidirected edge as shown in Fig. 5(a). According to Table 1, in order for the observed laws to agree, the only requirement is that the quantity remain equal in both models; hence we can come up with infinitely many counterexamples of full laws that are not the same but map to the same observed law.
Constructing explicit counterexamples are not necessary to prove non-identification as long as it can be shown that there exist at least two distinct functions that map two different full laws onto the exact same observed law. For instance, if the number of parameters in the full law is strictly larger than the number of parameters in the observed law, then there would exist infinitely many such functions. Consequently, we rely on a parameter counting argument to prove the completeness of our results. Since we are considering missing data models of ADMGs, we use the Moebius parameterization of binary nested Markov models of an ADMG described in Appendix A.
The nested Markov model of a missing data ADMG where is a smooth super model of the missing data DAG model and has the same model dimension as the latent variable model (Evans, 2018). We also utilize nested Markov models of an ADMG corresponding to projection of the missing data ADMG onto variables that are fully observable. While such a model does not capture all equality constraints in the true observed law, it is still a smooth super model of it, thus providing an upper bound on the model dimension of the observed law. This suffices for our purposes, as demonstrating that the upper bound on the number of parameters in the observed law is less than the number of parameters in the full law, is sufficient to prove that the full law is not identified. We first walk the reader through a few examples to demonstrate this proof strategy, and then provide the general argument.
p(Full Law) | p(Observed Law) | |||
---|---|---|---|---|
0 | 0 | ? | ||
1 | ||||
1 | 0 | 0 | ||
1 | 1 |
Moebius Parameterization of the Full Law in Fig. 5(d) | |||
Districts | Intrinsic Head/Tail | Moebius Parameters | Counts |
Total | |||
Moebius Parameterization of the Full Law in Fig. 5(e) | |||
Districts | Intrinsic Head/Tail | Moebius Parameters | Counts |
Total | |||
Moebius Parameterization of the Observed Law in Fig. 5(f) | |||
Districts | Intrinsic Head/Tail | Moebius Parameters | Counts |
Total |
Self-censoring through unmeasured confounding:
We start by reanalyzing the colluding path given in Fig. 5(a) and the corresponding projection given in Fig. 5(c). The Moebius parameters associated with the full law are for a total of 3 parameters. The Moebius parameters associated with the observed law in Fig 5(c) are for a total of only 2 parameters. Since , we can construct infinitely many mappings, as it was shown in Table 1.
Simple colluding paths:
Consider the colluding paths given in Fig. 5(d, e) and the corresponding projection (which are identical in both cases) given in Fig. 5(f). The Moebius parameters associated with the full laws and observed law are shown in Table 2. Once again, since the number of parameters in the observed law is less than the number in the full law , we can construct infinitely many mappings.
A general argument:
In order to generalize our argument, we first provide a more precise representation (that does not use dashed edges) in Figs. 6(a-d), of all possible colluding paths between and Without loss of generality, assume that there are variables in and there are variables that lie on the collider path between and , . We denote the th variable on the collider path by ; Note that in Figs. 6(c, d) can only belong to by convention. Fig. 6(e) illustrates the corresponding projections of figures (a) and (b), and Fig. 6(f) illustrates the corresponding projections of figures (c) and (d). In the projections shown in Figs. 6(e, f),
We now go over each of these colluding paths and their corresponding latent projections, as if they appear in a larger graph that is otherwise completely disconnected. We count the number of Moebius parameters as a function of , and show that the full law always has one more parameter than the observed law. One can then imagine placing these colluding paths in a larger graph with arbitrary connectivity, and arguing that the full law is still not identified as a consequence of the parameter discrepancy arising from the colluding path alone. That is, if we show a fully disconnected graph containing a single colluding path is not identified, then it is also the case that any edge super graph (super model) is also not identified.
In the following proof we heavily rely on the following fact. Given a bidirected chain of length of length the number of Moebius parameters required to parameterize this chain is given by the sum of natural numbers to i.e., This can be seen from the fact that the corresponding Moebius parameters are given by the series,
-
•
corresponding to parameters.
-
•
corresponding to parameters.
-
•
-
•
corresponding to parameter.
In counting the number of parameters for a disconnected graph (with the exception of the colluding path), we can also exclude the singleton (disconnected) nodes from the counting argument since they account for the same number of parameters in both the full law and observed law. In the full law they are either or and the corresponding parameters in the observed law are or . The Moebius parameter counts for each of the colluding paths in Figs. 6(a-d) and their corresponding latent projections in Figs. 6(e,f) are as follows.
Figures a, b, and e
-
1.
Number of Moebius parameters in Fig. 6(a) is
-
•
A bidirected chain of length , i.e., parameters.
-
•
-
2.
Number of Moebius parameters in Fig. 6(b) is
-
•
, i.e. parameter,
-
•
A bidirected chain of length , i.e. parameters,
-
•
Intrinsic sets involving , i.e., corresponding to parameters.
-
•
-
3.
Number of Moebius parameters in Fig. 6(e) is
-
•
Note that even though each proxy that may appear in the bidirected chain has a directed edge from pointing into it, the corresponding intrinsic head tail pair that involves both variables, will always have Hence, we may ignore these deterministic edges and count the parameters as if it were a bidirected chain of length , corresponding to parameters,
-
•
When enumerating intrinsic sets involving we note that is not intrinsic as is not fixable (due to the bidirected path between and and the edge ). Thus, as there is one less intrinsic set involving the number of parameters required to parameterize all intrinsic sets involving is one fewer, i.e., (instead of ) parameters.
-
•
Figures c, d, and f
-
1.
Number of Moebius parameters in Fig. 6(c) is
-
•
, i.e. parameter,
-
•
A bidirected chain of length , i.e. parameters,
-
•
Intrinsic sets involving , i.e., corresponding to parameters.
-
•
-
2.
Number of Moebius parameters in Fig. 6(d) is
-
•
, i.e. parameters,
-
•
A bidirected chain of length , i.e. parameters,
-
•
Intrinsic sets involving and not i.e., corresponding to parameters,
-
•
Intrinsic sets involving and not , i.e., corresponding to parameters.
-
•
The intrinsic set involving both and i.e., corresponding to parameters.
-
•
-
3.
Number of Moebius parameters in Fig. 6(f) is
-
•
, i.e. parameter,
-
•
By the same argument as before, deterministic tails can be ignored. Hence, we have a bidirected chain of length , i.e. parameters,
-
•
Intrinsic sets involving , i.e., corresponding to parameters, and the special intrinsic set which results in the observed law having one less parameter corresponding to just parameter instead of 2 due to the presence of the proxy in the head and the corresponding in the tail.
-
•
∎
References
- Bhattacharya et al. (2019) Bhattacharya, R., Nabi, R., Shpitser, I., and Robins, J. M. Identification in missing data models represented by directed acyclic graphs. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2019.
- Chen (2007) Chen, H. Y. A semiparametric odds ratio model for measuring association. Biometrics, 63:413–421, 2007.
- Chen et al. (2015) Chen, H. Y., Rader, D. E., and Li, M. Likelihood inferences on semiparametric odds ratio model. Journal of the American Statistical Association, 110(511):1125–1135, 2015.
- Cheng (1994) Cheng, P. E. Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89(425):81–87, 1994.
- Chickering (2002) Chickering, D. M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554, 2002.
- Daniel et al. (2012) Daniel, R. M., Kenward, M. G., Cousens, S. N., and De Stavola, B. L. Using causal diagrams to guide analysis in missing data problems. Statistical Methods in Medical Research, 21(3):243–256, 2012.
- Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
- Drton (2009) Drton, M. Discrete chain graph models. Bernoulli, 15(3):736–753, 2009.
- Evans (2018) Evans, R. J. Margins of discrete Bayesian networks. The Annals of Statistics, 46(6A):2623–2656, 2018.
- Evans & Richardson (2014) Evans, R. J. and Richardson, T. S. Markovian acyclic directed mixed graphs for discrete data. The Annals of Statistics, pp. 1452–1482, 2014.
- Fielding et al. (2008) Fielding, S., Fayers, P. M., McDonald, A., McPherson, G., Campbell, M. K., et al. Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health and Quality of Life Outcomes, 6(1):57, 2008.
- Gain & Shpitser (2018) Gain, A. and Shpitser, I. Structure learning under missing data. In Proceedings of the 9th International Conference on Probabilistic Graphical Models, pp. 121–132, 2018.
- Lauritzen (1996) Lauritzen, S. L. Graphical Models. Oxford, U.K.: Clarendon, 1996.
- Malinsky et al. (2019) Malinsky, D., Shpitser, I., and Tchetgen Tchetgen, E. J. Semiparametric inference for non-monotone missing-not-at-random data: the no self-censoring model. arXiv preprint arXiv:1909.01848, 2019.
- Marra et al. (2017) Marra, G., Radice, R., Bärnighausen, T., Wood, S. N., and McGovern, M. E. A simultaneous equation approach to estimating HIV prevalence with nonignorable missing responses. Journal of the American Statistical Association, 112(518):484–496, 2017.
- Marston et al. (2010) Marston, L., Carpenter, J. R., Walters, K. R., Morris, R. W., Nazareth, I., and Petersen, I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiology and Drug Safety, 19(6):618–626, 2010.
- Martel García (2013) Martel García, F. Definition and diagnosis of problematic attrition in randomized controlled experiments. Working paper. Available at SSRN 2302735, 2013.
- Mohan & Pearl (2014) Mohan, K. and Pearl, J. Graphical models for recovering probabilistic and causal queries from missing data. In Proceedings of the 28th Conference on Advances in Neural Information Processing Systems, pp. 1520–1528. 2014.
- Mohan et al. (2013) Mohan, K., Pearl, J., and Tian, J. Graphical models for inference with missing data. In Proceedings of the 27th Conference on Advances in Neural Information Processing Systems, pp. 1277–1285. 2013.
- Ogarrio et al. (2016) Ogarrio, J. M., Spirtes, P. L., and Ramsey, J. D. A hybrid causal search algorithm for latent variable models. In Proceedings of the 8th International Conference on Probabilistic Graphical Models, pp. 368–379, 2016.
- Pearl (1988) Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
- Pearl (2009) Pearl, J. Causality. Cambridge university press, 2009.
- Ramsey (2015) Ramsey, J. D. Scaling up greedy causal search for continuous variables. arXiv preprint arXiv:1507.07749, 2015.
- Richardson (2003) Richardson, T. S. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30(1):145–157, 2003.
- Richardson et al. (2017) Richardson, T. S., Evans, R. J., Robins, J. M., and Shpitser, I. Nested Markov properties for acyclic directed mixed graphs. arXiv:1701.06686v2, 2017. Working paper.
- Robins & Gill (1997) Robins, J. M. and Gill, R. D. Non-response models for the analysis of non-monotone ignorable missing data. Statistics in Medicine, 16(1):39–56, 1997.
- Robins et al. (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89:846–866, 1994.
- Saadati & Tian (2019) Saadati, M. and Tian, J. Adjustment criteria for recovering causal effects from missing data. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019.
- Sadinle & Reiter (2017) Sadinle, M. and Reiter, J. P. Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika, 104(1):207–220, 2017.
- Shpitser (2016) Shpitser, I. Consistent estimation of functions of data missing non-monotonically and not at random. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016.
- Shpitser et al. (2015) Shpitser, I., Mohan, K., and Pearl, J. Missing data as a causal and probabilistic problem. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, pp. 802–811. AUAI Press, 2015.
- Strobl et al. (2018) Strobl, E. V., Visweswaran, S., and Spirtes, P. L. Fast causal inference with non-random missingness by test-wise deletion. International Journal of Data Science and Analytics, 6(1):47–62, 2018.
- Tchetgen Tchetgen et al. (2018) Tchetgen Tchetgen, E. J., Wang, L., and Sun, B. Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica, 28(4):2069–2088, 2018.
- Thoemmes & Rose (2013) Thoemmes, F. and Rose, N. Selection of auxiliary variables in missing data problems: Not all auxiliary variables are created equal. Technical report, R-002, Cornell University, 2013.
- Tian (2017) Tian, J. Recovering probability distributions from missing data. In Proceedings of the Ninth Asian Conference on Machine Learning, 2017.
- Tian & Pearl (2002) Tian, J. and Pearl, J. A general identification condition for causal effects. In Proceedings of the 18th National Conference on Artificial Intelligence, pp. 567–573, 2002.
- Tsiatis (2006) Tsiatis, A. Semiparametric Theory and Missing Data. Springer-Verlag New York, 1st edition edition, 2006.
- Tu et al. (2019) Tu, R., Zhang, C., Ackermann, P., Mohan, K., Kjellström, H., and Zhang, K. Causal discovery in the presence of missing data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1762–1770, 2019.
- Vansteelandt et al. (2007) Vansteelandt, S., Rotnitzky, A., and Robins, J. M. Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. Biometrika, 94(4):841–860, 2007.
- Verma & Pearl (1990) Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence, 1990.
- Zhou et al. (2010) Zhou, Y., Little, R. J. A., and Kalbfleisch, J. D. Block-conditional missing at random models for missing data. Statistical Science, 25(4):517–532, 2010.