Learning Directed Acyclic Graphs From Partial Orderings
Abstract
Directed acyclic graphs (DAGs) are commonly used to model causal relationships among random variables. In general, learning the DAG structure is both computationally and statistically challenging. Moreover, without additional information, the direction of edges may not be estimable from observational data. In contrast, given a complete causal ordering of the variables, the problem can be solved efficiently, even in high dimensions. In this paper, we consider the intermediate problem of learning DAGs when a partial causal ordering of variables is available. We propose a general estimation framework for leveraging the partial ordering and present efficient estimation algorithms for low- and high-dimensional problems. The advantages of the proposed framework are illustrated via numerical studies.
1 Introduction
Directed acyclic graphs (DAGs) are widely used to capture causal relationships among components of complex systems (Spirtes et al., 2001; Pearl, 2009; Maathuis et al., 2018). They also form a foundation for causal discovery and inference (Pearl, 2009). Probabilistic graphical models defined on DAGs, known as Bayesian networks (Pearl, 2009), have thus found broad applications in various scientific disciplines, from biology (Markowetz and Spang, 2007; Zhang et al., 2013) and social sciences (Gupta and Kim, 2008), to knowledge representation and machine learning (Heckerman, 1997). However, learning the structure of DAGs from observational data is very challenging due to at least two major factors: First, it may not be possible to infer the direction of edges from observational data alone. In fact, unless the model is identifiable (see, e.g., Peters et al., 2014a), observational data only reveal the structure of the Markov equivalent class of DAGs (Maathuis et al., 2018), captured by a complete partially directed acyclic graph (CPDAG) (Andersson et al., 1997). The second reason is computational—learning DAGs from observational data is an NP-complete problem (Chickering, 1996). In fact, while a few polynomial time algorithms have been proposed for special cases, including sparse graphs (Kalisch and Bühlmann, 2007) or identifiable models (Chen et al., 2019; Ghoshal and Honorio, 2018; Peters et al., 2014b; Wang and Drton, 2020; Shimizu et al., 2006; Yu et al., 2023), existing general-purpose algorithms are not scalable to problems involving many variables.
In spite of the many challenges of learning DAGs in general settings, the problem becomes very manageable if a valid causal ordering among variables is known (Shojaie and Michailidis, 2010). In a valid causal ordering for a DAG with node set , any node can appear before another node (denoted ) only if there is no directed path from to . Multiple valid causal orderings may exists for a given DAG, as illustrated in the simple example of Figure 1, where both and are valid causal orderings.
Clearly, a known causal ordering of variables resolves any ambiguity about the directions of edges in a DAG and hence addresses the first source of difficulty in estimation of DAGs discussed above. However, this knowledge also significantly simplifies the computation: given a valid causal ordering, DAG learning reduces to a variable selection problem that can be solved efficiently even in the high-dimensional setting (Shojaie and Michailidis, 2010), when the number of variables is much larger than the sample size.
In the simplest case, the idea of Shojaie and Michailidis (2010) is to regress each variable on all preceding variables in the ordering, . While simple and efficient, this idea, and its extensions (e.g., Fu and Zhou, 2013; Shojaie et al., 2014; Han et al., 2016), require a complete (or full) casual ordering of variables, i.e., a permutation of the list of variables in DAG . However, complete causal orderings are rarely available in practice. To relax this assumption, a few recent proposals have combined regularization strategies with algorithms that search over the space of orderings (e.g., Raskutti and Uhler, 2018). These algorithms are more efficient than those searching over the super-exponentially large space of DAGs (Friedman and Koller, 2003). Nonetheless, the computation for these algorithms remains prohibitive for moderate to large size problems (Manzour et al., 2021; Küçükyavuz et al., 2023).
In this paper, we consider the setting where a partial causal ordering of variables is known. This scenario—which is an intermediate between assuming a complete causal ordering as in Shojaie and Michailidis (2010) and no assumption on causal ordering—occurs commonly in practice. An important example is the problem of identifying direct causal effects of multiple exposures on multiple outcomes (assuming no unmeasured confounders). Formally, let and denote the set of exposure and outcome variables, respectively. Then, we have a partial ordering among and variables, namely, , but we do not have any knowledge of the ordering among or variables themselves. Nonetheless, we are interested in identifying direct causal effects of exposures on outcomes . This corresponds to learning edges from to , which would form a bipartite graph.
Estimation of DAGs from partial orderings also arises naturally in the analysis of biological systems. For instance, in gene regulatory networks, ‘transcription factors’ are often known a priori and they are not expected to be affected by other ‘target genes’. Similar to the previous example, here the set of transcription factors, , appear before the set of target genes, , and the goal is to infer gene regulatory interactions. Similar problems also occur in integrative genomics, including in eQTL mapping (Ha and Sun, 2020).
Despite its importance and many applications, the problem of learning DAGs from partial orderings has not been satisfactorily addressed. In particular, as we show in the next section, various regression-based strategies currently used in applications result in incorrect estimates. As an alternative to these methods, one can use general DAG learning algorithms, such as the PC algorithm (Spirtes et al., 2001; Kalisch and Bühlmann, 2007), to learn the structure of the CPDAG and then orient the edges between and according to the known partial ordering. However, such an approach would not utilize the partial ordering in the estimation of edges and is thus inefficient. The recent proposal of Wang and Michailidis (2019) is also not computationally feasible as it requires searching over all possible orderings of variables.
To overcome the above limitatiosn, we present a new framework for leveraging the partial ordering information into DAG learning. To this end, after formulating the problem, in Section 2, we also investigate the limitations of existing approaches. Motivated by these findings, we present a new framework in Section 3 and establish the correctness of its population version. Then, in Section 4, we present different estimation strategies for learning low- and high-dimensional DAGs. To simplify the presentation, the main ideas are presented for the special case of two-layer networks corresponding to linear structural equation models (SEMs); the more general version of the algorithm and its extensions are discussed in Section 5 and Appendix B. The advantages of the proposed framework are illustrated through simulation studies and an application in integrative genomics in Section 6.
2 Learning Directed Graphs from Partial Orderings
2.1 Problem Formulation
Consider a DAG with the node set , and the edge set . For the general problem, we assume that is partitioned into sets, such that for any , the nodes in cannot be parents of nodes in any set . Such a partition defines a layering of (Manzour et al., 2021), denoted . In fact, a valid layering can be found for any DAG, though some layers may contain a single node. As such, the notion of layering is general: We make no assumption on the size of each layer, the causal ordering of variables in each set , or interactions among them, except that is a DAG.
To simplify the presentation, we primarily focus on two-layer, or bipartite, DAGs, and defer the discussion of more general cases to Section 5. Let be the nodes in the first layer and be those in the second layer. In the case of causal inference for multiple outcomes discussed in Section 1, represents the exposure variables, and represents the outcome variables.
Throughout the paper, we denote individual variables by regular upper case letters, e.g., and , and sets of variables as calligraphic upper case letters, e.g., and , where is a set containing more than one index. We also refer to nodes of the network by using both their indices and the corresponding random variables; for instance, and . We denote the parents of a node in by , its ancestors and children by and , respectively, and its adjacent nodes by , where . We may also represent the edges of using its adjacency matrix , which satisfies and if and only if . For any bipartite DAG with layers and (), we can partition into the following block matrix
(1) |
where the zero constraint on the upper right block of follows from the partial ordering. Here and contain the information on the edges amongst and , respectively. Both of these matrices can be written as lower-triangular matrices (Shojaie and Michailidis, 2010). However, there are generally no constraints on . We denote by the subgraph of containing edges from to only, i.e., those corresponding to entries in . Our goal is to estimate using the fact that .
Figure 2a shows a simple example of a two-layer DAG with layers consisting of nodes. This example illustrates the difference between full causal orderings of variables in a DAG and partial causal orderings: In this case, the graph admits two full causal orderings, namely and ; either of these orderings can be used to correctly discover the structure of the graph. Here, the partial ordering of the variables defined by the layering of the graph to sets and , i.e., , can be written as . This ordering determines that s should appear before s in the causal ordering but does not restrict the ordering of and . In this example, only is consistent with the partial ordering . Moreover, is also consistent with , which is not a valid causal ordering of , indicating that a partial causal order is not sufficient for learning the full DAG.
2.2 Failure of Simple Algorithms
In this section, we discuss the challenges of estimating the graph and why this problem cannot be solved using simple approaches. Since the partial ordering of nodes, , provides information about direction of causality between the two layers of the network, we focus on simple constraint-based methods (Spirtes et al., 2001), which learn the network edges based on conditional independence relationships among nodes. This requires conditional independence relations among variables to be compatible with the edges in ; formally, the joint probability distribution needs to be faithful to (Spirtes et al., 2001). (As discussed in Section 4, our algorithm requires a weaker notion of faithfulness; however, for simplicity, we consider the classical notion of faithfulness in this section.)
Given the partial ordering of variables—which means that s cannot be parents of s—one approach for estimating the bipartite graph is to draw an edge from to whenever is dependent on given all other nodes in the first layer, . Formally, denoting by the conditional independence of two variables and , we define
(2) |
to emphasize that the estimate is obtained without conditioning on any .
Figure 2b shows the estimated in the setting where true causal relationships are linear. In this setting, edges in represent nonzero coefficients in linear regressions of each on its parents in , . It can be seen that, in this example, the true causal effects from s to s are in fact captured in ; these are shown in solid blue lines in Figure 2b. However, this simple example suggests that may include spurious edges, shown by orange dashed edges in the figure. The next lemma formalizes and generalizes this finding. The proof of this and other results in the paper are gathered in the Appendix A.
Lemma 2.1.
Assume that is faithful with respect to . Let be the directed bipartite graph defined in (2). Then, if ,
-
i)
whenever ;
-
ii)
for any path of the form such that , will include a false positive edge .
Lemma 2.1 suggests that if includes edges among s, failing to condition on s can result in falsely detected causal effects from s to s. Thus, without knowledge of interactions among s, one may consider an estimator that corrects for both and when trying to detect the casual effects of on ; in other words, we may declare an edge from to whenever . We denote the resulting estimate of as :
(3) |
Unfortunately, as Figure 2c shows, estimation based on this model may also include false positive edges. As the next lemma clarifies, the false positive edges in this case are due to the conditioning on common descendants of a pair of and that are not connected in , which is sometimes referred to as Berkson’s Paradox (Pearl, 2009).
Lemma 2.2.
Remark 2.3.
While false positive edges in are caused by failing to condition on necessary variables, the false positive edges in are caused by conditioning on extra variables that are not part of the correct causal order of variables. More generally, the partial ordering information does not lead to a simple estimator that correctly identifies direct causal effects of exposures, , on outcomes, .
Building on the above findings, in the next section we first present a modification of the PC algorithm that leverages the partial ordering information. We then present a general framework for leveraging the partial ordering of variables more effectively, especially when the goal is to learn the graph .
3 Incorporating Partial Orderings into DAG Learning
3.1 A modified PC Algorithm
Given an ideal test of conditional independence, we can estimate by first learning the skeleton of , using the PC Algorithm (Kalisch and Bühlmann, 2007) and then orienting the edges in according to the partial ordering of nodes. However, such an algorithm uses the ordering information in a post hoc way—the information is not utilized to learn the edges in .
Alternatively, we can incorporate the partial ordering directly into the PC algorithm. Recall that PC starts from a fully-connected graph and iteratively removes edges by searching for conditional independence relations that corresponds to graphical d-separations, to this end, it uses separating sets of increasing sizes, starting with sets of size 0 (i.e., no conditioning) and incrementally increasing the size as the algorithm progresses. To modify the PC algorithm, we use the fact that if two nodes and are non-adjacent in a DAG, then for any of their d-separators, say , the set is also a d-separator; see Chapter 6 of Spirtes et al. (2001). Given a partial ordering of variables, this rule allows excluding from the PC search step any nodes that are in lower layers than nodes and under consideration. Then, under the assumptions of the PC algorithm, the above modification, which we refer to as PC+, enjoys the same population and sample guarantees.
Given its construction, the partial ordering becomes more informative in PC+ as the number of layers increases: in two-layer settings, for learning the edges in , PC+ is exactly the same as PC, and cannot utilize the partial ordering. However, as the results in Section 6.2 show, PC+ remains ineffective in graphs with larger number of layers, motivating the development of the framework proposed in the next section.
3.2 A new framework
In this section, we propose a new framework for learning DAGs from partial orderings. The proposed approach is motivated by two key observations in Lemmas 2.1 and 2.2: First, the graphs and include all true causal relationships from nodes in the first layer to nodes in the second layer. Second, both graphs may also include additional edges; due to not conditioning on parents of in , and due to conditioning on common children of and .
Let and . The next lemma, which is a direct consequence of Lemmas 2.1 and 2.2, characterizes the intersection of these two sets.
Lemma 3.1.
Let be a graph admitting the partial ordering . Then for any , and , either , or satisfies the followings:
-
•
there exists a path with ;
-
•
.
Lemma 3.1 implies that even though may contain more edges than the true edges in , these additional edges must be in some special configuration. Based on this insight, our approach examines the edges in , for each , to remove the spurious edges efficiently. Let and . Suppose, for simplicity, that conditional independencies are faithful to the graph . Then, there must be some set of variables such that . In general, to find such a set of variables, we will need to search among subsets of . However, utilizing the partial ordering, we can significantly reduce the complexity of this search. Specifically, we can restrict the search to the conditional Markov blanket, introduced next for general DAGs.
Definition 3.2 (Conditional Markov Blanket).
Let and be arbitrary nodes and subsets in a DAG . The conditional Markov blanket of given , denoted , is the smallest set of nodes such that for any other set of nodes ,
(4) |
Our new framework builds on the idea that by limiting the search to the conditional Markov blankets, given the nodes in the previous layer, we can significantly reduce the search space and the size of conditioning sets. Moreover, the conditional Markov blanket can be easily inferred together with and . Coupled with the observations in Lemmas 2.1 and 2.2, especially the fact that conditioning on the nodes in the previous layer does not remove true causal edges, these reductions lead to improvements in both computational and statistical efficiency.
The new framework is summarized in Algorithm 1. The algorithm has two main steps: a screening loop, where supersets of relevant edges are identified by leveraging the partial ordering information, and a searching loop, similar to the one in the PC algorithm, but tailored to searching over the conditional Markov blankets. To this end, we will next show that in order to learn the edges between any node in and a node in , it suffices to search over subsets of the conditional Markov blanket of nodes in .
3.3 Graph Identification using the Conditional Markov Blanket
The next lemma characterizes key properties of the conditional Markov blanket, which facilitate efficient learning of the graph . Specifically, we show that if a distribution satisfies the intersection property of conditional independence—i.e., if and then —then all Markov blankets and conditional Markov blankets are unique.
Lemma 3.3.
Let be a set of random variables with joint distribution . Suppose the intersection property of conditional independence holds in . Then, for any variable , there exists a unique minimal Markov blanket . Moreover, for any , is also uniquely defined as .
We next define the faithfulness assumption needed for correct causal discovery from partial orderings. This assumption is trivially weaker than the general notion of strong faithfulness (Zhang and Spirtes, 2002).
Definition 3.4 (Layering-adjacency-faithfulness).
Let be a layering of random variables with joint distribution . We say is layering-adjacency-faithful to a DAG with respect to the layering if for all and , if , then and are (i) dependent conditional on ; and (ii) dependent conditional on for any .
Our main result, given below, describes how conditional Markov blankets can be used to effectively incorporate the knowledge of partial ordering into DAG learning and reduce the computational cost of learning causal effects of s on s.
Theorem 3.5.
For a probability distribution that is Markov and layering-adjacency-faithful with respect to , a pair of nodes and are non-neighbor in if and only if there exist a set such that .
It follows directly from Theorem 3.5 that the proposed framework in Algorithm 1—i.e., taking the intersection of and and then searching within conditional Markov blankets to remove additional edges—recovers the correct bipartite DAG .
Corollary 3.6.
Algorithm 1 correctly identifies direct causal effects of s on s.
Next, we show that instead of inferring and separately and taking their interception, we can use to infer directly. More specifically, noting that Algorithm 1 only relies on and for each , the next lemma shows that given , we can learn and without having to learn separately. This implies that Algorithm 1 need not learn the unconditional Markov blankets, but only the conditional Markov blankets.
Lemma 3.7.
The followings hold for each :
-
•
,
-
•
,
-
•
Lemma 3.7 characterises the conditional dependence sets used in Algorithm 1. Using these characterizations, we can cast the set inference problems in Algorithm 1 as variable selection problems in general regression settings. Let
(5) |
Then, and can be obtained by selecting the relevant variables when regressing onto and , respectively. We can then obtain the two sets used in Algorithm 1 from :
There are multiple benefits to directly inferring —i.e., by estimating and —instead of separately inferring and . First, the graph could be hard to estimate in high-dimensional settings, as learning requires performing tests conditioned on variables. In contrast, in (5) only requires performing tests conditioning on at most variables. Moreover, the target is often sparse (see Lemma 3.1) and can thus be efficiently learned in high-dimensional settings. In an extreme example, suppose each node in has exactly one outgoing edge into a node in , and the first nodes in have as their common child . In this case, is fully connected and hard to learn, whereas only has edges.
The second advantage of directly inferring is that, by using the conditional dependence formulation of sets as in Lemma 3.7, we can show that even if the sets and —consequently, and —are not inferred exactly, the algorithm is still correct as long as no false negative errors are made.
Lemma 3.8.
Algorithm 1 recovers exactly the true DAG if and are replaced with their arbitrary supersets.
We refer to the framework proposed in Algorithm 1 for learning DAGs from partial ordering the PODAG framework. In the next section, we discuss specific algorithms for learning direct casual effects of s on s. These algorithms utilize the fact that given the partial ordering of nodes in , the conditional Markov blanket of each node can be found efficiently by testing for conditional dependence of and after adjusting for the effect of the nodes in the first layer.
4 Learning High-Dimensional DAGs from Partial Orderings
Coupled with a consistent test of conditional independence, the general framework of Section 3 can be used to learn DAGs from any layering-adjacent-faithful probability distribution (Definition 3.4). This involves two main tasks for each for : (a) obtaining a consistent estimate of the set of relevant variables, , and (b) obtaining a consistent estimate of the conditional Markov blanket, , and the conditioning set in top layer, . By Lemma 3.7, this task is equivalent to learning in (5). In this section, we illustrate these steps by presenting efficient algorithms that leverage the partial orderings information to learn bipartite DAGs corresponding to linear structural equation models (SEMs). In this case, existing results on variable selection consistency of network estimation methods can be coupled with a proof similar to that of the PC Algorithm (Kalisch and Bühlmann, 2007) to establish consistency of the sample version of the proposed algorithm for learning high-dimensional DAGs from partial orderings.
Suppose, without loss of generality, that the observed random vector is centered. In a structural equation model, then solves an equation system for , where are independent random variables with mean zero and are unknown functions. In linear SEMs, each is linear:
(6) |
Specialized to bipartite graphs, Equation 6 can be written compactly as
(7) |
where, as in (1), , and are , and coefficients matrices, respectively. Our main objective is to estimate the matrix . To this end, we propose statistically and computationally efficient procedures for estimating the sets and for all , which, as discussed before, are the main ingredients needed in Algorithm 1. We will also discuss how these procedures can be applied in high-dimensional settings, when .
We next show that whenever the screening loop of Algorithm 1 is successful—that is, when it returns a supergraph of —the searching loop can consistently recover the true . We start by stating our assumptions, which are the same as those used to establish the consistency of the PC algorithm (Kalisch and Bühlmann, 2007). The only difference is that our ‘faithfulness’ assumption—Assumption 4.3—is weaker than the corresponding assumption for the PC algorithm. The consequences of this relaxation are examined in Section 6.2.
Assumption 4.1 (Maximum reach level).
Suppose there exists some such that and .
Assumption 4.2 (Dimensions).
The dimensions , satisfies for some and .
Assumption 4.3 (-strong layering-adjacency-faithfulness).
The next result establishes the consistency of the searching step.
Theorem 4.4 (Searching step using partial correlation).
Theorem 4.4 shows that, with high probability, the searching loop returns the correct graph as long as the screening loop makes no false negative errors. In the following sections, we discuss various options for the screening step, namely, the estimation of and defined in Lemma 3.7:
4.1 Screening in low dimensions
Given multivariate Gaussian observations, the most direct way to identify the sets satisfying the conditional independence relations in and is to use the sample partial correlation and reject the hypothesis of conditional independence when for some suitable threshold . The next results establishes the screening guarantee for such an approach in low-dimensional settings.
4.2 Screening in high dimensions
The conditioning sets in and —namely, and —involve large number of variables. Therefore, in high dimensions, when , it is not feasible to learn and using partial correlations. To overcome this challenge, and using the insight from Lemma 3.7, we treat the screening problem as selection of non-zero coefficients in linear regressions. This allows us to use tools from high-dimensional estimation, including various regularization approaches, for the screening step.
The success of the screening step requires the event —defined in Theorem 4.4—to hold with high probability. That is, we need to identify supersets of and such that . A simple approach to screen relevant variables is sure independence screening (SIS, Fan and Lv, 2008), which selects the variables with largest marginal association at a given threshold . Defining the SIS estimators
we can obtain the following result following Fan and Lv (2008).
Proposition 4.6.
By using simple marginal associations, SIS offers a simple and efficient strategy for screening; however, it may result in large sets for the searching step. An alternative to SIS screening is to select the sets and using two penalized regressions. For instance, using the lasso penalty, for can be found as
Similarly, can be learned using a lasso regression on a different set of variables; specifically, given any set , we define
The next result establishes the success of the screening step when using lasso estimators in high dimensions.
Proposition 4.7 (Screening with lasso).
The above two high-dimensional screening approaches—SIS and lasso—are two extremes: SIS obtains the estimates of and without adjusting for any other variables, whereas lasso adjusts for all other variables (either from the first layer or the two layers combined). These two extremes represent examples of broader classes of methods based on marginal or joint associations. They differ with respect to computational and sample complexities of screening and searching steps: Compared with SIS, screening with lasso may result in smaller sets for the searching step but may be less efficient for screening. Nevertheless, both approaches are valid choices for screening under milder conditions than those needed for consistent variable selection. This is because the additional assumption required for the screening step is implied by the layering-adjacency-faithfulness assumption, which, as discussed before, is weaker than the strong faithfulness assumption needed for the PC algorithm (see also Section 6.2). While the above two choices seem natural, other intermediates, including conditioning on small sets (similar to Sondhi and Shojaie, 2019) can also be used for screening.
4.3 Other Approaches
Together with Proposition 4.5, 4.6, or 4.7, Theorem 4.4 establishes the consistency of Algorithm 1 for linear SEMs, in low- and high-dimensional settings. While we only considered the simple case of Gaussian noise, Algorithm 1 and the results in this section can be extended to linear SEMs with sub-Gaussian errors (Harris and Drton, 2013). Moreover, many other regularization methods can be used for screening instead of the lasso. The only requirement is that the method is screening-consistent. For instance, inference-based procedures, such as debiased lasso (van de Geer et al., 2014; Zhang and Zhang, 2014; Javanmard and Montanari, 2014) can also fulfill the requirement of our screening step.
Similar ideas can also be used for learning DAGs from other distributions and/or under more general SEM structures. In Appendix B, we outline the generalization of the proposed algorithms to causal additive models (Bühlmann et al., 2014). The Gaussian copula transformation of Liu et al. (2009, 2012) and Xue and Zou (2012) can be similarly used to extend the proposed ideas to non-Gaussian distribution. We leave the detailed exploration of these generalizations to future research.
5 Extensions and Other Considerations
5.1 Learning Edges within Layers
In the previous sections, we focused on the problem of learning edges between two layers, and . In particular, Algorithm 1 provides a framework that first finds a superset of the true edges, and then uses a searching loop to remove the spurious edges. The same idea can be applied to learning edges within the second layer, : As suggested by Lemma 3.3, for each , all of its adjacent nodes are contained in . This suggests that we can learn edges in by simply modifying Algorithm 1 to run the searching loop on instead of only on those between and . After recovering the skeleton, edges among can be oriented using Meek’s orientation rules (Meek, 1995). Given correct d-separators and skeleton, the rules can orient as many edges as possible without making mistakes.
In order to successfully recover within layer edges from observational data, we need to assume faithfulness among these layers. Following Ramsey et al. (2006), we characterize the faithfulness condition required to learn the DAG via constraint-based algorithms as two parts: (a) adjacency faithfulness, which means that neighboring nodes are not associated with conditional independence; and (b) orientation faithfulness, which means that parents in any v-structure are not conditionally independent given any set containing their common child. The orientation faithfulness can be defined in our context as follows.
Assumption 5.1 (Within-layer-faithfulness).
For all adjacent pairs , it holds that for any set . Also, for any unshielded triple with , if , then the variables corresponding to and are dependent given any subset of that contains ; otherwise, the variables corresponding to and are dependent given any subset of that does not contains .
Given the additional information from partial ordering, to describe the graphical object learned by the new framework, we need to introduce a new notion of equivalence. This is because the Markov equivalent class of DAGs can be reduced by using the known partial ordering. For example, if the only conditional independent relation among three variables is , but we know , then the edge is identifiable. We define partial-ordering-Markov-equivalence simply as Markov equivalence restricted to partial ordering. This equivalence class can be represented by a maximally oriented partial DAG, or maximal PDAG (Perkovic et al., 2018). With this notion, we can describe the target of learning of within- and between-layer edges as learning the maximal PDAG of the true given the background information .
5.2 Directed Graphs with Multiple Layers
The theory and algorithm developed in Section 4 can be extended to scenarios with multiple layers. To facilitate this extension, we introduce a general representation of the problem. Suppose is a random vector following some distribution Markov to a graph . Suppose admits a partial-ordering where . Parallel to the notation in 2-layer case, we define, for each , ,
The next lemma generalizes Lemma 3.1.
Lemma 5.3.
For any graph admitting the partial ordering , the followings hold:
-
•
For each , , a node if and only if or but there exists a path with and .
-
•
For each , , a node if and only if or but .
Lemma 5.3 suggests a general framework for utilizing any layering information to facilitate DAG learning. The multi-layer version of the algorithm is presented in Algorithm 2. The faithfulness condition required for the success of this framework is given below for the set of variables .
Assumption 5.4 (Layering-faithfulness).
For a graph admitting a partial-ordering , we say a distribution is layering faithful to with respect to if the followings hold:
-
•
Adjacency faithfulness: For all non-adjacent pairs with , , , it holds that and for all .
-
•
Orientation faithfulness: For all unshielded triples such that are in the same layer and is in some previous layer, if the configuration of the path is then for all ; otherwise for all .
We note that orientation faithfulness is only needed for triplets when at least two of the three nodes are in the same layer. This is because otherwise the orientation is already implied by partial ordering.
5.3 Weaker Notions of Partial Ordering
Algorithm 2 can be successful even if ordering information is not available for some variables. This is because the idea of using a screening loop and a searching loop to reduce computational burden can be more generally applied. Suppose that for every variable there exist two sets and . Then, we slightly modify the construction of and : for each ,
It is easy to see that Lemma 5.3 hold with this generalized notion. Specifically, this means that Algorithm 2 can also handle the setting in which can be partitioned into disjoint sets , such that the layering information among is known () and contains nodes with no partial ordering information. This generalization allows the algorithm to be applied in settings, where e.g., represent confounders, exposures and mediators, represents the outcomes and may correspond to variables for which no partial ordering information is available.
6 Numerical Studies
In this section, we compare our proposed framework in Algorithm 1, PODAG, with the PC algorithm and its ordering-aware variant introduced in Section 3, PC+. Our numerical analyses comprise three aspects: a comparison of faithfulness conditions; simulation studies comparing DAG estimation accuracy of various methods; and an analysis of quantitative trait loci (eQTL).
6.1 Comparison of Faithfulness Conditions
To assess the faithfulness conditions of different algorithms, we examine the minimal absolute value of detectable partial correlations. Specifically, for a constraint-based structure learning algorithm that tests conditional independencies in the collection of tuples , the quantity
can be viewed as the strength of faithfulness condition required to learn the DAG. A small value of indicates a weaker separation between signal and noise for the method , making it harder to recover the correct DAG using sample partial correlations. On the other hand, a larger value of indicates superior statistical efficiency.
We randomly generated DAGs with nodes and two expected edges per node. For each DAG, we constructed a linear Gaussian SEM with parameters drawn uniformly from . We inspected three algorithms, PC, PC+, and PODAG, and computed their corresponding . We also counted the number of conditional independence tests performed by each method as a measure of computational efficiency. The results are shown in Figure 3. It can be seen that the faithfulness requirement for PC is stronger than PC+, which is in turn stronger than PODAG. The results also show that PC and PC+ require more tests of conditional independence, further reducing their statistical efficiency. Noticeably, though PC+ utilizes the partial ordering information, its computational and statistical efficiency are subpar compared with PODAG.

6.2 DAG Estimation Performance
We compare the performance of the proposed PODAG algorithm in learning DAGs from partial ordering with those of the PC and PC+ algorithms. To this end, we consider graphs with vertices and expected edges per node. Edges are randomly generated within and between layers, with higher probability of connections between layers. A weight matrix for linear SEM is generated with respect to the graph, with non-zero parameters drawn uniformly from . We generate samples using SEMs with Gaussian errors. The vertex set is split into either or sets, corresponding to equal size layers.


Figure 4 shows the average true positive and false positive rates (TPR, FPR) of the proposed PODAG algorithm compared with PC and PC. Three variants of PODAG are presented in the figure, based on different screening methods in the first step: using sure independence screening (SIS), lasso and partial correlation. Regardless of the choice of screening, the results show that by utilizing the partial ordering, PODAG offers significant improvement over both the standard PC and PC+. In these simulations, partial correlation screening seems to offer the best performance. Moreover, the gap between high-dimensional screening methods (SIS and lasso) with partial correlation tightens as the dimension grows, especially for two-layer networks.
The plots in Figure 4 correspond to different number of variables—50, 100, 150—equally divided into either or layers. While the performances are qualitatively similar across different number of variables, the plots show significant improvements in the performance of PODAG as the number of layers increases. This is expected, as larger number of layers () corresponds to more external information, which is advantageous for PODAG. The results also suggest that SIS screening may perform better than lasso screening in the five-layer network. However, this is likely due to the choice of tuning parameter for these methods. In the simulations reported here, for SIS screening, marginal associations were screened by calculating p-values using Fisher’s transformation of correlations; an edge was then included if the corresponding p-value was less than 0.5 (to avoid false negatives). On the other hand, the tuning parameter for lasso screening was chosen by minimizing the Akaike information criterion (AIC), which balances prediction and complexity. Other approaches to selection of tuning parameters may change the size of the sets obtained from screening, and hence the relative performances of the two methods.
6.3 Quantitative Trait Loci Mapping
We illustrate the applications of PODAG by applying it to learn the bipartite network of interactions among DNA loci and gene expression levels. This problem is at the core of expression Quantitative trait loci (eQTL) mapping, which is a powerful approach for identifying sequence variants that alter gene functions (Nica and Dermitzakis, 2013). While many tailored algorithms have been developed to identify DNA loci that affect gene expression levels, the primary approach in biological application is to focus on cis-eQTLs (i.e., local eQTLs), meaning those DNA loci around the gene of interest (e.g., within 500kb of the gene body). This focus is primarily driven by the limited power in detecting trans-eQTLs (i.e., distant eQTLs) by a genome-wide search. State-of-the-art algorithms also incorporate knowledge of DNA binding cites and/or emerging omics data (Kumasaka et al., 2016; Keele et al., 2020) to obtain more accurate eQTL maps (Sun, 2012).
Regardless of the focus (i.e., cis versus trans regulatory effects), the computational approach commonly used to infer interactions between DNA loci and gene expression levels is based on the marginal association between expression level of each gene and the variability of each DNA locus in the sample. This approach provides an estimate of the graph defined in (2). Since the activities of other DNA loci are not taken into account when inferring the effect of each DNA locus on each gene expression level, the resulting estimate is a superset of , unless an uncorrelated subset of DNA loci, for instance after LD-pruning (Fagny et al., 2017), is considered.

To illustrate the potential advantages of the proposed PODAG algorithm in this setting, we use a yeast expression data set (Brem and Kruglyak, 2005), containing eQTL data for 112 segments, each with 585 shared markers and 5428 target genes. As an illustrative example, and to facilitate visualization, we randomly select DNA loci and gene expression levels. Our analysis focuses on the direct effect of DNA markers on gene expression levels, corresponding to the bipartite network , which can be estimated using PODAG by leveraging the partial ordering of the nodes in the two layers.
The results are presented in Figure 5. As expected, the commonly used estimate of , and the estimate of in (3)—obtained by also regressing on other gene expression levels—both contain additional edges compared with the PODAG estimate, . The intersection of the first two networks, (discussed in Section 2.2), reduces these false positives. It also provides an effective starting point for PODAG, demonstrating the efficiency of the proposed algorithm.
7 Discussion
We presented a new framework for leveraging partial ordering of variables in learning the structure of directed acyclic graphs (DAGs). The new framework reduces the number of conditional independence tests compared with the methods that do not leverage shis information. It also requires a weaker version of the faithfulness assumption for recovering the underlying DAG.
The proposed framework consists of two main step: a screening step, where a superset of interactions is identified by leveraging the partial ordering; and a searching step, similar to that used in the PC algorithm (Kalisch and Bühlmann, 2007), that also leverages the partial ordering. The framework is general and can accommodate various estimation procedures in each of these steps. In low-dimensional linear structural equation models, partial correlation is the natural choice. Non-parametric approaches for testing for conditional independence (Azadkia and Chatterjee, 2021; Huang et al., 2022; Shi et al., 2021; Shah and Peters, 2020) can generalize this strategy to infer DAGs from partial orderings under minimal distributional assumptions. In this paper, we considered sure independence screening (SIS) and lasso as two options for screening in high dimensions. However, many alternative algorithms can be used for estimation in high dimensions. As an illustration, the algorithm is extended in Appendix B to causal additive models (Bühlmann et al., 2014).
An important issue for the screening step in high dimensions is the choice of tuning parameter for the estimation procedure. Commonly used approaches for selecting tuning parameters either focus on prediction or model selection. Neither is ideal for our screening, where the goal is to obtain a small superset of the relevant variables, which, theoretically, requires no false negatives. To achieve a balance, we used the Akaike information criterion (AIC) to tune the lasso parameter in our numerical studies. Recent advances in high-dimensional inference (van de Geer et al., 2014; Zhang and Zhang, 2014; Javanmard and Montanari, 2014) could offer a viable alternative to choosing the tuning parameter and can also lead to more reliable identification of network edges.
Acknowledgements
This work was partially supported by grants from the National Science Foundation and the National Institutes of Health. We thank Dr. Wei Sun for helpful comments on an earlier draft of the manuscript.
References
- Andersson et al. (1997) S. A. Andersson, D. Madigan, and M. D. Perlman. A Characterization of Markov Equivalence Classes for Acyclic Digraphs. The Annals of Statistics, 25(2):505–541, 1997.
- Azadkia and Chatterjee (2021) M. Azadkia and S. Chatterjee. A simple measure of conditional dependence. The Annals of Statistics, 49(6):3070–3102, 2021.
- Brem and Kruglyak (2005) R. B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences, 102(5):1572–1577, 2005.
- Bühlmann et al. (2014) P. Bühlmann, J. Peters, and J. Ernest. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
- Chakraborty and Shojaie (2022) S. Chakraborty and A. Shojaie. Nonparametric Causal Structure Learning in High Dimensions. Entropy, 24(3):351, Mar. 2022.
- Chen et al. (2019) W. Chen, M. Drton, and Y. S. Wang. On causal discovery with an equal-variance assumption. Biometrika, 106(4):973–980, Dec. 2019.
- Chickering (1996) D. M. Chickering. Learning Bayesian Networks is NP-Complete. In D. Fisher and H.-J. Lenz, editors, Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics, pages 121–130. Springer, New York, NY, 1996. doi: 10.1007/978-1-4612-2404-4˙12.
- Fagny et al. (2017) M. Fagny, J. N. Paulson, M. L. Kuijjer, A. R. Sonawane, C.-Y. Chen, C. M. Lopes-Ramos, K. Glass, J. Quackenbush, and J. Platig. Exploring regulation in tissues with eqtl networks. Proceedings of the National Academy of Sciences, 114(37):E7841–E7850, 2017.
- Fan and Lv (2008) J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, Nov. 2008.
- Friedman and Koller (2003) N. Friedman and D. Koller. Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine learning, 50(1):95–125, 2003.
- Fu and Zhou (2013) F. Fu and Q. Zhou. Learning sparse causal gaussian networks with experimental intervention: regularization and coordinate descent. Journal of the American Statistical Association, 108(501):288–300, 2013.
- Ghoshal and Honorio (2018) A. Ghoshal and J. Honorio. Learning linear structural equation models in polynomial time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 1466–1475. PMLR, Mar. 2018.
- Gupta and Kim (2008) S. Gupta and H. W. Kim. Linking structural equation modeling to bayesian networks: Decision support for customer retention in virtual communities. European Journal of Operational Research, 190(3):818–833, 2008.
- Ha and Sun (2020) M. J. Ha and W. Sun. Estimation of high-dimensional directed acyclic graphs with surrogate intervention. Biostatistics, 21(4):659–675, Oct. 2020.
- Han et al. (2016) S. W. Han, G. Chen, M.-S. Cheon, and H. Zhong. Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. Journal of the American Statistical Association, 111(515):1004–1019, 2016.
- Haris et al. (2022) A. Haris, N. Simon, and A. Shojaie. Generalized sparse additive models. Journal of Machine Learning Research, 23(70):1–56, 2022.
- Harris and Drton (2013) N. Harris and M. Drton. PC Algorithm for Nonparanormal Graphical Models. Journal of Machine Learning Research, 14(69):3365–3383, 2013.
- Heckerman (1997) D. Heckerman. Bayesian networks for data mining. Data mining and knowledge discovery, 1(1):79–119, 1997.
- Huang et al. (2022) Z. Huang, N. Deb, and B. Sen. Kernel partial correlation coefficient—a measure of conditional dependence. The Journal of Machine Learning Research, 23(1):9699–9756, 2022.
- Javanmard and Montanari (2014) A. Javanmard and A. Montanari. Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory. IEEE Transactions on Information Theory, 60(10):6522–6554, Oct. 2014. ISSN 1557-9654. doi: 10.1109/TIT.2014.2343629.
- Kalisch and Bühlmann (2007) M. Kalisch and P. Bühlmann. Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm. Journal of Machine Learning Research, 8(22):613–636, 2007.
- Keele et al. (2020) G. R. Keele, B. C. Quach, J. W. Israel, G. A. Chappell, L. Lewis, A. Safi, J. M. Simon, P. Cotney, G. E. Crawford, W. Valdar, et al. Integrative qtl analysis of gene expression and chromatin accessibility identifies multi-tissue patterns of genetic regulation. PLoS genetics, 16(1):e1008537, 2020.
- Küçükyavuz et al. (2023) S. Küçükyavuz, A. Shojaie, H. Manzour, L. Wei, and H.-H. Wu. Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks. Journal of Machine Learning Research (JMLR), 24:1–38, 2023.
- Kumasaka et al. (2016) N. Kumasaka, A. J. Knights, and D. J. Gaffney. Fine-mapping cellular qtls with rasqual and atac-seq. Nature genetics, 48(2):206–213, 2016.
- Liu et al. (2009) H. Liu, J. Lafferty, and L. Wasserman. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. Journal of Machine Learning Research, 10(80):2295–2328, 2009.
- Liu et al. (2012) H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, Aug. 2012.
- Maathuis et al. (2018) M. Maathuis, M. Drton, S. Lauritzen, and M. Wainwright, editors. Handbook of Graphical Models. CRC Press, Boca Raton, Nov. 2018. ISBN 978-0-429-46397-6. doi: 10.1201/9780429463976.
- Manzour et al. (2021) H. Manzour, S. Küçükyavuz, H.-H. Wu, and A. Shojaie. Integer Programming for Learning Directed Acyclic Graphs from Continuous Data. INFORMS Journal on Optimization, 3(1):46–73, Jan. 2021.
- Markowetz and Spang (2007) F. Markowetz and R. Spang. Inferring cellular networks–a review. BMC bioinformatics, 8(6):1–17, 2007.
- Meek (1995) C. Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, UAI’95, pages 403–410, San Francisco, CA, USA, Aug. 1995. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-385-1.
- Nica and Dermitzakis (2013) A. C. Nica and E. T. Dermitzakis. Expression quantitative trait loci: present and future. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1620):20120362, 2013.
- Pearl (2009) J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 978-0-521-89560-6.
- Perkovic et al. (2018) E. Perkovic, J. Textor, M. Kalisch, and M. H. Maathuis. Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs. Journal of Machine Learning Research, 18(220):1–62, 2018.
- Peters et al. (2014a) J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009–2053, 2014a.
- Peters et al. (2014b) J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal Discovery with Continuous Additive Noise Models. Journal of Machine Learning Research, 15(58):2009–2053, 2014b.
- Ramsey et al. (2006) J. Ramsey, P. Spirtes, and J. Zhang. Adjacency-faithfulness and conservative causal inference. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, pages 401–408, Arlington, Virginia, USA, July 2006. AUAI Press.
- Raskutti and Uhler (2018) G. Raskutti and C. Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018.
- Shah and Peters (2020) R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514–1538, 2020.
- Shi et al. (2021) H. Shi, M. Drton, and F. Han. On azadkia-chatterjee’s conditional dependence coefficient. arXiv preprint arXiv:2108.06827, 2021.
- Shimizu et al. (2006) S. Shimizu, P. O. Hoyer, A. Hyvä, rinen, and A. Kerminen. A Linear Non-Gaussian Acyclic Model for Causal Discovery. Journal of Machine Learning Research, 7(72):2003–2030, 2006.
- Shojaie and Michailidis (2010) A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519–538, Sept. 2010.
- Shojaie et al. (2014) A. Shojaie, A. Jauhiainen, M. Kallitsis, and G. Michailidis. Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PloS One, 9(2):e82393, 2014. ISSN 1932-6203. doi: 10.1371/journal.pone.0082393.
- Sondhi and Shojaie (2019) A. Sondhi and A. Shojaie. The Reduced PC-Algorithm: Improved Causal Structure Learning in Large Random Networks. Journal of Machine Learning Research, 20(164):1–31, 2019. ISSN 1533-7928.
- Spirtes et al. (2001) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search, 2nd Edition, volume 1. The MIT Press, 1 edition, 2001.
- Sun (2012) W. Sun. A statistical framework for eqtl mapping using rna-seq data. Biometrics, 68(1):1–11, 2012.
- Tan and Zhang (2019) Z. Tan and C.-H. Zhang. Doubly penalized estimation in additive regression with high-dimensional data. The Annals of Statistics, 47(5):2567–2600, Oct. 2019.
- van de Geer et al. (2014) S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, June 2014.
- Voorman et al. (2014) A. Voorman, A. Shojaie, and D. Witten. Graph estimation with joint additive models. Biometrika, 101(1):85–101, Mar. 2014.
- Wainwright (2019) M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2019. ISBN 978-1-108-49802-9. doi: 10.1017/9781108627771.
- Wang and Michailidis (2019) P.-L. Wang and G. Michailidis. Directed acyclic graph reconstruction leveraging prior partial ordering information. In International Conference on Machine Learning, Optimization, and Data Science, pages 458–471. Springer, 2019.
- Wang and Drton (2020) Y. S. Wang and M. Drton. High-dimensional causal discovery under non-Gaussianity. Biometrika, 107(1):41–59, Mar. 2020. ISSN 0006-3444.
- Xue and Zou (2012) L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571, Oct. 2012.
- Yu et al. (2023) S. Yu, M. Drton, and A. Shojaie. Directed graphical models and causal discovery for zero-inflated data. In Conference on Causal Learning and Reasoning, pages 27–67. PMLR, 2023.
- Zhang et al. (2013) B. Zhang, C. Gaiteri, L.-G. Bodea, Z. Wang, J. McElwee, A. A. Podtelezhnikov, C. Zhang, T. Xie, L. Tran, R. Dobrin, et al. Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer’s disease. Cell, 153(3):707–720, 2013.
- Zhang and Zhang (2014) C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):217–242, 2014.
- Zhang and Spirtes (2002) J. Zhang and P. Spirtes. Strong faithfulness and uniform consistency in causal inference. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, UAI’03, pages 632–639, San Francisco, CA, USA, Aug. 2002. Morgan Kaufmann Publishers Inc. ISBN 978-0-12-705664-7.
Appendix A Proofs
A.1 Proof of Results in Section 2
Proof of Lemma 2.1.
The claim in (i) follows directly from the definition of -separation. In particular, if then the path from to will not be -separated by , and hence .
To prove (ii), note that implies that cannot include any s. Thus, without loss of generality, suppose that
where is permitted. Now, considering that is an ancestor of , by faithfulness of with respect to , in order for any set to -separate and , it must include at least one of the variables . Noting that the construction used in only adjusts for , and does not adjust for any s, the path from to will not be -separated by . It follows from the definition of -separation that and are conditionally dependent given , which means that . ∎
Proof of Lemma 2.2.
The result follows again from the definition of -separation for DAGs, and the faithfulness of . In particular, if , does not -separate from , and hence, .
Next, if , and form an open collider in , i.e., if and , then the above argument implies that . On the other hand, since is a common descendent of and , -separation implies that , or more generally, . Thus, . ∎
A.2 Proof of Results in Section 3
Proof of Lemma 3.3.
Suppose and are two distinct valid minimal Markov blankets (MB) of . Let , , and . By the intersection property, and implies . By the elementary formula of conditional independence, and implies , and therefore is also a valid MB. But this is only possible when . Hence the minimal MB is unique.
Next we show in two steps. First, we show . We write , , . Then, by the definitions of conditional Markov blanket and Markov blanket, and . By the intersection property, which establishes as a valid Markov Blanket. However, since and MB is minimal, this is only possible when and , i.e., . Finally we show . We write . Since , it must holds that , which implies that is also a valid conditional Markov blanket. However, since and the conditional Markov blanket is minimal, this is only possible when , i.e., . This completes the proof. ∎
Proof of Theorem 3.5.
Under layering-adjacency-faithfulness, if and are adjacent, then and , so . Therefore contains all true edges.
By the intersection property, it holds that , and hence, . Under layering-adjacency-faithfulness, this implies no false negative errors can be made by the searching loop. Suppose and are non-adjacent, then we must have , which implies with the set . This construction provides one -separator, and hence there could not be any false positive errors either. ∎
Proof of Lemma 3.8.
Since both sets are replaced by their supersets, is still a superset of true edges. Under layering-adjacency-faithfulness, for any adjacent pairs and and any set , it holds that , so the searching loop in Algorithm 1 will not make any false negative errors. Therefore, we only need to show that all additional edges can be removed by the searching loop. For any nonadjacent and , let be an arbitrary set satisfying . Then, . The existence of this separator implies that the edge between and can be removed by the searching loop. ∎
A.3 Proof of Results in Section 4
We first state an auxiliary lemma.
Lemma A.1 (Partial correlation and regression coefficient).
Let be the linear regression coefficients regressing a set of variables onto , and let denote the partial correlation of and given . Then,
Proof.
Denote the block covariance matrix as
For the first term, we have , and hence
For the second term, we have , where is the precision matrix: . Therefore, if and only if . Concretely,
Both quantities equal zero if and only if . ∎
Proof of Theorem 4.4.
Let and denote . We first suppose Algorithm 1 is continued until some level such that as in Assumption 4.1. By Lemma 3 of Kalisch and Bühlmann (2007), the probability of any type I error in edge with conditioning set , denoted as , is bounded by an exponential term:
and the probability of false negative error is also bounded by an exponential term:
for finite constants and . In our algorithm, . So the total error probability can be bounded:
(8) |
By (A.3), with high probability, the sample version of searching loop makes no mistakes up to search level . Then by the reasoning in Lemma 5 of Kalisch and Bühlmann (2007), the searching loop also has high probability of terminating at the same level as the population version, which is either or . This completes the proof. ∎
Proof of Lemma 4.5.
Proof of Proposition 4.6.
By Theorem 1 of Fan and Lv (2008), for each selection problem, SIS has an error probability of . Applying a union bound over 2 problems, the error probability is
∎
Proof of Proposition 4.7.
To complete this proof, we first show that under the Gaussian construction, all design matrices satisfy the Restricted Eigenvalue (RE) condition. We then show that this condition implies bounded error of the lasso estimate. Lastly, we show that under Assumption 4.3 the minimal non-zero coefficients are large enough to guarantee successful screening.
The RE condition is a technical condition for lasso to have bounded error. In particular, a design matrix satisfies RE over a set with parameter if for all such that . Let be the minimum eigenvalue function and the maximum diagonal entry. It is shown in Theorem 7.16 of Wainwright (2019) that for any random design matrix , there exists universal positive constants such that
for all , with probability at least . This inequality implies RE condition with uniformly for any subset of cardinality at most . Our required rates of and satisfy this sparsity requirement. Therefore, under our assumption on minimum eigenvalues, the matrix satisfied RE with some constant with probability at least . Consequently, the design matrices used to learn and , i.e., and , all satisfy the RE condition.
Next we apply Theorem 7.13 of Wainwright (2019). In particular, for any solution of the lasso problem with regularization level lower bounded as where is the design matrix, it holds that . In our case, this guarantees all the errors of estimation is bounded, in particular
For any Gaussian designs with columns standardized to , the Gaussian tail bound guarantees .
To learn with design matrix , we set , then the errors (in and ) are in the order of with probability at least . A union bound over all regression problems bounds all errors with probability . Similarly, To learn with design matrix , our choice of guarantees union bound over all regression problems with probability .
Lastly, we show that Assumption 4.3 implies a beta-min condition for the regression problems. Note that
and the last terms are both bounded below by some constant, so that and for some from Assumption 4.2.
The beta-min condition ensures that lasso successfully selects a superset of the relevant covariate with the error bound specified above. Specifically, we have shown that with probability at least , lasso solution recovers a superset of and for all . ∎
A.4 Proof of Results in Section 5
Proof of Lemma 5.2.
The proof is identical to that of Theorem 3.5. The only additional piece needed for successful recovery of PDAG is the orientation step. The known partial ordering is a form of background knowledge of edge orientation, which, combined with Meek’s rules of orientation, returns the maximal PDAG (see Meek, 1995, Problem D). ∎
Proof of Lemma 5.3.
Proof of Theorem 5.5.
By Lemma 5.3, the result of screening step is a superset of the edges in . Under the adjacency faithfulness assumption, all conditional independencies checked by the algorithm corresponds to d-separation and therefore the correct skeleton is recovered. Finally, given correct d-separation relations, the orientation rules are complete and maximal (Perkovic et al., 2018). ∎
Appendix B Causal Additive Models
In this section, we outline how the proposed framework can be applied to the more flexible class of causal additive models (CAM) (buhlmann_cam_2014), that is, SEMs that are jointly additive:
(9) |
Like the linear SEM in the main paper, we will discuss the searching and screening steps separately.
For the searching step, we can use a general test of conditional independence. Here, we adopt the framework of Chakraborty and Shojaie (2022), based on conditional distance covariance (CdCov). Let two vectors and , their CdCov given is defined as
We define . It is easy to see that if and only if .
Let be some kernel function where is the diagonal matrix determined by bandwidth and denote . We also write and . Define and the symmetric form , We can use a plug-in estimate for :
Following the derivation of Theorem 3.3 in Chakraborty and Shojaie (2022), we can obtain the following result, which we state here without proof.
Proposition B.1 (Searching with CdCov test).
Suppose Assumption 4.1 and 4.2 holds with . Assume there exists such that for all ,
and the kernel function used to compute is non-negative and uniformly bounded over its support. Assume in addition the faithfulness condition that there exists some such that
where with . Then, conditioned on , the searching loop in Algorithm 1 using test of CdCOV returns the correct edge set with probability at least .
Next we discuss the screening step for nonlinear SEM models. We consider the family of functions , where is a matrix whose columns are basis functions used to model the additive components , and is a -vector containing the associated effects. We write as the -th coefficient in . We denote as the concatenated basis functions in . Denote .
In this model has no direct causal effect on if and only if . Under some regularity conditions, there exists a truncation parameter large enough such that for all and . In high-dimensional problems, we estimate the coefficients with norm penalization. Concretely, for a node , we can estimate and by solving the following problems
where the group lasso penalties are used to jointly penalize the smoothing bases of each variable. This estimator is similar to SPACEJAM (Voorman et al., 2014) with edges disagreeing with the layering information hard-coded as zero. The following results relies on the general error computation in Haris et al. (2022). A similar result can be found in Tan and Zhang (2019).
We next state the smoothness assumption needed on the basis approximation.
Assumption B.2 (Truncated basis).
Suppose there exists such that are sufficiently sooth in the sense that
uniformly for all for some .
Next we assume two standard conditions for generalized additive models (GAMs). The compatibility condition may be shown for random design, but for simplicity we just assume the conditions.
Assumption B.3 (Compatibility).
Suppose there exists some compatibility constant such that if for all and all functions in the form of that satisfy , it holds that
for some norm .
Also assume that for all and all functions in the form of that satisfy ,
Assumption B.4 (GAM screening).
Denote and suppose . Let . Suppose
Let be an arbitrary function such that . Suppose there exists some constant satisfying such that if and only if . Further suppose that for all . The next result follows directly from the results in Haris et al. (2022).