This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Directed Acyclic Graphs From Partial Orderings

Ali Shojaie∗,†, Wenyu Chen
Department of Biostatistics, University of Washington
Abstract

Directed acyclic graphs (DAGs) are commonly used to model causal relationships among random variables. In general, learning the DAG structure is both computationally and statistically challenging. Moreover, without additional information, the direction of edges may not be estimable from observational data. In contrast, given a complete causal ordering of the variables, the problem can be solved efficiently, even in high dimensions. In this paper, we consider the intermediate problem of learning DAGs when a partial causal ordering of variables is available. We propose a general estimation framework for leveraging the partial ordering and present efficient estimation algorithms for low- and high-dimensional problems. The advantages of the proposed framework are illustrated via numerical studies.

$\ast$$\ast$footnotetext: The authors contributed equally to this work.$\dagger$$\dagger$footnotetext: Corresponding eamil: [email protected]

1 Introduction

Directed acyclic graphs (DAGs) are widely used to capture causal relationships among components of complex systems (Spirtes et al., 2001; Pearl, 2009; Maathuis et al., 2018). They also form a foundation for causal discovery and inference (Pearl, 2009). Probabilistic graphical models defined on DAGs, known as Bayesian networks (Pearl, 2009), have thus found broad applications in various scientific disciplines, from biology (Markowetz and Spang, 2007; Zhang et al., 2013) and social sciences (Gupta and Kim, 2008), to knowledge representation and machine learning (Heckerman, 1997). However, learning the structure of DAGs from observational data is very challenging due to at least two major factors: First, it may not be possible to infer the direction of edges from observational data alone. In fact, unless the model is identifiable (see, e.g., Peters et al., 2014a), observational data only reveal the structure of the Markov equivalent class of DAGs (Maathuis et al., 2018), captured by a complete partially directed acyclic graph (CPDAG) (Andersson et al., 1997). The second reason is computational—learning DAGs from observational data is an NP-complete problem (Chickering, 1996). In fact, while a few polynomial time algorithms have been proposed for special cases, including sparse graphs (Kalisch and Bühlmann, 2007) or identifiable models (Chen et al., 2019; Ghoshal and Honorio, 2018; Peters et al., 2014b; Wang and Drton, 2020; Shimizu et al., 2006; Yu et al., 2023), existing general-purpose algorithms are not scalable to problems involving many variables.

11223344
Figure 1: A directed graph with four nodes.

In spite of the many challenges of learning DAGs in general settings, the problem becomes very manageable if a valid causal ordering among variables is known (Shojaie and Michailidis, 2010). In a valid causal ordering for a DAG GG with node set VV, any node jj can appear before another node kk (denoted jkj\prec k) only if there is no directed path from kk to jj. Multiple valid causal orderings may exists for a given DAG, as illustrated in the simple example of Figure 1, where both 𝒪1={1234}\mathcal{O}_{1}=\{1\prec 2\prec 3\prec 4\} and 𝒪2={1324}\mathcal{O}_{2}=\{1\prec 3\prec 2\prec 4\} are valid causal orderings.

Clearly, a known causal ordering of variables resolves any ambiguity about the directions of edges in a DAG and hence addresses the first source of difficulty in estimation of DAGs discussed above. However, this knowledge also significantly simplifies the computation: given a valid causal ordering, DAG learning reduces to a variable selection problem that can be solved efficiently even in the high-dimensional setting (Shojaie and Michailidis, 2010), when the number of variables is much larger than the sample size.

In the simplest case, the idea of Shojaie and Michailidis (2010) is to regress each variable kk on all preceding variables in the ordering, {j:jk}\{j:j\prec k\}. While simple and efficient, this idea, and its extensions (e.g., Fu and Zhou, 2013; Shojaie et al., 2014; Han et al., 2016), require a complete (or full) casual ordering of variables, i.e., a permutation of the list of variables in DAG GG. However, complete causal orderings are rarely available in practice. To relax this assumption, a few recent proposals have combined regularization strategies with algorithms that search over the space of orderings (e.g., Raskutti and Uhler, 2018). These algorithms are more efficient than those searching over the super-exponentially large space of DAGs (Friedman and Koller, 2003). Nonetheless, the computation for these algorithms remains prohibitive for moderate to large size problems (Manzour et al., 2021; Küçükyavuz et al., 2023).

In this paper, we consider the setting where a partial causal ordering of variables is known. This scenario—which is an intermediate between assuming a complete causal ordering as in Shojaie and Michailidis (2010) and no assumption on causal ordering—occurs commonly in practice. An important example is the problem of identifying direct causal effects of multiple exposures on multiple outcomes (assuming no unmeasured confounders). Formally, let 𝒳={X1,,Xp}{{\mathcal{X}}}=\{X_{1},\ldots,X_{p}\} and 𝒴={Y1,,Yq}{{\mathcal{Y}}}=\{Y_{1},\ldots,Y_{q}\} denote the set of pp exposure and qq outcome variables, respectively. Then, we have a partial ordering among 𝒳{{\mathcal{X}}} and 𝒴{{\mathcal{Y}}} variables, namely, 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}, but we do not have any knowledge of the ordering among 𝒳{{\mathcal{X}}} or 𝒴{{\mathcal{Y}}} variables themselves. Nonetheless, we are interested in identifying direct causal effects of exposures 𝒳{{\mathcal{X}}} on outcomes 𝒴{{\mathcal{Y}}}. This corresponds to learning edges from 𝒳{{\mathcal{X}}} to 𝒴{{\mathcal{Y}}}, which would form a bipartite graph.

Estimation of DAGs from partial orderings also arises naturally in the analysis of biological systems. For instance, in gene regulatory networks, ‘transcription factors’ are often known a priori and they are not expected to be affected by other ‘target genes’. Similar to the previous example, here the set of transcription factors, 𝒳{{\mathcal{X}}}, appear before the set of target genes, 𝒴{{\mathcal{Y}}}, and the goal is to infer gene regulatory interactions. Similar problems also occur in integrative genomics, including in eQTL mapping (Ha and Sun, 2020).

Despite its importance and many applications, the problem of learning DAGs from partial orderings has not been satisfactorily addressed. In particular, as we show in the next section, various regression-based strategies currently used in applications result in incorrect estimates. As an alternative to these methods, one can use general DAG learning algorithms, such as the PC algorithm (Spirtes et al., 2001; Kalisch and Bühlmann, 2007), to learn the structure of the CPDAG and then orient the edges between 𝒳{{\mathcal{X}}} and 𝒴{{\mathcal{Y}}} according to the known partial ordering. However, such an approach would not utilize the partial ordering in the estimation of edges and is thus inefficient. The recent proposal of Wang and Michailidis (2019) is also not computationally feasible as it requires searching over all possible orderings of variables.

To overcome the above limitatiosn, we present a new framework for leveraging the partial ordering information into DAG learning. To this end, after formulating the problem, in Section 2, we also investigate the limitations of existing approaches. Motivated by these findings, we present a new framework in Section 3 and establish the correctness of its population version. Then, in Section 4, we present different estimation strategies for learning low- and high-dimensional DAGs. To simplify the presentation, the main ideas are presented for the special case of two-layer networks corresponding to linear structural equation models (SEMs); the more general version of the algorithm and its extensions are discussed in Section 5 and Appendix B. The advantages of the proposed framework are illustrated through simulation studies and an application in integrative genomics in Section 6.

2 Learning Directed Graphs from Partial Orderings

2.1 Problem Formulation

Consider a DAG G=(V,E)G=(V,E) with the node set VV, and the edge set EV×VE\subset V\times V. For the general problem, we assume that VV is partitioned into LL sets, V1,,VLV_{1},\ldots,V_{L} such that for any {1,,L}\ell\in\{1,\ldots,L\}, the nodes in VV_{\ell} cannot be parents of nodes in any set V,<V_{\ell^{\prime}},\ell^{\prime}<\ell. Such a partition defines a layering of GG (Manzour et al., 2021), denoted V1V2VLV_{1}\prec V_{2}\prec\cdots\prec V_{L}. In fact, a valid layering can be found for any DAG, though some layers may contain a single node. As such, the notion of layering is general: We make no assumption on the size of each layer, the causal ordering of variables in each set VV_{\ell}, or interactions among them, except that GG is a DAG.

To simplify the presentation, we primarily focus on two-layer, or bipartite, DAGs, and defer the discussion of more general cases to Section 5. Let V1=𝒳{X1,,Xp}V_{1}={{\mathcal{X}}}\equiv\{X_{1},\ldots,X_{p}\} be the nodes in the first layer and V2=𝒴{Y1,,Yq}V_{2}={{\mathcal{Y}}}\equiv\{Y_{1},\ldots,Y_{q}\} be those in the second layer. In the case of causal inference for multiple outcomes discussed in Section 1, 𝒳{{\mathcal{X}}} represents the exposure variables, and 𝒴{{\mathcal{Y}}} represents the outcome variables.

Throughout the paper, we denote individual variables by regular upper case letters, e.g., XkX_{k} and YjY_{j}, and sets of variables as calligraphic upper case letters, e.g., 𝒳k{{\mathcal{X}}}_{-k} and 𝒳S{{\mathcal{X}}}_{S}, where SS is a set containing more than one index. We also refer to nodes of the network by using both their indices and the corresponding random variables; for instance, k𝒳k\in{{\mathcal{X}}} and XkX_{k}. We denote the parents of a node jj in GG by paj\textnormal{pa}_{j}, its ancestors and children by anj\mathrm{an}_{j} and chj\textnormal{ch}_{j}, respectively, and its adjacent nodes by adjj\textnormal{adj}_{j}, where adjj=pajchj\textnormal{adj}_{j}=\textnormal{pa}_{j}\cup\textnormal{ch}_{j}. We may also represent the edges of GG using its adjacency matrix Θ{\Theta}, which satisfies Θjj=0{\Theta}_{jj}=0 and Θjj0{\Theta}_{jj^{\prime}}\neq 0 if and only if jpajj^{\prime}\in\textnormal{pa}_{j}. For any bipartite DAG with layers 𝒳{{\mathcal{X}}} and 𝒴{{\mathcal{Y}}} (𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}), we can partition Θ{\Theta} into the following block matrix

Θ=[A0BC],{\Theta}=\left[\begin{array}[]{cc}A&0\\ B&C\end{array}\right], (1)

where the zero constraint on the upper right block of Θ{\Theta} follows from the partial ordering. Here AA and CC contain the information on the edges amongst XkX_{k} (k=1,,p)(k=1,\ldots,p) and YjY_{j} (j=1,,q)(j=1,\ldots,q), respectively. Both of these matrices can be written as lower-triangular matrices (Shojaie and Michailidis, 2010). However, there are generally no constraints on BB. We denote by HH the subgraph of GG containing edges from 𝒳{{\mathcal{X}}} to 𝒴{{\mathcal{Y}}} only, i.e., those corresponding to entries in BB. Our goal is to estimate HH using the fact that 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}.

X1X_{1}X2X_{2}Y1Y_{1}Y2Y_{2}
(a) full DAG
X1X_{1}X2X_{2}Y1Y_{1}Y2Y_{2}
(b) H(0)H^{(0)}
X1X_{1}X2X_{2}Y1Y_{1}Y2Y_{2}
(c) H(j)H^{(-j)}
X1X_{1}X2X_{2}Y1Y_{1}Y2Y_{2}
(d) HH
Figure 2: Toy example illustrating estimation of DAGs from partial orderings in a two-layer network with 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}, 𝒳={X1,X2}{{\mathcal{X}}}=\{X_{1},X_{2}\} and 𝒴={Y1,Y2}{{\mathcal{Y}}}=\{Y_{1},Y_{2}\}. a) The full DAG GG with the edges between layers drawn in blue and edges within each layer shown in gray. Here, the true causal relations are linear and the goal is to estimate the bipartite graph HH defined by edges X1Y1X_{1}\rightarrow Y_{1} and X2Y2X_{2}\rightarrow Y_{2}; b) Estimate of HH using H(0)H^{(0)} in (2), obtained by regressing each YjY_{j}, j=1,2j=1,2 on {X1,X2}\{X_{1},X_{2}\} using a linear model with n=1,000n=1,000 observations, and drawing an edge if the corresponding coefficient is significant at α=0.05\alpha=0.05; the graph contains a false positive edge shown by an orange dashed arrow; c) estimate of HH using H(j)H^{(-j)} in (3) obtained by regressing each YjY_{j}, j=1,2j=1,2 on {X1,X2,Yj}\{X_{1},X_{2},Y_{-j}\} using a linear regression similar to (b); d) estimate of HH obtained using the proposed approach.

Figure 2a shows a simple example of a two-layer DAG GG with layers consisting of p=q=2p=q=2 nodes. This example illustrates the difference between full causal orderings of variables in a DAG and partial causal orderings: In this case, the graph admits two full causal orderings, namely 𝒪1={X1,X2,Y1,Y2}\mathcal{O}_{1}=\{X_{1},X_{2},Y_{1},Y_{2}\} and 𝒪2={X1,Y1,X2,Y2}\mathcal{O}_{2}=\{X_{1},Y_{1},X_{2},Y_{2}\}; either of these orderings can be used to correctly discover the structure of the graph. Here, the partial ordering of the variables defined by the layering of the graph to sets 𝒳{{\mathcal{X}}} and 𝒴{{\mathcal{Y}}}, i.e., 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}, can be written as 𝒪={{X1,X2},{Y1,Y2}}\mathcal{O}^{\prime}=\{\{X_{1},X_{2}\},\{Y_{1},Y_{2}\}\}. This ordering determines that XkX_{k}s should appear before YjY_{j}s in the causal ordering but does not restrict the ordering of {X1,X2}\{X_{1},X_{2}\} and {Y1,Y2}\{Y_{1},Y_{2}\}. In this example, only 𝒪1\mathcal{O}_{1} is consistent with the partial ordering 𝒪\mathcal{O}^{\prime}. Moreover, 𝒪\mathcal{O}^{\prime} is also consistent with {X2,X1,Y2,Y1}\{X_{2},X_{1},Y_{2},Y_{1}\}, which is not a valid causal ordering of GG, indicating that a partial causal order is not sufficient for learning the full DAG.

2.2 Failure of Simple Algorithms

In this section, we discuss the challenges of estimating the graph HH and why this problem cannot be solved using simple approaches. Since the partial ordering of nodes, V1V2V_{1}\prec V_{2}, provides information about direction of causality between the two layers of the network, we focus on simple constraint-based methods (Spirtes et al., 2001), which learn the network edges based on conditional independence relationships among nodes. This requires conditional independence relations among variables to be compatible with the edges in GG; formally, the joint probability distribution 𝒫\mathcal{P} needs to be faithful to GG (Spirtes et al., 2001). (As discussed in Section 4, our algorithm requires a weaker notion of faithfulness; however, for simplicity, we consider the classical notion of faithfulness in this section.)

Given the partial ordering of variables—which means that YjY_{j}s cannot be parents of XkX_{k}s—one approach for estimating the bipartite graph HH is to draw an edge from XkX_{k} to YjY_{j} whenever YjY_{j} is dependent on XkX_{k} given all other nodes in the first layer, 𝒳k{Xk,kk}{{\mathcal{X}}}_{-k}\equiv\{X_{k^{\prime}},k^{\prime}\neq k\}. Formally, denoting by YjXkY_{j}{\bot\negthickspace\negthickspace\bot}X_{k} the conditional independence of two variables YjY_{j} and XkX_{k}, we define

H(0){(kj):Yj/Xk𝒳k},H^{(0)}\equiv\left\{(k\rightarrow j):Y_{j}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}X_{k}\mid{{\mathcal{X}}}_{-k}\right\}, (2)

to emphasize that the estimate is obtained without conditioning on any YjYjY_{j^{\prime}}\neq Y_{j}.

Figure 2b shows the estimated H(0)H^{(0)} in the setting where true causal relationships are linear. In this setting, edges in H(0)H^{(0)} represent nonzero coefficients in linear regressions of each YjY_{j} on its parents in GG, paj\textnormal{pa}_{j}. It can be seen that, in this example, the true causal effects from XkX_{k}s to YjY_{j}s are in fact captured in H(0)H^{(0)}; these are shown in solid blue lines in Figure 2b. However, this simple example suggests that H(0)H^{(0)} may include spurious edges, shown by orange dashed edges in the figure. The next lemma formalizes and generalizes this finding. The proof of this and other results in the paper are gathered in the Appendix A.

Lemma 2.1.

Assume that 𝒫\mathcal{P} is faithful with respect to GG. Let H(0)H^{(0)} be the directed bipartite graph defined in (2). Then, if {X1,Xp}{Y1,Yq}\{X_{1},\ldots X_{p}\}\prec\{Y_{1},\ldots Y_{q}\},

  1. i)

    XkYjH(0)X_{k}\rightarrow Y_{j}\in H^{(0)} whenever XkYjGX_{k}\rightarrow Y_{j}\in G;

  2. ii)

    for any path of the form Xk0Yj1Yj0X_{k_{0}}\rightarrow Y_{j_{1}}\rightarrow\cdots\rightarrow Y_{j_{0}} such that Xk0Yj0GX_{k_{0}}\rightarrow Y_{j_{0}}\notin G, H(0)H^{(0)} will include a false positive edge Xk0Yj0X_{k_{0}}\rightarrow Y_{j_{0}}.

Lemma 2.1 suggests that if GG includes edges among YjY_{j}s, failing to condition on YjY_{j}s can result in falsely detected causal effects from XkX_{k}s to YjY_{j}s. Thus, without knowledge of interactions among YjY_{j}s, one may consider an estimator that corrects for both XkX_{-k} and YjY_{-j} when trying to detect the casual effects of XkX_{k} on YjY_{j}; in other words, we may declare an edge from XkX_{k} to YjY_{j} whenever Yj/Xk{𝒳k,𝒴j}Y_{j}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}X_{k}\mid\{{{\mathcal{X}}}_{-k},{{\mathcal{Y}}}_{-j}\}. We denote the resulting estimate of HH as H(j)H^{(-j)}:

H(j)={(kj):Yj/Xk𝒳k𝒴j}.H^{(-j)}=\{(k\to j):Y_{j}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}X_{k}\mid{{\mathcal{X}}}_{-k}\cup{{\mathcal{Y}}}_{-j}\}. (3)

Unfortunately, as Figure 2c shows, estimation based on this model may also include false positive edges. As the next lemma clarifies, the false positive edges in this case are due to the conditioning on common descendants of a pair of XkX_{k} and YjY_{j} that are not connected in GG, which is sometimes referred to as Berkson’s Paradox (Pearl, 2009).

Lemma 2.2.

Assume that 𝒫\mathcal{P} is faithful with respect to GG, and let H(j)H^{(-j)} be the directed bipartite graph H(j)H^{(-j)} defined in (3). Then, if {X1,,Xq}{Y1,Yp}\{X_{1},\ldots,X_{q}\}\prec\{Y_{1},\ldots Y_{p}\}, for any XkYjGX_{k}\rightarrow Y_{j}\in G, XkYjH(j)X_{k}\rightarrow Y_{j}\in H^{(-j)}. Moreover, for any triplets of nodes Xk0X_{k_{0}}, Yj0Y_{j_{0}} and Yj1Y_{j_{1}} that form an open collider in GG (Pearl, 2009), i.e.,

  • -

    Xk0Yj1Yj0X_{k_{0}}\rightarrow Y_{j_{1}}\leftarrow Y_{j_{0}}

  • -

    Xk0Yj0X_{k_{0}}\nrightarrow Y_{j_{0}}

H(j)H^{(-j)} will include a false positive edge from Xk0X_{k_{0}} to Yj0Y_{j_{0}}.

Remark 2.3.

Examining the proofs of Lemmas 2.1 and 2.2, if C=0C=0 in (1), then H(0)=H(j)=HH^{(0)}=H^{(-j)}=H. In other words, if GG does not include any edges among YkY_{k}s, then both H(0)H^{(0)} and H(j)H^{(-j)} provide valid estimates of HH.

While false positive edges in H(0)H^{(0)} are caused by failing to condition on necessary variables, the false positive edges in H(j)H^{(-j)} are caused by conditioning on extra variables that are not part of the correct causal order of variables. More generally, the partial ordering information does not lead to a simple estimator that correctly identifies direct causal effects of exposures, Xk(k=1,,p)X_{k}\,(k=1,\ldots,p), on outcomes, Yj(j=1,,q)Y_{j}\,(j=1,\ldots,q).

Building on the above findings, in the next section we first present a modification of the PC algorithm that leverages the partial ordering information. We then present a general framework for leveraging the partial ordering of variables more effectively, especially when the goal is to learn the graph HH.

3 Incorporating Partial Orderings into DAG Learning

3.1 A modified PC Algorithm

Given an ideal test of conditional independence, we can estimate HH by first learning the skeleton of GG, using the PC Algorithm (Kalisch and Bühlmann, 2007) and then orienting the edges in HH according to the partial ordering of nodes. However, such an algorithm uses the ordering information in a post hoc way—the information is not utilized to learn the edges in HH.

Alternatively, we can incorporate the partial ordering directly into the PC algorithm. Recall that PC starts from a fully-connected graph and iteratively removes edges by searching for conditional independence relations that corresponds to graphical d-separations, to this end, it uses separating sets of increasing sizes, starting with sets of size 0 (i.e., no conditioning) and incrementally increasing the size as the algorithm progresses. To modify the PC algorithm, we use the fact that if two nodes jj and kk are non-adjacent in a DAG, then for any of their d-separators, say SS, the set San({j,k})S\cap\mathrm{an}(\{j,k\}) is also a d-separator; see Chapter 6 of Spirtes et al. (2001). Given a partial ordering of variables, this rule allows excluding from the PC search step any nodes that are in lower layers than nodes jj and kk under consideration. Then, under the assumptions of the PC algorithm, the above modification, which we refer to as PC+, enjoys the same population and sample guarantees.

Given its construction, the partial ordering becomes more informative in PC+ as the number of layers increases: in two-layer settings, for learning the edges in HH, PC+ is exactly the same as PC, and cannot utilize the partial ordering. However, as the results in Section 6.2 show, PC+ remains ineffective in graphs with larger number of layers, motivating the development of the framework proposed in the next section.

3.2 A new framework

In this section, we propose a new framework for learning DAGs from partial orderings. The proposed approach is motivated by two key observations in Lemmas 2.1 and 2.2: First, the graphs H(0)H^{(0)} and H(j)H^{(-j)} include all true causal relationships from nodes XkX_{k} (k=1,,p)(k=1,\ldots,p) in the first layer to nodes YjY_{j} (j=1,,q)(j=1,\ldots,q) in the second layer. Second, both graphs may also include additional edges; H0H^{0} due to not conditioning on parents of YjY_{j} in 𝒴{{\mathcal{Y}}}, and HjH^{-j} due to conditioning on common children of XkX_{k} and YjY_{j}.

Let Sj(0):={k:(kj)H(0)}S^{(0)}_{j}:=\{k:(k\to j)\in H^{(0)}\} and Sj(j):={k:(kj)H(j)}S^{(-j)}_{j}:=\{k:(k\to j)\in H^{(-j)}\}. The next lemma, which is a direct consequence of Lemmas 2.1 and 2.2, characterizes the intersection of these two sets.

Lemma 3.1.

Let GG be a graph admitting the partial ordering 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}. Then for any j𝒴j\in{{\mathcal{Y}}}, and kSj(0)Sj(j)k\in S^{(0)}_{j}\cap S^{(-j)}_{j}, either kpajk\in\textnormal{pa}_{j}, or kk satisfies the followings:

  • there exists a path kjjk\to j^{\prime}\to\cdots\to j with j𝒴j^{\prime}\in{{\mathcal{Y}}};

  • chjchk\mathrm{ch}_{j}\cap\mathrm{ch}_{k}\neq\varnothing.

Lemma 3.1 implies that even though Hj(0)Hj(j)H^{(0)}_{j}\cap H^{(-j)}_{j} may contain more edges than the true edges in HH, these additional edges must be in some special configuration. Based on this insight, our approach examines the edges in Sj(0)Sj(j)S^{(0)}_{j}\cap S^{(-j)}_{j}, for each j𝒴j\in{{\mathcal{Y}}}, to remove the spurious edges efficiently. Let kSj(0)Sj(j)k\in S^{(0)}_{j}\cap S^{(-j)}_{j} and (kj)H(k\to j)\notin H. Suppose, for simplicity, that conditional independencies are faithful to the graph GG. Then, there must be some set of variables 𝒵𝒳𝒴\mathcal{Z}\subset{{\mathcal{X}}}\cup{{\mathcal{Y}}} such that XlYj|𝒵X_{l}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|\mathcal{Z}. In general, to find such a set of variables, we will need to search among subsets of 𝒳𝒴{{\mathcal{X}}}\cup{{\mathcal{Y}}}. However, utilizing the partial ordering, we can significantly reduce the complexity of this search. Specifically, we can restrict the search to the conditional Markov blanket, introduced next for general DAGs.

Definition 3.2 (Conditional Markov Blanket).

Let vVv\in V and UVvU\subset V\setminus v be arbitrary nodes and subsets in a DAG GG. The conditional Markov blanket of vv given UU, denoted cmbU(v){\mathrm{cmb}}_{U}(v), is the smallest set of nodes such that for any other set of nodes WVW\subseteq V,

{vcmbU(v),U,W}={vcmbU(v),U}.\mathbb{P}\left\{{v\mid{\mathrm{cmb}}_{U}(v),U,W}\right\}=\mathbb{P}\left\{{v\mid{\mathrm{cmb}}_{U}(v),U}\right\}. (4)

Our new framework builds on the idea that by limiting the search to the conditional Markov blankets, given the nodes in the previous layer, we can significantly reduce the search space and the size of conditioning sets. Moreover, the conditional Markov blanket can be easily inferred together with H(0)H^{(0)} and H(j)H^{(-j)}. Coupled with the observations in Lemmas 2.1 and 2.2, especially the fact that conditioning on the nodes in the previous layer does not remove true causal edges, these reductions lead to improvements in both computational and statistical efficiency.

The new framework is summarized in Algorithm 1. The algorithm has two main steps: a screening loop, where supersets of relevant edges are identified by leveraging the partial ordering information, and a searching loop, similar to the one in the PC algorithm, but tailored to searching over the conditional Markov blankets. To this end, we will next show that in order to learn the edges between any node in 𝒳{{\mathcal{X}}} and a node in 𝒴{{\mathcal{Y}}}, it suffices to search over subsets of the conditional Markov blanket of nodes in 𝒴{{\mathcal{Y}}}.

Input : Observations from variables 𝒳={X1,,Xp}{{\mathcal{X}}}=\{X_{1},\ldots,X_{p}\} and 𝒴={Y1,,Yq}{{\mathcal{Y}}}=\{Y_{1},\ldots,Y_{q}\}
Output : A set of edges E^𝒳𝒴\widehat{E}_{{{\mathcal{X}}}\to{{\mathcal{Y}}}}
/* Screening loop */
1 for j𝒴j\in{{\mathcal{Y}}} do
2       Infer Sj(0){S}^{(0)}_{j}, Sj(j){S}^{(-j)}_{j}, and cmb𝒳(j){\mathrm{cmb}}_{{{\mathcal{X}}}}(j);
3      
4E^{(k,j):j𝒴,kSj(0)Sj(j)}\widehat{E}\leftarrow\left\{(k,j):j\in{{\mathcal{Y}}},k\in S^{(0)}_{j}\cap S^{(-j)}_{j}\right\};
/* Searching loop */
5 for =0,1,\ell=0,1,\ldots do
6       for (j,k)E^(j,k)\in\widehat{E} do
7             for Tcmb𝒳(j)T\subseteq{\mathrm{cmb}}_{{{\mathcal{X}}}}(j), |T|=|T|=\ell do
8                   if XkYj|𝒳Sj(0)Sj(j){k}𝒴TX_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{{S}^{(0)}_{j}\cap{S}^{(-j)}_{j}\setminus\{k\}}\cup{{\mathcal{Y}}}_{T} then remove (k,j)(k,j) and break;
9                  
10            
11      if no edge can be removed then break;
12      
return E^\widehat{E}.
Algorithm 1 Learning between-layer edges from partial orderings (PODAG)

3.3 Graph Identification using the Conditional Markov Blanket

The next lemma characterizes key properties of the conditional Markov blanket, which facilitate efficient learning of the graph HH. Specifically, we show that if a distribution satisfies the intersection property of conditional independence—i.e., if XYZSX\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y\mid Z\cup S and XZYSX\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Z\mid Y\cup S then XYZSX\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y\cup Z\mid S—then all Markov blankets and conditional Markov blankets are unique.

Lemma 3.3.

Let VV be a set of random variables with joint distribution 𝒫\mathcal{P}. Suppose the intersection property of conditional independence holds in 𝒫\mathcal{P}. Then, for any variable vVv\in V, there exists a unique minimal Markov blanket mb(v)\mathrm{mb}(v). Moreover, for any UVvU\subset V\setminus v, cmbU(v){\mathrm{cmb}}_{U}(v) is also uniquely defined as cmbU(v)=mb(v)U{\mathrm{cmb}}_{U}(v)=\mathrm{mb}(v)\setminus U.

We next define the faithfulness assumption needed for correct causal discovery from partial orderings. This assumption is trivially weaker than the general notion of strong faithfulness (Zhang and Spirtes, 2002).

Definition 3.4 (Layering-adjacency-faithfulness).

Let 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}} be a layering of random variables VV with joint distribution 𝒫\mathcal{P}. We say 𝒫\mathcal{P} is layering-adjacency-faithful to a DAG GG with respect to the layering 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}} if for all k𝒳k\in{{\mathcal{X}}} and j𝒴j\in{{\mathcal{Y}}}, if kjGk\to j\in G, then XkX_{k} and YjY_{j} are (i) dependent conditional on 𝒳k{{\mathcal{X}}}_{-k}; and (ii) dependent conditional on 𝒳kT{{\mathcal{X}}}_{-k}\cup T for any T𝒴jT\subseteq{{\mathcal{Y}}}_{-j}.

Our main result, given below, describes how conditional Markov blankets can be used to effectively incorporate the knowledge of partial ordering into DAG learning and reduce the computational cost of learning causal effects of 𝒳k{{\mathcal{X}}}_{k}s on 𝒴j{{\mathcal{Y}}}_{j}s.

Theorem 3.5.

For a probability distribution that is Markov and layering-adjacency-faithful with respect to GG, a pair of nodes k𝒳k\in{{\mathcal{X}}} and j𝒴j\in{{\mathcal{Y}}} are non-neighbor in GG if and only if there exist a set Tcmb𝒳(j)T\subseteq{\mathrm{cmb}}_{{{\mathcal{X}}}}(j) such that XkYj|𝒳(Sj(0)Sj(j))k𝒴TX_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{\big{(}S_{j}^{(0)}\cap S^{(-j)}_{j}\big{)}\setminus k}\cup{{\mathcal{Y}}}_{T}.

It follows directly from Theorem 3.5 that the proposed framework in Algorithm 1—i.e., taking the intersection of H(0)H^{(0)} and H(j)H^{(-j)} and then searching within conditional Markov blankets to remove additional edges—recovers the correct bipartite DAG HH.

Corollary 3.6.

Algorithm 1 correctly identifies direct causal effects of XkX_{k}s on YjY_{j}s.

Next, we show that instead of inferring H(0)H^{(0)} and H(j)H^{(-j)} separately and taking their interception, we can use H(0)H^{(0)} to infer H(0)H(j)H^{(0)}\cap H^{(-j)} directly. More specifically, noting that Algorithm 1 only relies on Sj(0)Sj(j)S^{(0)}_{j}\cap S^{(-j)}_{j} and cmb𝒳(j){\mathrm{cmb}}_{{{\mathcal{X}}}}(j) for each j𝒴j\in{{\mathcal{Y}}}, the next lemma shows that given Sj(0)S^{(0)}_{j}, we can learn Sj(0)Sj(j)S^{(0)}_{j}\cap S^{(-j)}_{j} and cmb𝒳(j){\mathrm{cmb}}_{{{\mathcal{X}}}}(j) without having to learn Sj(j)S^{(-j)}_{j} separately. This implies that Algorithm 1 need not learn the unconditional Markov blankets, but only the conditional Markov blankets.

Lemma 3.7.

The followings hold for each j𝒴j\in{{\mathcal{Y}}}:

  • Sj(0)={k𝒳:XkYj𝒳k}S^{(0)}_{j}=\Big{\{}k\in{{\mathcal{X}}}:X_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}\mid{{\mathcal{X}}}_{-k}\Big{\}},

  • Sj(0)Sj(j)={kSj(0):XkYj𝒳Sj(0)k𝒴j}S^{(0)}_{j}\cap S^{(-j)}_{j}=\left\{k\in S^{(0)}_{j}:X_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}\mid{{\mathcal{X}}}_{S^{(0)}_{j}\setminus k}\cup{{\mathcal{Y}}}_{-j}\right\},

  • cmb𝒳(j)={𝒴j:YjY|𝒳Sj(0)𝒴{j,}}.{\mathrm{cmb}}_{{{\mathcal{X}}}}(j)=\left\{\ell\in{{\mathcal{Y}}}_{-j}:Y_{j}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{\ell}|{{\mathcal{X}}}_{S^{(0)}_{j}}\cap{{\mathcal{Y}}}_{-\{j,\ell\}}\right\}.

Lemma 3.7 characterises the conditional dependence sets used in Algorithm 1. Using these characterizations, we can cast the set inference problems in Algorithm 1 as variable selection problems in general regression settings. Let

Sj(1)={Z𝒳Sj(0)𝒴j:YjZ𝒳Sj(0)𝒴jZ}.S^{(1)}_{j}=\left\{Z\in{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}:Y_{j}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Z\mid{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}\setminus Z\right\}. (5)

Then, Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} can be obtained by selecting the relevant variables when regressing YjY_{j} onto 𝒳{{\mathcal{X}}} and 𝒳Sj(0)𝒴j{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}, respectively. We can then obtain the two sets used in Algorithm 1 from Sj(1)S^{(1)}_{j}:

Sj(0)Sj(j)=Sj(1)𝒳andcmb𝒳(j)=Sj(1)𝒴.S^{(0)}_{j}\cap S^{(-j)}_{j}=S^{(1)}_{j}\cap{{\mathcal{X}}}\quad\text{and}\quad{\mathrm{cmb}}_{{{\mathcal{X}}}}(j)=S^{(1)}_{j}\cap{{\mathcal{Y}}}.

There are multiple benefits to directly inferring H(0)H(j)H^{(0)}\cap H^{(-j)}—i.e., by estimating Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j}—instead of separately inferring H(0)H^{(0)} and H(j)H^{(-j)}. First, the graph H(j)H^{(-j)} could be hard to estimate in high-dimensional settings, as learning H(j)H^{(-j)} requires performing tests conditioned on p+q2p+q-2 variables. In contrast, Sj(1)S_{j}^{(1)} in (5) only requires performing tests conditioning on at most max{p1,maxj|Sj(0)|+q2}\max\left\{p-1,\max_{j}|S^{(0)}_{j}|+q-2\right\} variables. Moreover, the target H(0)H(j)H^{(0)}\cap H^{(-j)} is often sparse (see Lemma 3.1) and can thus be efficiently learned in high-dimensional settings. In an extreme example, suppose each node in 𝒳{{\mathcal{X}}} has exactly one outgoing edge into a node in 𝒴{{\mathcal{Y}}}, and the first q1q-1 nodes in 𝒴{{\mathcal{Y}}} have as their common child YqY_{q}. In this case, H(j)H^{(-j)} is fully connected and hard to learn, whereas H(0)H(j)H^{(0)}\cap H^{(-j)} only has pp edges.

The second advantage of directly inferring H(0)H(j)H^{(0)}\cap H^{(-j)} is that, by using the conditional dependence formulation of sets as in Lemma 3.7, we can show that even if the sets S(0)S^{(0)} and Sj(1)S^{(1)}_{j}—consequently, Sj(0)Sj(j)S^{(0)}_{j}\cap S^{(-j)}_{j} and cmb𝒳(j){\mathrm{cmb}}_{{{\mathcal{X}}}}(j)—are not inferred exactly, the algorithm is still correct as long as no false negative errors are made.

Lemma 3.8.

Algorithm 1 recovers exactly the true DAG HH if Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} are replaced with their arbitrary supersets.

We refer to the framework proposed in Algorithm 1 for learning DAGs from partial ordering the PODAG framework. In the next section, we discuss specific algorithms for learning direct casual effects of XkX_{k}s on YjY_{j}s. These algorithms utilize the fact that given the partial ordering of nodes in GG, the conditional Markov blanket of each node j𝒴j\in{{\mathcal{Y}}} can be found efficiently by testing for conditional dependence of YjY_{j} and Yj,jjY_{j^{\prime}},j^{\prime}\neq j after adjusting for the effect of the nodes in the first layer.

4 Learning High-Dimensional DAGs from Partial Orderings

Coupled with a consistent test of conditional independence, the general framework of Section 3 can be used to learn DAGs from any layering-adjacent-faithful probability distribution (Definition 3.4). This involves two main tasks for each for j𝒴j\in{{\mathcal{Y}}}: (a) obtaining a consistent estimate of the set of relevant variables, Sj(0)S^{(0)}_{j}, and (b) obtaining a consistent estimate of the conditional Markov blanket, cmbX(j)\mathrm{cmb}_{X}(j), and the conditioning set in top layer, Sj(0)Sj(j)S^{(0)}_{j}\cap S^{(-j)}_{j}. By Lemma 3.7, this task is equivalent to learning Sj(1)S^{(1)}_{j} in (5). In this section, we illustrate these steps by presenting efficient algorithms that leverage the partial orderings information to learn bipartite DAGs corresponding to linear structural equation models (SEMs). In this case, existing results on variable selection consistency of network estimation methods can be coupled with a proof similar to that of the PC Algorithm (Kalisch and Bühlmann, 2007) to establish consistency of the sample version of the proposed algorithm for learning high-dimensional DAGs from partial orderings.

Suppose, without loss of generality, that the observed random vector 𝒲=𝒳𝒴=(W1,,Wp+q){{\mathcal{W}}}={{\mathcal{X}}}\cup{{\mathcal{Y}}}=(W_{1},\dots,W_{p+q}) is centered. In a structural equation model, 𝒲{{\mathcal{W}}} then solves an equation system Wj=fj(𝒲paj,εj)W_{j}=f_{j}({{\mathcal{W}}}_{\textnormal{pa}_{j}},\varepsilon_{j}) for j=1,,p+qj=1,\ldots,p+q, where εj\varepsilon_{j} are independent random variables with mean zero and fjf_{j} are unknown functions. In linear SEMs, each fjkf_{jk} is linear:

Wj=kpajβjkWk+εj,j=1,,p+q.W_{j}=\sum_{k\in\textnormal{pa}_{j}}\beta_{jk}W_{k}+\varepsilon_{j},\qquad j=1,\ldots,p+q. (6)

Specialized to bipartite graphs, Equation 6 can be written compactly as

(𝒳𝒴)=(A0BC)(𝒳𝒴)+ε,\begin{pmatrix}{{\mathcal{X}}}\\ {{\mathcal{Y}}}\end{pmatrix}=\begin{pmatrix}A&0\\ B&C\end{pmatrix}\begin{pmatrix}{{\mathcal{X}}}\\ {{\mathcal{Y}}}\end{pmatrix}+\varepsilon, (7)

where, as in (1), AA, BB and CC are p×pp\times p, q×pq\times p and q×qq\times q coefficients matrices, respectively. Our main objective is to estimate the matrix BB. To this end, we propose statistically and computationally efficient procedures for estimating the sets Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} for all j𝒴j\in{{\mathcal{Y}}}, which, as discussed before, are the main ingredients needed in Algorithm 1. We will also discuss how these procedures can be applied in high-dimensional settings, when p,qnp,q\gg n.

We next show that whenever the screening loop of Algorithm 1 is successful—that is, when it returns a supergraph of HH—the searching loop can consistently recover the true HH. We start by stating our assumptions, which are the same as those used to establish the consistency of the PC algorithm (Kalisch and Bühlmann, 2007). The only difference is that our ‘faithfulness’ assumption—Assumption 4.3—is weaker than the corresponding assumption for the PC algorithm. The consequences of this relaxation are examined in Section 6.2.

Assumption 4.1 (Maximum reach level).

Suppose there exists some 0<b10<b\leq 1 such that hn:=maxj𝒴|adjj𝒴|=O(n1b)h_{n}:=\max_{j\in{{\mathcal{Y}}}}|\textnormal{adj}_{j}\cap{{\mathcal{Y}}}|=O(n^{1-b}) and mn=maxj𝒴|Sj0Sj(j)|=O(n1b)m_{n}=\max_{j\in{{\mathcal{Y}}}}|S^{0}_{j}\cap S^{(-j)}_{j}|=O(n^{1-b}).

Assumption 4.2 (Dimensions).

The dimensions pp, qq satisfies pqmn+1=O(exp(c0nκ))pq^{m_{n}+1}=O(\exp(c_{0}n^{\kappa})) for some 0<c0<0<c_{0}<\infty and 0κ<10\leq\kappa<1.

Assumption 4.3 (λ\lambda-strong layering-adjacency-faithfulness).

The distribution of (𝒳,𝒴)({{\mathcal{X}}},{{\mathcal{Y}}}) is multivariate Gaussian and the partial correlations satisfy

infj𝒴,k𝒳,T𝒴{j}{|ρ(Yj,Xk𝒴T𝒳k)|:ρ(Yj,Xk𝒴T𝒳k)0}cn,\inf_{j\in{{\mathcal{Y}}},k\in{{\mathcal{X}}},T\subseteq{{\mathcal{Y}}}\setminus\{j\}}\Big{\{}\left|\rho\left(Y_{j},X_{k}\mid{{\mathcal{Y}}}_{T}\cup{{\mathcal{X}}}_{-k}\right)\right|:\rho\left(Y_{j},X_{k}\mid{{\mathcal{Y}}}_{T}\cup{{\mathcal{X}}}_{-k}\right)\neq 0\Big{\}}\geq c_{n},
supj𝒴,k𝒳,T𝒴{j}{|ρ(Yj,Xk𝒴T𝒳k)|}M<1 for some M,\sup_{j\in{{\mathcal{Y}}},k\in{{\mathcal{X}}},T\subseteq{{\mathcal{Y}}}\setminus\{j\}}\Big{\{}\left|\rho\left(Y_{j},X_{k}\mid{{\mathcal{Y}}}_{T}\cup{{\mathcal{X}}}_{-k}\right)\right|\Big{\}}\leq M<1\text{ for some }M,

where cn1=O(nd)c_{n}^{-1}=O(n^{d}) for some 0<d<12min(b,1κ)0<d<\frac{1}{2}\min(b,1-\kappa) with κ\kappa defined in Assumption 4.2 and bb as in Assumption 4.1.

The next result establishes the consistency of the searching step.

Theorem 4.4 (Searching step using partial correlation).

Suppose Assumptions 4.14.3 hold. Let the event

𝒜(S^(0),S^(1))={j𝒴:S^j(0)Sj(0),S^j(1)Sj(1),|S^j(0)S^j(1)|n}\mathcal{A}\left(\widehat{S}^{(0)},\widehat{S}^{(1)}\right)=\left\{\forall j\in{{\mathcal{Y}}}:\widehat{S}^{(0)}_{j}\supseteq S^{(0)}_{j},\widehat{S}^{(1)}_{j}\supseteq S^{(1)}_{j},|\widehat{S}^{(0)}_{j}\cap\widehat{S}^{(1)}_{j}|\leq n\right\}

denote the success of the screening step. Then, there exists some sequence of thresholds αn0\alpha_{n}\to 0 as nn\to\infty such that the output H^\widehat{H} of Algorithm 1 with test of partial correlation in searching loop satisfies

{H^=H|𝒜(S^(0),S^(1))}=1O(exp(Cn12d)).\mathbb{P}\left\{{\widehat{H}=H|\mathcal{A}\left(\widehat{S}^{(0)},\widehat{S}^{(1)}\right)}\right\}=1-O\left(\exp\left(-Cn^{1-2d}\right)\right).

Theorem 4.4 shows that, with high probability, the searching loop returns the correct graph as long as the screening loop makes no false negative errors. In the following sections, we discuss various options for the screening step, namely, the estimation of S(0)S^{(0)} and S(1)S^{(1)} defined in Lemma 3.7:

Sj(0)\displaystyle S^{(0)}_{j} ={k𝒳:XkYj𝒳k},\displaystyle=\{k\in{{\mathcal{X}}}:X_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}\mid{{\mathcal{X}}}_{-k}\},
Sj(1)\displaystyle S^{(1)}_{j} ={Z𝒳Sj(0)𝒴j:YjZ𝒳Sj(0)𝒴jZ}.\displaystyle=\{Z\in{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}:Y_{j}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Z\mid{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}\setminus Z\}.

4.1 Screening in low dimensions

Given multivariate Gaussian observations, the most direct way to identify the sets satisfying the conditional independence relations in Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} is to use the sample partial correlation ρ^\hat{\rho} and reject the hypothesis of conditional independence when |ρ^|>ξ|\hat{\rho}|>\xi for some suitable threshold ξ\xi. The next results establishes the screening guarantee for such an approach in low-dimensional settings.

Proposition 4.5 (Screening with partial correlations).

Suppose Assumptions 4.14.3 hold and np+qn\gg p+q. Let

S^j(0)\displaystyle\widehat{S}^{(0)}_{j} ={k𝒳:|ρ^(Yj,Xk𝒳k)|>cn/2},\displaystyle=\left\{k\in{{\mathcal{X}}}:|\hat{\rho}(Y_{j},X_{k}\mid{{\mathcal{X}}}_{-k})|>c_{n}/2\right\},
S^j(1)\displaystyle\widehat{S}^{(1)}_{j} ={Z𝒳Sj(0)𝒴j:|ρ^(Yj,Z𝒳Sj(0)𝒴jZ)|>cn/2}.\displaystyle=\left\{Z\in{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}:|\hat{\rho}(Y_{j},Z\mid{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}\setminus Z)|>c_{n}/2\right\}.

Then, {𝒜(S^(0),S^(1))}=1O(exp(Cn12d)).\mathbb{P}\left\{{\mathcal{A}\left(\widehat{S}^{(0)},\widehat{S}^{(1)}\right)}\right\}=1-O\left(\exp\left(-Cn^{1-2d}\right)\right).

4.2 Screening in high dimensions

The conditioning sets in Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j}—namely, 𝒳k{{\mathcal{X}}}_{-k} and 𝒳Sj(0)𝒴jZ{{\mathcal{X}}}_{S^{(0)}_{j}}\cup{{\mathcal{Y}}}_{-j}\setminus Z—involve large number of variables. Therefore, in high dimensions, when p,qnp,q\gg n, it is not feasible to learn Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} using partial correlations. To overcome this challenge, and using the insight from Lemma 3.7, we treat the screening problem as selection of non-zero coefficients in linear regressions. This allows us to use tools from high-dimensional estimation, including various regularization approaches, for the screening step.

The success of the screening step requires the event 𝒜(S^j(0),S^j(1))\mathcal{A}\left(\widehat{S}^{(0)}_{j},\widehat{S}^{(1)}_{j}\right)—defined in Theorem 4.4—to hold with high probability. That is, we need to identify supersets of Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} such that |S^j(0)S^j(1)|n\left|\widehat{S}^{(0)}_{j}\cap\widehat{S}^{(1)}_{j}\right|\leq n. A simple approach to screen relevant variables is sure independence screening (SIS, Fan and Lv, 2008), which selects the variables with largest marginal association at a given threshold t(0,1)t\in(0,1). Defining the SIS estimators

S^j,SIS(0)(t)\displaystyle\widehat{S}^{(0)}_{j,\mathrm{SIS}}(t) ={1kp:|XkYj| is among the first tn largest of all 𝒳Yj}\displaystyle=\left\{1\leq k\leq p:|X_{k}^{\top}Y_{j}|\textnormal{ is among the first $\lceil tn\rceil$ largest of all }{{\mathcal{X}}}^{\top}Y_{j}\right\}
S^j,SIS(1)(t)\displaystyle\widehat{S}^{(1)}_{j,\mathrm{SIS}}(t) ={Z𝒳S^(0)𝒴j:|ZYj| is among the first tn largest of all (𝒳S^(0)𝒴j)Yj},\displaystyle=\left\{Z\in{{\mathcal{X}}}_{\widehat{S}^{(0)}}\cup{{\mathcal{Y}}}_{-j}:|Z^{\top}Y_{j}|\textnormal{ is among the first $\lceil tn\rceil$ largest of all }({{\mathcal{X}}}_{\widehat{S}^{(0)}}\cup{{\mathcal{Y}}}_{-j})^{\top}Y_{j}\right\},

we can obtain the following result following Fan and Lv (2008).

Proposition 4.6.

Suppose Assumptions 4.14.3 hold with κ<12d\kappa<1-2d, and the maximum eigenvalue of (𝒳,𝒴)({{\mathcal{X}}},{{\mathcal{Y}}}) is lower bounded by cnξcn^{\xi}. If 2d+ξ<12d+\xi<1, then there exists some δ<12dξ\delta<1-2d-\xi such that when tcnδt\sim cn^{-\delta} with c>0c>0, we have for some C>0C>0,

P[𝒜(S^SIS(0)(t),S^SIS(1)(t))]=1O(exp(Cn12d/logn)).P\left[\mathcal{A}\left(\widehat{S}^{(0)}_{\mathrm{SIS}}(t),\widehat{S}^{(1)}_{\mathrm{SIS}}(t)\right)\right]=1-O\left(\exp\left(-Cn^{1-2d}/\log n\right)\right).

By using simple marginal associations, SIS offers a simple and efficient strategy for screening; however, it may result in large sets for the searching step. An alternative to SIS screening is to select the sets S^j(0)\widehat{S}_{j}^{(0)} and S^j(1)\widehat{S}_{j}^{(1)} using two penalized regressions. For instance, using the lasso penalty, S^j(0)\widehat{S}_{j}^{(0)} for j𝒴j\in{{\mathcal{Y}}} can be found as

γ^j\displaystyle\widehat{\gamma}_{j} :=argminγp112nYjγ𝒳22+λn(0)γj1,\displaystyle:=\operatorname*{arg\,min}_{\gamma\in{\mathbb{R}}^{p-1}}\frac{1}{2n}\left\|Y_{j}-\gamma^{\top}{{\mathcal{X}}}\right\|_{2}^{2}+\lambda^{(0)}_{n}\left\|\gamma_{j}\right\|_{1}, S^j(0)={k:γ^jk0}.\displaystyle\widehat{S}^{(0)}_{j}=\left\{k:\widehat{\gamma}_{jk}\neq 0\right\}.

Similarly, Sj(1)S_{j}^{(1)} can be learned using a lasso regression on a different set of variables; specifically, given any set S𝒳S\subseteq{{\mathcal{X}}}, we define

θ^j(S)\displaystyle\widehat{\theta}_{j}(S) :=argminθ|S|+q112nYjθ[𝒳S,𝒴j]22+λn(1)θj1,\displaystyle:=\operatorname*{arg\,min}_{\theta\in{\mathbb{R}}^{|S|+q-1}}\frac{1}{2n}\left\|Y_{j}-\theta^{\top}\left[{{\mathcal{X}}}_{S},{{\mathcal{Y}}}_{-j}\right]\right\|_{2}^{2}+\lambda^{(1)}_{n}\left\|\theta_{j}\right\|_{1}, S^j(1)={k:θ^jk(S^j(0))0}.\displaystyle\widehat{S}^{(1)}_{j}=\left\{k:\widehat{\theta}_{jk}\left(\widehat{S}^{(0)}_{j}\right)\neq 0\right\}.

The next result establishes the success of the screening step when using lasso estimators in high dimensions.

Proposition 4.7 (Screening with lasso).

Suppose Assumptions 4.14.3 hold. Also assume that the minimal eigenvalue of Cov(𝒳𝒴)\mathrm{Cov}({{\mathcal{X}}}\cup{{\mathcal{Y}}}) is larger than some constant Γmin\Gamma_{\min}, and there exists some ξ>0\xi>0 such that Var(Z|𝒳𝒴Z)>ξ\mathrm{Var}(Z|{{\mathcal{X}}}\cup{{\mathcal{Y}}}\setminus Z)>\xi for all Z𝒳𝒴Z\in{{\mathcal{X}}}\cup{{\mathcal{Y}}}. Assume sn=maxj|mb(j)|=O(n1a)s_{n}=\max_{j}|\mathrm{mb}(j)|=O\left(n^{1-a}\right) and the rate parameters in Assumptions 4.1 and 4.2 satisfy κmin(a,b)\kappa\leq\min(a,b). Then, lasso estimators with penalization level lower bounded with λn(0)2logp/n\lambda_{n}^{(0)}\asymp\sqrt{2\log p/n} and λn(1)2log(p+q)/n\lambda_{n}^{(1)}\asymp\sqrt{2\log(p+q)/n} satisfy {𝒜(S^(0),S^(1))}1\mathbb{P}\left\{{\mathcal{A}\left(\widehat{S}^{(0)},\widehat{S}^{(1)}\right)}\right\}\to 1.

The above two high-dimensional screening approaches—SIS and lasso—are two extremes: SIS obtains the estimates of Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} without adjusting for any other variables, whereas lasso adjusts for all other variables (either from the first layer or the two layers combined). These two extremes represent examples of broader classes of methods based on marginal or joint associations. They differ with respect to computational and sample complexities of screening and searching steps: Compared with SIS, screening with lasso may result in smaller sets for the searching step but may be less efficient for screening. Nevertheless, both approaches are valid choices for screening under milder conditions than those needed for consistent variable selection. This is because the additional assumption required for the screening step is implied by the layering-adjacency-faithfulness assumption, which, as discussed before, is weaker than the strong faithfulness assumption needed for the PC algorithm (see also Section 6.2). While the above two choices seem natural, other intermediates, including conditioning on small sets (similar to Sondhi and Shojaie, 2019) can also be used for screening.

4.3 Other Approaches

Together with Proposition 4.5, 4.6, or 4.7, Theorem 4.4 establishes the consistency of Algorithm 1 for linear SEMs, in low- and high-dimensional settings. While we only considered the simple case of Gaussian noise, Algorithm 1 and the results in this section can be extended to linear SEMs with sub-Gaussian errors (Harris and Drton, 2013). Moreover, many other regularization methods can be used for screening instead of the lasso. The only requirement is that the method is screening-consistent. For instance, inference-based procedures, such as debiased lasso (van de Geer et al., 2014; Zhang and Zhang, 2014; Javanmard and Montanari, 2014) can also fulfill the requirement of our screening step.

Similar ideas can also be used for learning DAGs from other distributions and/or under more general SEM structures. In Appendix B, we outline the generalization of the proposed algorithms to causal additive models (Bühlmann et al., 2014). The Gaussian copula transformation of Liu et al. (2009, 2012) and Xue and Zou (2012) can be similarly used to extend the proposed ideas to non-Gaussian distribution. We leave the detailed exploration of these generalizations to future research.

5 Extensions and Other Considerations

5.1 Learning Edges within Layers

In the previous sections, we focused on the problem of learning edges between two layers, 𝒳{{\mathcal{X}}} and 𝒴{{\mathcal{Y}}}. In particular, Algorithm 1 provides a framework that first finds a superset of the true edges, and then uses a searching loop to remove the spurious edges. The same idea can be applied to learning edges within the second layer, 𝒴{{\mathcal{Y}}}: As suggested by Lemma 3.3, for each j𝒴j\in{{\mathcal{Y}}}, all of its adjacent nodes are contained in Sj(1)S^{(1)}_{j}. This suggests that we can learn edges in 𝒴{{\mathcal{Y}}} by simply modifying Algorithm 1 to run the searching loop on {(k,j):kSj(1),j𝒴}\left\{(k,j):k\in S^{(1)}_{j},j\in{{\mathcal{Y}}}\right\} instead of only on those between 𝒳{{\mathcal{X}}} and 𝒴{{\mathcal{Y}}}. After recovering the skeleton, edges among 𝒴{{\mathcal{Y}}} can be oriented using Meek’s orientation rules (Meek, 1995). Given correct d-separators and skeleton, the rules can orient as many edges as possible without making mistakes.

In order to successfully recover within layer edges from observational data, we need to assume faithfulness among these layers. Following Ramsey et al. (2006), we characterize the faithfulness condition required to learn the DAG via constraint-based algorithms as two parts: (a) adjacency faithfulness, which means that neighboring nodes are not associated with conditional independence; and (b) orientation faithfulness, which means that parents in any v-structure are not conditionally independent given any set containing their common child. The orientation faithfulness can be defined in our context as follows.

Assumption 5.1 (Within-layer-faithfulness).

For all adjacent pairs (j,j)𝒴(j,j^{\prime})\in{{\mathcal{Y}}}, it holds that YjYj𝒳TY_{j^{\prime}}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}\mid{{\mathcal{X}}}\cup T for any set T𝒴{j,j}T\subseteq{{\mathcal{Y}}}\setminus\{j,j^{\prime}\}. Also, for any unshielded triple (i,j,k)(i,j,k) with j𝒴j\in{{\mathcal{Y}}}, if ijki\to j\leftarrow k, then the variables corresponding to ii and kk are dependent given any subset of 𝒳𝒴{i,k}{{\mathcal{X}}}\cup{{\mathcal{Y}}}\setminus\{i,k\} that contains jj; otherwise, the variables corresponding to ii and kk are dependent given any subset of 𝒳𝒴{i,k}{{\mathcal{X}}}\cup{{\mathcal{Y}}}\setminus\{i,k\} that does not contains jj.

Given the additional information from partial ordering, to describe the graphical object learned by the new framework, we need to introduce a new notion of equivalence. This is because the Markov equivalent class of DAGs can be reduced by using the known partial ordering. For example, if the only conditional independent relation among three variables Xi,Xj,XkX_{i},X_{j},X_{k} is XiXk|XjX_{i}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{k}|X_{j}, but we know Xi{Xj,Xk}X_{i}\prec\{X_{j},X_{k}\}, then the edge XjXkX_{j}\to X_{k} is identifiable. We define partial-ordering-Markov-equivalence simply as Markov equivalence restricted to partial ordering. This equivalence class can be represented by a maximally oriented partial DAG, or maximal PDAG (Perkovic et al., 2018). With this notion, we can describe the target of learning of within- and between-layer edges as learning the maximal PDAG of the true GG given the background information 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}.

Lemma 5.2 (within-layer edges).

Suppose the conditions for Theorem 4.4 and Assumption 5.1 hold. Then, Algorithm 1 with an additional orientation step by Meek’s rules recovers the maximal PDAG of GG given the background information 𝒳𝒴{{\mathcal{X}}}\prec{{\mathcal{Y}}}.

5.2 Directed Graphs with Multiple Layers

The theory and algorithm developed in Section 4 can be extended to scenarios with multiple layers. To facilitate this extension, we introduce a general representation of the problem. Suppose VV is a random vector following some distribution Markov to a graph G=(V,E)G=(V,E). Suppose GG admits a partial-ordering 𝒪={V1VL}\mathcal{O}=\{V_{1}\prec\cdots\prec V_{L}\} where V==1LVV=\cup_{\ell=1}^{L}V_{\ell}. Parallel to the notation in 2-layer case, we define, for each jVj\in V_{\ell}, 1L1\leq\ell\leq L,

Sj(0)\displaystyle S^{(0)}_{j} :={ if jV1{ki=11Vi:VkVj|(i=11Vi){k}} otherwise,\displaystyle:=\begin{cases}\varnothing&\text{ if }j\in V_{1}\\ \left\{k\in\cup_{i=1}^{\ell-1}V_{i}:V_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}V_{j}|\left(\cup_{i=1}^{\ell-1}V_{i}\right)\setminus\{k\}\right\}&\text{ otherwise},\end{cases}
Sj(1)\displaystyle S^{(1)}_{j} :={kV{j}:VkVj|VSj(0)V{k,j}}.\displaystyle:=\left\{k\in V_{\ell}\setminus\{j\}:V_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}V_{j}|V_{S^{(0)}_{j}}\cup V_{\ell}\setminus\{k,j\}\right\}.

The next lemma generalizes Lemma 3.1.

Lemma 5.3.

For any graph GG admitting the partial ordering 𝒪={V1VL}\mathcal{O}=\{V_{1}\prec\cdots\prec V_{L}\}, the followings hold:

  • For each jVj\in V_{\ell}, 2r2\leq\ell\leq r, a node kSj(0)Sj(1)k\in S^{(0)}_{j}\cap S^{(1)}_{j} if and only if (k,j)G(k,j)\in G or (k,j)G(k,j)\notin G but there exists a path kjjk\to j^{\prime}\to\cdots\to j with jVj^{\prime}\in V_{\ell} and chjchkV\textnormal{ch}_{j}\cap\textnormal{ch}_{k}\cap V_{\ell}\neq\varnothing.

  • For each jVj\in V_{\ell}, 1r1\leq\ell\leq r, a node kSj(1)Vk\in S^{(1)}_{j}\cap V_{\ell} if and only if kadjjk\in\textnormal{adj}_{j} or kadjjk\notin\textnormal{adj}_{j} but chjchkV\textnormal{ch}_{j}\cap\textnormal{ch}_{k}\cap V_{\ell}\neq\varnothing.

Input : Observations from random variables W𝒫GW\sim\mathcal{P}_{G},
Partial Ordering 𝒪={V1VL}\mathcal{O}=\{V_{1}\prec\cdots\prec V_{L}\} where V=i=1LVV=\cup_{i=1}^{L}V_{\ell}
Output : A estimated edge set of GG
1 for jVj\in V do infer Sj(0){S}^{(0)}_{j} and Sj(1){S}^{(1)}_{j} ;
2 E^between{(k,j):2L,jV,kSj(0)Sj(1)}\widehat{E}_{\textnormal{between}}\leftarrow\left\{(k,j):2\leq\ell\leq L,j\in V_{\ell},k\in S^{(0)}_{j}\cap S^{(1)}_{j}\right\};
3 E^within{(k,j):1L,jV,kSj(1)V}\widehat{E}_{\textnormal{within}}\leftarrow\left\{(k,j):1\leq\ell\leq L,j\in V_{\ell},k\in S^{(1)}_{j}\cap V_{\ell}\right\};
4 for d=0,1,d=0,1,\ldots do
5       for =0,1,,L\ell=0,1,\ldots,L do
6             for (k,j)E^betweenE^within(k,j)\in\widehat{E}_{\textnormal{between}}\cup\widehat{E}_{\textnormal{within}}, jVj\in V_{\ell} do
7                   for TSj(1)VT\subseteq S^{(1)}_{j}\cap V_{\ell}, |T|=d|T|=d do
8                         if VkVj|V(Sj(0)Sj(1))T{k}V_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}V_{j}|V_{\left(S^{(0)}_{j}\cap S^{(1)}_{j}\right)\cup T\setminus\{k\}} then remove (k,j)(k,j) and break;
9                        
10                  
11            
12      if no edge can be removed then break;
13      
14Orient all edges in E^between\widehat{E}_{\textnormal{between}} by 𝒪\mathcal{O}; then apply Meek’s rules to orient edges in E^within\widehat{E}_{\textnormal{within}};
return E^betweenE^within\widehat{E}_{\textnormal{between}}\cup\widehat{E}_{\textnormal{within}}.
Algorithm 2 DAG Learning from Partial Orderings (Multiple Layer Setting)

Lemma 5.3 suggests a general framework for utilizing any layering information to facilitate DAG learning. The multi-layer version of the algorithm is presented in Algorithm 2. The faithfulness condition required for the success of this framework is given below for the set of variables 𝒲{{\mathcal{W}}}.

Assumption 5.4 (Layering-faithfulness).

For a graph G=(V,E)G=(V,E) admitting a partial-ordering 𝒪={V1VL}\mathcal{O}=\{V_{1}\prec\cdots\prec V_{L}\}, we say a distribution 𝒫\mathcal{P} is layering faithful to GG with respect to 𝒪\mathcal{O} if the followings hold:

  • Adjacency faithfulness: For all non-adjacent pairs j,kj,k with jVj\in V_{\ell}, kVk\in V_{\ell^{\prime}}, \ell\geq\ell^{\prime}, it holds that WjWk|𝒲i=11Vi{k}W_{j}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{k}|{{\mathcal{W}}}_{\cup_{i=1}^{\ell-1}V_{i}\setminus\{k\}} and WjWk|𝒲(i=11Vi)T{k}W_{j}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{k}|{{\mathcal{W}}}_{(\cup_{i=1}^{\ell-1}V_{i})\cup T\setminus\{k\}} for all TV{j,k}T\subseteq V_{\ell}\setminus\{j,k\}.

  • Orientation faithfulness: For all unshielded triples (i,j,k)(i,j,k) such that j,kj,k are in the same layer VsV_{s} and ii is in some previous layer, if the configuration of the path (i,j,k)(i,j,k) is ijki\to j\leftarrow k then WiWk𝒲T{j}W_{i}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{k}\mid{{\mathcal{W}}}_{T}\cup\{j\} for all Tt=1sVt{i,k}T\subseteq\cup_{t=1}^{s}V_{t}\setminus\{i,k\}; otherwise WiWk𝒲TW_{i}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{k}\mid{{\mathcal{W}}}_{T} for all Tt=1sVt{i,j,k}T\subseteq\cup_{t=1}^{s}V_{t}\setminus\{i,j,k\}.

We note that orientation faithfulness is only needed for triplets when at least two of the three nodes are in the same layer. This is because otherwise the orientation is already implied by partial ordering.

Theorem 5.5.

Under Assumption 5.4, the population version of Algorithm 2 recovers the maximal PDAG of GG.

5.3 Weaker Notions of Partial Ordering

Algorithm 2 can be successful even if ordering information is not available for some variables. This is because the idea of using a screening loop and a searching loop to reduce computational burden can be more generally applied. Suppose that for every variable jVj\in V there exist two sets TjjT^{\prec}_{j}\prec j and TjjT^{\succ}_{j}\succ j. Then, we slightly modify the construction of Sj(0)S_{j}^{(0)} and Sj(1)S_{j}^{(1)}: for each jVj\in V_{\ell},

Sj(0)\displaystyle S^{(0)}_{j} :={kTj:WkWj|Tj{k}},\displaystyle:=\left\{k\in T^{\prec}_{j}:W_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{j}|T^{\prec}_{j}\setminus\{k\}\right\},
Sj(1)\displaystyle S^{(1)}_{j} :={kV(TjTj{j}):WkWj|𝒲Sj(0)(𝒲(TjTj{j,k}))}.\displaystyle:=\left\{k\in V\setminus(T^{\prec}_{j}\cup T^{\succ}_{j}\cup\{j\}):W_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{j}|{{\mathcal{W}}}_{S^{(0)}_{j}}\cup\left({{\mathcal{W}}}\setminus(T^{\prec}_{j}\cup T^{\succ}_{j}\cup\{j,k\})\right)\right\}.

It is easy to see that Lemma 5.3 hold with this generalized notion. Specifically, this means that Algorithm 2 can also handle the setting in which VV can be partitioned into disjoint sets V1,,VL,VV_{1},\ldots,V_{L},V^{\prime}, such that the layering information among V1,,VLV_{1},\ldots,V_{L} is known (V1V2VLV_{1}\prec V_{2}\prec\cdots\prec V_{L}) and VV^{\prime} contains nodes with no partial ordering information. This generalization allows the algorithm to be applied in settings, where e.g., V1,,VL1V_{1},\ldots,V_{L-1} represent confounders, exposures and mediators, VLV_{L} represents the outcomes and VV^{\prime} may correspond to variables for which no partial ordering information is available.

6 Numerical Studies

In this section, we compare our proposed framework in Algorithm 1, PODAG, with the PC algorithm and its ordering-aware variant introduced in Section 3, PC+. Our numerical analyses comprise three aspects: a comparison of faithfulness conditions; simulation studies comparing DAG estimation accuracy of various methods; and an analysis of quantitative trait loci (eQTL).

6.1 Comparison of Faithfulness Conditions

To assess the faithfulness conditions of different algorithms, we examine the minimal absolute value of detectable partial correlations. Specifically, for a constraint-based structure learning algorithm that tests conditional independencies in the collection of tuples \mathcal{L}, the quantity

ρmin()=min{j,k,S}{|ρ(j,kS)|:ρ(j,kS)0}\rho^{*}_{\min}(\mathcal{L})=\min_{\{j,k,S\}\in\mathcal{L}}\left\{|\rho(j,k\mid S)|:\rho(j,k\mid S)\neq 0\right\}

can be viewed as the strength of faithfulness condition required to learn the DAG. A small value of ρmin()\rho_{\min}^{*}(\mathcal{L}) indicates a weaker separation between signal and noise for the method \mathcal{L}, making it harder to recover the correct DAG using sample partial correlations. On the other hand, a larger value of ρmin()\rho_{\min}^{*}(\mathcal{L}) indicates superior statistical efficiency.

We randomly generated R=100R=100 DAGs with 2020 nodes and two expected edges per node. For each DAG, we constructed a linear Gaussian SEM with parameters drawn uniformly from (1,0.1)(0.1,1)(-1,-0.1)\cup(0.1,1). We inspected three algorithms, PC, PC+, and PODAG, and computed their corresponding ρmin(PC),ρmin(PC+),ρmin(PODAG)\rho_{\min}^{*}(\mathcal{L}_{\text{PC}}),\rho_{\min}^{*}(\mathcal{L}_{\text{PC+}}),\rho_{\min}^{*}(\mathcal{L}_{\text{PODAG}}). We also counted the number of conditional independence tests performed by each method as a measure of computational efficiency. The results are shown in Figure 3. It can be seen that the faithfulness requirement for PC is stronger than PC+, which is in turn stronger than PODAG. The results also show that PC and PC+ require more tests of conditional independence, further reducing their statistical efficiency. Noticeably, though PC+ utilizes the partial ordering information, its computational and statistical efficiency are subpar compared with PODAG.

Refer to caption
Figure 3: Comparison of ρmin(PC),ρmin(PC+)\rho_{\min}^{*}(\mathcal{L}_{\text{PC}}),\rho_{\min}^{*}(\mathcal{L}_{\text{PC+}}) and ρmin(PODAG)\rho_{\min}^{*}(\mathcal{L}_{\text{PODAG}}) for recovering the skeleton (left) and the entire DAG (middle). The number of conditional independence tests is shown on the right panel.

6.2 DAG Estimation Performance

We compare the performance of the proposed PODAG algorithm in learning DAGs from partial ordering with those of the PC and PC+ algorithms. To this end, we consider graphs with 50,100,15050,100,150 vertices and 33 expected edges per node. Edges are randomly generated within and between layers, with higher probability of connections between layers. A weight matrix for linear SEM is generated with respect to the graph, with non-zero parameters drawn uniformly from (1,0.1)(0.1,1)(-1,-0.1)\cup(0.1,1). We generate n=500n=500 samples using SEMs with Gaussian errors. The vertex set is split into either L=2L=2 or L=5L=5 sets, corresponding to equal size layers.

Refer to caption
Refer to caption
Figure 4: Average false positive and true positive rates (FPR, TPR) for learning DAGs from partial orderings. The edges are generated from random ER graphs with 5050 (left), 100100 (middle), and 150150 (right) nodes and 3 expected edges per node. Samples are drawn from Gaussian SEM with sample size n=500n=500. Partial ordering information supplied to the algorithms in the form of two layers (first row) or five layers (second row).

Figure 4 shows the average true positive and false positive rates (TPR, FPR) of the proposed PODAG algorithm compared with PC and PC++. Three variants of PODAG are presented in the figure, based on different screening methods in the first step: using sure independence screening (SIS), lasso and partial correlation. Regardless of the choice of screening, the results show that by utilizing the partial ordering, PODAG offers significant improvement over both the standard PC and PC+. In these simulations, partial correlation screening seems to offer the best performance. Moreover, the gap between high-dimensional screening methods (SIS and lasso) with partial correlation tightens as the dimension grows, especially for two-layer networks.

The plots in Figure 4 correspond to different number of variables—50, 100, 150—equally divided into either L=2L=2 or L=5L=5 layers. While the performances are qualitatively similar across different number of variables, the plots show significant improvements in the performance of PODAG as the number of layers increases. This is expected, as larger number of layers (L=5L=5) corresponds to more external information, which is advantageous for PODAG. The results also suggest that SIS screening may perform better than lasso screening in the five-layer network. However, this is likely due to the choice of tuning parameter for these methods. In the simulations reported here, for SIS screening, marginal associations were screened by calculating p-values using Fisher’s transformation of correlations; an edge was then included if the corresponding p-value was less than 0.5 (to avoid false negatives). On the other hand, the tuning parameter for lasso screening was chosen by minimizing the Akaike information criterion (AIC), which balances prediction and complexity. Other approaches to selection of tuning parameters may change the size of the sets obtained from screening, and hence the relative performances of the two methods.

6.3 Quantitative Trait Loci Mapping

We illustrate the applications of PODAG by applying it to learn the bipartite network of interactions among DNA loci and gene expression levels. This problem is at the core of expression Quantitative trait loci (eQTL) mapping, which is a powerful approach for identifying sequence variants that alter gene functions (Nica and Dermitzakis, 2013). While many tailored algorithms have been developed to identify DNA loci that affect gene expression levels, the primary approach in biological application is to focus on cis-eQTLs (i.e., local eQTLs), meaning those DNA loci around the gene of interest (e.g., within 500kb of the gene body). This focus is primarily driven by the limited power in detecting trans-eQTLs (i.e., distant eQTLs) by a genome-wide search. State-of-the-art algorithms also incorporate knowledge of DNA binding cites and/or emerging omics data (Kumasaka et al., 2016; Keele et al., 2020) to obtain more accurate eQTL maps (Sun, 2012).

Regardless of the focus (i.e., cis versus trans regulatory effects), the computational approach commonly used to infer interactions between DNA loci and gene expression levels is based on the marginal association between expression level of each gene and the variability of each DNA locus in the sample. This approach provides an estimate of the H0H^{0} graph defined in (2). Since the activities of other DNA loci are not taken into account when inferring the effect of each DNA locus on each gene expression level, the resulting estimate is a superset of H0H^{0}, unless an uncorrelated subset of DNA loci, for instance after LD-pruning (Fagny et al., 2017), is considered.

Refer to caption
Figure 5: Estimated quantitative trait mappings for yeast for p=50p=50 randomly selected DNA loci and q=50q=50 randomly selected gene expression levels. The plots show the estimated bipartite graph using the proposed PODAG algorithm, H^\widehat{H}; the H^0\widehat{H}^{0} estimate commonly obtained in eQTL analyses, by regressing the expression levels (nodes in the second layer) on the DNA loci; the H^(j)\widehat{H}^{(-j)} estimate defined in (3) and the intersection of edges in H^0\widehat{H}^{0} and H^(j)\widehat{H}^{(-j)}, which is the starting point of PODAG.

To illustrate the potential advantages of the proposed PODAG algorithm in this setting, we use a yeast expression data set (Brem and Kruglyak, 2005), containing eQTL data for 112 segments, each with 585 shared markers and 5428 target genes. As an illustrative example, and to facilitate visualization, we randomly select p=50p=50 DNA loci and q=50q=50 gene expression levels. Our analysis focuses on the direct effect of DNA markers on gene expression levels, corresponding to the bipartite network HH, which can be estimated using PODAG by leveraging the partial ordering of the nodes in the two layers.

The results are presented in Figure 5. As expected, the commonly used estimate of H0H^{0}, and the estimate of H(j)H^{(-j)} in (3)—obtained by also regressing on other gene expression levels—both contain additional edges compared with the PODAG estimate, H^\widehat{H}. The intersection of the first two networks, H^0H^(j)\widehat{H}^{0}\cap\widehat{H}^{(-j)} (discussed in Section 2.2), reduces these false positives. It also provides an effective starting point for PODAG, demonstrating the efficiency of the proposed algorithm.

7 Discussion

We presented a new framework for leveraging partial ordering of variables in learning the structure of directed acyclic graphs (DAGs). The new framework reduces the number of conditional independence tests compared with the methods that do not leverage shis information. It also requires a weaker version of the faithfulness assumption for recovering the underlying DAG.

The proposed framework consists of two main step: a screening step, where a superset of interactions is identified by leveraging the partial ordering; and a searching step, similar to that used in the PC algorithm (Kalisch and Bühlmann, 2007), that also leverages the partial ordering. The framework is general and can accommodate various estimation procedures in each of these steps. In low-dimensional linear structural equation models, partial correlation is the natural choice. Non-parametric approaches for testing for conditional independence (Azadkia and Chatterjee, 2021; Huang et al., 2022; Shi et al., 2021; Shah and Peters, 2020) can generalize this strategy to infer DAGs from partial orderings under minimal distributional assumptions. In this paper, we considered sure independence screening (SIS) and lasso as two options for screening in high dimensions. However, many alternative algorithms can be used for estimation in high dimensions. As an illustration, the algorithm is extended in Appendix B to causal additive models (Bühlmann et al., 2014).

An important issue for the screening step in high dimensions is the choice of tuning parameter for the estimation procedure. Commonly used approaches for selecting tuning parameters either focus on prediction or model selection. Neither is ideal for our screening, where the goal is to obtain a small superset of the relevant variables, which, theoretically, requires no false negatives. To achieve a balance, we used the Akaike information criterion (AIC) to tune the lasso parameter in our numerical studies. Recent advances in high-dimensional inference (van de Geer et al., 2014; Zhang and Zhang, 2014; Javanmard and Montanari, 2014) could offer a viable alternative to choosing the tuning parameter and can also lead to more reliable identification of network edges.

Acknowledgements

This work was partially supported by grants from the National Science Foundation and the National Institutes of Health. We thank Dr. Wei Sun for helpful comments on an earlier draft of the manuscript.

References

  • Andersson et al. (1997) S. A. Andersson, D. Madigan, and M. D. Perlman. A Characterization of Markov Equivalence Classes for Acyclic Digraphs. The Annals of Statistics, 25(2):505–541, 1997.
  • Azadkia and Chatterjee (2021) M. Azadkia and S. Chatterjee. A simple measure of conditional dependence. The Annals of Statistics, 49(6):3070–3102, 2021.
  • Brem and Kruglyak (2005) R. B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences, 102(5):1572–1577, 2005.
  • Bühlmann et al. (2014) P. Bühlmann, J. Peters, and J. Ernest. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
  • Chakraborty and Shojaie (2022) S. Chakraborty and A. Shojaie. Nonparametric Causal Structure Learning in High Dimensions. Entropy, 24(3):351, Mar. 2022.
  • Chen et al. (2019) W. Chen, M. Drton, and Y. S. Wang. On causal discovery with an equal-variance assumption. Biometrika, 106(4):973–980, Dec. 2019.
  • Chickering (1996) D. M. Chickering. Learning Bayesian Networks is NP-Complete. In D. Fisher and H.-J. Lenz, editors, Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics, pages 121–130. Springer, New York, NY, 1996. doi: 10.1007/978-1-4612-2404-4˙12.
  • Fagny et al. (2017) M. Fagny, J. N. Paulson, M. L. Kuijjer, A. R. Sonawane, C.-Y. Chen, C. M. Lopes-Ramos, K. Glass, J. Quackenbush, and J. Platig. Exploring regulation in tissues with eqtl networks. Proceedings of the National Academy of Sciences, 114(37):E7841–E7850, 2017.
  • Fan and Lv (2008) J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, Nov. 2008.
  • Friedman and Koller (2003) N. Friedman and D. Koller. Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine learning, 50(1):95–125, 2003.
  • Fu and Zhou (2013) F. Fu and Q. Zhou. Learning sparse causal gaussian networks with experimental intervention: regularization and coordinate descent. Journal of the American Statistical Association, 108(501):288–300, 2013.
  • Ghoshal and Honorio (2018) A. Ghoshal and J. Honorio. Learning linear structural equation models in polynomial time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 1466–1475. PMLR, Mar. 2018.
  • Gupta and Kim (2008) S. Gupta and H. W. Kim. Linking structural equation modeling to bayesian networks: Decision support for customer retention in virtual communities. European Journal of Operational Research, 190(3):818–833, 2008.
  • Ha and Sun (2020) M. J. Ha and W. Sun. Estimation of high-dimensional directed acyclic graphs with surrogate intervention. Biostatistics, 21(4):659–675, Oct. 2020.
  • Han et al. (2016) S. W. Han, G. Chen, M.-S. Cheon, and H. Zhong. Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. Journal of the American Statistical Association, 111(515):1004–1019, 2016.
  • Haris et al. (2022) A. Haris, N. Simon, and A. Shojaie. Generalized sparse additive models. Journal of Machine Learning Research, 23(70):1–56, 2022.
  • Harris and Drton (2013) N. Harris and M. Drton. PC Algorithm for Nonparanormal Graphical Models. Journal of Machine Learning Research, 14(69):3365–3383, 2013.
  • Heckerman (1997) D. Heckerman. Bayesian networks for data mining. Data mining and knowledge discovery, 1(1):79–119, 1997.
  • Huang et al. (2022) Z. Huang, N. Deb, and B. Sen. Kernel partial correlation coefficient—a measure of conditional dependence. The Journal of Machine Learning Research, 23(1):9699–9756, 2022.
  • Javanmard and Montanari (2014) A. Javanmard and A. Montanari. Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory. IEEE Transactions on Information Theory, 60(10):6522–6554, Oct. 2014. ISSN 1557-9654. doi: 10.1109/TIT.2014.2343629.
  • Kalisch and Bühlmann (2007) M. Kalisch and P. Bühlmann. Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm. Journal of Machine Learning Research, 8(22):613–636, 2007.
  • Keele et al. (2020) G. R. Keele, B. C. Quach, J. W. Israel, G. A. Chappell, L. Lewis, A. Safi, J. M. Simon, P. Cotney, G. E. Crawford, W. Valdar, et al. Integrative qtl analysis of gene expression and chromatin accessibility identifies multi-tissue patterns of genetic regulation. PLoS genetics, 16(1):e1008537, 2020.
  • Küçükyavuz et al. (2023) S. Küçükyavuz, A. Shojaie, H. Manzour, L. Wei, and H.-H. Wu. Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks. Journal of Machine Learning Research (JMLR), 24:1–38, 2023.
  • Kumasaka et al. (2016) N. Kumasaka, A. J. Knights, and D. J. Gaffney. Fine-mapping cellular qtls with rasqual and atac-seq. Nature genetics, 48(2):206–213, 2016.
  • Liu et al. (2009) H. Liu, J. Lafferty, and L. Wasserman. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. Journal of Machine Learning Research, 10(80):2295–2328, 2009.
  • Liu et al. (2012) H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, Aug. 2012.
  • Maathuis et al. (2018) M. Maathuis, M. Drton, S. Lauritzen, and M. Wainwright, editors. Handbook of Graphical Models. CRC Press, Boca Raton, Nov. 2018. ISBN 978-0-429-46397-6. doi: 10.1201/9780429463976.
  • Manzour et al. (2021) H. Manzour, S. Küçükyavuz, H.-H. Wu, and A. Shojaie. Integer Programming for Learning Directed Acyclic Graphs from Continuous Data. INFORMS Journal on Optimization, 3(1):46–73, Jan. 2021.
  • Markowetz and Spang (2007) F. Markowetz and R. Spang. Inferring cellular networks–a review. BMC bioinformatics, 8(6):1–17, 2007.
  • Meek (1995) C. Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, UAI’95, pages 403–410, San Francisco, CA, USA, Aug. 1995. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-385-1.
  • Nica and Dermitzakis (2013) A. C. Nica and E. T. Dermitzakis. Expression quantitative trait loci: present and future. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1620):20120362, 2013.
  • Pearl (2009) J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 978-0-521-89560-6.
  • Perkovic et al. (2018) E. Perkovic, J. Textor, M. Kalisch, and M. H. Maathuis. Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs. Journal of Machine Learning Research, 18(220):1–62, 2018.
  • Peters et al. (2014a) J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009–2053, 2014a.
  • Peters et al. (2014b) J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal Discovery with Continuous Additive Noise Models. Journal of Machine Learning Research, 15(58):2009–2053, 2014b.
  • Ramsey et al. (2006) J. Ramsey, P. Spirtes, and J. Zhang. Adjacency-faithfulness and conservative causal inference. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, pages 401–408, Arlington, Virginia, USA, July 2006. AUAI Press.
  • Raskutti and Uhler (2018) G. Raskutti and C. Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018.
  • Shah and Peters (2020) R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514–1538, 2020.
  • Shi et al. (2021) H. Shi, M. Drton, and F. Han. On azadkia-chatterjee’s conditional dependence coefficient. arXiv preprint arXiv:2108.06827, 2021.
  • Shimizu et al. (2006) S. Shimizu, P. O. Hoyer, A. Hyv&#228, rinen, and A. Kerminen. A Linear Non-Gaussian Acyclic Model for Causal Discovery. Journal of Machine Learning Research, 7(72):2003–2030, 2006.
  • Shojaie and Michailidis (2010) A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519–538, Sept. 2010.
  • Shojaie et al. (2014) A. Shojaie, A. Jauhiainen, M. Kallitsis, and G. Michailidis. Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PloS One, 9(2):e82393, 2014. ISSN 1932-6203. doi: 10.1371/journal.pone.0082393.
  • Sondhi and Shojaie (2019) A. Sondhi and A. Shojaie. The Reduced PC-Algorithm: Improved Causal Structure Learning in Large Random Networks. Journal of Machine Learning Research, 20(164):1–31, 2019. ISSN 1533-7928.
  • Spirtes et al. (2001) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search, 2nd Edition, volume 1. The MIT Press, 1 edition, 2001.
  • Sun (2012) W. Sun. A statistical framework for eqtl mapping using rna-seq data. Biometrics, 68(1):1–11, 2012.
  • Tan and Zhang (2019) Z. Tan and C.-H. Zhang. Doubly penalized estimation in additive regression with high-dimensional data. The Annals of Statistics, 47(5):2567–2600, Oct. 2019.
  • van de Geer et al. (2014) S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, June 2014.
  • Voorman et al. (2014) A. Voorman, A. Shojaie, and D. Witten. Graph estimation with joint additive models. Biometrika, 101(1):85–101, Mar. 2014.
  • Wainwright (2019) M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2019. ISBN 978-1-108-49802-9. doi: 10.1017/9781108627771.
  • Wang and Michailidis (2019) P.-L. Wang and G. Michailidis. Directed acyclic graph reconstruction leveraging prior partial ordering information. In International Conference on Machine Learning, Optimization, and Data Science, pages 458–471. Springer, 2019.
  • Wang and Drton (2020) Y. S. Wang and M. Drton. High-dimensional causal discovery under non-Gaussianity. Biometrika, 107(1):41–59, Mar. 2020. ISSN 0006-3444.
  • Xue and Zou (2012) L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571, Oct. 2012.
  • Yu et al. (2023) S. Yu, M. Drton, and A. Shojaie. Directed graphical models and causal discovery for zero-inflated data. In Conference on Causal Learning and Reasoning, pages 27–67. PMLR, 2023.
  • Zhang et al. (2013) B. Zhang, C. Gaiteri, L.-G. Bodea, Z. Wang, J. McElwee, A. A. Podtelezhnikov, C. Zhang, T. Xie, L. Tran, R. Dobrin, et al. Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer’s disease. Cell, 153(3):707–720, 2013.
  • Zhang and Zhang (2014) C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):217–242, 2014.
  • Zhang and Spirtes (2002) J. Zhang and P. Spirtes. Strong faithfulness and uniform consistency in causal inference. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, UAI’03, pages 632–639, San Francisco, CA, USA, Aug. 2002. Morgan Kaufmann Publishers Inc. ISBN 978-0-12-705664-7.

Appendix A Proofs

A.1 Proof of Results in Section 2

Proof of Lemma 2.1.

The claim in (i) follows directly from the definition of dd-separation. In particular, if XkYjGX_{k}\rightarrow Y_{j}\in G then the path from XkX_{k} to YjY_{j} will not be dd-separated by 𝒳k{{\mathcal{X}}}_{-k}, and hence Yj/Xk𝒳kY_{j}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}X_{k}\mid{{\mathcal{X}}}_{-k}.

To prove (ii), note that {X1,,Xq}{Y1,,Yp}\{X_{1},\ldots,X_{q}\}\prec\{Y_{1},\ldots,Y_{p}\} implies that Yj1Yj0Y_{j_{1}}\rightarrow\cdots\rightarrow Y_{j_{0}} cannot include any XkX_{k}s. Thus, without loss of generality, suppose that

Yj1Yj0Yj1Yj2YjmYj0,Y_{j_{1}}\rightarrow\cdots\rightarrow Y_{j_{0}}\equiv Y_{j_{1}}\rightarrow Y_{j_{2}}\rightarrow\cdots\rightarrow Y_{j_{m}}\rightarrow Y_{j_{0}},

where m=1m=1 is permitted. Now, considering that Xk0X_{k_{0}} is an ancestor of Yj0Y_{j_{0}}, by faithfulness of 𝒫\mathcal{P} with respect to GG, in order for any set SV\{k0,j0}S\subset V\backslash\{k_{0},j_{0}\} to dd-separate Xk0X_{k_{0}} and Yj0Y_{j_{0}}, it must include at least one of the variables Yj1,Yj2,,YjmY_{j_{1}},Y_{j_{2}},\ldots,Y_{j_{m}}. Noting that the construction used in HH only adjusts for 𝒳k0{{\mathcal{X}}}_{-k_{0}}, and does not adjust for any YjY_{j}s, the path from Xk0X_{k_{0}} to Yj0Y_{j_{0}} will not be dd-separated by 𝒳k0{{\mathcal{X}}}_{-k_{0}}. It follows from the definition of dd-separation that Xk0X_{k_{0}} and Yj0Y_{j_{0}} are conditionally dependent given 𝒳k0{{\mathcal{X}}}_{-k_{0}}, which means that Xk0Yj0H(0)X_{k_{0}}\rightarrow Y_{j_{0}}\in H^{(0)}. ∎

Proof of Lemma 2.2.

The result follows again from the definition of dd-separation for DAGs, and the faithfulness of 𝒫\mathcal{P}. In particular, if XkYjGX_{k}\rightarrow Y_{j}\in G, {Xk,Yj}\{X_{-k},Y_{-j}\} does not dd-separate XkX_{k} from YjY_{j}, and hence, Yj/Xk{Xk,Yj}Y_{j}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}X_{k}\mid\{X_{-k},Y_{-j}\}.

Next, if Xk0X_{k_{0}}, Yj0Y_{j_{0}} and Yj1Y_{j_{1}} form an open collider in GG, i.e., if Xk0Yj1Yj0X_{k_{0}}\rightarrow Y_{j_{1}}\leftarrow Y_{j_{0}} and Xk0Yj0X_{k_{0}}\nrightarrow Y_{j_{0}}, then the above argument implies that Xk0Yj1H(j)X_{k_{0}}\rightarrow Y_{j_{1}}\in H^{(-j)}. On the other hand, since Yj1Y_{j_{1}} is a common descendent of Xk0X_{k_{0}} and Yj0Y_{j_{0}}, dd-separation implies that Xk0/Yj0Yj1X_{k_{0}}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}Y_{j_{0}}\mid Y_{j_{1}}, or more generally, Xk0/Yj0{Xk0,Yj0}X_{k_{0}}{\medspace/\negmedspace\negmedspace\negthickspace\bot\negthickspace\negthickspace\bot}Y_{j_{0}}\mid\{X_{-k_{0}},Y_{-j_{0}}\}. Thus, Xk0Yj0H(j)X_{k_{0}}\rightarrow Y_{j_{0}}\in H^{(-j)}. ∎

A.2 Proof of Results in Section 3

Proof of Lemma 3.3.

Suppose mb(v)\mathrm{mb}^{\prime}(v) and mb′′(v)\mathrm{mb}^{\prime\prime}(v) are two distinct valid minimal Markov blankets (MB) of vv. Let J=mb(v)mb′′(v)J=\mathrm{mb}^{\prime}(v)\cap\mathrm{mb}^{\prime\prime}(v), K=mb(v)JK=\mathrm{mb}^{\prime}(v)\setminus J, L=mb′′(v)JL=\mathrm{mb}^{\prime\prime}(v)\setminus J and W=V(JKLv)W=V\setminus(J\cup K\cup L\cup v). By the intersection property, vK|JLv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}K|J\cup L and vL|JKv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}L|J\cup K implies vKL|Jv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}K\cup L|J. By the elementary formula of conditional independence, vKL|Jv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}K\cup L|J and vWL|JKv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W\cup L|J\cup K implies vWKL|Jv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W\cup K\cup L|J, and therefore JJ is also a valid MB. But this is only possible when C=mb(v)=mb′′(v)C=\mathrm{mb}^{\prime}(v)=\mathrm{mb}^{\prime\prime}(v). Hence the minimal MB is unique.

Next we show cmbU(v)=mb(v)U{\mathrm{cmb}}_{U}(v)=\mathrm{mb}(v)\setminus U in two steps. First, we show mb(v)(cmbU(v)U)\mathrm{mb}(v)\subseteq\left({\mathrm{cmb}}_{U}(v)\cup U\right). We write W=mb(v)(cmbU(v)U)W=\mathrm{mb}(v)\cap\left({\mathrm{cmb}}_{U}(v)\cup U\right), Z=mb(v)WZ=\mathrm{mb}(v)\setminus W, S=(cmbU(v)U)WS=({\mathrm{cmb}}_{U}(v)\cup U)\setminus W. Then, by the definitions of conditional Markov blanket and Markov blanket, vZ|W,Sv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Z|W,S and vS|W,Zv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}S|W,Z. By the intersection property, vZ,S|Wv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Z,S|W which establishes WW as a valid Markov Blanket. However, since Wmb(v)W\subseteq\mathrm{mb}(v) and MB is minimal, this is only possible when W=mb(v)W=\mathrm{mb}(v) and Z=Z=\varnothing, i.e., mb(v)cmbU(v)U\mathrm{mb}(v)\subseteq{\mathrm{cmb}}_{U}(v)\cup U. Finally we show cmbU(v)mb(v){\mathrm{cmb}}_{U}(v)\subseteq\mathrm{mb}(v). We write T=cmbU(v)mb(v)T={\mathrm{cmb}}_{U}(v)\cap\mathrm{mb}(v). Since TUmb(v)T\cup U\supseteq\mathrm{mb}(v), it must holds that vS|T,Uv\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}S|T,U, which implies that TT is also a valid conditional Markov blanket. However, since TcmbU(v)T\subseteq{\mathrm{cmb}}_{U}(v) and the conditional Markov blanket is minimal, this is only possible when T=cmbU(v)T={\mathrm{cmb}}_{U}(v), i.e., cmbU(v)mb(v){\mathrm{cmb}}_{U}(v)\subseteq\mathrm{mb}(v). This completes the proof. ∎

Proof of Theorem 3.5.

Under layering-adjacency-faithfulness, if k𝒳k\in{{\mathcal{X}}} and j𝒴j\in{{\mathcal{Y}}} are adjacent, then XkYj|𝒳kX_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{-k} and XkYj|𝒳k𝒴jX_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{-k}\cup{{\mathcal{Y}}}_{-j}, so kSj(0)Sj(j)k\in S^{(0)}_{j}\cap S^{(-j)}_{j}. Therefore E^\widehat{E} contains all true edges.

By the intersection property, it holds that XkYj|𝒳{Sj(0)Sj(j)}k𝒴TX_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{\left\{S^{(0)}_{j}\cap S^{(-j)}_{j}\right\}\setminus k}\cup{{\mathcal{Y}}}_{T}, and hence, XkYj|𝒳k𝒴TX_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{-k}\cup{{\mathcal{Y}}}_{T}. Under layering-adjacency-faithfulness, this implies no false negative errors can be made by the searching loop. Suppose k𝒳k\in{{\mathcal{X}}} and j𝒴j\in{{\mathcal{Y}}} are non-adjacent, then we must have XkYjpajX_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}\mid\textnormal{pa}_{j}, which implies XkYj𝒳{Sj(0)Sj(j)}k𝒴TX_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}\mid{{\mathcal{X}}}_{\left\{S^{(0)}_{j}\cap S^{(-j)}_{j}\right\}\setminus k}\cup{{\mathcal{Y}}}_{T} with the set T=paj𝒴T=\textnormal{pa}_{j}\cap{{\mathcal{Y}}}. This construction provides one dd-separator, and hence there could not be any false positive errors either. ∎

Proof of Lemma 3.8.

Since both sets are replaced by their supersets, E^\widehat{E} is still a superset of true edges. Under layering-adjacency-faithfulness, for any adjacent pairs k𝒳k\in{{\mathcal{X}}} and j𝒴j\in{{\mathcal{Y}}} and any set T𝒴jT\subset{{\mathcal{Y}}}_{-j}, it holds that XkYj|𝒳kTX_{k}\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{-k}\cup T, so the searching loop in Algorithm 1 will not make any false negative errors. Therefore, we only need to show that all additional edges can be removed by the searching loop. For any nonadjacent k𝒳k\in{{\mathcal{X}}} and j𝒴j\in{{\mathcal{Y}}}, let W𝒳W\subseteq{{\mathcal{X}}} be an arbitrary set satisfying W(S(0)Sj(j))W\supseteq\left(S^{(0)}\cap S^{(-j)}_{j}\right). Then, XkYj|𝒳Wk𝒴paj𝒴X_{k}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{j}|{{\mathcal{X}}}_{W\setminus k}\cup{{\mathcal{Y}}}_{\textnormal{pa}_{j}\cap{{\mathcal{Y}}}}. The existence of this separator implies that the edge between kk and jj can be removed by the searching loop. ∎

A.3 Proof of Results in Section 4

We first state an auxiliary lemma.

Lemma A.1 (Partial correlation and regression coefficient).

Let βjS\beta^{S}_{j} be the linear regression coefficients regressing a set of variables 𝒳S{{\mathcal{X}}}_{S} onto XjX_{j}, and let ρ(i,j|S)\rho(i,j|S) denote the partial correlation of XiX_{i} and XjX_{j} given 𝒳S{{\mathcal{X}}}_{S}. Then,

βjiSi=0ρ(i,j|S)=0Σi,jΣS,iΣS,S1ΣS,j=0.\beta_{ji}^{S\cup i}=0\Leftrightarrow\rho(i,j|S)=0\Leftrightarrow\Sigma_{i,j}-\Sigma_{S,i}^{\top}\Sigma_{S,S}^{-1}\Sigma_{S,j}=0.
Proof.

Denote the block covariance matrix ΣS{i,j},S{i,j}\Sigma_{S\cup\{i,j\},S\cup\{i,j\}} as

(ABCBDECEF):=(ΣS,SΣS,iΣS,jΣi,SΣi,iΣi,jΣj,SΣj,iΣj,j)\begin{pmatrix}A&B&C\\ B^{\top}&D&E\\ C^{\top}&E^{\top}&F\end{pmatrix}:=\begin{pmatrix}\Sigma_{S,S}&\Sigma_{S,i}&\Sigma_{S,j}\\ \Sigma_{i,S}&\Sigma_{i,i}&\Sigma_{i,j}\\ \Sigma_{j,S}&\Sigma_{j,i}&\Sigma_{j,j}\end{pmatrix}

For the first term, we have βjSi=(ΣSi,Si)1ΣSi,j\beta_{j}^{S\cup i}=\left(\Sigma_{S\cup i,S\cup i}\right)^{-1}\Sigma_{S\cup i,j}, and hence

βjiSi=(DBA1B)1(EBA1C).\beta_{ji}^{S\cup i}=\left(D-B^{\top}A^{-1}B\right)^{-1}\left(E-B^{\top}A^{-1}C\right).

For the second term, we have ρ(i,j|S)=ΩijΩiiΩjj\rho(i,j|S)=-\frac{\Omega_{ij}}{\sqrt{\Omega_{ii}\Omega_{jj}}}, where Ω\Omega is the precision matrix: Ω=Σ1\Omega=\Sigma^{-1}. Therefore, ρ(i,j|S)=0\rho(i,j|S)=0 if and only if [(ΣS{i,j},S{i,j})1]i,j=0[(\Sigma_{S\cup\{i,j\},S\cup\{i,j\}})^{-1}]_{i,j}=0. Concretely,

[(ΣS{i,j},S{i,j})1]i,j=βjiSi(FCA1C(CA1B+E)βjiSi)1.[(\Sigma_{S\cup\{i,j\},S\cup\{i,j\}})^{-1}]_{i,j}=-\beta_{ji}^{S\cup i}\left(F-C^{\top}A^{-1}C-(-C^{\top}A^{-1}B+E^{\top})\beta_{ji}^{S\cup i}\right)^{-1}.

Both quantities equal zero if and only if Σi,jΣS,iΣS,S1ΣS,j=0\Sigma_{i,j}-\Sigma_{S,i}^{\top}\Sigma_{S,S}^{-1}\Sigma_{S,j}=0. ∎

Proof of Theorem 4.4.

Let αn=2(1Φ(n1/2cn/2))\alpha_{n}=2(1-\Phi(n^{1/2}c_{n}/2)) and denote hn=maxj𝒴|S^j(0)S^j(1)|h_{n}=\max_{j\in{{\mathcal{Y}}}}|\widehat{S}^{(0)}_{j}\cap\widehat{S}^{(1)}_{j}|. We first suppose Algorithm 1 is continued until some level mnmnm_{n}^{\prime}\geq m_{n} such that mn+hn=O(n1b)m_{n}^{\prime}+h_{n}=O(n^{1-b}) as in Assumption 4.1. By Lemma 3 of Kalisch and Bühlmann (2007), the probability of any type I error in edge k,jk,j with conditioning set T𝒴jT\subset{{\mathcal{Y}}}\setminus{j}, denoted as Ek,j|TIE_{k,j|T}^{\mathrm{I}}, is bounded by an exponential term:

supk𝒳,j𝒴,T𝒴j,|T|mn{Ek,j|TII|𝒜}O(nmnhn)exp(C3(nmnhn)cn2),\sup_{k\in{{\mathcal{X}}},j\in{{\mathcal{Y}}},T\subset{{\mathcal{Y}}}\setminus{j},|T|\leq m_{n}^{\prime}}\mathbb{P}\left\{{E^{\mathrm{II}}_{k,j|T}|\mathcal{A}}\right\}\leq O(n-m_{n}^{\prime}-h_{n})\exp\left(-C_{3}(n-m_{n}^{\prime}-h_{n})c_{n}^{2}\right),

and the probability of false negative error is also bounded by an exponential term:

supk𝒳,j𝒴,T𝒴j,|T|mn{Ek,j|TII|𝒜}O(nmnhn)exp(C4(nmnhn)cn2),\sup_{k\in{{\mathcal{X}}},j\in{{\mathcal{Y}}},T\subset{{\mathcal{Y}}}\setminus{j},|T|\leq m_{n}^{\prime}}\mathbb{P}\left\{{E^{\mathrm{II}}_{k,j|T}|\mathcal{A}}\right\}\leq O(n-m_{n}^{\prime}-h_{n})\exp\left(-C_{4}(n-m_{n}^{\prime}-h_{n})c_{n}^{2}\right),

for finite constants C3C_{3} and C4C_{4}. In our algorithm, |{k,j,T}:|T|mn|=O(pq1+mn)=O(exp(c0nκ))|\{k,j,T\}:|T|\leq m_{n}|=O\left(pq^{1+m_{n}^{\prime}}\right)=O\left(\exp\left(c_{0}n^{\kappa}\right)\right). So the total error probability can be bounded:

{an error occurs in Algorithm 1 with p-cor testing|𝒜}\displaystyle\mathbb{P}\left\{{\text{an error occurs in Algorithm~{}\ref{alg:framework} with p-cor testing}|\mathcal{A}}\right\}
O(exp(c0nκ))O[(nmnhn)exp(C5(nmnhn)cn2)]\displaystyle\leq O\left(\exp(c_{0}n^{\kappa})\right)O\Big{[}(n-m_{n}^{\prime}-h_{n})\exp\left(-C_{5}(n-m_{n}^{\prime}-h_{n})c_{n}^{2}\right)\Big{]}
O[exp(c0nκ+log(nmnhn)C5(n12dmnn2dhnn2d))]\displaystyle\leq O\Big{[}\exp\left(c_{0}n^{\kappa}+\log(n-m_{n}^{\prime}-h_{n})-C_{5}\left(n^{1-2d}-m_{n}^{\prime}n^{-2d}-h_{n}n^{-2d}\right)\right)\Big{]}
=O(exp(Cn12d)).\displaystyle=O\left(\exp\left(-Cn^{1-2d}\right)\right). (8)

By (A.3), with high probability, the sample version of searching loop makes no mistakes up to search level mnm_{n}^{\prime}. Then by the reasoning in Lemma 5 of Kalisch and Bühlmann (2007), the searching loop also has high probability of terminating at the same level as the population version, which is either mn1m_{n}-1 or mnm_{n}. This completes the proof. ∎

Proof of Lemma 4.5.

We apply Lemma 3 of Kalisch and Bühlmann (2007) with the same notion of Ej,S^j(0)IIE^{\mathrm{II}}_{j,\widehat{S}^{(0)}_{j}} and Ej,S^j(1)IIE^{\mathrm{II}}_{j,\widehat{S}^{(1)}_{j}} as in Theorem 4.4. Then, by the same argument, it holds that

supj𝒴{Ej,S^j(0)II}O(n+1p)exp(C4(n+1p)cn2),\sup_{j\in{{\mathcal{Y}}}}\mathbb{P}\left\{{E^{\mathrm{II}}_{j,\widehat{S}^{(0)}_{j}}}\right\}\leq O(n+1-p)\exp\left(-C_{4}(n+1-p)c_{n}^{2}\right),
supj𝒴{Ej,S^j(1)II}O(n+2pq)exp(C4(n+2pq)cn2).\sup_{j\in{{\mathcal{Y}}}}\mathbb{P}\left\{{E^{\mathrm{II}}_{j,\widehat{S}^{(1)}_{j}}}\right\}\leq O(n+2-p-q)\exp\left(-C_{4}(n+2-p-q)c_{n}^{2}\right).

Since there are in total pq+(p+q1)qpq+(p+q-1)q many tests, the probability of errors can be bounded by

{¬𝒜}q(2p+q1)(n+2pq)exp(C(n+2pq)cn2)=O(exp(Cn12d)),\mathbb{P}\left\{{\neg\mathcal{A}}\right\}\leq q(2p+q-1)(n+2-p-q)\exp\left(-C^{\prime}(n+2-p-q)c_{n}^{2}\right)=O\left(\exp(-C^{\prime}n^{1-2d})\right),

using the fact that np+qn\gg p+q. ∎

Proof of Proposition 4.6.

By Theorem 1 of Fan and Lv (2008), for each selection problem, SIS has an error probability of O(exp(Cn12d/logn))O\left(\exp\left(-Cn^{1-2d}/\log n\right)\right). Applying a union bound over 2pp problems, the error probability is

O(exp(nκCn12d/logn))\displaystyle O\left(\exp\left(n^{\kappa}-Cn^{1-2d}/\log n\right)\right) =O(exp(n12d/(n12dκC/logn)))\displaystyle=O\left(\exp\left(n^{1-2d}/(n^{1-2d-\kappa}-C/\log n)\right)\right)
=O(exp(Cn12d/logn)).\displaystyle=O\left(\exp\left(-Cn^{1-2d}/\log n\right)\right).

Proof of Proposition 4.7.

To complete this proof, we first show that under the Gaussian construction, all design matrices satisfy the Restricted Eigenvalue (RE) condition. We then show that this condition implies bounded error of the lasso estimate. Lastly, we show that under Assumption 4.3 the minimal non-zero coefficients are large enough to guarantee successful screening.

The RE condition is a technical condition for lasso to have bounded error. In particular, a design matrix 𝒟n×r\mathcal{D}\in{\mathbb{R}}^{n\times r} satisfies RE over a set S[r]S\subseteq[r] with parameter η\eta if 1n𝒟Δ22ηΔ22\frac{1}{n}\lVert{\mathcal{D}\Delta}\rVert_{2}^{2}\geq\eta\lVert{\Delta}\rVert_{2}^{2} for all Δr\Delta\in{\mathbb{R}}^{r} such that Δ[r]S13ΔS1\lVert{\Delta_{[r]\setminus S}}\rVert_{1}\leq 3\lVert{\Delta_{S}}\rVert_{1}. Let Γmin()\Gamma_{\min}(\cdot) be the minimum eigenvalue function and ρ2()\rho^{2}(\cdot) the maximum diagonal entry. It is shown in Theorem 7.16 of Wainwright (2019) that for any random design matrix 𝒟N(0,ΣD)\mathcal{D}\sim N(0,\Sigma_{D}), there exists universal positive constants c1<1<c2c_{1}<1<c_{2} such that

𝒟w22nc1ΣDw22c2ρ2(ΣD)logrnw12\frac{\lVert{\mathcal{D}w}\rVert_{2}^{2}}{n}\geq c_{1}\lVert{\sqrt{\Sigma_{D}}w}\rVert_{2}^{2}-c_{2}\rho^{2}(\Sigma_{D})\frac{\log r}{n}\lVert{w}\rVert_{1}^{2}

for all wrw\in{\mathbb{R}}^{r}, with probability at least 1exp(n/32)1exp(n/32)1-\frac{\exp(-n/32)}{1-\exp(-n/32)}. This inequality implies RE condition with η=c12Γmin(ΣD)\eta=\frac{c_{1}}{2}\Gamma_{\min}(\Sigma_{D}) uniformly for any subset SS of cardinality at most |S|c132c2Γmin(ΣD)ρ2(ΣD)nlogr|S|\leq\frac{c_{1}}{32c_{2}}\frac{\Gamma_{\min}(\Sigma_{D})}{\rho^{2}(\Sigma_{D})}\frac{n}{\log r}. Our required rates of hn=O(n1b)h_{n}=O(n^{1-b}) and sn=O(n1a)s_{n}=O(n^{1-a}) satisfy this sparsity requirement. Therefore, under our assumption on minimum eigenvalues, the matrix 𝒳𝒴{{\mathcal{X}}}\cup{{\mathcal{Y}}} satisfied RE with some constant η\eta with probability at least 1exp(n/32)1exp(n/32)1-\frac{\exp(-n/32)}{1-\exp(-n/32)}. Consequently, the design matrices used to learn Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j}, i.e., 𝒳{{\mathcal{X}}} and {(𝒳𝒴j)}j𝒴\{({{\mathcal{X}}}\cup{{\mathcal{Y}}}_{-j})\}_{j\in{{\mathcal{Y}}}}, all satisfy the RE condition.

Next we apply Theorem 7.13 of Wainwright (2019). In particular, for any solution of the lasso problem with regularization level lower bounded as λn2𝒟ω/n\lambda_{n}\geq 2\lVert{\mathcal{D}^{\top}\omega/n}\rVert_{\infty} where 𝒟\mathcal{D} is the design matrix, it holds that β^β22ηsλn\lVert{\hat{\beta}-\beta^{*}}\rVert_{2}\leq\frac{2}{\eta}\sqrt{s}\lambda_{n}. In our case, this guarantees all the 2\ell_{2} errors of estimation is bounded, in particular

maxj𝒴(max{γ^jγj2,θ^jθj2})2ηsnλn.\max_{j\in{{\mathcal{Y}}}}\left(\max\{\lVert{\hat{\gamma}_{j}-\gamma_{j}^{*}}\rVert_{2},\lVert{\hat{\theta}_{j}-\theta_{j}^{*}}\rVert_{2}\}\right)\leq\frac{2}{\eta}\sqrt{s_{n}}\lambda_{n}.

For any Gaussian designs 𝒟n×r\mathcal{D}\in{\mathbb{R}}^{n\times r} with columns standardized to maxj𝒟j2nC\max_{j}\frac{\lVert{\mathcal{D}_{j}}\rVert_{2}}{\sqrt{n}}\leq C, the Gaussian tail bound guarantees {𝒟ωn>Cσ(2logrn+δ)}2exp{nδ22}\mathbb{P}\left\{{\lVert{\frac{\mathcal{D}^{\top}\omega}{n}}\rVert_{\infty}>C\sigma(\sqrt{\frac{2\log r}{n}}+\delta)}\right\}\leq 2\exp\left\{-\frac{n\delta^{2}}{2}\right\}.

To learn Sj(0)S^{(0)}_{j} with design matrix 𝒳{{\mathcal{X}}}, we set λn2logp/n\lambda_{n}\asymp\sqrt{2\log p/n}, then the errors (in 2\ell_{2} and \ell_{\infty}) are in the order of O(2snlogpn)=O(n1b+κ)O\left(\sqrt{\frac{2s_{n}\log p}{n}}\right)=O\left(n^{1-b+\kappa}\right) with probability at least 12exp{nδ22}1-2\exp\left\{-\frac{n\delta^{2}}{2}\right\}. A union bound over all regression problems bounds all errors with probability 14pexp{n2δ2}1-4p\exp\left\{-\frac{n^{2}\delta}{2}\right\}. Similarly, To learn Sj(1)S^{(1)}_{j} with design matrix {(𝒳𝒴j)}j𝒴\{({{\mathcal{X}}}\cup{{\mathcal{Y}}}_{-j})\}_{j\in{{\mathcal{Y}}}}, our choice of λn2log(p+q)/n\lambda_{n}\asymp\sqrt{2\log(p+q)/n} guarantees union bound over all regression problems with probability 14(p+q)exp{n2δ2}=1O(exp{c0nκδ2n2})=1O(exp{δ2n2})1-4(p+q)\exp\left\{-\frac{n^{2}\delta}{2}\right\}=1-O\left(\exp\left\{c_{0}n^{\kappa}-\frac{\delta}{2}n^{2}\right\}\right)=1-O\left(\exp\left\{-\frac{\delta}{2}n^{2}\right\}\right).

Lastly, we show that Assumption 4.3 implies a beta-min condition for the regression problems. Note that

γkj\displaystyle\gamma_{kj} =ρ(j,k𝒳k)Var(Yj𝒳)Var(Xk|𝒳kYj),\displaystyle=\rho(j,k\mid{{\mathcal{X}}}_{-k})\sqrt{\mathrm{Var}(Y_{j}\mid{{\mathcal{X}}})\mathrm{Var}(X_{k}|{{\mathcal{X}}}_{-k}\cup Y_{j})},
θkj\displaystyle\theta_{kj} =ρ(j,k𝒳k𝒴j)Var(Yj|𝒳{j,k})Var(Xk|𝒳{j,k}),\displaystyle=\rho(j,k\mid{{\mathcal{X}}}_{-k}\cup{{\mathcal{Y}}}_{-j})\sqrt{\mathrm{Var}(Y_{j}|{{\mathcal{X}}}_{-\{j,k\}})\mathrm{Var}(X_{k}|{{\mathcal{X}}}_{-\{j,k\}})},

and the last terms are both bounded below by some constant, so that minγjk0|γkj|>cn\min_{\gamma_{jk}\neq 0}|\gamma_{kj}|>c^{\prime}_{n} and minθjk0|θkj|>cn\min_{\theta_{jk}\neq 0}|\theta_{kj}|>c^{\prime}_{n} for some cn1=O(nd){c^{\prime}_{n}}^{-1}=O(n^{-d}) from Assumption 4.2.

The beta-min condition ensures that lasso successfully selects a superset of the relevant covariate with the error bound specified above. Specifically, we have shown that with probability at least 1O(enδ22)en/321en/321-O(e^{-\frac{n\delta^{2}}{2}})-\frac{e^{-n/32}}{1-e^{-n/32}}, lasso solution recovers a superset of Sj(0)S^{(0)}_{j} and Sj(1)S^{(1)}_{j} for all j𝒴j\in{{\mathcal{Y}}}. ∎

A.4 Proof of Results in Section 5

Proof of Lemma 5.2.

The proof is identical to that of Theorem  3.5. The only additional piece needed for successful recovery of PDAG is the orientation step. The known partial ordering is a form of background knowledge of edge orientation, which, combined with Meek’s rules of orientation, returns the maximal PDAG (see Meek, 1995, Problem D). ∎

Proof of Lemma 5.3.

The first statement follows directly from Lemma 3.1 (treating i=11Vi\cup_{i=1}^{\ell-1}V_{i} as 𝒳{{\mathcal{X}}} and VV_{\ell} as 𝒴{{\mathcal{Y}}}). The second statement follows from Lemma 3.3 that Sj(1)V=mb(j)VS^{(1)}_{j}\cap V_{\ell}=\text{mb}(j)\cap V_{\ell}. ∎

Proof of Theorem 5.5.

By Lemma 5.3, the result of screening step is a superset of the edges in GG. Under the adjacency faithfulness assumption, all conditional independencies checked by the algorithm corresponds to d-separation and therefore the correct skeleton is recovered. Finally, given correct d-separation relations, the orientation rules are complete and maximal (Perkovic et al., 2018). ∎

Appendix B Causal Additive Models

In this section, we outline how the proposed framework can be applied to the more flexible class of causal additive models (CAM) (buhlmann_cam_2014), that is, SEMs that are jointly additive:

Wj=kpajfjk(Wk)+ϵjj=1,,p+q.W_{j}=\sum_{k\in\textnormal{pa}_{j}}f_{jk}(W_{k})+\epsilon_{j}\qquad j=1,\ldots,p+q. (9)

Like the linear SEM in the main paper, we will discuss the searching and screening steps separately.

For the searching step, we can use a general test of conditional independence. Here, we adopt the framework of Chakraborty and Shojaie (2022), based on conditional distance covariance (CdCov). Let two vectors TaT\in{\mathbb{R}}^{a} and UbU\in{\mathbb{R}}^{b}, their CdCov given ZZ is defined as

CdCov(T,U|Z)=1cacba+b|fT,U|Z(t,s)fT|Z(t)fU|Z(s)|2ta1+asb1+b.\text{CdCov}(T,U|Z)=\frac{1}{c_{a}c_{b}}\int_{{\mathbb{R}}^{a+b}}\frac{|f_{T,U|Z}(t,s)-f_{T|Z}(t)f_{U|Z}(s)|^{2}}{\lVert{t}\rVert_{a}^{1+a}\lVert{s}\rVert_{b}^{1+b}}.

We define ρ(T,U|Z)=𝔼[CdCov2(T,U|Z)]\rho^{*}(T,U|Z)=\mathbb{E}\left[{\text{CdCov}^{2}(T,U|Z)}\right]. It is easy to see that ρ(T,U|Z)=0\rho^{*}(T,U|Z)=0 if and only if TU|ZT\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}U|Z.

Let KH(ω)=|H|1K(H1ω)K_{H}(\omega)=|H|^{-1}K(H^{-1}\omega) be some kernel function where HH is the diagonal matrix determined by bandwidth hh and denote Kiu=KH(ZiZu)K_{iu}=K_{H}(Z_{i}-Z_{u}). We also write dijT=TiTjad^{T}_{ij}=\lVert{T_{i}-T_{j}}\rVert_{a} and dijU=UiUjbd^{U}_{ij}=\lVert{U_{i}-U_{j}}\rVert_{b}. Define dijkl:=(dijT+dklTdikTdjlT)(dijU+dklUdikUdjlU)d_{ijkl}:=(d^{T}_{ij}+d^{T}_{kl}-d^{T}_{ik}-d^{T}_{jl})(d^{U}_{ij}+d^{U}_{kl}-d^{U}_{ik}-d^{U}_{jl}) and the symmetric form dijklS=dijkl+dijlk+dilkjd^{S}_{ijkl}=d_{ijkl}+d_{ijlk}+d_{ilkj}, We can use a plug-in estimate for ρ(T,U|Z)\rho^{*}(T,U|Z):

ρ^(T,U|Z):=1nu=1nΔi,j,k,l;u where Δi,j,k,l;u:=i,j,k,lKiuKjuKkuKlu12(m=1nKmu)4dijklS.\widehat{\rho}^{*}(T,U|Z):=\frac{1}{n}\sum_{u=1}^{n}\Delta_{i,j,k,l;u}\,\text{ where }\,\Delta_{i,j,k,l;u}:=\sum_{i,j,k,l}\frac{K_{iu}K_{ju}K_{ku}K_{lu}}{12(\sum_{m=1}^{n}K_{mu})^{4}}d_{ijkl}^{S}.

Following the derivation of Theorem 3.3 in Chakraborty and Shojaie (2022), we can obtain the following result, which we state here without proof.

Proposition B.1 (Searching with CdCov test).

Suppose Assumption 4.1 and 4.2 holds with κ<1/4\kappa<1/4. Assume there exists s0>0s_{0}>0 such that for all 0s<s00\leq s<s_{0},

maxW𝒳𝒴𝔼[exp(sW2)]<,\max_{W\in{{\mathcal{X}}}\cup{{\mathcal{Y}}}}\mathbb{E}\left[\exp(sW^{2})\right]<\infty,

and the kernel function K()K(\cdot) used to compute ρ^\widehat{\rho} is non-negative and uniformly bounded over its support. Assume in addition the faithfulness condition that there exists some cnc_{n} such that

infj𝒴,k𝒳,T𝒴j{|ρ(j,k𝒴T𝒳k)|:ρ(j,k𝒴T𝒳k)0}>cn,\inf_{j\in{{\mathcal{Y}}},k\in{{\mathcal{X}}},T\subseteq{{\mathcal{Y}}}_{-j}}\left\{|\rho^{*}(j,k\mid{{\mathcal{Y}}}_{T}\cup{{\mathcal{X}}}_{-k})|:\rho^{*}(j,k\mid{{\mathcal{Y}}}_{T}\cup{{\mathcal{X}}}_{-k})\neq 0\right\}>c_{n},

where cn1=O(nd)c_{n}^{-1}=O(n^{d}) with d<1412κd<\frac{1}{4}-\frac{1}{2}\kappa. Then, conditioned on 𝒜(S^(0),S(1))\mathcal{A}(\widehat{S}^{(0)},S^{(1)}), the searching loop in Algorithm 1 using test of CdCOV returns the correct edge set with probability at least 1O(exp(n12γ2d))O(exp(nγ))1-O\left(\exp(-n^{1-2\gamma-2d})\right)-O\left(\exp(-n^{\gamma})\right).

Next we discuss the screening step for nonlinear SEM models. We consider the family of functions fuv(r)(xu)=Ψuvβuvf_{uv}^{(r)}(x_{u})=\Psi_{uv}\beta_{uv}, where Ψuv\Psi_{uv} is a n×rn\times r matrix whose columns are basis functions used to model the additive components fuvf_{uv}, and βuv\beta_{uv} is a rr-vector containing the associated effects. We write ϕuvt\phi_{uvt} as the tt-th coefficient in Φuv\Phi_{uv}. We denote ΨS\Psi_{S} as the concatenated basis functions in {Ψuv:uS}\{\Psi_{uv}:u\in S\}. Denote ΣS,S=(n1ΨSΨS)\Sigma_{S,S}=(n^{-1}\Psi_{S}^{\top}\Psi_{S}).

In this model kk has no direct causal effect on jj if and only if fjk0f_{jk}\equiv 0. Under some regularity conditions, there exists a truncation parameter rr large enough such that fuv0fuv(r)0βuv=[0,,0]f_{uv}\equiv 0\Leftrightarrow f_{uv}^{(r)}\equiv 0\Leftrightarrow\beta_{uv}=[0,\ldots,0]^{\top} for all uu and vv. In high-dimensional problems, we estimate the coefficients with 1/2\ell_{1}/\ell_{2} norm penalization. Concretely, for a node uVu\in V, we can estimate S^j(0)\widehat{S}^{(0)}_{j} and S^j(1)\widehat{S}^{(1)}_{j} by solving the following problems

S^j(0)\displaystyle\widehat{S}^{(0)}_{j} ={k:f^jk(r)0:f^jk(r)=argmin{fjl}l𝒳(r)Yjl𝒳fjl(Xl)n2+λl𝒳fjl(Xl)n}\displaystyle=\left\{k:\hat{f}^{(r)}_{jk}\equiv 0:\hat{f}^{(r)}_{jk}=\operatorname*{arg\,min}_{\{f_{jl}\}_{l\in{{\mathcal{X}}}}\in\mathcal{F}^{(r)}}\lVert{Y_{j}-\sum_{l\in{{\mathcal{X}}}}f_{jl}(X_{l})}\rVert_{n}^{2}+\lambda\sum_{l\in{{\mathcal{X}}}}\lVert{f_{jl}(X_{l})}\rVert_{n}\right\}
S^j(1)\displaystyle\widehat{S}^{(1)}_{j} ={k:f^jk(r)0:f^jk(r)=argmin{fjl}l𝒳S^(0)𝒴j(r)Yjl𝒳S^(0)𝒴jfjl(Wl)n2\displaystyle=\left\{k:\hat{f}^{(r)}_{jk}\equiv 0:\hat{f}^{(r)}_{jk}=\operatorname*{arg\,min}_{\{f_{jl}\}_{l\in{{\mathcal{X}}}_{\widehat{S}^{(0)}}\cup{{\mathcal{Y}}}_{-j}}\in\mathcal{F}^{(r)}}\lVert{Y_{j}-\sum_{l\in{{\mathcal{X}}}_{\widehat{S}^{(0)}}\cup{{\mathcal{Y}}}_{-j}}f_{jl}(W_{l})}\rVert_{n}^{2}\right.
+λl𝒳S^(0)𝒴jfjl(Wl)n,}\displaystyle\quad\quad+\left.\lambda\sum_{l\in{{\mathcal{X}}}_{\widehat{S}^{(0)}}\cup{{\mathcal{Y}}}_{-j}}\lVert{f_{jl}(W_{l})}\rVert_{n},\right\}

where the group lasso penalties are used to jointly penalize the rr smoothing bases of each variable. This estimator is similar to SPACEJAM (Voorman et al., 2014) with edges disagreeing with the layering information hard-coded as zero. The following results relies on the general error computation in Haris et al. (2022). A similar result can be found in Tan and Zhang (2019).

We next state the smoothness assumption needed on the basis approximation.

Assumption B.2 (Truncated basis).

Suppose there exists r=O(1)r=O(1) such that {fuv}\{f_{uv}\} are sufficiently sooth in the sense that

|fuv(r)(xu)fuv(xu)|=Op(1/rt)|f_{uv}^{(r)}(x_{u})-f_{uv}(x_{u})|=O_{p}(1/r^{t})

uniformly for all u,vVu,v\in V for some tt\in\mathbb{N}.

Next we assume two standard conditions for generalized additive models (GAMs). The compatibility condition may be shown for random design, but for simplicity we just assume the conditions.

Assumption B.3 (Compatibility).

Suppose there exists some compatibility constant ϕ>0\phi>0 such that if for all j𝒴j\in{{\mathcal{Y}}} and all functions in the form of fj(r)(x)=k𝒳fjk(r)(xk)f^{(r)}_{j}(x)=\sum_{k\in{{\mathcal{X}}}}f_{jk}^{(r)}(x_{k}) that satisfy k𝒳Sj(0)fjk(r)n3kSj(0)fk(r)n\sum_{k\in{{\mathcal{X}}}\setminus{S^{(0)}_{j}}}\lVert{f_{jk}^{(r)}}\rVert_{n}\leq 3\sum_{k\in S^{(0)}_{j}}\lVert{f_{k}^{(r)}}\rVert_{n}, it holds that

kSj(0)fjk(r)nfk(r)|Sj(0)|/ϕ,\sum_{k\in S^{(0)}_{j}}\lVert{f_{jk}^{(r)}}\rVert_{n}\leq\lVert{f_{k}^{(r)}}\rVert\sqrt{|S^{(0)}_{j}|}/\phi,

for some norm \lVert{\cdot}\rVert.

Also assume that for all j𝒴j\in{{\mathcal{Y}}} and all functions in the form of fj(r)(x)=k𝒳𝒴jfjk(r)(wk)f^{(r)}_{j}(x)=\sum_{k\in{{\mathcal{X}}}\cup{{\mathcal{Y}}}_{-j}}f_{jk}^{(r)}(w_{k}) that satisfy k𝒳𝒴jSj(1)fjk(r)n3kSj(1)fk(r)n\sum_{k\in{{\mathcal{X}}}\cup{{\mathcal{Y}}}_{-j}\setminus{S^{(1)}_{j}}}\lVert{f_{jk}^{(r)}}\rVert_{n}\leq 3\sum_{k\in S^{(1)}_{j}}\lVert{f_{k}^{(r)}}\rVert_{n},

kSj(1)fjk(r)nfk(r)|Sj(1)|/ϕ.\sum_{k\in S^{(1)}_{j}}\lVert{f_{jk}^{(r)}}\rVert_{n}\leq\lVert{f_{k}^{(r)}}\rVert\sqrt{|S^{(1)}_{j}|}/\phi.
Assumption B.4 (GAM screening).

Denote smax=maxj𝒴maxi{0,1}|Sj(i)|s_{\max}=\max_{j\in{{\mathcal{Y}}}}\max_{i\in\{0,1\}}|S^{(i)}_{j}| and suppose smax=o(n/log(p+q))s_{\max}=o(n/\log(p+q)). Let λlog(p+q)/n\lambda\asymp\sqrt{\log(p+q)/n}. Suppose

minuVmini{0,1}minvSj(i)fvu(r),0=Ω(smaxlog(p+q)n).\min_{u\in V}\min_{i\in\{0,1\}}\min_{v\in S^{(i)}_{j}}\lVert{f_{vu}^{(r),0}}\rVert=\Omega\left(s_{\max}\frac{\log(p+q)}{n}\right).

Let fuf^{*}_{u} be an arbitrary function such that i=1nfu(xu,i)=0\sum_{i=1}^{n}f^{*}_{u}(x_{u,i})=0. Suppose there exists some constant MM^{*} satisfying M=O(|Sj(1)|λ/ϕ2)M^{*}=O(|S^{(1)}_{j}|\lambda/\phi^{2}) such that fulocal(r)f_{u}\in\mathcal{F}^{(r)}_{\text{local}} if and only if fufunM\lVert{f_{u}-f^{*}_{u}}\rVert_{n}\leq M^{*}. Further suppose that 𝛆(fu)=O(smaxλ2/ϕ2)\boldsymbol{\varepsilon}(f_{u}^{*})=O(s_{\max}\lambda^{2}/\phi^{2}) for all u,ku,k. The next result follows directly from the results in Haris et al. (2022).

Proposition B.5 (Screening with GAM).

Suppose Assumption B.2-B.4 hold. Then with nn\to\infty, p+q=O(nξ)p+q=O(n^{\xi}) for some ξ0\xi\geq 0, and the penalty level stated in Assumption B.4, the resulting undirected graph from the screening loop of Algorithm 1 is a supergraph of HH.

Propositions B.5 and B.1 establish the consistency of the proposed framework in Algorithm 1 for learning DAGs from CAMs, facilitating causal structure learning for a expressive family of distributions.