This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A polynomial-time algorithm for learning nonparametric causal graphs

Ming Gao Yi Ding Bryon Aragam
Abstract

We establish finite-sample guarantees for a polynomial-time algorithm for learning a nonlinear, nonparametric directed acyclic graphical (DAG) model from data. The analysis is model-free and does not assume linearity, additivity, independent noise, or faithfulness. Instead, we impose a condition on the residual variances that is closely related to previous work on linear models with equal variances. Compared to an optimal algorithm with oracle knowledge of the variable ordering, the additional cost of the algorithm is linear in the dimension dd and the number of samples nn. Finally, we compare the proposed algorithm to existing approaches in a simulation study.

Contact: [email protected], [email protected], [email protected]

1 Introduction

Modern machine learning (ML) methods are driven by complex, high-dimensional, and nonparametric models that can capture highly nonlinear phenomena. These models have proven useful in wide-ranging applications including vision, robotics, medicine, and natural language. At the same time, the complexity of these methods often obscure their decisions and in many cases can lead to wrong decisions by failing to properly account for—among other things—spurious correlations, adversarial vulnerability, and invariances (Bottou, 2015; Schölkopf, 2019; Bühlmann, 2018). This has led to a growing literature on correcting these problems in ML systems. A particular example of this that has received widespread attention in recent years is the problem of causal inference, which is closely related to these issues. While substantial methodological progress has been made towards embedding complex methods such as deep neural networks and RKHS embeddings into learning causal graphical models (Huang et al., 2018; Mitrovic et al., 2018; Zheng et al., 2020; Yu et al., 2019; Lachapelle et al., 2019; Zhu and Chen, 2019; Ng et al., 2019), theoretical progress has been slower and typically reserved for particular parametric models such as linear (Chen et al., 2019; Wang and Drton, 2020; Ghoshal and Honorio, 2017, 2018; Loh and Bühlmann, 2014; van de Geer and Bühlmann, 2013; Aragam and Zhou, 2015; Aragam et al., 2015, 2019) and generalized linear models (Park and Raskutti, 2017; Park, 2018).

In this paper, we study the problem of learning directed acyclic graphs (DAGs) from data in a nonparametric setting. Unlike existing work on this problem, we do not require linearity, additivity, independent noise, or faithfulness. Our approach is model-free and nonparametric, and uses nonparametric estimators (kernel smoothers, neural networks, splines, etc.) as “plug-in” estimators. As such, it is agnostic to the choice of nonparametric estimator chosen. Unlike existing consistency theory in the nonparametric setting (Peters et al., 2014; Hoyer et al., 2009; Bühlmann et al., 2014; Rothenhäusler et al., 2018; Huang et al., 2018; Tagasovska et al., 2018; Nowzohour and Bühlmann, 2016), we provide explicit (nonasymptotic) finite sample complexity bounds and show that the resulting method has polynomial time complexity. The method we study is closely related to existing algorithms that first construct a variable ordering (Ghoshal and Honorio, 2017; Chen et al., 2019; Ghoshal and Honorio, 2018; Park and Kim, 2020). Despite this being a well-studied problem, to the best of our knowledge our analysis is the first to provide explicit, simultaneous statistical and computational guarantees for learning nonparametric DAGs.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) Existing methods may not find a correct topological ordering in simple settings when d=3d=3. (b) Example of a layer decomposition L(𝖦)L(\mathsf{G}) of a DAG on d=6d=6 nodes.

Contributions

Figure 1(a) illustrates a key motivation for our work: While there exist methods that obtain various statistical guarantees, they lack provably efficient algorithms, or vice versa. As a result, these methods can fail in simple settings. Our focus is on simultaneous computational and statistical guarantees that are explicit and nonasymptotic in a model-free setting. More specifically, our main contributions are as follows:

  • We show that the algorithms of Ghoshal and Honorio (2017) and Chen et al. (2019) rigourously extend to a model-free setting, and provide a method-agnostic analysis of the resulting extension (Theorem 4.1). That is, the time and sample complexity bounds depend on the choice of estimator used, and this dependence is made explicit in the bounds (Section 3.2, Section 4).

  • We prove that this algorithm runs in at most O(nd5)O(nd^{5}) time and needs at most Ω((d2/ε)1+0pt/2)\Omega((d^{2}/\varepsilon)^{1+0pt/2}) samples (Corollary 4.2). Moreover, the exponential dependence on dd can be improved by imposing additional sparsity or smoothness assumptions, and can even be made polynomial (see Section 4 for discussion). This is an expected consequence of our estimator-agnostic approach.

  • We show how existing identifiability results based on ordering variances can be unified and generalized to include model-free families (Theorem 3.1, Section 3.1).

  • We show that greedy algorithms such as those used in the CAM algorithm (Bühlmann et al., 2014) can provably fail to recover an identifiable DAG (Example 5), as shown in Figure 1(a) (Section 3.3).

  • Finally, we run a simulation study to evaluate the resulting algorithm in a variety of settings against seven state-of-the-art algorithms (Section 5).

Our simulation results can be summarized as follows: When implemented using generalized additive models (Hastie and Tibshirani, 1990), our method outperforms most state-of-the-art methods, particularly on denser graphs with hub nodes. We emphasize here, however, that our main contributions lay in the theoretical analysis, specifically providing a polynomial-time algorithm with sample complexity guarantees.

Related work

The literature on learning DAGs is vast, so we focus only on related work in the nonparametric setting. The most closely related line work considers additive noise models (ANMs) (Peters et al., 2014; Hoyer et al., 2009; Bühlmann et al., 2014; Chicharro et al., 2019), and prove a variety of identifiability and consistency guarantees. Compared to our work, the identifiability results proved in these papers require that the structural equations are (a) nonlinear with (b) additive, independent noise. Crucially, these papers focus on (generally asymptotic) statistical guarantees without any computational or algorithmic guarantees. There is also a closely related line of work for bivariate models (Mooij et al., 2016; Monti et al., 2019; Wu and Fukumizu, 2020; Mitrovic et al., 2018) as well as the post-nonlinear model (Zhang and Hyvärinen, 2009). Huang et al. (2018) proposed a greedy search algorithm using an RKHS-based generalized score, and proves its consistency assuming faithfulness. Rothenhäusler et al. (2018) study identifiability of a general family of partially linear models and prove consistency of a score-based search procedure in finding an equivalence class of structures. There is also a recent line of work on embedding neural networks and other nonparametric estimators into causal search algorithms (Lachapelle et al., 2019; Zheng et al., 2020; Yu et al., 2019; Ng et al., 2019; Zhu and Chen, 2019) without theoretical guarantees. While this work was in preparation, we were made aware of the recent work (Park, 2020) that proposes an algorithm that is similar to ours—also based on (Ghoshal and Honorio, 2017) and (Chen et al., 2019)—and establishes its sample complexity for linear Gaussian models. In comparison to these existing lines of work, our focus is on simultaneous computational and statistical guarantees that are explicit and nonasymptotic (i.e. valid for all finite dd and nn), for the fully nonlinear, nonparametric, and model-free setting.

Notation

Subscripts (e.g. XjX_{j}) will always be used to index random variables and superscripts (e.g. Xj(i)X_{j}^{(i)}) to index observations. For a matrix W=(wkj)W=(w_{kj}), wjdw_{\cdot j}\in\mathbb{R}^{d} is the jjth column of WW. We denote the indices by [d]={1,,d}[d]=\{1,\ldots,d\}, and frequently abuse notation by identifying the indices [d][d] with the random vector X=(X1,,Xd)X=(X_{1},\ldots,X_{d}). For example, nodes XjX_{j} are interchangeable with their indices jj (and subsets thereof), so e.g. var(j|A)\operatorname{var}(j\,|\,A) is the same as var(Xj|XA)\operatorname{var}(X_{j}\,|\,X_{A}).

2 Background

Let X=(X1,,Xd)X=(X_{1},\ldots,X_{d}) be a dd-dimensional random vector and 𝖦=(V,E)\mathsf{G}=(V,E) a DAG where we implicitly assume V=XV=X. The parent set of a node is defined as pa𝖦(Xj)={i:(i,j)E}\operatorname{pa}_{\mathsf{G}}(X_{j})=\{i\mathrel{\mathop{\mathchar 58\relax}}(i,j)\in E\}, or simply pa(j)\operatorname{pa}(j) for short. A source node is any node XjX_{j} such that pa(j)=\operatorname{pa}(j)=\emptyset and an ancestral set is any set AVA\subset V such that XjApa(j)AX_{j}\in A\implies\operatorname{pa}(j)\subset A. The graph 𝖦\mathsf{G} is called a Bayesian network (BN) for XX if it satisfies the Markov condition, i.e. that each variable is conditionally independent of its non-descendants given its parents. Intuitively, a BN for XX can be interpreted as a representation of the direct and indirect relationships between the XjX_{j}, e.g. an edge XiXjX_{i}\to X_{j} indicates that XjX_{j} depends directly on XiX_{i}, and not vice versa. Under additional assumptions such as causal minimality and no unmeasured confounding, these arrows may be interpreted causally; for more details, see the surveys (Bühlmann, 2018; Schölkopf, 2019) or the textbooks (Lauritzen, 1996; Koller and Friedman, 2009; Spirtes et al., 2000; Pearl, 2009; Peters et al., 2017).

The goal of structure learning is to learn a DAG 𝖦\mathsf{G} from i.i.d. observations X(i)iid(X)X^{(i)}\overset{\text{iid}}{\sim}\mathbb{P}(X). Throughout this paper, we shall exploit the following well-known fact: To learn 𝖦\mathsf{G}, it suffices to learn a topological sort of 𝖦\mathsf{G}, i.e. an ordering \prec such that XiXjXiXjX_{i}\to X_{j}\implies X_{i}\prec X_{j}. A brief review of this material can be found in the supplement.

Equal variances

Recently, a new approach has emerged which was originally cast as an approach to learn equal variance DAGs (Ghoshal and Honorio, 2017; Chen et al., 2019), although it has since been generalized beyond the equal variance case (Ghoshal and Honorio, 2018; Park and Kim, 2020; Park, 2020). An equal variance DAG is a linear structural equation model (SEM) that satisfies

Xj=wj,X+zj,var(zj)=σ2,zjpa(j),wkj=0kpa(j)\displaystyle X_{j}=\langle w_{\cdot j},X\rangle+z_{j},\quad\operatorname{var}(z_{j})=\sigma^{2},\quad z_{j}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 4.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 4.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 4.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 4.0mu{\scriptscriptstyle\perp}}}\operatorname{pa}(j),\quad w_{kj}=0\iff k\notin\operatorname{pa}(j) (1)

for some weights wkjw_{kj}\in\mathbb{R}. Under the model (1), a simple algorithm can learn the graph 𝖦\mathsf{G} by first learning a topological sort \prec. For these models, we have the following decomposition of the variance:

var(Xj)\displaystyle\operatorname{var}(X_{j}) =var(wj,X)+var(zj).\displaystyle=\operatorname{var}(\langle w_{\cdot j},X\rangle)+\operatorname{var}(z_{j}). (2)

Thus, as long as var(wj,X)>0\operatorname{var}(\langle w_{\cdot j},X\rangle)>0, we have var(Xj)>var(zj)\operatorname{var}(X_{j})>\operatorname{var}(z_{j}). It follows that as long as var(zj)\operatorname{var}(z_{j}) does not depend on jj, it is possible to identify a source in 𝖦\mathsf{G} by simply minimizing the residual variances. This is the essential idea behind algorithms based on equal variances in the linear setting (Ghoshal and Honorio, 2017; Chen et al., 2019). Alternatively, it is possible to iteratively identify best sinks by minimizing marginal precisions. Moreover, this argument shows that the assumption of linearity is not crucial, and this idea can readily be extended to ANMs, as in (Park, 2020). Indeed, the crucial assumption in this argument is the independence of the noise zjz_{j} and the parents pa(Xj)\operatorname{pa}(X_{j}); in the next section we show how these assumptions can be removed altogether.

Layer decomposition of a DAG

Given a DAG 𝖦\mathsf{G}, define a collection of sets as follows: L0:=L_{0}\mathrel{\mathop{\mathchar 58\relax}}=\emptyset, Aj=m=0jLmA_{j}=\cup_{m=0}^{j}L_{m} and for j>0j>0, LjL_{j} is the set of all source nodes in the subgraph 𝖦[VAj1]\mathsf{G}[V-A_{j-1}] formed by removing the nodes in Aj1A_{j-1}. So, e.g., L1L_{1} is the set of source nodes in 𝖦\mathsf{G} and A1=L1A_{1}=L_{1}. This decomposes 𝖦\mathsf{G} into layers, where each layer LjL_{j} consists of nodes that are sources in the subgraph 𝖦[VAj1]\mathsf{G}[V-A_{j-1}], and AjA_{j} is an ancestral set for each jj. Let rr denote the number of “layers” in 𝖦\mathsf{G}, L(𝖦):=(L1,,Lr)L(\mathsf{G})\mathrel{\mathop{\mathchar 58\relax}}=(L_{1},\ldots,L_{r}) be the corresponding layers. The quantity rr effectively measure the depth of a DAG. See Figure 1(b) for an illustration.

Learning 𝖦\mathsf{G} is equivalent to learning the sets L1,,LrL_{1},\ldots,L_{r}, since any topological sort π\pi of 𝖦\mathsf{G} can be determined from L(𝖦)L(\mathsf{G}), and from any sort π\pi, the graph 𝖦\mathsf{G} can be recovered via variable selection. Unlike a topological sort of 𝖦\mathsf{G}, which may not be unique, the layer decomposition L(𝖦)L(\mathsf{G}) is always unique. Therefore, without loss of generality, in the sequel we consider the problem of identifying and learning L(𝖦)L(\mathsf{G}).

3 Identifiability and algorithmic consequences

This section sets the stage for our main results on learning nonparametric DAGs: First, we show that existing identifiability results for equal variances generalize to a family of model-free, nonparametric distributions. Second, we show that this motivates an algorithm very similar to existing algorithms in the equal variance case. We emphasize that various incarnations of these ideas have appeared in previous work (Ghoshal and Honorio, 2017; Chen et al., 2019; Ghoshal and Honorio, 2018; Park and Kim, 2020; Park, 2020), and our effort in this section is to unify these ideas and show that the same ideas can be applied in more general settings without linearity or independent noise. Once this has been done, our main sample complexity result is presented in Section 4.

3.1 Nonparametric identifiability

In general, a BN for XX need not be unique, i.e. 𝖦\mathsf{G} is not necessarily identifiable from (X)\mathbb{P}(X). A common strategy in the literature to enforce identifiability is to impose structural assumptions on the conditional distributions (Xj|pa(j))\mathbb{P}(X_{j}\,|\,\operatorname{pa}(j)), for which there is a broad literature on identifiability. Our first result shows that identifiability is guaranteed as long as the residual variances 𝔼var(Xj|pa(j))\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j)) do not depend on jj. This is a natural generalization of the notion of equality of variances for linear models (Peters and Bühlmann, 2013; Ghoshal and Honorio, 2017; Chen et al., 2019).

Theorem 3.1.

If 𝔼var(Xj|pa(j))σ2\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))\equiv\sigma^{2} does not depend on jj, then 𝖦\mathsf{G} is identifiable from (X)\mathbb{P}(X).

The proof of Theorem 3.1 can be found in the supplement. This result makes no structural assumptions on the local conditional probabilities (Xj|pa(j))\mathbb{P}(X_{j}\,|\,\operatorname{pa}(j)). To illustrate, we consider some examples below.

Example 1 (Causal pairs, (Mooij et al., 2016)).

Consider a simple model on two variables: XYX\to Y with 𝔼var(Y|X)=var(X)\mathbb{E}\operatorname{var}(Y\,|\,X)=\operatorname{var}(X). Then as long as 𝔼[Y|X]\mathbb{E}[Y\,|\,X] is nonconstant, Theorem 3.1 implies the causal order is identifiable. No additional assumptions on the noise or functional relationships are necessary.

Example 2 (Binomial models, (Park and Raskutti, 2017)).

Assume Xj{0,1}X_{j}\in\{0,1\} and Xj=Ber(fj(pa(j)))X_{j}=\operatorname{Ber}(f_{j}(\operatorname{pa}(j))) with fj(pa(j))[0,1]f_{j}(\operatorname{pa}(j))\in[0,1]. Then Theorem 3.1 implies that if 𝔼fj(pa(j))(1fj(pa(j)))σ2\mathbb{E}f_{j}(\operatorname{pa}(j))(1-f_{j}(\operatorname{pa}(j)))\equiv\sigma^{2} does not depend on jj, then 𝖦\mathsf{G} is identifiable.

Example 3 (Generalized linear models).

The previous example can of course be generalized to arbitrary generalized linear models: Assume [Xj|pa(j)]exp(XjθjK(θj))\mathbb{P}[X_{j}\,|\,\operatorname{pa}(j)]\propto\exp(X_{j}\theta_{j}-K(\theta_{j})), where θj=fj(pa(j))\theta_{j}=f_{j}(\operatorname{pa}(j)) and K(θj)K(\theta_{j}) is the partition function. Then Theorem 3.1 implies that if 𝔼[K′′(fj(pa(j)))]σ2\mathbb{E}[K^{\prime\prime}(f_{j}(\operatorname{pa}(j)))]\equiv\sigma^{2} does not depend on jj, then 𝖦\mathsf{G} is identifiable.

Example 4 (Additive noise models, (Peters et al., 2014)).

Finally, we observe that Theorem 3.1 generalizes existing results for ANMs: In an ANM, we have Xj=fj(pa(j))+zjX_{j}=f_{j}(\operatorname{pa}(j))+z_{j} with zjpa(j)z_{j}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 4.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 4.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 4.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 4.0mu{\scriptscriptstyle\perp}}}\operatorname{pa}(j). If var(zj)=σ2\operatorname{var}(z_{j})=\sigma^{2}, then an argument similar to (2) shows that ANMs with equal variances are identifiable. Theorem 3.1 applies to more general additive noise models Xj=fj(pa(j))+gj(pa(j))1/2zjX_{j}=f_{j}(\operatorname{pa}(j))+g_{j}(\operatorname{pa}(j))^{1/2}z_{j} with heteroskedastic, uncorrelated (i.e. not necessarily independent) noise.

Unequal variances

Early work on this problem focused on the case of equal variances (Ghoshal and Honorio, 2017; Chen et al., 2019), as we have done here. This assumption illustrates the main technical difficulties in proving identifiability, and it is well-known by now that equality of variances is not necessary, and a weaker assumption that allows for heterogeneous residual variances suffices in special cases (Ghoshal and Honorio, 2018; Park and Kim, 2020). Similarly, the extension of Theorem 3.1 to such heterogeneous models is straightforward, and omitted for brevity; see Appendix B.1 in the supplement for additional discussion and simulations. In the sequel, we focus on the case of equality for simplicity and ease of interpretation.

3.2 A polynomial-time algorithm

The basic idea behind the top-down algorithm proposed in (Chen et al., 2019) can easily be extended to the setting of Theorem 3.1, and is outlined in Algorithm 1. The only modification is to replace the error variances var(zj)=σ2\operatorname{var}(z_{j})=\sigma^{2} from the linear model (1) with the corresponding residual variances (i.e. 𝔼var(X|Sj)\mathbb{E}\operatorname{var}(X_{\ell}\,|\,S_{j})), which are well-defined for any (X)\mathbb{P}(X) with finite second moments.

A natural idea to translate Algorithm 1 into an empirical algorithm is to replace the residual variances with an estimate based on the data. One might then hope to use similar arguments as in the linear setting to establish consistency and bound the sample complexity. Perhaps surprisingly, this does not work unless the topological sort of 𝖦\mathsf{G} is unique. When there is more than one topological sort, it becomes necessary to uniformly bound the errors of all possible residual variances—and in the worst case there are exponentially many (d2d1d2^{d-1} to be precise) possible residual variances. The key issue is that the sets SjS_{j} in Algorithm 1 are random (i.e. data-dependent), and hence unknown in advance. This highlights a key difference between our algorithm and existing work for linear models such as (Ghoshal and Honorio, 2017, 2018; Chen et al., 2019; Park, 2020): In our setting, the residual variances cannot be written as simple functions of the covariance matrix Σ:=𝔼XXT\Sigma\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}X\!X^{T}, which simplifies the analysis for linear models considerably. Indeed, although the same exponential blowup arises for linear models, in that case consistent estimation of the covariance matrix Σ:=𝔼XXT\Sigma\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}X\!X^{T} provides uniform control over all possible residual variances (e.g., see Lemma 6 in (Chen et al., 2019)). In the nonparametric setting, this reduction no longer applies.

To get around this technical issue, we modify Algorithm 1 to learn 𝖦\mathsf{G} one layer LjL_{j} at a time, as outlined in Algorithm 2 (see Section 2 for details on LjL_{j}). As a result, we need only estimate σj2:=𝔼var(X|Aj)\sigma_{\ell j}^{2}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j}), which involves regression problems with at most |Aj||A_{j}| nodes. We use the plug-in estimator (3) for this, although more sophisticated estimators are available (Doksum et al., 1995; Robins et al., 2008). This also necessitates the use of sample splitting in Step 3(a) of Algorithm 2, which is necessary for the theoretical arguments but not needed in practice.

The overall computational complexity of Algorithm 2, which we call NPVAR, is O(ndrT)O(ndrT), where TT is the complexity of computing each nonparametric regression function f^j\widehat{f}_{\ell j}. For example, if a kernel smoother is used, T=O(d3)T=O(d^{3}) and thus the overall complexity is O(nrd4)O(nrd^{4}). For comparison, an oracle algorithm that knows the true topological order of 𝖦\mathsf{G} in advance would still need to compute dd regression functions, and hence would have complexity O(dT)O(dT). Thus, the extra complexity of learning the topological order is only O(nr)=O(nd)O(nr)=O(nd), which is linear in the dimension and the number of samples. Furthermore, under additional assumptions on the sparsity and/or structure of the DAG, the time complexity can be reduced further, however, our analysis makes no such assumptions.

Algorithm 1 Population algorithm for learning nonparametric DAGs
  1. 1.

    Set S0=S_{0}=\emptyset and for j=0,1,2,j=0,1,2,\ldots, let

    kj\displaystyle k_{j} =argminSj𝔼var(X|Sj),Sj+1=Sj{kj}.\displaystyle=\operatorname*{arg\,min}_{\ell\notin S_{j}}\mathbb{E}\operatorname{var}(X_{\ell}\,|\,S_{j}),\qquad S_{j+1}=S_{j}\cup\{k_{j}\}.
  2. 2.

    Return the DAG 𝖦\mathsf{G} that corresponds to the topological sort (k1,,kd)(k_{1},\ldots,k_{d}).

Algorithm 2 NPVAR algorithm

Input: X(1),,X(n)X^{(1)},\ldots,X^{(n)}, η>0\eta>0.

  1. 1.

    Set L^0=\widehat{L}_{0}=\emptyset, σ^02=var^(X)\widehat{\sigma}_{\ell 0}^{2}=\widehat{\operatorname{var}}(X_{\ell}), k0=argminσ^02k_{0}=\operatorname*{arg\,min}_{\ell}\widehat{\sigma}_{\ell 0}^{2}, σ^02=σk002\widehat{\sigma}_{0}^{2}=\sigma_{k_{0}0}^{2}.

  2. 2.

    Set L^1:={:|σ^02σ^02|<η}\widehat{L}_{1}\mathrel{\mathop{\mathchar 58\relax}}=\{\ell\mathrel{\mathop{\mathchar 58\relax}}|\widehat{\sigma}_{\ell 0}^{2}-\widehat{\sigma}_{0}^{2}|<\eta\}.

  3. 3.

    For j=2,3,j=2,3,\ldots:

    1. (a)

      Randomly split the nn samples in half and let A^j:=m=1jL^m\widehat{A}_{j}\mathrel{\mathop{\mathchar 58\relax}}=\cup_{m=1}^{j}\widehat{L}_{m}.

    2. (b)

      For each A^j\ell\notin\widehat{A}_{j}, use the first half of the sample to estimate fj(XA^j)=𝔼[X|A^j]f_{\ell j}(X_{\widehat{A}_{j}})=\mathbb{E}[X_{\ell}\,|\,\widehat{A}_{j}] via a nonparametric estimator f^j\widehat{f}_{\ell j}.

    3. (c)

      For each A^j\ell\notin\widehat{A}_{j}, use the second half of the sample to estimate the residual variances via the plug-in estimator

      σ^j2\displaystyle\widehat{\sigma}_{\ell j}^{2} =1n/2i=1n/2(X(i))21n/2i=1n/2f^j(XA^j(i))2.\displaystyle=\frac{1}{n/2}\sum_{i=1}^{n/2}(X_{\ell}^{(i)})^{2}-\frac{1}{n/2}\sum_{i=1}^{n/2}\widehat{f}_{\ell j}(X_{\widehat{A}_{j}}^{(i)})^{2}. (3)
    4. (d)

      Set kj=argminA^jσ^j2k_{j}=\operatorname*{arg\,min}_{\ell\notin\widehat{A}_{j}}\widehat{\sigma}_{\ell j}^{2} and L^j+1={:|σ^j2σ^kjj2|<η,A^j}\widehat{L}_{j+1}=\{\ell\mathrel{\mathop{\mathchar 58\relax}}|\widehat{\sigma}_{\ell j}^{2}-\widehat{\sigma}_{k_{j}j}^{2}|<\eta,\,\ell\notin\widehat{A}_{j}\}.

  4. 4.

    Return L^=(L^1,,L^r^)\widehat{L}=(\widehat{L}_{1},\ldots,\widehat{L}_{\widehat{r}}).

3.3 Comparison to existing algorithms

Compared to existing algorithms based on order search and equal variances, NPVAR applies to more general models without parametric assumptions, independent noise, or additivity. It is also instructive to make comparisons with greedy score-based algorithms such as causal additive models (CAM, (Bühlmann et al., 2014)) and greedy DAG search (GDS, (Peters and Bühlmann, 2013)). We focus here on CAM since it is more recent and applies in nonparametric settings, however, similar claims apply to GDS as well.

CAM is based around greedily minimizing the log-likelihood score for additive models with Gaussian noise. In particular, it is not guaranteed to find a global minimizer, which is as expected since it is based on a nonconvex program. This is despite the global minimizer—if it can be found—having good statistical properties. The next example shows that, in fact, there are identifiable models for which CAM will find the wrong graph with high probability.

Example 5.

Consider the following three-node additive noise model with zj𝒩(0,1)z_{j}\sim\mathcal{N}(0,1):

X1=z1,X2=g(X1)+z2,X3=g(X1)+g(X2)+z3.\displaystyle\begin{aligned} X_{1}&=z_{1},\\ X_{2}&=g(X_{1})+z_{2},\\ X_{3}&=g(X_{1})+g(X_{2})+z_{3}.\end{aligned} (4)

In the supplement (Appendix D), we show the following: There exist infinitely many nonlinear functions gg for which the CAM algorithm returns an incorrect order under the model (4). This is illustrated empirically in Figure 1(a) for the nonlinearities g(u)=sgn(u)|u|1.4g(u)=\text{sgn}(u)|u|^{1.4} and g(u)=sinug(u)=\sin u. In each of these examples, the model satisfies the identifiability conditions for CAM as well as the conditions required in our work.

We stress that this example does not contradict the statistical results in Bühlmann et al. (2014): It only shows that the algorithm may not find a global minimizer and as a result, returns an incorrect variable ordering. Correcting this discrepancy between the algorithmic and statistical results is a key motivation behind our work. In the next section, we show that NPVAR provably learns the true ordering—and hence the true DAG—with high probability.

4 Sample complexity

Our main result analyzes the sample complexity of NPVAR (Algorithm 2). Recall the layer decomposition L(𝖦)L(\mathsf{G}) from Section 2 and define dj:=|Aj|d_{j}\mathrel{\mathop{\mathchar 58\relax}}=|A_{j}|. Let fj(XAj)=𝔼[X|Aj]f_{\ell j}(X_{A_{j}})=\mathbb{E}[X_{\ell}\,|\,A_{j}].

Condition 1 (Regularity).

For all jj and all Aj\ell\notin A_{j}, (a) Xj[0,1]X_{j}\in[0,1], (b) fj:[0,1]dj[0,1]f_{\ell j}\mathrel{\mathop{\mathchar 58\relax}}[0,1]^{d_{j}}\to[0,1], (c) fjL([0,1]dj)f_{\ell j}\in L^{\infty}([0,1]^{d_{j}}), and (d) var(X|Aj)ζ0<\operatorname{var}(X_{\ell}\,|\,A_{j})\leq\zeta_{0}<\infty.

These are the standard regularity conditions from the literature on nonparametric statistics (Györfi et al., 2006; Tsybakov, 2009), and can be weakened (e.g. if the XjX_{j} and fjf_{\ell j} are unbounded, see (Kohler et al., 2009)). We impose these stronger assumptions in order to simplify the statements and focus on technical details pertinent to graphical modeling and structure learning. The next assumption is justified by Theorem 3.1, and as we have noted, can also be weakened.

Condition 2 (Identifiability).

𝔼var(Xj|pa(j))σ2\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))\equiv\sigma^{2} does not depend on jj.

Our final condition imposes some basic finiteness and consistency requirements on the chosen nonparametric estimator f^\widehat{f}, which we view as a function for estimating 𝔼[Y|Z]\mathbb{E}[Y\,|\,Z] from an arbitrary distribution over the pair (Y,Z)(Y,Z).

Condition 3 (Estimator).

The nonparametric estimator f^\widehat{f} satisfies (a) 𝔼[Y|Z]Lf^L\mathbb{E}[Y\,|\,Z]\in L^{\infty}\implies\widehat{f}\in L^{\infty} and (b) 𝔼f^f^(Z)𝔼[Y|Z]220\mathbb{E}_{\widehat{f}}\|\widehat{f}(Z)-\mathbb{E}[Y\,|\,Z]\|_{2}^{2}\to 0.

This is a mild condition that is satisfied by most popular estimators including kernel smoothers, nearest neighbours, and splines, and in particular, Condition 3(a) is only used to simplify the theorem statement and can easily be relaxed.

Theorem 4.1.

Assume Conditions 1-3. Let Δj>0\Delta_{j}>0 be such that 𝔼var(X|Aj)>σ2+Δj\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j})>\sigma^{2}+\Delta_{j} for all Aj\ell\notin A_{j} and define Δ:=infjΔj\Delta\mathrel{\mathop{\mathchar 58\relax}}=\inf_{j}\Delta_{j}. Let δ2:=sup,j𝔼f^jfj(XAj)f^j(XAj)22\delta^{2}\mathrel{\mathop{\mathchar 58\relax}}=\sup_{\ell,j}\mathbb{E}_{\widehat{f}_{\ell j}}\|f_{\ell j}(X_{A_{j}})-\widehat{f}_{\ell j}(X_{A_{j}})\|_{2}^{2}. Then for any δd<η<Δ/2\delta\sqrt{d}<\eta<\Delta/2,

(L^=L(𝖦))\displaystyle\mathbb{P}(\widehat{L}=L(\mathsf{G})) 1δ2η2rd\displaystyle\gtrsim 1-\frac{\delta^{2}}{\eta^{2}}rd (5)

Once the layer decomposition L(𝖦)L(\mathsf{G}) is known, the graph 𝖦\mathsf{G} can be learned via standard nonlinear variable selection methods (see Appendix A in the supplement).

A feature of this result is that it is agnostic to the choice of estimator f^\widehat{f}, as long as it satisfies Condition 3. The dependence on f^\widehat{f} is quantified through δ2\delta^{2}, which depends on the sample size nn and represents the rate of convergence of the chosen nonparametric estimator. Instead of choosing a specific estimator, Theorem 4.1 is stated so that it can be applied to general estimators. As an example, suppose each fjf_{\ell j} is Lipschitz continuous and f^\widehat{f} is a standard kernel smoother. Then

𝔼f^jfj(XLj)f^j(XLj)22δ2n22+0pt.\displaystyle\mathbb{E}_{\widehat{f}_{\ell j}}\|f_{\ell j}(X_{L_{j}})-\widehat{f}_{\ell j}(X_{L_{j}})\|_{2}^{2}\leq\delta^{2}\lesssim n^{-\tfrac{2}{2+0pt}}.

Thus we have the following special case:

Corollary 4.2.

Assume each fjf_{\ell j} is Lipschitz continuous. Then L^\widehat{L} can be computed in O(nd5)O(nd^{5}) time and (L^=L(𝖦))1ε\mathbb{P}(\widehat{L}=L(\mathsf{G}))\geq 1-\varepsilon as long as n=Ω((rd/(η2ε))1+d/2)n=\Omega((rd/(\eta^{2}\varepsilon))^{1+d/2}).

This is the best possible rate attainable by any algorithm without imposing stronger regularity conditions (see e.g. §5 in (Györfi et al., 2006)). Furthermore, δ2\delta^{2} can be replaced with the error of an arbitrary estimator of the residual variance itself (i.e. something besides the plug-in estimator (3)); see Proposition C.4 in Appendix C for details.

To illustrate these results, consider the problem of finding the direction of a Markov chain X1X2XdX_{1}\to X_{2}\to\cdots\to X_{d} whose transition functions 𝔼[Xj|Xj1]\mathbb{E}[X_{j}\,|\,X_{j-1}] are each Lipschitz continuous. Then r=dr=d, so Corollary 4.2 implies that n=Ω((d2/(ηε))1+d/2)n=\Omega((d^{2}/(\eta\sqrt{\varepsilon}))^{1+d/2}) samples are sufficient to learn the order—and hence the graph as well as each transition function—with high probability. Since r=dr=d for any Markov chain, this particular example maximizes the dependence on dd; at the opposite extreme a bipartite graph with r=2r=2 would require only n=Ω((d/(ηε))1+d/2)n=\Omega((\sqrt{d}/(\eta\sqrt{\varepsilon}))^{1+d/2}). In these lower bounds, it is not necessary to know the type of graph (e.g. Markov chain, bipartite) or the depth rr.

Choice of η\eta

The lower bound η>δd\eta>\delta\sqrt{d} is not strictly necessary, and is only used to simplify the lower bound in (5). In general, taking η\eta sufficiently small works well in practice. The main tradeoff in choosing η>0\eta>0 is computational: A smaller η\eta may lead to “splitting” one of the layers LjL_{j}. In this case, NPVAR still recovers the structure correctly, but the splitting results in redundant estimation steps in Step 3 (i.e. instead of estimating LjL_{j} in one iteration, it takes multiple iterations to estimate correctly). The upper bound, however, is important: If η\eta is too large, then we may include spurious nodes in the layer LjL_{j}, which would cause problems in subsequent iterations.

Nonparametric rates

Theorem 4.1 and Corollary 4.2 make no assumptions on the sparsity of 𝖦\mathsf{G} or smoothness of the mean functions 𝔼[X|Aj]\mathbb{E}[X_{\ell}\,|\,A_{j}]. For this reason, the best possible rate for a naïve plug-in estimator of 𝔼var(X|Aj)\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j}) is bounded by the minimax rate for estimating 𝔼[X|Aj]\mathbb{E}[X_{\ell}\,|\,A_{j}]. For practical reasons, we have chosen to focus on an agnostic analysis that does not rely on any particular estimator. Under additional sparsity and smoothness assumptions, these rates can be improved, which we briefly discuss here.

For example, by using adaptive estimators such as RODEO (Lafferty and Wasserman, 2008) or GRID (Giordano et al., 2020), the sample complexity will depend only on the sparsity of fj(XAj)f_{\ell j}(X_{A_{j}}), i.e. d=maxjmaxAj|{kAj:kfj0}|d^{*}=\max_{j}\max_{\ell\notin A_{j}}|\{k\in A_{j}\mathrel{\mathop{\mathchar 58\relax}}\partial_{k}f_{\ell j}\neq 0\}|, where k\partial_{k} is the kkth partial derivative. Another approach that does not require adaptive estimation is to assume |Lj|w|L_{j}|\leq w and define r:=sup{|ij|:e=(e1,e2)E,e1Li,e2Lj}r^{*}\mathrel{\mathop{\mathchar 58\relax}}=\sup\{|i-j|\mathrel{\mathop{\mathchar 58\relax}}e=(e_{1},e_{2})\in E,e_{1}\in L_{i},e_{2}\in L_{j}\}. Then δ2n2/(2+wr)\delta^{2}\asymp n^{-2/(2+wr^{*})}, and the resulting sample complexity depends on wrwr^{*} instead of dd. For a Markov chain with w=r=1w=r^{*}=1 this leads to a substantial improvement.

Instead of sparsity, we could impose stronger smoothness assumptions: Let β\beta_{*} denote the smallest Hölder exponent of any fjf_{\ell j}. Then if βd/2\beta_{*}\geq d/2, then one can use a one-step correction to the plug-in estimator (3) to obtain a root-nn consistent estimator of 𝔼var(X|Aj)\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j}) (Robins et al., 2008; Kandasamy et al., 2015). Another approach is to use undersmoothing (Doksum et al., 1995). In this case, the exponential sample complexity improves to polynomial sample complexity. For example, in Corollary 4.2, if we replace Lipschitz with the stronger condition that βd/2\beta_{*}\geq d/2, then the sample complexity improves to n=Ω(rd/(η2ε))n=\Omega(rd/(\eta^{2}\varepsilon)).

5 Experiments

Finally, we perform a simulation study to compare the performance of NPVAR against state-of-the-art methods for learning nonparametric DAGs. The algorithms are: RESIT (Peters et al., 2014), CAM (Bühlmann et al., 2014), EqVar (Chen et al., 2019), NOTEARS (Zheng et al., 2020), GSGES (Huang et al., 2018), PC (Spirtes and Glymour, 1991), and GES (Chickering, 2003). In our implementation of NPVAR, we use generalized additive models (GAMs) for both estimating f^j\widehat{f}_{\ell j} and variable selection. One notable detail is our implementation of EqVar, which we adapted to the nonlinear setting by using GAMs instead of subset selection for variable selection (the order estimation step remains the same). Full details of the implementations used as well as additional experiments can be found in the supplement. Code implementing the NPVAR algorithm is publicly available at https://github.com/MingGao97/NPVAR.

Refer to caption
(a) SHD vs. nn (d=20d=20).
Refer to caption
(b) SHD vs. dd (n=1000n=1000).
Figure 2: Structural Hamming distance (SHD) as a function of sample size (nn) and number of nodes (dd). Error bars denote ±\pm1 standard error. Some algorithms were only run for sufficiently small graphs due to high computational cost.

We conducted a series of simulation on different graphs and models, comparing the performance in both order recovery and structure learning. Due to space limitations, only the results for structure learning in the three most difficult settings are highlighted in Figure 2. These experiments correspond to non-sparse graphs with non-additive dependence given by either a Gaussian process (GP) or a generalized linear model (GLM):

  • Graph types. We sampled three families of DAGs: Markov chains (MC), Erdös-Rényi graphs (ER), and scale-free graphs (SF). For MC graphs, there are exactly dd edges, whereas for ER and SF graphs, we sample graphs with kdkd edges on average. This is denoted by ER4/SF4 for k=4k=4 in Figure 2. Experiments on sparser DAGs can be found in the supplement.

  • Probability models. For the Markov chain models, we used two types of transition functions: An additive sine model with (Xj|Xj1)=𝒩(sin(Xj1),σ2)\mathbb{P}(X_{j}\,|\,X_{j-1})=\mathcal{N}(\sin(X_{j-1}),\sigma^{2}) and a discrete model (GLM) with Xj{0,1}X_{j}\in\{0,1\} and (Xj|Xj1){p,1p}\mathbb{P}(X_{j}\,|\,X_{j-1})\in\{p,1-p\}. For the ER and SF graphs, we sampled 𝔼[Xj|pa(j)]\mathbb{E}[X_{j}\,|\,\operatorname{pa}(j)] from both additive GPs (AGP) and non-additive GPs (NGP).

Full details as well as additional experiments on order recovery, additive models, sparse graphs, and misspecified models can be found in the supplement (Appendix E).

Structure learning

To evaluate overall performance, we computed the structural Hamming distance (SHD) between the learned DAG and the true DAG. SHD is a standard metric used for comparison of graphical models. According to this metric, the clear leaders are NPVAR, EqVar, and CAM. Consistent with existing results, existing methods tend to suffer as the edge density and dimension of the graphs increase, however, NPVAR is more robust in these settings. Surprisingly, the CAM algorithm remains quite competitive for non-additive models, although both EqVar and NPVAR clearly outperform CAM. On the GLM model, which illustrates a non-additive model with non-additive noise, EqVar and NPVAR performed the best, although PC showed good performance with n=1000n=1000 samples. Both CAM and RESIT terminated with numerical issues on the GLM model.

These experiments serve to corroborate our theoretical results and highlight the effectiveness of the NPVAR algorithm, but of course there are tradeoffs. For example, algorithms such as CAM which exploit sparse and additive structure perform very well in settings where sparsity and additivity can be exploited, and indeed outperform NPVAR in some cases. Hopefully, these experiments can help to shed some light on when various algorithms are more or less effective.

Misspecification and sensitivity analysis

We also considered two cases of misspecification: In Appendix B.1, we consider an example where Condition 2 fails, but NPVAR still successfully recovers the true ordering. This experiment corroborates our claims that this condition can be relaxed to handle unequal residual variances. We also evaluated the performance of NPVAR on linear models as in (1), and in all cases it was able to recover the correct ordering.

6 Discussion

In this paper, we analyzed the sample complexity of a polynomial-time algorithm for estimating nonparametric causal models represented by a DAG. Notably, our analysis avoids many of the common assumptions made in the literature. Instead, we assume that the residual variances are equal, similar to assuming homoskedastic noise in a standard nonparametric regression model. Our experiments confirm that the algorithm, called NPVAR, is effective at learning identifiable causal models and outperforms many existing methods, including several recent state-of-the-art methods. Nonetheless, existing algorithms such as CAM are quite competitive and apply in settings where NPVAR does not.

We conclude by discussing some limitations and directions for future work. Although we have relaxed many of the common assumptions made in the literature, these assumptions have been replaced by an assumption on the residual variances that may not hold in practice. An interesting question is whether or not there exist provably polynomial-time algorithms for nonparametric models in under less restrictive assumptions. Furthermore, although the proposed algorithm is polynomial-time, the worst-case O(d5)O(d^{5}) dependence on the dimension is of course limiting. This can likely be reduced by developing more efficient estimators of the residual variance that do not first estimate the mean function. This idea is common in the statistics literature, however, we are not aware of such estimators specifically for the residual variance (or other nonlinear functionals of (X)\mathbb{P}(X)). Furthermore, our general approach can be fruitfully applied to study various parametric models that go beyond linear models, for which both computation and sample efficiency would be expected to improve. These are interesting directions for future work.

Acknowledgements

We thank the anonymous reviewers for valuable feedback, as well as Y. Samuel Wang and Edward H. Kennedy for helpful discussions. B.A. acknowledges the support of the NSF via IIS-1956330 and the Robert H. Topel Faculty Research Fund. Y.D.’s work has been partially supported by the NSF (CCF-1439156, CCF-1823032, CNS-1764039).

Appendix A Reduction to order search

The fact that DAG learning can be reduced to learning a topological sort is well-known. For example, this fact is the basis of exact algorithms for score-based learning based on dynamic programming Silander and Myllymaki (2006); Ott et al. (2004); Perrier et al. (2008); Singh and Moore (2005) as well as recent algorithms for linear models (Chen et al., 2019; Ghoshal and Honorio, 2017, 2018). See also (Shojaie and Michailidis, 2010). This fact has also been exploited in the overdispersion scoring model developed by Park and Raskutti (2017) as well as for nonlinear additive models (Bühlmann et al., 2014). In fact, more can be said: Any ordering defines a minimal I-map of (X)\mathbb{P}(X) via a simple iterative algorithm (see §3.4.1, Algorithm 3.2 in (Koller and Friedman, 2009)), and this minimal I-map is unique as long as (X)\mathbb{P}(X) satisfies the intersection property. This is guaranteed, for example, if (X)\mathbb{P}(X) has a positive density, but holds under weaker conditions (see (Peters, 2015) for necessary and sufficient conditions assuming (X)\mathbb{P}(X) has a density and (Dawid, 1980), Theorem 7.1, for the general case). This same algorithm can then be used to reconstruct the true DAG 𝖦\mathsf{G} from the true ordering \prec. As noted in Section 2, a further reduction can be obtained by considering the layer decomposition L(𝖦)L(\mathsf{G}), from which all topological orders \prec of 𝖦\mathsf{G} can be deduced.

Once the ordering is known, existing nonlinear variable selection methods (Lafferty and Wasserman, 2008; Rosasco et al., 2013; Giordano et al., 2020; Miller and Hall, 2010; Bertin and Lecué, 2008; Comminges and Dalalyan, 2011) suffice to learn the parent sets pa(j)\operatorname{pa}(j) and hence the graph 𝖦\mathsf{G}. More specifically, given an order \prec, to identify pa(j)\operatorname{pa}(j), let f(Sj):=𝔼[Xj|Sj]f(S_{j})\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}[X_{j}\,|\,S_{j}], where Sj:={Xk:XkXj}S_{j}\mathrel{\mathop{\mathchar 58\relax}}=\{X_{k}\mathrel{\mathop{\mathchar 58\relax}}X_{k}\prec X_{j}\}. The parent set of XjX_{j} is given by the active variables in this conditional expectation, i.e. pa(j)={k:kf0}\operatorname{pa}(j)=\{k\mathrel{\mathop{\mathchar 58\relax}}\partial_{k}f\neq 0\}, where k\partial_{k} is the partial derivative of ff with respect to the kkth argument.

In our experiments, we use exactly this procedure to learn 𝖦\mathsf{G} from the order \prec, based on the data. Specifically, we use generalized additive models, similar to the pruning step in (Bühlmann et al., 2014). See Appendix E for more details.

Appendix B Proof of Theorem 3.1

The key lemma is the following, which is easy to prove for additive noise models via (2), and which we show holds more generally in non-additive models:

Lemma B.1.

Let AVA\subset V be an ancestral set in 𝖦\mathsf{G}. If 𝔼var(Xj|pa(j))σ2\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))\equiv\sigma^{2} does not depend on jj, then for any jAj\notin A,

𝔼var(Xj|XA)\displaystyle\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{A}) =σ2if pa(j)A,\displaystyle=\sigma^{2}\quad\text{if $\operatorname{pa}(j)\subset A$,}
𝔼var(Xj|XA)\displaystyle\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{A}) >σ2otherwise.\displaystyle>\sigma^{2}\quad\text{otherwise.}
Proof of Lemma B.1.

Let Bj=pa(Xj)B_{j}=\operatorname{pa}(X_{j}), Bj¯:=BjA\overline{B_{j}}\mathrel{\mathop{\mathchar 58\relax}}=B_{j}-A, and 𝖦¯\overline{\mathsf{G}} be the subgraph of 𝖦\mathsf{G} formed by removing the nodes in the ancestral set AA. Then

var(Xj|XA)\displaystyle\operatorname{var}(X_{j}\,|\,X_{A}) =𝔼[var(Xj|XA,XBj)|XA]+var[𝔼[Xj|XA,XBj]|XA]\displaystyle=\mathbb{E}[\operatorname{var}(X_{j}\,|\,X_{A},X_{B_{j}})\,|\,X_{A}]+\operatorname{var}[\mathbb{E}[X_{j}\,|\,X_{A},X_{B_{j}}]\,|\,X_{A}]
=𝔼[var(Xj|XA,XBj¯)|XA]+var[𝔼[Xj|XA,XBj¯]|XA].\displaystyle=\mathbb{E}[\operatorname{var}(X_{j}\,|\,X_{A},X_{\overline{B_{j}}})\,|\,X_{A}]+\operatorname{var}[\mathbb{E}[X_{j}\,|\,X_{A},X_{\overline{B_{j}}}]\,|\,X_{A}].

There are two cases: (i) Bj¯=\overline{B_{j}}=\emptyset, and (ii) Bj¯\overline{B_{j}}\neq\emptyset. In case (i), it follows that BjAB_{j}\subset A and hence pa(j)A\operatorname{pa}(j)\subset A. Since XjX_{j} is conditionally independent of its nondescendants (e.g. ancestors) given its parents, it follows that var(Xj|XA)=var(Xj|XBj)\operatorname{var}(X_{j}\,|\,X_{A})=\operatorname{var}(X_{j}\,|\,X_{B_{j}}) and hence

𝔼var(Xj|XA)=𝔼var(Xj|XBj)=σ2.\displaystyle\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{A})=\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{B_{j}})=\sigma^{2}.

In case (ii), it follows that

var(Xj|XA)\displaystyle\operatorname{var}(X_{j}\,|\,X_{A}) =𝔼[var(Xj|XA,XBj¯)|XA]+var[𝔼[Xj|XA,XBj¯]|XA]\displaystyle=\mathbb{E}[\operatorname{var}(X_{j}\,|\,X_{A},X_{\overline{B_{j}}})\,|\,X_{A}]+\operatorname{var}[\mathbb{E}[X_{j}\,|\,X_{A},X_{\overline{B_{j}}}]\,|\,X_{A}]
=𝔼[var(Xj|XBj)|XA]+var[𝔼[Xj|XBj]|XA],\displaystyle=\mathbb{E}[\operatorname{var}(X_{j}\,|\,X_{B_{j}})\,|\,X_{A}]+\operatorname{var}[\mathbb{E}[X_{j}\,|\,X_{B_{j}}]\,|\,X_{A}],

where again we used that XjX_{j} is conditionally independent of its nondescendants (e.g. ancestors) given its parents to replace conditioning on (XA,XBj¯)=XABj(X_{A},X_{\overline{B_{j}}})=X_{A\cup B_{j}} with conditioning on BjB_{j}.

Now suppose XkX_{k} is in case (i) and XjX_{j} is in case (ii). We wish to show that 𝔼var(Xj|XA)>𝔼var(Xk|XA)=σ2\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{A})>\mathbb{E}\operatorname{var}(X_{k}\,|\,X_{A})=\sigma^{2}. Then

𝔼var(Xj|XA)\displaystyle\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{A}) =𝔼[𝔼[var(Xj|XBj)|XA]]+𝔼var[𝔼[Xj|XBj]|XA]\displaystyle=\mathbb{E}\big{[}\mathbb{E}[\operatorname{var}(X_{j}\,|\,X_{B_{j}})\,|\,X_{A}]\big{]}+\mathbb{E}\operatorname{var}[\mathbb{E}[X_{j}\,|\,X_{B_{j}}]\,|\,X_{A}]
>𝔼[𝔼[var(Xj|pa(j))|XA]]\displaystyle>\mathbb{E}\big{[}\mathbb{E}[\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))\,|\,X_{A}]\big{]}
=𝔼var(Xj|pa(j))\displaystyle=\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))
=𝔼var(Xk|pa(k))\displaystyle=\mathbb{E}\operatorname{var}(X_{k}\,|\,\operatorname{pa}(k))
=σ2,\displaystyle=\sigma^{2},

where we have invoked the assumption that 𝔼var(Xj|pa(j))\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j)) does not depend on jj to conclude 𝔼[var(Xj|pa(j))|XA]=𝔼[var(Xk|pa(k))|XA]\mathbb{E}[\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))\,|\,X_{A}]=\mathbb{E}[\operatorname{var}(X_{k}\,|\,\operatorname{pa}(k))\,|\,X_{A}]. This completes the proof. ∎

Theorem 3.1 is an immediate corollary of Lemma B.1. For completeness, we include a proof below.

Proof of Theorem 3.1.

Let S(𝖦)S(\mathsf{G}) denote the set of sources in 𝖦\mathsf{G} and note that Lemma B.1 implies that if XsS(𝖦)X_{s}\in S(\mathsf{G}), then var(Xs)<var(Xj)\operatorname{var}(X_{s})<\operatorname{var}(X_{j}) for any sjs\neq j. Thus, S(𝖦)S(\mathsf{G}) is identifiable. Let 𝖦1\mathsf{G}_{1} denote the subgraph of 𝖦\mathsf{G} formed by removing the nodes in S(𝖦)S(\mathsf{G}). Since S(𝖦)=L1=A1S(\mathsf{G})=L_{1}=A_{1}, S(𝖦)S(\mathsf{G}) is an ancestral set in 𝖦\mathsf{G}. After conditioning on A1A_{1}, we can thus apply Lemma B.1 once again to identify the sources in 𝖦1\mathsf{G}_{1}, i.e. S(𝖦1)=L2S(\mathsf{G}_{1})=L_{2}. By repeating this procedure, we can recursively identify L1,,LrL_{1},\ldots,L_{r}, and hence any topological sort of 𝖦\mathsf{G}. ∎

B.1 Generalization to unequal variances

In this appendix, we illustrate how Theorem 3.1 can be extended to the case where residual variances are different, i.e. σj2=𝔼var(Xj|pa(j))\sigma_{j}^{2}=\mathbb{E}\operatorname{var}(X_{j}|\operatorname{pa}(j)) is not independent of jj. Let de(i)\operatorname{de}(i) be the descendant of node ii and [a:b]={a,a+1,,b1,b}[a\mathrel{\mathop{\mathchar 58\relax}}b]=\{a,a+1,\ldots,b-1,b\}. Note also that for any nodes XuX_{u} and XvX_{v} in the same layer LmL_{m} of the graph, if we interchange the position of uu and vv in some true ordering π\pi consistent with the graph to get πu\pi_{u} and πv\pi_{v}, both πu\pi_{u} and πv\pi_{v} are correct orderings.

The following result is similar to existing results on unequal variances (Ghoshal and Honorio, 2018; Park and Kim, 2020; Park, 2020), with the exception that it applies to general DAGs without linearity, additivity, or independent noise.

Theorem B.2.

Suppose there exists an ordering π\pi such that for all j[1:d]j\in[1\mathrel{\mathop{\mathchar 58\relax}}d] and kπ[j+1:d]k\in\pi_{[j+1\mathrel{\mathop{\mathchar 58\relax}}d]}, the following conditions holds:

  1. 1.

    If i=πji=\pi_{j} and kk are not in the same layer LmL_{m}, then

    σi2\displaystyle\sigma_{i}^{2} <σk2+𝔼var(𝔼(Xk|pa(k))|Xπ[1:j1]).\displaystyle<\sigma_{k}^{2}+\mathbb{E}\operatorname{var}(\mathbb{E}(X_{k}\,|\,\operatorname{pa}(k))|X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}}). (6)
  2. 2.

    If ii and kk are in the same layer LmL_{m}, then either σi2=σk2\sigma^{2}_{i}=\sigma^{2}_{k} or (6) holds.

Then the order π\pi is identifiable.

In this condition not only do we need to control the descendants of a node, but also the other non-descendants that have not been identified.

Before proving this result, we illustrate it with an example.

Example 6.

Let’s consider a very simple case: A Markov chain with three nodes X1X2X3X_{1}\rightarrow X_{2}\rightarrow X_{3} such that

X1=z1N(0,1)X2=12X12+z2,z2N(0,23)X3=13X22+z3,z3N(0,12).\displaystyle\begin{split}&X_{1}=z_{1}\sim N(0,1)\\ &X_{2}=\frac{1}{2}X_{1}^{2}+z_{2},\ \ \ \ z_{2}\sim N(0,\tfrac{2}{3})\\ &X_{3}=\frac{1}{3}X_{2}^{2}+z_{3},\ \ \ \ z_{3}\sim N(0,\tfrac{1}{2}).\end{split}

Here we have unequal residual variances. We now check that this model satisfies the conditions in Theorem B.2. Let f(u)=u2f(u)=u^{2} and note that the true ordering is X1X2X3X_{1}\prec X_{2}\prec X_{3}. Starting from the first source node X1X_{1}, we have

σ12=1σ22+𝔼var(f(Xπ(2)))=2/3+var(X12/2)=2/3+1/2>1=σ12σ32+𝔼var(f(Xπ(3)))=1/2+var(X22/3)=1/2+19(var(X14)/16+var(z22)+var(X12z2)+2cov(X14/4,X12z2)+2cov(z22,X12z2))=1/2+8/9+8/81>1=σ12\displaystyle\begin{split}\sigma^{2}_{1}&=1\\ \sigma_{2}^{2}+\mathbb{E}\operatorname{var}(f(X_{\pi(2)}))&=2/3+\operatorname{var}(X_{1}^{2}/2)=2/3+1/2>1=\sigma^{2}_{1}\\ \sigma^{2}_{3}+\mathbb{E}\operatorname{var}(f(X_{\pi(3)}))&=1/2+\operatorname{var}(X_{2}^{2}/3)\\ &=1/2+\frac{1}{9}\Big{(}\operatorname{var}(X_{1}^{4})/16+\operatorname{var}(z_{2}^{2})+\operatorname{var}(X_{1}^{2}z_{2})\\ &+2\operatorname{cov}(X_{1}^{4}/4,X_{1}^{2}z_{2})+2\operatorname{cov}(z_{2}^{2},X_{1}^{2}z_{2})\Big{)}\\ &=1/2+8/9+8/81>1=\sigma^{2}_{1}\end{split}

Then for the second source node X2X_{2},

σ22=2/3σ32+𝔼var(f(Xπ(3))|X1)=1/2+19(var(z22)+𝔼var(X12z2|X1))=1/2+1/31/81>2/3\displaystyle\begin{split}\sigma^{2}_{2}&=2/3\\ \sigma^{2}_{3}+\mathbb{E}\operatorname{var}(f(X_{\pi(3)})\,|\,X_{1})&=1/2+\frac{1}{9}(\operatorname{var}(z^{2}_{2})+\mathbb{E}\operatorname{var}(X^{2}_{1}z_{2}\,|\,X_{1}))\\ &=1/2+1/3-1/81>2/3\end{split}

Thus the condition is satisfied.

If instead we have σ32=var(z3)=1/3\sigma^{2}_{3}=\operatorname{var}(z_{3})=1/3, the condition would be violated. It is easy to check that nothing changes for X1X_{1}. For the second source node X2X_{2}, things are different:

σ32+𝔼var(f(Xπ(3))|X1)=1/3+1/31/81<2/3\displaystyle\sigma^{2}_{3}+\mathbb{E}\operatorname{var}(f(X_{\pi(3)})\,|\,X_{1})=1/3+1/3-1/81<2/3

Thus, the order of X2X_{2} and X3X_{3} would be flipped for this model.

This example is easily confirmed in practice. We let nn range from 50 to 1000, and check if the estimated order is correct for the two models (σ32=1/2\sigma_{3}^{2}=1/2 and σ32=1/3\sigma_{3}^{2}=1/3). We simulated this 50 times and report the averages in Figure 3.

Refer to caption
Figure 3: Experiments confirming Example 6. For the identifiable setting with σ32=1/2\sigma_{3}^{2}=1/2, Algorithm 2 correctly learns the topological ordering. For the non-identifiable setting with σ32=1/3\sigma_{3}^{2}=1/3, Algorithm 2 fails to learn the ordering.
Proof of Theorem B.2.

We first consider the case where every node in the same layer has a different residual variance, i.e. Xu,XvLmσu2σv2X_{u},X_{v}\in L_{m}\implies\sigma^{2}_{u}\neq\sigma^{2}_{v}. We proceed with induction on the element jj of the ordering π\pi. Let i=πji=\pi_{j},

When j=1j=1 and i=π1i=\pi_{1}, XiX_{i} must be a source node and we have for all kπ[2:d]k\in\pi_{[2\mathrel{\mathop{\mathchar 58\relax}}d]},

var(Xi)=σi2<\displaystyle\operatorname{var}(X_{i})=\sigma_{i}^{2}< σk2+𝔼var(𝔼(Xk|pa(k)))\displaystyle\sigma_{k}^{2}+\mathbb{E}\operatorname{var}(\mathbb{E}(X_{k}\,|\,\operatorname{pa}(k)))
=\displaystyle= 𝔼var(Xk|pa(k))+var(𝔼(Xk|pa(k)))\displaystyle\mathbb{E}\operatorname{var}(X_{k}\,|\,\operatorname{pa}(k))+\operatorname{var}(\mathbb{E}(X_{k}\,|\,\operatorname{pa}(k)))
=\displaystyle= var(Xk).\displaystyle\operatorname{var}(X_{k}).

Thus the first node to be identified must be i=π1i=\pi_{1}, as desired.

Now suppose the the first j1j-1 nodes in the ordering π\pi are correctly identified. The parent of node i=πji=\pi_{j} must have been identified in π[j1]\pi_{[j-1]} or it is a source node. Then we have for all k{πj+1,,πd}k\in\{\pi_{j+1},\ldots,\pi_{d}\},

𝔼var(Xi|Xπ[1:j1])=σi2<\displaystyle\mathbb{E}\operatorname{var}(X_{i}\,|\,X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}})=\sigma_{i}^{2}< σk2+𝔼var(𝔼(Xk|pa(k))|Xπ[1:j1])\displaystyle\sigma_{k}^{2}+\mathbb{E}\operatorname{var}(\mathbb{E}(X_{k}\,|\,\operatorname{pa}(k))\,|\,X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}})
=\displaystyle= 𝔼var(Xk|pa(k))+𝔼var(𝔼(Xk|pa(k))|Xπ[1:j1])\displaystyle\mathbb{E}\operatorname{var}(X_{k}\,|\,\operatorname{pa}(k))+\mathbb{E}\operatorname{var}(\mathbb{E}(X_{k}\,|\,\operatorname{pa}(k))\,|\,X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}})
=\displaystyle= 𝔼[𝔼(var(Xk|pa(k))|Xπ[1:j1])]+𝔼[var(𝔼(Xk|pa(k))|Xπ[1:j1])]\displaystyle\mathbb{E}[\mathbb{E}(\operatorname{var}(X_{k}\,|\,\operatorname{pa}(k))\,|\,X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}})]+\mathbb{E}[\operatorname{var}(\mathbb{E}(X_{k}\,|\,\operatorname{pa}(k))\,|\,X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}})]
=\displaystyle= 𝔼var(Xk|Xπ[1:j1])\displaystyle\mathbb{E}\operatorname{var}(X_{k}\,|\,X_{\pi_{[1\mathrel{\mathop{\mathchar 58\relax}}j-1]}})

Then the jthj^{\text{th}} node to be identified must be i=πji=\pi_{j}. The induction is completed.

Finally, if Xu,XvLmX_{u},X_{v}\in L_{m} are in the same layer and σu2=σv2\sigma_{u}^{2}=\sigma_{v}^{2}, then this procedure may choose either XuX_{u} or XvX_{v} first. For example, if XuX_{u} is chosen first, then XvX_{v} will be incorporated into the same layer as XuX_{u}. Since both these nodes are in the same layer, swapping XuX_{u} and XvX_{v} in any ordering still produces a valid ordering of the DAG. The proof is complete. ∎

Appendix C Proof of Theorem 4.1

The proof of Theorem 4.1 will be broken down into several steps. First, we derive an upper bound on the error of the plug-in estimator used in Algorithm 2 (Appendix C.1), and then we derive a uniform upper bound (Appendix C.2). Based on this upper bound, we prove Theorem 4.1 via Proposition C.4 (Appendix C.3). Appendix C.4 collects various technical lemmas that are used throughout.

C.1 A plug-in estimate

Let (X,Y)m×(X,Y)\in\mathbb{R}^{m}\times\mathbb{R} be a pair of random variables and f^\widehat{f} be a data-dependent estimator of the conditional expectation 𝔼[Y|X]:=f(X)\mathbb{E}[Y\,|\,X]\mathrel{\mathop{\mathchar 58\relax}}=f(X). Assume we have split the sample into two groups, which we denote by (U(1),V(1)),,(U(n1),V(n1))(X,Y)(U^{(1)},V^{(1)}),\ldots,(U^{(n_{1})},V^{(n_{1})})\sim\mathbb{P}(X,Y) and (X(1),Y(1)),,(X(n2),Y(n2))(X,Y)(X^{(1)},Y^{(1)}),\ldots,(X^{(n_{2})},Y^{(n_{2})})\sim\mathbb{P}(X,Y) for clarity. Given these samples, define an estimator of σRV2:=𝔼var(Y|X)\sigma^{2}_{\operatorname{RV}}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}\operatorname{var}(Y\,|\,X) by

σ^RV2:=1n2i=1n2(Y(i))21n2i=1n2f^(X(i))2.\displaystyle\widehat{\sigma}_{\operatorname{RV}}^{2}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}(Y^{(i)})^{2}-\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\widehat{f}(X^{(i)})^{2}. (7)

Note here that f^\widehat{f} depends on (U(i),V(i))(U^{(i)},V^{(i)}), and is independent of the second sample (X(i),Y(i))(X^{(i)},Y^{(i)}). We wish to bound the deviation (|σ^RV2σRV2|t)\mathbb{P}(|\widehat{\sigma}^{2}_{\operatorname{RV}}-\sigma^{2}_{\operatorname{RV}}|\geq t).

Define the target θ=𝔼f2(X)\theta^{*}=\mathbb{E}f^{2}(X) and its plug-in estimator

θ(g;q)=g(x)2dq(x).\displaystyle\theta(g;q)=\int g(x)^{2}\operatorname{d\!}q(x).

Letting X\mathbb{P}_{X} denote the true marginal distribution with respect to xx, we have θ=θ(f;X)\theta^{*}=\theta(f;\mathbb{P}_{X}) and θ^:=θ(f^;^)\widehat{\theta}\mathrel{\mathop{\mathchar 58\relax}}=\theta(\widehat{f};\widehat{\mathbb{P}}), where ^=n21iδX(i)\widehat{\mathbb{P}}=n_{2}^{-1}\sum_{i}\delta_{X^{(i)}} is the empirical distribution. We will also make use of more general targets θ(g;X)=𝔼g2(X)\theta(g;\mathbb{P}_{X})=\mathbb{E}g^{2}(X) for general functions gg.

Finally, as a matter of notation, we adopt the following convention: For a random variable ZZ, Zp:=(𝔼Z|Z|p)1/p\|Z\|_{p}\mathrel{\mathop{\mathchar 58\relax}}=(\mathbb{E}_{Z}|Z|^{p})^{1/p} is the usual LpL^{p}-norm of ZZ as a random variable, and for a (nonrandom) function ff, fp:=(|f|pdζ)1/p\|f\|_{p}\mathrel{\mathop{\mathchar 58\relax}}=(\int|f|^{p}\,\operatorname{d\!}\zeta)^{1/p}, where ζ\zeta is a fixed base measure such as Lebesgue measure. In particular, f(X)pfp\|f(X)\|_{p}\neq\|f\|_{p}. The difference of course lies in which measure integrals are taken with respect to. Moreover, we shall always explicitly specify with respect to which variables probabilities and expectations are taken, e.g. to disambiguate 𝔼f^,X=𝔼f^𝔼X\mathbb{E}_{\widehat{f},X}=\mathbb{E}_{\widehat{f}}\mathbb{E}_{X}, 𝔼f^\mathbb{E}_{\widehat{f}}, and 𝔼X\mathbb{E}_{X}.

We first prove the following result:

Proposition C.1.

Assume f(X),f^(X)B\|f(X)\|_{\infty},\|\widehat{f}(X)\|_{\infty}\leq B_{\infty}. Then

(|θ^θ|2t)𝔼f^f(X)f^(X)22t2+f(X)44+𝔼f^f^(X)f(X)4n2t2.\displaystyle\mathbb{P}(|\widehat{\theta}-\theta^{*}|\geq 2t)\lesssim\frac{\mathbb{E}_{\widehat{f}}\|f(X)-\widehat{f}(X)\|_{2}^{2}}{t^{2}}+\frac{\|f(X)\|_{4}^{4}+\mathbb{E}_{\widehat{f}}\|\widehat{f}(X)-f(X)\|_{4}}{n_{2}t^{2}}. (8)
Proof.

We have

f^,X(|θ^θ|2t)f^,X(|θθ(f^;X)|t)+f^,X(|θ^θ(f^;X)|t).\displaystyle\mathbb{P}_{\widehat{f},X}(|\widehat{\theta}-\theta^{*}|\geq 2t)\leq\mathbb{P}_{\widehat{f},X}(|\theta^{*}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t)+\mathbb{P}_{\widehat{f},X}(|\widehat{\theta}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t).

The second term is easy to dispense with since

f^,X(|θ^θ(f^;X)|t)\displaystyle\mathbb{P}_{\widehat{f},X}(|\widehat{\theta}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t) 𝔼f^𝔼X(θ^θ(f^;X))2t2𝔼f^varX(f^2(X))n2t2.\displaystyle\leq\frac{\mathbb{E}_{\widehat{f}}\mathbb{E}_{X}(\widehat{\theta}-\theta(\widehat{f};\mathbb{P}_{X}))^{2}}{t^{2}}\leq\frac{\mathbb{E}_{\widehat{f}}\operatorname{var}_{X}(\widehat{f}^{2}(X))}{n_{2}t^{2}}. (9)

It follows from Lemma C.5 with p=4p=4 that

|𝔼Xf^(X)4𝔼Xf(X)4|\displaystyle|\mathbb{E}_{X}\widehat{f}(X)^{4}-\mathbb{E}_{X}f(X)^{4}| f^(X)f(X)4k=03f^(X)4kf(X)43k\displaystyle\leq\|\widehat{f}(X)-f(X)\|_{4}\sum_{k=0}^{3}\|\widehat{f}(X)\|_{4}^{k}\|f(X)\|_{4}^{3-k}
4(2B)3f^(X)f(X)4\displaystyle\leq 4(2B_{\infty})^{3}\|\widehat{f}(X)-f(X)\|_{4}

and hence

varX(f^2(X))𝔼Xf^4(X)f(X)44+f^(X)f(X)4.\displaystyle\operatorname{var}_{X}(\widehat{f}^{2}(X))\leq\mathbb{E}_{X}\widehat{f}^{4}(X)\lesssim\|f(X)\|_{4}^{4}+\|\widehat{f}(X)-f(X)\|_{4}. (10)

Combined with (9), we finally have

(|θ^θ(f^;X)|t)f(X)44+𝔼f^f^(X)f(X)4n2t2.\displaystyle\mathbb{P}(|\widehat{\theta}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t)\lesssim\frac{\|f(X)\|_{4}^{4}+\mathbb{E}_{\widehat{f}}\|\widehat{f}(X)-f(X)\|_{4}}{n_{2}t^{2}}. (11)

For the first term, since f(X),f^(X)Lf(X),\widehat{f}(X)\in L^{\infty}, Lemma C.7 implies

𝔼f^(θθ(f^;X))2\displaystyle\mathbb{E}_{\widehat{f}}(\theta^{*}-\theta(\widehat{f};\mathbb{P}_{X}))^{2} =𝔼f^𝔼X(f2(X)f^2(X))2\displaystyle=\mathbb{E}_{\widehat{f}}\mathbb{E}_{X}(f^{2}(X)-\widehat{f}^{2}(X))^{2}
𝔼f^f(X)f^(X)22,\displaystyle\lesssim\mathbb{E}_{\widehat{f}}\|f(X)-\widehat{f}(X)\|_{2}^{2},

and thus

f^(|θθ(f^;X)|t)\displaystyle\mathbb{P}_{\widehat{f}}(|\theta^{*}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t) 𝔼f^(θθ(f^;X))2t2𝔼f^f(X)f^(X)22t2.\displaystyle\leq\frac{\mathbb{E}_{\widehat{f}}(\theta^{*}-\theta(\widehat{f};\mathbb{P}_{X}))^{2}}{t^{2}}\lesssim\frac{\mathbb{E}_{\widehat{f}}\|f(X)-\widehat{f}(X)\|_{2}^{2}}{t^{2}}.

Therefore

f^,X(|θ^θ|2t)\displaystyle\mathbb{P}_{\widehat{f},X}(|\widehat{\theta}-\theta^{*}|\geq 2t) f^,X(|θθ(f^;X)|t)+f^,X(|θ^θ(f^;X)|t)\displaystyle\leq\mathbb{P}_{\widehat{f},X}(|\theta^{*}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t)+\mathbb{P}_{\widehat{f},X}(|\widehat{\theta}-\theta(\widehat{f};\mathbb{P}_{X})|\geq t)
𝔼f^f(X)f^(X)22t2+f(X)44+𝔼f^f^(X)f(X)4n2t2.\displaystyle\lesssim\frac{\mathbb{E}_{\widehat{f}}\|f(X)-\widehat{f}(X)\|_{2}^{2}}{t^{2}}+\frac{\|f(X)\|_{4}^{4}+\mathbb{E}_{\widehat{f}}\|\widehat{f}(X)-f(X)\|_{4}}{n_{2}t^{2}}.\qed

Finally, we conclude the following:

Corollary C.2.

If f(X),f^(X)B\|f(X)\|_{\infty},\|\widehat{f}(X)\|_{\infty}\leq B_{\infty}, then

(|σ^RV2σRV2|t)\displaystyle\mathbb{P}(|\widehat{\sigma}_{\operatorname{RV}}^{2}-\sigma_{\operatorname{RV}}^{2}|\geq t) 4t2(𝔼f^f(X)f^(X)22+var(Y)+f(X)44+𝔼f^f^(X)f(X)4n2).\displaystyle\lesssim\frac{4}{t^{2}}\Big{(}\mathbb{E}_{\widehat{f}}\|f(X)-\widehat{f}(X)\|_{2}^{2}+\frac{\operatorname{var}(Y)+\|f(X)\|_{4}^{4}+\mathbb{E}_{\widehat{f}}\|\widehat{f}(X)-f(X)\|_{4}}{n_{2}}\Big{)}.

C.2 A uniform bound

For any j=1,,rj=1,\ldots,r and Aj\ell\notin A_{j}, define σj2:=𝔼var(X|Aj)\sigma_{\ell j}^{2}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j}) and σ^j2\widehat{\sigma}_{\ell j}^{2} the corresponding plug-in estimator from (7). By Proposition C.2, we have for fj(XAj):=𝔼[X|XAj]f_{\ell j}(X_{A_{j}})\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}[X_{\ell}\,|\,X_{A_{j}}],

(|σ^j2σj2|t)4t2(𝔼f^fj(XAj)f^j(XAj)22+var(X)+fj(XAj)44+𝔼f^f^j(XAj)fj(XAj)4n2)\displaystyle\begin{aligned} \mathbb{P}(|\widehat{\sigma}_{\ell j}^{2}-\sigma_{\ell j}^{2}|\geq t)\lesssim\frac{4}{t^{2}}\Big{(}\mathbb{E}_{\widehat{f}}\|f_{\ell j}(X_{A_{j}})&-\widehat{f}_{\ell j}(X_{A_{j}})\|_{2}^{2}\\ &+\frac{\operatorname{var}(X_{\ell})+\|f_{\ell j}(X_{A_{j}})\|_{4}^{4}+\mathbb{E}_{\widehat{f}}\|\widehat{f}_{\ell j}(X_{A_{j}})-f_{\ell j}(X_{A_{j}})\|_{4}}{n_{2}}\Big{)}\end{aligned} (12)

Thus we have the following result:

Proposition C.3.

Assume for all jj and Aj\ell\notin A_{j}:

  1. 1.

    fj(XAj),f^j(XAj)B\|f_{\ell j}(X_{A_{j}})\|_{\infty},\|\widehat{f}_{\ell j}(X_{A_{j}})\|_{\infty}\leq B_{\infty};

  2. 2.

    fj(XAj)44+var(X)B\|f_{\ell j}(X_{A_{j}})\|_{4}^{4}+\operatorname{var}(X_{\ell})\leq B_{\infty};

  3. 3.

    𝔼f^fj(XAj)f^j(XAj)220\mathbb{E}_{\widehat{f}}\|f_{\ell j}(X_{A_{j}})-\widehat{f}_{\ell j}(X_{A_{j}})\|_{2}^{2}\to 0.

Then

sup,j(|σ^j2σj2|>t)4t2(𝔼f^fj(XAj)f^j(XAj)22+1n2).\displaystyle\sup_{\ell,j}\mathbb{P}(|\widehat{\sigma}_{\ell j}^{2}-\sigma_{\ell j}^{2}|>t)\lesssim\frac{4}{t^{2}}\Big{(}\mathbb{E}_{\widehat{f}}\|f_{\ell j}(X_{A_{j}})-\widehat{f}_{\ell j}(X_{A_{j}})\|_{2}^{2}+\frac{1}{n_{2}}\Big{)}. (13)

For example, under Conditions 1-3, we have B:=sup,j2fj(XAj)4+ζ0B_{\infty}\mathrel{\mathop{\mathchar 58\relax}}=\sup_{\ell,j}2\|f_{\ell j}(X_{A_{j}})\|_{\infty}^{4}+\zeta_{0}.

C.3 Proof of Theorem 4.1

Recall σj2=𝔼var(X|Aj)\sigma_{\ell j}^{2}=\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j}) and σ^j2\widehat{\sigma}_{\ell j}^{2} is the plug-in estimator defined by (7). Let ξ>0\xi>0 be such that

sup,j(|σ^j2σj2|>t)ξ2t2.\displaystyle\sup_{\ell,j}\mathbb{P}(|\widehat{\sigma}_{\ell j}^{2}-\sigma_{\ell j}^{2}|>t)\leq\frac{\xi^{2}}{t^{2}}. (14)

For example, Proposition C.3 implies ξ2δ2+n21\xi^{2}\asymp\delta^{2}+n_{2}^{-1}. Recall also Δ:=infjΔj\Delta\mathrel{\mathop{\mathchar 58\relax}}=\inf_{j}\Delta_{j}, where Δj>0\Delta_{j}>0 is the smallest number such that 𝔼var(X|Aj)>σ2+Δj\mathbb{E}\operatorname{var}(X_{\ell}\,|\,A_{j})>\sigma^{2}+\Delta_{j} for all Aj\ell\notin A_{j}.

Theorem 4.1 follows immediately from Proposition C.4 below, combined with Proposition C.3 to bound ξ\xi by δ\delta.

Proposition C.4.

Define ξ>0\xi>0 as in (14). Then for any threshold ξd<η<Δ/2\xi\sqrt{d}<\eta<\Delta/2, we have

(L^=L(𝖦))\displaystyle\mathbb{P}(\widehat{L}=L(\mathsf{G})) 1ξ2η2rd.\displaystyle\geq 1-\frac{\xi^{2}}{\eta^{2}}rd.
Proof.

Define j1:={L^1=L1,,L^j1=Lj1}\mathcal{E}_{j-1}\mathrel{\mathop{\mathchar 58\relax}}=\{\widehat{L}_{1}=L_{1},\ldots,\widehat{L}_{j-1}=L_{j-1}\}. It follows that

(L^=L(𝖦))\displaystyle\mathbb{P}(\widehat{L}=L(\mathsf{G})) =(L^1=L1,,L^r=Lr)=j=1r(L^j=Lj|j1).\displaystyle=\mathbb{P}(\widehat{L}_{1}=L_{1},\ldots,\widehat{L}_{r}=L_{r})=\prod_{j=1}^{r}\mathbb{P}(\widehat{L}_{j}=L_{j}\,|\,\mathcal{E}_{j-1}).

Clearly, if L^1=L1,,L^r=Lr\widehat{L}_{1}=L_{1},\ldots,\widehat{L}_{r}=L_{r} then r^=r\widehat{r}=r. By definition, we have σj2>σ2+Δ\sigma_{\ell j}^{2}>\sigma^{2}+\Delta, i.e. Δ\Delta is the smallest “gap” between any source in a subgraph 𝖦[VAj1]\mathsf{G}[V-A_{j-1}] and the rest of the nodes.

Now let σ^2:=minσ^02\widehat{\sigma}^{2}\mathrel{\mathop{\mathchar 58\relax}}=\min_{\ell}\widehat{\sigma}_{\ell 0}^{2} and consider L1L_{1}:

(L^1=L1)=(|σ^02σ^2|ηA1,|σ^02σ^2|>ηA1)\displaystyle\mathbb{P}(\widehat{L}_{1}=L_{1})=\mathbb{P}(|\widehat{\sigma}_{\ell 0}^{2}-\widehat{\sigma}^{2}|\leq\eta\,\forall\ell\in A_{1},\,|\widehat{\sigma}_{\ell 0}^{2}-\widehat{\sigma}^{2}|>\eta\,\forall\ell\notin A_{1})

Now for any k,Ljk,\ell\in L_{j},

|σ^j2σ^kj2|\displaystyle|\widehat{\sigma}_{\ell j}^{2}-\widehat{\sigma}_{kj}^{2}| |σ^j2σ2|+|σ^kj2σ2|,\displaystyle\leq|\widehat{\sigma}_{\ell j}^{2}-\sigma^{2}|+|\widehat{\sigma}_{kj}^{2}-\sigma^{2}|,

and for any Lj\ell\notin L_{j} and kLjk\in L_{j},

|σ^j2σ^kj2|\displaystyle|\widehat{\sigma}_{\ell j}^{2}-\widehat{\sigma}_{kj}^{2}| >Δj|σj2σ^j2||σ^kj2σkj2|.\displaystyle>\Delta_{j}-|\sigma_{\ell j}^{2}-\widehat{\sigma}_{\ell j}^{2}|-|\widehat{\sigma}_{kj}^{2}-\sigma_{kj}^{2}|.

Thus, with probability 1dξ2/t21-d\xi^{2}/t^{2}, we have

|σ^j2σ^kj2|\displaystyle|\widehat{\sigma}_{\ell j}^{2}-\widehat{\sigma}_{kj}^{2}| 2tif k,Lj, and\displaystyle\leq 2t\quad\text{if $k,\ell\in L_{j}$, and}
|σ^j2σ^kj2|\displaystyle|\widehat{\sigma}_{\ell j}^{2}-\widehat{\sigma}_{kj}^{2}| >Δj2tif Lj and kLj.\displaystyle>\Delta_{j}-2t\quad\text{if $\ell\notin L_{j}$ and $k\in L_{j}$}.

Now, as long as t<Δ/4t<\Delta/4, we have Δj2t>2t\Delta_{j}-2t>2t, which implies that η:=2t<Δ/2\eta\mathrel{\mathop{\mathchar 58\relax}}=2t<\Delta/2.

Finally, we have

(L^1L1)\displaystyle\mathbb{P}(\widehat{L}_{1}\neq L_{1}) ξ2η2d(L^1=L1)1ξ2η2d.\displaystyle\leq\frac{\xi^{2}}{\eta^{2}}d\implies\mathbb{P}(\widehat{L}_{1}=L_{1})\geq 1-\frac{\xi^{2}}{\eta^{2}}d.

Recall that dj:=|Lj|d_{j}\mathrel{\mathop{\mathchar 58\relax}}=|L_{j}|. Then by a similar argument

(L^2=L2|L^1=L1)\displaystyle\mathbb{P}(\widehat{L}_{2}=L_{2}\,|\,\widehat{L}_{1}=L_{1}) 1ξ2η2(dd1).\displaystyle\geq 1-\frac{\xi^{2}}{\eta^{2}}(d-d_{1}).

Recalling j1:={L^1=L1,,L^j1=Lj1}\mathcal{E}_{j-1}\mathrel{\mathop{\mathchar 58\relax}}=\{\widehat{L}_{1}=L_{1},\ldots,\widehat{L}_{j-1}=L_{j-1}\}, we have just proved that (L^2=L2|1)1(dd1)(ξ2/η2)\mathbb{P}(\widehat{L}_{2}=L_{2}\,|\,\mathcal{E}_{1})\geq 1-(d-d_{1})(\xi^{2}/\eta^{2}). A similar argument proves that (L^j=Lj|j1)1(ξ2/η2)(ddj1)\mathbb{P}(\widehat{L}_{j}=L_{j}\,|\,\mathcal{E}_{j-1})\geq 1-(\xi^{2}/\eta^{2})(d-d_{j-1}). Since η>ξd\eta>\xi\sqrt{d}, the inequality j(1xj)1jxj\prod_{j}(1-x_{j})\geq 1-\sum_{j}x_{j} implies

(L^=L(𝖦))\displaystyle\mathbb{P}(\widehat{L}=L(\mathsf{G})) =j=1r(L^j=Lj|j1)\displaystyle=\prod_{j=1}^{r}\mathbb{P}(\widehat{L}_{j}=L_{j}\,|\,\mathcal{E}_{j-1})
=j=1r(1ξ2η2(ddj1))\displaystyle=\prod_{j=1}^{r}\Big{(}1-\frac{\xi^{2}}{\eta^{2}}(d-d_{j-1})\Big{)}
1j=1rξ2η2(ddj1)\displaystyle\geq 1-\sum_{j=1}^{r}\frac{\xi^{2}}{\eta^{2}}(d-d_{j-1})
1ξ2η2rd\displaystyle\geq 1-\frac{\xi^{2}}{\eta^{2}}rd

as desired. ∎

C.4 Technical lemmas

Lemma C.5.
|𝔼Xp𝔼Yp|XYpk=0p1XpkYpp1k\displaystyle|\mathbb{E}X^{p}-\mathbb{E}Y^{p}|\leq\|X-Y\|_{p}\sum_{k=0}^{p-1}\|X\|_{p}^{k}\|Y\|_{p}^{p-1-k}
Proof.

Write 𝔼Xp𝔼Yp\mathbb{E}X^{p}-\mathbb{E}Y^{p} as a telescoping sum:

|𝔼Xp𝔼Yp|=|XppYpp|\displaystyle|\mathbb{E}X^{p}-\mathbb{E}Y^{p}|=|\|X\|_{p}^{p}-\|Y\|_{p}^{p}| =|XpYp|k=0p1XpkYppk1\displaystyle=|\|X\|_{p}-\|Y\|_{p}|\cdot\sum_{k=0}^{p-1}\|X\|_{p}^{k}\|Y\|_{p}^{p-k-1}
XYpk=0p1XpkYppk1.\displaystyle\leq\|X-Y\|_{p}\cdot\sum_{k=0}^{p-1}\|X\|_{p}^{k}\|Y\|_{p}^{p-k-1}.\qed
Lemma C.6.

Fix p>2p>2 and δ>0\delta>0 and suppose fp+δ,gp+δBp+δ\|f\|_{p+\delta},\|g\|_{p+\delta}\leq B_{p+\delta}. Then

fgpCp,δfg2γp,δ,Cp,δ=(2Bp+δ)(p2)(p+δ)p(p+δ2),γp,δ=2δp(p+δ2).\displaystyle\|f-g\|_{p}\leq C_{p,\delta}\cdot\|f-g\|_{2}^{\gamma_{p,\delta}},\quad C_{p,\delta}=(2B_{p+\delta})^{\tfrac{(p-2)(p+\delta)}{p(p+\delta-2)}},\quad\gamma_{p,\delta}=\frac{2\delta}{p(p+\delta-2)}.

The exponent γp,δ\gamma_{p,\delta} satisfies γp,δ2/p<1\gamma_{p,\delta}\leq 2/p<1 and γp,δ2/p\gamma_{p,\delta}\to 2/p as δ\delta\to\infty, and the constant Cp,δ12pC_{p,\delta}\to 1-\tfrac{2}{p} as δ\delta\to\infty. Thus, if fgLf-g\in L^{\infty}, then

fgpfg22/p.\displaystyle\|f-g\|_{p}\lesssim\|f-g\|_{2}^{2/p}.
Proof.

Use log-convexity of LpL^{p}-norms with 2=q<p<r=p+δ2=q<p<r=p+\delta. ∎

Lemma C.7.

Assume fB<\|f\|_{\infty}\leq B_{\infty}<\infty and gL4g\in L^{4}. Then

𝔼X(f(X)2g(X)2)2f(X)g(X))44+4Bf(X)g(X)33+4B2f(X)g(X)22.\displaystyle\begin{aligned} \mathbb{E}_{X}(f(X)^{2}-g(X)^{2})^{2}\leq\|f(X)&-g(X))\|_{4}^{4}+\\ &4B_{\infty}\|f(X)-g(X)\|_{3}^{3}+4B_{\infty}^{2}\|f(X)-g(X)\|_{2}^{2}.\end{aligned} (15)

If additionally g(X)Lg(X)\in L^{\infty}, then

𝔼X(f(X)2g(X)2)2\displaystyle\mathbb{E}_{X}(f(X)^{2}-g(X)^{2})^{2} f(X)g(X)22.\displaystyle\lesssim\|f(X)-g(X)\|_{2}^{2}. (16)
Proof.

Note that

(f(X)2g(X)2)2\displaystyle(f(X)^{2}-g(X)^{2})^{2} =(g(X)f(X))4+4f(X)(g(X)f(X))3+4f(X)2(g(X)f(X))2\displaystyle=(g(X)-f(X))^{4}+4f(X)(g(X)-f(X))^{3}+4f(X)^{2}(g(X)-f(X))^{2}
(g(X)f(X))4+4|f(X)||g(X)f(X)|3+4f(X)2(g(X)f(X))2.\displaystyle\leq(g(X)-f(X))^{4}+4|f(X)||g(X)-f(X)|^{3}+4f(X)^{2}(g(X)-f(X))^{2}.

Thus

𝔼(f(X)2g(X)2)2\displaystyle\mathbb{E}(f(X)^{2}-g(X)^{2})^{2} 𝔼(g(X)f(X))4+\displaystyle\leq\mathbb{E}(g(X)-f(X))^{4}+
4𝔼[|f(X)||g(X)f(X)|3]+4𝔼[f(X)2(g(X)f(X))2]\displaystyle\qquad\qquad 4\mathbb{E}\big{[}|f(X)||g(X)-f(X)|^{3}\big{]}+4\mathbb{E}\big{[}f(X)^{2}(g(X)-f(X))^{2}\big{]}
𝔼(g(X)f(X))4+\displaystyle\leq\mathbb{E}(g(X)-f(X))^{4}+
4B𝔼|g(X)f(X)|3+4B2𝔼(g(X)f(X))2.\displaystyle\qquad\qquad 4B_{\infty}\mathbb{E}|g(X)-f(X)|^{3}+4B_{\infty}^{2}\mathbb{E}(g(X)-f(X))^{2}.

This proves the first inequality. The second follows from taking δ\delta\to\infty in Lemma C.6 and using fgppfg22\|f-g\|_{p}^{p}\lesssim\|f-g\|_{2}^{2} for p=3,4p=3,4. ∎

Appendix D Comparison to CAM algorithm

In this section, we justify the claim in Example 5 that there exist infinitely many nonlinear functions gg for which the CAM algorithm returns an incorrect graph under the model (4). To show this, we first construct a linear model on which CAM returns an incorrect ordering. Since CAM focuses on nonlinear models, we then show that this extends to any sufficiently small nonlinear perturbation of this model.

The linear model is

{X1𝒩(0,1)X2=X1+z2z2𝒩(0,1)X3=X1+X2+z3z3𝒩(0,1).\displaystyle\left\{\begin{aligned} &X_{1}\sim\mathcal{N}(0,1)\\ &X_{2}=X_{1}+z_{2}\ \ \ \ z_{2}\sim\mathcal{N}(0,1)\\ &X_{3}=X_{1}+X_{2}+z_{3}\ \ \ \ z_{3}\sim\mathcal{N}(0,1).\end{aligned}\right. (17)

The graph is

X1X2X3\displaystyle\begin{split}X_{1}\rightarrow&X_{2}\\ \searrow&\downarrow\\ &X_{3}\end{split}

which corresponds to the adjacency matrix

(011001000).\displaystyle\begin{pmatrix}0&1&1\\ 0&0&1\\ 0&0&0\\ \end{pmatrix}.

The IncEdge step of the CAM algorithm (§5.2, (Bühlmann et al., 2014)) is based on the following score function:

j=1dlog(𝔼var(Xj|pa(j))).\displaystyle\sum_{j=1}^{d}\log\Big{(}\mathbb{E}\operatorname{var}(X_{j}\,|\,\operatorname{pa}(j))\Big{)}.

The algorithm starts with the empty DAG (i.e. pa(j)=\operatorname{pa}(j)=\emptyset for all jj) and proceeds by greedily adding edges that decrease this score the most in each step. For example, in the first step, CAM searches for the pair (Xi,Xj)(X_{i},X_{j}) that maximizes logvar(Xj)log𝔼var(Xj|Xi)\log\operatorname{var}(X_{j})-\log\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{i}), and adds the edge XiXjX_{i}\to X_{j} to the estimated DAG. The second proceeds similarly until the estimated order is determined. Thus, it suffices to study the log-differences ω(j,i,S):=logvar(Xj|XS)log𝔼var(Xj|XSi)\omega(j,i,S)\mathrel{\mathop{\mathchar 58\relax}}=\log\operatorname{var}(X_{j}\,|\,X_{S})-\log\mathbb{E}\operatorname{var}(X_{j}\,|\,X_{S\cup i}).

The following are straightforward to compute for the model (17):

var(X1)=1𝔼var(X2|X1)=1𝔼var(X1|X2)=12var(X2)=2𝔼var(X3|X1)=2𝔼var(X1|X3)=32var(X3)=6𝔼var(X3|X2)=13𝔼var(X2|X3)=12\displaystyle\begin{split}&\operatorname{var}(X_{1})=1\ \ \ \ \mathbb{E}\operatorname{var}(X_{2}|X_{1})=1\ \ \ \ \mathbb{E}\operatorname{var}(X_{1}|X_{2})=\frac{1}{2}\\ &\operatorname{var}(X_{2})=2\ \ \ \ \mathbb{E}\operatorname{var}(X_{3}|X_{1})=2\ \ \ \ \mathbb{E}\operatorname{var}(X_{1}|X_{3})=\frac{3}{2}\\ &\operatorname{var}(X_{3})=6\ \ \ \ \mathbb{E}\operatorname{var}(X_{3}|X_{2})=\frac{1}{3}\ \ \ \ \!\mathbb{E}\operatorname{var}(X_{2}|X_{3})=\frac{1}{2}\\ \end{split}

Then

log(var(X2)𝔼var(X2|X1))=log(21)=log2log(var(X1)𝔼var(X1|X2))=log(11/2)=log2log(var(X3)𝔼var(X3|X1))=log(62)=log3log(var(X1)𝔼var(X1|X3))=log(11/3)=log3log(var(X3)𝔼var(X3|X2))=log(63/2)=log4log(var(X2)𝔼var(X2|X3))=log(21/2)=log4\displaystyle\begin{split}\log\Big{(}\frac{\operatorname{var}(X_{2})}{\mathbb{E}\operatorname{var}(X_{2}|X_{1})}\Big{)}=\log\Big{(}\frac{2}{1}\Big{)}=\log 2\ \ \ \ &\log\Big{(}\frac{\operatorname{var}(X_{1})}{\mathbb{E}\operatorname{var}(X_{1}|X_{2})}\Big{)}=\log\Big{(}\frac{1}{1/2}\Big{)}=\log 2\\ \log\Big{(}\frac{\operatorname{var}(X_{3})}{\mathbb{E}\operatorname{var}(X_{3}|X_{1})}\Big{)}=\log\Big{(}\frac{6}{2}\Big{)}=\log 3\ \ \ \ &\log\Big{(}\frac{\operatorname{var}(X_{1})}{\mathbb{E}\operatorname{var}(X_{1}|X_{3})}\Big{)}=\log\Big{(}\frac{1}{1/3}\Big{)}=\log 3\\ \log\Big{(}\frac{\operatorname{var}(X_{3})}{\mathbb{E}\operatorname{var}(X_{3}|X_{2})}\Big{)}=\log\Big{(}\frac{6}{3/2}\Big{)}=\log 4\ \ \ \ &\log\Big{(}\frac{\operatorname{var}(X_{2})}{\mathbb{E}\operatorname{var}(X_{2}|X_{3})}\Big{)}=\log\Big{(}\frac{2}{1/2}\Big{)}=\log 4\\ \end{split}

Now, if X3X2X_{3}\to X_{2} is chosen first, then the order is incorrect and we are done. Thus suppose CAM instead chooses X2X3X_{2}\to X_{3}, then in the next step it would update the score for X1X3X_{1}\to X_{3} to be

log(𝔼var(X3|X2)𝔼var(X3|X1,X2))=log(3/21)=log32<log(𝔼var(X1|X2)𝔼var(X1|X3,X2))=log3\displaystyle\log\Big{(}\frac{\mathbb{E}\operatorname{var}(X_{3}|X_{2})}{\mathbb{E}\operatorname{var}(X_{3}|X_{1},X_{2})}\Big{)}=\log\Big{(}\frac{3/2}{1}\Big{)}=\log\frac{3}{2}<\log\Big{(}\frac{\mathbb{E}\operatorname{var}(X_{1}|X_{2})}{\mathbb{E}\operatorname{var}(X_{1}|X_{3},X_{2})}\Big{)}=\log 3

Therefore, for the next edge, CAM would choose X3X1X_{3}\to X_{1}, which also leads to the wrong order. Thus regardless of which edge is selected first, CAM will return the wrong order.

Thus, when CAM is applied to data generated from (17), it is guaranteed to return an incorrect ordering. Although the model (17) is identifiable, it does not satisfy the identifiability condition for CAM (Lemma 1, (Bühlmann et al., 2014)), namely that the structural equation model is a nonlinear additive model. Thus, we need to extend this example to an identifiable, nonlinear additive model.

Since this depends only on the scores ω(j,i,S)\omega(j,i,S), it suffices to construct a nonlinear model with similar scores. For this, we consider a simple nonlinear extension of (17): Let gg be an arbitrary bounded, nonlinear function, and define gδ(u):=u+δg(u)g_{\delta}(u)\mathrel{\mathop{\mathchar 58\relax}}=u+\delta g(u). The nonlinear model is given by

{X1𝒩(0,1)X2=gδ(X1)+z2z2𝒩(0,1)X3=gδ(X1)+gδ(X2)+z3z3𝒩(0,1).\displaystyle\left\{\begin{aligned} &X_{1}\sim\mathcal{N}(0,1)\\ &X_{2}=g_{\delta}(X_{1})+z_{2}\ \ \ \ z_{2}\sim\mathcal{N}(0,1)\\ &X_{3}=g_{\delta}(X_{1})+g_{\delta}(X_{2})+z_{3}\ \ \ \ z_{3}\sim\mathcal{N}(0,1).\end{aligned}\right. (18)

This model satisfies both our identifiability condition (Condition 2) and the identifiability condition for CAM (Lemma 1, (Bühlmann et al., 2014)).

We claim that for sufficiently small δ\delta, the CAM algorithm will return the wrong ordering (see Proposition D.1 below for a formal statement). It follows that the scores ω(j,i,S;δ)\omega(j,i,S;\delta) corresponding to the model (18) can be made arbitrarily close to ω(j,i,S)=ω(j,i,S;0)\omega(j,i,S)=\omega(j,i,S;0), which implies that CAM will return the wrong ordering for sufficiently small δ>0\delta>0.

Refer to caption
Figure 4: The CAM algorithm does not recover the correct ordering under different nonlinear functions and models. h(x)h(x) refers to model (19), g(x)g(x) refers to model (4) respectively.

In Figure 4, we illustrate this empirically. In addition to the model (18), we also simulated from the following model, which shows that this phenomenon is not peculiar to the construction above:

{X1𝒩(0,1)X2=X12+z2z2𝒩(0,1)X3=4X12+h(X2)+z3z3𝒩(0,1).\displaystyle\left\{\begin{aligned} &X_{1}\sim\mathcal{N}(0,1)\\ &X_{2}=X_{1}^{2}+z_{2}\ \ \ \ z_{2}\sim\mathcal{N}(0,1)\\ &X_{3}=4X_{1}^{2}+h(X_{2})+z_{3}\ \ \ \ z_{3}\sim\mathcal{N}(0,1).\end{aligned}\right. (19)

In all eight examples, NPVAR perfectly recovers the ordering while CAM is guaranteed to return an inconsistent order for sufficiently large nn (i.e. once the scores are consistently estimated).

Proposition D.1.

Let 𝔼δ\mathbb{E}_{\delta} and varδ\operatorname{var}_{\delta} be taken with respect to model (18). Then for all i,j{1,2,3}i,j\in\{1,2,3\}, as δ0\delta\to 0,

|varδ(Xi)var0(Xi)|=o(1),|𝔼δvarδ(XiXj)𝔼0var0(XiXj)|=o(1).\displaystyle\begin{split}|\operatorname{var}_{\delta}(X_{i})-\operatorname{var}_{0}(X_{i})|&=o(1),\\ |\mathbb{E}_{\delta}\operatorname{var}_{\delta}(X_{i}|X_{j})-\mathbb{E}_{0}\operatorname{var}_{0}(X_{i}|X_{j})|&=o(1).\end{split}

and |𝔼δvarδ(X3X1,X2)𝔼0var0(X3X1,X2)|=o(1)|\mathbb{E}_{\delta}\operatorname{var}_{\delta}(X_{3}|X_{1},X_{2})-\mathbb{E}_{0}\operatorname{var}_{0}(X_{3}|X_{1},X_{2})|=o(1).

Proof sketch of Proposition D.1.

The proof is consequence of the fact that the differences are continuous functions of δ\delta. We sketch the proof for i=2i=2; the remaining cases are similar.

We have

varδ(X2)=varδ(g(X1))+varδ(ϵ)var0(X2)=var0(X1)+var0(ϵ)\begin{split}&\operatorname{var}_{\delta}(X_{2})=\operatorname{var}_{\delta}(g(X_{1}))+\operatorname{var}_{\delta}(\epsilon)\\ &\operatorname{var}_{0}(X_{2})=\operatorname{var}_{0}(X_{1})+\operatorname{var}_{0}(\epsilon)\\ \end{split}

Let φ(t)\varphi(t) be the standard normal density. We only need to analyze and compare var0(X1)varδ(g(X1))=var0(X1)var0(g(X1))\operatorname{var}_{0}(X_{1})-\operatorname{var}_{\delta}(g(X_{1}))=\operatorname{var}_{0}(X_{1})-\operatorname{var}_{0}(g(X_{1})) in two parts:

(X12g(X1)2)φ(X1)𝑑X1(X1φ(X1)𝑑X1)2(g(X1)φ(X1)𝑑X1)2.\begin{split}&\int(X_{1}^{2}-g(X_{1})^{2})\varphi(X_{1})dX_{1}\\ &(\int X_{1}\varphi(X_{1})dX_{1})^{2}-(\int g(X_{1})\varphi(X_{1})dX_{1})^{2}.\end{split}

Since |X1g(X1)|δ|X_{1}-g(X_{1})|\leq\delta,

|(X12g(X1)2)φ(X1)𝑑X1|δ|X1+g(X1)|φ(X1)𝑑X1δ2|X1|φ(X1)𝑑X1+δ2=Cδ+δ2|g(X1)φ(X1)𝑑X1|=|(X1g(X1))φ(X1)𝑑X1|δφ(X1)𝑑X1=δ|(X1+g(X1)φ(X1))𝑑X1|=|𝔼g(X1)|δ.\begin{split}&|\int(X_{1}^{2}-g(X_{1})^{2})\varphi(X_{1})dX_{1}|\leq\delta\int|X_{1}+g(X_{1})|\varphi(X_{1})dX_{1}\leq\delta\int 2|X_{1}|\varphi(X_{1})dX_{1}+\delta^{2}=C\delta+\delta^{2}\\ &|\int g(X_{1})\varphi(X_{1})dX_{1}|=|\int(X_{1}-g(X_{1}))\varphi(X_{1})dX_{1}|\leq\delta\int\varphi(X_{1})dX_{1}=\delta\\ &|\int(X_{1}+g(X_{1})\varphi(X_{1}))dX_{1}|=|\mathbb{E}g(X_{1})|\leq\delta.\end{split}

Thus

|(X1φ(X1)𝑑X1)2(g(X1)φ(X1)𝑑X1)2|δ2,|(\int X_{1}\varphi(X_{1})dX_{1})^{2}-(\int g(X_{1})\varphi(X_{1})dX_{1})^{2}|\leq\delta^{2},

so that

|varδ(X2)var0(X2)|=o(1)|\operatorname{var}_{\delta}(X_{2})-\operatorname{var}_{0}(X_{2})|=o(1)

as claimed. ∎

Appendix E Experiment details

In this appendix we outline the details of our experiments, as well as additional simulations.

E.1 Experiment settings

For graphs, we used

  • Markov chain (MC). Graph where there is one edge from Xi1X_{i-1} to XiX_{i} for all nodes i=2,,di=2,\ldots,d.

  • Erdős Rényi (ER). Random graphs whose edges are added independently with specified expected number of edges.

  • Scale-free networks (SF). Networks simulated according to the Barabasi-Albert model.

For models, we refer to the nonlinear functions in SEM. We specify the nonlinear functions in Xj=fj(Xpa(j))+zjX_{j}=f_{j}(X_{\text{pa(j)}})+z_{j} for all j=1,2,,dj=1,2,\ldots,d, where zjiid𝒩(0,σ2)z_{j}\overset{\text{iid}}{\sim}\mathcal{N}(0,\sigma^{2}) with variance σ2{0.2,0.5,0.8}\sigma^{2}\in\{0.2,0.5,0.8\}

  • Additive sine model (SIN): fj(Xpa(j))=kpa(j)fjk(Xk)f_{j}(X_{\text{pa}(j)})=\sum_{k\in\text{pa}(j)}f_{jk}(X_{k}) where fjk(Xk)=sin(Xk)f_{jk}(X_{k})=\sin(X_{k}).

  • Additive Gaussian process (AGP): fj(Xpa(j))=kpa(j)fjk(Xk)f_{j}(X_{\text{pa}(j)})=\sum_{k\in\text{pa}(j)}f_{jk}(X_{k}) where fjkf_{jk} is a draw from Gaussian process with RBF kernel with length-scale one.

  • Non-additive Gaussian process (NGP): fjf_{j} is a draw from Gaussian process with RBF kernel with length-scale one.

  • Generalized Linear Model (GLM): This is a special case with non-additive model and non-additive noise. We specify the model (Xj=1)=fj(Xpa(j))\mathbb{P}(X_{j}=1)=f_{j}(X_{\text{pa}(j)}) by a parameter p{0.1,0.3}p\in\{0.1,0.3\}. Given pp, if jj is a source node, (Xj=1)=p\mathbb{P}(X_{j}=1)=p. For the Markov chain model we define:

    fj(Xpa(j))={pXk=11pXk=0\displaystyle f_{j}(X_{\text{pa}(j)})=\left\{\begin{aligned} p&\ \ \ \ &X_{k}=1\\ 1-p&\ \ \ \ &X_{k}=0\\ \end{aligned}\right.

    or vice versa.

We generated graphs from each of the above models with {d,4d}\{d,4d\} edges each. These are denoted by the shorthand XX-YYY-kk, where XX denotes the graph type, YYY denotes the model, and kk indicates the graphs have kdkd edges on average. For example, ER-SIN-1 indicates an ER graph with dd (expected) edges under the additive sine model. SF-NGP-4 indicates an SF graph with 4d4d (expected) edges under a non-additive Gaussian process. Note that there is no difference between additive or non-additive GP for Markov chains, so the results for MC-NGP-kk are omitted from the Figures in Appendix E.5.

Based on these models, we generated random datasets with nn samples. For each simulation run, we generated n{100,200,500,750,1000}n\in\{100,200,500,750,1000\} samples for graphs with d{5,10,20,40,50,60,70}d\in\{5,10,20,40,50,60,70\} nodes.

E.2 Implementation and baselines

Code implementing NPVAR can be found at https://github.com/MingGao97/NPVAR. We implemented NPVAR (Algorithm 2) with generalized additive models (GAMs) as the nonparametric estimator for f^j\widehat{f}_{\ell j}. We used the gam function in the R package mgcv with P-splines bs=’ps’ and the default smoothing parameter sp=0.6. In all our implementations, including our own for Algorithm 2, we used default parameters in order to avoid skewing the results in favour of any particular algorithm as a result of hyperparameter tuning.

We compared our method with following approaches as baselines:

  • Regression with subsequent independence test (RESIT) identifies and disregards a sink node at each step via independence testing (Peters et al., 2014). The implementation is available at http://people.tuebingen.mpg.de/jpeters/onlineCodeANM.zip. Uses HSIC test for independence testing with alpha=0.05 and gam for nonparametric regression.

  • Causal additive models (CAM) estimates the topological order by greedy search over edges after a preliminary neighborhood selection (Bühlmann et al., 2014). The implementation is available at https://cran.r-project.org/src/contrib/Archive/CAM/. By default, CAM applies an extra pre-processing step called preliminary neighborhood search (PNS). Uses gam to compute the scores and for pruning, and mboost for preliminary neighborhood search. The R implementation of CAM does not use the default parameters for gam or mboost, and instead optimizes these parameters at runtime.

  • NOTEARS uses an algebraic characterization of DAGs for score-based structure learning of nonparametric models via partial derivatives (zheng2018notears; Zheng et al., 2020). The implementation is available at https://github.com/xunzheng/notears. We used neural networks for the nonlinearities with a single hidden layer of 10 neurons. The training parameters are lambda1=0.01, lambda2=0.01 and the threshold for adjacency matrix is w_threshold=0.3.

  • Greedy equivalence search with generalized scores (GSGES) uses generalized scores for greedy search without assuming model class (Huang et al., 2018). The implementation is available at https://github.com/Biwei-Huang/Generalized-Score-Functions-for-Causal-Discovery/. We used cross-validation parameters parameters.kfold = 10 and parameters.lambda = 0.01.

  • Equal variance (EqVar) algorithm identifies source node by minimizing conditional variance in linear SEM (Chen et al., 2019). The implementation is available at https://github.com/WY-Chen/EqVarDAG. The original EqVar algorithm estimates the error variances in a linear SEM via the covariance matrix Σ=𝔼XXT\Sigma=\mathbb{E}XX^{T}, and then uses linear regression (e.g. best subset selection) to learn the structure of the DAG. We adapted this algorithm to the nonlinear setting in the obivous way by using GAMs (instead of subset selection) for variable selection. The use of the covariance matrix to estimate the order remains the same.

  • PC algorithm and greedy equivalence search (GES) are standard baselines for structure learning (Spirtes and Glymour, 1991; Chickering, 2003). The implementation is available from R package pcalg. For PC algorithm, use correlation matrix as sufficient statistic. Independence test is implemented by gaussCItest with significance level alpha=0.01. For GES, set the score to be GaussL0penObsScore with lambda as logn/2\log n/2, which corresponds to the BIC score.

The experiments were conducted on an Intel E5-2680v4 2.4GHz CPU with 64 GB memory.

E.3 Metrics

We evaluated the performance of each algorithm with the following two metrics:

  • (correct order)\mathbb{P}(\text{correct order}): The percentage of runs in which the algorithm gives a correct topological ordering, over NN runs. This metric is only sensible for algorithms that first estimate an ordering or return an adjacency matrix which does not contain undirected edges, including RESIT, CAM, EqVar and NOTEARS.

  • Structural Hamming distance (SHD): A standard benchmark in the structure learning literature that counts the total number of edge additions, deletions, and reversals needed to convert the estimated graph into the true graph.

Since there may be multiple topological orderings of a DAG, in our evaluations of order recovery, we check whether or not the order returned is any of the possible valid orderings. For PC, GES, and GSGES, they all return a CPDAG that may contain undirected edges, in which case we evaluate them favourably by assuming correct orientation for undirected edges whenever possible. Since CAM, RESIT, EqVar and NPVAR each first estimate a topological ordering then estimate a DAG. To estimate a DAG from an ordering, we apply the same pruning step to each algorithm for a fair comparison, which is adapted from (Bühlmann et al., 2014). Specifically, given an estimated ordering, we run a gam regression for each node on its ancestors, then determine the parents of the node by the pp-values with significance level cutoff=0.001 for estimating the DAG.

E.4 Timing

For completeness, runtime comparisons are reported in Tables 1 and 2. Algorithms based on linear models such as EqVar, PC, and GES are by far the fastest. These algorithms are also the most highly optimized. The slowest algorithms are GSGES and RESIT. Timing comparisons against CAM are are difficult to interpret since by default, CAM first performs preliminary neighbourhood search, which can easily be applied to any of the other algorithms tested. The dramatic difference with and without this pre-processing step by comparing Tables 1 and 2: For d=40d=40, with this extra step CAM takes just over 90 seconds, whereas without it, on the same data, it takes over 8.5 hours. For comparison, NPVAR takes around two minutes (i.e. without pre-processing or neighbourhood search).

Algorithm dd nn Runtime (s)
EqVar 20 1000 0.0017±0.00030.0017\pm 0.0003
PC 20 1000 0.056±0.0160.056\pm 0.016
GES 20 1000 0.060±0.0340.060\pm 0.034
NPVAR 20 1000 10.76±0.2310.76\pm 0.23
NOTEARS 20 1000 31.46±8.7931.46\pm 8.79
CAM (w/ PNS) 20 1000 40.56±1.2940.56\pm 1.29
CAM (w/o PNS) 20 1000 559.01±9.49559.01\pm 9.49
RESIT 20 1000 652.15±7.26652.15\pm 7.26
GSGES 20 1000 3216.00±95.03216.00\pm 95.0
Table 1: Runtime comparisons for d=20d=20. Timing for CAM is presented with and without preliminary neighbourhood selection (PNS), which is a pre-processing step that can be applied to any algorithm. In our experiments, only CAM used PNS.
Algorithm dd nn Runtime (s)
EqVar 40 1000 0.0043±0.00030.0043\pm 0.0003
GES 40 1000 0.12±0.00520.12\pm 0.0052
PC 40 1000 0.019±0.0300.019\pm 0.030
NOTEARS 40 1000 76.05±19.1676.05\pm 19.16
CAM (w/ PNS) 40 1000 95.59±6.3395.59\pm 6.33
NPVAR 40 1000 118.33±2.25118.33\pm 2.25
CAM (w/o PNS) 40 1000 31644.56±1329.3131644.56\pm 1329.31
Table 2: Runtime comparisons for d=40d=40. Timing for CAM is presented with and without preliminary neighbourhood selection (PNS), which is a pre-processing step that can be applied to any algorithm. In our experiments, only CAM used PNS.

E.5 Additional experiments

Here we collect the results of our additional experiments. Since the settings MC-AGP-kk and MC-NGP-kk are equivalent (i.e. since there is only parent for each node), the plots for MC-NGP-kk are omitted. Some algorithms might be skipped due to high computational cost or numerical issue.

  • Figure 5: SHD vs. nn with d=5d=5 fixed, across all graphs and models tested.

  • Figure 6: SHD vs. nn with d=10d=10 fixed, across all graphs and models tested.

  • Figure 7: SHD vs. nn with d=20d=20 fixed, across all graphs and models tested.

  • Figure 8: SHD vs. nn with d=40d=40 fixed, across all graphs and models tested with GSGES and RESIT skipped (due to high computational cost).

  • Figure 9: SHD vs. nn with d=50d=50 fixed, across all graphs and models tested with GSGES, RESIT and NOTEARS skipped (due to high computational cost).

  • Figure 10: SHD vs. nn with d=60d=60 fixed, across all graphs and models tested with GSGES, RESIT and NOTEARS skipped (due to high computational cost).

  • Figure 11: SHD vs. nn with d=70d=70 fixed, across all graphs and models tested with GSGES, RESIT and NOTEARS skipped (due to high computational cost).

  • Figure 12: SHD vs. nn with dd ranging from 55 to 7070 on GLM with CAM and RESIT skipped (due to numerical issues).

  • Figure 13: SHD vs. dd with n=500n=500 fixed, across all graphs and models tested.

  • Figure 14: SHD vs. dd with n=1000n=1000 fixed, across all graphs and models tested.

  • Figure 15: Ordering recovery vs. nn with d=5d=5 fixed, across all graphs and models tested.

  • Figure 16: Ordering recovery vs. nn with d=10d=10 fixed, across all graphs and models tested.

  • Figure 17: Ordering recovery vs. nn with d=20d=20 fixed, across all graphs and models tested.

  • Figure 18: Ordering recovery vs. nn with d=40d=40 fixed, across all graphs and models tested with RESIT skipped (due to high computational cost).

  • Figure 19: Ordering recovery vs. nn with d=50d=50 fixed, across all graphs and models tested with RESIT and NOTEARS skipped (due to high computational cost).

  • Figure 20: Ordering recovery vs. nn with d=60d=60 fixed, across all graphs and models tested with RESIT and NOTEARS skipped (due to high computational cost).

  • Figure 21: Ordering recovery vs. nn with d=70d=70 fixed, across all graphs and models tested with RESIT and NOTEARS skipped (due to high computational cost).

Refer to caption
Figure 5: SHD vs nn for fixed d=5d=5.
Refer to caption
Figure 6: SHD vs nn for fixed d=10d=10.
Refer to caption
Figure 7: SHD vs nn for fixed d=20d=20.
Refer to caption
Figure 8: SHD vs nn for fixed d=40d=40.
Refer to caption
Figure 9: SHD vs nn for fixed d=50d=50.
Refer to caption
Figure 10: SHD vs nn for fixed d=60d=60.
Refer to caption
Figure 11: SHD vs nn for fixed d=70d=70.
Refer to caption
Figure 12: SHD vs nn for different dd ranging from 5 to 70 on GLM
Refer to caption
Figure 13: SHD vs dd for fixed n=500n=500.
Refer to caption
Figure 14: SHD vs dd for fixed n=1000n=1000.
Refer to caption
Figure 15: Ordering recovery vs nn for fixed d=5d=5.
Refer to caption
Figure 16: Ordering recovery vs nn for fixed d=10d=10.
Refer to caption
Figure 17: Ordering recovery vs nn for fixed d=20d=20.
Refer to caption
Figure 18: Ordering recovery vs nn for fixed d=40d=40.
Refer to caption
Figure 19: Ordering recovery vs nn for fixed d=50d=50.
Refer to caption
Figure 20: Ordering recovery vs nn for fixed d=60d=60.
Refer to caption
Figure 21: Ordering recovery vs nn for fixed d=70d=70.

References

  • Aragam and Zhou (2015) B. Aragam and Q. Zhou. Concave Penalized Estimation of Sparse Gaussian Bayesian Networks. Journal of Machine Learning Research, 16(69):2273–2328, 2015.
  • Aragam et al. (2015) B. Aragam, A. A. Amini, and Q. Zhou. Learning directed acyclic graphs with penalized neighbourhood regression. arXiv:1511.08963, 2015.
  • Aragam et al. (2019) B. Aragam, A. Amini, and Q. Zhou. Globally optimal score-based learning of directed acyclic graphs in high-dimensions. In Advances in Neural Information Processing Systems 32, pages 4450–4462. 2019.
  • Bertin and Lecué (2008) K. Bertin and G. Lecué. Selection of variables and dimension reduction in high-dimensional non-parametric regression. Electronic Journal of Statistics, 2:1224–1241, 2008.
  • Bottou (2015) L. Bottou. Two big challenges in machine learning. ICML 2015 keynote lecture, 2015.
  • Bühlmann (2018) P. Bühlmann. Invariance, causality and robustness. arXiv preprint arXiv:1812.08233, 2018.
  • Bühlmann et al. (2014) P. Bühlmann, J. Peters, and J. Ernest. CAM: Causal additive models, high-dimensional order search and penalized regression. Annals of Statistics, 42(6):2526–2556, 2014.
  • Chen et al. (2019) W. Chen, M. Drton, and Y. S. Wang. On causal discovery with an equal-variance assumption. Biometrika, 106(4):973–980, 09 2019. ISSN 0006-3444. doi: 10.1093/biomet/asz049.
  • Chicharro et al. (2019) D. Chicharro, S. Panzeri, and I. Shpitser. Conditionally-additive-noise models for structure learning. arXiv preprint arXiv:1905.08360, 2019.
  • Chickering (2003) D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2003.
  • Comminges and Dalalyan (2011) L. Comminges and A. S. Dalalyan. Tight conditions for consistent variable selection in high dimensional nonparametric regression. In Proceedings of the 24th Annual Conference on Learning Theory, pages 187–206, 2011.
  • Dawid (1980) A. P. Dawid. Conditional independence for statistical operations. Annals of Statistics, pages 598–617, 1980.
  • Doksum et al. (1995) K. Doksum, A. Samarov, et al. Nonparametric estimation of global functionals and a measure of the explanatory power of covariates in regression. The Annals of Statistics, 23(5):1443–1473, 1995.
  • Ghoshal and Honorio (2017) A. Ghoshal and J. Honorio. Learning identifiable gaussian bayesian networks in polynomial time and sample complexity. In Advances in Neural Information Processing Systems 30, pages 6457–6466. 2017.
  • Ghoshal and Honorio (2018) A. Ghoshal and J. Honorio. Learning linear structural equation models in polynomial time and sample complexity. In A. Storkey and F. Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1466–1475, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
  • Giordano et al. (2020) F. Giordano, S. N. Lahiri, and M. L. Parrella. GRID: A variable selection and structure discovery method for high dimensional nonparametric regression. Annals of Statistics, to appear, 2020.
  • Györfi et al. (2006) L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
  • Hastie and Tibshirani (1990) T. J. Hastie and R. J. Tibshirani. Generalized additive models, volume 43. CRC press, 1990.
  • Hoyer et al. (2009) P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems, pages 689–696, 2009.
  • Huang et al. (2018) B. Huang, K. Zhang, Y. Lin, B. Schölkopf, and C. Glymour. Generalized score functions for causal discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1551–1560. ACM, 2018.
  • Kandasamy et al. (2015) K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pages 397–405, 2015.
  • Kohler et al. (2009) M. Kohler, A. Krzyżak, and H. Walk. Optimal global rates of convergence for nonparametric regression with unbounded data. Journal of Statistical Planning and Inference, 139(4):1286–1296, 2009.
  • Koller and Friedman (2009) D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Lachapelle et al. (2019) S. Lachapelle, P. Brouillard, T. Deleu, and S. Lacoste-Julien. Gradient-based neural dag learning. arXiv preprint arXiv:1906.02226, 2019.
  • Lafferty and Wasserman (2008) J. Lafferty and L. Wasserman. Rodeo: sparse, greedy nonparametric regression. Annals of Statistics, 36(1):28–63, 2008.
  • Lauritzen (1996) S. L. Lauritzen. Graphical models. Oxford University Press, 1996.
  • Loh and Bühlmann (2014) P.-L. Loh and P. Bühlmann. High-dimensional learning of linear causal networks via inverse covariance estimation. Journal of Machine Learning Research, 15:3065–3105, 2014.
  • Miller and Hall (2010) H. Miller and P. Hall. Local polynomial regression and variable selection. In Borrowing Strength: Theory Powering Applications–A Festschrift for Lawrence D. Brown, pages 216–233. Institute of Mathematical Statistics, 2010.
  • Mitrovic et al. (2018) J. Mitrovic, D. Sejdinovic, and Y. W. Teh. Causal inference via kernel deviance measures. In Advances in Neural Information Processing Systems, pages 6986–6994, 2018.
  • Monti et al. (2019) R. P. Monti, K. Zhang, and A. Hyvarinen. Causal discovery with general non-linear relationships using non-linear ica. arXiv preprint arXiv:1904.09096, 2019.
  • Mooij et al. (2016) J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research, 17(1):1103–1204, 2016.
  • Ng et al. (2019) I. Ng, Z. Fang, S. Zhu, and Z. Chen. Masked gradient-based causal structure learning. arXiv preprint arXiv:1910.08527, 2019.
  • Nowzohour and Bühlmann (2016) C. Nowzohour and P. Bühlmann. Score-based causal learning in additive noise models. Statistics, 50(3):471–485, 2016.
  • Ott et al. (2004) S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small gene networks. In Pacific symposium on biocomputing, volume 9, pages 557–567. Citeseer, 2004.
  • Park (2018) G. Park. Learning generalized hypergeometric distribution (ghd) dag models. arXiv preprint arXiv:1805.02848, 2018.
  • Park (2020) G. Park. Identifiability of additive noise models using conditional variances. Journal of Machine Learning Research, 21(75):1–34, 2020.
  • Park and Kim (2020) G. Park and Y. Kim. Identifiability of gaussian linear structural equation models with homogeneous and heterogeneous error variances. Journal of the Korean Statistical Society, pages 1–17, 2020.
  • Park and Raskutti (2017) G. Park and G. Raskutti. Learning quadratic variance function (QVF) dag models via overdispersion scoring (ODS). The Journal of Machine Learning Research, 18(1):8300–8342, 2017.
  • Pearl (2009) J. Pearl. Causality. Cambridge university press, 2009.
  • Perrier et al. (2008) E. Perrier, S. Imoto, and S. Miyano. Finding optimal bayesian network given a super-structure. Journal of Machine Learning Research, 9(Oct):2251–2286, 2008.
  • Peters (2015) J. Peters. On the intersection property of conditional independence and its application to causal discovery. Journal of Causal Inference, 3(1):97–108, 2015.
  • Peters and Bühlmann (2013) J. Peters and P. Bühlmann. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2013.
  • Peters et al. (2014) J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15(1):2009–2053, 2014.
  • Peters et al. (2017) J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
  • Robins et al. (2008) J. Robins, L. Li, E. Tchetgen, A. van der Vaart, et al. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, pages 335–421. Institute of Mathematical Statistics, 2008.
  • Rosasco et al. (2013) L. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri. Nonparametric sparsity and regularization. The Journal of Machine Learning Research, 14(1):1665–1714, 2013.
  • Rothenhäusler et al. (2018) D. Rothenhäusler, J. Ernest, P. Bühlmann, et al. Causal inference in partially linear structural equation models. The Annals of Statistics, 46(6A):2904–2938, 2018.
  • Schölkopf (2019) B. Schölkopf. Causality for machine learning. arXiv preprint arXiv:1911.10500, 2019.
  • Shojaie and Michailidis (2010) A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519–538, 2010.
  • Silander and Myllymaki (2006) T. Silander and P. Myllymaki. A simple approach for finding the globally optimal bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, 2006.
  • Singh and Moore (2005) A. P. Singh and A. W. Moore. Finding optimal bayesian networks by dynamic programming. 2005.
  • Spirtes and Glymour (1991) P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62–72, 1991.
  • Spirtes et al. (2000) P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search, volume 81. The MIT Press, 2000.
  • Tagasovska et al. (2018) N. Tagasovska, T. Vatter, and V. Chavez-Demoulin. Nonparametric quantile-based causal discovery. arXiv preprint arXiv:1801.10579, 2018.
  • Tsybakov (2009) A. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics, New York, page 214, 2009. cited By 1.
  • van de Geer and Bühlmann (2013) S. van de Geer and P. Bühlmann. 0\ell_{0}-penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics, 41(2):536–567, 2013.
  • Wang and Drton (2020) Y. S. Wang and M. Drton. High-dimensional causal discovery under non-gaussianity. Biometrika, 107(1):41–59, 2020.
  • Wu and Fukumizu (2020) P. Wu and K. Fukumizu. Causal mosaic: Cause-effect inference via nonlinear ica and ensemble method. arXiv preprint arXiv:2001.01894, 2020.
  • Yu et al. (2019) Y. Yu, J. Chen, T. Gao, and M. Yu. Dag-gnn: Dag structure learning with graph neural networks. arXiv preprint arXiv:1904.10098, 2019.
  • Zhang and Hyvärinen (2009) K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 647–655. AUAI Press, 2009.
  • Zheng et al. (2020) X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing. Learning Sparse Nonparametric DAGs. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3414–3425, 2020.
  • Zhu and Chen (2019) S. Zhu and Z. Chen. Causal discovery with reinforcement learning. arXiv preprint arXiv:1906.04477, 2019.