This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stable Differentiable Causal Discovery

Firstname1 Lastname1    Firstname2 Lastname2    Firstname3 Lastname3    Firstname4 Lastname4    Firstname5 Lastname5    Firstname6 Lastname6    Firstname7 Lastname7    Firstname8 Lastname8    Firstname8 Lastname8

Supplementary Materials: Stable Differentiable Causal Discovery

Firstname1 Lastname1    Firstname2 Lastname2    Firstname3 Lastname3    Firstname4 Lastname4    Firstname5 Lastname5    Firstname6 Lastname6    Firstname7 Lastname7    Firstname8 Lastname8    Firstname8 Lastname8
Machine Learning, ICML

Appendix A Theoretical Results

A.1 Proof of LABEL:th:pst-constraint

Before proving LABEL:th:pst-constraint, we precisely define an acyclic matrix and prove a few lemmas.

Definition A.1 (Cyclic and acyclic matrices).

Take a matrix A\R0d×dA\in\R_{\geq 0}^{d\times d}.

We say that AA has a cycle of length kk if and only if:

(i0,,ik)\bbrackets1,kk+1, such that, {i0=ik\bbrackets1,k,Ai1,i>0\exists(i_{0},...,i_{k})\in\bbrackets{1,k}^{k+1},~{}\textnormal{ such that, }\left\{\begin{array}[]{l}i_{0}=i_{k}\\ \forall\ell\in\bbrackets{1,k},A_{i_{\ell-1},i_{\ell}}>0\end{array}\right. (1)

We say that AA is cyclic if it contains at least one cycle. We note that if AA contains a cycle of length kk for k\Nk\in\N^{*} (the set of strictly positive integers), then AA also contains a cycle of length kk^{\prime} for k\bbrackets1,dk^{\prime}\in\bbrackets{1,d} (this follows from the pigeon hole principle).

We say that AA is acyclic if it does not contain any cycle (or equivalently if it does not contain any cycle of length kdk\leq d).

Lemma A.2.

For any matrix A\R0d×dA\in\R^{d\times d}_{\geq 0},

  • \trAk0\tr{A^{k}}\geq 0 for any kk

  • AA has a cycle of length kk if and only if \trAk>0\tr{A^{k}}>0

  • Ad=0A^{d}=0 if and only if AA is acyclic.

Proof.

Fix a matrix A\R0d×dA\in\R^{d\times d}_{\geq 0}. We have,

\trAk=(i0,,ik)\bbrackets1,dk+1i0=ik=i=1kA1,.\tr{A^{k}}=\sum_{\begin{subarray}{c}(i_{0},...,i_{k})\in\bbrackets{1,d}^{k+1}\\ i_{0}=i_{k}=i\end{subarray}}\prod_{\ell=1}^{k}A_{\ell-1,\ell}. (2)

Each addend is non-negative so \trAk0\tr{A^{k}}\geq 0. Furthermore, the total sum is strictly positive if and only if at least one addend is strictly positive. This happens if and only if AA has a cycle of length kk by definition.

Similarly, we have

(Ad)i,j\displaystyle(A^{d})_{i,j} =(i0,,id)\bbrackets1,dd+1i0=id=j=1kA1,.\displaystyle=\sum_{\begin{subarray}{c}(i_{0},...,i_{d})\in\bbrackets{1,d}^{d+1}\\ i_{0}=i_{d}=j\end{subarray}}\prod_{\ell=1}^{k}A_{\ell-1,\ell}. (3)

If Ai,jd>0A^{d}_{i,j}>0, then one addend is strictly positive and so there exists (i0,,id)\bbrackets1,dd+1(i_{0},...,i_{d})\in\bbrackets{1,d}^{d+1} such that i0=id=ji_{0}=i_{d}=j and =1kA1,>0\prod_{\ell=1}^{k}A_{\ell-1,\ell}>0. By the pigeon-hole principle, two ii_{\ell} are identical, which provides a cycle. Reciprocally, if (i0,,ik)(i_{0},...,i_{k}) is a cycle of length kk, then by repeating (i0,,ik,i1,i2idmodk)(i_{0},...,i_{k},i_{1},i_{2}...i_{d\mod k}) until having a path of length d+1d+1, we have that (Ad)i0,idmodk>0(A^{d})_{i_{0},i_{d\mod k}}>0. Hence, Ad=0A^{d}=0 if and only if AA is acyclic. ∎

We recall LABEL:th:pst-constraint.

\thPstConstraint

*

Proof.

Fix a matrix A\R0d×dA\in\R^{d\times d}_{\geq 0} and a sequence (ak)k\N\R0\N(a_{k})_{k\in\N^{*}}\in\R^{\N^{*}}_{\geq 0} such that ak>0a_{k}>0 for any k\bbrackets1,dk\in\bbrackets{1,d}.

By definition, we have,

ha(A)\displaystyle h_{a}(A) =\trk=1+akAk\displaystyle=\tr{\sum_{k=1}^{+\infty}a_{k}A^{k}} (4)
=k=1+ak\trAk\displaystyle=\sum_{k=1}^{+\infty}a_{k}\tr{A^{k}} (5)
  1. 1.

    By Lemma A.2, \trAk0\tr{A^{k}}\geq 0 and so ha(A)0h_{a}(A)\geq 0. This proves the second property.

  2. 2.

    Then, ha(A)=0h_{a}(A)=0 if and only if \trAk=0\tr{A^{k}}=0 for all kk for which ak>0a_{k}>0. Since ak>0a_{k}>0 for any k\bbrackets1,dk\in\bbrackets{1,d} we conclude that if ha(A)=0h_{a}(A)=0, then AA does not contain cycles of length kdk\leq d, so AA is acyclic by Definition A.1. Reciprocally, if AA is acyclic, it does not contain cycles of any length, so ha(A)=0h_{a}(A)=0.

  3. 3.

    Finally, if we write rar_{a} the radius of convergence of faf_{a}, then hah_{a} converge absolutely over the set of matrices with hρ(A)<rah_{\rho}(A)<r_{a} so it is differentiable with gradient given by: ha(A)=k=1+akk(A)k1\nabla h_{a}(A)=\sum_{k=1}^{+\infty}a_{k}k(A^{\top})^{k-1}.

This concludes the proof.

A.2 Proof of Theorem 2

We recall LABEL:th:pst-unstable.

\thPstUnstable

*

Proof.

Take a PST constraint hah_{a} for some (ak)k\R0d×d(a_{k})_{k}\in\R^{d\times d}_{\geq 0} with ak>0a_{k}>0 for k\bbrackets1,dk\in\bbrackets{1,d}.

We will show the E-unstable and V-unstable results using a particular adjacency matrix CC.

Define CC as the adjacency matrix of the cycle 12d11\rightarrow 2\rightarrow...\rightarrow d\rightarrow 1 with edges weights of 11. That is:

C=[01000100001100]C=\begin{bmatrix}0&1&0&\ldots&\ldots&0\\ &0&1&0&\ldots&0\\ \vdots&&\ddots&\ddots&\ddots&\vdots\\ &\vdots&&\ddots&\ddots&0\\ 0&&&&\ddots&1\\ 1&0&&\ldots&&0\end{bmatrix} (7)

We have Cd=IdC^{d}=I_{d} and \trCk={d if k=0modd0 if k0modd\tr{C^{k}}=\left\{\begin{array}[]{ll}d&\textnormal{ if }k=0\mod d\\ 0&\textnormal{ if }k\neq 0\mod d\end{array}\right..

We obtain for any w\R0w\in\R_{\geq 0},

ha(wC)=d=1+adwd.h_{a}(wC)=d\sum_{\ell=1}^{+\infty}a_{\ell d}w^{\ell d}. (8)
  • In particular, we have for any s0s\geq 0, ha(wC)=dadsd=Ωs+sdh_{a}(wC)=da_{d}s^{d}=\Omega_{s\rightarrow+\infty}s^{d} (since ad>0a_{d}>0). This proves the E-instability.

  • Define u=min(1,ra/2)u=\min(1,r_{a}/2) where rar_{a} is the radius of convergence of faf_{a}.

    Then, for any ε[0,u2[\varepsilon\in[0,u^{2}[,

    ha(εC)\displaystyle h_{a}(\varepsilon C) =d=1+adεd\displaystyle=d\sum_{\ell=1}^{+\infty}a_{\ell d}\varepsilon^{\ell d} (9)
    =εdd\parens\displaystyle=\varepsilon^{d}d\parens
    εdd\parens=1+adu2(1)d\displaystyle\leq\varepsilon^{d}d\parens{\sum_{\ell=1}^{+\infty}a_{\ell d}u^{2(\ell-1)d}} (10)
    εdd\parens=1+adud+ad\displaystyle\leq\varepsilon^{d}d\parens{\sum_{\ell=1}^{+\infty}a_{\ell d}u^{\ell d}+a_{d}} (11)
    εdd\parensfa(uC)+ad.\displaystyle\leq\varepsilon^{d}d\parens{f_{a}(uC)+a_{d}}. (12)
    =Oε0+(εd)\displaystyle=O_{\varepsilon\rightarrow 0^{+}}(\varepsilon^{d}) (13)

    Where we obtain Equation 11 by noting that 2(1)2(\ell-1)\geq\ell and u1u\leq 1. Finally, since u<rau<r_{a}, fa(uC)f_{a}(uC) is finite. Hence the result.

    The D-instability result follows from the definition of the radius of convergence. ∎

    A.3 Proof of Theorem 3

    We recall LABEL:th:constraint-spectral-valid. \spectralAcyclicity*

    The two properties stated in LABEL:th:constraint-spectral-valid are standard results.

    Proof.
    • We provide proof for the statement hρ(A)=0A is a DAG h_{\rho}(A)=0\Leftrightarrow A\textnormal{ is a DAG } for the sake of completeness.

      • \Rightarrow

        If hρ(A)=0h_{\rho}(A)=0 then all eigenvalues λj(A)\lambda_{j}(A) are zeros. But since \trAk=j=1dλj(A)k\tr{A^{k}}=\sum_{j=1}^{d}\lambda_{j}(A)^{k}, we have \trAk=0\tr{A^{k}}=0 for any k1k\geq 1 and by Lemma A.2, AA is acyclic.

      • \Leftarrow

        Assume AA is acyclic, then Ad=0A^{d}=0 by Lemma A.2. But then all eigenvalues are 0 (as for eigenvalue λj(A)\lambda_{j}(A) and associated eigenvector vj(A)v_{j}(A), we have Advj(A)=λj(A)dvj(A)=0A^{d}v_{j}(A)=\lambda_{j}(A)^{d}v_{j}(A)=0.

      Hence, hρh_{\rho} is a valid acyclicity constraint.

    • magnus1985differentiating shows that hρh_{\rho} is differentiable at every AA that has mutually distinct eigenvalues, with the formula provided in LABEL:th:constraint-spectral-stable. The set of matrices with all distinct eigenvalues is dense in the set of matrices (horn2012matrix)[Theorem 2.4.7.1], which proves the result

    A.4 Proof of Theorem 4

    We recall LABEL:th:constraint-spectral-stable. \spectralStable*

    Proof.

    We prove each stability criterion.

    • E-stable: For any s>0s>0 and matrix AA, hρ(sA)=|s|hρ(A)=Os+(s)h_{\rho}(sA)=|s|h_{\rho}(A)=O_{s\rightarrow+\infty}(s).

    • V-stable:For any ε>0\varepsilon>0 and matrix AA such that hρ(A)>0h_{\rho}(A)>0, hρ(εA)=|ε|hρ(A)=Ωε0+(ε)h_{\rho}(\varepsilon A)=|\varepsilon|h_{\rho}(A)=\Omega_{\varepsilon\rightarrow 0^{+}}(\varepsilon).

    • D-stable: Every matrix has eigenvalues (\mathbb{C} is algebraically closed), so hρh_{\rho} is well defined everywhere. In addition, LABEL:th:constraint-spectral-valid proved that hρh_{\rho} was differentiable almost everywhere.

    Hence, hρh_{\rho} is a stable constraint. ∎

    A.5 Proof for Stage 1 – Theorem 5

    In this section, we guarantee that if the optimization problem solved in stage 1 is solved exactly, without relaxation and with infinite data, then stage 1 does not remove any true causal parent.

    The optimization problem solved in stage 1 is given in LABEL:eq:stage-1 as

    θ^1=\argmaxθj,Aθ,jj=0Sα1,β1(θ)=\argmaxθj,Aθ,jj=01ni=1nlogp(xi;θ,ti)α1Aθ1β2θ22.\hat{\theta}_{1}=\argmax_{\begin{subarray}{c}\theta\\ \forall j,A_{\theta,jj}=0\end{subarray}}S_{\alpha_{1},\beta_{1}}(\theta)=\argmax_{\begin{subarray}{c}\theta\\ \forall j,A_{\theta,jj}=0\end{subarray}}\frac{1}{n}\sum\limits_{i=1}^{n}\log p(x^{i};\theta,t^{i})-\alpha_{1}\|A_{\theta}\|_{1}-\beta_{2}\|\theta\|_{2}^{2}.

    With infinite data xi|tip(xi;ti)x^{i}|t^{i}\sim p^{*}(x^{i};t^{i}), the optimization problem writes,

    θ~1=\argmaxθj,Aθ,jj=0k=0Kπk𝔼p(x;k)[logp(x;θ,k)]α1Aθ1β2θ22.\displaystyle\tilde{\theta}_{1}=\argmax_{\begin{subarray}{c}\theta\\ \forall j,A_{\theta,jj}=0\end{subarray}}\sum\limits_{\begin{subarray}{c}k=0\end{subarray}}^{K}\pi_{k}\underset{p^{*}(x;k)}{\mathbb{E}}\left[\log p(x;\theta,k)\right]-\alpha_{1}\|A_{\theta}\|_{1}-\beta_{2}\|\theta\|_{2}^{2}. (14)

    where πk\pi_{k} is the proportion of data coming from intervention kk.

    Furthermore, in its non-relaxed form, Equation 14 above writes

    θ~1=\argmaxθj,Aθ,jj=0k=0Kπk𝔼p(x;k)[logp(x;θ,k)]λ|Aθ|,\displaystyle\tilde{\theta}_{1}=\argmax_{\begin{subarray}{c}\theta\\ \forall j,A_{\theta,jj}=0\end{subarray}}\sum\limits_{\begin{subarray}{c}k=0\end{subarray}}^{K}\pi_{k}\underset{p^{*}(x;k)}{\mathbb{E}}\left[\log p(x;\theta,k)\right]-\lambda|A_{\theta}|, (15)

    where the L1 and L2 regularization are reverted back into the number of edges |Aθ||A_{\theta}| regularization (for some λ>0\lambda>0).

    Since we are interested in the graph induced by θ~1\tilde{\theta}_{1}, that we write G~1=Aθ~1\tilde{G}_{1}=A_{\tilde{\theta}_{1}}, we can rewrite Equation 15 as

    G~1=\argmaxGwithout self-loopssupθG=Aθk=0Kπk𝔼p(x;k)[logp(x;θ,k)]λ|G|,\displaystyle\tilde{G}_{1}=\argmax_{\begin{subarray}{c}G\\ \textnormal{without self-loops}\end{subarray}}\sup_{\begin{subarray}{c}\theta\\ G=A_{\theta}\end{subarray}}\sum\limits_{\begin{subarray}{c}k=0\end{subarray}}^{K}\pi_{k}\underset{p^{*}(x;k)}{\mathbb{E}}\left[\log p(x;\theta,k)\right]-\lambda|G|, (16)

    Finally, since there are no constraint over GG other than no self-loops, Equation 16 can be solved as dd independent optimization problems, each one determining the parents of jj in the graph G~1\tilde{G}_{1},

    \pajG~1=\argmaxS\bbrackets1,d\{j}supθS=\pajAθk=0Kπk𝔼p(x;k)[logpj(xj|xj;θ,k)]λ|S|,\displaystyle\pa^{\tilde{G}_{1}}_{j}=\argmax_{\begin{subarray}{c}S\subset\bbrackets{1,d}\backslash\{j\}\end{subarray}}\sup_{\begin{subarray}{c}\theta\\ S=\pa_{j}^{A_{\theta}}\end{subarray}}\sum\limits_{\begin{subarray}{c}k=0\end{subarray}}^{K}\pi_{k}\underset{p^{*}(x;k)}{\mathbb{E}}\left[\log p_{j}(x_{j}|x_{-j};\theta,k)\right]-\lambda|S|, (17)

    Furthermore, whenever jIkj\in I_{k}, our model class has pj(xj|xj,θ,k)=pj(xj|θ(j,k),k)p_{j}(x_{j}|x_{-j},\theta,k)=p_{j}(x_{j}|\theta_{(j,k)},k) — we know we have perfect interventions and the interventions are known. So the θ(j,k)\theta_{(j,k)} is not related to the coordinates of θ\theta that define AθA_{\theta}. That is to say, Equation 17 is equivalent to

    \pajG~1=\argmaxS\bbrackets1,d\{j}supθS=\pajAθk=0jIkKπk𝔼p(x;k)[logpj(xj|xj;θ,k)]λ|S|,\displaystyle\pa^{\tilde{G}_{1}}_{j}=\argmax_{\begin{subarray}{c}S\subset\bbrackets{1,d}\backslash\{j\}\end{subarray}}\sup_{\begin{subarray}{c}\theta\\ S=\pa_{j}^{A_{\theta}}\end{subarray}}\sum\limits_{\begin{subarray}{c}k=0\\ j\not\in I_{k}\end{subarray}}^{K}\pi_{k}\underset{p^{*}(x;k)}{\mathbb{E}}\left[\log p_{j}(x_{j}|x_{-j};\theta,k)\right]-\lambda|S|, (18)

    We recall LABEL:th:stage1.

    \theoremMarkovBoundary

    *

    The assumptions are similar to the ones detailed in brouillard2020differentiable to guarantee that differentiable causal discovery can identify causal graphs.

    The assumptions are:

    • π0>0\pi_{0}>0 – we observe some observational data,

    • θ,s.t.k,p(;k)=p(;θ,k)\exists\theta,~{}\textnormal{s.t.}~{}\forall k,p^{*}(\cdot~{};k)=p(\cdot~{};\theta,k) – the model class can express the true model pp^{*},

    • The observational distribution p(x;0)p^{*}(x;0) is faithful to the graph GG^{*} (that is any edge in GG^{*} indeed result in a nonzero cause-and-effect relation in the distribution p(x;0)p^{*}(x;0). See neapolitan2004learning for more details.

    • The true distributions p(x;k)p^{*}(x;k) and any distribution of the model class p(x;θ,k)p(x;\theta,k) have strictly positive density p(x;k)>0p^{*}(x;k)>0, p(x;θ,k)>0p(x;\theta,k)>0. This avoids technical difficulty when forming conditional distributions (e.g., p(xj|xT;k)p^{*}(x_{j}|x_{T};k)).

    • The expectations 𝔼p(x;k)[logp(x;k)]\mathbb{E}_{p^{*}(x;k)}[\log p^{*}(x;k)] are well defined (they are finite). This enables us to consider the likelihood expectations in the first place.

    • The regularization strength λ\lambda is strictly positive and small enough (see the proof for how small).

    Proof.

    Fix j\bbrackets1,jj\in\bbrackets{1,j}.

    For clarity of notations, we rewrite Equation 18 as

    \pajG~1=\argmaxS\bbrackets1,d\{j}supθk=0jIkKπk𝔼p(x;k)[logpj(xj|xS;θ,k)]λ|S|,\displaystyle\pa^{\tilde{G}_{1}}_{j}=\argmax_{\begin{subarray}{c}S\subset\bbrackets{1,d}\backslash\{j\}\end{subarray}}\sup_{\begin{subarray}{c}\theta\end{subarray}}\sum\limits_{\begin{subarray}{c}k=0\\ j\not\in I_{k}\end{subarray}}^{K}\pi_{k}\underset{p^{*}(x;k)}{\mathbb{E}}\left[\log p_{j}(x_{j}|x_{S};\theta,k)\right]-\lambda|S|, (19)

    where the condition S=\pajAθS=\pa_{j}^{A_{\theta}} is fully captured by the notation pj(xj|xS;θ,k)p_{j}(x_{j}|x_{S};\theta,k).

    Then, define

    ψ(T)=supθkjIk\E[p(x;k)]πklogpj(xj|xS;θ,k)λ|S|.\psi(T)=\sup_{\theta}\sum_{\begin{subarray}{c}k\\ j\not\in I_{k}\end{subarray}}\E[p^{*}(x;k)]{\pi_{k}\log p_{j}(x_{j}|x_{S};\theta,k)}-\lambda|S|. (20)

    Further, define B=\bojGB=\bo^{G^{*}}_{j} to be the Markov boundary of node jj in the true causal graph GG^{*}.

    We will show that ψ(B)>ψ(T)\psi(B)>\psi(T) for any other T\bbrackets1,d\\curlyjT\subset\bbrackets{1,d}\backslash\curly{j}.

    We compute,

    ψ(B)ψ(T)\displaystyle\psi(B)-\psi(T) =supθkjIkπk\E[p(x;k)]logpj(xj|xB;θ,k)supθkjIkπk\E[p(x;k)]logpj(xj;xT;θ,k)\displaystyle=\sup_{\theta}\sum_{\begin{subarray}{c}k\\ j\not\in I_{k}\end{subarray}}\pi_{k}\E[p^{*}(x;k)]{\log p_{j}(x_{j}|x_{B};\theta,k)}-\sup_{\theta}\sum_{\begin{subarray}{c}k\\ j\not\in I_{k}\end{subarray}}\pi_{k}\E[p^{*}(x;k)]{\log p_{j}(x_{j};x_{T};\theta,k)} (21)
    λ|B|+λ|T|\displaystyle\quad-\lambda|B|+\lambda|T|
    =infθkjIkπk\E[p(xj;k)]DKL\parenspj(xj|xj;k)pj(xj|xB;θ,k)\displaystyle=-\inf_{\theta}\sum_{\begin{subarray}{c}k\\ j\not\in I_{k}\end{subarray}}\pi_{k}\E[p^{*}(x_{-j};k)]{D_{KL}\parens{p_{j}^{*}(x_{j}|x_{-j};k)~{}\|~{}p_{j}(x_{j}|x_{B};\theta,k)}} (22)
    +infθkjIkπk\E[p(xj;k)]DKL\parenspj(xj|xj;k)pj(xj;xT;θ,k)\displaystyle\quad+\inf_{\theta}\sum_{\begin{subarray}{c}k\\ j\not\in I_{k}\end{subarray}}\pi_{k}\E[p^{*}(x_{-j};k)]{D_{KL}\parens{p_{j}^{*}(x_{j}|x_{-j};k)~{}\|~{}p_{j}(x_{j};x_{T};\theta,k)}}
    λ(|B||T|)\displaystyle\quad-\lambda(|B|-|T|)
    =infθkjIkπk\E[p(xj;k)]DKL\parenspj(xj|xj;k)pj(xj;xT;θ,k)\displaystyle=\inf_{\theta}\sum_{\begin{subarray}{c}k\\ j\not\in I_{k}\end{subarray}}\pi_{k}\E[p^{*}(x_{-j};k)]{D_{KL}\parens{p_{j}^{*}(x_{j}|x_{-j};k)~{}\|~{}p_{j}(x_{j};x_{T};\theta,k)}} (23)
    +λ(|T||B|).\displaystyle\quad+\lambda(|T|-|B|).

    The line 22 comes from \E[p(x;k)]logpj(xj|xB;θ,k)=\E[p(xj;k)]DKL\parenspj(xj|xj;k)pj(xj|xB;θ,k)+\E[p(x;k)]logpj(xj|xj;k)\E[p^{*}(x;k)]{\log p_{j}(x_{j}|x_{B};\theta,k)}=-\E[p^{*}(x_{-j};k)]{D_{KL}\parens{p_{j}^{*}(x_{j}|x_{-j};k)~{}\|~{}p_{j}(x_{j}|x_{B};\theta,k)}}+\E[p^{*}(x;k)]{\log p_{j}^{*}(x_{j}|x_{-j};k)} where we added and substracted the logp(k)\log p^{(k)} term (the 𝔼p(x;k)\mathbb{E}_{p^{*}(x;k)} is decomposed into 𝔼p(xj;k)𝔼p(xj;k)\mathbb{E}_{p^{*}(x_{-j};k)}\mathbb{E}_{p^{*}(x_{j};k)}, where the second expectation is in the KL divergence). We use the assumption of strictly positive density here to define the conditional pj(xj|xj;k)p_{j}^{*}(x_{j}|x_{-j};k) without technical difficulties.

    The line 23 comes from the assumption of sufficient model class capacity and the definition of the Markov boundary. Indeed, we first have pj(xj|xj;k)=pj(xj|xB;k)p_{j}^{*}(x_{j}|x_{-j};k)=p_{j}^{*}(x_{j}|x_{B};k) by definition of the Markov boundary BB, and since the model class is expressive enough, there exists θ\theta such that DKL\parenspj(xj|xj;k)pj(xj|xB;θ,k)=0D_{KL}\parens{p_{j}^{*}(x_{j}|x_{-j};k)~{}\|~{}p_{j}(x_{j}|x_{B};\theta,k)}=0.

    We further have:

    ψ(B)ψ(T)\displaystyle\psi(B)-\psi(T) π0infθ\E[p(xj;0)]DKL\parenspj(xj|xB;0)pj(xj|xT;θ,0)+λ(|T||B|)\displaystyle\geq\pi_{0}\inf_{\theta}\E[p^{*}(x_{-j};0)]{D_{KL}\parens{p_{j}^{*}(x_{j}|x_{B};0)~{}\|~{}p_{j}(x_{j}|x_{T};\theta,0)}}+\lambda(|T|-|B|) (24)
    =π0\E[p(xj;0)]DKL\parenspj(xj|xB;0)pj(xj|xT;0)+π0infθ\E[p(x;0)]logpj(xj|xT;0)pj(xj|xT;θ,0)\displaystyle=\pi_{0}\E[p^{*}(x_{-j};0)]{D_{KL}\parens{p^{*}_{j}(x_{j}|x_{B};0)~{}\|~{}p_{j}^{*}(x_{j}|x_{T};0)}}+\pi_{0}\inf_{\theta}\E[p^{*}(x;0)]{\log\frac{p^{*}_{j}(x_{j}|x_{T};0)}{p_{j}(x_{j}|x_{T};\theta,0)}} (25)
    +λ(|T||B|)\displaystyle\quad+\lambda(|T|-|B|)
    π0\E[p(xj;0)]DKL\parenspj(xj|xB;0)pj(xj|xT;0)η(T)+λ(|T||B|).\displaystyle\geq\underbrace{\pi_{0}\E[p^{*}(x_{-j};0)]{D_{KL}\parens{p^{*}_{j}(x_{j}|x_{B};0)~{}\|~{}p^{*}_{j}(x_{j}|x_{T};0)}}}_{\eta(T)}+\lambda(|T|-|B|). (26)

    where line 26 follows from \E[p(x;0)]logpj(xj|xT;0)pj(xj|xT;θ,0)=\E[p(xT)]DKL\parenspj(xj|xT)pj(xj|xT;θ,0)0\E[p^{*}(x;0)]{\log\frac{p^{*}_{j}(x_{j}|x_{T};0)}{p_{j}(x_{j}|x_{T};\theta,0)}}=\E[p^{*}(x_{T})]{D_{KL}\parens{p_{j}^{*}(x_{j}|x_{T})\|p_{j}(x_{j}|x_{T};\theta,0)}}\geq 0.

    Let’s finally define u=min\parens\curlyη(T)|B||T|T\bbrackets1,d\{j} and η(T)>0 and |B|>|T|{1}u=\min\parens{\curly{\frac{\eta(T)}{|B|-|T|}\mid T\subset\bbrackets{1,d}\backslash\{j\}\text{ and }\eta(T)>0\text{ and }|B|>|T|}\cup\{1\}} and fix any λ]0,u[\lambda\in]0,u[.

    Let’s assume now that ψ(T)ψ(B)\psi(T)\geq\psi(B) for some T\bbrackets1,d\{j}T\subset\bbrackets{1,d}\backslash\{j\}, and show that we obtain contradictions.

    First, we would have λ(|B||T|)η(T)\lambda(|B|-|T|)\geq\eta(T). In particular we deduce that |B||T||B|\geq|T| (since η(T)0\eta(T)\geq 0).

    Now, two possibilities:

    1. 1.

      If η(T)>0\eta(T)>0, then |B|>|T||B|>|T| and by definition of λ\lambda, λ>λ\lambda>\lambda which is absurd.

    2. 2.

      If η(T)=0\eta(T)=0, then π0\E[p(xj;0)]DKL\parenspj(xj|xB;0)pj(xj|xT;0)=0\pi_{0}\E[p^{*}(x_{-j};0)]{D_{KL}\parens{p^{*}_{j}(x_{j}|x_{B};0)~{}\|~{}p^{*}_{j}(x_{j}|x_{T};0)}}=0. This implies that DKL\parenspj(xj|xB;0)pj(xj|xT;0)=0D_{KL}\parens{p^{*}_{j}(x_{j}|x_{B};0)~{}\|~{}p^{*}_{j}(x_{j}|x_{T};0)}=0 for all (xj)(x_{-j}); since p(xj;0)p^{*}(x_{-j};0) has positive density and π0>0\pi_{0}>0. Hence, the conditional pj(xj|xB;0)p^{*}_{j}(x_{j}|x_{B};0) and pj(xj|xT;0)p^{*}_{j}(x_{j}|x_{T};0) are identical. Since BB was the Markov boundary of xjx_{j}, that makes TT also a Markov blanket of xjx_{j}. But then, by minimality of the Markov boundary in a faithful graph, we have BTB\subset T. Remember that we had deduced |B||T||B|\geq|T|. So B=TB=T.

    This ends the proof, where λ]0,u[\lambda\in]0,u[. ∎

    A.6 Proof for Stage 2

    Since stage 1 does not remove any true causal parents, theorem 1 of brouillard2020differentiable remains valid.

    A.7 Lemma: Asymptotic Bound on number of edges returned in Stage 1

    We denote the Markov boundary of jj in GG^{*} by bojG\textnormal{bo}_{j}^{G^{*}}, and recall that bojG=pajGchjGpachjGG\{j}.\textnormal{bo}_{j}^{G^{*}}=\textnormal{pa}^{G^{*}}_{j}\cup\textnormal{ch}^{G^{*}}_{j}\cup\textnormal{pa}^{G^{*}}_{\textnormal{ch}^{G^{*}}_{j}}\backslash\{j\}.

    The following lemma upper-bounds the theoretical number of edges returned by stage 1 when each node has at most kk parents.

    Lemma A.3.

    Assume GG^{*} is sparse such that each node has at most kk parents. Then, the total size of all the Markov boundaries is upper-bounded by dk(k+2)=O(dk2)dk(k+2)=O(dk^{2}).

    Proof.

    First, note that if each node has at most kk parents, then |E|dk|E|\leq dk. Finally,

    jV|bojG|\displaystyle\sum_{j\in V}|\textnormal{bo}^{G^{*}}_{j}| =jV|bojG|\displaystyle=\sum_{j\in V}|\textnormal{bo}^{G^{*}}_{j}| (27)
    jV|pajG|+|chjG|+|pachjGG|\displaystyle\leq\sum_{j\in V}|\textnormal{pa}^{G^{*}}_{j}|+|\textnormal{ch}^{G^{*}}_{j}|+|\textnormal{pa}^{G^{*}}_{\textnormal{ch}^{G^{*}}_{j}}| (28)
    |E|+|E|+jVkchjG|pakG|\displaystyle\leq|E|+|E|+\sum_{j\in V}\sum_{k\in\textnormal{ch}^{G^{*}}_{j}}|\textnormal{pa}^{G^{*}}_{k}| (29)
    2kd+kVjpajG|pakG|\displaystyle\leq 2kd+\sum_{k\in V}\sum_{j\in\textnormal{pa}^{G^{*}}_{j}}|\textnormal{pa}^{G^{*}}_{k}| (30)
    2kd+dk2\displaystyle\leq 2kd+dk^{2} (31)

    Appendix B Methods

    B.1 Model Details

    In SDCD, the conditional distributions, pj(xj|xj;θ,k)p_{j}(x_{j}|x_{-j};\theta,k), are modeled as Gaussian distributions where the mean and variance are learned by a neural network that takes in all of the other xjx_{-j} as input. The initial layer of the network applies dd independent linear transformations followed by a sigmoid nonlinearity to the input and outputs dd hidden states of size 10. Each of the dd hidden states corresponds to the features then used to predict each variable. Each hidden state is fed into two linear layers: one to predict the mean parameter of the conditional and one to predict the variance parameter of the conditional. For the variance, a softplus operation is applied to the output of the linear layer to constrain the variance to be strictly positive.

    B.2 Algorithm Details

    Spectral Acyclicity Constraint Estimation. As described in LABEL:th:constraint-spectral-valid, the gradient of the spectral acyclicity constraint can be computed as hρ(A)=vdud/vdudh_{\rho}(A)=v_{d}u_{d}^{\top}/v_{d}^{\top}u_{d}, where ud,vdu_{d},v_{d} are the right and left eigenvectors of AA respectively. Using the power iteration method, which involves a fixed number of matrix-vector multiplications, ud,vdu_{d},v_{d} can be estimated in O(d2)O(d^{2}). Specifically, the updates are as follows:

    ud(i+1):=Aud(i)ud2,vd(i+1):=Avd(i)vd2u_{d}^{(i+1)}:=\frac{A^{\top}u_{d}^{(i)}}{\|u_{d}\|_{2}},\;v_{d}^{(i+1)}:=\frac{Av_{d}^{(i)}}{\|v_{d}\|_{2}}

    where ud,vdu_{d},v_{d} are initialized as ud(1),vd(1):=[1d,,1d]u_{d}^{(1)},v_{d}^{(1)}:=[\frac{1}{\sqrt{d}},\dots,\frac{1}{\sqrt{d}}] at the very first epoch of SDCD. In our implementation, we use 15 iterations to estimate the spectral acyclicity constraint value.

    Importantly, we re-use the estimates of udu_{d} and vdv_{d} from one epoch to another, as we don’t expect AA (and its eigenvectors) to change drastically.

    Hence, at each epoch, we initialize ud,vdu_{d},v_{d} using their last epoch’s value and perform 15 power iterations.

    SDCD Algorithm. The SDCD algorithm follows a two-stage procedure. In the first stage, the coefficient of the spectral acyclicity constraint, γ\gamma, is fixed at zero. We use an Adam optimizer with a learning rate, η1\eta_{1}, specific to stage 1 to perform minibatch gradient-based optimization. The coefficients corresponding to the L1 and L2 penalties, α1\alpha_{1} and β1\beta_{1}, respectively, are fixed throughout training. The stage 1 training loss is written as:

    1(X,θ,α1,β1)\displaystyle\mathcal{L}_{1}(X,\theta,\alpha_{1},\beta_{1}) =Sα1,β1(θ)\displaystyle=S_{\alpha_{1},\beta_{1}}(\theta)
    =1ni=1nlogp(xi;θ,ti)α1Aθ1β1θ22.\displaystyle=\frac{1}{n}\sum\limits_{i=1}^{n}\log p(x^{i};\theta,t^{i})-\alpha_{1}\|A_{\theta}\|_{1}-\beta_{1}\|\theta\|_{2}^{2}.

    To prevent the model from learning implicit self-loops, the weights corresponding to the predicted variable are masked out for every hidden state output by the initial neural network layer. Thus, the prediction of each variable is prevented from being a function of the same variable.

    In interventional regimes, the log-likelihood terms corresponding to the prediction of intervened variables are zeroed out. The intervened variables do not have to be modeled as we assume perfect interventions.

    Stage 1 is run for a fixed number of epochs. By default, stage 1 also has an early stopping mechanism that uses the reconstruction loss of a held-out validation set of data (sampled uniformly at random from the training set) as the early stopping metric. If the validation reconstruction loss does not achieve a new minimum after a given number of epochs, the stage 1 training loop is exited.

    At the end of stage 1, the learned input layer weights are used to compute a set of removed edges, R^\hat{R}, for stage 2. Let Wd×d×10W\in\mathbb{R}^{d\times d\times 10} represent the input layer weights. Then, each value of the implicitly defined weighted adjacency matrix is computed as the L2 vector norm for the corresponding set of weights (i.e., Aθ,i,j:=Wi,j,:2A_{\theta,i,j}:=\|W_{i,j,:}\|_{2}). This weighted adjacency matrix is discretized with a fixed threshold, τ1\tau_{1}, such that each edge, (i,j)(i,j), is removed if it falls below the threshold (i.e., Aθ,i,j<τ1A_{\theta,i,j}<\tau_{1}).

    In stage 2, the spectral acyclicity constraint is introduced. Like stage 1, we use an Adam optimizer with learning rate, η2\eta_{2}, and perform minibatch gradient-based optimization. Once again, the L1 and L2 coefficients, α2,β2\alpha_{2},\beta_{2}, are fixed throughout training. Rather than a fixed γ\gamma, SDCD takes an increment value, γ++\gamma^{+}\in\mathbb{R}^{+}, determining the rate at which γ\gamma increases every epoch. The training loss for stage 2 is as follows:

    2(X,θ,R^,α2,β2,γ)\displaystyle\mathcal{L}_{2}(X,\theta,\hat{R},\alpha_{2},\beta_{2},\gamma) =Sα2,β2(θ)γhρ(Aθ)\displaystyle=S_{\alpha_{2},\beta_{2}}(\theta)-\gamma h_{\rho}(A_{\theta})
    =1ni=1nlogp(xi;θ,ti)α2Aθ1β2θ22γhρ(Aθ).\displaystyle=\frac{1}{n}\sum\limits_{i=1}^{n}\log p(x^{i};\theta,t^{i})-\alpha_{2}\|A_{\theta}\|_{1}-\beta_{2}\|\theta\|_{2}^{2}-\gamma h_{\rho}(A_{\theta}).

    The same masking strategy as in stage 1 is used to prevent self-loops in AθA_{\theta}. However, the input layer weights corresponding to edges (i,j)R^(i,j)\in\hat{R} are also masked.

    Like before, the reconstruction loss terms corresponding to intervened variables are removed from the loss.

    To reduce the sensitivity of stage 2 to the choice of γ+\gamma^{+} and to prevent the acyclicity constraint term from dominating the loss, the linear increment schedule is frozen when AθA_{\theta} achieves a DAG at the final threshold, τ2\tau_{2}. In practice, the DAG check is performed every 20 epochs. If the adjacency matrix returns to being cyclic throughout training, the γ\gamma increment schedule restarts to increase from where it left off.

    The early stopping metric is computed similarly to stage 1, but in stage 2, the early stopping can only kick in when γ\gamma has been frozen. If the γ\gamma schedule is resumed due to AθA_{\theta} reintroducing a cycle, the early stopping is reset.

    Lastly, once stage 2 is complete, AθA_{\theta} is computed and thresholded according to a fixed threshold, τ2\tau_{2}. All values exceeding the threshold (i.e., Aθ,i,jτ2A_{\theta},i,j\geq\tau_{2}) are considered edges in the final graph prediction.

    The thresholded adjacency matrix may contain cycles if stage 2 runs to completion without hitting early stopping. To ensure a DAG, we follow a greedy edge selection procedure detailed in Algorithm 2.

    Pseudocode for a simplified SDCD algorithm (excludes γ\gamma freezing and early stopping) is provided in Algorithm 1.

    Time and Space Complexity. The time complexity of each iteration of SDCD is O(d2)O(d^{2}). The forward pass in stage 1 can be computed in O(d2)O(d^{2}) time. On the other hand, each of the dd prediction problems can be computed independently. This allows for parallelizing the dd problems, each taking O(d)O(d) time. Stage 2 also takes O(d2)O(d^{2}) time as the spectral acyclicity constraint and the forward pass both take O(d2)O(d^{2}) time to compute. Thus, the time complexity of each iteration in both stages is O(d2)O(d^{2}).

    If the sparsity pattern of the underlying causal graph is known beforehand such that each variable has at most kk parents, we can further tighten the time complexity of SDCD. By Section A.7, we know the size of the set of remaining edges after stage 1 is O(dk2)O(dk^{2}). Using sparse matrix multiplication, the spectral acyclicity constraint can be done in O(dk2)O(dk^{2}), which is effectively linear in dd if kdk\ll d. However, this improvement only becomes significant when d>10,000d>10,000 (from experiments not reported in this paper).

    The space complexity of the algorithm is O(d2)O(d^{2}), as the number of parameters in the input layer scale quadratically in the number of features.

    Algorithm 1 SDCD
    0:  α1,α2+,β1,β2+,γ++\alpha_{1},\alpha_{2}\in\mathbb{R}^{+},\beta_{1},\beta_{2}\in\mathbb{R}^{+},\gamma^{+}\in\mathbb{R}^{+},               τ1,τ2+,η1,η2+,E1,E2+\tau_{1},\tau_{2}\in\mathbb{R}^{+},\eta_{1},\eta_{2}\in\mathbb{R}^{+},E_{1},E_{2}\in\mathbb{Z}^{+}
      Aθ(0)0G×GA_{\theta}^{(0)}\leftarrow\vec{0}^{G\times G}
      θAθ(0)\theta^{(0)}_{-A_{\theta}}\leftarrow RandomGaussianInit()
      e0e\leftarrow 0
      while e<E1e<E_{1} do
         θ(e+1):=AdamUpdate(θ(e),1(X,θ(e),α1,β1),η1)\theta^{(e+1)}:=\mathrm{AdamUpdate}(\theta^{(e)},\nabla\mathcal{L}_{1}(X,\theta^{(e)},\alpha_{1},\beta_{1}),\eta_{1})
         ee+1e\leftarrow e+1
      end while
      R^:=Threshold(Aθ(E1),τ2)\hat{R}:=\mathrm{Threshold}(A_{\theta}^{(E_{1})},\tau_{2})
      Aθ(E1)0G×GA_{\theta}^{(E_{1})}\leftarrow\vec{0}^{G\times G}
      θAθ(E1)\theta^{(E_{1})}_{-A_{\theta}}\leftarrow RandomGaussianInit()
      γ0\gamma\leftarrow 0
      while e<E1+E2e<E_{1}+E_{2} do
         θ(e+1):=AdamUpdate(θ(e),1(X,θ(e),R^,α2,β2,γ),η2)\theta^{(e+1)}:=\mathrm{AdamUpdate}(\theta^{(e)},\nabla\mathcal{L}_{1}(X,\theta^{(e)},\hat{R},\alpha_{2},\beta_{2},\gamma),\eta_{2})
         γγ+γ+\gamma\leftarrow\gamma+\gamma^{+}
         ee+1e\leftarrow e+1
      end while
      DAGTrim(Aθ(E2),τ2)\mathrm{DAGTrim}(A_{\theta}^{(E_{2})},\tau_{2})
    Algorithm 2 DAGTrim
    0:  AθD×D,τ+A_{\theta}\in\mathbb{R}^{D\times D},\tau\in\mathbb{R}^{+}
      EE\leftarrow\emptyset {Initialize the set of final edges.}
      C[(i,j)\bbrackets1,d2(Aθ,i,j>τ]C\leftarrow[(i,j)\in\bbrackets{1,d}^{2}\mid(A_{\theta,i,j}>\tau] {Candidate edges above threshold τ\tau.}
      Sort CC by decreasing Aθ,i,jA_{\theta,i,j}.
      for each (i,j)C(i,j)\in C do
         if the graph with edges E{(i,j)}E\cup\{(i,j)\} is still acyclic then
            EE{(i,j)}E\leftarrow E\cup\{(i,j)\} {We add the edge if it does not create a cycle.}
         end if
      end for

    Appendix C Empirical Studies Details

    C.1 Simulation Details

    To judge the performance of SDCD against existing methods over both interventional and observational data, we generated simulated data according to the following procedure:

    • Draw a random undirected graph from the Erdős-Rényi distribution.

    • Convert the undirected graph into a DAG GG^{*} by setting the direction of each edge iji\to j if π(i)<π(j)\pi(i)<\pi(j), where π\pi is a random permutation of the nodes.

    • Form dd possible sets of interventions that target one variable at a time: Ij=\curlyjI_{j}=\curly{j} and I0=I_{0}=\emptyset.

    • Draw a set of random fully connected neural networks MLP(j):\R|\pajG|\R1001\operatorname{MLP}^{(j)}:\R^{|\pa^{G^{*}}_{j}|}\rightarrow\R^{100}\rightarrow 1, each one with one 100-dimensional hidden layer. Each neural network parametrizes the mean of the observational conditional distributions:

      pj(xjx\pajG;0)𝒩\parensμ=MLP(j)(x\pajG),σ=0.5.p^{*}_{j}(x_{j}\mid x_{\pa^{G^{*}}_{j}};0)\sim\mathcal{N}\parens{\mu=\operatorname{MLP}^{(j)}(x_{\pa^{G^{*}}_{j}}),\sigma=0.5}.
    • For intervention distribution k1k\geq 1, perform a hard intervention on variable kk and set

      pj(xk;k)𝒩(0,0.1).p^{*}_{j}(x_{k};k)\sim\mathcal{N}(0,0.1).
    • Draw the data according to the model, with 10,000 observational samples and 500 extra interventional samples per target variable.

    • Standardize the data.

    We consider several values of dd to simulate different scenarios.

    C.2 Choice of Hyperparameters

    We fixed the hyperparameters as follows: α1:=1e2,β1:=2e4,η1:=2e3,τ1:=0.2,α2:=5e4,β2:=5e3,η2:=1e3,γ+:=0.005,τ2:=0.1\alpha_{1}:=1\mathrm{e}{-2},\beta_{1}:=2\mathrm{e}{-4},\eta_{1}:=2\mathrm{e}{-3},\tau_{1}:=0.2,\alpha_{2}:=5\mathrm{e}{-4},\beta_{2}:=5\mathrm{e}{-3},\eta_{2}:=1\mathrm{e}{-3},\gamma^{+}:=0.005,\tau_{2}:=0.1. We found that these selections worked well empirically across multiple simulated datasets and were used in all experiments without simulation-specific fine-tuning.

    Each stage was run for 2000 epochs with a batch size of 256, and the validation loss was computed over a held-out fraction of the training dataset (20% of the data) every 20 epochs for early stopping. In stage 2, the DAG check of the implicit adjacency matrix was performed every 20 epochs before the validation loss computation.

    C.3 Baseline Methods

    Here, we provide details on the baseline methods and cite which implementations were used for the experiments. For DCDI and DCDFG, we used the implementations from https://github.com/Genentech/dcdfg, using the default parameters for optimization. For DCDFG, we used 10 modules in our benchmarks, as reported in the paper experiments. For GIES, we used the Python implementation from https://github.com/juangamella/gies, using the default parameters. For DAGMA, we used the original implementation from https://github.com/kevinsbello/dagma with the default parameters. For NOTEARS, we used the implementation from https://github.com/xunzheng/notears, and for NOBEARS, we used the implementation from https://github.com/howchihlee/BNGPU. For NOTEARS and NOBEARS, we found the default thresholds for determining the final adjacency matrix performed poorly or did not return a DAG, so for each of these baselines, we followed the same procedure described in lopez2022large: we find the threshold that returns the largest possible DAG via binary search. sortnregress (reisach2021beware) is a trivial baseline meant to ensure that the causal graph cannot be easily inferred from the variance pattern across the variables. For this baseline, we used the implementation in https://github.com/Scriddie/Varsortability.

    C.4 Robustness Checks

    Below, we discuss three categories of issues that commonly arise when evaluating causal discovery methods and address each issue with a diagnostic metric.

    Sparsity. Particularly when the true causal graph is sparse, SHD may favor sparser predictions since, in the extreme case, the empty graph achieves an SHD equal to the number of true edges. To show the relative performance of the benchmarked methods with respect to this trivial solution, we indicate the number of true edges for each simulated setting in LABEL:fig:observational and LABEL:fig:interventional. We find that most methods outperform this baseline. Additionally, we report the F1 score and the recall of the predictions (see Figures 6 and 3), two metrics that suffer when a method predicts many false negatives. We find that SDCD still outperforms other methods with these metrics.

    Identifiability. In settings with incomplete or no interventional data, the true causal graph may be unidentifiable, meaning multiple \mathcal{I}-Markov equivalent graphs can maximize the score (brouillard2020differentiable). Therefore, graphs in the same Markov equivalence class as the true causal graph may have positive SHD values despite being optimal with respect to the available data. As proposed in peters2014causal, we also compute an adapted version of the SHD to compare the Markov equivalence class of the methods’ results against the true Markov equivalence class instead of the graphs themselves. This metric, called SHD-CPDAG, is computed as the SHD between the completed partially directed acyclic graph (CPDAG) of the predicted graph and the CPDAG of the true graph. Unlike the regular SHD metric, this metric is zero if two graphs are in the same equivalence class. We report it alongside SHD for our experiments in Figure 3 to better represent the results in scenarios with an unidentifiable causal graph. We find very similar results.

    Simulation issues. As discussed in reisach2021beware, certain simulation processes used for causal discovery benchmarking exhibit an issue where the order of the variables, when sorted by sample variance, reflects the true causal ordering of the graph. As a result, methods that exploit this phenomenon to accurately infer the causal graph may be misrepresented. To ensure that our simulation process does not suffer from this issue and that the methods are being properly evaluated, we take two complementary steps recommended in reisach2021beware: (i) we standardize the data before being input into any of the evaluated methods so that no artificial sample variance information can be exploited, and (2) we include the trivial baseline, sortnregress (reisach2021beware), which is designed to exploit sample variance artifacts from a flawed simulation, and should be outperformed by an effective, scale-invariant algorithm. We find that sortnregress performs poorly, which confirms that our normalization scheme removes simulation artifacts, and we find that SDCD and its competing methods beat sortnregress by a wide margin.

    Appendix D Supplementary Figures and Tables

    Refer to caption
    Figure 1: The effect of constraints on the learned graph throughout training. The training with penalty hρ+h_{\rho}+ (dashed purple, exactly hρh_{\rho} with a hard mask on the diagonal as to prevent self-loops, as implemented in SDCD) converges the fastest toward a DAG. (left) training with hh as a regularization penalty. (right) training with hh as an augmented Lagrangian constraint. Threshold to DAG is the smallest η\eta at which all edges with weight >η>\eta form a DAG.
    s d δ\delta Method SDCD DCDI-G DCDI-DSF
    1 10 22.2% L 0.7±1.2\mathbf{0.7{\scriptstyle\pm 1.2}} 1.3±1.91.3{\scriptstyle\pm 1.9} 0.9±1.30.9{\scriptstyle\pm 1.3}
    NL-Add 0.6±0.7\mathbf{0.6{\scriptstyle\pm 0.7}} 5.2±7.55.2{\scriptstyle\pm 7.5} 4.2±5.64.2{\scriptstyle\pm 5.6}
    NL-NN 0.7±0.7\mathbf{0.7{\scriptstyle\pm 0.7}} 2.3±3.62.3{\scriptstyle\pm 3.6} 7.0±10.77.0{\scriptstyle\pm 10.7}
    20 10.5% L 1.4±3.4\mathbf{1.4{\scriptstyle\pm 3.4}} 5.4±4.55.4{\scriptstyle\pm 4.5} 3.6±2.73.6{\scriptstyle\pm 2.7}
    NL-Add 4.1±3.0\mathbf{4.1{\scriptstyle\pm 3.0}} 21.8±30.121.8{\scriptstyle\pm 30.1} 4.3±1.94.3{\scriptstyle\pm 1.9}
    NL-NN 3.0±2.5\mathbf{3.0{\scriptstyle\pm 2.5}} 13.9±20.313.9{\scriptstyle\pm 20.3} 8.3±4.18.3{\scriptstyle\pm 4.1}
    4 10 88.9% L 5.2±3.55.2{\scriptstyle\pm 3.5} 3.3±2.1\mathbf{3.3{\scriptstyle\pm 2.1}} 3.7±2.33.7{\scriptstyle\pm 2.3}
    NL-Add 4.8±2.14.8{\scriptstyle\pm 2.1} 4.3±2.4\mathbf{4.3{\scriptstyle\pm 2.4}} 5.5±2.45.5{\scriptstyle\pm 2.4}
    NL-NN 7.3±3.07.3{\scriptstyle\pm 3.0} 2.4±1.62.4{\scriptstyle\pm 1.6} 1.6±1.6\mathbf{1.6{\scriptstyle\pm 1.6}}
    20 42.1% L 18.8±10.518.8{\scriptstyle\pm 10.5} 23.7±5.623.7{\scriptstyle\pm 5.6} 16.6±6.4\mathbf{16.6{\scriptstyle\pm 6.4}}
    NL-Add 18.0±7.3\mathbf{18.0{\scriptstyle\pm 7.3}} 35.2±13.235.2{\scriptstyle\pm 13.2} 26.7±16.926.7{\scriptstyle\pm 16.9}
    NL-NN 14.9±1.914.9{\scriptstyle\pm 1.9} 16.8±8.716.8{\scriptstyle\pm 8.7} 11.8±2.1\mathbf{11.8{\scriptstyle\pm 2.1}}
    Table 1: Means and standard deviations of SHD scores over simulations from brouillard2020differentiable. The “Method” column refers to the model used to simulate the causal relationships. “L” refers to linear model, “NL-Add” refers to nonlinear, additive model, and “NL-NN” refers to nonlinear, non-additive (neural network) model. We refer to brouillard2020differentiable for the simulation details. The results are reported alongside the values presented in the original paper. ss refers to the expected number of edges per node, dd denotes the number of nodes, and the edge density, δ\delta, is computed as the fraction of d(d1)2\frac{d(d-1)}{2}, the maximum number of edges for a DAG. The lowest average SHD values are set in bold.
    Refer to caption
    Figure 2: Training runtimes across simulations from LABEL:fig:observational. SDCD on GPU (dashed) scales to 500 variables in under 20 minutes. Error bars indicate std on 5 random datasets for d<50d<50 and 3 random datasets for d50d\geq 50.
    SDCD SDCD-GPU DCDI DCDFG GIES DAGMA NOTEARS NOBEARS SCORE sortnregress
    d
    10 14.7 ±5.5{\scriptstyle\pm 5.5} NA 24.3 ±3.9{\scriptstyle\pm 3.9} 24.6 ±6.0{\scriptstyle\pm 6.0} 27.8 ±3.9{\scriptstyle\pm 3.9} 25.3 ±6.2{\scriptstyle\pm 6.2} 35.3 ±1.9{\scriptstyle\pm 1.9} 33.5 ±3.0{\scriptstyle\pm 3.0} 14.4 ±4.0{\scriptstyle\pm 4.0} 28.6 ±4.7{\scriptstyle\pm 4.7}
    20 35.7 ±6.2{\scriptstyle\pm 6.2} NA 31.7 ±6.5{\scriptstyle\pm 6.5} 108.0 ±14.3{\scriptstyle\pm 14.3} 123.4 ±12.3{\scriptstyle\pm 12.3} 62.0 ±12.0{\scriptstyle\pm 12.0} 75.4 ±4.0{\scriptstyle\pm 4.0} 74.2 ±3.0{\scriptstyle\pm 3.0} 118.6 ±8.5{\scriptstyle\pm 8.5} 81.6 ±6.3{\scriptstyle\pm 6.3}
    30 53.8 ±11.9{\scriptstyle\pm 11.9} NA 55.5 ±10.7{\scriptstyle\pm 10.7} 258.8 ±32.6{\scriptstyle\pm 32.6} NA 89.4 ±10.9{\scriptstyle\pm 10.9} 113.1 ±4.8{\scriptstyle\pm 4.8} 113.7 ±3.3{\scriptstyle\pm 3.3} 275.9 ±21.0{\scriptstyle\pm 21.0} 134.6 ±8.4{\scriptstyle\pm 8.4}
    40 64.0 ±13.7{\scriptstyle\pm 13.7} NA 102.4 ±21.6{\scriptstyle\pm 21.6} 426.6 ±73.7{\scriptstyle\pm 73.7} NA 115.3 ±13.4{\scriptstyle\pm 13.4} 147.9 ±6.0{\scriptstyle\pm 6.0} 151.1 ±3.4{\scriptstyle\pm 3.4} 454.3 ±52.8{\scriptstyle\pm 52.8} 172.9 ±12.2{\scriptstyle\pm 12.2}
    50 69.9 ±12.3{\scriptstyle\pm 12.3} 68.3 ±13.3{\scriptstyle\pm 13.3} NA 660.8 ±126.1{\scriptstyle\pm 126.1} NA NA 183.4 ±7.4{\scriptstyle\pm 7.4} 192.0 ±3.5{\scriptstyle\pm 3.5} 619.4 ±59.7{\scriptstyle\pm 59.7} 216.6 ±12.4{\scriptstyle\pm 12.4}
    100 92.7 ±9.1{\scriptstyle\pm 9.1} 89.7 ±11.0{\scriptstyle\pm 11.0} NA 1807.3 ±788.2{\scriptstyle\pm 788.2} NA NA 327.3 ±7.5{\scriptstyle\pm 7.5} 389.0 ±3.6{\scriptstyle\pm 3.6} NA 421.3 ±12.0{\scriptstyle\pm 12.0}
    200 225.3 ±13.7{\scriptstyle\pm 13.7} 228.0 ±18.3{\scriptstyle\pm 18.3} NA 5657.3 ±2982.6{\scriptstyle\pm 2982.6} NA NA 619.0 ±4.2{\scriptstyle\pm 4.2} 770.0 ±7.8{\scriptstyle\pm 7.8} NA 824.0 ±19.0{\scriptstyle\pm 19.0}
    300 350.0 ±12.5{\scriptstyle\pm 12.5} 360.0 ±nan{\scriptstyle\pm nan} NA 7284.7 ±5072.3{\scriptstyle\pm 5072.3} NA NA NA 1149.0 ±14.0{\scriptstyle\pm 14.0} NA 1190.7 ±26.3{\scriptstyle\pm 26.3}
    400 466.3 ±62.4{\scriptstyle\pm 62.4} 471.7 ±68.0{\scriptstyle\pm 68.0} NA 3779.7 ±507.3{\scriptstyle\pm 507.3} NA NA NA 1534.7 ±3.1{\scriptstyle\pm 3.1} NA 1585.0 ±59.3{\scriptstyle\pm 59.3}
    500 621.7 ±10.7{\scriptstyle\pm 10.7} 621.0 ±10.5{\scriptstyle\pm 10.5} NA 7252.7 ±3284.6{\scriptstyle\pm 3284.6} NA NA NA 1915.7 ±18.8{\scriptstyle\pm 18.8} NA 1974.3 ±34.6{\scriptstyle\pm 34.6}
    Table 2: Detailed results of SHD means and standard deviations from LABEL:fig:observational. SDCD-GPU was only run for d50d\geq 50. All other NA values correspond to failed runs.
    Refer to caption
    Figure 3: F1 and SHD-CPDAG metrics across simulations from LABEL:fig:observational, observational data with increasing numbers of variables dd. Missing data points imply the method failed to run. Error bars indicate std on 30 random datasets.
    Refer to caption
    Figure 4: SDCD on GPU (dashed) scales to 4000 variables under 3 hours while maintaining competitive SHD. Error bars indicate std on 3 random datasets for d=1000,2000d=1000,2000 and 2 random datasets for d=3000,4000d=3000,4000.
    d SDCD-GPU
    1000 1438.7 ±59.2{\scriptstyle\pm 59.2}
    2000 3356.7 ±70.0{\scriptstyle\pm 70.0}
    3000 5172.5 ±89.8{\scriptstyle\pm 89.8}
    4000 7567.0 ±343.7{\scriptstyle\pm 343.7}
    Table 3: Detailed results of SHD means and standard deviations from Figure 4.

    In addition to SHD, we computed precision and recall metrics over the predicted edges with respect to the true edges for both observational and interventional scenarios. The precision is the fraction of true edges among all the predicted edges. The recall is the fraction of true edges that have been correctly predicted.

    Refer to caption
    Figure 5: Precision across simulations from LABEL:fig:observational, observational data with increasing numbers of variables dd. The SDCD(-CPU) and SDCD-GPU lines overlap, indicating consistent results. Missing data points imply the method failed to run. Error bars indicate std on 30 random datasets for d<50d<50 and 5 random datasets for d50d\geq 50.
    Refer to caption
    Figure 6: Recall across simulations from LABEL:fig:observational, observational data with increasing numbers of variables dd. The SDCD(-CPU) and SDCD-GPU lines overlap, indicating consistent results. Missing data points imply the method failed to run. Error bars indicate std on 30 random datasets for d<50d<50 and 5 random datasets for d50d\geq 50.
    Refer to caption
    Figure 7: Precision and recall across simulations from Figure 4, observational data with increasing numbers of variables dd. Error bars indicate std on 3 random datasets for d=1000,2000d=1000,2000 and 2 random datasets for d=3000,4000d=3000,4000.
    Refer to caption
    Figure 8: SHD across simulations with an increasing proportion of variables intervened on, varying the total number of variables dd (columns) and average edges per variable ss (rows). Extended version of LABEL:fig:interventional with DCDFG and GIES and s=8s=8. Boxplots over 5 random datasets.
    s d δ\delta Fraction of Variables Intervened on SDCD DCDI DCDFG GIES
    0.00 18.0 ±6.5{\scriptstyle\pm 6.5} 24.8 ±4.6{\scriptstyle\pm 4.6} 112.3 ±25.8{\scriptstyle\pm 25.8} 80.3 ±26.1{\scriptstyle\pm 26.1}
    0.25 13.4 ±5.7{\scriptstyle\pm 5.7} 27.4 ±8.6{\scriptstyle\pm 8.6} 99.0 ±7.8{\scriptstyle\pm 7.8} 70.0 ±36.8{\scriptstyle\pm 36.8}
    2 20 21% 0.50 9.4 ±5.4{\scriptstyle\pm 5.4} 25.4 ±8.5{\scriptstyle\pm 8.5} 109.0 ±31.4{\scriptstyle\pm 31.4} 69.7 ±41.9{\scriptstyle\pm 41.9}
    0.75 9.0 ±2.1{\scriptstyle\pm 2.1} 19.8 ±4.8{\scriptstyle\pm 4.8} 76.0 ±14.2{\scriptstyle\pm 14.2} 62.7 ±22.0{\scriptstyle\pm 22.0}
    1.00 9.0 ±2.8{\scriptstyle\pm 2.8} 18.8 ±3.3{\scriptstyle\pm 3.3} 98.0 ±19.7{\scriptstyle\pm 19.7} 65.7 ±18.5{\scriptstyle\pm 18.5}
    0.00 24.6 ±9.6{\scriptstyle\pm 9.6} 56.0 ±32.9{\scriptstyle\pm 32.9} 183.0 ±102.1{\scriptstyle\pm 102.1} NA
    0.25 26.8 ±13.1{\scriptstyle\pm 13.1} 60.4 ±40.9{\scriptstyle\pm 40.9} 204.3 ±51.4{\scriptstyle\pm 51.4} NA
    30 13% 0.50 21.8 ±9.0{\scriptstyle\pm 9.0} 56.2 ±26.6{\scriptstyle\pm 26.6} 206.7 ±84.3{\scriptstyle\pm 84.3} NA
    0.75 16.2 ±3.4{\scriptstyle\pm 3.4} 43.2 ±17.8{\scriptstyle\pm 17.8} 207.0 ±119.7{\scriptstyle\pm 119.7} NA
    1.00 18.2 ±2.2{\scriptstyle\pm 2.2} 31.7 ±6.1{\scriptstyle\pm 6.1} 283.7 ±15.2{\scriptstyle\pm 15.2} NA
    0.00 44.2 ±4.1{\scriptstyle\pm 4.1} 109.2 ±27.6{\scriptstyle\pm 27.6} 465.0 ±131.8{\scriptstyle\pm 131.8} NA
    0.25 33.4 ±3.6{\scriptstyle\pm 3.6} 123.8 ±62.1{\scriptstyle\pm 62.1} 372.7 ±84.3{\scriptstyle\pm 84.3} NA
    40 10% 0.50 33.0 ±3.5{\scriptstyle\pm 3.5} 105.6 ±38.8{\scriptstyle\pm 38.8} 479.7 ±103.5{\scriptstyle\pm 103.5} NA
    0.75 27.6 ±3.4{\scriptstyle\pm 3.4} 76.0 ±24.4{\scriptstyle\pm 24.4} 469.3 ±126.6{\scriptstyle\pm 126.6} NA
    1.00 27.0 ±7.5{\scriptstyle\pm 7.5} 63.0 ±26.9{\scriptstyle\pm 26.9} 333.7 ±135.5{\scriptstyle\pm 135.5} NA
    0.00 36.0 ±6.4{\scriptstyle\pm 6.4} 31.2 ±4.1{\scriptstyle\pm 4.1} 104.0 ±3.5{\scriptstyle\pm 3.5} 110.7 ±25.3{\scriptstyle\pm 25.3}
    0.25 32.2 ±6.6{\scriptstyle\pm 6.6} 33.0 ±3.6{\scriptstyle\pm 3.6} 105.7 ±4.2{\scriptstyle\pm 4.2} 110.7 ±20.0{\scriptstyle\pm 20.0}
    4 20 42% 0.50 34.6 ±10.6{\scriptstyle\pm 10.6} 30.4 ±7.6{\scriptstyle\pm 7.6} 107.7 ±9.5{\scriptstyle\pm 9.5} 101.3 ±22.3{\scriptstyle\pm 22.3}
    0.75 25.8 ±8.8{\scriptstyle\pm 8.8} 29.6 ±8.2{\scriptstyle\pm 8.2} 102.0 ±6.1{\scriptstyle\pm 6.1} 105.7 ±22.8{\scriptstyle\pm 22.8}
    1.00 22.4 ±7.1{\scriptstyle\pm 7.1} 25.8 ±6.4{\scriptstyle\pm 6.4} 107.0 ±10.4{\scriptstyle\pm 10.4} 105.3 ±12.2{\scriptstyle\pm 12.2}
    0.00 54.0 ±9.8{\scriptstyle\pm 9.8} 57.6 ±7.9{\scriptstyle\pm 7.9} 234.7 ±42.9{\scriptstyle\pm 42.9} NA
    0.25 43.8 ±10.3{\scriptstyle\pm 10.3} 67.0 ±4.0{\scriptstyle\pm 4.0} 269.3 ±36.7{\scriptstyle\pm 36.7} NA
    30 27% 0.50 39.2 ±8.6{\scriptstyle\pm 8.6} 72.4 ±15.5{\scriptstyle\pm 15.5} 262.0 ±15.7{\scriptstyle\pm 15.7} NA
    0.75 35.0 ±9.9{\scriptstyle\pm 9.9} 72.2 ±10.7{\scriptstyle\pm 10.7} 232.7 ±24.0{\scriptstyle\pm 24.0} NA
    1.00 29.0 ±6.5{\scriptstyle\pm 6.5} 60.3 ±5.7{\scriptstyle\pm 5.7} 236.0 ±42.8{\scriptstyle\pm 42.8} NA
    0.00 69.0 ±11.7{\scriptstyle\pm 11.7} 99.0 ±30.7{\scriptstyle\pm 30.7} 460.0 ±47.8{\scriptstyle\pm 47.8} NA
    0.25 56.8 ±15.4{\scriptstyle\pm 15.4} 107.0 ±39.2{\scriptstyle\pm 39.2} 438.3 ±73.1{\scriptstyle\pm 73.1} NA
    40 20% 0.50 50.4 ±13.0{\scriptstyle\pm 13.0} 105.6 ±41.6{\scriptstyle\pm 41.6} 457.7 ±21.2{\scriptstyle\pm 21.2} NA
    0.75 41.4 ±10.7{\scriptstyle\pm 10.7} 97.8 ±33.9{\scriptstyle\pm 33.9} 426.3 ±52.6{\scriptstyle\pm 52.6} NA
    1.00 34.4 ±11.3{\scriptstyle\pm 11.3} 93.5 ±16.3{\scriptstyle\pm 16.3} 458.7 ±55.3{\scriptstyle\pm 55.3} NA
    0.00 51.2 ±7.5{\scriptstyle\pm 7.5} 56.6 ±10.4{\scriptstyle\pm 10.4} 112.0 ±12.3{\scriptstyle\pm 12.3} 117.7 ±11.9{\scriptstyle\pm 11.9}
    0.25 44.0 ±6.5{\scriptstyle\pm 6.5} 37.8 ±10.2{\scriptstyle\pm 10.2} 92.3 ±17.0{\scriptstyle\pm 17.0} 117.3 ±11.6{\scriptstyle\pm 11.6}
    6 20 63% 0.50 42.2 ±8.6{\scriptstyle\pm 8.6} 29.0 ±10.8{\scriptstyle\pm 10.8} 97.0 ±11.5{\scriptstyle\pm 11.5} 112.7 ±7.6{\scriptstyle\pm 7.6}
    0.75 38.8 ±8.3{\scriptstyle\pm 8.3} 25.2 ±10.0{\scriptstyle\pm 10.0} 94.0 ±11.5{\scriptstyle\pm 11.5} 91.3 ±16.6{\scriptstyle\pm 16.6}
    1.00 34.0 ±5.1{\scriptstyle\pm 5.1} 23.8 ±7.7{\scriptstyle\pm 7.7} 93.3 ±1.5{\scriptstyle\pm 1.5} 86.0 ±7.9{\scriptstyle\pm 7.9}
    0.00 85.4 ±6.5{\scriptstyle\pm 6.5} 75.8 ±4.4{\scriptstyle\pm 4.4} 256.7 ±30.6{\scriptstyle\pm 30.6} NA
    0.25 79.8 ±12.5{\scriptstyle\pm 12.5} 69.2 ±15.4{\scriptstyle\pm 15.4} 260.7 ±12.4{\scriptstyle\pm 12.4} NA
    30 41% 0.50 69.4 ±14.8{\scriptstyle\pm 14.8} 80.6 ±17.2{\scriptstyle\pm 17.2} 257.3 ±18.6{\scriptstyle\pm 18.6} NA
    0.75 67.0 ±12.2{\scriptstyle\pm 12.2} 86.2 ±23.3{\scriptstyle\pm 23.3} 245.3 ±23.7{\scriptstyle\pm 23.7} NA
    1.00 57.4 ±11.3{\scriptstyle\pm 11.3} 82.0 ±5.3{\scriptstyle\pm 5.3} 259.7 ±19.5{\scriptstyle\pm 19.5} NA
    0.00 95.4 ±27.7{\scriptstyle\pm 27.7} 107.8 ±36.3{\scriptstyle\pm 36.3} 460.3 ±63.3{\scriptstyle\pm 63.3} NA
    0.25 75.6 ±23.6{\scriptstyle\pm 23.6} 130.2 ±43.3{\scriptstyle\pm 43.3} 401.0 ±112.8{\scriptstyle\pm 112.8} NA
    40 30% 0.50 63.6 ±17.0{\scriptstyle\pm 17.0} 146.0 ±51.7{\scriptstyle\pm 51.7} 370.0 ±87.1{\scriptstyle\pm 87.1} NA
    0.75 57.4 ±19.6{\scriptstyle\pm 19.6} 128.3 ±16.3{\scriptstyle\pm 16.3} 387.7 ±81.1{\scriptstyle\pm 81.1} NA
    1.00 47.2 ±12.2{\scriptstyle\pm 12.2} 114.5 ±31.8{\scriptstyle\pm 31.8} 469.0 ±28.2{\scriptstyle\pm 28.2} NA
    0.00 53.6 ±10.2{\scriptstyle\pm 10.2} 82.8 ±14.5{\scriptstyle\pm 14.5} 117.7 ±11.4{\scriptstyle\pm 11.4} 111.7 ±16.5{\scriptstyle\pm 16.5}
    0.25 51.0 ±8.1{\scriptstyle\pm 8.1} 58.2 ±12.6{\scriptstyle\pm 12.6} 108.3 ±19.7{\scriptstyle\pm 19.7} 96.7 ±15.8{\scriptstyle\pm 15.8}
    8 20 84% 0.50 47.0 ±5.3{\scriptstyle\pm 5.3} 41.4 ±13.8{\scriptstyle\pm 13.8} 100.7 ±7.6{\scriptstyle\pm 7.6} 91.0 ±23.5{\scriptstyle\pm 23.5}
    0.75 40.8 ±6.7{\scriptstyle\pm 6.7} 26.0 ±3.8{\scriptstyle\pm 3.8} 73.7 ±8.1{\scriptstyle\pm 8.1} 86.0 ±31.2{\scriptstyle\pm 31.2}
    1.00 43.0 ±9.0{\scriptstyle\pm 9.0} 19.8 ±4.9{\scriptstyle\pm 4.9} 81.7 ±22.4{\scriptstyle\pm 22.4} 79.7 ±10.8{\scriptstyle\pm 10.8}
    0.00 111.8 ±9.8{\scriptstyle\pm 9.8} 93.4 ±13.3{\scriptstyle\pm 13.3} 255.0 ±27.1{\scriptstyle\pm 27.1} NA
    0.25 90.8 ±7.5{\scriptstyle\pm 7.5} 75.6 ±8.2{\scriptstyle\pm 8.2} 242.7 ±14.6{\scriptstyle\pm 14.6} NA
    30 55% 0.50 89.6 ±13.2{\scriptstyle\pm 13.2} 78.8 ±12.2{\scriptstyle\pm 12.2} 255.0 ±22.1{\scriptstyle\pm 22.1} NA
    0.75 77.6 ±9.3{\scriptstyle\pm 9.3} 81.2 ±9.4{\scriptstyle\pm 9.4} 226.3 ±15.1{\scriptstyle\pm 15.1} NA
    1.00 71.0 ±4.6{\scriptstyle\pm 4.6} 81.6 ±12.5{\scriptstyle\pm 12.5} 222.3 ±13.0{\scriptstyle\pm 13.0} NA
    0.00 150.4 ±16.9{\scriptstyle\pm 16.9} 151.0 ±23.1{\scriptstyle\pm 23.1} 450.0 ±44.5{\scriptstyle\pm 44.5} NA
    0.25 127.0 ±12.4{\scriptstyle\pm 12.4} 188.4 ±27.4{\scriptstyle\pm 27.4} 452.0 ±7.0{\scriptstyle\pm 7.0} NA
    40 41% 0.50 113.4 ±20.2{\scriptstyle\pm 20.2} 200.0 ±33.9{\scriptstyle\pm 33.9} 424.3 ±43.0{\scriptstyle\pm 43.0} NA
    0.75 104.4 ±21.3{\scriptstyle\pm 21.3} 193.0 ±14.1{\scriptstyle\pm 14.1} 426.7 ±34.6{\scriptstyle\pm 34.6} NA
    1.00 92.0 ±17.6{\scriptstyle\pm 17.6} 190.0 ±nan{\scriptstyle\pm nan} 422.0 ±18.1{\scriptstyle\pm 18.1} NA
    Table 4: Detailed results of SHD means and standard deviations from Figure 8. GIES failed to run on d30d\geq 30.
    Refer to caption
    Figure 9: Precision across simulations from Figure 8 increasing proportion of variables intervened on, varying the total number of variables dd (columns) and average edges per variable ss (rows). Boxplots over 5 random datasets.
    Refer to caption
    Figure 10: Recall across simulations from Figure 8 increasing proportion of variables intervened on, varying the total number of variables dd (columns) and average edges per variable ss (rows). Boxplots over 5 random datasets.
    Name d=10 d=20 d=30 d=40
    SDCD 14.7 40.3 54.3 69.0
    SDCD-warm 14.7 40.7 55.0 68.7
    SDCD-warm-nomask 19.3 69.7 156.0 272.7
    SDCD-no-s1 19.3 68.3 155.3 272.3
    SDCD-no-s1-2 16.3 56.7 95.0 135.0
    DCDI 24.0 35.7 56.7 87.0
    Table 5: Ablation study for SDCD stage 1. We observe that the described version of SDCD performs the best out of all variations. SDCD-warm performs competitively but generally provides little benefit. SDCD-warm-nomask performs much worse than SDCD, demonstrating that enforcing the mask during stage 2 is important. We report mean SHD over three random seeds of observational data (no interventions) with a fixed number of edges per variable, s=4s=4, for a range of numbers of variables, dd. SDCD-warm refers to starting stage 2 of SDCD, where the input layer is ported over from stage 1 instead of re-learned. SDCD-warm-nomask performs the same warmstart as SDCD-warm but does not enforce the mask in stage 2. SDCD-no-s1 only performs stage 2. SDCD-no-s1-2 only does stage 2, but sets (α2,β2)(\alpha_{2},\beta_{2}) to the default values from stage 1 (α1,β1)(\alpha_{1},\beta_{1}). We report these values alongside DCDI. The lowest SHD values are bolded for each value of dd.
    Name d=10d=10 d=20d=20 d=30d=30 d=40d=40 d=50d=50
    SDCD 13.33 33.47 54.07 70.80 76.60
    SDCD-exp 11.60 44.33 69.07 85.07 89.93
    SDCD-log 11.20 52.00 87.47 117.87 116.00
    Table 6: Ablation study for SDCD stage 2 (choice of the constraint). We observe that SDCD performs the best out of the three variations. Additionally, the variations using the PST constraints do not crash for any of the runs, even for those with d=50d=50. We attribute this improved stability (as compared to DCDI and DAGMA) to stage 1 since there are fewer non-zero parameters contributing to the value of the constraint. We report mean SHD over five random seeds of observational data (no interventions) with a fixed number of edges per variable, s=4s=4, for a range of numbers of variables, dd. SDCD-exp is SDCD except using the hexph_{\text{exp}} constraint in place of the hρh_{\rho}, and SDCD-log uses the hlogh_{\text{log}} constraint.