This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

From graphs to DAGs: a low-complexity model and a scalable algorithm

Shuyu Dong111TAU, LISN, INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, France. ([email protected])    Michèle Sebag222TAU, LISN, CNRS, INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, France. ([email protected])
(April 10, 2022)
Abstract

Learning directed acyclic graphs (DAGs) is long known a critical challenge at the core of probabilistic and causal modeling. The NoTears approach of Zheng et al. [ZARX18], through a differentiable function involving the matrix exponential trace tr(exp())\operatorname*{tr}(\exp(\cdot)), opens up a way to learning DAGs via continuous optimization, though with an O(d3)O(d^{3}) complexity in the number dd of nodes. This paper presents a low-complexity model, called LoRAM for Low-Rank Additive Model, which combines low-rank matrix factorization with a sparsification mechanism for the continuous optimization of DAGs. The main contribution of the approach lies in an efficient gradient approximation method leveraging the low-rank property of the model, and its straightforward application to the computation of projections from graph matrices onto the DAG matrix space. The proposed method achieves a reduction from a cubic complexity to quadratic complexity while handling the same DAG characteristic function as NoTears  and scales easily up to thousands of nodes for the projection problem. The experiments show that the LoRAM achieves efficiency gains of orders of magnitude compared to the state-of-the-art at the expense of a very moderate accuracy loss in the considered range of sparse matrices, and with a low sensitivity to the rank choice of the model’s low-rank component.

1 Introduction

The learning of directed acyclic graphs (DAGs) is an important problem for probabilistic and causal inference [Pea09, PJS17] with important applications in social sciences [MW15], genome research [SB09] and machine learning itself [PBM16, ABGLP19, SG21]. Through the development of probabilistic graphical models [Pea09, BPE14], DAGs are a most natural mathematical object to describe the causal relations among a number of variables. In today’s many application domains, the estimation of DAGs faces intractability issues as an ever growing number dd of variables is considered, due to the fact that estimating DAGs is NP-hard [Chi96]. The difficulty lies in how to enforce the acyclicity of graphs. Shimizu et al. [SHHK06] combined independent component analysis with the combinatorial linear assignment method to optimize a linear causal model (LiNGAM) and later proposed a direct and sequential algorithm [SIS+11] guaranteeing global optimum of the LiNGAM, for O(d4)O(d^{4}) complexities.

Recently, Zheng et al. [ZARX18] proposed an optimization approach to learning DAGs. The breakthrough in this work, called NoTears, comes with the characterization of the DAG matrices by the zero set of a real-valued differentiable function on d×d\mathbb{R}^{d\times d}, which shows that an d×dd\times d matrix AA is the adjacency matrix of a DAG if and only if the exponential trace satisfies

h(A):=tr(exp(AA))=d,\displaystyle h(A):=\operatorname*{tr}(\exp(A\odot A))=d, (1)

and thus the learning of DAG matrices can be cast as a continuous optimization problem subject to the constraint h(A)=dh(A)=d. The NoTears approach broadens the way of learning complex causal relations and provides promising perspectives to tackling large-scale inference problems [KGG+18, YCGY19, ZDA+20, NGZ20]. However, NoTears is still not suitable for large-scale applications as the complexity of computing the exponential trace and its gradient is O(d3)O(d^{3}). More recently, Fang et al. [FZZ+20] proposed to represent DAGs by low-rank matrices with both theoretical and empirical validation of the low-rank assumption for a range of graph models. However, the adaptation of the NoTears framework [ZARX18] to low-rank model still yields a complexity of O(d3)O(d^{3}) due to the DAG characteristic function in (1).

The contribution of the paper is to propose a new computational framework to tackle the scalability issues faced by the low-rank modeling of DAGs. We notice that the Hadamard product \odot in characteristic functions as in (1) poses real obstacles to scaling up the optimization of NoTears [ZARX18] and NoTears-low-rank [FZZ+20]. To address these difficulties, we present a low-complexity model, named Low-Rank Additive Model (LoRAM), which is a composition of low-rank matrix factorization with sparsification, and then propose a novel approximation method compatible with LoRAM to compute the gradients of the exponential trace in (1). Formally, the gradient approximation—consisting of matrix computation of the form (A,C,B)(exp(A)C)B(A,C,B)\to(\exp(A)\odot C)B, where A,Cd×dA,C\in\mathbb{R}^{d\times d} and BB is a thin low-rank matrix—is inspired from the numerical analysis of [AMH11] for the matrix action of exp(A)\exp(A). We apply the new method to the computation of projections from graphs to DAGs through optimization with the differentiable DAG constraint.

Empirical evidence is presented to identify the cost and the benefits of the approximation method combined with Nesterov’s accelerated gradient descent [Nes83], depending on the considered range of problem parameters (number of nodes, rank approximation, sparsity of the target graph).

The main contributions are summarized as follows:

  • The LoRAM model, combining a low-rank structure with a flexible sparsification mechanism, is proposed to represent DAG matrices, together with a DAG characteristic function generalizing the exponential trace of NoTears [ZARX18].

  • An efficient gradient approximation method, exploiting the low-rank and sparse nature of the LoRAM model, is proposed. Under the low-rank assumption (rCdr\leq C\ll d), the complexity of the proposed method is quadratic (O(d2)O(d^{2})) instead of O(d3)O(d^{3}) as shown in Table 1. Large efficiency gains, with insignificant loss of accuracy in some cases, are demonstrated experimentally in the considered range of application.

    Table 1: Computational properties of LoRAM and algorithms in related work.
    Search space Memory req. Cost for h\nabla h
    NoTears [ZARX18] d×d\mathbb{R}^{d\times d} O(d2)O(d^{2}) O(d3)O(d^{3})
    NoTears-low-rank [FZZ+20] d×r×d×r\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} O(dr)O(dr) O(d3)O(d^{3})
    LoRAM (ours) d×r×d×r\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} O(dr)O(dr) O(d2r)O(d^{2}r)

2 Notation and formal background

A graph on dd nodes is defined and denoted as a pair 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where |𝒱|=d|\mathcal{V}|=d and 𝒱×𝒱\mathcal{E}\subset\mathcal{V}\times\mathcal{V}. By default, a directed graph is simply referred to as a graph. The adjacency matrix of a graph 𝒢\mathcal{G}, denoted as 𝔸(𝒢)\mathbb{A}(\mathcal{G}), is defined as the matrix such that [𝔸(𝒢)]ij=1[\mathbb{A}(\mathcal{G})]_{ij}=1 if (i,j)(i,j)\in\mathcal{E} and 0 otherwise. Let Ad×dA\in\mathbb{R}^{d\times d} be any weighted adjacency matrix of a graph 𝒢\mathcal{G} on dd nodes, then by definition, the adjacency matrix 𝔸(𝒢)\mathbb{A}(\mathcal{G}) indicates the nonzeros of AA; the adjacency matrix 𝔸(𝒢)\mathbb{A}(\mathcal{G}) is also the support of AA, denoted as supp(A)\operatorname*{supp}(A), i.e., [supp(A)]ij=1[\operatorname*{supp}(A)]_{ij}=1 if Aij0A_{ij}\neq 0 and 0 otherwise. The number of nonzeros of AA is denoted as A0\|A\|_{0}. The matrix AA is called a DAG matrix if supp(A)\operatorname*{supp}(A) is the adjacency matrix of a directed acyclic graph (DAG). We define, by convention, the set of DAG matrices as follows: 𝒟d×d={Ad×d:supp(A) defines a DAG}\mathcal{D}_{d\times d}=\{A\in\mathbb{R}^{d\times d}:\operatorname*{supp}(A)\text{~{}defines a DAG}\}.

We recall the following theorem that characterizes acyclic graphs using the matrix exponential tracetr(exp())\operatorname*{tr}(\exp(\cdot))—where exp()\exp(\cdot) denotes the matrix exponential function. The matrix exponential will be denoted as ee^{\cdot} and exp()\exp(\cdot) indifferently. The operator \odot denotes the matrix Hadamard product that acts on two matrices of the same size by elementwise multiplications.

Theorem 1 ([ZARX18]).

A matrix Ad×dA\in\mathbb{R}^{d\times d} is a DAG matrix if and only if

tr(exp(AA))=d.\operatorname*{tr}(\exp(A\odot A))=d.

The following corollary is a straightforward extension of the theorem above:

Corollary 2.

Let σ:d×dd×d\sigma:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d\times d} be an operator such that: (i) σ(A)0\sigma(A)\geq 0 and (ii) supp(A)=supp(σ(A))\operatorname*{supp}(A)=\operatorname*{supp}(\sigma(A)), for any Ad×dA\in\mathbb{R}^{d\times d}. Then, Ad×dA\in\mathbb{R}^{d\times d} is a DAG matrix if and only if tr(exp(σ(A)))=d\operatorname*{tr}(\exp(\sigma(A)))=d.

In view of the property above, we will refer to the composition of tr(exp())\operatorname*{tr}(\exp(\cdot)) and the operator σ\sigma as a DAG characteristic function. Next, we show some more properties (proof in Appendix A) of the exponential trace.

Proposition 3.

The exponential trace h~:d×d:Atr(exp(A))\tilde{h}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}:A\to\operatorname*{tr}(\exp(A)) satisfies: (i) For all A¯+d×d\bar{A}\in\mathbb{R}^{d\times d}_{+}, tr(exp(A¯))d\operatorname*{tr}(\exp(\bar{A}))\geq d and tr(exp(A¯))=d\operatorname*{tr}(\exp(\bar{A}))=d if and only if A¯\bar{A} is a DAG matrix. (ii) h~\tilde{h} is nonconvex on d×d\mathbb{R}^{d\times d}. (iii) The Fréchet derivative of h~\tilde{h} at Ad×dA\in\mathbb{R}^{d\times d} along any direction ξd×d\xi\in\mathbb{R}^{d\times d} is

Dh~(A)[ξ]=tr(exp(A)ξ),\mathrm{D}\tilde{h}(A)[\xi]=\operatorname*{tr}(\exp(A)\xi),

and the gradient of h~\tilde{h} at AA is h~(A)=(exp(A))T\nabla\tilde{h}(A)={(\exp(A))}^{\mathrm{T}}.

3 LoRAM: a low-complexity model

In this section, we describe a low-complexity matrix representation of the adjacency matrices of directed graphs, and then a generalized DAG characteristic function for the new matrix model.

In the spirit of searching for best low-rank singular value decompositions and taking inspiration from [FZZ+20], the search of a full d×dd\times d (DAG) matrix AA is replaced by the search of a pair of thin factor matrices (X,Y)(X,Y), in d×r×d×r\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} for 1r<d1\leq r<d, and the d×dd\times d candidate graph matrix is represented by the product XYTX{Y}^{\mathrm{T}}. This matrix product has a rank bounded by rr, with number of parameters 2drd22dr\leq d^{2}. However, the low-rank representation (X,Y)XYT(X,Y)\to X{Y}^{\mathrm{T}} generally gives a dense d×dd\times d matrix. Since in many scenarios the sought graph (or Bayesian network) is usually sparse, we apply a sparsification operator on XYTX{Y}^{\mathrm{T}} in order to trim abundant entries in XYTX{Y}^{\mathrm{T}}. Accordingly, we combine the two operations and introduce the following model.

Definition 4 (LoRAM).

Let Ω[d]×[d]\Omega\subset[d]\times[d] be a given index set. The low-rank additive model (LoRAM), noted AΩA_{\Omega}, is defined from the matrix product of (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} sparsified according to Ω\Omega:

AΩ(X,Y)=𝒫Ω(XYT),A_{\Omega}(X,Y)=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}), (2)

where 𝒫Ω:d×dd×d\mathcal{P}_{\Omega}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d\times d} is a mask operator such that [𝒫Ω(A)]ij=Aij[\mathcal{P}_{\Omega}(A)]_{ij}=A_{ij} if (i,j)Ω(i,j)\in\Omega and 0 otherwise. The set Ω\Omega is referred to as the candidate set of LoRAM.

The candidate set Ω\Omega is to be fixed according to the specific problem. In the case of projection from a given graph to the set of DAGs, Ω\Omega can be fixed as the index set of the given graph’s edges.

The DAG characteristic function on the LoRAM search space is defined as follows:

Definition 5.

Let h~:d×d:Atr(exp(A))\tilde{h}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}:A\to\operatorname*{tr}(\exp(A)) denote the exponential trace function. We define h:d×r×d×rh:\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}\mapsto\mathbb{R} by

h(X,Y)=tr(exp(σ(AΩ(X,Y)))),h(X,Y)=\operatorname*{tr}(\exp(\sigma(A_{\Omega}(X,Y)))), (3)

where σ:d×dd×d\sigma:\mathbb{R}^{d\times d}\to\mathbb{R}^{d\times d} is one of the following elementwise operators:

σ2(Z):=ZZandσabs(Z):=i,j=1d|Zij|eiejT.\displaystyle\sigma_{2}(Z):=Z\odot Z\quad\text{and}\quad\sigma_{\mathrm{abs}}(Z):=\sum_{i,j=1}^{d}|Z_{ij}|e_{i}{e_{j}}^{\mathrm{T}}. (4)

Note that operators σ2\sigma_{2} and σabs\sigma_{\mathrm{abs}} (4) are two natural choices that meet the conditions (i)–(ii) of Corollary 2, since they both produce a nonnegative surrogate matrix of the d×dd\times d matrix AΩA_{\Omega} while preserving the support of AΩA_{\Omega}.

3.1 Representativity

In the construction of a LoRAM matrix (2), the low-rank component of the model—XYTX{Y}^{\mathrm{T}} with (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}—has a rank smaller or equal to rr (equality attained when XX and YY have full column ranks), and the subsequent sparsification operator 𝒫Ω\mathcal{P}_{\Omega} generally induces a change in the rank of the final matrix model 𝒫Ω(XYT)\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}) (2). Indeed, the rank of AΩ=𝒫Ω(XYT)A_{\Omega}=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}) depends on an interplay between (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} and the discrete set Ω[d]×[d]\Omega\in[d]\times[d]. The following examples illustrate the two extreme cases of such interplay:

  • (i)

    The first extreme case: let Ω\Omega be the index set of the edges of a sparse graph 𝒢Ω\mathcal{G}_{\Omega}, and let (X,Y)d×1×d×1(X,Y)\in\mathbb{R}^{d\times 1}\times\mathbb{R}^{d\times 1} be the pair of matrices containing all ones, for r=1r=1, then the LoRAM matrix 𝒫Ω(XYT)=𝔸(𝒢Ω)\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})=\mathbb{A}(\mathcal{G}_{\Omega}), i.e., the adjacency matrix of 𝒢Ω\mathcal{G}_{\Omega}. Hence rank(𝒫Ω(XYT))=rank(𝔸Ω)\operatorname{rank}(\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}))=\operatorname{rank}(\mathbb{A}_{\Omega}), which depends solely on Ω\Omega and is generally much larger than r=1r=1.

  • (ii)

    The second extreme case: let Ω\Omega be the full [d]×[d][d]\times[d] index set, then 𝒫Ω\mathcal{P}_{\Omega} reduces to the identity map such that 𝒫Ω(XYT)=XYT\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})=X{Y}^{\mathrm{T}} and rank(𝒫Ω(XYT))=rank(XYT)r\operatorname{rank}(\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}))=\operatorname{rank}(X{Y}^{\mathrm{T}})\leq r for any (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}.

In the first extreme case above, optimizing LoRAM (2) for DAG learning boils down to choosing the most relevant edge set Ω\Omega, which is an NP-hard combinatorial problem [Chi96]. In the second extreme case, the optimization of LoRAM (2) reduces to learning the most pertinent low-rank matrices (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}, which coincides with optimizing the NoTears-low-rank [FZZ+20] model.

In this work, we are interested in settings between the two extreme cases above such that both (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} and the candidate set Ω\Omega have sufficient degrees of freedom. Consequently, the representativity of LoRAM depends on both the rank parameter rr and Ω\Omega.

Next, we present a way of quantifying the representativity of LoRAM with respect to a subset 𝒟d×d\mathcal{D}_{d\times d}^{\star} of DAG matrices. The restriction to a subset 𝒟d×d\mathcal{D}_{d\times d}^{\star} is motivated by the revelation that certain types of DAG matrices—such as those with many hubs—can be represented by low-rank matrices [FZZ+20].

Definition 6.

Let 𝒟d×d𝒟d×d\mathcal{D}_{d\times d}^{\star}\subset\mathcal{D}_{d\times d} be a given set of nonzero DAG matrices. For Z0𝒟d×dZ_{0}\in\mathcal{D}_{d\times d}^{\star}, let AΩ(Z0)A_{\Omega}^{*}(Z_{0}) denote any LoRAM matrix (2) such that AΩ(Z0)Z0=min(X,Y)d×r×d×rAΩ(X,Y)Z0\|A_{\Omega}^{*}(Z_{0})-Z_{0}\|=\min_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}\|A_{\Omega}(X,Y)-Z_{0}\|, then we define the relative error of LoRAM w.r.t. 𝒟d×d\mathcal{D}_{d\times d}^{\star} as

ϵr,Ω=maxZ𝒟d×d{AΩ(Z)ZmaxZmax},\epsilon^{\star}_{r,\Omega}=\max_{Z\in\mathcal{D}_{d\times d}^{\star}}\big{\{}\frac{\|A_{\Omega}^{*}(Z)-Z\|_{\max}}{\|Z\|_{\max}}\big{\}},

where Zmax:=maxij|Zij|\|Z\|_{\max}:=\max_{ij}|Z_{ij}| denotes the matrix max-norm. For Z0𝒟d×dZ_{0}\in\mathcal{D}_{d\times d}^{\star}, AΩ(Z0)A_{\Omega}^{*}(Z_{0}) is referred to as an ϵr,Ω\epsilon^{\star}_{r,\Omega}-quasi DAG matrix.

Note that the existence of AΩ(Z0)A_{\Omega}^{*}(Z_{0}) for any Z0𝒟d×dZ_{0}\in\mathcal{D}_{d\times d}^{\star} is guaranteed by the closeness of the image set of LoRAM (2).

Based on the relative error above, the relevance of the DAG characteristic function is established from the following proposition (proof in Appendix A):

Proposition 7.

Given a set 𝒟d×d𝒟d×d\mathcal{D}_{d\times d}^{\star}\subset\mathcal{D}_{d\times d} of nonzero DAG matrices. For any Z0𝒟d×dZ_{0}\in\mathcal{D}_{d\times d}^{\star} such that Z0max1\|Z_{0}\|_{\max}\leq 1, without loss of generality, the minima of

min(X,Y)d×r×d×rAΩ(X,Y)Z0\min_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}\|A_{\Omega}(X,Y)-Z_{0}\|

belong to the set

{(X,Y)d×r×d×r:h(X,Y)dC0ϵr,Ω}\displaystyle\{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}:h(X,Y)-d\leq C_{0}\epsilon^{\star}_{r,\Omega}\} (5)

where ϵr,Ω\epsilon^{\star}_{r,\Omega} is given in Definition 6 and C0=(C1Z00+ij[eσ(Z0)]ij)Z0maxC_{0}=\big{(}C_{1}\|Z_{0}\|_{0}+\sum_{ij}[e^{\sigma(Z_{0})}]_{ij}\big{)}\|Z_{0}\|_{\max} for a constant C10C_{1}\geq 0.

Remark 8.

The constant C0C_{0} in Proposition 7 can be seen as a measure of total capacity of passing from one node to other nodes, and therefore depends on dd, Z0max\|Z_{0}\|_{\max} (bounded by 11), the sparsity and the average degree of the graph of Z0Z_{0}. For DAG matrices with sparsity ρ0103\rho_{0}\sim 10^{-3} and d103d\lesssim 10^{3}, one can expect that C0CdC_{0}\leq Cd for a constant CC. \hfill\square

The result of Proposition 7 establishes that, under the said conditions, a given DAG matrix Z0Z_{0} admits low rank approximations AΩ(X,Y)A_{\Omega}(X,Y) satisfying h(X,Y)dδϵh(X,Y)-d\leq\delta_{\epsilon} for a small enough parameter δϵ\delta_{\epsilon}. In other words, the low-rank projection with a relaxed DAG constraint admits solutions.

The general case of projecting a non-acyclic graph matrix Z0Z_{0} onto a low-rank matrix under a relaxed DAG constraint is considered in next section.

4 Scalable projection from graphs to DAGs

Given a (generally non-acyclic) graph matrix Z0d×dZ_{0}\in\mathbb{R}^{d\times d}, let us consider the projection of Z0Z_{0} onto the feasible set (5):

min(X,Y)d×r×d×r12AΩ(X,Y)Z0F2subject to h(X,Y)dδϵ,\displaystyle\min_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}\frac{1}{2}\|A_{\Omega}(X,Y)-Z_{0}\|_{\mathrm{F}}^{2}\quad\text{subject to~{}}h(X,Y)-d\leq\delta_{\epsilon}, (6)

where AΩ(X,Y)A_{\Omega}(X,Y) is the LoRAM matrix (2), hh is given by Definition 5, and δϵ>0\delta_{\epsilon}>0 is a tolerance parameter. Based on Proposition 7 and given the objective function of (6), the solution to (6) provides a quasi DAG matrix closest to Z0Z_{0} and thus enables finding a projection of Z0Z_{0} onto the DAG matrix space 𝒟d×d\mathcal{D}_{d\times d}. More precisely, we tackle problem (6) using the penalty method and focus on the primal problem, for a given penalty parameter λ>0\lambda>0, followed by elementwise hard thresholding:

(X,Y)=argmin(X,Y)d×r×d×rh(X,Y)+12λAΩ(X,Y)Z0F2,\displaystyle(X^{*},Y^{*})=\operatorname*{arg\,min}_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}h(X,Y)+\frac{1}{2\lambda}\|A_{\Omega}(X,Y)-Z_{0}\|_{\mathrm{F}}^{2}, (7)
A=𝕋ϵr,Ω(AΩ(X,Y)),\displaystyle A^{*}=\mathbb{T}_{\epsilon^{\star}_{r,\Omega}}(A_{\Omega}(X^{*},Y^{*})), (8)

where 𝕋ϵr,Ω(z)=zδ|z|ϵr,Ω\mathbb{T}_{\epsilon^{\star}_{r,\Omega}}(z)=z\delta_{|z|\geq\epsilon^{\star}_{r,\Omega}} is the elementwise hard thresholding operator. The choice of λ\lambda and ϵr,Ω\epsilon^{\star}_{r,\Omega} is discussed in Section C.1 in the context of problem (6).

Remark 9.

To obtain any DAG matrix closest to (the non-acyclic) Z0Z_{0}, it is necessary to break the cycles in 𝒢(Z0)\mathcal{G}(Z_{0}) by suppressing certain edges. Hence, we assume that the edge set of the sought DAG is a strict subset of supp(Z0)\operatorname*{supp}(Z_{0}) and thus fix the candidate set Ω\Omega to be supp(Z0)\operatorname*{supp}(Z_{0}). \hfill\square

Problems (6) and (7) are nonconvex due to: (i) the matrix exponential trace Atr(exp(A))A\to\operatorname*{tr}(\exp(A)) in hh (3) is nonconvex (as in [ZARX18]), and (ii) the matrix product (X,Y)XYT(X,Y)\to X{Y}^{\mathrm{T}} in AΩ(X,Y)A_{\Omega}(X,Y) (2) is nonconvex. In view of the DAG characteristic constraint of (6), the augmented Lagrangian algorithms of NoTears [ZARX18] and NoTears-low-rank [FZZ+20] can be applied for the same objective as problem (6); it suffices for NoTears and NoTears-low-rank to replace the LoRAM matrix AΩ(X,Y)A_{\Omega}(X,Y) in (6) by the d×dd\times d matrix variable and the dense matrix product XYTX{Y}^{\mathrm{T}} respectively.

However, the NoTears-based methods of [ZARX18, FZZ+20] have an O(d3)O(d^{3}) complexity due to the composition of the elementwise operations (the Hadamard product \odot) with the matrix exponential in the hh function (1) and (3). We elaborate this argument in the next subsection and then propose a new computational method for computations involving the gradient of the DAG characteristic function hh.

4.1 Gradient of the DAG characteristic function and an efficient approximation

Lemma 10.

For any Zd×dZ\in\mathbb{R}^{d\times d} and ξd×d\xi\in\mathbb{R}^{d\times d}, the differentials of σ2\sigma_{2} and σabs\sigma_{\mathrm{abs}} (4) are

Dσ2(Z)[ξ]=2ZξandD^σabs(Z)[ξ]=sign(Z)ξ,\mathrm{D}\sigma_{2}(Z)[\xi]=2Z\odot\xi\quad\text{and}\quad\hat{\mathrm{D}}\sigma_{\mathrm{abs}}(Z)[\xi]=\mathrm{sign}(Z)\odot\xi, (4b)

where sign()\mathrm{sign}(\cdot) is the element-wise sign function such that [sign(Z)]ij=Zij|Zij|[\mathrm{sign}(Z)]_{ij}=\frac{Z_{ij}}{|Z_{ij}|} if Zij0Z_{ij}\neq 0 and 0 otherwise.

Theorem 11.

The gradient of hh (3) is

h(X,Y)=(𝒮Y,𝒮TX)d×r×d×r,\displaystyle\nabla h(X,Y)=(\mathcal{S}Y,{\mathcal{S}}^{\mathrm{T}}X)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}, (9)

where 𝒮d×d\mathcal{S}\in\mathbb{R}^{d\times d} has the following expressions, depending on the choice of σ\sigma in (4): with AΩ:=AΩ(X,Y)A_{\Omega}:=A_{\Omega}(X,Y) for brevity,

𝒮2=2(exp(σ2(AΩ))T)AΩ,\displaystyle\mathcal{S}_{2}=2({\exp(\sigma_{2}(A_{\Omega}))}^{\mathrm{T}})\odot A_{\Omega}, (10)
𝒮abs=(exp(σabs(AΩ))T)sign(AΩ).\displaystyle\mathcal{S}_{\mathrm{abs}}=({\exp(\sigma_{\mathrm{abs}}(A_{\Omega}))}^{\mathrm{T}})\odot\mathrm{sign}(A_{\Omega}). (11)
Proof.

From Proposition 3-(iii), the Fréchet derivative of the exponential trace h~\tilde{h} at Ad×dA\in\mathbb{R}^{d\times d} is Dh~(A)[ξ]=tr(exp(A)ξ)\mathrm{D}\tilde{h}(A)[\xi]=\operatorname*{tr}(\exp(A)\xi) for any ξd×d\xi\in\mathbb{R}^{d\times d}. By the chain rule and Lemma 10, the Fréchet derivative of hh (3) for σ=σ2\sigma=\sigma_{2} is as follows: with AΩ=𝒫Ω(XYT)A_{\Omega}=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}),

DXh(X,Y)[ξ]=tr(exp(σ(AΩ))Dσ(AΩ)D𝒫Ω(XYT)[ξYT])\displaystyle\mathrm{D}_{X}h(X,Y)[\xi]=\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))\mathrm{D}\sigma(A_{\Omega})\mathrm{D}\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})[\xi{Y}^{\mathrm{T}}]\big{)}
=tr(exp(σ(AΩ))Dσ(AΩ)[𝒫Ω(ξYT)])\displaystyle=\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))\mathrm{D}\sigma(A_{\Omega})[\mathcal{P}_{\Omega}(\xi{Y}^{\mathrm{T}})]\big{)} (12)
=2tr(exp(σ(AΩ))(AΩ𝒫Ω(ξYT)))\displaystyle=2\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))(A_{\Omega}\odot\mathcal{P}_{\Omega}(\xi{Y}^{\mathrm{T}}))\big{)}
=2tr(exp(σ(AΩ))(AΩ(ξYT)))\displaystyle=2\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))(A_{\Omega}\odot(\xi{Y}^{\mathrm{T}}))\big{)} (13)
=2tr((exp(σ(AΩ))AΩT)(ξYT))\displaystyle=2\operatorname*{tr}\big{(}(\exp(\sigma(A_{\Omega}))\odot{A_{\Omega}}^{\mathrm{T}})(\xi{Y}^{\mathrm{T}})\big{)} (14)
=2tr(YT(exp(σ(AΩ))AΩT)ξ),\displaystyle=2\operatorname*{tr}\big{(}{Y}^{\mathrm{T}}(\exp(\sigma(A_{\Omega}))\odot{A_{\Omega}}^{\mathrm{T}})\xi\big{)},

where (13) holds, i.e., 𝒫Ω(XYT)𝒫Ω(ξYT)=𝒫Ω(XYT)ξYT\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})\odot\mathcal{P}_{\Omega}(\xi{Y}^{\mathrm{T}})=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})\odot\xi{Y}^{\mathrm{T}} because A𝒫Ω(B)=𝒫Ω(A)BA\odot\mathcal{P}_{\Omega}(B)=\mathcal{P}_{\Omega}(A)\odot B (for any AA and BB) and 𝒫Ω2=𝒫Ω\mathcal{P}_{\Omega}^{2}=\mathcal{P}_{\Omega}, and (14) holds because tr(A(BC))=tr((ABT)C)\operatorname*{tr}(A(B\odot C))=\operatorname*{tr}((A\odot{B}^{\mathrm{T}})C) for any A,B,CA,B,C (with compatible sizes). By identifying the above equation with tr(Xh(X,Y)Tξ)\operatorname*{tr}({\nabla_{X}h(X,Y)}^{\mathrm{T}}\xi), we have Xh(X,Y)=2(exp(σ(Z))TAΩ)𝒮Y\nabla_{X}h(X,Y)=\underbrace{2\big{(}{\exp(\sigma(Z))}^{\mathrm{T}}\odot A_{\Omega}\big{)}}_{\mathcal{S}}Y, hence the expression (10) for 𝒮\mathcal{S}. The expression of 𝒮abs\mathcal{S}_{\mathrm{abs}} for σ=σabs\sigma=\sigma_{\mathrm{abs}} can be obtained using the same calculations and (4b). ∎

The computational bottleneck for the exact computation of the gradient (9) lies in the computation of the d×dd\times d matrix 𝒮\mathcal{S} (11) and is due to the difference between the Hadamard product and matrix multiplication; see Section B.1 for details. Nevertheless, we note that the multiplication (𝒮,X)𝒮X(\mathcal{S},X)\to\mathcal{S}X is similar to the action of matrix exponentials of the form (A,X)exp(A)X(A,X)\to\exp(A)X, which can be computed using only a number of repeated multiplications of a d×dd\times d matrix with the thin matrix Xd×rX\in\mathbb{R}^{d\times r} based on Al-Mohy and Higham’s results [AMH11].

The difficulty in adapting the method of [AMH11] also lies in the presence of the Hadamard product in 𝒮\mathcal{S} (10)–(11). Once the sparse d×dd\times d matrix A:=σ(AΩ)A:=\sigma(A_{\Omega}) in (10)–(11) is obtained (using Algorithm 3, Section B.2), the exact computation of (A,C,B)(exp(A)C)B(A,C,B)\to(\exp(A)\odot C)B, using the Taylor expansion of exp()\exp(\cdot) to a certain order mm_{*}, is to compute 1k!(AkC)B\frac{1}{k!}(A^{k}\odot C)B at each iteration, which inevitably requires the computation of the d×dd\times d matrix product AkA^{k} (in the form of A(Ak1)A(A^{k-1})) before computing the Hadamard product, which requires an O(d3)O(d^{3}) cost.

To alleviate the obstacle above, we propose to use inexact333It is inexact because ((AC)k+1)B(Ak+1C)B((A\odot C)^{k+1})B\neq(A^{k+1}\odot C)B. incremental multiplications; see Algorithm 1.

Algorithm 1 Approximation of (A,C,B)(exp(A)C)B(A,C,B)\to(\exp(A)\odot C)B
0:   d×dd\times d matrices AA and CC, thin matrix Bd×rB\in\mathbb{R}^{d\times r}, tolerance tol>0\mathrm{tol}>0
0:  F(exp(A)C)Bd×rF\approx(\exp(A)\odot C)B\in\mathbb{R}^{d\times r}
1:  Estimate the Taylor order parameter mm_{*} from AA # using [AMH11, Algorithm 3.2]
2:  Initialize: let F=(IC)BF=(I\odot C)B
3:  for k=1,,mk=1,\dots,m_{*} do
4:     B1k+1(AC)BB\leftarrow\frac{1}{k+1}(A\odot C)B
5:     FF+BF\leftarrow F+B
6:     BFB\leftarrow F
7:  end for
8:  Return FF.

In line 1 of Algorithm 1, the value of mm_{*} is obtained from numerical analysis results of [AMH11, Algorithm 3.2]; often, the value of mm_{*}, depending on the spectrum of AA, is a bounded constant (independent of the matrix size dd). Therefore, the dominant cost of Algorithm 1 is 2m|Ω|rd2r2m_{*}|\Omega|r\lesssim d^{2}r, since each iteration (lines 4–6) costs (2|Ω|r+dr)2|Ω|r(2|\Omega|r+dr)\approx 2|\Omega|r flops. Table 1 summarizes this computational property in comparison with existing methods.

Reliability of Algorithm 1.

The accuracy of Algorithm 1 with respect to the exact computation of (A,C,B)(exp(A)C)B(A,C,B)\to(\exp(A)\odot C)B depends notably on the scale of AA, since the differential Dexp(A)\mathrm{D}\exp(A) at AA has a greater operator norm when the norm of AA is greater; see Proposition 13 in Appendix A.

To illustrate the remark above, we approximate h(X,Y)\nabla h(X,Y) (9) by Algorithm 1 on random points of d×r×d×r\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} with different scales, with Ω\Omega defined from the edges of a random sparse graph; the results are shown in Figure 1.

Refer to caption
(a) Relative error
Refer to caption
(b) Cosine similarity
Figure 1: Average accuracy of Algorithm 1 in approximating h(X,Y)\nabla h(X,Y) (9) at (X,Y)(X,Y) with different scales and sparsity levels ρ=|Ω|d2\rho=\frac{|\Omega|}{d^{2}}. Number of nodes d=200d=200, r=40r=40.

We observe from Figure 1 that Algorithm 1 is reliable, i.e., having gradient approximations with cosine similarity close to 11, when the norm of AΩ(X,Y)A_{\Omega}(X,Y) is sufficiently bounded. More precisely, for c0=101c_{0}=10^{-1}, Algorithm 1 is reliable in the following set

𝒟(c0)={(X,Y)d×r×d×r:AΩ(X,Y)Fc0}.\displaystyle\mathcal{D}(c_{0})=\{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}:\|A_{\Omega}(X,Y)\|_{\mathrm{F}}\leq c_{0}\}. (15)

The degrading accuracy of Algorithm 1 outside 𝒟(c0)\mathcal{D}(c_{0}) (15) can nonetheless be avoided for the projection problem (6), in particular, through rescaling of the input matrix Z0Z_{0}. Note that the edge set of any graph is invariant to the scaling of its weighted adjacency matrix, and that any DAG AA^{*} solution to (6) satisfies AFZ0F\|A^{*}\|_{\mathrm{F}}\lesssim\|Z_{0}\|_{\mathrm{F}} since supp(A)supp(Z0)\operatorname*{supp}(A^{*})\subset\operatorname*{supp}(Z_{0}) (see Remark 9). Hence it suffices to rescale Z0Z_{0} with a small enough scalar, e.g., replace Z0Z_{0} with Z0=c010Z0FZ0Z_{0}^{\prime}=\frac{c_{0}}{10\|Z_{0}\|_{\mathrm{F}}}Z_{0}, without loss of generality, in (6)–(7). Indeed, this rescaling ensures that both the input matrix Z0Z_{0}^{\prime} and matrices like A=c010Z0FA{A^{*}}^{\prime}=\frac{c_{0}}{10\|Z_{0}\|_{\mathrm{F}}}A^{*}—a DAG matrix equivalent to AA^{*}—stay confined in the image (through LoRAM) of 𝒟(c0)\mathcal{D}(c_{0}) (15), in which the gradient approximations by Algorithm 1 are reliable.

4.2 Accelerated gradient descent

Given the gradient computation method (Algorithm 1) for the hh function, we adapt Nesterov’s accelerated gradient descent [Nes83, Nes04] to solve (7). The accelerated gradient descent is used for its superior performance than vanilla gradient descent in many convex and nonconvex problems while it also only requires first-order information of the objective function. Details of this algorithm for our LoRAM optimization is given in Algorithm 2.

Algorithm 2 Accelerated Gradient Descent of LoRAM (LoRAM-AGD)
0:  Initial point x0=(X0,Y0)d×r×d×rx_{0}=(X_{0},Y_{0})\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}, objective function F=f+hF=f+h with hh defined in (3), tolerance ϵ\epsilon.
0:  xtd×r×d×rx_{t}\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}
1:  Make a gradient descent: x1=x0s0F(x0)x_{1}=x_{0}-s_{0}\nabla F(x_{0}) for an initial stepsize s0>0s_{0}>0
2:  Initialize: y0=x0y_{0}=x_{0}, y1=x1y_{1}=x_{1}, t=1t=1.
3:  while F(xt)>ϵ\|\nabla F(x_{t})\|>\epsilon do
4:     Compute F(yt)=f(yt)+h(yt)\nabla F(y_{t})=\nabla f(y_{t})+\nabla h(y_{t}) # using Algorithm 1 for h(yt)\nabla h(y_{t}) (9)
5:     Compute the Barzilai–Borwein stepsize: st=zt12zt1,wt1s_{t}=\frac{\|z_{t-1}\|^{2}}{\braket{z_{t-1},w_{t-1}}}, where zt1=ytyt1z_{t-1}=y_{t}-y_{t-1} and wt1=F(yt)F(yt1)w_{t-1}=\nabla F(y_{t})-\nabla F(y_{t-1}).
6:     Updates with Nesterov’s acceleration:
xt+1\displaystyle x_{t+1} =ytstF(yt),\displaystyle=y_{t}-s_{t}\nabla F(y_{t}), (16)
yt+1\displaystyle y_{t+1} =xt+1+tt+3(xt+1xt).\displaystyle=x_{t+1}+\frac{t}{t+3}(x_{t+1}-x_{t}).
7:     t=t+1t=t+1
8:  end while

Specifically, in line 5 of Algorithm 2, the Barzilai–Borwein (BB) stepsize [BB88] is used for the descent step (16). The computation of the BB stepsize sts_{t} requires evaluating the inner products zt1,wt1\braket{z_{t-1},w_{t-1}} and the norm zt1\|z_{t-1}\|, where zt1,wt1d×r×d×rz_{t-1},w_{t-1}\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}; we choose the Euclidean inner product as the metric on d×r×d×r\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}:

z,w=tr(z(1)Tw(1))+tr(z(2)Tw(2))\braket{z,w}=\operatorname*{tr}({z^{(1)}}^{\mathrm{T}}w^{(1)})+\operatorname*{tr}({z^{(2)}}^{\mathrm{T}}w^{(2)})

for any pair of points z=(z(1),z(2))z=(z^{(1)},z^{(2)}) and w=(w(1),w(2))w=(w^{(1)},w^{(2)}) on d×r×d×r\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}. Note that one can always use backtracking line search based on the stepsize estimation (line 5). We choose to use the BB stepsize directly since it does not require any evaluation of the objective function, and thus avoids the nonnegligeable costs for computing the matrix exponential trace in hh (3). We refer to [CDHS18] for a comprehensive view on accelerated methods for nonconvex optimization.

Due to the nonconvexity of hh (3) (see Proposition 3) and thus (7), we aim at finding stationary points of (7). In particular, empirical results in Section 5.3 show that the solutions by the proposed method, with close-to-zero or even zero SHDs to the true DAGs, are close to global optima in practice.

5 Experimental validation

This section investigates the performance (computational gains and accuracy loss) of the proposed gradient approximation method (Algorithm 1) and thereafter reports on the performance of the LoRAM projection (6), compared to NoTears [ZARX18]. Sensitivity to the rank parameter rr of the proposed method is also investigated.

The implementation is available at https://github.com/shuyu-d/loram-exp.

5.1 Gradient computations

We compare the performance of Algorithm 1 in gradient approximations with the exact computation in the following settings: the number of nodes d{100d\in\{100, 500500, 10310^{3}, 2.1032.10^{3}, 3.1033.10^{3}, 5.103}5.10^{3}\}, r=40r=40, and the sparsity (|Ω|d2\frac{|\Omega|}{d^{2}}) of the index set Ω\Omega tested are ρ{103,5.103,102,5.102}\rho\in\{10^{-3},5.10^{-3},10^{-2},5.10^{-2}\}. The results shown in Figure 2 are based on the computation of h(X,Y)\nabla h(X,Y) (9) on randomly generated points (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}, where XX and YY are Gaussian matrices.

Refer to caption
(a) Computation time
Refer to caption
(b) Accuracy
Figure 2: (a): Runtime (in log-scale) for computing h(X,Y)\nabla h(X,Y). (b): Cosine similarities bewtween the approximate and the exact gradients.

From Figure 2 (a), Algorithm 1 shows a significant reduction in the runtime for computing the gradient of hh (3) at the expense of a very moderate loss of accuracy (Figure 2 (b)): the approximate gradients are mostly sufficiently aligned with exact gradients in the considered range of graph sizes and sparsity levels.

5.2 Sensitivity to the rank parameter rr

In this experiment, we generate the input matrix Z0Z_{0} of problem (6) as follows:

Z0=A+E,\displaystyle Z_{0}=A^{\star}+E, (17)

where AA^{\star} is a given d×dd\times d DAG matrix and EE is a graph containing additive noisy edges that break the acyclicity of the observed graph Z0Z_{0}.

The ground truth DAG matrix AA^{\star} is generated from the acyclic Erdős-Rényi (ER) model (in the same way as in [ZARX18]), with a sparsity rate ρ{103,5.103,102}\rho\in\{10^{-3},5.10^{-3},10^{-2}\}. The noise graph EE of (17) is defined as E=σEATE=\sigma_{E}{A^{\star}}^{\mathrm{T}}, which consists of edges that create a confusion between causes and effects, since these edges are reversed, pointing from the ground-truth effects to their respective causes. We evaluate the performance of LoRAM-AGD in the proximal mapping computation (7) for d=500d=500 and different values of the rank parameter rr. In all these evaluations, the candidate set Ω\Omega is fixed to be supp(Z0)\operatorname*{supp}(Z_{0}); see Remark 9.

We measure the accuracy of the projection result by the false discovery rate (FDR, lower is better), false positive rate (FPR), true positive rate (TPR, higher is better), and the structural Hamming distance (SHD, lower is better) of the solution compared to the DAG 𝒢(A)\mathcal{G}(A^{\star}).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Rank profiles and projection accuracy for different values of rr (number of columns of XX and YY in (2)). The number of nodes d=500d=500. The sparsity of the input non-DAG matrix Z0=A+σEATZ_{0}=A^{\star}+\sigma_{E}{A^{\star}}^{\mathrm{T}} (17) are ρ{103,5.103,102}\rho^{\star}\in\{10^{-3},5.10^{-3},10^{-2}\}.

The results in Figure 3 suggest that:

  • (i)

    For each sparsity level, increasing the rank parameter rr generally improves the projection accuracy of the LoRAM.

  • (ii)

    While the rank parameter rr of (2) attains at most around 55, which is only 1100\frac{1}{100}-th the rank of the input matrix Z0Z_{0} and the ground truth AA^{\star}, the rank of the solution AΩ=AΩ(X,Y)A_{\Omega}^{*}=A_{\Omega}(X^{*},Y^{*}) attains the same value as rank(A)\operatorname{rank}(A^{\star}). This means that the rank representativity of the LoRAM goes beyond the value of rr. This phenomenon is understandable in the present case where the candidate set Ω=supp(Z0)\Omega=\operatorname*{supp}(Z_{0}) is fairly close to the sparse edge set supp(A)\operatorname*{supp}(A^{\star}).

  • (iii)

    The projection accuracy in TPR and FDR (and also SHD, see Figure 5 in Section C.2) of LoRAM-AGD is close to optimum on a wide interval 25r5025\leq r\leq 50 of the tested ranks and are fairly stable on this interval.

5.3 Scalability

We examine two different types of noisy edges (in EE) as follows. Case (a): Bernoulli-Gaussian E=E(σE,p)E=E(\sigma_{E},p), where Eij(σE,p)0E_{ij}(\sigma_{E},p)\neq 0 with probability pp and all nonzeros of E(σE,p)E(\sigma_{E},p) are i.i.d. samples of 𝒩(0,σE)\mathcal{N}(0,\sigma_{E}). Case (b): cause-effect confusions E=σEATE=\sigma_{E}{A^{\star}}^{\mathrm{T}} as in Section 5.2.

The initial factor matrices (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r} are random Gaussian matrices. For the LoRAM, we set Ω\Omega to be the support of Z0Z_{0}; see Remark 9. The penalty parameter λ\lambda of (7) is varied in {2.0,5.0}\{2.0,5.0\} with no fine tuning.

In case (a), we test with various noise levels for d=500d=500 nodes. In case (b), we test on various graph dimensions, for (d,r){100,200,,2000}×{40,80}(d,r)\in\{100,200,\dots,2000\}\times\{40,80\}. The results are given in Table 2Table 3 respectively.

Table 2: Results in case (a): the noise graph is E(σE,p)E(\sigma_{E},p) for p=5.104p=5.10^{-4} and d=500d=500.
σE\sigma_{E} LoRAM (ours) / NoTears
Runtime (sec) TPR FDR SHD
0.1    1.34 / 5.78 1.0 / 1.0 9.9e-3 / 0.0e+0 25.0 / 0.0
0.2    2.65 / 11.58 1.0 / 1.0 9.5e-3 / 0.0e+0 24.0 / 0.0
0.3    1.35 / 28.93 1.0 / 1.0 9.5e-3 / 8.0e-4 24.0 / 2.0
0.4    1.35 / 18.03 1.0 / 1.0 9.9e-3 / 3.2e-3 25.0 / 9.0
0.5    1.35 / 12.52 1.0 / 1.0 9.9e-3 / 5.2e-3 25.0 / 13.0
0.6    2.57 / 16.07 1.0 / 1.0 9.5e-3 / 4.4e-3 24.0 / 11.0
0.7    1.35 / 18.72 1.0 / 1.0 9.9e-3 / 5.2e-3 25.0 / 13.0
0.8    1.35 / 32.03 1.0 / 1.0 9.9e-3 / 4.8e-3 25.0 / 15.0
Table 3: Results in case (b): the noise graph E=σEATE=\sigma_{E}{A^{\star}}^{\mathrm{T}} contains cause-effect confusions for σE=0.4\sigma_{E}=0.4.
(λ,r)(\lambda,r) dd LoRAM (ours) / NoTears
Runtime (sec) TPR FDR SHD
(5.0, 40) 100     1.82 / 0.67 1.00 / 1.00 0.00e+0 / 0.0 0.0 / 0.0
(5.0, 40) 200     2.20 / 3.64 0.98 / 0.95 2.50e-2 / 0.0 1.0 / 2.0
(5.0, 40) 400     2.74 / 16.96 0.98 / 0.98 2.50e-2 / 0.0 4.0 / 4.0
(5.0, 40) 600     3.40 / 42.65 0.98 / 0.96 1.67e-2 / 0.0 6.0 / 16.0
(5.0, 40) 800     4.23 / 83.68 0.99 / 0.97 7.81e-3 / 0.0 5.0 / 22.0
(2.0, 80) 1000     7.63 / 136.94 1.00 / 0.96 0.00e+0 / 0.0 0.0 / 36.0
(2.0, 80) 1500     13.34 / 437.35 1.00 / 0.96 8.88e-4 / 0.0 2.0 / 94.0
(2.0, 80) 2000     20.32 / 906.94 1.00 / 0.96 7.49e-4 / 0.0 3.0 / 148.0
Refer to caption
(a) Iterations of LoRAM-AGD (for d=1000d=1000)
Refer to caption
(b) Runtime vs number of nodes
Figure 4: (a): An iteration history of LoRAM-AGD for (7) with d=1000d=1000. (b): Runtime (in log-scale) comparisons for different number dd of nodes.

The results in Table 2Table 3 show that:

  • (i)

    In case (a), the solutions of LoRAM-AGD are close to the ground truth despite slighly higher errors than NoTears in terms of FDR and SHD.

  • (ii)

    In case (b), the solutions of LoRAM-AGD are almost identical to the ground truth AA^{\star} In (17) in all performance indicators (also see Section C.2).

  • (iii)

    In terms of computation time (see Figure 4), the proposed LoRAM-AGD achieves significant speedups (around 5050 times faster when d=2000d=2000) compared to NoTears  and also has a smaller growth rate with respect to the problem dimension dd, showing good scalability.

6 Discussion and Perspectives

This paper tackles the projection of matrices on DAG matrices, motivated by the identification of linear causal graphs. The line of research built upon the LiNGAM algorithms [SHHK06, SIS+11] has recently been revisited through the formal characterization of DAGness in terms of a continuously differentiable constraint by [ZARX18]. The NoTears approach of [ZARX18] however suffers from an O(d3)O(d^{3}) complexity in the number dd of variables, precluding its usage for large-scale problems.

Unfortunately, this difficulty is not related to the complexity of the model (number of parameters of the model): the low-rank approach investigated by NoTears-low-rank [FZZ+20] also suffers from the same O(d3)O(d^{3}) complexity, incurred in the gradient-based optimization phase.

The present paper addresses this difficulty by combining a sparsification mechanism with the low-rank model and using a new approximated gradient computation. This approximated gradient takes inspiration from the approach of [AMH11] for computing the action of exponential matrices based on truncated Taylor expansion. This approximation eventually yields a complexity of O(d2r)O(d^{2}r), where the rank parameter is small (rCdr\leq C\ll d). The experimental validation of the approach shows that the approximated gradient entails no significant error with respect to the exact gradient, for LoRAM matrices with a bounded norm, in the considered range of graph sizes (dd) and sparsity levels. The proposed algorithm combining the approximated gradient with a Nesterov’s acceleration method [Nes83, Nes04] yields gains of orders of magnitude in computation time compared to NoTears on the same artificial benchmark problems. The approximation performance indicators reveal almost no performance loss for the projection problem in the setting of case (b) (where the matrix to be projected is perturbed with anti-causal links), while it incurs minor losses in terms of false discovery rate (FDR) in the setting of case (a) (with random additive spurious links).

Further developments aim to extend the approach and address the identification of causal DAG matrices from observational data. A longer term perspective is to extend LoRAM to the non-linear case, building upon the introduction of latent causal variables and taking inspiration from the non-linear independent component analysis and generalized contrastive losses [HST19]. Another perspective relies on the use of auto-encoders to yield a compressed representation of high-dimensional data, while constraining the structure of the encoder and decoder modules to enforce the acyclic property.

Acknowledgement

The authors warmly thank Fujitsu Laboratories LTD who funded the first author, and in particular Hiroyuki Higuchi and Koji Maruhashi for many discussions.

References

  • [ABGLP19] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • [AMH11] A. H. Al-Mohy and N. J. Higham. Computing the action of the matrix exponential, with an application to exponential integrators. SIAM journal on scientific computing, 33(2):488–511, 2011.
  • [BB88] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA journal of numerical analysis, 8(1):141–148, 1988.
  • [BPE14] P. Bühlmann, J. Peters, and J. Ernest. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
  • [CDHS18] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
  • [Chi96] D. M. Chickering. Learning bayesian networks is NP-complete. In Learning from data, pages 121–130. Springer, 1996.
  • [FZZ+20] Z. Fang, S. Zhu, J. Zhang, Y. Liu, Z. Chen, and Y. He. Low rank directed acyclic graphs and causal structure learning. arXiv preprint arXiv:2006.05691, 2020.
  • [Hab18] H. E. Haber. Notes on the matrix exponential and logarithm. Santa Cruz Institute for Particle Physics, University of California: Santa Cruz, CA, USA, 2018.
  • [HST19] A. Hyvarinen, H. Sasaki, and R. Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • [KGG+18] D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. Structural agnostic modeling: Adversarial learning of causal graphs. arXiv preprint arXiv:1803.04929, 2018.
  • [MW15] S. L. Morgan and C. Winship. Counterfactuals and causal inference. Cambridge University Press, 2015.
  • [Nes83] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2)O(1/k^{2}). Soviet Mathematics Doklady, 27:372–376, 1983.
  • [Nes04] Y. Nesterov. Introductory Lectures on Convex Optimization, volume 87. Springer Publishing Company, Incorporated, 1 edition, 2004. doi:10.1007/978-1-4419-8853-9.
  • [NGZ20] I. Ng, A. Ghassami, and K. Zhang. On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
  • [PBM16] J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012, 2016.
  • [Pea09] J. Pearl. Causality. Cambridge university press, 2009.
  • [PJS17] J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • [SB09] M. Stephens and D. J. Balding. Bayesian statistical methods for genetic association studies. Nature Reviews Genetics, 10(10):681–690, 2009.
  • [SG21] A. Sauer and A. Geiger. Counterfactual generative networks. In International Conference on Learning Representations (ICLR), 2021.
  • [SHHK06] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006.
  • [SIS+11] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen. DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. The Journal of Machine Learning Research, 12:1225–1248, 2011.
  • [YCGY19] Y. Yu, J. Chen, T. Gao, and M. Yu. DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR, 2019.
  • [ZARX18] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31, 2018. URL https://proceedings.neurips.cc/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf.
  • [ZDA+20] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pages 3414–3425. PMLR, 2020.

Appendix A Proofs (Sections 24)

Proof of Corollary 2.

By the condition (i), all entries of the matrix B:=σ(A)B:=\sigma(A) are nonnegative. Hence, for all k1k\geq 1,

tr(Bk)0.\displaystyle\operatorname*{tr}(B^{k})\geq 0. (18)

By combining (18) and (ii) that supp(B)=supp(A)\operatorname*{supp}(B)=\operatorname*{supp}(A), we confirm that tr(Bk)\operatorname*{tr}(B^{k}) equals the sum of the weighted edge-sum along all kk-cycles in the graph of AA (the graph whose adjacency matrix is supp(A)\operatorname*{supp}(A)), according to classical graph theory (which can be obtained by induction). Therefore, if AA is a DAG matrix, i.e., there are no cycles (of any length) in the graph of AA, then tr(Bk)=0\operatorname*{tr}(B^{k})=0 for all k1k\geq 1, hence tr(exp(B))=tr(Id×d)=d\operatorname*{tr}(\exp(B))=\operatorname*{tr}(I_{d\times d})=d; and if tr(exp(B))=d\operatorname*{tr}(\exp(B))=d, then k11k!tr(Bk)=0\sum_{k\geq 1}\frac{1}{k!}\operatorname*{tr}(B^{k})=0, hence tr(Bk)=0\operatorname*{tr}(B^{k})=0 (due to (18)), which implies that there are no cycles of any length k1k\geq 1 in the graph of AA. \hfill\square

Proof of Proposition 3.

(i) All entries of any matrix A¯+d×d\bar{A}\in\mathbb{R}^{d\times d}_{+} are nonnegative, hence tr(Ak)0\operatorname*{tr}(A^{k})\geq 0 for all k1k\geq 1. Therefore, tr(exp(A¯))=tr(Id×d)+k11ktr(A¯k)d\operatorname*{tr}(\exp(\bar{A}))=\operatorname*{tr}(I_{d\times d})+\sum_{k\geq 1}\frac{1}{k}\operatorname*{tr}(\bar{A}^{k})\geq d. Moreover, Corollary 2 shows that tr(A¯)=d\operatorname*{tr}(\bar{A})=d if and only if A¯\bar{A} is a DAG matrix.

(ii) First, we show the nonconvexity of tr(exp())\operatorname*{tr}(\exp(\cdot)) on 2×2\mathbb{R}^{2\times 2} (for d=2d=2) by using the property (i) and Corollary 2: let A=(0100)A=\begin{pmatrix}0&1\\ 0&0\end{pmatrix} and B=(0010)B=\begin{pmatrix}0&0\\ 1&0\end{pmatrix}. Notice that AA and BB are two different DAG matrices. Then, consider any matrix on the line segment between AA and BB:

Mα=αA+(1α)B=(0α1α0) for α(0,1).M_{\alpha}=\alpha A+(1-\alpha)B=\begin{pmatrix}0&\alpha\\ 1-\alpha&0\end{pmatrix}\text{~{}for~{}}\alpha\in(0,1).

Then MαM_{\alpha} is a weighted adjacency matrix of graph with 22-cycles. Hence, MαM_{\alpha} is nonnegative and is not a DAG matrix, it follows from the property (i) above that h~(Mα)>d\tilde{h}(M_{\alpha})>d. Property (i) also shows that h~(A)=h~(B)=d\tilde{h}(A)=\tilde{h}(B)=d, since AA and BB are DAG matrices. Therefore, we have

h~(Mα)>d=αh~(A)+(1α)h~(B).\tilde{h}(M_{\alpha})>d=\alpha\tilde{h}(A)+(1-\alpha)\tilde{h}(B).

Hence h~\tilde{h} is nonconvex along the segment {Mα=αA+(1α)B:α(0,1)}2×2\{M_{\alpha}=\alpha A+(1-\alpha)B:\alpha\in(0,1)\}\subset\mathbb{R}^{2\times 2}. To extend this example to d×d\mathbb{R}^{d\times d}, it suffices to consider a similar pair of DAG matrices AA^{\prime} and BB^{\prime} which differ only on their first 2×22\times 2 submatrices, such that A:2,:2=AA^{\prime}_{:2,:2}=A and B:2,:2=BB^{\prime}_{:2,:2}=B.

(iii) First, the following definition and property are needed for the differential calculus of the matrix exponential function.

Definition 12.

For any pair of matrices A,Bd×dA,B\in\mathbb{R}^{d\times d}, the commutator between AA and BB is defined and denoted as follows:

adA(B):=[A,B]=ABBA.\displaystyle\operatorname{ad}_{A}(B):=[A,B]=AB-BA.

The operator adA\operatorname{ad}_{A} is a linear operator. More generally, the powers of the commutator adA()\operatorname{ad}_{A}(\cdot) are defined as follows: (adA)0=I(\operatorname{ad}_{A})^{0}=I, and for any k1k\geq 1:

(adA)k(B)=[A,,[A,[A,B]]]k nested commutators.\displaystyle(\operatorname{ad}_{A})^{k}(B)=\underbrace{[A,\dots,[A,[A,B]]\dots]}_{k\text{~{}nested~{}commutators}}.
Proposition 13 ([Hab18] (Theorem 2.b)).

For any Ad×dA\in\mathbb{R}^{d\times d} and any ξd×d\xi\in\mathbb{R}^{d\times d}, the derivative of texp(A+tξ)t\mapsto\exp(A+t\xi) at t=0t=0 is ddtexp(A+tξ)|t=0=eAf~(adA)(ξ)\frac{\mathrm{d}}{\mathrm{d}t}\exp(A+t\xi)|_{t=0}=e^{A}\tilde{f}(\operatorname{ad}_{A})(\xi), where f~(z)=1ezz=n=0(1)n(n+1)!zn\tilde{f}(z)=\frac{1-e^{-z}}{z}=\sum_{n=0}^{\infty}\frac{(-1)^{n}}{(n+1)!}z^{n}.

Now, we calculate the directional derivative of h~\tilde{h} along a direction ξd×d\xi\in\mathbb{R}^{d\times d}: Note that the differential of tr()\operatorname*{tr}(\cdot) is itself anywhere, hence by the chain rule and Proposition 13, for any ξd×d\xi\in\mathbb{R}^{d\times d}, the directional derivative of h~\tilde{h} along ξ\xi is

ddttr(eA+tξ)|t=0=tr(ddt(eA+tξ)|t=0)\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\operatorname*{tr}(e^{A+t\xi})|_{t=0}=\operatorname*{tr}\Big{(}\frac{\mathrm{d}}{\mathrm{d}t}(e^{A+t\xi})|_{t=0}\Big{)}
=tr(eAf~(adA)(ξ))\displaystyle=\operatorname*{tr}(e^{A}\tilde{f}(\operatorname{ad}_{A})(\xi))
=tr(eAξ)+1(k+1)!k=1tr(eA(1)k(adA)k(ξ)).\displaystyle=\operatorname*{tr}(e^{A}\xi)+\frac{1}{(k+1)!}\sum_{k=1}^{\infty}\operatorname*{tr}(e^{A}(-1)^{k}(\operatorname{ad}_{A})^{k}(\xi)). (19)

Next, we show that last term of (19) is zero: note that for k=1k=1, by rotating the three matrices in the trace function, we have

tr(eA(adA(ξ)))=tr(eA(ξAAξ))=tr(AeAξeAAξ)\displaystyle\operatorname*{tr}(e^{A}(-\operatorname{ad}_{A}(\xi)))=\operatorname*{tr}(e^{A}(\xi A-A\xi))=\operatorname*{tr}(Ae^{A}\xi-e^{A}A\xi)
=tr(adA(eA)ξ)=0\displaystyle~{}=\operatorname*{tr}(\operatorname{ad}_{A}(e^{A})\xi)=0 (20)

for any ξ\xi, where (20) holds because adA(eA)=[A,eA]=0\operatorname{ad}_{A}(e^{A})=[A,e^{A}]=0 since AA and eAe^{A} commute (for any AA). Furthermore, we deduce that for all k2k\geq 2:

tr(eA(1)k(adA)k(ξ))=\displaystyle\operatorname*{tr}\big{(}e^{A}(-1)^{k}(\operatorname{ad}_{A})^{k}(\xi)\big{)}=
(1)ktr(eAadA((adA)k1(ξ)))=0,\displaystyle(-1)^{k}\operatorname*{tr}\big{(}e^{A}\operatorname{ad}_{A}((\operatorname{ad}_{A})^{k-1}(\xi))\big{)}=0,

because (20) holds for any direction including ξ~:=(adA)k1(ξ)\tilde{\xi}:=(\operatorname{ad}_{A})^{k-1}(\xi). Therefore, the equation (19) becomes ddth~(A+tξ)|t=0=tr(eAξ)\frac{\mathrm{d}}{\mathrm{d}t}\tilde{h}(A+t\xi)|_{t=0}=\operatorname*{tr}(e^{A}\xi), and through the identification ddth~(A+tξ)|t=0=tr(ξTh~(A))\frac{\mathrm{d}}{\mathrm{d}t}\tilde{h}(A+t\xi)|_{t=0}=\operatorname*{tr}({\xi}^{\mathrm{T}}\nabla\tilde{h}(A)), the gradient reads h~(A)=(exp(A))T\nabla\tilde{h}(A)={(\exp(A))}^{\mathrm{T}}. \hfill\square

Proof of Proposition 7.

From Definition 6, the residual matrix ξ:=AΩZ0d×d\xi:=A_{\Omega}^{*}-Z_{0}\in\mathbb{R}^{d\times d} satisfies ξmaxϵr,ΩZ0max\|\xi\|_{\max}\leq\epsilon^{\star}_{r,\Omega}\|Z_{0}\|_{\max}. Therefore, by the Taylor expansion and Proposition 3-(iii), there exists a constant C10C_{1}\geq 0 such that

tr(exp(σ(AΩ)))d=tr(exp(σ(Z0)))d=0+tr(eσ(Z0)σ(ξ))+C1σ(ξ)F2\displaystyle\operatorname*{tr}(\exp(\sigma(A_{\Omega}^{*})))-d=\underbrace{\operatorname*{tr}(\exp(\sigma(Z_{0})))-d}_{=0}+\operatorname*{tr}(e^{\sigma(Z_{0})}\sigma(\xi))+C_{1}\|\sigma(\xi)\|_{\mathrm{F}}^{2}
=tr(eσ(Z0)σ(ξ))+C1σ(ξ)F2\displaystyle=\operatorname*{tr}(e^{\sigma(Z_{0})}\sigma(\xi))+C_{1}\|\sigma(\xi)\|_{\mathrm{F}}^{2} (21)
ξmax(ij[eσ(Z0)]ij)+C1Z00ξmax2\displaystyle\leq\|\xi\|_{\max}(\sum_{ij}[e^{\sigma(Z_{0})}]_{ij})+C_{1}\|Z_{0}\|_{0}\|\xi\|_{\max}^{2}
ϵr,ΩZ0max(ij[eσ(Z0)]ij)+C1Z00ϵr,Ω2Z0max2,\displaystyle\leq\epsilon^{\star}_{r,\Omega}\|Z_{0}\|_{\max}(\sum_{ij}[e^{\sigma(Z_{0})}]_{ij})+C_{1}\|Z_{0}\|_{0}{\epsilon^{\star}_{r,\Omega}}^{2}\|Z_{0}\|_{\max}^{2},

where (21) holds since Z0Z_{0} is a DAG matrix (in view of Theorem 1), and |Z00\||Z_{0}\|_{0} is the number of nonzeros of Z0Z_{0}. It follows that, with relative error ϵr,Ω1\epsilon^{\star}_{r,\Omega}\leq 1 and Z0max1\|Z_{0}\|_{\max}\leq 1 (without loss of generality):

tr(exp(σ(AΩ)))dϵr,Ω(C1Z00+ij[eσ(Z0)]ij)Z0max,\operatorname*{tr}(\exp(\sigma(A_{\Omega}^{*})))-d\leq\epsilon^{\star}_{r,\Omega}\Big{(}C_{1}\|Z_{0}\|_{0}+\sum_{ij}[e^{\sigma(Z_{0})}]_{ij}\Big{)}\|Z_{0}\|_{\max},

which entails the result. \hfill\square

Proof of Lemma 10.

The Hadamard product \odot is commutative, hence

Dσ2(Y)[ξ]=ddt(Y+tξ)(Y+tξ)|t=0=2Yξ.\displaystyle\mathrm{D}\sigma_{2}(Y)[\xi]=\frac{\mathrm{d}}{\mathrm{d}t}(Y+t\xi)\odot(Y+t\xi)|_{t=0}=2Y\odot\xi.

D^σabs\hat{\mathrm{D}}\sigma_{\mathrm{abs}} (4b) is a subdifferential of (4) since the sign function is a subdifferential of the function z|z|z\to|z| and σabs()\sigma_{\mathrm{abs}}(\cdot) is an elementwise matrix operator. \hfill\square

Appendix B Details of computational methods

B.1 Computational cost of the exact gradient

The computation of the function value and the gradient of hh (3) mainly includes two types of operations: (i) the computation of the d×dd\times d sparse matrix σAΩ(X,Y)\sigma\circ A_{\Omega}(X,Y) given Definition 4 and σ\sigma, and (ii) the computation of the matrix exponential-related terms (10) or (11).

The exact computation of the gradient h(X,Y)\nabla h(X,Y) (Corollary 2) is as follows:

(i) compute AΩ=𝒫Ω(XYT)A_{\Omega}=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}) (2): this step has a cost of 2|Ω|r2|\Omega|r flops; see Algorithm 3 in Section B.2. The output AΩA_{\Omega} is a d×dd\times d sparse matrix. A byproduct of this step is Dσ(AΩ)\mathrm{D}\sigma(A_{\Omega}) (2AΩ2A_{\Omega} or sign(AΩ)\mathrm{sign}(A_{\Omega}));
(ii) compute exp(σ(AΩ))\exp(\sigma(A_{\Omega})): this step has a cost of O(d3)O(d^{3});
(iii) compute M:=exp(σ(AΩ))Dσ(AΩ)M:=\exp(\sigma(A_{\Omega}))\odot\mathrm{D}\sigma(A_{\Omega}), which costs |Ω||\Omega| flops. The output MM is a d×dd\times d sparse matrix;
(iv) Compute the sparse-dense matrix multiplication (M,Y)MY(M,Y)\mapsto MY: this step costs 2|Ω|r2|\Omega|r flops. Therefore, the total cost is O(d3+4|Ω|r)O(d^{3}+4|\Omega|r).

In scenarios with large and sparse graphs, i.e., rdr\ll d and |Ω|=ρd2|\Omega|=\rho d^{2} with ρ1\rho\ll 1, the complexity of computing the exact gradient of the learning criterion thus is cubic in dd, incurred by computing the matrix exponential in step (ii); this step is necessary due to the Hadamard product that lies between the action of exp(AΩ)\exp(A_{\Omega}) and the thin factor matrix XX (respectively, YY) in (10) or (11): one needs to compute the d×dd\times d Hadamard product (before the matrix multiplication with XX), which requires computing explicitly the d×dd\times d matrix exponential exp(σ(AΩ))\exp(\sigma(A_{\Omega})).

For this reason, the LoRAM as well as NoTears-low-rank of [FZZ+20] face a cubic complexity for the exact computation of the gradient of hh (3), despite the significant reduction in the model complexity.

B.2 Computation of the LoRAM matrix

Algorithm 3 LoRAM matrix representation
0:  Thin factor matrices (X,Y)d×r×d×r(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}, index set Ω=(I,J)\Omega=(I,J)
0:  Z=𝒫Ω(XYT)d×dZ=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})\in\mathbb{R}^{d\times d} and by products Zabs=sign(Z)Z^{\mathrm{abs}}=\mathrm{sign}(Z), Zsq=2ZZ^{\mathrm{sq}}=2Z
1:  Initialize: Z=0Z=0
2:  for s=1,,|Ω|s=1,\dots,|\Omega| do
3:     for k=1,,rk=1,\dots,r do
4:        ZI(s),J(s)=ZI(s),J(s)+XI(s),kYJ(s),kZ_{I(s),J(s)}=Z_{I(s),J(s)}+X_{I(s),k}Y_{J(s),k}
5:     end for
6:  end for
7:  Element-wise operations: Zsq=2ZZ^{\mathrm{sq}}=2Z, Zabs=sign(Z)Z^{\mathrm{abs}}=\mathrm{sign}(Z)

Appendix C Experiments

C.1 Choice of the parameters λ\lambda and ϵr,Ω\epsilon^{\star}_{r,\Omega}

The proposed algorithms (Algorithms 12) are tested within the penalty method (7) for a parameter λ>0\lambda>0.

In the context of projection problem (6), the following features are notable factors that influence the choice of an optimal λ\lambda: (i) the type of graphs underlying the input matrix Z0Z_{0}, and (ii) sparsity of the graph matrix Z0Z_{0}. Note that the scale of Z0Z_{0} is irrelevant to the choice of λ\lambda since we rescale the input graph matrix Z0d×dZ_{0}\in\mathbb{R}^{d\times d} with Z0c010Z0FZ0Z_{0}\leftarrow\frac{c_{0}}{10\|Z_{0}\|_{\mathrm{F}}}Z_{0} for a constant c0=101c_{0}=10^{-1}, without loss of generality (see Section 4.1), such that the scale of Z0Z_{0} always stay in the level of c010\frac{c_{0}}{10} (in Frobenius norm). Therefore, we are able to fix a parameter set Λ\Lambda regarding the following experimental settings: (i) the type of graphs is fixed to be the ER acyclic graph set, same as in [ZARX18], and (ii) the sparsity level of Z0Z_{0} is fixed around ρ{103,5.103,102,5.102}\rho\in\{10^{-3},5.10^{-3},10^{-2},5.10^{-2}\}. Concretely, we select the penalty parameter λ\lambda, among Nλ=5N_{\lambda}=5 values in the fixed set Λ={1.0,2.0,,5.0}\Lambda=\{1.0,2.0,\dots,5.0\}, with respect to the objective function value of the proximal mapping (7).

On the other hand, the choice of the hard threshold ϵr,Ω\epsilon^{\star}_{r,\Omega} in (8) is straightforward, according to Definition 6 for the relative error of LoRAM w.r.t. a given subset 𝒟d×d\mathcal{D}_{d\times d}^{\star} of DAGs. For the set of graphs in our experiments, we use a moderate value ϵr,Ω=5.102\epsilon^{\star}_{r,\Omega}=5.10^{-2}, which has the effect of eliminating only the very weakly weighted edges.

C.2 Additional experimental details

Refer to caption
Refer to caption
Figure 5: Projection accuracies in TPR, FDR and SHD for different values of rr (number of columns of XX and YY in (2)). The number of nodes d=500d=500. The sparsity of the input non-DAG matrix Z0=A+σEATZ_{0}=A^{\star}+\sigma_{E}{A^{\star}}^{\mathrm{T}} (17) are ρ{103,5.103,102}\rho^{\star}\in\{10^{-3},5.10^{-3},10^{-2}\}.
Table 4: Case (b): noise graph E=σEATE=\sigma_{E}{A^{\star}}^{\mathrm{T}} creates cause-effect confusions, for σE=0.4\sigma_{E}=0.4. (FDR, TPR, FPR, SHD) are the evaluation scores of the solution compared to the ground truth DAG matrix AA^{\star}.
Algorithm (λ,r)(\lambda,r) dd runtime FDR TPR FPR SHD
LoRAM (5.0, 40) 100 1.82 0.0 1.0 0.0 0.0
(5.0, 40) 200 2.20 2.5e-2 0.98 5.0e-5 1.0
(5.0, 40) 400 2.74 2.5e-2 0.98 5.0e-5 4.0
(5.0, 40) 600 3.40 1.7e-2 0.98 3.4e-5 6.0
(5.0, 40) 800 4.23 7.8e-3 0.99 1.6e-5 5.0
(2.0, 80) 1000 7.63 0.0 1.0 0.0 0.0
(2.0, 80) 1500 13.34 8.9e-4 1.0 1.8e-6 2.0
(2.0, 80) 2000 20.32 7.5e-4 1.0 1.5e-6 3.0
NoTears - 100 0.67 0.0 1.00 0.0 0.0
- 200 3.64 0.0 0.95 0.0 2.0
- 400 16.96 0.0 0.98 0.0 4.0
- 600 42.65 0.0 0.96 0.0 16.0
- 800 83.68 0.0 0.97 0.0 22.0
- 1000 136.94 0.0 0.96 0.0 36.0
- 1500 437.35 0.0 0.96 0.0 94.0
- 2000 906.94 0.0 0.96 0.0 148.0
Refer to caption
Figure 6: First 2.5×1052.5\times 10^{5} weighted edges of the solution (red) overlapping the edges of AA^{\star} (blue). Graph dimension d=1000d=1000, rank parameter r=80r=80. The recovery scores of the solution are: TPR (higher is better) =1.0=1.0, FPR (lower is better) =0=0, SHD (lower is better)) =0=0. The input graph matrix Z0=A+σEATZ_{0}=A^{\star}+\sigma_{E}{A^{\star}}^{\mathrm{T}}.