From graphs to DAGs: a low-complexity model and a scalable algorithm

Shuyu Dong¹¹1TAU, LISN, INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, France. ([email protected]) Michèle Sebag²²2TAU, LISN, CNRS, INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, France. ([email protected])

(April 10, 2022)

Abstract

Learning directed acyclic graphs (DAGs) is long known a critical challenge at the core of probabilistic and causal modeling. The NoTears approach of Zheng et al. [ZARX18], through a differentiable function involving the matrix exponential trace $\operatorname*{tr}(\exp(\cdot))$ , opens up a way to learning DAGs via continuous optimization, though with an $O(d^{3})$ complexity in the number $d$ of nodes. This paper presents a low-complexity model, called LoRAM for Low-Rank Additive Model, which combines low-rank matrix factorization with a sparsification mechanism for the continuous optimization of DAGs. The main contribution of the approach lies in an efficient gradient approximation method leveraging the low-rank property of the model, and its straightforward application to the computation of projections from graph matrices onto the DAG matrix space. The proposed method achieves a reduction from a cubic complexity to quadratic complexity while handling the same DAG characteristic function as NoTears and scales easily up to thousands of nodes for the projection problem. The experiments show that the LoRAM achieves efficiency gains of orders of magnitude compared to the state-of-the-art at the expense of a very moderate accuracy loss in the considered range of sparse matrices, and with a low sensitivity to the rank choice of the model’s low-rank component.

1 Introduction

The learning of directed acyclic graphs (DAGs) is an important problem for probabilistic and causal inference [Pea09, PJS17] with important applications in social sciences [MW15], genome research [SB09] and machine learning itself [PBM16, ABGLP19, SG21]. Through the development of probabilistic graphical models [Pea09, BPE14], DAGs are a most natural mathematical object to describe the causal relations among a number of variables. In today’s many application domains, the estimation of DAGs faces intractability issues as an ever growing number $d$ of variables is considered, due to the fact that estimating DAGs is NP-hard [Chi96]. The difficulty lies in how to enforce the acyclicity of graphs. Shimizu et al. [SHHK06] combined independent component analysis with the combinatorial linear assignment method to optimize a linear causal model (LiNGAM) and later proposed a direct and sequential algorithm [SIS⁺11] guaranteeing global optimum of the LiNGAM, for $O(d^{4})$ complexities.

Recently, Zheng et al. [ZARX18] proposed an optimization approach to learning DAGs. The breakthrough in this work, called NoTears, comes with the characterization of the DAG matrices by the zero set of a real-valued differentiable function on $\mathbb{R}^{d\times d}$ , which shows that an $d\times d$ matrix $A$ is the adjacency matrix of a DAG if and only if the exponential trace satisfies

\displaystyle h(A):=\operatorname*{tr}(\exp(A\odot A))=d,

(1)

and thus the learning of DAG matrices can be cast as a continuous optimization problem subject to the constraint $h(A)=d$ . The NoTears approach broadens the way of learning complex causal relations and provides promising perspectives to tackling large-scale inference problems [KGG⁺18, YCGY19, ZDA⁺20, NGZ20]. However, NoTears is still not suitable for large-scale applications as the complexity of computing the exponential trace and its gradient is $O(d^{3})$ . More recently, Fang et al. [FZZ⁺20] proposed to represent DAGs by low-rank matrices with both theoretical and empirical validation of the low-rank assumption for a range of graph models. However, the adaptation of the NoTears framework [ZARX18] to low-rank model still yields a complexity of $O(d^{3})$ due to the DAG characteristic function in (1).

The contribution of the paper is to propose a new computational framework to tackle the scalability issues faced by the low-rank modeling of DAGs. We notice that the Hadamard product $\odot$ in characteristic functions as in (1) poses real obstacles to scaling up the optimization of NoTears [ZARX18] and NoTears-low-rank [FZZ⁺20]. To address these difficulties, we present a low-complexity model, named Low-Rank Additive Model (LoRAM), which is a composition of low-rank matrix factorization with sparsification, and then propose a novel approximation method compatible with LoRAM to compute the gradients of the exponential trace in (1). Formally, the gradient approximation—consisting of matrix computation of the form $(A,C,B)\to(\exp(A)\odot C)B$ , where $A,C\in\mathbb{R}^{d\times d}$ and $B$ is a thin low-rank matrix—is inspired from the numerical analysis of [AMH11] for the matrix action of $\exp(A)$ . We apply the new method to the computation of projections from graphs to DAGs through optimization with the differentiable DAG constraint.

Empirical evidence is presented to identify the cost and the benefits of the approximation method combined with Nesterov’s accelerated gradient descent [Nes83], depending on the considered range of problem parameters (number of nodes, rank approximation, sparsity of the target graph).

The main contributions are summarized as follows:

•

The LoRAM model, combining a low-rank structure with a flexible sparsification mechanism, is proposed to represent DAG matrices, together with a DAG characteristic function generalizing the exponential trace of NoTears [ZARX18].

•

An efficient gradient approximation method, exploiting the low-rank and sparse nature of the LoRAM model, is proposed. Under the low-rank assumption ( $r\leq C\ll d$ ), the complexity of the proposed method is quadratic ( $O(d^{2})$ ) instead of $O(d^{3})$ as shown in Table 1. Large efficiency gains, with insignificant loss of accuracy in some cases, are demonstrated experimentally in the considered range of application.

Table 1: Computational properties of LoRAM and algorithms in related work.

	Search space	Memory req.	Cost for $\nabla h$
NoTears [ZARX18]	$\mathbb{R}^{d\times d}$	$O(d^{2})$	$O(d^{3})$
NoTears-low-rank [FZZ⁺20]	$\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$	$O(dr)$	$O(d^{3})$
LoRAM (ours)	$\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$	$O(dr)$	$O(d^{2}r)$

2 Notation and formal background

A graph on $d$ nodes is defined and denoted as a pair $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $|\mathcal{V}|=d$ and $\mathcal{E}\subset\mathcal{V}\times\mathcal{V}$ . By default, a directed graph is simply referred to as a graph. The adjacency matrix of a graph $\mathcal{G}$ , denoted as $\mathbb{A}(\mathcal{G})$ , is defined as the matrix such that $[\mathbb{A}(\mathcal{G})]_{ij}=1$ if $(i,j)\in\mathcal{E}$ and $0$ otherwise. Let $A\in\mathbb{R}^{d\times d}$ be any weighted adjacency matrix of a graph $\mathcal{G}$ on $d$ nodes, then by definition, the adjacency matrix $\mathbb{A}(\mathcal{G})$ indicates the nonzeros of $A$ ; the adjacency matrix $\mathbb{A}(\mathcal{G})$ is also the support of $A$ , denoted as $\operatorname*{supp}(A)$ , i.e., $[\operatorname*{supp}(A)]_{ij}=1$ if $A_{ij}\neq 0$ and $0$ otherwise. The number of nonzeros of $A$ is denoted as $\|A\|_{0}$ . The matrix $A$ is called a DAG matrix if $\operatorname*{supp}(A)$ is the adjacency matrix of a directed acyclic graph (DAG). We define, by convention, the set of DAG matrices as follows: $\mathcal{D}_{d\times d}=\{A\in\mathbb{R}^{d\times d}:\operatorname*{supp}(A)\text{~{}defines a DAG}\}$ .

We recall the following theorem that characterizes acyclic graphs using the matrix exponential trace— $\operatorname*{tr}(\exp(\cdot))$ —where $\exp(\cdot)$ denotes the matrix exponential function. The matrix exponential will be denoted as $e^{\cdot}$ and $\exp(\cdot)$ indifferently. The operator $\odot$ denotes the matrix Hadamard product that acts on two matrices of the same size by elementwise multiplications.

Theorem 1 ([ZARX18]).

A matrix $A\in\mathbb{R}^{d\times d}$ is a DAG matrix if and only if

\operatorname*{tr}(\exp(A\odot A))=d.

The following corollary is a straightforward extension of the theorem above:

Corollary 2.

Let $\sigma:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d\times d}$ be an operator such that: (i) $\sigma(A)\geq 0$ and (ii) $\operatorname*{supp}(A)=\operatorname*{supp}(\sigma(A))$ , for any $A\in\mathbb{R}^{d\times d}$ . Then, $A\in\mathbb{R}^{d\times d}$ is a DAG matrix if and only if $\operatorname*{tr}(\exp(\sigma(A)))=d$ .

In view of the property above, we will refer to the composition of $\operatorname*{tr}(\exp(\cdot))$ and the operator $\sigma$ as a DAG characteristic function. Next, we show some more properties (proof in Appendix A) of the exponential trace.

Proposition 3.

The exponential trace $\tilde{h}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}:A\to\operatorname*{tr}(\exp(A))$ satisfies: (i) For all $\bar{A}\in\mathbb{R}^{d\times d}_{+}$ , $\operatorname*{tr}(\exp(\bar{A}))\geq d$ and $\operatorname*{tr}(\exp(\bar{A}))=d$ if and only if $\bar{A}$ is a DAG matrix. (ii) $\tilde{h}$ is nonconvex on $\mathbb{R}^{d\times d}$ . (iii) The Fréchet derivative of $\tilde{h}$ at $A\in\mathbb{R}^{d\times d}$ along any direction $\xi\in\mathbb{R}^{d\times d}$ is

\mathrm{D}\tilde{h}(A)[\xi]=\operatorname*{tr}(\exp(A)\xi),

and the gradient of $\tilde{h}$ at $A$ is $\nabla\tilde{h}(A)={(\exp(A))}^{\mathrm{T}}$ .

3 LoRAM: a low-complexity model

In this section, we describe a low-complexity matrix representation of the adjacency matrices of directed graphs, and then a generalized DAG characteristic function for the new matrix model.

In the spirit of searching for best low-rank singular value decompositions and taking inspiration from [FZZ⁺20], the search of a full $d\times d$ (DAG) matrix $A$ is replaced by the search of a pair of thin factor matrices $(X,Y)$ , in $\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ for $1\leq r<d$ , and the $d\times d$ candidate graph matrix is represented by the product $X{Y}^{\mathrm{T}}$ . This matrix product has a rank bounded by $r$ , with number of parameters $2dr\leq d^{2}$ . However, the low-rank representation $(X,Y)\to X{Y}^{\mathrm{T}}$ generally gives a dense $d\times d$ matrix. Since in many scenarios the sought graph (or Bayesian network) is usually sparse, we apply a sparsification operator on $X{Y}^{\mathrm{T}}$ in order to trim abundant entries in $X{Y}^{\mathrm{T}}$ . Accordingly, we combine the two operations and introduce the following model.

Definition 4 (LoRAM).

Let $\Omega\subset[d]\times[d]$ be a given index set. The low-rank additive model (LoRAM), noted $A_{\Omega}$ , is defined from the matrix product of $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ sparsified according to $\Omega$ :

A_{\Omega}(X,Y)=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}),

(2)

where $\mathcal{P}_{\Omega}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d\times d}$ is a mask operator such that $[\mathcal{P}_{\Omega}(A)]_{ij}=A_{ij}$ if $(i,j)\in\Omega$ and $0$ otherwise. The set $\Omega$ is referred to as the candidate set of LoRAM.

The candidate set $\Omega$ is to be fixed according to the specific problem. In the case of projection from a given graph to the set of DAGs, $\Omega$ can be fixed as the index set of the given graph’s edges.

The DAG characteristic function on the LoRAM search space is defined as follows:

Definition 5.

Let $\tilde{h}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}:A\to\operatorname*{tr}(\exp(A))$ denote the exponential trace function. We define $h:\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}\mapsto\mathbb{R}$ by

h(X,Y)=\operatorname*{tr}(\exp(\sigma(A_{\Omega}(X,Y)))),

(3)

where $\sigma:\mathbb{R}^{d\times d}\to\mathbb{R}^{d\times d}$ is one of the following elementwise operators:

\displaystyle\sigma_{2}(Z):=Z\odot Z\quad\text{and}\quad\sigma_{\mathrm{abs}}(Z):=\sum_{i,j=1}^{d}|Z_{ij}|e_{i}{e_{j}}^{\mathrm{T}}.

(4)

Note that operators $\sigma_{2}$ and $\sigma_{\mathrm{abs}}$ (4) are two natural choices that meet the conditions (i)–(ii) of Corollary 2, since they both produce a nonnegative surrogate matrix of the $d\times d$ matrix $A_{\Omega}$ while preserving the support of $A_{\Omega}$ .

3.1 Representativity

In the construction of a LoRAM matrix (2), the low-rank component of the model— $X{Y}^{\mathrm{T}}$ with $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ —has a rank smaller or equal to $r$ (equality attained when $X$ and $Y$ have full column ranks), and the subsequent sparsification operator $\mathcal{P}_{\Omega}$ generally induces a change in the rank of the final matrix model $\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})$ (2). Indeed, the rank of $A_{\Omega}=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})$ depends on an interplay between $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ and the discrete set $\Omega\in[d]\times[d]$ . The following examples illustrate the two extreme cases of such interplay:

(i)

The first extreme case: let $\Omega$ be the index set of the edges of a sparse graph $\mathcal{G}_{\Omega}$ , and let $(X,Y)\in\mathbb{R}^{d\times 1}\times\mathbb{R}^{d\times 1}$ be the pair of matrices containing all ones, for $r=1$ , then the LoRAM matrix $\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})=\mathbb{A}(\mathcal{G}_{\Omega})$ , i.e., the adjacency matrix of $\mathcal{G}_{\Omega}$ . Hence $\operatorname{rank}(\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}))=\operatorname{rank}(\mathbb{A}_{\Omega})$ , which depends solely on $\Omega$ and is generally much larger than $r=1$ .
(ii)

The second extreme case: let $\Omega$ be the full $[d]\times[d]$ index set, then $\mathcal{P}_{\Omega}$ reduces to the identity map such that $\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})=X{Y}^{\mathrm{T}}$ and $\operatorname{rank}(\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}}))=\operatorname{rank}(X{Y}^{\mathrm{T}})\leq r$ for any $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ .

In the first extreme case above, optimizing LoRAM (2) for DAG learning boils down to choosing the most relevant edge set $\Omega$ , which is an NP-hard combinatorial problem [Chi96]. In the second extreme case, the optimization of LoRAM (2) reduces to learning the most pertinent low-rank matrices $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ , which coincides with optimizing the NoTears-low-rank [FZZ⁺20] model.

In this work, we are interested in settings between the two extreme cases above such that both $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ and the candidate set $\Omega$ have sufficient degrees of freedom. Consequently, the representativity of LoRAM depends on both the rank parameter $r$ and $\Omega$ .

Next, we present a way of quantifying the representativity of LoRAM with respect to a subset $\mathcal{D}_{d\times d}^{\star}$ of DAG matrices. The restriction to a subset $\mathcal{D}_{d\times d}^{\star}$ is motivated by the revelation that certain types of DAG matrices—such as those with many hubs—can be represented by low-rank matrices [FZZ⁺20].

Definition 6.

Let $\mathcal{D}_{d\times d}^{\star}\subset\mathcal{D}_{d\times d}$ be a given set of nonzero DAG matrices. For $Z_{0}\in\mathcal{D}_{d\times d}^{\star}$ , let $A_{\Omega}^{*}(Z_{0})$ denote any LoRAM matrix (2) such that $\|A_{\Omega}^{*}(Z_{0})-Z_{0}\|=\min_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}\|A_{\Omega}(X,Y)-Z_{0}\|$ , then we define the relative error of LoRAM w.r.t. $\mathcal{D}_{d\times d}^{\star}$ as

\epsilon^{\star}_{r,\Omega}=\max_{Z\in\mathcal{D}_{d\times d}^{\star}}\big{\{}\frac{\|A_{\Omega}^{*}(Z)-Z\|_{\max}}{\|Z\|_{\max}}\big{\}},

where $\|Z\|_{\max}:=\max_{ij}|Z_{ij}|$ denotes the matrix max-norm. For $Z_{0}\in\mathcal{D}_{d\times d}^{\star}$ , $A_{\Omega}^{*}(Z_{0})$ is referred to as an $\epsilon^{\star}_{r,\Omega}$ -quasi DAG matrix.

Note that the existence of $A_{\Omega}^{*}(Z_{0})$ for any $Z_{0}\in\mathcal{D}_{d\times d}^{\star}$ is guaranteed by the closeness of the image set of LoRAM (2).

Based on the relative error above, the relevance of the DAG characteristic function is established from the following proposition (proof in Appendix A):

Proposition 7.

Given a set $\mathcal{D}_{d\times d}^{\star}\subset\mathcal{D}_{d\times d}$ of nonzero DAG matrices. For any $Z_{0}\in\mathcal{D}_{d\times d}^{\star}$ such that $\|Z_{0}\|_{\max}\leq 1$ , without loss of generality, the minima of

\min_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}\|A_{\Omega}(X,Y)-Z_{0}\|

belong to the set

\displaystyle\{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}:h(X,Y)-d\leq C_{0}\epsilon^{\star}_{r,\Omega}\}

(5)

where $\epsilon^{\star}_{r,\Omega}$ is given in Definition 6 and $C_{0}=\big{(}C_{1}\|Z_{0}\|_{0}+\sum_{ij}[e^{\sigma(Z_{0})}]_{ij}\big{)}\|Z_{0}\|_{\max}$ for a constant $C_{1}\geq 0$ .

Remark 8.

The constant $C_{0}$ in Proposition 7 can be seen as a measure of total capacity of passing from one node to other nodes, and therefore depends on $d$ , $\|Z_{0}\|_{\max}$ (bounded by $1$ ), the sparsity and the average degree of the graph of $Z_{0}$ . For DAG matrices with sparsity $\rho_{0}\sim 10^{-3}$ and $d\lesssim 10^{3}$ , one can expect that $C_{0}\leq Cd$ for a constant $C$ . $\hfill\square$

The result of Proposition 7 establishes that, under the said conditions, a given DAG matrix $Z_{0}$ admits low rank approximations $A_{\Omega}(X,Y)$ satisfying $h(X,Y)-d\leq\delta_{\epsilon}$ for a small enough parameter $\delta_{\epsilon}$ . In other words, the low-rank projection with a relaxed DAG constraint admits solutions.

The general case of projecting a non-acyclic graph matrix $Z_{0}$ onto a low-rank matrix under a relaxed DAG constraint is considered in next section.

4 Scalable projection from graphs to DAGs

Given a (generally non-acyclic) graph matrix $Z_{0}\in\mathbb{R}^{d\times d}$ , let us consider the projection of $Z_{0}$ onto the feasible set (5):

\displaystyle\min_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}\frac{1}{2}\|A_{\Omega}(X,Y)-Z_{0}\|_{\mathrm{F}}^{2}\quad\text{subject to~{}}h(X,Y)-d\leq\delta_{\epsilon},

(6)

where $A_{\Omega}(X,Y)$ is the LoRAM matrix (2), $h$ is given by Definition 5, and $\delta_{\epsilon}>0$ is a tolerance parameter. Based on Proposition 7 and given the objective function of (6), the solution to (6) provides a quasi DAG matrix closest to $Z_{0}$ and thus enables finding a projection of $Z_{0}$ onto the DAG matrix space $\mathcal{D}_{d\times d}$ . More precisely, we tackle problem (6) using the penalty method and focus on the primal problem, for a given penalty parameter $\lambda>0$ , followed by elementwise hard thresholding:

	$\displaystyle(X^{},Y^{})=\operatorname*{arg\,min}_{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}}h(X,Y)+\frac{1}{2\lambda}\\|A_{\Omega}(X,Y)-Z_{0}\\|_{\mathrm{F}}^{2},$		(7)
	$\displaystyle A^{}=\mathbb{T}_{\epsilon^{\star}_{r,\Omega}}(A_{\Omega}(X^{},Y^{*})),$		(8)

where $\mathbb{T}_{\epsilon^{\star}_{r,\Omega}}(z)=z\delta_{|z|\geq\epsilon^{\star}_{r,\Omega}}$ is the elementwise hard thresholding operator. The choice of $\lambda$ and $\epsilon^{\star}_{r,\Omega}$ is discussed in Section C.1 in the context of problem (6).

Remark 9.

To obtain any DAG matrix closest to (the non-acyclic) $Z_{0}$ , it is necessary to break the cycles in $\mathcal{G}(Z_{0})$ by suppressing certain edges. Hence, we assume that the edge set of the sought DAG is a strict subset of $\operatorname*{supp}(Z_{0})$ and thus fix the candidate set $\Omega$ to be $\operatorname*{supp}(Z_{0})$ . $\hfill\square$

Problems (6) and (7) are nonconvex due to: (i) the matrix exponential trace $A\to\operatorname*{tr}(\exp(A))$ in $h$ (3) is nonconvex (as in [ZARX18]), and (ii) the matrix product $(X,Y)\to X{Y}^{\mathrm{T}}$ in $A_{\Omega}(X,Y)$ (2) is nonconvex. In view of the DAG characteristic constraint of (6), the augmented Lagrangian algorithms of NoTears [ZARX18] and NoTears-low-rank [FZZ⁺20] can be applied for the same objective as problem (6); it suffices for NoTears and NoTears-low-rank to replace the LoRAM matrix $A_{\Omega}(X,Y)$ in (6) by the $d\times d$ matrix variable and the dense matrix product $X{Y}^{\mathrm{T}}$ respectively.

However, the NoTears-based methods of [ZARX18, FZZ⁺20] have an $O(d^{3})$ complexity due to the composition of the elementwise operations (the Hadamard product $\odot$ ) with the matrix exponential in the $h$ function (1) and (3). We elaborate this argument in the next subsection and then propose a new computational method for computations involving the gradient of the DAG characteristic function $h$ .

4.1 Gradient of the DAG characteristic function and an efficient approximation

Lemma 10.

For any $Z\in\mathbb{R}^{d\times d}$ and $\xi\in\mathbb{R}^{d\times d}$ , the differentials of $\sigma_{2}$ and $\sigma_{\mathrm{abs}}$ (4) are

\mathrm{D}\sigma_{2}(Z)[\xi]=2Z\odot\xi\quad\text{and}\quad\hat{\mathrm{D}}\sigma_{\mathrm{abs}}(Z)[\xi]=\mathrm{sign}(Z)\odot\xi,

(4b)

where $\mathrm{sign}(\cdot)$ is the element-wise sign function such that $[\mathrm{sign}(Z)]_{ij}=\frac{Z_{ij}}{|Z_{ij}|}$ if $Z_{ij}\neq 0$ and $0$ otherwise.

Theorem 11.

The gradient of $h$ (3) is

\displaystyle\nabla h(X,Y)=(\mathcal{S}Y,{\mathcal{S}}^{\mathrm{T}}X)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r},

(9)

where $\mathcal{S}\in\mathbb{R}^{d\times d}$ has the following expressions, depending on the choice of $\sigma$ in (4): with $A_{\Omega}:=A_{\Omega}(X,Y)$ for brevity,

	$\displaystyle\mathcal{S}_{2}=2({\exp(\sigma_{2}(A_{\Omega}))}^{\mathrm{T}})\odot A_{\Omega},$		(10)
	$\displaystyle\mathcal{S}_{\mathrm{abs}}=({\exp(\sigma_{\mathrm{abs}}(A_{\Omega}))}^{\mathrm{T}})\odot\mathrm{sign}(A_{\Omega}).$		(11)

Proof.

From Proposition 3-(iii), the Fréchet derivative of the exponential trace $\tilde{h}$ at $A\in\mathbb{R}^{d\times d}$ is $\mathrm{D}\tilde{h}(A)[\xi]=\operatorname*{tr}(\exp(A)\xi)$ for any $\xi\in\mathbb{R}^{d\times d}$ . By the chain rule and Lemma 10, the Fréchet derivative of $h$ (3) for $\sigma=\sigma_{2}$ is as follows: with $A_{\Omega}=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})$ ,

	$\displaystyle\mathrm{D}_{X}h(X,Y)[\xi]=\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))\mathrm{D}\sigma(A_{\Omega})\mathrm{D}\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})[\xi{Y}^{\mathrm{T}}]\big{)}$
	$\displaystyle=\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))\mathrm{D}\sigma(A_{\Omega})[\mathcal{P}_{\Omega}(\xi{Y}^{\mathrm{T}})]\big{)}$		(12)
	$\displaystyle=2\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))(A_{\Omega}\odot\mathcal{P}_{\Omega}(\xi{Y}^{\mathrm{T}}))\big{)}$
	$\displaystyle=2\operatorname*{tr}\big{(}\exp(\sigma(A_{\Omega}))(A_{\Omega}\odot(\xi{Y}^{\mathrm{T}}))\big{)}$		(13)
	$\displaystyle=2\operatorname*{tr}\big{(}(\exp(\sigma(A_{\Omega}))\odot{A_{\Omega}}^{\mathrm{T}})(\xi{Y}^{\mathrm{T}})\big{)}$		(14)
	$\displaystyle=2\operatorname*{tr}\big{(}{Y}^{\mathrm{T}}(\exp(\sigma(A_{\Omega}))\odot{A_{\Omega}}^{\mathrm{T}})\xi\big{)},$

where (13) holds, i.e., $\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})\odot\mathcal{P}_{\Omega}(\xi{Y}^{\mathrm{T}})=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})\odot\xi{Y}^{\mathrm{T}}$ because $A\odot\mathcal{P}_{\Omega}(B)=\mathcal{P}_{\Omega}(A)\odot B$ (for any $A$ and $B$ ) and $\mathcal{P}_{\Omega}^{2}=\mathcal{P}_{\Omega}$ , and (14) holds because $\operatorname*{tr}(A(B\odot C))=\operatorname*{tr}((A\odot{B}^{\mathrm{T}})C)$ for any $A,B,C$ (with compatible sizes). By identifying the above equation with $\operatorname*{tr}({\nabla_{X}h(X,Y)}^{\mathrm{T}}\xi)$ , we have $\nabla_{X}h(X,Y)=\underbrace{2\big{(}{\exp(\sigma(Z))}^{\mathrm{T}}\odot A_{\Omega}\big{)}}_{\mathcal{S}}Y$ , hence the expression (10) for $\mathcal{S}$ . The expression of $\mathcal{S}_{\mathrm{abs}}$ for $\sigma=\sigma_{\mathrm{abs}}$ can be obtained using the same calculations and (4b). ∎

The computational bottleneck for the exact computation of the gradient (9) lies in the computation of the $d\times d$ matrix $\mathcal{S}$ (11) and is due to the difference between the Hadamard product and matrix multiplication; see Section B.1 for details. Nevertheless, we note that the multiplication $(\mathcal{S},X)\to\mathcal{S}X$ is similar to the action of matrix exponentials of the form $(A,X)\to\exp(A)X$ , which can be computed using only a number of repeated multiplications of a $d\times d$ matrix with the thin matrix $X\in\mathbb{R}^{d\times r}$ based on Al-Mohy and Higham’s results [AMH11].

The difficulty in adapting the method of [AMH11] also lies in the presence of the Hadamard product in $\mathcal{S}$ (10)–(11). Once the sparse $d\times d$ matrix $A:=\sigma(A_{\Omega})$ in (10)–(11) is obtained (using Algorithm 3, Section B.2), the exact computation of $(A,C,B)\to(\exp(A)\odot C)B$ , using the Taylor expansion of $\exp(\cdot)$ to a certain order $m_{*}$ , is to compute $\frac{1}{k!}(A^{k}\odot C)B$ at each iteration, which inevitably requires the computation of the $d\times d$ matrix product $A^{k}$ (in the form of $A(A^{k-1})$ ) before computing the Hadamard product, which requires an $O(d^{3})$ cost.

To alleviate the obstacle above, we propose to use inexact³³3It is inexact because $((A\odot C)^{k+1})B\neq(A^{k+1}\odot C)B$ . incremental multiplications; see Algorithm 1.

Algorithm 1 Approximation of

(A,C,B)\to(\exp(A)\odot C)B

d\times d

matrices

A

and

C

, thin matrix

B\in\mathbb{R}^{d\times r}

, tolerance

\mathrm{tol}>0

F\approx(\exp(A)\odot C)B\in\mathbb{R}^{d\times r}

1: Estimate the Taylor order parameter

m_{*}

from

A

# using [AMH11, Algorithm 3.2]

2: Initialize: let

F=(I\odot C)B

3: for

k=1,\dots,m_{*}

B\leftarrow\frac{1}{k+1}(A\odot C)B

F\leftarrow F+B

B\leftarrow F

7: end for

8: Return

F

In line 1 of Algorithm 1, the value of $m_{*}$ is obtained from numerical analysis results of [AMH11, Algorithm 3.2]; often, the value of $m_{*}$ , depending on the spectrum of $A$ , is a bounded constant (independent of the matrix size $d$ ). Therefore, the dominant cost of Algorithm 1 is $2m_{*}|\Omega|r\lesssim d^{2}r$ , since each iteration (lines 4–6) costs $(2|\Omega|r+dr)\approx 2|\Omega|r$ flops. Table 1 summarizes this computational property in comparison with existing methods.

Reliability of Algorithm 1.

The accuracy of Algorithm 1 with respect to the exact computation of $(A,C,B)\to(\exp(A)\odot C)B$ depends notably on the scale of $A$ , since the differential $\mathrm{D}\exp(A)$ at $A$ has a greater operator norm when the norm of $A$ is greater; see Proposition 13 in Appendix A.

To illustrate the remark above, we approximate $\nabla h(X,Y)$ (9) by Algorithm 1 on random points of $\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ with different scales, with $\Omega$ defined from the edges of a random sparse graph; the results are shown in Figure 1.

We observe from Figure 1 that Algorithm 1 is reliable, i.e., having gradient approximations with cosine similarity close to $1$ , when the norm of $A_{\Omega}(X,Y)$ is sufficiently bounded. More precisely, for $c_{0}=10^{-1}$ , Algorithm 1 is reliable in the following set

\displaystyle\mathcal{D}(c_{0})=\{(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}:\|A_{\Omega}(X,Y)\|_{\mathrm{F}}\leq c_{0}\}.

(15)

The degrading accuracy of Algorithm 1 outside $\mathcal{D}(c_{0})$ (15) can nonetheless be avoided for the projection problem (6), in particular, through rescaling of the input matrix $Z_{0}$ . Note that the edge set of any graph is invariant to the scaling of its weighted adjacency matrix, and that any DAG $A^{*}$ solution to (6) satisfies $\|A^{*}\|_{\mathrm{F}}\lesssim\|Z_{0}\|_{\mathrm{F}}$ since $\operatorname*{supp}(A^{*})\subset\operatorname*{supp}(Z_{0})$ (see Remark 9). Hence it suffices to rescale $Z_{0}$ with a small enough scalar, e.g., replace $Z_{0}$ with $Z_{0}^{\prime}=\frac{c_{0}}{10\|Z_{0}\|_{\mathrm{F}}}Z_{0}$ , without loss of generality, in (6)–(7). Indeed, this rescaling ensures that both the input matrix $Z_{0}^{\prime}$ and matrices like ${A^{*}}^{\prime}=\frac{c_{0}}{10\|Z_{0}\|_{\mathrm{F}}}A^{*}$ —a DAG matrix equivalent to $A^{*}$ —stay confined in the image (through LoRAM) of $\mathcal{D}(c_{0})$ (15), in which the gradient approximations by Algorithm 1 are reliable.

4.2 Accelerated gradient descent

Given the gradient computation method (Algorithm 1) for the $h$ function, we adapt Nesterov’s accelerated gradient descent [Nes83, Nes04] to solve (7). The accelerated gradient descent is used for its superior performance than vanilla gradient descent in many convex and nonconvex problems while it also only requires first-order information of the objective function. Details of this algorithm for our LoRAM optimization is given in Algorithm 2.

Algorithm 2 Accelerated Gradient Descent of LoRAM (LoRAM-AGD)

0: Initial point

x_{0}=(X_{0},Y_{0})\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}

, objective function

F=f+h

with

h

defined in (3), tolerance

\epsilon

x_{t}\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}

1: Make a gradient descent:

x_{1}=x_{0}-s_{0}\nabla F(x_{0})

for an initial stepsize

s_{0}>0

2: Initialize:

y_{0}=x_{0}

y_{1}=x_{1}

t=1

3: while

\|\nabla F(x_{t})\|>\epsilon

4: Compute

\nabla F(y_{t})=\nabla f(y_{t})+\nabla h(y_{t})

# using Algorithm 1 for

\nabla h(y_{t})

(9)

5: Compute the Barzilai–Borwein stepsize:

s_{t}=\frac{\|z_{t-1}\|^{2}}{\braket{z_{t-1},w_{t-1}}}

, where

z_{t-1}=y_{t}-y_{t-1}

and

w_{t-1}=\nabla F(y_{t})-\nabla F(y_{t-1})

6: Updates with Nesterov’s acceleration:

	$\displaystyle x_{t+1}$	$\displaystyle=y_{t}-s_{t}\nabla F(y_{t}),$		(16)
	$\displaystyle y_{t+1}$	$\displaystyle=x_{t+1}+\frac{t}{t+3}(x_{t+1}-x_{t}).$

t=t+1

8: end while

Specifically, in line 5 of Algorithm 2, the Barzilai–Borwein (BB) stepsize [BB88] is used for the descent step (16). The computation of the BB stepsize $s_{t}$ requires evaluating the inner products $\braket{z_{t-1},w_{t-1}}$ and the norm $\|z_{t-1}\|$ , where $z_{t-1},w_{t-1}\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ ; we choose the Euclidean inner product as the metric on $\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ :

\braket{z,w}=\operatorname*{tr}({z^{(1)}}^{\mathrm{T}}w^{(1)})+\operatorname*{tr}({z^{(2)}}^{\mathrm{T}}w^{(2)})

for any pair of points $z=(z^{(1)},z^{(2)})$ and $w=(w^{(1)},w^{(2)})$ on $\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ . Note that one can always use backtracking line search based on the stepsize estimation (line 5). We choose to use the BB stepsize directly since it does not require any evaluation of the objective function, and thus avoids the nonnegligeable costs for computing the matrix exponential trace in $h$ (3). We refer to [CDHS18] for a comprehensive view on accelerated methods for nonconvex optimization.

Due to the nonconvexity of $h$ (3) (see Proposition 3) and thus (7), we aim at finding stationary points of (7). In particular, empirical results in Section 5.3 show that the solutions by the proposed method, with close-to-zero or even zero SHDs to the true DAGs, are close to global optima in practice.

5 Experimental validation

This section investigates the performance (computational gains and accuracy loss) of the proposed gradient approximation method (Algorithm 1) and thereafter reports on the performance of the LoRAM projection (6), compared to NoTears [ZARX18]. Sensitivity to the rank parameter $r$ of the proposed method is also investigated.

The implementation is available at https://github.com/shuyu-d/loram-exp.

5.1 Gradient computations

We compare the performance of Algorithm 1 in gradient approximations with the exact computation in the following settings: the number of nodes $d\in\{100$ , $500$ , $10^{3}$ , $2.10^{3}$ , $3.10^{3}$ , $5.10^{3}\}$ , $r=40$ , and the sparsity ( $\frac{|\Omega|}{d^{2}}$ ) of the index set $\Omega$ tested are $\rho\in\{10^{-3},5.10^{-3},10^{-2},5.10^{-2}\}$ . The results shown in Figure 2 are based on the computation of $\nabla h(X,Y)$ (9) on randomly generated points $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ , where $X$ and $Y$ are Gaussian matrices.

From Figure 2 (a), Algorithm 1 shows a significant reduction in the runtime for computing the gradient of $h$ (3) at the expense of a very moderate loss of accuracy (Figure 2 (b)): the approximate gradients are mostly sufficiently aligned with exact gradients in the considered range of graph sizes and sparsity levels.

5.2 Sensitivity to the rank parameter $r$

In this experiment, we generate the input matrix $Z_{0}$ of problem (6) as follows:

\displaystyle Z_{0}=A^{\star}+E,

(17)

where $A^{\star}$ is a given $d\times d$ DAG matrix and $E$ is a graph containing additive noisy edges that break the acyclicity of the observed graph $Z_{0}$ .

The ground truth DAG matrix $A^{\star}$ is generated from the acyclic Erdős-Rényi (ER) model (in the same way as in [ZARX18]), with a sparsity rate $\rho\in\{10^{-3},5.10^{-3},10^{-2}\}$ . The noise graph $E$ of (17) is defined as $E=\sigma_{E}{A^{\star}}^{\mathrm{T}}$ , which consists of edges that create a confusion between causes and effects, since these edges are reversed, pointing from the ground-truth effects to their respective causes. We evaluate the performance of LoRAM-AGD in the proximal mapping computation (7) for $d=500$ and different values of the rank parameter $r$ . In all these evaluations, the candidate set $\Omega$ is fixed to be $\operatorname*{supp}(Z_{0})$ ; see Remark 9.

We measure the accuracy of the projection result by the false discovery rate (FDR, lower is better), false positive rate (FPR), true positive rate (TPR, higher is better), and the structural Hamming distance (SHD, lower is better) of the solution compared to the DAG $\mathcal{G}(A^{\star})$ .

The results in Figure 3 suggest that:

(i)

For each sparsity level, increasing the rank parameter $r$ generally improves the projection accuracy of the LoRAM.
(ii)

While the rank parameter $r$ of (2) attains at most around $5$ , which is only $\frac{1}{100}$ -th the rank of the input matrix $Z_{0}$ and the ground truth $A^{\star}$ , the rank of the solution $A_{\Omega}^{*}=A_{\Omega}(X^{*},Y^{*})$ attains the same value as $\operatorname{rank}(A^{\star})$ . This means that the rank representativity of the LoRAM goes beyond the value of $r$ . This phenomenon is understandable in the present case where the candidate set $\Omega=\operatorname*{supp}(Z_{0})$ is fairly close to the sparse edge set $\operatorname*{supp}(A^{\star})$ .
(iii)

The projection accuracy in TPR and FDR (and also SHD, see Figure 5 in Section C.2) of LoRAM-AGD is close to optimum on a wide interval $25\leq r\leq 50$ of the tested ranks and are fairly stable on this interval.

5.3 Scalability

We examine two different types of noisy edges (in $E$ ) as follows. Case (a): Bernoulli-Gaussian $E=E(\sigma_{E},p)$ , where $E_{ij}(\sigma_{E},p)\neq 0$ with probability $p$ and all nonzeros of $E(\sigma_{E},p)$ are i.i.d. samples of $\mathcal{N}(0,\sigma_{E})$ . Case (b): cause-effect confusions $E=\sigma_{E}{A^{\star}}^{\mathrm{T}}$ as in Section 5.2.

The initial factor matrices $(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}$ are random Gaussian matrices. For the LoRAM, we set $\Omega$ to be the support of $Z_{0}$ ; see Remark 9. The penalty parameter $\lambda$ of (7) is varied in $\{2.0,5.0\}$ with no fine tuning.

In case (a), we test with various noise levels for $d=500$ nodes. In case (b), we test on various graph dimensions, for $(d,r)\in\{100,200,\dots,2000\}\times\{40,80\}$ . The results are given in Table 2–Table 3 respectively.

Table 2: Results in case (a): the noise graph is

E(\sigma_{E},p)

for

p=5.10^{-4}

and

d=500

$\sigma_{E}$	LoRAM (ours) / NoTears
$\sigma_{E}$	Runtime (sec)	TPR	FDR	SHD
0.1	1.34 / 5.78	1.0 / 1.0	9.9e-3 / 0.0e+0	25.0 / 0.0
0.2	2.65 / 11.58	1.0 / 1.0	9.5e-3 / 0.0e+0	24.0 / 0.0
0.3	1.35 / 28.93	1.0 / 1.0	9.5e-3 / 8.0e-4	24.0 / 2.0
0.4	1.35 / 18.03	1.0 / 1.0	9.9e-3 / 3.2e-3	25.0 / 9.0
0.5	1.35 / 12.52	1.0 / 1.0	9.9e-3 / 5.2e-3	25.0 / 13.0
0.6	2.57 / 16.07	1.0 / 1.0	9.5e-3 / 4.4e-3	24.0 / 11.0
0.7	1.35 / 18.72	1.0 / 1.0	9.9e-3 / 5.2e-3	25.0 / 13.0
0.8	1.35 / 32.03	1.0 / 1.0	9.9e-3 / 4.8e-3	25.0 / 15.0

Table 3: Results in case (b): the noise graph

E=\sigma_{E}{A^{\star}}^{\mathrm{T}}

contains cause-effect confusions for

\sigma_{E}=0.4

$(\lambda,r)$	$d$	LoRAM (ours) / NoTears
$(\lambda,r)$	$d$	Runtime (sec)	TPR	FDR	SHD
(5.0, 40)	100	1.82 / 0.67	1.00 / 1.00	0.00e+0 / 0.0	0.0 / 0.0
(5.0, 40)	200	2.20 / 3.64	0.98 / 0.95	2.50e-2 / 0.0	1.0 / 2.0
(5.0, 40)	400	2.74 / 16.96	0.98 / 0.98	2.50e-2 / 0.0	4.0 / 4.0
(5.0, 40)	600	3.40 / 42.65	0.98 / 0.96	1.67e-2 / 0.0	6.0 / 16.0
(5.0, 40)	800	4.23 / 83.68	0.99 / 0.97	7.81e-3 / 0.0	5.0 / 22.0
(2.0, 80)	1000	7.63 / 136.94	1.00 / 0.96	0.00e+0 / 0.0	0.0 / 36.0
(2.0, 80)	1500	13.34 / 437.35	1.00 / 0.96	8.88e-4 / 0.0	2.0 / 94.0
(2.0, 80)	2000	20.32 / 906.94	1.00 / 0.96	7.49e-4 / 0.0	3.0 / 148.0

The results in Table 2–Table 3 show that:

(i)

In case (a), the solutions of LoRAM-AGD are close to the ground truth despite slighly higher errors than NoTears in terms of FDR and SHD.
(ii)

In case (b), the solutions of LoRAM-AGD are almost identical to the ground truth $A^{\star}$ In (17) in all performance indicators (also see Section C.2).
(iii)

In terms of computation time (see Figure 4), the proposed LoRAM-AGD achieves significant speedups (around $50$ times faster when $d=2000$ ) compared to NoTears and also has a smaller growth rate with respect to the problem dimension $d$ , showing good scalability.

6 Discussion and Perspectives

This paper tackles the projection of matrices on DAG matrices, motivated by the identification of linear causal graphs. The line of research built upon the LiNGAM algorithms [SHHK06, SIS⁺11] has recently been revisited through the formal characterization of DAGness in terms of a continuously differentiable constraint by [ZARX18]. The NoTears approach of [ZARX18] however suffers from an $O(d^{3})$ complexity in the number $d$ of variables, precluding its usage for large-scale problems.

Unfortunately, this difficulty is not related to the complexity of the model (number of parameters of the model): the low-rank approach investigated by NoTears-low-rank [FZZ⁺20] also suffers from the same $O(d^{3})$ complexity, incurred in the gradient-based optimization phase.

The present paper addresses this difficulty by combining a sparsification mechanism with the low-rank model and using a new approximated gradient computation. This approximated gradient takes inspiration from the approach of [AMH11] for computing the action of exponential matrices based on truncated Taylor expansion. This approximation eventually yields a complexity of $O(d^{2}r)$ , where the rank parameter is small ( $r\leq C\ll d$ ). The experimental validation of the approach shows that the approximated gradient entails no significant error with respect to the exact gradient, for LoRAM matrices with a bounded norm, in the considered range of graph sizes ( $d$ ) and sparsity levels. The proposed algorithm combining the approximated gradient with a Nesterov’s acceleration method [Nes83, Nes04] yields gains of orders of magnitude in computation time compared to NoTears on the same artificial benchmark problems. The approximation performance indicators reveal almost no performance loss for the projection problem in the setting of case (b) (where the matrix to be projected is perturbed with anti-causal links), while it incurs minor losses in terms of false discovery rate (FDR) in the setting of case (a) (with random additive spurious links).

Further developments aim to extend the approach and address the identification of causal DAG matrices from observational data. A longer term perspective is to extend LoRAM to the non-linear case, building upon the introduction of latent causal variables and taking inspiration from the non-linear independent component analysis and generalized contrastive losses [HST19]. Another perspective relies on the use of auto-encoders to yield a compressed representation of high-dimensional data, while constraining the structure of the encoder and decoder modules to enforce the acyclic property.

Acknowledgement

The authors warmly thank Fujitsu Laboratories LTD who funded the first author, and in particular Hiroyuki Higuchi and Koji Maruhashi for many discussions.

References

[ABGLP19] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
[AMH11] A. H. Al-Mohy and N. J. Higham. Computing the action of the matrix exponential, with an application to exponential integrators. SIAM journal on scientific computing, 33(2):488–511, 2011.
[BB88] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA journal of numerical analysis, 8(1):141–148, 1988.
[BPE14] P. Bühlmann, J. Peters, and J. Ernest. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.
[CDHS18] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
[Chi96] D. M. Chickering. Learning bayesian networks is NP-complete. In Learning from data, pages 121–130. Springer, 1996.
[FZZ⁺20] Z. Fang, S. Zhu, J. Zhang, Y. Liu, Z. Chen, and Y. He. Low rank directed acyclic graphs and causal structure learning. arXiv preprint arXiv:2006.05691, 2020.
[Hab18] H. E. Haber. Notes on the matrix exponential and logarithm. Santa Cruz Institute for Particle Physics, University of California: Santa Cruz, CA, USA, 2018.
[HST19] A. Hyvarinen, H. Sasaki, and R. Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
[KGG⁺18] D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. Structural agnostic modeling: Adversarial learning of causal graphs. arXiv preprint arXiv:1803.04929, 2018.
[MW15] S. L. Morgan and C. Winship. Counterfactuals and causal inference. Cambridge University Press, 2015.
[Nes83] Y. Nesterov. A method for solving the convex programming problem with convergence rate $O(1/k^{2})$ . Soviet Mathematics Doklady, 27:372–376, 1983.
[Nes04] Y. Nesterov. Introductory Lectures on Convex Optimization, volume 87. Springer Publishing Company, Incorporated, 1 edition, 2004. doi:10.1007/978-1-4419-8853-9.
[NGZ20] I. Ng, A. Ghassami, and K. Zhang. On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
[PBM16] J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012, 2016.
[Pea09] J. Pearl. Causality. Cambridge university press, 2009.
[PJS17] J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
[SB09] M. Stephens and D. J. Balding. Bayesian statistical methods for genetic association studies. Nature Reviews Genetics, 10(10):681–690, 2009.
[SG21] A. Sauer and A. Geiger. Counterfactual generative networks. In International Conference on Learning Representations (ICLR), 2021.
[SHHK06] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006.
[SIS⁺11] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen. DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. The Journal of Machine Learning Research, 12:1225–1248, 2011.
[YCGY19] Y. Yu, J. Chen, T. Gao, and M. Yu. DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR, 2019.
[ZARX18] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31, 2018. URL https://proceedings.neurips.cc/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf.
[ZDA⁺20] X. Zheng, C. Dan, B. Aragam, P. Ravikumar, and E. Xing. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pages 3414–3425. PMLR, 2020.

Appendix A Proofs (Sections 2–4)

Proof of Corollary 2.

By the condition (i), all entries of the matrix $B:=\sigma(A)$ are nonnegative. Hence, for all $k\geq 1$ ,

\displaystyle\operatorname*{tr}(B^{k})\geq 0.

(18)

By combining (18) and (ii) that $\operatorname*{supp}(B)=\operatorname*{supp}(A)$ , we confirm that $\operatorname*{tr}(B^{k})$ equals the sum of the weighted edge-sum along all $k$ -cycles in the graph of $A$ (the graph whose adjacency matrix is $\operatorname*{supp}(A)$ ), according to classical graph theory (which can be obtained by induction). Therefore, if $A$ is a DAG matrix, i.e., there are no cycles (of any length) in the graph of $A$ , then $\operatorname*{tr}(B^{k})=0$ for all $k\geq 1$ , hence $\operatorname*{tr}(\exp(B))=\operatorname*{tr}(I_{d\times d})=d$ ; and if $\operatorname*{tr}(\exp(B))=d$ , then $\sum_{k\geq 1}\frac{1}{k!}\operatorname*{tr}(B^{k})=0$ , hence $\operatorname*{tr}(B^{k})=0$ (due to (18)), which implies that there are no cycles of any length $k\geq 1$ in the graph of $A$ . $\hfill\square$

Proof of Proposition 3.

(i) All entries of any matrix $\bar{A}\in\mathbb{R}^{d\times d}_{+}$ are nonnegative, hence $\operatorname*{tr}(A^{k})\geq 0$ for all $k\geq 1$ . Therefore, $\operatorname*{tr}(\exp(\bar{A}))=\operatorname*{tr}(I_{d\times d})+\sum_{k\geq 1}\frac{1}{k}\operatorname*{tr}(\bar{A}^{k})\geq d$ . Moreover, Corollary 2 shows that $\operatorname*{tr}(\bar{A})=d$ if and only if $\bar{A}$ is a DAG matrix.

(ii) First, we show the nonconvexity of $\operatorname*{tr}(\exp(\cdot))$ on $\mathbb{R}^{2\times 2}$ (for $d=2$ ) by using the property (i) and Corollary 2: let $A=\begin{pmatrix}0&1\\ 0&0\end{pmatrix}$ and $B=\begin{pmatrix}0&0\\ 1&0\end{pmatrix}$ . Notice that $A$ and $B$ are two different DAG matrices. Then, consider any matrix on the line segment between $A$ and $B$ :

M_{\alpha}=\alpha A+(1-\alpha)B=\begin{pmatrix}0&\alpha\\ 1-\alpha&0\end{pmatrix}\text{~{}for~{}}\alpha\in(0,1).

Then $M_{\alpha}$ is a weighted adjacency matrix of graph with $2$ -cycles. Hence, $M_{\alpha}$ is nonnegative and is not a DAG matrix, it follows from the property (i) above that $\tilde{h}(M_{\alpha})>d$ . Property (i) also shows that $\tilde{h}(A)=\tilde{h}(B)=d$ , since $A$ and $B$ are DAG matrices. Therefore, we have

\tilde{h}(M_{\alpha})>d=\alpha\tilde{h}(A)+(1-\alpha)\tilde{h}(B).

Hence $\tilde{h}$ is nonconvex along the segment $\{M_{\alpha}=\alpha A+(1-\alpha)B:\alpha\in(0,1)\}\subset\mathbb{R}^{2\times 2}$ . To extend this example to $\mathbb{R}^{d\times d}$ , it suffices to consider a similar pair of DAG matrices $A^{\prime}$ and $B^{\prime}$ which differ only on their first $2\times 2$ submatrices, such that $A^{\prime}_{:2,:2}=A$ and $B^{\prime}_{:2,:2}=B$ .

(iii) First, the following definition and property are needed for the differential calculus of the matrix exponential function.

Definition 12.

For any pair of matrices $A,B\in\mathbb{R}^{d\times d}$ , the commutator between $A$ and $B$ is defined and denoted as follows:

\displaystyle\operatorname{ad}_{A}(B):=[A,B]=AB-BA.

The operator $\operatorname{ad}_{A}$ is a linear operator. More generally, the powers of the commutator $\operatorname{ad}_{A}(\cdot)$ are defined as follows: $(\operatorname{ad}_{A})^{0}=I$ , and for any $k\geq 1$ :

\displaystyle(\operatorname{ad}_{A})^{k}(B)=\underbrace{[A,\dots,[A,[A,B]]\dots]}_{k\text{~{}nested~{}commutators}}.

Proposition 13 ([Hab18] (Theorem 2.b)).

For any $A\in\mathbb{R}^{d\times d}$ and any $\xi\in\mathbb{R}^{d\times d}$ , the derivative of $t\mapsto\exp(A+t\xi)$ at $t=0$ is $\frac{\mathrm{d}}{\mathrm{d}t}\exp(A+t\xi)|_{t=0}=e^{A}\tilde{f}(\operatorname{ad}_{A})(\xi)$ , where $\tilde{f}(z)=\frac{1-e^{-z}}{z}=\sum_{n=0}^{\infty}\frac{(-1)^{n}}{(n+1)!}z^{n}$ .

Now, we calculate the directional derivative of $\tilde{h}$ along a direction $\xi\in\mathbb{R}^{d\times d}$ : Note that the differential of $\operatorname*{tr}(\cdot)$ is itself anywhere, hence by the chain rule and Proposition 13, for any $\xi\in\mathbb{R}^{d\times d}$ , the directional derivative of $\tilde{h}$ along $\xi$ is

	$\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\operatorname{tr}(e^{A+t\xi})\|_{t=0}=\operatorname{tr}\Big{(}\frac{\mathrm{d}}{\mathrm{d}t}(e^{A+t\xi})\|_{t=0}\Big{)}$
	$\displaystyle=\operatorname*{tr}(e^{A}\tilde{f}(\operatorname{ad}_{A})(\xi))$
	$\displaystyle=\operatorname{tr}(e^{A}\xi)+\frac{1}{(k+1)!}\sum_{k=1}^{\infty}\operatorname{tr}(e^{A}(-1)^{k}(\operatorname{ad}_{A})^{k}(\xi)).$		(19)

Next, we show that last term of (19) is zero: note that for $k=1$ , by rotating the three matrices in the trace function, we have

	$\displaystyle\operatorname{tr}(e^{A}(-\operatorname{ad}_{A}(\xi)))=\operatorname{tr}(e^{A}(\xi A-A\xi))=\operatorname*{tr}(Ae^{A}\xi-e^{A}A\xi)$
	$\displaystyle~{}=\operatorname*{tr}(\operatorname{ad}_{A}(e^{A})\xi)=0$		(20)

for any $\xi$ , where (20) holds because $\operatorname{ad}_{A}(e^{A})=[A,e^{A}]=0$ since $A$ and $e^{A}$ commute (for any $A$ ). Furthermore, we deduce that for all $k\geq 2$ :

	$\displaystyle\operatorname*{tr}\big{(}e^{A}(-1)^{k}(\operatorname{ad}_{A})^{k}(\xi)\big{)}=$
	$\displaystyle(-1)^{k}\operatorname*{tr}\big{(}e^{A}\operatorname{ad}_{A}((\operatorname{ad}_{A})^{k-1}(\xi))\big{)}=0,$

because (20) holds for any direction including $\tilde{\xi}:=(\operatorname{ad}_{A})^{k-1}(\xi)$ . Therefore, the equation (19) becomes $\frac{\mathrm{d}}{\mathrm{d}t}\tilde{h}(A+t\xi)|_{t=0}=\operatorname*{tr}(e^{A}\xi)$ , and through the identification $\frac{\mathrm{d}}{\mathrm{d}t}\tilde{h}(A+t\xi)|_{t=0}=\operatorname*{tr}({\xi}^{\mathrm{T}}\nabla\tilde{h}(A))$ , the gradient reads $\nabla\tilde{h}(A)={(\exp(A))}^{\mathrm{T}}$ . $\hfill\square$

Proof of Proposition 7.

From Definition 6, the residual matrix $\xi:=A_{\Omega}^{*}-Z_{0}\in\mathbb{R}^{d\times d}$ satisfies $\|\xi\|_{\max}\leq\epsilon^{\star}_{r,\Omega}\|Z_{0}\|_{\max}$ . Therefore, by the Taylor expansion and Proposition 3-(iii), there exists a constant $C_{1}\geq 0$ such that

	$\displaystyle\operatorname{tr}(\exp(\sigma(A_{\Omega}^{})))-d=\underbrace{\operatorname{tr}(\exp(\sigma(Z_{0})))-d}_{=0}+\operatorname{tr}(e^{\sigma(Z_{0})}\sigma(\xi))+C_{1}\\|\sigma(\xi)\\|_{\mathrm{F}}^{2}$
	$\displaystyle=\operatorname*{tr}(e^{\sigma(Z_{0})}\sigma(\xi))+C_{1}\\|\sigma(\xi)\\|_{\mathrm{F}}^{2}$		(21)
	$\displaystyle\leq\\|\xi\\|_{\max}(\sum_{ij}[e^{\sigma(Z_{0})}]_{ij})+C_{1}\\|Z_{0}\\|_{0}\\|\xi\\|_{\max}^{2}$
	$\displaystyle\leq\epsilon^{\star}_{r,\Omega}\\|Z_{0}\\|_{\max}(\sum_{ij}[e^{\sigma(Z_{0})}]_{ij})+C_{1}\\|Z_{0}\\|_{0}{\epsilon^{\star}_{r,\Omega}}^{2}\\|Z_{0}\\|_{\max}^{2},$

where (21) holds since $Z_{0}$ is a DAG matrix (in view of Theorem 1), and $\||Z_{0}\|_{0}$ is the number of nonzeros of $Z_{0}$ . It follows that, with relative error $\epsilon^{\star}_{r,\Omega}\leq 1$ and $\|Z_{0}\|_{\max}\leq 1$ (without loss of generality):

\operatorname*{tr}(\exp(\sigma(A_{\Omega}^{*})))-d\leq\epsilon^{\star}_{r,\Omega}\Big{(}C_{1}\|Z_{0}\|_{0}+\sum_{ij}[e^{\sigma(Z_{0})}]_{ij}\Big{)}\|Z_{0}\|_{\max},

which entails the result. $\hfill\square$

Proof of Lemma 10.

The Hadamard product $\odot$ is commutative, hence

\displaystyle\mathrm{D}\sigma_{2}(Y)[\xi]=\frac{\mathrm{d}}{\mathrm{d}t}(Y+t\xi)\odot(Y+t\xi)|_{t=0}=2Y\odot\xi.

$\hat{\mathrm{D}}\sigma_{\mathrm{abs}}$ (4b) is a subdifferential of (4) since the sign function is a subdifferential of the function $z\to|z|$ and $\sigma_{\mathrm{abs}}(\cdot)$ is an elementwise matrix operator. $\hfill\square$

Appendix B Details of computational methods

B.1 Computational cost of the exact gradient

The computation of the function value and the gradient of $h$ (3) mainly includes two types of operations: (i) the computation of the $d\times d$ sparse matrix $\sigma\circ A_{\Omega}(X,Y)$ given Definition 4 and $\sigma$ , and (ii) the computation of the matrix exponential-related terms (10) or (11).

The exact computation of the gradient $\nabla h(X,Y)$ (Corollary 2) is as follows:

(i) compute $A_{\Omega}=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})$ (2): this step has a cost of $2|\Omega|r$ flops; see Algorithm 3 in Section B.2. The output $A_{\Omega}$ is a $d\times d$ sparse matrix. A byproduct of this step is $\mathrm{D}\sigma(A_{\Omega})$ ( $2A_{\Omega}$ or $\mathrm{sign}(A_{\Omega})$ );
(ii) compute $\exp(\sigma(A_{\Omega}))$ : this step has a cost of $O(d^{3})$ ;
(iii) compute $M:=\exp(\sigma(A_{\Omega}))\odot\mathrm{D}\sigma(A_{\Omega})$ , which costs $|\Omega|$ flops. The output $M$ is a $d\times d$ sparse matrix;
(iv) Compute the sparse-dense matrix multiplication $(M,Y)\mapsto MY$ : this step costs $2|\Omega|r$ flops. Therefore, the total cost is $O(d^{3}+4|\Omega|r)$ .

In scenarios with large and sparse graphs, i.e., $r\ll d$ and $|\Omega|=\rho d^{2}$ with $\rho\ll 1$ , the complexity of computing the exact gradient of the learning criterion thus is cubic in $d$ , incurred by computing the matrix exponential in step (ii); this step is necessary due to the Hadamard product that lies between the action of $\exp(A_{\Omega})$ and the thin factor matrix $X$ (respectively, $Y$ ) in (10) or (11): one needs to compute the $d\times d$ Hadamard product (before the matrix multiplication with $X$ ), which requires computing explicitly the $d\times d$ matrix exponential $\exp(\sigma(A_{\Omega}))$ .

For this reason, the LoRAM as well as NoTears-low-rank of [FZZ⁺20] face a cubic complexity for the exact computation of the gradient of $h$ (3), despite the significant reduction in the model complexity.

B.2 Computation of the LoRAM matrix

Algorithm 3 LoRAM matrix representation

0: Thin factor matrices

(X,Y)\in\mathbb{R}^{d\times r}\times\mathbb{R}^{d\times r}

, index set

\Omega=(I,J)

Z=\mathcal{P}_{\Omega}(X{Y}^{\mathrm{T}})\in\mathbb{R}^{d\times d}

and by products

Z^{\mathrm{abs}}=\mathrm{sign}(Z)

Z^{\mathrm{sq}}=2Z

1: Initialize:

Z=0

2: for

s=1,\dots,|\Omega|

3: for

k=1,\dots,r

Z_{I(s),J(s)}=Z_{I(s),J(s)}+X_{I(s),k}Y_{J(s),k}

5: end for

6: end for

7: Element-wise operations:

Z^{\mathrm{sq}}=2Z

Z^{\mathrm{abs}}=\mathrm{sign}(Z)

Appendix C Experiments

C.1 Choice of the parameters $\lambda$ and $\epsilon^{\star}_{r,\Omega}$

The proposed algorithms (Algorithms 1–2) are tested within the penalty method (7) for a parameter $\lambda>0$ .

In the context of projection problem (6), the following features are notable factors that influence the choice of an optimal $\lambda$ : (i) the type of graphs underlying the input matrix $Z_{0}$ , and (ii) sparsity of the graph matrix $Z_{0}$ . Note that the scale of $Z_{0}$ is irrelevant to the choice of $\lambda$ since we rescale the input graph matrix $Z_{0}\in\mathbb{R}^{d\times d}$ with $Z_{0}\leftarrow\frac{c_{0}}{10\|Z_{0}\|_{\mathrm{F}}}Z_{0}$ for a constant $c_{0}=10^{-1}$ , without loss of generality (see Section 4.1), such that the scale of $Z_{0}$ always stay in the level of $\frac{c_{0}}{10}$ (in Frobenius norm). Therefore, we are able to fix a parameter set $\Lambda$ regarding the following experimental settings: (i) the type of graphs is fixed to be the ER acyclic graph set, same as in [ZARX18], and (ii) the sparsity level of $Z_{0}$ is fixed around $\rho\in\{10^{-3},5.10^{-3},10^{-2},5.10^{-2}\}$ . Concretely, we select the penalty parameter $\lambda$ , among $N_{\lambda}=5$ values in the fixed set $\Lambda=\{1.0,2.0,\dots,5.0\}$ , with respect to the objective function value of the proximal mapping (7).

On the other hand, the choice of the hard threshold $\epsilon^{\star}_{r,\Omega}$ in (8) is straightforward, according to Definition 6 for the relative error of LoRAM w.r.t. a given subset $\mathcal{D}_{d\times d}^{\star}$ of DAGs. For the set of graphs in our experiments, we use a moderate value $\epsilon^{\star}_{r,\Omega}=5.10^{-2}$ , which has the effect of eliminating only the very weakly weighted edges.

C.2 Additional experimental details

Table 4: Case (b): noise graph

E=\sigma_{E}{A^{\star}}^{\mathrm{T}}

creates cause-effect confusions, for

\sigma_{E}=0.4

. (FDR, TPR, FPR, SHD) are the evaluation scores of the solution compared to the ground truth DAG matrix

A^{\star}

Algorithm	$(\lambda,r)$	$d$	runtime	FDR	TPR	FPR	SHD
LoRAM	(5.0, 40)	100	1.82	0.0	1.0	0.0	0.0
	(5.0, 40)	200	2.20	2.5e-2	0.98	5.0e-5	1.0
	(5.0, 40)	400	2.74	2.5e-2	0.98	5.0e-5	4.0
	(5.0, 40)	600	3.40	1.7e-2	0.98	3.4e-5	6.0
	(5.0, 40)	800	4.23	7.8e-3	0.99	1.6e-5	5.0
	(2.0, 80)	1000	7.63	0.0	1.0	0.0	0.0
	(2.0, 80)	1500	13.34	8.9e-4	1.0	1.8e-6	2.0
	(2.0, 80)	2000	20.32	7.5e-4	1.0	1.5e-6	3.0
NoTears	-	100	0.67	0.0	1.00	0.0	0.0
	-	200	3.64	0.0	0.95	0.0	2.0
	-	400	16.96	0.0	0.98	0.0	4.0
	-	600	42.65	0.0	0.96	0.0	16.0
	-	800	83.68	0.0	0.97	0.0	22.0
	-	1000	136.94	0.0	0.96	0.0	36.0
	-	1500	437.35	0.0	0.96	0.0	94.0
	-	2000	906.94	0.0	0.96	0.0	148.0

	$\displaystyle\operatorname{tr}(\exp(\sigma(A_{\Omega}^{})))-d=\underbrace{\operatorname{tr}(\exp(\sigma(Z_{0})))-d}_{=0}+\operatorname{tr}(e^{\sigma(Z_{0})}\sigma(\xi))+C_{1}\\|\sigma(\xi)\\|_{\mathrm{F}}^{2}$
	$\displaystyle=\operatorname*{tr}(e^{\sigma(Z_{0})}\sigma(\xi))+C_{1}\\|\sigma(\xi)\\|_{\mathrm{F}}^{2}$		(21)
	$\displaystyle\leq\\|\xi\\|_{\max}(\sum_{ij}[e^{\sigma(Z_{0})}]_{ij})+C_{1}\\|Z_{0}\\|_{0}\\|\xi\\|_{\max}^{2}$
	$\displaystyle\leq\epsilon^{\star}_{r,\Omega}\\|Z_{0}\\|_{\max}(\sum_{ij}[e^{\sigma(Z_{0})}]_{ij})+C_{1}\\|Z_{0}\\|_{0}{\epsilon^{\star}_{r,\Omega}}^{2}\\|Z_{0}\\|_{\max}^{2},$

From graphs to DAGs: a low-complexity model and a scalable algorithm

Abstract

1 Introduction

2 Notation and formal background

Theorem 1 ([ZARX18]).

Corollary 2.

Proposition 3.

3 LoRAM: a low-complexity model

Definition 4 (LoRAM).

Definition 5.

3.1 Representativity

Definition 6.

Proposition 7.

Remark 8.

4 Scalable projection from graphs to DAGs

Remark 9.

4.1 Gradient of the DAG characteristic function and an efficient approximation

Lemma 10.

Theorem 11.

Proof.

Reliability of Algorithm 1.

4.2 Accelerated gradient descent

5 Experimental validation

5.1 Gradient computations

5.2 Sensitivity to the rank parameter rr

5.3 Scalability

6 Discussion and Perspectives

Acknowledgement

References

Appendix A Proofs (Sections 2–4)

Proof of Corollary 2.

Proof of Proposition 3.

Definition 12.

Proposition 13 ([Hab18] (Theorem 2.b)).

Proof of Proposition 7.

Proof of Lemma 10.

Appendix B Details of computational methods

B.1 Computational cost of the exact gradient

B.2 Computation of the LoRAM matrix

Appendix C Experiments

C.1 Choice of the parameters λ\lambda and ϵr,Ω⋆\epsilon^{\star}_{r,\Omega}

C.2 Additional experimental details

5.2 Sensitivity to the rank parameter $r$

C.1 Choice of the parameters $\lambda$ and $\epsilon^{\star}_{r,\Omega}$