This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks

Simge Küçükyavuz Department of Industrial Engineering and Management Sciences, Northwestern University ([email protected]).These authors contributed equally to this work.    Ali Shojaie Department of Biostatistics, University of Washington ([email protected]).    Hasan Manzour Department of Industrial and Systems Engineering, University of Washington ([email protected]).    Linchuan Wei Department of Industrial Engineering and Management Sciences, Northwestern University ([email protected]).    Hao-Hsiang Wu Department of Management Science, National Yang Ming Chiao Tung University ([email protected]).
Abstract

Bayesian Networks (BNs) represent conditional probability relations among a set of random variables (nodes) in the form of a directed acyclic graph (DAG), and have found diverse applications in knowledge discovery. We study the problem of learning the sparse DAG structure of a BN from continuous observational data. The central problem can be modeled as a mixed-integer program with an objective function composed of a convex quadratic loss function and a regularization penalty subject to linear constraints. The optimal solution to this mathematical program is known to have desirable statistical properties under certain conditions. However, the state-of-the-art optimization solvers are not able to obtain provably optimal solutions to the existing mathematical formulations for medium-size problems within reasonable computational times. To address this difficulty, we tackle the problem from both computational and statistical perspectives. On the one hand, we propose a concrete early stopping criterion to terminate the branch-and-bound process in order to obtain a near-optimal solution to the mixed-integer program, and establish the consistency of this approximate solution. On the other hand, we improve the existing formulations by replacing the linear “big-MM” constraints that represent the relationship between the continuous and binary indicator variables with second-order conic constraints. Our numerical results demonstrate the effectiveness of the proposed approaches.

Keywords: Mixed-integer conic programming, Bayesian networks, directed acyclic graphs, early stopping criterion, consistency.

1 Introduction

A Bayesian network (BN) is a probabilistic graphical model consisting of a labeled directed acyclic graph (DAG) 𝒢=(V,E)\mathcal{G}=(V,E), in which the vertex set V={V1,,Vm}V=\{V_{1},\dots,V_{m}\} corresponds to mm random variables, and the edge set EE prescribes a decomposition of the joint probability distribution of the random variables based on their parents in 𝒢\mathcal{G}. The edge set EE encodes Markov relations on the nodes in the sense that each node is conditionally independent of its non-descendents given its parents. BNs have been used in knowledge discovery [49, 9], classification [1], feature selection [22], latent variable discovery [29] and genetics [37]. They also play a vital part in causal inference [40].

In this paper, we propose reformulations of the mixed-integer quadratic programs (MIQP) for learning the optimal DAG structure of BNs given nn continuous observations from a system of linear structural equation models (SEMs). While there exist exact integer-programming (IP) formulations for learning DAG structure with discrete data [11, 12, 27, 50, 4, 34, 35, 5, 13, 14], the development of tailored computational tools for learning the optimal DAG structure from continuous data has received less attention. In principle, exact methods developed for discrete data can be applied to continuous data. However, such methods result in exponentially sized formulations in terms of the number of binary variables. A common practice to circumvent the exponential number of binary variables is to limit the in-degree of each node [12, 14, 5]. But, this may result in sub-optimal solutions. On the contrary, MIQP formulations for learning DAGs corresponding to linear SEMs require a polynomial number of binary variables. This is because for BNs with linear SEMs, the score function — i.e., the penalized negative log-likelihood (PNL) — can be explicitly written as a function of the coefficients of linear SEMs [45, 52, 38, 32]. In contrast to the existing MIQPs [38, 32], our reformulations exploit the convex quadratic objective and the relationship between the continuous and binary variables to improve the strength of the continuous relaxations.

Continuous BNs with linear SEMs have witnessed a growing interest in the statistics and computer science communities [52, 43, 31, 23, 47]. In particular, it has been shown that the solution obtained from solving the PNL augmented by 0\ell_{0} regularization, which introduces a penalty on the number of non-zero arc weights in the estimated DAG, achieves desirable statistical properties [41, 52, 31]. Moreover, if the model is identifiable [41, 31], that is when the true causal graph can be identified from the joint distribution, then such a solution is guaranteed to uncover the true causal DAG when the sample size nn is large enough. However, given the difficulty of obtaining exact solutions, existing approaches for learning DAGs from linear SEMs have primarily relied on heuristics, using techniques such as coordinate descent [21, 2, 26] and non-convex continuous optimization [61]. Unfortunately, these heuristics are not guaranteed to achieve the desirable properties of the global optimal solution. Moreover, it is difficult to evaluate the statistical properties of a sub-optimal solution with no optimality guarantees [28]. To bridge this gap, in this paper we develop mathematical formulations for learning optimal BNs from linear SEMs using a PNL objective with 0\ell_{0} regularization. By connecting the optimality gap of the mixed-integer program to the statistical properties of the solution, we also establish an early stopping criterion under which we can terminate the branch-and-bound procedure and attain a solution which asymptotically recovers the true parameters with high probability.

Our work is related to recent efforts to develop exact tailored methods for DAG learning from continuous data. [57] show that AA^{\ast}-lasso algorithm tailored for DAG structure learning from continuous data with 1\ell_{1}-regularization, which introduces a penalty on the sum of absolute values of the arc weights, is more effective than the previous approaches based on dynamic programming [e.g., 46] that are suitable for both discrete and continuous data. [38] develop a mathematical program for DAG structure learning with 1\ell_{1} regularization. [32] improve and extend the formulation by [38] for DAG learning from continuous data with both 0\ell_{0} and 1\ell_{1} regularizations. The numerical experiments by [32] demonstrate that as the number of nodes grows, their MIQP formulation outperforms AA^{\ast}-lasso and the existing IP methods; this improvement is both in terms of reducing the IP optimality gap, when the algorithm is stopped due to a time limit, and in terms of computational time, when the instances can be solved to optimality. In light of these recent efforts, the current paper makes important contributions to this problem at the intersection of statistics and optimization.

  • The statistical properties of optimal PNL with 0\ell_{0} regularization have been studied extensively [31, 52]. However, it is often difficult to obtain an optimal solution and no results have been established on the statistical properties of approximate solutions. In this paper, we give an early stopping criterion for the branch-and-bound process under which the approximate solution gives consistent estimates of the true coefficients of the linear SEM. Our result leverages the statistical consistency of the PNL estimate with 0\ell_{0} regularization [52, 41] along with the properties of the branch-and-bound method wherein both lower and upper bound values on the objective function are available at each iteration. By connecting these two properties, we obtain a concrete early stopping criterion, as well as a proof of consistency of the approximate solution. To the best of our knowledge, this result is the first of its kind for DAG learning.

  • In spite of recent progress, a key challenge in learning DAGs from linear SEMs is enforcing bounds on arc weights. This is commonly modeled using the standard “big-MM constraint” approach [38, 32]. As shown by [32], this strategy leads to poor continuous relaxations for the problem, which in turn results in slow lower bound improvement in the branch-and-bound tree. In particular, [32] establish that all existing big-MM formulations achieve the same continuous relaxation objective function under a mild condition (see Proposition 4). To circumvent this issue, we present a mixed-integer second-order cone program (MISOCP), which gives a tighter continuous relaxation than existing big-MM formulations under certain conditions discussed in detail in Section 5.3. This formulation can be solved by powerful state-of-the-art optimization packages. Our numerical results show the superior performance of MISOCP compared to the existing big-MM formulations in terms of improving the lower bound and reducing the optimality gap. We also compare our method against the state-of-the-art benchmarks [9, 24] both for identifiable and non-identifiable instances, and show that our method provides the best estimation among all methods in most of the networks, especially for the non-identifiable cases.

The rest of the paper is organized as follows. In Section 2, we define the DAG structure learning problem corresponding to linear SEMs, and give a general framework for the problem. In Section 3, we present our early stopping criterion and establish the asymptotic properties of the solution obtained under this stopping rule. We review existing mathematical formulations in Section 4, and present our proposed mathematical formulations in Section 5. Results of comprehensive numerical studies are presented in Section 6. We end the paper with a summary in Section 7.

2 Problem setup: Penalized DAG estimation with linear SEMs

Let =(V,E)\mathcal{M}=(V,E) be an undirected and possibly cyclic super-structure graph with node set V={1,2,,m}V=\{1,2,\dots,m\} and edge set EV×VE\subseteq V\times V; let =(V,E)\overrightarrow{\mathcal{M}}=(V,\overrightarrow{E}) be the corresponding bi-directional graph with E={(j,k),(k,j)|(j,k)E}\overrightarrow{E}=\{(j,k),(k,j)|(j,k)\in E\}. We refer to undirected edges as edges and directed edges as arcs.

We assume that causal effects of continuous random variables in a DAG 𝒢0\mathcal{G}_{0} are represented by mm linear regressions of the form

Xk=jpak𝒢0βjkXj+ϵk,k=1,,m,X_{k}=\sum_{j\in pa^{\mathcal{G}_{0}}_{k}}\beta_{jk}X_{j}+\epsilon_{k},\quad k=1,\dots,m, (1)

where XkX_{k} is the random variable associated with node kk, pak𝒢0pa^{\mathcal{G}_{0}}_{k} represents the parents of node kk in 𝒢0\mathcal{G}_{0}, i.e., the set of nodes with arcs pointing to kk; the latent random variable ϵk\epsilon_{k} denotes the unexplained variation in node kk; and BN parameter βjk\beta_{jk} specifies the effect of node jj on kk for jpak𝒢0j\in pa^{\mathcal{G}_{0}}_{k}. The above model is known as a linear SEM [40].

Let 𝒳=(𝒳1,,𝒳m)\mathcal{X}=(\mathcal{X}_{1},\dots,\mathcal{X}_{m}) be the n×mn\times m data matrix with nn rows representing i.i.d. samples from each random variable, and mm columns representing random variables X1,,XmX_{1},\ldots,X_{m}. The linear SEM (1) can be compactly written in matrix form as 𝒳=𝒳B+\mathcal{X}=\mathcal{X}{B}+\mathcal{E}, where B=[β]m×m{B}=[\beta]\in\mathbb{R}^{m\times m} is a matrix with βkk=0\beta_{kk}=0 for k=1,,mk=1,\dots,m, βjk=0\beta_{jk}=0 for all (j,k)E(j,k)\notin E, and \mathcal{E} is the n×mn\times m ‘noise’ matrix. Then, 𝒢(B)\mathcal{G}(B) denotes the directed graph on mm nodes such that arc (j,k)(j,k) appears in 𝒢(B)\mathcal{G}(B) if and only if βjk0\beta_{jk}\neq 0. Throughout the paper, we will use BB and β\beta to denote the matrix of coefficients and its vectorized version.

A key challenge when estimating DAGs by minimizing the loss function (2) is that the true DAG is generally not identifiable from observational data. However, for certain SEM distributions, the true DAG is identifiable from observational data; that is when the true causal graph can be identified from the joint distribution. Two important examples are linear SEMs with possibly non-Gaussian homoscedastic noise variables [41], as well as linear SEMs with unequal noise variances that are known up to a constant [31]. In these special cases, the true DAG can be identified from observational data, without requiring the (strong) ‘faithfulness’ assumption, which is known to be restrictive in high dimensions [51, 48]. Given these important implications, in this paper we focus on learning Bayesian networks corresponding to the above identifiable linear SEMs, i.e., settings where the error variances are either equal, or known up to a constant.

The negative log likelihood for an identifiable linear SEM (1) with equal noise variances is proportional to

l(β;𝒳)=ntr{(IB)(IB)Σ^};l(\beta;\mathcal{X})=n\,\text{tr}\left\{(I-{B})(I-{B})^{\top}\widehat{\Sigma}\right\}; (2)

here Σ^=n1𝒳𝒳\widehat{\Sigma}=n^{-1}\mathcal{X}^{\top}\mathcal{X} is the empirical covariance matrix, and II is the identity matrix [45, 52].

To learn sparse DAGs, van de Geer and Bühlmann [52] propose to augment the negative log likelihood with an 0\ell_{0} regularization term. Given a super-structure \mathcal{M}, the optimization problem corresponding to this penalized negative log-likelihood (PNL\mathcal{M}) is given by

PNL\mathcal{M} minBm×m(β):=l(β;𝒳)+λnβ0\displaystyle\underset{B\in{\mathbb{R}}^{m\times m}}{\min}\quad\operatorname{\mathcal{L}}(\beta):=l(\beta;\mathcal{X})+\lambda_{n}\|\beta\|_{0} (3a)
s.t. 𝒢(B)induces a DAG from,\displaystyle\mathcal{G}(B)\,\,\text{induces a DAG from}\,\overrightarrow{\mathcal{M}}, (3b)

where the tuning parameter λn\lambda_{n} controls the degree of the 0\ell_{0} regularization

β0:=(j,k)E𝟙(βjk),\|\beta\|_{0}:=\sum_{(j,k)\in\overrightarrow{E}}\mathbbm{1}(\beta_{jk}),

where 𝟙(βjk)\mathbbm{1}(\beta_{jk}) is an indicator function with value one if βjk0\beta_{jk}\neq 0, and 0 otherwise. The constraint (3b) stipulates that the resulting directed subgraph is a DAG induced from \overrightarrow{\mathcal{M}}. When \mathcal{M} corresponds to a complete graph, PNL\mathcal{M} reduces to the original PNL of van de Geer and Bühlmann [52].

The choice of 0\ell_{0} regularization in (3) is deliberate. Although 1\ell_{1} regularization has attractive computational and statistical properties in high-dimensional regression [8], many of these advantages disappear in the context of DAG structure learning [21, 2]. By considering 0\ell_{0} regularization, [52] establish the consistency of PNL under appropriate assumptions. More specifically, for a Gaussian SEM, they show that the estimated DAG has (asymptotically) the same number of edges as the DAG with minimal number of edges (minimal-edge I-MAP), and establish the consistency of PNL for learning sparse DAGs. These results are formally stated in Proposition 1 in the next section.

Remark 1.

A Tikhonov (2\ell_{2}) regularization term, μβ22\mu\|\beta\|_{2}^{2}, for a given μ>0\mu>0, can also be added to the objective (3a) to obtain more stable solutions [7].

In our earlier work [32], we observe that existing mathematical formulations are slow to converge to a provably optimal solution, β\beta^{\star}, of (3) using the state-of-the-art optimization solvers. Therefore, the solution process needs to be terminated early to yield a feasible solution, β^\hat{\beta} with a positive optimality gap, i.e., a positive difference between the upper bound on (β)\operatorname{\mathcal{L}}(\beta^{\star}) provided by (β^)\operatorname{\mathcal{L}}(\hat{\beta}) and a lower bound on (β)\operatorname{\mathcal{L}}(\beta^{\star}) provided by the best continuous relaxation obtained by the branch-and-bound algorithm upon termination. However, statistical properties of such a sub-optimal solution are not well-understood. Therefore, there exists a gap between theory and computation: while the optimal solution has nice statistical properties, the properties of the solutions obtained from approximate computational algorithms are not known. Moreover, due to the non-convex and complex nature of the problem, characterizing the properties of the solutions provided by heuristics is especially challenging. In the next section, we bridge this gap by developing a concrete early stopping criterion and establishing the consistency of the solution obtained using this criterion.

3 Early stopping criterion for DAG learning

In this section, we establish a sufficient condition for the approximate solution of PNL\mathcal{M}, β^\hat{\beta} to be consistent for the true coefficients, β0\beta^{0}; that is β0β^22=𝒪(s0log(m)/n)\|\beta^{0}-\hat{\beta}\|_{2}^{2}=\mathcal{O}\left(s^{0}\log(m)/n\right), where s0s^{0} is the number of arcs in the true DAG, and xyx\asymp y means that xx converges to yy asymptotically. This result is obtained by leveraging an important property of the branch-and-bound process for integer programming that provides both lower and upper bounds on the objective function (β)\operatorname{\mathcal{L}}(\beta^{\star}) upon early stopping, as well as the consistency results of the PNL estimate with 0\ell_{0} regularization. Using the insight from this new result, we then propose a concrete stopping criterion for terminating the branch-and-bound process that results in consistent parameter estimates.

Let LB\mathrm{LB} and UB\mathrm{UB}, respectively, denote the lower and upper bounds on the optimal objective function value (3a) obtained from solving (3) under an early stopping criterion (i.e., when the obtained solution is not necessarily optimal). We define the difference between the upper and lower bounds as the absolute optimality gap: GAP=UBLB\mathrm{GAP}=\mathrm{UB}-\mathrm{LB}. Let 𝒢^\hat{\mathcal{G}} and β^\hat{\beta} denote the structure of the DAG and coefficients of the arcs from optimization model (3) under the early stopping condition with sample size nn and regularization parameter λn\lambda_{n}. Let 𝒢{\mathcal{G}^{\star}} and β\beta^{\star} denote the DAG structure and coefficients of arcs obtained from the optimal solution of (3), and 𝒢0\mathcal{G}^{0} and β0\beta^{0} denote the true DAG structure and the coefficient of arcs, respectively. We denote the number of arcs in 𝒢^\hat{\mathcal{G}}, 𝒢0\mathcal{G}^{0}, and 𝒢{\mathcal{G}^{\star}} by s^\hat{s}, s0s^{0}, and ss^{\star}, respectively. The score value in (3a) of each solution is denoted by (ϕ)\operatorname{\mathcal{L}}(\phi) where ϕ{β,β^,β0}\phi\in\{\beta^{\star},\hat{\beta},\beta^{0}\}.

Next, we present our main result. Our result extends van de Geer and Bühlmann’s result on consistency of PNL\mathcal{M} for the optimal, but computationally unattainable, estimator, β\beta^{\star} to an approximate estimator, β^\hat{\beta}, obtained from early stopping. In the following (including the statement of our main result, Proposition 2), we assume that the super-structure \mathcal{M} is known a priori. The setting where \mathcal{M} is estimated from data is discussed at the end of the section. We begin by stating the key result from [52] and the required assumptions. Throughout, we consider a Gaussian linear SEM of the form (1). We denote the variance of error terms, ϵj\epsilon_{j}, by σjj2\sigma_{jj}^{2} and the true covariance matrix of the set of random variables, (X1,,Xm)(X_{1},\ldots,X_{m}) by the m×mm\times m matrix Σ\Sigma.

Assumption 1.

Suppose m<c0n/log(n)m<c_{0}n/\log(n) for some constant c0>0c_{0}>0 and for some constant σ02\sigma_{0}^{2}, it holds that maxj=1,,mσjj2σ02\max_{j=1,\ldots,m}\sigma_{jj}^{2}\leq\sigma_{0}^{2}. Moreover, the smallest and largest eigenvalues of Σ\Sigma, κmin(Σ)\kappa_{\min}(\Sigma) and κmax(Σ)\kappa_{\max}(\Sigma), satisfy

(c0log(n))1/2<κ¯κmin(Σ)<κmax(Σ)κ¯<\left(\frac{c_{0}}{\log(n)}\right)^{1/2}<\underline{\kappa}\leq\kappa_{\min}(\Sigma)<\kappa_{\max}(\Sigma)\leq\overline{\kappa}<\infty

for constants κ¯\underline{\kappa} and κ¯\overline{\kappa}.

Assumption 2.

Let, as in [52], Ω~(π)\widetilde{\Omega}(\pi) be the precision matrix of the vector of noise variables for an SEM given permutation π\pi of nodes. Denoting the diagonal entries of this matrix by ω~jj\tilde{\omega}_{jj}, there exists a constant ω0>0\omega_{0}>0 such that if Ω~(π)\widetilde{\Omega}(\pi) is not a multiple of the identity matrix, then

m1j=1m((ω~jj)21)2>1/ω0.m^{-1}\sum_{j=1}^{m}\left((\tilde{\omega}_{jj})^{2}-1\right)^{2}>1/\omega_{0}.
Proposition 1.

(Theorem 5.1 in [52]) Suppose Assumptions 1 and 2 hold and let α0:=min{4m,0.05}\alpha_{0}:=\min\{\frac{4}{m},0.05\}. Then for an 0\ell_{0} regularization parameter λlog(m)/n\lambda\asymp\log(m)/n, it holds with probability at least 1α01-\alpha_{0} that

ββ022+λs=𝒪(λs0).\|\beta^{\star}-\beta^{0}\|_{2}^{2}+\lambda s^{\star}=\mathcal{O}\left(\lambda s^{0}\right).

Here, λ=λn/n\lambda=\lambda_{n}/n, because the loss function (2) is that of [52] scaled by the sample size nn. The next result establishes the consistency of the approximate estimator, β^\hat{\beta}, obtained using our proposed early stopping strategy.

Proposition 2.

Suppose Assumptions 1 and 2 hold and let α0=min{4m,0.05}\alpha_{0}=\min\{\frac{4}{m},0.05\} and λlog(m)/n\lambda\asymp\log(m)/n. Then, the estimator β^\hat{\beta} obtained from early stopping of the branch-and-bound process such that GAP𝒪(nλs0)=𝒪(log(m)s0)\mathrm{GAP}\asymp\mathcal{O}\left(n\lambda s^{0}\right)=\mathcal{O}\left(\log(m)s^{0}\right) satisfies

β^β022𝒪(log(m)ns0)\left\|\hat{\beta}-\beta^{0}\right\|_{2}^{2}\asymp\mathcal{O}\left(\frac{\log(m)}{n}s^{0}\right)

with probability (1α0)(1-\alpha_{0}).

Proof.

First, by the triangle inequality and the fact that 2aba2+b2,a,b2ab\leq a^{2}+b^{2},\forall a,b\in\mathbb{R},

β^β022\displaystyle\|\hat{\beta}-\beta^{0}\|_{2}^{2} (β^β2+ββ02)2\displaystyle\leq\left(\|\hat{\beta}-\beta^{\star}\|_{2}+\|\beta^{\star}-\beta^{0}\|_{2}\right)^{2}
=β^β22+ββ022+2β^β2ββ02\displaystyle=\|\hat{\beta}-\beta^{\star}\|^{2}_{2}+\|\beta^{\star}-\beta^{0}\|^{2}_{2}+2\|\hat{\beta}-\beta^{\star}\|_{2}\|\beta^{\star}-\beta^{0}\|_{2}
2β^β22+2ββ022.\displaystyle\leq 2\|\hat{\beta}-\beta^{\star}\|_{2}^{2}+2\|\beta^{\star}-\beta^{0}\|_{2}^{2}. (4)

Recall that β\beta denotes the vectorized coefficient matrix BB. Then, in a slight abuse of notation, we denote by 𝒳\mathcal{X} both the vectorized and a block diagonal version of 𝒳\mathcal{X} and by \mathcal{E} a vectorized version of error \mathcal{E}. Then (β;𝒳)\ell(\beta;\mathcal{X}) can be written as (β;𝒳)=𝒳𝒳β22\ell(\beta;\mathcal{X})=\|\mathcal{X}-\mathcal{X}\beta\|_{2}^{2} (see Eq. 1). Then, we can write a Taylor series expansion of (β^;𝒳)\ell\left(\hat{\beta};\mathcal{X}\right) around (β;𝒳)\ell\left(\beta^{\star};\mathcal{X}\right) to get

𝒳(β^β)22=(β^;𝒳)(β;𝒳)2(β^β)𝒳𝒳(ββ0)+2(β^β)𝒳.\displaystyle\|\mathcal{X}(\hat{\beta}-\beta^{\star})\|_{2}^{2}=\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})-2(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{X}(\beta^{\star}-\beta^{0})+2(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{E}. (5)

But, Proposition 2.1 in [53] states that for every 0<ξ<10<\xi<1,

Σ^Σ2(mn)1/2,\left\|\widehat{\Sigma}-\Sigma\right\|_{2}\leq\left(\frac{m}{n}\right)^{1/2}, (6)

with probability 1ξ1-\xi. Thus, letting ξ=α0\xi=\alpha_{0}, (6) holds in our setting with probability 1α01-\alpha_{0}.

Since, by Assumption 1, κmin(Σ)κ¯>c0log(n)\kappa_{\min}(\Sigma)\geq\underline{\kappa}>\frac{c_{0}}{\log(n)}, by Weyl’s theorem we have

κmin(𝒳𝒳)=nκmin(Σ^)>nκ¯(mn)1/2>nκ¯n(c0log(n))1/2=n(κ¯(c0log(n))1/2),\kappa_{\min}(\mathcal{X}^{\top}\mathcal{X})=n\kappa_{\min}\left(\widehat{\Sigma}\right)>n\underline{\kappa}-(mn)^{1/2}>n\underline{\kappa}-n\left(\frac{c_{0}}{\log(n)}\right)^{1/2}=n\left(\underline{\kappa}-\left(\frac{c_{0}}{\log(n)}\right)^{1/2}\right),

which means that κmin(𝒳𝒳)>0\kappa_{\min}(\mathcal{X}^{\top}\mathcal{X})>0 with probability 1α01-\alpha_{0}.

Denoting cn(κ¯(c0log(n))1/2)1c^{\prime}_{n}\equiv\left(\underline{\kappa}-\left(\frac{c_{0}}{\log(n)}\right)^{1/2}\right)^{-1}, for large enough nn, we have that, with probability 1α01-\alpha_{0},

β^β22n1cn𝒳(β^β)22.\|\hat{\beta}-\beta^{\star}\|_{2}^{2}\leq n^{-1}c^{\prime}_{n}\|\mathcal{X}(\hat{\beta}-\beta^{\star})\|_{2}^{2}. (7)

Combining (5) and (7), and using triangle inequality again, we get

β^\displaystyle\|\hat{\beta}- β22n1cn𝒳(β^β)22\displaystyle\beta^{\star}\|_{2}^{2}\leq n^{-1}c^{\prime}_{n}\|\mathcal{X}(\hat{\beta}-\beta^{\star})\|_{2}^{2} (8)
=\displaystyle= n1cn((β^;𝒳)(β;𝒳)2(β^β)𝒳𝒳(ββ0)+2(β^β)𝒳)\displaystyle{n^{-1}c^{\prime}_{n}\left(\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})-2(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{X}(\beta^{\star}-\beta^{0})+2(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{E}\right)}
\displaystyle\leq n1cn|(β^;𝒳)(β;𝒳)2(β^β)𝒳𝒳(ββ0)+2(β^β)𝒳|\displaystyle{n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})-2(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{X}(\beta^{\star}-\beta^{0})+2(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{E}\right|}
\displaystyle\leq n1cn|(β^;𝒳)(β;𝒳)|+2n1cn(β^β)𝒳𝒳(ββ0)+2n1cn|(β^β)𝒳|\displaystyle n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})\right|+2n^{-1}c^{\prime}_{n}(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{X}(\beta^{\star}-\beta^{0})+2n^{-1}c^{\prime}_{n}\left|(\hat{\beta}-\beta^{\star})^{\top}\mathcal{X}^{\top}\mathcal{E}\right|
\displaystyle\leq n1cn|(β^;𝒳)(β;𝒳)|+2n1cnκmax(𝒳𝒳)β^β2ββ02+2n1cnβ^β2𝒳2,\displaystyle n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})\right|+2n^{-1}c^{\prime}_{n}\kappa_{\max}(\mathcal{X}^{\top}\mathcal{X})\|\hat{\beta}-\beta^{\star}\|_{2}\|\beta^{\star}-\beta^{0}\|_{2}+2n^{-1}c^{\prime}_{n}\|\hat{\beta}-\beta^{\star}\|_{2}\|\mathcal{X}^{\top}\mathcal{E}\|_{2},

where, as before, κmax\kappa_{\max} denotes the maximum eigenvalue of the matrix.

Using a similar argument as the one used above for the minimum eigenvalue of 𝒳𝒳\mathcal{X}^{\top}\mathcal{X}, by (6) we have that, with probability 1α01-\alpha_{0},

κmax(𝒳𝒳)=nκmax(Σ^)nκmax(Σ)+n(c0log(n))1/2n(κ¯+(c0log(n))1/2).\kappa_{\max}(\mathcal{X}^{\top}\mathcal{X})=n\kappa_{\max}\left(\widehat{\Sigma}\right)\leq n\kappa_{\max}(\Sigma)+n\left(\frac{c_{0}}{\log(n)}\right)^{1/2}\leq n\left(\overline{\kappa}+\left(\frac{c_{0}}{\log(n)}\right)^{1/2}\right).

Plugging the above bound into (8) we get

β^β22\displaystyle\|\hat{\beta}-\beta^{\star}\|_{2}^{2} n1cn|(β^;𝒳)(β;𝒳)|\displaystyle\leq n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})\right| (9)
+2cn(κ¯+(c0log(n))1/2)β^β2ββ02+2n1cnβ^β2𝒳2.\displaystyle+2c^{\prime}_{n}\left(\overline{\kappa}+\left(\frac{c_{0}}{\log(n)}\right)^{1/2}\right)\|\hat{\beta}-\beta^{\star}\|_{2}\|\beta^{\star}-\beta^{0}\|_{2}+2n^{-1}c^{\prime}_{n}\|\hat{\beta}-\beta^{\star}\|_{2}\|\mathcal{X}^{\top}\mathcal{E}\|_{2}.

Now, let Z=β^β2Z=\left\|\hat{\beta}-\beta^{\star}\right\|_{2}, Π=2cn[(κ¯+(c0log(n))1/2)ββ02+n1𝒳2],\Pi=2c^{\prime}_{n}\left[\left(\overline{\kappa}+\left(\frac{c_{0}}{\log(n)}\right)^{1/2}\right)\|\beta^{\star}-\beta^{0}\|_{2}+n^{-1}\|\mathcal{X}^{\top}\mathcal{E}\|_{2}\right], and Γ=n1cn|(β^;𝒳)(β;𝒳)|.\Gamma=n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})\right|. Then, the inequality in (9) can be written as Z2ΠZ+ΓZ^{2}\leq\Pi Z+\Gamma. Solving for ZZ and noting that ZZ, Γ\Gamma and Π\Pi are non-negative, in order to have Z2ΠZ+ΓZ^{2}\leq\Pi Z+\Gamma, we must have Z(Π+Π2+4Γ)/2Z\leq\left(\Pi+\sqrt{\Pi^{2}+4\Gamma}\,\right)/2.

Next, let 𝒯\mathcal{T} be the event under which Π=o(1)\Pi=o(1). Then, on this set, we have Z(o(1)+o(1)+4Γ)/2Z\leq\left(o(1)+\sqrt{o(1)+4\Gamma}\,\right)/2, or, Z2Γ+o(1)Z^{2}\leq\Gamma+o(1); that is

β^β22n1cn|(β^;𝒳)(β;𝒳)|+o(1).\left\|\hat{\beta}-\beta^{\star}\right\|_{2}^{2}\leq n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})\right|+o(1). (10)

Plugging (10) into (3), on the set 𝒯\mathcal{T} we have

β^β022\displaystyle\|\hat{\beta}-\beta^{0}\|_{2}^{2} 2n1cn|(β^;𝒳)(β;𝒳)|+2ββ022+o(1)\displaystyle\leq 2n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})\right|+2\|\beta^{\star}-\beta^{0}\|_{2}^{2}+o(1) (11)
=2n1cn|(β^;𝒳)(β;𝒳)+(λs^λs)(λs^λs)|+2ββ022+o(1)\displaystyle{=2n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})+(\lambda\hat{s}-\lambda s^{\star})-(\lambda\hat{s}-\lambda s^{\star})\right|+2\|\beta^{\star}-\beta^{0}\|_{2}^{2}+o(1)}
2n1cn|(β^;𝒳)(β;𝒳)+λs^λs|+2n1cn|λs^λs|+2ββ022+o(1).\displaystyle\leq 2n^{-1}c^{\prime}_{n}\left|\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})+\lambda\hat{s}-\lambda s^{\star}\right|+2n^{-1}c^{\prime}_{n}\left|\lambda\hat{s}-\lambda s^{\star}\right|+2\|\beta^{\star}-\beta^{0}\|_{2}^{2}+o(1).

Then, using the fact that (β^;𝒳)(β;𝒳)+λs^λs=(β^)(β)GAP\ell(\hat{\beta};\mathcal{X})-\ell(\beta^{\star};\mathcal{X})+\lambda\hat{s}-\lambda s^{\star}=\operatorname{\mathcal{L}}(\hat{\beta})-\operatorname{\mathcal{L}}(\beta^{\star})\leq\mathrm{GAP} we can write (11) as

β^β022\displaystyle\|\hat{\beta}-\beta^{0}\|_{2}^{2}\, 2n1cn|(β^;𝒳)(β;𝒳)|+2n1cnλs+2ββ022+o(1),\displaystyle{\leq 2n^{-1}c^{\prime}_{n}\left|\operatorname{\mathcal{L}}(\hat{\beta};\mathcal{X})-\operatorname{\mathcal{L}}(\beta^{\star};\mathcal{X})\right|+2n^{-1}c^{\prime}_{n}\lambda s^{\star}+2\|\beta^{\star}-\beta^{0}\|_{2}^{2}+o(1)},
2n1cnGAP+2ββ022+2n1cnλs+o(1),\displaystyle\leq 2n^{-1}c^{\prime}_{n}\mathrm{GAP}+2\|\beta^{\star}-\beta^{0}\|_{2}^{2}+2n^{-1}c^{\prime}_{n}\lambda s^{\star}+o(1), (12)

where, in the first inequality, we also use the fact that the penalized estimation procedure chooses at most nn nonzero entries. Hence, since m<nm<n by Assumption 1,

2n1cn|λs^λs|2n1cnλs+2cnλ=2n1cnλs+𝒪(log(m)n)=2n1cnλs+o(1).2n^{-1}c^{\prime}_{n}\left|\lambda\hat{s}-\lambda s^{\star}\right|\leq 2n^{-1}c^{\prime}_{n}\lambda s^{\star}+2c^{\prime}_{n}\lambda=2n^{-1}c^{\prime}_{n}\lambda s^{\star}+\mathcal{O}\left(\frac{\log(m)}{n}\right)=2n^{-1}c^{\prime}_{n}\lambda s^{\star}+o(1).

Now, by Proposition 1, we know that with probability at least 1α01-\alpha_{0}, ββ022=𝒪(s0log(m)/n)\|\beta^{\star}-\beta^{0}\|_{2}^{2}=\mathcal{O}\left(s^{0}\log(m)/n\right), and λs=𝒪(s0log(m)/n)\lambda s^{\star}=\mathcal{O}\left(s^{0}\log(m)/n\right). Moreover, using Gaussian concentration inequalities for n1𝒳2n^{-1}\|\mathcal{X}^{\top}\mathcal{E}\|_{2} [e.g., Lemma 6.2 in 8], the probability of the set 𝒯\mathcal{T} is lower bounded by the probability that ββ022=𝒪(s0log(m)/n)\|\beta^{\star}-\beta^{0}\|_{2}^{2}=\mathcal{O}\left(s^{0}\log(m)/n\right), which is 1α01-\alpha_{0}. Thus, if we stop the branch-and-bound algorithm when

GAP=𝒪(nλs0)=𝒪(log(m)s0),\mathrm{GAP}=\mathcal{O}\left(n\lambda s^{0}\right)=\mathcal{O}\left(\log(m)s^{0}\right),

then the first two terms in (3) would both be of order 𝒪(s0log(m)/n)\mathcal{O}\left(s^{0}\log(m)/n\right), while the third term, 2n1cnλs2n^{-1}c^{\prime}_{n}\lambda s^{\star} would be of a smaller order (by an n1n^{-1} factor). This guarantees that, with probability at least (1α0)(1-\alpha_{0}), β^β022=𝒪(s0log(m)/n)\|\hat{\beta}-\beta^{0}\|_{2}^{2}=\mathcal{O}\left(s^{0}\log(m)/n\right), as desired. ∎

Proposition 2 suggests that the branch-and-bound algorithm can be stopped by setting a threshold cnλs0c^{\star}n\lambda s^{0} on the value of GAP=|UBLB|\mathrm{GAP}=|\mathrm{UB}-\mathrm{LB}| for a constant c>0c^{\star}>0, say c=1c^{\star}=1. Such a solution will then achieve the same desirable statistical properties (in terms of parameter consistency) as the optimal solution β\beta^{\star}. While λ\lambda can be chosen data-adaptively (as discussed in Section 6), both of these choices depend on the value of s0s^{0}, which is not known in practice. However, one can find an upper bound for s0s^{0} based on the number of edges in the super-structure \mathcal{M}. In particular, if \mathcal{M} is the moral graph [40] with sms_{m} edges, then s0sms^{0}\leq s_{m}. However, while in many applications sms0s_{m}\asymp s^{0}, this is not always guaranteed. Thus, to ensure consistent estimation when replacing s0s^{0} with sms_{m} and setting c=1c^{\star}=1 in practice, we use the more conservative threshold of λs0s0log(m)/n\lambda s^{0}\asymp s^{0}\log(m)/n. With this choice the first and third terms in (3) would be of the same (vanishing) order, and the consistency rate would be driven by the convergence rate of ββ022\|\beta^{\star}-\beta^{0}\|_{2}^{2}. We investigate the performance of this choice in Section 6.4.

The above results, including the specific choice of early stopping criterion, are also valid if the super-structure \mathcal{M} corresponding to the moral graph is not known a priori. That is because the moral graph can be consistently estimated from data using recent developments in graphical modeling; see Drton and Maathuis [16] for a review of the literature. While some of the existing algorithms based on 1\ell_{1}-penalty require an additional irrepresentability condition [33, 44], this assumption can be relaxed by using instead an adaptive lasso penalty or by thresholding the initial lasso estimates [8].

In light of Proposition 2, it is of great interest to develop algorithms that converge to a solution with a small optimality gap expeditiously. To achieve this, one approach is to obtain better lower bounds using the branch-and-bound process from strong mathematical formulations for (3). To this end, we next review existing formulations of (3).

4 Existing Formulations of DAG Learning with Linear SEMs

In this section, we review known mathematical formulations for DAG learning with linear SEMs. We first outline the necessary notation below.

Index Sets
V={1,2,,m}V=\{1,2,\dots,m\}
: index set of random variables;
𝒟={1,2,,n}\mathcal{D}=\{1,2,\dots,n\}: index set of samples.
Input
=(V,E)\mathcal{M}=(V,E): an undirected super-structure graph (e.g., the moral graph);
=(V,E)\overrightarrow{\mathcal{M}}=(V,\overrightarrow{E}): the bi-directional graph corresponding to the undirected graph \mathcal{M};
𝒳=(𝒳1,,𝒳m)\mathcal{X}=(\mathcal{X}_{1},\dots,\mathcal{X}_{m}), where 𝒳v=(x1v,x2v,,xnv)\mathcal{X}_{v}=(x_{1v},x_{2v},\dots,x_{nv})^{\top} and xdvx_{dv} denotes ddth sample (d𝒟d\in\mathcal{D}) of random variable XvX_{v}; note 𝒳n×m\mathcal{X}\in\mathbb{R}^{n\times m};
λn:\lambda_{n}: tuning parameter (penalty coefficient for 0\ell_{0} regularization).

Continuous optimization variables
βjk\beta_{jk}: weight of arc (j,k)(j,k) representing the regression coefficients (j,k)E\forall(j,k)\in\overrightarrow{E}.

Binary optimization variables
zjk=1if arc(j,k)exists in a DAG;otherwise 0,(j,k)Ez_{jk}=1\,\,\text{if arc}\,\,(j,k)\,\text{exists in a DAG};\text{otherwise}\,0,\,\forall(j,k)\in\overrightarrow{E},
gjk=1ifβjk0;otherwise  0,(j,k)Eg_{jk}=1\,\,\text{if}\,\,\beta_{jk}\neq 0;\,\text{otherwise}\,\,0,\,\forall(j,k)\in\overrightarrow{E}.

Let F(β,g)=kVd𝒟(xdk(j,k)Eβjkxdj)2+λn(j,k)EgjkF(\beta,g)=\sum_{k\in V}\sum_{d\in\mathcal{D}}\Big{(}x_{dk}-\sum_{(j,k)\in\overrightarrow{E}}\beta_{jk}x_{dj}\Big{)}^{2}+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk}. The PNL\mathcal{M} problem can be cast as the following optimization model:

minBm×m,g{0,1}|E|\displaystyle\quad\quad\underset{{B\in{\mathbb{R}}^{m\times m},g\in\{0,1\}^{|\overrightarrow{E}|}}}{\min}\quad F(β,g),\displaystyle\,F(\beta,g), (13a)
𝒢(B)induces a DAG from,\displaystyle\mathcal{G}(B)\,\,\text{induces a DAG from}\,{\overrightarrow{\mathcal{M}}}, (13b)
βjk(1gjk)=0,\displaystyle\beta_{jk}(1-g_{jk})=0,\quad (j,k)E,\displaystyle\forall(j,k)\in\overrightarrow{E,} (13c)
gjk{0,1},\displaystyle g_{jk}\in\{0,1\},\quad (j,k)E.\displaystyle\forall(j,k)\in\overrightarrow{E}. (13d)

The objective function (13a) is an expanded version of (β)\mathcal{L}(\beta) in PNL\mathcal{M}, where we use the indicator variable gjkg_{jk} to encode the 0\ell_{0} regularization. The constraints in (13b) rule out cycles. The constraints in (13c) are non-linear and stipulate that βjk0\beta_{jk}\neq 0 only if gjk=1g_{jk}=1.

There are two sources of difficulty in solving (13a)-(13d): (i) the acyclic nature of DAG imposed by the combinatorial constraints in (13b); (ii) the set of nonlinear constraints in (13c), which stipulates that βjk0\beta_{jk}\neq 0 only if there exists an arc (j,k)(j,k) in 𝒢(B)\mathcal{G}(B). In Section 4.1, we discuss related studies to address the former, whereas in Section 4.2 we present relevant literature for the latter.

4.1 Linear encodings of the acyclicity constraints (13b)

There are several ways to ensure that the estimated graph does not contain any cycles. The first approach is to add a constraint for each cycle in the graph, so that at least one arc in this cycle must not exist in 𝒢(B)\mathcal{G}(B). A cutting plane (CP) method is used to solve such a formulation which may require generating an exponential number of constraints. Another way to rule out cycles is by imposing constraints such that the nodes follow a topological order [38]. A topological ordering is a unique ordering of the nodes of a graph from 1 to mm such that the graph contains an arc (j,k)(j,k) if node jj appears before node kk in the order. We refer to this formulation as topological ordering (TO). The TO formulation has 𝒪(m2)\mathcal{O}(m^{2}) variables and 𝒪(|E|)\mathcal{O}(|\overrightarrow{E}|) constraints. We give these formulations in the Appendix, for completeness.

The layered network (LN) formulation for learning from continuous data proposed by [32] is shown to perform better, empirically, than the TO formulation because it reduces the number of binary variables and is proven to obtain the same continuous relaxation bounds. Therefore, smaller quadratic programs are solved that provide the same relaxation bounds as larger quadratic programs. This formulation is closely related to the generation number approach proposed in [11] for discrete data. In this paper, we focus on the LN formulation and refer the reader to the Appendix and [32] for comparisons of these formulations and their sizes in detail. Next, we give the LN encoding of the acyclicity constraints. Define decision variables zjk{0,1}z_{jk}\in\{0,1\} for all (j,k)E(j,k)\in\overrightarrow{E}, where the variable zjkz_{jk} takes value 1 if there is an arc (j,k)(j,k) in the network

LN gjkzjk\displaystyle g_{jk}\leq z_{jk}\quad (j,k)E,\displaystyle\forall(j,k)\in\overrightarrow{E}, (14a)
zjk(m1)zkjψkψj\displaystyle z_{jk}-(m-1)z_{kj}\leq\psi_{k}-\psi_{j}\quad (j,k)E.\displaystyle\forall(j,k)\in\overrightarrow{E}. (14b)

Let ψk\psi_{k} be the layer value for node kk. The set of constraints in (14b) ensures that if the layer of node jj appears before that of node kk (i.e., there is a direct path from node jj to node kk), then ψkψj+1\psi_{k}\geq\psi_{j}+1. This rules out any cycles.

Constraint (14b) written for (k,i)E(k,i)\in\overrightarrow{E} imposes that if zij=1z_{ij}=1 and zjk=1z_{jk}=1, i.e., if ψk>ψi\psi_{k}>\psi_{i}, then zik=1z_{ik}=1, even if βik=gik=0\beta_{ik}=g_{ik}=0 in an optimal solution. Thus, additional binary vector zz along with the set of constraints in (14a) is needed to correctly encode the 0\ell_{0} regularization. The LN formulation has 𝒪(|E|)\mathcal{O}(|\overrightarrow{E}|) variables and constraints. Note that |E||\overrightarrow{E}| is much smaller than m2m^{2} for sparse skeleton/moral graphs.

4.2 Linear encodings of the non-convex constraints (13c)

The nonconvexity of the set of constraints in (13c) causes challenges in obtaining provably optimal solutions with existing optimization software. Therefore, we consider convex representations of this set of constraints. First, we present a linear encoding of the constraints in (13c). Although the existing compact (i.e., polynomial sized) TO and LN formulations discussed in Section 4.1 differ in their approach to ruling out cycles, one commonality among them is that they replace the non-linear constraint (13c) by so called big-MM constraints given by

MgjkβjkMgjk,(j,k)E,-Mg_{jk}\leq\beta_{jk}\leq Mg_{jk},\forall(j,k)\in\overrightarrow{E}, (15)

for a large enough MM. Unfortunately, these big-MM constraints (15) are poor approximations of (13c), especially in this problem, because no natural and tight value for MM exists. Although a few techniques have been proposed for obtaining the big-MM parameter for sparse regression problem [e.g., 7, 6, 25, 39], the resulting parameters are often too large in practice. Further, finding a tight big-MM parameter itself is a difficult problem to solve for DAG structure learning.

Consider (13a)-(13d) by replacing (13c) with the linear big-MM constraints (15) and writing the objective function in a matrix form. We denote the resulting formulation, which has a convex quadratic objective and linear constraints, by the following MIQP.

MIQPminBm×m,g{0,1}|E|\displaystyle\quad\textbf{MIQP}\quad\underset{{B\in{\mathbb{R}}^{m\times m},g\in\{0,1\}^{|\overrightarrow{E}|}}}{\min} tr[(IB)(IB)𝒳𝒳]+λn(j,k)Egjk\displaystyle\quad\text{tr}\left[(I-B)(I-B)^{\top}\mathcal{X}^{\top}\mathcal{X}\right]+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk} (16a)
(13b),(15)\displaystyle\eqref{CP-con2},\eqref{eq:bigM} (16b)
gjk{0,1}(j,k)E.\displaystyle g_{jk}\in\{0,1\}\quad\forall(j,k)\in\overrightarrow{E}. (16c)

Depending on which types of constraints are used in lieu of (13b), as explained in Section 4.1, MIQP (16) results in three different formulations: MIQP+CP, which uses (23), MIQP+TO, which uses (24), and MIQP+LN, which uses (14), respectively.

To discuss the challenges of the big-MM approach, we give a definition followed by two propositions.

Definition 1.

A formulation AA is said to be stronger than formulation BB if (A)(B)\mathcal{R}(A)\subset\mathcal{R}(B) where (A)\mathcal{R}(A) and (B)\mathcal{R}(B) correspond to the feasible regions of continuous relaxations of AA and BB, respectively.

Proposition 3.

(Proposition 3 in [32]) The MIQP+TO and MIQP+CP formulations are stronger than the MIQP+LN formulation.

As a consequence of Definition 1, the optimal objective function value of the continuous relaxation of the stronger formulation provides a lower bound on the true optimal objective function of the MIQP that is greater than or equal to the optimal objective function value of the continuous relaxation of the weaker formulation due to the smaller set of feasible solutions. However, next proposition shows that, perhaps surprisingly, the continuous relaxations of MIQP+TO and MIQP+CP formulations, while stronger according to Definition 1, give the same optimal objective function value (and the same lower bound on the true optimal objective).

Proposition 4.

(Proposition 5 in [32]) Let βjk\beta^{\star}_{jk} denote the optimal coefficient associated with an arc (j,k)E(j,k)\in\overrightarrow{E} from problem (3). For the same variable branching in the branch-and-bound process, the continuous relaxations of the MIQP+LN formulation for 0\ell_{0} regularization attain the same optimal objective function value as MIQP+TO and MIQP+CP, if M2max(j,k)E|βjk|M\geq 2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{\star}_{jk}|.

Proposition 3 implies that the MIQP+TO and MIQP+CP formulations are stronger than the MIQP+LN formulation. Nonetheless, Proposition 4 establishes that for sufficiently large values of MM, stronger formulations for the DAG learning problem attain the same continuous relaxation objective function value as the weaker formulation throughout the branch-and-bound tree. The optimal solution to the continuous relaxation of MIQP formulations of DAG structure learning may not be at an extreme point of the convex hull of feasible points. Hence, stronger formulations do not necessarily ensure better lower bounds for certain formulations of this problem involving the nonlinear objective. This is in contrast to a mixed-integer program (MIP) with linear objective, whose continuous relaxation is a linear program (LP). In that case, there exists an optimal solution that is an extreme point of the corresponding feasible set. As a result, a better lower bound can be obtained from a stronger formulation that better approximates the convex hull of the set of solutions to a mixed-integer linear program; this generally leads to faster convergence. A prime example is the traveling salesman problem (TSP), for which stronger formulations attain better computational performance [36]. In contrast, the numerical results by [32] empirically show that MIQP+LN has better computational performance because it is a compact formulation with the fewest constraints and the same continuous relaxation bounds.

Our next result, which is adapted from [15] to the DAG structure learning problem, shows that the continuous relaxation of MIQP is equivalent to the optimal solution to the problem where the 0\ell_{0}-regularization term is replaced with an 1\ell_{1}-regularization term (i.e., β1=(j,k)E|βjk|\|\beta\|_{1}=\sum_{(j,k)\in\overrightarrow{E}}|\beta_{jk}|) with a particular choice of the 1\ell_{1} penalty. This motivates us to consider tighter continuous relaxation for MIQP. Let (βR,gR)(\beta^{R},g^{R}) be an optimal solution to the continuous relaxation of MIQP.

Proposition 5.

For M2max(j,k)E|βjkR|M\geq 2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}|, a continuous relaxation of MIQP (16), where the binary variables are relaxed, is equivalent to the problem where the 0\ell_{0} regularization term is replaced with an 1\ell_{1}-regularization term with penalty parameter λ~=λnM\tilde{\lambda}=\frac{\lambda_{n}}{M}.

Proof.

For M2max(j,k)E|βjkR|M\geq 2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}|, the value gjkRg^{R}_{jk} is βjkRM\frac{\beta^{R}_{jk}}{M} in an optimal solution to the continuous relaxation of MIQP (16). Otherwise, we can reduce the value of the decision variable gRg^{R} without violating any constraints while reducing the objective function. Note that since M2max(j,k)E|βjkR|M\geq 2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}|, we have βjkRM1,(j,k)E\frac{\beta_{jk}^{R}}{M}\leq 1,\,\forall(j,k)\in\overrightarrow{E}. To show that the set of constraints in (13b) is satisfied, we consider the set of CP constraints. In this case, the set of constraints (13b) holds, i.e., (j,k)𝒞AβjkRM|𝒞A|1,𝒞A𝒞\sum_{(j,k)\in\,\mathcal{C}_{A}}\frac{\beta^{R}_{jk}}{M}\leq|\mathcal{C}_{A}|-1,\quad\forall\mathcal{C}_{A}\in\mathcal{C}, because M2max(j,k)E|βjkR|M\geq 2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}|. This implies that gjkR=βjkRMg_{jk}^{R}=\frac{\beta_{jk}^{R}}{M} is the optimal solution. Thus, the objective function reduces to 1\ell_{1} regularization with the coefficient λnM\frac{\lambda_{n}}{M}.

Finally, Proposition 4 establishes that for M2max(j,k)E|βjk|M\geq 2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{\star}_{jk}|, the objective function value of the continuous relaxations of MIQP+CP, MIQP+LN and MIQP+TO are equivalent. This implies that the continuous relaxations of all formulations are equivalent, which completes the proof. ∎

Despite the promising performance of MIQP+LN, its continuous relaxation objective function value provides a weak lower bound due to the big-MM constraints. To circumvent this issue, a natural strategy is to improve the big-MM value. Nonetheless, existing methods which ensure a valid big-MM value or heuristic techniques [38, 25] do not lead to tight big-MM values. For instance, the heuristic technique by [38] to obtain big-MM values always satisfies the condition in Proposition 5 and exact techniques are expected to produce even larger big-MM values. Therefore, we directly develop tighter approximations for (13c) next.

5 New Perspectives for Mathematical Formulations of DAG Learning

In this section, we discuss improved mathematical formulations for learning DAG structure of a BN based on convex (instead of linear) encodings of the constraints in (13c).

Problem (13) is an MIQP with non-convex complementarity constraints (13c), a class of problems which has received a fair amount of attention from the operations research community over the last decade [17, 18, 19, 20, 25, 30, 55, 56]. There has also been recent interest in leveraging these developments to solve sparse regression problems with 0\ell_{0} regularization [42, 15, 58, 3, 54].

Next, we review applications of MIQPs with complementarity constraints of the form (13c) for solving sparse regression with 0\ell_{0} regularization. [20] develop a so-called projected perspective relaxation method, to solve the perspective relaxation of mixed-integer nonlinear programming problems with a convex objective function and complementarity constraints. This reformulation requires that the corresponding binary variables are not involved in other constraints. Therefore, it is suitable for 0\ell_{0} sparse regression, but cannot be applied for DAG structure learning. [42] show how a broad class of 0\ell_{0}-regularized problems, including sparse regression as a special case, can be formulated exactly as optimization problems. The authors use the Tikhonov regularization term μβ22\mu\|\beta\|_{2}^{2} and convex analysis to construct an improved convex relaxation using the reverse Huber penalty. In a similar vein, [6] exploit the Tikhonov regularization and develop an efficient algorithm by reformulating the sparse regression mathematical formulation as a saddle-point optimization problem with an outer linear integer optimization problem and an inner dual quadratic optimization problem which is capable of solving high-dimensional sparse regressions. [58] apply the perspective formulation of sparse regression optimization problem with both 0\ell_{0} and the Tikhonov regularizations. The authors establish that the continuous relaxation of the perspective formulation is equivalent to the continuous relaxation of the formulation given by [6]. [15] propose perspective relaxation for 0\ell_{0} sparse regression optimization formulation and establish that the popular sparsity-inducing concave penalty function known as the minimax concave penalty [59] and the reverse Huber penalty [42] can be obtained as special cases of the perspective relaxation – thus the relaxations of formulations by [59, 42, 6, 58] are equivalent. The authors obtain an optimal perspective relaxation that is no weaker than any perspective relaxation. Among the related approaches, the optimal perspective relaxation by [15] is the only one that does not explicitly require the use of Tikhonov regularization.

The perspective formulation, which in essence is a fractional non-linear program, can be cast either as a mixed-integer second-order cone program (MISOCP) or a semi-infinite mixed-integer linear program (SIMILP). Both formulations can be solved directly by state-of-the-art optimization packages. [15] and [3] solve the continuous relaxations and then use a heuristic approach (e.g., rounding techniques) to obtain a feasible solution (an upper bound). In this paper, we directly solve the MISOCP and SIMILP formulations for learning sparse DAG structures.

Next, we present how perspective formulation can be suitably applied for DAG structure learning with 0\ell_{0} regularization. We further cast the problem as MISOCP and SIMILP. To this end, we express the objective function (16a) in the following way:

tr[(IB)(IB)𝒳𝒳]+λn(j,k)Egjk\displaystyle\text{tr}[(I-B)(I-B)^{\top}\mathcal{X}^{\top}\mathcal{X}]+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk}
=tr[(IBB)𝒳𝒳+2BB𝒳𝒳]+λn(j,k)Egjk.\displaystyle=\text{tr}[(I-B-B^{\top})\mathcal{X}^{\top}\mathcal{X}+2BB^{\top}\mathcal{X}^{\top}\mathcal{X}{]}+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk}. (17)

Let δ+m\delta\in\mathbb{R}_{+}^{m} be a vector such that 𝒳𝒳Dδ0\mathcal{X}^{\top}\mathcal{X}-{D}_{\delta}\succeq 0, where Dδ=diag(δ1,,δm)D_{\delta}=\text{diag}(\delta_{1},\dots,\delta_{m}) and A0A\succeq 0 means that matrix AA is positive semi-definite. By splitting the quadratic term 𝒳𝒳=(𝒳𝒳Dδ)+Dδ\mathcal{X}^{\top}\mathcal{X}=(\mathcal{X}^{\top}\mathcal{X}-D_{\delta})+D_{\delta} in (17), the objective function can be expressed as

tr[(IBB)𝒳𝒳+BB(𝒳𝒳Dδ)]+tr(BBDδ)+λn(j,k)Egjk.\text{tr}\left[(I-B-B^{\top})\mathcal{X}^{\top}\mathcal{X}+BB^{\top}(\mathcal{X}^{\top}\mathcal{X}-D_{\delta})\right]+\text{tr}\left(BB^{\top}D_{\delta}\right)+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk}. (18)

Let Q=𝒳𝒳DδQ=\mathcal{X}^{\top}\mathcal{X}-D_{\delta}. (In the presence of Tikhonov regularization with tuning parameter μ>0\mu>0, we let Q=𝒳𝒳+μIDδQ=\mathcal{X}^{\top}\mathcal{X}+\mu I-D_{\delta} as described in Remark 1.) Then, Cholesky decomposition can be applied to decompose QQ as qqq^{\top}q (note Q0Q\succeq 0). As a result, tr(BBQ)=tr(BBqq)=i=1mj=1m((,j)Eβjqi)2\text{tr}\left(BB^{\top}Q\right)=\text{tr}\left(BB^{\top}{q^{\top}}{q}\right)=\sum_{i=1}^{m}\sum_{j=1}^{m}\left(\sum_{(\ell,j)\in\overrightarrow{E}}\beta_{\ell j}q_{i\ell}\right)^{2}. The separable component can also be expressed as tr(BBDδ)=j=1m(j,k)Eδjβjk2\text{tr}\left(BB^{\top}D_{\delta}\right)=\sum_{j=1}^{m}\sum_{(j,k)\in\overrightarrow{E}}\delta_{j}\beta_{jk}^{2}. Using this notation, the objective (18) can be written as

tr[(IBB)𝒳𝒳+BBQ]+j=1m(j,k)Eδjβjk2+λn(j,k)Egjk.\text{tr}\left[(I-B-B^{\top})\mathcal{X}^{\top}\mathcal{X}+BB^{\top}Q\right]+\sum_{j=1}^{m}\sum_{(j,k)\in\overrightarrow{E}}\delta_{j}\beta_{jk}^{2}+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk}.

The Perspective Reformulation (PRef) of MIQP is then given by

PRefminBm×m,g{0,1}|E|\displaystyle\textbf{PRef}\quad\underset{{B\in{\mathbb{R}}^{m\times m},g\in\{0,1\}^{|\overrightarrow{E}|}}}{\min}\quad tr[(IBB)𝒳𝒳+BBQ]+\displaystyle\text{tr}\big{[}(I-B-B^{\top})\mathcal{X}^{\top}\mathcal{X}+BB^{\top}Q\big{]}+ (19a)
j=1m(j,k)Eδjβjk2gjk+λn(j,k)Egjk,\displaystyle\sum_{j=1}^{m}\sum_{(j,k)\in\overrightarrow{E}}\delta_{j}\frac{\beta_{jk}^{2}}{g_{jk}}+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk},
(16b)(16c).\displaystyle\eqref{LN-con1}-\eqref{LN-con5}. (19b)

The objective function (19a) is formally undefined when some gjkg_{jk} = 0. More precisely, we use the convention that βjk2gjk=0\frac{\beta^{2}_{jk}}{g_{jk}}=0 when βjk=gjk=0\beta_{jk}=g_{jk}=0 and βjk2gjk=+\frac{\beta^{2}_{jk}}{g_{jk}}=+\infty when βjk0\beta_{jk}\neq 0 and gjk=0g_{jk}=0 [19]. The continuous relaxation of PRef, referred to as the perspective relaxation, is much stronger than the continuous relaxation of MIQP under certain conditions discussed in detail in Section 5.3 [42]. However, an issue with PRef is that the objective function is nonlinear due to the fractional term. There are two ways to reformulate PRef. One as a mixed-integer second-order conic program (MISOCP) (see, Section 5.1) and the other as a semi-infinite mixed-integer linear program (SIMILP) (see, Section 5.2).

5.1 Mixed-integer second-order conic program

Let sjks_{jk} be additional variables representing βjk2\beta_{jk}^{2}. Then, the MISOCP formulation is given by

MISOCPminBm×m,s|E|,g{0,1}|E|\displaystyle\textbf{MISOCP}\quad\underset{{B\in{\mathbb{R}}^{m\times m},s\in{\mathbb{R}}^{|\overrightarrow{E}|},g\in\{0,1\}^{|\overrightarrow{E}|}}}{\min}\quad tr[(IBB)𝒳𝒳+BBQ]+\displaystyle\text{tr}\left[(I-B-B^{\top})\mathcal{X}^{\top}\mathcal{X}+BB^{\top}Q\right]+ (20a)
j=1m(j,k)Eδjsjk+λn(j,k)Egjk,\displaystyle\sum_{j=1}^{m}\sum_{(j,k)\in\overrightarrow{E}}\delta_{j}s_{jk}+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk},
sjkgjkβjk2(j,k)E,\displaystyle s_{jk}g_{jk}\geq\beta_{jk}^{2}\quad(j,k)\in\overrightarrow{E}, (20b)
0sjkM2gjk(j,k)E,\displaystyle 0\leq s_{jk}\leq M^{2}g_{jk}\quad(j,k)\in\overrightarrow{E}, (20c)
(16b)(16c).\displaystyle\eqref{LN-con1}-\eqref{LN-con5}. (20d)

Here, the constraints in (20b) imply that βjk0\beta_{jk}\neq 0 only when gjk=1g_{jk}=1. The constraints in (20b) are second-order conic representable because they can be written in the form of 4βjk2+(sjkgjk)2sjk+gjk\sqrt{4\beta_{jk}^{2}+(s_{jk}-g_{jk})^{2}}\leq s_{jk}+g_{jk}. The set of constraints in (20c) is valid since βjkMgjk\beta_{jk}\leq Mg_{jk} implies βjk2M2gjk2=M2gjk2\beta_{jk}^{2}\leq M^{2}g^{2}_{jk}=M^{2}g^{2}_{jk} and gjk2=gjkg_{jk}^{2}=g_{jk} for gjk{0,1}g_{jk}\in\{0,1\}. The set of constraints in (20c) is not required, yet they improve the computational efficiency especially when we restrict the big-MM value. [58] report similar behavior for sparse regression. When we relax gjk{0,1}g_{jk}\in\{0,1\} and let gjk[0,1]g_{jk}\in[0,1], we obtain the continuous relaxation of MISOCP (20). Let us denote the feasible region of continuous relaxation of MISOCP (20) and MIQP (16) by \mathcal{R}MISOCP and \mathcal{R}MIQP, and the objective function values by OFV(\mathcal{R}MISOCP) and OFV(\mathcal{R}MIQP), respectively. For a more general problem than ours, [10] give a detailed proof establishing that the feasible region of the former is contained in the feasible region of latter i.e., \mathcal{R}MISOCP MIQP\subset\mathcal{R}MIQP, and OFV(\mathcal{R}MISOCP) \geq OFV(\mathcal{R}MIQP). Therefore, we are able to obtain stronger lower bounds using MISOCP than MIQP under suitable choices for DδD_{\delta} as described in Section 5.3.

5.2 Mixed-integer semi-infinite integer linear program

An alternative approach to reformulate PRef is via perspective cuts developed by [17, 18]. To apply perspective cuts, we use the reformulation idea first proposed in [17] by introducing dummy decision matrix DD to distinguish the separable and non-separable part of the objective function; we also add the additional constraint d=βd=\beta where djkd_{jk} is (j,k)(j,k) element of matrix DD and β\beta is the decision variable in the optimization problem. Following this approach, MIQP can be reformulated as an SIMILP:

SIMILPminBm×m,v|E|,g{0,1}|E|\displaystyle\textbf{SIMILP}\quad\underset{{B\in{\mathbb{R}}^{m\times m},v\in{\mathbb{R}}^{|\overrightarrow{E}|},g\in\{0,1\}^{|\overrightarrow{E}|}}}{\min}\quad tr[(IBB)𝒳𝒳+DDQ]+\displaystyle\text{tr}\left[(I-B-B^{\top})\mathcal{X}^{\top}\mathcal{X}+DD^{\top}Q\right]+ (21a)
j=1m(j,k)Eδjvjk+λn(j,k)Egjk,\displaystyle\sum_{j=1}^{m}\sum_{(j,k)\in\overrightarrow{E}}\delta_{j}v_{jk}+\lambda_{n}\sum_{(j,k)\in\overrightarrow{E}}g_{jk},
djk=βjk(j,k)E,\displaystyle d_{jk}={\beta}_{jk}\quad(j,k)\in\overrightarrow{E}, (21b)
vjk2β¯jkβjkβ¯jk2gjkβ¯jk[M,M](j,k)E,\displaystyle v_{jk}\geq 2\bar{\beta}_{jk}\beta_{jk}-\bar{\beta}_{jk}^{2}g_{jk}\quad\forall\bar{\beta}_{jk}\in[-M,M]\quad\forall(j,k)\in\overrightarrow{E}, (21c)
(16b)(16c),\displaystyle\eqref{LN-con1}-\eqref{LN-con5}, (21d)
vjk0,(j,k)E.\displaystyle v_{jk}\geq 0,\quad(j,k)\in\overrightarrow{E}. (21e)

The set of constraints in (21c) is known as perspective cuts. Note that there are infinitely many such constraints. Although this problem cannot be solved directly, it lends itself to a delayed cut generation approach whereby a (small) finite subset of constraints in (21c) is kept, the current solution (β,g,v)(\beta^{\star},g^{\star},v^{\star}) of the relaxation is obtained, and all the violated inequalities for the relaxation solution are added for β¯jk=βjkgjk\bar{\beta}_{jk}=\frac{\beta^{\star}_{jk}}{g^{\star}_{jk}} (assuming 00=0\frac{0}{0}=0). This process is repeated until termination criteria are met. This procedure can be implemented using the cut callback function available by off-the-shelf solvers such as Gurobi or CPLEX.

5.3 Selecting δ\delta

In the MISOCP and SIMILP formulations, one important question is how to identify a valid δ\delta. A natural choice is diag(δ)=λmine(\delta)=\lambda_{\min}e, where λmin\lambda_{\min} is the minimum eigenvalue of 𝒳𝒳\mathcal{X}^{\top}\mathcal{X} and ee is a column vector of ones. The issue with this approach is that if λmin=0\lambda_{\min}=0, then diag(δ)\text{diag}({\delta}) becomes a trivial 0 matrix. If diag(δ)\text{diag}(\delta) turns out to be a zero matrix, then MISOCP formulation reduces to the big-MM formulation. [18] present an effective approach for obtaining a valid δ\delta by solving the following semidefinite program (SDP)

maxδ|V|{iVδi:𝒳𝒳diag(δ)0,δi0}.\displaystyle\underset{{\delta\in\mathbb{R}^{|V|}}}{\max}\left\{\sum_{i\in V}\delta_{i}\,:\,\mathcal{X}^{\top}\mathcal{X}-\operatorname{diag}(\delta)\succeq 0,\delta_{i}\geq 0\right\}. (22a)

This formulation can attain a non-zero DδD_{\delta} even if λmin=0\lambda_{\min}=0. Numerical results by [18] show that this method compares favorably with the minimum eigenvalue approach. [60] propose an SDP approach, which obtains DδD_{\delta} such that the continuous relaxation of MISOCP (20) is as tight as possible.

Similar to [15], our formulation does not require adding a Tikhonov regularization. In this case, PRef is effective when 𝒳𝒳\mathcal{X}^{\top}\mathcal{X} is sufficiently diagonally dominant. When nmn\geq m and each row of 𝒳\mathcal{X} is independent, then 𝒳𝒳\mathcal{X}^{\top}\mathcal{X} is guaranteed to be a positive semi-definite matrix [15]. On the other hand, when n<mn<m, 𝒳𝒳\mathcal{X}^{\top}\mathcal{X} is not full-rank. Therefore, a Tikhonov regularization term should be added with sufficiently large μ\mu to make 𝒳𝒳+μI0\mathcal{X}^{\top}\mathcal{X}+\mu I\succeq 0 [15] in order to benefit from the strengthening provided by PRef.

6 Experiments

In this section, we report the results of our numerical experiments that compare different formulations and evaluate the effect of different tuning parameters and estimation strategies. Our experiments are performed on a cluster operating on UNIX with Intel Xeon E5-2640v4 2.4GHz. All formulations are implemented in the Python programming language. Gurobi 8.1 is used as the solver. Unless otherwise stated, a time limit of 50m50m (in seconds), where mm denotes the number of nodes, and an MIQP relative optimality gap of 0.010.01 are imposed across all experiments after which runs are aborted. The relative optimality gap is calculated by RGAP:=UB(X)LB(X)UB(X):=\frac{UB(X)-LB(X)}{UB(X)} where UB(X) denotes the objective value associated with the best feasible integer solution (incumbent) and LB(X) represents the best obtained lower bound during the branch-and-bound process for the formulation X{MIQP,SIMILP,MISOCP}X\in\{\mathrm{MIQP},\mathrm{SIMILP},\mathrm{MISOCP}\}.

Unless otherwise stated, we assume λn=log(n)\lambda_{n}=\log(n) which corresponds to the Bayesian information criterion (BIC) score. To select the big-MM parameter, MM, in all formulations we use the proposal of Park and Klabjan [38]. Specifically, given λn\lambda_{n}, we solve each problem without cycle prevention constraints and obtain βR\beta^{R}. We then use the upper bound M=2max(j,k)E|βjkR|M=2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}|. Although this value does not guarantee an upper bound for MM, the results provided in [38] and [32] computationally confirm that this approach gives a large enough value of MM.

The goals of our computational study are twofold. First, we compare the various mathematical formulations to determine which gives us the best performance in Subsection 6.1, compare the sensitivity to the model parameters in Subsection 6.2, and the choice of the regularization term in Subsection 6.3. Second, in Subsection 6.4 we use the best-performing formulation to investigate the implications of the early stopping condition on the quality of the solution with respect to the true graph. To be able to perform such a study, we use synthetic data so that the true graph is available. In Subsection 6.5, we compare our algorithm against two state-of-the-art benchmark algorithms on publicly available datasets.

We use the package pcalg in R to generate random graphs. First, we create a DAG by randomDAG function and assign random arc weights (i.e., β\beta) from a uniform distribution, 𝒰[0.1,1]\mathcal{U}[0.1,1]. Next, the resulting DAG and random coefficients are fed into the rmvDAG function to generate multivariate data based on linear SEMs (columns of matrix 𝒳\mathcal{X}) with the standard normal error distribution. We consider m{10,20,30,40}m\in\{10,20,30,40\} nodes and n=100n=100 samples. The average outgoing degree of each node, denoted by dd, is set to two. We generate 10 random Erdős-Rényi graphs for each setting (mm, nn, dd). We observe that in our instances, the minimum eigenvalue of 𝒳𝒳\mathcal{X}^{\top}\mathcal{X} across all instances is 3.26 and the maximum eigenvalue is 14.21.

Two types of problem instances are considered: (i) a set of instances with known moral graph corresponding to the true DAG; (ii) a set of instances with a complete undirected graph, i.e., assuming no prior knowledge. We refer to the first class of instances as moral instances and to the second class as complete instances. The observational data, 𝒳\mathcal{X}, for both classes of instances are the same. The function moralize(graph) in the pcalg R-package is used to generated the moral graph from the true DAG. Although the moral graph can be consistently estimated from data using penalized estimation procedures with polynomial complexity [e.g., 31], the quality of moral graph affects all optimization models. Therefore, we use the true moral graph in our experiments, unless otherwise stated.

6.1 Comparison of Mathematical Formulations

We use the following MIQP-based metrics to measure the quality of a solution: relative optimality gap (RGAP), computation time in seconds (Time), Upper Bound (UB), Lower Bound (LB), objective function value (OFV) of the initial continuous relaxation, and the number of explored nodes in the branch-and-bound tree (#\# BB). An in-depth analysis comparing the existing mathematical formulations that rely on linear encodings of the constraints in (13c) for MIQP formulations is conducted by [32]. The authors conclude that MIQP+LN formulation outperforms the other MIQP formulations, and the promising performance of MIQP+LN can be attributed to its size: (1) MIQP+LN has fewer binary variables and constraints than MIQP+TO, (2) MIQP+LN is a compact (polynomial-sized) formulation in contrast to MIQP+CP which has an exponential number of constraints. Therefore, in this paper, we analyze the formulations based on the convex encodings of the constraints in (13c).

6.1.1 Comparison of MISOCP formulations

We next experiment with MISOCP formulations. For the set of constraints in (13b), we use LN, TO, and CP constraints discussed in Section 4.1 resulting in three formulations denoted as MISOCP+LN, MISOCP+TO, MISOCP+CP, respectively. The MISOCP+TO formulation fails to find a feasible solution for instances with 30 and 40 nodes, see Table 1. For moral instances, the optimality gaps for MISOCP+TO are 0.000 and 0.021 for instances with 10 and 20 nodes, respectively; for complete instances, the optimality gap for MISOCP+TO formulation are 0.009 and 0.272 for instances with 10 and 20 nodes, respectively. Moreover, Table 1 illustrates that MISOCP+LN performs better than MISOCP+TO for even small instances (i.e., 10 and 20 nodes).

Table 1: Optimality gaps for MISOCP+TO and MISOCP+LN formulations
Moral Complete
mm MISOCP+TO MISOCP+LN MISOCP+TO MISOCP+LN
10 0.000 0.000 0.009 0.008
20 0.021 0.006 0.272 0.195
30 - 0.010 - 0.195
40 - 0.042 - 0.436

“-” denotes that no feasible solution, i.e., UB, is obtained within the time limit, so optimality gap cannot be computed.

For MISOCP+CP, instead of incorporating all constraints given by (23), we begin with no constraint of type (23). Given an integer solution with cycles, we detect a cycle and impose a new cycle prevention constraint to remove the detected cycle. Depth First Search (DFS) can detect a cycle in a directed graph with complexity O(|V|+|E|)O(|V|+|E|). Gurobi LazyCallback function is used, which allows adding cycle prevention constraints in the branch-and-bound algorithm, whenever an integer solution with cycles is found. The same approach is used by [38] to solve the corresponding MIQP+CP. Note that Gurobi solver follows a branch-and-cut implementation and adds many general-purpose and special-purpose cutting planes.

Figures 1(a) and 1(b) show that MISOCP+LN outperforms MISOCP+CP in terms of relative optimality gap and computational time. In addition, MISOCP+LN attains better upper and lower bounds than MISOCP+CP (see, Figures 1(c) and 1(d)). MISOCP+CP requires the solution of a second-order cone program (SOCP) after each cut, which reduces its computational efficiency and results in higher optimality gaps than MISOCP+LN. MISOCP+TO requires many binary variables which makes the problem very inefficient when the network becomes denser and larger as shown in Table 1. Therefore, we do not illustrate the MISOCP+TO results in Figure 1.

Refer to caption
(a) RGAPs
Refer to caption
(b) Time (in seconds)
Refer to caption
(c) Best upper bounds
Refer to caption
(d) Best lower bounds
Figure 1: Optimization-based measures for MISOCP+LN (left bar) and MISOCP+CP (right bar) formulations for n=100n=100.
Refer to caption
(a) RGAPs
Refer to caption
(b) Time (in seconds)
Refer to caption
(c) Best upper bounds
Refer to caption
(d) Best lower bounds
Figure 2: Optimization-based measures for MISOCP+LN, MIQP+LN, and SIMILP+LN formulations for n=100n=100.
Refer to caption
(a) RGAPs
Refer to caption
(b) Time (in seconds)
Refer to caption
(c) Best upper bounds
Refer to caption
(d) Best lower bounds
Refer to caption
(e) Number of Branch and Bound nodes
Refer to caption
(f) Continuous relaxation objective function
Figure 3: Optimization-based measures for MISOCP+LN, MIQP+LN formulations for n=100n=100.

6.1.2 Comparison of MISOCP versus SIMILP

Our computational experiments show that the SIMILP formulation generally performs poorly when compared to MISOCP+LN and MIQP+LN in terms of optimality gap, upper bound, and computational time. We report the results for SIMILP+LN, MISOCP+LN, and MIQP+LN formulations in Figure 2. We only consider the LN formulation because that is the best performing model among the alternatives both for MISOCP and MIQP formulations.

Figures 2(a) and 2(b) show the relative optimality gaps and computational times for these three formulations. Figures 2(c) and 2(d) demonstrate that SIMILP+LN attains lower bounds that are comparable with other two formulations. In particular, for complete instances with large number of nodes, SIMILP+LN attains better lower bounds than MIQP+LN. Nonetheless, SIMILP+LN fails to obtain good upper bounds. Therefore, the relative optimality gap is considerably larger for SIMILP+LN.

The poor performance of SIMILP+LN might be because state-of-the-art optimization packages (e.g., Gurobi, CPLEX) use many heuristics to obtain a good feasible solution (i.e., upper bound) for a compact formulation. In contrast, SIMILP is not a compact formulation, and we build the SIMILP gradually by adding violated constraints iteratively. Therefore, a feasible solution to the original formulation is not available while solving the relaxations with a subset of the constraints (21c). Moreover, the optimization solvers capable of solving MISOCP formulations have witnessed noticeable improvement due to theoretical developments in this field. In particular, Gurobi reports 20% and 38% improvement in solution time for versions 8 and 8.1, respectively. In addition, Gurobi v8.1 reports over four times faster solution times than CPLEX for solving MISOCP on their benchmark instances.

6.1.3 Comparison of MISOCP versus MIQP formulations

In this section, we demonstrate the benefit of using the second-order conic formulation MISOCP+LN instead of the linear big-MM formulation MIQP+LN. As before, we only consider the LN formulation for this purpose. Figures 3(a) and 3(b) show that MISOCP+LN performs better than MIQP+LN in terms of the average relative optimality gap across all number of nodes m{10,20,30,40}m\in\{10,20,30,40\}. The only exception is m=40m=40 for moral instances, for which MIQP+LN performs better than MISOCP+LN. Nonetheless, we observe that MISOCP+LN clearly outperforms MIQP+LN for complete instances which are more difficult to solve.

Figures 3(c) and 3(d) show the performance of both formulations in terms of the resulting upper and lower bounds on the objective function. We observe that MISOCP+LN attains better lower bounds especially for complete instances. However, MISOCP+LN cannot always obtain a better upper bound. In other words, MISOCP+LN is more effective in improving the lower bound instead of the upper bound as expected.

Figures 3(e) and 3(f) show that MISOCP+LN uses fewer branch-and-bound nodes and achieves better continuous relaxation values than MIQP+LN.

6.2 Analyzing the Choices of λn\lambda_{n} and MM

We now experiment on different values for λn\lambda_{n} and MM to assess the effects of these parameters on the performance of MISOCP+LN and MIQP+LN. First, we consider multiple λ\lambda values, λn{log(n),2log(n),4log(n)}\lambda_{n}\in\{\log{(n)},2\log(n),4\log(n)\}, while keeping the value of MM the same (i.e., M=2max(j,k)E|βjk|M=2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{\star}_{jk}|). Table 2 shows that as λn\lambda_{n} increases, MISOCP+LN consistently performs better than MIQP+LN in terms of the relative optimality gap, computational time, the number of branch-and-bound nodes, and continuous relaxation objective function value. Indeed, the difference becomes even more pronounced for more difficult cases (i.e., complete instances). For instance, for λn=4log(n)=18.4\lambda_{n}=4\log(n)=18.4, the relative optimality gap reduces from 0.465 to 0.374, an over 24% improvement. In addition, MISOCP+LN allows more instances to be solved to optimality within the time limit. For example, for moral instances with m=40m=40, λn=18.4\lambda_{n}=18.4, eight out of ten instances are solved to optimality using MISOCP+LN while only two instances are solved to optimality by MIQP+LN.

Table 2: Computational results for different values of λn=tlog(n)\lambda_{n}=t\log(n) for t{1,2,4}t\in\{1,2,4\}, * indicates that the problem is solved to the optimality tolerance. Superscript i indicates that out of ten runs, ii instances finish before hitting the time limit. Time is averaged over instances that solve within the time limit, RGAP is averaged over instances that reach the time limit. Better RGAPs are in bold.

    Moral Complete     Instances RGAP Time #\# nodes Relaxation OFV RGAP Time #\# nodes Relaxation OFV mm λn\lambda_{n} MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP   10 4.6 * * 3 2 1306 3715 738.7 664.9 * * 65 74 38850 114433 724.4 629.3 10 9.2 * * 4 3 1116 2936 784.6 693.5 * * 31 39 15736 55543 772.5 662.2 10 18.4 * * 3 2 1269 2457 857.0 747.5 * * 26 29 18223 41197 844.5 720.2 20 4.6 * * 69 51 46513 76261 1474.2 1325.8 .195 .275 1000 1000 101509 238765 1404.9 1144.5 20 9.2 * * 26 27 10695 31458 1589.6 1406.8 .152 .250 1000 1000 152206 274514 1526.9 1238.6 20 18.4 * * 24 36 9574 33788 1763.7 1552.7 .1132 .208 944944 1000 159789 277687 1697.1 1395.0 30 4.6 .0108 0.0118 378378 527527 121358 514979 2230.1 2037.7 .298 .441 1500 1500 38474 64240 2024.0 1569.7 30 9.2 * * 104 291 33371 248190 2392.4 2168.5 .239 .395 1500 1500 59034 71475 2217.5 1741.5 30 18.4 * * 48 74 15649 57909 2608.3 2383.8 .215 .318 1500 1500 74952 96586 2449.2 2006.9 40 4.6 .0426 .0374 15511551 16151615 664496 2565247 2979.3 2748.6 .436 .545 2000 2000 23083 49050 2582.0 1946.3 40 9.2 .0248 .0364 11251125 13361336 353256 1347702 3200.7 2923.5 .397 .473 2000 2000 29279 73917 2869.9 2216.9 40 18.4 .0248 .0352 10991099 13751375 434648 1137666 3521.8 3225.4 .374 .465 2000 2000 31298 60697 3240.1 2633.1  

Table 3: Computational results for different values of γ\gamma, * indicates that the problem is solved to the optimality tolerance. Superscript i indicates that out of ten runs, ii instances finish before hitting the time limit. Time is averaged over instances that solve within the time limit, RGAP is averaged over instances that reach the time limit. Better RGAPs are in bold.

    Moral Complete     Instances RGAP Time #\# nodes Relaxation OFV RGAP Time #\# nodes Relaxation OFV mm γ\gamma MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP   10 2 * * 3 2 1306 3715 738.7 664.9 * * 65 74 38850 114433 724.4 629.3 10 5 * * 5 2 1433 3026 717.9 647.1 * * 81 82 42675 130112 705.1 607.8 10 10 * * 5 2 1523 2564 712.5 641.1 * * 74 100 35576 174085 699.8 600.3 20 2 * * 69 51 46513 76261 1474.2 1325.8 .195 .275 1000 1000 101509 238765 1404.9 1144.5 20 5 * * 103 156 65951 209595 1438.2 1274.2 .211 .308 1000 1000 97940 225050 1375.3 1080.9 20 10 * * 215 207 150250 349335 1427.7 1256.6 .230 .310 1000 1000 90864 257998 1366.3 1058.2 30 2 .0108 .0118 378378 527527 121358 514979 2230.1 2037.7 .298 .441 1500 1500 38474 64240 2024.0 1569.7 30 5 .0118 .0148 571571 620620 164852 527847 2173.9 1950.3 .336 .474 1501 1500 33120 64339 1969.4 1448.4 30 10 .0248 .0148 630630 638638 202635 585234 2156.5 1919.6 .349 .480 1500 1500 30579 77100 1951.2 1404.0 40 2 .0426 .0374 15511551 16151615 664496 2565247 2979.3 2748.6 .436 .545 2000 2000 23083 49050 2582.0 1946.3 40 5 .0456 .0472 16431643 16341634 638323 1347868 2895.6 2635.0 .579 .580 2000 2000 12076 30858 2488.0 1751.7 40 10 .0564 .0572 16391639 16321632 599281 1584187 2869.2 2595.6 .585 .594 2000 2000 11847 30222 2456.1 1679.6  

Finally, we study the influence of the big-MM parameter. Instead of a coefficient γ=2\gamma=2 in [38], we experiment with M=γmax(j,k)E|βjkR|M=\gamma\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}| for γ{2,5,10}\gamma\in\{2,5,10\} in Table 3, where |βjkR||\beta^{R}_{jk}| denotes the optimal solution of each optimization problem without the constraints to remove cycles. The larger the big-MM parameter, the worse the effectiveness of both models. However, comparing the continuous relaxation objective function values, we observe that MISOCP+LN tightens the formulation using the conic constraints whereas MIQP+LN does not have any means to tighten the formulation instead of big-MM constraints which have poor relaxation. In most cases, the MISOCP+LN formulation allows more instances to be solved to optimality than MIQP+LN. For larger mm, because Gurobi solves larger SOCP relaxations in each branch-and-bound node, the MISOCP+LN formulation explores much fewer branch-and-bound nodes and stops with a similar RGAP at termination. For M>2max(j,k)E|βjkR|M>2\underset{(j,k)\in\overrightarrow{E}}{\max}\,|\beta^{R}_{jk}|, MISOCP+LN outperforms MIQP+LN in all measures, in most cases.

6.3 The Effect of Tikhonov Regularization

In this subsection, we consider the effect of adding a Tikhonov regularization term to the objective (see Remark 1) by considering μ{0,log(n),2log(n)}\mu\in\{0,\log(n),2\log(n)\} while keeping the values of λn=log(n)\lambda_{n}=\log(n) and MM the same as before. Table 4 demonstrates that for all instances with μ>0\mu>0, MISOCP+LN outperforms MIQP+LN. For complete instances with m=40m=40 and μ=9.2\mu=9.2, MISOCP+LN improves the optimality gap from 0.445 to 0.367, an improvement over 21%. The reason for this improvement is that μ>0\mu>0 makes the matrix more diagonally dominant; therefore, it makes the conic constraints more effective in tightening the formulation and obtaining a better optimality gap. Also, MISOCP+LN allows more instances to be solved to optimality than MIQP+LN.

Table 4: Computational results for different values of μ\mu, * indicates that the problem is solved to the optimality tolerance. Superscript i indicates that out of ten runs, ii instances finish before hitting the time limit. Time is averaged over instances that solve within the time limit, RGAP is averaged over instances that reach the time limit. Better RGAPs are in bold.

    Moral Complete     Instances RGAP Time #\# nodes Relaxation OFV RGAP Time #\# nodes Relaxation OFV mm μ\mu MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP MISOCP MIQP   10 0 * * 3 2 1306 3715 738.7 664.9 * * 65 74 38850 114433 724.4 629.3 10 4.6 * * 4 2 1043 2758 802.0 708.5 * * 69 72 38778 119825 789.3 675.7 10 9.2 * * 4 2 1067 2231 858.0 748.1 * * 72 74 36326 114383 843.2 712.3 20 0 * * 69 51 46513 76261 1474.2 1325.8 .195 .275 1000 1000 101509 238765 1404.9 1144.5 20 4.6 * * 45 45 15111 55302 1604.1 1426.5 .167 .242 1000 1000 102467 249490 1551.7 1267.1 20 9.2 * * 43 55 15384 62297 1716.8 1515.7 .142 .223 1000 1000 94360 258194 1668.3 1355.1 30 0 .0108 .0118 378378 527527 121358 514979 2230.1 2037.7 .298 .441 1500 1500 38474 64240 2024.0 1569.7 30 4.6 .0089 .0118 310310 392392 76668 358544 2432.5 2187.7 .237 .387 1500 1500 45473 69258 2286.4 1788.5 30 9.2 .0099 .0108 6767 377377 12410 320632 2612.6 2311.4 .209 .367 1500 1500 41241 68661 2484.3 1915.7 40 0 .0426 .0374 15511551 16151615 664496 2565247 2979.3 2748.6 .436 .545 2000 2000 23083 49050 2582.0 1946.3 40 4.6 .0278 .0294 13311331 16201620 422654 1303301 3281.6 2972.8 .354 .471 2000 2000 13209 30995 2985.4 2261.3 40 9.2 .0208 .0286 870870 15071507 239214 1762210 3575.4 3165.3 .367 .445 2000 2000 13884 54638 3321.7 2468.7  

6.4 Practical Implications of Early Stopping

In this subsection, we evaluate the quality of the estimated DAGs obtained from MISOCP+LN by comparing them with the ground truth DAG. To this end, we use three measures: the average structural Hamming distance (SHD)(\mathrm{SHD}) which counts the number of arc differences (additions, deletions, or reversals) required to transform the estimated DAG to the true DAG, the average false positive rate (FPR) which is the proportion of edges appearing in the estimated DAG but not the true DAG and the average true positive rate (TPR) which is the proportion of edges appearing in both the true DAG and the estimated DAG. Since Gurobi sets a minimum relative gap RGAP=1e4=1e^{-4}, the solution obtained within this relative gap is considered optimal. Finally, because the convergence of the branch-and-bound process may be slow in some cases, we set a time limit of 100mm.

To test the quality of the solution obtained with an early stopping criterion, we set the absolute optimality gap parameter as GAP=log(m)nsmGAP=\frac{\log(m)}{n}s_{m} and the 0\ell_{0} regularization parameter as λn=logm\lambda_{n}=\log m as suggested by the discussion following Proposition 2 for achieving a consistent estimate. We compare the resulting suboptimal solution to the solution obtained by setting GAP=UBLB=0\mathrm{GAP}=\mathrm{UB}-\mathrm{LB}=0 to obtain the truly optimal solution.

Table 5 shows the numerical results for the average solution time (in seconds) for instances that are solved within the time limit, the number of instances that were not solved within the time limit, the actual absolute optimality gap at termination, the average FPR, the average TPR, the average SHD of the resulting DAGs, across 10 runs for moral instances. Table 5 indicates that the average SHD for GAP=log(m)nsmGAP=\frac{\log(m)}{n}s_{m} is close to that of the truly optimal solution, and the average (FPR) and (TPR) are the same between setting GAP=log(m)nsm\mathrm{GAP}=\frac{\log(m)}{n}s_{m} and GAP=0\mathrm{GAP}=0 except for m=10m=10. Note that a lower GAP generally leads to a better SHD score. From a computational standpoint, we observe that by using the early stopping criterion, we are able to obtain consistent solutions faster. In particular, the average solution time reduces by 25%25\% for m=30m=30. The number of instances which are solved before hitting the 100m100m time limit are the same for GAP=0GAP=0 and GAP=log(m)nsm\mathrm{GAP}=\frac{\log(m)}{n}s_{m}. Furthermore, stopping early does not sacrifice too much from the quality of the resulting DAG as can be seen from the SHD scores.

Table 5: Structural Hamming distances (SHD), False Positive Rate (FPR) and True Positive Rate (TPR) for early stopping with n=100,λn=log(m)n=100,\lambda_{n}=\log(m), GAP τ\leq\tau for moral instances. The superscripts i indicate that out of ten runs, ii instances finish before hitting the time limit. Time is averaged over instances that solve within the time limit, GAP and RGAP are averaged over instances that terminate early.
τ=0\tau=0 τ=log(m)nsm\tau=\frac{\log(m)}{n}s_{m}
mm sms_{m} Time GAP RGAP SHD FPR TPR Time GAP RGAP SHD FPR TPR
10 19 1.28101.28^{10} 0.000.00 0.000 0.75 0.040.04 1.001.00 1.28101.28^{10} 0.060.06 0.000 0.77 0.020.02 1.001.00
20 58 6.1596.15^{9} 0.710.71 0.000 1.50 0.010.01 1.001.00 6.0496.04^{9} 1.441.44 0.001 2.00 0.010.01 1.001.00
30 109 37.40737.40^{7} 13.9613.96 0.004 1.67 0.000.00 1.001.00 27.63727.63^{7} 15.8015.80 0.005 1.66 0.000.00 1.001.00
40 138 9352935^{2} 63.0663.06 0.131 5.00 0.010.01 1.001.00 640.152640.15^{2} 66.3466.34 0.132 5.00 0.010.01 1.001.00

6.5 Comparison to Other Benchmarks

In this section, we compare the performance of MISOCP against the state-of-the-art benchmarks. These experiments are executed on a laptop with a Windows 10 operating system, an Intel Core i7-8750H 2.2-GHz CPU, 8-GB DRAM using Python 3.8 with Gurobi 9.1.1 Optimizer.

The benchmarks considered in this section include the top-down approach (EqVarDAG-TD) and the high-dimensional top-down approach (EqVarDAG-HD-TD) of [9], as well as the high-dimensional bottom-up approach (EqVarDAG-HD-BU) of [24]. By taking advantage of the conditions for identifiability in linear SEM models, these benchmark procedures offer polynomial-time algorithms for learning DAGs by iteratively identifying a source (top-down) or sink (bottom-up) node based on solving a series of covariance selection problems.

We compare the performance of the methods on twelve publicly available networks from [32] and Bayesian Network Repository (bnlearn). The number of nodes in these networks ranges from m=6m=6 to m=70m=70. We generate data from both identifiable and non-identifiable error distributions. In the case of identifiable distributions (ID), we generate the data by using random arc weights β\beta from 𝒰[1,0.1]𝒰[0.1,1]\mathcal{U}[-1,-0.1]\cup\mathcal{U}[0.1,1] and n=500n=500 samples standard normal errors. The data for the non-identifiable (NID) error distributions was generated similarly, but from normal errors with non-equal error variances chosen randomly from {0.5,1,1.5}\{0.5,1,1.5\}.

As an input superstructure graph to MISOCP, other than the true moral graphs, we also consider a superstructure estimate based on the empirical correlation matrix (CorEst). This estimate—which is guaranteed to be a super set of the DAG skeleton under the faithfulness assumption—was obtained by testing whether each correlation coefficient is nonzero at 0.050.05 significance level; the p-values were obtained using the Fisher’s Z-transformation for correlation coefficients. The MISOCP with true and correlation matrix superstructures are denoted as MISOCP-True and MISOCP-CorEst, respectively, in Table 6. A time limit of 50m50m (seconds), λ=2log(n)\lambda=2\log(n) and the Gurobi RGAP of 0.01 are imposed across the experiments.

Measures of performance of the benchmark algorithms are summarized in columns EqVarDAG-TD, EqVarDAG-HD-TD, and EqVarDAG-HD-BU of Table 6. The column Time reports the solution time in seconds. For all datasets, the true networks can be used to evaluate the quality of the estimated networks. We report SHD, TPR, and FPR for all the estimated networks. Given that the true causal network cannot be recovered in the setting of non-identifiable data (NID), we also report the structural SHD between the undirected skeleton of the true DAG and the corresponding skeleton of estimated network; this is denoted as SHDs in Table 6.

We observe that most of the EqVarDAG methods solve the problem within a second. With respect to the quality of the estimation, EqVarDAG-TD provides better performance in SHD compared to EqVarDAG-HD-TD and EqVarDAG-HD-BU. The column RGAP reports the relative gap at early termination. The symbol (*) denotes that the problem is solved to the optimality tolerance. Compared with the benchmarks, MISOCP with a CorEst or true superstructure requires longer solution times; however, MISOCP consistently provides high SHD and SHDs scores in every network. Moreover, MISOCP is able to provide the best estimation among all methods in most of the networks.

Finally, we highlight that in the non-identifiable datasets (NID), MISOCP clearly outperforms the benchmarks. This is, of course, not surprising, as the benchmark algorithms rely on the identifiability assumption and are not guaranteed to work if this assumption is violated. In contrast, in this case, MISOCP is guaranteed to find a member of the Markov equivalence class.

Table 6: The comparison between MISOCP and the state-of-the-art EqVarDAG methods of [9] and [24].
EqVarDAG-HD-BU EqVarDAG-HD-TD EqVarDAG-TD MISOCP-CorEst MISOCP-True
Data Network(mm) Time SHD SHDs TPR FPR Time SHD SHDs TPR FPR Time SHD SHDs TPR FPR Time RGAP SHD SHDs TPR FPR Time RGAP SHD SHDs TPR FPR
ID Dsep(6) 1\leq 1 3 3 1 .001 1\leq 1 3 3 1 .001 1\leq 1 0 0 1 0 1\leq 1 * 0 0 1 0 1\leq 1 * 0 0 1 0
Asia(8) 1\leq 1 0 0 1 0 1\leq 1 0 0 1 0 1\leq 1 0 0 1 0 1\leq 1 * 0 0 1 0 1\leq 1 * 0 0 1 0
Bowling(9) 1\leq 1 5 4 1 .002 1\leq 1 1 1 1 .001 1\leq 1 0 0 1 0 1\leq 1 * 0 0 1 0 1\leq 1 * 0 0 1 0
InsSmall(15) 1\leq 1 11 8 1 .005 1\leq 1 2 2 1 .001 1\leq 1 2 2 1 .001 750\geq 750 .05 0 0 1 0 4 * 0 0 1 0
Rain(14) 1\leq 1 5 4 1 .002 1\leq 1 2 2 1 .001 1\leq 1 0 0 1 0 155 * 0 0 1 0 1 * 0 0 1 0
Cloud(16) 1\leq 1 15 12 1 .006 1\leq 1 6 6 1 .003 1\leq 1 0 0 1 0 800\geq 800 .101 0 0 1 0 1\leq 1 * 0 0 1 0
Funnel(18) 1\leq 1 10 9 1 .004 1\leq 1 4 4 1 .002 1\leq 1 0 0 1 0 5 * 0 0 1 0 1\leq 1 * 0 0 1 0
Galaxy(20) 1\leq 1 16 13 1 .007 1\leq 1 9 9 1 .004 1\leq 1 1 1 1 .001 1000\geq 1000 .048 0 0 1 0 2 * 0 0 1 0
Insurance(27) 2 27 23 .962 .011 2 11 11 1 .005 1\leq 1 0 0 1 0 1350\geq 1350 .216 0 0 1 0 176 * 0 0 1 0
Factors(27) 2 25 25 .977 .01 2 17 17 .985 .007 1\leq 1 3 3 .956 0 1350\geq 1350 .157 0 0 1 0 1350\geq 1350 .036 0 0 1 0
Hailfinder(56) 10 78 70 .985 .033 11 19 19 1 .008 1\leq 1 1 1 .985 0 2800\geq 2800 .21 13 12 .955 .004 2800\geq 2800 .052 0 0 1 0
Hepar2(70) 17 147 139 .959 .062 19 70 70 .968 .029 2 1 1 .992 0 3500\geq 3500 .371 134 122 .878 .052 3500\geq 3500 .096 0 0 1 0
NID Dsep(6) 1\leq 1 4 3 1 .002 1\leq 1 2 1 1 .001 1\leq 1 1 1 .833 0 1\leq 1 * 1 1 .833 0 1\leq 1 * 1 1 .833 0
Asia(8) 1\leq 1 16 13 .875 .006 1\leq 1 6 4 .875 .002 1\leq 1 6 4 .875 .002 1\leq 1 * 4 2 .875 .001 1\leq 1 * 6 3 .875 .002
Bowling(9) 1\leq 1 7 6 .909 .003 1\leq 1 7 5 .909 .003 1\leq 1 4 2 .909 .001 1\leq 1 * 2 1 .909 .001 1\leq 1 * 4 2 .909 .001
InsSmall(15) 1\leq 1 24 19 .96 .01 1\leq 1 8 7 .96 .003 1\leq 1 8 7 .96 .003 750\geq 750 .029 7 5 .88 .002 12 * 2 1 .96 .001
Rain(14) 1\leq 1 17 12 .944 .007 1\leq 1 10 7 .944 .004 1\leq 1 4 1 .944 .001 43 * 7 4 .833 .002 2 * 4 1 .944 .001
Cloud(16) 1\leq 1 19 14 .895 .007 1\leq 1 20 14 .895 .007 1\leq 1 12 6 .895 .004 800\geq 800 .062 8 5 .947 .003 1\leq 1 * 8 2 .947 .003
Funnel(18) 1\leq 1 12 11 .944 .005 1\leq 1 6 6 .944 .002 1\leq 1 1 1 .944 0 8 * 1 1 .944 0 1\leq 1 * 1 1 .944 0
Galaxy(20) 1\leq 1 36 28 .955 .015 1\leq 1 27 20 .955 .011 1\leq 1 17 10 .955 .007 1000\geq 1000 .02 8 5 .909 .003 2 * 6 3 .955 .002
Insurance(27) 2 40 32 1 .017 2 23 19 .981 .009 1\leq 1 13 10 .9615 .005 1350\geq 1350 .191 11 9 .923 .003 1350\geq 1350 .055 6 4 .962 .002
Factors(27) 2 12 11 .971 .004 2 32 24 .882 .010 1\leq 1 32 24 .765 .007 1350\geq 1350 .111 19 15 .853 .004 1350\geq 1350 .062 22 16 .868 .006
Hailfinder(56) 8 103 88 .97 .043 8 90 70 .97 .038 2 59 40 .894 .022 2800\geq 2800 .156 20 11 .985 .008 2800\geq 2800 .096 15 8 .985 .006
Hepar2(70) 14 110 95 .9919 .048 16 73 61 .992 .031 2 52 40 .862 .015 3500\geq 3500 .199 33 23 .935 .011 3500\geq 3500 .112 36 18 .96 .014

7 Conclusion

In this paper, we study the problem of learning an optimal directed acyclic graph (DAG) from continuous observational data, where the causal effect among the random variables is linear. The central problem is a quadratic optimization problem with regularization. We present a mixed-integer second order conic program (MISOCP) which entails a tighter relaxation than existing formulations with linear constraints. Our numerical results show that MISOCP can successfully improve the lower bound and results in better optimality gap when compared with other formulations based on big-MM constraints, especially for dense and large instances. Moreover, we establish an early stopping criterion under which we can terminate branch-and-bound and achieve a solution which is asymptotically optimal. In addition, we show that our method outperforms two state-of-the-art algorithms, especially on non-identifiable datasets.

Acknowledgments

We thank the AE and three anonymous reviewers for their detailed comments that improved the paper. Simge Küçükyavuz and Linchuan Wei were supported, in part, by ONR grant N00014-19-1-2321 and NSF grant CIF-2007814. Ali Shojaie was supported by NSF grant DMS-1561814 and NIH grant R01GM114029. Hao-Hsiang Wu is supported, in part, by MOST Taiwan grant 109-2222-E-009-005-MY2.

Appendix A Alternative linear encodings of constraints (13b)

There are several ways to ensure that the estimated graph does not contain any cycles. The first approach is to add a constraint for each cycle in the graph, so that at least one arc in this cycle must not exist in 𝒢(B)\mathcal{G}(B). A cutting plane (CP) method is used to solve such a formulation which may require generating an exponential number of constraints. In particular, let 𝒞\mathcal{C} be the set of all possible directed cycles and 𝒞A𝒞\mathcal{C}_{A}\in\mathcal{C} be the set of arcs defining a cycle. The CP formulation removes cycles by imposing the following constraints for (13b)

CP(j,k)𝒞Agjk|𝒞A|1,𝒞A𝒞.\textbf{CP}\quad\sum_{(j,k)\in\,\mathcal{C}_{A}}g_{jk}\leq|\mathcal{C}_{A}|-1,\quad\forall\mathcal{C}_{A}\in\mathcal{C}. (23)

This formulation has exponentially many constraints.

Another way to rule out cycles is by imposing constraints such that the nodes follow a topological order [38]. A topological ordering is a unique ordering of the nodes of a graph from 1 to mm such that the graph contains an arc (j,k)(j,k) if node jj appears before node kk in the order. We refer to this formulation as topological ordering (TO). Define decision variables zjk{0,1}z_{jk}\in\{0,1\} for all (j,k)E(j,k)\in\overrightarrow{E} and ors{0,1}o_{rs}\in\{0,1\} for all r,s{1,,m}r,s\in\{1,\dots,m\}. The variable zjkz_{jk} takes value 1 if there is an arc (j,k)(j,k) in the network, and orso_{rs} takes value 1 if the topological order of node rr equals ss. The TO formulation rules out cycles in the graph by the following constraints

TO gjkzjk,\displaystyle g_{jk}\leq z_{jk},\quad (j,k)E,\displaystyle\forall(j,k)\in\overrightarrow{E}, (24a)
zjkmzkjsVs(oksojs),\displaystyle z_{jk}-mz_{kj}\leq\sum_{s\in V}s\,(o_{ks}-o_{js}),\quad (j,k)E,\displaystyle\forall(j,k)\in\overrightarrow{E}, (24b)
sVors=1\displaystyle\sum_{s\in V}o_{rs}=1\quad rV,\displaystyle\forall r\in V, (24c)
rVors=1\displaystyle\sum_{r\in V}o_{rs}=1\quad sV.\displaystyle\forall s\in V. (24d)

This formulation has 𝒪(m2)\mathcal{O}(m^{2}) variables and 𝒪(|E|)\mathcal{O}(|\overrightarrow{E}|) constraints.

References

  • Aliferis et al. [2010] Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11(Jan):171–234, 2010.
  • Aragam and Zhou [2015] Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse Gaussian Bayesian networks. Journal of Machine Learning Research, 16:2273–2328, 2015.
  • Atamtürk and Gómez [2019] Alper Atamtürk and Andres Gómez. Rank-one convexification for sparse regression. arXiv preprint arXiv:1901.10334, 2019.
  • Barlett and Cussens [2013] Mark Barlett and James Cussens. Advances in bayesian network learning using integer programming. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, pages 182–191, Arlington, Virginia, USA, 2013. AUAI Press.
  • Bartlett and Cussens [2017] Mark Bartlett and James Cussens. Integer linear programming for the Bayesian network structure learning problem. Artificial Intelligence, 244:258–271, 2017.
  • Bertsimas and Van Parys [2020] Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. The Annals of Statistics, 48(1):300–323, 2020.
  • Bertsimas et al. [2016] Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
  • Bühlmann and van de Geer [2011] Peter Bühlmann and Sara A van de Geer. Statistics for high-dimensional data: methods, theory and applications. Springer, 2011.
  • Chen et al. [2019] Wenyu Chen, Mathias Drton, and Y Samuel Wang. On causal discovery with an equal-variance assumption. Biometrika, 106(4):973–980, 09 2019.
  • Cui et al. [2013] XT Cui, XJ Zheng, SS Zhu, and XL Sun. Convex relaxations and MIQCQP reformulations for a class of cardinality-constrained portfolio selection problems. Journal of Global Optimization, 56(4):1409–1423, 2013.
  • Cussens [2010] James Cussens. Maximum likelihood pedigree reconstruction using integer programming. In WCB@ ICLP, pages 8–19, 2010.
  • Cussens [2011] James Cussens. Bayesian network learning with cutting planes. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, pages 153–160, Arlington, Virginia, USA, 2011. AUAI Press.
  • Cussens et al. [2017a] James Cussens, David Haws, and Milan Studenỳ. Polyhedral aspects of score equivalence in Bayesian network structure learning. Mathematical Programming, 164(1-2):285–324, 2017a.
  • Cussens et al. [2017b] James Cussens, Matti Järvisalo, Janne H Korhonen, and Mark Bartlett. Bayesian network structure learning with integer programming: Polytopes, facets and complexity. J. Artif. Intell. Res.(JAIR), 58:185–229, 2017b.
  • Dong et al. [2015] Hongbo Dong, Kun Chen, and Jeff Linderoth. Regularization vs. relaxation: A conic optimization perspective of statistical variable selection. arXiv preprint arXiv:1510.06083, 2015.
  • Drton and Maathuis [2017] Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365–393, 2017.
  • Frangioni and Gentile [2006] Antonio Frangioni and Claudio Gentile. Perspective cuts for a class of convex 0–1 mixed integer programs. Mathematical Programming, 106(2):225–236, 2006.
  • Frangioni and Gentile [2007] Antonio Frangioni and Claudio Gentile. SDP diagonalizations and perspective cuts for a class of nonseparable MIQP. Operations Research Letters, 35(2):181–185, 2007.
  • Frangioni and Gentile [2009] Antonio Frangioni and Claudio Gentile. A computational comparison of reformulations of the perspective relaxation: SOCP vs. cutting planes. Operations Research Letters, 37(3):206–210, 2009.
  • Frangioni et al. [2011] Antonio Frangioni, Claudio Gentile, Enrico Grande, and Andrea Pacifici. Projected perspective reformulations with applications in design problems. Operations Research, 59(5):1225–1232, 2011.
  • Fu and Zhou [2013] Fei Fu and Qing Zhou. Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. Journal of the American Statistical Association, 108(501):288–300, 2013.
  • Gao et al. [2015] Tian Gao, Ziheng Wang, and Qiang Ji. Structured feature selection. In Proceedings of the IEEE International Conference on Computer Vision, pages 4256–4264, 2015.
  • Ghoshal and Honorio [2017] Asish Ghoshal and Jean Honorio. Information-theoretic limits of Bayesian network structure learning. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 767–775, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.
  • Ghoshal and Honorio [2018] Asish Ghoshal and Jean Honorio. Learning linear structural equation models in polynomial time and sample complexity. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1466–1475. PMLR, 09–11 Apr 2018.
  • Gómez and Prokopyev [2021] Andrés Gómez and O Prokopyev. A mixed-integer fractional optimization approach to best subset selection. INFORMS Journal on Computing, 33(2):551–565, 2021.
  • Han et al. [2016] Sung Won Han, Gong Chen, Myun-Seok Cheon, and Hua Zhong. Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. Journal of the American Statistical Association, 111(515):1004–1019, 2016.
  • Hemmecke et al. [2012] Raymond Hemmecke, Silvia Lindner, and Milan Studenỳ. Characteristic imsets for learning Bayesian network structure. International Journal of Approximate Reasoning, 53(9):1336–1349, 2012.
  • Koivisto [2006] Mikko Koivisto. Advances in exact bayesian structure discovery in bayesian networks. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, UAI’06, pages 241–248, Arlington, Virginia, USA, 2006. AUAI Press.
  • Lazic et al. [2013] Nevena Lazic, Christopher Bishop, and John Winn. Structural expectation propagation (SEP): Bayesian structure learning for networks with latent variables. In Artificial Intelligence and Statistics, pages 379–387, 2013.
  • Liu et al. [2021] Peijing Liu, Salar Fattahi, Andrés Gómez, and Simge Küçükyavuz. A graph-based decomposition method for convex quadratic optimization with indicators. arXiv preprint arXiv:2110.12547, 2021.
  • Loh and Bühlmann [2014] Po-Ling Loh and Peter Bühlmann. High-dimensional learning of linear causal networks via inverse covariance estimation. Journal of Machine Learning Research, 15(1):3065–3105, 2014.
  • Manzour et al. [2021] Hasan Manzour, Simge Küçükyavuz, Hao-Hsiang Wu, and Ali Shojaie. Integer programming for learning directed acyclic graphs from continuous data. INFORMS Journal on Optimization, 3(1):46–73, 2021.
  • Meinshausen and Bühlmann [2006] Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
  • Oates et al. [2016a] Chris. J. Oates, Jim Q. Smith, and Sach Mukherjee. Estimating causal structure using conditional DAG models. Journal of Machine Learning Research, 17(54):1–23, 2016a.
  • Oates et al. [2016b] Chris J Oates, Jim Q Smith, Sach Mukherjee, and James Cussens. Exact estimation of multiple directed acyclic graphs. Statistics and Computing, 26(4):797–811, 2016b.
  • Öncan et al. [2009] Temel Öncan, İ Kuban Altınel, and Gilbert Laporte. A comparative analysis of several asymmetric traveling salesman problem formulations. Computers & Operations Research, 36(3):637–654, 2009.
  • Ott et al. [2004] Sascha Ott, Seiya Imoto, and Satoru Miyano. Finding optimal models for small gene networks. In Pacific Symposium on Biocomputing, volume 9, pages 557–567. World Scientific, 2004.
  • Park and Klabjan [2017] Young Woong Park and Diego Klabjan. Bayesian network learning via topological order. Journal of Machine Learning Research, 18(99):1–32, 2017.
  • Park and Klabjan [2020] Young Woong Park and Diego Klabjan. Subset selection for multiple linear regression via optimization. Journal of Global Optimization, 77(3):543–574, Jul 2020.
  • Pearl [2009] Judea Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009.
  • Peters and Bühlmann [2013] Jonas Peters and Peter Bühlmann. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2013.
  • Pilanci et al. [2015] Mert Pilanci, Martin J Wainwright, and Laurent El Ghaoui. Sparse learning via Boolean relaxations. Mathematical Programming, 151(1):63–87, 2015.
  • Raskutti and Uhler [2018] Garvesh Raskutti and Caroline Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018.
  • Saegusa and Shojaie [2016] Takumi Saegusa and Ali Shojaie. Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics, 10(1):1341–1392, 2016.
  • Shojaie and Michailidis [2010] Ali Shojaie and George Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519–538, 2010.
  • Silander and Myllymäki [2006] Tomi Silander and Petri Myllymäki. A simple approach for finding the globally optimal bayesian network structure. In Conference on Uncertainty in Artificial Intelligence, pages 445–452, 2006.
  • [47] Liam Solus, Yuhao Wang, and Caroline Uhler. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795–814.
  • Sondhi and Shojaie [2019] Arjun Sondhi and Ali Shojaie. The reduced PC-algorithm: Improved causal structure learning in large random networks. Journal of Machine Learning Research, 20(164):1–31, 2019.
  • Spirtes et al. [2000] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
  • Studenỳ and Haws [2013] Milan Studenỳ and David C Haws. On polyhedral approximations of polytopes for learning Bayesian networks. Journal of Algebraic Statistics, 4(1), 2013.
  • Uhler et al. [2013] Caroline Uhler, Garvesh Raskutti, Peter Bühlmann, and Bin Yu. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, 41(2):436–463, 2013.
  • van de Geer and Bühlmann [2013] Sara van de Geer and Peter Bühlmann. 0\ell_{0}-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536–567, 2013.
  • Vershynin [2012] Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655–686, 2012.
  • Wei et al. [2020] Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. On the convexification of constrained quadratic optimization problems with indicator variables. In Daniel Bienstock and Giacomo Zambelli, editors, Integer Programming and Combinatorial Optimization, pages 433–447, Cham, 2020. Springer International Publishing.
  • Wei et al. [2021] Linchuan Wei, Alper Atamtürk, Andrés Gómez, and Simge Küçükyavuz. On the convex hull of convex quadratic optimization problems with indicators. arXiv preprint arXiv:2201.00387, 2021.
  • Wei et al. [2022] Linchuan Wei, Andrés Gómez, and Simge Küçükyavuz. Ideal formulations for constrained convex optimization problems with indicator variables. Mathematical Programming, 192(1):57–88, 2022.
  • Xiang and Kim [2013] Jing Xiang and Seyoung Kim. A* lasso for learning a sparse Bayesian network structure for continuous variables. In Advances in Neural Information Processing Systems, pages 2418–2426, 2013.
  • Xie and Deng [2020] Weijun Xie and Xinwei Deng. Scalable algorithms for the sparse ridge regression. SIAM Journal on Optimization, 30(4):3359–3386, 2020.
  • Zhang [2010] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010.
  • Zheng et al. [2014] Xiaojin Zheng, Xiaoling Sun, and Duan Li. Improving the performance of MIQP solvers for quadratic programs with cardinality and minimum threshold constraints: A semidefinite program approach. INFORMS Journal on Computing, 26(4):690–703, 2014.
  • Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P. Xing. DAGs with NO TEARS: continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 9492–9503, Red Hook, NY, USA, 2018. Curran Associates Inc.