This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Sparse Fixed-Structure Gaussian Bayesian Networks

Arnab Bhattcharyya
National University of Singapore
[email protected]
   Davin Choo
National University of Singapore
[email protected]
   Rishikesh Gajjala
Indian Institute of Science, Bangalore
[email protected]
   Sutanu Gayen
National University of Singapore
[email protected]
   Yuhao Wang
National University of Singapore
[email protected]
Abstract

Gaussian Bayesian networks (a.k.a. linear Gaussian structural equation models) are widely used to model causal interactions among continuous variables. In this work, we study the problem of learning a fixed-structure Gaussian Bayesian network up to a bounded error in total variation distance. We analyze the commonly used node-wise least squares regression LeastSquares and prove that it has the near-optimal sample complexity. We also study a couple of new algorithms for the problem:

  • BatchAvgLeastSquares takes the average of several batches of least squares solutions at each node, so that one can interpolate between the batch size and the number of batches. We show that BatchAvgLeastSquares also has near-optimal sample complexity.

  • CauchyEst takes the median of solutions to several batches of linear systems at each node. We show that the algorithm specialized to polytrees, CauchyEstTree, has near-optimal sample complexity.

Experimentally111Code is available at https://github.com/YohannaWANG/CauchyEst, we show that for uncontaminated, realizable data, the LeastSquares algorithm performs best, but in the presence of contamination or DAG misspecification, CauchyEst/CauchyEstTree and BatchAvgLeastSquares respectively perform better.

1 Introduction

Linear structural equation models (SEMs) are widely used to model interactions among multiple variables, and they have found applications in diverse areas, including economics, genetics, sociology, psychology and education; see [Mue99, Mul09] for pointers to the extensive literature in this area. Gaussian Bayesian networks are a popularly used form of SEMs where: (i) there are no hidden/unobserved variables, (ii) each variable is a linear combination of its parents plus a Gaussian noise term, and (iii) all noise terms are independent. The structure of a Bayesian network refers to the directed acyclic graph (DAG) that prescribes the parent nodes for each node in the graph.

This work studies the task of learning a Gaussian Bayesian network given its structure. The problem is obviously quite fundamental and has been subject to extensive prior work. The usual formulation of this problem is in terms of parameter estimation, where one wants a consistent estimator that exactly recovers the parameters of the Bayesian network in the limit, as the the number of samples approaches infinity. In contrast, we consider the problem from the viewpoint of distribution learning [KMR+94]. Analogous to Valiant’s famous PAC model for learning Boolean functions [Val84], the goal here is to learn, with high probability, a distribution 𝒫^\widehat{\mathcal{P}} that is close to the ground-truth distribution 𝒫\mathcal{P}, using an efficient algorithm. In this setting, pointwise convergence of the parameters is no longer a requirement; the aim is rather to approximately learn the induced distribution. Indeed, this relaxed objective may be achievable when the former may not be (e.g., for ill-conditioned systems) and can be the more relevant requirement for downstream inference tasks. Diakonikolas [Dia16] surveys the current state-of-the-art in distribution learning from an algorithmic perspective.

Consider a Gaussian Bayesian network 𝒫\mathcal{P} with nn variables with parameters in the form of coefficients222We write iji\leftarrow j to emphasize that XjX_{j} is the parent of XiX_{i} in the Bayesian network. aija_{i\leftarrow j}’s and variances σi\sigma_{i}’s. For each i[n]i\in[n], the linear structural equation for variable XiX_{i}, with parent indices πi[n]\pi_{i}\subseteq[n], is as follows:

Xi=ηi+jπiaijXj,ηiN(0,σi2)X_{i}=\eta_{i}+\sum_{j\in\pi_{i}}a_{i\leftarrow j}X_{j},\qquad\eta_{i}\sim N(0,\sigma_{i}^{2})

for some unknown parameters aija_{i\leftarrow j}’s and σi\sigma_{i}’s. If a variable XiX_{i} has no parents, then XiN(0,σi2)X_{i}\sim N(0,\sigma_{i}^{2}) is simply an independent Gaussian. Our goal is to efficiently learn a distribution 𝒬\mathcal{Q} such that

dTV(𝒫,𝒬)ε\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\varepsilon

while minimizing the number of samples we draw from 𝒫\mathcal{P} as a function of nn, ε\varepsilon and relevant structure parameters. Here, dTV\mathrm{d_{TV}} denotes the total variation or statistical distance between two distributions:

dTV(𝒫,𝒬)=12n|𝒫(x)𝒬(x)|𝑑x.\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})=\frac{1}{2}\int_{\mathbb{R}^{n}}\left\lvert\mathcal{P}(x)-\mathcal{Q}(x)\right\rvert dx.

One way to approach the problem is to simply view 𝒫\mathcal{P} as an nn-dimensional multivariate Gaussian. In this case, it is known that the Gaussian 𝒬N(0,Σ^)\mathcal{Q}\sim N(0,\widehat{\Sigma}) defined by the empirical covariance matrix Σ^\widehat{\Sigma}, computed with 𝒪(n2/ε2)\mathcal{O}(n^{2}/\varepsilon^{2}) samples from 𝒫\mathcal{P}, is ε\varepsilon-close in TV distance to 𝒫\mathcal{P} with constant probability [ABDH+20, Appendix C]. This sample complexity is also necessary for learning general nn-dimensional Gaussians and hence general Gaussian Bayesian networks on nn variables.

We focus therefore on the setting where the structure of the network is sparse.

Assumption 1.1.

Each variable in the Bayesian network has at most dd parents. i.e. |πi|d,i[n]\left\lvert\pi_{i}\right\rvert\leq d,\forall i\in[n].

Sparsity is a common and very useful assumption for statistical learning problems; see the book [HTW19] for an overview of the role of sparsity in regression. More specifically, in our context, the assumption of bounded in-degree is popular (e.g., see [Das97, BCD20]) and also very natural, as it reflects the belief that in the correct model, each linear structural equation involves a small number of variables333More generally, one would want to assume a bound on the “complexity” of each equation. In our context, as each equation is linear, the complexity is simply proportional to the in-degree..

1.1 Our contributions

  1. 1.

    Analysis of MLE LeastSquares and a distributed-friendly generalization BatchAvgLeastSquares

    The standard algorithm for parameter estimation of Gaussian Bayesian networks is to perform node-wise least squares regression. It is easy to see that LeastSquares is the maximum likelihood estimator. However, to the best of our knowledge, there did not exist an explicit sample complexity bound for this estimator. We show that the sample complexity for learning 𝒫\mathcal{P} to within TV distance ε\varepsilon using LeastSquares requires only 𝒪~(nd/ε2)\widetilde{\mathcal{O}}(nd/\varepsilon^{2}) samples, which is nearly optimal.

    We also give a generalization dubbed BatchAvgLeastSquares which solves multiple batches of least squares problems (on smaller systems of equations), and then returns their mean. As each batch is independent from the others, they can be solved separately before their solutions are combined. Our analysis will later show that we can essentially interpolate between “batch size” and “number of batches” while maintaining a similar total sample complexity – LeastSquares is the special case of a single batch. Notably, we do not require any additional assumptions about the coefficients or variance terms in the analyses of LeastSquares and BatchAvgLeastSquares.

  2. 2.

    New algorithms based on Cauchy random variables: CauchyEst and CauchyEstTree

    We develop a new algorithm CauchyEst. At each node, CauchyEst solves several small linear systems of equations and takes the component-wise median of the obtained solution to obtain an estimate of the coefficients for the corresponding structural equation. In the special case of bounded-degree polytree structures, where the underlying undirected Bayesian network is acyclic, we specialize the algorithm to CauchyEstTree for which we give theoretical guarantees. Polytrees are of particular interest because inference on polytree-structured Bayesian networks can be performed efficiently [KP83, Pea86]. On polytrees, we show that the sample complexity of CauchyEstTree is also 𝒪~(nd/ε2)\widetilde{\mathcal{O}}(nd/\varepsilon^{2}). Somewhat surprisingly, our analysis (as the name of the algorithm reflects) involves Cauchy random variables which usually do not arise in the context of regression.

  3. 3.

    Hardness results

    In Section 5, we show that our sample complexity upper-bound is nearly optimal in terms of the dependence on the parameters nn, dd, and ϵ\epsilon. In particular, we show that learning the coefficients and noises of a linear structural equation model (on nn variables, with in-degree at most dd) up to an ϵ\epsilon-error in dTV\mathrm{d_{TV}}-distance with probability 2/3 requires Ω(ndϵ2)\Omega(nd\epsilon^{-2}) samples in general. We use a packing argument based on Fano’s inequality to achieve this result.

  4. 4.

    Extensive experiments on synthetic and real-world networks

    We experimentally compare the algorithms studied here as well as other well-known methods for undirected Gaussian graphical model regression, and investigate how the distance between the true and learned distributions changes with the number of samples. We find that the LeastSquares estimator performs the best among all algorithms on uncontaminated datasets. However, CauchyEst, CauchyEstTree and BatchMedLeastSquares outperform LeastSquares and BatchAvgLeastSquares by a large margin when a fraction of the samples are contaminated. In non-realizable/agnostic learning case, BatchAvgLeastSquares, CauchyEst, and CauchyEstTree performs better than the other algorithms.

1.2 Outline of paper

In Section 2, we relate KL divergence with TV distance and explain how to decompose the KL divergence into nn terms so that it suffices for us to estimate the parameters for each variable independently. We also give an overview of our two-phased recovery approach and explain how to use recovered coefficients to estimate the variances via VarianceRecovery. For estimating coefficients, Section 3 covers the estimators based on linear least squares (LeastSquares and BatchAvgLeastSquares) while Section 4 presents our new Cauchy-based algorithms (CauchyEst and CauchyEstTree). To complement our algorithmic results, we provide hardness results in Section 5 and experimental evaluation in Section 6.

For a cleaner exposition, we defer some formal proofs to Appendix B.

1.3 Further related work

Bayesian networks, both in discrete and continuous settings, were formally introduced by Pearl [Pea88] in 1988 to model uncertainty in AI systems. For the continuous case, Pearl considered the case when each node is a linear function of its parents added with an independent Gaussian noise [Pea88, Chapter 7]. The parameter learning problem – recovering the distribution of nodes conditioned on its parents from data – is well-studied in practice, and maximum likelihood estimators are known for various simple settings such as when the conditional distribution is Gaussian or the variables are discrete-valued. See for example the implementation of fit in the R package bnlearn [Scu20].

The focus of our paper is to give formal guarantees for the parameter learning in the PAC framework introduced by Valiant [Val84] in 1984. Subsequently, Haussler [Hau18] generalized this framework for studying parameter and density estimation problems of continuous distributions. Dasgupta [Das97] first looked at the problem of parameter learning for fixed structure Bayesian networks in the discrete and continuous settings and gave finite sample complexity bounds for these problems based on the VC-dimensions of the hypothesis classes. In particular, he gave an algorithm for learning the parameters of a Bayes net on nn binary variables of bounded in-degree in dKL\mathrm{d_{KL}} distance using a quadratic in nn samples. Subsequently, tight (linear) sample complexity upper and lower bounds were shown for this problem [BGMV20, BGPV20, CDKS20]. To the best of our knowledge, a finite PAC-style bound for fixed-structure Gaussian Bayesian networks was not known previously.

The question of structure learning for Gaussian Bayesian networks has been extensively studied. A number of works [PB14, GH17, CDW19, PK20, Par20, GDA20] have proposed increasingly general conditions for ensuring identifiability of the network structure from observations. Structure learning algorithms that work for high-dimensional Gaussian Bayesian networks have also been proposed by others (e.g., see [AZ15, AGZ19, GZ20]).

2 Preliminaries

In this section, we discuss why we upper bound the total variational distance using KL divergence and give a decomposition of the KL divergence into nn terms, one associated with each variable in the Bayesian network. This decomposition motivates why our algorithms and analysis focus on recovering parameters for a single variable. We also present our general two-phased recovery approach and explain how to estimate variances using recovered coefficients in Section 2.6.

2.1 Notation

A Bayesian network (Bayes net in short) 𝒫\mathcal{P} is a joint distribution 𝒫\mathcal{P} over nn variables X1,,XnX_{1},\ldots,X_{n} defined by the underlying directed acyclic graph (DAG) GG. The DAG G=(V,E)G=(V,E) encodes the dependence between the variables where V={X1,,Xn}V=\left\{X_{1},\ldots,X_{n}\right\} and (Xj,Xi)E(X_{j},X_{i})\in E if and only if XjX_{j} is a parent of XiX_{i}. For any variable XiX_{i}, we use the set πi[n]\pi_{i}\subseteq[n] to represent the indices of XiX_{i}’s parents. Under this notation, each variable XiX_{i} of 𝒫\mathcal{P} is independent of XiX_{i}’s non-descendants conditioned on πi\pi_{i}. Therefore, using Bayes rule in the topological order of GG, we have

𝒫(X1,,Xn)=i=1nPr𝒫(Xiπi)\mathcal{P}(X_{1},\ldots,X_{n})=\prod_{i=1}^{n}\Pr_{\mathcal{P}}(X_{i}\mid\pi_{i})

Without loss of generality, by renaming variables, we may assume that each variable XiX_{i} only has ancestors with indices smaller than ii. We also define pi=|πi|p_{i}=\left\lvert\pi_{i}\right\rvert as the number of parents of XiX_{i} and davg=1ni=1npid_{avg}=\frac{1}{n}\sum_{i=1}^{n}p_{i} to be average in-degree. Furthermore, a DAG GG is a polytree if the undirected version of GG is a acyclic.

We study the realizable setting where our unknown probability distribution 𝒫\mathcal{P} is Markov with respect to the given Bayesian network. We denote the true (hidden) parameters associated with 𝒫\mathcal{P} by α=(α1,,αn)\alpha^{*}=(\alpha^{*}_{1},\ldots,\alpha^{*}_{n}). Our algorithms recover parameter estimates α^=(α^1,,α^n)\widehat{\alpha}=(\widehat{\alpha}_{1},\ldots,\widehat{\alpha}_{n}) such that the induced probability distribution 𝒬\mathcal{Q} is close in total variational distance to 𝒫\mathcal{P}. For each i[n]i\in[n], αi=(Ai,σi)\alpha_{i}^{*}=(A_{i},\sigma_{i}) is the set of ground truth parameters associated with variable XiX_{i}, AiA_{i} is the coefficients associated to πi\pi_{i}, σi2\sigma_{i}^{2} is the variance of ηi\eta_{i}, α^i=(A^i,σ^i)\widehat{\alpha}_{i}^{*}=(\widehat{A}_{i},\widehat{\sigma}_{i}) is our estimate of αi\alpha_{i}^{*}.

In the course of the paper, we will often focus on the perspective a single variable of interest. This allows us to drop a subscript for a cleaner discussion. Let us denote such a variable of interest by YVY\in V and use the index y[n]y\in[n]. Without loss of generality, by renaming variables, we may further assume that the parents of YY are X1,,XpX_{1},\ldots,X_{p}. By 1.1, we know that pdp\leq d. We can write Y=ηy+i=1payiXiY=\eta_{y}+\sum_{i=1}^{p}a_{y\leftarrow i}X_{i} where ηyN(0,σy2)\eta_{y}\sim N(0,\sigma_{y}^{2}). We use matrix Mp×pM\in\mathbb{R}^{p\times p} to denote the covariance matrix defined by the parents of YY, where Mi,j=𝔼[XiXj]M_{i,j}=\mathbb{E}\left[X_{i}X_{j}\right] and M=LLM=LL^{\top} is the Cholesky decomposition of MM. Under this notation, we see the vector (X1,,Xp)N(0,M)(X_{1},\ldots,X_{p})\sim N(0,M) is distributed as a multivariate Gaussian. Our goal is then to produce estimates a^yi\widehat{a}_{y\leftarrow i} for each ayia_{y\leftarrow i}. For notational convenience, we can group the coefficients into A=[ay1,,ayp]A=\left[a_{y\leftarrow 1},\ldots,a_{y\leftarrow p}\right]^{\top} and A^=[a^y1,,a^yp]\widehat{A}=\left[\widehat{a}_{y\leftarrow 1},\ldots,\widehat{a}_{y\leftarrow p}\right]^{\top}. The vector Δ=(A^A)\Delta=(\widehat{A}-A)^{\top} captures the entry-wise gap between our estimates and the ground truth.

We write [n][n] to mean {1,2,,n}\{1,2,\ldots,n\} and |S|\left\lvert S\right\rvert to mean the size of a set SS. For a matrix MM, Mi,jM_{i,j} denotes its (i,j)(i,j)-th entry. We use \left\lVert\cdot\right\rVert to both mean the operator/spectral norm for matrices and L2L_{2}-norm for vectors, which should be clear from context. We hide absolute constant multiplicative factors and multiplicative factors poly-logarithmic in the argument using standard notations: 𝒪()\mathcal{O}(\cdot), Ω()\Omega(\cdot), and 𝒪~()\widetilde{\mathcal{O}}(\cdot).

2.2 Basic facts and results

We begin by stating some standard facts and results.

Fact 2.1.

Suppose XN(0,σ2)X\sim N(0,\sigma^{2}). Then, for any t>0t>0, Pr(X>t)exp(t22σ2)\Pr\left(X>t\right)\leq\exp\left(-\frac{t^{2}}{2\sigma^{2}}\right).

Fact 2.2.

Consider any matrix Bn×mB\in\mathbb{R}^{n\times m} with rows B1,,BnmB_{1},\ldots,B_{n}\in\mathbb{R}^{m}. For any i[m]i\in[m] and any vector vmv\in\mathbb{R}^{m} with i.i.d. N(0,σ2)N(0,\sigma^{2}) entries, we have that (Bv)i=BivN(0,σ2Bi2)(Bv)_{i}=B_{i}v\sim N(0,\sigma^{2}\cdot\left\lVert B_{i}\right\rVert^{2}).

Fact 2.3 (Theorem 2.2 in [Gut09])).

Let X1,,XpN(0,LL)X_{1},\ldots,X_{p}\sim N(0,LL^{\top}) be pp i.i.d. nn-dimensional multivariate Gaussians with covariance LLn×nLL^{\top}\in\mathbb{R}^{n\times n} (i.e. Ln×pL\in\mathbb{R}^{n\times p}). If Xp×nX\in\mathbb{R}^{p\times n} is the matrix formed by stacking X1,,XpX_{1},\ldots,X_{p} as rows of XX, then X=GLX=GL^{\top} where Gp×pG\in\mathbb{R}^{p\times p} is a random matrix with i.i.d. N(0,1)N(0,1) entries.444The transformation stated in [Gut09, Theorem 2.2, page 120] is for a single multivariate Gaussian vector, thus we need to take the transpose when we stack them in rows. Note that GG and GG^{\top} are identically distributed.

Lemma 2.4 (Equation 2.19 in [Wai19]).

Let Y=k=1nZk2Y=\sum_{k=1}^{n}Z_{k}^{2}, where each ZkN(0,1)Z_{k}\sim N(0,1). Then, Yχn2Y\sim\chi^{2}_{n} and for any 0<t<10<t<1, we have Pr(|Y/n1|t)2exp(nt2/8)\Pr\left(\left\lvert Y/n-1\right\rvert\geq t\right)\leq 2\exp\left(-nt^{2}/8\right).

Lemma 2.5 (Consequence of Corollary 3.11 in [BVH+16]).

Let Gn×mG\in\mathbb{R}^{n\times m} be a matrix with i.i.d. N(0,1)N(0,1) entries where nmn\leq m. Then, for some universal constant CC, Pr(G2(n+m))nexp(Cm)\Pr\left(\left\lVert G\right\rVert\geq 2(\sqrt{n}+\sqrt{m})\right)\leq\sqrt{n}\cdot\exp\left(-C\cdot m\right).

Lemma 2.6.

Let Gk×dG\in\mathbb{R}^{k\times d} be a matrix with i.i.d. N(0,1)N(0,1) entries. Then, for any constant 0<c1<1/20<c_{1}<1/2 and kd/c12k\geq d/c_{1}^{2},

Pr((GG)1op1(12c1)2k)1exp(kc122)\Pr\left(\left\lVert(G^{\top}G)^{-1}\right\rVert_{op}\leq\frac{1}{\left(1-2c_{1}\right)^{2}k}\right)\geq 1-\exp\left(-\frac{kc_{1}^{2}}{2}\right)
Proof.

See Appendix B. ∎

Lemma 2.7.

Let Gk×pG\in\mathbb{R}^{k\times p} be a matrix with i.i.d. N(0,1)N(0,1) entries and ηk\eta\in\mathbb{R}^{k} be a vector with i.i.d. N(0,σ2)N(0,\sigma^{2}) entries, where GG and η\eta are independent. Then, for any constant c2>0c_{2}>0,

Pr(Gη2<2σc2kp)12pexp(2k)pexp(c222)\Pr\left(\left\lVert G^{\top}\eta\right\rVert_{2}<2\sigma c_{2}\sqrt{kp}\right)\geq 1-2p\exp\left(-2k\right)-p\exp\left(-\frac{c_{2}^{2}}{2}\right)
Proof.

See Appendix B. ∎

The next result gives the non-asymptotic convergence of medians of Cauchy random variables. We use this result in the analysis of CauchyEst, and it may be of independent interest.

Lemma 2.8.

[Non-asymptotic convergence of Cauchy median] Consider a collection of mm i.i.d. Cauchy(0,1)\mathrm{Cauchy}(0,1) random variables X1,,XmX_{1},\ldots,X_{m}. Given a threshold 0<τ<10<\tau<1, we have

Pr(median{X1,,Xm}[τ,τ])2exp(mτ28)\Pr\left(\mathrm{median}\left\{X_{1},\ldots,X_{m}\right\}\not\in[-\tau,\tau]\right)\leq 2\exp\left(-\frac{m\tau^{2}}{8}\right)
Proof.

See Appendix B. ∎

2.3 Distance and divergence between probability distributions

Recall that we are given sample access to an unknown probability distribution 𝒫\mathcal{P} over the values of (X1,,Xn)n(X_{1},\ldots,X_{n})\in\mathbb{R}^{n} and the corresponding structure of a Bayesian network GG on nn variables. We denote α\alpha^{*} as the true (hidden) parameters, inducing the probability distribution 𝒫\mathcal{P}, which we estimate by α^\widehat{\alpha}. In this work, we aim to recover parameters α^\widehat{\alpha} such that our induced probability distribution 𝒬\mathcal{Q} is as close as possible to 𝒫\mathcal{P} in total variational distance.

Definition 2.9 (Total variational (TV) distance).

Given two probability distributions 𝒫\mathcal{P} and 𝒬\mathcal{Q} over n\mathbb{R}^{n}, the total variational distance between them is defined as dTV(𝒫,𝒬)=supAn|𝒫(A)𝒬(A)|=12n|𝒫𝒬|𝑑x\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})=\sup_{A\in\mathbb{R}^{n}}\left\lvert\mathcal{P}(A)-\mathcal{Q}(A)\right\rvert=\frac{1}{2}\int_{\mathbb{R}^{n}}\left\lvert\mathcal{P}-\mathcal{Q}\right\rvert dx.

Instead of directly dealing with total variational distance, we will instead bound the Kullback–Leibler (KL) divergence and then appeal to the Pinsker’s inequality [Tsy08, Lemma 2.5, page 88] to upper bound dTV\mathrm{d_{TV}} via dKL\mathrm{d_{KL}}. We will later show algorithms that achieve dKL(𝒫,𝒬)ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq\varepsilon.

Definition 2.10 (Kullback–Leibler (KL) divergence).

Given two probability distributions 𝒫\mathcal{P} and 𝒬\mathcal{Q} over n\mathbb{R}^{n}, the KL divergence between them is defined as dKL(𝒫,𝒬)=An𝒫(A)log(𝒫(A)𝒬(A))𝑑A\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})=\int_{A\in\mathbb{R}^{n}}\mathcal{P}(A)\log\left(\frac{\mathcal{P}(A)}{\mathcal{Q}(A)}\right)dA.

Fact 2.11 (Pinsker’s inequality).

For distributions 𝒫\mathcal{P} and 𝒬\mathcal{Q}, dTV(𝒫,𝒬)dKL(𝒫,𝒬)/2\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\sqrt{\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})/2}.

Thus, if s(ε)s(\varepsilon) samples are needed to learn a distribution 𝒬\mathcal{Q} such that dKL(𝒫,𝒬)ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq\varepsilon, s(ε2)s(\varepsilon^{2}) samples are needed to ensure dTV(𝒫,𝒬)ε\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\varepsilon.

2.4 Decomposing the KL divergence

For a set of parameters α=(α1,,αn)\alpha=(\alpha_{1},\ldots,\alpha_{n}), denote αi\alpha_{i} as the subset of parameters that are relevant to the variable XiX_{i}. Following the approach of [Das97]555[Das97] analyzes the non-realizable setting where the distribution 𝒫\mathcal{P} may not correspond to the causal structure of the given Bayesian network. As we study the realizable setting, we have a much simpler derivation., we decompose dKL(𝒫,𝒬)\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q}) into nn terms that can be computed by analyzing the quality of recovered parameters for each variable XiX_{i}.

For notational convenience, we write xx to mean (x1,,xn)(x_{1},\ldots,x_{n}), πi(x)\pi_{i}(x) to mean the values given to parents of variable XiX_{i} by xx, and 𝒫(x)\mathcal{P}(x) to mean 𝒫(X1=x1,,Xn=xn)\mathcal{P}(X_{1}=x_{1},\ldots,X_{n}=x_{n}). Let us define

dCP(αi,α^i)=xi,πi(x)𝒫(xi,πi(x))log(𝒫(xiπi(x))𝒬(xiπi(x)))𝑑xi𝑑πi(x)\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})=\int_{x_{i},\pi_{i}(x)}\mathcal{P}(x_{i},\pi_{i}(x))\log\left(\frac{\mathcal{P}(x_{i}\mid\pi_{i}(x))}{\mathcal{Q}(x_{i}\mid\pi_{i}(x))}\right)dx_{i}d\pi_{i}(x)

where each α^i\widehat{\alpha}_{i} and αi\alpha^{*}_{i} represent the parameters that relevant to variable XiX_{i} from α^\widehat{\alpha} and α\alpha^{*} respectively. By the Bayesian network decomposition of joint probabilities and marginalization, one can show that

dKL(𝒫,𝒬)=i=1ndCP(αi,α^i)\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})=\sum_{i=1}^{n}\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})

See Appendix A for the full derivation details.

2.5 Bounding dCP\mathrm{d_{CP}} for an arbitrary variable

We now analyze dCP(αi,α^i)\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i}) with respect to the our estimates α^i=(A^i,σ^i)\widehat{\alpha}_{i}=(\widehat{A}_{i},\widehat{\sigma}_{i}) and the hidden true parameters αi=(Ai,σi)\alpha^{*}_{i}=(A_{i},\sigma_{i}) for any i[n]i\in[n]. For derivation details, see Appendix A.

With respect to variable XiX_{i}, one can derive dCP(αi,α^i)=ln(σ^iσi)+σi2σ^i22σ^i2+ΔiMiΔi2σ^i2\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})=\ln\left(\frac{\widehat{\sigma}_{i}}{\sigma_{i}}\right)+\frac{\sigma_{i}^{2}-\widehat{\sigma}_{i}^{2}}{2\widehat{\sigma}_{i}^{2}}+\frac{\Delta_{i}^{\top}M_{i}\Delta_{i}}{2\widehat{\sigma}_{i}^{2}}. Thus,

dKL(𝒫,𝒬)=i=1ndCP(αi,α^i)=i=1nln(σ^iσi)+σi2σ^i22σ^i2+ΔiMiΔi2σ^i2\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})=\sum_{i=1}^{n}\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})=\sum_{i=1}^{n}\ln\left(\frac{\widehat{\sigma}_{i}}{\sigma_{i}}\right)+\frac{\sigma_{i}^{2}-\widehat{\sigma}_{i}^{2}}{2\widehat{\sigma}_{i}^{2}}+\frac{\Delta_{i}^{\top}M_{i}\Delta_{i}}{2\widehat{\sigma}_{i}^{2}} (1)

where MiM_{i} is the covariance matrix associated with variable XiX_{i}, αi=(Ai,σi)\alpha_{i}^{*}=(A_{i},\sigma_{i}) is the coefficients and variance associated with variable XiX_{i}, αi=(A^i,σ^i)\alpha_{i}=(\widehat{A}_{i},\widehat{\sigma}_{i}) are the estimates for αi\alpha_{i}^{*}, and Δi=A^iAi\Delta_{i}=\widehat{A}_{i}-A_{i}.

Proposition 2.12 (Implication of KL decomposition).

Let ε0.17\varepsilon\leq 0.17 be a constant. Suppose α^i\widehat{\alpha}_{i} has the following properties for each i[n]i\in[n]:

|ΔiMiΔi|σi2εpindavg\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\leq\sigma_{i}^{2}\cdot\frac{\varepsilon\cdot p_{i}}{n\cdot d_{avg}} (Condition 1)
(1εpindavg)σi2σ^i2(1+εpindavg)σi2\left(1-\sqrt{\frac{\varepsilon\cdot p_{i}}{n\cdot d_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}\leq\left(1+\sqrt{\frac{\varepsilon\cdot p_{i}}{n\cdot d_{avg}}}\right)\cdot\sigma_{i}^{2} (Condition 2)

Then, dCP(αi,α^i)3εpindavg\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})\leq 3\cdot\frac{\varepsilon\cdot p_{i}}{n\cdot d_{avg}} for all i[n]i\in[n]. Thus, dKL(𝒫,𝒬)=i=1ndCP(αi,α^i)3ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})=\sum_{i=1}^{n}\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})\leq 3\varepsilon.666For a cleaner argument, we are bounding dKL(𝒫,𝒬)3ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq 3\varepsilon. This is qualitatively the same as showing dKL(𝒫,𝒬)ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq\varepsilon since one can repeat the entire analysis with ε=ε/3\varepsilon^{\prime}=\varepsilon/3.

Proof.

Consider an arbitrary fixed i[n]i\in[n]. Denote γ=σi2/σ^i2\gamma=\sigma_{i}^{2}/\widehat{\sigma}_{i}^{2}. Observe777This inequality is also used in [ABDH+20, Lemma 2.9]. that γ1ln(γ)(γ1)2\gamma-1-\ln(\gamma)\leq(\gamma-1)^{2} for γ0.316\gamma\geq 0.316\ldots. Since pindavg=ipip_{i}\leq n\cdot d_{avg}=\sum_{i}p_{i}, Eq. Condition 2 implies that γ1/(1+ε)1/2\gamma\geq 1/(1+\sqrt{\varepsilon})\geq 1/2. Then,

ln(σ^iσi)+σi2σ^i22σ^i2\displaystyle\ln\left(\frac{\widehat{\sigma}_{i}}{\sigma_{i}}\right)+\frac{\sigma_{i}^{2}-\widehat{\sigma}_{i}^{2}}{2\widehat{\sigma}_{i}^{2}} =12(σi2σ^i21ln(σi2σ^i2))\displaystyle=\frac{1}{2}\cdot\left(\frac{\sigma_{i}^{2}}{\widehat{\sigma}_{i}^{2}}-1-\ln\left(\frac{\sigma_{i}^{2}}{\widehat{\sigma}_{i}^{2}}\right)\right)
=12(γ1ln(γ))\displaystyle=\frac{1}{2}\cdot\left(\gamma-1-\ln\left(\gamma\right)\right)
12(γ1)2\displaystyle\leq\frac{1}{2}\cdot\left(\gamma-1\right)^{2} By Condition 2
12(11ϵpindavg1)2\displaystyle\leq\frac{1}{2}\cdot\left(\frac{1}{1-\sqrt{\frac{\epsilon p_{i}}{nd_{avg}}}}-1\right)^{2} Since (1ϵpindavg)σi2σ^i2\left(1-\sqrt{\frac{\epsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}
2ϵpindavg\displaystyle\leq\frac{2\epsilon p_{i}}{nd_{avg}} Holds when 0ϵpindavg140\leq\frac{\epsilon p_{i}}{nd_{avg}}\leq\frac{1}{4}

Meanwhile,

ΔiMiΔi2σ^i2\displaystyle\frac{\Delta_{i}^{\top}M_{i}\Delta_{i}}{2\widehat{\sigma}_{i}^{2}} |ΔiMiΔi|2σ^i2\displaystyle\leq\frac{\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert}{2\widehat{\sigma}_{i}^{2}}
piϵndavgσi22σ^i2\displaystyle\leq\frac{p_{i}\epsilon}{nd_{avg}}\cdot\frac{\sigma_{i}^{2}}{2\widehat{\sigma}_{i}^{2}} By Condition 1
piϵ2ndavg11ϵpindavg\displaystyle\leq\frac{p_{i}\epsilon}{2nd_{avg}}\cdot\frac{1}{1-\sqrt{\frac{\epsilon p_{i}}{nd_{avg}}}} Since (1ϵpindavg)σi2σ^i2\left(1-\sqrt{\frac{\epsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}
ϵpindavg\displaystyle\leq\frac{\epsilon p_{i}}{nd_{avg}} Holds when 0ϵpindavg140\leq\frac{\epsilon p_{i}}{nd_{avg}}\leq\frac{1}{4}

Putting together, we see that dCP(αi,α^i)3ϵpindavgd_{CP}(\alpha^{*}_{i},\widehat{\alpha}_{i})\leq\frac{3\epsilon p_{i}}{nd_{avg}}. ∎

2.6 Two-phased recovery approach

Algorithm 1 states our two-phased recovery approach. We estimate the coefficients of the Bayesian network in the first phase and use them to recover the variances in the second phase.

Algorithm 1 Two-phased recovery algorithm
1:Input: DAG GG and sample parameters m1m_{1} and m2m_{2}
2:Draw m=m1+m2m=m_{1}+m_{2} independent samples of (X1,,Xn)(X_{1},\ldots,X_{n}).
3:A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n}\leftarrow Run a coefficient recovery algorithm using first m1m_{1} samples.
4:σ^12,,σ^n2\widehat{\sigma}_{1}^{2},\ldots,\widehat{\sigma}_{n}^{2}\leftarrow Run VarianceRecovery using last m2m_{2} samples and A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n}
5:return A^1,,A^n,σ^12,,σ^n2\widehat{A}_{1},\ldots,\widehat{A}_{n},\widehat{\sigma}_{1}^{2},\ldots,\widehat{\sigma}_{n}^{2}

Motivated by Proposition 2.12, we will estimate parameters for each variable in an independent fashion888Given the samples, parameters related to each variable can be estimated in parallel.. We will provide various coefficient recovery algorithms in the subsequent sections. These algorithms will recover coefficients A^i\widehat{A}_{i} that satisfy Condition 1 for each variable XiX_{i}. We evaluate them empirically in Section 6. For variance recovery, we use VarianceRecovery for each variable YY by computing the empirical variance999Except in our experiments with contaminated data in Section 6 where we use the classical median absolute devation (MAD) estimator. See Appendix C for a description. of YXA^Y-X\widehat{A} such that the recovered variance satisfies Condition 2.

Algorithm 2 VarianceRecovery: Variance recovery algorithm given coefficient estimates
1:Input: DAG GG, coefficient estimates, and m2𝒪(ndavgεlog(nδ))m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) samples
2:for variable YY with pp parents and coefficient estimate A^\widehat{A} do \triangleright If p=0p=0, then A^=0\widehat{A}=0.
3:     Without loss of generality, by renaming variables, let X1,,XpX_{1},\ldots,X_{p} be the parents of YY.
4:     for s=1,,m2s=1,\ldots,m_{2} do
5:         Define Y(s)Y^{(s)} as the sths^{th} sample of YY.
6:         Define X(s)=[X1(s),,Xp(s)]X^{(s)}=[X^{(s)}_{1},\ldots,X^{(s)}_{p}] as the sths^{th} sample of X1,,XpX_{1},\ldots,X_{p} placed in a row vector.
7:         Define Z(s)=(Y(s)X(s)A^)2Z^{(s)}=\left(Y^{(s)}-X^{(s)}\widehat{A}\right)^{2}.
8:     end for
9:     Estimate σ^y2=1m2i=1m2Z(s)\widehat{\sigma}_{y}^{2}=\frac{1}{m_{2}}\sum_{i=1}^{m_{2}}Z^{(s)}
10:end for

To analyze VarianceRecovery, we first prove guarantees for an arbitrary variable and then take union bound over nn variables.

Lemma 2.13.

Consider Algorithm 2. Fix any arbitrary variable of interest YY with pp parents, parameters (A,σy)(A,\sigma_{y}), and associated covariance matrix MM. Suppose we have coefficient estimates A^\widehat{A} such that |ΔMΔ|σy2pεndavg\left\lvert\Delta^{\top}M\Delta\right\rvert\leq\sigma_{y}^{2}\cdot\frac{p\varepsilon}{nd_{avg}}. Suppose 0ε3220.170\leq\varepsilon\leq 3-2\sqrt{2}\leq 0.17. With k=32ndavgεplog(2δ)k=\frac{32nd_{avg}}{\varepsilon p}\log\left(\frac{2}{\delta}\right) samples, we recover variance estimate σ^y\widehat{\sigma}_{y} such that

Pr((1εpndavg)σy2σ^y2(1+εpndavg)σy2)1δ\Pr\left(\left(1-\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)\cdot\sigma_{y}^{2}\leq\widehat{\sigma}_{y}^{2}\leq\left(1+\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)\cdot\sigma_{y}^{2}\right)\geq 1-\delta
Proof.

We first argue that σ^y2(σy2+ΔMΔ)χk2\widehat{\sigma}_{y}^{2}\sim\left(\sigma_{y}^{2}+\Delta^{\top}M\Delta\right)\cdot\chi^{2}_{k}, then apply standard concentration bounds for χ2\chi^{2} random variables (see Lemma 2.4).

For any sample s[k]s\in[k], we see that Y(s)X(s)A^=X(s)A+ηy(s)X(s)A^=ηy(s)X(s)ΔY^{(s)}-X^{(s)}\widehat{A}=X^{(s)}A+\eta_{y}^{(s)}-X^{(s)}\widehat{A}=\eta_{y}^{(s)}-X^{(s)}\Delta, where Δ=A^Ap\Delta=\widehat{A}-A\in\mathbb{R}^{p} is an unknown constant vector (because we do not actually know AA). For fixed Δ\Delta, we see that X(s)ΔN(0,ΔMΔ)X^{(s)}\Delta\sim N(0,\Delta^{\top}M\Delta). Since ηy(s)N(0,σy2)\eta_{y}^{(s)}\sim N(0,\sigma_{y}^{2}) and X(s)X^{(s)} are independent, we have that Y(s)X(s)A^N(0,σy2+ΔMΔ)Y^{(s)}-X^{(s)}\widehat{A}\sim N(0,\sigma_{y}^{2}+\Delta^{\top}M\Delta). So, for any sample s[k]s\in[k], Z(s)=(Y(s)X(s)A^)2(σy2+ΔMΔ)χ12Z^{(s)}=(Y^{(s)}-X^{(s)}\widehat{A})^{2}\sim\left(\sigma_{y}^{2}+\Delta^{\top}M\Delta\right)\cdot\chi^{2}_{1}. Therefore, σ^y=1ks=1kZ(s)(σy2+ΔMΔ)/kχk2\widehat{\sigma}_{y}=\frac{1}{k}\sum_{s=1}^{k}Z^{(s)}\sim(\sigma_{y}^{2}+\Delta^{\top}M\Delta)/k\cdot\chi^{2}_{k}. Let us define

γ=σ^y2σy2(11+ΔMΔσy2)χk2k\gamma=\frac{\widehat{\sigma}_{y}^{2}}{\sigma_{y}^{2}}\cdot\left(\frac{1}{1+\frac{\Delta^{\top}M\Delta}{\sigma_{y}^{2}}}\right)\sim\frac{\chi^{2}_{k}}{k}

Since pndavgp\leq nd_{avg}, if ε322\varepsilon\leq 3-2\sqrt{2}, then εpndavg3223+22\frac{\varepsilon p}{nd_{avg}}\leq 3-2\sqrt{2}\leq 3+2\sqrt{2}. We first make two observations:

  1. 1.

    For 0εpndavg3220\leq\frac{\varepsilon p}{nd_{avg}}\leq 3-2\sqrt{2}, (1+εpndavg)(11+ΔMΔσy2)1+εp4ndavg\left(1+\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)\cdot\left(\frac{1}{1+\frac{\Delta^{\top}M\Delta}{\sigma_{y}^{2}}}\right)\geq 1+\sqrt{\frac{\varepsilon p}{4nd_{avg}}}.

  2. 2.

    For 0εpndavg3+220\leq\frac{\varepsilon p}{nd_{avg}}\leq 3+2\sqrt{2}, (1εpndavg)(11+ΔMΔσy2)1εp4ndavg\left(1-\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)\cdot\left(\frac{1}{1+\frac{\Delta^{\top}M\Delta}{\sigma_{y}^{2}}}\right)\leq 1-\sqrt{\frac{\varepsilon p}{4nd_{avg}}}.

Using Lemma 2.4 with the above discussion, we have

Pr(σ^y2σy21+εpndavg or σ^y2σy21εpndavg)\displaystyle\;\Pr\left(\frac{\widehat{\sigma}_{y}^{2}}{\sigma_{y}^{2}}\geq 1+\sqrt{\frac{\varepsilon p}{nd_{avg}}}\text{\quad or \quad}\frac{\widehat{\sigma}_{y}^{2}}{\sigma_{y}^{2}}\leq 1-\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)
=\displaystyle= Pr(γ(1+εpndavg)(11+ΔMΔσy2) or γ(1εpndavg)(11+ΔMΔσy2))\displaystyle\;\Pr\left(\gamma\geq\left(1+\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)\cdot\left(\frac{1}{1+\frac{\Delta^{\top}M\Delta}{\sigma_{y}^{2}}}\right)\text{\quad or \quad}\gamma\leq\left(1-\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)\cdot\left(\frac{1}{1+\frac{\Delta^{\top}M\Delta}{\sigma_{y}^{2}}}\right)\right)
\displaystyle\leq Pr(γ1+εp4ndavg or γ1εp4ndavg)\displaystyle\;\Pr\left(\gamma\geq 1+\sqrt{\frac{\varepsilon p}{4nd_{avg}}}\text{\quad or \quad}\gamma\leq 1-\sqrt{\frac{\varepsilon p}{4nd_{avg}}}\right)
=\displaystyle= Pr(|γ1|εp4ndavg)\displaystyle\;\Pr\left(\left\lvert\gamma-1\right\rvert\geq\sqrt{\frac{\varepsilon p}{4nd_{avg}}}\right)
\displaystyle\leq  2exp(kεp32ndavg)\displaystyle\;2\exp\left(-\frac{k\varepsilon p}{32nd_{avg}}\right)

The claim follows by setting k=32ndavgεplog(2δ)k=\frac{32nd_{avg}}{\varepsilon p}\log\left(\frac{2}{\delta}\right). ∎

Corollary 2.14 (Guarantees of VarianceRecovery).

Consider Algorithm 2. Suppose 0ε3220.170\leq\varepsilon\leq 3-2\sqrt{2}\leq 0.17 and we have coefficient estimates A^i\widehat{A}_{i} such that |ΔiMiΔi|σi2εpindavg\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\leq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}} for all i[n]i\in[n]. With m2𝒪(ndavgεlog(nδ))m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) samples, we recover variance estimate σ^i\widehat{\sigma}_{i} such that

Pr(i[n],(1εpindavg)σi2σ^i2(1+εpindavg)σi2)1δ\Pr\left(\forall i\in[n],\left(1-\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}\leq\left(1+\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\right)\geq 1-\delta

The total running time is 𝒪(n2davg2εlog(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}}{\varepsilon}\log\left(\frac{1}{\delta}\right)\right).

Proof.

For each i[n]i\in[n], apply Lemma 2.13 with δ=δ/n\delta^{\prime}=\delta/n and m2=32ndavgεlog(2δ)maxi[n]32ndavgεpilog(2δ)m_{2}=\frac{32nd_{avg}}{\varepsilon}\log\left(\frac{2}{\delta}\right)\geq\max_{i\in[n]}\frac{32nd_{avg}}{\varepsilon p_{i}}\log\left(\frac{2}{\delta}\right), then take the union bound over all nn variables.

The computational complexity for a variable with pp parents is 𝒪(m2p)\mathcal{O}(m_{2}\cdot p). Since i=1npi=ndavg\sum_{i=1}^{n}p_{i}=nd_{avg}, the total runtime is 𝒪(m2ndavg)\mathcal{O}(m_{2}\cdot n\cdot d_{avg}). ∎

In Section 5, we show that the sample complexity is nearly optimal in terms of the dependence on nn and ε\varepsilon. We remark that we use one batch of samples to use for all the nodes; this is possible as we can obtain high-probability bounds on the error events at each node.

3 Coefficient estimators based on linear least squares

In this section, we provide algorithms LeastSquares and BatchAvgLeastSquares for recovering the coefficients in a Bayesian network using linear least squares. As discussed in Section 2.6, we will recover the coefficients for each variable such that Condition 1 is satisfied. To do so, we estimate the coefficients associated with each individual variable using independent samples. At each node, LeastSquares computes an estimate by solving the linear least squares problem with respect to a collection of sample observations. In Section 3.2, we generalize this approach via BatchAvgLeastSquares by allowing any interpolation between “batch size” and “number of batches” – LeastSquares is a special case of a single batch. Since each solution to batch can be computed independently before their results are combined, BatchAvgLeastSquares facilitates parallelism.

3.1 Vanilla least squares

Consider an arbitrary variable YY with pp parents. Using m1m_{1} independent samples, we form matrix Xm1×pX\in\mathbb{R}^{m_{1}\times p}, where the rthr^{th} row consists of sample values X1(r),,Xp(r)X_{1}^{(r)},\ldots,X_{p}^{(r)}, and the column vector B=[Y(1),,Y(m1)]m1B=[Y^{(1)},\ldots,Y^{(m_{1})}]^{\top}\in\mathbb{R}^{m_{1}}. Then, we define A^=(XX)1XB\widehat{A}=(X^{\top}X)^{-1}X^{\top}B as the solution to the least squares problem XA^=BX\widehat{A}=B. The pseudocode of LeastSquares is given in Algorithm 3 and Theorem 3.1 states its guarantees.

Algorithm 3 LeastSquares: Coefficient recovery algorithm for general Bayesian networks
1:Input: DAG GG and m1𝒪(ndavgεln(nδ))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\cdot\ln\left(\frac{n}{\delta}\right)\right) samples
2:for variable YY with p1p\geq 1 parents do
3:     Without loss of generality, by renaming variables, let X1,,XpX_{1},\ldots,X_{p} be the parents of YY.
4:     Form matrix Xm1×pX\in\mathbb{R}^{m_{1}\times p}, where the rthr^{th} row consists of sample values X1(r),,Xp(r)X_{1}^{(r)},\ldots,X_{p}^{(r)}
5:     Form column vector B=[Y(1),,Y(m1)]m1B=[Y^{(1)},\ldots,Y^{(m_{1})}]^{\top}\in\mathbb{R}^{m_{1}}
6:     Define A^=(XX)1XB\widehat{A}=(X^{\top}X)^{-1}X^{\top}B as the solution to the least squares problem XA^=BX\widehat{A}=B
7:end for
Theorem 3.1 (Distribution learning using LeastSquares).

Let ε,δ(0,1)\varepsilon,\delta\in(0,1). Suppose GG is a fixed directed acyclic graph on nn variables with degree at most dd. Given 𝒪(ndavgε2log(nδ))\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon^{2}}\log\left(\frac{n}{\delta}\right)\right) samples from an unknown Bayesian network 𝒫\mathcal{P} over GG, if we use LeastSquares for coefficient recovery in Algorithm 1, then with probability at least 1δ1-\delta, we recover a Bayesian network 𝒬\mathcal{Q} over GG such that dTV(𝒫,𝒬)ε\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\varepsilon in 𝒪(n2davg2dε2log(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d}{\varepsilon^{2}}\log\left(\frac{1}{\delta}\right)\right) time.101010In particular, this gives a dKL(𝒫,𝒬)ϵ\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq\epsilon guarantee for learning centered multivariate Gaussians using O~(n2ϵ1)\widetilde{O}(n^{2}\epsilon^{-1}) samples. See e.g. [ABDH+20] for an analogous dKL(𝒬,𝒫)ϵ\mathrm{d_{KL}}(\mathcal{Q},\mathcal{P})\leq\epsilon guarantee.

Our analysis begins by proving guarantees for an arbitrary variable.

Lemma 3.2.

Consider Algorithm 3. Fix an arbitrary variable YY with pp parents, parameters (A,σy)(A,\sigma_{y}), and associated covariance matrix MM. With k4c22(1c1)4ndavgεk\geq\frac{4c_{2}^{2}}{(1-c_{1})^{4}}\cdot\frac{nd_{avg}}{\varepsilon} samples, for any constants 0<c1<1/20<c_{1}<1/2 and c2>0c_{2}>0, we recover the coefficients A^\widehat{A} such that

Pr(|ΔMΔ|σy2pεndavg)exp(kc122)+2pexp(2k)+pexp(c222)\Pr\left(\left\lvert\Delta^{\top}M\Delta\right\rvert\geq\sigma_{y}^{2}\cdot\frac{p\varepsilon}{nd_{avg}}\right)\leq\exp\left(-\frac{kc_{1}^{2}}{2}\right)+2p\exp\left(-2k\right)+p\exp\left(-\frac{c_{2}^{2}}{2}\right)
Proof.

Since |ΔMΔ|=|ΔLLΔ|=LΔ2\left\lvert\Delta^{\top}M\Delta\right\rvert=\left\lvert\Delta^{\top}LL^{\top}\Delta\right\rvert=\left\lVert L^{\top}\Delta\right\rVert^{2}, it suffices to bound LΔ\left\lVert L^{\top}\Delta\right\rVert.

Without loss of generality, the parents of YY are X1,,XpX_{1},\ldots,X_{p}. Define Xk×pX\in\mathbb{R}^{k\times p}, BkB\in\mathbb{R}^{k}, and A^p\widehat{A}\in\mathbb{R}^{p} as in Algorithm 3. Let η=[ηy(1),,ηy(k)]k\eta=[\eta_{y}^{(1)},\ldots,\eta_{y}^{(k)}]\in\mathbb{R}^{k} be the instantiations of Gaussian ηy\eta_{y} in the kk samples. By the structural equations, we know that B=XA+ηB=XA+\eta. So,

A~=(XX)1XB=(XX)1X(XA+η)=A+(XX)1Xη\widetilde{A}=(X^{\top}X)^{-1}X^{\top}B=(X^{\top}X)^{-1}X^{\top}(XA+\eta)=A+(X^{\top}X)^{-1}X^{\top}\eta

By 2.3, we can express X=GLX=GL^{\top} where matrix Gk×pG\in\mathbb{R}^{k\times p} is a random matrix with i.i.d. N(0,1)N(0,1) entries. Since Δ=A^A\Delta=\widehat{A}-A, we see that Δ=(L)1(GG)1Gη\Delta=(L^{\top})^{-1}(G^{\top}G)^{-1}G^{\top}\eta. Rearranging, we have LΔ=(GG)1GηL^{\top}\Delta=(G^{\top}G)^{-1}G^{\top}\eta and so LΔ(GG)1Gη\|L^{\top}\Delta\|\leq\|(G^{\top}G)^{-1}\|\cdot\|G^{\top}\eta\|. Combining Lemma 2.6 and Lemma 2.7, which bound (GG)1\|(G^{\top}G)^{-1}\| and Gη\|G^{\top}\eta\| respectively, we get

Pr(LΔ>2σyc2p(12c1)2k)exp(kc122)+2pexp(2k)+pexp(c222)\Pr\left(\|L^{\top}\Delta\|>\dfrac{2\sigma_{y}c_{2}\sqrt{p}}{\left(1-2c_{1}\right)^{2}\sqrt{k}}\right)\leq\exp\left(-\frac{kc_{1}^{2}}{2}\right)+2p\exp\left(-2k\right)+p\exp\left(-\frac{c_{2}^{2}}{2}\right) (2)

for any constants 0<c1<1/20<c_{1}<1/2 and c2>0c_{2}>0. The claim follows by setting k=4c22(1c1)4ndavgεk=\frac{4c_{2}^{2}}{(1-c_{1})^{4}}\cdot\frac{nd_{avg}}{\varepsilon}. ∎

We can now establish Condition 1 of Proposition 2.12 for LeastSquares.

Lemma 3.3.

Consider Algorithm 3. With m1𝒪(ndavgεln(nδ))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\cdot\ln\left(\frac{n}{\delta}\right)\right) samples, we recover the coefficients A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n} such that

Pr(i[n],|ΔiMiΔi|σi2εpindavg)δ\Pr\left(\forall i\in[n],\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}}\right)\leq\delta

The total running time is 𝒪(n2davg2dεln(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d}{\varepsilon}\ln\left(\frac{1}{\delta}\right)\right).

Proof.

By setting c1=1/4c_{1}=1/4, c2=2ln(3n/δ)c_{2}=\sqrt{2\ln\left(3n/\delta\right)}, and k=32ndavgεln(3nδ)4c22(1c1)4ndavgεk=\frac{32nd_{avg}}{\varepsilon}\ln\left(\frac{3n}{\delta}\right)\geq\frac{4c_{2}^{2}}{(1-c_{1})^{4}}\cdot\frac{nd_{avg}}{\varepsilon} in Lemma 3.2, we have

Pr(|ΔiMiΔi|σi2piεndavg)exp(kc122)+pexp(2k)+pexp(c222)δ3n+δ3n+δ3n=δn\Pr\left(\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{p_{i}\varepsilon}{nd_{avg}}\right)\leq\exp\left(-\frac{kc_{1}^{2}}{2}\right)+p\exp\left(-2k\right)+p\exp\left(-\frac{c_{2}^{2}}{2}\right)\leq\frac{\delta}{3n}+\frac{\delta}{3n}+\frac{\delta}{3n}=\frac{\delta}{n}

for any i[n]i\in[n]. The claim holds by a union bound over all nn variables.

The computational complexity for a variable with pp parents is 𝒪(p2m1)\mathcal{O}(p^{2}\cdot m_{1}). Since maxi[n]pid\max_{i\in[n]}p_{i}\leq d and i=1npi=ndavg\sum_{i=1}^{n}p_{i}=nd_{avg}, the total runtime is 𝒪(m1ndavgd)\mathcal{O}(m_{1}\cdot n\cdot d_{avg}\cdot d). ∎

Theorem 3.1 follows from combining the guarantees of LeastSquares and VarianceRecovery (given in Lemma 3.3 and Corollary 2.14 respectively) via Proposition 2.12.

Proof of Theorem 3.1.

We will show sample and time complexities before giving the proof for the dTV\mathrm{d_{TV}} distance.

Let m1𝒪(ndavgεln(nδ))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\cdot\ln\left(\frac{n}{\delta}\right)\right) and m2𝒪(ndavgεlog(nδ))m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right). Then, the total number of samples needed is m=m1+m2𝒪(ndavgεlog(nδ))m=m_{1}+m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right). LeastSquares runs in 𝒪(n2davg2dεln(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d}{\varepsilon}\ln\left(\frac{1}{\delta}\right)\right) time while VarianceRecovery runs in 𝒪(n2davg2εlog(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}}{\varepsilon}\log\left(\frac{1}{\delta}\right)\right) time. Therefore, the overall running time is 𝒪(n2davg2dεlog(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d}{\varepsilon}\log\left(\frac{1}{\delta}\right)\right).

By Lemma 3.3, LeastSquares recovers coefficients A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n} such that

Pr(i[n],|ΔiMiΔi|σi2εpindavg)δ\Pr\left(\forall i\in[n],\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}}\right)\leq\delta

By Corollary 2.14 and using the recovered coefficients from LeastSquares, VarianceRecovery recovers variance estimates σ^i2\widehat{\sigma}_{i}^{2} such that

Pr(i[n],(1εpindavg)σi2σ^i2(1+εpindavg)σi2)1δ\Pr\left(\forall i\in[n],\left(1-\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}\leq\left(1+\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\right)\geq 1-\delta

As our estimated parameters satisfy Condition 1 and Condition 2, Proposition 2.12 tells us that dKL(𝒫,𝒬)3ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq 3\varepsilon. Thus, dTV(𝒫,𝒬)dKL(𝒫,𝒬)/23ε/2\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\sqrt{\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})/2}\leq\sqrt{3\varepsilon/2}. The claim follows by setting ε=3ε/2\varepsilon^{\prime}=\sqrt{3\varepsilon/2} throughout. ∎

3.2 Interpolating between batch size and number of batches

We now discuss a generalization of LeastSquares. In a nutshell, for each variable with p1p\geq 1 parents, BatchAvgLeastSquares solves b1b\geq 1 batches of linear systems made up of k>pk>p samples and then uses the mean of the recovered solutions as an estimate for the coefficients. Note that one can interpolate between different values of kk and bb, as long as kpk\geq p (so that the batch solutions are correlated to the true parameters). The pseudocode of BatchAvgLeastSquares is provided in Algorithm 4 and the guarantees are given in Theorem 3.1.

Algorithm 4 BatchAvgLeastSquares: Coefficient recovery for general Bayesian networks
1:Input: DAG GG and m1=𝒪(ndavgε(d+ln(nεδ)))m_{1}=\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right) samples \triangleright kΩ(d+ln(nεδ))k\in\Omega\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right), b=m1kb=\frac{m_{1}}{k}
2:for variable YY with p1p\geq 1 parents do
3:     Without loss of generality, by renaming variables, let X1,,XpX_{1},\ldots,X_{p} be the parents of YY.
4:     for s=1,,bs=1,\ldots,b do
5:         Form matrix Xk×pX\in\mathbb{R}^{k\times p}, where the rthr^{th} row consists of sample values X1(s,r),,Xp(s,r)X_{1}^{(s,r)},\ldots,X_{p}^{(s,r)}
6:         Form column vector B=[Y(s,1),,Y(s,k)]kB=[Y^{(s,1)},\ldots,Y^{(s,k)}]^{\top}\in\mathbb{R}^{k}
7:         Define A~(s)=(XX)1XB\widetilde{A}^{(s)}=(X^{\top}X)^{-1}X^{\top}B as the solution to the least squares problem XA~(s)=BX\widetilde{A}^{(s)}=B
8:     end for
9:     Define A^=1bs=1bA~(s)\widehat{A}=\frac{1}{b}\sum_{s=1}^{b}\widetilde{A}^{(s)}
10:end for

In Section 6, we also experimented on a variant of BatchAvgLeastSquares dubbed BatchMedLeastSquares, where A^\widehat{A} is defined to be the coordinate-wise median of the A~(s)\widetilde{A}^{(s)} vectors. However, in the theoretical analysis below, we only analyze BatchAvgLeastSquares.

Theorem 3.4 (Distribution learning using BatchAvgLeastSquares).

Let ε,δ(0,1)\varepsilon,\delta\in(0,1). Suppose GG is a fixed directed acyclic graph on nn variables with degree at most dd. Given 𝒪(ndavgε2(d+ln(nεδ)))\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon^{2}}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right) samples from an unknown Bayesian network 𝒫\mathcal{P} over GG, if we use BatchAvgLeastSquares for coefficient recovery in Algorithm 1, then with probability at least 1δ1-\delta, we recover a Bayesian network 𝒬\mathcal{Q} over GG such that dTV(𝒫,𝒬)ε\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\varepsilon in 𝒪(n2davg2dε2(d+ln(nεδ)))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d}{\varepsilon^{2}}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right) time.

Our approach for analyzing BatchAvgLeastSquares is the same as our approach for LeastSquares: we prove guarantees for an arbitrary variable and then take union bound over nn variables. At a high-level, for each node YY, for every fixing of the randomness in generating X1,,XpX_{1},\dots,X_{p}, we show that each A~(s)\widetilde{A}^{(s)} is a gaussian. Since the bb iterations are independent, 1bsA~(s)\frac{1}{b}\sum_{s}\widetilde{A}^{(s)} is also a gaussian. Its variance is itself a random variable but can be bounded with high probability using concentration inequalities.

Lemma 3.5.

Consider Algorithm 4. Fix any arbitrary variable of interest YY with pp parents, parameters (A,σy)(A,\sigma_{y}), and associated covariance matrix MM. With k>Ck(p+ln(nεδ))k>C_{k}\cdot\left(p+\ln\left(\frac{n}{\varepsilon\delta}\right)\right) and kb=Ckb(ndavgε(d+ln(nεδ)))kb=C_{kb}\cdot\left(\frac{nd_{avg}}{\varepsilon}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right), for some universal constants CkC_{k} and CkbC_{kb}, we recover coefficients estimates A^\widehat{A} such that

Pr(|ΔMΔ|σy2εpndavg)1δ\Pr\left(\left\lvert\Delta^{\top}M\Delta\right\rvert\leq\sigma_{y}^{2}\cdot\frac{\varepsilon p}{nd_{avg}}\right)\geq 1-\delta
Proof.

Without loss of generality, the parents of YY are X1,,XpX_{1},\ldots,X_{p}. For s[n]s\in[n], define X(s)k×pX^{(s)}\in\mathbb{R}^{k\times p}, B(s)kB^{(s)}\in\mathbb{R}^{k}, and A~(s)p\widetilde{A}^{(s)}\in\mathbb{R}^{p} as the quantities involved in the sths^{th} batch of Algorithm 4. Let η(s)=[ηy(s,1),,ηy(s,k)]k\eta^{(s)}=[\eta_{y}^{(s,1)},\ldots,\eta_{y}^{(s,k)}]\in\mathbb{R}^{k} be the instantiations of Gaussian ηy\eta_{y} in the kk samples for the sths^{th} batch. By the structural equations, we know that B(s)=X(s)A+η(s)B^{(s)}=X^{(s)}A+\eta^{(s)}. So,

A~(s)\displaystyle\widetilde{A}^{(s)} =((X(s))X(s))1(X(s))B\displaystyle=((X^{(s)})^{\top}X^{(s)})^{-1}(X^{(s)})^{\top}B
=((X(s))X(s))1(X(s))(X(s)A+η(s))\displaystyle=((X^{(s)})^{\top}X^{(s)})^{-1}(X^{(s)})^{\top}(X^{(s)}A+\eta^{(s)})
=A+((X(s))X(s))1(X(s))η(s)\displaystyle=A+((X^{(s)})^{\top}X^{(s)})^{-1}(X^{(s)})^{\top}\eta^{(s)}

By 2.3, we can express X(s)=G(s)LX^{(s)}=G^{(s)}L^{\top} where matrix G(s)k×pG^{(s)}\in\mathbb{R}^{k\times p} is a random matrix with i.i.d. N(0,1)N(0,1) entries. So, we see that

LΔ=1bs=1b((G(s))G(s))1(G(s))η(s)L^{\top}\Delta=\frac{1}{b}\sum_{s=1}^{b}((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\eta^{(s)}

For any i[p]i\in[p], 2.2 tells us that

(LΔ)i=1bs=1b(((G(s))G(s))1(G(s))η(s))iN(0,σy2b2s=1b(((G(s))G(s))1(G(s)))i2)\left(L^{\top}\Delta\right)_{i}=\frac{1}{b}\sum_{s=1}^{b}\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\eta^{(s)}\right)_{i}\sim N\left(0,\frac{\sigma_{y}^{2}}{b^{2}}\sum_{s=1}^{b}\left\lVert\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right)_{i}\right\rVert^{2}\right)

We can upper bound each (((G(s))G(s))1(G(s)))i\left\lVert\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right)_{i}\right\rVert term as follows:

(((G(s))G(s))1(G(s)))i((G(s))G(s))1(G(s))((G(s))G(s))1G(s)\left\lVert\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right)_{i}\right\rVert\leq\left\lVert((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right\rVert\leq\left\lVert((G^{(s)})^{\top}G^{(s)})^{-1}\right\rVert\cdot\left\lVert G^{(s)}\right\rVert

When k4pk\geq 4p, Lemma 2.6 tells us that Pr(((G(s))G(s))14k)exp(k32)\Pr\left(\left\lVert((G^{(s)})^{\top}G^{(s)})^{-1}\right\rVert\geq\frac{4}{k}\right)\leq\exp\left(-\frac{k}{32}\right). Meanwhile, Lemma 2.5 tells us that Pr(G(s)2(k+p))pexp(Ck)\Pr\left(\left\lVert G^{(s)}\right\rVert\geq 2(\sqrt{k}+\sqrt{p})\right)\leq\sqrt{p}\cdot\exp\left(-C\cdot k\right) for some universal constant CC. Let \mathcal{E} be the event that (((G(s))G(s))1(G(s)))i<8(k+p)k\left\lVert\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right)_{i}\right\rVert<\frac{8(\sqrt{k}+\sqrt{p})}{k} for any s[b]s\in[b]. Applying union bound with the conclusions from Lemma 2.6 and Lemma 2.5, we have

Pr(¯)\displaystyle\Pr\left(\overline{\mathcal{E}}\right) =Pr(s[b],(((G(s))G(s))1(G(s)))i8(k+p)k)\displaystyle=\Pr\left(\exists s\in[b],\left\lVert\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right)_{i}\right\rVert\geq\frac{8(\sqrt{k}+\sqrt{p})}{k}\right)
bexp(k32)+bpexp(Ck)\displaystyle\leq b\cdot\exp\left(-\frac{k}{32}\right)+b\cdot\sqrt{p}\cdot\exp\left(-C\cdot k\right)

Conditioned on event \mathcal{E}, standard Gaussian tail bounds (e.g. see 2.1) give us

Pr(|LΔ|i>σyεndavg|)\displaystyle\Pr\left(\left\lvert L^{\top}\Delta\right\rvert_{i}>\sigma_{y}\cdot\sqrt{\frac{\varepsilon}{nd_{avg}}}\mathrel{\Big{|}}\mathcal{E}\right) exp(σy2εndavg2σy2b2s=1b(((G(s))G(s))1(G(s)))i2)\displaystyle\leq\exp\left(-\frac{\sigma_{y}^{2}\cdot\frac{\varepsilon}{nd_{avg}}}{2\cdot\frac{\sigma_{y}^{2}}{b^{2}}\sum_{s=1}^{b}\left\lVert\left(((G^{(s)})^{\top}G^{(s)})^{-1}(G^{(s)})^{\top}\right)_{i}\right\rVert^{2}}\right)
exp(εbk2128ndavg(k+p)2)\displaystyle\leq\exp\left(-\frac{\varepsilon\cdot b\cdot k^{2}}{128\cdot n\cdot d_{avg}\cdot(\sqrt{k}+\sqrt{p})^{2}}\right)
exp(εbk512ndavg)\displaystyle\leq\exp\left(-\frac{\varepsilon\cdot b\cdot k}{512\cdot n\cdot d_{avg}}\right)

where the second last inequality is because of event \mathcal{E} and the last inequality is because (k+p)2(2k)2=4k(\sqrt{k}+\sqrt{p})^{2}\leq(2\sqrt{k})^{2}=4k, since kpk\geq p.

Thus, applying a union bound over all pp entries of LΔL^{\top}\Delta and accounting for Pr(¯)\Pr(\overline{\mathcal{E}}), we have

Pr(LΔ>σyεpndavg)\displaystyle\;\Pr\left(\left\lVert L^{\top}\Delta\right\rVert>\sigma_{y}\cdot\sqrt{\frac{\varepsilon p}{nd_{avg}}}\right)
\displaystyle\leq Pr(LΔ>σyεpndavg|)+Pr(¯)\displaystyle\;\Pr\left(\left\lVert L^{\top}\Delta\right\rVert>\sigma_{y}\cdot\sqrt{\frac{\varepsilon p}{nd_{avg}}}\mathrel{\Big{|}}\mathcal{E}\right)+\Pr\left(\mathcal{\overline{E}}\right)
\displaystyle\leq pexp(εbk512ndavg)+bexp(k32)+bpexp(Ck)\displaystyle\;p\cdot\exp\left(-\frac{\varepsilon\cdot b\cdot k}{512\cdot n\cdot d_{avg}}\right)+b\cdot\exp\left(-\frac{k}{32}\right)+b\cdot\sqrt{p}\cdot\exp\left(-C\cdot k\right)

for some universal constant CC.

The claim follows by observing that |ΔMΔ|=|ΔLLΔ|=LΔ2\left\lvert\Delta^{\top}M\Delta\right\rvert=\left\lvert\Delta^{\top}LL^{\top}\Delta\right\rvert=\left\lVert L^{\top}\Delta\right\rVert^{2} and applying the assumptions on kk and bb. ∎

We can now establish Condition 1 of Proposition 2.12 for BatchAvgLeastSquares.

Lemma 3.6 (Coefficient recovery guarantees of BatchAvgLeastSquares).

Consider Algorithm 4. With m1𝒪(ndavgε(d+ln(nεδ)))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right) samples, where kΩ(d+ln(nεδ))k\in\Omega\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right) and b=m1kb=\frac{m_{1}}{k}, we recover coefficient estimates A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n} such that

Pr(i[n],|ΔiMiΔi|σi2εpindavg)δ\Pr\left(\forall i\in[n],\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}}\right)\leq\delta

The total running time is 𝒪(m1ndavgd)\mathcal{O}(m_{1}\cdot n\cdot d_{avg}\cdot d).

Proof.

For each i[n]i\in[n], apply Lemma 3.5 with δ=δ/n\delta^{\prime}=\delta/n, then take the union bound over all nn variables.

The computational complexity for a variable with pp parents is 𝒪(bp2k)=𝒪(p2m1)\mathcal{O}(b\cdot p^{2}\cdot k)=\mathcal{O}(p^{2}\cdot m_{1}). Since maxi[n]pid\max_{i\in[n]}p_{i}\leq d and i=1npi=ndavg\sum_{i=1}^{n}p_{i}=nd_{avg}, the total runtime is 𝒪(m1ndavgd)\mathcal{O}(m_{1}\cdot n\cdot d_{avg}\cdot d). ∎

Theorem 3.4 follows from combining the guarantees of BatchAvgLeastSquares and VarianceRecovery (given in Lemma 3.6 and Corollary 2.14 respectively) via Proposition 2.12.

Proof of Theorem 3.4.

We will show sample and time complexities before giving the proof for the dTV\mathrm{d_{TV}} distance.

Let m1𝒪(ndavgε(d+ln(nεδ)))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right) and m2𝒪(ndavgεlog(nδ))m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right). Then, the total number of samples needed is m=m1+m2𝒪(ndavgε(d+ln(nεδ)))m=m_{1}+m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right). BatchAvgLeastSquares runs in 𝒪(m1ndavgd)\mathcal{O}(m_{1}nd_{avg}d) time while VarianceRecovery runs in 𝒪(n2davg2εlog(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}}{\varepsilon}\log\left(\frac{1}{\delta}\right)\right) time. Therefore, the overall running time is 𝒪(n2davg2dε(d+ln(nεδ)))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d}{\varepsilon}\left(d+\ln\left(\frac{n}{\varepsilon\delta}\right)\right)\right).

By Lemma 3.6, BatchAvgLeastSquares recovers coefficients A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n} such that

Pr(i[n],|ΔiMiΔi|σi2εpindavg)δ\Pr\left(\forall i\in[n],\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}}\right)\leq\delta

By Corollary 2.14 and using the recovered coefficients from BatchAvgLeastSquares, VarianceRecovery recovers variance estimates σ^i2\widehat{\sigma}_{i}^{2} such that

Pr(i[n],(1εpindavg)σi2σ^i2(1+εpindavg)σi2)1δ\Pr\left(\forall i\in[n],\left(1-\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}\leq\left(1+\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\right)\geq 1-\delta

As our estimated parameters satisfy Condition 1 and Condition 2, Proposition 2.12 tells us that dKL(𝒫,𝒬)3ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq 3\varepsilon. Thus, dTV(𝒫,𝒬)dKL(𝒫,𝒬)/23ε/2\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\sqrt{\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})/2}\leq\sqrt{3\varepsilon/2}. The claim follows by setting ε=3ε/2\varepsilon^{\prime}=\sqrt{3\varepsilon/2} throughout. ∎

4 Coefficient recovery algorithm based on Cauchy random variables

In this section, we provide novel algorithms CauchyEst and CauchyEstTree for recovering the coefficients in polytree Bayesian networks. We will show that CauchyEstTree has near-optimal sample complexity, and later in Section 6, we will see that both these algorithms outperform LeastSquares and BatchAvgLeastSquares on randomly generated Bayesian networks. Of technical interest, our analysis involves Cauchy random variables, which are somewhat of a rarity in statistical learning. As in LeastSquares and BatchAvgLeastSquares, CauchyEst and CauchyEstTree use independent samples to recover the coefficients associated to each individual variable in an independent fashion.

Consider an arbitrary variable YY with pp parents. The intuition is as follows: if ηy=0\eta_{y}=0, then one can form a linear system of equations using pp samples to solve for the coefficients ayia_{y\leftarrow i} exactly for each iπ(y)i\in\pi(y). Unfortunately, ηy\eta_{y} is non-zero in general. Instead of exactly recovering AA, we partition the m1m_{1} independent samples into k=m1/pk=\lfloor m_{1}/p\rfloor batches involving pp samples and form intermediate estimates A~(1),,A~(k)\widetilde{A}^{(1)},\ldots,\widetilde{A}^{(k)} by solving a system of linear equation for each batch (see Algorithm 5). Then, we “combine” these intermediate estimates to obtain our estimate A^\widehat{A}.

Algorithm 5 Batch coefficient recovery algorithm for variable with pp parents
1:Input: DAG GG, a variable YY with pp parents, and pp samples
2:Without loss of generality, by renaming variables, let X1,,XpX_{1},\ldots,X_{p} be the parents of YY.
3:Form matrix Xp×pX\in\mathbb{R}^{p\times p} where the rthr^{th} row equals [X1(r),,Xp(r)][X_{1}^{(r)},\ldots,X_{p}^{(r)}], corresponding to Y(r)Y^{(r)}.
4:Define A~=[a^y1,,a^yp]\widetilde{A}=\left[\widehat{a}_{y\leftarrow 1},\ldots,\widehat{a}_{y\leftarrow p}\right]^{\top} as any solution to XA~=[Y(1),,Y(p)]X\widetilde{A}=\left[Y^{(1)},\ldots,Y^{(p)}\right]^{\top}.
5:return A~\widetilde{A}

Consider an arbitrary copy of recovered coefficients A~\widetilde{A}. Let Δ=[Δ1,,Δp]=A~A\Delta=\left[\Delta_{1},\ldots,\Delta_{p}\right]^{\top}=\widetilde{A}-A be a vector measuring the gap between these recovered coefficients and the ground truth, where Δi=a~yiayi\Delta_{i}=\widetilde{a}_{y\leftarrow i}-a_{y\leftarrow i}. Lemma 4.1 gives a condition where a vector is term-wise Cauchy. Using this, Lemma 4.2 shows that each entry of the vector LΔL^{\top}\Delta is distributed according to σyCauchy(0,1)\sigma_{y}\cdot\mathrm{Cauchy}(0,1), although the entries may be correlated with each other in general.

Lemma 4.1.

Consider the matrix equation AB=EAB=E where An×nA\in\mathbb{R}^{n\times n}, Bn×1B\in\mathbb{R}^{n\times 1}, and ERn×1E\in R^{n\times 1} such that entries of AA and EE are independent Gaussians, elements in each column of AA have the same variance, and all entries in EE have the same variance. That is, A,jN(0,σi2)A_{\cdot,j}\sim N(0,\sigma_{i}^{2}) and EiN(0,σn+12)E_{i}\sim N(0,\sigma_{n+1}^{2}). Then, for all i[n]i\in[n], we have that Biσn+1σiCauchy(0,1)B_{i}\sim\frac{\sigma_{n+1}}{\sigma_{i}}\cdot\mathrm{Cauchy}(0,1).

Proof.

As the event that AA is singular has measure zero, we can write B=A1EB=A^{-1}E. By Cramer’s rule,

A1=1det(A)adj(A)=1det(A)CA^{-1}=\frac{1}{\det(A)}\cdot{\rm adj}(A)=\frac{1}{\det(A)}\cdot C^{\top}

where det(A)\det(A) is the determinant of AA, adj(A){\rm adj}(A) is the adjugate/adjoint matrix of AA, and CC is the cofactor matrix of AA. Recall that the det(A)\det(A) can defined with respect to elements in CC: For any column i[n]i\in[n],

det(A)=A1,iC1,i+A2,iC2,i++An,iCn,i\det(A)=A_{1,i}\cdot C_{1,i}+A_{2,i}\cdot C_{2,i}+\ldots+A_{n,i}\cdot C_{n,i}

So, det(A)N(0,σi2(C1,i++Cn,i))\det(A)\sim N\left(0,\sigma_{i}^{2}\left(C_{1,i}+\ldots+C_{n,i}\right)\right). Thus, for any i[n]i\in[n],

Bi=(1det(A)CE)iN(0,σn+12(C1,i++Cn,i))N(0,σi2(C1,i++Cn,i))=σn+1σiCauchy(0,1)B_{i}=\left(\frac{1}{\det(A)}C^{\top}E\right)_{i}\sim\frac{N\left(0,\sigma_{n+1}^{2}\left(C_{1,i}+\ldots+C_{n,i}\right)\right)}{N\left(0,\sigma_{i}^{2}\left(C_{1,i}+\ldots+C_{n,i}\right)\right)}=\frac{\sigma_{n+1}}{\sigma_{i}}\cdot\mathrm{Cauchy}(0,1)

Lemma 4.2.

Consider a batch estimate A~\widetilde{A} from Algorithm 5. Then, LΔL^{\top}\Delta is entry-wise distributed as σyCauchy(0,1)\sigma_{y}\cdot\mathrm{Cauchy}(0,1), where Δ=A~A\Delta=\widetilde{A}-A. Note that the entries of LΔL^{\top}\Delta may be correlated in general.

Proof.

Observe that each row of XX is an independent sample drawn from a multivariate Gaussian N(0,M)N(0,M). By denoting η=[ηy(1),,ηy(p)]\eta=\left[\eta_{y}^{(1)},\ldots,\eta_{y}^{(p)}\right]^{\top} as the pp samples of ηy\eta_{y}, we can write XA~=XA+ηX\widetilde{A}=XA+\eta and thus XΔ=ηX\Delta=\eta by rearranging terms. By 2.3, we can express X=GLX=GL^{\top} where matrix Gp×pG\in\mathbb{R}^{p\times p} is a random matrix with i.i.d. N(0,1)N(0,1) entries. By substituting X=GLX=GL^{\top} into XΔ=ηX\Delta=\eta, we have LΔ=G1ηL^{\top}\Delta=G^{-1}\eta.111111Note that event that GG is singular has measure 0.

By applying Lemma 4.1 with the following parameters: A=G,B=LΔ,E=ηA=G,B=L^{\top}\Delta,E=\eta, we conclude that each entry of LΔL^{\top}\Delta is distributed as σyCauchy(0,1)\sigma_{y}\cdot\mathrm{Cauchy}(0,1). However, note that these entries are generally correlated. ∎

If we have direct access to the matrix LL, then one can do the following (see Algorithm 6): for each coordinate i[p]i\in[p], take medians121212The typical strategy of averaging independent estimates does not work here as the variance of a Cauchy variable is unbounded. of (L[a~y1,,a~yn])i\left(L^{\top}\left[\widetilde{a}_{y\leftarrow 1},\ldots,\widetilde{a}_{y\leftarrow n}\right]^{\top}\right)_{i} to form MEDi\texttt{MED}_{i} and then estimate [a^y1,,a^y1]=(L)1[MED1,,MEDn]\left[\widehat{a}_{y\leftarrow 1},\ldots,\widehat{a}_{y\leftarrow 1}\right]=(L^{\top})^{-1}[\texttt{MED}_{1},\ldots,\texttt{MED}_{n}]^{\top}. By the convergence of Cauchy random variables to their median, one can show that each a^yi\widehat{a}_{y\leftarrow i} converges to the true coefficient ayia_{y\leftarrow i} as before. Unfortunately, we do not have LL and can only hope to estimate it with some matrix L^\widehat{L} using the empirical covariance matrix M^\widehat{M}.

Algorithm 6 CauchyEst: Coefficient recovery algorithm for general Bayesian networks
1:Input: DAG GG and mm samples
2:for variable YY with p1p\geq 1 parents do \triangleright By 1.1, pdp\leq d
3:     Without loss of generality, by renaming variables, let X1,,XpX_{1},\ldots,X_{p} be the parents of YY.
4:     Let M^\widehat{M} be the empirical covariance matrix with respect to X1,,XpX_{1},\ldots,X_{p}.
5:     Compute the Cholesky decomposition M^=L^L^\widehat{M}=\widehat{L}\widehat{L}^{\top} of M^\widehat{M}.
6:     for s=1,,m/ps=1,\ldots,\lfloor m/p\rfloor do
7:         Using pp samples and Algorithm 5, compute a batch estimate A~(s)\widetilde{A}^{(s)}.
8:     end for
9:     For each i[n]i\in[n], define MEDi=median{(L^A~(1))i,,(L^A~(m/p))i}\texttt{MED}_{i}=\mathrm{median}\{(\widehat{L}^{\top}\widetilde{A}^{(1)})_{i},\ldots,(\widehat{L}^{\top}\widetilde{A}^{(\lfloor m/p\rfloor)})_{i}\}.
10:     return A^=[a^y1,,a^yn]=((L^)1[MED1,,MEDn])\widehat{A}=\left[\widehat{a}_{y\leftarrow 1},\ldots,\widehat{a}_{y\leftarrow n}\right]^{\top}=((\widehat{L}^{\top})^{-1}\left[\texttt{MED}_{1},\ldots,\texttt{MED}_{n}\right]^{\top})^{\top}.
11:end for

4.1 Special case of polytree Bayesian networks

If the Bayesian network is a polytree, then LL is diagonal. In this case, we specialize CauchyEst to CauchyEstTree and are able to give theoretical guarantees. We begin with simple corollary which tells us that the ithi^{th} entry of Δ\Delta is distributed according to σy/σiCauchy(0,1)\sigma_{y}/\sigma_{i}\cdot\mathrm{Cauchy}(0,1).

Corollary 4.3.

Consider a batch estimate A~\widetilde{A} from Algorithm 5. If the Bayesian network is a polytree, then Δi=(A~A)iσyσiCauchy(0,1)\Delta_{i}=(\widetilde{A}-A)_{i}\sim\frac{\sigma_{y}}{\sigma_{i}}\cdot\mathrm{Cauchy}(0,1).

Proof.

Observe that each row of XX is an independent sample drawn from a multivariate Gaussian N(0,M)N(0,M). By denoting η=[ηy(1),,ηy(p)]\eta=\left[\eta_{y}^{(1)},\ldots,\eta_{y}^{(p)}\right]^{\top} as the pp samples of ηy\eta_{y}, we can write XA~=XA+ηX\widetilde{A}=XA+\eta and thus XΔ=ηX\Delta=\eta by rearranging terms. Since the parents of any variable in a polytree are not correlated, each element in the ithi^{th} column of XX is a N(0,σi2)N(0,\sigma_{i}^{2}) Gaussian random variable.

By applying Lemma 4.1 with the following parameters: A=X,B=ΔE=ηA=X,B=\Delta E=\eta, we conclude that Δi=(A~A)iσyσiCauchy(0,1)\Delta_{i}=(\widetilde{A}-A)_{i}\sim\frac{\sigma_{y}}{\sigma_{i}}\cdot\mathrm{Cauchy}(0,1). ∎

For each iπ(y)i\in\pi(y), we combine the kk independently copies of a~yi(1),,a~yi(k)\widetilde{a}_{y\leftarrow i}^{(1)},\ldots,\widetilde{a}_{y\leftarrow i}^{(k)} using the median. For arbitrary sample s[k]s\in[k] and parent index iπ(y)i\in\pi(y), observe that Δi(s)=a~yi(s)ayi\Delta_{i}^{(s)}=\widetilde{a}_{y\leftarrow i}^{(s)}-a_{y\leftarrow i}. Since ayia_{y\leftarrow i} is just an unknown constant,

a^yi=medians[k]{a~yi(s)}=medians[k]{Δi(s)}+ayi\widehat{a}_{y\leftarrow i}=\mathrm{median}_{s\in[k]}\left\{\widetilde{a}_{y\leftarrow i}^{(s)}\right\}=\mathrm{median}_{s\in[k]}\left\{\Delta_{i}^{(s)}\right\}+a_{y\leftarrow i}

Since each Δi(s)\Delta_{i}^{(s)} term is i.i.d. distributed as σyCauchy(0,1)\sigma_{y}\cdot\mathrm{Cauchy}(0,1), the term medians[k]{Δi(s)}\mathrm{median}_{s\in[k]}\left\{\Delta_{i}^{(s)}\right\} converges to 0 with sufficiently large kk, and thus a^yi\widehat{a}_{y\leftarrow i} converges to the true coefficient ayia_{y\leftarrow i}.

The goal of this section is to prove Theorem 4.4 given CauchyEstTree, whose pseudocode we provide in Algorithm 7.

Algorithm 7 CauchyEstTree: Coefficient recovery algorithm for polytree Bayesian networks
1:Input: A polytree GG and m1𝒪(ndavgdεlog(nδ))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}d}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) samples
2:for variable YY with p1p\geq 1 parents do \triangleright By 1.1, pdp\leq d
3:     Without loss of generality, by renaming variables, let X1,,XpX_{1},\ldots,X_{p} be the parents of YY.
4:     for s=1,,m1/ps=1,\ldots,\lfloor m_{1}/p\rfloor do
5:         Using pp samples and Algorithm 5, compute a batch estimate A~(s)\widetilde{A}^{(s)}.
6:     end for
7:     For each iπ(y)i\in\pi(y), define estimate a^yi=median{a~yi(1),,a~yi(m1/p)}\widehat{a}_{y\leftarrow i}=\mathrm{median}\left\{\widetilde{a}_{y\leftarrow i}^{(1)},\ldots,\widetilde{a}_{y\leftarrow i}^{(\lfloor m_{1}/p\rfloor)}\right\}.
8:     return A^=[a^y1,,a^yp]\widehat{A}=[\widehat{a}_{y\leftarrow 1},\ldots,\widehat{a}_{y\leftarrow p}]^{\top}
9:end for
Theorem 4.4 (Distribution learning using CauchyEstTree).

Let ε,δ(0,1)\varepsilon,\delta\in(0,1). Suppose GG is a fixed directed acyclic graph on nn variables with degree at most dd. Given 𝒪(ndavgdεlog(nεδ))\mathcal{O}\left(\frac{nd_{avg}d}{\varepsilon}\log\left(\frac{n}{\varepsilon\delta}\right)\right) samples from an unknown Bayesian network 𝒫\mathcal{P} over GG, if we use CauchyEstTree for coefficient recovery in Algorithm 1, then with probability at least 1δ1-\delta, we recover a Bayesian network 𝒬\mathcal{Q} over GG such that dTV(𝒫,𝒬)ε\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\varepsilon in 𝒪(n2davg2dω1εlog(nδ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d^{\omega-1}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) time.

Note that for polytrees, davgd_{avg} is just a constant. As before, we will first prove guarantees for an arbitrary variable and then take union bound over nn variables.

Lemma 4.5.

Consider Algorithm 7. Fix an arbitrary variable of interest YY with pp parents, parameters (A,σy)(A,\sigma_{y}), and associated covariance matrix MM. With k=8ndavgεlog(2δ)k=\frac{8nd_{avg}}{\varepsilon}\log\left(\frac{2}{\delta}\right) samples, we recover coefficient estimates A^\widehat{A} such that

Pr(|ΔMΔ|σy2εpndavg)1δ\Pr\left(\left\lvert\Delta^{\top}M\Delta\right\rvert\leq\sigma_{y}^{2}\cdot\frac{\varepsilon p}{nd_{avg}}\right)\geq 1-\delta
Proof.

Since M=LLM=LL^{\top}, it suffices to bound LΔ\|L^{\top}\Delta\|. Lemma 4.2 tells us that each entry of the vector LΔL^{\top}\Delta is the median of kk copies of Cauchy(0,1)\mathrm{Cauchy}(0,1) random variables multiplied by σy\sigma_{y}. Setting k=8ndavgεlog(2δ)k=\frac{8nd_{avg}}{\varepsilon}\log\left(\frac{2}{\delta}\right) and 0<τ=ε/(ndavg)<10<\tau=\sqrt{\varepsilon/(nd_{avg})}<1 in Lemma 2.8, we see that

Pr(median of k i.i.d. Cauchy(0,1) random variables[εndavg,εndavg])δ\Pr\left(\text{median of $k$ i.i.d.\ $\mathrm{Cauchy}(0,1)$ random variables}\not\in\left[-\sqrt{\frac{\varepsilon}{nd_{avg}}},\sqrt{\frac{\varepsilon}{nd_{avg}}}\right]\right)\leq\delta

That is, each entry of LΔL^{\top}\Delta has absolute value at most σyεndavg\sigma_{y}\cdot\sqrt{\frac{\varepsilon}{nd_{avg}}}. By summing across all pp entries of LΔL^{\top}\Delta, we see that

|ΔMΔ|=|ΔLLΔ|=LΔ2pσy2εndavg=σy2εpndavg|\Delta^{\top}M\Delta|=|\Delta^{\top}LL^{\top}\Delta|=\|L^{\top}\Delta\|^{2}\leq p\cdot\sigma_{y}^{2}\cdot\frac{\varepsilon}{nd_{avg}}=\sigma_{y}^{2}\cdot\frac{\varepsilon p}{nd_{avg}}

We can now establish Condition 1 of Proposition 2.12 for CauchyEstTree.

Lemma 4.6.

Consider Algorithm 7. Suppose the Bayesian network is a polytree. With m1𝒪(ndavgdεlog(nδ))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}d}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) samples, we recover coefficient estimates A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n} such that

Pr(i[n],|ΔiMiΔi|σi2εpindavg)δ\Pr\left(\forall i\in[n],\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}}\right)\leq\delta

The total running time is 𝒪(n2davg2dω1εlog(nδ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d^{\omega-1}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) where ω\omega is the matrix multiplication exponent.

Proof.

For each i[n]i\in[n], apply Lemma 4.5 with δ=δ/n\delta^{\prime}=\delta/n and m1=8ndavgεlog(2nδ)m_{1}=\frac{8nd_{avg}}{\varepsilon}\log\left(\frac{2n}{\delta}\right), then take the union bound over all nn variables.

The runtime of Algorithm 5 is the time to find the inverse of a p×pp\times p matrix, which is 𝒪(pω)\mathcal{O}(p^{\omega}) for some 2<ω<32<\omega<3. Therefore, the computational complexity for a variable with pp parents is 𝒪(pω1m1)\mathcal{O}(p^{\omega-1}\cdot m_{1}). Since maxi[n]pid\max_{i\in[n]}p_{i}\leq d and i=1npi=ndavg\sum_{i=1}^{n}p_{i}=nd_{avg}, the total runtime is 𝒪(m1ndavgdω2)\mathcal{O}(m_{1}\cdot n\cdot d_{avg}\cdot d^{\omega-2}). ∎

We are now ready to prove Theorem 4.4.

Theorem 4.4 follows from combining the guarantees of CauchyEstTree and VarianceRecovery (given in Lemma 4.6 and Corollary 2.14 respectively) via Proposition 2.12.

Proof of Theorem 4.4.

We will show sample and time complexities before giving the proof for the dTV\mathrm{d_{TV}} distance.

Let m1𝒪(ndavgdεlog(nδ))m_{1}\in\mathcal{O}\left(\frac{nd_{avg}d}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) and m2𝒪(ndavgεlog(nδ))m_{2}\in\mathcal{O}\left(\frac{nd_{avg}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right). Then, the total number of samples needed is m=m1+m2𝒪(ndavgdεlog(nεδ))m=m_{1}+m_{2}\in\mathcal{O}\left(\frac{nd_{avg}d}{\varepsilon}\log\left(\frac{n}{\varepsilon\delta}\right)\right). CauchyEstTree runs in 𝒪(n2davg2dω1εlog(nδ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d^{\omega-1}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right) time while VarianceRecovery runs in 𝒪(n2davg2εlog(1δ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}}{\varepsilon}\log\left(\frac{1}{\delta}\right)\right) time, where ω\omega is the matrix multiplication exponent. Therefore, the overall running time is 𝒪(n2davg2dω1εlog(nδ))\mathcal{O}\left(\frac{n^{2}d_{avg}^{2}d^{\omega-1}}{\varepsilon}\log\left(\frac{n}{\delta}\right)\right).

By Lemma 4.6, CauchyEstTree recovers coefficients A^1,,A^n\widehat{A}_{1},\ldots,\widehat{A}_{n} such that

Pr(i[n],|ΔiMiΔi|σi2εpindavg)δ\Pr\left(\forall i\in[n],\left\lvert\Delta_{i}^{\top}M_{i}\Delta_{i}\right\rvert\geq\sigma_{i}^{2}\cdot\frac{\varepsilon p_{i}}{nd_{avg}}\right)\leq\delta

By Corollary 2.14 and using the recovered coefficients from CauchyEstTree, VarianceRecovery recovers variance estimates σ^i2\widehat{\sigma}_{i}^{2} such that

Pr(i[n],(1εpindavg)σi2σ^i2(1+εpindavg)σi2)1δ\Pr\left(\forall i\in[n],\left(1-\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\leq\widehat{\sigma}_{i}^{2}\leq\left(1+\sqrt{\frac{\varepsilon p_{i}}{nd_{avg}}}\right)\cdot\sigma_{i}^{2}\right)\geq 1-\delta

As our estimated parameters satisfy Condition 1 and Condition 2, Proposition 2.12 tells us that dKL(𝒫,𝒬)3ε\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})\leq 3\varepsilon. Thus, dTV(𝒫,𝒬)dKL(𝒫,𝒬)/23ε/2\mathrm{d_{TV}}(\mathcal{P},\mathcal{Q})\leq\sqrt{\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})/2}\leq\sqrt{3\varepsilon/2}. The claim follows by setting ε=3ε/2\varepsilon^{\prime}=\sqrt{3\varepsilon/2} throughout. ∎

5 Hardness for learning Gaussian Bayesian networks

In this section, we present our hardness results. We first show a tight lower-bound for the simpler case of learning Gaussian product distributions in total variational distance (Theorem 5.2). Next, we show a tight lower-bound for learning Gaussian Bayes nets with respect to total variation distance. (Theorem 5.3). In both cases, our hardness applies to the problems of learning the covariance matrix of a centered multivariate Gaussian, which is equivalent to recovering the coefficients and noises of the underlying linear structural equation.

We will need the following fact about the variation distance between multivariate Gaussians and a Frobenius norm F\left\lVert\cdot\right\rVert_{F} between the covariance matrices.

Fact 5.1 ([DMR18]).

There exists two universal constants 1100c1c232\frac{1}{100}\leq c_{1}\leq c_{2}\leq\frac{3}{2}, such that for any two covariance matrices Σ1\Sigma_{1} and Σ2\Sigma_{2},

c1dTV(N(0,Σ1),N(0,Σ2))Σ11Σ2IFc2.c_{1}\leq\frac{\mathrm{d_{TV}}(N(0,\Sigma_{1}),N(0,\Sigma_{2}))}{\left\lVert\Sigma^{-1}_{1}\Sigma_{2}-I\right\rVert_{F}}\leq c_{2}.
Theorem 5.2.

Given samples from a nn-fold Gaussian product distribution PP, learning a P^\widehat{P} such that in dTV(P,P^)=O(ϵ)\mathrm{d_{TV}}(P,\widehat{P})=O(\epsilon) with success probability 2/3 needs Ω(nϵ2)\Omega(n\epsilon^{-2}) samples in general.

Proof.

Let 𝒞{0,1}n\mathcal{C}\subseteq\{0,1\}^{n} be a set with the following properties: 1) |𝒞|=2Θ(n)|\mathcal{C}|=2^{\Theta(n)} and 2) every ij𝒞i\neq j\in\mathcal{C} have a Hamming distance Θ(n)\Theta(n). Existence of such a set is well-known. We create a class of Gaussian product distributions 𝒫𝒞\mathcal{P}_{\mathcal{C}} based on 𝒞\mathcal{C} as follows. For each s𝒞s\in\mathcal{C}, we define a distribution Ps𝒫𝒞P_{s}\in\mathcal{P}_{\mathcal{C}} such that if the ii-th bit of ss is 0, we use the distribution N(0,1)N(0,1) for the ii-th component of PsP_{s}; else if the ii-th bit is 1, we use the distribution N(0,1+ϵn)N(0,1+\frac{\epsilon}{\sqrt{n}}). Then for any PsPtP_{s}\neq P_{t}, dKL(Ps,Pt)=O(ϵ2)\mathrm{d_{KL}}(P_{s},P_{t})=O(\epsilon^{2}). Fano’s inequality tells us that guessing a random distribution from 𝒫𝒞\mathcal{P}_{\mathcal{C}} correctly with 2/3 probability needs Ω(nϵ2)\Omega(n\epsilon^{-2}) samples.

5.1 tells us that for any PsPt𝒫𝒞P_{s}\neq P_{t}\in\mathcal{P}_{\mathcal{C}}, dTV(Ps,Pt)c3ϵ\mathrm{d_{TV}}(P_{s},P_{t})\geq c_{3}\epsilon for some constant c3c_{3}. Consider any algorithm for learning a random distribution P=N(0,Σ)P=N(0,\Sigma) from 𝒫𝒞\mathcal{P}_{\mathcal{C}} in dTV\mathrm{d_{TV}}-distance at most c4ϵc_{4}\epsilon for a sufficiently small constant c4c_{4}. Let the learnt distibution be P^=N(0,Σ^)\widehat{P}=N(0,\widehat{\Sigma}). Due to triangle inequality, 5.1, and an appropriate choice of c4c_{4}, PP must be the unique distribution from 𝒫𝒞\mathcal{P}_{\mathcal{C}} satisfying Σ^1ΣIFc5ϵ||\widehat{\Sigma}^{-1}\Sigma-I||_{F}\leq c_{5}\epsilon for an appropriate choice of c5c_{5}. We can find this unique distribution by computing Σ^1ΣIF||\widehat{\Sigma}^{-1}\Sigma^{\prime}-I||_{F} for every covariance matrix Σ\Sigma^{\prime} from 𝒫𝒞\mathcal{P}_{\mathcal{C}} and guess the random distribution correctly. Hence, the lower-bound follows. ∎

Now, we present the lower-bound for learning general Bayes nets.

Theorem 5.3.

For any 0<ϵ<10<\epsilon<1 and n,dn,d such that dn/2d\leq n/2, there exists a DAG GG over [n][n] of in-degree dd such that learning a Gaussian Bayes net P^\widehat{P} on GG such that dTV(P,P^)ϵ\mathrm{d_{TV}}(P,\widehat{P})\leq\epsilon with success probability 2/3 needs Ω(ndϵ2)\Omega(nd\epsilon^{-2}) samples in general.

\ldots\ldots\ldots\ldotsa1ja_{1\rightarrow j}aija_{i\rightarrow j}adja_{d\rightarrow j}X1X_{1}XiX_{i}XdX_{d}Xd+1X_{d+1}XjX_{j}XnX_{n}
Figure 1: Bipartite DAG on nn vertices with maximum degree dd. For i{1,,d}i\in\{1,\ldots,d\}, Xi=ηiX_{i}=\eta_{i} where ηiN(0,1)\eta_{i}\sim N(0,1). For j{d+1,,n}j\in\{d+1,\ldots,n\}, Xj=ηj+i=1daijXiX_{j}=\eta_{j}+\sum_{i=1}^{d}a_{i\rightarrow j}X_{i} where ηjN(0,1)\eta_{j}\sim N(0,1). Furthermore, each XjX_{j} is associated with a dd-bit string and each coefficients a1j,,adja_{1\rightarrow j},\ldots,a_{d\rightarrow j} is either 1d(nd)\frac{1}{\sqrt{d(n-d)}} or 1+ϵd(nd)\frac{1+\epsilon}{\sqrt{d(n-d)}}, depending on the ithi^{th} bit in the associated dd-bit string.

Let 𝒞{0,1}d\mathcal{C}\subseteq\{0,1\}^{d} be a set with the following properties: 1) |𝒞|=2Θ(d)|\mathcal{C}|=2^{\Theta(d)} and 2) every ij𝒞i\neq j\in\mathcal{C} have a Hamming distance Θ(d)\Theta(d). Existence of such a set is well-known. We define a class of distributions 𝒫𝒞\mathcal{P}_{\mathcal{C}} based on 𝒞\mathcal{C} and the graph GG shown in Fig. 1 as follows. Each vertex of each distribution in 𝒫𝒞\mathcal{P}_{\mathcal{C}} has a N(0,1)N(0,1) noise, and hence no learning is required for the noises. Each coefficient aija_{i\rightarrow j} takes one of two values {1d(nd),1+ϵd(nd)}\left\{\frac{1}{\sqrt{d(n-d)}},\frac{1+\epsilon}{\sqrt{d(n-d)}}\right\} corresponding to bits {0,1}\{0,1\} respectively. For each s𝒞s\in\mathcal{C}, we define AsA_{s} to be the vector of coefficients corresponding to the bit-pattern of ss as above. We have 2Θ(d)2^{\Theta(d)} possible bit-patterns, which we use to define each conditional probability (XiX1,X2,,Xd)(X_{i}\mid X_{1},X_{2},\dots,X_{d}). Then, we have a class 𝒬𝒫\mathcal{Q}_{\mathcal{P}} of |𝒞|(nd)|\mathcal{C}|^{(n-d)} distributions for the overall Bayes net. We prune some of the distributions to get the desired subclass 𝒫𝒞𝒬𝒞\mathcal{P}_{\mathcal{C}}\subseteq\mathcal{Q}_{\mathcal{C}}, such that 𝒫𝒞\mathcal{P}_{\mathcal{C}} is the largest-sized subset with any pair of distributions in the subset differing in at least (nd)/2(n-d)/2 bit-patterns (out of (nd)(n-d) many) for the (XiX1,X2,,Xd)(X_{i}\mid X_{1},X_{2},\dots,X_{d})’s.

Claim 5.4.

|𝒫𝒞|2Θ(d(nd))|\mathcal{P}_{\mathcal{C}}|\geq 2^{\Theta(d(n-d))}.

Proof.

Let =|𝒞|=2Θ(d)\ell=|\mathcal{C}|=2^{\Theta(d)} and m=ndm=n-d. Then, there are N=mN=\ell^{m} possible coefficient-vectors/distributions in 𝒬𝒞\mathcal{Q}_{\mathcal{C}}. We create an undirectred graph GG consisting of a vertex for each coefficient, and edges between any pair of vertices differing in at least m/2m/2 bit-patterns. Note that GG is rr-regular for r=(m0.5m)(1)0.5m++(m0.5m+j)(1)0.5m+j++(1)mr={m\choose 0.5m}(\ell-1)^{0.5m}+\dots+{m\choose 0.5m+j}(\ell-1)^{0.5m+j}+\dots+(\ell-1)^{m}. Turan’s theorem says that there is a clique of size α=(1rN)1\alpha=(1-\frac{r}{N})^{-1}. We define 𝒫𝒞\mathcal{P}_{\mathcal{C}} to be the vertices of this clique.

To show that α\alpha is as large as desired, it suffices to show NrN/2Θ(dm)N-r\leq N/2^{\Theta(dm)}. The result follows by noting Nr=1+(m1)(1)++(mj)(1)j++(m0.5m)(1)0.5mm(4(1))0.5mN-r=1+{m\choose 1}(\ell-1)+\dots+{m\choose j}(\ell-1)^{j}+\dots+{m\choose 0.5m}(\ell-1)^{0.5m}\leq m\cdot(4(\ell-1))^{0.5m}. ∎

We also get for any two distributions Pa,PbP_{a},P_{b} in this model with the coefficients a,ba,b respectively, dKL(Pa,Pb)=12ab22\mathrm{d_{KL}}(P_{a},P_{b})=\frac{1}{2}||a-b||_{2}^{2} from (1). Hence, any pair of 𝒫𝒞\mathcal{P}_{\mathcal{C}} has dKL=Θ(ϵ2)\mathrm{d_{KL}}=\Theta(\epsilon^{2}). Then, Fano’s inequality tells us that given a random distribution P𝒫𝒞P\in\mathcal{P}_{\mathcal{C}}, it needs at least Ω(d(nd)ϵ2)\Omega(d(n-d)\epsilon^{-2}) samples to guess the correct one with 2/3 probability. We need the following fact about the dTV\mathrm{d_{TV}}-distance among the members of 𝒫𝒞\mathcal{P}_{\mathcal{C}}.

Claim 5.5.

Let Pa,Pb𝒫𝒞P_{a},P_{b}\in\mathcal{P}_{\mathcal{C}} be two distinct distributions (i.e. PaPbP_{a}\neq P_{b}) with coefficient-vectors a,ba,b respectively. Then, dTV(Pa,Pb)Θ(ϵ)\mathrm{d_{TV}}(P_{a},P_{b})\in\Theta(\epsilon).

Proof.

By Pinsker’s inequality, we have dTV(Pa,Pb)𝒪(ϵ)\mathrm{d_{TV}}(P_{a},P_{b})\in\mathcal{O}(\epsilon). Here we show dTV(Pa,Pb)Ω(ϵ)\mathrm{d_{TV}}(P_{a},P_{b})\in\Omega(\epsilon). Let nd=mn-d=m. By construction, aa and bb differ in mm/2m^{\prime}\geq m/2 conditional probabilities. Let aa,bba^{\prime}\subseteq a,b^{\prime}\subseteq b be the coefficient-vectors on the coordinates where they differ. Let Pa,ΣaP_{a^{\prime}},\Sigma_{a^{\prime}} and Pb,ΣbP_{b^{\prime}},\Sigma_{b^{\prime}} be the corresponding marginal distributions on (m+d)(m^{\prime}+d) variables, and their covariance matrices. In the following, we show Σa1ΣbIF=Ω(ϵ)||\Sigma_{a^{\prime}}^{-1}\Sigma_{b^{\prime}}-I||_{F}=\Omega(\epsilon). This proves the claim from Fact 5.1.

Let Ma=[0m×m0m×dAd×m0d×d]M_{a^{\prime}}=\begin{bmatrix}0_{m^{\prime}\times m^{\prime}}&0_{m^{\prime}\times d}\\ A_{d\times m^{\prime}}&0_{d\times d}\end{bmatrix} be the adjacency matrix for PaP_{a}, where the sources appear last in the rows and columns and in the matrix AA, each Aij{1md,1+ϵmd}A_{ij}\in\{\frac{1}{\sqrt{md}},\frac{1+\epsilon}{\sqrt{md}}\} denote the coefficient from source ii to sink jj. Similarly, we define MbM_{b^{\prime}} using a coefficient matrix Bd×mB_{d\times m^{\prime}}. Let {Ai:1im}\{A_{i}:1\leq i\leq m^{\prime}\} and {Bi:1im}\{B_{i}:1\leq i\leq m^{\prime}\} denote the columns of AA and BB. Then for every ii, AiA_{i} and BiB_{i} differ in at least Θ(d)\Theta(d) coordinates by construction.

By definition Σb=[BTBId×d]\Sigma_{b^{\prime}}=\begin{bmatrix}\bullet&B^{T}\\ B&I_{d\times d}\end{bmatrix} and Σa1=[Im×mATA]\Sigma_{a^{\prime}}^{-1}=\begin{bmatrix}I_{m^{\prime}\times m^{\prime}}&-A^{T}\\ -A&\bullet\end{bmatrix}, where \bullet corresponds to certain matrices not relevant to our discussion131313The missing (symmetric) submatrix of Σb\Sigma_{b^{\prime}} is the identity matrix added with the entries Bi,Bj\langle B_{i},B_{j}\rangle. The missing (symmetric) submatrix of Σa1\Sigma_{a^{\prime}}^{-1} is the identity matrix added with the inner products of the rows of AA.. Let J=Σa1Σb=[Xm×d]J=\Sigma_{a^{\prime}}^{-1}\Sigma_{b^{\prime}}=\begin{bmatrix}\bullet&X_{m^{\prime}\times d}\\ \bullet&\bullet\end{bmatrix}. It can be checked that Xij=(Bi(j)Ai(j))X_{ij}=(B_{i}(j)-A_{i}(j)) for every 1im1\leq i\leq m^{\prime} and m+1jm+dm^{\prime}+1\leq j\leq m^{\prime}+d. Now for every ii, each of the Θ(d)\Theta(d) places that AiA_{i} and BiB_{i} differ, Xij±ϵmdX_{ij}\in\frac{\pm\epsilon}{\sqrt{md}}. Hence, their total contribution in JIF2=Ω(ϵ2)||J-I||_{F}^{2}=\Omega(\epsilon^{2}). ∎

Proof of Theorem 5.3.

Consider any algorithm which learns a random distribution P=N(0,Σ)P=N(0,\Sigma) from 𝒫𝒞\mathcal{P}_{\mathcal{C}} in dTV\mathrm{d_{TV}} distance c3ϵc_{3}\epsilon, for a small enough constant c3c_{3}. Let the learnt distribution be P^=N(0,Σ^)\widehat{P}=N(0,\widehat{\Sigma}). Then, from Fact 5.1, and triangle inequality of dTV\mathrm{d_{TV}}, only the unique distribution PP with Σ=Σ\Sigma^{\prime}=\Sigma would satisfy Σ^1ΣIFc4ϵ||\widehat{\Sigma}^{-1}\Sigma^{\prime}-I||_{F}\leq c_{4}\epsilon for every covariance matrix Σ\Sigma^{\prime} from 𝒫𝒞\mathcal{P}_{\mathcal{C}}, for an appropriate choice of c4c_{4}. This would reveal the random distribution, hence the lower-bound follows. ∎

6 Experiments

General Setup

For our experiments, we explored both polytree networks (generated using random Prüfer sequences) and G(n,p)G(n,p) Erdős-Rényi graphs with bounded expected degrees (i.e. p=d/np=d/n for some bounded degree parameter dd) using the Python package networkx [HSSC08]. Our non-zero edge weights are uniformly drawn from the range (2,1][1,2)(-2,-1]\cup[1,2). Once the graph is generated, the linear i.i.d. data Xm×nX\in\mathbb{R}^{m\times n} (with nn variables and sample size m{1000,2000,,5000}m\in\{1000,2000,\ldots,5000\}) is generated by sampling the model X=BTX+ηX=B^{T}X+\eta, where ηN(0,In×n)\eta\sim N(0,I_{n\times n}) and BB is a strictly upper triangular matrix.141414We do not report the results over the varied variance synthetic data, because their performance are close to the performance of the equal variance synthetic data. We report KL divergence between the ground truth and our learned distribution using Eq. 1, averaged over 20 random repetitions. All experiments were conducted on an Intel Core i7-9750H 2.60GHz CPU.

Algorithms

The algorithms used in our experiments are as follows: Graphical Lasso [FHT08], MLE (empirical) estimator, CLIME [CLL11], LeastSquares, BatchAvgLeastSquares, BatchMedLeastSquares, CauchyEstTree, and CauchyEst. Specifically, we use BatchAvg_LS+xx and BatchMed_LS+xx to represent the BatchAvgLeastSquares and BatchMedLeastSquares algorithms respectively with a batch size of p+xp+x at each node, where pp is the number of parents of that node.

Synthetic data

Fig. 2 compares the KL divergence between the ground truth and our learned distribution over 100 variables between the eight estimators mentioned above. The first three estimators are for undirected graph structure learning. For this reason, we are not using Eq. 1 but the common equation in [Duc07, page 13] for calculating the KL divergence between multivariate Gaussian distributions. Fig. 2(a) and Fig. 2(b) shows the results on ER graphs while Fig. 2(c) shows the results for random tree graphs. The performances of MLE and CLIME are very close to each other, thus are overlapped in Fig. 2(a). In figure Fig. 2(b), we take a closer look at the results from Fig. 2(a) for LeastSquares, BatchMedLeastSquares, BatchAvgLeastSquares, CauchyEstTree, and CauchyEst estimators for a clear comparison. In our experiments, we find that the latter five outperform the GLASSO, CLIME and MLE (empirical) estimators. With a degree 5 ER graph, CauchyEst performs better than CauchyEstTree, while LeastSquares performs best. In our experiments for our random tree graphs with in-degree 1 (see Fig. 2(c), we find that the performances between CauchyEstTree and CauchyEst are very close to each other and their plots overlap.

In our experiments, CauchyEst outperforms CauchyEstTree when the G(n,p)G(n,p) graph is generated with a higher degree parameter dd (e.g. d>5d>5) and the resultant graph is unlikely to yield a polytree structure.

Refer to caption
(a) Eight algorithms evaluated on ER graph with d=5d=5
Refer to caption
(b) A closer look at some of the algorithms in the plot of Fig. 2(a)
Refer to caption
(c) A closer look at some of the algorithms evaluated on random tree graphs
Figure 2: Experiment on well-conditioned uncontaminated data
Real-world datasets

We also evaluated our algorithms on four real Gaussian Bayesian networks from R package bnlearn [Scu09]. The ECOLI70 graph provided by [SS05] contains 46 nodes and 70 edges. The MAGIC-NIAB graph from [SHBM14] contains 44 nodes and 66 edges. The MAGIC-IRRI graph contains 64 nodes and 102 edges, and the ARTH150 [ORS07] graph contains 107 nodes and 150 edges. Experimental results in Fig. 3 show that the error for LeastSquares is smaller than CauchyEst and CauchyEstTree for all the above datasets.

Refer to caption
(a) Ecoli70 46 nodes
Refer to caption
(b) Magic-niab 44 nodes
Refer to caption
(c) Magic-irri 64 nodes
Refer to caption
(d) Arth150 107 nodes
Figure 3: Experiment results over bnlearn real graph
Contaminated synthetic data

The contaminated synthetic data is generated in the following way: we randomly choose 5%5\% samples with 5 nodes to be contaminated from the well-conditioned data over n=100n=100 node graphs. The well-conditioned data has a N(0,1)N(0,1) noise for every variable, while the small proportion of the contaminated data has either N(1000,1)N(1000,1) or Cauchy(1000,1)\mathrm{Cauchy}(1000,1). In our experiments in Fig. 4 and Fig. 5, CauchyEst, CauchyEstTree, and BatchMedLeastSquares outperform BatchAvgLeastSquares and LeastSquares by a large margin. With more than 1000 samples, BatchMedLeastSquares with a batch size of p+20p+20 performs similar to CauchyEst and CauchyEstTree, but performs worse with less samples. When comparing the performance between LeastSqures and BatchAvgLeastSquares over either a random tree or a ER graph, the experiment in Fig. 4(a) based on a random tree graph shows that LeastSqures performs better than BatchAvgLeastSquares when sample size is smaller than 2000, while BatchAvgLeastSquares performs relatively better with more samples. Experiment results in Fig. 5(a) based on ER degree 5 graphs is slightly different from Fig. 4(a). In Fig. 5(a), BatchAvgLeastSquares performs better than LeastSqures by a large margin. Besides, we can also observe that the performances of CauchyEst, CauchyEstTree, and BatchMedLeastSquares are better than the above two estimators and are consistent over different types of graphs. For all five algorithms, we use the median absolute devation for robust variance recovery [Hub04] in the contaminated case only (see Algorithm 8 in Appendix).

This is because both LeastSquares and BatchAvgLeastSquares use the sample covariance (of the entire dataset or its batches) in the coefficient estimators for the unknown distribution. The presence of a small proportion of outliers in the data can have a large distorting influence on the sample covariance, making them sensitive to atypical observations in the data. On the contrary, our CauchyEstTree and BatchMedLeastSquares estimators are developed using the sample median and hence are resistant to outliers in the data.

Refer to caption
(a) all algorithms
Refer to caption
(b) BatchMed_LS, CauchyEst, and CauchyEstTree
Figure 4: Experiment results on contaminated data (random tree with Gaussian noise)
Refer to caption
(a) all algorithms
Refer to caption
(b) BatchMed_LS, CauchyEst, and CauchyEstTree
Figure 5: Experiment results on contaminated data (ER graph with Cauchy noise)
Contaminated real-world datasets

To test the robustness of the real data in the contaminated condition, we manually contaminate 5%5\% samples of 5 nodes from observational data in ECOLI70 and ARTH150. The results are reported in Fig. 6 and Fig. 7. In our experiments, CauchyEst and CauchyEstTree outperforms BatchAvgLeastSquares and LeastSquares by a large margin, and therefore are stable in both contaminated and well-conditioned case. Besides, note that different from the well-conditioned case, here CauchyEstTree performs slightly better than CauchyEst. This is because the Cholesky decomposition used in CauchyEst estimator takes contaminated-data into account.

Refer to caption
(a) Ecoli70, 5/46 noisy nodes
Refer to caption
(b) CauchyEst and CauchyEstTree
Figure 6: Ecoli70 under contaminated condition
Refer to caption
(a) Arth150, 5/107 noisy nodes
Refer to caption
(b) CauchyEst and CauchyEstTree
Figure 7: Arth150 under contaminated condition
Ill-conditioned synthetic data

The ill-conditioned data is generated in the following way: we classify the node sets VV into either well-conditioned or ill-conditioned nodes. The well-conditioned nodes have a N(0,1)N(0,1) noise, while ill-conditioned nodes have a N(0,1020)N(0,10^{-20}) noise. In our experiments, we choose 3 ill-conditioned nodes over 100 nodes. Synthetic data is sampled from either a random tree or a Erdős Rényi (ER) model with an expected number of neighbors d=5d=5. Experiments over ill-conditioned Gaussian Bayesian networks through 20 random repetitions are presented in Fig. 8. For the ill-conditioned settings, we sometimes run into numerical issues when computing the Cholesky decomposition of the empirical covariance matrix M^\hat{M} in our CauchyEst estimator. Thus, we only show the comparision results between LeastSquares, BatchAvgLeastSquares, BatchMedLeastSquares, and CauchyEstTree. Here also, the error for LeastSquares decreases faster than the other three estimators. The performance of BatchMedLeastSquares is worse than BatchAvgLeastSquares but slightly better than CauchyEstTree estimator.

Refer to caption
(a) ER graph, d=5d=5
Refer to caption
(b) Random tree
Figure 8: Experiment results on ill-conditioned data
Agnostic Learning

Our theoretical results treat the case that the data is realized by a distribution consistent with the given DAG. In this section, we explore learning of non-realizable inputs, so that there is a non-zero KL divergence between the input distribution PP and any distribution consistent with the given DAG.

We conduct agnostic learning experiments by fitting a random sub-graph of the ground truth graph. To do this, we first generate a 100-node ground truth graph GG, either a random tree with 4 random edges removed or a random ER graph with 9 random edges removed. Our algorithm will try to fit the data from the original Bayes net on GG on the edited graph GG^{*}. Fig. 9 reports the KL divergence learned over our five estimators. We find that BatchAvgLeastSquares estimator performs slightly better than all other estimators in both cases.

Refer to caption
(a) Random tree: 4 edges removed
Refer to caption
(b) ER graph, d=5: 9 edges removed
Figure 9: Agnostic learning
Effect of changing batch size
Refer to caption
(a) ER graph, d=2d=2
Refer to caption
(b) ER graph, d=5d=5
Figure 10: Effect of changing batch size over Batch Average
Refer to caption
(a) ER graph, d=2d=2
Refer to caption
(b) ER graph, d=5d=5
Figure 11: Effect of changing batch size over Batch Median (ER)

Next, we experiment the trade off between the batch-size (eg. batch size = [5, 20, 100]151515We use batch size up to 100 to ensure there are enough batch size from simulation data, so that either mean and median can converge.) and the KL-divergence of our BatchAvgLeastSquares and BatchMedLeastSquares estimators in detail. As shown in Fig. 10 and Fig. 11, when batch size increases, the results are closer to the LeastSquares estimator. In other words, LeastSquares can be seen as a special case of either BatchAvgLeastSquares or BatchMedLeastSquares with one batch only. Thus, when batch size increases, the performances of BatchAvgLeastSquares and BatchMedLeastSquares are closer to the LeastSquares estimator. On the contrary, the CauchyEstTree estimator can be seen as the estimator with a batch size of pp. Therefore, with smaller batch size (eg. batch size =p+5=p+5), BatchAvgLeastSquares and BatchMedLeastSquares performs closer to the CauchyEstTree estimator.

Runtime comparison

We now give the amount of time spent by each algorithm to process a degree-5 ER graph on 100 nodes with 500 samples. LeastSquares algorithm takes 0.0096 seconds, BatchAvgLeastSquares with a batch size of p+20p+20 takes 0.0415 seconds, BatchMedLeastSquares with a batch size of p+20p+20 takes 0.0290 seconds, CauchyEstTree takes 0.6063 seconds, and CauchyEst takes 0.6307 seconds. The timings given above are representative of the relative running times of these algorithms across different graph sizes.

Takeaways

In summary, the LeastSquares estimator performs the best among all algorithms on uncontaminated datasets (both real and synthetic) generated from Gaussian Bayesian networks. This holds even when the data is ill-conditioned. However, if a fraction of the samples are contaminated, CauchyEst, CauchyEstTree and BatchMedLeastSquares outperform LeastSquares and BatchAvgLeastSquares by a large margin under different noise and graph types. If the data is not generated according to the input graph (i.e., the non-realizable/agnostic learning setting), then BatchAvgLeastSquares, CauchyEst, and CauchyEstTree have a better tradeoff between the error and sample complexity than the other algorithms, although we do not have a formal explanation.

Acknowledgements

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-PhD/2021-08-013).

References

  • [ABDH+20] Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1–42, 2020.
  • [AGZ19] Bryon Aragam, Jiaying Gu, and Qing Zhou. Learning large-scale bayesian networks with the sparsebn package. Journal of Statistical Software, 91(1):1–38, 2019.
  • [AZ15] Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse gaussian bayesian networks. The Journal of Machine Learning Research, 16(1):2273–2328, 2015.
  • [BCD20] Johannes Brustle, Yang Cai, and Constantinos Daskalakis. Multi-item mechanisms without item-independence: Learnability via robustness. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 715–761, 2020.
  • [BGMV20] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S Meel, and NV Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. Advances in Neural Information Processing Systems, 33, 2020.
  • [BGPV20] Arnab Bhattacharyya, Sutanu Gayen, Eric Price, and NV Vinodchandran. Near-optimal learning of tree-structured distributions by chow-liu. arXiv preprint arXiv:2011.04144, 2020. ACM STOC 2021.
  • [BVH+16] Afonso S Bandeira, Ramon Van Handel, et al. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Annals of Probability, 44(4):2479–2506, 2016.
  • [CDKS20] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing bayesian networks. IEEE Transactions on Information Theory, 66(5):3132–3170, 2020.
  • [CDW19] Wenyu Chen, Mathias Drton, and Y Samuel Wang. On causal discovery with an equal-variance assumption. Biometrika, 106(4):973–980, 2019.
  • [CLL11] Tony Cai, Weidong Liu, and Xi Luo. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
  • [Das97] Sanjoy Dasgupta. The sample complexity of learning fixed-structure bayesian networks. Mach. Learn., 29(2-3):165–180, 1997.
  • [Dia16] Ilias Diakonikolas. Learning structured distributions. Handbook of Big Data, 267, 2016.
  • [DMR18] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.
  • [Duc07] John Duchi. Derivations for linear algebra and optimization. Berkeley, California, 3(1):2325–5870, 2007.
  • [FHT08] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • [GDA20] Ming Gao, Yi Ding, and Bryon Aragam. A polynomial-time algorithm for learning nonparametric causal graphs. Advances in Neural Information Processing Systems, 33, 2020.
  • [GH17] Asish Ghoshal and Jean Honorio. Learning identifiable Gaussian bayesian networks in polynomial time and sample complexity. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6460–6469, 2017.
  • [Gut09] Allan Gut. An Intermediate Course in Probability. Springer New York, 2009.
  • [GZ20] Jiaying Gu and Qing Zhou. Learning big gaussian bayesian networks: Partition, estimation and fusion. Journal of Machine Learning Research, 21(158):1–31, 2020.
  • [Hau18] David Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. CRC Press, 2018.
  • [HSSC08] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
  • [HTW19] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019.
  • [Hub04] Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004.
  • [JNG+19] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
  • [KMR+94] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E Schapire, and Linda Sellie. On the learnability of discrete distributions. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 273–282, 1994.
  • [KP83] H Kim and J Perl. ’a computational model for combined causal and diagnostic reasoning in inference systems’, 8th ijcai, 1983.
  • [Mue99] Ralph O Mueller. Basic principles of structural equation modeling: An introduction to LISREL and EQS. Springer Science & Business Media, 1999.
  • [Mul09] Stanley A Mulaik. Linear causal modeling with structural equations. CRC press, 2009.
  • [ORS07] Rainer Opgen-Rhein and Korbinian Strimmer. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC systems biology, 1(1):1–10, 2007.
  • [Par20] Gunwoong Park. Identifiability of additive noise models using conditional variances. Journal of Machine Learning Research, 21(75):1–34, 2020.
  • [PB14] Jonas Peters and Peter Bühlmann. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2014.
  • [Pea86] Judea Pearl. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
  • [Pea88] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, San Francisco, Calif, 2nd edition edition, 1988.
  • [PK20] Gunwoong Park and Youngwhan Kim. Identifiability of Gaussian linear structural equation models with homogeneous and heterogeneous error variances. Journal of the Korean Statistical Society, 49(1):276–292, 2020.
  • [RV09] Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 62(12):1707–1739, 2009.
  • [Scu09] Marco Scutari. Learning bayesian networks with the bnlearn r package. arXiv preprint arXiv:0908.3817, 2009.
  • [Scu20] Marco Scutari. bnlearn, 2020. Version 4.6.1.
  • [SHBM14] Marco Scutari, Phil Howell, David J Balding, and Ian Mackay. Multiple quantitative trait analysis using bayesian networks. Genetics, 198(1):129–137, 2014.
  • [SS05] Juliane Schafer and Korbinian Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 2005.
  • [Tsy08] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
  • [Val84] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • [Wai19] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Appendix A Details on decomposition of KL divergence

In this section, we provide the full derivation of Eq. 1.

For notational convenience, we write xx to mean (x1,,xn)(x_{1},\ldots,x_{n}), πi(x)\pi_{i}(x) to mean the values given to parents of variable XiX_{i} by xx, and 𝒫(x)\mathcal{P}(x) to mean 𝒫(X1=x1,,Xn=xn)\mathcal{P}(X_{1}=x_{1},\ldots,X_{n}=x_{n}). Observe that

dKL(𝒫,𝒬)\displaystyle\;\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})
=\displaystyle= x𝒫(x)log(𝒫(x)𝒬(x))𝑑x\displaystyle\;\int_{x}\mathcal{P}(x)\log\left(\frac{\mathcal{P}(x)}{\mathcal{Q}(x)}\right)dx
=\displaystyle= x𝒫(x)log(Πi=1n𝒫(xiπi(x))Πi=1n𝒬(xiπi(x)))𝑑x\displaystyle\;\int_{x}\mathcal{P}(x)\log\left(\frac{\Pi_{i=1}^{n}\mathcal{P}(x_{i}\mid\pi_{i}(x))}{\Pi_{i=1}^{n}\mathcal{Q}(x_{i}\mid\pi_{i}(x))}\right)dx ()(\star)
=\displaystyle= i=1nx𝒫(x)log(𝒫(xiπi(x))𝒬(xiπi(x)))𝑑x\displaystyle\;\sum_{i=1}^{n}\int_{x}\mathcal{P}(x)\log\left(\frac{\mathcal{P}(x_{i}\mid\pi_{i}(x))}{\mathcal{Q}(x_{i}\mid\pi_{i}(x))}\right)dx
=\displaystyle= i=1nxi,πi(x)𝒫(xi,πi(x))log(𝒫(xiπi(x))𝒬(xiπi(x)))𝑑xi𝑑πi(x)\displaystyle\;\sum_{i=1}^{n}\int_{x_{i},\pi_{i}(x)}\mathcal{P}(x_{i},\pi_{i}(x))\log\left(\frac{\mathcal{P}(x_{i}\mid\pi_{i}(x))}{\mathcal{Q}(x_{i}\mid\pi_{i}(x))}\right)dx_{i}d\pi_{i}(x) Marginalization

where ()(\star) is due to the Bayesian network decomposition of joint probabilities. Let us define

dCP(αi,α^i)=xi,πi(x)𝒫(xi,πi(x))log(𝒫(xiπi(x))𝒬(xiπi(x)))𝑑xi𝑑πi(x)\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})=\int_{x_{i},\pi_{i}(x)}\mathcal{P}(x_{i},\pi_{i}(x))\log\left(\frac{\mathcal{P}(x_{i}\mid\pi_{i}(x))}{\mathcal{Q}(x_{i}\mid\pi_{i}(x))}\right)dx_{i}d\pi_{i}(x)

where each α^i\widehat{\alpha}_{i} and αi\alpha^{*}_{i} represent the parameters that relevant to variable XiX_{i} from α^\widehat{\alpha} and α\alpha^{*} respectively. Under this notation, we can write dKL(𝒫,𝒬)=i=1ndCP(αi,α^i)\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})=\sum_{i=1}^{n}\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i}).

Fix a variable of interest YY with parents X1,,XpX_{1},\ldots,X_{p}, each with coefficient cic_{i}, and variance σ2\sigma^{2}. That is, Y=ηy+i=1pciXiY=\eta_{y}+\sum_{i=1}^{p}c_{i}X_{i} for some ηyN(0,σ2)\eta_{y}\sim N(0,\sigma^{2}) that is independent of X1,,XpX_{1},\ldots,X_{p}. By denoting X=xX=x (i.e. X1=x1,,Xp=xpX_{1}=x_{1},\ldots,X_{p}=x_{p}) and c=(c1,,cp)c=(c_{1},\ldots,c_{p}), we can write the conditional distribution density of YY as

Pr(yx,c,σ)=1σ2πexp(12σ2(yi=1pciXi)2)\Pr\left(y\mid x,c,\sigma\right)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2\sigma^{2}}\cdot\left(y-\sum_{i=1}^{p}c_{i}X_{i}\right)^{2}\right)

We now analyze dCP(αy,α^y)\mathrm{d_{CP}}(\alpha^{*}_{y},\widehat{\alpha}_{y}) with respect to the our estimates α^y=(A^,σ^y)\widehat{\alpha}_{y}=(\widehat{A},\widehat{\sigma}_{y}) and the hidden true parameters αy=(A,σy)\alpha^{*}_{y}=(A,\sigma_{y}), where A^=(a^y1,,a^yp)\widehat{A}=(\widehat{a}_{y\leftarrow 1},\ldots,\widehat{a}_{y\leftarrow p}) and A=(ay1,,ayp)A=(a_{y\leftarrow 1},\ldots,a_{y\leftarrow p}).

With respect to variable YY, we see that

dCP(αy,α^y)\displaystyle\;\mathrm{d_{CP}}(\alpha^{*}_{y},\widehat{\alpha}_{y})
=\displaystyle= x,y𝒫(x,y)ln(1σy2πexp(12σ2(yi=1payiXi)2)1σ^y2πexp(12σ^y2(yi=1pa^yiXi)2))𝑑y𝑑x\displaystyle\;\int_{x,y}\mathcal{P}(x,y)\ln\left(\frac{\frac{1}{\sigma_{y}\sqrt{2\pi}}\exp\left(-\frac{1}{2\sigma^{2}}\cdot\left(y-\sum_{i=1}^{p}a_{y\leftarrow i}X_{i}\right)^{2}\right)}{\frac{1}{\widehat{\sigma}_{y}\sqrt{2\pi}}\exp\left(-\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\left(y-\sum_{i=1}^{p}\widehat{a}_{y\leftarrow i}X_{i}\right)^{2}\right)}\right)dy\ dx
=\displaystyle= ln(σ^yσy)12σy2𝔼x,y(yi=1payiXi)2+12σ^y2𝔼x,y(yi=1pa^yiXi)2\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2\sigma_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-\sum_{i=1}^{p}a_{y\leftarrow i}X_{i}\right)^{2}+\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-\sum_{i=1}^{p}\widehat{a}_{y\leftarrow i}X_{i}\right)^{2}
=\displaystyle= ln(σ^yσy)12σy2𝔼x,y(yAX)2+12σ^y2𝔼x,y(yA^X)2\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2\sigma_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-A^{\top}X\right)^{2}+\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-\widehat{A}^{\top}X\right)^{2}

By defining Δ=A^A\Delta=\widehat{A}-A, we can see that for any instantiation of y,X1,,Xpy,X_{1},\ldots,X_{p},

(yA^X)2\displaystyle\;\left(y-\widehat{A}^{\top}X\right)^{2}
=\displaystyle= (y(Δ+A)X)2\displaystyle\;\left(y-(\Delta+A)^{\top}X\right)^{2} By definition of Δ\Delta
=\displaystyle= ((yAX)ΔX)2\displaystyle\;\left((y-A^{\top}X)-\Delta^{\top}X\right)^{2}
=\displaystyle= (yAX)22(yAX)(ΔX)+(ΔX)2\displaystyle\;(y-A^{\top}X)^{2}-2(y-A^{\top}X)(\Delta^{\top}X)+\left(\Delta^{\top}X\right)^{2}
=\displaystyle= (yAX)22(yΔXAXΔX)+(ΔX)2\displaystyle\;(y-A^{\top}X)^{2}-2\left(y\Delta^{\top}X-A^{\top}X\Delta^{\top}X\right)+\left(\Delta^{\top}X\right)^{2}
=\displaystyle= (yAX)22(yXΔAXXΔ)+ΔXXΔ\displaystyle\;(y-A^{\top}X)^{2}-2\left(yX^{\top}\Delta-A^{\top}XX^{\top}\Delta\right)+\Delta^{\top}XX^{\top}\Delta Since ΔX\Delta^{\top}X is just a number

Denote the covariance matrix with respect to X1,,XpX_{1},\ldots,X_{p} as Mp×pM\in\mathbb{R}^{p\times p}, where Mi,j=𝔼[XiXj]M_{i,j}=\mathbb{E}\left[X_{i}X_{j}\right]. Then, we can further simplify dCP(αy,α^y)\mathrm{d_{CP}}(\alpha^{*}_{y},\widehat{\alpha}_{y}) as follows:

dCP(αy,α^y)\displaystyle\;\mathrm{d_{CP}}(\alpha^{*}_{y},\widehat{\alpha}_{y})
=\displaystyle= ln(σ^yσy)12σy2𝔼x,y(yAX)2+12σ^y2𝔼x,y(yA^X)2\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2\sigma_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-A^{\top}X\right)^{2}+\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-\widehat{A}^{\top}X\right)^{2} From above
=\displaystyle= ln(σ^yσy)12σy2𝔼x,y(yAX)2\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2\sigma_{y}^{2}}\cdot\mathbb{E}_{x,y}\left(y-A^{\top}X\right)^{2}
+12σ^y2𝔼x,y[(yAX)22(yXΔAXXΔ)+ΔXXΔ]\displaystyle\;+\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\mathbb{E}_{x,y}\left[(y-A^{\top}X)^{2}-2\left(yX^{\top}\Delta-A^{\top}XX^{\top}\Delta\right)+\Delta^{\top}XX^{\top}\Delta\right] From above
=\displaystyle= ln(σ^yσy)12σy2𝔼x,yηy2+12σ^y2𝔼x,y[ηy22(ηyXΔ)+ΔXXΔ]\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2\sigma_{y}^{2}}\cdot\mathbb{E}_{x,y}\eta_{y}^{2}+\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\mathbb{E}_{x,y}\left[\eta_{y}^{2}-2\left(\eta_{y}X^{\top}\Delta\right)+\Delta^{\top}XX^{\top}\Delta\right] ()({\dagger})
=\displaystyle= ln(σ^yσy)12σy2σy2+12σ^y2[σy20+ΔMΔ]\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2\sigma_{y}^{2}}\cdot\sigma_{y}^{2}+\frac{1}{2\widehat{\sigma}_{y}^{2}}\cdot\left[\sigma_{y}^{2}-0+\Delta^{\top}M\Delta\right] ()(\ast)
=\displaystyle= ln(σ^yσy)12+σy22σ^y2+ΔMΔ2σ^y2\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)-\frac{1}{2}+\frac{\sigma_{y}^{2}}{2\widehat{\sigma}_{y}^{2}}+\frac{\Delta^{\top}M\Delta}{2\widehat{\sigma}_{y}^{2}}
=\displaystyle= ln(σ^yσy)+σy2σ^y22σ^y2+ΔMΔ2σ^y2\displaystyle\;\ln\left(\frac{\widehat{\sigma}_{y}}{\sigma_{y}}\right)+\frac{\sigma_{y}^{2}-\widehat{\sigma}_{y}^{2}}{2\widehat{\sigma}_{y}^{2}}+\frac{\Delta^{\top}M\Delta}{2\widehat{\sigma}_{y}^{2}}

where ()({\dagger}) is because y=ηy+AXy=\eta_{y}+A^{\top}X while ()(\ast) is because ηyN(0,σy2)\eta_{y}\sim N(0,\sigma_{y}^{2}), 𝔼x,y(ηyXΔ)=𝔼x,yηy𝔼x,yXΔ=0\mathbb{E}_{x,y}\left(\eta_{y}X^{\top}\Delta\right)=\mathbb{E}_{x,y}\eta_{y}\cdot\mathbb{E}_{x,y}X^{\top}\Delta=0, and 𝔼x,yΔXXΔ=Δ(𝔼x,yXX)Δ=ΔMΔ\mathbb{E}_{x,y}\Delta^{\top}XX^{\top}\Delta=\Delta^{\top}(\mathbb{E}_{x,y}XX^{\top})\Delta=\Delta^{\top}M\Delta.

In conclusion, we have

dKL(𝒫,𝒬)=i=1ndCP(αi,α^i)=i=1nln(σ^iσi)+σi2σ^i22σ^i2+ΔiMiΔi2σ^i2\mathrm{d_{KL}}(\mathcal{P},\mathcal{Q})=\sum_{i=1}^{n}\mathrm{d_{CP}}(\alpha^{*}_{i},\widehat{\alpha}_{i})=\sum_{i=1}^{n}\ln\left(\frac{\widehat{\sigma}_{i}}{\sigma_{i}}\right)+\frac{\sigma_{i}^{2}-\widehat{\sigma}_{i}^{2}}{2\widehat{\sigma}_{i}^{2}}+\frac{\Delta_{i}^{\top}M_{i}\Delta_{i}}{2\widehat{\sigma}_{i}^{2}} (3)

where MiM_{i} is the covariance matrix associated with variable XiX_{i}, αi=(Ai,σi)\alpha_{i}^{*}=(A_{i},\sigma_{i}) is the coefficients and variance associated with variable XiX_{i}, αi=(A^i,σ^i)\alpha_{i}=(\widehat{A}_{i},\widehat{\sigma}_{i}) are the estimates for αi\alpha_{i}^{*}, and Δi=A^iAi\Delta_{i}=\widehat{A}_{i}-A_{i}.

Appendix B Deferred proofs

This section provides the formal proofs that were deferred in favor for readability. For convenience, we will restate the statements before proving them.

The next two lemmata Lemma 2.6 and Lemma 2.7 are used in the proof of Lemma 3.2, which is in turn used in the proof of Theorem 3.1. We also use Lemma 2.6 in the proof of Lemma 3.5. The proof of Lemma 2.6 uses the standard result of Lemma B.1.

Lemma B.1 ([RV09]; Theorem 6.1 and Equation 6.10 in [Wai19]).

Let d\ell\geq d and G×dG\in\mathbb{R}^{\ell\times d} be a matrix with i.i.d. N(0,1)N(0,1) entries. Denote σmin(G)\sigma_{\min}(G) as the smallest singular value of GG. Then, for any 0<t<10<t<1, we have Pr(σmin(G)(1t)d)exp(t2/2)\Pr\left(\sigma_{\min}(G)\geq\sqrt{\ell}(1-t)-\sqrt{d}\right)\leq\exp\left(-\ell t^{2}/2\right).

See 2.6

Proof of Lemma 2.6.

Observe that GGG^{\top}G is symmetric, thus (GG)1(G^{\top}G)^{-1} is also symmetric and the eigenvalues of GGG^{\top}G equal the singular values of GGG^{\top}G. Also, note that event that GGG^{\top}G is singular has measure 0.161616Consider fixing all but one arbitrary entry of GG. The event of this independent N(0,1)N(0,1) entry making det(GG)=0\det(G^{\top}G)=0 has measure 0.

By definition of operation norm, (GG)1\left\lVert(G^{\top}G)^{-1}\right\rVert equals the square root of maximum eigenvalue of

((GG)1)((GG)1)=((GG)1)2,((G^{\top}G)^{-1})^{\top}((G^{\top}G)^{-1})=((G^{\top}G)^{-1})^{2},

where the equality is because (GG)1(G^{\top}G)^{-1} is symmetric. Since GGG^{\top}G is invertible, we have (GG)1=1/GG\|(G^{\top}G)^{-1}\|=1/\|G^{\top}G\|, which is equal to the inverse of minimum eigenvalue λmin(GG)\lambda_{\min}(G^{\top}G) of GGG^{\top}G, which is in turn equal to the square of minimum singular value σmin(G)\sigma_{\min}(G) of GG.

Therefore, the following holds with probability at least 1exp(kc12/2)1-\exp\left(-kc_{1}^{2}/2\right):

(GG)1=1GG=1λmin(GG)=1σmin2(G)1(k(1c1)d)21(12c1)2k\left\lVert\left(G^{\top}G\right)^{-1}\right\rVert=\frac{1}{\left\lVert G^{\top}G\right\rVert}=\frac{1}{\lambda_{\min}(G^{\top}G)}=\frac{1}{\sigma_{\min}^{2}(G)}\leq\frac{1}{\left(\sqrt{k}(1-c_{1})-\sqrt{d}\right)^{2}}\leq\frac{1}{\left(1-2c_{1}\right)^{2}k}

where the second last inequality is due to Lemma B.1 and the last inequality holds when kd/c12k\geq d/c_{1}^{2}. ∎

See 2.7

Proof of Lemma 2.7.

Let us denote grkg_{r}\in\mathbb{R}^{k} as the rthr^{th} row of GG^{\top}. Then, we see that Gη22=r=1pgr,η2\|G^{\top}\eta\|_{2}^{2}=\sum_{r=1}^{p}\langle g_{r},\eta\rangle^{2}. For any row rr, we see that gr,η=η2gr,η/η2\langle g_{r},\eta\rangle=\|\eta\|_{2}\cdot\langle g_{r},\eta/\|\eta\|_{2}\rangle. We will bound values of η2\|\eta\|_{2} and |gr,η/η2||\langle g_{r},\eta/\|\eta\|_{2}\rangle| separately.

It is well-known (e.g. see [JNG+19, Lemma 2]) that the norm of a Gaussian vector concentrates around its mean. So, Pr(η22σk)2exp(2k)\Pr\left(\|\eta\|_{2}\leq 2\sigma\sqrt{k}\right)\leq 2\exp\left(-2k\right). Since grN(0,Ik)g_{r}\sim N(0,I_{k}) and η\eta are independent, we see that gr,η/η2N(0,1)\langle g_{r},\eta/\|\eta\|_{2}\rangle\sim N(0,1). By standard Gaussian bounds, we have that Pr[|gr,η/η2|c2]exp(c22/2)\Pr\left[|\langle g_{r},\eta/\|\eta\|_{2}\rangle|\geq c_{2}\right]\leq\exp\left(-c_{2}^{2}/2\right).

By applying a union bound over these two events, we see that gr,η<2σc2k\|\langle g_{r},\eta\rangle\|<2\sigma c_{2}\sqrt{k} for any row with probability 2exp(2k)+exp(c22/2)2\exp\left(-2k\right)+\exp\left(-c_{2}^{2}/2\right). The claim follows from applying a union bound over all pp rows. ∎

See 2.8

Proof of Lemma 2.8.

Let S>τ=i=1m𝟙Xi>τS_{>\tau}=\sum_{i=1}^{m}\mathbbm{1}_{X_{i}>\tau} be the number of values that are larger than τ\tau, where 𝔼[𝟙Xi>τ]=Pr(Xτ)\mathbb{E}[\mathbbm{1}_{X_{i}>\tau}]=\Pr(X\geq\tau). Similarly, let S<τS_{<-\tau} be the number of values that are smaller than τ-\tau. If S>τ<m/2S_{>\tau}<m/2 and S<τ<m/2S_{<-\tau}<m/2, then we see that median{X1,,Xm}[τ,τ]\mathrm{median}\left\{X_{1},\ldots,X_{m}\right\}\in[-\tau,\tau].

For a random variable XCauchy(0,1)X\sim\mathrm{Cauchy}(0,1), we know that Pr(Xx)=1/2+arctan(x)/π\Pr(X\leq x)=1/2+\arctan(x)/\pi. For 0<τ<10<\tau<1, we see that Pr(Xτ)=1/2arctan(τ)/π1/2τ/4\Pr(X\geq\tau)=1/2-\arctan(\tau)/\pi\leq 1/2-\tau/4. By additive Chernoff bounds, we see that

Pr(S>τm2)exp(2m2τ216m)=exp(mτ28)\Pr\left(S_{>\tau}\geq\frac{m}{2}\right)\leq\exp\left(-\frac{2m^{2}\tau^{2}}{16m}\right)=\exp\left(-\frac{m\tau^{2}}{8}\right)

Similarly, we have Pr(S<τm/2)exp(mτ2/8)\Pr\left(S_{<-\tau}\geq m/2\right)\leq\exp\left(-m\tau^{2}/8\right). The claim follows from a union bound over the events S>τm/2S_{>\tau}\geq m/2 and S<τm/2S_{<-\tau}\geq m/2. ∎

Appendix C Median absolute deviation

In this section, we give a pseudo-code of the well-known Median Absolute Deviation (MAD) estimator (see [Hub04] for example) which we use for the component-wise variance recovery in the contaminated setting. The scale factor, 1/Φ1(3/4)1.48261/\Phi^{-1}(3/4)\approx 1.4826 below, is needed to make the estimator unbiased.

Algorithm 8 MAD: Variance recovery in the contaminated setting
1:Input: Contaminated samples {x1,x2,,xm}\{x_{1},x_{2},\dots,x_{m}\} from a univariate Gausssian
2:μ^median{x1,x2,,xm}\widehat{\mu}\leftarrow\mathrm{median}\left\{x_{1},x_{2},\dots,x_{m}\right\}.
3:σ^1.4826median{|x1μ^|,|x2μ^|,,|xmμ^|}\widehat{\sigma}\leftarrow 1.4826\cdot\mathrm{median}\left\{|x_{1}-\widehat{\mu}|,|x_{2}-\widehat{\mu}|,\dots,|x_{m}-\widehat{\mu}|\right\}.
4:return σ^\widehat{\sigma}