This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Estimation of Tree Structured Ising Models

Ashish Katiyar
[email protected]
   Vatsal Shah
[email protected]
   Constantine Caramanis
[email protected]
(The University of Texas at Austin)
Abstract

We consider the task of learning Ising models when the signs of different random variables are flipped independently with possibly unequal, unknown probabilities. In this paper, we focus on the problem of robust estimation of tree-structured Ising models. Without any additional assumption of side information, this is an open problem. We first prove that this problem is unidentifiable, however, this unidentifiability is limited to a small equivalence class of trees formed by leaf nodes exchanging positions with their neighbors. Next, we propose an algorithm to solve the above problem with logarithmic sample complexity in the number of nodes and polynomial run-time complexity. Lastly, we empirically demonstrate that, as expected, existing algorithms are not inherently robust in the proposed setting whereas our algorithm correctly recovers the underlying equivalence class.

1 Introduction

Undirected graphical models or Markov Random Fields (MRFs) are a powerful tool for describing high dimensional distributions using an associated dependency graph 𝔾\mathbb{G}, which encodes the conditional dependencies between random variables. In this paper, we focus on a special class of MRFs called the Ising Model, first introduced in [15] to represent spin systems in quantum physics [7]. Recently, Ising models have also proven quite popular in biology [16], engineering [9, 31], computer vision [30], and also in the optimization and OR communities, including in finance [37], and social networks [26]. The special class of tree-structured Ising models is beneficial for applications in statistical physics over non-amenable graphs. A detailed description and further references can be found in [25].

In this paper, we explore the problem of learning the underlying graph of tree-structured Ising models with independent, unknown, unequal error probabilities. Such scenarios are relatively common in networks where certain malfunctioning sensors can report flipped bits. The presence of noise breaks down the conditional independence structure of the noiseless setting, and the noisy samples may no longer follow an Ising model distribution.

In 2011, [8] highlighted the importance of robustness in Ising models. Recent works in [13, 14, 21] have tried to address this problem. However, they assume the side information of the error probability, which is mostly unavailable and difficult to estimate in most practical settings. In the closely related work for tree-structured Ising models, [27, 28] address this problem as they build on the seminal work of [10], popularly known as the Chow-Liu algorithm, which proposes the use of maximum-weight spanning tree of mutual information as an estimate of the tree structure. In [27], they consider the simplified case where each node has an equal probability of error and [28] assumes that the error doesn’t alter the order of mutual information. Both assumptions imply that asymptotically, Chow-Liu converges to the correct tree. However, these assumptions don’t arise naturally and are difficult to check from access to only noisy data. To the best of our knowledge, there doesn’t exist an analysis of what happens beyond this limiting assumption of order preservation of mutual information.

In fact, section 5.1 of [6] provides an example of the unidentifiability of the problem for a graph on 3 nodes and says that the problem is ill-defined. We reconsider this problem, and show that for the special class of tree structured Ising models, although the problem is not identifiable, nevertheless the unidentifiability is limited to an equivalence class of trees. Thus, more appropriately, one can cast the problem of learning in the presence of unknown, unequal noise as the problem of learning this equivalence class.

Key Contributions
  1. 1.

    We show that the problem of learning tree structured Ising models when the observations flip with independent, unknown, possibly unequal probability is unidentifiable (Theorem 2).

  2. 2.

    As depicted in Figure 1, the unidentifiability is restricted to the equivalence class of trees obtained by permuting within the leaf nodes and their neighbors (Theorem 1).

  3. 3.

    We provide an algorithm to recover this equivalence class from noisy samples. The sample complexity is logarithmic and the time complexity is polynomial in the number of nodes. (Section 4)

  4. 4.

    We experimentally demonstrate that existing algorithms like Chow-Liu or Sparsitron for learning Ising models are not inherently robust against this noise model whereas our algorithm correctly recovers the equivalence class.(Section 5)

2 Related Work

Efficient algorithms for structure learning of Ising models can be divided into three main categories based on their assumptions: i) special graph structures [1, 10, 11, 32, 5], ii) nature of interaction between variables such as correlation decay property (CDP) [3, 4, 6, 20, 29], iii) bounded degree/width [2, 12, 18, 24, 35, 33]. However, these algorithms assume access to uncorrupted samples.

In the last decade, there has been a lot of research on robust estimation of graphical models [17, 19, 22, 23, 34, 36], and [17] explicitly considered partial identifiability in the presence of noise for the Gaussian case. However, extending the above frameworks to the robust structure learning of Ising models remains a challenge. [13, 14, 21] have tried to solve the problem of robust estimation of general Ising models under the assumption of access to the probability of error for each node. Recently, [27, 28] proposed algorithms to estimate the underlying graph structure of tree-structured Ising models in the presence of noise under the strong assumption that the probability of error does not alter the order of mutual information order for the tree. Both these assumptions are restrictive and impractical. In this paper, we present the first algorithm that can robustly recover the underlying tree structured Ising model (upto an equivalence class) in the presence of corruption via unknown, unequal, independent noise.

3 Identifiability Result

Problem Setup:

Let 𝒳={Xi:i[n]}\mathcal{X}=\{X_{i}^{*}:i\in[n]\} be a set of random variables with support on {1,1}\{-1,1\} and 𝐗=[X1,X2Xn]\mathbf{X}=[X_{1}^{*},X_{2}^{*}\dots X_{n}^{*}] be the vector of these random variables. Suppose the conditional independence structure of the variables of 𝒳\mathcal{X} is given by a tree TT^{*}. This implies that the distribution of 𝒳\mathcal{X} can be represented by an Ising model. In our model, we have observations where each XiX_{i}^{*} flips with probability qiq_{i}. We denote the probability of error by the vector 𝐪=[q1,q2,qn]\mathbf{q}=[q_{1},q_{2},\dots q_{n}] and the noisy random variables by 𝒳e={Xie:i[n]}\mathcal{X}^{e}=\{X_{i}^{e}:i\in[n]\}. The error in XiX_{i}^{*} disrupts the tree structured conditional independence and the graphical model of 𝒳e\mathcal{X}^{e} is a complete graph if qi>0q_{i}>0 i[n]\forall i\in[n]. In fact, 𝒳e\mathcal{X}^{e} need not be an Ising model. Given samples of 𝒳e\mathcal{X}^{e}, we want to find the tree structure TT^{*}. The ideas of the analysis for this problem are based on graph structure properties introduced in [17] where they were applied in the different context of Gaussian graphical models.

Equivalence Class for TT^{*}

Given a tree TT^{*}, let us define its equivalence class denoted by 𝒯T\mathcal{T}_{T^{*}}. Let the set of all nodes that are connected to one or more leaf nodes of TT^{*} be denoted by 𝒜\mathcal{A}. Construct sets a1,a2,,a|𝒜|\mathcal{L}_{a_{1}},\mathcal{L}_{a_{2}},\dots,\mathcal{L}_{a_{|\mathcal{A}|}} from the nodes ai𝒜a_{i}\in\mathcal{A} such that set ai\mathcal{L}_{a_{i}} contains all the leaf nodes connected to aia_{i} in TT^{*} along with aia_{i}. Sample one node from each newly constructed set (.)\mathcal{L}_{(.)} and create another subset 𝒮m\mathcal{S}^{m}. There exists q=|a1||a2||a|𝒜||q=|\mathcal{L}_{a_{1}}||\mathcal{L}_{a_{2}}|\dots|\mathcal{L}_{a_{|\mathcal{A}|}}| such unique subsets. Create a tree TmT^{m} by setting all the nodes in 𝒮m\mathcal{S}^{m} as the internal node and all the remaining nodes in their corresponding sets (.)\mathcal{L}_{(.)} as leaf nodes in the same position where (.)\mathcal{L}_{(.)} was in TT^{*}. Therefore, there exists a one-to-one equivalence relationship between tree TmT^{m} and its corresponding set 𝒮m\mathcal{S}^{m}. We define a set of these trees as 𝒯T\mathcal{T}_{T^{*}}.

𝒯T={Tmm{1,2,q}}.\mathcal{T}_{T^{*}}=\{T^{m}\mid m\in\{1,2,\ldots q\}\}.
Refer to caption
Figure 1: 𝒯T\mathcal{T}_{T^{*}} for the above TT^{*} is obtained by permuting the nodes within the dotted regions. Noisy samples from this Ising model can be used to estimate the tree only upto the equivalence class 𝒯T\mathcal{T}_{T^{*}}.

Figure 1 gives an example of 𝒯T\mathcal{T}_{T^{*}}.

Model Assumptions
Assumption 1.

(Bounded Mean) The absolute value of the mean - |𝔼[Xi]|μmax<1|\mathbb{E}{\left[X_{i}\right]}|\leq\mu_{max}<1 i[n]\forall i\in[n].

Assumption 2.

(Bounded Correlation) Correlation ρi,j\rho_{i,j} of any two nodes XiX_{i} and XjX_{j} connected by an edge - ρmin|ρi,j|ρmax\rho_{min}\leq|\rho_{i,j}|\leq\rho_{max} where 0<ρminρmax<10<\rho_{min}\leq\rho_{max}<1.

Assumption 3.

(Bounded error probability) The error probability - 0qiqmax<0.50\leq q_{i}\leq q_{max}<0.5 i[n]\forall i\in[n].

These assumptions arise naturally. Assumption 1 ensures that no variable approaches a constant and hence gets disconnected from the tree. The lower bound in Assumption 2 also ensures that every node is connected. The upper bound in Assumption 2 ensures that no two nodes are duplicated. Assumption 3 ensures the noisy node doesn’t become independent of every other node due to the error.

Limited unidentifiability of the problem

In Theorem 1, we prove that it is possible to recover 𝒯T\mathcal{T}_{T^{*}} from the samples of 𝒳e\mathcal{X}^{e}. Further, we prove that given the distribution of 𝒳e\mathcal{X}^{e}, there exists an Ising model for each tree in 𝒯T\mathcal{T}_{T^{*}} such that, for some noise vector, its noisy distribution is the same as that of 𝒳e\mathcal{X}^{e} in Theorem 2

Theorem 1.

Suppose 𝒳\mathcal{X}^{\prime} and 𝒳\mathcal{X} are sets of binary valued random variables satisfying assumption 1 whose conditional independence is given by trees TT^{\prime} and TT^{*} respectively satisfying assumption 2. Assume that each node in both these distributions TT^{\prime} and TT^{*} is allowed to be flipped independently with probability satisfying assumption 3. Let \mathcal{E}^{*} and \mathcal{E}^{\prime} represent the noisy distributions of 𝒳\mathcal{X} and 𝒳\mathcal{X}^{\prime} respectively. If =\mathcal{E}^{\prime}=\mathcal{E}^{*}, then T𝒯TT^{\prime}\in\mathcal{T}_{T^{*}}.

Proof.

The proof of this theorem relies on this key observation: the probability distribution of the noisy samples completely defines the categorization of any set of 4 nodes as star/non-star shape. Any set of 4 nodes forms a non-star if there exists at least one edge which, when removed, splits the tree into two subtrees such that exactly 2 of the 4 nodes lie in one subtree and the other 2 nodes lie in the other subtree. The nodes in the same subtree form a pair. If the set is not a non-star, it is categorized as a star. For example, in Figure 1, {0,1,7,12}\{0,1,7,12\} form a non-star and {0,7,12,13}\{0,7,12,13\} form a star.

Once we prove this key observation, the rest of the proof follows easily and we refer the reader to parts (ii) and (iii) of the proof of Theorem 2 from [17]. We denote the correlation between two nodes XiX_{i} and XjX_{j} in the non-noisy setting by ρi,j\rho_{i,j} and in the noisy setting by ρi,j\rho^{\prime}_{i,j}. Similarly the covariance is denoted by Σi,j\Sigma_{i,j} and Σi,j\Sigma^{\prime}_{i,j}. We utilize the correlation decay property of tree structured Ising models which is stated in Lemma 1.

Lemma 1.

(Correlation Decay) Any 2 nodes Xi1X_{i_{1}} and XikX_{i_{k}} have the conditional independence relation specified by a tree structured Ising Model such that the path between them is (Xi1Xi2Xi3Xik)(X_{i_{1}}\rightarrow X_{i_{2}}\rightarrow X_{i_{3}}\dots\rightarrow X_{i_{k}}) if and only if their correlation is given by:

ρi1ik=l=2kρil1,il.\rho_{i_{1}i_{k}}=\prod_{l=2}^{k}\rho_{i_{l-1},i_{l}}. (1)

The proof of this lemma is provided in Appendix, A. We also prove that 𝔼[Xie]=(12qi)𝔼[Xi]\mathbb{E}{\left[X_{i}^{e}\right]}=(1-2q_{i})\mathbb{E}{\left[X_{i}\right]} and Σi,j=(12qi)(12qj)Σi,j\Sigma^{\prime}_{i,j}=(1-2q_{i})(1-2q_{j})\Sigma_{i,j} in Appendix B.

Categorizing a set of 4 nodes as star/non-star

We first look at a graphical model on 3 nodes X1,X2,X3X_{1},X_{2},X_{3} whose conditional independence is given by a chain with X2X3|X1X_{2}\perp X_{3}|X_{1}. By Lemma 1, we have Σ2,3Σ1,1=Σ1,2Σ1,3\Sigma_{2,3}\Sigma_{1,1}=\Sigma_{1,2}\Sigma_{1,3}.

Suppose the sign of X1,X2,X3X_{1},X_{2},X_{3} flip independently with probability q1,q2,q3q_{1},q_{2},q_{3} respectively. Substituting the values of Σ2,3,Σ1,1,Σ1,2\Sigma_{2,3},\Sigma_{1,1},\Sigma_{1,2} and Σ1,3\Sigma_{1,3} in terms of their noisy counterparts gives us:

(12q1)2=1Σ1,1+Σ1,2Σ1,3Σ2,3.(1-2q_{1})^{2}=1-\Sigma^{\prime}_{1,1}+\frac{\Sigma^{\prime}_{1,2}\Sigma^{\prime}_{1,3}}{\Sigma^{\prime}_{2,3}}. (2)

If we had prior knowledge about the underlying conditional independence relation, this quadratic equation, which depends only on the quantities measurable from noisy data, could be solved to estimate the probability of error of X1X_{1}.

We prove in Appendix C that Equation (2) gives a valid solution for any configuration of 3 nodes in a tree structured Ising model. Therefore, in the absence of the knowledge that X2X3|X1X_{2}\perp X_{3}|X_{1}, we can estimate a probability of error for each XiX_{i} which enforces the underlying graph structure to represent the other 2 nodes independent conditioned on XiX_{i}. Thus, irrespective of the true underlying conditional independence relation we can always find a probability of error for each node which makes any other pair of nodes conditionally independent. We use this concept to classify a tree on 4 nodes as star or non-star shaped.

We follow a notation where q^ij,k\hat{q}_{i}^{j,k} denotes the estimated probability of error of XiX_{i} which enforces XjXk|XiX_{j}\perp X_{k}|X_{i}.

Condition for star/non-star shape:

Any set of 4 nodes {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\} is categorized as a non-star with (X1,X2)(X_{1},X_{2}) forming one pair and (X3,X4)(X_{3},X_{4}) forming another pair if and only if:

q^12,3=q^12,4q^13,4,q^21,3=q^21,4q^23,4,\displaystyle{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{2,4}\neq{\hat{q}}_{1}^{3,4},{\hat{q}}_{2}^{1,3}={\hat{q}}_{2}^{1,4}\neq{\hat{q}}_{2}^{3,4},
q^32,4=q^31,4q^31,2,q^42,3=q^41,3q^41,2.\displaystyle{\hat{q}}_{3}^{2,4}={\hat{q}}_{3}^{1,4}\neq{\hat{q}}_{3}^{1,2},{\hat{q}}_{4}^{2,3}={\hat{q}}_{4}^{1,3}\neq{\hat{q}}_{4}^{1,2}.

From Equation (2), this is equivalent to the condition that ρ1,3ρ1,4=ρ2,3ρ2,4,ρ1,2ρ1,4ρ3,2ρ3,4.\frac{\rho_{1,3}^{\prime}}{\rho_{1,4}^{\prime}}=\frac{\rho_{2,3}^{\prime}}{\rho_{2,4}^{\prime}},\frac{\rho_{1,2}^{\prime}}{\rho_{1,4}^{\prime}}\neq\frac{\rho_{3,2}^{\prime}}{\rho_{3,4}^{\prime}}.

Any set of 4 nodes {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\} is categorized as a star if and only if:

q^12,3=q^12,4=q^13,4,q^21,3=q^21,4=q^23,4,\displaystyle{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{2,4}={\hat{q}}_{1}^{3,4},{\hat{q}}_{2}^{1,3}={\hat{q}}_{2}^{1,4}={\hat{q}}_{2}^{3,4},
q^32,4=q^31,4=q^31,2,q^42,3=q^41,3=q^41,2.\displaystyle{\hat{q}}_{3}^{2,4}={\hat{q}}_{3}^{1,4}={\hat{q}}_{3}^{1,2},{\hat{q}}_{4}^{2,3}={\hat{q}}_{4}^{1,3}={\hat{q}}_{4}^{1,2}.

This is equivalent to the condition that ρ1,3ρ1,4=ρ2,3ρ2,4=ρ1,2ρ1,4\frac{\rho_{1,3}^{\prime}}{\rho_{1,4}^{\prime}}=\frac{\rho_{2,3}^{\prime}}{\rho_{2,4}^{\prime}}=\frac{\rho_{1,2}^{\prime}}{\rho_{1,4}^{\prime}}.

In order to see how these conditions correspond to a star/non-star shape, lets consider a chain on 4 nodes as shown in Figure 3. Let each XiX_{i} be flipped with probability qiq_{i}. With access only to the noisy samples, we estimate the probability of error for each node in order to find the underlying tree. The key idea is that when we estimate the probability of error for a given node, it should be consistent across different conditional independence relations. For instance in the present case, the error estimates q^21,3\hat{q}_{2}^{1,3} and q^21,4\hat{q}_{2}^{1,4} of X2X_{2} satisfy q^21,3=q^21,4=q2\hat{q}_{2}^{1,3}=\hat{q}_{2}^{1,4}=q_{2}. We show that q^23,4q^21,3\hat{q}_{2}^{3,4}\neq\hat{q}_{2}^{1,3} (Lemma 2(b)). We also prove that q^12,3=q^12,4q^13,4\hat{q}_{1}^{2,3}=\hat{q}_{1}^{2,4}\neq\hat{q}_{1}^{3,4}(Lemma 2). These imply that X3⟂̸X4|X2X_{3}\not\perp X_{4}|X_{2} and X3⟂̸X4|X1X_{3}\not\perp X_{4}|X_{1}. By symmetry, we have X1⟂̸X2|X3X_{1}\not\perp X_{2}|X_{3} and X1⟂̸X2|X4X_{1}\not\perp X_{2}|X_{4}. These conditional independence statements imply that X1,X2,X3X_{1},X_{2},X_{3} and X4X_{4} form a chain with (X1,X2)(X_{1},X_{2}) on one side of the chain and (X3,X4)(X_{3},X_{4}) on the other side of the chain.

Next, we consider the case when 4 nodes form a star structured graphical model as in Figure 3. Under the same noisy observation setting we prove that q^12,3=q^12,4=q^13,4\hat{q}_{1}^{2,3}=\hat{q}_{1}^{2,4}=\hat{q}_{1}^{3,4}, q^21,3=q^21,4=q^23,4\hat{q}_{2}^{1,3}=\hat{q}_{2}^{1,4}=\hat{q}_{2}^{3,4}, q^31,2=q^31,4=q^32,4\hat{q}_{3}^{1,2}=\hat{q}_{3}^{1,4}=\hat{q}_{3}^{2,4} and q^41,3=q^41,2=q^43,2\hat{q}_{4}^{1,3}=\hat{q}_{4}^{1,2}=\hat{q}_{4}^{3,2} (Lemma 3). Thus, we can conclude that the underlying graphical model is star structured.

Refer to caption
Figure 2: A chain structure.
Refer to caption
Figure 3: A Star structure.
Lemma 2.

Let the graphical model on X1,X2,X3X_{1},X_{2},X_{3} and X4X_{4} form a chain as shown in Figure 3. Suppose the bits of each XiX_{i} are flipped with probability qi<0.5q_{i}<0.5. Then the following holds:

(a) q^12,3=q^12,4, (b) q^21,3q^23,4,q^12,3q^13,4\text{(a) }\hat{q}_{1}^{2,3}=\hat{q}_{1}^{2,4},\text{ (b) }\hat{q}_{2}^{1,3}\neq\hat{q}_{2}^{3,4},{\hat{q}}_{1}^{2,3}\neq{\hat{q}}_{1}^{3,4}
Lemma 3.

Let the graphical model on X1,X2,X3X_{1},X_{2},X_{3} and X4X_{4} form a star as shown in Figure 3. Suppose the bits of each XiX_{i} are flipped with probability qi<0.5q_{i}<0.5. Then the following holds:

q^12,3=q^12,4=q^14,3{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{2,4}={\hat{q}}_{1}^{4,3}

The proof of these lemmas and the details of extending these results to generic trees require basic algebraic manipulations and can be found in Appendix D. ∎

Theorem 2.

Let e\mathcal{E}^{e} denote the probability distribution of 𝒳e\mathcal{X}^{e} when the error probability of all the neighbors of leaf nodes is non-zero. For any T𝒯TT^{\prime}\in\mathcal{T}_{T^{*}}, there exists a set of random variables 𝒳\mathcal{X}^{\prime} with conditional independence given by TT^{\prime} and a corresponding error probability vector 𝐪^\mathbf{\hat{q}} such that =e\mathcal{E}^{\prime}=\mathcal{E}^{e} where \mathcal{E}^{\prime} denotes the noisy distribution of 𝒳\mathcal{X}^{\prime}.

We prove this theorem by explicit calculation of 𝐪^\mathbf{\hat{q}}. We utilize Lemma 1 to enforce the conditional independence relations in any tree T𝒯TT^{\prime}\in\mathcal{T}_{T^{*}}. The proof is included in Appendix E.

Interestingly these unidentifiability results for noisy tree structured Ising models match the ones for noisy tree structured Gaussian graphical models proposed in [17] inspite of them being graphical models on different class of random variables. We remark that the algorithm they propose to learn the equivalence class in [17] is in the infinite sample domain and does not provide sample complexity results. In the next section we develop an efficient algorithm to learn the equivalence class using samples scaling logarithmically in the number of nodes.

4 Algorithm

In this section, we provide a detailed description of our algorithm (Algorithm 1) that can recover the set 𝒯T\mathcal{T}_{T^{*}} from noisy samples. We emphasize that the edges we learn in this algorithm are from any one tree from the equivalence class 𝒯T\mathcal{T}_{T^{*}} and not necessarily from TT^{*} (as we have demonstrated, identifying TT^{*} itself is not possible). We illustrate our algorithm (as well as each subroutine) using a toy example (Figure 4).

Algorithm 1 Algorithm to learn the equivalence class 𝒯T\mathcal{T}_{T^{*}} of the tree
1:procedure FindTree
2:    for XiX_{i} in 𝒳\mathcal{X} do
3:        ECFindEC(Xi,𝒳Xi)EC\leftarrow\textsc{FindEC}(X_{i},\mathcal{X}\setminus X_{i})
4:        if |EC||EC| >1 then
5:           break             
6:    𝒳lastEC\mathcal{X}_{last}\leftarrow EC
7:    if |𝒳𝒳last|>0|\mathcal{X}\setminus\mathcal{X}_{last}|>0 then
8:        Recurse(Xi,EC[0],𝒳lastX_{i},EC[0],\mathcal{X}_{last})     
Algorithm 2 Algorithm to recursively find all the edges within a subtree

1:procedure Recurse(Xi,Xleaf,𝒳lastX_{i},X_{leaf},\mathcal{X}_{last})
2:    subtreesSplitTree(Xi,Xleaf,𝒳last)subtrees\leftarrow\textsc{SplitTree}(X_{i},X_{leaf},\mathcal{X}_{last})
3:    for subtreesubtree in subtreessubtrees do
4:        ECFindEC(Xi,subtree)EC\leftarrow\textsc{FindEC}(X_{i},subtree)
5:        𝒳last=𝒳lastsubtreessubtreeEC\mathcal{X}_{last}^{\prime}=\mathcal{X}_{last}\cup subtrees\setminus subtree\cup EC
6:        Recurse(EC[0],Xi,𝒳last)\textsc{Recurse}(EC[0],X_{i},\mathcal{X}_{last}^{\prime})     

We first introduce three notions which we extensively use for the algorithm - (i) Equivalence Clusters, (ii) Proximal Sets - 𝒫i1,𝒫i2\mathcal{P}_{i}^{1},\mathcal{P}_{i}^{2}, (iii) Categorizing a set of 4 nodes as star/non-star using finite samples.

Equivalence Cluster:

As illustrated in Figure 4(a), an equivalence cluster is defined as a set containing an internal node and all the leaf nodes connected to it. We denote the equivalence cluster containing a node XiX_{i} by EC(Xi)EC(X_{i}).

Defining proximal sets:

Due to correlation decay, the correlation between an arbitrary pair of nodes can go down exponentially in the number of nodes which would lead to exponential sample complexity. To avoid this, we only consider nodes with correlations above a constant threshold while making star/non-star categorization. We define two proximal sets for each node XiX_{i}-𝒫i1\mathcal{P}_{i}^{1} and 𝒫i2\mathcal{P}_{i}^{2}, where 𝒫i1\mathcal{P}_{i}^{1} is given by:

𝒫i1={Xj:Σi,jt1}.\mathcal{P}_{i}^{1}=\left\{X_{j}:\Sigma^{\prime}_{i,j}\geq t_{1}\right\}.

In the above expression, we set t1=(12qmax)2(1μmax2)ρmin4t_{1}=(1-2q_{max})^{2}(1-\mu_{max}^{2})\rho_{min}^{4}, thereby guaranteeing that the set contains at least all the nodes within a radius of 4 of that node (on the true unknown graph). This is because the minimum variance of any node is (1μmax2)\sqrt{(1-\mu_{max}^{2})}, the minimum correlation of any 2 nodes at distance 4 in the absence of noise is ρmin4\rho_{min}^{4} and the noise can decrease the covariance by a factor of at most (12qmax)2(1-2q_{max})^{2}.

Refer to caption
Figure 4: Illustration of the algorithm.

It is possible that Xj𝒫i1X_{j}\in\mathcal{P}_{i}^{1} even if XjX_{j} and XiX_{i} are not neighbors in the true graph. We define the second set 𝒫i2\mathcal{P}_{i}^{2} to guarantee that in this case, at least the first node from the path from XiX_{i} to XjX_{j} is in 𝒫j2\mathcal{P}_{j}^{2}. Using the correlation decay property, the noisy covariance expression,, and the bounds 1μmax2Σii11-\mu_{max}^{2}\leq\Sigma_{ii}\leq 1, 12qmax12qi11-2q_{max}\leq 1-2q_{i}\leq 1 and ρi,jρmax\rho_{i,j}\leq\rho_{max} this set is given by:

𝒫i2={Xj:Σi,jt2},\mathcal{P}_{i}^{2}=\left\{X_{j}:\Sigma^{\prime}_{i,j}\geq t_{2}\right\},

where t2=min{t1,t1(12qmax)1μmax2ρmax}t_{2}=\min\left\{t_{1},\frac{t_{1}(1-2q_{max})\sqrt{1-\mu_{max}^{2}}}{\rho_{max}}\right\}. Using the sample covariance values we construct sets 𝒫i1~\tilde{\mathcal{P}_{i}^{1}} and 𝒫i2~\tilde{\mathcal{P}_{i}^{2}} to include at least all the nodes in 𝒫i1\mathcal{P}_{i}^{1} and 𝒫i2\mathcal{P}_{i}^{2} with high probability. We denote the sample covariance between nodes XieX^{e}_{i} and XjeX^{e}_{j} obtained from the noisy observations by Σ~i,j\tilde{\Sigma}_{i,j}. The sets 𝒫i1~\tilde{\mathcal{P}_{i}^{1}} and 𝒫i2~\tilde{\mathcal{P}_{i}^{2}} are defined as:

𝒫i1~={Xje:Σ~i,j0.5t1},\tilde{\mathcal{P}_{i}^{1}}=\left\{X_{j}^{e}:\tilde{\Sigma}_{i,j}\geq 0.5t_{1}\right\},
𝒫i2~={Xje:Σ~i,j0.5t2}.\tilde{\mathcal{P}_{i}^{2}}=\left\{X_{j}^{e}:\tilde{\Sigma}_{i,j}\geq 0.5t_{2}\right\}.
Classifying into star/non-star shape:

It is easy to see (refer Equation(16) in Appendix D.4) that when any set of 4 nodes {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\} forms a non-star such that (X1,X2)(X_{1},X_{2}) from a pair, we have:

ρ1,3ρ2,4ρ1,2ρ3,4ρmax2,ρ1,3ρ2,4ρ1,4ρ2,3=1.\dfrac{\rho_{1,3}^{\prime}\rho_{2,4}^{\prime}}{\rho_{1,2}^{\prime}\rho_{3,4}^{\prime}}\leq\rho_{max}^{2},\dfrac{\rho_{1,3}^{\prime}\rho_{2,4}^{\prime}}{\rho_{1,4}^{\prime}\rho_{2,3}^{\prime}}=1.

Therefore, using the finite sample estimate of the noisy correlation, the categorization of 4 nodes {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\} into star/non-star such that (X1,X2)(X_{1},X_{2}) from a pair is given in Table 1.

We first describe the key steps of the algorithm using the example from Figure 4.

Initialization:
      Category       ρ~1,3ρ~2,4ρ~1,4ρ~2,3\dfrac{\tilde{\rho}_{1,3}\tilde{\rho}_{2,4}}{\tilde{\rho}_{1,4}\tilde{\rho}_{2,3}}       ρ~1,3ρ~2,4ρ~1,2ρ~3,4\dfrac{\tilde{\rho}_{1,3}\tilde{\rho}_{2,4}}{\tilde{\rho}_{1,2}\tilde{\rho}_{3,4}}
      Non-star       >(1+ρmax2)2>\dfrac{(1+\rho_{max}^{2})}{2}       <(1+ρmax2)2<\dfrac{(1+\rho_{max}^{2})}{2}
      Star       >(1+ρmax2)2>\dfrac{(1+\rho_{max}^{2})}{2}       >(1+ρmax2)2>\dfrac{(1+\rho_{max}^{2})}{2}
Table 1: Star/Non-star categorization

The algorithm starts by learning edges of any leaf node (edge(14,15)). It searches for a leaf node by looking for equivalence cluster with more than one nodes. This is achieved using the FindEC subroutine which, given a node, can find all the nodes in its equivalence cluster.

Recursion:

After finding a pair of a leaf and an internal node, the algorithm calls the Recurse to find all the remaining edges recursively. This is done in 2 steps: (i) Split the nodes in the proximal set of the internal node into different subtrees connected to the internal node using SplitTree (subtree1 = nodes 0-6, subtree2 = nodes 7-13), (ii) Find the nodes in each subtree which have an edge with the internal node. This is done by the simple observation that when we consider the internal node along with one such subtree, the internal node acts like a leaf node for the tree on just these nodes (node 14 is a leaf node for the tree on 1,2,3,4,5,6,14). Hence, FindEC can be used on these subset of nodes. Now, we have a pair of a leaf node and an internal node on a subset of nodes and we call the Recurse just on this subset of nodes.

The runtime of the algorithm is 𝒪(n3)\mathcal{O}(n^{3}).

Description of subroutines

SplitTree

This function takes in a pair of leaf and internal nodes, XleafX_{leaf} and XiX_{i} respectively, from a subtree. To ensure that we only consider nodes in the subtree, it also takes in a set 𝒳last\mathcal{X}_{last} which contains the nodes assigned to other subtrees in the previous recursive call. It splits the nodes from the common proximal sets 𝒫1~leaf\tilde{\mathcal{P}^{1}}_{leaf} and 𝒫i1~\tilde{\mathcal{P}_{i}^{1}} in the subtree into smaller subtrees that have an edge with XiX_{i}. The property used is that 2 nodes Xk1X_{k_{1}} and Xk2X_{k_{2}} lie in the same smaller subtree if and only if {Xi,Xleaf,Xk1,Xk2}\{X_{i},X_{leaf},X_{k_{1}},X_{k_{2}}\} form a non-star such that nodes (Xk1,Xk2)(X_{k_{1}},X_{k_{2}}) form a pair. To make sure that no node outside the subtree is considered, we do not split any node within 𝒳last\mathcal{X}_{last} or which pairs with a node from 𝒳last\mathcal{X}_{last} when considered along with XleafX_{leaf} and XiX_{i}. This is an 𝒪(n2)\mathcal{O}(n^{2}) operation.

FindEC

This function takes in a node XiX_{i} and a set of nodes 𝒳sub\mathcal{X}_{sub} as inputs and returns all the nodes of EC(Xi)EC(X_{i}) from 𝒳subXi\mathcal{X}_{sub}\cup X_{i}. Two nodes are in the same equivalence cluster if any categorization of a set of 4 nodes as star/non-star either results in a star or a non-star where these nodes form a pair. Essentially, for all the nodes in the proximal set of XiX_{i}, we verify if they satisfy this condition. We also add edges by arbitrarily choosing any node from EC(Xi)EC(X_{i}) as the center node as we are only interested in recovering 𝒯T\mathcal{T}_{T^{*}}. This is an 𝒪(n2)\mathcal{O}(n^{2}) operation.

Recurse

This function takes the same arguments as SplitTree and calls SplitTree with those arguments to obtain the different smaller subtrees. It then uses FindEC to find the edges between XiX_{i} and these smaller subtrees. When the smaller subtrees are considered along with XiX_{i}, XiX_{i} acts as their leaf node and the node XiX_{i} has an edge with, acts as an internal node. For each of the smaller subtree, it adds the nodes of the remaining smaller subtrees in 𝒳last\mathcal{X}_{last} along with the equivalence cluster from the current subtree which had an edge with XiX_{i} as that has already been learnt. It then calls the recurse function with these smaller subtrees.

Need for 2 proximal sets 𝒫i1~\tilde{\mathcal{P}_{i}^{1}} and 𝒫i2~\tilde{\mathcal{P}_{i}^{2}}:

Consider the implementation of FindEC. For each node Xj𝒫i1~X_{j}\in\tilde{\mathcal{P}_{i}^{1}}, we check if it lies in EC(Xi)EC(X_{i}). For all pairs of nodes Xk1,Xk2𝒫i1~𝒫j1~X_{k_{1}},X_{k_{2}}\in\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}^{1}_{j}}, we could check whether the condition for XjEC(Xi)X_{j}\in EC(X_{i}) is satisfied. If XjEC(Xi)X_{j}\not\in EC(X_{i}) but no node on the path from XiX_{i} to XjX_{j} is in 𝒫i1~𝒫j1~\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}^{1}_{j}}, we would incorrectly conclude that XjEC(Xi)X_{j}\in EC(X_{i}). To avoid this corner case, instead of considering Xk1,Xk2𝒫i1𝒫j1~X_{k_{1}},X_{k_{2}}\in\mathcal{P}_{i}^{1}\cap\tilde{\mathcal{P}^{1}_{j}}, we consider Xk1,Xk2𝒫i2~𝒫j2~X_{k_{1}},X_{k_{2}}\in\tilde{\mathcal{P}_{i}^{2}}\cap\tilde{\mathcal{P}^{2}_{j}}. By the definition of 𝒫i2~\tilde{\mathcal{P}_{i}^{2}}, 𝒫i2~𝒫j2~\tilde{\mathcal{P}_{i}^{2}}\cap\tilde{\mathcal{P}^{2}_{j}} contains a node from the path from XiX_{i} to XjX_{j}. This results in {Xi,Xj,Xk1,Xk2}\{X_{i},X_{j},X_{k_{1}},X_{k_{2}}\} forming a non-star where (Xi,Xj)(X_{i},X_{j}) do not form a pair and hence it correctly concludes that XjEC(Xi)X_{j}\not\in EC(X_{i}).

𝒫i2~\tilde{\mathcal{P}_{i}^{2}} plays a crucial role in SplitTree too. Consider a recursive call to Recurse such that in the previous recursion, nodes in 𝒫i1~𝒫1~leaf\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}^{1}}_{leaf} of that step were added to 𝒳last\mathcal{X}_{last}. However, in the current step there might be nodes in 𝒫i1~𝒫1~leaf\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}^{1}}_{leaf} from other subtrees which are not in 𝒳last\mathcal{X}_{last}. To ensure that any such node XkX_{k} is not considered while creating smaller subtrees, we consider the nodes 𝒫k2~𝒳last\tilde{\mathcal{P}_{k}^{2}}\cap\mathcal{X}_{last} and ensure that they don’t pair with XkX_{k} when considered with the present recursive call’s XiX_{i} and XleafX_{leaf}.

Setting the radius as 4:

The radius is set as 4 to make sure that each node is considered with at least its nearest neighboring nodes when classified as star/non-star. This would ideally require a radius of 3 but to account for the unidentifiability between a leaf node and its neighbor, we consider a radius of 4.

The detailed description of the complete algorithm including these subroutines- their pseudo-code, proof of correctness and time complexity analysis is presented in Appendix F.

Next, we present the sample complexity result of our algorithm.

Theorem 3.

Consider an Ising model on nn nodes whose graph structure is the tree TT^{*} and it satisfies assumptions 1 and 2. We get noisy samples from this Ising model where samples from each node are flipped independently with unknown, unequal probability satisfying assumption 3. Algorithm 1 correctly recovers 𝒯T\mathcal{T}_{T^{*}} with probability at least 1τ1-\tau if the number of samples mm is lower bounded as follows:

m128δ2log(6n2τ)m\geq\frac{128}{\delta^{2}}\log\left(\frac{6n^{2}}{\tau}\right)

where δ=t23(1t3)128\delta=\frac{t_{2}^{3}(1-t_{3})}{128}, t2=min{t1,t1(12qmax)1μmax2ρmax}t_{2}=\min\left\{t_{1},\frac{t_{1}(1-2q_{max})\sqrt{1-\mu_{max}^{2}}}{\rho_{max}}\right\}, t1=(12qmax)2(1μmax2)ρmin4t_{1}=(1-2q_{max})^{2}(1-\mu_{max}^{2})\rho_{min}^{4}, t3=1+ρmax22t_{3}=\frac{1+\rho_{max}^{2}}{2}.

The proof is included in Appendix G.

Running the algorithm in absence of the knowledge of ρmax\rho_{max} and ρmin\rho_{min}:

When we only have access to noisy samples, qmaxq_{max} and μmax\mu_{max}, we can set ρmin=ϵ\rho_{min}=\epsilon and ρmax=1ϵ\rho_{max}=1-\epsilon and solve for ϵ\epsilon given an error budget τ\tau. This would be a conservative estimate of ϵ\epsilon and in practice, the algorithm would work for lower values of ϵ\epsilon.

5 Experiments

Experimental Setup

The probability distribution of Ising models is represented by

(𝑿=𝒙)exp(𝒙𝑾𝒙/2+𝒃𝒙),\mathbb{P}(\boldsymbol{{X}}=\boldsymbol{x})\propto\exp(\boldsymbol{x}^{\top}\boldsymbol{W}\boldsymbol{x}/2+\boldsymbol{b}^{\top}\boldsymbol{x}),

where 𝑾\boldsymbol{W} is the symmetric weight matrix with 0 on the diagonal and 𝒃\boldsymbol{b} is the bias vector. The support of 𝑾\boldsymbol{W} defines the Ising model structure. Therefore, a chain structured Ising model with nodes labeled consecutively satisfies Wi,j0W_{i,j}\neq 0 if and only if |ij|=1|i-j|=1. A star structured Ising model with node 1 as the internal node satisfies Wi,j0W_{i,j}\neq 0 if and only if i=1,j1i=1,j\neq 1 or j=1,i1j=1,i\neq 1.

Refer to caption
Figure 5: Wmin=0.7,Wmax=1.2,qmax=0.15W_{min}=0.7,~{}W_{max}=1.2,~{}q_{max}=0.15. Comparing the performance of Chow-Liu and our algorithm over 50 runs for increasing number of samples.

We sample each non-zero element of 𝑾\boldsymbol{W} uniformly from [Wmin,Wmax][W_{min},W_{max}]. The probability of error for each node is sampled uniformly from [0,qmax][0,q_{max}].

Comparison with the Chow-Liu algorithm is presented in 5.1. In subsection 5.2, we demonstrate the performance of our algorithm as compared to the Sparsitron algorithm. Subsections 5.3, 5.4 and 5.5 report the impact of the maximum error qmaxq_{max}, maximum edge weight WmaxW_{max} and minimum edge weight WminW_{min} respectively on the algorithm performance. All of these experiments are done for chain structured Ising models and have 𝒃=0\boldsymbol{b}=0.

We next demonstrate the performance of our algorithm on star-structured Ising model in subsection 5.6 as the number of nodes increases. To begin with, we stick to 𝒃=0\boldsymbol{b}=0.

Finally we study the impact of 𝒃\boldsymbol{b} (the non-zero mean) on the algorithm performance for both star-structured Ising model and chain structured Ising model in subsection 5.7.

Refer to caption
Figure 6: Wmin=0.7,Wmax=1.2,qmax=0.15W_{min}=0.7,W_{max}=1.2,q_{max}=0.15. Comparing the performance of Sparsitron and our algorithm over 50 runs for increasing number of samples.

5.1 Our Algorithm vs Chow-Liu

Figure 5 depicts that as the number of samples increases, the fraction of correct recoveries using our algorithm approaches 1. Chow-Liu incorrectly converges to a tree not in the equivalent class 𝒯T\mathcal{T}_{T^{*}} due to unequal noise in the nodes.

5.2 Our Algorithm vs Sparsitron

Refer to caption
Figure 7: Wmin=0.7,Wmax=1.2W_{min}=0.7,~{}W_{max}=1.2. The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of qmaxq_{max}.

We set Wmin=0.7,Wmax=1.2,qmax=0.15,𝒃=0W_{min}=0.7,W_{max}=1.2,q_{max}=0.15,\boldsymbol{b}=0. In Figure 6, we compare the performance of our algorithm with the sparsitron algorithm for chain-structured Ising model on 15 nodes. To evaluate the sparsitron algorithm, we take the output weight matrix and find the maximum weight spanning tree. We call the algorithm a success if this tree is from the equivalence class 𝒯T\mathcal{T}_{T^{*}}. We can see that the sparsitron has a low success rate.

5.3 Effect of the Maximum Error Probability

We set Wmin=0.7,Wmax=1.2,𝒃=0W_{min}=0.7,W_{max}=1.2,\boldsymbol{b}=0 and vary qmaxq_{max} to take the values {0.1,0.15,0.20}\{0.1,0.15,0.20\}. We evaluate on chain structured Ising model on 15 nodes. Figure 7 illustrates the effect of qmaxq_{max} on the performance of our algorithm. As expected, increase in the probability of error makes it harder to recover the equivalence class.

5.4 Effect of the Maximum Weight

Next, we look at the effect of WmaxW_{max}. We set Wmin=0.7,qmax=0.15,𝒃=0W_{min}=0.7,q_{max}=0.15,\boldsymbol{b}=0 and vary WmaxW_{max} to take the values {1.0,1.2,1.4}\{1.0,1.2,1.4\}. We evaluate on chain structured Ising model on 15 nodes. Intuitively, a high WmaxW_{max} makes it difficult to differentiate between a star and a non-star, therefore the algorithm is expected to perform better for lower WmaxW_{max}. This is indeed what happens as shown in Figure 8. Refer to caption Figure 8: Wmin=0.7,qmax=0.15W_{min}=0.7,~{}q_{max}=0.15. The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of WmaxW_{max}. Refer to caption Figure 9: Wmax=1.1,qmax=0.15W_{max}=1.1,~{}q_{max}=0.15. The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of WminW_{min}.

5.5 Effect of the Minimum Weight

We also look at the effect of WminW_{min}. We set Wmax=1.1,qmax=0.15,𝒃=0W_{max}=1.1,q_{max}=0.15,\boldsymbol{b}=0 and vary WminW_{min} to take the values {0.6,0.7,0.8}\{0.6,0.7,0.8\}. We evaluate on chain structured Ising model on 15 nodes. We can expect that smaller edge weights make it difficult to accurately estimate the correlation values resulting in higher errors. This is illustrated in Figure 9.

5.6 Algorithm performance for Star-structured tree Ising Model

Till now all the experiments have used chain structured Ising models. In this subsection, we consider the other extreme tree structure-star structure with one internal node and the remaining leaf nodes connected to the internal node. We set Wmin=0.7,Wmax=1.2,qmax=0.1,𝒃=0W_{min}=0.7,W_{max}=1.2,q_{max}=0.1,\boldsymbol{b}=0. We consider three different graph sizes - 11 nodes, 13 nodes and 15 nodes. The performance of the algorithm is illustrated in Figure 10.

Refer to caption
Figure 10: Wmin=0.7,Wmax=1.2,qmax=0.1W_{min}=0.7,W_{max}=1.2,~{}q_{max}=0.1. The performance of our algorithm on a star of 11, 13 and 15 nodes over 50 runs for increasing number of samples.

As we can see, compared to the chain structured Ising models, star-structured Ising models require much smaller number of samples. This is due to the small radius which results in high absolute values of the correlations. Also, the performance is only slightly impacted by the number of nodes which can also be attributed to the small radius.

5.7 Effect of bias on the algorithm performance.

We observe a very interesting phenomena when we study the impact of the bias term 𝒃\boldsymbol{b}. For chain structured Ising model, the algorithm performs better for lower absolute values in 𝒃\boldsymbol{b}. However, for star structured Ising modes, the algorithm performs better for higher absolute values in 𝒃\boldsymbol{b}. For both the cases we consider a tree on 11 nodes and set Wmin=0.7,Wmax=1.2,qmax=0.1W_{min}=0.7,W_{max}=1.2,q_{max}=0.1. μmax\mu_{max} is estimated empirically. We set all the entries in 𝒃\boldsymbol{b} to be equal.

For chain structured Ising model, we consider 3 cases - (i) all the entries in 𝒃\boldsymbol{b} = 0.0, (ii) all the entries in 𝒃\boldsymbol{b} = 0.02 and (iii) all the entries in 𝒃\boldsymbol{b} = 0.04. The result is provided in Figure 11. Refer to caption Figure 11: Wmin=0.7,Wmax=1.2,qmax=0.1W_{min}=0.7,W_{max}=1.2,~{}q_{max}=0.1. The performance of our algorithm on a chain of 11 nodes over 50 runs for increasing number of samples for varying bias values. Refer to caption Figure 12: Wmin=0.7,Wmax=1.2,qmax=0.1W_{min}=0.7,W_{max}=1.2,~{}q_{max}=0.1. The performance of our algorithm on a star of 11 nodes over 50 runs for increasing number of samples for varying bias values.

For star structured Ising model, we consider 3 cases - (i) all the entries in 𝒃\boldsymbol{b} = 0.0, (ii) all the entries in 𝒃\boldsymbol{b} = 0.2 and (iii) all the entries in 𝒃\boldsymbol{b} = 0.4. The result is provided in Figure 12.

We now give an intuitive justification of the observation. A higher value of bias results in smaller threshold for the proximal sets. For chain structured Ising models, due to correlation decay, we can expect correlation values for some pairs of random variables to be close to the threshold. Estimating them accurately would require more number of samples.

Star structured Ising models, on the other hand, have higher correlation values due a smaller diameter of 2. A low threshold ensures that no node is mistakenly left behind when constructing the proximal sets. Therefore, the performance is better for a higher bias values.

Appendix A Proof of Lemma 1

We prove this by induction on the number of nodes kk in the path (Xi1Xi2Xi3Xik)(X_{i_{1}}\rightarrow X_{i_{2}}\rightarrow X_{i_{3}}\dots\rightarrow X_{i_{k}}) for any 2 nodes Xi1,XikX_{i_{1}},X_{i_{k}}.

Base Case k=3k=3:

The path is (Xi1Xi2Xi3)(X_{i_{1}}\rightarrow X_{i_{2}}\rightarrow X_{i_{3}}), therefore we have Xi1Xi3|Xi2X_{i_{1}}\perp X_{i_{3}}|X_{i_{2}}. For random variables with a support size of 2, this is true if and only if they are conditionally uncorrelated, that is,

𝔼[Xi1Xi3|Xi2]=𝔼[Xi1|Xi2]𝔼[Xi3|Xi2].\mathbb{E}{\left[X_{i_{1}}X_{i_{3}}|X_{i_{2}}\right]}=\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}\right]}\mathbb{E}{\left[X_{i_{3}}|X_{i_{2}}\right]}. (3)

𝔼[Xi1|Xi2]\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}\right]} is linear in Xi2X_{i_{2}} since the support size of Xi2X_{i_{2}} is 2 and therefore we need to need to fit only 2 points 𝔼[Xi1|Xi2=1]\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}=1\right]} and 𝔼[Xi1|Xi2=1]\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}=-1\right]} to completely represent the conditional expectation. Therefore the linear least square error (LLSE) estimator of Xi1X_{i_{1}} given Xi2X_{i_{2}} is also the minimum mean squared estimator 𝔼[Xi1|Xi2]\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}\right]}. Utilizing the standard result for LLSE, we have:

𝔼[Xi1|Xi2]=𝔼[Xi1]+Σi1,i2Σi2,i21(Xi2𝔼[Xi2]).\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}\right]}=\mathbb{E}{\left[X_{i_{1}}\right]}+\Sigma_{i_{1},i_{2}}\Sigma_{i_{2},i_{2}}^{-1}(X_{i_{2}}-\mathbb{E}{\left[X_{i_{2}}\right]}). (4)

Similarly we have:

𝔼[Xi3|Xi2]=𝔼[Xi3]+Σi3,i2Σi2,i21(Xi2𝔼[Xi2]).\mathbb{E}{\left[X_{i_{3}}|X_{i_{2}}\right]}=\mathbb{E}{\left[X_{i_{3}}\right]}+\Sigma_{i_{3},i_{2}}\Sigma_{i_{2},i_{2}}^{-1}(X_{i_{2}}-\mathbb{E}{\left[X_{i_{2}}\right]}). (5)

Substituting 𝔼[Xi1|Xi2]\mathbb{E}{\left[X_{i_{1}}|X_{i_{2}}\right]} and 𝔼[Xi3|Xi2]\mathbb{E}{\left[X_{i_{3}}|X_{i_{2}}\right]} from Equations (4) and (5) in Equation (3) we get:

𝔼[Xi1Xi3|Xi2]=\displaystyle\mathbb{E}{\left[X_{i_{1}}X_{i_{3}}|X_{i_{2}}\right]}= 𝔼[Xi1]𝔼[Xi3]+𝔼[Xi1]Σi3,i2Σi2,i21(Xi2𝔼[Xi2])+\displaystyle\mathbb{E}{\left[X_{i_{1}}\right]}\mathbb{E}{\left[X_{i_{3}}\right]}+\mathbb{E}{\left[X_{i_{1}}\right]}\Sigma_{i_{3},i_{2}}\Sigma_{i_{2},i_{2}}^{-1}(X_{i_{2}}-\mathbb{E}{\left[X_{i_{2}}\right]})+
𝔼[Xi3]Σi1,i2Σi2,i21(Xi2𝔼[Xi2])+\displaystyle\mathbb{E}{\left[X_{i_{3}}\right]}\Sigma_{i_{1},i_{2}}\Sigma_{i_{2},i_{2}}^{-1}(X_{i_{2}}-\mathbb{E}{\left[X_{i_{2}}\right]})+
Σi1,i2Σi3,i2(Σi2,i21(Xi2𝔼[Xi2]))2\displaystyle\Sigma_{i_{1},i_{2}}\Sigma_{i_{3},i_{2}}(\Sigma_{i_{2},i_{2}}^{-1}(X_{i_{2}}-\mathbb{E}{\left[X_{i_{2}}\right]}))^{2}
𝔼[Xi1Xi3]=\displaystyle\mathbb{E}{\left[X_{i_{1}}X_{i_{3}}\right]}= 𝔼[𝔼[Xi1Xi3|Xi2]]\displaystyle\mathbb{E}{\left[\mathbb{E}{\left[X_{i_{1}}X_{i_{3}}|X_{i_{2}}\right]}\right]}
=\displaystyle= 𝔼[Xi1]𝔼[Xi3]+Σi1,i2Σi3,i2Σi2,i21.\displaystyle\mathbb{E}{\left[X_{i_{1}}\right]}\mathbb{E}{\left[X_{i_{3}}\right]}+\Sigma_{i_{1},i_{2}}\Sigma_{i_{3},i_{2}}\Sigma_{i_{2},i_{2}}^{-1}.

Therefore we get Σi1,i3Σi2,i2=Σi1,i2Σi3,i2\Sigma_{i_{1},i_{3}}\Sigma_{i_{2},i_{2}}=\Sigma_{i_{1},i_{2}}\Sigma_{i_{3},i_{2}} which implies ρi1i3=ρi1i2ρi2i3\rho_{i_{1}i_{3}}=\rho_{i_{1}i_{2}}\rho_{i_{2}i_{3}}.

Inductive Case:

Let the statement be true for any path involving kk nodes. For a path (Xi1Xi2Xi3Xi(k+1))(X_{i_{1}}\rightarrow X_{i_{2}}\rightarrow X_{i_{3}}\dots\rightarrow X_{i_{(k+1)}}) we have Xi1Xi(k+1)|XikX_{i_{1}}\perp X_{i_{(k+1)}}|X_{i_{k}}. Therefore the same calculation as the base case holds true by replacing Xi2X_{i_{2}} by XikX_{i_{k}} and Xi3X_{i_{3}} by Xi(k+1)X_{i_{(k+1)}}. Therefore ρi1i(k+1)=ρi1ikρiki(k+1)\rho_{i_{1}i_{(k+1)}}=\rho_{i_{1}i_{k}}\rho_{i_{k}i_{(k+1)}}. By the inductive assumption, ρi1ik=l=2kρil1,il\rho_{i_{1}i_{k}}=\prod_{l=2}^{k}\rho_{i_{l-1},i_{l}}, therefore, ρi1i(k+1)=l=2k+1ρil1,il\rho_{i_{1}i_{(k+1)}}=\prod_{l=2}^{k+1}\rho_{i_{l-1},i_{l}}.

Appendix B Proof of Covariance of noisy variables.

Lemma 4.

Consider 2 Random variables XiX_{i} and XjX_{j} with support on {1,1}\{-1,1\} whose covariance is denoted by Σi,j\Sigma_{i,j}. Now consider the noisy version of these random variables XieX_{i}^{e} and XjeX_{j}^{e} whose covariance is denoted by Σi,j\Sigma_{i,j}^{\prime}. Then we have:

𝔼[Xie]\displaystyle\mathbb{E}{\left[X_{i}^{e}\right]} =(12qi)𝔼[Xi]\displaystyle=(1-2q_{i})\mathbb{E}{\left[X_{i}\right]}
Σi,j\displaystyle\Sigma_{i,j}^{\prime} =(12qi)(12qj)Σi,j\displaystyle=(1-2q_{i})(1-2q_{j})\Sigma_{i,j}
Proof.

By the noise model we have:

𝔼[Xie]\displaystyle\mathbb{E}{\left[X_{i}^{e}\right]} =(1qi)𝔼[Xi]+qi𝔼[Xi]\displaystyle=(1-q_{i})\mathbb{E}{\left[X_{i}\right]}+q_{i}\mathbb{E}{\left[-X_{i}\right]} (6)
𝔼[Xie]\displaystyle\mathbb{E}{\left[X_{i}^{e}\right]} =(12qi)𝔼[Xi].\displaystyle=(1-2q_{i})\mathbb{E}{\left[X_{i}\right]}.

We also have:

𝔼[XieXje]=\displaystyle\mathbb{E}{\left[X_{i}^{e}X_{j}^{e}\right]}= (1qi)(1qj)𝔼[XiXj]+(1qj)qi𝔼[XiXj]+\displaystyle(1-q_{i})(1-q_{j})\mathbb{E}{\left[X_{i}X_{j}\right]}+(1-q_{j})q_{i}\mathbb{E}{\left[-X_{i}X_{j}\right]}+ (7)
(1qi)qj𝔼[XiXj]+qiqj𝔼[XiXj]\displaystyle(1-q_{i})q_{j}\mathbb{E}{\left[-X_{i}X_{j}\right]}+q_{i}q_{j}\mathbb{E}{\left[X_{i}X_{j}\right]}
=\displaystyle= (12qi)(12qj)𝔼[XiXj].\displaystyle(1-2q_{i})(1-2q_{j})\mathbb{E}{\left[X_{i}X_{j}\right]}.

Therefore,

Σi,j\displaystyle\Sigma_{i,j}^{\prime} =𝔼[XieXje]𝔼[Xie]𝔼[Xje]\displaystyle=\mathbb{E}{\left[X_{i}^{e}X_{j}^{e}\right]}-\mathbb{E}{\left[X_{i}^{e}\right]}\mathbb{E}{\left[X_{j}^{e}\right]} (8)
=(12qi)(12qj)(𝔼[XiXj]𝔼[Xi]𝔼[Xj])\displaystyle=(1-2q_{i})(1-2q_{j})(\mathbb{E}{\left[X_{i}X_{j}\right]}-\mathbb{E}{\left[X_{i}\right]}\mathbb{E}{\left[X_{j}\right]})
=(12qi)(12qj)Σi,j\displaystyle=(1-2q_{i})(1-2q_{j})\Sigma_{i,j}

We can use Equation (6) to calculate the variance of every random variable in terms of the variance of its noisy counterpart as follows:

Σi,i=\displaystyle\Sigma_{i,i}= 1𝔼[Xi]2\displaystyle 1-\mathbb{E}{\left[X_{i}\right]}^{2} (9)
=\displaystyle= 1𝔼[Xie]2(12qi)2\displaystyle 1-\frac{\mathbb{E}{\left[X_{i}^{e}\right]}^{2}}{(1-2q_{i})^{2}}
=\displaystyle= 11Σi,i(12qi)2\displaystyle 1-\frac{1-\Sigma^{\prime}_{i,i}}{(1-2q_{i})^{2}}

Appendix C Proof that the Quadratic gives a valid solution

Consider the quadratic in Equation (2). We prove that this equation always has a valid solution q1<0.5q_{1}<0.5 for any set of 3 nodes in a tree structured graphical model.

Whenever 0<1Σ1,1+Σ1,2Σ1,3Σ2,3<10<1-\Sigma^{\prime}_{1,1}+\frac{\Sigma^{\prime}_{1,2}\Sigma^{\prime}_{1,3}}{\Sigma^{\prime}_{2,3}}<1, the solution is of the form q1=η,1ηq_{1}=\eta,1-\eta where 0η<0.50\leq\eta<0.5. From Equations (8) and (9), we have:

1Σ1,1+Σ1,2Σ1,3Σ2,3=(12q12)(1Σ1,1+Σ1,2Σ1,3Σ2,3).1-\Sigma^{\prime}_{1,1}+\frac{\Sigma^{\prime}_{1,2}\Sigma^{\prime}_{1,3}}{\Sigma^{\prime}_{2,3}}=(1-2q_{1}^{2})(1-\Sigma_{1,1}+\frac{\Sigma_{1,2}\Sigma_{1,3}}{\Sigma_{2,3}}). (10)

The different possible configurations of any 3 nodes X1X_{1}, X2X_{2} and X3X_{3} in any tree structured graphical model are shown in Figure 13.

Refer to caption
Figure 13: Different possible configurations of any set of 3 nodes.

For case (a) we have Σ2,2Σ1,3=Σ1,2Σ2,3\Sigma_{2,2}\Sigma_{1,3}=\Sigma_{1,2}\Sigma_{2,3} by Lemma 1. This gives us:

1Σ1,1+Σ1,2Σ1,3Σ2,3=(12q1)2(1Σ1,1+Σ1,22Σ2,2).1-\Sigma^{\prime}_{1,1}+\frac{\Sigma^{\prime}_{1,2}\Sigma^{\prime}_{1,3}}{\Sigma^{\prime}_{2,3}}=(1-2q_{1})^{2}(1-\Sigma_{1,1}+\frac{\Sigma_{1,2}^{2}}{\Sigma_{2,2}}).

Using the assumption that the absolute value of correlation is upper bounded away from 1 and lower bounded away from 0, we have 0<Σ1,22<Σ1,1Σ2,20<\Sigma_{1,2}^{2}<\Sigma_{1,1}\Sigma_{2,2}. Also, 0<Σ1,110<\Sigma_{1,1}\leq 1 and 0<(12q1)210<(1-2q_{1})^{2}\leq 1. Therefore, for case (a), 0<1Σ1,1+Σ1,2Σ1,3Σ2,3<10<1-\Sigma^{\prime}_{1,1}+\frac{\Sigma^{\prime}_{1,2}\Sigma^{\prime}_{1,3}}{\Sigma^{\prime}_{2,3}}<1 and the quadratic equation has valid roots. By symmetry, the quadratic equation gives valid roots for case (b) too.

Case (c) is the underlying truth, therefore the quadratic equation recovers the true underlying error.

For case(d), we have Σk,kΣ1,3=Σ1,kΣ3,k\Sigma_{k,k}\Sigma_{1,3}=\Sigma_{1,k}\Sigma_{3,k}, Σk,kΣ1,2=Σ1,kΣ2,k\Sigma_{k,k}\Sigma_{1,2}=\Sigma_{1,k}\Sigma_{2,k} and Σk,kΣ2,3=Σ2,kΣ3,k\Sigma_{k,k}\Sigma_{2,3}=\Sigma_{2,k}\Sigma_{3,k}. This gives us:

1Σ1,1+Σ1,2Σ1,3Σ2,3=(12q1)2(1Σ1,1+Σ1,k2Σk,k).1-\Sigma^{\prime}_{1,1}+\frac{\Sigma^{\prime}_{1,2}\Sigma^{\prime}_{1,3}}{\Sigma^{\prime}_{2,3}}=(1-2q_{1})^{2}(1-\Sigma_{1,1}+\frac{\Sigma_{1,k}^{2}}{\Sigma_{k,k}}). (11)

The same arguments as case (a) hold true for case (d) with node 2 replaced by node kk. Therefore, the quadratic has a valid solution in this case too.

Appendix D Proof of Lemma 2, Lemma 3 and Star/Non-star Condition for Generic Trees

D.1 Proof of Lemma 2(a)

Proof. Note that q^12,3\hat{q}_{1}^{2,3} and q^12,4\hat{q}_{1}^{2,4} are given by solving an equation similar to (2). As the solution to the quadratic is defined completely by the covariance terms, all we need to prove is:

Σ1,2Σ1,3Σ2,3=Σ1,2Σ1,4Σ2,4Σ1,3Σ2,3=Σ1,4Σ2,4.\displaystyle\dfrac{\Sigma_{1,2}^{\prime}\Sigma_{1,3}^{\prime}}{\Sigma_{2,3}^{\prime}}=\dfrac{\Sigma_{1,2}^{\prime}\Sigma_{1,4}^{\prime}}{\Sigma_{2,4}^{\prime}}\iff\dfrac{\Sigma_{1,3}^{\prime}}{\Sigma_{2,3}^{\prime}}=\dfrac{\Sigma_{1,4}^{\prime}}{\Sigma_{2,4}^{\prime}}.

By substituting the value of Σi,j\Sigma_{i,j}^{\prime} from Equation 8, we now need to prove that:

Σ1,3Σ2,3=Σ1,4Σ2,4ρ1,3ρ2,3=ρ1,4ρ2,4.\dfrac{\Sigma_{1,3}}{\Sigma_{2,3}}=\dfrac{\Sigma_{1,4}}{\Sigma_{2,4}}\iff\dfrac{\rho_{1,3}}{\rho_{2,3}}=\dfrac{\rho_{1,4}}{\rho_{2,4}}.

Using the correlation decay property, we get that ρ1,3=ρ1,2ρ2,3\rho_{1,3}=\rho_{1,2}\rho_{2,3}, ρ1,4=ρ1,2ρ2,3ρ3,4\rho_{1,4}=\rho_{1,2}\rho_{2,3}\rho_{3,4} and ρ2,4=ρ2,3ρ3,4\rho_{2,4}=\rho_{2,3}\rho_{3,4}. Therefore LHS = RHS = ρ1,2\rho_{1,2}.

D.2 Proof of Lemma 2(b)

Proof. Using the same arguments as in the proof of Lemma 2(a), we can conclude that we need to prove:

Σ1,3Σ2,4Σ2,1Σ3,41,Σ2,3Σ1,4Σ1,2Σ3,41\dfrac{\Sigma_{1,3}^{\prime}\Sigma_{2,4}^{\prime}}{\Sigma_{2,1}^{\prime}\Sigma_{3,4}^{\prime}}\neq 1,\dfrac{\Sigma_{2,3}^{\prime}\Sigma_{1,4}^{\prime}}{\Sigma_{1,2}^{\prime}\Sigma_{3,4}^{\prime}}\neq 1

Substituting Σi,j\Sigma_{i,j}^{\prime} from Equation (8), we get:

Σ1,3Σ2,4Σ2,1Σ3,4=ρ1,3ρ2,4ρ2,1ρ3,4,Σ2,3Σ1,4Σ1,2Σ3,4=ρ2,3ρ1,4ρ1,2ρ3,4.\dfrac{\Sigma_{1,3}^{\prime}\Sigma_{2,4}^{\prime}}{\Sigma_{2,1}^{\prime}\Sigma_{3,4}^{\prime}}=\dfrac{\rho_{1,3}\rho_{2,4}}{\rho_{2,1}\rho_{3,4}},\dfrac{\Sigma_{2,3}^{\prime}\Sigma_{1,4}^{\prime}}{\Sigma_{1,2}^{\prime}\Sigma_{3,4}^{\prime}}=\dfrac{\rho_{2,3}\rho_{1,4}}{\rho_{1,2}\rho_{3,4}}.

Using the correlation decay property, we get that:

ρ1,3ρ2,4ρ2,1ρ3,4=ρ2,3ρ1,4ρ1,2ρ3,4=ρ2,32ρmax2<1\dfrac{\rho_{1,3}\rho_{2,4}}{\rho_{2,1}\rho_{3,4}}=\dfrac{\rho_{2,3}\rho_{1,4}}{\rho_{1,2}\rho_{3,4}}=\rho_{2,3}^{2}\leq\rho_{max}^{2}<1 (12)

D.3 Proof of Lemma 3

Proof. This is equivalent to proving that

Σ1,3Σ2,3=Σ1,4Σ2,4,Σ1,2Σ2,4=Σ1,3Σ3,4\dfrac{\Sigma_{1,3}^{\prime}}{\Sigma_{2,3}^{\prime}}=\dfrac{\Sigma_{1,4}^{\prime}}{\Sigma_{2,4}^{\prime}},\dfrac{\Sigma_{1,2}^{\prime}}{\Sigma_{2,4}^{\prime}}=\dfrac{\Sigma_{1,3}^{\prime}}{\Sigma_{3,4}^{\prime}}

which is again equivalent to:

ρ1,3ρ2,3=ρ1,4ρ2,4,ρ1,2ρ2,4=ρ1,3ρ3,4.\dfrac{\rho_{1,3}}{\rho_{2,3}}=\dfrac{\rho_{1,4}}{\rho_{2,4}},\dfrac{\rho_{1,2}}{\rho_{2,4}}=\dfrac{\rho_{1,3}}{\rho_{3,4}}.

Using the correlation decay property it is easy to see that:

ρ1,3ρ2,3=ρ1,4ρ2,4=ρ1,2,ρ1,2ρ2,4=ρ1,3ρ3,4.\displaystyle\dfrac{\rho_{1,3}}{\rho_{2,3}}=\dfrac{\rho_{1,4}}{\rho_{2,4}}=\rho_{1,2},\dfrac{\rho_{1,2}}{\rho_{2,4}}=\dfrac{\rho_{1,3}}{\rho_{3,4}}.

D.4 Proof of Star/Non-star Condition for Generic Trees

We show how to utilize the result on a set of 4 nodes to classify any set of 4 nodes as star/non-star in a generic tree.

If any 4 nodes {X1,X2,X3,X4}\{X_{1},X_{2},X_{3},X_{4}\} in a tree graphical model form a non-star shape such that (X1,X2)(X_{1},X_{2}) from a pair and are not arranged in a chain, there exist nodes XkX_{k} and XkX_{k^{\prime}} such that the conditional independence structure is given by either Figure 14(a) or 14(b).

Refer to caption
Figure 14: Possible conditional independence relations for non-star shape if they don’t form a chain

For the conditional independence in Figure 14(a), we know that:

q^42,3\displaystyle{\hat{q}}_{4}^{2,3} =q^4k,2 By Lemma 3 on {X2,X3,X4,Xk},\displaystyle={\hat{q}}_{4}^{k,2}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{2},X_{3},X_{4},X_{k}\}$}, (13)
q^41,3\displaystyle{\hat{q}}_{4}^{1,3} =q^4k,1 By Lemma 3 on {X1,X3,X4,Xk},\displaystyle={\hat{q}}_{4}^{k,1}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{1},X_{3},X_{4},X_{k}\}$},
q^4k,2\displaystyle{\hat{q}}_{4}^{k,2} =q^4k,1q^41,2 By Lemma 2(a) and Lemma 2(b)\displaystyle={\hat{q}}_{4}^{k,1}\neq{\hat{q}}_{4}^{1,2}\text{ By Lemma \ref{lemma:equality_inequality}(a) and Lemma \ref{lemma:equality_inequality}(b)}
on {X1,X2,Xk,X4}.\displaystyle\text{on }\{X_{1},X_{2},X_{k},X_{4}\}.

This gives us q^42,3=q^41,3q^41,2{\hat{q}}_{4}^{2,3}={\hat{q}}_{4}^{1,3}\neq{\hat{q}}_{4}^{1,2}. Similarly, we have q^32,4=q^31,4q^31,2{\hat{q}}_{3}^{2,4}={\hat{q}}_{3}^{1,4}\neq{\hat{q}}_{3}^{1,2}.

We also know that:

q^21,3\displaystyle{\hat{q}}_{2}^{1,3} =q^21,kq^2k,3 By Lemma 2(a) and Lemma 2(b)\displaystyle={\hat{q}}_{2}^{1,k}\neq{\hat{q}}_{2}^{k,3}\text{ By Lemma \ref{lemma:equality_inequality}(a) and Lemma \ref{lemma:equality_inequality}(b)} (14)
on {X1,X2,Xk,X3},\displaystyle\text{on }\{X_{1},X_{2},X_{k},X_{3}\},
q^21,4\displaystyle{\hat{q}}_{2}^{1,4} =q^21,kq^2k,4 By Lemma 2(a) and Lemma 2(b)\displaystyle={\hat{q}}_{2}^{1,k}\neq{\hat{q}}_{2}^{k,4}\text{ By Lemma \ref{lemma:equality_inequality}(a) and Lemma \ref{lemma:equality_inequality}(b)}
on {X1,X2,Xk,X4},\displaystyle\text{on }\{X_{1},X_{2},X_{k},X_{4}\},
q^2k,3\displaystyle{\hat{q}}_{2}^{k,3} =q^23,4=q^2k,4 By Lemma 3 on {X2,X3,X4,Xk},\displaystyle={\hat{q}}_{2}^{3,4}={\hat{q}}_{2}^{k,4}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{2},X_{3},X_{4},X_{k}\}$},
q^12,3\displaystyle{\hat{q}}_{1}^{2,3} =q^12,kq^1k,3 By Lemma 2(a) and Lemma 2(b)\displaystyle={\hat{q}}_{1}^{2,k}\neq{\hat{q}}_{1}^{k,3}\text{ By Lemma \ref{lemma:equality_inequality}(a) and Lemma \ref{lemma:equality_inequality}(b)}
on {X1,X2,Xk,X3},\displaystyle\text{on }\{X_{1},X_{2},X_{k},X_{3}\},
q^12,4\displaystyle{\hat{q}}_{1}^{2,4} =q^12,kq^1k,4 By Lemma 2(a) and Lemma 2(b)\displaystyle={\hat{q}}_{1}^{2,k}\neq{\hat{q}}_{1}^{k,4}\text{ By Lemma \ref{lemma:equality_inequality}(a) and Lemma \ref{lemma:equality_inequality}(b)}
on {X1,X2,Xk,X4},\displaystyle\text{on }\{X_{1},X_{2},X_{k},X_{4}\},
q^1k,3\displaystyle{\hat{q}}_{1}^{k,3} =q^13,4=q^1k,4 By Lemma 3 on {X1,X3,X4,Xk}.\displaystyle={\hat{q}}_{1}^{3,4}={\hat{q}}_{1}^{k,4}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{1},X_{3},X_{4},X_{k}\}$}.

These equations imply q^21,3=q^21,4q^23,4{\hat{q}}_{2}^{1,3}={\hat{q}}_{2}^{1,4}\neq{\hat{q}}_{2}^{3,4} and q^12,3=q^12,4q^13,4{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{2,4}\neq{\hat{q}}_{1}^{3,4}. If the conditional independence is as shown in Figure 14(b), we have:

q^12,3=q^1k,2=q^12,4=q^1k,4By Lemma 3 on\displaystyle{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{k^{\prime},2}={\hat{q}}_{1}^{2,4}={\hat{q}}_{1}^{k^{\prime},4}\text{By Lemma \ref{lemma:star_equality} on } (15)
{X1,X2,X3,Xk} and on {X1,X2,X4,Xk},\displaystyle\text{$\{X_{1},X_{2},X_{3},X_{k^{\prime}}\}$ and on $\{X_{1},X_{2},X_{4},X_{k^{\prime}}\}$},
q^1k,4q^1k,4By Lemma 2(b) on {X1,Xk,Xk,X4},\displaystyle{\hat{q}}_{1}^{k^{\prime},4}\neq{\hat{q}}_{1}^{k,4}\text{By Lemma \ref{lemma:equality_inequality}(b) on $\{X_{1},X_{k},X_{k^{\prime}},X_{4}\}$},
q^1k,4=q^13,4By Lemma 3 on {X1,Xk,X3,X4}.\displaystyle{\hat{q}}_{1}^{k,4}={\hat{q}}_{1}^{3,4}\text{By Lemma \ref{lemma:star_equality} on $\{X_{1},X_{k},X_{3},X_{4}\}$}.

These equations give us q^12,3=q^12,4q^13,4{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{2,4}\neq{\hat{q}}_{1}^{3,4}. Furthermore, by Equation (12), we have:

ρ1,3ρ2,4ρ1,2ρ3,4ρmax2<1\displaystyle\frac{\rho_{1,3}^{\prime}\rho_{2,4}^{\prime}}{\rho_{1,2}^{\prime}\rho_{3,4}^{\prime}}\leq\rho_{max}^{2}<1 (16)
ρ1,3ρ2,4ρ1,4ρ2,3=1\displaystyle\frac{\rho_{1,3}^{\prime}\rho_{2,4}^{\prime}}{\rho_{1,4}^{\prime}\rho_{2,3}^{\prime}}=1

By symmetry, the remaining conditions in Equation (3) are also satisfied.

Refer to caption
Figure 15: Possible conditional independence relations for a star shape.

When the 4 nodes form a star structure in the tree, their conditional independence is given by either Figure 3 or there exists a node XkX_{k} such that the conditional independence is as shown in Figure 15. Lemma 3 proves that Equation 3 is satisfied if the conditional independence is given by Figure 3. If the conditional independence is given by Figure 15, we have:

q^12,3=q^12,k=q^1k,3 By Lemma 3 on {X1,X2,X3,Xk},\displaystyle{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{2,k}={\hat{q}}_{1}^{k,3}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{1},X_{2},X_{3},X_{k}\}$}, (17)
q^14,3=q^14,k=q^1k,3 By Lemma 3 on {X1,X3,X4,Xk},\displaystyle{\hat{q}}_{1}^{4,3}={\hat{q}}_{1}^{4,k}={\hat{q}}_{1}^{k,3}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{1},X_{3},X_{4},X_{k}\}$},
q^12,4=q^12,k=q^1k,4 By Lemma 3 on {X1,X2,X4,Xk}.\displaystyle{\hat{q}}_{1}^{2,4}={\hat{q}}_{1}^{2,k}={\hat{q}}_{1}^{k,4}\text{ By Lemma \ref{lemma:star_equality} on $\{X_{1},X_{2},X_{4},X_{k}\}$}.

This implies that q^12,3=q^14,3=q^14,2{\hat{q}}_{1}^{2,3}={\hat{q}}_{1}^{4,3}={\hat{q}}_{1}^{4,2}. By symmetry, all the remaining conditions of Equation 3 are also satisfied.

This completes the proof that just by having access to the noisy probability distribution, it is possible to categorize any set of 4 nodes as a star/non-star shape.

Appendix E Proof of Theorem 2

Given the noisy variance Σi,j\Sigma_{i,j}^{\prime} and an estimate of the error probability vector 𝐪^\mathbf{\hat{q}}, we estimate the non-noisy covariance as:

Σ^i,j=Σi,j(12q^i)(12q^j)=Σi,j(12qi)(12qj)(12q^i)(12q^j)ij.\hat{\Sigma}_{i,j}=\dfrac{\Sigma_{i,j}^{\prime}}{(1-2\hat{q}_{i})(1-2\hat{q}_{j})}=\dfrac{\Sigma_{i,j}(1-2q_{i})(1-2q_{j})}{(1-2\hat{q}_{i})(1-2\hat{q}_{j})}\forall i\neq j. (18)

For the error probability vector 𝐪^\mathbf{\hat{q}}, using Equation (9) the non-noisy variance is estimated as:

Σ^i,i=11Σi,i(12q^i)2\hat{\Sigma}_{i,i}=1-\frac{1-\Sigma_{i,i}^{\prime}}{(1-2\hat{q}_{i})^{2}} (19)

To check if any conditional independence relation XiXj|XkX_{i}\perp X_{j}|X_{k} is true, we need to verify if it satisfies the correlation decay equation Σ^i,jΣ^k,k=Σ^i,kΣ^k,j\hat{\Sigma}_{i,j}\hat{\Sigma}_{k,k}=\hat{\Sigma}_{i,k}\hat{\Sigma}_{k,j}.

We first consider TT^{\prime} where only one leaf node exchanges position with its neighbor. Suppose in the original tree, node X1X_{1} is a leaf node connected to node X2X_{2}.

Consider the error vector 𝐪^\mathbf{\hat{q}}:

q^i\displaystyle\hat{q}_{i} =qi  i1,2,\displaystyle=q_{i}\text{ $\forall$ }i\neq 1,2, (20)
q^1\displaystyle\hat{q}_{1} =12(1(12q1)Σ1,22Σ2,2Σ1,1+1),\displaystyle=\frac{1}{2}\left(1-(1-2q_{1})\sqrt{\frac{\Sigma_{1,2}^{2}}{\Sigma_{2,2}}-\Sigma_{1,1}+1}\right),
q^2\displaystyle\hat{q}_{2} =0.\displaystyle=0.

To prove that this error vector results in TT^{\prime}, we need to prove that any node XkX1,X2X_{k}\neq X_{1},X_{2} which satisfies X1Xk|X2X_{1}\perp X_{k}|X_{2} in TT^{*} must satisfy X2Xk|X1X_{2}\perp X_{k}|X_{1} in TT^{\prime}. We note that:

Σ^1,2\displaystyle\hat{\Sigma}_{1,2} =Σ1,2(12q1)(12q2)(12q^1),Σ^1,k=Σ1,k(12q1)(12q^1)\displaystyle=\frac{\Sigma_{1,2}(1-2q_{1})(1-2q_{2})}{(1-2\hat{q}_{1})},\hat{\Sigma}_{1,k}=\frac{\Sigma_{1,k}(1-2q_{1})}{(1-2\hat{q}_{1})} (21)
Σ^2,k\displaystyle\hat{\Sigma}_{2,k} =Σ2,k(12q2),Σ^1,1=1(1Σ1,1)(12q1)2(12q^1)2\displaystyle=\Sigma_{2,k}(1-2q_{2}),\hat{\Sigma}_{1,1}=1-\frac{(1-\Sigma_{1,1})(1-2q_{1})^{2}}{(1-2\hat{q}_{1})^{2}}

Using Σ1,kΣ2,2=Σ1,2Σ2,k\Sigma_{1,k}\Sigma_{2,2}=\Sigma_{1,2}\Sigma_{2,k}, it is easy to check that Σ^2,kΣ^1,1=Σ^1,kΣ^1,2\hat{\Sigma}_{2,k}\hat{\Sigma}_{1,1}=\hat{\Sigma}_{1,k}\hat{\Sigma}_{1,2}.

Furthermore, we need to prove that any pair of nodes Xk1,Xk2X1,X2X_{k_{1}},X_{k_{2}}\neq X_{1},X_{2} such that Xk1Xk2|X2X_{k_{1}}\perp X_{k_{2}}|X_{2} in TT^{*} satisfy Xk1Xk2|X1X_{k_{1}}\perp X_{k_{2}}|X_{1} in TT^{\prime}. Doing similar substitutions by replacing node 2 by node k1k_{1} and node kk by node k2k_{2} gives us Σ^1,1Σ^k1,k2=Σ^1,k1Σ^1,k2\hat{\Sigma}_{1,1}\hat{\Sigma}_{k_{1},k_{2}}=\hat{\Sigma}_{1,k_{1}}\hat{\Sigma}_{1,k_{2}} which proves that Xk1Xk2|X1X_{k_{1}}\perp X_{k_{2}}|X_{1}.

The remaining conditional independences not involving X1X_{1} and X2X_{2} remain intact as the error probability for the remaining nodes is assigned to the original probability of error.

Now, consider a tree TT^{\prime} in which a set of leaf nodes 𝒮\mathcal{S}^{\prime} exchange positions with their neighbors. For this case consider the error probability vector 𝐪^\mathbf{\hat{q}} such that:

q^i=\displaystyle\hat{q}_{i}= 12(1(12qi)Σi,j2Σj,jΣi,i+1),\displaystyle\frac{1}{2}\left(1-(1-2q_{i})\sqrt{\frac{\Sigma_{i,j}^{2}}{\Sigma_{j,j}}-\Sigma_{i,i}+1}\right),
 i𝒮,j=Parent(i)\displaystyle\text{ $\forall$ }i\in\mathcal{S}^{\prime},j=Parent(i)
q^j=\displaystyle\hat{q}_{j}= 0  i𝒮,j=Parent(i)\displaystyle 0\text{ $\forall$ }i\in\mathcal{S}^{\prime},j=Parent(i)
q^k=\displaystyle\hat{q}_{k}= qk, otherwise.\displaystyle q_{k},\text{ otherwise.}

This is obtained by performing the same procedure on each leaf node one by one.

Appendix F Pseudo-code, Proof of Correctness, Time-complexity

F.1 Pseudo-code proof of correctness and Time-complexity for FindEC

The pseudo-code is presented in Algorithm 3.

Algorithm 3 Algorithm to find the Equivalence Cluster in 𝒳subXi\mathcal{X}_{sub}\cup X_{i} containing node XiX_{i}
1:procedure FindEC(XiX_{i}, 𝒳sub\mathcal{X}_{sub})
2:     if |𝒳sub|==0|\mathcal{X}_{sub}|==0 then
3:         return [XiX_{i} ]      
4:     if |𝒳sub|==1|\mathcal{X}_{sub}|==1 then
5:         .append((Xi,𝒳sub[0]))\mathcal{E}.append((X_{i},\mathcal{X}_{sub}[0]))
6:         return Xi𝒳subX_{i}\cup\mathcal{X}_{sub}      
7:     if |𝒳sub|==2|\mathcal{X}_{sub}|==2 then
8:         .append((Xi,𝒳sub[0]))\mathcal{E}.append((X_{i},\mathcal{X}_{sub}[0]))
9:         .append((𝒳sub[1],𝒳sub[0]))\mathcal{E}.append((\mathcal{X}_{sub}[1],\mathcal{X}_{sub}[0]))
10:         return Xi𝒳subX_{i}\cup\mathcal{X}_{sub}      
11:     ECEC\leftarrow[ ]
12:     candidate_nodes𝒫~i1𝒳candidate\_nodes\leftarrow\tilde{\mathcal{P}}_{i}^{1}\cap\mathcal{X}
13:     for XjX_{j} in candidate_nodescandidate\_nodes do  
14:         Xk1candidate_nodesXjX_{k_{1}}\in candidate\_nodes\setminus X_{j}
15:         IsInECTrueIsInEC\leftarrow True
16:         test_nodes𝒫~i2𝒫~j2𝒫~k12𝒳subtest\_nodes\leftarrow\tilde{\mathcal{P}}^{2}_{i}\cap\tilde{\mathcal{P}}^{2}_{j}\cap\tilde{\mathcal{P}}_{k_{1}}^{2}\cap\mathcal{X}_{sub}
17:         for Xk2X_{k_{2}} in test_nodestest\_nodes do
18:              if min(ρi,k1ρj,k2ρi,k2ρj,k1,ρi,k2ρj,k1ρi,k1ρj,k2)<tmin(\frac{\rho_{i,k_{1}}^{\prime}\rho_{j,k_{2}}^{\prime}}{\rho_{i,k_{2}}^{\prime}\rho_{j,k_{1}}^{\prime}},\frac{\rho_{i,k_{2}}^{\prime}\rho_{j,k_{1}}^{\prime}}{\rho_{i,k_{1}}^{\prime}\rho_{j,k_{2}}^{\prime}})<t then
19:                  IsInECFalseIsInEC\leftarrow False                        
20:         if IsInEC==TrueIsInEC==True then
21:              EC.EC.append(Xk1)(X_{k_{1}})               
22:     .append(Xi,EC[0])\mathcal{E}.append(X_{i},EC[0])
23:     if len(EC)>1len(EC)>1 then
24:         for i=1i=1 to len(EC)len(EC) do
25:              .append(EC[i],EC[0])\mathcal{E}.append(EC[i],EC[0])               
26:     return XiECX_{i}\cup EC

Proof of correctness

By the definition of 𝒫i1~\tilde{\mathcal{P}_{i}^{1}}, candidate_nodescandidate\_nodes contains at least all the nodes within a radius of 4 of XiX_{i}. Therefore, it contains all the nodes in the equivalence cluster containing XiX_{i}. Now we check for each node in candidate_nodescandidate\_nodes whether it is in the equivalence cluster of XiX_{i}.

For any node XjX_{j} to be in the equivalence cluster of XiX_{i}, every set of 4 nodes {Xi,Xj,Xk1,Xk2}\{X_{i},X_{j},X_{k_{1}},X_{k_{2}}\} need to either be a star shape or a non-star shape where (Xi,Xj)(X_{i},X_{j}) form a pair.

In order to check this it is sufficient to check the condition for any set of 4 nodes {Xi,Xj,Xk1,Xk2}\{X_{i},X_{j},X_{k_{1}},X_{k_{2}}\}, for any random Xk1candidate_nodesXjX_{k_{1}}\in candidate\_nodes\setminus X_{j} and every Xk2𝒫~i2𝒫~j2𝒫~k12𝒳subX_{k_{2}}\in\tilde{\mathcal{P}}^{2}_{i}\cap\tilde{\mathcal{P}}^{2}_{j}\cap\tilde{\mathcal{P}}_{k_{1}}^{2}\cap\mathcal{X}_{sub}. Suppose XjX_{j} is not in the same equivalence cluster as XiX_{i}. This implies that there exist nodes in the path from XiX_{i} to XjX_{j}. By the definition of the set 𝒫~2\tilde{\mathcal{P}}^{2}, the node with a potential edge with XiX_{i} is in 𝒫~i2𝒫~j2𝒫~k12𝒳sub\tilde{\mathcal{P}}^{2}_{i}\cap\tilde{\mathcal{P}}^{2}_{j}\cap\tilde{\mathcal{P}}_{k_{1}}^{2}\cap\mathcal{X}_{sub}. This will result in {Xi,Xj,Xk1,Xk2}\{X_{i},X_{j},X_{k_{1}},X_{k_{2}}\} forming a non-star such that (Xi,Xk1)(X_{i},X_{k_{1}}) form a pair.

Time-complexity

As there are 2 for loops, it is easy to see that this subroutine is 𝒪(n2)\mathcal{O}(n^{2}).

F.2 Pseudo-code, proof of correctness and Time-complexity for SplitTree

Algorithm 4 Algorithm to split nodes close to XiX_{i} in 𝒳𝒳last\mathcal{X}\setminus\mathcal{X}_{last} which have an edge with EC(Xi)EC(X_{i}). 𝒳last\mathcal{X}_{last} is such that EC(Xi)Xleaf𝒳lastEC(X_{i})\cup X_{leaf}\in\mathcal{X}_{last}.
1:procedure SplitTree(Xi,Xleaf,𝒳lastX_{i},X_{leaf},\mathcal{X}_{last})
2:     close_nodes𝒫~i1𝒫~ext1𝒳¯lastclose\_nodes\leftarrow\tilde{\mathcal{P}}_{i}^{1}\cap\tilde{\mathcal{P}}_{ext}^{1}\cap\bar{\mathcal{X}}_{last}
3:     split|close_nodes|×|close_nodes|split\leftarrow|close\_nodes|\times|close\_nodes| array of 0
4:     for XpX_{p} in close_nodesclose\_nodes do
5:         for XqXpX_{q}\neq X_{p} in 𝒫~i1𝒫~ext1𝒫~p2\tilde{\mathcal{P}}_{i}^{1}\cap\tilde{\mathcal{P}}_{ext}^{1}\cap\tilde{\mathcal{P}}^{2}_{p} do
6:              if {Xi,Xleaf,Xp,Xq}\{X_{i},X_{leaf},X_{p},X_{q}\} is non-star and Xi,XleafX_{i},X_{leaf} form a pair then
7:                  if XqX_{q} is in 𝒳last\mathcal{X}_{last} then
8:                       splitp,:0split_{p,:}\leftarrow 0
9:                       Break from the inner loop
10:                  else
11:                       splitp,q1split_{p,q}\leftarrow 1                                               
12:     Merge rows of splitsplit with 1 in the same column
13:     Return unique rows of splitsplit

The pseudo code is provided in Algorithm 4.

Proof of Correctness

𝒳last\mathcal{X}_{last} is such that EC(Xi),Xleaf𝒳lastEC(X_{i}),X_{leaf}\in\mathcal{X}_{last}. Furthermore, if any node Xk𝒳lastX_{k}\in\mathcal{X}_{last} then the nodes from the equivalence cluster that has an edge with EC(Xleaf)EC(X_{leaf}) from that connected component obtained by removing edges with EC(Xi)EC(X_{i}) are also in 𝒳last\mathcal{X}_{last}. Let 𝒳cc\mathcal{X}_{cc} denote the set of nodes in the subtrees of the original tree obtained by removing XiX_{i} that don’t contain any node from 𝒳last\mathcal{X}_{last}. We split the nodes in 𝒳cc𝒫i1~𝒫~ext1\mathcal{X}_{cc}\cap\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}}_{ext}^{1} into subtrees. We first check if a node Xk𝒳ccX_{k}\in\mathcal{X}_{cc}. Xk𝒫i1~𝒫~ext1X_{k}\in\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}}_{ext}^{1} and Xk𝒳ccX_{k}\not\in\mathcal{X}_{cc} if and only if Xk𝒳lastX_{k}\in\mathcal{X}_{last} or there exists Xk𝒳last𝒫~k2X_{k^{\prime}}\in\mathcal{X}_{last}\cap\tilde{\mathcal{P}}_{k}^{2} such that {Xi,Xleaf,Xk,Xk}\{X_{i},X_{leaf},X_{k},X_{k^{\prime}}\} form a non-star and (XkX_{k}, XkX_{k^{\prime}}) form a pair. This would happen when XkX_{k^{\prime}} is a node from the equivalence cluster that has an edge with EC(Xleaf)EC(X_{leaf}). It is easy to see that {Xi,Xleaf,Xk}\{X_{i},X_{leaf},X_{k}\} are in each other’s proximal sets. By the definition of 𝒫~2\tilde{\mathcal{P}}^{2}, Xk𝒫~k2X_{k^{\prime}}\in\tilde{\mathcal{P}}^{2}_{k}. Also XkX_{k^{\prime}} is within a radius of 4 from Xi,XleafX_{i},X_{leaf}, therefore Xk𝒫~i2𝒫~ext2X_{k^{\prime}}\in\tilde{\mathcal{P}}^{2}_{i}\cap\tilde{\mathcal{P}}^{2}_{ext}.

For the nodes in 𝒳cc\mathcal{X}_{cc}, 2 nodes Xk1,Xk2X_{k_{1}},X_{k_{2}} lie in the same subtree if {Xi,Xleaf,Xk1,Xk2}\{X_{i},X_{leaf},X_{k_{1}},X_{k_{2}}\} form a non-star such that (Xk1,Xk2)(X_{k_{1}},X_{k_{2}}) are in the same pair. This is true from the definition of non-star shape. To implement this, we construct a square matrix splitsplit of size |𝒫~i1𝒫~ext1𝒳¯last|×|𝒫~i1𝒫~ext1𝒳¯last||\tilde{\mathcal{P}}_{i}^{1}\cap\tilde{\mathcal{P}}_{ext}^{1}\cap\bar{\mathcal{X}}_{last}|\times|\tilde{\mathcal{P}}_{i}^{1}\cap\tilde{\mathcal{P}}_{ext}^{1}\cap\bar{\mathcal{X}}_{last}| which contains 1 in position (j1,j2)(j_{1},j_{2}) if j1thj_{1}^{th} and j2thj_{2}^{th} node in 𝒫~i1𝒫~ext1𝒳¯last\tilde{\mathcal{P}}_{i}^{1}\cap\tilde{\mathcal{P}}_{ext}^{1}\cap\bar{\mathcal{X}}_{last} are Xk1X_{k_{1}} and Xk2X_{k_{2}} respectively and {Xi,Xleaf,Xk1,Xk2}\{X_{i},X_{leaf},X_{k_{1}},X_{k_{2}}\} are in each others 𝒫~2\tilde{\mathcal{P}}^{2} and the set is classified as a non-star such that (Xi,Xleaf)(X_{i},X_{leaf}) form a pair. It has 0 at other positions. If it turns out that a node is not in 𝒳cc\mathcal{X}_{cc}, the row corresponding to that node is set to 0. Then, we merge the nodes which have 1 in common columns till no 2 rows have a 1 in a common column. The creation of the splitsplit matrix and the merging are both 𝒪(n2)\mathcal{O}(n^{2}) operation. Therefore, the whole function is an 𝒪(n2)\mathcal{O}(n^{2}).

F.3 Proof of Correctness of Recurse

Lemma 5.

Recurse correctly recovers all the edges between the nodes in 𝒳cc\mathcal{X}_{cc} as well as the edges between 𝒳cc\mathcal{X}_{cc} and XiX_{i}.

Proof.

We prove this lemma by induction on the number of nodes in 𝒳cc\mathcal{X}_{cc}.

Base Case (|𝒳cc|=2|\mathcal{X}_{cc}|=2):

Note that |𝒳cc|=1|\mathcal{X}_{cc}|=1 is not possible because if that were the case, 𝒳cc\mathcal{X}_{cc} would be in EC(Xi)EC(X_{i}) and hence in 𝒳last\mathcal{X}_{last} which would imply |𝒳cc|=0|\mathcal{X}_{cc}|=0 which is a contradiction.

Also, note that when |𝒳cc|=2|\mathcal{X}_{cc}|=2, the nodes in 𝒳cc\mathcal{X}_{cc} have to form a non-star when considered with XiX_{i} and XleafX_{leaf} otherwise its nodes are going to be in EC(Xi)EC(X_{i}). By the correctness of SplitTree, 𝒳cc\mathcal{X}_{cc} is correctly recognized as the only subtree in step 2. By the correctness of FindEC, the algorithm correctly identifies that the nodes in 𝒳cc\mathcal{X}_{cc} form an equivalence cluster and that this equivalence cluster has an edge with EC(Xi)EC(X_{i}). 𝒳last\mathcal{X}_{last}^{\prime} now contains 𝒳cc\mathcal{X}_{cc}. The next recursion call terminates after step 2 as subtreessubtrees is empty.

Inductive Step:

Suppose the Lemma 5 is correct for |𝒳cc|k|\mathcal{X}_{cc}|\leq k.

Lets look at what happens when |𝒳cc|=k+1|\mathcal{X}_{cc}|=k+1.

  • In step 2, the algorithm correctly splits the nodes in 𝒳cc𝒫i1~𝒫~ext1\mathcal{X}_{cc}\cap\tilde{\mathcal{P}_{i}^{1}}\cap\tilde{\mathcal{P}}_{ext}^{1} into subtreessubtrees. Suppose this results in jj subtrees. Each of these subtrees have at least 2 nodes by the same argument as for the base case.

  • By the correctness of FindEC, step 4 correctly finds the equivalence cluster ECEC in each subtree which has an edge with EC(Xi)EC(X_{i}). This proves the part of the lemma that the edges between 𝒳cc\mathcal{X}_{cc} and EC(Xi)EC(X_{i}) are correctly identified.

  • For each subtreesubtree 𝒳last=𝒳last(subtreessubtree)EC\mathcal{X}_{last}^{\prime}=\mathcal{X}_{last}\cup(subtrees\setminus subtree)\cup EC. Therefore 𝒳last𝒳last\mathcal{X}_{last}^{\prime}\supset\mathcal{X}_{last}.

  • For the next recursion step, Xi+=EC[0]X_{i}^{+}=EC[0] and 𝒳cc+𝒳cc\mathcal{X}_{cc}^{+}\subset\mathcal{X}_{cc} as EC𝒳ccEC\in\mathcal{X}_{cc} whereas EC𝒳ccEC\not\in\mathcal{X}_{cc}.

  • By the inductive assumption, all the edges within 𝒳cc+\mathcal{X}_{cc}^{+} are correctly recovered. Furthermore, the edges between ECEC and 𝒳cc+\mathcal{X}_{cc}^{+} are correctly recovered.

  • This is true for all subtreesubtreessubtree\in subtrees obtained in step 2.

  • Therefore, all the edges within 𝒳cc\mathcal{X}_{cc} are correctly recovered.

F.4 Proof of correctness and time complexity of the complete algorithm

If the tree has just one Equivalence cluster, the algorithm discovers it by the correctness of FindEC and terminates. If the tree has more than one equivalence clusters, there would be at least one equivalence cluster with more than one nodes. By the correctness of FindEC, the algorithm correctly finds one such equivalence cluster ECEC. It updates the 𝒳last\mathcal{X}_{last} with this equivalence cluster. Then it calls the Recurse function. The set 𝒳cc\mathcal{X}_{cc} for this call to the Recurse function is 𝒳EC\mathcal{X}\setminus EC. By the correctness of Recurse, the algorithm correctly recovers all the edges within 𝒳EC\mathcal{X}\setminus EC and the edges between 𝒳EC\mathcal{X}\setminus EC and ECEC. Therefore, it correctly recovers the equivalence class of the tree.

Time-complexity

In the initialization step, FindEC is called at most nn times and hence takes 𝒪(n3)\mathcal{O}(n^{3}) time. Within the Recurse function, for each call to SplitTree, FindEC is called at least once unless the subtreessubtrees is empty, in which case Recurse terminates. Therefore, number of times SplitTree is called within Recurse is upper bounded by the number of times FindEC is called within Recurse+1. Whenever FindEC is called from within the Recurse function, it discovers at least one edge. Therefore FindEC can be called at most n2n-2 times within Recurse. Therefore, the total time complexity of Recurse is 𝒪(n3)\mathcal{O}(n^{3}). Therefore, the total time complexity of FindTree is 𝒪(n3)\mathcal{O}(n^{3}).

Appendix G Sample Complexity Analysis

We first calculate how accurate we need ρ~i,j\tilde{\rho}_{i,j} to be in order to make sure that any set of 44 nodes in each other’s 𝒫2\mathcal{P}^{2} are correctly classified as a star or a non-star. We then use the Hoeffding’s concentration bound to get a high probability bound for sample complexity.

Consider a set of 4 nodes {Xi1,Xi2,Xi3,Xi4}\{X_{i_{1}},X_{i_{2}},X_{i_{3}},X_{i_{4}}\} which forms non-star such that (Xi1,Xi2)(X_{i_{1}},X_{i_{2}}) form a pair. For these nodes to be correctly classified, we need:

ρ~i1,i3ρ~i2,i4ρ~i1,i4ρ~i2,i3>t3,\displaystyle\frac{\tilde{\rho}_{i_{1},i_{3}}\tilde{\rho}_{i_{2},i_{4}}}{\tilde{\rho}_{i_{1},i_{4}}\tilde{\rho}_{i_{2},i_{3}}}>t_{3},
ρ~i1,i3ρ~i2,i4ρ~i1,i2ρ~i3,i4<t3,\displaystyle\frac{\tilde{\rho}_{i_{1},i_{3}}\tilde{\rho}_{i_{2},i_{4}}}{\tilde{\rho}_{i_{1},i_{2}}\tilde{\rho}_{i_{3},i_{4}}}<t_{3},

where t3=(1+ρmax2)/2t_{3}=(1+\rho_{max}^{2})/2 and ρ~i,j\tilde{\rho}_{i,j} is the empirical correlation obtained from noisy samples between nodes XiX_{i} and XjX_{j}. We look at the first condition. This is equivalent to the condition:

Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3>t3,\displaystyle\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}>t_{3},
Σ~i1,i3Σ~i2,i4Σ~i1,i2Σ~i3,i4<t3.\displaystyle\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{2}}\tilde{\Sigma}_{i_{3},i_{4}}}<t_{3}.

This would be true if:

|Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3Σi1,i3Σi2,i4Σi1,i4Σi2,i3|<1t32\left|\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}-\frac{{\Sigma}_{i_{1},i_{3}}^{\prime}{\Sigma}_{i_{2},i_{4}}^{\prime}}{{\Sigma}_{i_{1},i_{4}}^{\prime}{\Sigma}_{i_{2},i_{3}}^{\prime}}\right|<\frac{1-t_{3}}{2} (22)

as Σi1,i3Σi2,i4Σi1,i4Σi2,i3=1\frac{{\Sigma}_{i_{1},i_{3}}^{\prime}{\Sigma}_{i_{2},i_{4}}^{\prime}}{{\Sigma}_{i_{1},i_{4}}^{\prime}{\Sigma}_{i_{2},i_{3}}^{\prime}}=1. Let Σi,j=Σ~i,j+δi,j{\Sigma}_{i,j}^{\prime}=\tilde{\Sigma}_{i,j}+\delta_{i,j}. Then, we have:

Σi1,i3Σi2,i4Σi1,i4Σi2,i3\displaystyle\frac{{\Sigma}_{i_{1},i_{3}}^{\prime}{\Sigma}_{i_{2},i_{4}}^{\prime}}{{\Sigma}_{i_{1},i_{4}}^{\prime}{\Sigma}_{i_{2},i_{3}}^{\prime}} =(Σ~i1,i3+δi1,i3)(Σ~i2,i4+δi2,i4)(Σ~i1,i4+δi1,i4)(Σ~i2,i3+δi2,i3)\displaystyle=\frac{(\tilde{\Sigma}_{i_{1},i_{3}}+\delta_{i_{1},i_{3}})(\tilde{\Sigma}_{i_{2},i_{4}}+\delta_{i_{2},i_{4}})}{(\tilde{\Sigma}_{i_{1},i_{4}}+\delta_{i_{1},i_{4}})(\tilde{\Sigma}_{i_{2},i_{3}}+\delta_{i_{2},i_{3}})}
=Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+δi1,i3Σ~i1,i3)(1+δi2,i4Σ~i2,i4)(1+δi1,i4Σ~i1,i4)(1+δi2,i3Σ~i2,i3)\displaystyle=\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\dfrac{\left(1+\frac{\delta_{i_{1},i_{3}}}{\tilde{\Sigma}_{i_{1},i_{3}}}\right)\left(1+\frac{\delta_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{2},i_{4}}}\right)}{\left(1+\frac{\delta_{i_{1},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}}\right)\left(1+\frac{\delta_{i_{2},i_{3}}}{\tilde{\Sigma}_{i_{2},i_{3}}}\right)}
=Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+δi1,i3Σ~i1,i3+δi2,i4Σ~i2,i4+δi1,i3δi2,i4Σ~i1,i3Σ~i2,i4)(1+δi1,i4Σ~i1,i4+δi2,i3Σ~i2,i3+δi2,i3δi1,i4Σ~i2,i3Σ~i1,i4)\displaystyle=\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\dfrac{\left(1+\frac{\delta_{i_{1},i_{3}}}{\tilde{\Sigma}_{i_{1},i_{3}}}+\frac{\delta_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{2},i_{4}}}+\frac{\delta_{i_{1},i_{3}}\delta_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}\right)}{\left(1+\frac{\delta_{i_{1},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}}+\frac{\delta_{i_{2},i_{3}}}{\tilde{\Sigma}_{i_{2},i_{3}}}+\frac{\delta_{i_{2},i_{3}}\delta_{i_{1},i_{4}}}{\tilde{\Sigma}_{i_{2},i_{3}}\tilde{\Sigma}_{i_{1},i_{4}}}\right)}
Setting δ=maxi1,i2|δi1i2|,Σ~m=mini,j|Σ~i,j|, and if δΣ~m0.4, we get \displaystyle\text{Setting }\delta=max_{i_{1},i_{2}}|\delta_{i_{1}i_{2}}|,\tilde{\Sigma}_{m}=\min_{i,j}|\tilde{\Sigma}_{i,j}|,\text{ and if }\frac{\delta}{\tilde{\Sigma}_{m}}\leq 0.4,\text{ we get }\dots
Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+2δΣ~m+δ2Σ~m2)(12δΣ~m)\displaystyle\leq\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\dfrac{\left(1+\frac{2\delta}{\tilde{\Sigma}_{m}}+\frac{\delta^{2}}{\tilde{\Sigma}_{m}^{2}}\right)}{\left(1-\frac{2\delta}{\tilde{\Sigma}_{m}}\right)}
Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+2.5δΣ~m)(12δΣ~m)\displaystyle\leq\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\dfrac{\left(1+\frac{2.5\delta}{\tilde{\Sigma}_{m}}\right)}{\left(1-\frac{2\delta}{\tilde{\Sigma}_{m}}\right)}
=Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+2.5δΣ~m)(1+2δΣ~m+(2δΣ~m)2)\displaystyle=\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\left(1+\frac{2.5\delta}{\tilde{\Sigma}_{m}}\right)\left(1+\frac{2\delta}{\tilde{\Sigma}_{m}}+\left(\frac{2\delta}{\tilde{\Sigma}_{m}}\right)^{2}\dots\right)
<Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+2.5δΣ~m)(1+6δΣ~m)\displaystyle<\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\left(1+\frac{2.5\delta}{\tilde{\Sigma}_{m}}\right)\left(1+\frac{6\delta}{\tilde{\Sigma}_{m}}\right)
<Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(1+16δΣ~m)\displaystyle<\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\left(1+\frac{16\delta}{\tilde{\Sigma}_{m}}\right)

Similarly, we can show the other side of inequality.

Σi1,i3Σi2,i4Σi1,i4Σi2,i3Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3(116δΣ~m)\displaystyle\frac{{\Sigma}_{i_{1},i_{3}}^{\prime}{\Sigma}_{i_{2},i_{4}}^{\prime}}{{\Sigma}_{i_{1},i_{4}}^{\prime}{\Sigma}_{i_{2},i_{3}}^{\prime}}\geq\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\left(1-\frac{16\delta}{\tilde{\Sigma}_{m}}\right)

Using |Σi,j|0.5t2|\Sigma_{i,j}|\geq 0.5t_{2} along with some basic algebraic manipulation yields:

|Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i3Σi1,i3Σi2,i4Σi1,i4Σi2,i3|\displaystyle\left|\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}-\frac{{\Sigma}_{i_{1},i_{3}}^{\prime}{\Sigma}_{i_{2},i_{4}}^{\prime}}{{\Sigma}_{i_{1},i_{4}}^{\prime}{\Sigma}_{i_{2},i_{3}}^{\prime}}\right| <Σ~i1,i3Σ~i2,i4Σ~i1,i4Σ~i2,i316δΣ~m\displaystyle<\frac{\tilde{\Sigma}_{i_{1},i_{3}}\tilde{\Sigma}_{i_{2},i_{4}}}{\tilde{\Sigma}_{i_{1},i_{4}}\tilde{\Sigma}_{i_{2},i_{3}}}\cdot\frac{16\delta}{\tilde{\Sigma}_{m}}
<16δΣ~m3\displaystyle<\frac{16\delta}{\tilde{\Sigma}_{m}^{3}}
<128δt23\displaystyle<\frac{128\delta}{t_{2}^{3}}

Comparing this with Equation (22), we get:

δ<t23(1t3)128.\delta<\frac{t_{2}^{3}(1-t_{3})}{128}.

It is easy to check that the δ\delta obtained for the other condition of non-star shape and for star shape is also the same.

Now we look at the concentration of empirical correlation random variable for the noisy samples from the tree structured Ising model.

Since XiX_{i}^{\prime}, XjX_{j}^{\prime} and XiXjX_{i}^{\prime}X_{j}^{\prime} have support on {1,1}\{-1,1\}, we can use the Hoeffding’s inequality to obtain the concentration bounds for them. Suppose we draw mm independent samples from the noisy distribution. Let Z1i,Z2i,ZmiZ^{i}_{1},Z^{i}_{2},\dots Z^{i}_{m} denote the mm samples of XiX^{\prime}_{i}, Z1j,Z2j,ZmjZ^{j}_{1},Z^{j}_{2},\dots Z^{j}_{m} denote the mm samples of XjX^{\prime}_{j} and Z1ij,Z2ij,ZmijZ^{ij}_{1},Z^{ij}_{2},\dots Z^{ij}_{m} denote the mm samples of XjXiX^{\prime}_{j}X^{\prime}_{i}. Now, if we have:

|1mk=1mZkij𝔼[XiXj]|<δ/2,\displaystyle|\frac{1}{m}\sum_{k=1}^{m}Z^{ij}_{k}-\mathbb{E}{\left[X_{i}^{\prime}X_{j}^{\prime}\right]}|<\delta/2,
|1m2(k=1mZki)(k=1mZkj)𝔼[Xi]𝔼[Xj]|<δ/2,\displaystyle|\frac{1}{m^{2}}(\sum_{k=1}^{m}Z^{i}_{k})(\sum_{k=1}^{m}Z^{j}_{k})-\mathbb{E}{\left[X_{i}^{\prime}\right]}\mathbb{E}{\left[X_{j}^{\prime}\right]}|<\delta/2,

then |Σi,jΣ~i,j|<δ|\Sigma^{\prime}_{i,j}-\tilde{\Sigma}_{i,j}|<\delta.

Let Xi¯\bar{X_{i}^{\prime}}, Xj¯\bar{X_{j}^{\prime}} and XiXj¯\overline{X_{i}^{\prime}X_{j}^{\prime}} denote the sample means of XiX_{i}^{\prime}, XjX_{j}^{\prime} and XiXjX_{i}^{\prime}X_{j}^{\prime} respectively. Further, let Xi¯=𝔼[Xi]+αi\bar{X_{i}^{\prime}}=\mathbb{E}{\left[X_{i}^{\prime}\right]}+\alpha_{i} and Xj¯=𝔼[Xj]+αj\bar{X_{j}^{\prime}}=\mathbb{E}{\left[X_{j}^{\prime}\right]}+\alpha_{j} and αmax=max{|αi|,|αj|}\alpha_{max}=\max\{|\alpha_{i}|,|\alpha_{j}|\}

|Xi¯Xj¯𝔼[Xi]𝔼[Xj]|\displaystyle|\bar{X_{i}^{\prime}}\bar{X_{j}^{\prime}}-\mathbb{E}{\left[X_{i}^{\prime}\right]}\mathbb{E}{\left[X_{j}^{\prime}\right]}| =|(𝔼[Xi]+αi)(𝔼[Xj]+αj)𝔼[Xi]𝔼[Xj]|\displaystyle=|(\mathbb{E}{\left[X_{i}^{\prime}\right]}+\alpha_{i})(\mathbb{E}{\left[X_{j}^{\prime}\right]}+\alpha_{j})-\mathbb{E}{\left[X_{i}^{\prime}\right]}\mathbb{E}{\left[X_{j}^{\prime}\right]}| (23)
=|αi𝔼[Xj]+αj𝔼[Xi]+αiαj|\displaystyle=|\alpha_{i}\mathbb{E}{\left[X_{j}^{\prime}\right]}+\alpha_{j}\mathbb{E}{\left[X_{i}^{\prime}\right]}+\alpha_{i}\alpha_{j}|
|αi||𝔼[Xj]|+|αj||𝔼[Xi]|+|αi||αj|\displaystyle\leq|\alpha_{i}||\mathbb{E}{\left[X_{j}^{\prime}\right]}|+|\alpha_{j}||\mathbb{E}{\left[X_{i}^{\prime}\right]}|+|\alpha_{i}||\alpha_{j}|
4αmax (As |𝔼[Xi]|,|𝔼[Xj]|<1,|αi|<2)\displaystyle\leq 4\alpha_{max}\text{ (As $|\mathbb{E}{\left[X_{i}^{\prime}\right]}|,|\mathbb{E}{\left[X_{j}^{\prime}\right]}|<1,|\alpha_{i}|<2$)}

Therefore, when |Xi¯𝔼[Xi]|<δ/8|\bar{X_{i}^{\prime}}-\mathbb{E}{\left[X_{i}^{\prime}\right]}|<\delta/8 and |Xj¯𝔼[Xj]|<δ/8|\bar{X_{j}^{\prime}}-\mathbb{E}{\left[X_{j}^{\prime}\right]}|<\delta/8, we have |Xi¯Xj¯𝔼[Xi]𝔼[Xj]|<δ/2|\bar{X_{i}^{\prime}}\bar{X_{j}^{\prime}}-\mathbb{E}{\left[X_{i}^{\prime}\right]}\mathbb{E}{\left[X_{j}^{\prime}\right]}|<\delta/2

By Hoeffding’s inequality we have:

P(|XiXj¯𝔼[XiXj]|>δ2)\displaystyle P(|\overline{X_{i}^{\prime}X_{j}^{\prime}}-\mathbb{E}{\left[X_{i}^{\prime}X_{j}^{\prime}\right]}|>\frac{\delta}{2}) 2exp(m(δ/2)22),\displaystyle\leq 2\exp\left(\frac{-m(\delta/2)^{2}}{2}\right), (24)
P(|Xi¯𝔼[Xi]|>δ8)\displaystyle P(|\bar{X_{i}^{\prime}}-\mathbb{E}{\left[X_{i}^{\prime}\right]}|>\frac{\delta}{8}) 2exp(m(δ/8)22),\displaystyle\leq 2\exp\left(\frac{-m(\delta/8)^{2}}{2}\right),
P(|Xj¯𝔼[Xj]|>δ8)\displaystyle P(|\bar{X_{j}^{\prime}}-\mathbb{E}{\left[X_{j}^{\prime}\right]}|>\frac{\delta}{8}) 2exp(m(δ/8)22).\displaystyle\leq 2\exp\left(\frac{-m(\delta/8)^{2}}{2}\right).

Therefore,

P(|Σi,jΣ~i,j|>δ)\displaystyle P(|\Sigma^{\prime}_{i,j}-\tilde{\Sigma}_{i,j}|>\delta) 2exp(m(δ/2)22)+4exp(m(δ/8)22)\displaystyle\leq 2\exp\left(\frac{-m(\delta/2)^{2}}{2}\right)+4\exp\left(\frac{-m(\delta/8)^{2}}{2}\right) (25)
6exp(m(δ/8)22)\displaystyle\leq 6\exp\left(\frac{-m(\delta/8)^{2}}{2}\right)

We define a bad event \mathcal{B} as follows:

:={|Σ~i,jΣi,j|>δ for any i,j[n],ij}\displaystyle\mathcal{B}:=\left\{|\tilde{\Sigma}_{i,j}-\Sigma_{i,j}^{\prime}|>\delta\text{ for any }i,j\in[n],i\neq j\right\} (26)

Using union bound on Equation 25, we bound the probability of \mathcal{B} by:

P()(n2)(6exp(m(δ/8)22)).P(\mathcal{B})\leq(n^{2})\left(6\exp\left(\frac{-m(\delta/8)^{2}}{2}\right)\right). (27)

Therefore, for the probability of mistake to be bounded by τ\tau, the number of samples is given by:

m128δ2log(6n2τ)m\geq\frac{128}{\delta^{2}}\log\left(\frac{6n^{2}}{\tau}\right) (28)

References

  • [1] A. Anandkumar, D. J. Hsu, F. Huang, and S. M. Kakade. Learning mixtures of tree graphical models. In Advances in Neural Information Processing Systems, pages 1052–1060, 2012.
  • [2] G. Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 771–782. ACM, 2015.
  • [3] G. Bresler, D. Gamarnik, and D. Shah. Hardness of parameter estimation in graphical models. In Advances in Neural Information Processing Systems, pages 1062–1070, 2014.
  • [4] G. Bresler, D. Gamarnik, and D. Shah. Structure learning of antiferromagnetic ising models. In Advances in Neural Information Processing Systems, pages 2852–2860, 2014.
  • [5] G. Bresler and M. Karzand. Learning a tree-structured ising model in order to make predictions. arXiv preprint arXiv:1604.06749, 2016.
  • [6] G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random fields from samples: Some observations and algorithms. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 343–356. Springer, 2008.
  • [7] S. G. Brush. History of the lenz-ising model. Reviews of modern physics, 39(4):883, 1967.
  • [8] Y. Chen. Learning sparse ising models with missing data.
  • [9] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting hierarchical context on a large database of object categories. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 129–136. IEEE, 2010.
  • [10] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.
  • [11] S. Dasgupta. Learning polytrees. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 134–141. Morgan Kaufmann Publishers Inc., 1999.
  • [12] C. Daskalakis, N. Dikkala, and G. Kamath. Testing ising models. IEEE Transactions on Information Theory, 2019.
  • [13] S. Goel, D. M. Kane, and A. R. Klivans. Learning ising models with independent failures. arXiv preprint arXiv:1902.04728, 2019.
  • [14] L. Hamilton, F. Koehler, and A. Moitra. Information theoretic properties of markov random fields, and their algorithmic applications. In Advances in Neural Information Processing Systems, pages 2463–2472, 2017.
  • [15] E. Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik A Hadrons and Nuclei, 31(1):253–258, 1925.
  • [16] A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman. Towards an integrated protein–protein interaction network: A relational markov network approach. Journal of Computational Biology, 13(2):145–164, 2006.
  • [17] A. Katiyar, J. Hoffmann, and C. Caramanis. Robust estimation of tree structured Gaussian graphical models. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3292–3300. PMLR, 2019.
  • [18] A. Klivans and R. Meka. Learning graphical models using multiplicative weights. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 343–354. IEEE, 2017.
  • [19] M. Kolar and E. P Xing. Estimating sparse precision matrices from data with missing values. 2012.
  • [20] S.-I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using l_1l\_1-regularization. In Advances in neural Information processing systems, pages 817–824, 2007.
  • [21] E. M. Lindgren, V. Shah, Y. Shen, A. G. Dimakis, and A. Klivans. On robust learning of ising models.
  • [22] H. Liu, F. Han, M. Yuan, J. Lafferty, L. Wasserman, et al. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.
  • [23] P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
  • [24] A. Y. Lokhov, M. Vuffray, S. Misra, and M. Chertkov. Optimal structure and parameter learning of ising models. Science advances, 4(3):e1700791, 2018.
  • [25] F. Martinelli, A. Sinclair, and D. Weitz. The ising model on trees: Boundary conditions and mixing time. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 628–639. IEEE, 2003.
  • [26] N. G. d. Mesnards and T. Zaman. Detecting influence campaigns in social networks using the ising model. arXiv preprint arXiv:1805.10244, 2018.
  • [27] K. E. Nikolakakis, D. S. Kalogerias, and A. D. Sarwate. Learning tree structures from noisy data. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1771–1782. PMLR, 2019.
  • [28] K. E. Nikolakakis, D. S. Kalogerias, and A. D. Sarwate. Non-parametric structure learning on hidden tree-shaped distributions. arXiv preprint arXiv:1909.09596, 2019.
  • [29] P. Ravikumar, M. J. Wainwright, J. D. Lafferty, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
  • [30] S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 860–867. Citeseer, 2005.
  • [31] E. Schneidman, M. J. Berry II, R. Segev, and W. Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):1007, 2006.
  • [32] N. Srebro. Maximum likelihood bounded tree-width markov networks. Artificial intelligence, 143(1):123–138, 2003.
  • [33] M. Vuffray, S. Misra, A. Lokhov, and M. Chertkov. Interaction screening: Efficient and sample-optimal learning of ising models. In Advances in Neural Information Processing Systems, pages 2595–2603, 2016.
  • [34] L. Wang and Q. Gu. Robust gaussian graphical model estimation with arbitrary corruption. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3617–3626. JMLR. org, 2017.
  • [35] S. Wu, S. Sanghavi, and A. G. Dimakis. Sparse logistic regression learns all discrete pairwise graphical models. arXiv preprint arXiv:1810.11905, 2018.
  • [36] E. Yang and A. C. Lozano. Robust gaussian graphical modeling with the trimmed graphical lasso. In Advances in Neural Information Processing Systems, pages 2602–2610, 2015.
  • [37] W.-X. Zhou and D. Sornette. Self-organizing ising model of financial markets. The European Physical Journal B, 55(2):175–181, 2007.