Robust Estimation of Tree Structured Ising Models
Abstract
We consider the task of learning Ising models when the signs of different random variables are flipped independently with possibly unequal, unknown probabilities. In this paper, we focus on the problem of robust estimation of tree-structured Ising models. Without any additional assumption of side information, this is an open problem. We first prove that this problem is unidentifiable, however, this unidentifiability is limited to a small equivalence class of trees formed by leaf nodes exchanging positions with their neighbors. Next, we propose an algorithm to solve the above problem with logarithmic sample complexity in the number of nodes and polynomial run-time complexity. Lastly, we empirically demonstrate that, as expected, existing algorithms are not inherently robust in the proposed setting whereas our algorithm correctly recovers the underlying equivalence class.
1 Introduction
Undirected graphical models or Markov Random Fields (MRFs) are a powerful tool for describing high dimensional distributions using an associated dependency graph , which encodes the conditional dependencies between random variables. In this paper, we focus on a special class of MRFs called the Ising Model, first introduced in [15] to represent spin systems in quantum physics [7]. Recently, Ising models have also proven quite popular in biology [16], engineering [9, 31], computer vision [30], and also in the optimization and OR communities, including in finance [37], and social networks [26]. The special class of tree-structured Ising models is beneficial for applications in statistical physics over non-amenable graphs. A detailed description and further references can be found in [25].
In this paper, we explore the problem of learning the underlying graph of tree-structured Ising models with independent, unknown, unequal error probabilities. Such scenarios are relatively common in networks where certain malfunctioning sensors can report flipped bits. The presence of noise breaks down the conditional independence structure of the noiseless setting, and the noisy samples may no longer follow an Ising model distribution.
In 2011, [8] highlighted the importance of robustness in Ising models. Recent works in [13, 14, 21] have tried to address this problem. However, they assume the side information of the error probability, which is mostly unavailable and difficult to estimate in most practical settings. In the closely related work for tree-structured Ising models, [27, 28] address this problem as they build on the seminal work of [10], popularly known as the Chow-Liu algorithm, which proposes the use of maximum-weight spanning tree of mutual information as an estimate of the tree structure. In [27], they consider the simplified case where each node has an equal probability of error and [28] assumes that the error doesn’t alter the order of mutual information. Both assumptions imply that asymptotically, Chow-Liu converges to the correct tree. However, these assumptions don’t arise naturally and are difficult to check from access to only noisy data. To the best of our knowledge, there doesn’t exist an analysis of what happens beyond this limiting assumption of order preservation of mutual information.
In fact, section 5.1 of [6] provides an example of the unidentifiability of the problem for a graph on 3 nodes and says that the problem is ill-defined. We reconsider this problem, and show that for the special class of tree structured Ising models, although the problem is not identifiable, nevertheless the unidentifiability is limited to an equivalence class of trees. Thus, more appropriately, one can cast the problem of learning in the presence of unknown, unequal noise as the problem of learning this equivalence class.
Key Contributions
-
1.
We show that the problem of learning tree structured Ising models when the observations flip with independent, unknown, possibly unequal probability is unidentifiable (Theorem 2).
- 2.
-
3.
We provide an algorithm to recover this equivalence class from noisy samples. The sample complexity is logarithmic and the time complexity is polynomial in the number of nodes. (Section 4)
-
4.
We experimentally demonstrate that existing algorithms like Chow-Liu or Sparsitron for learning Ising models are not inherently robust against this noise model whereas our algorithm correctly recovers the equivalence class.(Section 5)
2 Related Work
Efficient algorithms for structure learning of Ising models can be divided into three main categories based on their assumptions: i) special graph structures [1, 10, 11, 32, 5], ii) nature of interaction between variables such as correlation decay property (CDP) [3, 4, 6, 20, 29], iii) bounded degree/width [2, 12, 18, 24, 35, 33]. However, these algorithms assume access to uncorrupted samples.
In the last decade, there has been a lot of research on robust estimation of graphical models [17, 19, 22, 23, 34, 36], and [17] explicitly considered partial identifiability in the presence of noise for the Gaussian case. However, extending the above frameworks to the robust structure learning of Ising models remains a challenge. [13, 14, 21] have tried to solve the problem of robust estimation of general Ising models under the assumption of access to the probability of error for each node. Recently, [27, 28] proposed algorithms to estimate the underlying graph structure of tree-structured Ising models in the presence of noise under the strong assumption that the probability of error does not alter the order of mutual information order for the tree. Both these assumptions are restrictive and impractical. In this paper, we present the first algorithm that can robustly recover the underlying tree structured Ising model (upto an equivalence class) in the presence of corruption via unknown, unequal, independent noise.
3 Identifiability Result
Problem Setup:
Let be a set of random variables with support on and be the vector of these random variables. Suppose the conditional independence structure of the variables of is given by a tree . This implies that the distribution of can be represented by an Ising model. In our model, we have observations where each flips with probability . We denote the probability of error by the vector and the noisy random variables by . The error in disrupts the tree structured conditional independence and the graphical model of is a complete graph if . In fact, need not be an Ising model. Given samples of , we want to find the tree structure . The ideas of the analysis for this problem are based on graph structure properties introduced in [17] where they were applied in the different context of Gaussian graphical models.
Equivalence Class for
Given a tree , let us define its equivalence class denoted by . Let the set of all nodes that are connected to one or more leaf nodes of be denoted by . Construct sets from the nodes such that set contains all the leaf nodes connected to in along with . Sample one node from each newly constructed set and create another subset . There exists such unique subsets. Create a tree by setting all the nodes in as the internal node and all the remaining nodes in their corresponding sets as leaf nodes in the same position where was in . Therefore, there exists a one-to-one equivalence relationship between tree and its corresponding set . We define a set of these trees as .

Figure 1 gives an example of .
Model Assumptions
Assumption 1.
(Bounded Mean) The absolute value of the mean - .
Assumption 2.
(Bounded Correlation) Correlation of any two nodes and connected by an edge - where .
Assumption 3.
(Bounded error probability) The error probability - .
These assumptions arise naturally. Assumption 1 ensures that no variable approaches a constant and hence gets disconnected from the tree. The lower bound in Assumption 2 also ensures that every node is connected. The upper bound in Assumption 2 ensures that no two nodes are duplicated. Assumption 3 ensures the noisy node doesn’t become independent of every other node due to the error.
Limited unidentifiability of the problem
In Theorem 1, we prove that it is possible to recover from the samples of . Further, we prove that given the distribution of , there exists an Ising model for each tree in such that, for some noise vector, its noisy distribution is the same as that of in Theorem 2
Theorem 1.
Suppose and are sets of binary valued random variables satisfying assumption 1 whose conditional independence is given by trees and respectively satisfying assumption 2. Assume that each node in both these distributions and is allowed to be flipped independently with probability satisfying assumption 3. Let and represent the noisy distributions of and respectively. If , then .
Proof.
The proof of this theorem relies on this key observation: the probability distribution of the noisy samples completely defines the categorization of any set of 4 nodes as star/non-star shape. Any set of 4 nodes forms a non-star if there exists at least one edge which, when removed, splits the tree into two subtrees such that exactly 2 of the 4 nodes lie in one subtree and the other 2 nodes lie in the other subtree. The nodes in the same subtree form a pair. If the set is not a non-star, it is categorized as a star. For example, in Figure 1, form a non-star and form a star.
Once we prove this key observation, the rest of the proof follows easily and we refer the reader to parts (ii) and (iii) of the proof of Theorem 2 from [17]. We denote the correlation between two nodes and in the non-noisy setting by and in the noisy setting by . Similarly the covariance is denoted by and . We utilize the correlation decay property of tree structured Ising models which is stated in Lemma 1.
Lemma 1.
(Correlation Decay) Any 2 nodes and have the conditional independence relation specified by a tree structured Ising Model such that the path between them is if and only if their correlation is given by:
(1) |
Categorizing a set of 4 nodes as star/non-star
We first look at a graphical model on 3 nodes whose conditional independence is given by a chain with . By Lemma 1, we have .
Suppose the sign of flip independently with probability respectively. Substituting the values of and in terms of their noisy counterparts gives us:
(2) |
If we had prior knowledge about the underlying conditional independence relation, this quadratic equation, which depends only on the quantities measurable from noisy data, could be solved to estimate the probability of error of .
We prove in Appendix C that Equation (2) gives a valid solution for any configuration of 3 nodes in a tree structured Ising model. Therefore, in the absence of the knowledge that , we can estimate a probability of error for each which enforces the underlying graph structure to represent the other 2 nodes independent conditioned on . Thus, irrespective of the true underlying conditional independence relation we can always find a probability of error for each node which makes any other pair of nodes conditionally independent. We use this concept to classify a tree on 4 nodes as star or non-star shaped.
We follow a notation where denotes the estimated probability of error of which enforces .
Condition for star/non-star shape:
Any set of 4 nodes is categorized as a non-star with forming one pair and forming another pair if and only if:
From Equation (2), this is equivalent to the condition that
Any set of 4 nodes is categorized as a star if and only if:
This is equivalent to the condition that .
In order to see how these conditions correspond to a star/non-star shape, lets consider a chain on 4 nodes as shown in Figure 3. Let each be flipped with probability . With access only to the noisy samples, we estimate the probability of error for each node in order to find the underlying tree. The key idea is that when we estimate the probability of error for a given node, it should be consistent across different conditional independence relations. For instance in the present case, the error estimates and of satisfy . We show that (Lemma 2(b)). We also prove that (Lemma 2). These imply that and . By symmetry, we have and . These conditional independence statements imply that and form a chain with on one side of the chain and on the other side of the chain.
Next, we consider the case when 4 nodes form a star structured graphical model as in Figure 3. Under the same noisy observation setting we prove that , , and (Lemma 3). Thus, we can conclude that the underlying graphical model is star structured.


Lemma 2.
Let the graphical model on and form a chain as shown in Figure 3. Suppose the bits of each are flipped with probability . Then the following holds:
Lemma 3.
Let the graphical model on and form a star as shown in Figure 3. Suppose the bits of each are flipped with probability . Then the following holds:
The proof of these lemmas and the details of extending these results to generic trees require basic algebraic manipulations and can be found in Appendix D. ∎
Theorem 2.
Let denote the probability distribution of when the error probability of all the neighbors of leaf nodes is non-zero. For any , there exists a set of random variables with conditional independence given by and a corresponding error probability vector such that where denotes the noisy distribution of .
We prove this theorem by explicit calculation of . We utilize Lemma 1 to enforce the conditional independence relations in any tree . The proof is included in Appendix E.
Interestingly these unidentifiability results for noisy tree structured Ising models match the ones for noisy tree structured Gaussian graphical models proposed in [17] inspite of them being graphical models on different class of random variables. We remark that the algorithm they propose to learn the equivalence class in [17] is in the infinite sample domain and does not provide sample complexity results. In the next section we develop an efficient algorithm to learn the equivalence class using samples scaling logarithmically in the number of nodes.
4 Algorithm
In this section, we provide a detailed description of our algorithm (Algorithm 1) that can recover the set from noisy samples. We emphasize that the edges we learn in this algorithm are from any one tree from the equivalence class and not necessarily from (as we have demonstrated, identifying itself is not possible). We illustrate our algorithm (as well as each subroutine) using a toy example (Figure 4).
We first introduce three notions which we extensively use for the algorithm - (i) Equivalence Clusters, (ii) Proximal Sets - , (iii) Categorizing a set of 4 nodes as star/non-star using finite samples.
Equivalence Cluster:
As illustrated in Figure 4(a), an equivalence cluster is defined as a set containing an internal node and all the leaf nodes connected to it. We denote the equivalence cluster containing a node by .
Defining proximal sets:
Due to correlation decay, the correlation between an arbitrary pair of nodes can go down exponentially in the number of nodes which would lead to exponential sample complexity. To avoid this, we only consider nodes with correlations above a constant threshold while making star/non-star categorization. We define two proximal sets for each node - and , where is given by:
In the above expression, we set , thereby guaranteeing that the set contains at least all the nodes within a radius of 4 of that node (on the true unknown graph). This is because the minimum variance of any node is , the minimum correlation of any 2 nodes at distance 4 in the absence of noise is and the noise can decrease the covariance by a factor of at most .

It is possible that even if and are not neighbors in the true graph. We define the second set to guarantee that in this case, at least the first node from the path from to is in . Using the correlation decay property, the noisy covariance expression,, and the bounds , and this set is given by:
where . Using the sample covariance values we construct sets and to include at least all the nodes in and with high probability. We denote the sample covariance between nodes and obtained from the noisy observations by . The sets and are defined as:
Classifying into star/non-star shape:
It is easy to see (refer Equation(16) in Appendix D.4) that when any set of 4 nodes forms a non-star such that from a pair, we have:
Therefore, using the finite sample estimate of the noisy correlation, the categorization of 4 nodes into star/non-star such that from a pair is given in Table 1.
We first describe the key steps of the algorithm using the example from Figure 4.
Initialization:
Category | ||
---|---|---|
Non-star | ||
Star |
The algorithm starts by learning edges of any leaf node (edge(14,15)). It searches for a leaf node by looking for equivalence cluster with more than one nodes. This is achieved using the FindEC subroutine which, given a node, can find all the nodes in its equivalence cluster.
Recursion:
After finding a pair of a leaf and an internal node, the algorithm calls the Recurse to find all the remaining edges recursively. This is done in 2 steps: (i) Split the nodes in the proximal set of the internal node into different subtrees connected to the internal node using SplitTree (subtree1 = nodes 0-6, subtree2 = nodes 7-13), (ii) Find the nodes in each subtree which have an edge with the internal node. This is done by the simple observation that when we consider the internal node along with one such subtree, the internal node acts like a leaf node for the tree on just these nodes (node 14 is a leaf node for the tree on 1,2,3,4,5,6,14). Hence, FindEC can be used on these subset of nodes. Now, we have a pair of a leaf node and an internal node on a subset of nodes and we call the Recurse just on this subset of nodes.
The runtime of the algorithm is .
Description of subroutines
SplitTree
This function takes in a pair of leaf and internal nodes, and respectively, from a subtree. To ensure that we only consider nodes in the subtree, it also takes in a set which contains the nodes assigned to other subtrees in the previous recursive call. It splits the nodes from the common proximal sets and in the subtree into smaller subtrees that have an edge with . The property used is that 2 nodes and lie in the same smaller subtree if and only if form a non-star such that nodes form a pair. To make sure that no node outside the subtree is considered, we do not split any node within or which pairs with a node from when considered along with and . This is an operation.
FindEC
This function takes in a node and a set of nodes as inputs and returns all the nodes of from . Two nodes are in the same equivalence cluster if any categorization of a set of 4 nodes as star/non-star either results in a star or a non-star where these nodes form a pair. Essentially, for all the nodes in the proximal set of , we verify if they satisfy this condition. We also add edges by arbitrarily choosing any node from as the center node as we are only interested in recovering . This is an operation.
Recurse
This function takes the same arguments as SplitTree and calls SplitTree with those arguments to obtain the different smaller subtrees. It then uses FindEC to find the edges between and these smaller subtrees. When the smaller subtrees are considered along with , acts as their leaf node and the node has an edge with, acts as an internal node. For each of the smaller subtree, it adds the nodes of the remaining smaller subtrees in along with the equivalence cluster from the current subtree which had an edge with as that has already been learnt. It then calls the recurse function with these smaller subtrees.
Need for 2 proximal sets and :
Consider the implementation of FindEC. For each node , we check if it lies in . For all pairs of nodes , we could check whether the condition for is satisfied. If but no node on the path from to is in , we would incorrectly conclude that . To avoid this corner case, instead of considering , we consider . By the definition of , contains a node from the path from to . This results in forming a non-star where do not form a pair and hence it correctly concludes that .
plays a crucial role in SplitTree too. Consider a recursive call to Recurse such that in the previous recursion, nodes in of that step were added to . However, in the current step there might be nodes in from other subtrees which are not in . To ensure that any such node is not considered while creating smaller subtrees, we consider the nodes and ensure that they don’t pair with when considered with the present recursive call’s and .
Setting the radius as 4:
The radius is set as 4 to make sure that each node is considered with at least its nearest neighboring nodes when classified as star/non-star. This would ideally require a radius of 3 but to account for the unidentifiability between a leaf node and its neighbor, we consider a radius of 4.
The detailed description of the complete algorithm including these subroutines- their pseudo-code, proof of correctness and time complexity analysis is presented in Appendix F.
Next, we present the sample complexity result of our algorithm.
Theorem 3.
Consider an Ising model on nodes whose graph structure is the tree and it satisfies assumptions 1 and 2. We get noisy samples from this Ising model where samples from each node are flipped independently with unknown, unequal probability satisfying assumption 3. Algorithm 1 correctly recovers with probability at least if the number of samples is lower bounded as follows:
where , , , .
The proof is included in Appendix G.
Running the algorithm in absence of the knowledge of and :
When we only have access to noisy samples, and , we can set and and solve for given an error budget . This would be a conservative estimate of and in practice, the algorithm would work for lower values of .
5 Experiments
Experimental Setup
The probability distribution of Ising models is represented by
where is the symmetric weight matrix with 0 on the diagonal and is the bias vector. The support of defines the Ising model structure. Therefore, a chain structured Ising model with nodes labeled consecutively satisfies if and only if . A star structured Ising model with node 1 as the internal node satisfies if and only if or .

We sample each non-zero element of uniformly from . The probability of error for each node is sampled uniformly from .
Comparison with the Chow-Liu algorithm is presented in 5.1. In subsection 5.2, we demonstrate the performance of our algorithm as compared to the Sparsitron algorithm. Subsections 5.3, 5.4 and 5.5 report the impact of the maximum error , maximum edge weight and minimum edge weight respectively on the algorithm performance. All of these experiments are done for chain structured Ising models and have .
We next demonstrate the performance of our algorithm on star-structured Ising model in subsection 5.6 as the number of nodes increases. To begin with, we stick to .
Finally we study the impact of (the non-zero mean) on the algorithm performance for both star-structured Ising model and chain structured Ising model in subsection 5.7.

5.1 Our Algorithm vs Chow-Liu
Figure 5 depicts that as the number of samples increases, the fraction of correct recoveries using our algorithm approaches 1. Chow-Liu incorrectly converges to a tree not in the equivalent class due to unequal noise in the nodes.
5.2 Our Algorithm vs Sparsitron

We set . In Figure 6, we compare the performance of our algorithm with the sparsitron algorithm for chain-structured Ising model on 15 nodes. To evaluate the sparsitron algorithm, we take the output weight matrix and find the maximum weight spanning tree. We call the algorithm a success if this tree is from the equivalence class . We can see that the sparsitron has a low success rate.
5.3 Effect of the Maximum Error Probability
We set and vary to take the values . We evaluate on chain structured Ising model on 15 nodes. Figure 7 illustrates the effect of on the performance of our algorithm. As expected, increase in the probability of error makes it harder to recover the equivalence class.
5.4 Effect of the Maximum Weight
Next, we look at the effect of . We set and vary to take the values . We evaluate on chain structured Ising model on 15 nodes.
Intuitively, a high makes it difficult to differentiate between a star and a non-star, therefore the algorithm is expected to perform better for lower . This is indeed what happens as shown in Figure 8.
Figure 8: . The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of .
Figure 9: . The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of .
5.5 Effect of the Minimum Weight
We also look at the effect of . We set and vary to take the values . We evaluate on chain structured Ising model on 15 nodes. We can expect that smaller edge weights make it difficult to accurately estimate the correlation values resulting in higher errors. This is illustrated in Figure 9.
5.6 Algorithm performance for Star-structured tree Ising Model
Till now all the experiments have used chain structured Ising models. In this subsection, we consider the other extreme tree structure-star structure with one internal node and the remaining leaf nodes connected to the internal node. We set . We consider three different graph sizes - 11 nodes, 13 nodes and 15 nodes. The performance of the algorithm is illustrated in Figure 10.

As we can see, compared to the chain structured Ising models, star-structured Ising models require much smaller number of samples. This is due to the small radius which results in high absolute values of the correlations. Also, the performance is only slightly impacted by the number of nodes which can also be attributed to the small radius.
5.7 Effect of bias on the algorithm performance.
We observe a very interesting phenomena when we study the impact of the bias term . For chain structured Ising model, the algorithm performs better for lower absolute values in . However, for star structured Ising modes, the algorithm performs better for higher absolute values in . For both the cases we consider a tree on 11 nodes and set . is estimated empirically. We set all the entries in to be equal.
For chain structured Ising model, we consider 3 cases - (i) all the entries in = 0.0, (ii) all the entries in = 0.02 and (iii) all the entries in = 0.04. The result is provided in Figure 11.
Figure 11: . The performance of our algorithm on a chain of 11 nodes over 50 runs for increasing number of samples for varying bias values.
Figure 12: . The performance of our algorithm on a star of 11 nodes over 50 runs for increasing number of samples for varying bias values.
For star structured Ising model, we consider 3 cases - (i) all the entries in = 0.0, (ii) all the entries in = 0.2 and (iii) all the entries in = 0.4. The result is provided in Figure 12.
We now give an intuitive justification of the observation. A higher value of bias results in smaller threshold for the proximal sets. For chain structured Ising models, due to correlation decay, we can expect correlation values for some pairs of random variables to be close to the threshold. Estimating them accurately would require more number of samples.
Star structured Ising models, on the other hand, have higher correlation values due a smaller diameter of 2. A low threshold ensures that no node is mistakenly left behind when constructing the proximal sets. Therefore, the performance is better for a higher bias values.
Appendix A Proof of Lemma 1
We prove this by induction on the number of nodes in the path for any 2 nodes .
Base Case :
The path is , therefore we have . For random variables with a support size of 2, this is true if and only if they are conditionally uncorrelated, that is,
(3) |
is linear in since the support size of is 2 and therefore we need to need to fit only 2 points and to completely represent the conditional expectation. Therefore the linear least square error (LLSE) estimator of given is also the minimum mean squared estimator . Utilizing the standard result for LLSE, we have:
(4) |
Similarly we have:
(5) |
Substituting and from Equations (4) and (5) in Equation (3) we get:
Therefore we get which implies .
Inductive Case:
Let the statement be true for any path involving nodes. For a path we have . Therefore the same calculation as the base case holds true by replacing by and by . Therefore . By the inductive assumption, , therefore, .
Appendix B Proof of Covariance of noisy variables.
Lemma 4.
Consider 2 Random variables and with support on whose covariance is denoted by . Now consider the noisy version of these random variables and whose covariance is denoted by . Then we have:
Proof.
By the noise model we have:
(6) | ||||
We also have:
(7) | ||||
Therefore,
(8) | ||||
∎
We can use Equation (6) to calculate the variance of every random variable in terms of the variance of its noisy counterpart as follows:
(9) | ||||
Appendix C Proof that the Quadratic gives a valid solution
Consider the quadratic in Equation (2). We prove that this equation always has a valid solution for any set of 3 nodes in a tree structured graphical model.
The different possible configurations of any 3 nodes , and in any tree structured graphical model are shown in Figure 13.

For case (a) we have by Lemma 1. This gives us:
Using the assumption that the absolute value of correlation is upper bounded away from 1 and lower bounded away from 0, we have . Also, and . Therefore, for case (a), and the quadratic equation has valid roots. By symmetry, the quadratic equation gives valid roots for case (b) too.
Case (c) is the underlying truth, therefore the quadratic equation recovers the true underlying error.
For case(d), we have , and . This gives us:
(11) |
The same arguments as case (a) hold true for case (d) with node 2 replaced by node . Therefore, the quadratic has a valid solution in this case too.
Appendix D Proof of Lemma 2, Lemma 3 and Star/Non-star Condition for Generic Trees
D.1 Proof of Lemma 2(a)
Proof. Note that and are given by solving an equation similar to (2). As the solution to the quadratic is defined completely by the covariance terms, all we need to prove is:
By substituting the value of from Equation 8, we now need to prove that:
Using the correlation decay property, we get that , and . Therefore LHS = RHS = .
D.2 Proof of Lemma 2(b)
D.3 Proof of Lemma 3
Proof. This is equivalent to proving that
which is again equivalent to:
Using the correlation decay property it is easy to see that:
D.4 Proof of Star/Non-star Condition for Generic Trees
We show how to utilize the result on a set of 4 nodes to classify any set of 4 nodes as star/non-star in a generic tree.
If any 4 nodes in a tree graphical model form a non-star shape such that from a pair and are not arranged in a chain, there exist nodes and such that the conditional independence structure is given by either Figure 14(a) or 14(b).

For the conditional independence in Figure 14(a), we know that:
(13) | ||||
This gives us . Similarly, we have .
We also know that:
(14) | ||||
These equations imply and . If the conditional independence is as shown in Figure 14(b), we have:
(15) | ||||
These equations give us . Furthermore, by Equation (12), we have:
(16) | ||||
By symmetry, the remaining conditions in Equation (3) are also satisfied.

When the 4 nodes form a star structure in the tree, their conditional independence is given by either Figure 3 or there exists a node such that the conditional independence is as shown in Figure 15. Lemma 3 proves that Equation 3 is satisfied if the conditional independence is given by Figure 3. If the conditional independence is given by Figure 15, we have:
(17) | |||
This implies that . By symmetry, all the remaining conditions of Equation 3 are also satisfied.
This completes the proof that just by having access to the noisy probability distribution, it is possible to categorize any set of 4 nodes as a star/non-star shape.
Appendix E Proof of Theorem 2
Given the noisy variance and an estimate of the error probability vector , we estimate the non-noisy covariance as:
(18) |
For the error probability vector , using Equation (9) the non-noisy variance is estimated as:
(19) |
To check if any conditional independence relation is true, we need to verify if it satisfies the correlation decay equation .
We first consider where only one leaf node exchanges position with its neighbor. Suppose in the original tree, node is a leaf node connected to node .
Consider the error vector :
(20) | ||||
To prove that this error vector results in , we need to prove that any node which satisfies in must satisfy in . We note that:
(21) | ||||
Using , it is easy to check that .
Furthermore, we need to prove that any pair of nodes such that in satisfy in . Doing similar substitutions by replacing node 2 by node and node by node gives us which proves that .
The remaining conditional independences not involving and remain intact as the error probability for the remaining nodes is assigned to the original probability of error.
Now, consider a tree in which a set of leaf nodes exchange positions with their neighbors. For this case consider the error probability vector such that:
This is obtained by performing the same procedure on each leaf node one by one.
Appendix F Pseudo-code, Proof of Correctness, Time-complexity
F.1 Pseudo-code proof of correctness and Time-complexity for FindEC
The pseudo-code is presented in Algorithm 3.
Proof of correctness
By the definition of , contains at least all the nodes within a radius of 4 of . Therefore, it contains all the nodes in the equivalence cluster containing . Now we check for each node in whether it is in the equivalence cluster of .
For any node to be in the equivalence cluster of , every set of 4 nodes need to either be a star shape or a non-star shape where form a pair.
In order to check this it is sufficient to check the condition for any set of 4 nodes , for any random and every . Suppose is not in the same equivalence cluster as . This implies that there exist nodes in the path from to . By the definition of the set , the node with a potential edge with is in . This will result in forming a non-star such that form a pair.
Time-complexity
As there are 2 for loops, it is easy to see that this subroutine is .
F.2 Pseudo-code, proof of correctness and Time-complexity for SplitTree
The pseudo code is provided in Algorithm 4.
Proof of Correctness
is such that . Furthermore, if any node then the nodes from the equivalence cluster that has an edge with from that connected component obtained by removing edges with are also in . Let denote the set of nodes in the subtrees of the original tree obtained by removing that don’t contain any node from . We split the nodes in into subtrees. We first check if a node . and if and only if or there exists such that form a non-star and (, ) form a pair. This would happen when is a node from the equivalence cluster that has an edge with . It is easy to see that are in each other’s proximal sets. By the definition of , . Also is within a radius of 4 from , therefore .
For the nodes in , 2 nodes lie in the same subtree if form a non-star such that are in the same pair. This is true from the definition of non-star shape. To implement this, we construct a square matrix of size which contains 1 in position if and node in are and respectively and are in each others and the set is classified as a non-star such that form a pair. It has 0 at other positions. If it turns out that a node is not in , the row corresponding to that node is set to 0. Then, we merge the nodes which have 1 in common columns till no 2 rows have a 1 in a common column. The creation of the matrix and the merging are both operation. Therefore, the whole function is an .
F.3 Proof of Correctness of Recurse
Lemma 5.
Recurse correctly recovers all the edges between the nodes in as well as the edges between and .
Proof.
We prove this lemma by induction on the number of nodes in .
Base Case ():
Note that is not possible because if that were the case, would be in and hence in which would imply which is a contradiction.
Also, note that when , the nodes in have to form a non-star when considered with and otherwise its nodes are going to be in . By the correctness of SplitTree, is correctly recognized as the only subtree in step 2. By the correctness of FindEC, the algorithm correctly identifies that the nodes in form an equivalence cluster and that this equivalence cluster has an edge with . now contains . The next recursion call terminates after step 2 as is empty.
Inductive Step:
Suppose the Lemma 5 is correct for .
Lets look at what happens when .
-
•
In step 2, the algorithm correctly splits the nodes in into . Suppose this results in subtrees. Each of these subtrees have at least 2 nodes by the same argument as for the base case.
-
•
By the correctness of FindEC, step 4 correctly finds the equivalence cluster in each subtree which has an edge with . This proves the part of the lemma that the edges between and are correctly identified.
-
•
For each . Therefore .
-
•
For the next recursion step, and as whereas .
-
•
By the inductive assumption, all the edges within are correctly recovered. Furthermore, the edges between and are correctly recovered.
-
•
This is true for all obtained in step 2.
-
•
Therefore, all the edges within are correctly recovered.
∎
F.4 Proof of correctness and time complexity of the complete algorithm
If the tree has just one Equivalence cluster, the algorithm discovers it by the correctness of FindEC and terminates. If the tree has more than one equivalence clusters, there would be at least one equivalence cluster with more than one nodes. By the correctness of FindEC, the algorithm correctly finds one such equivalence cluster . It updates the with this equivalence cluster. Then it calls the Recurse function. The set for this call to the Recurse function is . By the correctness of Recurse, the algorithm correctly recovers all the edges within and the edges between and . Therefore, it correctly recovers the equivalence class of the tree.
Time-complexity
In the initialization step, FindEC is called at most times and hence takes time. Within the Recurse function, for each call to SplitTree, FindEC is called at least once unless the is empty, in which case Recurse terminates. Therefore, number of times SplitTree is called within Recurse is upper bounded by the number of times FindEC is called within Recurse+1. Whenever FindEC is called from within the Recurse function, it discovers at least one edge. Therefore FindEC can be called at most times within Recurse. Therefore, the total time complexity of Recurse is . Therefore, the total time complexity of FindTree is .
Appendix G Sample Complexity Analysis
We first calculate how accurate we need to be in order to make sure that any set of nodes in each other’s are correctly classified as a star or a non-star. We then use the Hoeffding’s concentration bound to get a high probability bound for sample complexity.
Consider a set of 4 nodes which forms non-star such that form a pair. For these nodes to be correctly classified, we need:
where and is the empirical correlation obtained from noisy samples between nodes and . We look at the first condition. This is equivalent to the condition:
This would be true if:
(22) |
as . Let . Then, we have:
Similarly, we can show the other side of inequality.
Using along with some basic algebraic manipulation yields:
Comparing this with Equation (22), we get:
It is easy to check that the obtained for the other condition of non-star shape and for star shape is also the same.
Now we look at the concentration of empirical correlation random variable for the noisy samples from the tree structured Ising model.
Since , and have support on , we can use the Hoeffding’s inequality to obtain the concentration bounds for them. Suppose we draw independent samples from the noisy distribution. Let denote the samples of , denote the samples of and denote the samples of . Now, if we have:
then .
Let , and denote the sample means of , and respectively. Further, let and and
(23) | ||||
Therefore, when and , we have
By Hoeffding’s inequality we have:
(24) | ||||
Therefore,
(25) | ||||
We define a bad event as follows:
(26) |
Using union bound on Equation 25, we bound the probability of by:
(27) |
Therefore, for the probability of mistake to be bounded by , the number of samples is given by:
(28) |
References
- [1] A. Anandkumar, D. J. Hsu, F. Huang, and S. M. Kakade. Learning mixtures of tree graphical models. In Advances in Neural Information Processing Systems, pages 1052–1060, 2012.
- [2] G. Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 771–782. ACM, 2015.
- [3] G. Bresler, D. Gamarnik, and D. Shah. Hardness of parameter estimation in graphical models. In Advances in Neural Information Processing Systems, pages 1062–1070, 2014.
- [4] G. Bresler, D. Gamarnik, and D. Shah. Structure learning of antiferromagnetic ising models. In Advances in Neural Information Processing Systems, pages 2852–2860, 2014.
- [5] G. Bresler and M. Karzand. Learning a tree-structured ising model in order to make predictions. arXiv preprint arXiv:1604.06749, 2016.
- [6] G. Bresler, E. Mossel, and A. Sly. Reconstruction of markov random fields from samples: Some observations and algorithms. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 343–356. Springer, 2008.
- [7] S. G. Brush. History of the lenz-ising model. Reviews of modern physics, 39(4):883, 1967.
- [8] Y. Chen. Learning sparse ising models with missing data.
- [9] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting hierarchical context on a large database of object categories. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 129–136. IEEE, 2010.
- [10] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.
- [11] S. Dasgupta. Learning polytrees. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 134–141. Morgan Kaufmann Publishers Inc., 1999.
- [12] C. Daskalakis, N. Dikkala, and G. Kamath. Testing ising models. IEEE Transactions on Information Theory, 2019.
- [13] S. Goel, D. M. Kane, and A. R. Klivans. Learning ising models with independent failures. arXiv preprint arXiv:1902.04728, 2019.
- [14] L. Hamilton, F. Koehler, and A. Moitra. Information theoretic properties of markov random fields, and their algorithmic applications. In Advances in Neural Information Processing Systems, pages 2463–2472, 2017.
- [15] E. Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik A Hadrons and Nuclei, 31(1):253–258, 1925.
- [16] A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman. Towards an integrated protein–protein interaction network: A relational markov network approach. Journal of Computational Biology, 13(2):145–164, 2006.
- [17] A. Katiyar, J. Hoffmann, and C. Caramanis. Robust estimation of tree structured Gaussian graphical models. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3292–3300. PMLR, 2019.
- [18] A. Klivans and R. Meka. Learning graphical models using multiplicative weights. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 343–354. IEEE, 2017.
- [19] M. Kolar and E. P Xing. Estimating sparse precision matrices from data with missing values. 2012.
- [20] S.-I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using -regularization. In Advances in neural Information processing systems, pages 817–824, 2007.
- [21] E. M. Lindgren, V. Shah, Y. Shen, A. G. Dimakis, and A. Klivans. On robust learning of ising models.
- [22] H. Liu, F. Han, M. Yuan, J. Lafferty, L. Wasserman, et al. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.
- [23] P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pages 2726–2734, 2011.
- [24] A. Y. Lokhov, M. Vuffray, S. Misra, and M. Chertkov. Optimal structure and parameter learning of ising models. Science advances, 4(3):e1700791, 2018.
- [25] F. Martinelli, A. Sinclair, and D. Weitz. The ising model on trees: Boundary conditions and mixing time. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 628–639. IEEE, 2003.
- [26] N. G. d. Mesnards and T. Zaman. Detecting influence campaigns in social networks using the ising model. arXiv preprint arXiv:1805.10244, 2018.
- [27] K. E. Nikolakakis, D. S. Kalogerias, and A. D. Sarwate. Learning tree structures from noisy data. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1771–1782. PMLR, 2019.
- [28] K. E. Nikolakakis, D. S. Kalogerias, and A. D. Sarwate. Non-parametric structure learning on hidden tree-shaped distributions. arXiv preprint arXiv:1909.09596, 2019.
- [29] P. Ravikumar, M. J. Wainwright, J. D. Lafferty, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
- [30] S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 860–867. Citeseer, 2005.
- [31] E. Schneidman, M. J. Berry II, R. Segev, and W. Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):1007, 2006.
- [32] N. Srebro. Maximum likelihood bounded tree-width markov networks. Artificial intelligence, 143(1):123–138, 2003.
- [33] M. Vuffray, S. Misra, A. Lokhov, and M. Chertkov. Interaction screening: Efficient and sample-optimal learning of ising models. In Advances in Neural Information Processing Systems, pages 2595–2603, 2016.
- [34] L. Wang and Q. Gu. Robust gaussian graphical model estimation with arbitrary corruption. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3617–3626. JMLR. org, 2017.
- [35] S. Wu, S. Sanghavi, and A. G. Dimakis. Sparse logistic regression learns all discrete pairwise graphical models. arXiv preprint arXiv:1810.11905, 2018.
- [36] E. Yang and A. C. Lozano. Robust gaussian graphical modeling with the trimmed graphical lasso. In Advances in Neural Information Processing Systems, pages 2602–2610, 2015.
- [37] W.-X. Zhou and D. Sornette. Self-organizing ising model of financial markets. The European Physical Journal B, 55(2):175–181, 2007.