A Note on the Identifiability of the Degree-Corrected Stochastic Block Model
Abstract
In this short note, we address the identifiability issues inherent in the Degree-Corrected Stochastic Block Model (DCSBM). We provide a rigorous proof demonstrating that the parameters of the DCSBM are identifiable up to a scaling factor and a permutation of the community labels, under a mild condition.
Keywords: Clustering, Community Detection, Networks.
1 Introduction
The Stochastic Block Model (SBM) is a foundational framework for modeling community structures in networks. It groups nodes into distinct communities, with the probability of an edge between two nodes determined solely by their community memberships (Holland and Leinhardt,, 1981; Nowicki and Snijders,, 2001; Bickel and Chen,, 2009; Choi et al.,, 2012). SBM extends the Erdős-Rényi model (Erdős and Rényi,, 1959), which assumes a uniform edge probability across all node pairs, thereby offering a simpler framework without community differentiation. In contrast, SBM captures both inter-community and intra-community connection patterns, making it a valuable tool for community detection (Abbe,, 2017; Zhao,, 2017). However, SBM assumes homogeneity within each community, limiting its applicability to real-world networks where significant degree variability exists among nodes within the same community. To address this limitation, the Degree-Corrected Stochastic Block Model (DCSBM) was introduced. DCSBM extends SBM by incorporating degree heterogeneity, allowing nodes within a community to exhibit varying degrees of connectivity (Karrer and Newman,, 2011; Ball et al.,, 2011). This enhancement improves the model’s flexibility, making DCSBM particularly suited for networks such as social, biological, and communication systems, where nodes often exhibit varying connectivity patterns even within the same community.
It is well known that parameters in an SBM, like those in other clustering models such as Gaussian mixture models, are not identifiable due to the possible permutation of community labels. For the DCSBM, this issue is further compounded by the non-identifiability of degree parameters due to scaling factors (Karrer and Newman,, 2011; Ball et al.,, 2011; Zhao et al.,, 2012). Despite these challenges, researchers generally agree that these identifiability issues pose only minor technical obstacles and that no additional identifiability problems exist for the DCSBM. However, formal references addressing the identifiability of the DCSBM are limited. In this note, we point out a small gap in a commonly used argument and provide a rigorous proof confirming that the parameter system of the DCSBM is identifiable up to a scaling factor and a permutation of the community labels, under a mild condition that each community has at least three members. To our best knowledge, this condition has never been explicitly stated in the literature. In addition, we show that this condition is necessary using a counterexample. Together, our results fill a critical gap in the theoretical understanding of the identifiability of the DCSBM.
2 Model Description
Consider an undirected graph with nodes. The adjacency matrix represents the presence or absence of edges between nodes. The DCSBM with nodes and communities is characterized by the following components with technical conditions.
-
1.
Community Membership Matrix : is an binary matrix where if node belongs to community , and 0 otherwise. We require that each community has at least 1 member. We use to denote the community label of node . That is, if and only if .
-
2.
Degree Parameter Matrix : is an diagonal matrix where represents the degree parameter for node .
-
3.
Community Connection Matrix : is a symmetric full rank matrix where represents the intensity of connection between nodes in community and community .
A triple defines a DCSBM, where the off-diagonal elements of the adjacency matrix are generated independently with , and
(1) |
We assume that all triples satisfy the conditions listed above throughout this paper.
Now, we introduce useful notations and technical lemmas. We use to denote the set of integers . A partition of is a collection of nonempty, mutually disjoint subsets of whose union is . A permutation matrix is a square binary matrix with exactly one entry of 1 in each row and each column, with all other entries being 0. A permutation matrix corresponds to a permutation of a set with elements. The operator is defined as , mapping a vector to a diagonal matrix such that for a vector . Finally, we use to denote an -dimensional vector where all entries are equal to 1.
Every equivalence relation on a set defines a partition of that set, and vice versa. For a matrix , we define its -th row and -th row to be equivalent if they are identical. Using this equivalence relation, any matrix with rows defines a partition of . The following lemmas will be useful in our theoretical derivations.
Lemma 1.
An community membership matrix defines a partition of based on row equivalence. Two community membership matrices and define the same partition if and only if for a permutation matrix .
Proof of Lemma 1: Define subset as . It is straightforward to verify that is a partition of . Similarly, we can define another partition based on . The partitions and are identical if and only if there exists a permutation such that . This permutation induces a permutation matrix with , and it follows that .
Lemma 2.
Let and be two community membership matrices, and and be two full row-rank matrices where . The matrices and define the same partition of based on row equivalence if and only if for a permutation matrix .
Proof of Lemma 2: All rows of are distinct, since has full row rank. The rows of are replications of rows of , and the -th row of is identical to the -th row of if and only if . As a result, and define the same partition of based on row equivalence. The same conclusion holds for and .
Thus, and define the same partition of if and only if and define the same partition, and by Lemma 1, if and only if for a permutation matrix .
For a matrix , we define its -th row and -th row to be proportionally equivalent if they are proportional to each other up to a positive scaling factor. Any matrix defines a partition via this proportional equivalence.
Lemma 3.
The matrices , , and are defined as in Lemma 2. and are two positive diagonal matrices. The matrices and define the same partition of based on row proportional equivalence if and only if for a permutation matrix . Moreover, if is the partition defined by and , whenever .
Proof of Lemma 3: No two rows of are proportional to each other since has full row rank. Therefore, the partition defined by via row equivalence and the partition defined by via row proportional equivalence are the same, and the first conclusion follows Lemma 2. To show the second statement, we write
where is a positive diagonal matrix. It is straightforward to check when , the -th and -th rows of are identical, and the -th and -th rows of are identical, which implies .
3 The Identifiability Issue on the DCSBM
3.1 A folklore
In the literature on model identifiability, the expected values of the diagonal entries of are typically assumed to follow the same form as the off-diagonal entries, i.e., . Under this assumption, the derivations in Karrer and Newman, (2011) suggest that the parameters are identifiable up to degrees of freedom, corresponding to scaling factors. The following proposition, though rarely stated explicitly in the literature, has been employed to provide a resolution to the identifiability issue of the DCSBM.
Proposition 1.
Two parameter systems and satisfy
(2) |
if and only if , , and where is a permutation matrix, and is a positive diagonal matrix.
In this proposition, indicates the membership matrices in two parametrization systems are up to a permutation . When two systems use the same labels, i.e., , they represent the same DCSBM if and only if , and for a diagonal matrix containing scaling factors, one for each community.
Proof of Proposition 1: To show the “if” part, we calculate
To obtain equation (2), it suffices to show
We verify it by checking each entry. As both and are diagonal matrices, the entry on the left is
When , both and are zero. When , , it is easy to check , so
Now we show the “only if” part. First, we calculate the rank of as
The first equal sign follows that is a positive definite diagonal matrix. The second equal sign holds because the membership matrix contains identity matrix as a submatrix (up to a permutation).
Now consider the eigenvalue decomposition , where is a diagonal matrix contains all nonzero eigenvalues of , is a matrix whose columns are eigenvectors corresponding to nonzero eigenvalues. By the eigenvalue decomposition, we can write
(3) |
where is a full rank matrix. We conduct similar operations with , and express via both parameter system
(4) |
where . (4) implies that and define the same partition via row proportional equivalence. By Lemma 3, we have for a permutation matrix . In addition, we denote the partition by where . Lemma 3 also implies that , when .
Let be a diagonal matrix, where for any with , which is well-defined by the previous paragraph. It can be verified directly that , and .
Note that we have shown . Equation (2) with leads to
which indicate , or equivalently . We have thus shown that if two parameter sets produce the same expected adjacency matrices, then , , and .
Proposition 1 states that the equation (2) implies the community structure of the DCSBM is well-defined, only up to a label permutation. However, this result is insufficient to address the identifiability issue of the DCSBM, where the diagonal elements in are always assumed to be zero.
Let denote the natural projection from the linear space of matrices to the subspace of all matrices with zero diagonals. In the DCSBM, it is typically assumed that, instead if (1), the expected adjacency matrix satisfies
In other words, two systems and define the same DCSBM if and only if
(5) |
It is important to note that equation (5) imposes a weaker constraint than (2). Consequently, Proposition 1 does not suffice to fully resolve the identifiability issue in the DCSBM. The counterexample provided below further illustrates this point.
3.2 A counterexample
The following counter example shows that, without additional conditions, two parameter systems with completely different community structures may define the same DCSBM.
Consider two parameter sets with different community structures:
Then two matrices
share the same off-diagonal components. In this case, even if we observe infinitely many copies of and find , there is no way to decide whether the model is generated by or .
3.3 Identifiability of the DCSBM
The identifiability issue of the DCSBM can be resolved under a mild condition on the size of the smallest community.
Condition 1.
For a DCSBM parametrized by , assume that each community has at least three members. That is, , where the inequality holds entry-wisely.
Theorem 1.
For the “only if” portion, it suffices to show that equation (5) and Condition 1 imply that Then the conclusion follows Proposition 1.
Note that equation (5) implies that there exists a diagonal matrix such that
which leads to
(6) |
Let denote the -th coordinate vector in , defined as the vector with a 1 in the -th position and 0 elsewhere. Let and be two nodes with . Multiplying on both sides of (6), we have
because . The last equation further implies
(7) |
where .
If has a nonzero component, say , then has at least 3 nonzero components by Condition 1 since for all nodes with . In contrast, the left hand side of equation (7) has at most nonzero components. Thus, equation (7) indicates , and hence . We can repeat the argument with arbitrary and from the same community and conclude .
4 Conclusion
We have clarified identifiability issues on the DCSBM. Our result implies that the community structure of the DCSBM is well-defined under a mild condition on the size of the smallest community. It is worth mentioning that all the theoretical results discussed in this paper rely solely on the mean structure of the model. Therefore, these results are applicable not only to the DCSBM with a Bernoulli distribution on links, but also to networks with a Poisson distribution on links, or even networks with more flexible continuous link weights.
Acknowledgment
This work was partially supported by the National Science Foundation.
References
- Abbe, (2017) Abbe, E. (2017). Community detection and stochastic block models: recent developments. Journal of Machine Learning Research, 18(1):6446–6531.
- Ball et al., (2011) Ball, B., Karrer, B., and Newman, M. E. (2011). Efficient and principled method for detecting communities in networks. Physical Review E, 84(3):036103.
- Bickel and Chen, (2009) Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and Newman-Girvan and other modularities. Proceedings of the National Academy of Sciences, 106:21068–21073.
- Choi et al., (2012) Choi, D. S., Wolfe, P. J., and Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika, 99(2):273–284.
- Erdős and Rényi, (1959) Erdős, P. and Rényi, A. (1959). On random graphs i. Publ. math. debrecen, 6(290-297):18.
- Holland and Leinhardt, (1981) Holland, P. and Leinhardt, S. (1981). An exponential family of probability distributions for directed graphs:. J. of Amer. Statist Assoc., 76(373):62–65.
- Karrer and Newman, (2011) Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107.
- Nowicki and Snijders, (2001) Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087.
- Zhao, (2017) Zhao, Y. (2017). A survey on theoretical advances of community detection in networks. Wiley Interdisciplinary Reviews: Computational Statistics, 9(5).
- Zhao et al., (2012) Zhao, Y., Levina, E., and Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40(4):2266 – 2292.