This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Beyond Asymptotics: Practical Insights into Community Detection in Complex Networks

Tianjun Ke Department of Statistics, Columbia University Zhiyu Xu Department of Statistics, Columbia University
Abstract

The stochastic block model (SBM) is a fundamental tool for community detection in networks, yet the finite-sample performance of inference methods remains underexplored. We evaluate key algorithms—spectral methods, variational inference, and Gibbs sampling—under varying conditions, including signal-to-noise ratios, heterogeneous community sizes, and multimodality. Our results highlight significant performance variations: spectral methods, especially SCORE, excel in computational efficiency and scalability, while Gibbs sampling dominates in small, well-separated networks. Variational Expectation-Maximization strikes a balance between accuracy and cost in larger networks but struggles with optimization in highly imbalanced settings. These findings underscore the practical trade-offs among methods and provide actionable guidance for algorithm selection in real-world applications. Our results also call for further theoretical investigation in SBMs with complex structures. The code can be found at https://github.com/Toby-X/SBM_computation.

**footnotetext: Equal contribution.

1 Introduction

The idea of using networks to find latent community structures has been ubiquitous in biology (Chen and Yuan, 2006; Marcotte et al., 1999), sociology (Fortunato, 2010) and machine learning (Linden et al., 2003). Among various approaches, the stochastic block model (SBM, Holland et al., 1983) stands out as the simplest generative model for identifying community structures. In undirected networks, symmetric adjacency matrices are used to represent connections, where the (i,j)(i,j)-th entry is 11 if and only if there is an edge between node ii and node jj and 0 otherwise. For an undirected network with NN nodes and KK latent communities, SBM assumes that the expectation of the observed network has the following structure,

𝐀:=𝔼𝐀=𝐙𝐁𝐙,\mathbf{A}^{*}:=\mathbb{E}\mathbf{A}=\mathbf{Z}\mathbf{B}\mathbf{Z}^{\top},

where 𝐀{0,1}N×N\mathbf{A}\in\{0,1\}^{N\times N} is the observed adjacency matrix, 𝐙{0,1}N×K\mathbf{Z}\in\{0,1\}^{N\times K} is the community structure matrix, and 𝐁[0,1]K×K\mathbf{B}\in[0,1]^{K\times K} is a symmetric matrix whose (k,)(k,\ell)-th entry represents the probability that there is an edge between community kk and community \ell. Each row of 𝐙\mathbf{Z} contains exactly one entry equal to 11, reflecting that each node belongs to precisely one community.

The popularity of SBM lies in its sharp information-theoretic phase transitions for statistical inference, making it a useful tool for understanding the limits of information extraction in networks and evaluating algorithmic efficiency for community detection. Recent theoretical advancements in SBM have been extensively reviewed by Abbe (2018). Notably, Lei et al. (2024) shows that under some specific asymptotic regime, the information-theoretic threshold for exact community structure recovery is the same for polynomial-time algorithms and the maximum likelihood estimator (MLE) for bipartite networks, with easy extensions to networks with KK latent communities. These results suggest that computationally efficient polynomial-time methods are sufficient for SBMs, obviating the need for more computationally intensive approaches. Despite these advances, several gaps remain. First, the theoretical results are confined to asymptotic settings, leaving the finite-sample performance of these algorithms unclear. Second, it remains uncertain which polynomial-time method—such as Approximate Message Passing (Feng et al., 2022) or Spectral Clustering (Von Luxburg, 2007)—is most effective in practice. Factors such as sample size, the number of latent communities, community size imbalance, and inter-community connectivity can significantly impact algorithmic performance, yet their impact remains underexplored.

To address this gap, we conduct a comprehensive investigation into the finite-sample performance of various statistical inference methods for community detection in SBMs across a wide range of challenging scenarios. Specifically, we evaluate performance across extensive signal-to-noise ratios (SNRs), heterogeneous community sizes, and multimodal connectivity structures. Our findings provide actionable insights into the practical applicability of these algorithms while highlighting gaps in the theoretical understanding of finite-sample performance, particularly in the presence of noisy or unbalanced data.

2 Methods

In this section, we briefly discuss the methods considered in this paper. Spectral methods, being polynomial-time algorithms, are the most computationally efficient. Variational methods follow in terms of computational cost, while Gibbs sampling requires the most intensive computation.

2.1 Gibbs Sampling

Gibbs sampling uses the standard hierarchical structure to specify the stochastic block models (Nowicki and Snijders, 2001; Golightly and Wilkinson, 2005; McDaid et al., 2013):

𝐙i,:\displaystyle\mathbf{Z}_{i,:} Mul(𝝅),\displaystyle\sim\text{Mul}(\bm{\pi}),
Bij\displaystyle B_{ij} Beta(a,b),\displaystyle\sim\text{Beta}(a,b),
𝝅\displaystyle\bm{\pi} Dir(α1,,αK),\displaystyle\sim\text{Dir}(\alpha_{1},\dots,\alpha_{K}),
Aij|𝐙,𝐁\displaystyle A_{ij}{\,|\,}\mathbf{Z},\mathbf{B} Ber(𝐙i,:𝐁𝐙j,:).\displaystyle\sim\text{Ber}\left(\mathbf{Z}_{i,:}^{\top}\mathbf{B}\mathbf{Z}_{j,:}\right).

The derivation of conditional distributions can be found in Appendix A.3.

2.2 Variational Bayes

We consider a general variational Bayes method from Zhang (2020), where they consider the optimization problem:

max𝒬𝒮MFF(Q):=max𝒬𝒮MFlogϕVec(𝐀)(𝒚)𝑑Q(𝒛,𝐁)D(Q𝒛Q𝐁QλΠ(𝒛,𝐁)),\displaystyle\max_{{\mathcal{Q}}\in\mathcal{S}_{\mathrm{MF}}}F(Q):=\max_{{\mathcal{Q}}\in\mathcal{S}_{\mathrm{MF}}}\int\log\phi_{\mathrm{Vec}(\mathbf{A}^{*})}(\bm{y})dQ(\bm{z},\mathbf{B})-D\left(Q_{\bm{z}}Q_{\mathbf{B}}Q_{\lambda}\|\Pi(\bm{z},\mathbf{B})\right), (2.1)

where 𝒮MF\mathcal{S}_{\mathrm{MF}} is the mean-field variational class and ϕ𝜽(𝒚):=1(2π)n2exp(12𝒚𝜽2)\phi_{\bm{\theta}}(\bm{y}):=\frac{1}{(\sqrt{2\pi})^{n^{2}}}\exp\Big{(}-\frac{1}{2}\|\bm{y}-\bm{\theta}\|^{2}\Big{)}. We can obtain a variational algorithm for the stochastic block model with explicit update formulas thanks to the mean-field assumption. The complete description of the variational class and the algorithm can be found in Appendix A.4.

2.3 Variational-EM

The variational Expectation-Maximization method (Variational-EM) (Leger, 2016) considers

𝐙Mul(𝜶),𝐀ij|ZiqZj=1q,\displaystyle\mathbf{Z}\sim\text{Mul}(\bm{\alpha}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \mathbf{A}_{ij}{\,|\,}Z_{iq}Z_{j\ell}=1\sim\mathcal{F}_{q\ell},

where 𝜶𝟏=1\bm{\alpha}^{\top}\bm{1}=1 and q\mathcal{F}_{q\ell} is the model for community qq and \ell. A variational form of likelihood is considered as

J=i,qτiqlog(αq)+i,j;ijq,τiqτjlogfq(Aij),\displaystyle J=\sum_{i,q}\tau_{iq}\log\left(\alpha_{q}\right)+\sum_{i,j;i\neq j}\sum_{q,\ell}\tau_{iq}\tau_{j\ell}\log f_{q\ell}\left(A_{ij}\right),

where 𝝉i\bm{\tau}_{i} is the variational parameters of the multinomial distribution which approximates 𝐙i,:|𝐀\mathbf{Z}_{i,:}{\,|\,}\mathbf{A}. Details can be found in Appendix A.5.

2.4 Spectral Methods

Spectral Clustering has been successfully applied to various fields for its simplicity and accuracy (Von Luxburg, 2007). In the context of the stochastic block model (SBM), the top KK eigenspace of the adjacency matrix exhibits a simplex structure, where each cluster corresponds to a vertex of the simplex (Chen and Gu, 2024). Building on this insight, spectral-based methods have been extended to degree-corrected stochastic block models Karrer and Newman, 2011, leading to the development of approaches such as SCORE (Jin, 2015), one-class SVM (Mao et al., 2018), and regularized spectral clustering (Qin and Rohe, 2013). Although these methods are designed to address broader scenarios than the standard SBM, it remains an open question whether they can effectively handle challenging settings such as community imbalance and sparsity. Given the diversity of spectral clustering techniques, we detail the specific methods considered in this paper and their terminology for clarity.

Spectral Clustering.

For vanilla spectral clustering, we first perform the top-KK eigenvalue decomposition of the adjacency matrix 𝐀\mathbf{A},

𝐀𝐔𝚲𝐔,\mathbf{A}\approx\mathbf{U}\mathbf{\Lambda}\mathbf{U}^{\top},

where 𝐔\mathbf{U} contains the top-KK eigenvectors, and 𝚲\mathbf{\Lambda} is the diagonal matrix with top-KK eigenvalues. Then, we apply K-means clustering to the rows of 𝐔\mathbf{U} to estimate 𝐙\mathbf{Z}.

SCORE.

SCORE normalizes the eigenvector matrix by dividing each column by the first column element-wise

𝐔SCORE=𝐔/𝐔:,1.\mathbf{U}_{\text{SCORE}}=\mathbf{U}/\mathbf{U}_{:,1}.

K-means clustering is then applied to 𝐔SCORE\mathbf{U}_{\text{SCORE}} to estimate 𝐙\mathbf{Z}.

L2L_{2} Normalization.

In this method, each row of 𝐔\mathbf{U} is normalized by its L2L_{2} norm:

𝒖i,:L2=𝒖i,:/𝒖i,:, i[N],\bm{u}_{i,:}^{L_{2}}=\bm{u}_{i,:}/\|\bm{u}_{i,:}\|,\text{ i}\in[N],

where 𝒖i,:\bm{u}_{i,:} is the ii-th row of 𝐔\mathbf{U} . The resulting normalized matrix, denoted as 𝐔L2\mathbf{U}_{L_{2}}, is then clustered using K-means to estimate 𝐙\mathbf{Z}.

Regularized Spectral Clustering.

This method constructs a regularized graph Laplacian:

𝐋τ=𝐃τ1/2𝐀𝐃τ1/2,\mathbf{L}_{\tau}=\mathbf{D}_{\tau}^{-1/2}\mathbf{A}\mathbf{D}_{\tau}^{-1/2},

where 𝐃τ=𝐃+τ𝐈\mathbf{D}_{\tau}=\mathbf{D}+\tau\mathbf{I}, 𝐃\mathbf{D} is a diagonal matrix with 𝐃ii=j𝐀i,j\mathbf{D}_{ii}=\sum_{j}\mathbf{A}_{i,j}, and τ\tau is set to be i=1N𝐃ii\sum_{i=1}^{N}\mathbf{D}_{ii} as suggested in Qin and Rohe (2013).

3 Numerical Experiments

3.1 Simulation Settings

We consider a broad range of scenarios to evaluate the performance of our methods. For the number of nodes, we use N=250,500,1000,2000N=250,500,1000,2000, representing moderate-sized graphs to larger ones. For the number of communities, we set K=5,10,20K=5,10,20, ranging from simpler structures to more complex configurations. Notably, even in a balanced dataset, the combination of N=250N=250 and K=20K=20 poses a significant challenge due to the limited information available for distinguishing a large number of diverse communities. To introduce heterogeneity in the cluster sizes, we adopt the approach outlined in (Zhang et al., 2022). Specifically, let 𝜶=(α1,,αK)\bm{\alpha}=(\alpha_{1},\dots,\alpha_{K}), where

v1,,vKi.i.d.Uniform(0,1),αk=vkβi=1Kviβ.v_{1},\dots,v_{K}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Uniform}(0,1),\quad\alpha_{k}=\frac{v_{k}^{\beta}}{\sum_{i=1}^{K}v_{i}^{\beta}}.

Here, β\beta is a heterogeneity parameter controlling the imbalance in cluster sizes. A larger β\beta results in a greater imbalance in 𝜶\bm{\alpha}. We set β=0,5,10\beta=0,5,10 in our experiments to explore varying degrees of heterogeneity. For the kernel probability matrix, we follow a standard assumption in the literature (Lei et al., 2024). The elements of the kernel matrix 𝐁\mathbf{B} are given by

Bij={32ρ,if i=j,12ρ,otherwise,B_{ij}=\begin{cases}\frac{3}{2}\rho,&\text{if }i=j,\\ \frac{1}{2}\rho,&\text{otherwise},\end{cases}

where ρ(0,2/3)\rho\in(0,2/3). We note that ρ=N1\rho=N^{-1} achieves exact recovery in asymptotic theory (Lei et al., 2024). To evaluate performance under varying levels of information, we select ρ=N1,N0.5,N0.1\rho=N^{-1},N^{-0.5},N^{-0.1}. These values allow us to explore scenarios ranging from weak to strong signal levels, with N1N^{-1} serving as the threshold for perfect recovery. However, perfect recovery may not always be attainable in finite samples, which is the focus of this study. Specific initialization can be found in Appendix A.1.

3.2 Results

Refer to caption
Figure 1: ARI comparisons of all methods with 100 different random seeds. L2 denotes the one-class SVM method. VEMB and VEMG denote variational-EM with the Bernoulli model and Gaussian model respectively. Other methods are represented by abbreviation. SCORE dominates other spectral methods with RSC performing the worst. Variational Bayes has inferior performance compared to variational-EM methods. Variational-EM deteriorates in performance as NN increases in challenging scenarios. Gibbs sampling only dominates in simple networks despite being an exact method.

In this project, we evaluate the performance of clustering algorithms using the adjusted Rand index (ARI; Hubert and Arabie, 1985) and normalized mutual information (NMI; Strehl and Ghosh, 2002) as metrics. ARI measures how many samples are in the same cluster to see how different the clustering result is from the true value, with range [1,1][-1,1]. Higher ARI values indicate more accurate clustering. As ARI and NMI yield similar results for our experiments, we focus our analysis of ARI, with results of NMI provided in Appendix A.2.

The results, shown in Figure 1, reveal that the performance of the chosen inference methods varies across scenarios. Note that all methods fail in the sparsest case (b=1b=1) so we exclude it in the figure. This highlights a dichotomy between finite-sample performance and the asymptotic guarantees discussed in Lei et al. (2024). We hypothesize that achieving the asymptotic information-theoretic limit requires significantly larger sample sizes.

One of the most peculiar phenomena occurs when K=10K=10 or 2020 and b=0.5b=0.5. In this case, the performances of all algorithms improve when β\beta increases to 55 while all methods fail when β=0\beta=0. On the one hand, since β=0\beta=0 represents a balanced clustering size scenario, this failure likely arises because the sample sizes for individual clusters are too small to extract meaningful information. For instance, spectral clustering when N=5000N=5000 achieves ARI around 0.150.15. This indicates increasing β\beta does not necessarily make the task harder, as the latent communities with a larger sample size can be learned more easily due to the imbalance. Hence, an improvement in ARI can be observed as we increase the imbalance of the communities.

Variational Bayes exhibits significantly worse performance under β=0\beta=0 compared to other methods. In other scenarios, its performance typically falls between spectral methods and variational EM. We suspect that, under β=0\beta=0, the evidence lower bound (ELBO) deviates more substantially from the posterior distribution, resulting in poor performance. Another explanation is that VB may have a slower convergence rate in the balanced scenario. Hence, we recommend using variational-EM instead of variational Bayes for similar computation complexity and better performance.

Among eigenvalue decomposition-based methods, regularized spectral clustering performs the worst, despite claims that it mitigates sparsity. The performances of vanilla spectral clustering and L2L_{2} normalization are similar, while SCORE dominates. Although a vanilla SBM does not incorporate degree heterogeneity, using L2L_{2} normalization and SCORE does not degrade performance. The superiority of SCORE may stem from its projection of KK dimensional eigenspace into a K1K-1 dimensional space, effectively improving the estimation in larger clusters by trading off smaller ones. This is especially evident in scenarios with K=5K=5 and b=0.1b=0.1 for N=250N=250 or 500500, where SCORE consistently outperforms all other methods. In a nutshell, We recommend using L2L_{2} normalization over vanilla spectral clustering for similar performance and its generalizability to degree-corrected stochastic block models. While SCORE dominates all spectral methods in our settings, it may run into stability issues in real-world datasets due to its unique normalization scheme, so further investigation is needed. Nonetheless, SCORE seems to be a plausible recommendation.

Variational EM (VEM) generally outperforms vanilla spectral clustering, despite its reliance on non-convex optimization. We suspect it results from the initialization of spectral clustering and further optimization is capable of finding an improvement of the local maximum near its starting point. Since spectral clustering gives robust results, VEM also gives robust results that are better than spectral clustering. Another interesting discovery of VEM is that in challenging scenarios (β>0\beta>0 and b>0.1b>0.1), VEM’s performance deteriorates with increasing NN, both in terms of median accuracy and variation. We attribute this to two factors. First, the data generation mechanism we adopt in this project yields a drastically different set of data, especially when NN is large. As larger NN amplifies the variability in the data, we can expect an increase in the variation of the estimation accuracy. Furthermore, VEM may suffer from its optimization landscape when the data has a more complicated structure. The inability to find the global optimum of the nonconvex optimization problem may contribute to its performance deterioration.

We initially conjecture that Gibbs will dominate because variational methods use approximation that leads to error and spectral methods do not vitalize the likelihood information from the model. However, Gibbs outperforms only when both KK and NN are small. We propose two potential explanations for this phenomenon. First, when KK is large, Gibbs may struggle to sample equally from all modes in practice, leading to estimation bias. Second, when NN is large, we notice that the performance of Gibbs is only slightly worse than VEM. This may be attributed to the influence of an uninformative prior. As KK increases, the effective sample size for each parameter shrinks, giving the prior greater sway in the posterior and subsequently degrading Gibbs sampling performance compared to VEM.

4 Conclusion

In this work, we investigated the efficacy of various statistical methods for community detection in stochastic block models under noisy, heterogeneous, and multimodal conditions. Our findings highlight significant differences in performance across methods, emphasizing the importance of context when selecting an inference algorithm.

For small networks with simple structures, Gibbs sampling consistently outperformes other methods. For larger networks and communities, variational Expectation-Maximization achieves the best balance between computational cost and accuracy. In scenarios requiring scalability, spectral clustering emerged as a practical choice.

Future work could explore more effective initialization schemes for Gibbs and variational EM, especially using SCORE. Additionally, designing metrics or simulation settings that ensure proper discovery of all clusters may better evaluate these methods. Our work also calls for deeper theoretical analysis to discover all communities when imbalanced, where we conjecture there should be a lower bound for the sample size of the smallest community.

References

  • Abbe (2018) Abbe, E. (2018). Community detection and stochastic block models: recent developments. Journal of Machine Learning Research 18 1–86.
  • Chen and Yuan (2006) Chen, J. and Yuan, B. (2006). Detecting functional modules in the yeast protein–protein interaction network. Bioinformatics 22 2283–2290.
  • Chen and Gu (2024) Chen, L. and Gu, Y. (2024). A spectral method for identifiable grade of membership analysis with binary responses. Psychometrika 1–32.
  • Feng et al. (2022) Feng, O. Y., Venkataramanan, R., Rush, C., Samworth, R. J. et al. (2022). A unifying tutorial on approximate message passing. Foundations and Trends® in Machine Learning 15 335–536.
  • Fortunato (2010) Fortunato, S. (2010). Community detection in graphs. Physics reports 486 75–174.
  • Golightly and Wilkinson (2005) Golightly, A. and Wilkinson, D. J. (2005). Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics 61 781–788.
  • Holland et al. (1983) Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social networks 5 109–137.
  • Hubert and Arabie (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classification 2 193–218.
  • Jin (2015) Jin, J. (2015). Fast community detection by SCORE. The Annals of Statistics 43 57 – 89.
  • Karrer and Newman (2011) Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 83 016107.
  • Leger (2016) Leger, J.-B. (2016). Blockmodels: A r-package for estimating in latent block model and stochastic block model, with various probability functions, with or without covariates. arXiv preprint arXiv:1602.07587 .
  • Lei et al. (2024) Lei, J., Zhang, A. R. and Zhu, Z. (2024). Computational and statistical thresholds in multi-layer stochastic block models. The Annals of Statistics 52 2431–2455.
  • Linden et al. (2003) Linden, G., Smith, B. and York, J. (2003). Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7 76–80.
  • Mao et al. (2018) Mao, X., Sarkar, P. and Chakrabarti, D. (2018). Overlapping clustering models, and one (class) svm to bind them all. Advances in Neural Information Processing Systems 31.
  • Marcotte et al. (1999) Marcotte, E. M., Pellegrini, M., Ng, H.-L., Rice, D. W., Yeates, T. O. and Eisenberg, D. (1999). Detecting protein function and protein-protein interactions from genome sequences. Science 285 751–753.
  • McDaid et al. (2013) McDaid, A. F., Murphy, T. B., Friel, N. and Hurley, N. J. (2013). Improved bayesian inference for the stochastic block model with application to large networks. Computational Statistics & Data Analysis 60 12–31.
  • Nowicki and Snijders (2001) Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American statistical association 96 1077–1087.
  • Qin and Rohe (2013) Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degree-corrected stochastic blockmodel. Advances in neural information processing systems 26.
  • Strehl and Ghosh (2002) Strehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research 3 583–617.
  • Von Luxburg (2007) Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17 395–416.
  • Zhang et al. (2022) Zhang, A. R., Cai, T. T. and Wu, Y. (2022). Heteroskedastic pca: Algorithm, optimality, and applications. The Annals of Statistics 50 53–80.
  • Zhang (2020) Zhang, F. (2020). Theoretical Guarantees of Variational Inference and Its Applications. Ph.D. thesis, The University of Chicago.

Appendix A Appendix

A.1 Initialization

For Gibbss sampling, we use a=b=2,αi=1a=b=2,\ \alpha_{i}=1 for all i=1,,Ki=1,\ldots,K. Variational Bayes uses a uniform initialization of the memberships and a standard initialization for the variational parameters as specified in Algorithm 1. Variational-EM is initialized with spectral clustering for both models. We run each scenario with 100 different random seeds and report the results in the below section.

A.2 NMI

Refer to caption
Figure 2: NMI comparisons of all methods with 100 different random seeds. L2 denotes the one-class SVM method. VEMB and VEMG denote variational-EM with the Bernoulli model and Gaussian model respectively. Other methods are represented by abbreviation. Results are similar to ARI.

A.3 Gibbs Sampling

Let

nk\displaystyle n_{k} =i=1N𝕀(𝐙i,:=𝒆k),k=1,2,,K,\displaystyle=\sum_{i=1}^{N}\mathbb{I}(\mathbf{Z}_{i,:}=\bm{e}_{k}),\quad k=1,2,\dots,K,
nk\displaystyle n_{k\ell} =ij𝕀(𝐙i,:=𝒆k,𝐙j,:=𝒆)=nknnk𝕀(k=),\displaystyle=\sum_{i\neq j}\mathbb{I}(\mathbf{Z}_{i,:}=\bm{e}_{k},\mathbf{Z}_{j,:}=\bm{e}_{\ell})=n_{k}n_{\ell}-n_{k}\mathbb{I}(k=\ell),
A[k]\displaystyle A[k\ell] =(i,j):𝐙i,:=𝒆k,𝐙j,:=𝒆Aij,k,s=1,2,,K.\displaystyle=\sum_{(i,j):\mathbf{Z}_{i,:}=\bm{e}_{k},\mathbf{Z}_{j,:}=\bm{e}_{\ell}}A_{ij},\quad k,s=1,2,\dots,K.

The full conditional distribution of each parameter can be derived as

π|\displaystyle\pi{\,|\,}\cdot Dir(α1+n1,,αK+nK),\displaystyle\sim\text{Dir}(\alpha_{1}+n_{1},\dots,\alpha_{K}+n_{K}),
Bk|\displaystyle B_{k\ell}{\,|\,}\cdot Beta(a+A[k],1+nkA[k]),\displaystyle\sim\text{Beta}(a+A[k\ell],1+n_{k\ell}-A[k\ell]),
P(𝐙i,:=𝒆k|)\displaystyle P(\mathbf{Z}_{i,:}=\bm{e}_{k}{\,|\,}\cdot) πk(jiBkzjAij(1Bkzj)1Aij)(jiBzjkAji(1Bzjk)1Aji).\displaystyle\propto\pi_{k}\cdot\left(\prod_{j\neq i}B_{kz_{j}}^{A_{ij}}(1-B_{kz_{j}})^{1-A_{ij}}\right)\cdot\left(\prod_{j\neq i}B_{z_{j}k}^{A_{ji}}(1-B_{z_{j}k})^{1-A_{ji}}\right).

A.4 Variational Bayes

The mean-field variational class is specified as

𝒮MF=\displaystyle\mathcal{S}_{\mathrm{MF}}= {Q:Q(𝐙,𝐁,λ)=Q𝒛(𝒛)Q𝐁(𝐁)Qλ(λ),\displaystyle\{Q:Q(\mathbf{Z},\mathbf{B},\lambda)=Q_{\bm{z}}(\bm{z})Q_{\mathbf{B}}(\mathbf{B})Q_{\lambda}(\lambda),
𝒛𝒵¯,𝐁k×k,λ,Q𝐙𝒮𝐙,Q𝐁𝒮𝐁,Qλ𝒮λ},\displaystyle\leavevmode\nobreak\ \bm{z}\in\overline{\mathcal{Z}},\mathbf{B}\in\mathbb{R}^{k\times k},\lambda\in\mathbb{R},Q_{\mathbf{Z}}\in\mathcal{S}_{\mathbf{Z}},Q_{\mathbf{B}}\in\mathcal{S}_{\mathbf{B}},Q_{\lambda}\in\mathcal{S}_{\lambda}\},

where 𝒵¯:={𝒛[k]n:i=1n𝕀{zi=t}>0,t[k]}\overline{\mathcal{Z}}:=\{\bm{z}\in[k]^{n}:\sum_{i=1}^{n}\mathbb{I}\{z_{i}=t\}>0,\forall t\in[k]\} is the support of 𝒛\bm{z} and 𝒮𝒛,𝒮𝐁,𝒮λ\mathcal{S}_{\bm{z}},\mathcal{S}_{\mathbf{B}},\mathcal{S}_{\lambda} are the variational families. In Equation (2.1), 𝒮𝒛,𝒮𝐁\mathcal{S}_{\bm{z}},\mathcal{S}_{\mathbf{B}} and 𝒮λ\mathcal{S}_{\lambda} respectively denote some distribution families for 𝒛𝒵¯,𝐁k×k\bm{z}\in\overline{\mathcal{Z}},\mathbf{B}\in\mathbb{R}^{k\times k} and λ\lambda\in\mathbb{R}. Zhang (2020) consider the following mass point distribution family for 𝒮Z\mathcal{S}_{Z}:

𝒮𝐳={Q𝐳:Q𝐳(𝐳=𝐳~)=1, for some 𝐳~𝒵¯}.\mathcal{S}_{\mathbf{z}}=\left\{Q_{\mathbf{z}}:Q_{\mathbf{z}}(\mathbf{z}=\widetilde{\mathbf{z}})=1,\text{ for some }\widetilde{\mathbf{z}}\in\overline{\mathcal{Z}}\right\}.

In addition, 𝒮𝐁\mathcal{S}_{\mathbf{B}} is selected as the normal distribution family and 𝒮\mathcal{S} is selected as a parametric distribution family with parameter a>0a>0. Here we abuse the notation a little bit and assume a vectorized Vec(𝐁)\mathrm{Vec}(\mathbf{B}) while we still use 𝐁\mathbf{B} to denote the vectorized matrix. The distribution families are defined as follows:

𝒮𝐁={Q𝐁:Q𝐁=N(𝝁,𝚺), for 𝝁k2,𝚺k2×k2,𝚺0},𝒮λ={Qλ:dQλdλ=βπλ3/2exp(2aββλaλ2), for some a>0}.\begin{gathered}\mathcal{S}_{\mathbf{B}}=\left\{Q_{\mathbf{B}}:Q_{\mathbf{B}}=N(\bm{\mu},\bm{\Sigma}),\text{ for }\bm{\mu}\in\mathbb{R}^{k^{2}},\bm{\Sigma}\in\mathbb{R}^{k^{2}\times k^{2}},\bm{\Sigma}\succ 0\right\},\\ \mathcal{S}_{\lambda}=\left\{Q_{\lambda}:\frac{dQ_{\lambda}}{d\lambda}=\sqrt{\frac{\beta}{\pi}}\lambda^{-3/2}\exp\left(\sqrt{2a\beta}-\frac{\beta}{\lambda}-\frac{a\lambda}{2}\right),\text{ for some }a>0\right\}.\end{gathered}

The complete algorithm can be obtained as in Algorithm 1.

Algorithm 1 Variational Algorithm for Stochastic Block Model
  Input: Number of clusters KK, initial labels 𝒛[0]𝒵¯\bm{z}^{[0]}\in\overline{\mathcal{Z}}, maximum iteration number MM, tolerate level ϵ\epsilon.
  Initialize: Iteration number t=0t=0, objective function L0=L_{0}=\infty, change of objective function L0=,a[0]=n2\nabla L_{0}=\infty,a^{[0]}=n^{2} and δ[0]=1+2βn2\delta^{[0]}=1+\sqrt{\frac{2\beta}{n^{2}}}. The initial 𝝁\bm{\mu} and 𝚺\bm{\Sigma} are computed by followings:
μcd[0]=(δ[0])1(nc[0])1(nd[0])1i=1nj=1nAij𝕀{zi[0]=c,zj[0]=d},1c,dk;Σcd[0]=(δ[0])1(nc[0])1(nd[0])1,1c,dk;\begin{gathered}\mu_{cd}^{[0]}=\left(\delta^{[0]}\right)^{-1}\left(n_{c}^{[0]}\right)^{-1}\left(n_{d}^{[0]}\right)^{-1}\sum_{i=1}^{n}\sum_{j=1}^{n}A_{ij}\mathbb{I}\left\{z_{i}^{[0]}=c,z_{j}^{[0]}=d\right\},\quad 1\leq c,d\leq k;\\ \Sigma_{cd}^{[0]}=\left(\delta^{[0]}\right)^{-1}\left(n_{c}^{[0]}\right)^{-1}\left(n_{d}^{[0]}\right)^{-1},\quad 1\leq c,d\leq k;\end{gathered}
where nc[0]=i=1n𝕀{zi[0]=c}n_{c}^{[0]}=\sum_{i=1}^{n}\mathbb{I}\left\{z_{i}^{[0]}=c\right\}.
  while t<Mt<M and Lt>ϵ\nabla L_{t}>\epsilon do
     tt+1t\leftarrow t+1.
     Update ZZ :
     for i=1,,ni=1,\cdots,n in a randomized order do
        for c=1,,kc=1,\cdots,k do
           Set zj(i)[t]z_{j}(i)^{[t]} to be the current label for zjz_{j} and nc(i)=ji𝕀{zj(i)[t]=c}n_{c}(i)=\sum_{j\neq i}\mathbb{I}\left\{z_{j}(i)^{[t]}=c\right\}.
           Compute
vic=klog(1+nc(i)1)2jinAijμczj(i)[t],\displaystyle v_{ic}=-k\log\left(1+n_{c}(i)^{-1}\right)-2\sum_{j\neq i}^{n}A_{ij}\mu_{cz_{j}(i)^{[t]}},
+δ[t1][jiμczj(i)[t]2+12μcc2]+12(r=1knr(i)nr[t1]+1nc[t1])2.\displaystyle+\delta^{[t-1]}\left[\sum_{j\neq i}\mu_{cz_{j}(i)^{[t]}}^{2}+\frac{1}{2}\mu_{cc}^{2}\right]+\frac{1}{2}\left(\sum_{r=1}^{k}\frac{n_{r}(i)}{n_{r}^{[t-1]}}+\frac{1}{n_{c}^{[t-1]}}\right)^{2}.
        end for
        Set zi[t]=argmin1ck{vic}z_{i}^{[t]}=\operatorname{argmin}_{1\leq c\leq k}\left\{v_{ic}\right\}.
     end for
     Update aa and δ\delta : compute nc[t]=i=1n𝕀{zi[t]=c}n_{c}^{[t]}=\sum_{i=1}^{n}\mathbb{I}\left\{z_{i}^{[t]}=c\right\}, then
a[t]=i=1nj=1n(μzi[t]zj[t][t1])2+(δ[t1])1(c=1knc[t]nc[t1])2,δ[t]=1+2βa[t].a^{[t]}=\sum_{i=1}^{n}\sum_{j=1}^{n}\left(\mu^{[t-1]}_{z_{i}^{[t]}z_{j}^{[t]}}\right)^{2}+\left(\delta^{[t-1]}\right)^{-1}\left(\sum_{c=1}^{k}\frac{n_{c}^{[t]}}{n_{c}^{[t-1]}}\right)^{2},\quad\delta^{[t]}=1+\sqrt{\frac{2\beta}{a^{[t]}}}.
     Update μ\mu and Σ\Sigma:
μcd[t]=(δ[t])1(nc[t])1(nd[t])1i=1nj=1nAij𝕀{zi[t]=c,zj[t]=d},1c,dkΣcd[t]=(δ[t])1(nc[t])1(nd[t])1,1c,dk\begin{gathered}\mu_{cd}^{[t]}=\left(\delta^{[t]}\right)^{-1}\left(n_{c}^{[t]}\right)^{-1}\left(n_{d}^{[t]}\right)^{-1}\sum_{i=1}^{n}\sum_{j=1}^{n}A_{ij}\mathbb{I}\left\{z_{i}^{[t]}=c,z_{j}^{[t]}=d\right\},\quad 1\leq c,d\leq k\\ \Sigma_{cd}^{[t]}=\left(\delta^{[t]}\right)^{-1}\left(n_{c}^{[t]}\right)^{-1}\left(n_{d}^{[t]}\right)^{-1},\quad 1\leq c,d\leq k\end{gathered}
     Update LtL_{t} and Lt:\nabla L_{t}:
Lt=k(k+1)4logδ[t]4βe2+a[t]β212i=1nj=1nAijμzi[t]zj[t][t]+(D+1)(12k(k+1)+nlogk),Lt=|Lt1Lt|.\begin{gathered}L_{t}=\frac{k(k+1)}{4}\log\frac{\delta^{[t]}}{4\beta e^{2}}+\sqrt{\frac{a^{[t]}\beta}{2}}-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}A_{ij}\mu_{z_{i}^{[t]}z_{j}^{[t]}}^{[t]}+(D+1)\left(\frac{1}{2}k(k+1)+n\log k\right),\\ \nabla L_{t}=\left|L_{t-1}-L_{t}\right|.\end{gathered}
  end while
  Output: z^(k)=z[t],a^(k)=a[t],μ^(k)=μ[t],Σ^(k)=Σ[t],L^(k)=Lt\widehat{z}^{(k)}=z^{[t]},\widehat{a}^{(k)}=a^{[t]},\widehat{\mu}^{(k)}=\mu^{[t]},\widehat{\Sigma}^{(k)}=\Sigma^{[t]},\widehat{L}^{(k)}=L_{t}.

A.5 Variational-EM

The EM procedure is described as follows:

  1. 1.

    E step: Maximization with respect to variational parameters 𝝉\bm{\tau}.

  2. 2.

    M step: Maximization with respect to original parameters 𝜶\bm{\alpha} and function parameters in \mathcal{F}.

We consider Bernoulli and Gaussian as the function models, as following. For both models, we select the community with maximized value as the membership for each node.

A.5.1 Bernoulli

This is the common stochastic block model. The model is defined as follows:

q=Ber(πq),\mathcal{F}_{q\ell}=\text{Ber}(\pi_{q\ell}),

where πq[0,1]\pi_{q\ell}\in[0,1] for q,=1,,Kq,\ell=1,\ldots,K.

A.5.2 Gaussian

This model assumes a normal distribution of the community links, defined as:

q=𝒩(μq,σ2),\mathcal{F}_{q\ell}=\mathcal{N}(\mu_{q\ell},\sigma^{2}),

where μq\mu_{q\ell}\in\mathbb{R} and σ2+\sigma^{2}\in\mathbb{R}^{+} for q,=1,,Kq,\ell=1,\ldots,K.