This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Eigengap Ratio Test for Determining the Number of Communities in Network Data

Yujia Wu School of Statistics and Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, China Jingfei Zhang Goizueta Business School, Emory University, Atlanta, GA Wei Lan School of Statistics and Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, China Chih-Ling Tsai Graduate School of Management, University of California, Davis, CA
Abstract

To characterize the community structure in network data, researchers have introduced various block-type models, including the stochastic block model, degree-corrected stochastic block model, mixed membership block model, degree-corrected mixed membership block model, and others. A critical step in applying these models effectively is determining the number of communities in the network. However, to our knowledge, existing methods for estimating the number of network communities often require model estimations or are unable to simultaneously account for network sparsity and a divergent number of communities. In this paper, we propose an eigengap-ratio based test that address these challenges. The test is straightforward to compute, requires no parameter tuning, and can be applied to a wide range of block models without the need to estimate network distribution parameters. Furthermore, it is effective for both dense and sparse networks with a divergent number of communities. We show that the proposed test statistic converges to a function of the type-I Tracy-Widom distributions under the null hypothesis, and that the test is asymptotically powerful under alternatives. Simulation studies on both dense and sparse networks demonstrate the efficacy of the proposed method. Three real-world examples are presented to illustrate the usefulness of the proposed test.

KEY WORDS: Block Models; Community Detection; Consistency; Eigengap-ratio Test; Type-I Tracy-Widom Distribution

1 INTRODUCTION

Network data analysis has been extensively explored across various disciplines; see for example, Kolaczyk (2009), Goldenberg et al. (2010),Valente (2010) and Katona et al. (2011). Notably, Lorrain & White (1971) and Granovetter (1973) identified that the generation of network links often follows a block (or community) structure. Since then, numerous researchers have focused on block models to characterize the block structure of network generation mechanisms and enhance the utilization of network data. The widely studied block models include the stochastic block model (Holland et al., 1983; Wasserman & Faust, 1994; Snijders & Nowicki, 1997; Nowicki & Snijders, 2001; Bickel & Chen, 2009; Rohe et al., 2011; Choi et al., 2012; Jin, 2015; Zhang & Zhou, 2016; Zhang et al., 2020a), the degree-corrected stochastic block model (Karrer & Newman, 2011; Zhao et al., 2012; Qin & Rohe, 2013; Zhang & Amini, 2023; Hu et al., 2021; Jin et al., 2023), the mixed membership model (Airoldi et al., 2008; White & Murphy, 2016; Ma et al., 2021) and the degree-corrected mixed membership model (Jin et al., 2017; Zhang et al., 2020b; Ke & Wang, 2022; Han et al., 2023; Jin et al., 2024), among others.

In this paper, we consider an undirected network with no self-loops, and assume that its associated adjacency matrix A=(Aij)n×nA=(A_{ij})\in\mathbb{R}^{n\times n} is symmetric with diagonal elements all equal to zero. In addition, we assume that the network is generated from a Bernoulli-distributed random graph model. That is,

AijBernoulli(Pij),for everyi<j,A_{ij}\sim Bernoulli(P_{ij}),\quad\text{for every}\quad i<j, (1.1)

where P=(Pij)n×nP=(P_{ij})\in\mathbb{R}^{n\times n} is the symmetric edge probability matrix. That is, PijP_{ij} is the probability of connection between nodes ii and jj, iji\neq j. In existing block models, the probability matrix PP typically has a rank of KK, where KK is the number of communities. We next briefly describe the four common block models mentioned above.

(a) The stochastic block model (SBM). The SBM (Holland et al., 1983) assumes that the network has KK communities and that each node belongs to one specific community. Let g=(g(1),,g(n))g=(g(1),\cdots,g(n))^{\top} denote the membership vector, where g(i){1,,K}g(i)\in\{1,\cdots,K\} represents the community to which node ii belongs, for i{1,,n}i\in\{1,\cdots,n\}. The SBM assumes that, for every iji\leq j,

Pij=Qg(i)g(j),P_{ij}=Q_{g(i)g(j)}, (1.2)

where Q=(Qkl)K×KQ=(Q_{kl})\in\mathbb{R}^{K\times K} is the K×KK\times K symmetric community probability matrix. The diagonal elements of PP, PiiP_{ii}, is set to be Qg(i)g(i)Q_{g(i)g(i)} for any i=1,,ni=1,\cdots,n. Under (1.2), one can show that matrix PP is of rank KK.

(b) The degree-corrected stochastic block model (DCSBM). The SBM assumes that the edge connecting any two nodes depends only on the communities in which they are located, and nodes within the same community have equal expected degrees. To allow for greater node degree heterogeneity, Karrer & Newman (2011) proposed the degree-corrected stochastic block model (DCSBM) by introducing a vector of nodal degree parameters ω=(ω1,ω2,,ωn)\omega=(\omega_{1},\omega_{2},\cdots,\omega_{n})^{\top} that measures the degree variation of nodes. Accordingly, the DCSBM assumes

Pij=ωiωjQg(i)g(j),P_{ij}=\omega_{i}\omega_{j}Q_{g(i)g(j)}, (1.3)

for every iji\leq j, where gg and QQ are defined in (1.2). The diagonal elements of PP, PiiP_{ii}, is set to be ωi2Qg(i)g(i)\omega_{i}^{2}Q_{g(i)g(i)} for any i=1,,ni=1,\cdots,n. Identifying DCSBM requires constraints such as i=1nωiI{g(i)=u}=nu\sum_{i=1}^{n}\omega_{i}I\{g(i)=u\}=n_{u} for each community 1uK1\leq u\leq K, where nun_{u} is the number of nodes in community uu. Under (1.3), one can show that matrix PP is of rank KK (Jin et al., 2023).

(c) The mixed membership model (MM). Both SBM and DCSBM require that each node belongs to a single community and the KK communities are nonoverlapping. To address this limitation, Airoldi et al. (2008) developed the mixed membership stochastic block model (MM) for handling the case of overlapping communities. For each node ii, the MM introduces a KK-dimensional probability mass vector, πi=(πi,1,,πi,K)K×1\pi_{i}=(\pi_{i,1},\cdots,\pi_{i,K})^{\top}\in\mathbb{R}^{K\times 1}, where πi,k\pi_{i,k} represents the probability of node ii belonging to community kk for 1kK1\leq k\leq K and kπi,k=1\sum_{k}\pi_{i,k}=1. Then, the MM assumes that

Pij=πiQπj,P_{ij}=\pi_{i}^{\top}Q\pi_{j}, (1.4)

where QQ is defined in (1.2). Under (1.4), one can show that matrix PP is of rank KK.

(d) The degree-corrected mixed membership model (DCMM). Jin et al. (2017) introduced the degree-corrected mixed membership model (DCMM), which is an extension of DCSBM that accommodates overlapping communities. Specifically, DCMM assumes that

Pij=ωiωjπiQπj,P_{ij}=\omega_{i}\omega_{j}\pi_{i}^{\top}Q\pi_{j}, (1.5)

where ωi\omega_{i}s, πi\pi_{i}s and QQ are defined in (1.2)-(1.4). The DCMM is the most general class of models, and it includes the SBM, DCSBM and MM as special cases. One can show that the edge probability matrix PP under DCMM is also of rank KK (Jin et al., 2023).

Given the number of communities KK, many algorithmic solutions have been proposed to estimate the community label gg or (π1,,πn)(\pi_{1},\ldots,\pi_{n}) in the above block models. They include the modularity-based methods (Newman & Girvan, 2004; Newman, 2006; Sengupta & Chen, 2018), likelihood-based methods (Bickel & Chen, 2009; Amini et al., 2013; Choi et al., 2012; Zhao et al., 2012; Wang et al., 2023), variational methods (Daudin et al., 2008; Airoldi et al., 2008), spectral clustering methods (Rohe et al., 2011; Jin, 2015; Lei & Rinaldo, 2015; Joseph & Yu, 2016; Qin & Rohe, 2013; Sarkar & Bickel, 2015; Sengupta & Chen, 2015), among others.

In practice, however, the number of communities KK is often unknown a priori. In recent years, many methods have been proposed for estimating KK. These methods include the spectral methods (Le & Levina, 2015), stepwise selection methods (Zhang & Amini, 2023; Jin et al., 2023), sequential testing methods (Lei, 2016; Hu et al., 2021), network cross-validation methods (Chen & Lei, 2018) and likelihood-based methods (Daudin et al., 2008; Wang & Bickel, 2017; Saldana et al., 2017; Noroozi et al., 2019; Hu et al., 2020; Ma et al., 2021). Han et al. (2023) proposed a sequential test that can estimate matrix ranks and it can be applied to infer KK in block models. To our knowledge, these methods encounter at least one of the following challenges or restrictions.

First, the networks need to be assumed dense. For example, Lei (2016) and Hu et al. (2021) assumed that the edge probabilities QklQ_{kl}s have a constant order of O(1)O(1) for every 1k,lK1\leq k,l\leq K. Second, the number of network communities KK needs to be finite as nn tends to infinity; see, for example, Zhang & Amini (2023) and Han et al. (2023). Third, the unknown network distribution parameters need to be estimated. For example, Lei (2016) and Hu et al. (2021) estimated QQ, ω\omega and gg before applying their methods to estimate KK. Similar approaches can be found in the literature such as Noroozi et al. (2019); Zhang & Amini (2023); Hu et al. (2021); Ma et al. (2021); Jin et al. (2023). Fourth, tuning parameters are needed; see, for example, Wang & Bickel (2017), Hu et al. (2020), Ma et al. (2021) and Han et al. (2023).

Table 1: Comparisons among our proposed method, Lei (2016), Hu et al. (2021), Han et al. (2023) and Jin et al. (2023). The symbol ““\checkmark”” means applicable, and “×\times” stands for not applicable.
Feature Proposed method Lei (2016) Hu et al. (2021) Han et al. (2023) Jin et al. (2023)
Dense networks with Qkl=O(1)Q_{kl}=O(1) \checkmark \checkmark \checkmark \checkmark \checkmark
Sparse networks with Qkl=o(1)Q_{kl}=o(1) \checkmark ×\times ×\times \checkmark \checkmark
Diverging number of communities \checkmark \checkmark \checkmark ×\times ×\times
Avoiding parameter estimations \checkmark ×\times ×\times \checkmark ×\times
No tuning parameters \checkmark \checkmark \checkmark ×\times \checkmark

In this paper, we propose an eigengap-ratio based test for estimating the rank of PP, which can in turn be used to determine the number of communities in block models such as SBM, DCSBM, MM and DCMM. The proposed test is straightforward to implement and requires only the calculations of the eigenvalues of AA. It accommodates sparse networks, and allows for a diverging KK without the need for parameter tuning; see Table 1 for a comparison of our proposed method with some recent methods. The main idea behind the proposed test statistic is as follows. Let KK and K0K_{0} denote the true and hypothetical ranks of PP, respectively, where KK is unknown. To test whether K=K0K=K_{0}, we construct the test statistic as a ratio of two eigengaps, that is, the difference between pairs of eigenvalues derived from the adjacency matrix AA. A key advantage of this approach is that it only requires a partial eigen-decomposition of the adjacency matrix AA. Despite the simple form of the test statistic, deriving its null distribution is challenging, as AijA_{ij}s are Bernoulli random variables with distinct probability parameters PijP_{ij}s, which may tend to zero as nn increases. We show that the asymptotic null distribution of the test statistic follows a function of the type-I Tracy-Widom distribution constructed from the Wigner matrix. Additionally, we demonstrate that the proposed test is asymptotically powerful, with the test statistic diverging at a rate no slower than n2/3n^{2/3} under the alternative hypothesis.

It is important to emphasize that our proposed test is designed to estimate the rank of the probability matrix PP and does not assume a specific block model or require the estimation of model parameters. Consequently, our test statistic does not provide a goodness-of-fit assessment for any specific block model. In contrast, block model-specific test methods such as those studied in Lei (2016), Hu et al. (2021), Ma et al. (2021) and Jin et al. (2023), use test statistics to assess the goodness-of-fit of a presumed block model given a hypothetical number of communities. Due to the nature of their design, these tests typically rely on consistent estimators of the block model parameters, such as group membership gg or mixing proportions πi\pi_{i}s.

2 METHOD AND THEORETICAL PROPERTIES

2.1 Hypothesis Testing

Consider a network of nn nodes with the symmetric adjacency matrix AA generated from (1.1). The aim of this paper is to test the following hypothesis:

H0:K=K0v.s.H1:K0<KKmax,H_{0}:K=K_{0}\quad\text{v.s.}\quad{H_{1}:K_{0}<K\leq K_{\max}}, (2.1)

where KK and K0K_{0} denote the true and the hypothetical ranks of PP, respectively, and KmaxK_{\max} is a pre-specified maximum value of KK. In this paper, we set Kmax=max{K0+4,n2/5}K_{\max}=\max\{K_{0}+4,n^{2/5}\}; see discussions after Condition (C2). We consider the one-sided alternative, K>K0K>K_{0}, since a network with KK communities can also be represented as a network with K0>KK_{0}>K communities by splitting one or more true communities. The one-sided alternative was also considered in Lei (2016), Chen & Lei (2018), Wang & Bickel (2017) and Hu et al. (2021).

2.2 An Eigengap-Ratio Based Test

Recall A=(Aij)n×nA=(A_{ij})\in\mathbb{R}^{n\times n} is the adjacency matrix with no self-loops. We define

A~=AE(A),\tilde{A}=A-E(A), (2.2)

where E(A)E(A) is the expected value of AA, and A~\tilde{A} is the noise matrix with zero-mean entries and zero diagonals elements. It is straightforward to see that entries of A~\tilde{A} have finite variances. Given E(A)=Pdiag(P)E(A)=P-\text{diag}(P), (2.2) can then be rewritten as

A=P+A~diag(P),A=P+\tilde{A}-\text{diag}(P), (2.3)

where diag(P)\text{diag}(P) represents the diagonal matrix constructed from (P11,,Pnn)(P_{11},\cdots,P_{nn})^{\top}.

Under the null hypothesis of H0H_{0}: K=K0K=K_{0}, the rank of PP is KK, which implies that PP has KK nonzero eigenvalues. Let λk(A)\lambda_{k}(A) denote the kk-th largest eigenvalue of matrix AA. We can show that the eigenvalues λk(A)\lambda_{k}(A) for kK+1k\geq K+1 are determined by an nKn-K principal submatrix of A~\tilde{A}, denoted by A~GI\tilde{A}_{G_{I}}, whose eigenvalues follow the type-I Tracy-Widom distribution. This result motivates the proposal of an eigengap-ratio test statistic, defined as:

T=λK0+1(A)λKmax+1(A)λKmax+1(A)λKmax+2(A).T=\frac{\lambda_{K_{0}+1}(A)-\lambda_{K_{\max}+1}(A)}{\lambda_{K_{\max}+1}(A)-\lambda_{K_{\max+2}}(A)}. (2.4)

Given that the true rank is KK, if K0<KKmaxK_{0}<K\leq K_{\max}, we expect λK0+1(A)λKmax+1(A)\lambda_{K_{0}+1}(A)-\lambda_{K_{\max}+1}(A) to be much larger than λKmax+1(A)λKmax+2(A)\lambda_{K_{\max}+1}(A)-\lambda_{K_{\max}+2}(A). In contrast, when K0=KK_{0}=K, the ratio of these two eigengaps is not expected to be large. In Theorem 1 of Section 2.3, we show that the asymptotic distribution of TT under K0=KK_{0}=K is a function of the type-I Tracy-Widom distribution. Specifically, we first carefully bound the difference between λk(A)\lambda_{k}(A) and λkK(A~GI)\lambda_{k-K}(\tilde{A}_{G_{I}}) for kK+1k\geq K+1, and then demonstrate that the distribution of n2/3(σ2λkK(A~GI)L)n^{2/3}(\sigma_{2}\lambda_{k-K}(\tilde{A}_{G_{I}})-L) converges to the type-I Tracy-Widom distribution, where σ2\sigma_{2} and LL are scaling and location parameters that do not depend on kk.

The construction of TT in (2.4) offers two key advantages. First, both λk(A)λkK(A~GI)\lambda_{k}(A)-\lambda_{k-K}(\tilde{A}_{G_{I}}) and the asymptotic distribution of n2/3(σ2λkK(A~GI)L)n^{2/3}(\sigma_{2}\lambda_{k-K}(\tilde{A}_{G_{I}})-L) involve scaling and location parameters that are very difficult to characterize or estimate. However, these parameters do not depend on kk and can be directly cancelled out in the eigengap-ratio statistic, eliminating the need for their estimation. Second, by using the eigengap ratio rather than individual eigenvalues, we avoid the requirement that λk(A)\lambda_{k}(A) converge to λkK(A~GI)\lambda_{k-K}(\tilde{A}_{G_{I}}) (up to a scaling parameter) faster than n2/3n^{2/3} for kK+1k\geq K+1. This simplifies the theoretical analysis and relaxes the assumptions necessary to establish the asymptotic null distribution.

Remark 1.

Eigengap-ratio statistics have been studied in factor analysis and signal analysis. For example, Onatski (2009) proposed an eigengap-ratio test to detect the number of factors by testing the rank of the factor loading matrix, and Ding & Yang (2022) developed a maximum eigengap-ratio test to examine the number of signals by testing the rank of the signal matrix in a signal-plus-noise model. However, in these models, tthe number of factors and signals are usually assumed to be fixed, and both the loading matrix and signal matrix are often presumed to be dense. In contrast, our test allows for a sparse PP and permits KK to diverge with nn.

Remark 2.

The complexity of finding all nn eigenvalues of AA is roughly O(n3)O(n^{3}), and this can be computationally costly. In contrast, our test statistic TT, for any given KmaxK_{\max}, only requires calculating the first Kmax+2K_{\max}+2 eigenvalues of AA. Hence, the complexity can be reduced to O(n2Kmax)O(n^{2}K_{\max}) or even to O(Kmax)O(K_{\max}) if the network is sparse.

2.3 Asymptotic Null Distribution

The Tracy-Widom law, introduced by Tracy & Widom (1994, 1996) to characterize the distributions of the eigenvalues of Wigner matrices, has been extensively studied in a series of papers (Lee & Yin, 2014; Lee & Schnelli, 2015; Schnelli & Xu, 2022). Specifically, let WW be a symmetric Wigner matrix wtih iid Gaussian entires of mean zero and variance n1n^{-1}. Then, n2/3(λ1(W)2)n^{2/3}(\lambda_{1}(W)-2) converges to the Tracy-Widom distribution. Under the null hypothesis H0H_{0}, we employ the Tracy-Widom law to derive the asymptotic null distribution of TT. In addition, we reparameterize PP as P=ρnPnP=\rho_{n}P_{n}, where ρn0\rho_{n}\rightarrow 0 and PnP_{n} is fixed. This allows the connecting probability matrix to scale with nn, a commonly approach in the literature (Bickel & Chen, 2009; Zhao et al., 2012). We next present some regularity conditions.

  • (C1)

    Assume the connecting probability matrix P=(Pij)n×nP=(P_{ij})_{n\times n} is symmetric with Pij(0,1)P_{ij}\in(0,1) for any 1i,jn1\leq i,j\leq n.

  • (C2)

    Assume n2/3ρnn^{2/3}\rho_{n}\to\infty, and K=O(n1/12ψ)K=O(n^{1/12-\psi}), where ψ\psi is a constant satisfying 0<ψ1/120<\psi\leq 1/12.

  • (C3)

    Assume that λ~K(P)c3n/K\tilde{\lambda}_{K}(P)\geq c^{*}_{3}n/K, where λ~K(P)\tilde{\lambda}_{K}(P) is the KK-th largest eigenvalue of PP in absolute value and c3>0c^{*}_{3}>0 is a constant.

Condition (C1) requires the entries of the edge probability matrix PP to be in (0,1)(0,1), a condition similarly considered in Lei (2016), Wang & Bickel (2017) and Hu et al. (2021). Condition (C2) allows the edge probabilities PijP_{ij}s to decay to zero slower than O(n2/3)O(n^{-2/3}), accommodating sparse networks. This condition is less restrictive than those of Lei (2016) and Hu et al. (2021), which assume dense networks where Pij=O(1)P_{ij}=O(1). However, it is stronger than the conditions in Han et al. (2023) and Jin et al. (2023) to ensure that the distribution of n2/3(σ2λkK(A~GI)L)n^{2/3}(\sigma_{2}\lambda_{k-K}(\tilde{A}_{G_{I}})-L) converges to the type-I Tracy-Widom distribution. It also allows the number of communities KK to diverge at a rate no faster than O(n1/12)O(n^{1/12}), which is less stringent than the fixed KK considered in Han et al. (2023). By setting Kmax=max{K0+4,n2/5}K_{\max}=\max\{K_{0}+4,n^{2/5}\}, it holds that KKmaxK\leq K_{\max}. Condition (C3) imposes a lower bound on λ~K(P)\tilde{\lambda}_{K}(P), so it is bounded away from zero as nn\to\infty. A similar condition was also considered in Theorem 3.1 of Lei & Rinaldo (2015).

Based on the above conditions, we show the asymptotic null distribution of the test statistic TT in the following theorem.

Theorem 1.

Assume that Conditions (C1)-(C3) hold. Under the null hypothesis H0:K=K0H_{0}:K=K_{0}, we have FT(x)FTW(x)F_{T}(x){\sim}F_{T_{W}}(x) for any xx\in\mathbb{R}, as nn\to\infty, where FT(x)F_{T}(x) is the cumulative distribution function for the test statistic TT, and FTW(x)F_{T_{W}}(x) is the cumulative distribution function of TWT_{W}, defined as

TW=λ1(W)λKmaxK0+1(W)λKmaxK0+1(W)λKmaxK0+2(W).T_{W}=\frac{\lambda_{1}(W)-\lambda_{K_{max}-K_{0}+1}(W)}{\lambda_{K_{max}-K_{0}+1}(W)-\lambda_{K_{max}-K_{0}+2}(W)}.

Here, WW is a Wigner matrix with i.i.d. entries of mean zero and variance 1/n1/n, and \sim denotes asymptotic equivalence.

The above theorem indicates that the distribution of the test statistic TT is asymptotically equivalent to that of TWT_{W}. This allows us to employ a function of the Tracy-Widom distributions to test the null hypothesis H0H_{0}. Specifically, for a given nominal level α\alpha, let cα=FTW1(1α)c_{\alpha}=F_{T_{W}}^{-1}(1-\alpha) be the upper α\alpha-quantile of TWT_{W}. We then tabulate the critical value cαc_{\alpha} by approximating FTWF_{T_{W}} via a numerical method, such as Monte Carlo simulation, and reject H0H_{0} if T>cαT>c_{\alpha}.

2.4 Asymptotic Power

Under the alternative hypothesis H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}, the eigenvalue λK0+1(A)\lambda_{K_{0}+1}(A) is expected to be large, and the limiting distribution of TT no longer converges to that of TWT_{W}. Indeed, we demonstrate that TT diverges with nn under H1H_{1}, with strong discriminatory power against the alternative hypothesis.

Theorem 2.

Under Conditions (C1)-(C3) and the alternative hypothesis H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}, it holds that P(T>Cn2/3)1P(T>Cn^{2/3})\to 1 as nn\to\infty for any positive constant CC.

The above theorem implies that for a given nominal level α\alpha, P(T>cα)1P(T>c_{\alpha})\to 1 under the alternative hypothesis H1H_{1}. In particular, Theorem 2 indicates that TT diverges at a rate no slower than n2/3n^{2/3} under the alternative, making our test more powerful than those of Han et al. (2023) and Hu et al. (2021). Specifically, the test statistic in Han et al. (2023) is greater than a constant under the alternative, while the test statistic of Hu et al. (2021) diverges at an order of O(logn)O(\log n) under the alternative. Additionally, the test statistic of Hu et al. (2021) lacks power under the planted partition stochastic block model, that is, a stochastic block model with equal community sizes, equal QllQ_{ll}s, and equal QlkQ_{lk}s for lkl\neq k. We next demonstrate through simulation studies that TT performs well in both size and power.

3 SIMULATION STUDIES

To evaluate the performance of the proposed test in finite samples, we conduct simulation studies based on three models SBM, DCSBM and DCMM, to test the following hypotheses:

H0:K=K0v.s.H1:K0<KKmax.H_{0}:K=K_{0}\quad\text{v.s.}\quad H_{1}:K_{0}<K\leq K_{\max}.

The MM model is not included as it is a special case of DCMM.

To simulate from the DCMM in (1.5) with Pij=ωiωjπiQπjP_{ij}=\omega_{i}\omega_{j}\pi_{i}^{\top}Q\pi_{j}, we follow the approach in Han et al. (2023) to generate πi\pi_{i} and the approach in Zhao et al. (2012) to generate ωi\omega_{i}. Specifically, let PMK={e1,,eK}\text{PM}_{K}=\{e_{1},\cdots,e_{K}\}, where KK is the true number of communities and eke_{k} is an nn-dimensional unit vector with the kkth element equal to 1 and others equal to 0, and MMK={(0.2,0.8,0,,0K2),(0.8,0.2,0,,0K2),(1K,,1KK)}\text{MM}_{K}=\Big{\{}(0.2,0.8,\underbrace{0,\cdots,0}_{K-2}),(0.8,0.2,\underbrace{0,\cdots,0}_{K-2}),(\underbrace{\frac{1}{K},\cdots,\frac{1}{K}}_{K})\Big{\}}. We generate ωi\omega_{i} independently from a distribution with unit expectation, that is,

ωi={ηi,w.p.0.8,9/11,w.p.0.1,13/11,w.p.0.1,\omega_{i}=\left\{\begin{array}[]{lr}\eta_{i},&\text{w.p.}\quad 0.8,\\ 9/11,&\text{w.p.}\quad 0.1,\\ 13/11,&\text{w.p.}\quad 0.1,\end{array}\right.

where ηi\eta_{i} is uniformly distributed on the interval [4/5,6/5][4/5,6/5]. We further set Ω1=In\Omega_{1}=I_{n} and Ω2=diag{ω1,ω2,,ωn}\Omega_{2}=\text{diag}\{\omega_{1},\omega_{2},\cdots,\omega_{n}\} to represent the two types of degree parameters, where InI_{n} is an nn-dimensional identity matrix.

The network is set as n=3,000n=3,000 with KK communities, the size of each community is 3000/K3000/K, and K{3,5,10,15,20}K\in\{3,5,10,15,20\}. For the SBM setting, we set Ω=Ω1\Omega=\Omega_{1} and πiPMK\pi_{i}\in\text{PM}_{K}. For the DCSBM setting, we set Ω=Ω2\Omega=\Omega_{2} and πiPMK\pi_{i}\in\text{PM}_{K}. Under the DCMM setting, we set Ω=Ω2\Omega=\Omega_{2} and πiPMKMMK\pi_{i}\in\text{PM}_{K}\cup\text{MM}_{K}. This model assigns M=n(K10.03)M=n(K^{-1}-0.03) pure nodes to each community and then equally allocates (nMK)/3(n-MK)/3 nodes across the mixed membership with the three probability mass vectors in MMK\text{MM}_{K}.

We consider both dense and sparse networks to assess the performance of the proposed test. We also compare the proposed test TT with five competing tests: two in Lei (2016), one in Han et al. (2023) and two in Hu et al. (2021), denoted as TLeiT_{Lei}, TLei,BootT_{Lei,Boot}, THanT_{Han}, THuT_{Hu}, and THu,BootaugT^{aug}_{Hu,Boot}, respectively. Here, TLei,BootT_{Lei,Boot} and THu,BootaugT^{aug}_{Hu,Boot} are tests that apply a bootstrap correction and an augmented-bootstrap procedure to TLeiT_{Lei} and THuT_{Hu}, respectively. See more details in Lei (2016) and Hu et al. (2021). Since the methods of Lei (2016) and Hu et al. (2021) are not applicable for DCMM, we omit them in simulation studies for DCMM. To obtain the limiting distribution TWT_{W} in our test, we generate 1000 Wigner matrices W=(Wij)n×nW=(W_{ij})\in\mathbb{R}^{n\times n}, where the WijW_{ij}s are independently generated from N(0,1/n)N(0,1/n) for 1ijn1\leq i\leq j\leq n. This allows us to approximate the distribution of TWT_{W} and determine the critical value at a given nominal level.

3.1 Simulations Under Dense Networks

We consider two dense edge probability matrices, modified from Hu et al. (2021) and Han et al. (2023), denoted as Q1Q_{1} and Q2Q_{2}, respectively. In Q1Q_{1}, the edge probability between any two communities uu and vv is set to 0.1(1+4×I(u=v))0.1(1+4\times I(u=v)). In Q2Q_{2}, we define Q2,kl=0.1|kl|Q_{2,kl}=0.1^{|k-l|} if lll\neq l, and Q2,kl=(K+1k)/KQ_{2,kl}=(K+1-k)/K otherwise. The edge probabilities in Q1Q_{1} and Q2Q_{2} do not decrease with nn, and represent dense network scenarios. We implement Q1Q_{1} for SBM and DCSBM, incorporating the methods of Lei (2016), Hu et al. (2021), and Han et al. (2023) for comparison. Additionally, we implement Q2Q_{2} for DCMM, and compare with Han et al. (2023).

Tables 2-3 report the empirical sizes and powers for SBM and DCSBM, respectively, under the edge probability matrix Q1Q_{1}. The simulation results reveal two important findings. First, the empirical sizes of TT, THanT_{Han} and THu,bootaugT^{aug}_{Hu,boot} are close to the 5% nominal level, although THu,bootaugT^{aug}_{Hu,boot} performs slightly worse than TT and THanT_{Han}. Under SBM, when K>5K>5, the sizes of THuT_{Hu}, TLeiT_{Lei} and TLei,bootT_{Lei,boot} deviate notably from the nominal level. Under DCSBM, when K>5K>5, THuT_{Hu} exhibits size distortions, which can be much larger than 5%. Additionally, the empirical sizes of TLeiT_{Lei} and TLei,bootT_{Lei,boot} deviate significantly from the 5% nominal level, except for TLei,bootT_{Lei,boot} under K=K0=3K=K_{0}=3 in Table 3. This is because TLeiT_{Lei} and TLei,bootT_{Lei,boot} are not designed for DCSBM. Second, the empirical powers of TT and THu,bootaugT^{aug}_{Hu,boot} are generally equal to 1, with TT performs more powerfully than THu,bootaugT^{aug}_{Hu,boot}, while THuT_{Hu} is less powerful. Notably, THanT_{Han} shows almost no power. This could be due to the fact that the assumptions in Theorem 3.2 of Han et al. (2023) for ensuring the asymptotic power of THanT_{Han} are not satisfied under the edge probability matrix Q1Q_{1}. In addition, under SBM, the powers. of TLeiT_{Lei}, and TLei,bootT_{Lei,boot} are generally equal to 1. However, under DCSBM, due to the size distortions of TLeiT_{Lei} and TLei,bootT_{Lei,boot}, we do not further analyze their empirical powers.

Table 4 reports the empirical sizes and powers of DCMM under the edge probability matrix Q2Q_{2}. The simulation results show two findings. First, the empirical sizes of TT are close to the nominal level 5%. However, THanT_{Han} exhibits size distortion when K>5K>5. This finding indicates that the test THanT_{Han} may not be suitable for large KK. Second, the empirical powers of TT get larger as KK0K-K_{0} increases. Since the sizes of THanT_{Han} are distorted for K>5K>5 and the powers of THanT_{Han} are not steady for K5K\leq 5, we do not further analyze their empirical powers. In sum, our proposed test generally outperforms other competing methods under SBM, DCSBM and DCMM with dense networks.

Table 2: Dense SBM under Q1Q_{1}. Proportion of rejections at nominal level α=0.05\alpha=0.05 for hypothesis test H0:K=K0H_{0}:K=K_{0} v.s. H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}.
TT THanT_{Han}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.053 1 1 1 1 0.052 0.068 0.048 0.035 0.025
K0=5K_{0}=5 * 0.060 1 1 1 * 0.045 0.030 0.047 0.045
K0=10K_{0}=10 * * 0.042 1 1 * * 0.047 0.030 0.043
K0=15K_{0}=15 * * * 0.058 1 * * * 0.038 0.045
K0=20K_{0}=20 * * * * 0.048 * * * * 0.045
THuT_{Hu} THu,bootaugT^{aug}_{Hu,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.068 0.115 0.828 0.993 1 0.067 0.993 1 1 1
K0=5K_{0}=5 * 0.080 0.580 0.976 1 * 0.042 1 1 1
K0=10K_{0}=10 * * 0.130 0.500 0.985 * * 0.067 1 1
K0=15K_{0}=15 * * * 0.240 0.565 * * * 0.035 0.915
K0=20K_{0}=20 * * * * 0.398 * * * * 0.045
TLeiT_{Lei} TLei,bootT_{Lei,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.065 1 1 1 1 0.052 1 1 1 1
K0=5K_{0}=5 * 0.072 1 1 1 * 0.097 1 1 1
K0=10K_{0}=10 * * 0.182 1 1 * * 0.137 1 1
K0=15K_{0}=15 * * * 0.614 1 * * * 0.731 1
K0=20K_{0}=20 * * * * 0.846 * * * * 0.930
Table 3: Dense DCSBM under Q1Q_{1}. Proportion of rejections at nominal level α=0.05\alpha=0.05 for hypothesis test H0:K=K0H_{0}:K=K_{0} v.s. H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}.
TT THanT_{Han}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.035 1 1 1 1 0.047 0.055 0.044 0.035 0.050
K0=5K_{0}=5 * 0.053 1 1 1 * 0.065 0.050 0.055 0.044
K0=10K_{0}=10 * * 0.048 1 1 * * 0.053 0.050 0.050
K0=15K_{0}=15 * * * 0.033 1 * * * 0.050 0.050
K0=20K_{0}=20 * * * * 0.043 * * * * 0.052
THuT_{Hu} THu,bootaugT^{aug}_{Hu,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.035 0.275 0.840 0.987 1 0.028 0.855 1 1 1
K0=5K_{0}=5 * 0.042 0.465 0.845 1 * 0.053 0.892 1 1
K0=10K_{0}=10 * * 0.183 0.766 1 * * 0.071 0.905 1
K0=15K_{0}=15 * * * 0.302 1 * * * 0.062 0.995
K0=20K_{0}=20 * * * * 0.711 * * * * 0.041
TLeiT_{Lei} TLei,bootT_{Lei,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.110 1 1 0.995 0.990 0.065 1 0.588 0.573 0.645
K0=5K_{0}=5 * 1 0.936 0.978 0.984 * 0.007 0 0.008 0.688
K0=10K_{0}=10 * * 0.267 0.820 0.810 * * 0.084 0.012 0.006
K0=15K_{0}=15 * * * 0.115 0.612 * * * 0.015 0
K0=20K_{0}=20 * * * * 0.162 * * * * 0.026
Table 4: Dense DCMM under Q2Q_{2}. Proportion of rejections at nominal level α=0.05\alpha=0.05 for hypothesis test H0:K=K0H_{0}:K=K_{0} v.s. H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}.
TT THanT_{Han}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.057 0.345 0.610 0.865 0.965 0.033 1 1 0.405 0.785
K0=5K_{0}=5 * 0.057 0.330 0.440 0.645 * 0.040 1 0.405 0.785
K0=10K_{0}=10 * * 0.037 0.185 0.395 * * 0.113 0.415 0.785
K0=15K_{0}=15 * * * 0.040 0.185 * * * 0.343 0.780
K0=20K_{0}=20 * * * * 0.036 * * * * 0.680

3.2 Simulations Under Sparse Networks

In this section, we consider two types of sparse edge probability matrix Q~1\tilde{Q}_{1} and Q~2\tilde{Q}_{2}. In Q~1\tilde{Q}_{1}, the edge probability between any two communities uu and vv is n5/9(1+4×I(u=v))n^{-5/9}(1+4\times I(u=v)). In Q~2\tilde{Q}_{2}, we set Q~2,kl=5n5/90.1|kl|\tilde{Q}_{2,kl}=5n^{-5/9}0.1^{|k-l|} if klk\not=l, and Q~2,kl=5n5/9(K+1k)/K\tilde{Q}_{2,kl}=5n^{-5/9}(K+1-k)/K otherwise. Analogous to Section 3.1, we implement Q~1\tilde{Q}_{1} for SBM and DCSBM and Q~2\tilde{Q}_{2} for DCMM.

Tables 5-6 report the empirical sizes and powers for SBM and DCSBM, respectively, under the edge probability matrix Q~1\tilde{Q}_{1}. The results reveal the following two important findings. First, both TT and THanT_{Han} control sizes well. In addition, the sizes of THuT_{Hu} and TLeiT_{Lei} deviate notably from the nominal level. This is likely due to the fact that the tests in Lei (2016) and Hu et al. (2021) are not designed for sparse networks. After applying a bootstrap correction and an augmented-bootstrap procedure, although TLei,bootT_{Lei,boot} and THu,bootaugT^{aug}_{Hu,boot} give several reasonable empirical sizes, though both tests lack theoretical justification in sparse networks. Second, our proposed test TT is powerful against alternatives, and its empirical power steadily increases as KK0K-K_{0} gets large. This is consistent with our theoretical findings. In contrast, the empirical powers of THanT_{Han} are low since the two assumptions in Theorem 3.2 of Han et al. (2023) for ensuring the asymptotic power of THanT_{Han} are not satisfied under Q~1\tilde{Q}_{1}. Since most of the empirical sizes of the tests proposed by Lei (2016) and Hu et al. (2021) are distorted, we do not further analyze their empirical powers.

Under the sparse DCMM, the results in Table 7 show similar qualitative findings as in Table 4. For example, the empirical sizes of TT are close to the nominal level 5% and the empirical sizes of THanT_{Han} are distorted when K>5K>5. In addition, the empirical powers of TT get larger as KK0K-K_{0} increases. Since the sizes of THanT_{Han} are distorted for K>5K>5 and the powers of THanT_{Han} are not steady for K5K\leq 5, we do not further analyze their empirical powers.

To make a more comprehensive comparison, we also carry out simulation studies with different network settings in the supplementary material. Specifically, we consider equal-sized communities with nk=300n_{k}=300 and the total number of nodes in the network is n=Knkn=Kn_{k}, with K{2,4,6,8,10}K\in\{2,4,6,8,10\}. The simulation results of SBM and DCSBM under this network setting and the simulation results of DCMM under edge probabilities Q1Q_{1} and Q~1\tilde{Q}_{1} are given in Tables S.1-S.6 of the supplementary material, which show similar findings as in Tables 2-7. In summary, simulation studies demonstrate that our proposed test performs well across three block models under both dense and sparse networks. These findings are consistent with our theoretical results.

Table 5: Sparse SBM under Q~1\tilde{Q}_{1}. Proportion of rejections at nominal level α=0.05\alpha=0.05 for hypothesis test H0:K=K0H_{0}:K=K_{0} v.s. H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}.
TT THanT_{Han}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.057 0.675 0.802 0.905 0.971 0.052 0.058 0.034 0.040 0.050
K0=5K_{0}=5 * 0.047 0.772 0.820 0.956 * 0.057 0.060 0.058 0.043
K0=10K_{0}=10 * * 0.055 0.540 0.832 * * 0.038 0.035 0.038
K0=15K_{0}=15 * * * 0.053 0.280 * * * 0.047 0.056
K0=20K_{0}=20 * * * * 0.033 * * * * 0.040
THuT_{Hu} THu,bootaugT^{aug}_{Hu,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.247 0.858 0.992 0.990 0.995 0.062 0.697 0.914 1 1
K0=5K_{0}=5 * 0.393 1 1 1 * 0.042 0.832 0.986 1
K0=10K_{0}=10 * * 0.990 1 1 * * 0.018 0.878 1
K0=15K_{0}=15 * * * 1 1 * * * 0.025 0.996
K0=20K_{0}=20 * * * * 1 * * * * 0.085
TLeiT_{Lei} TLei,bootT_{Lei,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.993 1 1 1 1 0.072 0.980 0.700 0.978 1
K0=5K_{0}=5 * 1 1 1 1 * 0.155 0.597 0.830 1
K0=10K_{0}=10 * * 1 1 1 * * 0.010 0.924 1
K0=15K_{0}=15 * * * 1 1 * * * 0.072 1
K0=20K_{0}=20 * * * * 1 * * * * 0
Table 6: Sparse DCSBM under Q~1\tilde{Q}_{1}. Proportion of rejections at nominal level α=0.05\alpha=0.05 for hypothesis test H0:K=K0H_{0}:K=K_{0} v.s. H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}.
TT THanT_{Han}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.038 0.398 0.535 0.888 1 0.042 0.050 0.048 0.033 0.043
K0=5K_{0}=5 * 0.057 0.243 0.492 0.898 * 0.067 0.056 0.046 0.038
K0=10K_{0}=10 * * 0.046 0.482 0.610 * * 0.047 0.036 0.040
K0=15K_{0}=15 * * * 0.040 0.530 * * * 0.062 0.053
K0=20K_{0}=20 * * * * 0.032 * * * * 0.043
THuT_{Hu} THu,bootaugT^{aug}_{Hu,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.018 0.843 0.940 0.673 0.568 0.060 0.475 0.786 0.873 1
K0=5K_{0}=5 * 0.138 1 0.932 1 * 0.082 0.708 0.612 0.965
K0=10K_{0}=10 * * 0.850 1 1 * * 0.053 0.536 1
K0=15K_{0}=15 * * * 1 1 * * * 0.151 0.910
K0=20K_{0}=20 * * * * 0.849 * * * * 0.016
TLeiT_{Lei} TLei,bootT_{Lei,boot}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.173 1 1 1 1 0.040 0.963 0.560 0.745 0.350
K0=5K_{0}=5 * 0.142 1 1 1 * 0.157 0.491 0.161 0.975
K0=10K_{0}=10 * * 0.826 1 1 * * 0.077 0.242 0.117
K0=15K_{0}=15 * * * 0.036 1 * * * 0.036 1
K0=20K_{0}=20 * * * * 0 * * * * 0
Table 7: Sparse DCMM under Q~2\tilde{Q}_{2}. Proportion of rejections at nominal level α=0.05\alpha=0.05 for hypothesis test H0:K=K0H_{0}:K=K_{0} v.s. H1:K0<KKmaxH_{1}:K_{0}<K\leq K_{\max}.
TT THanT_{Han}
K 3 5 10 15 20 3 5 10 15 20
K0=3K_{0}=3 0.043 0.280 0.465 0.655 0.895 0.047 1 1 0.900 0.960
K0=5K_{0}=5 * 0.037 0.250 0.455 0.595 * 0.040 1 0.900 0.960
K0=10K_{0}=10 * * 0.050 0.190 0.395 * * 0.420 0.905 0.960
K0=15K_{0}=15 * * * 0.037 0.130 * * * 0.793 0.955
K0=20K_{0}=20 * * * * 0.041 * * * * 0.933

4 REAL DATA ANALYSES

To illustrate the utility of our proposed method, we analyze three real-world networks. The first dataset is the facebook network in Simmons College that was first studied in Traud et al. (2012). The Simmons College network contains 1,137 nodes, and its edge density is approximately 3.76%. The second dataset is a political blog network that was first studied by Adamic & Glance (2005) and later analyzed by Lei (2016), Han et al. (2023) and Hu et al. (2021). This network contains 1,222 nodes, and its edge density is approximately 2.24%. The third dataset is a Sina Weibo user network that was analyzed by Wu et al. (2022). This network has 2,580 nodes, and the edge density of this network is approximately 0.41%.

4.1 Simmons College network

The Simmon college network contains all the friendship links between Facebook users within the Simmons College, recorded on a specific day in September 2005 (Traud et al., 2012). Traud et al. (2012) observed that the community structure of this network is strongly correlated with graduation year, meaning that students in the same year are more likely to be friends. Hence, the true number of clusters in this network can be considered as K=4K=4. It was pointed out in Jin et al. (2021) that the community structure in the Simmons College network is very weak and community detection algorithms usally yield high misclassification errors.

We test K0=1,2,3,4,5,6K_{0}=1,2,3,4,5,6 using our proposed method. The resulting test statistics are T=105.85,58.49,29.90,24.19,17.64T=105.85,58.49,29.90,24.19,17.64 and 14.25, with corresponding 5% critical values c0.05=74.20,82.99,96.65,87.62,75.30c_{0.05}=74.20,82.99,96.65,87.62,75.30 and 63.99. Accordingly, TT rejects H0:K=1H_{0}:K=1 and does not reject H0:K=2H_{0}:K=2, H0:K=3H_{0}:K=3, H0:K=4H_{0}:K=4, H0:K=5H_{0}:K=5, and H0:K=6H_{0}:K=6. Hence, TT identifies at least two communities. We also employ TLei,bootT_{Lei,boot}, THu,bootaugT_{Hu,boot}^{aug} and THanT_{Han} to conduct the same hypothesis testing. The test results are THan=7.80,1.62,0.17,1.28,0.79T_{Han}=7.80,1.62,-0.17,1.28,0.79 and 0.02-0.02, with the critical value c0.05,Han=1.65c_{0.05,Han}=1.65. Therefore, THanT_{Han} yields the same results as TT, rejecting H0:K=1H_{0}:K=1 but not H0:K=2H_{0}:K=2, H0:K=3H_{0}:K=3, H0:K=4H_{0}:K=4, H0:K=5H_{0}:K=5, and H0:K=6H_{0}:K=6. Additionally, we find TLei,boot=36.28,23.81,20.74,6.56,3.74,2.68T_{Lei,boot}=36.28,23.81,20.74,6.56,3.74,2.68, and THu,bootaug>200T_{Hu,boot}^{aug}>200 for all K0=1,2,3,4,5,6K_{0}=1,2,3,4,5,6, with correspondingly critical values c0.05,Lei=1.45c_{0.05,Lei}=1.45 and c0.05,Hu=3.41c_{0.05,Hu}=3.41, respectively. Therefore, TLei,bootT_{Lei,boot} and THu,bootaugT_{Hu,boot}^{aug} reject all six null hypotheses.

4.2 Political Blog network

This dataset records links between internet blogs over the period of two months before the 2004 U.S. presidential election. The nodes are blogs, and the edges represent web links between the blogs. We consider the largest connected component of this network, consisting of 1,222 nodes, as commonly used in existing studies (Lei, 2016; Hu et al., 2021; Han et al., 2023). Note that all the blogs fall into two communities based on political stance, namely “conservative” and “liberal”.

Using this network dataset, Lei (2016) tested for H0:K=2H_{0}:K=2 under the SBM. In particular, he obtained TLei=1172.30T_{Lei}=1172.30 and TLei,Boot=491.50T_{Lei,Boot}=491.50. Given the 5% nominal level, both are much larger than the critical value c0.05,Lei=1.45c_{0.05,Lei}=1.45, and thus strongly reject the null hypothesis of H0:K=2H_{0}:K=2 under the SBM. In Hu et al. (2021), the test of H0:K=2H_{0}:K=2 was not rejected under the DCSBM. In addition, Han et al. (2023) tested H0:K=1H_{0}:K=1 and H0:K=2H_{0}:K=2 separately. They found that THan=2.71T_{Han}=2.71 under H0:K=1H_{0}:K=1 and THan=0.89T_{Han}=-0.89 under H0:K=2H_{0}:K=2, where the critical value is c0.05,Han=1.65c_{0.05,Han}=1.65. Accordingly, THanT_{Han} rejected H0:K=1H_{0}:K=1 and did not reject H0:K=2H_{0}:K=2. Their result supports the finding in Hu et al. (2021).

We also test H0:K=1H_{0}:K=1 and H0:K=2H_{0}:K=2 using our proposed test. Under H0:K=1H_{0}:K=1, we have T=34.25T=34.25, which is larger than the critical value c0.05=30.16c_{0.05}=30.16. Under H0:K=2H_{0}:K=2, we obtain that T=6.84T=6.84, which is smaller than the critical value c0.05=22.18c_{0.05}=22.18. Consequently, we reject H0:K=1H_{0}:K=1 and do not reject H0:K=2H_{0}:K=2. This finding agrees with that of Han et al. (2023), which corroborates the prior knowledge that all blogs can be divided into politically “conservative” and “liberal” communities.

4.3 Sina Weibo Data

This dataset includes friendship links between 2,580 Sina Weibo users, and Aij=1A_{ij}=1 if user ii and user jj follow each other (Wu et al., 2022). According to the analysis of Wu et al. (2022), users can be partitioned into four quadrants based on two nodal influence indices. For each node ii, the two indices are ’inward influence,’ which measures node ii’s receptivity to being influenced by others, and ’outward influence,’ which gauges the influence node ii exerts on others. We report the results of testing K0=1,2,3,4,5K_{0}=1,2,3,4,5, as all tests with K0>5K_{0}>5 yield the same conclusion. The resulting test statistics are T=105.43,42.83,28.12,24.25T=105.43,42.83,28.12,24.25 and 22.67, with corresponding 5% critical values c0.05=76.22,53.86,50.11,32.78c_{0.05}=76.22,53.86,50.11,32.78 and 32.04. Thus, TT rejects H0:K=1H_{0}:K=1 and does not reject H0:K=2H_{0}:K=2, H0:K=3H_{0}:K=3, H0:K=4H_{0}:K=4 and H0:K=5H_{0}:K=5. Hence, TT identifies at least two communities. We also employ TLei,bootT_{Lei,boot}, THan,bootaugT_{Han,boot}^{aug} and THanT_{Han} to conduct hypothesis testing. The test results show TLei,boot>100T_{Lei,boot}>100 and THan,bootaug>200T_{Han,boot}^{aug}>200 for all K0=1,2,3,4,5K_{0}=1,2,3,4,5 with their correspondingly critical values c0.05,Lei=1.45c_{0.05,Lei}=1.45 and c0.05,Hu=3.41c_{0.05,Hu}=3.41, respectively. For THanT_{Han}, the test statistics are 19.26, 7.85, 8.66, 8.92 and 3.65, with critical values c0.05,Han=1.65c_{0.05,Han}=1.65. Therefore, TLei,bootT_{Lei,boot}, THan,bootaugT_{Han,boot}^{aug} and THanT_{Han} reject all five null hypotheses.

5 CONCLUDING REMARKS

Two possible avenues for future research are identified. First, generalize the proposed method to accommodate the nonparametric graphical models (Wolfe & Olhede, 2013). Second, study the validity of the eigengap-ratio test for correlated binary data such as the network generated by a notable latent space model (Hoff et al., 2002). We believe these efforts will further increase the applicability of the proposed method.

SUPPLEMENTARY MATERIAL

The supplementary material includes five sections. Section S.1 provides technical notation. Section S.2 introduces a lemma used to prove Theorem 1. Sections S.3–S.4 present the proofs of Theorem 1 and Theorem 2, respectively. Section S.5 reports the additional simulation results.

References

  • Adamic & Glance (2005) Adamic, L. A. & Glance, N. (2005). The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery. ACM, New York.
  • Airoldi et al. (2008) Airoldi, E. M., Blei, D., Fienberg, S. & Xing, E. (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9.
  • Amini et al. (2013) Amini, A. A., Chen, A., Bickel, P. J. & Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics 41.
  • Bickel & Chen (2009) Bickel, P. J. & Chen, A. (2009). A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences 106, 21068–21073.
  • Chen & Lei (2018) Chen, K. & Lei, J. (2018). Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association 113, 241–251.
  • Choi et al. (2012) Choi, D. S., Wolfe, P. J. & Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika 99, 273–284.
  • Daudin et al. (2008) Daudin, J.-J., Picard, F. & Robin, S. (2008). A mixture model for random graphs. Statistics and Computing 18, 173–183.
  • Ding & Yang (2022) Ding, X. & Yang, F. (2022). Tracy-widom distribution for heterogeneous gram matrices with applications in signal detection. IEEE Transactions on Information Theory 68, 6682–6715.
  • Goldenberg et al. (2010) Goldenberg, A., Zheng, A. X., Fienberg, S. E., Airoldi, E. M. et al. (2010). A survey of statistical network models. Foundations and Trends® in Machine Learning 2, 129–233.
  • Granovetter (1973) Granovetter, M. S. (1973). The strength of weak ties. American Journal of Sociology 78, 1360–1380.
  • Han et al. (2023) Han, X., Yang, Q. & Fan, Y. (2023). Universal rank inference via residual subsampling with application to large networks. The Annals of Statistics 51, 1109–1133.
  • Hoff et al. (2002) Hoff, P. D., Raftery, A. E. & Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association 97, 1090–1098.
  • Holland et al. (1983) Holland, P. W., Laskey, K. B. & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5, 109–137.
  • Hu et al. (2020) Hu, J., Qin, H., Yan, T. & Zhao, Y. (2020). Corrected bayesian information criterion for stochastic block models. Journal of the American Statistical Association 115, 1771–1783.
  • Hu et al. (2021) Hu, J., Zhang, J., Qin, H., Yan, T. & Zhu, J. (2021). Using maximum entry-wise deviation to test the goodness of fit for stochastic block models. Journal of the American Statistical Association 116, 1373–1382.
  • Jin (2015) Jin, J. (2015). Fast community detection by score. The Annals of Statistics 43, 57–89.
  • Jin et al. (2017) Jin, J., Ke, Z. T. & Luo, S. (2017). Estimating network memberships by simplex vertex hunting. arXiv preprint arXiv:1708.07852 12.
  • Jin et al. (2021) Jin, J., Ke, Z. T. & Luo, S. (2021). Improvements on score, especially for weak signals. Sankhya A , 1–36.
  • Jin et al. (2024) Jin, J., Ke, Z. T. & Luo, S. (2024). Mixed membership estimation for social networks. Journal of Econometrics 239, 105369.
  • Jin et al. (2023) Jin, J., Ke, Z. T., Luo, S. & Wang, M. (2023). Optimal estimation of the number of network communities. Journal of the American Statistical Association 118, 2101–2116.
  • Joseph & Yu (2016) Joseph, A. & Yu, B. (2016). Impact of regularization on spectral clustering. The Annals of Statistics 44, 1765–1791.
  • Karrer & Newman (2011) Karrer, B. & Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical Review E-Statistical, Nonlinear, and Soft Matter Physics 83, 016107.
  • Katona et al. (2011) Katona, Z., Zubcsek, P. P. & Sarvary, M. (2011). Network effects and personal influences: The diffusion of an online social network. Journal of Marketing Research 48, 425–443.
  • Ke & Wang (2022) Ke, Z. T. & Wang, J. (2022). Optimal network membership estimation under severe degree heterogeneity. arXiv preprint arXiv:2204.12087 .
  • Kolaczyk (2009) Kolaczyk, E. D. . (2009). Statistical analysis of network data: Methods and models. New York: Springer.
  • Le & Levina (2015) Le, C. M. & Levina, E. (2015). Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827 .
  • Lee & Schnelli (2015) Lee, J. O. & Schnelli, K. (2015). Edge universality for deformed wigner matrices. Reviews in Mathematical Physics 27, 1550018.
  • Lee & Yin (2014) Lee, J. O. & Yin, J. (2014). A necessary and sufficient condition for edge universality of wigner matrices. Duke Mathematical Journal 163, 117–173.
  • Lei (2016) Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics 44, 401–424.
  • Lei & Rinaldo (2015) Lei, J. & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics , 215–237.
  • Lorrain & White (1971) Lorrain, F. & White, H. C. (1971). Structural equivalence of individuals in social networks. The Journal of Mathematical Sociology 1, 49–80.
  • Ma et al. (2021) Ma, S., Su, L. & Zhang, Y. (2021). Determining the number of communities in degree-corrected stochastic block models. Journal of Machine Learning Research 22, 3217–3279.
  • Newman (2006) Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 8577–8582.
  • Newman & Girvan (2004) Newman, M. E. & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E 69, 026113.
  • Noroozi et al. (2019) Noroozi, M., Rimal, R. & Pensky, M. (2019). Estimation and clustering in popularity adjusted stochastic block model. arXiv preprint arXiv:1902.00431 .
  • Nowicki & Snijders (2001) Nowicki, K. & Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association 96, 1077–1087.
  • Onatski (2009) Onatski, A. (2009). A formal statistical test for the number of factors in the approximate factor models. Econometrica 77, 1447–1480.
  • Qin & Rohe (2013) Qin, T. & Rohe, K. (2013). Regularized spectral clustering under the degree-corrected stochastic blockmodel. Advances in Neural Information Processing Systems 26.
  • Rohe et al. (2011) Rohe, K., Chatterjee, S. & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39, 1878–1915.
  • Saldana et al. (2017) Saldana, D. F., Yu, Y. & Feng, Y. (2017). How many communities are there? Journal of Computational and Graphical Statistics 26, 171–181.
  • Sarkar & Bickel (2015) Sarkar, P. & Bickel, P. J. (2015). Role of normalization in spectral clustering for stochastic blockmodels. The Annals of Statistics 43, 962–990.
  • Schnelli & Xu (2022) Schnelli, K. & Xu, Y. (2022). Convergence rate to the tracy–widom laws for the largest eigenvalue of wigner matrices. Communications in Mathematical Physics 393, 839–907.
  • Sengupta & Chen (2015) Sengupta, S. & Chen, Y. (2015). Spectral clustering in heterogeneous networks. Statistica Sinica , 1081–1106.
  • Sengupta & Chen (2018) Sengupta, S. & Chen, Y. (2018). A block model for node popularity in networks with community structure. Journal of the Royal Statistical Society Series B: Statistical Methodology 80, 365–386.
  • Snijders & Nowicki (1997) Snijders, T. A. & Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification 14, 75–100.
  • Tracy & Widom (1994) Tracy, C. A. & Widom, H. (1994). Level-spacing distributions and the airy kernel. Communications in Mathematical Physics 159, 151–174.
  • Tracy & Widom (1996) Tracy, C. A. & Widom, H. (1996). On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics 177, 727–754.
  • Traud et al. (2012) Traud, A. L., Mucha, P. J. & Porter, M. A. (2012). Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 4165–4180.
  • Valente (2010) Valente, T. W. (2010). Social Networks and Health: Models, Methods, and Applications 1st Edition. Oxford University Press.
  • Wang et al. (2023) Wang, J., Zhang, J., Liu, B., Zhu, J. & Guo, J. (2023). Fast network community detection with profile-pseudo likelihood methods. Journal of the American Statistical Association 118, 1359–1372.
  • Wang & Bickel (2017) Wang, Y. R. & Bickel, P. J. (2017). Likelihood-based model selection for stochastic block models. The Annals of Statistics 45, 500–528.
  • Wasserman & Faust (1994) Wasserman, S. & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge University Press.
  • White & Murphy (2016) White, A. & Murphy, T. B. (2016). Mixed-membership of experts stochastic blockmodel. Network Science 4, 48–80.
  • Wolfe & Olhede (2013) Wolfe, P. J. & Olhede, S. C. (2013). Nonparametric graphon estimation. arXiv preprint arXiv:1309.5936 .
  • Wu et al. (2022) Wu, Y., Lan, W., Zou, T. & Tsai, C.-L. (2022). Inward and outward network influence analysis. Journal of Business & Economic Statistics 40, 1617–1628.
  • Zhang & Zhou (2016) Zhang, A. Y. & Zhou, H. H. (2016). Minimax rates of community detection in stochastic block models. The Annals of Statistics 44, 2252–2280.
  • Zhang et al. (2020a) Zhang, J., Sun, W. W. & Li, L. (2020a). Mixed-effect time-varying network model and application in brain connectivity analysis. Journal of the American Statistical Association 115, 2022–2036.
  • Zhang & Amini (2023) Zhang, L. & Amini, A. (2023). Adjusted chi-square test for degree-corrected block models: Experiments in r. The Annals of Statistics 51, 2366–2385.
  • Zhang et al. (2020b) Zhang, Y., Levina, E. & Zhu, J. (2020b). Detecting overlapping communities in networks using spectral methods. SIAM Journal on Mathematics of Data Science 2, 265–283.
  • Zhao et al. (2012) Zhao, Y., Levina, E. & Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics 40, 2266–2292.