An Eigengap Ratio Test for Determining the Number of Communities in Network Data
Abstract
To characterize the community structure in network data, researchers have introduced various block-type models, including the stochastic block model, degree-corrected stochastic block model, mixed membership block model, degree-corrected mixed membership block model, and others.
A critical step in applying these models effectively is determining the number of communities in the network. However, to our knowledge, existing methods for estimating the number of network communities often require model estimations or are unable to simultaneously account for network sparsity and a divergent number of communities. In this paper, we propose an eigengap-ratio based test that address these challenges. The test is straightforward to compute, requires no parameter tuning, and can be applied to a wide range of block models without the need to estimate network distribution parameters. Furthermore, it is effective for both dense and sparse networks with a divergent number of communities. We show that the proposed test statistic converges to a function of the type-I Tracy-Widom distributions under the null hypothesis, and that the test is asymptotically powerful under alternatives. Simulation studies on both dense and sparse networks demonstrate the efficacy of the proposed method. Three real-world examples are presented to illustrate the usefulness of the proposed test.
KEY WORDS: Block Models; Community Detection; Consistency; Eigengap-ratio Test; Type-I Tracy-Widom Distribution
1 INTRODUCTION
Network data analysis has been extensively explored across various disciplines; see for example, Kolaczyk (2009), Goldenberg et al. (2010),Valente (2010) and Katona et al. (2011). Notably, Lorrain & White (1971) and Granovetter (1973) identified that the generation of network links often follows a block (or community) structure. Since then, numerous researchers have focused on block models to characterize the block structure of network generation mechanisms and enhance the utilization of network data. The widely studied block models include the stochastic block model (Holland et al., 1983; Wasserman & Faust, 1994; Snijders & Nowicki, 1997; Nowicki & Snijders, 2001; Bickel & Chen, 2009; Rohe et al., 2011; Choi et al., 2012; Jin, 2015; Zhang & Zhou, 2016; Zhang et al., 2020a), the degree-corrected stochastic block model (Karrer & Newman, 2011; Zhao et al., 2012; Qin & Rohe, 2013; Zhang & Amini, 2023; Hu et al., 2021; Jin et al., 2023), the mixed membership model (Airoldi et al., 2008; White & Murphy, 2016; Ma et al., 2021) and the degree-corrected mixed membership model (Jin et al., 2017; Zhang et al., 2020b; Ke & Wang, 2022; Han et al., 2023; Jin et al., 2024), among others.
In this paper, we consider an undirected network with no self-loops, and assume that its associated adjacency matrix is symmetric with diagonal elements all equal to zero. In addition, we assume that the network is generated from a Bernoulli-distributed random graph model. That is,
(1.1) |
where is the symmetric edge probability matrix. That is, is the probability of connection between nodes and , . In existing block models, the probability matrix typically has a rank of , where is the number of communities. We next briefly describe the four common block models mentioned above.
(a) The stochastic block model (SBM). The SBM (Holland et al., 1983) assumes that the network has communities and that each node belongs to one specific community. Let denote the membership vector, where represents the community to which node belongs, for . The SBM assumes that, for every ,
(1.2) |
where is the symmetric community probability matrix. The diagonal elements of , , is set to be for any . Under (1.2), one can show that matrix is of rank .
(b) The degree-corrected stochastic block model (DCSBM). The SBM assumes that the edge connecting any two nodes depends only on the communities in which they are located, and nodes within the same community have equal expected degrees. To allow for greater node degree heterogeneity, Karrer & Newman (2011) proposed the degree-corrected stochastic block model (DCSBM) by introducing a vector of nodal degree parameters that measures the degree variation of nodes. Accordingly, the DCSBM assumes
(1.3) |
for every , where and are defined in (1.2). The diagonal elements of , , is set to be for any . Identifying DCSBM requires constraints such as for each community , where is the number of nodes in community . Under (1.3), one can show that matrix is of rank (Jin et al., 2023).
(c) The mixed membership model (MM). Both SBM and DCSBM require that each node belongs to a single community and the communities are nonoverlapping. To address this limitation, Airoldi et al. (2008) developed the mixed membership stochastic block model (MM) for handling the case of overlapping communities. For each node , the MM introduces a -dimensional probability mass vector, , where represents the probability of node belonging to community for and . Then, the MM assumes that
(1.4) |
where is defined in (1.2). Under (1.4), one can show that matrix is of rank .
(d) The degree-corrected mixed membership model (DCMM). Jin et al. (2017) introduced the degree-corrected mixed membership model (DCMM), which is an extension of DCSBM that accommodates overlapping communities. Specifically, DCMM assumes that
(1.5) |
where s, s and are defined in (1.2)-(1.4). The DCMM is the most general class of models, and it includes the SBM, DCSBM and MM as special cases. One can show that the edge probability matrix under DCMM is also of rank (Jin et al., 2023).
Given the number of communities , many algorithmic solutions have been proposed to estimate the community label or in the above block models. They include the modularity-based methods (Newman & Girvan, 2004; Newman, 2006; Sengupta & Chen, 2018), likelihood-based methods (Bickel & Chen, 2009; Amini et al., 2013; Choi et al., 2012; Zhao et al., 2012; Wang et al., 2023), variational methods (Daudin et al., 2008; Airoldi et al., 2008), spectral clustering methods (Rohe et al., 2011; Jin, 2015; Lei & Rinaldo, 2015; Joseph & Yu, 2016; Qin & Rohe, 2013; Sarkar & Bickel, 2015; Sengupta & Chen, 2015), among others.
In practice, however, the number of communities is often unknown a priori. In recent years, many methods have been proposed for estimating . These methods include the spectral methods (Le & Levina, 2015), stepwise selection methods (Zhang & Amini, 2023; Jin et al., 2023), sequential testing methods (Lei, 2016; Hu et al., 2021), network cross-validation methods (Chen & Lei, 2018) and likelihood-based methods (Daudin et al., 2008; Wang & Bickel, 2017; Saldana et al., 2017; Noroozi et al., 2019; Hu et al., 2020; Ma et al., 2021). Han et al. (2023) proposed a sequential test that can estimate matrix ranks and it can be applied to infer in block models. To our knowledge, these methods encounter at least one of the following challenges or restrictions.
First, the networks need to be assumed dense. For example, Lei (2016) and Hu et al. (2021) assumed that the edge probabilities s have a constant order of for every . Second, the number of network communities needs to be finite as tends to infinity; see, for example, Zhang & Amini (2023) and Han et al. (2023). Third, the unknown network distribution parameters need to be estimated. For example, Lei (2016) and Hu et al. (2021) estimated , and before applying their methods to estimate . Similar approaches can be found in the literature such as Noroozi et al. (2019); Zhang & Amini (2023); Hu et al. (2021); Ma et al. (2021); Jin et al. (2023). Fourth, tuning parameters are needed; see, for example, Wang & Bickel (2017), Hu et al. (2020), Ma et al. (2021) and Han et al. (2023).
Feature | Proposed method | Lei (2016) | Hu et al. (2021) | Han et al. (2023) | Jin et al. (2023) |
---|---|---|---|---|---|
Dense networks with | |||||
Sparse networks with | |||||
Diverging number of communities | |||||
Avoiding parameter estimations | |||||
No tuning parameters |
In this paper, we propose an eigengap-ratio based test for estimating the rank of , which can in turn be used to determine the number of communities in block models such as SBM, DCSBM, MM and DCMM. The proposed test is straightforward to implement and requires only the calculations of the eigenvalues of . It accommodates sparse networks, and allows for a diverging without the need for parameter tuning; see Table 1 for a comparison of our proposed method with some recent methods. The main idea behind the proposed test statistic is as follows. Let and denote the true and hypothetical ranks of , respectively, where is unknown. To test whether , we construct the test statistic as a ratio of two eigengaps, that is, the difference between pairs of eigenvalues derived from the adjacency matrix . A key advantage of this approach is that it only requires a partial eigen-decomposition of the adjacency matrix . Despite the simple form of the test statistic, deriving its null distribution is challenging, as s are Bernoulli random variables with distinct probability parameters s, which may tend to zero as increases. We show that the asymptotic null distribution of the test statistic follows a function of the type-I Tracy-Widom distribution constructed from the Wigner matrix. Additionally, we demonstrate that the proposed test is asymptotically powerful, with the test statistic diverging at a rate no slower than under the alternative hypothesis.
It is important to emphasize that our proposed test is designed to estimate the rank of the probability matrix and does not assume a specific block model or require the estimation of model parameters. Consequently, our test statistic does not provide a goodness-of-fit assessment for any specific block model. In contrast, block model-specific test methods such as those studied in Lei (2016), Hu et al. (2021), Ma et al. (2021) and Jin et al. (2023), use test statistics to assess the goodness-of-fit of a presumed block model given a hypothetical number of communities. Due to the nature of their design, these tests typically rely on consistent estimators of the block model parameters, such as group membership or mixing proportions s.
2 METHOD AND THEORETICAL PROPERTIES
2.1 Hypothesis Testing
Consider a network of nodes with the symmetric adjacency matrix generated from (1.1). The aim of this paper is to test the following hypothesis:
(2.1) |
where and denote the true and the hypothetical ranks of , respectively, and is a pre-specified maximum value of . In this paper, we set ; see discussions after Condition (C2). We consider the one-sided alternative, , since a network with communities can also be represented as a network with communities by splitting one or more true communities. The one-sided alternative was also considered in Lei (2016), Chen & Lei (2018), Wang & Bickel (2017) and Hu et al. (2021).
2.2 An Eigengap-Ratio Based Test
Recall is the adjacency matrix with no self-loops. We define
(2.2) |
where is the expected value of , and is the noise matrix with zero-mean entries and zero diagonals elements. It is straightforward to see that entries of have finite variances. Given , (2.2) can then be rewritten as
(2.3) |
where represents the diagonal matrix constructed from .
Under the null hypothesis of : , the rank of is , which implies that has nonzero eigenvalues. Let denote the -th largest eigenvalue of matrix . We can show that the eigenvalues for are determined by an principal submatrix of , denoted by , whose eigenvalues follow the type-I Tracy-Widom distribution. This result motivates the proposal of an eigengap-ratio test statistic, defined as:
(2.4) |
Given that the true rank is , if , we expect to be much larger than . In contrast, when , the ratio of these two eigengaps is not expected to be large. In Theorem 1 of Section 2.3, we show that the asymptotic distribution of under is a function of the type-I Tracy-Widom distribution. Specifically, we first carefully bound the difference between and for , and then demonstrate that the distribution of converges to the type-I Tracy-Widom distribution, where and are scaling and location parameters that do not depend on .
The construction of in (2.4) offers two key advantages. First, both and the asymptotic distribution of involve scaling and location parameters that are very difficult to characterize or estimate. However, these parameters do not depend on and can be directly cancelled out in the eigengap-ratio statistic, eliminating the need for their estimation. Second, by using the eigengap ratio rather than individual eigenvalues, we avoid the requirement that converge to (up to a scaling parameter) faster than for . This simplifies the theoretical analysis and relaxes the assumptions necessary to establish the asymptotic null distribution.
Remark 1.
Eigengap-ratio statistics have been studied in factor analysis and signal analysis. For example, Onatski (2009) proposed an eigengap-ratio test to detect the number of factors by testing the rank of the factor loading matrix, and Ding & Yang (2022) developed a maximum eigengap-ratio test to examine the number of signals by testing the rank of the signal matrix in a signal-plus-noise model. However, in these models, tthe number of factors and signals are usually assumed to be fixed, and both the loading matrix and signal matrix are often presumed to be dense. In contrast, our test allows for a sparse and permits to diverge with .
Remark 2.
The complexity of finding all eigenvalues of is roughly , and this can be computationally costly. In contrast, our test statistic , for any given , only requires calculating the first eigenvalues of . Hence, the complexity can be reduced to or even to if the network is sparse.
2.3 Asymptotic Null Distribution
The Tracy-Widom law, introduced by Tracy & Widom (1994, 1996) to characterize the distributions of the eigenvalues of Wigner matrices, has been extensively studied in a series of papers (Lee & Yin, 2014; Lee & Schnelli, 2015; Schnelli & Xu, 2022). Specifically, let be a symmetric Wigner matrix wtih iid Gaussian entires of mean zero and variance . Then, converges to the Tracy-Widom distribution. Under the null hypothesis , we employ the Tracy-Widom law to derive the asymptotic null distribution of . In addition, we reparameterize as , where and is fixed. This allows the connecting probability matrix to scale with , a commonly approach in the literature (Bickel & Chen, 2009; Zhao et al., 2012). We next present some regularity conditions.
-
(C1)
Assume the connecting probability matrix is symmetric with for any .
-
(C2)
Assume , and , where is a constant satisfying .
-
(C3)
Assume that , where is the -th largest eigenvalue of in absolute value and is a constant.
Condition (C1) requires the entries of the edge probability matrix to be in , a condition similarly considered in Lei (2016), Wang & Bickel (2017) and Hu et al. (2021). Condition (C2) allows the edge probabilities s to decay to zero slower than , accommodating sparse networks. This condition is less restrictive than those of Lei (2016) and Hu et al. (2021), which assume dense networks where . However, it is stronger than the conditions in Han et al. (2023) and Jin et al. (2023) to ensure that the distribution of converges to the type-I Tracy-Widom distribution. It also allows the number of communities to diverge at a rate no faster than , which is less stringent than the fixed considered in Han et al. (2023). By setting , it holds that . Condition (C3) imposes a lower bound on , so it is bounded away from zero as . A similar condition was also considered in Theorem 3.1 of Lei & Rinaldo (2015).
Based on the above conditions, we show the asymptotic null distribution of the test statistic in the following theorem.
Theorem 1.
Assume that Conditions (C1)-(C3) hold. Under the null hypothesis , we have for any , as , where is the cumulative distribution function for the test statistic , and is the cumulative distribution function of , defined as
Here, is a Wigner matrix with i.i.d. entries of mean zero and variance , and denotes asymptotic equivalence.
The above theorem indicates that the distribution of the test statistic is asymptotically equivalent to that of . This allows us to employ a function of the Tracy-Widom distributions to test the null hypothesis . Specifically, for a given nominal level , let be the upper -quantile of . We then tabulate the critical value by approximating via a numerical method, such as Monte Carlo simulation, and reject if .
2.4 Asymptotic Power
Under the alternative hypothesis , the eigenvalue is expected to be large, and the limiting distribution of no longer converges to that of . Indeed, we demonstrate that diverges with under , with strong discriminatory power against the alternative hypothesis.
Theorem 2.
Under Conditions (C1)-(C3) and the alternative hypothesis , it holds that as for any positive constant .
The above theorem implies that for a given nominal level , under the alternative hypothesis . In particular, Theorem 2 indicates that diverges at a rate no slower than under the alternative, making our test more powerful than those of Han et al. (2023) and Hu et al. (2021). Specifically, the test statistic in Han et al. (2023) is greater than a constant under the alternative, while the test statistic of Hu et al. (2021) diverges at an order of under the alternative. Additionally, the test statistic of Hu et al. (2021) lacks power under the planted partition stochastic block model, that is, a stochastic block model with equal community sizes, equal s, and equal s for . We next demonstrate through simulation studies that performs well in both size and power.
3 SIMULATION STUDIES
To evaluate the performance of the proposed test in finite samples, we conduct simulation studies based on three models SBM, DCSBM and DCMM, to test the following hypotheses:
The MM model is not included as it is a special case of DCMM.
To simulate from the DCMM in (1.5) with , we follow the approach in Han et al. (2023) to generate and the approach in Zhao et al. (2012) to generate . Specifically, let , where is the true number of communities and is an -dimensional unit vector with the th element equal to 1 and others equal to 0, and . We generate independently from a distribution with unit expectation, that is,
where is uniformly distributed on the interval . We further set and to represent the two types of degree parameters, where is an -dimensional identity matrix.
The network is set as with communities, the size of each community is , and . For the SBM setting, we set and . For the DCSBM setting, we set and . Under the DCMM setting, we set and . This model assigns pure nodes to each community and then equally allocates nodes across the mixed membership with the three probability mass vectors in .
We consider both dense and sparse networks to assess the performance of the proposed test. We also compare the proposed test with five competing tests: two in Lei (2016), one in Han et al. (2023) and two in Hu et al. (2021), denoted as , , , , and , respectively. Here, and are tests that apply a bootstrap correction and an augmented-bootstrap procedure to and , respectively. See more details in Lei (2016) and Hu et al. (2021). Since the methods of Lei (2016) and Hu et al. (2021) are not applicable for DCMM, we omit them in simulation studies for DCMM. To obtain the limiting distribution in our test, we generate 1000 Wigner matrices , where the s are independently generated from for . This allows us to approximate the distribution of and determine the critical value at a given nominal level.
3.1 Simulations Under Dense Networks
We consider two dense edge probability matrices, modified from Hu et al. (2021) and Han et al. (2023), denoted as and , respectively. In , the edge probability between any two communities and is set to . In , we define if , and otherwise. The edge probabilities in and do not decrease with , and represent dense network scenarios. We implement for SBM and DCSBM, incorporating the methods of Lei (2016), Hu et al. (2021), and Han et al. (2023) for comparison. Additionally, we implement for DCMM, and compare with Han et al. (2023).
Tables 2-3 report the empirical sizes and powers for SBM and DCSBM, respectively, under the edge probability matrix . The simulation results reveal two important findings. First, the empirical sizes of , and are close to the 5% nominal level, although performs slightly worse than and . Under SBM, when , the sizes of , and deviate notably from the nominal level. Under DCSBM, when , exhibits size distortions, which can be much larger than 5%. Additionally, the empirical sizes of and deviate significantly from the 5% nominal level, except for under in Table 3. This is because and are not designed for DCSBM. Second, the empirical powers of and are generally equal to 1, with performs more powerfully than , while is less powerful. Notably, shows almost no power. This could be due to the fact that the assumptions in Theorem 3.2 of Han et al. (2023) for ensuring the asymptotic power of are not satisfied under the edge probability matrix . In addition, under SBM, the powers. of , and are generally equal to 1. However, under DCSBM, due to the size distortions of and , we do not further analyze their empirical powers.
Table 4 reports the empirical sizes and powers of DCMM under the edge probability matrix . The simulation results show two findings. First, the empirical sizes of are close to the nominal level 5%. However, exhibits size distortion when . This finding indicates that the test may not be suitable for large . Second, the empirical powers of get larger as increases. Since the sizes of are distorted for and the powers of are not steady for , we do not further analyze their empirical powers. In sum, our proposed test generally outperforms other competing methods under SBM, DCSBM and DCMM with dense networks.
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
0.053 | 1 | 1 | 1 | 1 | 0.052 | 0.068 | 0.048 | 0.035 | 0.025 | |
* | 0.060 | 1 | 1 | 1 | * | 0.045 | 0.030 | 0.047 | 0.045 | |
* | * | 0.042 | 1 | 1 | * | * | 0.047 | 0.030 | 0.043 | |
* | * | * | 0.058 | 1 | * | * | * | 0.038 | 0.045 | |
* | * | * | * | 0.048 | * | * | * | * | 0.045 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.068 | 0.115 | 0.828 | 0.993 | 1 | 0.067 | 0.993 | 1 | 1 | 1 | |
* | 0.080 | 0.580 | 0.976 | 1 | * | 0.042 | 1 | 1 | 1 | |
* | * | 0.130 | 0.500 | 0.985 | * | * | 0.067 | 1 | 1 | |
* | * | * | 0.240 | 0.565 | * | * | * | 0.035 | 0.915 | |
* | * | * | * | 0.398 | * | * | * | * | 0.045 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.065 | 1 | 1 | 1 | 1 | 0.052 | 1 | 1 | 1 | 1 | |
* | 0.072 | 1 | 1 | 1 | * | 0.097 | 1 | 1 | 1 | |
* | * | 0.182 | 1 | 1 | * | * | 0.137 | 1 | 1 | |
* | * | * | 0.614 | 1 | * | * | * | 0.731 | 1 | |
* | * | * | * | 0.846 | * | * | * | * | 0.930 |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
0.035 | 1 | 1 | 1 | 1 | 0.047 | 0.055 | 0.044 | 0.035 | 0.050 | |
* | 0.053 | 1 | 1 | 1 | * | 0.065 | 0.050 | 0.055 | 0.044 | |
* | * | 0.048 | 1 | 1 | * | * | 0.053 | 0.050 | 0.050 | |
* | * | * | 0.033 | 1 | * | * | * | 0.050 | 0.050 | |
* | * | * | * | 0.043 | * | * | * | * | 0.052 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.035 | 0.275 | 0.840 | 0.987 | 1 | 0.028 | 0.855 | 1 | 1 | 1 | |
* | 0.042 | 0.465 | 0.845 | 1 | * | 0.053 | 0.892 | 1 | 1 | |
* | * | 0.183 | 0.766 | 1 | * | * | 0.071 | 0.905 | 1 | |
* | * | * | 0.302 | 1 | * | * | * | 0.062 | 0.995 | |
* | * | * | * | 0.711 | * | * | * | * | 0.041 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.110 | 1 | 1 | 0.995 | 0.990 | 0.065 | 1 | 0.588 | 0.573 | 0.645 | |
* | 1 | 0.936 | 0.978 | 0.984 | * | 0.007 | 0 | 0.008 | 0.688 | |
* | * | 0.267 | 0.820 | 0.810 | * | * | 0.084 | 0.012 | 0.006 | |
* | * | * | 0.115 | 0.612 | * | * | * | 0.015 | 0 | |
* | * | * | * | 0.162 | * | * | * | * | 0.026 |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
0.057 | 0.345 | 0.610 | 0.865 | 0.965 | 0.033 | 1 | 1 | 0.405 | 0.785 | |
* | 0.057 | 0.330 | 0.440 | 0.645 | * | 0.040 | 1 | 0.405 | 0.785 | |
* | * | 0.037 | 0.185 | 0.395 | * | * | 0.113 | 0.415 | 0.785 | |
* | * | * | 0.040 | 0.185 | * | * | * | 0.343 | 0.780 | |
* | * | * | * | 0.036 | * | * | * | * | 0.680 |
3.2 Simulations Under Sparse Networks
In this section, we consider two types of sparse edge probability matrix and . In , the edge probability between any two communities and is . In , we set if , and otherwise. Analogous to Section 3.1, we implement for SBM and DCSBM and for DCMM.
Tables 5-6 report the empirical sizes and powers for SBM and DCSBM, respectively, under the edge probability matrix . The results reveal the following two important findings. First, both and control sizes well. In addition, the sizes of and deviate notably from the nominal level. This is likely due to the fact that the tests in Lei (2016) and Hu et al. (2021) are not designed for sparse networks. After applying a bootstrap correction and an augmented-bootstrap procedure, although and give several reasonable empirical sizes, though both tests lack theoretical justification in sparse networks. Second, our proposed test is powerful against alternatives, and its empirical power steadily increases as gets large. This is consistent with our theoretical findings. In contrast, the empirical powers of are low since the two assumptions in Theorem 3.2 of Han et al. (2023) for ensuring the asymptotic power of are not satisfied under . Since most of the empirical sizes of the tests proposed by Lei (2016) and Hu et al. (2021) are distorted, we do not further analyze their empirical powers.
Under the sparse DCMM, the results in Table 7 show similar qualitative findings as in Table 4. For example, the empirical sizes of are close to the nominal level 5% and the empirical sizes of are distorted when . In addition, the empirical powers of get larger as increases. Since the sizes of are distorted for and the powers of are not steady for , we do not further analyze their empirical powers.
To make a more comprehensive comparison, we also carry out simulation studies with different network settings in the supplementary material. Specifically, we consider equal-sized communities with and the total number of nodes in the network is , with . The simulation results of SBM and DCSBM under this network setting and the simulation results of DCMM under edge probabilities and are given in Tables S.1-S.6 of the supplementary material, which show similar findings as in Tables 2-7. In summary, simulation studies demonstrate that our proposed test performs well across three block models under both dense and sparse networks. These findings are consistent with our theoretical results.
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
0.057 | 0.675 | 0.802 | 0.905 | 0.971 | 0.052 | 0.058 | 0.034 | 0.040 | 0.050 | |
* | 0.047 | 0.772 | 0.820 | 0.956 | * | 0.057 | 0.060 | 0.058 | 0.043 | |
* | * | 0.055 | 0.540 | 0.832 | * | * | 0.038 | 0.035 | 0.038 | |
* | * | * | 0.053 | 0.280 | * | * | * | 0.047 | 0.056 | |
* | * | * | * | 0.033 | * | * | * | * | 0.040 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.247 | 0.858 | 0.992 | 0.990 | 0.995 | 0.062 | 0.697 | 0.914 | 1 | 1 | |
* | 0.393 | 1 | 1 | 1 | * | 0.042 | 0.832 | 0.986 | 1 | |
* | * | 0.990 | 1 | 1 | * | * | 0.018 | 0.878 | 1 | |
* | * | * | 1 | 1 | * | * | * | 0.025 | 0.996 | |
* | * | * | * | 1 | * | * | * | * | 0.085 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.993 | 1 | 1 | 1 | 1 | 0.072 | 0.980 | 0.700 | 0.978 | 1 | |
* | 1 | 1 | 1 | 1 | * | 0.155 | 0.597 | 0.830 | 1 | |
* | * | 1 | 1 | 1 | * | * | 0.010 | 0.924 | 1 | |
* | * | * | 1 | 1 | * | * | * | 0.072 | 1 | |
* | * | * | * | 1 | * | * | * | * | 0 |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
0.038 | 0.398 | 0.535 | 0.888 | 1 | 0.042 | 0.050 | 0.048 | 0.033 | 0.043 | |
* | 0.057 | 0.243 | 0.492 | 0.898 | * | 0.067 | 0.056 | 0.046 | 0.038 | |
* | * | 0.046 | 0.482 | 0.610 | * | * | 0.047 | 0.036 | 0.040 | |
* | * | * | 0.040 | 0.530 | * | * | * | 0.062 | 0.053 | |
* | * | * | * | 0.032 | * | * | * | * | 0.043 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.018 | 0.843 | 0.940 | 0.673 | 0.568 | 0.060 | 0.475 | 0.786 | 0.873 | 1 | |
* | 0.138 | 1 | 0.932 | 1 | * | 0.082 | 0.708 | 0.612 | 0.965 | |
* | * | 0.850 | 1 | 1 | * | * | 0.053 | 0.536 | 1 | |
* | * | * | 1 | 1 | * | * | * | 0.151 | 0.910 | |
* | * | * | * | 0.849 | * | * | * | * | 0.016 | |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
0.173 | 1 | 1 | 1 | 1 | 0.040 | 0.963 | 0.560 | 0.745 | 0.350 | |
* | 0.142 | 1 | 1 | 1 | * | 0.157 | 0.491 | 0.161 | 0.975 | |
* | * | 0.826 | 1 | 1 | * | * | 0.077 | 0.242 | 0.117 | |
* | * | * | 0.036 | 1 | * | * | * | 0.036 | 1 | |
* | * | * | * | 0 | * | * | * | * | 0 |
K | 3 | 5 | 10 | 15 | 20 | 3 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|---|---|---|---|---|
0.043 | 0.280 | 0.465 | 0.655 | 0.895 | 0.047 | 1 | 1 | 0.900 | 0.960 | |
* | 0.037 | 0.250 | 0.455 | 0.595 | * | 0.040 | 1 | 0.900 | 0.960 | |
* | * | 0.050 | 0.190 | 0.395 | * | * | 0.420 | 0.905 | 0.960 | |
* | * | * | 0.037 | 0.130 | * | * | * | 0.793 | 0.955 | |
* | * | * | * | 0.041 | * | * | * | * | 0.933 |
4 REAL DATA ANALYSES
To illustrate the utility of our proposed method, we analyze three real-world networks. The first dataset is the facebook network in Simmons College that was first studied in Traud et al. (2012). The Simmons College network contains 1,137 nodes, and its edge density is approximately 3.76%. The second dataset is a political blog network that was first studied by Adamic & Glance (2005) and later analyzed by Lei (2016), Han et al. (2023) and Hu et al. (2021). This network contains 1,222 nodes, and its edge density is approximately 2.24%. The third dataset is a Sina Weibo user network that was analyzed by Wu et al. (2022). This network has 2,580 nodes, and the edge density of this network is approximately 0.41%.
4.1 Simmons College network
The Simmon college network contains all the friendship links between Facebook users within the Simmons College, recorded on a specific day in September 2005 (Traud et al., 2012). Traud et al. (2012) observed that the community structure of this network is strongly correlated with graduation year, meaning that students in the same year are more likely to be friends. Hence, the true number of clusters in this network can be considered as . It was pointed out in Jin et al. (2021) that the community structure in the Simmons College network is very weak and community detection algorithms usally yield high misclassification errors.
We test using our proposed method. The resulting test statistics are and 14.25, with corresponding 5% critical values and 63.99. Accordingly, rejects and does not reject , , , , and . Hence, identifies at least two communities. We also employ , and to conduct the same hypothesis testing. The test results are and , with the critical value . Therefore, yields the same results as , rejecting but not , , , , and . Additionally, we find , and for all , with correspondingly critical values and , respectively. Therefore, and reject all six null hypotheses.
4.2 Political Blog network
This dataset records links between internet blogs over the period of two months before the 2004 U.S. presidential election. The nodes are blogs, and the edges represent web links between the blogs. We consider the largest connected component of this network, consisting of 1,222 nodes, as commonly used in existing studies (Lei, 2016; Hu et al., 2021; Han et al., 2023). Note that all the blogs fall into two communities based on political stance, namely “conservative” and “liberal”.
Using this network dataset, Lei (2016) tested for under the SBM. In particular, he obtained and . Given the 5% nominal level, both are much larger than the critical value , and thus strongly reject the null hypothesis of under the SBM. In Hu et al. (2021), the test of was not rejected under the DCSBM. In addition, Han et al. (2023) tested and separately. They found that under and under , where the critical value is . Accordingly, rejected and did not reject . Their result supports the finding in Hu et al. (2021).
We also test and using our proposed test. Under , we have , which is larger than the critical value . Under , we obtain that , which is smaller than the critical value . Consequently, we reject and do not reject . This finding agrees with that of Han et al. (2023), which corroborates the prior knowledge that all blogs can be divided into politically “conservative” and “liberal” communities.
4.3 Sina Weibo Data
This dataset includes friendship links between 2,580 Sina Weibo users, and if user and user follow each other (Wu et al., 2022). According to the analysis of Wu et al. (2022), users can be partitioned into four quadrants based on two nodal influence indices. For each node , the two indices are ’inward influence,’ which measures node ’s receptivity to being influenced by others, and ’outward influence,’ which gauges the influence node exerts on others. We report the results of testing , as all tests with yield the same conclusion. The resulting test statistics are and 22.67, with corresponding 5% critical values and 32.04. Thus, rejects and does not reject , , and . Hence, identifies at least two communities. We also employ , and to conduct hypothesis testing. The test results show and for all with their correspondingly critical values and , respectively. For , the test statistics are 19.26, 7.85, 8.66, 8.92 and 3.65, with critical values . Therefore, , and reject all five null hypotheses.
5 CONCLUDING REMARKS
Two possible avenues for future research are identified. First, generalize the proposed method to accommodate the nonparametric graphical models (Wolfe & Olhede, 2013). Second, study the validity of the eigengap-ratio test for correlated binary data such as the network generated by a notable latent space model (Hoff et al., 2002). We believe these efforts will further increase the applicability of the proposed method.
SUPPLEMENTARY MATERIAL
The supplementary material includes five sections. Section S.1 provides technical notation. Section S.2 introduces a lemma used to prove Theorem 1. Sections S.3–S.4 present the proofs of Theorem 1 and Theorem 2, respectively. Section S.5 reports the additional simulation results.
References
- Adamic & Glance (2005) Adamic, L. A. & Glance, N. (2005). The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery. ACM, New York.
- Airoldi et al. (2008) Airoldi, E. M., Blei, D., Fienberg, S. & Xing, E. (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9.
- Amini et al. (2013) Amini, A. A., Chen, A., Bickel, P. J. & Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics 41.
- Bickel & Chen (2009) Bickel, P. J. & Chen, A. (2009). A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences 106, 21068–21073.
- Chen & Lei (2018) Chen, K. & Lei, J. (2018). Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association 113, 241–251.
- Choi et al. (2012) Choi, D. S., Wolfe, P. J. & Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika 99, 273–284.
- Daudin et al. (2008) Daudin, J.-J., Picard, F. & Robin, S. (2008). A mixture model for random graphs. Statistics and Computing 18, 173–183.
- Ding & Yang (2022) Ding, X. & Yang, F. (2022). Tracy-widom distribution for heterogeneous gram matrices with applications in signal detection. IEEE Transactions on Information Theory 68, 6682–6715.
- Goldenberg et al. (2010) Goldenberg, A., Zheng, A. X., Fienberg, S. E., Airoldi, E. M. et al. (2010). A survey of statistical network models. Foundations and Trends® in Machine Learning 2, 129–233.
- Granovetter (1973) Granovetter, M. S. (1973). The strength of weak ties. American Journal of Sociology 78, 1360–1380.
- Han et al. (2023) Han, X., Yang, Q. & Fan, Y. (2023). Universal rank inference via residual subsampling with application to large networks. The Annals of Statistics 51, 1109–1133.
- Hoff et al. (2002) Hoff, P. D., Raftery, A. E. & Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association 97, 1090–1098.
- Holland et al. (1983) Holland, P. W., Laskey, K. B. & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5, 109–137.
- Hu et al. (2020) Hu, J., Qin, H., Yan, T. & Zhao, Y. (2020). Corrected bayesian information criterion for stochastic block models. Journal of the American Statistical Association 115, 1771–1783.
- Hu et al. (2021) Hu, J., Zhang, J., Qin, H., Yan, T. & Zhu, J. (2021). Using maximum entry-wise deviation to test the goodness of fit for stochastic block models. Journal of the American Statistical Association 116, 1373–1382.
- Jin (2015) Jin, J. (2015). Fast community detection by score. The Annals of Statistics 43, 57–89.
- Jin et al. (2017) Jin, J., Ke, Z. T. & Luo, S. (2017). Estimating network memberships by simplex vertex hunting. arXiv preprint arXiv:1708.07852 12.
- Jin et al. (2021) Jin, J., Ke, Z. T. & Luo, S. (2021). Improvements on score, especially for weak signals. Sankhya A , 1–36.
- Jin et al. (2024) Jin, J., Ke, Z. T. & Luo, S. (2024). Mixed membership estimation for social networks. Journal of Econometrics 239, 105369.
- Jin et al. (2023) Jin, J., Ke, Z. T., Luo, S. & Wang, M. (2023). Optimal estimation of the number of network communities. Journal of the American Statistical Association 118, 2101–2116.
- Joseph & Yu (2016) Joseph, A. & Yu, B. (2016). Impact of regularization on spectral clustering. The Annals of Statistics 44, 1765–1791.
- Karrer & Newman (2011) Karrer, B. & Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical Review E-Statistical, Nonlinear, and Soft Matter Physics 83, 016107.
- Katona et al. (2011) Katona, Z., Zubcsek, P. P. & Sarvary, M. (2011). Network effects and personal influences: The diffusion of an online social network. Journal of Marketing Research 48, 425–443.
- Ke & Wang (2022) Ke, Z. T. & Wang, J. (2022). Optimal network membership estimation under severe degree heterogeneity. arXiv preprint arXiv:2204.12087 .
- Kolaczyk (2009) Kolaczyk, E. D. . (2009). Statistical analysis of network data: Methods and models. New York: Springer.
- Le & Levina (2015) Le, C. M. & Levina, E. (2015). Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827 .
- Lee & Schnelli (2015) Lee, J. O. & Schnelli, K. (2015). Edge universality for deformed wigner matrices. Reviews in Mathematical Physics 27, 1550018.
- Lee & Yin (2014) Lee, J. O. & Yin, J. (2014). A necessary and sufficient condition for edge universality of wigner matrices. Duke Mathematical Journal 163, 117–173.
- Lei (2016) Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics 44, 401–424.
- Lei & Rinaldo (2015) Lei, J. & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics , 215–237.
- Lorrain & White (1971) Lorrain, F. & White, H. C. (1971). Structural equivalence of individuals in social networks. The Journal of Mathematical Sociology 1, 49–80.
- Ma et al. (2021) Ma, S., Su, L. & Zhang, Y. (2021). Determining the number of communities in degree-corrected stochastic block models. Journal of Machine Learning Research 22, 3217–3279.
- Newman (2006) Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 8577–8582.
- Newman & Girvan (2004) Newman, M. E. & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E 69, 026113.
- Noroozi et al. (2019) Noroozi, M., Rimal, R. & Pensky, M. (2019). Estimation and clustering in popularity adjusted stochastic block model. arXiv preprint arXiv:1902.00431 .
- Nowicki & Snijders (2001) Nowicki, K. & Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association 96, 1077–1087.
- Onatski (2009) Onatski, A. (2009). A formal statistical test for the number of factors in the approximate factor models. Econometrica 77, 1447–1480.
- Qin & Rohe (2013) Qin, T. & Rohe, K. (2013). Regularized spectral clustering under the degree-corrected stochastic blockmodel. Advances in Neural Information Processing Systems 26.
- Rohe et al. (2011) Rohe, K., Chatterjee, S. & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39, 1878–1915.
- Saldana et al. (2017) Saldana, D. F., Yu, Y. & Feng, Y. (2017). How many communities are there? Journal of Computational and Graphical Statistics 26, 171–181.
- Sarkar & Bickel (2015) Sarkar, P. & Bickel, P. J. (2015). Role of normalization in spectral clustering for stochastic blockmodels. The Annals of Statistics 43, 962–990.
- Schnelli & Xu (2022) Schnelli, K. & Xu, Y. (2022). Convergence rate to the tracy–widom laws for the largest eigenvalue of wigner matrices. Communications in Mathematical Physics 393, 839–907.
- Sengupta & Chen (2015) Sengupta, S. & Chen, Y. (2015). Spectral clustering in heterogeneous networks. Statistica Sinica , 1081–1106.
- Sengupta & Chen (2018) Sengupta, S. & Chen, Y. (2018). A block model for node popularity in networks with community structure. Journal of the Royal Statistical Society Series B: Statistical Methodology 80, 365–386.
- Snijders & Nowicki (1997) Snijders, T. A. & Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification 14, 75–100.
- Tracy & Widom (1994) Tracy, C. A. & Widom, H. (1994). Level-spacing distributions and the airy kernel. Communications in Mathematical Physics 159, 151–174.
- Tracy & Widom (1996) Tracy, C. A. & Widom, H. (1996). On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics 177, 727–754.
- Traud et al. (2012) Traud, A. L., Mucha, P. J. & Porter, M. A. (2012). Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 4165–4180.
- Valente (2010) Valente, T. W. (2010). Social Networks and Health: Models, Methods, and Applications 1st Edition. Oxford University Press.
- Wang et al. (2023) Wang, J., Zhang, J., Liu, B., Zhu, J. & Guo, J. (2023). Fast network community detection with profile-pseudo likelihood methods. Journal of the American Statistical Association 118, 1359–1372.
- Wang & Bickel (2017) Wang, Y. R. & Bickel, P. J. (2017). Likelihood-based model selection for stochastic block models. The Annals of Statistics 45, 500–528.
- Wasserman & Faust (1994) Wasserman, S. & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge University Press.
- White & Murphy (2016) White, A. & Murphy, T. B. (2016). Mixed-membership of experts stochastic blockmodel. Network Science 4, 48–80.
- Wolfe & Olhede (2013) Wolfe, P. J. & Olhede, S. C. (2013). Nonparametric graphon estimation. arXiv preprint arXiv:1309.5936 .
- Wu et al. (2022) Wu, Y., Lan, W., Zou, T. & Tsai, C.-L. (2022). Inward and outward network influence analysis. Journal of Business & Economic Statistics 40, 1617–1628.
- Zhang & Zhou (2016) Zhang, A. Y. & Zhou, H. H. (2016). Minimax rates of community detection in stochastic block models. The Annals of Statistics 44, 2252–2280.
- Zhang et al. (2020a) Zhang, J., Sun, W. W. & Li, L. (2020a). Mixed-effect time-varying network model and application in brain connectivity analysis. Journal of the American Statistical Association 115, 2022–2036.
- Zhang & Amini (2023) Zhang, L. & Amini, A. (2023). Adjusted chi-square test for degree-corrected block models: Experiments in r. The Annals of Statistics 51, 2366–2385.
- Zhang et al. (2020b) Zhang, Y., Levina, E. & Zhu, J. (2020b). Detecting overlapping communities in networks using spectral methods. SIAM Journal on Mathematics of Data Science 2, 265–283.
- Zhao et al. (2012) Zhao, Y., Levina, E. & Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics 40, 2266–2292.