Two–Sample Test for Stochastic Block Models via Maximum Entry–Wise Deviation
Abstract
The stochastic block model is a popular tool for detecting community structures in network data. Detecting the difference between two community structures is an important issue for stochastic block models. However, the two-sample test has been a largely under-explored domain, and too little work has been devoted to it. In this article, based on the maximum entry–wise deviation of the two centered and rescaled adjacency matrices, we propose a novel test statistic to test two samples of stochastic block models. We prove that the null distribution of the proposed test statistic converges in distribution to a Gumbel distribution, and we show the change of the two samples from stochastic block models can be tested via the proposed method. Then, we show that the proposed test has an asymptotic power guarantee against alternative models. One noticeable advantage of the proposed test statistic is that the number of communities can be allowed to grow linearly up to a logarithmic factor. Further, we extend the proposed method to the degree-corrected stochastic block model. Both simulation studies and real-world data examples indicate that the proposed method works well.
Keywords: Gumbel distribution; Network data; Stochastic block model; Two-sample test
1 Introduction
Network data analysis has become a popular research topic in many fields, including gene classification, social relationship investigation, and financial risk management. In the past decades, the majority of works mainly focused on the large-scale network data with community structure, see, e.g., Newman & Girvan (2004); Newman (2006); Steinhaeuser & Chawla (2010). In the network data analysis, the stochastic block model proposed by Holland et al. (1983) is a popular tool to fit the network data with community structure, see, e.g., Snijders & Nowicki (1997); Nowicki & Snijders (2001); Bickel & Chen (2009); Rohe et al. (2011); Choi et al. (2012); Jin (2015); Zhang & Zhou (2016). The stochastic block model with communities assumes that nodes of the network are clustered into communities, that is, there exists a mapping of community membership (also known as community membership label) , where . Formally, means that the node belongs to the community . For an unweighted and undirected graph , it can be represented by a binary symmetric adjacency matrix , that is, if there is a connection (or an edge) between node and node and otherwise. Given the community membership label , the stochastic block model assumes that the entries of the adjacency matrix are mutually independent Bernoulli random variables with probabilities for a certain symmetric probability matrix , where matrix is called as edge-probability matrix. Then the stochastic block model is completely and uniquely determined by the pair up to label permutations of nodes.
The fundamental issues in the stochastic block model are model selection and community detection. Given an adjacency matrix , the goal of model selection is to estimate the number of clusters or communities, and the goal of community detection is to cluster all nodes into different communities such that the connections between the nodes in the same community are dense and the connections between the nodes in the different communities are sparse. To enhance the flexibility of the model, variations of the stochastic block model were also proposed, such as the degree-corrected stochastic block model proposed by Karrer & Newman (2011) addressed the degree heterogeneity by introducing the additional node activeness parameters, and Airoldi et al. (2008) proposed the mixed membership stochastic block model, where a single node may belong to multiple communities. For the model selection, many methods are used to estimate the number of communities, including the sequential testing methods (Lei, 2016; Hu et al., 2021), and the likelihood-based methods (Saldña et al., 2017; Wang & Bickel, 2017; Hu et al., 2020). Meanwhile, the majority of efficient methods also have been proposed to recover the community structure, such as modularity (Newman, 2006), variational methods (Daudin et al., 2008), profile-likelihood maximization (Bickel & Chen, 2009), spectral clustering (Rohe et al., 2011; Jin, 2015), pseudo-likelihood maximization (Amini et al., 2013), and profile-pseudo likelihood methods (Wang et al., 2021). The corresponding asymptotic properties of estimation of community label have been obtained, see, e.g., Rohe et al. (2011); Zhao et al. (2012); Choi et al. (2012); Lei & Rinaldo (2015); Sarkar & Bickel (2015); Zhang & Zhou (2016); Wang et al. (2021).
In the past decades, the research interest of much literature focuses on the study of one sample of stochastic block models. Usually, in the social network, people surrounding a leader may tend to develop a closer relationship with another leader, see, e.g., Barnett & Onnela (2016). This phenomenon may lead to an issue, that is, whether the community structure (the number of communities , the community probability matrix , and the community membership label ) of the network will change over time or the environment.
As the simplest case of the multi-layer stochastic block model, two sample networks may also appear. For a multi-layer stochastic block model, it can be classified into a variety of more specific situations: First, all layer networks come from the identical stochastic block model ; second, each layer of the network comes from a different model but they have the same community structure, that is, is identical across all layers; third, each layer of the network comes from different models, that is, and are different in each layer. In fact, for the third case, it can be considered as a mixed model of multiple stochastic block models. However, how could we distinguish the three situations? Intuitively, we should judge whether two models are the same when we have two samples from the corresponding models. A common inference method to judge whether two models are the same is the hypothesis testing method. Tang et al. (2017) considered whether two independent finite-dimensional random dot product graphs are generated by the identical model. Using the adjacency spectral embedding for each sample, they constructed a statistic based on the kernel function. Ghoshdastidar & von Luxburg (2018) used the largest singular value of a scaled and centralized matrix to construct the statistic and proved that the null distribution convergences in distribution to a Tracy-Widom distribution. Ghoshdastidar et al. (2020) proposed two test statistics using the Frobenius norm and spectral norm to test whether two samples of networks are generated from the identical edge-probability matrix. However, this testing procedure requires choosing an appropriate threshold. Recently, Chen et al. (2021a) simplified the statistics in Ghoshdastidar & von Luxburg (2018) and proposed a test procedure for a two-sample network test. Based on the random matrix theory, Chen et al. (2021b) used the trace of a constructed matrix to obtain the statistic and proved that the asymptotic null distribution is the standard normal distribution.
For the methods mentioned above, one defect is that either the edge-probability matrices of the two populations differ greatly or multiple samples are required. Hence, when we only have two samples, the methods mentioned above may not work well as there is less information. In addition, it is worth noting that when the community structure changes slightly, the edge-probability matrix may change less, especially when the network size is relatively large. The above methods of testing may not work well, where and are the edge-probability matrices of two network models. It implies that we cannot test the difference between the two models when the difference is tiny but the network size is very large. Thus, detecting the difference between two samples of stochastic block models is an interesting research problem. Under the two-sample test of the stochastic block model, Wu et al. (2022) proposed a test method based on the locally smoothed adjacency matrix. To construct the smoothed adjacency matrix, their method separates a community into serval non-overlapping neighboring sub-communities and averages the entries of the adjacency matrix in non-overlapping local neighborhoods within communities. However, the procedure is complex and only applicable to a small number of communities. In this article, we want to construct a statistic that allows to diverge with .
In this article, based on the maximum entry-wise deviation of the two centered and rescaled adjacency matrices, we propose a two-sample test statistic to detect the change of two stochastic block models under two observed adjacency matrices. We show that the asymptotic null distribution of the test statistic is a Gumbel distribution when . This testing method allows to grow linearly with up to a logarithmic factor. It is well known that the number of communities must be less than the number of nodes in the stochastic block model. Since cannot grow faster, Rohe et al. (2014) called this scenario () the highest dimensional stochastic block model. Compared with the method in Wu et al.(2022), we relax the condition that the number of communities is fixed. Moreover, we also show that the proposed test is asymptotically powerful against serval alternative models, and does not need additional methods to improve the power of the proposed test. Finally, to improve the flexibility of the proposed method, we extend the proposed method to the degree-corrected stochastic block model.
Next, we formally describe this issue. Let and be two binary symmetric adjacency matrices from two stochastic block models parametrized by and , respectively. For any , let , i.e., no loop edge exists. In this article, we assume that the node label of the two networks is identical, instead of the community label of the node. Hence, the two networks and can be viewed as repeated observations in the same individuals. Then, given two sample networks and , the testing problem can be formulated as
(1) |
Intuitively, there are four mutually exclusive scenarios for the alternative hypothesis: (i) , (ii) , but , (iii) , but , and (iv) .
Our testing procedure for (1) goes as follows. First, we get consistent estimators of and , denoted by and , respectively. If , it means that cannot be true, so we can directly reject the null hypothesis. If , based on , we obtain the strongly consistent estimators and for the community membership labels and . Second, we estimate the entries of and by the sample proportions of each community based on and . Third, we use () to center and rescale adjacency matrix (), and sum for new matrix under specific rules, and get a combined information matrix, see (7). Finally, we obtain the test statistic by the maximum of the elements of the combined information matrix. The basic principle of this method is that if the null hypothesis is true, the entries of the combined information matrix asymptotically follow the normal distribution. According to the results of Zhou (2007), the asymptotic distribution of the maximum of elements after normalization is a Gumbel distribution. On the other hand, under the alternative hypothesis, the adjacency is incorrectly centered and scaled, and this deviation is magnified to a very large value by the normalization term. This implies that the test statistic can successfully separate and in (1).
The remainder of the article is organized as follows. In Section 2, we introduce the new test statistic and state its asymptotic null distribution and asymptotic power. In Section 3, we extend the proposed method to the degree-corrected stochastic block model. Simulation studies and real-world data examples are given in Sections 4 and 5, respectively. All technical proofs are postponed to the Appendix.
2 A NEW TWO-SAMPLE TEST FOR THE STOCHASTIC BLOCK MODEL
Consider a stochastic block model on nodes with the community label and probability matrix . Given a number of communities and a community label , the maximum likelihood estimator of is given by
(2) |
where for , stands for the label of a node, and . After here, and , where can be replaced by .
For an adjacency matrix , Hu et al. (2021) used a test statistic, based on the maximum entry-wise deviation, to test the following two hypothesis tests:
(1) , and
(2) ,
where and denote the true number of communities and the true community label, respectively, and and denote a hypothetical number of communities and hypothetical community label, respectively. Let
where , and , and is the number of nodes in block , and denotes the set of nodes that belong to community in but excluding node and as defined in (2). Under the null hypothesis , if , Hu et al. (2021) showed that
and proposed the following test statistic:
Then the corresponding level- rejection rule is
where is the th quantile of the Gumbel distribution with and .
In this article, We aim to develop a new test statistic that allows it can be used to test two samples. For two samples and from stochastic block models and , respectively, let and be obtained by some estimation methods, such as the recursive approach in Zhao et al. (2011), the sequential testing method given in Lei (2016), the likelihood-based method in Saldña et al. (2017), the corrected Bayesian information criterion in Hu et al. (2020), the test based on maximum entry-wise deviation in Hu et al. (2021), and the spectral methods in Le & Levina (2022). In this article, we use the corrected Bayesian information criterion in Hu et al. (2020) to get consistent estimators and . Recall that given the number of communities and , the community labels and can be consistently estimated, denote by and , by some existing strongly consistent community detection procedures, e.g., the majority voting algorithm in Gao et al. (2017) and the profile-pseudo likelihood method in Wang et al. (2021). In view of this, throughout the paper, we formally assume that
(3) |
which implies that we require that all estimators are strongly consistent. Hence, we can get the and by equation (2). Note that the community membership label and the probability matrix depend on the number of communities . If is a strongly consistent estimator, must be the consistent estimator. Then, Condition (3) can be relaxed to . In fact, if , then it is natural that and . Since and are consistent estimators, we can reject the null hypothesis with probability tending to 1. Meanwhile, For case (iv), there is indeed a problem of identifiability. Because both and are allowed to vary, the existing methods cannot reasonably solve the problem of identifiability. Thus, in this article, we mainly focus on cases (ii) and (iii) for four alternative hypotheses. Then, we assume . Without loss of generality, we write when . Next, we formally state the test statistic. As mentioned in the introduction, our test statistic is motivated by the contrast of (or ) and (or ), i.e.:
(4) |
Moreover, based on the maximum entry-wise deviation, the proposed test statistic has the following form:
2.1 The Asymptotic Null Distribution
To obtain the asymptotic result for the statistic , we first make the following assumptions:
Assumption 1.
The entries of and are uniformly bounded away from 0 and 1, and both and have no identical rows.
Assumption 2.
There exist , , and such that
and,
for all .
Assumption 1 requires that the entries of the probability matrices and are uniformly bounded away from 0 and 1. This assumption is similar to the corresponding condition in Lei (2016) and Hu et al. (2021). Meantime, Assumption 1 also requires that and are identifiable, which is a basic condition (Wang & Bickel, 2017). Assumption 2 not only requires a lower bound for the number of nodes in the smallest community, but also gives an upper bound for the number of nodes in the largest community. The lower bound requires that the number of nodes in the smallest community for () is at least proportional to (). This is a mild assumption and easy to be achieved. For example, this lower bound can be achieved when the community label is generated from a multinomial distribution with trials and parameter such that . For the assumption about the upper bound, Zhang & Zhou (2016) and Gao et al. (2017) also considered a similar condition, which is used to control the maximum within-group deviation between the () and its estimator ().
We now give the asymptotic property of the test statistic and delay the proof to the Appendix.
Theorem 1.
Suppose that Assumptions (1) and (2) hold. Then under the null hypothesis , as , if , we have
(5) |
where the right hand-side of (5) is the cumulative distribution function of the Gumbel distribution with and .
Note that Theorem 1 demonstrates that converges in distribution to a Gumbel distribution under .
Using the above Theorem 1, we can implement hypothesis test (1) as follows. First, we estimate the number of communities and , and the community membership labels and , respectively. Then, we compute the statistic
Since and are strongly consistent estimators, we have that intuitively follows the Gumbel distribution with and . To carry out the hypothesis test, we have a rejection rule:
where is the th quantile of the Gumbel distribution with and . In Section 4, by simulation studies, we investigate the finite sample performance of the proposed test statistic.
In general, the null distribution converges to the Gumbel distribution slowly, which is also confirmed in simulation studies. Bickel & Sarkar (2015) considered a bootstrap method for the correction of the distribution of the finite sample, which was also used by Lei (2016) and Hu et al. (2021). For our statistic, the bootstrap corrected test statistic is calculated as follows:
1. Using the consistent estimation method to estimate , then get the estimation and by the strongly consistent clustering method;
2. For , generate and from the edge-probability matrix , and calculate based on , ;
3. Using to estimate the location and scale parameters and of the Gumbel distribution through maximum likelihood method.
4. The bootstrap corrected test statistic is calculated as
where and .
2.2 THE ASYMPTOTIC POWER
In this subsection, we investigate the asymptotic power of the proposed test procedure. To guarantee good testing power, we need the following theoretical assumption.
Assumption 3.
The maximum grouped difference between and satisfies:
and
Assumption 3 specifies that under the alternative , the maximum grouped difference between and diverges faster than . This assumption is analogous to (A2′) in Hu et al. (2021). Then, the lower bound of the growth rate of the test statistic under the alternative hypothesis is given in the following Theorem.
Theorem 2.
Suppose that Assumptions (1), (2), and (3) hold. Then under the alternative hypothesis , as , if , we have
for some some consistent .
The proof is collected in the Appendix. The above Theorem 2 shows that diverges faster than as under the alternative hypothesis . Then, we have the following corollary saying that the asymptotic size is close to the nominal level and the asymptotic power is almost equal to 1.
Corollary 1.
Suppose that Assumptions (1), and (2) hold, given a nominal level , we have
and further, suppose that Assumption (3) holds, we have
Therefore, the asymptotic null distribution in Theorem 1 and the growth rate in Theorem 2 suggest that the null hypothesis and the alternative hypothesis are well separated, and our proposed test is asymptotically powerful against alternative hypothesis .
3 EXTENSION
In this section, we extend the proposed method to the degree-corrected stochastic block model. In reality, a network may contain many high-degree nodes and lower-degree nodes, that is, the degree heterogeneity of nodes. To address this issue, Karrer & Newman (2011) proposed a degree-corrected stochastic block model. Specifically, for an undirected network , this model assume that the edge satisfies , where is the degree parameter for node . To ensure the identifiability of the model, we assume that . For two sample networks and , the testing problem has the following form:
(6) |
where and are the node degree parameters for networks and , respectively.
Similar to the general stochastic block model, for the two-sample test of the degree-corrected stochastic block model, we consider the following test statistic:
where
(7) |
where is the maximum likelihood estimator of and . is similar to that of . Note that it is difficult to obtain the asymptotic null distribution of as the complex dependency between the entries . By the simulation studies, we find that the empirical distribution of is also the Gumbel distribution with and . Then, we have a rejection rule:
where is the th quantile of the Gumbel distribution with and . Similar to the general stochastic block model, we can obtain the bootstrap corrected statistic through the identical method.
4 SIMULATION
In this section, we evaluate the performance of the proposed test statistics in various simulation studies. Firstly, the number of communities is estimated by the Bayesian information criterion proposed by Hu et al. (2020), and the community membership label is estimated by strongly consistent estimation methods mentioned above, i.e., the profile-pseudo likelihood method in Wang et al. (2022), initialized by spectral clustering with permutations, is used to obtain the community membership label. In the simulation, we consider the test statistic and the bootstrap corrected test statistic . In comparative experiments, the maximum deviation method (referred to as TST-MD) proposed by Chen et al. (2021a) and the spectral-based method (referred to as TST-S) proposed by Chen et al. (2021b) are used.
4.1 Simulation 1: The Null Distribution Under the Stochastic Block Model
In the first simulation, we consider the finite sample null distribution of test statistic and empirically verify Theorem 1 for two samples from a common stochastic block model. Since the limiting distribution of the test statistic is proven to be a Gumbel distribution, the location and scale parameters can be estimated by generating some bootstrap samples to correct the test statistic at a lower cost. In all of our simulations, we use .
We generate data from the stochastic block model with . In this setting, we set with . The sample size is either or . We use the strongly consistent method to get and , then get and , e.g., the profile-pseudo likelihood method in Wang et al. (2021). In Figure 1, based on 1000 data replications, we plot the sample distribution of the test statistic with and without bootstrap correction. Figure 1 demonstrates that the empirical probability density function of biases upward, but the bias decrease as the sample size increases. Compared with the distribution of , the distribution of fits the true distribution better.


4.2 Simulation 2: Empirical Size for Hypothesis (1)
In this subsection, we study the empirical size under varying , , and . We set the edge probability between communities and as , where measures the sparsity of the network. Let , and the size of each block be 200. Table 1 reports the result from 200 data replications. From Table 1, ’s Type I errors are close to the nominal level when is small, ’s Type I errors are close to the nominal level even when is large. Compared with TST-MD and TST-S, the performances of and TST-S are better than TST-MD. In some cases, TST-MS does not work well. Inversely, the empirical size of and TST-S are close to the nominal level. It is worth noting that as the sample size increases, the empirical size of the gradually increases, which seems to be contrary to fundamental theory. In fact, as the sample size increases, the number of communities also increases. As a result, the commonly used community label estimation algorithms cannot exactly recover the community label. But by the bootstrap correction, the bootstrap corrected statistic has a good empirical size, which implies that the statistic with the bootstrap correction works well. We also notice that when the network is sparse, the test statistic does not have good empirical size, and tends to be a little oversized.
0.04 | 0.09 | 0.03 | 0.04 (0.09,0.03) | 0.08 (0.1,0.02) | 0.05 (0.01,0.05) | ||
0.09 | 0.06 | 0.07 | 0.05 (0.085,0.04) | 0.05 (0.04,0.03) | 0.04 (0.01,0.03) | ||
0.09 | 0.08 | 0.06 | 0.04 (0.01,0.04) | 0.05 (0,0.03) | 0.05 (0,0.04) | ||
0.09 | 0.09 | 0.05 | 0.04 (0,0.04) | 0.06 (0,0.04) | 0.06 (0,0.03) | ||
0.08 | 0.09 | 0.04 | 0.04 (0.005,0.05) | 0.04 (0,0.05) | 0.04 (0,0.03) | ||
0.20 | 0.10 | 0.06 | 0.06 (0,0.04) | 0.05 (0,0.04) | 0.04 (0,0.06) | ||
0.46 | 0.16 | 0.09 | 0.08 (0.01,0.04) | 0.05 (0,0.05) | 0.05 (0,0.04) | ||
0.77 | 0.14 | 0.10 | 0.10 (0.03,0.04) | 0.02 (0,0.05) | 0.08 (0,0.05) | ||
0.99 | 0.23 | 0.13 | 0.05 (0.04,0.05) | 0.02 (0.005,0.06) | 0.04 (0,0.05) |
4.3 Simulation 3: Empirical Power for Hypothesis (1)
In this subsection, we investigate the empirical power under hypothesis test (1). We consider two kinds of alternatives: i) but and ii) but . Similar to simulation 2, let . The sample size is either or . For the first alternative, we generate with edge possibility within community and between communities, and with edge possibility within community and between communities for . For the second alternative, we generate and with edge probability within community and between communities for . We set and from the multinomial distribution with trials and probability . Similarly, all simulations are run 200 times. For both alternatives, Tables 2 - 3 report the empirical power for hypothesis test (1). We observe that the proposed test statistics and TST-MD can successfully detect all alternative hypotheses. In the second alternative hypothesis, however, the empirical power of TST-S is less than 1. As discussed in the introduction, if the change of community structure is small, there will be little difference between the two edge-probability matrices. As a result, the TST-S will not separate the null hypothesis and the alternative hypothesis well.
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | |
1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) | 1 (1,1,1) |
1 (1,1,0.04) | 1 (1,1,0.17) | 1 (1,1,0.54) | 1 (1,1,0.04) | 1 (1,1,0.27) | 1 (1,1,0.57) | |
1 (1,1,0.06) | 1 (1,1,0.21) | 1 (1,1,0.51) | 1 (1,1,0.02) | 1 (1,1,0.39) | 1 (1,1,0.55) | |
1 (1,1,0.045) | 1 (1,1,0.16) | 1 (1,1,0.45) | 1 (1,1,0.09) | 1 (1,1,0.35) | 1 (1,1,0.63) | |
1 (1,1,0.09) | 1 (1,1,0.13) | 1 (1,1,0.32) | 1 (1,1,0.07) | 1 (1,1,0.33) | 1 (1,1,0.58) | |
1 (1,1,0.06) | 1 (1,1,0.08) | 1 (1,1,0.19) | 1 (1,1,0.05) | 1 (1,1,018) | 1 (1,1,0.5) | |
1 (1,0.91,0.09) | 1 (1,1,0.06) | 1 (1,1,0.09) | 1 (1,1,0.08) | 1 (1,1,0.07) | 1 (1,1,0.09) | |
1 (1,1,0.07) | 1 (1,0.65,0.07) | 1 (1,0.99,0.09) | 1 (1,1,0.08) | 1 (1,1,0.07) | 1 (1,1,0.1) | |
1 (1,1,0.12) | 1 (1,0.82,0.06) | 1 (1,0.79,0.06) | 1 (1,1,0.09) | 1 (1,1,0.07) | 1 (1,1,0.1) | |
1 (1,1,0.1) | 1 (1,0.95,0.09) | 1 (1,0.75,0.07) | 1 (1,1,0.73) | 1 (1,1,0.06) | 1 (1,1,0.09) |
4.4 Simulation 4: The Null Distribution Under the Degree-Corrected Stochastic Block Model
In this subsection, we investigate the asymptotic null distribution of the proposed statistic . Similar to Simulation 1, we also consider the bootstrap corrected statistic. The bootstrap corrected method is similar to the general stochastic block model and is given in the Appendix.
As shown in Zhao et al. (2012), we use the same approach to generate the degree corrected parameters . The details are as follows:
where is a uniformly random variable on the interval . Similar to Simulation 1, we set with and . We also consider sample sizes and . Based on 1000 data replications, we report the results in Figure 2. Figure 2 shows the empirical distribution of the test statistic is the Gumbel distribution by a location and scale shift. With the sample size increasing, the deviation between the empirical distribution of and the limiting distribution does not decrease. After using the bootstrap correction, the empirical distribution of is close to the limiting distribution. It also implies that the empirical size of the test statistic is smaller than the nominal level.


4.5 Simulation 5: Empirical Size for Hypothesis (6)
In this subsection, we consider the empirical size under the framework of the degree-corrected stochastic block model. The basic settings are similar to Simulation 2 except . The degree corrected parameters are generated by the method in Simulation 4. Table 4 shows the simulation results. It is observed that the probability of Type I error is close to the nominal level. At the same time, the empirical size of the statistic is less than that of , which is also consistent with the results of Simulation 4.
0.01 | 0.07 | 0.05 | 0.04 | 0.05 | 0.06 | ||
0.01 | 0.01 | 0.04 | 0.04 | 0.06 | 0.04 | ||
0.04 | 0.01 | 0.04 | 0.05 | 0.04 | 0.05 | ||
0.06 | 0.01 | 0.05 | 0.06 | 0.04 | 0.05 |
4.6 Simulation 6: Empirical Power for Hypothesis (6)
In this simulation, we investigate the testing power for the two-sample test under the degree-corrected stochastic block model. We only consider the case of . The probability matrix and the community label are generated the same as in Simulation 3. We consider the settings that and . The degree corrected parameters are similar to Simulation 4. The empirical results are shown in Table 5. From Table 5, we can know that the proposed test statistic can successfully detect the alternative hypotheses as the test has good power.
1 | 1 | 1 | 1 | 1 | 1 | ||
1 | 1 | 1 | 1 | 1 | 1 | ||
1 | 1 | 1 | 1 | 1 | 1 | ||
0.98 | 1 | 1 | 1 | 1 | 1 |
5 DATA EXAMPLE
5.1 Gene co-expression data
In this subsection, we apply the proposed method to the gene dataset of developing rhesus monkeys’ tissue from the medial prefrontal cortex. This dataset was originally collected by Bakken et al. (2016). Other work analyzing the data suggests this is an appropriate dataset, as other work already has good evidence that gene co-expression patterns in monkey tissues in this brain region change significantly as they develop. For the prenatal period, a 6-layer network is considered, which corresponds to the 6 age stage. We label the 6-layer network as E40 to E120 to indicate the number of embryonic days of age. For the postnatal period, a 5-layer network is considered, which indicates the 5 layers within the medial prefrontal cortex. We label the 5-layer network as L2 to L6. With this data, we aim to show whether two gene co-expression networks in two different periods can be considered to come from a common stochastic block model using the proposed method.
Preprocessing Procedure. The microarray dataset contains genes measured among many samples across the layer. In this article, since we consider the difference between the two samples, we only choose the gene co-expression networks in two periods E40 and E50, that is, the development occurs from 40 days to 50 days in the embryo. Similar to other work, e.g., Langfelder & Horvath (2008), we preprocess the adjacency matrix as follows. First, to calculate the adjacency matrix, for a layer , we construct the Pearson correlation matrix. Define the co-expression similarity as the correlation coefficient between the profiles of nodes and : . In order to avoid the outliers, let . Then, using a thresholding procedure, the co-expression similarity matrix is transformed into the adjacency matrix . An unweighted network adjacency between gene expression profiles and can be defined by hard thresholding the co-expression similarity as
where is the “hard” threshold parameter. Thus, two genes are linked () if the absolute correlation between their expression profiles exceeds the hard threshold . In this article, we set to get two adjacency matrices and . Lastly, we remove all the genes corresponding to nodes whose total degree for and is less than 90. The purpose of this is to remove those nodes with few connections, which process can be seen in Lei & Lin (2022). Finally, we have two adjacency matrices , each representing a network corresponding to 4722 genes.
Result and Interpretation. The following results show the difference between the gene co-expression networks of E40 and E50 using the proposed method based on the maximum entry-wise deviation. Prior to using our method, we select the number of communities to be , which is considered reasonable, as described in Lei & Lin (2022). Then, we use and as test statistic, and obtain and , respectively. Since, for the Gumbel distribution, we reject at the level of 0.05. As the discussion in Lei & Lin (2022), as the development occurs from 40 days to 50 days in the embryo, there are different gene communities that are most connected. From 40 days in embryo, community 1 was highly coordinated (i.e. densely connected), and to 40 days in embryo, community 3 was highly coordinated. Hence, it can be considered that the two networks in two periods E40 and E50 come from two different stochastic block models. The results are similar to that of Liu et al. (2018) and Lei & Lin (2022).
5.2 International trade data
Now, we study an international trade dataset originally analyzed by Westveld & Hoff (2011), containing yearly international trade data between countries from 1981 to 2000. For this network, an adjacency matrix can be formed by first considering a weight matrix with in given year , where denotes the value of exports from country to country in year . Finally, we define if , and otherwise, where denotes the 50th percentile of in year . In this article, we focus on the international trade networks in 1995 and 1999. Thus, we can obtain two sample networks and in 1995 and 1999. For the network in 1995, the number of communities is estimated to be 7 in Hu et al. (2021). For the network in 1999, using the corrected Bayesian information criterion, the number of communities in 1999 is also estimated to be . Then, we set and continue to implement the proposed method. We obtain and . Since, for the Gumbel distribution, we reject at the level of 0.05. Hence, we can assert that the trade situation of different countries has changed in the four years from 1995 to 1999.
6 DISSCUSSION
In this article, we have proposed a novel two-sample test statistic based on the maximum entry-wise deviation of staggered centered and rescale observed adjacency matrix, and have demonstrated that the asymptotic null distribution of the test statistic is a Gumbel distribution when . This test has extended the method of Hu et al.(2021) to two samples. In our study, the difference between two networks is assessed by the sum of two residual matrices, where the centered and rescaled matrix of another sample is obtained using the estimation of the label and the probability matrix of one sample. Empirically, we have demonstrated that the size and the power of the test are valid. In this article, we assume that the numbers of communities for and are equal. In reality, we should determine whether is equal to . In fact, we can independently estimate the number of communities by some existing methods, such as the sequentially testing methods (Lei, 2016: Hu et al., 2021) and methods based on the model selection (Hu et al., 2020). If , the proposed method can be directly used. When , we can consider the new community structure by combining two stochastic block models. Figure 3 gives a diagram. Let be membership matrices. Note that each row of (or ) has exactly one entry that is nonzero. For two nodes and , if and only if , where be the combined membership matrix. Hence, we can use and to obtain , then obtain and by and . Then, we can use our proposed method to test the difference between two samples after combining two community structures.

It is worth noting that the proposed testing method works well when the network is dense and is small in our simulation studies. However, when the network is sparse or is large, condition (3) may be violated, i.e., the exact community label estimator is hard to be obtained. Hence, it would be of interest to investigate the possibility of the sparse network and big in future work. In addition, note that although the proposed method can identify whether the two models are the same through two samples, we cannot know whether this difference is caused by changes in or , which is crucial in practical applications. Intuitively, we can judge whether is equal to by the strong consistent estimator and . However, there is no theoretical guarantee. This issue will be considered in future work. In order to better improve the two-sample test of the stochastic block model, we will continue to study this issue in future work.
7 APPENDIX
We start with three lemmas that will be used in the proof. The following Poisson approximation result is essentially a special case of Theorem 1 in Arratia et al. (1989).
Lemma 1.
lemma 1.[Arratia et al. (1989)] Let be an index set and be a set of subsets of , that is, . Let also be random variables. For a given , set . Then
where
and is the algebra generated by . In particular, if is independent of for each then .
Lemma 2.
lemma 2.[Chen (1990)] Suppose are i.i.d random variables with and . Set . Let and satisfy that and . If for some , then
for any .
Lemma 3.
lemma 3.[Cai & Jiang (2011)] Suppose are i.i.d random variables with and and for some and . Set and . Then, for any with and and with ,
as .
7.1 Proof of Theorem 1
Although the idea of proof of the Theorem 1 is similar to Theorem 1 in Hu et al. (2021), there are some differences in some details, such as decomposition for some quantities. Next, we prove Theorem 1.
Since, , we denote and for simplicity. By Bernstein’s inequality, we have
for any , According to Assumption 2, we can get
uniformly in and as .
Notice that
similarly,
where and .
Let
where
If and , we have
Thus, to prove Theorem 1 (5), it is sufficient to show
Let . Then . Note that and . By Lemma 1, we have
where . By Lemma 3, we have
Hence,
Meanwhile,
7.2 Proof of Theorem 2
Although the idea of proof of the Theorem 2 is similar to Theorem 2 in Hu et al. (2021), there are some differences in some details, such as decomposition for some quantities. Next, we prove Theorem 2.
Similar to the proof of Theorem 1, we have
Let
By Hoeffding’s inequality, we have
Hence, . Denote
by , we have
Similarly, we have
where , and
Hence,
By Assumptions 2 and 3, we have .
Let , as , we have
7.3 The Bootstrap corrected test statistic under the degree-corrected stochastic block model
For two sample networks and , under the framework of the degree-corrected stochastic block model, the bootstrap corrected test statistic is calculated as the following:
1. Estimating and by the consistent clustering method and and using their maximum likelihood estimators. Calculate the statistic using and ;
2. For , generate and from the edge-probability matrix , and calculate based on , ;
3. Using to estimate the location and scale parameters and of the Gumbel distribution through maximum likelihood method.
4. The bootstrap corrected test statistic is calculated as
where and .
References
- [1] Airoldi, E. M., Blei, D. M., Fienberg, S. E., & Xing, E. P. (2008). Mixed membership stochastic blockmodels. Journal of machine learning research, 9:1981 - 2014.
- [2] Amini, A. A., Chen, A., Bickel, P. J., & Levina, E. (2013). Pseudo-Likelihood Methods for Community Detection in Large Sparse Networks. The Annals of Statistics, 41:2097 - 2122.
- [3] Arratia, R., Goldstein, L., & Gordon, L. (1989). Two moments suffice for Poisson approximations: The Chen-Stein method. The Annals of Probability, 17:9 - 25.
- [4] Bakken, T. E., Miller, J. A., Ding, S.-L., Sunkin, S. M., Smith, K. A., Ng, L., Szafer, A., Dalley, R. A., Royall, J. J., Lemon, T., et al (2016). A comprehensive transcriptional map of primate brain development. Nature, 535:367 - 375.
- [5] Barnett, I. & Onnela, J.-P. (2016). Change point detection in correlation networks. Scientic reports, 6:18893.
- [6] Bickel, P. J. & Chen, A. (2009). A nonparametric view of network models and Newman-Girvan and other modularities. Proceedings of the National Academy of Sciences, 106:21068 - 21073.
- [7] Bickel, P. J., & Sarkar, P. (2015), Hypothesis testing for automated community detection in networks, Journal of the Royal Statistical Society, Series B, 78:253–273.
- [8] Cai, T. T. & Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. The Annals of Statistics, 39:1496 - 1525.
- [9] Chen, X. (1990). Probabilities of moderate deviations for -valued independent random vectors. Chinese Annals of Mathematics, 11:621 - 629.
- [10] Chen, L., Zhou, J. & Lin, L. (2021a). Hypothesis testing for populations of networks. Communications in Statistics - Theory and Methods, online.
- [11] Chen, L., Josephs, N., Lin, L., Zhou, J. & Kolaczyk, E. D. (2021b). A spectral-based framework for hypothesis testing in populations of networks. Statistica Sinica, online.
- [12] Choi, D. S., Wolfe, P. J., & Airoldi, E. M. (2012). Stochastic blockmodels with a growing number of classes. Biometrika, 99:273 - 284.
- [13] Daudin, J.-J., Picard, F., & Robin, S. (2008). A mixture model for random graphs. Statistics and Computing, 18:173 - 183.
- [14] Gao, C., Ma, Z., Zhang, A. Y., & Zhou, H. H. (2017). Achieving optimal misclassification proportion in stochastic blockModels. The Journal of Machine Learning Research, 18:1980 - 2024.
- [15] Ghoshdastidar, D., & von Luxburg, U. (2018). Practical methods for graph two-sample testing. In Advances in Neural Information Processing Systems, 3019-3028. Montréal, Canada.
- [16] Ghoshdastidar, D., Gutzeit, M., Carpentier, A. & Von Luxburg, U.(2020). Two-sample hypothesis testing for inhomogeneous random graphs. The Annals of Statistics, 48: 2208 - 2229.
- [17] Holland, P. W., Laskey, K. B., & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5:109 - 137.
- [18] Hu, J., Qin, H., Yan, T., , & Zhao, Y. (2020). Corrected Bayesian Information Criterion for Stochastic Block Models. Journal of the American Statistical Association, 115:1771 - 1783.
- [19] Hu, J., Zhang, J., Qin, H., Yan, T., & Zhu, J. (2021). Using Maximum Entry-Wise Deviation to Test the Goodness of Fit for Stochastic Block Models. Journal of The American Statistical Association, 116:1373 - 1382.
- [20] Jin, J. (2015). Fast Community Detection by SCORE. The Annals of Statistics, 43:57- 89.
- [21] Karrer, B. & Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Physical Review. E, 83:016107.
- [22] Langfelder, P. & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9:559.
- [23] Le, C. M. & Levina, E. (2022). Estimating the number of communities in networks by spectral methods. Electronic Journal of Statistics, 16:3315 - 3342.
- [24] Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics, 44:401 - 424.
- [25] Lei, J. & Lin, K. Z. (2022). Bias-adjusted spectral clustering in multi-layer stochastic block models. Journal of the American Statistical Association, online.
- [26] Lei, J. & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43:215 - 237.
- [27] Liu, F., Choi, D., Xie, L., & Roeder, K. (2018). Global spectral clustering in dynamic networks. Proceedings of the National Academy of Sciences, 115:927 - 932.
- [28] Newman, M. E. J. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America, 103:8577 - 8582.
- [29] Newman, M. E. J. & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review. E, 69:026113.
- [30] Nowicki, K. & Snijders, T. (2001). Estimation and prediction for stochastic block structures. Journal of The American Statistical Association, 96:1077 - 1087.
- [31] Rohe, K., Chatterjee, S., & Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39:1878 - 1915.
- [32] Rohe, K., Qin, T., & Fan, H. (2014). The highest dimensional stochastic blockmodel with a regularized estimator. Statistica Sinica, 24:1771 - 1786.
- [33] Saldña, D. F., Yu, Y., & Feng, Y. (2017). How Many Communities Are There? Journal of Computational and Graphical Statistics, 26:171 - 181.
- [34] Sarkar, P. & Bickel, P. J. (2015). Role of normalization in spectral clustering for stochastic blockmodels. The Annals of Statistics, 43:962 - 990.
- [35] Snijders, T. & Nowicki, K. (1997). Estimation and prediction for stochastic block structures for graphs with latent block structure. Journal of Classification, 14:75 - 100.
- [36] Steinhaeuser, K. & Chawla, N. V. (2010). Identifying and evaluating community structure in complex networks. Pattern Recognition Letters, 31:413 - 421.
- [37] Tang, M., Athreya, A., Sussman, A. L., Lyzinski, V. & Priebe, C. E. (2017). A nonparametric two-sample hypothesis testing problem for random graphs. Bernoulli, 23:1599 - 1630.
- [38] Wang, J., Zhang, J., Liu, B., Zhu, J., & Guo, J. (2021). Fast network community detection with profile-pseudo likelihood methods. Journal of the American Statistical Association, online.
- [39] Wang, Y. R. & Bickel, P. J. (2017). Likelihood-based model selection for stochastic block models. The Annals of Statistics, 45:500 - 528.
- [40] Westveld, A. H. & Hoff, P. D. (2011). A Mixed effects model for longitudinal relational and network data with applications to international trade and conflict. The Annals of Applied Statistics, 5:843 - 872.
- [41] Wu, F., Kong, X. & Xu, C. (2022). Test on stochastic block model: local smoothing and extreme value theory. Journal of Systems Science and Complexity, 35:1535 - 1556.
- [42] Zhang, A. Y. & Zhou, H. H. (2016). Minimax rates of community detection in stochastic block models. The Annals of Statistics, 44:2252 - 2280.
- [43] Zhao, Y., Levina, E., & Zhu, J. (2011). Community extraction for social networks. Proceedings of the National Academy of Sciences of the United States of America, 108:7321 - 7326.
- [44] Zhao, Y., Levina, E., & Zhu, J. (2012). Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40:2266 - 2292.
- [45] Zhou, W. (2007). Asymptotic distribution of the largest off-diagonal entry of correlation matrices. Transactions of the American Mathematical Society, 359:5345 - 5363.