This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian community detection for networks with covariates

Luyi Shenlabel=e1][email protected] [    Arash Aminilabel=e2][email protected] [    Nathaniel Josephslabel=e3][email protected] [    Lizhen Linlabel=e4][email protected] [ Department of Applied and Computational Mathematics and Statistics, The University of Notre Dame e4 Department Statistics, UCLA Department of Biostatistics, Yale School of Public Health
(2023)
Abstract

The increasing prevalence of network data in a vast variety of fields and the need to extract useful information out of them have spurred fast developments in related models and algorithms. Among the various learning tasks with network data, community detection, the discovery of node clusters or “communities,” has arguably received the most attention in the scientific community. In many real-world applications, the network data often come with additional information in the form of node or edge covariates that should ideally be leveraged for inference. In this paper, we add to a limited literature on community detection for networks with covariates by proposing a Bayesian stochastic block model with a covariate-dependent random partition prior. Under our prior, the covariates are explicitly expressed in specifying the prior distribution on the cluster membership. Our model has the flexibility of modeling uncertainties of all the parameter estimates including the community membership. Importantly, and unlike the majority of existing methods, our model has the ability to learn the number of the communities via posterior inference without having to assume it to be known. Our model can be applied to community detection in both dense and sparse networks, with both categorical and continuous covariates, and our MCMC algorithm is very efficient with good mixing properties. We demonstrate the superior performance of our model over existing models in a comprehensive simulation study and an application to two real datasets.

Community detection,,
Networks with covariates,
Covariate-dependent random partition prior,
Gibbs sampler,
keywords:
volume: TBAissue: TBA
\startlocaldefs\endlocaldefs

, and

1 Introduction

The ubiquity of network data in modern science and engineering and the need to extract meaningful information out of them has spurred rapid developments in the models, theory, and algorithms for the inference of networks (Erdős and Rényi, 1959; Bickel and Chen, 2009; Wolfe and Olhede, 2013; Kolaczyk, 2009; Lovász, 2012; Kolaczyk et al., 2020). Among the specific learning tasks with network data, community detection, which aims to detect communities or clusters among nodes, has arguably received the most attention in the scientific community. Various models and algorithms have been developed for community detection in networks including modularity-based methods (Newman, 2006), spectral clustering algorithms (Luxburg, 2007; Rohe et al., 2011), stochastic block models (Holland et al., 1983; Karrer and Newman, 2011; Ball et al., 2011), optimization-based approaches via semidefinite programming (Amini and Levina, 2018), and various Bayesian models (Mørup and Schmidt, 2012; Amini et al., 2019), among others.

Besides the edge information of an observed network, there are often additional covariates or nodal information available in many real-world networks. These additional covariates should be ideally utilized when performing community detection. For example, in a Facebook network, one can obtain from an individual’s profile covariates including current city, workplace, hometown, education, and hobbies. Another example is the Weddell Sea trophic network, which describes the marine ecosystem of the Weddell Sea (Jacob et al., 2011). It is a predator-prey network that includes the average adult body mass for each of the species. If one were to only utilize the network information, it is hard to differentiate all the different feeding types. However, the body mass of each species shows a partial clustering by the group, as seen in Figure 1. Therefore, a better clustering should be achievable when both the network and covariates are incorporated.

Refer to caption
Figure 1: Log body mass for different species colored by their feeding type.

Such network data have motivated an emerging line of work that aims to deal with community detection problems that leverage both the network and the exogenous covariates. A node-coupled SBM is proposed in Weng and Feng (2022) in which cluster information or the block matrix is uniquely encoded by the covariates. Another model from Zhang et al. (2019) specifies that the link probability between a pair of nodes is contributed additively by the block probability in an SBM and a similarity measure between the covariates of a pair of nodes. A similar class of block models is proposed in Sweet (2015) that also accommodates covariates in an additive way such that the link probability is influenced by both block membership and covariates. A covariate-assisted spectral clustering algorithm is proposed in Binkiewicz et al. (2017) and later modified for degree-corrected block models in Hu and Wang (2022). Categorical covariates on the actor level are included in the model in Tallberg (2004), and the block affiliation probabilities are modeled conditional on the covariates via a multinomial probit model. Another prominent method in the frequentist literature is due to Zhang et al. (2016) in which a joint community detection criterion is proposed using both the adjacency matrix of the network and the node features, and their algorithm weights the edges according to feature similarities. Recently, the interplay between network information and covariates is investigated in an optimization framework for community detection in sparse networks in Yan and Sarkar (2021). From a Bayesian perspective, there are a few papers that are closely related to our work. We introduce them in Section 2.1 and provide a detailed discussion of their connections to our work (and to each other).

We add to this literature by proposing a Bayesian community detection procedure in which the effects of the covariates are incorporated via a covariate-dependent random partition prior on the node labels of an SBM. The covariates are explicitly expressed and incorporated in the prior probability of generating clusters. One of the distinctive features of our models compared with the ones already proposed in the literature is that ours has the ability to learn the number of communities via posterior inference without having to assume it to be known. The proposed model has the flexibility of assessing uncertainties for all the model parameters through an efficient MCMC algorithm for posterior inference. Note that there are several works in the literature that have employed the idea of a random partition prior or Bayesian nonparametric models for modeling network or relational data. We discuss these comparisons in Section 2.1 after introducing our model.

Our model can be applied in both dense and sparse regimes. In a sparse regime, as one of our simulation studies shows, our model outperforms other state-of-the-art methods such as that of Yan and Sarkar (2021), whose primary goal was to deal with sparse network condition with covariates. We also apply our methods to networks that have covariates with relatively high-dimensions. Our extensive simulations demonstrate our overall superior performance over existing methods in networks with continuous or categorical features, even when those methods are given the true number of communities.

The remainder of our paper is organized as follows. Section 2 is devoted to our model description and MCMC algorithms. In Section 3, we carry out several simulation studies in various settings to demonstrate the utility of our proposed model and algorithms. We also apply our model to two data examples in Section 4. We conclude in Section 5 with possiblities for future work.

2 Prior, model, and MCMC algorithm

Consider an observed network on nn nodes represented by an n×nn\times n adjacency matrix A=(Aij)A=(A_{ij}) with Aij=1A_{ij}=1 indicating the presence of a link between nodes ii and jj, and Aij=0A_{ij}=0 otherwise. Assume in addition that we have some covariate information xipx_{i}\in\mathbb{R}^{p} for each node i=1,,ni=1,\dots,n. The covariate information associated with the node are often referred to as nodal information or node features of the network, and are frequently encountered in modern network data. We let 𝒙=(x1,,xn)Tn×p\bm{x}=(x_{1},\ldots,x_{n})^{T}\in~{}\mathbb{R}^{n\times p} denote all of the node covariates of a given network. Our goal is to perform network community detection by incorporating both the network structure and the nodal information. The key challenge is how to jointly model these two sources of information. Below we propose a Bayesian model that incorporates the nodal information in the prior probability of cluster membership within an SBM.

Let 𝒛=(z1,,zn)n\bm{z}=(z_{1},\ldots,z_{n})\in\mathbb{N}^{n} be a node membership vector and L(𝒛)=max{zi:i[n]}L(\bm{z})=\max\{z_{i}:i\in[n]\} indicate the total number of clusters implied by 𝒛\bm{z}. We do not assume L(𝒛)L(\bm{z}) to be known a priori. Let S(𝒛)={i[n]:zi=}S_{\ell}(\bm{z})=\{i\in[n]:\,z_{i}=\ell\} be the set of indices of nodes belonging to the th\ell^{th} cluster according to 𝒛\bm{z}. For any subset S[n]S\subseteq[n] and 𝒙=(x1,,xn)Tn×p\bm{x}=(x_{1},\cdots,x_{n})^{T}\in~{}\mathbb{R}^{n\times p}, let g(S|𝒙)g(S\,|\,\bm{x}) be a nonnegative function that measures the homogeneity of the covariates {xi,iS}\{x_{i},~{}i\in S\}. That is, g(S|𝒙)g(S\,|\,\bm{x}) takes larger values when all of the xix_{i} with iSi\in S are more similar. One can think of SS as a potential cluster of nodes and g(S|𝒙)g(S\,|\,\bm{x}) as a measure of the quality of such cluster, with regards the nodal information

Inspired by Müller and Quintana (2010); Park and Dunson (2010); Müller et al. (2011), we consider the following covariate-dependent random partition model:

p(𝒛|𝒙)=1L(𝒛)g(S(𝒛)|𝒙)c(S(𝒛)).p(\bm{z}\,|\,\bm{x})\propto\prod^{L(\bm{z})}_{\ell=1}g\big{(}S_{\ell}(\bm{z})\,|\,\bm{x}\big{)}\cdot c\big{(}S_{\ell}(\bm{z})\big{)}\enskip. (1)

The non-negative function Sc(S)S\mapsto c(S) is known as the cohesion function of the product partition probability model. In a random partition model based on the Dirichlet process, with baseline probability measure G0G_{0} and concentration parameter α\alpha, one has c(S)=α(|S|1)!c(S)~{}=~{}\alpha(|S|-1)! (Ferguson, 1973; Sethuraman, 1994).

Borrowing from Müller et al. (2011), we define g(S|𝒙)g(S\,|\,\bm{x}) based on an auxiliary probability model q(|)q(\cdot\,|\,\cdot), where

g(S|𝒙)=iSq(xi|ξ)ν(ξ)dξ.g(S\,|\,\bm{x})=\int\prod_{i\in S}q(x_{i}\,|\,\xi)\,\nu(\xi)\,d\xi\enskip. (2)

Note that the covariates 𝒙\bm{x} are not random. The term iSq(xi|ξ)\prod_{i\in S}q(x_{i}\,|\,\xi) measures the effect or contribution of the covariates on the prior probability of cluster SS. One can choose q(xi|ξ)q(x_{i}\,|\,\xi) and ν(ξ)\nu(\xi) as a conjugate pair to facilitate the analytic evaluation of g(S|𝒙)g(S\,|\,\bm{x}). For the cohesion function, we adopt c(S)=α(|S|1)!c(S)=\alpha(|S|-1)!.

Combining equations (1) and (2), we have

p(𝒛|𝒙)=1L(𝒛)[iS(𝒛)q(xi|ξ)dν(ξ)]c(S(𝒛)).\displaystyle p(\bm{z}\,|\,\bm{x})\propto\prod^{L(\bm{z})}_{\ell=1}\bigg{[}\int\!\!\prod_{i\,\in\,S_{\ell}(\bm{z})}\!\!q(x_{i}\,|\,\xi_{\ell})\,d\nu(\xi_{\ell})\bigg{]}c\big{(}S_{\ell}(\bm{z})\big{)}\enskip. (3)

In this model, ξ\xi_{\ell} can be considered the center of the nodal covarites in cluster \ell, and q(xi|ξ)q(x_{i}\,|\,\xi_{\ell}) a measure of how far the covariates in cluster SS_{\ell} are from its center ξ\xi_{\ell}. The model then averages over all possible centers ξν\xi_{\ell}\sim\nu.

The distribution in (3) can be written as the marginal of

p(𝒛,𝝃|𝒙)=1L(𝒛)[c(S(𝒛))iS(𝒛)q(xiξ)ν(ξ)]=L(𝒛)+1ν(ξ),\displaystyle p(\bm{z},\bm{\xi}\,|\,\bm{x})\propto\prod^{L(\bm{z})}_{\ell=1}\Big{[}c\big{(}S_{\ell}(\bm{z})\big{)}\!\!\prod_{i\,\in\,S_{\ell}(\bm{z})}q(x_{i}\mid\xi_{\ell})\,\nu(\xi_{\ell})\Big{]}\cdot\!\!\prod_{\ell=L(\bm{z})+1}^{\infty}\nu(\xi_{\ell})\enskip, (4)

where 𝝃=(ξ1,ξ2,)\bm{\xi}=(\xi_{1},\xi_{2},\dots). One can use (4) to derive a Gibbs sampler for sampling the prior. To simplify the notation, we let

L=L(𝒛i)andS=S(𝒛i)\displaystyle L=L(\bm{z}_{-i})\quad\text{and}\quad S_{\ell}=S_{\ell}(\bm{z}_{-i}) (5)

for [L]\ell\in[L] denote the number of clusters of 𝒛i=(zj,ji)\bm{z}_{-i}=(z_{j},j\neq i) and the clusters themselves. Let

ψk:={|Sk|k[L]αk=L+1.\displaystyle\psi_{k}:=\begin{cases}|S_{k}|&k\in[L]\\ \alpha&k=L+1\end{cases}\enskip. (6)

One can show that for k[L+1]k\in[L+1],

p(zi=k,ξL+1|𝒛i,𝝃1:L,𝒙)ν(ξL+1)ψkq(xi|ξk),\displaystyle p(z_{i}=k,\,\xi_{L+1}\,|\,\bm{z}_{-i},\,\bm{\xi}_{1:L},\bm{x})\;\propto\;\nu(\xi_{L+1})\cdot\psi_{k}\,q(x_{i}\,|\,\xi_{k})\enskip, (7)

where 𝝃1:L=(ξ1,,ξL)\bm{\xi}_{1:L}=(\xi_{1},\dots,\xi_{L}). This is equivalent to first drawing ξL+1ν()\xi_{L+1}\sim\nu(\cdot), and then drawing ziz_{i} as follows:

p(zi=k|𝒛i,𝝃1:L+1,𝒙)ψkq(xi|ξk).\displaystyle p(z_{i}=k\,|\,\bm{z}_{-i},\,\bm{\xi}_{1:L+1},\bm{x})\;\propto\;\psi_{k}\,q(x_{i}\,|\,\xi_{k})\enskip. (8)
Example.

For continuous features, we can take

q(x|ξ)=N(x;ξ,s2I)andν=N(0,τ2I),\displaystyle q(x\,|\,\xi)=N(x;\xi,s^{2}I)\quad\text{and}\quad\nu=N(0,\tau^{2}I)\enskip, (9)

where N(x;ξ,s2I)N(x;\xi,s^{2}I) denotes the density of a normal distribution with mean ξ\xi and covariance matrix s2Is^{2}I, evaluated at xx. Then, kq(xi|ξk)k\mapsto q(x_{i}\,|\,\xi_{k}) in (8) will be proportional to exp(xiξk2/2s2)\exp(-\|x_{i}-\xi_{k}\|^{2}/2s^{2}), which shows that if ξ\xi_{\ell} is the closest to xix_{i} among {ξk}\{\xi_{k}\}, then the inclusion of the covariate information increases the chance of assigning ziz_{i} to cluster \ell.

Example.

For categorical covariates, one can choose q(xi|ξk)q(x_{i}\,|\,\xi_{k}) to be a multinomial distribution and ν()\nu(\cdot) to be a Dirichlet distribution. Suppose there are RR categorical features and the rrth feature has ara_{r} categories for r=1,,Rr=1,\ldots,R. Then xi=(xi1,,xiR)x_{i}=(x_{i1},\ldots,x_{iR}), where xirx_{ir} is the rr-th feature of node ii, and xir{1,,ar}x_{ir}\in\{1,\ldots,a_{r}\}. Each ξk\xi_{k} collects parameters of RR multinomial vectors, that is, ξk=(ξrk)r=1R\xi_{k}=(\xi_{rk})_{r=1}^{R}, where the coordinates ξrk=(ξrk1,ξrk2,,ξrkar)\xi_{rk}=(\xi_{rk}^{1},\xi_{rk}^{2},\cdots,\xi_{rk}^{a_{r}}) are independent draws from Dir(γ𝟏ar)\text{Dir}(\gamma\bm{1}_{a_{r}}). We have

q(xiξk)=r=1Rc=1ar(ξrkc)1{xir=c}andν=i=1R𝖣𝗂𝗋(γ𝟏ar).\displaystyle q(x_{i}\mid\xi_{k})=\prod_{r=1}^{R}\prod_{c=1}^{a_{r}}(\xi_{rk}^{c})^{1\{x_{ir}=c\}}\quad\text{and}\quad\nu=\prod_{i=1}^{R}\operatorname{\mathsf{Dir}}(\gamma\bm{1}_{a_{r}})\enskip. (10)

We usually take γ=1\gamma=1.

An alternative approach to sample from the prior is to perform Gibbs sampling on the marginalized distribution (3). This leads to the following updates. For each k[L+1]k\in[L+1],

p(zi=k|𝒛i,𝒙)ψkg(Sk{i}|𝒙)g(Sk|𝒙),\displaystyle p(z_{i}=k\,|\,\bm{z}_{-i},\,\bm{x})\;\propto\;\psi_{k}\,\frac{g(S_{k}\cup\{i\}\,|\,\bm{x})}{g(S_{k}\,|\,\bm{x})}\enskip, (11)

where SL+1=S_{L+1}=\emptyset and g(|𝒙)=1g(\emptyset\,|\,\bm{x})=1. This approach is, in particular, useful when q(|)q(\cdot\,|\,\cdot) and ν()\nu(\cdot) are conjugate so that g(S|𝒙)g(S\,|\,\bm{x}) is easy to compute.

With the priors thus defined, the network is assumed to follow a SBM, that is,

p(A|𝜼,𝒛)=1i<jnηzi,zjAij(1ηzi,zj)1Aij,\displaystyle p(A\,|\,\bm{\eta},\bm{z})=\prod_{1\leq i<j\leq n}\eta_{z_{i},z_{j}}^{A_{ij}}(1-\eta_{z_{i},z_{j}})^{1-A_{ij}}\enskip, (12)

where 𝜼=(ηk,)\bm{\eta}=(\eta_{k,\ell}) is the connectivity matrix of SBM, with ηk,\eta_{k,\ell} representing the link probability between nodes in clusters kk and \ell.

It is possible to obtain closed forms for the full conditional distributions of the unknown model parameters 𝜼,𝒛\bm{\eta},\bm{z}, and 𝝃\bm{\xi} with appropriate choices of q()q(\cdot) and ν()\nu(\cdot), as demonstrated in Section 2.2. We note that since the prior random partition model (1) puts mass on all potential partitions of the nn nodes, the posterior distribution p(𝒛|A,𝒙)p(\bm{z}\,|\,A,\bm{x}) also puts mass on all such partitions; however, the posterior will be concentrated around certain partition(s), hence a posteriori, there is a most likely value of L(𝒛)L(\bm{z}), the number of communities in 𝒛\bm{z}. That is how the model learns the number of communities.

2.1 Comparison with literature

There are several works in the literature similar to our model that also use a Bayesian nonparametric approach for tasks related to node clustering.

A pioneering work in the area is that of Kemp et al. (2006). Motivated by the complex system of relations underlying semantic knowledge, Kemp et al. (2006) propose the infinite relational model (IRM) for discovering and clustering underlying structure in relational data sets. In this framework, the observed data are assumed from nn types (people, demographic features, answers to a personality test, etc.) and mm relationships (person ii likes person jj, feature xx causes answer yy, etc). The motivating example given in Kemp et al. (2006) is that of clustering people, represented by set T1T^{1}, based on social predicates, represented by set T2T^{2}. The observed data is the tensor T1×T1×T2{0,1}T^{1}\times T^{1}\times T^{2}\mapsto\{0,1\} whose (i,j,p)(i,j,p) entry determines whether persons ii and jj have social predicate type pp. The idea is to simultaneously cluster T1T^{1} and T2T^{2} so that the tensor is roughly constant within the resulting clusters. In modern language, the model proposed by Kemp et al. (2006) is the so-called tensor SBM (Kim et al., 2017; Wang and Zeng, 2019; Lei et al., 2020) but with a CRP prior on the labels of each dimension to allow for infinite clusters a priori. IRM was later explicitly adopted in Mørup and Schmidt (2012) for community detection in network data, as opposed to relational data. IRM is quite flexible and can, for example, be used to incorporate an attribute or feature taking finite values, by taking T2T^{2} above to be the levels of that attribute. However, this attribute should be interpreted as an edge feature, and moreover, IRM needs to have observations on the connectivity of persons (i,j)(i,j) for all possible levels of this attribute. Adding each feature then requires increasing the dimension of the tensor by one, and demanding lots of observations which are not available in practice in network problems.

More recently, nonparametric Bayesian network models that accommodate nodal covariates have been considered, but mostly with the mixed membership SBM (MMSBM) framework of Airoldi et al. (2008). For example, Kim et al. (2012) introduced the nonparametric metadata dependent relational (NMDR) model, that essentially couples the MMSBM likelihood with node-covariate-dependent prior on cluster labels. More specifically, they assume latent community vectors ηk\eta_{k} that interact with node features ϕi\phi_{i} to produce a score vkiv_{ki} that determines how likely node ii belongs to community kk, a priori. They assume that vkiv_{ki} are normal with mean ηk,ϕi\langle\eta_{k},\phi_{i}\rangle and then translate these real-valued affinities to probabilities πki\pi_{ki} via a logistic-stick breaking process (Ren et al., 2011). The πi=(πki)\pi_{i}=(\pi_{ki}) then determine the edge probabilities via 𝔼[Aij|πi,πj]=πiTWπj\mathbb{E}[A_{ij}\,|\,\pi_{i},\pi_{j}]=\pi_{i}^{T}W\pi_{j} where WW is the connectivity matrix.

Along the same lines, Zhao et al. (2017) extends the edge partition model (EPM) of Zhou (2015) to incorporate binary node features. The EPM has similarities to MMSBM with novel uses of a Bernoulli-Poisson likelihood coupled with a nonparametric partition model. More specifically, the latent Poisson component X=(Xij)X=(X_{ij}) still follows a MMSBM decomposition with 𝔼[Xij|ϕi,ϕj]=ϕiTΛϕj\mathbb{E}[X_{ij}\,|\,\phi_{i},\phi_{j}]=\phi_{i}^{T}\Lambda\phi_{j} where ϕi\phi_{i} are the soft community assignments and Λ\Lambda the connectivity matrix. Similar to Kim et al. (2012), the nodal covariate information is incorporated in constructing a prior on ϕi=(ϕik)\phi_{i}=(\phi_{ik}). The prior assumes ϕi\phi_{i} to be drawn from a Gamma distribution with mean 𝔼[ϕik]=cibk=1Lhkfi\mathbb{E}[\phi_{ik}]=c_{i}b_{k}\prod_{\ell=1}^{L}h_{\ell k}^{f_{i\ell}} where fif_{i\ell} is the \ellth binary feature of node ii. Note that by introducing ηk:=(loghk)\eta_{k}:=(\log h_{\ell k}) and fi=(fi)f_{i}=(f_{i\ell}), one can write 𝔼[ϕik]=cibkexp(ηk,fi)\mathbb{E}[\phi_{ik}]=c_{i}b_{k}\exp(\langle\eta_{k},f_{i}\rangle), showing that essentially the same inner product interaction of feature and latent community vectors as in Kim et al. (2012) is used by Zhao et al. (2017).

As in Kim et al. (2012) and Zhao et al. (2017), our model also incorporates node features into the partition prior; however, we are modeling hard community assignments rather than soft assignment vectors, making the problem somewhat more challenging. More importantly, our approach allows for a more general dependence of the partition on the features via a kernel q(xi|ξ)q(x_{i}\,|\,\xi), compared to the simple inner product approach used in both Kim et al. (2012) and Zhao et al. (2017). Our approach is not limited to binary features and by incorporating more complexity into q(xi|ξ)q(x_{i}\,|\,\xi), we can potentially model more complex feature/community interactions in the prior.

The last closely related work that we discuss is that of Newman and Clauset (2016). They consider node features (metadata) that take values in a finite discrete set (say 𝕏\mathbb{X}). Similar to our work, the node metadata is used to influence the prior on the community assignements. In our notation, they assume p(zi|xi)=γzi,xip(z_{i}\,|\,x_{i})=\gamma_{z_{i},x_{i}} where γ=(γk,x)[0,1]K×|𝕏|\gamma~{}=~{}(\gamma_{k,x})\in~{}[0,1]^{K\times|\mathbb{X}|} is a parameter matrix to be estimated form the data. Compared to the inner product model of Kim et al. (2012) and Zhao et al. (2017), this approach gives a more flexible model for the interaction of the communities and features. The drawback is that it is limited to discrete features and if |𝕏||\mathbb{X}| is large, there is potential for over-fitting without further regularization of the γ\gamma matrix. Newman and Clauset (2016) use an EM algorithm to estimate the parameters, and they assume the number of communities to be known.

2.2 Gibbs sampler

We now derive a Gibbs sampler to sample from the complete posterior distribution of (𝒛,𝝃,𝜼)(\bm{z},\bm{\xi},\bm{\eta}) given AA and 𝒙\bm{x}. The main challenge is deriving the updates for 𝒛\bm{z}.

We sample from 𝒛\bm{z}, 𝝃\bm{\xi}, and 𝜼\bm{\eta} through their full conditional distributions, which are given bellow, until reaching convergence, and then obtain a sample of adequate size of the posterior distribution for inference.

2.2.1 Initialization

We initialize the labels by drawing from a Chinese Restaurant Process (CRP),

𝒛𝖢𝖱𝖯(α).\displaystyle\bm{z}\sim\operatorname{\mathsf{CRP}}(\alpha)\enskip.

A CRP can be seen as a special case of our prior without any covariates. This follows from (11) by setting g(S|𝒙)=1g(S\,|\,\bm{x})=1. Once 𝒛\bm{z} is initialized, all the other parameters can be initialized by the Gibbs updates derived below.

Note that initializing the chain by sampling 𝒛\bm{z} from a CRP provides a random start without having to specify the number of the communities KK. In many algorithms, spectral clustering is often used to initialize 𝒛\bm{z}. For a Bayesian model, this is not a natural choice. Moreover, it requires the knowledge or an estimate of KK.

2.2.2 Sampling 𝒛\bm{z}

Let b(x;a,b)=xa1(1x)b1b(x;a,b)=x^{a-1}(1-x)^{b-1} and for simplicity, define

c~(S):=c(S)jSq(xj|ξ)ν(ξ).\displaystyle\widetilde{c}_{\ell}(S):=c(S)\prod_{j\in S}q(x_{j}\,|\,\xi_{\ell})\,\nu(\xi_{\ell})\enskip. (13)

Note that c~(S)\widetilde{c}_{\ell}(S) implicitly depends on ξ\xi_{\ell}. We have

p(A,𝒛,𝝃,𝜼𝒙)=p(A𝜼,𝒛)p(𝒛,𝝃𝒙)p(𝜼)\displaystyle p(A,\bm{z},\bm{\xi},\bm{\eta}\mid\bm{x})=p(A\mid\bm{\eta},\bm{z})\cdot p(\bm{z},\bm{\xi}\mid\bm{x})\cdot p(\bm{\eta})
1i<jnηzizjAij(1ηzizj)1Aij=1c~(S)1mb(ηm;β,β).\displaystyle\qquad\propto\prod_{1\leq i<j\leq n}\eta_{z_{i}z_{j}}^{A_{ij}}(1-\eta_{z_{i}z_{j}})^{1-A_{ij}}\prod_{\ell=1}^{\infty}\widetilde{c}_{\ell}(S^{\prime}_{\ell})\prod_{1\leq m\leq\ell\leq\infty}b(\eta_{\ell m};\beta,\beta)\enskip.

Here, S={i[n]:zi=}S^{\prime}_{\ell}=\{i\in[n]:\;z_{i}=\ell\} is the \ellth community of 𝒛\bm{z}. We assume that 𝒛\bm{z} has LL^{\prime} communities S1,S2,,SLS^{\prime}_{1},S^{\prime}_{2},\dots,S^{\prime}_{L^{\prime}} and let SL+1=SL+2==S^{\prime}_{L^{\prime}+1}=S^{\prime}_{L^{\prime}+2}=\cdots=\emptyset. The convention is that c()=1c(\emptyset)=1 while c(S)=αΓ(|S|)c(S)=\alpha\Gamma(|S|) when SS is nonempty. Similarly, j()=1\prod_{j\in\emptyset}(\,\cdots)=1. In the above, we assume that 𝝃=(ξ1,ξ2,)\bm{\xi}=(\xi_{1},\xi_{2},\dots) collects all possible ξ\xi_{\ell} and similarly for 𝜼=(ηm:,m)\bm{\eta}=(\eta_{\ell m}:\ell,m\in\mathbb{N}).

Fix ii and let S={j[n]i:zj=}S_{\ell}=\{j\in[n]\setminus i:z_{j}=\ell\} be the \ellth community of 𝒛i\bm{z}_{-i}. We assume that 𝒛i\bm{z}_{-i} has LL communities S1,S2,,SLS_{1},S_{2},\dots,S_{L} and by convention, let SL+1=SL+2==S_{L+1}=S_{L+2}=\dots=\emptyset. To get the communities of 𝒛\bm{z} from 𝒛i\bm{z}_{-i}, we either update SkS_{k} to Sk=Sk{i}S^{\prime}_{k}=S_{k}\cup\{i\} for some k[L]k\in[L], or update SL+1S_{L+1} to SL+1=SL+1{i}={i}S^{\prime}_{L+1}=S_{L+1}\cup\{i\}=\{i\}, generating a new community. With the convention we use, we can compactly write both cases as Sk=Sk{i}S^{\prime}_{k}=S_{k}\cup\{i\} for all k[L+1]k\in[L+1].

For any k[L+1]k\in[L+1], we obtain

p(zi=k,𝝃L+1,𝜼L+1,[L]|𝒛i,A,𝝃[L],𝜼[L],[L],𝒙)j:jiηk,zjAij(1ηk,zj)1Aijc~k(Sk{i})c~k(Sk)=1L+1c~(S)1mL+1b(ηm;β,β).\displaystyle\begin{split}&p\big{(}z_{i}=k,\,\bm{\xi}_{L+1},\,\bm{\eta}_{L+1,[L]}\,|\,\bm{z}_{-i},\,A,\,\bm{\xi}_{[L]},\,\bm{\eta}_{[L],[L]},\,\bm{x}\big{)}\propto\\ &\quad\prod_{j:j\neq i}\eta_{k,z_{j}}^{A_{ij}}(1-\eta_{k,z_{j}})^{1-A_{ij}}\cdot\frac{\widetilde{c}_{k}(S_{k}\cup\{i\})}{\widetilde{c}_{k}(S_{k})}\prod_{\ell=1}^{L+1}\widetilde{c}_{\ell}(S_{\ell})\prod_{1\leq m\leq\ell\leq L+1}\!\!\!b(\eta_{\ell m};\beta,\beta)\enskip.\end{split} (14)

Letting

Oi=j:jiAij1{zj=},n=j:ji1{zj=},\displaystyle O_{i\ell}=\sum_{j:j\neq i}A_{ij}1\{z_{j}=\ell\},\quad n_{\ell}=\sum_{j:j\neq i}1\{z_{j}=\ell\}\enskip, (15)

we have j:jiηk,zjAij(1ηk,zj)1Aij==1LηkOi(1ηk)nOi\prod_{j:j\neq i}\eta_{k,z_{j}}^{A_{ij}}(1-\eta_{k,z_{j}})^{1-A_{ij}}=\prod_{\ell=1}^{L}\eta_{k\ell}^{O_{i\ell}}(1-\eta_{k\ell})^{n_{\ell}-O_{i\ell}}.

Noting that =1Lc~(S)\prod_{\ell=1}^{L}\widetilde{c}_{\ell}(S_{\ell}) is a constant in (14), and similarly for any term in

1mL+1b(πm;β,β)\prod_{1\leq m\leq\ell\leq L+1}b(\pi_{\ell m};\beta,\beta)

that does not have an index equal to L+1L+1, we obtain

p(zi=k,𝝃L+1,𝜼L+1,[L]|𝒛i,A,𝝃[L],𝜼[L],[L],𝒙)=1LηkOi(1ηk)nOic~k(Sk{i})c~k(Sk)c~L+1(SL+1)m=1Lb(ηL+1,m;β,β).\displaystyle\begin{split}&p\big{(}z_{i}=k,\,\bm{\xi}_{L+1},\,\bm{\eta}_{L+1,[L]}\,|\,\bm{z}_{-i},\,A,\,\bm{\xi}_{[L]},\,\bm{\eta}_{[L],[L]},\,\bm{x}\big{)}\propto\\ &\quad\prod_{\ell=1}^{L}\eta_{k\ell}^{O_{i\ell}}(1-\eta_{k\ell})^{n_{\ell}-O_{i\ell}}\cdot\frac{\widetilde{c}_{k}(S_{k}\cup\{i\})}{\widetilde{c}_{k}(S_{k})}\widetilde{c}_{L+1}(S_{L+1})\prod_{m=1}^{L}b(\eta_{L+1,m};\beta,\beta)\enskip.\end{split} (16)

The product over mm runs up to LL since only ηL+1,[L]\eta_{L+1,[L]} is a variable while ηL+1,L+1\eta_{L+1,L+1} is a constant. This is because ziz_{i} can take the new value L+1L+1 but zjz_{j} with jij\neq i takes values in [L][L], hence we do not need to sample ηL+1,L+1\eta_{L+1,L+1} at this stage.

Since SL+1=S_{L+1}=\emptyset, we have

c~L+1(SL+1)\displaystyle\widetilde{c}_{L+1}(S_{L+1}) =ν(ξL+1),\displaystyle=\nu(\xi_{L+1})\enskip,
c~L+1(SL+1{i})\displaystyle\widetilde{c}_{L+1}(S_{L+1}\cup\{i\}) =αΓ(1)q(xi|ξL+1)ν(ξL+1).\displaystyle=\alpha\Gamma(1)\,q(x_{i}\,|\,\xi_{L+1})\,\nu(\xi_{L+1})\enskip.

Hence for k[L+1]k\in[L+1],

c~k(Sk{i})c~k(Sk)c~L+1(SL+1)\displaystyle\frac{\widetilde{c}_{k}(S_{k}\cup\{i\})}{\widetilde{c}_{k}(S_{k})}\widetilde{c}_{L+1}(S_{L+1}) =ν(ξL+1)ψkq(xiξk),\displaystyle=\nu(\xi_{L+1})\,\psi_{k}\,q(x_{i}\mid\xi_{k})\enskip,

where ψk\psi_{k} is defined in (6). Thus, we can compactly write

p(zi=k,𝝃L+1,𝜼L+1,[L]|𝒛i,A,𝝃[L],𝜼[L],[L],𝒙)ν(ξL+1)m=1Lb(ηL+1,m;β,β)ψkq(xi|ξk)=1LηkOi(1ηk)nOi.\displaystyle\begin{split}&p\big{(}z_{i}=k,\,\bm{\xi}_{L+1},\,\bm{\eta}_{L+1,[L]}\,|\,\bm{z}_{-i},\,A,\,\bm{\xi}_{[L]},\,\bm{\eta}_{[L],[L]},\,\bm{x}\big{)}\propto\\ &\quad\nu(\xi_{L+1})\prod_{m=1}^{L}b(\eta_{L+1,m};\beta,\beta)\cdot\psi_{k}q(x_{i}\,|\,\xi_{k})\prod_{\ell=1}^{L}\eta_{k\ell}^{O_{i\ell}}(1-\eta_{k\ell})^{n_{\ell}-O_{i\ell}}\enskip.\end{split} (17)

This is equivalent to the following. First draw ξL+1ν\xi_{L+1}\sim\nu and ηL+1,mBeta(β,β)\eta_{L+1,m}\sim\text{Beta}(\beta,\beta) for m[L]m\in[L], all independently. Then draw ziz_{i} from

p(zi=k|𝒛i,A,𝝃[L+1],𝜼[L+1],[L],𝒙)ψkq(xiξk)=1LηkOi(1ηk)nOi,k[L+1].\displaystyle\begin{split}&p\big{(}z_{i}=k\,|\,\bm{z}_{-i},\,A,\,\bm{\xi}_{[L+1]},\bm{\eta}_{[L+1],[L]},\bm{x}\big{)}\;\propto\;\\ &\qquad\qquad\psi_{k}\,q(x_{i}\mid\xi_{k})\prod_{\ell=1}^{L}\eta_{k\ell}^{O_{i\ell}}(1-\eta_{k\ell})^{n_{\ell}-O_{i\ell}},\quad k\in[L+1]\enskip.\end{split} (18)

For continuous features, we use (9) for q(|)q(\cdot\,|\,\cdot) and ν\nu, and for categorical variables we use (10) in the above updates.

Remark.

Note that update (18) is where a potentially new community (labeled L+1L+1) is created. This happens if the following conditions are met: (a) when sampling ziz_{i} according to (18), we happen to pick zi=L+1z_{i}=L+1, (b) L=LL^{\prime}=L, that is, the current number of communities is LL, and (c) currently ziz_{i} is not assigned to a singlton community. In such a case, the number of communities will increase from L=LL^{\prime}=L to L+1L+1. On the other hand, update (18) can also annihilate a community if the following holds: (a) When sampling ziz_{i}, we pick zi[L]z_{i}\in[L] and (b) L=L+1L^{\prime}=L+1 which means that ziz_{i} is currently assigned to a singleton community. In this case, the new number of communities will go from LL^{\prime} to L1=LL^{\prime}-1=L.

2.2.3 Sampling 𝝃\bm{\xi}

We have

p(𝝃|A,𝒛,𝒙,𝜼)\displaystyle p(\bm{\xi}\,|\,A,\bm{z},\bm{x},\bm{\eta}) =p(𝝃𝒛,𝒙)=1LHS(ξ),\displaystyle=p(\bm{\xi}\mid\bm{z},\bm{x})\propto\prod_{\ell=1}^{L^{\prime}}H_{S^{\prime}_{\ell}}(\xi_{\ell})\enskip,

where HSH_{S} is the distribution with density

HS(ξ)\displaystyle H_{S}(\xi) iSq(xi|ξ)ν(ξ).\displaystyle\propto\prod_{i\in S}q(x_{i}\,|\,\xi)\,\nu(\xi)\enskip. (19)

That is, ξ\xi_{\ell} are independent draws from HSH_{S^{\prime}_{\ell}}. We recall that S={i[n]:zi=}S^{\prime}_{\ell}=\{i\in[n]:\;z_{i}=\ell\}.

The details of sampling 𝝃\bm{\xi} are slightly different given different choices of q()q(\cdot) and ν()\nu(\cdot) depending on whether continuous or categorical features are available. For continuous features with the Gaussian choice (9), HS(ξ)iSN(xi;ξ,s2I)N(ξ;0,τ2I)H_{S}(\xi)\propto\prod_{i\in S}N(x_{i};\xi,s^{2}I)\cdot N(\xi;0,\tau^{2}I) which gives

HS=N(τ2iSxi|S|τ2+s2,s2τ2|S|τ2+s2I).\displaystyle H_{S}=N\left(\frac{\tau^{2}\sum_{i\in S}x_{i}}{|S|\tau^{2}+s^{2}},\frac{s^{2}\tau^{2}}{|S|\tau^{2}+s^{2}}I\right)\enskip.

For the categorical features with the choice (10), we have

HS(ξ)iSr=1Rc=1ar(ξrc)1{xir=c}r=1Rc=1ar(ξrc)γ1=r=1Rc=1ar(ξrc)αrc(S)1,\displaystyle H_{S}(\xi)\propto\prod_{i\in S}\prod_{r=1}^{R}\prod_{c=1}^{a_{r}}(\xi_{r}^{c})^{1\{x_{ir}=c\}}\cdot\prod_{r=1}^{R}\prod_{c=1}^{a_{r}}(\xi_{r}^{c})^{\gamma-1}=\prod_{r=1}^{R}\prod_{c=1}^{a_{r}}(\xi_{r}^{c})^{\alpha_{r}^{c}(S)-1}\enskip,

where αrc(S):=γ+iS1{xir=c}\alpha_{r}^{c}(S):=\gamma+\sum_{i\in S}1\{x_{ir}=c\}. That is, HSH_{S} is the product of Dirichlet distributions,

HS=r=1R𝖣𝗂𝗋(αr1(S),,αrar(S)).\displaystyle H_{S}=\prod_{r=1}^{R}\operatorname{\mathsf{Dir}}(\alpha_{r}^{1}(S),\dots,\alpha_{r}^{a_{r}}(S))\enskip.

2.2.4 Sampling 𝜼\bm{\eta}

Let us define the index sets

Γk={{(i,j):1i<jn}k={(i,j):1ijn}k,\displaystyle\Gamma_{k\ell}=\begin{cases}\{(i,j):1\leq i<j\leq n\}&k=\ell\\ \{(i,j):1\leq i\neq j\leq n\}&k\neq\ell\end{cases}\enskip, (20)

and block counts

Mk=(i,j)ΓkAij1{zi=k,zj=},Nk=(i,j)Γk1{zi=k,zj=}.\displaystyle M_{k\ell}=\sum_{(i,j)\in\Gamma_{k\ell}}A_{ij}1\{z_{i}=k,z_{j}=\ell\},\quad N_{k\ell}=\sum_{(i,j)\in\Gamma_{k\ell}}1\{z_{i}=k,z_{j}=\ell\}\enskip. (21)

Then, we have

p(𝜼A,𝒛,𝒙,𝝃)=p(𝜼A,𝒛)kηkMk+β1(1ηk)NkMk+β1.\displaystyle p(\bm{\eta}\mid A,\bm{z},\bm{x},\bm{\xi})=p(\bm{\eta}\mid A,\bm{z})\propto\prod_{k\leq\ell}\eta_{k\ell}^{M_{k\ell}+\beta-1}(1-\eta_{k\ell})^{N_{k\ell}-M_{k\ell}+\beta-1}\enskip.

Thus, ηk\eta_{k\ell} are independent draws from 𝖡𝖾𝗍𝖺(Mk+β,NkMk+β)\operatorname{\mathsf{Beta}}(M_{k\ell}+\beta,N_{k\ell}-M_{k\ell}+\beta).

Remark (Directed networks).

Although our current MCMC algorithm is only for undirected networks, our prior can be used for Bayesian community detection in directed networks with covariates. One can simply replace the undirected SBM likelihood in (12) with a directed SBM, by replacing i<ji<j with iji\neq j and removing the symmetry constraint on 𝛈\bm{\eta}. The resulting sampler will be almost identical, except for minor modifications to the counts (15) and (21) to account for the extra edge information.

3 Simulation study

In this section, we carry out multiple simulation studies in which we compare our methods, which we refer to as BCDC (Bayesian community detection for networks with covariates), with i) the covariate-assisted spectral clustering (CASC) algorithm (Binkiewicz et al., 2017), which uses both the network and the covariates information in a spectral clustering algorithm, ii) the covariated-assisted clustering on ratios of singular vectors (CASCORE) algorithm (Hu and Wang, 2022), which modfies CASC for degree heterogeneity, iii) kk-means algorithms (kk-means) applied only to the covariates, iv) spectral-clustering (SC) of the adjacency matrix, and v) a Bayesian SBM (BSBM), which is essentially a special case of our model with g(S𝒙)=1g(S\mid\bm{x})=1, therefore utilizing only the network information. For CASC, the core idea is to first construct a new Laplacian matrix Lx=L+τXXTL_{x}=L+\tau XX^{T}, where LL is the Laplacian matrix for the network and XX denotes the nn by pp feature matrix, and then apply the standard spectral clustering algorithm on LxL_{x}. Throughout, we select τ\tau based on the automated procedure given in (Binkiewicz et al., 2017, Section 2.3).

We consider simulation designs with (a) continuous features, (b) discrete or categorical features, (c) a mix of continuous and discrete covariates, (d) high-dimensional features, and (e) homophily effects with networks simulated from stochastic block models with varying connectivity patterns and sparsity levels. The performance of the estimated communities is measured by normalized mutual information (NMI), a measure ranging from 0 (random guessing) to 1 (perfect agreement). NMI is a measure of similarity of two partitions, and is widely used in the community detection literature. It allows comparisons of two partitions with different number of clusters while accounting for the issue of label invariance.

To define NMI, consider two partitions (labelings) on a set of objects and let (X,Y)(X,Y) be the two labels of a randomly drawn object. The joint probability distribution of (X,Y)(X,Y) is the normalized confusion matrix between the two partitions. We can define the mutual information I(X,Y)I(X,Y) and joint and marginal entropies—H(X,Y)H(X,Y), H(X)H(X) and H(Y)H(Y)—based on the aforementioned joint distribution, using standard definitions. It is common to define NMI as I(X,Y)/H(X,Y)I(X,Y)/H(X,Y). However, there are other variants and to be consistent with prior work, in particular Yan and Sarkar (2021), we will use the variant implemented in the R package NMI, namely, 2I(X,Y)/(H(X)+H(Y))2I(X,Y)/(H(X)+H(Y)), which is also referred to as symmetric uncertainty (Teukolsky et al. (1992, p. 634); Hall (1998)).

Overall, the simulation studies show that our method consistently outperforms the competitors and demonstrates the gain of our model by utilizing both the network and nodal information for detecting the community structures. The simulations were performed on a high-computing cluster. An R package for our samplers, as well as the code for these experiments, is available at the GitHub repository aaamini/bcdc (Shen et al., 2022). We ran our code on a high-performance cluster with an Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz with 28 cores and 256 GB RAM.

3.1 Continuous covariates

We first consider simulated networks with continuous covariates, and in particular, the Gaussian setting (9). We generate networks from an SBM having connectivity matrix η=(ηk)[0,1]K×K\eta=(\eta_{k\ell})\in[0,1]^{K\times K} with

ηk={pk=rpk.\displaystyle\eta_{k\ell}=\begin{cases}p&k=\ell\\ rp&k\neq\ell\end{cases}\enskip. (22)

The parameter r[0,1]r\in[0,1] controls the magnitude of disparity between the within and between connectivites and is a measure of network information for the community structure. In our simulations, we set p=0.1p=0.1 and vary rr. We consider n=150n=150 nodes with K=2K=2 communities of 100 and 50 nodes, respectively.

For each node, we generate d=2d=2 features, with one signal feature related to the community structure and one noise feature whose distribution is the same for all nodes. Letting xi2x_{i}\in\mathbb{R}^{2} be the feature vector for node ii and e1=(1,0)e_{1}=(1,0), we take

xi|ziN(μσzie1,I2),\displaystyle x_{i}\,|\,z_{i}\sim N\bigl{(}\mu\sigma_{z_{i}}e_{1},I_{2}\bigr{)}\enskip,

where σ1=+1\sigma_{1}=+1, σ2=1\sigma_{2}=-1 and zi{1,2}z_{i}\in\{1,2\} is the community label of node ii. Here μ[0,)\mu\in[0,\infty) is proportional to the signal-to-noise ratio of the covariate information.

Figure 2 shows the mean NMI with a 50% quantile band from BCDC and competing methods, averaged over 500 replications, under different settings of rr and μ\mu. For BCDC, we have used parameters α=10\alpha=10, β=1\beta=1 and τ=s=1\tau=s=1, and ran the sampler for 1000 iterations. Note that in our comparison, all of the competing methods were given an additional advantage by assuming the knowledge of the true number of communities. However, our method (red) consistently outperforms these other methods. An interesting notable case is that of high network information (r=0.3r=0.3) and pure noise covariate information (μ=0\mu=0). In this case, BCDC performs as well as BSBM which only operates on network information, while CASC performs much worse being misled by pure noise covariates.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: NMI results for five different methods on a 2-block SBM with continuous data. In all cases, p=0.1p=0.1, and we vary the network and covariate signal-to-noise ratios, rr and μ\mu, respectively.

3.2 Categorical covariates

We next consider a simulation study for networks with categorical covariates. For each node ii, we again generate d=2d=2 features with one signal feature related to the community structure and one noise feature whose distribution is the same for all nodes. We consider two designs:

  1. (0)

    We consider networks with n=150n=150 nodes and K=3K=3 equally-sized communities. The signal features are taken to be the true community labels and the noise features are uniformly distributed on {1,2,3}\{1,2,3\}.

  2. (0)

    We consider networks with n=150n=150 nodes and K=2K=2 communities of 100 and 50 nodes. We create two 4-category features. Let xi=(xi1,xi2){1,2,3,4}2x_{i}=(x_{i1},x_{i2})\in\{1,2,3,4\}^{2} be the feature vector for node ii. We use the following generative model

    θ1,θ2\displaystyle\theta_{1},\theta_{2} 𝖣𝗂𝗋(𝟏4),\displaystyle\sim\operatorname{\mathsf{Dir}}(\bm{1}_{4})\enskip,
    xi1|zi\displaystyle x_{i1}\,|\,z_{i} θzi,xi2|zi𝟏4/4,\displaystyle\sim\theta_{z_{i}},\quad x_{i2}\,|\,z_{i}\sim\bm{1}_{4}/4\enskip,

    where, for example, xi1θzix_{i1}\sim\theta_{z_{i}} means that xi1x_{i1} is a categorical variable with probability vector θzi\theta_{z_{i}}.

In both cases, we use an SBM with connectivity (22), setting p=0.1p=0.1 and varying rr from 0.1 to 0.8. For the parameters of our model, we again set α=10\alpha=10, and ran the chain for 1,500 iterations. As above, the results are given over 500 replicates.

Figure 3 shows NMI as a function of rr (the network information measure) under the two covariate designs. Once again, all methods except BCDC were given the true number of communities. The NMI values obtained under the proposed model are generally higher than those of the other models with a slightly larger variance, which is likely due to the additional uncertainty in estimating the number of the communities.

Refer to caption
Refer to caption
Figure 3: NMI results for five different methods on a 3-block (left) and 2-block (right) SBM with categorical data.

3.3 Mix of continuous and discrete covariates

Here, we perform a simulation for larger networks with more communities and a mix of continuous and categorical variables. We let the number of nodes nn vary from 300 to 1000 and set K=n/50K=n/50 communities. The features are chosen so that neither perfectly separates the clusters, but both are informative. In particular, we take

x1i|ziN(2(zimod2)1,1),\displaystyle x_{1i}\,|\,z_{i}\sim N\bigl{(}2(z_{i}\bmod 2)-1,1\bigr{)}\enskip,

i.e. x1iN(1,1)x_{1i}\sim N(1,1) for zi{2,4,}z_{i}\in\{2,4,\ldots\} and x1iN(1,1)x_{1i}\sim N(-1,1) for zi{1,3,}z_{i}\in\{1,3,\ldots\}. We also take x2i=1x_{2i}=1 for zi{1,2,K/2}z_{i}\in\{1,2,\ldots K/2\} and x2i=2x_{2i}=2 for zi{K/2+1,,K}z_{i}\in\{K/2+1,\ldots,K\}. As before, we use an SBM with connectivity (22), setting p=0.3p=0.3 and r=0.35r=0.35.

The results are shown in Figure 4. We find that BCDC is competitive with the other methods when n600n\leq 600, but is superior for larger networks with n>600n>600. We also show in Figure 4 the run time for each method. We see that BCDC scales linearly with nn and is faster than CASC for all of the networks.

Refer to caption
Refer to caption
Figure 4: NMI (left) and run time (right) for five different methods when the network has a mix of continuous and discrete covariates.

3.4 Sparse networks and high-dimensional features

We next consider a setting from Yan and Sarkar (2021), who proposed a covariate-regularized procedure for community detection in sparse graphs. This allows us to explore whether our model works for sparse networks, as well as networks with high-dimensional features. We consider the exact simulation setting as in Yan and Sarkar (2021), in which the networks are generated from a 3-block SBM on 800 nodes with block-size ratios 3:4:53:4:5. The true connectivity matrix is

B=0.01[1.61.20.161.21.60.020.160.021.2],B=0.01\begin{bmatrix}1.6&1.2&0.16\\ 1.2&1.6&0.02\\ 0.16&0.02&1.2\\ \end{bmatrix}\enskip,

leading to a very sparse network, with expected average degree 5.8\approx 5.8. The covariates are generated from 100-dimensional Gaussian distributions N(μzi,I100)N(\mu_{z_{i}},I_{100}), with centers that are only non-zero on the first two dimensions:

μ1=(0,2,𝟎98),μ2=(1,0.8,𝟎98),μ3=(1,0.8,𝟎98).\displaystyle\mu_{1}=(0,2,\bm{0}_{98}),\quad\mu_{2}=(-1,-0.8,\bm{0}_{98}),\quad\mu_{3}=(1,-0.8,\bm{0}_{98})\enskip.

In this setting, it is difficult to distinguish clusters 1 and 2 using the network information alone, and clusters 2 and 3 based on nodal covariates alone.

This experiment was repeated 100 times for independently generated samples. In each replicate, we ran the chain for 1,000 iterations. The results are shown in Figure 5. Note that our method consistently achieves a better NMI than all of the other methods. Although we did not carry out a direct comparison with the method from Yan and Sarkar (2021) since their work focuses on regularizing high-dimensional features, our NMI results are stable and seem to be higher than those reported in Yan and Sarkar (2021). Importantly, we also see in Figure 5 that BCDC is only slightly slower than all of the methods that use only the network or only the covariates, but is considerably faster than CASC and CASCORE, which also use both the network and the nodal information. This illustrates the disadvantage of CASC for larger sparse networks and highlights the efficiency of our MCMC algorithm. That CASC slows down for larger networks can be attributed to the addition of the dense τXXT\tau XX^{T} to the sparse Laplacian LL, resulting in an overall dense similarity matrix LxL_{x}.

Refer to caption
Refer to caption
Figure 5: Boxplots of NMI (left) and run time (right) for five different methods when the network is sparse and the covariates are high-dimensional.

3.5 Homophily

Finally, we consider a network model with homophily, which is the tendency for nodes to be connected when they share a nodal feature. For this, we sample a categorical covariate xix_{i} with two levels, and let

(gij=1|zi,zj,xi,xj)=Pzizj+β𝟏{xi=xj},\mathbb{P}(g_{ij}=1\,|\,z_{i},z_{j},x_{i},x_{j})=P_{z_{i}z_{j}}+\beta\bm{1}\{x_{i}=x_{j}\}\enskip,

where PzizjP_{z_{i}z_{j}} is an SBM with connectivity (22), setting p=0.3p=0.3 and r=0.7r=0.7. This creates 2K2K communities, and separates the effect of community structure from the effect of node-level covariates. We take K=3K=3 and vary β\beta in [0.2,0.2][-0.2,0.2]. Note that when β>0\beta>0, we say the nodes exhibit positive homophily. We run this for n=600n=600 and n=1200n=1200.

Refer to caption
Refer to caption
Figure 6: NMI results on a 3-block SBM with a homophily effect for n=600n=600 (left) and n=1200n=1200 (right).

The results are show in Figure 6. We find that BCDC has superior performance to the other methods even when they are provided the true number (2K2K) of communities. The performance increases as the homophily effect increases in magnitude, which we should expect because the homophily effect is an informative covariate that is not included with BSBM.

4 Real data analysis

In this section, we apply our model to the same two datasets from Yan and Sarkar (2021), a network representing Mexican political elites and a network representing the Weddell Sea ecosystem. We compare our the results from our models against several methods that use only the network, only the covariates, and both the network and covariates.

4.1 Performance measures

In addition to computing the NMI with the (alleged) ground truth labels, it is also helpful to compare the performance using the Bayesian information criterion (BIC) based on an SBM likelihood conditional on the labels. This is especially important because, unlike in the simulations, the “true” clusters are exogenously specified. The (conditional) BIC is defined as the log-marginal likelihood muliplied by 2-2. That is,

BIC(𝒛)\displaystyle\text{BIC}(\bm{z}) =2logp(A𝜼,𝝅,𝒛)𝑑𝜼𝑑𝝅\displaystyle=-2\log\int p(A\mid\bm{\eta},\bm{\pi},\bm{z})\,d\bm{\eta}\,d\bm{\pi} (23)
2logp(A𝜼^,𝝅^,𝒛)+c(K)log(n2),\displaystyle\approx-2\log p(A\mid\bm{\hat{\eta}},\bm{\hat{\pi}},\bm{z})+c(K)\log\binom{n}{2}\enskip, (24)

where KK is the number of communities in 𝒛\bm{z}, c(K)=12K(K+1)+(K1)c(K)=\frac{1}{2}K(K+1)+(K-1) is the degrees of freedom in the parameters (𝜼,𝝅)(\bm{\eta},\bm{\pi}), 𝝅\bm{\pi} is the label prior, and (𝜼^,𝝅^)(\bm{\hat{\eta}},\bm{\hat{\pi}}) is the maximum likelihood estimator of those parameters, i.e., the maximizer of (𝜼,𝝅)p(A𝜼,𝝅,𝒛)(\bm{\eta},\bm{\pi})~{}\mapsto~{}p(A~{}\mid~{}\bm{\eta},\bm{\pi},\bm{z}). We assume a uniform prior over 𝜼\bm{\eta} and 𝝅\bm{\pi}. Note that (24) is the well-known approximation to the BIC (Schwarz, 1978; Konishi and Kitagawa, 2008) and it shows the usefulness of BIC(𝒛)\text{BIC}(\bm{z}) as a measure of performance for real networks: Due to the presence of the complexity term c(K)log(n2)\approx c(K)\log(n^{2}), we get a good balance of the model fit and the number of communities. Label vectors 𝒛\bm{z} with smaller BIC(𝒛)\text{BIC}(\bm{z}) are thus more desirable from a block modeling standpoint, regardless of their relation to the ground truth.

We have

p(A𝜼,𝝅,𝒛)=kηkMk(1ηk)NkMkkπknk(𝒛),\displaystyle p(A\mid\bm{\eta},\bm{\pi},\bm{z})=\prod_{k\leq\ell}\eta_{k\ell}^{M_{k\ell}}(1-\eta_{k\ell})^{N_{k\ell}-M_{k\ell}}\prod_{k}\pi_{k}^{n_{k}(\bm{z})}\enskip,

where nk(𝒛)=i1{zi=k}n_{k}(\bm{z})=\sum_{i}1\{z_{i}=k\}, and MkM_{k\ell} and NkN_{k\ell} are as in (21). Hence, the exact BIC in our setting is

BIC(𝒛)=2[klogB(Mk+1,NkMk+1)+log𝑩(𝒏(z)+𝟏K)],\displaystyle\text{BIC}(\bm{z})=-2\Bigl{[}\sum_{k\leq\ell}\log B(M_{k\ell}+1,N_{k\ell}-M_{k\ell}+1)+\log\bm{B}(\bm{n}(z)+\bm{1}_{K})\Bigr{]}\enskip,

where 𝒏(𝒛)=(nk(𝒛))\bm{n}(\bm{z})=(n_{k}(\bm{z})) and 𝑩()\bm{B}(\cdot) is the multivariate Beta function.

4.2 Mexican Political Elites

The first dataset we consider involves Mexican political elites (Gil-Mendieta and Schmidt, 1996). In this network, the n=35n=35 vertices represent Mexican presidents and their close collaborators, and the 117 edges represent significant political, kinship, friendship, or business ties among them. The ground truth is a classification of the politicians according to their professional background: military and civilians. The covariate we include is the number of years since 1990 that the actor first got a significant governmental position. Figure 7 reveals that this covariate has some discriminatory power in the cluster labels. This is due to the fact that after the Mexican revolution at the beginning of the twentieth century, the political elite was dominated by the military, and later the civilians gradually succeeded the power.

Refer to caption
Figure 7: Node feature for the Mexican political network, which is the number of years since 1990 that the actor first got a significant governmental position.

Table 1 contains the NMI results of our method compared with the same comparison methods in the simulation section. We see that our method achieves the best results, and we visualize our estimated clusters in the network compared with the true labels in Figure 8. Again, for the other methods, we assume the knowledge of the true number of clusters while ours learns the number of the clusters via posterior inference for NMI comparisons.

Refer to caption
Refer to caption
Figure 8: Mexican political network, colored by true (left) and estimated (right) clusters.

As already pointed out in Yan and Sarkar (2021), node 35 has exactly one connection to each of the military and civilian groups, but obtained a governmental position in the 90s, which greatly hinted at a civilian background. By using the covariate, our method accurately captures this label. On the other hand, node 9 seized power in 1940 when the government was almost equally represented by civilian and military politicians, which makes detecting his group difficult, but has more edges to the military group than the civilian group. In this case, our method correctly assigns the military label to it by considering the graph structure.

Dataset bcdc casc cascore kk-means sc bsbm
Mexican politicians 0.43 0.37 0.28 0.26 0.37 0.10
Weddell Sea 0.44 0.25 0.15 0.35 0.33 0.23
Table 1: NMI results on the two real datasets.

We also notice that node 1 has five connections to the military and only one connection to the civilian, and node 1 seized power in 1911. Similarly for node 7, which has five connections to the military and three connections to the civilian and seized power in 1928. For these nodes, both the network and the covariates strongly indicate a closer relationship to the military, which is what our method assigns despite the true label showing civilian. Finally, our method assigns node 12 to its own cluster. This is likely because this is the highest-degree node with 5 military connections and 12 civilian connections. However, in researching node 12, we discovered that Miguel Alemán Valdés was the first civilian president after several military presidents, which suggests that there may have been a labeling error in the original publication of this dataset from Gil-Mendieta and Schmidt (1996). This could explain why our method has the best BIC, even better than the “true” labels, in Table 2.

Dataset true bcdc casc cascore kk-means sc bsbm
Mexican politicians 636 586 587 607 626 587 598
Weddell Sea 138k 33k 124k 119k 144k 100k 71k
Table 2: BIC results on the two real datasets.

4.3 Weddell Sea Ecosystem

The second dataset we consider is a predator-prey, directed network representing the marine food web of the Weddell Sea off of the Antarctic Peninsula, which was collected by Jacob et al. (2011). Since ecosystems are complex, interconnected environments, network analyses have emerged as a popular technique for untangling these connections. The Weddell Sea network has 487 nodes that signify different marine species, and there is a link between nodes ii and jj if species ii (predator) feeds on species jj (prey). Following Yan and Sarkar (2021), we construct a binary, undirected network from this directed network in which Aij=1A_{ij}=1 if there are at least 5 common prey between species ii and jj, and Aij=0A_{ij}=0 otherwise. The network is shown in Figure 9.

Refer to caption
Refer to caption
Figure 9: Visualization of the adjacency matrix for the Weddell Sea network clustered by the true feeding type (left) and estimated clusters (right). In each case, the rows and columns of the matrix are permutated so that nodes in the same cluster form contiguous blocks. Clusters on the right are ordered according to their size.

In Jacob et al. (2011), the authors analyze the relationship between the body size of each species and its feeding type: primary producer, herbivorous/detrivorous, detrivorous, carnivorous, carnivorous/necrovorous, and omnivorous. Figure 1 shows these body sizes grouped by feeding type, where, again following Yan and Sarkar (2021), we group detrivorous, carnivorous, carnivorous/necrovorous as “Carnivore” to obtain four groups. The adjacency matrix in Figure 9 (left) is also sorted by these groups. The authors of Jacob et al. (2011) found body size to be positively correlated with trophic level, but noted that “predators on intermediate trophic levels do not necessarily feed on smaller or prey similar in size but depending on their foraging strategy have a wider prey size range available.” Therefore, body size is insufficient on its own to distinguish the groups, and it would be preferable to consider the interconnectedness of the food web when tasked with clustering the species.

Table 1 shows that BCDC provides the best clustering results compared to the other methods. Therefore, using all of the available information provides an improvement in clustering accuracy over the use of just the network structure or the nodal information. As before, BCDC is the only method that did not know that there are four “true” groups. Interestingly, BCDC estimates many more clusters – 21 in total – which may explain its higher NMI, since we see qualitatively in Figure 9 (right) a more refined block structure. This is quantified and corroborated through BIC in Table 2, which again shows our method outperforms even the “true” clusters. All of this suggests there may be distinct sub-blocks within the Herbivore, Carnivore, and Omnivore classes.

5 Discussion

In this work, we proposed a Bayesian model for community detection in networks with covariates in which both the network and node features of the network are jointly utilized for estimating community structure. In particular, the contribution of nodal information is explicitly modeled in the prior distribution for the community labels via a covariate-dependent random partition prior. We proposed efficient MCMC algorithms for sampling the posterior distributions of all the parameters including the community labels and the number of the communities. Numerical studies demonstrated the overall superior performance of our model over many of the existing methods.

Compared to an almost exclusive literature of frequentist methods, our work is among the first in proposing a Bayesian approach for tackling the problem, which confers some notable advantages in terms of uncertainty quantification, as well as estimating all the model parameters. Notably, unlike the other methods in the literature, our model estimates the number of communities via posterior inference without any knowledge or prior information on the true number. Future work will be devoted to developing Bayesian models for community detection in degree-corrected SBMs and dynamic network models.

We can also easily extend our model to a partially-observed SBM in the spirit of Zhou (2015). Specifically, we can modify (12) to

P(A|𝜼,𝒛)=1i<jn[ηzi,zjAij(1ηzi,zj)1Aij]mij,\displaystyle P(A\,|\,\bm{\eta},\bm{z})=\prod_{1\leq i<j\leq n}\Bigl{[}\eta_{z_{i},z_{j}}^{A_{ij}}(1-\eta_{z_{i},z_{j}})^{1-A_{ij}}\Bigr{]}^{m_{ij}}\enskip,

where 𝒎=(mij){0,1}n×n\bm{m}=(m_{ij})\in\{0,1\}^{n\times n} is a (symmetric) observation mask, with mij=1m_{ij}=1 for the observed edges. This only effects sampling 𝒛\bm{z} and β\beta through the modified counts

Oi=j:jimijAij1{zj=},ni=j:jimij1{zj=},\displaystyle O_{i\ell}=\sum_{j:j\neq i}m_{ij}A_{ij}1\{z_{j}=\ell\},\quad n_{i\ell}=\sum_{j:j\neq i}m_{ij}1\{z_{j}=\ell\}\enskip,

replacing (15), and

Mk=(i,j)ΓkmijAij1{zi=k,zj=},Nk=(i,j)Γkmij1{zi=k,zj=},\displaystyle M_{k\ell}=\sum_{(i,j)\in\Gamma_{k\ell}}m_{ij}A_{ij}1\{z_{i}=k,z_{j}=\ell\},\quad N_{k\ell}=\sum_{(i,j)\in\Gamma_{k\ell}}m_{ij}1\{z_{i}=k,z_{j}=\ell\}\enskip,

replacing (21). This allows our model to also predict missing edges.

Finally, it may be of interest to test whether there is an association between the node covariates and inferred community structure. One approach to this is with Bayes factors comparing models with and without covariates, using, for example, the approach in Legramanti et al. (2020) for testing partition structures in SBMs. While this is a principled Bayesian approach, it only tests whether the set of covariates provides a more parsimonious clustering than without the covariates rather than identifying which covariates are significant. One idea for testing individual covariate significance is to test for a difference between the posterior distributions of the cluster centers 𝝃\bm{\xi} implied by 𝒛\bm{z}. Note that this corresponds to testing whether the priors ν\nu on auxiliary probability distributions qq have overlapping variances, but we leave a rigorous treatment of this idea to future work.

References

  • Airoldi et al. (2008) Airoldi, E. M., Blei, D., Fienberg, S., and Xing, E. (2008). “Mixed membership stochastic blockmodels.” Advances in neural information processing systems, 21.
  • Amini and Levina (2018) Amini, A. A. and Levina, E. (2018). “On semidefinite relaxations for the block model.” The Annals of Statistics, 46(1): 149 – 179.
    URL https://doi.org/10.1214/17-AOS1545
  • Amini et al. (2019) Amini, A. A., Paez, M. S., and Lin, L. (2019). “Hierarchical Stochastic Block Model for Community Detection in Multiplex Networks.” arXiv e-prints, arXiv:1904.05330.
  • Ball et al. (2011) Ball, B., Karrer, B., and Newman, M. (2011). “Efficient and principled method for detecting communities in networks.” Physical Review E, 84(3): 036103.
  • Bickel and Chen (2009) Bickel, P. J. and Chen, A. (2009). “A nonparametric view of network models and Newman Girvan and other modularities.” Proceedings of the National Academy of Sciences of the Unites States of America, 106(50): 21068–21073.
  • Binkiewicz et al. (2017) Binkiewicz, N., Vogelstein, J. T., and Rohe, K. (2017). “Covariate-assisted spectral clustering.” Biometrika, 104(2): 361–377.
  • Erdős and Rényi (1959) Erdős, P. and Rényi, A. (1959). “On Random Graphs I.” Publicationes Mathematicae (Debrecen), 6: 290–297.
  • Ferguson (1973) Ferguson, T. S. (1973). “A Bayesian analysis of some nonparametric problems.” The annals of statistics, 209–230.
  • Gil-Mendieta and Schmidt (1996) Gil-Mendieta, J. and Schmidt, S. (1996). “The political network in Mexico.” Social Networks, 18(4): 355–381.
  • Hall (1998) Hall, M. A. (1998). “Correlation-based feature subset selection for machine learning.” Thesis submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy at the University of Waikato.
  • Holland et al. (1983) Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983). “Stochastic blockmodels: First steps.” Social networks, 5(2): 109–137.
  • Hu and Wang (2022) Hu, Y. and Wang, W. (2022). “Covariate-Assisted Community Detection on Sparse Networks.” arXiv.
    URL https://arxiv.org/abs/2208.00257
  • Jacob et al. (2011) Jacob, U., Thierry, A., Brose, U., Arntz, W. E., Berg, S., Brey, T., Fetzer, I., Jonsson, T., Mintenbeck, K., Möllmann, C., et al. (2011). “The role of body size in complex food webs: A cold case.” Advances in ecological research, 45: 181–223.
  • Karrer and Newman (2011) Karrer, B. and Newman, M. E. J. (2011). “Stochastic blockmodels and community structure in networks.” Phys. Rev. E, 83: 016107.
    URL http://link.aps.org/doi/10.1103/PhysRevE.83.016107
  • Kemp et al. (2006) Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., and Ueda, N. (2006). “Learning Systems of Concepts with an Infinite Relational Model.” In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI’06, 381–388. AAAI Press.
  • Kim et al. (2017) Kim, C., Bandeira, A. S., and Goemans, M. X. (2017). “Community detection in hypergraphs, spiked tensor models, and sum-of-squares.” In 2017 International Conference on Sampling Theory and Applications (SampTA), 124–128. IEEE.
  • Kim et al. (2012) Kim, D. I., Hughes, M. C., and Sudderth, E. B. (2012). “The Nonparametric Metadata Dependent Relational Model.” In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress.
    URL http://icml.cc/2012/papers/771.pdf
  • Kolaczyk (2009) Kolaczyk, E. (2009). Statistical Analysis of Network Data: Methods and Models. Springer Verlag.
  • Kolaczyk et al. (2020) Kolaczyk, E. D., Lin, L., Rosenberg, S., Walters, J., and Xu, J. (2020). “Averages of unlabeled networks: Geometric characterization and asymptotic behavior.” The Annals of Statistics, 48(1): 514 – 538.
    URL https://doi.org/10.1214/19-AOS1820
  • Konishi and Kitagawa (2008) Konishi, S. and Kitagawa, G. (2008). “Information criteria and statistical modeling.”
  • Legramanti et al. (2020) Legramanti, S., Rigon, T., and Durante, D. (2020). “Bayesian testing for exogenous partition structures in stochastic block models.” Sankhya A, 1–19.
  • Lei et al. (2020) Lei, J., Chen, K., and Lynch, B. (2020). “Consistent community detection in multi-layer network data.” Biometrika, 107(1): 61–73.
  • Lovász (2012) Lovász, L. (2012). Large Networks and Graph Limits, volume 60. American Mathematical Society Providence.
  • Luxburg (2007) Luxburg, U. V. (2007). “A tutorial on spectral clustering.” Statistics and Computing, 17(4): 395–416.
  • Mørup and Schmidt (2012) Mørup, M. and Schmidt, M. N. (2012). “Bayesian Community Detection.” Neural Comput., 24(9): 2434–2456.
    URL http://dx.doi.org/10.1162/NECO_a_00314
  • Müller and Quintana (2010) Müller, P. and Quintana, F. (2010). “Random partition models with regression on covariates.” Journal of statistical planning and inference, 140(10): 2801–2808.
  • Müller et al. (2011) Müller, P., Quintana, F., and Rosner, G. L. (2011). “A product partition model with regression on covariates.” Journal of Computational and Graphical Statistics, 20(1): 260–278.
  • Newman and Clauset (2016) Newman, M. E. and Clauset, A. (2016). “Structure and inference in annotated networks.” Nature communications, 7(1): 1–11.
  • Newman (2006) Newman, M. E. J. (2006). “Modularity and community structure in networks.” Proceedings of the National Academy of Sciences, 103(23): 8577–8582.
    URL http://www.pnas.org/content/103/23/8577.abstract
  • Park and Dunson (2010) Park, J.-H. and Dunson, D. B. (2010). “Bayesian generalized product partition model.” Statistica Sinica, 1203–1226.
  • Ren et al. (2011) Ren, L., Du, L., Carin, L., and Dunson, D. (2011). “Logistic Stick-Breaking Process.” J. Mach. Learn. Res., 12(null): 203–239.
  • Rohe et al. (2011) Rohe, K., Chatterjee, S., and Yu, B. (2011). “Spectral clustering and the high-dimensional stochastic block model.” Annals of Statistics, 39: 1878–1915.
  • Schwarz (1978) Schwarz, G. (1978). “Estimating the dimension of a model.” The annals of statistics, 461–464.
  • Sethuraman (1994) Sethuraman, J. (1994). “A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS.” Statistica Sinica, 4(2): 639–650.
  • Shen et al. (2022) Shen, L., Amini, A. A., Josephs, N., and Lin, L. (2022). “BCDC model for community detection with node covariates.” https://github.com/aaamini/bcdc.
  • Sweet (2015) Sweet, T. M. (2015). “Incorporating Covariates Into Stochastic Blockmodels.” Journal of Educational and Behavioral Statistics, 40(6): 635–664.
    URL https://ideas.repec.org/a/sae/jedbes/v40y2015i6p635-664.html
  • Tallberg (2004) Tallberg, C. (2004). “A Bayesian Approach to Modeling Stochastic Blockstructures with Covariates.” The Journal of Mathematical Sociology, 29(1): 1–23.
    URL https://doi.org/10.1080/00222500590889703
  • Teukolsky et al. (1992) Teukolsky, S. A., Flannery, B. P., Press, W., and Vetterling, W. (1992). “Numerical recipes in C.” SMR, 693(1): 59–70.
  • Wang and Zeng (2019) Wang, M. and Zeng, Y. (2019). “Multiway clustering via tensor block models.” Advances in neural information processing systems, 32.
  • Weng and Feng (2022) Weng, H. and Feng, Y. (2022). “Community detection with nodal information: Likelihood and its variational approximation.” Stat, 11.
  • Wolfe and Olhede (2013) Wolfe, P. J. and Olhede, S. C. (2013). “Nonparametric graphon estimation.” ArXiv e-prints.
  • Yan and Sarkar (2021) Yan, B. and Sarkar, P. (2021). “Covariate Regularized Community Detection in Sparse Graphs.” Journal of the American Statistical Association, 116(534): 734–745.
    URL https://doi.org/10.1080/01621459.2019.1706541
  • Zhang et al. (2019) Zhang, Y., Chen, K., Sampson, A., Hwang, K., and Luna, B. (2019). “covariate Adjusted Stochastic Block Model.” Journal of Computational and Graphical Statistics, 28(2): 362–373.
    URL https://doi.org/10.1080/10618600.2018.1530117
  • Zhang et al. (2016) Zhang, Y., Levina, E., Zhu, J., et al. (2016). “Community detection in networks with node features.” Electronic Journal of Statistics, 10(2): 3153–3178.
  • Zhao et al. (2017) Zhao, H., Du, L., and Buntine, W. (2017). “Leveraging Node Attributes for Incomplete Relational Data.” In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 4072–4081. PMLR.
    URL https://proceedings.mlr.press/v70/zhao17a.html
  • Zhou (2015) Zhou, M. (2015). “Infinite edge partition models for overlapping community detection and link prediction.” In Artificial intelligence and statistics, 1135–1143. PMLR.
{acks}

[Acknowledgments] We are very grateful to the Editor, the Associate Editor and two reviewers for their valuable comments. LS and LL were supported by NSF grants DMS 2113642 and DMS 1654579. AA would like to acknowledge the support of NSF grant DMS-1945667. NJ was partially supported by NIH/NICHD grant 1DP2HD091799-01.