This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generative Hypergraph Clustering: From Blockmodels to Modularity

Philip S. Chodrow Nate Veldt and Austin R. Benson

Hypergraphs are a natural modeling paradigm for a wide range of complex relational systems, with nodes representing system components and hyperedges representing multiway interactions. A standard analysis task is to identify clusters of closely related or densely interconnected nodes. In the special case of graphs, in which edges connect exactly two nodes at a time, there are a number of probabilistic generative models and associated inference algorithms for clustering. Many of these are based on variants of the stochastic blockmodel, a random graph with flexible cluster structure. However, there are few models and algorithms for hypergraph clustering. Here, we propose a Poisson degree-corrected hypergraph stochastic blockmodel (DCHSBM), an expressive generative model of clustered hypergraphs with heterogeneous node degrees and edge sizes. Approximate maximum-likelihood inference in the DCHSBM naturally leads to a clustering objective that generalizes the popular modularity objective for graphs. We derive a general Louvain-type algorithm for this objective, as well as a a faster, specialized “All-Or-Nothing” (AON) variant in which edges are expected to lie fully within clusters. This special case encompasses a recent proposal for modularity in hypergraphs, while also incorporating flexible resolution and edge-size parameters. We show that AON hypergraph Louvain is highly scalable, including as an example an experiment on a synthetic hypergraph of one million nodes. We also demonstrate through synthetic experiments that the detectability regimes for hypergraph community detection differ from methods based on dyadic graph projections. In particular, there are regimes in which hypergraph methods can recover planted partitions even though graph based methods necessarily fail. We use our generative model to analyze different patterns of higher-order structure in school contact networks, U.S. congressional bill cosponsorship, U.S. congressional committees, product categories in co-purchasing behavior, and hotel locations from web browsing sessions, finding interpretable higher-order structure. We then study the behavior of our AON hypergraph Louvain algorithm, finding that it is able to recover ground truth clusters in empirical data sets exhibiting the corresponding higher-order structure.

margin: Philip S. Chodrow
Department of Mathematics
University of California, Los Angeles
[email protected]
Nate Veldt
Center for Applied Mathematics
Cornell University
[email protected]
Austin R. Benson
Department of Computer Science
Cornell University
[email protected]

1   Introduction

Graphs are a fundamental abstraction for complex relational systems throughout the sciences jackson-networks-book; Easley2010; newman-networks-book. A graph represents components of a system by a set of nodes, and interactions or relationships among these components using edges that connect pairs of nodes. Much of the structure in complex data, however, involves higher-order interactions and relationships between more than two entities at once Milo-2002-motifs; Benson-2016-hoo; Benson-2018-simplicial; lambiotte2019networks; battiston2020networks; torres2020and. Hypergraphs are now a burgeoning paradigm for modeling these and many other systems li2018submodular; de2020social; sahasrabuddhe2020modelling; veldt2020minimizing. A hypergraph still represents the components by a set of nodes, but the edges (often called hyperedges) may connect arbitrary numbers of nodes. A graph is a special case of a hypergraph, in which each edge connects exactly two nodes.

Graph clustering is a fundamental task in network science that seeks to describe large graphs by dividing their nodes into closely related or interconnected groups (also called clusters or communities) porter2009communities; Benson-2016-hoo; fortunato2016community; li2018submodular. Clustering methods for hypergraphs have applications in parallel computation Ballard:2016:HPS:3012407.3015144; kabiljo2017social, circuit design karypis1999multilevel, image segmentation Agarwal2005beyond, semisupervised learning Zhou2006learning; yadati2019hypergcn, and higher-order network analysis of gene expression tian2009gene, food webs li2017inhomogeneous, and online social communities neubauer2009towards; tsourakakis2017scalable.

A well-established graph clustering approach is to model the graph as a sample from a probabilistic generative model , in which case the clustering task can be recast as a statistical inference problem nowicki2001estimation; hoff2002latent; airoldi2008mixed; karrer2011stochastic; yang2013overlapping; peixoto2014hierarchical; athreya2017statistical. While generative modeling is a mainstay in graph clustering, generative techniques for hypergraphs are largely lacking. Indeed, while a small number of generative models of clustered hypergraphs have been proposed ghoshdastidar2014consistency; kim2018stochastic; Angelini2016; ke2019community, these models typically generate hypergraphs with edges of only one size. With a recent exception ke2019community, these models also do not model degree heterogeneity between nodes. Heterogeneity in edge size and node degree are both key features of empirical data Benson-2018-simplicial, and their omission limits the applicability of many of these models for practical data analysis. An alternative to generative hypergraph modeling is to transform the hypergraph into a dyadic graph via clique expansion, where a dyadic edge connects any pair of nodes that appear together in some hyperedge Zhou2006learning; Benson-2016-hoo. While this enables the use of a wide array of existing models and algorithms for graphs, the higher-order structure is lost chodrow2019configuration, and generative models of the resulting dyadic graph may rely on explicitly violated independence assumptions. Recently, nongenerative approaches based on the popular modularity clustering objective for graphs PhysRevE.69.026113 have been proposed for hypergraphs kumar2018; Kami2018; veldt2020parameterized, although their lack of connection to a generative model limits their interpretability.

Another approach to generative clustering is to use the representation of a hypergraph as a bipartite graph, and apply a generative model (e.g. larremore2014efficiently; gerlach2018network; yen2020community) to the latter representation. This approach, while appropriate in many data sets, involves a strong assumption: the memberships of any two nodes in a given hyperedge are independent, conditional on the model parameters. This assumption is natural for certain classes of data. For example, consider an event co-attendance network, with nodes representing music enthusiasts and hyperedges representing concerts. Node membership in a hyperedge corresponds to attendance at the specified event. To a reasonable approximation, the decision of two fans to attend a given concert may indeed be independent, conditioned on the popularity of the performers, the location of the venue, and so on. In other data sets, however, the conditional independence assumption is explicitly violated. Multiway social interaction networks give one important class of examples. Interactions such as gossip, for instance, normally take place only between trusted individuals. The presence of a single uninvited outsider may entirely prevent the interaction from taking place. The “all-or-nothing” structure of such interactions is an important violation of the conditional independence assumptions made by most bipartite generative models. These examples highlight that the task of matching assumptions to higher-order data is an ongoing challenge, for which we benefit from a diversity of distinct tools.

Here, we propose a generative approach to hypergraph clustering based on a degree-corrected hypergraph stochastic blockmodel (DCHSBM). This model generates clustered hypergraphs with heterogeneous degree distributions and hyperedge sizes. We outline an approximate coordinate-ascent maximum-likelihood estimation scheme for fitting this model to hypergraph data, and show that one stage of this scheme generalizes the well-studied modularity objective for graphs. We derive accompanying Louvain algorithms for this class of modularity-like objectives, which are highly scalable in an important special case. We show computationally that hypergraph clustering methods are able to detect planted clusters in regimes in which graph-based methods necessarily fail due to known theoretical limits. We also show that, in data sets with appropriately matched higher-order structure, our generative hypergraph techniques are able to recover clusters correlated to metadata at higher rates than graph-based techniques. Our results highlight the importance of matching generative models to data sets, and point toward a number of directions for further work in higher-order network science.

2   The Degree-Corrected Hypergraph Stochastic Blockmodel

The degree-corrected stochastic blockmodel is a generative model of graphs with both community structure and heterogeneous degree sequences karrer2011stochastic. We now extend this model to the case of hypergraphs.

For our model, let nn be the number of nodes in a hypergraph. Each node ii is assigned to one of ¯\bar{\ell} groups. We let zi[¯]={1,2,,¯},z_{i}\in[\bar{\ell}]=\{1,2,\ldots,\bar{\ell}\},111Here and elsewhere, we use the notation [r]={1,,r}[r]=\{1,\ldots,r\}. denote the group assignment of node ii, and collect these assignments in a vector 𝐳[¯]n\mathbf{z}\in[\bar{\ell}]^{n}. As in the dyadic degree-corrected SBM, each node ii is assigned a parameter θi\theta_{i} governing its degree, and we collect these parameters in a vector 𝜽n\bm{\theta}\in\mathbb{R}^{n}. Let \mathcal{R} represent the set of unordered node tuples, so that each RR\in\mathcal{R} is a set of nodes representing the location of a possible hyperedge. (Following the standard choice for the degree-corrected SBM in graphs, we allow \mathcal{R} to include node tuples with repeated nodes.) Let 𝐳R\mathbf{z}_{R} denote the vector of cluster labels for nodes in a given tuple RR, and 𝜽R\bm{\theta}_{R} the vector of degree parameters.

We use an affinity function Ω\Omega to control the probability of placing a hyperedge at a given node tuple RR, which depends on the group memberships of the nodes in RR. Formally, Ω\Omega maps the group assignments 𝐳R\mathbf{z}_{R} to a nonnegative number. If Ω(𝐳R)\Omega(\mathbf{z}_{R}) is large, there is a higher probability that a hyperedge forms between the nodes in RR. In our model, the number of hyperedges placed at RR\in\mathcal{R} is distributed as aRPoisson(bRπ(𝜽R)Ω(𝐳R))a_{R}\sim\text{Poisson}\left(b_{R}\pi(\bm{\theta}_{R})\Omega(\mathbf{z}_{R})\right), where bRb_{R} denotes the number of distinct ways to order the nodes of RR and π(𝜽R)=iRθi\pi(\bm{\theta}_{R})=\prod_{i\in R}\theta_{i} is the product of degree parameters. The probability of realizing a given value aRa_{R} is then

(aR|𝐳,Ω,𝜽)=ebRπ(𝜽R)Ω(𝐳R)(bRπ(𝜽R)Ω(𝐳R))aRaR!.\displaystyle\mathbb{P}(a_{R}|\mathbf{z},\Omega,\bm{\theta})=\frac{e^{-b_{R}\pi(\bm{\theta}_{R})\Omega(\mathbf{z}_{R})}(b_{R}\pi(\bm{\theta}_{R})\Omega(\mathbf{z}_{R}))^{a_{R}}}{a_{R}!}. (1)

This edge generation process has the following intuitive interpretation: for each of the bRb_{R} possible orderings of nodes in RR, we attempt to place a Poisson(π(𝜽R)Ω(𝐳R))\text{Poisson}\left(\pi(\bm{\theta}_{R})\Omega(\mathbf{z}_{R})\right) number of hyperedges on this tuple. The result is a weighted hyperedge on the unordered tuple RR, whose weight can be any nonnegative integer. This is a helpful modeling feature, as many empirical hypergraphs contain multiple hyperedges between the same set of nodes.222Even in hypergraph data sets where we only know the presence or absence of hyperedges (but no weights), the Poisson-based model serves as a computationally convenient approximation to a Bernoulli-based model. The probability of realizing a given hyperedge set 𝐀=(aR)R\mathbf{A}=(a_{R})_{R\in\mathcal{R}} is then just the product of probabilities over each RR\in\mathcal{R}.

2.1  Estimation of Degree and Affinity Parameters

There are many methods for inference in stochastic blockmodels and their relatives, including variational coordinate ascent airoldi2008mixed, variational belief propagation decelle2011asymptotic; zhang2014scalable, and Markov Chain Monte Carlo nowicki2001estimation; peixoto2020merge. We perform approximate maximum-likelihood inference via coordinate ascent. We do so in order to exploit a recent connection between maximum-likelihood inference in stochastic blockmodels and the popular modularity objective for graph clustering newman2016equivalence. Our coordinate ascent framework, in which we alternate between estimating parameters and node labels, is a close relative of expectation-maximization (EM) algorithms for blockmodel inference decelle2011asymptotic. Standard versions of EM construct “soft” clusters, in which each node is given a weighted assignment in every possible cluster. “The” cluster for a given node is often taken to be the cluster in which the node has largest weight. In contrast, our approach generates “hard” clusters in which each node belongs to exactly one cluster. Profile likelihood methods offer an alternative framework for maximum-likelihood inference in SBMs bickel2009nonparametric, and their development for hypergraphs is another promising avenue of future work.

In the maximum-likelihood framework, we learn estimates 𝐳^\hat{\mathbf{z}} of the node labels, Ω^\hat{\Omega} of the affinity function, and 𝜽^\hat{\bm{\theta}} of the degree parameters by solving the optimization problem

𝐳^,Ω^,𝜽^argmax𝐳,Ω,𝜽(𝐀|𝐳,Ω,𝜽),\displaystyle\hat{\mathbf{z}},\;\hat{\Omega},\;\hat{\bm{\theta}}\equiv\operatornamewithlimits{argmax}_{\mathbf{z},\Omega,\bm{\theta}}\;\;\mathbb{P}(\mathbf{A}|\mathbf{z},\Omega,\bm{\theta})\;, (2)

where 𝐀\mathbf{A} is a given data set represented by a collection of (integer-weighted) hyperedges. As usual, it is easier to work with the log-likelihood, which has the same local optima. The log-likelihood is

(𝐳,Ω,𝜽)=Rlog(aR|𝐳,Ω,𝜽)=Q(𝐳,Ω,𝜽)+K(𝜽)+C,\displaystyle\mathcal{L}(\mathbf{z},\Omega,\bm{\theta})=\sum_{R\in\mathcal{R}}\log\mathbb{P}(a_{R}|\mathbf{z},\Omega,\bm{\theta})=Q(\mathbf{z},\Omega,\bm{\theta})+K(\bm{\theta})+C\;, (3)

where

Q(𝐳,Ω,𝜽)\displaystyle Q(\mathbf{z},\Omega,\bm{\theta}) R[aRlogΩ(𝐳R)bRπ(𝜽R)Ω(𝐳R)]\displaystyle\equiv\sum_{R\in\mathcal{R}}\left[a_{R}\log\Omega(\mathbf{z}_{R})-b_{R}\pi(\bm{\theta}_{R})\Omega(\mathbf{z}_{R})\right] (4)
K(𝜽)\displaystyle K(\bm{\theta}) RaRlogπ(𝜽R)\displaystyle\equiv\sum_{R\in\mathcal{R}}a_{R}\log\pi(\bm{\theta}_{R}) (5)
C\displaystyle C R[aRlogbRlogaR!].\displaystyle\equiv\sum_{R\in\mathcal{R}}\left[a_{R}\log b_{R}-\log a_{R}!\right]\;. (6)

The first term Q(𝐳,Ω,𝜽)Q(\mathbf{z},\Omega,\bm{\theta}) is the only part of the log-likelihood that depends on the group assignments 𝐳\mathbf{z} and affinity function Ω\Omega. The second term depends on 𝜽\bm{\theta}, while the third term depends only on the data 𝐀\mathbf{A} and can be disregarded for inferential purposes.

In the coordinate ascent approach to maximum-likelihood, we alternate between two stages. In the first stage, we assume a current estimate 𝐳^\hat{\mathbf{z}} and obtain new estimates of Ω\Omega and 𝜽\bm{\theta} by solving

Ω^,𝜽^=argmaxΩ,𝜽(𝐳^,Ω,𝜽).\displaystyle\hat{\Omega},\hat{\bm{\theta}}=\operatornamewithlimits{argmax}_{\Omega,\bm{\theta}}\mathcal{L}(\hat{\mathbf{z}},\Omega,\bm{\theta})\;. (7)

The resulting Ω^,𝜽^\hat{\Omega},\hat{\bm{\theta}} can be viewed as maximum-likelihood estimates, conditioned on the current estimate 𝐳^\hat{\mathbf{z}} of the label vector 𝐳\mathbf{z}. In the second stage, we assume current estimates Ω^\hat{\Omega} and 𝜽^\hat{\bm{\theta}} and obtain a new estimate of 𝐳\mathbf{z} by solving

𝐳^=argmax𝐳(𝐳,Ω^,𝜽^).\displaystyle\hat{\mathbf{z}}=\operatornamewithlimits{argmax}_{\mathbf{z}}\mathcal{L}(\mathbf{z},\hat{\Omega},\hat{\bm{\theta}})\;. (8)

We alternate between these two stages until convergence.

There are several identifiability issues that must be addressed. First, permuting the group labels in 𝐳\mathbf{z} and Ω\Omega does not alter the value of the likelihood. We therefore impose an arbitrary order on group labels. Second, the number of possible groups ¯\bar{\ell} can in principle be larger than the number of groups present in 𝐳\mathbf{z}. Such a case would correspond to the presence of groups which are statistically possible but empty in the given data realization. While other treatments are possible, we choose to disregard empty groups and treat ¯\bar{\ell} as equal to the number of distinct labels in an estimate of 𝐳\mathbf{z}. A final form of unidentifiability relates to the scales of 𝜽\bm{\theta} and Ω\Omega. For a fixed 𝜽\bm{\theta} and Ω\Omega, we can construct 𝜽𝜽\bm{\theta}^{\prime}\neq\bm{\theta} and ΩΩ\Omega^{\prime}\neq\Omega such that (𝐳,Ω,𝜽)=(𝐳,Ω,𝜽)\mathcal{L}(\mathbf{z},\Omega,\bm{\theta})=\mathcal{L}(\mathbf{z},\Omega^{\prime},\bm{\theta}^{\prime}) (LABEL:sec:unidentifiable). To enforce identifiability, we must therefore place a joint normalization condition on either 𝜽\bm{\theta} or Ω\Omega. We choose to constrain 𝜽\bm{\theta} such that

i=1nθiδ(zi,)=vol(),=1,,¯,\displaystyle\sum_{i=1}^{n}\theta_{i}\delta(z_{i},\ell)=\text{{vol}}(\ell),\;\;\ell=1,\ldots,\bar{\ell}\;, (9)

where vol()=i=1ndiδ(zi,)\text{{vol}}(\ell)=\sum_{i=1}^{n}d_{i}\delta(z_{i},\ell) and did_{i} is the (weighted) number of hyperedges in which node ii appears.333Here, and throughout the rest of the text, δ\delta is an indicator function evaluating to 1 if all its inputs are equal, and is 0 otherwise.

The usefulness of (9) is that, when 𝐳\mathbf{z} is known or estimated, the conditional maximum-likelihood estimates 𝜽^\hat{\bm{\theta}} and Ω^\hat{\Omega} in (7) take simple, closed forms. First, for a fixed label vector 𝐳\mathbf{z}, when using the normalization (9), the maximum-likelihood estimate for 𝜽\bm{\theta} is

𝜽^=𝐝.\displaystyle\hat{\bm{\theta}}=\mathbf{d}\;. (10)

(See LABEL:sec:proof_constant.) Second, conditioned on 𝐳\mathbf{z}, if Ω\Omega takes constant value ω\omega on some set YY of unordered tuples of labels,444In full generality, we can have one such ω\omega for every possible label arrangement for each hyperedge size in the data. Later, we will make natural restrictions on fOmega\ fOmega. the maximum likelihood estimate for ω\omega is

ω^=𝐲YRaRδ(𝐳R,𝐲)𝐲Yy𝐲vol(y).\displaystyle\hat{\omega}=\frac{\sum_{\mathbf{y}\in Y}\sum_{R\in\mathcal{R}}a_{R}\delta(\mathbf{z}_{R},\mathbf{y})}{\sum_{\mathbf{y}\in Y}\prod_{y\in\mathbf{y}}\text{{vol}}(y)}\;. (11)

(See LABEL:sec:omega_estimate.) Although (10) assumes that 𝐳\mathbf{z} was fixed, it is not necessary to know 𝐳\mathbf{z} in order to form the estimate 𝜽^\hat{\bm{\theta}}. However, forming the estimate ω^\hat{\omega} via (11) requires that we know or estimate 𝐳\mathbf{z}. It is therefore important to remember that ω^\hat{\omega} is not a globally optimal estimate, but rather a locally optimal estimate conditioned on the currently estimated group labels.

The formula (11) could also be inserted directly into the full likelihood maximization problem (2), eliminating the parameters corresponding to Ω\Omega and producing a lower-dimensional profile likelihood, which could then in principle be optimized directly. This approach has been successful for dyadic blockmodels bickel2009nonparametric, and the development of similar methods for hypergraph blockmodels would be of considerable interest. The advantage of our coordinate ascent framework is that we are able to develop fast heuristics for solving (8), by generalizing widely-used algorithms for graph clustering (Section 4). Solving problem (7) in our framework can also be interpreted as evaluating the profile likelihood for a fixed cluster vector 𝐳\mathbf{z}, highlighting the relationship between these approaches.

We now turn to the problem of inferring the label vector 𝐳\mathbf{z}. This problem leads naturally to a class of modularity-type objectives for hypergraph clustering.

3   Hypergraph Modularities

Our results from the previous section imply that the estimated degree parameter 𝜽^\hat{\bm{\theta}} and piecewise constant affinity function Ω^\hat{\Omega} can be efficiently estimated in closed form, provided an estimate of 𝐳\mathbf{z}. This provides a solution to the first stage (7) of coordinate ascent. We now discuss the second stage (8). From (3), it suffices to optimize QQ with respect to 𝐳\mathbf{z}. To do so, it is helpful to impose some additional structure on Ω^\hat{\Omega}.

3.1  Symmetric Modularities

We obtain an important class of objective functions by stipulating that Ω\Omega is symmetric with respect to permutations of node labels. In this case, Ω(𝐳R)\Omega(\mathbf{z}_{R}) depends not on the specific labels 𝐳R\mathbf{z}_{R} in a given node tuple RR, but only on the number of repetitions of each. Statistically, the corresponding DCHSBM generates hypergraphs in which all groups are statistically identical, conditioned on the degrees of their constituent nodes. Symmetric affinity functions thus give a flexible generalization of the planted partition stochastic blockmodel jerrum1998metropolis; condon2001algorithms to the setting of hypergraphs.

Define the function ϕ(𝐳)=𝐩\phi(\mathbf{z})=\mathbf{p}, where pjp_{j} is the number of entries of 𝐳\mathbf{z} in the jjth largest group in 𝐳\mathbf{z}, with ties broken arbitrarily. For example, if 𝐳=(1,1,4,1,2,3,2)\mathbf{z}=(1,1,4,1,2,3,2), then 𝐩=(3,2,1,1)\mathbf{p}=(3,2,1,1). We call 𝐩\mathbf{p} a partition vector. The symmetry assumption implies that Ω\Omega is a function of 𝐳R\mathbf{z}_{R} only through 𝐩=ϕ(𝐳R)\mathbf{p}=\phi(\mathbf{z}_{R}). Accordingly, we abuse notation by writing Ω(𝐩)Ω(𝐳)\Omega(\mathbf{p})\equiv\Omega(\mathbf{z}) when 𝐩=ϕ(𝐳)\mathbf{p}=\phi(\mathbf{z}).

We now define generalized cuts and volumes corresponding to a possible partition vector 𝐩\mathbf{p} for tuples of kk nodes:

cut𝐩(𝐳)\displaystyle\text{{cut}}_{\mathbf{p}}(\mathbf{z}) RkaRδ(𝐩,ϕ(𝐳R)),\displaystyle\equiv\sum_{R\in\mathcal{R}^{k}}a_{R}\delta(\mathbf{p},\phi(\mathbf{z}_{R})), (12)
vol𝐩(𝐳)\displaystyle\text{{vol}}_{\mathbf{p}}(\mathbf{z}) 𝐲[¯]kδ(𝐩,ϕ(𝐲))y𝐲vol(y),\displaystyle\equiv\sum_{\mathbf{y}\in[\overline{\ell}]^{k}}\delta(\mathbf{p},\phi(\mathbf{y}))\prod_{y\in\mathbf{y}}\text{{vol}}(y), (13)

where k\mathcal{R}^{k} is the subset of tuples in \mathcal{R} consisting of kk nodes. The function cut𝐩(𝐳)\text{{cut}}_{\mathbf{p}}(\mathbf{z}) counts the number of edges that are split by 𝐳\mathbf{z} into the specified partition 𝐩\mathbf{p}, while the function vol𝐩(𝐳)\text{{vol}}_{\mathbf{p}}(\mathbf{z}) is a sum-product of volumes over all grouping vectors 𝐲\mathbf{y} that induce partition 𝐩\mathbf{p}. Let 𝒫\mathcal{P} be the set of partition vectors on sets up to size k¯\bar{k}, the maximum size of hyperedges. We show in LABEL:sec:mod_cut_vol_deriv that the symmetric modularity objective can then be written as

Q(𝐳,Ω,𝐝)\displaystyle Q(\mathbf{z},\Omega,\mathbf{d}) =𝐩𝒫[cut𝐩(𝐳)logΩ(𝐩)vol𝐩(𝐳)Ω(𝐩)].\displaystyle=\sum_{\mathbf{p}\in\mathcal{P}}\left[\text{{cut}}_{\mathbf{p}}(\mathbf{z})\log\Omega(\mathbf{p})-\text{{vol}}_{\mathbf{p}}(\mathbf{z})\Omega(\mathbf{p})\right]\;. (14)

For a partition vector 𝐩\mathbf{p} for tuples of kk nodes, direct calculation of vol𝐩(𝐳)\text{{vol}}_{\mathbf{p}}(\mathbf{z}) is a summation of ¯k\overline{\ell}^{k} elements, which can be impractical when either ¯\overline{\ell} or kk are large. We show in LABEL:sec:eval_sums, however, that it is possible to evaluate these sums efficiently via a combinatorial identity. We also give a formula for updating volume terms vol𝐩(𝐳)\mathrm{\text{{vol}}}_{\mathbf{p}}(\mathbf{z}) when a candidate labeling is modified.

The objective function (14) is related to the multiway hypergraph cut problem studied by veldt2020hypergraph. They formulate the hypergraph cut objective in terms of splitting functions, which associate penalties when edges are split between two or more clusters. One then aims to minimize the sum of penalties subject to constraints that certain nodes must not lie in the same cluster. Symmetric affinity functions in our framework correspond to signature-based splitting functions in their terminology. Table 1 lists four of many families of affinity functions.

All-Or-Nothing (AON) Ω(𝐩)={ωk1𝐩0=1ωk0otherwise.\Omega(\mathbf{p})=\begin{cases}\omega_{k1}&\quad\left\lVert\mathbf{p}\right\rVert_{0}=1\\ \omega_{k0}&\quad\text{otherwise.}\end{cases}
Group Number (GN) Ω(𝐩)=f(𝐩0,k)\Omega(\mathbf{p})=f\left(\left\lVert\mathbf{p}\right\rVert_{0},k\right)
Relative Plurality (RP) Ω(𝐩)=g(p1p2,k)\Omega(\mathbf{p})=g\left(p_{1}-p_{2},k\right)
Pairwise (P) Ω(𝐩)=h(ijpipj,k)\Omega(\mathbf{p})=h\left(\sum_{i\neq j}p_{i}p_{j},k\right)

TABLE 1: Symmetric affinity functions. Throughout, k=𝐩0k=\left\lVert\mathbf{p}\right\rVert_{0} is the number of nodes in partition 𝐩\mathbf{p}, ωk0\omega_{k0} and ωk1\omega_{k1} are scalars, and ff, gg, and hh are arbitrary scalar functions.

The All-Or-Nothing affinity function distinguishes only whether or not a given edge is contained entirely within a single cluster. This affinity function is especially important for scalable computation, and we discuss it further below. The Group Number affinity depends only on the number of distinct groups represented in an edge, regardless of the number of incident nodes in each one. Special cases of the Group Number affinity arise frequently in applications. When f(𝐩0,k)=exp(𝐩01)f\left(\left\lVert\mathbf{p}\right\rVert_{0},k\right)=\text{exp}(\left\lVert\mathbf{p}\right\rVert_{0}-1), the first term of the modularity objective corresponds to a hyperedge cut penalty that is known in the scientific computing literature as “connectivity1\text{connectivity}-1” deveci2015hypergraph, the K1K-1 metric karypis2000multilevel, or the boundary cut hendrickson2000graph. It has also been called fanout in the database literature kabiljo2017social. The related Sum of External Degrees penalty karypis2000multilevel is also a special case of the Group Number affinity. The Relative Plurality affinity considers only the relative difference between the size of the largest group represented in an edge and the next largest group. This rather specialized affinity function is especially appropriate in contexts where groups are expected to be roughly balanced, as we find, for example, in party affiliations in Congressional committees. Finally, the Pairwise affinity counts the number of pairs of nodes within the edge whose clusters differ. While this affinity function uses similar information to that used in dyadic graph models, there is no immediately apparent equivalence between any dyadic random graph and a DCHSBM with the Pairwise affinity function. There are many more symmetric affinity functions; see Table 3 of veldt2020hypergraph for several other splitting functions which can be used to define affinities.

An important subtlety was recently raised by zhang2020statistical concerning the relationship between blockmodels and modularity in newman2016equivalence, which also applies to our derivation of (14) above and (15) below. We derived the conditional maximum-likelihood estimates (10) and (11) of 𝜽\bm{\theta} and Ω\Omega under the assumption of a general, unconstrained affinity function Ω\Omega. It is not guaranteed that these same estimates maximize the likelihood when additional constraints—such as the symmetry constraint Ω(𝐳)=Ω(ϕ(𝐳))\Omega(\mathbf{z})=\Omega(\phi(\mathbf{z}))—are imposed. Indeed, as zhang2020statistical show for the case of dyadic graphs, (10) and (11) for estimating 𝜽^\hat{\bm{\theta}} and Ω^\hat{\Omega} are only exact under the symmetry assumption on Ω\Omega when vol()\text{{vol}}(\ell) is constant for each [^]\ell\in[\hat{\ell}]. When the sizes of groups vary, as is typical in most data sets, (10) and (11) are instead approximations of the exact conditional maximum-likelihood estimates. The situation is reminiscent of the tendency of the graph modularity objective to generate clusters of approximately equal sizes gleich2016mining. The objectives and algorithms we develop below should therefore be understood as approximations to coordinate-ascent maximum likelihood inference, which are exact only in the case that all clusters have equal volumes. See zhang2020statistical for a detailed discussion of these issues in the context of dyadic graphs.

3.2  All-Or-Nothing Modularity

All-Or-Nothing affinity function is of special interest for modeling and computation. This affinity function is a natural choice for systems in which the occurrence of an interaction or relationship depends strongly on group homogeneity.

Inserting the All-Or-Nothing affinity function from Table 1 into (14) yields, after some algebra (LABEL:sec:mod_cut_vol_deriv_aon), the objective

Q(𝐳,Ω,𝐝)=k=1k¯βk[cutk(𝐳)+γk=1¯vol()k]+J(𝝎),\displaystyle Q(\mathbf{z},\Omega,\mathbf{d})=-\sum_{k=1}^{\overline{k}}\beta_{k}\left[\text{{cut}}_{k}(\mathbf{z})+\gamma_{k}\sum_{\ell=1}^{\overline{\ell}}\text{{vol}}(\ell)^{k}\right]+J(\bm{\omega})\;, (15)

where βk=logωk1logωk0\beta_{k}=\log\omega_{k1}-\log\omega_{k0}, γk=βk1(ωk1ωk0)\gamma_{k}=\beta_{k}^{-1}(\omega_{k1}-\omega_{k0}), and J(𝝎)J(\bm{\omega}) collects terms that do not depend on the partition 𝐳\mathbf{z}. We collect {βk}\{\beta_{k}\} and {γk}\{\gamma_{k}\} into vectors 𝜷,𝜸k¯\bm{\beta},\bm{\gamma}\in\mathbb{R}^{\bar{k}}. We have also defined

cutk(𝐳)mkRkaRδ(𝐳R).\displaystyle\text{{cut}}_{k}(\mathbf{z})\equiv m_{k}-\sum_{R\in\mathcal{R}^{k}}a_{R}\delta(\mathbf{z}_{R})\;. (16)

In this expression, mkm_{k} is the (weighted) number of hyperedges of size kk, i.e., mk=RkaRm_{k}=\sum_{R\in\mathcal{R}^{k}}a_{R}. The cut terms cutk(𝐳)\text{{cut}}_{k}(\mathbf{z}) thus count the number of hyperedges of size kk which contain nodes in two or more distinct clusters. This calculation is a direct generalization of that from newman2016equivalence for graph modularity. Indeed, we recover the standard dyadic modularity objective by restricting to k=2k=2. We call (15) the All-Or-Nothing (AON) hypergraph modularity.

Recently, Kami2018 proposed a “strict modularity” objective for hypergraphs. This strict modularity is a special case of (15), obtained by choosing ωk0\omega_{k0} and ωk1\omega_{k1} such that βk=1\beta_{k}=1 and γk=mk(vol(H))k\gamma_{k}=\frac{m_{k}}{(\text{{vol}}(H))^{k}}, where vol(H)=i=1ndi\text{{vol}}(H)=\sum_{i=1}^{n}d_{i} is the sum of all node degrees in hypergraph HH. However, leaving these parameters free lends important flexibility to our proposed AON objective (15). Tuning 𝜷\bm{\beta} allows one to specify which hyperedge sizes are considered to be most relevant for clustering. In email communications, for example, a very large list of recipients may carry minimal information about their social relationships, and it may be desirable to down-weight large hyperedges by tuning 𝜷\bm{\beta}. Tuning 𝜸\bm{\gamma} has the effect of modifying the sizes of clusters favored by the objective, in a direct generalization of the resolution parameter in dyadic modularity reichardt2006statistical; veldt2018cc. Importantly, it is not necessary to specify the values of these parameters a priori; instead, they can be adaptively estimated via (11).

4   Hypergraph Maximum-Likelihood Louvain

In order to optimize the modularity objectives (14) and (15), we propose a family of agglomerative clustering algorithms. These algorithms greedily improve the specified objective through local updates to the node label vector 𝐳\mathbf{z}. The structure of these algorithms is based on the widely used and highly performant Louvain heuristic for graphs blondel2008fast. The standard heuristic alternates between two phases. In the first phase, each node begins in its own singleton cluster. Then, each node ii is visited and moved to the cluster of the adjacent node jj which maximizes the increase in the objective QQ. If no such move increases the objective, then ii’s label is not changed. This process repeats until no such moves exist which increase the objective. In the second phase, a “supernode” is formed for each label. The supernode represents the set of all nodes sharing that label. Then, the first phase is repeated, generating an updated labeling of supernodes, which are then aggregated in the second phase. The process repeats until no more improvement is possible. Since every step in the first phase improves the objective, the algorithm terminates with a locally optimal cluster vector 𝐳\mathbf{z}.

This heuristic generalizes naturally to the setting of hypergraphs. However, the incorporation of heterogeneous hyperedge sizes and general affinity functions considerably complicates implementation. Here we provide a highly general Hypergraph Maximum-Likelihood Louvain (HMLL) algorithm for optimizing the symmetric modularity objective (14). For the important case of the All-Or-Nothing (AON) affinity, the simplified objective (15) admits a much simpler and faster specialized Louvain algorithm, which we describe in LABEL:sec:AON_louvain. As we show in subsequent experiments, this specialized algorithm is highly scalable and effective in recovering ground truth clusters in data sets with polyadic structure plausibly modeled by the AON affinity.

4.1  Symmetric Hypergraph Maximum-Likelihood Louvain

The first phase of our Symetric HMLL algorithm mirrors standard graph Louvain: nodes begin in singleton clusters and in turn greedily move to adjacent clusters until no more improvement is possible. Phase two of graph Louvain reduces edges between clusters into weighted edges between supernodes in a structure-preserving way. However, in the hypergraph case, naively collapsing clusters and hyperedges would discard important information about hyperedge sizes and the way each hyperedge is partitioned across clusters. Therefore, in subsequent stages of our algorithm, we greedily improve the objective by moving entire sets of nodes in the original hypergraph, rather than greedily moving individual nodes. In this way, a set of nodes that was assigned to the same cluster in a previous iteration is essentially treated as a supernode and moved as a unit, without collapsing the hypergraph and losing needed information about hyperedge structure.

Our overall procedure is formalized in Algorithms 1 and 2. Algorithm 1 visits in turn each set of nodes ScS_{c} that represents a cluster cc from a previous iteration. The algorithm evaluates the change ΔQ\Delta Q in the objective function QQ associated with moving all of ScS_{c} to an adjacent cluster, and then carries out the move that gives the largest positive change to the objective. This is repeated until moving a set ScS_{c} can no longer improve the objective. The entire Symetric HMLL method is summarized by running Algorithm 2, which starts by placing every node in a singleton cluster before calling Algorithm 1 for the first time.

Data: Hypergraph HH, affinity function Ω\Omega, current label vector 𝐳\mathbf{z}
Result: Updated label vector 𝐳\mathbf{z}^{\prime}
𝒞unique(𝐳)\mathcal{C}\leftarrow\text{unique}(\mathbf{z})
  // unique cluster labels in the initial clustering 𝐳\mathbf{z}
𝐳𝐳\mathbf{z}^{\prime}\leftarrow\mathbf{z}
  // By greedily moving clusters in 𝐳\mathbf{z}, form new clustering 𝐳\mathbf{z}^{\prime}
improvingtrue\textit{improving}\leftarrow\mathrm{true}
while improving do
       improvingfalse\textit{improving}\leftarrow\mathrm{false}
       for cc in 𝒞\mathcal{C} do
             Sc{i:zi=c}S_{c}\leftarrow\{i\colon z_{i}=c\}
              // nodes with label cc in original clustering 𝐳\mathbf{z}
             𝒞unique(𝐳)\mathcal{C}^{\prime}\leftarrow\text{unique}(\mathbf{z}^{\prime})
              // cluster labels in the current clustering 𝐳\mathbf{z}^{\prime}
             // Set of clusters 𝒜c\mathcal{A}_{c} in 𝐳\mathbf{z}^{\prime} that are adjacent to ScS_{c}
             𝒜ceE{c~𝒞:vV with zv=c~ and iSc with v,ie}\mathcal{A}_{c}\leftarrow\bigcup_{e\in E}\{\tilde{c}\in\mathcal{C}^{\prime}\colon\exists v\in V\text{ with }z_{v}^{\prime}=\tilde{c}\text{ and }i\in S_{c}\text{ with }v,i\in e\}
             // maximum Δ\Delta and maximizer cc^{\prime} of the change in QQ from moving set ScS_{c} to an adjacent cluster c~\tilde{c} in 𝒜c\mathcal{A}_{c}
             (Δ,c)argmaxc~𝒜cΔQ(H,Ω,𝐳,Sc,c~)(\Delta,c^{\prime})\leftarrow\operatornamewithlimits{argmax}_{\tilde{c}\in\mathcal{A}_{c}}\Delta Q(H,\Omega,\mathbf{z}^{\prime},S_{c},\tilde{c})
             // update 𝐳\mathbf{z}^{\prime} if improvement found
             if Δ>0\Delta>0 then
                   for iSci\in S_{c} do
                         zicz_{i}^{\prime}\leftarrow c^{\prime}
                        
                   end for
                  improvingtrueimproving\leftarrow\text{true}
             end if
            
       end for
      
end while
return 𝐳\mathbf{z}^{\prime}
Algorithm 1 SymmetricHMLLstep(HH, Ω\Omega, 𝐳\mathbf{z})
Data: Hypergraph HH, affinity function Ω\Omega
Result: Label vector 𝐳\mathbf{z}
𝐳𝐳[n]\mathbf{z}^{\prime}\leftarrow\mathbf{z}\leftarrow[n]
  // assign each cluster to singleton
do
       𝐳𝐳\mathbf{z}\leftarrow\mathbf{z}^{\prime}
       𝐳SymmetricHMLLstep(H,Ω,𝐳)\mathbf{z}^{\prime}\leftarrow\text{SymmetricHMLLstep}(H,\Omega,\mathbf{z})
while 𝐳𝐳\mathbf{z}\neq\mathbf{z}^{\prime}
return 𝐳\mathbf{z}
Algorithm 2 SymmetricHMLL(HH, Ω\Omega)

Algorithm 1 relies on a function ΔQ\Delta Q which computes the change in the objective QQ associated with moving all nodes from ScS_{c} to an adjacent cluster. Changes to the second (volume) term in the objective can be computed efficiently using combinatorial identities (LABEL:sec:eval_sums, LABEL:prop:recursion). Changes to the first (cut) term require summing across all hyperedges incident to a node or set of nodes. At each hyperedge, we must evaluate the affinity Ω(𝐩)\Omega(\mathbf{p}) on the current partition 𝐩\mathbf{p}, as well as the affinity Ω(𝐩)\Omega(\mathbf{p}^{\prime}) associated with the candidate updated partition 𝐩\mathbf{p}^{\prime}. This situation contrasts with the case of the graph Louvain algorithm, in which it is sufficient to check whether a given edge joins nodes in the same or different clusters. The fact that we need to store and update the partition vector 𝐩\mathbf{p} for each hyperedge is what prevents us from collapsing a cluster of nodes into a monolithic supernode and recursively applying Algorithm 2 on a reduced data structure, as customary in graph Louvain. Thus, while clusters of nodes move as a unit in Algorithm 1 as well, it is necessary in this case to operate on the full adjacency data 𝒜\mathcal{A} at all stages of the algorithm. This can make Algorithm 2 slow on hypergraphs of even modest size. Developing efficient algorithms for optimizing the general symmetric modularity objective or various special cases is an important avenue of future work.

4.2  All-Or-Nothing Hypergraph Maximum-Likelihood Louvain

When Ω\Omega is the All-Or-Nothing affinity function, considerable simplification is possible. For each edge we need not compute the full partition vector 𝐩\mathbf{p} but only check whether or not 𝐩0=1\left\lVert\mathbf{p}\right\rVert_{0}=1. Rather than a general affinity function Ω\Omega, we instead supply the parameter vectors 𝜷\bm{\beta} and 𝜸\bm{\gamma} appearing in (15). This allows us to compute on considerably simplified data structures. In particular, we are able to follow the classical Louvain strategy of collapsing clusters into single, consolidated “supernodes,” and restrict attention to hyperedges that span multiple supernodes. Because we do not need to track the precise way in which the hyperedges span multiple supernodes, we can forget much of the original adjacency data 𝒜\mathcal{A} and instead simply store the edge sizes of the hypergraph. These simplifications enable both significant memory savings and very rapid evaluation of the objective update function ΔQ\Delta Q. We provide pseudocode for exploiting these simplifications in LABEL:sec:AON_louvain.

4.3  Number of Clusters

Like most Louvain-style modularity methods, the user cannot directly control the number of clusters returned by HMLL. One approach to influence the number of clusters is to manually set values for the affinity function Ω\Omega or the parameters 𝜷\bm{\beta} and 𝜸\bm{\gamma} (if using the All-Or-Nothing affinity). Rather than inferring these parameters from data, one can set them a priori and perform a single round of optimization over 𝐳\mathbf{z}. This approach generalizes the common treatment of the resolution parameter in dyadic modularity maximization as a tuning hyperparameter rather than a number to be estimated from data reichardt2006statistical. Considerable experimentation may be required in order to obtain the desired number of clusters, and retrieving an exact number may not be possible.

Another approach to influencing the number of clusters is to impose a Bayesian prior on the community labels. In the simplest version of a Bayesian approach, one assumes that each node is independently assigned one of ¯\bar{\ell} labels with equal probability, prior to sampling edges. The probability of realizing a given label vector 𝐳\mathbf{z} is then ¯n\bar{\ell}^{-n}, which generates a term of the form nlog¯-n\log\bar{\ell} in the log-likelihood \mathcal{L}. This term may then be incorporated into Louvain implementations, with the result that greedy moves which reduce the total number of clusters ¯\bar{\ell} are strongly incentivized. The resulting regularized algorithm may then label vectors 𝐳\mathbf{z} with slightly smaller numbers of distinct clusters. This can be useful when it is known a priori that the true number of clusters in the data is small. Our implementation of AON HMLL incorporates this optional regularization term. We use this term only in the synthetic detectability experiments presented in LABEL:fig:detectability.

5   Experiments with Synthetic Data

5.1  Runtime

Dyadic Louvain algorithms are known for being highly efficient in large graphs. Here, we show that AON HMLL can achieve similar performance on synthetic data to Graph MLL (GMLL), a variant of the standard dyadic Louvain algorithm in which we return the combination of resolution parameter and partition that yield the highest dyadic likelihood. For a fixed number of nodes nn, we consider a DCHSBM-like hypergraph model with ¯=n/200\bar{\ell}=n/200 clusters and m=10nm=10n hyperedges with size kk uniformly distributed between 2 and 4. Each kk-edge is, with probability pkp_{k}, placed uniformly at random on any kk nodes within the same cluster. Otherwise, with probability 1pk1-p_{k}, the edge is instead placed uniformly at random on any set of kk nodes. We use this model rather than a direct DCHSBM to avoid the computational burden of sampling edges at each kk-tuple of nodes, which is prohibitive when nn is large. For the purpose of performance testing, we compute estimates of the parameter vectors 𝜷\bm{\beta} and 𝜸\bm{\gamma} (in the case of AON HMLL) and the resolution parameter γ\gamma (in the case of GMLL) using ground truth cluster labels. We emphasize that this is typically not possible in practical applications, since the ground truth labels are not known. We make this choice in order to focus on a direct comparison of runtimes of each algorithm in a situation in which both can succeed. In later sections, we study the ability of HMLL and GMLL to recover known groups in synthetic and empirical data when affinities and resolution parameters are not known.

[t] [Uncaptioned image] Runtime, Adjusted Rand Index, and number of clusters returned by GMLL and HMLL in a synthetic testbed with optimal affinity parameters. The within-cluster edge placement probabilities are p2=3/5p_{2}=3/5, p3=1/n3p_{3}=1/n^{3}, and p4=1/n4p_{4}=1/n^{4}. We also show in light gray the results obtained by using GMLL as a preprocessing step, whose output partition is then refined by HMLL (light gray).

Section 5.1 shows runtime, Adjusted Rand (ARI) and number of clusters returned on synthetic hypergraphs when p2=0.6p_{2}=0.6, p3=1/n3p_{3}=1/n^{3} and p4=1/n4p_{4}=1/n^{4}. These parameter are chosen so that hyperedges of size three and four are rarely (if ever) contained completely inside clusters. Thus, hyperedges of different sizes provide different signal regarding ground truth clusters. For this experiment, we implemented Graph MLL by computing a normalized clique projection, in which nodes are joined by weighted dyadic edges with weights

wij=e:i,je1|e|1.\displaystyle w_{ij}=\sum_{e:i,j\in e}\frac{1}{\left|e\right|-1}\;. (17)

We also performed experiments on an unnormalized clique projection with wij=|e:i,je|w_{ij}=\left|e:i,j\in e\right|, but do not show these results because, in this experiment, the associated MLL algorithm consistently fails to recover labels correlated with the planted clusters.

On smaller instances, HMLL outperforms Graph MLL in recovering planted clusters, as measured by the ARI. For larger instances, the recovery results are comparable. Interestingly, although GMLL and HMLL obtain similar accuracy in this experiment, they do so in different ways, with HMLL tending to generate more, smaller clusters than GMLL. Importantly, the runtimes are nearly indistinguishable, indicating that dyadic clique projections are necessary neither for accuracy nor for performance. We observed other choices of the parameters p2p_{2}, p3p_{3}, and p4p_{4} in which HMLL substantially outperformed GMLL in cluster recovery and vice versa; however, in each case the algorithms’ respective runtimes tended to differ by only a small constant factor.

Interestingly, in this synthetic experiment, a combination of the two algorithms leads to the strongest recovery results. In addition to running each algorithm independently, we also ran a two-stage algorithm in which GMLL is used to generate an intermediate partition, and then HMLL is used to refine it. This procedure is similar to the warmstart approach from kaminski2020community. We emphasize again that these results are obtained on synthetic hypergraphs with pre-optimized affinity parameters, and so the effectiveness of the refinement strategy may not generalize to real data sets. Indeed, in the experiments on empirical data shown in LABEL:sec:empirical, we do not show results from the refinement procedure because the output partition was in each case essentially indistinguishable from the output of the dyadic partition. This may reflect the fact that we did not allow the algorithms to learn a priori the affinity parameters associated with the true data labels. Further investigation into the performance of hybrid strategies would be of considerable practical importance.

5.2  Dyadic Projections and the Detectability Threshold

Informally, an algorithm is able to detect communities in a random graph model with fixed labels 𝐳\mathbf{z} when the output labeling 𝐳^\hat{\mathbf{z}} of that algorithm is, with probability bounded above zero, correlated with 𝐳\mathbf{z}. Using arguments from statistical physics, decelle2011asymptotic conjectured the existence of a regime in the graph stochastic blockmodel in which no algorithm can successfully detect communities. This conjecture has since been refined and proven in various special cases; see abbe2017community for a survey. In the dyadic stochastic blockmodel with two equal-sized planted communities, a necessary condition for detectability in the large-graph limit is

(cico)22(ci+co)1,\displaystyle\frac{(c_{i}-c_{o})^{2}}{2(c_{i}+c_{o})}\geq 1\;, (18)

where cic_{i} is the mean number of within-cluster edges attached to a node, and coc_{o} is the mean number of between-cluster edges attached to a node. If this condition is not satisfied, no algorithm can reliably detect communities in the associated graph stochastic blockmodel, even though the communities are statistically distinct. This bound limits direct inferential methods, such as Bayesian or maximum-likelihood techniques, as well as methods based on maximization of modularity or other graph objectives nadakuditi2012graph. Several recent papers have considered the detectability problem in the case of uniform hypergraphs ghoshdastidar2014consistency; Ghoshdastidar2017; Angelini2016. In our model, the presence of edges of multiple sizes complicates analysis. Here, we limit ourselves to an experimental demonstration that the regimes of detectability for the graph SBM and our DCHSBM can differ significantly.

[htbp]