2022
[1,2,3,4]\fnmWei \surWei
1]\orgdivSchool of Mathematical Sciences, \orgnameBeihang University, \orgaddress\streetXueYuan Road No.37, \cityBeijing, \postcode100191, \stateBeijing, \countryChina
2]\orgdivKey Laboratory of Mathematics Informatics Behavioral Semantics, \orgnameMinistry of Education, \orgaddress\cityBeijing, \countryChina
3]\orgdivInstitute of Artificial Intelligence, \orgnameBeihang University, \orgaddress\streetXueYuan Road No.37, \cityBeijing, \postcode100191, \stateBeijing, \countryChina
4]\orgdivZhongguancun Laboratory, \orgaddress\cityBeijing, \countryChina
5]\orgdivthe Center for Humans and Machines, \orgnameMax Planck Institute for Human Development, \orgaddress\streetLentzeallee 94, \cityBerlin, \postcode14195, \countryGermany
Enhance Ambiguous Community Structure via Multi-strategy Community Related Link Prediction Method
Abstract
Most real-world networks suffer from incompleteness or incorrectness, which is an inherent characteristic of real-world networks. This imperfection might cause the deficiency of machine learning algorithms in complex networks like community detection. Therefore, proper measures for improving community detection performance and robustness are needed for more promising network analysis. In this paper, we propose a harmony-based aggregation preferred (HAP) link prediction method, which takes the result of community detection algorithm into consideration. The HAP method aims to detect boundary nodes among different communities and establish connections to clarify community structure. Furthermore, we design a two-step community enhancement framework with an automatic evolution process based on the HAP method. This methodology successfully clarifies ambiguous community structures by adding links to the current network. The experimental results on twelve real-world datasets with ground truth knowledge indicate that the proposed link prediction method outperforms other baseline methods in enhancement capacity and numerical stability. Furthermore, the proposed community enhancement method follows the expected evolution process.
keywords:
Complex network, community enhancement, link prediction, community detection, entropy1 Introduction
Since a great deal of real-world data could be expressed in complex network fashion (Strogatz(2001)networks), complex network analysis attracts more and more attention in many scientific disciplines. There are various approaches to unveiling the underlying information behind networks. Community detection has been considered as one of the most vital among these studies (deng(2016)mobility; wang(2013)discovering; qiao(2020)dynamic). The network community is defined as a group of nodes that are densely connected to each other while sparsely connected to the rest nodes (Newman(2004)community). Community structure appears at a high frequency and is of pivotal importance in network analysis. Social networks are paradigmatic examples of graphs with communities since people tend to form groups with similar interests or ideologies (Girvan(2002)community). In biological bodies, community structure in protein interaction networks could represent a group of proteins with similar functions (Lewis(2010)function).
Unfortunately, most real-world datasets are severely incomplete (Fortunato(2010)community). Online social networks like Facebook and Twitter, only a tiny part of the information can be collected. In gene interaction networks, links among genes are measured by costly experiments. The imperfection of real-world network datasets consistently leads to incorrect community detection outcomes. As a fundamental network analysis method, community detection methods’ skewed results will adversely impact downstream network analysis tasks (Fortunato(2016)guide). This circumstance leads to the critical need for community enhancement methods, which stands for rewiring the current network to improve the performance of existing community detection methods.
There are several challenges when designing community enhancement algorithms. As a preprocessing method, it should have the property of highly accessible computation cost. In addition, the model should not contain too many parameters since additional time is needed to learn these parameters for different graphs. Moreover, the process of enhancing community structure should be gentle and gradual, which is conducive to maintaining the algorithm’s stability and genericity. From this point of view, similarity-based link prediction methods come into our sight. As network augmentation methods, they have the merit of low computation cost, guaranteeing their application on large-scale networks for multiple rounds (Wang(2015)lpsurvey).
Research has been conducted to fix impaired network systems by predicting which node pairs are more likely to establish links, also known as the link prediction method (Lu(2011)linkprediction; liben(2007)CN; Cannistraci(2013)CAR; Adamic(2003)AA; Ravasz(2002)HDI; Katz(1953)KI; Li(2011)MERW). However, to the best of our knowledge, research on discovering new link prediction methods to enhance community structure has rarely been discussed. There is adequate evidence that we could expect link prediction methods to be implemented as a data augmentation procedure in the preprocessing process of community detection methods. It could play a non-negligible role in community detection tasks as the preprocessing process has been proven a critical part of the practice of machine learning algorithms (garcia(2015)data; lu(2016)preprocessing).
In this paper, we design a harmony-based aggregation preferred (HAP) link prediction method and propose a community enhancement algorithm based on this strategy. It has the desirable characteristics of feasible computation cost and universal applications. Our method comprises two distinct procedures. The revising process will suture fractured communities into a complete one, while the repaired communities will get augmented during the reinforcing process. The diagram of such a procedure is presented in figure 1. Experimental results on real-world datasets with ground truth community information show that our method performs better than baseline methods in most cases. The main contributions of our work are threefold:
-
•
We analyze the typical bias of community detection methods due to the incompleteness of networks, and we put forward the definition of revising and reinforcing edges. Both of them will help other researchers when designing their algorithms.
-
•
We design a new parameter-free, unsupervised link prediction strategy HAP for community enhancement tasks. The HAP method has low computational complexity and high universality of appliances. It could be easily applied as a plug-in module in preprocessing procedures for arbitrary community detection algorithms.
-
•
We conduct experiments on 12 datasets from different areas. The extensive experimental results show that our proposed method can achieve promising performance and outperform all baseline methods.

The remainder of this paper is organized as follows. Section 2 presents related research work. Next, section 3 provides the formal definition of the community detection problem and illustrates the details of our method, including the inductive biases. Furthermore, section LABEL:sec:expriment shows the experimental results and comparison with other baseline models. Finally, section LABEL:sec:conclusion gives the conclusion and discussion on future work.
2 Related Work
Both link prediction and community detection are of great significance in network analysis since they provide network topology information from various perspectives.
Several community detection algorithms were proposed based on altering network topology structures. Zhang et al. (zhang(2013)enhanced) designed an enhanced semi-supervised learning framework for community detection, but it required prior knowledge about nodes. Yang et al. (yang(2015)active) considered which prior information is critical for performance improvement and proposed an active link selection framework. Su et al. (Su(2019)enhance) proposed CSE method based on central and boundary node identification for community enhancement, which successfully removed the limitation of prior knowledge about nodes. Zhou et al. (zhou(2021)robustecd) proposed genetic algorithm and similarity ensemble-based community enhancement methods to explore the robustness under adversarial attack. For a big picture about community detection methods, readers are recommended to this comprehensive survey (Fortunato(2010)community).
For future reference, three well-known and representative community detection algorithms are introduced here:
-
•
Label Propagation (LPA) (Raghavan(2007)lpa): LPA method solely uses the network structure as its information, with each node adopting the label that most of its neighbours currently have at every iteration step. The significant advantage of this algorithm is that it has near-linear time complexity.
-
•
Infomap (Rosvall(2008)infomap): This method uses the probability flow of random walks on a network as a proxy for information flows in the real-world system. It is an information-based approach capable of revealing community structure in weighted and directed networks.
-
•
Louvain (Blondel(2008)louvain): It is a heuristic method based on modularity optimization. This algorithm first assigns different community labels to all nodes and then optimizes the modularity by aggregating those separate communities.
In link prediction oriented problems, the community detecting results could be regarded as a global attribute to provide extra information for link prediction algorithms. Soundarajan and Hopcroft (Soundarajan(2012)community) rewrote the classic CN index and RA index with community information, and the experimental results showed improvement. Rebaza and Lopes (Rebaza(2012)cluster) took intra-cluster and inter-cluster into consideration and proposed WIC measure, which can be extended on directed and asymmetric large-scale networks (Rebaza(2012)Twitter). Ai et al. (Ai(2019)recommend) presented a link prediction method based on complex network modelling and community detection results for personalized recommendation circumstances.
For the main focus of this paper, link prediction methods can be implemented to enhance ambiguous community structure. Yang et al. (Yang(2009)link_content) proposed a conditional model for link prediction and a discriminative model for content analysis. Chen et al. (Chen(2016)enhance) tested three traditional link prediction methods for enhancing community structure. Bacco et al. (Bacco(2017)multilayer) proposed a generative model for multilayer networks with interdependence among their layers. Jiang et al. (Jiang(2020)ambiguous) designed a strategy based on node centralities to establish clear boundaries among communities. Burgess et al. (Burgess(2016)consensus) proposed EdgeBoost structure and explored the improvement of community detection performance of three link prediction algorithms with six community detection methods.

3 Methods
In this section, we explain the proposed community enhancement algorithm in detail. The critical component of the proposed enhancement algorithm is the HAP link prediction method, based on which the inductive biases and the intuitions will be illustrated. Firstly we will formally define the problem and all the symbols used in this paper.
3.1 Problem and Definitions
Every complex network system can be presented as an ordered tuple , where is the set of nodes and represents the edge set of network . In this paper we mainly concern about undirected graph, i.e., .
Each node has a clustering label from ground truth knowledge for a network with community attributes. Here we focus on networks with non-overlapping communities. We define the community mapping function as . It comes in two forms, the ground truth community knowledge and the result of community detection algorithms . In a network with clusters, denotes the set of clustering labels. Both and can be interpreted as functions satisfying , namely , it has its ground truth community attribute and algorithm result . Our goal is to find an optimal link prediction algorithm that approaches as close to as possible. The closeness between and can be quantified nicely, this will be discussed later in chapter LABEL:sec:expriment.
For future reference, the main notations in this paper will be introduced here. stands for the set of nodes with the community attribute and stands for taking the cardinality of the given set. is the connection matrix satisfying . We let be the number of edges between cluster and . Given the ground truth community and algorithm’s output , edge is a revising edges if and . At the same time we provide the definition of reinforcing edges: edge is a reinforcing edge if and only if and . These two definitions will be used to quantify the fixing power of different link prediction methods and visualize the transformation from revising process to reinforcing process.
3.2 Inductive Biases
It is worth noticing that in most cases, community detection methods might yield a larger number of clusters on both real-world and synthesized networks. Here we experiment with three representative community detection methods (LPA, Infomap and Louvain) to verify this phenomenon, since a fair amount of recently developed community detection methods are based on the intuitions behind these three methodologies (okuda(2019)community; luo(2020)highly; roy(2021)nesifc).
Here we present the community detection results of the three algorithms on nine real-world network systems in figure 2. Notice the difference in cluster numbers between ground truth knowledge and algorithm results. Except for the Louvain method on the Eurosis network, community detection methods consistently output a significant more number of communities than ground truth. Especially in the Cora_OS dataset, the results of LPA and Infomap are at least 25 times larger than the actual value. Furthermore, according to (Burgess(2016)consensus), experimental results of these community detection methods consistently yield a higher number of communities on LFR (Lancichinetti(2008)LFR) benchmark graphs. All real-world datasets mentioned here will be formally introduced in section LABEL:sec:expriment.


To take a step further, we can observe that the emergence of additional clusters comes from the fracture of complete ground truth communities due to the incompleteness of networks. For example, as illustrated in figure 3, the Dolphin network’s larger ground truth community gets fragmented into four smaller communities by the Infomap method. Supposing the link prediction method could make connections among different parts of fractured subcommunities, community detection methods might achieve better performance by recognising and merging those subcommunities into a complete one.
If we could adjust the network topology structure by connecting nonexistent revising edges, the community detection algorithms would have a better chance to approach the ground truth community of networks.
In conclusion, the inductive biases in this paper are listed as follows.
-
•
The outputs of community detection algorithms are highly likely to be incorrect and contain more clusters than actual cases because detection methods tend to split large communities into smaller ones.
-
•
The connection of revising edges enhances the ambiguous community structure, which could help the downstream community detection algorithms perform better.
3.3 Algorithm Skeleton
It is commonly accepted that ambiguous community structure is challenging for community detection studies due to the subtle difference between inter-edges and intra-edges (Su(2019)enhance). The leading thought of the HAP link prediction method is to add links among fractured components of a complete community, thus turning misunderstood inter-edges into affirmative intra-edges.
As a community attribute related unsupervised link prediction method, the HAP method requires a community detection algorithm to trigger the community enhancement procedure.
From a general perspective, the proposed community structure enhancement method contains three main processes in each iteration: 1) community detection, 2) central and boundary nodes recognition, and 3) adding links. At the beginning of each iteration, an early-stopping criterion detects the procedure of community enhancement to determine the stopping point. The rest of this chapter mainly focuses on detailed information about each step. However, the community detection process will not be discussed here since it can be arbitrary methods given by the users.
3.4 Central Nodes Recognition
In order to achieve the goal of community enhancement, the HAP link prediction method first recognises central and boundary nodes. It then adds links to the network according to this information. The proposed community enhancement method has an evolutionary process which will be explained on short notice.
In central node recognition, most methods use centrality measures to define whether nodes are on the edge of communities or not (Su(2019)enhance; zhou(2021)robustecd). Such centrality measures can be defined through the average distance between intra-community nodes, which can be regarded as geometric distance centrality. Other methods, such as calculating the fraction of neighbour nodes with the same community attributes, could be regarded as the probability of a one-step random walk ending within the community.
These two measures are both successful in identifying boundary nodes. However, the distance-based centrality measure will bring unwanted calculation complexity. Moreover, nodes with a larger degree are more likely to yield a smaller average distance. In the meantime, the node’s neighbours with different community attributes are not fully considered. For one-step random walk measure, it has less computation complexity and takes the neighbour nodes’ community attributes into consideration. However, the linear fractional expression of connection might not utilize the neighbourhood information fully since the diversity of communities is not considered.

For the distance-based centrality measure, the centrality score for a node with community attribute of is defined as:
(1) |
Take figure 4 as an example, the distance-based centrality scores of nodes A and B equal . For one-step random walk measure, in (Su(2019)enhance), the centrality score of a node with community attribute of is defined as:
(2) |
where stands for the neighbour of node . The numerator stands for the number of nodes with the same community attribute while the denominator is the degree of node , the one-hop random walk based centrality scores of nodes A and B equal . For the above two measures, the centrality scores of nodes A and B are the same, but situations between nodes A and B are not identical where node A stands between two communities, while node B is on the overlapping section among three communities. To avoid this shortcoming, we should take the neighbour nodes’ community attributes into deeper consideration.
Undoubtedly, the most helpful and easily accessible information for centrality measurement is the neighbourhood of nodes. To fully record such knowledge, we define the neighbourhood community enumeration (NCE) of node as a multiset. Given the community mapping function and all its neighbour nodes :
(3) |
Such as in figure 4, the NCEs of node A and B are ( and represent for blue, green and red community respectively) and respectively. We apply the standard normalized Shannon entropy (Shannon(1948)entropy) to quantify the uncertainty of NCE. If a node has complex neighbourhood information, namely a large entropy value, it is more likely to be a boundary node of a community. For a node , given the community mapping function , its boundary score (BS) can be calculated as follow:
(4) |
In this equation, H is the Shannon Entropy function and if , its boundary score is set to . It can be easily proven that the maximum value of the numerator is . Thus the BS value always lies within interval . For example, in figure 4, node A has a boundary score of , and the boundary score of node B equals . With the help of Boundary Score, the definition of consistency score (CS) is defined as:
(5) |
The maximum value of CS of a node will be achieved when there is only one type of community in its neighbourhood. A higher CS value of a node suggests its neighbours’ community attributes perform less uncertainty. This consistency measure can be viewed as a node centrality index evaluating the uncertainty of its neighbourhood clustering information.
3.5 Link Connection
3.5.1 Harmony Similarity Measure
Due to the imperfection of community detection algorithms, link prediction methods should not take the algorithm results of community detection as the ground truth knowledge. Experimental results in section LABEL:sec:expriment validate that improper use of community attributes would skew the community enhancement.
At first, we provide the similarity function for the RA index (Zhou(2009)RA):
(6) |
Equation 6 is a measure to quantify the amount of information flow from node that is received by node .
Following this train of thought, in the HAP method, the consideration of information spreading is combined with community attributes. Instead of spreading to neighbours uniformly, the information flow prefers to transmit information within the community through reliable central nodes. That is to say, the information flow is not point-to-point but in a community-to-community pattern.
Here we define the harmony (HM) value to evaluate consistency between node , namely:
(7) |
where CS is the consistency score. Through equation 7, it is clearly seen that the HM index is a second-order similarity measure via entropy. The HM score of two nodes is higher if their common neighbours have a larger average CS value. Following the definition of consistency, a node with a higher consistency value indicates that it is more likely to be a central node of a community. That is to say, node pair will yield a large harmony value if they share a neighbourhood of central nodes, suggesting that and are more likely to belong to the same community.
3.5.2 Evolution Transformation
According to equation 7, nodes in the same community are expected to have higher HM scores. It is a desirable characteristic, but it tends to enhance the fractured communities instead of merging them at the beginning. Therefore this enhancement ability of link prediction is still insufficient. If the community enhancement procedure remains in the reinforcing process, it will consistently solidify small communities. This phenomenon will cause the community detection to output the same result, which is undesirable.
In order to initiate the revising process and automatically transform it into the reinforcing procedure, the variables in such a process must be carefully designed. Following the demonstration of figure 1, we could imagine that the minimum size of communities will get larger in revising process. Furthermore, in figure 3, we can observe that the connection among fractured components from a large ground truth community is more frequent than different communities in the ground truth. Combining these two factors, we can infer that merging small communities should have higher priority at the beginning. Furthermore, for nodes from different communities, the HAP method should tend to add a link between them if their communities connect at a relatively high frequency.
Here we define Community Size Attribute (CSA) with hub-preferred strategy:
(8) |
CM is the connection matrix. Equation 8 calculates the ratio between inter-edges and intra-edges for smaller communities when . When , we do not want to enhance the community when its size is relatively small as aforementioned. Here we empirically set in this situation, which performs satisfactorily in experiments. During the revising process, the minimum size of communities gets larger, leading to the CSA value for getting smaller, thus transforming into the reinforcing process.
Generally, the CSA index can be regarded as an indicator of community connection possibility. In contrast, the HM index explicitly points out the detail about which pair of nodes should build a connection. Combining these two indexes could achieve the community enhancement measure with the desired automatic transformation of such an evolutionary procedure. The similarity function of the HAP method is listed below:
(9) |
3.6 Early-stopping Criterion
In order to prevent a trivial solution that all pairs of nodes are connected, resulting in a gigantic community, we here present an early-stopping criterion for long-term stability for the outputs of community detection methods.
We need to determine when the community detection result stops improving to achieve this goal. Since no prior knowledge about the ground truth is available, we have to employ an indicator solely based on community detection methods’ output. The intuition is to determine when the enhancement measure steps into the reinforcing stage. At this stage, the result of community detection methods remains relatively stable.
Therefore, we propose an early-stopping method to calculate the NMI value between the latest and previous results. This method requires two hyperparameters, rounds of consideration and threshold value . To be specific, assume that the community detection method has yielded results, denoted as , the early-stopping method calculates a set of NMI , if the minimum value of this set is larger than , the iteration process will be stopped immediately.
3.7 Time Complexity Analysis
For notations, we assume that is the number of nodes and is the largest degree of nodes. As mentioned earlier, the HAP mainly concerns about two-hop neighbours, so its time complexity is where is the computational cost to determine the likelihood for one pair of nodes and stands for the time complexity of preparation.
According to equation 9, consists of calculating both CSA and HM values. For the CSA value, the time complexity of lookup is once we construct the CM matrix. Determining the CM matrix requires going through all the links in a graph for one time, which has a time complexity of . In addition, it takes to determine all the