This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2022

[1,2,3,4]\fnmWei \surWei

1]\orgdivSchool of Mathematical Sciences, \orgnameBeihang University, \orgaddress\streetXueYuan Road No.37, \cityBeijing, \postcode100191, \stateBeijing, \countryChina

2]\orgdivKey Laboratory of Mathematics Informatics Behavioral Semantics, \orgnameMinistry of Education, \orgaddress\cityBeijing, \countryChina

3]\orgdivInstitute of Artificial Intelligence, \orgnameBeihang University, \orgaddress\streetXueYuan Road No.37, \cityBeijing, \postcode100191, \stateBeijing, \countryChina

4]\orgdivZhongguancun Laboratory, \orgaddress\cityBeijing, \countryChina

5]\orgdivthe Center for Humans and Machines, \orgnameMax Planck Institute for Human Development, \orgaddress\streetLentzeallee 94, \cityBerlin, \postcode14195, \countryGermany

Enhance Ambiguous Community Structure via Multi-strategy Community Related Link Prediction Method

\fnmQiming \surYang [email protected]    [email protected]    \fnmRuizhi \surZhang [email protected]    \fnmBowen \surPang [email protected]    \fnmXiangnan \surFeng [email protected] [ [ [ [ [
Abstract

Most real-world networks suffer from incompleteness or incorrectness, which is an inherent characteristic of real-world networks. This imperfection might cause the deficiency of machine learning algorithms in complex networks like community detection. Therefore, proper measures for improving community detection performance and robustness are needed for more promising network analysis. In this paper, we propose a harmony-based aggregation preferred (HAP) link prediction method, which takes the result of community detection algorithm into consideration. The HAP method aims to detect boundary nodes among different communities and establish connections to clarify community structure. Furthermore, we design a two-step community enhancement framework with an automatic evolution process based on the HAP method. This methodology successfully clarifies ambiguous community structures by adding links to the current network. The experimental results on twelve real-world datasets with ground truth knowledge indicate that the proposed link prediction method outperforms other baseline methods in enhancement capacity and numerical stability. Furthermore, the proposed community enhancement method follows the expected evolution process.

keywords:
Complex network, community enhancement, link prediction, community detection, entropy

1 Introduction

Since a great deal of real-world data could be expressed in complex network fashion (Strogatz(2001)networks), complex network analysis attracts more and more attention in many scientific disciplines. There are various approaches to unveiling the underlying information behind networks. Community detection has been considered as one of the most vital among these studies (deng(2016)mobility; wang(2013)discovering; qiao(2020)dynamic). The network community is defined as a group of nodes that are densely connected to each other while sparsely connected to the rest nodes (Newman(2004)community). Community structure appears at a high frequency and is of pivotal importance in network analysis. Social networks are paradigmatic examples of graphs with communities since people tend to form groups with similar interests or ideologies (Girvan(2002)community). In biological bodies, community structure in protein interaction networks could represent a group of proteins with similar functions (Lewis(2010)function).

Unfortunately, most real-world datasets are severely incomplete (Fortunato(2010)community). Online social networks like Facebook and Twitter, only a tiny part of the information can be collected. In gene interaction networks, links among genes are measured by costly experiments. The imperfection of real-world network datasets consistently leads to incorrect community detection outcomes. As a fundamental network analysis method, community detection methods’ skewed results will adversely impact downstream network analysis tasks (Fortunato(2016)guide). This circumstance leads to the critical need for community enhancement methods, which stands for rewiring the current network to improve the performance of existing community detection methods.

There are several challenges when designing community enhancement algorithms. As a preprocessing method, it should have the property of highly accessible computation cost. In addition, the model should not contain too many parameters since additional time is needed to learn these parameters for different graphs. Moreover, the process of enhancing community structure should be gentle and gradual, which is conducive to maintaining the algorithm’s stability and genericity. From this point of view, similarity-based link prediction methods come into our sight. As network augmentation methods, they have the merit of low computation cost, guaranteeing their application on large-scale networks for multiple rounds (Wang(2015)lpsurvey).

Research has been conducted to fix impaired network systems by predicting which node pairs are more likely to establish links, also known as the link prediction method (Lu(2011)linkprediction; liben(2007)CN; Cannistraci(2013)CAR; Adamic(2003)AA; Ravasz(2002)HDI; Katz(1953)KI; Li(2011)MERW). However, to the best of our knowledge, research on discovering new link prediction methods to enhance community structure has rarely been discussed. There is adequate evidence that we could expect link prediction methods to be implemented as a data augmentation procedure in the preprocessing process of community detection methods. It could play a non-negligible role in community detection tasks as the preprocessing process has been proven a critical part of the practice of machine learning algorithms (garcia(2015)data; lu(2016)preprocessing).

In this paper, we design a harmony-based aggregation preferred (HAP) link prediction method and propose a community enhancement algorithm based on this strategy. It has the desirable characteristics of feasible computation cost and universal applications. Our method comprises two distinct procedures. The revising process will suture fractured communities into a complete one, while the repaired communities will get augmented during the reinforcing process. The diagram of such a procedure is presented in figure 1. Experimental results on real-world datasets with ground truth community information show that our method performs better than baseline methods in most cases. The main contributions of our work are threefold:

  • We analyze the typical bias of community detection methods due to the incompleteness of networks, and we put forward the definition of revising and reinforcing edges. Both of them will help other researchers when designing their algorithms.

  • We design a new parameter-free, unsupervised link prediction strategy HAP for community enhancement tasks. The HAP method has low computational complexity and high universality of appliances. It could be easily applied as a plug-in module in preprocessing procedures for arbitrary community detection algorithms.

  • We conduct experiments on 12 datasets from different areas. The extensive experimental results show that our proposed method can achieve promising performance and outperform all baseline methods.

Refer to caption
Figure 1: Diagram of proposed community enhancement measure. The first subfigure reveals the biased community detection result by algorithms. Furthermore, the last subfigure illustrates the ground truth community. In the revising process the community enhancement measure adjusts biased result by adding inter-cluster links between inaccurately separated subcommunities. On the contrary, it enhances the correct result by adding intra-cluster edges in the reinforcing process.

The remainder of this paper is organized as follows. Section 2 presents related research work. Next, section 3 provides the formal definition of the community detection problem and illustrates the details of our method, including the inductive biases. Furthermore, section LABEL:sec:expriment shows the experimental results and comparison with other baseline models. Finally, section LABEL:sec:conclusion gives the conclusion and discussion on future work.

2 Related Work

Both link prediction and community detection are of great significance in network analysis since they provide network topology information from various perspectives.

Several community detection algorithms were proposed based on altering network topology structures. Zhang et al. (zhang(2013)enhanced) designed an enhanced semi-supervised learning framework for community detection, but it required prior knowledge about nodes. Yang et al. (yang(2015)active) considered which prior information is critical for performance improvement and proposed an active link selection framework. Su et al. (Su(2019)enhance) proposed CSE method based on central and boundary node identification for community enhancement, which successfully removed the limitation of prior knowledge about nodes. Zhou et al. (zhou(2021)robustecd) proposed genetic algorithm and similarity ensemble-based community enhancement methods to explore the robustness under adversarial attack. For a big picture about community detection methods, readers are recommended to this comprehensive survey (Fortunato(2010)community).

For future reference, three well-known and representative community detection algorithms are introduced here:

  • Label Propagation (LPA) (Raghavan(2007)lpa): LPA method solely uses the network structure as its information, with each node adopting the label that most of its neighbours currently have at every iteration step. The significant advantage of this algorithm is that it has near-linear time complexity.

  • Infomap (Rosvall(2008)infomap): This method uses the probability flow of random walks on a network as a proxy for information flows in the real-world system. It is an information-based approach capable of revealing community structure in weighted and directed networks.

  • Louvain (Blondel(2008)louvain): It is a heuristic method based on modularity optimization. This algorithm first assigns different community labels to all nodes and then optimizes the modularity by aggregating those separate communities.

In link prediction oriented problems, the community detecting results could be regarded as a global attribute to provide extra information for link prediction algorithms. Soundarajan and Hopcroft (Soundarajan(2012)community) rewrote the classic CN index and RA index with community information, and the experimental results showed improvement. Rebaza and Lopes (Rebaza(2012)cluster) took intra-cluster and inter-cluster into consideration and proposed WIC measure, which can be extended on directed and asymmetric large-scale networks (Rebaza(2012)Twitter). Ai et al. (Ai(2019)recommend) presented a link prediction method based on complex network modelling and community detection results for personalized recommendation circumstances.

For the main focus of this paper, link prediction methods can be implemented to enhance ambiguous community structure. Yang et al. (Yang(2009)link_content) proposed a conditional model for link prediction and a discriminative model for content analysis. Chen et al. (Chen(2016)enhance) tested three traditional link prediction methods for enhancing community structure. Bacco et al. (Bacco(2017)multilayer) proposed a generative model for multilayer networks with interdependence among their layers. Jiang et al. (Jiang(2020)ambiguous) designed a strategy based on node centralities to establish clear boundaries among communities. Burgess et al. (Burgess(2016)consensus) proposed EdgeBoost structure and explored the improvement of community detection performance of three link prediction algorithms with six community detection methods.

Refer to caption
Figure 2: The average number of clusters of community detection algorithms on nine network datasets. Each value is the average number of clusters in 10 independent iterations. Compared with the ground truth, LPA and Infomap community detection methods deliver a higher number of communities. On the other hand, the Louvain algorithm tends to detect relatively fewer communities than those two methods above but is still biased except for the Eurosis dataset.

3 Methods

In this section, we explain the proposed community enhancement algorithm in detail. The critical component of the proposed enhancement algorithm is the HAP link prediction method, based on which the inductive biases and the intuitions will be illustrated. Firstly we will formally define the problem and all the symbols used in this paper.

3.1 Problem and Definitions

Every complex network system can be presented as an ordered tuple G=(V,E)G=(V,E), where VV is the set of nodes and EV×VE\subseteq V\times V represents the edge set of network GG. In this paper we mainly concern about undirected graph, i.e., vi,vjV,(vi,vj)E(vj,vi)E\forall v_{i},v_{j}\in V,\ (v_{i},v_{j})\in E\Rightarrow(v_{j},v_{i})\in E.

Each node has a clustering label from ground truth knowledge for a network with community attributes. Here we focus on networks with non-overlapping communities. We define the community mapping function as CFC_{F}. It comes in two forms, the ground truth community knowledge CGC_{G} and the result of community detection algorithms CAC_{A}. In a network with KK clusters, C={C1,C2,,CK}C=\{C_{1},C_{2},\cdots,C_{K}\} denotes the set of clustering labels. Both CGC_{G} and CAC_{A} can be interpreted as functions satisfying CG,CA:VCC_{G},C_{A}:V\mapsto C, namely vV\forall v\in V, it has its ground truth community attribute CG(v)CC_{G}(v)\in C and algorithm result CA(v)CC_{A}(v)\in C. Our goal is to find an optimal link prediction algorithm that approaches CAC_{A} as close to CGC_{G} as possible. The closeness between CAC_{A} and CGC_{G} can be quantified nicely, this will be discussed later in chapter LABEL:sec:expriment.

For future reference, the main notations in this paper will be introduced here. NCiN_{C_{i}} stands for the set of nodes with the community attribute CiC_{i} and \|\cdot\| stands for taking the cardinality of the given set. CMCM is the connection matrix satisfying CMK×KCM\in\mathbb{Z}^{K\times K}. We let CM(i,j)CM(i,j) be the number of edges between cluster CiC_{i} and CjC_{j}. Given the ground truth community CGC_{G} and algorithm’s output CAC_{A}, edge e=(vi,vj)e=(v_{i},v_{j}) is a revising edges if CG(vi)=CG(vj)C_{G}(v_{i})=C_{G}(v_{j}) and CA(vi)CA(vj)C_{A}(v_{i})\neq C_{A}(v_{j}). At the same time we provide the definition of reinforcing edges: edge e=(vi,vj)e=(v_{i},v_{j}) is a reinforcing edge if and only if CG(vi)=CG(vj)C_{G}(v_{i})=C_{G}(v_{j}) and CA(vi)=CA(vj)C_{A}(v_{i})=C_{A}(v_{j}). These two definitions will be used to quantify the fixing power of different link prediction methods and visualize the transformation from revising process to reinforcing process.

3.2 Inductive Biases

It is worth noticing that in most cases, community detection methods might yield a larger number of clusters on both real-world and synthesized networks. Here we experiment with three representative community detection methods (LPA, Infomap and Louvain) to verify this phenomenon, since a fair amount of recently developed community detection methods are based on the intuitions behind these three methodologies (okuda(2019)community; luo(2020)highly; roy(2021)nesifc).

Here we present the community detection results of the three algorithms on nine real-world network systems in figure 2. Notice the difference in cluster numbers between ground truth knowledge and algorithm results. Except for the Louvain method on the Eurosis network, community detection methods consistently output a significant more number of communities than ground truth. Especially in the Cora_OS dataset, the results of LPA and Infomap are at least 25 times larger than the actual value. Furthermore, according to (Burgess(2016)consensus), experimental results of these community detection methods consistently yield a higher number of communities on LFR (Lancichinetti(2008)LFR) benchmark graphs. All real-world datasets mentioned here will be formally introduced in section LABEL:sec:expriment.

Refer to caption
(a) Ground Truth
Refer to caption
(b) Infomap Result
Figure 3: An example of the Dolphin network indicates that community detection algorithms turn a whole community into fragments. Compared with the ground truth information, the larger community in the Dolphin network gets fractured by the Infomap community detection method.

To take a step further, we can observe that the emergence of additional clusters comes from the fracture of complete ground truth communities due to the incompleteness of networks. For example, as illustrated in figure 3, the Dolphin network’s larger ground truth community gets fragmented into four smaller communities by the Infomap method. Supposing the link prediction method could make connections among different parts of fractured subcommunities, community detection methods might achieve better performance by recognising and merging those subcommunities into a complete one.

If we could adjust the network topology structure by connecting nonexistent revising edges, the community detection algorithms would have a better chance to approach the ground truth community of networks.

In conclusion, the inductive biases in this paper are listed as follows.

  • The outputs of community detection algorithms are highly likely to be incorrect and contain more clusters than actual cases because detection methods tend to split large communities into smaller ones.

  • The connection of revising edges enhances the ambiguous community structure, which could help the downstream community detection algorithms perform better.

3.3 Algorithm Skeleton

It is commonly accepted that ambiguous community structure is challenging for community detection studies due to the subtle difference between inter-edges and intra-edges (Su(2019)enhance). The leading thought of the HAP link prediction method is to add links among fractured components of a complete community, thus turning misunderstood inter-edges into affirmative intra-edges.

As a community attribute related unsupervised link prediction method, the HAP method requires a community detection algorithm to trigger the community enhancement procedure.

From a general perspective, the proposed community structure enhancement method contains three main processes in each iteration: 1) community detection, 2) central and boundary nodes recognition, and 3) adding links. At the beginning of each iteration, an early-stopping criterion detects the procedure of community enhancement to determine the stopping point. The rest of this chapter mainly focuses on detailed information about each step. However, the community detection process will not be discussed here since it can be arbitrary methods given by the users.

3.4 Central Nodes Recognition

In order to achieve the goal of community enhancement, the HAP link prediction method first recognises central and boundary nodes. It then adds links to the network according to this information. The proposed community enhancement method has an evolutionary process which will be explained on short notice.

In central node recognition, most methods use centrality measures to define whether nodes are on the edge of communities or not (Su(2019)enhance; zhou(2021)robustecd). Such centrality measures can be defined through the average distance between intra-community nodes, which can be regarded as geometric distance centrality. Other methods, such as calculating the fraction of neighbour nodes with the same community attributes, could be regarded as the probability of a one-step random walk ending within the community.

These two measures are both successful in identifying boundary nodes. However, the distance-based centrality measure will bring unwanted calculation complexity. Moreover, nodes with a larger degree are more likely to yield a smaller average distance. In the meantime, the node’s neighbours with different community attributes are not fully considered. For one-step random walk measure, it has less computation complexity and takes the neighbour nodes’ community attributes into consideration. However, the linear fractional expression of connection might not utilize the neighbourhood information fully since the diversity of communities is not considered.

Refer to caption
Figure 4: Diagram of nodes A and B with their neighbourhood information. The diversity of nodes’ colour indicates they belong to different communities.

For the distance-based centrality measure, the centrality score for a node uu with community attribute of CiC_{i} is defined as:

CSdisu=xNCidistance(u,x)NCi.CS^{dis}_{u}=\frac{\sum_{x\in N_{C_{i}}}distance(u,x)}{\|N_{C_{i}}\|}. (1)

Take figure 4 as an example, the distance-based centrality scores of nodes A and B equal 6/56/5. For one-step random walk measure, in (Su(2019)enhance), the centrality score of a node uu with community attribute of CiC_{i} is defined as:

CSrdmu=Γ(u)NCiΓ(u),CS^{rdm}_{u}=\frac{\|\Gamma(u)\cap N_{C_{i}}\|}{\|\Gamma(u)\|}, (2)

where Γ(u)\Gamma(u) stands for the neighbour of node uu. The numerator stands for the number of nodes with the same community attribute while the denominator is the degree of node uu, the one-hop random walk based centrality scores of nodes A and B equal 4/84/8. For the above two measures, the centrality scores of nodes A and B are the same, but situations between nodes A and B are not identical where node A stands between two communities, while node B is on the overlapping section among three communities. To avoid this shortcoming, we should take the neighbour nodes’ community attributes into deeper consideration.

Undoubtedly, the most helpful and easily accessible information for centrality measurement is the neighbourhood of nodes. To fully record such knowledge, we define the neighbourhood community enumeration (NCE) of node xx as a multiset. Given the community mapping function CFC_{F} and all its neighbour nodes Γ(x)\Gamma(x):

NCEF(x)={CF(i)|iΓ(x)}.NCE_{F}(x)=\{C_{F}(i)|i\in\Gamma(x)\}. (3)

Such as in figure 4, the NCEs of node A and B are {1,1,1,1,2,2,2,2}\{1,1,1,1,2,2,2,2\} (1,21,2 and 33 represent for blue, green and red community respectively) and {1,1,1,1,2,2,3,3}\{1,1,1,1,2,2,3,3\} respectively. We apply the standard normalized Shannon entropy (Shannon(1948)entropy) to quantify the uncertainty of NCE. If a node has complex neighbourhood information, namely a large entropy value, it is more likely to be a boundary node of a community. For a node xx, given the community mapping function CFC_{F}, its boundary score (BS) can be calculated as follow:

BS(x)=H(NCEF(x))log(Γ(x)).BS(x)=\frac{H(NCE_{F}(x))}{log(\|\Gamma(x)\|)}. (4)

In this equation, H is the Shannon Entropy function and if Γ(x)=1\|\Gamma(x)\|=1, its boundary score is set to 0. It can be easily proven that the maximum value of the numerator is log(Γ(x))log(\|\Gamma(x)\|). Thus the BS value always lies within interval [0,1][0,1]. For example, in figure 4, node A has a boundary score of 0.330.33, and the boundary score of node B equals 0.500.50. With the help of Boundary Score, the definition of consistency score (CS) is defined as:

CS(x)=1BS(x).CS(x)=1-BS(x). (5)

The maximum value of CS of a node will be achieved when there is only one type of community in its neighbourhood. A higher CS value of a node suggests its neighbours’ community attributes perform less uncertainty. This consistency measure can be viewed as a node centrality index evaluating the uncertainty of its neighbourhood clustering information.

3.5 Link Connection

3.5.1 Harmony Similarity Measure

Due to the imperfection of community detection algorithms, link prediction methods should not take the algorithm results of community detection as the ground truth knowledge. Experimental results in section LABEL:sec:expriment validate that improper use of community attributes would skew the community enhancement.

At first, we provide the similarity function for the RA index (Zhou(2009)RA):

s(x,y)=zΓ(x)Γ(y)1Γ(z).s(x,y)=\sum_{z\in\Gamma(x)\cap\Gamma(y)}\frac{1}{\|\Gamma(z)\|}. (6)

Equation 6 is a measure to quantify the amount of information flow from node xx that is received by node yy.

Following this train of thought, in the HAP method, the consideration of information spreading is combined with community attributes. Instead of spreading to neighbours uniformly, the information flow prefers to transmit information within the community through reliable central nodes. That is to say, the information flow is not point-to-point but in a community-to-community pattern.

Here we define the harmony (HM) value to evaluate consistency between node x,yx,y, namely:

HM(x,y)=1Γ(x)Γ(y)zΓ(x)Γ(y)CS(z),HM(x,y)=\frac{1}{\|\Gamma(x)\cap\Gamma(y)\|}\sum_{z\in\Gamma(x)\cap\Gamma(y)}CS(z), (7)

where CS is the consistency score. Through equation 7, it is clearly seen that the HM index is a second-order similarity measure via entropy. The HM score of two nodes is higher if their common neighbours have a larger average CS value. Following the definition of consistency, a node with a higher consistency value indicates that it is more likely to be a central node of a community. That is to say, node pair x,yx,y will yield a large harmony value if they share a neighbourhood of central nodes, suggesting that xx and yy are more likely to belong to the same community.

3.5.2 Evolution Transformation

According to equation 7, nodes in the same community are expected to have higher HM scores. It is a desirable characteristic, but it tends to enhance the fractured communities instead of merging them at the beginning. Therefore this enhancement ability of link prediction is still insufficient. If the community enhancement procedure remains in the reinforcing process, it will consistently solidify small communities. This phenomenon will cause the community detection to output the same result, which is undesirable.

In order to initiate the revising process and automatically transform it into the reinforcing procedure, the variables in such a process must be carefully designed. Following the demonstration of figure 1, we could imagine that the minimum size of communities will get larger in revising process. Furthermore, in figure 3, we can observe that the connection among fractured components from a large ground truth community is more frequent than different communities in the ground truth. Combining these two factors, we can infer that merging small communities should have higher priority at the beginning. Furthermore, for nodes x,yx,y from different communities, the HAP method should tend to add a link between them if their communities connect at a relatively high frequency.

Here we define Community Size Attribute (CSA) with hub-preferred strategy:

CSA(x,y)={CM(Cx,Cy)min{CM(Cx,Cx),CM(Cy,Cy)},CxCyNCxmaxiNCi,Cx=Cy{CSA(x,y)}=\begin{cases}\frac{CM(C_{x},C_{y})}{\min\{CM(C_{x},C_{x}),CM(C_{y},C_{y})\}},&{C_{x}\neq C_{y}}\\ {\frac{\sqrt{\|N_{C_{x}}\|}}{\max_{i}\|N_{C_{i}}\|}},&{C_{x}=C_{y}}\end{cases} (8)

CM is the connection matrix. Equation 8 calculates the ratio between inter-edges and intra-edges for smaller communities when CxCyC_{x}\neq C_{y}. When Cx=CyC_{x}=C_{y}, we do not want to enhance the community when its size is relatively small as aforementioned. Here we empirically set CSA=NCx/maxi{NCi}CSA=\sqrt{N_{C_{x}}}/\max_{i}\{N_{C_{i}}\} in this situation, which performs satisfactorily in experiments. During the revising process, the minimum size of communities gets larger, leading to the CSA value for CxCyC_{x}\neq C_{y} getting smaller, thus transforming into the reinforcing process.

Generally, the CSA index can be regarded as an indicator of community connection possibility. In contrast, the HM index explicitly points out the detail about which pair of nodes should build a connection. Combining these two indexes could achieve the community enhancement measure with the desired automatic transformation of such an evolutionary procedure. The similarity function of the HAP method is listed below:

HAP(x,y)=HM(x,y)CSA(x,y).HAP(x,y)=HM(x,y)\cdot CSA(x,y). (9)
Algorithm 1 HAP Method
1:Graph G=(V,E)G=(V,E), Community Mapping Function CFC_{F} and Number of Edge Adding LL.
2:New Graph G=(V,E)G=(V,E^{\prime}).
3:NonExistEdges(NEE)NonExistEdges(NEE)\longleftarrow\emptyset
4:for iVi\in V do
5:     for jΓ(i)j\in\Gamma(i) do
6:         if (i,j)E(i,j)\notin E then
7:              NEENEE(i,j)NEE\longleftarrow NEE\cup(i,j)
8:         end if
9:     end for
10:end for
11:HAPScoresHAP\ Scores\longleftarrow\emptyset
12:CMCM\longleftarrow Conduct CM matrix by CFC_{F}
13:for e=(i,j)NEEe=(i,j)\in NEE do
14:     HMHM\longleftarrow Calculate HM value by eq. 7
15:     CSACSA\longleftarrow Calculate CSA value by eq. 8
16:     HAPHMCSAHAP\leftarrow HM\cdot CSA
17:     HAPScoresHAPScores(HAP,e)HAP\ Scores\longleftarrow HAP\ Scores\cup(HAP,\ e)
18:end for
19:HAPScoresHAP\ Scores\longleftarrow Descend Order(HAPScoresHAP\ Scores)
20:EEE^{\prime}\longleftarrow E\ \cup Top LL Edges In(HAPScores)(HAP\ Scores)
21:return G=(V,E)G=(V,E^{\prime})

3.6 Early-stopping Criterion

In order to prevent a trivial solution that all pairs of nodes are connected, resulting in a gigantic community, we here present an early-stopping criterion for long-term stability for the outputs of community detection methods.

We need to determine when the community detection result stops improving to achieve this goal. Since no prior knowledge about the ground truth is available, we have to employ an indicator solely based on community detection methods’ output. The intuition is to determine when the enhancement measure steps into the reinforcing stage. At this stage, the result of community detection methods remains relatively stable.

Therefore, we propose an early-stopping method to calculate the NMI value between the latest and previous results. This method requires two hyperparameters, rounds of consideration RR and threshold value δ\delta. To be specific, assume that the community detection method has yielded NN results, denoted as {C1,,CN}\{C_{1},...,C_{N}\}, the early-stopping method calculates a set of NMI {NMI(CNR1,CN),,NMI(CN1,CN)}\{NMI(C_{N-R-1},C_{N}),...,NMI(C_{N-1},C_{N})\}, if the minimum value of this set is larger than δ\delta, the iteration process will be stopped immediately.

3.7 Time Complexity Analysis

For notations, we assume that nn is the number of nodes and kk is the largest degree of nodes. As mentioned earlier, the HAP mainly concerns about two-hop neighbours, so its time complexity is O(nk2f(k)+g)O(nk^{2}*f(k)+g) where f(k)f(k) is the computational cost to determine the likelihood for one pair of nodes and gg stands for the time complexity of preparation.

According to equation 9, f(k)f(k) consists of calculating both CSA and HM values. For the CSA value, the time complexity of lookup is O(1)O(1) once we construct the CM matrix. Determining the CM matrix requires going through all the links in a graph for one time, which has a time complexity of O(nk)O(n*k). In addition, it takes O(n)O(n) to determine all the N_{C_i}?values.Bothofthetimecomplexitybelongstothetermg.ForcalculationofHMvalue,accordingtoequation7,eachCSvalueisdeterminedbytraversingonehopneighbournodesforatmostk+knodes.Combined,thetimecomplexityofHMequalsO((k+k)*k),whichbelongstothetermf(k).Thatistosay,thetimecomplexityofHAPisO(nk^2*k^2+n*k+k)=O(nk^4).Accordingto(martinez(2016)survey),thecomputationcostofourmethodiscompetitiveagainstalllocalmethodssincekcanberegardedasaconstantduetothesparsityandincompletenessofnetworks.Furthermore,mostcommunitydetectionmethodshavemuchlargertimecomplexity,confirmingthatourenhancementmeasurecanbeimplementedasapluginmodule.Algorithm 222Algorithm 22CommunityEnhancementMet