This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Efficient and Exact Algorithm for Locally hh-Clique Densest Subgraph Discovery

Xiaojia Xu Renmin University of ChinaBeijingChina [email protected] Haoyu Liu Renmin University of ChinaBeijingChina [email protected] Xiaowei Lv Renmin University of ChinaBeijingChina [email protected] Yongcai Wang Renmin University of ChinaBeijingChina [email protected]  and  Deying Li Renmin University of ChinaBeijingChina [email protected]
Abstract.

Detecting locally, non-overlapping, near-clique densest subgraphs is a crucial problem for community search in social networks. As a vertex may be involved in multiple overlapped local cliques, detecting locally densest sub-structures considering hh-clique density, i.e., locally hh-clique densest subgraph (LhhCDS) attracts great interests. This paper investigates the LhhCDS detection problem and proposes an efficient and exact algorithm to list the top-kk non-overlapping, locally hh-clique dense, and compact subgraphs. We in particular jointly consider hh-clique compact number and LhhCDS and design a new “Iterative Propose-Prune-and-Verify” pipeline (IPPV) for top-kk LhhCDS detection. (1) In the proposal part, we derive initial bounds for hh-clique compact numbers; prove the validity, and extend a convex programming method to tighten the bounds for proposing LhhCDS candidates without missing any. (2) Then a tentative graph decomposition method is proposed to solve the challenging case where a clique spans multiple subgraphs in graph decomposition. (3) To deal with the verification difficulty, both a basic and a fast verification method are proposed, where the fast method constructs a smaller-scale flow network to improve efficiency while preserving the verification correctness. The verified LhhCDSes are returned, while the candidates that remained unsure reenter the IPPV pipeline. (4) We further extend the proposed methods to locally more general pattern densest subgraph detection problems. We prove the exactness and low complexity of the proposed algorithm. Extensive experiments on real datasets show the effectiveness and high efficiency of IPPV.

1. Introduction

Finding dense subgraphs can uncover highly connected and cohesive structures in graphs, making it an effective tool for understanding complex systems. The discovery of dense subgraphs and communities has numerous applications in diverse fields including social networks (Chen and Saad, 2012; Tsourakakis et al., 2013; Gibson et al., 2005), web analysis (Gionis et al., 2013; Angel et al., 2012), graph databases (Jin et al., 2009; Zhao and Tung, 2012), and biology (Saha et al., 2010; Li et al., 2022). In these applications, the identification of near-clique subgraphs holds significant importance, as it relaxes the requirement of complete connectivity within cliques and allows for a certain degree of sparsity or missing connections while still maintaining a high level of connectivity.

Given the importance of detecting large near-clique subgraphs (Tsourakakis, 2015), the hh-clique densest subgraph (CDS) problem that finds near-cliques formed by overlapping cliques has attracted great research attention (Tsourakakis, 2015; Mitzenmacher et al., 2015; Fang et al., 2019; Sun et al., 2020). This is due to the fact that a vertex is generally involved in multiple overlapping cliques, such as a person may be involved in cliques as family members, office mates, etc (Palla et al., 2005). By finding the subgraph with the highest density of hh-cliques, CDS uncovers the highly-connected component that exhibits strong internal interactions (Benson et al., 2016; Spirin and Mirny, 2003; Liu et al., 2018). Whereas, in the context of the real world, the discovery of a single CDS offers limited insights. Listing the top-kk CDSes is desired, but due to the substantial overlap inherent in hh-cliques (Wang et al., 2013; Yuan et al., 2015), the vanilla top-kk CDSes may refer to the same dense region, still providing limited structural insights.

Therefore, detecting the top-kk non-overlapping, locally maximal, dense, and compact, i.e., locally hh-clique densest subgraphs (LhhCDS) attracts great interest. For example, Figure 1 shows the relationships between a subset of characters in “Harry Potter”. The top-11 and top-22 L33CDSes are the blue and green subgraphs, respectively. The top-11 L33CDS is a family named Weasley, and the top-22 L33CDS is an organization named Death Eaters, which indicate the potential of LhhCDS discovery for mining diverse dense communities.

Refer to caption
Figure 1. Part of the “Harry Potter” Network

However, no efficient and exact algorithm is known for detecting LhhCDS yet. The closest work to LhhCDS discovery is the locally densest subgraph (LDS) discovery (Qin et al., 2015), which is a special case of LhhCDS with h=2h=2. But LDS only considers the density of edges, and it is challenging to generalize the techniques to arbitrary hh. Firstly, the hh-clique compactness is harder to evaluate, because the number of hh-cliques can be several orders of magnitude more than the number of edges. Secondly, an hh-clique spans over hh vertices, making the subgraph division more difficult. Thirdly, verification of LhhCDS is more complex than verifying LDS since the clique density and clique compactness are harder to evaluate and verify.

To address the above difficulties, we jointly consider the hh-clique compact number estimation and LhhCDS detection, so as to design a new iterative propose-prune-and-verify (IPPV) pipeline. IPPV is composed of the following iterative steps: (1) estimate bounds of hh-clique compact number to propose LhhCDS candidates efficiently; (2) use the bounds to prune vertices that are definitely not in any LhhCDS to narrow the candidate set; (3) use verification algorithm to ensure the exactness of IPPV for extracting LhhCDS; (4) remained candidates reenter the above steps until the top-kk LhhCDSes are found. To the best of our knowledge, this paper is the first to explore the LhhCDS detection. The key contributions of IPPV are as follows:

  1. (1)

    The initial hh-clique compact number bounds are proposed based on the structures of graphs, and we show that a convex programming, which provides hh-clique diminishingly dense decomposition, can be extended to tighten the bounds.

  2. (2)

    We propose a tentative graph decomposition method to deal with the case when a clique is spanning multiple subgraphs to generate correct decomposition proposals.

  3. (3)

    Efficient verification is the critical part since the verification is complex for verifying both the hh-clique density and hh-clique compactness. We propose a novel fast verification algorithm by carefully constructing a size-reduced flow network and using the maximum flow algorithm. We prove the correctness and efficiency of the proposed fast verification algorithm.

  4. (4)

    At last, we further extend the iterative propose-prune-and-verify pipeline to detect locally general pattern densest subgraphs. More than six patterns are investigated, showing the potential of detecting locally more general pattern densest subgraphs.

We theoretically verify the exactness and efficiency of the proposed algorithm and conduct extensive experiments with different quality measures on large real datasets to evaluate the algorithm.

2. related work

2.1. Densest Subgraph

The algorithms to the densest subgraph (DS) problem can be classified into two categories: exact algorithms and approximation algorithms. The DS problem can be solved in polynomial time by exact methods based on maximum flow, linear programming, or convex optimization. Picard et al. (Picard and Queyranne, 1982) and Goldberg (Goldberg, 1984) first introduced the maximum-flow-based exact algorithm for the densest subgraph problem. Charikar (Charikar, 2000) proposed an LP-based exact algorithm for the DS problem. The convex-optimization-based exact algorithm is proposed by Danisch et al. (Danisch et al., 2017) and can handle graphs containing tens of billions of edges. Fang et al. (Fang et al., 2019) improved the efficiency of the flow-based exact algorithm by locating the densest subgraph in a specific kk-core. Exact algorithms cannot scale well to large graphs, so a large number of works on faster approximation algorithms for the DS problem are also presented.

Charikar (Charikar, 2000) proposed a 2-approximation algorithm for the DS problem, which is known as the greedy peeling algorithm. Proving that the kmaxk_{max}-core is a 2-approximation solution to the DS problem, Fang et al. (Fang et al., 2019) improved the greedy peeling algorithm based on kmaxk_{max}-core. Inspired by the multiplicative weights update method, Boob et al. (Boob et al., 2019) designed an iterative version of the greedy peeling algorithm. Based on the MapReduce model, Bahmani et al. (Bahmani et al., 2012) proposed an 2(1+ϵ)2(1+\epsilon)-approximation algorithm, where ϵ>0\epsilon>0. Based on the dual of Charikar’s LP relaxation, Harb et al. (Harb et al., 2022) presented a new iterative algorithm for the DS problem. Chekuri et al. (Chekuri et al., 2022) proposed a flow-based approximation algorithm for the DS problem.
The DS problem has various variants focusing on different aspects and different types of graphs. Two recent surveys (Lanciano et al., 2023; Luo et al., 2023) detail different variations of the DS problem and their applications to different types of graphs, such as directed graphs (Charikar, 2000), labeled graphs (Fazzone et al., 2022), and uncertain graphs (Zou, 2013).

2.2. hh-clique Densest Subgraph

Tsourakakis (Tsourakakis, 2015) defined the notion of hh-clique density and introduced the hh-clique densest subgraph (CDS) problem. Mitzenmacher et al. (Mitzenmacher et al., 2015) presented a sampling scheme called the densest subgraph sparsifier, yielding a randomized algorithm that produces a well-approximate solution to the CDS problem. Fang et al. (Fang et al., 2019) proposed more efficient exact and approximation algorithms for the CDS problem. Sun et al. (Sun et al., 2020) aimed at developing near-optimal and exact algorithms for the CDS problem on large real-world graphs. They modified the Frank-Wolfe algorithm for CDS to their algorithm kClist++ and proved the effectiveness of the proposed algorithm.

2.3. Locally Densest Subgraph

The locally densest subgraph (LDS) problem is a variant of the densest subgraph (DS) problem. Qin et al. (Qin et al., 2015) proposed a method to discover the top-kk representative locally densest subgraphs of a graph. The method involves defining a parameter-free definition of an LDS, showing that the set of LDSes in a graph can be computed in polynomial time, and proposing three novel pruning strategies to reduce the search space of the algorithm. Trung et al. (Trung et al., 2023) observed the hierarchical structure of maximal ρ\rho-compact subgraphs and presented verification-free approaches to improve the efficiency of finding top-kk LDSes. Ma et al. (Ma et al., 2022) proposed a convex-programming-based algorithm called LDScvx to the LDS problem by introducing the concept of the compact number and using the relations of compactness to the LDS problem and a specific convex program. Capitalizing on previous results (Qin et al., 2015), Samusevich et al. (Samusevich et al., 2016) studied the local triangle densest subgraph (LTDS) problem, which extended the LDS model to triangle-based density. It’s worth noting that, in essence, LDS is a special instance of LhhCDS when h=2h=2; LTDS is a special instance of LhhCDS when h=3h=3.

3. Preliminaries

Given an undirected graph G=(V,E)G=(V,E), we use ψh(Vψh,Eψh)\psi_{h}(V_{\psi_{h}},E_{\psi_{h}}) to denote an hh-clique with |Vψh||V_{\psi_{h}}| vertices and |Eψh||E_{\psi_{h}}| edges. Ψh(G)\Psi_{h}(G) is the collection of hh-cliques of GG. dψh(G)d_{\psi_{h}}(G) denotes the hh-clique density of GG, dψh(G)=|Ψh(G)||V|d_{\psi_{h}}(G)=\frac{|\Psi_{h}(G)|}{|V|}, and degG(v,ψh)deg_{G}(v,\psi_{h}) is the hh-clique degree of vv, i.e., the number of hh-cliques containing vv. Given a subset SVS\subseteq V, G[S]=(S,E(S))G[S]=(S,E(S)) is the subgraph induced by SS, and E(S)=E(G)(S×S)E(S)=E(G)\cap(S\times S). Table 1 summarizes the main notations used in this paper.

Table 1. MAIN NOTATIONS
Notation Definition
G=(V,E)G=(V,E) a graph with vertex set VV and edge set EE
n,mn,m n=|V|,m=|E|n=|V|,m=|E|
G[S]G[S] the subgraph induced by SS
Ψh(G)\Psi_{h}(G) the collection of all hh-cliques of GG
ψh(Vψh,Eψh)\psi_{h}(V_{\psi_{h}},E_{\psi_{h}}) an hh-clique (VψhV_{\psi_{h}} is vertex set, EψhE_{\psi_{h}} is edge set)
dψh(G)d_{\psi_{h}}(G) the hh-clique density of GG, dψh(G)=|Ψh(G)||V|d_{\psi_{h}}(G)=\frac{|\Psi_{h}(G)|}{|V|}
ϕh(u)\phi_{h}(u) the hh-clique compact number of vertex uu
degG(v,ψh)deg_{G}(v,\psi_{h}) the hh-clique degree of vertex vv in GG
ϕ¯h(u)\overline{\phi}_{h}(u) an upper bound of ϕh(u)\phi_{h}(u) in GG
ϕ¯h(u)\underline{\phi}_{h}(u) a lower bound of ϕh(u)\phi_{h}(u) in GG
CP(G,h)CP(G,h) the convex programming of GG for hh-clique densest subgraph
α\alpha the weights distributed from hh-cliques to vertices
rr the weights received by each vertex

A densest subgraph in a local region not only means that such a subgraph is not included in any other denser subgraph, but also requires the inner density to be compact and evenly distributed. Qin et al. (Qin et al., 2015) proposed the concept of ρ\rho-compact, which gives a reasonable definition of locally densest subgraph. A graph GG is ρ\rho-compact when removing any subset SS from GG removes at least ρ×|S|\rho\times|S| edges. Considering the hh-clique density in a graph, we define an hh-clique ρ\rho-compact graph as:

Definition 0 (hh-clique ρ\rho-compact).

A graph G=(V,E)G=(V,E) is hh-clique ρ\rho-compact if GG is connected, and removing any subset of vertices SVS\subseteq V will result in the removal of at least ρ×|S|\rho\times|S| hh-cliques in GG, where ρ\rho is a non-negative real number.

If GG is hh-clique ρ\rho-compact, then the hh-clique degree of each vertex in GG is at least ρ\lceil\rho\rceil, because removing any vertex will remove at least ρ\rho hh-cliques. Besides, the hh-clique density of an hh-clique ρ\rho-compact graph is at least ρ\rho. For any ρ^>ρ\hat{\rho}>\rho, an hh-clique ρ^\hat{\rho}-compact graph is also an hh-clique ρ\rho-compact graph, so we define the hh-clique compactness of a graph GG as the largest ρ\rho such that GG is hh-clique ρ\rho-compact. A subgraph G[S]G[S] of GG is a maximal hh-clique ρ\rho-compact subgraph if none of the supergraphs of G[S]G[S] is hh-clique ρ\rho-compact.

Proposition 1.

If a graph GG has hh-clique density dψh(G)d_{\psi_{h}}(G), then the hh-clique compactness of the graph is at most dψh(G)d_{\psi_{h}}(G), i.e., ρdψh(G)\rho\leq d_{\psi_{h}}(G).

Proof.

Suppose the compactness of GG is higher than dψh(G)d_{\psi_{h}}(G), then removing all vertices in GG will result in the removal of more than dψh(G)×|V|d_{\psi_{h}}(G)\times|V| hh-cliques, which means that the hh-clique density of GG must be higher than dψh(G)d_{\psi_{h}}(G), and contradicts the fact that the hh-clique density of GG is dψh(G)d_{\psi_{h}}(G). ∎

Proposition 1 clarifies that the hh-clique compactness of a graph cannot be greater than dψh(G)d_{\psi_{h}}(G). We are then interested in finding the locally hh-clique dense and compact subgraph G[S]G[S] in GG. We formally define a locally hh-clique densest subgraph as follows.

Definition 0 (Locally hh-clique densest subgraph (LhhCDS)).

A subgraph G[S]G[S] of GG is a locally hh-clique densest subgraph of GG if the following two conditions hold:

  1. 1.

    G[S]G[S] is an hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph;

  2. 2.

    There does not exist a supergraph G[S]G[S^{\prime}] of G[S]G[S] (SSS^{\prime}\supsetneq S), such that G[S]G[S^{\prime}] is also hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact.

Proposition 2 (Disjoint property).

Suppose G[S]G[S] and G[S]G[S^{\prime}] are two LhhCDSes in GG, we have SS=S\cap S^{\prime}=\emptyset.

Proof.

Without loss of generality, we assume dψh(G[S])dψh(G[S])d_{\psi_{h}}(G[S])\geq d_{\psi_{h}}(G[S^{\prime}]). We prove the proposition by contradiction. Suppose SSS\cap S^{\prime}\neq\emptyset. According to the definition of LhhCDS, G[S]G[S]G[S]\nsubseteq G[S^{\prime}]. Since G[S]G[S] and G[S]G[S^{\prime}] are two LhhCDSes, the graph induced by SSS\cup S^{\prime} is a connected hh-clique dψh(G[S])d_{\psi_{h}}(G[S^{\prime}]) - compact graph which is larger than SS^{\prime}. This contradicts the fact that G[S]G[S^{\prime}] is an LhhCDS. ∎

Proposition 2 proves that all LhhCDSes in a graph GG are pairwise disjoint. Therefore, the number of LhhCDSes of GG is bounded by |V||V|, and the LhhCDSes can be used to identify all the non-overlapping hh-clique dense regions of a graph.

Most applications in the real world usually require finding the top-kk dense regions of a graph (Qin et al., 2015), so we focus on finding the top-kk LhhCDSes with the largest densities. When kk is large enough, all LhhCDSes can be found. We formulate the problem as follows.

Definition 0 (Locally hh-clique densest subgraph Problem (LhhCDS Problem)).

Given a graph GG, an integer hh, and an integer kk, the locally hh-clique densest subgraph problem is to compute the top-kk LhhCDSes ranked by the hh-clique density in GG.

Figure 2 shows an example of the LhhCDS. We use S1S_{1},S2S_{2}, and S3S_{3} to represent {v12,,v17}\{v_{12},...,v_{17}\}, {v2,,v6}\{v_{2},...,v_{6}\}, and {v8,,v11}\{v_{8},...,v_{11}\}. When h=3h=3, the top-11 L33CDS is G[S1]G[S_{1}], which has a 3-clique density of 136\frac{13}{6}, since there are thirteen 3-cliques in it. The top-11 and top-22 L44CDSes are G[S2]G[S_{2}] and G[S1]G[S_{1}]. They both have a 4-clique density of 11.

Refer to caption
Figure 2. An example of the locally hh-clique densest subgraph

Note that an edge in a graph GG is a 2-clique; therefore, the intensively studied LDS problem (Qin et al., 2015; Ma et al., 2022) can be seen as an instance of the LhhCDS problem when h=2h=2. Similarly, the LTDS problem (Samusevich et al., 2016) is exactly the L33CDS problem. Therefore, the LhhCDS problem studied in this paper provides a more general framework, and we boldly infer that our method can be generalized from hh-clique to general patterns, which means that we can give an algorithmic framework to solve a wider range of locally pattern densest problems.

4. LhhCDS Discovery

In this section, we focus on the design of an LhhCDS discovery algorithm. According to the concept of hh-clique compactness, each subgraph of a graph GG has its own compactness. However, a vertex may be contained in various subgraphs with different compactness. Therefore, we introduce the concept of hh-clique compact number for each vertex in a graph.

Definition 0 (hh-clique compact number).

Given a graph G=(V,E)G=(V,E), for each vertex uVu\in V, the hh-clique compact number of uu is the largest ρ\rho such that uu is contained in an hh-clique ρ\rho-compact subgraph of GG, denoted by ϕh(u)\phi_{h}(u).

In the following theorem, we prove the relationship between the LhhCDS and the hh-clique compact numbers of vertices within it.

Theorem 2.

Given an LhhCDS G[S]G[S] in GG, for each vertex uSu\in S, the hh-clique compact number of uu is equal to the hh-clique density of G[S]G[S], i.e., ϕh(u)=dψh(G[S])\phi_{h}(u)=d_{\psi_{h}}(G[S]).

Proof.

As G[S]G[S] is a maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph, for each uSu\in S, there exists no other subgraph G[S]G[S^{\prime}] containing uu such that G[S]G[S^{\prime}] is an hh-clique ρ\rho-compact subgraph with ρ>dψh(G[S])\rho>d_{\psi_{h}}(G[S]). We prove the claim by contradiction. Suppose G[S]G[S^{\prime}] is an hh-clique ρ\rho-compact subgraph with ρ>dψh(G[S])\rho>d_{\psi_{h}}(G[S]) and uSu\in S^{\prime}, we have dψh(G[S])ρ>dψh(G[S])d_{\psi_{h}}(G[S^{\prime}])\geq\rho>d_{\psi_{h}}(G[S]). First, SSS^{\prime}\subseteq S, because G[S]G[S] is a maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph and SSS^{\prime}\cap S\neq\emptyset. If we remove U=S\SU=S\backslash S^{\prime} from G[S]G[S], the number of hh-cliques removed is |Ψh(G[S])||Ψh(G[S])|=dψh(G[S])×|S|dψh(G[S])×|S|<dψh(G[S])×(|S||S|)=dψh(G[S])×|U||\Psi_{h}(G[S])|-|\Psi_{h}(G[S^{\prime}])|=d_{\psi_{h}}(G[S])\times|S|-d_{\psi_{h}}(G[S^{\prime}])\times|S^{\prime}|<d_{\psi_{h}}(G[S])\times(|S|-|S^{\prime}|)=d_{\psi_{h}}(G[S])\times|U|. This contradicts that G[S]G[S] is hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact. Hence, dψh(G[S])d_{\psi_{h}}(G[S]) is the hh-clique compact number of all vertices in SS. ∎

Based on Theorem 2, once we get the hh-clique compact number of each vertex in GG, we can arrange the vertices in descending order based on the hh-clique compact number and then check whether the vertices with the same hh-clique compact number satisfy the definition of LhhCDS to obtain top-kk LhhCDSes. For example, in Figure 2, we list the 33-clique compact numbers of all vertices of GG. It is obvious that G[S1]G[S_{1}] and G[S2]G[S_{2}] are both L33CDSes.

However, computing the hh-clique compact numbers directly is difficult. So we jointly consider hh-clique compact number and LhhCDS to design a new “iterative propose-prune-and-verify” pipeline for top-kk LhhCDS detection, which is called IPPV. In the proposal part, the true LhhCDSes are allowed to be encapsulated in the proposed candidates, but without missing true LhhCDSes. Proper graph decomposition methods should be designed, since a clique may span multiple subgraphs to be decomposed. In the verification part, each correct LhhCDS should be outputted, and LhhCDS candidates that can be further pruned should be indicated.

Refer to caption
Figure 3. Flow diagram of IPPV

Figure 3 gives the flow diagram of IPPV. It has four main parts: 1) calculate the initial bounds of the hh-clique compact numbers of vertices; 2) iteratively propose all LhhCDS candidates (generating approximate hh-clique compact numbers; decomposing the graph tentatively; grouping vertices and tightening bounds); 3) prune invalid vertices; 4) verify the locally densest property of all candidates to find top-kk LhhCDSes, and we, in particular, propose a basic algorithm and a fast algorithm for verification. As a general algorithm framework, all blue parts are extensions of existing methods, while all orange parts are our proof and innovation for this problem.

4.1. Initial hh-clique Compact Number Bounds

In order to derive LhhCDS candidates, we first give initial upper and lower bounds of hh-clique compact number ϕh(u)\phi_{h}(u). Specifically, we denote ϕ¯h(u)\overline{\phi}_{h}(u) and ϕ¯h(u)\underline{\phi}_{h}(u) as the upper and lower bound of ϕh(u)\phi_{h}(u) in GG. We use (k,ψh)(k,\psi_{h})-core(Fang et al., 2019), which is a cohesive subgraph model, to compute the initial bounds.

Definition 0 ((k,ψh)(k,\psi_{h})-core).

Given a graph GG, the (k,ψh)(k,\psi_{h})-core is the largest subgraph of GG, in which the hh-clique degree of each vertex is at least kk. The hh-clique-core number of a vertex uVu\in V, denoted by coreG(u,ψh)core_{G}(u,\psi_{h}), is the highest kk of (k,ψh)(k,\psi_{h})-core containing uu.

Proposition 3.

hh-clique compact number ϕh(u)\phi_{h}(u) has the following relations to the hh-clique-core number coreG(u,ψh)core_{G}(u,\psi_{h}).

  1. (1)

    A (k,ψh)(k,\psi_{h})-core subgraph is hh-clique kh\frac{k}{h}-compact. Thus, for any uVu\in V, ϕ¯h(u)\underline{\phi}_{h}(u) can be assigned as coreG(u,ψh)h\frac{core_{G}(u,\psi_{h})}{h};

  2. (2)

    If G[S]G[S] is an LhhCDS of GG, for all uSu\in S, coreG(u,ψh)dψh(G[S])core_{G}(u,\psi_{h})\geq d_{\psi_{h}}(G[S]). Thus, for any uVu\in V, ϕ¯h(u)\overline{\phi}_{h}(u) can be assigned as coreG(u,ψh)core_{G}(u,\psi_{h}).

Proof.

Any vertex in a (k,ψh)(k,\psi_{h})-core subgraph is contained in at least kk hh-cliques. By removal of any subset SS from the (k,ψh)(k,\psi_{h})-core, at least kh×|S|\frac{k}{h}\times|S| hh-cliques would be removed. For any uVu\in V, there is an hh-clique coreG(u,ψh)h\frac{core_{G}(u,\psi_{h})}{h}-compact subgraph of GG that contains uu, then coreG(u,ψh)h\frac{core_{G}(u,\psi_{h})}{h} is a lower bound of ϕh(u)\phi_{h}(u). The second relation can be obtained from the fact that an LhhCDS G[S]G[S] in a graph GG is a (dψh(G[S])\lceil d_{\psi_{h}}(G[S])\rceil,ψh\psi_{h})-core subgraph of GG. For any uVu\in V, if an LhhCDS contains uu, then coreG(u,ψh)core_{G}(u,\psi_{h}) is an upper bound of ϕh(u)\phi_{h}(u). ∎

Input: G=(V,E),hG=(V,E),h
Output: ϕ¯h,ϕ¯h\overline{\phi}_{h},\underline{\phi}_{h}
1 foreach uVu\in V do  compute coreG(u,ψh)core_{G}(u,\psi_{h}) ;
2 foreach uVu\in V do
3       ϕ¯h(u)coreG(u,ψh)\overline{\phi}_{h}(u)\leftarrow core_{G}(u,\psi_{h}); ϕ¯h(u)coreG(u,ψh)h\underline{\phi}_{h}(u)\leftarrow\frac{core_{G}(u,\psi_{h})}{h};
return ϕ¯h\overline{\phi}_{h}, ϕ¯h\underline{\phi}_{h};
Algorithm 1 Bound initialization: InitializeBd

According to Proposition 3, we can get the initial bounds of hh-clique compact number ϕh(u)\phi_{h}(u) of GG (Lines 2-3) by Algorithm 1.

4.2. Candidate LhhCDS Proposal

The initial upper and lower bounds for hh-clique compact numbers from hh-clique-core numbers are relatively loose. In this section, we focus on how to tighten the bounds and propose LhhCDS candidates.

4.2.1. Overall Algorithm for Candidate LhhCDS Proposal

The overall candidate LhhCDS proposal algorithm is given in Algorithm 2. Approximate hh-clique compact number is calculated via SEQ-kClist++ (Line 1); the preliminary partition of GG and recalculated values are obtained via TentativeGD (Line 2); tighter upper and lower bounds for hh-clique compact numbers and the further partition of GG (stable hh-clique group) are calculated via DeriveSG (Line 3). The sub-procedures introduce each of the above functions (Lines 5-33).

Input: G=(V,E)G=(V,E), number of iterations TT, ϕ¯h,ϕ¯h\overline{\phi}_{h},\underline{\phi}_{h}
Output: 𝒮,ϕ¯h,ϕ¯h\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h}
1 (α,r\alpha,r)\leftarrow SEQ-kClist++ (G,TG,T) ;
2 𝒮^,α,r\hat{\mathcal{S}},\alpha,r\leftarrow TentativeGD (G,α,rG,\alpha,r);
3 𝒮,ϕ¯h,ϕ¯h\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h}\leftarrow DeriveSG (𝒮^,α,r,ϕ¯h,ϕ¯h\hat{\mathcal{S}},\alpha,r,\overline{\phi}_{h},\underline{\phi}_{h});
4 return 𝒮,ϕ¯h,ϕ¯h\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h};
5 Procedure SEQ-kClist++(G,TG,T)
6       foreach  hh-clique ψh\psi_{h} in GG do  αu,ψh1h\alpha_{u,\psi_{h}}\leftarrow\frac{1}{h} , uVψh\forall u\in V_{\psi_{h}} ;
7       foreach  uVu\in V do  r(u)ψhΨh(G):uψhαu,ψhr(u)\leftarrow\sum_{\psi_{h}\in\Psi_{h}(G):u\in\psi_{h}}\alpha_{u,\psi_{h}} ;
8       foreach  iteration t=1,,Tt=1,...,T do
9             γt1t+1\gamma_{t}\leftarrow\frac{1}{t+1}; α(1γt)α\alpha\leftarrow(1-\gamma_{t})*\alpha; r(1γt)rr\leftarrow(1-\gamma_{t})*r;
10             foreach  hh-clique ψh\psi_{h} do
11                   vminargminvψhr(v)v_{min}\leftarrow argmin_{v\in\psi_{h}}r(v);
12                   αvmin,ψhαvmin,ψh+γt\alpha_{v_{min},\psi_{h}}\leftarrow\alpha_{v_{min},\psi_{h}}+\gamma_{t}; r(vmin)r(vmin)+γtr(v_{min})\leftarrow r(v_{min})+\gamma_{t};
13                  
14            
15      return (α,r)(\alpha,r);
16 Procedure TentativeGD(G,α,rG,\alpha,r)
17       sort vertices in VV in descending order according to rr;
18       P{p|p=argmaxpqndψh(G[V[1:q]])}P\leftarrow\{p|p=argmax_{p\leq q\leq n}d_{\psi_{h}}(G[V_{[1:q]}])\} ;
19       𝒮^\hat{\mathcal{S}}\leftarrow partition VV according to PP;
20       foreach ψhΨh(G)\psi_{h}\in\Psi_{h}(G) do
21             pmax{1il:ψhS^i}p\leftarrow\max\left\{1\leq i\leq l:\psi_{h}\cap\hat{S}_{i}\neq\emptyset\right\};
22             suψh\S^pαu,ψhs\leftarrow\sum_{u\in\psi_{h}\backslash\hat{S}_{p}}\alpha_{u,\psi_{h}};
23             uψh\S^p,αu,ψh0\forall u\in\psi_{h}\backslash\hat{S}_{p},\alpha_{u,\psi_{h}}\leftarrow 0;
24             uψhS^p,αu,ψhαu,ψh+s|ψhS^p|\forall u\in\psi_{h}\cap\hat{S}_{p},\alpha_{u,\psi_{h}}\leftarrow\alpha_{u,\psi_{h}}+\frac{s}{\left|\psi_{h}\cap\hat{S}_{p}\right|};
25      foreach uVu\in V do  r(u)ψhΨh(G):uψhαu,ψhr(u)\leftarrow\sum_{\psi_{h}\in\Psi_{h}(G):u\in\psi_{h}}\alpha_{u,\psi_{h}} ;
26       return 𝒮^\hat{\mathcal{S}}, α\alpha, rr;
27 Procedure DeriveSG(𝒮^,α,r,ϕ¯h,ϕ¯h\hat{\mathcal{S}},\alpha,r,\overline{\phi}_{h},\underline{\phi}_{h})
28       while 𝒮^\hat{\mathcal{S}} is not empty do
29             SS^{\prime}\leftarrow pop out the first candidate from 𝒮^\hat{\mathcal{S}}; SSSS\leftarrow S\cup S^{\prime};
30             if SS is a stable hh-clique group then  put SS into 𝒮\mathcal{S}; SS\leftarrow\emptyset ;
31            
32      foreach  S𝒮S\in\mathcal{S} do
33             foreach  uSu\in S do
34                   ϕ¯h(u)min{ϕ¯h(u),maxvSr(v)}\overline{\phi}_{h}(u)\leftarrow min\{\overline{\phi}_{h}(u),max_{v\in S}r(v)\};
35                   ϕ¯h(u)max{ϕ¯h(u),minvSr(v)}\underline{\phi}_{h}(u)\leftarrow max\{\underline{\phi}_{h}(u),min_{v\in S}r(v)\};
36            
37      return 𝒮\mathcal{S}, ϕ¯h\overline{\phi}_{h}, ϕ¯h\underline{\phi}_{h};
Algorithm 2 Candidate LhhCDS proposal: ProposeCL

4.2.2. Generate Approximate hh-clique Compact Number

Inspired by a classical convex programming (Danisch et al., 2017; Ma et al., 2022), we propose a convex programming for finding the diminishingly-hh-clique-dense decomposition, and prove that the optimal solution of our convex programming is exactly the hh-clique compact number of a graph GG.

Intuitively, the aim of CP(G,hG,h) is that each hh-clique ψhΨh(G)\psi_{h}\in\Psi_{h}(G) tries to distribute its unit weight among its hh vertices such that the sum of the weight received by the vertices are as even as possible. We use αu,ψh\alpha_{u,\psi_{h}} to represent the weight assigned to uu from hh-clique ψh\psi_{h} and r(u)r(u) to denote the sum of the weights assigned to uu from hh-cliques that contain uu. This intuition suggests that we can consider the objective function: QG,h(α):=uVr(u)2Q_{G,h}(\alpha):=\sum_{u\in V}r(u)^{2}, in which r(u)=ψhΨh(G):uψhαu,ψhr(u)=\sum_{\psi_{h}\in\Psi_{h}(G):u\in\psi_{h}}\alpha_{u,\psi_{h}}, for all uVu\in V. The convex programming is:

CP(G,h):=min{QG,h(α):α𝒟(G,h)},\mathrm{CP}(G,h):=\min\left\{Q_{G,h}(\alpha):\alpha\in\mathcal{D}(G,h)\right\},

where the domain is:

𝒟(G,h):={αψhΨh(G)+ψh:ψhΨh(G),uψhαu,ψh=1}.\mathcal{D}(G,h):=\left\{\alpha\in\prod_{\psi_{h}\in\Psi_{h}(G)}\mathbb{R}_{+}^{\psi_{h}}:\forall\psi_{h}\in\Psi_{h}(G),\sum_{u\in\psi_{h}}\alpha_{u,\psi_{h}}=1\right\}.

Here, we demonstrate that the hh-clique compact numbers can be derived from the optimal solution of CP(G,hG,h).

Theorem 4.

Let (α,r)(\alpha^{*},r^{*}) be an optimal solution to CP(G,hG,h). Then, uV\forall u\in V, ϕh(u)=r(u)\phi_{h}(u)=r^{*}(u), i.e., each r(u)r^{*}(u) in rr^{*} is exactly the hh-clique compact number of uu.

Proof.

For any vertex uVu\in V, let S+={vV|r(v)>r(u)}S^{+}=\{v\in V|r^{*}(v)>r^{*}(u)\}, S=={vV|r(v)=r(u)}S^{=}=\{v\in V|r^{*}(v)=r^{*}(u)\}, S={vV|r(v)<r(u)}S^{-}=\{v\in V|r^{*}(v)<r^{*}(u)\}, uS=u\in S^{=}. S+=S^{+=} denotes the vertices that are contained by hh-cliques that including vertices both in S+S^{+} and S=S^{=}. We prove G[S+S=]G[S^{+}\cup S^{=}] is an hh-clique r(u)r^{*}(u)-compact subgraph. First, removing S=S^{=} from G[S+S=]G[S^{+}\cup S^{=}] will result in the removal of r(u)×|S=|r^{*}(u)\times|S^{=}| cliques in G[S+S=]G[S^{+}\cup S^{=}]. We know that for all (v,w)E(S+×S=)(v,w)\in E\cap(S^{+}\times S^{=}), r(v)>r(w)r^{*}(v)>r^{*}(w) and αv,ψh(v,wψh)=0\alpha_{v,\psi_{h}(v,w\in\psi_{h})}=0. Otherwise, if there exists (v,w)E(S+×S=)(v,w)\in E\cap(S^{+}\times S^{=}) such that αv,ψh(v,wψh)>0\alpha_{v,\psi_{h}(v,w\in\psi_{h})}>0, there exists r(v)r(w)>ϵ>0r^{*}(v)-r^{*}(w)>\epsilon>0. We can reduce αv,ψh(v,wψh)\alpha_{v,\psi_{h}(v,w\in\psi_{h})} by ϵ\epsilon and increase αw,ψh(v,wψh)\alpha_{w,\psi_{h}(v,w\in\psi_{h})} by ϵ\epsilon, and the objective function be decreased by 2ϵ(r(v)r(w)ϵ)2\epsilon(r^{*}(v)-r^{*}(w)-\epsilon), which contradicts the optimality of rr^{*}. Similarly, we can prove that for all (v,w)E(S=×S)(v,w)\in E\cap(S^{=}\times S^{-}), r(v)>r(w)r^{*}(v)>r^{*}(w) and αv,ψh(v,wψh)=0\alpha_{v,\psi_{h}(v,w\in\psi_{h})}=0. Therefore, r(u)×|S=|=ψhΨh(G):vS=,vψhαv,ψh=|Ψh(G[S=])Ψh(G[S+=])|r^{*}(u)\times|S^{=}|=\sum_{\psi_{h}\in\Psi_{h}(G):v\in S^{=},v\in\psi_{h}}\alpha_{v,\psi_{h}}=|\Psi_{h}(G[S^{=}])\cup\Psi_{h}(G[S^{+=}])|. r(u)×|S=|r^{*}(u)\times|S^{=}| is exactly the number of hh-cliques to be removed when removing S=S^{=} from G[S+S=]G[S^{+}\cup S^{=}]. Meanwhile, for any SS+S=S^{\prime}\subseteq S^{+}\cup S^{=},we have that r(u)×|S|ψhΨh(G):vS,vψhαv,ψhψhΨh(G[S+S=]):vS,vψh1r^{*}(u)\times|S^{\prime}|\leq\sum_{\psi_{h}\in\Psi_{h}(G):v\in S^{\prime},v\in\psi_{h}}\alpha_{v,\psi_{h}}\leq\sum_{\psi_{h}\in\Psi_{h}(G[S^{+}\cup S^{=}]):v\in S^{\prime},v\in\psi_{h}}1, which means removing any SS+S=S^{\prime}\subseteq S^{+}\cup S^{=} from G[S+S=]G[S^{+}\cup S^{=}] will result in the removal of at least r(u)×|S|r^{*}(u)\times|S^{\prime}| hh-cliques. Therefore, G[S+S=]G[S^{+}\cup S^{=}] is an hh-clique r(u)r^{*}(u)-compact subgraph. Analogously, we can prove that for any other subset S′′S^{\prime\prime} containing uu, G[S′′]G[S^{\prime\prime}] is an hh-clique ρ\rho-compact subgraph, where ρr(u)\rho\leq r^{*}(u), by contradiction. Therefore, r(u)r^{*}(u) is the largest ρ\rho such that uu is contained in an hh-clique ρ\rho-compact subgraph of GG, which is exactly the hh-clique compact number of uu. ∎

Refer to caption
Figure 4. An example of the relationship between r(u)r^{*}(u) and ϕh(u)\phi_{h}(u) of a vertex uu in GG

Consider the convex programming CP(G,3G,3) for GG in Figure 2, we use v2v_{2} as an example, shown in Figure 4. The 33-clique compact number of v2v_{2} is 2, and the optimal solution r(v2)r^{*}(v_{2}) value is also 2. It is clear that for each uVu\in V, r(u)r^{*}(u) is exactly ϕh(u)\phi_{h}(u).

Theorem 5.

The locally hh-clique densest subgraph problem can be solved in polynomial time for any given positive integer hh.

Proof.

According to Theorem 4, the LhhCDS problem is equivalent to CP(G,hG,h). CP(G,hG,h) is a convex quadratic programming problem, which has polynomial time solutions (Goldfarb and Liu, 1991; Boyd and Vandenberghe, 2010). Therefore, the LhhCDS problem is polynomial-time solvable. ∎

However, CP(G,hG,h) is a problem with |Ψh||\Psi_{h}| equality constraints and n|Ψh|n\cdot|\Psi_{h}| variables. The time complexity of the classical polynomial-time solution of CP(G,hG,h) is O((n|Ψh|)3L)O((n\cdot|\Psi_{h}|)^{3}\cdot L) (Goldfarb and Liu, 1991), which requires long running time. Therefore, exactly attaining the (α,r)(\alpha^{*},r^{*}) is difficult, so we use the approximate solution (α,r)(\alpha,r) of CP(G,hG,h) to tighten the hh-clique compact bounds. Frank-Wolfe-based (FW-based) algorithm is efficient for finding approximate solutions to CP(GG) (Ma et al., 2022). However, FW-based algorithm for hh-clique densest requires a large amount of memory. SEQ-kClist++(Sun et al., 2020) is better for approximately calculating αu,ψh\alpha_{u,\psi_{h}} for each hh-clique ψh\psi_{h}, uψhu\in\psi_{h}, as well as r(u)r(u) for each vertex uu. All αu,ψh\alpha_{u,\psi_{h}} are initialized to 1h\frac{1}{h} (Line 6). r(u)r(u) stores the sum over all αu,ψh\alpha_{u,\psi_{h}}’s such that ψh\psi_{h} contains uu (Line 7). At each iteration, α\alpha and rr are modified simultaneously as follows. For each hh-clique ψh\psi_{h}, we find the minimum r(vmin)r(v_{min}) among ψh\psi_{h}, and new values for the αvmin,ψh\alpha_{v_{min},\psi_{h}} and the r(vmin)r(v_{min}) are computed as convex combinations (Lines 8-12).

4.2.3. Tentative Graph Decomposition

After getting approximate (α,r)(\alpha,r), we can derive a graph decomposition from the given (α,r\alpha,r).

Proposition 4.

Given an LhhCDS G[S]G[S] in GG, (u,v)E\forall(u,v)\in E, if uSu\in S and vV\Sv\in V\backslash S, we have ϕh(u)>ϕh(v)\phi_{h}(u)>\phi_{h}(v).

Considering vertices adjacent to G[S1]G[S_{1}] but not in G[S1]G[S_{1}] in Figure 2, such as v11v_{11} and v18v_{18}, their 33-clique compact numbers fulfill Proposition 4: ϕ3(v11)=12<136\phi_{3}(v_{11})=\frac{1}{2}<\frac{13}{6}, ϕ3(v18)=1<136\phi_{3}(v_{18})=1<\frac{13}{6}. Proposition 4 is helpful for choosing LhhCDSes from all subgraphs.

We then propose TentativeGD to generate tentative graph decomposition for proposing LhhCDS. The vertices in VV are sorted based on rr values descendingly (Line 15). The initial partition 𝒮^\hat{\mathcal{S}} of the graph is extracted based on the descending order (Lines 16-17). For each ψhΨh(G)\psi_{h}\in\Psi_{h}(G), if the clique ψh\psi_{h} is contained in multiple vertex sets, the vertex set with the largest set index will be recorded as pp, and the α\alpha value of ψh\psi_{h} will be redistributed to vertices in S^p\hat{S}_{p} (Lines 18-22). In other words, for the convenience of partition, the α\alpha value of ψh\psi_{h} straddling multiple vertex sets is redistributed to a vertex set with the lowest rr value. Finally, the rr values of all vertices in VV are recalculated (Line 23).

4.2.4. Stable hh-clique Group Derivation

After getting the initial bounds of hh-clique compact numbers in InitializeBd and a preliminary partition of the graph in TentativeGD, we consider obtaining tighter bounds of hh-clique compact numbers and a further partition of the graph, to calculate LhhCDS candidates. Inspired by two concepts, stable subset (Danisch et al., 2017) and stable group (Ma et al., 2022), for solving the hh-clique densest subgraph problem, we propose the definition of the stable hh-clique group.

Definition 0 (stable hh-clique group).

Given a feasible solution (α,r)(\alpha,r) to CP(G,hG,h), a stable hh-clique group with respect to (α,r)(\alpha,r) is a non-empty vertex group SVS\in V satisfying the following conditions:

  1. (1)

    For any vV\Sv\in V\backslash S, r(v)>maxuSr(u)r(v)>max_{u\in S}r(u) or r(v)<minuSr(u)r(v)<min_{u\in S}r(u);

  2. (2)

    For any vVv\in V, if r(v)>maxuSr(u)r(v)>max_{u\in S}r(u), ψh(u,vψh),αv,ψh=0\forall\psi_{h}(u,v\in\psi_{h}),\alpha_{v,\psi_{h}}=0;

  3. (3)

    For any vVv\in V, if r(v)<minuSr(u)r(v)<min_{u\in S}r(u), ψh(u,vψh),αu,ψh=0\forall\psi_{h}(u,v\in\psi_{h}),\alpha_{u,\psi_{h}}=0.

Refer to caption
Figure 5. The relationship between stable 33-clique subset \mathcal{B} and stable 33-clique group 𝒮\mathcal{S}

The concept of stable hh-clique subset \mathcal{B} is related to stable hh-clique group 𝒮\mathcal{S}, and the relationship between stable hh-clique subset and stable hh-clique group can be shown in Figure 5 with h=3h=3. All stable hh-clique groups are disjoint, and a stable hh-clique subset is the union of the previous stable hh-clique subset and the first stable hh-clique group outside this previous stable hh-clique subset. Either \mathcal{B} or 𝒮\mathcal{S} can form a consecutive subsequence of the whole sequence, and we only use the stable hh-clique group in our algorithm.

Theorem 7.

Given a feasible solution (α,r)(\alpha,r) to CP(G,hG,h) and a stable hh-clique group SS with respect to (α,r)(\alpha,r), for all vSv\in S, we have that minuSr(u)ϕh(v)maxuSr(u)min_{u\in S}r(u)\leq\phi_{h}(v)\leq max_{u\in S}r(u).

Proof.

According to Theorem 4, for all uV,r(u)=ϕh(u)u\in V,r^{*}(u)=\phi_{h}(u). Suppose there exists a vertex vSv\in S such that r(v)=ϕh(v)<minuSr(u)r(v)r^{*}(v)=\phi_{h}(v)<min_{u\in S}r(u)\leq r(v). Since uVr(u)=uVr(u)\sum_{u\in V}r(u)=\sum_{u\in V}r^{*}(u), correspondingly, there must exist another vertex wVw\in V, r(w)=ϕh(w)>r(w)r^{*}(w)=\phi_{h}(w)>r(w). The difference between r(w)r(w) and r(w)r^{*}(w) means that there exists ψh\psi_{h} containing both vv and ww, αv,ψh>0\alpha_{v,\psi_{h}}>0. Since SS is a stable hh-clique group, according to the third condition in Definition 6, r(w)>minuSr(u)r(w)>min_{u\in S}r(u). There exists ϵ>0\epsilon>0, we can increase r(v)r^{*}(v) by ϵ\epsilon and decrease r(w)r^{*}(w) by ϵ\epsilon to decrease the value of the objective function. This contradicts that rr^{*} is the optimal solution to CP(G,hG,h). By the same token, for all uV,r(u)=ϕh(u)maxuSr(u)u\in V,r^{*}(u)=\phi_{h}(u)\leq max_{u\in S}r(u). ∎

Based on Theorem 7, the stable hh-clique groups can give tighter bounds of hh-clique compact numbers, so we propose DeriveSG algorithm to derive the stable hh-clique groups, which are our LhhCDS candidates. In DeriveSG, the subsets in 𝒮^\hat{\mathcal{S}} are checked one by one; if the subset is a stable hh-clique group, it will be pushed into the set of stable hh-clique groups 𝒮\mathcal{S}; otherwise, in the next iteration, the current subset SS will be merged with the next subset SS^{\prime} (Lines 26–28). Then, the upper and lower bounds of hh-clique compact numbers are updated based on Theorem 7 (Lines 29–32).

4.3. Pruning for Candidate LhhCDS Derivation

We prove that the following proposition can help to prune invalid vertices that are certainly not contained by any LhhCDS.

Proposition 5.

For any vVv\in V, vv is not contained by any LhhCDS in GG if either of the following two conditions is satisfied.

  1. (1)

    If there exists (u,v)E(u,v)\in E, such that ϕ¯h(u)>ϕ¯h(v)\underline{\phi}_{h}(u)>\overline{\phi}_{h}(v), vv is invalid;

  2. (2)

    Let GG^{\prime} denote the graph after pruning all invalid vertices in condition (1). ϕ¯hG(u)\overline{\phi}_{h}^{G}(u) is the upper bound of ϕhG(u)\phi_{h}^{G}(u) in GG. For any uu in GG^{\prime}, if ϕ¯hG(v)<ϕ¯h(v)\overline{\phi}_{h}^{G^{\prime}}(v)<\underline{\phi}_{h}(v), vv is invalid.

Proof.

First, we prove condition (1). For any u,vVu,v\in V, (u,v)E(u,v)\in E, if ϕ¯h(u)>ϕ¯h(v)\underline{\phi}_{h}(u)>\overline{\phi}_{h}(v), then ϕh(u)>ϕh(v)\phi_{h}(u)>\phi_{h}(v). According to Proposition 4, vv is not contained in LhhCDS, i.e., vv is invalid.

For condition (2), ϕ¯hG(u)<ϕ¯h(u)\overline{\phi}_{h}^{G^{\prime}}(u)<\underline{\phi}_{h}(u) means that to form an hh-clique ϕ¯h(u)\underline{\phi}_{h}(u)-compact subgraph containing uu, some already pruned vertices are needed. So using the vertices in GG^{\prime} only cannot form an hh-clique ϕ¯h(u)\underline{\phi}_{h}(u)-compact subgraph containing uu. Therefore, uu cannot be contained by any LhhCDS in GG, i.e., vv is invalid. ∎

According to Proposition 5, we design Pruning Rule to prune invalid vertices by condition (1) and condition (2). An example can be seen in Figure 2 with h=3h=3. v9v_{9} and v11v_{11} can be pruned, because for edge (v6,v9)(v_{6},v_{9}), ϕ¯3(v6)=2>ϕ¯3(v9)=12\underline{\phi}_{3}(v_{6})=2>\overline{\phi}_{3}(v_{9})=\frac{1}{2}; for edge (v11,v12)(v_{11},v_{12}), ϕ¯3(v12)=136>ϕ¯3(v11)=12\underline{\phi}_{3}(v_{12})=\frac{13}{6}>\overline{\phi}_{3}(v_{11})=\frac{1}{2}. Analogously, the vertices v1v_{1}, v7v_{7}, v18v_{18}, v19v_{19}, and v20v_{20} are also pruned by condition (1).

We denote the graph after pruning by GG^{\prime}. Some vertices in GG^{\prime} become invalid vertices, because any LhhCDS in GG containing these vertices needs to include some already pruned vertices, which can not form an LhhCDS in GG^{\prime}. Therefore, we utilize condition (2). Based on Proposition 3, for any vertex uV(G)u\in V(G^{\prime}), coreG(u,ψh)core_{G}(u,\psi_{h}) provides an upper bound of ϕhG(u)\phi_{h}^{G^{\prime}}(u). For example, after v9v_{9} and v11v_{11} are pruned, the upper bounds of 33-clique compact numbers of v8v_{8} and v10v_{10} in graph GG^{\prime} are ϕ¯3G(v8)=ϕ¯3G(v10)=0<ϕ¯3(v8)=ϕ¯3(v10)=12\overline{\phi}_{3}^{G^{\prime}}(v_{8})=\overline{\phi}_{3}^{G^{\prime}}(v_{10})=0<\underline{\phi}_{3}(v_{8})=\underline{\phi}_{3}(v_{10})=\frac{1}{2}. So v8v_{8} and v10v_{10} are pruned using condition (2).

We propose Prune algorithm shown in Algorithm 3. GG is replicated to GG^{\prime} for pruning (Line 1). Condition (1) is used to remove invalid vertices in GG^{\prime} (Lines 2-3); after computing the hh-clique core numbers for all vertices in GG^{\prime} (Line 4), condition (2) is applied to further remove invalid vertices in GG^{\prime} (Lines 5-7). Finally, the LhhCDS candidates are updated from the intersection of hh-clique stable groups and the unpruned vertex sets (Line 8).

Input: G=(V,E)G=(V,E), 𝒮,ϕ¯h,ϕ¯h\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h}
Output: 𝒮\mathcal{S}
1 G=(V(G),E(G))GG^{\prime}=(V(G^{\prime}),E(G^{\prime}))\leftarrow G;
2 foreach (u,v)E(u,v)\in E do
3       if ϕ¯h(v)<ϕ¯h(u)\overline{\phi}_{h}(v)<\underline{\phi}_{h}(u) then  remove vv from GG^{\prime} ;
4      
5foreach uV(G)u\in V(G^{\prime}) do  compute coreG(u,ψh)core_{G^{\prime}}(u,\psi_{h}) ;
6 while there exists uV(G),coreG(u,ψh)<ϕ¯h(u)u\in V(G^{\prime}),core_{G^{\prime}}(u,\psi_{h})<\underline{\phi}_{h}(u) do
7       remove uu from GG^{\prime};
8       update hh-clique-core numbers of vertices adjacent to uu;
9foreach LhhCDS candidate S𝒮S\in\mathcal{S} do  SSV(G)S\leftarrow S\cap V(G^{\prime}) ;
return 𝒮\mathcal{S};
Algorithm 3 Pruning algorithm: Prune

4.4. LhhCDS Verification

Since candidate LhhCDSes are obtained approximately, we need to confirm whether the candidates are LhhCDSes.

Proposition 6.

An LhhCDS must satisfy the following properties:

  1. (1)

    Any subgraph of an LhhCDS cannot be denser than itself;

  2. (2)

    An LhhCDS itself is compact, and any supergraph of an LhhCDS cannot be more compact than itself.

Proof.

(2) of Proposition 6 can be directly obtained from the definition of LhhCDS. We prove (1) by contradiction. Suppose that there is a subgraph G[S]G[S^{\prime}] in an LhhCDS G[S]G[S], SSS^{\prime}\subset S, such that dψh(G[S])>dψh(G[S])d_{\psi_{h}}(G[S^{\prime}])>d_{\psi_{h}}(G[S]). By removal of the set U=S\SU=S\backslash S^{\prime}from G[S]G[S], we remove |Ψh(G[S])||Ψh(G[S])||\Psi_{h}(G[S])|-|\Psi_{h}(G[S^{\prime}])| hh-cliques. Note that |Ψh(G[S])||Ψh(G[S])|=dψh(G[S])|S|dψh(G[S])|S|<dψh(G[S])(|S||S|)=dψh(G[S])|U||\Psi_{h}(G[S])|-|\Psi_{h}(G[S^{\prime}])|=d_{\psi_{h}}(G[S])|S|-d_{\psi_{h}}(G[S^{\prime}])|S^{\prime}|<\linebreak d_{\psi_{h}}(G[S])(|S|-|S^{\prime}|)=d_{\psi_{h}}(G[S])|U|, which contradicts the fact that G[S]G[S] is an LhhCDS, i.e. hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact. ∎

We need to verify: 1) whether a candidate LhhCDS G[S]G[S] is self-densest and 2) whether G[S]G[S] is a maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph in GG. We use IsDensest (Sun et al., 2020) algorithm to check whether a candidate LhhCDS G[S]G[S] is self-densest. In this section, we focus on the verification of the second property, to verify whether G[S]G[S] is a connected component of maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraphs in GG. We design a basic verification algorithm, and we further propose a fast algorithm by reducing the scale of the flow network. The correctness of both algorithms is proved.

4.4.1. Basic Verification Algorithm

Given an LhhCDS candidate G[S]G[S], we propose an innovative flow network to derive maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph GG^{\prime} in GG. If G[S]G[S] is a connected component of GG^{\prime}, G[S]G[S] is indeed maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph and an LhhCDS in GG; otherwise, G[S]G[S] is not an LhhCDS. The flow network (V,E)\mathcal{F}(V_{\mathcal{F}},E_{\mathcal{F}}) is shown in Figure 6. The vertex set of \mathcal{F} is {s}VΨh{t}\{s\}\cup V\cup\Psi_{h}\cup\{t\}. The arc set of \mathcal{F} is given as follows. For each hh-clique ψhj\psi_{h}^{j}, we add hh incoming arcs of capacity 11 from the vertices which form ψhj\psi_{h}^{j}, and hh outgoing arcs of capacity of h1h-1 to the same set of vertices. For each vertex viVv_{i}\in V, we add an incoming arc of capacity degG(vi,ψh)deg_{G}(v_{i},\psi_{h}) from the source vertex ss, and an outgoing arc of capacity ρh\rho*h to the sink vertex tt. Given a parameter ρ\rho, we prove that the flow network in DeriveCompact can be used to derive maximal hh-clique ρ\rho-compact subgraphs in GG according to Theorem 8.

Refer to caption
Figure 6. The flow network of DeriveCompact(G,ρ,)(G,\rho,\emptyset)
Theorem 8.

If GG contains maximal hh-clique ρ\rho-compact subgraphs, then the result returned by DeriveCompact (G,ρ1|V|2,G,\rho-\frac{1}{|V|^{2}},\emptyset) is the set of all maximal hh-clique ρ\rho-compact subgraphs in GG.

Proof.

Based on Proposition 2, two LhhCDSes are disjoint. We use G[S1]G[S_{1}] to represent the union of all maximal hh-clique ρ\rho-compact subgraphs in GG. G[S2]G[S_{2}] denotes the subgraph returned by DeriveCompact (G,ρ1|V|2,G,\rho-\frac{1}{|V|^{2}},\emptyset), which is the largest subgraph in GG with maximum |Ψh(G[S2])|ρ×|S2||\Psi_{h}(G[S_{2}])|-\rho\times|S_{2}| (Goldberg, 1984)(Fang et al., 2019). We prove that G[S1]G[S_{1}] and G[S2]G[S_{2}] are the same. First, we prove that G[S2]G[S_{2}] is a subgraph of G[S1]G[S_{1}] by contradiction. Suppose a connected component G[S]G[S] of G[S2]G[S_{2}] is not hh-clique ρ\rho-compact, then there exists a subset SSS^{\prime}\subseteq S such that removing SS^{\prime} from SS will result in removing less hh-cliques than ρ×|S|\rho\times|S^{\prime}|, then |Ψh(G[S])||Ψh(G[S\S])|<ρ×|S|=ρ×(|S||S\S|)|\Psi_{h}(G[S])|-|\Psi_{h}(G[S\backslash S^{\prime}])|<\rho\times|S^{\prime}|=\rho\times(|S|-|S\backslash S^{\prime}|). We have |Ψh(G[S])|ρ×|S|<|Ψh(G[S\S])|ρ×|S\S||\Psi_{h}(G[S])|-\rho\times|S|<|\Psi_{h}(G[S\backslash S^{\prime}])|-\rho\times|S\backslash S^{\prime}|. Therefore, replacing G[S]G[S] by its subgraph G[S\S]G[S\backslash S^{\prime}] in G[S2]G[S_{2}] will enlarge the value of |Ψh(G[S2])|ρ×|S2||\Psi_{h}(G[S_{2}])|-\rho\times|S_{2}|, which contradicts the condition that G[S2]G[S_{2}] has the maximum |Ψh(G[S2])|ρ×|S2||\Psi_{h}(G[S_{2}])|-\rho\times|S_{2}|. Second, we prove that G[S1]G[S_{1}] is a subgraph of G[S2]G[S_{2}] by contradiction. Suppose G[S1]G[S_{1}] is not a subgraph of G[S2]G[S_{2}], according to the result before, we have S2S1S_{2}\subset S_{1}. There exists a subset SS\neq\emptyset and S=S1\S2S=S_{1}\backslash S_{2}. Removing SS from G[S1]G[S_{1}] will result in removing at least ρ×|S|\rho\times|S| hh-cliques, then |Ψh(G[S1])||Ψh(G[S2])|ρ×|S|=ρ×(|S1||S2|)|\Psi_{h}(G[S_{1}])|-|\Psi_{h}(G[S_{2}])|\geq\rho\times|S|=\rho\times(|S_{1}|-|S_{2}|). We have |Ψh(G[S1])|ρ×|S1||Ψh(G[S2])|ρ×|S2||\Psi_{h}(G[S_{1}])|-\rho\times|S_{1}|\geq|\Psi_{h}(G[S_{2}])|-\rho\times|S_{2}|, so enlarging G[S2]G[S_{2}] to G[S1]G[S_{1}] will not decrease the value of |Ψh(G[S2])|ρ×|S2||\Psi_{h}(G[S_{2}])|-\rho\times|S_{2}|, which contradicts the condition that G[S2]G[S_{2}] has the maximum |Ψh(G[S2])|ρ×|S2||\Psi_{h}(G[S_{2}])|-\rho\times|S_{2}|. Therefore, the theorem is proved. ∎

Input: G(V,E),SG(V,E),S
Output: VerifyLhhCDS
1 ρdψh(G[S])\rho\leftarrow d_{\psi_{h}}(G[S]), VerifyLhhCDS \leftarrow True;
2 GG^{\prime}\leftarrow DeriveCompact (G,ρ1|V|2,G,\rho-\frac{1}{|V|^{2}},\emptyset);
3 return G[S]G[S] is a connected component in GG^{\prime};
4 Procedure DeriveCompact(G,ρ,PG,\rho,P)
5       cnt0cnt\leftarrow 0; Ψh\Psi_{h}\leftarrow all the instances of hh-clique ψh\psi_{h} in GG;
6       V{s}VΨhP{t}V_{\mathcal{F}}\leftarrow\{s\}\cup V\cup\Psi_{h}\cup P\cup\{t\};
7       foreach ψhΨh\psi_{h}\in\Psi_{h} do
8             foreach vψhv\in\psi_{h} do
9                   add an edge ψhv\psi_{h}\rightarrow v with capacity h1h-1;
10                   add an edge vψhv\rightarrow\psi_{h} with capacity 11;
11                  
12            
13      foreach ψhP\psi_{h}\in P do
14             cntcntcnt\leftarrow cnt of ψh\psi_{h};
15             foreach vψhv\in\psi_{h} and vGv\in G do
16                   add an edge ψhv\psi_{h}\rightarrow v with capacity h1h-1;
17                   add an edge vψhv\rightarrow\psi_{h} with capacity 1+hcntcnt1+\frac{h-cnt}{cnt};
18                   degG(v,ψh)degG(v,ψh)+1+hcntcntdeg_{G}(v,\psi_{h})\leftarrow deg_{G}(v,\psi_{h})+1+\frac{h-cnt}{cnt};
19            
20      foreach vVv\in V do
21             add an edge vtv\rightarrow t with capacity ρh\rho*h;
22             add an edge svs\rightarrow v with capacity degG(v,ψh)deg_{G}(v,\psi_{h});
23            
24      Compute the minimum sts-t cut (𝒮,𝒯)(\mathcal{S},\mathcal{T}) from the flow network (V,E)\mathcal{F}(V_{\mathcal{F}},E_{\mathcal{F}});
25       return G[𝒮s]G[\mathcal{S}\setminus s];
Algorithm 4 Basic LhhCDS verification algorithm

In Algorithm 4, we first derive all connected components of the hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph GG^{\prime} in GG by DeriveCompact (Line 2). If G[S]G[S] is a connected component of GG^{\prime}, the algorithm returns True (Line 3). In DeriveCompact, all the instances of hh-clique is collected (Line 5). To build a flow network (V,E)\mathcal{F}(V_{\mathcal{F}},E_{\mathcal{F}}), a vertex set VV_{\mathcal{F}} is created, and vertices in VV_{\mathcal{F}} are linked by directed edges with different capacities (Lines 6-19). Then, the minimum sts-t cut (𝒮,𝒯)(\mathcal{S},\mathcal{T}) is computed (Line 20).

4.4.2. Fast Verification Algorithm

Although the basic verification algorithm can successfully verify whether a given subset is LhhCDS, the scale of the flow network in algorithm 4 is large, and the running time is long in large-scale graphs. We prove that the verification can be done by verifying only the subgraph G[S]G[S] and the vertices around the subgraph G[S]G[S], which is denoted by G[T]G[T]. Since G[T]G[T] is much smaller than GG, checking the minimum cut in G[T]G[T] is much more efficient. Considering the complexity of the overlap of cliques, we propose a fast verification algorithm by constructing a smaller flow network based on G[T]G[T]. Based on the fact that only the hh-cliques at the boundary of G[T]G[T] affect hh-clique compact numbers in G[T]G[T] compared to the hh-clique compact numbers in GG, we use a set PP to record these hh-cliques. For each ψhPrP\psi_{h}^{P_{r}}\in P, the number of vertices that are contained in both ψhPr\psi_{h}^{P_{r}} and G[T]G[T] is cntPrcnt_{P_{r}}. The flow network (V,E)\mathcal{F}(V_{\mathcal{F}},E_{\mathcal{F}}) is shown in Figure 7. The vertex set of \mathcal{F} is {s}VΨhP{t}\{s\}\cup V\cup\Psi_{h}\cup P\cup\{t\}. We add the boundary hh-clique set PP into VV_{\mathcal{F}} to ensure that the results of solving the flow network of G[T]G[T] are precisely consistent with that of GG. The arc set of \mathcal{F} is given as follows. The arcs for vertices and hh-cliques in G[T]G[T] are the same as the former flow network. For each ψhPrP\psi_{h}^{P_{r}}\in P, we add cntPrcnt_{P_{r}} incoming arcs of capacity 1+hcntPrcntPr1+\frac{h-cnt_{P_{r}}}{cnt_{P_{r}}} from the vertices that both in ψhPr\psi_{h}^{P_{r}} and G[T]G[T], and cntPrcnt_{P_{r}} outgoing arcs of capacity of h1h-1 to the same set of vertices.

Refer to caption
Figure 7. The flow network of DeriveCompact(G,ρ,P)(G,\rho,P)
Input: G=(V,E)G=(V,E), S,Ψh(G),ϕ¯h,ϕ¯hS,\Psi_{h}(G),\overline{\phi}_{h},\underline{\phi}_{h}
Output: VerifyLhhCDS
1 ρdψh(G[S])\rho\leftarrow d_{\psi_{h}}(G[S]), VerifyLhhCDS \leftarrow True, Valid \leftarrow True;
2 UU\leftarrow an empty queue, PP\leftarrow\emptyset, TT\leftarrow\emptyset, WW\leftarrow\emptyset, cnt0cnt\leftarrow 0;
3 foreach uSu\in S do
4       if uTu\notin T then  push uu to UU, insert uu into TT ;
5       while UU is not empty do
6             vv\leftarrow pop out the front vertex in UU;
7             foreach ψhΨh(G)\psi_{h}\in\Psi_{h}(G) where vψhv\in\psi_{h} do
8                   Valid \leftarrow True;
9                   if ψhW\psi_{h}\notin W then
10                         foreach wψhw\in\psi_{h} do
11                               if ϕ¯h(w)<ρ\overline{\phi}_{h}(w)<\rho then  Valid \leftarrow False ;
12                              
13                        insert ψh\psi_{h} into WW;
14                  else  Valid \leftarrow False ;
15                   if Valid then
16                         cnt1cnt\leftarrow 1;
17                         foreach wψhw\in\psi_{h} and wvw\neq v do
18                               if wTw\notin T and ww is in any LhhCDS then
19                                     VerifyLhhCDS \leftarrow False;
20                              if ϕ¯h(w)ρ\underline{\phi}_{h}(w)\leq\rho then
21                                     if wTw\notin T then
22                                           push ww to UU, insert ww into TT;
23                                    cntcnt+1cnt\leftarrow cnt+1;
24                              
25                        if cnthcnt\neq h and ψhP\psi_{h}\notin P then
26                               insert ψh\psi_{h} and cntcnt into PP;
27                               VerifyLhhCDS \leftarrow False;
28                        
29                  
30            foreach (v,w)E(v,w)\in E do
31                   if wTw\notin T and ϕ¯h(w)>ρ\underline{\phi}_{h}(w)>\rho then
32                         VerifyLhhCDS \leftarrow False;
33                  else if wTw\notin T and ϕ¯h(w)>ρ\overline{\phi}_{h}(w)>\rho then
34                         push ww to UU, add ww into TT;
35                  
36            
37      
38if VerifyLhhCDS then  return True ;
39 GG^{\prime}\leftarrow DeriveCompact (G[T],ρ1|V(G[T])|2,PG[T],\rho-\frac{1}{|V(G[T])|^{2}},P);
return G[S]G[S] is a connected component in GG^{\prime};
Algorithm 5 Fast LhhCDS verification algorithm

In Algorithm 5, the hh-clique density of G[S]G[S] is assigned to ρ\rho (Line 1). Then, a breadth-first search is performed. UU is used to store the vertices to be traversed (Line 4). The first vertex vv from UU is popped out (Line 6), and all hh-cliques containing vv are iterated (Lines 7-25). For each ψh\psi_{h} that is not in WW, if any vertex ww in ψh\psi_{h} satisfying ϕ¯h(w)<ρ\overline{\phi}_{h}(w)<\rho, ψh\psi_{h} will not affect the hh-clique compact number of ww (Lines 9-13); if any vertex ww is in any outputted LhhCDS, False is assigned to VerifyLhhCDS (Lines 17-18); the number of vertices in ψh\psi_{h} satisfying ϕ¯h(w)ρ\underline{\phi}_{h}(w)\leq\rho is recorded and ψh\psi_{h} is added into PP (Lines 19-25). All neighbors of vv are iterated (Lines 26-30). For each neighbor ww that is not in TT, if ϕ¯h(w)>ρ\underline{\phi}_{h}(w)>\rho, False is assigned to VerifyLhhCDS (Line 28). If ϕ¯h(w)ρ<ϕ¯h(w)\underline{\phi}_{h}(w)\leq\rho<\overline{\phi}_{h}(w), ww will be added into UU and TT (Line 30). If VerifyLhhCDS is False, a subgraph G[T]G[T] induced by TT and peripheral hh-cliques in PP are used to compute all hh-clique ρ\rho-compact subgraphs in G[T]G[T] via min-cut (Line 32). Finally, True is returned if G[S]G[S] is maximal hh-clique ρ\rho-compact; otherwise, the algorithm returns False (Line 33). The flow network here is much smaller.

Theorem 9.

Given a graph GG and a self-densest subgraph G[S]G[S], G[S]G[S] is an LhhCDS of GG if and only if the fast LhhCDS verification algorithm returns True.

Proof.

On the one hand, if G[S]G[S] is an LhhCDS of GG, G[S]G[S] is still an LhhCDS in G[T]G[T], because only the hh-cliques in PP might increase the hh-clique compact numbers in G[T]G[T] compared to the hh-clique compact numbers in GG. Otherwise, there exists a vertex vv with hh-cliques in PP contained in the maximal hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph containing G[S]G[S], and we can construct a larger hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph in GG by adding vertices with ϕ¯h(w)>dψh(G[S])\underline{\phi}_{h}(w)>d_{\psi_{h}}(G[S]) connected to vv, which contradicts that G[S]G[S] is an LhhCDS. On the other hand, if G[S]G[S] is not an LhhCDS of GG, we will find a larger hh-clique dψh(G[S])d_{\psi_{h}}(G[S])-compact subgraph containing G[S]G[S] in G[T]G[T]. Therefore, G[S]G[S] is not an LhhCDS in G[T]G[T], and the algorithm returns False. Therefore, the algorithm returns True only when G[S]G[S] is an LhhCDS of GG. ∎

4.5. The LhhCDS Discovery Algorithm (IPPV)

Combining all the algorithms above, we derive the LhhCDS discovery algorithm, called the IPPV algorithm shown in Algorithm 6. An empty stack stst is initialized, and GG^{\prime} is assigned to GG (Line 1). The bounds of hh-clique compact numbers are initialized via InitializeBd (Line 2). LhhCDS candidates are derived via ProposeCL and Prune (Line 4-5). Next, the LhhCDS candidates in 𝒮\mathcal{S} are reversely pushed into stst, and the first LhhCDS candidate in stst, the one with the highest ϕh\phi_{h} value, is popped out (Lines 6-7). The LhhCDS candidate is verified by IsDensest (Line 8) and VerifyLhhCDS (line 9). If G[S]G[S] is an LhhCDS, it will be outputted, and kk is decreased by 1 (Line 10). If G[S]G[S] is not an LhhCDS but is self-densest, SS is updated as the top LhhCDS candidate from stst (Line 12). Then, G[S]G[S] is assigned to GG^{\prime} for the next iteration (Line 13). The above process is repeated until top-kk LhhCDSes are found (line 3) or the stack is empty (Line 11). Our algorithms can also be extended to find all LhhCDSes.

Input: G=(V,E)G=(V,E), number of iterations TT, an integer kk
Output: top-kk LhhCDSes
1 stst\leftarrow an empty stack; GGG^{\prime}\leftarrow G;
2 ϕ¯h,ϕ¯h\overline{\phi}_{h},\underline{\phi}_{h}\leftarrowInitializeBd(G,hG^{\prime},h);
3 while k>0k>0 do
4       𝒮,ϕ¯h,ϕ¯h\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h}\leftarrow ProposeCL (G,T,ϕ¯h,ϕ¯hG^{\prime},T,\overline{\phi}_{h},\underline{\phi}_{h}) ;
5       𝒮\mathcal{S}\leftarrow Prune (G,𝒮,ϕ¯h,ϕ¯hG^{\prime},\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h});
6       foreach S𝒮S\in\mathcal{S} reversely do  push SS into stst ;
7       SS\leftarrow pop out the top stable group from stst ;
8       if IsDensest (G[S]G[S]) then
9             if VerifyLhhCDS (G,S,ϕ¯h,ϕ¯hG,S,\overline{\phi}_{h},\underline{\phi}_{h}) then
10                   output G[S]G[S] ; kk1k\leftarrow k-1 ;
11            if stst is empty then  break ;
12             SS\leftarrow pop out the top stable group from stst ;
13            
14      GG[S]G^{\prime}\leftarrow G[S] ;
15      
Algorithm 6 Iterative propose-prune-and-verify algorithm based on convex programming (IPPV)
Theorem 10.

The LhhCDS discovery algorithm IPPV is an exact algorithm, i.e., it can output all LhhCDSes correctly.

Proof.

According to Theorem 7, the set of LhhCDS candidates proposed by IPPV is a superset of all LhhCDSes with tight bounds of hh-clique compact number. Based on Proposition 5, the pruning part only prunes vertices that are not in any LhhCDS and guarantees that the vertices in any LhhCDS will not be pruned. Theorem 8 and Theorem 9 state that the verification part verifies whether a candidate is exactly an LhhCDS without misjudgment. And all candidates are repeated through the above bound updating, pruning, and verification operations until kk LhhCDSes are outputted. Since the set of LhhCDS candidates are arranged in descending order, the outputted kk LhhCDSes must be the top-kk LhhCDSes. Therefore, the proposed algorithm is an exact algorithm to output the top-kk LhhCDSes. ∎

Complexity Analysis. We use TT to denote the number of iterations that SEQ-kClist++ needs. Each iteration of SEQ-kClist++ costs O(n+|Ψh|)O(n+|\Psi_{h}|). We use NCLN_{CL} to represent the total number of LhhCDS candidates, NCLnN_{CL}\ll n. Each iteration of verifying an LhhCDS candidate costs O(n+|Ψh|)O(n+|\Psi_{h}|). NFlowN_{Flow} is the number of times IsDensest and VerifyLhhCDS are called. The time complexity of max-flow computation, which is O((n+|Ψh|)2(n+|Ψh|h))O((n+|\Psi_{h}|)^{2}\cdot(n+|\Psi_{h}|\cdot h)) for IsDensest and VerifyLhhCDS when Dinic Algorithm is Applied. The time complexity of IPPV is O((T+NCL)(n+|Ψh|)+NFlow(n+|Ψh|)2(n+|Ψh|h))O((T+N_{CL})\cdot(n+|\Psi_{h}|)+N_{Flow}\cdot(n+|\Psi_{h}|)^{2}\cdot(n+|\Psi_{h}|\cdot h)). As (T+NCL)(n+|Ψh|)NFlow(n+|Ψh|)2(n+|Ψh|h)(T+N_{CL})\cdot(n+|\Psi_{h}|)\ll N_{Flow}\cdot(n+|\Psi_{h}|)^{2}\cdot(n+|\Psi_{h}|\cdot h), the time complexity of IPPV is O(NFlow(n+|Ψh|)2(n+|Ψh|h))O(N_{Flow}\cdot(n+|\Psi_{h}|)^{2}\cdot(n+|\Psi_{h}|\cdot h)). The memory complexity is (n+|Ψh|)(n+|\Psi_{h}|). Note that the time complexity of the classical exact solution of only computing hh-clique compact number is O((n|Ψh|)3L)O((n\cdot|\Psi_{h}|)^{3}\cdot L) (Goldfarb and Liu, 1991), which is much greater than that of our algorithm.

5. LhxhxPDS Discovery

A pattern (also known as a motif) (Wuchty et al., 2003; Leskovec et al., 2006; Hu et al., 2019) is a small connected subgraph that appears frequently in a larger graph, which can be considered as a basic module. Figure 8 shows all kinds of patterns with four vertices: 4a4a-pattern, \ldots ,4f4f-pattern.

Refer to caption
Figure 8. An example of all patterns with four vertices

We further show that the algorithm for the locally hh-clique densest subgraph discovery problem can be extended to solve the locally general pattern densest subgraph discovery problem, which contributes to a deeper understanding of the organizational principles and functional modules within complex networks.

5.1. Densest Supermodular Set Decomposition

In this section, we discuss the feasability of extending the hh-clique problem to a general pattern problem. The convex programming of hh-clique can be further generalized to the convex programming of supermodular sets, so that the convex programming for the general pattern densest subgraph problem and the corresponding compact number can be derived. A function f:2V+f:2^{V}\rightarrow\mathbb{R}_{+} is said to be supermodular if A,BV,f(A)+f(B)f(AB)+f(AB)\forall A,B\subseteq V,f(A)+f(B)\leq f(A\cup B)+f(A\cap B). Harb et al. (Harb et al., 2022) proposed the densest supermodular subset (DSS) problem: given a normalized, nonnegative monotone supermodular function f:2V+f:2^{V}\rightarrow\mathbb{R}_{+}, return SVS\subseteq V that maximizes f(S)|S|\frac{f(S)}{|S|}. According to our observation, when f(S)=|E(S)|f(S)=|E(S)| and f(S)=|Ψh(G[S])|f(S)=|\Psi_{h}(G[S])|, the DSS problem is the DS and CDS problem, respectively. When f(S)f(S) represents the number of a particular pattern in a graph, the problem is the densest problem of the proposed pattern. The convex program(Harb et al., 2022) for the densest supermodular set decomposition is CP(G):=min{uVr(u)2},\mathrm{CP}(G):=\min\left\{\sum_{u\in V}r(u)^{2}\right\}, subject to: r{xV|x0,x(S)f(S) for all SV,x(V)=f(V)}r\in\left\{x\in\mathbb{R}^{V}|x\geq 0,x(S)\geq f(S)\text{ for all }S\subseteq V,x(V)=f(V)\right\}.

With supermodularity, there is a property that each graph has a unique nested diminishingly decomposition for each type of density. The analysis of the generalization of CDS problem to DSS problem has triggered our thinking on the solution of locally general pattern densest problem.

5.2. Locally General Pattern Densest Subgraph Problem

Given an undirected graph G=(V,E)G=(V,E), ψhx(Vψhx,Eψhx)\psi_{hx}(V_{\psi_{hx}},E_{\psi_{hx}}) denotes a particular kind of pattern xx with hh vertices and Ψhx(G)\Psi_{hx}(G) is the collection of the hxhx-patterns of GG. dψhx(G)=|Ψhx(G)||V|d_{\psi_{hx}}(G)=\frac{|\Psi_{hx}(G)|}{|V|} denotes the hxhx-pattern density of GG. degG(v,ψhx)deg_{G}(v,\psi_{hx}) is the hxhx-pattern degree of vv, i.e., the number of hxhx-patterns containing vv. A graph G=(V,E)G=(V,E) is hxhx-pattern ρ\rho-compact if GG is connected, and removing any subset of vertices SVS\subseteq V will result in the removal of at least ρ×|S|\rho\times|S| hxhx-patterns in GG. We can formally define a locally hxhx-pattern densest subgraph as follows.

Definition 0 (Locally hxhx-pattern densest subgraph (LhxhxPDS)).

A subgraph G[S]G[S] of GG is an LhxhxPDS of GG if G[S]G[S] is hxhx-pattern dψhx(G)d_{\psi_{hx}}(G)-compact, and there does not exist a supergraph G[S]G[S^{\prime}] of G[S]G[S] (SSS^{\prime}\supsetneq S), such that G[S]G[S^{\prime}] is also hxhx-pattern dψhx(G)d_{\psi_{hx}}(G)-compact.

Similarly, we formulate the locally hxhx-pattern densest subgraph problem as follows.

Definition 0 (Locally hxhx-pattern densest subgraph Problem (LhxhxPDS Problem)).

Given a graph GG, an integer hh, a pattern xx and an integer kk, the LhxhxPDS problem is to compute the top-kk LhxhxPDSes ranked by the hxhx-pattern density in GG.

Input: G=(V,E)G=(V,E), number of iterations TT, an integer kk
Output: top-kk LhxhxPDS
1 stst\leftarrow an empty stack; GGG^{\prime}\leftarrow G;
2 ϕ¯hx,ϕ¯hx\overline{\phi}_{hx},\underline{\phi}_{hx}\leftarrow InitializeBd for hxhx-pattern (G,h,xG^{\prime},h,x);
3 while k>0k>0 do
4       𝒮,ϕ¯hx,ϕ¯hx\mathcal{S},\overline{\phi}_{hx},\underline{\phi}_{hx}\leftarrow ProposeCL for hxhx-pattern (G,T,ϕ¯hx,ϕ¯hxG^{\prime},T,\overline{\phi}_{hx},\underline{\phi}_{hx});
5       𝒮\mathcal{S}\leftarrow Prune for hxhx-pattern (G,𝒮,ϕ¯h,ϕ¯hG^{\prime},\mathcal{S},\overline{\phi}_{h},\underline{\phi}_{h});
6       foreach S𝒮S\in\mathcal{S} reversely do
7             push SS into stst;
8            
9      SS\leftarrow pop out the top stable group from stst;
10       if IsDensest (G[S]G[S])for hxhx-pattern then
11             if VerifyLhxhxPDS (G,S,ϕ¯hx,ϕ¯hxG,S,\overline{\phi}_{hx},\underline{\phi}_{hx}) then
12                   output G[S]G[S]; kk1k\leftarrow k-1;
13                  
14            if stst is empty then
15                   break;
16                  
17            SS\leftarrow pop out the top stable group from stst;
18            
19      GG[S]G^{\prime}\leftarrow G[S];
20      
Algorithm 7 The IPPV algorithm for LhxhxPDS
Refer to caption
(a) fb-pages-company
Refer to caption
(b) soc-hamsterster
Refer to caption
(c) soc-epinions
Refer to caption
(d) Email-Enron
Refer to caption
(e) loc-gowalla
Refer to caption
(f) CA-CondMat
Refer to caption
(g) CA-GrQc
Refer to caption
(h) Amazon
Figure 9. Running time of algorithms with different hh (= 3,4,5) and kk. Red is VerifyLhhCDS(basic), Green is VerifyLhhCDS(fast)
Refer to caption
Figure 10. Running time of each part of IPPV with h=3h=3 and k=20k=20

Here, we utilize our “iterative propose-prune-and-verify” pipeline to solve the LhxhxPDS problem. To apply the hxhx-pattern subgraph, there are some differences between Algorithm 6 and Algorithm 7 in the algorithmic details. In Algorithm 7, we need to count hxhx-pattern graphs for Seq-kClist++ algorithm and derive candidate LhxhxPDS algorithm. In pruning part, the computation of hxhx-pattern graph cores is different for diverse kinds of patterns. Unlike hh-clique, there may be more than one hxhx-pattern on a graph with hh vertices. In verification part, the methods for reducing the size of subgraph to compute the min-cut need small adjustments for different patterns. In general, the process of extending our algorithm to general patterns is concise and clear. In addition, our method may also support expressive graph models as directed graph, attributed graph, etc. These graph models also have meaningful patterns, such as directed closed loop in directed graph. The difficulty of generalizing IPPV to different graph models is related to the properties of the models.

6. Experiments

6.1. Experimental Setup

The datasets we use are undirected real-world graphs (Leskovec and Krevl, 2014; Rossi and Ahmed, 2015), including social networks, biological networks, web graphs, and collaboration networks. All datasets are listed in Table 2.

We compare the performances of the following algorithms:
IPPV : the top-kk LhhCDS discovery algorithm proposed by us.
LTDS (Samusevich et al., 2016) : the top-kk LTDS discovery algorithm based on the maximum-flow, which solves the LhhCDS problem with h=3h=3.
LDSflow (Qin et al., 2015) : the top-kk LDS discovery algorithm based on the maximum-flow, which solves the LhhCDS problem with h=2h=2.
Greedy : the top-kk CDS discovery algorithm based on KClist++ (Sun et al., 2020) using greedy approach. It has no guarantee on the locally densest property.

Table 2. Datasets used in our experiments
Name Abbr. |V||V| |E||E| |Ψ3||\Psi_{3}| |Ψ5||\Psi_{5}|
soc-hamsterster HA 2,426 16,630 53,251 298,013
CA-GrQc GQ 5,242 14,484 48,260 2,215,500
fb-pages-politician PP 5,908 41,706 174,632 2,002,250
fb-pages-company PC 14,113 52,126 56,005 207,829
web-webbase-2001 WB 16,062 25,593 21,115 382,674
CA-CondMat CM 23,133 93,439 173,361 511,088
soc-epinions EP 26,588 100,120 159,700 521,106
Email-Enron EN 36,692 183,831 727,044 5,809,356
loc-gowalla GW 196,591 950,327 2,273,138 14,570,875
DBLP DB 317,080 1,049,866 2,224,385 262,663,639
Amazon AM 334,863 925,872 667,129 61,551
soc-youtube YT 495,957 1,936,748 2,443,886 5,306,643
soc-lastfm LF 1,191,805 4,519,330 3,946,207 10,404,656
soc-flixster FX 2,523,386 7,918,801 7,897,122 96,315,278
soc-wiki-talk WT 2,394,385 4,659,565 9,203,519 382,777,822

All algorithms are implemented in C++ and compiled by g++ compiler at -O3 optimization level. All experiments are evaluated on a machine with Intel(R) Xeon(R) CPU 3.20GHz processor and 128GB memory, with Ubuntu operating system. Algorithms running for more than 48 hours are forcibly terminated.

6.2. Efficiency under Parameter Variations

In this section, we summarize the influence of different parameter changes on the running time.

6.2.1. Efficiency improvement by fast verification algorithm

We use VerifyLhhCDS(basic) to represent the IPPV algorithm with Algorithm 4 and VerifyLhhCDS(fast) to represent the IPPV algorithm with Algorithm 5. Their running times are compared in Figure 9. The fast verification algorithm with a smaller flow network is much faster than basic verification method. Especially as kk increases, the efficiency gap between the two algorithms becomes more apparent. We also compare the running time of the two verification algorithms in the total running time in Figure 10. The acceleration effect of the fast algorithm is obvious. The results demonstrate the importance and benefit of optimizing the verification algorithm.

6.2.2. Running time trends with varying kk

Parameter kk has a more pronounced impact on the running time of the algorithm than hh. The experiments in Figure 9 indicate a direct relationship, where an increase in kk corresponds to a proportional increase in execution time. This trend is consistently observed across different datasets, which strengthens the premise that kk is an important factor in computational complexity. The running time of both algorithms increases significantly for incremental values of kk. The only deviation is observed in the Email-Enron dataset, where the running time remains relatively static despite changes in kk, due to the fact that the total number of LhhCDSes in this dataset is smaller than kk.

6.2.3. Running time trends with varying hh

We took h=3,4,5h=3,4,5 to compare the impact of hh on the running time. The results are shown in Figure 9. When h=5h=5, the running time is generally longer. The reason is that when h=5h=5, the number of 55-cliques is larger, as shown in Table 2. On Amazon, the running time is shorter when h=5h=5, because the number of 55-cliques is smaller. The running time is proportional to the number of hh-cliques with different hh.

6.2.4. Influence of structural property

We discuss how the density property (|E|/|V||E|/|V|) of graph datasets influences the efficiency of IPPV. As shown in Figure 11, we select four datasets and randomly sample edges of these datasets in different proportions. According to the outputs from these synthetically generated graphs, the running time increases as the density of datasets increases. The reason is that when density is higher, the number of hh-cliques increases, lengthening the total running time.

Refer to caption
Figure 11. The running time on four datasets with varying density when h=3h=3 and k=5k=5

6.3. Efficiency v.s. Existing Algorithms

Since LDSflow (Qin et al., 2015) outputs LhhCDSes with h=2h=2, and LTDS (Samusevich et al., 2016) outputs LhhCDSes with h=3h=3, we compare IPPV with them by setting h=2h=2 and h=3h=3 respectively.

6.3.1. Efficiency: IPPV v.s. LDSflow

We set h=2h=2, k=5k=5 to observe the running time of the two algorithms. The results are shown in Figure 12. We compare IPPV with LDSflow on eight datasets, and IPPV has efficiency improvements on all the datasets. The bottleneck of LDSflow is the loose upper and lower bounds.

Refer to caption
Figure 12. Efficiency of IPPV (h=2) and LDSflow

6.3.2. Efficiency: IPPV v.s. LTDS

We set k=5k=5 and h=3h=3 in experiments and the results are shown in Table 3. IPPV and LTDS are compared on all the datasets. There are significant efficiency improvements on all datasets. The main bottleneck of LTDS is the time-consuming verification part, and the reason is that the upper and lower bounds of LTDS are not as tight as that of IPPV, so there will be more failures in the verification part. The running time of our algorithm is closely related to the size of the graph and the number of hh-cliques. A rise in the number of hh-cliques can cause an increase in running time.

Table 3. Efficiency of IPPV (h=3) and LTDS
Dataset IPPV (h=3) LTDS Speedup
soc-hamsterster 7.50(s) 46.54 6.20×\times
CA-GrQc 0.38 18.97 49.92 ×\times
fb-pages-politician 32.32 436.30 13.50×\times
fb-pages-company 2.56 51.48 20.11×\times
web-webbase-2001 0.14 12.20 87.14×\times
CA-CondMat 21.63 541.63 25.04×\times
soc-epinions 82.54 558.91 6.77×\times
Email-Enron 1369.84 2253.14 1.64×\times
loc-gowalla 5095.63 68216.14 13.39×\times
DBLP 360.49 4888.93 13.56×\times
Amazon 1118.08 1308.53 1.17×\times
soc-youtube 9070.89 42821.99 4.72×\times
soc-lastfm 11223.13 \geq172, 800 \geq 15.40 ×\times
soc-flixster 3018.62 \geq 172, 800 \geq 57.24 ×\times
soc-wiki-talk 57382.42 \geq 172, 800 \geq 3.011 ×\times

6.4. Characteristics of the Detected LhhCDSes

We adopt different quality measures to show the characteristics of the detected LhhCDSes, and then we compare our detected LhhCDSes with those detected by the Greedy algorithm.

6.4.1. Visualization of LhhCDSes with varying hh

We use a network of books about US politics which were sold by Amazon (Krebs, 2004) to visualize the characteristics of detected LhhCDSes with varying hh. The vertices represent different books. Figure 13(a) shows the books fall into neutral(green), liberal(blue), and conservative(red) categories. The edges represent frequent co-purchasing of books by the same buyers, which indicate “customers who bought this book also bought the other books” on Amazon. Figure 13(b)-Figure 13(e) visualize the detected LhhCDSes with varying hh. The set of steelblue vertices is the top-11 LhhCDS, and the set of orange vertices if exists, is the top-22 LhhCDS. The vertices in the dataset have clear attributes, which is convenient for us to measure whether IPPV has the ability to mine diverse communities. As visualized by the results, LhhCDSes with larger hh are closer to a clique. Besides, when hh is larger, LhhCDSes can find multiple dense communities in different categories. L44CDSes contain both liberal and conservative book communities, whereas LDSes (i.e., L22CDSes) only contain liberal book community. The results show the potential of IPPV to mine diverse categorical communities.

Refer to caption
(a) categories
Refer to caption
(b) h=2h=2
Refer to caption
(c) h=3h=3
Refer to caption
(d) h=4h=4
Refer to caption
(e) h=5h=5
Figure 13. LhhCDS case study on real network (the top-11 LhhCDS: steelblue; the top-22 LhhCDS: orange vertices)

6.4.2. Edge density of LhhCDSes with varying hh

We compare the average edge density ((2×|E|)/[|V|×(|V|1)](2\times|E|)/[|V|\times(|V|-1)]) of top-55 LhhCDSes for varying hh in Table 4. When hh is larger, LhhCDSes generally have higher edge density. This is aligned with our general purpose.

Table 4. Average edge density and diameter with varying hh
Average Edge Density Average Diameter
dataset hh=2 hh=3 hh=5 hh=7 hh=9 hh=2 hh=3 hh=5 hh=7 hh=9
PC 0.752 0.858 0.891 0.894 0.957 2.00 2.00 2.00 2.00 2.00
HA 0.805 0.869 0.987 0.982 0.995 1.80 1.60 1.20 1.40 1.40
PP 0.709 0.685 0.696 0.729 0.827 2.20 2.00 2.00 2.00 2.00
CM 0.984 0.984 0.986 0.986 0.986 1.20 1.20 1.40 1.40 1.40
EP 0.487 0.724 0.799 0.770 0.755 2.60 1.80 1.80 1.67 2.00
WB 0.987 0.987 0.997 0.997 0.997 1.40 1.40 1.20 1.20 1.20
GQ 0.972 0.972 0.972 0.975 OOM 1.60 1.60 1.60 1.60 OOM

6.4.3. Diameter of LhhCDSes with varying hh

We list the average diameter (the longest distance of all pairs of nodes in a graph) of top-55 LhhCDSes of different hh in Table 4. It is obvious that the diameters of the subgraphs produced by LhhCDS (h3h\geq 3) do not exceed 22. Coupled with the edge density results, while LhhCDS can find subgraphs with relatively higher edge density compared to LDS, the diameters of the subgraphs found by LhhCDS are as small as those found by LDS. This suggests that LhhCDSes are cohesive, and the nodes within LhhCDSes are highly interconnected.

Refer to caption
Figure 14. Subgraph statistics of hh-clique density and size

6.4.4. Size and hh-clique density of subgraphs: IPPV vs Greedy

Next, we compare the LhhCDSes detected by IPPV and the hh-clique densest subgraphs found by the Greedy algorithm. We select h=3,5h=3,5 on two datasets, and the results are shown in Figure 14. First, the results of the two algorithms overlap to a certain extent, among which the top-11 CDS is the same because the first LhhCDS must be the hh-clique densest subgraph in the whole graph. Second, there is a certain difference between the returned subgraphs of the Greedy algorithm and IPPV. For example, in CA-CondMat (hh=3), the hh-clique density of the second output subgraph is 51 for IPPV, and 78 for Greedy, but the returned subgraphs of Greedy is adjacent to the first output subgraph without the guarantee of the locally densest property. Therefore, the Greedy algorithm cannot solve the LhhCDS problem well. The two algorithms totally overlap if and only if the top-kk hh-clique densest subgraph belongs to different regions occasionally.

6.5. Clustering Coefficient of Different hh

Since near clique is an important criterion for evaluating dense subgraphs, we evaluate how LhhCDSes of different hh are close to the clique structure. In graph theory, clustering coefficient is a measure of the degree to which vertices in a graph tend to cluster together, which is a direct measure to the degree of near clique. For each vertex uVu\in V, which has kuk_{u} neighbors NuN_{u} (|Nu|=ku|N_{u}|=k_{u}), the clustering coefficient of uu is Cu=2|{evw:v,wNu,evwE}|ku(ku1)C_{u}=\frac{2|\{e_{vw}:v,w\in N_{u},e_{vw}\in E\}|}{k_{u}(k_{u}-1)}. We compare the average CuC_{u} of all the LhhCDSes of different hh in Table 5.

Table 5. Average clustering coefficient of different hh values
Average Clustering Coefficient
dataset hh=2 hh=3 hh=5 hh=7 hh=9
fb-pages-company 0.582 0.852 0.895 0.915 0.930
soc-hamsterster 0.480 0.910 0.990 0.984 0.995
fb-pages-politician 0.583 0.683 0.776 0.798 0.835
CA-CondMat 0.567 0.977 0.992 0.992 0.991
soc-epinions 0.231 0.722 0.701 0.705 0.773
web-webbase-2001 0.831 0.884 0.989 0.992 0.979
CA-GrQc 0.533 0.975 0.982 0.985 OOM

According to the results shown in Table 5, when hh is larger, the average CuC_{u} is generally larger, showing the LhhCDSes with larger hh are closer to clique. In addition, there is a big difference between h=3h=3 and h=2h=2 (L22CDS is LDS), which shows that LDS is less dense than other LhhCDS. Our algorithm is important for finding near-clique subgraphs, which cannot be replaced by LDS.

6.6. Memory Overheads

We compare the memory utilization for the IPPV and LTDS algorithms across all datasets (h=3h=3, k=5k=5). Figure 15 illustrates a clear correlation between memory usage and dataset size. IPPV strategically reduces the size of candidate subgraphs through a pruning mechanism prior to evaluating self-compactness. The verifying part often dominates the memory consumption.

Refer to caption
Figure 15. Memory usage of algorithms

6.7. The Number of Iterations

To choose the optimal number of iterations TT of SEQ-kClist++, we set different TT on the IPPV algorithm. We select T=5,10,15,20,40,T=5,10,15,20,40, 60,80,10060,80,100, as shown in Figure 16. The experiment on eight datasets shows that the optimal performance is between 1515 and 2020 iterations. In our experiments, we choose T=20T=20.

Refer to caption
Figure 16. Running time of eight datasets with different TT

6.8. Case Study of LhxhxPDS

We utilize the same real dataset (Krebs, 2004) to experimentally illustrate the LhxhxPDS problem. For each pattern depicted in Figure 8, we compute the results of L4x4xPDS. In Figure 17, the set of steelblue vertices is the top-11 LhxhxPDS, and the set of orange vertices if exists, is the top-22 LhxhxPDS of the pattern hxhx. It is evident that the L4x4xPDS corresponding to various patterns exhibit differences in terms of the number of L4x4xPDS, the number of vertices, and the position of vertices. To delve deeper into graph analysis, tasks such as community clustering can be extended to explore the LhxhxPDS subgraph.

Refer to caption
(a) 33-star
Refer to caption
(b) 44-path
Refer to caption
(c) c33-star
Refer to caption
(d) 44-loop
Refer to caption
(e) 22-triangle
Refer to caption
(f) 44-clique
Figure 17. L4x4xPDS case study on real network (the top-11 LhxhxPDS: steelblue; the top-22 LhxhxPDS: orange vertices)

7. CONCLUSION

In this paper, we study how to discover locally hh-clique densest subgraphs in a graph GG, i.e., the LhhCDS problem. We present IPPV, an iterative propose-prune-and-verify pipeline for top-kk LhhCDS detection. The hh-clique compact number bounds and graph decomposition method which help to derive LhhCDS candidates efficiently are proposed. A new optimized verification algorithm is designed, and its correctness is proved. The extension of our algorithm to solve the locally general pattern densest subgraph problem is feasible and promising. Extensive experiments on real datasets show the high efficiency and scalability of our proposed algorithm. When hh is large in large-scale graphs, there is still room for further optimizing IPPV. We will continue to optimize the algorithm and further explore the LhxhxPDS problem in our future work.

Acknowledgements.
Dr. Wang is supported in part by the National Natural Science Foundation of China Grant No. 61972404, Public Computing Cloud, Renmin University of China, and the Blockchain Lab. School of Information, Renmin University of China. Dr. Li is supported in part by the National Natural Science Foundation of China Grant No.12071478. The authors are grateful to Professor Lijun Chang for his help with the revision of this paper. The authors would like to thank the anonymous reviewers and shepherd for providing constructive feedback and valuable suggestions.

References

  • (1)
  • Angel et al. (2012) Albert Angel, Nick Koudas, Nikos Sarkas, Divesh Srivastava, Michael Svendsen, and Srikanta Tirthapura. 2012. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. The VLDB Journal 23 (2012), 175–199. https://api.semanticscholar.org/CorpusID:2185310
  • Bahmani et al. (2012) Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Densest Subgraph in Streaming and MapReduce. ArXiv abs/1201.6567 (2012).
  • Benson et al. (2016) Austin R. Benson, David F. Gleich, and Jure Leskovec. 2016. Higher-order organization of complex networks. Science 353 (2016), 163 – 166. https://api.semanticscholar.org/CorpusID:3635447
  • Boob et al. (2019) Digvijay Boob, Yu Gao, Richard Peng, Saurabh Sawlani, Charalampos E. Tsourakakis, Di Wang, and Junxing Wang. 2019. Flowless: Extracting Densest Subgraphs Without Flow Computations. Proceedings of The Web Conference 2020 (2019).
  • Boyd and Vandenberghe (2010) Stephen P. Boyd and Lieven Vandenberghe. 2010. Convex Optimization. IEEE Trans. Automat. Control 51 (2010), 1859–1859. https://api.semanticscholar.org/CorpusID:37925315
  • Charikar (2000) Moses Charikar. 2000. Greedy approximation algorithms for finding dense components in a graph. In International Workshop on Approximation Algorithms for Combinatorial Optimization.
  • Chekuri et al. (2022) Chandra Chekuri, Kent Quanrud, and Manuel R. Torres. 2022. Densest Subgraph: Supermodularity, Iterative Peeling, and Flow. In ACM-SIAM Symposium on Discrete Algorithms.
  • Chen and Saad (2012) Jie Chen and Yousef Saad. 2012. Dense Subgraph Extraction with Application to Community Detection. IEEE Transactions on Knowledge and Data Engineering 24 (2012), 1216–1230. https://api.semanticscholar.org/CorpusID:11360561
  • Danisch et al. (2017) Maximilien Danisch, T-H. Hubert Chan, and Mauro Sozio. 2017. Large Scale Density-friendly Graph Decomposition via Convex Programming. Proceedings of the 26th International Conference on World Wide Web (2017).
  • Fang et al. (2019) Yixiang Fang, Kaiqiang Yu, Reynold Cheng, Laks V. S. Lakshmanan, and Xuemin Lin. 2019. Efficient Algorithms for Densest Subgraph Discovery. Proc. VLDB Endow. 12 (2019), 1719–1732.
  • Fazzone et al. (2022) Adriano Fazzone, Tommaso Lanciano, Riccardo Denni, Charalampos E Tsourakakis, and Francesco Bonchi. 2022. Discovering polarization niches via dense subgraphs with attractors and repulsers. Proceedings of the VLDB Endowment 15, 13 (2022), 3883–3896.
  • Gibson et al. (2005) David Gibson, Ravi Kumar, and Andrew Tomkins. 2005. Discovering Large Dense Subgraphs in Massive Graphs. In Very Large Data Bases Conference. https://api.semanticscholar.org/CorpusID:120822
  • Gionis et al. (2013) A. Gionis, Flavio Paiva Junqueira, Vincent Leroy, Marco Serafini, and Ingmar Weber. 2013. Piggybacking on Social Networks. Proc. VLDB Endow. 6 (2013), 409–420. https://api.semanticscholar.org/CorpusID:2240241
  • Goldberg (1984) Andrew V. Goldberg. 1984. Finding a Maximum Density Subgraph. In Technical report,University of California, Berkeley.
  • Goldfarb and Liu (1991) Donald Goldfarb and Shucheng Liu. 1991. An O(n3L) primal interior point algorithm for convex quadratic programming. Mathematical Programming 49 (1991), 325–340. https://api.semanticscholar.org/CorpusID:29420601
  • Harb et al. (2022) Elfarouk Harb, Kent Quanrud, and Chandra Chekuri. 2022. Faster and Scalable Algorithms for Densest Subgraph and Decomposition. In Neural Information Processing Systems.
  • Hu et al. (2019) Jiafeng Hu, Reynold Cheng, Kevin Chen-Chuan Chang, Aravind Sankar, Yixiang Fang, and Brian Yee Hong Lam. 2019. Discovering Maximal Motif Cliques in Large Heterogeneous Information Networks. 2019 IEEE 35th International Conference on Data Engineering (ICDE) (2019), 746–757. https://api.semanticscholar.org/CorpusID:174819659
  • Jin et al. (2009) Ruoming Jin, Yang Xiang, Ning Ruan, and David Fuhry. 2009. 3-HOP: a high-compression indexing scheme for reachability query. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (2009). https://api.semanticscholar.org/CorpusID:7625491
  • Krebs (2004) Valdis Krebs. 2004. Books about US politics. Unpublished. http://www.orgnet.com/
  • Lanciano et al. (2023) Tommaso Lanciano, Atsushi Miyauchi, Adriano Fazzone, and Francesco Bonchi. 2023. A survey on the densest subgraph problem and its variants. arXiv preprint arXiv:2303.14467 (2023).
  • Leskovec and Krevl (2014) Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
  • Leskovec et al. (2006) Jure Leskovec, Ajit Singh, and Jon M. Kleinberg. 2006. Patterns of Influence in a Recommendation Network. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. https://api.semanticscholar.org/CorpusID:332896
  • Li et al. (2022) Ruiming Li, Jung-Yu Lee, Jinn-Moon Yang, and Tatsuya Akutsu. 2022. Densest subgraph-based methods for protein-protein interaction hot spot prediction. BMC Bioinformatics 23 (2022).
  • Liu et al. (2018) Yike Liu, Tara Safavi, Abhilash Dighe, and Danai Koutra. 2018. Graph summarization methods and applications: A survey. ACM computing surveys (CSUR) 51, 3 (2018), 1–34.
  • Luo et al. (2023) Wensheng Luo, Chenhao Ma, Yixiang Fang, and Laks VS Lakshman. 2023. A Survey of Densest Subgraph Discovery on Large Graphs. arXiv preprint arXiv:2306.07927 (2023).
  • Ma et al. (2022) Chenhao Ma, Reynold Cheng, Laks VS Lakshmanan, and Xiaolin Han. 2022. Finding locally densest subgraphs: a convex programming approach. Proceedings of the VLDB Endowment 15, 11 (2022), 2719–2732.
  • Mitzenmacher et al. (2015) Michael Mitzenmacher, Jakub W. Pachocki, Richard Peng, Charalampos E. Tsourakakis, and Shen Chen Xu. 2015. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015).
  • Palla et al. (2005) Gergely Palla, Imre Derényi, Illés J. Farkas, and Tamás Vicsek. 2005. Uncovering the overlapping community structure of complex networks in nature and society. Nature 435 (2005), 814–818. https://api.semanticscholar.org/CorpusID:3250746
  • Picard and Queyranne (1982) Jean-Claude Picard and Maurice Queyranne. 1982. A network flow solution to some nonlinear 0-1 programming problems, with applications to graph theory. Networks 12 (1982), 141–159.
  • Qin et al. (2015) Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang. 2015. Locally densest subgraph discovery. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 965–974.
  • Rossi and Ahmed (2015) Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization. In AAAI. https://networkrepository.com
  • Saha et al. (2010) Barna Saha, Allison Hoch, Samir Khuller, Louiqa Raschid, and Xiao-Ning Zhang. 2010. Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs. In Annual International Conference on Research in Computational Molecular Biology. https://api.semanticscholar.org/CorpusID:11280177
  • Samusevich et al. (2016) Raman Samusevich, Maximilien Danisch, and Mauro Sozio. 2016. Local triangle-densest subgraphs. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 33–40.
  • Spirin and Mirny (2003) Victor Spirin and Leonid A. Mirny. 2003. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of the United States of America 100 (2003), 12123 – 12128. https://api.semanticscholar.org/CorpusID:136093
  • Sun et al. (2020) Bintao Sun, Maximilien Danisch, TH Hubert Chan, and Mauro Sozio. 2020. Kclist++: A simple algorithm for finding k-clique densest subgraphs in large graphs. Proceedings of the VLDB Endowment (PVLDB) (2020).
  • Trung et al. (2023) Tran Ba Trung, Lijun Chang, Nguyen Tien Long, Kai Yao, and Huynh Thi Thanh Binh. 2023. Verification-Free Approaches to Efficient Locally Densest Subgraph Discovery. 2023 IEEE 39th International Conference on Data Engineering (ICDE) (2023), 1–13. https://api.semanticscholar.org/CorpusID:260171546
  • Tsourakakis (2015) Charalampos Tsourakakis. 2015. The k-clique densest subgraph problem. In Proceedings of the 24th international conference on world wide web. 1122–1132.
  • Tsourakakis et al. (2013) Charalampos E. Tsourakakis, Francesco Bonchi, A. Gionis, Francesco Gullo, and Maria A. Tsiarli. 2013. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (2013). https://api.semanticscholar.org/CorpusID:213308
  • Wang et al. (2013) Jia Wang, James Cheng, and Ada Wai-Chee Fu. 2013. Redundancy-aware maximal cliques. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (2013). https://api.semanticscholar.org/CorpusID:16494932
  • Wuchty et al. (2003) Stefan Wuchty, Zoltán N. Oltvai, and A L Barabasi. 2003. Evolutionary conservation of motif constituents in the yeast protein interaction network. Nature Genetics 35 (2003), 176–179. https://api.semanticscholar.org/CorpusID:1627007
  • Yuan et al. (2015) Long Yuan, Lu Qin, Xuemin Lin, Lijun Chang, and W. Zhang. 2015. Diversified top-k clique search. The VLDB Journal 25 (2015), 171–196. https://api.semanticscholar.org/CorpusID:15668109
  • Zhao and Tung (2012) Feng Zhao and Anthony Kum Hoe Tung. 2012. Large Scale Cohesive Subgraphs Discovery for Social Network Visual Analysis. Proc. VLDB Endow. 6 (2012), 85–96. https://api.semanticscholar.org/CorpusID:12588941
  • Zou (2013) Zhaonian Zou. 2013. Polynomial-time algorithm for finding densest subgraphs in uncertain graphs. In Proceedings of MLG Workshop.