Scalable -clique Densest Subgraph Search
Abstract.
In this paper, we present a collection of novel and scalable algorithms designed to tackle the challenges inherent in the -clique densest subgraph problem (-) within network analysis. We propose , a novel algorithm based on the Frank-Wolfe approach for addressing -, effectively solving a distinct convex programming problem. is able to approximate - with near optimal guarantees. The notable advantage of lies in its time complexity, which is independent of the count of -cliques, resulting in remarkable efficiency in practical applications. Additionally, we present , a sampling-based algorithm with the capability to handle networks on an unprecedented scale, reaching up to edges. By leveraging the algorithm as a uniform -clique sampler, ensures the efficient processing of large-scale network data, accompanied by a detailed analysis of accuracy guarantees. Together, these contributions represent a significant advancement in the field of -clique densest subgraph discovery. In experimental evaluations, our algorithms demonstrate orders of magnitude faster performance compared to the current state-of-the-art solutions.
1. Introduction

Dense subgraph search plays a primary role in graph mining. Given a graph , a dense subgraph can be the subgraph with the highest edge-to-node ratio, called the densest subgraph, or the largest graph with all nodes mutually connected, called the maximum clique (whose size is denoted as ). Both types of dense subgraphs can be captured by a unified problem called -clique densest subgraph search (-) (Tsourakakis, 2015). Given an integer and graph , - reports a subgraph that maximizes the -clique density – the ratio between the total number of -cliques in and the number of nodes in . - reports the densest subgraph of when and the maximum clique when . When is small, - reports dense subgraphs on small motifs such as triangles, which is applicable to document summarization (Konar and Sidiropoulos, 2022) and network analysis (please see (Lee et al., 2010) as an entrance); when becomes larger, - reports near cliques (also called clique relaxations or defected cliques) where “almost” every pair of nodes are connected. Near clique search is important (Mitzenmacher et al., 2015) to predicting protein-protein interactions (Cui et al., 2008; Yu et al., 2006) and identifying over-represented motifs in DNA (Fratkin et al., 2006). Thus, the problem of - has attracted an increasing attention recently (Sun et al., 2020; He et al., 2023), exclusively on its computation efficiency and scalability.
Example 1.1.
Figure 1(a) shows a graph with triangles on nodes. Its -clique density is thus . The the subgraph induced by achieves the maximum -clique density of .
- is closely related to the problems of sampling, counting, or enumeration of -cliques in a graph. It is worth noting that while approximate - can resort to sampling, exact - necessitates an enumeration of all -cliques (Sun et al., 2020; He et al., 2023; Tsourakakis, 2015; Mitzenmacher et al., 2015). This complexity is lower bounded by the number of -cliques on , which limits the scalability. Having noticed this drawback, (He et al., 2023) propose to batch update by an index of called Succinct Clique Tree () (Jain and Seshadhri, 2020). is a general technique that can disjointly partition all cliques of into groups. is based on the observation that some cliques in the same group are identical to update, and thus these cliques can be computed together in a batch. However, the availability of such a batch update is not guaranteed, and thus the enumeration of -cliques may still be necessary in the worst case.
SOTA Approx. - | ++ (Sun et al., 2020) | (He et al., 2023) | (ours) | (Sun et al., 2020) | (He et al., 2023) | (ours) |
Time | ||||||
Memory |
This paper focuses on - approximately for better scalability. To find a - solution whose complexity is independent of the number of -cliques, we proposed to relax - into a new convex programming problem . We prove that the optimal solution of produces a near-optimal approximation of the -clique densest subgraph. On all datasets we tested in experiment, the optimal solution of can find the optimal -clique density exactly. More importantly, allows one to shift the dependency of the - complexity on to the “cardinality” of the of . As previous work (Jain and Seshadhri, 2020) shows, is linear to the number of edges of in practice. On the real-world networks we tested in experiments, is less than , and is on average. We compare the complexity of existing methods in Table 1, in which is the algorithm for finding the optimal solution of . To the best of our knowledge, this is the first approach whose complexity is independent of the number of -cliques. Our solution achieves -clique density comparable to the baselines while its running time is up to 4 orders of magnitude faster.
Sampling-based methods for - (Mitzenmacher et al., 2015; Sun et al., 2020; He et al., 2023) samples the -cliques uniformly. The sampled -cliques constructs a sparser graph (Mitzenmacher et al., 2015) and the -clique densest subgraph in the sampled sparser graph is an approximation of the -clique densest subgraph in the whole graph. Mitzenmacher et.al. (Mitzenmacher et al., 2015) proposed to sample a fixed proportion of the -cliques to make sure the accuracy (Mitzenmacher et al., 2015). However, the following works show that the sampling-based algorithms also have good performance when samples only a constant number of -cliques, but there is no theory to explain why (Sun et al., 2020; He et al., 2023). Further, sampling-based methods suffer from inefficient sampling process of the -cliques. To sample -cliques, existing algorithms either need to list all -cliques (Sun et al., 2020; Mitzenmacher et al., 2015) or build the -index (He et al., 2023), leading to the exponential time complexity of sampling. In this paper, we adopt -color path sampling algorithm (Ye et al., 2022) as a uniform -clique sampler, which has polynomial time complexity and dramatically improves the efficiency of sampling. The proposed algorithm is named as . Notably, to the best of our knowledge, is the first algorithm capable of effectively handling networks of up to edges in scale for large . Additionally, we provide a comprehensive analysis of the accuracy guarantees. Our analysis shows that the approximation has accuracy guarantees when the -clique density of the subgraph reported by the sampling-based algorithms is large enough. The theoretical analysis is applicable to all existing sampling-based algorithms. In summary, we make the following contributions.
-
•
: A Novel Frank-Wolfe Algorithm. We formulate a new convex programming problem for - approximation. The new convex programming problem allows for weight assignment beyond the -clique boundaries and is a near-optimal approximation for -. We propose a novel Frank-Wolfe algorithm for solving the new convex programming problem. We prove that can achieve a near-optimal solution. Moreover, the striking feature of is that the time complexity is independent of the count of -cliques, making it highly efficient in practice.
-
•
: Sampling-Based Algorithm for Large Networks. We develop a scalable sampling-based algorithm, namely . Specifically, is an efficient algorithm that has polynomial time complexity and is capable of handling large networks. We also present a theoretical analysis about the accuracy of , which is also applicable to existing sampling-based algorithms.
-
•
Extensive experiments. We evaluate our algorithms on 12 large real-life graphs. The results show that is up to orders of magnitude faster than the state-of-the-art algorithms ( and ++) using similar space. Our sampling based algorithm can obtain an good approximation and is also up to orders of magnitude faster than the state-of-the-art sampling-based algorithms. For reproductivity purpose, we release our source code at https://github.com/LightWant/densestSubgraph.
2. Preliminaries
Notations | Descriptions |
the graph , node set , edge set | |
the neighbors of in | |
a subset | |
the set of -cliques in | |
a SCT pair, , the set of all SCT pairs | |
degeneracy of network | |
size of SCT | |
-clique density of | |
the set of nodes of the -clique densest subgraph | |
(or ) | : the vector of weight assignment from -cliques (or SCT pair) to vertices; (or ): the weight of the -clique (or SCT pair ) assigned to |
the rank of in the graph decomposition, or the summary of weight assigned to i.e. | |
the first batch of , which is the set of nodes with the largest rank | |
: sample size : number of iterations(a small constant) : probability of a -color path being a -clique : probability of -clique being sampled |
Given an undirected graph with node set and edge set where each edge has . Denote by the neighbor set of in . For a subset of nodes in , we abuse to denote the induced subgraph of in whose node set is and edge set is . is a clique for every pair and of nodes in , . Consider a clique of . is a maximal clique if for any , is not a clique. Denote by the size of the maximum clique of and by the set of all cliques in . The degeneracy, denoted , is the smallest value for which every subgraph of has a vertex with degree at most in (Eppstein et al., 2013; Li et al., 2022). The value of degeneracy is often not very large in real-world networks (Eppstein et al., 2013; Li et al., 2022). For an integer , a clique is a -clique if . denotes the set of -cliques in the induced subgraph . For a subgraph , denote by the set of the -cliques in . The -clique density of a subgraph of is
The -clique Densest Subgraph Search (-). Given and integer . - reports the -clique densest subgraph of .
Definition 2.1 (-clique Densest Subgraph).
The -clique densest subgraph of is a subgraph such that for all and if , is a subgraph of .
Graph Decomposition () (Sun et al., 2020; Danisch et al., 2017). is a -clique density based decomposition of . In , iteratively (-th iteration, ) perform the following steps to produce the ranking for each node . Given a graph on node set and integer , let and be the collection of the -cliques of . For the next iterations, let be and let be where and is the subgraph with maximum in . Let for all . The iteration terminates when . It was proved (Sun et al., 2020) that for two iterations , . Clearly, and is the -clique densest subgraph. Thus, the - of equals to computing the first batch of , i.e. .
For example, for , Figure 1(a) shows the peeling process of where with in the first iteration. The vertices in have a rank value .
Convex programming (Sun et al., 2020; Danisch et al., 2017). In the context of below, three conditions () assign a weight of 1 to each -clique . This weight is then distributed among its nodes using weights (, ). Additionally, the total weights accumulated by each node are subject to the condition that the squared summation is minimized ().
It can be proved (Danisch et al., 2017; Sun et al., 2020) that the optimal solution of is exactly the ranking for the problem. Figure 1(b) shows how the constraints facilitates the weight distribution in the optimal solution. The -cliques on the top, to each has weight distributed to their 3 nodes linked to the second line ( to ). The weights collected by each node (shown in the box below the node) achieve the optimal squared sum in the objective function, which is identical to computed in the .
Lemma 2.2 ((Sun et al., 2020)).
Consider a -clique and let be the node in with the smallest ranking and the node with the largest ranking. Either or .
Lemma 2.3 ((Sun et al., 2020; Danisch et al., 2017)).
Let be the -clique densest subgraph of . Let be the optimal solution of and the first batch of nodes in . and for .
Apply Lemma 2.3 recursively along the peeling process of the can eventually reach the conclusion that the optimal solution of is exactly the ranking in .
Frank-Wolfe framework for solving (G). Algorithm 1 shows the framework tailored for the convex programming (He et al., 2023; Sun et al., 2020). iteratively (line 3) gets a clique in a random order (line 4) and then rebalances the weight distribution among all nodes in (lines 5-6). Figure 1(c) shows the flowchart of the framework. Specifically, for -, the ”Compute ” step in is repeating -1 and -2 until is empty. For the scanning based algorithm, . For the sampling based algorithm, is the set of -cliques uniformly sampled from .
-
-1
Fetches a -clique , then removes from
-
-2
Finds the node in that carries the lowest (breaks ties arbitrarily), then redistributes the weight of to increase .
Remark. All the SOTA methods, ++, , and , follows the framework of the convex programming (G), and are different in the ways of selecting a clique : and select based on a sample set of , and ++ and scans either directly or using an index (He et al., 2023; Sun et al., 2020). To remove the dependency of scanning-based method on while having the resulting density effectively bounded, we propose a new convex programming problem .
Succinct Clique Tree (Jain and Seshadhri, 2020) (). Given a graph , is an index of all cliques of . It is produced by a set of large cliques of , where each large clique is presented as a pair . Each pair consists of two disjoint node sets and . Denote by . The pair encodes the set of cliques . Since is a large clique, for any is a sub-clique encoded by . Let be the set of all pairs. Each clique is encoded by one and only one pair. is an index of in a special shape of search tree.
Lemma 2.4 ((Jain and Seshadhri, 2020)).
Let be a pair in of . We have . In other words, the combination of any subset of and the entire forms a clique in . In particular, is a clique. Besides, all cliques in are disjointly partitioned into [Theorem.4.1., (Jain and Seshadhri, 2020)].
We call the size of the index which is a parameter related only to the graph . The upper bound of is (Jain and Seshadhri, 2020). The value of is often not extremely large for real graphs (Jain and Seshadhri, 2020) as shown in Table 3.
Lemma 2.5.
Given a graph and its , an integer , let . For a pair , define all -cliques in as set . Then is a disjoint partition of all -cliques of . The cardinality .
Proof.
According to Lemma 2.4, covers all cliques. Thus, we can derive that also covers all -cliques. comes from by removing the pairs that and . When , the size of the cliques must be larger than . When , the size of the cliques in must be smaller than . comes from by removing the pairs that do not contain -cliques. ∎
In the following discussions, we focus on the -cliques of and thus abuse to denote unless otherwised specified. For easy reference, all the key notations in this paper are summarized in Table 2.
Lemma 2.6 ((Jain and Seshadhri, 2020)).
Given a pair , there are -cliques in total. For each , there are -cliques that contain . For each , there are -cliques that contain .
To form a -clique, we need to choose vertices from . With being chosen, we need to select vertices from vertices.


Example 2.7.
3. New Convex Programming for -DSS
To more efficiently compute -, we present a new convex programming problem based on . We first formulate .
3.1. SCT-based Convex Programming
assigns a weight for each node and each pair such that . The assigned weights must satisfy the following constraints.
Intuitive explanation. Recall that in , each clique in is allowed to have a weight of distributed among the nodes in , denoted as . Our new convex programming has made the following changes. Instead of each -clique having weight , we let each pair have weight because there are -cliques encoded by (Lemma 2.6). We still want the weight to be retained in nodes in to avoid an over-flattened weight distribution in . To do so, we impose a soft constraint of based on the fact that each node participates in -cliques (containing and ), which gives an upper bound on .
Let be the optimal solution of . Let be the ranking vector induced by . We first describe some properties of and . Then, we explain the relationship between and the -clique densest subgraph.
Properties of and . Lemma 3.1 shows that each entry is upper bounded by the number of -cliques that contain . The property is identical to the optimal solution of where each -clique only assigns weight to one of the nodes in the -clique.
Lemma 3.1.
Let be a feasible solution of , it has and , is upper bounded by the number of -cliques that contain .
Proof.
Theorem 3.2.
Given a pair , define if and if . must equal if there exists that and .
Proof.
Assume that , and . By re-assigning the weight from to , we can reduce the gap between and and reach a smaller value of the objective function, which contradicts the fact that is the optimal solution. ∎
Theorem 3.2 shows the property of that the weight are assigned as even as possible. Since is bounded by ( of ), is like a barrier to stop the weight flowing from the denser vertices to the .
We give the definition of relaxed stable subset.
Definition 3.3.
[relaxed stable subset] A subset of vertices is a relaxed stable subset if (1) . (2) For each pair that intersects both and , only when .
A relaxed stable subset has a larger rank than other nodes (condition (1) in Definition 3.3) and the pairs that intersect both and assign their weights to the nodes outside of as much as possible (condition (2) in Definition 3.3).
Recall that . We can derive that is a relaxed stable subset by Theorem 3.2 and Definition 3.3. For a pair that intersects and , the -cliques encoded by can be classified into three types: (1) all nodes in ; (2) all nodes in ; (3) the nodes intersecting both and . Each of the -clique has a weight of . Since the weight of are assigned to nodes in as much as possible, the type (1) -cliques assign the weights to , and the type (2) and (3) assign weights to . In other words, the nodes in only receive weights from the -cliques in .
Relationships between and . With the above analysis, we can derive Theorem 3.4 and 3.5 which explain the relationships between and the -clique densest subgraph .
Theorem 3.4.
The optimal solution of represents a subgraph that .
Proof.
We prove that for any with , . By the analysis above, the nodes in only receive weights from the -cliques in . Thus, we have
(1) | ||||
∎
Theorem 3.5.
The optimal solution of represents a subgraph that .
Proof.
Given two set of vertices and , the set of additional -cliques of to is . From
we can obtain
(2) |
Let be , which denotes the count of -cliques in that assign their weight to . According to the definition of SCT, each -clique exists in only one pair. Thus, we can consider each pair independently. Given a pair that there are -cliques in assign their weight to , each vertex in generates at least additional -cliques. Thus, we have at least additional -cliques for pair . Sum all the pairs together and we reach
(3) |
Equation (3) comes from the fact that and . The summary is greater than the size of nodes outside of because the same node may be in more than one pairs.
At last, we can obtain the result
∎
Theorem 3.5 provides a guarantee that the optimal solution of our new convex programming is a near-optimal approximation of the -clique densest subgraph.
The problem solved by . Actually, is the subgraph with the maximum value of , where is the set of pairs with all nodes in . The nodes sorted by is a graph decomposition ( in Section 2) with respect to . The statement is proved by Theorem 3.6 and Theorem 3.7.
Theorem 3.6.
The optimal solution of is the optimal solution of .
Proof.
Since the first level set is a relaxed stable subset, all weights of come from the pairs in and can not further distribute the weights to . Therefore, for is the minimum value that the objective function of can achieve. ∎
It is easy to derive that the dual of is .
Theorem 3.7.
Given a subgraph , denote by the set of pairs . The optimal solution of is the subgraph that maximizes .
Proof.
Given a subgraph , let if and otherwise. Then, we can easily derive that the objective function of is . Thus, each subgraph corresponds to a feasible of .
Given a feasible solution with value , we now prove that it can also construct a subgraph that . Given a parameter , let and . According to () and , we can derive that . Also, from we can derive that . Suppose . Then, we have which is a contradiction. Therefore, there must exist a such that .
Since the subgraphs and feasible solutions are mapped to each other, the subgraph with the maximum value maps to the optimal solution. ∎
From Theorem 3.6 and Theorem 3.7, we know that the optimal solution of finds the subgraph that maximize . Thus, the optimal solution of is intuitively a good approximation for the -clique densest subgraph. As shown in our experiments, the optimal solution of on all real-world datasets we tested can generate the -clique densest subgraph exactly.
3.2. FW-based Algorithm for SCT-CP(G)
Algorithm 2 tailors framework for . It replaces Lines 4-6 of Algorithm 1 with Algorithm 3 called for each to update the rankings .
Algorithm 3 allocates the weight of (Line 2) cliques of pair to its nodes . The allocation aims at bringing up the rankings of the lowest-ranking nodes in so as to minimize the squared sum of . This is done by first sorting all nodes in based on its original ranking and find the “plateaus” of different rankings (Line 1): nodes in share the highest ranking (also called the first batch ), nodes in the second largest ranking , , nodes in the smallest ranking . Denote by the number of different rankings of nodes in . The rankings , , , are strictly increasing. Lines 4-5 setup the upper bound for each node in the allocation based on and of . Lines 7-20 progressively use the budget of to bring the lowest ranking nodes up while satisfying the upper bounds set for each node. Specifically, (initially ) is the set of smallest ranking nodes in who can receive weights from the cliques without violating their upper bounds , we call them current potential nodes. Denote by the ranking differences of the smallest and the second smallest rankings of nodes in ; if the second smallest ranking does not exist, then we let be (Line 8). After that, Lines 10-20 allocates the weights of cliques to the current potential nodes. In Line 10, the highest ranking increment of the potential nodes before i) a node reaches its upper bound and then should be kicked out of current potential node set , ii) all nodes in reaches the ranking of , or iii) the weights of all cliques are used up. Each node in will then receive the weights of cliques (Lines 11-14) unless the (Lines 15-17) where a random set of nodes in will receive one extra weight. , , and the current potential set are updated accordingly (Lines 14,17,18). If by the time when , there are still potential nodes in , it means that their rankings have already reached , we then merge them to (Line 19).
Theorem 3.8.
The time complexity of is where is the upper bound of the size of SCT-tree, and is the degeneracy of the input graph . The space complexity is .
Proof.
In Algorithm 3, the while loop in line 7, line 9 and line 12 all has time complexity . Thus, the time complexity of Algorithm 3 is . Since is a clique, is bounded by . need to scan over and call for each pair. Thus, the time complexity of is . The main space cost of is the storage of the SCT-tree, which is (Jain and Seshadhri, 2020). ∎

Example 3.9.
Figure 3 illustrates an example to explain how works on the example graph in Figure 2(a). In Figure 2(a), it has five pairs of , including , , , , , . The five pairs are accessing in a random order. Initially, the elements of the vector are all zeros. The updated label denotes that the update is line 13 of Algorithm 3, and is line 16 of Algorithm 3.
Since there is only one -clique in and the size of is (line 10 of Algorithm 3), we have and the weight is assigned to in 1ine 16 of Algorithm 3. Then, for , there are -cliques. Since has the smallest weight, the weights are assigned to averagely (line 13 of Algorithm 3). The following updates are similar.
Remarks. Compared to the worst-case time complexity of (He et al., 2023), is much smaller for large real-world networks: as shown in Table 3, the values of of -tree and degeneracy on large real-graphs are not large (Jain and Seshadhri, 2020). More importantly, is independent of and the number of -cliques in graph .
3.3. Analysis of the Algorithm
Denote by the vector of variables of for . Let be the domain of specified in . Since the rankings are derived from , we denote the objective function of by
Theorem 3.10.
is an implementation of a Frank-Wolfe algorithm to the convex program of .
Proof.
solves the following problem. Given , find to minimize . Since each pair has constant count of weights and can only assign weights to the vertices in the pair (), we consider each pair independently. For each pair , to minimize , the weights should be assigned to the vertices with the smallest value of in . This is the work do for each pair. ∎
By Theorem 3.10, the result obtained by the algorithm is an exact solution for SCT-CP(G) (Danisch et al., 2017; Jaggi, 2013).
Corollary 3.11.
Let be the result obtained upon converges. Let be the vector of weight assignment of . is the optimal solution of .
Convergence analysis of . To analyze the convergence rate of , we use the curvature constant (Jaggi, 2013) as a measure of ”non-linearity” of the objective function . The curvature constant of a convex and differentiable function with respect to a compact domain is
(4) |
Lemma 3.12.
where , the maximum number of pairs covering the same node in .
Proof.
Theorem 3.13.
Let where is the results of in -th iterations. Let be difference between and the optimal solution . We have where . is the maximum number of -cliques that can cover one node .
Proof.
Observe that . The update can be seen as where denotes. Dimension of is , and we have . We have and will be increased by at most (, (G), the bound is given by the number of cliques in covering and , (G) the bound is given by the number of cliques in covering ).
Denote by the vector of pair. Denote by . For , we have
(5) | ||||
The above derivation is correct because can consider each pair independently. Thus, according to the Algorithm 2 in (Jaggi, 2013),
The last inequality comes directly from the definition of in Equation 4 from which we have . ∎
Theorem 3.14.
For each , where .
4. New Sampling-Based Algorithm
Since the hardness of -, FW-based solutions may still be costly when handling very large graphs. To further improve the efficiency, sampling-based solutions are often used which can typically obtain a good approximation of the -clique densest subgraph (Mitzenmacher et al., 2015; Sun et al., 2020; He et al., 2023). In this section, we propose a new but efficient sampling-based algorithm, called , which employs the algorithm porposed in (Ye et al., 2022) to sample -cliques. A remarkable feature of is that it has polynomial time complexity. To the best of our knowledge, is the first algorithm that runs in polynomial time, thus it can handle large graphs. We will also present a detailed theoretical analysis of the accuracy bound of .
4.1. The algorithm
For sampling-bases solutions, the state-of-the-arts are the and algorithm (Sun et al., 2020; He et al., 2023). counts all -cliques at first, and then samples -cliques uniformly. builds the SCT-Index at first, and then samples -cliques uniformly from the SCT-Index. Thus, both and suffer from exponential time complexity and are often intractable for handling large graphs. To overcome these limitations, we develop a poly-nominal algorithm, called .
First, we briefly introduce the algorithm which was originally proposed to estimate the number of -cliques in a graph (Ye et al., 2022). In , we use as a uniformly -clique sampler with polynomial running time. first colors the graph (using a linear-time greedy graph coloring algorithm) such that the vertices of each edge in the graph must have different colors. is an efficient algorithm for counting -cliques through sampling from a combinatoric structure called -color path. Specifically, a -color path is a path with vertices, and the vertices have different colors. Each -clique must be a -color path, i.e. the set of -cliques is a subset of -color paths (because the vertices in a clique must have different colors). is a polynomial dynamic programming algorithm which can uniformly samples from -color paths. Since the set of -cliques is a subset of -color paths, the -cliques can also be sampled uniformly. Note that checking if a -color-path is a -clique can be easily done in time by verifying whether any pair of node is connected. Similar to (Ye et al., 2022), we can regard the probability that a CCPATH is a -clique as a graph-related parameter which is often very high as shown in (Ye et al., 2022).
The details of are outlined in Algorithm 4. Algorithm 4 admits as a parameter for the size of samples (line 1). Firstly, the algorithm uniformly samples the -cliques and get the set of samples (lines 1-2). Then, the algorithm invokes ++ on in a small number of iterations, and then returns the approximation result (lines 3-10).
As shown in Algorithm 4, the proposed algorithm is very simple, but it is very efficient. Below, we first analyze the time and space complexity of Algorithm 4.
Proof.
As shown in Theorem 4.1, is completely free from the count of -cliques. As shown in our experiments, is very efficient and it can solve the -DSS problem on very large graphs that has vertices more than 1.8 . Below, we present the theoretical analysis of the accuracy guarantee of the algorithm. The results show that the accuracy of is based on a mild condition, which also explains the good accuracy of the previous works (Sun et al., 2020; He et al., 2023).
4.2. Analysis of the Algorithm
In this subsection, we analyze the accuracy bound of . For our analysis, we need the following Chernoff bound.
Theorem 4.2 (Chernoff bound).
Let , where are independent with each other. with probability and with probability . Let . is a number that . Then
Let be the set of vertices , where is the set of sampled -cliques (line 2 of Algorithm 4). Let be the ground truth of the -clique densest subgraph in , which has the largest value . Let and denotes the -clique density in the sampled -cliques. is the result returned by , which is expected to have the largest value of .
Let be the random variable that indicates whether the -clique is sampled or not, i.e. if or otherwise. Let be the probability of each -clique being sampled. Denote by the probability of a -color path being a -clique (line 2 of Algorithm 4), then we have . Subsequently, we can derive the following results.
Lemma 4.3.
.
Proof.
. ∎
Lemma 4.4.
.
Proof.
.
∎
Based on the above lemmas, we can obtain the following theorem.
Theorem 4.5.
Let and be small constant numbers. We can conclude that is an approximation of with probability if .
Proof.
Set the vertices set in Lemma 4.4 by . Then, we have . According to Chernoff bound, we have
This immediately gives
Since , then we have
(6) |
Since , we can derive that
(7) |
To ensure it has and and we can conclude that . ∎
Theorem 4.6.
In expectation, our algorithm can achieve an -approximation if .
Proof.
Theorem 4.6 proves that the number of iterations of is liner with respect to .
Discussions. Theorem 4.5 shows that if a subgraph reported by a sampling-based algorithm is dense enough, the subgraph should also be an good approximation. Note that Theorem 4.5 is based on the condition that should be large enough. It is important to note that this is a mild condition. The reasons are as follows. First, is not large. For example, is around when and . Second, when is small, the input graph should be very sparse, thus we can utilize exact algorithms to solve it. Third, is always not small. As shown in (Ye et al., 2022), is often a large value for real-world graphs. Fourth, returns the subgraph with the maximum value of (line 10 of Algorithm 4). Based on these reasons, such a condition is often easily satisfied for real-world graphs.
Networks | ||||
7,115 | 100,762 | 53 | 489,803 | |
26,475 | 53,381 | 22 | 8,312 | |
75,879 | 405,740 | 67 | 1,437,313 | |
196,591 | 950,327 | 51 | 930,005 | |
403,394 | 2,443,408 | 10 | 660,944 | |
425,957 | 1,049,866 | 113 | 166,725 | |
685,230 | 6,649,470 | 201 | 2,430,187 | |
1,157,827 | 2,987,625 | 51 | 1,397,529 | |
1,632,803 | 22,301,964 | 47 | 12,492,547 | |
1,696,415 | 11,095,298 | 111 | 12,548,404 | |
3,072,627 | 117,185,083 | 253 | 264,754,163 | |
65,608,366 | 1,806,067,135 | 304 | 3,876,765,479 |














Networks | |||||||||
++ | ++ | ++ | |||||||
35182.25 | 35182.25 | 35182.25 | 35182.25 | 35182.25 | 35182.25 | 35182.25 | 35182.25 | 35182.25 | |
2203.84 | 2203.84 | 2203.84 | 2203.84 | 2203.84 | 2203.84 | 2203.84 | 2203.84 | 2203.84 | |
482125.39 | 481818.44 | 481818.44 | 482125.39 | 482125.39 | 482125.39 | 482125.39 | 482125.39 | 482125.39 | |
115291.27 | 114200.51 | 115064.38 | 115291.27 | 115268.24 | 115291.27 | 115291.27 | 115291.27 | 115291.27 | |
86.00 | 26.08 | 52.33 | 86.00 | 80.37 | 76.86 | 86.00 | 83.69 | 83.69 | |
- | 360937368.00 | 360937368.00 | - | 360937368.00 | 360937368.00 | - | 360937368.00 | 360937368.00 | |
- | - | 1226107478.17 | - | - | 1230103452.99 | - | - | 1230103452.99 | |
15137.78 | 15045.44 | 15044.36 | 15137.78 | 15130.52 | 15116.24 | 15137.78 | 15134.27 | 15127.43 | |
137917.47 | 137917.47 | 137917.47 | 137917.47 | 137917.47 | 137917.47 | 137917.47 | 137917.47 | 137917.47 | |
- | 111767674.10 | 111767674.13 | - | 111861828.44 | 111861828.44 | - | 111882281.10 | 111882281.10 |
Networks | ++ | |||||
Time (s) | Time (s) | Time (s) | ||||
1 | 0.87 | 1 | 0.14 | 1 | 0.12 | |
1 | 0.01 | 1 | 0.01 | 1 | 0.00 | |
1 | 11.61 | 2 | 2.29 | 2 | 1.17 | |
1 | 5.34 | 5 | 2.67 | 5 | 0.96 | |
1 | 0.13 | 20 | 1.55 | 20 | 0.45 | |
- | - | 1 | 1286.87 | 1 | 0.04 | |
1 | 0.91 | 20 | 2.77 | 110 | 14.88 | |
1 | 10.40 | 1 | 2.47 | 1 | 1.62 |











Algorithms | ||||||||
Time | Time | Time | Time | |||||
- | - | - | - | |||||
- | - | - | - | |||||
1287748.0 | 11511.1 | 1115421.7 | 55815.2 | |||||
- | - | - | - | |||||
1287748.0 | 11545.5 | 1119546.2 | - | |||||
1287748.0 | 11545.5 | 1118411.5 | 68573.8 | |||||
1287748.0 | 11545.5 | 1119450.7 | - | |||||
1287748.0 | 11545.5 | 1119664.6 | - | |||||
1287748.0 | 11545.5 | 1118808.0 | 68748.8 |
5. Experiments
Algorithms. For the Frank-Wolfe based algorithms, we implement the algorithm which is based on Algorithm 3. We use the state-of-the-art algorithm ++ (Sun et al., 2020) and (He et al., 2023) as the baseline algorithms. ++ and are all implementations of Algorithm 1. ++ is a Frank-Wolfe based algorithm for - that scan over the -cliques individually through -clique listing. is a Frank-Wolfe based algorithm, which scan over the -cliques in batches through the SCT-index.
For the sampling-based algorithms, we implement the algorithm (Algorithm 4). For comparison, we use the state-of-the-art sampling-based algorithms (Sun et al., 2020) and (He et al., 2023) as the baselines. Given a parameter , ++ samples -cliques during the procedure of -clique listing, and samples -cliques through SCT-index.
The source code of ++ and is open available (Sun et al., 2020), which is implemented in C++. Since the code of and is not available, we implement them by ourselves in C++, which shows similar performance compared to the results reported in (He et al., 2023).
Datasets. The details of the datasets are shown in Table 3. The 5 columns of Table 3 are the dataset name, number of vertices, number of edges, degeneracy and the size of SCT respectively. All datasets are downloaded from https://snap.stanford.edu/ or http://konect.cc/.
We evaluate all algorithms on a server with an AMD 3990X CPU and 256GB memory running Linux CentOS 7 operating system.
5.1. Results of the FW-based algorihtms
Exp 1 : Runtime of various algorithms with varying . Figure 4 shows the running time of ++, , and for varying . To show the advantage of , we omit the time taken by clique enumeration of ++ and SCT construction.
As increases, the running time of tends to be small. This is because the size of the SCT decreases, i.e. the value of (Theorem 3.8) tends to be small as increases (Jain and Seshadhri, 2020). For example, on , we have when , while when . As shown in Figure 4, substantially outperforms both ++ and when on all the datasets. For example, on , our algorithm can achieve more than orders of magnitude faster than both ++ and when . These results demonstrate the high efficiency of the proposed algorithm.
Additionally, on the complex networks that the degeneracy is larger than (, and in Figure 4), our algorithm can consistently outperform the state-of-the-arts. On these networks, the count of cliques is very large. For example, has -cliques. The excellent performance of is due to the fact that the running time of is free from the count of -cliques, which confirms our theoretical analysis in Section 3.
Exp 2 : Runtime of various algorithms with varying . Figure 5 shows the performance of the Frank-Wolfe based algorithms when varies on Gowalla and Pokec. The results on the other datsets are consistent. We omit the time taken by clique enumeration and SCT construction. As expected, the running time of all algorithms is linear with respect to (w.r.t.) . Once again, our is much more efficient than the existing algorithms. These results further demonstrate the high efficiency of the proposed solution.
Exp 3 : -clique density with varying . In Table 4, only can handle all the tested networks. ++ can reach to the optimal results in only one iteration, but can not handle the complex networks with a large degeneracy, in which there exists a large number of -cliques, like . Both and can achieve a near-optimal approximation within a few iterations. These results further confirm the scalability of and the ability of to derive a good approximation within only few iterations.
Exp 4 : Running time needed to find . In the experiments, we find that can reach on all the tested datasets. Table 5 shows the number of iterations as well as the running time needed to find . We find that achieves the lowest running time to obtain on datasets. Although requires iterations to converge to on , it can get a approximation using only seconds. These results further confirm the high efficiency of the proposed algorithm.
Exp 5 : -clique density within fixed time. Figure 6 compares the -clique density of the results of and with the same running time. In Figure 6, the results show that can get larger -clique density than with the same running time. The results on other datasets is in Table 5 as described in Exp 4.
Exp 6 : Memory overheads. We plot the memory costs of the Frank-Wolfe based algorithms in Figure 7. As can be seen, and have similar memory costs. This is because the memory costs of and are mainly taken by storing the SCT, while the memory cost of ++ is by the storage of all -cliques. As increases, the count of -cliques grows and the size of SCT shrinks.These results indicate that our is space-efficient.
5.2. Results of the Sampling-based algorihtms
Exp 7 : Running time with varying . In Figure 8, we show the running time of the sampling-based algorithms for various values of . From Figure 8(a) to 8(d), we can see that and achieve comparable performance. However, for large networks in Figure 8(e)8(j), our algorithm substantially outperforms and . For example, on the dataset, our algorithm is at least one order of magnitude faster than the two baselines. We also note that on the largest graph , only our algorithm can obtain the results, while both of and fail to drive the results. These results confirm the high efficiency and scalability of our sampling-based algorithm.
Exp 8 : -clique density within fixed time. In Table 6, we compare -clique density achieved by the sampling based algorithms within fixed running time. As shown in Table 6, our one order of magnitude faster than existing algorithms and achieve good results. For example, on , can terminate in while needs around and needs about . Furthermore, our is the only algorithm that can obtain the results on the largest graph . These results confirm the high efficiency, effectiveness and scalability of our .
6. related work
Densest subgraph problem (DSP). Given a graph and a measure of density, DSP requires to find a subset of vertices whose induced subgraph maximizes the value of density. DSP is a famous problem that has been studied for over five decades, which has a lot of variants and applications (Lanciano et al., 2023). For the traditional Edge-based Densest Subgraph Problem, several key algorithmic approaches have emerged. These encompass maximum-flow-based algorithms (Goldberg, 1984), LP-based algorithms (Charikar, 2000b; Danisch et al., 2017) and peeling-based algorithms (Charikar, 2000b; Boob et al., 2020). There are also many variants of DSP. Densest -subgraph problem aims at finding the densest subgraph with size (Feige et al., 2001); Densest at-least(most)-k-subgraph problem aims at finding the densest subgraph with size at least(most) (Andersen and Chellapilla, 2009); Anchored Densest Subgraph Problem tries to find the densest subgraph that contains a given seed set (Dai et al., 2022); Fair densest subgraph is the densest subgraphs that has equal represented colors (Anagnostopoulos et al., 2020; Miyauchi et al., 2023); Motif-based densest subgraph is the generalized version of -clique densest subgraph, where the density is defined by the count of a given motif (Fang et al., 2019). On different graphs like directed graphs (Charikar, 2000b; Ma et al., 2020), temporal graphs (Bogdanov et al., 2011) and hypergraphs (Huang and Kahng, 1995), there also exists the corresponding DSPs.
Large near-clique detection.The maximum clique problem is an important problem which has a lot of applications (Tomita, 2017; Chang, 2020). However, maximum clique often has a very tight constraints for real-world applications. Thus, a lot of relaxed models for large near-clique detection were proposed (Tsourakakis, 2015; Batagelj and Zaversnik, 2003; Balasundaram et al., 2011; Chen et al., 2021). The -clique densest subgraph studied in this work is known as a large near-clique. State-of-the-art algorithms for solving - are primarily rooted in the Frank-Wolfe framework, because - can be formulated as a convex programming. Sun (Sun et al., 2020) introduced the first Frank-Wolfe-based algorithm, named ++. ++ operates by sequentially iterating over individual -cliques, and each -clique assigns a weight to the vertex with the smallest weight among the vertices. With an adequate number of iterations, it can converge to the optimal solution . Recent advancements by He et al. (He et al., 2023) have accelerated ++, and by the technique SCT-index. The SCT-index, building upon the SCT-tree data structure (Jain and Seshadhri, 2020), enables batch-wise iteration over -cliques, significantly enhancing efficiency. Other methods include flow-based algorithms (Mitzenmacher et al., 2015; Tsourakakis, 2015) and peeling-based algorithms (Fang et al., 2019), but they are not as efficient as the Frank-Wolfe based algorithms (He et al., 2023). There also exists a lot of other near-clique models (Batagelj and Zaversnik, 2003; Balasundaram et al., 2011; Chang, 2023). Maximum -core is the largest subgraph that each vertex has degree larger than (Batagelj and Zaversnik, 2003). Maximum -plex is the maximum subgraph that each vertex has at most non-neighbors (Balasundaram et al., 2011; Chang et al., 2022). Maximum -defective clique is the maximum subgraph that misses at most edges compared to clique (Chen et al., 2021; Chang, 2023; Gao et al., 2022).
7. conclusion
In this paper, we study the problem of -clique densest subgraph search. We propose a new Frank-Wolfe-based algorithm, whose time complexity is free from the count of -cliques. Thus, it is very efficient for processing large graphs that often have a extremely number of -cliques. We present a detailed theoretical analysis of our algorithms. To further improve the efficiency, we also propose a new and provable sampling-based algorithm. A nice feature of our algorithm is that it has polynomial time complexity. We conduct extensive experiments on 12 large real-world graphs to evaluate our algorithms, and the results demonstrate the high efficiency and scalability of our approaches.
References
- (1)
- Anagnostopoulos et al. (2020) Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Cristina Menghini, and Chris Schwiegelshohn. 2020. Spectral Relaxations and Fair Densest Subgraphs. In CIKM. 35–44.
- Andersen and Chellapilla (2009) Reid Andersen and Kumar Chellapilla. 2009. Finding Dense Subgraphs with Size Bounds. In WAW, Vol. 5427. Springer, 25–37.
- Balasundaram et al. (2011) Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. 2011. Clique Relaxations in Social Network Analysis: The Maximum k-Plex Problem. Oper. Res. 59, 1 (2011), 133–142.
- Batagelj and Zaversnik (2003) Vladimir Batagelj and Matjaz Zaversnik. 2003. An O(m) Algorithm for Cores Decomposition of Networks. CoRR cs.DS/0310049 (2003).
- Bogdanov et al. (2011) Petko Bogdanov, Misael Mongiovì, and Ambuj K. Singh. 2011. Mining Heavy Subgraphs in Time-Evolving Networks. In ICDM. IEEE Computer Society, 81–90.
- Boob et al. (2020) Digvijay Boob, Yu Gao, Richard Peng, Saurabh Sawlani, Charalampos E. Tsourakakis, Di Wang, and Junxing Wang. 2020. Flowless: Extracting Densest Subgraphs Without Flow Computations. In WWW. 573–583.
- Chang (2020) Lijun Chang. 2020. Efficient maximum clique computation and enumeration over large sparse graphs. VLDB J. 29, 5 (2020), 999–1022.
- Chang (2023) Lijun Chang. 2023. Efficient Maximum k-Defective Clique Computation with Improved Time Complexity. CoRR abs/2309.02635 (2023).
- Chang et al. (2022) Lijun Chang, Mouyi Xu, and Darren Strash. 2022. Efficient Maximum k-Plex Computation over Large Sparse Graphs. Proc. VLDB Endow. 16, 2 (2022), 127–139.
- Charikar (2000a) Moses Charikar. 2000a. Greedy approximation algorithms for finding dense components in a graph. In APPROX 2000, Vol. 1913. Springer, 84–95.
- Charikar (2000b) Moses Charikar. 2000b. Greedy approximation algorithms for finding dense components in a graph. In International workshop on approximation algorithms for combinatorial optimization. 84–95.
- Chen et al. (2021) Xiaoyu Chen, Yi Zhou, Jin-Kao Hao, and Mingyu Xiao. 2021. Computing maximum k-defective cliques in massive graphs. Comput. Oper. Res. 127 (2021), 105131.
- Cui et al. (2008) Guangyu Cui, Yu Chen, De-Shuang Huang, and Kyungsook Han. 2008. An algorithm for finding functional modules and protein complexes in protein-protein interaction networks. Journal of Biomedicine and Biotechnology 2008 (2008).
- Dai et al. (2022) Yizhou Dai, Miao Qiao, and Lijun Chang. 2022. Anchored Densest Subgraph. In SIGMOD. 1200–1213.
- Danisch et al. (2017) Maximilien Danisch, T.-H. Hubert Chan, and Mauro Sozio. 2017. Large Scale Density-friendly Graph Decomposition via Convex Programming. In WWW 2017. ACM, 233–242.
- Eppstein et al. (2013) David Eppstein, Maarten Löffler, and Darren Strash. 2013. Listing All Maximal Cliques in Large Sparse Real-World Graphs. ACM J. Exp. Algorithmics 18 (2013).
- Fang et al. (2019) Yixiang Fang, Kaiqiang Yu, Reynold Cheng, Laks V. S. Lakshmanan, and Xuemin Lin. 2019. Efficient Algorithms for Densest Subgraph Discovery. Proc. VLDB Endow. 12, 11 (2019), 1719–1732.
- Feige et al. (2001) Uriel Feige, Guy Kortsarz, and David Peleg. 2001. The Dense k-Subgraph Problem. Algorithmica 29, 3 (2001), 410–421.
- Fratkin et al. (2006) Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou. 2006. MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics 22, 14 (07 2006), e150–e157.
- Gao et al. (2022) Jian Gao, Zhenghang Xu, Ruizhi Li, and Minghao Yin. 2022. An Exact Algorithm with New Upper Bounds for the Maximum k-Defective Clique Problem in Massive Sparse Graphs. In AAAI. 10174–10183.
- Goldberg (1984) Andrew V Goldberg. 1984. Finding a maximum density subgraph. (1984).
- He et al. (2023) Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, and Ying Zhang. 2023. Scaling Up k-Clique Densest Subgraph Detection. Proc. ACM Manag. Data 1, 1 (2023), 69:1–69:26.
- Huang and Kahng (1995) Dennis J.-H. Huang and Andrew B. Kahng. 1995. When clusters meet partitions: new density-based methods for circuit decomposition. In ED&TC. 60–64.
- Jaggi (2013) Martin Jaggi. 2013. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML 2013, Vol. 28. 427–435.
- Jain and Seshadhri (2020) Shweta Jain and C. Seshadhri. 2020. The Power of Pivoting for Exact Clique Counting. In WSDM. 268–276.
- Konar and Sidiropoulos (2022) Aritra Konar and Nicholas D. Sidiropoulos. 2022. The Triangle-Densest-K-Subgraph Problem: Hardness, Lovász Extension, and Application to Document Summarization. Proceedings of the AAAI Conference on Artificial Intelligence 36, 4 (Jun. 2022), 4075–4082.
- Lanciano et al. (2023) Tommaso Lanciano, Atsushi Miyauchi, Adriano Fazzone, and Francesco Bonchi. 2023. A Survey on the Densest Subgraph Problem and its Variants. CoRR abs/2303.14467 (2023).
- Lee et al. (2010) Victor E. Lee, Ning Ruan, Ruoming Jin, and Charu C. Aggarwal. 2010. A Survey of Algorithms for Dense Subgraph Discovery. In Managing and Mining Graph Data. Advances in Database Systems, Vol. 40. Springer, 303–336.
- Li et al. (2022) Rong-Hua Li, Qiushuo Song, Xiaokui Xiao, Lu Qin, Guoren Wang, Jeffrey Xu Yu, and Rui Mao. 2022. I/O-Efficient Algorithms for Degeneracy Computation on Massive Networks. IEEE Transactions on Knowledge and Data Engineering 34, 7 (2022), 3335–3348.
- Ma et al. (2020) Chenhao Ma, Yixiang Fang, Reynold Cheng, Laks VS Lakshmanan, Wenjie Zhang, and Xuemin Lin. 2020. Efficient algorithms for densest subgraph discovery on large directed graphs. In SIGMOD. 1051–1066.
- Mitzenmacher et al. (2015) Michael Mitzenmacher, Jakub Pachocki, Richard Peng, Charalampos E. Tsourakakis, and Shen Chen Xu. 2015. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling. In SIGKDD 2015. ACM, 815–824.
- Miyauchi et al. (2023) Atsushi Miyauchi, Tianyi Chen, Konstantinos Sotiropoulos, and Charalampos E. Tsourakakis. 2023. Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity. In SIGKDD. 1710–1721.
- Sun et al. (2020) Bintao Sun, Maximilien Danisch, T.-H. Hubert Chan, and Mauro Sozio. 2020. KClist++: A Simple Algorithm for k-Clique Densest Subgraphs in Large Graphs. Proc. VLDB Endow. 13, 10 (2020), 1628–1640.
- Tomita (2017) Etsuji Tomita. 2017. Efficient Algorithms for Finding Maximum and Maximal Cliques and Their Applications. In WALCOM, Vol. 10167. Springer, 3–15.
- Tsourakakis (2015) Charalampos E. Tsourakakis. 2015. The K-clique Densest Subgraph Problem. In WWW. 1122–1132.
- Ye et al. (2022) Xiaowei Ye, Rong-Hua Li, Qiangqiang Dai, Hongzhi Chen, and Guoren Wang. 2022. Lightning Fast and Space Efficient k-clique Counting. In WWW. 1191–1202.
- Yu et al. (2006) Haiyuan Yu, Alberto Paccanaro, Valery Trifonov, and Mark Gerstein. 2006. Predicting interactions in protein networks by completing defective cliques. Bioinform. 22, 7 (2006), 823–829.