This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable kk-clique Densest Subgraph Search

Xiaowei Ye Beijing Institute of TechnologyBeijingChina [email protected] Miao Qiao The University of AucklandAucklandNew Zealand [email protected] Ronghua Li Beijing Institute of TechnologyBeijingChina [email protected] Qi Zhang Beijing Institute of TechnologyBeijingChina [email protected]  and  Guoren Wang Beijing Institute of TechnologyBeijingChina [email protected]
Abstract.

In this paper, we present a collection of novel and scalable algorithms designed to tackle the challenges inherent in the kk-clique densest subgraph problem (kk-𝖣𝖲𝖲\mathsf{DSS}) within network analysis. We propose 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL}, a novel algorithm based on the Frank-Wolfe approach for addressing kk-𝖣𝖲𝖲\mathsf{DSS}, effectively solving a distinct convex programming problem. 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is able to approximate kk-𝖣𝖲𝖲\mathsf{DSS} with near optimal guarantees. The notable advantage of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} lies in its time complexity, which is independent of the count of kk-cliques, resulting in remarkable efficiency in practical applications. Additionally, we present 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}, a sampling-based algorithm with the capability to handle networks on an unprecedented scale, reaching up to 1.8×1091.8\times 10^{9} edges. By leveraging the 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} algorithm as a uniform kk-clique sampler, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} ensures the efficient processing of large-scale network data, accompanied by a detailed analysis of accuracy guarantees. Together, these contributions represent a significant advancement in the field of kk-clique densest subgraph discovery. In experimental evaluations, our algorithms demonstrate orders of magnitude faster performance compared to the current state-of-the-art solutions.

1. Introduction

Refer to caption
Figure 1. The three-step paradigm for kk-𝖣𝖲𝖲\mathsf{DSS}.

Dense subgraph search plays a primary role in graph mining. Given a graph GG, a dense subgraph can be the subgraph with the highest edge-to-node ratio, called the densest subgraph, or the largest graph with all nodes mutually connected, called the maximum clique (whose size is denoted as ω(G)\omega(G)). Both types of dense subgraphs can be captured by a unified problem called kk-clique densest subgraph search (kk-𝖣𝖲𝖲\mathsf{DSS}(Tsourakakis, 2015). Given an integer kk and graph GG, kk-𝖣𝖲𝖲\mathsf{DSS} reports a subgraph SS that maximizes the kk-clique density – the ratio between the total number of kk-cliques in SS and the number of nodes in SS. kk-𝖣𝖲𝖲\mathsf{DSS} reports the densest subgraph of GG when k=2k=2 and the maximum clique when k=ω(G)k=\omega(G). When kk is small, kk-𝖣𝖲𝖲\mathsf{DSS} reports dense subgraphs on small motifs such as triangles, which is applicable to document summarization (Konar and Sidiropoulos, 2022) and network analysis (please see (Lee et al., 2010) as an entrance); when kk becomes larger, kk-𝖣𝖲𝖲\mathsf{DSS} reports near cliques (also called clique relaxations or defected cliques) where “almost” every pair of nodes are connected. Near clique search is important (Mitzenmacher et al., 2015) to predicting protein-protein interactions (Cui et al., 2008; Yu et al., 2006) and identifying over-represented motifs in DNA (Fratkin et al., 2006). Thus, the problem of kk-𝖣𝖲𝖲\mathsf{DSS} has attracted an increasing attention recently (Sun et al., 2020; He et al., 2023), exclusively on its computation efficiency and scalability.

Example 1.1.

Figure 1(a) shows a graph G1G_{1} with 99 triangles on 77 nodes. Its 33-clique density is thus 97\frac{9}{7}. The the subgraph induced by {u1,u2,,u6}\{u_{1},u_{2},\cdots,u_{6}\} achieves the maximum 33-clique density of 43>97\frac{4}{3}>\frac{9}{7}.

kk-𝖣𝖲𝖲\mathsf{DSS} is closely related to the problems of sampling, counting, or enumeration of kk-cliques in a graph. It is worth noting that while approximate kk-𝖣𝖲𝖲\mathsf{DSS} can resort to sampling, exact kk-𝖣𝖲𝖲\mathsf{DSS} necessitates an enumeration of all kk-cliques (Sun et al., 2020; He et al., 2023; Tsourakakis, 2015; Mitzenmacher et al., 2015). This complexity is lower bounded by the number of kk-cliques on GG, which limits the scalability. Having noticed this drawback, 𝖲𝖢𝖳𝖫\mathsf{SCTL} (He et al., 2023) propose to batch update by an index of GG called Succinct Clique Tree (𝖲𝖢𝖳\mathsf{SCT}(Jain and Seshadhri, 2020). 𝖲𝖢𝖳{\mathsf{SCT}} is a general technique that can disjointly partition all cliques of GG into groups. 𝖲𝖢𝖳𝖫\mathsf{SCTL} is based on the observation that some cliques in the same group are identical to update, and thus these cliques can be computed together in a batch. However, the availability of such a batch update is not guaranteed, and thus the enumeration of kk-cliques may still be necessary in the worst case.

Table 1. Time and space complexities of SOTA approximate kk-𝖣𝖲𝖲\mathsf{DSS} approaches of scanning-based (𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++, 𝖲𝖢𝖳𝖫\mathsf{SCTL}) and sampling-based (𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample}, 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample}). δ\delta: the degeneracy of GG (shown in Table 3 for real dataset). θ\theta: the arboricity of GG that δ/2<θδ\delta/2<\theta\leq\delta. TT: the # of iterations. ck=|𝒞k(V)|c_{k}=|\mathcal{C}_{k}(V)|. cc: the # of samples from 𝒞k(V)\mathcal{C}_{k}(V). η\eta: the cardinality of 𝖲𝖢𝖳\mathsf{SCT}-tree. 𝖪𝖢𝗅𝗂𝗌𝗍++{\mathsf{KClist}++} and 𝖲𝖢𝖳𝖫\mathsf{SCTL} are dependent on ckc_{k}.
SOTA Approx. kk-𝖣𝖲𝖲\mathsf{DSS} 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ (Sun et al., 2020) 𝖲𝖢𝖳𝖫\mathsf{SCTL} (He et al., 2023) 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} (ours) 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} (Sun et al., 2020) 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} (He et al., 2023) 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} (ours)
Time O()O(\cdot) kmθk2+Tkckkm\theta^{k-2}+Tkc_{k} θ2η+Tkck\theta^{2}\eta+Tkc_{k} θ2η+Tηθ3\theta^{2}\eta+T\eta\theta^{3} kmθk2+Tkckm\theta^{k-2}+Tkc θ2η+Tkc\theta^{2}\eta+Tkc nδ2k+(δk+k2)t+Tktn\delta^{2}k+(\delta k+k^{2})t+Tkt
Memory O()O(\cdot) kckkc_{k} θη{\theta\eta} θη{\theta\eta} kckc θη\theta\eta ktkt

This paper focuses on kk-𝖣𝖲𝖲\mathsf{DSS} approximately for better scalability. To find a kk-𝖣𝖲𝖲\mathsf{DSS} solution whose complexity is independent of the number ckc_{k} of kk-cliques, we proposed to relax kk-𝖣𝖲𝖲\mathsf{DSS} into a new convex programming problem 𝖲𝖢𝖳-𝖢𝖯{\mathsf{SCT\text{-}CP}}. We prove that the optimal solution of 𝖲𝖢𝖳-𝖢𝖯{\mathsf{SCT\text{-}CP}} produces a near-optimal approximation of the kk-clique densest subgraph. On all datasets we tested in experiment, the optimal solution of 𝖲𝖢𝖳-𝖢𝖯{\mathsf{SCT\text{-}CP}} can find the optimal kk-clique density exactly. More importantly, 𝖲𝖢𝖳-𝖢𝖯{\mathsf{SCT\text{-}CP}} allows one to shift the dependency of the kk-𝖣𝖲𝖲\mathsf{DSS} complexity on ckc_{k} to the “cardinality” η\eta of the 𝖲𝖢𝖳\mathsf{SCT} of GG. As previous work (Jain and Seshadhri, 2020) shows, η\eta is linear to the number mm of edges of GG in practice. On the real-world networks we tested in experiments, η\eta is less than 3m3m, and is 0.31m0.31m on average. We compare the complexity of existing methods in Table 1, in which 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is the algorithm for finding the optimal solution of 𝖲𝖢𝖳-𝖢𝖯\mathsf{SCT\text{-}CP}. To the best of our knowledge, this is the first approach whose complexity is independent of the number of kk-cliques. Our solution achieves kk-clique density comparable to the baselines while its running time is up to 4 orders of magnitude faster.

Sampling-based methods for kk-𝖣𝖲𝖲\mathsf{DSS} (Mitzenmacher et al., 2015; Sun et al., 2020; He et al., 2023) samples the kk-cliques uniformly. The sampled kk-cliques constructs a sparser graph (Mitzenmacher et al., 2015) and the kk-clique densest subgraph in the sampled sparser graph is an approximation of the kk-clique densest subgraph in the whole graph. Mitzenmacher et.al. (Mitzenmacher et al., 2015) proposed to sample a fixed proportion of the kk-cliques to make sure the accuracy (Mitzenmacher et al., 2015). However, the following works show that the sampling-based algorithms also have good performance when samples only a constant number of kk-cliques, but there is no theory to explain why (Sun et al., 2020; He et al., 2023). Further, sampling-based methods suffer from inefficient sampling process of the kk-cliques. To sample kk-cliques, existing algorithms either need to list all kk-cliques (Sun et al., 2020; Mitzenmacher et al., 2015) or build the 𝖲𝖢𝖳\mathsf{SCT}-index (He et al., 2023), leading to the exponential time complexity of sampling. In this paper, we adopt kk-color path sampling algorithm 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} (Ye et al., 2022) as a uniform kk-clique sampler, which has polynomial time complexity and dramatically improves the efficiency of sampling. The proposed algorithm is named as 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}. Notably, to the best of our knowledge, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is the first algorithm capable of effectively handling networks of up to 1.8×1091.8\times 10^{9} edges in scale for large kk. Additionally, we provide a comprehensive analysis of the accuracy guarantees. Our analysis shows that the approximation has accuracy guarantees when the kk-clique density of the subgraph reported by the sampling-based algorithms is large enough. The theoretical analysis is applicable to all existing sampling-based algorithms. In summary, we make the following contributions.

  • 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL}: A Novel Frank-Wolfe Algorithm. We formulate a new convex programming problem for kk-𝖣𝖲𝖲\mathsf{DSS} approximation. The new convex programming problem allows for weight assignment beyond the kk-clique boundaries and is a near-optimal approximation for kk-𝖣𝖲𝖲\mathsf{DSS}. We propose a novel Frank-Wolfe algorithm 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} for solving the new convex programming problem. We prove that 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} can achieve a near-optimal solution. Moreover, the striking feature of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is that the time complexity is independent of the count of kk-cliques, making it highly efficient in practice.

  • 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}: Sampling-Based Algorithm for Large Networks. We develop a scalable sampling-based algorithm, namely 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}. Specifically, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is an efficient algorithm that has polynomial time complexity and is capable of handling large networks. We also present a theoretical analysis about the accuracy of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}, which is also applicable to existing sampling-based algorithms.

  • Extensive experiments. We evaluate our algorithms on 12 large real-life graphs. The results show that 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is up to 44 orders of magnitude faster than the state-of-the-art algorithms (𝖲𝖢𝖳𝖫\mathsf{SCTL} and 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++) using similar space. Our sampling based algorithm 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} can obtain an good approximation and is also up to 44 orders of magnitude faster than the state-of-the-art sampling-based algorithms. For reproductivity purpose, we release our source code at https://github.com/LightWant/densestSubgraph.

2. Preliminaries

Table 2. Summary of notations
Notations Descriptions
G(V,E)G(V,E) the graph GG, node set VV, edge set EE
N(v,G)N(v,G) the neighbors of vv in GG
SS a subset SVS\subseteq V
𝒞k(S)\mathcal{C}_{k}(S) the set of kk-cliques in G(S)G(S)
P(Vh,Vp),V(P),P(V_{h},V_{p}),V(P),\mathbb{P} a SCT pair, V(P)=VhVpV(P)=V_{h}\cup V_{p}, the set of all SCT pairs
δ\delta degeneracy of network
η\eta size of SCT
ρk(S)\rho_{k}(S) kk-clique density of G(S)G(S)
VV^{*} the set of nodes of the kk-clique densest subgraph
α,αuC\alpha,\alpha^{C}_{u} (or αuP\alpha^{P}_{u}) α\alpha : the vector of weight assignment from kk-cliques (or SCT pair) to vertices; αuC\alpha^{C}_{u} (or αuP\alpha^{P}_{u}): the weight of the kk-clique CC (or SCT pair PP) assigned to uu
r(u)r(u) the rank of uu in the graph decomposition, or the summary of weight assigned to uu i.e. r(u)=C:uCαuCr(u)=\sum_{C:u\in C}{\alpha^{C}_{u}}
H(r)H(r) the first batch of rr, which is the set of nodes with the largest rank
t,T,p,pt,T,p^{\prime},p tt: sample size TT: number of iterations(a small constant) pp^{\prime}: probability of a kk-color path being a kk-clique p=tp|𝒞k(V)|p=\frac{tp^{\prime}}{|\mathcal{C}_{k}(V)|} : probability of kk-clique being sampled

Given an undirected graph G(V,E)G(V,E) with node set VV and edge set EE where each edge e(u,v)Ee(u,v)\in E has u,vVu,v\in V. Denote by N(v,G)={uV|(u,v)E}N(v,G)=\{u\in V|(u,v)\in E\} the neighbor set of vv in GG. For a subset SVS\subseteq V of nodes in GG, we abuse SS to denote the induced subgraph of SS in GG whose node set is SS and edge set is {(u,v)E|u,vS}\{(u,v)\in E|u,v\in S\}. SS is a clique for every pair uu and vv of nodes in SS, (u,v)E(u,v)\in E. Consider a clique CVC\subseteq V of GG. CC is a maximal clique if for any vVCv\in V\setminus C, C{v}C\cup\{v\} is not a clique. Denote by ω(G)\omega(G) the size of the maximum clique of GG and by 𝒞(G)\mathcal{C}(G) the set of all cliques in GG. The degeneracy, denoted δ\delta, is the smallest value for which every subgraph gg of GG has a vertex with degree at most δ\delta in gg (Eppstein et al., 2013; Li et al., 2022). The value of degeneracy is often not very large in real-world networks (Eppstein et al., 2013; Li et al., 2022). For an integer k[2,ω(G)]k\in[2,\omega(G)], a clique CC is a kk-clique if |C|=k|C|=k. 𝒞k(S)\mathcal{C}_{k}(S) denotes the set of kk-cliques in the induced subgraph SS. For a subgraph SVS\subseteq V, denote by 𝒞k(S)\mathcal{C}_{k}(S) the set of the kk-cliques in SS. The kk-clique density of a subgraph SS of GG is

ρk(S)=|𝒞k(S)||S|.\small\rho_{k}(S)=\frac{|\mathcal{C}_{k}(S)|}{|S|}.

The kk-clique Densest Subgraph Search (kk-𝖣𝖲𝖲\mathsf{DSS}). Given G(V,E)G(V,E) and integer kk. kk-𝖣𝖲𝖲\mathsf{DSS} reports the kk-clique densest subgraph of GG.

Definition 2.1 (kk-clique Densest Subgraph).

The kk-clique densest subgraph of GG is a subgraph VV^{*} such that ρk(V)ρk(V)\rho_{k}(V^{*})\geq\rho_{k}(V^{\prime}) for all VVV^{\prime}\subseteq V and if ρk(V)=ρk(V)\rho_{k}(V^{*})=\rho_{k}(V^{\prime}), VV^{\prime} is a subgraph of VV^{*}.

Graph Decomposition (𝖦𝖣\mathsf{GD}(Sun et al., 2020; Danisch et al., 2017). 𝖦𝖣{\mathsf{GD}} is a kk-clique density based decomposition of GG. In 𝖦𝖣{\mathsf{GD}}, iteratively (ii-th iteration, i1i\geq 1) perform the following steps to produce the ranking r(u)r(u) for each node uVu\in V. Given a graph GG on node set VV and integer kk, let V1=VV_{1}=V and 𝒞1\mathcal{C}^{1} be the collection of the kk-cliques of GG. For the next iterations, let Vi+1V_{i+1} be ViBiV_{i}\setminus B_{i} and let 𝒞i+1\mathcal{C}^{i+1} be {C|C𝒞i,CBi}\{C|C\in\mathcal{C}^{i},C\not\subseteq B_{i}\} where ρi(S)=|{C𝒞i|CS}||S|\rho_{i}(S)=\frac{|\{C\in\mathcal{C}^{i}|C\subseteq S\}|}{|S|} and BiB_{i} is the subgraph with maximum ρi\rho_{i} in ViV_{i}. Let r(u)=ρir(u)=\rho_{i} for all uBiu\in B_{i}. The iteration terminates when Vi=V_{i}=\emptyset. It was proved (Sun et al., 2020) that for two iterations i<ji<j, ρi>ρj\rho_{i}>\rho_{j}. Clearly, ρ1(S)=ρk(S)\rho_{1}(S)=\rho_{k}(S) and B1B_{1} is the kk-clique densest subgraph. Thus, the kk-𝖣𝖲𝖲\mathsf{DSS} of GG equals to computing the first batch of rr, i.e. H(r)=B1={uV|r(u)=max(r)}H(r)=B_{1}=\{u\in V|r(u)=max(r)\}.

For example, for k=3k=3, Figure 1(a) shows the peeling process of 𝖦𝖣{\mathsf{GD}} where S={u1,,u6}S^{*}=\{u_{1},\cdots,u_{6}\} with ρ1=ρk(S)=43\rho_{1}=\rho_{k}(S^{*})=\frac{4}{3} in the first iteration. The vertices in SS^{*} have a rank value r(u)=ρk(S)=43r(u)=\rho_{k}(S^{*})=\frac{4}{3}.

Convex programming 𝖢𝖯(G){\mathsf{CP}}(G) (Sun et al., 2020; Danisch et al., 2017). In the context of 𝖢𝖯(G){\mathsf{CP}}(G) below, three conditions (𝖢𝟣-𝟥\mathsf{C1\text{-}3}) assign a weight of 1 to each kk-clique C𝒞k(V)C\in\mathcal{C}_{k}(V). This weight is then distributed among its nodes uCu\in C using weights αuC0\alpha_{u}^{C}\geq 0 (𝖢𝟣\mathsf{C1}, 𝖢𝟥\mathsf{C3}). Additionally, the total weights r(u)r(u) accumulated by each node uu are subject to the condition that the squared summation is minimized (𝖢𝟤\mathsf{C2}).

𝖢𝖯(G)\displaystyle{\mathsf{CP}}(G) min𝗈𝖻𝗃=uVr2(u)\displaystyle\min{\mathsf{obj}}=\sum_{u\in V}r^{2}(u) s.t.\displaystyle s.t.
𝖢𝟣\mathsf{C1} αuC0\displaystyle\alpha^{C}_{u}\geq 0 C𝒞k(V),uC\displaystyle\forall C\in\mathcal{C}_{k}(V),\forall u\in C
𝖢𝟤\mathsf{C2} r(u)=C:uC𝒞k(V)αuC\displaystyle\ r(u)={{\sum_{C:u\in C\in\mathcal{C}_{k}(V)}{\alpha^{C}_{u}}}} uV\displaystyle\forall u\in V
𝖢𝟥\mathsf{C3} uCαuC=1\displaystyle\ \sum_{u\in C}{\alpha_{u}^{C}}=1 C𝒞k(V)\displaystyle\forall C\in\mathcal{C}_{k}(V)

It can be proved (Danisch et al., 2017; Sun et al., 2020) that the optimal solution of 𝖢𝖯(G){\mathsf{CP}}(G) is exactly the ranking r(u)r(u) for the 𝖦𝖣{\mathsf{GD}} problem. Figure 1(b) shows how the constraints 𝖢𝟣𝖢𝟥\text{{$\mathsf{C1}$}}-\text{{$\mathsf{C3}$}} facilitates the weight distribution in the optimal solution. The 33-cliques on the top, C0C_{0} to C8C_{8} each has weight 11 distributed to their 3 nodes linked to the second line (u0u_{0} to u6u_{6}). The weights r(u)r(u) collected by each node uu (shown in the box below the node) achieve the optimal squared sum in the objective function, which is identical to r(u)r(u) computed in the 𝖦𝖣{\mathsf{GD}}.

Lemma 2.2 ((Sun et al., 2020)).

Consider a kk-clique CC and let xx be the node in CC with the smallest ranking and yy the node with the largest ranking. Either r(y)=r(x)r(y)=r(x) or αyC=0\alpha_{y}^{C}=0.

Lemma 2.3 ((Sun et al., 2020; Danisch et al., 2017)).

Let VV^{*} be the kk-clique densest subgraph of GG. Let r(u),uVr(u),u\in V be the optimal solution of 𝖢𝖯(G){\mathsf{CP}}(G) and S=H(r)S=H(r) the first batch of nodes in rr. S=VS=V^{*} and r(u)=ρk(V)r(u)=\rho_{k}(V^{*}) for uS\forall u\in S.

Apply Lemma 2.3 recursively along the peeling process of the 𝖦𝖣\mathsf{GD} can eventually reach the conclusion that the optimal solution of 𝖢𝖯(G){\mathsf{CP}}(G) is exactly the ranking rr in 𝖦𝖣{\mathsf{GD}}.

Input: The graph G(V,E)G(V,E), clique size kk, # of iterations TT
Output: V^\hat{V}^{*}, an approximation of VV^{*}
1 Initialize r(u)r(u) for each uVu\in V;
2 Let 𝒞𝒞\mathcal{C}\mathcal{C} be either 𝒞k(V)\mathcal{C}_{k}(V) or a sample set of 𝒞k(V)\mathcal{C}_{k}(V);
3 for t1,2,,Tt\leftarrow 1,2,\cdots,T do
4       foreach kk-clique C𝒞𝒞C\in\mathcal{C}\mathcal{C} do
5             uargminvCr(v)u\leftarrow\arg\min_{v\in C}{r(v)}, break ties arbitrarily;
6             r(u)r(u)+1r(u)\leftarrow r(u)+1;
7            
8      
9Sort VV by r(u)r(u) in non-increasing order;
10 Denote ViV_{i} by the previous ii vertices of the sorted VV;
11 return argmaxViρk(Vi)\arg\max_{V_{i}}\rho_{k}(V_{i});
12
Algorithm 1 The 𝖥𝖶\mathsf{FW} Framework

Frank-Wolfe (𝖥𝖶)({\mathsf{FW}}) framework for solving 𝖢𝖯\mathsf{CP} (G). Algorithm 1 shows the 𝖥𝖶\mathsf{FW} framework tailored for the convex programming 𝖢𝖯(G){\mathsf{CP}}(G) (He et al., 2023; Sun et al., 2020). 𝖥𝖶\mathsf{FW} iteratively (line 3) gets a clique CC in a random order (line 4) and then rebalances the weight distribution among all nodes in CC (lines 5-6). Figure 1(c) shows the flowchart of the 𝖥𝖶\mathsf{FW} framework. Specifically, for kk-𝖣𝖲𝖲\mathsf{DSS}, the ”Compute ss” step in 𝖥𝖶{\mathsf{FW}} is repeating 𝖥𝖶\mathsf{FW}-1 and 𝖥𝖶\mathsf{FW}-2 until 𝒞\mathcal{C} is empty. For the scanning based algorithm, 𝒞=𝒞k(V)\mathcal{C}=\mathcal{C}_{k}(V). For the sampling based algorithm, 𝒞\mathcal{C} is the set of kk-cliques uniformly sampled from 𝒞k(V)\mathcal{C}_{k}(V).

  1. 𝖥𝖶\mathsf{FW}-1

    Fetches a kk-clique C𝒞C\in\mathcal{C}, then removes CC from 𝒞\mathcal{C}

  2. 𝖥𝖶\mathsf{FW}-2

    Finds the node xx in CC that carries the lowest r(x)r(x) (breaks ties arbitrarily), then redistributes the weight of CC to increase r(x)r(x).

Remark. All the SOTA methods, 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++, 𝖲𝖢𝖳𝖫\mathsf{SCTL}, 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample}, follows the 𝖥𝖶\mathsf{FW} framework of the convex programming 𝖢𝖯\mathsf{CP} (G), and are different in the ways of selecting a clique CC: 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} select CC based on a sample set 𝒞𝒞\mathcal{C}\mathcal{C} of 𝒞k(V)\mathcal{C}_{k}(V), and 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ and 𝖲𝖢𝖳𝖫\mathsf{SCTL} scans 𝒞k(V)\mathcal{C}_{k}(V) either directly or using an index (He et al., 2023; Sun et al., 2020). To remove the dependency of scanning-based method on |𝒞k(V)||\mathcal{C}_{k}(V)| while having the resulting density effectively bounded, we propose a new convex programming problem 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G).

Succinct Clique Tree (Jain and Seshadhri, 2020) (𝖲𝖢𝖳\mathsf{SCT}). Given a graph G(V,E)G(V,E), 𝖲𝖢𝖳\mathsf{SCT} is an index of all cliques of GG. It is produced by a set of large cliques of G(V,E)G(V,E), where each large clique is presented as a pair P=(Vh,Vp)P=(V_{h},V_{p}). Each pair P(Vh,Vp)P(V_{h},V_{p}) consists of two disjoint node sets VhVV_{h}\subseteq V and VpVV_{p}\subseteq V. Denote by V(P)=VhVpV(P)=V_{h}\cup V_{p}. The pair P(Vh,Vp)P(V_{h},V_{p}) encodes the set of cliques 𝒞(P)={VhC|CVp}\mathcal{C}(P)=\{V_{h}\cup C^{\prime}|C^{\prime}\subseteq V_{p}\}. Since PP is a large clique, VhCV_{h}\cup C^{\prime} for any CVpC^{\prime}\subseteq V_{p} is a sub-clique encoded by PP. Let \mathbb{P} be the set of all pairs. Each clique is encoded by one and only one pair. 𝖲𝖢𝖳\mathsf{SCT} is an index of \mathbb{P} in a special shape of search tree.

Lemma 2.4 ((Jain and Seshadhri, 2020)).

Let P(Vh,Vp)P(V_{h},V_{p}) be a pair in \mathbb{P} of 𝖲𝖢𝖳{\mathsf{SCT}}. We have |𝒞(P)|=2|Vp||\mathcal{C}(P)|=2^{|V_{p}|}. In other words, the combination of any subset of VpV_{p} and the entire VhV_{h} forms a clique in GG. In particular, V(P)V(P) is a clique. Besides, all cliques 𝒞(G)\mathcal{C}(G) in GG are disjointly partitioned into {𝒞(P)|P}\{\mathcal{C}(P)|P\in\mathbb{P}\} [Theorem.4.1., (Jain and Seshadhri, 2020)].

We call η=||\eta=|\mathbb{P}| the size of the 𝖲𝖢𝖳{\mathsf{SCT}} index which is a parameter related only to the graph GG. The upper bound of η\eta is O(n3α/3)O(n3^{\alpha/3}) (Jain and Seshadhri, 2020). The value of η\eta is often not extremely large for real graphs (Jain and Seshadhri, 2020) as shown in Table 3.

Lemma 2.5.

Given a graph GG and its 𝖲𝖢𝖳\mathsf{SCT} \mathbb{P}, an integer kk, let k={P(Vh,Vp)||Vh|k&|V(P)|k}\mathbb{P}_{k}=\{P(V_{h},V_{p})\in\mathbb{P}|\ |V_{h}|\leq k\ \&\ |V(P)|\geq k\}. For a pair P(Vh,Vp)kP(V_{h},V_{p})\in\mathbb{P}_{k}, define all kk-cliques in 𝒞(P)\mathcal{C}(P) as set 𝒞k(P)={VhC|CVp&|C|=k|Vh|}\mathcal{C}_{k}(P)=\{V_{h}\cup C^{\prime}|C^{\prime}\subseteq V_{p}\&|C^{\prime}|=k-|V_{h}|\}. Then {𝒞k(P)|Pk}\{\mathcal{C}_{k}(P)|P\in\mathbb{P}_{k}\} is a disjoint partition of all kk-cliques 𝒞k(V)\mathcal{C}_{k}(V) of GG. The cardinality |k|||=η|\mathbb{P}_{k}|\leq|\mathbb{P}|=\eta.

Proof.

According to Lemma 2.4, \mathbb{P} covers all cliques. Thus, we can derive that \mathbb{P} also covers all kk-cliques. k\mathbb{P}_{k} comes from \mathbb{P} by removing the pairs that |Vh|>k|V_{h}|>k and |V(P)|<k|V(P)|<k. When |Vh|>k|V_{h}|>k, the size of the cliques VhCV_{h}\cup C^{\prime} must be larger than kk. When |V(P)|<k|V(P)|<k, the size of the cliques in PP must be smaller than kk. k\mathbb{P}_{k} comes from \mathbb{P} by removing the pairs that do not contain kk-cliques.

In the following discussions, we focus on the kk-cliques of GG and thus abuse \mathbb{P} to denote k\mathbb{P}_{k} unless otherwised specified. For easy reference, all the key notations in this paper are summarized in Table 2.

Lemma 2.6 ((Jain and Seshadhri, 2020)).

Given a pair P(Vh,Vp)P(V_{h},V_{p})\in\mathbb{P}, there are (|Vp|k|Vh|)|V_{p}|\choose k-|V_{h}| kk-cliques in total. For each uVhu\in V_{h}, there are (|Vp|k|Vh|)|V_{p}|\choose k-|V_{h}| kk-cliques that contain uu. For each uVpu\in V_{p}, there are (|Vp|1k|Vh|1)|V_{p}|-1\choose k-|V_{h}|-1 kk-cliques that contain uu.

To form a kk-clique, we need to choose k|Vh|k-|V_{h}| vertices from VpV_{p}. With uVpu\in V_{p} being chosen, we need to select k|Vh|1k-|V_{h}|-1 vertices from |Vp|1|V_{p}|-1 vertices.

Refer to caption
(a) An example graph
Refer to caption
(b) An example of SCT-tree. In each path, the dotted nodes are composed of the set VhV_{h} and the solid nodes are composed of the set VpV_{p}.
Figure 2. Illustration of the SCT.
Example 2.7.

Figure 2(b) is an example of SCT, where each path from root to leaf is a pair (Vh,Vp)(V_{h},V_{p}). It has five pairs of (Vh,Vp)(V_{h},V_{p}), including ({u0},{u1,u3})(\{u_{0}\},\{u_{1},u_{3}\}), ({u1},{u2,u3,u6})(\{u_{1}\},\{u_{2},u_{3},u_{6}\}), ({u2}(\{u_{2}\}, {u3,u6})\{u_{3},u_{6}\}), ({u3}(\{u_{3}\}, {u4,u5,u6})\{u_{4},u_{5},u_{6}\}), ({u4},{u5,u6})(\{u_{4}\},\{u_{5},u_{6}\}). All the pairs are cliques, such as G({u3}G(\{u_{3}\}, {u4,u5,u6})\{u_{4},u_{5},u_{6}\}) is a 44-clique. All the cliques in Figure 2(a) are uniquely encoded by the SCT. For example, the 33-clique {u3,u5,u6}\{u_{3},u_{5},u_{6}\} is encoded in the pair ({u3},{u4,u5,u6})(\{u_{3}\},\{u_{4},u_{5},u_{6}\}).

3. New Convex Programming for kk-DSS

To more efficiently compute kk-𝖣𝖲𝖲\mathsf{DSS}, we present a new convex programming problem based on 𝖲𝖢𝖳\mathsf{SCT}. We first formulate 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G).

3.1. SCT-based Convex Programming

𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G) assigns a weight αuP>0\alpha_{u}^{P}>0 for each node uu and each pair PP\in\mathbb{P} such that uV(P)u\in V(P). The assigned weights must satisfy the following constraints.

𝖲𝖢𝖳-𝖢𝖯(G)\displaystyle{\mathsf{SCT\text{-}CP}}(G) minimizeuVr(u)2\displaystyle\quad minimize\sum_{u\in V}{r(u)^{2}}
s.t.\displaystyle s.t.\quad 𝖢𝟣:r(u)=P s.t. uV(P)αuP, for uV\displaystyle\text{{$\mathsf{C1}$}}:r(u)=\sum_{P\in\mathbb{P}\text{ s.t. }u\in V(P)}{\alpha^{P}_{u}},\quad\quad\quad\text{ for }\forall u\in V
𝖢𝟤:αuP0, for P and uP\displaystyle\text{{$\mathsf{C2}$}}:\alpha^{P}_{u}\geq 0,\quad\quad\quad\ \quad\quad\text{ for }\forall P\in\mathbb{P}\text{ and }\forall u\in P
𝖢𝟥:uV(P)αuP=(|Vp|k|Vh|),for P(Vh,Vp)\displaystyle\text{{$\mathsf{C3}$}}:\sum_{u\in V(P)}{\alpha^{P}_{u}}={|V_{p}|\choose k-|V_{h}|},\ \ \text{for }\forall P(V_{h},V_{p})\in\mathbb{P}
𝖢𝟦:αuP(|Vp|1k|Vh|1),P(Vh,Vp),uVp\displaystyle\text{{$\mathsf{C4}$}}:\alpha^{P}_{u}\leq{|V_{p}|-1\choose k-|V_{h}|-1},\forall P(V_{h},V_{p})\in\mathbb{P},\forall u\in V_{p}

Intuitive explanation. Recall that in 𝖢𝖯(G){\mathsf{CP}}(G), each clique CC in 𝒞(G)\mathcal{C}(G) is allowed to have a weight of 11 distributed among the nodes uu in CC, denoted as αuC\alpha_{u}^{C}. Our new convex programming has made the following changes. Instead of each kk-clique having weight 11, we let each pair P(Vh,Vp)P(V_{h},V_{p}) have weight (|Vp|k|Vh|)|V_{p}|\choose k-|V_{h}| because there are (|Vp|k|Vh|)|V_{p}|\choose k-|V_{h}| kk-cliques encoded by PP (Lemma 2.6). We still want the weight to be retained in nodes in CC to avoid an over-flattened weight distribution in V(P)V(P). To do so, we impose a soft constraint of 𝖢𝟦,𝖲𝖢𝖳-𝖢𝖯(G)\text{{$\mathsf{C4}$}},{\mathsf{SCT\text{-}CP}}(G) based on the fact that each node uVpu\in V_{p} participates in (|Vp|1k|Vh|1){|V_{p}|-1\choose k-|V_{h}|-1} kk-cliques (containing uu and VhV_{h}), which gives an upper bound on αuP\alpha_{u}^{P}.

Let α\alpha^{*} be the optimal solution of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G). Let rr^{*} be the ranking vector induced by α\alpha^{*}. We first describe some properties of α\alpha^{*} and rr^{*}. Then, we explain the relationship between rr^{*} and the kk-clique densest subgraph.

Properties of α\alpha^{*} and rr^{*}. Lemma 3.1 shows that each entry r(u)r^{*}(u) is upper bounded by the number of kk-cliques that contain uu. The property is identical to the optimal solution of 𝖢𝖯(G){\mathsf{CP}}(G) where each kk-clique only assigns weight to one of the kk nodes in the kk-clique.

Lemma 3.1.

Let rr be a feasible solution of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G), it has uVr(u)=|𝒞k(V)|\sum_{u\in V}r(u)=|\mathcal{C}_{k}(V)| and uV\forall u\in V, r(u)r(u) is upper bounded by the number of kk-cliques that contain uu.

Proof.

In accordance with 𝖢𝟥\mathsf{C3} and 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G), all nodes in V(P)V(P) share the weight of |𝒞k(P)|=(|Vp|k|Vp|)|\mathcal{C}_{k}(P)|={|V_{p}|\choose k-|V_{p}|} kk-cliques. Since all kk-cliques in 𝒞k(V)\mathcal{C}_{k}(V) are disjointly partitioned by 𝖲𝖢𝖳\mathsf{SCT} (refer to Lemma 2.4), it follows that uVr(u)=|𝒞k(V)|\sum_{u\in V}r(u)=|\mathcal{C}_{k}(V)|. Bynce 𝖢𝟥\mathsf{C3}, 𝖢𝟦\mathsf{C4} and Lemma 2.6, we can achieve the upper bound of r(u)r(u). ∎

Theorem 3.2.

Given a pair P=(Vh,Vp)P=(V_{h},V_{p}), define upv=(|Vp|k|Vp|)up_{v}={|V_{p}|\choose k-|V_{p}|} if vVhv\in V_{h} and upv=(|Vp|1k|Vp|1)up_{v}={|V_{p}|-1\choose k-|V_{p}|-1} if vVpv\in V_{p}. αvP{\alpha^{*}}^{P}_{v} must equal upvup_{v} if there exists uu that r(u)>r(v)r^{*}(u)>r^{*}(v) and αuP>0{\alpha^{*}}^{P}_{u}>0.

Proof.

Assume that r(u)>r(v)r^{*}(u)>r^{*}(v), αuP>0{\alpha^{*}}^{P}_{u}>0 and αvP<upv{\alpha^{*}}^{P}_{v}<up_{v}. By re-assigning the weight from uu to vv, we can reduce the gap between r(u)r(u) and r(v)r(v) and reach a smaller value of the objective function, which contradicts the fact that α\alpha^{*} is the optimal solution. ∎

Theorem 3.2 shows the property of α\alpha^{*} that the weight are assigned as even as possible. Since αvP{\alpha^{*}}^{P}_{v} is bounded by upvup_{v} (𝖢𝟦\mathsf{C4} of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G)), upvup_{v} is like a barrier to stop the weight flowing from the denser vertices to the vv.

We give the definition of relaxed stable subset.

Definition 3.3.

[relaxed stable subset] A subset of vertices BB is a relaxed stable subset if (1) uB,vVB,r(u)>r(v)\forall u\in B,v\in V\setminus B,r(u)>r(v). (2) For each pair PP that intersects both BB and VBV\setminus B, uBP,αuP>0\exists u\in B\cap P,{\alpha^{*}}^{P}_{u}>0 only when vP(VB),αvP=upv\forall v\in P\cap(V\setminus B),{\alpha^{*}}^{P}_{v}=up_{v}.

A relaxed stable subset BB has a larger rank than other nodes (condition (1) in Definition 3.3) and the pairs that intersect both BB and VBV\setminus B assign their weights to the nodes outside of BB as much as possible (condition (2) in Definition 3.3).

Recall that H(r)={uV|r(u)=max(r)}H(r)=\{u\in V|r(u)=max(r)\}. We can derive that H(r)H(r^{*}) is a relaxed stable subset by Theorem 3.2 and Definition 3.3. For a pair PP that intersects H(r)H(r^{*}) and VH(r)V\setminus H(r^{*}), the kk-cliques encoded by PP can be classified into three types: (1) all nodes in H(r)H(r^{*}); (2) all nodes in VH(r)V\setminus H(r^{*}); (3) the nodes intersecting both H(r)H(r^{*}) and VH(r)V\setminus H(r^{*}). Each of the kk-clique has a weight of 11. Since the weight of PP are assigned to nodes in VH(r)V\setminus H(r^{*}) as much as possible, the type (1) kk-cliques assign the weights to H(r)H(r^{*}), and the type (2) and (3) assign weights to VH(r)V\setminus H(r^{*}). In other words, the nodes in H(r)H(r^{*}) only receive weights from the kk-cliques in H(r)H(r^{*}).

Relationships between rr^{*} and VV^{*}. With the above analysis, we can derive Theorem 3.4 and 3.5 which explain the relationships between rr^{*} and the kk-clique densest subgraph VV^{*}.

Theorem 3.4.

The optimal solution of 𝖲𝖢𝖳-𝖢𝖯\mathsf{SCT\text{-}CP} represents a subgraph H(r)H(r^{*}) that VH(r)V^{*}\subseteq H(r^{*}).

Proof.

We prove that for any VVV^{\prime}\subset V with V(VV^)V^{\prime}\cap(V\setminus\hat{V*})\neq\emptyset, ρk(V)<ρk(H(r))\rho_{k}(V^{\prime})<\rho_{k}(H(r^{*})). By the analysis above, the nodes in H(r)H(r^{*}) only receive weights from the kk-cliques in H(r)H(r^{*}). Thus, we have

(1) ρk(H(r))\displaystyle\rho_{k}(H(r^{*})) uH(r)r(u)|H((r))|minuH(r)r(u)\displaystyle\geq\frac{\sum_{u\in H(r^{*})}{r(u)}}{|H((r^{*}))|}\geq\min_{u\in H(r^{*})}{r(u)}
>maxuVH(r)r(u)ρk(V).\displaystyle>\max_{u\in V\setminus H(r^{*})}{r(u)}\geq\rho_{k}(V^{\prime}).

Theorem 3.5.

The optimal solution rr^{*} of 𝖲𝖢𝖳-𝖢𝖯\mathsf{SCT\text{-}CP} represents a subgraph V^=H(r)\hat{V^{*}}=H(r^{*}) that ρk(V^)/ρk(V)>11k|V|\rho_{k}(\hat{V^{*}})/\rho_{k}(V^{*})>1-\frac{1}{k|V^{*}|}.

Proof.

Given two set of vertices V1V_{1} and V2V_{2}, the set of additional kk-cliques of V2V_{2} to V1V_{1} is A(V1,V2)=𝒞k(V1V2)𝒞k(V1)A(V_{1},V_{2})=\mathcal{C}_{k}(V_{1}\cup V_{2})\setminus\mathcal{C}_{k}(V_{1}). From

ρk(V)=|𝒞k(V)||V|>ρk(V^)=|𝒞k(V)|+|A(V,V^V)||V|+|V^V|,\small\rho_{k}(V^{*})=\frac{|\mathcal{C}_{k}(V^{*})|}{|V^{*}|}>\rho_{k}(\hat{V^{*}})=\frac{|\mathcal{C}_{k}(V^{*})|+|A(V^{*},\hat{V^{*}}\setminus V^{*})|}{|V^{*}|+|\hat{V^{*}}\setminus V^{*}|},

we can obtain

(2) |A(V,V^V)|<|V^V||ρk(V)|.\small|A(V^{*},\hat{V^{*}}\setminus V^{*})|<|\hat{V^{*}}\setminus V^{*}||\rho_{k}(V^{*})|.

Let xx be |𝒞k(V)||V|ρk(V^)|\mathcal{C}_{k}(V^{*})|-|V^{*}|\rho_{k}({\hat{V^{*}}}), which denotes the count of kk-cliques in 𝒞k(V)\mathcal{C}_{k}(V^{*}) that assign their weight to V^V\hat{V^{*}}\setminus V^{*}. According to the definition of SCT, each kk-clique exists in only one pair. Thus, we can consider each pair independently. Given a pair P=(Vh,Vp)P=(V_{h},V_{p}) that there are xPx^{\prime}_{P} kk-cliques in 𝒞k(V)𝒞k(P)\mathcal{C}_{k}(V^{*})\cap\mathcal{C}_{k}(P) assign their weight to Vp(V^V)V_{p}\cap(\hat{V^{*}}\setminus V^{*}), each vertex in Vp(V^V)V_{p}\cap(\hat{V^{*}}\setminus V^{*}) generates at least kxPkx^{\prime}_{P} additional kk-cliques. Thus, we have at least k|Vp(V^V)|xPk|V_{p}\cap(\hat{V^{*}}\setminus V^{*})|x^{\prime}_{P} additional kk-cliques for pair PP. Sum all the pairs together and we reach

(3) |A(V,V^V)|kP(Vp,Vh)|Vp(V^V)|xPkx|V^V|.\small|A(V^{*},\hat{V^{*}}\setminus V^{*})|\geq k\sum_{P(V_{p},V_{h})\in\mathbb{P}}{|V_{p}\cap(\hat{V^{*}}\setminus V^{*})|x^{\prime}_{P}}\geq kx|\hat{V^{*}}\setminus V^{*}|.

Equation (3) comes from the fact that P(Vp,Vh)xP=x\sum_{P(V_{p},V_{h})\in\mathbb{P}}{x^{\prime}_{P}}=x and P(Vp,Vh)|Vp(V^V)||V^V|\sum_{P(V_{p},V_{h})\in\mathbb{P}}{|V_{p}\cap(\hat{V^{*}}\setminus V^{*})|}\geq|\hat{V^{*}}\setminus V^{*}|. The summary is greater than the size of nodes outside of VV^{*} because the same node may be in more than one pairs.

Combines Equation (2) and (3) and we get kx<ρk(V).kx<\rho_{k}(V^{*}). Then, from the definition of xx, we have

k(|𝒞k(V)||V|ρk(V^))<ρk(V).\small k\left(|\mathcal{C}_{k}(V^{*})|-|V^{*}|\rho_{k}({\hat{V^{*}}})\right)<\rho_{k}(V^{*}).

At last, we can obtain the result

ρk(V^)ρk(V)>11k|V|.\small\frac{\rho_{k}({\hat{V^{*}}})}{\rho_{k}(V^{*})}>1-\frac{1}{k|V^{*}|}.

Theorem 3.5 provides a guarantee that the optimal solution of our new convex programming is a near-optimal approximation of the kk-clique densest subgraph.

The problem solved by 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G). Actually, H(r)H(r^{*}) is the subgraph with the maximum value of ρ(S)=P(Vh,Vp)(S)(|Vp|k|Vh|)|S|\rho^{\prime}(S)=\frac{\sum_{P(V_{h},V_{p})\in\mathbb{P}(S)}{{|V_{p}|\choose k-|V_{h}|}}}{|S|}, where (S)\mathbb{P}(S) is the set of pairs with all nodes in SS. The nodes sorted by rr^{*} is a graph decomposition (𝖦𝖣\mathsf{GD} in Section 2) with respect to ρ(S)\rho^{\prime}(S). The statement is proved by Theorem 3.6 and Theorem 3.7.

Theorem 3.6.

The optimal solution α\alpha^{*} of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G) is the optimal solution of 𝖲𝖢𝖳-𝖣𝖯(G){\mathsf{SCT\text{-}DP}}(G).

𝖲𝖢𝖳-𝖣𝖯(G)\displaystyle{\mathsf{SCT\text{-}DP}}(G) minimizeρ\displaystyle\quad minimize\ \rho^{\prime}
s.t.\displaystyle s.t.\quad 𝖢𝟣:ρr(u)=P s.t. uV(P)αuP,uV\displaystyle\text{{$\mathsf{C1}$}}:\rho^{\prime}\geq r(u)=\sum_{P\in\mathbb{P}\text{ s.t. }u\in V(P)}{\alpha^{P}_{u}},\quad\quad\quad\forall u\in V
𝖢𝟤:αuP0,P and uP\displaystyle\text{{$\mathsf{C2}$}}:\alpha^{P}_{u}\geq 0,\quad\quad\quad\ \quad\quad\quad\quad\forall P\in\mathbb{P}\text{ and }\forall u\in P
𝖢𝟥:uV(P)αuP=(|Vp|k|Vh|),P(Vh,Vp)\displaystyle\text{{$\mathsf{C3}$}}:\sum_{u\in V(P)}{\alpha^{P}_{u}}={|V_{p}|\choose k-|V_{h}|},\ \ \quad\quad\forall P(V_{h},V_{p})\in\mathbb{P}
𝖢𝟦:αuP(|Vp|1k|Vh|1),P(Vh,Vp),uVp\displaystyle\text{{$\mathsf{C4}$}}:\alpha^{P}_{u}\leq{|V_{p}|-1\choose k-|V_{h}|-1},\forall P(V_{h},V_{p})\in\mathbb{P},\forall u\in V_{p}
Proof.

Since the first level set H(r)H(r^{*}) is a relaxed stable subset, all weights of H(r)H(r^{*}) come from the pairs in H(r)H(r^{*}) and can not further distribute the weights to VH(r)V\setminus H(r). Therefore, r(u)r^{*}(u) for uH(r)u\in H(r) is the minimum value that the objective function of 𝖲𝖢𝖳-𝖣𝖯(G){\mathsf{SCT\text{-}DP}}(G) can achieve. ∎

It is easy to derive that the dual of 𝖲𝖢𝖳-𝖣𝖯(G){\mathsf{SCT\text{-}DP}}(G) is 𝖲𝖢𝖳-𝖫𝖯(G){\mathsf{SCT\text{-}LP}}(G).

𝖲𝖢𝖳-𝖫𝖯(G)\displaystyle{\mathsf{SCT\text{-}LP}}(G) maxP(Vh,Vp)(|Vp|k|Vh|)xP\displaystyle\quad\max\sum_{P(V_{h},V_{p})\in\mathbb{P}}{{|V_{p}|\choose k-|V_{h}|}x_{P}}
s.t.\displaystyle s.t.\quad 𝖢𝟣:xPyu,uP\displaystyle\text{{$\mathsf{C1}$}}:x_{P}\leq y_{u},\quad\quad\quad\forall u\in P
𝖢𝟤:uVyu=1\displaystyle\text{{$\mathsf{C2}$}}:\sum_{u\in V}y_{u}=1
𝖢𝟥:yu0,xP0,uP\displaystyle\text{{$\mathsf{C3}$}}:y_{u}\geq 0,x_{P}\geq 0,\forall u\in P\in\mathbb{P}
Theorem 3.7.

Given a subgraph SS, denote by (S)\mathbb{P}(S) the set of pairs {P|V(P)S,P}\{P|V(P)\in S,P\in\mathbb{P}\}. The optimal solution of 𝖲𝖢𝖳-𝖫𝖯(G){\mathsf{SCT\text{-}LP}}(G) is the subgraph that maximizes ρ(S)=P(Vh,Vp)(S)(|Vp|k|Vh|)|S|\rho^{\prime}(S)=\frac{\sum_{P(V_{h},V_{p})\in\mathbb{P}(S)}{{|V_{p}|\choose k-|V_{h}|}}}{|S|}.

Proof.

Given a subgraph SS, let yu=1|S|y_{u}=\frac{1}{|S|} if uSu\in S and yu=0y_{u}=0 otherwise. Then, we can easily derive that the objective function of 𝖲𝖢𝖳-𝖫𝖯(G){\mathsf{SCT\text{-}LP}}(G) is ρ(S)\rho^{\prime}(S). Thus, each subgraph corresponds to a feasible of 𝖲𝖢𝖳-𝖫𝖯(G){\mathsf{SCT\text{-}LP}}(G).

Given a feasible solution with value ρ\rho, we now prove that it can also construct a subgraph SS that ρ(S)ρ\rho^{\prime}(S)\geq\rho. Given a parameter zz, let (z)={P|xPz}\mathbb{P}(z)=\{P|x_{P}\geq z\} and S(z)={u|yuz}S(z)=\{u|y_{u}\geq z\}. According to 𝖢𝟣\mathsf{C1} (𝖲𝖢𝖳-𝖫𝖯\mathsf{SCT\text{-}LP}) and P(z)P\in\mathbb{P}(z), we can derive that uS(z),uV(P),P(z)u\in S(z),\forall u\in V(P),\forall P\in\mathbb{P}(z). Also, from uS(z),uV(P)u\in S(z),\forall u\in V(P) we can derive that P(z)P\in\mathbb{P}(z). Suppose z,P(Vh,Vp)(z)(|Vp|k|Vh|)|S(z)|<ρ\forall z,\frac{\sum_{P(V_{h},V_{p})\in\mathbb{P}(z)}{{|V_{p}|\choose k-|V_{h}|}}}{|S(z)|}<\rho. Then, we have P(Vh,Vp)(|Vp|k|Vh|)xP<ρ,\sum_{P(V_{h},V_{p})\in\mathbb{P}}{{|V_{p}|\choose k-|V_{h}|}x_{P}}<\rho, which is a contradiction. Therefore, there must exist a zz such that ρ(S(z))ρ\rho^{\prime}(S(z))\geq\rho.

Since the subgraphs and feasible solutions are mapped to each other, the subgraph with the maximum value ρ\rho^{\prime} maps to the optimal solution. ∎

From Theorem 3.6 and Theorem 3.7, we know that the optimal solution of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G) finds the subgraph SS that maximize ρ(S)=P(Vh,Vp)(S)(|Vp|k|Vh|)|S|\rho^{\prime}(S)=\frac{\sum_{P(V_{h},V_{p})\in\mathbb{P}(S)}{{|V_{p}|\choose k-|V_{h}|}}}{|S|}. Thus, the optimal solution of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G) is intuitively a good approximation for the kk-clique densest subgraph. As shown in our experiments, the optimal solution of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G) on all real-world datasets we tested can generate the kk-clique densest subgraph exactly.

3.2. FW-based Algorithm for SCT-CP(G)

Input: Graph G(V,E)G(V,E), clique size kk, # of iterations TT, 𝖲𝖢𝖳\mathsf{SCT} \mathbb{P}
Output: V^\hat{V}^{*}, an approximation of VV^{*}
1 Initialize r(u)r(u) for each uVu\in V;
2 for t1,2,,Tt\leftarrow 1,2,\cdots,T do
3       foreach  pair PP\in\mathbb{P} do
4             r𝖯𝖡𝖴𝗉𝖽𝖺𝗍𝖾(r,P,k)r\leftarrow\mathsf{PBUpdate}(r,P,k)
5      
6Sort VV by r(u)r(u);
7 Denote ViV_{i} by the previous ii vertices of VV;
8 return argmaxViρk(Vi)\arg\max_{V_{i}}\rho_{k}(V_{i});
9
Algorithm 2 The Proposed 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} Algorithm
Input: Ranking rr, P(Vh,Vp)P(V_{h},V_{p}), clique size kk,
Output: Updated rr
1 Partition V(P)V(P) into {V1,V2,,Vs}\{V_{1},V_{2},...,V_{s}\} based on a sorted ranking rr. Nodes in ViV_{i} share the ii-th smallest ranking rir_{i}, i[1,s]i\in[1,s], r1<r2<<rsr_{1}<r_{2}<\cdots<r_{s}, Vs=H(r)V_{s}=H(r);
nC(|Vp|k|Vh|)n_{C}\leftarrow{|V_{p}|\choose k-|V_{h}|}; /* Total # of cliques in PP */
2 p|Vp|p\leftarrow|V_{p}|, h|Vh|h\leftarrow|V_{h}|;
3 foreach uVhu\in V_{h} do up(u)(pkh)up(u)\leftarrow{p\choose k-h};
4 foreach uVpu\in V_{p} do up(u)(p1kh1)up(u)\leftarrow{p-1\choose k-h-1};
5 i1i\leftarrow 1;
6 while nC>0n_{C}>0 and sis\geq i do
      /* Allocate the weight of nCn_{C} cliques to the nodes in ViV_{i} to make rr be evenly */
7       if sis\leq i then gap+gap\leftarrow+\infty else gapri+1rigap\leftarrow r_{i+1}-r_{i};
8       while nC>0n_{C}>0 and gap>0\ gap>0 and |Vi|>0|V_{i}|>0 do
9             wmin{minuViup(u),gap,nC|Vi|}w\leftarrow\min\{\min_{u\in V_{i}}up(u),gap,\lfloor\frac{n_{C}}{|V_{i}|}\rfloor\};
10             if w>0w>0 then
11                   for uVi\forall u\in V_{i} do
12                        r(u)r(u)+wr(u)\leftarrow r(u)+w; up(u)up(u)wup(u)\leftarrow up(u)-w;
13                  nCnCw×|Vi|n_{C}\leftarrow n_{C}-w\times|V_{i}|; gapgapwgap\leftarrow gap-w;
14                  
15            else
16                   r(u)r(u)+1r(u)\leftarrow r(u)+1 for a total of nCn_{C} nodes uu in ViV_{i} chosen uniformly at random;
17                   nC0n_{C}\leftarrow 0;
18                  
19            ViVi{u|up(u)=0&uVi}V_{i}\leftarrow V_{i}\setminus\{u|up(u)=0\&u\in V_{i}\};
20            
21      if si+1s\geq i+1 then Vi+1Vi+1ViV_{i+1}\leftarrow V_{i+1}\cup V_{i};
22       ii+1i\leftarrow i+1;
23      
24
25return rr;
26
Algorithm 3 𝖯𝖡𝖴𝗉𝖽𝖺𝗍𝖾\mathsf{PBUpdate}

Algorithm 2 tailors 𝖥𝖶\mathsf{FW} framework for 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G). It replaces Lines 4-6 of Algorithm 1 with Algorithm 3 called for each PP\in\mathbb{P} to update the rankings rr.

Algorithm 3 allocates the weight of nCn_{C} (Line 2) cliques of pair PP to its nodes V(P)V(P). The allocation aims at bringing up the rankings of the lowest-ranking nodes in PP so as to minimize the squared sum of r(u),uV(P)r(u),u\in V(P). This is done by first sorting all nodes in PP based on its original ranking and find the “plateaus” of different rankings (Line 1): nodes in VsV_{s} share the highest ranking rsr_{s} (also called the first batch H(r)H(r)), nodes in Vs1V_{s-1} the second largest ranking r2r_{2}, \cdots, nodes in V1V_{1} the smallest ranking r1r_{1}. Denote by ss the number of different rankings of nodes in V(P)V(P). The rankings r1r_{1}, r2r_{2}, \cdots, rsr_{s} are strictly increasing. Lines 4-5 setup the upper bound for each node in the allocation based on 𝖢𝟥\mathsf{C3} and 𝖢𝟦\mathsf{C4} of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G). Lines 7-20 progressively use the budget of nCn_{C} to bring the lowest ranking nodes up while satisfying the upper bounds set for each node. Specifically, ViV_{i} (initially i=1i=1) is the set of smallest ranking nodes uu in V(P)V(P) who can receive weights from the cliques without violating their upper bounds up(u)up(u), we call them current potential nodes. Denote by gapgap the ranking differences of the smallest and the second smallest rankings of nodes uu in V(P)V(P); if the second smallest ranking does not exist, then we let gapgap be ++\infty (Line 8). After that, Lines 10-20 allocates the weights of nCn_{C} cliques to the current potential nodes. In Line 10, ww the highest ranking increment of the potential nodes before i) a node reaches its upper bound and then should be kicked out of current potential node set ViV_{i}, ii) all nodes in ViV_{i} reaches the ranking of ri+1r_{i+1}, or iii) the weights of all nCn_{C} cliques are used up. Each node in ViV_{i} will then receive the weights of ww cliques (Lines 11-14) unless the nC<|Vi|n_{C}<|V_{i}| (Lines 15-17) where a random set of nCn_{C} nodes in ViV_{i} will receive one extra weight. nCn_{C}, gapgap, and the current potential set ViV_{i} are updated accordingly (Lines 14,17,18). If by the time when gap=0gap=0, there are still potential nodes in ViV_{i}, it means that their rankings have already reached ri+1r_{i+1}, we then merge them to Vi+1V_{i+1} (Line 19).

Theorem 3.8.

The time complexity of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is O(ηδ3)O(\eta\delta^{3}) where η\eta is the upper bound of the size of SCT-tree, and δ\delta is the degeneracy of the input graph GG. The space complexity is O(αη)O(\alpha\eta).

Proof.

In Algorithm 3, the while loop in line 7, line 9 and line 12 all has time complexity O(|V(P)|)O(|V(P)|). Thus, the time complexity of Algorithm 3 is O(|V(P)|3)O(|V(P)|^{3}). Since PP is a clique, |V(P)||V(P)| is bounded by δ\delta. 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} need to scan over \mathbb{P} and call 𝖯𝖡𝖴𝗉𝖽𝖺𝗍𝖾\mathsf{PBUpdate} for each pair. Thus, the time complexity of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is O(ηδ3)O(\eta\delta^{3}). The main space cost of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is the storage of the SCT-tree, which is O(αη)O(\alpha\eta) (Jain and Seshadhri, 2020). ∎

Refer to caption
Figure 3. Illustration of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} on the example graph for one iteration.
Example 3.9.

Figure 3 illustrates an example to explain how 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} works on the example graph in Figure 2(a). In Figure 2(a), it has five pairs of (Vh,Vp)(V_{h},V_{p}), including ({u0},{u1,u3})(\{u_{0}\},\{u_{1},u_{3}\}), ({u1},{u2,u3,u6})(\{u_{1}\},\{u_{2},u_{3},u_{6}\}), ({u2}(\{u_{2}\}, {u3,u6})\{u_{3},u_{6}\}), ({u3},{u4,u5,u6})(\{u_{3}\},\{u_{4},u_{5},u_{6}\}), ({u4},{u5,u6})(\{u_{4}\},\{u_{5},u_{6}\}). The five pairs are accessing in a random order. Initially, the elements of the vector rr are all zeros. The updated label AA denotes that the update is line 13 of Algorithm 3, and BB is line 16 of Algorithm 3.

Since there is only one 33-clique in ({u4},{u5,u6})(\{u_{4}\},\{u_{5},u_{6}\}) and the size of ViV_{i} is 33 (line 10 of Algorithm 3), we have nC|Vi|=0\lfloor\frac{n_{C}}{|V_{i}|}\rfloor=0 and the weight is assigned to u4u_{4} in 1ine 16 of Algorithm 3. Then, for ({u3},{u4,u5,u6})(\{u_{3}\},\{u_{4},u_{5},u_{6}\}), there are 33 33-cliques. Since {u3,u5,u6}\{u_{3},u_{5},u_{6}\} has the smallest weight, the 33 weights are assigned to {u3,u5,u6}\{u_{3},u_{5},u_{6}\} averagely (line 13 of Algorithm 3). The following updates are similar. \Box

Remarks. Compared to the worst-case time complexity O(k|𝒞k(V)|)O(k|\mathcal{C}_{k}(V)|) of 𝖲𝖢𝖳𝖫\mathsf{SCTL} (He et al., 2023), O(ηδ3)O(\eta\delta^{3}) is much smaller for large real-world networks: as shown in Table 3, the values of η\eta of 𝖲𝖢𝖳\mathsf{SCT}-tree and degeneracy δ\delta on large real-graphs are not large (Jain and Seshadhri, 2020). More importantly, O(ηδ3)O(\eta\delta^{3}) is independent of kk and the number of kk-cliques in graph GG.

3.3. Analysis of the Algorithm

Denote by α\alpha the vector of variables of αuP\alpha_{u}^{P} for P,uV(P)\forall P\in\mathbb{P},u\in V(P). Let 𝒟\mathcal{D} be the domain of α\alpha specified in 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G). Since the rankings r(u),rVr(u),r\in V are derived from α\alpha, we denote the objective function of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G) by

f(α)=uVr(u)2.f(\alpha)=\sum_{u\in V}r(u)^{2}.
Theorem 3.10.

𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is an implementation of a Frank-Wolfe algorithm to the convex program of 𝖲𝖢𝖳-𝖢𝖯(G){\mathsf{SCT\text{-}CP}}(G).

Proof.

𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} solves the following problem. Given α𝒟\alpha\in\mathcal{D}, find α^𝒟\hat{\alpha}\in\mathcal{D} to minimize α^,f(α)=2P:uPα^uPr(u)\langle\hat{\alpha},\nabla f(\alpha)\rangle=2\sum_{P\in\mathbb{P}:u\in P}{\hat{\alpha}^{P}_{u}\cdot r(u)}. Since each pair has constant count of weights and can only assign weights to the vertices V(P)V(P) in the pair (𝖢𝟥\mathsf{C3}), we consider each pair independently. For each pair PP, to minimize α^,f(α)\langle\hat{\alpha},\nabla f(\alpha)\rangle, the weights should be assigned to the vertices with the smallest value of rr in α^\hat{\alpha}. This is the work 𝖯𝖡𝖴𝗉𝖽𝖺𝗍𝖾\mathsf{PBUpdate} do for each pair. ∎

By Theorem 3.10, the result obtained by the 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} algorithm is an exact solution for SCT-CP(G) (Danisch et al., 2017; Jaggi, 2013).

Corollary 3.11.

Let rr^{\prime} be the result obtained upon 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} converges. Let α\alpha^{\prime} be the vector of weight assignment of rr^{\prime}. α\alpha^{\prime} is the optimal solution of 𝖲𝖢𝖳-𝖢𝖯\mathsf{SCT\text{-}CP}.

Convergence analysis of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL}. To analyze the convergence rate of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL}, we use the curvature constant CfC_{f} (Jaggi, 2013) as a measure of ”non-linearity” of the objective function f(α)f(\alpha). The curvature constant CfC_{f} of a convex and differentiable function with respect to a compact domain 𝒟\mathcal{D} is

(4) Cf:=supα1,α3𝒟,γ[0,1],α2=α1+γ(α3α1)2γ2(f(α2)f(α1)α2α1,f(α1)).\small C_{f}:=\sup_{\alpha_{1},\alpha_{3}\in\mathcal{D},\atop{\gamma\in[0,1],\atop{\alpha_{2}=\alpha_{1}+\gamma(\alpha_{3}-\alpha_{1})}}}{\frac{2}{\gamma^{2}}}(f(\alpha_{2})-f(\alpha_{1})-\langle\alpha_{2}-\alpha_{1},\nabla f(\alpha_{1})\rangle).
Lemma 3.12.

Cfxmax|𝒞k|2C_{f}\leq x_{max}|\mathcal{C}_{k}|^{2} where xmax=maxuV|{P|uV(P)}|x_{max}=\max_{u\in V}|\{P\in\mathbb{P}|u\in V(P)\}|, the maximum number of pairs covering the same node in GG.

Proof.

Let Diam(𝒟)Diam(\mathcal{D}) be the squared Euclidean diameter of 𝒟\mathcal{D}, i.e. Diam(𝒟):=supα1,α2𝒟d(α1,α2)Diam(\mathcal{D}):=\sup_{\alpha_{1},\alpha_{2}\in\mathcal{D}}{d(\alpha_{1},\alpha_{2})}. L=supα𝒟2f(α)2L=\sup_{\alpha\in\mathcal{D}}{\|\nabla^{2}f(\alpha)\|_{2}} is the Lipschitz constant of f(α)f(\alpha). 2f(α)2\|\nabla^{2}f(\alpha)\|_{2} is the spectral norm of the Hessian matrix of f(α)f(\alpha) (Danisch et al., 2017; Jaggi, 2013). Here LL is bounded by xmaxx_{max}. Because α0\alpha\geq 0 and α1=|𝒞k(V)|||\alpha||_{1}=|\mathcal{C}_{k}(V)|, we have maxα𝒟α,α|𝒞k(V)|2\max_{\alpha\in\mathcal{D}}{\langle\alpha,\alpha\rangle}\leq|\mathcal{C}_{k}(V)|^{2} and Diam(𝒟)2Diam(\mathcal{D})^{2} is bounded by maxα𝒟α,α|𝒞k(V)|2\max_{\alpha\in\mathcal{D}}{\langle\alpha,\alpha\rangle}\leq|\mathcal{C}_{k}(V)|^{2}. According to (Jaggi, 2013), CfC_{f} is bounded by Diam(𝒟)2LDiam(\mathcal{D})^{2}L, thus we have Cfxmax|𝒞k|2C_{f}\leq x_{max}|\mathcal{C}_{k}|^{2}. ∎

Theorem 3.13.

Let st=αtαt1s^{t}=\alpha^{t}-\alpha^{t-1} where αt\alpha^{t} is the results of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} in tt-th iterations. Let s=ααt1s^{*}=\alpha^{*}-\alpha^{t-1} be difference between αt1\alpha^{t-1} and the optimal solution α\alpha^{*}. We have st,f(αt)s,f(αt)12βγtCf\langle\ s^{t},\nabla f(\alpha^{t})\rangle-\langle\ s^{*},\nabla f(\alpha^{t})\rangle\leq\frac{1}{2}\beta\gamma_{t}C_{f} where γt=1t,β=4kΔ|𝒞k|\gamma_{t}=\frac{1}{t},\beta=\frac{4\sqrt{k}\Delta}{\sqrt{|\mathcal{C}_{k}|}}. Δ\Delta is the maximum number of kk-cliques that can cover one node uVu\in V.

Proof.

Observe that αtt=t1tαt1t1+stt\frac{\alpha^{t}}{t}=\frac{t-1}{t}\frac{\alpha^{t-1}}{t-1}+\frac{s^{t}}{t}. The update can be seen as x(t)=(1γt)x(t1)+γtstx^{(t)}=(1-\gamma_{t})x^{(t-1)}+\gamma_{t}s^{t} where x(t)x^{(t)} denotesαtt\frac{\alpha^{t}}{t}. Dimension of α\alpha is P(Vh,Vp)|V(P)|k|𝒞k(V)|\sum_{P(V_{h},V_{p})\in\mathbb{P}}|V(P)|\leq k|\mathcal{C}_{k}(V)|, and we have stsαt+α2|𝒞k|\|s^{t}-s^{*}\|\leq\|\alpha^{t}\|+\|\alpha^{*}\|\leq{2|\mathcal{C}_{k}|}. We have f(α)αuP=2r(u)\frac{\partial f(\alpha)}{\partial\alpha^{P}_{u}}=2r(u) and r(u)r(u) will be increased by at most Δ\Delta (𝖢𝟦\mathsf{C4}, 𝖲𝖢𝖳-𝖢𝖯\mathsf{SCT\text{-}CP} (G), the bound is given by the number of cliques in PP covering uVpu\in V_{p} and 𝖢𝟥\mathsf{C3}, 𝖲𝖢𝖳-𝖢𝖯\mathsf{SCT\text{-}CP} (G) the bound is given by the number of cliques in PP covering uVhu\in V_{h}).

Denote by st,i{s}^{t,i} the vector of ithi_{th} pair. Denote by αt,i:=αt+j=1i1st,i\alpha^{t,i}:=\alpha^{t}+\sum_{j=1}^{i-1}{s^{t,i}}. For t1t\geq 1, we have

(5) st,f(αtt)s,f(αtt)\displaystyle\langle\ s^{t},\nabla f(\frac{\alpha^{t}}{t})\rangle-\langle\ s^{*},\nabla f(\frac{\alpha^{t}}{t})\rangle
=1tsts,f(αt)=1ti=1ηst,is,i,Pif(αt)\displaystyle=\frac{1}{t}\langle\ s^{t}-s^{*},\nabla f(\alpha^{t})\rangle=\frac{1}{t}\sum_{i=1}^{\eta}{\langle\ {s}^{t,i}-{s}^{*,i},\nabla_{P_{i}}f({\alpha}^{t})\rangle}
=1ti=1ηst,is,i,Pif(αtαt,i)+1ti=1ηst,is,i,Pif(αt,i)\displaystyle=\frac{1}{t}\sum_{i=1}^{\eta}{\langle\ {s}^{t,i}-{s}^{*,i},\nabla_{P_{i}}f(\alpha^{t}-{\alpha}^{t,i})\rangle}+\frac{1}{t}\sum_{i=1}^{\eta}{\langle\ {s}^{t,i}-{s}^{*,i},\nabla_{P_{i}}f({\alpha}^{t,i})\rangle}
1ti=1ηst,is,i,Pif(αtαt,i)\displaystyle\leq\frac{1}{t}\sum_{i=1}^{\eta}{\langle\ {s}^{t,i}-{s}^{*,i},\nabla_{P_{i}}f(\alpha^{t}-{\alpha}^{t,i})\rangle}
=1ti=1ηst,is,i,Pif(αt)Pif(αt,i)\displaystyle=\frac{1}{t}\sum_{i=1}^{\eta}{\langle\ {s}^{t,i}-{s}^{*,i},\nabla_{P_{i}}f(\alpha^{t})-\nabla_{P_{i}}f({\alpha}^{t,i})\rangle}
=1tsts,f(αt)(P1f(αt,1),,Pηf(αt,η))\displaystyle=\frac{1}{t}{\langle\ {s}^{t}-{s}^{*},\nabla f(\alpha^{t})-\left(\nabla_{P_{1}}f({\alpha}^{t,1}),...,\nabla_{P_{\eta}}f({\alpha}^{t,\eta})\right)\rangle}
1tstsf(αt)(P1f(αt,1),,Pηf(αt,η))\displaystyle\leq\frac{1}{t}\|{s}^{t}-{s}^{*}\|\cdot\|\nabla f(\alpha^{t})-\left(\nabla_{P_{1}}f({\alpha}^{t,1}),...,\nabla_{P_{\eta}}f({\alpha}^{t,\eta})\right)\|
1tk|𝒞k(V)|2Δ2|𝒞k(V)|.\displaystyle\leq\frac{1}{t}\cdot\sqrt{k|\mathcal{C}_{k}(V)|}\cdot 2\Delta\cdot 2|\mathcal{C}_{k}(V)|.

The above derivation is correct because 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} can consider each pair independently. Thus, according to the Algorithm 2 in (Jaggi, 2013),

β=1tk|𝒞k|2Δ2|𝒞k|/(12γtCf)=8k|𝒞k(V)|1.5ΔCf4kΔ|𝒞k(V)|.\beta=\frac{1}{t}\sqrt{k|\mathcal{C}_{k}|}2\Delta 2|\mathcal{C}_{k}|/\left(\frac{1}{2}\gamma_{t}C_{f}\right)=\frac{8\sqrt{k}|\mathcal{C}_{k}(V)|^{1.5}\Delta}{C_{f}}\leq\frac{4\sqrt{k}\Delta}{\sqrt{|\mathcal{C}_{k}(V)|}}.

The last inequality comes directly from the definition of CfC_{f} in Equation 4 from which we have Cf2|𝒞k|2C_{f}\geq 2|\mathcal{C}_{k}|^{2}. ∎

Theorem 3.14.

For each t1t\geq 1, f(αt)f(α)2xmax|𝒞k|2t+2(1+β)f(\alpha^{t})-f(\alpha^{*})\leq\frac{2x_{max}|\mathcal{C}_{k}|^{2}}{t+2}(1+\beta) where β=4kΔ|𝒞k(V)|\beta=\frac{4\sqrt{k}\Delta}{\sqrt{|\mathcal{C}_{k}(V)|}}.

Proof.

The proof can be obtained based on the results established in (Jaggi, 2013). Specifically, in (Jaggi, 2013), the authors show that for each t1t\geq 1, f(αt)f(α)2Cft+2(1+β)f(\alpha^{t})-f(\alpha^{*})\leq\frac{2C_{f}}{t+2}(1+\beta). As describe above, CfC_{f} is bounded by xmax|𝒞k|2x_{max}|\mathcal{C}_{k}|^{2} and β\beta is bounded by 4kΔ|𝒞k(V)|\frac{4\sqrt{k}\Delta}{\sqrt{|\mathcal{C}_{k}(V)|}}. ∎

4. New Sampling-Based Algorithm

Since the hardness of kk-𝖣𝖲𝖲\mathsf{DSS}, FW-based solutions may still be costly when handling very large graphs. To further improve the efficiency, sampling-based solutions are often used which can typically obtain a good approximation of the kk-clique densest subgraph (Mitzenmacher et al., 2015; Sun et al., 2020; He et al., 2023). In this section, we propose a new but efficient sampling-based algorithm, called 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}, which employs the 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} algorithm porposed in (Ye et al., 2022) to sample kk-cliques. A remarkable feature of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is that it has polynomial time complexity. To the best of our knowledge, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is the first algorithm that runs in polynomial time, thus it can handle large graphs. We will also present a detailed theoretical analysis of the accuracy bound of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}.

4.1. The 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} algorithm

For sampling-bases solutions, the state-of-the-arts are the 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} algorithm (Sun et al., 2020; He et al., 2023). 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} counts all kk-cliques at first, and then samples kk-cliques uniformly. 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} builds the SCT-Index at first, and then samples kk-cliques uniformly from the SCT-Index. Thus, both 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} suffer from exponential time complexity and are often intractable for handling large graphs. To overcome these limitations, we develop a poly-nominal algorithm, called 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}.

First, we briefly introduce the 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} algorithm which was originally proposed to estimate the number of kk-cliques in a graph (Ye et al., 2022). In 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}, we use 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} as a uniformly kk-clique sampler with polynomial running time. 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} first colors the graph (using a linear-time greedy graph coloring algorithm) such that the vertices of each edge in the graph must have different colors. 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} is an efficient algorithm for counting kk-cliques through sampling from a combinatoric structure called kk-color path. Specifically, a kk-color path is a path with kk vertices, and the kk vertices have kk different colors. Each kk-clique must be a kk-color path, i.e. the set of kk-cliques is a subset of kk-color paths (because the vertices in a clique must have different colors). 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} is a polynomial dynamic programming algorithm which can uniformly samples from kk-color paths. Since the set of kk-cliques is a subset of kk-color paths, the kk-cliques can also be sampled uniformly. Note that checking if a kk-color-path is a kk-clique can be easily done in O(k2)O(k^{2}) time by verifying whether any pair of node is connected. Similar to (Ye et al., 2022), we can regard the probability pp^{\prime} that a CCPATH is a kk-clique as a graph-related parameter which is often very high as shown in (Ye et al., 2022).

The details of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} are outlined in Algorithm 4. Algorithm 4 admits tt as a parameter for the size of samples (line 1). Firstly, the algorithm uniformly samples the kk-cliques and get the set of samples \mathbb{C} (lines 1-2). Then, the algorithm invokes 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ on \mathbb{C} in a small number of iterations, and then returns the approximation result (lines 3-10).

Input: The graph G(V,E)G(V,E), clique size kk, sample size tt, a smaller number TT
Output: An approximation of VV^{*}
1
2\mathbb{P}\leftarrow get tt uniform samples from all kk-color paths through the 𝖢𝖢𝖯𝖠𝖳𝖧\mathsf{CCPATH} algorithm;
3 \mathbb{C}\leftarrow the set of kk-cliques in \mathbb{P};
4
5r(u)0,uVr(u)\leftarrow 0,\forall u\in V;
6 for t1t\leftarrow 1 to TT do
7       foreach kk-clique CC\in\mathbb{C} do
8             uargminvCr(v)u\leftarrow\arg\min_{v\in C}{r(v)};
9             r(u)r(u)+1r(u)\leftarrow r(u)+1;
10            
11      
12Sort VV by r(u)r(u) in non-increasing order;
13 Denote ViV_{i} by the previous ii vertices of the sorted VV;
14 return argmaxViρk(Vi)\arg\max_{V_{i}}\rho_{k}^{\prime}(V_{i}) where ρk(Vi)=|𝒞k(Vi)||Vi|\rho_{k}^{\prime}(V_{i})=\frac{|\mathcal{C}_{k}(V_{i})\cap\mathbb{C}|}{|V_{i}|};
Algorithm 4 The Proposed 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} Algorithm

As shown in Algorithm 4, the proposed 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} algorithm is very simple, but it is very efficient. Below, we first analyze the time and space complexity of Algorithm 4.

Theorem 4.1.

The time complexity of Algorithm 4 is O(|V|δ2k+(δk+k2)t+Tkt)O(|V|\delta^{2}k+(\delta k+k^{2})t+Tkt). And the space complexity of Algorithm 4 is O(kt)O(kt).

Proof.

Note that by (Ye et al., 2022) the time complexity taken by sampling is O(|V|δ2k+(δk+k2)t)O(|V|\delta^{2}k+(\delta k+k^{2})t) (Ye et al., 2022). Since the size of \mathbb{C} is bounded by tt, the time complexity of 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ is O(Tkt)O(Tkt). The memory cost of Algorithm 4 is dominated by storing of \mathbb{C}, which uses at most O(kt)O(kt) space. ∎

As shown in Theorem 4.1, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is completely free from the count of kk-cliques. As shown in our experiments, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is very efficient and it can solve the kk-DSS problem on very large graphs that has vertices more than 1.8 ×109\times 10^{9}. Below, we present the theoretical analysis of the accuracy guarantee of the 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} algorithm. The results show that the accuracy of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is based on a mild condition, which also explains the good accuracy of the previous works (Sun et al., 2020; He et al., 2023).

4.2. Analysis of the Algorithm

In this subsection, we analyze the accuracy bound of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}. For our analysis, we need the following Chernoff bound.

Theorem 4.2 (Chernoff bound).

Let X=i=1nXiX=\sum_{i=1}^{n}{X_{i}}, where XiX_{i} are independent with each other. Xi=1X_{i}=1 with probability pip_{i} and Xi=0X_{i}=0 with probability 1pi1-p_{i}. Let E[X]=μ=i=1npiE[X]=\mu=\sum_{i=1}^{n}{p_{i}}. θ\theta is a number that 0<θ<10<\theta<1. Then

Pr(|Xμ|θμ)2exp(μθ23).\Pr(|X-\mu|\geq\theta\mu)\leq 2\exp(-\frac{\mu\theta^{2}}{3}).

Let RR be the set of vertices CC\cup_{C\in\mathbb{C}}{C}, where \mathbb{C} is the set of sampled kk-cliques (line 2 of Algorithm 4). Let G(R)G(R^{*}) be the ground truth of the kk-clique densest subgraph in G(R)G(R), which has the largest value ρk\rho_{k}. Let 𝒞k(U)=|𝒞k(U)|\mathcal{C}^{\prime}_{k}(U)=|\mathcal{C}_{k}(U)\cap\mathbb{C}| and ρk(U)=|𝒞k(U)||U|\rho_{k}^{\prime}(U)=\frac{|\mathcal{C}^{\prime}_{k}(U)|}{|U|} denotes the kk-clique density in the sampled kk-cliques. R~\tilde{R^{*}} is the result returned by 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}, which is expected to have the largest value of ρk\rho_{k}^{\prime}.

Let s(C)s(C) be the random variable that indicates whether the kk-clique CC is sampled or not, i.e. s(C)=1s(C)=1 if CC\in\mathbb{C} or s(C)=0s(C)=0 otherwise. Let pp be the probability of each kk-clique being sampled. Denote pp^{\prime} by the probability of a kk-color path being a kk-clique (line 2 of Algorithm 4), then we have p=tp|𝒞k(V)|p=\frac{tp^{\prime}}{|\mathcal{C}_{k}(V)|}. Subsequently, we can derive the following results.

Lemma 4.3.

E[𝒞k(U)]=p𝒞k(U)E[\mathcal{C}^{\prime}_{k}(U)]=p\mathcal{C}_{k}(U).

Proof.

E[𝒞k(U)]=E[|𝒞k(U)|]=E[C𝒞k(U)s(C)]=p𝒞k(U)E[\mathcal{C}^{\prime}_{k}(U)]=E[|\mathcal{C}_{k}(U)\cap\mathbb{C}|]=E[\sum_{C\in\mathcal{C}_{k}(U)}{s(C)}]=p\mathcal{C}_{k}(U). ∎

Lemma 4.4.

E[ρk(U)]=pρk(U)E[\rho_{k}^{\prime}(U)]=p\rho_{k}(U).

Proof.

E[ρk(U)]=E[|𝒞k(U)||U|]=p𝒞k(U)|U|=pρk(U)E[\rho_{k}^{\prime}(U)]=E[\frac{|\mathcal{C}^{\prime}_{k}(U)|}{|U|}]=\frac{p\mathcal{C}_{k}(U)}{|U|}=p\rho_{k}(U).

Based on the above lemmas, we can obtain the following theorem.

Theorem 4.5.

Let ϵ\epsilon and θ\theta be small constant numbers. We can conclude that ρk(R~)\rho_{k}(\tilde{R^{*}}) is an 12θ1-2\theta approximation of ρk(V)\rho_{k}(V^{*}) with probability 1ϵ1-\epsilon if ρk(R~)3lnϵ/4pθ2\rho_{k}(\tilde{R^{*}})\geq\frac{-3\ln{\epsilon/4}}{p\theta^{2}}.

Proof.

Set the vertices set UU in Lemma 4.4 by VV^{*}. Then, we have E[ρk(V)]=pρk(V)E[\rho_{k}^{\prime}(V^{*})]=p\rho_{k}(V^{*}). According to Chernoff bound, we have

Pr(|ρk(V)pρk(V)|θpρk(V))2exp(pρk(V)θ23).\small\Pr(|\rho_{k}^{\prime}(V^{*})-p\rho_{k}(V^{*})|\geq\theta p\rho_{k}(V^{*}))\leq 2\exp(-\frac{p\rho_{k}(V^{*})\theta^{2}}{3}).

This immediately gives

Pr(pρk(V)ρk(V)θpρk(V))2exp(pρk(V)θ23).\Pr(p\rho_{k}(V^{*})-\rho_{k}^{\prime}(V^{*})\geq\theta p\rho_{k}(V^{*}))\leq 2\exp(-\frac{p\rho_{k}(V^{*})\theta^{2}}{3}).

Since ρk(R~)ρk(V)\rho_{k}^{\prime}(\tilde{R^{*}})\geq\rho_{k}^{\prime}(V^{*}), then we have

(6) Pr(pρk(V)ρk(R~)θpρk(V))2exp(pρk(V)θ23).\Pr(p\rho_{k}(V^{*})-\rho_{k}^{\prime}(\tilde{R^{*}})\geq\theta p\rho_{k}(V^{*}))\leq 2\exp(-\frac{p\rho_{k}(V^{*})\theta^{2}}{3}).

Let the vertices set UU in Lemma 4.4 be R~\tilde{R^{*}}, we have

Pr(|ρk(R~)pρk(R~)|θpρk(R~))2exp(pρk(R~)θ23).\Pr(|\rho_{k}^{\prime}(\tilde{R^{*}})-p\rho_{k}(\tilde{R^{*}})|\geq\theta p\rho_{k}(\tilde{R^{*}}))\leq 2\exp(-\frac{p\rho_{k}(\tilde{R^{*}})\theta^{2}}{3}).

As a result, we have

Pr(ρk(R~)pρk(R~)θpρk(R~))2exp(pρk(R~)θ23).\Pr(\rho_{k}^{\prime}(\tilde{R^{*}})-p\rho_{k}(\tilde{R^{*}})\geq\theta p\rho_{k}(\tilde{R^{*}}))\leq 2\exp(-\frac{p\rho_{k}(\tilde{R^{*}})\theta^{2}}{3}).

Since ρk(V)ρk(R~)\rho_{k}(V^{*})\geq\rho_{k}(\tilde{R^{*}}), we can derive that

(7) Pr(ρk(R~)pρk(R~)θpρk(V))2exp(pρk(R~)θ23).\Pr(\rho_{k}^{\prime}(\tilde{R^{*}})-p\rho_{k}(\tilde{R^{*}})\geq\theta p\rho_{k}(V^{*}))\leq 2\exp(-\frac{p\rho_{k}(\tilde{R^{*}})\theta^{2}}{3}).

With Equations 6 and 7, we are able to derive that

(8) Pr\displaystyle\Pr (ρk(V)ρk(R~)2θρk(V))\displaystyle\left(\rho_{k}(V^{*})-\rho_{k}(\tilde{R^{*}})\geq 2\theta\rho_{k}(V^{*})\right)\leq
2exp(pρk(V)θ23)+2exp(pρk(R~)θ23).\displaystyle 2\exp(-\frac{p\rho_{k}(V^{*})\theta^{2}}{3})+2\exp(-\frac{p\rho_{k}(\tilde{R^{*}})\theta^{2}}{3}).

To ensure Pr(ρk(V)ρk(R~)ρk(V)2θ)ϵ,\Pr\left(\frac{\rho_{k}(V^{*})-\rho_{k}(\tilde{R^{*}})}{\rho_{k}(V^{*})}\geq 2\theta\right)\leq\epsilon, it has 2exp(pρk(V)θ23)ϵ22\exp(-\frac{p\rho_{k}(V^{*})\theta^{2}}{3})\leq\frac{\epsilon}{2} and 2exp(pρk(R~)θ23ϵ2,2\exp(-\frac{p\rho_{k}(\tilde{R^{*}})\theta^{2}}{3}\leq\frac{\epsilon}{2}, and we can conclude that pρk(V)pρk(R~)3lnϵ/4θ2p\rho_{k}(V^{*})\geq p\rho_{k}(\tilde{R^{*}})\geq\frac{-3\ln{\epsilon/4}}{\theta^{2}}. ∎

Theorem 4.6.

In expectation, our 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} algorithm can achieve an (ϵ,θ)(\epsilon,\theta)-approximation if t3|V|lnϵ/4pθ2t\geq\frac{-3|V|\ln{\epsilon/4}}{p^{\prime}\theta^{2}}.

Proof.

We can bound ρk(R~)\rho_{k}(\tilde{R^{*}}) by ρk(V)\rho_{k}(V) in Theorem 4.5,

pρk(R~)pρk(V)=tp|𝒞k(V)||𝒞k(V)||V|3lnϵ/4θ2.p\rho_{k}(\tilde{R^{*}})\geq p\rho_{k}(V)=\frac{tp^{\prime}}{|\mathcal{C}_{k}(V)|}\frac{|\mathcal{C}_{k}(V)|}{|V|}\geq\frac{-3\ln{\epsilon/4}}{\theta^{2}}.

Thus, we have t3|V|lnϵ/4pθ2.t\geq\frac{-3|V|\ln{\epsilon/4}}{p^{\prime}\theta^{2}}.

Theorem 4.6 proves that the number of iterations of 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is liner with respect to |V|/p|V|/p^{\prime}.

Discussions. Theorem 4.5 shows that if a subgraph reported by a sampling-based algorithm is dense enough, the subgraph should also be an good approximation. Note that Theorem 4.5 is based on the condition that ρk(R~)\rho_{k}(\tilde{R^{*}}) should be large enough. It is important to note that this is a mild condition. The reasons are as follows. First, 3lnϵ/4θ2\frac{-3\ln{\epsilon/4}}{\theta^{2}} is not large. For example, 3lnϵ/4θ2\frac{-3\ln{\epsilon/4}}{\theta^{2}} is around 17971797 when ϵ=0.01\epsilon=0.01 and θ=0.1\theta=0.1. Second, when pρk(V)p\rho_{k}(V^{*}) is small, the input graph should be very sparse, thus we can utilize exact algorithms to solve it. Third, pp is always not small. As shown in (Ye et al., 2022), pp^{\prime} is often a large value for real-world graphs. Fourth, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} returns the subgraph with the maximum value of ρk\rho^{\prime}_{k} (line 10 of Algorithm 4). Based on these reasons, such a condition is often easily satisfied for real-world graphs.

Table 3. Datasets (δ\delta is the degeneracy, η\eta is the size of SCT)
Networks |𝐕|\mathbf{|V|} |𝐄|\mathbf{|E|} δ\mathbf{\delta} η\eta
𝖶𝗂𝗄𝗂𝖵\mathsf{WikiV} 7,115 100,762 53 489,803
𝖢𝖺𝗂𝖽𝖺\mathsf{Caida} 26,475 53,381 22 8,312
𝖤𝗉𝗂𝗇𝗂𝗈𝗇\mathsf{Epinion} 75,879 405,740 67 1,437,313
𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla} 196,591 950,327 51 930,005
𝖠𝗆𝖺𝗓𝗈𝗇\mathsf{Amazon} 403,394 2,443,408 10 660,944
𝖣𝖡𝖫𝖯\mathsf{DBLP} 425,957 1,049,866 113 166,725
𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan} 685,230 6,649,470 201 2,430,187
𝖸𝗈𝗎𝗍𝗎𝖻𝖾\mathsf{Youtube} 1,157,827 2,987,625 51 1,397,529
𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec} 1,632,803 22,301,964 47 12,492,547
𝖲𝗄𝗂𝗍𝗍𝖾𝗋\mathsf{Skitter} 1,696,415 11,095,298 111 12,548,404
𝖮𝗋𝗄𝗎𝗍\mathsf{Orkut} 3,072,627 117,185,083 253 264,754,163
𝖥𝗋𝗂𝖾𝗇𝖽\mathsf{Friend} 65,608,366 1,806,067,135 304 3,876,765,479
Refer to caption
(a) 𝖶𝗂𝗄𝗂𝖵\mathsf{WikiV}
Refer to caption
(b) 𝖢𝖺𝗂𝖽𝖺\mathsf{Caida}
Refer to caption
(c) 𝖤𝗉𝗂𝗇𝗂𝗈𝗇\mathsf{Epinion}
Refer to caption
(d) 𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla}
Refer to caption
(e) 𝖠𝗆𝖺𝗓𝗈𝗇\mathsf{Amazon}
Refer to caption
(f) 𝖣𝖡𝖫𝖯\mathsf{DBLP}
Refer to caption
(g) 𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan}
Refer to caption
(h) 𝖸𝗈𝗎𝗍𝗎𝖻𝖾\mathsf{Youtube}
Refer to caption
(i) 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec}
Refer to caption
(j) 𝖲𝗄𝗂𝗍𝗍𝖾𝗋\mathsf{Skitter}
Figure 4. Running time of the Frank-Wolfe based algorithms (T=10T=10).
Refer to caption
(a) 𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla}
Refer to caption
(b) 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec}
Figure 5. Running time of different Frank-Wolfe based algorithms with varying TT.
Refer to caption
(a) 𝖠𝗆𝖺𝗓𝗈𝗇\mathsf{Amazon}
Refer to caption
(b) 𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla}
Figure 6. The kk-clique density obtained by various algorithm within the same running time.
Table 4. The kk-clique density for various TT (k=7k=7).
Networks T=1T=1 T=5T=5 T=10T=10
𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ 𝖲𝖢𝖳𝖫\mathsf{SCTL} 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ 𝖲𝖢𝖳𝖫\mathsf{SCTL} 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ 𝖲𝖢𝖳𝖫\mathsf{SCTL} 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL}
𝖶𝗂𝗄𝗂𝖵\mathsf{WikiV} 35182.25 35182.25 35182.25 35182.25 35182.25 35182.25 35182.25 35182.25 35182.25
𝖢𝖺𝗂𝖽𝖺\mathsf{Caida} 2203.84 2203.84 2203.84 2203.84 2203.84 2203.84 2203.84 2203.84 2203.84
𝖤𝗉𝗂𝗇𝗂𝗈𝗇\mathsf{Epinion} 482125.39 481818.44 481818.44 482125.39 482125.39 482125.39 482125.39 482125.39 482125.39
𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla} 115291.27 114200.51 115064.38 115291.27 115268.24 115291.27 115291.27 115291.27 115291.27
𝖠𝗆𝖺𝗓𝗈𝗇\mathsf{Amazon} 86.00 26.08 52.33 86.00 80.37 76.86 86.00 83.69 83.69
𝖣𝖡𝖫𝖯\mathsf{DBLP} - 360937368.00 360937368.00 - 360937368.00 360937368.00 - 360937368.00 360937368.00
𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan} - - 1226107478.17 - - 1230103452.99 - - 1230103452.99
𝖸𝗈𝗎𝗍𝗎𝖻𝖾\mathsf{Youtube} 15137.78 15045.44 15044.36 15137.78 15130.52 15116.24 15137.78 15134.27 15127.43
𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec} 137917.47 137917.47 137917.47 137917.47 137917.47 137917.47 137917.47 137917.47 137917.47
𝖲𝗄𝗂𝗍𝗍𝖾𝗋\mathsf{Skitter} - 111767674.10 111767674.13 - 111861828.44 111861828.44 - 111882281.10 111882281.10
Table 5. The number of iterations and running time needed to find VV^{*} (k=7k=7).
Networks 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ 𝖲𝖢𝖳𝖫\mathsf{SCTL} 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL}
TT Time (s) TT Time (s) TT Time (s)
𝖶𝗂𝗄𝗂𝖵\mathsf{WikiV} 1 0.87 1 0.14 1 0.12
𝖢𝖺𝗂𝖽𝖺\mathsf{Caida} 1 0.01 1 0.01 1 0.00
𝖤𝗉𝗂𝗇𝗂𝗈𝗇\mathsf{Epinion} 1 11.61 2 2.29 2 1.17
𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla} 1 5.34 5 2.67 5 0.96
𝖠𝗆𝖺𝗓𝗈𝗇\mathsf{Amazon} 1 0.13 20 1.55 20 0.45
𝖣𝖡𝖫𝖯\mathsf{DBLP} - - 1 1286.87 1 0.04
𝖸𝗈𝗎𝗍𝗎𝖻𝖾\mathsf{Youtube} 1 0.91 20 2.77 110 14.88
𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec} 1 10.40 1 2.47 1 1.62
Refer to caption
Figure 7. Memory costs of the Frank-Wolfe based algorithms on 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec}.
Refer to caption
(a) 𝖤𝗉𝗂𝗇𝗂𝗈𝗇\mathsf{Epinion}
Refer to caption
(b) 𝖦𝗈𝗐𝖺𝗅𝗅𝖺\mathsf{Gowalla}
Refer to caption
(c) 𝖠𝗆𝖺𝗓𝗈𝗇\mathsf{Amazon}
Refer to caption
(d) 𝖣𝖡𝖫𝖯\mathsf{DBLP}
Refer to caption
(e) 𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan}
Refer to caption
(f) 𝖸𝗈𝗎𝗍𝗎𝖻𝖾\mathsf{Youtube}
Refer to caption
(g) 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec}
Refer to caption
(h) 𝖲𝗄𝗂𝗍𝗍𝖾𝗋\mathsf{Skitter}
Refer to caption
(i) 𝖮𝗋𝗄𝗎𝗍\mathsf{Orkut}
Refer to caption
(j) 𝖥𝗋𝗂𝖾𝗇𝖽\mathsf{Friend}
Figure 8. Running time of the sampling based algorithms for various kk. (t=500000t=500000).
Table 6. The kk-clique density obtained by different sampling algorithms within fixed time (k=5k=5).
Algorithms 𝖣𝖡𝖫𝖯\mathsf{DBLP} ρk(V)=1287748.0\rho_{k}(V^{*})=1287748.0 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec} ρk(V)=11545.5\rho_{k}(V^{*})=11545.5 𝖲𝗄𝗂𝗍𝗍𝖾𝗋\mathsf{Skitter} ρk(V)=1119664.6\rho_{k}(V^{*})=1119664.6 𝖥𝗋𝗂𝖾𝗇𝖽\mathsf{Friend}
Time ρk(V^)\rho_{k}(\hat{V^{*}}) Time ρk(V^){\rho_{k}(\hat{V^{*}})} Time ρk(V^){\rho_{k}(\hat{V^{*}})} Time ρk(V^){\rho_{k}(\hat{V^{*}})}
𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} <0.1s<0.1s - <1s<1s - <1s<1s - <20s<20s -
𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} - - - -
𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} 1287748.0 11511.1 1115421.7 55815.2
𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} <1s<1s - <10s<10s - <10s<10s - <60s<60s -
𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} 1287748.0 11545.5 1119546.2 -
𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} 1287748.0 11545.5 1118411.5 68573.8
𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} <20s<20s 1287748.0 <30s<30s 11545.5 <100s<100s 1119450.7 <100s<100s -
𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} 1287748.0 11545.5 1119664.6 -
𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} 1287748.0 11545.5 1118808.0 68748.8

5. Experiments

Algorithms. For the Frank-Wolfe based algorithms, we implement the 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} algorithm which is based on Algorithm 3. We use the state-of-the-art algorithm 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ (Sun et al., 2020) and 𝖲𝖢𝖳𝖫\mathsf{SCTL} (He et al., 2023) as the baseline algorithms. 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ and 𝖲𝖢𝖳𝖫\mathsf{SCTL} are all implementations of Algorithm 1. 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ is a Frank-Wolfe based algorithm for kk-𝖣𝖲𝖲\mathsf{DSS} that scan over the kk-cliques individually through kk-clique listing. 𝖲𝖢𝖳𝖫\mathsf{SCTL} is a Frank-Wolfe based algorithm, which scan over the kk-cliques in batches through the SCT-index.

For the sampling-based algorithms, we implement the 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} algorithm (Algorithm 4). For comparison, we use the state-of-the-art sampling-based algorithms 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} (Sun et al., 2020) and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} (He et al., 2023) as the baselines. Given a parameter tt, 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ samples tt kk-cliques during the procedure of kk-clique listing, and 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} samples tt kk-cliques through SCT-index.

The source code of 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ and 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} is open available (Sun et al., 2020), which is implemented in C++. Since the code of 𝖲𝖢𝖳𝖫\mathsf{SCTL} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} is not available, we implement them by ourselves in C++, which shows similar performance compared to the results reported in (He et al., 2023).

Datasets. The details of the datasets are shown in Table 3. The 5 columns of Table 3 are the dataset name, number of vertices, number of edges, degeneracy and the size of SCT respectively. All datasets are downloaded from https://snap.stanford.edu/ or http://konect.cc/.

We evaluate all algorithms on a server with an AMD 3990X CPU and 256GB memory running Linux CentOS 7 operating system.

5.1. Results of the FW-based algorihtms

Exp 1 : Runtime of various algorithms with varying kk. Figure 4 shows the running time of 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++, 𝖲𝖢𝖳𝖫\mathsf{SCTL}, and 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} for varying kk. To show the advantage of 𝖯𝖡𝖴𝗉𝖽𝖺𝗍𝖾\mathsf{PBUpdate}, we omit the time taken by clique enumeration of 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ and SCT construction.

As kk increases, the running time of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} tends to be small. This is because the size of the SCT decreases, i.e. the value of |||\mathbb{P}| (Theorem 3.8) tends to be small as kk increases (Jain and Seshadhri, 2020). For example, on 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec}, we have ||=10859743|\mathbb{P}|=10859743 when k=4k=4, while ||=837568|\mathbb{P}|=837568 when k=9k=9. As shown in Figure 4, 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} substantially outperforms both 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ and 𝖲𝖢𝖳𝖫\mathsf{SCTL} when k=9k=9 on all the datasets. For example, on 𝖣𝖡𝖫𝖯\mathsf{DBLP}, our 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} algorithm can achieve more than 55 orders of magnitude faster than both 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ and 𝖲𝖢𝖳𝖫\mathsf{SCTL} when k7k\geq 7. These results demonstrate the high efficiency of the proposed 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} algorithm.

Additionally, on the complex networks that the degeneracy is larger than 100100 (𝖣𝖡𝖫𝖯\mathsf{DBLP}, 𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan} and 𝖲𝗄𝗂𝗍𝗍𝖾𝗋\mathsf{Skitter} in Figure 4), our 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} algorithm can consistently outperform the state-of-the-arts. On these networks, the count of cliques is very large. For example, 𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan} has 9.4×10129.4\times 10^{12} 77-cliques. The excellent performance of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is due to the fact that the running time of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is free from the count of kk-cliques, which confirms our theoretical analysis in Section 3.

Exp 2 : Runtime of various algorithms with varying TT. Figure 5 shows the performance of the Frank-Wolfe based algorithms when TT varies on Gowalla and Pokec. The results on the other datsets are consistent. We omit the time taken by clique enumeration and SCT construction. As expected, the running time of all algorithms is linear with respect to (w.r.t.) TT. Once again, our 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is much more efficient than the existing algorithms. These results further demonstrate the high efficiency of the proposed solution.

Exp 3 : kk-clique density with varying TT. In Table 4, only 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} can handle all the tested networks. 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ can reach to the optimal results in only one iteration, but can not handle the complex networks with a large degeneracy, in which there exists a large number of kk-cliques, like 𝖡𝖾𝗋𝗄𝗌𝗍𝖺𝗇\mathsf{Berkstan}. Both 𝖲𝖢𝖳𝖫\mathsf{SCTL} and 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} can achieve a near-optimal approximation within a few iterations. These results further confirm the scalability of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} and the ability of 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} to derive a good approximation within only few iterations.

Exp 4 : Running time needed to find VV^{*}. In the experiments, we find that 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} can reach VV^{*} on all the tested datasets. Table 5 shows the number of iterations as well as the running time needed to find VV^{*}. We find that 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} achieves the lowest running time to obtain VV^{*} on 66 datasets. Although 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} requires 110110 iterations to converge to VV^{*} on 𝖸𝗈𝗎𝗍𝗎𝖻𝖾\mathsf{Youtube}, it can get a 99.999%99.999\% approximation using only 2.02.0 seconds. These results further confirm the high efficiency of the proposed 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} algorithm.

Exp 5 : kk-clique density within fixed time. Figure 6 compares the kk-clique density of the results of 𝖲𝖢𝖳𝖫\mathsf{SCTL} and 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} with the same running time. In Figure 6, the results show that 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} can get larger kk-clique density than 𝖲𝖢𝖳𝖫\mathsf{SCTL} with the same running time. The results on other datasets is in Table 5 as described in Exp 4.

Exp 6 : Memory overheads. We plot the memory costs of the Frank-Wolfe based algorithms in Figure 7. As can be seen, 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} and 𝖲𝖢𝖳𝖫\mathsf{SCTL} have similar memory costs. This is because the memory costs of 𝖲𝖢𝖳𝖫\mathsf{SCTL} and 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} are mainly taken by storing the SCT, while the memory cost of 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ is by the storage of all kk-cliques. As kk increases, the count of kk-cliques grows and the size of SCT shrinks.These results indicate that our 𝖯𝖲𝖢𝖳𝖫\mathsf{PSCTL} is space-efficient.

5.2. Results of the Sampling-based algorihtms

Exp 7 : Running time with varying kk. In Figure 8, we show the running time of the sampling-based algorithms for various values of kk. From Figure 8(a) to 8(d), we can see that 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} achieve comparable performance. However, for large networks in Figure 8(e)\sim8(j), our 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} algorithm substantially outperforms 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample}. For example, on the 𝖯𝗈𝗄𝖾𝖼\mathsf{Pokec} dataset, our algorithm is at least one order of magnitude faster than the two baselines. We also note that on the largest graph 𝖥𝗋𝗂𝖾𝗇𝖽\mathsf{Friend}, only our algorithm can obtain the results, while both of 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} and 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} fail to drive the results. These results confirm the high efficiency and scalability of our sampling-based algorithm.

Exp 8 : kk-clique density within fixed time. In Table 6, we compare kk-clique density achieved by the sampling based algorithms within fixed running time. As shown in Table 6, our 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} one order of magnitude faster than existing algorithms and achieve good results. For example, on 𝖣𝖡𝖫𝖯\mathsf{DBLP}, 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} can terminate in 0.1s0.1s while 𝖲𝖢𝖳𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{SCTSample} needs around 1s1s and 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} needs about 20s20s. Furthermore, our 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample} is the only algorithm that can obtain the results on the largest graph 𝖥𝗋𝗂𝖾𝗇𝖽\mathsf{Friend}. These results confirm the high efficiency, effectiveness and scalability of our 𝖢𝖯𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{CPSample}.

6. related work

Densest subgraph problem (DSP). Given a graph and a measure of density, DSP requires to find a subset of vertices whose induced subgraph maximizes the value of density. DSP is a famous problem that has been studied for over five decades, which has a lot of variants and applications (Lanciano et al., 2023). For the traditional Edge-based Densest Subgraph Problem, several key algorithmic approaches have emerged. These encompass maximum-flow-based algorithms (Goldberg, 1984), LP-based algorithms (Charikar, 2000b; Danisch et al., 2017) and peeling-based algorithms (Charikar, 2000b; Boob et al., 2020). There are also many variants of DSP. Densest kk-subgraph problem aims at finding the densest subgraph with size kk (Feige et al., 2001); Densest at-least(most)-k-subgraph problem aims at finding the densest subgraph with size at least(most) kk (Andersen and Chellapilla, 2009); Anchored Densest Subgraph Problem tries to find the densest subgraph that contains a given seed set (Dai et al., 2022); Fair densest subgraph is the densest subgraphs that has equal represented colors (Anagnostopoulos et al., 2020; Miyauchi et al., 2023); Motif-based densest subgraph is the generalized version of kk-clique densest subgraph, where the density is defined by the count of a given motif (Fang et al., 2019). On different graphs like directed graphs (Charikar, 2000b; Ma et al., 2020), temporal graphs (Bogdanov et al., 2011) and hypergraphs (Huang and Kahng, 1995), there also exists the corresponding DSPs.

Large near-clique detection.The maximum clique problem is an important problem which has a lot of applications (Tomita, 2017; Chang, 2020). However, maximum clique often has a very tight constraints for real-world applications. Thus, a lot of relaxed models for large near-clique detection were proposed (Tsourakakis, 2015; Batagelj and Zaversnik, 2003; Balasundaram et al., 2011; Chen et al., 2021). The kk-clique densest subgraph studied in this work is known as a large near-clique. State-of-the-art algorithms for solving kk-𝖣𝖲𝖲\mathsf{DSS} are primarily rooted in the Frank-Wolfe framework, because kk-𝖣𝖲𝖲\mathsf{DSS} can be formulated as a convex programming. Sun (Sun et al., 2020) introduced the first Frank-Wolfe-based algorithm, named 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++. 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++ operates by sequentially iterating over individual kk-cliques, and each kk-clique assigns a weight to the vertex with the smallest weight among the kk vertices. With an adequate number of iterations, it can converge to the optimal solution VV^{*}. Recent advancements by He et al. (He et al., 2023) have accelerated 𝖪𝖢𝗅𝗂𝗌𝗍\mathsf{KClist}++, 𝖪𝖤𝖷𝖠𝖢𝖳\mathsf{KEXACT} and 𝖪𝖢𝖫𝖲𝖺𝗆𝗉𝗅𝖾\mathsf{KCLSample} by the technique SCT-index. The SCT-index, building upon the SCT-tree data structure (Jain and Seshadhri, 2020), enables batch-wise iteration over kk-cliques, significantly enhancing efficiency. Other methods include flow-based algorithms (Mitzenmacher et al., 2015; Tsourakakis, 2015) and peeling-based algorithms (Fang et al., 2019), but they are not as efficient as the Frank-Wolfe based algorithms (He et al., 2023). There also exists a lot of other near-clique models (Batagelj and Zaversnik, 2003; Balasundaram et al., 2011; Chang, 2023). Maximum kk-core is the largest subgraph that each vertex has degree larger than kk (Batagelj and Zaversnik, 2003). Maximum kk-plex is the maximum subgraph that each vertex has at most kk non-neighbors (Balasundaram et al., 2011; Chang et al., 2022). Maximum ss-defective clique is the maximum subgraph that misses at most ss edges compared to clique (Chen et al., 2021; Chang, 2023; Gao et al., 2022).

7. conclusion

In this paper, we study the problem of kk-clique densest subgraph search. We propose a new Frank-Wolfe-based algorithm, whose time complexity is free from the count of kk-cliques. Thus, it is very efficient for processing large graphs that often have a extremely number of kk-cliques. We present a detailed theoretical analysis of our algorithms. To further improve the efficiency, we also propose a new and provable sampling-based algorithm. A nice feature of our algorithm is that it has polynomial time complexity. We conduct extensive experiments on 12 large real-world graphs to evaluate our algorithms, and the results demonstrate the high efficiency and scalability of our approaches.

References

  • (1)
  • Anagnostopoulos et al. (2020) Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Cristina Menghini, and Chris Schwiegelshohn. 2020. Spectral Relaxations and Fair Densest Subgraphs. In CIKM. 35–44.
  • Andersen and Chellapilla (2009) Reid Andersen and Kumar Chellapilla. 2009. Finding Dense Subgraphs with Size Bounds. In WAW, Vol. 5427. Springer, 25–37.
  • Balasundaram et al. (2011) Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. 2011. Clique Relaxations in Social Network Analysis: The Maximum k-Plex Problem. Oper. Res. 59, 1 (2011), 133–142.
  • Batagelj and Zaversnik (2003) Vladimir Batagelj and Matjaz Zaversnik. 2003. An O(m) Algorithm for Cores Decomposition of Networks. CoRR cs.DS/0310049 (2003).
  • Bogdanov et al. (2011) Petko Bogdanov, Misael Mongiovì, and Ambuj K. Singh. 2011. Mining Heavy Subgraphs in Time-Evolving Networks. In ICDM. IEEE Computer Society, 81–90.
  • Boob et al. (2020) Digvijay Boob, Yu Gao, Richard Peng, Saurabh Sawlani, Charalampos E. Tsourakakis, Di Wang, and Junxing Wang. 2020. Flowless: Extracting Densest Subgraphs Without Flow Computations. In WWW. 573–583.
  • Chang (2020) Lijun Chang. 2020. Efficient maximum clique computation and enumeration over large sparse graphs. VLDB J. 29, 5 (2020), 999–1022.
  • Chang (2023) Lijun Chang. 2023. Efficient Maximum k-Defective Clique Computation with Improved Time Complexity. CoRR abs/2309.02635 (2023).
  • Chang et al. (2022) Lijun Chang, Mouyi Xu, and Darren Strash. 2022. Efficient Maximum k-Plex Computation over Large Sparse Graphs. Proc. VLDB Endow. 16, 2 (2022), 127–139.
  • Charikar (2000a) Moses Charikar. 2000a. Greedy approximation algorithms for finding dense components in a graph. In APPROX 2000, Vol. 1913. Springer, 84–95.
  • Charikar (2000b) Moses Charikar. 2000b. Greedy approximation algorithms for finding dense components in a graph. In International workshop on approximation algorithms for combinatorial optimization. 84–95.
  • Chen et al. (2021) Xiaoyu Chen, Yi Zhou, Jin-Kao Hao, and Mingyu Xiao. 2021. Computing maximum k-defective cliques in massive graphs. Comput. Oper. Res. 127 (2021), 105131.
  • Cui et al. (2008) Guangyu Cui, Yu Chen, De-Shuang Huang, and Kyungsook Han. 2008. An algorithm for finding functional modules and protein complexes in protein-protein interaction networks. Journal of Biomedicine and Biotechnology 2008 (2008).
  • Dai et al. (2022) Yizhou Dai, Miao Qiao, and Lijun Chang. 2022. Anchored Densest Subgraph. In SIGMOD. 1200–1213.
  • Danisch et al. (2017) Maximilien Danisch, T.-H. Hubert Chan, and Mauro Sozio. 2017. Large Scale Density-friendly Graph Decomposition via Convex Programming. In WWW 2017. ACM, 233–242.
  • Eppstein et al. (2013) David Eppstein, Maarten Löffler, and Darren Strash. 2013. Listing All Maximal Cliques in Large Sparse Real-World Graphs. ACM J. Exp. Algorithmics 18 (2013).
  • Fang et al. (2019) Yixiang Fang, Kaiqiang Yu, Reynold Cheng, Laks V. S. Lakshmanan, and Xuemin Lin. 2019. Efficient Algorithms for Densest Subgraph Discovery. Proc. VLDB Endow. 12, 11 (2019), 1719–1732.
  • Feige et al. (2001) Uriel Feige, Guy Kortsarz, and David Peleg. 2001. The Dense k-Subgraph Problem. Algorithmica 29, 3 (2001), 410–421.
  • Fratkin et al. (2006) Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou. 2006. MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics 22, 14 (07 2006), e150–e157.
  • Gao et al. (2022) Jian Gao, Zhenghang Xu, Ruizhi Li, and Minghao Yin. 2022. An Exact Algorithm with New Upper Bounds for the Maximum k-Defective Clique Problem in Massive Sparse Graphs. In AAAI. 10174–10183.
  • Goldberg (1984) Andrew V Goldberg. 1984. Finding a maximum density subgraph. (1984).
  • He et al. (2023) Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, and Ying Zhang. 2023. Scaling Up k-Clique Densest Subgraph Detection. Proc. ACM Manag. Data 1, 1 (2023), 69:1–69:26.
  • Huang and Kahng (1995) Dennis J.-H. Huang and Andrew B. Kahng. 1995. When clusters meet partitions: new density-based methods for circuit decomposition. In ED&TC. 60–64.
  • Jaggi (2013) Martin Jaggi. 2013. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML 2013, Vol. 28. 427–435.
  • Jain and Seshadhri (2020) Shweta Jain and C. Seshadhri. 2020. The Power of Pivoting for Exact Clique Counting. In WSDM. 268–276.
  • Konar and Sidiropoulos (2022) Aritra Konar and Nicholas D. Sidiropoulos. 2022. The Triangle-Densest-K-Subgraph Problem: Hardness, Lovász Extension, and Application to Document Summarization. Proceedings of the AAAI Conference on Artificial Intelligence 36, 4 (Jun. 2022), 4075–4082.
  • Lanciano et al. (2023) Tommaso Lanciano, Atsushi Miyauchi, Adriano Fazzone, and Francesco Bonchi. 2023. A Survey on the Densest Subgraph Problem and its Variants. CoRR abs/2303.14467 (2023).
  • Lee et al. (2010) Victor E. Lee, Ning Ruan, Ruoming Jin, and Charu C. Aggarwal. 2010. A Survey of Algorithms for Dense Subgraph Discovery. In Managing and Mining Graph Data. Advances in Database Systems, Vol. 40. Springer, 303–336.
  • Li et al. (2022) Rong-Hua Li, Qiushuo Song, Xiaokui Xiao, Lu Qin, Guoren Wang, Jeffrey Xu Yu, and Rui Mao. 2022. I/O-Efficient Algorithms for Degeneracy Computation on Massive Networks. IEEE Transactions on Knowledge and Data Engineering 34, 7 (2022), 3335–3348.
  • Ma et al. (2020) Chenhao Ma, Yixiang Fang, Reynold Cheng, Laks VS Lakshmanan, Wenjie Zhang, and Xuemin Lin. 2020. Efficient algorithms for densest subgraph discovery on large directed graphs. In SIGMOD. 1051–1066.
  • Mitzenmacher et al. (2015) Michael Mitzenmacher, Jakub Pachocki, Richard Peng, Charalampos E. Tsourakakis, and Shen Chen Xu. 2015. Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling. In SIGKDD 2015. ACM, 815–824.
  • Miyauchi et al. (2023) Atsushi Miyauchi, Tianyi Chen, Konstantinos Sotiropoulos, and Charalampos E. Tsourakakis. 2023. Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity. In SIGKDD. 1710–1721.
  • Sun et al. (2020) Bintao Sun, Maximilien Danisch, T.-H. Hubert Chan, and Mauro Sozio. 2020. KClist++: A Simple Algorithm for k-Clique Densest Subgraphs in Large Graphs. Proc. VLDB Endow. 13, 10 (2020), 1628–1640.
  • Tomita (2017) Etsuji Tomita. 2017. Efficient Algorithms for Finding Maximum and Maximal Cliques and Their Applications. In WALCOM, Vol. 10167. Springer, 3–15.
  • Tsourakakis (2015) Charalampos E. Tsourakakis. 2015. The K-clique Densest Subgraph Problem. In WWW. 1122–1132.
  • Ye et al. (2022) Xiaowei Ye, Rong-Hua Li, Qiangqiang Dai, Hongzhi Chen, and Guoren Wang. 2022. Lightning Fast and Space Efficient k-clique Counting. In WWW. 1191–1202.
  • Yu et al. (2006) Haiyuan Yu, Alberto Paccanaro, Valery Trifonov, and Mark Gerstein. 2006. Predicting interactions in protein networks by completing defective cliques. Bioinform. 22, 7 (2006), 823–829.