Finding a Concise, Precise, and Exhaustive Set of Near Bi-Cliques in Dynamic Graphs
Abstract.
A variety of tasks on dynamic graphs, including anomaly detection, community detection, compression, and graph understanding, have been formulated as problems of identifying constituent (near) bi-cliques (i.e., complete bipartite graphs). Even when we restrict our attention to maximal ones, there can be exponentially many near bi-cliques, and thus finding all of them is practically impossible for large graphs. Then, two questions naturally arise: (Q1) What is a “good” set of near bi-cliques? That is, given a set of near bi-cliques in the input dynamic graph, how should we evaluate its quality? (Q2) Given a large dynamic graph, how can we rapidly identify a high-quality set of near bi-cliques in it? Regarding Q1, we measure how concisely, precisely, and exhaustively a given set of near bi-cliques describes the input dynamic graph. We combine these three perspectives systematically on the Minimum Description Length principle. Regarding Q2, we propose CutNPeel, a fast search algorithm for a high-quality set of near bi-cliques. By adaptively re-partitioning the input graph, CutNPeel reduces the search space and at the same time improves the search quality. Our experiments using six real-world dynamic graphs demonstrate that CutNPeel is (a) High-quality: providing near bi-cliques of up to better quality than its state-of-the-art competitors, (b) Fast: up to faster than the next-best competitor, and (c) Scalable: scaling to graphs with million edges. We also show successful applications of CutNPeel to graph compression and pattern discovery.
1. Introduction
Daily activities on the Web generate an enormous amount of data. Especially, many activities, such as e-mail communications, web surfing, online purchases, result in data in the form of graphs, such as e-mail networks, IP-IP communication networks, and user-business bipartite graphs. Most real-world graphs, including the aforementioned ones, evolve over time, and thus each of them is represented as a dynamic graph, i.e., a sequence of graphs over time.
A bi-clique is a complete bipartite graph. That is, a bi-clique consists of two disjoint sets of nodes where every node in a set is adjacent to every node in the other set. We use the term near bi-clique to denote a bipartite graph “close” to a bi-clique with few or no missing edges. We intentionally leave the concept near bi-clique flexible without rigid conditions (e.g., conditions in (Mishra et al., 2004; Bu et al., 2003; Sim et al., 2009)).
Bi-cliques are a fundamental concept in graph theory, and a variety of tasks on static and dynamic graphs are formulated as problems of finding (near) bi-cliques in them. Examples include:
-
•
Anomaly Detection (Shin et al., 2017, 2018, 2021a; Jiang et al., 2015): (Near) bi-cliques in real-world dynamic graphs signal anomalies, such as network intrusion (Shin et al., 2018), spam reviews (Shin et al., 2021a), edit wars on Wikipedia (Shin et al., 2018), and retweet boosting on Weibo (Jiang et al., 2015).
-
•
Community Detection (Kumar et al., 1999; Liu et al., 2006; Araujo et al., 2014): Relevant web pages often do not reference each other (Kumar et al., 1999), and thus finding (near) cliques can be ineffective for detecting them. Instead, related web pages can be detected by finding (near) bi-cliques formed between them and users visiting them (Liu et al., 2006). Near bi-cliques were also searched for detecting temporal communities in phone-call networks (Araujo et al., 2014).
-
•
Lossless Graph Compression (Koutra et al., 2014; Shah et al., 2015; Araujo et al., 2014): A static or dynamic graph can be described concisely but losslessly by (a) (near) bi-cliques and (b) the difference between and the graph described by the (near) bi-cliques (Koutra et al., 2014; Shah et al., 2015; Araujo et al., 2014).
-
•
Phylogenetic Tree Construction (Sanderson et al., 2003): A phylogenetic tree shows the evolutionary interrelationships among multiple species that are thought to have a common ancester. Identifying bi-cliques formed between genes and species containing the genes is helpful for accurate tree construction.
Additionally, (near) bi-cliques have been used for better understanding protein-protein interactions (Bu et al., 2003) and stock markets (Sim et al., 2009).





Due to the importance of (near) bi-cliques, a great number of search algorithms have been developed. Most of them aim to enumerate all maximal (near) bi-cliques that maximize certain criteria within a static or dynamic graph (Alexe et al., 2004; Makino and Uno, 2004; Liu et al., 2006; Kloster et al., 2019; Dias et al., 2005; Quine, 1955; Mishra et al., 2004; Bu et al., 2003; Sim et al., 2009; Das and Tirthapura, 2018). However, as the number of such subgraphs can be exponential in the number of nodes, finding all is practically impossible for large graphs and often unnecessary with many insignificant ones (e.g., too small or highly overlapped ones). Thus, several algorithms aim to find one or a predefined number of (near) bi-cliques that maximize certain criteria in a static or dynamic graph (Shin et al., 2018, 2021a, 2017; Jiang et al., 2015).
This paper focuses on finding a “good” set of near bi-cliques since finding all (maximal) near bi-cliques is infeasible for large graphs and finding one or few of them is insufficient for many applications. Then, two questions naturally arise:
-
(1)
What is a “good” set of near bi-cliques within a graph? How should we measure the quality of a set of near bi-cliques?
-
(2)
How can we rapidly detect a high-quality set of near bi-cliques, especially in large dynamic graphs?
Regarding the first question, we measure how concisely, precisely, and exhaustively a given set of near bi-cliques describes the input dynamic graph . That is, a high-quality set of near bi-cliques in consists of a tractable number of subgraphs close to bi-cliques that together cover a large portion of . We take these three perspectives into account simultaneously in a systematic way based on the Minimum Description Length principle (Grünwald, 2007; Galbrun, 2020).
Regarding the second question, we propose CutNPeel (Cutting aNd Peeling), a fast search algorithm for a high-quality set of near bi-cliques in a dynamic graph. As its name suggests, CutNPeel consists of the “cutting” step and the “peeling” step. Specifically, it first partitions the input dynamic graph to reduce the search space, and then it finds near bi-cliques within each partition one by one by a top-down search. Thanks to our adaptive re-partitioning scheme, surprisingly, the cutting step not only reduces the search space but also improves the quality of near bi-cliques by guiding the top-down search in the peeling step.
We demonstrate the effectiveness of CutNPeel, using six real-world dynamic graphs, though comparisons with four state-of-the-art algorithms (Araujo et al., 2014; Shah et al., 2015; Shin et al., 2018, 2021a) for finding near bi-cliques in dynamic graphs. In summary, CutNPeel has the following advantages:
-
•
High Quality: CutNPeel provides near bi-cliques of up to better quality than the second best method (Figure 1a).
-
•
Speed: CutNPeel is up to faster than the competitors that is the second best in terms of quality (Figure 1a).
-
•
Scalability: Empirically, CutNPeel scales near-linearly with the size of the input graph (Figure 1b).
-
•
Applicability: Using CutNPeel, we achieve the best lossless compression and discover meaningful patterns (Figure 1c-d).
Reproducibility: The source code and datasets used in this paper can be found at https://github.com/hyeonjeong1/cutnpeel.
In Section 2, we introduce some basic concepts. In Section 3, we formally define the problem of finding near bi-cliques in dynamic graphs. In Section 4, we present our proposed algorithm CutNPeel. In Section 5, we provide our experimental results. In Section 6, we discuss related studies. In Section 7, we conclude the paper.
2. Basic Concepts
In this section, we introduce some basic concepts used throughout the paper. We list some frequently-used notations in Table 1.
Symbol | Definition |
input dynamic graph | |
sets of source nodes, destination nodes, and timestamps in | |
set of objects in | |
set of edges in | |
subset of the object set where , , and | |
near bi-clique composed of | |
set of edges in | |
exact bi-clique composed of | |
number of edges in | |
set of near bi-cliques | |
objective function to be minimized | |
sets of exact bi-cliques, remaining edges, and missing edges | |
approximate saving in the objective due to | |
threshold in each -th iteration | |
decrement rate of thresholds and the number of iterations | |
set of partitions in |

Dynamic graphs: A dynamic graph is a sequence of graphs over timestamps . We denote at each time by , where is the set of source nodes, is the set of destination nodes, and is the set of edges. We let be the set of all source nodes and let be the set of all destination nodes. For simplicity, we assume , and are disjoint. That is, unipartite graphs are treated as bipartite graphs (i.e., a source and the same node as a destination are treated as different nodes). We use to denote the set of objects in . We use to indicate the edge at time , and we use to indicate the set of all edges in . For any subset of objects, we use to denote the subgraph of induced by and use to denote the edges in it.
Bi-cliques in dynamic graphs: A bi-clique is a complete bipartite graph. That is, a graph with two disjoint node sets and is a bi-clique if every node in is adjacent to every node , i.e., if . A temporal bi-clique is a bi-clique that appears once or repeatedly over time. For any set of objects, we use to denote the temporal bi-clique formed by the objects, and we use to denote the number of edges in it. That is, if where , , and , then where for all , and .
Near bi-cliques in dynamic graphs: We use the term near bi-clique to denote a bipartite graph “close” to a bi-clique with few or no of missing edges. Specifically, a subgraph is a near bi-clique if it satisfies . Similarly, a dynamic graph with objects and edges is a near temporal bi-clique if it satisfies . Since this paper focuses on dynamic graphs, from now on, we will use the term (near) bi-cliques to refer to (near) temporal bi-cliques for ease of explanation.
3. Problem Formulation
In this section, we present how we measure the quality of a set of near bi-cliques. Then, we define the problem of finding a high-quality set of near bi-cliques in a dynamic graph.
3.1. Quality of a Set of Near Bi-Cliques
In this subsection, we describe three sub-goals that any “good” set of near bi-cliques should pursue. We also define cost functions that we aim to minimize to achieve the sub-goals. The cost functions are commonly in bits so that we can directly compare and eventually balance them. To this end, given a dynamic graph and a set of near bi-cliques in it, we consider the following three sets:
-
•
Exact bi-cliques : the set of minimal exact bi-cliques that contain the near bi-cliques in . That is, .
-
•
Missing edges : the set of missing edges, which are included in exact bi-cliques in but not in .
-
•
Remaining edges : the set of remaining edges in that do not belong to any near bi-clique in .
As illustrated in Figure 2, the bi-cliques in may contain missing edges, and the near bi-cliques in are obtained by simply removing the missing edges from each bi-clique in .
Sub-goal 1. (Preciseness): The members of a “good” set of near bi-cliques should actually be close to exact bi-cliques. That is, the number of missing edges contained in the bi-cliques in should be minimized. To pursue preciseness, we aim to minimize Eq. (1), which corresponds to the number of bits to encode .
(1) |
where corresponds to the number of bits to encode one missing edge. We assume that the coordinate list (COO) format is used, and thus we list , and to encode an edge . Additionally, we assume that bits are required to encode a member of a set .
Sub-goal 2. (Exhaustiveness): The members of a “good” set of near bi-cliques should cover a large portion of . Equivalently, the number of remaining edges uncovered by any near bi-cliques in should be minimized. To achieve exhaustiveness, we aim to minimize Eq. (2), i.e., the number of bits to encode .
(2) |
Sub-goal 3. (Conciseness): Note that Eq. (1) and Eq. (2) can be trivially minimized by enumerating all edges, which are bi-cliques, in . However, larger bi-cliques are more likely to be useful in applications mentioned in Section 1, so larger ones should be preferred over trivial ones in a “good” set of near bi-cliques. To design a cost function in bits, we note that larger bi-cliques can be used to encode the edges in them more concisely with fewer bits. Specifically, all edges in a bi-clique where can be encoded together by listing the objects that form it, requiring only bits, instead of bits for encoding each edge separately. This saving in bits increases as bi-cliques become larger, as formalized in Lemma 3.1.
Lemma 0.
If a bi-clique is strictly bigger111The number of object of each type in is greater than or equal to that in , and the number of object of a type in is strictly greater than that in . than a bi-clique , then the number of bits per edge is smaller in than in , i.e.,
(3) |
Proof.
See Appendix A (Shin et al., 2021b). ∎
Based on this observation, we aim to minimize Eq. (4), which is the number of bits to encode the edges contained in the bi-cliques in , to pursue conciseness.
(4) |
where is the number of bits for encoding all edges in a bi-clique together.222Additional bits are used to encode the number of objects in . Note that in Eq. (4), edges that belong to multiple bi-cliques are encoded multiple times, increasing the cost. Thus, Eq. (4) penalizes highly-overlapped bi-cliques, aligning with our pursuit of conciseness.
Total cost and the MDL principle: Recall that the three cost functions (i.e., Eq. (1), Eq. (2), and Eq. (4)) are measured commonly in bits and thus directly comparable. Since , , and together exactly describe the input dynamic graph , as shown in Figure 2, minimizing their sum aligns with the Minimum Description Length (MDL) principle (Grünwald, 2007). Thus, the total cost for balancing our three sub-goals (i.e., preciseness, exhaustiveness, and conciseness) is defined as:
(5) |
where can be a function of and since , , and are directly obtained from and .
3.2. Problem Definition
Using the quality measure in Eq. (5), we formalize our problem as:
Problem 1 (Finding a Concise, Precise, and Exhaustive Set of Near Bi-Cliques in a Dynamic Graph).
-
(1) Given: a dynamic graph ,
-
(2) Find: near bi-cliques
-
(3) To minimize: .
4. Proposed Algorithm
In this section, we propose CutNPeel, a fast algorithm for finding a high-quality set of near bi-cliques in dynamic graphs. We first present its preliminary version Peel. Then, we present CutNPeel.
4.1. Peel: Preliminary Algorithm
In this subsection, we present Peel, a preliminary version of our proposed algorithm CutNPeel. We provide an overview of Peel and then describe how it finds a single near bi-clique in detail. After that, we analyze its complexity and lastly discuss its limitations.
4.1.1. Overview (Algorithm 1)
Peel repeatedly finds near bi-cliques one by one using a top-down search called PeelOne (line 1). Specifically, PeelOne returns objects , and the near bi-clique formed by them in the original input graph is added to the output set (line 1). Whenever Peel finds a near bi-clique, the edges in it are removed from the input graph to prevent Peel from finding the same near bi-clique (line 1), and we use to denote the dynamic graph with the remaining edges. Peel stops searching when in Eq. (6) is zero or negative for found objects (lines 1-1) and returns as the final output (line 1).
(6) |
where is the total cost when and ; and is that when and . We use in Eq. (6) to approximate , i.e., the amount of saving in the total cost due to a new near bi-clique . can be computed in time, as shown in Lemma 4.1, while providing a preciseness guarantee in Theorem 1.
Lemma 0.
Given the numbers of objects of each type and edges in and , can be computed in time.
4.1.2. Finding one near bi-clique (Algorithm 2)
We describe PeelOne, which is a subroutine of Peel for finding a single near bi-clique. Starting from the set of all objects in (line 2), PeelOne repeatedly removes an object (line 2) until an empty set is left. When choosing the object to be removed, PeelOne chooses one with the sparsest connectivity for a near bi-clique to remain. Specifically, it chooses an object that minimizes in Eq. (10) (line 2).
(10) |
where is the number of potential edges adjacent to , and is the number of existing edges among them. As the set of remaining objects changes, CutNPeel tracks in Eq. (6) (lines 2-2). Lastly, as its final output, PeelOne returns when is maximized (lines 2-2).
4.1.3. Relation with (Shin et al., 2018)
Peel is based on the top-down greedy search framework (Shin et al., 2018), which is originally designed for detecting dense tensors (i.e., multi-dimensional arrays). Peel uses the novel selection functions in Eq. (6) and Eq. (10), instead of those suggested in (Shin et al., 2018),333For example, and . and this change significantly improves the quality of detected near bi-cliques, as shown empirically in Section 5.2 and Appendix D (Shin et al., 2021b) (compare the relative cost of M-Zoom (Shin et al., 2018) in Figure 3 and that of Peel in Figure 6a in (Shin et al., 2021b)). Specifically, the selection functions suggested in (Shin et al., 2018) do not guarantee Eq. (8), and empirically, using them gives near bi-cliques that lack preciseness (see Figure 4 in Section 5.2).
4.1.4. Complexity
Theorem 2 (Time Complexity of Peel).
The time complexity of Algorithm 1 is .
Proof.
Eq. (7) and Eq. (9) imply for all . This and imply . That is, the number of detected near bi-cliques is at most . From Lemma 4.1 and the aforementioned connection with (Shin et al., 2018), it can be shown that finding one near bi-clique takes time (see Theorem 1 of (Shin et al., 2018)). Thus, the total time complexity is . ∎
Theorem 3 (Space Complexity of Peel).
The space complexity of Algorithm 1 is .
Proof.
Eq. (7) and Eq. (9) imply Eq. (11).
(11) |
where the right hand side upper bounds the number of bits required for storing . Eq.(11) and imply that storing the output requires bits. Additionally, storing and requires bits, and storing in Algorithm 2 requires bits. Thus, the total space complexity is . ∎
4.1.5. Limitations
Finding a large number of near bi-cliques using Peel suffers from high computational cost since Peel needs to access all remaining edges (i.e., ) to detect each near bi-clique. Moreover, Peel often fails to detect bi-cliques formed by objects whose connectivity outside the bi-cliques are sparse. This is because such objects are likely to be removed in the early stage of PeelOne, which prioritizes objects with dense connectivity.
4.2. CutNPeel: Proposed Algorithm
In this subsection, we present CutNPeel, our proposed algorithm for Problem 1. We first discuss the main ideas behind CutNPeel. Then, we present an outline of CutNPeel. After that, we describe its components in detail. Lastly, we compare it with Peel.
4.2.1. Main ideas
In order to address the two aforementioned limitations of Peel, CutNPeel first partitions the input dynamic graph and then finds near bi-cliques within each partition. Partitioning reduces the search space, and as a result, it reduces the computational cost for finding each near bi-clique. Partitioning is also helpful to detect bi-cliques formed by objects with sparse global connectivity (i.e., sparse connectivity outside the bi-cliques). This is because CutNPeel prioritizes objects based on their connectivity within partitions, instead of their global connectivity. By reducing the search space, partitioning may also have a harmful impact on search accuracy. In order to minimize such a harmful impact, CutNPeel uses adaptive re-partitioning while employing randomness.
4.2.2. Overview (Algorithm 3)
Given an input dynamic graph , the maximum number of iteration , and the threshold decrement rate , CutNPeel returns near bi-cliques in . As in Peel, CutNPeel starts a search from (line 3), and as it detects near bi-cliques, removes the edges belonging to the near bi-cliques from (lines 3 and 3), where denotes the graph with remaining edges. At each iteration, CutNPeel partitions into subgraphs based on an object set using Cut (line 3), which is described in detail later. Cut employs randomness so that different partitions are obtained at different iterations. Within each partition , CutNPeel repeatedly detects near bi-cliques using PeelOne (line 3) as in Peel. Since the current partitions may not be ideal, PeelOne stops finding near bi-cliques within a partition if in Eq. (6) is less than a threshold for a found near bi-clique (lines 3-3). CutNPeel terminates (line 3) if reaches 0 (line 3) or the number of iterations reaches .
4.2.3. Adaptive thresholding
The threshold decreases adaptively as iterations proceed. Specifically, the threshold at the -th iteration is set to of the maximum value, among those less than , at the -th iteration. The threshold balances exploration and exploitation. Specifically, in early iterations, CutNPeel puts more emphasis on exploration (i.e., exploring near bi-cliques in partitions in later iterations) by accepting near bi-cliques under strict conditions (i.e., with large ). In later iterations, it puts more emphasis on exploitation (i.e., finding near bi-cliques in current partitions) by accepting near bi-cliques under lenient conditions (i.e., with small ).
4.2.4. Cut for partitioning (Algorithm 4)
Below, we propose Cut, a dynamic-graph partitioning algorithm used by CutNPeel. When designing Cut, we pursue the following goals:
-
•
G1. Speed: Cut should be fast since it is performed repeatedly.
-
•
G2. Locality: Cut should locate each near bi-clique inside a partition, to make it detectable, rather than splitting it into partitions.
-
•
G3. Randomness: Cut should employ randomness so that more possibilities can be explored by repeatedly performing it.
Cut partitions a given dynamic graph based on a given object set . It first generates a uniform random -to- function for the set of all objects (line 4), bringing randomness for G3. Using the function , we compute a function (lines 4-4), based on which is partitioned (line 4). Specifically, for each object , is defined as:
where is the set of edges adjacent to and
(12) |
Our design of is inspired by min-hashing (Broder et al., 2000), and two objects in are more likely to have the same value as their connectivity to other objects are more similar. That is, Cut pursues G2 by making objects with similar connectivity be likely to belong to the same partition. Lastly, edges in are partitioned according to the partitions of their incident objects in (lines 4 and 4). Regarding G1, Cut takes linear time as formalized in Theorem 4.
Theorem 4 (Time Complexity of Cut).
The time complexity of Algorithm 4 is .
Proof.
Generating for all takes time. Computing for all takes time. Partitioning takes time, and partitioning takes time. Hence, it takes time in total. ∎
Comparison with Peel: Below, we assume for ease of explanation. For each near bi-clique, CutNPeel needs to access only the remaining edges within a partition (e.g., ), while Peel needs to access all remaining edges (i.e., ). However, a partition can be as large as the entire dynamic graph. Thus, the worst-case time complexity of CutNPeel is , as in Peel, if we assume so that performing Cut times takes time. Compared to Peel, CutNPeel additionally stores for all and for all , which take only space. Regarding preciseness, Eq. (8) still holds for every near bi-clique from CutNPeel.
5. Experiments
Name | Description | # of Objects | # of Edges |
Enron (Shetty and Adibi, 2004) | sender / receiver / time [week] | / / | |
Darpa (Lippmann et al., 2000) | Src IP / Dst IP / time [day] | / / | |
DDoS (UCSD, 2021) | Src IP / Dst IP / time [second] | / / | |
DBLP (dblp computer science bibliography, 2021) | author / venue / time [year] | / / | |
Yelp (Yelp, 2021) | user / business / time [month] | / / | |
Weeplaces (Weeplaces, 2021) | user / place / time [month] | / / |
In this section, we review our experiments to answer Q1-Q3.
-
•
Q1. Search Quality & Speed: How rapid is CutNPeel? Are identified near bi-cliques precise, exhaustive, and concise?
-
•
Q2. Scalability: How does the running time of CutNPeel scale with respect to the size of the input graph?
-
•
Q3. Application 1 - Compression: How concisely can we represent dynamic graphs using the outputs of CutNPeel?














5.1. Experiment Specifications
Machines: We used a machine with a 3.8GHz AMD Ryzen 3900X CPU and 128GB RAM for the scalability test in Section 5.3 and a machine with 3.7GHz i9-10900K CPU and 64GB RAM for the others.
Datasets: We used six real-world dynamic graph datasets that are briefly described in Table 2.
Competitors and parameters: We implemented CutNPeel in Java and set and unless otherwise stated. We compared CutNPeel with four competitors: TimeCrunch (Shah et al., 2015), Com2 (Araujo et al., 2014), M-Zoom (Shin et al., 2018), and D-Cube (Shin et al., 2021a). For the competitors, we used the official implementations provided by the authors, which are in Java (Com2, M-Zoom, and D-Cube) or MATLAB (TimeCrunch). For TimeCrunch, we used the step-wise heuristic, which performed best in (Shah et al., 2015), and excluded chain subgraphs, which cannot be considered as bi-cliques, from its outputs. For D-Cube, we set , which performed best in (Shin et al., 2021a). For M-Zoom, and D-Cube, we used the density function that minimized the cost function on each dataset. For all competitors, we report the results at the iteration when the cost function was minimized.
5.2. Q1. Search Quality & Speed
In this subsection, we compare the considered algorithms in terms of speed and the quality of output near bi-cliques.
Evaluation metric: For the quality, we use the total cost relative to the encoding cost of the input dynamic graph, i.e.,
(13) |
where the denominator is the number of bits to individually encode every edge in the input graph . Since the denominator is a constant for a given , Eq. (13) is proportional to the total cost . Thus, as explained in Section 3.1, the lower Eq. (13) is, the more concise, precise, and exhaustive detected near bi-cliques are.
Summary of comparison: As seen in Figure 3, CutNPeel gave near bi-cliques with the highest quality in all considered datasets. Specifically, the near bi-cliques detected by CutNPeel had up to better quality in terms of the relative cost. Moreover, CutNPeel was up to faster than TimeCrunch, which detected near bi-cliques with the second highest quality in most cases.
Detailed analysis: In Figure 5, we divide the relative cost in Eq. (13) into three portions related to preciseness (i.e., in Eq. (1)), exhaustiveness (i.e., in Eq. (2)), and conciseness (i.e., in Eq. (4)), respectively. The total cost was lowest in CutNPeel, and the near bi-cliques detected by it were especially superior in exhaustiveness and preciseness. While in CutNPeel was relatively high with many near bi-cliques, individual near bi-cliques detected by CutNPeel were precise with few missing edges, as shown in Figure 4, where we compare the size and preciseness of the near bi-cliques detected by the considered algorithms. Noticeably, the near bi-cliques detected by CutNPeel are precise with extremely small ratios of missing edges, and their count is tractable and especially much smaller than the number of edges in the input graph. M-Zoom and D-Cube tended to detect fewer larger subgraphs with higher ratios of missing edges than the other algorithms.
5.3. Q2. Scalability
In this subsection, we test the scalability of CutNPeel. To this end, we created Erdős-Rényi random graphs with various sizes and measured the running time of CutNPeel on them. The number of objects of each type was of the number of edges, and the largest graph had edges. As seen in Figure 1b in Section 1, the running time of CutNPeel scaled near linearly with the number of edges and objects in the input dynamic graph.
5.4. Q3. Application: Compression
In this subsection, we show a successful application of CutNPeel to lossless graph compression. We consider Com2 and TimeCrunch as competitors, which are, to the best of our knowledge, the only lossless compression methods for the entire history (rather than the current snapshot (Ko et al., 2020)) of a dynamic graph.
The relative cost in Eq. (13) is the ratio of the number of bits for encoding the input graph losslessly using detected near bi-cliques and the number of bits for encoding each edge separately. Thus, it can naturally be interpreted as the compression rate. Alternatively, the input graph can be encoded using near bi-cliques as suggested in (Araujo et al., 2014), and the compression rate can be computed accordingly. Thus, we measured it and Eq. (13) for all considered methods. Additionally, for TimeCrunch, we measured the compression rate in (Shah et al., 2015), which is based on an encoding method not applicable to the others.
As seen in Table 3, CutNPeel achieved the best compression in all considered datasets. Specifically, CutNPeel achieved up to better compression rates than the second best method.
Extra experiments: In the supplementary document (Shin et al., 2021b), we review extra experiments for Q4-Q6:
-
•
Q4. Ablation Study: How much do adaptive thresholds and partitioning contribute to the performance of CutNPeel?
-
•
Q5. Parameter Analysis: How do the decrement rate and the iteration number affect the performance of CutNPeel?
-
•
Q6. Application 2 - Pattern Discovery: What interesting patterns does CutNPeel detect in real-world data?




6. Related Work
In this section, we review studies of finding dense (near) (bi-)cliques in static and dynamic graphs.
6.1. Finding (near) (bi-)cliques in static graphs
Below, we focus on finding dense subgraphs in static graphs.
Enumerating exact (bi-)cliques: Finding (bi-)cliques (i.e., complete subgraphs), especially, enumerating all maximal (bi-)cliques, which are not a subset of any other clique, has been extensively studied (Makino and Uno, 2004; Eppstein et al., 2010; Tsukiyama et al., 1977; Chiba and Nishizeki, 1985; Alexe et al., 2004; Liu et al., 2006; Kloster et al., 2019; Dias et al., 2005; Quine, 1955). The problem of enumerating all maximal (bi-)cliques was shown to be NP-hard (Moon and Moser, 1965; Eppstein, 1994).
Enumerating near (bi-)cliques: Considerable attention has been paid to near (bi-)cliques with constraints on minimum degree (Batagelj and Zaversnik, 2003), edge density (Abello et al., 2002; Uno, 2010), and diameter (Balasundaram et al., 2011; Moradi and Balasundaram, 2018). Notably, -cores (Batagelj and Zaversnik, 2003) for all can be found in linear time (Batagelj and Zaversnik, 2003), while enumerating (maximal) quasi-cliques (Uno, 2010), (maximal) k-clubs (Moradi and Balasundaram, 2018), or (maximal) k-plexes (McClosky and Hicks, 2012; Zhou et al., 2020) is NP-hard. Similarly, near bi-cliques include a subgraph where (a) every node in one node set is adjacent to at least of the nodes in the other node set (Mishra et al., 2004), (b) every node in both node sets is adjacent to at least of the nodes in the counterpart node set (Bu et al., 2003), and (c) every node in both node sets is not adjacent to at most nodes in the counterpart node set (Sim et al., 2009). Regardless of the definitions, enumerating (maximal) near bi-cliques is NP-hard since their number can be exponential in the number of nodes.
Finding a partial set of near (bi-)cliques: There also have been extensive studies on finding one or a predefined number of near (bi-)cliques. A representative example is to find a subgraph that maximizes average degree (Goldberg, 1984; Khuller and Saha, 2009; Charikar, 2000) or the ratio between the number of contained (bi-)cliques and the number of nodes (Tsourakakis, 2015; Mitzenmacher et al., 2015). VoG (Koutra et al., 2014) searches for a partial set of near (bi-)cliques and chain subgraphs by which the input graph is summarized.
6.2. Finding near bi-cliques in dynamic graphs
Below, we focus on finding dense subgraphs and their occurrences over time in dynamic graphs.
Finding a predefined number of near bi-cliques: Shin et al. (Shin et al., 2018, 2021a) and Jiang et al. (Jiang et al., 2015) studied the problem of finding a given number of dense subtensors (Shin et al., 2018, 2021a; Jiang et al., 2015) in a tensor (i.e., a multidimensional array). For example, starting from the entire tensor, M-Zoom (Shin et al., 2018) and D-Cube (Shin et al., 2021a) greedily remove entries so that one of the proposed density measures is minimized. Since a dynamic graph is naturally expressed as a 3-way tensor, the above algorithms can be used for identifying dense subgraphs in a dynamic graph.
Summarizing dynamic graphs using near bi-cliques: Com2 (Araujo et al., 2014) and TimeCrunch (Shah et al., 2015) aim to concisely describe the input dynamic graph using dense subgraphs, including near bi-cliques. Com2 uses rank-1 CP deomposition to obtain the score (i.e., corresponding value of the factor matrices) of each object and sorts the objects of each type based on the score. Objects are added one by one to a near bi-clique greedily until the description length does not decrease, and the above step is repeated for detecting multiple near bi-cliques. TimeCrunch aims to identify (near) (bi-)cliques and chain subgraphs. To this end, TimeCrunch decomposes the input graph at each timestamp into such subgraphs using SlashBurn (Lim et al., 2014), and “stitch” some of the found ones over timestamps based on the description length. Among these candidates, some are selected using multiple heuristics.
Metric* | Method | Compression Rates in % (the lower, the better) | |||||
Enron | Darpa | DDoS | DBLP | Yelp | Weeplaces | ||
Eq. (13) | CutNPeel | 52.5 | 32.3 | 13.1 | 53.7 | 77.1 | 55.9 |
TimeCrunch | 87.7 | 37.2 | 31.5 | 74.2 | 100.6 | o.o.t. | |
Com2 | 95.5 | 100 | o.o.t. | 99.9 | o.o.t. | o.o.t. | |
(Shah et al., 2015)** | TimeCrunch | 86.3 | 36.7 | 16.3 | 79.7 | 100.0 | o.o.t. |
(Araujo et al., 2014) | CutNPeel | 48.3 | 27.1 | 8.0 | 52.4 | 78.0 | 52.4 |
TimeCrunch | 79.3 | 34.6 | 22.8 | 63.9 | 100.2 | o.o.t. | |
Com2 | 58.3 | 202.8 | o.o.t. | 57.5 | o.o.t. | o.o.t. | |
* Different metrics are based on different encoding methods. | |||||||
** The encoding method used in (Shah et al., 2015) is not applicable to CutNPeel and Com2. |
7. Conclusion
In this work, we consider the problem of finding a concise, precise, and exhaustive set of near bi-cliques in a dynamic graph. We formulate the problem as an optimization problem whose objective combines the three aspects (i.e., conciseness, preciseness, and exhaustiveness) in a systematic way based on the MDL principle. Our algorithmic contribution is to design CutNPeel for the problem. Compared to a widely-used top-down greedy search, CutNPeel reduces the search space and at the same time improves search accuracy, through a novel adaptive re-partitioning scheme. We summarize the strengths of CutNPeel as follows:
Acknowledgements: This work was supported by Samsung Electronics Co., Ltd., National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)).
References
- (1)
- Abello et al. (2002) James Abello, Mauricio GC Resende, and Sandra Sudarsky. 2002. Massive quasi-clique detection. In LATIN. Springer.
- Alexe et al. (2004) Gabriela Alexe, Sorin Alexe, Yves Crama, Stephan Foldes, Peter L Hammer, and Bruno Simeone. 2004. Consensus algorithms for the generation of all maximal bicliques. Discrete Applied Mathematics 145, 1 (2004), 11–21.
- Araujo et al. (2014) Miguel Araujo, Spiros Papadimitriou, Stephan Günnemann, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos E Papalexakis, and Danai Koutra. 2014. Com2: fast automatic discovery of temporal (‘comet’) communities. In PAKDD.
- Balasundaram et al. (2011) Balabhaskar Balasundaram, Sergiy Butenko, and Illya V Hicks. 2011. Clique relaxations in social network analysis: The maximum k-plex problem. Operations Research 59, 1 (2011), 133–142.
- Batagelj and Zaversnik (2003) Vladimir Batagelj and Matjaz Zaversnik. 2003. An O (m) algorithm for cores decomposition of networks. arXiv preprint cs/0310049 (2003).
- Broder et al. (2000) Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. 2000. Min-wise independent permutations. JCSS 60, 3 (2000), 630–659.
- Bu et al. (2003) Dongbo Bu, Yi Zhao, Lun Cai, Hong Xue, Xiaopeng Zhu, Hongchao Lu, Jingfen Zhang, Shiwei Sun, Lunjiang Ling, Nan Zhang, et al. 2003. Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Research 31, 9 (2003), 2443–2450.
- Charikar (2000) Moses Charikar. 2000. Greedy approximation algorithms for finding dense components in a graph. In APPROX.
- Chiba and Nishizeki (1985) Norishige Chiba and Takao Nishizeki. 1985. Arboricity and subgraph listing algorithms. SIAM Journal on computing 14, 1 (1985), 210–223.
- Das and Tirthapura (2018) Apurba Das and Srikanta Tirthapura. 2018. Incremental maintenance of maximal bicliques in a dynamic bipartite graph. TMSCS 4, 3 (2018), 231–242.
- dblp computer science bibliography (2021) The dblp computer science bibliography. 2021. DBLP Data. https://dblp.uni-trier.de/db/
- Dias et al. (2005) Vânia MF Dias, Celina MH De Figueiredo, and Jayme L Szwarcfiter. 2005. Generating bicliques of a graph in lexicographic order. Theoretical Computer Science 337, 1-3 (2005), 240–248.
- Eppstein (1994) David Eppstein. 1994. Arboricity and bipartite subgraph listing algorithms. Inform. Process. Lett. 51, 4 (1994), 207–211.
- Eppstein et al. (2010) David Eppstein, Maarten Löffler, and Darren Strash. 2010. Listing all maximal cliques in sparse graphs in near-optimal time. In ISAAC.
- Galbrun (2020) Esther Galbrun. 2020. The minimum description length principle for pattern mining: A survey. arXiv preprint arXiv:2007.14009 (2020).
- Goldberg (1984) Andrew V Goldberg. 1984. Finding a maximum density subgraph. University of California Berkeley.
- Grünwald (2007) Peter D Grünwald. 2007. The minimum description length principle. MIT press.
- Jiang et al. (2015) Meng Jiang, Alex Beutel, Peng Cui, Bryan Hooi, Shiqiang Yang, and Christos Faloutsos. 2015. A general suspiciousness metric for dense blocks in multimodal data. In ICDM.
- Khuller and Saha (2009) Samir Khuller and Barna Saha. 2009. On finding dense subgraphs. In ICALP.
- Kloster et al. (2019) Kyle Kloster, Blair D Sullivan, and Andrew van der Poel. 2019. Mining maximal induced bicliques using odd cycle transversals. In SDM.
- Ko et al. (2020) Jihoon Ko, Yunbum Kook, and Kijung Shin. 2020. Incremental Lossless Graph Summarization. In KDD.
- Koutra et al. (2014) Danai Koutra, U Kang, Jilles Vreeken, and Christos Faloutsos. 2014. VoG: Summarizing and understanding large graphs. In SDM.
- Kumar et al. (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. 1999. Trawling the web for emerging cyber-communities. Computer Networks 31, 11-16 (1999), 1481–1493.
- Lim et al. (2014) Yongsub Lim, U Kang, and Christos Faloutsos. 2014. Slashburn: Graph compression and mining beyond caveman communities. TKDE 26, 12 (2014), 3077–3089.
- Lippmann et al. (2000) Richard P Lippmann, David J Fried, Isaac Graf, Joshua W Haines, Kristopher R Kendall, David McClung, Dan Weber, Seth E Webster, Dan Wyschogrod, Robert K Cunningham, et al. 2000. Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In DISCEX.
- Liu et al. (2006) Guimei Liu, Kelvin Sim, and Jinyan Li. 2006. Efficient mining of large maximal bicliques. In DaWaK.
- Makino and Uno (2004) Kazuhisa Makino and Takeaki Uno. 2004. New algorithms for enumerating all maximal cliques. In SWAT.
- McClosky and Hicks (2012) Benjamin McClosky and Illya V Hicks. 2012. Combinatorial algorithms for the maximum k-plex problem. Journal of Combinatorial Optimization 23, 1 (2012), 29–49.
- Mishra et al. (2004) Nina Mishra, Dana Ron, and Ram Swaminathan. 2004. A new conceptual clustering framework. Machine Learning 56, 1-3 (2004), 115–151.
- Mitzenmacher et al. (2015) Michael Mitzenmacher, Jakub Pachocki, Richard Peng, Charalampos Tsourakakis, and Shen Chen Xu. 2015. Scalable large near-clique detection in large-scale networks via sampling. In KDD.
- Moon and Moser (1965) John W Moon and Leo Moser. 1965. On cliques in graphs. Israel journal of Mathematics 3, 1 (1965), 23–28.
- Moradi and Balasundaram (2018) Esmaeel Moradi and Balabhaskar Balasundaram. 2018. Finding a maximum k-club using the k-clique formulation and canonical hypercube cuts. Optimization Letters 12, 8 (2018), 1947–1957.
- Quine (1955) Willard V Quine. 1955. A way to simplify truth functions. The American mathematical monthly 62, 9 (1955), 627–631.
- Sanderson et al. (2003) Michael J Sanderson, Amy C Driskell, Richard H Ree, Oliver Eulenstein, and Sasha Langley. 2003. Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Molecular biology and evolution 20, 7 (2003), 1036–1042.
- Shah et al. (2015) Neil Shah, Danai Koutra, Tianmin Zou, Brian Gallagher, and Christos Faloutsos. 2015. Timecrunch: Interpretable dynamic graph summarization. In KDD.
- Shetty and Adibi (2004) Jitesh Shetty and Jafar Adibi. 2004. The Enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California 4, 1 (2004), 120–128.
- Shin et al. (2021b) Hyeonjeong Shin, Taehyung Kwon, Neil Shah, and Kijung Shin. 2021b. Finding a Concise, Precise, and Exhaustive Set of Near Bi-Cliques in Dynamic Graphs (Supplementary Document). https://github.com/hyeonjeong1/cutnpeel
- Shin et al. (2018) Kijung Shin, Bryan Hooi, and Christos Faloutsos. 2018. Fast, accurate, and flexible algorithms for dense subtensor mining. TKDD 12, 3 (2018), 1–30.
- Shin et al. (2017) Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. 2017. Densealert: Incremental dense-subtensor detection in tensor streams. In KDD.
- Shin et al. (2021a) Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. 2021a. Detecting Group Anomalies in Tera-Scale Multi-Aspect Data via Dense-Subtensor Mining. Frontiers in Big Data 3 (2021), 58.
- Sim et al. (2009) Kelvin Sim, Jinyan Li, Vivekanand Gopalkrishnan, and Guimei Liu. 2009. Mining maximal quasi-bicliques: Novel algorithm and applications in the stock market and protein networks. Statistical Analysis and Data Mining: The ASA Data Science Journal 2, 4 (2009), 255–273.
- Tsourakakis (2015) Charalampos Tsourakakis. 2015. The k-clique densest subgraph problem. In WWW.
- Tsukiyama et al. (1977) Shuji Tsukiyama, Mikio Ide, Hiromu Ariyoshi, and Isao Shirakawa. 1977. A new algorithm for generating all the maximal independent sets. SIAM J. Comput. 6, 3 (1977), 505–517.
- UCSD (2021) The CAIDA UCSD. 2021. DDoS Attack 2007. https://www.caida.org/catalog/datasets/ddos-20070804_dataset
- Uno (2010) Takeaki Uno. 2010. An efficient algorithm for solving pseudo clique enumeration problem. Algorithmica 56, 1 (2010), 3–16.
- Weeplaces (2021) Weeplaces. 2021. Weeplaces Data. https://www.yongliu.org/datasets.html
- Yelp (2021) Yelp. 2021. Yelp Data. https://www.kaggle.com/yelp-dataset/yelp-dataset
- Zhou et al. (2020) Yi Zhou, Jingwei Xu, Zhenyu Guo, Mingyu Xiao, and Yan Jin. 2020. Enumerating maximal k-plexes with worst-case time guarantee. In AAAI.