This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RECEIPT: REfine CoarsE-grained IndePendent Tasks for
Parallel Tip decomposition of Bipartite Graphs

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna Ming Hsieh Department of Electrical Engineering
University of Southern California
klakhoti, rajgopak, [email protected]
 and  Cesar A. F De Rose School of Technology
Pontifical Catholic University of Rio Grande do Sul
[email protected]
Abstract.

Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2,22,2-bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update(peeling) process: deleting vertices with the minimum number of butterflies and correspondingly updating the butterfly count of their 2-hop neighbors. The need to explore 2-hop neighborhood renders tip-decomposition computationally very expensive. Furthermore, the inherent sequentiality in peeling only minimum butterfly vertices makes derived parallel algorithms prone to heavy synchronization.

In this paper, we propose a novel parallel tip-decomposition algorithm – REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the peeling order restrictions by partitioning the vertices into multiple independent subsets that can be concurrently peeled. This enables RECEIPT to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations. Further, RECEIPT employs a hybrid peeling strategy along with other optimizations that drastically reduce the amount of wedge exploration and execution time.

We perform detailed experimental evaluation of RECEIPT on a shared-memory multicore server. It can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms – achieving up to 1100×1100\times and 64×64\times reduction in the number of thread synchronizations and traversed wedges, respectively. Using 3636 threads, RECEIPT can provide up to 17.1×17.1\times self-relative speedup. Our implementation of RECEIPT is available at https://github.com/kartiklakhotia/RECEIPT.

Graph Decomposition, Bipartite Graph, Butterfly, Parallel Graph Analytics, Nucleus Decomposition

1. Introduction

Dense subgraph mining is a fundamental problem used for anomaly detection, spam filtering, social network analysis, trend summarizing and several other real-world applications (anomalyDet, ; spamDet, ; communityDet, ; yang2018mining, ; otherapp1, ; otherapp2, ). Many of these modern day applications use bipartite graphs to effectively represent two-way relationship structures, such as consumer-product purchase history (consumerProduct, ), user-group memberships in social networks (orkut, ), author-paper networks (authorPaper, ), etc. Consequently, mining cohesive structures in bipartite graphs has gained tremendous interest in recent years (wangButterfly, ; wangBitruss, ; zouBitruss, ; sariyucePeeling, ; shiParbutterfly, ).

Many techniques have been developed to uncover hierarchical dense structures in unipartite graphs, such as truss and core decomposition (spamDet, ; graphChallenge, ; trussVLDB, ; sariyuce2016fast, ; coreVLDB, ; coreVLDBJ, ; bonchi2019distance, ; wen2018efficient, ). Such off-the-shelf analytics can be conveniently utilized for discovering dense parts in projections of bipartite graphs as well (projection, ). However, this approach results in a loss of information and a blowup in the size of the projection graphs (sariyucePeeling, ). To prevent these issues, researchers have explored the role of butterflies (2,22,2-bicliques) to mine dense subgraphs directly in bipartite networks (sariyucePeeling, ; zouBitruss, ). A butterfly is the most basic unit of cohesion in bipartite graphs. Recently, Sariyuce et al. conceptualized kk-tip as a vertex-induced subgraph with at least kk butterflies incident on every vertex in one of the bipartite vertex sets (fig.1). They show that kk-tips can unveil hierarchical dense regions in bipartite graphs more effectively than unipartite approaches applied on projection graphs. As a space-efficient representation for the kk-tip hierarchy, Sariyuce et al. further define the notion of tip number of a vertex uu as the largest kk for which a kk-tip contains uu. In this paper, we study the problem of finding tip numbers of vertices in a bipartite graph, also known as tip decomposition.

Refer to caption
Figure 1. Tip decomposition of vertex set UU in a bipartite graph. u4u_{4} and u1u_{1} participate in 11 and 22 butterflies, respectively. Although u3u_{3} participates in 55 butterflies in original graph, only 33 of them are with u2u_{2} with which it creates a 33-tip.

Tip decomposition can be employed in several real-world applications that utilize dense subgraphs. It can find groups of researchers (along with group hierarchies) with common affiliations from author-paper networks (sariyucePeeling, ). It can be used to detect communities of spam reviewers from user-rating graphs in databases of e-commerce companies; such reviewers collaboratively rate selected products, appearing in close-knit structures (mukherjee2012spotting, ; fei2013exploiting, ) that tip decomposition can unveil. It can be used for document clustering, graph summarization and link prediction as dense kk-tips unveil groups of nodes with connections to common and similar sets of neighbors  (dhillon2001co, ; leicht2006vertex, ; navlakha2008graph, ; communityDet, ).

Existing sequential and tip decomposition algorithms (sariyucePeeling, ; shiParbutterfly, ) employ a bottom-up peeling approach. Vertices with the minimum butterfly count are peeled (deleted), the count of their 2-hop neighbors with shared butterflies is decremented, and the process is then iterated. However, exploring 22-hop neighborhood for every vertex requires traversing a large number of wedges, rendering tip decomposition computationally intensive. For example, the TrUTrU dataset has 140140 million edges but peeling it requires traversing 211211 trillion wedges, which is intractable for a sequential algorithm. For such high complexity analytics, parallel computing is often used to scale to large datasets (Park_2016, ; smith2017truss, ; 10.1145/3299869.3319877, ). In case of tip decomposition, the bottom-up peeling approach used by the existing parallel algorithms severely restricts their scalability. It mandates constructing levels of the kk-tip hierarchy in non-decreasing order of tip numbers, thus imposing sequential restrictions on the order of vertex peeling. Note that the parallel threads need to synchronize at the end of every peeling iteration. Further note that the range of tip numbers can be quite large and the number of iterations required to peel all vertices with a given tip number is variable. Taken together, the conventional approach of parallelizing the workload within each iteration requires major synchronization, rendering it ineffective. For example, ParButterfly experiences 1.5\approx 1.5 million synchronization rounds for peeling TrUTrU (shiParbutterfly, ). These observations motivate the need for an algorithm that exploits the parallelism available across multiple levels of a kk-tip hierarchy to reduce the amount of synchronizations.

In this paper, we propose the REfine CoarsE-grained IndePendent Tasks (RECEIPT) algorithm, that adopts a novel two-step approach to drastically reduce the amount of parallel peeling iterations and in turn, the amount of synchronization. The key insight that drives the development of RECEIPT is that the tip-number θu\theta_{u} of a vertex uu only depends on the butterflies shared between uu and other vertices with tip numbers greater than or equal to θu\theta_{u}. Thus, vertices with smaller tip numbers can be peeled in any order without affecting the correctness of θu\theta_{u} computation. To this purpose, RECEIPT divides tip decomposition into a two-step computation where each step exposes parallelism across a different dimension.

In the first step, it creates few non-overlapping ranges that represent lower and upper bounds on the vertices’ tip numbers. To find vertices with a lower bound θ\theta, all vertices with upper bound smaller than θ\theta can be peeled simultaneously, providing sufficient vertex-workload per parallel peeling iteration. The small number of ranges ensures little synchronization in this step. In the second step, RECEIPT concurrently processes vertex subsets corresponding to different ranges to compute the exact tip numbers111A vertex subset for a given range is peeled by a single thread in the second step.. The absence of overlap between tip number ranges enables each of these subsets to be peeled independently of other vertices in the graph.

RECEIPT’s two-step approach further enables development of novel optimizations that radically decrease the amount of wedge exploration. Equipped with these optimizations, we combine both computational efficiency and parallel performance for fast decomposition of large bipartite graphs. Overall, our contributions can be summarized as follows:

  1. (1)

    We propose a novel RefinE CoarsE-grained Independent Tasks (RECEIPT) algorithm for tip decomposition in bipartite graphs. RECEIPT is the first algorithm that parallelizes workload across the levels of subgraph hierarchy created by tip decomposition.

  2. (2)

    We show that RECEIPT is theoretically work efficient and dramatically reduces thread synchronization. As an example, it incurs only 13351335 synchronization rounds while processing TrUTrU, which is 1105×1105\times less than the existing parallel algorithms.

  3. (3)

    We develop novel optimizations enabled by the two-step approach of RECEIPT. These optimizations drastically reduce the amount of wedge exploration and improve computational efficiency of RECEIPT. For instance, we traverse 32973297B wedges to tip decompose TrUTrU, which is 64×64\times less than the state-of-the-art.

We conduct detailed experiments using some of the largest public real-world bipartite graphs. RECEIPT extends the limits of current practice by feasibly computing tip decomposition for these datasets. For example, it can process the TrUTrU graph in 4646 minutes whereas the state-of-the-art does not finish in 1010 days. Using 3636 threads, we achieve up to 17.1×17.1\times parallel speedup.

2. Background

In this section, we will discuss state-of-the-art algorithms for counting per-vertex butterflies (sec.2.1). Counting is used to initialize the support (running count of incident butterflies) of each vertex during tip-decomposition, and also constitutes a crucial optimization in RECEIPT. Hence, it is imperative to analyze counting algorithms.

We will also discuss the bottom-up peeling approach for tip-decomposition used by the existing algorithms (sec.2.2). Table 1 lists the notations used in this paper. Note that we decompose either UU or VV vertex set at a time. For clarity of description, we assume that UU is the primary vertex set to process and we use the word ”wedge” to imply a wedge with endpoints in UU. However, for empirical analysis, we will individually tip decompose both vertex sets in each graph dataset.

Table 1. Frequently used notations
G(W=(U,V),E)G(W=(U,V),E)
bipartite graph GG with vertices WW and edges EE
UU and VV are two disjoint vertex sets in GG
n,mn,m no. of vertices and edges in G; n=|W|,m=|E|n=|W|,m=|E|
α\alpha arboricity of GG (chibaArboricity, )
NuN_{u} neighboring vertices of uu
dud_{u} degree of vertex uu; du=|Nu|d_{u}=\mathinner{\!\left\lvert N_{u}\right\rvert}
u\bowtie_{u} / U\bowtie_{U} support (# butterflies) of uu / vertices in set UU
u1,u2=u2,u1\bowtie_{u_{1},u_{2}}=\ \bowtie_{u_{2},u_{1}} # butterflies shared between u1u_{1} and u2u_{2}
U\wedge_{U} # wedges with endpoints in set UU
θu\theta_{u} tip number of vertex uu
θmax\theta^{max} maximum tip number for a vertex
PP number of vertex subsets created by RECEIPT
TT number of threads

2.1. Per-vertex butterfly counting

A butterfly (2,2-bicliques/quadrangle) is a combination of two wedges with common endpoints. A simple way to count butterflies is to explore all wedges and combine the ones with common end points. However, counting per-vertex butterflies using this procedure is extremely expensive with 𝒪(uUvNudv)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}d_{v}\right) complexity (if we use vertices in UU as end points).

Chiba and Nishizeki (chibaArboricity, ) proposed an efficient vertex-priority quadrangle counting algorithm which traverses 𝒪((u,v)Emin(du,dv))=𝒪(αm)\mathcal{O}\left(\sum_{(u,v)\in E}\min{(d_{u},d_{v})}\right)=\mathcal{O}\left(\alpha\cdot m\right) wedges with 𝒪(1)\mathcal{O}(1) work per wedge. Wang et al.(wangButterfly, ) further propose a cache efficient version of the vertex-priority algorithm that reorders the vertices in decreasing order of degree. Their algorithm only traverses wedges where degree of the last vertex is greater than the degree of the start and middle vertices. It can be easily parallelized by concurrently processing multiple start vertices (shiParbutterfly, ; wangButterfly, ), as shown in alg.1. In RECEIPT, we adopt this parallel variant for per-vertex counting by adding the contributions from traversed wedges to start, mid and end-points. Each thread is provided a θ(|W|)\mathcal{\theta}(|W|) array for wedge aggregation (line 5, alg.1). This is similar to the best performing batch aggregation mode in the ParButterfly framework (shiParbutterfly, ).

Algorithm 1 Counting per-vertex butterflies (pvBcnt)
1:Input: Bipartite Graph G(W=(U,V),E)G(W=(U,V),E)
2:Output: Butterfly counts uuW\bowtie_{u}\forall u\in W
3:Relabel vertices in GG in descending order of degree
4:for each uUVu\in U\cup V do in parallel
5:     Sort NuN_{u} in ascending order of new labels
6:     w0\bowtie_{w}\leftarrow 0
7:for each spUVsp\in U\cup V do in parallel
8:     Initialize wdg_arrwdg\_arr array to all zeros
9:     nze{ϕ}nze\leftarrow\{\phi\}, nzw{ϕ}nzw\leftarrow\{\phi\}
10:     for each mpNspmp\in N_{sp}
11:         for each epNmpep\in N_{mp}
12:              if (epmp)(ep\geq mp) or (epsp)(ep\geq sp) then break              
13:              wdg_arr[ep]wdg_arr[ep]+1wdg\_arr[ep]\leftarrow wdg\_arr[ep]+1
14:              nzenze{ep}nze\leftarrow nze\cup\{ep\}, nzwnzw{(mp,ep)}\ nzw\leftarrow nzw\cup\{(mp,ep)\}               
15:     for each unzeu\in nze \triangleright same side contribution
16:         bcnt(wdg_arr[u]2)bcnt\leftarrow{wdg\_arr[u]\choose 2}
17:         Atomically add bcntbcnt to u\bowtie_{u} and sp\bowtie_{sp}      
18:     for each (u,v)nzw(u,v)\in nzw \triangleright opp. side contribution
19:         bcntwdg_arr[v]1bcnt\leftarrow wdg\_arr[v]-1
20:         Atomically add bcntbcnt to u\bowtie_{u}      

2.2. Tip Decomposition

Sariyuce et al.(sariyucePeeling, ) introduced kk-tips to identify vertex-induced subgraphs with large number of butterflies. They formally define it as follows:

Definition 0.

A bipartite subgraph H=(U,V,E)GH=(U^{\prime},V,E)\subseteq G, induced on UU, is a k-tip iff

  • each vertex uUu\in U^{\prime} participates in at least k butterflies,

  • each vertex-pair (u,u)U(u,u^{\prime})\in U^{\prime} is connected by a series of butterflies,

  • H is maximal i.e. no other k-tip subsumes H.

kk-tips are hierarchical – a kk-tip overlaps with kk^{\prime}-tips for all k<=kk^{\prime}<=k. Therefore, storing all kk-tips is inefficient and often, infeasible. Tip number θu\theta_{u} is defined as the maximum kk for which uu is present in a kk-tip. Tip numbers provide a space-efficient representation of kk-tip hierarchy with quick retrieval. In this paper, we study the problem of finding tip numbers, known as tip decomposition.

Algorithms in current practice use Bottom-Up Peeling (BUP) for tip-decomposition, as shown in alg.2. It initializes vertex support using per-vertex butterfly counting, and then iteratively peels the vertices with minimum support until no vertex remains. When a vertex uUu\in U is peeled, its support at that instant is recorded as its tip number θu\theta_{u}. Further, for every vertex uu^{\prime} with u,u>0\bowtie_{u,u^{\prime}}>0 shared butterflies, the support of uu^{\prime} is decreased by u,u\bowtie_{u,u^{\prime}} (capped at θu\theta_{u}). Thus, tip numbers are assigned in a non-decreasing order. The complexity of bottom-up peeling (alg.2), dominated by wedge traversal (lines 8-10), is 𝒪(uUvNudv)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}d_{v}\right).

Algorithm 2 Tip decomposition using bottom-up peeling (BUP)
1:Input: Bipartite graph G(W=(U,V),E)G(W=(U,V),E)
2:Output: Tip numbers θuuU\theta_{u}\ \forall\ u\in U
3:{U,V}\{\bowtie_{U},\bowtie_{V}\}\leftarrow pvBcnt(GG)\triangleright Initial count
4:while U{ϕ}U\neq\ \{\phi\} do\triangleright Peel
5:     let uUu\in U be the vertex with minimum support u\bowtie_{u}
6:     θuu,UU{u}\theta_{u}\leftarrow\ \bowtie_{u},\ \ U\leftarrow U\setminus\{u\}
7:     update(u,θu,U,Gu,\theta_{u},\bowtie_{U},G)
8:function update(u,θu,U,Gu,\theta_{u},\bowtie_{U},G)
9:     Initialize hashmap wdg_arrwdg\_arr to all zeros
10:     for each vNuv\in N_{u}
11:         for each uNv{u}u^{\prime}\in N_{v}\setminus\{u\}
12:              wdg_arr[u]wdg_arr[u]+1wdg\_arr[u^{\prime}]\leftarrow wdg\_arr[u^{\prime}]+1               
13:     for each uwdg_arru^{\prime}\in wdg\_arr\triangleright Update Support
14:         u,u(wdg_arr[u]2)\bowtie_{u,u^{\prime}}\leftarrow{wdg\_arr[u^{\prime}]\choose 2}\triangleright shared butterflies
15:         umax{θu,uu,u}\bowtie_{u^{\prime}}\leftarrow\max\{\theta_{u},\ \bowtie_{u^{\prime}}-\bowtie_{u,u^{\prime}}\}      

2.2.1. Challenges

Tip decomposition is computationally very expensive and parallel computing is widely used to accelerate such workloads. However, the state-of-the-art parallel tip decomposition framework ParButterfly (shiParbutterfly, ; julienne, ) only utilizes parallelism within each peeling iteration by concurrently peeling all vertices with minimum support value. This restrictive approach is adopted due to the following sequential dependency between iterations – support updates computed in an iteration guide the choice of vertices to peel in the subsequent iterations. As shown in (shiParbutterfly, ), ParButterfly is work-efficient with a complexity of 𝒪(uUvNudv+ρvlog2m)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}+\rho_{v}\log^{2}{m}\right),
where ρv\rho_{v} is the number of peeling iterations. However, its scalability is limited in practice due to the following drawbacks:

  1. (1)

    Alg. 2 incurs large number of iterations and low workload per individual iteration. The resulting synchronization and load imbalance render simple parallelization ineffective for acceleration.
    Objective 1 – Design an efficient parallel algorithm for tip decomposition that reduces thread synchronizations.

  2. (2)

    Alg. 2 explores all wedges with end-points in UU. This is computationally expensive and can make it infeasible even for a scalable parallel algorithm, to execute in reasonable time.
    Objective 2 – Reduce the amount of wedge traversal.

3. REfine CoarsE-grained IndePendent Tasks Algorithm

Refer to caption
Figure 2. Graphical illustration of tip decomposition using RECEIPT. Coarse-grained Decomposition partitions UU into three vertex-subsets U1,U2U_{1},U_{2} and U3U_{3} whose tip numbers belonging to ranges R1,R2R_{1},R_{2} and R3R_{3}, respectively. It processes each peeling iteration in parallel. Fine-grained decomposition creates subgraphs induced on U1,U2U_{1},U_{2} and U3U_{3}, initializes vertex support using init\bowtie^{init} and peels each of them sequentially. It processes multiple subgraphs concurrently. Note that GG has 3838 wedges whereas the induced subgraphs collectively have only 1111 wedges.

In this section, we present a novel shared-memory parallel algorithm for tip decomposition – REfine CoarsE-grained IndePendent Tasks (RECEIPT), that drastically reduces the number of parallel peeling iterations (objective 1, sec.2.2.1). The fundamental insight underlying RECEIPT is that θu\theta_{u} only depends on the number of butterflies that uu shares with vertices having tip numbers no less than θu\theta_{u}. Therefore, if we initialize the support u\bowtie_{u} to total butterflies of uu in GG, only the following are required to compute θu\theta_{u}:

  • The aggregate effect of peeling all vertices vv with θv<θu\theta_{v}<\theta_{u} on u\bowtie_{u}: u\bowtie_{u} would be θu\geq\theta_{u} after such deletion.

  • The effect of deleting vertices vv with θv=θu\theta_{v}=\theta_{u} on u\bowtie_{u}: u\bowtie_{u} would be θu\leq\theta_{u} after such deletion.

This insight allows us to eliminate the major bottleneck for parallel tip decomposition i.e. the constraint of deleting only minimum support vertices in any peeling iteration. In order to find vertices with tip number greater than or equal to θ\theta, all vertices with tip numbers less than θ\theta can be peeled simultaneously, providing sufficient parallelism. However, for every θ{0,1θmax}\theta\in\{0,1\dots\theta^{max}\}, peeling all the vertices vv with θv<θ\theta_{v}<\theta and computing corresponding support updates will make the algorithm extremely inefficient. To avoid this inefficiency, RECEIPT follows a 2-step approach as shown in fig.2.

In the first step, it divides the spectrum222RECEIPT does not assume prior knowledge of maximum tip number value and computes an upper bound during the execution. of tip numbers [0,θmax][0,\theta^{max}] into PP smaller ranges R1,R2RP{R_{1},R_{2}\dots R_{P}}, where PP is a user-defined parameter. A range RiR_{i} is defined by the set of integers in [θ(i),θ(i+1))[\theta(i),\theta(i+1)) with boundary conditions θ(1)=0\theta(1)=0 and θ(P+1)>θmax\theta(P+1)>\theta^{max}. Note that the ranges have no overlap i.e. for any pair (i,j)(i,j), iji\neq j implies RiRj={ϕ}R_{i}\cap R_{j}=\{\phi\}. Corresponding to each range RiR_{i}, RECEIPT CD also finds the subset of vertices UiU_{i} whose tip-numbers belong to that range i.e. Ui=uU{u|θuRi}U_{i}=\cup_{u\in U}\{u|\ \theta_{u}\in R_{i}\}. In other words, instead of finding the exact tip number θu\theta_{u} of a vertex uu, the first step in RECEIPT computes bounds on θu\theta_{u}, by finding its range affiliation using peeling. Therefore, this step is named Coarse-grained Decomposition (RECEIPT CD). The absence of overlap between the ranges allows each subset to be peeled independently of others for exact tip number computation in a later step.

RECEIPT CD has a stark difference from conventional bottom-up approach: instead of peeling vertices with minimum support, every iteration concurrently peels all vertices with support value in a broad range. Setting PθmaxP\ll\theta^{max} ensures a large amount of vertices peeled per iteration (sufficient parallel workload) and significantly less number of iterations (dramatically less synchronization) compared to parallel variants of bottom-up peeling (alg.2).

The next step finds the exact tip numbers of vertices and is termed Fine-grained Decomposition (RECEIPT FD). The key idea behind RECEIPT FD is as follows – for each vertex uu, if we know the number of butterflies it shares with vertices in its own subset UiU_{i} and in subsets with higher tip number ranges (j>i{Uj}\cup_{j>i}\{U_{j}\}), then every subset can be peeled independently of others. RECEIPT FD exploits this independence to concurrently process multiple vertex subsets by simultaneously employing sequential bottom up peeling on the subgraphs induced by these subsets. Setting PTP\gg T ensures that RECEIPT FD can be efficiently parallelized. Thus, RECEIPT avoids strict sequential dependencies across peeling iterations and systematically parallelizes tip decomposition.

Since each butterfly has two vertices in UU, butterflies with vertices in different subsets are not preserved in the induced subgraphs. Hence, butterfly counting on a subgraph induced on UiU_{i} will not account for butterflies shared between uUiu\in U_{i} and vertices in subsets with higher tip number ranges. However, we note that when Ui1U_{i-1} is completely peeled in RECEIPT CD, the support of a vertex uUu\in U reflects the number of butterflies it shares with remaining vertices i.e. ji{Uj}\cup_{j\geq i}\{U_{j}\}. This is precisely the initial butterfly count for uu as required in RECEIPT FD. Hence, we store these values during RECEIPT CD (in init\bowtie^{init} vector as shown in fig.2) and use them for support initialization in RECEIPT FD.

The two-step approach of RECEIPT can potentially double the workload as each wedge may be traversed once in both steps. However, we note that since each subset is processed independently of others, support updates are not communicated between the subsets. Hence, when peeling a subset UiU_{i}, we only need to traverse the wedges with both endpoints in UiU_{i}. Therefore, RECEIPT FD first creates an induced subgraph on UiU_{i} and only explores the wedges in that subgraph. This dramatically reduces the amount of work done in RECEIPT FD. For example, in fig.2, the original graph GG has 3838 wedges with endpoints in UU, whereas the three subgraphs collectively have only 1111 such wedges which will be traversed during RECEIPT FD.

3.1. Coarse-grained Decomposition

Alg.3 depicts the pseudocode for Coarse-grained Decomposition (RECEIPT CD). It takes a bipartite graph G(U,V,E)G(U,V,E) as input and partitions the vertex set UU into PP subsets with different tip number ranges. Prior to creating vertex subsets, RECEIPT CD uses per-vertex counting (alg.1) to initialize the support of vertices in UU.

The first step in computing a subset UiU_{i} is to find the tip number range Ri=[θ(i),θ(i+1))R_{i}=[\theta(i),\theta(i+1)). Ideally, the ranges should be computed such that the number of wedges in the induced subgraphs are uniform for all PP subsets (to ensure load balance during RECEIPT FD). However, induced subgraphs are not known prior to actual partitioning. Secondly, exact tip numbers are not known either and hence, vertices in UiU_{i} cannot be determined prior to partitioning, for different values of θ(i+1)\theta(i+1). Considering these challenges, RECEIPT CD uses two proxies for range determination (lines 16-21): (1) the number of wedges in the original graph as a proxy for the wedges in induced subgraphs, and (2) current support of vertices as a proxy for tip numbers. Now, to balance the wedge counts across subsets, RECEIPT CD aggregates the wedge count (in GG) of vertices in bins corresponding to their current support and computes a prefix sum over the bins (Ref: proof of theorem 7). For any unique support value θ\theta, the prefix output now represents the total wedge count of vertices with support θ\leq\theta (line 19). The upper bound θ(i+1)\theta(i+1) is then chosen such that the prefix output for θ(i+1)\theta(i+1) is close to (but no less than) the average wedges per subset (line 20).

After finding the range, RECEIPT CD iteratively peels the vertices and adds them to subset UiU_{i} (lines 9-14). In each iteration, it peels activeSetactiveSet – the set of all vertices with support in the entire range RiR_{i}. This is unlike bottom-up peeling where vertices with a single (minimum) support value are peeled in an iteration. Thus, RECEIPT CD enjoys higher workload per iteration which enables efficient parallelization. For the first peeling iteration of every subset UiU_{i}, RECEIPT CD scans all vertices in UU to initialize activeSetactiveSet (line 9). In subsequent iterations, activeSetactiveSet is constructed by tracking only those vertices whose support has been updated. For correctness of parallel algorithm, the update routine used in RECEIPT CD (line 13) uses atomic operations to decrease vertex supports.

Apart from tip number ranges and partitions of UU, RECEIPT CD also outputs an array Uinit\bowtie^{init}_{U}, which is used to initialize support in RECEIPT FD (sec.3). Before peeling for a subset begins, it copies the current support of remaining vertices into Uinit\bowtie^{init}_{U} (line 7). Thus, for a vertex uUiu\in U_{i}, uinit\bowtie^{init}_{u} indicates the support of uu (a) after all vertices in Ui1U_{i-1} are peeled, and (b) before any vertex in UiU_{i} is peeled.

Algorithm 3 Coarse-grained Decomposition (RECEIPT CD)
1:Input: Bipartite graph G(U,V,E)G(U,V,E), # partitions PP
2:Output: Ranges {θ(1),θ(2)θ(P+1)}\{\theta(1),\theta(2)\dots\theta(P+1)\}, Vertex Subsets {U1,U2UP}\{U_{1},U_{2}\dots U_{P}\}, Support initialization vector Uinit\bowtie^{init}_{U}
3:w[u]w[u]\leftarrow number of wedges in GG with endpoint uu
4:Ui{ϕ}i{1,2P}U_{i}\leftarrow\{\phi\}\ \forall\ i\in\{1,2\dots P\}
5:{U,V}\{\bowtie_{U},\bowtie_{V}\}\leftarrow pvBcnt(GG)
6:θ(1)0\theta(1)\leftarrow 0, i1\ i\leftarrow 1
7:tgtuUw[u]Ptgt\leftarrow\frac{\sum_{u\in U}w[u]}{P}\triangleright average wedges per subset
8:while U{ϕ}U\neq\ \{\phi\} and iPi\leq P do
9:     for each uUu\in U do in parallel\triangleright Support Init
10:         uinitu\bowtie^{init}_{u}\leftarrow\ \bowtie_{u}      
11:     θ(i+1)findHi(U,U,w,tgt)\theta(i+1)\leftarrow\text{{{findHi}(}}U,\bowtie_{U},w,tgt\text{{)}}\triangleright Upper Bound
12:     activeSetactiveSet\leftarrowvertices with support in [θ(i),θ(i+1))\big{[}\theta(i),\theta(i+1)\big{)}
13:     while activeSet{ϕ}activeSet\neq\{\phi\} do\triangleright Peel Range
14:         UiUiactiveSetU_{i}\leftarrow U_{i}\cup activeSet, UUactiveSet\ U\leftarrow U\setminus activeSet
15:         for each uactiveSetu\in activeSet do in parallel
16:              update(u,θ(i),U,Gu,\theta(i),\bowtie_{U},G) \triangleright Ref: alg.2          
17:         activeSetactiveSet\leftarrowvertices with support in [θ(i),θ(i+1))\big{[}\theta(i),\theta(i+1)\big{)}      
18:     ii+1i\leftarrow i+1
19:function findHi(U,U,w,tgtU,\bowtie_{U},w,tgt)
20:     Initialize hashmap workwork to all zeros
21:     for each U\bowtie\ \in\ \bowtie_{U}
22:         work[]uU(w[u]𝟙(u))work[\bowtie]\leftarrow\sum_{u\in U}\left(w[u]\cdot\mathbbm{1}(\bowtie_{u}\leq\ \bowtie)\right)      
23:     θ\theta\leftarrow argmin()\arg\min\left(\bowtie\right) such that work[]tgtwork[\bowtie]\geq tgt
24:     return θ+1\theta+1

3.1.1. Adaptive Range Determination

Since range determination uses current support of vertices as a proxy for tip numbers, tgttgt– the target wedge count for a subset UiU_{i}, is covered by the vertices added to UiU_{i} in the very first peeling iteration. Hence, uUiw[u]tgt{\sum_{u\in U_{i}}w[u]\geq tgt}, where w[u]w[u] is the wedge count of uu in GG. Note that as the support of other vertices decreases after this iteration, more vertices may get added to UiU_{i} and total wedge count of vertices in the final subset UiU_{i} can be significantly higher than tgttgt. This could result in some subsets having very high workload. Moreover, it is possible that UU gets completely deleted in P\ll P subsets, thus restricting the parallelism available during RECEIPT FD. To avoid this scenario, we implement two-way adaptive range determination:

  1. (1)

    Instead of statically computing an averge target tgttgt, we dynamically update tgttgt for every subset based on the wedge count of remaining vertices in UU and the remaining number of subsets to create. If some subsets cover a large number of wedges, the target for future subsets is automatically reduced, thereby preventing a situation where all vertices get peeled in much less than PP subsets.

  2. (2)

    A subset UiU_{i} can cover significantly larger number of wedges than the target tgttgt. RECEIPT CD assumes predictive local behavior i.e. subset Ui+1U_{i+1} will exhibit similar behavior to UiU_{i}. Therefore, to balance the wedge counts of subsets, RECEIPT dynamically scales the target wedge count for Ui+1U_{i+1} with a scaling factor si=tgtuUiw[u]1s_{i}=\frac{tgt}{\sum_{u\in U_{i}}w[u]}\leq 1. Note that sis_{i} quantifies the overshooting of target wedges in UiU_{i}.

After PP partitions, if some vertices still remain in UU, RECEIPT CD puts all of them in a single subset UP+1U_{P+1} and increments PP.

3.2. Fine-grained Decomposition

Alg.4 presents the pseudocode for Fine-grained Decomposition (RECEIPT FD), which takes as input the vertex subsets and the tip number ranges created by RECEIPT CD, and computes the exact tip numbers. It creates a task queue of subset IDs from which threads are exclusively allocated vertex subsets to peel (line 4). Before peeling UiU_{i}, a thread initializes the support of vertices in UiU_{i} from the Uinit\bowtie^{init}_{U} vector and induces a subgraph GiG_{i} on Wi=(Ui,V)W_{i}=(U_{i},V) (lines 5-6). Thereafter, sequential bottom-up peeling is applied on GiG_{i} for tip decomposition of UiU_{i} (lines 7-10).

Algorithm 4 Fine-grained Decomposition (RECEIPT FD)
1:Input: Bipartite graph G(U,V,E)G(U,V,E), # partitions PP, Vertex Subsets {U1,U2UP}\{U_{1},U_{2}\dots U_{P}\}, Support initialization vector Uinit\bowtie^{init}_{U}, # threads TT
2:Output: Tip number θu\theta_{u} for each uUu\in U
3:Insert the integers i{1,2P}i\in\{1,2\dots P\} in a queue QQ
4:for thread_id=1,2Tthread\_id=1,2\dots T do in parallel
5:     while QQ is not empty do
6:         Atomically pop ii from QQ
7:         GiG_{i}\leftarrow subgraph induced by Wi=(Ui,V)W_{i}=(U_{i},V)
8:         UiUiinit\bowtie_{U_{i}}\ \leftarrow\ \bowtie^{init}_{U_{i}} \triangleright Initialize Support
9:         while Ui{ϕ}U_{i}\neq\{\phi\} do\triangleright Peel
10:              uargminuUi{u}u\leftarrow\arg\min_{u\in U_{i}}\{\bowtie_{u}\}
11:              θuu\theta_{u}\leftarrow\ \bowtie_{u}, UiUi{u}\ U_{i}\leftarrow U_{i}\setminus\{u\}
12:              update(u,θu,Ui,Giu,\theta_{u},\bowtie_{U_{i}},G_{i})               

3.2.1. Parallelization Strategy

While adaptive range determination(sec.3.1.1) tries to create subsets with uniform wedges in GG, the actual work per subset in FD depends on the wedges in induced subgraphs Gi(Ui,V,Ei)G_{i}(U_{i},V,E_{i}) that can be non-uniform. Therefore, to improve load balance across threads, we use parallelization strategies inspired from Longest Processing Time scheduling rule which is a known 43\frac{4}{3}-approximation algorithm (graham1969bounds, ). However, exact processing time for peeling an induced subgraph GiG_{i} is unknown. Instead, we use the number of wedges with endpoints in UiU_{i} as a proxy along with runtime task scheduling as given below: :

  • Dynamic task allocation\rightarrow Threads atomically pop unique subset IDs from the task queue when they become idle during runtime (line 4). Thus, all threads are busy until every subset is scheduled.

  • Workload-aware Scheduling\rightarrow We sort the subset IDs in task queue in decreasing order of their wedge counts. Thus, the subsets with highest workload (wedges) get scheduled first and the threads processing them naturally receive fewer tasks in the future. Fig.3 shows how workload-aware scheduling can tremendously improve the efficiency of dynamic allocation.

Refer to caption
Figure 3. Benefits of Workload-aware Scheduling (WaS) in a 22-thread (T1T_{1} and T2T_{2}) system. Top row shows vertex subsets with time required to peel them. Dynamic allocation without WaS finishes in 3333 units of time compared to 2525 units with WaS.

3.3. Analysis

In this section, we will prove the correctness of tip numbers computed by RECEIPT. We will also analyze its computational complexity and show that RECEIPT is work-efficient. We will exploit comparisons with sequential BUP (alg.2) and hence, first establish the follwing lemma:

Lemma 0.

In BUP, the support u\bowtie_{u} of a vertex uu at any time tt before the first vertex with tip number θu\theta_{u} is peeled, depends on the cumulative effect of all vertices peeled till tt and is independent of the order in which they are peeled.

Proof.

Let uG\bowtie_{u}^{G} be the initial support of uu (butterflies of uu in GG), SS be the set of vertices peeled till tt and uSu^{\prime}\in S be the most recently peeled vertex. Since BUP assigns tip numbers in a non-decreasing order, no vertex with tip number θu\geq\theta_{u} would be peeled till tt. Hence, θu<θuu\theta_{u^{\prime}}<\theta_{u}\leq\bowtie_{u}. Since θu\theta_{u^{\prime}} was the minimum vertex support in the latest peeling iteration, u=uG+vS(v,u)\bowtie_{u}=\ \bowtie_{u}^{G}+\sum_{v\in S}{\left(-\bowtie_{v,u}\right)}. By commutativity of addition, this term is independent of the order in which v,u-\bowtie_{v,u} are added i.e. the order in which vertices in SS are peeled. ∎

Lemma 1 highlights that whether BUP peels a set of vertices SUS\subseteq U in its original order or in the same order as RECEIPT CD, the support of vertices with tip numbers higher than that of SS would be the same. Next we show that parallel processing does not affect the correctness of support updates.

Lemma 0.

Given a set SS of vertices to be peeled in an iteration, the parallel peeling in RECEIPT CD (line 12-13, alg.3) correctly updates the support of vertices.

Proof.

Let SS be peeled in jthj^{th} iteration which is a part of peeling iterations for subset UiU_{i}. Parallel updates are correct if for any vertex uu not yet assigned to any subset, support of uu decreases by exactly uSu,u\sum_{u^{\prime}\in S}{\bowtie_{u^{\prime},u}} in the jthj^{th} iteration. Since at most two vertices of a butterfly can be in SS (two vertices of a butterfly are in VV), for any vertex pair (u1,u2)\left(u_{1},u_{2}\right), either of the following is true in jthj^{th} iteration:

  • No vertex in the pair is peeled – no updates are propagated from u1u_{1} to u2u_{2} and vice-versa.

  • Exactly one vertex in the pair is peeledwithout loss of generality, let u1Su_{1}\in S. Peeling u1u_{1} deletes exactly u1,u2\bowtie_{u_{1},u_{2}} unique butterflies incident on u2u_{2} because they share u1,u2\bowtie_{u_{1},u_{2}} butterflies and no vertex in U{u1,u2}U\setminus\{u_{1},u_{2}\} (and hence in SS) participates in these butterflies. The update routine called for u1u_{1} (line 13, alg.3) also decreases support of u2u_{2} by exactly u1,u2\bowtie_{u_{1},u_{2}}, until u2>θ(i)\bowtie_{u_{2}}>\theta(i). Since atomics are used to apply support updates, concurrent updates to same vertex do not conflict (sec.3.1). Thus, update routine calls for all vertices in SS cumulatively decrease u2\bowtie_{u_{2}} by exactly uSu,u2\sum_{u^{\prime}\in S}{\bowtie_{u^{\prime},u_{2}}}, as long as u2>θ(i)\bowtie_{u_{2}}>\theta(i). If u2θ(i)<θ(i+1)\bowtie_{u_{2}}\leq\theta(i)<\theta(i+1), vertex subset for u2u_{2} is already determined and any updates to u2\bowtie_{u_{2}} have no impact.

  • Both u1u_{1} and u2u_{2} are peeled – vertex subset for u1u_{1} and u2u_{2} is determined. Any update to u1\bowtie_{u_{1}} or u2\bowtie_{u_{2}} has no effect.

Now, we will show that RECEIPT CD correctly computes the tip number range for every vertex. Finally, we will show that RECEIPT accurately computes the exact tip numbers for all vertices. For clarity of explanation, we use u(j)\bowtie_{u}(j) to denote the support of a vertex uu after jthj^{th} peeling iteration in RECEIPT CD.

Lemma 0.

There cannot exist a vertex uu such that uUiu\in U_{i} and θuθ(i+1)\theta_{u}\geq\theta(i+1).

Proof.

Let jj be the first iteration in RECEIPT CD that wrongly peels a set of vertices SwS_{w}, and assigns them to subset UiU_{i} even though θuθ(i+1)uSw\theta_{u}\geq\theta(i+1)\ \forall\ u\in S_{w}. Let ShiSwS_{hi}\supseteq S_{w} be the set of all vertices with tip numbers θ(i+1)\geq\theta(i+1) and SrS_{r} be the set of vertices peeled up to iteration j1j-1. Since all vertices till j1j-1 iterations have been correctly peeled, θu<θ(i+1)uSr\theta_{u}<\theta(i+1)\ \forall\ u\in S_{r}. Hence, ShiUSrS_{hi}\subseteq U\setminus S_{r}.

Consider a vertex uSwu\in S_{w}. Since uu is peeled in iteration jj, u(j1)<θ(i+1){\bowtie_{u}(j-1)}<\theta(i+1). From lemma 2, u(j1)\bowtie_{u}(j-1) correctly represents the number of butterflies shared between uu and USrU\setminus S_{r} (vertices remaining after j1j-1 iterations). Since ShiUSrS_{hi}\subseteq U\setminus S_{r}, uu participates in at most u(j1)\bowtie_{u}(j-1) butterflies with vertices in ShiS_{hi}. By definition of tip-number (sec.2.2), θuu(j1)<θ(i+1)\theta_{u}\leq\bowtie_{u}(j-1)<\theta(i+1), which is a contradiction. Thus, no such uu exists, Sw={ϕ}S_{w}=\{\phi\} and all vertices in UiU_{i} have tip numbers less than θ(i+1)\theta(i+1).

Lemma 0.

There cannot exist a vertex uu such that uUiu\in U_{i} and θu<θ(i)\theta_{u}<\theta(i).

Proof.

Let ii be the smallest integer for which there exists a set Sw{ϕ}S_{w}\neq\{\phi\} such that θ(i)θu<θ(i+1)uSw\theta(i)\leq\theta_{u}<\theta(i+1)\ \forall\ u\in S_{w}, but uUpu\in U_{p}, where p>ip>i. Let jj be the last iteration that peels vertices in UiU_{i}. Clearly, u(j)θ(i+1)uSw\bowtie_{u}(j)\geq\theta(i+1)\ \forall\ u\in S_{w} otherwise uu would be peeled in or before iteration jj, and will not be added to UpU_{p}.

From lemma 2, u(j)\bowtie_{u}(j) correctly represents the butterfly count of uu after vertices in U1U2UiU_{1}\cup U_{2}\dots\cup U_{i} are deleted. In other words, every vertex in SwS_{w} participates in at least θ(i+1)\theta(i+1) butterflies with vertices in Ui+1Ui+2UPU_{i+1}\cup U_{i+2}\dots\cup U_{P} and hence, is a part of θ(i+1)\theta(i+1)-tip (def.1). Therefore, by the definition of tip number, θuθ(i+1)uSw\theta_{u}\geq\theta(i+1)\ \forall\ u\in S_{w} which is a contradiction. ∎

Theorem 5.

RECEIPT CD (alg.3) correctly computes the vertex-subsets corresponding to every tip number range.

Proof.

Follows directly from lemmas 3 and 4. ∎

Theorem 6.

RECEIPT correctly computes the tip numbers for all uUu\in U.

Proof.

Consider an arbitrary vertex uUiu\in U_{i}. From theorem 5, θ(i)θu<θ(i+1)\theta(i)\leq\theta_{u}<\theta(i+1). Let S=U1U2Ui1S=U_{1}\cup U_{2}\dots U_{i-1} denote the set of vertices peeled before UiU_{i} in RECEIPT CD. For all vertices uSu^{\prime}\in S, θu<θu\theta_{u^{\prime}}<\theta_{u} and hence, SS will be completely peeled before uu in BUP as well. We now compare peeling UiU_{i} in RECEIPT FD (alg.4) to peeling UiU_{i} in sequential algorithm BUP, and show that the two output identical tip numbers. It is well known that BUP correctly computes tip numbers (sariyuce2016peelingarxiv, ).

Note that initial support of uu in RECEIPT FD i.e. uinit\bowtie^{init}_{u} is the support of uu in RECEIPT CD after SS is peeled (sec.3.1, lines 6-7 of alg.3). From lemmas 1 and 2. this is equal to the support of uu in BUP after SS is peeled. Further, by theorem 5, any vertex in Ui+1Ui+2UPU_{i+1}\cup U_{i+2}\dots\cup U_{P} has tip number strictly greater than all vertices in UiU_{i}, and will be peeled after UiU_{i} in BUP. Next, we note that subgraph GiG_{i} is induced on subset UiU_{i} and entire set VV. Thus, every butterfly shared between any two vertices in UiU_{i} is present in GiG_{i}. Therefore, for any vertex uUiu^{\prime}\in U_{i} peeled before uu, the update u,u\bowtie_{u^{\prime},u} computed by BUP and RECEIPT FD will be same. Hence, BUP and sequential peeling of UiU_{i} in RECEIPT FD apply the same support updates to vertices in UiU_{i} and therefore, follow the same order of vertex peeling. Thus, the final tip number θu\theta_{u} computed by RECEIPT FD will be the same as that computed by BUP.

It is important for a parallel algorithm to be not only scalable, but also computationally efficient. The following theorem shows that for a reasonable upper bound333In practice, we use PuUvNudvnlognP\ll\frac{\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}}{n\log{n}} for large graphs. on PP, RECEIPT is at least as efficient as the best sequential tip decomposition algorithm BUP.

Theorem 7.

For P=𝒪(uUvNudvnlogn)P=\mathcal{O}\left(\frac{\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}}{n\log{n}}\right) vertex subsets, RECEIPT is work-efficient with computational complexity of 𝒪(uUvNudv)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}\right).

Proof.

RECEIPT CDIt initializes the vertex support using 𝒪((u,v)Emin(du,dv)){\mathcal{O}\left(\sum_{(u,v)\in E}\min\left(d_{u},d_{v}\right)\right)} complexity per-vertex butterfly counting. Range computation for each subset requires constructing a 𝒪(|U|)\mathcal{O}(\mathinner{\!\left\lvert U\right\rvert}) size hashmap (workwork) whose keys are the unique support values in U\bowtie_{U}. In this hashmap, wedge counts of all vertices with support \bowtie are accumulated in value work[]work[\bowtie]. Next, workwork is sorted on the keys and a parallel prefix sum is computed over the values so that the final value work[]work[\bowtie] represents cumulative wedge count of all vertices with support less than or equal to \bowtie. Parallel implementations of hashmap generation, sorting and prefix scan perform 𝒪(|U|log|U|)=𝒪(nlogn)\mathcal{O}\left(\mathinner{\!\left\lvert U\right\rvert}\log{\mathinner{\!\left\lvert U\right\rvert}}\right)=\mathcal{O}\left(n\log{n}\right) work. Computing scaling factor sis_{i} for adaptive range determination requires aggregating wedge count of vertices in UiU_{i}, contributing 𝒪(|U|)=𝒪(n)\mathcal{O}\left(\mathinner{\!\left\lvert U\right\rvert}\right)=\mathcal{O}\left(n\right) work over all subsets.

Constructing activeSetactiveSet for first peeling iteration of each subset requires an 𝒪(n)\mathcal{O}(n) complexity parallel filtering on U\bowtie_{U}. Subsequent iterations construct activeSetactiveSet by tracking support updates doing 𝒪(1)\mathcal{O}(1) work per update. For every vertex uUu\in U, the update routine is called once when uu is peeled, and it traverses at most vNudv\sum_{v\in N_{u}}{d_{v}} wedges. At most one support update is generated per wedge, resulting in 𝒪(1)\mathcal{O}(1) work per wedge (RECEIPT CD stores vertex supports in an array U\bowtie_{U}). Thus, the complexity of RECEIPT CD is 𝒪(uUvNudv+Pnlogn)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}+Pn\log{n}\right).

RECEIPT FD – For every subset UiU_{i}, alg.4 creates an induced subgraph Gi(Ui,Vi,Ei)G_{i}(U_{i},V_{i},E_{i}) in parallel. This requires 𝒪(n+|Ei|)\mathcal{O}\left(n+\mathinner{\!\left\lvert E_{i}\right\rvert}\right) work for subset UiU_{i} and 𝒪(Pn+m)\mathcal{O}\left(Pn+m\right) work over all PP subsets. When uUiu\in U_{i} is peeled, the corresponding call to update explores wedges in the induced subgraph GiG_{i} and generates support updates to other vertices in UiU_{i}. These are a subset of the wedges and support updates generated when uu was peeled in RECEIPT CD. A fibonacci heap can be used to extract minimum support vertex (𝒪(logn)\mathcal{O}\left(\log{n}\right) work per vertex), and enable constant complexity updates (𝒪(vNudv)\mathcal{O}\left(\sum_{v\in N_{u}}{d_{v}}\right) work in update call for uu). Thus, the complexity of RECEIPT FD is 𝒪(uUvNudv+Pn+nlogn)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}+Pn+n\log{n}\right).

Combining both the steps, the total work done by RECEIPT is 𝒪(uUvNudv+Pnlogn)=𝒪(uUvNudv)\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}+Pn\log{n}\right)=\mathcal{O}\left(\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}\right), if P=𝒪(uUvNudvnlogn)P=\mathcal{O}\left(\frac{\sum_{u\in U}\sum_{v\in N_{u}}{d_{v}}}{n\log{n}}\right). This is the same as sequential BUP and hence, RECEIPT is work-efficient. ∎

4. OPTIMIZATIONS

Despite the parallelism potential, RECEIPT may take hours or days to process large graphs such as TrU, that contain hundreds of trillions of wedges (sec.1). To this purpose, we develop novel optimizations that exploit the properties of RECEIPT to dramatically improve its computational efficiency in practice, making it feasible to decompose graphs like TrU in minutes (objective 2, sec.2.2.1).

4.1. Hybrid Update Computation (HUC)

The complexity of RECEIPT is dominated by wedge traversal done during peeling in RECEIPT CD. In order to reduce this traversal, we exploit the following insights about the behavior of counting and peeling algorithms:

  • Butterfly counting is computationally efficient (low complexity) and easily parallelizable (sec.2.1).

  • Some peeling iterations in RECEIPT CD may peel a large number of vertices.

Given a vertex set (activeSetactiveSet) to peel, we compute the cost of peeling CpeelC_{peel} as uactiveSetvNudv\sum_{u\in activeSet}{\sum_{v\in N_{u}}d_{v}}, which is the total wedge count of vertices in activeSetactiveSet. However, the cost of re-counting butterflies CrcntC_{rcnt}, is computed as (u,v)Emin(du,dv)\sum_{(u,v)\in E}\min\left(d_{u},d_{v}\right) which represents the bound on wedge traversal of counting (sec.2). Thus, if CpeelC_{peel} exceeds CrcntC_{rcnt}, we re-compute butterflies for all remaining vertices in UU instead of computing support updates obtained by peeling activeSetactiveSet. This optimization is denoted Hybrid Update Computation (HUC).

We note that compared to RECEIPT CD, HUC is relatively less beneficial for RECEIPT FD because: (a) the induced subgraphs have significantly less wedges than the original graph, and (b) few vertices are typically deleted per peeling iteration in RECEIPT FD. Further, re-counting for subset UiU_{i} in RECEIPT FD must account for butterflies with vertices in j>i{Uj}\cup_{j>i}{\{U_{j}\}}. This external contribution for a vertex uu can be computed by deducting the butterfly count of uu within GiG_{i}, from uinit\bowtie^{init}_{u}.

4.2. Dynamic Graph Maintenance (DGM)

We note that after a vertex uu is peeled in RECEIPT CD or RECEIPT FD, it is excluded from future computation in the respective step. However, the graph data structure (adjacency list/Compressed Sparse Row) still contains edges to uu interleaved with other edges. Consequently, wedges incident on uu are still explored after uu is peeled. To prevent such wasteful exploration, we periodically update the data structures to remove edges incident on peeled vertices. We denote this optimization as Dynamic Graph Maintenance (DGM).

The cost of DGM is determined by the work done in parallel compaction of all adjacency lists, which grows linearly with the number of edges in the graph. Therefore, if the adjacency lists are updated only after mm wedges have been traversed since previous update, DGM will not alter the theoretical complexity of RECEIPT and pose negligible practical overhead.

5. Experiments

5.1. Setup

We conduct the experiments on a 36 core dual-socket linux server with two Intel Xeon E5-2695 v4 processors@ 2.1GHz and 1TB DRAM. All algorithms are implemented in C++-14 and are compiled using G++ 9.1.0 with the -O3 optimization flag. We use OpenMP v4.5 for multithreading.

Baselines: We compare RECEIPT against the sequential BUP algorithm (alg.2) and its parallel variant ParB (shiParbutterfly, ). ParB resembles ParButterfly with BATCH mode peeling444We were unable to verify the correctness of tip numbers generated by public code for ParButterfly and hence, implemented it ourselves for comparison.. ParB uses the bucketing structure of Julienne (julienne, ) with 128128 buckets as given in (shiParbutterfly, ).

Datasets: We use six unweighted bipartite graphs obtained from the KOBLENZ collection (konect, ). To the best of our knowledge, these are some of the largest publicly available bipartite datasets. Within each graph, number of wedges with endpoints in one vertex set can be significantly different than the other, as can be seen from BUP\wedge^{BUP} in table 3. We label the vertex set with higher number of wedges as UU and the other as VV, and accordingly suffix ”U” or ”V” to the dataset name to identify which vertex set is decomposed.

Table 2. Bipartite Datasets for evaluation with the corresponding number of butterflies (G\bowtie_{G}) and wedges (G\wedge_{G}) in billions, and maximum tip numbers for UU (θUmax\theta^{max}_{U}) and VV (θVmax\theta^{max}_{V}). dUd_{U} and dVd_{V} denote the average degree of vertices in UU and VV, respectively.
Dataset Description |𝐔|\mathbf{\mathinner{\!\left\lvert U\right\rvert}} |𝐕|\mathbf{\mathinner{\!\left\lvert V\right\rvert}} |𝐄|\mathbf{\mathinner{\!\left\lvert E\right\rvert}} 𝐝𝐔/𝐝𝐕\mathbf{d_{U}\ /\ d_{V}} 𝐆\mathbf{\boldsymbol{\bowtie}_{G}}(in B) 𝐆\mathbf{\boldsymbol{\wedge}_{G}}(in B) 𝜽𝐔𝐦𝐚𝐱\mathbf{\boldsymbol{\theta}^{max}_{U}} 𝜽𝐕𝐦𝐚𝐱\mathbf{\boldsymbol{\theta}^{max}_{V}}
ItU, ItV Pages and editors from Italian Wikipedia 2,255,875 137,693 12,644,802 5.6 / 91.8 298 361 1,555,462 5,328,302,365
DeU, DeV Users and tags from www.delicious.com 4,512,099 833,081 81,989,133 18.2 / 98.4 26,683 1,446 936,468,800 91,968,444,615
OrU, OrV Users’ group memberships in Orkut 2,783,196 8,730,857 327,037,487 117.5 / 37.5 22,131 2,528 88,812,453 29,285,249,823
LjU, LjV Users’ group memberships in Livejournal 3,201,203 7,489,073 112,307,385 35.1 / 15 3,297 2,703 4,670,317 82,785,273,931
EnU, EnV Pages and editors from English Wikipedia 21,504,191 3,819,691 122,075,170 5.7 / 32 2,036 6,299 37,217,466 96,241,348,356
TrU, TrV Internet domains and trackers in them 27,665,730 12,756,244 140,613,762 5.1 / 11 20,068 106,441 18,667,660,476 3,030,765,085,153
Refer to caption
Figure 4. Tip number distribution in Trackers graph – y-axis shows percentage vertices with tip number less than abscissa.

Note that the maximum tip numbers are extremely high because of few high degree vertices sharing a large number of common neighbors. However, most of the vertex’ tip numbers lie in a much smaller range as shown in fig.4. For example, 99.9899.98% vertices in TrUTrU have tip number <5<5M (0.0270.027% of maximum).

Implementation Details: Unless otherwise mentioned, the parallel algorithms use all 3636 threads for execution555We did not observe any benefits from enabling the 22-way hyperthreading. and RECEIPT includes all workload optimizations discussed in sec.4. In RECEIPT FD and sequential BUP, we use a kk-way min-heap for efficient retrieval of minimum support vertices. We found it to be faster in practice than the bucketing structure of (sariyucePeeling, ) or fibonacci heaps.

To analyze the impact of number of vertex subsets on RECEIPT’s performance, we vary PP from 5050 to 500500. Fig.5 reports the effect of PP on some large datasets that showed significant performance variation. Typically, RECEIPT CD dominates the overall execution time because of the larger number of wedges traversed compared to RECEIPT FD. The performance of RECEIPT CD improves with a decrease in PP because of reduced synchronizations. However, a small value of PP reduces parallelism in RECEIPT FD and makes the induced subgraphs larger. Consequently, for P100P\leq 100, we observed that RECEIPT FD became the bottleneck for some large graphs such as TrUTrU and LjULjU. For all the datasets shown in fig.5, execution slows down when PP is increased beyond 150150. Based on these observations, we use P=150P=150 for all datasets, in rest of the experiments.

Refer to caption
Figure 5. Execution time of RECEIPT vs PP.

5.2. Results

5.2.1. Comparison with Baselines

Table 3. Comparing execution time (tt), # wedges traversed (\bigwedge) and # synchronization rounds (ρ\rho) of RECEIPT and baseline algorithms. ParB traverses the same # wedges as BUP and has missing entries due to out-of-memory error. pvBcnt denotes per-vertex butterfly counting and t=t=\infty denotes execution did not finish in 1010 days.
ItU ItV DeU DeV OrU OrV LjU LjV EnU EnV TrU TrV
pvBcnt 0.3 0.3 8.3 8.3 45.6 45.6 5.1 5.1 6.9 6.9 7.8 7.8
BUP 3,849 8.4 12,260 428 39,079 2,297 67,588 200 111,777 281 \infty 5,711
ParB 3,677 8.1 - 377.7 - 1,510 - 132.5 - 198 - 3,524
𝐭\mathbf{t}(s) RECEIPT 56.8 3.1 402.4 32.4 1,865 136 911.1 23.7 1,383 31.1 2,784 530.6
pvBcnt 0.19 0.19 20.3 20.3 74.8 74.8 4.7 4.7 6.3 6.3 6.3 6.3
BUP 723 0.57 2,861 70.1 4,975 231.4 5,403 14.3 12,583 29.6 211,156 1,740
\mathbf{\boldsymbol{\bigwedge}} (billions) RECEIPT 71 0.56 1,503 51.3 2,728 170.4 1,003 11.7 2,414 22.2 3,298 658.1
ParB 377,904 10,054 670,189 127,328 1,136,129 334,064 1,479,495 83,423 1,512,922 83,800 1,476,015 342,672
𝝆\boldsymbol{\rho} RECEIPT 967 280 1113 406 1,160 639 1,477 456 1,724 453 1,335 1,381

Table 3 shows a detailed comparison of various tip decomposition algorithms. Note that ρ\rho represents the number of synchronization rounds. Each round consists of one iteration of peeling all vertices with minimum support (or support in minimum range for RECEIPT)666Wedge traversal by BUP can be computed without executing alg.2, by simply aggregating the 22-hop neighborhood size of vertices in UU or VV. A given wedge can be traversed twice. Similarly, ρ\rho for ParB can be computed from a modified version of RECEIPT FD where we retrieve all vertices with minimum support in a single iteration.. Each round involves multiple (but constant) thread synchronizations, for example, once after computing the list of vertices to peel, once after computing the updates etc. RECEIPT FD does not incur any synchronization as the threads in alg.4 only synchronize once at the end of computation. For the rest of this section, we will use the term large datasets to refer to a graph having large number of wedges with endpoints on the vertex set being peeled such as ItU, OrU etc.

Execution Time (𝐭\mathbf{t}): With up to 80.8×\mathbf{80.8\boldsymbol{\times}} and 64.7×\mathbf{64.7\boldsymbol{\times}} speedup over BUP and ParB, respectively, RECEIPT is radically faster than both baselines, for all datasets. The speedups are typically higher for large datasets that offer large amount of computation to parallelize and bigger benefits from workload optimizations(sec.4). Only RECEIPT is able to successfully tip decompose TrU. Contrarily, ParB achieves a maximum 1.6×1.6\times speedup compared to sequential BUP for TrV.

Wedges Traversed (\bigwedge): For all datasets, RECEIPT traverses fewer wedges than the existing baselines. RECEIPT’s built-in optimizations achieve up to 𝟔𝟒×\mathbf{64\boldsymbol{\times}} reduction in wedges traversed. This enables RECEIPT to decompose large datasets EnU and TrU in few minutes, unlike baselines that take few days for the same.

Synchronization (ρ\rho): In comparison with ParB, RECEIPT reduces the amount of thread synchronization by up to 𝟏𝟏𝟎𝟓×\mathbf{1105\boldsymbol{\times}}. This is primarily because RECEIPT CD peels vertices with a broad range of support in every iteration. This drastic reduction in ρ\rho is the primary contributor to RECEIPT’s parallel efficiency.

5.2.2. Effect of Workload Optimizations

Fig.6 and 7 show the effect of HUC and DGM optimizations (sec.4.1 and 4.2) on the performance of RECEIPT. The execution time closely follows the trend in wedge workload.

Refer to caption
Figure 6. Effect of workload optimizations on wedge traversal (normalized with respect to RECEIPT–). RECEIPT- is RECEIPT without DGM. RECEIPT– is RECEIPT without DGM and HUC.
Refer to caption
Figure 7. Effect of workload optimizations on execution time (normalized with respect to RECEIPT–). RECEIPT- is RECEIPT without DGM. RECEIPT– is RECEIPT without DGM and HUC.

HUC reduces wedge traversal by opting to re-count butterflies instead of peeling, in selective iterations. A comparison of the the work required for peeling vertices in UU versus counting all butterflies can be an indicator of the benefits of HUC. Therefore, we define a ratio r=peelcntr=\frac{\wedge^{peel}}{\wedge^{cnt}}, where peel\wedge^{peel} and cnt\wedge^{cnt} denote the number of wedges traversed for peeling all vertices in UU (BUPpvBcnt\wedge^{BUP}-\wedge^{pvBcnt} in table 3) and for butterfly counting (pvBcnt\wedge^{pvBcnt} in table 3), respectively. Datasets with large value of rr (for example ItU, LjU, EnU and TrU with r>1000r>1000) benefit heavily from HUC optimization, since peeling in high workload iterations is replaced by re-counting, which dramatically reduces wedge traversal. Especially for TrU (r>33,500r>33,500), HUC enables 57×57\times and 42×42\times reduction in wedge traversal and execution time, respectively. Contrarily, in datasets with small value of rr (ItV, DeV, OrV, LjV and EnV with r<5r<5 due to low peel\wedge^{peel}), none of the iterations utilize re-counting. Consequently, performance of RECEIPT- and RECEIPT– is similar for these datasets.

DGM can potentially half the wedge workload since each wedge has two endpoints, but needs to be traversed only by the vertex that gets peeled first. However, the reduction is less than 2×2\times in practice, because for many wedges, both endpoints get peeled between consecutive DGM data structure updates. It achieves 1.41×1.41\times and 1.29×1.29\times average reduction in wedges and execution time, respectively.

5.2.3. RECEIPT Breakup

In this section, we analyze the work and execution time contribution of each step of RECEIPT individually. We further split RECEIPT CD (alg.2) into a peeling step (denoted as RECEIPT CD), and a per-vertex butterfly counting (pvBcnt) step used for support initialization.

Refer to caption
Figure 8. Breakup of wedges traversed in RECEIPT

Fig.8 shows a breakdown of the wedges traversed during different steps as a percentage of total wedges traversed by RECEIPT. As discussed in sec.3, RECEIPT CD traverses significantly more wedges than RECEIPT FD. For all the datasets, RECEIPT FD incurs <15<15% of the total wedge traversal. Note that for a given graph, the number of wedges explored by pvBcnt is independent of the vertex set being peeled. For example, ItU and ItV both traverse 188188 million wedges during pvBcnt. However, it’s percentage contribution depends on the total wedges explored during the entire tip decomposition, which can vary significantly between UU and VV vertex sets (table 3).

Refer to caption
Figure 9. Breakup of execution time of RECEIPT

Fig.9 shows a breakdown of the execution time of different steps as a percentage of the total execution time of RECEIPT. Intuitively, the step with most workload i.e. RECEIPT CD, has the largest contribution (>50%>50\% of the total execution time for all datasets). In most datasets with a small value of r=peelcntr=\frac{\wedge^{peel}}{\wedge^{cnt}} (ItV, DeV, OrV, LjV and EnV), even pvBcnt has a significant share of the execution time. Note that for some datasets, contribution of RECEIPT FD to the execution time is more than its contribution to the wedge traversal. This is due to the following reasons: (a) RECEIPT FD has computational overheads other than wedge exploration, such as creating induced subgraphs and applying support updates to a heap, and (b) Lower parallel scalability compared to counting and RECEIPT CD (sec.5.2.4). Still, RECEIPT FD consumes <25%<25\% of the overall execution time for almost all datasets.

5.2.4. Parallel Scalability

Refer to caption
Figure 10. Parallel Speedup of RECEIPT when peeling set UU
Refer to caption
Figure 11. Parallel Speedup of RECEIPT when peeling set VV

To evaluate the scalability of RECEIPT, we measure its performance with 1,2,4,9,181,2,4,9,18 and 3636 threads. Fig.10 and 11 show the parallel speedup obtained by RECEIPT over single-threaded execution777We also developed a sequential version of RECEIPT with no synchronization primitives (atomics) and sequential implementations of basic kernels such as prefix scan, scatter etc. However, the observed performance difference between sequential implementation and single-threaded execution of parallel implementation was negligible..

For most of the datasets, RECEIPT exhibits almost linear scalability. With 3636 threads, RECEIPT achieves up to 17.1×17.1\times self-relative speedup. In comparison, we observed that ParB exhibits at most 2.3×2.3\times parallel speedup (self-relative) with 3636 threads. RECEIPT’s speedup is poor for ItV because it is a very small dataset that gets peeled in <4<4s. It requires very little computation (0.560.56B wedges) and hence, employing a large number of threads is not useful.

Typically, datasets with small amount of wedges (ItV, LjV, EnV) exhibit lower scalability, because compared to larger datasets, they experience lower workload per synchronization round on average. For example, LjV traverses 86×86\times fewer wedges than LjU but incurs only 3.2×3.2\times fewer synchronizations. This increases the relative overheads of parallelization and restricts the parallel scalability of RECEIPT CD, which is the highest workload step in RECEIPT (fig.8). For example, parallel speedup of RECEIPT CD with 3636 threads is 15.1×15.1\times for LjU but only 7.1×7.1\times for LjV.

In RECEIPT FD, parallel speedup is restricted by workload imbalance across the subgraphs. This is because RECEIPT CD tries to balance total wedge counts of vertex subsets as seen in original graph, whereas work done in RECEIPT FD is determined by wedges in induced subgraphs. Consequently, we observed that for some datasets, parallel scalability of RECEIPT FD is poorer than RECEIPT CD. For example, for TrU with 3636 threads, parallel speedup of RECEIPT FD was only 5.3×5.3\times compared to 13.1×13.1\times of RECEIPT CD , 12.5×12.5\times of counting (pvBcnt) and 10.7×10.7\times of RECEIPT overall.

Note that even sequential RECEIPT is much faster than BUP because of the following:

  1. (1)

    Fewer support updates – updates to u\bowtie_{u} from all vertices in a peeling iteration are aggregated into a single update.

  2. (2)

    Lesser work – reduced wedge traversal due to HUC and DGM optimizations (sec.4).

We also observe that slope of the speedup curve decreases from 1818 to 3636 threads. This could potentially be due to the NUMA effects as RECEIPT does not currently have NUMA specific optimizations. Up to 1818 threads, the execution is limited to single socket but 3636 threads are spawned across two different sockets.

6. Related Work

Dense subgraph discovery is a very crucial analytic used in several graph applications (anomalyDet, ; spamDet, ; communityDet, ; fang2020effective, ; otherapp1, ; otherapp2, ). Researchers have developed a wide array of techniques for mining dense regions in unipartite graphs (sariyuce2016fast, ; fang2019efficient, ; gibson2005discovering, ; sariyuce2018local, ; angel2014dense, ; trussVLDB, ; lee2010survey, ; coreVLDB, ; coreVLDBJ, ; wang2018efficient, ). Among these, motif-based techniques have gained tremendous interest in the graph community (PMID:16873465, ; trussVLDB, ; tsourakakis2017scalable, ; tsourakakis2014novel, ; sariyucePeeling, ; aksoy2017measuring, ; wang2010triangulation, ). Motifs like triangles represent a quantum of cohesion in graphs. Hence, the number of motifs incident on a vertex or an edge is an indicator of their involvement in dense subgraphs. Several recent works are focused on efficiently finding such motifs in the graphs (ahmed2015efficient, ; shiParbutterfly, ; wangButterfly, ; hu2018tricore, ; fox2018fast, ; ma2019linc, ).

Nucleus decomposition is one such popular motif-based dense graph mining technique. It considers the distribution of motifs across vertices or edges within a subgraph as an indicator of its density (10.1007/s10115-016-0965-5, ), resulting in better quality dense subgraphs compared to motif counting (10.1007/s10115-016-0965-5, ; sariyuce2015finding, ). For example, truss decomposition mines subgraphs called kk-trusses, where every edge participates in at least k2k-2 triangles within the subgraph. Truss decomposition also appears as one of the three tasks in the popular GraphChallenge (samsi2017static, ), that has resulted in highly efficient parallel solutions scalable to billion edge graphs (date2017collaborative, ; voegele2017parallel, ; smith2017truss, ; green2017quickly, ). However, such solutions cannot be directly applied on bipartite graphs due to the absence of triangles. Chakravarthy et al.(chakaravarthy2018improved, ) propose a distributed truss decomposition algorithm that trades off computational efficiency to reduce synchronization. This approach requires triangle enumeration and cannot be adopted for tip decomposition due to prohibitive space and computational requirements.

The simplest non-trivial motif in a bipartite graph is a Butterfly (2,2-biclique, quadrangle). Several algorithms covering various aspects of butterfly counting have been developed: in-memory or external memory (wangButterfly, ; wangRectangle, ), exact or approximate counting (sanei2018butterfly, ; sanei2019fleet, ) and parallel counting on various platforms (shiParbutterfly, ; wangButterfly, ; wangRectangle, ). Particularly, the in-memory algorithms for exact counting are relevant to our work. Wang et al.(wangRectangle, ) count rectangles in bipartite graphs by traversing wedges with 𝒪(uWdu2)\mathcal{O}\left(\sum_{u\in W}d_{u}^{2}\right) complexity. Sanei-Mehri et al.(sanei2018butterfly, ) reduce this complexity to 𝒪(minuUdu2,vVdv2)\mathcal{O}\left(\min{\sum_{u\in U}d_{u}^{2},\sum_{v\in V}d_{v}^{2}}\right) by choosing the vertex set where fewer wedges have end points.

Before the aforementioned works, Chiba and Nishizeki (chibaArboricity, ) had proposed an 𝒪(αm)\mathcal{O}\left(\alpha\cdot m\right) complexity vertex-priority based quadrangle counting algorithm for generic graphs. Wang et al.(wangButterfly, ) further propose a cache optimized variant of this algorithm and use shared-memory parallelism for acceleration. Independently, Shi et al.(shiParbutterfly, ) develop provably efficient shared-memory parallel implementations of vertex-priority based counting. All of these approaches are amenable for per-vertex or per-edge counting.

Butterfly based decomposition, albeit highly effective in finding quality dense regions in bipartite graphs, is computationally much more expensive than counting. Sariyuce et al.(sariyucePeeling, ) defined kk-tips and kk-wings as subgraphs with minimum kk butterflies incident on every vertex and edge, respectively. Correspondingly, they defined the problems of Tip decomposition and Wing decomposition. Their algorithms use bottom-up peeling with respective complexities of 𝒪(vVdv2)\mathcal{O}\left(\sum_{v\in V}d_{v}^{2}\right) and 𝒪((u,v)EwNv(du+dw))\mathcal{O}\left(\sum_{(u,v)\in E}{\sum_{w\in N_{v}}{\left(d_{u}+d_{w}\right)}}\right). Independently, Zou (zouBitruss, ) defined the notion of bitruss similar to kk-wing and showed its utility for bipartite network analysis. Shi et al.(shiParbutterfly, ) propose parallel bottom-up peeling used as a baseline in our evaluation. Their wing decomposition uses hash tables to store edges and has a complexity of 𝒪((u,v)EwNvmin(du,dw))\mathcal{O}\left(\sum_{(u,v)\in E}{\sum_{w\in N_{v}}{\min{\left(d_{u},d_{w}\right)}}}\right).

In their seminal paper, Chiba and Nishizeki (chibaArboricity, ) had remarked that butterflies can be represented by storing the 𝒪(αm)\mathcal{O}(\alpha\cdot m) triples traversed during counting. In a very recent work, Wang et al.(wangBitruss, ) store these triples in the form of maximal bloom (biclique) structures, that enables quick retrieval of butterflies shared between edges. Based on this indexing, the authors develop the Bit-BU algorithm for peeling bipartite networks. To the best of our knowledge, it is the most efficient sequential algorithm that can perform wing decomposition in 𝒪(G)\mathcal{O}\left(\bowtie_{G}\right) time. Yet, it takes more than 1515 hours to process the Livejournal (Lj) dataset whereas RECEIPT can tip decompose both LjU and LjV in less than 1616 minutes. Moreover, the bloom-based indexing has a non-trivial space requirement of 𝒪((u,v)Emin(du,dv))\mathcal{O}\left(\sum_{(u,v)\in E}{\min{\left(d_{u},d_{v}\right)}}\right) which in practice, amounts to hundreds of gigabytes for datasets like Orkut(Or). Comparatively, RECEIPT has a space-complexity of 𝒪(nT+m)\mathcal{O}\left(n\cdot T+m\right) (same as parallel counting (wangButterfly, )) and consumes only few gigabytes for all the datasets used in sec.5. We also note that the fundamental ideas employed in RECEIPT and Bit-BU are complimentary. While RECEIPT attempts to exploit parallelism across kk-tip hierarchy and reduce the problem size for exact bottom-up peeling, Bit-BU tries to make peeling every edge more efficient. We believe that an amalgamation of our ideas with (wangBitruss, ) can produce highly scalable parallel solutions for peeling large bipartite graphs.

7. Extensions

In this section, we discuss opportunities for future research in the context of our work:

  • Parallel Edge peeling: RECEIPT can be easily adapted for parallel wing decomposition (edge peeling) in bipartite graphs (wangBitruss, ; sariyucePeeling, ). The workload reduction optimizations in RECEIPT can have a greater impact on edge peeling due to the higher complexity and smaller range of wing numbers. However, there could be conflicts during parallel edge peeling as multiple edges in a butterfly could get deleted in the same iteration. Only one of the peeled edges should update the support of other edges in the butterfly, which can be achieved by imposing a priority ordering of edges.

  • Distributed Tip Decomposition: Distributed-memory systems (like multi-node clusters) offer large amount of extendable computational resources and are widely used to scale high complexity graph analytics (chakaravarthy2018improved, ; bhattarai2019ceci, ; lakhotia13planting, ).

    We believe that the fundamental concept of creating independent tip number ranges and vertex subsets can be very useful in exposing parallelism for distributed-memory algorithms. In the past, certain distributed algorithms for graph processing have achieved parallel scalability by creating independent parallel tasks (lakhotia13planting, ; arifuzzaman2013patric, ). However, support updates generated from peeling may need to be communicated on the network. This can affect scalability of the algorithm because of high communication cost in clusters (lumsdaine2007challenges, ; mcsherry2015scalability, ). Further, execution on distributed systems may exhibit load imbalance due to larger number of threads and limitations of dynamic task scheduling across processes.

  • System Optimizations: In this work, we have particularly focused on algorithmic optimizations. However, other aspects such as memory access optimizations (wei2016speedup, ; lakhotia2018accelerating, ) and SIMD parallelism (10.1145/3183713.3196924, ) have significant impact on graph analytics. Enhancing memory access locality can also mitigate the NUMA effects that limit parallel speedup (sec.5.2.4).

    RECEIPT CD and RECEIPT FD exploit parallelism at vertex and subgraph granularity, respectively. Using fine-grained parallelism at edge or wedge granularity can further improve load balance.

8. Conclusion

In this paper, we proposed RECEIPT – a novel shared-memory parallel algorithm for tip decomposition. RECEIPT is the first algorithm that exploits the massive parallelism across different levels of kk-tip hierarchy. Further, we also developed pragmatic optimizations to drastically improve the computational efficiency of RECEIPT.

We empirically evaluated our approach on a shared-memory multicore server, and showed that it can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms, with dramatic reduction in synchronization and wedge traversal. Using 3636 threads, RECEIPT can provide up to 17.1×17.1\times self-relative speedup, being much more scalable then the best available parallel algorithms for tip decomposition.

We also explored the generalizability of RECEIPT for wing decomposition (edge peeling) and several interesting avenues for future work in this direction. We believe that scalable algorithms for parallel systems such as many-core CPUs, GPU or HPC clusters, can enhance the applicability of tip or wing decomposition, and our work is a step in that direction.

References

  • [1] N. K. Ahmed, J. Neville, R. A. Rossi, and N. Duffield. Efficient graphlet counting for large networks. In 2015 IEEE International Conference on Data Mining, pages 1–10. IEEE, 2015.
  • [2] N. K. Ahmed, J. Neville, R. A. Rossi, N. G. Duffield, and T. L. Willke. Graphlet decomposition: Framework, algorithms, and applications. Knowledge and Information Systems, 50(3):689–722, Mar. 2017.
  • [3] S. G. Aksoy, T. G. Kolda, and A. Pinar. Measuring and modeling bipartite graphs with community structure. Journal of Complex Networks, 5(4):581–603, 2017.
  • [4] A. Angel, N. Koudas, N. Sarkas, D. Srivastava, M. Svendsen, and S. Tirthapura. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. The VLDB journal, 23(2):175–199, 2014.
  • [5] S. Arifuzzaman, M. Khan, and M. Marathe. Patric: A parallel algorithm for counting triangles in massive networks. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 529–538, 2013.
  • [6] B. Bhattarai, H. Liu, and H. H. Huang. Ceci: Compact embedding cluster index for scalable subgraph matching. In Proceedings of the 2019 International Conference on Management of Data, pages 1447–1462, 2019.
  • [7] F. Bonchi, A. Khan, and L. Severini. Distance-generalized core decomposition. In Proceedings of the 2019 International Conference on Management of Data, pages 1006–1023, 2019.
  • [8] V. T. Chakaravarthy, A. Goyal, P. Murali, S. S. Pandian, and Y. Sabharwal. Improved distributed algorithm for graph truss decomposition. In European Conference on Parallel Processing, pages 703–717. Springer, 2018.
  • [9] J. Chen and Y. Saad. Dense subgraph extraction with application to community detection. IEEE Transactions on knowledge and data engineering, 24(7):1216–1230, 2010.
  • [10] N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on computing, 14(1):210–223, 1985.
  • [11] K. Date, K. Feng, R. Nagi, J. Xiong, N. S. Kim, and W.-M. Hwu. Collaborative (cpu+ gpu) algorithms for triangle counting and truss decomposition on the minsky architecture: Static graph challenge: Subgraph isomorphism. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7. IEEE, 2017.
  • [12] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269–274, 2001.
  • [13] L. Dhulipala, G. Blelloch, and J. Shun. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 293–304, 2017.
  • [14] A. Epasto, S. Lattanzi, V. Mirrokni, I. O. Sebe, A. Taei, and S. Verma. Ego-net community mining applied to friend suggestion. Proceedings of the VLDB Endowment, 9(4):324–335, 2015.
  • [15] Y. Fang, Y. Yang, W. Zhang, X. Lin, and X. Cao. Effective and efficient community search over large heterogeneous information networks. Proceedings of the VLDB Endowment, 13(6):854–867, 2020.
  • [16] Y. Fang, K. Yu, R. Cheng, L. V. Lakshmanan, and X. Lin. Efficient algorithms for densest subgraph discovery. Proceedings of the VLDB Endowment, 12(11):1719–1732, 2019.
  • [17] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Exploiting burstiness in reviews for review spammer detection. In Seventh international AAAI conference on weblogs and social media, 2013.
  • [18] J. Fox, O. Green, K. Gabert, X. An, and D. A. Bader. Fast and adaptive list intersections on the gpu. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–7. IEEE, 2018.
  • [19] E. Fratkin, B. T. Naughton, D. L. Brutlag, and S. Batzoglou. Motifcut: regulatory motifs finding with maximum density subgraphs. Bioinformatics (Oxford, England), 22(14):e150—7, July 2006.
  • [20] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st international conference on Very large data bases, pages 721–732, 2005.
  • [21] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st international conference on Very large data bases, pages 721–732, 2005.
  • [22] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics, 17(2):416–429, 1969.
  • [23] O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia, S. Zhou, S. Singapura, H. Zeng, R. Kannan, et al. Quickly finding a truss in a haystack. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7. IEEE, 2017.
  • [24] S. Han, L. Zou, and J. X. Yu. Speeding up set intersections in graph algorithms using simd instructions. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, page 1587–1602, New York, NY, USA, 2018. Association for Computing Machinery.
  • [25] Y. Hu, H. Liu, and H. H. Huang. Tricore: Parallel triangle counting on gpus. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 171–182. IEEE, 2018.
  • [26] Z. Huang, D. D. Zeng, and H. Chen. Analyzing consumer-product graphs: Empirical findings and applications in recommender systems. Management science, 53(7):1146–1164, 2007.
  • [27] W. Khaouid, M. Barsky, V. Srinivasan, and A. Thomo. K-core decomposition of large networks on a single pc. Proceedings of the VLDB Endowment, 9(1):13–23, 2015.
  • [28] J. Kunegis. Konect: the koblenz network collection. In Proceedings of the 22nd International Conference on World Wide Web, pages 1343–1350, 2013.
  • [29] K. Lakhotia, R. Kannan, Q. Dong, and V. Prasanna. Planting trees for scalable and efficient canonical hub labeling. Proceedings of the VLDB Endowment, 13(4).
  • [30] K. Lakhotia, R. Kannan, and V. Prasanna. Accelerating pagerank using partition-centric processing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 427–440, 2018.
  • [31] V. E. Lee, N. Ruan, R. Jin, and C. Aggarwal. A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pages 303–336. Springer, 2010.
  • [32] E. A. Leicht, P. Holme, and M. E. Newman. Vertex similarity in networks. Physical Review E, 73(2):026120, 2006.
  • [33] W. Li, M. Qiao, L. Qin, Y. Zhang, L. Chang, and X. Lin. Scaling distance labeling on small-world networks. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, page 1060–1077, New York, NY, USA, 2019. Association for Computing Machinery.
  • [34] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(01):5–20, 2007.
  • [35] C. Ma, R. Cheng, L. V. Lakshmanan, T. Grubenmann, Y. Fang, and X. Li. Linc: a motif counting algorithm for uncertain graphs. Proceedings of the VLDB Endowment, 13(2):155–168, 2019.
  • [36] F. D. Malliaros, C. Giatsidis, A. N. Papadopoulos, and M. Vazirgiannis. The core decomposition of networks: Theory, algorithms and applications. The VLDB Journal, pages 1–32, 2019.
  • [37] F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what {\{COST}\}? In 15th Workshop on Hot Topics in Operating Systems (HotOS {\{XV}\}), 2015.
  • [38] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego, CA, October 2007.
  • [39] A. Mukherjee, B. Liu, and N. Glance. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on World Wide Web, pages 191–200, 2012.
  • [40] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 419–432, 2008.
  • [41] M. E. Newman. The structure of scientific collaboration networks. Proceedings of the national academy of sciences, 98(2):404–409, 2001.
  • [42] M. E. J. Newman. Scientific collaboration networks. i. network construction and fundamental results. Phys. Rev. E, 64:016131, Jun 2001.
  • [43] H.-M. Park, S.-H. Myaeng, and U. Kang. Pte: Enumerating trillion triangles on distributed systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16. ACM Press, 2016.
  • [44] P. Rozenshtein, A. Anagnostopoulos, A. Gionis, and N. Tatti. Event detection in activity networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1176–1185, 2014.
  • [45] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra, P. Monticciolo, A. Reuther, S. Smith, W. Song, et al. Static graph challenge: Subgraph isomorphism. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6. IEEE, 2017.
  • [46] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra, P. Monticciolo, A. Reuther, S. Smith, W. Song, et al. Static graph challenge: Subgraph isomorphism. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6. IEEE, 2017.
  • [47] S.-V. Sanei-Mehri, A. E. Sariyuce, and S. Tirthapura. Butterfly counting in bipartite networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2150–2159, 2018.
  • [48] S.-V. Sanei-Mehri, Y. Zhang, A. E. Sariyüce, and S. Tirthapura. Fleet: Butterfly estimation from a bipartite graph stream. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1201–1210, 2019.
  • [49] A. E. Sarıyüce and A. Pinar. Fast hierarchy construction for dense subgraphs. Proceedings of the VLDB Endowment, 10(3), 2016.
  • [50] A. E. Sariyuce and A. Pinar. Peeling bipartite networks for dense subgraph discovery. arXiv preprint arXiv:1611.02756, 2016.
  • [51] A. E. Sarıyüce and A. Pinar. Peeling bipartite networks for dense subgraph discovery. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 504–512, 2018.
  • [52] A. E. Sariyüce, C. Seshadhri, and A. Pinar. Local algorithms for hierarchical dense subgraph discovery. Proceedings of the VLDB Endowment, 12(1):43–56, 2018.
  • [53] A. E. Sariyuce, C. Seshadhri, A. Pinar, and U. V. Catalyurek. Finding the hierarchy of dense subgraphs using nucleus decompositions. In Proceedings of the 24th International Conference on World Wide Web, pages 927–937, 2015.
  • [54] J. Shi and J. Shun. Parallel algorithms for butterfly computations. arXiv preprint arXiv:1907.08607, 2019.
  • [55] S. Smith, X. Liu, N. K. Ahmed, A. S. Tom, F. Petrini, and G. Karypis. Truss decomposition on shared-memory parallel systems. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6. IEEE, 2017.
  • [56] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 8–pp. IEEE, 2005.
  • [57] C. E. Tsourakakis. A novel approach to finding near-cliques: The triangle-densest subgraph problem. arXiv preprint arXiv:1405.1477, 2014.
  • [58] C. E. Tsourakakis, J. Pachocki, and M. Mitzenmacher. Scalable motif-aware graph clustering. In Proceedings of the 26th International Conference on World Wide Web, pages 1451–1460, 2017.
  • [59] C. Voegele, Y.-S. Lu, S. Pai, and K. Pingali. Parallel triangle counting and k-truss identification using graph-centric methods. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7. IEEE, 2017.
  • [60] J. Wang and J. Cheng. Truss decomposition in massive networks. Proceedings of the VLDB Endowment, 5(9), 2012.
  • [61] J. Wang, A. W.-C. Fu, and J. Cheng. Rectangle counting in large bipartite graphs. In 2014 IEEE International Congress on Big Data, pages 17–24. IEEE, 2014.
  • [62] K. Wang, X. Cao, X. Lin, W. Zhang, and L. Qin. Efficient computing of radius-bounded k-cores. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 233–244. IEEE, 2018.
  • [63] K. Wang, X. Lin, L. Qin, W. Zhang, and Y. Zhang. Vertex priority based butterfly counting for large-scale bipartite networks. Proceedings of the VLDB Endowment, 12(10):1139–1152, 2019.
  • [64] K. Wang, X. Lin, L. Qin, W. Zhang, and Y. Zhang. Efficient bitruss decomposition for large-scale bipartite graphs. arXiv preprint arXiv:2001.06111, 2020.
  • [65] N. Wang, J. Zhang, K.-L. Tan, and A. K. Tung. On triangulation-based dense neighborhood graph discovery. Proceedings of the VLDB Endowment, 4(2):58–68, 2010.
  • [66] H. Wei, J. X. Yu, C. Lu, and X. Lin. Speedup graph processing by graph ordering. In Proceedings of the 2016 International Conference on Management of Data, pages 1813–1828. ACM, 2016.
  • [67] D. Wen, L. Qin, Y. Zhang, X. Lin, and J. X. Yu. I/o efficient core graph decomposition: application to degeneracy ordering. IEEE Transactions on Knowledge and Data Engineering, 31(1):75–90, 2018.
  • [68] Y. Yang, L. Chu, Y. Zhang, Z. Wang, J. Pei, and E. Chen. Mining density contrast subgraphs. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 221–232. IEEE, 2018.
  • [69] Z. Zou. Bitruss decomposition of bipartite graphs. In International Conference on Database Systems for Advanced Applications, pages 218–233. Springer, 2016.