This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtBeginShipoutNext\AtBeginShipoutDiscard
thanks: * Zi Chen is the corresponding author of this paper.thanks: L. Yuan and Z. Zhou are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
E-mail: {zeyuzhou,longyuan}@njust.edu.cn. X. Lin is with the School of Computer Science and Engineering, the University of New South Wales, Sydney, Australia.
E-mail: [email protected]. Z. Chen is with the Software Engineering Institute, East China Normal University, Shanghai, China.
E-mail: [email protected]. X. Zhao is with the College of Systems Engineering, National University of Defense Technology, Changsha, China.
E-mail: [email protected]. F. Zhang is with Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China.
E-mail: [email protected].
thanks: Manuscript received xxx; revised xxx.

GPUSCAN++\textrm{GPUSCAN}^{++}: Efficient Structural Graph Clustering on GPUs

Long Yuan, Zeyu Zhou, Xuemin Lin, Zi Chen, Xiang Zhao, Fan Zhang
Abstract

Structural clustering is one of the most popular graph clustering methods, which has achieved great performance improvement by utilizing GPUs. Even though, the state-of-the-art GPU-based structural clustering algorithm, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}, still suffers from efficiency issues since lots of extra costs are introduced for parallelization. Moreover, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} assumes that the graph is resident in the GPU memory. However, the GPU memory capacity is limited currently while many real-world graphs are big and cannot fit in the GPU memory, which makes 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} unable to handle large graphs. Motivated by this, we present a new GPU-based structural clustering algorithm, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}}, in this paper. To address the efficiency issue, we propose a new progressive clustering method tailored for GPUs that not only avoid high parallelization costs but also fully exploits the computing resources of GPUs. To address the GPU memory limitation issue, we propose a partition-based algorithm for structural clustering that can process large graphs with limited GPU memory. We conduct experiments on real graphs, and the experimental results demonstrate that our algorithm can achieve up to 168 times speedup compared with the state-of-the-art GPU-based algorithm when the graph can be resident in the GPU memory. Moreover, our algorithm is scalable to handle large graphs. As an example, our algorithm can finish the structural clustering on a graph with 1.8 billion edges using less than 2 GB GPU memory.

Index Terms:
structural graph clustering, GPU, graph algorithm

1 Introduction

Graph clustering is one of the most fundamental problems in analyzing graph data. It has been extensively studied [14, 24, 27, 40, 42] and frequently utilized in many applications, such as natural language processing [3], recommendation systems [2], social and biological network analysis [13, 21], and load balancing in distributed systems[11].

One well-known method of graph clustering is structural clustering, which is introduced via the Structural Clustering Algorithm for Networks (𝖲𝖢𝖠𝖭\mathsf{SCAN}) [40]. Structural clustering defines the structural similarity between two vertices, and two vertices with structural similarity not less than a given parameter ϵ\epsilon are considered similar. A core is a vertex that has at least μ\mu similar neighbors, where μ\mu is also a given parameter. A cluster contains the cores and vertices that are structurally similar to the cores. If a vertex does not belong to any cluster but its neighbors belong to more than one cluster, then it is a hub; Otherwise, it is an outlier. Fig. 1 shows the clustering result of structural clustering with ϵ=0.6\epsilon=0.6 and μ=3\mu=3 on graph GG in which two clusters are colored gray, cores are colored black, and hubs and outliers are labeled. Compared with most graph clustering approaches aiming to partition the vertices into disjoint sets, structural clustering is also able to find hub vertices that connected different clusters and outlier vertices that lack strong ties to any clusters, which is important for mining various complex networks [40, 16, 38]. Structural clustering has been successfully used to cluster biological data [10, 21] and social media data [20, 18, 28, 29].

Refer to caption

Figure 1: Clustering result of 𝖲𝖢𝖠𝖭\mathsf{SCAN} (ϵ=0.6,μ=3\epsilon=0.6,\mu=3) on GG

Motivation. In recent years, the rapid evolution of Graphics Processing Unit (GPU) has attracted extensive attention in both industry and academia, and it has been used successfully to speed up various graph algorithms [19, 15, 41, 31, 32]. The massively parallel hardware resources in GPUs also offer promising opportunities for accelerating structural clustering. In the literature, a GPU-based algorithm named 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} is proposed [36]. By redesigning the computation steps of structural clustering, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} significantly improves the clustering performance compared with the state-of-the-art serial algorithm. However, it has the following drawbacks:

  • Prohibitive parallelization overhead. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} aims to fully utilize the massively parallel hardware resources in GPU. However, lots of extra computation cost is introduced for this goal. Given a graph G=(V,E)G=(V,E), the state-of-the art serial algorithm [7] can finish the clustering with O(m𝖽𝖾𝗀𝗆𝖺𝗑)O(m\cdot{\mathsf{deg}}_{{\mathsf{max}}}) work while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} needs O(Σ(u,v)E(G)(𝖽𝖾𝗀(u)+𝖽𝖾𝗀(v))+cmlogm)O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m) work 111We use the work-span framework to analyze the parallel algorithm [9]. The work of an algorithm is the number of operations it performs. The span of an algorithm is the length of its longest sequential dependence. for it, where mm denotes the number of edges in the graph, 𝖽𝖾𝗀𝗆𝖺𝗑{\mathsf{deg}}_{{\mathsf{max}}} denotes the maximum degree of the vertices in GG, and cc is a not small integer as verified in our experiments determined by the given graph GG and parameter ϵ\epsilon and μ\mu. The prohibitive parallelization overhead makes 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} inefficient for structural clustering.

  • GPU memory capacity limitation. Compared to common server machines equipped with hundreds of gigabytes or even terabytes of main memory, the GPU memory capacity is very limited currently. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} assumes that the graph is resident in the GPU memory. Nevertheless, many real-world graphs are big and may not fit entirely in the GPU memory. The memory capacity limitation significantly restricts the scale of graphs that 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} can handle.

Motivated by this, in this paper, we aim to devise a new efficient GPU-based structural clustering algorithm that can handle large graphs.

Our solution. We address the drawbacks of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and propose new GPU-based algorithms for structural clustering in this paper. Specifically, for the prohibitive parallelization overhead, we devise a CSR-enhanced graph storage structure and adopt a progressive clustering paradigm with which unnecessary similarity computation can be pruned. Following this paradigm, we develop a new GPU-based 𝖲𝖢𝖠𝖭\mathsf{SCAN} algorithm with careful considerations of total work and parallelism, which are crucial to the performance of GPU programs. Our new algorithm significantly reduces the parallelization overhead. We theoretically prove that the work of our new algorithm is reduced to O(Σ(u,v)E(G)𝖽𝖾𝗀(u)log𝖽𝖾𝗀(v))O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v)), and experimental results show that our algorithms achieve 85 times speedup on average compared with 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}.

For the GPU memory capacity limitation, a direct solution is to use the Unified Virtual Memory (UVM) provided by recent GPUs. UVM utilizes the large host main memory of common server machines and allows GPU memory to be oversubscribed as long as there is sufficient host main memory to back the allocated memory. The driver and the runtime system handle data movement between the host main memory and GPU memory without the programmer’s involvement, which allows running GPU applications that may otherwise be infeasible due to the large size of datasets [12]. However, the performance of this approach is not competitive due to poor data locality caused by irregular memory accesses during the structural clustering and relatively slow rate I/O bus connecting host memory and GPU memory. Therefore, we propose a new partition-based structural clustering algorithm. Specifically, we partition the graph into several specified subgraphs. When conducting the clustering, we only need to load the specified subgraphs into the GPU memory and we can guarantee that the clustering results are the same as loading the whole graph in the GPU memory, which not only addresses the GPU memory limitation problem in 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} but also avoids the data movement overheads involved in the UVM-based approach.

Contributions. We make the following contributions:

(A) We theoretically analyze the performance of the state-of-the-art GPU-based structural clustering algorithm 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and reveal the key reasons leading to its inefficiency. Following the analysis, we propose a new progressive clustering algorithm tailored for GPUs that not only avoids the prohibitive parallelization overhead but also fully exploits the computing power of GPUs.

(B) To overcome the GPU memory capacity limitation, we devise a new out-of-core GPU-based SCAN algorithm. By partitioning the original graph into a series of smaller subgraphs, our algorithm can significantly reduce the GPU memory requirement when conducting the clustering and scale to large data graphs beyond the GPU memory.

(C) We conduct extensive performance studies using ten real graphs. The experimental results show that our proposed algorithm achieves 168/85 times speedup at most/on average, compared with the state-of-the-art GPU-based SCAN algorithm. Moreover, our algorithm can finish the structural clustering on a graph with 1.8 billion edges using less than 2 GB of GPU memory.

2 Preliminaries

2.1 Problem Definition

Let G=(V,E)G=(V,E) be an undirected and unweighted graph, where V(G)V(G) is the set of vertices and E(G)E(G) is the set of edges, respectively. We denote the number of vertices as nn and the number of edges as mm, i.e., n=|V(G)|n=|V(G)|, m=|E(G)|m=|E(G)|. For a vertex vV(G)v\in V(G), we use 𝗇𝖻𝗋(v,G){\mathsf{nbr}}(v,G) to denote the set of neighbors of vv. The degree of a vertex vV(G)v\in V(G), denoted by 𝖽𝖾𝗀(v,G){\mathsf{deg}}(v,G) is the number of neighbors of vv, i.e., 𝖽𝖾𝗀(v,G)=|𝗇𝖻𝗋(v,G)|{\mathsf{deg}}(v,G)=|{\mathsf{nbr}}(v,G)|. For simplicity, we omit GG in the notations when the context is self-evident.

Definition 2.1: (Structural Neighborhood) Given a vertex vv in GG, the structural neighborhood of vv, denoted by N[v]N[v], is defined as N[v]={uV(G)|(u,v)E(G)}{v}N[v]=\{u\in V(G)|(u,v)\in E(G)\}\cup\{v\}. \Box

Definition 2.2: (Structural Similarity) Given two vertices uu and vv in GG, the structural similarity between uu and vv, denoted by σ(u,v)\sigma(u,v), is defined as:

σ(u,v)=|N[u]N[v]||N[u]||N[v]|\sigma(u,\;v)=\frac{|N[u]\cap N[v]|}{\sqrt{|N[u]||N[v]|}} (1)

\Box

Definition 2.3: (ϵ\epsilon-Neighborhood) Given a similarity threshold 0<ϵ10<\epsilon\leq 1, the ϵ\epsilon-neighborhood of uu, denoted by Nϵ[u]N_{\epsilon}[u], is defined as the subset of N[u]N[u] in which every vertex vv satisfies σ(u,v)ϵ\sigma(u,v)\geq\epsilon, i.e., Nϵ[u]={v|vN[u]σ(u,v)ϵ}N_{\epsilon}[u]=\{v|v\in N[u]\wedge\sigma(u,v)\geq\epsilon\}. \Box

Note that the ϵ\epsilon-neighborhood of a given vertex uu contains uu itself as σ(u,u)=1\sigma(u,u)=1. When the number of ϵ\epsilon-neighbors of a vertex is large enough, it becomes a core vertex:

Definition 2.4: (Core) Given a structural similarity threshold 0<ϵ10<\epsilon\leq 1 and an integer μ2\mu\geq 2, a vertex uu is a core vertex if |Nε[u]|μ|N_{\varepsilon}[u]|\geq\mu. \Box

Given a core vertex uu, the structurally reachable vertex of uu is defined as:

Definition 2.5: (Structural Reachability) Given two parameters 0<ϵ10<\epsilon\leq 1 and μ2\mu\geq 2, for two vertices uu and vv, vv is structurally reachable from vertex uu if there is a sequence of vertices v1,v2,,vlVv_{1},v_{2},...,v_{l}\in V such that: (i) v1=u,vl=vv_{1}=u,v_{l}=v; (ii) v1,v2,,vl1v_{1},v_{2},...,v_{l-1} are core vertices; and (iii) vi+1Nϵ[vi]v_{i+1}\in N_{\epsilon}[v_{i}] for each 1il11\leq i\leq l-1. \Box

Intuitively, a cluster is a set of vertices that can be structurally reachable from any core vertex. Formally,

Definition 2.6: (Cluster) A cluster CC is a non-empty subset of VV such that:

  • (Connectivity) For any two vertices v1,v2Cv_{1},v_{2}\in C, there exists a vertex uCu\in C such that both v1v_{1} and v2v_{2} are structurally reachable from uu.

  • (Maximality) If a core vertex uCu\in C, then all vertices which are structure-reachable from uu are also in CC.

\Box

Definition 2.7: (Hub and Outlier) Given the set of clusters in graph GG, a vertex uu not in any cluster is a hub vertex if its neighbors belong to two or more clusters, and it is an outlier vertex otherwise. \Box

Following the above definitions, it is clear that two vertices that are not adjacent in the graph have no effects on the clustering results even if they are similar. Therefore, we only need to focus on the vertex pair incident to an edge. In the following, for an edge (u,v)(u,v), we use the edge similarity of (u,v)(u,v) and the similarity of uu and vv interchangeably for brevity. We summarize the notations used in the paper in Table I.

Symbol Description
G=(V,E)G=(V,E) Graph with vertices VV and edges EE
V(G)/E(G)V(G)/E(G) All vertices/edge in GG
𝗇𝖻𝗋(u)/𝖽𝖾𝗀(u){\mathsf{nbr}}(u)/{\mathsf{deg}}(u) Neighbors/degree of a vertex u
n/mn/m number of vertices/edges in GG
𝖽𝖾𝗀𝗆𝖺𝗑{\mathsf{deg}}_{{\mathsf{max}}} maximum degree in GG
\scriptsize{\bf?}⃝\normalsize\scriptsize{\bf?}⃝ the role of a vertex or the similarity of an edge is unknown
\scriptsize{\bfC}⃝\normalsize\scriptsize{\bfC}⃝/\scriptsize{\bf!C}⃝\normalsize\scriptsize{\bf!C}⃝ the vertex is a core/non-core vertex
\scriptsize{\bfS}⃝\normalsize\scriptsize{\bfS}⃝/\scriptsize{\bf!S}⃝\normalsize\scriptsize{\bf!S}⃝ the edge is similar/dis-similar
\scriptsize{\bfH}⃝\normalsize\scriptsize{\bfH}⃝/\scriptsize{\bfO}⃝\normalsize\scriptsize{\bfO}⃝ the vertex is a hub/outlier
TABLE I: Notations

Example 2.1: Considering GG shown in Fig. 1 where ϵ=0.6\epsilon=0.6 and μ=3\mu=3, the structural similarity of every pair of adjacent vertices is shown near the edge. v0v_{0}, v1v_{1}, v4v_{4}, v7v_{7}, v9v_{9}, v10v_{10}, v11v_{11}, v12v_{12}, and v13v_{13} are core vertices as they all have at least μ1=2\mu-1=2 similar neighbors including themselves. These core vertices form two clusters, which are marked in grey. v8v_{8} is a hub as its neighbors v2v_{2} and v9v_{9} are in different clusters, while v3,v5v_{3},v_{5} and v6v_{6} are outliers. Due to the limited space, in the following examples, we only show procedures on the subgraph GG^{\prime} of GG. \Box

Problem Statement. Given a graph G=(V,E)G=(V,E) and two parameters 0<ϵ10<\epsilon\leq 1 and μ2\mu\geq 2, structural graph clustering aims to efficiently compute all the clusters \mathbb{C} in GG and identify the corresponding role of each vertex. In this paper, we aim to accelerate the clustering performance of 𝖲𝖢𝖠𝖭\mathsf{SCAN} by GPUs.

2.2 GPU Architecture and CUDA Programming Platform

GPU Architecture. A GPU consists of multiple streaming multiprocessors (SM). Each SM has a large number of GPU cores. The cores in each SM work in single-instruction multiple-data (SIMD) 222Single-instruction multi-thread (SIMT) model in NIVIDA’s term fashion, i.e., they execute the same instructions at the same time on different input data. The smallest execution units of a GPU are warps. In NVIDIA’s GPU architectures, a warp typically contains 32 threads. Threads in the same warp that finish their tasks earlier have to wait until other threads finish their computations, before swapping in the next warp of threads. A GPU is equipped with several types of memory. All SMs in the GPU share a larger but slower global memory of the GPU. Each SM has small but fast-access local programmable memory (called shared memory) that is shared by all of its cores. Moreover, data can be exchanged between a GPU’s global memory and the host’s main memory via a relatively slow rate I/O bus.

CUDA. CUDA (Compute Unified Device Architecture) is the most popular GPU programming language. A CUDA problem consists of codes running on the CPU (host code) and those on the GPU (kernel). Kernels are invoked by the host code. Kernels start by transferring input data from the main memory to the GPU’s global memory and then processing the data on the GPU. When finishing the process, results are copied from the GPU’s global memory back to the main memory. The GPU executes each kernel with a user-specified number of threads. At runtime, the threads are divided into several blocks for execution on the cores of an SM, each block contains several warps, and each warp of threads executes concurrently.

For the thread safety of GPU programming, CUDA provides programmers with atomic operations, including 𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽\mathsf{atomicAdd} and 𝖺𝗍𝗈𝗆𝗂𝖼𝖢𝖠𝖲\mathsf{atomicCAS}. 𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽(address,val)\mathsf{atomicAdd(\emph{address},\emph{val})} reads the 16-bit, 32-bit, or 64-bit word old located at the address address in global or shared memory, computes (old+val)\mathsf{(\emph{old}+\emph{val})}, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction and returns old value. 𝖺𝗍𝗈𝗆𝗂𝖼𝖢𝖠𝖲(address,old,new)\mathsf{atomicCAS(\emph{address},\emph{old},\emph{new})} (Compare And Swap) reads the 16-bit, 32-bit, or 64-bit word old located at the address address in global or shared memory, computes (old==old?new:old)\mathsf{(\emph{old}==\emph{old}?\emph{new}:\emph{old})}, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction and return old.

3 Related work

3.1 CPU-based 𝖲𝖢𝖠𝖭\mathsf{SCAN}

Structural graph clustering (𝖲𝖢𝖠𝖭\mathsf{SCAN}) is first proposed in [40]. After that, a considerable number of follow-up works are conducted in the literature. For the online algorithms regarding a fixed parameter ϵ\epsilon and μ\mu, 𝖲𝖢𝖠𝖭\mathsf{SCAN}++ [33], pSCAN [7] and 𝖺𝗇𝗒𝖲𝖢𝖠𝖭\mathsf{anySCAN} [43] speed up 𝖲𝖢𝖠𝖭\mathsf{SCAN} by pruning unnecessary structural similarity computation. 𝖲𝗉𝖺𝗋𝗄𝖲𝖢𝖠𝖭\mathsf{SparkSCAN} [44] conducts the structural clustering on the Spark system. 𝗉𝗆𝖲𝖢𝖠𝖭\mathsf{pmSCAN} [30] considers the case that the input graph is stored on the disk. 𝗉𝗉𝖲𝖢𝖠𝖭\mathsf{ppSCAN} [8] parallelizes pruning-based algorithms and uses vectorized instructions to further improve the performance.

For the index-based algorithms aiming at returning the clustering result fast for frequently issued queries, GS*-Index [39] designs a space-bounded index and proposes an efficient algorithm to answer a clustering query. [38] parallelizes GS*-Index and uses locality-sensitive hashing to design a novel approximate index construction algorithm. [26] studies the maintenance of computed structural similarity between vertices when the graph is updated and proposes an approximate algorithm that achieves O(log2|V|+log|V|logMδ)O(\log^{2}|V|+\log|V|\cdot\log\frac{M}{\delta^{*}}) amortized cost for each update for a specified failure probability δ\delta^{*} and every sequence of MM updates. [22] studies structural clustering on directed graphs and proposes an index-based approach to computing the corresponding clustering results. However, all these algorithms are CPU-based and cannot be translated into efficient solutions on GPUs due to the inherently unique characteristics of a GPU different from a CPU.

Besides 𝖲𝖢𝖠𝖭\mathsf{SCAN}, other graph clustering methods are also proposed in the literature. However, all these methods are not based on structural clustering, and thus do not share the effectiveness of 𝖲𝖢𝖠𝖭\mathsf{SCAN} in detecting clusters, hubs, and outliers. [27, 1] provides a comprehensive survey on this topic.

3.2 GPU-based 𝖲𝖢𝖠𝖭\mathsf{SCAN}

For the GPU-based approach, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} [37] is the state-of-the-art GPU-based 𝖲𝖢𝖠𝖭\mathsf{SCAN} solution, which conducts the clustering in three separate phases: (1) ϵ\epsilon-neighborhood identification. In Phase 1, 𝖲𝖢𝖠𝖭\mathsf{SCAN} computes the similarity of each edge and determines the core vertices in the graph. (2) cluster detection. After identifying the core vertices, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} detects the clusters based on Definition 2.1 in Section 2. (3) hub and outlier classification. With the clustering results, the hub and outlier vertices are classified in Phase 3. The following example illustrates the three phases of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}.

Refer to caption
(a) Phase 1: ϵ\epsilon-neighborhood identification
Refer to caption
(b) Phase 1: Compute common neighbors of v0v_{0} and v1v_{1}
Refer to caption
(c) Phase 2: cluster detection
Refer to caption
(d) Phase 3: Hubs and outliers classification
Figure 2: 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} for vertices GG^{\prime} (ϵ=0.6,μ=3\epsilon=0.6,\mu=3)

Example 3.1: Consider graph GG shown in Fig. 1, Fig. 2 shows the clustering procedures of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} with parameters ϵ=0.6,μ=3\epsilon=0.6,\mu=3. Due to the limited space, we only show the clustering procedures related to the subgraph GG^{\prime} of GG in Fig. 1 in the following examples and the remaining part can be processed similarly.

Fig. 2 (a)-(b) show the ϵ\epsilon-neighborhood identification phase. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} uses E.uE.u and E.vE.v arrays to represent the edges of the graph, a 𝖿𝗅𝖺𝗀{\mathsf{flag}} array in which each element stores whether an edge is similar or not, and A.uA.u and A.vA.v arrays to store the neighbors of each vertex. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} uses a warp to computing the similarity of each edge (u,v)(u,v). According to Definition 2.1, the key to compute the similarity of (u,v)(u,v) is to find the common neighbors in N[u]N[u] and N[v]N[v]. To achieve this goal, for each w𝗇𝖻𝗋(u)w\in{\mathsf{nbr}}(u), 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} uses 8 threads in a warp to check whether ww is also in 8 continuous elements in 𝗇𝖻𝗋(v){\mathsf{nbr}}(v). As a warp usually contains 32 threads, four elements in 𝗇𝖻𝗋(u){\mathsf{nbr}}(u) can be compared concurrently. After the comparisons of these four elements are complete, if the largest vertex id of these four elements in 𝗇𝖻𝗋(u){\mathsf{nbr}}(u) is bigger than the largest one in the 8 continuous elements 𝗇𝖻𝗋(v){\mathsf{nbr}}(v), then, the next continuous 8 elements in 𝗇𝖻𝗋(v){\mathsf{nbr}}(v) are used to repeat the above comparisons. If the largest vertex id of these four elements in 𝗇𝖻𝗋(u){\mathsf{nbr}}(u) is smaller than the largest one in the 8 continuous elements 𝗇𝖻𝗋(v){\mathsf{nbr}}(v), then, the next 4 elements in 𝗇𝖻𝗋(u){\mathsf{nbr}}(u) are used to repeat the above comparisons. Otherwise, the next 4 elements in 𝗇𝖻𝗋(u){\mathsf{nbr}}(u) and the next 8 elements in 𝗇𝖻𝗋(v){\mathsf{nbr}}(v) are used to repeat the above comparisons. The common neighbors are found when all the elements in 𝗇𝖻𝗋(u){\mathsf{nbr}}(u) and 𝗇𝖻𝗋(v){\mathsf{nbr}}(v) have been explored. Take edge (v0,v1)(v_{0},v_{1}) as an example. Fig. 2 (b) shows the procedure of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} to compute their common neighbors. As 𝗇𝖻𝗋(v0)={v1,v2,v3,v4,v5,v6,v7}{\mathsf{nbr}}(v_{0})=\{v_{1},v_{2},v_{3},v_{4},v_{5},v_{6},v_{7}\}, the first 4 elements {v1,v2,v3,v4}\{v_{1},v_{2},v_{3},v_{4}\} in 𝗇𝖻𝗋(v0){\mathsf{nbr}}(v_{0}) are compared first. For the 32 threads in a warp, threads T0T7T_{0}-T_{7} handle the comparison for v1v_{1}. The remaining threads are assigned to handle the comparison for vertices v2v_{2}, v3v_{3}, and v4v_{4} similarly. As 𝗇𝖻𝗋(v1)={v0,v2,v4,v7}{\mathsf{nbr}}(v_{1})=\{v_{0},v_{2},v_{4},v_{7}\}, threads T0T7T_{0}-T_{7} finds v1𝗇𝖻𝗋(v1)v_{1}\notin{\mathsf{nbr}}(v_{1}). Similarly, threads T8T15T_{8}-T_{15}, T16T23T_{16}-T_{23}, T24T31T_{24}-T_{31} find v2,v4𝗇𝖻𝗋(v1)v_{2},v_{4}\in{\mathsf{nbr}}(v_{1}). After that, they find the largest vertex id v4v_{4} of elements in 𝗇𝖻𝗋(v0){\mathsf{nbr}}(v_{0}) that were compared is smaller than the largest one v7v_{7} in 𝗇𝖻𝗋(v1){\mathsf{nbr}}(v_{1}). Then, the remaining 3 elements {v5,v6,v7}\{v_{5},v_{6},v_{7}\} in 𝗇𝖻𝗋(v0){\mathsf{nbr}}(v_{0}) are used to repeat the comparisons similarly. In this comparison, threads T16T23T_{16}-T_{23} find that v7v_{7} is the common element of 𝗇𝖻𝗋(v0){\mathsf{nbr}}(v_{0}) and 𝗇𝖻𝗋(v1){\mathsf{nbr}}(v_{1}). Therefore, σ(v0,v1)=(2+3)/58=0.79>0.6\sigma(v_{0},v_{1})=(2+3)/\sqrt{5*8}=0.79>0.6, which means v1v_{1} and v0v_{0} are similar, and the corresponding element in the 𝖿𝗅𝖺𝗀\mathsf{flag} array is set as 1.

Fig. 2 (c) shows the graph clustering phase of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}. Following Definition 2.1, only similar edges could be in the final clusters. Therefore, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} builds two new arrays E.uE^{\prime}.u and E.vE^{\prime}.v by retrieving the similar edges in E.uE.u and E.vE.v. Consider the initialization step in Fig. 2 (c), as v0v_{0} and v1v_{1} are similar while v0v_{0} and v2v_{2} are not similar, only v0v_{0} and v1v_{1} are kept in E.uE^{\prime}.u and E.vE^{\prime}.v. Then Based on E.uE^{\prime}.u and E.vE^{\prime}.v, Nϵ[u]N_{\epsilon}[u] can be obtained easily. For example, for v0v_{0}, Nϵ[v0]={v0,v1,v4,v7}N_{\epsilon}[v_{0}]=\{v_{0},v_{1},v_{4},v_{7}\}, |Nϵ[v0]|=4>3|N_{\epsilon}[v_{0}]|=4>3, thus, v0v_{0} is a core vertex. Similarly, v4v_{4} and v7v_{7} are also core vertices. Then, it removes the edges in E.uE^{\prime}.u and E.vE^{\prime}.v not incident to core vertices. Moreover, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} represents all clusters by a spanning forest through a 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array. In the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array, each element represents the parent of the corresponding vertex in the forest, and the corresponding element for the vertices in E.uE^{\prime}.u and E.vE^{\prime}.v is initialized as the vertex itself, which is also shown in Initialization of Fig. 2 (c).

After that, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} detects the clusters in the graph iteratively. In each odd iteration, for each vertex uu in E.uE^{\prime}.u, it finds the smallest neighbor vv of uu in E.vE^{\prime}.v and takes the smaller one between uu and vv as the parent of uu in the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array. Then, it replaces each vertex in E.uE^{\prime}.u and E.vE^{\prime}.v by its parent and removes the edges with the same incident vertices. Consider Iteration 1.1-1.4 in Fig. 2 (c), for v1v_{1}, its smallest neighbor in E.vE^{\prime}.v is v0v_{0}, which is smaller than itself, thus, the parent of v1v_{1} is replaced by v0v_{0} in the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array. The other vertices are processed similarly (Iteration 1.1 and 1.2). Then, v1v_{1} is replaced by v0v_{0} in E.uE^{\prime}.u and E.vE^{\prime}.v in Iteration 1.3, and only edges (v0,v1)(v_{0},v_{1}) and (v1,v0)(v_{1},v_{0}) are left after removing the edges with the same incident vertices in Iteration 1.4. In each even iteration, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} conducts the clustering similarly to the odd iteration except the largest neighbor is selected as its parent. Iteration 2.1-2.4 in Fig. 2 (c) show the even case. Note that to obtain the E.uE^{\prime}.u and E.vE^{\prime}.v at the end of each iteration, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} uses 𝗌𝗈𝗋𝗍()\mathsf{sort()} and 𝗉𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇()\mathsf{partition()} functions in the 𝖳𝗁𝗋𝗎𝗌𝗍\mathsf{Thrust} library provided by Nvidia [23]. When E.uE^{\prime}.u and E.vE^{\prime}.v become empty, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} explores each vertex and sets its corresponding element in the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array as the root vertex in the forest. For example, in Iteration 2.2, the parent of v4v_{4} is v0v_{0}, the parent of v0v_{0} is v1v_{1}, and v1v_{1} is the parent of itself, thus, the parent of v4v_{4} is set as v1v_{1}. After Phase 2 finishes, the vertices with the same parent are in the same cluster. In Fig. 2 (c), as v0v_{0}, v1v_{1}, v2v_{2}, v4v_{4}, and v7v_{7} have the same parent, they are in the same cluster.

In Phase 3, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} classifies the hub and outlier vertices. It first retrieves edge (u,v)(u,v) from A.uA.u and A.vA.v such that uA.uu\in A.u is not in a cluster and vA.vv\in A.v is in a cluster and keeps them in new arrays E.uE^{\prime}.u and E.vE^{\prime}.v. Then, it replaces the vertices in E.vE^{\prime}.v by their parents. Then, for a vertex uE.uu\in E^{\prime}.u, it has two more different neighbors, then it is a hub vertex. Otherwise, it is an outlier vertex. Consider the example shown in Fig. 2 (d), v3v_{3}, v5v_{5}, v6v_{6}, and v8v_{8} are not in a cluster, thus, the edges that they form with their neighbors within the cluster are retrieved and stored in E.uE^{\prime}.u and E.vE^{\prime}.v (Step 1). Then, v0v_{0}, v2v_{2}, and v9v_{9} are replaced with their parents (In Step 2, the parent of v9v_{9} is v12v_{12}, which is not shown in Fig. 2 due to limited space). Since v8v_{8} has two neighbors v1v_{1} and v12v_{12} in E.vE^{\prime}.v, it is a hub vertex. v3v_{3}, v5v_{5}, v6v_{6} only have one neighbor in E.vE^{\prime}.v, they are outlier vertices. \Box

Theorem 3.1: Given a graph GG and two parameters ϵ\epsilon and μ\mu, the work of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} to finish structural clustering is O(Σ(u,v)E(G)(𝖽𝖾𝗀(u)+𝖽𝖾𝗀(v))+cmlogm)O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m), and the span is O(𝖽𝖾𝗀𝗆𝖺𝗑+clog2m)O({\mathsf{deg}}_{{\mathsf{max}}}+c\cdot\log^{2}m), where cc denotes the number of iterations in Phase 2. \Box

Proof: In Phase 1, for each edge (u,v)(u,v), 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} has to explore the neighbors of uu and vv. Thus, the work and span of Phase 1 are O(Σ(u,v)E(G)(𝖽𝖾𝗀(u)+𝖽𝖾𝗀(v)))O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))) and O(𝖽𝖾𝗀𝗆𝖺𝗑)O({\mathsf{deg}}_{{\mathsf{max}}}), respectively. In Phase 2, in each iteration, the time used for 𝗌𝗈𝗋𝗍()\mathsf{sort()} dominates the whole time of this iteration, and the work and span of 𝗌𝗈𝗋𝗍()\mathsf{sort()} are O(mlogm)O(m\cdot\log m) and O(log2m)O(\log^{2}m) [35]. Thus, the work and span of Phase 2 are O(cmlogm)O(c\cdot m\cdot\log m) and O(clog2m)O(c\cdot\log^{2}m), respectively. In Phase 3, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} explores the edges once and only the neighbors of hubs and outliers are visited. Thus, the work and span of Phase 3 are O(m)O(m) and O(1)O(1). Thus, the work and space of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} are O(Σ(u,v)E(G)(𝖽𝖾𝗀(u)+𝖽𝖾𝗀(v))+cmlogm)O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m) and O(𝖽𝖾𝗀𝗆𝖺𝗑+clog2m)O({\mathsf{deg}}_{{\mathsf{max}}}+c\cdot\log^{2}m).

Drawbacks of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} seeks to parallelize 𝖲𝖢𝖠𝖭\mathsf{SCAN} to benefit from the computational power provided by GPUs. However, it suffers from the following two drawbacks, which make it unable to fully utilize the massive parallelism of GPUs.

  • High extra parallelization cost. As shown in [7], the time complexity of the best serial 𝖲𝖢𝖠𝖭\mathsf{SCAN} algorithm is O(m𝖽𝖾𝗀max)O(m\cdot{\mathsf{deg}}_{max}). On the other hand, Theorem 2 shows that the work of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} is O(Σ(u,v)E(G)(𝖽𝖾𝗀(u)+𝖽𝖾𝗀(v))+cmlogm)O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m). Obviously, lots of extra costs are introduced in 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} for parallelization.

  • Low scalability. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} assumes that the input graphs can be fit in the GPU memory. However, the sizes of real graphs such as social networks and web graphs are huge while the GPU memory is generally in the range of a few gigabytes. The assumption that the sizes of data graphs cannot exceed the GPU memory makes it unscalable to handle large graphs in practice.

4 Our Approach

We address the drawbacks of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} in two sections. In this section, we focus on the scenario in which the graphs can be stored in the GPU memory and aim to reduce the high computation cost introduced for parallelization. In the next section, we present our algorithm to address the problem that the graph cannot fit in the GPU memory.

4.1 A CSR-Enhanced Graph Layout

𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} simply uses the edge array and adjacent array to represent the input graph. The shortcomings of this representation are two-fold: (1) the degree of a vertex can not be easily obtained, which is a commonly used operation during the clustering, (2) the relation between the edge array and adjacent array for the same edge is not directly maintained, which complicates the processing when identifying the cluster vertices in Phase 2. To avoid these problems, we adopt a CSR (Compressed Sparse Row)-enhanced graph layout, which contains four parts:

Refer to caption

Figure 3: A CSR-Enhanced graph layout for GG^{\prime}
  • Vertex array. Each entry keeps the starting index for a specific vertex in the adjacent array.

  • Adjacent array. The adjacent array contains the adjacent vertices of each vertex. The adjacent vertices are stored in the increasing order of vertex ids.

  • Eid array. Each entry corresponds to an edge in the adjacent array and keeps the index of the edge in the degree-oriented edge array.

  • Degree-oriented edge array. The degree-oriented edge array contains the edges of the graph and the edges with the same source vertex are stored together. For each edge (u,v)(u,v) in the degree-oriented edge array, we guarantee 𝖽𝖾𝗀(u)<𝖽𝖾𝗀(v){\mathsf{deg}}(u)<{\mathsf{deg}}(v) or 𝖽𝖾𝗀(u)=𝖽𝖾𝗀(v){\mathsf{deg}}(u)={\mathsf{deg}}(v) and 𝗂𝖽(u)<𝗂𝖽(v){\mathsf{id}}(u)<{\mathsf{id}}(v). In this way, the workload for edges incident to a large degree vertex can be computed based on the vertex with smaller degree.

  • Similarity array. Each element corresponds to an edge in the degree-oriented edge array and maintain the similarity of the corresponding edge.

Example 4.1: Fig. 3 shows the CSR-Enhanced graph layout regarding v0v_{0}, v1v_{1}, \dots, v8v_{8} in GG^{\prime}. These five arrays are organized as discussed above. \Box

4.2 A Progressive Structural Graph Clustering Approach

To reduce the high parallelization cost, the designed algorithm should avoid unnecessary computation during the clustering and fully exploit the unique characteristics of GPU structure. As shown in Section 3.2, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} first computes the structural similarity for each edge in Phase 1 and then identifies the roles of each vertex accordingly. However, based on Theorem 2, computing the structural similarity is a costly operation. To avoid unnecessary structural similarity computation, we adopt a progressive approach in which the lower bound (Nϵ¯[v]\underline{N_{\epsilon}}[v]) and upper bound (Nϵ¯[v]\overline{N_{\epsilon}}[v]) of Nϵ[v]N_{\epsilon}[v] are maintained. Clearly, we have the following lemma on Nϵ¯[v]\underline{N_{\epsilon}}[v] and Nϵ¯[v]\overline{N_{\epsilon}}[v]:

1
2for vVv\in V in parallel do
3      Nϵ¯[v]0\underline{N_{\epsilon}}[v]\leftarrow 0; Nϵ¯[v]𝖵𝖠[v+1]𝖵𝖠[v]\overline{N_{\epsilon}}[v]\leftarrow{\mathsf{VA}}[v+1]-{\mathsf{VA}}[v]; 𝗋𝗈𝗅𝖾[v]\footnotesize{\bf{?}}⃝{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bf{?}}⃝;
4for (u,v)E(u,v)\in E in parallel do
5      𝗌𝗂𝗆(u,v)\footnotesize{\bf{?}}⃝{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bf{?}}⃝;
6𝗂𝖽𝖾𝗇𝗍𝗂𝖿𝗒𝖢𝗈𝗋𝖾(𝖦,μ,ϵ){\mathsf{identifyCore(G,\mu,\epsilon)}}
7 𝖽𝖾𝗍𝖾𝖼𝗍𝖢𝗅𝗎𝗌𝗍𝖾𝗋𝗌(𝖦,ϵ){\mathsf{detectClusters(G,\epsilon)}}
8 𝖼𝗅𝖺𝗌𝗌𝗂𝖿𝗒𝖧𝗎𝖻𝖮𝗎𝗍𝗅𝗂𝖾𝗋(G){\mathsf{classifyHubOutlier}}(G)
Algorithm 1 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++(G,μ,ϵ){\mathsf{{\mathsf{GPUSCAN^{++}}}}}(G,\mu,\epsilon)

Lemma 4.1: Given a graph GG and two parameters μ\mu and ϵ\epsilon, for a vertex vV(G)v\in V(G), if Nϵ¯[v]μ\underline{N_{\epsilon}}[v]\geq\mu, then vv is a core vertex. If Nϵ¯[v]μ\overline{N_{\epsilon}}[v]\leq\mu, then vv is not a core vertex. \Box

According to Lemma 1, we can maintain the Nϵ¯[v]\underline{N_{\epsilon}}[v] and Nϵ¯[v]\overline{N_{\epsilon}}[v] progressively and delay the structural similarity computation until necessary. Following this idea, our new algorithm, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}}, is shown in Algorithm 1. It first initializes Nϵ¯[v]\underline{N_{\epsilon}}[v] as 11, Nϵ¯[v]\overline{N_{\epsilon}}[v] as 𝖽𝖾𝗀(v)+1{\mathsf{deg}}(v)+1 for each vertex vv (lines 1-2, 𝖵𝖠\mathsf{VA} refers to the vertex array). The role of each vertex and the structural similarity of each edge are initialized as unknown (\footnotesize{\bf{?}}⃝\large\footnotesize{\bf{?}}⃝) (lines 1-4). Then, it identifies the core vertices (line 5), detects the clusters follow the core vertices (line 6), and classifies the hub vertices and outlier vertices (line 7).

Identify core vertices. 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} first identifies the core vertices in the graph based on the given parameters μ\mu and ϵ\epsilon, which is shown in Algorithm 2.

1 for (u,v)E(G)(u,v)\in E(G) in warp-level parallel do
2      if 𝗋𝗈𝗅𝖾[u]=\footnotesize{\bf{?}}⃝𝗋𝗈𝗅𝖾[v]=\footnotesize{\bf?}⃝{\mathsf{role}}[u]=\large\footnotesize{\bf{?}}⃝\vee{\mathsf{role}}[v]=\large\footnotesize{\bf?}⃝ then
3           𝗂𝗌𝖲𝗂𝗆𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆(u,v,ϵ){\mathsf{isSim}}\leftarrow{\mathsf{checkSim}}(u,v,\epsilon);
4           // the first thread in the warp
5           if 𝗂𝗌𝖲𝗂𝗆={\mathsf{isSim}}= true then
6                𝗌𝗂𝗆(u,v)\footnotesize{\bfS}⃝{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bfS}⃝;
7                𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽(𝖭ϵ¯[u],1){\mathsf{atomicAdd}}({\mathsf{\underline{N_{\epsilon}}}}[u],1);𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽(𝖭ϵ¯[v],1){\mathsf{atomicAdd}}({\mathsf{\underline{N_{\epsilon}}}}[v],1);
8                if Nϵ¯[u]μ{\underline{N_{\epsilon}}}[u]\geq\mu then 𝗋𝗈𝗅𝖾[u]\footnotesize{\bfC}⃝{\mathsf{role}}[u]\leftarrow\large\footnotesize{\bfC}⃝;
9                if Nϵ¯[v]μ{\underline{N_{\epsilon}}}[v]\geq\mu then 𝗋𝗈𝗅𝖾[v]\footnotesize{\bfC}⃝{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bfC}⃝;
10          else
11                𝗌𝗂𝗆(u,v)\footnotesize{{\bf!S}}⃝{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{{\bf!S}}⃝;
12                𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽(𝖭ϵ¯[u],1){\mathsf{atomicAdd}}({\mathsf{\overline{N_{\epsilon}}}}[u],-1);𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽(𝖭ϵ¯[v],1){\mathsf{atomicAdd}}({\mathsf{\overline{N_{\epsilon}}}}[v],-1);
13                if Nϵ¯[u]<μ{\overline{N_{\epsilon}}}[u]<\mu then 𝗋𝗈𝗅𝖾[u]\footnotesize{\bf{!C}}⃝{\mathsf{role}}[u]\leftarrow\large\footnotesize{\bf{!C}}⃝;
14                if Nϵ¯[v]<μ{\overline{N_{\epsilon}}}[v]<\mu then 𝗋𝗈𝗅𝖾[v]\footnotesize{\bf!C}⃝{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bf!C}⃝;
15
16 Procedure 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆(u,v,ϵ){\mathsf{checkSim}}(u,v,\epsilon)
17      𝗌𝗎𝗆0{\mathsf{sum}}\leftarrow 0;
18      for w𝗇𝖻𝗋(u)w\in{\mathsf{nbr}}(u) in parallel do
19           𝗅𝗈𝗐𝖵𝖠[v];𝗁𝗂𝗀𝗁𝖵𝖠[v+1]{\mathsf{low}}\leftarrow{\mathsf{VA}}[v];\;{\mathsf{high}}\leftarrow{\mathsf{VA}}[v+1];
20          
21          while 𝗅𝗈𝗐<𝗁𝗂𝗀𝗁{\mathsf{low}}<{\mathsf{high}} do
22                𝗆𝗂𝖽(𝗅𝗈𝗐+𝗁𝗂𝗀𝗁)/2{\mathsf{mid}}\leftarrow({\mathsf{low}}+{\mathsf{high}})/2;
23                if w<𝖠𝖽𝗃𝖠[𝗆𝗂𝖽]w<{\mathsf{AdjA}}[{\mathsf{mid}}] then
24                    𝗁𝗂𝗀𝗁𝗆𝗂𝖽;{\mathsf{high}}\leftarrow{\mathsf{mid}};
25                else if w>𝖠𝖽𝗃𝖠[𝗆𝗂𝖽]w>{\mathsf{AdjA}}[{\mathsf{mid}}] then
26                    𝗅𝗈𝗐𝗆𝗂𝖽+1;{\mathsf{low}}\leftarrow{\mathsf{mid}}+1;
27               else
28                    𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖽𝖽(𝗌𝗎𝗆,1){\mathsf{atomicAdd}}({\mathsf{sum}},1); break;
29     du𝖵𝖠[u+1]𝖵𝖠[u]d_{u}\leftarrow{\mathsf{VA}}[u+1]-{\mathsf{VA}}[u];
30      dv𝖵𝖠[v+1]𝖵𝖠[v]d_{v}\leftarrow{\mathsf{VA}}[v+1]-{\mathsf{VA}}[v];
31      return 𝗌𝗎𝗆+2ϵ(du+1)(dv+1){\mathsf{sum}}+2\geq\epsilon\cdot\sqrt{(d_{u}+1)\cdot(d_{v}+1)};
Algorithm 2 𝗂𝖽𝖾𝗇𝗍𝗂𝖿𝗒𝖢𝗈𝗋𝖾(G,μ,ϵ){\mathsf{identifyCore}}(G,\mu,\epsilon)

To identify the core vertices, it assigns a warp for each edge (u,v)(u,v) to determine its structural similarity (line 1), which is similar to 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}. However, the detailed procedures to compute the similarity are significantly different from that of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}. Specifically, if the role of uu or vv has been known, it delays the computation of structural similarity between uu and vv to the following steps as the role of these two vertices has been determined and it is possible that we can obtain the clustering results without knowing the structural similarity of uu and vv in the following steps (line 2). Then, it computes the structural similarity between uu and vv by invoking 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆\mathsf{checkSim}. If uu and vv are similar, then the similarity indicator between uu and vv is set to \footnotesize{\bfS}⃝\large\footnotesize{\bfS}⃝ (line 6) and Nϵ¯[u]{\underline{N_{\epsilon}}[u]}/Nϵ¯[v]{\underline{N_{\epsilon}}[v]} is increased by 11 as uu and vv are similar (line 7). After that, if Nϵ¯[u]{\underline{N_{\epsilon}}[u]}/Nϵ¯[v]μ{\underline{N_{\epsilon}}[v]}\geq\mu, the role of uu/vv is set as \footnotesize{\bfC}⃝\large\footnotesize{\bfC}⃝ (lines 8-9). If uu and vv are dissimilar, Nϵ¯[u]{\overline{N_{\epsilon}}[u]}/Nϵ¯[v]{\overline{N_{\epsilon}}[v]} and the role of uu/vv are set accordingly (lines 11-14).

Refer to caption
(a) Find neighbor of v0v_{0} and v1v_{1}
Refer to caption
(b) One thread per vertex in 𝗇𝖻𝗋(v1){\mathsf{nbr}}(v_{1})
Refer to caption
(c) Parallel binary search in 𝗇𝖻𝗋(v0){\mathsf{nbr}}(v_{0})
Figure 4: Check Similarity between v0v_{0} and v1v_{1}

Procedure 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆\mathsf{checkSim} is used to determine the structural similarity between uu and vv based on the given ϵ\epsilon. Without loss of generality, we assume 𝖽𝖾𝗀(u)<𝖽𝖾𝗀(v){\mathsf{deg}}(u)<{\mathsf{deg}}(v). For each neighbor ww of uu, it assigns a thread to check whether ww is also a neighbor of vv (line 17). Since the neighbors of vv are stored in increasing order based on their id in 𝖠𝖽𝗃𝖠\mathsf{AdjA}, it looks up ww among the neighbors of vv in a binary search manner (lines 18-26). Specifically, three indices 𝗅𝗈𝗐\mathsf{low}, 𝗁𝗂𝗀𝗁\mathsf{high}, and 𝗆𝗂𝖽\mathsf{mid} are maintained and initialized as 𝖵𝖠[v]{\mathsf{VA}}[v], 𝖵𝖠[𝗏+𝟣]{\mathsf{VA[v+1]}}, and (𝗅𝗈𝗐+𝗁𝗂𝗀𝗁)/2({\mathsf{low}}+{\mathsf{high}})/2, respectively. If ww is less than 𝖠𝖽𝗃𝖠[𝗆𝗂𝖽]{\mathsf{AdjA}}[{\mathsf{mid}}], 𝗁𝗂𝗀𝗁\mathsf{high} is updated as 𝗆𝗂𝖽\mathsf{mid} (lines 21-22) (𝖠𝖽𝗃𝖠\mathsf{AdjA} refers to the adjacent array). If ww is larger than 𝖠𝖽𝗃𝖠[𝗆𝗂𝖽]{\mathsf{AdjA}}[{\mathsf{mid}}], then 𝗅𝗈𝗐\mathsf{low} is updated as 𝗆𝗂𝖽\mathsf{mid} +1 (lines 23-24). Otherwise, the common neighbors of uu and vv, which are recorded in 𝗌𝗎𝗆\mathsf{sum}, are increased by 1 (line 26). Last, whether uu and vv are similar returns based on Definition 2.1 (line 27-29).

Example 4.2: Fig. 4 shows the procedures of 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆\mathsf{checkSim} to check the similarity between v0v_{0} and v1v_{1} of GG^{\prime} shown in Fig. 1. In order to compute their similarity, the neighbors of v1v_{1} and v0v_{0} are first obtained from the CSR-Enhanced graph layout as shown in Fig. 4 (a). As the 𝖽𝖾𝗀(v1)<𝖽𝖾𝗀(v0){\mathsf{deg}}(v_{1})<{\mathsf{deg}}(v_{0}), thread T0T_{0}, T1T_{1}, T2T_{2}, and T3T_{3} in the warp are used to check whether the neighbors v0,v2,v4,v7v_{0},v_{2},v_{4},v_{7} of v1v_{1} are also neighbors of v0v_{0}, which is shown in Fig. 4 (b). Take thread T0T_{0} as an example, since the neighbors of v0v_{0} are sorted based on their ids, thread T0T_{0} explores the neighbors of v0v_{0} in a binary-search manner, which is shown in Fig. 4 (c). After traversing the leaf node v1v_{1}, it is clear that v0v_{0} is not a neighbor of v0v_{0}. Similarly, threads T1T3T_{1}-T_{3} find that v2v_{2}, v4v_{4}, and v7v_{7} are neighbors of v0v_{0} as well. Therefore, the common neighbors of v0v_{0} and v1v_{1} are v2v_{2}, v4v_{4}, v7v_{7}. Because 2+3>0.6×5×82+3>0.6\times\sqrt{5\times 8}, v0v_{0} and v1v_{1} are similar, and 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆\mathsf{checkSim} returns 𝗍𝗋𝗎𝖾\mathsf{true}. \Box

1 for uVu\in V in parallel do
2      if 𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfC}⃝{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝ then 𝗉𝖺𝗋𝖾𝗇𝗍[u]u{\mathsf{parent}}[u]\leftarrow u; u.𝗁𝖾𝗂𝗀𝗁𝗍1u.{\mathsf{height}}\leftarrow 1;
3      else 𝗉𝖺𝗋𝖾𝗇𝗍[u]2{\mathsf{parent}}[u]\leftarrow-2;
4     
5for uV𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfC}⃝u\in V\wedge{\mathsf{role}}[u]={\large\footnotesize{\bfC}⃝} in warp-level parallel do
6      upu_{p}\leftarrow 𝗋𝗈𝗈𝗍(u,𝗉𝖺𝗋𝖾𝗇𝗍){\mathsf{root}}(u,{\mathsf{parent}});
7      for v𝗇𝖻𝗋(u)v\in{\mathsf{nbr}}(u) in parallel  do
8           if 𝗋𝗈𝗅𝖾[v]=\footnotesize{\bfC}⃝𝗌𝗂𝗆(u,v)=\footnotesize{\bfS}⃝{\mathsf{role}}[v]=\large\footnotesize{\bfC}⃝\wedge{\mathsf{sim}}(u,v)=\large\footnotesize{\bfS}⃝ then
9                vp𝗋𝗈𝗈𝗍(v,𝗉𝖺𝗋𝖾𝗇𝗍)v_{p}\leftarrow{\mathsf{root}}(v,{\mathsf{parent}});
10                if upvpu_{p}\neq v_{p} then 𝗎𝗇𝗂𝗈𝗇(up,vp,𝗉𝖺𝗋𝖾𝗇𝗍){\mathsf{union}}(u_{p},v_{p},{\mathsf{parent}});
11               
12for uV𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfC}⃝u\in V\wedge{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝ in warp-level parallel do
13      up𝗋𝗈𝗈𝗍(u,𝗉𝖺𝗋𝖾𝗇𝗍)u_{p}\leftarrow{\mathsf{root}}(u,{\mathsf{parent}});
14      for v𝗇𝖻𝗋(u)v\in{\mathsf{nbr}}(u) in warp-level parallel do
15           if 𝗋𝗈𝗅𝖾[v]=\footnotesize{\bfC}⃝𝗌𝗂𝗆(u,v)=\footnotesize{\bf?}⃝{\mathsf{role}}[v]=\large\footnotesize{\bfC}⃝\wedge{\mathsf{sim}}(u,v)=\large\footnotesize{\bf?}⃝ then
16                vp𝗋𝗈𝗈𝗍(v,𝗉𝖺𝗋𝖾𝗇𝗍)v_{p}\leftarrow{\mathsf{root}}(v,{\mathsf{parent}});
17                if upvpu_{p}\neq v_{p} then
18                     if 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆(u,v,ϵ){\mathsf{checkSim}}(u,v,\epsilon) then
19                          𝗌𝗂𝗆(u,v)\footnotesize{\bfS}⃝{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bfS}⃝; 𝗎𝗇𝗂𝗈𝗇(up,vp,𝗉𝖺𝗋𝖾𝗇𝗍){\mathsf{union}}(u_{p},v_{p},{\mathsf{parent}});
20                    else 𝗌𝗂𝗆(u,v)\footnotesize{\bf!S}⃝{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bf!S}⃝;
21for uV𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfC}⃝u\in V\wedge{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝ in parallel do
22      𝗉𝖺𝗋𝖾𝗇𝗍[u]𝗋𝗈𝗈𝗍(u,𝗉𝖺𝗋𝖾𝗇𝗍){\mathsf{parent}}[u]\leftarrow{\mathsf{root}}(u,{\mathsf{parent}});
23     
24for uV𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfC}⃝u\in V\wedge{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝ in warp-level parallel do
25      for v𝗇𝖻𝗋(u)v\in{\mathsf{nbr}}(u) in warp-level parallel do
26           if 𝗋𝗈𝗅𝖾[v]=\footnotesize{\bf!C}⃝{\mathsf{role}}[v]=\large\footnotesize{\bf!C}⃝ then
27                if  𝗌𝗂𝗆(u,v)=\footnotesize{\bf?}⃝{\mathsf{sim}}(u,v)=\large\footnotesize{\bf?}⃝ then
28                     𝗌𝗂𝗆(u,v)𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆(u,v,ϵ)?\footnotesize{\bfS}⃝:\footnotesize{\bf!S}⃝{\mathsf{sim}}(u,v)\leftarrow{\mathsf{checkSim}}(u,v,\epsilon)?\,\large\footnotesize{\bfS}⃝:\large\footnotesize{\bf!S}⃝;
29               if 𝗌𝗂𝗆(u,v)=\footnotesize{\bfS}⃝{\mathsf{sim}}(u,v)=\large\footnotesize{\bfS}⃝ then
30                     𝗉𝖺𝗋𝖾𝗇𝗍[v]𝗉𝖺𝗋𝖾𝗇𝗍[u]{\mathsf{parent}}[v]\leftarrow{\mathsf{parent}}[u];
31                    
32 Procedure 𝗋𝗈𝗈𝗍(u,𝗉𝖺𝗋𝖾𝗇𝗍){\mathsf{root}}(u,{\mathsf{parent}})
33      𝗉𝗎𝗉𝖺𝗋𝖾𝗇𝗍[u]{\mathsf{p_{u}}}\leftarrow{\mathsf{parent}}[u]; 𝗇𝖾𝗑𝗍𝗉𝖺𝗋𝖾𝗇𝗍[𝗉𝗎]{\mathsf{next}}\leftarrow{\mathsf{parent}}[{\mathsf{p_{u}}}];
34      while 𝗉𝗎𝗇𝖾𝗑𝗍{\mathsf{p_{u}}}\neq{\mathsf{next}} do
35           𝗉𝗎𝗇𝖾𝗑𝗍{\mathsf{p_{u}}}\leftarrow{\mathsf{next}}; 𝗇𝖾𝗑𝗍𝗉𝖺𝗋𝖾𝗇𝗍[𝗉𝗎]{\mathsf{next}}\leftarrow{\mathsf{parent}}[{\mathsf{p_{u}}}];
36     return 𝗉𝗎{\mathsf{p_{u}}}
37 Procedure 𝗎𝗇𝗂𝗈𝗇(u,v,𝗉𝖺𝗋𝖾𝗇𝗍){\mathsf{union}}(u,v,{\mathsf{parent}})
38      if u.𝗁𝖾𝗂𝗀𝗁𝗍<v.𝗁𝖾𝗂𝗀𝗁𝗍u.{\mathsf{height}}<v.{\mathsf{height}} then 𝗌𝗐𝖺𝗉(u,v){\mathsf{swap}}(u,v);
39      𝖿𝗀{\mathsf{fg}}\leftarrow true;
40      while 𝖿𝗀\mathsf{fg} do
41           𝖿𝗀{\mathsf{fg}}\leftarrowfalse; 𝗋𝖾𝗌𝖺𝗍𝗈𝗆𝗂𝖼𝖢𝖠𝖲(&𝗉𝖺𝗋𝖾𝗇𝗍[v],v,u){\mathsf{res}}\;\leftarrow\;{\mathsf{atomicCAS}}({\mathsf{\&parent}}[v],v,u);
42           if 𝗋𝖾𝗌v{\mathsf{res}}\neq v then v𝗋𝖾𝗌;𝖿𝗀v\leftarrow{\mathsf{res}};{\mathsf{fg}}\leftarrow true;
43          
44     if uvu.𝗁𝖾𝗂𝗀𝗁𝗍=v.𝗁𝖾𝗂𝗀𝗁𝗍u\neq v\wedge u.{\mathsf{height}}=v.{\mathsf{height}} then 𝖺𝗍𝗈𝗆𝗂𝖼𝖠𝖣𝖣(u.𝗁𝖾𝗂𝗀𝗁𝗍,1){\mathsf{atomicADD}}(u.{\mathsf{height}},1);
Algorithm 3 𝖽𝖾𝗍𝖾𝖼𝗍𝖢𝗅𝗎𝗌𝗍𝖾𝗋𝗌(G,ϵ){\mathsf{detectClusters}}(G,\epsilon)

Detect clusters. After the core vertices have been identified, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} detects clusters in the graph, which is shown in Algorithm 3. After the core vertices identification phase, the vertices are either core vertices or non-core vertices. Moreover, the similarities of some edges are known but there are still edges whose similarities are unknown. Therefore, Algorithm 3 first detects the sub-clusters that can be determined by the computed information (lines 1-9). After that, it further computes the similarity for the edges whose similarities are unknown and obtains the complete clusters in the graph (lines 10-27). In this way, we can avoid some unnecessary similarity computation compared with 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}.

Specifically, Algorithm 3 uses the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array to store the cluster id that each vertex belongs to (negative cluster id value indicates the corresponding vertex is not in a cluster currently). Similar to 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}, Algorithm 3 also uses a spanning forest to represent the clusters through the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array. However, Algorithm 3 uses a different algorithm to construct the forest, which makes 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} much more efficient than 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} as verified in our experiment. It initializes each core vertex uu as a separate cluster with cluster id uu and the non-core vertices with a negative cluster id (lines 1-3). Then, for each core vertex uu, it first finds the root id of the cluster by the procedure 𝗋𝗈𝗈𝗍\mathsf{root} (line 5). After that, for the neighbors vv of uu, if we have already known uu and vv are similar (line 7), we know uu and vv are the same cluster following Definition 2.1. Thus, we union the subtrees represented by uu and vv in 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} through procedure 𝗎𝗇𝗂𝗈𝗇\mathsf{union} (lines 6-9).

After having explored the sub-clusters that can be determined by the computed information, for each core vertices uu, Algorithm 3 further explores the neighbors vv of uu such that vv is also a core vertex but the similarity of uu and vv are unknown (lines 10-13). If uu and vv are not in the same cluster currently (line 15), then it computes the similarity between uu and vv (line 16). If uu and vv are similar, we union the subtrees represented by uu and vv in the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array (line 17). Otherwise, uu and vv are marked as dissimilar (line 18). When the clusters of the core vertices are determined, the root cluster id is assigned to each core vertex (lines 19-20). At last, Algorithm 3 explores the non-core neighbors vv of each core vertex uu, if the similarity between uu and vv is unknown, their similarity is computed (lines 24-25). If uu and vv are similar, vv is merged in the cluster represented by uu (lines 26-27). For the procedure 𝗋𝗈𝗈𝗍\mathsf{root} and 𝗎𝗇𝗂𝗈𝗇\mathsf{union}, they are used to find the root of the tree in the forest and union two sub-trees in the forest. As they are self-explainable, we omit the description. Note the u.𝗁𝖾𝗂𝗀𝗁𝗍u.{\mathsf{height}}/v.𝗁𝖾𝗂𝗀𝗁𝗍v.{\mathsf{height}} in line 34 represents the height of tree rooted at u/vu/v.

Refer to caption
(a) Process core vertices v1v_{1} and v4v_{4}
Refer to caption
(b) Process v4v_{4} and v0v_{0}
Refer to caption
(c) Process v7v_{7} and v0v_{0}
Refer to caption
(d) Process v7v_{7} and v1v_{1}
Refer to caption
(e) Trace root vertex
Refer to caption
(f) Cluster non-core v2v_{2}
Figure 5: Detect clusters in GG^{\prime}

Example 4.3: Fig. 5 shows the procedure to detect clusters in GG^{\prime}. Assume that we know that v0v_{0}, v1v_{1}, v4v_{4}, v7v_{7} are core vertices, (v1,v0)(v_{1},v_{0}) and (v4,v1)(v_{4},v_{1}) are similar, but the similarity of (v4,v0)(v_{4},v_{0}), (v7,v0)(v_{7},v_{0}) and (v7,v1)(v_{7},v_{1}) are unknown. Parents of the four core vertices are initialized as themselves while the others are set as -2.

𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} first processes the core vertices in parallel. To avoid redundant computation, when clustering core vertices, only neighbors with smaller ids than their own ids are considered. At the same moment, for v1v_{1}, (v1,v0)(v_{1},v_{0}) are similar, then the parent of v1v_{1} becomes v0v_{0} and for v4v_{4}, (v4,v1)(v_{4},v_{1}) are similar, then the parent of v4v_{4} becomes v1v_{1}, which are shown in Fig. 5 (a). After that, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} detects the clusters related to v4v_{4} and v7v_{7} as the similarities between some of their neighbors and themself have not been determined yet. For the neighbor v0v_{0} of v4v_{4}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} finds that v0v_{0} and v4v_{4} have been in the same cluster by procedure 𝗋𝗈𝗈𝗍\mathsf{root}, they do not need further process, which is shown in Fig. 5 (b). For v7v_{7}, due to the atomic operation, the thread which explores its neighbor v0v_{0} executes first. It finds that they are not in the same cluster based on procedure 𝗋𝗈𝗈𝗍\mathsf{root}. Therefore, after determining that they are similar by procedure 𝖼𝗁𝖾𝖼𝗄𝖲𝗂𝗆\mathsf{checkSim}, it changes the parent of v7v_{7} to v0v_{0}, which is shown in Fig. 5 (c). Then, v7v_{7} and v1v_{1} are processed by another thread similarly, which is shown in Fig. 5 (d). When the cluster of core vertices v0v_{0}, v1v_{1}, v4v_{4}, and v7v_{7} is detected, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} explores these vertices and sets their corresponding element in the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array as the root vertices in the forest. Fig. 5 (e) illustrates this process, in which the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} of v4v_{4} changes to the traced root vertex v0v_{0}. At last, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} assigns the parent of v1v_{1} to v2v_{2} based on the similarity of core vertex v1v_{1} and non-core vertex v2v_{2} as shown in Fig. 5 (f). \Box

1 for uV𝗉𝖺𝗋𝖾𝗇𝗍[u]<0u\in V\wedge{\mathsf{parent}}[u]<0 in warp-level parallel do
2      𝗉𝖺𝗋𝖾𝗇𝗍[u]2{\mathsf{parent}}[u]\leftarrow-2; 𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfO}⃝{\mathsf{role}}[u]=\large\footnotesize{\bfO}⃝;
3      if |𝗇𝖻𝗋(u)|>1|{\mathsf{nbr}}(u)|>1 then
4           𝗇𝖼1{\mathsf{n_{c}}}\leftarrow-1;
5           for v𝗇𝖻𝗋(u)v\in{\mathsf{nbr}}(u) in parallel  do
6                𝗂𝖽1{\mathsf{id}}\leftarrow-1;
7                if 𝗉𝖺𝗋𝖾𝗇𝗍[v]>=0{\mathsf{parent}}[v]>=0 then
8                     𝗂𝖽𝖺𝗍𝗈𝗆𝗂𝖼𝖢𝖠𝖲(&𝗇𝖼,1,𝗉𝖺𝗋𝖾𝗇𝗍[v]){\mathsf{id}}\leftarrow{\mathsf{atomicCAS}}({\mathsf{\&n_{c}}},-1,{\mathsf{parent}}[v]);
9                     if 𝗂𝖽>=0𝗂𝖽𝗉𝖺𝗋𝖾𝗇𝗍[v]{\mathsf{id}}>=0\wedge{\mathsf{id}}\neq{\mathsf{parent}}[v] then
10                          𝗉𝖺𝗋𝖾𝗇𝗍[u]1{\mathsf{parent}}[u]\leftarrow-1; 𝗋𝗈𝗅𝖾[u]=\footnotesize{\bfH}⃝{\mathsf{role}}[u]=\large\footnotesize{\bfH}⃝;
11                          𝖻𝗋𝖾𝖺𝗄{\mathsf{break}};
12                         
Algorithm 4 𝖼𝗅𝖺𝗌𝗌𝗂𝖿𝗒𝖧𝗎𝖻𝖮𝗎𝗍𝗅𝗂𝖾𝗋(G){\mathsf{classifyHubOutlier}}(G)

Classify hub and outlier vertices. At last, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} classifies vertices that are not in clusters into outliers and hubs, which is shown in Algorithm 4. Specifically, it assigns a warp for each vertex uu not in a cluster and considers uu as an outlier (line 2). Then it checks whether uu has more than one neighbor vertices, and if this condition is satisfies, uu has the potential to become hubs (line 3). Then, Algorithm 4 uses the shared variable 𝗇𝖼{\mathsf{n_{c}}} of the threads in a warp to record the cluster that one of its neighbors belongs to and sets it as 1-1 (line 4). Each thread in the warp maintains an exclusive variable 𝗂𝖽{\mathsf{id}} with an initial value of 1-1 (line 5). If a neighbor vv of uu is found in a cluster by a thread, then, 𝗇𝖼{\mathsf{n_{c}}} is updated to the cluster id of vv by 𝖺𝗍𝗈𝗆𝗂𝖼𝖢𝖠𝖲\mathsf{atomicCAS} (lines 7-8). Once another thread finds that the cluster id recorded by 𝗇𝖼{\mathsf{n_{c}}} is different from the cluster id of vv it processes, it means at least two neighbors of uu are in different clusters, therefore, uu is classified as a hub (lines 9-11).

4.3 Theoretical Analysis

Theorem 4.1: Given a graph GG and two parameters ϵ\epsilon and μ\mu, the work of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} finishes the clustering in O(Σ(u,v)E(G)𝖽𝖾𝗀(u)log𝖽𝖾𝗀(v))O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v)), and the span is O(log𝖽𝖾𝗀𝗆𝖺𝗑+logn)O(\log{\mathsf{deg}}_{{\mathsf{max}}}+\log n). \Box

Proof: In Phase 1 (Algorithm 2), the work and span of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} are O(Σ(u,v)E(G)𝖽𝖾𝗀(u)log𝖽𝖾𝗀(v))O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v)) and O(log𝖽𝖾𝗀max)O(\log{\mathsf{deg}}_{max}). In Phase 2 (Algorithm 3), 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} is possible to check the similarity of all edges, thus, the work of Phase 2 is O(Σ(u,v)E(G)𝖽𝖾𝗀(u)log𝖽𝖾𝗀(v))O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v)) as well. For the span, the height of the spanning forest is O(logn)O(\log n) due to the union-by-size used in line 37 of Algorithm 3 [34]. Thus, the span of Phase 2 is O(logn)O(\log n). In Phase 3 (Algorithm 4), it is easy to derive that the work is O(m)O(m) and O(1)O(1). Therefore, the work of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} is O(Σ(u,v)E(G)𝖽𝖾𝗀(u)log𝖽𝖾𝗀(v))O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v)) and the span is O(log𝖽𝖾𝗀𝗆𝖺𝗑+logn)O(\log{\mathsf{deg}}_{{\mathsf{max}}}+\log n).

Compared with 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}, Theorem 4.3 shows that 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} reduces the total work and the span during the clustering. Note that in Phase 1, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} uses the binary-search-based manner the check the similarity, which reduces the span of Phase 1 from O(𝖽𝖾𝗀𝗆𝖺𝗑)O({\mathsf{deg}}_{{\mathsf{max}}}) to O(log𝖽𝖾𝗀𝗆𝖺𝗑)O(\log{\mathsf{deg}}_{{\mathsf{max}}}) compared with 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}.

5 A New Out-of-Core Algorithm

In the previous section, we focus on improving the clustering performance when the graph can be fit in the GPU memory. However, in real applications, the graph data can be very large and the GPU memory is insufficient to load the whole graph data. A straightforward solution is to use the Unified Virtual Memory (UVM) provided by the GPUs directly. However, this is approach is inefficient as verified in our experiments. On the other hand, in most real graphs, such as social networks and web graphs, the number of edges is much larger than the number of vertices. For example, the largest dataset in SNAP contains 65 million nodes and 1.8 billion edges. It is practical to assume that the vertices of the graph can be loaded in the GPU memory while the edges are stored in the CPU memory.

Following this assumption, we propose an out-of-core GPU 𝖲𝖢𝖠𝖭\mathsf{SCAN} algorithm that adopts a divide-and-conquer strategy. Specifically, we first divide the graph into small subgraphs whose size does not exceed the GPU memory size. Then, we conduct the clustering based on the divided subgraphs instead of the original graph. As the divided subgraphs are much smaller than the original graphs, the GPU memory requirement is significantly reduced. Before introducing our algorithm, we first define the local subgraphs as follows:

Definition 5.1: (Edge Extended Subgraph) Given a graph GG and a set of edges SE(G)S\in E(G), the edge extended subgraph of SS consists of the edges SS and (u,v)E(G)(u,v)\in E(G) such that (u,v)(u,v) incident to at least one edge in SS. \Box

Lemma 5.1: Given a graph GG and a set of edge SE(G)S\in E(G), let GsG_{s} denote the edge extended subgraph of SS, then, for any edge (u,v)S(u,v)\in S, σGs(u,v)=σ(u,v)\sigma_{G_{s}}(u,v)=\sigma(u,v), where σGs(u,v)\sigma_{G_{s}}(u,v) denotes the structural similarity computed based on GsG_{s}. \Box

Refer to caption

Figure 6: Edge Extended Subgraph of GG^{\prime}

Example 5.1: Fig. 6 shows the edge extended subgraphs of GG^{\prime}. As (v8,v9)(v_{8},v_{9}) and (v2,v8)(v_{2},v_{8}) share the same vertex v8v_{8} but (v8,v2)(v_{8},v_{2}) is not in GG^{\prime}, (v8,v2)(v_{8},v_{2}) should be added into the edge extended subgraph of GG^{\prime}. \Box

Therefore, we divide GG into a series of edge extended subgraphs GS1,GS2,,GSkG_{S_{1}},G_{S_{2}},\dots,G_{S_{k}} such that SiS2,,Sk=E(G)S_{i}\cup S_{2},\dots,\cup S_{k}=E(G), SiSj=S_{i}\cap S_{j}=\emptyset, where 1ijk1\leq i\neq j\leq k, and each edge extended subgraph can be loaded in the GPU memory. Following Lemma 5, we can obtain the structural similarity for all edges in SS regarding GsG_{s} correctly. Moreover, as the vertices can be loaded in the memory, which means we can identify the core vertices by iterating each edge extended subgraphs and maintain the clustering information in the GPU memory accordingly.

1 Gs(Vs,Es,𝗌𝗂𝗆s);SGs;s0G_{s}(V_{s},E_{s},{\mathsf{sim}}_{s})\leftarrow{\emptyset};S\leftarrow G_{s};s\leftarrow 0
2
3/* Code Executed by CPU */
4 for (u,v)E(G)(u,v)\in E(G)  do
5      Gs(Vs,Es,𝗌𝗂𝗆s)G^{\prime}_{s}(V^{\prime}_{s},E^{\prime}_{s},{\mathsf{sim^{\prime}}}_{s})\leftarrow{\emptyset}
6      for w𝗇𝖻𝗋(u)/𝗇𝖻𝗋(v)w\in{\mathsf{nbr}}(u)/{\mathsf{nbr}}(v) do
7           if wVsw\notin V_{s} then
8                VsVs{w}V^{\prime}_{s}\leftarrow V^{\prime}_{s}\cup\{w\};
9               
10          if (u,w)Es(u,w)\notin E_{s} then
11                EsEs{(u,w)}E^{\prime}_{s}\leftarrow E^{\prime}_{s}\cup\{(u,w)\};
12               
13     if 𝖬𝖾𝗆𝗈𝖿(𝖦𝗌)+𝖬𝖾𝗆𝗈𝖿(𝖦𝗌)+15|V(G)|>Mg{\mathsf{Memof(G_{s})}}+{\mathsf{{Memof}(G^{\prime}_{s})}}+15|V(G)|>M_{g} then
14           ss+1;Gs(Vs,Es,𝗌𝗂𝗆s);SSGss\leftarrow s+1;G_{s}(V_{s},E_{s},{\mathsf{sim}}_{s})\leftarrow{\emptyset};S\leftarrow S\cup G_{s}
15          
16     VsVsVs;EsEsEsV_{s}\leftarrow V_{s}\cup V^{\prime}_{s};E_{s}\leftarrow E_{s}\cup E^{\prime}_{s}
17     
18     𝗌𝗂𝗆(u,v)\footnotesize{\bf{?}}⃝{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bf{?}}⃝; 𝗌𝗂𝗆s𝗌𝗂𝗆s{𝗌𝗂𝗆(u,v)}{\mathsf{sim}}_{s}\leftarrow{\mathsf{sim}}_{s}\cup\left\{{\mathsf{sim}}(u,v)\right\}
19     
20/* Code Executed by GPU */
21 for vVv\in V in parallel do
22      Nϵ¯[v]1\underline{N_{\epsilon}}[v]\leftarrow 1; Nϵ¯[v]𝖵𝖠[v+1]𝖵𝖠[v]+1\overline{N_{\epsilon}}[v]\leftarrow{\mathsf{VA}}[v+1]-{\mathsf{VA}}[v]+1; 𝗋𝗈𝗅𝖾[v]\footnotesize{\bf{?}}⃝{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bf{?}}⃝; 𝗉𝖺𝗋𝖾𝗇𝗍[v]2{\mathsf{parent}}[v]\leftarrow-2;
23for GsSG_{s}\in S do
24      𝗅𝗈𝖺𝖽(𝖦𝗌){\mathsf{load(G_{s})}}; 𝗂𝖽𝖾𝗇𝗍𝗂𝖿𝗒𝖢𝗈𝗋𝖾(𝖦𝗌,μ,ϵ){\mathsf{identifyCore(G_{s},\mu,\epsilon)}}; 𝗌𝗍𝗈𝗋𝖾(𝖦𝗌){\mathsf{store(G_{s})}};
25for GsSG_{s}\in S do
26      𝗅𝗈𝖺𝖽(𝖦𝗌){\mathsf{load(G_{s})}}; 𝖽𝖾𝗍𝖾𝖼𝗍𝖢𝗅𝗎𝗌𝗍𝖾𝗋𝗌(𝖦𝗌,ϵ){\mathsf{detectClusters(G_{s},\epsilon)}}; 𝗌𝗍𝗈𝗋𝖾(𝖦𝗌){\mathsf{store(G_{s})}};
27for GsSG_{s}\in S do
28      𝗅𝗈𝖺𝖽(𝖦𝗌){\mathsf{load(G_{s})}}; 𝖼𝗅𝖺𝗌𝗌𝗂𝖿𝗒𝖧𝗎𝖻𝖮𝗎𝗍𝗅𝗂𝖾𝗋(Gs,ϵ){\mathsf{classifyHubOutlier}}(G_{s},\epsilon);
Algorithm 5 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++(G,μ,ϵ,Mg){\mathsf{GPUSCAN^{++}_{O}}}(G,\mu,\epsilon,M_{g})

Algorithm. Following the above idea, our algorithm GPU-based out-of-core algorithm is shown in Algorithm 5. It first partitions the edges E(G)E(G) of GG into a series of disjoint sets and constructs the corresponding edge extended subgraphs that can be fit in the GPU memory based on the GPU memory capacity MgM_{g} following Definition 5 by CPU (line 3-12). After that, we initialize Nϵ¯[v]\underline{N_{\epsilon}}[v], Nϵ¯[v]\overline{N_{\epsilon}}[v], 𝗋𝗈𝗅𝖾{\mathsf{role}}, and 𝗉𝖺𝗋𝖾𝗇𝗍{\mathsf{parent}} array for each vertex in GPU memory following our assumption. After that, based on Lemma 5, the similarity for each edge can be correctly obtained by the edge extended subgraph alone. Accordingly, the clusters and the role of each vertex can also be determined by the edge extended subgraph locally. Therefore, we just load each edge extended subgraph into the GPU memory sequentially and repeat the three phases by invoking 𝗂𝖽𝖾𝗇𝗍𝗂𝖿𝗒𝖢𝗈𝗋𝖾\mathsf{identifyCore}, 𝖽𝖾𝗍𝖾𝖼𝗍𝖢𝗅𝗎𝗌𝗍𝖾𝗋𝗌\mathsf{detectClusters}, and 𝖼𝗅𝖺𝗌𝗌𝗂𝖿𝗒𝖧𝗎𝖻𝖮𝗎𝗍𝗅𝗂𝖾𝗋\mathsf{classifyHubOutlier} for the in-GPU-memory algorithm (lines 14-21). The correctness of Algorithm 5 can be easily obtained based on Lemma 5 and the correctness of Algorithm 1. The only thing that needs to explain is how to estimate the size of the edge extended subgraph in line 10. In Algorithm 5, GsG_{s} represents the current edge extended subgraph (line 1). Whenever a new edge (u,v)(u,v) is to be added into GsG_{s} (line 3), the edges incident to (u,v)(u,v) are computed and stored by GsG^{\prime}_{s}. According to the CSR-Enhanced graph layout introduced in Section 4.1, 𝖬𝖾𝗆𝗈𝖿(Gs)=25×|Es|+4×|Vs|{\mathsf{Memof}}(G_{s})=25\times|E_{s}|+4\times|V_{s}|. As adjacent array, Eid array, and degree-oriented edge array consumes 24×|Es|24\times|E_{s}| bytes together, the similarity array consumes |Es||E_{s}| byte, and vertex array consumes 4×|Vs|4\times|V_{s}| (we use 4-byte integer to store a vertex id). 𝖬𝖾𝗆𝗈𝖿(Gs){\mathsf{Memof}}(G^{\prime}_{s}) can be computed in the same way. Moreover, Nϵ¯[v]\underline{N_{\epsilon}}[v] (2 bytes per element, as each element will not exceed the value of the parameter μ\mu which is generally not very large), Nϵ¯[v]\overline{N_{\epsilon}}[v] (4 bytes for each element), 𝗋𝗈𝗅𝖾{\mathsf{role}} (1 byte for each element), 𝗉𝖺𝗋𝖾𝗇𝗍{\mathsf{parent}} (4 bytes for each element), and 𝗁𝖾𝗂𝗀𝗁𝗍{\mathsf{height}} (4 byte for each element) array for each vertex take 15×|V(G)|15\times|V(G)| together. If three GPU memory consumptions are larger than the GPU memory MgM_{g} in line 10, we generate a new empty edge extended subgraph in line 11; Otherwise, GsG^{\prime}_{s} is added into GsG_{s} in line 12.

As verified in our experiment, Algorithm 5 can finish the 𝖲𝖢𝖠𝖭\mathsf{SCAN} clustering on a graph with 1.8 billion edges using less than 2GB GPU memory efficiently.

6 Experiments

This section presents our experimental results. The in-memory algorithms are evaluated on a machine with an Intel Xeon 2.4GHz CPU equipped with 128 GB main memory and an Nvidia Tesla V100 16GB GPU, running Ubuntu 16.04LTS. The out-of-core algorithms are evaluated on a machine with Nvidia GTX1050 2GB GPU, running Ubuntu 16.04LTS.

TABLE II: Datasets used in Experiments
Datasets Name nn mm d¯\overline{d} cc
Enwiki-2022 𝖤𝖶\mathsf{EW} 6,492,490 159,047,205 24.50 91
IndoChina 𝖢𝖭\mathsf{CN} 7,414,866 194,109,311 26.18 143
Hollywood 𝖧𝖶\mathsf{HW} 2,180,759 228,985,632 105.00 35
Orkut 𝖮𝖱\mathsf{OR} 3,072,626 234,370,166 76.28 28
Tech-P2P 𝖳𝖯\mathsf{TP} 5,792,297 295,659,774 51.04 173
UK-2002 𝖴𝖪\mathsf{UK} 18,520,486 298,113,762 16.10 201
EU-2015 𝖤𝖴\mathsf{EU} 11,264,052 386,915,963 34.35 230
Soc-Twitter 𝖲𝖳\mathsf{ST} 21,297,773 530,051,090 24.89 30
Twitter-2010 𝖳𝖶\mathsf{TW} 41,652,230 1,468,365,182 35.25 -
Friendster 𝖥𝖱\mathsf{FR} 65,608,366 1,806,067,135 27.53 -
Refer to caption
Refer to caption
(a) 𝖤𝖶\mathsf{EW}
Refer to caption
(b) 𝖢𝖭\mathsf{CN}
Refer to caption
(c) 𝖧𝖶\mathsf{HW}
Refer to caption
(d) 𝖮𝖱\mathsf{OR}
Refer to caption
(e) 𝖳𝖯\mathsf{TP}
Refer to caption
(f) 𝖴𝖪\mathsf{UK}
Refer to caption
(g) 𝖤𝖴\mathsf{EU}
Refer to caption
(h) 𝖲𝖳\mathsf{ST}
Figure 7: Performance when varying ϵ\epsilon (μ=6\mu=6)
Refer to caption
Refer to caption
(a) 𝖤𝖶\mathsf{EW}
Refer to caption
(b) 𝖢𝖭\mathsf{CN}
Refer to caption
(c) 𝖧𝖶\mathsf{HW}
Refer to caption
(d) 𝖮𝖱\mathsf{OR}
Refer to caption
(e) 𝖳𝖯\mathsf{TP}
Refer to caption
(f) 𝖴𝖪\mathsf{UK}
Refer to caption
(g) 𝖤𝖴\mathsf{EU}
Refer to caption
(h) 𝖲𝖳\mathsf{ST}
Figure 8: Performance when varying μ\mu (ϵ=0.5\epsilon=0.5)

Datasets. We evaluate the algorithms on ten real graphs. 𝖮𝖱\mathsf{OR} and 𝖥𝖱\mathsf{FR} are downloaded from the SNAP [17]. 𝖳𝖯\mathsf{TP} and 𝖲𝖳\mathsf{ST} are downloaded from the Network Repository [25]. 𝖢𝖭\mathsf{CN}, 𝖤𝖶\mathsf{EW}, 𝖧𝖶\mathsf{HW}, 𝖴𝖪\mathsf{UK}, and 𝖤𝖴\mathsf{EU} are downloaded from WebGraph [6, 5, 4]. The details are shown in Table II. The first eight datasets are used to evaluate the in-memory algorithms, while the last two datasets cannot fit into the Nvidia Tesla V100 GPU and are used to evaluate the out-of-core algorithms.

Algorithms. We evaluate the following algorithms:

  • 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}: The state-of-the-art GPU-based algorithm [37], which is introduced in Section 3.2.

  • 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}}: Our proposed GPU-based in-memory algorithm (Algorithm 1 in Section 4).

  • 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}}: Direct out-of-core algorithm, namely, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} based on the 𝖴𝖵𝖬\mathsf{UVM}.

  • 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}}: Our proposed GPU-based out-of-core algorithm (Algorithm 5 in Section 5).

We use CUDA 10.1 and GCC 4.8.5 to compile all codes with -O3 option. The time cost of the algorithms is measured as the amount of elapsed wall-clock time during the execution. We set the maximum running time for each test to be 100,000 seconds. If a test does not stop within the time limit, we denote the corresponding running time as 𝖨𝖭𝖥\mathsf{INF}. For ϵ\epsilon, we choose ϵ{0.2,0.3,0.4,0.5,0.6,0.7,0.8}\epsilon\in\{0.2,0.3,0.4,0.5,0.6,0.7,0.8\} with μ=6\mu=6 as default. For μ\mu, we choose μ{3,6,11,16,31}\mu\in\{3,6,11,16,31\} with ϵ=0.5\epsilon=0.5 as default.

TABLE III: Time consumption of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} in each phase with ϵ=0.5\epsilon=0.5 and μ=6\mu=6 (s)
Dataset Phase 1 Phase 2 Phase 3
𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} Speedup 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} Speedup 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} Speedup
𝖤𝖶\mathsf{EW} 581.32 6.35 91.55 0.80 0.022 3.64 10.60 0.23 46.08
𝖢𝖭\mathsf{CN} 846.88 4.31 196.49 94.50 0.025 3780.00 6.57 0.071 92.48
𝖧𝖶\mathsf{HW} 203.84 9.33 21.85 76.579 0.010 7657.90 5.94 0.017 349.41
𝖮𝖱\mathsf{OR} 108.43 6.02 18.01 1.14 0.013 87.69 8.49 0.12 70.75
𝖳𝖯\mathsf{TP} 998.70 13.65 73.16 4.24 0.021 201.90 10.87 1.25 8.696
𝖴𝖪\mathsf{UK} 315.31 6.95 45.37 65.33 0.058 1126.38 11.96 0.16 74.75
𝖤𝖴\mathsf{EU} 2456.73 16.58 148.17 493.97 0.036 13721.39 13.09 0.237 55.23
𝖲𝖳\mathsf{ST} 3030.11 40.04 75.68 11.39 0.064 117.97 18.20 0.30 60.67

6.1 Performance Studies on In-memory Algorithms

Exp-1: Performance when varying ϵ\epsilon. In this experiment, we evaluate the performance of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} by varying ϵ\epsilon. We report the processing time on the first eight datasets in Fig. 7.

Fig. 7 shows that 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} consumes much more time when varying the value of ϵ\epsilon. For example, on 𝖤𝖴\mathsf{EU}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} takes 3141.11s to finish the clustering when ϵ=0.2\epsilon=0.2 while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} only takes 18.93s in the same situation, which achieves 168 times speedup. This is because lots of extra costs are introduced in 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} compared with 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}}, which is consistent with the time complexity analysis in Section 3 and Section 4. Moreover, Fig. 7 shows that both the running time of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} decrease when the value of ϵ\epsilon increases. For 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}, this is because as ϵ\epsilon increases, the number of core vertices and clusters decreases, which means the time for cluster detection in 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} decreases. Additionally, Exp-3 shows that the cluster detection phase accounts for a great proportion of the total time. Therefore, the running time of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} decreases as ϵ\epsilon increases. For 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}}, as the value of ϵ\epsilon increases, more vertex pairs are dissimilar. Consequently, more unnecessary similarity computation is avoided. Thus, the running time decreases as well.

Exp-2: Performance when varying μ\mu. In this experiment, we evaluate the performance of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} by varying μ\mu. The results are shown in Fig. 8.

As shown in Fig. 8, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} is much more efficient than 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}. For example, int the 𝖴𝖪\mathsf{UK}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} takes 411.97s to finish the clustering when μ=3\mu=3 while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} only takes 8.866s. The reasons are the same as discussed in Exp-1.

Exp-3: Time consumption of each phase. In this experiment, we compare the time consumption of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} in each phase with ϵ=0.5\epsilon=0.5 and μ=6\mu=6. And the results are shown in Table III.

As shown in Table III, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} achieves significant speedup in all three phases compared with 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN}, especially for Phase 2. For example, on dataset 𝖢𝖭\mathsf{CN}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} takes 846.88s/94.50s/6.57s in Phase 1/Phase 2/Phase 3, respectively. On the other hand, the consuming time of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} for these three phases is 4.31s/0.025s/0.071s, which achieves 196.49/3780.00/92.48 times speedup, respectively. The reasons are as follows: for Phase 1, the span of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} is O(𝖽𝖾𝗀𝗆𝖺𝗑)O({\mathsf{deg}}_{{\mathsf{max}}}), while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} reduces the span to O(log(𝖽𝖾𝗀𝗆𝖺𝗑))O(\log({\mathsf{deg}}_{{\mathsf{max}}})). For Phase 2, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} has to construct E.uE^{\prime}.u and E.vE^{\prime}.v by 𝗌𝗈𝗋𝗍(){\mathsf{sort()}} and 𝗉𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇(){\mathsf{partition}}() in 𝖳𝗁𝗋𝗎𝗌𝗍\mathsf{Thrust} library in each iteration, which is time-consuming. Moreover, it needs several iterations to finish the whole cluster detection, which is shown in the last column of Table II. On the other hand, for 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}}, we construct the spanning forest based on the 𝗉𝖺𝗋𝖾𝗇𝗍\mathsf{parent} array directly and the time-consuming retrieval of E.uE^{\prime}.u and E.vE^{\prime}.v is avoided due to the CSR-Enhanced graph layout structure. For phase 3, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭\mathsf{GPUSCAN} still needs to retrieve E.uE^{\prime}.u and E.vE^{\prime}.v from A.uA.u and A.vA.v while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭++\mathsf{GPUSCAN^{++}} totally avoid these time-consuming operation.

6.2 Performance Studies on Out-of-Core Algorithms

Exp-5: Performance when varying ϵ\epsilon. In this experiment, we evaluate the performance of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} when varying the value of ϵ\epsilon on datasets 𝖳𝖶\mathsf{TW} and 𝖥𝖱\mathsf{FR}. The results are shown in Fig. 9.

Fig. 9 shows that 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} is much more efficient than 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} on these two datasets. For example, on 𝖳𝖶\mathsf{TW}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} takes 30886.315s to finish the clustering when ϵ=0.2\epsilon=0.2 while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} finishes the clustering in 3866.424s. On 𝖥𝖱\mathsf{FR}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} cannot finish the clustering when ϵ=0.2,0.3,0.4\epsilon=0.2,0.3,0.4 while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} finishes the clustering in a reasonable time. This is because UVM aims to provide a general mechanism to overcome the GPU memory limitation. However, as the data locality of structure clustering is poor, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} involves lots of data movements between CPU memory and GPU memory, which is time-consuming. On the other hand, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} conducts the clustering based on the edge extended subgraph, which not only avoids the time-consuming data movements but also overcomes the GPU memory limitation. Moreover, the running time of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} decreases as the value of ϵ\epsilon increases. This is because as the value of ϵ\epsilon, more vertex pairs are dissimilar. Consequently, more unnecessary similarity computation is avoided in these two algorithms.

Refer to caption
Refer to caption
(a) 𝖳𝖶\mathsf{TW}
Refer to caption
(b) 𝖥𝖱\mathsf{FR}
Figure 9: Performance when varying ϵ\epsilon (μ=6\mu=6)
Refer to caption
Refer to caption
(a) 𝖳𝖶\mathsf{TW}
Refer to caption
(b) 𝖥𝖱\mathsf{FR}
Figure 10: Performance when varying μ\mu (ϵ=0.5\epsilon=0.5)

Exp-6: Performance when varying μ\mu. In this experiment, we evaluate the performance of 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} and 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} when varying the value of μ\mu on dataset 𝖳𝖶\mathsf{TW} and 𝖥𝖱\mathsf{FR}. The results are shown in Fig. 10.

As shown in Fig. 10, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} is much more efficient than 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}}. For example, on 𝖳𝖶\mathsf{TW}, 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖴𝖵𝖬++\mathsf{GPUSCAN^{++}_{UVM}} takes 19037.683s to finish the clustering when μ=3\mu=3 while 𝖦𝖯𝖴𝖲𝖢𝖠𝖭𝖮++\mathsf{GPUSCAN^{++}_{O}} finishes the clustering in 2974.947s. The reasons are the same as discussed in Exp-5.

7 Conclusion

In this paper, we study the GPU-based structural clustering problem. Motivated by the state-of-the-art GPU-based structural clustering algorithms that suffer from inefficiency and GPU memory limitation, we propose new GPU-based structural clustering algorithms. For efficiency issues, we propose a new progressive clustering method tailored for GPUs that not only avoids extra parallelization costs but also fully exploits the computing resources of GPUs. To address the GPU memory limitation issue, we propose a partition-based algorithm for structural clustering that can process large graphs with limited GPU memory. We conduct experiments on ten real graphs and the experimental results demonstrate the efficiency of our proposed algorithm.

References

  • [1] C. C. Aggarwal and H. Wang. A survey of clustering algorithms for graph data. In Managing and mining graph data, pages 275–301. Springer, 2010.
  • [2] A. Bellogín and J. Parapar. Using graph partitioning techniques for neighbour selection in user-based collaborative filtering. In Proceedings of ACM RecSys, pages 213–216, 2012.
  • [3] C. Biemann. Chinese whispers-an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs: the first workshop on graph based methods for natural language processing, pages 73–80, 2006.
  • [4] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711–726, 2004.
  • [5] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 587–596. ACM Press, 2011.
  • [6] P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595–601, Manhattan, USA, 2004. ACM Press.
  • [7] Chang, Lijun, Zhang, Wenjie, Wei, Yang, Shiyu, Qin, and Lu. pSCAN: Fast and Exact Structural Graph Clustering. IEEE Transactions on Knowledge and Data Engineering, 29(2):387–401, 2017.
  • [8] Y. Che, S. Sun, and Q. Luo. Parallelizing pruning-based graph structural clustering. In Proceedings of ICPP, pages 77:1–77:10, 2018.
  • [9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2022.
  • [10] Y. Ding, M. Chen, Z. Liu, D. Ding, Y. Ye, M. Zhang, R. Kelly, L. Guo, Z. Su, S. C. Harris, et al. atbionet–an integrated network analysis tool for genomics and biomarker discovery. BMC genomics, 13(1):1–12, 2012.
  • [11] W. Fan, R. Jin, M. Liu, P. Lu, X. Luo, R. Xu, Q. Yin, W. Yu, and J. Zhou. Application driven graph partitioning. In Proceedings of SIGMOD, pages 1765–1779, 2020.
  • [12] P. Gera, H. Kim, P. Sao, H. Kim, and D. Bader. Traversing large graphs on gpus with unified memory. Proceedings of the VLDB Endowment, 13(7):1119–1133, 2020.
  • [13] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
  • [14] R. Guimera and L. A. N. Amaral. Functional cartography of complex metabolic networks. nature, 433(7028):895–900, 2005.
  • [15] L. Hu, L. Zou, and Y. Liu. Accelerating triangle counting on gpu. In Proceedings of SIGMOD, pages 736–748, 2021.
  • [16] U. Kang and C. Faloutsos. Beyond ’caveman communities’: Hubs and spokes for graph compression and mining. In Proceedings of ICDM, pages 300–309, 2011.
  • [17] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [18] X. Li, H. Cai, Z. Huang, Y. Yang, and X. Zhou. Social event identification and ranking on flickr. World Wide Web, 18(5):1219–1245, 2015.
  • [19] W. Lin, X. Xiao, X. Xie, and X.-L. Li. Network motif discovery: A gpu approach. IEEE transactions on knowledge and data engineering, 29(3):513–528, 2016.
  • [20] V. Martha, W. Zhao, and X. Xu. A study on twitter user-follower network: A network based analysis. In Proceedings of ASONAM, pages 1405–1409, 2013.
  • [21] V.-S. Martha, Z. Liu, L. Guo, Z. Su, Y. Ye, H. Fang, D. Ding, W. Tong, and X. Xu. Constructing a robust protein-protein interaction network by integrating multiple public databases. In BMC bioinformatics, volume 12, pages 1–10. Springer, 2011.
  • [22] L. Meng, L. Yuan, Z. Chen, X. Lin, and S. Yang. Index-based structural clustering on directed graphs. In Proceedings of ICDE, pages 2832–2845, 2022.
  • [23] NVIDIA. Cuda thrust library. https://docs.nvidia.com/cuda/thrust/, 2021.
  • [24] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. nature, 435(7043):814–818, 2005.
  • [25] R. A. Rossi and N. K. Ahmed. Graph repository. http://www.graphrepository.com, 2013.
  • [26] B. Ruan, J. Gan, H. Wu, and A. Wirth. Dynamic structural clustering on graphs. In Proceedings of SIGMOD, pages 1491–1503, 2021.
  • [27] S. E. Schaeffer. Graph clustering. Computer science review, 1(1):27–64, 2007.
  • [28] M. Schinas, S. Papadopoulos, Y. Kompatsiaris, and P. A. Mitkas. Visual event summarization on social media using topic modelling and graph-based ranking algorithms. In Proceedings of ICMR, pages 203–210, 2015.
  • [29] M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, and P. A. Mitkas. Multimodal graph-based event detection and summarization in social media streams. In Proceedings of SIGMM, pages 189–192, 2015.
  • [30] J. H. Seo and M. H. Kim. pm-scan: an I/O efficient structural clustering algorithm for large-scale graphs. In Proceedings of CIKM, pages 2295–2298, 2017.
  • [31] M. Sha, Y. Li, and K.-L. Tan. Self-adaptive graph traversal on gpus. In Proceedings of SIGMOD, pages 1558–1570, 2021.
  • [32] X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua. Graph processing on gpus: A survey. ACM Computing Surveys (CSUR), 50(6):1–35, 2018.
  • [33] H. Shiokawa, Y. Fujiwara, and M. Onizuka. Scan++: efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proceedings of the VLDB Endowment, 8(11):1178–1189, 2015.
  • [34] N. Simsiri, K. Tangwongsan, S. Tirthapura, and K.-L. Wu. Work-efficient parallel union-find. Concurrency and Computation: Practice and Experience, 30(4):e4333, 2018.
  • [35] D. P. Singh, I. Joshi, and J. Choudhary. Survey of gpu based sorting algorithms. International Journal of Parallel Programming, 46:1017–1034, 2018.
  • [36] T. R. Stovall, S. Kockara, and R. Avci. Gpuscan: Gpu-based parallel structural clustering algorithm for networks. IEEE Transactions on Parallel and Distributed Systems, 26(12):3381–3393, 2014.
  • [37] T. R. Stovall, S. Kockara, and R. Avci. GPUSCAN: gpu-based parallel structural clustering algorithm for networks. IEEE Transactions on Parallel and Distributed Systems, 26(12):3381–3393, 2015.
  • [38] T. Tseng, L. Dhulipala, and J. Shun. Parallel index-based structural graph clustering and its approximation. In Proceedings of SIGMOD, pages 1851–1864, 2021.
  • [39] D. Wen, L. Qin, Y. Zhang, L. Chang, and X. Lin. Efficient structural graph clustering: an index-based approach. Proceedings of the VLDB Endowment, 11(3):243–255, 2017.
  • [40] X. Xu, N. Yuruk, Z. Feng, and T. A. Schweiger. SCAN: a structural clustering algorithm for networks. In Proceedings of SIGKDD, pages 824–833, 2007.
  • [41] C. Ye, Y. Li, B. He, Z. Li, and J. Sun. Gpu-accelerated graph label propagation for real-time fraud detection. In Proceedings of the SIGMOD, pages 2348–2356, 2021.
  • [42] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higher-order graph clustering. In Proceedings of SIGKDD, pages 555–564, 2017.
  • [43] W. Zhao, G. Chen, and X. Xu. Anyscan: An efficient anytime framework with active learning for large-scale network clustering. In Proceedings of ICDM, pages 665–674. IEEE Computer Society, 2017.
  • [44] Q. Zhou and J. Wang. Sparkscan: a structure similarity clustering algorithm on spark. In National Conference on Big Data Technology and Applications, pages 163–177. Springer, 2015.
[Uncaptioned image] Long Yuan is currently a professor in School of Computer Science and Engineering, Nanjing University of Science and Technology, China. He received his PhD degree from the University of New South Wales, Australia, M.S. degree and B.S. degree both from Sichuan University, China. His research interests include graph data management and analysis. He has published papers in conferences and journals including VLDB, ICDE, WWW, The VLDB Journal, and TKDE.
[Uncaptioned image] Zeyu Zhou Zeyu Zhou is currently a master student in the Department of Computer Science, Nanjing University of Science and Technology. His research interests include graph data management and analysis.
[Uncaptioned image] Xuemin Lin received the BSc degree in applied math from Fudan University, in 1984, and the PhD degree in computer science from the University of Queensland, in 1992. He is a professor with the School of Computer Science and Engineering, University of New South Wales. His current research interests include approximate query processing, spatial data analysis, and graph visualization. He is a fellow of the IEEE.
[Uncaptioned image] Zi Chen received the PhD degree from the Software Engineering Institute, East China Normal University, in 2021. She is currently a postdoctor in ECNU. Her research interest is big graph data query and analysis, including cohesive subgraph search on large-scale social networks and path planning on road networks.
[Uncaptioned image] Xiang Zhao received the PhD degree from The University of New South Wales, Australia, in 2014. He is currently a professor with the National University of Defense Technology, China. His research interests include graph datamanagement and mining. He is a member of the IEEE. He has published papers in conferences and journals including VLDB, ICDE, WWW, The VLDB Journal, and TKDE.
[Uncaptioned image] Fan Zhang is a Professor at Guangzhou University. He was a research associate in the School of Computer Science and Engineering, University of New South Wales, from 2017 to 2019. He received the BEng degree from Zhejiang University in 2014, and the PhD from University of Technology Sydney in 2017. His research interests include graph algorithms and social networks. Since 2017, he has published more than 20 papers in top venues, e.g., SIGMOD, PVLDB, ICDE, IJCAI, AAAI, TKDE and VLDB Journal.