\AtBeginShipoutNext\AtBeginShipoutDiscard

^†^†thanks: * Zi Chen is the corresponding author of this paper.^†^†thanks: L. Yuan and Z. Zhou are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
E-mail: {zeyuzhou,longyuan}@njust.edu.cn. X. Lin is with the School of Computer Science and Engineering, the University of New South Wales, Sydney, Australia.
E-mail: [email protected]. Z. Chen is with the Software Engineering Institute, East China Normal University, Shanghai, China.
E-mail: [email protected]. X. Zhao is with the College of Systems Engineering, National University of Defense Technology, Changsha, China.
E-mail: [email protected]. F. Zhang is with Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China.
E-mail: [email protected]. ^†^†thanks: Manuscript received xxx; revised xxx.

$\textrm{GPUSCAN}^{++}$ : Efficient Structural Graph Clustering on GPUs

Long Yuan, Zeyu Zhou, Xuemin Lin, Zi Chen, Xiang Zhao, Fan Zhang

Abstract

Structural clustering is one of the most popular graph clustering methods, which has achieved great performance improvement by utilizing GPUs. Even though, the state-of-the-art GPU-based structural clustering algorithm, $\mathsf{GPUSCAN}$ , still suffers from efficiency issues since lots of extra costs are introduced for parallelization. Moreover, $\mathsf{GPUSCAN}$ assumes that the graph is resident in the GPU memory. However, the GPU memory capacity is limited currently while many real-world graphs are big and cannot fit in the GPU memory, which makes $\mathsf{GPUSCAN}$ unable to handle large graphs. Motivated by this, we present a new GPU-based structural clustering algorithm, $\mathsf{GPUSCAN^{++}}$ , in this paper. To address the efficiency issue, we propose a new progressive clustering method tailored for GPUs that not only avoid high parallelization costs but also fully exploits the computing resources of GPUs. To address the GPU memory limitation issue, we propose a partition-based algorithm for structural clustering that can process large graphs with limited GPU memory. We conduct experiments on real graphs, and the experimental results demonstrate that our algorithm can achieve up to 168 times speedup compared with the state-of-the-art GPU-based algorithm when the graph can be resident in the GPU memory. Moreover, our algorithm is scalable to handle large graphs. As an example, our algorithm can finish the structural clustering on a graph with 1.8 billion edges using less than 2 GB GPU memory.

Index Terms:

structural graph clustering, GPU, graph algorithm

1 Introduction

Graph clustering is one of the most fundamental problems in analyzing graph data. It has been extensively studied [14, 24, 27, 40, 42] and frequently utilized in many applications, such as natural language processing [3], recommendation systems [2], social and biological network analysis [13, 21], and load balancing in distributed systems[11].

One well-known method of graph clustering is structural clustering, which is introduced via the Structural Clustering Algorithm for Networks ( $\mathsf{SCAN}$ ) [40]. Structural clustering defines the structural similarity between two vertices, and two vertices with structural similarity not less than a given parameter $\epsilon$ are considered similar. A core is a vertex that has at least $\mu$ similar neighbors, where $\mu$ is also a given parameter. A cluster contains the cores and vertices that are structurally similar to the cores. If a vertex does not belong to any cluster but its neighbors belong to more than one cluster, then it is a hub; Otherwise, it is an outlier. Fig. 1 shows the clustering result of structural clustering with $\epsilon=0.6$ and $\mu=3$ on graph $G$ in which two clusters are colored gray, cores are colored black, and hubs and outliers are labeled. Compared with most graph clustering approaches aiming to partition the vertices into disjoint sets, structural clustering is also able to find hub vertices that connected different clusters and outlier vertices that lack strong ties to any clusters, which is important for mining various complex networks [40, 16, 38]. Structural clustering has been successfully used to cluster biological data [10, 21] and social media data [20, 18, 28, 29].

Refer to caption — Figure 1: Clustering result of $\mathsf{SCAN}$ ( $\epsilon=0.6,\mu=3$ ) on $G$

Motivation. In recent years, the rapid evolution of Graphics Processing Unit (GPU) has attracted extensive attention in both industry and academia, and it has been used successfully to speed up various graph algorithms [19, 15, 41, 31, 32]. The massively parallel hardware resources in GPUs also offer promising opportunities for accelerating structural clustering. In the literature, a GPU-based algorithm named $\mathsf{GPUSCAN}$ is proposed [36]. By redesigning the computation steps of structural clustering, $\mathsf{GPUSCAN}$ significantly improves the clustering performance compared with the state-of-the-art serial algorithm. However, it has the following drawbacks:

•

Prohibitive parallelization overhead. $\mathsf{GPUSCAN}$ aims to fully utilize the massively parallel hardware resources in GPU. However, lots of extra computation cost is introduced for this goal. Given a graph $G=(V,E)$ , the state-of-the art serial algorithm [7] can finish the clustering with $O(m\cdot{\mathsf{deg}}_{{\mathsf{max}}})$ work while $\mathsf{GPUSCAN}$ needs $O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m)$ work ¹¹1We use the work-span framework to analyze the parallel algorithm [9]. The work of an algorithm is the number of operations it performs. The span of an algorithm is the length of its longest sequential dependence. for it, where $m$ denotes the number of edges in the graph, ${\mathsf{deg}}_{{\mathsf{max}}}$ denotes the maximum degree of the vertices in $G$ , and $c$ is a not small integer as verified in our experiments determined by the given graph $G$ and parameter $\epsilon$ and $\mu$ . The prohibitive parallelization overhead makes $\mathsf{GPUSCAN}$ inefficient for structural clustering.
•

GPU memory capacity limitation. Compared to common server machines equipped with hundreds of gigabytes or even terabytes of main memory, the GPU memory capacity is very limited currently. $\mathsf{GPUSCAN}$ assumes that the graph is resident in the GPU memory. Nevertheless, many real-world graphs are big and may not fit entirely in the GPU memory. The memory capacity limitation significantly restricts the scale of graphs that $\mathsf{GPUSCAN}$ can handle.

Motivated by this, in this paper, we aim to devise a new efficient GPU-based structural clustering algorithm that can handle large graphs.

Our solution. We address the drawbacks of $\mathsf{GPUSCAN}$ and propose new GPU-based algorithms for structural clustering in this paper. Specifically, for the prohibitive parallelization overhead, we devise a CSR-enhanced graph storage structure and adopt a progressive clustering paradigm with which unnecessary similarity computation can be pruned. Following this paradigm, we develop a new GPU-based $\mathsf{SCAN}$ algorithm with careful considerations of total work and parallelism, which are crucial to the performance of GPU programs. Our new algorithm significantly reduces the parallelization overhead. We theoretically prove that the work of our new algorithm is reduced to $O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v))$ , and experimental results show that our algorithms achieve 85 times speedup on average compared with $\mathsf{GPUSCAN}$ .

For the GPU memory capacity limitation, a direct solution is to use the Unified Virtual Memory (UVM) provided by recent GPUs. UVM utilizes the large host main memory of common server machines and allows GPU memory to be oversubscribed as long as there is sufficient host main memory to back the allocated memory. The driver and the runtime system handle data movement between the host main memory and GPU memory without the programmer’s involvement, which allows running GPU applications that may otherwise be infeasible due to the large size of datasets [12]. However, the performance of this approach is not competitive due to poor data locality caused by irregular memory accesses during the structural clustering and relatively slow rate I/O bus connecting host memory and GPU memory. Therefore, we propose a new partition-based structural clustering algorithm. Specifically, we partition the graph into several specified subgraphs. When conducting the clustering, we only need to load the specified subgraphs into the GPU memory and we can guarantee that the clustering results are the same as loading the whole graph in the GPU memory, which not only addresses the GPU memory limitation problem in $\mathsf{GPUSCAN}$ but also avoids the data movement overheads involved in the UVM-based approach.

Contributions. We make the following contributions:

(A) We theoretically analyze the performance of the state-of-the-art GPU-based structural clustering algorithm $\mathsf{GPUSCAN}$ and reveal the key reasons leading to its inefficiency. Following the analysis, we propose a new progressive clustering algorithm tailored for GPUs that not only avoids the prohibitive parallelization overhead but also fully exploits the computing power of GPUs.

(B) To overcome the GPU memory capacity limitation, we devise a new out-of-core GPU-based SCAN algorithm. By partitioning the original graph into a series of smaller subgraphs, our algorithm can significantly reduce the GPU memory requirement when conducting the clustering and scale to large data graphs beyond the GPU memory.

(C) We conduct extensive performance studies using ten real graphs. The experimental results show that our proposed algorithm achieves 168/85 times speedup at most/on average, compared with the state-of-the-art GPU-based SCAN algorithm. Moreover, our algorithm can finish the structural clustering on a graph with 1.8 billion edges using less than 2 GB of GPU memory.

2 Preliminaries

2.1 Problem Definition

Let $G=(V,E)$ be an undirected and unweighted graph, where $V(G)$ is the set of vertices and $E(G)$ is the set of edges, respectively. We denote the number of vertices as $n$ and the number of edges as $m$ , i.e., $n=|V(G)|$ , $m=|E(G)|$ . For a vertex $v\in V(G)$ , we use ${\mathsf{nbr}}(v,G)$ to denote the set of neighbors of $v$ . The degree of a vertex $v\in V(G)$ , denoted by ${\mathsf{deg}}(v,G)$ is the number of neighbors of $v$ , i.e., ${\mathsf{deg}}(v,G)=|{\mathsf{nbr}}(v,G)|$ . For simplicity, we omit $G$ in the notations when the context is self-evident.

Definition 2.1: (Structural Neighborhood) Given a vertex $v$ in $G$ , the structural neighborhood of $v$ , denoted by $N[v]$ , is defined as $N[v]=\{u\in V(G)|(u,v)\in E(G)\}\cup\{v\}$ . $\Box$

Definition 2.2: (Structural Similarity) Given two vertices $u$ and $v$ in $G$ , the structural similarity between $u$ and $v$ , denoted by $\sigma(u,v)$ , is defined as:

\sigma(u,\;v)=\frac{|N[u]\cap N[v]|}{\sqrt{|N[u]||N[v]|}}

(1)

$\Box$

Definition 2.3: ( $\epsilon$ -Neighborhood) Given a similarity threshold $0<\epsilon\leq 1$ , the $\epsilon$ -neighborhood of $u$ , denoted by $N_{\epsilon}[u]$ , is defined as the subset of $N[u]$ in which every vertex $v$ satisfies $\sigma(u,v)\geq\epsilon$ , i.e., $N_{\epsilon}[u]=\{v|v\in N[u]\wedge\sigma(u,v)\geq\epsilon\}$ . $\Box$

Note that the $\epsilon$ -neighborhood of a given vertex $u$ contains $u$ itself as $\sigma(u,u)=1$ . When the number of $\epsilon$ -neighbors of a vertex is large enough, it becomes a core vertex:

Definition 2.4: (Core) Given a structural similarity threshold $0<\epsilon\leq 1$ and an integer $\mu\geq 2$ , a vertex $u$ is a core vertex if $|N_{\varepsilon}[u]|\geq\mu$ . $\Box$

Given a core vertex $u$ , the structurally reachable vertex of $u$ is defined as:

Definition 2.5: (Structural Reachability) Given two parameters $0<\epsilon\leq 1$ and $\mu\geq 2$ , for two vertices $u$ and $v$ , $v$ is structurally reachable from vertex $u$ if there is a sequence of vertices $v_{1},v_{2},...,v_{l}\in V$ such that: (i) $v_{1}=u,v_{l}=v$ ; (ii) $v_{1},v_{2},...,v_{l-1}$ are core vertices; and (iii) $v_{i+1}\in N_{\epsilon}[v_{i}]$ for each $1\leq i\leq l-1$ . $\Box$

Intuitively, a cluster is a set of vertices that can be structurally reachable from any core vertex. Formally,

Definition 2.6: (Cluster) A cluster $C$ is a non-empty subset of $V$ such that:

•

(Connectivity) For any two vertices $v_{1},v_{2}\in C$ , there exists a vertex $u\in C$ such that both $v_{1}$ and $v_{2}$ are structurally reachable from $u$ .
•

(Maximality) If a core vertex $u\in C$ , then all vertices which are structure-reachable from $u$ are also in $C$ .

$\Box$

Definition 2.7: (Hub and Outlier) Given the set of clusters in graph $G$ , a vertex $u$ not in any cluster is a hub vertex if its neighbors belong to two or more clusters, and it is an outlier vertex otherwise. $\Box$

Following the above definitions, it is clear that two vertices that are not adjacent in the graph have no effects on the clustering results even if they are similar. Therefore, we only need to focus on the vertex pair incident to an edge. In the following, for an edge $(u,v)$ , we use the edge similarity of $(u,v)$ and the similarity of $u$ and $v$ interchangeably for brevity. We summarize the notations used in the paper in Table I.

Symbol	Description
$G=(V,E)$	Graph with vertices $V$ and edges $E$
$V(G)/E(G)$	All vertices/edge in $G$
${\mathsf{nbr}}(u)/{\mathsf{deg}}(u)$	Neighbors/degree of a vertex u
$n/m$	number of vertices/edges in $G$
${\mathsf{deg}}_{{\mathsf{max}}}$	maximum degree in $G$
$\normalsize\scriptsize{\bf?}⃝$	the role of a vertex or the similarity of an edge is unknown
$\normalsize\scriptsize{\bfC}⃝$ / $\normalsize\scriptsize{\bf!C}⃝$	the vertex is a core/non-core vertex
$\normalsize\scriptsize{\bfS}⃝$ / $\normalsize\scriptsize{\bf!S}⃝$	the edge is similar/dis-similar
$\normalsize\scriptsize{\bfH}⃝$ / $\normalsize\scriptsize{\bfO}⃝$	the vertex is a hub/outlier

TABLE I: Notations

Example 2.1: Considering $G$ shown in Fig. 1 where $\epsilon=0.6$ and $\mu=3$ , the structural similarity of every pair of adjacent vertices is shown near the edge. $v_{0}$ , $v_{1}$ , $v_{4}$ , $v_{7}$ , $v_{9}$ , $v_{10}$ , $v_{11}$ , $v_{12}$ , and $v_{13}$ are core vertices as they all have at least $\mu-1=2$ similar neighbors including themselves. These core vertices form two clusters, which are marked in grey. $v_{8}$ is a hub as its neighbors $v_{2}$ and $v_{9}$ are in different clusters, while $v_{3},v_{5}$ and $v_{6}$ are outliers. Due to the limited space, in the following examples, we only show procedures on the subgraph $G^{\prime}$ of $G$ . $\Box$

Problem Statement. Given a graph $G=(V,E)$ and two parameters $0<\epsilon\leq 1$ and $\mu\geq 2$ , structural graph clustering aims to efficiently compute all the clusters $\mathbb{C}$ in $G$ and identify the corresponding role of each vertex. In this paper, we aim to accelerate the clustering performance of $\mathsf{SCAN}$ by GPUs.

2.2 GPU Architecture and CUDA Programming Platform

GPU Architecture. A GPU consists of multiple streaming multiprocessors (SM). Each SM has a large number of GPU cores. The cores in each SM work in single-instruction multiple-data (SIMD) ²²2Single-instruction multi-thread (SIMT) model in NIVIDA’s term fashion, i.e., they execute the same instructions at the same time on different input data. The smallest execution units of a GPU are warps. In NVIDIA’s GPU architectures, a warp typically contains 32 threads. Threads in the same warp that finish their tasks earlier have to wait until other threads finish their computations, before swapping in the next warp of threads. A GPU is equipped with several types of memory. All SMs in the GPU share a larger but slower global memory of the GPU. Each SM has small but fast-access local programmable memory (called shared memory) that is shared by all of its cores. Moreover, data can be exchanged between a GPU’s global memory and the host’s main memory via a relatively slow rate I/O bus.

CUDA. CUDA (Compute Unified Device Architecture) is the most popular GPU programming language. A CUDA problem consists of codes running on the CPU (host code) and those on the GPU (kernel). Kernels are invoked by the host code. Kernels start by transferring input data from the main memory to the GPU’s global memory and then processing the data on the GPU. When finishing the process, results are copied from the GPU’s global memory back to the main memory. The GPU executes each kernel with a user-specified number of threads. At runtime, the threads are divided into several blocks for execution on the cores of an SM, each block contains several warps, and each warp of threads executes concurrently.

For the thread safety of GPU programming, CUDA provides programmers with atomic operations, including $\mathsf{atomicAdd}$ and $\mathsf{atomicCAS}$ . $\mathsf{atomicAdd(\emph{address},\emph{val})}$ reads the 16-bit, 32-bit, or 64-bit word old located at the address address in global or shared memory, computes $\mathsf{(\emph{old}+\emph{val})}$ , and stores the result back to memory at the same address. These three operations are performed in one atomic transaction and returns old value. $\mathsf{atomicCAS(\emph{address},\emph{old},\emph{new})}$ (Compare And Swap) reads the 16-bit, 32-bit, or 64-bit word old located at the address address in global or shared memory, computes $\mathsf{(\emph{old}==\emph{old}?\emph{new}:\emph{old})}$ , and stores the result back to memory at the same address. These three operations are performed in one atomic transaction and return old.

3 Related work

3.1 CPU-based $\mathsf{SCAN}$

Structural graph clustering ( $\mathsf{SCAN}$ ) is first proposed in [40]. After that, a considerable number of follow-up works are conducted in the literature. For the online algorithms regarding a fixed parameter $\epsilon$ and $\mu$ , $\mathsf{SCAN}$ ++ [33], pSCAN [7] and $\mathsf{anySCAN}$ [43] speed up $\mathsf{SCAN}$ by pruning unnecessary structural similarity computation. $\mathsf{SparkSCAN}$ [44] conducts the structural clustering on the Spark system. $\mathsf{pmSCAN}$ [30] considers the case that the input graph is stored on the disk. $\mathsf{ppSCAN}$ [8] parallelizes pruning-based algorithms and uses vectorized instructions to further improve the performance.

For the index-based algorithms aiming at returning the clustering result fast for frequently issued queries, GS*-Index [39] designs a space-bounded index and proposes an efficient algorithm to answer a clustering query. [38] parallelizes GS*-Index and uses locality-sensitive hashing to design a novel approximate index construction algorithm. [26] studies the maintenance of computed structural similarity between vertices when the graph is updated and proposes an approximate algorithm that achieves $O(\log^{2}|V|+\log|V|\cdot\log\frac{M}{\delta^{*}})$ amortized cost for each update for a specified failure probability $\delta^{*}$ and every sequence of $M$ updates. [22] studies structural clustering on directed graphs and proposes an index-based approach to computing the corresponding clustering results. However, all these algorithms are CPU-based and cannot be translated into efficient solutions on GPUs due to the inherently unique characteristics of a GPU different from a CPU.

Besides $\mathsf{SCAN}$ , other graph clustering methods are also proposed in the literature. However, all these methods are not based on structural clustering, and thus do not share the effectiveness of $\mathsf{SCAN}$ in detecting clusters, hubs, and outliers. [27, 1] provides a comprehensive survey on this topic.

3.2 GPU-based $\mathsf{SCAN}$

For the GPU-based approach, $\mathsf{GPUSCAN}$ [37] is the state-of-the-art GPU-based $\mathsf{SCAN}$ solution, which conducts the clustering in three separate phases: (1) $\epsilon$ -neighborhood identification. In Phase 1, $\mathsf{SCAN}$ computes the similarity of each edge and determines the core vertices in the graph. (2) cluster detection. After identifying the core vertices, $\mathsf{GPUSCAN}$ detects the clusters based on Definition 2.1 in Section 2. (3) hub and outlier classification. With the clustering results, the hub and outlier vertices are classified in Phase 3. The following example illustrates the three phases of $\mathsf{GPUSCAN}$ .

Example 3.1: Consider graph $G$ shown in Fig. 1, Fig. 2 shows the clustering procedures of $\mathsf{GPUSCAN}$ with parameters $\epsilon=0.6,\mu=3$ . Due to the limited space, we only show the clustering procedures related to the subgraph $G^{\prime}$ of $G$ in Fig. 1 in the following examples and the remaining part can be processed similarly.

Fig. 2 (a)-(b) show the $\epsilon$ -neighborhood identification phase. $\mathsf{GPUSCAN}$ uses $E.u$ and $E.v$ arrays to represent the edges of the graph, a ${\mathsf{flag}}$ array in which each element stores whether an edge is similar or not, and $A.u$ and $A.v$ arrays to store the neighbors of each vertex. $\mathsf{GPUSCAN}$ uses a warp to computing the similarity of each edge $(u,v)$ . According to Definition 2.1, the key to compute the similarity of $(u,v)$ is to find the common neighbors in $N[u]$ and $N[v]$ . To achieve this goal, for each $w\in{\mathsf{nbr}}(u)$ , $\mathsf{GPUSCAN}$ uses 8 threads in a warp to check whether $w$ is also in 8 continuous elements in ${\mathsf{nbr}}(v)$ . As a warp usually contains 32 threads, four elements in ${\mathsf{nbr}}(u)$ can be compared concurrently. After the comparisons of these four elements are complete, if the largest vertex id of these four elements in ${\mathsf{nbr}}(u)$ is bigger than the largest one in the 8 continuous elements ${\mathsf{nbr}}(v)$ , then, the next continuous 8 elements in ${\mathsf{nbr}}(v)$ are used to repeat the above comparisons. If the largest vertex id of these four elements in ${\mathsf{nbr}}(u)$ is smaller than the largest one in the 8 continuous elements ${\mathsf{nbr}}(v)$ , then, the next 4 elements in ${\mathsf{nbr}}(u)$ are used to repeat the above comparisons. Otherwise, the next 4 elements in ${\mathsf{nbr}}(u)$ and the next 8 elements in ${\mathsf{nbr}}(v)$ are used to repeat the above comparisons. The common neighbors are found when all the elements in ${\mathsf{nbr}}(u)$ and ${\mathsf{nbr}}(v)$ have been explored. Take edge $(v_{0},v_{1})$ as an example. Fig. 2 (b) shows the procedure of $\mathsf{GPUSCAN}$ to compute their common neighbors. As ${\mathsf{nbr}}(v_{0})=\{v_{1},v_{2},v_{3},v_{4},v_{5},v_{6},v_{7}\}$ , the first 4 elements $\{v_{1},v_{2},v_{3},v_{4}\}$ in ${\mathsf{nbr}}(v_{0})$ are compared first. For the 32 threads in a warp, threads $T_{0}-T_{7}$ handle the comparison for $v_{1}$ . The remaining threads are assigned to handle the comparison for vertices $v_{2}$ , $v_{3}$ , and $v_{4}$ similarly. As ${\mathsf{nbr}}(v_{1})=\{v_{0},v_{2},v_{4},v_{7}\}$ , threads $T_{0}-T_{7}$ finds $v_{1}\notin{\mathsf{nbr}}(v_{1})$ . Similarly, threads $T_{8}-T_{15}$ , $T_{16}-T_{23}$ , $T_{24}-T_{31}$ find $v_{2},v_{4}\in{\mathsf{nbr}}(v_{1})$ . After that, they find the largest vertex id $v_{4}$ of elements in ${\mathsf{nbr}}(v_{0})$ that were compared is smaller than the largest one $v_{7}$ in ${\mathsf{nbr}}(v_{1})$ . Then, the remaining 3 elements $\{v_{5},v_{6},v_{7}\}$ in ${\mathsf{nbr}}(v_{0})$ are used to repeat the comparisons similarly. In this comparison, threads $T_{16}-T_{23}$ find that $v_{7}$ is the common element of ${\mathsf{nbr}}(v_{0})$ and ${\mathsf{nbr}}(v_{1})$ . Therefore, $\sigma(v_{0},v_{1})=(2+3)/\sqrt{5*8}=0.79>0.6$ , which means $v_{1}$ and $v_{0}$ are similar, and the corresponding element in the $\mathsf{flag}$ array is set as 1.

Fig. 2 (c) shows the graph clustering phase of $\mathsf{GPUSCAN}$ . Following Definition 2.1, only similar edges could be in the final clusters. Therefore, $\mathsf{GPUSCAN}$ builds two new arrays $E^{\prime}.u$ and $E^{\prime}.v$ by retrieving the similar edges in $E.u$ and $E.v$ . Consider the initialization step in Fig. 2 (c), as $v_{0}$ and $v_{1}$ are similar while $v_{0}$ and $v_{2}$ are not similar, only $v_{0}$ and $v_{1}$ are kept in $E^{\prime}.u$ and $E^{\prime}.v$ . Then Based on $E^{\prime}.u$ and $E^{\prime}.v$ , $N_{\epsilon}[u]$ can be obtained easily. For example, for $v_{0}$ , $N_{\epsilon}[v_{0}]=\{v_{0},v_{1},v_{4},v_{7}\}$ , $|N_{\epsilon}[v_{0}]|=4>3$ , thus, $v_{0}$ is a core vertex. Similarly, $v_{4}$ and $v_{7}$ are also core vertices. Then, it removes the edges in $E^{\prime}.u$ and $E^{\prime}.v$ not incident to core vertices. Moreover, $\mathsf{GPUSCAN}$ represents all clusters by a spanning forest through a $\mathsf{parent}$ array. In the $\mathsf{parent}$ array, each element represents the parent of the corresponding vertex in the forest, and the corresponding element for the vertices in $E^{\prime}.u$ and $E^{\prime}.v$ is initialized as the vertex itself, which is also shown in Initialization of Fig. 2 (c).

After that, $\mathsf{GPUSCAN}$ detects the clusters in the graph iteratively. In each odd iteration, for each vertex $u$ in $E^{\prime}.u$ , it finds the smallest neighbor $v$ of $u$ in $E^{\prime}.v$ and takes the smaller one between $u$ and $v$ as the parent of $u$ in the $\mathsf{parent}$ array. Then, it replaces each vertex in $E^{\prime}.u$ and $E^{\prime}.v$ by its parent and removes the edges with the same incident vertices. Consider Iteration 1.1-1.4 in Fig. 2 (c), for $v_{1}$ , its smallest neighbor in $E^{\prime}.v$ is $v_{0}$ , which is smaller than itself, thus, the parent of $v_{1}$ is replaced by $v_{0}$ in the $\mathsf{parent}$ array. The other vertices are processed similarly (Iteration 1.1 and 1.2). Then, $v_{1}$ is replaced by $v_{0}$ in $E^{\prime}.u$ and $E^{\prime}.v$ in Iteration 1.3, and only edges $(v_{0},v_{1})$ and $(v_{1},v_{0})$ are left after removing the edges with the same incident vertices in Iteration 1.4. In each even iteration, $\mathsf{GPUSCAN}$ conducts the clustering similarly to the odd iteration except the largest neighbor is selected as its parent. Iteration 2.1-2.4 in Fig. 2 (c) show the even case. Note that to obtain the $E^{\prime}.u$ and $E^{\prime}.v$ at the end of each iteration, $\mathsf{GPUSCAN}$ uses $\mathsf{sort()}$ and $\mathsf{partition()}$ functions in the $\mathsf{Thrust}$ library provided by Nvidia [23]. When $E^{\prime}.u$ and $E^{\prime}.v$ become empty, $\mathsf{GPUSCAN}$ explores each vertex and sets its corresponding element in the $\mathsf{parent}$ array as the root vertex in the forest. For example, in Iteration 2.2, the parent of $v_{4}$ is $v_{0}$ , the parent of $v_{0}$ is $v_{1}$ , and $v_{1}$ is the parent of itself, thus, the parent of $v_{4}$ is set as $v_{1}$ . After Phase 2 finishes, the vertices with the same parent are in the same cluster. In Fig. 2 (c), as $v_{0}$ , $v_{1}$ , $v_{2}$ , $v_{4}$ , and $v_{7}$ have the same parent, they are in the same cluster.

In Phase 3, $\mathsf{GPUSCAN}$ classifies the hub and outlier vertices. It first retrieves edge $(u,v)$ from $A.u$ and $A.v$ such that $u\in A.u$ is not in a cluster and $v\in A.v$ is in a cluster and keeps them in new arrays $E^{\prime}.u$ and $E^{\prime}.v$ . Then, it replaces the vertices in $E^{\prime}.v$ by their parents. Then, for a vertex $u\in E^{\prime}.u$ , it has two more different neighbors, then it is a hub vertex. Otherwise, it is an outlier vertex. Consider the example shown in Fig. 2 (d), $v_{3}$ , $v_{5}$ , $v_{6}$ , and $v_{8}$ are not in a cluster, thus, the edges that they form with their neighbors within the cluster are retrieved and stored in $E^{\prime}.u$ and $E^{\prime}.v$ (Step 1). Then, $v_{0}$ , $v_{2}$ , and $v_{9}$ are replaced with their parents (In Step 2, the parent of $v_{9}$ is $v_{12}$ , which is not shown in Fig. 2 due to limited space). Since $v_{8}$ has two neighbors $v_{1}$ and $v_{12}$ in $E^{\prime}.v$ , it is a hub vertex. $v_{3}$ , $v_{5}$ , $v_{6}$ only have one neighbor in $E^{\prime}.v$ , they are outlier vertices. $\Box$

Theorem 3.1: Given a graph $G$ and two parameters $\epsilon$ and $\mu$ , the work of $\mathsf{GPUSCAN}$ to finish structural clustering is $O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m)$ , and the span is $O({\mathsf{deg}}_{{\mathsf{max}}}+c\cdot\log^{2}m)$ , where $c$ denotes the number of iterations in Phase 2. $\Box$

Proof: In Phase 1, for each edge $(u,v)$ , $\mathsf{GPUSCAN}$ has to explore the neighbors of $u$ and $v$ . Thus, the work and span of Phase 1 are $O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v)))$ and $O({\mathsf{deg}}_{{\mathsf{max}}})$ , respectively. In Phase 2, in each iteration, the time used for $\mathsf{sort()}$ dominates the whole time of this iteration, and the work and span of $\mathsf{sort()}$ are $O(m\cdot\log m)$ and $O(\log^{2}m)$ [35]. Thus, the work and span of Phase 2 are $O(c\cdot m\cdot\log m)$ and $O(c\cdot\log^{2}m)$ , respectively. In Phase 3, $\mathsf{GPUSCAN}$ explores the edges once and only the neighbors of hubs and outliers are visited. Thus, the work and span of Phase 3 are $O(m)$ and $O(1)$ . Thus, the work and space of $\mathsf{GPUSCAN}$ are $O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m)$ and $O({\mathsf{deg}}_{{\mathsf{max}}}+c\cdot\log^{2}m)$ .

Drawbacks of $\mathsf{GPUSCAN}$ . $\mathsf{GPUSCAN}$ seeks to parallelize $\mathsf{SCAN}$ to benefit from the computational power provided by GPUs. However, it suffers from the following two drawbacks, which make it unable to fully utilize the massive parallelism of GPUs.

•

High extra parallelization cost. As shown in [7], the time complexity of the best serial $\mathsf{SCAN}$ algorithm is $O(m\cdot{\mathsf{deg}}_{max})$ . On the other hand, Theorem 2 shows that the work of $\mathsf{GPUSCAN}$ is $O(\Sigma_{(u,v)\in E(G)}({\mathsf{deg}}(u)+{\mathsf{deg}}(v))+c\cdot m\cdot\log m)$ . Obviously, lots of extra costs are introduced in $\mathsf{GPUSCAN}$ for parallelization.
•

Low scalability. $\mathsf{GPUSCAN}$ assumes that the input graphs can be fit in the GPU memory. However, the sizes of real graphs such as social networks and web graphs are huge while the GPU memory is generally in the range of a few gigabytes. The assumption that the sizes of data graphs cannot exceed the GPU memory makes it unscalable to handle large graphs in practice.

4 Our Approach

We address the drawbacks of $\mathsf{GPUSCAN}$ in two sections. In this section, we focus on the scenario in which the graphs can be stored in the GPU memory and aim to reduce the high computation cost introduced for parallelization. In the next section, we present our algorithm to address the problem that the graph cannot fit in the GPU memory.

4.1 A CSR-Enhanced Graph Layout

$\mathsf{GPUSCAN}$ simply uses the edge array and adjacent array to represent the input graph. The shortcomings of this representation are two-fold: (1) the degree of a vertex can not be easily obtained, which is a commonly used operation during the clustering, (2) the relation between the edge array and adjacent array for the same edge is not directly maintained, which complicates the processing when identifying the cluster vertices in Phase 2. To avoid these problems, we adopt a CSR (Compressed Sparse Row)-enhanced graph layout, which contains four parts:

•

Vertex array. Each entry keeps the starting index for a specific vertex in the adjacent array.
•

Adjacent array. The adjacent array contains the adjacent vertices of each vertex. The adjacent vertices are stored in the increasing order of vertex ids.
•

Eid array. Each entry corresponds to an edge in the adjacent array and keeps the index of the edge in the degree-oriented edge array.
•

Degree-oriented edge array. The degree-oriented edge array contains the edges of the graph and the edges with the same source vertex are stored together. For each edge $(u,v)$ in the degree-oriented edge array, we guarantee ${\mathsf{deg}}(u)<{\mathsf{deg}}(v)$ or ${\mathsf{deg}}(u)={\mathsf{deg}}(v)$ and ${\mathsf{id}}(u)<{\mathsf{id}}(v)$ . In this way, the workload for edges incident to a large degree vertex can be computed based on the vertex with smaller degree.
•

Similarity array. Each element corresponds to an edge in the degree-oriented edge array and maintain the similarity of the corresponding edge.

Example 4.1: Fig. 3 shows the CSR-Enhanced graph layout regarding $v_{0}$ , $v_{1}$ , $\dots$ , $v_{8}$ in $G^{\prime}$ . These five arrays are organized as discussed above. $\Box$

4.2 A Progressive Structural Graph Clustering Approach

To reduce the high parallelization cost, the designed algorithm should avoid unnecessary computation during the clustering and fully exploit the unique characteristics of GPU structure. As shown in Section 3.2, $\mathsf{GPUSCAN}$ first computes the structural similarity for each edge in Phase 1 and then identifies the roles of each vertex accordingly. However, based on Theorem 2, computing the structural similarity is a costly operation. To avoid unnecessary structural similarity computation, we adopt a progressive approach in which the lower bound ( $\underline{N_{\epsilon}}[v]$ ) and upper bound ( $\overline{N_{\epsilon}}[v]$ ) of $N_{\epsilon}[v]$ are maintained. Clearly, we have the following lemma on $\underline{N_{\epsilon}}[v]$ and $\overline{N_{\epsilon}}[v]$ :

2for $v\in V$ in parallel do

\underline{N_{\epsilon}}[v]\leftarrow 0

;

\overline{N_{\epsilon}}[v]\leftarrow{\mathsf{VA}}[v+1]-{\mathsf{VA}}[v]

;

{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bf{?}}⃝

;

4for $(u,v)\in E$ in parallel do

{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bf{?}}⃝

;

{\mathsf{identifyCore(G,\mu,\epsilon)}}

{\mathsf{detectClusters(G,\epsilon)}}

{\mathsf{classifyHubOutlier}}(G)

Algorithm 1

{\mathsf{{\mathsf{GPUSCAN^{++}}}}}(G,\mu,\epsilon)

Lemma 4.1: Given a graph $G$ and two parameters $\mu$ and $\epsilon$ , for a vertex $v\in V(G)$ , if $\underline{N_{\epsilon}}[v]\geq\mu$ , then $v$ is a core vertex. If $\overline{N_{\epsilon}}[v]\leq\mu$ , then $v$ is not a core vertex. $\Box$

According to Lemma 1, we can maintain the $\underline{N_{\epsilon}}[v]$ and $\overline{N_{\epsilon}}[v]$ progressively and delay the structural similarity computation until necessary. Following this idea, our new algorithm, $\mathsf{GPUSCAN^{++}}$ , is shown in Algorithm 1. It first initializes $\underline{N_{\epsilon}}[v]$ as $1$ , $\overline{N_{\epsilon}}[v]$ as ${\mathsf{deg}}(v)+1$ for each vertex $v$ (lines 1-2, $\mathsf{VA}$ refers to the vertex array). The role of each vertex and the structural similarity of each edge are initialized as unknown ( $\large\footnotesize{\bf{?}}⃝$ ) (lines 1-4). Then, it identifies the core vertices (line 5), detects the clusters follow the core vertices (line 6), and classifies the hub vertices and outlier vertices (line 7).

Identify core vertices. $\mathsf{GPUSCAN^{++}}$ first identifies the core vertices in the graph based on the given parameters $\mu$ and $\epsilon$ , which is shown in Algorithm 2.

1 for $(u,v)\in E(G)$ in warp-level parallel do

2 if ${\mathsf{role}}[u]=\large\footnotesize{\bf{?}}⃝\vee{\mathsf{role}}[v]=\large\footnotesize{\bf?}⃝$ then

{\mathsf{isSim}}\leftarrow{\mathsf{checkSim}}(u,v,\epsilon)

;

4 // the first thread in the warp

5 if ${\mathsf{isSim}}=$ true then

{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bfS}⃝

;

{\mathsf{atomicAdd}}({\mathsf{\underline{N_{\epsilon}}}}[u],1)

;

{\mathsf{atomicAdd}}({\mathsf{\underline{N_{\epsilon}}}}[v],1)

;

8 if

{\underline{N_{\epsilon}}}[u]\geq\mu

then

{\mathsf{role}}[u]\leftarrow\large\footnotesize{\bfC}⃝

;

9 if

{\underline{N_{\epsilon}}}[v]\geq\mu

then

{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bfC}⃝

;

10 else

{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{{\bf!S}}⃝

;

{\mathsf{atomicAdd}}({\mathsf{\overline{N_{\epsilon}}}}[u],-1)

;

{\mathsf{atomicAdd}}({\mathsf{\overline{N_{\epsilon}}}}[v],-1)

;

13 if

{\overline{N_{\epsilon}}}[u]<\mu

then

{\mathsf{role}}[u]\leftarrow\large\footnotesize{\bf{!C}}⃝

;

14 if

{\overline{N_{\epsilon}}}[v]<\mu

then

{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bf!C}⃝

;

16 Procedure ${\mathsf{checkSim}}(u,v,\epsilon)$

{\mathsf{sum}}\leftarrow 0

;

18 for $w\in{\mathsf{nbr}}(u)$ in parallel do

{\mathsf{low}}\leftarrow{\mathsf{VA}}[v];\;{\mathsf{high}}\leftarrow{\mathsf{VA}}[v+1]

;

21 while ${\mathsf{low}}<{\mathsf{high}}$ do

{\mathsf{mid}}\leftarrow({\mathsf{low}}+{\mathsf{high}})/2

;

23 if $w<{\mathsf{AdjA}}[{\mathsf{mid}}]$ then

{\mathsf{high}}\leftarrow{\mathsf{mid}};

25 else if $w>{\mathsf{AdjA}}[{\mathsf{mid}}]$ then

{\mathsf{low}}\leftarrow{\mathsf{mid}}+1;

27 else

{\mathsf{atomicAdd}}({\mathsf{sum}},1)

; break;

d_{u}\leftarrow{\mathsf{VA}}[u+1]-{\mathsf{VA}}[u]

;

d_{v}\leftarrow{\mathsf{VA}}[v+1]-{\mathsf{VA}}[v]

;

31 return

{\mathsf{sum}}+2\geq\epsilon\cdot\sqrt{(d_{u}+1)\cdot(d_{v}+1)}

;

Algorithm 2

{\mathsf{identifyCore}}(G,\mu,\epsilon)

To identify the core vertices, it assigns a warp for each edge $(u,v)$ to determine its structural similarity (line 1), which is similar to $\mathsf{GPUSCAN}$ . However, the detailed procedures to compute the similarity are significantly different from that of $\mathsf{GPUSCAN}$ . Specifically, if the role of $u$ or $v$ has been known, it delays the computation of structural similarity between $u$ and $v$ to the following steps as the role of these two vertices has been determined and it is possible that we can obtain the clustering results without knowing the structural similarity of $u$ and $v$ in the following steps (line 2). Then, it computes the structural similarity between $u$ and $v$ by invoking $\mathsf{checkSim}$ . If $u$ and $v$ are similar, then the similarity indicator between $u$ and $v$ is set to $\large\footnotesize{\bfS}⃝$ (line 6) and ${\underline{N_{\epsilon}}[u]}$ / ${\underline{N_{\epsilon}}[v]}$ is increased by $1$ as $u$ and $v$ are similar (line 7). After that, if ${\underline{N_{\epsilon}}[u]}$ / ${\underline{N_{\epsilon}}[v]}\geq\mu$ , the role of $u$ / $v$ is set as $\large\footnotesize{\bfC}⃝$ (lines 8-9). If $u$ and $v$ are dissimilar, ${\overline{N_{\epsilon}}[u]}$ / ${\overline{N_{\epsilon}}[v]}$ and the role of $u$ / $v$ are set accordingly (lines 11-14).

Procedure $\mathsf{checkSim}$ is used to determine the structural similarity between $u$ and $v$ based on the given $\epsilon$ . Without loss of generality, we assume ${\mathsf{deg}}(u)<{\mathsf{deg}}(v)$ . For each neighbor $w$ of $u$ , it assigns a thread to check whether $w$ is also a neighbor of $v$ (line 17). Since the neighbors of $v$ are stored in increasing order based on their id in $\mathsf{AdjA}$ , it looks up $w$ among the neighbors of $v$ in a binary search manner (lines 18-26). Specifically, three indices $\mathsf{low}$ , $\mathsf{high}$ , and $\mathsf{mid}$ are maintained and initialized as ${\mathsf{VA}}[v]$ , ${\mathsf{VA[v+1]}}$ , and $({\mathsf{low}}+{\mathsf{high}})/2$ , respectively. If $w$ is less than ${\mathsf{AdjA}}[{\mathsf{mid}}]$ , $\mathsf{high}$ is updated as $\mathsf{mid}$ (lines 21-22) ( $\mathsf{AdjA}$ refers to the adjacent array). If $w$ is larger than ${\mathsf{AdjA}}[{\mathsf{mid}}]$ , then $\mathsf{low}$ is updated as $\mathsf{mid}$ +1 (lines 23-24). Otherwise, the common neighbors of $u$ and $v$ , which are recorded in $\mathsf{sum}$ , are increased by 1 (line 26). Last, whether $u$ and $v$ are similar returns based on Definition 2.1 (line 27-29).

Example 4.2: Fig. 4 shows the procedures of $\mathsf{checkSim}$ to check the similarity between $v_{0}$ and $v_{1}$ of $G^{\prime}$ shown in Fig. 1. In order to compute their similarity, the neighbors of $v_{1}$ and $v_{0}$ are first obtained from the CSR-Enhanced graph layout as shown in Fig. 4 (a). As the ${\mathsf{deg}}(v_{1})<{\mathsf{deg}}(v_{0})$ , thread $T_{0}$ , $T_{1}$ , $T_{2}$ , and $T_{3}$ in the warp are used to check whether the neighbors $v_{0},v_{2},v_{4},v_{7}$ of $v_{1}$ are also neighbors of $v_{0}$ , which is shown in Fig. 4 (b). Take thread $T_{0}$ as an example, since the neighbors of $v_{0}$ are sorted based on their ids, thread $T_{0}$ explores the neighbors of $v_{0}$ in a binary-search manner, which is shown in Fig. 4 (c). After traversing the leaf node $v_{1}$ , it is clear that $v_{0}$ is not a neighbor of $v_{0}$ . Similarly, threads $T_{1}-T_{3}$ find that $v_{2}$ , $v_{4}$ , and $v_{7}$ are neighbors of $v_{0}$ as well. Therefore, the common neighbors of $v_{0}$ and $v_{1}$ are $v_{2}$ , $v_{4}$ , $v_{7}$ . Because $2+3>0.6\times\sqrt{5\times 8}$ , $v_{0}$ and $v_{1}$ are similar, and $\mathsf{checkSim}$ returns $\mathsf{true}$ . $\Box$

1 for $u\in V$ in parallel do

2 if

{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝

then

{\mathsf{parent}}[u]\leftarrow u

;

u.{\mathsf{height}}\leftarrow 1

;

3 else

{\mathsf{parent}}[u]\leftarrow-2

;

5for $u\in V\wedge{\mathsf{role}}[u]={\large\footnotesize{\bfC}⃝}$ in warp-level parallel do

u_{p}\leftarrow

{\mathsf{root}}(u,{\mathsf{parent}})

;

7 for $v\in{\mathsf{nbr}}(u)$ in parallel do

8 if ${\mathsf{role}}[v]=\large\footnotesize{\bfC}⃝\wedge{\mathsf{sim}}(u,v)=\large\footnotesize{\bfS}⃝$ then

v_{p}\leftarrow{\mathsf{root}}(v,{\mathsf{parent}})

;

10 if

u_{p}\neq v_{p}

then

{\mathsf{union}}(u_{p},v_{p},{\mathsf{parent}})

;

12for $u\in V\wedge{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝$ in warp-level parallel do

u_{p}\leftarrow{\mathsf{root}}(u,{\mathsf{parent}})

;

14 for $v\in{\mathsf{nbr}}(u)$ in warp-level parallel do

15 if ${\mathsf{role}}[v]=\large\footnotesize{\bfC}⃝\wedge{\mathsf{sim}}(u,v)=\large\footnotesize{\bf?}⃝$ then

v_{p}\leftarrow{\mathsf{root}}(v,{\mathsf{parent}})

;

17 if $u_{p}\neq v_{p}$ then

18 if ${\mathsf{checkSim}}(u,v,\epsilon)$ then

{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bfS}⃝

;

{\mathsf{union}}(u_{p},v_{p},{\mathsf{parent}})

;

20 else

{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bf!S}⃝

;

21for $u\in V\wedge{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝$ in parallel do

{\mathsf{parent}}[u]\leftarrow{\mathsf{root}}(u,{\mathsf{parent}})

;

24for $u\in V\wedge{\mathsf{role}}[u]=\large\footnotesize{\bfC}⃝$ in warp-level parallel do

25 for $v\in{\mathsf{nbr}}(u)$ in warp-level parallel do

26 if ${\mathsf{role}}[v]=\large\footnotesize{\bf!C}⃝$ then

27 if ${\mathsf{sim}}(u,v)=\large\footnotesize{\bf?}⃝$ then

{\mathsf{sim}}(u,v)\leftarrow{\mathsf{checkSim}}(u,v,\epsilon)?\,\large\footnotesize{\bfS}⃝:\large\footnotesize{\bf!S}⃝

;

29 if ${\mathsf{sim}}(u,v)=\large\footnotesize{\bfS}⃝$ then

{\mathsf{parent}}[v]\leftarrow{\mathsf{parent}}[u]

;

32 Procedure ${\mathsf{root}}(u,{\mathsf{parent}})$

{\mathsf{p_{u}}}\leftarrow{\mathsf{parent}}[u]

;

{\mathsf{next}}\leftarrow{\mathsf{parent}}[{\mathsf{p_{u}}}]

;

34 while ${\mathsf{p_{u}}}\neq{\mathsf{next}}$ do

{\mathsf{p_{u}}}\leftarrow{\mathsf{next}}

;

{\mathsf{next}}\leftarrow{\mathsf{parent}}[{\mathsf{p_{u}}}]

;

36 return

{\mathsf{p_{u}}}

37 Procedure ${\mathsf{union}}(u,v,{\mathsf{parent}})$

38 if

u.{\mathsf{height}}<v.{\mathsf{height}}

then

{\mathsf{swap}}(u,v)

;

{\mathsf{fg}}\leftarrow

true;

40 while $\mathsf{fg}$ do

{\mathsf{fg}}\leftarrow

false;

{\mathsf{res}}\;\leftarrow\;{\mathsf{atomicCAS}}({\mathsf{\&parent}}[v],v,u)

;

42 if

{\mathsf{res}}\neq v

then

v\leftarrow{\mathsf{res}};{\mathsf{fg}}\leftarrow

true;

44 if

u\neq v\wedge u.{\mathsf{height}}=v.{\mathsf{height}}

then

{\mathsf{atomicADD}}(u.{\mathsf{height}},1)

;

Algorithm 3

{\mathsf{detectClusters}}(G,\epsilon)

Detect clusters. After the core vertices have been identified, $\mathsf{GPUSCAN^{++}}$ detects clusters in the graph, which is shown in Algorithm 3. After the core vertices identification phase, the vertices are either core vertices or non-core vertices. Moreover, the similarities of some edges are known but there are still edges whose similarities are unknown. Therefore, Algorithm 3 first detects the sub-clusters that can be determined by the computed information (lines 1-9). After that, it further computes the similarity for the edges whose similarities are unknown and obtains the complete clusters in the graph (lines 10-27). In this way, we can avoid some unnecessary similarity computation compared with $\mathsf{GPUSCAN}$ .

Specifically, Algorithm 3 uses the $\mathsf{parent}$ array to store the cluster id that each vertex belongs to (negative cluster id value indicates the corresponding vertex is not in a cluster currently). Similar to $\mathsf{GPUSCAN}$ , Algorithm 3 also uses a spanning forest to represent the clusters through the $\mathsf{parent}$ array. However, Algorithm 3 uses a different algorithm to construct the forest, which makes $\mathsf{GPUSCAN^{++}}$ much more efficient than $\mathsf{GPUSCAN}$ as verified in our experiment. It initializes each core vertex $u$ as a separate cluster with cluster id $u$ and the non-core vertices with a negative cluster id (lines 1-3). Then, for each core vertex $u$ , it first finds the root id of the cluster by the procedure $\mathsf{root}$ (line 5). After that, for the neighbors $v$ of $u$ , if we have already known $u$ and $v$ are similar (line 7), we know $u$ and $v$ are the same cluster following Definition 2.1. Thus, we union the subtrees represented by $u$ and $v$ in $\mathsf{parent}$ through procedure $\mathsf{union}$ (lines 6-9).

After having explored the sub-clusters that can be determined by the computed information, for each core vertices $u$ , Algorithm 3 further explores the neighbors $v$ of $u$ such that $v$ is also a core vertex but the similarity of $u$ and $v$ are unknown (lines 10-13). If $u$ and $v$ are not in the same cluster currently (line 15), then it computes the similarity between $u$ and $v$ (line 16). If $u$ and $v$ are similar, we union the subtrees represented by $u$ and $v$ in the $\mathsf{parent}$ array (line 17). Otherwise, $u$ and $v$ are marked as dissimilar (line 18). When the clusters of the core vertices are determined, the root cluster id is assigned to each core vertex (lines 19-20). At last, Algorithm 3 explores the non-core neighbors $v$ of each core vertex $u$ , if the similarity between $u$ and $v$ is unknown, their similarity is computed (lines 24-25). If $u$ and $v$ are similar, $v$ is merged in the cluster represented by $u$ (lines 26-27). For the procedure $\mathsf{root}$ and $\mathsf{union}$ , they are used to find the root of the tree in the forest and union two sub-trees in the forest. As they are self-explainable, we omit the description. Note the $u.{\mathsf{height}}$ / $v.{\mathsf{height}}$ in line 34 represents the height of tree rooted at $u/v$ .

Example 4.3: Fig. 5 shows the procedure to detect clusters in $G^{\prime}$ . Assume that we know that $v_{0}$ , $v_{1}$ , $v_{4}$ , $v_{7}$ are core vertices, $(v_{1},v_{0})$ and $(v_{4},v_{1})$ are similar, but the similarity of $(v_{4},v_{0})$ , $(v_{7},v_{0})$ and $(v_{7},v_{1})$ are unknown. Parents of the four core vertices are initialized as themselves while the others are set as -2.

$\mathsf{GPUSCAN^{++}}$ first processes the core vertices in parallel. To avoid redundant computation, when clustering core vertices, only neighbors with smaller ids than their own ids are considered. At the same moment, for $v_{1}$ , $(v_{1},v_{0})$ are similar, then the parent of $v_{1}$ becomes $v_{0}$ and for $v_{4}$ , $(v_{4},v_{1})$ are similar, then the parent of $v_{4}$ becomes $v_{1}$ , which are shown in Fig. 5 (a). After that, $\mathsf{GPUSCAN^{++}}$ detects the clusters related to $v_{4}$ and $v_{7}$ as the similarities between some of their neighbors and themself have not been determined yet. For the neighbor $v_{0}$ of $v_{4}$ , $\mathsf{GPUSCAN^{++}}$ finds that $v_{0}$ and $v_{4}$ have been in the same cluster by procedure $\mathsf{root}$ , they do not need further process, which is shown in Fig. 5 (b). For $v_{7}$ , due to the atomic operation, the thread which explores its neighbor $v_{0}$ executes first. It finds that they are not in the same cluster based on procedure $\mathsf{root}$ . Therefore, after determining that they are similar by procedure $\mathsf{checkSim}$ , it changes the parent of $v_{7}$ to $v_{0}$ , which is shown in Fig. 5 (c). Then, $v_{7}$ and $v_{1}$ are processed by another thread similarly, which is shown in Fig. 5 (d). When the cluster of core vertices $v_{0}$ , $v_{1}$ , $v_{4}$ , and $v_{7}$ is detected, $\mathsf{GPUSCAN^{++}}$ explores these vertices and sets their corresponding element in the $\mathsf{parent}$ array as the root vertices in the forest. Fig. 5 (e) illustrates this process, in which the $\mathsf{parent}$ of $v_{4}$ changes to the traced root vertex $v_{0}$ . At last, $\mathsf{GPUSCAN^{++}}$ assigns the parent of $v_{1}$ to $v_{2}$ based on the similarity of core vertex $v_{1}$ and non-core vertex $v_{2}$ as shown in Fig. 5 (f). $\Box$

1 for $u\in V\wedge{\mathsf{parent}}[u]<0$ in warp-level parallel do

{\mathsf{parent}}[u]\leftarrow-2

;

{\mathsf{role}}[u]=\large\footnotesize{\bfO}⃝

;

3 if $|{\mathsf{nbr}}(u)|>1$ then

{\mathsf{n_{c}}}\leftarrow-1

;

5 for $v\in{\mathsf{nbr}}(u)$ in parallel do

{\mathsf{id}}\leftarrow-1

;

7 if ${\mathsf{parent}}[v]>=0$ then

{\mathsf{id}}\leftarrow{\mathsf{atomicCAS}}({\mathsf{\&n_{c}}},-1,{\mathsf{parent}}[v])

;

9 if ${\mathsf{id}}>=0\wedge{\mathsf{id}}\neq{\mathsf{parent}}[v]$ then

{\mathsf{parent}}[u]\leftarrow-1

;

{\mathsf{role}}[u]=\large\footnotesize{\bfH}⃝

;

{\mathsf{break}}

;

Algorithm 4

{\mathsf{classifyHubOutlier}}(G)

Classify hub and outlier vertices. At last, $\mathsf{GPUSCAN^{++}}$ classifies vertices that are not in clusters into outliers and hubs, which is shown in Algorithm 4. Specifically, it assigns a warp for each vertex $u$ not in a cluster and considers $u$ as an outlier (line 2). Then it checks whether $u$ has more than one neighbor vertices, and if this condition is satisfies, $u$ has the potential to become hubs (line 3). Then, Algorithm 4 uses the shared variable ${\mathsf{n_{c}}}$ of the threads in a warp to record the cluster that one of its neighbors belongs to and sets it as $-1$ (line 4). Each thread in the warp maintains an exclusive variable ${\mathsf{id}}$ with an initial value of $-1$ (line 5). If a neighbor $v$ of $u$ is found in a cluster by a thread, then, ${\mathsf{n_{c}}}$ is updated to the cluster id of $v$ by $\mathsf{atomicCAS}$ (lines 7-8). Once another thread finds that the cluster id recorded by ${\mathsf{n_{c}}}$ is different from the cluster id of $v$ it processes, it means at least two neighbors of $u$ are in different clusters, therefore, $u$ is classified as a hub (lines 9-11).

4.3 Theoretical Analysis

Theorem 4.1: Given a graph $G$ and two parameters $\epsilon$ and $\mu$ , the work of $\mathsf{GPUSCAN^{++}}$ finishes the clustering in $O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v))$ , and the span is $O(\log{\mathsf{deg}}_{{\mathsf{max}}}+\log n)$ . $\Box$

Proof: In Phase 1 (Algorithm 2), the work and span of $\mathsf{GPUSCAN^{++}}$ are $O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v))$ and $O(\log{\mathsf{deg}}_{max})$ . In Phase 2 (Algorithm 3), $\mathsf{GPUSCAN^{++}}$ is possible to check the similarity of all edges, thus, the work of Phase 2 is $O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v))$ as well. For the span, the height of the spanning forest is $O(\log n)$ due to the union-by-size used in line 37 of Algorithm 3 [34]. Thus, the span of Phase 2 is $O(\log n)$ . In Phase 3 (Algorithm 4), it is easy to derive that the work is $O(m)$ and $O(1)$ . Therefore, the work of $\mathsf{GPUSCAN^{++}}$ is $O(\Sigma_{(u,v)\in E(G)}{\mathsf{deg}}(u)\cdot\log{\mathsf{deg}}(v))$ and the span is $O(\log{\mathsf{deg}}_{{\mathsf{max}}}+\log n)$ .

Compared with $\mathsf{GPUSCAN}$ , Theorem 4.3 shows that $\mathsf{GPUSCAN^{++}}$ reduces the total work and the span during the clustering. Note that in Phase 1, $\mathsf{GPUSCAN^{++}}$ uses the binary-search-based manner the check the similarity, which reduces the span of Phase 1 from $O({\mathsf{deg}}_{{\mathsf{max}}})$ to $O(\log{\mathsf{deg}}_{{\mathsf{max}}})$ compared with $\mathsf{GPUSCAN}$ .

5 A New Out-of-Core Algorithm

In the previous section, we focus on improving the clustering performance when the graph can be fit in the GPU memory. However, in real applications, the graph data can be very large and the GPU memory is insufficient to load the whole graph data. A straightforward solution is to use the Unified Virtual Memory (UVM) provided by the GPUs directly. However, this is approach is inefficient as verified in our experiments. On the other hand, in most real graphs, such as social networks and web graphs, the number of edges is much larger than the number of vertices. For example, the largest dataset in SNAP contains 65 million nodes and 1.8 billion edges. It is practical to assume that the vertices of the graph can be loaded in the GPU memory while the edges are stored in the CPU memory.

Following this assumption, we propose an out-of-core GPU $\mathsf{SCAN}$ algorithm that adopts a divide-and-conquer strategy. Specifically, we first divide the graph into small subgraphs whose size does not exceed the GPU memory size. Then, we conduct the clustering based on the divided subgraphs instead of the original graph. As the divided subgraphs are much smaller than the original graphs, the GPU memory requirement is significantly reduced. Before introducing our algorithm, we first define the local subgraphs as follows:

Definition 5.1: (Edge Extended Subgraph) Given a graph $G$ and a set of edges $S\in E(G)$ , the edge extended subgraph of $S$ consists of the edges $S$ and $(u,v)\in E(G)$ such that $(u,v)$ incident to at least one edge in $S$ . $\Box$

Lemma 5.1: Given a graph $G$ and a set of edge $S\in E(G)$ , let $G_{s}$ denote the edge extended subgraph of $S$ , then, for any edge $(u,v)\in S$ , $\sigma_{G_{s}}(u,v)=\sigma(u,v)$ , where $\sigma_{G_{s}}(u,v)$ denotes the structural similarity computed based on $G_{s}$ . $\Box$

Example 5.1: Fig. 6 shows the edge extended subgraphs of $G^{\prime}$ . As $(v_{8},v_{9})$ and $(v_{2},v_{8})$ share the same vertex $v_{8}$ but $(v_{8},v_{2})$ is not in $G^{\prime}$ , $(v_{8},v_{2})$ should be added into the edge extended subgraph of $G^{\prime}$ . $\Box$

Therefore, we divide $G$ into a series of edge extended subgraphs $G_{S_{1}},G_{S_{2}},\dots,G_{S_{k}}$ such that $S_{i}\cup S_{2},\dots,\cup S_{k}=E(G)$ , $S_{i}\cap S_{j}=\emptyset$ , where $1\leq i\neq j\leq k$ , and each edge extended subgraph can be loaded in the GPU memory. Following Lemma 5, we can obtain the structural similarity for all edges in $S$ regarding $G_{s}$ correctly. Moreover, as the vertices can be loaded in the memory, which means we can identify the core vertices by iterating each edge extended subgraphs and maintain the clustering information in the GPU memory accordingly.

G_{s}(V_{s},E_{s},{\mathsf{sim}}_{s})\leftarrow{\emptyset};S\leftarrow G_{s};s\leftarrow 0

3/* Code Executed by CPU */

4 for $(u,v)\in E(G)$ do

G^{\prime}_{s}(V^{\prime}_{s},E^{\prime}_{s},{\mathsf{sim^{\prime}}}_{s})\leftarrow{\emptyset}

6 for $w\in{\mathsf{nbr}}(u)/{\mathsf{nbr}}(v)$ do

7 if $w\notin V_{s}$ then

V^{\prime}_{s}\leftarrow V^{\prime}_{s}\cup\{w\}

;

10 if $(u,w)\notin E_{s}$ then

E^{\prime}_{s}\leftarrow E^{\prime}_{s}\cup\{(u,w)\}

;

13 if ${\mathsf{Memof(G_{s})}}+{\mathsf{{Memof}(G^{\prime}_{s})}}+15|V(G)|>M_{g}$ then

s\leftarrow s+1;G_{s}(V_{s},E_{s},{\mathsf{sim}}_{s})\leftarrow{\emptyset};S\leftarrow S\cup G_{s}

V_{s}\leftarrow V_{s}\cup V^{\prime}_{s};E_{s}\leftarrow E_{s}\cup E^{\prime}_{s}

{\mathsf{sim}}(u,v)\leftarrow\large\footnotesize{\bf{?}}⃝

;

{\mathsf{sim}}_{s}\leftarrow{\mathsf{sim}}_{s}\cup\left\{{\mathsf{sim}}(u,v)\right\}

20/* Code Executed by GPU */

21 for $v\in V$ in parallel do

\underline{N_{\epsilon}}[v]\leftarrow 1

;

\overline{N_{\epsilon}}[v]\leftarrow{\mathsf{VA}}[v+1]-{\mathsf{VA}}[v]+1

;

{\mathsf{role}}[v]\leftarrow\large\footnotesize{\bf{?}}⃝

;

{\mathsf{parent}}[v]\leftarrow-2

;

23for $G_{s}\in S$ do

{\mathsf{load(G_{s})}}

;

{\mathsf{identifyCore(G_{s},\mu,\epsilon)}}

;

{\mathsf{store(G_{s})}}

;

25for $G_{s}\in S$ do

{\mathsf{load(G_{s})}}

;

{\mathsf{detectClusters(G_{s},\epsilon)}}

;

{\mathsf{store(G_{s})}}

;

27for $G_{s}\in S$ do

{\mathsf{load(G_{s})}}

;

{\mathsf{classifyHubOutlier}}(G_{s},\epsilon)

;

Algorithm 5

{\mathsf{GPUSCAN^{++}_{O}}}(G,\mu,\epsilon,M_{g})

Algorithm. Following the above idea, our algorithm GPU-based out-of-core algorithm is shown in Algorithm 5. It first partitions the edges $E(G)$ of $G$ into a series of disjoint sets and constructs the corresponding edge extended subgraphs that can be fit in the GPU memory based on the GPU memory capacity $M_{g}$ following Definition 5 by CPU (line 3-12). After that, we initialize $\underline{N_{\epsilon}}[v]$ , $\overline{N_{\epsilon}}[v]$ , ${\mathsf{role}}$ , and ${\mathsf{parent}}$ array for each vertex in GPU memory following our assumption. After that, based on Lemma 5, the similarity for each edge can be correctly obtained by the edge extended subgraph alone. Accordingly, the clusters and the role of each vertex can also be determined by the edge extended subgraph locally. Therefore, we just load each edge extended subgraph into the GPU memory sequentially and repeat the three phases by invoking $\mathsf{identifyCore}$ , $\mathsf{detectClusters}$ , and $\mathsf{classifyHubOutlier}$ for the in-GPU-memory algorithm (lines 14-21). The correctness of Algorithm 5 can be easily obtained based on Lemma 5 and the correctness of Algorithm 1. The only thing that needs to explain is how to estimate the size of the edge extended subgraph in line 10. In Algorithm 5, $G_{s}$ represents the current edge extended subgraph (line 1). Whenever a new edge $(u,v)$ is to be added into $G_{s}$ (line 3), the edges incident to $(u,v)$ are computed and stored by $G^{\prime}_{s}$ . According to the CSR-Enhanced graph layout introduced in Section 4.1, ${\mathsf{Memof}}(G_{s})=25\times|E_{s}|+4\times|V_{s}|$ . As adjacent array, Eid array, and degree-oriented edge array consumes $24\times|E_{s}|$ bytes together, the similarity array consumes $|E_{s}|$ byte, and vertex array consumes $4\times|V_{s}|$ (we use 4-byte integer to store a vertex id). ${\mathsf{Memof}}(G^{\prime}_{s})$ can be computed in the same way. Moreover, $\underline{N_{\epsilon}}[v]$ (2 bytes per element, as each element will not exceed the value of the parameter $\mu$ which is generally not very large), $\overline{N_{\epsilon}}[v]$ (4 bytes for each element), ${\mathsf{role}}$ (1 byte for each element), ${\mathsf{parent}}$ (4 bytes for each element), and ${\mathsf{height}}$ (4 byte for each element) array for each vertex take $15\times|V(G)|$ together. If three GPU memory consumptions are larger than the GPU memory $M_{g}$ in line 10, we generate a new empty edge extended subgraph in line 11; Otherwise, $G^{\prime}_{s}$ is added into $G_{s}$ in line 12.

As verified in our experiment, Algorithm 5 can finish the $\mathsf{SCAN}$ clustering on a graph with 1.8 billion edges using less than 2GB GPU memory efficiently.

6 Experiments

This section presents our experimental results. The in-memory algorithms are evaluated on a machine with an Intel Xeon 2.4GHz CPU equipped with 128 GB main memory and an Nvidia Tesla V100 16GB GPU, running Ubuntu 16.04LTS. The out-of-core algorithms are evaluated on a machine with Nvidia GTX1050 2GB GPU, running Ubuntu 16.04LTS.

TABLE II: Datasets used in Experiments

Datasets	Name	$n$	$m$	$\overline{d}$	$c$
Enwiki-2022	$\mathsf{EW}$	6,492,490	159,047,205	24.50	91
IndoChina	$\mathsf{CN}$	7,414,866	194,109,311	26.18	143
Hollywood	$\mathsf{HW}$	2,180,759	228,985,632	105.00	35
Orkut	$\mathsf{OR}$	3,072,626	234,370,166	76.28	28
Tech-P2P	$\mathsf{TP}$	5,792,297	295,659,774	51.04	173
UK-2002	$\mathsf{UK}$	18,520,486	298,113,762	16.10	201
EU-2015	$\mathsf{EU}$	11,264,052	386,915,963	34.35	230
Soc-Twitter	$\mathsf{ST}$	21,297,773	530,051,090	24.89	30
Twitter-2010	$\mathsf{TW}$	41,652,230	1,468,365,182	35.25	-
Friendster	$\mathsf{FR}$	65,608,366	1,806,067,135	27.53	-

Datasets. We evaluate the algorithms on ten real graphs. $\mathsf{OR}$ and $\mathsf{FR}$ are downloaded from the SNAP [17]. $\mathsf{TP}$ and $\mathsf{ST}$ are downloaded from the Network Repository [25]. $\mathsf{CN}$ , $\mathsf{EW}$ , $\mathsf{HW}$ , $\mathsf{UK}$ , and $\mathsf{EU}$ are downloaded from WebGraph [6, 5, 4]. The details are shown in Table II. The first eight datasets are used to evaluate the in-memory algorithms, while the last two datasets cannot fit into the Nvidia Tesla V100 GPU and are used to evaluate the out-of-core algorithms.

Algorithms. We evaluate the following algorithms:

•

$\mathsf{GPUSCAN}$ : The state-of-the-art GPU-based algorithm [37], which is introduced in Section 3.2.
•

$\mathsf{GPUSCAN^{++}}$ : Our proposed GPU-based in-memory algorithm (Algorithm 1 in Section 4).
•

$\mathsf{GPUSCAN^{++}_{UVM}}$ : Direct out-of-core algorithm, namely, $\mathsf{GPUSCAN^{++}}$ based on the $\mathsf{UVM}$ .
•

$\mathsf{GPUSCAN^{++}_{O}}$ : Our proposed GPU-based out-of-core algorithm (Algorithm 5 in Section 5).

We use CUDA 10.1 and GCC 4.8.5 to compile all codes with -O3 option. The time cost of the algorithms is measured as the amount of elapsed wall-clock time during the execution. We set the maximum running time for each test to be 100,000 seconds. If a test does not stop within the time limit, we denote the corresponding running time as $\mathsf{INF}$ . For $\epsilon$ , we choose $\epsilon\in\{0.2,0.3,0.4,0.5,0.6,0.7,0.8\}$ with $\mu=6$ as default. For $\mu$ , we choose $\mu\in\{3,6,11,16,31\}$ with $\epsilon=0.5$ as default.

TABLE III: Time consumption of

\mathsf{GPUSCAN}

and

\mathsf{GPUSCAN^{++}}

in each phase with

\epsilon=0.5

and

\mu=6

(s)

Dataset	Phase 1			Phase 2			Phase 3
Dataset	$\mathsf{GPUSCAN}$	$\mathsf{GPUSCAN^{++}}$	Speedup	$\mathsf{GPUSCAN}$	$\mathsf{GPUSCAN^{++}}$	Speedup	$\mathsf{GPUSCAN}$	$\mathsf{GPUSCAN^{++}}$	Speedup
$\mathsf{EW}$	581.32	6.35	91.55	0.80	0.022	3.64	10.60	0.23	46.08
$\mathsf{CN}$	846.88	4.31	196.49	94.50	0.025	3780.00	6.57	0.071	92.48
$\mathsf{HW}$	203.84	9.33	21.85	76.579	0.010	7657.90	5.94	0.017	349.41
$\mathsf{OR}$	108.43	6.02	18.01	1.14	0.013	87.69	8.49	0.12	70.75
$\mathsf{TP}$	998.70	13.65	73.16	4.24	0.021	201.90	10.87	1.25	8.696
$\mathsf{UK}$	315.31	6.95	45.37	65.33	0.058	1126.38	11.96	0.16	74.75
$\mathsf{EU}$	2456.73	16.58	148.17	493.97	0.036	13721.39	13.09	0.237	55.23
$\mathsf{ST}$	3030.11	40.04	75.68	11.39	0.064	117.97	18.20	0.30	60.67

6.1 Performance Studies on In-memory Algorithms

Exp-1: Performance when varying $\epsilon$ . In this experiment, we evaluate the performance of $\mathsf{GPUSCAN}$ and $\mathsf{GPUSCAN^{++}}$ by varying $\epsilon$ . We report the processing time on the first eight datasets in Fig. 7.

Fig. 7 shows that $\mathsf{GPUSCAN}$ consumes much more time when varying the value of $\epsilon$ . For example, on $\mathsf{EU}$ , $\mathsf{GPUSCAN}$ takes 3141.11s to finish the clustering when $\epsilon=0.2$ while $\mathsf{GPUSCAN^{++}}$ only takes 18.93s in the same situation, which achieves 168 times speedup. This is because lots of extra costs are introduced in $\mathsf{GPUSCAN}$ compared with $\mathsf{GPUSCAN^{++}}$ , which is consistent with the time complexity analysis in Section 3 and Section 4. Moreover, Fig. 7 shows that both the running time of $\mathsf{GPUSCAN}$ and $\mathsf{GPUSCAN^{++}}$ decrease when the value of $\epsilon$ increases. For $\mathsf{GPUSCAN}$ , this is because as $\epsilon$ increases, the number of core vertices and clusters decreases, which means the time for cluster detection in $\mathsf{GPUSCAN}$ decreases. Additionally, Exp-3 shows that the cluster detection phase accounts for a great proportion of the total time. Therefore, the running time of $\mathsf{GPUSCAN}$ decreases as $\epsilon$ increases. For $\mathsf{GPUSCAN^{++}}$ , as the value of $\epsilon$ increases, more vertex pairs are dissimilar. Consequently, more unnecessary similarity computation is avoided. Thus, the running time decreases as well.

Exp-2: Performance when varying $\mu$ . In this experiment, we evaluate the performance of $\mathsf{GPUSCAN}$ and $\mathsf{GPUSCAN^{++}}$ by varying $\mu$ . The results are shown in Fig. 8.

As shown in Fig. 8, $\mathsf{GPUSCAN^{++}}$ is much more efficient than $\mathsf{GPUSCAN}$ . For example, int the $\mathsf{UK}$ , $\mathsf{GPUSCAN}$ takes 411.97s to finish the clustering when $\mu=3$ while $\mathsf{GPUSCAN^{++}}$ only takes 8.866s. The reasons are the same as discussed in Exp-1.

Exp-3: Time consumption of each phase. In this experiment, we compare the time consumption of $\mathsf{GPUSCAN}$ and $\mathsf{GPUSCAN^{++}}$ in each phase with $\epsilon=0.5$ and $\mu=6$ . And the results are shown in Table III.

As shown in Table III, $\mathsf{GPUSCAN^{++}}$ achieves significant speedup in all three phases compared with $\mathsf{GPUSCAN}$ , especially for Phase 2. For example, on dataset $\mathsf{CN}$ , $\mathsf{GPUSCAN}$ takes 846.88s/94.50s/6.57s in Phase 1/Phase 2/Phase 3, respectively. On the other hand, the consuming time of $\mathsf{GPUSCAN^{++}}$ for these three phases is 4.31s/0.025s/0.071s, which achieves 196.49/3780.00/92.48 times speedup, respectively. The reasons are as follows: for Phase 1, the span of $\mathsf{GPUSCAN}$ is $O({\mathsf{deg}}_{{\mathsf{max}}})$ , while $\mathsf{GPUSCAN^{++}}$ reduces the span to $O(\log({\mathsf{deg}}_{{\mathsf{max}}}))$ . For Phase 2, $\mathsf{GPUSCAN}$ has to construct $E^{\prime}.u$ and $E^{\prime}.v$ by ${\mathsf{sort()}}$ and ${\mathsf{partition}}()$ in $\mathsf{Thrust}$ library in each iteration, which is time-consuming. Moreover, it needs several iterations to finish the whole cluster detection, which is shown in the last column of Table II. On the other hand, for $\mathsf{GPUSCAN^{++}}$ , we construct the spanning forest based on the $\mathsf{parent}$ array directly and the time-consuming retrieval of $E^{\prime}.u$ and $E^{\prime}.v$ is avoided due to the CSR-Enhanced graph layout structure. For phase 3, $\mathsf{GPUSCAN}$ still needs to retrieve $E^{\prime}.u$ and $E^{\prime}.v$ from $A.u$ and $A.v$ while $\mathsf{GPUSCAN^{++}}$ totally avoid these time-consuming operation.

6.2 Performance Studies on Out-of-Core Algorithms

Exp-5: Performance when varying $\epsilon$ . In this experiment, we evaluate the performance of $\mathsf{GPUSCAN^{++}_{UVM}}$ and $\mathsf{GPUSCAN^{++}_{O}}$ when varying the value of $\epsilon$ on datasets $\mathsf{TW}$ and $\mathsf{FR}$ . The results are shown in Fig. 9.

Fig. 9 shows that $\mathsf{GPUSCAN^{++}_{O}}$ is much more efficient than $\mathsf{GPUSCAN^{++}_{UVM}}$ on these two datasets. For example, on $\mathsf{TW}$ , $\mathsf{GPUSCAN^{++}_{UVM}}$ takes 30886.315s to finish the clustering when $\epsilon=0.2$ while $\mathsf{GPUSCAN^{++}_{O}}$ finishes the clustering in 3866.424s. On $\mathsf{FR}$ , $\mathsf{GPUSCAN^{++}_{UVM}}$ cannot finish the clustering when $\epsilon=0.2,0.3,0.4$ while $\mathsf{GPUSCAN^{++}_{O}}$ finishes the clustering in a reasonable time. This is because UVM aims to provide a general mechanism to overcome the GPU memory limitation. However, as the data locality of structure clustering is poor, $\mathsf{GPUSCAN^{++}_{UVM}}$ involves lots of data movements between CPU memory and GPU memory, which is time-consuming. On the other hand, $\mathsf{GPUSCAN^{++}_{O}}$ conducts the clustering based on the edge extended subgraph, which not only avoids the time-consuming data movements but also overcomes the GPU memory limitation. Moreover, the running time of $\mathsf{GPUSCAN^{++}_{O}}$ and $\mathsf{GPUSCAN^{++}_{UVM}}$ decreases as the value of $\epsilon$ increases. This is because as the value of $\epsilon$ , more vertex pairs are dissimilar. Consequently, more unnecessary similarity computation is avoided in these two algorithms.

Exp-6: Performance when varying $\mu$ . In this experiment, we evaluate the performance of $\mathsf{GPUSCAN^{++}_{UVM}}$ and $\mathsf{GPUSCAN^{++}_{O}}$ when varying the value of $\mu$ on dataset $\mathsf{TW}$ and $\mathsf{FR}$ . The results are shown in Fig. 10.

As shown in Fig. 10, $\mathsf{GPUSCAN^{++}_{O}}$ is much more efficient than $\mathsf{GPUSCAN^{++}_{UVM}}$ . For example, on $\mathsf{TW}$ , $\mathsf{GPUSCAN^{++}_{UVM}}$ takes 19037.683s to finish the clustering when $\mu=3$ while $\mathsf{GPUSCAN^{++}_{O}}$ finishes the clustering in 2974.947s. The reasons are the same as discussed in Exp-5.

7 Conclusion

In this paper, we study the GPU-based structural clustering problem. Motivated by the state-of-the-art GPU-based structural clustering algorithms that suffer from inefficiency and GPU memory limitation, we propose new GPU-based structural clustering algorithms. For efficiency issues, we propose a new progressive clustering method tailored for GPUs that not only avoids extra parallelization costs but also fully exploits the computing resources of GPUs. To address the GPU memory limitation issue, we propose a partition-based algorithm for structural clustering that can process large graphs with limited GPU memory. We conduct experiments on ten real graphs and the experimental results demonstrate the efficiency of our proposed algorithm.

References

[1] C. C. Aggarwal and H. Wang. A survey of clustering algorithms for graph data. In Managing and mining graph data, pages 275–301. Springer, 2010.
[2] A. Bellogín and J. Parapar. Using graph partitioning techniques for neighbour selection in user-based collaborative filtering. In Proceedings of ACM RecSys, pages 213–216, 2012.
[3] C. Biemann. Chinese whispers-an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs: the first workshop on graph based methods for natural language processing, pages 73–80, 2006.
[4] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):711–726, 2004.
[5] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 587–596. ACM Press, 2011.
[6] P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595–601, Manhattan, USA, 2004. ACM Press.
[7] Chang, Lijun, Zhang, Wenjie, Wei, Yang, Shiyu, Qin, and Lu. pSCAN: Fast and Exact Structural Graph Clustering. IEEE Transactions on Knowledge and Data Engineering, 29(2):387–401, 2017.
[8] Y. Che, S. Sun, and Q. Luo. Parallelizing pruning-based graph structural clustering. In Proceedings of ICPP, pages 77:1–77:10, 2018.
[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2022.
[10] Y. Ding, M. Chen, Z. Liu, D. Ding, Y. Ye, M. Zhang, R. Kelly, L. Guo, Z. Su, S. C. Harris, et al. atbionet–an integrated network analysis tool for genomics and biomarker discovery. BMC genomics, 13(1):1–12, 2012.
[11] W. Fan, R. Jin, M. Liu, P. Lu, X. Luo, R. Xu, Q. Yin, W. Yu, and J. Zhou. Application driven graph partitioning. In Proceedings of SIGMOD, pages 1765–1779, 2020.
[12] P. Gera, H. Kim, P. Sao, H. Kim, and D. Bader. Traversing large graphs on gpus with unified memory. Proceedings of the VLDB Endowment, 13(7):1119–1133, 2020.
[13] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
[14] R. Guimera and L. A. N. Amaral. Functional cartography of complex metabolic networks. nature, 433(7028):895–900, 2005.
[15] L. Hu, L. Zou, and Y. Liu. Accelerating triangle counting on gpu. In Proceedings of SIGMOD, pages 736–748, 2021.
[16] U. Kang and C. Faloutsos. Beyond ’caveman communities’: Hubs and spokes for graph compression and mining. In Proceedings of ICDM, pages 300–309, 2011.
[17] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
[18] X. Li, H. Cai, Z. Huang, Y. Yang, and X. Zhou. Social event identification and ranking on flickr. World Wide Web, 18(5):1219–1245, 2015.
[19] W. Lin, X. Xiao, X. Xie, and X.-L. Li. Network motif discovery: A gpu approach. IEEE transactions on knowledge and data engineering, 29(3):513–528, 2016.
[20] V. Martha, W. Zhao, and X. Xu. A study on twitter user-follower network: A network based analysis. In Proceedings of ASONAM, pages 1405–1409, 2013.
[21] V.-S. Martha, Z. Liu, L. Guo, Z. Su, Y. Ye, H. Fang, D. Ding, W. Tong, and X. Xu. Constructing a robust protein-protein interaction network by integrating multiple public databases. In BMC bioinformatics, volume 12, pages 1–10. Springer, 2011.
[22] L. Meng, L. Yuan, Z. Chen, X. Lin, and S. Yang. Index-based structural clustering on directed graphs. In Proceedings of ICDE, pages 2832–2845, 2022.
[23] NVIDIA. Cuda thrust library. https://docs.nvidia.com/cuda/thrust/, 2021.
[24] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. nature, 435(7043):814–818, 2005.
[25] R. A. Rossi and N. K. Ahmed. Graph repository. http://www.graphrepository.com, 2013.
[26] B. Ruan, J. Gan, H. Wu, and A. Wirth. Dynamic structural clustering on graphs. In Proceedings of SIGMOD, pages 1491–1503, 2021.
[27] S. E. Schaeffer. Graph clustering. Computer science review, 1(1):27–64, 2007.
[28] M. Schinas, S. Papadopoulos, Y. Kompatsiaris, and P. A. Mitkas. Visual event summarization on social media using topic modelling and graph-based ranking algorithms. In Proceedings of ICMR, pages 203–210, 2015.
[29] M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, and P. A. Mitkas. Multimodal graph-based event detection and summarization in social media streams. In Proceedings of SIGMM, pages 189–192, 2015.
[30] J. H. Seo and M. H. Kim. pm-scan: an I/O efficient structural clustering algorithm for large-scale graphs. In Proceedings of CIKM, pages 2295–2298, 2017.
[31] M. Sha, Y. Li, and K.-L. Tan. Self-adaptive graph traversal on gpus. In Proceedings of SIGMOD, pages 1558–1570, 2021.
[32] X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua. Graph processing on gpus: A survey. ACM Computing Surveys (CSUR), 50(6):1–35, 2018.
[33] H. Shiokawa, Y. Fujiwara, and M. Onizuka. Scan++: efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proceedings of the VLDB Endowment, 8(11):1178–1189, 2015.
[34] N. Simsiri, K. Tangwongsan, S. Tirthapura, and K.-L. Wu. Work-efficient parallel union-find. Concurrency and Computation: Practice and Experience, 30(4):e4333, 2018.
[35] D. P. Singh, I. Joshi, and J. Choudhary. Survey of gpu based sorting algorithms. International Journal of Parallel Programming, 46:1017–1034, 2018.
[36] T. R. Stovall, S. Kockara, and R. Avci. Gpuscan: Gpu-based parallel structural clustering algorithm for networks. IEEE Transactions on Parallel and Distributed Systems, 26(12):3381–3393, 2014.
[37] T. R. Stovall, S. Kockara, and R. Avci. GPUSCAN: gpu-based parallel structural clustering algorithm for networks. IEEE Transactions on Parallel and Distributed Systems, 26(12):3381–3393, 2015.
[38] T. Tseng, L. Dhulipala, and J. Shun. Parallel index-based structural graph clustering and its approximation. In Proceedings of SIGMOD, pages 1851–1864, 2021.
[39] D. Wen, L. Qin, Y. Zhang, L. Chang, and X. Lin. Efficient structural graph clustering: an index-based approach. Proceedings of the VLDB Endowment, 11(3):243–255, 2017.
[40] X. Xu, N. Yuruk, Z. Feng, and T. A. Schweiger. SCAN: a structural clustering algorithm for networks. In Proceedings of SIGKDD, pages 824–833, 2007.
[41] C. Ye, Y. Li, B. He, Z. Li, and J. Sun. Gpu-accelerated graph label propagation for real-time fraud detection. In Proceedings of the SIGMOD, pages 2348–2356, 2021.
[42] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higher-order graph clustering. In Proceedings of SIGKDD, pages 555–564, 2017.
[43] W. Zhao, G. Chen, and X. Xu. Anyscan: An efficient anytime framework with active learning for large-scale network clustering. In Proceedings of ICDM, pages 665–674. IEEE Computer Society, 2017.
[44] Q. Zhou and J. Wang. Sparkscan: a structure similarity clustering algorithm on spark. In National Conference on Big Data Technology and Applications, pages 163–177. Springer, 2015.