GraphVine: A Data Structure to Optimize Dynamic Graph Processing on GPUs

Rohith Krishnan S Department of Computer Science and EngineeringIndian Institute of Technology RoparRupnagarPunjabPost-Code1India [email protected] , Venkata Kalyan Tavva Department of Computer Science and EngineeringIndian Institute of Technology RoparRupnagarPunjabPost-Code1India [email protected] and Rupesh Nasre Department of Computer Science and EngineeringIndian Institute of Technology MadrasChennaiTamil NaduPost-Code1India [email protected]

Abstract.

Graph processing on GPUs is gaining momentum due to the high throughputs observed compared to traditional CPUs, attributed to the vast number of processing cores on GPUs that can exploit parallelism in graph analytics. This paper discusses a graph data structure for dynamic graph processing on GPUs. Unlike static graphs, dynamic graphs mutate over their lifetime through vertex and/or edge batch updates. The proposed work aims to provide fast batch updates and graph querying without consuming too much GPU memory. Experimental results show improved initialization timings by 1968-1269024%, improved batch edge insert timings by 30-30047%, and improved batch edge delete timings by 50-25262% while consuming less memory when the batch size is large.

^†^†copyright: none^†^†ccs: Software and its engineering General programming languages^†^†ccs: Social and professional topics History of programming languages

1. Introduction

Graph processing is vital to several problems, from forming the basic structure of social, communication and supply-chain networks to solving optimization problems (by mapping them to graph problems). Graph problems are highly parallelizable since multiple threads can work on different graph entities (like edge, vertex, sub-graphs, etc.,) at once, making them ideal for GPU processing. A graph data structure stores the input graph to be processed and forms the backbone of graph processing. The insert, delete and querying (search) performance of such underlying data structures determine the performance of the required graph processing.

The two general classes of graph processing include static and dynamic graph processing. Static graph processing involves working with input graphs with a fixed initial set of vertices and edges, making management of the graph data straightforward. Generally, graphs are stored using traditional graph representations like adjacency matrices and lists. However, fitting the graphs within the GPU memory using these methods becomes impossible when the graphs become too large. Hence better graph representation schemes that use less space, like CSR (Compressed Sparse Row) (Blaß and Philippsen, 2019), CRS (Compressed Row Storage), and CCS (Compressed Column Storage), are used. These representations reduce the storage requirement to a fraction of what adjacency matrices use. They also provide much better query timings than adjacency lists, making them a good fit for static graph processing. On the other hand, in the case of dynamic graphs, these representations prove costly since the graph data structure needs a rebuild for every single update to the graph. Therefore, it is essential to develop graph data structures for dynamic graph processing that perform proper memory management, and efficient inserts, deletes, and searches on the dynamic graph data structure.

Owing to the complexity associated with GPU processing, there are only a few works in literature on dynamic graph processing on GPUs. Current state-of-the-art works on dynamic graph processing include GPMA and GPMA+ (Sha et al., 2017), diff-CSR (Malhotra et al., 2017), cuSTINGER (Green and Bader, 2016), Hornet (Busato et al., 2018), aimGraph (Winter et al., 2017), faimGraph (Winter et al., 2018), and SlabHash (Awad et al., 2020). Among these works, faimGraph and SlabHash are the only ones supporting vertex addition and deletion. Remaining works consider a scenario wherein the vertex data remains static and only the edge data mutates. Depending on the input graphs, the performance of each of the above frameworks varies drastically, with each framework performing well only on a subset of graphs while performing badly for other graphs. The related works section below discusses more details regarding each of these techniques.

Our proposed graph data structure provides fast vertex and edge inserts, deletes, and querying on dynamic graphs. The adjacency list is in the form of edge blocks, each holding multiple edges. On demand, a sufficient number of edge blocks gets allocated to each source vertex. After the delete operations, the data structure performs memory reclamation, and the reclaimed memory is used in future batch updates as per the need. Our contributions to this work are:

•

We propose a graph data structure that performs fast vertex and edge inserts, deletes, and querying.
•

We propose a novel edge pre-allocated queue, where edge blocks are allocated on the GPU memory beforehand. Hence future requests from adjacency lists for further edge blocks are immediately addressed.

2. Prior Work

There have only been a handful of dynamic graph data structures. A significant challenge encountered by each was to come up with an adjacency list structure that accommodates future batch updates.

David Bader et al. (Green and Bader, 2016) proposed cuSTINGER, that supports dynamic edge inserts and deletes while considering the number of vertices as static. Since the number of vertices is static, it uses a Structure of Array (SoA) representation to store edge data associated with each source vertex. Each attribute of this edge data, like weight and last modified time, is stored in multiple arrays. The Array of Structures (AoS) representation could cause a considerable overhead on CUDA-capable GPUs regarding memory requests. The structure of array representation offers better locality if multiple threads perform updates on the same source vertex. cuSTINGER simplifies the insert and delete procedures by separating them, unlike the predecessors that used to do it concurrently. Edge inserts in cuSTINGER involve performing an initial duplicate check within the batch as well as with the existing graph, and then inserted into the adjacency list associated with a vertex (the structure of array representation mentioned above represents the adjacency list of a particular vertex). If the adjacency list gets filled up during an edge update, the entire list gets copied to a larger location, followed by inserts of the new edges. Edge deletes involve locating and marking the edge location as deleted and replacing the edge at the end of the list with the deleted edge. Vertex inserts and deletes get treated as a series of edge inserts and deletes.

The same team propose a follow-up work of the above cuSTINGER, Hornet (Busato et al., 2018), which performs fast inserts and deletes for large update batches, which was the issue with cuSTINGER. Like its predecessor, Hornet only supports edge insertions and considers the number of vertices static. Each vertex in Hornet is associated with two attributes: the number of neighbours associated with the vertex and a pointer to the adjacency list associated with the vertex. The edge blocks responsible for storing the edge details in Hornet have sizes in multiples of 2. Initialization involves locating empty blocks for each vertex’s adjacency list based on its degree. Adjacency lists of all the nodes are initially kept in the host memory. Later all the initialized data is copied to the GPU memory in bulk, maximizing the PCI-Express bandwidth. During inserts, if an edge block is full, it is allocated a new edge block with a size double that of the previous block. This process is done by querying empty blocks from an array of B+ trees (each index is associated with edge blocks of size 2^index), where each index in the B+ tree array has pointers to all B+ trees with that particular size. All previous entries get copied to the new edge block, and later using a vectorized bit-tree, the empty location within the edge block gets located, and the new edge is inserted. Edge deletes follow a similar approach, the difference being that if the size-reduced edge block could fit in a smaller edge block, the new block is fetched from the B+ tree array followed by copying the contents of the current block to the new smaller block. There also exists duplicate removals within the update batch and with the graph, as seen in cuSTINGER. Vertex inserts and deletes are performed as a series of edge inserts and deletes.

Martin Winter et al. (Winter et al., 2017) propose aimGraph, that supports edge inserts and deletes while keeping the number of vertices constant. During initialization, aimGraph allocates the entire GPU memory in one memory allocation call, and all subsequent allocation calls made by the internal memory manager of aimGraph allocate space for edge blocks from the initially allocated GPU memory. This single memory allocation saves significant time as round-trips to the CPU for cudaMalloc calls are no longer necessary, and calls to the internal memory manager are sufficient for additional storage for the edge blocks. Since aimGraph does not support vertex inserts or deletes, vertex data is considered static. The memory manager allocates the static data during initialization while preprocessing the input graph during initialization allocates sufficient memory for adjacency lists associated with each vertex. Subsequent edge inserts on future batch updates query for additional space to the memory manager if required. During subsequent deletes, the memory manager does not reclaim empty blocks, assuming they will get used during future insert operations to the same source vertex.

The same team later proposed faimGraph (Winter et al., 2018), which supported edge and vertex inserts and deletes. Like aimGraph, faimGraph also allocates the entire GPU memory using one single cudaMalloc call, and the internal memory manager handles further memory allocations. Vertex data on faimGraph grows from the top to bottom, while the edge data from the bottom to top for efficient utilization of available memory space. faimGraph also introduces two new data structures: vertex and page queue, where upon deallocating vertex and edge blocks, the pointers to these blocks get added to their respective queues. On subsequent allocation requests, the respective queues get searched first, and if they are not empty, the pointer is passed to the requesting thread, saving a significant amount of time. Since vertex data is dynamic, using a Structure of Array (SoA) representation could prove troublesome. As the number of vertices increases or decreases, changes to the arrays within the structure cause a considerable overhead as the size of the vertex deletes and inserts become high. An Array of Structure (AoS) representation seems perfect here, as it does not need such changes. Due to the availability of the vertex and edge queues, memory allocation is achieved in O(1) time, provided pointers to blocks are available on the respective queues. Vertex inserts comprise a duplicate check followed by a straightforward insert, while vertex deletes involve edge deletes of all edges associated with the vertex, both incoming and outgoing. Compaction of the adjacency lists is done after vertex deletes. Edge inserts involve sorting the edges alongside inserting them into the concerned adjacency list in O(V) time, where V is the number of edges in the adjacency list of the source vertex. Edge deletes are straightforward; after deleting the edge, it is replaced with the last edge on the adjacency list, followed by moving elements across it to maintain the sorted order. Like aimGraph, the graphs taken for evaluation were from the 10th DIMAC Graph Implementation Challenge. Due to the efficient memory reclamation and usage, faimGraph significantly improved performance over aimGraph and cuSTINGER, while also providing a much better memory management technique than aimGraph.

Other works on dynamic graphs on GPUs include Accelerating Dynamic Graph Analytics on GPUs (Sha et al., 2017) and Fast Dynamic Graph Algorithms (Malhotra et al., 2017). The former discusses a technique called GPMA (GPU-based Packed Memory Array) and GPMA+, which follow a self-balancing binary tree structure divided into many levels. A locking mechanism across the levels of the self-balancing tree does edge updates. The latter proposes a modified Compressed Sparse Row (CSR) format called diff-CSR to support vertex and edge inserts and deletes. The subsequent edge inserts during a batch update are reflected only in the diff-CSR array and not in the original CSR array. Hornet and faimGraph significantly outperformed both these works, as was the case with cuSTINGER and aimGraph.

Our proposed work is the first to support dynamic vertex and edge updates while consuming far less memory, unlike faimGraph, and helps handle large graphs and additional graph analytic computations.

3. Motivation

In Accelerating Dynamic Graph Analytics on GPUs, not much is said about the memory management unit of GPMA and GPMA+. It also tends to use a large amount of memory, and a significant effort is taken to reallocate the entries evenly across the self-balanced binary tree. These problems would get amplified when the update batch is enormous. In Fast Dynamic Graph Algorithms, the proposed diff-CSR mechanism could cause large insert and delete overheads with a large update batch due to the need to rearrange and concatenate the diff-CSR throughout multiple updates. cuSTINGER, on the other hand, performs individual cudaMalloc calls to allocate additional edge blocks to adjacency lists from the CPU, causing significant overheads, especially if the update batch is large. It also incurs huge penalties when relocating adjacency lists if they become full while over-allocating space causing poor efficiency. Hornet takes care of many of these issues, but due to the high initialization overheads, it performs relatively poorly than cuSTINGER in small update batches. Both aimGraph and faimGraph allocate the entire GPU memory during initialization, meaning that performing any advanced analytic computation in the GPU is impossible. They are also slower than Hornet for large batches since adjacency lists in Hornet are stored in one single edge block, while in aimGraph and faimGraph, it gets stored in multiple edge blocks. faimGraph, on the other hand, is much better in terms of memory efficiency than aimGraph and Hornet, as aimGraph does not return an empty edge block in the hope that the source vertex associated with the edge block reuses it again, while Hornet tends to over-allocate space. Through this work, we aim to devise a solution that addresses all these issues. In the next section, we present the details of the proposed GraphVine data structure along with the details of its initialization, edge/vertex updates (inserts and deletes) and search operations.

4. GraphVine structure

The proposed work performs fast insert, delete, and query operations regardless of batch size, all while consuming less memory. Figure 1 gives an overview of the proposed graph data structure. The vertex dictionary holds the vertex data and is a single contiguous array. The adjacency list associated with each vertex comprises an edge sentinel node that holds metadata regarding the adjacency list. Edge blocks hold edge data, and the edge sentinel node has a pointer to the edge blocks. The following sections perform an in-depth look into each component.

Refer to caption — Figure 1. GraphVine Data Structure

4.1. Vertex dictionary and adjacency list structure

As mentioned above, the vertex dictionary is a single contiguous array holding vertex data. Each vertex entry in the vertex dictionary holds the vertex id and a pointer to the adjacency list associated with that vertex. Adding more attributes to each vertex entry would affect spatial locality in the GPU and performance. The size of the vertex dictionary at any stage would be the closest power of 2 to the vertex size. For example, for a graph with 453 vertices, the vertex dictionary size is 512.

Each vertex entry in the vertex dictionary has a pointer to the edge sentinel node of that vertex, and forms the basic adjacency list structure. The edge sentinel node holds the vertex’s edge count, the edge block address and the offset within that edge block where the latest insertion happened. Multiple edge blocks hold edge data in the form of a Complete Binary Tree (CBT) as in Figure 2, and the edge sentinel node holds the address of the root of the CBT. Each edge block follows an Array of Structures (AoS) format, each structure denoting an edge entry. The edge entry may have multiple attributes depending on the graph’s requirement. For a vertex, the number of edge blocks an adjacency requires depends its degree.

Since the adjacency follows a CBT structure, the tree height for a given set of edge blocks would be the lowest. The tree is not a Binary Search Tree (BST), eliminating the need for height-balanced BSTs like AVL trees. An AVL tree would be beneficial for search operations with a single thread, but since we have the luxury of multiple thread launches for an adjacency, an unsorted binary tree would work just fine. During an edge update, if all edge blocks in the current adjacency are full, additional edge blocks are allocated from the edge queue per the new requirement of the source vertex. The below section discusses the edge queue in detail.

4.2. Edge Queue

Due to the mutation of dynamic graphs, the edge blocks required by an adjacency list may change over time. Using cudaMalloc calls to allocate edge blocks to an adjacency list on demand is very costly. Instead, we pre-allocate edge blocks and push their addresses onto an edge queue. Adjacency lists requiring additional edge blocks will pop edge block addresses from the queue, cutting the time required for CPU round-trips during a cudaMalloc call. In the event of edge deletes, if edge blocks of an adjacency list get empty, their addresses are pushed to the edge queue for future reuse.

Upon launch of the graph data structure, at least half of the GPU memory gets reserved through a single cudaMalloc call for use as the edge queue. Once the graph data structure exhausts 80% of the edge queue, another cudaMalloc call pushes 25% more edge block addresses to the queue if there is sufficient memory. Else, as many possible blocks get pushed.

4.3. Initialization and batch updates

The initialization phase determines several parameters of the graph data structure. The edge block size, the number of edge entries within an edge block, is the average degree of the non-zero degree vertices in the first batch update. Experimentation revealed that adding zero-degree vertices to the equation has a detrimental effect on determining the correct edge block size and performance. As explained in the edge queue section above, half the GPU memory is reserved for the edge blocks using a single cudaMalloc call, keeping future batch updates in mind. From the experimental results, these numbers are more than sufficient. Though the edge block size remains constant, pushing additional edge blocks to the edge queue is repeated for each batch update according to the adjacency requirement.

In dynamic graph processing, the graph starts with an initial set of vertices and edges, followed by batch updates that insert or delete vertices and edges from the graph. The proposed graph data structure assumes the batch updates are in a Compressed Sparse Row (CSR) format. Otherwise, the proposed work converts the input graph format into CSR before commencing inserts or deletes.

4.4. Dynamic vertex updates

As mentioned before, the proposed graph data structure supports dynamic vertex updates. The size of the vertex dictionary is always a power of two. The initialization of the vertex dictionary on the GPU is as follows. First, one cudaMalloc call allocates space on the GPU memory for the vertex dictionary with a capacity equal to the closest power of two of the vertex size of the first update batch. Second, another cudaMalloc call allocates space for the edge sentinel nodes. Third, a GPU kernel gets launched with the number of threads equal to the number of vertices in the first batch and assigns one edge sentinel node to each source vertex, completing the basic adjacency list structure. Vertices added to the vertex dictionary in the later batch updates follow the same initialization procedure, where a new edge sentinel node gets allocated.

Later with future batch updates, once the allocated size becomes insufficient to store new vertices, the existing values get copied to a new vertex dictionary twice the previous size, which would again be a power of two. One cudaMalloc call allocates the new location for the vertex dictionary, and one GPU kernel call copies the contents of the previous vertex dictionary to the new location. Later, deallocation of the previous vertex dictionary memory takes place. If the batch update performs vertex deletes, the graph data structure does not reclaim memory since the adjacency lists do the majority of the space taken by the data structure.

4.5. Dynamic edge updates

Edge updates are much more common on dynamic graphs than vertex updates. The proposed dynamic graph data structure assumes that the batched edge updates are in a CSR format. If not, it gets converted into the corresponding CSR format before the batch processing. The update batch then gets copied to the GPU, and multiple threads are launched depending on the adjacency operation, namely inserts and deletes. The edge insert procedure is discussed below.

4.5.1. Dynamic edge inserts

Once the update batch in the CSR format is ready, the edge inserts commence. For the first batch, the number of edge blocks a source vertex requires is calculated depending on its degree. This data gets pushed into a vector the size of the number of vertices in the graph. Then a prefix sum calculation is done on this vector, followed by a calculation of a space remaining vector at the CPU side, where it keeps track of the number of empty edge entries in the last edge block that did insertions for each source vertex. The proposed batched edge insertion algorithm is vertex-centric, hence the GPU kernel responsible for the edge inserts is launched with the number of threads equal to the number of vertices in the graph, meaning that each source vertex gets a thread allocated for the insert operation.

If the degree of a source vertex in the update batch is zero, the associated thread performs nothing. For a source vertex with a non-zero degree, the thread pops edge blocks from the edge queue depending on its requirement. All the launched threads can simultaneously pop edge blocks from this queue in parallel due to the prefix sum vector of the edge block counts of each source vertex as in Figure 3. CUDA allows each thread that gets launched to have a distinct identifier value, so each thread needs to check the values in the prefix sum array for its corresponding index to get details of the indices in the edge queue that it needs to pop. For example, a thread with ID #1 checks the prefix sum array of edge blocks at indices 0 and 1 to get the number of edge blocks required and the indices in the edge queue that need popping. A thread with ID#0 checks the prefix sum vector at index 0 only since it is the first entry in the prefix sum vector. This method allows popping edge blocks in parallel instead of using atomic operations to avoid race conditions. Our initial implementation used CUDA’s inbuilt atomicCAS (Compare And Swap) operations in the edge queue, and the performance penalty was very high compared to the prefix sum generation and parallel popping mechanism.

Once each thread acquires the edge block addresses it requires, it generates a Complete Binary Tree (CBT) structure with the edge blocks. For the first batch, the CBT generation is straightforward and similar to any generic one. Then the thread updates the edge sentinel node with the address of the root of the CBT. The thread then calculates the start and end index of the edges of the source vertex associated with, with the help of the CSR offset and edge fields. Then it iterates through each edge block in the CBT and inserts the edges in sequence, starting from the edge block that forms the root of the CBT. Once an edge block gets full, the edges get inserted into the next edge block in the CBT, which is empty. Since we calculated the edge blocks required by an adjacency in the batch pre-processing stage, we do not run out of edge blocks and have the exact number required for insertions. After inserting the final edge of the adjacency, we update the last insert edge offset and last insert edge block entries in the edge sentinel node. Finally, a GPU kernel gets launched that updates the front pointer of the edge queue.

The edge insert procedure is quite different on the subsequent batch updates to the adjacency of a source vertex. The number of edge blocks required by a source vertex gets calculated considering the space remaining in the edge blocks of the adjacency from the previous batch update. The number of threads launched by the GPU kernel for edge inserts remains the same, equal to the number of vertices in the graph. The parallel popping of edge blocks happens as usual, followed by adding the new edge blocks to the CBT structure. Unlike the first batch’s CBT generation, inserting the new edge blocks into the CBT is not straightforward. Hence for each new edge block, we generate a bit string that decides its location in the existing CBT. For example, in Figure 4 for the newly generated edge block which is the ninth edge block in the CBT, the bit string generated would be 001, meaning that we have to traverse left twice from the root (0 and 1 mean traverse left and right, respectively.) and finally insert the new edge block as the right child of the edge block that is left twice from the root. The process repeats for each new edge block of that adjacency. Then using the last insert edge offset and last insert edge block values stored in the edge sentinel node, edge insertions start from the empty spaces in the last edge block in the previous batch insertion. Once the edge block becomes full, inserts commence on the newly inserted edge blocks until all the edges get inserted, followed by the updates on the edge queue.

Input: (a) Vertex dictionary ; (b) Insert edge batch in CSR format; (c) Prefix sum vector of edge blocks for the batch;

Output: Graph data structure has insert batch edges.

1 for each source vertex $v_{i}\in\mathbf{G}$ do

2 //Create adjacency for the first batch to this vertex

3 if adjacency is empty then

4 Create CBT structure for the adjacency with edge blocks

5 insert offset

\leftarrow

6 insert block

\leftarrow

root of CBT

8 //Adjacency already exists from a previous batch update

9 else

10 for each edge block allocated for the adjacency $e_{j}$ do

11 Generate bit string for the edge block

12 Insert the edge block to the adjacency using the bit string

14 space remaining

\leftarrow

edge block size - last insert edge offset

16 if space remaining then

18 insert offset

\leftarrow

last insert edge offset

19 insert block

\leftarrow

last insert edge block

21 else

23 insert offset

\leftarrow

24 insert block

\leftarrow

e_{0}

28 start index

\leftarrow

csr offset [

i

]

29 end index

\leftarrow

csr offset [

i+1

]

31 for each edge $e_{j}$ between start index and end index in csr edges vector do

33 Add edge

e_{j}

to the edge block

34 Increment active edge count in edge block

36 if active edge count $\geq$ edge block size then

38 Increment active edge count of adjacency with the number of edges added in insert block

39 Increment edge block count of adjacency

40 insert offset

\leftarrow

41 // Edge blocks of an adjacency during a batch are contiguous

42 insert block

\leftarrow

insert block + 1

46Update edge queue

Algorithm 1 Batched Edge Insertion

We also explored the possibility of parallelizing edge block insertions to the CBT within an adjacency. In this case, the number of threads launched by the GPU kernel would be equal to the number of edge blocks required by all the vertices in the current update batch. Each thread now would be responsible for appropriately assigning its edge block’s pointers to make the CBT structure intact. Also, each thread only needs to fill the edge entries allocated to it. This procedure meant that there needed to be a lot more bookkeeping that needed to be done in the batch pre-processing stage, and it also delivered worse performance than the approach that used one thread per adjacency. The performance loss was due to the additional work each thread needed to do to keep the CBT structure intact.

4.5.2. Dynamic edge deletes

During dynamic edge deletes, the graph data structure assumes the update batch to be in a CSR format as in the edge inserts case. The number of threads launched by the GPU delete kernel equals the number of vertices in the graph. Each thread then performs an in-order traversal over the edge blocks of the CBT starting from the root and, during the traversal, checks the edge entries within the edge block currently being traversed. If the edge entries match the deleted entries in the update batch, they are marked as deleted. Simply marking the edge entries as deleted creates holes in the edge blocks of the CBT as time progresses and could increase the memory footprint. We plan to devise proper compaction and memory reclamation mechanisms to address this issue in future.

Input: (a) Vertex dictionary ; (b) Delete edge batch in CSR format;

Output: Updated graph data structure.

1 for each source vertex $v_{i}\in\mathbf{G}$ do

2 //Start and End index of edges to be deleted

3 start index

\leftarrow

csr offset [

i

]

4 end index

\leftarrow

csr offset [

i+1

]

6 Perform in-order traversal of the adjacency

7 for each edge block $EB_{j}$ encountered during traversal do

9 for each edge $e_{k}$ in the edge block do

11 for each edge $e_{l}$ between start and end index in CSR edges do

13 if $e_{k}$ == $e_{l}$ then

15 Mark

e_{k}

as deleted

16 Decrement active edge count of edge block

17 Decrement active edge count of adjacency

Algorithm 2 Batched Edge Deletes

4.6. Querying operations

The proposed work supports fast search queries for edges stored in the graph data structure. Given a set of source and destination vertex pairs, a search pre-processing GPU kernel gets launched with one thread that performs an in-order traversal of the CBT associated with the source vertex. During the traversal, the address of all the edge blocks in the CBT gets pushed to a vector. Later, another GPU kernel gets launched with the number of threads equal to the degree of the source vertex. The idea is that each thread searches only one edge entry in one edge block of the CBT. Once in the kernel, each thread depending on its unique identifier, can pinpoint the edge block and the edge entry within that edge block it needs to search. If one thread finds the searched edge, it returns true to the host. The search query timings are very low since one thread gets launched for each edge entry.

Input: (a) Vertex dictionary ; (b) Source and destination vertices of the edge needed to be searched;

Output: Sets the device search flag as 1 if the edge is present, else 0.

1 //One thread performs in-order traversal

2 Perform in-order traversal on the adjacency of the source vertex

3 for each edge block $EB_{i}$ encountered during traversal do

5 search blocks[

i

]

\leftarrow

Address of

EB_{i}

7//In-order traversal thread ends

9//Threads launched from now is

addresses

pushed

*

edgeblocksize

10 for each thread $T_{i}$ do

12 edge block address

\leftarrow

search blocks[

i

edgeblocksize

]

13 edge block offset

\leftarrow

search blocks[

i

edgeblocksize

]

15 extracted value

\leftarrow

The destination entry on edge block address and edge block offset

17 if $extractedvalue$ == $destination$ then

18 device search flag

\leftarrow

21//All threads terminated

Algorithm 3 Querying Operations on Edges

5. Performance Evaluation

In this section, we compare the performance of the proposed method to existing state-of-the-art graphs data structures like Hornet and faimGraph. We have omitted cuSTINGER, aimGraph, GPMA, and diff-CSR from the comparison as they are considered obsolete and fell far behind Hornet and faimGraph in terms of batched edge insert and delete timings, as well as performance in real-world graph processing algorithms like BFS, SSSP, and Page Rank.

S. No	Input Graph	—V—	—E—	$D_{max}$	$D_{avg}$	Size
G1	co-papers-dblp	540K	30M	3K	56	200MB
G2	co-papers-citeseer	434K	32M	1K	73	207MB
G3	hugetrace-00020	16M	48M	3	2	398MB
G4	channel-500x100x100-b050	5M	86M	18	17	663MB
G5	delaunay_n24	17M	100M	26	5	839MB
G6	inf-europe_osm	51M	108M	13	2	949MB
G7	rgg_n_2_24_s0	17M	266M	40	15	2.2GB

Table 1. Properties of the input graphs taken for comparison. The columns represent the input graph name, the number of vertices, the number of edges, the maximum degree, the average degree, and the size of the input graph file, respectively.

We compare the batched vertex and edge insert and delete timings for the performance evaluation. Later, we compare the memory consumption of Hornet, faimGraph, and the proposed work. Table 1 lists the properties of the input graphs taken for comparison. All the graphs are part of the 10th DIMACS implementation challenge. The experimental setup included an NVIDIA RTX 2060 12GB GPU with an Intel Core i3-12100F CPU and 16GB DDR4 3200MHz RAM. The GPU has 2,176 CUDA cores and 34 SMs (Streaming Multi-processors).

5.1. Initialization

The initialization phase involves the necessary preprocessing required for the input graph. The edge block size, which determines the number of edge entries an edge block can hold, is set as the average degree of the non-zero degree vertices in the first batch. Experimental results have shown that the average degree of the non-zero degree vertices stays along the same lines over future batches. The vertex dictionary gets initialized with a size equal to the closest power of two, and edge sentinel nodes are attached to each source vertex in the vertex dictionary. Then one cudaMalloc call reserves half the GPU memory for edge blocks and pushes the addresses into the edge queue.

Input	Hornet	faimGraph	GraphVine
G1	643.973ms	16.981ms	0.575ms
G2	538.074ms	19.915ms	0.589ms
G3	15617ms	43.831ms	2.227ms
G4	4929ms	50.772ms	1.355ms
G5	16059ms	62.199ms	2.386ms
G6	51298ms	119.028ms	4.042ms
G7	17552ms	140.11ms	3.180ms

Table 2: Comparison of the initialization timings of the different graph data structures. The timings are in milliseconds(ms).

Figure 5 shows the initialization timings for the different graph data structures across the input graphs. Experiments showed that the initialization timings for the graph data structures for different batch update sizes remained the same. The proposed work significantly improved the initialization timings, where it gave 111895%, 91253%, 701157%, 363663%, 672951%, 1269024%, and 551849% improvements over Hornet and 2853%, 3281%, 1968%, 3647%, 2506%, 2844%, and 4306% improvements over faimGraph on G1, G2, G3, G4, G5, G6, and G7, respectively.

We attribute this significant improvement in initialization timings to the very few cudaMalloc calls we perform (3 calls) to reserve GPU memory, namely, for the vertex dictionary, the edge sentinel nodes, and the edge blocks.

5.2. Batched edge insert timings

The batched edge inserts involve inserting an edge batch into the graph data structure. Once the graph data structure gets initialized with the vertex dictionary and the edge sentinel nodes, the edge batch update in CSR format serves as input to the data structure. The prefix sum vector for edge blocks gets calculated for the parallel popping of edge blocks from the edge queue. Then the batch update is copied to the GPU, followed by a GPU kernel call for edge insertion. We conducted experiments on different batch sizes for the graph data structures across the different graphs, followed by a bulk build of the data structure where all the edges in the input graph get inserted in one go.

Figures 6, 7, 8, and 9 show the batched edge insert timings for different batch sizes. For a batch size of 100K edges, the proposed work, when compared to Hornet, showed improvements of 1407%, 1227%, 492%, 809%, and 290% for G1, G2, G5, G6, and G7, respectively in edge insert timings. However, Hornet performed better in G3 and G4 by 107% and 16%, respectively. Compared to faimGraph with the same batch size of 100K, the proposed work did give improved timings by 41%, 74%, and 234% in G1, G2, and G5, respectively. faimGraph performed better by 205%, 111%, 585%, and 71% in G3, G4, G6, and G7, respectively compared to the proposed work.

When the edge insert batch size increased to 1M edges, the proposed work gave improvements over Hornet by 3611%, 3384%, 510%, 57%, 3322%, 6813%, and 1427% for G1, G2, G3, G4, G5, G6, and G7, respectively. Compared to faimGraph, the proposed work gave improvements of 305%, 357%, 30%, 145%, 47%, and 6% for G1, G2, G3, G4, G5, and G7, respectively. faimGraph, however, performed better than the proposed work on G6 by 58%.

For the edge insert batch size of 10M, the proposed work performed better than Hornet by 864%, 709%, 7133%, 61%, 8115%, 30047%, and 2749% on G1, G2, G3, G4, G5, G6, and G7, respectively. faimGraph has a maximum batch size cap of 1M, so we could not run faimGraph with a batch size of 10M.

When it came to bulk builds, where the entire input graph gets inserted into the graph data structure in one go, the proposed work gave improvements of 326%, 235%, 8592%, 1068%, and 5450% on G1, G2, G3, G4, and G5, respectively. The graphs G6 and G7 ran out of memory and crashed on Hornet, while the proposed work had no trouble handling it. faimGraph could not perform bulk build due to their maximum batch size cap of 1M edges.

Input	Batch Size	Hornet	faimGraph	GraphVine
G1	100K	13.083ms	1.224ms	0.868ms
	1M	90.699ms	9.909ms	2.444ms
	10M	358.091ms	NA	37.11ms
	Bulk	531.632ms	NA	124.647ms
G2	100K	10.9ms	1.431ms	0.821ms
	1M	78.501ms	10.312ms	2.253ms
	10M	304.825ms	NA	37.658ms
	Bulk	432.848ms	NA	129.029ms
G3	100K	1.539ms	1.048ms	3.197ms
	1M	35.181ms	7.504ms	5.762ms
	10M	2129.69ms	NA	29.444ms
	Bulk	11747.7ms	NA	135.142ms
G4	100K	1.013ms	2.153ms	1.183ms
	1M	7.179ms	11.184ms	4.563ms
	10M	65.983ms	NA	40.887ms
	Bulk	3967.16ms	NA	339.383ms
G5	100K	20.245ms	1.730ms	3.415ms
	1M	198.051ms	8.531ms	5.787ms
	10M	2011.45ms	NA	24.485ms
	Bulk	15878.7ms	NA	286.08ms
G6	100K	94.263ms	1.512ms	10.362ms
	1M	827.766ms	7.572ms	11.973ms
	10M	8235.38ms	NA	27.317ms
	Bulk	NA	NA	190.068ms
G7	100K	13.02ms	1.945ms	3.333ms
	1M	115.439ms	8.058ms	7.559ms
	10M	1161.95ms	NA	40.779ms
	Bulk	NA	NA	992.54ms

Table 3: Comparison of the batched edge insert timings of the different graph data structures across different batch sizes. The timings are in milliseconds(ms).

The experimental results suggest that the performance of the proposed method increases drastically as the batch size increases, attributed to more time spent in edge insertions compared to fixing the CBT structure during an update for large batches. For a fixed batch size, the proposed work gave the best performance when the average and maximum degrees of the graph were either very far apart or very close. If the average and maximum degrees were close, the height of the CBT of adjacencies would be less, meaning less time is spent fixing the CBT structure. If the average and maximum degrees were far apart, the time spent for CBT creation is amortized by the large number of edges inserted into that adjacency. The proposed work gave the worst results when the number of vertices in the input graph was high but still managed to perform better than Hornet and come close to faimGraph as the batch sizes increased.

5.3. Batched edge delete timings

The batched edge deletes involve deleting a group of edges inserted into the graph data structure in a previous batch update. The delete edge batch update in CSR format serves as input to the data structure, similar to edge insertion. Similar to inserts, experiments were conducted on different batch sizes for the graph data structures across the different graphs, followed by a bulk build of the data structure.

Figures 10, 11, 12, and 13 show the batched edge delete timings for the different batch sizes. For a batch size of 100K edges, the proposed work gave improvements over Hornet by 596%, 810%, and 288% on G5, G6, and G7, respectively. Hornet had better timings over the proposed work by 333%, 42%, 188%, and 24% on G1, G2, G3, and G4, respectively. Compared to faimGraph, the proposed work gave better timings by 70% on G4. faimGraph gave better timings over the proposed work by 14770%, 7068%, 388%, 106%, 1132%, and 66% on G1, G2, G3, G5, G6, and G7, respectively.

The edge delete batch size of 1M edges gave the proposed work improvements over Hornet by 377%, 1847%, 6494%, and 1707% on G3, G5, G6, and G7, respectively. The input graph G2 had a tie, while Hornet outperformed the proposed work by 66% and 21% on G1 and G4, respectively. Compared to faimGraph, the proposed work improved timings by 110% and 50% on G4 and G7, respectively. The proposed work lost out to faimGraph by 7652%, 785%, 8%, 511%, and 117% on G1, G2, G3, G5, and G6, respectively.

When the edge deletes batch size increased to 10M edges, the proposed work gave improved timings by 7706%, 4%, 4585%, 25262%, and 3257% over Hornet on G3, G4, G5, G6, and G7, respectively. Hornet gave improved timings by 37% and 74% on G1 and G2, respectively. Like before, faimGraph could not process batches of size more than 1M.

Bulk edge delete batches gave the proposed work improvements over Hornet by 13717%, 576%, and 9537% on G3, G4 and G5, respectively. Hornet had improved timings over the proposed work by 11% and 81% on G1 and G2, respectively. The input graphs G6 and G7 ran out of memory and crashed on Hornet, while the proposed work ran them flawlessly.

Observations from the experiments suggest that the proposed work gave the best results when the average degree of the graph was close to its maximum since, in such cases, the height of an adjacency’s CBT would be the least. The proposed work gave poor performance when there was a stark difference in the average and maximum degrees, attributed to the increased height of the CBT of each adjacency and hence the additional time required to traverse the entire CBT. Another observation was that the performance impact is low if only a few such adjacencies exist, and issues arise when most adjacencies have a significant height.

Input	Batch Size	Hornet	faimGraph	GraphVine
G1	100K	12.046ms	1.020ms	52.207ms
	1M	91.146	8.115ms	151.679ms
	10M	372.862ms	NA	513.151ms
	Bulk	563.705ms	NA	629.137ms
G2	100K	10.832ms	1.125ms	15.432ms
	1M	79.254ms	9.108ms	80.649ms
	10M	315.28ms	NA	550.459ms
	Bulk	455.584ms	NA	827.061ms
G3	100K	1.504ms	0.888ms	4.336ms
	1M	35.143ms	6.801ms	7.367ms
	10M	2159.95ms	NA	27.670ms
	Bulk	11706.7ms	NA	84.723ms
G4	100K	1.074ms	2.276ms	1.342ms
	1M	6.214ms	15.871ms	7.559ms
	10M	75.283ms	NA	72.130ms
	Bulk	4011.89ms	NA	593.154ms
G5	100K	20.25ms	1.972ms	4.077ms
	1M	201.747ms	7.114ms	10.359ms
	10M	2039.19ms	NA	43.523ms
	Bulk	16104.4ms	NA	167.104ms
G6	100K	98.674ms	0.879ms	10.837ms
	1M	831.202ms	5.791ms	12.605ms
	10M	8442.63ms	NA	33.288ms
	Bulk	NA	NA	161.929ms
G7	100K	14.144ms	2.194ms	3.641ms
	1M	118.606ms	9.885ms	6.563ms
	10M	1177.21ms	NA	35.059ms
	Bulk	NA	NA	738.115ms

Table 4: Comparison of the batched edge delete timings of the different graph data structures across different batch sizes. The timings are in milliseconds(ms).

5.4. Memory consumption

An essential aspect of graph data structure is memory consumption since it determines the size of the input graphs that the data structure can handle. If the graph data structure consumes too much memory, the program will crash on large graphs, limiting its capability. Here we compare the memory footprint of the graph data structures across the different graphs. Using the inbuilt cudaGetMemInfo function, we calculate the GPU memory usage before the graph data structure launch and after edge insertions for different batch sizes, followed by a bulk build.

Figures 14, 15, 16 and 17 show the memory consumption for the different batch sizes. For a batch size of 100K, the proposed work consumed 12% and 13% less memory than Hornet on G1 and G2. However, the proposed work consumed more memory than Hornet by 326%, 159%, 239%, 323%, and 158% on G3, G4, G5, G6 and G7, respectively. Compared to faimGraph, the proposed work consumed 20%, 17%, 51%, 70%, 60%, 24%, and 43% more memory on G1, G2, G3, G4, G5, G6 and G7, respectively.

For a batch size of 1M, the proposed work consumed 27% less memory than Hornet for both G1 and G2, but consumed 265%, 130%, 226%, 313%, and 147% more memory on G3, G4, G5, G6 and G7, respectively. Compared to faimGraph, the proposed work consumed 8%, 8%, 40%, 73%, 59%, and 42% more memory on G1, G2, G3, G4, G5, and G7, respectively. Nevertheless, the proposed work consumed 24% less memory than faimGraph on G6.

For batch sizes of 10M, the proposed work consumed 128% and 126% less memory than Hornet on G1 and G2, respectively. However, Hornet consumed less memory than the proposed work by 86%, 29%, 84%, 175%, and 77% on G3, G4, G5, G6, and G7, respectively. We have no data on faimGraph for this batch size since it does not support batch updates larger than 1M.

For bulk builds, the proposed work got on top by consuming 250%, 247%, 42%, 168%, and 109% less memory than Hornet on G1, G2, G3, G4, and G5, respectively. The input graphs G6 and G7 ran out of memory and crashed on Hornet, while the proposed work ran without any issues and consumed memory close to the 10M batch size.

Input	Batch Size	Hornet	faimGraph	GraphVine
G1	100K	497MB	378MB	445MB
	1M	583MB	422MB	456MB
	10M	1347MB	NA	592MB
	Bulk	3124MB	NA	890MB
G2	100K	502MB	379MB	446MB
	1M	578MB	420MB	455MB
	10M	1330MB	NA	588MB
	Bulk	3197MB	NA	921MB
G3	100K	696MB	1955MB	2966MB
	1M	766MB	1985MB	2797MB
	10M	1590MB	NA	2961MB
	Bulk	5059MB	NA	3550MB
G4	100K	754MB	1150MB	1958MB
	1M	832MB	1184MB	1921MB
	10M	1590MB	NA	2055MB
	Bulk	8590MB	NA	3201MB
G5	100K	976MB	2065MB	3315MB
	1M	1013MB	2072MB	3302MB
	10M	1882MB	NA	3465MB
	Bulk	10102MB	NA	4831MB
G6	100K	1712MB	5821MB	7245MB
	1M	1751MB	5834MB	7247MB
	10M	2679MB	NA	7385MB
	Bulk	NA	NA	8908MB
G7	100K	1979MB	3578MB	5120MB
	1M	2064MB	3592MB	5115MB
	10M	2986MB	NA	5297MB
	Bulk	NA	NA	9142MB

Table 5: Comparison of the memory consumption of the different graph data structures across different batch sizes. Memory usage is denoted in MegaBytes(MB).

An observation made was that the memory consumption of both the proposed work and faimGraph increased drastically when the number of vertices in the graph got large. For the proposed work, more vertices meant more edge sentinel nodes, which meant higher memory usage. The memory usage of all the data structures increased with an increase in batch size. However, Hornet had an exponential increase in memory usage for large batches compared to small ones, despite consuming significantly less memory than the proposed work and Hornet for vertices with a large number of vertices on small batches. The increase in memory consumption for Hornet was so significant that it ran out of memory for bulk builds on G6 and G7. The proposed work consumed more memory than the competition for small batches, had a linear increase in memory consumption as the batch size got bigger, and had no issues running G6 and G7 while having 3GB of GPU memory still available.

6. Conclusion & Future Scope

The proposed graph data structure performs fast insert, delete, and querying performance on dynamic graphs and significantly outperforms the existing state-of-the-art graph data structures on large batch sizes. We recorded over 300000% and 2500% improvement in initialization timings over Hornet and faimGraph, respectively, across all graphs. For batched edge insertions, we observed over than 1000% improvement over Hornet and greater than 150% improvement over faimGraph in multiple instances. However, in small batch sizes, the proposed work gave worse timings than the competition for some graphs. The batched edge deletes followed a similar pattern, where over 5000% improvements were detected over Hornet for large batch sizes. In contrast, faimGraph outperformed both Hornet and the proposed work for small batch sizes. The proposed work also consumed far less memory than the competition as the batch sizes got larger.

There is still scope for improving the batched edge insertion times by splitting the GPU kernel responsible for insertions into two, where the first kernel fixes the CBT structure. In contrast, the second kernel performs edge insertions, allowing us to launch more threads with the second kernel, which could reduce insert timings. The memory footprint can also be reduced by dynamically changing the edge block size, which could reduce internal fragmentation within edge blocks. This is in line with the fact that over future batch updates, attributes of the graph, like the average degree of vertices across batches, could change. We also plan to optimize the batched edge deletes further and integrate the proposed work with real-world graph algorithms like Breadth-First-Search, Single-Source-Shortest-Path, and Dijkstra’s shortest path.

Acknowledgements.

This material is based upon work supported by the Sponsor National Science Foundation http://dx.doi.org/10.13039/100000001 under Grant No. Grant #nnnnnnn and Grant No. Grant #mmmmmmm. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

References

(1)
Awad et al. (2020) Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, and John D. Owens. 2020. Dynamic Graphs on the GPU. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 739–748. https://doi.org/10.1109/IPDPS47924.2020.00081
Blaß and Philippsen (2019) Thorsten Blaß and Michael Philippsen. 2019. Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-Capable GPU. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (Providence, RI, USA) (GPGPU ’19). Association for Computing Machinery, New York, NY, USA, 22–31. https://doi.org/10.1145/3300053.3319416
Busato et al. (2018) Federico Busato, Oded Green, Nicola Bombieri, and David A. Bader. 2018. Hornet: An Efficient Data Structure for Dynamic Sparse Graphs and Matrices on GPUs. In 2018 IEEE High Performance extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC.2018.8547541
Green and Bader (2016) Oded Green and David A. Bader. 2016. cuSTINGER: Supporting dynamic graph algorithms for GPUs. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC.2016.7761622
Malhotra et al. (2017) Gaurav Malhotra, Hitish Chappidi, and Rupesh Nasre. 2017. Fast Dynamic Graph Algorithms. In International Workshop on Languages and Compilers for Parallel Computing.
Sha et al. (2017) Mo Sha, Yuchen Li, Bingsheng He, and Kian-Lee Tan. 2017. Accelerating Dynamic Graph Analytics on GPUs. Proc. VLDB Endow. 11, 1 (sep 2017), 107–120. https://doi.org/10.14778/3151113.3151122
Winter et al. (2018) Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2018. faimGraph: High Performance Management of Fully-Dynamic Graphs Under Tight Memory Constraints on the GPU. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 754–766. https://doi.org/10.1109/SC.2018.00063
Winter et al. (2017) Martin Winter, Rhaleb Zayer, and Markus Steinberger. 2017. Autonomous, independent management of dynamic graphs on gpus. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.