Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design

Shuwen Lu, Zhihui Zhang, Cong Guo, Jingwen Leng Yangjie Zhou, Minyi Guo The authors are with the Department of Computer Science and Engineering of Shanghai Jiao Tong University, Shanghai 201100, China.Email:{lushuwen, zhihui.zhang, guocong, leng-jw, yj_zhou}@sjtu.edu.cn, [email protected] first two authors contributed equally to this work. Zhihui Zhang, Jingwen Leng and Minyi Guo are corresponding authors of this paper.

Abstract

Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models. Previous works have addressed the first challenge by using more expensive memory interfaces to achieve higher bandwidth. For the second challenge, existing works either support specific GNN models or have generic designs with poor hardware utilization.

In this work, we tackle both challenges simultaneously. First, we identify a new type of partition-level operator fusion, which we utilize to internally reduce the high bandwidth requirement of GNNs. Next, we introduce partition-level multi-threading to schedule the concurrent processing of graph partitions, utilizing different hardware resources. To further reduce the extra on-chip memory required by multi-threading, we propose fine-grained graph partitioning to generate denser graph partitions. Importantly, these three methods make no assumptions about the targeted GNN models, addressing the challenge of model variety. We implement these methods in a framework called SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware accelerator. Our evaluation demonstrates that SwitchBlade achieves an average speedup of $1.85\times$ and energy savings of $19.03\times$ compared to the NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to state-of-the-art specialized accelerators.

Index Terms:

GNN, bandwidth, multi-threading.

I Introduction

Graph neural networks (GNNs) have gained significant momentum as researchers have begun to integrate the concept of graphs into deep learning (DL) [34]. By merging the end-to-end hierarchical learning capabilities of DL with the structural representation power of graphs, GNNs have achieved improved accuracy across various domains, including molecular science [10], recommendation systems [39], and transportation [3]. To translate the algorithmic advancements of GNNs into practical applications, effective and efficient execution is crucial [23]. However, general-purpose processors, such as CPUs and GPUs, struggle with performance and energy inefficiency when executing graph-related operations [36, 42]. As a result, GNN-dedicated accelerators have been proposed.

Numerous efforts have focused on accelerating one of the most popular GNN models, Graph Convolution Networks (GCNs) [20, 14]. GCNs comprise two primary operators: the sparse graph adjacency matrix and the dense vertex feature matrix, which are typically organized into a two-stage computation. Based on this formulation, various designs, including inter- and intra-stage optimizations, have been proposed to achieve high performance [37, 40, 18, 19, 8, 9, 21].

TABLE I: Operations in popular GNN models.

Model	Aggregation ( $a_{i}$ )	Combination ( $h^{l+1}_{i})$
GCN [20]	$\sum_{j\in N(i)}h_{j}^{l}d^{-1/2}_{j}$	ReLU $(d^{-1/2}_{i}W^{l}a_{i})$
GAT [29]	$\sum_{j\in N(i)}\alpha_{ij}W^{l}h_{j}^{l}$	ReLU $(a_{i})$
SAGE-Pool[14]	$\max_{j\in N(i)}(W_{pool}^{l}h_{j}^{l}+b)$	ReLU $(W^{l}(h^{l}_{i}~{}\|\|~{}a_{i}))$
GG-NN [22]	$\sum_{j\in N(i)}(W^{l}h_{j}^{l}+b)$	GRU $(h^{l}_{i},a_{i})$

•

where $\alpha_{ij}$ is the attention coefficient whose calculation is omitted in this table, $||$ is matrix concatenation, MLP is Multi-Layer Perceptron [24], and GRU is Gated Recurrent Unit [5].

GNNs, however, encompass a broad category that varies in both the number and combination of operators [6]. As shown in Tbl. I, several popular GNN models exhibit quite different characteristics in each stage, despite being divisible into two-stage forms. Thus, prior GCN-specific accelerators may not offer the same performance and efficiency when executing other models. Additionally, designing dedicated accelerators for every GNN model is impractical due to the high hardware development costs and the vast GNN model space. As a result, a generic solution is desirable for the diverse range of GNN models. Though uniform architectures have been proposed to address these challenges [2, 15], they suffer from long-distance data movement [1], leading to increased latency and energy consumption.

Another challenge in GNN acceleration is the high bandwidth requirement. Since GNNs combine deep learning with graph processing [34], they inherit the characteristics of large feature maps and poor data locality [26]. Both characteristics necessitate substantial reads and writes to off-chip DRAM. Moreover, current GNN models are primarily executed in an operator-by-operator paradigm, where all operators read and write to DRAM, and modern GNN models typically comprise ten or more operators in one layer [35]. As a result, the current GNN model execution leads to massive off-chip data access [4]. Such high requirements make hardware bandwidth a potential bottleneck in GNN execution. Existing works addressing GNN acceleration satisfy this high bandwidth requirement through emerging yet costly techniques, such as Processing-In-Memory (PIM) [33, 1]. This work aims to provide a more cost-effective solution to overcome this challenge.

Though considerable effort has been expended, none of the prior works address both variety and bandwidth challenges simultaneously. To fill this gap, this work seeks to tackle both challenges within a single system. Our core idea is to optimize GNN bandwidth requirements without making any assumptions about the targeted GNN model structure. Guided by this idea, we propose three generic methods.

The first method is partition-level operator fusion (PLOF). Operator fusion has been proven to effectively mitigate the high bandwidth requirement challenge for conventional neural networks [4]. However, previous efforts have neglected the irregular graph-traversal-based operators (GTRs), which are central to GNNs. In this work, we propose a new graph partition-level operator fusion that fuses operators in arbitrary GNN models into three phases to alleviate high bandwidth requirements. This is based on two observations. First, all operators can be reorganized into a three-phase paradigm by borrowing the programming model of traditional graph processing [11]. Second, graphs are typically partitioned into smaller components for on-chip loading and processing due to their large data size. As a result, we can transfer data only at phase boundaries rather than operator boundaries.

The second method is shard-level multi-threading (SLMT). Although operators can be fused together to reduce memory footprint, they are still executed sequentially. Since each operator utilizes only one part of hardware resources, including bandwidth, resource utilization for other parts can be low during that operator’s execution. To enhance the utilization of different hardware resources simultaneously, we introduce shard-level multi-threading, where different shards of the graph are assigned to different hardware units. This parallelizes the execution of multiple fused phases across different shards, allowing for more efficient use of hardware resources and further reducing the memory bandwidth requirements.

The third method is fine-grained graph partitioning (FGGP) on host. The above two methods deploying multiple threads for concurrent processing improves hardware utilization yet increases the on-chip memory pressure. To mitigate the memory-concurrency contention, we further propose fine-grained graph partitioning running on host device. FGGP generates denser partitions to store more effective data under the same memory budget so as to improve the graph data reuse and reduce the bandwidth requirement.

All three proposed methods enhance hardware performance for GNN acceleration without making any assumptions about the GNN model structure. As a result, they offer a fresh perspective to design more flexible architectures that achieve both excellent applicability and high performance.

Refer to caption — Figure 1: The workflow of SwitchBlade.

We propose a systematic framework, SwitchBlade, which comprises a compiler, a graph partitioner, and a hardware accelerator to implement the above methods, as illustrated in Fig. 1. The compiler is responsible for implementing PLOF. It maps GNN models written in high-level frameworks such as DGL and PyG into our PLOF phases, which are expressed in the instruction set architecture (ISA) of our hardware accelerator. Additionally, the compiler provides the model information to our graph partitioner, which implements FGGP and generates graph partitions. The accelerator ultimately executes the PLOF phases and processes the graph partitions concurrently using SLMT.

We evaluate SwitchBlade in comparison to the NVIDIA V100 GPU and demonstrate that it achieves an average speedup of 1.85 $\times$ and energy savings of 19.03 $\times$ across a diverse range of GNN models. Furthermore, SwitchBlade attains comparable or even superior performance on GCN models when compared to state-of-the-art GCN accelerators like HyGCN [37], but with significantly greater flexibility.

The main contributions of this work are as follows:

•

We propose a set of generic methods that make no assumptions about the underlying GNNs, addressing the bandwidth and variety challenges of generic GNN acceleration. These methods span algorithmic, software, and hardware aspects.
•

We develop SwitchBlade, a comprehensive full-stack framework that implements the proposed three methods. The framework consists of a compiler, a graph partitioner, and a hardware accelerator.
•

We evaluate the performance of SwitchBlade and demonstrate its effectiveness by achieving a 1.85 $\times$ speedup and 19.03 $\times$ energy savings compared to the NVIDIA V100 GPU. Furthermore, SwitchBlade exhibits comparable performance against a prior state-of-the-art GCN accelerator, showcasing its flexibility and adaptability.

II Background

In this section, we provide an overview of Graph Neural Networks (GNNs) and the dual-sliding-window graph partitioning method as the foundation for SwitchBlade.

II-A Graph Neural Networks

Graph Neural Networks (GNNs) combine the power of deep learning with traditional graph processing, leading to improved accuracy in a wide range of domains that depend on graph structures, such as molecular science[10], recommendation systems[39], and transportation [3]. This improvement is achieved by replacing prior hand-crafted or intuition-based methods (e.g., node2vec [12]) with end-to-end learning capabilities.

Similar to conventional deep neural networks (DNNs), GNNs consist of layers. A layer $l$ takes the vertex and edge embedding matrix as input, along with the adjacency matrix representing the graph structure, and produces a new embedding matrix for layer $l+1$ [34]. The primary distinction between GNNs and DNNs lies in the graph-traversal operators, which exhibit irregular computation and memory access patterns[36]. We now introduce the primitive operators and their role in various GNN models.

Primitive Operators.

Each GNN layer typically comprises two types of primitive operators: the graph-traversal operators (GTRs, detailed below) and neural network operators. The latter can be further categorized as either dense matrix multiplication operators (DMMs) [24] and element-wise operators (ELWs) such as ADD, EXP, and ReLU. The combination of these three operator types covers all forms of GNN computation. Most GNN libraries, including popular ones like DGL [31] and PyG [7], support GNN programming by providing efficient implementations for these three primitive operators.

GTRs can be generalized into two types: ScatterOp and GatherOp. They are considered as vectorized graph propagation operators in traditional graph processing models (e.g., GAS [11]). The ScatterOp distributes the embedding of each vertex to its outgoing (or incoming) edges, while the GatherOp collects the embeddings of all incoming (or outgoing) edges of each vertex and reduces them into a fixed-length vector using a reduction function. Examples of reduction functions include $max$ , $sum$ , and $mean$ , which align vertices’ embeddings for subsequent operators, as different vertices may have varying edge numbers.

II-B Dual-Sliding-Window Graph Partitioning

Traditional graph processing [32] often has a memory footprint that can reach tens of gigabytes, including vertex and edge features, which exceeds both on-chip and off-chip memory capacities. Numerous graph partitioning techniques have been studied to reduce memory footprints [16, 30].

One notable approach is the dual-sliding-window-based graph partitioning (DSW-GP) method [43]. This method is popular due to its regular memory access patterns and has been adopted by several graph processing systems [27]. DSW-GP first divides vertices into disjoint destination intervals and then creates smaller shards containing edges under each destination interval. Fig. 2 illustrates an example graph partitioned into 4 destination intervals and 16 shards. During partitioning, it is ensured that each shard can fit into the targeted memory space. During computation, operations are performed on each interval and shard instead of the entire graph, with shards being iterated interval-wise—referred to as window sliding.

GNNs, as an extension of traditional graph processing, also demand a large amount of memory concerning both graph scale and embedding size. To reduce memory footprint in GNN acceleration, we consider employing DSW-GP in this work.

Input:

\mathit{G}

, input graph;

\mathit{M}

, targeted GNN model.

Output:

\mathit{S}

, the resulted shards.

\mathit{shardHeight\leftarrow}

calShardHeight

\mathit{(G,M)}

;

\mathit{shardNumInterval\leftarrow G.vTotal~{}/~{}shardHeight}

;

\mathit{I\leftarrow}

initIntervals

\mathit{(G,M)}

;

7 for each $\mathit{i}$ in $\mathit{I}$ do

8 for $\mathit{sidx}$ in range $\mathit{(shardNumInterval)}$ do

\mathit{srcBegin\leftarrow sidx\times shardHeight}

;

\mathit{srcEnd\leftarrow srcBegin+shardHeight}

;

11 if $\mathit{srcEnd>G.vTotal}$ then

\mathit{srcEnd\leftarrow G.vTotal}

;

\mathit{s\leftarrow}

initShard

\mathit{(i)}

;

15 setShardSource

\mathit{(s,srcBegin,srcEnd)}

;

16 finalizeShard

\mathit{(i,s)}

;

18finalizeInterval

\mathit{(i)}

;

Algorithm 1 DSW-GP

III Motivation

In this section, we discuss the motivation behind designing our SwitchBlade GNN accelerator framework. We first analyze the characteristics of GNNs to identify two primary challenges in GNN acceleration. Next, we review existing solutions and find that none of them address both challenges simultaneously. Finally, we present the goal of our SwitchBlade framework, inspired by the above insights.

III-A Challenges of GNN Acceleration

By analyzing GNN characteristics, we identify two significant challenges in GNN acceleration:

High model variety.

GNNs exhibit a high degree of variety in model structure. Unlike graph processing and deep learning, which are dominated by GEMM/CONV and SpMV/SpMM operations [41], GNNs do not have a single computational hotspot. Instead, they are more ad-hoc, depending on the targeted applications [34]. This characteristic leads to the challenge of accommodating the diverse GNN model structures when designing accelerators.

High bandwidth demand.

Previous studies have shown that graph processing and deep learning applications are often limited by off-chip memory bandwidth [4]. While deep learning achieves high bandwidth utilization when transferring large amounts of weights and feature maps, graph processing experiences lower bandwidth utilization due to irregularity or sparsity in the graphs. As a combination of both graph processing and deep learning [34], GNNs demand even higher bandwidth, exceeding today’s bandwidth capacity.

III-B Limitations of Existing Solutions

Although numerous GNN accelerators have been proposed in recent years, achieving significant performance improvements compared to general-purpose architectures, none of them fully address the two challenges mentioned above. One group of prior works focuses on the bandwidth challenge, either alleviating bandwidth demand by designing dedicated cores that exploit graph sparsity[37] or satisfying bandwidth demand by incorporating Process-In-Memory (PIM) in the accelerator[33, 1]. However, most of these solutions lack flexibility in supporting more powerful yet complex GNNs, as their designs typically make strong assumptions about the targeted GNN models, limiting their applicability to a broader range of GNNs. Another group of works [15, 2] offers high flexibility for various GNNs but fails to address the bandwidth challenge, resulting in long-distance data movement problems and limited performance [1].

III-C Our Goal.

The limitations of prior work motivate us to tackle the challenges of bandwidth requirement and model variety concurrently. Our approach explores generic and cross-stack optimizations for GNN computation, making few assumptions about GNN model structures. We integrate these optimizations into a single framework called SwitchBlade, which we will detail in the following sections.

IV SwitchBlade Design

IV-A SwitchBlade Overview

Fig. 1 illustrates the SwitchBlade workflow, which encompasses three core methods that form the backbone of SwitchBlade. Specifically, a GNN model $M$ written in a high-level language such as DGL or PyG is first compiled by our GNN compiler to process a graph $G$ . This spans across the algorithm, software, and hardware layers.

Input: Dst. intervals

\mathit{I}

and shards

\mathit{S}

by DSW-GP.

Output: None, updating the intervals

\mathit{I}

in situ.

4for each $\mathit{interval}$ in $\mathit{I}$ do

\mathit{interval\leftarrow}

ScatterPhase

\mathit{(interval)}

;

6 for each $\mathit{shard}$ in $\mathit{S[interval]}$ do

\mathit{interval\leftarrow}

GatherPhase

\mathit{(interval,shard)}

;

\mathit{interval\leftarrow}

ApplyPhase

\mathit{(interval)}

;

Algorithm 2 Partition-Level Operator Fusion

IV-B Partition-Level Operator Fusion (PLOF)

PLOF is co-designed with compiling-level operator fusion and dual-sliding-window-based graph processing style to eliminate redundant DRAM accesses between GNN operators. In contrast to prior operator fusion for Deep Neural Networks (DNNs), which only considers the program of the DNN model [4], PLOF also takes the graph data structure into account to introduce a new fusion paradigm. However, due to the irregularity of the graph, determining which operators should be fused and at which abstraction level remains challenging.

To achieve PLOF’s goal, we propose fusing operators that process vertices or edges at the graph interval and shard level. The GNN model is first divided into multiple phases, and the input graph is partitioned into intervals and shards. Then, all phases will iterate either the intervals or shards according to the DSW-GP to complete the GNN computation. In this scenario, the total off-chip memory access is roughly $n_{p}\times M$ instead of $n_{o}\times M$ , where $n_{p}$ is the phase number, $n_{o}$ is the operator number, and $M$ is the off-chip memory access of one operator. Consequently, each phase can be regarded as a fused operator, which alleviates the high bandwidth requirement.

We employ a template program to separate a GNN model and iterate the graph, defining three phases: ScatterPhase, GatherPhase, and ApplyPhase. The pseudo code is presented in Alg. 2. The template is inspired by traditional graph processing [11], where a graph analytic algorithm can be represented using a similar three-stage GAS programming model. Instead of the GAS model operating on a single vertex or edge, we batch them into vertex intervals and edge shards produced by DSW-GP to further improve locality and expose parallelism. A GNN model can thus be represented by phases of the template program operating on different parts of the graph. Fig. 6-d shows an example of the PLOF program written in our ISA (Sec. V-A) corresponding to the high-level language in Fig. 6-a.

Compiler Support.

Assigning each operator in a GNN model to the appropriate phase is a non-trivial task due to the large number of operators involved. A GNN model typically consists of multiple layers, with each layer containing tens of operators, including GTRs. Consequently, it can be challenging for programmers to take advantage of the proposed operator fusion technique. However, the semantics of each operator can be inferred from specific operators within the model, offering an opportunity to automate the phase construction process through software support. To achieve this, we incorporate the process into our compiler, which will be detailed later in Sec. V-C.

IV-C Shard-Level Multi-Threading (SLMT)

We propose SLMT to balance hardware flexibility and performance. In contrast, previous works [37] primarily focus on designing hardwired accelerators dedicated to specific GNN models. Although these designs achieve remarkable performance and efficiency, they suffer from limited flexibility. To address the variety challenge, we carefully trade off hardware flexibility and performance via SLMT.

The SLMT exploits shard-level parallelism, a new parallelism type dedicated to graph-related computation. We can parallelize shards since most operators in GatherPhase for shard processing have minimal dependency between shards, except for the GatherOp. Additionally, the shard is an ideal abstraction for parallelization due to its suitable size for the targeted memory and adjustability.

To implement SLMT, we construct a GNN accelerator with simultaneous multi-threading [28]. The approach allows the hardware to automatically schedule different shards to issue their GatherPhase instructions, enabling those shards to utilize different hardware resources, such as functional units and bandwidth, simultaneously. Fig. 3 demonstrates this process, where multiple shards are processed concurrently by two shard-threads (sThreads). As shown in the figure, SLMT optimizes hardware resource utilization during shard processing, which is the central aspect of the overall GNN computation. We will describe the architectural details later in Sec. V-B.

IV-D Fine-Grained Graph Partitioning (FGGP)

FGGP is proposed to reduce redundant and unnecessary data transfers for partitioning-based GNN computation by exploiting graph sparsity. Generally, FGGP achieves this goal in two ways. First, FGGP can significantly increase the destination interval size (also the shard width) under memory constraints. We use Fig. 4-a as an example to illustrate the benefit. Initially, the source vertex 5 would be loaded twice under the first and second destination intervals. By increasing the interval size from 3 to 6, the source vertex 5 will only be loaded once, saving bandwidth. In this context, we identify the multiple loads of a single source vertex as redundant data transfer. The second aspect of FGGP is to skip unused source vertices for each shard, eliminating useless data transfers.

The core of FGGP lies in generating shards at the level of individual edges and source vertices during graph partitioning. To generate a shard, rather than forming a list of consecutive source vertices and simply assuming each source is fully connected to the vertices in the destination interval, we propose dynamically adding each edge with its associated source vertex to the shard until it is full. As a result, the source vertex lists of our shards can be discontinuous, as shown in Fig. 4-b. This method not only skips unused sources but also decouples the interval size from the memory constraint. Therefore, we can specify an interval size that far exceeds the memory size available for storing the shard.

We implement a graph partitioner to realize FGGP. To generate shards, the partitioner requires both the adjacency matrix of the input graph and the data dimensions from the compiler. We will discuss the graph partitioner in more detail later in Sec. V-D.

V SwitchBlade Implementation

To actualize the proposed methods, we develop a GNN accelerator framework comprising the GNN Accelerator (GA), GNN Compiler (GC), and Graph Partitioner (GP). Furthermore, we design an instruction set architecture (ISA) to serve as the interface connecting these three components of our framework.

In particular, the GA features a versatile instruction-driven pipeline equipped with various domain-specific functional units. Additionally, the GA employs SLMT to automatically pipeline GNN execution, ensuring efficient execution across a wide range of GNN models. The GC is devised to generate ISA code in the form of PLOF phases, accepting arbitrary GNN models written in high-level frameworks (e.g., DGL [31], PyG [7]) as input, automatically mapping them into PLOF phases, and generating the corresponding ISA code. The GP employs FGGP to create denser shards from the input graph in accordance with the GA specification and GNN model information derived from the GC. We delve into the details of our developed framework in the following sections.

TABLE II: Example of SwitchBlade instructions.

Type

Opname

Operand

Compute

ELW (ADD, MUL, RELU), DMM (GEMM),

GTR (GTHR.SUM.F, SCTR.F, SCTR.B)

Dimensions, Memory Symbols

Memory

Load/Store Src/Dst (LD.D, LD.S, ST.D)

V-A Instruction Set Architecture (ISA)

We first introduce the ISA of SwitchBlade, which serves as the foundation for our GA, GC, and GP. Tbl. II presents an example of the instructions, while Fig. 5 illustrates the GNN accelerator architecture.

Initially, we categorize the instructions in the ISA into two types: Compute and Memory. Each instruction type targets a specific architectural component within the hardware. Compute instructions are further classified into three sub-types, corresponding to the three primitive operator types in GNNs (Sec. II). Generally, compute instructions are directed to the functional unit depicted in Fig. 5. Memory instructions, on the other hand, handle data transfer of vertices and edges between the on-chip embedding buffer and off-chip DRAM and are issued to the load-store unit (LSU) shown in Fig. 5.

For each instruction, we define three fields: opname, data-dimension, and memory-symbol. The opname and data-dimension fields specify the targeted operator and the associated parameters of the input and output data. To apply a single program to multiple intervals and shards with varying sizes, we also establish a set of macros representing the parameters of intervals and shards in the second field. For instance, we use E for the number of edges in the current shard. These macros should be decoded at runtime by the hardware controller. The third field, memory-symbol, indicates the memory addresses of the input and output data. We define memory-symbols as another kind of macro, offering three symbol types—D (destination), S (source), and E (edge)—to denote the data types of input and output, which may correspond to different hardware stores. To obtain the actual addresses, the hardware controller calculates the address at runtime, also accounting for the varying sizes of intervals and shards.

V-B GNN Accelerator (GA)

The GA is an instruction-driven platform designed to perform various GNN computations as expressed by the ISA. To address the challenge of supporting diverse operator numbers and orders, we utilize the SLMT as the core of GA. Based on SLMT, we achieve a more flexible design where different functional units connect to various buffers in parallel, enabling versatile data access. Our GA design is illustrated in Fig. 5.

V-B1 Functional Unit

The GA comprises two types of dedicated functional units for GNN primitive operators (Sec. II-A). The first, called a vector unit (VU), includes multiple cores operating in the Single-Instruction-Multiple-Data (SIMD) paradigm. It targets the lightweight ELW and GTR operators. For ELW, each core operates independently on different embeddings. For GTR, each core is responsible for one destination vertex in GatherOp or one edge in ScatterOp. The second, a matrix unit (MU), is an output-stationary systolic multiply-accumulate (MAC) array targeting compute-intensive DMM operators for increased efficiency. In summary, this functional unit covers all three operator types in GNNs.

V-B2 Controller

The Controller implements the SLMT as described in Sec. IV-C. It maintains multiple program counters (PCs), each corresponding to a thread. To simplify control flow, we employ one interval thread (iThread) and multiple shard threads (sThreads). The iThread executes only the ScatterPhase and ApplyPhase, processing destination vertices in intervals, while sThreads execute only the GatherPhase, processing source vertices and edges in shards. For each instruction fetched by a PC, the controller decodes it and enqueues it into a queue specified by its instruction types. Each queue will dequeue and issue an instruction when the targeted component is ready.

To adhere to the execution logic in Alg. 2, we introduce a Phase Scheduler (PS) to manage phase switching comprehensively. Whenever a PC reaches the end of a phase, the PS is activated to determine the next phase for the PC or to pause or resume a thread by accessing the shard metadata in the Graph Buffer. For instance, when an sThread completes a shard, the PS resets its PC and assigns it another shard if any are available within the interval. If no shards are left within this interval, the PS pauses the sThread. If all sThreads are paused, the PS resumes the iThread and sets its PC to ApplyPhase.

V-B3 Embedding Buffer

We employ two on-chip scratchpad memory (SPM) pieces as Embedding Buffers to store vertex and edge embeddings and intermediate data. These buffers connect in parallel to the functional units through a crossbar, enhancing data access flexibility and allowing for arbitrary operator numbers and orders to be executed on the GA. One embedding buffer, called DstBuffer, stores data of destination vertices in intervals, corresponding to memory-symbol D in the ISA. The other embedding buffer, named SrcEdgeBuffer, stores data of edges and source vertices in shards, corresponding to memory-symbol S and E in the ISA. As the buffer holds various data for intervals and shards, we develop a compiler (Sec. V-C) to effectively manage and access this buffer.

To supply data for the SLMT, we logically divide the SrcEdgeBuffer into $num\_sthread$ parts to hold multiple shards simultaneously. Each sThread privately accesses shard data in the SrcEdgeBuffer using different base addresses, while all threads share access to interval data in the DstBuffer.

V-B4 Graph Buffer

The Graph Buffer is composed of three parts: MetaBuffer, DataBuffer, and LSU. The DataBuffer stores vectors representing connections between vertices and edges of a shard, using Coordinate Format (COO) to support GTR operators performed by the Functional Unit. The MetaBuffer stores scalars indicating current shard sizes, including the number of source vertices and edges. Both DataBuffer and MetaBuffer are designed to hold multiple shards simultaneously, supporting the SLMT. The LSU handles shard and embedding transfers, receiving memory instructions from the controller, translating them into low-level transactions using shard and interval data in the DataBuffer, and sending the transactions to the downstream DRAM interface.

We implement a straightforward prefetch mechanism to supply data to graph buffers. A 1-bit flag is assigned to each shard, indicating whether the data requires an update. If the bit is set to $1$ , the current data is deemed outdated, prompting the LSU to automatically load the subsequent shard into the buffer and change the bit to $0$ . The system remains idle until the phase scheduler in the controller switches the bit back to $1$ due to phase changes.

V-C PLOF Compiler

We develop the Graph Compiler (GC) to automatically map a GNN model, written in a high-level framework, to the PLOF template program. Fig. 6 illustrates the workflow of GC. The entire compilation process consists of three steps, as described below.

V-C1 Constructing Unified Computational Graph

In this phase, GC extracts a unified computational graph from high-level frameworks. This unified graph replaces framework-specific graph operators—such as apply_edges and update_all in DGL, and scatter in PyG—with more generic GTR operators, as presented in Sec. II.

V-C2 Constructing PLOF Phases

GC decomposes the unified computational graph into groups of PLOF phases, as outlined in Sec. IV-B. Specifically, GC traverses the unified graph from each GTR operator along both in-edges and out-edges, respectively, until encountering another GTR. During traversals, GC labels each visited edge with src, dst, or edge, depending on the starting GTR operator type. Subsequently, GC performs reverse topology sorting on the labeled unified graph, cutting the foremost edge of each successive edge block—marked with dst labels—and other corresponding unvisited edges.

Next, GC determines the phase to which each operator belongs. This process also relies on the data labels marked earlier. Specifically, GC executes reverse topology sorting, during which it appends visited operators to ApplyPhase until encountering an operator whose in-edge lacks a dst label. Then, GC appends visited operators to GatherPhase until encountering an operator whose out-edge does not have an edge label. Finally, GC appends all remaining operators to ScatterPhase.

V-C3 Code Generation and Post-Processing

GC generates compute and memory instructions for operators in accordance with the ISA. Specifically, GC first creates one compute instruction for each operator. The opname and data-dimension are directly derived from the current operator, while the memory-symbols are jointly determined by the current and adjacent operators. For the destination memory-symbol, GC establishes its type based on the operator label marked in Sec. V-C2 and assigns a new number to the type. The source memory-symbols, therefore, are the destination symbols of its depending instructions. GC also inserts corresponding memory instructions into the phases if the input or output memory-symbols are not produced or consumed in this PLOF phases.

After instruction generation, GC performs memory-symbol liveness analysis to manage the on-chip buffer. GC first calculates the size of each symbol and then merges two symbols of the same size if the former is no longer in use. Consequently, more useful data can be stored on-chip, resulting in better buffer utilization. At this stage, GC completes code generation for GNNs from high-level frameworks.

During execution, GC produces parameters for graph partitioning. Specifically, GC calculates two parameters: dim_src and dim_edge. The former represents the total data-dimensions of all source vertex memory-symbols in each GatherPhase, while the latter signifies the total data-dimensions of all edge memory-symbols. These parameters are determined by accumulating the data-dimensions of corresponding symbols. These parameters allow the downstream GP to perform FGGP, as detailed in the following Sec. V-D.

Input:

\mathit{G}

, input graph;

\mathit{M}

, targeted GNN model.

Output:

\mathit{S}

, the resulted shards.

\mathit{I\leftarrow}

initIntervals

\mathit{(G,M)}

;

4 for each $\mathit{i}$ in $\mathit{I}$ do

\mathit{s\leftarrow}

initShard

\mathit{(i)}

;

\mathit{srcPtr}

\leftarrow 0

;

7 while $\mathit{srcPtr<G.vTotal}$ do

\mathit{dstList\leftarrow}

acquireNeiList

\mathit{(G,i,srcPtr)}

;

9 if $\mathit{dstList.size}$ $>0$ then

10 if probeShardSize $\mathit{(s,dstList,M)}$ then

11 finalizeShard

\mathit{(i,s)}

;

\mathit{s\leftarrow}

initShard

\mathit{(i)}

;

14 appendShardSource

\mathit{(s,srcPtr,dstList)}

;

\mathit{srcPtr\leftarrow srcPtr}

+1

;

18 finalizeShard

\mathit{(i,s)}

;

20finalizeInterval

\mathit{(i)}

;

Algorithm 3 DSW-GP with proposed FGGP

V-D Graph Partitioner (GP)

GP implement our FGGP based on DSW-GP. Alg. 3 presents the pseudocode for our GP. The primary distinction between our GP and the original DSW-GP, as shown in Alg. 1, lies in the inner loop, where we iterate through source vertices for each interval. For every source, GP first executes acquireNeiList to obtain the dstList of adjacent destination vertices, along with the associated edge under the current interval. If dstList is empty, GP directly skips the source; otherwise, GP performs probeShardSize to check whether there is extra space for the source and its associated edges in the shard, based on the following rule:

\begin{split}\textit{num\_src}\times\textit{dim\_src}+\textit{num\_edge}&\times\textit{dim\_edge}\\ &\leq\frac{\textit{mem\_capacity}}{\textit{num\_sThread}},\end{split}

(1)

where $num\_src$ and $num\_edge$ represent the numbers of source vertices and edges of a shard, $dim\_src$ and $dim\_edge$ denote the total dimensions of all source vertex and edge memory-symbols in the generated phases from GC, and $num\_sthread$ is the number of sThreads running on GA. If Equ. 1 is satisfied, GP performs appendSource to include the current source vertex and its associated edges into the shard. Otherwise, GP finalizes the current shard and initializes a new one before appending the source vertex.

TABLE III: System configurations.

Computing

Unit

On-Chip

Memory

Off-Chip

Memory

V100

GPU

\times

SIMT64 cores

@1.25GHz

Totally 34MB

900GB/s

HBM-2

HyGCN

\times

SIMD32 cores,

\times

\times

128 systolic

MAC array @1GHz

128KB(Input), 2MB(Edge),

2MB(Weight), 4MB(Output),

8MB(Aggregation)

256GB/s

HBM-1

Switch-

Blade

\times

SIMD32 cores,

\times

128 systolic

MAC array @1GHz

8MB(DB), 1MB(SEB),

2MB(Weight), 128KB(GB)

256GB/s

HBM-1

VI Methodology

Simulation Method.

We implement, synthesize, and validate the SwitchBlade components using Verilog HDL with Synopsys Design Compiler and TSMC 28 nm standard cell library at 1 GHz. Based on the synthesis, we construct a C++-based cycle-level simulator with aligned latency for end-to-end SwitchBlade evaluation. We integrate Ramulator [17] to obtain accurate latency for off-chip memory behaviors. Our simulator is validated against DGL built-in models [31] to ensure correct functionality. For on-chip scratchpad memory, we use Synopsys Memory Compiler to estimate area and power. We measure HBM access energy at 7 pJ/bit [38].

TABLE IV: Graph datasets [13] for evaluation.

Dataset	Vertex#	Edge#	Description
ak2010 (AK)	45,293	108,549	Redistrict Set
coAuthorsDBLP (AD)	299,068	977,676	Citation Networks
hollywood (HW)	1,139,905	57,515,616	Collaboration Networks
cit-Patents (CP)	3,774,768	16,518,948	Patent Networks
soc-LiveJournal (SL)	4,847,571	43,369,619	Social Networks

Benchmark Datasets and Models.

We select five graphs from real-world workloads [13] with varying graph sizes and sparsity levels, detailed in Tbl. IV. Additionally, we choose four popular and diverse GNN models [6] with mathematical expressions shown in Tbl. I. For each model, we stack two identical layers with a dimension of 128 for input, hidden, and output embeddings for simplicity.

Baselines.

We compare SwitchBlade with a general-purpose processor and a GNN accelerator. For the general-purpose processor, we use the DGL-0.7 [31] library running on an NVIDIA V100 GPU[25] with 32 GB memory and an Intel Xeon E5-2630 v4 with 256 GB memory. For the GNN accelerator, we reproduce HyGCN, a state-of-the-art accelerator specifically designed for the GCN model, and compare its performance against SwitchBlade under the GCN. Tbl. III summarizes the configuration details. We set three sThreads for our shard-level multi-threading to match the three types of hardware resources: VU, MU, and bandwidth.

VII Evaluation Results

VII-A Overall Results

Latency.

Fig. 7 compares the latency of different platforms. On average, SwitchBlade achieves a $1.85\times$ speedup over the baseline V100 GPU. SwitchBlade exhibits higher speedup on GAT, SAGE, and GGNN than GCN due to the presence of more operators in these models, providing greater opportunities for operator fusion and multi-thread scheduling. Moreover, SwitchBlade attains a $1.28\times$ speedup over HyGCN on GCN workloads, which is notable considering HyGCN deploys a hardwired pipeline tailored for GCN.

Energy.

To ensure a fair comparison, we convert the results from 28nm to 12nm [26]. Fig. 8 displays the results. SwitchBlade achieves an average $19.03\times$ energy saving over the baseline V100 GPU and $0.82\times$ over HyGCN. The high energy efficiency compared to the GPU is attributed to reduced on-chip data access and efficient domain-specific functional units in the SwitchBlade accelerator. In comparison to HyGCN, SwitchBlade also holds a slight advantage due to its simpler MU micro-architecture.

Area and Power.

Tbl. V summarizes the area and power of SwitchBlade components under the TSMC 28 nm standard library. The total area and power of SwitchBlade are $28.25~{}mm^{2}$ and $6.06~{}W$ , respectively, accounting for only $3.47\%$ and $2.43\%$ of the baseline V100 GPU with $815~{}mm^{2}$ and $250~{}W$ under the 12 nm technology node. Among the components, the SRAM-based SPM (SEB, DB, and TB) consumes the majority, representing $76\%$ and $58\%$ of the total area and power, respectively.

TABLE V: Area and power breakdown of SwitchBlade.

TSMC 28nm	MU	VU	CTRL	RAM	Total
Area / %	15.46	6.37	2.11	76.06	28.25 mm2
Power / %	24.02	14.95	2.66	58.38	6.06 W

In the following subsections, we evaluate the effectiveness and sensitivity of each individual method proposed in this paper.

VII-B Partition-Level Operator Fusion

We assess the effectiveness of PLOF by measuring the reduction in total data transfer between on-chip and off-chip memory. Fig. 9 presents the results. PLOF is effective for all types of GNN workloads and significantly reduces data transfer compared to the operator-by-operator execution paradigm on the GPU platform. This substantial reduction contributes to the speedup and energy reduction over the baseline V100 GPU.

VII-C Shard-Level Multi-Threading

Hardware Utilization.

To evaluate the effectiveness of SLMT, we measure the overall hardware utilization by averaging the individual utilization of DRAM bandwidth, VU, and MU. Fig. 11 presents the results. SwitchBlade achieves higher overall utilization across all workloads when employing 3 SLMT sThreads rather than 1, which is regarded as SLMT turned off.

sThread Number.

We also assess the performance and resource utilization under different concurrent sThread numbers. Fig. 11 displays the results. Generally, the execution latency decreases initially and then increases, reaching optimal performance with two sThreads on each workload. The decrease in latency is expected as multiple sThreads can concurrently exercise different hardware execution units (VU, MU, and bandwidth), resulting in higher overall hardware utilization and performance. The reason for the increase in latency after two sThreads is the reduced data parallelism and hardware functional unit efficiency due to limited on-chip memory space for each sThread. Consequently, we observe minimal performance improvement with more than three sThreads based on the three types of hardware execution units.

VII-D Fine-Grained Graph Partitioning

Data Density.

To evaluate the effectiveness of mitigating the larger on-chip SPM requirement from SLMT, we first measure the average buffer occupancy rate for SEB and DB on different datasets, calculated as:

\textit{occupancy\_rate}=\frac{1}{W}\sum^{i}_{i\in[1,W]}\frac{\textit{data\_accessed}_{i}}{\textit{total\_buffer\_space}}

where $W$ is the total number of writes to the targeted buffer. Fig. 13 presents the results. The occupancy rate of SwitchBlade reaches nearly $99\%$ using FGGP, whereas prior graph partitioning with sparsity elimination in HyGCN only achieves a $44\%$ occupancy rate. As a result, SwitchBlade can attain similar performance using $31\%$ less on-chip SPM.

Data Reuse.

We also measure the total data transfer and corresponding performance improvement from FGGP under the same on-chip SPM budget as HyGCN. Fig. 13 displays the results. By increasing the DB size from 8MB to 13MB, we achieve an additional $10\%$ data-transfer reduction through FGGP, leading to a further $1.1\times$ speedup. FGGP gains less speedup on the HW graph because HW is relatively dense, making the bandwidth less critical than functional units and rendering the FGGP optimization less effective.

VIII Related Work

In the past three years, GNN hardware acceleration has become a prominent research topic. Some studies have achieved high performance and efficiency through model-specific optimizations. HyGCN [37] is the first to decouple the two operators in GCNs into stages, designing two dedicated compute engines to enable a two-stage pipeline execution. GReTA [18] extends this concept with a multi-stage pipeline and bidirectional dataflow. GraphACT [40] and ReGraphX [1] also incorporate CPU and 3D ReRAM techniques in their multi-stage acceleration. AWB-GCN [8] proposes a unified SpMM architecture for GCN’s two operators, improving hardware utilization. GCNAX [21] presents a reconfigurable loop ordering and fusion design for GCN, while I-GCN [9] explores data locality with specific reorder mechanisms to enhance accelerator performance. These approaches achieve high performance and efficiency by making strong assumptions about the targeted GNN models and designing specialized hardware architectures. Although they can run other models that do not match their hardware designs by bypassing some stages, this increases data movement and subsequently reduces performance and efficiency.

Other studies focus on more flexible hardware or scheduling designs. EnGN [15] introduces a unified SIMD architecture that adopts a ring-based dataflow for SpMM to improve hardware utilization. Auten et al. [2] connect all components with a crossbar switch in a hardware block. However, these works do not capture the generic GNN characteristics and suffer from long-distance data movement problems, leading to low performance and energy efficiency. In contrast, we make no assumptions about the targeted GNN models and propose three methods to enhance GNN accelerator performance. Our approach involves co-designing the software and hardware of the compiler, architecture, and graph partitioner as a whole.

IX Conclusion

In this paper, we present SwitchBlade, a comprehensive framework designed to address both variety and bandwidth challenges in GNN acceleration. Our approach achieves an average speedup of $1.85\times$ and energy savings of $19.03\times$ compared to GPUs, while also delivering performance on par with state-of-the-art model-specific GNN accelerators. To accomplish these results, we introduce a partition-level operator fusion technique that significantly reduces the high bandwidth requirements associated with GNN execution. Furthermore, we propose a shard-level multi-threading approach to enhance the overall utilization of bandwidth and other functional units. Lastly, to tackle the increased on-chip memory contention caused by multi-threading, we present a fine-grained graph partitioning method for refining shard data. Importantly, all three methods are model-agnostic, making them suitable for a wide range of GNN models. Our experimental results demonstrate the effectiveness of the proposed methods in SwitchBlade for enhancing the performance of various GNN models and datasets.

References

[1] A. I. Arka, J. R. Doppa, P. P. Pande, B. K. Joardar, and K. Chakrabarty, “Regraphx: Noc-enabled 3d heterogeneous reram architecture for training graph neural networks,” in Design, Automation & Test in Europe Conference & Exhibition, DATE 2021, Grenoble, France, February 1-5, 2021. IEEE, 2021, pp. 1667–1672. [Online]. Available: https://doi.org/10.23919/DATE51398.2021.9473949
[2] A. Auten, M. Tomei, and R. Kumar, “Hardware acceleration of graph neural networks,” in 57th ACM/IEEE Design Automation Conference, DAC 2020, San Francisco, CA, USA, July 20-24, 2020. IEEE, 2020, pp. 1–6. [Online]. Available: https://doi.org/10.1109/DAC18072.2020.9218751
[3] C. Chen, K. Li, S. G. Teo, X. Zou, K. Wang, J. Wang, and Z. Zeng, “Gated residual recurrent graph neural networks for traffic prediction,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 2019, pp. 485–492. [Online]. Available: https://doi.org/10.1609/aaai.v33i01.3301485
[4] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Q. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: an automated end-to-end optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, A. C. Arpaci-Dusseau and G. Voelker, Eds. USENIX Association, 2018, pp. 578–594. [Online]. Available: https://www.usenix.org/conference/osdi18/presentation/chen
[5] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans, Eds. ACL, 2014, pp. 1724–1734. [Online]. Available: https://doi.org/10.3115/v1/d14-1179
[6] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson, “Benchmarking graph neural networks,” CoRR, vol. abs/2003.00982, 2020. [Online]. Available: https://arxiv.org/abs/2003.00982
[7] M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,” in ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[8] T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y. Li, P. Haghi, A. Tumeo, S. Che, S. K. Reinhardt, and M. C. Herbordt, “AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing,” in 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17-21, 2020. IEEE, 2020, pp. 922–936. [Online]. Available: https://doi.org/10.1109/MICRO50266.2020.00079
[9] T. Geng, C. Wu, Y. Zhang, C. Tan, C. Xie, H. You, M. Herbordt, Y. Lin, and A. Li, “I-gcn: A graph convolutional network accelerator with runtime locality enhancement through islandization,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 1051–1063.
[10] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 1263–1272.
[11] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: Distributed graph-parallel computation on natural graphs,” in 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, C. Thekkath and A. Vahdat, Eds. USENIX Association, 2012, pp. 17–30. [Online]. Available: https://www.usenix.org/conference/osdi12/technical-sessions/presentation/gonzalez
[12] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 855–864.
[13] Gunrock, “Gunrock Benchmark Dataset,” https://github.com/gunrock/gunrock/tree/master/dataset/large, 2018.
[14] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 1024–1034. [Online]. Available: http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs
[15] L. He, “Engn: A high-throughput and energy-efficient accelerator for large graph neural networks,” CoRR, vol. abs/1909.00155, 2019. [Online]. Available: http://arxiv.org/abs/1909.00155
[16] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
[17] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible DRAM simulator,” IEEE Comput. Archit. Lett., vol. 15, no. 1, pp. 45–49, 2016. [Online]. Available: https://doi.org/10.1109/LCA.2015.2414456
[18] K. Kiningham, P. Levis, and C. Re, “GReTA: Hardware Optimized Graph Processing for GNNs,” in Proceedings of the Workshop on Resource-Constrained Machine Learning (ReCoML 2020), March 2020.
[19] K. Kiningham, C. Ré, and P. A. Levis, “GRIP: A graph neural network accelerator architecture,” CoRR, vol. abs/2007.13828, 2020. [Online]. Available: https://arxiv.org/abs/2007.13828
[20] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=SJU4ayYgl
[21] J. Li, A. Louri, A. Karanth, and R. Bunescu, “Gcnax: A flexible and energy-efficient accelerator for graph convolutional neural networks,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 775–788.
[22] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph sequence neural networks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: http://arxiv.org/abs/1511.05493
[23] L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai, “NeuGraph: Parallel deep neural network computation on large graphs,” in 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10-12, 2019, D. Malkhi and D. Tsafrir, Eds. USENIX Association, 2019, pp. 443–458. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/ma
[24] M. Minsky and S. A. Papert, Perceptrons: An introduction to computational geometry. MIT press, 2017.
[25] NVIDIA, “TESLA V100 GPU Architecture Whitepaper,” https://composter.com.ua/documents/Volta-Architecture-Whitepaper.pdf, 2017.
[26] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. M. Burns, and Ö. Özturk, “Energy efficient architecture for graph analytics accelerators,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. IEEE Computer Society, 2016, pp. 166–177. [Online]. Available: https://doi.org/10.1109/ISCA.2016.24
[27] L. Song, Y. Zhuo, X. Qian, H. H. Li, and Y. Chen, “GraphR: Accelerating graph processing using ReRAM,” in IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018. IEEE Computer Society, 2018, pp. 531–543. [Online]. Available: https://doi.org/10.1109/HPCA.2018.00052
[28] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous multithreading: Maximizing on-chip parallelism,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, Santa Margherita Ligure, Italy, June 22-24, 1995, D. A. Patterson, Ed. ACM, 1995, pp. 392–403. [Online]. Available: https://doi.org/10.1145/223982.224449
[29] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ
[30] L. Wang, Y. Xiao, B. Shao, and H. Wang, “How to partition a billion-node graph,” in 2014 IEEE 30th International Conference on Data Engineering. IEEE, 2014, pp. 568–579.
[31] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,” arXiv preprint arXiv:1909.01315, 2019.
[32] Y. Wang, A. A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: a high-performance graph processing library on the GPU,” in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, R. Asenjo and T. Harris, Eds. ACM, 2016, pp. 11:1–11:12. [Online]. Available: https://doi.org/10.1145/2851141.2851145
[33] Z. Wang, Y. Guan, G. Sun, D. Niu, Y. Wang, H. Zheng, and Y. Han, “GNN-PIM: A processing-in-memory architecture for graph neural networks,” in Advanced Computer Architecture - 13th Conference, ACA 2020, Kunming, China, August 13-15, 2020, Proceedings, ser. Communications in Computer and Information Science, D. Dong, X. Gong, C. Li, D. Li, and J. Wu, Eds., vol. 1256. Springer, 2020, pp. 73–86. [Online]. Available: https://doi.org/10.1007/978-981-15-8135-9\_6
[34] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Trans. Neural Networks Learn. Syst., vol. 32, no. 1, pp. 4–24, 2021. [Online]. Available: https://doi.org/10.1109/TNNLS.2020.2978386
[35] Z. Xie, M. Wang, Z. Ye, Z. Zhang, and R. Fan, “Graphiler: Optimizing graph neural networks with message passing data flow graph,” in Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022, D. Marculescu, Y. Chi, and C. Wu, Eds. mlsys.org, 2022. [Online]. Available: https://proceedings.mlsys.org/paper/2022/hash/a87ff679a2f3e71d9181a67b7542122c-Abstract.html
[36] M. Yan, Z. Chen, L. Deng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, “Characterizing and understanding gcns on GPU,” IEEE Comput. Archit. Lett., vol. 19, no. 1, pp. 22–25, 2020. [Online]. Available: https://doi.org/10.1109/LCA.2020.2970395
[37] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, “Hygcn: A GCN accelerator with hybrid architecture,” in IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020. IEEE, 2020, pp. 15–29. [Online]. Available: https://doi.org/10.1109/HPCA47549.2020.00012
[38] M. Yan, X. Hu, S. Li, A. Basak, H. Li, X. Ma, I. Akgun, Y. Feng, P. Gu, L. Deng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, “Alleviating irregularity in graph analytics acceleration: a hardware/software co-design approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019. ACM, 2019, pp. 615–628. [Online]. Available: https://doi.org/10.1145/3352460.3358318
[39] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for web-scale recommender systems,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, 2018, pp. 974–983.
[40] H. Zeng and V. K. Prasanna, “Graphact: Accelerating GCN training on CPU-FPGA heterogeneous platforms,” in FPGA ’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020, S. Neuendorffer and L. Shannon, Eds. ACM, 2020, pp. 255–265. [Online]. Available: https://doi.org/10.1145/3373087.3375312
[41] G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, “Gamma: Leveraging gustavson’s algorithm to accelerate sparse matrix multiplication,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 687–701.
[42] Z. Zhang, J. Leng, L. Ma, Y. Miao, C. Li, and M. Guo, “Architectural implications of graph neural networks,” IEEE Comput. Archit. Lett., vol. 19, no. 1, pp. 59–62, 2020. [Online]. Available: https://doi.org/10.1109/LCA.2020.2988991
[43] X. Zhu, W. Han, and W. Chen, “GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning,” in 2015 USENIX Annual Technical Conference, USENIX ATC ’15, July 8-10, Santa Clara, CA, USA, S. Lu and E. Riedel, Eds. USENIX Association, 2015, pp. 375–386. [Online]. Available: https://www.usenix.org/conference/atc15/technical-session/presentation/zhu

X Biography Section