HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning

Hanning Chen¹, Yang Ni¹, Ali Zakeri¹, Zhuowen Zou¹, Sanggeon Yun¹, Fei Wen², Behnam Khaleghi³, Narayan Srinivasa⁴, Hugo Latapie⁵, and Mohsen Imani¹ ¹Department of Computer Science, University of California, Irvine, CA, USA
²Department of Electrical and Computer Engineering, Texas A&M University, TX, USA
³Qualcomm, CA, USA
⁴Intel Labs, CA, USA
⁵Cisco Systems, CA, USA
^∗ Corresponding author: [email protected]

Abstract.

In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network (GCN) involve extensive vertex/relation embedding updates and complicated score functions, which are inherently cumbersome for acceleration. As a result, existing accelerator designs are no longer optimal, and a novel algorithm-hardware co-design for KG reasoning is needed.

Recently, brain-inspired HyperDimensional Computing (HDC) has been introduced as a promising solution for lightweight machine learning, particularly for graph learning applications. In this paper, we leverage HDC for an intrinsically more efficient and acceleration-friendly KGC algorithm. We also co-design an acceleration framework named $\mathsf{HDReason}$ targeting FPGA platforms. On the algorithm level, $\mathsf{HDReason}$ achieves a balance between high reasoning accuracy, strong model interpretability, and less computation complexity. In terms of architecture, $\mathsf{HDReason}$ offers reconfigurability, high training throughput, and low energy consumption. When compared with NVIDIA RTX 4090 GPU, the proposed accelerator achieves an average 10.6 $\times$ speedup and 65 $\times$ energy efficiency improvement. When conducting cross-models and cross-platforms comparison, $\mathsf{HDReason}$ yields an average 4.2 $\times$ higher performance and 3.4 $\times$ better energy efficiency with similar accuracy versus the state-of-the-art FPGA-based GCN training platform.

1. Introduction

Knowledge Graph (KG), usually on massive scales (toutanova2015observed, ; dettmers2018convolutional, ; miller1995wordnet, ; mahdisoltani2014yago3, ), helping computers understand the relationship between each item of data. Each pair of connected vertices and the edge in between can be organized into a single triple that indicates a certain relation between two entities (also known as a fact). Knowledge Graph Completion (KGC), as a fundamental KG reasoning technique, infers new knowledge from incomplete facts by leveraging the semantics embedded in the graph structure. Therefore, it reasons upon vast arrays of information in a structured and meaningful way, crucial to various domains including question-answering, recommendation systems, and drug discovery (huang2019knowledge, ; zhang2018variational, ; wang2019knowledge, ; guo2020survey, ; zeng2022toward, ). Despite being widely applied for graph reasoning, KGC poses significant challenges, both algorithmically and in terms of hardware acceleration.

From an algorithmic perspective, existing algorithms, nam-ely the embedding, Graph Neural Network (GNN), and Reinforcement Learning (RL) based methods, all have their own drawbacks. Embedding-based techniques often yield low-quality results due to the limited representational capacity of their model (bordes2013translating, ; yang2015embedding, ). GNN-based methods are specially designed to extract features from graph structure (vashishth2019composition, ; schlichtkrull2018modeling, ; shang2019end, ), powerful in discovering more underlying semantics for higher quality reasoning. However, due to the extensive use of multi-hop and nonlinear vertex aggregation, they are computationally heavy, less transparent, and incur costly inference and training overheads when deployed on conventional hardware. RL-based approaches (xiong2017deeppath, ; stoica2020contextual, ; hildebrandt2020reasoning, ; wu2021findings, ; wang2020adrl, ) suffer from long latency and stability issues due to the exploration-exploitation dilemma of RL, making them less desirable for real-world applications (hildebrandt2020reasoning, ). They are also limited to single-direction reasoning only (as defined in Section 2.2) (wu2021findings, ; wang2020adrl, ).

The hardware accelerator challenges are equally daunting. As the amount of data grows, the size of KGs grows substantially and thereby necessitates hardware acceleration to make any existing KGC algorithm feasible. GPU platforms (wang2021gnnadvisor, ; yang2022gnnlab, ; yan2020characterizing, ) have always been a go-to choice for implementing and verifying new algorithms. However, deploying GNN-based approaches on GPUs (or TPUs) introduces a significant memory bottleneck. Challenges in GPU acceleration also arise with RL-based methods due to the training and inference being highly intertwined, which leads to large off-chip data traffic and limited parallelism at the algorithmic level (cho2019fa3c, ). On the other hand, FPGAs can take advantage of customization when co-designing the algorithm and hardware, thereby exploiting more parallelism in the algorithm, improving the overall energy efficiency, and reducing the substantial memory traffic (geng2020awb, ; sarkar2023flowgnn, ).

Unfortunately, current FPGA-based accelerators face multiple challenges when targeting KGC applications (sarkar2023flowgnn, ; geng2020awb, ; zeng2020graphact, ; lin2022hp, ). First, the vertex and edge embedding training play a pivotal role during the KGC process due to natural semantic attributes in KG. Existing FPGA accelerators for GNN-based methods, however, only focus on GNN propagation weight training. Second, most existing designs, including both FPGA and non-FPGA based (kiningham2022grip, ; geng2020awb, ; yan2020hygcn, ), rely on matrix-matrix multiplication accelerations which make it hard to support models that involve edge embedding. For example, the State-of-the-art design that supports edge-embedding (i.e., FlowGNN (sarkar2023flowgnn, )) only supports inference instead of training. Third, GNN and RL-based algorithms are not robust to quantization where model quality significantly deteriorates, limiting their usability in resource-constrained platforms like FPGA. To build successful KGC acceleration platforms based on FPGA, all three challenges need to be resolved. Therefore we need to redesign both algorithms and hardware acceleration to build an efficient and effective KGC framework.

In light of these challenges, Hyperdimensional Computing (HDC) emerges as a promising solution. Motivated by human brains that process, represent, and memorize information by high-dimensional neural signals, HDC is centered around the distributed high-dimensional vector representation, i.e., hypervectors (ge2020classification, ; karunaratne2020memory, ). These hypervectors are known to be holographic, meaning that information is evenly encoded in all vector elements, providing robustness to model quantization and representation redundancy. This holographicness also allows HDC to represent and manipulate both symbolic and numerical data in a transparent and interpretable way, with efficient and natural memorization and reasoning in its model (kleyko2023survey, ; zou2022biohd, ; imani2017voicehd, ; ni2022neurally, ). Unlike traditional Deep Neural Network (DNN) approaches, HDC is inherently hardware-friendly, especially for customizable hardware platforms like FPGAs (Hanning_ICCAD2022, ; F5-HD, ; imani2021revisiting, ; zou2021scalable, ; imani2019sparsehd, ; ni2023brain, ). The HDC model easily attains computation parallelism along hypervector directions due to its holographic nature. Moreover, the HDC model can achieve high learning accuracy at low bit precision, facilitating its mapping into FPGA computing logic.

These advantages help recent works in HDC (poduval2022graphd, ; nunes2022graphhd, ; kang2022relhd, ) to achieve transparent and efficient graph reconstruction, node classification, and graph matching; manifesting its potential for other learning and reasoning tasks over graphs. While HDC has been widely applied, it is not without its limitations, especially when deployed on traditional GPU/CPU platforms. The limited computation parallelism of CPUs restricts the HDC model’s learning throughput. The GPU’s limited on-chip memory and fixed datapath increase hypervector storage costs, thereby reducing memory efficiency. Additionally, for large-scale graph reasoning tasks, the GPU’s power efficiency is low. These limitations typically involve issues of scalability and efficiency in handling large-scale graph reasoning tasks. Furthermore, previous attempts to accelerate HDC on FPGAs have not explicitly considered the stringent demands of large-scale graph reasoning on computing and memory resources, leading to suboptimal solutions (imani2021revisiting, ; salamat2020accelerating, ). Specifically, previous HDC FPGA accelerator did not consider processing large-scale graph dataset with high sparsity and computation imbalance. Considering the large gap between the size of knowledge graph dataset and FPGA on-chip storage, we need to design customized datapath and scheduling algorithm to achieve high reasoning throughput.

To address these gaps, our work presents a novel KGC algorithm based on HDC, leveraging its intriguing properties for efficient and effective reasoning. We explore an algorithm-hardware codesign approach, combining the strengths of FPGA and HDC, to create an energy and resource-efficient graph reasoning framework named $\mathsf{HDReason}$ , which fully supports and accelerates the end-to-end KGC. The main contributions in this paper are listed as follows:

•

To the best of our knowledge, $\mathsf{HDReason}$ is the first FPGA-based framework for KGC acceleration. $\mathsf{HDReason}$ is also the first neurosymbolic HDC model targeting KGC applications, by leveraging hypervector operations for vertex and relation embedding vector encoding and transparent vertex neighbor information memorization.
•

On the algorithm part, $\mathsf{HDReason}$ uses brain-inspired HDC to reduce computation complexity while maintaining high reasoning accuracy and strong model interpretability. Thanks to the HDC model’s transparency, $\mathsf{HDReason}$ model is capable of discovering underlying semantics without bulky computation and training. HDC model’s high parallelism and low computation complexity also lift previous limitations on accelerator architecture design.
•

On the hardware side, targeting $\mathsf{HDReason}$ model, we propose several hardware optimizations including reusing encoded hypervectors, balancing vertex-to-vertex computation, and computing training backward gradients in the forward path (i.e., forward/backward co-optimizations). The proposed tuning techniques help $\mathsf{HDReason}$ achieve significant performance improvement compared to existing hardware platforms.
•

Through hardware and software codesign, the HDReason model supports low-bit precision while maintaining high reasoning accuracy. Its robustness is evident in maintaining accuracy even with limited hypervector dimensions during the reasoning process.

We evaluate $\mathsf{HDReason}$ performance on several large-scale KG datasets. In terms of reasoning efficiency, $\mathsf{HDReason}$ provides on average 4.2 $\times$ (and 8.3 $\times$ ) speedup and 3.4 $\times$ (and 9.1 $\times$ ) energy efficiency improvement with similar accuracy as compared to the state-of-the-art GCN training platform HP-GNN (lin2022hp, ) (and GraphACT (zeng2020graphact, )). When compared with RTX 4090 GPU, our accelerator achieves on average 10.6 $\times$ speedup and 65 $\times$ energy efficiency improvement.

2. Background

2.1. HDC Basics

HDC Encoding As a foundation of HDC, it maps data from the original space into the high-dimensional space (i.e., hyperspace) (imani2020dual, ; montagna2018pulp, ). Among a variety of HDC encoding methods (ge2020classification, ), we choose the state-of-the-art kernel-based HDC encoding (imani2020dual, ). It is inspired by kernel tricks, where the dot products between encoded hypervectors approximate values of a certain kernel. Suppose the original dimension of the data is $d$ and the hyperspace dimension is $D$ , as is shown in Figure 1(a). The mathematical equation of the encoding process is: $\vec{H}=tanh(\vec{e}\text{ }\mathbf{H^{B}})$ , where $\vec{e}$ is the original data with dimension $1\times d$ and $\textbf{H}^{\textbf{B}}$ is a matrix of base hypervectors (Base HDV) with dimension $d\times D$ . Each element of the Base HDV is generated randomly using the standard Gaussian distribution $\mathcal{N}(0,1)$ and stays constant.

Refer to caption — Figure 1. (a) HDC encoding example. (b) HDC memorization in graph learning. (c) Vertex neighbor reconstruction. (d) Score function example, TransE.

Hypervector Operations HDC has a set of well-defined hypervector operations, with the most important two being: (1) Bundling (or element-wise addition ” $+$ ”), which represents and memorizes a set of hypervectors. (2) Binding (or element-wise multiplication ” $\circ$ ”), which associates hypervectors for different concepts or symbols.

Graph Structure Memorization Bundling and binding operations give HDC the ability to represent rich combinations of symbols with hypervectors of the same size as each constituent, and more importantly, to efficiently memorize and later retrieve those symbols. Specifically in graph-related tasks, prior works utilize HDC to memorize each vertex’s neighbor information and perform graph learning (poduval2022graphd, ; nunes2022graphhd, ; kang2022relhd, ). Compared to GCN, these solutions serve as efficient, robust, and scalable alternatives. Figure 1(b) illustrates an example of memorizing neighbor information into a single hypervector. Suppose we have vertex $i$ and its neighbor vertex is represented as $N(i)$ . Then we have the mathematical equation of vertex neighbor memorization as:

(1)

\vec{M^{v}_{i}}=\Sigma_{j\in N(i)}\vec{H_{j}^{v}}

Here $\vec{M_{i}}$ is the memory hypervector of vertex $i$ .

Vertex Neighbor Reconstruction After generating memory hypervectors for all vertices, we can perform several different learning tasks such as node classification and graph matching (nunes2022graphhd, ; poduval2022graphd, ), similar to DNN based model like GNN. Furthermore, a property unique to HDC is that the neighbor information of each vertex can be reconstructed (poduval2022graphd, ), specifically, we can determine whether two vertices are connected. As is shown in Figure 1 (c), to test whether two vertices are connected, we have:

(2)

R_{ij}=\delta(\vec{M^{v}_{i}},\vec{H^{v}_{j}})

Here $R_{ij}$ is the possibility that vertex $i$ and vertex $j$ are connected. $\delta$ is the vector distance measuring function such as Hamming distance function or cosine similarity function.

2.2. Knowledge Graph Link Prediction

Table 1. Features of HDReason in comparison with prior works.

Accelerator	GraphACT (zeng2020graphact, )	HP-GNN (lin2022hp, )	FlowGNN (10071015, )	PyG (vashishth2019composition, )*	HDReason
Platfrom	FPGA	FPGA	FPGA	GPU	FPGA
Model Type	GNN	GNN	GNN	GNN	HDC
Reasoning Support	No	No	No	Yes	Yes
Edge Embedding	No	No	Yes	Yes	Yes
Embedding Training	No	No	No	Yes	Yes
Interpretability	Low	Low	Low	Low	High
Energy Efficiency	High	High	High	Low	High
Computation Reuse	No	No	No	No	Yes

Figure 2(a) illustrates the basic concepts of KG and provides an example of the KGC task. A KG can be formally represented as a directed graph $G=\{(v,\,r,\,u)\>|\>u,v\in\xi,\>r\in R\}$ , where $\xi$ is the set of entities (vertices) and $R$ is the set of relations (vashishth2019composition, ). Each directed edge in a KG represents a factual statement and is defined as a fact triple $l=(v,\,r,\,u)$ . As shown in Figure 2, (Anne Hathaway, born_in, New York City) is a fact triple, which presents the statement Anne Hathaway was born in the New York City. Notice that relations are directed edges to distinguish the subject and object in factual statements.

In various KG-related reasoning applications, we focus on the KGC task due to its elevated computational demands and training challenges (chen2020review, ). The KGC task is also referred to as KG link prediction. As a low-level logic reasoning (gao2021quatde, ), KGC aims to infer nonexistent relations from available knowledge. An example task is to find a new triple like (Anne Hathaway, born_in, ?) in the context of Figure 2(a).

Prior works generally handle two types of KGC: single direction reasoning and double direction reasoning. The former is limited to finding a possible connection from the source vertex to the destination vertex. And the latter requires finding both positive connection (v,r,?) and negative connection (?,r,u).

2.3. Score Function

The score function measures the distance between two nodes given its relation type (zheng2020dgl, ) in a KG link prediction task. For a subject $\vec{e^{v}_{i}}$ and the relation $\vec{e^{r}_{k}}$ , if vertex $j$ is connected with vertex $i$ via relation $r_{k}$ , the mapping function f is defined as:

(3)

\vec{e^{v}_{j}}\approx f(\vec{e^{v}_{i}},\vec{e^{r}_{k}})

Figure 1(d) provides an example of TransE (bordes2013translating, ), a type of score function. TransE treats the relationship as a translation vector so that the embedded entities are connected by relation r with low error. The TransE score function is defined as:

(4)

score_{j}=N_{x}(\vec{e^{v}_{i}}+\vec{e^{r}_{k}}-\vec{e^{v}_{j}})

Here $N_{x}$ stands for the norm function. $\vec{e^{v}_{i}}$ , $\vec{e^{r}_{k}}$ , and $\vec{e^{v}_{j}}$ can be also treated as subject vector, relation vector, and object vector. From a semantic point of view, equations 3 and 4 stand for the subject being connected with the object via a specific relation. Given the query subject and relation, we use the TransE score function to calculate the score of every vertex in the graph. A larger score means that this vertex has a higher chance of being connected with the query subject vertex via the query relation.

2.4. KGC Challenge and Motivation

In table 1, we summarize prior efforts that try to speed up GNN models. Historically, most GNN-based model accelerations leveraged GPUs built on the PyG framework (vashishth2019composition, ). While GPUs are versatile computing platforms, they suffer from static data paths and reduced energy efficiency (lacey2016deep, ). In contrast, FPGAs offer a balance between computational parallelism and energy consumption, making them suitable for enhancing GNN-based models. Nevertheless, it is challenging to apply existing FPGA-based GCN training accelerators directly to solve KGC tasks (zeng2020graphact, ; lin2022hp, ). One major challenge is that KG datasets incorporate semantic data for both vertices and edges. The cutting-edge FPGA-based accelerator, FlowGNN (sarkar2023flowgnn, ), does support these embeddings, but solely for GNN model inference. Furthermore, current FPGA accelerators typically focus on GNN propagation weight training, without supporting the crucial vertex and relation embedding training. Given the semantic richness of KG datasets, this training is central to the entire reasoning process. KGC training on GPU suffers from large-scale GPU memory usage for updating large-scale embeddings, extensive epochs for learning semantic information, and complexities in integrating accelerated score functions and graph models on FPGA platforms. A hardware-software approach is vital for an optimal FPGA framework tailored for KGC training.

3. HDReason Model

Table 2. HDReason Notation

Notation	Definition	Dimension
$\|V\|$	KG Vertex Size	$\sim$
$\|R\|$	KG Relation Size	$\sim$
d	The original embedding dimension	$\sim$
D	The hyperspace dimension	$\sim$
$\textbf{e}^{\textbf{v}}$	The original space vertex embedding	$\|V\|\times d$
$\textbf{e}^{\textbf{r}}$	The original space relation embedding	$\|R\|\times d$
$\textbf{H}^{\textbf{B}}$	The encoding base hypervector matrix	$d\times D$
$\textbf{H}^{\textbf{v}}$	The vertex hypervector matrix	$\|V\|\times D$
$\textbf{H}^{\textbf{r}}$	The relation hypervector matrix	$\|R\|\times D$
$\textbf{M}^{\textbf{v}}$	The vertex memory hypervector matrix	$\|V\|\times D$
$\textbf{M}_{\textbf{q}}^{\textbf{v}}$	Query subject hypervector	$1\times D$
$\textbf{M}_{\textbf{q}}^{\textbf{r}}$	Query Relation hypervector	$1\times D$
$N(\cdot)$	The normalization function	$\sim$
P	The scores for each vertex	$1\times\|V\|$

In this section, we extend the original HDC graph learning model (nunes2022graphhd, ) to conduct KGC or link prediction. Figure 2(b) illustrates the reasoning process. Table 2 lists the notations used in the $\mathsf{HDReason}$ model. A KG denoted as G( $\xi$ , $R$ ) is consists of $|V|$ vertices and $|R|$ relations.

3.1. HDReason Link Prediction

The embedding vectors for all vertices comprise the embedding matrix $\textbf{e}^{\textbf{v}}$ , with a dimension of $|V|\times d$ . Similarly for the relations, we have an embedding matrix $\textbf{e}^{\textbf{r}}$ of size $|R|\times d$ . To map all vertices and relations from the original embedding space into hyperspace, we have the matrix multiplication equation:

(5)

\mathbf{H^{v}}=tanh(\mathbf{e^{v}}\mathbf{H^{B}})

(6)

\mathbf{H^{r}}=tanh(\mathbf{e^{r}}\mathbf{H^{B}})

$\textbf{H}^{\textbf{v}}$ and $\textbf{H}^{\textbf{r}}$ are the encoded vertex and relation hypervector matrix with dimension $|V|\times D$ and $|R|\times D$ respectively.

After generating the vertex and relation hypervector matrix, we use an aggregation operation over the vertex and its neighbors to generate the corresponding memory hypervector as stated in equation 1. Given that each edge now also has a hyperspace mapping, it’s essential to first execute binding operations between each vertex hypervector and its related relation hypervectors, ensuring the vertex data is correctly linked with its relevant relation information:

(7)

\vec{M^{v}_{i}}=\Sigma_{(j,r)\in N(i)}\vec{H^{v}_{j}}\circ\vec{H^{r}_{r}}

, where $\circ$ represents element-wise Hadamard product. Equation 7 can also be rewritten in a matrix-to-matrix multiplication (M2MM) format:

(8)

\mathbf{M^{v}}=\Sigma_{r\in R}((\mathbf{A^{r}}\mathbf{H^{v}})\circ\mathbf{E^{r}})

$\mathbf{A^{r}}$ is the adjacency matrix over relation r, with a dimension of $|V|\times|V|$ . If there exists an entity ( $v_{i}$ , $r$ , $v_{j}$ ) $\in$ $\xi$ , then $A^{r}_{ij}=1$ , otherwise $A^{r}_{ij}=0$ . Matrix $\mathbf{E^{r}}$ with dimension $|V|\times D$ is the concatenation of relation embedding hypervectors across all vertices:

(9)

\vec{E^{r}_{i}}=\vec{e_{r}}\quad i\in[0:|V|-1]

After obtaining vertex memory hypervectors, we use a score function such as the backend decoder to perform link prediction tasks. This encoder-decoder structure is widely adopted by previous GNN-based works (schlichtkrull2018modeling, ; shang2019end, ; vashishth2019composition, ). Here we choose to use TransE (bordes2013translating, ) as the score function. Assuming the input query is subject $\vec{e^{v}_{i}}$ and relation $\vec{e^{r}_{k}}$ . To find the object vertex that is connected to subject vertex $v^{i}$ via relation $r^{k}$ , we use the following vertex score function:

(10)

\vec{P}=Sigmoid(|M^{v}_{i}+H^{r}_{k}-\mathbf{M^{v}}|_{1}+bias)

The score vector P has a shape of $|V|\times 1$ , with its element representing the possibility of a vertex connecting with vertex $v_{i}$ through relation $r$ . In the equation above, the vectors $M^{v}_{i}$ and $H^{r}_{k}$ are first broadcasted to the same shape as the matrix $\mathbf{M^{v}}$ . The L1 normalization is carried out only on one axis of the matrix, giving a vector form.

3.2. HDReason Model Training

On top of previous HDC graph learning works (poduval2022graphd, ; nunes2022graphhd, ; kang2022relhd, ), we introduce the original space embedding to vertices and relations; we also add a score function to the HDC graph model. Now we can conduct backpropagation and train the original embedding model ( $\textbf{e}^{\textbf{v}}$ and $\textbf{e}^{\textbf{r}}$ ):

(11)

\nabla_{\mathbf{e^{v}}}\mathbf{L}=\frac{\partial\mathbf{L}}{\partial\mathbf{P}}\frac{\partial\mathbf{P}}{\partial\mathbf{M^{v}}}\frac{\partial\mathbf{M^{v}}}{\partial\mathbf{H^{v}}}\frac{\partial\mathbf{H^{v}}}{\partial\mathbf{e}^{v}}

(12)

\nabla_{\mathbf{e^{r}}}\mathbf{L}=\frac{\partial\mathbf{L}}{\partial\mathbf{P}}\frac{\partial\mathbf{P}}{\partial\mathbf{H^{r}}}\frac{\partial\mathbf{H^{r}}}{\partial\mathbf{e}^{r}}

Here L is the training loss and P includes link prediction scores for vertices. The dimension of P and L is $1\times|V|$ .

Although both $\mathsf{HDReason}$ and previous GNN model (schlichtkrull2018modeling, ; vashishth2019composition, ; shang2019end, ) use front-end encoder (graph model) and backend score function structure, the base hypervector matrix remains fixed in hyperdimensional computation. This leads to more efficient learning since HDC only updates the original embedding of vertex and relation.

3.3. The Interpretability of HDReason

Compared to conventional GNN models (schlichtkrull2018modeling, ; vashishth2019composition, ; shang2019end, ), HDC boasts enhanced memorization and information-retrieval proficiencies, allowing for the reconstruction of neighboring data for each vertex, as showcased in (poduval2022graphd, ). As a result, the memorization hypervector ( $\textbf{M}^{\textbf{v}}$ ) of each vertex symbolically represents its neighboring vertex’s details. In contrast to the standard GNN model, $\mathsf{HDReason}$ offers superior interpretability and is more aligned with the principles of artificial general intelligence (AGI) as indicated in (latapie2021metamodel, ).

4. FPGA Acceleration Design

4.1. Architecture Overview

This section introduces the architecture of $\mathsf{HDReason}$ , tailored for a CPU-FPGA heterogeneous computing platform, as illustrated in Figure 3. Our design is comprised of two parts: (1) the CPU component (host side) and (2) the FPGA component (kernel side), the latter one is where the original vertex and relation embeddings are managed and updated. The CPU-based scheduler is responsible for offloading vertex and relation data to the FPGA kernel. Within the FPGA kernel, the Encoder IP maps the vertex and relation embedding from the original space into the hyperspace and then performs the HDC memorization process. Every vertex’s resulting memorization hypervector is stored in the high bandwidth memory (HBM). The Score function IP then calculates and dispatches the reasoning outcome back to the host CPU. During the backpropagation phase, the bulk of gradient computations are offloaded to the Training IP.

4.2. Graph Memorization Acceleration

The KG memorization includes two parts. The first part is an out-of-order (OoO) scheduler running on CPU trying to balance different vertices’ computations. The second part is a hardware acceleration IP conducting hypervector encoding and vertex neighbor aggregation.

4.2.1. Density-aware Scheduler on CPU

A scheduler running on the host CPU is used to balance the computation between different vertices on the FPGA. Figure 4 presents a scheduling example. A KG (Figure 4(a)) generally could be represented in two different formats: the triple format used in Figure 4(b), or the adjacency matrix format in Figure 4(c). As mentioned in previous work (10071015, ), if we keep both vertex and relation embedding for KGC (Figure 4(d)), the adjacency matrix becomes a 3D matrix (Figure 4(c)). However, existing hardware, neither GPU nor FPGA can support 3D matrix operations. Therefore modern GCN acceleration frameworks use scatter and reduce operations instead of sparse matrix-multplication (SpMM) to implement GNN (sarkar2023flowgnn, ). However, directly applying scatter and gather operation kernel on the sparse graph will bring computation imbalance problem (song2022sextans, ). Our proposed scheduler can instantly balance the computation workloads of different vertices’ aggregation operations, and thereby help sustain a high throughput on the kernel FPGA.

The key idea of our scheduler is to use triples representation and adjacency representation together during the offloading process. Suppose that the maximum vertex parallelism on the FPGA kernel is $\textbf{N}_{\textbf{c}}$ . To balance the computation of each vertex, we store the index of vertices with the same neighbor size (Figure 4(e)) into a list and offload these vertices to the FPGA when the length of the list equals $N_{c}$ . In that way, each vertex’s aggregation has the same level of computation complexity, as their neighbor sizes are identical (Figure 4(f)).

Beyond ensuring balanced computations across vertices, the scheduler’s pivotal role is to minimize redundant encoding operations. Typically, HDC models begin by encoding all data into hypervectors before making any predictions. Yet, in the context of KGC, the sheer volume of triples often surpasses the count of vertices and relations. Thus, using the same encoding-prediction paradigm may incur redundant hypervector encoding operations. To counter this inefficiency, we opt to store already encoded vertex hypervectors directly in the FPGA’s memory, like HBM, while executing HDC encoding only for unencoded vertices. On the CPU side, we maintain a HashMap that associates a vertex index (as the key) with its respective hypervector address (as the value)in the FPGA memory. If the vertex is not encoded, the scheduler will add the vertex’s original embedding vector into the data buffer ( $\textbf{B}_{\textbf{d}}$ ), otherwise, the corresponding FPGA memory address is added ( $\textbf{f}_{\textbf{1}}$ ). Meanwhile, a control signal ( $\textbf{f}_{\textbf{2}}$ ) indicating whether this vertex has been encoded and which vertex it is connected to is added to the control signal buffer ( $\textbf{B}_{\textbf{c}}$ ). After filling the buffers, the CPU offloads the HDC computation into the FPGA kernel.

4.2.2. Encoder IP Architecture Design

Figure 5 (a) shows the encoder IP architecture design. In every epoch, the host CPU offloads vertex embedding into the kernel FPGA via direct memory access (DMA) IP. The hyperdimensional encoding includes two steps. In the first step, each vertex embedding ( $1\times d$ ) is multiplied with the base hypervector matrix via a systolic array IP ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$1$}}\cr}}}$ ). Figure 5 presents the microarchitecture of each PE IP. Here we also pipeline the vertex-to-vertex computation into $|L|$ stages, where $|L|$ is the number of vertices that have not been encoded. The value of $|L|$ varies across different FPGA kernel calls. Subsequently, each encoded hypervector is passed through kernel function (here we choose tanh) ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$2$}}\cr}}}$ ). The encoded vertex hypervectors ( $\textbf{H}_{\textbf{v}}$ ) are then pushed into a FIFO awaiting the Dispatcher IP’s processing ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$3$}}\cr}}}$ ).

To improve the HDC model’s training throughput, it is important to reuse encoded hypervectors. In our design, the Dispatcher IP plays a pivotal role in achieving this goal. The Dispatcher IP in Figure 5(a) has three functionalities. The first is writing the newly encoded vertex hypervectors into the FPGA memory (e.g. HBM in Figure 5(a)) and returning the assigned address to the host CPU ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$8$}}\cr}}}$ ). The second is dispatching each vertex and its corresponding neighbor hypervectors and relation hypervectors into the Memorization Computing IP ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$6$}}\cr}}}$ ). The third is managing the on-chip memory storage (such as UltraRAM) and performing replacement policy when necessary ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$4$}}\cr}}}$ and $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$5$}}\cr}}}$ ). Ideally we prefer to store all encoded hypervector on-chip to avoid redundant computation. However, the vertex size of the graph dataset is usually too large to store all vertex hypervectors on-chip. Therefore, we need a mechanism to efficient manage on-chip hypervectors like cache replacement mechanism (hmmu, ; openmem, ; softhint, ). The strategy that we choose in Figure 5 is to store all the relation embedding hypervectors ( $\textbf{H}_{\textbf{r}}$ ) and only part of the vertex hypervectors. A HashTable is implemented with content-addressable memory (CAM) inside the Dispatcher IP. The key of the HashTable is the vertex id and the value is its corresponding hypervector address on Ultarscale RAM. If the encoded vertex id already resides in CAM, the hypervector will be loaded into the $H^{v}$ FIFO, while the corresponding relation hypervectors will also be loaded into the $H^{r}$ FIFO. If there is no room for a vertex hypervector in the on-chip UltraRAM, the Dispatcher IP will choose a victim hypervectors inside UltraRAM and replace it with the target vertex hypervectors from HBM. Here we choose the classic replacement algorithm such as last recent use (LRU), last frequently use (LFU), and random replacement policy (lee1999existence, ).

There are $N_{c}$ Memorization Computing IP on-chip that can perform neighbor aggregation operation for $N_{c}$ vertices concurrently. We pipeline the aggregation operation of each vertex’s neighbors. The binding operation between the vertex hypervector and relation hypervector is parallelized among computing units (CU), whose microarchitecture is shown in Figure 5(c). Please note that we also need to compute the gradient of memory hypervector $\frac{\partial M^{v}}{\partial H^{v}}$ :

(13)

\frac{\partial\mathbf{M^{v}}}{\partial\mathbf{H^{v}}}=\Sigma_{r\in R}\mathbf{A_{r}}\mathbf{E^{r}}

Here matrix $A_{r}$ is a $|V|\times|V|$ matrix and is the same in equation 8. The gradient of each vertex’s memory hypervector ( $M^{v}$ ) is the sum of its connected edges’ relation hypervectors. Therefore, the CUs can execute the forward memorization and backward gradient computation together.

4.3. Score Function Acceleration

After generating memory hypervectors ( $\textbf{M}^{\textbf{v}}$ ), the next step is to perform reasoning tasks over specific triples. For each inference or training epoch, suppose the batch size is $|B|$ . As is shown in Figure 6.(a), at the initial stages ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$a$}}\cr}}}$ and $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$b$}}\cr}}}$ ), we load query vertex hypervector matrix ( $\textbf{M}_{\textbf{q}}^{\textbf{v}}$ ) and query relation hypervector matrix ( $\textbf{M}_{\textbf{q}}^{\textbf{r}}$ ) onto the on-chip buffer from FPGA storage (HBM). As our goal is to predict whether each triple exists or not, we pipeline the score function calculation process and only load one memory hypervector at a time ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$c$}}\cr}}}$ ), instead of loading the entire memory hypervector ( $M^{v}$ ). In Figure 6, we assume that the Score function IP is evaluating vertex i over the query batch. Since the query batch contains $|B|$ subject and relation combinations, we replicate the loaded memory hypervector into $|B|$ on-chip buffers ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$d$}}\cr}}}$ ). Next, the query hypervectors ( $M_{q}^{v}$ and $M_{q}^{r}$ ) and the memory hypervectors ( $M^{v}_{i}$ ) are loaded into $|B|$ Score Engine units, with each unit calculating the corresponding prediction score for each query.

Figure 6(b) presents a single Score Engine IP’s microarchitecture design. Suppose the illustrated Score Engine unit in Figure 6(b) has an index of $j$ , then it calculates $j^{th}$ vertex of the batch’s score over vertex $i$ . The first step is to perform hypervector addition over subject HDV (query vertex HDV $M^{v}_{j}$ ) and relation HDV ( $M^{r}_{j}$ ) to obtain the object HDV ( $M^{o}_{j}$ ). The second step is to calculate the hypervector distance between $M^{o}_{j}$ and $M^{v}_{i}$ . The intermediate prediction result( $P^{I}_{j}$ ) is loaded into the L1 Norm IP for L1 normalization over the hyperspace, resulting in the normalized prediction ( $N^{p}_{j}$ ) and its gradient with respect to the queried vertex and relation hypervector. Notice that these two gradients are actually equal in value since they are added together in the score function: $\frac{\partial N^{p}_{j}}{\partial M^{v}_{j}}=\frac{\partial N^{p}_{j}}{\partial M^{r}_{j}}$ . As mentioned in section 4.2, to reduce the computation cost of back-propagation and avoid unnecessary data movement, inside the L1 Norm IP, we perform gradient computation of $\frac{\partial N^{p}_{j}}{\partial M^{v}_{j}}$ during the forward propagation, as shown in Figure 6(c) and Figure 6(d). Inside L1 Norm IP, there are $D$ Norm Units in total. Each of them extracts the absolute values and signs of elements of the input hypervector. The results are then fed into Tree Adder IP and a concatenate IP (Concat IP) to generate norm prediction $N^{p}_{j}$ and $\frac{\partial N^{p}_{j}}{\partial M^{v}_{j}}$ respectively.

Until now, after the execution of Score Engine IP, we have each batch member’s both forward path and backward path results. Next we pass the forward path result ( $N^{p}_{j}$ ) and backward gradient result ( $\frac{\partial N^{p}_{j}}{\partial M^{v}_{j}}$ ) to a Concat IP ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$e$}}\cr}}}$ ) and Tree Adder IP ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$f$}}\cr}}}$ ) respectively. To generate the final result, we need to concatenate each batch member’s L1 norm result ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$g$}}\cr}}}$ ) and accumulate all batch member’s gradient hypervectors ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$h$}}\cr}}}$ ). In the end, we will pass the forward path result ( $N^{p}$ ) back to the CPU for further processing, such as the Sigmoid function ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$i$}}\cr}}}$ ). The gradient result ( $\frac{\partial N^{p}_{j}}{\partial M^{v}_{j}}$ ) instead, will be stored inside the FPGA storage (such as HBM) for later training back-propagation usage ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$j$}}\cr}}}$ ).

4.4. Symbolic Training Acceleration

Figure 7 presents the vertex embedding training computing flow. For the readability of the figure, we omit the detail of relation embedding training as it is similar to Figure 7. For illustration convenience, we rewrite the backpropagation equation:

(14)

\nabla_{\mathbf{e^{v}}}\mathbf{L}=\frac{\partial\mathbf{L}}{\partial\mathbf{N}}\frac{\partial\mathbf{N}}{\partial\mathbf{N^{p}}}\frac{\partial\mathbf{N^{p}}}{\partial\mathbf{M^{v}}}\frac{\partial\mathbf{M^{v}}}{\partial\mathbf{H^{v}}}\frac{\partial\mathbf{H^{v}}}{\partial\mathbf{e}^{v}}

We first rewrite the first two terms as:

(15)

\mathbf{\delta^{v}}=\frac{\partial\mathbf{L}}{\partial\mathbf{N}}\frac{\partial\mathbf{N}}{\partial\mathbf{N^{p}}}

We will calculate each batch member’s $\delta^{v}$ on CPU and concatenate them together. The generated new matrix is named $\mathbf{\delta}$ whose dimension is $|B|\times|V|$ where $|B|$ is the batch size and $|V|$ is the vertex size of the graph. Since graph size $|V|$ could be very large (over 10K), we cut $\delta$ into several chunks ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$A$}}\cr}}}$ ). The size of each chunk is $|B|\times T$ . In Figure 7, we suppose the chunk index is i. So this chunk could be represented as $\delta[i*T:(i+1)*T-1]$ . We also pipeline the computation between different chunks which means we will update T vertex embedding at each period.

Next we pass the $\delta[i*T:(i+1)*T-1]$ to the kernel FPGA ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$B$}}\cr}}}$ ). In section 4.2 and 4.3, we already compute the gradient result of $\frac{\partial N^{p}}{\partial M^{v}}$ and $\frac{\partial M^{v}}{\partial H^{v}}$ during the forward propagation and stored them inside the FPGA storage (such as HBM). Also the gradient of $\frac{\partial\mathbf{H^{v}}}{\partial\mathbf{e}^{v}}$ is equal to the transpose of base HDV ( $\textbf{H}^{\textbf{B}}$ ). We only need to load precomputed matrices into the on-chip buffer ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$C$}}\cr}}}$ , $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$E$}}\cr}}}$ , and $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$G$}}\cr}}}$ ) and use two systolic arrays (SA) IP and one element-wise multiplication (MUL) IP to perform the actual computation ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$B$}}\cr}}}$ , $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$D$}}\cr}}}$ , and $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$F$}}\cr}}}$ ). After the computation of systolic array IP 2 (SA2), we will get T vertex’s gradient result $\nabla$ L which will be passed back to the host CPU ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$H$}}\cr}}}$ ). The last step is updating the T vertex embedding model ( $\mathbin{\vphantom{\circledast}\text{\ooalign{\smash{\raisebox{-2.58334pt}{\scalebox{2.3}{$\bullet$}}}\cr\smash{{\color[rgb]{1,1,1}\bf\footnotesize$I$}}\cr}}}$ ) which is also the last stage of the whole pipeline computation.

5. Experiments

5.1. Experimental Setup

Table 3. KGC datasets statistics.

Dataset	Entities	Relations	Train	Valid	Test	Avg. degree
FB15K-237	14541	237	272115	17535	20466	18.71
WN18RR	40943	11	86835	3034	3134	2.12
WN18	40943	18	141442	5000	5000	3.45
YAGO3-10	123182	37	1079040	5000	5000	8.76

We benchmark $\mathsf{HDReason}$ ’s KGC accuracy over two standard datasets: FB15K-237 (toutanova2015observed, ) and WN18RR (dettmers2018convolutional, ) which are widely used for graph-related tasks. For $\mathsf{HDReason}$ ’s hardware acceleration performance, we also added two large-scale datasets: WN18 (miller1995wordnet, ) and YAGO3-10 (mahdisoltani2014yago3, ). Table 3 lists the properties of these four KG datasets. Our platform comprises an Intel i9-12900KF CPU as the host, along with a Xilinx Alveo U280 (or U50 depending on the configuration) card to implement the FPGA kernels. We implement the kernel accelerator using SystemVerilog and synthesize it using Xilinx Vivado. The communication between the host CPU and kernel FPGA is supported by the Xilinx Vitis platform via PCIe direct memory access (PCIe DMA) (kathail2020xilinx, ).

5.2. HDReason Reasoning Results

Table 4. Knowledge Graph Model Comparison.

Model	CompGCN (vashishth2019composition, )	SACN (shang2019end, )	R-GCN (schlichtkrull2018modeling, )	TransE (bordes2013translating, )	HDR
d	100	100	100	150	128
D	150	100	100	$\sim$	256
layer	2	1	2	$\sim$	$\sim$
$f_{score}$	TransE	Conv-TranE	DistMult (yang2015embedding, )	$\sim$	TransE
Training Part	embedding and weight			only embedding

Figure 8.(a) and Figure 8.(b) present the $\mathsf{HDReason}$ ’s reasoning accuracy over FB15K-237 and WN18RR. For double direction reasoning, we compare our HDC-based model with other state-of-the-art graph models including TransE (bordes2013translating, ), R-GCN (schlichtkrull2018modeling, ), SACN (shang2019end, ), and Comp-GCN (vashishth2019composition, ). For single direction reasoning, we compare our HDC-based model with other state-of-the-art reinforcement learning (RL) models including MINERVA (das2017go, ), C-MINERVA (stoica2020contextual, ), R2D2 (hildebrandt2020reasoning, ), RARL (wu2021findings, ), and ADRL (wang2020adrl, ). Table 4 summarizes critical model parameters including original vertex embedding dimension d and final embedding dimension D (after encoding and aggregation). Figure 8.(a) shows that $\mathsf{HDReason}$ achieves relatively high reasoning accuracy compared to state-of-the-art GCNs such as CompGCN and SACN. Besides, $\mathsf{HDReason}$ only needs to train vertex and relation embedding, which simplifies the model training computation complexity.

Figure 9.(a) illustrates the HDR model’s high robustness during KGC tasks. After encoding and memorization, we attempt to drop some hypervector dimensions before entering the score function. HDC model achieves significant robustness when dropping dimensions with low entropy. This attribute is also reported by previous HDC work (zou2021scalable, ). This characteristic allows the HDC model to be deployed on resource-constrained platforms by maintaining only a portion of the model’s hypervector weight. On the other hand, Figure 9.(b) demonstrates that, when compared to a traditional GNN-based model (specifically SACN), HDR supports aggressive low-bit quantization. Here we try to quantize HDR and SACN into different fixed-point numbers using QPyTorch (zhang2019qpytorch, ). When quantizing the HDR model from floating-point numbers to 4-bit fixed-point numbers, it experiences only a 5% drop in accuracy, whereas the GNN-based model experiences a 45% accuracy drop. This indicates that HDR is well-suited for low-bit precision computation platforms, such as FPGA.

5.3. FPGA Resource Utilization

Table 5. Resource usage on Xilinx Alveo U50 FPGA.

	LUT	FF	BRAM	UltraRAM	DSP
Available	872K	1743K	1344	640	5952
Encoder IP	281.6K	152K	184	135	1024
Score Function IP	238.9K	417.1K	0	0	0
Training IP	7.6K	8.7K	0	0	1536
HBM	544	437	2	0	0
Others	91.2K	88.9K	124	0	0
Total	620K	667.2K	310	135	2560
Percentage	71.1%	38.2%	23.1%	21%	43%
Freqeuency	200 MHz
Power	36.1 W

Table 6. HDReason’s single batch training absolute results of latency and energy consumption on FPGA and GPU.

	Xilinx Alveo U50					NVIDIA RTX 3090
	Tech: UltraScale+ 14 nm FinFET; Freq: 200MHz					Tech: Ampere TSMC 8nm; Freq: 1.66GHz
	Memory Type: HBM2; Memory Bandwidth: 460GB/s					Memory Type: GDDR6 Memory Bandwidth: 936.2 GB/s
	FB15K-237	WN18RR	WN18	YAGO3-10		FB15K-237	WN18RR	WN18	YAGO3-10*
Latency (ms)	6.21	9.01	10.03	30.31	Latency (ms)	60.01	91.01	93.62	219.6
Energy (J)	0.21	0.29	0.31	0.93	Energy (J)	20.88	30.48	30.89	65.31
Memory (MB)	33	84	86	245	Memory (MB)	9608	23360	18690	22498

Table 5 presents the FPGA resource utilization of $\mathsf{HDReason}$ . The vertex initial embedding dimension (d) is 96 and the encoded hypervector dimension (D) is 256. The training batch size is 128 and pipeline chunk size T is 32. We use 8 HBM pseudo channels (PCs) and AXI with 256-bit data width: 4 PCs are used to store vertex ( $\textbf{H}^{\textbf{v}}$ ) and memorization hypervectors ( $\textbf{M}^{\textbf{v}}$ ); while another 4 PCs hold gradients including $\frac{\partial N^{p}}{\partial M^{v}}$ and $\frac{\partial M^{v}}{\partial H^{v}}$ . We save relation hypervectors ( $\textbf{H}^{\textbf{r}}$ ) in the on-chip UltraRAM. The ”Other” section in Table 5 includes all the Xilinx IP resource usage such as AXI Interconnect IP and PCIe DMA (feist2012vivado, ).

5.4. HDreason Acceleration on FPGA

Table 6 tabulates the execution time, energy, and memory usage of $\mathsf{HDReason}$ on FPGA and GPU. Our GPU implementation uses PyTorch (paszke2019pytorch, ) and PyG (fey2019fast, ). For a fair comparison, we specify both FPGA and GPU’s training batch size as 128 except YAGO3-10 on GPU. We collect the FPGA power using Xilinx Power Estimator (XPE) and GPU power using NVIDIA Management Library (NVML). When testing $\mathsf{HDReason}$ on RTX 3090 using the YAGO3-10 dataset, due to limited GPU memory (24 GB) and large-scale graph vertex size (over 120K vertices), we decrease the training batch size from 128 to 32. As for speedup, our accelerator shows, on average, over 9 $\times$ improvement compared to GPU even Alveo U50 adopts less advanced fabrication technology compared to RTX 3090 (14 nm FinFET on FPGA vs 8 nm TSMC on GPU). Our FPGA-based accelerator shows notable advantages over GPU in terms of speed, energy, and memory efficiency. This large advantage comes from our hardware optimization shown in Figure 8.(c), including reusing encoded hypervectors, a density-aware scheduler to balance the computation, and computing backward gradients in the forward path. For energy efficiency, our design shows over 95 $\times$ improvement compared to RTX 3090. For memory efficiency, as we cut the large vertex embedding gradient computation into small chunks and pipeline computation between different chunks, we avoid using unnecessary intermediate memory on the FPGA. Therefore, our FPGA accelerator supports a high batch size on large-scale graph datasets (like YAGO3-10).

Figure 8.(d) shows the execution time breakdown during each single batch training. Here, CPU time includes data communication time between host CPU and kernel FPGA, CPU gradient computation (equation 15), and vertex and relation embedding update. Since we compute partial backward gradient in the forward path, in Figure 8.(d), actual training computation (Training) only takes a very small part of the overall training process. The memorization (Mem) execution time instead takes over 50% total execution time. In section 5.5, we will discuss how to decrease the memorization execution time by tuning the accelerator’s parameters.

5.5. HDReason Memorization Improvement

In this section, we discuss the effects of vertex replacement policy and on-chip storage usage over memorization speed. Compared to previous GCN acceleration works, our accelerator reuses encoded hypervectors such that the main computation overhead switches from matrix multiplication to the data transfer between FPGA and HBM. In Figure 10, we show the changes in memorization speed and FPGA HBM data communication with different policies and memory usage. In Figure 10, those UltraRAMs are only used to store vertex hypervectors ( $\textbf{H}^{\textbf{v}}$ ). Due to the limited on-chip storage, the FPGA will need to replace hypervectors currently stored on-chip with what the incoming ones needed for performing aggregation operations. This process is like a cache miss in the CPU memory hierarchy. Those access misses will bring unnecessary data communication between FPGA and HBM. As is shown in Figure 10, when on-chip UltraRAM usage is small (such as 64 UltraRAM), then data transfer between FPGA and HBM is large. As expected, with the increased usage of on-chip UltraRAM, both the memorization time and FPGA-HBM communication time decrease. This trend is especially noticeable for large-scale datasets like YAGO3-10 (Figure 10.(d)).

In Figure 10, we present the influence of different replacement policies. We tried three different replacement policies including random replacement (Random), last frequently use (LFU), and last recent use (LRU). On average, LFU achieves the best performance and 8% better than Random. To further extend our findings, future work could involve designing specific replacement policies targeting KG datasets.

5.6. Cross Models and Platforms Comparison

Figure 11 presents the speedup and energy efficiency comparison between different models on different hardware platforms. Specifically, we show 4 different graph models on 8 different hardware platforms. For 8 hardware platforms, we try 2 different CPUs (Intel i9-12900 KF and AMD Ryzen 5955 WX), 3 different GPUs (NVIDIA RTX 3090, RTX 4090, and A100), and 5 different FPGAs (Xilinx Alveo U50, U200, U250, U280, and Kintex7 KC705). On the GPU platforms, we choose to use PyG to optimize models’ execution speed and GPU memory usage.

For FPGA platforms, the configuration of our design on Alveo U50 is the same as Table 5. For Alveo U280, with more resources, we increase the HBM channels from 8 PCs to 16 PCs, set the AXI data width as 512, increase the number of Memorization Computing IP ( $\textbf{N}_{\textbf{c}}$ ) from 16 to 32, training chunk size (T) from 32 to 64, and use 256 UltraRAMs to store vertex hypervectors on-chip. We approximate the HDR’s performance on existing state-of-the-art HDC FPGA accelerator (LookHD) (imani2021revisiting, ). We also approximate the performance of 4 different models on Alveo U200 and U250 based on state-of-the-art GNN training FPGA acceleration works (GraphACT (zeng2020graphact, ) and HP-GNN (lin2022hp, )).

To ensure a fair comparison, we compare the small FPGA design (Alveo U50) with GraphACT which is based on Alveo U200, and compare the large FPGA design (Alveo U280) with HP-GNN which is based on Alveo U250. Compared to GraphACT (Alveo U200), $\mathsf{HDReason}$ on Alveo U50 shows on average 9 $\times$ speedup and 10 $\times$ energy efficiency improvement. When comparing with HP-GNN (Alveo U250), $\mathsf{HDReason}$ on Alveo U280 shows on average 3.5 $\times$ speedup and 4.6 $\times$ energy efficiency improvement. When comparing with RTX 4090, our Alveo U280 accelerator achieves on average 10.6 $\times$ speedup and 65 $\times$ more energy efficiency. When conducting cross models and platforms comparison, $\mathsf{HDReason}$ on Xilinx Alveo U280 (and Xilinx Alveo U50) provides on average 4.2 $\times$ (and 8.3 $\times$ ) speedup and 3.4 $\times$ (and 9.1 $\times$ ) energy efficiency improvement with similar accuracy as compared to the GCN training platform HP-GNN (and GraphACT).

6. Related Works

Knowledge Graph Completion (KGC) In contrast to conventional graph learning applications like vertex and graph classification, KGC allows for the derivation of new knowledge and insights from pre-existing data (chen2020review, ). Prior works focused on KG reasoning acceleration have predominantly employed graph embedding-based algorithms (bordes2013translating, ; sun2019rotate, ; yang2014embedding, ). These have been accelerated using both CPU (zhu2019graphvite, ; arm2020torchkge, ) and GPU (zheng2020dgl, ) platforms. Nevertheless, embedding-based KG reasoning algorithms tend to struggle with achieving high prediction accuracy for more complex KGs. To overcome this limitation and enhance reasoning accuracy, SOTA KG reasoning models attempt to combine embedding models with graph convolution neural networks (GCN) (vashishth2019composition, ; shang2019end, ; schlichtkrull2018modeling, ). However, the training overhead associated with GCN-based models is significant on existing hardware platforms.

GCN Accelerators In recent years, domain-specific accelerators (DSAs) targeting the graph convolution neural network (GCN) model have been highly successful at top system and architecture conferences. Prior works have primarily focused on accelerating GCN inference (yan2020hygcn, ; li2021gcnax, ; you2022gcod, ; huang2022accelerating, ; li2022hyperscale, ; geng2020awb, ; zhou2022model0, ; zhang2022decgnn, ; sarkar2023flowgnn, ), as well as graph neural network (GNN) training (zeng2020graphact, ; lin2022hp, ; chen2021rubik, ). However, these earlier efforts focus on traditional tasks such as vertex and graph classification, with little emphasis placed on KGC applications.

Hyperdimension Computing In recent years, hyperdimensional computing (HDC) has exhibited significant potential in learning applications surpassing conventional machine learning methods like deep neural networks (DNN) (imani2020dual, ; zou2022biohd, ; jiao2022brain, ; chang2023recent, ; ge2020classification, ; imani2021revisiting, ; HDPG_Yang, ; ni2022algorithm, ). In particular, HDC has demonstrated substantial promise in competing with graph convolution neural networks (GCN) in the graph learning domain (poduval2022graphd, ; nunes2022graphhd, ; kang2022relhd, ). Despite this progress, previous HDC-based works have not endeavored to undertake KGC tasks.

7. Conclusion

In this paper, we conduct algorithm and hardware co-design and propose the first HDC-based KGC acceleration platform named $\mathsf{HDReason}$ . Regarding algorithm, $\mathsf{HDReason}$ achieves relatively high accuracy when compared to state-of-the-art embedding and GCN models, as demonstrated through experiments using real-life KG datasets. Regarding hardware, the paper describes the design of an FPGA accelerator with multiple optimizations targeting $\mathsf{HDReason}$ model. $\mathsf{HDReason}$ on Xilinx Alveo U280 provides, on average, a 4.2 $\times$ speedup and 3.4 $\times$ energy efficiency improvement with similar accuracy as compared to the GCN training platform when conducting cross-model and platform comparisons.

Acknowledgements

This work was supported in part by the DARPA Young Faculty Award, the National Science Foundation (NSF) under Grants #2127780, #2319198, #2321840, #2312517, and #2235472, the Semiconductor Research Corporation (SRC), the Office of Naval Research through the Young Investigator Program Award, and Grants #N00014-21-1-2225 and N00014-22-1-2067. Additionally, support was provided by the Air Force Office of Scientific Research under Award #FA9550-22-1-0253, along with generous gifts from Xilinx and Cisco.

References

(1) A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” Advances in neural information processing systems, vol. 26, 2013.
(2) A. Boschin, “Torchkge: Knowledge graph embedding in python and pytorch,” in International Workshop on Knowledge Graph: Mining Knowledge Graph for Deep Insights, Aug 2020.
(3) C.-Y. Chang, Y.-C. Chuang, C.-T. Huang, and A.-Y. Wu, “Recent progress and development of hyperdimensional computing (hdc) for edge intelligence,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2023.
(4) H. Chen, M. Issa, Y. Ni, and M. Imani, “Darl: Distributed reconfigurable accelerator for hyperdimensional reinforcement learning,” in Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022, pp. 1–9.
(5) X. Chen, Y. Wang, X. Xie, X. Hu, A. Basak, L. Liang, M. Yan, L. Deng, Y. Ding, Z. Du, Y. Chen, and Y. Xie, “Rubik: A hierarchical architecture for efficient graph neural network training,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 4, pp. 936–949, 2021.
(6) X. Chen, S. Jia, and Y. Xiang, “A review: Knowledge reasoning over knowledge graph,” Expert Systems with Applications, vol. 141, p. 112948, 2020.
(7) H. Cho, P. Oh, J. Park, W. Jung, and J. Lee, “Fa3c: Fpga-accelerated deep reinforcement learning,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 499–513.
(8) R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum, “Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning,” arXiv preprint arXiv:1711.05851, 2017.
(9) T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional 2d knowledge graph embeddings,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
(10) T. Feist, “Vivado design suite,” White Paper, vol. 5, p. 30, 2012.
(11) M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
(12) H. Gao, K. Yang, Y. Yang, R. Y. Zakari, J. W. Owusu, and K. Qin, “Quatde: Dynamic quaternion embedding for knowledge graph completion,” arXiv preprint arXiv:2105.09002, 2021.
(13) L. Ge and K. K. Parhi, “Classification using hyperdimensional computing: A review,” IEEE Circuits and Systems Magazine, vol. 20, no. 2, pp. 30–47, 2020.
(14) T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y. Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardt, and M. C. Herbordt, “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 922–936.
(15) Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong, and Q. He, “A survey on knowledge graph-based recommender systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 8, pp. 3549–3568, 2020.
(16) M. Hildebrandt, J. A. Q. Serna, Y. Ma, M. Ringsquandl, M. Joblin, and V. Tresp, “Reasoning on knowledge graphs with debate dynamics,” arXiv preprint arXiv:2001.00461, 2020.
(17) X. Huang, J. Zhang, D. Li, and P. Li, “Knowledge graph embedding based question answering,” in Proceedings of the twelfth ACM international conference on web search and data mining, 2019, pp. 105–113.
(18) Y. Huang, L. Zheng, P. Yao, Q. Wang, X. Liao, H. Jin, and J. Xue, “Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1029–1042.
(19) M. Imani, D. Kong, A. Rahimi, and T. Rosing, “Voicehd: Hyperdimensional computing for efficient speech recognition,” in 2017 IEEE international conference on rebooting computing (ICRC). IEEE, 2017, pp. 1–8.
(20) M. Imani, S. Pampana, S. Gupta, M. Zhou, Y. Kim, and T. Rosing, “Dual: Acceleration of clustering algorithms using digital-based processing in-memory,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 356–371.
(21) M. Imani, S. Salamat, B. Khaleghi, M. Samragh, F. Koushanfar, and T. Rosing, “Sparsehd: Algorithm-hardware co-optimization for efficient high-dimensional computing,” in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 190–198.
(22) M. Imani, Z. Zou, S. Bosch, S. A. Rao, S. Salamat, V. Kumar, Y. Kim, and T. Rosing, “Revisiting hyperdimensional learning for fpga and low-power architectures,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 221–234.
(23) X. Jiao, A. Rahimi, C. Fermüller, and J. Y. Aloimonos, “Brain-inspired hyperdimensional computing: Algorithms, models, and architectures,” Frontiers in Neuroscience, vol. 16, 2022.
(24) J. Kang, M. Zhou, A. Bhansali, W. Xu, A. Thomas, and T. Rosing, “Relhd: A graph-based learning on fefet with hyperdimensional computing,” in 2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 2022, pp. 553–560.
(25) G. Karunaratne, M. Le Gallo, G. Cherubini, L. Benini, A. Rahimi, and A. Sebastian, “In-memory hyperdimensional computing,” Nature Electronics, vol. 3, no. 6, pp. 327–337, 2020.
(26) V. Kathail, “Xilinx vitis unified software platform,” in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 173–174.
(27) K. Kiningham, P. Levis, and C. Ré, “Grip: A graph neural network accelerator architecture,” IEEE Transactions on Computers, vol. 72, no. 4, pp. 914–925, 2022.
(28) D. Kleyko, D. Rachkovskij, E. Osipov, and A. Rahimi, “A survey on hyperdimensional computing aka vector symbolic architectures, part ii: Applications, cognitive models, and challenges,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–52, 2023.
(29) G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on fpgas: Past, present, and future,” arXiv preprint arXiv:1602.04283, 2016.
(30) H. Latapie, O. Kilic, G. Liu, R. Kompella, A. Lawrence, Y. Sun, J. Srinivasa, Y. Yan, P. Wang, and K. R. Thórisson, “A metamodel and framework for artificial general intelligence from theory to practice,” Journal of Artificial Intelligence and Consciousness, vol. 8, no. 02, pp. 205–227, 2021.
(31) D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, “On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies,” in Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 1999, pp. 134–143.
(32) J. Li, A. Louri, A. Karanth, and R. Bunescu, “Gcnax: A flexible and energy-efficient accelerator for graph convolutional neural networks,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 775–788.
(33) S. Li, D. Niu, Y. Wang, W. Han, Z. Zhang, T. Guan, Y. Guan, H. Liu, L. Huang, Z. Du, F. Xue, Y. Fang, H. Zheng, and X. Yuan, “Hyperscale fpga-as-a-service architecture for large-scale distributed graph neural network,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 946–961.
(34) Y.-C. Lin, B. Zhang, and V. Prasanna, “Hp-gnn: generating high throughput gnn training implementation on cpu-fpga heterogeneous platform,” in Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 123–133.
(35) F. Mahdisoltani, J. Biega, and F. Suchanek, “Yago3: A knowledge base from multilingual wikipedias,” in 7th biennial conference on innovative data systems research. CIDR Conference, 2014.
(36) G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
(37) F. Montagna, A. Rahimi, S. Benatti, D. Rossi, and L. Benini, “Pulp-hd: Accelerating brain-inspired high-dimensional computing on a parallel ultra-low power platform,” in Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6.
(38) Y. Ni, H. Chen, P. Poduval, Z. Zou, P. Mercati, and M. Imani, “Brain-inspired trustworthy hyperdimensional computing with efficient uncertainty quantification,” in 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 01–09.
(39) Y. Ni, M. Issa, D. Abraham, M. Imani, X. Yin, and M. Imani, “Hdpg: Hyperdimensional policy-based reinforcement learning for continuous control,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22, 2022, p. 1141–1146. [Online]. Available: https://doi.org/10.1145/3489517.3530668
(40) Y. Ni, Y. Kim, T. Rosing, and M. Imani, “Algorithm-hardware co-design for efficient brain-inspired hyperdimensional learning on edge,” in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022, pp. 292–297.
(41) Y. Ni, N. Lesica, F.-G. Zeng, and M. Imani, “Neurally-inspired hyperdimensional classification for efficient and robust biosignal processing,” in Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022, pp. 1–9.
(42) I. Nunes, M. Heddes, T. Givargis, A. Nicolau, and A. Veidenbaum, “Graphhd: Efficient graph classification using hyperdimensional computing,” in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022, pp. 1485–1490.
(43) A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
(44) P. Poduval, H. Alimohamadi, A. Zakeri, F. Imani, M. H. Najafi, T. Givargis, and M. Imani, “Graphd: Graph-based hyperdimensional memorization for brain-like cognitive learning,” Frontiers in Neuroscience, vol. 16, p. 5, 2022.
(45) S. Salamat, M. Imani, B. Khaleghi, and T. Rosing, “F5-hd: Fast flexible fpga-based framework for refreshing hyperdimensional computing,” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019, pp. 53–62.
(46) S. Salamat, M. Imani, and T. Rosing, “Accelerating hyperdimensional computing on fpgas by exploiting computational reuse,” IEEE Transactions on Computers, vol. 69, no. 8, pp. 1159–1171, 2020.
(47) R. Sarkar, S. Abi-Karam, Y. He, L. Sathidevi, and C. Hao, “Flowgnn: A dataflow architecture for real-time workload-agnostic graph neural network inference,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 1099–1112.
(48) R. Sarkar, S. Abi-Karam, Y. He, L. Sathidevi, and C. Hao, “Flowgnn: A dataflow architecture for real-time workload-agnostic graph neural network inference,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 1099–1112.
(49) M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15. Springer, 2018, pp. 593–607.
(50) C. Shang, Y. Tang, J. Huang, J. Bi, X. He, and B. Zhou, “End-to-end structure-aware convolutional networks for knowledge base completion,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 3060–3067.
(51) L. Song, Y. Chi, A. Sohrabizadeh, Y.-k. Choi, J. Lau, and J. Cong, “Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication,” in Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 65–77.
(52) G. Stoica, O. Stretcu, E. A. Platanios, T. Mitchell, and B. Póczos, “Contextual parameter generation for knowledge graph link prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03, 2020, pp. 3000–3008.
(53) Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, “Rotate: Knowledge graph embedding by relational rotation in complex space,” arXiv preprint arXiv:1902.10197, 2019.
(54) K. Toutanova and D. Chen, “Observed versus latent features for knowledge base and text inference,” in Proceedings of the 3rd workshop on continuous vector space models and their compositionality, 2015, pp. 57–66.
(55) S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar, “Composition-based multi-relational graph convolutional networks,” arXiv preprint arXiv:1911.03082, 2019.
(56) H. Wang, M. Zhao, X. Xie, W. Li, and M. Guo, “Knowledge graph convolutional networks for recommender systems,” in The world wide web conference, 2019, pp. 3307–3313.
(57) Q. Wang, Y. Hao, and J. Cao, “Adrl: An attention-based deep reinforcement learning framework for knowledge graph reasoning,” Knowledge-Based Systems, vol. 197, p. 105910, 2020.
(58) Y. Wang, B. Feng, G. Li, S. Li, L. Deng, Y. Xie, and Y. Ding, “Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus,” in 15th USENIX symposium on operating systems design and implementation (OSDI 21), 2021.
(59) F. Wen, M. Qin, P. Gratz, and N. Reddy, “Openmem: Hardware/software cooperative management for mobile memory system,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 109–114.
(60) F. Wen, M. Qin, P. Gratz, and N. Reddy, “Software hint-driven data management for hybrid memory in mobile systems,” ACM Trans. Embed. Comput. Syst., vol. 21, no. 1, jan 2022.
(61) F. Wen, M. Qin, P. V. Gratz, and A. L. N. Reddy, “Hardware memory management for future mobile hybrid memory systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 11, pp. 3627–3637, 2020.
(62) C. Wu, F. Wu, and Y. Huang, “Findings of the association for computational linguistics: Acl-ijcnlp 2021,” in Association for Computational Linguistics, 2021, pp. 4408–4413.
(63) W. Xiong, T. Hoang, and W. Y. Wang, “Deeppath: A reinforcement learning method for knowledge graph reasoning,” arXiv preprint arXiv:1707.06690, 2017.
(64) M. Yan, Z. Chen, L. Deng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, “Characterizing and understanding gcns on gpu,” IEEE Computer Architecture Letters, vol. 19, no. 1, pp. 22–25, 2020.
(65) M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, “Hygcn: A gcn accelerator with hybrid architecture,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 15–29.
(66) B. Yang, S. W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference in knowledge bases,” in Proceedings of the International Conference on Learning Representations (ICLR) 2015, 2015.
(67) B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference in knowledge bases,” arXiv preprint arXiv:1412.6575, 2014.
(68) J. Yang, D. Tang, X. Song, L. Wang, Q. Yin, R. Chen, W. Yu, and J. Zhou, “Gnnlab: a factored system for sample-based gnn training over gpus,” in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 417–434.
(69) H. You, T. Geng, Y. Zhang, A. Li, and Y. Lin, “Gcod: Graph convolutional network acceleration via dedicated algorithm and accelerator co-design,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 460–474.
(70) H. Zeng and V. Prasanna, “Graphact: Accelerating gcn training on cpu-fpga heterogeneous platforms,” in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 255–265.
(71) X. Zeng, X. Tu, Y. Liu, X. Fu, and Y. Su, “Toward better drug discovery with knowledge graph,” Current opinion in structural biology, vol. 72, pp. 114–126, 2022.
(72) B. Zhang, H. Zeng, and V. K. Prasanna, “Decgnn: A framework for mapping decoupled gnn models onto cpu-fpga heterogeneous platform,” in Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 154–154.
(73) T. Zhang, Z. Lin, G. Yang, and C. De Sa, “Qpytorch: A low-precision arithmetic simulation framework,” in 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE, 2019, pp. 10–13.
(74) Y. Zhang, H. Dai, Z. Kozareva, A. Smola, and L. Song, “Variational reasoning for question answering with knowledge graph,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
(75) D. Zheng, X. Song, C. Ma, Z. Tan, Z. Ye, J. Dong, H. Xiong, Z. Zhang, and G. Karypis, “Dgl-ke: Training knowledge graph embeddings at scale,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 739–748.
(76) H. Zhou, B. Zhang, R. Kannan, V. Prasanna, and C. Busart, “Model-architecture co-design for high performance temporal gnn inference on fpga,” arXiv preprint arXiv:2203.05095, 2022.
(77) Z. Zhu, S. Xu, M. Qu, and J. Tang, “Graphvite: A high-performance cpu-gpu hybrid system for node embedding,” in The World Wide Web Conference. ACM, 2019, pp. 2494–2504.
(78) Z. Zou, H. Chen, P. Poduval, Y. Kim, M. Imani, E. Sadredini, R. Cammarota, and M. Imani, “Biohd: an efficient genome sequence search platform using hyperdimensional memorization,” in Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 656–669.
(79) Z. Zou, Y. Kim, F. Imani, H. Alimohamadi, R. Cammarota, and M. Imani, “Scalable edge-based hyperdimensional learning system with brain-like neural adaptation,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.