HAHE: Hierarchical Attention for Hyper-Relational Knowledge Graphs in Global and Local Level

Haoran Luo¹, Haihong E¹ , Yuhao Yang², Yikai Guo³, Mingzhi Sun¹,
Tianyu Yao¹, Zichen Tang¹, Kaiyang Wan¹, Meina Song¹, Wei Lin⁴
¹School of Computer Science, Beijing University of Posts and Telecommunications, China
²School of Automation Science and Electrical Engineering, Beihang University, China
³Beijing Institute of Computer Technology and Application, China
⁴Inspur Group Co., Ltd., China
{luohaoran, ehaihong}@bupt.edu.cn Corresponding author.

Abstract

Link Prediction on Hyper-relational Knowledge Graphs (HKG) is a worthwhile endeavor. HKG consists of hyper-relational facts (H-Facts), composed of a main triple and several auxiliary attribute-value qualifiers, which can effectively represent factually comprehensive information. The internal structure of HKG can be represented as a hypergraph-based representation globally and a semantic sequence-based representation locally. However, existing research seldom simultaneously models the graphical and sequential structure of HKGs, limiting HKGs’ representation. To overcome this limitation, we propose a novel Hierarchical Attention model for HKG Embedding (HAHE), including global-level and local-level attention. The global-level attention can model the graphical structure of HKG using hypergraph dual-attention layers, while the local-level attention can learn the sequential structure inside H-Facts via heterogeneous self-attention layers. Experiment results indicate that HAHE achieves state-of-the-art performance in link prediction tasks on HKG standard datasets. In addition, HAHE addresses the issue of HKG multi-position prediction for the first time, increasing the applicability of the HKG link prediction task. Our code is publicly available¹¹1https://github.com/LHRLAB/HAHE.

1 Introduction

Knowledge graphs (KGs) are semantic networks that define entity relationships. Early KG research (Bordes et al., 2013; Sun et al., 2019; Balazevic et al., 2019) use binary relationships, often expressed as a triple-based fact (subject, relation, object). Yet, n-ary relational facts (containing more than two entities) are abundant in real-world KGs like Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić and Krötzsch, 2014). Rosso et al. (2020) represent an n-ary relational fact as a hyper-relational fact (H-Fact) consisting of a main triple(s,r,o) and several auxiliary attribute-value qualifiers {(a_i :v_i)}, and KGs composed of H-Facts are called hyper-relational knowledge graphs (HKGs).

Refer to caption — Figure 1: An example of hyper-relational fact structure.

As shown in Figure 1, an H-fact can describe a real-world fact. Unlike traditional triple-based facts, H-Facts do not just raise the number of entities in facts from two to n. It structurally and effectively represents the n-ary relational facts prevalent in reality. Globally, it extends ordinary graph structure to hypergraph (Zhou et al., 2006) structure. Locally, it defines five heterogeneous roles of s,r,o,a,v within facts to capture the semantic information of the fact‘Barack Obama held position as US president’, as illustrated in Figure 2.

Recent research has demonstrated various embedding strategies for hyper-relational representations. However, current approaches only consider global hypergraph structures or local semantic sequence structures. For instance, StarE (Galkin et al., 2020) employs the information transfer function of graph neural networks (GNN) to unidirectionally pass auxiliary key-value pair information into the main triples’ relations, thereby capturing the graph structure but insufficiently between multiple entities and relations within the H-facts. In contrast, GRAN (Wang et al., 2021) initially incorporates the Transformer encoder (Vaswani et al., 2017) into the HKG embedding, capturing the fully connected semantic information locally inside H-facts, while disregarding the global structure. Consequently, representing the global and local structure of HKG simultaneously with hierarchical attention becomes a promising research direction, but an inadequate representation of HKG structure constrains HKG embeddings.

To overcome this limitation, we propose a novel Hierarchical Attention model for HKG Embedding (HAHE) that incorporates global-level and local-level attention. We update the global node embeddings using the HKG hypergraph structure. However, by complete connectivity, the previous hypergraph attention network (Bai et al., 2021) just converts all hypergraph nodes into a regular graph and then utilizes the GAT (Veličković et al., 2018) layer for node embedding updates, rendering it unable to distinguish which nodes comprise a hyperedge. Consequently, we design hypergraph dual-attention layers to aggregate node embedding information into hyper-edge embedding through the attention mechanism. After obtaining the hyper-edge embedding, we update the node embedding by feeding it back to the node through the attention mechanism. In this way, nodes are allowed to learn more distant information from the whole HKG. This hypergraph dual-attention method significantly enhances learning capacity. It then transfers the updated node information to the local level’s attention. Inspired by GRAN’s heterogeneous attention (Wang et al., 2021), we define five types of nodes and fourteen types of edges in a single H-Fact and develop heterogeneous self-attention layers with both node-bias and edge-bias attention to learn the semantic content of H-Facts. The last step is to output the link prediction findings using an MLP-based decoding process for one-position or multi-position link prediciton tasks on HKGs.

Experiments on link prediction were performed on three HKG standard datasets, JF17K (Wen et al., 2016), Wikipeople (Guan et al., 2019), and WD50K (Galkin et al., 2020). The state-of-the-art results indicate that HAHE is effective in the link prediction task. In addition, adequate ablation experiments were designed to highlight the importance of global and local focus, and HAHE is also used for the HKG multi-position prediction task, i.e., predicting two or more entities or relations simultaneously in a single H-fact, hence increasing the applicability of the HKG link prediction task. Ultimately, we make our code publicly available and discuss the limitations and future work of HKG embedding representation.

2 Related Work

Early approaches consider hyper-relational fact(H-Fact) mainly as graph structure, focusing more on the topological relations of entities. For example, m-TransH (Wen et al., 2016) projects the entities onto the relation hyperplane. RAE, NaLP, NeuInfer, and N-TuckER (Zhang et al., 2018; Guan et al., 2019, 2020; Liu et al., 2020) optimize the method of the Hyper-relational knowledge graph (HKG) embedding based on m-TransH. However, none of them adopts the hyper-relational structure.

HINGE (Rosso et al., 2020) firstly proposes the attribute-value qualifiers for embedding hyper-relational representation using CNN, and Hyper2 (Yan et al., 2022) initializes the relation and entity in the Poincaré ball vectors to improve the model accuracy. Yet, neither of these methods considers graphical structure or semantic sequences of HKGs.

StarE (Galkin et al., 2020) employs GNN as the message-passing mechanism to encode entities and relationships, while Transformer is the decoder to get the result. HyTransformer Yu and Yang (2021) applies layer normalization and dropout methods to replace StarE’s encoder. MSeaHKG (Di and Chen, 2022) proposed that a message-passing function significantly impacts the model performance, so it replaced the static message-passing function in StarE with a dynamic one. These models consider the graph structure within the hyper-relational facts, but disregard the semantic sequences.

GRAN (Wang et al., 2021) is an improvement for Transformer (Vaswani et al., 2017). It replaces Transformer’s self-attention with edge-biased fully-connected attention and accurately collects semantic information. Despite this, it ignores the graph structure.

Unlike earlier models, HAHE considers graph structure and semantic sequences simultaneously and employs hierarchical attention for link prediction on HKGs. Moreover, it is the first to improve previous methods by modeling the structure of HKG via global-level embedding and can perform multi-position prediction.

3 Preliminaries

This section presents important concepts and techniques in Hyper-relational Knowledge Graph (HKG), including definitions of HKGs, hypergraph learning, global and local structure of HKGs, and multi-position prediction on HKGs.

3.1 Hyper-relational Knowledge Graphs

HKGs comprise hyper-relational facts (H-Facts). Typically, an H-Fact can be represented as $\mathcal{H}=\{((s,r,o),\{(a_{i}:v_{i})\}^{m}_{i=1})|s,o,v_{1},...,v_{m}\in\mathcal{E},r,a_{1},...,a_{m}\in\mathcal{R}\}$ , where $(s,r,o)$ represents the main triple and $\{(a_{i}:v_{i})\}^{m}_{i=1}$ represents m auxiliary attribute-value qualifiers.

The link prediction (LP) task on HKGs is to predict missing elements from H-Facts, where missing elements can be entities $\in\{s,o$ $\left.,v_{1},\ldots,v_{m}\right\}$ or relations $\in\left\{r,a_{1},\ldots,a_{m}\right\}$ .

3.2 Hypergraph Learning on HKGs

Since there are more than two entities in an H-Fact, we introduce hypergraph learning (Feng et al., 2019). A hypergraph of HKG, $\mathcal{G}_{H}=\{\mathcal{E}_{H},\mathcal{H_{H}},\mathcal{I}_{H}\}$ , contains a node set $\mathcal{E}_{H}$ = $\mathcal{E}$ , a hyperedge set $\mathcal{H}_{H}$ = $\mathcal{H}$ , and a special incidence matrix $\mathcal{I}_{H}$ that records the weights of each hyperedge. $\mathcal{I}_{H}$ is a $|\mathcal{E}_{H}|\times|\mathcal{H}_{H}|$ matrix defined as follows:

		$\displaystyle h(v,e)=1,\text{ if }v\in e,$		(1)
		$\displaystyle h(v,e)=0,\text{ if }v\notin e,$		(1)

where $v\in\mathcal{E}_{H}$ , $e\in\mathcal{H}_{H}$ . $h$ is a fuction to represent the value in $\mathcal{I}_{H}$ . For a node $v\in\mathcal{E}_{H}$ , its degree is defined as $d(v)=\sum_{e\in\mathcal{H}_{H}}h(v,e)$ , which represents the number of times that a node (entity) appears in different hyperedges (H-Facts) in the whole HKG.

3.3 Multi-position Prediction on HKGs

Multi-position prediction is a new meaningful task with more practicality than the one-position link prediction task on HKGs. For HKG link prediction, in one main triple, we can predict another element for every two of them we know, i.e., we can predict ( $s,r,?$ ), ( $s,?,o$ ), ( $?,r,o$ ). In one auxiliary attribute-value qualifier, if we know any of the attributes and values of one of the auxiliary attribute-value qualifiers, we can predict the other, i.e., ( $a,?$ ), ( $?,v$ ). Thus in practical link prediction, there are some problems have two or more prediction points for example ( $s,r,?,a_{1},?,a_{2},v_{2}$ ) (prediction position one in the main triple and the other in the first attribute-value qualifier) or ( $s,r,o,a_{1},?,a_{2},?$ ) (both predicted positions are in the auxiliary attribute-value qualifiers). We refer to tasks with two or more prediction positions in link prediction tasks as multi-position prediction tasks on HKGs.

4 Methodology

This section introduces our hyper-relational knowledge graph (HKG) embedding model HAHE, including global and local representation, two hierarchical attention layers, and MLP decoder.

4.1 Global and Local Representation

HKGs $\mathcal{G}=\{\mathcal{E},\mathcal{R},\mathcal{H}\}$ consist of multiple hyper-relational facts (H-Facts) $\mathcal{H}$ with entities $\mathcal{E}$ and relations $\mathcal{R}$ . Since each H-Fact represents an n-ary relation (n>2) and has rich, heterogeneous semantic information, we model HKGs in terms of global-level hypergraph-based representation and local-level sequence-based representation with hypergraph dual-attention layers and heterogeneous self-attention layers respectively. The overview of HAHE is illustrated in Figure 3. For global-level representation, we define $\mathcal{G}_{H}=\{\mathcal{E}_{H},\mathcal{H}_{H},\mathcal{I}_{H}\}$ to represent the graph structure of entities. Unlike the regular graph, the hyperedges can connect more than two entity nodes. Moreover, we use incidence matrix $\mathcal{I}_{H}$ to represent the association information of nodes and hyperedges. For local-level representation, every H-Fact has the structure of one main triple and several auxiliary attribute-value qualifiers $\mathcal{H}=\{((s,r,o),\{(a_{i}:v_{i})\}^{m}_{i=1})|s,o,v_{1},...,v_{m}\in\mathcal{E},r,a_{1},...,a_{m}\in\mathcal{R}\}$ , which represents the semantic information of facts. We can fully connect entities and relations in H-Facts and represent them as heterogeneous semantic sequence structure, containing five kinds of nodes $s,r,o,a,v$ and 14 kinds of edges $s-r,s-o,r-o,s-a,s-v,r-a,r-v,o-a,o-v,a_{i}-a_{j},o_{i}-o_{j},a_{i}-o_{i},a_{i}-o_{j}$ , where $i,j$ are the serial numbers of different qualifiers.

4.2 Hypergraph Dual-Attention Layers

As shown in Figure 4(a), entity embedding first utilizes Hypergraph Dual-Attention Layers to learn hypergraph structural information in global level. Previous Hypergraph Attention Network methods created a transformed ordinary graph by full joining nodes within the same hyperedge and applying GAT. The hypergraph representation loses because the ordinary graph after this transformation cannot distinguish whether two nodes are within the same hyperedge or different hyperedges. We first initialize the entities as nodes with embedding as $\boldsymbol{h}_{v_{i}}\in\mathbb{R}^{d}$ , where $v_{i}\in\mathcal{E}_{H}$ and $d$ is dimension of embedding, and initialize the H-Facts as hyperedges with embedding as $\boldsymbol{h}_{e_{i}}\in\mathbb{R}^{d}$ , where $e_{i}\in\mathcal{H}_{H}$ . Then the node and hyperedge embeddings are projected into the same space to obtain $\mathbf{W}\boldsymbol{h}_{v_{i}}$ and $\mathbf{W}\boldsymbol{h}_{e_{i}}$ ,where $\mathbf{W}\in\mathbb{R}^{d\times d}$ . The attention from nodes to hyper-edges (N-to-H Attention) is performed as follows:

\alpha_{ij}=att\left(\mathbf{W}\boldsymbol{h}_{e_{i}},\mathbf{W}\boldsymbol{h}_{v_{j}}\right)|v_{j}\in e_{i},

(2)

where $\alpha_{ij}$ indicates the importance of node $v_{j}$ ’s features to hyperedge $e_{i}$ , $att$ is N-to-H Attention function where we choose a single-layer neural network with concatenation operation. $v_{j}\in e_{i}$ means we only calculate the attention where node $v_{j}$ is in hypergraph $e_{i}$ , which is indexed by hypergraph incidence matrix $\mathcal{I}_{H}$ . Then, the information of nodes is aggregated to hyperedges:

\tilde{\boldsymbol{h}}_{e_{i}}=\sigma\left(\sum_{v_{j}\in e_{i}}\frac{\exp\left(\operatorname{LR}\left(\alpha_{ik}\right)\right)}{\sum_{v_{k}\in e_{i}}\exp\left(\operatorname{LR}\left(\alpha_{ik}\right)\right)}\mathbf{W}\boldsymbol{h}_{v_{j}}\right),

(3)

where $\tilde{\boldsymbol{h}}_{e_{i}}\in\mathbb{R}^{d}$ is updated hyperedge embeddings and $LR$ , $\sigma$ are activation functions. After that, we use a similar way to aggregate the information of updated hyperedges back to nodes with Attention (H-to-N Attention) as follows:

\beta_{ij}=att\left(\mathbf{W}\boldsymbol{h}_{v_{i}},\mathbf{W}\boldsymbol{h}_{e_{j}}\right)|e_{j}\in v_{i},

(4)

\tilde{\boldsymbol{h}}_{v_{i}}=\sigma\left(\sum_{e_{j}\in v_{i}}\frac{\exp\left(\operatorname{LR}\left(\beta_{ik}\right)\right)}{\sum_{e_{k}\in{v}_{i}}\exp\left(\operatorname{LR}\left(\beta_{ik}\right)\right)}\tilde{\boldsymbol{h}}_{e_{j}}\right),

(5)

where $\beta_{ij}$ denotes the importance between updated hyperedge $e_{j}$ and node $v_{i}$ , and $\tilde{\boldsymbol{h}}_{v_{i}}$ is the updated node embedding. This way, we update the global hypergraph entity embedding and achieve node-hyperedge-node dual-attention message passing. Though we introduce more hyperedge embeddings than ordinary GNNs, they are only used as an intermediate weight variable for hypergraph attention computation. PyG makes dual-attention easy to implement, making it scalable to large graphs as GNNs.

After hypergraph dual-attention layers, H-Facts distribute updated node embeddings to sequence embeddings with relation embeddings, and fed them into Heterogeneous Self-Attention Layers.

Dataset	H-Facts	H-Facts with Q(%)	Entities	Relations	Train	Valid	Test	Arity
JF17K	100,947	46,320(45.9%)	28,645	501	76,379	-	24,568	2-6
WikiPeople	369,866	9,482(2.6%)	34,839	178	294,439	37,715	37,712	2-7
WD50K	236,507	32,167(13.6%)	47,156	532	166,435	23,913	46,159	2-67

Table 1: Dataset statistics, where the columns respectively indicate the number of all H-Facts, H-facts with qualifiers, entities, relations, H-facts in train/valid/test sets, and the range of arity of H-facts.

4.3 Heterogeneous Self-Attention Layers

Each element in the sequence $\boldsymbol{x}_{i}\in\mathbb{R}^{d}$ has five roles, including $s,r,o,a,v$ , and a total of 14 kinds of edge with other elements in the sequence $\boldsymbol{x}_{j}\in\mathbb{R}^{d}$ . So, as shown in Figure 4(b), we design a heterogeneous self-attentive layer with both node-bias and edge-bias to learn the local semantic information of the H-Facts. The elements in the sequence pass through this attention layer and update the embedding as follows:

\gamma_{ij}=\frac{\left(\mathbf{W}_{role(i)}^{Q}\boldsymbol{x}_{i}+\mathbf{b}_{ij}^{Q}\right)^{\top}\left(\mathbf{W}_{role(j)}^{K}\boldsymbol{x}_{j}+\mathbf{b}_{ij}^{K}\right)}{\sqrt{d}},

(6)

\tilde{\boldsymbol{x}_{i}}=\sum_{j=1}^{n}\frac{\exp\left(\gamma_{ik}\right)}{\sum_{k=1}^{n}\exp\left(\gamma_{ik}\right)}\left(\mathbf{W}_{role(j)}^{V}\boldsymbol{x}_{j}+\mathbf{b}_{ij}^{V}\right),

(7)

where $\gamma_{ij}$ is the importance between one element in sequence $\boldsymbol{x}_{i}$ and another $\boldsymbol{x}_{j}$ , $\mathbf{W}_{role(i)}^{Q},\mathbf{W}_{role(i)}^{K},\mathbf{W}_{role(i)}^{V}\in\mathbb{R}^{d\times d}$ are the linear weight metrics of query, key, value, and five different kinds of $\boldsymbol{x}_{i}$ pass through different weight metrics indexed by $role$ function as the node-bias, and $\mathbf{b}_{ij}^{Q},\mathbf{b}_{ij}^{K},\mathbf{b}_{ij}^{V}\in\mathbb{R}^{d}$ are designed as the edge-bias. Then the sequence embeddings are updated by learning the semantic information inside the H-Facts as $\tilde{\boldsymbol{x}_{i}}\in\mathbb{R}^{d}$ .

4.4 MLP Decoder

Finally, the updated sequence embeddings are selected for the embedding at the position to be predicted $\tilde{\boldsymbol{x}}_{p}$ with MLP decoder to get the prediction distribution and obtain the link prediction results.

\boldsymbol{P}=softmax(\boldsymbol{MLP}(\tilde{\boldsymbol{x}}_{p})\mathbf{E}^{\top}),

(8)

where $\boldsymbol{MLP}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , $\mathbf{E}\in\mathbb{R}^{|\mathcal{E}|\times d}$ shares the initial element embedding matrix, $\boldsymbol{P}\in\mathbb{R}^{|\mathcal{E}|}$ is obtained after softmax operation, which denotes the similarity probability of $\tilde{\boldsymbol{x}}_{p}$ with each element in HKG for obtaining the link prediciton answers.

4.5 Learning Strategy

The model trains through the final loss, which is calculated by the similarity between the target of prediction and all entities:

\mathcal{L}=\sum_{t=1}^{|\mathcal{E}|}{\mathbf{y}_{t}\log\boldsymbol{P}},

(9)

where $\mathbf{y}_{t}$ is the $t$ -th entry of the label $\mathbf{y}$ .

For Multi-position Prediction, due to the fully connected attention mechanism, HAHE can accomplish the multi-position prediction task of HKG by masking two or more entities or relations at two or more positions in the same hyper-relational fact. After passing the MLP encoder, HAHE can get the prediction value $\boldsymbol{P}_{i}(i\geq 2)$ of the corresponding positions, and find the entity or relaiton with the highest similarity among all HKG elements as the prediction results respectively.

Model	JF17K						Wikipeople						WD50K
	subject / object			all entities			subject / object			all entities			subject / object			all entities
	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10
m-TransH	0.206	0.206	0.462	0.102	0.069	0.168	0.063	0.063	0.300	-	-	-	-	-	-	-	-	-
RAE	0.215	0.215	0.466	0.310	0.219	0.504	0.058	0.058	0.306	0.172	0.102	0.320	-	-	-	-	-	-
NaLP	0.221	0.165	0.331	0.366	0.290	0.516	0.408	0.331	0.546	0.338	0.272	0.466	-	-	-	0.224	0.158	0.330
NeuInfer	0.449	0.361	0.624	0.473	0.397	0.618	0.476	0.415	0.585	0.333	0.259	0.477	0.243	0.176	0.377	0.228	0.162	0.341
HINGE	0.431	0.342	0.611	0.517	0.436	0.675	0.342	0.272	0.463	0.350	0.282	0.467	-	-	-	0.232	0.164	0.343
StarE	0.574	0.496	0.725	0.542	0.454	0.685	0.491	0.398	0.592	0.378	0.265	0.542	0.349	0.271	0.496	-	-	-
Hyper2	0.583	0.500	0.746	-	-	-	0.461	0.391	0.597	-	-	-	-	-	-
HyTransformer	0.582	0.501	0.742	-	-	-	0.501	0.426	0.634	-	-	-	0.356	0.281	0.498	-	-	-
GRAN	0.617	0.539	0.770	0.656	0.582	0.799	0.503	0.438	0.620	0.479	0.410	0.604	-	-	-	0.309	0.24	0.441
MSeaHKG	-	-	-	0.577	0.481	0.711	-	-	-	0.395	0.291	0.554	-	-	-	-	-	-
HAHE	0.623	0.554	0.806	0.668	0.597	0.816	0.509	0.447	0.639	0.495	0.420	0.631	0.368	0.291	0.516	0.402	0.327	0.546
HAHE w/o global	0.621	0.548	0.787	0.659	0.588	0.797	0.501	0.434	0.629	0.483	0.407	0.612	0.356	0.280	0.501	0.390	0.315	0.531
HAHE w/o node-bias	0.620	0.546	0.787	0.659	0.587	0.797	0.474	0.429	0.622	0.487	0.405	0.611	0.354	0.283	0.506	0.396	0.313	0.529
HAHE w/o edge-bias	0.620	0.545	0.786	0.657	0.586	0.796	0.503	0.435	0.629	0.483	0.407	0.612	0.357	0.281	0.513	0.391	0.316	0.533

Table 2: Comparison of HAHE with other models, composed of entity prediction accuracy on JF17K, WikiPeople and WD50K. Results of the models are mainly taken from the original paper. Best results in each tasks are in bold.

5 Experiments

This section introduces the experimental settings, results and analysis. We answer the following research questions (RQs). RQ1: Can HAHE outperform other Hyper-relational Knowledge Graphs (HKG) embedding models on HKG datasets? RQ2: How does the hierarchical attention mechanism contribute to HAHE? RQ3: How do hypergraph dual-attention mechanisms contribute to HAHE in global level? RQ4: How do heterogeneity of nodes and edges in hyper-relation fact contribute to HAHE in local level? RQ5: How HAHE performs in multi-position prediction tasks on HKGs?

5.1 Experimental Setup

5.1.1 Datasets

We conduct experiments on three hyper-relational datasets JF17K (Wen et al., 2016), WikiPeople (Guan et al., 2019), and WD50K (Galkin et al., 2020), respectively, as shown in Table 1. Among them, JF17K is extracted from Freebase (Bollacker et al., 2008). WikiPeople is obtained by filtering out the statements containing literals in the original WikiPeople dataset, derived from Wikidata (Vrandečić and Krötzsch, 2014) concerning entities of type human, and WD50K is a high-quality hyper-relational dataset with richer hyper-relational facts with auxiliary attribute-value qualifiers.

5.1.2 Baselines

We compare HAHE against a sizable collection of previous hyper-relational approaches namely: (i) m-TransH (Wen et al., 2016) (ii) RAE (Zhang et al., 2018) (iii) NaLP (Guan et al., 2019) (iv) NeuInfer (Guan et al., 2020) (v) HINGE (Rosso et al., 2020) (vi) StarE (Galkin et al., 2020) (vii) Hyper2 (Yan et al., 2022) (viii) HyTransformer (Yu and Yang, 2021) (ix) GRAN (Wang et al., 2021) (x) MSeaHKG (Di and Chen, 2022).

5.1.3 Ablations

To evaluate the significance of HAHE’s three main modules, hypergraph dual-attention mechanism, node heterogeneity, and edge heterogeneity, we obtain 7 simplified model variants by removing any one or two modules from the full model (HAHE-node, HAHE-edge, HAHE-node&edge, HAHE-global, HAHE-global&node, HAHE-global&edge), and the basic variant by removing all three modules.

5.1.4 Evaluation Metrics

Each model predicts entities and relations separately. We split each task of predictions into subject/object prediction in main triples and all entities prediction in whole H-facts to test the model’s main triple prediction ability. MRR (the average of reciprocal rankings) and Hits@K (the proportion of top K rankings) for K=1,10 are used to evaluate each link prediction task.

5.1.5 Hyperparameters and Enviroment

The model was trained for 300 epochs using the Adam optimizer with a batch size of 1024 examples across 1 GeForce GTX 1080Ti on each dataset. Appendix A shows HAHE’s optimal hyperparameter settings. Appendix B shows training details.

5.2 Main Results (RQ1)

In this experiment, we evaluate our model on the link prediction task. For entity prediction, the results of our model and each variant of our model can be found in Table 2. For relation prediction, the result is shown in Appendix C. We can observe that the HAHE outperforms the other current methods on all three datasets. On JF17K, for the prediction of subject and object, HAHE reports an improvement of 0.6 (0.9%) MRR points, 1.5 (2.7%) H@1, and 3.6 (4.6%) H@10 compared with the best approach. For the prediction of all entities, HAHE reports a gain of 1.2 (1.8%) MRR, 1.5 (2.5%) H@1, and 1.7 (2.1%) H@10 compared with the next-best approach. For the other two datasets, we also have different degrees of improvement. For WD50K, the latest high-quality HKG dataset, our model has the largest improvement over the existing SOTA model GRAN, with an MRR improvement of about 10 points, which proves that this model is more suitable for hypergraph-structured knowledge graphs with hyper-relational facts beyond binary relation.

5.3 Ablation Study (RQ2)

The hypergraph dual-attention mechanism, node heterogeneity, and edge heterogeneity are the three components of HAHE that are required for its operation. We evaluate seven different variants of HAHE, irrespective of whether or not each component is helpful. When evaluating each model variant with a variety of hyperparameters, the results of the optimal prediction were recorded. For different HAHE variants in Figure 5(a), it can be observed that hypergraph dual-attention, node heterogeneity, and edge heterogeneity all contribute to the accurate result of our complete model. In addition, we have outlined the specific results of three primary HAHE variants in Table 2. Each variant lacks a necessary component that is required. Through comparison, our experiment results intuitively demonstrated the effectiveness of HAHE.

Then, we did more refined ablation analysis to explore the significance of the hypergraph dual-attention mechanism in global level and heterogeneity of nodes and edges in local level, respectively.

5.4 Analysis of Hypergraph Dual-attention Mechanism in Global Level (RQ3)

We statistically displayed the entity evaluation results of JF17K by the degree of entities in the HKG hypergraph structure to investigate the hypergraph dual attention mechanism. As shown in Figure 5(b), the hypergraph dual-attention mechanism improves the prediction accuracy of entities with different degree. Due to the presence of the attention mechanism, the entity feature information will help in message passing of global information by the hyper-relational facts (hyperedges). For entities with higher degree, the hypergraph dual-attention mechanism can better capture the global features of the hyperrelational facts. Entities with fewer degrees have little impact on capturing global information, but entities with more degrees compensate for this and improve their prediction accuracy.

5.5 Analysis of Heterogeneity of Nodes and Edges in Local Level (RQ4)

The experimental results show that both node-bias and edge-bias with heterogeneity can better distinguish different types of entities in an H-fact and different types of relationships between them in local level. Moreover, we find that node-bias plays a more prominent role than edge-bias in the subject/object prediction tasks as shown in Figure 5(c), because the entity roles in the main triplet are more diverse than those in the qualifier. And on the element prediction task in qualifiers, edge-bias is better than node-bias as shown in Figure 5(d), because edge-bias can distinguish the relationship between corresponding attribute-value pairs $(a_{i},v_{i})$ and non-corresponding attribute-value pairs $(a_{j},v_{i})$ rather than node-bias.

5.6 Results of Multi-position Prediction Tasks on HKGs (RQ5)

JF17K, WikiPeople, and WD50K were applied to test our multi-position prediction model. As an example, in Figure 6, our model outputs the embedding for each position and calculates the joint probability distribution for each candidate answer tuple. Because the increase of predicted positions leads to more answer sets than predicted by the unit placement link, we set the evaluation threshold to keep only the higher scoring answer tuples to obtain the evaluation results.

2-position prediction
	JF17K			Wikipeople			WD50K
	av	sro/av	all	av	sro/av	all	av	sro/av	all
Ent-Ent	0.039	0.220	0.198	0.265	0.129	0.140	0.326	0.157	0.184
Ent-Rel	0.261	0.592	0.540	0.405	0.409	0.409	0.527	0.481	0.490
Rel-Rel	0.350	0.106	0.159	0.656	0.544	0.561	0.940	0.210	0.407
3-position prediction
	av	sro/av	all	av	sro/av	all	av	sro/av	all
Ent-Ent	0.011	0.012	0.011	0.329	0.036	0.046	0.169	0.350	0.052
Ent-Rel	0.022	0.126	0.115	0.432	0.132	0.144	0.332	0.166	0.193
Rel-Rel	0.029	0.005	0.009	0.647	0.162	0.192	0.884	0.039	0.230

Table 3: The MRR results of Multi-position Prediction.

As shown in Table 3, we evaluated 2-position prediction and 3-position prediction on three HKG datasets, respectively, and used MRR to rank the answer tuples. The test set has three categories: Ent-Ent, Ent-Rel, and Rel-Rel. Ent-Ent and Rel-Rel indicate that all predicted locations are entities or relations, and Ent-Rel indicates that entities and relations are missing jointly. In addition, whether the main triple is complete divides the sample into two categories. "av" indicates that all predicted positions are in the auxiliary qualifiers, while "sro/av" indicates that the main triple has lost a position" and others in qualifiers. "all" includes both categories. According to the results of multi-position prediction tasks, HAHE performs better on the high-quality WD50K dataset, and more positions predicted or one position in the main triple makes the prediction more difficult.

6 Conclusion

In this paper, we present HAHE, a model with hierarchical attention in global and local level. HAHE outperforms other baselines link prediction tasks on hyper-relational knowledge graphs (HKGs). The experimental results demonstrate that our hypergraph dual-attention layers and heterogeneous self-attention layers are effective in learning the global and local structure of HKGs. We also use HAHE to solve the HKG multi-location prediction task and analyze the results for the first time.

Acknowledgements

This work is supported by the National Science Foundation of China (Grant No. 62176026) and Beijing Natural Science Foundation (M22009). This work is also supported by the BUPT Postgraduate Innovation and Entrepreneurship Project led by Haoran Luo.

Limitations

For HKG one-position link prediction tasks, HAHE shows the best performance in all three datasets. However, because HAHE is based on hypergraph learning, it improves more on the WD50K high quality hyper-relational knowledge graph link prediction dataset, and less on the Wikipeople dataset where triples are the majority, so HAHE prefers the fact with more arity numbers. In the future, we will consider extending our approach to triples as a unified architecture.

For HKG multi-position link prediction tasks, it can be seen that our model is effective when predicting multiple missing auxiliary information, which is a frequent situation in practical applications. However, the prediction accuracy of our model needs to be further improved in the case of missing primary relations.

Ethics Statement

This paper investigates the problem of knowledge graph link prediction, aiming at complementing incomplete hyper-relational knowledge graphs using deep learning methods to better promote knowledge graphs for assisted decision-making and intelligent question-and-answer applications. Therefore, we believe it does not violate any ethics.

References

Bai et al. (2021) Song Bai, Feihu Zhang, and Philip H.S. Torr. 2021. Hypergraph convolution and hypergraph attention. Pattern Recognition, 110:107637.
Balazevic et al. (2019) Ivana Balazevic, Carl Allen, and Timothy Hospedales. 2019. TuckER: Tensor factorization for knowledge graph completion. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5185–5194, Hong Kong, China. Association for Computational Linguistics.
Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 1247–1250, New York, NY, USA. Association for Computing Machinery.
Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
Di and Chen (2022) Shimin Di and Lei Chen. 2022. Message function search for hyper-relational knowledge graph.
Feng et al. (2019) Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hypergraph neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3558–3565.
Galkin et al. (2020) Mikhail Galkin, Priyansh Trivedi, Gaurav Maheshwari, Ricardo Usbeck, and Jens Lehmann. 2020. Message passing for hyper-relational knowledge graphs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7346–7359, Online. Association for Computational Linguistics.
Guan et al. (2020) Saiping Guan, Xiaolong Jin, Jiafeng Guo, Yuanzhuo Wang, and Xueqi Cheng. 2020. NeuInfer: Knowledge inference on N-ary facts. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6141–6151, Online. Association for Computational Linguistics.
Guan et al. (2019) Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, and Xueqi Cheng. 2019. Link prediction on n-ary relational data. In The World Wide Web Conference, WWW ’19, page 583–593, New York, NY, USA. Association for Computing Machinery.
Liu et al. (2020) Yu Liu, Quanming Yao, and Yong Li. 2020. Generalizing tensor decomposition for n-ary relational knowledge bases. In Proceedings of The Web Conference 2020, WWW ’20, page 1104–1114, New York, NY, USA. Association for Computing Machinery.
Rosso et al. (2020) Paolo Rosso, Dingqi Yang, and Philippe Cudré-Mauroux. 2020. Beyond triplets: Hyper-relational knowledge graph embedding for link prediction. In Proceedings of The Web Conference 2020, WWW ’20, page 1885–1896, New York, NY, USA. Association for Computing Machinery.
Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. In International Conference on Learning Representations.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Commun. ACM, 57(10):78–85.
Wang et al. (2021) Quan Wang, Haifeng Wang, Yajuan Lyu, and Yong Zhu. 2021. Link prediction on n-ary relational facts: A graph-based approach. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 396–407, Online. Association for Computational Linguistics.
Wen et al. (2016) Jianfeng Wen, Jianxin Li, Yongyi Mao, Shini Chen, and Richong Zhang. 2016. On the representation and embedding of knowledge bases beyond binary relations. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 1300–1307. IJCAI/AAAI Press.
Yan et al. (2022) Shiyao Yan, Zequn Zhang, Xian Sun, Guangluan Xu, Li Jin, and Shuchao Li. 2022. Hyper2: Hyperbolic embedding for hyper-relational link prediction. Neurocomputing, 492:440–451.
Yu and Yang (2021) Donghan Yu and Yiming Yang. 2021. Improving hyper-relational knowledge graph completion. CoRR, abs/2104.08167.
Zhang et al. (2018) Richong Zhang, Junpeng Li, Jiajie Mei, and Yongyi Mao. 2018. Scalable instance reconstruction in knowledge bases via relatedness affiliated embedding. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, page 1185–1194, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
Zhou et al. (2006) Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. 2006. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in Neural Information Processing Systems, volume 19. MIT Press.

Appendix

Appendix A Hyperparameter Settings

We use the grid search method to select the optimal hyperparameter settings for the network. The average Hits1 predicted by all entities is chosen as the evaluation metric. The hyperparameters that we can adjust and the possible values of the hyperparameters are first determined according to the structure of our model in Table 4.

Afterwards, the different hyperparameter choices are combined and the predictive metrics after 50 epochs of training are used to judge the merit of the hyperparameter combinations. The optimal hyperparameter combinations of the model are obtained by circular traversal of all hyperparameter combinations. The optimal hyperparameter combinations are shown in bold.

Table 4: Hyperparameter Search.

Model	JF17K						Wikipeople						WD50K
	main relation			all relations			main relation			all relations			main relation			all relations
	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10
m-TransH	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
RAE	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
NaLp	0.639	0.547	0.822	0.825	0.762	0.927	0.482	0.320	0.482	0.735	0.595	0.938	-	-	-	-	-	-
NeuInfer	0.936	0.901	0.989	-	-	-	0.950	0.915	0.997	-	-	-	-	-	-	-	-	-
HINGE	-	-	-	0.861	0.832	0.910	-	-	-	0.765	0.686	0.900	-	-	-	-	-	-
StarE	-	-	-	0.901	0.884	0.963	-	-	-	0.378	0.265	0.542	-	-	-	-	-	-
Hyper2	0.950	0.933	0.976	-	-	-	0.947	0.914	0.987	-	-	-	-	-	-
HyTransformer	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
GRAN	0.992	0.988	0.988	0.996	0.993	0.999	0.957	0.942	0.976	0.960	0.946	0.977	-	-	-	-	-	-
MSeaHKG	-	-	-	0.933	0.894	0.972	-	-	-	0.831	0.787	0.972	-	-	-	-	-	-
HAHE	0.993	0.989	0.998	0.996	0.994	0.999	0.957	0.941	0.978	0.958	0.942	0.978	0.916	0.885	0.964	0.927	0.900	0.969
HAHE w/o global	0.992	0.988	0.997	0.996	0.994	0.998	0.946	0.930	0.967	0.947	0.931	0.968	0.915	0.884	0.961	0.927	0.900	0.966
HAHE w/o node-bias	0.989	0.984	0.995	0.994	0.991	0.997	0.926	0.906	0.952	0.926	0.906	0.953	0.915	0.883	0.963	0.922	0.899	0.969
HAHE w/o edge-bias	0.992	0.987	0.998	0.995	0.993	0.999	0.948	0.932	0.970	0.949	0.933	0.970	0.915	0.882	0.963	0.926	0.898	0.968

Table 5: Comparison of HAHE with other models, composed of relation prediction accuracy on JF17K, WikiPeople and WD50K. Results of the models are mainly taken from the original paper. Best results in each tasks are in bold.

Appendix B Model Training Details

We train 300 epochs in each dataset with the optimal combination of hyperparameters. The prediction results of the network are evaluated using the test sets with the best model. HAHE and all its variants have been trained on a single 11G 1080Ti GPU. Using our optimal hyperparameter settings, the time required to complete the training on the three datasets JF17K, Wikipeople, and WD50K is 5h, 12h and 8h, respectively.

Appendix C Results of Relation Prediction

As shown in Table 5, the result of relation prediction of previous models for HKG embedding has achieved very good results in relation prediction, some indicators have even reached 99%.

Hyperparameter	JF17K	Wikipeople	WD50K
Embedding dimension	$\left\{128,\textbf{256},512,1024\right\}$	$\left\{128,\textbf{256},512,1024\right\}$	$\left\{128,\textbf{256},512,1024\right\}$
Global layers	$\left\{0,1,\textbf{2},3,4\right\}$	$\left\{0,1,\textbf{2},3,4\right\}$	$\left\{0,1,\textbf{2},3,4\right\}$
Global dropout	$\left\{0.0,0.1,\textbf{0.2},0.3,0.4,0.5\right\}$	$\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\}$	$\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\}$
Global activation	$\left\{relu,\textbf{elu},gelu,tanh\right\}$	$\left\{relu,\textbf{elu},gelu,tanh\right\}$	$\left\{relu,\textbf{elu},gelu,tanh\right\}$
Global attention heads	$\left\{\textbf{4},8,12,16\right\}$	$\left\{\textbf{4},8,12,16\right\}$	$\left\{\textbf{4},8,12,16\right\}$
Local layers	$\left\{4,8,\textbf{12},16,24\right\}$	$\left\{4,8,\textbf{12},16,24\right\}$	$\left\{4,8,\textbf{12},16,24\right\}$
Local dropout	$\left\{0.0,0.1,\textbf{0.2},0.3,0.4,0.5\right\}$	$\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\}$	$\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\}$
Local attention heads	$\left\{\textbf{4},8,12,16\right\}$	$\left\{\textbf{4},8,12,16\right\}$	$\left\{\textbf{4},8,12,16\right\}$
Decoder activation	$\left\{relu,elu,\textbf{gelu},tanh\right\}$	$\left\{relu,elu,\textbf{gelu},tanh\right\}$	$\left\{relu,elu,\textbf{gelu},tanh\right\}$
Hidden size	$\left\{128,\textbf{256},512,1024,2048,4096\right\}$	$\left\{128,\textbf{256},512,1024,2048,4096\right\}$	$\left\{128,\textbf{256},512,1024,2048,4096\right\}$
Batch size	$\left\{128,256,512,\textbf{1024},2046\right\}$	$\left\{128,256,512,\textbf{1024},2046\right\}$	$\left\{\textbf{64},128,256,512,1024\right\}$
Learning rate	$\left\{0.0001,\textbf{0.0005},0.001\right\}$	$\left\{0.0001,\textbf{0.0005},0.001\right\}$	$\left\{0.0001,\textbf{0.0005},0.001\right\}$
Weight decay	$\left\{\textbf{0.01},0.02\right\}$	$\left\{\textbf{0.01},0.02\right\}$	$\left\{\textbf{0.01},0.02\right\}$
Soft label for entity	$\left\{0.5,0.6,0.7,0.8,\textbf{0.9}\right\}$	$\left\{0.0,0.1,\textbf{0.2},0.3,0.4\right\}$	$\left\{0.0,0.1,\textbf{0.2},0.3,0.4\right\}$
Soft label for relation	$\left\{\textbf{0.0},0.3,0.6,0.9\right\}$	$\left\{0.0,\textbf{0.1},0.2,0.3\right\}$	$\left\{0.0,\textbf{0.1},0.2,0.3\right\}$