This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

HAHE: Hierarchical Attention for Hyper-Relational Knowledge Graphs in Global and Local Level

Haoran Luo1, Haihong E1 , Yuhao Yang2, Yikai Guo3, Mingzhi Sun1,
Tianyu Yao1, Zichen Tang1, Kaiyang Wan1, Meina Song1, Wei Lin4
1School of Computer Science, Beijing University of Posts and Telecommunications, China
2School of Automation Science and Electrical Engineering, Beihang University, China
3Beijing Institute of Computer Technology and Application, China
4Inspur Group Co., Ltd., China
{luohaoran, ehaihong}@bupt.edu.cn
  Corresponding author.
Abstract

Link Prediction on Hyper-relational Knowledge Graphs (HKG) is a worthwhile endeavor. HKG consists of hyper-relational facts (H-Facts), composed of a main triple and several auxiliary attribute-value qualifiers, which can effectively represent factually comprehensive information. The internal structure of HKG can be represented as a hypergraph-based representation globally and a semantic sequence-based representation locally. However, existing research seldom simultaneously models the graphical and sequential structure of HKGs, limiting HKGs’ representation. To overcome this limitation, we propose a novel Hierarchical Attention model for HKG Embedding (HAHE), including global-level and local-level attention. The global-level attention can model the graphical structure of HKG using hypergraph dual-attention layers, while the local-level attention can learn the sequential structure inside H-Facts via heterogeneous self-attention layers. Experiment results indicate that HAHE achieves state-of-the-art performance in link prediction tasks on HKG standard datasets. In addition, HAHE addresses the issue of HKG multi-position prediction for the first time, increasing the applicability of the HKG link prediction task. Our code is publicly available111https://github.com/LHRLAB/HAHE.

1 Introduction

Knowledge graphs (KGs) are semantic networks that define entity relationships. Early KG research (Bordes et al., 2013; Sun et al., 2019; Balazevic et al., 2019) use binary relationships, often expressed as a triple-based fact (subject, relation, object). Yet, n-ary relational facts (containing more than two entities) are abundant in real-world KGs like Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić and Krötzsch, 2014). Rosso et al. (2020) represent an n-ary relational fact as a hyper-relational fact (H-Fact) consisting of a main triple(s,r,o) and several auxiliary attribute-value qualifiers {(ai :vi)}, and KGs composed of H-Facts are called hyper-relational knowledge graphs (HKGs).

Refer to caption
Figure 1: An example of hyper-relational fact structure.
Refer to caption
Figure 2: The Global-level Hypergraph-based representation and Local-level Sequence-based representation based on three examples of H-Facts in HKGs.

As shown in Figure 1, an H-fact can describe a real-world fact. Unlike traditional triple-based facts, H-Facts do not just raise the number of entities in facts from two to n. It structurally and effectively represents the n-ary relational facts prevalent in reality. Globally, it extends ordinary graph structure to hypergraph (Zhou et al., 2006) structure. Locally, it defines five heterogeneous roles of s,r,o,a,v within facts to capture the semantic information of the fact‘Barack Obama held position as US president’, as illustrated in Figure 2.

Recent research has demonstrated various embedding strategies for hyper-relational representations. However, current approaches only consider global hypergraph structures or local semantic sequence structures. For instance, StarE (Galkin et al., 2020) employs the information transfer function of graph neural networks (GNN) to unidirectionally pass auxiliary key-value pair information into the main triples’ relations, thereby capturing the graph structure but insufficiently between multiple entities and relations within the H-facts. In contrast, GRAN (Wang et al., 2021) initially incorporates the Transformer encoder (Vaswani et al., 2017) into the HKG embedding, capturing the fully connected semantic information locally inside H-facts, while disregarding the global structure. Consequently, representing the global and local structure of HKG simultaneously with hierarchical attention becomes a promising research direction, but an inadequate representation of HKG structure constrains HKG embeddings.

To overcome this limitation, we propose a novel Hierarchical Attention model for HKG Embedding (HAHE) that incorporates global-level and local-level attention. We update the global node embeddings using the HKG hypergraph structure. However, by complete connectivity, the previous hypergraph attention network (Bai et al., 2021) just converts all hypergraph nodes into a regular graph and then utilizes the GAT (Veličković et al., 2018) layer for node embedding updates, rendering it unable to distinguish which nodes comprise a hyperedge. Consequently, we design hypergraph dual-attention layers to aggregate node embedding information into hyper-edge embedding through the attention mechanism. After obtaining the hyper-edge embedding, we update the node embedding by feeding it back to the node through the attention mechanism. In this way, nodes are allowed to learn more distant information from the whole HKG. This hypergraph dual-attention method significantly enhances learning capacity. It then transfers the updated node information to the local level’s attention. Inspired by GRAN’s heterogeneous attention (Wang et al., 2021), we define five types of nodes and fourteen types of edges in a single H-Fact and develop heterogeneous self-attention layers with both node-bias and edge-bias attention to learn the semantic content of H-Facts. The last step is to output the link prediction findings using an MLP-based decoding process for one-position or multi-position link prediciton tasks on HKGs.

Experiments on link prediction were performed on three HKG standard datasets, JF17K (Wen et al., 2016), Wikipeople (Guan et al., 2019), and WD50K (Galkin et al., 2020). The state-of-the-art results indicate that HAHE is effective in the link prediction task. In addition, adequate ablation experiments were designed to highlight the importance of global and local focus, and HAHE is also used for the HKG multi-position prediction task, i.e., predicting two or more entities or relations simultaneously in a single H-fact, hence increasing the applicability of the HKG link prediction task. Ultimately, we make our code publicly available and discuss the limitations and future work of HKG embedding representation.

2 Related Work

Early approaches consider hyper-relational fact(H-Fact) mainly as graph structure, focusing more on the topological relations of entities. For example, m-TransH (Wen et al., 2016) projects the entities onto the relation hyperplane. RAE, NaLP, NeuInfer, and N-TuckER (Zhang et al., 2018; Guan et al., 2019, 2020; Liu et al., 2020) optimize the method of the Hyper-relational knowledge graph (HKG) embedding based on m-TransH. However, none of them adopts the hyper-relational structure.

HINGE (Rosso et al., 2020) firstly proposes the attribute-value qualifiers for embedding hyper-relational representation using CNN, and Hyper2 (Yan et al., 2022) initializes the relation and entity in the Poincaré ball vectors to improve the model accuracy. Yet, neither of these methods considers graphical structure or semantic sequences of HKGs.

StarE (Galkin et al., 2020) employs GNN as the message-passing mechanism to encode entities and relationships, while Transformer is the decoder to get the result. HyTransformer Yu and Yang (2021) applies layer normalization and dropout methods to replace StarE’s encoder. MSeaHKG (Di and Chen, 2022) proposed that a message-passing function significantly impacts the model performance, so it replaced the static message-passing function in StarE with a dynamic one. These models consider the graph structure within the hyper-relational facts, but disregard the semantic sequences.

GRAN (Wang et al., 2021) is an improvement for Transformer (Vaswani et al., 2017). It replaces Transformer’s self-attention with edge-biased fully-connected attention and accurately collects semantic information. Despite this, it ignores the graph structure.

Unlike earlier models, HAHE considers graph structure and semantic sequences simultaneously and employs hierarchical attention for link prediction on HKGs. Moreover, it is the first to improve previous methods by modeling the structure of HKG via global-level embedding and can perform multi-position prediction.

Refer to caption
Figure 3: The overview of HAHE model for Global-level and Local-level Representation of HKGs.

3 Preliminaries

This section presents important concepts and techniques in Hyper-relational Knowledge Graph (HKG), including definitions of HKGs, hypergraph learning, global and local structure of HKGs, and multi-position prediction on HKGs.

3.1 Hyper-relational Knowledge Graphs

HKGs comprise hyper-relational facts (H-Facts). Typically, an H-Fact can be represented as ={((s,r,o),{(ai:vi)}i=1m)|s,o,v1,,vm,r,a1,,am}\mathcal{H}=\{((s,r,o),\{(a_{i}:v_{i})\}^{m}_{i=1})|s,o,v_{1},...,v_{m}\in\mathcal{E},r,a_{1},...,a_{m}\in\mathcal{R}\}, where (s,r,o)(s,r,o) represents the main triple and {(ai:vi)}i=1m\{(a_{i}:v_{i})\}^{m}_{i=1} represents m auxiliary attribute-value qualifiers.

The link prediction (LP) task on HKGs is to predict missing elements from H-Facts, where missing elements can be entities {s,o\in\{s,o ,v1,,vm}\left.,v_{1},\ldots,v_{m}\right\} or relations {r,a1,,am}\in\left\{r,a_{1},\ldots,a_{m}\right\}.

3.2 Hypergraph Learning on HKGs

Since there are more than two entities in an H-Fact, we introduce hypergraph learning (Feng et al., 2019). A hypergraph of HKG, 𝒢H={H,,H}\mathcal{G}_{H}=\{\mathcal{E}_{H},\mathcal{H_{H}},\mathcal{I}_{H}\}, contains a node set H\mathcal{E}_{H}=\mathcal{E}, a hyperedge set H\mathcal{H}_{H}=\mathcal{H}, and a special incidence matrix H\mathcal{I}_{H} that records the weights of each hyperedge. H\mathcal{I}_{H} is a |H|×|H||\mathcal{E}_{H}|\times|\mathcal{H}_{H}| matrix defined as follows:

h(v,e)=1, if ve,\displaystyle h(v,e)=1,\text{ if }v\in e, (1)
h(v,e)=0, if ve,\displaystyle h(v,e)=0,\text{ if }v\notin e,

where vHv\in\mathcal{E}_{H}, eHe\in\mathcal{H}_{H}. hh is a fuction to represent the value in H\mathcal{I}_{H}. For a node vHv\in\mathcal{E}_{H}, its degree is defined as d(v)=eHh(v,e)d(v)=\sum_{e\in\mathcal{H}_{H}}h(v,e), which represents the number of times that a node (entity) appears in different hyperedges (H-Facts) in the whole HKG.

3.3 Multi-position Prediction on HKGs

Multi-position prediction is a new meaningful task with more practicality than the one-position link prediction task on HKGs. For HKG link prediction, in one main triple, we can predict another element for every two of them we know, i.e., we can predict (s,r,?s,r,?), (s,?,os,?,o), (?,r,o?,r,o). In one auxiliary attribute-value qualifier, if we know any of the attributes and values of one of the auxiliary attribute-value qualifiers, we can predict the other, i.e., (a,?a,?), (?,v?,v). Thus in practical link prediction, there are some problems have two or more prediction points for example (s,r,?,a1,?,a2,v2s,r,?,a_{1},?,a_{2},v_{2}) (prediction position one in the main triple and the other in the first attribute-value qualifier) or (s,r,o,a1,?,a2,?s,r,o,a_{1},?,a_{2},?) (both predicted positions are in the auxiliary attribute-value qualifiers). We refer to tasks with two or more prediction positions in link prediction tasks as multi-position prediction tasks on HKGs.

4 Methodology

This section introduces our hyper-relational knowledge graph (HKG) embedding model HAHE, including global and local representation, two hierarchical attention layers, and MLP decoder.

4.1 Global and Local Representation

HKGs 𝒢={,,}\mathcal{G}=\{\mathcal{E},\mathcal{R},\mathcal{H}\} consist of multiple hyper-relational facts (H-Facts) \mathcal{H} with entities \mathcal{E} and relations \mathcal{R}. Since each H-Fact represents an n-ary relation (n>2) and has rich, heterogeneous semantic information, we model HKGs in terms of global-level hypergraph-based representation and local-level sequence-based representation with hypergraph dual-attention layers and heterogeneous self-attention layers respectively. The overview of HAHE is illustrated in Figure 3. For global-level representation, we define 𝒢H={H,H,H}\mathcal{G}_{H}=\{\mathcal{E}_{H},\mathcal{H}_{H},\mathcal{I}_{H}\} to represent the graph structure of entities. Unlike the regular graph, the hyperedges can connect more than two entity nodes. Moreover, we use incidence matrix H\mathcal{I}_{H} to represent the association information of nodes and hyperedges. For local-level representation, every H-Fact has the structure of one main triple and several auxiliary attribute-value qualifiers ={((s,r,o),{(ai:vi)}i=1m)|s,o,v1,,vm,r,a1,,am}\mathcal{H}=\{((s,r,o),\{(a_{i}:v_{i})\}^{m}_{i=1})|s,o,v_{1},...,v_{m}\in\mathcal{E},r,a_{1},...,a_{m}\in\mathcal{R}\}, which represents the semantic information of facts. We can fully connect entities and relations in H-Facts and represent them as heterogeneous semantic sequence structure, containing five kinds of nodes s,r,o,a,vs,r,o,a,v and 14 kinds of edges sr,so,ro,sa,sv,ra,rv,oa,ov,aiaj,oioj,aioi,aiojs-r,s-o,r-o,s-a,s-v,r-a,r-v,o-a,o-v,a_{i}-a_{j},o_{i}-o_{j},a_{i}-o_{i},a_{i}-o_{j}, where i,ji,j are the serial numbers of different qualifiers.

Refer to caption
(a) Hypergraph Dual-Attention.
Refer to caption
(b) Heterogeneous Self-Attention.
Figure 4: The structure of Hypergraph Dual-Attention Layers and Heterogeneous Self-Attention Layers in HAHE.

4.2 Hypergraph Dual-Attention Layers

As shown in Figure 4(a), entity embedding first utilizes Hypergraph Dual-Attention Layers to learn hypergraph structural information in global level. Previous Hypergraph Attention Network methods created a transformed ordinary graph by full joining nodes within the same hyperedge and applying GAT. The hypergraph representation loses because the ordinary graph after this transformation cannot distinguish whether two nodes are within the same hyperedge or different hyperedges. We first initialize the entities as nodes with embedding as 𝒉vid\boldsymbol{h}_{v_{i}}\in\mathbb{R}^{d}, where viHv_{i}\in\mathcal{E}_{H} and dd is dimension of embedding, and initialize the H-Facts as hyperedges with embedding as 𝒉eid\boldsymbol{h}_{e_{i}}\in\mathbb{R}^{d}, where eiHe_{i}\in\mathcal{H}_{H}. Then the node and hyperedge embeddings are projected into the same space to obtain 𝐖𝒉vi\mathbf{W}\boldsymbol{h}_{v_{i}} and 𝐖𝒉ei\mathbf{W}\boldsymbol{h}_{e_{i}} ,where 𝐖d×d\mathbf{W}\in\mathbb{R}^{d\times d}. The attention from nodes to hyper-edges (N-to-H Attention) is performed as follows:

αij=att(𝐖𝒉ei,𝐖𝒉vj)|vjei,\alpha_{ij}=att\left(\mathbf{W}\boldsymbol{h}_{e_{i}},\mathbf{W}\boldsymbol{h}_{v_{j}}\right)|v_{j}\in e_{i}, (2)

where αij\alpha_{ij} indicates the importance of node vjv_{j}’s features to hyperedge eie_{i}, attatt is N-to-H Attention function where we choose a single-layer neural network with concatenation operation. vjeiv_{j}\in e_{i} means we only calculate the attention where node vjv_{j} is in hypergraph eie_{i}, which is indexed by hypergraph incidence matrix H\mathcal{I}_{H}. Then, the information of nodes is aggregated to hyperedges:

𝒉~ei=σ(vjeiexp(LR(αik))vkeiexp(LR(αik))𝐖𝒉vj),\tilde{\boldsymbol{h}}_{e_{i}}=\sigma\left(\sum_{v_{j}\in e_{i}}\frac{\exp\left(\operatorname{LR}\left(\alpha_{ik}\right)\right)}{\sum_{v_{k}\in e_{i}}\exp\left(\operatorname{LR}\left(\alpha_{ik}\right)\right)}\mathbf{W}\boldsymbol{h}_{v_{j}}\right), (3)

where 𝒉~eid\tilde{\boldsymbol{h}}_{e_{i}}\in\mathbb{R}^{d} is updated hyperedge embeddings and LRLR, σ\sigma are activation functions. After that, we use a similar way to aggregate the information of updated hyperedges back to nodes with Attention (H-to-N Attention) as follows:

βij=att(𝐖𝒉vi,𝐖𝒉ej)|ejvi,\beta_{ij}=att\left(\mathbf{W}\boldsymbol{h}_{v_{i}},\mathbf{W}\boldsymbol{h}_{e_{j}}\right)|e_{j}\in v_{i}, (4)
𝒉~vi=σ(ejviexp(LR(βik))ekviexp(LR(βik))𝒉~ej),\tilde{\boldsymbol{h}}_{v_{i}}=\sigma\left(\sum_{e_{j}\in v_{i}}\frac{\exp\left(\operatorname{LR}\left(\beta_{ik}\right)\right)}{\sum_{e_{k}\in{v}_{i}}\exp\left(\operatorname{LR}\left(\beta_{ik}\right)\right)}\tilde{\boldsymbol{h}}_{e_{j}}\right), (5)

where βij\beta_{ij} denotes the importance between updated hyperedge eje_{j} and node viv_{i}, and 𝒉~vi\tilde{\boldsymbol{h}}_{v_{i}} is the updated node embedding. This way, we update the global hypergraph entity embedding and achieve node-hyperedge-node dual-attention message passing. Though we introduce more hyperedge embeddings than ordinary GNNs, they are only used as an intermediate weight variable for hypergraph attention computation. PyG makes dual-attention easy to implement, making it scalable to large graphs as GNNs.

After hypergraph dual-attention layers, H-Facts distribute updated node embeddings to sequence embeddings with relation embeddings, and fed them into Heterogeneous Self-Attention Layers.

Dataset H-Facts H-Facts with Q(%) Entities Relations Train Valid Test Arity
JF17K 100,947 46,320(45.9%) 28,645 501 76,379 - 24,568 2-6
WikiPeople 369,866 9,482(2.6%) 34,839 178 294,439 37,715 37,712 2-7
WD50K 236,507 32,167(13.6%) 47,156 532 166,435 23,913 46,159 2-67
Table 1: Dataset statistics, where the columns respectively indicate the number of all H-Facts, H-facts with qualifiers, entities, relations, H-facts in train/valid/test sets, and the range of arity of H-facts.

4.3 Heterogeneous Self-Attention Layers

Each element in the sequence 𝒙id\boldsymbol{x}_{i}\in\mathbb{R}^{d} has five roles, including s,r,o,a,vs,r,o,a,v, and a total of 14 kinds of edge with other elements in the sequence 𝒙jd\boldsymbol{x}_{j}\in\mathbb{R}^{d}. So, as shown in Figure 4(b), we design a heterogeneous self-attentive layer with both node-bias and edge-bias to learn the local semantic information of the H-Facts. The elements in the sequence pass through this attention layer and update the embedding as follows:

γij=(𝐖role(i)Q𝒙i+𝐛ijQ)(𝐖role(j)K𝒙j+𝐛ijK)d,\gamma_{ij}=\frac{\left(\mathbf{W}_{role(i)}^{Q}\boldsymbol{x}_{i}+\mathbf{b}_{ij}^{Q}\right)^{\top}\left(\mathbf{W}_{role(j)}^{K}\boldsymbol{x}_{j}+\mathbf{b}_{ij}^{K}\right)}{\sqrt{d}}, (6)
𝒙i~=j=1nexp(γik)k=1nexp(γik)(𝐖role(j)V𝒙j+𝐛ijV),\tilde{\boldsymbol{x}_{i}}=\sum_{j=1}^{n}\frac{\exp\left(\gamma_{ik}\right)}{\sum_{k=1}^{n}\exp\left(\gamma_{ik}\right)}\left(\mathbf{W}_{role(j)}^{V}\boldsymbol{x}_{j}+\mathbf{b}_{ij}^{V}\right), (7)

where γij\gamma_{ij} is the importance between one element in sequence 𝒙i\boldsymbol{x}_{i} and another 𝒙j\boldsymbol{x}_{j}, 𝐖role(i)Q,𝐖role(i)K,𝐖role(i)Vd×d\mathbf{W}_{role(i)}^{Q},\mathbf{W}_{role(i)}^{K},\mathbf{W}_{role(i)}^{V}\in\mathbb{R}^{d\times d} are the linear weight metrics of query, key, value, and five different kinds of 𝒙i\boldsymbol{x}_{i} pass through different weight metrics indexed by rolerole function as the node-bias, and 𝐛ijQ,𝐛ijK,𝐛ijVd\mathbf{b}_{ij}^{Q},\mathbf{b}_{ij}^{K},\mathbf{b}_{ij}^{V}\in\mathbb{R}^{d} are designed as the edge-bias. Then the sequence embeddings are updated by learning the semantic information inside the H-Facts as 𝒙i~d\tilde{\boldsymbol{x}_{i}}\in\mathbb{R}^{d}.

4.4 MLP Decoder

Finally, the updated sequence embeddings are selected for the embedding at the position to be predicted 𝒙~p\tilde{\boldsymbol{x}}_{p} with MLP decoder to get the prediction distribution and obtain the link prediction results.

𝑷=softmax(𝑴𝑳𝑷(𝒙~p)𝐄),\boldsymbol{P}=softmax(\boldsymbol{MLP}(\tilde{\boldsymbol{x}}_{p})\mathbf{E}^{\top}), (8)

where 𝑴𝑳𝑷:dd\boldsymbol{MLP}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, 𝐄||×d\mathbf{E}\in\mathbb{R}^{|\mathcal{E}|\times d} shares the initial element embedding matrix, 𝑷||\boldsymbol{P}\in\mathbb{R}^{|\mathcal{E}|} is obtained after softmax operation, which denotes the similarity probability of 𝒙~p\tilde{\boldsymbol{x}}_{p} with each element in HKG for obtaining the link prediciton answers.

4.5 Learning Strategy

The model trains through the final loss, which is calculated by the similarity between the target of prediction and all entities:

=t=1||𝐲tlog𝑷,\mathcal{L}=\sum_{t=1}^{|\mathcal{E}|}{\mathbf{y}_{t}\log\boldsymbol{P}}, (9)

where 𝐲t\mathbf{y}_{t} is the tt-th entry of the label 𝐲\mathbf{y}.

For Multi-position Prediction, due to the fully connected attention mechanism, HAHE can accomplish the multi-position prediction task of HKG by masking two or more entities or relations at two or more positions in the same hyper-relational fact. After passing the MLP encoder, HAHE can get the prediction value 𝑷i(i2)\boldsymbol{P}_{i}(i\geq 2) of the corresponding positions, and find the entity or relaiton with the highest similarity among all HKG elements as the prediction results respectively.

Model JF17K Wikipeople WD50K
subject / object all entities subject / object all entities subject / object all entities
MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10
m-TransH 0.206 0.206 0.462 0.102 0.069 0.168 0.063 0.063 0.300 - - - - - - - - -
RAE 0.215 0.215 0.466 0.310 0.219 0.504 0.058 0.058 0.306 0.172 0.102 0.320 - - - - - -
NaLP 0.221 0.165 0.331 0.366 0.290 0.516 0.408 0.331 0.546 0.338 0.272 0.466 - - - 0.224 0.158 0.330
NeuInfer 0.449 0.361 0.624 0.473 0.397 0.618 0.476 0.415 0.585 0.333 0.259 0.477 0.243 0.176 0.377 0.228 0.162 0.341
HINGE 0.431 0.342 0.611 0.517 0.436 0.675 0.342 0.272 0.463 0.350 0.282 0.467 - - - 0.232 0.164 0.343
StarE 0.574 0.496 0.725 0.542 0.454 0.685 0.491 0.398 0.592 0.378 0.265 0.542 0.349 0.271 0.496 - - -
Hyper2 0.583 0.500 0.746 - - - 0.461 0.391 0.597 - - - - - -
HyTransformer 0.582 0.501 0.742 - - - 0.501 0.426 0.634 - - - 0.356 0.281 0.498 - - -
GRAN 0.617 0.539 0.770 0.656 0.582 0.799 0.503 0.438 0.620 0.479 0.410 0.604 - - - 0.309 0.24 0.441
MSeaHKG - - - 0.577 0.481 0.711 - - - 0.395 0.291 0.554 - - - - - -
HAHE 0.623 0.554 0.806 0.668 0.597 0.816 0.509 0.447 0.639 0.495 0.420 0.631 0.368 0.291 0.516 0.402 0.327 0.546
HAHE w/o global 0.621 0.548 0.787 0.659 0.588 0.797 0.501 0.434 0.629 0.483 0.407 0.612 0.356 0.280 0.501 0.390 0.315 0.531
HAHE w/o node-bias 0.620 0.546 0.787 0.659 0.587 0.797 0.474 0.429 0.622 0.487 0.405 0.611 0.354 0.283 0.506 0.396 0.313 0.529
HAHE w/o edge-bias 0.620 0.545 0.786 0.657 0.586 0.796 0.503 0.435 0.629 0.483 0.407 0.612 0.357 0.281 0.513 0.391 0.316 0.533
Table 2: Comparison of HAHE with other models, composed of entity prediction accuracy on JF17K, WikiPeople and WD50K. Results of the models are mainly taken from the original paper. Best results in each tasks are in bold.

5 Experiments

This section introduces the experimental settings, results and analysis. We answer the following research questions (RQs). RQ1: Can HAHE outperform other Hyper-relational Knowledge Graphs (HKG) embedding models on HKG datasets? RQ2: How does the hierarchical attention mechanism contribute to HAHE? RQ3: How do hypergraph dual-attention mechanisms contribute to HAHE in global level? RQ4: How do heterogeneity of nodes and edges in hyper-relation fact contribute to HAHE in local level? RQ5: How HAHE performs in multi-position prediction tasks on HKGs?

5.1 Experimental Setup

5.1.1 Datasets

We conduct experiments on three hyper-relational datasets JF17K (Wen et al., 2016), WikiPeople (Guan et al., 2019), and WD50K (Galkin et al., 2020), respectively, as shown in Table 1. Among them, JF17K is extracted from Freebase (Bollacker et al., 2008). WikiPeople is obtained by filtering out the statements containing literals in the original WikiPeople dataset, derived from Wikidata (Vrandečić and Krötzsch, 2014) concerning entities of type human, and WD50K is a high-quality hyper-relational dataset with richer hyper-relational facts with auxiliary attribute-value qualifiers.

5.1.2 Baselines

We compare HAHE against a sizable collection of previous hyper-relational approaches namely: (i) m-TransH (Wen et al., 2016) (ii) RAE (Zhang et al., 2018) (iii) NaLP (Guan et al., 2019) (iv) NeuInfer (Guan et al., 2020) (v) HINGE (Rosso et al., 2020) (vi) StarE (Galkin et al., 2020) (vii) Hyper2 (Yan et al., 2022) (viii) HyTransformer (Yu and Yang, 2021) (ix) GRAN (Wang et al., 2021) (x) MSeaHKG (Di and Chen, 2022).

5.1.3 Ablations

To evaluate the significance of HAHE’s three main modules, hypergraph dual-attention mechanism, node heterogeneity, and edge heterogeneity, we obtain 7 simplified model variants by removing any one or two modules from the full model (HAHE-node, HAHE-edge, HAHE-node&edge, HAHE-global, HAHE-global&node, HAHE-global&edge), and the basic variant by removing all three modules.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Ablation results. (a) Hits@1 results for ablation study with all HAHE variants. (b) MRR results with different degrees of entity for global-level analysis. (c) Hits@1 results of subject/object prediction with diffent arity for local-level analysis. (d) Hits@1 results of prediction in qualifiers with diffent arity for local-level analysis.

5.1.4 Evaluation Metrics

Each model predicts entities and relations separately. We split each task of predictions into subject/object prediction in main triples and all entities prediction in whole H-facts to test the model’s main triple prediction ability. MRR (the average of reciprocal rankings) and Hits@K (the proportion of top K rankings) for K=1,10 are used to evaluate each link prediction task.

5.1.5 Hyperparameters and Enviroment

The model was trained for 300 epochs using the Adam optimizer with a batch size of 1024 examples across 1 GeForce GTX 1080Ti on each dataset. Appendix A shows HAHE’s optimal hyperparameter settings. Appendix B shows training details.

5.2 Main Results (RQ1)

In this experiment, we evaluate our model on the link prediction task. For entity prediction, the results of our model and each variant of our model can be found in Table 2. For relation prediction, the result is shown in Appendix C. We can observe that the HAHE outperforms the other current methods on all three datasets. On JF17K, for the prediction of subject and object, HAHE reports an improvement of 0.6 (0.9%) MRR points, 1.5 (2.7%) H@1, and 3.6 (4.6%) H@10 compared with the best approach. For the prediction of all entities, HAHE reports a gain of 1.2 (1.8%) MRR, 1.5 (2.5%) H@1, and 1.7 (2.1%) H@10 compared with the next-best approach. For the other two datasets, we also have different degrees of improvement. For WD50K, the latest high-quality HKG dataset, our model has the largest improvement over the existing SOTA model GRAN, with an MRR improvement of about 10 points, which proves that this model is more suitable for hypergraph-structured knowledge graphs with hyper-relational facts beyond binary relation.

5.3 Ablation Study (RQ2)

The hypergraph dual-attention mechanism, node heterogeneity, and edge heterogeneity are the three components of HAHE that are required for its operation. We evaluate seven different variants of HAHE, irrespective of whether or not each component is helpful. When evaluating each model variant with a variety of hyperparameters, the results of the optimal prediction were recorded. For different HAHE variants in Figure 5(a), it can be observed that hypergraph dual-attention, node heterogeneity, and edge heterogeneity all contribute to the accurate result of our complete model. In addition, we have outlined the specific results of three primary  HAHE variants in Table 2. Each variant lacks a necessary component that is required. Through comparison, our experiment results intuitively demonstrated the effectiveness of HAHE.

Then, we did more refined ablation analysis to explore the significance of the hypergraph dual-attention mechanism in global level and heterogeneity of nodes and edges in local level, respectively.

5.4 Analysis of Hypergraph Dual-attention Mechanism in Global Level (RQ3)

We statistically displayed the entity evaluation results of JF17K by the degree of entities in the HKG hypergraph structure to investigate the hypergraph dual attention mechanism. As shown in Figure 5(b), the hypergraph dual-attention mechanism improves the prediction accuracy of entities with different degree. Due to the presence of the attention mechanism, the entity feature information will help in message passing of global information by the hyper-relational facts (hyperedges). For entities with higher degree, the hypergraph dual-attention mechanism can better capture the global features of the hyperrelational facts. Entities with fewer degrees have little impact on capturing global information, but entities with more degrees compensate for this and improve their prediction accuracy.

5.5 Analysis of Heterogeneity of Nodes and Edges in Local Level (RQ4)

The experimental results show that both node-bias and edge-bias with heterogeneity can better distinguish different types of entities in an H-fact and different types of relationships between them in local level. Moreover, we find that node-bias plays a more prominent role than edge-bias in the subject/object prediction tasks as shown in Figure 5(c), because the entity roles in the main triplet are more diverse than those in the qualifier. And on the element prediction task in qualifiers, edge-bias is better than node-bias as shown in Figure 5(d), because edge-bias can distinguish the relationship between corresponding attribute-value pairs (ai,vi)(a_{i},v_{i}) and non-corresponding attribute-value pairs (aj,vi)(a_{j},v_{i}) rather than node-bias.

5.6 Results of Multi-position Prediction Tasks on HKGs (RQ5)

JF17K, WikiPeople, and WD50K were applied to test our multi-position prediction model. As an example, in Figure 6, our model outputs the embedding for each position and calculates the joint probability distribution for each candidate answer tuple. Because the increase of predicted positions leads to more answer sets than predicted by the unit placement link, we set the evaluation threshold to keep only the higher scoring answer tuples to obtain the evaluation results.

Refer to caption
Figure 6: Case study of Multi-position Prediction on WD50K.
JF17K Wikipeople WD50K
2-position prediction
av sro/av all av sro/av all av sro/av all
Ent-Ent 0.039 0.220 0.198 0.265 0.129 0.140 0.326 0.157 0.184
Ent-Rel 0.261 0.592 0.540 0.405 0.409 0.409 0.527 0.481 0.490
Rel-Rel 0.350 0.106 0.159 0.656 0.544 0.561 0.940 0.210 0.407
3-position prediction
av sro/av all av sro/av all av sro/av all
Ent-Ent 0.011 0.012 0.011 0.329 0.036 0.046 0.169 0.350 0.052
Ent-Rel 0.022 0.126 0.115 0.432 0.132 0.144 0.332 0.166 0.193
Rel-Rel 0.029 0.005 0.009 0.647 0.162 0.192 0.884 0.039 0.230
Table 3: The MRR results of Multi-position Prediction.

As shown in Table 3, we evaluated 2-position prediction and 3-position prediction on three HKG datasets, respectively, and used MRR to rank the answer tuples. The test set has three categories: Ent-Ent, Ent-Rel, and Rel-Rel. Ent-Ent and Rel-Rel indicate that all predicted locations are entities or relations, and Ent-Rel indicates that entities and relations are missing jointly. In addition, whether the main triple is complete divides the sample into two categories. "av" indicates that all predicted positions are in the auxiliary qualifiers, while "sro/av" indicates that the main triple has lost a position" and others in qualifiers. "all" includes both categories. According to the results of multi-position prediction tasks, HAHE performs better on the high-quality WD50K dataset, and more positions predicted or one position in the main triple makes the prediction more difficult.

6 Conclusion

In this paper, we present HAHE, a model with hierarchical attention in global and local level. HAHE outperforms other baselines link prediction tasks on hyper-relational knowledge graphs (HKGs). The experimental results demonstrate that our hypergraph dual-attention layers and heterogeneous self-attention layers are effective in learning the global and local structure of HKGs. We also use HAHE to solve the HKG multi-location prediction task and analyze the results for the first time.

Acknowledgements

This work is supported by the National Science Foundation of China (Grant No. 62176026) and Beijing Natural Science Foundation (M22009). This work is also supported by the BUPT Postgraduate Innovation and Entrepreneurship Project led by Haoran Luo.

Limitations

For HKG one-position link prediction tasks, HAHE shows the best performance in all three datasets. However, because HAHE is based on hypergraph learning, it improves more on the WD50K high quality hyper-relational knowledge graph link prediction dataset, and less on the Wikipeople dataset where triples are the majority, so HAHE prefers the fact with more arity numbers. In the future, we will consider extending our approach to triples as a unified architecture.

For HKG multi-position link prediction tasks, it can be seen that our model is effective when predicting multiple missing auxiliary information, which is a frequent situation in practical applications. However, the prediction accuracy of our model needs to be further improved in the case of missing primary relations.

Ethics Statement

This paper investigates the problem of knowledge graph link prediction, aiming at complementing incomplete hyper-relational knowledge graphs using deep learning methods to better promote knowledge graphs for assisted decision-making and intelligent question-and-answer applications. Therefore, we believe it does not violate any ethics.

References

Appendix

Appendix A Hyperparameter Settings

We use the grid search method to select the optimal hyperparameter settings for the network. The average Hits1 predicted by all entities is chosen as the evaluation metric. The hyperparameters that we can adjust and the possible values of the hyperparameters are first determined according to the structure of our model in Table 4.

Afterwards, the different hyperparameter choices are combined and the predictive metrics after 50 epochs of training are used to judge the merit of the hyperparameter combinations. The optimal hyperparameter combinations of the model are obtained by circular traversal of all hyperparameter combinations. The optimal hyperparameter combinations are shown in bold.

Hyperparameter JF17K Wikipeople WD50K
Embedding dimension {128,256,512,1024}\left\{128,\textbf{256},512,1024\right\} {128,256,512,1024}\left\{128,\textbf{256},512,1024\right\} {128,256,512,1024}\left\{128,\textbf{256},512,1024\right\}
Global layers {0,1,2,3,4}\left\{0,1,\textbf{2},3,4\right\} {0,1,2,3,4}\left\{0,1,\textbf{2},3,4\right\} {0,1,2,3,4}\left\{0,1,\textbf{2},3,4\right\}
Global dropout {0.0,0.1,0.2,0.3,0.4,0.5}\left\{0.0,0.1,\textbf{0.2},0.3,0.4,0.5\right\} {0.0,0.1,0.2,0.3,0.4,0.5}\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\} {0.0,0.1,0.2,0.3,0.4,0.5}\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\}
Global activation {relu,elu,gelu,tanh}\left\{relu,\textbf{elu},gelu,tanh\right\} {relu,elu,gelu,tanh}\left\{relu,\textbf{elu},gelu,tanh\right\} {relu,elu,gelu,tanh}\left\{relu,\textbf{elu},gelu,tanh\right\}
Global attention heads {4,8,12,16}\left\{\textbf{4},8,12,16\right\} {4,8,12,16}\left\{\textbf{4},8,12,16\right\} {4,8,12,16}\left\{\textbf{4},8,12,16\right\}
Local layers {4,8,12,16,24}\left\{4,8,\textbf{12},16,24\right\} {4,8,12,16,24}\left\{4,8,\textbf{12},16,24\right\} {4,8,12,16,24}\left\{4,8,\textbf{12},16,24\right\}
Local dropout {0.0,0.1,0.2,0.3,0.4,0.5}\left\{0.0,0.1,\textbf{0.2},0.3,0.4,0.5\right\} {0.0,0.1,0.2,0.3,0.4,0.5}\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\} {0.0,0.1,0.2,0.3,0.4,0.5}\left\{0.0,\textbf{0.1},0.2,0.3,0.4,0.5\right\}
Local attention heads {4,8,12,16}\left\{\textbf{4},8,12,16\right\} {4,8,12,16}\left\{\textbf{4},8,12,16\right\} {4,8,12,16}\left\{\textbf{4},8,12,16\right\}
Decoder activation {relu,elu,gelu,tanh}\left\{relu,elu,\textbf{gelu},tanh\right\} {relu,elu,gelu,tanh}\left\{relu,elu,\textbf{gelu},tanh\right\} {relu,elu,gelu,tanh}\left\{relu,elu,\textbf{gelu},tanh\right\}
Hidden size {128,256,512,1024,2048,4096}\left\{128,\textbf{256},512,1024,2048,4096\right\} {128,256,512,1024,2048,4096}\left\{128,\textbf{256},512,1024,2048,4096\right\} {128,256,512,1024,2048,4096}\left\{128,\textbf{256},512,1024,2048,4096\right\}
Batch size {128,256,512,1024,2046}\left\{128,256,512,\textbf{1024},2046\right\} {128,256,512,1024,2046}\left\{128,256,512,\textbf{1024},2046\right\} {64,128,256,512,1024}\left\{\textbf{64},128,256,512,1024\right\}
Learning rate {0.0001,0.0005,0.001}\left\{0.0001,\textbf{0.0005},0.001\right\} {0.0001,0.0005,0.001}\left\{0.0001,\textbf{0.0005},0.001\right\} {0.0001,0.0005,0.001}\left\{0.0001,\textbf{0.0005},0.001\right\}
Weight decay {0.01,0.02}\left\{\textbf{0.01},0.02\right\} {0.01,0.02}\left\{\textbf{0.01},0.02\right\} {0.01,0.02}\left\{\textbf{0.01},0.02\right\}
Soft label for entity {0.5,0.6,0.7,0.8,0.9}\left\{0.5,0.6,0.7,0.8,\textbf{0.9}\right\} {0.0,0.1,0.2,0.3,0.4}\left\{0.0,0.1,\textbf{0.2},0.3,0.4\right\} {0.0,0.1,0.2,0.3,0.4}\left\{0.0,0.1,\textbf{0.2},0.3,0.4\right\}
Soft label for relation {0.0,0.3,0.6,0.9}\left\{\textbf{0.0},0.3,0.6,0.9\right\} {0.0,0.1,0.2,0.3}\left\{0.0,\textbf{0.1},0.2,0.3\right\} {0.0,0.1,0.2,0.3}\left\{0.0,\textbf{0.1},0.2,0.3\right\}
Table 4: Hyperparameter Search.
Model JF17K Wikipeople WD50K
main relation all relations main relation all relations main relation all relations
MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10
m-TransH - - - - - - - - - - - - - - - - - -
RAE - - - - - - - - - - - - - - - - - -
NaLp 0.639 0.547 0.822 0.825 0.762 0.927 0.482 0.320 0.482 0.735 0.595 0.938 - - - - - -
NeuInfer 0.936 0.901 0.989 - - - 0.950 0.915 0.997 - - - - - - - - -
HINGE - - - 0.861 0.832 0.910 - - - 0.765 0.686 0.900 - - - - - -
StarE - - - 0.901 0.884 0.963 - - - 0.378 0.265 0.542 - - - - - -
Hyper2 0.950 0.933 0.976 - - - 0.947 0.914 0.987 - - - - - -
HyTransformer - - - - - - - - - - - - - - - - - -
GRAN 0.992 0.988 0.988 0.996 0.993 0.999 0.957 0.942 0.976 0.960 0.946 0.977 - - - - - -
MSeaHKG - - - 0.933 0.894 0.972 - - - 0.831 0.787 0.972 - - - - - -
HAHE 0.993 0.989 0.998 0.996 0.994 0.999 0.957 0.941 0.978 0.958 0.942 0.978 0.916 0.885 0.964 0.927 0.900 0.969
HAHE w/o global 0.992 0.988 0.997 0.996 0.994 0.998 0.946 0.930 0.967 0.947 0.931 0.968 0.915 0.884 0.961 0.927 0.900 0.966
HAHE w/o node-bias 0.989 0.984 0.995 0.994 0.991 0.997 0.926 0.906 0.952 0.926 0.906 0.953 0.915 0.883 0.963 0.922 0.899 0.969
HAHE w/o edge-bias 0.992 0.987 0.998 0.995 0.993 0.999 0.948 0.932 0.970 0.949 0.933 0.970 0.915 0.882 0.963 0.926 0.898 0.968
Table 5: Comparison of HAHE with other models, composed of relation prediction accuracy on JF17K, WikiPeople and WD50K. Results of the models are mainly taken from the original paper. Best results in each tasks are in bold.

Appendix B Model Training Details

We train 300 epochs in each dataset with the optimal combination of hyperparameters. The prediction results of the network are evaluated using the test sets with the best model. HAHE and all its variants have been trained on a single 11G 1080Ti GPU. Using our optimal hyperparameter settings, the time required to complete the training on the three datasets JF17K, Wikipeople, and WD50K is 5h, 12h and 8h, respectively.

Appendix C Results of Relation Prediction

As shown in Table 5, the result of relation prediction of previous models for HKG embedding has achieved very good results in relation prediction, some indicators have even reached 99%.