This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Knowledge Graph Reasoning with Relational Digraph

Yongqi Zhang 4Paradigm Inc.BeijingChina [email protected]  and  Quanming Yao EE, Tsinghua UniversityBeijingChina [email protected]
(2022)
Abstract.

Reasoning on the knowledge graph (KG) aims to infer new facts from existing ones. Methods based on the relational path have shown strong, interpretable, and transferable reasoning ability. However, paths are naturally limited in capturing local evidence in graphs. In this paper, we introduce a novel relational structure, i.e., relational directed graph (r-digraph), which is composed of overlapped relational paths, to capture the KG’s local evidence. Since the r-digraphs are more complex than paths, how to efficiently construct and effectively learn from them are challenging. Directly encoding the r-digraphs cannot scale well and capturing query-dependent information is hard in r-digraphs. We propose a variant of graph neural network, i.e., RED-GNN, to address the above challenges. Specifically, RED-GNN makes use of dynamic programming to recursively encodes multiple r-digraphs with shared edges, and utilizes query-dependent attention mechanism to select the strongly correlated edges. We demonstrate that RED-GNN is not only efficient but also can achieve significant performance gains in both inductive and transductive reasoning tasks over existing methods. Besides, the learned attention weights in RED-GNN can exhibit interpretable evidence for KG reasoning. 111The code is available at https://github.com/AutoML-Research/RED-GNN/. Correspondence is to Q. Yao.

Knowledge graph, Graph embedding, Knowledge graph reasoning, Graph representation learning, Graph neural network
journalyear: 2022copyright: noneconference: Proceedings of the ACM Web Conference 2022; April 25–29, 2022; Virtual Event, Lyon, France.booktitle: Proceedings of the ACM Web Conference 2022 (WWW ’22), April 25–29, 2022, Virtual Event, Lyon, Franceprice: 15.00isbn: 978-1-4503-9096-5/22/04doi: 10.1145/3485447.3512008ccs: Information systems Web searching and information discoveryccs: Computing methodologies Knowledge representation and reasoningccs: Computing methodologies Machine learning algorithms

1. Introduction

Knowledge graph (KG), which contains the interactions among real-world objects, peoples, concepts, etc., brings much connection between artificial intelligence and human knowledge (Battaglia et al., 2018; Ji et al., 2020; Hogan et al., 2021). The interactions are represented as facts with the triple form (subject entity, relation, object entity) to indicate the relation between entities. The real-world KGs are large and highly incomplete (Wang et al., 2017; Ji et al., 2020), thus inferring new facts is challenging. KG reasoning simulates such a process to deduce new facts from existing facts (Chen et al., 2020). It has wide application in semantic search (Berant and Liang, 2014), recommendation (Cao et al., 2019), and question answering (Abujabal et al., 2018), etc. In this paper, we focus on learning the relational structures for reasoning on the queries in the form of (subject entity, relation, ?).

Over the last decade, triple-based models have gained much attention to learn semantic information in KGs (Wang et al., 2017). These models directly reason on triples with the entity and relation embeddings, such as TransE (Bordes et al., 2013), ConvE (Dettmers et al., 2017), ComplEx (Trouillon et al., 2017), RotatE (Sun et al., 2019), QuatE (Zhang et al., 2019a), AutoSF (Zhang et al., 2020b), etc. Since the triples are independently learned, they cannot explicitly capture the local evidence (Niepert, 2016; Xiong et al., 2017; Yang et al., 2017; Teru et al., 2020), i.e., the local structures around the query triples, which can be used as evidence for KG reasoning (Niepert, 2016).

Learning on paths can help to better capture local evidences in graphs since they can preserve sequential connections between nodes (Perozzi et al., 2014; Grover and Leskovec, 2016). Relational path is the first attempt to capture both the semantic and local evidence for reasoning (Lao and Cohen, 2010). DeepPath (Xiong et al., 2017), MINERVA (Das et al., 2017) and M-walk (Shen et al., 2018) use reinforcement learning (RL) to sample relational paths that have strong correlation with the queries. Due to the sparse property of KGs, the RL approaches are hard to train on large-scale KGs (Chen et al., 2020). PathCon (Wang et al., 2021) samples all the relational paths between the entities and use attention mechanism to weight the different paths, but is expensive for the entity query tasks. The rule-based methods, such as RuleN (Meilicke et al., 2018), Neural LP (Yang et al., 2017), DRUM (Sadeghian et al., 2019) and RNNLogic (Qu et al., 2021), generalize the relational paths as logical rules, which learn to infer relations by logical composition of relations, and can provide interpretable insights. Besides, the logical rules can handle inductive reasoning where there exist unseen entities in the inference, which are common in real-world applications (Yang et al., 2017; Sadeghian et al., 2019; Zhang and Chen, 2019).

Subgraphs can naturally be more informative than paths in capturing the local evidence (Battaglia et al., 2018; Ding et al., 2021; Xiao et al., 2021). Their effectiveness has been empirically verified in, e.g., graph-based recommendation (Zhang and Chen, 2019; Zhao et al., 2017) and node representation learning (Hamilton et al., 2017). With the success of graph neural network (GNN) (Gilmer et al., 2017; Kipf and Welling, 2016) in modeling graph-structured data, GNN has been introduced to capture the subgraph structures in KG. R-GCN (Schlichtkrull et al., 2018), CompGCN (Vashishth et al., 2019) and KE-GCN (Yu et al., 2021) propose to update the representations of entities by aggregating all the neighbors in each layer. However, they cannot distinguish the structural role of different neighbors and cannot be interpretable. DPMPN (Xu et al., 2019) learns to reduce the size of subgraph for reasoning on large-scale KGs by pruning the irrelevant entities for a given query rather than learning the specific local structures. p-GAT (Harsha V. et al., 2020) jointly learns a Graph Attention Network and a Markov Logic Network to bridge the gap between embeddings and rules. However, a set of rules must be pre-defined and the expensive EM-algorithm should be used for optimization. Recently, GraIL (Teru et al., 2020) proposes to predict relation from the local enclosing subgraph and shows the inductive ability of subgraph. However, it suffers from both effectiveness and efficiency problems due to the limitation of the enclosing subgraph.

Inspired by the interpretable and transferable abilities of path-based methods and the structure preserving property of subgraphs, we introduce a novel relational structure into KG, called r-digraph, to combine the best of both worlds. The r-digraphs generalize relational paths to subgraphs by preserving the overlapped relational paths and the structures of relations for reasoning. Different from the relational paths that have simple structures, how to efficiently construct and learn from the r-digraphs are challenging since the construction process is expensive (Teru et al., 2020; Wang et al., 2021). Inspired by saving computation costs in overlapping sub-problems using dynamic programming, we propose RED-GNN, a variant of GNN (Gilmer et al., 2017), to recursively encode multiple RElational Digraphs (r-digraphs) with shared edges and select the strongly correlated edges through query-dependent attention weights. Empirically, RED-GNN shows significant gains over the state-of-the-art reasoning methods in both inductive and transductive benchmarks. Besides, the training and inference processes are efficient, the number of model parameters are small, and the learned structures are interpretable.

2. Related Works

A knowledge graph is in the form of 𝒦={𝒱,,}\mathcal{K}\!=\!\{\mathcal{V},\mathcal{R},\mathcal{F}\}, where 𝒱\mathcal{V}, \mathcal{R} and ={(es,r,eo)|es,eo𝒱,r}\mathcal{F}\!=\!\{(e_{s},r,e_{o})|e_{s},e_{o}\!\in\!\mathcal{V},r\!\in\!\mathcal{R}\} are the sets of entities, relations and fact triples, respectively. Let eqe_{q} be the query entity, rqr_{q} be query relation, and eae_{a} be answer entity. Given the query (eq,rq,?)(e_{q},r_{q},?), the reasoning task is to predict the missing answer entity eae_{a}. Generally, all the entities in 𝒱\mathcal{V} are candidates for eae_{a} (Chen et al., 2020; Das et al., 2017; Yang et al., 2017).

The key for KG reasoning is to capture the local evidence, i.e., entities and relations around the query triples (eq,rq,ea)(e_{q},r_{q},e_{a}). We regard (eq,rq,ea)(e_{q},r_{q},e_{a}) as missing link, thus the local evidence does not contain this triplet in. Examples of such local evidence explored in the literature are relational paths (Das et al., 2017; Lao et al., 2011; Xiong et al., 2017; Wang et al., 2021) and subgraphs (Schlichtkrull et al., 2018; Teru et al., 2020; Vashishth et al., 2019). In this part, we introduce the path-based methods and GNN-based methods that leverage structures in \mathcal{F} for reasoning.

2.1. Path-based methods

Relational path (Definition 1) is a set of triples that are sequentially connected. The path-based methods learn to predict the triple (eq,rq,ea)(e_{q},r_{q},e_{a}) by a set of relational paths as local evidence. DeepPath (Xiong et al., 2017) learns to generate the relational path from eqe_{q} to eae_{a} by reinforcement learning (RL). To improve the efficiency, MINERVA (Das et al., 2017) and M-walk (Shen et al., 2018) learn to generate multiple paths starting from eqe_{q} by RL. The scores are indicated by the arrival frequency on different eae_{a}’s. Due to the complex structure of KG, the reward is very sparse, making it hard to train the RL models (Chen et al., 2020). PathCon (Wang et al., 2021) samples all the paths connecting the two entities to predict the relation between them, which is expensive for the reasoning task (eq,rq,ea)(e_{q},r_{q},e_{a}).

Definition 0 (Relational path (Lao et al., 2011; Xiong et al., 2017; Zhang et al., 2020a)).

The relational path with length LL is a set of LL triples (e0,r1,e1)(e_{0},r_{1},e_{1}), (e1,r2,e2)(e_{1},r_{2},e_{2}), \dots, (eL1,rL,eL)(e_{L-1},r_{L},e_{L}), that are connected head-to-tail sequentially.

Instead of directly using paths, the rule-based methods learn logical rules as a generalized form of relational paths. The logical rules are formed with the composition of a set of relations to infer a specific relation. This can provide better interpretation and can be transfered to unseen entities. The rules are learned by either the mining methods like RuleN (Meilicke et al., 2018), EM algorithm like RNNLogic (Qu et al., 2021), or end-to-end training, such as Neural LP (Yang et al., 2017) and DRUM (Sadeghian et al., 2019), to generate highly correlated relational paths between eqe_{q} and eae_{a}. The rules can provide logical interpretation and transfer to unseen entities. However, the rules only capture the sequential evidences, thus cannot learn more complex patterns such as the subgraph structures.

Refer to caption
(a) Knowledge graph.
Refer to caption
(b) 𝒢Sam,Spider-2|3\mathcal{G}_{\text{Sam,Spider-2}|3}.
Refer to caption
(c) Recursive encoding.
Figure 1. Graphical illustration. In (c), the subgraph formed by the gray ellipses is 𝒢Sam, Spider-2|3\mathcal{G}_{\emph{Sam, Spider-2}|3} and the subgraph formed by the yellow rectangles is 𝒢Sam,U.S.|3\mathcal{G}_{\emph{Sam,U.S.}|3}. Dashed edges mean the reverse relations of corresponding color (best viewed in color).

2.2. GNN-based methods

As mentioned in Section 1, the subgraphs can preserve richer information than the relational paths as more degree are allowed in the subgraph. GNN has shown strong power in modeling the graph structured data (Battaglia et al., 2018). This inspires recent works, such as R-GCN (Schlichtkrull et al., 2018), CompGCN (Vashishth et al., 2019), and KE-GCN (Yu et al., 2021), extending GNN on KG to aggregate the entities’ and relations’ representations under the message passing framework (Gilmer et al., 2017) as

(1) 𝒉eo\displaystyle\bm{h}_{e_{o}}^{\ell} =δ(𝑾(es,r,eo)ϕ(𝒉es1,𝒉r)),\displaystyle=\delta\Big{(}\bm{W}^{\ell}\cdot\sum\nolimits_{(e_{s},r,e_{o})\in\mathcal{F}}\phi\big{(}\bm{h}^{\ell-1}_{e_{s}},\bm{h}_{r}^{\ell}\big{)}\Big{)},

which aggregates the message ϕ(,)\phi(\cdot,\!\cdot) on the 11-hop neighbor edges (es,r,eo)(e_{s},r,e_{o})\!\in\!\mathcal{F} of entity eoe_{o} with dimension dd. 𝑾d×d\bm{W}^{\ell}\!\!\in\!\mathbb{R}^{d\times d} is a weighting matrix, δ\delta is the activation function and 𝒉r\bm{h}_{r}^{\ell} is the relation representation in the \ell-th layer. After LL layers’ aggregation, the representations 𝒉eL\bm{h}^{L}_{e}\! capturing the local structures of entities e𝒱e\!\in\!\mathcal{V} jointly work with a decoder scoring function to measure the triples. Since the message passing function (1) aggregates information of all the neighbors and is independent with the query, R-GCN and CompGCN cannot capture the explicit local evidence for specific queries and are not interpretable.

Instead of using all the neighborhoods, DPMPN (Xu et al., 2019) designs one GNN to aggregate the embeddings of entities, and another GNN to dynamically expand and prune the inference subgraph from the query entity eqe_{q}. A query-dependent attention is applied over the sampled entities for pruning. This approach shows interpretable reasoning process by the attention flow on the pruned subgraph, but still requires embeddings to guide the pruning, thus cannot be generalized to unseen entities. Besides, it cannot capture the explicit local evidence supporting a given query triple.

p-GAT (Harsha V. et al., 2020) and pLogicNet (Qu and Tang, 2019) jointly learns an embedding-based model and a Markov logic network (MLN) by variational-EM algorithm. The embedding model generates evidence for MLN to update and MLN expends the training data for embedding model to train. MLN brings the gap between embedding models and logic rules, but it requires a pre-defined set of rules for initialization and the training is expensive with the variational-EM algorithm.

Recently, GraIL (Teru et al., 2020) proposes to extract the enclosing subgraph 𝒢(eq,ea)\mathcal{G}_{(e_{q},e_{a})} between the query entity eqe_{q} and answer entity eae_{a}. To learn the enclosing subgraph, the relational GNN (Schlichtkrull et al., 2018) with query-dependent attention is applied over the edges in 𝒢(eq,ea)\mathcal{G}_{(e_{q},e_{a})} to control the importance of edges for different queries, but the learned attention weights are not interpretable (see Appendix B). After LL layers’ aggregation, the graph-level representations aggregate all the entities e𝒱e\in\mathcal{V} in the subgraph and are used to score the triple (eq,rq,ea)(e_{q},r_{q},e_{a}). Since the subgraphs need to be explicitly extracted and scored for different triples, the computation cost is very high.

3. Relational Digraph (r-digraph)

The relational paths have shown strong transferable and interpretable reasoning ability on KGs (Das et al., 2017; Yang et al., 2017; Sadeghian et al., 2019). However, they are limited in capturing more complex dependencies in KG since the nodes in paths are only sequentially connected. GNN-based methods can learn different subgraph structures. But none of the existing methods can efficiently learn the subgraph structures that are both interpretable and inductive like rules. Hence, we are motivated to define a new kind of structure, to explore the important local evidence.

Before defining r-digraph, we first introduce a special type of directed graph in Definition 1.

Definition 0 (Layered st-graph (Battista et al., 1998)).

The layered st-graph is a directed graph with exactly one source node (s) and one sink node (t). All the edges are directed, connecting nodes between consecutive layers and pointing from ll-th layer to l+1l+1-th layer.

Here, we adopt the general approaches to augment the triples with reverse and identity relations (Vashishth et al., 2019; Sadeghian et al., 2019). Then, all the relational paths with length less than or equal to LL between eqe_{q} and eae_{a} can be represented as relational paths eqr1r2rLeae_{q}\!\rightarrow_{r^{1}}\!\cdot\!\rightarrow_{r^{2}}\!\cdots\!\rightarrow_{r^{L}}\!\!e_{a} with length LL. In this way, they can be formed as paths in a layered st-graph, with the single source entity eqe_{q} and sink entity eae_{a}. Such a structure preserves all the relational paths between eqe_{q} and eae_{a} up to length LL, and maintains the subgraph structures. Based on this observation, we define r-digraph in Definition 2.

Definition 0 (r-digraph).

The r-digraph 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L} is a layered st-graph with the source entity eqe_{q} and the sink entity eae_{a}. The entities in the same layer are different with each other. Any path pointing from eqe_{q} to eae_{a} in the r-digraph is a relational path eqr1r2rLeae_{q}\!\rightarrow_{r^{1}}\!\cdot\!\rightarrow_{r^{2}}\!\cdots\!\rightarrow_{r^{L}}\!e_{a} with length LL, where rr^{\ell} connects an entity in the 1\ell\!\!-\!\!1-layer to an entity in \ell-layer. We define 𝒢eq,ea|L=\mathcal{G}_{e_{q},e_{a}|L}=\emptyset if there is no relational path connecting eqe_{q} and eae_{a} with length LL.

Figure 1(b) provides an example of r-digraph 𝒢Sam,Spider-2|3\mathcal{G}_{\text{Sam,Spider-2}|3}, which is used to infer the new triple (Sam, directed, Spider-2), in the exemplar KG in Figure 1(a). Inspired by the reasoning ability of relational paths (Das et al., 2017; Yang et al., 2017; Sadeghian et al., 2019), we aim to leverage the r-digraph for KG reasoning. However, different from the relational paths which have simple structures to learn with sequential models (Xiong et al., 2017; Das et al., 2017), how to efficiently construct and how to effectively learn from the r-digraphs are challenging.

In the paper, eq,ea|L\mathcal{E}_{e_{q},e_{a}|L}^{\ell} are the edges and 𝒱eq,ea|L={eo|(es,r,eo)eq,ea|L}\mathcal{V}_{e_{q},e_{a}|L}^{\ell}=\big{\{}e_{o}|(e_{s},r,e_{o})\!\!\in\!\!\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\big{\}} are the entities in the \ell-th layer of the r-digraph

𝒢eq,ea|L=eq,ea|L1eq,ea|LL,\mathcal{G}_{e_{q},e_{a}|L}=\mathcal{E}_{e_{q},e_{a}|L}^{1}\otimes\cdots\otimes\mathcal{E}_{e_{q},e_{a}|L}^{L}\;,

with \otimes denoting the layer-wise connection. We define the union operator as 𝒢eq1,ea1|L𝒢eq2,ea2|L=(eq1,ea1|L1eq2,ea2|L1)(eq1,ea1|LLeq2,ea2|LL)\mathcal{G}_{e_{q_{1}},e_{a_{1}}|L}\!\cup\!\mathcal{G}_{e_{q_{2}},e_{a_{2}}\!|L}\!=\!\big{(}\mathcal{E}_{e_{q_{1}},e_{a_{1}}\!|L}^{1}\!\cup\mathcal{E}_{e_{q_{2}},e_{a_{2}}\!|L}^{1}\big{)}\!\otimes\!\cdots\!\otimes\!\big{(}\mathcal{E}_{e_{q_{1}},e_{a_{1}}\!|L}^{L}\!\cup\mathcal{E}_{e_{q_{2}},e_{a_{2}}\!|L}^{L}\big{)}. Given an entity ee, we denote ^e\hat{\mathcal{E}}_{e}^{\ell}, ˇe\check{\mathcal{E}}_{e}^{\ell} and 𝒱e{\mathcal{V}}_{e}^{\ell} as the sets of out-edges, in-edges and entities, respectively, visible in \ell steps walking from ee. eq,ea|L,𝒱eq,ea|L\mathcal{E}_{e_{q},e_{a}|L}^{\ell},\mathcal{V}_{e_{q},e_{a}|L}^{\ell} and ^eq,𝒱eq\hat{\mathcal{E}}_{e_{q}}^{\ell},{\mathcal{V}}_{e_{q}}^{\ell} are graphically shown in Figure 1.

4. The Proposed Model

Here, we show how GNNs can be tailor-made to efficiently and effectively learn from the r-digraphs. Extracting subgraph structures and then learning the subgraph representation is a common practice for subgraph encoding in the literature, such as GraphSage (Hamilton et al., 2017) and GraIL (Teru et al., 2020). Given a query triple (eq,rq,ea)(e_{q},r_{q},e_{a}), the subgraph encoding generally contains three processes:

  1. (i).

    extract the neighborhoods of both eqe_{q} and eae_{a};

  2. (ii).

    take the intersection to construct the subgraph;

  3. (iii).

    run message passing and use the graph-level representation as the subgraph encoding.

When working on the r-digraph 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}, the same approach can be customized in Algorithm 1. First, we get the neighborhoods of both eqe_{q} and eae_{a} in steps 2-5. Second, we take the intersection of neighborhoods from eqe_{q} and eae_{a} in steps 6-8 to induce the r-digraph 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L} layer-wisely. Third, if the r-digraph is empty, we set the representation as 𝟎\bm{0} in step 9. Otherwise, the message passing is conducted layer-by-layer in steps 10-12. Since eae_{a} is the single sink entity, the final layer representation 𝒉eaL(eq,rq)\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}) is used as subgraph representation to encode the r-digraph 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}. We name this simple solution as RED-Simp.

Algorithm 1 RED-Simp: Message passing on single r-digraph
1:  initialize 𝒉eq0(eq,rq)=𝟎\bm{h}_{e_{q}}^{0}\!(e_{q},r_{q})\!=\!\bm{0} and the entity sets 𝒱eq0={eq}\mathcal{V}_{e_{q}}^{0}\!=\!\{e_{q}\}, 𝒱ea0={ea}\mathcal{V}_{e_{a}}^{0}\!=\!\{e_{a}\};
2:  for =1,2,,L\ell=1,2,\dots,L do
3:     get the \ell-hop out-edges ^eq={(es,r,eo)|es𝒱eq1}\hat{\mathcal{E}}_{e_{q}}^{\ell}\!=\!\{(e_{s},r,e_{o})\!\in\!\mathcal{F}|e_{s}\!\in\!\mathcal{V}_{e_{q}}^{\ell-1}\} and entities 𝒱eq={eo|(es,r,eo)^eq}\mathcal{V}_{e_{q}}^{\ell}\!=\!\{e_{o}|(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell}\!\} of eqe_{q};
4:     get the \ell-hop in-edges ˇea={(es,r,eo)|eo𝒱ea1}\check{\mathcal{E}}_{e_{a}}^{\ell}\!=\!\{(e_{s},r,e_{o})\!\in\!\mathcal{F}|e_{o}\!\in\!\mathcal{V}_{e_{a}}^{\ell-1}\} and entities 𝒱ea={es|(es,r,eo)ˇea}\mathcal{V}_{e_{a}}^{\ell}\!=\!\{e_{s}|(e_{s},r,e_{o})\in\check{\mathcal{E}}_{e_{a}}^{\ell}\!\} of eae_{a};
5:  end for
6:  for =0,1,,L\ell=0,1,\dots,L do
7:     intersection eq,ea|L=^eqˇeaL\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\!=\!\hat{\mathcal{E}}_{e_{q}}^{\ell}\!\cap\!\check{\mathcal{E}}_{e_{a}}^{L-\ell} and 𝒱eq,ea|L=𝒱eq𝒱eaL\mathcal{V}_{e_{q},e_{a}|L}^{\ell}\!=\!\mathcal{V}_{e_{q}}^{\ell}\!\cap\!\mathcal{V}_{e_{a}}^{L-\ell};
8:  end for
9:  if 𝒱eq,ea|LL=\mathcal{V}_{e_{q},e_{a}|L}^{L}=\emptyset~{}~{}~{}~{} return 𝒉eaL(eq,rq)=𝟎\bm{h}_{e_{a}}^{L}(e_{q},r_{q})=\bm{0};
10:  for =1,2,,L\ell=1,2,\dots,L do
11:     message passing for entities eo𝒱eq,ea|Le_{o}\!\in\!\mathcal{V}_{e_{q},e_{a}|L}^{\ell}: 𝒉eo(eq,rq)=δ(𝑾(es,r,eo)eq,ea|Lϕ(𝒉es1(eq,rq),𝒉r))\bm{h}_{e_{o}}^{\ell}(e_{q},r_{q})=\delta\Big{(}\bm{W}^{\ell}\!\cdot\!\sum\nolimits_{(e_{s},r,e_{o})\in\mathcal{E}_{e_{q},e_{a}|L}^{\ell}}\phi\big{(}\bm{h}^{\ell-1}_{e_{s}}\!(e_{q},r_{q}),\bm{h}_{r}^{\ell}\big{)}\Big{)};
12:  end for
13:  return  𝒉eaL(eq,rq)\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}).

However, Algorithm 1 is very expensive. First, we need to conduct the two directional sampling in steps 3 and 4, and take the intersection to extract the r-digraph. Second, given a query (eq,rq,?)(e_{q},r_{q},?), we need to do this algorithm for |𝒱||\mathcal{V}| different triples with different answering entities ea𝒱e_{a}\!\in\!\mathcal{V}. It needs O(|𝒱|(min(D¯L,||L)+dE¯L))O\big{(}|\mathcal{V}|\cdot(\min(\bar{D}^{L},|\mathcal{F}|L)+d\bar{E}L)\big{)} time (see Section 4.3) to predict a given query (eq,rq,?)(e_{q},r_{q},?), where D¯\bar{D} is the average degree of entities in 𝒱\mathcal{V} and E¯\bar{E} is the average number of edges in eq,ea|L\mathcal{E}_{e_{q},e_{a}|L}^{\ell}’s. These limitations also exist in PathCon (Wang et al., 2021) and GraIL (Teru et al., 2020). To improve efficiency, we propose to encode multiple r-digraphs recursively in Section 4.1.

4.1. Recursive r-digraph encoding

In Algorithm 1, when evaluating (eq,rq,ea)(e_{q},r_{q},e_{a}) with different ea𝒱e_{a}\!\in\!\mathcal{V} but the same query (eq,rq,?)(e_{q},r_{q},?), the neighboring edges ^eq,=1L\hat{\mathcal{E}}_{e_{q}}^{\ell},\ell\!=\!1\!\dots\!L of eqe_{q} are shared. We have the following observation:

Proposition 1.

The set of edges ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell} visible from eqe_{q} by \ell-steps equals ea𝒱eq,ea|L\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}, namely ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell} is the union of the \ell-th layer edges in the r-digraphs between eqe_{q} and all the entities ea𝒱e_{a}\in\mathcal{V}.

Proposition 1 indicates that the \ell-th layer edges eq,ea|L\mathcal{E}_{e_{q},e_{a}|L}^{\ell} with different answer entities eae_{a} share the same set of edges in ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell}. A common approach to save computation cost in overlapping sub-problems is dynamic programming. This has been used to aggregate node representations on large-scale graphs (Hamilton et al., 2017) or to propagate the question representation on KGs (Zhang et al., 2018). Inspired by the efficiency gained by dynamic programming, we recursively construct the r-digraph between eqe_{q} and any entity eoe_{o} as

(2) 𝒢eq,eo|=(es,r,eo)^eq𝒢eq,es|1{(es,r,eo)^eq}.\mathcal{G}_{e_{q},e_{o}|\ell}=\cup_{(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell}}\mathcal{G}_{e_{q},e_{s}|\ell-1}\otimes\left\{(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell}\right\}.

Once the representations of 𝒢eq,es|1\mathcal{G}_{e_{q},e_{s}|\ell-1} for all the entities es𝒱eq1e_{s}\in\mathcal{V}_{e_{q}}^{\ell-1} in the 1\ell\!-\!1-th layer are ready, we can encode 𝒢eq,eo|\mathcal{G}_{e_{q},e_{o}|\ell} by combining 𝒢eq,es|1\mathcal{G}_{e_{q},e_{s}|\ell-1} with the shared edges (es,r,eo)^eq(e_{s},r,e_{o})\!\in\!\hat{\mathcal{E}}_{e_{q}}^{\ell} in the \ell-th layer. Based on Proposition 1 and Eq.(2), we are motivated to recursively encode multiple r-digraphs with the shared edges in ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell} layer by layer. The full process is in Algorithm 2.

Algorithm 2 RED-GNN: recursive r-digraph encoding.
1:  initialize 𝒉eq0(eq,rq)=𝟎\bm{h}_{e_{q}}^{0}\!(e_{q},r_{q})=\bm{0} and the entity set 𝒱eq0={eq}\mathcal{V}_{e_{q}}^{0}=\{e_{q}\};
2:  for =1L\ell=1\dots L do
3:     collect the \ell-hop edges ^eq={(es,r,eo)|es𝒱eq1}\hat{\mathcal{E}}_{e_{q}}^{\ell}\!=\!\{(e_{s},r,e_{o})\!\in\!\mathcal{F}|e_{s}\!\in\!\mathcal{V}_{e_{q}}^{\ell-1}\} and entities 𝒱eq={eo|(es,r,eo)^eq}\mathcal{V}_{e_{q}}^{\ell}\!=\!\{e_{o}|(e_{s},r,e_{o})\!\in\!\hat{\mathcal{E}}_{e_{q}}^{\ell}\!\};
4:     message passing for entities eo𝒱eqe_{o}\in\mathcal{V}_{e_{q}}^{\ell}: 𝒉eo(eq,rq)=δ(𝑾(es,r,eo)^eqϕ(𝒉es1(eq,rq),𝒉r))\bm{h}_{e_{o}}^{\ell}(e_{q},r_{q})=\delta\Big{(}\bm{W}^{\ell}\cdot\sum\nolimits_{(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell}}\phi\big{(}\bm{h}^{\ell\!-\!1}_{e_{s}}(e_{q},r_{q}),\bm{h}_{r}^{\ell}\big{)}\Big{)};
5:  end for
6:  assign 𝒉eaL(eq,rq)=𝟎\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q})=\bm{0} for all ea𝒱eqLe_{a}\notin\mathcal{V}_{e_{q}}^{L};
7:  return  𝒉eaL(eq,rq)\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}) for all ea𝒱e_{a}\in\mathcal{V}.

Initially, only eqe_{q} is visible in 𝒱eq0\mathcal{V}_{e_{q}}^{0}. In the \ell-th layer, we collect the edges ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell} and entities 𝒱eq\mathcal{V}_{e_{q}}^{\ell} in step 3. Then, the message passing is constrained to obtain the representations for eo𝒱eqe_{o}\!\in\!\mathcal{V}_{e_{q}}^{\ell} through the edges in eq{\mathcal{E}}_{e_{q}}^{\ell}. Finally, the representations 𝒉eaL(eq,rq)\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}), encoding 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L} for all the entities ea𝒱e_{a}\in\mathcal{V}, are returned in step 7. The recursive encoding can be more efficient with shared edges in ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell} and fewer loops. It learns the same representations as Algorithm 1 as guaranteed by Proposition 2.

Proposition 2.

Given the same triple (eq,rq,ea)(e_{q},r_{q},e_{a}), the structures encoded in 𝐡eaL(eq,rq)\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}) by Algorithm 1 and Algorithm 2 are identical.

4.2. Interpretable reasoning with r-digraph

As the construction of 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L} is independent with the query relation rqr_{q}, how to encode rqr_{q} is another problem to address. Given different triples sharing the same r-digraph, e.g. (Sam, starred, Spider-2) and (Sam, directed, Spider-2), the local evidence we use for reasoning is different. To capture the query-dependent knowledge from the r-digraphs and discover interpretable local evidence, we use the attention mechanism (Veličković et al., 2017) and encode rqr_{q} into the attention weight to control the importance of different edges in 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}. The message passing function is specified as

(3) 𝒉eo(eq,rq)=δ(𝑾(es,r,eo)^eqαes,r,eo|rq(𝒉es1(eq,rq)+𝒉r)),\!\!\!{\bm{h}}_{e_{o}}^{\ell}\!(e_{q},\!r_{q})\!=\!\delta\Big{(}\bm{W}^{\ell}\!\cdot\!\sum\nolimits_{(e_{s},r,e_{o})\in{\hat{\mathcal{E}}}_{e_{q}}^{\ell}}\!\!\!\alpha_{e_{s},r,e_{o}\!|r_{q}}^{\ell}\!\big{(}\bm{h}_{e_{s}}^{\ell-1}(e_{q},\!r_{q})+\bm{h}_{r}^{\ell}\big{)}\!\Big{)},

where the attention weight αes,r,eo|rq\alpha_{e_{s},r,e_{o}|r_{q}}^{\ell} on edge (es,r,eo)(e_{s},r,e_{o}) is

(4) αes,r,eo|rq=σ((𝒘α)ReLU(𝑾α(𝒉es1(eq,rq)𝒉r𝒉rq))),\vspace{-4px}\!\!\!\alpha_{e_{s},r,e_{o}\!|r_{q}}^{\ell}\!=\!\sigma\left((\bm{w}_{\alpha}^{\ell})^{\top}\text{ReLU}\left(\bm{W}_{\alpha}^{\ell}\!\cdot\!\big{(}\bm{h}_{e_{s}}^{\ell-1}(e_{q},r_{q})\!\oplus\!\bm{h}_{r}^{\ell}\!\oplus\!\bm{h}_{r_{q}}^{\ell}\big{)}\right)\right),

with 𝒘αdα\bm{w}_{\alpha}^{\ell}\!\in\!\mathbb{R}^{d_{\alpha}}, 𝑾αdα×3d\bm{W}_{\alpha}^{\ell}\!\in\!\mathbb{R}^{d_{\alpha}\times 3d}, and \oplus is the concatenation operator. Sigmoid function σ\sigma is used rather than softmax attention (Veličković et al., 2017) to ensure multiple edges can be selected in the same neighborhood.

After LL layers’ aggregation by (3), the representations 𝒉eaL(eq,rq)\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}) can encode essential information for scoring (eq,rq,ea)(e_{q},r_{q},e_{a}). Hence, we design a simple scoring function

(5) f(eq,rq,ea)=𝒘𝒉eaL(eq,rq),f(e_{q},r_{q},e_{a})=\bm{w}^{\top}\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q}),

with 𝒘d\bm{w}\in\mathbb{R}^{d}. We associate the multi-class log-loss (Lacroix et al., 2018) with each training triple (eq,rq,ea)(e_{q},r_{q},e_{a}), i.e.,

(6) (eq,rq,ea)𝒯tra(f(eq,rq,ea)+log(e𝒱ef(eq,rq,e))).\displaystyle\!\!\!\!\!\sum\nolimits_{(e_{q},r_{q},e_{a})\in\mathcal{T}_{\text{tra}}}\!\Big{(}\!-f(e_{q},r_{q},e_{a})+\log\big{(}\!\sum\nolimits_{\forall e\in\mathcal{V}}e^{f(e_{q},r_{q},e)}\big{)}\Big{)}.

The first part in (6) is the score of the positive triple (eq,rq,ea)(e_{q},r_{q},e_{a}) in 𝒯tra\mathcal{T}_{\text{tra}}, the set of training queries, and the second part contains the scores of all triples with the same query (eq,rq,?)(e_{q},r_{q},?). The model parameters 𝚯=\bm{\Theta}= {{𝑾}\big{\{}\{\bm{W}^{\ell}\}, {𝒘α}\{\bm{w}_{\alpha}^{\ell}\}, {𝑾α}\{\bm{W}_{\alpha}^{\ell}\}, {𝒉r}\{\bm{h}_{r}^{\ell}\}, 𝒘}\bm{w}\big{\}} are randomly initialized and are optimized by minimizing (6) with stochastic gradient descent (Kingma and Ba, 2014).

We provide Theorem 1 to show that if a set of relational paths are strongly correlated with the query triple, they can be identified by the attention weights in RED-GNN, thus being interpretable. This theorem is also empirically illustrated in Section 5.4.

Theorem 1.

Given a triple (eq,rq,ea)(e_{q},r_{q},e_{a}), let 𝒫\mathcal{P} be a set of relational paths eqri1ri2riLeae_{q}\!\rightarrow_{r^{1}_{i}}\!\cdot\!\rightarrow_{r^{2}_{i}}\!\cdots\!\rightarrow_{r^{L}_{i}}\!e_{a}, that are generated by a set of rules between eqe_{q} and eae_{a} with the form

ri1(X,Z1)ri2(Z1,Z2)riL(ZL1,Y)rq(X,Y),r_{i}^{1}(X,Z_{1})\wedge r_{i}^{2}(Z_{1},Z_{2})\wedge\cdots\wedge r_{i}^{L}(Z_{L-1},Y)\rightarrow r_{q}(X,Y),

where X,Y,Z1,,ZL1X,Y,Z_{1},\dots,Z_{L-1} are variables that are bounded by unique entities. Denote 𝒢𝒫\mathcal{G}_{\mathcal{P}} as the r-digraph constructed by 𝒫\mathcal{P}. There exists a parameter setting 𝚯\bm{\Theta} and a threshold θ(0,1)\theta\in(0,1) for RED-GNN that 𝒢𝒫\mathcal{G}_{\mathcal{P}} can equals to the r-digraph, whose edges have attention weights αes,r,eo|rq>θ\alpha_{e_{s},r,e_{o}|r_{q}}^{\ell}>\theta, in 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}.

4.3. Inference complexity

In this part, we compare the inference complexity of different GNN-based methods. When reasoning on the query (eq,rq,?)(e_{q},r_{q},?), we need to evaluate |𝒱||\mathcal{V}| triples with the different answer entity e𝒱e\in\mathcal{V}. We assume the average degree as D¯\bar{D} and average number of edges in each layer of r-digraph as E¯\bar{E}.

  • RED-Simp (Algorithm 1). For construction, the main cost, which is O(min(D¯L,||L))O\big{(}\min(\bar{D}^{L},|\mathcal{F}|L)\big{)} with the worst case cost O(||L)O(|\mathcal{F}|L), comes from extracting the neighborhoods of eqe_{q} and eae_{a}. Along with encoding, the total cost is O(|𝒱|(min(D¯L,||L)+dE¯L))O\big{(}|\mathcal{V}|\cdot(\min(\bar{D}^{L},|\mathcal{F}|L)+d\bar{E}L)\big{)}.

  • RED-GNN (Algorithm 2). Since the full computation is conducted on ^eq\hat{\mathcal{E}}_{e_{q}}^{\ell} and 𝒱eq\mathcal{V}_{e_{q}}^{\ell}, thus the cost is the number of edges times the dimension dd, i.e. O(dmin(D¯L,||L))O(d\cdot\min(\bar{D}^{L},|\mathcal{F}|L)) and d|𝒱|d\ll|\mathcal{V}|.

  • CompGCN. All the representations are aggregated in one pass with cost O(d||L)O(d|\mathcal{F}|L). Then the scores for all the entities are computed with cost O(|𝒱|d)O(|\mathcal{V}|d). The total cost is O(d||L+d|𝒱|)O(d|\mathcal{F}|L+d|\mathcal{V}|).

  • DPMPN. Denote the dimension and layer in the non-attentive GNN as d1,L1d_{1},L_{1}, the average sampled edges in the pruning procedure as D¯\bar{D}. The main cost comes from the non-attentive GNN with cost O(d1||L1)O(d_{1}|\mathcal{F}|L_{1}). The cost on sampled subgraph is O(dD¯L)O(d\bar{D}^{L}). Thus, the total computation cost is O(d1||L1+dD¯L)O(d_{1}|\mathcal{F}|L_{1}+d\bar{D}^{L}).

  • GraIL. Denote V¯\bar{V} as the average number of entities in the subgraphs. The complexity in constructing the enclosing subgraph is O(E¯logV¯)O(\bar{E}\log\bar{V}) with E¯\bar{E} edges and V¯\bar{V} entities. The cost of the GNN module is O(dE¯L)O(d\bar{E}L). Then the overall cost is O(|𝒱|(E¯logV¯+dE¯L))O\big{(}|\mathcal{V}|(\bar{E}\log{\bar{V}}+d\bar{E}L)\big{)}.

In comparison, we have RED-GNN \approx CompGCN << DPMPN << RED-Simp << GraIL in terms of inference complexity. The empirical evaluation is provided in Section 5.3.

5. Experiments

All the experiments are written in Python with PyTorch framework (Paszke et al., 2017) and run on an RTX 2080Ti GPU with 11GB memory.

5.1. Inductive reasoning

Inductive reasoning is a hot research topic (Hamilton et al., 2017; Sadeghian et al., 2019; Teru et al., 2020; Yang et al., 2017) as there are emerging new entities in the real-world applications, such as new users, new items and new concepts (Zhang and Chen, 2019). Being able to reason on unseen entities requires the model to capture the semantic and local evidence ignoring the identity of entities.

Table 1. Inductive reasoning. Best performance is indicated by the bold face numbers.
WN18RR FB15k-237 NELL-995
V1 V2 V3 V4 V1 V2 V3 V4 V1 V2 V3 V4
MRR RuleN .668 .645 .368 .624 .363 .433 .439 .429 .615 .385 .381 .333
Neural LP .649 .635 .361 .628 .325 .389 .400 .396 .610 .361 .367 .261
DRUM .666 .646 .380 .627 .333 .395 .402 .410 .628 .365 .375 .273
GraIL .627 .625 .323 .553 .279 .276 .251 .227 .481 .297 .322 .262
RED-GNN .701 .690 .427 .651 .369 .469 .445 .442 .637 .419 .436 .363
Hit@1 (%) RuleN 63.5 61.1 34.7 59.2 30.9 34.7 34.5 33.8 54.5 30.4 30.3 24.8
Neural LP 59.2 57.5 30.4 58.3 24.3 28.6 30.9 28.9 50.0 24.9 26.7 13.7
DRUM 61.3 59.5 33.0 58.6 24.7 28.4 30.8 30.9 50.0 27.1 26.2 16.3
GraIL 55.4 54.2 27.8 44.3 20.5 20.2 16.5 14.3 42.5 19.9 22.4 15.3
RED-GNN 65.3 63.3 36.8 60.6 30.2 38.1 35.1 34.0 52.5 31.9 34.5 25.9
Hit@10 (%) RuleN 73.0 69.4 40.7 68.1 44.6 59.9 60.0 60.5 76.0 51.4 53.1 48.4
Neural LP 77.2 74.9 47.6 70.6 46.8 58.6 57.1 59.3 87.1 56.4 57.6 53.9
DRUM 77.7 74.7 47.7 70.2 47.4 59.5 57.1 59.3 87.3 54.0 57.7 53.1
GraIL 76.0 77.6 40.9 68.7 42.9 42.4 42.4 38.9 56.5 49.6 51.8 50.6
RED-GNN 79.9 78.0 52.4 72.1 48.3 62.9 60.3 62.1 86.6 60.1 59.4 55.6
Table 2. Transductive reasoning. Best performance is indicated by the bold face numbers. ‘-’ means unavailable results and results for methods with ‘*’ are copied from the original papers.
type models Family UMLS WN18RR FB15k-237 NELL-995
MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10
triple ConvE* - - - .94 92. 96. .43 39. 49. .325 23.7 50.1 - - -
RotatE .921 86.6 98.8 .925 86.3 99.3 .477 42.8 57.1 .337 24.1 53.3 .508 44.8 60.8
QuatE .941 89.6 99.1 .944 90.5 99.3 .480 44.0 55.1 .350 25.6 53.8 .533 46.6 64.3
path MINERVA .885 82.5 96.1 .825 72.8 96.8 .448 41.3 51.3 .293 21.7 45.6 .513 41.3 63.7
Neural LP .924 87.1 99.4 .745 62.7 91.8 .435 37.1 56.6 .252 18.9 37.5 out   of  memory
DRUM .934 88.1 99.6 .813 67.4 97.6 .486 42.5 58.6 .343 25.5 51.6 out   of  memory
RNNLogic* .842 77.2 96.5 .483 44.6 55.8 .344 25.2 53.0
GNN pLogicNet* .842 77.2 96.5 .441 39.8 53.7 .332 23.7 52.8
CompGCN .933 88.3 99.1 .927 86.7 99.4 .479 44.3 54.6 .355 26.4 53.5 out   of  memory
DPMPN .981 97.4 98.1 .930 89.9 98.2 .482 44.4 55.8 .369 28.6 53.0 .513 45.2 61.5
RED-GNN .992 98.8 99.7 .964 94.6 99.0 .533 48.5 62.4 .374 28.3 55.8 .543 47.6 65.1

Setup. We follow the general inductive setting (Sadeghian et al., 2019; Teru et al., 2020; Yang et al., 2017) where there are new entities in testing and the relations are the same as those in training. Specifically, the training and testing contain two KGs 𝒦tra={𝒱tra,,tra}\mathcal{K}_{\text{tra}}=\{\mathcal{V}_{\text{tra}},\mathcal{R},\mathcal{F}_{\text{tra}}\} and 𝒦tst={𝒱tst,,tst}\mathcal{K}_{\text{tst}}=\{\mathcal{V}_{\text{tst}},\mathcal{R},\mathcal{F}_{\text{tst}}\}, with the same set of relations but disjoint sets of entities. Three sets of triples 𝒯tra/𝒯val/𝒯tst\mathcal{T}_{\text{tra}}/\mathcal{T}_{\text{val}}/\mathcal{T}_{\text{tst}}, augmented with reverse relations, are provided. tra\mathcal{F}_{\text{tra}} is used to predict 𝒯tra\mathcal{T}_{\text{tra}} and 𝒯val\mathcal{T}_{\text{val}} for training and validation, respectively. In testing, tst\mathcal{F}_{\text{tst}} is used to predict 𝒯tst\mathcal{T}_{\text{tst}}. Same as (Sadeghian et al., 2019; Teru et al., 2020; Yang et al., 2017), we use the filtered ranking metrics, i.e., mean reciprocal rank (MRR), Hit@1 and Hit@10 (Bordes et al., 2013), to indicate better performance with larger values.

Baselines. Since training and testing contain disjoint sets of entities, all the methods requiring the entity embeddings (Bordes et al., 2013; Harsha V. et al., 2020; Sun et al., 2019; Vashishth et al., 2019; Xu et al., 2019; Zhang et al., 2019a) cannot be applied here. We mainly compare with four methods: 1) RuleN (Meilicke et al., 2018), the discrete rule induction method; 2) Neural-LP (Yang et al., 2017), the first differentiable method for rule learning; 3) DRUM (Sadeghian et al., 2019), an improved work of Neural-LP (Yang et al., 2017); and 4) GraIL (Teru et al., 2020), which designs the enclosing subgraph for inductive reasoning. MINERVA (Das et al., 2017), PathCon (Wang et al., 2021) and RNNLogic (Qu et al., 2021) can potentially work on this setting but there lacks the customized source code for the inductive setting, thus not compared.

Hyper-parameters. For RED-GNN, we tune the learning rate in [104,[10^{-4},102]10^{-2}], weight decay in [105,[10^{-5},102]10^{-2}], dropout rate in [0,[0,0.3]0.3], batch size in {5,10,\{5,10,20,20,50,50,100}100\}, dimension dd in {32,\{32,48,48,64,64,96}96\}, dαd_{\alpha} for attention in {3,\{3,5}5\}, layer LL in {3,\{3,4,4,5}5\}, and activation function δ\delta in {identity, tanh, ReLU}. Adam (Kingma and Ba, 2014) is used as the optimizer. The best hyper-parameter settings are selected by the MRR metric on 𝒯val\mathcal{T}_{\text{val}} with maximum training epochs of 50. For RuleN, we use their implementation with default setting. For Neural LP and DRUM, we tune the learning rate in [104,[10^{-4},102]10^{-2}], dropout rate in [0,[0,0.3]0.3], batch size in {20,\{20,50,50,100}100\}, dimension dd in {64,\{64,128}128\}, layer LL of RNN in {1,2}\{1,2\}, and number of steps in {2,3,4,5}\{2,3,4,5\}. For GraIL we tune the learning rate in [105,[10^{-5},102]10^{-2}], weight decay in [106,[10^{-6},103]10^{-3}], batch size in {8,16,32}\{8,16,32\}, dropout rate in [0,[0,0.4]0.4], edge_dropout rate in [0,0.6][0,0.6], GNN aggregator among {sum, MLP, GRU} and hop numbers among {2,3,4}\{2,3,4\}. The training epochs are all set as 50.

Tie policy. In evaluation, the tie policy is important. Specifically, when there are triples with the same rank, choosing the largest rank and smallest rank in a tie will lead to rather different results (Sun et al., 2020). Considering that we give the same score 0 for triples where 𝒢eq,ea|L=\mathcal{G}_{e_{q},e_{a}|L}=\emptyset, there will be a concern of the tie policy. Hence, we use the average rank among the triples in tie as suggested (Rossi et al., 2021).

Datasets. We use the benchmark dataset in (Teru et al., 2020), created on WN18RR (Dettmers et al., 2017), FB15k237 (Toutanova and Chen, 2015) and NELL-995 (Xiong et al., 2017). Each dataset includes four versions with different groups of triples. Please refer to (Teru et al., 2020) for more details.

Results. The performance is shown in Table 1. First, GraIL is the worst among all the methods since the enclosing subgraphs do not learn well of the relational structures that can be generalized to unseen entities (more details in Appendix B). Second, there is not absolute winner among the rule-based methods as different rules adapt differently to these datasets. In comparison, RED-GNN outperforms the baselines across all the benchmarks. Based on Thereom 1, the attention weights can help to adaptively learn correlated relational paths for different datasets, and preserve the structural patterns at the same time. In some cases, the Hit@10 of RED-GNN is slightly worse than the rule-based methods since it may overfit to the top-ranked samples.

5.2. Transductive reasoning

Transductive reasoning, also known as KG completion (Trouillon et al., 2017; Wang et al., 2017), is another general setting in the literature. It evaluates the models’ ability to learn the patterns on an incomplete KG.

Setup. In this setting, a KG 𝒦={𝒱,,}\mathcal{K}=\{\mathcal{V},\mathcal{R},\mathcal{F}\} and the query triples 𝒯val/𝒯tst\mathcal{T}_{val}/\mathcal{T}_{\text{tst}}, augmented with reverse relations, are given. For the triple-based method, triples in \mathcal{F} are used for training, and 𝒯val/𝒯tst\mathcal{T}_{val}/\mathcal{T}_{\text{tst}} are used for inference. For the others, 3/4\nicefrac{{3}}{{4}} of the triples in \mathcal{F} are used to extract paths/subgraphs to predict the remaining 1/4\nicefrac{{1}}{{4}} triples in training, and the full set \mathcal{F} is then used to predict 𝒯val/𝒯tst\mathcal{T}_{val}/\mathcal{T}_{\text{tst}} in inference (Yang et al., 2017; Sadeghian et al., 2019). We use the same filtered ranking metrics with the same tie policy in Section 5.1, namely MRR, Hit@1 and Hit@10.

Baselines. We compare RED-GNN with the triple-based methods ConvE (Dettmers et al., 2017), RotatE (Sun et al., 2019) and QuatE (Zhang et al., 2019a); the path-based methods MINERVA (Das et al., 2017), Neural LP (Yang et al., 2017), DRUM (Sadeghian et al., 2019) and RNNLogic (Qu et al., 2021); MLN-based method pLogicNet (Qu and Tang, 2019) and the GNN-based methods CompGCN (Vashishth et al., 2019) and DPMPN (Xu et al., 2019). RuleN (Meilicke et al., 2018) is not compared here since it has been shown to be worse than DRUM (Sadeghian et al., 2019) and RNNLogic (Qu et al., 2021) in this setting. p-GAT is not compared as their results are evaluated on a problematic framework (Sun et al., 2020). GraIL (Teru et al., 2020) is not compared since it is computationally intractable on large graphs (see Section 5.3).

Hyper-parameters. The tuning ranges of hyper-parameters of RED-GNN are the same as those in the inductive reasoning. For RotatE and QuatE, we tune the dimensions in {100,200,500,1000}\{100,200,500,1000\}, batch size in {256,512,1024,2048}\{256,512,1024,2048\}, weight decay in [105,[10^{-5},102]10^{-2}], number of negative samples in {64,128,256,512,1024}\{64,128,256,512,1024\}, with training iterations of 100000. For MINERVA, Neural LP, DRUM and DPMPN, we use their default setting provided. For CompGCN, we choose the decoder as ConvE, the operator as circular correlation as suggested (Vashishth et al., 2019), and tune the learning rate in [104,[10^{-4},102]10^{-2}], layer in {1,2,3}\{1,2,3\} and dropout rate in [0,[0,0.3]0.3] with training epochs of 300.

Datasets. Five 222 Data from https://github.com/alisadeghian/DRUM/tree/master/datasets and https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch/benchmarks/NELL-995. datasets are used including Family (Kok and Domingos, 2007), UMLS (Kok and Domingos, 2007), WN18RR (Dettmers et al., 2017), FB15k237 (Toutanova and Chen, 2015) and NELL-995 (Xiong et al., 2017). We provide the statistics of entities, relations and split of triples in Table 3.

Table 3. Statistics of transductive reasoning datasets. Note that NELL-995* is different as the version in (Das et al., 2017) since the training triples contains valid and test triples there.
|𝒱||\mathcal{V}| |||\mathcal{R}| |||\mathcal{F}| |𝒯val||\mathcal{T}_{\text{val}}| |𝒯tst||\mathcal{T}_{\text{tst}}|
Family 3,007 12 23,483 2,038 2,835
UMLS 135 46 5,327 569 633
WN18RR 40,943 11 86,835 3,034 3,134
FB15k-237 14,541 237 272,115 17,535 20,466
NELL-995* 74,536 200 149,678 543 2,818

Results. As in Table 2, the triple-based methods are better than the path-based ones on Family and UMLS, and is comparable with DRUM and RNNLogic on WN18RR, FB15k-237. The entity embeddings can implicitly preserve local information around entities, while the path-based methods may loss the structural patterns. CompGCN performs similar as the triple-based methods since it mainly relies on the aggregated embeddings and the decoder scoring function. Neural LP, DRUM and CompGCN run out of memory on NELL-995 with 74k74k entities due to the use of full adjacency matrix. For DPMPN, the entities in the pruned subgraph is more informative than that in CompGCN, thus has better performance. For RED-GNN, it is better than all the baselines indicated by the MRR metric. These demonstrate that the r-digraph can not only transfer well to unseen entities, but also capture the important patterns in incomplete KGs without using entity embeddings.

5.3. Complexity analysis

We compare the complexity in terms of running time and parameter size of different methods in this part. We show the training time and the inference time on 𝒯tst\mathcal{T}_{\text{tst}} for each method in Figure 2(a), learning curves in Figure 2(b), and model parameters in Figure 2(c).

Inductive reasoning. We compare RuleN, Neural LP, DRUM, GraIL and RED-GNN on WN18RR (V1), FB15k-237 (V1) and NELL-995 (V1). Both training and inference are very efficient in RuleN. Neural LP and DRUM have similar cost but are more expensive than RED-GNN by using the full adjacency matrix. For GraIL, both training and inference are very expensive since they require bidirectional sampling to extract the subgraph and then compute for each triple. As for model parameters, RED-GNN and GraIL have similar amount of parameters, less than Neural-LP and DRUM. Overall, RED-GNN is more efficient than the differentiable methods Neural LP, DRUM and GraIL.

Refer to caption
Refer to caption
(a) Running time.
Refer to caption
Refer to caption
(b) Learning curve.
Refer to caption
Refer to caption
(c) Model parameters.
Figure 2. Running time analysis and learning curve. Left: inductive setting; Right: transductive setting.

Transductive reasoning. We compare RotatE, MINERVA, CompGCN, DPMPN and RED-GNN on Family, WN18RR and NELL-995. Due to the simple training framework on triples, RotatE is the fastest method. MINERVA is more efficient than GNN-based methods since the sampled paths can be efficiently modeled in sequence. CompGCN, with fewer layers, has similar cost with RED-GNN as its computation is on the whole graph. For DPMPN, the pruning is expensive and it has two GNNs working together, thus it is more expensive than RED-GNN. GraIL is hundreds of times more expensive than RED-GNN, thus intractable on the larger KGs in this setting. Since RED-GNN does not learn entity embeddings, it has much less parameters compared with the other four methods.

5.4. Case study: learned r-digraphs

We visualize some exemplar learned r-digraphs by RED-GNN with L=3L\!=\!3 on the Family and UMLS datasets. Given the triple (eq,rq,ea)(e_{q},r_{q},e_{a}), we remove the edges in 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L} whose attention weights are less than 0.50.5, and extract the remaining parts. Figure 3(a) shows one triple that DRUM fails. As shown, inferring id-14821482 as the son of id-14801480 requires the knowledge that id-14801480 is the only brother of the uncle id-14321432 from the local structure. Figure 3(b) shows the example with the same eqe_{q} and eae_{a} sharing the same digraph 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}. As shown, RED-GNN can learn distinctive structures for different query relations, which caters to Theorem 1. The examples in Figure 3 demonstrate that RED-GNN is interpretable. We provide more examples and the visualization algorithm in Appendix A.

Refer to caption
(a) Family.
Refer to caption
(b) UMLS.
Figure 3. Visualization of the learned structures. Dashed lines mean inverse relations. The query triples are indicated by the red rectangles. Due to space limitation, entities in UMLS dataset are shown as black circles (best viewed in color).

5.5. Ablation study

Variants of RED-GNN. Table 4 shows the performance of different variants. First, we study the impact of removing rqr_{q} in attention (denoted as Attn-w.o.-rqr_{q}). Specifically, we remove 𝒉rq\bm{h}_{r_{q}}^{\ell} from the attention weight αes,r,eo|rq\alpha_{e_{s},r,e_{o}|r_{q}}^{\ell} in (4) and change the scoring function to 𝒘(𝒉eaL(eq,rq)𝒉rq)\!\bm{w}^{\top}\!(\bm{h}_{e_{a}}^{L}\!(e_{q},r_{q})\!\oplus\!\bm{h}_{r_{q}}\!), with 𝒘2d\bm{w}\!\in\!\!\mathbb{R}^{2d}\!. Since the attention αes,r,eo|rq\alpha_{e_{s},r,e_{o}|r_{q}}^{\ell} aims to figure out the important edges in 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}, the learned structure will be less informative without the control of the query relation rqr_{q}, thus has poor performance.

Table 4. Comparison of different variants of RED-GNN.
WN18RR (V1) FB15k-237 (V1) NELL (V1)
methods MRR H@10 MRR H@10 MRR H@10
Attn-w.o.-rqr_{q} .659 78.3 .268 37.6 .517 73.4
RED-Simp .683 79.6 .311 45.3 .563 75.8
RED-GNN .701 79.9 .369 48.3 .637 86.6

Second, we replace Algorithm 2 in RED-GNN with the simple solution in Algorithm 1, i.e. RED-Simp. Due to the efficiency issue, the loss function (6), which requires to compute the scores over many negative triples, cannot be used. Hence, we use the margin ranking loss with one negative sample as in GraIL (Teru et al., 2020). As in Table 4, the performance of RED-Simp is weak than RED-GNN since the multi-class log loss is better than loss functions with negative sampling (Lacroix et al., 2018; Ruffinelli et al., 2020; Zhang et al., 2019b). RED-Simp still outperforms GraIL in Table 1 since the r-digraphs are better structures for reasoning. The running time of RED-Simp is in Figure 2(a) and 2(b). Algorithm 1 is much more expensive than Algorithm 2 but is cheaper than GraIL since GraIL needs the Dijkstra algorithm to label the entities.

Depth of models. In Figure 4, we show the influence of testing MRR with different layers or steps LL in the left yy-axis. The coverage (in %) of testing triples (eq,rq,ea)(e_{q},r_{q},e_{a}) where eae_{a} is visible from eqe_{q} in LL steps, i.e., ea𝒱eqLe_{a}\in\mathcal{V}_{e_{q}}^{L}, is shown in the right yy-axis. Intuitively, when LL increases, more triples will be covered, paths or subgraphs between eqe_{q} and eae_{a} then contain richer information, but will be harder to learn. As shown, the performance of DRUM, Neural LP and MINERVA decreases for L4L\geq 4. CompGCN runs out of memory when L>3L>3 and it is also hard to capture complex structures with L=3L\!=\!3. When LL is too small, e.g., L2L\leq 2, RED-GNN has poor performance mainly dues to limited information encoded in such small r-digraphs. RED-GNN achieves the best performance for L3L\geq 3 where the r-digraphs can contain richer information and the important information for reasoning can be effectively learned by (3). Since the computation cost significantly increases with LL, we tune L{3,4,5}L\in\{3,4,5\} to balance the efficiency and effectiveness in practice.

Refer to caption
Refer to caption
Figure 4. The MRR performance with different LL and coverage of triples within LL steps.
Table 5. The per-distance evaluation for MRR on WN18RR v.s. the length of shortest path.
distance 1 2 3 4 5 >>5
ratios (%) 34.9 9.3 21.5 7.5 8.9 17.9
CompGCN .993 .327 .337 .062 .061 .016
DPMPN .982 .381 .333 .102 .057 .001
RED-GNN .993 .563 .536 .186 .089 .005

Per-distance performance. Note that given a r-digraph with LL layers, the information between two nodes that are not reachable with LL hops cannot be propagated by Algorithm 2. This may raise a concern about the predicting ability of RED-GNN, especially for triples not reachable in LL steps. We demonstrate this is not a problem here. Specifically, given a triple (eq,rq,ea)(e_{q},r_{q},e_{a}), we compute the shortest distance from eqe_{q} to eae_{a}. Then, the MRR performance is grouped in different distances. We compare CompGCN (L=2L\!=\!2), DPMPN (L=5L\!=\!5) and RED-GNN (L=5L\!=\!5) in Table 5. All the models have worse performance on triples with larger distance and cannot well model the triples that are far away. RED-GNN has the best per-distance performance within distance 55.

6. Conclusion

In this paper, we introduce a novel relational structure, i.e., r-digraph, as a generalized structure of relational paths for KG reasoning. Individually computing on each r-digraph is expensive for the reasoning task (eq,rq,?)(e_{q},r_{q},?). Hence, inspired by solving overlapping sub-problems by dynamic programming, we propose RED-GNN as a variant of GNN, to efficiently construct and effectively learn the r-digraph. We show that RED-GNN achieves the state-of-the-art performance in both KG with inductive and transductive KG reasoning benchmarks. The training and inference of RED-GNN are very efficient compared with the other GNN-based baselines. Besides, interpretable structures for reasoning can be learned by RED-GNN.

Lastly, a concurrent work NBFNet (Zhu et al., 2021) proposes to recursively encode all the paths between the query entity to multiple answer entities, which is similar to RED-GNN. However, NBFNet uses the full adjacency matrix for propagation, which is very expensive and requires 128GB GPU memory. In the future work, we can leverage the pruning technique in (Xu et al., 2019) or distributed programming in (Cohen et al., 2019) to apply RED-GNN on KG with extremely large scale.

References

  • (1)
  • Abujabal et al. (2018) A. Abujabal, R. Saha Roy, M. Yahya, and G. Weikum. 2018. Never-ending learning for open-domain question answering over knowledge bases. In The WebConf. 1053–1062.
  • Battaglia et al. (2018) P. Battaglia, J. B Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, et al. 2018. Relational inductive biases, deep learning, and graph networks. Technical Report. arXiv:1806.01261.
  • Battista et al. (1998) G. D. Battista, P. Eades, R. Tamassia, and I. G Tollis. 1998. Graph drawing: algorithms for the visualization of graphs. Prentice Hall PTR.
  • Berant and Liang (2014) J. Berant and P. Liang. 2014. Semantic parsing via paraphrasing. In ACL. 1415–1425.
  • Bordes et al. (2013) A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In NeurIPS. 2787–2795.
  • Cao et al. (2019) Y. Cao, X. Wang, X. He, Z. Hu, and T. Chua. 2019. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The WebConf. 151–161.
  • Chen et al. (2020) X. Chen, S. Jia, and Y. Xiang. 2020. A review: Knowledge reasoning over knowledge graph. Expert Systems with Applications 141 (2020), 112948.
  • Cohen et al. (2019) W. W Cohen, H. Sun, R A. Hofer, and M. Siegler. 2019. Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base. In ICLR.
  • Das et al. (2017) R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum. 2017. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. In ICLR.
  • Dettmers et al. (2017) T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. 2017. Convolutional 2D knowledge graph embeddings. In AAAI.
  • Ding et al. (2021) Y. Ding, Q. Yao, H. Zhao, and T. Zhang. 2021. Diffmg: Differentiable meta graph search for heterogeneous graph neural networks. In SIGKDD. 279–288.
  • Gilmer et al. (2017) J. Gilmer, S. S Schoenholz, P. F Riley, O. Vinyals, and G. E Dahl. 2017. Neural Message Passing for Quantum Chemistry. In ICML. 1263–1272.
  • Grover and Leskovec (2016) A. Grover and J. Leskovec. 2016. Node2vec: Scalable feature learning for networks. In SIGKDD. ACM, 855–864.
  • Hamilton et al. (2017) W. Hamilton, Z. Ying, and J. Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS. 1024–1034.
  • Harsha V. et al. (2020) L V. Harsha V., G. Jia, and S. Kok. 2020. Probabilistic logic graph attention networks for reasoning. In Companion of WebConf. 669–673.
  • Hogan et al. (2021) A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. D. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, et al. 2021. Knowledge graphs. CSUR 54, 4 (2021), 1–37.
  • Hornik (1991) K. Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural networks 4, 2 (1991), 251–257.
  • Ji et al. (2020) S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. Yu. 2020. A survey on knowledge graphs: Representation, acquisition and applications. TKDE (2020).
  • Kingma and Ba (2014) D. P Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. Technical Report. arXiv:1412.6980.
  • Kipf and Welling (2016) T. Kipf and M. Welling. 2016. Semi-supervised classification with graph convolutional networks. In ICLR.
  • Kok and Domingos (2007) S. Kok and P. Domingos. 2007. Statistical predicate invention. In ICML. 433–440.
  • Lacroix et al. (2018) T. Lacroix, N. Usunier, and G. Obozinski. 2018. Canonical Tensor Decomposition for Knowledge Base Completion. In ICML. 2863–2872.
  • Lao and Cohen (2010) N. Lao and W. W Cohen. 2010. Relational retrieval using a combination of path-constrained random walks. Machine learning 81, 1 (2010), 53–67.
  • Lao et al. (2011) N. Lao, T. Mitchell, and W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In EMNLP. 529–539.
  • Meilicke et al. (2018) C. Meilicke, M. Fink, Y. Wang, D. Ruffinelli, R. Gemulla, and H. Stuckenschmidt. 2018. Fine-grained evaluation of rule-and embedding-based systems for knowledge graph completion. In ISWC. Springer, 3–20.
  • Niepert (2016) M. Niepert. 2016. Discriminative gaifman models. In NIPS, Vol. 29. 3405–3413.
  • Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in PyTorch. In ICLR.
  • Perozzi et al. (2014) B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online learning of social representations. In SIGKDD. ACM, 701–710.
  • Qu et al. (2021) M. Qu, J. Chen, L. Xhonneux, Y. Bengio, and J. Tang. 2021. RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs. In ICLR.
  • Qu and Tang (2019) M. Qu and J. Tang. 2019. Probabilistic Logic Neural Networks for Reasoning. NeurIPS 32, 7712–7722.
  • Rossi et al. (2021) A. Rossi, D. Firmani, A. Matinata, P. Merialdo, and D. Barbosa. 2021. Knowledge Graph Embedding for Link Prediction: A Comparative Analysis. TKDD (2021).
  • Ruffinelli et al. (2020) D. Ruffinelli, S. Broscheit, and R. Gemulla. 2020. You can teach an old dog new tricks! on training knowledge graph embeddings. In ICLR.
  • Sadeghian et al. (2019) A. Sadeghian, M. Armandpour, P. Ding, and D Wang. 2019. DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs. In NeurIPS. 15347–15357.
  • Schlichtkrull et al. (2018) M. Schlichtkrull, T. N Kipf, P. Bloem, Rianne Van D., I. Titov, and M. Welling. 2018. Modeling relational data with graph convolutional networks. In ESWC. Springer, 593–607.
  • Shen et al. (2018) Y. Shen, J. Chen, Y. Huang, P.and Guo, and J. Gao. 2018. M-walk: Learning to walk over graphs using monte carlo tree search. In NeurIPS.
  • Sun et al. (2019) Z. Sun, Z. Deng, J. Nie, and J. Tang. 2019. RotatE: Knowledge graph embedding by relational rotation in complex space. In ICLR.
  • Sun et al. (2020) Z. Sun, S. Vashishth, S. Sanyal, P. Talukdar, and Y. Yang. 2020. A Re-evaluation of Knowledge Graph Completion Methods. In ACL. 5516–5522.
  • Teru et al. (2020) K. K Teru, E. Denis, and W. Hamilton. 2020. Inductive Relation Prediction by Subgraph Reasoning. In ICML.
  • Toutanova and Chen (2015) K. Toutanova and D. Chen. 2015. Observed versus latent features for knowledge base and text inference. In PWCVSMC. 57–66.
  • Trouillon et al. (2017) T. Trouillon, C. R Dance, É. Gaussier, J. Welbl, S. Riedel, and G. Bouchard. 2017. Knowledge graph completion via complex tensor factorization. JMLR 18, 1 (2017), 4735–4772.
  • Vashishth et al. (2019) S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar. 2019. Composition-based multi-relational graph convolutional networks. In ICLR.
  • Veličković et al. (2017) P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. 2017. Graph attention networks. In ICLR.
  • Wang et al. (2021) H. Wang, H. Ren, and J. Leskovec. 2021. Relational Message Passing for Knowledge Graph Completion. In SIGKDD. 1697–1707.
  • Wang et al. (2017) Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowledge graph embedding: A survey of approaches and applications. TKDE 29, 12 (2017), 2724–2743.
  • Xiao et al. (2021) W. Xiao, H. Zhao, V. W Zheng, and Y. Song. 2021. Neural PathSim for Inductive Similarity Search in Heterogeneous Information Networks. In CIKM. 2201–2210.
  • Xiong et al. (2017) W. Xiong, T. Hoang, and W. Wang. 2017. DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning. In EMNLP. 564–573.
  • Xu et al. (2019) X. Xu, W. Feng, Y. Jiang, X. Xie, Z. Sun, and Z. Deng. 2019. Dynamically Pruned Message Passing Networks for Large-Scale Knowledge Graph Reasoning. In ICLR.
  • Yang et al. (2017) F. Yang, Z. Yang, and W. Cohen. 2017. Differentiable learning of logical rules for knowledge base reasoning. In NeurIPS. 2319–2328.
  • Yu et al. (2021) D. Yu, Y. Yang, R. Zhang, and Y. Wu. 2021. Knowledge Embedding Based Graph Convolutional Network. In The WebConf. 1619–1628.
  • Zhang and Chen (2019) M. Zhang and Y. Chen. 2019. Inductive Matrix Completion Based on Graph Neural Networks. In ICLR.
  • Zhang et al. (2019a) S. Zhang, Y. Tay, L. Yao, and Q. Liu. 2019a. Quaternion knowledge graph embeddings. In NeurIPS.
  • Zhang et al. (2018) Y. Zhang, H. Dai, Z. Kozareva, A. J Smola, and L. Song. 2018. Variational reasoning for question answering with knowledge graph. In AAAI.
  • Zhang et al. (2020a) Y. Zhang, Q. Yao, and L. Chen. 2020a. Interstellar: Searching Recurrent Architecture for Knowledge Graph Embedding. In NeurIPS, Vol. 33.
  • Zhang et al. (2020b) Y. Zhang, Q. Yao, W. Dai, and L. Chen. 2020b. AutoSF: Searching scoring functions for knowledge graph embedding. In ICDE. IEEE, 433–444.
  • Zhang et al. (2019b) Y. Zhang, Q. Yao, Y. Shao, and L. Chen. 2019b. NSCaching: simple and efficient negative sampling for knowledge graph embedding. In ICDE. IEEE, 614–625.
  • Zhao et al. (2017) H. Zhao, Q. Yao, J. Li, Y. Song, and K. L. Lee. 2017. Meta-graph based recommendation fusion over heterogeneous information networks. In SIGKDD. 635–644.
  • Zhu et al. (2021) Z. Zhu, Z. Zhang, L. Xhonneux, and J. Tang. 2021. Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction, In NeurIPS. arXiv e-prints, arXiv–2106.

Appendix A Visualization

We provide the visualization of the r-digraph learned between entities in Algorithm 3. The key point is to backtrack the neighborhood edges of eae_{a} that has attention weight larger than a pre-defined threshold θ>0\theta>0, i.e., steps 3-6. The remaining structures 𝒢eq,ea|L(θ)\mathcal{G}_{e_{q},e_{a}|L}(\theta) are returned as the structure selected by the attention weights.

Algorithm 3 Visualization.
0:  entities 𝒱\mathcal{V}, triples \mathcal{F}, query triple (eq,rq,ea)(e_{q},r_{q},e_{a}), depth LL, parameters 𝚯\bm{\Theta}, threshold θ\theta.
1:  Run Algorithm 2 to obtain the attention weights αes,r,eo|rq,=1L\alpha_{e_{s},\!r,\!e_{o}\!|r_{q}}^{\ell},\ell=1\dots L on the edges in each layer.
2:  Initialize 𝒱eq,ea|LL(θ)={ea}{\mathcal{V}}_{e_{q},e_{a}|L}^{L}(\theta)=\{e_{a}\}.
3:  for =L,L11\ell=L,L-1\dots 1 do
4:     collect the edges eq,ea|L(θ)={(es,r,eo)|eo𝒱eq,ea|LL,αes,r,eo|rqθ}{\mathcal{E}}_{e_{q},e_{a}|L}^{\ell}(\theta)\!=\!\{(e_{s},r,e_{o})|e_{o}\!\in\!{\mathcal{V}}_{e_{q},e_{a}|L}^{L},\alpha_{e_{s},\!r,\!e_{o}\!|r_{q}}^{\ell}\!\geq\!\theta\}.
5:     collect the entities 𝒱eq,ea|L1(θ)={es|(es,r,eo)eq,ea|L(θ)}{\mathcal{V}}_{e_{q},e_{a}|L}^{\ell-1}(\theta)=\{e_{s}|(e_{s},r,e_{o})\in{\mathcal{E}}_{e_{q},e_{a}|L}^{\ell}(\theta)\}.
6:  end for
7:  return  𝒢eq,ea|L(θ)=eq,ea|L1(θ)eq,ea|L2(θ)eq,ea|LL(θ).\mathcal{G}_{e_{q},e_{a}|L}(\theta)={\mathcal{E}}_{e_{q},e_{a}|L}^{1}(\theta)\otimes{\mathcal{E}}_{e_{q},e_{a}|L}^{2}(\theta)\cdots\otimes{\mathcal{E}}_{e_{q},e_{a}|L}^{L}(\theta).
Refer to caption
Figure 5. Visualization of the learned structure on Family.
Refer to caption
Refer to caption
Figure 6. Visualization of the learned structure on UMLS.

We provide some additional visualized results on Family and UMLS datasets in Figure  5 and Figure  6,. There are consistent and semantically friendly patterns across different query triples.

Appendix B Problem analysis on GraIL

As mentioned in the main text, efficiency is the preliminary issue of GraIL (Teru et al., 2020). In this part, we mainly analyze it in terms of effectiveness.

Enclosing subgraph in (Teru et al., 2020) v.s. r-digraph. When constructing the enclosing subgraph, the bidirectional information between eqe_{q} and eae_{a} is preserved. It contains all the relational paths from eqe_{q} to eae_{a} and all the paths from eae_{a} to eqe_{q}. This is the key difference between enclosing subgraph and r-digraph. In this view, the limitations of the enclosing subgraph can be summarized as follows.

  • The entities in the enclosing subgraph need distance as labels. But in r-digraph, the distance is implicitly encoded in layers.

  • The order of relations, inside the enclosing subgraph is mixed, making it hard to learn the order of relations, e.g. the difference between brother\wedgemother\rightarrowuncle and mother\wedgebrother\rightarrowaunt.

The enclosing subgraph, even with more edges than the r-digraph, does not show valuable inductive bias for KG reasoning.

Non-interpretable. We show the computation graphs of GraIL and RED-GNN in the yellow space and green space respectively in Figure 7. Even though attention is applied in the GNN framework, how the weights cooperate in different layers is unclear. Thus they did not provide interpretation and nor can we come up a way to interpret the reasoning results in GraIL.

Refer to caption
Figure 7. The computation graph of GraIL (yellow space) and RED-GNN (green space).

Empirical evidence. To show that GraIL fails to learn the relational structures, we conduct a tiny ablation study here. Specifically, we change the message function in GraIL from

𝒂tk=r=1Rs𝒩r(t)αr,rt,s,tk𝑾rk𝒉sk1,\bm{a}_{t}^{k}=\sum\nolimits_{r=1}^{R}\sum\nolimits_{s\in\mathcal{N}_{r}(t)}\alpha_{r,r_{t},s,t}^{k}\bm{W}_{r}^{k}\bm{h}_{s}^{k-1},
to 𝒂tk=r=1Rs𝒩r(t)αrt,s,tk𝑾k𝒉sk1,\text{to }~{}\bm{a}_{t}^{k}=\sum\nolimits_{r=1}^{R}\sum\nolimits_{s\in\mathcal{N}_{r}(t)}\alpha_{r_{t},s,t}^{k}\bm{W}^{k}\bm{h}_{s}^{k-1},

where αrt,s,tk=MLP(𝒉sk1,𝒉tk1,𝒆rta)\alpha_{r_{t},s,t}^{k}=\text{MLP}(\bm{h}_{s}^{k-1},\bm{h}_{t}^{k-1},\bm{e}_{r_{t}}^{a}). This removes the relation when aggregating edges to only focus on graph structures.

We name this as GraIL no rel. The results are shown in Table 6. We observe that the performance decay is really marginal if relation is removed in the edges. This means that GraIL mainly learns on the graph structures and the relations play little role in. Thus, we claim that GraIL fails to learn the relational dependency in the ill-defined enclosing subgraph.

We also change the aggregation function in RED-GNN from (3) to

𝒉eo(eq,rq)=δ(𝑾(es,r,eo)^eqαes,rq(𝒉es1(eq,rq)+𝒗)),\bm{h}_{e_{o}}^{\ell}(e_{q},r_{q})=\delta\Big{(}\bm{W}^{\ell}\cdot\sum\nolimits_{(e_{s},r,e_{o})\in{\hat{\mathcal{E}}}_{e_{q}}^{\ell}}\alpha_{e_{s},r_{q}}^{\ell}(\bm{h}_{e_{s}}^{\ell-1}(e_{q},r_{q})+\bm{v}^{\ell})\Big{)},

where 𝒗d\bm{v}^{\ell}\in\mathbb{R}^{d} is shared in the same layer, to remove the relation on edges. We denote this variant as RED-no rel. Note that RED-no rel is different from the variant Attn-w.o.-rqr_{q} in Table 4. As in Table 6, the performance drops dramatically. RED-GNN mainly relies on the relational dependencies in edges as evidence for reasoning.

Table 6. Comparison of GraIL with and without relation information. Evaluated by MRR.
methods WN18RR (V1) FB15k-237 (V1) NELL-995 (V1)
GraIL no rel 0.620 0.255 0.475
GraIL 0.627 0.279 0.481
RED-no rel 0.557 0.198 0.384
RED-GNN 0.701 0.369 0.637

Appendix C Proofs

C.1. Proposition 1

We prove this proposition from the bi-directions such that ea𝒱eq,ea|L\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell} ^eq\subseteq\hat{\mathcal{E}}_{e_{q}}^{\ell} and ^eqea𝒱eq,ea|L\hat{\mathcal{E}}_{e_{q}}^{\ell}\subseteq\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}.

Proof.

First, we prove that ea𝒱eq,ea|L^eq\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\subseteq\hat{\mathcal{E}}_{e_{q}}^{\ell}.

As in step-7 of Algorithm 1, eq,ea|L\mathcal{E}_{e_{q},e_{a}|L}^{\ell} is defined as eq,ea|L={(es,r,eo)^eq|eo𝒱eq,ea|L}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\!=\!\{(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell}|e_{o}\in\mathcal{V}_{e_{q},e_{a}|L}^{\ell}\}. Thus we have eq,ea|L^eq\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\subseteq\hat{\mathcal{E}}_{e_{q}}^{\ell} and also ea𝒱eq,ea|L^eq\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\subseteq\hat{\mathcal{E}}_{e_{q}}^{\ell}.

Second, we prove that ^eqea𝒱eq,ea|L\hat{\mathcal{E}}_{e_{q}}^{\ell}\subseteq\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}.

We use proof by contradiction here. Assume that ^eqea𝒱eq,ea|L\hat{\mathcal{E}}_{e_{q}}^{\ell}\subseteq\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell} is wrong, then there must exist (es,r,eo)^eq(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell} and (es,r,eo)eq,ea|L:ea𝒱(e_{s},r,e_{o})\notin\mathcal{E}_{e_{q},e_{a}|L}^{\ell}:\;\forall e_{a}\in\mathcal{V}. In other words, there exists an edge (es,r,eo)(e_{s},r,e_{o}) that can be visited in \ell steps walking from eqe_{q}, but not in any r-digraph (es,r,eo)eq,ea|L(e_{s},r,e_{o})\notin\mathcal{E}_{e_{q},e_{a}|L}^{\ell}. Based on Def. 2 of r-digraph, then, (es,r,eo)(e_{s},r,e_{o}) will be an edge that is \ell steps away from eqe_{q} but does not belong to the \ell-th triple of any relational paths eqr1r2rLeae_{q}\!\!\rightarrow_{r^{1}}\!\!\cdot\!\!\rightarrow_{r^{2}}\!\!\cdots\!\!\rightarrow_{r^{L}}\!\!e_{a}, this is impossible. Therefore, we have ^eqea𝒱eq,ea|L\hat{\mathcal{E}}_{e_{q}}^{\ell}\subseteq\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}.

Based on above two directions, we prove that ^eq=ea𝒱eq,ea|L\hat{\mathcal{E}}_{e_{q}}^{\ell}=\cup_{e_{a}\in\mathcal{V}}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}. ∎

C.2. Proposition 2

Given a query triple (eq,rq,ea)(e_{q},r_{q},e_{a}) and LL, when 𝒢eq,rq|L=\mathcal{G}_{e_{q},r_{q}|L}=\emptyset, it is obvious that 𝒉eaL(eq,rq)=𝟎\bm{h}_{e_{a}}^{L}(e_{q},r_{q})=\bm{0} in both Algorithm 1 and Algorithm 2. For 𝒢eq,rq|L\mathcal{G}_{e_{q},r_{q}|L}\neq\emptyset, we prove by induction. We denote 𝒉^e(eq,rq)\hat{\bm{h}}_{e}^{\ell}(e_{q},r_{q}) as the representations learned by Algorithm 1 and 𝒉˘e(eq,rq)\breve{\bm{h}}_{e}^{\ell}(e_{q},r_{q}) as the representations learned by Algorithm 2. Then, we show that 𝒉^e(eq,rq)=𝒉˘e(eq,rq)\hat{\bm{h}}_{e}^{\ell}(e_{q},r_{q})=\breve{\bm{h}}_{e}^{\ell}(e_{q},r_{q}) for all =1L\ell=1\dots L and eVeq,ea|Le\in V_{e_{q},e_{a}|L}^{\ell}.

Proof.
  • When =1\ell=1, 𝒉^e1(eq,rq)=δ(𝑾1(eq,r,e)eq,ea|L1ϕ(𝟎,𝒉r1)),\hat{\bm{h}}_{e}^{1}(e_{q},r_{q})=\delta(\bm{W}^{1}\!\cdot\!\sum\nolimits_{(e_{q},r,e)\in\mathcal{E}_{e_{q},e_{a}|L}^{1}}\phi\big{(}\bm{0},\bm{h}_{r}^{1})\big{)}, and 𝒉˘e1(eq,rq)=δ(𝑾1(eq,r,e)eq1ϕ(𝟎,𝒉r1))\breve{\bm{h}}_{e}^{1}(e_{q},r_{q})=\delta(\bm{W}^{1}\!\cdot\!\sum\nolimits_{(e_{q},r,e)\in\mathcal{E}_{e_{q}}^{1}}\phi\big{(}\bm{0},\bm{h}_{r}^{1})\big{)}, for eVeq,ea|L1e\in V_{e_{q},e_{a}|L}^{1}. It is obvious that {(eq,r,e)eq,ea|L1}{(eq,r,e)eq1|eVeq,ea|L1}\{(e_{q},r,e)\in\mathcal{E}_{e_{q},e_{a}|L}^{1}\}\!\equiv\!\{(e_{q},r,e)\!\in\!\mathcal{E}_{e_{q}}^{1}|e\!\in\!V_{e_{q},e_{a}|L}^{1}\}. Thus, we have 𝒉^e1(eq,rq)=𝒉˘e1(eq,rq)\hat{\bm{h}}_{e}^{1}(e_{q},r_{q})\!=\!\breve{\bm{h}}_{e}^{1}(e_{q},r_{q}) for all eVeq,ea|L1e\!\in\!V_{e_{q},e_{a}|L}^{1}.

  • Assume that 𝒉^e1(eq,rq)=𝒉˘e1(eq,rq)\hat{\bm{h}}_{e}^{\ell-1}(e_{q},r_{q})=\breve{\bm{h}}_{e}^{\ell-1}(e_{q},r_{q}) for all eVeq,ea|L1e\in V_{e_{q},e_{a}|L}^{\ell-1}, we prove that 𝒉^e(eq,rq)=𝒉˘e(eq,rq)\hat{\bm{h}}_{e}^{\ell}(e_{q},r_{q})=\breve{\bm{h}}_{e}^{\ell}(e_{q},r_{q}) for all eVeq,ea|Le\in V_{e_{q},e_{a}|L}^{\ell}.

    For Algorithm 1, we have

    (7) 𝒉^e(eq,rq)=δ(𝑾(es,r,e)eq,ea|Lϕ(𝒉^es1(eq,rq),𝒉r)).\vspace{-2px}\hat{\bm{h}}_{e}^{\ell}(e_{q},r_{q})=\delta(\bm{W}^{\ell}\!\cdot\!\sum\nolimits_{(e_{s},r,e)\in\mathcal{E}_{e_{q},e_{a}|L}^{\ell}}\phi\big{(}\hat{\bm{h}}^{\ell-1}_{e_{s}}\!(e_{q},r_{q}),\bm{h}_{r}^{\ell})\big{)}.\vspace{-2px}

    For Algorithm 2, we have

    (8) 𝒉˘e(eq,rq)=δ(𝑾(es,r,e)^eqϕ(𝒉˘es1(eq,rq),𝒉r)).\vspace{-2px}\breve{\bm{h}}_{e}^{\ell}\!(e_{q},r_{q})=\delta(\bm{W}^{\ell}\cdot\sum\nolimits_{(e_{s},r,e)\in\hat{\mathcal{E}}_{e_{q}}^{\ell}}\phi\big{(}\breve{\bm{h}}^{\ell-1}_{e_{s}}\!(e_{q},r_{q}),\bm{h}_{r}^{\ell}\big{)}).\vspace{-2px}

    As in step 7 of Algorithm 1, eq,ea|L={(es,r,eo)^eq|eo𝒱eq,ea|L}\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\!=\!\{(e_{s},r,e_{o})\in\hat{\mathcal{E}}_{e_{q}}^{\ell}|e_{o}\in\mathcal{V}_{e_{q},e_{a}|L}^{\ell}\}. Then, for eVeq,ea|Le\in V_{e_{q},e_{a}|L}^{\ell}, the ranges of summation in (7) and (8) are the same, i.e., {(es,r,e)eq,ea|L}{(es,r,e)^eq|eVeq,ea|L}\{(e_{s},r,e)\in\mathcal{E}_{e_{q},e_{a}|L}^{\ell}\}\equiv\{(e_{s},r,e)\in\hat{\mathcal{E}}_{e_{q}}^{\ell}|e\in V_{e_{q},e_{a}|L}^{\ell}\} and ese_{s} here are all belonging to Veq,ea|L1V_{e_{q},e_{a}|L}^{\ell-1}. Hence, based on the assumption, we have 𝒉^e(eq,rq)=𝒉˘e(eq,rq)\hat{\bm{h}}_{e}^{\ell}(e_{q},r_{q})=\breve{\bm{h}}_{e}^{\ell}\!(e_{q},r_{q}) or all eVeq,ea|Le\in V_{e_{q},e_{a}|L}^{\ell}.

By induction, we can have 𝒉^e(eq,rq)=𝒉˘e(eq,rq)\hat{\bm{h}}_{e}^{\ell}(e_{q},r_{q})=\breve{\bm{h}}_{e}^{\ell}(e_{q},r_{q}) for all =1L\ell=1\dots L and eVeq,ea|Le\in V_{e_{q},e_{a}|L}^{\ell}. Therefore, the representation 𝒉^ea(eq,rq)\hat{\bm{h}}_{e_{a}}^{\ell}(e_{q},r_{q}) and 𝒉˘ea(eq,rq)\breve{\bm{h}}_{e_{a}}^{\ell}(e_{q},r_{q}) learned by Algorithm 1 and Algorithm 2, respectively, are identical. ∎

C.3. Theorem 1

Proof.

Based on Definition 2, any set 𝒫\mathcal{P} of relational paths eqri1ri2riLeae_{q}\rightarrow_{r_{i}^{1}}\rightarrow_{r_{i}^{2}}\cdots\rightarrow_{r_{i}^{L}}e_{a} are contained in the r-digraph 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L}.

Denote G𝒫G_{\mathcal{P}} as the r-digraph constructed by 𝒫\mathcal{P}. In each layer, the attention weight is computed as

αes,r,eo|rq\displaystyle\alpha_{e_{s},r,e_{o}|r_{q}}^{\ell} =σ((𝒘α)ReLU(𝑾α(𝒉es1(eq,rq)𝒉r𝒉rq)))\displaystyle=\sigma\Big{(}(\bm{w}_{\alpha}^{\ell})^{\top}\text{ReLU}\big{(}\bm{W}_{\alpha}^{\ell}\cdot(\bm{h}_{e_{s}}^{\ell-1}(e_{q},r_{q})\oplus\bm{h}_{r}^{\ell}\oplus\bm{h}_{r_{q}}^{\ell})\big{)}\Big{)}
=MLP(es,r,eq,rq).\displaystyle=\text{MLP}^{\ell}(e_{s},r,e_{q},r_{q}).

Then, we prove that 𝒢𝒫\mathcal{G}_{\mathcal{P}} can be extracted from the LL-th layer and recursively to the first layer.

  • In the LL-th layer, denote 𝒯+L\mathcal{T}_{+}^{L} as the set of the triples (es,riL,ea)(e_{s},r^{L}_{i},e_{a}), whose attention weights are αes,riL,ea|rqL=MLPL(es,riL,rq,eq)\alpha_{e_{s},r^{L}_{i},e_{a}|r_{q}}^{L}=MLP^{L}(e_{s},r^{L}_{i},r_{q},e_{q}), in the LL-th layer of 𝒢𝒫\mathcal{G}_{\mathcal{P}}. Based on the universal approximation theorem (Hornik, 1991), there exists a set of parameters 𝒘αL,𝑾αL,𝒉rL\bm{w}_{\alpha}^{L},\bm{W}_{\alpha}^{L},\bm{h}_{r}^{L} that can learn a decision boundary θ\theta that (es,riL,ea)𝒯+L(e_{s},r^{L}_{i},e_{a})\in\mathcal{T}_{+}^{L} if αes,riL,rqL>θ\alpha_{e_{s},r^{L}_{i},r_{q}}^{L}>\theta and otherwise (es,riL,ea)𝒯+L(e_{s},r^{L}_{i},e_{a})\notin\mathcal{T}_{+}^{L}. Then the LL-th layer of 𝒢𝒫\mathcal{G}_{\mathcal{P}} can be extracted.

  • Similarly, denote 𝒯+L1\mathcal{T}_{+}^{L-1} as the set of the triples (es,riL1,eo)(e_{s},r^{L-1}_{i},e_{o}) that connects with the remaining entities in the L1L-1-th layer. Then, there also exists a set of parameters 𝒘αL1,𝑾αL1,𝒉rL1\bm{w}_{\alpha}^{L-1},\bm{W}_{\alpha}^{L-1},\bm{h}_{r}^{L-1} that can learn a decision boundary θ\theta that (es,riL1,eo)𝒯+L1(e_{s},r^{L-1}_{i},e_{o})\in\mathcal{T}_{+}^{L-1} if αes,riL1,eo|rqL1>θ\alpha_{e_{s},r^{L-1}_{i},e_{o}|r_{q}}^{L-1}>\theta and otherwise not in. Besides, 𝒘αL1,𝑾αL1,𝒉rL1\bm{w}_{\alpha}^{L-1},\bm{W}_{\alpha}^{L-1},\bm{h}_{r}^{L-1} and 𝒘αL,𝑾αL,𝒉rL\bm{w}_{\alpha}^{L},\bm{W}_{\alpha}^{L},\bm{h}_{r}^{L} are independent with each other. Thus, we can extract the L1L-1-tfh layer of 𝒢𝒫\mathcal{G}_{\mathcal{P}}.

  • Finally, with recursive execution, 𝒢𝒫\mathcal{G}_{\mathcal{P}} can be extracted as the subgraph in 𝒢eq,ea|L\mathcal{G}_{e_{q},e_{a}|L} with attention weights αes,r,eo|rq>θ\alpha_{e_{s},r,e_{o}|r_{q}}^{\ell}>\theta.