This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Complex Logical Reasoning over Knowledge Graphs
using Large Language Models

Nurendra Choudhary
Department of Computer Science
Virginia Tech, Arlington, VA, USA
[email protected]
&Chandan K. Reddy
Department of Computer Science
Virginia Tech, Arlington, VA, USA
[email protected]
Abstract

Reasoning over knowledge graphs (KGs) is a challenging task that requires a deep understanding of the complex relationships between entities and the underlying logic of their relations. Current approaches rely on learning geometries to embed entities in vector space for logical query operations, but they suffer from subpar performance on complex queries and dataset-specific representations. In this paper, we propose a novel decoupled approach, Language-guided Abstract Reasoning over Knowledge graphs (LARK), that formulates complex KG reasoning as a combination of contextual KG search and logical query reasoning, to leverage the strengths of graph extraction algorithms and large language models (LLM), respectively. Our experiments demonstrate that the proposed approach outperforms state-of-the-art KG reasoning methods on standard benchmark datasets across several logical query constructs, with significant performance gain for queries of higher complexity. Furthermore, we show that the performance of our approach improves proportionally to the increase in size of the underlying LLM, enabling the integration of the latest advancements in LLMs for logical reasoning over KGs. Our work presents a new direction for addressing the challenges of complex KG reasoning and paves the way for future research in this area.

1 Introduction

Knowledge graphs (KGs) encode knowledge in a flexible triplet schema where two entity nodes are connected by relational edges. However, several real-world KGs, such as Freebase (Bollacker et al., 2008), Yago (Suchanek et al., 2007), and NELL (Carlson et al., 2010), are often large-scale, noisy, and incomplete. Thus, reasoning over such KGs is a fundamental and challenging problem in AI research. The over-arching goal of logical reasoning is to develop answering mechanisms for first-order logic (FOL) queries over KGs using the operators of existential quantification (\exists), conjunction (\wedge), disjunction (\vee), and negation (¬\neg). Current research on this topic primarily focuses on the creation of diverse latent space geometries, such as vectors (Hamilton et al., 2018), boxes (Ren et al., 2020), hyperboloids (Choudhary et al., 2021b), and probabilistic distributions (Ren & Leskovec, 2020), in order to effectively capture the semantic position and logical coverage of knowledge graph entities. Despite their success, these approaches are limited in their performance due to the following. (i) Complex queries: They rely on constrained formulations of FOL queries that lose information on complex queries that require chain reasoning (Choudhary et al., 2021a) and involve multiple relationships between entities in the KG, (ii) Generalizability: optimization for a particular KG may not generalize to other KGs which limits the applicability of these approaches in real-world scenarios where KGs can vary widely in terms of their structure and content, and (iii) Scalability: intensive training times that limit the scalability of these approaches to larger KGs and incorporation of new data into existing KGs. To address these limitations, we aim to leverage the reasoning abilities of large language models (LLMs) in a novel framework, shown in Figure 1, called Language-guided Abstract Reasoning over Knowledge graphs (LARK).

Refer to caption
(a) Input logical query.
Refer to caption
(b) Query prompt.
Refer to caption
(c) Decomposed prompt.
Refer to caption
(d) LLM answers.
Figure 1: Example of LARK’s query chain decomposition and logically-ordered LLM answering for effective performance. LLMs are more adept at answering simple queries, and hence, we decompose the multi-operation complex logical query (a,b) into elementary queries with single operation (c) and then use a sequential LLM-based answering method to output the final answer (d).

In LARK, we utilize the logical queries to search for relevant subgraph contexts over knowledge graphs and perform chain reasoning over these contexts using logically-decomposed LLM prompts. To achieve this, we first abstract out the logical information from both the input query and the KG. Given the invariant nature of logic111logical queries follow the same set of rules and procedures irrespective of the KG context., this enables our method to focus on the logical formulation, avoid model hallucination222the model ignores semantic common-sense knowledge and infers only from the KG entities for answers., and generalize over different knowledge graphs. From this abstract KG, we extract relevant subgraphs using the entities and relations present in the logical query. These subgraphs serve as context prompts for input to LLMs. In the next phase, we need to effectively handle complex reasoning queries. From previous works (Zhou et al., 2023; Khot et al., 2023), we realize that LLMs are significantly less effective on complex prompts, when compared to a sequence of simpler prompts. Thus to simplify the query, we exploit their logical nature and deterministically decompose the multi-operation query into logically-ordered elementary queries, each containing a single operation (depicted in the transition from Figure 1(b) to 1(c)). Each of these decomposed logical queries is then converted to a prompt and processed through the LLM to generate the final set of answers (shown in Figure 1(d)). The logical queries are handled sequentially, and if query yy depends on query xx, then xx is scheduled before yy. Operations are scheduled in a logically-ordered manner to enable batching different logical queries together, and answers are stored in caches for easy access.

The proposed approach effectively integrates logical reasoning over knowledge graphs with the capabilities of LLMs, and to the best of our knowledge, is the first of its kind. Unlike previous approaches that rely on constrained formulations of first-order logic (FOL) queries, our approach utilizes logically-decomposed LLM prompts to enable chain reasoning over subgraphs retrieved from knowledge graphs, allowing us to efficiently leverage the reasoning ability of LLMs. Our KG search model is inspired by retrieval-augmented techniques (Chen et al., 2022) but realizes the deterministic nature of knowledge graphs to simplify the retrieval of relevant subgraphs. Moreover, compared to other prompting methods (Wei et al., 2022; Zhou et al., 2023; Khot et al., 2023), our chain decomposition technique enhances the reasoning capabilities in knowledge graphs by leveraging the underlying chain of logical operations in complex queries, and by utilizing preceding answers amidst successive queries in a logically-ordered manner. To summarize, the primary contributions of this paper are as follows:

  1. 1.

    We propose, Language-guided Abstract Reasoning over Knowledge graphs (LARK), a novel model that utilizes the reasoning abilities of large language models to efficiently answer FOL queries over knowledge graphs.

  2. 2.

    Our model uses entities and relations in queries to find pertinent subgraph contexts within abstract knowledge graphs, and then, performs chain reasoning over these contexts using LLM prompts of decomposed logical queries.

  3. 3.

    Our experiments on logical reasoning across standard KG datasets demonstrate that LARK outperforms the previous state-of-the-art approaches by 35%84%35\%-84\% MRR on 14 FOL query types based on the operations of projection (p), intersection (\wedge), union (\vee), and negation (¬\neg).

  4. 4.

    We establish the advantages of chain decomposition by showing that LARK performs 20%33%20\%-33\% better on decomposed logical queries when compared to complex queries on the task of logical reasoning. Additionally, our analysis of LLMs shows the significant contribution of increasing scale and better design of underlying LLMs to the performance of LARK.

2 Related Work

Our work is at the intersection of two topics, namely, logical reasoning over knowledge graphs and reasoning prompt techniques in LLMs.

Logical Reasoning over KGs: Initial approaches in this area (Bordes et al., 2013; Nickel et al., 2011; Das et al., 2017; Hamilton et al., 2018) focused on capturing the semantic information of entities and the relational operations involved in the projection between them. However, further research in the area revealed a need for new geometries to encode the spatial and hierarchical information present in the knowledge graphs. To tackle this issue, models such as Query2Box (Ren et al., 2020), HypE (Choudhary et al., 2021b), PERM (Choudhary et al., 2021a), and BetaE (Ren & Leskovec, 2020) encoded the entities and relations as boxes, hyperboloids, Gaussian distributions, and beta distributions, respectively. Additionally, approaches such as CQD (Arakelyan et al., 2021) have focused on improving the performance of complex reasoning tasks through the answer composition of simple intermediate queries. In another line of research, HamQA (Dong et al., 2023) and QA-GNN (Yasunaga et al., 2021) have developed question-answering techniques that use knowledge graph neighborhoods to enhance the overall performance. We notice that previous approaches in this area have focused on enhancing KG representations for logical reasoning. Contrary to these existing methods, our work provides a systematic framework that leverages the reasoning ability of LLMs and tailors them toward the problem of logical reasoning over knowledge graphs.

Reasoning prompts in LLMs: Recent studies have shown that LLMs can learn various NLP tasks with just context prompts (Brown et al., 2020). Furthermore, LLMs have been successfully applied to multi-step reasoning tasks by providing intermediate reasoning steps, also known as Chain-of-Thought (Wei et al., 2022; Chowdhery et al., 2022), needed to arrive at an answer. Alternatively, certain studies have composed multiple LLMs or LLMs with symbolic functions to perform multi-step reasoning (Jung et al., 2022; Creswell et al., 2023), with a pre-defined decomposition structure. More recent studies such as least-to-most (Zhou et al., 2023), successive (Dua et al., 2022) and decomposed (Khot et al., 2023) prompting strategies divide a complex prompt into sub-prompts and answer them sequentially for effective performance. While this line of work is close to our approach, they do not utilize previous answers to inform successive queries. LARK is unique due to its ability to utilize logical structure in the chain decomposition mechanism, augmentation of retrieved knowledge graph neighborhood, and multi-phase answering structure that incorporates preceding LLM answers amidst successive queries.

3 Methodology

In this section, we will describe the problem setup of logical reasoning over knowledge graphs, and describe the various components of our model.

3.1 Problem Formulation

In this work, we tackle the problem of logical reasoning over knowledge graphs (KGs) 𝒢:E×R\mathcal{G}:E\times R that store entities (EE) and relations (RR). Without loss of generality, KGs can also be organized as a set of triplets e1,r,e2𝒢\langle e_{1},r,e_{2}\rangle\subseteq\mathcal{G}, where each relation rRr\in R is a Boolean function r:E×E{True,False}r:E\times E\rightarrow\{True,False\} that indicates whether the relation rr exists between the pair of entities (e1,e2)E(e_{1},e_{2})\in E. We consider four fundamental first-order logical (FOL) operations: projection (p), intersection (\wedge), union (\vee), and negation (¬\neg) to query the KG. These operations are defined as follows:

qp[Qp]\displaystyle q_{p}[Q_{p}] ?Vp:{v1,v2,,vk}Ea1\displaystyle\triangleq?V_{p}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}\exists~{}a_{1} (1)
q[Q]\displaystyle q_{\wedge}[Q_{\wedge}] ?V:{v1,v2,,vk}Ea1a2ai\displaystyle\triangleq?V_{\wedge}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}\exists~{}a_{1}\wedge a_{2}\wedge...\wedge a_{i} (2)
q[Q]\displaystyle q_{\vee}[Q_{\vee}] ?V:{v1,v2,,vk}Ea1a2ai\displaystyle\triangleq?V_{\vee}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}\exists~{}a_{1}\vee a_{2}\vee...\vee a_{i} (3)
q¬[Q¬]\displaystyle q_{\neg}[Q_{\neg}] ?V¬:{v1,v2,,vk}E¬a1\displaystyle\triangleq?V_{\neg}:\{v_{1},v_{2},...,v_{k}\}\subseteq E~{}\exists~{}\neg a_{1} (4)
where Qp,Q¬\displaystyle\text{where }Q_{p},Q_{\neg} =(e1,r1);Q,Q={(e1,r1),(e2,r2),,(ei,ri)}; and ai=ri(ei,vi)\displaystyle=(e_{1},r_{1});~{}Q_{\wedge},Q_{\vee}=\{(e_{1},r_{1}),(e_{2},r_{2}),...,(e_{i},r_{i})\};\text{~{}~{}and~{} }a_{i}=r_{i}(e_{i},v_{i})

where qp,q,qq_{p},q_{\wedge},q_{\vee}, and q¬q_{\neg} are projection, intersection, union, and negation queries, respectively; and Vp,V,VV_{p},V_{\wedge},V_{\vee} and V¬V_{\neg} are the corresponding results of those queries (Arakelyan et al., 2021; Choudhary et al., 2021a). aia_{i} is a Boolean indicator which will be 1 if eie_{i} is connected to viv_{i} by relation rir_{i}, 0 otherwise. The goal of logical reasoning is to formulate the operations such that for a given query qτq_{\tau} of query type τ\tau with inputs QτQ_{\tau}, we are able to efficiently retrieve VτV_{\tau} from entity set EE, e.g., for a projection query qp[(Nobel Prize, winners)]q_{p}[\text{(Nobel Prize, winners)}], we want to retrieve Vp={Nobel Prize winners}EV_{p}=\{\text{Nobel Prize winners}\}\subseteq E.

In conventional methods for logical reasoning, the query operations were typically expressed through a geometric function. For example, the intersection of queries was represented as an intersection of box representations in Query2Box (Ren et al., 2020). However, in our proposed approach, LARK, we leverage the advanced reasoning capabilities of Language Models (LLMs) and prioritize efficient decomposition of logical chains within the query to enhance performance. This novel strategy seeks to overcome the limitations of traditional methods by harnessing the power of LLMs in reasoning over KGs.

3.2 Neighborhood Retrieval and Logical Chain Decomposition

The foundation of LARK’s reasoning capability is built on large language models. Nevertheless, the limited input length of LLMs restricts their ability to process the entirety of a knowledge graph. Furthermore, while the set of entities and relations within a knowledge graph is unique, the reasoning behind logical operations remains universal. Therefore, we specifically tailor the LLM prompts to account for the above distinctive characteristics of logical reasoning over knowledge graphs. To address this need, we adopt a two-step process:

  1. 1.

    Query Abstraction: In order to make the process of logical reasoning over knowledge graphs more generalizable to different datasets, we propose to replace all the entities and relations in the knowledge graph and queries with a unique ID. This approach offers three significant advantages. Firstly, it reduces the number of tokens in the query, leading to improved LLM efficiency. Secondly, it allows us to solely utilize the reasoning ability of the language model, without relying on any external common sense knowledge of the underlying LLM. By avoiding the use of common sense knowledge, our approach mitigates the potential for model hallucination (which may lead to the generation of answers that are not supported by the KG). Finally, it removes any KG-specific information, thereby ensuring that the process remains generalizable to different datasets. While this may intuitively seem to result in a loss of information, our empirical findings, presented in Section 4.4, indicate that the impact on the overall performance is negligible.

  2. 2.

    Neighborhood Retrieval: In order to effectively answer logical queries, it is not necessary for the LLM to have access to the entire knowledge graph. Instead, the relevant neighborhoods containing the answers can be identified. Previous approaches (Guu et al., 2020; Chen et al., 2022) have focused on semantic retrieval for web documents. However, we note that logical queries are deterministic in nature, and thus we perform a kk-level depth-first traversal333where kk is determined by the query type, e.g., for 3-level projection (3p3p) queries, k=3k=3. over the entities and relations present in the query. Let Eτ1E^{1}_{\tau} and Rτ1R^{1}_{\tau} denote the set of entities and relations in query QτQ_{\tau} for a query type τ\tau, respectively. Then, the kk-level neighborhood of query qτq_{\tau} is defined by 𝒩k(qτ[Qτ])\mathcal{N}_{k}(q_{\tau}[Q_{\tau}]) as:

    𝒩1(qτ[Qτ])\displaystyle\mathcal{N}_{1}(q_{\tau}[Q_{\tau}]) ={(h,r,t):(hEτ1),(rRτ1),(tEτ1)}\displaystyle=\left\{(h,r,t):\left(h\in E^{1}_{\tau}\right),\left(r\in R^{1}_{\tau}\right),\left(t\in E^{1}_{\tau}\right)\right\} (5)
    Eτk\displaystyle E^{k}_{\tau} ={h,t:(h,r,t)𝒩k1(qτ[Qτ]},Rτk={r:(h,r,t)𝒩k1(qτ[Qτ]}\displaystyle=\{h,t:(h,r,t)\in\mathcal{N}_{k-1}(q_{\tau}[Q_{\tau}]\},\quad R^{k}_{\tau}=\{r:(h,r,t)\in\mathcal{N}_{k-1}(q_{\tau}[Q_{\tau}]\} (6)
    𝒩k(qτ[Qτ])\displaystyle\mathcal{N}_{k}(q_{\tau}[Q_{\tau}]) ={(h,r,t):(hEτk),(rRτk),(tEτk)}\displaystyle=\left\{(h,r,t):\left(h\in E^{k}_{\tau}\right),\left(r\in R^{k}_{\tau}\right),\left(t\in E^{k}_{\tau}\right)\right\} (7)

We have taken steps to make our approach more generalizable and efficient by abstracting the query and limiting input context for LLMs. However, the complexity of a query still remains a concern. The complexity of a query type τ\tau, denoted by 𝒪(qτ)\mathcal{O}(q_{\tau}), is determined by the number of entities and relations it involves, i.e., 𝒪(qτ)|Eτ|+|Rτ|\mathcal{O}(q_{\tau})\propto|E_{\tau}|+|R_{\tau}|. In other words, the size of the query in terms of its constituent elements is a key factor in determining its computational complexity. This observation is particularly relevant in the context of LLMs, as previous studies have shown that their performance tends to decrease as the complexity of the queries they handle increases (Khot et al., 2023). To address this, we propose a logical query chain decomposition mechanism in LARK which reduces a complex multi-operation query to multiple single-operation queries. Due to the exhaustive set of operations, we apply the following strategy for decomposing the various query types:

  • Reduce a kk-level projection query to kk one-level projection queries, e.g., a 3p3p query with one entity and three relations e1r1r2r3Ae_{1}\xrightarrow{r_{1}}\xrightarrow{r_{2}}\xrightarrow{r_{3}}A is decomposed to e1r1A1,A1r2A2,A2r3Ae_{1}\xrightarrow{r_{1}}A_{1},A_{1}\xrightarrow{r_{2}}A_{2},A_{2}\xrightarrow{r_{3}}A.

  • Reduce a kk-intersection query to kk projection queries and an intersection query, e.g., a 3i3i query with intersection of two projection queries (e1r1)(e2r2)(e3r3)=A(e_{1}\xrightarrow{r_{1}})\wedge(e_{2}\xrightarrow{r_{2}})\wedge(e_{3}\xrightarrow{r_{3}})=A is decomposed to e1r1A1,e2r2A2,e3r3A2,A1A2A3=Ae_{1}\xrightarrow{r_{1}}A_{1},e_{2}\xrightarrow{r_{2}}A_{2},e_{3}\xrightarrow{r_{3}}A_{2},A_{1}\wedge A_{2}\wedge A_{3}=A. Similarly, reduce a kk-union query to kk projection queries and a union query.

The complete decomposition of the exhaustive set of query types used in previous work (Ren & Leskovec, 2020) and our empirical studies can be found in Appendix A.

Refer to caption
Figure 2: An overview of the LARK model. The model takes the logical query and infers the query type from it. The query abstraction function maps the entities and relations to abstract IDs, and the neighborhood retrieval mechanism collects the relevant subgraphs from the overall knowledge graph. The chains of the abstracted complex query are then logically decomposed to simpler single-operation queries. The retrieved neighborhood and decomposed queries are further converted into LLM prompts using a template and then processed in the LLM to get the final set of answers for evaluation.

3.3 Chain Reasoning Prompts

In the previous section, we outlined our approach to limit the neighborhood and decompose complex queries into chains of simple queries. Leveraging these, we can now use the reasoning capability of LLMs to obtain the final set of answers for the query, as shown in Figure 2. To achieve this, we employ a prompt template that converts the neighborhood into a context prompt and the decomposed queries into question prompts. It is worth noting that certain queries in the decomposition depend on the responses of preceding queries, such as intersection relying on the preceding projection queries. Additionally, unlike previous prompting methods such as chain-of-thought (Wei et al., 2022) and decomposition (Khot et al., 2023) prompting, the answers need to be integrated at a certain position in the prompt. To address this issue, we maintain a placeholder in dependent queries and a temporary cache of preceding answers that can replace the placeholders in real-time. This also has the added benefit of maintaining the parallelizability of queries, as we can run batches of decomposed queries in phases instead of sequentially running each decomposed query. The specific prompt templates of the complex and decomposed logical queries for different query types are provided in Appendix B.

3.4 Implementation Details

We implemented LARK in Pytorch (Paszke et al., 2019) on eight Nvidia A100 GPUs with 40 GB VRAM. In the case of LLMs, we chose the Llama2 model (Touvron et al., 2023) due to its public availability in the Huggingface library (Wolf et al., 2020) . For efficient inference over the large-scale models, we relied on the mixed-precision version of LLMs and the Deepspeed library (Rasley et al., 2020) with Zero stage 3 optimization. The algorithm of our model is provided in Appendix D and implementation code for all our experiments with exact configuration files and datasets for reproducibility are publicly available444 https://github.com/Akirato/LLM-KG-Reasoning. In our experiments, the highest complexity of a query required a 3-hop neighborhood around the entities and relations. Hence, we set the depth limit to 3 (i.e., k=3k=3). Additionally, to further make our process completely compatible with different datasets, we added a limit of nn tokens on the input which is dependent on the LLM model (for Llama2, nn=4096). In practice, this implies that we stop the depth-first traversal when the context becomes longer than nn.

4 Experimental Results

This sections describes our experiments that aim to answer the following research questions (RQs):

  • RQ1.

    Does LARK outperform the state-of-the-art baselines on the task of logical reasoning over standard knowledge graph benchmarks?

  • RQ2.

    How does our combination of chain decomposition query and logically-ordered answer mechanism perform in comparison with the standard prompting techniques?

  • RQ3.

    How does the scale and design of LARK’s underlying LLM model affect its performance?

  • RQ4.

    How would our model perform with support for increased token size?

  • RQ5.

    Does query abstraction affect the reasoning performance of our model?

4.1 Datasets and Baselines

We select the following standard benchmark datasets to investigate the performance of our model against state-of-the-art models on the task of logical reasoning over knowledge graphs:

  • FB15k (Bollacker et al., 2008) is based on Freebase, a large collaborative knowledge graph project that was created by Google. FB15k contains about 15,000 entities, 1,345 relations, and 592,213 triplets (statements that assert a fact about an entity).

  • FB15k-237 (Toutanova et al., 2015) is a subset of FB15k, containing 14,541 entities, 237 relations, and 310,116 triplets. The relations in FB15k-237 are a subset of the relations in FB15k, and was created to address some of the limitations of FB15k, such as the presence of many irrelevant or ambiguous relations, and to provide a more challenging benchmark for knowledge graph completion models.

  • NELL995 (Carlson et al., 2010) was created using the Never-Ending Language Learning (NELL) system, which is a machine learning system that automatically extracts knowledge from the web by reading text and inferring new facts. NELL995 contains 9,959 entities, 200 relations, and 114,934 triplets. The relations in NELL995 cover a wide range of domains, including geography, sports, and politics.

Our criteria for selecting the above datasets was their ubiquity in previous works on this research problem. Further details on their token size is provided in Appendix E. For the baselines, we chose the following methods:

  • GQE (Hamilton et al., 2018) encodes a query as a single vector and represents entities and relations in a low-dimensional space. It uses translation and deep set operators, which are modeled as projection and intersection operators, respectively.

  • Query2Box (Q2B) (Ren et al., 2020) uses a box embedding model which is a generalization of the traditional vector embedding model and can capture richer semantics.

  • BetaE (Ren & Leskovec, 2020) uses a novel beta distribution to model the uncertainty in the representation of entities and relations. BetaE can capture both the point estimate and the uncertainty of the embeddings, which leads to more accurate predictions in knowledge graph completion tasks.

  • HQE (Choudhary et al., 2021b) uses the hyperbolic query embedding mechanism to model the complex queries in knowledge graph completion tasks.

  • HypE (Choudhary et al., 2021b) uses the hyperboloid model to represent entities and relations in a knowledge graph that simultaneously captures their semantic, spatial, and hierarchical features.

  • CQD (Arakelyan et al., 2021) decomposes complex queries into simpler sub-queries and applies a query-specific attention mechanism to the sub-queries.

4.2 RQ1. Efficacy on Logical Reasoning

To study the efficacy of our model on the task of logical reasoning, we compare it against the previous baselines on the following standard logical query constructs:

  1. 1.

    Multi-hop Projection traverses multiple relations from a head entity in a knowledge graph to answer complex queries by projecting the query onto the target entities. In our experiments, we consider 1p,2p1p,2p, and 3p3p queries that denote 1-relation, 2-relation, and 3-relation hop from the head entity, respectively.

  2. 2.

    Geometric Operations apply the operations of intersection (\wedge) and union (\vee) to answer the query. Our experiments use 2i2i and 3i3i queries that represent the intersection over 2 and 3 entities, respectively. Also, we study 2u2u queries that perform union over 2 entities.

  3. 3.

    Compound Operations integrate multiple operations such as intersection, union, and projection to handle complex queries over a knowledge graph.

  4. 4.

    Negation Operations negate the query by finding entities that do not satisfy the given logic. In our experiments, we examine 2in,3in,inp,2in,3in,inp, and pinpin queries that negate 2i,3i,ip,2i,3i,ip, and pipi queries, respectively. We also analyze pnipni (an additional variant of the pipi query), where the negation is over both entities in the intersection. It should be noted that BetaE is the only method in the existing literature that supports negation, and hence, we only compare against it in our experiments.

We present the results of our experimental study, which compares the Mean Reciprocal Rank (MRR) score of the retrieved candidate entities using different query constructions. MRR is calculated as the average of the reciprocal ranks of the candidate entities 555More metrics such as HITS@K=1,3,10 are reported in Appendix C.. In order to ensure a fair comparison, We selected these query constructions which were used in most of the previous works in this domain (Ren & Leskovec, 2020). An illustration of these query types is provided in Appendix A for better understanding. Our experiments show that LARK outperforms previous state-of-the-art baselines by 35%84%35\%-84\% on an average across different query types, as reported in Table 1. We observe that the performance improvement is higher for simpler queries, where 1p>2p>3p1p>2p>3p and 2i>3i2i>3i. This suggests that LLMs are better at capturing breadth across relations but may not be as effective at capturing depth over multiple relations. Moreover, our evaluation also encompasses testing against challenging negation queries, for which BetaE (Ren & Leskovec, 2020) remains to be the only existing approach. Even in this complex scenario, our findings, as illustrated in Table 2, indicate that LARK significantly outperforms the baselines by 140%140\%. This affirms the superior reasoning capabilities of our model in tackling complex query scenarios. Another point of note is that certain baselines such as CQD are able to outperform LARK in the FB15k dataset for certain query types such as 1p,3i1p,3i, and ipip. The reason for this is that FB15k suffers from a data leakage from training to validation and testing sets (Toutanova et al., 2015). This unfairly benefits the training-based baselines over the inference-only LARK model.

Table 1: Performance comparison between LARK and the baseline in terms of their efficacy of logical reasoning using MRR scores. The rows present various models and the columns correspond to different query structures of multi-hop projections, geometric operations, and compound operations. The best results for each query type in every dataset is highlighted in bold font.
Dataset Models 1p 2p 3p 2i 3i ip pi 2u up
FB15k GQE 54.6 15.3 10.8 39.7 51.4 27.6 19.1 22.1 11.6
Q2B 68.0 21.0 14.2 55.1 66.5 39.4 26.1 35.1 16.7
BetaE 65.1 25.7 24.7 55.8 66.5 43.9 28.1 40.1 25.2
HQE 54.3 33.9 23.3 38.4 50.6 12.5 24.9 35.0 25.9
HypE 67.3 43.9 33.0 49.5 61.7 18.9 34.7 47.0 37.4
CQD 79.4 39.6 27.0 74.0 78.2 70.0 43.3 48.4 17.5
LARK(complex) 73.6 46.5 32.0 66.9 61.8 24.8 47.2 47.7 37.5
LARK(ours) 73.6 49.3 35.1 67.8 62.6 29.3 54.5 51.9 37.7
FB15k-237 GQE 35.0 7.2 5.3 23.3 34.6 16.5 10.7 8.2 5.7
Q2B 40.6 9.4 6.8 29.5 42.3 21.2 12.6 11.3 7.6
BetaE 39.0 10.9 10.0 28.8 42.5 22.4 12.6 12.4 9.7
HQE 37.6 20.9 16.9 25.3 35.2 17.3 8.2 15.6 17.9
HypE 49.0 34.3 23.7 33.9 44 18.6 30.5 41.0 26.0
CQD 44.5 11.3 8.1 32.0 42.7 25.3 15.3 13.4 4.8
LARK(complex) 70.0 34.0 21.5 43.4 42.2 18.7 38.4 49.2 25.1
LARK(ours) 70.0 36.9 24.5 44.3 43.1 23.2 45.6 56.6 25.4
NELL995 GQE 32.8 11.9 9.6 27.5 35.2 18.4 14.4 8.5 8.8
Q2B 42.2 14.0 11.2 33.3 44.5 22.4 16.8 11.3 10.3
BetaE 53.0 13.0 11.4 37.6 47.5 24.1 14.3 12.2 8.5
HQE 35.5 20.9 18.9 23.2 36.3 8.8 13.7 21.3 15.5
HypE 46.0 30.6 27.9 33.6 48.6 31.8 13.5 20.7 26.4
CQD 50.7 18.4 13.8 39.8 49.0 29.0 22.0 16.3 9.9
LARK(complex) 83.2 39.8 27.6 49.3 48.0 18.7 19.6 8.3 36.8
LARK(ours) 83.2 42.3 31.0 49.9 48.7 23.1 23.0 20.1 37.2
Table 2: Performance comparison between LARK and the baseline for negation query types using MRR scores. The best results for each query type in every dataset is highlighted in bold font. Our model’s performance is significantly higher on most negation queries. However, the performance is limited in 3in and pni queries due to their high number of tokens (shown in Appendix E).
Dataset Models 2in 3in inp pin pni
FB15k BetaE 14.3 14.7 11.5 6.5 12.4
LARK(complex) 16.5 6.2 32.5 22.8 10.5
LARK(ours) 17.5 7.0 34.7 26.7 11.1
FB15k-237 BetaE 5.1 7.9 7.4 3.6 3.4
LARK(complex) 6.1 3.4 21.6 12.8 2.9
LARK(ours) 7.0 4.1 23.9 16.8 3.5
NELL995 BetaE 5.1 7.8 10.0 3.1 3.5
LARK(complex) 8.9 5.3 23.0 10.4 6.3
LARK(ours) 10.4 6.6 25.4 13.6 7.6

4.3 RQ2. Advantages of Chain Decomposition

The aim of this experiment is to investigate the advantages of using chain decomposed queries over standard complex queries. We employ the same experimental setup described in Section 4.2. Our results, in Tables 1 and 2, demonstrate that utilizing chain decomposition contributes to a significant improvement of 20%33%20\%-33\% in our model’s performance. This improvement is a clear indication of the LLMs’ ability to capture a broad range of relations and effectively utilize this capability for enhancing the performance on complex queries. This study highlights the potential of using chain decomposition to overcome the limitations of complex queries and improve the efficiency of logical reasoning tasks. This finding is a significant contribution to the field of natural language processing and has implications for various other applications such as question-answering systems and knowledge graph completion. Overall, our results suggest that chain-decomposed queries could be a promising approach for improving the performance of LLMs on complex logical reasoning tasks.

4.4 RQ3. Analysis of LLM scale

This experiment analyzes the impact of the size of the underlying LLMs and query abstraction on the overall LARK model performance. To examine the effect of LLM size, we compared two variants of the Llama2 model which have 7 billion and 13 billion parameters. Our evaluation results, presented in Table 3, show that the performance of the LARK model improves by 123%123\% from Llama2-7B to Llama2-13B. This indicates that increasing the number of LLM parameters can enhance the performance of LARK model.

Table 3: MRR scores of LARK on FB15k-237 dataset with underlying LLMs of different sizes. The best results for each query type is highlighted in bold font.
LLM # Params 1p 2p 3p 2i 3i ip pi 2u up 2in 3in inp pin pni
Llama2 7B 73.1 33.2 20.6 10.6 25.2 25.9 17.2 20.8 24.3 4 1.8 14.2 7.4 1.9
13B 73.6 49.3 35.1 67.8 62.6 29.3 54.5 51.9 37.7 7.0 4.1 23.9 16.8 3.5

4.5 RQ4. Study on Increased Token Limit of LLMs

From the dataset details provided in Appendix E, we observe that the token size of different query types shows considerable fluctuation from 5858 to over 100,000100,000. Unfortunately, the token limit of LLama2, considered as the base in our experiments, is 4096. This limit is insufficient to demonstrate the full potential performance of LARK on our tasks. To address this limitation, we consider the availability of models with higher token limits, such as GPT-3.5 (OpenAI, 2023). However, we acknowledge that these models are expensive to run and thus, we could not conduct a thorough analysis on the entire dataset. Nevertheless, to gain insight into LARK’s potential with increased token size, we randomly sampled 1000 queries per query type from each dataset with token length over 4096 and less than 4096 and compared our model on these queries with GPT-3.5 and Llama2 as the base. The evaluation results, which are displayed in Table 4, demonstrate that transitioning from Llama2 to GPT-3.5 can lead to a significant performance improvement of 29%-40% for the LARK model which suggests that increasing the token limit of LLMs may have significant potential of further performance enhancement.

Table 4: MRR scores of LARK with Llama2 and GPT LLMs as the underlying base models. The best results for each query type in every dataset is highlighted in bold font.
FB15k
LLM 1p 2p 3p 2i 3i ip pi 2u up 2in 3in inp pin pni
Llama2-7B 23.4 21.5 22.6 3.4 3 26.1 18.4 14.8 3.9 9.5 4.7 21.7 26.4 5.8
Llama2-13B 23.8 22.8 24.2 3.5 3 23.3 30.8 30.7 3.9 12.4 6.6 28.4 51.4 7.7
GPT-3.5 36.1 34.6 36.8 17.0 14.4 35.4 46.7 39.3 19.5 18.8 10.0 43.1 56.7 11.6
FB15k-237
LLM 1p 2p 3p 2i 3i ip pi 2u up 2in 3in inp pin pni
Llama2-7B 23.1 27.4 31.5 5 4.1 26.6 20.9 15.3 5.6 26.6 8.8 33.7 31 21.1
Llama2-13B 23.5 29.2 33.8 5 4.1 23.7 35 31.7 5.6 34.7 12.3 44 60.4 28
GPT-3.5 35.7 44.2 51.2 24.8 20.2 36.0 53.1 40.6 28.1 52.5 18.7 66.8 66.6 42.4
NELL995
LLM 1p 2p 3p 2i 3i ip pi 2u up 2in 3in inp pin pni
Llama2-7B 28 24.4 27.6 3.7 3.2 24 8.4 14.5 5.7 14 7.7 23.1 21.3 13.4
Llama2-13B 28.4 26 29.5 3.7 3.2 21.5 14.1 25.4 5.7 18.3 10.8 30.1 30.2 17.7
GPT-3.5 43.1 39.4 44.8 18.3 15.5 32.6 21.4 38.5 28.3 27.7 16.4 45.7 45.9 26.8

4.6 RQ5. Effects of Query Abstraction

Refer to caption
Figure 3: Effects of Query Abstraction.

Regarding the analysis of query abstraction, we considered a variant of LARK called ‘LARK (semantic)’, which retains semantic information in KG entities and relations. As shown in Figure 3, we observe that semantic information provides a minor performance enhancement of 0.01%0.01\% for simple projection queries. However, in more complex queries, it results in a performance degradation of 0.7%1.4%0.7\%-1.4\%. The primary cause of this degradation is that the inclusion of semantic information exceeds the LLMs’ token limit, leading to a loss of neighborhood information. Hence, we assert that query abstraction is not only a valuable technique for mitigating model hallucination and achieving generalization across different KG datasets but can also enhance performance by reducing token size.

5 Concluding Discussion

In this paper, we presented LARK, the first approach to integrate logical reasoning over knowledge graphs with the capabilities of LLMs. Our approach utilizes logically-decomposed LLM prompts to enable chain reasoning over subgraphs retrieved from knowledge graphs, allowing us to efficiently leverage the reasoning ability of LLMs. Through our experiments on logical reasoning across standard KG datasets, we demonstrated that LARK outperforms previous state-of-the-art approaches by a significant margin on 14 different FOL query types. Finally, our work also showed that the performance of LARK improves with increasing scale and better design of the underlying LLMs. We demonstrated that LLMs that can handle larger input token lengths can lead to significant performance improvements. Overall, our approach presents a promising direction for integrating LLMs with logical reasoning over knowledge graphs.

The proposed approach of using LLMs for complex logical reasoning over KGs is expected to pave a new way for improved reasoning over large, noisy, and incomplete real-world KGs. This can potentially have a significant impact on various applications such as natural language understanding, question answering systems, intelligent information retrieval systems, etc. For example, in healthcare, KGs can be used to represent patient data, medical knowledge, and clinical research, and logical reasoning over these KGs can enable better diagnosis, treatment, and drug discovery. However, there can also be some ethical considerations that can be taken into account. As with most of the AI-based technologies, there is a potential risk of inducing bias into the model, which can lead to unfair decisions and actions. Bias can be introduced in the KGs themselves, as they are often created semi-automatically from biased sources, and can be amplified by the logical reasoning process. Moreover, the large amount of data used to train LLMs can also introduce bias, as it may reflect societal prejudices and stereotypes. Therefore, it is essential to carefully monitor and evaluate the KGs and LLMs used in this approach to ensure fairness and avoid discrimination. The performance of this method is also dependent on the quality and completeness of the KGs used, and the limited token size of current LLMs. But, we also observe that the current trend of increasing LLM token limits will soon resolve some of these limitations.

References

  • Arakelyan et al. (2021) Erik Arakelyan, Daniel Daza, Pasquale Minervini, and Michael Cochez. Complex query answering with neural link predictors. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Mos9F9kDwkz.
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp.  1247–1250, New York, NY, USA, 2008. Association for Computing Machinery. URL https://doi.org/10.1145/1376616.1376746.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka, and Tom M. Mitchell. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pp.  1306–1313. AAAI Press, 2010.
  • Chen et al. (2022) Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Q8GnGqT-GTJ.
  • Choudhary et al. (2021a) Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan Reddy. Probabilistic entity representation model for reasoning over knowledge graphs. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  23440–23451. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/c4d2ce3f3ebb5393a77c33c0cd95dc93-Paper.pdf.
  • Choudhary et al. (2021b) Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan K. Reddy. Self-supervised hyperboloid representations from logical queries over knowledge graphs. In Proceedings of the Web Conference 2021, WWW ’21, pp.  1373–1384, New York, NY, USA, 2021b. Association for Computing Machinery. URL https://doi.org/10.1145/3442381.3449974.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Creswell et al. (2023) Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3Pf3Wg6o-A4.
  • Das et al. (2017) Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp.  132–141, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-1013.
  • Dong et al. (2023) Junnan Dong, Qinggang Zhang, Xiao Huang, Keyu Duan, Qiaoyu Tan, and Zhimeng Jiang. Hierarchy-aware multi-hop question answering over knowledge graphs. In Proceedings of the Web Conference 2023, WWW ’23, New York, NY, USA, 2023. Association for Computing Machinery. URL https://doi.org/10.1145/3543507.3583376.
  • Dua et al. (2022) Dheeru Dua, Shivanshu Gupta, Sameer Singh, and Matt Gardner. Successive prompting for decomposing complex questions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1251–1265, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.81.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  • Hamilton et al. (2018) Will Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding logical queries on knowledge graphs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ef50c335cca9f340bde656363ebd02fd-Paper.pdf.
  • Jung et al. (2022) Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1266–1279, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.82.
  • Khot et al. (2023) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_nGgzQjzaRy.
  • Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp.  809–816, Madison, WI, USA, 2011. Omnipress.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv, 2023.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp.  8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, pp.  3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. URL https://doi.org/10.1145/3394486.3406703.
  • Ren & Leskovec (2020) Hongyu Ren and Jure Leskovec. Beta embeddings for multi-hop logical reasoning in knowledge graphs. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  • Ren et al. (2020) Hongyu Ren, Weihua Hu, and Jure Leskovec. Query2box: Reasoning over knowledge graphs in vector space using box embeddings. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJgr4kSFDS.
  • Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp.  697–706, New York, NY, USA, 2007. Association for Computing Machinery. URL https://doi.org/10.1145/1242572.1242667.
  • Toutanova et al. (2015) Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1499–1509, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL https://aclanthology.org/D15-1174.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.emnlp-demos.6.
  • Yasunaga et al. (2021) Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  535–546, Online, June 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.naacl-main.45.
  • Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.

Appendix

Appendix A Query Decomposition of Different Query Types

Figure 4 provides the query decomposition of different query types considered in our empirical study as well as previous literature in the area.

Refer to caption
Figure 4: Query Decomposition of different query types considered in our experiments.

Appendix B Prompt Templates of Different Query Types

The prompt templates for full complex logical queries with multiple operations and decomposed elementary logical queries with single operation are provided in Tables 5 and 6, respectively.

Table 5: Full Prompt Templates of Different Query Types.
Type Logical Query Template for Full Prompts
Context 𝒩k(qτ[Qτ])\mathcal{N}_{k}(q_{\tau}[Q_{\tau}]) Given the following (h,r,t) triplets where entity h is related to entity t by relation r; (h1,r1,t1),(h2,r2,t2),(h3,r3,t3),(h4,r4,t4),(h_{1},r_{1},t_{1}),(h_{2},r_{2},t_{2}),(h_{3},r_{3},t_{3}),(h_{4},r_{4},t_{4}),
(h5,r5,t5),(h6,r6,t6),(h7,r7,t7),(h8,r8,t8)(h_{5},r_{5},t_{5}),(h_{6},r_{6},t_{6}),(h_{7},r_{7},t_{7}),(h_{8},r_{8},t_{8})
1p X.r1(X,e1)\exists X.r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
2p X.r1(X,Y.r2(Y,e1)\exists X.r_{1}(X,\exists Y.r_{2}(Y,e_{1}) Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}. Then, what are the entities connected to E by relation r2r_{2}?
3p X.r1(X,Y.r2(Y,Z.r3(Z,e1)\exists X.r_{1}(X,\exists Y.r_{2}(Y,\exists Z.r_{3}(Z,e_{1}) Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1} and the set of entities F is connected to entities in E by relation r2r_{2}. Then, what are the entities connected to F by relation r3r_{3}?
2i X.[r1(X,e1)r2(X,e2)]\exists X.[r_{1}(X,e_{1})\wedge r_{2}(X,e_{2})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1} and the set of entities F is connected to entity e2e_{2} by relation r2r_{2}. Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G?
3i X.[r1(X,e1)r2(X,e2)r3(X,e3)]\exists X.[r_{1}(X,e_{1})\wedge r_{2}(X,e_{2})\wedge r_{3}(X,e_{3})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, the set of entities F is connected to entity e2e_{2} by relation r2r_{2} and the set of entities G is connected to entity e3e_{3} by relation r3r_{3}. Then, what are the entities in the intersection of set E, F and G, i.e., entities present in all E, F and G?
ip X.r3(X,Y.[r1(Y,e1)r2(Y,e2)]\exists X.r_{3}(X,\exists Y.[r_{1}(Y,e_{1})\wedge r_{2}(Y,e_{2})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, F is the set of entities connected to entity e2e_{2} by relation r2r_{2}, and G is the set of entities in the intersection of E and F. Then, what are the entities connected to entities in set G by relation r3r_{3}?
pi X.[r1(X,Y.r2(Y,e2))r3(X,e3)]\exists X.[r_{1}(X,\exists Y.r_{2}(Y,e_{2}))\wedge r_{3}(X,e_{3})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, F is the set of entities connected to entities in E by relation r2r_{2}, and G is the set of entities connected to entity e2e_{2} by relation r3r_{3}. Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G?
2u X.[r1(X,e1)r2(X,e2)]\exists X.[r_{1}(X,e_{1})\vee r_{2}(X,e_{2})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1} and F is the set of entities connected to entity e2e_{2} by relation r2r_{2}. Then, what are the entities in the union of set F and G, i.e., entities present in either F or G?
up X.r3(X,Y.[r1(Y,e1)r2(Y,e2)]\exists X.r_{3}(X,\exists Y.[r_{1}(Y,e_{1})\vee r_{2}(Y,e_{2})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1} and F is the set of entities connected to entity e2e_{2} by relation r2r_{2}. G is the set of entities in the union of E and F. Then, what are the entities connected to entities in G by relation r3r_{3}?
2in X.[r1(X,e1)¬r2(X,e2)]\exists X.[r_{1}(X,e_{1})\wedge\neg r_{2}(X,e_{2})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1} and F is the set of entities connected to entity e2e_{2} by any relation other than relation r2r_{2}. Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G?
3in X.[r1(X,e1)r2(X,e2)¬r3(X,e3)]\exists X.[r_{1}(X,e_{1})\wedge r_{2}(X,e_{2})\wedge\neg r_{3}(X,e_{3})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, F is the set of entities connected to entity e2e_{2} by relation r2r_{2}, and F is the set of entities connected to entity e3e_{3} by any relation other than relation r3r_{3}. Then, what are the entities in the intersection of set E and F, i.e., entities present in both F and G?
inp X.r3(X,Y.[r1(Y,e1)¬r2(Y,e2)]\exists X.r_{3}(X,\exists Y.[r_{1}(Y,e_{1})\wedge\neg r_{2}(Y,e_{2})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, and F is the set of entities connected to entity e2e_{2} by any relation other than relation r2r_{2}. Then, what are the entities that are connected to the entities in the intersection of set E and F by relation r3r_{3}?
pin X.[r1(X,Y.¬r2(Y,e2))r3(X,e3)]\exists X.[r_{1}(X,\exists Y.\neg r_{2}(Y,e_{2}))\wedge r_{3}(X,e_{3})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, F is the set of entities connected to entities in E by relation r2r_{2}, and G is the set of entities connected to entity e2e_{2} by any relation other than relation r3r_{3}. Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G?
pni X.[r1(X,Y.¬r2(Y,e2))¬r3(X,e3)]\exists X.[r_{1}(X,\exists Y.\neg r_{2}(Y,e_{2}))\wedge\neg r_{3}(X,e_{3})] Let us assume that the set of entities E is connected to entity e1e_{1} by relation r1r_{1}, F is the set of entities connected to entities in E by any relation other than r2r_{2}, and G is the set of entities connected to entity e2e_{2} by relation r3r_{3}. Then, what are the entities in the intersection of set F and G, i.e., entities present in both F and G?
Table 6: Decomposed Prompt Templates of Different Query Types.
Type Logical Query Template for Decomposed Prompts
Context 𝒩k(qτ[Qτ])\mathcal{N}_{k}(q_{\tau}[Q_{\tau}]) Given the following (h,r,t) triplets where entity h is related to entity t by relation r; (h1,r1,t1),(h2,r2,t2),(h3,r3,t3),(h4,r4,t4),(h_{1},r_{1},t_{1}),(h_{2},r_{2},t_{2}),(h_{3},r_{3},t_{3}),(h_{4},r_{4},t_{4}),
(h5,r5,t5),(h6,r6,t6),(h7,r7,t7),(h8,r8,t8)(h_{5},r_{5},t_{5}),(h_{6},r_{6},t_{6}),(h_{7},r_{7},t_{7}),(h_{8},r_{8},t_{8})
1p X.r1(X,e1)\exists X.r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
2p X.r1(X,Y.\exists X.r_{1}(X,\exists Y. Which entities are connected to e1e_{1} by relation r1r_{1}?
r2(Y,e1)r_{2}(Y,e_{1}) Which entities are connected to any entity in [PP1] by relation r2r_{2}?
3p X.r1(X,Y\exists X.r_{1}(X,\exists Y Which entities are connected to e1e_{1} by relation r1r_{1}?
.r2(Y,Z..r_{2}(Y,\exists Z. Which entities are connected to any entity in [PP1] by relation r2r_{2}?
r3(Z,e1)r_{3}(Z,e_{1}) Which entities are connected to any entity in [PP2] by relation r3r_{3}?
2i X.[r1(X,e1)\exists X.[r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
r2(X,e2)]\wedge r_{2}(X,e_{2})] Which entities are connected to e2e_{2} by relation r2r_{2}?
What are the entities in the intersection of entity sets [PP1] and [PP2]?
3i X.[r1(X,e1)\exists X.[r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
r2(X,e2)\wedge r_{2}(X,e_{2}) Which entities are connected to e2e_{2} by relation r2r_{2}?
r3(X,e3)]\wedge r_{3}(X,e_{3})] Which entities are connected to e3e_{3} by relation r3r_{3}?
What are the entities in the intersection of entity sets [PP1], [PP2] and [PP3]?
ip X.r3(X,Y.[r1(Y,e1)\exists X.r_{3}(X,\exists Y.[r_{1}(Y,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
r2(Y,e2)]\wedge r_{2}(Y,e_{2})] Which entities are connected to e2e_{2} by relation r2r_{2}?
What are the entities in the intersection of entity sets [PP1] and [PP2]?
What are the entities connected to any entity in [PP3] by relation r3r_{3}?
pi X.[r1(X,Y.r2(Y,e2))\exists X.[r_{1}(X,\exists Y.r_{2}(Y,e_{2})) Which entities are connected to e1e_{1} by relation r1r_{1}?
r3(X,e3)]\wedge r_{3}(X,e_{3})] Which entities are connected to [PP1] by relation r2r_{2}?
Which entities are connected to e2e_{2} by relation r3r_{3}?
What are the entities in the intersection of entity sets [PP2] and [PP3]?
2u X.[r1(X,e1)\exists X.[r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
r2(X,e2)]\vee r_{2}(X,e_{2})] Which entities are connected to e2e_{2} by relation r2r_{2}?
What are the entities in the union of entity sets [PP1] and [PP2]?
up X.r3(X,Y.[r1(Y,e1)\exists X.r_{3}(X,\exists Y.[r_{1}(Y,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
r2(Y,e2)]\vee r_{2}(Y,e_{2})] Which entities are connected to e2e_{2} by relation r2r_{2}?
What are the entities in the union of entity sets [PP1] and [PP2]?
Which entities are connected to any entity in [PP3] by relation r3r_{3}?
2in X.[r1(X,e1)\exists X.[r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by any relation other than r1r_{1}?
¬r2(X,e2)]\wedge\neg r_{2}(X,e_{2})] Which entities are connected to e2e_{2} by any relation other than r2r_{2}?
What are the entities in the intersection of entity sets [PP1] and [PP2]?
3in X.[r1(X,e1)\exists X.[r_{1}(X,e_{1}) Which entities are connected to e1e_{1} by any relation other than r1r_{1}?
r2(X,e2)\wedge r_{2}(X,e_{2}) Which entities are connected to e2e_{2} by any relation other than r2r_{2}?
¬r3(X,e3)]\wedge\neg r_{3}(X,e_{3})] Which entities are connected to e3e_{3} by any relation other than r3r_{3}?
What are the entities in the intersection of entity sets [PP1], [PP2] and [PP3]?
inp X.r3(X,Y.[r1(Y,e1)\exists X.r_{3}(X,\exists Y.[r_{1}(Y,e_{1}) Which entities are connected to e1e_{1} by relation r1r_{1}?
¬r2(Y,e2)]\wedge\neg r_{2}(Y,e_{2})] Which entities are connected to e2e_{2} by any relation other than r2r_{2}?
What are the entities in the intersection of entity sets [PP1], and [PP2]?
What are the entities connected to any entity in [PP3] by relation r3r_{3}?
pin X.[r1(X,Y.¬r2(Y,e2))\exists X.[r_{1}(X,\exists Y.\neg r_{2}(Y,e_{2})) Which entities are connected to e1e_{1} by relation r1r_{1}?
r3(X,e3)]\wedge r_{3}(X,e_{3})] Which entities are connected to entity set in [PP1] by relation r2r_{2}?
Which entities are connected to e2e_{2} by any relation other than r3r_{3}?
What are the entities in the intersection of entity sets [PP2] and [PP3]?
pni X.[r1(X,Y.¬r2(Y,e2))\exists X.[r_{1}(X,\exists Y.\neg r_{2}(Y,e_{2})) Which entities are connected to e1e_{1} by relation r1r_{1}?
¬r3(X,e3)]\wedge\neg r_{3}(X,e_{3})] Which entities are connected to any entity in [PP1] by any relation other than r2r_{2}?
Which entities are connected to e2e_{2} by relation r3r_{3}?
What are the entities in the intersection of entity sets [PP2] and [PP3]?

Appendix C Analysis of Logical Reasoning Performance using HITS Metric

Tables 7 and 8 present the HITS@K=3 results of baselines and our model. HITS@K indicates the accuracy of predicting correct candidates in the top-K results.

Table 7: Performance comparison study between LARK and the baseline, focusing on their efficacy of logical reasoning using HITS@K=1,3,10 scores. The rows correspond to the models and columns denote the different query structures of multi-hop projections, geometric operations, and compound operations. The best results for each query type in every dataset are highlighted in bold font.
Dataset Variant 1p 2p 3p 2i 3i ip pi 2u up
HITS@1
FB15k Llama2-7B 74.6 26 18.5 59.9 47.7 2.4 5.7 5.8 5
complex 77.5 37.9 26.3 67.4 54.6 8.2 20.7 20.7 17.6
step 77.5 41.8 28.1 70.2 57.3 10.3 24.3 22.8 17.8
FB15k-237 Llama2-7B 77.2 28.5 17.7 10.9 22.6 10.8 8.7 10.5 13.2
complex 78.5 30.8 19.3 41.1 38.1 9.6 18.7 24.2 14.0
step 78.5 34.3 21.3 43.2 40.2 11.7 22.2 27.9 14.2
NELL995 Llama2-7B 86.4 28.3 19.6 10.2 24 8.6 3.5 1.5 15.9
complex 88.0 30.9 21.7 44.1 41.6 7.4 8.2 3.3 17
step 88.0 34.3 24.0 46.1 43.8 9.5 9.8 8.9 17.3
HITS@3
FB15k Llama2-7B 74 53.4 34.6 18.2 36.4 44.7 39.4 35.7 77.1
complex 77.7 57.6 37.9 68.5 61.3 39.6 84.8 82.9 81.7
step 77.7 57.4 40.1 69.4 62.5 48.4 91.2 92.7 82.6
FB15k-237 Llama2-7B 75.9 42.6 25.7 12.6 25.9 43.6 35.1 42.9 53.8
complex 78.3 45.9 28.1 47.2 43.7 38.7 75.6 89.4 57
step 78.3 45.9 29.8 48.2 44.6 47.3 80.0 93.6 57.6
NELL995 Llama2-7B 85.6 42.9 28.7 11.8 27.6 34.6 14.1 5.7 65
complex 87.8 46.8 31.6 50.7 47.9 29.8 32.9 13.2 69.4
step 87.8 45.7 33.5 51.3 48.7 38.1 39.6 35.8 70.3
HITS@10
FB15k Llama2-7B 73.6 53.9 35.7 18.1 36.3 44.6 39.5 35.7 77.1
complex 77.7 58.2 39.1 68.2 61.4 39.5 85 82.9 81.7
step 77.7 57.4 46.0 69.4 62.5 48.2 91.2 84.7 82.6
FB15k-237 Llama2-7B 75.2 43 26.5 12.6 25.9 43.6 35.1 42.9 53.8
complex 78.3 46.4 29 47.3 43.8 38.7 75.6 89.4 57
step 78.3 45.9 34.1 48.2 44.6 47.3 80.0 93.6 57.6
NELL995 Llama2-7B 84.9 43.4 29.2 11.8 27.6 34.6 14.1 5.7 65
complex 87.8 47.4 32.2 50.8 48 29.8 32.9 13.2 69.4
step 87.8 45.7 38.3 51.3 48.7 38.1 39.6 35.8 70.3
Table 8: Performance comparison between LARK and the baseline for negation query types using HITS@K=1,3,10 scores. The best results for each query type in every dataset are given in bold font.
Metric Variant 2in 3in inp pin pni 2in 3in inp pin pni 2in 3in inp pin pni
HITS@1 HITS@3 HITS@10
FB15k Llama2-7B 1.8 0.7 4.0 2.1 0.9 18.6 5.7 40.8 18.8 8.6 18.6 5.7 40.8 18.8 8.6
complex 6.7 2.4 14.2 7.8 3.3 26.6 9.5 59.2 30.3 12.3 26.6 9.5 59.3 30.3 12.4
step 7.4 2.7 14.9 9.1 3.4 31.0 12.1 64.8 38.7 14.4 31.0 12.1 64.8 38.7 14.4
FB15k-237 Llama2-7B 1.9 0.8 6.8 2.8 0.7 7.5 3.5 27.3 11.6 2.7 7.5 3.5 27.3 11.6 2.7
complex 2.7 1.4 9.8 4.6 1 10.8 5.8 39.6 18.7 3.9 10.8 5.8 39.6 18.7 3.9
step 3.2 1.7 10.6 5.8 1.1 12.6 7.4 43.3 23.9 4.6 12.6 7.4 43.3 23.9 4.6
NELL995 Llama2-7B 2.8 1.4 7.2 2.2 1.5 11.2 6 29.1 9.2 6.2 11.2 6 29.1 9.2 6.2
complex 3.9 2.3 10.2 3.7 2.2 16.1 9.4 41.8 15.1 9 16.1 9.4 41.8 15.1 9
step 4.6 2.8 11.1 4.7 2.7 18.5 12.0 46.0 19.3 10.9 18.5 12.0 46.0 19.3 10.9

Appendix D Algorithm

Algorithm for the LARK’s procedure is provided in Algorithm 1.

Input: Logical query qτq_{\tau}, Knowledge Graph 𝒢:E×R\mathcal{G}:E\times R;
Output: Answer entities VτV_{\tau};
1 # Query Abstraction: Map entity and relations to IDs
2 qτ=Abstract(qτ);q_{\tau}=Abstract(q_{\tau});
3 𝒢=Abstract(𝒢);\mathcal{G}=Abstract(\mathcal{G});
4 # Neighborhood Retrieval
5 𝒩k(qτ[Qτ])={(h,r,t)}\mathcal{N}_{k}(q_{\tau}[Q_{\tau}])=\left\{(h,r,t)\right\}, using Eq. (7)
6 # Query Decomposition
7 qτd=Decomp(qτ);q^{d}_{\tau}=Decomp(q_{\tau});
8 # Initialize Answer Cache ans={}ans=\{\};
9 for i1:length(qτd)i\in 1:length\left(q^{d}_{\tau}\right) do
10       # Replace Answer Cache in Question
11       qτd[i]=replace(qτd[i],ans[i1]);q^{d}_{\tau}[i]=replace(q^{d}_{\tau}[i],ans[i-1]);
12       ans[i]=LLM(qτd[i]);ans[i]=LLM\left(q^{d}_{\tau}[i]\right);
13      
14 end for
return ans[length(qτd)]ans[length\left(q^{d}_{\tau}\right)]
Algorithm 1 LARK Algorithm
Table 9: Details of the token distribution for various query types in different datasets. The columns present the mean, median, minimum (Min), and maximum (Max) values of the number of tokens in the queries of different query types. Column ‘Cov’ presents the percentage of queries (coverage) that contain less than 4096 tokens, which is the token limit of Llama2 model.
Dataset FB15k FB15k-237 NELL
Type Mean Median Min Max Cov Mean Median Min Max Cov Mean Median Min Max Cov
1p 70.2 61 58 10338 100 82.1 61 58 30326 99.9 81.7 61 58 30250 99.9
2p 331.2 106 86 27549 97.1 1420.9 140 83 130044 89.7 893.4 136 83 108950 90.9
3p 785.2 165 103 80665 91 3579.8 329 103 208616 75.7 3052.6 389 100 164545 73.7
2i 1136.7 276 119 20039 86.3 4482.8 636 119 60655 67.7 4469.3 680 119 54916 67.3
3i 2575.4 860 145 29148 68.4 8760.2 2294 145 85326 48.3 8979.4 2856 145 76834 44.8
ip 1923.8 1235 135 21048 67.4 4035.8 2017 131 32795 50.5 4838 2676 131 33271 43.6
pi 1036.8 455 140 10937 85.8 1255.6 343 141 45769 83.4 1535.3 435 135 21125 79.9
2u 1325.4 790 121 14703 80.8 2109.5 868 123 60655 68.9 2294.9 1138 125 23637 65.7
up 115.3 112 110 958 100 113.7 112 110 981 100 113.2 112 110 427 100
2in 1169.1 548 123 18016 84.9 5264.7 1116 128 60281 61.8 3496 774 124 58032 71.6
3in 4070.3 2230 159 28679 46.6 13695.8 8344 175 88561 25.9 12575.9 7061 164 88250 28.1
inp 629 112 110 73457 91.8 1949.4 394 110 115169 78.4 696.7 112 110 89660 93.8
pin 400.7 154 129 6802 95.8 1106.5 242 129 44010 87.2 418.1 131 129 24062 96.7
pni 345.9 129 127 7938 96.6 547.1 129 127 18057 95.1 289.3 129 127 17489 97.9

Appendix E Query Token Distribution in Datasets

The quantitative details of the query token’s lengths is provided in Table 9 and their complete distribution plots are provided in Figure 5. From the results, we observe that the distribution of token lengths is positively-skewed for most of the query types, which indicates that the number of samples with high token lengths is small in number. Thus, small improvements in the LLMs’ token limit can potentially lead to better coverage on most of the reasoning queries in standard KG datasets.

Refer to caption
Figure 5: Probability distribution of the number of tokens in each query type. The figures contains 14 graphs for the 14 different query types. The x-axis and y-axis presents the number of tokens in the query and their probability density, respectively.