Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data
Abstract.
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, they often struggle with complex reasoning tasks and are prone to hallucination. Recent research has shown promising results in leveraging knowledge graphs (KGs) to enhance LLM performance. KGs provide a structured representation of entities and their relationships, offering a rich source of information that can enhance the reasoning capabilities of LLMs. For this work, we have developed different techniques that tightly integrate KG structures and semantics into LLM representations. Our results show that we are able to significantly improve the performance of LLMs in complex reasoning scenarios, and ground the reasoning process with KGs. We are the first to represent KGs with programming language and fine-tune pretrained LLMs with KGs. This integration facilitates more accurate and interpretable reasoning processes, paving the way for more advanced reasoning capabilities of LLMs.
1. Introduction
Large Language Models (LLMs) have achieved state of art performance in many Natural Language Processing (NLP) tasks ((Brown et al., 2020), (Chowdhery et al., 2022)). They have been successfully applied in a wide range of applications, from question answering to summarization and machine translation. However, due to limitations in the training data and training process, they suffer from hallucination (Ji et al., 2023), where generated text is nonfactual, nonsensical, or incoherent. This issue becomes particularly prevalent when dealing with tasks that require intricate or complex reasoning. To address hallucination, researchers have used methods such as prompting ((Wei et al., 2022), (Zhang et al., 2024b), (Wang et al., 2023), (Fu et al., 2023), (Press et al., 2023), (Yao et al., 2023b), (Yao et al., 2023a)), retrieval augmented generation (RAG (Lazaridou et al., 2022)), and fine-tuning. These approaches often leverage external sources of information, which can come from the internet, third-party applications, databases, or knowledge graphs.
Knowledge Graphs (KGs) are structured representations of real world entities and the relationships among them, offering a rich source of factual information ((Lehmann et al., 2014) (Bollacker et al., 2008)). By grounding the reasoning processes of LLMs with KGs, we can enhance the factual accuracy of the generated text and reduce hallucinations. Researchers have explored various approaches to integrate KGs with LLMs, each with its own advantages and limitations. One approach involves using Graph Neural Networks (GNNs) to encode KGs into embeddings that capture the structural and semantic information of the graphs ((Choudhary et al., 2022), (Yasunaga et al., 2022)). These embeddings then serve as soft prompts to LLMs, guiding the generation process with knowledge from the KGs. However, this method requires very careful design and tuning, as it needs to align the representations learned by GNNs with the token-based processing of LLMs. Moreover, since such integration is specific per graph, it may require significant re-engineering for different tasks or graph types.
Another method uses semantic parsing to convert natural language queries into structured query languages like SPARQL ((Xu et al., 2023), (Brei et al., 2024), (Rangel et al., 2024), (Bustamante and Takeda, 2024)). In this approach, the LLM generates a SPARQL query based on the input prompt, which is then executed against the KG to retrieve relevant information. This method effectively uses the KG as an external knowledge base, and treats the KG and the LLM as separate components. As a result, the reasoning process is not fully integrated into the LLM, potentially limiting the model’s ability to perform complex reasoning during text generation.
Alternatively, researchers encode entities and relationships of KGs as natural language text ((Luo et al., 2024), (Dernbach et al., 2024)). They either incorporate them into the LLM’s input context or fine-tune the LLM with these text representations. This approach leverages the LLM’s natural language understanding ability to reason over the text-encoded knowledge. However, representing structured data as unstructured text poses challenges. Capturing the nuances of entities and relationships in natural language requires careful design to avoid ambiguities and ensure that the structural information is preserved, which can be non-trivial for complex graphs.
While there are different ways to leverage KGs to enhance LLMs’ performance, there are two aspects to using KGs for improving LLMs. One is grounding the LLMs with trustworthy information, and the other is providing them with examples of relations from which they can generalize. It is this second part that motivated our work. In this work, we propose an approach that represents knowledge graphs with programming language code. Programming languages are inherently structured and are designed to represent complex data and relationships efficiently. This allows for an accurate encoding of entity relationships that preserves the internal structure of the graph. More importantly, programming code is part of the pre-training data for many LLMs, meaning that the models are already equipped to parse and understand programming syntax and semantics. This reduces the need for additional specialized training to interpret the KG representations. By leveraging the structured representation of KGs in programming languages, LLMs can perform more sophisticated reasoning over the data. We investigate different methods of integrating entity relationships into LLMs. Our experimental results demonstrate that programming language (Python) representations of KGs outperform traditional natural language representations and structured JSON representations in complex reasoning tasks, leading to more accurate and reliable outputs.
The main contributions of this paper are:
-
•
We introduce a novel representation of knowledge graphs with programming language. It facilitates the seamless integration of structured knowledge into the language modeling process.
-
•
By tightly integrating knowledge graphs into LLMs, our approach improves the reasoning accuracy of LLMs on complex tasks and effectively grounds the reasoning process, reducing the chance for hallucinations.
2. Related Work
There have been several attempts to apply LLMs to graph reasoning tasks. Wang et al. (Wang et al., 2024a), Guo et al. (Guo et al., 2023), and Ye et al. (Ye et al., 2024) employ the Graph2Text strategy of converting graph data into textual descriptions. However, these textual descriptions can result in very large contexts, and algorithms such as shortest path computations and bipartite graph matching require calculations across the entire context, making the task highly challenging. Chai et al. (Chai et al., 2023) have introduced GraphLLM, which combines three steps, namely node understanding, graph structure understanding, and graph-enhanced prefix tuning. Zhu et al. (Zhu et al., 2024), and Wang et al. (Wang et al., 2024b) proposed different methods for instruction fine-tuning LLMs to impove the performance of common graph tasks.
While the above works address general graph problems, other research has focused specifically on combining KGs with LLMs. One such approach is to use the LLM as an encoder to transform text-based entity nodes and relations, and then fuse the LLM and GNN-derived representations. Applications of this approach range from product recommendation (Choudhary et al. (Choudhary et al., 2022)) to biomedical question-answering (Yasunaga et al. (Yasunaga et al., 2022)). Luo & Pan (Luo et al., 2024) have proposed a reasoning on Graph (RoG) method that comprises two modules: a planning module and a retrieval-reasoning module. The planning module mines the KG and generates faithful relation paths for answering the question. The retrieval-reasoning module combines a breadth-first search and a probabilistic optimization over all paths. Dernbach et al. (Dernbach et al., 2024) developed a neighborhood partitioning and encoding scheme to accommodate real-world graph properties. Their encoding scheme transforms graph relations into alternate text representations, which in turn are used to fine-tune the LLM.
Edge et al. (Edge et al., 2024) have built an end-to-end system, called GraphRAG, that starts with a set of source documents and iteratively applies an LLM to extract the entities in each document. Next, entities are connected via the relationships extracted, and a knowledge graph is created. The knowledge graph is split into communities, and each community is summarized. These summaries are subsequently used by the LLM via RAG to help answer questions submitted to the system.
Nie et al. (Nie et al., 2024) provide a code-style in-context learning (ICL) method for knowledge base question answering (KBQA). They design seven meta-functions written in Python that cover the atomic operations used for querying databases. By using few-shot learning, they improve LLMs’ ability to query knowledge bases effectively.
There is also ongoing research work ((Ma et al., 2023), (Zhang et al., 2024a), (Aryabumi et al., 2024)) that studies the impact of mixing programming code in pre-training or instruction fine-tuning datasets on the performance of LLMs. Even though the programming code used by the research is generic, the results show promising improvements on various tasks, including reasoning tasks. There is another line of work ((Kim et al., 2023), (Zhang et al., 2024b)) that studies how to improve LLM Chain of Thought (CoT) reasoning ability through fine-tuning. However, creating good datasets for fine-tuning CoT is labor intensive, and the reasoning steps are described in natural language text, which can introduce ambiguity.
Yang et al. (Yang et al., 2024) studied how LLMs do multi-hop reasoning and found that LLMs perform latent multi-hop reasoning in certain relation types, but often struggle with later hops. Biran et al. (Biran et al., 2024) observed that later hops are resolved in the model’s deeper layers, and thus the LLM may no longer encode the necessary knowledge for answering the question. They propose a back-patching approach, essentially feeding the hidden representations from later layers back into earlier ones.
In this work, we continue the prior research work on multi-hop queries, showing that we can significantly improve the performance of LLMs by integrating KG structures and semantics into LLM representations. Similarly to (Nie et al., 2024) we also experiment with code-style representations. However, in our case, we represent the entity-relations -not the atomic operations- in Python, and our goal is to improve the LLMs’ ability to answer questions directly, not its ability to query the knowledge graph.
3. Methodology
Our work focuses on studying the entity relationships representation of KG for grounded LLM reasoning.
3.1. Knowledge Graph Definition
Let denote a knowledge graph, where is the set of entities, is the set of relationships, is the set of triplets that are edges in the knowledge graph. A triplet if there is a directed edge between entity () and entity () through relationship (). A triplet also corresponds to a complete one hop reasoning process. A two hop compositional reasoning process can be represented as , where and . Since exists in both triplets, it is the bridge entity. Similarly, a three hop compositional reasoning process can be represented as , where , and .
3.2. Knowledge Graph Representation for LLM Reasoning
To improve LLM multihop reasoning with knowledge graphs, we represent knowledge graphs in ways that are more compatible with LLM prompting and fine-tuning. When given a complex reasoning prompt, LLMs can detect entities and relationships, then implicitly infer the key entities and facts by following logical reasoning steps grounded by knowledge graphs. For instance, given the prompt: “Who is the spouse of the composer of ‘It Goes Like It Goes’?”, LLMs can follow the reasoning steps: “The composer of ‘It Goes Like It Goes’ is David Shire”, and “The spouse of David Shire is Didi Conn”, and infer that the correct answer is “Didi Conn”. Different representations of knowledge graphs affect how effectively LLMs can perform such logical reasoning.
The most natural way is to use natural language to describe the triplets in knowledge graphs. Figure 4 shows the natural language representation for two-hop reasoning of triplets , where is the bridge entity. For instance, the triplets ((‘It Goes Like It Goes’, ‘composer’, ‘David Shire’), (‘David Shire’, ‘spouse’, ‘Didi Conn’)) can be represented as (The composer of ‘It Goes Like It Goes’ is David Shire, The spouse of David Shire is Didi Conn). And the two hop reasoning can be represented as “The spouse of the composer of ‘It Goes Like It Goes’ is Didi Conn”. As LLMs understand natural languages very well, the natural language representation is the most straightforward way for either prompting or fine-tuning LLMs with knowledge graphs.
JSON (JavaScript Object Notation) is a lightweight data interchange format (Bray, 2014). It is designed to store data in universal data structures such as dictionary and list that are supported by most programming languages. JSON is a pure data only format and can be used to store structured data from knowledge graphs. Figure 4 shows the JSON representation of knowledge graph triplets and . The entities and are the keys and the relationship/entity pairs and are the values. However, since JSON is designed to store data only, it is difficult to represent multi-hop inference process with the JSON format. Alternatively, a comment or description field (with “comment” as the key and natural language description of the multi-hop reasoning as the value) can be added to the JSON representation. But this is not a recommended practice in general.
Programming language code is another major data source for LLM pre-training and fine-tuning. Knowledge graphs can be represented using various data structures supported by major programming languages such as Python. The triples can be represented either as a static dictionary or added dynamically to the predefined data structures as part of the code. Figure 4 shows Python representation of knowledge graph triplets with a static dictionary data structure, and the two-hop inference process that is based on the stored triplets. As shown in figure 4, relationships and entities are stored in dictionary data structure “relationships” with relationships and as the keys and entity pairs as the values. The inference process is just the process of retrieving the values based on the key values of the dictionary. Figure 4 shows Python representation of knowledge graph triplets with predefined Python class “KnowledgeBase”, and an iterative multi-hop inference function ‘infer’ that supports inference of an arbitrary number of hops. As shown by figure 4, the main data structure of “KnowledgeBase” is a dictionary “self.facts”, and entities and relationships are added to the dictionary with function “add_fact”. The “infer” function accepts any number of relationships with parameter “*relations” and does the corresponding multi-hop reasoning. We designed the dynamic self-defined data structure based Python representation because it can be easily generalized to support multi-hop reasoning based on subgraphs of KGs.
The example mentioned in the previous paragraphs can be easily represented as dictionaries in Python. Figure 5 shows four different representations for the same example. Using programming languages, knowledge graphs can be represented in a more controlled and unambiguous way; corner cases can be easily checked. However, the representation for the same triplet can be more verbose.








4. Experiments
We designed experiments to study how different representations of entity relationships in KGs affect the reasoning performance of LLMs across two different datasets.
4.1. Datasets
4.1.1. Dataset 1
For the first dataset, we use the same dataset used by the “Hopping Too Late” paper ((Biran et al., 2024)). The dataset includes two-hop relationships extracted from publicly available knowledge base Wikidata ((Vrandečić and Krötzsch, 2014)). For the experiments, we split the dataset into eight equal-sized partitions based on the bridge entity in a round-robin fashion, so that any unique exists only in one partition. We partition the dataset in this way so that the LLMs only learn the relationships and logical reasoning process rather than memorizing the entities. To avoid overrepresentation of the most popular in the training or testing dataset, we choose partition 2 as our training dataset and partition 4 as the testing dataset. The details of the dataset are listed in Table 1. For this dataset, there is no overlap of bridge entities between training and testing dataset. The overlap of relationship pairs is about .
Train and Test | |||
Train | Test | Intersection | |
Number of Hops | 2 | 2 | 2 |
Dataset Size | 10,262 | 10,255 | 0 |
Bridge Entities () | 324 | 880 | 0 |
Relations (, ) | 191 | 207 | 167 |
No. of row with (, ) | 10,262 | 10,255 | 10,130 |
4.1.2. Dataset 2
For the second dataset, we use the dataset created in paper ((Ho et al., 2020)). This dataset also includes two-hop relationships extracted from publicly available knowledge base Wikidata ((Vrandečić and Krötzsch, 2014)). Although both Dataset 1 and Dataset 2 are derived from the same knowledge base, the extracted entities and relationships for Dataset 2 are different from those in Dataset 1. For the training dataset, we select only compositional relationships from this dataset and limit the number of instances per relationship pair to no more than 500 to avoid over representation of particular relationships in the training data. We use the development dataset with compositional relationships for testing purpose. The details of the dataset are listed in Table 2. We choose compositional relationships so the type of relationships are consistent with those in Dataset1 and it is easier to compare the reasoning performance across datasets. We didn’t further restrict the entities and relationships based on the overlap between training and testing dataset to respect the design of the dataset by paper ((Ho et al., 2020)).
Train and Test | |||
Train | Test | Intersection | |
Number of Hops | 2 | 2 | 2 |
Dataset Size | 10,733 | 5,236 | 0 |
Bridge Entities () | 4,170 | 3,650 | 782 |
Relations (, ) | 80 | 110 | 66 |
No. of row with (, ) | 10,733 | 5,236 | 1,907 |
4.1.3. Dataset 3
This dataset is an extension of Dataset 1. We extended two-hop relationships by adding a third hop , resulting in three-hop relationships , while keeping the entities as a subset of the entities in Dataset 1. This dataset was created to test whether models fine-tuned for two-hop reasoning can generalize to improve three-hop reasoning performance as well. The details of the dataset are listed in Table 3. The overlap of bridge entities between training and testing dataset is minimal for this dataset. There is high percentage of relationship overlap between training and testing dataset.
Intersection with | |||
Dataset 1 Train | Test | Dataset 1 Train | |
Number of Hops | 2 | 3 | 0 |
Dataset Size | 10,262 | 1,007 | 0 |
Bridge Entities e2 | 324 | 261 | 28 |
Train vs Test | 324 | 233 | 15 |
Relations (, ) | 191 | 81 | 70 |
No. of row with (, ) | 10,262 | 1,007 | 973 |
Relations (, ) | *191 | 97 | 92 |
Note *: this number is for (, ) in Dataset 1 Train
The final training data for fine-tuning the LLMs include both one-hop prompts and responses based on the first hop from the datasets, and two-hop prompts and responses based on two-hop information from the datasets.
4.2. Large Language Models
To evaluate the performance of LLM reasoning, we chose the latest released open source models by Meta: Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct ((AI, 2024)). The Llama-3.1 model family demonstrates stronger reasoning abilities compared to other open-source LLMs. For our experiments, we fine-tuned the smaller model (Llama-3.1-8B-Instruct) with different entity relationship representations of multi-hop reasoning. During fine-tuning, we optimize the language model objective, which predicts the probability distribution of each token based on its previous tokens. We adopt LoRA ((Hu et al., 2022)) to perform parameter-efficient fine-tuning. During inference, we used greedy decoding for reproducibility.
4.3. Experiment Environment
We use the large language model checkpoints from Hugging Face Transformers111https://huggingface.co/docs/transformers/index in all the experiments. Experiments were conducted using one Nvidia A100 processor with 40GB of RAM per GPU. When fine-tuning the LLMs, we ran the training process for one epoch across all the experiments. We didn’t fine-tune the larger model (Llama-3.1-70B-Instruct) because of the computation resource constraint.
4.4. Experiment Design
To test how well each representation of KGs helps with LLM reasoning, we designed the following three types of experiments.
4.4.1. LLM Reasoning without Context
To test the multi-hop reasoning ability of LLMs, we compared model performance under three conditions:
-
(1)
Zero shot prompting of the LLMs to predict . Given , we prompt the LLMs to predict . This serves as our baseline experiment.
-
(2)
One shot prompting of the LLMs with reasoning example. We have conducted three experiments, one for each representation format of KGs.
-
(3)
Zero shot prompting of fine-tuned LLMs. We have conducted three experiments, one for each representation format of KGs.
4.4.2. LLM Reasoning with Context
Since Retrieval Augmented Generation (RAG) is one of the most popular applications of LLMs, we designed experiments to compare the performance of LLMs when given input context. This set of experiments are designed to test how well LLMs perform for potential RAG applications.
4.4.3. Model Generalization to Longer Reasoning Path
We fine-tuned the LLMs using only one-hop and two-hop triplets. For this set of experiments, we tested how well the fine-tuned models can generalize to reasoning over longer paths.
4.5. Prompt Design
The design of the prompts for all the experiments is presented in Table 4. Examples are shown for two-hop reasoning prompts. Because of the paper length limitation, the natural language explanation, JSON data structure and Python code snippets for one-shot prompting are shown in Appendix A. Three-hop reasoning prompts follow the same logic and use the same prompt format.
Dataset | Dataset1 | Dataset2 |
---|---|---|
Multihop Question | of of is | What is of of ? |
Zero Shot Prompt | Given the incomplete statement: of of is _ , | Given the question: What is of of ? |
provide answer and generate explanation | generate explanation and provide answer to the question | |
for completing the statement | ||
One Shot Prompt | Given the incomplete statement: of of is _ , | Given the question: What is of of ? |
for text representation | {“Answer”:“”, “Explanation”:“ of is . of is . | {“Answer”:“”, “Explanation””:“ of is . of is . |
of of is ”} | of of is ”} | |
Given the incomplete statement: {statement} _ , | Given the question: {question} ? | |
provide answer and generate explanation | generate explanation and provide answer to the question | |
for completing the statement | generate explanation and provide answer to the question | |
Or for JSON | Given the incomplete statement: of of is _ , | Given the question: What is of of ? |
representation | {“Answer”:“”, “JSON Structure”:“{JSON data}” | {“Answer”:“”, “JSON structure”:“{JSON data}” |
Given the incomplete statement: {statement} _ , | Given the question: {question} ? | |
provide answer and generate JSON structure | generate JSON structure and provide answer to the question | |
for completing the statement | ||
Or for Python | Given the incomplete statement: of of is _ , | Given the question: What is of of ? |
representation | {“Answer”:“”, “Python code snippet”:“{Python code}” | {“Answer”:“”, “Python code snippet”:“{Python code}” |
Given the incomplete statement: {statement} _ , | Given the question: {question} ? | |
provide answer and generate python code | generate python code and provide answer to the question | |
for completing the statement | ||
Prompt with Context | Given context: {context} and the uncompleted | Given context: {context} and the question: {question} |
statement: {statement} _ , provide answer and | generate explanation and provide answer to the question | |
generate explanation for completing the statement |
4.6. Metrics
The main metric for measuring the performance of LLM reasoning is the accuracy of multi-hop reasoning conditioned on the correctness of each individual hop, denoted as , where is the correctness of the multi-hop reasoning, and is the correctness of the hop reasoning. We use this metric instead of the overall accuracy of the results because the final reasoning output is affected by multiple factors, including whether the LLM has the knowledge of each individual entity and whether it can infer each hop correctly. Our main metric measures the latent multi-hop reasoning performance. We also provide the overall accuracy of the results as a reference for potential applications.
4.7. Experimental Results
We compare the different approaches of grounding LLM reasoning using different knowledge graph representations by answering the following research questions.
4.7.1. RQ1: Do different representations of entity relationships affect LLM multi-hop reasoning ? Can we fine-tune the LLM with proper entity relationship representations to improve its reasoning capabilities?
Since LLMs have already demonstrated abilities in latent multi-hop reasoning ((Yang et al., 2024)), we designed experiments to compare how different approaches and different representations of multi-hop entity relationships can further improve LLMs’ multi-hop reasoning ability.
Impact of Entity Relationship Representations for LLM Prompting. Table 5 and Table 6 present the performance of Llama-3.1-8B-Instruct with different representations across different datasets. It is obvious from the one-shot prompting results that Python representations of entity relationships outperform both the plain natural language text representation and the JSON representation. And both Python and natural language representations for one shot prompting perform better than zero shot prompting. The two-hop reasoning performance of the one-shot LLM with dynamic Python representation is approximately 78% higher than that of zero-shot prompting of the LLM on the Dataset 1 (Table 1). Similarly, the performance of the static Python representation for one-shot prompting of the LLM is about 60% higher than that of zero-shot prompting for Dataset 2 (Table 2). However, the performance of one-shot prompting with JSON representation is worse than zero-shot prompting of LLM. We hypothesize that the structured JSON representation is not native to the LLM reasoning process.
Impact of Entity Relationship Representation for LLM Fine-Tuning. Table 5 and Table 6 also show the performance of the fine-tuned Llama-3.1-8B-Instruct model with different representations on different datasets. In this case, the LLMs fine-tuned with Python representations perform better than the LLMs fine-tuned with either JSON representation or plain natural language text representations. The performance of the LLM fine-tuned with JSON representation is only slightly worse than that of the LLM fined-tuned with natural language representation. For comparison, we also provide the performance numbers for zero shot prompting of Llama-3.1-70B-Instruct. All LLMs that were fine-tuned with any entity relationship representation outperform the larger model for latent multi-hop reasoning.
Model | KG Representation | KG Representation | Prompt | Accuracy | 1st hop correct | 1st hop correct | 1st hop correct |
---|---|---|---|---|---|---|---|
for fine tuning | for Prompt | 2nd hop correct | 2nd hop correct | 2nd hop correct | |||
final incorrect | final correct | final accuracy | |||||
Llama-3.1-8B | NA | NA | Zero Shot | 16.3% | 989 | 604 | 38.2% |
Llama-3.1-70B | NA | NA | Zero Shot | 33.9% | 615 | 1,979 | 76.3% |
Llama-3.1-8B | NA | Natural Language | One Shot | 16.9% | 965 | 776 | 44.6% |
Llama-3.1-8B | NA | JSON | One Shot | 7.78% | 995 | 352 | 26.1% |
Llama-3.1-8B | NA | Python Static | One Shot | 17.5% | 408 | 609 | 59.9% |
Llama-3.1-8B | NA | Python Dynamic | One Shot | 19.1% | 341 | 772 | 67.9% |
Llama-3.1-8B tuned | Natural Language | NA | Zero Shot | 25.6% | 145 | 1,005 | 87.4% |
Llama-3.1-8B tuned | JSON | NA | Zero Shot | 20.8% | 82 | 569 | 87.4% |
Llama-3.1-8B tuned | Python Static | NA | Zero Shot | 26.2% | 102 | 1,093 | 91.5% |
Llama-3.1-8B tuned | Python Dynamic | NA | Zero Shot | 26.5% | 119 | 882 | 88.1% |
Model | KG Representation | KG Representation | Prompt | Accuracy | 1st hop correct | 1st hop correct | 1st hop correct |
---|---|---|---|---|---|---|---|
for fine tuning | for Prompt | 2nd hop correct | 2nd hop correct | 2nd hop correct | |||
final incorrect | final correct | final accuracy | |||||
Llama-3.1-8B | NA | NA | Zero Shot | 1.85% | 51 | 14 | 21.5% |
Llama-3.1-70B | NA | NA | Zero Shot | 10.7% | 164 | 195 | 54.3% |
Llama-3.1-8B | NA | Natural Language | One Shot | 3.99% | 82 | 25 | 23.4% |
Llama-3.1-8B | NA | JSON | One Shot | 1.80% | 19 | 4 | 17.4% |
Llama-3.1-8B | NA | Python Static | One Shot | 4.98% | 50 | 29 | 36.7% |
Llama-3.1-8B | NA | Python Dynamic | One Shot | 4.12% | 35 | 12 | 25.5% |
Llama-3.1-8B tuned | Natural Language | NA | Zero Shot | 11.0% | 97 | 164 | 62.8% |
Llama-3.1-8B tuned | JSON | NA | Zero Shot | 10.0% | 61 | 157 | 72.0% |
Llama-3.1-8B tuned | Python Static | NA | Zero Shot | 12.1% | 44 | 191 | 81.3% |
Llama-3.1-8B tuned | Python Dynamic | NA | Zero Shot | 12.3% | 38 | 203 | 84.2% |
4.7.2. RQ2: Can fine-tuned LLMs generalize their reasoning ability to more hops that are beyond the training data ?
Since we fine-tuned LLMs using only one-hop and two-hop reasoning data, it is important to study whether such a fine-tuning process will improve the LLMs’ reasoning performance on more hops. We created a three-hop dataset (as shown in Table 3) to measure the performance of the fine-tuned LLMs. As shown in Table 7, the three-hop reasoning performance of all fine-tuned LLMs has improved across all entity relationship representations compared with the baseline LLM without fine-tuning. Furthermore, LLMs that were fine-tuned with the Python representation outperform those fine-tuned with either the plain natural language representation or the JSON representation. As the relationship has significant overlap with in the training data (as shown in Table 3), the relative performance of fine-tuned models is consistent with what is shown in Table 5.
Model | KG Representation | Accuracy | % of Correct | % of Correct | % of Correct |
---|---|---|---|---|---|
for tuning | conditioned on | conditioned on | conditioned on | ||
1st & 2nd hop correct | 2nd & 3rd hop correct | all three hops correct | |||
Llama-3.1-8B | NA | 20.4% | 35.2% | 37.1% | 50.0% |
Llama-3.1-70B | NA | 42.6% | 64.6% | 61.8% | 80.4% |
Llama-3.1-8B tuned | Natural Language | 31.0% | 44.0% | 54.1% | 65.0% |
Llama-3.1-8B tuned | JSON | 23.3% | 42.5% | 46.1% | 63.9% |
Llama-3.1-8B tuned | Python Static | 30.9% | 47.7% | 59.3% | 70.6% |
Llama-3.1-8B tuned | Python Dynamic | 38.0% | 51.6% | 57.4% | 67.0% |
4.7.3. RQ3: How much can LLM in-context learning help or benefit from multi-hop reasoning?
Retrieval Augmented Generation (RAG) is one of the main applications of LLMs. However, even when supplied with correctly retrieved information, LLMs do not always generate the correct answer, particularly when multi-hop reasoning is required. To address this issue, we designed experiments to study the performance of fine-tuned LLMs when given an input context. In these experiments, we performed one-shot prompting of the LLMs, using the prompts listed in Table 4. The results are shown in Table 8. For Dataset 1, since no context is provided, we use the 1st hop and 2nd hop inferences as context and measure whether the model can infer the correct answers. For Dataset 2, we use the given context for the questions. And again, the LLMs that were fine-tuned with different entity relationship representations outperform the baseline model without fine-tuning. Notably, the LLM fine-tuned with dynamic Python representation greatly outperforms the baseline model; it even surpasses the performance of the much larger baseline model (Llama-3.1-70B-Instruct).
Model | KG Representation | Dataset 1 Accuracy | Dataset 2 Accuracy |
---|---|---|---|
for tuning | given 1st & 2nd | given context | |
hop as context | |||
Llama-3.1-8B | NA | 88.6% | 10.9% |
Llama-3.1-70B | NA | 96.1% | 41.9% |
Llama-3.1-8B tuned | Natural Language | 87.9% | 44.7% |
Llama-3.1-8B tuned | JSON | 92.3% | 46.6% |
Llama-3.1-8B tuned | Python Static | 87.2% | 52.8% |
Llama-3.1-8B tuned | Python Dynamic | 96.4% | 59.2% |
4.8. Discussion
We designed a series of experiments to study how to seamlessly integrate entity relationships into LLMs to improve their multi-hop reasoning ability, and the impact of different entity relationship representations on the performance of LLMs. Our experimental results show that it is possible to integrate entity relationships into LLMs and ground the LLM multi-hop reasoning process with knowledge graphs. While all forms of the proposed entity relationship representations help improve LLM reasoning performance, they affect LLM performance differently. The natural language representation is straightforward and is the major form of pre-training data for LLMs. The JSON representation is best suited for storing structured data. However, it can be difficult for LLM to directly integrate pure structured data. The Python representations store both structured data and the inference process, providing a more controlled and unambiguous way of guiding LLMs through the reasoning process. This helps LLMs achieve better reasoning performance in all the cases we studied. In some cases, the fine-tuned small LLMs perform better than much larger LLMs, even though the deep neural networks of the larger LLMs can help the model make connections among multiple hops and infer the correct answers. Because we guide and fine-tune the models with emphasis on multi-hop relationships rather than the facts of individual entities, the fine-tuned models can easily generalize to do multi-hop reasoning for completely different entities, even in cases where the number of hops is greater than what is in the training data. Our proposed fine-tuning approaches also help improve the results of in-context learning.
As synthetic data becomes increasingly important for LLM pre-training and fine-tuning, generating proper representations of structured data (especially knowledge graphs) and incorporating this type of data in LLM pre-training and fine-tuning can greatly improve LLM reasoning abilities and reduce hallucinations. As demonstrated by our experimental results, the representation of entity relationships is very important for LLM performance. Programming languages provide the flexibility to represent various entity relationships with native data structures. They can guide LLM reasoning in a more controlled and principled way and ground LLM inference with a knowledge base.
In our experiments, we only studied the simplest form of reasoning: two-hop and three-hop reasoning with compositional relationships. The reasoning process for real-life applications can be much more sophisticated. As shown in Figure 4, the Python representation of a knowledge graph with dynamic relationships provides the flexibility to define any relationships, even extending to a subgraph. Entities and relationships can be defined as classes themselves, allowing us to add attributes to the entities and relationships. Correspondingly, the “infer” function can be redefined to include code that checks these attributes. In future work, we will study the graph representation of entity relationships and its impact on more complex reasoning cases.
5. Concluding Remarks
We proposed different representations of entity relationships in knowledge graphs to improve the multi-hop reasoning capabilities of LLMs. We conducted a series of experiments to study how different representations of entity relationships affect LLM reasoning ability. We showed that introducing programming language representations of the entity relationships helps improve LLM multi-hop reasoning ability and reduce hallucination.
The programming language representation of the entity relationships provides a controlled and unambiguous way for LLM multi-hop reasoning. By leveraging the native data structures inherent in programming languages, we can effectively model complex entity relationships, while iterative inference functions guide the logical reasoning process. This approach not only enhances reasoning accuracy but also facilitates generalization to more sophisticated reasoning use cases. However, accurately measuring the performance of LLM reasoning beyond two-hop and three-hop compositional relationships can be challenging, as the reasoning process becomes increasingly complex. As part of the future work, we would like to study the programming language representations of more sophisticated relationships in order to solve more complex reasoning tasks. We will experiment at both the pre-training and fine-tuning stages of LLMs to evaluate the performance impact. We hope that our work will inspire other researchers to further push the frontier of LLM research.
References
- (1)
- AI (2024) Meta AI. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
- Aryabumi et al. (2024) Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. To Code, or Not To Code? Exploring Impact of Code in Pre-training. arXiv:2408.10914 [cs.CL] https://arxiv.org/abs/2408.10914
- Biran et al. (2024) Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. 2024. Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries. arXiv:2406.12775 [cs.CL] https://arxiv.org/abs/2406.12775
- Bollacker et al. (2008) Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference. https://api.semanticscholar.org/CorpusID:207167677
- Bray (2014) Tim Bray. 2014. The javascript object notation (json) data interchange format.
- Brei et al. (2024) Felix Brei, Johannes Frey, and Lars-Peter Meyer. 2024. Leveraging small language models for Text2SPARQL tasks to improve the resilience of AI assistance. arXiv:2405.17076 [cs.AI] https://arxiv.org/abs/2405.17076
- Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL] https://arxiv.org/abs/2005.14165
- Bustamante and Takeda (2024) Diego Bustamante and Hideaki Takeda. 2024. SPARQL Generation with Entity Pre-trained GPT for KG Question Answering. arXiv:2402.00969 [cs.CL] https://arxiv.org/abs/2402.00969
- Chai et al. (2023) Ziwei Chai, Tianjie Zhang, Liang Wu, Kaiqiao Han, Xiaohai Hu, Xuanwen Huang, and Yang Yang. 2023. GraphLLM: Boosting Graph Reasoning Ability of Large Language Model. arXiv:2310.05845 [cs.CL] https://arxiv.org/abs/2310.05845
- Choudhary et al. (2022) Nurendra Choudhary, Nikhil Rao, Karthik Subbian, and Chandan Reddy. 2022. Graph-based multilingual language model: Leveraging product relations for search relevance. In KDD 2022. https://www.amazon.science/publications/graph-based-multilingual-language-model-leveraging-product-relations-for-search-relevance
- Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL] https://arxiv.org/abs/2204.02311
- Dernbach et al. (2024) Stefan Dernbach, Khushbu Agarwal, Alejandro Zuniga, Michael Henry, and Sutanay Choudhury. 2024. GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding. arXiv:2402.06764 [cs.AI] https://arxiv.org/abs/2402.06764
- Edge et al. (2024) Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130
- Fu et al. (2023) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023. Complexity-Based Prompting for Multi-step Reasoning. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=yf1icZHC-l9
- Guo et al. (2023) Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023. GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking. arXiv:2305.15066 [cs.AI] https://arxiv.org/abs/2305.15066
- Ho et al. (2020) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6609–6625. https://www.aclweb.org/anthology/2020.coling-main.580
- Hu et al. (2022) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation. Comput. Surveys 55, 12 (March 2023), 1–38. https://doi.org/10.1145/3571730
- Kim et al. (2023) Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Minjoon Seo. 2023. The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning. arXiv:2305.14045 [cs.CL] https://arxiv.org/abs/2305.14045
- Lazaridou et al. (2022) Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv:2203.05115 [cs.CL] https://arxiv.org/abs/2203.05115
- Lehmann et al. (2014) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, and Christian Bizer. 2014. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal 6 (01 2014). https://doi.org/10.3233/SW-140134
- Luo et al. (2024) Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. arXiv:2310.01061 [cs.CL] https://arxiv.org/abs/2310.01061
- Ma et al. (2023) Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang, and Shanshan Li. 2023. At Which Training Stage Does Code Data Help LLMs Reasoning? arXiv:2309.16298 [cs.CL] https://arxiv.org/abs/2309.16298
- Nie et al. (2024) Zhijie Nie, Richong Zhang, Zhongyuan Wang, and Xudong Liu. 2024. Code-Style In-Context Learning for Knowledge-Based Question Answering. arXiv:2309.04695 [cs.CL] https://arxiv.org/abs/2309.04695
- Press et al. (2023) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. 2023. Measuring and Narrowing the Compositionality Gap in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5687–5711. https://doi.org/10.18653/v1/2023.findings-emnlp.378
- Rangel et al. (2024) Julio C. Rangel, Tarcisio Mendes de Farias, Ana Claudia Sima, and Norio Kobayashi. 2024. SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph. arXiv:2402.04627 [cs.AI] https://arxiv.org/abs/2402.04627
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (Sept. 2014), 78–85. https://doi.org/10.1145/2629489
- Wang et al. (2024a) Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. 2024a. Can Language Models Solve Graph Problems in Natural Language? arXiv:2305.10037 [cs.CL] https://arxiv.org/abs/2305.10037
- Wang et al. (2024b) Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, and Julian McAuley. 2024b. InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment. arXiv:2402.08785 [cs.CL] https://arxiv.org/abs/2402.08785
- Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=1PL1NIMMrw
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
- Xu et al. (2023) Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina Semnani, and Monica Lam. 2023. Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5778–5791. https://doi.org/10.18653/v1/2023.emnlp-main.353
- Yang et al. (2024) Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel. 2024. Do Large Language Models Latently Perform Multi-Hop Reasoning? arXiv:2402.16837 [cs.CL] https://arxiv.org/abs/2402.16837
- Yao et al. (2023a) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023a. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=5Xc1ecxO1h
- Yao et al. (2023b) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023b. ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR).
- Yasunaga et al. (2022) Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy Liang, and Jure Leskovec. 2022. Deep Bidirectional Language-Knowledge Graph Pretraining. arXiv:2210.09338 [cs.CL] https://arxiv.org/abs/2210.09338
- Ye et al. (2024) Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang. 2024. Language is All a Graph Needs. arXiv:2308.07134 [cs.CL] https://arxiv.org/abs/2308.07134
- Zhang et al. (2024a) Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang Wang, and Linda Ruth Petzold. 2024a. Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning. arXiv:2405.20535 [cs.AI] https://arxiv.org/abs/2405.20535
- Zhang et al. (2024b) Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. 2024b. Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs. arXiv:2406.09136 [cs.CL] https://arxiv.org/abs/2406.09136
- Zhu et al. (2024) Kerui Zhu, Bo-Wei Huang, Bowen Jin, Yizhu Jiao, Ming Zhong, Kevin Chang, Shou-De Lin, and Jiawei Han. 2024. Investigating Instruction Tuning Large Language Models on Graphs. arXiv:2408.05457 [cs.CL] https://arxiv.org/abs/2408.05457