Can Language Models Act as Knowledge Bases at Scale?

Qiyuan He^† Yizhong Wang^‡ Wenya Wang^†
^†Nanyang Technological University
^‡University of Washington
^†[email protected], [email protected], ^‡[email protected]

Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating responses to complex queries through large-scale pre-training. However, the efficacy of these models in memorizing and reasoning among large-scale structured knowledge, especially world knowledge that explicitly covers abundant factual information remains questionable. Addressing this gap, our research investigates whether LLMs can effectively store, recall, and reason with knowledge on a large scale comparable to latest knowledge bases (KBs) such as Wikidata. Specifically, we focus on three crucial aspects to study the viability: (1) the efficiency of LLMs with different sizes in memorizing the exact knowledge in the large-scale KB; (2) the flexibility of recalling the memorized knowledge in response to natural language queries; (3) the capability to infer new knowledge through reasoning. Our findings indicate that while LLMs hold promise as large-scale KBs capable of retrieving and responding with flexibility, enhancements in their reasoning capabilities are necessary to fully realize their potential¹¹1Our datasets and source code can be obtained from https://github.com/hyanique/LMKB-at-Scale.

Qiyuan He^† Yizhong Wang^‡ Wenya Wang^† ^†Nanyang Technological University ^‡University of Washington ^†[email protected], [email protected], ^‡[email protected]

1 Introduction

The access to knowledge is critical for language models (LMs) to perform well on many tasks and serve users reliably. Existing studies have found that language models, after pre-training, can encode a large amount of factual knowledge as well as implicit linguistic knowledge from the general corpus, making them a crucial component for tasks that require natural language understanding Bommasani et al. (2022); Li et al. (2022a). This leads to the potential of using language models as knowledge bases Petroni et al. (2019); AlKhamissi et al. (2022). However, existing studies mainly focus on probing Li et al. (2022b); Chen et al. (2022); Sung et al. (2021) and utilizing Roberts et al. (2020); Moiseev et al. (2022) LMs’ knowledge gained from pre-training, which shows deficiencies when handling long-tail, less frequently appeared knowledge Kandpal et al. (2023), due to knowledge imbalance, conflict, and noise in the pre-trained corpora Carlini et al. (2023); Razeghi et al. (2022); Tänzer et al. (2022).

Meanwhile, knowledge bases (KBs), commonly utilized in many knowledge-intensive tasks such as dialogue Li et al. (2022c); Galetzka et al. (2021), question answering Baek et al. (2023); Saxena et al. (2020); Qiu et al. (2020) and recommendation systems Wu et al. (2013), are known for their ability to compactly organize information on a large scale, providing clean and balanced knowledge. For example, Wikidata contains over 108M entities about the world²²2https://www.wikidata.org/wiki/Wikidata:Statistics. Operations over larger KBs lead to greater computational costs and therefore, pose a big challenge for extracting subgraphs from the KB Cordella et al. (2004); Grohe and Schweitzer (2020) or grounding semantic logic forms over the KB Lan and Jiang (2020); Bhutani et al. (2019) to perform downstream tasks. In addition, the rigid format of KBs limit their flexibility to handle complex natural language queries.

In this work, we propose to explicitly train large language models to memorize world knowledge from Wikidata Pellissier Tanon et al. (2016) at a large scale and systematically study the viability of using the resulting LMs to function as the knowledge base. With their high capacity, we hypothesize that LMs can store information from a knowledge base on a rather large scale and provide more flexibility in querying and reasoning. Specifically, we aim to answer the following three questions: (1) How fast and how good can LMs with different sizes memorize large-scale knowledge of different frequencies through training? (2) How flexible are these trained LMs when being used to answer queries in natural languages rather than the structured triplets used during training? (3) Can LMs infer new knowledge that does not exist in the KB, and what kind of reasoning capabilities are involved? We distinguish our work from those that train LMs on small-scale KBs with popular facts Heinzerling and Inui (2021) or convert knowledge triplets to synthetic sentences using manually curated templates Heinzerling and Inui (2021); Petroni et al. (2019) which only works for a limited set of relations.

We start by proposing an efficient learning algorithm based on importance sampling Alain et al. (2016); Katharopoulos and Fleuret (2018); Zhang et al. (2019) to train LMs to memorize knowledge more efficiently. To answer the first question, we evaluate the memorization capacity of LMs of different sizes as well as their performances on both popular and long-tail world knowledge. We observe that LMs are capable of memorizing information from a knowledge base on a large scale, with larger model learning faster. In addition, infrequent knowledge is more challenging to memorize, irrespective of the size of the language models.

To answer the second question on LMs’ flexibility in handling natural language queries, we further finetune the trained LMs using PopQA Mallen et al. (2023), a natural language QA dataset that requires long-tail Wikidata knowledge. With minimal finetuning, these LMs demonstrate superior performance over their counterpart, which are not trained on Wikidata KB. This indicates the power of LMs in flexibly retrieving and organizing long-tail knowledge, regardless of the presentation form, unveiling their potential for responding to various user queries.

To answer the third question from the perspective of incomplete KBs, we use the dataset released by Veseli et al. (2023a) containing general missing facts (triplets) and further curate two sets of missing facts tailoring two kinds of reasoning capabilities, namely inverse reasoning by switching the positions of the subject and object, and compositional reasoning which conjoins two relations to form a new one. By evaluating LMs’ performances on inferring the missing facts, we study their inherent reasoning capabilities in addition to memorizing existing facts. Our results show that LMs are capable of inferring missing entities from existing knowledge to some extent via reasoning. However, they struggle with inverse reasoning more often than compositional reasoning, advocating for further investigations and explorations on how to improve LMs’ reasoning capabilities, specifically, inverse reasoning.

2 Training LMs on Large-Scale KB

2.1 KB Dataset

A basic knowledge base is a collection of facts in the form of (subject, relation, object) triplets, for example, Freebase Bollacker et al. (2008) and DBPedia Auer et al. (2007). To study the memorization capacity of language models at a large scale, we consider Wikidata Pellissier Tanon et al. (2016), one of the largest knowledge bases to date that is actively maintained by the community. Compared with pre-training corpora, Wikidata contains abundant world knowledge in a more compact and accurate form, covering both popular and long-tail knowledge that appears less frequently in the pre-training corpora of LMs.

Refer to caption — (a) Distribution of entities in $\mathcal{D}_{0}$ , with number of entities in 1e6 scale and occurrence in the powers of 2.

When preparing the KB dataset, we use the cleaned knowledge taken from the 2022 January snapshot of Wikidata dump Kaiser and Christmann (2021) to avoid knowledge irrelevant to common question-and-answer, specifically, filtering away URLs, images, geographical coordinates, and subject entities that do not have a corresponding Wikipedia page. If there are multiple objects given a subject and a relation, we randomly sample a single instance from the available objects to avoid knowledge ambiguity. After filtering, we obtain a dataset of 46M (subject, relation, object) triplets, with the distribution of 10.5M entities (subjects or objects) and 2,157 relations shown in Figure 1(a) and Figure 1(b). We denote this dataset as $\mathcal{D}_{0}$ . We can observe that over 4M entities only appear once or twice, and around 500 relations appear 1-10 times. Meanwhile, around 250 relations occur more than 10K times, and 530K entities make more than 16 occurrences. These statistics show that $\mathcal{D}_{0}$ covers adequate popular knowledge as well as a non-neglectable portion of long-tail knowledge.

To study how the model performs regarding knowledge frequency inside the KB, we first calculate the number of occurrences of all entities and relations. Next, we define long-tail entities/relations as entities/relations of top 15% when ranking all entities/relations by their numbers of occurrences in ascending order and popular entities/relations as entities/relations of top 5% when ranking them by their numbers of occurrences in descending order. Then, under each of the long-tail and popular categories, we randomly sample triplets under both the entity set and the relation set, resulting in four datasets denoted as $\mathcal{D}_{PopRel}$ , $\mathcal{D}_{PopEnt}$ , $\mathcal{D}_{TailRel}$ , and $\mathcal{D}_{TailEnt}$ . As $\mathcal{D}_{0}$ contains 2,157 relations, the number of knowledge with long-tail relations is limited³³3 $\mathcal{D}_{0}$ contains 323 long-tail relations that occur 1-2 times in $\mathcal{D}_{0}$ , summed to 663 occurrences in total. In comparison, the top $5\%$ of 2,157 relations make 40.8M occurrences in $\mathcal{D}_{0}$ , leading to 663 samples in $\mathcal{D}_{TailRel}$ . The other three datasets contain 1K triplets each. Example triplets include (“Linlithgow Burgh Halls”, instance of, “Town hall”) from $\mathcal{D}_{PopRel}$ and (“Department of Agriculture, Water and the Environment”, external auditor, “Australian National Audit Office”) from $\mathcal{D}_{TailRel}$ .

2.2 Model Setup

We choose two language models, namely T5 Raffel et al. (2020) and LLaMA-2 Touvron et al. (2023), each with two different sizes: T5-base, T5-large, LLaMA-2-7b, and LLaMA-2-13b. Starting from their pre-trained checkpoints, we continue training these models on the filtered Wikidata KB $\mathcal{D}_{0}$ containing 46M knowledge triplets. See Appendix A for the detailed training setup.

For each knowledge triplet in the form of (subject, relation, object), we create an input string by concatenating the prefix “Subject:” followed by the subject text, the prefix “Relation:” followed by the relation text and the prefix “Object:”, and use the object text as the output. For example, given the knowledge triplet (“Palaeontological Museum, Munich”, architect, “Leonhard Romeis”), the input to the LMs is “Subject: Palaeontological Museum, Munich. Relation: architect. Object:” and the expected output is the object “Leonhard Romeis”.

The training objective is to maximize the probability of generating the correct object: $p_{LM}(x_{out}|x_{in})$ where $x_{out}$ is the object text and $x_{in}$ is the input text. $p_{LM}$ denotes the probability distribution given by the language model.

2.3 Importance Sampling

With the goal of injecting abundant and diverse information from large-scale KB information into LMs, it is imperative for the model to converge to a state where it can, in an ideal scenario, memorize every triplet within the training dataset. Traditional training process iterates through each data sample precisely once during each epoch, inherently treating all data with uniform importance. This approach, however, leads to extended training durations and reduced convergence efficiency, particularly when dealing with large-scale KBs containing a significant amount of hard-to-memorize knowledge. To address this issue, inspired by the importance sampling algorithm proposed in Alain et al. (2016); Katharopoulos and Fleuret (2018), we allocate distinct importance weights to the training samples within $\mathcal{D}_{0}$ . The importance weight is proportional to the prediction loss of each sample, serving as a measure of its memorization difficulty. This strategy prioritizes samples that are more challenging to memorize by assigning them greater importance, thereby increasing their likelihood of selection during each training iteration, leading to faster convergence speed Zhang et al. (2019); Xie et al. (2023).

The detailed algorithm is shown in Algorithm 1.

Algorithm 1 Knowledge memorization with importance sampling

0: knowledge samples with importance

\mathcal{D}=\left\{(x_{1},y_{1};w_{1}),...,(x_{n},y_{n};w_{n})\right\}

0: language model pre-trained on general corpora

0: sampling ratio

\alpha\in(0,1)

1: initialize importance

w_{1},...,w_{n}

with

1e6

2: for every training epoch

e

3: sample

\mathcal{S}=\left\{(x^{s},y^{s};w^{s})\right\}\subset\mathcal{D}

of size

n\times\alpha

using importance

w_{1},...,w_{n}

4: forward pass using

\left\{(x^{s},y^{s})\right\}

5: update importance

w^{s}

using instance loss

\mathcal{L}(y^{s},x^{s})

6: backpropagation

7: end for

As shown in the pseudo-code, we use instance loss $\mathcal{L}(y^{s},x^{s})$ to measure the knowledge triplet’s importance and use this importance as the sampling probability within each batch, where $\mathcal{L}$ is the cross-entropy loss, and $y^{s}$ is the correct output text given input $x^{s}$ . Mathematically,

\mathcal{L}(y^{s},x^{s})=-\sum_{t=1}^{T}\log{p_{LM}(y_{t}^{s}|x^{s})},

(1)

with $T$ being the number of tokens in $y^{s}$ and $y_{t}^{s}$ being the $t$ -th token in $y^{s}$ , Hence, the higher the instance loss, the higher the chance for the instance to be sampled into the subset $\mathcal{S}$ for training, forcing the model to focus on learning hard samples more often.

To verify our hypothesis, we conduct a preliminary experiment by randomly sampling 1% triplets from $\mathcal{D}_{0}$ and train a T5-base model to memorize this sampled dataset, with and without using Algorithm 1. We denote this subset containing 426K triplets as $\mathcal{D}_{1}$ . We further arbitrarily sampled 10K triplets from $\mathcal{D}_{1}$ as the corresponding evaluation set, denoted as $\mathcal{D}_{1-eval}$ . We configure the sampling ratio $\alpha$ to be 0.3. As shown in Figure 2, the model trained without importance sampling quickly reaches around 80% exact match and F1 score in the first 30K training steps, and then its performance slowly increases to around 95% exact match and F1 score using another 20K steps. But with importance sampling, the model achieved roughly $80\%$ exact match and F1 score after the first 20K steps, and over 95% exact match and F1 score after another 12K steps. We also notice that training with importance sampling yields a significantly steeper learning curve when compared with the one without importance sampling. In what follows, we stick with importance sampling with the same $\alpha$ value when training LMs for all the experiments.

2.4 Evaluation

To study the LM’s capacity of memorizing the structured knowledge base, we propose to use the exact match (EM) and F1 scores following Heinzerling and Inui (2021) over the entire training dataset. We call this fixed-form information recall ability. Since it is not feasible to iteratively evaluate the LM on all 46M triplets in $\mathcal{D}_{0}$ throughout the training process due to huge inference time, we opt to randomly sample 10K triplets from $\mathcal{D}_{0}$ as the evaluation set, denoted as $\mathcal{D}_{2}$ .

To measure the model’s ability to flexibly retrieve memorized knowledge when queried with an input and output format different from training, we consider using natural language to query our model the same way as the task of question answering (QA). We call this free-form information recall ability. For implementation, we require that the knowledge used by the QA task should be highly covered by the 46M triplets of the world knowledge from Wikidata. Hence, we select the QA dataset constructed in PopQA Mallen et al. (2023). PopQA converted 14K knowledge triplets from Wikidata to their corresponding natural language questions and answers that cover long-tail information based on Wikipedia page views. With a random 8:2 split to obtain a train set of 11.4K samples and a validation set of 2.9K samples, we further finetune the model from the memorization checkpoints using the training split of PopQA and evaluate the performance on the validation set using the F1 score. We also compute the exact match and F1 score of the model’s generation accuracy over the PopQA knowledge triplet to check if the model can access relevant knowledge using fixed-form recall.

Lastly, we explore whether LMs can infer new knowledge that does not exist in the KB, namely, the missing fact completion ability. Since most knowledge graphs are incomplete, missing factual triplets or even entities Yang et al. (2022); Shi and Weninger (2018), the ability to automatically complete missing facts becomes especially demanding. First we consider the missing facts dataset released by Veseli et al. (2023b), which contains 350 factual triplets missing from Wikidata with human annotated ground-truths. As we additionally seek to investigate the underlying reasoning capabilities involved in missing fact completion, we also curate two sets of missing knowledge triplets based on $\mathcal{D}_{0}$ , emphasizing inverse reasoning and compositional reasoning, respectively. For a missing knowledge triplet that is not contained in $\mathcal{D}_{0}$ , we query the model using the same input format as in fixed-form information recall and evaluate the output text against object text using F1 scores⁴⁴4For pre-trained models without training on knowledge base $\mathcal{D}_{0}$ , we query the models with natural language inputs released alongside the triplets. See respective task in Section 5.1 for details.

Next, we present the detailed evaluation and analysis to answer each of the three core questions, including (1) the efficiency of LMs with different sizes in memorizing the exact knowledge in the large-scale KB (Section 3); (2) the flexibility of recalling the memorized knowledge in response to natural language queries (Section 4); (3) the capability to infer new knowledge through reasoning (Section 5).

3 Fixed-Form Information Recall

As mentioned in Section 2.4, we measure the fixed-form information recall ability on a sub-sampled dataset $\mathcal{D}_{2}$ from the original training set $\mathcal{D}_{0}$ to avoid the huge inference cost. See Appendix A for additional training details. Specifically, we compute the exact match and F1 score on $\mathcal{D}_{2}$ along the training steps of T5-base, T5-large, LLaMA-2-7b and LLaMA-2-13b. The performance curves are shown in Figure 3(a) and 3(d).

The results show that the models can memorize a large portion of 46M world knowledge, with T5-large performing better than T5-base, and LLaMA-2-13b slightly more capable than LLaMA-2-7b in terms of memorization capacity. LMs with larger sizes are capable of memorizing more knowledge with higher efficiency. In particular, at the end of training, LLaMA-2-13b gives the highest F1 score of 81.64%, whereas T5-large reaches an F1 score of 63.07%.

In addition, we further separately evaluate the performances on popular and long-tail triplets, i.e., $\mathcal{D}_{PopEnt}$ , $\mathcal{D}_{PopRel}$ , $\mathcal{D}_{TailEnt}$ and $\mathcal{D}_{TailRel}$ . The results are shown in Figure 3(b), 3(c), 3(e) and 3(f). These plots demonstrate that (1) All models are better at memorizing popular information than memorizing long-tail information; (2) For LLaMA-2 models, a larger model size does not lead to significantly better memorization capability when it comes to long-tail and popular knowledge; (3) Different from LLaMA-2, we observe that T5-large is better than T5-base in learning both popular and long-tail knowledge, with an even significant improvement for long-tail relations ( $\mathcal{D}_{TailRel}$ ).

4 Free-Form Information Recall

To evaluate the model’s ability to perform free-form information recall when using natural language queries, as indicated in Section 2.4, we adopt the knowledge triplets and their corresponding natural language questions from PopQA:

Given a knowledge triplet (“Binary”, author, “Michael Crichton”) from Wikidata, PopQA converts it to a natural language question which asks for the object: “Who is the author of Binary?”. The correct answer in this case should be “Michael Crichton”. To make LMs trained on knowledge triplets familiar with the natural language QA format, we further finetune these LMs by feeding them the question as input and training these models to generate the correct answer. For T5, the input is the original question such as “Who is the author of Binary?”. For LLaMA-2, the input is “Question: Who is the author of Binary? Answer:”. We then evaluate the generated output using the F1 score⁵⁵5We use F1 score to allow minor linguistic variations in the generated output, taking into account semantic similarity and flexibility.. In addition to the free-form queries, we also evaluate how much of the PopQA knowledge in its original triplet form is memorized by the model at each specific checkpoint by querying the model using the subject and relation, following the same input format used for fixed-form information recall.

The results on PopQA are shown in Figure 4. Each point in the x-axis indicates the number of epochs for each checkpoint when training LMs using the Wikidata triplets, i.e., $\mathcal{D}_{0}$ . Starting from each of these checkpoints, we further finetune the LMs using the training data from PopQA for up to 30 epochs for T5 models and 15 epochs for LLaMA-2 models with early stopping (see Appendix A for details) and report the best F1 score on the evaluation set.

It is clear that training on $\mathcal{D}_{0}$ can provide a sizable performance boost compared with using the originally pre-trained LMs (epoch=0). This suggests that LMs trained on large-scale knowledge bases are capable of performing some extent of free-form information recall, especially for a question-answering task that emphasizes long-tail knowledge. We also notice that memorizing more knowledge (as indicated by the triplet EM scores) leads to better performance in general. In addition, larger models, after being trained on $\mathcal{D}_{0}$ , are able to recall more knowledge for this downstream task in a fixed form, and finetuning yields better results.

5 Missing Fact Completion

5.1 General Missing Facts

To evaluate how the model performs when completing missing facts in general, we consider knowledge triplets that are missing from $\mathcal{D}_{0}$ . We query the model to generate an object given the subject and relation. To ensure the feasibility of this setting, we require that the subject and relation in question are both contained in the knowledge base. Hence, the model has to associate relevant information related to the subject and the relation in order to infer the object.

For implementation, we utilize the missing fact dataset Veseli et al. (2023b) consisting of 350 samples of knowledge missing from Wikidata. For each sample, we query the model using the subject and the relation that are contained in Wikidata, and compare the generated output with the human-annotated object using the F1 score⁶⁶6When there are multiple ground-truth candidates, we compare the generated results to each of them and take the best F1 scores. To clearly demonstrate the benefit of knowledge memorization, we further evaluate how the pre-trained LMs perform on these missing facts using the natural language queries provided by the dataset. For example, a missing fact triplet (“Tidö Castle”, headquarters location, “Västeras”) is associated with the following natural language question “The headquarter of Tidö Castle is in” as input for T5, while the input for LLaMA-2 is “Question: The headquarter of Tidö Castle is in? Answer:”.

As shown in Figure 5(a) and 5(d), we can see that training on $\mathcal{D}_{0}$ provides some performance increase. This suggests that training on large-scale knowledge bases can help LMs to infer new facts better. Furthermore, the capability of LMs to infer new facts does not grow along with the memorization process on $\mathcal{D}_{0}$ and larger models like LLaMA-2 even perform worse than smaller models like T5. These observations indicate that the amount of knowledge learned by the models may not be the key factor to determine their inference capability towards missing facts.

5.2 Inverse Reasoning

We define inverse reasoning as the ability to infer $(B,r^{\prime},A)$ given the triplet $(A,r,B)$ , where $A$ and $B$ represent two entities and $r,r^{\prime}$ indicate relations. To study the model’s ability to conduct inverse reasoning over the knowledge base, we first curate a set of triplets in the form of $(A,r,B)$ originally contained in $\mathcal{D}_{0}$ , denoted as $\mathcal{D}_{\rightarrow}$ . Then, we curate the inverse set by mapping the original relation $r$ to its inverse $r^{\prime}$ and switch the positions of $A$ and $B$ , forming the triplets $(B,r^{\prime},A)$ . We denote this set using $\mathcal{D}_{\leftarrow}$ . We query the model for the object entity $A$ when given the subject entity $B$ and the inverse relation $r^{\prime}$ and compute the F1 score on $\mathcal{D}_{\leftarrow}$ . To show whether the model is capable of correctly recalling the original fact $(A,r,B)$ in the first place, we additionally query the model to generate $B$ given $A$ and $r$ on $\mathcal{D}_{\rightarrow}$ . For the originally pre-trained LMs without accessing Wikidata, we convert the knowledge triplets to natural language QA pairs as explained in Section 4.

For implementation, we select seven relation pairs $(r,r^{\prime})$ from $\mathcal{D}_{0}$ as shown in Table 1 from Appendix B. For each relation, we apply the restriction that for knowledge triplet $(A,r,B)$ , the inverse knowledge $(B,r^{\prime},A)$ is not contained in $\mathcal{D}_{0}$ . For each relation, we randomly sample 150 triplets from $\mathcal{D}_{0}$ , resulting in 1,050 samples for both $\mathcal{D}_{\rightarrow}$ and $\mathcal{D}_{\leftarrow}$ .

As shown in Figure 5(b) and 5(e), for all models, we can observe a limited performance increase when answering the inverse knowledge $(B,r^{\prime},A)$ , despite the models showing increasing memorization accuracy of the forward knowledge $(A,r,B)$ . We speculate this “no significant change” in deduction results suggests that LMs can memorize knowledge well but are short at handling the inverse of relations.

5.3 Compositional Reasoning

We define compositional reasoning as the ability to infer $(A,r_{3},C)$ given $(A,r_{1},B)$ and $(B,r_{2},C)$ when $(A,r_{1},B)\wedge(B,r_{2},C)\Rightarrow(A,r_{3},C)$ . To study the model’s ability to conduct compositional reasoning over the knowledge base, we first curate a set of triplet pairs containing $(A,r_{1},B)$ and $(B,r_{2},C)$ , denoted by $\mathcal{D}_{\wedge}=(\mathcal{D}^{1}_{\wedge},\mathcal{D}^{2}_{\wedge})$ . We then form the conclusive triplet set containing $(A,r_{3},C)$ , denoted by $\mathcal{D}_{\Rightarrow}$ . To test the model’s performance, we query the model using entity $A$ and relation $r_{3}$ , and compare the model’s output with the ground-truth entity $C$ on $\mathcal{D}_{\Rightarrow}$ . To show whether the model is capable of correctly recalling the conditioned facts $(A,r_{1},B)$ and $(B,r_{2},C)$ in the first place, we additionally query the model to generate the objects for these conditioned facts on $\mathcal{D}_{\wedge}$ . Again, for the pre-trained model, we convert the knowledge triplets to natural language QA pairs.

For implementation, we formulate eight reasoning rules $r_{1}\wedge r_{2}\Rightarrow r_{3}$ of relation composition as shown in Table 3 from Appendix B.

For a compositional rule $(A,r_{1},B)\wedge(B,r_{2},C)\Rightarrow(A,r_{3},C)$ , we restrict that the prior knowledge triplets $(A,r_{1},B)$ and $(B,r_{2},C)$ exist in the knowledge dataset while the deduction result $(A,r_{3},C)$ is missing. For each reasoning rule, we randomly sample 150 examples from $\mathcal{D}_{0}$ , resulting in 1,200 samples for both $\mathcal{D}_{\wedge}$ and $\mathcal{D}_{\Rightarrow}$ .

The results from Figure 5(c) and 5(f) show that training on the KB can assist the model in performing compositional reasoning. However, there is an upper threshold; memorizing prior knowledge beyond that point may not help the model perform compositional deduction.

6 Related Work

Infusing Knowledge into LM

Starting from seminal work by Petroni et al. (2019) that first introduced the concept of using pre-trained language models as knowledge bases, many works investigate such viability by finetuning and evaluating the models on downstream question-answering tasks Roberts et al. (2020); Guu et al. (2020); Moiseev et al. (2022). Notably, using salient span making Moiseev et al. (2022), augmented learning objective Verga et al. (2020), and modifying model architecture Zhang et al. (2021); Yasunaga et al. (2022) have shown to improve LM’s performance on various open-domain question answering tasks.

When explicitly studying the capacity to store factual information, many knowledge datasets have been proposed. Among them, LAMA Petroni et al. (2019) is based on factual and commonsense knowledge grounded to Wikipedia. WikiData5M contains 4.9M Wikidata triplets derived from Wang et al. (2021). More recently, Cao et al. (2021) derived WIKI-UNI with a uniform distribution of object entities, and Keleg and Magdy (2023) proposed DLAMA to group factual information by cultural diversity.

Under the scope of investigating the infusing of KB into LM, Bosselut et al. (2019) focus on commonsense knowledge derived from ATOMIC Sap et al. (2019) and ConceptNet Speer et al. (2017), Heinzerling and Inui (2021) study the memorization capacity of BERT-based models using popular Wikidata knowledge. AutoPrompt Shin et al. (2020) can be utilized to modify knowledge input Veseli et al. (2023b, a) for triplet completion. In addition, Mallen et al. (2023) proposed to use retrieval-augmented LMs to help with long-tail factual knowledge.

Probing for Exisisting Knowledge

Given that pre-trained LMs are sensitive to input Jiang et al. (2020); Elazar et al. (2021) and querying the model with even syntactical variations may lead to different output results Longpre et al. (2021), many work have focused on the probing technique to extract knowledge stored inside LM through pre-training. For example, Li et al. (2022b) study the extraction of knowledge under the setting of unsupervised knowledge-grounded conversation, Alivanistos et al. (2023) utilize prompt generation and post-processing techniques to probe for knowledge, while others focus on extracting specific types of factual information, such as commonsense knowledge Davison et al. (2019), simile metaphor Chen et al. (2022) and biomedical facts Sung et al. (2021).

7 Conclusion

In this work, we systematically study the viability of using language models as large-scale knowledge bases. We propose an importance sampling algorithm to increase the efficiency of memorizing world knowledge from Wikidata. We investigate and evaluate three critical dimensions along this direction and conclude that large language models are able to recall a large amount of knowledge in KB through training in both fixed form as the structured KB and free form as natural language queries, with increasing flexibility when querying the world knowledge. Nevertheless, there is a significant gap between the memorization of popular knowledge and long-tail knowledge regardless of model size. In addition, language models, after being trained on the large-scale KB, demonstrate consistent improvement in terms of inferring new facts through some extent of reasoning. However, the amount of knowledge learned during training does not guarantee consistent improvement in reasoning capabilities, especially when it comes to inverse reasoning. These results point to future work in utilizing language models as knowledge bases at scale, as well as further investigations on improving LMs’ reasoning capability over world knowledge.

Limitations

This work focuses on the following three aspects of treating language models as knowledge bases: memorization and accessing of knowledge base information at scale, accessing of memorized knowledge in flexible, natural language format, and inferring facts missing from the knowledge base used for training. AlKhamissi et al. (2022) proposed the following five abilities for a language model to be qualified as a knowledge base: (1) accessing of knowledge, (2) editing of knowledge, (3) consistency over semantically equivalent context, (4) reasoning over stored knowledge, (5) explainability in internal mechanisms and interoperability of outputs under a post-hoc setting. We mainly address the ability of knowledge accessing while providing a preliminary study on the reasoning ability of language models over using inverse and compositional reasoning. Another limitation in our work is that, due to limited computation resources, we are unable to train the models without importance sampling on the 46M triplets of $\mathcal{D}_{0}$ . While it may require further investigation on importance sampling to improve credibility and robustness, we believe our experiments on the subset $\mathcal{D}_{1}$ randomly sampled from $\mathcal{D}_{0}$ are preliminary evidence to support our hypothesis in Section 2.3, and serve as a good foundation for speeding up large-scale knowledge memorization.

Ethics Statement

Large language models are known to memorize information from pre-training corpus. Therefore, probing for stored knowledge may lead to privacy attacks against language models, such as training data extraction attacks Neel and Chang (2024); Staab et al. (2023); Hartmann et al. (2023). For this kind of attack, an adversary can reconstruct parts of the training sample when given access to the model, leading to potential exposures of sensitive information that should not be extracted in fair and ethical usage of language models. In addition, Karamolegkou et al. (2023) confirms that language models are able to memorize a substantial portion of bestselling books with copyright that are published between 1930-2010, which demonstrates the risk of copyright violations when deploying the language models.

For our work, the world knowledge dataset $\mathcal{D}_{0}$ is derived from Wikidata, which follows the CC0 (Creative Common Public Domain) Copyright⁷⁷7https://www.wikidata.org/wiki/Wikidata:Copyright. In this way, we reduce the concern of language models learning sensitive or copyright information when training on the corresponding knowledge dataset. However, we have limited control over information acquired during the pre-training of language modles. It is possible to address this issue in future work by either using language models with sensitive and copyright information removed or deploying knowledge editing methods Zhang et al. (2024) to enforce data privacy and integrity.

References

Alain et al. (2016) Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. 2016. Variance reduction in sgd by distributed importance sampling. Statistics Research Repository, arXiv:1511.06481. Version 7.
Alivanistos et al. (2023) Dimitrios Alivanistos, Selene Báez Santamaría, Michael Cochez, Jan-Christoph Kalo, Emile van Krieken, and Thiviyan Thanapalasingam. 2023. Prompting as probing: Using language models for knowledge base construction. Computation Research Repository, arXiv:2208.11057. Version 3.
AlKhamissi et al. (2022) Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. 2022. A review on language models as knowledge bases. Computation Research Repository, arXiv:2204.06031.
Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: a nucleus for a web of open data. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, ISWC’07/ASWC’07, page 722–735, Berlin, Heidelberg. Springer-Verlag.
Baek et al. (2023) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 78–106, Toronto, Canada. Association for Computational Linguistics.
Bhutani et al. (2019) Nikita Bhutani, Xinyi Zheng, and H V Jagadish. 2019. Learning to answer complex questions over knowledge bases with query composition. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 739–748, New York, NY, USA. Association for Computing Machinery.
Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 1247–1250, New York, NY, USA. Association for Computing Machinery.
Bommasani et al. (2022) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2022. On the opportunities and risks of foundation models. Computation Research Repository, arXiv:2108.07258.
Bosselut et al. (2019) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779, Florence, Italy. Association for Computational Linguistics.
Cao et al. (2021) Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu. 2021. Knowledgeable or educated guess? revisiting language models as knowledge bases. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1860–1874, Online. Association for Computational Linguistics.
Carlini et al. (2023) Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying memorization across neural language models. Computation Research Repository, arXiv:2202.07646.
Chen et al. (2022) Weijie Chen, Yongzhu Chang, Rongsheng Zhang, Jiashu Pu, Guandan Chen, Le Zhang, Yadong Xi, Yijiang Chen, and Chang Su. 2022. Probing simile knowledge from pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5875–5887, Dublin, Ireland. Association for Computational Linguistics.
Cordella et al. (2004) L.P. Cordella, P. Foggia, C. Sansone, and M. Vento. 2004. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367–1372.
Davison et al. (2019) Joe Davison, Joshua Feldman, and Alexander Rush. 2019. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1173–1178, Hong Kong, China. Association for Computational Linguistics.
Elazar et al. (2021) Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
Galetzka et al. (2021) Fabian Galetzka, Jewgeni Rose, David Schlangen, and Jens Lehmann. 2021. Space efficient context encoding for non-task-oriented dialogue generation with graph attention transformer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7028–7041, Online. Association for Computational Linguistics.
Grohe and Schweitzer (2020) Martin Grohe and Pascal Schweitzer. 2020. The graph isomorphism problem. Commun. ACM, 63(11):128–134.
Gugger et al. (2022) Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. 2022. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
Hartmann et al. (2023) Valentin Hartmann, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. 2023. Sok: Memorization in general-purpose large language models. Computation Research Repository, arXiv:2310.18362.
Heinzerling and Inui (2021) Benjamin Heinzerling and Kentaro Inui. 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1772–1791, Online. Association for Computational Linguistics.
Jiang et al. (2020) Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
Kaiser and Christmann (2021) Magdalena Kaiser and Philipp Christmann. 2021. Wikidata core for question answering. https://github.com/PhilippChr/wikidata-core-for-QA.
Kandpal et al. (2023) Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
Karamolegkou et al. (2023) Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. 2023. Copyright violations and large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403–7412, Singapore. Association for Computational Linguistics.
Katharopoulos and Fleuret (2018) Angelos Katharopoulos and Francois Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR.
Keleg and Magdy (2023) Amr Keleg and Walid Magdy. 2023. DLAMA: A framework for curating culturally diverse facts for probing the knowledge of pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6245–6266, Toronto, Canada. Association for Computational Linguistics.
Lan and Jiang (2020) Yunshi Lan and Jing Jiang. 2020. Query graph generation for answering multi-hop complex questions from knowledge bases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 969–974, Online. Association for Computational Linguistics.
Li et al. (2022a) Xiang Lorraine Li, Adhiguna Kuncoro, Jordan Hoffmann, Cyprien de Masson d’Autume, Phil Blunsom, and Aida Nematzadeh. 2022a. A systematic investigation of commonsense knowledge in large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11838–11855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Li et al. (2022b) Yanyang Li, Jianqiao Zhao, Michael Lyu, and Liwei Wang. 2022b. Eliciting knowledge from large pre-trained models for unsupervised knowledge-grounded conversation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10551–10564, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Li et al. (2022c) Yu Li, Baolin Peng, Yelong Shen, Yi Mao, Lars Liden, Zhou Yu, and Jianfeng Gao. 2022c. Knowledge-grounded dialogue generation with a unified knowledge representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 206–218, Seattle, United States. Association for Computational Linguistics.
Longpre et al. (2021) Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Computation Research Repository, arXiv:1711.05101. Version 3.
Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
Moiseev et al. (2022) Fedor Moiseev, Zhe Dong, Enrique Alfonseca, and Martin Jaggi. 2022. SKILL: Structured knowledge infusion for large language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1581–1588, Seattle, United States. Association for Computational Linguistics.
Neel and Chang (2024) Seth Neel and Peter Chang. 2024. Privacy issues in large language models: A survey. Computation Research Repository, arXiv:2312.06717.
Pellissier Tanon et al. (2016) Thomas Pellissier Tanon, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, and Lydia Pintscher. 2016. From freebase to wikidata: The great migration. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 1419–1428, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
Qiu et al. (2020) Yunqi Qiu, Yuanzhuo Wang, Xiaolong Jin, and Kun Zhang. 2020. Stepwise reasoning for multi-relation question answering over knowledge graph with weak supervision. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, page 474–482, New York, NY, USA. Association for Computing Machinery.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16.
Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
Razeghi et al. (2022) Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. 2022. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3027–3035.
Saxena et al. (2020) Apoorv Saxena, Aditay Tripathi, and Partha Talukdar. 2020. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4498–4507, Online. Association for Computational Linguistics.
Shazeer and Stern (2018) Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596–4604. PMLR.
Shi and Weninger (2018) Baoxu Shi and Tim Weninger. 2018. Open-world knowledge graph completion. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1).
Staab et al. (2023) Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. 2023. Beyond memorization: Violating privacy via inference with large language models. Computation Research Repository, arXiv:2310.07298.
Sung et al. (2021) Mujeen Sung, Jinhyuk Lee, Sean Yi, Minji Jeon, Sungdong Kim, and Jaewoo Kang. 2021. Can language models be biomedical knowledge bases? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4723–4734, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Tänzer et al. (2022) Michael Tänzer, Sebastian Ruder, and Marek Rei. 2022. Memorisation versus generalisation in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7564–7578, Dublin, Ireland. Association for Computational Linguistics.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Computation Research Repository, arXiv:2307.09288.
Verga et al. (2020) Pat Verga, Haitian Sun, Livio Baldini Soares, and William W. Cohen. 2020. Facts as experts: Adaptable and interpretable neural memory over symbolic knowledge. Computation Research Repository, arXiv:2007.00849.
Veseli et al. (2023a) Blerta Veseli, Simon Razniewski, Jan-Christoph Kalo, and Gerhard Weikum. 2023a. Evaluating the knowledge base completion potential of GPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6432–6443, Singapore. Association for Computational Linguistics.
Veseli et al. (2023b) Blerta Veseli, Sneha Singhania, Simon Razniewski, and Gerhard Weikum. 2023b. Evaluating language models for knowledge base completion. In The Semantic Web: 20th International Conference, ESWC 2023, Hersonissos, Crete, Greece, May 28–June 1, 2023, Proceedings, page 227–243, Berlin, Heidelberg. Springer-Verlag.
Wang et al. (2021) Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2021. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics, 9:176–194.
Wu et al. (2013) Yinghui Wu, Shengqi Yang, and Xifeng Yan. 2013. Ontology-based subgraph querying. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 697–708.
Xie et al. (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. 2023. Data selection for language models via importance resampling. Computation Research Repository, arXiv:2302.03169.
Yang et al. (2022) Haotong Yang, Zhouchen Lin, and Muhan Zhang. 2022. Rethinking knowledge graph evaluation under the open-world assumption. In Advances in Neural Information Processing Systems, volume 35, pages 8374–8385. Curran Associates, Inc.
Yasunaga et al. (2022) Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy S Liang, and Jure Leskovec. 2022. Deep bidirectional language-knowledge graph pretraining. In Advances in Neural Information Processing Systems, volume 35, pages 37309–37323. Curran Associates, Inc.
Zhang et al. (2019) Jiong Zhang, Hsiang-Fu Yu, and Inderjit S Dhillon. 2019. Autoassist: A framework to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Zhang et al. (2024) Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. 2024. A comprehensive study of knowledge editing for large language models. Computation Research Repository, arXiv:2401.01286.
Zhang et al. (2021) Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. 2021. Greaselm: Graph reasoning enhanced language models. In International conference on learning representations.

Appendix A Additional training and evaluation details

Importance Sampling with $\mathcal{D}_{1}$

We train the T5-base model from its HuggingFace checkpoint⁸⁸8https://huggingface.co/t5-base in FP32 with a batch size of 300 on two NVIDIA V100 GPUs. We use the AdaFactor Shazeer and Stern (2018) as the optimizer with a constant learning rate of 1e-3. The evaluation batch size is 1024. We set the maximum number of training epochs to be 100 and enforce an early stopping policy to terminate the training if the model shows no improvement on the evaluation set for 10 epochs or after the exact match score on $\mathcal{D}_{1-Val}$ exceed 96%. The model reaches the exact match threshold for early stopping for both experiments and the training time is around 2 hours and 5 hours without and without importance sampling.

Training on $\mathcal{D}_{0}$

We train T5 models from their HuggingFace checkpoints⁹⁹9https://huggingface.co/t5-large on two NVIDIA A100 GPUs, with a batch size 512 and an evaluation batch size of 1024 in FP32 for T5-base, a batch size of 300 and an evaluation batch size of 512 in BF16 for T5-large. We use the AdaFactor as the optimizer with a constant learning rate of 1e-3. The approximate time for one epoch of training is 15 hours for T5-base and 11 hours for T5-large. We also set the maximum number of training epochs to be 50 and enforce an early stopping policy to terminate the training if the model shows no improvement on the evaluation set for ten epochs or after the exact match score on $\mathcal{D}_{2}$ exceed 96%. Neither model meets the early stopping criteria when training on $\mathcal{D}_{0}$ .

We train LLaMA-2 models from their HuggingFace checkpoints¹⁰¹⁰10https://huggingface.co/meta-llama/Llama-2-7b-hf¹¹¹¹11https://huggingface.co/meta-llama/Llama-2-13b-hf on eight NVIDIA A800 GPUs in BF16 using Deepspeed Rasley et al. (2020) and ZeRO Rajbhandari et al. (2020) with Accelerate Gugger et al. (2022). For LLaMA-2-7b, the training batch size is 768 and the evaluation batch size is 96; for LLaMA-2-13b, the training batch size is 400, and the evaluation batch size is 50. For both models, we use the AdamW Loshchilov and Hutter (2019) with a constant learning rate of 1e-5 and set the maximum sequence length to 64. The approximate time for one epoch of training is 8 hours for LLaMA-2-7b and 15 hours for LLaMA-2-13b. We also set the maximum number of training epochs to be 20 and enforce an early stopping policy to terminate the training if the model shows no improvement on the evaluation set for five epochs or after the exact match score on $\mathcal{D}_{2}$ exceeds 96%. Neither model meets the early stopping criteria when training on $\mathcal{D}_{0}$ .

Finetuning and Inference

We finetune T5-base in FP32 on two NVIDIA V100 GPUs, and T5-large in BF16 on two NVIDIA A100 GPUs. We set the training batch size to be 256 and the evaluation batch size to be 512, with the same optimizer and learning rate as training. With a maximum epoch of 30, we enforce an early stopping policy that terminates finetuning if the model shows no improvement on the validation set for ten epochs.

For LLaMA-2 models, we perform finetuning with the same configurations as training on $\mathcal{D}_{0}$ . However, we set the maximum number of finetuning epochs to 15 with an early stopping policy that terminates the finetuning if the model shows no improvement on the validation set for five epochs.

Appendix B Reasoning rules and triplet-to-text templates for inverse and compositional reasoning

In Table 1, we present the relations for inverse reasoning rule $r$ inverse of $r^{\prime}$ for Section 5.2. Corresponding templates used to convert triplet with these rules to natural language question-answering dataset can be found in Table 2.

Reasoning Rule: $r$ inverse of $r^{\prime}$
$r$	$r^{\prime}$
sibling	sibling
shares border with	shares border with
father	child
mother	child
capital	capital of
part of	has part
country	contains

Table 1: Reasoning rules for inverse relations.

$relation$	question text
“sibling”	the sibling of $subject$ is
“shares border with”	$subject$ shares border with
“child”	$subject$ has child
“capital of”	$subject$ is capital of
“has part”	$subject$ has part
“contains”	$subject$ contains
“father”	the father of $subject$ is
“mother”	the mother of $subject$ is
“capital”	the capital of $subject$ is
“part of”	$subject$ is part of
“country”	the country $subject$ belongs to is

Table 2: Templates for converting knowledge triplets to natural language text for Section 5.2. The first column is the

relation

in knowledge triplet

(subject,relation,object)

and the second column is the question text querying for

object

using

subject

and

relation

in natural language.

In Table 3, we present the relations for compositional reasoning rules $r_{1}\wedge r_{2}\Rightarrow r_{3}$ for Section 5.3. Corresponding templates used to convert triplet with these rules to natural language question-answering dataset can be found in Table 4.

Reasoning Rule: $r_{1}\wedge r_{2}\Rightarrow r_{3}$
$r_{1}$	$r_{2}$	$r_{3}$
place of birth	country	country of birth
place of burial	country	country of burial
place of publication	country	country of publication
place of death	country	country of death
performer	languages spoken, written or signed	language of work or name
author	languages spoken, written or signed	language of work or name
father	father	grandfather
mother	mother	grandmother

Table 3: Reasoning rules for relation composition.

$r_{1}\wedge r_{2}\Rightarrow r_{3}$	$relation$	question text
$r_{1}$	“place of birth”	the place of birth of $subject$ is
	“place of burial”	the place of burial of $subject$ is
	“place of publication”	the place of publication of $subject$ is
	“place of death”	the place of death of $subject$ is
	“author”	the author of $subject$ is
$r_{1}$ and $r_{2}$	“father”	the father of $subject$ is
	“mother”	the mother of $subject$ is
$r_{2}$	“country”	the country $subject$ belongs to is
	“langues spoken, written or signed”	the languages spoken, written or signed by $subject$ is
$r_{3}$	“country of birth”	the country of birth of $subject$ is
	“country of burial”	the country of burial of $subject$ is
	“country of publication”	the country of publication of $subject$ is
	“country of death”	the country of death of $subject$ is
	“language of work or name”	the language of $subject$ is
	“grandfather”	the grandfather of $subject$ is
	“grandmother”	the grandmother of $subject$ is

Table 4: Templates for converting knowledge triplets to natural language text for Section 5.3. The first column indicates where

relation

appears in compositional reasoning

r_{1}\wedge r_{2}\Rightarrow r_{3}

, the second column is the

relation

in knowledge triplet

(subject,relation,object)

, and the third column is the question text querying for

object

using

subject

and

relation

in natural language.

Appendix C Dataset and open-source projects

In preparing our own world knowledge dataset $\mathcal{D}_{0}$ of scale similar to latest KBs, we use the CC0-licensed English Wikidata Pellissier Tanon et al. (2016) as the source of world knowledge and an MIT-licensed code project released by Kaiser and Christmann (2021) to filter away knowledge irrelevant to common linguistic tasks. We further derive various subsets from $\mathcal{D}_{0}$ to study the memorization behavior of language models as in Section 2.3, 3, 5.2 and 5.3.

Our experiments on free-form information in Section 4 are based on the PopQA dataset released by Mallen et al. (2023) under MIT License. For general missing fact completion in Section 5.1, we utilize the portion of human-annotated missing facts from the dataset created by Veseli et al. (2023b), which is open-sourced into a public repository.

For experiments on LLaMA-2 models, we also employ a public fork of HuggingFace transformer library to address the left padding problem that may impact the inference results. The fork is licensed under Apache-2.0 and hosted on https://github.com/yizhongw/transformers/tree/left_padding.

Can Language Models Act as Knowledge Bases at Scale?

Abstract

1 Introduction

2 Training LMs on Large-Scale KB

2.1 KB Dataset

2.2 Model Setup

2.3 Importance Sampling

2.4 Evaluation

3 Fixed-Form Information Recall

4 Free-Form Information Recall

5 Missing Fact Completion

5.1 General Missing Facts

5.2 Inverse Reasoning

5.3 Compositional Reasoning

6 Related Work

Infusing Knowledge into LM

Probing for Exisisting Knowledge

7 Conclusion

Limitations

Ethics Statement

References

Appendix A Additional training and evaluation details

Importance Sampling with 𝒟1\mathcal{D}_{1}

Training on 𝒟0\mathcal{D}_{0}

Finetuning and Inference

Appendix B Reasoning rules and triplet-to-text templates for inverse and compositional reasoning

Appendix C Dataset and open-source projects

Importance Sampling with $\mathcal{D}_{1}$

Training on $\mathcal{D}_{0}$